Robust methods for generalized partial linear partial additive models withan application to detection of disease outbreaksbyTae Yoon (Harry) LeeB.Sc. (Hons.), The University of British Columbia, 2016A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES(Statistics)The University of British Columbia(Vancouver)August 2019c© Tae Yoon (Harry) Lee, 2019The following individuals certify that they have read, and recommend to the Faculty of Graduate and Post-doctoral Studies for acceptance, the thesis entitled:Robust methods for generalized partial linear partial additive models with an application todetection of disease outbreakssubmitted by Tae Yoon (Harry) Lee in partial fulfillment of the requirements for the degree of Master ofScience in Statistics.Examining Committee:Matias Salibian-Barrera, StatisticsSupervisorGabriela V. Cohen Freue, StatisticsSupervisory Committee MemberiiAbstractAn essential function of the public health system is to detect and control disease outbreaks. The BritishColumbia Centre for Disease Control (BC CDC) monitors approximately 60 notifiable disease counts from8 branch offices in 16 health service delivery areas. These disease counts exhibit a variety of characteristics,such as seasonality in meningococcal and a long-term trend in acute hepatitis A. As staff need to determinewhether the reported counts are higher than expected, the detection process is both costly and fallible. Toalleviate this problem, in the early 2000s the BC CDC commissioned an automated statistical method todetect disease outbreaks. The method is based on a generalized additive partially linear model and appearsto capture the characteristics of disease counts. However, it relies on certain ad-hoc criteria to flag countsfor an outbreak. The BC CDC is interested in considering other alternatives.In this thesis, we discuss an outbreak detection method based on robust estimators. It builds on recentlyproposed robust estimators for additive, generalized additive, and generalized linear models. Using real andsimulated data, we compare our method with that of the BC CDC and other natural competitors and presentpromising results.iiiLay SummaryAn essential function of the public health system is to detect and control disease outbreaks. The BritishColumbia Centre for Disease Control (BC CDC) monitors approximately 60 notifiable disease counts from8 branch offices in 16 health service delivery areas. As staff need to determine whether the reported countsare higher than expected, the detection process is both costly and fallible. To alleviate this problem, inthe early 2000s the BC CDC commissioned an automated method to detect disease outbreaks. However,it relies on certain ad-hoc criteria which are not desirable. The BC CDC is interested in considering otheralternatives.We propose an outbreak detection method based on the recent literature in robust statistics. Using realand simulated data, we compare our method with that of the BC CDC and other natural competitors andpresent promising results of our method.ivPrefaceThis dissertation was prepared under the supervision of Professor Matias Salibian-Barrera. He suggestedthe research topic and provided relevant literature. Moreover, he checked all the proofs, helped correct theproofs, and made comments on writing to enhance clarity and coherence.Chapter 3 describes a disease outbreak detection method used by the British Columbia Centre for Dis-ease Control (BC CDC). The method was developed and implemented by Rick White, a deceased memberof the Applied Statistical and Data Science Group in Department of Statistics at UBC. The description isbased on an unpublished and unfinished manuscript by Rick White and the BC CDC (White, 2000b) and aPowerPoint presentation (White, 2000a).Chapter 4 describes a disease outbreak detection method based on robust estimators. It is based on Ali-madad and Salibian-Barrera (2011), Pan (2013), and Boente et al. (2017). An improvement on computationof estimating parameters in Alimadad and Salibian-Barrera (2011) and Pan (2013) was made from readingAeberhard et al. (2014). The code for estimating two sets of parameters for Negative Binomial distributionsin an alternating fashion in Aeberhard et al. (2014) was adapted and modified for the proposed method forNegative Binomial distributions.Chapter 6 uses the datasets provided by the BC CDC.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.1 Set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 GLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2.1 Iterated re-weighted local scoring algorithm (IRWLS) . . . . . . . . . . . . . . . . 42.2.2 Quasi-likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3 GAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3.1 The backfitting algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3.2 Generalized local scoring algorithm (GLSA) . . . . . . . . . . . . . . . . . . . . . 72.4 GAPLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.4.1 Backfitting estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.4.2 Speckman estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.4.3 Generalized Speckman estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.4.4 Connection to profile likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13vi3 PHIDO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.1 Model fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.1.1 Binomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.1.2 Negative Binomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2 Alert score calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.3 Remark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 Three robust methods to fit a GAPLM and outbreak detection . . . . . . . . . . . . . . . . . 214.1 Preliminary: robust linear regressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.2 Robust approaches to fit a GAPLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.2.1 The first approach: YP and WA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.2.2 The second approach: TL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.2.3 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.3 Outbreak detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365.1 Simulation setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365.1.1 Data generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365.1.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385.1.3 Evaluation measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.2 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.2.1 Fitted values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.2.2 Assessment of parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 435.2.3 Goodness-of-fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.2.4 Quantification of outliers: CRR and RP . . . . . . . . . . . . . . . . . . . . . . . . 486 Application: PHIDO datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536.1 Exploratory data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536.2 Alert levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587 Conclusion and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 637.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65viiList of TablesTable 5.1 Simulation study parameters and their levels. . . . . . . . . . . . . . . . . . . . . . . . . 38Table 5.2 The five methods that are compared in the simulation study. Their key differences infeatures are outlined. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39Table 6.1 The alert levels for the last four weekly counts of Series1 computed by PHIDO and WA.The overall alert score by PHIDO is -3.68, which indicates no alert. . . . . . . . . . . . . 59Table 6.2 The alert levels for the last four weekly counts of Series2 computed by PHIDO and WA.The overall alert score by PHIDO is -0.67, which indicates no alert. . . . . . . . . . . . . 59Table 6.3 The alert levels for the last four weekly counts of Series A computed by PHIDO and WA.The overall alert score by PHIDO is 0.29, which indicates no alert. . . . . . . . . . . . . 61Table 6.4 The alert levels for the last four weekly counts of Series B computed by PHIDO and WA.The overall alert score by PHIDO is 2.3, which indicates the medium alert. . . . . . . . . 61Table 6.5 The alert levels for the last four weekly counts of Series C computed by PHIDO and WA.The overall alert score by PHIDO is -0.24, which indicates no alert. . . . . . . . . . . . . 61Table 6.6 The alert levels for the last four weekly counts of Series D computed by PHIDO and WA.The overall alert score by PHIDO is -4.27, which indicates no alert. . . . . . . . . . . . . 61Table 6.7 The alert levels for the last four weekly counts of Series E computed by PHIDO and WA.The overall alert score by PHIDO is -2.29, which indicates no alert. . . . . . . . . . . . . 61Table 6.8 The alert levels for the last four weekly counts of Series F computed by PHIDO and YP.The overall alert score by PHIDO is 0.99, which indicates no alert. . . . . . . . . . . . . 62viiiList of FiguresFigure 1.1 The number of reported cases of meningococcal disease generally peaks in January,February, and March. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2Figure 1.2 There has been a gradual decline in the reported cases of hepatitis A in the US from2000 to 2015. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2Figure 3.1 An example of the current PHIDO algorithm illustrates that some of the data points thatare close to the fitted robust curve are classified as high and low alert points. . . . . . . . 20Figure 4.1 OLS, Huber’s, and Tukey’s biweight ρ , ψ , and W functions for c = 3 . . . . . . . . . . 24Figure 4.2 Comparison of gam and robust gam fits for data generated from a Poisson distribution.This example is taken from Alimadad and Salibian-Barrera (2011). The robust gamfits by Alimadad and Salibian-Barrera (2011) and our proposed method show similarperformance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33Figure 5.1 Left: f1(T1); Right: f2(T2). They are smooth, periodic, and centered. . . . . . . . . . . . 37Figure 5.2 The upward trend mean function is smooth and periodic. . . . . . . . . . . . . . . . . . 38Figure 5.3 The functional boxplot of the 100 fitted curves by the five methods in the absence ofoutliers: (a) AEBERHARD, (b) LEE, (c) MGCV, (d) PAN, and (e) PHIDO. . . . . . . . 42Figure 5.4 The functional boxplot of the 100 fitted curves by the five methods in the presence ofoutliers in the mid: (a) AEBERHARD, (b) LEE, (c) MGCV, (d) PAN, and (e) PHIDO. . 44Figure 5.5 The functional boxplot of the 100 fitted curves by the five methods in the presence ofoutliers at the end: (a) AEBERHARD, (b) LEE, (c) MGCV, (d) PAN, and (e) PHIDO. . 45Figure 5.6 The boxplots for each of the two β parameters for PAN, LEE, AEBERHARD, andMGCV: (a) no outliers, (b) outliers in the mid, and (c) outliers at the end. . . . . . . . . 46Figure 5.7 The boxplots of the overdispersion parameter σ2 for LEE, AEBERHARD, and MGCV:(a) no outliers, (b) outliers in the mid, and (c) outliers at the end. . . . . . . . . . . . . . 47Figure 5.8 The boxplots of the MSEs for PAN, LEE, AEBERHARD, and MGCV: (a) no outliers,(b) outliers in the mid, and (c) outliers at the end. . . . . . . . . . . . . . . . . . . . . . 49ixFigure 5.9 (a) and (b): the boxplots of the 100 CRRs at 1−α = 0.95 for PAN, LEE, AEBER-HARD, and MGCV in the presence of outliers in the mid and at the end, respectively.(c) and (d): the boxplots of the 100 RPs at 1−α = 0.95 for PAN, LEE, AEBERHARD,and MGCV in the presence of outliers in the mid and at the end, respectively. . . . . . . 50Figure 5.10 The functional boxplot of the 100 fitted curves by the five methods in the presence ofoutliers in the mid: (a) AEBERHARD, (b) LEE, (c) MGCV, (d) PAN, and (e) PHIDO. . 51Figure 5.11 The functional boxplot of the 100 fitted curves by the five methods in the presence ofoutliers at the end: (a) AEBERHARD, (b) LEE, (c) MGCV, (d) PAN, and (e) PHIDO. . 52Figure 6.1 The weekly counts of Series 1, 2, A, and B. . . . . . . . . . . . . . . . . . . . . . . . . 54Figure 6.2 The weekly counts of Series C, D, E, and F. . . . . . . . . . . . . . . . . . . . . . . . . 55Figure 6.3 Rough estimates estimate of overdispersion for each month for Series 1, 2, A, and B.The red line indicates the intercept of 1. . . . . . . . . . . . . . . . . . . . . . . . . . . 56Figure 6.4 Rough estimates of overdispersion for each month for Series C, D, E, and F. The redline indicates the intercept of 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57Figure 6.5 The fitted values of PHIDO and LEE for Series 1, 2, A, and B. . . . . . . . . . . . . . . 59Figure 6.6 The fitted values of PHIDO and LEE for Series C, D, E, and F. . . . . . . . . . . . . . . 60Figure 6.7 The last one year of the fitted values of PHIDO and LEE for Series 2. . . . . . . . . . . 60xAcknowledgmentsTo begin with, I want to thank my family for their unconditional support and love. Thanks to them, I havebeen able to freely explore my curiosity and seek the meaning of my life through a variety of experiences.Next, I want to express my gratitude to my supervisor, Prof. Matias Salibian-Barrera for his immensepatience. Under his guidance, I was able to complete the thesis smoothly. Whenever I was stuck, he derivedfrom scratch and clarified problems or hinted at the next steps for research. On a personal side, I haveenjoyed his humor that takes a while to notice and have observed his love for mathematics, music, andsoccer. In addition, I am grateful for his understanding and support with regards to my personal matters.I also want to thank my another unclaimed supervisor, Prof. James Zidek. Jim and I have been workingon a completely different research topic on units of measurement. I truly admire his passion for research andhis everlasting curiosity that one would find in a young child. He has been supportive of several workshopsand conferences, from which I have greatly learned and benefited. As both academic and life mentors, hehas given me invaluable advice and shared many interesting stories.I want to thank Prof. Gabriela V. Cohen Freue for generously and kindly reviewing my thesis andproviding invaluable feedback on short notice. I want to thank Prof. Harry Joe for an infinite number ofoffice hours. Whenever I had questions about statistics, I always checked first whether his door was open.Lastly, I would like to thank everyone else in the department for creating a friendly and support environ-ment for collaboration and research. Special thanks to Ali, Andrea, and Peggy for taking care of us in theback.verbiagxiChapter 1IntroductionAn essential function of the public health system is to detect and control disease outbreaks, where a diseaseoutbreak is defined as “the occurrence of cases of disease in excess of what would be normally expected”(World Health Organization, 2018). In British Columbia, there are about 60 notifiable diseases or diseasecategories under the Health Act Communicable Disease Regulation. These diseases exhibit a variety ofcharacteristics, such as seasonality in meningococcal disease (Figure 1.1) and a gradual decline in acutehepatitis A over the past decade (Figure 1.2). As staff need to determine whether the reported counts arehigher than expected, the detection process is fallible. Moreover, considering the complexity of diseasecharacteristics and paramount importance of notifying disease outbreaks, it requires highly experiencedstaff to be present at all times, and hence it is costly for the BC CDC. To alleviate these problems, the BCCDC commissioned an automated statistical method to detect disease outbreaks in the early 2000s.The method was developed by White (2000b), and we shall call it PHIDO hereinafter. It is basedon a generalized additive partial linear model (GAPLM), which combines the strengths of interpretabilityof generalized linear models (McCullagh and Nelder, 1989) and flexibility of generalized additive models(Hastie and Tibshirani, 1986). It allows one to model the mean of the response variable (e.g. counts ofdisease) as a function of covariates for a variety of distributions in the exponential family. We formallydefine the framework and describe the classical fitting algorithms for GAPLMs in Chapter 2. In Chapter 3,we describe PHIDO in details and explain why PHIDO relies on ad-hoc criteria to flag cases for an outbreak.Our major contribution is the development of a theoretically sound version of the first out of two com-ponents of PHIDO, which is fitting a robust or outbreak-resistant curve to a disease dataset. The proposedmethod builds on recently developed robust estimators for additive, generalized additive, and generalizedlinear models. In Chapter 4, we present two robust fitting algorithms for GAPLMs and provide details forimplementing them for Poisson and Negative Binomial (NB) distributions. In a simulated environment, thecomparisons of the proposed methods and PHIDO are reported in Chapter 5. We show the application ofour methods to the datasets provided by the BC CDC in Chapter 6. Finally, we summarize our contributionand discuss future work in Chapter 7.1Figure 1.1: The number of reported cases of meningococcal disease generally peaks in January, Febru-ary, and March.Source: CDC, National Notifiable Diseases Surveillance System (NNDSS)02,0004,0006,0008,00010,00012,00014,00016,000Number of casesYearFigure 1.2: There has been a gradual decline in the reported cases of hepatitis A in the US from 2000to 2015.2Chapter 2PreliminariesIn this chapter, we formally introduce generalized linear models (GLM), generalized additive models (GAM),and generalized additive partial linear models (GAPLM). We mainly discuss how to fit these models. Forasymptotic and inference results, see Ha¨rdle et al. (2012).2.1 Set-upThe goal is to understand the relationship between Y and X , where Y is the response variable and X =(X1, . . . ,Xp) is the vector of the explanatory variables or simply covariates. We assume the response variablefollows a distribution in the exponential family of the form:fY |X(y;x,θ ,φ) = exp(yθ(x)−b(θ(x))a(φ)+ c(y,φ)),where a,b,c are functions, φ is an arbitrary scale, and θ(x) is the canonical parameter that links Y and X .Furthermore, we restrict our attention to the distributions for which a(φ) = φω , where ω is a known constant.For these distributions, we can always express the variance of Y as a function of µ(X): var(Y ) = φV (µ),where µ is the conditional mean of Y given X , and V is called the variance function. This restriction coversthe distributions that are commonly used in practice, such as Poisson and Negative Binomial (NB). Fordetails, see Section 2.1 of Wood (2006).2.2 GLMGeneralized linear models can be viewed as an extension of linear models; they allow modeling a largerset of distributions, such as Poisson and NB, in the exponential family (Nelder and Wedderburn, 1972).For example, a standard linear regression model would not be appropriate for modeling count data sincethe linear regression model allows nonsensical negative values. If the distribution of the response variableis a known member of the exponential family, only the relationship between the conditional mean of theresponse variable and covariates needs to be to specified. Specifically, we assume a linear form with a3smooth monotonic link function g:η := g(µ) = g(E(Y |X)) = β0+p∑i=1βpXp. (2.1)µ = g−1(η). (2.2)When the link function connects η and θ in a very specific way such that η = θ , it is called canonical.Hence, θ(X) depends on X in a linear fashion. For example, the log link is the canonical link function forPoisson distributions. In Wood (2006), see Table 2.1 for more examples and Section 2.1.8 for advantages ofusing canonical links.2.2.1 Iterated re-weighted local scoring algorithm (IRWLS)Maximum likelihood estimation is used to estimate β = (β0, . . . ,βp) in Eq. (2.1).βˆMLE = argmaxβn∑i=1l(β ;yi,xi), (2.3)where l(·) is the log-likelihood function. The scale parameter, φ , is omitted from the estimation, as it appearsas a constant multiplier in the score function (Eq. (2.4)) and consequently does not affect the estimation.The estimating equations areS(β ) =∂ l∂β(β ) ∝n∑i=1yi−µiV (µi)∂µi∂β= 0. (2.4)Note that these equations are nonlinear in β . The Newton-Raphson method is a popular choice for solvingnonlinear equations, provided that first and second order derivatives exist. This results in the followingupdating procedure:βˆ r = βˆ r−1−[J−1(β )∂ l∂β(β )]β=βˆ r−1, r = 1,2,3, . . . ,whereJ(βˆ ) =[∂ 2l∂β∂β T(β )∣∣∣∣β=βˆis the observed information matrix. Sometimes, it is difficult to calculate the second order derivative Jin practice. In such a case, J is often approximated by its expectation, known as the Fisher informationI(β ) :=−EY (J(β )). By the properties of an exponential family distribution, we have the following identityI(β ) = EY (∂ l∂β(β )∂ l∂β T(β ))4under the usual regularity conditions. It only involves the first order derivative and greatly simplifies com-putation.βˆ r = βˆ r−1+[I(βˆ r−1)]−1∂ l∂β(βˆ r−1). (2.5)By explicitly evaluating the second term in Eq. (2.5), we obtain the iterative re-weighted least squares(IRWLS) algorithm:βˆ r = (XTW r−1X)−1XTW r−1Zr−1, (2.6)where the ith elements of the adjusted response variable Z and the diagonal weight matrix W areZr−1i = ηˆr−1+(yi− µˆr−1i )[∂ηi∂µi(µi)]µi=µˆr−1i(2.7)and(W r−1i )−1 =[(dηidµi(µi))2Vi(µi)]µi=µˆr−1i, (2.8)respectively. X is the usual design matrix (which include the vector of 1’s for the intercept). In short, it isa simple iteration of weighted least squares of Z on X with weights W . The derivation of the IRWLS al-gorithm is readily available in GLM textbooks; for example, see Section 6.2 of Hastie and Tibshirani (1986).2.2.2 Quasi-likelihoodSo far, we have assumed the distribution of the response variable is known. In practice, the distribution maynot be known precisely. McCullagh (1983) showed that quasi-likelihood can be used to develop methodsto estimate parameters without imposing a distributional assumption as long as the relationships betweenthe variance and mean and between the mean and parameters are specified. For the exponential familydistributions, the log quasi-likelihood is defined asQi(µi) =∫ µiyiy−uφV (u)du. (2.9)It turns out that the estimating equations for the parameters based on the maximum quasi-likelihood methodare equal to Eq. 2.4. Thus, the parameters can be easily estimated by using the aforementioned IRWLSalgorithm.2.3 GAMIf the relationship is not linear in Eq. (2.1), then a nonparametric regression can be employed instead.A nonparametric regression offers more flexibility than a parametric regression, as it does not require arigid form of dependence. However, its flexibility comes with a cost. It has well-known problems, such5as the curse of dimensionality. The curse of dimensionality arises when the number of covariates, p, islarge; the number of data points required to ensure feasibility of a multi-predictor smoother increases at anexponential rate. It can be overcome by assuming an additive form of the predictors. If the relationshipbetween the predictors and the mean is not additive, then an additive model will work as an approximationto the true relationship. The additive version of a GLM is called GAM. We simply replace ∑pi=1 Xpβp in Eq.(2.1) with a sum of univariate smoothers fi(Xi), i = 1, . . . , p.η = g(µ) = β0+ f1(X1)+ · · ·+ fp(Xp),where β0 is the intercept, and E( f j) = 0 is assumed to ensure identifiability of β0. An advantage of a GAMis that it is still interpretable. Each f j(X j) can be plotted against its corresponding X j, and the marginalrelationship can be studied.2.3.1 The backfitting algorithmWe discuss a popular estimation method for a GAM. The backfitting algorithm (Hastie and Tibshirani, 1986)is widely used to estimate the smooth functions. In fact, it is a general approach to fit an additive regressionmodel, including generalized additive partial linear models (GAPLMs), which will be introduced in Section2.4. The algorithm is extremely simple and intuitive. For simplicity, first consider a simple additive model.Y = β0+p∑j=1f j(X j)+ ε,where we assume E(ε|X) = 0 and E( f j) = 0 ∀ j. The second assumption is often imposed for identifiabilityof β0. The backfitting algorithm works as follows:Algorithm 1 Backfitting algorithm for an AMInitialization: Let r = 0, β0 = mean(y), f rj = 0 ∀ j.repeatr← r+1for j = 1 to p doPartial residual: Rrj = y−β0−∑ j−1k=1 f rk (Xk)−∑pk= j+1 f˜ r−1(Xk).Backfitting: f˜ rj ← S j(Rrj;X j), where S j is an univariate smoother.end forfor j = 1 to p doCentering: f rj ← f˜ rj −mean[ f˜ rj (X j)].end foruntil convergenceIn Algorithm 1, an univariate smoother, S j, could be potentially different for each j but is often chosento be the same across j in practice. Common smoothers are the locally weighted scatterplot smoothing(LOESS) of Cleveland and Devlin (1988) and penalized splines. In this thesis, LOESS is chosen for ease of6implementation and computation.We briefly explain LOESS; see Loader (2006) for more information. LOESS fits a local polynomialregression for each data point given the span, degree of the (local) polynomial regressions, and weights. Thespan is the proportion of data points that are considered in fitting a local polynomial regression for a chosenpoint. The span controls the bias-variance trade off. If the span is too small, then a noisy fit will be likelyto be obtained due to insufficient data, resulting in high variability. If the span is too big, then a fit might betoo smooth that it does not fit the data well. Similarly, the degree of the local polynomial regression controlsthe bias-variance trade-off. A low degree polynomial regression would not fit data better than a high degreepolynomial, resulting a higher bias. On the other hand, if the degree is too high, then it will be susceptibleto a small noise in the data, resulting in high variability. In practice, the degree is usually set to either 1or 2, and the span is varied to adjust for the bias-variance trade-off. Keep in mind that there may be twokinds of weights: weights for the proximity of the data points to the point of interest (given by a kerneldensity of choice, such as tricube) and pre-specified weights of the data points. We will see shortly that pre-specified weights are critical for a backfitting algorithm for a GAM. Instead of a normal univariate smoother,we will use a weighted univariate smoother, analogous to weighed least squares in IRWLS. See Clevelandand Devlin (1988) for details and my LOESS ShinyApp (https://tylee.shinyapps.io/LOESS ShinyApp/) tovisually explore the effects of degree, span, and pre-specified weights.Furthermore, notice that the intercept is not updated throughout the algorithm. Considering the popula-tion version of the algorithm reveals the reason. The (population version of) backfitting step for the interceptisβ0 = E(Y,X)(Y −∑ f j(X j)) = E(Y,X)(Y −∑j( f˜ j(X j)−E(Y,X)( f˜ j(X j)))) = E(Y,X)(Y ), (2.10)because expectation is a linear operator. This illustrates the intercept does not need to updated as long as f jare centered to be 0.2.3.2 Generalized local scoring algorithm (GLSA)To estimate f ’s in a GAM, we use the estimation techniques via expected log-likelihood given in Hastie andTibshirani (1986). One intuitive derivation is presented here. For other derivations, see Hastie and Tibshirani(1986).We first define the space for η(X) = η(X) = β0 + f1(X1)+ · · ·+ fp(Xp). Hilbert space is a complete,inner-product equipped vector space. Let H j be the Hilbert spaces of measurable functions h j(X j) with theinner-product defined as< h j(X j),h′j(X j)>:= EX j h j(X j)h′j(X j)and the following properties:• (centered) EX j(h j(X j)) = 0;• (square integrable) EX j [(h j(X j))2]< ∞.7Let H be the space of centered, square integrable functions of X1, . . . ,Xp and Hadd := H1 + · · ·+Hp be aclosed linear subspace of H.Suppose that the conditional density of Y is g(Y,ψ), where ψ =ψ(X)∈H is a regression parameter. Wewish to choose an additive function η(X) ∈ Hadd ⊂ H such that the expected log-likelihood is maximized:E(l(η(X),Y )), (2.11)where l(·) is the log-likelihood. Even when the true model does not exist in the space Hadd , Stone (1986)shows that the best additive approximation η∗ to the true model η exists and is unique under some regularityconditions. The additive approximation can be though of as orthogonally projecting from the true model ηonto the space Hadd . This is analogous to viewing a linear regression as the orthogonal projection of theresponse variable onto the space spanned by covariates.Now we derive the estimation procedure for f ’s. Suppose the log-likelihood is strictly concave in η forevery Y . Then the first and second conditional (on X) Bartlett identities are satisfied (Bartlett, 1953):• EY |X(l′(η(X),Y )|X) = 0• |EY |X(l′′(η(X),Y )|X)|< ∞,respectively. As the log-likelihood is strictly concave, the unique maximizer η∗ exists. Hence, the necessaryand sufficient condition for η to be the maximizer is that it is a stationary point, where a stationary pointis obtained when the Fre´chet directional derivative with respect to η for all feasible directions h ∈ Haddevaluated at that point is zero (see Chapter 7 of Luenberger (1997)). LetG(η) = EX ,Y (l(η(X)),Y )and define for any h ∈ H j for any j = 1, . . . , p and real-valued α ,Hh(α) = G(η∗+αh) = EX ,Y (l(η∗(X)+αh(X j)),Y ).Then for a maximizer η∗, the stationary condition impliesdHh(α)dα∣∣∣∣α=0= 0.More explicitly, assuming l(·) is bounded in both X and Y , we use Lebesgue’s dominated convergencetheorem to exchange the integral and derivative:H ′h(α) =dHh(α)dα= EX ,Y (l′(η∗(X)+αh(X j),Y )h(X j)).8Notice that with X− j denoting X without the jth element,0 = H ′h(0)= EX ,Y (l′(η∗(X),Y )h(X j))= EX j [EY,X− j|X j(l′(η∗(X),Y )|X j)(h(X j))] by the law of total expectation.= EX j [h(X j)EY,X− j|X j(l′(η∗(X),Y )|X j)] since h(X j) is σ(X j)-measurable.Thus, we haveEX j [h(X j)EY,X− j|X j(l′(η∗(X),Y )|X j)] = 0 (2.12)Defineg(X j) := EY,X− j|X j(l′(η∗(X),Y )|X j),which is a measurable function of X j. We show that g(X j)∈H j by proving EX j g(X j) = 0 and EX j g(X j)2 <∞.For the first part,EX j g(X j) = EX j EY,X− j|X j(l′(η∗(X),Y )|X j)= EY,X l′(η∗(X),Y ) by the law of total expectation= EX EY |X(l′(η∗(X),Y )|X) by the law of total expectation= EX 0 by the first conditional Bartlett identity= 0.Next, we show that EX j g(X j)2 < ∞.EX j(g(X j)2) = EX j [EY,X− j|X j(l′(η∗(X),Y )|X j)]2≤ EX j [EY,X− j|X j(l′(η∗(X),Y )2|X j)] by conditional Jensen’s inequality= EY,X(l′(η∗(X),Y ))2 by the law of total expectation=−EY,X [l′′(η∗(X),Y )] since Y belongs to an exponential family=−EX EY |X(l′′(η∗(X),Y )|X) by the law of total expectation< ∞ by the second conditional Bartlett identityTherefore, g(X j) ∈ H j. Taking h(X j) = g(X j) in Eq. 2.12, we obtainEX j [EY,X− j|X j(l′(η∗(X),Y )|X j)]2 = 0.This implies the optimality condition for η isEY,X− j|X j(l′(η(X),Y )|X j) = 0 a.s. in P-null sets of X j9for each j ∈ 1, . . . , p. Notice that the condition is nonlinear in η . We can use Newton’s method to solve forη in an iterative fashion with an initial guess of the parameters, ηold ,Wold ,Zold .E(∂ l∂ηold+(∂ 2l∂η2old)(η−ηold)|X j)= E[(− ∂2l∂η2old)(ηold +(− ∂2l∂η2old)−1(∂ l∂ηold)−η)|X j]= E(Wold(X)(Zold−η)|X j)As η is linear in f ’s, using the linearity of conditional expectation, we find an equation for f jf j(X j) =E(Wold(X)(Zold−∑k 6= j fk(Xk))|X j)E(Wold(X)|X j)for each j. Algorithm 2 implements this approach. Note that the intercept needs to be updated since theadjusted response variable Z and weights W are updated in each iteration.Algorithm 2 Generalized local scoring algorithm for a GAMInitialization: Let r = 0 and, for example, β r0 = g(mean(y)), f rj = 0 ∀ j. Construct ηr,µr,Zr, and W r.repeatr← r+1for j = 1 to p doPartial residual: Rrj = Zr−1−β r−10 −∑ j−1k=1 f rk (Xk)−∑pk= j+1 f r−1(Xk).Backfitting: f˜ rj ← S j(Rrj;W r−1), where S j is a weighted univariate smoother.end forfor j = 1 to p doCentering: f rj ← f˜ rj −mean[ f˜ rj (X j)]end forUpdate the intercept: β r0 ←∑i W r−1i (Zr−1i −∑pj=1 f rj )∑i W r−1i.Update ηr,µr,Zr,W r.until convergence2.4 GAPLMA GAPLM is the combination of a GLM and a GAM, inheriting both the interpretability of a GLM andflexibility of a GAM.η = g(µ) = X tβ +q∑j=1fq(Tq), (2.13)where the intercept is embedded in β , X is the set of covariates for the parametric components, and{T1, . . . ,Tq} are the set of covariates for the nonparametric component.102.4.1 Backfitting estimatorNotice that (2.13) is still additive, for which the backfitting method can be applied. More explicitly, wealternate between estimating nonparametric and parametric components until convergence; see Algorithm3.Algorithm 3 Backfitting algorithm for a GAPLMInitialization: Let r = 0 and, for example, β r0 = g(mean(y)), β j = f rm = 0 ∀ j,m. Construct ηr,µr,Zr, andW r, where Zr and W r are defined in Eqs. (2.7) and (2.8), respectively.repeatr← r+1Parametric Residual: PRrj = Zr−1−Xβ r−1for j = 1 to p doPartial residual: Rrj = PRrj−∑ j−1k=1 f rk (Xk)−∑pk= j+1 f r−1(Xk).Backfitting: f˜ rj ← S j(Rrj;W r−1), where S j is a weighted univariate smoother.end forfor j = 1 to p doCentering: f rj ← f˜ rj −mean[ f˜ rj (X j)].end forNonparametric Residual: NRr = Zr−1−∑qm=1 f rm.β r = (XTW r−1X)−1(XTW r−1NRr).Update: ηr,µr,Zr,W r.until convergence2.4.2 Speckman estimatorSpeckman (1988) suggested an alternative approach that removes the effects of the nonparametric covariateson both the response and the parametric explanatory variables before estimating β . To illustrate his idea,consider an additive partial linear model:Y = X tβ + f (T )+ ε (2.14)with E(ε|X ,T ) = 0. Taking conditional expectation with respect to T , we haveE(Y |T ) = E(X tβ |T )+ f (T )+E(ε|T ). (2.15)Next, subtract Eq. 2.14 by Eq. 2.15:Y −E(Y |T ) = X tβ −E(X tβ |T )+0− (ε−E(ε|T )). (2.16)By defining Y˜ := Y −E(Y |T ), X˜ := X−E(X |T ), and ε˜ = (ε−E(ε|T ), Eq. 2.16 can be re-written asY˜ = X˜ tβ + ε˜,11where E(ε˜) = 0 by the law of total expectation. Y˜ and X˜ are Y and X with effects of T removed. To obtainY˜ and X˜ , we need to solve for the conditional expectations with respect to T . By assuming that both E(Y |T )and E(X |T ) are additive functions of T , we can use the backfitting algorithm to estimate them with LOESS.For other approaches, see Ha¨rdle et al. (2012).2.4.3 Generalized Speckman estimatorTo extend Speckman’s approach to a GAPLM (Pan, 2013), we simply modify the step for estimating β inthe backfitting algorithm; in Eq. (2.16), we replace Y with Z as defined in Eq. (2.7) and and use weightsW as defined in Eq. (2.8). See Algorithm 4 for implementation. Here is a caveat when it comes down toestimating β . The first column of X˜ becomes 0 since the first column of X is constant - a vector of 1’s.To be able to estimate the intercept, Pan (2013) adds the column means of X and mean of Z to X˜ and Z˜,respectively. Alternatively, one can hold out the first column when smoothing out X , and append it to X˜when estimating β . The latter approach appears to be simpler, as it avoids unnecessary computation.Algorithm 4 Generalized Speckman algorithm for a GAPLMInitialization: Let r = 0 and, for example, β r0 = g(mean(y)), β j = f rm = 0 ∀ j,m. Construct ηr,µr,Zr, andW r.repeatr← r+1X˜−1 = X−1−S(X−1 ∼ T ;W r−1), where X−1 is the design matrix without the column for the intercept.Z˜ = X−S(Z ∼ T ;W r−1).Update: β r = (X˜TW r−1X˜)−1X˜TW r−1Z˜, where X˜ = (1 X˜−1).Parametric Residual: PRrj = Zr−1−Xβ rfor j = 1 to p doPartial residual: Rrj = PRrj−∑ j−1k=1 f rk (Xk)−∑pk= j+1 f r−1(Xk).Backfitting: f˜ rj ← S j(Rrj;W r−1).end forfor j = 1 to p doCentering: f rj ← f˜ rj −mean[ f˜ rj (X j)]Adjusting the intercept: β r0 = β r0 +mean[ f˜ rj (X j)]end forUpdate: ηr,µr,Zr,W r.until convergenceThe Speckman estimator has a major advantage over the backfitting estimator. It is more resistant tothe problem caused by strong dependency between additive terms (Pan, 2013), known as concurvity (Bujaet al., 1989). Concurvity is the additive analog of multicollinearity. In the presence of multicollinearity, theparameters of a multiple linear model are not well-defined due to singularity of the inverse of the variance-covariance matrix. A similar problem arises for an additive model when one of the additive terms can beapproximated by a combination of others. The Speckman estimator shows superior performance in terms ofbias and variance in the presence of high concurvity in simulations; see the references mentioned in Section122.3.2 of Pan (2013) for more details.2.4.4 Connection to profile likelihoodAnother estimator that produces similar results as the generalized Speckman estimator is the profile likeli-hood estimator. Given a multi-dimensional kernel Kh(·) with a bandwidth matrix h, the local log-likelihoodis defined aslh( f (T ),β ;Y,X ,T ) =n∑i=1Kh(T −Ti)l(g(X ti β + f (Ti)),Yi),where l(·) is the log-likelihood, g(·) is the link function, and Kh(t − Ti) are the local/kernel weights. Byholding β fixed, lh|β (·) is maximized with respect to f (t). Then using the estimated f (t), fˆ (t), β is esti-mated by maximizing lh. This two-step process is iterated until convergence. It has been shown that theprofile likelihood estimator and Speckman estimator are the same for APLMs when the kernel smootherfor the Speckman estimator and the kernel function are the same (Ha¨rdle et al., 2012). For GAPLMs, theyproduce similar results when the bandwidth is small or when the nonparametric component, f (Ti), is smallin magnitude compared to the parametric component, X ti β . Using Nadaraya-Watson type kernel smoothing,Mu¨ller (2001) conducted a simulation study, which showed that the Speckman estimator performs well forsmall sample sizes and gives similar results to those by the profile likelihood estimator for large samplesizes.13Chapter 3PHIDOIn this chapter, we describe the method developed by Rick White for detecting disease outbreaks. His draftpaper (White, 2000b) and his presentation along with the code were the only source due to his unexpecteddeath in September 2016. It was proven to be extremely difficult to follow his sharp thinking. We refer tohis method as PHIDO.Recall that from the introduction that characteristics of disease vary greatly. Hence, one objective ofa disease outbreak detection method is to account for those variations, such as seasonality and long-termtrends. The second objective is to identify outliers - disease counts that are higher than expected. The lastobjective is to provide an alert or severity level of an outbreak, where an outbreak is defined as an outlier ora series of outliers.How do we identify outliers? One approach is to set a certain threshold such that disease counts abovethe threshold are classified as outliers. However, the threshold needs to vary among disease and may changeover time. Instead of manually setting a threshold for each of more than 100 diseases, we can use the datato decide which disease counts are outliers.There is a branch of statistics, called robust statistics, which has been developed to deal with outliers.For more information on robust statistics, see Section 4.1. Viewing an outbreak as a set of outliers, Whiteapplied ad-hoc robust statistics to fit a robust curve to the data so that the curve is not affected by the outliers.Then using the robust curve, he computed a test statistic on counts of the last four weeks to provide an alertlevel of a disease outbreak (public health experts use the last four weeks for assessment). Note that PHIDOis designed to be applied to each disease dataset separately. In other words, it does not consider correlationamong diseases or spatial correlation among counts for different health regions.White assumes the following set-up, which we shall adopt for the rest of this report.• (A1) the data is structured as a time series of weekly (or at some other frequency) counts with a samplesize of n: Yt , t = 1, . . . ,n.• (A2) the conditional distribution of Yt belongs to an exponential family. In particular, it is specifiedby a mean function λ (t) and variance function σ2λ (t). Here, σ2 is called the dispersion parameter,14which is assumed to stay constant over time t. The reciprocal of it is called the dispersion index.• (A3) Covariates are time-covariates, Xt , that account for a long-term trend and seasonality. See Eq.(3.4) for the time covariates used in PHIDO.λ (t) = E(Yt). (3.1)σ2λ (t) = var(Yt). (3.2)Recall that (A2) is sufficient to justify the condition for using quasi-likelihood estimation (see Section 2.2.2).The method can be broken down into two parts: 1) model fitting and 2) alert score calculation.3.1 Model fittingUnlike most parametric/semi-parametric model fitting procedures, where we do specify a distribution ofsome kind, such as Poisson and Negative Binomial, White only assumes that the conditional distribution ofthe response variable belongs to an exponential and uses quasi-likelihood estimation under the assumption(A2) to estimate the parameters. Thus, no prior knowledge of the specific conditional distribution of Ytis required. Based on the estimated dispersion parameter σ2 = var(Yt)/λ (t), he assigns a (conditional)distribution for Yt . If σ2(t) = 1, the conditional distribution belongs to a Poisson (POIS) distribution, whosemean and variance are equal, with λ (t) as its mean at time t. If σ2 > 1, the ratio of the conditional varianceof Yt to the conditional mean λ (t) is greater than 1, implying that the data points are overdispersed withrespect to a Poisson distribution. In such a case, a Negative Binomial (NB) distribution can be used asan alternative; it has an extra parameter to control for the variance independently of the mean. Finally, ifσ2 < 1, then the conditional variance-mean ratio is less than 1, implying the data points are under-dispersed.In this case, a Binomial (Bin) distribution is used. In summary,Yt |Xt ∼NB(r(t) =λ (t)σ2−1 , p =1σ2), σ2 > 1.POIS(λ (t)), σ2 = 1.Bin(N(t) =λ (t)1−σ2 , p = 1−σ2), σ2 < 1.(3.3)In practice, σ2 is unlikely to be exactly estimated as 1. Thus, the conditional distribution will be alwayseither NB or Bin. One may realize that the above parametrization is not standard, where λ (t) is the functionof Xt . The standard parameterization is to keep the number of people in consideration (N,r) fixed andto model the probability of having disease p at time t. For each of Bin and NB, we show a standardparametrization by assuming the corresponding distribution before fitting, and we compare that with White’s15parametrization.3.1.1 BinomialWhen Yt is assumed to belong to a Bin(N, p) distribution before fitting, the standard parameterization is touse a logit (canonical) link and model the probability, p, as a function of the covariates. Here we treat Nas the total number of people that are being considered at time t. This information is readily available inCanada through Canadian Census.The parameterization for Bin in (3.3) is obtained as follows. Using the assumption (A2), we haveλ (t) = E(Yt) = N p.σ2λ (t) = var(Yt) = N p(1− p).Dividing the second equation by the first equation, we obtain σ2 = 1− p or p = 1−σ2, and substitute p inthe first equation to obtain the parametrization in Eq. (3.3). Since σ2 is constant, p is forced to be constant.At the same time, it implies that N is a function of t. However, the number of people is readily available.Consequently, focus can be made on estimating an unknown parameter p, which is very likely to changeover time. On a side note, to compute probabilities when N(t) is not an integer, the beta distribution is usedinstead by invoking the following relationship:P(Yt ≤ y) = P(Zt ≤ 1− p),where Zt ∼ Beta(N(t)− y,y+1).3.1.2 Negative BinomialWhen Yt is assumed to belong to a NB distribution, standard parameterizations for NB are NB1 (Cameronand Trivedi, 1998) and NB2 (Lawless, 1987). The probability mass function of NB(τ,ζ ) isP(Y = y) = f (y;τ = τ(x),ζ = ζ (x)) =Γ(τ+ y)Γ(τ)y!ζ (x)y(1+ζ )τ+y, τ > 0, ζ > 0, y > 0withE(Y ) = τζvar(Y ) = τζ (1+ζ )NB1 : τ = µγ−1 and ζ = µγ.NB2 : τ = γ−1 and ζ = µγ.16For NB2, the (conditional) variance isτζ (1+ζ ) = γ−1µγ(1+µγ) = µ(1+µγ).The parameterization for NB in (3.3) is obtained as follows. Using the assumption (A2), we haveλ (t) = E(Yt) =n(t)(1− p)p.σ2λ (t) = var(Yt) =n(t)(1− p)p2.Divide the second equation by the first one to obtain σ2 = 1p , and after some basic algebra, this gives thedesired parameterization in Eq. 3.3. Again, the probability is fixed here.To meet the first objective of a disease outbreak detection method, the mean function is assumed to bein the form oflog(λ (t)) = α+β t+ f (t)+ c(tc, ts), (3.4)where tc = cos(14pit/365.2425) and ts = sin(14pit/365.2425). The first two terms and the last two termscorrespond to the parametric and nonparametric components, respectively. The parametric component sim-ply captures a linear trend. f (t) captures a long-term trend while c(tc, ts) captures a seasonal trend. Inclusionof β t, f (t) and c(tc, ts) is decided as follows:• c(·) is excluded if there are less than two years of data or less than 10 non-zero counts per year onaverage.• f (t) and β t are excluded if there are less than 10 non-zero counts in the data.A GAPLM fit is obtained with the backfitting algorithm as implemented in the library gam in R (Team,2018). LOESS is chosen as the smoother to estimate the nonparametric functions. The spans for c and f aredetermined by the following rule:• the span for c is always 1 (i.e. the whole data).• the span for f is the smallest proportion of the data such that the (span * n) nearest neighbours of eachpoint yt contains at least two years of data and at least 10 non-zero counts.The mean-covariate relationship is cleverly chosen to avoid the concurvity issue; if the seasonality termhad been broken down to two additive terms as in c(tc, ts) := c1(tc)+ c2(ts), then it would have run into theconcurvity issue, as noted in the help page of the concurvity function in the package mgcv developed andmaintained by Simon Wood “it [concurvity] tends to be an issue in models including a smooth of time, alongwith smooths of other time varying covariates.” More on concurvity issues with time series can be found inDominici et al. (2002); Ramsay et al. (2003).17Next, we discuss how the second objective of identifying outliers is accomplished. The approach Whitetakes is to iteratively down-weigh outliers. Standardized residuals or deviances from the fitted curve areused to assign weights to their corresponding data points. That is, ideally outliers would give large residuals.Therefore, low weights would be assigned to them. Then the model is re-fitted with the weights, and thisprocedure is iterated until the weights converge.An iterative approach is dependent on an initialization; a poor initialization can result in a non-optimalsolution. White calculates the initial weights in a sensible but heuristic way. The weight of each data pointis initially set to 1, and the data points are ordered: y(k). A kth data point is defined to be a noticeable jumpif y(k) > 2 and y(k−1) = 0 OR y(k) > 2y(k−1). We can set all the (ranked) data points that have a zero count tobe 1 for computing the initial weights and use the latter rule only. The initial weights are changed only forthose in the lower (0-25th) or upper (75th-100th) quantiles that contain a noticeable jump. At each jump j,the jth initial weight is changed to (2y( j−1)/y( j)). Next, starting at the lower quartile and proceeding to thefirst data point, we set wnew(k−1) = w(k−1)w(k). Similarly, starting at the upper quartile and proceeding to thelast data point, we set wnew(k+1) = w(k+1)w(k). Finally, all the initial weights are squared. The choice of 2 in thedefinition of a noticeable jump as well as the choice of the last and first 25 quantiles for changing the initialweights are not supported or documented by White.With the initial weights, the model is fitted by the function gam in a R library gam:m <- gam(model,family=quasi(log,mu),weights=w),where m is the fitted gam object. The next crucial step is to estimate the dispersion parameter σ2, by whichthe conditional distribution of Y is chosen based on Eq. (3.3). If the lower quartile of Y is not zero, thedispersion parameter is calculated robustly by scaleTau2 in robustbase using the deviance residu-als. Note that scaleTau2 is a general method that computes a robust estimate of scale as proposed byMaronna and Zamar (2002). If the lower quartile is zero, the dispersion estimate is non-robustly based onthe Pearson residuals from m. It is not clear why different types of residuals are used when the dispersionestimate can be significantly different depending on them. Furthermore, the dispersion parameter is esti-mated non-robustly by the method of moment (gam calls glm::glm, and the method of moment estimatoris used for estimation of the scale parameter by glm). If it is estimated with MLE instead, at least 10% moreefficiency can be achieved (Lawless, 1987).Having decided the distribution, the weights need to be updated, as the approach is to iteratively updatethe weights until convergence. To update the weights, three measures are calculated based on the standard-ized (weighted) deviance residuals, d(t), for each time t.• P1(t): the probability of the order/rank statistic of the observations being less than or equal d(t) ismultiplied by a factor of 2 and then subtracted by 1 so that it ranges from -1 to 1. It measures howextreme d(t) is compared to other standardized weighted deviance residuals.• P2(t): the probability of a standard normal random variable being less than or equal to the weighted18deviance residual is multiplied by a factor of 2 and then subtracted by 1 so that it ranges from -1 to 1.This measures how extreme the weighted deviance residual is.• P3(t): the multiplication of P1 and P2. Note that it is positive if P1 and P2 have the same sign.The weight of yt is changed by a factor of 10 * (1-P3(t)) if P3(t) is greater than cW , where cW is set to be0.9. The parameter cW can be viewed as a tuning parameter to control the level of robustness. The choice of0.9 appears not to be tested based on the documentation. The model is re-fitted with these new weights untileither the maximum number of iterations is reached or the weights converge in L1 norm.3.2 Alert score calculationWith the final model, an ad-hoc test statistic is computed to determine whether there is an outbreak for theperiod of last four weeks/observations in the data. First, White only considers the times for which yt− λˆt hasthe same sign as yT − λˆT . The backward cumulated counts are computed: y∗t = ∑Ti=t yt . Since the dispersionparameter is fixed over time, Y ∗t ∼ F(Λˆt = ∑Ti=t λˆt , σˆ2), where F is determined based on σˆ2 as described inSection 3.1. For each of the last four weeks, pi(t) = P(Y ∗t > y∗t ) is computed. If the last count yT is greaterthan or equal to the estimated mean count λˆT , then he chooses the time t∗such that pi(t∗) is the smallest. Inother words, he chooses the time in which it is most unlikely to observe a count bigger than the realizedcumulated count. Otherwise, he chooses the time with the highest probability. The log odds ratio is used asa test statistic to reflect the high intensity of the outbreak (this is incorrectly documented in White (2000b),but implemented in R as:z1 = log10(1−pi(t∗)pi(t∗)).The implementation is sensible since the test statistic decreases in pi(t∗) as expected.On the other hand, an outbreak of low intensity may not be apparent enough to cause an alarm just overthe period of four weeks. To determine whether such a low but increasing trend is occurring, the rate ofchange of the long term trend, f (t), is used. Based on this rate, White computes the second test statistic:z2 =δ (n)−mediant∈[2,n](δ (t))στ,where δ (t) = f (t)− f (t−1) and στ is a τ estimate of scale. The final alert score is the linear combinationof the two test statistics:z =3z1+ z2√10,where z appears to follow the standard normal distribution when the estimated mean function is not closeto zero under a simulated study. This study has been replicated successfully. As the final score followsthe standard normal distribution, it can be easily translated into a p-value (upper-tail). Public health expert-s/staff can use it to assess the level of an outbreak. For practical purposes, 1.645 (95th quantile), 2.326 (99th19Figure 3.1: An example of the current PHIDO algorithm illustrates that some of the data points thatare close to the fitted robust curve are classified as high and low alert points.quantile), and 3.090 (99.9th quantile) can be used to provide a low, medium, and high alert, respectively.Moreover, an upper-tail p-value for each data point can be provided based on the fitted model.We will see PHIDO does not down-weight outliers properly in a simulated study in Chapter 5. As aresult, PHIDO does not identify outliers well. An example in the PHIDO manual (Figure 3.1) somewhatillustrates this problem already; the data points that are very close to the fitted robust curve are given a highor low level of alert. Thus, the test statistic z would not be reliable but in a reassuring way - it seems toproduce false alerts (i.e. falsely identify normal counts as outliers).3.3 RemarkIt was realized near the end of the project that the current algorithm (call it CDC-PHIDO) used by the BCCDC differs from PHIDO. A small empirical investigation revealed that CDC-PHIDO is slightly differentfrom PHIDO. Specifically, the fitting method does not differ, but calculation of alert scores does differ. Thus,in this report, PHIDO was used for fitting, and then CDC-PHIDO was used to calculate alert scores.20Chapter 4Three robust methods to fit a GAPLM andoutbreak detectionIn the last chapter, we saw that PHIDO lacked justification for robustly down-weighting the observationsand problems related to it. Moreover, it is not clear how PHIDO copes with different degrees of outliers.To mitigate these issues, we introduce three robust fitting methods for GAPLMs based on recent literature:Pan (2013) and Boente et al. (2017). We refer to them as fitting methods, because no work on inference hasbeen accomplished. See the discussion in Alimadad and Salibian-Barrera (2011) for difficulties regardingthe asymptotic analysis of these estimators.However, there has been robust procedures with asymptotic analysis for simpler models. Bianco andYohai (1996) propose robust and Fisher-consistent M-estimators for logistic regression models and provethat the estimators are consistent and asymptotically normal; Croux and Haesbroeck (2003) present a fastand stable algorithm to compute a robust estimator for the logistic regression model. Cantoni and Ronchetti(2001) extends M-robust estimators to generalized linear models based on quasi-likelihood. Bianco et al.(2005) propose robust MM-estimators for linear models with error terms belonging in the exponential fam-ily and show their asymptotic normality. Boente et al. (2006) introduce robust estimators for a gener-alized partly linear model (i.e. generalized partially linear model with one nonparametric component):Yi|(Xi,Ti)∼ F(,µi) with µi = g(η(Ti)+X ti β ), where F is a distribution in the exponential family, and showthe asymptotic normality of the β estimators.For hypothesis testing, Bianco and Martı´nez (2009) propose robust tests for the regression (β ) param-eter under a logistic model, and BoenteandRodriguez (2010) propose robust tests for the same regressionparameter under a generalized partly linear model. Boente and Pardo-Ferna´ndez (2016) present robust testsfor choosing between a generalized linear model and generalized partly linear model.Lastly,returning to our second objective of computing an alert score for a disease outbreak based onthe last four disease counts, we simply flag individual counts based on the probability of observing a valuegreater than the observed count using the fitted distribution. Future work for detecting a disease outbreak isprovided in the last chapter.214.1 Preliminary: robust linear regressionsTo being with, we give a brief introduction to robust linear regressions. Consider a simple linear regressionmodel:Y = X tβ + ε, ε ∼ N(0,σ2),where the goal is to find β under the condition σ2 is known. The maximum likelihood estimation of β isequivalent to minimizing the squared residuals:minβ=n∑iρLS(ri), (4.1)where ri = yi− xtiβ and ρLS(ri) = r2i . It is well-known that the LS/MLE estimate is heavily influenced bypresence of outliers (Maronna et al., 2018). In other words, one abnormal data point can lead to significantlymisleading interpretations and results.The idea is to control the possible detrimental effect of such abnormal points by using different ρ-functions in Eq. (4.1). A class of regression M-estimates is given by minimizing1nn∑i=1ρc(ri(β )σ),where a ρ-function satisfies the following properties (Maronna et al., 2018):1. ρ(x) is a non-decreasing function of |x|.2. ρ(0) = 0.3. ρ(x) is increasing for x > 0 such that ρ(x)< ρ(∞).4. If ρ is bounded, it is also assumed that ρ(∞) = 1.Equivalently, the first order optimality condition implies the optimal β satisfies the following equations:1nn∑i=1ψc(ri(β )σ)xi = 0, (4.2)where ψc is the derivative of ρc with respect to β . The subscript c is an important tuning parameter thatcontrols the impact of outliers. In case of the Huber loss function, the standardized residuals that are greaterthan c in absolute value are trimmed off to c in Eq. (4.2), thereby managing the impact of high standardizedresiduals in absolute value or equivalently outliers. As c goes to∞, ψc becomes the mean function, and hencethe standardized residuals contribute equally. If c is too low, non-outliers may be undesirably trimmed off.Two examples of ρ− functions that are used in this thesis are Huber and Tukey’s bi-weight (or bi-square):22ρHuberc (x) =x2 if |x| ≤ c2c|x|− c2 if |x|> c(4.3)ψHuberc (x) =x if |x| ≤ csgn(x)c if |x|> c (4.4)ρTukeyc (x) =1− [1− (x/c)2]3 if |x| ≤ c1 if |x|> c(4.5)ψTukeyc (x) =x[1− (x/c)2]2 if |x| ≤ c0 if |x|> c.(4.6)Solving for β in Eq. (4.2) reveals how the standardized residuals are controlled through different ρc-functions. We manipulate Eq. (4.2) by multiplying and dividing it by by r˜i = ri/σ :1nn∑i=1ψc(r˜i(β ))r˜i(β )r˜i(β )xi =1nn∑i=1W (r˜i(β ))r˜i(β )xi = 0, (4.7)where W (·) is the weight function. The corresponding weight functions for Huber and Tukey’s bi-weightare:W (x) =ψ(x)xif x 6= 0ψ ′(0) if x = 0.W Huber(x) =1 if |x| ≤ cc|x| if |x|> c(4.8)W Tukey(x) =(1− (xc)2)2 if |x| ≤ c0 if |x|> c(4.9)In addition, this yields an iterative algorithm to solve for β ; the algorithm is called Iterative Re-WeightedLeast Squares (IRWLS). Notice that arguments to these functions are standardized (or equivalently unitless)and hence scale-invariant. Consequently, c does not have to be scaled whenever y is scaled. The propertiesof ρ imply that ψ is odd and that ψ(x)≥ 0 for x≥ 0.We briefly compare OLS, Huber, and Tukey’s bi-weight functions; see Fig. 4.1.• OLS vs. Huber: when the absolute value of x is greater than c, ρ increase linearly, rather thanexponentially as the squared loss function. Hence, the rate of change or ψ-function is flat past |x|= c.23Thus, unlike the OLS weight function that stays the same at 1, the ρ weight function decays after |c|.• Huber vs. Tukey: The major difference is that the Tukey’s bi-weight ρ-function is bounded. There-descending property of the Tukey’s ψ comes from this. The Tukey’s ψ starts from 0 at x = 0, goeseither up or down depending on the sign of x as x increases, and returns back to 0 at x = c. Thisimplies extreme outliers are essentially given a weight of 0 in Eq. (4.7), whereas they are penalizedless by the Huber’s loss function.04812−5.0 −2.5 0.0 2.5 5.0rhoOLS−10−50510−5.0 −2.5 0.0 2.5 5.0psi0.500.751.001.251.50−5.0 −2.5 0.0 2.5 5.0W05101520−5.0 −2.5 0.0 2.5 5.0Huber−202−5.0 −2.5 0.0 2.5 5.00.60.70.80.91.0−5.0 −2.5 0.0 2.5 5.00.000.250.500.751.00−5.0 −2.5 0.0 2.5 5.0Tukey−0.50.00.5−5.0 −2.5 0.0 2.5 5.00.000.250.500.751.00−5.0 −2.5 0.0 2.5 5.0Figure 4.1: OLS, Huber’s, and Tukey’s biweight ρ , ψ , and W functions for c = 34.2 Robust approaches to fit a GAPLMWe discuss and derive three robust methods to fit a GAPLM. The first method is proposed by Pan (2013),and we refer to it as YP. In addition, using the derivation in Aeberhard et al. (2014), we present the secondmethod, an improved version of YP, and we refer to this improved version as WA. Finally, we propose thethird method based on Boente et al. (2017), and we refer to it as TL. The names of the three methods comefrom taking the first and last letters of the corresponding main authors’ names.4.2.1 The first approach: YP and WACantoni and Ronchetti (2001) proposed a class of robust M-estimators for GLMs based on robust quasi-likelihood (considerr the usual GLM setting with notations in Chapter 2). The corresponding estimatingequations are1nn∑i=1ψ(risi)∂µi∂β−αi(β ) = 0, (4.10)where• ri = yi−µi.24• si =√a(φ)V (µi) are the square roots of the variance.• αi = E(ψ(ri/si)∂µi/∂β ) are correction terms to ensure Fisher consistency (i.e. unbiasedness of Eq.(4.10)).To solve Eq. (4.10), we present the robust IRWLS algorithm. Then using the robust adjusted response andweight variables from the robust IRWLS algorithm, we extend to GAMs (Alimadad and Salibian-Barrera,2011), resulting in the robust GLSA algorithm. Then using both of the robust algorithms, we extend toGAPLMs with the Speckman or backfitting approach, as described in Section 2.4 (Pan, 2013).Recall that the robust M-estimator for β is the solution of the estimating equations:0 =U(β ) :=n∑i=1Ψc(β ;yi,xi) (4.11)=n∑i=1ψc(ri)si(µi)∂µi∂β−αi(β ) (4.12)=n∑i=1ψc(ri)si(µi)∂µi∂ηixi−αi(β ) (4.13)=n∑i=1[ψc(ri)−E(ψc(ri))] 1si(µi)∂µi∂ηixi (4.14)where• ri = (yi−µi)/si are the standardized (unitless) residuals;• c ∈ R+ is the tuning constant;• αi(β ) = EYi|xi[(ψc(ri)/si(µi))(∂µi/∂ηi)xi]are the correction terms introduced to ensure Fisherconsistency (i.e. E(U(β )) = 0).Hereinafter, we assume that the expectations are always taken with respect to Yi given Xi. We definehi := ψc(ri)−E(ψc(ri)).Using the Fisher scoring algorithm, we have the following updating procedure for estimation of β from rthstep to (r+1)th step:I(βˆ r)βˆ r+1 = I(βˆ r)βˆ r +U(βˆ r), (4.15)whereI(βˆ ) =−n∑i=1E∂U(β )∂β T∣∣∣β=βˆis the Fisher Information matrix. Pan (2013) and Aeberhard et al. (2014) compute I(β ) differently (forGLMs), resulting in two different sets of the adjusted response and weight variables and hence two differentIRWLS algorithms.25Alimadad and Salibian-Barrera (2011) and Pan (2013) compute I(β ) as follows:−I(β )β =n∑i=1E(∂hi∂β)1si∂µi∂ηixTi β +E(hi∂ s−1i∂β∂µi∂ηixTi β ) (4.16)=n∑i=1E(∂hi∂β)1si∂µi∂ηixTi β +∂ s−1i∂β∂µi∂ηixTi βE(hi) (4.17)=n∑i=1E(∂hi∂β)1si∂µi∂ηixTi β (4.18)since E(hi) = 0. Next, we explicitly expand E(∂hi/∂β ).−∂hi∂β=−∂ψc(ri)∂β∂ ri∂β+∂E(ψc(ri))∂β(4.19)=−∂ψc(ri)∂β∂ yi−µi(β )si(β )∂β+∂E(ψc(ri))∂β(4.20)=−∂ψc(ri)∂β(− ∂µi∂ηi∂ηi∂β1si− yi−µis2i∂ si∂ηi∂ηi∂β)+∂E(ψc(ri))∂ηi∂ηi∂β(4.21)=[∂ψc(ri)∂β(∂µi∂ηi1si+risi∂ si∂ηi)+∂E(ψc(ri))∂ηi]∂ηi∂β(4.22)=[∂ψc(ri)∂β(∂µi∂ηi1si+risi∂ si∂ηi)+∂E(ψc(ri))∂ηi]xi. (4.23)(4.24)By definingli :=[E(∂ψc(ri)∂β)(∂µi∂ηi1si+risi∂ si∂ηi)+E(∂E(ψc(ri))∂ηi)], (4.25)the expectation of ∂hi/∂β isE(∂hi∂β) =−lixi.Note this requires computation of E(∂ψc(ri)/∂β ),E(ri(∂ψc(ri)/∂β )), and E(∂E(ψc(ri))/∂ηi). The firsttwo terms can be explicitly computed in many cases. On the other hand, the third one is numericallyapproximated by a first-order divided difference. Finally, I(β ) isI(β ) =n∑i=1xilisi∂µi∂ηixTi (4.26)= XTWY P(β )X , (4.27)26whereWY P(β ) = diag(wi) =lisi∂µi∂ηi.Rewriting the right-hand side of Eq. (4.15), we derive the robust IRWLS algorithm.I(βˆ )βˆ +U(βˆ ) =n∑i=1[liηisi∂µi∂ηi+hisi∂µi∂ηi]xi (4.28)=n∑i=1wi(ηi+hili)xi (4.29)= XTWY PZY P, (4.30)where zi = ηi+hi/li. Finally we can obtain the robust IRWLS algorithm:βˆ r+1 = (XTWY P(βˆ r)X)−1XTWY P(βˆ r)ZY P(βˆ r).Aeberhard et al. (2014) computes I(β ) differently from Pan by using theoretical results on M-estimatorsfor GLMs. WA has a couple of advantages compared to YP. First, WA requires computation of fewer terms.Second, it does not involve numerical approximations for distributions we use in the thesis: NB and Poissondistributions.We provide a non-rigorous derivation of WA. For a rigorous derivation, see Chapter 4 of Hampel et al.(2011). By Fisher consistency, we haveE(Ψc(β ;Y,x)) = 0.Under the regularity condition such that differentiation and integral can be exchanged (for example, usingdominated convergence theorem when Ψc(β ;Y,x) is bounded), we have∂E(Ψc(β ;Y,x))∂β= E∂ (Ψc(β ;Y,x))∂β=∫ (∂Ψc(β ;Y,x)∂βfY |x(y;x,β )+Ψc(β ;Y,x)∂ fY |x(y;x,β )∂β)dy = 0This impliesE(Ψ′c(β ;Y,x)) =−∫Ψc(β ;Y,x)f ′Y |x(y;β )fY |x(y;x,β )fY |x(y;x,β )dy =−E(Ψc(β ;Y,x)S(β )),where S is the score function. Using the fact thatI(β ) =−nE(Ψ′(Y ;x,β )),27we obtain a different expression for I(β ):I(β ) =n∑i=1E(Ψc(β ;Yi,xi)Si(β )).We explicitly compute the right hand-side of the above equation.I(β ) =n∑i=1E[(ψc(ri)si∂µi∂ηixi−ai(β ))Si(β )](4.31)=n∑i=1E[(ψc(ri)si∂µi∂ηixi−ai(β ))yi−µis2i∂µi∂ηixTi](4.32)=n∑i=1xi(1si∂µi∂ηi)2E[ψc(ri)ri−E(ψc)ri]xTi (4.33)=n∑i=1xi(1si∂µi∂ηi)2(E[ψc(ri)ri]−E(ψc)E[ri])xTi (4.34)=n∑i=1xi(1si∂µi∂ηi)2(E[ψc(ri)ri]−E(ψc)0)xTi (4.35)=n∑i=1xi(1si∂µi∂ηi)2E[ψc(ri)ri]xTi (4.36)= XTWWA(β )X , (4.37)whereWWA(β ) = diag(wi) = (1si∂µi∂ηi)2E[ψc(ri)rixTi ].Notice that only E(ψc(ri)ri) needs to be computed, whereas for YP, two more terms need to be computed,and additionally, one of them requires numerical approximation. Similar to YP, we derive a robust IRWLSalgorithm but with different Z and W (Aeberhard et al., 2014).28I(βˆ )βˆ +U(βˆ ) =n∑i=1xi(1si∂µi∂ηi)2E[ψc(ri)ri]xTi β +ψc(ri)si∂µi∂ηixi−ai(β ) (4.38)=n∑i=1xi(1si∂µi∂ηi)2E[ψc(ri)ri]ηi+ψc(ri)−E[ψc(ri)]si∂µi∂ηixi (4.39)=n∑i=1xi[(1si∂µi∂ηi)2E[ψc(ri)ri]ηi+hisi∂µi∂ηi](4.40)=n∑i=1xi(1si∂µi∂ηi)2E[ψc(ri)ri][ηi+hisiE[ψc(ri)ri] ∂µi∂ηi](4.41)= XTWWA(β )ZWA, (4.42)wherezi = ηi+hisiE[ψc(ri)ri] ∂µi∂ηi.To extend to GAMs, the proposed robust generalized local scoring algorithm is constructed similarly asGLSA, replacing the classic MLE-based components by their corresponding robust components derivedin the IRWLS algorithm Aeberhard et al. (2014); Alimadad and Salibian-Barrera (2011). Next, using therobust IRWLS algorithm and GLSA, we extend to GAPLMs with the Speckman approach Pan (2013). Werefer to these two robust fitting methods for GAPLMs as YP and WA. Note that Aeberhard et al. (2014) onlypresent the IRWLS algorithm and hence do not extend to GAMs or GAPLMs.We explicitly compute the terms for YP and WA when the conditional distribution of Y given x is aPoisson distribution or a Negative Binomial distribution. We begin with the Poisson distribution. Let j1 =bµi− csic and j2 = bµi + csic. For YP, we need to calculate E(ψc(ri)), E(ψ ′c(ri)), E(ψ ′c(ri)), E(ψ ′c(ri)ri),and E(∂E(ψc(ri))/∂ηi).29• E(ψc(ri))E(ψc(ri)) =∞∑j=−∞ψc(j−µisi)P(Yi = j)1( j ≥ 0)=∞∑j=0ψc(j−µisi)P(Yi = j)=j1∑j=0ψc(j−µisi)P(Yi = j)+j2∑j= j1+1ψc(j−µisi)P(Yi = j)+∞∑j= j2+1ψc(j−µisi)P(Yi = j)=j1∑j=0(−c)P(Yi = j)+j2∑j= j1+1(j−µisi)P(Yi = j)+∞∑j= j2+1cP(Yi = j)= c(P(Yi ≥ j2+1)−P(Yi ≤ j1))+j2∑j= j1+1(j−µisi)P(Yi = j)= c(P(Yi ≥ j2+1)−P(Yi ≤ j1))+ µisi (P(Yi = j1)−P(Yi = j2)).The last equality follows from the following identity:Noticej2−1∑j= j1P(Yi = j) =j2∑j= j1+1P(Yi = j)jµi⇐⇒ P(Yi = j1)−P(Yi = j2) =j2∑j= j1+1P(Yi = j)(jµi−1)⇐⇒ µisi(P(Yi = j1)−P(Yi = j2)) =j2∑j= j1+1P(Yi = j)(j−µisi).• E(ψ ′c(ri))E(ψ ′c(ri)) =∞∑j=0ψ ′c(j−µisi)P(Yi = j)=j2∑j= j1+1P(Yi = j) = P(Yi ≤ j2)−P(Yi ≤ j1).• E(ψ ′c(ri)ri)E(ψ ′c(ri)ri) =∞∑j=0(j−µisi)P(Yi = j)=µisi(P(Yi = j1)−P(Yi = j2)).Note that this term is a sub-case of the terms in the calculation of E(ψc(ri); this is one of the improve-ments from Pan (2013).30• E(∂E(ψc(ri))/∂ηi)We use the first order difference method to approximate this. Choose a small step size L, and setµi0 = g−1(ηi−L)andµi1 = g−1(ηi+L),where ηi = g(µi). Then we use the following as an approximation:∆E(ψc(ri))∆ηi=E(ψc(ri1))−E(ψc(ri0))2L.For WA, we need to calculate E(ψc(ri)ri). See Aeberhard et al. (2014) for details.In a case of Poisson distributions, the scale parameter si is an identity function of the location parameterµi. On the other hand, for NB2 distributions, we have two parameters - the scale and location. To esti-mate those two parameters, one approach is to use a profile likelihood estimation method as proposed byAeberhard et al. (2014). It proceeds as follows:1. Initialize a reasonable estimate of the scale parameter.2. Holding the scale parameter estimate fixed, estimate the location parameter, µ , using the above M-estimation of the quasi-likelihood.3. Given the estimate of the location parameter from the previous step, estimate the location parameterusing a robust M-estimator provided in Aeberhard et al. (2014).4. Iterate the above two steps until convergence.For NB2 distributions, the computation of the parametric components for WA can be found in Aeberhardet al. (2014), and the computation for YP is not accomplished in this thesis.4.2.2 The second approach: TLAnother approach, referred to as TL in this thesis, to fit a GAPLM is motivated by a robust back-fitting algo-rithm for an AM proposed by Boente et al. (2017). TL robustifies each of the parametric and nonparametriccomponents separately in a back-fitting manner. In other words, beginning with some initial estimate of thenonparametric components, we can use YP or WA to robustly estimate the parametric components on theresiduals from the nonparametric components. Then we use a robust method to estimate the nonparamet-ric components on the residuals from the parametric components. The robust method for non-parametriccomponents in this thesis is based on a recent robust back-fitting algorithm for an additive model, which hasbeen developed along with theoretical justification by Boente et al. (2017). In short, a robust smoother is31used in place of a regular smoother. This robust backfitting algorithm is extended to a GAM in a heuristicmanner using the components derived by MLE.Recall that the idea is to simply robustify the backfitting step in the GLSA algorithm. Given the adjustedresponse variable, Z, and weights, W , we have the following estimating equations:Z = f0+p∑i=1fi(Ti) (4.43)Estimation of fi is based on the following equation in the classic backfitting step:E(W (Z−p∑i=0;i 6=kfˆi|Tk)) =W fk(Tk), (4.44)where fk is estimated with a univariate (weighted) smoother. Denoting the partial residual asZ˜ = Z−p∑i=0;i 6=kfˆi,estimation of fk can be viewed as the minimization problem in the L2 norm space:min E((W (Z˜− fk))2|Tk). (4.45)One way to robust this BF algorithm is to replace the L2 loss function in Eq. (4.46) with a Huber-type lossfunction:min E(ρ(W (Z˜− fk))|Tk). (4.46)Then the backfitting step becomesE(WΨ(W (Z˜− fk|Tk))) = 0 (4.47)where an extra term of W is introduced by the chain rule. Note that the expectation is taken with respect toY , which is embedded in W and Z. See Algorithm 6 for implementation of TL.TL is compared with existing methods in a small empirical example for a GAM (note that a GAM is aGAPLM without nonparametric components). Figure 4.2 shows a plot of data (green dots) generated froma Poisson distribution with the true mean as the solid grey curve and three GAM fits. The dashed red curveis a classic GAM fit using the gam function from the MGCV package, the dotted blue curve is the robustGAM fit by Alimadad and Salibian-Barrera (2011), and the dotdashed black curve is the robust GAM fitusing the proposed method. The outliers are located at around x = 22 and x = 55. The robust methods areindeed robust to the outliers, and the two robust fitted curves behave similarly.However, one concern with TL is the estimation of the intercept for a GAPLM. It is not clear how toproperly estimate the intercept. Suppose that the intercept is embedded in the parametric component, andwe start with estimation of parametric components after some initialization of the nonparametric compo-nents. Then, the nonparametric components are estimated, but each nonparametric term, f˜ j, needs to be320 20 40 60 8005101520Comparison of gam and robust gam fits − PoissonxytruegamrgamTLgamFigure 4.2: Comparison of gam and robust gam fits for data generated from a Poisson distribution. Thisexample is taken from Alimadad and Salibian-Barrera (2011). The robust gam fits by Alimadadand Salibian-Barrera (2011) and our proposed method show similar performance.centered for a unique identification of the intercept term. Should the centering term, ∑ j mean f˜ j, be simplyadded to the intercept term or does the intercept term need to be estimated again given the estimates of thenonparametric and parametric components? The proposed algorithm tries to avoid this problem by startingwith estimation of the nonparametric components.4.2.3 ComparisonCompared to YP or WA, TL offers a couple of advantages. It can use different robust loss functions whenestimating f j and different tuning constants for the parametric and nonparametric components. A smallempirical simulation study showed that YP and TL gave similar results; the tuning constant for YP hadto be chosen smaller to achieve the same level of performance in terms of estimation of the parametriccoefficients (including the intercept) and mean squared errors of the fitted curves and the true curve. Forease of computation, we used WA for the simulation study and application to real datasets.4.3 Outbreak detectionRecall that PHIDO computes an alert score based on the last four counts of a disease series to evaluatewhether there is a disease outbreak (see Chapter 3 for details). To compute the score, White incorporates thecommon characteristic of a disease outbreak that counts of an outbreak are serially clustered. In this thesis,we simply flag a disease count, yt , at time t as an outlier or part of a disease outbreak at a pre-specified33Algorithm 5 YP: Robust generalized Speckman algorithm for a GAPLM.Initialization: Let r = 0 and, for example, β r0 = g(median(y)), β j = f rm = 0 ∀ j,m. Then construct robustηr,µr,Zr,W r.repeatr← r+1X˜−1 = X−1− S(X−1 ∼ T ;W r−1). Note that X−1 is the design matrix without the column of ones forthe intercept term.Z˜ = X−S(Z ∼ T ;W r−1)β r = (X˜TW r−1X˜)−1(X˜TW r−1Z˜).Parametric Residual: PRrj = Zr−1−Xβ rfor j = 1 to p doPartial residual: Rrj = PRrj−∑ j−1k=1 f rk (Xk)−∑pk= j+1 f r−1(Xk).Backfitting: f˜ lj ← S j(Rrj;W r−1), where S j is a weighted univariate smoother.end forfor j = 1 to p doCentering: f rj ← f˜ rj −mean[ f˜ rj (X j)]Adjusting the intercept: β r0 = β r0 +mean[ f˜ rj (X j)]end forUpdate: robust ηr,µr,Zr,W r.until convergenceAlgorithm 6 TL: Robust back-fitting algorithm for a GAPLM.Initialization: r = 0 and, for example, β r0 = g(median(y)), β j = f rm = 0 ∀ j,m. Then construct non-robustηr,µr,Zr,W r.repeatr← r+1Parametric Residual: PRrj = Zr−1−Xβ rfor j = 1 to p doPartial residual: Rrj = PRrj−∑ j−1k=1 f rk (Xk)−∑pk= j+1 f r−1(Xk).Robust Backfitting: f˜ rj ← S j(Rrj;W r−1,ψc), where S j is a weighted robust univariate smoother.end forfor j = 1 to p doCentering: f rj = f˜rj −mean( f˜ rj ).end forNonparametric Residual: NRr = Zr−1−∑pj=1 f rjUse the first approach with NRr as the response variable, β r−1 as the initial estimates, and p = 0 (nononparametric component) to obtain β r.Update: non-robust ηr,µr,Zr,W r.until convergence34level, α ∈ (0,1), if the tail probability of observing a value greater than yt given the estimated parameters(mean and/or dispersion) is less than the pre-specified level: P(Yt > yt |µˆt , σˆ2)< α . We use three values forα: 0.05, 0.01, and 0.001, representing low, moderate, and high alert levels, respectively. Then we look atthe levels of the last four counts to qualitatively decide whether there is a disease outbreak or call for furtherinvestigations. This simple way does not provide an overall alert score that PHIDO does and does not alsouse the fact that counts of a disease outbreak are clustered. Developing a more sophisticated alert score fora disease outbreak is an area of future research.35Chapter 5Simulation studyWe compare the proposed robust method, WA, against PHIDO and other existing methods in the literatureby examining parameter estimations, goodness-of-fit, and outlier detection on simulated datasets.5.1 Simulation settingWe describe the setting used for the simulation.5.1.1 Data generationAs real disease datasets often exhibit seasonality/periodicity and trends (see the next chapter for such in-stances), we incorporate these characteristics in our simulated datasets. We emulate two disease datasets.While both of the datasets exhibit seasonality/periodicity. One of them has an upward trend, and the otherone has a downward trend. We consider the usual GAPLM setting (as described in Section 2.4) and generatedata from a Negative Binomial distribution:Yt |Xt ∼ NB(µt ,σ2), t = 1, . . . ,n,where σ2 is the dispersion parameter, which is assumed to be constant across t and Xt . Based on the analysisof real disease datasets provided by the BCCDC in the next Chapter, most of the datasets are over-dispersed.Hence, a NB distribution is chosen to generate such over-dispersed data. The sample size, n, is set to 168 torepresent about three years of data assuming that one sample represents one weekly count. The sample size ischosen such that it represents more than two years so that PHIDO accounts for seasonality (otherwise, recallthat PHIDO excludes the seasonality term in its model). We choose the NB2 parameterization described inChapter 3. As a reminder, it is described by the mean and variance as follows:Var(Yt |Xt) =V (µt) = µt +µtσ2.36Two values are used for σ2: 0.5 and 0.25. On a side note, the reciprocal of σ2 is called the index param-eter (Lawless, 1987). The index parameter is more frequently used in practice, but we use the dispersionparameter for the greater numerical stability of estimation of the parameters as asserted by Aeberhard et al.(2014).For the dataset with a decreasing trend, we specify the mean function as follows:log(µt) = β0+β1t+ f1(x1(t))+ f2(x2(t)), (5.1)where β0 = 1, β1 = −0.01, and f1(x1(t)) and f2(x2(t)) are smooth univariate functions to be describedshortly. The β parameters represent the long term trend, and their values are chosen so that not many zerocounts are generated. The other two terms f1(x1(t)) and f2(x2(t)) represent the seasonality/periodicity.Note that x1(t) and x2(t) are chosen to reflect the frequency at which the data are collected. As we assumethe data are weekly counts, the corresponding nonparametric covariates are: x1(t) = cos(2pit 7365.2425) andx2(t) = sin(2pit 7365.2425). Having chosen the nonparametric covariates, f1 and f2 are generated as shown inFigure 5.1. The resulting mean function is smooth and periodic. To create the functions f1 and f2, stepfunctions were created first, and then they were smoothed by LOESS. Finally, the smoothed functions wereapproximated by a polynomial regression:• f1(x1) = 1.4+0.3x1+0.24x21−0.18x31.• f2(x2) = 1.625−0.45x2−0.25x22−0.75x32−0.5x52.In addition, f1 and f2 were centered so that they have zero mean after adding those means of f1 and f2 tothe intercept β0.The dataset with an upward trend is obtained by simply reversing the dataset with the downward trend.Its mean function is provided in Figure 5.2. The upward trend is more similar to real disease trends that areprovided in the next chapter. Hence, we provide the simulation results from the upward trend dataset.0 50 100 150−0.2−0.10.00.10.2timef10 50 100 150−0.3−0.2−0.10.00.10.2timef2Figure 5.1: Left: f1(T1); Right: f2(T2). They are smooth, periodic, and centered.To emulate a disease outbreak or a sequence of outliers, we increase the mean, µt , artificially for achosen period. Specifically, we multiply the mean by a magnitude of 3 or 5 in the middle or at end of thedisease series by either 10 or 15 percents in proportion. For instance, suppose we emulate a dataset that hasa disease outbreak in the middle of the series and use the following setting: the magnitude is 3, the location370 50 100 15010203040506070WeekCountFigure 5.2: The upward trend mean function is smooth and periodic.Name Levelsσ2 0.5, 0.25outlier type no outlier, mid, endoutlier prop 0.10, 0.15outlier magnitude 3, 5c 4, 5, 6span 0.60, 0.75, 0.90Table 5.1: Simulation study parameters and their levels.of outliers is middle, and the proportion of outliers is 10% of n. First, we calculate the number of outliers:the ceiling of 0.1 * n = 17. Then we choose 17 indices around the mid point of the series, which are from77 to 94. Finally, we multiply the original µt by a factor of 3 for t = 77, . . . ,94.5.1.2 MethodsWe describe five methods, including our proposed robust method and PHIDO, that will be compared usingthe simulated datasets.• PHIDO: PHIDO introduced in Chapter 2.• MGCV: the model fitted by gam function in the MGCV package in R. It uses splines for estimating thenonparametric functions of a GAPLM and generalized cross-validation to select the tuning parametersfor the splines.• AEBERHARD: the robust M-estimator for negative binomial distributions for a GLM, developedby Aeberhard et al. (2014) and described in Section 4.2.2. It does not incorporate nonparametriccomponents.• PAN: YP in Section 4.2.2 for Poisson distributions.38Algo Nonpar Robust OverdispersionPHIDO LOESS Yes YesMGCV spline No YesAEBERHARD NA Yes YesPAN LOESS Yes NoLEE LOESS Yes YesTable 5.2: The five methods that are compared in the simulation study. Their key differences in featuresare outlined.• LEE: WA in Section 4.2.2 for Negative Binomial distributions. Note that LEE is an extension ofAEBERHARD by incorporating nonparametric components.Their key differences lie in their ability to handle outliers, non-linearity, and over-dispersion in the responsevariable. Here are a-priori expectations on how well MGCV, AEBERHARD, PAN, and LEE would perform.MGCV would be able to fit the data better than other methods in the absence of outliers since it usesgeneralized cross-validation (CV) to tune hyperparameters for splines so that its splines fit the data well.On the other hand, for the three robust methods, we simply try several values for hyperparameters of thenonparametric components. While there is a way to implement a generalized CV for these robust methodsas well (see Pan (2013)), it is not implemented in this thesis. However, in the presence of outliers, MGCVwould not perform well, as it would fit the outliers as well. In contrast, AEBERHARD would be robust tooutliers, but it would not be able to fit the data well due to absence of nonparametric components. PAN wouldbe both robust to outliers and excellent at estimating the mean function. Its weakness in flagging outlierswould be exposed when data are over-dispersed, as it assumes the underlying data (response variable) comesfrom a Poisson distribution. Finally, LEE would be able to handle over-dispersed data as well. Table 5.2summarizes the key features of the methods.We describe how tuning parameters for nonparametric components are chosen for each of the method.For PHIDO, see Chapter 3. For MGCV, a tuning parameter is a smoothing parameter that controls overfittingor the trade-off between bias and variance by penalizing wiggliness of the spline. The default is set to runa generalized cross-validation to optimize the smooth parameter. See Chapter 4 of Wood (2017) for moreinformation. For LEE and PAN, they use LOESS instead of splines. The hyperparameters for LOESS aredescribed in Section 2.3.1. For both of them, the degree of polynomial for local polynomial regressions isset to 1, and a tricube kernel is used. Three values for the span are used: 0.60, 0.75, and 0.90. For LEE,PAN, and AEBERHARD, three values of a robustness tuning parameter c in φc and ρc are used: 4,5, and6 (see Section 4.1 for the role of c). While it is not performed in this study, the hyperparameters should betuned by, for example, a generalized robust CV developed by Alimadad and Salibian-Barrera (2011), forLEE, PAN, and AEBERHARD as well.We run 100 replications for each combination of the simulation study parameters; the simulation studyparameters and their levels are summarized in Table 5.1.395.1.3 Evaluation measuresBased on the following measures, we compare the performance of the five methods. For each of the repli-cation j of each combination, we compute three• Mean Squared Error (MSE):MSE j =1nn∑t=1(µt − yˆt j)2MSE is measuring how close the fitted curve is to the true mean curve, for we want to evaluate howwell a method can estimate the true mean in the presence of outliers. Note that it is not a measure ofhow well the training data are fit (if you were to evaluate how well the training data are fit, then youwould replace µt with yt).• Absolute Proportional Bias (ABP):APBγ j =(|γˆ j− γ|)γfor each parameter in the model γ = β,β1,σ2. ABP measures the bias of an estimated parametercompared to the true value of that parameter. There are three parameters in the mode: β0,β1, and σ2.• Reporting Proportion (RP):RP1−α =RO1−αTO,where TO is the total number of true outliers and RO1−α is the total number of outliers identified bythe method at some pre-specified level, α ∈ (0,1). A case or a data point yt at time t is classified asan outlier at a pre-specified level, α ∈ (0,1), if the upper-tail probability of observing a value greaterthan yt given the estimated mean function and dispersion parameter is less than the pre-specified level:P(Yt > yt |µˆt , σˆ2)< α . We use three values for α: 0.05, 0.01, and 0.001, representing low, moderate,and high alert levels, respectively. The range of RR is from 0 to n/TO.• Correct Reporting Rate (CRR):CRR1−α =CO1−αTO,where COα is simply the number of outliers that are identified correctly by the method. The range ofCRR is from 0 to 1.As explained in Chapter 3, PHIDO computes a test statistic to decide whether an outbreak has occurredin the last four weeks of disease series. However, we have not developed such a test statistic for the otherfour methods, and hence we cannot directly compare the performance of detecting outbreaks between anyof the other methods with that of PHIDO. Instead, we can compare the fitted curves, RP1−α , and CRR1−α .405.2 Simulation resultsAn ShinyApp has been developed to explore the simulation results easily. In this thesis, we present theresults based on the upward trend datasets. For the full results, please explore through the ShinyApp:https://tylee.shinyapps.io/RGAPLM/.5.2.1 Fitted valuesBefore proceeding to other quantitative measures, we qualitatively assess the fitted curves to gain insights.There are many dimensions to look at the fitted values. For the same reason as to how we examined param-eter estimations, we look at the fitted curves by the outlier type, keeping the other levels fixed.We begin with examining functional boxplots of 100 replications of fitted curves for each of the method.The red curve is the true mean, and the black solid line is the median of 100 fitted curves. The greyshaded region is called the 50 % sample central region of the functional boxplots, where the central regionis analogous to the interquartile range of a classical boxplot.It simply covers the central 50 % of the fittedcurves. It is an useful description of the spread of the central samples, as it is not affected by extremesamples. The upper and lower border lines of the functional boxplots are the maximum and minimum ofthe samples, respectively, excluding outliers. The outliers are defined as the 1.5 times the range of the 50 %central region. For more information, see Sun and Genton (2011).Figure 5.3 shows the functional boxplots of the 100 fitted curves of AEBERHARD, LEE, MGCV, PAN,and PHIDO in the absence of outliers. The median curves of LEE and MGCV almost align with the truemean curve, and their central regions closely surround the true mean curve. While the median curve of PANis slightly under the true mean curve at the humps located at around 60, 80, 120, and 140 weeks, its centralregion also closely encapsulates the true mean curve. The median curve of PHIDO is less aligned with thetrue curve than that of PAN, and the central region is not as sharply defined as that of PAN at around weeks80 and 130. The functional boxplot of AEBERHARD shows that it completely misses the key trend anddoes not fit the data well, as expected, due to its lack of nonparametric components in the model.Figure 5.4 shows the functional boxplots of the 100 fitted curves of AEBERHARD, LEE, MGCV, PAN,and PHIDO in the presence of outliers in the middle of weeks. We saw that AEBERHARD missed thetrend completely in the absence of outliers. It still does in the presence of the outliers, but the central regioncaptures several parts of the true mean curve. The median curve of LEE does not align with the true meancurve as closely as it did in the absence of outliers. The central region of LEE contains the true mean curvebut in the lower part of the region, indicating it is affected by the outliers. It is much worse for MGCV, asexpected, since MGCV is not robust to outliers. The true mean curve lies about at the lower borderline of thefunctional boxplot, yet the periodic trend is still captured well. PAN performs better than LEE, as the meancurve lies closer to the median curve of the central region, and the central region is tighter. For PHIDO, wesee that it is not robust to outliers at the mid. In fact, it is worse than MGCV. In the middle of the weekswhere the outliers are present, the mean curve lies outside the boundary of the functional boxplot, meaningthat the points of the true mean curve in that range is regarded as outliers. In addition, the periodic trend is410 50 100 150050100150WeekCounts(a) AEBERHARD0 50 100 150050100150WeekCounts(b) LEE0 50 100 150050100150WeekCounts(c) MGCV0 50 100 150050100150WeekCounts(d) PAN0 50 100 150050100150WeekCounts(e) PHIDO (f) PANFigure 5.3: The functional boxplot of the 100 fitted curves by the five methods in the absence ofoutliers: (a) AEBERHARD, (b) LEE, (c) MGCV, (d) PAN, and (e) PHIDO.42completely lost.Figure 5.4 shows the functional boxplots of the 100 fitted curves of AEBERHARD, LEE, MGCV, PAN,and PHIDO in the presence of outliers at the end of weeks. Even though AEBERHARD misses the periodictrend, the lack of nonparametric components results in a conservative fitted curves such that the functionalboxplot includes the true mean curve. For LEE, starting at Week 90, the median curve starts to inflate, andthe true mean curve lies at the bottom of the central region and later at the lower boundary of the functionalboxplot. For MGCV, inflation is more severe, and it starts to lose the periodic trend. The median curve atthe end shoots off, and the true mean curve lies outside the functional boxplot. The median curve of PANis the closest to the true mean curve out until Week 140. The periodic trend of PAN seems to be shiftedby 5 weeks or so, and the true mean curve lies outside the functional boxplot at the end of the weeks. ForPHIDO, the true mean curve is captured well by the central region before the period in which the outliersare present. In the time period of the outliers, the median curve blows up, indicating that it is not as robustas the other methods. It may be the case that PHIDO chooses a small span, which is not optimal in thisparticular dataset, explaining its robustness in the middle of the weeks.5.2.2 Assessment of parameter estimationIn the previous subsection, we qualitatively assessed the fits. In subsequent subsections, we quantitativelyassess the fits. First, we use APB to evaluate estimations of the three parameters: βˆ0, βˆ1, σˆ2 from the fourmethods, PAN, LEE, AEBERHARD, and MGCV. To show the distribution of 100 estimations of each ofthe parameters under each combination, we use boxplots. As the results are similar in different levels of theoutlier proportion, outlier magnitude, span, tuning parameter, and over-dispersion parameter, we fix thoseparameters to 0.15, 1, 0.25, 4, and 0.6, respectively. For each of the parameters, we present three box-plotsfor the three levels of the outlier type: no outliers, outliers in the mid, and outliers at the end.Figure 5.6a shows boxplots for each β parameter for each of the four methods in the absence of outliers.We do not see any notable difference in ABP of the estimated β parameters by the four methods. In thepresence of outliers in the mid, the ABPs are much twice higher than those in the absence of outliers (Fig.5.6b). In addition, MGCV does not estimate β0 well compared to the other three methods; the median ofthe ABPs of βˆ0 lies at the 75th quantiles of the other methods. For β1, the ABP given by PAN are slightlyhigher than the other three methods. In the presence of outliers at the end (Fig. 5.6c), the ABPs givenby LEE and AEBERHARD rank the lowest for both β1 and β0. LEE and AEBERHARD estimate the βparameters almost perfectly, as there are several zero ABPs.For σ2, we report the ABPs for LEE, AEBERHARD, and MGCV, as these methods account for overdis-persion (Figures 5.7a, 5.7b, and 5.7c). Across all the levels of the outlier type, the boxplots by the threemethods do overlap, and the boxplots by LEE are consistently lower than those by the other two methods.In the absence of outliers, the median of the ABPs for all the three methods lies at about 0.25. It is doubledin the presence of outliers, suggesting that all the methods do not estimate the overdispersion parameter wellin the presence of outliers.430 50 100 150050100150WeekCounts(a) AEBERHARD0 50 100 150050100150WeekCounts(b) LEE0 50 100 150050100150WeekCounts(c) MGCV0 50 100 150050100150WeekCounts(d) PAN0 50 100 150050100150WeekCounts(e) PHIDO (f) PANFigure 5.4: The functional boxplot of the 100 fitted curves by the five methods in the presence ofoutliers in the mid: (a) AEBERHARD, (b) LEE, (c) MGCV, (d) PAN, and (e) PHIDO.440 50 100 150050100150WeekCounts(a) AEBERHARD0 50 100 150050100150WeekCounts(b) LEE0 50 100 150050100150WeekCounts(c) MGCV0 50 100 150050100150WeekCounts(d) PAN0 50 100 150050100150WeekCounts(e) PHIDO (f) PANFigure 5.5: The functional boxplot of the 100 fitted curves by the five methods in the presence ofoutliers at the end: (a) AEBERHARD, (b) LEE, (c) MGCV, (d) PAN, and (e) PHIDO.45lllll0.000.050.100.150.20PAN.b0 LEE.b0 AEBERHARD.b0 MGCV.b0 PAN.b1 LEE.b1 AEBERHARD.b1 MGCV.b1ABP(a) No outlierslllllll0.00.10.20.3PAN.b0 LEE.b0 AEBERHARD.b0 MGCV.b0 PAN.b1 LEE.b1 AEBERHARD.b1 MGCV.b1ABP(b) Outliers in the midllll lll0.000.250.500.751.00PAN.b0 LEE.b0 AEBERHARD.b0 MGCV.b0 PAN.b1 LEE.b1 AEBERHARD.b1 MGCV.b1ABP(c) Outliers at the endFigure 5.6: The boxplots for each of the two β parameters for PAN, LEE, AEBERHARD, and MGCV:(a) no outliers, (b) outliers in the mid, and (c) outliers at the end.46lllllll0.150.200.250.300.350.40LEE AEBERHARD MGCVABP_sigmqsq(a) No outliers0.40.50.60.70.8LEE AEBERHARD MGCVABP_sigmqsq(b) Outliers in the midl0.40.50.60.7LEE AEBERHARD MGCVABP_sigmqsq(c) Outliers at the endFigure 5.7: The boxplots of the overdispersion parameter σ2 for LEE, AEBERHARD, and MGCV:(a) no outliers, (b) outliers in the mid, and (c) outliers at the end.475.2.3 Goodness-of-fitNext, we inspect MSE - a common measure of goodness-of-fit. As a reminder, MSE in this thesis is definedas the mean squared difference between fitted values and the true mean values, not observed values. Again,we look at them by the outlier type using boxplots.In the absence of outliers (Figure 5.8a), AEBERHARD has high values of MSE, which are explained bytheir inability to fit the trend well. All the other methods give similar MSE values, which are about half ofthose of AEBERHARD.In the presence of outliers (Figures 5.8b and 5.8c), MGCV now performs badly, as it is vulnerable tooutliers. MSEs are significantly higher for the cases with outliers at the end of the weeks than those withoutliers in the middle. It should not be surprising that PAN and LEE perform similarly. Their differencesdo not lie in the mean function but in whether overdispersion is accounted for. Thus, their difference willbecome pronounced when we examine their results on identifying outliers via CRR and RP in the nextsubsection.5.2.4 Quantification of outliers: CRR and RPBased on the fitted models, we determine whether each observed data point is an outlier. We use 1−α =0.95, a pre-specified level for RP and CRR; see Section 5.1.3 for the role α . For this subsection, we lookat the results from the simulation setting for outliers in the mid and at the end. The outlier proportion,outlier magnitude, dispersion parameter, span, and robust tuning parameter are set to 0.1, 2, 0.25, 0.9, and4, respectively. For the outliers in the mid, see Figures 5.10a, 5.10b, 5.10c, 5.10d, and 5.10e for functionalboxplots of the fitted values for each of the method. For the outliers at the end, see Figures 5.11a, 5.11b,5.11c, 5.11d, and 5.11e for functional boxplots of the fitted values for each of the method.in the presence of outliers in the mid or at the end (Figures 5.9a and 5.9b), PAN appears to be the best atdetecting outliers at a first glance based on the CRR boxplots. However, notice that PAN has about triple thenumber of the true outliers (Figures 5.9c and 5.9d). This implies almost one-third of the data is classified asoutliers by PAN. In contrast, the RP values of LEE and AEBERHARD are close to 1, and the CRR valuesare about 0.85 and 0.75 for the outliers in the mid and at the end, respectively. LEE is slightly better thanAEBERHARD, as its medians of the CRR values are higher than those of AEBERHARD. Moreover, its 25quantiles lie about at the medians of AEBERHARD. To see why AEBERHARD performs so well in spiteof the fact that it does not fit the trend well, see the functional boxplots of LEE and AEBERHARD. Theirfunctional boxplots are quite similar in the region of outliers. Lastly, MGCV has the lowest RP and CRRvalues, which are about half of those of LEE.48lllllllllllll0204060PAN LEE AEBERHARD MGCVmse(a) No outliersllllllllll0200400PAN LEE AEBERHARD MGCVmse(b) Outliers in the midllllllll02505007501000PAN LEE AEBERHARD MGCVmse(c) Outliers at the endFigure 5.8: The boxplots of the MSEs for PAN, LEE, AEBERHARD, and MGCV: (a) no outliers, (b)outliers in the mid, and (c) outliers at the end.49lll0.40.60.81.0PAN,0.95 LEE,0.95 AEBERHARD,0.95 MGCV,0.95CRR(a) CRR with outliers in the midlllllllll0.250.500.751.00PAN,0.95 LEE,0.95 AEBERHARD,0.95 MGCV,0.95CRR(b) CRR with outliers at the endllll123PAN,0.95 LEE,0.95 AEBERHARD,0.95 MGCV,0.95RP(c) RP with outliers in the midlll1234PAN,0.95 LEE,0.95 AEBERHARD,0.95 MGCV,0.95RP(d) RP with outliers at the endFigure 5.9: (a) and (b): the boxplots of the 100 CRRs at 1−α = 0.95 for PAN, LEE, AEBERHARD,and MGCV in the presence of outliers in the mid and at the end, respectively. (c) and (d): theboxplots of the 100 RPs at 1− α = 0.95 for PAN, LEE, AEBERHARD, and MGCV in thepresence of outliers in the mid and at the end, respectively.500 50 100 150050100150WeekCounts(a) AEBERHARD0 50 100 150050100150WeekCounts(b) LEE0 50 100 150050100150WeekCounts(c) MGCV0 50 100 150050100150WeekCounts(d) PAN0 50 100 150050100150WeekCounts(e) PHIDO (f) PANFigure 5.10: The functional boxplot of the 100 fitted curves by the five methods in the presence ofoutliers in the mid: (a) AEBERHARD, (b) LEE, (c) MGCV, (d) PAN, and (e) PHIDO.510 50 100 150050100150WeekCounts(a) AEBERHARD0 50 100 150050100150WeekCounts(b) LEE0 50 100 150050100150WeekCounts(c) MGCV0 50 100 150050100150WeekCounts(d) PAN0 50 100 150050100150WeekCounts(e) PHIDO (f) PANFigure 5.11: The functional boxplot of the 100 fitted curves by the five methods in the presence ofoutliers at the end: (a) AEBERHARD, (b) LEE, (c) MGCV, (d) PAN, and (e) PHIDO.52Chapter 6Application: PHIDO datasetsIn this chapter, we compare the performance of our proposed method, WA or YP depending on the typeof distribution of the response variable (Section 4.2.2), and PHIDO (Chapter 3) on eight disease datasetsprovided by the British Columbia Centre for Disease Control (BC-CDC). Some of the datasets are simulated,but we do not know which one is real or simulated.6.1 Exploratory data analysisWe begin with exploratory data analysis to choose an appropriate distribution, either Poisson or NegativeBinomial. If it is Poisson, then we use YP. If it is Negative Binomial, then we use WA. Next we usethe chosen method and PHIDO (recall that PHIDO does not require any distribution assumption) to fiteach series. Since we do not have a method to select the optimal values of the tuning or hyperparametersparameters, we try different values of them. We try 4, 5, and 6 for the robustness tuning parameter, c. ForLOESS used in estimation of nonparametric components, we try 0.5, 0.75 and 0.90 for span, with the degreeset to 1 (to review the role of these parameters, see Section 2.3.1). For covariates, we usex1(t) = cos(2pit7365.2425)andx2(t) = sin(2pit7365.2425),as the response variables of all the eight datasets are weekly disease counts.We plot the raw disease count to make any observation that might be helpful for modeling (Figures 6.1a,6.1b), and 6.1c-6.2d). Based on the scatter plots of the data, we see that periodicity exists in every dataset.There seems to be an upward trend for Series 2, D, and F; it is hard to determine whether there exists a trendfor the other datasets based on the plots. Next, we plot a rough estimate of dispersion, which is the ratio ofthe sample variance to the sample mean, for each month to choose an appropriate distribution (Figures 6.3and 6.4). If most of the dispersion estimates that are greater than 1, then we choose a Negative Binomial53model to account for overdispersion. If they are in (0.9,1.1), then we use a Poisson model. By examining thedispersion plots, we see that all the datasets, except for the last one, which is Series 8, have overdispersedcounts. Hence, we use WA for Series 1, 2, and A-E and YP for Series F.2000 2005 2010 2015406080100120140160YearCounts(a) Series 1: it exhibits periodicity but no clear trend.2000 2005 2010 2015500600700800900YearCounts(b) Series 2: it exhibits periodicity and an upward trend2013 2014 2015 2016 2017102030405060YearCounts(c) Series A: it exhibits periodicity but no clear trend.2013 2014 2015 2016 2017024681012YearCounts(d) Series B: it exhibits periodicity but no clear trend.Figure 6.1: The weekly counts of Series 1, 2, A, and B.542013 2014 2015 2016 201702468YearCounts(a) Series C: it exhibits periodicity but no clear trend.2013 2014 2015 2016 201750100150YearCounts(b) Series D: it exhibits periodicity and an upward trend.2013 2014 2015 2016 2017100200300400YearCounts(c) Series E: it exhibits periodicity and an upward trend.2013 2014 2015 2016 2017024681012YearCounts(d) Series F: it exhibits periodicity but no clear trend.Figure 6.2: The weekly counts of Series C, D, E, and F.55llll lllll ll llllllll2000 2005 2010 20150246810YearDispersion Estimate(a) Series 1: a NB model would be appropriate.lllllllll llllllll lll2000 2005 2010 20150246810YearDispersion Estimate(b) Series 2: a NB model would be appropriate.llllll2012 2013 2014 2015 2016 20170246810YearDispersion Estimate(c) Series A: a NB model would be appropriate.llll l l2012 2013 2014 2015 2016 20170246810YearDispersion Estimate(d) Series B: a NB model would be appropriate.Figure 6.3: Rough estimates estimate of overdispersion for each month for Series 1, 2, A, and B. Thered line indicates the intercept of 1.56l lllll2012 2013 2014 2015 2016 20170246810YearDispersion Estimate(a) Series C: a NB model would be appropriate.ll l ll2012 2013 2014 2015 2016 20170246810YearDispersion Estimate(b) Series D: a NB model would be appropriate.lll ll2012 2013 2014 2015 2016 20170246810YearDispersion Estimate(c) Series E: a NB model would be appropriate.ll l lll2012 2013 2014 2015 2016 20170246810YearDispersion Estimate(d) Series F: a Poisson model would be appropriate.Figure 6.4: Rough estimates of overdispersion for each month for Series C, D, E, and F. The red lineindicates the intercept of 1.576.2 Alert levelsWe look at the fitted curves and alert levels to make an educated guess on whether an outbreak has occurredat the end of each disease series. An alert level is given to each observation, and it is based on the probabilityof observing that observation or greater using fitted models. We give a high, medium, or low level of alert ifthe probability is less than 0.001, 0.01, and 0.05, respectively. Otherwise, an alert is not flagged. Moreover,we color code each of the levels: red for high, gold for medium, green for low, and black for none. As adisease outbreak is determined based on the last four weekly counts, we look at the last four alert levels tomake a decision.Recall that each of the data series was fitted using the different levels of span and c for WA or YP. Weprovide the plot for one set of the fitted values, where the span is 0.75 and c is 4. The reason is that the fitson the last four weekly counts do not vary much that, in fact, the alert levels are identical across the differentlevels. Furthermore, the fits behave as expected; the fits are more smooth as the span and c increase.By looking at the full fitted curves (Figures 6.5a, 6.5b, and 6.5c-6.6d), the fits are similar for Series 1,A, B, C, and F. For Series 2, we zoom in at the last year to closely examine the difference in the fitted curves(Figure 6.7). We see that the fit by LEE is much lower, especially at that peak in February and March. ForSeries D and E, the fits by PHIDO seem to follow the large counts at the end closely, whereas the fits byLEE are clearly lower than those of PHIDO, perceiving those points that PHIDO seems to follow as outliers,and appear to follow the trend back at the end of the series.Tables 6.1-6.8 show the alert levels of the last four weekly counts along with the overall alert score thatPHIDO calculates (see Chapter 3 for details). Except for Series B and C, the alert level colors are all black,indicating there is no outbreak based on the method. For Series B, WA’s alert levels are higher than those ofPHIDO. According to PHIDO, there is only one medium alert, whereas there are one high alert and one lowalert assigned by WA. As the BC-CDC prefers to be more conservative, i.e. they prefer higher false alertlevels than lower false alert levels, WA avails to be a better method for this particular dataset. For Series C,which is mainly composed of zero values, WA assigns one low alert.In short, we conclude that there is no disease outbreak, except for Series B. This is indeed the correctanswer, confirmed by the BC-CDC. This is also the answer given by PHIDO using the alert scores. Our pro-posed robust method gives the same results as PHIDO for these datasets. For these datasets, our proposedrobust method gave alert levels at least as high as PHIDO, and thereby appeared to be a more conservativemethod that the BC-CDC would prefer. Moreover, the level of robustness can be adjusted in our proposedmodel through the robustness tuning parameter and span to manage different degrees of outbreaks. Conse-quently, more accurate (in terms of the measures defined in Section 5.3.1) and hence more reliable resultscan be obtained.58lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll40801201602000 2005 2010 2015YearCounts MethodLEEPHIDO(a) Series 1llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll5006007008009002000 2005 2010 2015YearCounts MethodLEEPHIDO(b) Series 2lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll1020304050602013 2014 2015 2016 2017YearCounts MethodLEEPHIDO(c) Series Alllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll05102013 2014 2015 2016 2017YearCounts MethodLEEPHIDO(d) Series BFigure 6.5: The fitted values of PHIDO and LEE for Series 1, 2, A, and B.Count PHIDO WA34 black black42 black black45 black black42 black blackTable 6.1: The alert levels for the last four weekly counts of Series1 computed by PHIDO and WA.The overall alert score by PHIDO is -3.68, which indicates no alert.Count PHIDO WA671 black black651 black black680 black black674 black blackTable 6.2: The alert levels for the last four weekly counts of Series2 computed by PHIDO and WA.The overall alert score by PHIDO is -0.67, which indicates no alert.59lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll024682013 2014 2015 2016 2017YearCounts MethodLEEPHIDO(a) Series Clllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0501001502013 2014 2015 2016 2017YearCounts MethodLEEPHIDO(b) Series Dlllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll1002003004002013 2014 2015 2016 2017YearCounts MethodLEEPHIDO(c) Series Elllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll05102013 2014 2015 2016 2017YearCounts MethodLEEPHIDO(d) Series FFigure 6.6: The fitted values of PHIDO and LEE for Series C, D, E, and F.lllllllllllllllllllllllllllllllllllllllllllllllllllll700800900Oct 2016 Jan 2017 Apr 2017 Jul 2017YearCounts MethodLEEPHIDOFigure 6.7: The last one year of the fitted values of PHIDO and LEE for Series 2.60Count PHIDO WA20 black black29 black black22 black black27 black blackTable 6.3: The alert levels for the last four weekly counts of Series A computed by PHIDO and WA.The overall alert score by PHIDO is 0.29, which indicates no alert.Count PHIDO WA0 black black4 black green37 gold red32 black blackTable 6.4: The alert levels for the last four weekly counts of Series B computed by PHIDO and WA.The overall alert score by PHIDO is 2.3, which indicates the medium alert.Count PHIDO WA0 black black1 black green30 black black0 black blackTable 6.5: The alert levels for the last four weekly counts of Series C computed by PHIDO and WA.The overall alert score by PHIDO is -0.24, which indicates no alert.Count PHIDO WA73 black black69 black black58 black black37 black blackTable 6.6: The alert levels for the last four weekly counts of Series D computed by PHIDO and WA.The overall alert score by PHIDO is -4.27, which indicates no alert.Count PHIDO WA238 black black218 black black200 black black224 black blackTable 6.7: The alert levels for the last four weekly counts of Series E computed by PHIDO and WA.The overall alert score by PHIDO is -2.29, which indicates no alert.61Count PHIDO YP5 black black7 black black7 black black6 black blackTable 6.8: The alert levels for the last four weekly counts of Series F computed by PHIDO and YP.The overall alert score by PHIDO is 0.99, which indicates no alert.62Chapter 7Conclusion and future workThis dissertation was motivated by the demand of a theoretically sound statistical method for detecting dis-ease outbreaks. We analyzed PHIDO, the statistical method currently used by the BC CDC, and identifiedseveral potential problems. In particular, it lacked theoretical justification on how it performed the robusti-fying step on down-weighing data points that are potentially outliers. To mitigate the problem, we presentedtwo alternatives based on the recent development of robust estimators for additive, generalized additive, andgeneralized linear models. We showed the derivation of each robust method and provided the details on im-plementing them for Poisson and Negative Binomial distributions. We demonstrated that their performancewas more superior to other methods, including PHIDO, on the synthetic datasets and PHIDO datasets. Whilewe restricted ourselves to the application in disease outbreaks, we envision these methods are an essentialaddition to robust statistics and hence an useful tool for applications, such as disease outbreak detection.7.1 Future workHere are several research topics in the future:• We focused only on developing robust methods for GAPLMs, and much of work in inference orasymptotic analysis still remains to be solved. As mentioned in Chapter 4, see the list of references inAlimadad and Salibian-Barrera (2011) for difficulties.• For practical purposes, a couple of things need to be implemented for the proposed methods: startingpoints and tuning hyper-parameters. Starting points or estimates of the parameters for these methodsare crucial for ensuring convergence of the objective function to a true minimum. The implementedalgorithm uses classical MLE estimates as the starting points, and it may cause issues with conver-gence. One solution is to use subsampling (see Section 5.7.2 in Maronna et al. (2018)); instead ofusing the whole data points, we take subsamples to compute candidate starting estimates. Next, aset of levels were pre-assigned for tuning hyper-parameters, such as bandwidth/span, in this thesis.In practice, the tuning parameters are chosen by either the rule-of-thumb values (choosing values by63making strong assumptions) or cross-validation. Since outliers may still affect a goodness-of-fit cri-terion, such as deviance, for cross validation, a robust version of the criterion must be used in orderto handle the effect of outliers accordingly. See Section 3.3 in Alimadad and Salibian-Barrera (2011)for one such robust criterion.• A disease outbreak is often a series of abnormally high reported cases; these points are clusteredtogether. This information was used in PHIDO but not in the proposed methods. Moreover, wecompletely ignored spatial correlation and correlation between different diseases.64BibliographyAeberhard, W. H., E. Cantoni, and S. Heritier (2014). Robust inference in the negative binomial regressionmodel with an application to falls data. Biometrics 70(4), 920–931. → pages v, 24, 25, 27, 28, 29, 31,37, 38Alimadad, A. and M. Salibian-Barrera (2011). An outlier-robust fit for generalized additive models withapplications to disease outbreak detection. Journal of the American Statistical Association 106(494),719–731. → pages v, ix, 21, 25, 29, 32, 33, 39, 63, 64Bartlett, M. S. (1953). Approximate confidence intervals. ii. more than one unknown parameter.Biometrika 40(3/4), 306–317. → pages 8Bianco, A. M., M. G. Ben, and V. J. Yohai (2005). Robust estimation for linear regression with asymmetricerrors. Canadian Journal of Statistics 33(4), 511–528. → pages 21Bianco, A. M. and E. Martı´nez (2009). Robust testing in the logistic regression model. ComputationalStatistics & Data Analysis 53(12), 4095–4105. → pages 21Bianco, A. M. and V. J. Yohai (1996). Robust estimation in the logistic regression model. In Robuststatistics, data analysis, and computer intensive methods, pp. 17–34. Springer. → pages 21Boente, G., X. He, J. Zhou, et al. (2006). Robust estimates in generalized partially linear models. TheAnnals of Statistics 34(6), 2856–2878. → pages 21Boente, G., A. Martı´nez, and M. Salibia´n-Barrera (2017). Robust estimators for additive models usingbackfitting. Journal of Nonparametric Statistics 29(4), 744–767. → pages v, 21, 24, 31Boente, G. and J. C. Pardo-Ferna´ndez (2016). Robust testing for superiority between two regressioncurves. Computational Statistics & Data Analysis 97, 151–168. → pages 21Boente, G. and D. Rodriguez (2010). Robust inference in generalized partially linear models.Computational Statistics & Data Analysis 54(12), 2942–2966. → pages 21Buja, A., T. Hastie, and R. Tibshirani (1989). Linear smoothers and additive models. The Annals ofStatistics, 453–510. → pages 1265Cameron, C. and P. Trivedi (1998). Models for count data. → pages 16Cantoni, E. and E. Ronchetti (2001). Robust inference for generalized linear models. Journal of theAmerican Statistical Association 96(455), 1022–1030. → pages 21, 24Cleveland, W. S. and S. J. Devlin (1988). Locally weighted regression: an approach to regression analysisby local fitting. Journal of the American statistical association 83(403), 596–610. → pages 6, 7Croux, C. and G. Haesbroeck (2003). Implementing the bianco and yohai estimator for logistic regression.Computational statistics & data analysis 44(1-2), 273–295. → pages 21Dominici, F., A. McDermott, S. L. Zeger, and J. M. Samet (2002). On the use of generalized additivemodels in time-series studies of air pollution and health. American journal of epidemiology 156(3),193–203. → pages 17Hampel, F. R., E. M. Ronchetti, P. J. Rousseeuw, and W. A. Stahel (2011). Robust statistics: the approachbased on influence functions, Volume 196. John Wiley & Sons. → pages 27Ha¨rdle, W. K., M. Mu¨ller, S. Sperlich, and A. Werwatz (2012). Nonparametric and semiparametricmodels. Springer Science & Business Media. → pages 3, 12, 13Hastie, T. and R. Tibshirani (1986, Aug). Generalized additive models. Statistical Science 1(3), 297–310.→ pages 1, 5, 6, 7Lawless, J. F. (1987). Negative binomial and mixed poisson regression. Canadian Journal ofStatistics 15(3), 209–225. → pages 16, 18, 37Loader, C. (2006). Local regression and likelihood. Springer Science & Business Media. → pages 7Luenberger, D. G. (1997). Optimization by vector space methods. John Wiley & Sons. → pages 8Maronna, R., R. D. Martin, V. Yohai, and M. Salibia´n-Barrera (2018). Robust statistics, Volume 1. JohnWiley & Sons, Chichester. ISBN. → pages 22, 63Maronna, R. A. and R. H. Zamar (2002). Robust estimates of location and dispersion for high-dimensionaldatasets. Technometrics 44(4), 307–317. → pages 18McCullagh, P. (1983). Quasi-likelihood functions. The Annals of Statistics, 59–67. → pages 5McCullagh, P. and J. Nelder (1989). Binary data. In Generalized linear models, pp. 98–148. Springer. →pages 1Mu¨ller, M. (2001). Estimation and testing in generalized partial linear models—a comparative study.Statistics and Computing 11(4), 299–309. → pages 1366Nelder, J. and R. Wedderburn (1972). Generalizedlinear models-jr statist. soc. a, 135: 370-384.nelder370135j. r. Statist. Soc A 1972. → pages 3Pan, Y. (2013). A robust fit for generalized partial linear partial additive models. Ph. D. thesis, Universityof British Columbia. → pages v, 12, 13, 21, 24, 25, 26, 29, 30, 39Ramsay, T. O., R. T. Burnett, and D. Krewski (2003). The effect of concurvity in generalized additivemodels linking mortality to ambient particulate matter. Epidemiology 14(1), 18–23. → pages 17Speckman, P. (1988). Kernel smoothing in partial linear models. Journal of the Royal Statistical Society.Series B (Methodological), 413–436. → pages 11Stone, C. J. (1986). The dimensionality reduction principle for generalized additive models. The Annals ofStatistics, 590–606. → pages 8Sun, Y. and M. G. Genton (2011). Functional boxplots. Journal of Computational and GraphicalStatistics 20(2), 316–334. → pages 41Team, R. C. (2018). R-project. vienna, austria: R foundation for statistical computing; 2016. → pages 17White, R. (2000a). Presentation: A statistical method for dection of outbreaks in notifiable disease data.Unpublished. → pages vWhite, R. (2000b). A statistical method for dection of outbreaks in notifiable disease data. Unpublished.→ pages v, 1, 14, 19Wood, S. N. (2006). Generalized additive models: an introduction with R. Chapman and Hall/CRC. →pages 3, 4Wood, S. N. (2017). Generalized Additive Models: An Introduction with R (2 ed.). Chapman Hall/CRCTexts in Statistical Science. Chapman and Hall/CRC. → pages 39World Health Organization (2018, April). Disease outbreaks.http://www.searo.who.int/topics/disease outbreaks/en/. → pages 167