A Robust Fit for Generalized Partial Linear Partial Additive Models by Yiyang Pan B.Sc., Zhejiang University, 2010 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in The Faculty of Graduate Studies (Statistics) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) July 2013 c Yiyang Pan 2013 Abstract In regression studies, semi-parametric models provide both flexibility and interpretability. In this thesis, we focus on a robust model fitting algorithm for a family of semi-parametric models – the Generalized Partial Linear Partial Additive Models (GAPLMs), which is a hybrid of the widely-used Generalized Linear Models (GLMs) and Generalized Additive Models (GAMs). The traditional model fitting algorithms are mainly based on likelihood procedures. However, the resulting fits can be severely distorted by the presence of a small portion of atypical observations (also known as “outliers”), which deviate from the assumed model. Furthermore, the traditional model diagnostic methods might also fail to detect outliers. In order to systematically solve these problems, we develop a robust model fitting algorithm which is resistant to the effect of outliers. Our method combines the backfitting algorithm and the generalized Speckman estimator to fit the “partial linear partial additive” styled models. Instead of using the likelihood-based weights and adjusted response from the generalized local scoring algorithm (GLSA), we apply the robust weights and adjusted response derived form the robust quasi-likelihood proposed by Cantoni and Ronchetti (2001). We also extend previous methods by proposing a model prediction algorithm for GAPLMs. To compare our robust method with the non-robust one given by the R function gam::gam(), which uses the backfitting algorithm and the GLSA, we report the results of a simulation study. The simulation results show that our robust fit can effectively resist the damage of outliers and it performs similarly to non-robust fit in clean datasets. Moreover, our robust algorithm is observed to be helpful in identifying outliers, by comparing the fitted values with the observed response variable. In the end, we apply our method to analyze the global phytoplankton data. We interpret the outliers reported by our robust fit with an exploratory analysis and we see some interesting patterns of those outliers in the dataset. We believe our result can provide more information for the relative research. ii Preface The dataset in Chapter 5 is collected and cleaned by Daniel G. Boyce, Marlon R. Lewis and Boris Worm. I use their dataset in my thesis with their permit. iii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . x 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 6 2 GLMs, GAMs and GAPLMs . . . . . . . . . . . . . . 2.1 Generalized Linear Models . . . . . . . . . . . . . . . 2.1.1 Extend the Linear Models . . . . . . . . . . . 2.1.2 Model Definitions . . . . . . . . . . . . . . . . 2.1.3 Iterative Weighted Least Square . . . . . . . 2.1.4 Justification of the Iterative Weighted Least Algorithm . . . . . . . . . . . . . . . . . . . . 2.1.5 Quasi-Likelihood Functions . . . . . . . . . . 2.2 Generalized Additive Models . . . . . . . . . . . . . 2.2.1 Additive Models . . . . . . . . . . . . . . . . 2.2.2 Generalized Additive Models . . . . . . . . . 2.3 Generalized Partial Linear Partial Additive Models . 2.3.1 Fitting APLMs . . . . . . . . . . . . . . . . . 2.3.2 Fitting GAPLMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Square . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 8 8 10 13 14 16 19 19 25 28 28 31 3 Robust Estimates and Robust GAPLMs . . . . . . . . . . . . 35 3.1 Robust Estimates . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.1.1 The Influence of Outliers on LS . . . . . . . . . . . . . 37 iv Table of Contents 3.2 3.3 3.1.2 Robust Regression M-estimate . . Robust GLM Regression M-estimates . . . 3.2.1 Fitting the Robust GLMs . . . . . 3.2.2 Computation of the Robust GLMs Robust GAPLMs M-estimates . . . . . . . 3.3.1 Model Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 45 47 50 54 54 4 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.1 Poisson Case . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.2 Binomial Case . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.1 Study Background . . . . . . . . . . . . . . . . . . . . . . . . 85 5.2 Scientific Problem and Results . . . . . . . . . . . . . . . . . 88 6 Conclusions and Future Work . . . . . . . . . . . . . . . . . . 102 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 v List of Tables 4.1 The table of means and standard deviations of MSEs for Poisson case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 The table of the mean and standard deviations of APBs for Poisson case. The fitted intercept β̂0 of GAM seems biased while its fitted slope seems accurate. This is not surprising since GAM estimates the coefficients in a different way. . . . . 4.3 The table of the mean and standard deviations of CRRs for Poisson case, MAG=moderate. . . . . . . . . . . . . . . . . . 4.4 The table of the mean and standard deviations of CRRs for Poisson case, MAG=extreme. . . . . . . . . . . . . . . . . . . 4.5 The table of the mean and standard deviations of MRRs for Poisson case, MAG=moderate. . . . . . . . . . . . . . . . . . 4.6 The table of the mean and standard deviations of MRRs for Poisson case, MAG=extreme. . . . . . . . . . . . . . . . . . . 4.7 The table of the mean and standard deviations of RPs for Poisson case, MAG=moderate. . . . . . . . . . . . . . . . . . 4.8 The table of the mean and standard deviations of RPs for Poisson case, MAG=extreme. . . . . . . . . . . . . . . . . . . 4.9 The table of mean and standard deviations of MSEs for Binomial case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.10 The table of the mean and standard deviations of APBs for Binomial case. The fitted intercept β̂0 of GAM seems biased while its fitted slope seems accurate. This is not surprising since GAM estimates the coefficients in a different way. . . . . 4.11 The table of the mean and standard deviations of MRRs for Binomial case. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.12 The table of the mean and standard deviations of RPs for Binomial case . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 The variables in the dataset. . . . . . . . . . . . . . . . . . . . 66 67 68 69 70 71 72 73 80 81 82 83 87 vi List of Figures 1.1 3.1 3.2 4.1 4.2 An example of the effect of outliers. The true surface is in the upper panel. There are two cases of outliers shown in the lower left and lower right panel. The gray points are below the surfaces while the black points are above the surfaces. ‘+’ points are outliers. The fits were calculated by the R function gam() in the package gam. . . . . . . . . . . . . . . . . . . . . A simulated examples of the effect of outliers. The solid line is the LS fit with outliers, the dotted line is the robust fit with Huber function c = 1.2 and the dashed line is the LS fit without the 6 outliers. The details of robust fit is introduced in the following sections. . . . . . . . . . . . . . . . . . . . . . The residual plots of the standardized residuals vs. fitted values, Q-Q plots of the standardized residual plots and the Cook distance plots of the two cases in Figure 3.1. The upper 3 plots are based on the left panel of Figure 3.1 while the lower 3 graphs are based on the right panel. In the residual/Q-Q plots, the solid points are the outliers while the hollow ones are “good” data. The indices of outliers are 16, 39, 36, 26, 28, 40. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 39 41 The plane of covariates: the solid points are central points while the hollow points are marginal points. . . . . . . . . . . 60 Examples of simulation datasets with Poisson response: MAG=moderate, LOC=center and ν = 0.2. The solid points are “good” responses while ‘+’ points are outliers. The gray points are below the surface while black points are above the surfaces. The bandwidth parameter is decided by 5-fold CV based on uncontaminated datasets. . . . . . . . . . . . . . . . . . . . . 64 vii List of Figures 4.3 4.4 4.5 4.6 4.7 5.1 5.2 Examples of simulation datasets with Poisson response: MAG=moderate, LOC=margin and ν = 0.2. The solid points are “good” responses while ‘+’ points are outliers. The gray points are below the surface while black points are above the surfaces. The bandwidth parameter is decided by 5-fold CV based on uncontaminated datasets. . . . . . . . . . . . . . . . . . . . . 65 The plot of MSEs, APBs of β̂1 s with M AG = M oderate of the three fits. Solid line: RGAPLM fit; Dashed line: GAM fit; Dotted line: GAPLM fit. . . . . . . . . . . . . . . . . . . . 75 The plot of CRRs and MRRs with δ = 5 × 10−4 and M AG = M oderate of the three fits. Solid line: RGAPLM fit; Dashed line: GAM fit; Dotted line: GAPLM fit. . . . . . . . . . . . . 76 Examples of simulation datasets with binomial response: LOC=center and ν = 0.2. The solid points are “good” responses while ‘+’ points are outliers. The gray points are below the surface while black points are above the surfaces. The bandwidth parameter is decided by 5-fold CV based on uncontaminated datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Examples of simulation datasets with binomial response: LOC=margin and ν = 0.2. The solid points are “good” responses while ‘+’ points are outliers. The gray points are below the surface while black points are above the surfaces. The bandwidth parameter is decided by 5-fold CV based on uncontaminated datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 The Figure 3a in Boyce, Lewis and Worm (2010). Ocean regions to estimate the regional trends of Chl. . . . . . . . . . The fitted marginal additive effects of Year as both a categorical variable and a continuous variable for the region of Southern Pacific. In the discrete case, the baseline year’s effect were estimated by the transformed mean of that year β̂11 = (ChlbaseY ear ). The partial residuals are calculated with d √i −Chli . The bold points are the discrete Year f (Yd eari ) + Chl 88 di Chl 5.3 effects, the small solid points are partial residuals and the hollow round points are partial residuals of reported outliers. The map of observations in Southern Pacific in each level of Bath. Black points are the reported outliers of the robust fit in continuous Year case. . . . . . . . . . . . . . . . . . . . . . 94 95 viii List of Figures 5.4 5.5 The map of observations in Southern Pacific of each season. Black points are the reported outliers of the robust fit in continuous Year case. . . . . . . . . . . . . . . . . . . . . . . . . The fitted marginal additive effects of Year as both a categorical variable and a continuous variable for the region of Southern Indian. In the discrete case, the baseline year’s effect were estimated by the transformed mean of that year β̂11 = (ChlbaseY ear ). The partial residuals are calculated with d √i −Chli . The bold points are the discrete Year f (Yd eari ) + Chl 96 di Chl 5.6 5.7 effects, the small solid points are partial residuals and the hollow round points are partial residuals of reported outliers. 98 The map of observations in Southern Indian in each level of Bath. Black points are the reported outliers of the robust fit in continuous Year case. . . . . . . . . . . . . . . . . . . . . . 99 The map of observations in Southern Indian of each season. Black points are the reported outliers of the robust fit in continuous Year case. . . . . . . . . . . . . . . . . . . . . . . . . 100 ix Acknowledgements Foremost I would like to express my gratitude to my supervisor, Professor Matías Salibián-Barrera, for his insightful guidance in the last two years. His valuable comments and inspiration greatly helped me to improve the work in this thesis. Also, I am grateful to Professor Gabriela V. Cohen Freue for serving on my examining committee and for her suggestions. Moreover, I owe my thanks to all the professors and staff members of the Department of Statistics. Their efforts provided me with an enjoyable, precious working and studying experience in this big family. Finally, I would like to give my deep appreciation to my family for their support and encouragement on my study. This thesis is dedicated to them. x Chapter 1 Introduction In statistical studies, regression models play a critical role in exploring the variation in the response (or dependent variable) Y with some covariates (or independent variables, predictors) X1 , X2 , . . . , Xp . The essential approach of regression models is to assume that a function g : R → R of the conditional mean of Y given the predictors is a multivariate function G : Rp → R of X1 , X2 , . . . , Xp , µ = E(Y |X1 , X2 , . . . , Xp ) g(µ) = G(X1 , X2 , . . . , Xp ). (1.1) Depending on the type of function G, we classify the regression models with the form (1.1) into three categories: parametric, nonparametric and semiparametric. In a parametric regression model, G(X1 , X2 , . . . , Xp ; θ) is a parametric multivariate function which is defined by arguments X1 , X2 , . . . , Xp and several parameters θ ∈ Rq . The unknown parameters θ determines the shape of the regression function G(X1 , X2 , . . . , Xp ). In a nonparametric regression model, G(X1 , X2 , . . . , Xp ) does not have any exact functional form. Instead, G(X1 , X2 , . . . , Xp ) is sometimes merely assumed to be a smooth function. A semi-parametric regression model is a hybrid of parametric and nonparametric models so that the function G has both parametric and nonparametric components. Being theoretically elegant and intuitively interpretable, the linear regression model (LM) might be the most commonly used parametric regression model. In a LM, the function G is a linear combination of the covariates and g is the identity function, µ = E(Y |X1 , X2 , . . . , Xp ) = β0 + β1 X1 + β2 X2 + . . . βp Xp = η. (1.2) In (1.2), β = θ = (β0 , β1 , β2 , . . . , βp ) are the unknown linear coefficients which is estimated in the regression model. Actually, the G function of a LM is only a linear function of the coefficients β and without loss of generality, we assume G as a linear combination of covariates. Even though parametric models such as LM enjoy appealing properties such as easy interpretation and estimation accuracy when the assumption of 1 Chapter 1. Introduction G and g is correct, it is sometimes impossible to choose the most appropriate form of the function G just by looking at the data. In the cases where the assumed parametric function G(X1 , X2 , . . . , Xp ; θ) fails to capture the effect of the covariates on the response, nonparametric models can be applied. Pp Additive Models (AMs) generalize LMs by replacing the linear form j=1 βj Xj with a sum of unspecified univariate smooth functions fi : R → R of the covariates, E(Y |X1 , X2 , . . . , Xp ) = f0 + f1 (X1 ) + f2 (X2 ) + . . . fp (Xp ), (1.3) where f0 ∈ R is the intercept term. Since no parameters of the function G(X1 , . . . , Xp ) are assumed, a model specified by (1.3) is more flexible than LMs and it avoids the curse of dimensionality by assuming an additive style of G(X1 , . . . , Xp ) rather than other multivariate formats. In some situations where we need both easy interpretability and flexibility, one option is to use semi-parametric models. For example, there are two groups of covariates X1 , X2 , . . . , Xh and T1 , T2 , . . . , Tk . In our regression model, Xj , j = 1, 2, . . . , h are the predictors of interest while Tl , l = 1, 2, . . . , k are just nuisance variables. In this context, we only hope to model Xj linearly and to assume Tl have non-parametric additive effect on Y ’s conditional expectation. Partial Linear Partial Additive Models (APLMs) combine the features of AMs and LMs by adding nonparametric terms to the linear components in (1.2). The name “(Generalized) Partial Linear Partial Additive Model” was defined in Härdle et al. (2004) and we adopt this name in this thesis. There are two groups of covariates X1 , X2 , . . . , Xh and T1 , T2 , . . . , Tk and the definition of APLMs is, µ = E(Y |X1 , X2 , . . . , Xh , T1 , T2 , . . . , Tk ) = β0 + h X j=1 βj Xj + k X fl (Tl ) = η. l=1 (1.4) By definition (1.4), APLMs are semi-parametric models and it is easy to see that (1.2)-(1.3) are special cases of APLMs. The formulae (1.2)-(1.4) show that the expectation of the response Y given the covariates can take any real value. However, sometimes we need to restrict the range of this conditional expectation µ or we have to assume certain distributions of Y with support other than (−∞, +∞). Generalized Linear Models (GLMs) extend the concept of LMs enabling us to deal with a large distribution family of responses Y . In GLMs, Y is assumed to have a distribution in the exponential family and this family covers many distributions such as Normal, Poisson, Binomial and Gamma. The fundamental 2 Chapter 1. Introduction feature of GLM is that the conditional expectation µ = E(Y |X1 , X2 , . . . , Xp ) is transformed by a monotone function g to the linear predictor η defined in (1.2), g (Y |X1 , X2 , . . . , Xp ) = β0 + β1 X1 + β2 X2 + . . . βp Xp = η. (1.5) Generalized Additive Models (GAMs) extend AMs in the same way that GLMs extend LMs. By this extension, GAMs are able to model the response Y in the exponential family without parametric assumptions, g E(Y |X1 , X2 , . . . , Xp ) = f0 + f1 (X1 ) + f2 (X2 ) + . . . fp (Xp ). (1.6) Following the same way of extension, Generalized Partial Linear Partial Additive Models (GAPLMs) have a more general form than APLMs, h k X X g (Y |X1 , X2 , . . . , Xh , T1 , T2 , . . . , Tk ) = β0 + βj Xj + fl (Tl ). (1.7) j=1 l=1 So GAPLMs can model semi-parametric relations between the response and covariates within the exponential family. Although the above regression models usually work well in many situations, it has been observed that a small portion of “atypical” observations which deviate from the assumed model might give rise to severe affects on model fitting. Those observations are called outliers and they follow a different model/process. For example, outliers usually come from bad measuring methods, typing mistakes or misplaced decimal points. Sometimes, those outliers can be quite pervasive in quantitative researches causing adverse influence on model fitting. The outlying responses are often much larger or smaller than those responses with similar covariates. In a regression, the outlying covariates, which are far away from the majority of observed covariates, are called high leverage points. Those high leverage points are known to be very influential to the regression model fitting, see Maronna, Martin and Yohai (2006). Since classical regression methods are usually based on optimizing certain loss functions which are sensitive to outliers, those outliers distort the fit by forcing the optimization to sacrifice the fitting accuracy of “good” observations in order to compensate the damage of outliers. As a result, the estimated parameters or the smooth terms deviate from the true model and the fitted model might favour the outliers making the “good” observations even seem atypical. The phenomenon where outliers over-distort the fitted 3 Chapter 1. Introduction model making the outliers’ residuals small while making many “good” observations’ residuals big is named masking. In some simple cases it is possible to spot and delete the outliers by checking the scatter plots. But that approach is too subjective and the outliers become no more obvious in high dimensional cases. For example, if p = 3, it is impossible to visualize the observations in a 4-dimensional space by scatter plots. A naive way is to draw the observations in marginal dimensional scatter plots such as the 2 or 3 dimensional plots of Y vs X1 , Y vs X2 , Y vs X1 , X2 . However, an outlier might look like a “good” point in those marginal dimensional scatter plots. There is a widely used GAPLMs fitting algorithm, which is generalized local scoring algorithm (GLSA) proposed by Hastie and Tibshirani (1986). However, it is also sensitive to outliers. To illustrate this problem, we can take a look at the following example, Y |X, T ∼ P λ(X, T ) log λ(X, T ) = 0.05 + 0.02X 2 T − 100 −|T − 100| + .002 sin T − 100 exp . 20 30 The outliers are replaced to the response. In the first case, we only pick the points in the central region of the covariates’ plane (xi , ti ); in the second case, we only pick the points in the marginal region of the covariates’ plane. The details of this example are introduced in Chapter 4. All the picked observations are replaced by outlying values much larger than their original values. We use the R function gam() of the package gam() to fit the GAPLM. The smoother, locally polynomial weighted regression (LOESS) is used to show the bad performance of GLSA. The band widths of LOESS are chosen by leave-one-out cross validation. The results are in Figure 1. From lower panels of Figure 1, we can see that the outliers push the fitted surfaces towards them, making the fit distorted. Since classical methods are not suitable to deal with outliers, new statistical methods have been studied to develop good fits to the datasets with or without outliers. These methods are called robust methods (or resistant methods). Some robust methods for GLMs and GAMs are based on the works of Bianco and Yohai (1996), Cantoni and Ronchetti (2001), Alimadad and Salibian-Barrera (2011). This thesis is organized as follows: an introduction of the background of (G)LMs, (G)AMs, (G)APLMs is introduced in Chapter 2. We briefly review the model construction, classic model fitting algorithms and quasilikelihood models. In Chapter 3, some basic knowledge of robust estimates 4 Chapter 1. Introduction True mean surface mu ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● X ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● T ● hat(mu) ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● gray points are below the surface ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● T ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● X ● ●● ● T ● ● ● ● X hat(mu) ● ● ● ●● ● ● gray points are below the surface Figure 1.1: An example of the effect of outliers. The true surface is in the upper panel. There are two cases of outliers shown in the lower left and lower right panel. The gray points are below the surfaces while the black points are above the surfaces. ‘+’ points are outliers. The fits were calculated by 5 the R function gam() in the package gam. 1.1. Notation and the derivation of robust regression M-estimate are shown. We illustrate how we extend the robust algorithm for GLMs to GAPLMs and how we make prediction using our algorithm. A simulation study is reported and discussed in Chapter 4 and in this study, we comprehensively compare the behaviours of both the robust approach and classic approach in the data with and without outliers. We conclude that for datasets with Poisson and binomial response, our robust algorithm works similarly with non-robust algorithms. However, the robust fit is better than the non-robust fit in the sense of MSE, estimation bias and outlier reporting in presence of outliers. Chapter 5 provides a real dataset example which illustrates the application of our proposed algorithm. We compare and interpret the results of the robust and non-robust fits. After that, we report outliers in the dataset based on the robust fit and perform an exploratory analysis based on the reported outliers. Chapter 6 summarizes the conclusions and limitations in this thesis. It also proposes the topics for future study. 1.1 Notation Here we provide an overview of the notations which we use in the following chapters. The random variables are in upper cases. For example, the response random variable is Y , the parametric covariates are X1 , X2 , . . . , Xh and the nonparametric covariates are T1 , T2 , . . . , Tk . The realizations of both the response and covariates are denoted as lower case letters. Vectors are written in bold characters. The vector with all 1s and 0s are denoted as 1 and 0 respectively. Suppose we have n realizations in to fit a GAPLM, our sample is y = (y1 , y2 , . . . , yn )T ∈ Rn , (x(1) , x(2) , . . . , x(h) ) = (xij )n×h ∈ Rn×h , (t(1) , t(2) . . . . , t(k) ) = (tij )n×k ∈ Rn×k . Note that in the population case, X = (X1 , X2 , . . . , Xp )T means the random vector of covariates while in the sample case, T x1 xT 2 X = (xij )n×(p+1) = . = (1, x(1) , x(2) , . . . , x(p) ) .. xTn is the design matrix whose first column is 1 and xTi = (1, xi1 , xi2 , . . . , xip ) is the i-th row of the design matrix. 6 1.1. Notation Matrices are denoted with bold upper case letters (their elements are usually the corresponding lower case Latin letters with subscripts) and the model parameters are usually expressed as Greek letters. For example, a matrix is like A = (aij )p×q ∈ Rp×q and the linear coefficient vector is β = (β0 , β1 , β2 , . . . , βp )T . P A linear predictor has a form of xTi β = β0 + pj=1 xij βj , i = 1, 2, . . . , n P while an additive predictor has an expression of f0 + kj=1 fj (tij ) where fj : R → R is an arbitrary smooth function. Expectation is denoted by E(·) and variance by var(·). A log-likelihood is denoted by l(θ; y, x). 7 Chapter 2 GLMs, GAMs and GAPLMs In this chapter, we give a brief review of Generalized Linear Models, Generalized Additive Models, Generalized Partial Linear Partial Additive Models and Quasi-likelihood Functions. The content covers the model definitions and model fitting algorithms. 2.1 2.1.1 Generalized Linear Models Extend the Linear Models One purpose of regression models is to explain the relation between the response and the covariates in a dataset. In a regression model, there are usually two kinds of variations: the systematic variation and the random variation. Covariates are used to explain the systematic variation in the response and the random variations are designed to express the uncertainty in the response. Depending on the type of responses, random variations can be assumed to follow certain random distributions. For example, we often assume that the random errors of continuous response variables follow Normal distribution. In most cases, we are interested in modelling the conditional mean of the response, so, as shown in (1.1), we assume the conditional mean given the covariates is the systematic part and it is a function of covariates. Generalized linear models were introduced by Nelder and Wedderburn (1972), and they are an extension of the Linear Models (LMs). Linear Models are often used to model continuous response variables in the real axis (−∞, +∞). Based on (1.2), a LM has the form, Yi = β0 + β1 Xi1 + β2 Xi2 + . . . βp Xip + i for i = 1, 2, . . . , n (2.1) E(i ) = 0 where n is the number of observations. The random errors i are assumed to be mutually independent. The linear coefficients βj are unknown parameters and they are to be estimated with the observations and certain algorithms. The least squares (LS) is a natural estimation method. LS minimizes the 8 2.1. Generalized Linear Models sum of squared errors, β̂ = arg min β n X (yi − β0 − i=1 p X βj xij )2 . j=1 As shown in (2.1), the conditional mean of the response is assumed as a linear combination of the covariates. This form naturally provides a way to predict the response with β̂ and covariates x = (1, x1 , x2 , . . . , xp )T , Ŷ = µ̂ = β̂0 + p X β̂j xj . j=1 Sometimes we assume that i are normal random variables with constant variance, iid ∼ N (0, σ 2 ). Thus, the response variable follows Normal distribution and has a mean of linear form given the covariates. Under this assumption, the LS estimates β̂ coincide with the MLE. However, sometimes we need to restrict the range of the response Y or we are modelling discrete response variables, and LMs no longer satisfy our needs. For example, if the response is the total number of customers of a Starbucks in a certain time period, this response can only be integers. In this case, the mean of the response is positive and it is improper to assume a continuous distribution such as Normal for a count variable. Instead, the response is often assumed to be a Poisson random variable based on the theories of Poisson process. For another example, even though the response is continuous, if it can only take positive values, it is still improper to assume a linear model where the linear formed response’s mean can take negative values. In Chapter 5, a data analysis on a real dataset with positive response variable is presented. For this dataset, we assume Gamma distribution of the positive response given the the covariates. GLMs are capable to fit a larger class of responses. They generalize LMs in the following two ways (in the random component and systematic component), • In a GLM, the response given the covariates is assumed to follow a distribution in the exponential family. The exponential family covers both continuous and discrete distributions for example, continuous as Gaussian, Gamma distribution and discrete as Bernoulli, Poisson distribution. So GLMs can be applied to the foregoing examples to model counts and positive values. 9 2.1. Generalized Linear Models • Since the conditional mean of the responsePmay not range in real axis (−∞, +∞), the linear predictor ηi = β0 + j=1 βj xij has to be transformed to the possible range of the mean. So in a GLM, ηi is transformed to the mean with a differentiable monotonic function, ηi = g(µi ), or µi = g −1 (ηi ). (2.2) In (2.2), g is called the link function and for LM, the link function is the identity function. 2.1.2 Model Definitions In a GLM, the response Y given the covariates is assumed to follow a distribution in the exponential family which is defined to have the probability density function or probability mass function, yθ − b(θ) fY (y; θ, φ) = exp , (2.3) a(φ) + c(y, φ) for some specific functions a, b and c. θ ∈ R is called the canonical parameter. Exponential family covers may commonly used distributions. For instance, we can reparameterize the pdf of normal distribution in the form of (2.3), 1 (y − µ)2 fY (y; θ, φ) = √ exp − (2.4) 2σ 2 2πσ 2 (yµ − µ2 /2) y 2 /σ 2 + log(2πσ 2 ) − , = exp σ2 2 and we get that θ = µ, φ = σ 2 and a(φ) = φ, b(θ) = θ2 /2, c(y, φ) = − y 2 /σ 2 + log(2πσ 2 ) . 2 In (2.3), a(φ) often has the form, a(φ) = φ/ω. In this case, φ is called the dispersion parameter which is constant over the observations while ω is a known prior weight that can be different from observation to observation. For example, in a normal model where each observation is the mean of m independent variables, a(φ) = φ/m, ω = m. 10 2.1. Generalized Linear Models Based on (2.3), the mean and variance of Y can also be derived, they are E(Y ) = b0 (θ), var(Y ) = b00 (θ)a(φ). (2.5) The following formulae show how (2.5) is derived. We write l(θ, φ; y) = log(fY (y; θ, φ)) as the log likelihood function of θ and φ, given y, l(θ, φ; y) = yθ − b(θ) + c(φ, y). a(φ) The first and second derivative of l(θ, φ; y) with respect to θ are, ∂l y − b0 (θ) = ∂θ a(φ) 2 ∂ l −b00 (θ) = . ∂θ2 a(φ) (2.6) (2.7) An exponential family random variable Y has the following properties, ∂l )=0 ∂θ ∂l ∂2l E( 2 ) + E( )2 = 0. ∂θ ∂θ E( (2.8) (2.9) The proof of (2.9)-(2.10) is based on the fact that exponential family satisfies the regularity conditions which allow us the exchange the order of integration 11 2.1. Generalized Linear Models and derivation If we denote the cdf of the response Y as FY (y; θ, φ), ∂l ∂ log(fY (Y ; θ, φ)) ) = E( ) ∂θ ∂θ ∂ fY (Y ; θ, φ) = E( ∂θ ) fY (Y ; θ, φ) Z +∞ ∂ dFY (y; θ, φ) = fY (y; θ, φ) ∂θ fY (y; θ, φ) −∞ ∂ = 1 = 0, ∂θ Z +∞ ∂2l ∂2 log fY (y; θ, φ) dFY (y; θ, φ) E( 2 ) = 2 ∂θ −∞ ∂θ ∂ Z +∞ fY (y; θ, φ) ∂ ∂θ = dFY (y; θ, φ) fY (y; θ, φ) −∞ ∂θ Z +∞ ∂ 2 ∂ fY (y; θ, φ))2 f (y; θ, φ) fY (y; θ, φ) − ( ∂θ Y 2 ∂θ = dFY (y; θ, φ) fY (y; θ, φ)2 −∞ 2 Z +∞ Z +∞ ∂ ∂2 dFY (y; θ, φ) ∂θ fY (y; θ, φ) = 2 − dFY (y; θ, φ) fY (y; θ, φ) ∂θ −∞ fY (y; θ, φ) fY (y; θ, φ) −∞ 2 Z +∞ ∂ =0− log fY (y; θ, φ) dFY (y; θ, φ) ∂θ −∞ ∂l = −E( )2 . ∂θ If we take expectations of (2.6)-(2.7) and plug them into (2.8)-(2.9), it is easy to get E(Y ) = b0 (θ) = µ and −b00 (θ) E (Y − µ)2 −b00 (θ) var(Y ) + = + 2 = 0. a(φ) a2 (φ) a(φ) a (φ) E( The equations in (2.5) indicate that the conditional mean µ only depends on θ and the variance var(Y ) can be defined by µ and φ: var(Y ) = a(φ)b00 (θ) = a(φ)V (µ) (2.10) where the function V is called the variance function. From (2.10), we see that in a GLM, the variance is not a free parameter and it depends on µ. The link function g is often monotonically increasing function, which is consistent with LMs, whose link function is identity function. In the case of η = θ, the link function is called canonical link function. For models with a canonical link function, some theoretical and practical problems are easier to solve. 12 2.1. Generalized Linear Models 2.1.3 Iterative Weighted Least Square GLMs are defined by, Y |X1 , X2 , . . . , Xp has a distribution in the exponential family (2.11a) g(µ) = g(E(Y |X1 , X2 , . . . , Xp )) = η = β0 + p X Xj βj . (2.11b) j=1 To obtain the maximum likelihood estimates of the linear coefficients β̂ = (β̂0 , β̂1 , β̂2 , . . . , β̂p ), we optimize the log likelihood with respect with β, β̂ = arg max β n X l(β; yi , xi ). i=1 It can be shown in the next section that the maximum likelihood estimate of β can be obtained by iterative weighted least square (IRWLS) with adjusted dependent variables and the same covariates. This process is iterative because the weights are updated by the adjusted fitted values in each iteration. This algorithm is as described in the following paragraphs. Suppose the current estimated linear coefficient is β̂ old , the estimated linear predictor would be η̂ old = X β̂ old , where X is the design matrix: X = (1, x(1) , x(2) , . . . , x(p) ) ∈ Rn×(p+1) . The fitted values are µ̂old = (g −1 (η̂1old ), g −1 (η̂2old ), . . . , g −1 (η̂nold ))T . Then use those values to calculate the adjusted dependent variables z old = (z1old , z2old , . . . , znold )T , ziold = η̂iold + (yi − µ̂old i ) ∂η old (µ̂ ), ∂µ i for i = 1, 2, . . . , n, and the weights wold = (w1old , wnold , . . . , wnold )T , dη old 2 wiold = (µ̂i ) V (µ̂old i ) for i = 1, 2, . . . , n. dµ (2.12) (2.13) dη In (2.12)-(2.13), dµ (µ̂old i ) is the derivative of η with respect to µ and it is evaluated at the value of µ̂old i . After that, apply a weighted least square of zold with the weights wold and design matrix X, the updated linear coefficients are −1 T β̂ new = X T Diag(wold )X X Diag(wold )z old , 13 2.1. Generalized Linear Models where Diag(wold ) is a diagonal matrix whose diagonal components are wold . We continue this iterative process with β̂ new until the convergence criteria is met. Since understanding this algorithm is quite important to the derivation of the robust estimates in next chapter, we show the derivation of this algorithm in the following section. 2.1.4 Justification of the Iterative Weighted Least Square Algorithm Since Pn the model fitting is based on maximizing the log likelihood, l(β; y, X) = i=1 l(β; yi , xi ), the MLE of β in a GLM is the solution to the nonlinear estimating equations ∂l (β̂) = (0, 0, . . . , 0)T . ∂β To solve the equations above, Newton-Raphson method and Fisher scoring algorithm are used. Newton-Raphson method provides the following updating process, β̂ new = β̂ old −1 ∂2l ∂l old old − (β̂ ) (β̂ ), T ∂β∂β ∂β (2.14) 2 ∂ l old ) is the Hessian matrix of l(β; y). Fisher scoring algorithm where ∂β∂β T (β̂ substitutes the Hessian matrix in (2.14) with the negative Fisher information matrix, ∂l old ∂l old −E (β̂ ) T (β̂ ) , (2.15) ∂β ∂β then, (2.14) becomes, β̂ new = β̂ old −1 ∂l old ∂l ∂l old old (β̂ ) T (β̂ ) (β̂ ). +E ∂β ∂β ∂β ∂l First of all, to obtain the score function ∂β , we must use chain rule because the probability density function is only defined with θi and φ, i = 1, 2, . . . , n. The partial derivative of l(β; y, x) with respect to the j-th coefficient βj , j = 0, 1, 2, . . . , p is ∂l ∂l ∂θ ∂µ ∂η = . ∂βj ∂θ ∂µ ∂η ∂βj 14 2.1. Generalized Linear Models 0 00 From the previous Pp result, µ = b (θ) and V = b (θ). ∂l/∂θ is given by (2.8) and since η = j=0 βj xj , ∂η/∂βj = xj . We have, ∂l y − µ 1 dµ = xj ∂βj a(φ) V dη w dη = (y − µ) xj , a(φ) dµ dη 2 where w−1 = dµ V has a same form as (2.13). Please notice that since the denominator a(φ) is a constant, it does not affect the maximum likelihood estimates of β. So, for convenience, we do not contain a(φ) in the following formulea. The estimating equations are, n X ∂l dηi = wi (yi − µi ) xij ∂βj dµi for j = 0, 1, 2, . . . , p. (2.16) i=1 At the population level, the (j, s) element of the Fisher information matrix is, ∂l ∂l E ∂βj ∂βs =E X n i=1 n dηi X dηi xij xis . wi (Yi − µi ) wi (Yi − µi ) dµi dµi i=1 Since if q 6= l, Yq and Yl are independent and dηq dηl xqj E wl (Yl − µl ) xlj = 0, E wq (Yq − µq ) dµq dµl the (j, s) element of the Fisher information matrix becomes ∂l ∂l E ∂βj ∂βs n X 2 dηi 2 2 E wi = (Yi − µi ) xij xis dµi i=1 = n X wi xij xis . (2.17) i=1 By checking the form of (2.17), it is easy to find that (2.17) is the (j, s) element of the matrix X T Diag(w)X where w = (w1, w2 , . . . , wn ). T old Based on (2.14), (2.15) and (2.17), if we denote X Diag(w )X as 15 2.1. Generalized Linear Models Aold and v old = z old − η old , the updating procedure is β̂ new = β̂ old −1 ∂2l ∂l old old − (β̂ ) (β̂ ) T ∂β∂β ∂β = β̂ old + (Aold )−1 X T Diag(wold )v old = β̂ old + (Aold )−1 X T Diag(wold )(z old − X β̂ old ) = (Aold )−1 X T Diag(wold )z old −1 T = X T Diag(wold )X X Diag(wold )z old . (2.18) (2.18) is the iterative algorithm introduced in section 2.1.3. 2.1.5 Quasi-Likelihood Functions The model fitting algorithm in the previous sections is based on likelihood which requires us to know the random mechanism by which the data is generated. We construct the likelihood of the model by assuming certain distribution of the response Y and assuming the way how the distribution parameters are linked to the covariates X1 , X2 , . . . , Xp . For example in a GLM, we assume g(µ) = X T β. The likelihood function of the distribution parameters θ (in GLMs, θ = β) given the response y is, L(θ; y) = fY (y; θ). The MLE of θ is obtained by maximizing the likelihood function with respect to θ, θ̂ = arg max L(θ; y), θ θ ∈ Ω, where Ω is the parameter space. Directly maximizing the log likelihood function l(θ; y) = log L(θ; y) is one method to fit the model. Sometimes the optimization can also be achieved by solving the estimating equations, S(θ; y) = ∂ l(θ; y) = 0, ∂θ where S(θ; y) is called the score function. But whether the solutions of estimation equations are the universal maximizing points has to be checked afterwards. Based on (2.9)-(2.11), for exponential family, the score function 16 2.1. Generalized Linear Models S(µ; y) has three properties, E(S(µ; Y )) = 0 ∂S(µ; Y ) = E S 2 (µ; Y ) −E ∂µ var(S(µ; Y )) = E(S 2 (µ; Y )) = (2.19a) (2.19b) 1 . a(φ)V (µ) (2.19c) One desirable feature of GLM is that the estimating function (2.16) is only defined by the mean and variance of the response Y . Often there are no theories about the probabilistic mechanism of the data. In such a case, the likelihood approach is inappropriate since we lack the information to assume the random distribution of the dependent variable. However, we are actually interested in how the response’s mean is affected by the covariates and sometimes we can specify how the response’s variability changes with its mean. Wedderburn (1974) proposed the quasi-likelihood function which enables us to make statistical inference without making distribution assumptions. Like the score functions of GLMs, the quasi-likelihood function only requires us to specify how the response’s mean is related to the covariates and how the response’s variance is related to the mean. The variance is assumed to be either equal or proportional to some known function of the mean, var(Y ) = σ 2 V (µ), where V : R → R is a known function. The quasi-likelihood function is defined as Z y y−t Q(µ; y) = dt. (2.20) 2 V (t) σ µ By analogy, the quasi-score function, which is the derivative of (2.20) with respect to µ, is U (µ; y) = y−µ ∂ Q(µ; y) = 2 . ∂µ σ V (µ) (2.21) U (µ; y) has a similar form of the score function S(µ; y) of exponential family. It is also easy to see that U (µ; y) has the same three properties in (2.19). So the quasi-likelihood function Q(µ; y) should have some similar properties with the log likelihood function. Actually, Wedderburn (1974) proved that the log likelihood is identical to the quasi-likelihood if and only if this dependent variable is in the exponential family. Just like GLMs, we can choose a link function g and assume g(µ) = X T β for the quasi-likelihood function (2.20). One appealing property of quasi-likelihood functions is that we can 17 2.1. Generalized Linear Models also use the iterative weighted least square to estimate the linear coefficients β. The estimating equations based on quasi-likelihood functions are, ∂ y − µ ∂µ Q(µ; y) = 2 = 0, ∂β σ V (µ) ∂β which has the same form as (2.16). So, by the same approach then in GLM, using Newton-Raphson method and Fisher scoring algorithm, we can derive the iterative weighted least square estimates of β. Wedderburn (1974) mainly derived the quasi-likelihood functions for the independent responses. McCullagh and Nelder (1989) discussed quasi-likelihood in more detail and they also discussed the quasi-likelihood for dependent observations. More importantly, they showed that when the relations between the mean & variance and mean & covariates are correctly specified, the quasi-likelihood estimates β̂ are consistent and asymptotically Normal under second-moment assumptions even when the response distribution is unknown. In this thesis, we mainly focus on independent observations. 18 2.2. Generalized Additive Models 2.2 2.2.1 Generalized Additive Models Additive Models There are several good features of GLMs: (i) The iterative weighted least square algorithm is easy to apply. (ii) It is easy to interpret the dependence of the response on each covariate and make inferences based on the linear coefficients. Once we fit a model, we can examine the predictors separately in the absence of interactions. (iii) It is easy to make predictions, ŷ = µ̂ = g −1 (β̂0 + p X β̂j xj ). j=1 The nonparametric generalization of LM has the following general form, Y = f (X1 , X2 , . . . , Xp ) + , (2.22) where the function f : Rp → R is an arbitrary multivariate smooth function. For some complicated multivariate functions f , the model fitting can be very computational intensive when the number of covariates p is big. Multivariate smoothers, such as kernel smoothers, is a way to fit the smooth function f . It locally approximate the function f (x1 , x2 , . . . , xp ) using the sample points in a neighborhood of (x1 , x2 , . . . , xp ). However, when the dimension p increases, we cannot get enough sample points in any certain neighborhoods, making the local fit impossible. This phenomenon is called the curse of dimensionality. Moreover, usually in a multivariate smooth function, it becomes harder to interpret the contribution of every single covariate. Additive models inherit one important features of GLMs: the predictor effects have an additive form. AMs are defined as, Y = f0 + p X fj (Xj ) + , (2.23) j=1 where the errors are independent of the covariates Xj and E() = 0. The fj are arbitrary univariate smooth functions and in order to make the model identifiable, it is natural to assume E[fj (Xj )] = 0 since there is a free intercept parameter f0 . From (2.23), it is easy to find that, different from LMs, the effect of each predictor Xj in an AM becomes a nonlinear function fj . 19 2.2. Generalized Additive Models In addition, the definition (2.23) also solves the curse of dimensionality problem in model fitting by providing an iterative univariate smoothing process to fit the additive terms in (2.23). This algorithm is called backfitting, which was proposed by Hastie and Tibshirani (1986), and it can be a general approach to fit any additive style regression model. From (2.23), we see that the conditional expectation of Y given all the covariates Xj has the form (1.4). So for any l ∈ {1, 2, . . . , p}, X E Y − f0 − fj (Xj )|Xl = fl (Xl ). (2.24) j6=l Formula (2.24) suggests that we can estimate fl by smoothing the partial P residual Y − f0 − j6=l fj (Xj ). The backfitting algorithm works by iteratively smoothing the partial residuals and updating the smooth terms fˆj . Although E(·|Xl ) is the expectation over all directions Xj , j 6= l, Hastie and Tibshirani (1990) suggested that one-dimensional smoother is a sufficient approximation. In this thesis, we adopt their backfitting algorithm and we use the Locally-Polynomial Weighted Regression (LOESS) as our univariate smoother. LOESS is introduced in the next section. Mammen, Linton and Nielsen (1999) provided another type of backfitting which is called smoothed backfitting. Their main idea is to estimate the conditional expectation “correctly” instead of approximating them with a one-dimensional smoother. Similar to (2.24), their idea is based on, X fl (Xl ) = E(Y |Xl ) − E[fj (Xj )|Xl ], j6=l where E(Y |Xl ) is estimated with a univariate smoother and E[fj (Xj )|Xl ] is estimated by Z ϕ̂lj (Xl , t) fj (t) dt. ϕ̂l (Xl ) Here the function ϕl is a univariate density estimator of the marginal pdf of Xl and ϕlj is a bivariate estimator of the joint pdf of Xl and Xj . This type of backfitting is also an iterative process so the estimation of E[fj (Xj )|Xl ] in each iteration is done using the smooth function fˆj (Xl ) of the previous iteration. Now we introduce one version of backfitting algorithm to fit AMs. Suppose we have a sample with response y and covariates x(1) , x(2) , . . . , x(p) . To make our algorithm easier to extend to GAMs fitting algorithm, here we introduce a weight vector w = (w1 , w2 , . . . , wn ) and for AMs, w = 1. If 20 2.2. Generalized Additive Models we denote Y = (Y1 , Y2 , . . . , Yn )T , Xj = (X1j , X2j , . . . , Xnj )T and s0 = f0 1, sj = (fj (X1j ), fj (X2j ), . . . , fj (Xnj ))T ∈ Rn , j = 1, 2, . . . p, our model can be written as, p X E[Y |X1 , X2 , . . . , Xp ] = s0 + sj . j=1 The backfitting can estimate the smooth terms sj , j = 1, 2, . . . p. The estimates are denoted as ŝj , j = 1, 2, . . . p. The algorithm is as follows, (i) Initialize: s0 ← ȳ1 or any vector specified by the user. is a small positive number for deciding the iteration convergence. Then assign [0] [0] [0] ← s2 ← . . . ← sk ← 0 s1 y ← y − s0 m ← 1 δ ← +1 in which 0 ∈ Rn and 0 = (0, 0, . . . , 0)T . (ii) Update: m ← m+1 for l [m] sl = 1, 2, . . . , p p l−1 X X [m−1] [m] sj ∼ x(l) , w, α, k), ← S(y − sj − j=1 j=l+1 where S(v ∼ x(l) , w, α, k) denotes an spline or LOESS estimator of the vector v over the covariate x(l) with weights w. The arguments α and k are also specified by the user and they are to be introduced later. [m] Centralize and update sl [m] [m] def. [m] − s̄l , [m] = sl ← sl [m] and cl where cl 1 = s̄l [m] 1 n l = 1, 2, . . . , p [m] i=1 sli . Pn Update δ [m−1] [m] − sj || j=1 ||sj , Pp [m−1] || j=1 ||sj Pp δ← 21 2.2. Generalized Additive Models where for a vector x = (x1 , x2 , . . . , xn ) ∈ Rn , ||x|| = qP n 2 j=1 xn . (iii) Repeat step (ii) until δ < . P [0] [m] Suppose δ < in the m-th iteration, then, update ŝ0 ← s0 +[ kh=1 ch ]1. [m] [m] The estimates are ŝj = sj , j = 1, 2, . . . , p and cj = cj , j = 1, 2, . . . , p. Note that in step (ii), we centralize the temporary smooth terms to make the algorithm converge and it is introduced in Hastie, Tibshirani and Friedman (2009). The definition and properties of LOESS Here we introduce the univariate smoother LOESS. It can be shown that LOESS is a linear smoother, so that if we are going to smooth y with covariate x = (x1 , x2 , . . . , xn ) and weights w = (w1 , w2 , . . . , wn ), there exists a matrix S ∈ Rn×n which does not depend on y, such that the smoothed response can be expressed as ŷ = Sy. This linear feature of LOESS helps us better understand some properties of the backfitting algorithm introduced in the previous section. Suppose the model is: E(Y |X) = m(X) where m(X) is a smooth function of X and our sample is y and x. If we are going to estimate the value E(Y |X = x) = m(x) at a point x, by taking a Taylor approximation of the function m in the neighborhood of x, we get m(t) ≈ m(x) + m0 (x)(t − x) + . . . + m(k) (x)(t − x)k 1 , k! where t is in the neighbourhood of x. We approximate E(Yi |Xi ) based on, k 1 \ (k) 0 [ \ y[ . i (x) = m(x) + m (x)(xi − x) + . . . + m (x)(xi − x) k! [ and we can In the above formula, we are interested in the value of m(x) \ estimate it along with the derivatives m(q) (x), q = 1, 2, . . . , k in the following way. First, a good approximation y[ i (x) should make the sum of squared Pn 2 [ errors i=1 (yi − yi (x)) small. Second, those xi which are closer to the point x should be given more importance on the corresponding squared error \ 2 (q) (yi − y[ i (x)) . A weighted least square can be applied to estimate m (x), q = 22 2.2. Generalized Additive Models 0, 1, 2, . . . , k. LOESS uses the tricube kernel to decide the importance of each point xi . The tricube kernel is defined as: Z 3 3 Kb (t) ∝ (1 − |t| ) I(|t| < b), and Kb (t) dt = 1 where I is an indicator function and the argument b = B(α, x, x) is the kernel bandwidth which is decided by a proportion α, the sample x and x. The function B(α, x, x) makes the bandwidth b large enough to include a α proportion of the sample in the neighborhood Nx,b = t : |x − t| < b . This function has the following form, B(x, x) = inf{b : #{xi ∈ Nx,b } ≥ αn}. The weight of the weighted least square is wi Kb (x − xi ), i = 1, 2, . . . , n where wi is the pre-specified weight. The weighted least square is, 2 n X k β̂(k, α, x, x, w) = min yi − β0 − β1 (xi − x) − . . . − βp (xi − x) wi Kb (x − xi ). β i=1 Let the design matrix and the diagonal weight matrix be 1 x1 − x (x1 − x)2 . . . (x1 − x)k 1 x2 − x (x2 − x)2 . . . (x2 − x)k Z = . ∈ Rn×k .. .. .. .. .. . . . . 2 k 1 xn − x (xn − x) . . . (xn − x) w1 Kb (x − x1 ) 0 0 ... 0 0 w2 Kb (x − x2 ) 0 . . . 0 W = .. .. .. . . .. . . . . . 0 0 ∈ Rn×n , 0 . . . wn Kb (x − xn ) then the weighted least square estimate β̂(k, α, x, x, w) is −1 T β̂(k, α, x, x, w) = Z T W Z Z W y. [ depends on the choice of k, w, α, Since the estimate at the point x, m(x) [ and the covariates x, let’s denote it as m(x) k,α,x,w = β̂0 (k, α, x, x, w) = [ eT β̂(k, α, x, x, w), where e1 = (1, 0, 0, . . . , 0)T . Then m(x) can be 1 k,α,x,w written as T T T [ m(x) k,α,x,w = e1 β̂(k, α, x, x, w) = e1 X W X −1 def X T W y = ak,α,x,w (x)T y (2.25) 23 2.2. Generalized Additive Models Based on (2.25), the LOESS estimate of the ak,α,x,w (x1 )T ak,α,x,w (x2 )T S(y ∼ x, w, α, k) = .. . ak,α,x,w (xn )T sample can be expressed as def y = Sk,α,x,w y. (2.26) Note that (2.26) not only shows how LOESS calculates the smoothed observations ŷi = ak,α,x,w (xi )T y, but also shows how LOESS predicts the smoothed values based on a new data x which doesn’t belong to the sample x, which is ŷ = ak,α,x,w (x)T y. From (2.26), we see that the smoothing matrix Sk,α,x,w only depends on k, α, x, w. So LOESS is a linear smoother. Next we show that our backfitting estimates of the smooth terms also converge to a linear smoother. To make it simple, we do not consider the intercept term s0 . That is reasonable since in our backfitting algorithm, we only update the intercept when the smooth terms converge. Thus in the iterations we can regard [0] y − s0 as the new response vector. Suppose our algorithm converges and the limit of the uncentralized smooth terms are sO j for j = 1, 2, . . . , p and the centralized smooth terms are sj = 1 1 T O O T sj − n 11 sO so sj = sO j = n 1 sj ,P j . We denote cP j −cj 1. At convergence, the p smoothed response is ỹ = j=1 sj + ( pj=1 cj )1 and our estimates converge to those limits ŝj → sj . Then we denote the weighted smoothing matrix of each variable x(j) as Sj , j = 1, 2, . . . , p such that ∀r ∈ Rn , S(r ∼ x(j) , w, α, k) = Sj r for some w, α and k. From the form of our back fitting algorithm, we see that the limits sO j , j = 1, 2, . . . , p are the solutions of the equation, O sO = Sj v − (sO j 1 + . . . + sj−1 + sj+1 + . . . + sp ) O O O = Sj v − (sO 1 + . . . + sj−1 + sj+1 + . . . + sp ) + (cj+1 + cj+2 + . . . , cp )1 . Since if we smooth a constant, the smoothed value should also be the identical constant, we have Sj 1 = 1, j = 1, 2, . . . , p. The above equation is equivalent 24 2.2. Generalized Additive Models to M sO 1 sO 2 .. . sO p In S1 − n1 110 S1 − n1 110 . . . S1 − n1 110 In S2 − n1 110 . . . S2 − n1 110 def S2 = .. .. .. .. .. . . . . . Sp Sp Sp ... In S1 S2 def = . y = Ay .. sO 1 sO 2 .. . sO p Sp where In is a n dimensional identity matrix. So, O s1 sO 2 .. = M −1 Ay. . sO p Since the centralized smooth terms are sj = (In − n1 11T )sO j for j = 1, 2, . . . , p, the centralized smooth terms can be expressed in a linear form as s1 O ... O In − n1 11T s2 O In − n1 11T . . . O −1 .. = M Ay .. . . . . . . . . sp O ... In − n1 11T (2.27) [0] where O = (0)n×n . If the initial value of the intercept term is s0 = ȳ1, it is [0] P easy to show that, the smoothed response at convergence ỹ = s0 + j=1 sO j can also be expressed as ỹ = Sy for some S ∈ Rn×n which is decided by α, k, w and x(1) , x(2) , . . . , x(p) . This proves that the backfitting algorithm approximates a linear smoother. 2.2.2 Generalized Additive Models GAMs generalize AMs in the same way that GLMs generalize LMs. The distribution of Y is in the exponential family, its mean given the covariates X1 , X2 , . . . , Xp is transformed by a link function g : R → R, and the transformed mean is defined to be the sum of the additive components. As shown 25 2.2. Generalized Additive Models in (1.6), a GAM is defined by g E(Y |X1 , X2 , . . . , Xp ) = f0 + f1 (X1 ) + f2 (X2 ) + . . . fp (Xp ). Since the MLE fitting algorithm of GLMs is based on an iterative weighted least square of the adjusted response and weights, the estimation of f0 , f1 , . . . , fp can be estimated by analogy. Hastie and Tibshirani (1986) proposed the generalized local scoring algorithm (GLSA) to fit GAMs. The algorithm sketch is iteratively fitting weighted AMs with the adjusted response defined in (2.12) and weights defined in (2.13). The algorithm is as follows, (i) Initialize: • for all but Binomial distribution, s0 ← g(ȳ)1 • for Binomial distribution with ni as the sizes of the response, s0 ← (g( y2 yn y1 ), g( ), . . . , g( )), n1 n2 nn is a small positive number to decide convergence. Then assign [0] s1 [0] ← s2 ← . . . ← s[0] p ←0 δ ← +1 P [0] [0] [0] in which 0 ∈ Rn . So, η [0] ← pj=0 sj and µi ← g −1 (ηi ). Then, initialize the adjusted response and weights, [0] [0] [0] dη [0] zi ← ηi + (yi − µi ) (µi ) dµ dµ [0] 2 1 [0] wi ← (µ ) [0] dη i V (µ ) i m ← 1, for Gaussian distribution, the dispersion parameter is estimated by [0] [0] 2 1 Pn V (µi ) = n−1 j=1 (yj − µi ) . 26 2.2. Generalized Additive Models (ii) Fit a weighted additive model of z [m−1] with weights w[m−1] . After [m−1] that, update the smooth terms with the backfitting estimates ŝj ,j = 0, 1, 2, . . . , p, [m] sj [m−1] ← ŝj , [m] for j = 0, 1, 2, . . . , p [m−1] [m−1] Update the intercept s0 ← s0 + ŝ0 P [m] [m] [m] η ← j=0 sj and the mean µ . , the additive predictor Set: [m−1] [m] − sj || j=1 ||sj , Pp (m−1) || h=1 ||sj Pp δ ← m ← m + 1, [m] zi [m] wi [m] [m] dη [m] ← ηi + (yi − µi ) (µi ) dµ dµ [m] 2 1 ← (µi ) . [m] dη V (µ ) i (iii) Repeat step (ii) until δ < . 27 2.3. Generalized Partial Linear Partial Additive Models 2.3 Generalized Partial Linear Partial Additive Models As mentioned briefly in Chapter 1, Partial Linear Partial Additive Model is a simple extension of AMs by adding linear terms to the additive components. As shown in (1.4), a APLM has the form: Y = β0 + h X βj Xj + j=1 k X fl (Tl ) + , (2.28) l=1 E() = 0, where is a random error which is independent to the covariates. In some cases we divide our covariates into two groups: X1 , X2 , . . . , Xh and T1 , T2 , . . . , Tk . Then we fit a regression model where X1 , X2 , . . . , Xh play as the parametric components while T1 , T2 , . . . , Tk play as the nonparametric components. By doing this, a APLM enjoys the benefits of both the parametric models and the nonparametric models. In our practical examples, we are usually interested in the interpretation and inference on the parametric terms. The nonparametric terms become the nuisance variables which provide flexibility to the model. By analogy, a “generalized” version of APLMs can be called Generalized Partial Linear Partial Additive Models (GAPLMs) with the form, Y |X1 , . . . , Xh , T1 , . . . , Tk follows a distribution in exponential family (2.29a) h k X X g E(Y |X1 , X2 , . . . , Xh , T1 , T2 , . . . , Tk ) = β0 + βj Xj + fl (Tl ). j=1 l=1 (2.29b) 2.3.1 Fitting APLMs Like in the foregoing sections, we start with the fitting method of nongeneralized versions. To make it easier to illustrate, we consider a more general case first. The Partial Linear Models (PLMs) are Y = β0 + h X βj Xj + m(T ) + , (2.30) j=1 28 2.3. Generalized Partial Linear Partial Additive Models where T = (T1 , T2 , . . . , Tk )T and m : Rh → R is a smooth multivariate function. As discussed in Härdle et al. (2004), once we have calculated one type of component (xTi β or function m) of a PLM, the computation of the remaining component is straightforward. There are mainly two ways to fit (2.30). • One way derives from the idea of (2.24) indicating that the parametric and nonparametric terms can be fitted with partial residuals: E(Y − β0 − h X βj Xj |T ) = m(T ) (2.31) j=1 E(Y − β0 − m(T )|X) = n X βj Xj , (2.32) j=1 where X = (X1 , X2 , . . . , Xh ). This fitting algorithm is an iterative process. Like (2.32), we update the linear coefficients β̂ old by least old \) with the covariates X. Then, squares on partial residuals Y − m(T new \) estimate the nonparametric term m(T using the partial residual Y − X T β̂ new . As stated in the section 2.2.1, the backfitting algorithm using partial residuals can work as a general approach to fit additive models such as (2.30). • The second way can be intuitively shown by the following approach. If we take expectation of Y conditional on T , we get, E(Y |T ) = E(Xβ|T ) + E(m(T )|T ) + E(|T ). Based on (2.27) and (2.33), we can show, Y − E(Y |T ) = X − E(X|T ) β + − E(|T ), (2.33) (2.34) since E[m(T )|T ] = m(T ). By assumption, E(|X, T ) = 0 and we get E[ − E(|T )] = 0. Note that the conditional expectations E(Y |T ) and E(X|T ) can be replaced by corresponding nonparametric estimators, and E(X|T ) denoted by ŷ and X̂. In a sample case, X̂ ∈ Rn×h is the estimated design matrix whose columns are the smoothed vectors x(j) as functions of the t(l) , and ŷ ∈ Rn is the estimated response variable. By doing this, (2.34) becomes a standard LM where we can easily estimate the linear 29 2.3. Generalized Partial Linear Partial Additive Models coefficient β. Then, plug β̂ in (2.31), we estimate the smooth function m with a non-parametric regression. For APLMs, the nonparametric estimator can be a multivariate kernel smoother or backfitting estimator with certain univariate smoothers. This estimating idea has been studied by Speckman (1988) and Robinson (1988). We call this method Speckman Estimator in the following sections. In our thesis work, we apply the Speckman estimator and backfitting with LOESS to fit APLMs. √ Under regularity conditions, the Speckman estimate β̂ is n consistent √ for β which means β̂ − β = Op (1/ n). β̂ is also asymptotically Normal and there exists a consistent estimator of this limiting covariance matrix. In this thesis, we mainly discuss the fitting method and leave the estimator’s asymptotic properties for future work. Härdle et al. (2004) showed that the Speckman estimator is similar to the Profile Likelihood Estimator, which is a two-step optimization process of the local likelihood, in fitting (G)APLMs. The local likelihood function is, lH (µ; y) = n X KH (t − Ti )l g −1 [xTi β + m(t)]; yi , i=1 −1 T where l g [cxi β + mβ (t)]; yi is the log likelihood function of exponential family and KH (·) is a kernel function. In the first step, by fixing β, this local likelihood function is maximized with respect to m(t), mβ (t) = arg max m(t) n X KH (t − Ti )l g −1 [xTi β + m(t)]; yi . i=1 Then a profile likelihood of β is constructed, n X −1 T KH (t − Ti )l g [xi β + mβ (t)]; yi . (2.35) i=1 In the second step, the profile likelihood is maximized with respect to β, to calculate the profile likelihood estimates β̂ and mβ̂ (t). This algorithm is also an iterative process. It can be shown that for APLMs, the Profile Likelihood Estimator coincides with Speckman estimator using the kernel smoother with the same kernel function KH in (2.35), see Härdle et al. (2004). However, it is easier 30 2.3. Generalized Partial Linear Partial Additive Models to extend Speckman estimator to fit GAPLMs with GLSA than to use the profile likelihood estimator. In this thesis, we only discuss the Speckman estimator. At the end of this chapter, a comparison of standard backfitting, Speckman estimator and profile likelihood estimator is provided. 2.3.2 Fitting GAPLMs As shown in the section 2.3.1, two classes of estimation procedures can be used to fit APLMs: • Backfitting type: Alternating between parametric and nonparametric estimations. • Speckman estimator/Profile likelihood type: Estimating the linear parameters with partial response and partial covariates. Then, estimate the nonparametric terms with partial residuals. Based on the foregoing sections, it is straightforward to extend the algorithm of APLMs to fit GAPLMs. Recall that for GAMs, the fitting algorithm is GLSA with adjusted response and weights defined in (2.12)-(2.13). As the hybrid of APLMs and GAMs, GAPLMs, can be fitted with GLSA and the Speckman estimator, this estimator is called the Generalized Speckman Estimator. Suppose our sample is y, (x(1) , x(2) , . . . , x(h) ) and s(t(1) , t(2) , . . . , t(k) ). The design matrix for the parametric covariates is X = (1, x(1) , x(2) , . . . , x(h) ). We define that for a vector ∀v ∈ Rn , if we smooth it using our backfitting algorithm with weights w, the smoothed vector is ṽX,w . For the matrix (1) (2) (h) X, we denote X̃X,w = (1̃X,w , x̃X,w , x̃X,w , . . . , x̃X,w ) as the matrix whose columns are the smoothed columns of X. The fitting algorithm based on generalized Speckman estimator is as follow. (i) Initialize: [0] • for all but Binomial distribution, η0 ← g(ȳ)1 • for Binomial distribution with ni as the sizes of the response, η0 ← (g( y1 y2 yn ), g( ), . . . , g( )), n1 n2 nn 31 2.3. Generalized Partial Linear Partial Additive Models [0] [0] µi ← g −1 (ηi ). is a small positive number to decide convergence. Then assign [0] s1 [0] [0] ← s2 ← . . . ← sk ← 0 β [0] ← (0, 0, . . . , 0) δ ← +1 in which 0 ∈ Rn . Calculate the adjusted response and weights, [0] [0] [0] dη [0] zi ← ηi + (yi − µi ) (µi ) dµ dµ [0] 2 1 [0] (µ ) wi ← , [0] dη i V (µ ) i [0] for Gaussian distribution, V (µi ) = 1 n−1 Pn j=1 (yj [0] − µi )2 . Set m ← 1. (ii) (a) Calculate the partial design matrix with weights w[m−1] and the [m−1] partial adjusted response, Xres = X − X̃X,w[m−1] + n1 11T X, [m−1] zres [m−1] = z [m−1] − z̃X,w[m−1] + n1 11T z [m−1] . (b) Update β [m] by weighted least square with weights w[m−1] . The [m−1] [m−1] design matrix is Xres and response variable is zres , [m−1] T [m−1] −1 [m−1] T [m−1] β [m] ← [(Xres ) Diag(w[m−1] )Xres ] (Xres ) Diag(w[m−1] )zres (c) Fit a weighted additive model of z [m−1] − Xβ [m] with weights w[m−1] and initial intercept as 0. After that, update the smooth terms with backfitting estimates, [m] sj [m−1] ← ŝj , [m] for j = 1, 2, . . . , k [m] [m−1] [m−1] Update the intercept β0 ← β0 + ŝ0 [1], where ŝ0 [1] is the estimated intercept of the backfitting algorithm. Then update P [m] the additive predictor η [m] ← Xβ [m] + kj=1 sj and the mean µ[m] with inverse link function g −1 (·) • for all but binomial distribution, [m+1] µ[m+1] ← (g −1 (η1 [m+1] ), g −1 (η2 ), . . . , g −1 (ηn[m+1] )) 32 2.3. Generalized Partial Linear Partial Additive Models • for binomial distribution, [m+1] µ[m+1] ← (n1 g −1 (η1 [m+1] ), n2 g −1 (η2 ), . . . , nn g −1 (ηn[m+1] )) (d) Calculate the adjusted response, [m] zi [m] wi [m] [m] dη [m] ← ηi + (yi − µi ) (µi ) dµ 1 dµ [m] 2 (µi ) , ← [m] dη V (µ ) i [m] 2 [m] 1 Pn for Gaussian distribution, V (µi ) = n−1 j=1 (yj − µi ) . Set: Pk P [m−1] [m] [m] [m−1] − sl || + hj=0 ||(βj − βj )xj || l=1 ||sl δ ← , P Pp [m−1] [m−1] || + ji=0 ||βj xj || l=1 ||sl m ← m+1 (e) Repeat step (ii) until δ < . Note that in the step (ii) (b), we add the extra terms n1 11T X and n1 11T z [m] to the partial design matrix and partial response. This may seem different from (2.34), but we have to notice that (2.34) does not show how to estimate the intercept β0 . In our algorithm, we estimate the intercept using iterative weighted least square and backfitting algorithm. Moreover, if we do not add [m] 1 T n 11 X to the partial design matrix X − X̃X,w[m] , the first column of Xres [m] becomes 0 and Xres is not of full rank. Durban, Hackett and Currie (1999) proposed the adjustment of those extra terms to a backfitting type fitting procedure for APLMs. Our algorithm is based on their result. There are some studies on the comparison of the backfitting estimator and Speckman estimator. One important shortcoming of the backfitting procedure is that it may fail with correlated explanatory variables. In regression studies, the definition of “correlated explanatory variables” differs in different models. In LMs, this problem is called (multi)collinearity which indicates that one predictor can be approximated by the linear combination of other predictors. When the model contains nonparametric additive components such as AMs, GAMs etc., this problem is termed concurvity, where 33 2.3. Generalized Partial Linear Partial Additive Models a function of a predictor can be approximated by a linear combination of the functions of other predictors. He, Mazumdar and Arena (2006, 2007) illustrated this phenomenon in a series of simulations. They showed that, the backfitting estimator has a larger bias and a higher variation for a model with high concurvity covariates, compared with Speckman estimator. One reason of the failure of the backfitting estimator is that the matrix M in (2.27) might be singular if high concurvity is among the covariates. Hastie and Tibshirani (1990) proposed a modified backfitting algorithm to fit GAMs. This modified backfitting algorithm is based on orthogonal projections. Suppose the model is defined in (2.23), and Sj , j = 1, 2, . . . , p are the univariate smoothing matrices and V1 (Sj ) is the linear space spanned by Sj ’s eigenvectors corresponding to eigenvalue 1. Then if Ai is the matrix which projects on V1 (Sj ), I − Aj projects on V1 (Sj )⊥ . The modified backfitting algorithm contains two steps. First, it uses an orthogonal projection A (usually the hat matrix X(X T X)−1 X T ) to project the response onto the space V1 (S1 ) + V1 (S2 ) + . . . , V1 (Sp ). Second, it projects the partial residuals onto V1 (Sj )⊥ using (I − Aj )Sj . This algorithm can also be applied in fitting GAPLMs. Hastie & Tibshirani (1990) declared that the modified backfitting makes it easier to find a unique fit and it also eliminates the dependence of the final results on the starting functions. Müller (2001) and Härdle et al. (2004) have compared the backfitting, the Speckman and the profile likelihood estimators based on a kernel smoothing. They reported the following conclusions. • When the bandwidth is small or when the nonlinear function m is relatively constant or small compared with the parametric part, the profile likelihood and the Speckman estimators perform similarly. • The backfitting estimator works better than the Speckman estimator for a GAPLM with an independent design. However, the Speckman estimator improves the fit when the model has dependent explanatory variables. • For small sample sizes, the Speckman estimator seems to perform the best. The resulting estimator can be considered as a good compromise between accuracy and computational efficiency. For larger sample sizes the Speckman estimator is almost identical to the profile likelihood algorithm. 34 Chapter 3 Robust Estimates and Robust GAPLMs As shown in section 2.3, the traditional model fitting algorithms for GAPLMs are mainly based on the likelihood function. For an exponential family dependent variable, the least squares (for Normal dependent variables) and generalized local scoring algorithm are designed to maximize the likelihood function n Y L(y; θ, x1 , x2 , . . . , xn ) = f (yi ; xi , θ), (3.1) i=1 )T , where xi = (xi1 , xi2 , . . . , xip f : R → R is the pdf or pmf of the assumed distribution Y ∼ FY (θ, X) and θ = (θ1 , θ2 , . . . , θp )T are the distribution parameters. Equivalently, MLE can be written in the log likelihood form n X θ̂M LE = arg min ρ(yi ; xi , θ), θ ∈ Ω, (3.2) θ i=1 where ρ(yi ; xi , θ) = − log f (yi ; xi , θ) and Ω is the parameter space. The well developed MLE theory shows that under the regularity conditions, 1. the true value θ0 of θ is interior to the parameter space Ω, which has finite dimension and is compact; 2. the pdf or pmf defined by any two different values of θ are distinct; 3. there is a neighbourhood N of θ0 within which the first three derivatives of the log likelihood with respect to θ exist almost surely, and for r, s, t = 1, 2, . . . , p, n−1 E |∂ 3 l(θ; y, X)/∂θr ∂θs ∂θt | is uniformly bounded for θ ∈ N ; 4. within N , the Fisher information matrix I(θ) is finite and positive definite, and its elements satisfy 2 ∂ l(θ; y, X) ∂l(θ; y, X) ∂l(θ; y, X) = −E , I(θrs ) = E ∂θr ∂θs ∂θr ∂θs 35 Chapter 3. Robust Estimates and Robust GAPLMs where r, s = 1, 2, . . . , p, MLE defined in (3.2) has two desirable characteristics. • If the likelihood (3.1) is correctly assumed, the estimates are consistent, as n → +∞, P (θ̂M LE → θ0 ) = 1. • θ̂M LE converges in distribution to a multivariate normal distribution, √ D n(θ̂M LE − θ0 ) → N (0, I −1 (θ0 )). More importantly, θ̂M LE is asymptotically the most efficient estimate compared with other unbiased estimates θ̃ whose covariance matrix is cov(θ̃), 1 cov(θ̃) − I −1 (θ0 ) is non-negative definite. n The likelihood approach works very well within the assumed distribution families. However, there is one potential problem of the likelihood method in practical data analysis. Sometimes the assumed distribution FY (θ, X) in (3.1) is not true for the observed response due to bad data quality. For example, the measurements are taken by two machines and one machine is off-standard and it sometimes gives wrong measurements. In this case, the majority of measurements might follow the assumed distribution FY (θ, X) while a small portion of observations depart from it. This phenomenon can be expressed with a contaminated distribution with the pdf/pmf, t(y; x, θ, γ) = (1 − )f (y; x, θ) + h(y; x, γ), ∈ (0, 1) (3.3) where γ ∈ Rq and t(y; x, θ, γ), which is the true pdf/pmf of the sample, is a mixture of f and h. h(y; x, γ) can be any pdf/pmf other than f (y; x, θ). If f (y; x, θ) is Normal density, (3.3) is called contaminated Normal distribution. Even though we are more interested in the assumed distribution f (y; x, θ), we are actually fitting regression models in a “neighborhood” of it. It is necessary to check if the likelihood method still works well in the case of (3.3). It is also meaningful to develop methods which can help to detect those atypical observations/outliers. In addition, the equation in (3.2) is the sum P of n terms, ni=1 ρ(yi ; xi , θ). From this form, we see that the n terms , which represent the n observations, are given the same weight in the optimization (3.2). Based on that, MLE tends to make every observation’s log likelihood big. However, those ρ(yi ; xi , θ) of outliers should not be optimized along with 36 3.1. Robust Estimates those “good” observations since they do not follow the assumed distribution. Thus, the outliers might distort the estimate in likelihood approaches. In this chapter, the effects of outliers on the traditional regression estimates are discussed. The robust regression estimates, which are designed to deal with bad data quality are also introduced. Like in Chapter 2, we start our discussion with LMs and then extend it to more generalized models. 3.1 3.1.1 Robust Estimates The Influence of Outliers on LS In this section, we show and explain that LS can be severely affected by outliers. In a LM defined in (2.1), a good estimate β̂ should have two properties, β̂(X, y + Xγ) = β̂(X, y) + γ, for ∀γ ∈ Rp+1 β̂(X, λy) = λβ̂(X, y), (3.4) (3.5) where β̂(X, y) is the estimated coefficients with design matrix X and response y. (3.4) and (3.5) are called regression and scale equivariance respectively. Regression equivariance ensures that the a shift of Xγ in the response gives a shift of γ in the estimates. Scale equivariance enables us to get the same fitted values ŷi = xTi β̂ with different unit of measurement. For example, the data measured in centimeter y1 and the data measured in millimeter y2 = 10y1 should give us coherent estimates, 10X T β̂(X, y1 ) = X T β̂(X, y2 ). In order to attain (3.4) and (3.5), p we optimize the likelihood of the standardized responses Ri = (Yi − µi )/ var(Yi ) instead of the likelihood of Yi . iid For example, if we assume normal i ∼ N (0, σ 2 ) in (2.1) and the pdf of standard Normal is φ0 , the pdf of Ri becomes φ(r) = 1/σφ0 (r/σ), and the likelihood function can be re-parameterized as X n yi − xTi β 1 β̂ = arg min ρ0 ( ) + log(σ) , β n σ (3.6) i=1 y −xT β y −xT β (y −xT β)2 where ρ0 ( i σ i ) = − log(φ0 ( i σ i )) = i 2σi2 + log(2π). Note that for the LS fit of normal variables Yi , this re-parameterization is unnecessary. 37 3.1. Robust Estimates But the form of (3.6) makes it easier to extend LS to the definition of robust regression estimates. It is easy to check that the estimator maximizing (3.6) is regression and scale equivariance, n X n ρ0 ( i=1 n X i=1 X yi − xTi β̂M LE yi + xTi γ − xTi (β̂M LE + γ) )= ), ρ0 ( σ σ i=1 ρ0 ( yi − xTi β̂M LE σ )= n X i=1 ρ0 ( λyi − xTi (λβ̂M LE ) ). λσ If we assume σ is known and take the derivative of (3.6) with respect to β, we get n X i=1 ψ0 ( ei (β̂M LE ) )xi = 0, σ (3.7) where ψ0 (x) = ρ00 (x) = −φ00 (x)/φ0 (x) = x and ei (β) = yi − xTi β are the residuals. For LM, the solution of (3.7) is unique and equal to (3.6) if the design matrix has a rank of p + 1. The formula (3.6) shows how outliers affect the LS estimates. From (3.6), we can P 2see β̂M LE (or β̂LS ), is based on minimizing the sum of squared residuals ei (β). If one response yj deviates a lot from the assumed model (2.1), the absolute value |yj −xTj βtrue | becomes very large. As a result, in the minimization of (3.6), |yj − xTj β̂LS | becomes smaller than |yj − xTj βtrue | at the cost of bigger absolute residuals |yj −xTj β̂LS | for the “good” observations. There are mainly two kinds of observations which can be influential to βLS such that two samples with and without those observations provide quite different fits. • The outliers in the y-direction with atypical response variable yj compared with those response variables with similar covariates xj . • The observations whose explanatory variables xj are far from the other explanatory variables. We often call these observations high-leverage points. Note that high-leverage points are not outliers if their corresponding response variables follow the assumed distribution. To illustrate the above concepts, we present 2 examples of each type. In the examples, the true model is Y = −1 + 3X + and we pick 6 “good” 38 3.1. Robust Estimates ●● ● ● ● ● ●● ● ●● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● 2 −5 ● ● ● 6 ● ●● ● ● ● ● 4 0 ● ● ● ● ● ● ● y ● ● −2 −1 0 x 1 ● ● ● ● ● ● ● ●● −6 −4 −2 −25 −15 y 0 ● ● ●● ● ●● ● ●●● ●● ● ● ● ● ●● ●● ● ● ● ● 2 −2 0 2 4 6 8 10 12 x Figure 3.1: A simulated examples of the effect of outliers. The solid line is the LS fit with outliers, the dotted line is the robust fit with Huber function c = 1.2 and the dashed line is the LS fit without the 6 outliers. The details of robust fit is introduced in the following sections. observations and replace them with outliers. The left panel of Figure 3.1 shows the case of y-direction outliers. We see that the estimated slope is distorted towards the 6 outliers in the right lower corner. The right panel of Figure 3.1 shows the effect of high-leverage points. Even though the relative deviation |youtlier + 1 − 3x|/| − 1 + 3x| in the right panel is less than that in the left panel, the fitted line is obviously affected by those outliers making the residuals of some “good” points even bigger than those of the outliers. This phenomenon is called masking where several outliers don’t appear to be atypical or extreme in the fitted model. Note that we use relative deviation because absolute deviation is not a appropriate measure to make the two cases comparable. For example, if an absolute deviation is 1, it should be more severe to a true value of 0.1 than a true value of 1000. Since MLEs are sensitive to outliers, it is necessary to develop statistical approaches to deal with outliers. 3.1.2 Robust Regression M-estimate In this section, we introduce the robust regression M-estimate for LMs. Some traditional methods can naively deal with outliers. They work by detecting the outliers or influential points first and then deleting or modifying them to fit a least square based LM with the modified data. Another way of dealing 39 3.1. Robust Estimates with outliers is to use a robust method, which is also known as resistant fit. The resistant fit does not require preliminary outlier detection and it fits the majority of the sample well while allowing some observations to have large residuals. We introduce and compare the two approaches in this section. Traditional approaches For LMs, there are some traditional approaches such as regression diagnostics which can be used for influential observation detection after an initial LS fit. This method includes the Q-Q plot of residuals, the scatter plot of the residuals against the fitted values xTi β̂ and the Cook distance plot. The residual plot and the Q-Q plot can be applied to check the model assumptions and atypical observations. The Cook distance plot is for checking the influence of each observation (xi , yi ) by comparing the LS estimate of the full data and the LS estimate without (xi , yi ). Weisberg (1985), Belsley, Kuh and Welch (1980) have discussed and introduced these methods in detail. The three diagnostic plots based on the examples in Figure 3.1 are shown in Figure 3.2. In Figure 3.2, we use the standardized residuals, rp where ei = yi − xTi β̂, e = (e1 , e2 , . . . , en ) and sd(e) = i = (ei − ē)/sd(e), Pn 1/(n − 1) i=1 (ei − ē)2 , to draw the Q-Q plots and residual plots. From the residual plots, which would show randomly distributed residuals if the model assumptions hold, we see that the outliers seriously distort the LS fit even though the majority of data are “good’ points. The Q-Q plots for the two cases fail to show any atypical residuals because the points, no matter solid or hollow, lie along the reference line. More importantly, the lower Q-Q plot shows the residuals of those outliers are relatively smaller compared with many “good” points illustrating a masking effect. If we set the influential points criteria for Cook distance as Cook distance > 4/(sample size) = 0.1, to report influential observations, all of the 6 true outliers are reported in the first example. see Bollen, Jackman (1990). As it can be seen in the upper Cook distance plot, the 3 points with the biggest Cook distance are true outliers. However, in the second case, only one point with index 26 has a Cook distance larger than 0.1 and the other outliers do not seem quite influential. Moreover, in the lower Cook distance plot, a “good” point with index 7 has the third largest Cook distance. This phenomenon suggests that the Cook distance plot does not always work well for detecting outliers. In fact, this leave-one-out approach often works well in detecting isolated high-leverage outliers but fails in multi-high-leverage outliers situations. 40 3.1. Robust Estimates Normal Q−Q Plot Cook's distance 0.20 ●● 28 ● ● ● ● ● ● ● 0.15 0.10 26 −4 ● ● ● ● Cook's distance ● ● ● ●● ●●● ●●●● ● ●●●●●●●●●● ●● ●●●●● ●●● 16 0.05 0 ●● ● ●● ● ●● ●● ● ●● ● ● ● ● 2 ● 0 ● ●●● −2 ● ●● ● ● −1 Std Residual 1 ● Sample Quantiles 4 ● 0.00 −2 ● ●● ● ● −12 −10 −8 −6 −4 −2 −2 fit.lmFull1 −1 0 1 2 0 10 Theoretical Quantiles 20 30 40 Obs. number Normal Q−Q Plot 2 Cook's distance ●●● ● ● ● ● ● ● ● ● ● −4 0.08 2 0 ● 16 7 0.00 ● ● ● 0.04 ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ●● ●● ●●● ● ●● ●●● ●●●● ● ● ● ●●●● ● ● ● ●●● Cook's distance 0 ● −2 ● −4 1 ● ● ● ● −1 Std Residual ●● Sample Quantiles 4 26 ● ●● ● −2 0 2 fit.lmFull2 4 −2 −1 0 1 Theoretical Quantiles 2 0 10 20 30 40 Obs. number Figure 3.2: The residual plots of the standardized residuals vs. fitted values, Q-Q plots of the standardized residual plots and the Cook distance plots of the two cases in Figure 3.1. The upper 3 plots are based on the left panel of Figure 3.1 while the lower 3 graphs are based on the right panel. In the residual/Q-Q plots, the solid points are the outliers while the hollow ones are “good” data. The indices of outliers are 16, 39, 36, 26, 28, 40. 41 3.1. Robust Estimates To summarize, the regression diagnostics methods do not always work well. There are improved methods by Peña and Yohai (1999) which provide more reliable influential observations and outliers detection methods. However, those improved approaches still might fail under masking situations and the corresponding estimator’s approximate properties are often unknown. There are two more problems of traditional approaches. • Detecting outliers with diagnostic plots is sometimes based on the judgment of the user. This approach can be very subjective. In addition to that, directly deleting the detected outliers might let us detect even more new outliers. • The detecting-deleting or detecting-modifying process makes the statistical inference harder. For example, in a LM, suppose we want to test, βj = 0, j = 0, 1, . . . , p, using the confidence interval of βj based on the estimated standard error of β̂j . In a LM, the standard errors are calculated with linear operations of the design matrix X and response vector y. However, in a modified sample, β̂j is not obtained with the original data and the reporting-deleting or reporting-modifying process is hard to be expressed in the form of linear operation. So, it is hard to calculate the confidence intervals, making the statistical inference difficult. Thus, to eliminate those shortcomings, we need systematic approaches to deal with outliers and analyze the asymptotic distributions of the estimators. Robust Regression M-estimates In this section, we introduce the robust LM regression M-estimate for LMs. A robust estimate has the following features. • It develops good fit for the majority of the data while allowing outliers to have large residuals. The robust fits based on a contaminated dataset is close to the non-robust fit based on the corresponding uncontaminated data. For example, in Figure 3.1, the dotted line is the robust fit with outliers and it is very close to the non-robust fit without outliers. • It does not require the use of preliminary outlier detection tools. Since it is resistant to outliers in nature, there is no need to detect and delete or modify observations. 42 3.1. Robust Estimates Since the robust approach allows big residuals for some observations, a robust fit suggests a new way to detect outliers by looking at the difference between the robust fit and the observations. So, those traditional approaches and the robust approach are based on reverse order of steps: the traditional methods fit preliminary models to detect the outliers first and then fit a LM with the modified data; the robust methods fit the model first and then detect outliers with the fits. The robust regression M-estimate is an extension of the likelihood methods in (3.6)-(3.7). Instead of minimizing the sum of squared residuals, a regression M-estimate downweights the extreme values in order to allow large residuals for some points. If we replace the functions ρ0 and ψ0 functions in (3.6) and (3.7) with a more generalized class of functions, the regression M-estimates are the solution of β̂ = arg min β n X i=1 ρ( yi − xTi β ), σ̂ (3.8) where σ̂ is an error scale estimate and ρ is an even function. By differentiating (3.8) with respect to β, we get the following (3.9) which has the same form of (3.7), n X i=1 ψ( ei (β̂) )xi = 0, σ̂ (3.9) where ψ = ρ0 . The equation (3.9) does not need to be a likelihood estimating equation. For example, when ρ(x) = |x|, the estimates β̂ satisfy β̂ = arg min β n X |ri (β)|, i=1 and these M-estimates β̂ are also called L1 estimates. Actually L1 estimates have a longer history than LS fit, Boscovinch (1757) and Laplace (1799). When p = 0, the L1 estimate is the median of y. In order to downweight big ri , the function ρ(ri ) often has a smaller increasing rate than the quadratic function when ri is big. The function ψ is usually a bounded function. In (3.8)-(3.9), the functions ρ and ψ are called ρ-function and ψ-function. The ρ-function satisfies (i) ρ(x) is nondecreasing function of |x|. 43 3.1. Robust Estimates (ii) ρ(0) = 0. (iii) For a 0 < x < ∞, ρ(x) < ρ(∞), where ρ(∞) = limx→∞ ρ(x) and the limit exists. (iv) For a bounded ρ : R → R+ which satisfies (i), (ii) and (iii), we can build another function ρ̃ that also satisfies (i), (ii) and (iii) and, β̂ = arg min β n X n ρ( i=1 X yi − xT β yi − xTi β i ) = arg min ρ̃( ). β σ̂ σ̂ i=1 So without loss of generality, we assume that ρ(∞) = 1 The ψ-function is the derivative of a ρ-function. It is odd and satisfies ψ(x) ≥ 0 for x ≥ 0. There is a widely used ρ-function, which is the Huber function ( x2 /2 |x| ≤ c ρc (x) = , (3.10) c(|x| − c/2) otherwise where c is a constant parameter and its corresponding ψ-function is ( x |x| ≤ c ψc (x) = . (3.11) c sign(x) otherwise From (3.10) and (3.11), we see that the Huber ψ-function is a bounded monotone function with a very simple form. In our following sections we only use the Huber ψ-function to derive our fitting algorithm. The robust fit with Huber ψ-function and c = 1.2 of the previous example is the dotted line in Figure 3.1. The fitting algorithm is introduced later. Note that if σ̂ is regression and scale equivariant so that σ̂(X, y + Xγ) = σ̂(X, y) σ̂(X, λy) = |λ|σ̂(X, y), where σ̂(X, y) is the error scale estimated based on the covariates X and response y, then n X i=1 n X i=1 n X yi + xT γ − xT (β + γ) yi − xTi β i i ρ( )= ρ( ) σ̂(X, y) σ̂(X, y + Xγ) i=1 ρ( xTi β yi − )= σ̂(X, y) n X i=1 ρ( λyi + xTi (λβ) ). σ̂(X, λy) 44 3.2. Robust GLM Regression M-estimates Thus, the solution to (3.8)-(3.9) is also regression and scale equivariant. When the function ρ is convex, the ψ = ρ0 function is monotone and the solution of (3.9) is equal to the solution of (3.8), see Maronna, Martin and Yohai (2006). However, for (G)LMs, even when ρ is convex and ψ is monotone, the M-estimator in (3.8) sometimes can also be adversely affected by high leverage outliers. There are some solutions to this problem. This M-estimate defined in (3.8) is called the Huber-type M-estimate. One option to deal with the high leverage outliers is to use another type of M-estimate which is called Mallow-type with additional weights W (d(xi )) to each observation as penalties of high leverage. It has the form n X i=1 ψ( ei (β̂) )W (d(xi ))xi = 0, σ̂ where d(xi ) is a measure of the “largeness” of xi and W is a weight function. For example, to fit a straight line yi = β0 + β1 xi + i , the function d : R → R can be chosen as |xi − µ̂x | , d(xi ) = σ̂x where µ̂x and σ̂x are robust mean and standard deviation estimates of the covariates xi , i = 1, 2, . . . , n. Usually the function d is chosen such that big d(xi )’s indicates high leverages. The Mallow-type M-estimates can effectively downweight high leverage points. It can be shown that if d(xi ) is known, those weights W (d(xi )) do not add complexity to our model fitting algorithm which is introduced later. However, the Mallow-type M-estimates are still not ideally robust since when the number of linear covariates p is big, only a small proportion of outliers can tremendously distort the estimated coefficients β̂. A better remedy is to use a non-convex ρ-function and bounded “redescending” ψ function. The M-estimates with redescending ψ-functions are called Redescending M-estimates and those with monotone ψ-functions are called Monotone M-estimates. For redescending M-estimates, the solutions of (3.9) are usually not unique. In this thesis we only focus on Hubertype M-estimates defined in (3.8)-(3.9). For more details of redescending M-estimates, see Maronna, Martin and Yohai (2006). 3.2 Robust GLM Regression M-estimates In this section, a consistent robust GLM regression M-estimate is introduced. Like other M-estimates, the definition of the estimation equations 45 3.2. Robust GLM Regression M-estimates are extended from quasi-likelihood or quasi-score functions. As mentioned at the beginning of this chapter, MLEs are consistent. However, the solution of a robust quasi-likelihood equation might be asymptotically biased. It is not hard to imagine because if MLEs based on (3.6) or (3.7) satisfies p β̂M LE → βtrue , as n → ∞, any other asymptotically unbiased estimates β̂ should not deviate from the MLEs. But we cannot guarantee that a robust estimate β̂rob obtained by adjusting the formulae of (3.6) or (3.7) is even close to MLEs. Similarly, in the case of GLMs, a robust M-estimate based on optimizing the robust quasilikelihood, which is similar with (3.8), may also be asymptotically biased. To solve that problem, Bianco and Yohai (1995) proposed a class of Mestimates for logistic regression models which generates consistent robust estimates. Cantoni and Ronchetti (2001) presented a robust estimators for GLMs based on robust quasi-likelihood. The main idea is adding a Fisher consistency correction term to the quasi-likelihood equations in (3.9), n X i=1 ψ( ei (β) xi ) − a(β) = 0, σ σ (3.12) P where a(β) = ni=1 E[ψ(ei (β)/σ)] xσi and E[ψ(ei (β)/σ)] is the conditional expectation with respect to y|x. This method can also be applied in GAPLMs and we adopt this estimate and derive the fitting algorithm in the following section. In fact, Künsch, Stefanski and Carroll (1989) discussed that in a more general case, if the estimator θ̂ is based on the M-estimator estimating equation(s) n X H(yi , xi , θ) = 0, (3.13) i=1 p then the estimator is consistent , θ̂ → θ0 when Eθ0 [H(y, xi , θ0 )] = 0. Cantoni and Ronchetti (2001) proposed a class of robust M-estimators for GLMs based on the classical quasi-score functions which have the form of (2.21) and the estimating equations satisfy (3.13). The estimating equations are n X Ψ(yi , µi ) = 0, (3.14) i=1 46 3.2. Robust GLM Regression M-estimates where ∂µ − a(β), Ψ(y, µ) = v(y, µ) ∂β n 1X ∂µi a(β) = E[v(yi , µi ) ]. n ∂β i=1 In (3.14), v(y, µ) = ψ(ei /(a(φ)V (µi ))1/2 ) (a(φ)V 1(µ ))1/2 . Note that if v(y, µ) = i (y−µ)/(a(φ)V (µi ))1/2 , then a(β) = 0 and we obtain the classical quasi-score functions (2.16). This M-estimator in Cantoni and Ronchetti (2001) is a Mallow-type but we only present the derivation for Huber-type M-estimates in the following sections. It is easy to extend our method to the Mallow’s type case. Cantoni and Ronchetti (2001) also gave the derivation of the fitting algorithm of (3.14) using Huber’s ψ-function ψc and they showed that the corresponding estimates are asymptotically Normal. In this thesis, we only discuss model fitting and the asymptotic properties of the estimates are not covered. The details can be found in Cantoni and Ronchetti (2001) and Heritier et al.(2009). 3.2.1 Fitting the Robust GLMs In this section, we present the algorithm of Cantoni and Ronchetti (2001) to solve the robust quasi-likelihood estimating equations (3.18). In this algorithm, Huber’s ψ-function defined in (3.11) is used. Like the MLE for the non-robust counterpart, the derivation of this robust algorithm also requires Newton-Raphson algorithm and the result also has an iteratively reweighted least squares (IRWLS) form. So just like the non-robust algorithm for GLM, this elegant form can be naturally extended with generalized local scoring algorithm and Speckman estimator to fit GAPLMs. As before, the responses Y = (Y1 , Y2 , . . . , Yn ) are independent random variables in the exponential family and the mean of Y given the covariates is linked to the linear form additive components η = xT β with a link function g so that g(µ) = η. The robust quasi score functions are n X i=1 yi − µi 1 ∂µi ψc ( ) − a(β) = 0, si si ∂β (3.15) p where si = a(φ)V (µi ). Since ψc is a bounded function, the estimating equations can effectively downweight the extreme residuals. The following derivation can also be found in Cantoni and Ronchetti (2001) and Alimadad 47 3.2. Robust GLM Regression M-estimates and Salibian-Barrera (2011). Since (3.15) are nonlinear equations, we denote f1 (β) n f2 (β) X yi − µi 1 ∂µi f (β) = − a(β) (3.16) ψc ( ) = .. si si ∂β . i=1 fp+1 (β) n X 1 ∂µi = ψc (ri ) − E(ψc (ri )) xi , si ∂ηi i=1 and we are going to solve the equations f (β) = 0. Then we apply the Newton-Raphson algorithm to numerically calculate the solution. The iterative updating scheme is ∇f (β̂ old ) β̂ new − β̂ old = −f (β̂ old ), (3.17) ∂fi where ∇f (β̂ old ) = ( ∂β ) is the gradient of ∇f (β̂ old ) with respect j (p+1)×(p+1) to β. According to Fisher scoring algorithm, we substitute the gradient ∇f (β̂ old ) with its expectation, and get the equations old E ∇f (β̂ ) β̂ new − β̂ old = −f (β̂ old ). (3.18) Then, we derive the expression of E ∇f (β̂ old ) . If we denote hi = ψc (ri ) − E(ψc (ri )), then, ψc ( yi − µi 1 ∂µi 1 ∂µi ) − a(β) = hi xi , si si ∂β si ∂ηi the gradient of f (β) can be written as, ∇f (β̂ old )= n X ∂µi i=1 ∂ηi xi s−1 i ∂hi ∂ + hi ∂β ∂β T . Since E(hi ) = 0, we get X n 1 ∂µi ∂hi T E ∇f (β) = xi E( ) . si ∂ηi ∂β (3.19) i=1 48 3.2. Robust GLM Regression M-estimates i Then, we derive E( ∂h ∂β ). The derivative of hi with respect to β is ∂hi ∂ri ∂ = ψk0 (ri ) − Eψc (ri ) ∂β ∂β ∂β ∂[(yi − µi )/si ] ∂ = ψk0 (ri ) − Eψc (ri ) ∂β ∂β ∂ 1 ∂µi ∂ηi ∂ηi ∂si ∂ηi − Eψc (ri ) si − (yi − µi ) = ψk0 (ri ) 2 − ∂η ∂β ∂η ∂β ∂η ∂β s i i i 0 i 0 ψ (ri ) ∂µi ψc (ri )(yi − µi ) ∂si ∂ =− c + + Eψc (ri ) xi . 2 si ∂ηi ∂ηi ∂ηi si The expectation of ∂hi ∂β is ∂ ∂hi E[ψc0 (ri )] ∂µi E[ψc0 (ri )ri ] ∂si E( )=− + +E Eψc (ri ) xi = li xi , ∂β si ∂ηi si ∂ηi ∂ηi (3.20) ∂ E[ψc0 (ri )] ∂µi E[ψc0 (ri )ri ] ∂si where li = − . If we plug (3.20) si ∂ηi + si ∂ηi +E ∂ηi Eψc (ri ) into (3.19), we get X n li ∂µi E ∇f (β) = xi xTi . si ∂ηi (3.21) i=1 Then, plug (3.21) into (3.18), the iterative updating can be expressed as old new old E ∇f (β̂ ) β̂ = E ∇f (β̂ ) β̂ old − f (β̂ old ), (3.22) which is equivalent to, X n i=1 n old old old X li ηi ∂µold hold liold ∂µold T new i i i ∂µi x x β̂ = − old old xi old i i sold sold ∂ηiold si ∂ηi i ∂ηi i i=1 n X liold ∂µold hold old i i = (η − )xi . old ∂η old i old s l i i i=1 i If we denote wiold = liold ∂µold i , old sold ∂η i i ziold = ηiold − hold i , liold (3.23) 49 3.2. Robust GLM Regression M-estimates the updating procedure can be written as X n n X old T new wi x i x i β = wiold xi ziold , i=1 i=1 which is the form of a weighted least square equation. So, βinew = (X T W old X)−1 X T W old z old , (3.24) where W old = Diag(w1old , w2old , . . . , wnold ) and z old = (z1old , z2old , . . . , znold ). By applying Newton-Raphson algorithm, the robust quasi-likelihood solution is simplified to a IRWLS with transformed responses ziold and weights Wiold in each iteration. By analogy, this form also suggests using the generalized local scoring to fit Robust GAM M-estimates with the same ziold and wiold . So the only task left is to calculate the li and hi for each distribution. 3.2.2 Computation of the Robust GLMs According to the IRWLS form of (3.24), the algorithm of fitting robust GLMs is similar to its non-robust counterpart. The only difference is to use the different adjusted responses zi and weights wi . In this section, we show the hi i calculation of the weights wi = slii ∂µ ∂ηi , and the adjusted response zi = ηi − li , where hi = ψc (ri ) − Eψc (ri ) and ∂ E[ψc0 (ri )] ∂µi E[ψc0 (ri )ri ] ∂si + +E Eψc (ri ) . li = − si ∂ηi si ∂ηi ∂ηi In addition, we found a mistake in the code of Alimadad and Salibian-Barrera ∂si i (2011), where the code miscalculates the value of ∂µ ∂ηi and ∂ηi in the Binomial case. To simplify the task, we only show the Eψc (ri ), computation scheme of1/2 ∂ 0 0 Eψc (ri ), E[ψc (ri )ri ] and E ∂ηi Eψc (ri ) when ri = (Yi − µi )/V (µi ) = (Yi − µi )/si are the Pearson residuals and ψc is the Huber’s ψ function defined in (3.11) and ( a.s. 1, |r| ≤ c ψc0 (r) = (3.25) 0, |r| > c. Note that ψc is not differentiable at the points c, −c, but the almost sure definition (3.29) is more convenient for us to calculate expectations. This calculation scheme is proposed by Cantoni and Ronchetti (2001). Since the calculation differs for each distribution, the computation recipe for each distribution is listed below. 50 3.2. Robust GLM Regression M-estimates • Eψc (ri ) – Poisson distribution. We define j1 = bµi −csi c and j2 = bµi +csi c, where b·c is the floor function. For a Poisson random variable √ Yi ∼ P(µi ), the standard deviation is si = µi . Then, X +∞ Yi − µi j − µi = P (Yi = j) E ψc ψc si si j=0 = c P (Yi ≥ j2 + 1) − P (Yi ≤ ji ) µi + P (Yi = j1 ) − p(Yi = j2 ) . si – Binomial distribution. q The response Yi ∼ B(ni , pi ), so that E(Yi ) = µi = ni pi and si = j1 , j2 , hence, i µi nin−µ . If we keep the same definition of i X +∞ Yi − µi j − µi = P (Yi = j) E ψc ψc si si j=0 = c P (Yi ≥ j2 + 1) − P (Yi ≤ ji ) µi + P (j1 ≤ Ỹi ≤ j2 − 1) si − P (j1 + 1 ≤ Yi ≤ j2 ) , where Ỹi ∼ B(ni − 1, pi ). – Normal distribution. For normal distribution, since standard normal distribution which is symmetric, Yi − µi E ψc =0 si Yi −µi si follows – Gamma distribution. We assume the response random variable Yi ∼ Gamma(α, βi ) and assume that we have a preliminarily estimated nuisance parameter of Yi , which is φ = a(φ), then the 1 adjusted response Yi∗ = φµ Yi = βi Yi ∼ Gamma(α, 1). Then we i 51 3.2. Robust GLM Regression M-estimates denote the pdf for Yi∗ as f (t; α) and the cdf is F (t; α). Then, ∗ Y − µi β i Yi − µ i E ψc = E ψc i si βi si Z βi (µi −csi ) Z +∞ f (t; α) dt f (t; α) dt − c =c βi (µi +csi ) Z βi (µi +csi ) −∞ t − βi µi + f (t; α) dt βi si βi (µi −csi ) = c 1 − F βi (µi + csi ); α − cF βi (µi − csi ); α − µi /si F βi (µi + csi ); α − F βi (µi − csi ); α Γ(α + 1) + F (βi (µi + csi ); α + 1) − F (βi (µi − csi ); α + 1) . βi si Γ(α) Note that like the Normal case, we assume a constant shape parameter α through all the observations. The nuisance parameter a(φ) = φ = 1/α can be substituted by the corresponding estimates and in the following chapter, one example of estimated is used to fit a Gamma response GAPLM. • Eψc0 (ri ) – Poisson and Binomial distribution. X +∞ Yi − µi j − µi E ψc0 )P (Yi = j) = ψc0 ( si si j=0 = P (Yi ≤ j2 ) − P (Y < j1 ) – Normal distribution. We assume that the Yi , i = 1, 2, . . . , n have homogeneous variance which means s1 = s2 = . . . = sn . So, we use the normalized median absolute deviation about the median (MADN), which is a robust estimator of the sample standard deviation, to estimate si of e = y − µ. M ed{e − M ed(e)} , (3.26) 0.6745 and for a Gaussian random variable U ∼ N (µ, σ 2 ), M ADN (U ) = σ. We denote the cdf of standard normal as Φ0 (t), Φ0 (c) − Φ0 (−c) 0 Yi − µi E ψc = si ŝi sˆi = M ADN (e) = 52 3.2. Robust GLM Regression M-estimates – Gamma distribution. The expectation can be simply calculated 1 by the cdf of Yi∗ = βi Yi = φµ Yi , i E ψc0 Yi − µi si = F (βi (µi + csi ); α) − F (βi (µi − csi ); α). Note that φ can be substituted by corresponding estimates. • E[ψc0 (ri )ri ] – Poisson and Binomial distribution. The expectation is calculated by the sum of finite terms. E[ψc0 (ri )ri ] = j2 X j − µi j1 si P (Yi = j). – Normal distribution. Since the Pearson residual is a symmetric random variable, E[ψc0 (ri )ri ] = 0. – Gamma distribution. Similar with the derivation of E(ψc (ri )), Z βi (µi +csi ) t − β i µi 0 Yi − µ i E ψc ri = f (t; α)dt si βi si βi (µi −csi ) = −µi /si F βi (µi + csi ); α − F βi (µi − csi ); α Γ(α + 1) F (βi (µi + csi ); α + 1) − F (βi (µi − csi ); α + 1) . + βi si Γ(α) • E ∂ ∂ηi Eψc (ri ) . We use 4Eψc (ri ) , 4ηi to proximate this expectation. If we specify a small value L and set µi1 = g −1 (ηi −L), µi2 = g −1 (ηi +L), where ηi = g(µi ). Then, following the computation schemes to calculate 4Eψc (ri ) = Eψc (ri2 )−Eψc (ri1 ) and 4ηi = 2L. 53 3.3. Robust GAPLMs M-estimates 3.3 Robust GAPLMs M-estimates Our robust GAPLMs M-estimates fitting algorithm can be obtained from the robust GLM M-estimates fitting algorithm following the same approach as in Chapter 2. The algorithm has the same structure that the GAPLMs non-robust algorithm and the only difference is that we use the adjusted response variable z [m] and the weights w[m] in the m-th iteration following the definition in (3.23) and the computation schemes in section 3.2.2. 3.3.1 Model Prediction Once the model is fitted, we can make predictions according to the fits and furthermore, we can detect the outliers with the predicted response variables ŷ. Also, when choosing the tuning parameter of LOESS with cross validation, we need to predict the response with new covariates. The prediction of robust GAPLM M-estimate is of great importance in practice. Since in our backfitting algorithm, we center the nonparametric terms in every iteration, there is a lagging effect in the prediction of the nonparametric terms. Here we explain the lagging effect and describe in detail how to obtain predictions. Suppose at the L-th iteration, the estimates reach convergence, the values [L] [L] of w[L] , z [L] , β [L] , and sj , cj , j = 1, 2, . . . , k are returned as the final estimates by our function, see the algorithms in 2.2.1 and 2.3.2. Recall that cj are the centring elements in the backfitting algorithm. Suppose we are going to predict the parametric as well as the nonparametric terms at a new point (x1 , x2 , . . . , xh , t1 , t2 , . . . , tk ). There are two types of the additive components: the parametric and nonparametric terms. Like in GLMs, the parametric terms can be easily predicted with theP estimated linear coefficients β̂, i.e., the predicted Pk parametric h terms are β̂0 + i=1 β̂i Xi . As for the nonparametric terms j=1 fj (Tj ), it is P natural to predict each fl (tl ) with partial residuals z [L] − β̂0 1 − hi=1 β̂i xi − P [L] j6=l ŝj using LOESS. However, since in the backfitting algorithm, we center the nonparametric terms in every iteration, the partial residuals give us incoherent predictions as our fits. We have to adjust those nonparametric terms. In general, if the predicted terms are \ [ [ [ \ \ (βb0 , β 1 x1 , β2 x2 , . . . , βh xh , f1 (t1 ), f2 (t2 ), . . . , fk (tk )). 54 3.3. Robust GAPLMs M-estimates the predicted parametric terms are, [L] βb0 = β0 [L] βd j xj = β j xj , j = 1, 2, . . . , k, the nonparametric prediction includes two steps. 1. Predict the f[ l (tl ) with the partial residuals, S(z [L] − β̂0 − h X β̂i xi − i=1 X [L] ŝj ∼ t(l) , w[L] ) P redict → f[ l (tl ), j6=l where t(l) ∈ Rn is the observed l-th nonparametric covariate. 2. If l 6= k, adjust f[ l (tl ) by [ f[ l (tl ) ← fl (tl ) + k X [L] cj (3.27) j=l+1 If l = k, don’t update f[ l (tl ). P [L] In (3.27), the correction term kj=l+1 cj shows a lagging effect of the centring components cj . This lagging pattern can be explained with the steps of our backfitting algorithm. [L] When the backfitting algorithm converges, sj , j = 1, 2, . . . , k are the [L] final centralized additive terms and cj , j = 1, 2, . . . , k are their mean values before centralizing. The intercept s0 before updating is s0 = v̄1 or it is setted as any real number by the user. Now we inspect the l-th term. In the iteration, since we centralize each nonparametric term right before the next term is calculated, at convergence, those terms satisfy [L] sl [L] [L] before centering = sl + cl 1 l−1 k X X [L] [L] [L] [L] (l) [L] = S v − s0 − (sj + cj 1) − sj ∼ t , w j=1 j=l+1 l−1 l−1 k X X X [L] [L] [L] (l) [L] = S v − s0 − cj 1 − sj − sj ∼ t , w j=1 j=1 j=l+1 55 3.3. Robust GAPLMs M-estimates [L] After cycling over every covariate, the intercept has been updated by s0 = P [L] s0 + ( lj=1 cj )1. When we calculate the prediction with the partial residuals, [L] S(v − s0 − l−1 X k X [L] sj − j=1 = S(v − s0 − j=l+1 l−1 X [L] cj 1 − j=1 [L] [L] [L] sj − 1 ∼ t(l) , w[L] ) [L] = sl + cl 1 − cl 1 − l−1 X [L] sj − j=1 k X j=l+1 k X [L] [L] sj − cl 1 − j=l+1 [L] [L] cj 1 = sl − k X [L] cj 1 ∼ t(l) , w[L] ) j=l+1 k X [L] cj 1 j=l+1 Pk [L] So by adding to the nonparametric terms, we get predicj=l+1 cj tions which are “coherent” with the fits returned in the final iteration of the algorithm. 56 Chapter 4 Simulation Study In this chapter, we report the results of a series of simulations which are designed to compare the performance of the robust fit for GAPLMs introduced in Chapter 3 with the estimation method of Hastie and Tibshirani (1990) which is implemented by the R function gam(). We have introduced two non-robust estimating algorithms for GAPLMs in Chapter 2: the one using generalized Speckman estimator and the Hastie and Tibshirani’s (1990) estimator discussed in 2.3.2. Since gam() uses a different way to estimate linear coefficients and it also estimates the linear coefficients of nonparametric covariates, it is difficult to directly compare the performance of our robust fit to that of Hastie and Tibshirani’s fit. It can be seen from the simulation results in the next section that the fitted intercept term β̂0 of gam() deviates much more from the true value than our robust fit, which is not surprising. To present a comprehensive comparison between the robust fit and non-robust fits, we also provide the simulation results of the non-robust fit based on generalized Speckman estimator. We can compare the simulation results of the two non-robust fits first and then compare the performance of the robust and the non-robust fits. We call the fit of the function gam() GAM fit, the robust fit in Chapter 3 the RGAPLM fit and the non-robust fit using generalized Speckman estimator in Chapter 2 the GAPLM fit. All of our calculations are performed in R 3.0.0. In our simulation study, there are two covariates: one parametric covariate X and one nonparametric covariate T , g E(Y |X, T ) = β0 + β1 X + f (T ). We simulate datasets of Poisson as well as Binomial responses. In addition to that, we generate datasets with and without outliers, with different locations, magnitudes and proportions of the outliers. The magnitude of an outlier shows how far this outlier deviates from the true distribution and we only consider different magnitudes for the Poisson case. For each simulation scenario, we generate 5000 samples with 200 observations and calculate the 57 Chapter 4. Simulation Study mean squared errors n M SE` = 1X (µi − µ̂i )2 , n ` = 1, 2, . . . , 5000, i=1 where µi = β0 + β1 xi + f (ti ) [ µ̂i = β̂`0 + βd 1 xi ` + f (ti )` and the estimated linear coefficients’ absolute proportional biases (APBs) AP Bp` = β̂p` − βptrue , p = 0, 1. βptrue In this chapter, the mean and standard deviation of the 5000 MSEs and APBs are reported. The function gam() provides two classes of univariate smoothers, which are LOESS and splines. In this simulation study, we only use LOESS to fit GAPLMs. As introduced in 2.2.1.1, there is a parameter α which determines the bandwidth of LOESS. In the three algorithms, the smoothing is implemented by the R function loess() and the parameter to determine the bandwidth is the span argument. span has the same definition as the α in 2.2.1.1. Generally speaking, span has to be decided by separate cross validations (CVs) on each sample. However, cross validation is quite computational expensive and the cross validation may be unstable and even unrobust on contaminated samples, as discussed in Alimadad and SalibianBarrera (2011). Thus we do not run the CVs on each sample. Since our aim is to compare the performances of the three algorithms, it is enough to only choose the spans which make the results comparable. So we only decide the spans based on 5-fold CVs with 20 uncontaminated samples first and then keep those values fixed to fit both the uncontaminated and contaminated samples. In a 5-fold CV, we randomly divide one sample into 5 equal-sized subgroups and predict the responses of each subgroup with the fit based on all the other 4 subgroups. Suppose the subgroup index of the i-th observation is Gr(i), the CV value of a 5-fold CV is n X yi − µ̂(−Gr(i)) 2 q CV = , V (µ̂(−Gr(i)) ) i=1 (4.1) 58 Chapter 4. Simulation Study where µ̂(−Gr(i)) is the predicted response obtained by the fit without the subgroup Gr(i). For a certain sample with fixed span, we run the 5-fold CV 10 times and use the mean of the 10 CV values defined in (4.1) as the CV criterion. The tuning parameter c of Huber’s ψ function defined in (3.11) determines the tolerance of our robust fitting algorithm to the extreme responses deviating from the fitted model. The smaller this parameter is, the less impact of outlying observations to the fit, and the more robust our fit is. In this simulation study, we set the fixed value c = 1.5, which was used in Alimadad and Salibian-Barrera (2011), for RGAPLM fits. In all the simulations, we generate one covariate x = (x1 , x2 , . . . , x200 ) for the parametric term and one covariate t = (t1 , t2 , . . . , t200 ) for the smooth term. The covariates are kept fixed throughout all the simulation settings to generate the responses of each sample. The covariates are generated in the following way for both the Poisson and Binomial cases, ti = i, i = 1, 2, . . . , 200 iid Xi ∼ U nif (−20, 20). We also want to check whether the location of outliers affect the fits. In the plane of covariates (xi , ti ), those points in the marginal region are more sparse, making the smoothing less “local” than the points in the central region. So we expect the location (marginal or central regions in the covariates’ plane) of outliers might have different effects on the fits. We generate outliers in the marginal and central regions of the covariates’ plane to inspect this effect. We only place outliers to the response variable and we locate the outliers in either the center or the margin of the plane of covariates (xi , ti ), i = 1, 2, . . . , 200. To determine the center and the margin, we use Mahalanobis distance of the covariates in the matrix S = (x, t) with the center vector uT0 = (x̄, t̄) = 1T S and covariance matrix cov(S) = 1 1 1 (S − 11S)T (S − 11S). n−1 n n The Mahalanobis distance of each observation is, T −1 di = ui − u0 cov(S) ui − u0 . Then, we sort the Mahalanobis distances, d(1) , d(2) , . . . , d(200) 59 Chapter 4. Simulation Study and mark the covariates (xi , ti ) corresponding to d(1) , d(2) , . . . , d(100) as central points while those ui corresponding to d(101) , . . . , d(200) as marginal points. The covariates, the central and marginal points are shown in Figure 4.1. 20 The plane of covariates ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● 10 ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● −10 ● ● ● ● ● ● ● ● ● ● −20 ● ● ● ● ● ●● ● ● ● ● ● ● ●● 50 ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 X ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● 100 150 200 T Figure 4.1: The plane of covariates: the solid points are central points while the hollow points are marginal points. If the generated observations before adding outliers are (ỹi , xi , ti ), i = 1, 2, . . . , 200, we then pick some observations in the same region (center or margin) and replace their responses with outliers. The outliers are added following a contaminated model defined in (3.3), yi = (1 − zi )ỹi + zi oi , (4.2) where zi ∼ B(1, ν) and the distribution of oi differs for Poisson and Binomial responses. The number of outliers in a sample is random and is determined by the parameter ν ∈ [0, 1]. In this simulation, we generate samples with ν = 0, 0.1, 0.2, 0.3 and note that ν = 0 means no outliers in the dataset. In the simulations we also report outliers by comparing the observations with the fits. Suppose for the i-th observation yi , the fitted mean value is µ̂i 60 Chapter 4. Simulation Study and the corresponding distribution is FY (y; µ̂i ). If the probabilities of the lower or the upper tails FY (yi ; µ̂i ), 1 − FY (yi ; µ̂i ) are very small, we think the observation deviates a lot from the fitted model and we can report this observation as an outlier. Based on this idea, we report the outliers with extreme quantiles of the distribution FY (y; µ̂i ). Just like a hypothesis testing procedure, by taking a proportion δ ∈ (0, 1), we can calculate an interval Ii = L(µ̂i , δ), U (µ̂i , δ) , whose upper and lower bounds satisfies FY (L(µ̂i , δ); µ̂i ) + 1 − FY (U (µ̂i , δ); µ̂i ) = δ. (4.3) If yi 6∈ Ii , we flag yi as an outlier. For continuous symmetric random variables, it is reasonable to set L(µ̂i , δ) and U (µ̂i , δ) as the δ/2 and 1 − δ/2 quantiles of the distribution FY (y; µ̂i ). For discrete random variables, the δ-quantile is calculated by, qµ̂i ,δ = inf{y : FY (y; µ̂i ) ≥ δ, P (Y = y) > 0}. For asymmetric discrete distributions with much more mass in the lower tail such as Poisson and Binomial with small probabilities, which are the cases for our simulation, the quantile qµ̂i ,δ/2 sometimes makes the probability FY (qµ̂i ,δ/2 ; µ̂i ) > δ. So the quantile qµ̂i ,δ/2 is undesirable to be the lower bound L(µ̂i , δ) of the interval. To solve this problem, we can first calculate the δ/2 quantile qµ̂i ,δ/2 as the value of L(µ̂i , δ) and the associated probability pL ← FY (L(µ̂i , δ); µ̂i ). If pL > δ, there is more density in the lower tail, and we should reset the lower bound L(µ̂i , δ) ← −∞ and set the upper bound U (µ̂i , δ) as the δ quantile; if pL ≤ δ, we keep the lower bound L(µ̂i , δ) and set the upper bound U (µ̂i , δ) as the 1 − δ + pL quantile. The calculation strategy is as follow. 1. Calculate the δ/2-quantile L(µ̂i , δ) ← qµ̂i ,δ/2 , and the probability pL ← FY (L(µ̂i , δ); µ̂i ). 61 Chapter 4. Simulation Study • If pL > δ, L(µ̂i , δ) ← −∞, pL ← 0. 2. Calculate the upper bound, U (µ̂i , δ) ← qµ̂i ,1−δ+pL The interval calculated by our method satisfies, P (Y 6∈ Ii ; µ̂i ) ≤ δ. In our simulations, we use 0.05, 0.01, 0.001, 5 × 10−4 as the δ to report outliers. For each sample and δ, we calculate the correct reporting rate, Number of reported true outliers , CRR = Number of true outliers the misreporting rate, Number of misreported outliers M RR = Number of reported outliers and the reporting proportion, Number of reported outliers RP = . Number of true outliers To sum up, in this simulation study, we use the following conditions to control our distribution settings. • Distribution: Poisson or Binomial. • Location of outliers (LOC): center or margin. • Proportion of outliers: ν = 0, 0.1, 0.2, 0.3, and when ν = 0, the sample is uncontaminated. • Magnitude of outliers (MAG): moderate or extreme (this condition only applies for the Poisson case). To compare the performance of each algorithm, we report the means and standard deviations of the following values. • Mean squared errors (MSEs). • Linear coefficients’ absolute proportional biases (APBs). • Correct reporting rates (CRRs). • Misreporting rates (MRRs). • Reporting proportions (RPs). 62 4.1. Poisson Case 4.1 Poisson Case In the Poisson case, the model is, Y |X, T ∼ P λ(X, T ) , log λ(X, T ) = 0.05 + 0.02X −|T − n2 | n 2 T − n/2 T− . exp + 0.002 sin 20 2 30 It can be shown that in our simulation, the smooth term is an odd function for T − n/2, which is consistent with our model assumption E(f (T )) = 0. We only add the atypically large response variables as the outliers. There are two candidate distributions as the outliers in (4.2), oi ∼ P(15) or oi ∼ P(25), for our Poisson case. The two distributions represent two different magnitudes of outliers and by this way, we can see how the magnitudes of outliers affect the fits of three algorithms. Figure 4.2 and Figure 4.3 show the plots of typical datasets with outliers and the fitted surfaces of each fitting algorithm. From the shape of surfaces and the numbers of gray and black points (gray points are under the surface while black points are above the surface), it is easy to find that the fitted non-robust surfaces are distorted by the outliers while the robust surface is closer to the true mean surface. When the parameter ν or the magnitude of outliers is smaller or bigger, the fitted non-robust surfaces will be noticeably less or more distorted by the outlier while the robust surface only changes a little. We only show the cases of ν = 0.2, the summaries of the other scenarios are listed in numeric forms in Table 4.1-4.8. Table 4.1-4.8 list the summaries of the mean squared errors, linear coefficients’ absolute proportional biases, correct reporting rates, misreporting rates and reporting proportions of the 3 algorithms. All the numbers of the tables in this chapter have been rounded to have two significant digits. 63 4.1. Poisson Case True mean surface Fitted RGAPLM surface hat(mu) mu ● ●● ● ● ● ● ●● ● ● ● ●● ● ● X ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ●●●● ● ●●●● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ●● ●● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ●●● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●●●● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ●●●● ● ●●●● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ●● ●● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ●●● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●●●● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● X ● ● T T ● ● Fitted GAPLM surface Fitted GAM surface gray points are below the surface ● T ● ● ● ● ●● ● ● X ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ●●●●● ● ●●●● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ●● ●● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ●●● ● ●●● ●● ● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ●●●●● ● ●●●● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ●● ●● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ●●● ● ●●● ●● ● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● X ● ●● hat(mu) hat(mu) ● ● T ● Figure 4.2: Examples of simulation datasets with Poisson response: MAG=moderate, LOC=center and ν = 0.2. The solid points are “good” responses while ‘+’ points are outliers. The gray points are below the surface while black points are above the surfaces. The bandwidth parameter is decided by 5-fold CV based on uncontaminated datasets. 64 4.1. Poisson Case True mean surface Fitted RGAPLM surface hat(mu) mu ● ●● ● ● ● ● ●● ● ● ●● ● X ●● ● ● ● ●● ● ● ● ● ● ●●● ● ● ● ● ● ● ●●●● ● ●●●● ● ●● ● ● ● ● ●● ● ● ●● ●● ● ●● ● ● ● ● ●● ●● ● ●● ●● ●● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ●● ●● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ●● ●● ● ●● ●● ● ●●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ●●● ● ● ● ● ● ● ●●●● ● ●●●● ● ●● ● ● ● ● ●● ● ● ●● ●● ● ●● ● ● ● ● ●● ●● ● ●● ●● ●● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ●● ●● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ●● ●● ● ●● ●● ● ●●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ● X ● ● T T ● ● Fitted GAPLM surface Fitted GAM surface gray points are below the surface ● T ● ● ● ● ●● ● ● X ●● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ●● ● ● ●● ● ● ●●●●● ●● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ●● ●● ● ●● ●● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ●● ●● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ●● ● ● ●● ● ● ●●●●● ●● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ●● ●● ● ●● ●● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ●● ●● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ● X ● ●● hat(mu) hat(mu) ● ● T ● Figure 4.3: Examples of simulation datasets with Poisson response: MAG=moderate, LOC=margin and ν = 0.2. The solid points are “good” responses while ‘+’ points are outliers. The gray points are below the surface while black points are above the surfaces. The bandwidth parameter is decided by 5-fold CV based on uncontaminated datasets. 65 RGAPLM GAPLM GAM Center Margin Center Margin Center Margin ν = .1 .11(.06) .15(.14) 1.1(.6) 1.2(.66) .90(.49) 1.0(.57) ν = .2 .24(.14) .44(.6) 3.4(1.4) 3.4(1.4) 2.9(1.2) 3.1(1.3) ν = .3 .54(.39) 1.4(1.6) 6.9(2.1) 6.8(2.3) 6.1(1.9) 6.2(2.1) ν = .1 .11(.06) .15(.22) 3.2(1.7) 3.4(1.9) 2.4(1.4) 2.8(1.6) ν = .2 .24(.14) .68(1.7) 9.9(3.8) 10(4.1) 8.4(3.4) 9.0(3.8) ν = .3 .57(.87) 2.8(4.6) 20(6.1) 20(6.5) 18(5.6) 19(6.1) Moderate Extreme ν=0 .072(.036) .068(.034) .096(.031) Table 4.1: The table of means and standard deviations of MSEs for Poisson case. 4.1. Poisson Case No Outliers 66 RGAPLM GAPLM GAM Center Margin Center Margin Center Margin ν = .1 2.2(1.4) 2.2(1.4) 9.2(2.6) 9.0(2.7) 15(3.4) 17(5.0) ν = .2 4.3(1.7) 4.3(1.7) 15(2.5) 15(2.7) 19(3.6) 22(4.9) ν = .3 6.9(1.9) 6.9(2.0) 19(2.2) 20(2.4) 24(3.5) 27(4.5) ν = .1 .25(.19) .29(.22) .46(.32) .60(.44) .46(.32) .57(.42) ν = .2 .28(.21) .35(.27) .60(.35) .67(.47) .63(.35) .64(.45) ν = .3 .33(.23) .44(.35) .69(.33) .71(.48) .74(.33) .67(.46) ν = .1 2.2(1.4) 2.2(1.4) 13(3.3) 13(3.6) 18(4.6) 21(6.6) ν = .2 4.3(1.7) 4.3(1.8) 20(2.9) 21(3.3) 25(4.7) 29(6.1) ν = .3 6.9(2.0) 7.0(2.1) 26(2.5) 27(2.8) 31(4.2) 34(5.2) ν = .1 .25(.19) .29(.22) .60(.41) .79(.58) .62(.42) .77(.56) ν = .2 .28(.21) .35(.28) .73(.41) .81(.57) .79(.42) .78(.56) ν = .3 .33(.23) .48(.40) .81(.38) .81(.54) .89(.38) .77(.53) Moderate bias(β̂0 ) Extreme bias(β̂0 ) bias(β̂1 ) No Outliers 67 bias(β̂0 ) bias(β̂1 ) ν=0 1.3(.98) 1.2(.94) 9.3(2.5) .24(.19) .23(.18) .24(.18) Table 4.2: The table of the mean and standard deviations of APBs for Poisson case. The fitted intercept β̂0 of GAM seems biased while its fitted slope seems accurate. This is not surprising since GAM estimates the coefficients in a different way. 4.1. Poisson Case bias(β̂1 ) RGAPLM Center GAPLM GAM Margin Center Margin Center Margin Moderate ν = .1 1.0(.011) .99(.026) .99(.024) .98(.045) 1.0(.016) .99(.035) δ = .01 1.0(.014) .98(.046) .96(.049) .92(.074) .97(.037) .94(.063) δ = .001 .99(.030) .97(.057) .96(.061) .93(.093) .98(.047) .94(.077) δ = 5 × 10−4 .99(.036) .97(.064) .95(.071) .90(.10) .97(.056) .93(.087) δ = .05 1.0(.0098) .98(.036) .98(.034) .95(.056) .99(.026) .96(.048) δ = .01 1.0(.014) .94(.063) .98(.046) .97(.037) .97(.037) .94(.063) .99(.027) .95(.063) .89(.077) .84(.099) .93(.062) .87(.089) .98(.032) .94(.069) .87(.086) .81(.10) .90(.071) .85(.095) δ = .05 1.0(.013) .96(.054) .94(.048) .90(.069) .96(.038) .92(.062) δ = .01 .99(.019) .94(.065) .89(.065) .84(.085) .92(.056) .87(.077) δ = .001 .97(.033) .91(.081) .79(.092) .74(.10) .84(.079) .77(.095) δ = 5 × 10−4 .97(.039) .89(.086) .75(.098) .70(.10) .79(.088) .74(.099) ν = .2 δ = .001 δ =5× 10−4 ν = .3 68 Table 4.3: The table of the mean and standard deviations of CRRs for Poisson case, MAG=moderate. 4.1. Poisson Case δ = .05 RGAPLM Center GAPLM GAM Margin Center Margin Center Margin Extreme ν = .1 1.0(0.0) 1.0(.0056) 1.0(.0042) 1.0(.020) 1.0(.0015) 1.0(.012) δ = .01 1.0(0.0) 1.0(.0075) 1.0(.0071) .99(.027) 1.0(.0028) 1.0(.017) δ = .001 1.0(.00094) 1.0(.011) 1.0(.015) .99(.042) 1.0(.0061) .99(.028) δ = 5 × 10−4 1.0(.0018) 1.0(.011) 1.0(.017) .98(.048) 1.0(.0077) .99(.031) δ = .05 1.0(0.0) 1.0(.022) 1.0(.012) .98(.033) 1.0(.006) .99(.024) δ = .01 1.0(0.0) .99(.025) .99(.018) .97(.045) 1.0(.0097) .98(.034) 1.0(0.0) .99(.031) .98(.032) .95(.063) .99(.018) .97(.051) 1.0(.0010) .99(.032) .98(.036) .94(.069) .99(.020) .96(.056) δ = .05 1.0(.0044) .98(.039) .99(.022) .96(.049) 1.0(.013) .97(.041) δ = .01 1.0(.0065) .98(.045) .98(.033) .93(.064) .99(.021) .95(.054) δ = .001 1.0(.0087) .97(.053) .94(.054) .89(.083) .97(.038) .91(.073) δ = 5 × 10−4 1.0(.010) .97(.055) .93(.061) .87(.088) .96(.044) .90(.079) ν = .2 δ = .001 δ =5× 10−4 ν = .3 69 Table 4.4: The table of the mean and standard deviations of CRRs for Poisson case, MAG=extreme. 4.1. Poisson Case δ = .05 RGAPLM GAPLM GAM Center Margin Center Margin Center Margin δ = .05 .59(.11) .59(.11) .54(.11) .53(.13) .46(.13) .49(.14) δ = .01 .24(.13) .24(.13) .17(.12) .17(.13) .13(.11) .16(.12) δ = .001 .038(.063) .037(.064) .021(.047) .019(.050) .015(.042) .020(.050) δ = 5 × 10−4 .021(.048) .020(.048) .010(.034) .010(.036) .0076(.030) .010(.037) δ = .05 .35(.090) .36(.091) .47(.074) .41(.089) .41(.10) .38(.10) δ = .01 .10(.069) .11(.070) .13(.080) .12(.080) .068(.065) .089(.073) .014(.028) .014(.028) .012(.026) .014(.031) .0051(.017) .0092(.025) .0079(.021) .0072(.020) .0063(.019) .0078(.023) .0027(.013) .0048(.018) Moderate ν = .1 δ = .001 δ =5× 10−4 ν = .3 δ = .05 .23(.065) .26(.074) .48(.057) .42(.065) .44(.076) .41(.069) δ = .01 .059(.043) .070(.050) .19(.082) .15(.077) .14(.094) .13(.077) δ = .001 .0084(.017) .011(.022) .019(.032) .027(.040) .0045(.014) .019(.032) δ = 5 × 10−4 .0049(.013) .0067(.017) .0091(.022) .017(.031) .002(.0091) .011(.024) 70 Table 4.5: The table of the mean and standard deviations of MRRs for Poisson case, MAG=moderate. 4.1. Poisson Case ν = .2 RGAPLM GAPLM GAM Center Margin Center Margin Center Margin δ = .05 .58(.11) .59(.11) .65(.078) .63(.092) .57(.11) .57(.12) δ = .01 .24(.13) .24(.13) .27(.13) .26(.14) .15(.12) .20(.14) δ = .001 .037(.062) .036(.062) .035(.059) .044(.070) .014(.040) .030(.060) δ = 5 × 10−4 .021(.047) .019(.046) .018(.044) .027(.055) .007(.028) .018(.045) δ = .05 .35(.090) .36(.091) .64(.044) .59(.053) .60(.065) .57(.061) δ = .01 .10(.069) .11(.07) .35(.096) .30(.096) .28(.13) .27(.11) .014(.028) .015(.029) .077(.073) .085(.072) .028(.053) .065(.067) .0078(.02) .0084(.022) .043(.056) .059(.063) .012(.033) .041(.052) Extreme ν = .1 δ = .001 δ =5× 10−4 ν = .3 δ = .05 .23(.066) .26(.077) .63(.032) .59(.036) .61(.038) .59(.040) δ = .01 .058(.044) .078(.058) .43(.064) .37(.066) .39(.081) .35(.068) δ = .001 .0084(.018) .017(.030) .17(.082) .15(.076) .12(.089) .13(.073) δ = 5 × 10−4 .0048(.013) .012(.026) .11(.075) .11(.070) .071(.075) .094(.067) 71 Table 4.6: The table of the mean and standard deviations of MRRs for Poisson case, MAG=extreme. 4.1. Poisson Case ν = .2 RGAPLM GAPLM GAM Center Margin Center Margin Center Margin δ = .05 2.6(.85) 2.6(.87) 2.3(.74) 2.3(.79) 2.0(.68) 2.1(.77) δ = .01 1.4(.27) 1.3(.28) 1.2(.21) 1.2(.23) 1.2(.19) 1.2(.23) δ = .001 1.0(.082) 1.0(.096) .98(.081) .95(.11) .99(.069) .97(.097) δ = 5 × 10−4 1.0(.067) .99(.084) .96(.08) .92(.11) .98(.066) .94(.096) δ = .05 1.6(.24) 1.6(.25) 1.9(.26) 1.7(.26) 1.7(.30) 1.6(.26) δ = .01 1.1(.095) 1.1(.099) 1.1(.11) 1.0(.11) 1.0(.080) 1.0(.097) 1.0(.040) .97(.068) .91(.079) .85(.097) .93(.065) .88(.088) .99(.039) .95(.071) .87(.087) .82(.10) .90(.072) .85(.094) δ = .05 1.3(.12) 1.3(.13) 1.8(.20) 1.6(.19) 1.7(.22) 1.6(.19) δ = .01 1.1(.054) 1.0(.072) 1.1(.12) 1.0(.11) 1.1(.12) 1.0(.098) δ = .001 .98(.038) .92(.078) .80(.089) .76(.097) .84(.079) .79(.091) δ = 5 × 10−4 .97(.041) .90(.083) .75(.097) .71(.10) .80(.088) .75(.096) Moderate ν = .1 δ = .001 δ =5× 10−4 ν = .3 72 Table 4.7: The table of the mean and standard deviations of RPs for Poisson case, MAG=moderate. 4.1. Poisson Case ν = .2 RGAPLM GAPLM GAM Center Margin Center Margin Center Margin δ = .05 2.6(.85) 2.6(.86) 3.0(.77) 2.9(.83) 2.5(.69) 2.5(.79) δ = .01 1.4(.27) 1.4(.27) 1.4(.27) 1.4(.29) 1.2(.20) 1.3(.27) δ = .001 1.0(.077) 1.0(.079) 1.0(.072) 1.0(.089) 1.0(.047) 1.0(.078) δ = 5 × 10−4 1.0(.057) 1.0(.056) 1.0(.052) 1.0(.071) 1.0(.033) 1.0(.058) δ = .05 1.6(.24) 1.6(.25) 2.8(.33) 2.4(.32) 2.6(.38) 2.4(.32) δ = .01 1.1(.093) 1.1(.092) 1.6(.22) 1.4(.19) 1.4(.25) 1.4(.19) 1.0(.030) 1.0(.037) 1.1(.087) 1.0(.088) 1.0(.062) 1.0(.078) 1.0(.022) 1.0(.033) 1.0(.063) 1.0(.079) 1.0(.038) 1.0(.064) δ = .05 1.3(.12) 1.3(.13) 2.7(.24) 2.3(.24) 2.6(.25) 2.4(.25) δ = .01 1.1(.051) 1.1(.057) 1.7(.18) 1.5(.16) 1.6(.21) 1.5(.15) δ = .001 1.0(.019) .99(.042) 1.1(.11) 1.0(.10) 1.1(.11) 1.1(.092) δ = 5 × 10−4 1.0(.015) .98(.043) 1.1(.087) .98(.093) 1.0(.081) .99(.08) Extreme ν = .1 δ = .001 δ =5× 10−4 ν = .3 73 Table 4.8: The table of the mean and standard deviations of RPs for Poisson case, MAG=extreme. 4.1. Poisson Case ν = .2 4.1. Poisson Case The plots of MSEs and APBs of β̂1 s with M AG = M oderate are shown in Figure 4.4. The plots of CRRs and MRRs with δ = 5−4 and M AG = M oderate are in Figure 4.5. The CRRs and MRRs of other δ values. Generally speaking, the tables above show that when there are no outliers, the performance of all the 3 algorithms are similar but the RGAPLM fit is better than the non-robust fits in the presence of outliers. This suggests that the RGAPLM is able to resist the damage of outliers in model fitting. The summaries of the two non-robust fits are quite similar under the same simulation scenarios. The standard deviations of the robust fit in MSEs and APBs are consistently smaller than the other fits in contaminated samples. In our simulation example, the marginal outliers have a bigger effect on the MSEs of the 3 algorithms. From Table 4.1, we see that the location, proportion and magnitudes of outliers all have noticeable influence on the non-robust fits of GAPLM and GAM. When the outlier proportion increases or the outlier magnitude increases, the MSEs of the three fits increase remarkably. For example, if the location and the magnitude of outliers are fixed, when ν increases from 0.1 to 0.3, the MSEs of the 3 fits increase 4 to 6 times but the MSEs of RGAPLM are relatively much smaller than the other 2 fits. The magnitudes of outliers have subtle impact on the RGAPLM’s MSEs but the bigger outlier magnitudes doubled the MSEs of the non-robust fits. The MSEs of nonrobust fits are usually more than 10 times bigger than the MSEs of the robust fit in contaminated datasets. Table 4.2 shows that the outliers have a much bigger impact on the estimated intercept β0 than on the slope β1 . The location, proportion and magnitudes of outliers all can influent the estimated coefficients of the 3 fits. However, the APBs of GAPLM fit and GAM fit are usually at least 2 times bigger than that of RGAPLM fit. The marginal outliers generate bigger biases to the estimated coefficients, as expected. It can be seen that GAM’s fitted slope β̂1 is close to GAPLM’s β̂1 while GAM’s fitted intercept β̂0 seems biased in uncontaminated datasets. This phenomenon might be explained by the fact that gam() uses a different linear coefficient estimating method. From Figure 4.4, we can see that marginal outliers are more damaging to our robust fit than the non-robust fits in the sense of MSEs and APBs. Table 4.3-4.8 and Figure 4.5 show the performance of the 3 fits in outlier detection. RGAPLM performs better than non-robust fits in detecting true outliers. When the distances of outliers are moderate and when outlier proportion is small, RGAPLM is more aggressive: it flags more outliers and is more prone to misreport. However, when outliers are more extreme and outlier proportion is bigger, RGAPLM usually makes less mistakes than 74 4.1. Poisson Case MSEs, center extreme 10 MSE 4 3 0 0 1 5 2 MSE 5 15 6 7 20 MSEs, center moderate 0.1 0.2 0.3 0 0.1 0.2 ν ν MSEs, margin moderate MSEs, margin extreme 0.3 10 MSE 4 3 0 0 1 5 2 MSE 5 15 6 7 20 0 0.1 0.2 0.3 0 APBs, center extreme 0.3 0.8 0.6 0.2 0.2 0.4 0.6 apb.c2.rgam 0.8 1.0 APBs, center moderate 0.4 0.1 0.2 0.3 0 0.1 0.2 ν APBs, margin moderate APBs, margin extreme 0.3 0.8 0.4 0.2 0.2 0.4 0.6 apb.m2.rgam 0.8 1.0 ν 0.6 apb.c1.rgam 0.2 ν 1.0 0 apb.m1.rgam 0.1 ν 1.0 0 0 0.1 0.2 ν 0.3 0 0.1 0.2 ν 0.3 75 Figure 4.4: The plot of MSEs, APBs of β̂1 s with M AG = M oderate of the three fits. Solid line: RGAPLM fit; Dashed line: GAM fit; Dotted line: GAPLM fit. crr.c2.rgam 0.94 0.96 0.90 0.2 0.3 0.1 0.2 0.3 ν CRRs, margin moderate, delta=.001 CRRs, margin extreme, delta=.001 0.94 0.90 crr.m2.rgam 0.98 ν 0.1 0.2 0.3 0.1 0.2 0.3 ν MRRs, center moderate, delta=.001 MRRs, center extreme, delta=.001 0.10 0.00 0.00 0.01 0.02 mrr.c2.rgam 0.03 0.15 ν 0.05 crr.c1.rgam 0.85 0.80 crr.m1.rgam 0.75 0.80 0.85 0.90 0.95 1.00 0.1 mrr.c1.rgam CRRs, center extreme, delta=.001 0.98 1.00 CRRs, center moderate, delta=.001 0.95 1.00 4.1. Poisson Case 0.1 0.2 0.3 0.1 0.2 0.3 ν MRRs, margin moderate, delta=.001 MRRs, margin extreme, delta=.001 0.10 mrr.m2.rgam 0.00 0.05 0.02 0.01 0.00 mrr.m1.rgam 0.03 0.15 ν 0.1 0.2 ν 0.3 0.1 0.2 ν 0.3 76 Figure 4.5: The plot of CRRs and MRRs with δ = 5 × 10−4 and M AG = M oderate of the three fits. Solid line: RGAPLM fit; Dashed line: GAM fit; Dotted line: GAPLM fit. 4.2. Binomial Case the other two fits. It can be seen that when the detecting proportion δ is bigger, GAPLM and GAM detect more true outliers but when δ is smaller, their CRRs decrease more noticeably than RGAPLM. That is because the non-robust fits are more severely distorted by those outliers. 4.2 Binomial Case In the binomial case, the simulated model is, Y |X, T ∼ B(p(X, T ), 20), logit(p(X, T )) = 0.05 + 0.02X −|T − n2 | n T − n/2 T− exp . + 0.03 cos 20 2 30 Like the Poisson case, the outliers are extremely big numbers in the response variable. Since the binomial response can be no larger than its size 20, the outliers are constant numbers, oi = 20. So there is only one magnitude of outliers. Figure 4.6-4.7 show the plots of typical datasets with outliers and the fitted surfaces of each fitting algorithm. Since the outliers are bounded this time, the impact of outliers is limited. By looking at the plots, we can see that this time the robust fitted surface is only slightly better than the other two fits. Unlike what we have expected, the marginal outliers seem to have less effects on the 3 fits than the central outliers. We are not able to explain this phenomenon. But maybe, the marginal outliers are much more sparse than the central ones and the magnitude of outliers is not big, the effects of the marginal outliers is relatively negligible compared with central outliers. Table 4.9-4.12 list the summaries of the MSEs, APBs, MRR and RPs of the 3 algorithms. This time, CRRs are not reported because all the CRRs are uniformly 1.0 with 0 standard deviation, which means all the fits can correctly detect outliers. 77 4.2. Binomial Case True mean surface Fitted RGAPLM surface ● ● ● ● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● hat(mu) ●● ● ● ●● ●● ● ● ● mu ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ●●●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ●● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ●●● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ●● ●● ● ● ● ● ●●●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● X X ● ● T T Fitted GAPLM surface Fitted GAM surface gray points are below the surface gray points are below the surface ● ●● ●● ● ● ●● ● ● ●● ●● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ●●●● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● X X T ● ● hat(mu) hat(mu) ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ●●●● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● T Figure 4.6: Examples of simulation datasets with binomial response: LOC=center and ν = 0.2. The solid points are “good” responses while ‘+’ points are outliers. The gray points are below the surface while black points are above the surfaces. The bandwidth parameter is decided by 5-fold CV based on uncontaminated datasets. 78 4.2. Binomial Case True mean surface Fitted RGAPLM surface ● ● ● ● ● hat(mu) mu ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ●●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●●●● ● ● ● ● ●●● ● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● X X ● ● T T Fitted GAPLM surface Fitted GAM surface gray points are below the surface gray points are below the surface ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ●●● ● ●●●● ● ● ● ● ●●● ● ●● ●● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● X X T hat(mu) hat(mu) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ●●● ● ●●●● ● ● ● ● ●●● ● ●● ●● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● T Figure 4.7: Examples of simulation datasets with binomial response: LOC=margin and ν = 0.2. The solid points are “good” responses while ‘+’ points are outliers. The gray points are below the surface while black points are above the surfaces. The bandwidth parameter is decided by 5-fold CV based on uncontaminated datasets. 79 RGAPLM Center GAPLM GAM Center Margin Center Margin .00085(.00046) .0017(.00088) .0017(.00089) .0017(.00085) .0017(.00086) Outliers ν = .1 .00081(.00039) ν = .2 .0015(.00077) .0017(.0011) .0045(.0018) .0043(.0018) .0044(.0018) .0042(.0018) ν = .3 .0030(.0014) .0040(.0031) .0090(.0029) .0082(.0028) .0088(.0028) .0082(.0027) No Outliers ν=0 .00056(.00024) .00055(.00024) .00056(.00023) Table 4.9: The table of mean and standard deviations of MSEs for Binomial case. 4.2. Binomial Case Margin 80 RGAPLM GAPLM GAM Center Margin Center Margin Center Margin ν = .1 .94(.63) .93(.64) 2.0(.89) 1.9(.88) 4.6(1.5) 4.9(1.8) ν = .2 1.9(.83) 1.9(.87) 4.1(1.1) 3.9(1.1) 6.3(1.7) 6.9(2.2) ν = .3 3.1(.99) 3.3(1.2) 6.2(1.2) 5.9(1.2) 8.0(1.8) 8.9(2.4) ν = .1 .13(.099) .14(.11) .15(.11) .18(.14) .14(.11) .18(.14) ν = .2 .14(.10) .17(.13) .17(.13) .23(.17) .16(.13) .23(.17) ν = .3 .15(.11) .20(.15) .18(.14) .26(.20) .18(.14) .26(.20) Outliers bias(β̂1 ) No Outliers bias(β̂0 ) bias(β̂1 ) ν=0 .53(.40) .52(.39) 2.9(1.3) .12(.093) .12(.091) .12(.091) Table 4.10: The table of the mean and standard deviations of APBs for Binomial case. The fitted intercept β̂0 of GAM seems biased while its fitted slope seems accurate. This is not surprising since GAM estimates the coefficients in a different way. 4.2. Binomial Case bias(β̂0 ) 81 RGAPLM GAPLM GAM Center Margin Center Margin Center Margin δ = .05 .54(.11) .54(.11) .56(.095) .56(.098) .56(.096) .56(.097) δ = .01 .22(.12) .23(.12) .22(.12) .22(.12) .22(.12) .22(.12) δ = .001 .036(.062) .037(.063) .031(.057) .032(.058) .031(.057) .032(.058) δ = 5 × 10−4 .022(.048) .021(.048) .019(.045) .018(.045) .019(.045) .018(.045) δ = .05 .37(.081) .37(.081) .43(.065) .42(.069) .43(.066) .42(.068) δ = .01 .12(.068) .12(.069) .15(.068) .15(.068) .15(.068) .15(.069) .016(.028) .017(.030) .021(.031) .022(.032) .021(.031) .021(.031) .0098(.022) .010(.023) .011(.023) .012(.024) .011(.024) .012(.023) Outliers ν = .1 δ = .001 δ =5× 10−4 ν = .3 δ = .05 .29(.061) .30(.062) .40(.048) .37(.051) .39(.049) .38(.050) δ = .01 .088(.048) .098(.051) .15(.052) .14(.053) .15(.053) .14(.052) δ = .001 .012(.019) .018(.026) .025(.027) .024(.027) .024(.027) .024(.027) δ = 5 × 10−4 .0068(.015) .011(.021) .014(.021) .016(.022) .014(.021) .015(.022) 82 Table 4.11: The table of the mean and standard deviations of MRRs for Binomial case. 4.2. Binomial Case ν = .2 RGAPLM GAPLM GAM Center Margin Center Margin Center Margin δ = .05 2.3(.66) 2.3(.66) 2.4(.63) 2.4(.65) 2.4(.64) 2.4(.65) δ = .01 1.3(.25) 1.3(.25) 1.3(.24) 1.3(.23) 1.3(.25) 1.3(.24) δ = .001 1.0(.079) 1.0(.080) 1.0(.070) 1.0(.074) 1.0(.073) 1.0(.073) δ = 5 × 10−4 1.0(.059) 1.0(.058) 1.0(.054) 1.0(.054) 1.0(.054) 1.0(.054) δ = .05 1.6(.22) 1.6(.22) 1.8(.21) 1.8(.22) 1.8(.22) 1.8(.22) δ = .01 1.1(.094) 1.1(.097) 1.2(.098) 1.2(.097) 1.2(.098) 1.2(.098) 1.0(.031) 1.0(.032) 1.0(.034) 1.0(.035) 1.0(.034) 1.0(.035) 1.0(.024) 1.0(.025) 1.0(.025) 1.0(.026) 1.0(.025) 1.0(.025) δ = .05 1.4(.12) 1.4(.13) 1.7(.13) 1.6(.13) 1.7(.13) 1.6(.13) δ = .01 1.1(.059) 1.1(.065) 1.2(.072) 1.2(.072) 1.2(.073) 1.2(.072) δ = .001 1.0(.02) 1.0(.028) 1.0(.029) 1.0(.029) 1.0(.029) 1.0(.029) δ = 5 × 10−4 1.0(.016) 1.0(.022) 1.0(.022) 1.0(.024) 1.0(.022) 1.0(.023) Outliers ν = .1 δ = .001 δ =5× 10−4 ν = .3 83 Table 4.12: The table of the mean and standard deviations of RPs for Binomial case 4.2. Binomial Case ν = .2 4.2. Binomial Case The tables of the Binomial case show many similar phenomena with the facts revealed by the Poisson case. For example, the RGAPLM fit is better than the GAPLM fit and GAM fit in contaminated datasets but they perform similarly in uncontaminated datasets. Here we mainly point out the different results shown in the Binomial case. The means of the MSEs as well as the APBs of robust fit is slightly bigger than the non-robust counterparts in uncontaminated dataset this time. It is not surprising since the standard deviations of those MSEs and APBs are not negligible. All the 3 algorithms perform well in CRRs and they can all detect all the outliers in different simulation settings. 84 Chapter 5 Example In this chapter, a real data analysis is presented to show the practical application of our robust fit. In Boyce, Lewis and Worm (2010), the authors studied the global phytoplankton concentration long-term trend in the past century. Based on a non-robust statistical analysis on the phytoplankton biomass in a regional scale, the authors concluded that they observed century-long declines in eight out of ten world ocean regions. In this chapter, the dataset of Boyce, Lewis and Worm (2010) is studied using our robust model fitting algorithm. 5.1 Study Background Phytoplankton plays an important role in the world’s ecosystem, where it generates about half of the primary production on Earth. The analysis of satellite-based phytoplankton concentration (1979-86 and 1997-present) have revealed decadal-scale fluctuations related to climate effect. But the time span of the analysis was insufficient for us to see the longer-term trends and this was the motivation of Boyce, Lewis and Worm (2010). Phytoplankton biomass is commonly represented by the measures of total chlorophyll pigment concentration (Chl). Boyce, Lewis and Worm explained that Chl “is considered as a reliable indicator of both phytoplankton production and biomass”. In this data analysis, Chl is the response variable of interest. There are several ways to measure Chl. Measurements of upper ocean transparency using the standardized Secchi disk are available from 1899 to present. Those measurements can be transformed into depth-averaged Chl concentrations through some established models. The in situ optical measurement also records Chl. In the study of Boyce, Lewis and Worm (2010), Chl measurements (in mg/m3 ) of in situ and ocean transparency data, which were collected in the upper ocean in the past century, were merged to provide enough information for the statistical analysis. The available upper ocean 85 5.1. Study Background transparency and in situ Chl data were extracted from the National Oceanographic Data Center (NODC, http://www.nodc.noaa.gov/), the Worldwide Ocean Optics Database (WOOD, http://wodjhuapl.edu/wood/) and the Marine Information Research Center (MIRC, http://mirc.jha.jp/en/outline.html). The composited dataset consists of 445,237 world-wide Chl measurement collected between 1899 and 2008. The details of data combination can be found in the supplement document of Boyce, Lewis and Worm (2010). Along with the response variable Chl, several relevant variables were collected to be the candidate covariates in regression models. All the variables in the dataset are listed in Table 5.1. 86 Description Type Chl Year Month DayM DayY Lon Lat Bath Dist ZD Field total chlorophyll pigment concentration (mg/m3 ) year of the observation month of the observation day of the month of the observation day of the year of the observation longitude (degrees) of the observation latitude (degrees) of the observation bathymetry (m) of the observation distance (m) from the nearest coast of the observation the secchi depth of the observation identifies the observation as transparency or in situ data positive continuous positive integer factor positive integer positive integer positive continuous positive continuous negative continuous positive continuous positive continuous factor Table 5.1: The variables in the dataset. 5.1. Study Background Name 87 5.2. Scientific Problem and Results Figure 5.1: The Figure 3a in Boyce, Lewis and Worm (2010). Ocean regions to estimate the regional trends of Chl. The covariates provide both oceanographic and temporal information of the observations. However, the interest of this study was the marginal relationship between Chl and Year, in order to discover the long-term trend of Chl concentration. To estimate regional Chl trends, the authors divided the global ocean into ten regions: Arctic, Northern Pacific, Equatorial Pacific, Southern Pacific, Northern Indian, Southern Indian, Northern Atlantic, Equatorial Atlantic, Southern Atlantic and Southern. In this regional analysis, the observations collected in the inland seas were not included. The ten regions were divided by longitude and latitudes of the oceans and they are in Figure 5.1. In our example, we only focus on the regions of Southern Pacific and Southern Indian. 5.2 Scientific Problem and Results In Boyce, Lewis and Worm (2010), GAPLMs were fitted to explore the regional Chl trends. The models’ covariates were selected based on background knowledge in order to explain the major variation in Chl while at the same time, making the model parsimonious. In the supplement document, the authors mentioned that phytoplankton growth and abundance may be influenced by spacial and temporal factors which can alter photo-synthetically 88 5.2. Scientific Problem and Results active radiation and nutrient availability. So they chose covariates of seasonality (DayM or DayY ), the water depth of the observations (Bath), temporal effect (Year ) and spatial changes (Lat and Lon). As already mentioned, the authors were more concerned about the temporal effect in regional Chl concentrations. They specified three models to explore the additive temporal effect: (1) Year as a continuous linear term (2) Year as a discrete linear term (3) Year as a continuous smooth term. The three models helped to reveal both magnitudes and shape of trend in the response. In those GAPLMs, Bath was converted into a 3-level discrete variable defined as greater than -200m, between -200 and -1000m and less than -1000m. Seasonality refers to the variation in sunlight and nutrients throughout each year. In the models, seasonality was assumed as a smooth function of DayY and this assumption helped to visualize different seasonality patterns in the ten regions. In the dataset, spacial dependency exists because some observations share similar physical oceanographic features. To model the spacial dependency which can not be explained by Bath and seasonality, a smooth function of Lat and Lon was assumed in the additive terms. GAPLMs assuming Gamma response with log link function were used to model the regional trends, Chl|Bath, Y ear, DayY, Lat, Lon ∼ Gamma µ(Bath, Y ear, DayY, Lat, Lon), a(φ) . Note that here we do not use shape and scale parameters to define the Gamma distribution because we only model the structure of the conditional mean of Chl. Suppose for a certain region, the dataset contains yr1 , yr2 , . . . , yrny , yri ∈ {1899, 1900, . . . , 2008} years of record and the first year yr1 is the baseline if Year is a dummy variable. In addition, we denote the three levels of Bath as bh1 , bh2 , bh3 and bh1 is the baseline. The models 89 5.2. Scientific Problem and Results assuming different temporal effects were as follows, log(µ(Bath, Y ear, DayY, Lat, Lon)) = β0 + β1 Y ear + 2 X β2j I(Bath = bhj+1 ) j=1 + f1 (DayY ) + f2 (Lat, Lon) ny −1 log(µ(Bath, Y ear, DayY, Lat, Lon)) = β0 + + X β1i I(Y ear = yri+1 ) i=1 2 X β2j I(Bath = bhj+1 ) j=1 + f1 (DayY ) + f2 (Lat, Lon) log(µ(Bath, Y ear, DayY, Lat, Lon)) = β0 + 2 X β1j I(Bath = bhj+1 ) + f1 (Y ear) j=1 + f2 (DayY ) + f3 (Lat, Lon). Then the authors showed the fitted curves f1\ (Y ear) and fitted coefficients β̂1 or β̂1i of Year to examine the temporal effect on Chl. The models were fitted using the function gam() in the R package mgcv. Since we want to observe more details in Chl trend, we are more interested in the last two models. In our approach, we fit GAPLMs with similar model structures and compare the results of both the robust fit and the non-robust fit given by the R function gam() in the gam package. In this thesis we only discuss one dimensional smoothing. Instead of assuming a bivariate function of Lat and Lon as the spatial additive component, we model it as a sum of two one-variable functions. Our models 90 5.2. Scientific Problem and Results are, ny −1 log(µ(Bath, Y ear, DayY, Lat, Lon)) = β0 + X β1i I(Y ear = yri+1 ) i=1 + 2 X β2j I(Bath = bhj+1 ) j=1 + f1 (DayY ) + f2 (Lat) + f3 (Lon) log(µ(Bath, Y ear, DayY, Lat, Lon)) = β0 + 2 X β1j I(Bath = bhj+1 ) j=1 + f1 (Y ear) + f2 (DayY ) + f3 (Lat) + f4 (Lon). In a certain region, we assume a constant shape parameter α = 1/φ for all the observations. In this example, the nuisance parameter φ is estimated with the median of all the ny statistics M AD(ChlY ear=yri ) M edian(ChlY ear=yri ) of each year within the region. The estimated nuisance parameter is M AD(ChlY ear=yri ) φ̂ = M ediani=1,2,...,ny . M edian(ChlY ear=yri ) We use MAD and median because MAD and median are more robust estimates of the standard deviation and mean. We choose the smoothing argument span for robust and non-robust fits using 2-fold cross validations. We use 2-fold cross validation because we have to ensure that the training dataset contains all levels of Year when Year is a dummy variable. To reduce the computation amount, we assume that Lat and Lon have the same spans as they both represent spatial effects. We choose the span’s among 0.4, 0.5 and 0.6 for the robust algorithm and the values among 0.3, 0.4, 0.5, 0.6, 0.7 for function gam(). Whereas there might be outliers in the dataset, we use the mean of trimmed sum of squared standardized residuals to be the cross validation criterion. Suppose the trimming rate is p and the number of residuals is n, the cross validation value is 1 CV Value = bn(1 − p)c bn(1−p)c X 2 r(i) , i=1 91 5.2. Scientific Problem and Results 2 is the i-th largest value in the squared standardized residuals where r(i) d i )/Chl d i 2 , i = 1, 2, . . . , n. We apply the trimming rates of 1%, (Chli − Chl 5% and 10% and choose the optimal span’s for the region of Southern Pacific. We find that the optimal span’s of the robust algorithm in the continuous Year case might depend on the trimming rates. However, the CV values of the three trimming rates are very close. So we pick span which is voted by both the two trimming rates 1% and 10%. Due to our limited computation speed, we apply the spans of Southern Pacific to the region of Southern Indian but we estimate the nuisance parameter φ of Southern Indian separately. In addition to that, we also report outliers with each fit using a tail proportion of δ = 5 × 10−4 . The fitted marginal additive effects of Year as both a categorical variable and a continuous variable for the region of Southern Pacific are in Figure 5.2. The baseline year’s effect is estimated with the transformed mean of the response Chl in that year. The effects of the subsequent years are the sums of estimated baseline effect and the estimated coefficients β̂1i , i = 2, . . . , ny . The plots also contain partial residuals, which are the sums of estimated additive components β̂11 + β̂1i , i = 2, . . . , ny , or, \ f (Y ear) and their corresponding Pearson residuals. Figure 5.2 illustrates different trends of Chl in the two fits. Like the results in Boyce, Lewis and Worm (2010), the curves all show a modal between 1983 and 2000 with a peak around 1993. However, the non-robust curve shows a continuing decrease after 2000 while the robust curve begins to go up after 2003. Our robust curve is more similar to the authors’ result. The positions of curves and partial residuals identify the difference between the robust fit and non-robust fit. In the discrete Year case, the partial residuals are mainly positive and the non-robust fit detects much less outliers than the robust fit. Most partial residuals of the robust fit are around zero. There are more extreme partial residuals of the robust fit compared with the non-robust fit. Probably this pattern shows that the non-robust fit is more distorted by large partial residuals. In the continuous Year case, the robust fitted curve mainly goes through the areas with more dense partial residuals while the non-robust curve usually goes above the dense partial residual areas. Similar with the discrete case, the robust fit has more extreme partial residuals than the non-robust fit. So the non-robust curve might also be distorted by the large partial residuals and the robust curve fits the majority of observations well. 92 5.2. Scientific Problem and Results We further inspect those reported outliers of robust fit in the continuous Year case using a exploratory analysis. We plot the Chl in Southern Pacific region based on different levels of Bath and seasons in Figure 5.4-5.5 where the black dots represent the reported outliers. We transfer DayY into a categorical variable to decide the season of each observation. Figure 5.4-5.5 both suggest that most of the reported outliers are near shores. Maybe, shoreside areas are more influenced by the land temperature or human activities making Chl more variable. Figure 5.3 shows that there are relatively more reported outliers in the shallow and deep ocean. Probably the more exposure to the sunlight or the continent motion can explain this phenomenon. Figure 5.4 shows that there are less reported outliers on the shore of Chile in winters and on the western shore of New Zealand. 93 20 15 10 0 1970 1980 1990 2000 1960 1970 1980 1990 2000 Robust fit of s.pacific , discrete Year Robust fit of s.pacific , continuous Year 15 10 5 0 0 5 Additive effect 20 Year 10 15 20 Year 1960 1970 1980 Year 1990 2000 1960 1970 1980 1990 2000 Year 94 Figure 5.2: The fitted marginal additive effects of Year as both a categorical variable and a continuous variable for the region of Southern Pacific. In the discrete case, the baseline year’s effect were estimated by the transformed d \ √i −Chli . The bold mean of that year β̂11 = (ChlbaseY ear ). The partial residuals are calculated with f (Y eari ) + Chl di Chl points are the discrete Year effects, the small solid points are partial residuals and the hollow round points are partial residuals of reported outliers. 5.2. Scientific Problem and Results 1960 Additive effect 5 5 Additive effect 10 15 20 Non robust fit of s.pacific , continuous Year 0 Additive effect Non robust fit of s.pacific , discrete Year s.pacific Bathymetry (0,200] s.pacific Bathymetry (200,1e+03] 95 Figure 5.3: The map of observations in Southern Pacific in each level of Bath. Black points are the reported outliers of the robust fit in continuous Year case. 5.2. Scientific Problem and Results s.pacific Bathymetry (1e+03,Inf] s.pacific Season Summer s.pacific Season Autumn s.pacific Season Winter 96 Figure 5.4: The map of observations in Southern Pacific of each season. Black points are the reported outliers of the robust fit in continuous Year case. 5.2. Scientific Problem and Results s.pacific Season Spring 5.2. Scientific Problem and Results The fitted marginal additive effects of Year as both a categorical variable and a continuous variable for the region of Southern Indian is in Figure 5.5. The baseline year’s effect and the subsequent years’ effects are adjusted with the same approach in Southern Pacific region. Again, in Figure 5.3, the non-robust curve and robust curve are different in shape. The non-robust curve is decreasing first until the year of 1993 and then begins to increase while the robust fitted curve shows an all-time increasing trend. None of the two fits show the same result as the plots in Boyce, Lewis and Worm (2010) for Southern Indian. It is not surprising as Boyce, Lewis and Worm (2010) discussed in their communication page that fitting different models might give different temporal trend patterns with this composited dataset. Like the Southern Pacific case, the partial residuals are mainly either under or above the non-robust fitted Year effects but are mostly around the robust fitted Year effects. The robust fits prone to detect more outliers and have more extreme partial residuals. We do the same exploratory analysis on the reported outliers of robust fit in the continuous Year case. The plots are in Figure 5.6-5.7. Like Southern Pacific, most of the reported outliers are near shores and are in shallow or deep ocean. In addition, we find less reported outliers in the area around the northwestern shore of Australia and the southern shore of Indonesia. 97 1920 1940 1960 1980 10 15 20 1900 1920 1940 1960 1980 2000 Robust fit of s.indian , discrete Year Robust fit of s.indian , continuous Year 5 0 Additive effect 15 10 5 10 15 20 Year 20 Year 0 Additive effect 5 2000 1900 1920 1940 1960 Year 1980 2000 1900 1920 1940 1960 1980 2000 Year 98 Figure 5.5: The fitted marginal additive effects of Year as both a categorical variable and a continuous variable for the region of Southern Indian. In the discrete case, the baseline year’s effect were estimated by the transformed d \ √i −Chli . The bold mean of that year β̂11 = (ChlbaseY ear ). The partial residuals are calculated with f (Y eari ) + Chl di Chl points are the discrete Year effects, the small solid points are partial residuals and the hollow round points are partial residuals of reported outliers. 5.2. Scientific Problem and Results 1900 0 5 Additive effect 10 15 20 Non robust fit of s.indian , continuous Year 0 Additive effect Non robust fit of s.indian , discrete Year s.indian Bathymetry (0,200] s.indian Bathymetry (200,1e+03] 99 Figure 5.6: The map of observations in Southern Indian in each level of Bath. Black points are the reported outliers of the robust fit in continuous Year case. 5.2. Scientific Problem and Results s.indian Bathymetry (1e+03,Inf] s.indian Season Summer s.indian Season Autumn s.indian Season Winter 100 Figure 5.7: The map of observations in Southern Indian of each season. Black points are the reported outliers of the robust fit in continuous Year case. 5.2. Scientific Problem and Results s.indian Season Spring 5.2. Scientific Problem and Results To sum up, we find similar patterns in the reported outliers in the two regions. These phenomena can be better explained with relative background knowledge. In fact, we also plot the maps based on both season and Bath’s levels to inspect the interaction between the two factors but we do not see any noticeable patterns. In the data analysis we simply fit our GAPLMs with the covariates selected in Boyce, Lewis and Worm (2010). Maybe, another choice of covariates can provide more sensible fits and results. However, we believe that our result can provide useful information for the researcher in their further studies. 101 Chapter 6 Conclusions and Future Work In this thesis, we propose a robust GAPLMs fitting algorithm which is resistant to the damage of data points that deviate from the structure of the majority of the data. In addition, the performance of the robust fit is very close to that of the non-robust one when the dataset is clean with Poisson and Binomial response variables. Another practical value of our robust fit is to effectively detect the atypically extreme observations in the dataset by comparing the robustly fitted values and the observed values. We illustrate the use of our method with a real data analysis and interpret the patterns of our reported outliers. We believe our results can be helpful for the researchers in their future studies. Our simulation and example chapters are limited by our present computation capability. In Chapter 4, we only pre-select the smoothing parameter span’s with trial datasets. In addition, we do not adopt the robust cross validation criteria discussed in Alimadad and Salibian-Barrera (2011). To provide a more comprehensive comparison, the smoothing parameters can be chosen by both robust and non-robust cross validations for each sample. In Chapter 5, we select the smoothing parameters among three values for our robust fit and we use the same span’s for Latitude & Longitude in both regions (Southern Pacific and Southern Indian). Furthermore, we keep using the covariates selected in Boyce, Lewis and Worm (2010) to fit our own GAPLMs. To improve the work, a complete cross validation should be conducted, more choices of covariates should be tried and a bivariate smoother for (Lat, Lon) should be used to make our fits more comparable with the fits in Boyce, Lewis and Worm (2010). In this study, we use a preliminarily estimated dispersion parameter φ̂ to fit the Gamma distribution. It is of interest to estimate the dispersion parameter robustly and jointly with the additive regression components of the model. 102 Bibliography [1] Azadeh Alimadad and Matias Salibian-Barrera. An outlier-robust fit for generalized additive models with applications to disease outbreak detection. Journal of the American Statistical Association, 106(494):719–731, 2011. [2] DA Belsey, Ed Kuh, and RE Welsch. Regression Diagnostics. John Wiley & Sons, 1980. [3] Ana M Bianco and Victor J Yohai. Robust estimation in the logistic regression model. Springer, 1996. [4] Kenneth A Bollen and Robert W Jackman. Regression diagnostics: An expository treatment of outliers and influential cases. Modern methods of data analysis, pages 257–291, 1990. [5] Daniel G Boyce, Marlon R Lewis, and Boris Worm. Global phytoplankton decline over the past century. Nature, 466(7306):591–596, 2010. [6] Eva Cantoni and Elvezio Ronchetti. Robust inference for generalized linear models. Journal of the American Statistical Association, 96(455):1022–1030, 2001. [7] Maria Durban, Christine A Hackett, and Iain D Currie. Approximate standard errors in semiparametric models. Biometrics, 55(3):699–703, 1999. [8] Wolfgang Härdle. Nonparametric and semiparametric models. Springer Verlag, 2004. [9] Trevor Hastie and Robert Tibshirani. Generalized additive models. Statistical science, 1(3):297–310, 1986. [10] Trevor Hastie and Robert Tibshirani. Chapman Hall, London, 1990. Generalized additive models. 103 Bibliography [11] Trevor Hastie, Robert Tibshirani, Jerome Friedman, and James Franklin. The elements of statistical learning: data mining, inference and prediction. Springer, 2009. [12] Shui He, Sati Mazumdar, and Vincent C Arena. A comparative study of the use of gam and glm in air pollution research. Environmetrics, 17(1):81–93, 2006. [13] Shui He, Sati Mazumdar, Vincent C Arena, and Gong Tang. Partial regression method to fit a generalized additive model. Environmetrics, 18(6):599–606, 2007. [14] Stephane Heritier, Eva Cantoni, Samuel Copt, and Maria-Pia VictoriaFeser. Robust methods in Biostatistics. Wiley, 2009. [15] Hans R Künsch, Leonard A Stefanski, and Raymond J Carroll. Conditionally unbiased bounded-influence estimation in general regression models, with applications to generalized linear models. Journal of the American Statistical Association, 84(406):460–466, 1989. [16] Enno Mammen, Oliver Linton, and J Nielsen. The existence and asymptotic properties of a backfitting projection algorithm under weak conditions. The Annals of Statistics, 27(5):1443–1490, 1999. [17] Ricardo A Maronna, R Douglas Martin, and Victor J Yohai. Robust statistics. J. Wiley, 2006. [18] Peter McCullagh. Quasi-likelihood functions. The Annals of Statistics, 11(1):59–67, 1983. [19] Peter McCullagh and James A Nelder. Generalized linear models. Chapman Hall, London, 1989. [20] Marlene Müller. Estimation and testing in generalized partial linear modelsâa comparative study. Statistics and Computing, 11(4):299–309, 2001. [21] John A Nelder and Robert WM Wedderburn. Generalized linear models. Journal of the Royal Statistical Society. Series A (General), 135(3):370– 384, 1972. [22] Daniel Peña and Victor J Yohai. A fast procedure for outlier diagnostics in large regression problems. Journal of the American Statistical Association, 94(446):434–445, 1999. 104 Bibliography [23] Peter M Robinson. Root-n-consistent semiparametric regression. Econometrica: Journal of the Econometric Society, 56(4):931–954, 1988. [24] Paul Speckman. Kernel smoothing in partial linear models. Journal of the Royal Statistical Society. Series B (Methodological), 50(3):413–436, 1988. [25] Robert WM Wedderburn. Quasi-likelihood functions, generalized linear models, and the gaussânewton method. Biometrika, 61(3):439–447, 1974. [26] S. Weisberg. Applied linear regression. John Wiley & Sons, 1985. 105
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- A robust fit for generalized partial linear partial...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
A robust fit for generalized partial linear partial additive models Pan, Yiyang 2013
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | A robust fit for generalized partial linear partial additive models |
Creator |
Pan, Yiyang |
Publisher | University of British Columbia |
Date Issued | 2013 |
Description | In regression studies, semi-parametric models provide both flexibility and interpretability. In this thesis, we focus on a robust model fitting algorithm for a family of semi-parametric models – the Generalized Partial Linear Partial Addi- tive Models (GAPLMs), which is a hybrid of the widely-used Generalized Linear Models (GLMs) and Generalized Additive Models (GAMs). The traditional model fitting algorithms are mainly based on likelihood proce- dures. However, the resulting fits can be severely distorted by the presence of a small portion of atypical observations (also known as “outliers”), which deviate from the assumed model. Furthermore, the traditional model diag- nostic methods might also fail to detect outliers. In order to systematically solve these problems, we develop a robust model fitting algorithm which is resistant to the effect of outliers. Our method combines the backfitting algorithm and the generalized Speckman estimator to fit the “partial linear partial additive” styled models. Instead of using the likelihood-based weights and adjusted response from the generalized local scoring algorithm (GLSA), we apply the robust weights and adjusted response derived form the robust quasi-likelihood proposed by Cantoni and Ronchetti (2001). We also extend previous methods by proposing a model prediction algorithm for GAPLMs. To compare our robust method with the non-robust one given by the R function gam |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2013-07-12 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0073950 |
URI | http://hdl.handle.net/2429/44647 |
Degree |
Master of Science - MSc |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2013-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2013_fall_pan_yiyang.pdf [ 10.59MB ]
- Metadata
- JSON: 24-1.0073950.json
- JSON-LD: 24-1.0073950-ld.json
- RDF/XML (Pretty): 24-1.0073950-rdf.xml
- RDF/JSON: 24-1.0073950-rdf.json
- Turtle: 24-1.0073950-turtle.txt
- N-Triples: 24-1.0073950-rdf-ntriples.txt
- Original Record: 24-1.0073950-source.json
- Full Text
- 24-1.0073950-fulltext.txt
- Citation
- 24-1.0073950.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
data-media="{[{embed.selectedMedia}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0073950/manifest