Applications of Penalized Likelihood Methods for Feature Selection in Statistical Modeling by Chen Xu M.A., York University, Canada, 2007 B.Sc., Xi’an Jiaotong University, China, 2006 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy in THE FACULTY OF GRADUATE STUDIES (Statistics) The University Of British Columbia (Vancouver) September 2012 c Chen Xu, 2012 Abstract Feature selection plays a pivotal role in knowledge discovery and contemporary scientific research. Traditional best subset selection or stepwise regression can be computationally expensive or unstable in the selection process, and so vari- ous penalized likelihood methods (PLMs) have received much attention in recent decades. In this dissertation, we develop approaches based on PLMs to deal with the issues of feature selection arising from several application fields. Motivated by genomic association studies, we first address feature selection in ultra-high-dimensional situations, where the number of candidate features can be huge. Reducing the dimension of the data is essential in such situations. We propose a novel screening approach via the sparsity-restricted maximum likelihood estimator that removes most of the irrelevant features before the formal selection. The model after screening serves as an excellent starting point for the use of PLMs. We establish the screening and selection consistency of the proposed method and develop efficient algorithms for its implementation. We next turn our attention to the analysis of complex survey data, where the identification of influential factors for certain behavioral, social, and economic in- dices forms a variable selection problem. When data are collected though sur- vey sampling from a finite population, they have an intrinsic dependence structure and may provide a biased representation of the target population. To avoid dis- torted conclusions, survey weights are usually adopted in these analyses. We use a pseudo-likelihood to account for the survey weights and propose a penalized pseudo-likelihood method for the variable selection of survey data. The consis- tency of the proposed approach is established for the joint randomization frame- work. ii Lastly, we address order selection for finite mixture models, which provides a flexible tool for modeling data from a heterogeneous population. PLMs are at- tractive for such problems. However, this application requires maximizations over nonsmooth and nonconcave objective functions, which are computationally chal- lenging. We transform the original multivariate objective function into a sum of univariate functions and design an iterative thresholding-based algorithm to effi- ciently solve the sparse maximization without ad hoc steps. We establish the con- vergence of the new algorithm and illustrate its efficiency through both simulations and real-data examples. iii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Motivating examples . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 Computational genomics . . . . . . . . . . . . . . . . . . 3 1.2.2 Chronic disease studies . . . . . . . . . . . . . . . . . . . 4 1.2.3 Market segmentation . . . . . . . . . . . . . . . . . . . . 5 1.3 Traditional feature selection methods . . . . . . . . . . . . . . . . 5 1.3.1 Selection criteria: AIC and BIC . . . . . . . . . . . . . . 6 1.3.2 Selection procedures . . . . . . . . . . . . . . . . . . . . 10 1.4 Penalized likelihood methods . . . . . . . . . . . . . . . . . . . . 11 1.4.1 The penalized likelihood and penalty functions . . . . . . 11 1.4.2 Asymptotic properties of PLM . . . . . . . . . . . . . . . 14 1.4.3 Tuning strategies . . . . . . . . . . . . . . . . . . . . . . 17 1.4.4 Computational strategies . . . . . . . . . . . . . . . . . . 19 1.5 Contributions of the dissertation . . . . . . . . . . . . . . . . . . 22 iv 2 The Screening-Based PLM in Ultra-High-Dimensional Feature Spaces 25 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2 Sure screening techniques with ultra-high dimensionality . . . . . 27 2.2.1 Model settings and notation . . . . . . . . . . . . . . . . 27 2.2.2 Sure independence screening . . . . . . . . . . . . . . . . 29 2.3 Variable screening via the sparse-MLE . . . . . . . . . . . . . . . 31 2.3.1 The sparsity-restricted maximum likelihood estimator . . 31 2.3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . 34 2.3.3 Numerical assessment . . . . . . . . . . . . . . . . . . . 37 2.4 Screening-based PLM selection procedure . . . . . . . . . . . . . 38 2.4.1 Consistency of SMLE-PLM . . . . . . . . . . . . . . . . 38 2.4.2 Tuning with EBIC . . . . . . . . . . . . . . . . . . . . . 41 2.5 Numerical studies . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.5.1 General settings . . . . . . . . . . . . . . . . . . . . . . . 42 2.5.2 Linear regression . . . . . . . . . . . . . . . . . . . . . . 45 2.5.3 Logistic regression . . . . . . . . . . . . . . . . . . . . . 48 2.5.4 Poisson regression . . . . . . . . . . . . . . . . . . . . . 51 2.5.5 Real-data example . . . . . . . . . . . . . . . . . . . . . 54 2.6 Summary and conclusions . . . . . . . . . . . . . . . . . . . . . 55 3 The Penalized Pseudo-Likelihood in Analysis of Survey Data . . . . 56 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.2 Joint inference and super-population . . . . . . . . . . . . . . . . 58 3.3 Pseudo-likelihood-based variable selection . . . . . . . . . . . . . 60 3.3.1 The penalized pseudo-likelihood method . . . . . . . . . 60 3.3.2 Asymptotic properties of PPLM . . . . . . . . . . . . . . 61 3.3.3 Tuning strategy via sample-based BIC . . . . . . . . . . . 64 3.4 Numerical studies . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.4.1 SLCDC data . . . . . . . . . . . . . . . . . . . . . . . . 66 3.4.2 Simulation settings . . . . . . . . . . . . . . . . . . . . . 69 3.4.3 Simulation results . . . . . . . . . . . . . . . . . . . . . 71 3.4.4 Analysis of SLCDC data . . . . . . . . . . . . . . . . . . 76 3.5 Summary and conclusions . . . . . . . . . . . . . . . . . . . . . 78 v 4 A Thresholding Algorithm for PLM in Finite Mixture Models . . . 80 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.2 Order selection via regularization . . . . . . . . . . . . . . . . . . 82 4.2.1 Mixture model and penalized likelihood . . . . . . . . . . 82 4.2.2 The EM algorithm . . . . . . . . . . . . . . . . . . . . . 84 4.3 Iterative thresholding descent procedure . . . . . . . . . . . . . . 86 4.4 Tuning strategies . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.4.1 Choice of u . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.4.2 Choice of . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.4.3 Choice of CK . . . . . . . . . . . . . . . . . . . . . . . . 93 4.5 Numerical studies . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.5.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.5.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.6 Summary and conclusions . . . . . . . . . . . . . . . . . . . . . 100 5 Summary and Future Work . . . . . . . . . . . . . . . . . . . . . . . 102 5.1 Summary of the dissertation . . . . . . . . . . . . . . . . . . . . 102 5.2 Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.2.1 Variable selection in model-based clustering . . . . . . . . 103 5.2.2 Support vector machine with nonconvex penalty . . . . . 105 5.2.3 Some other issues . . . . . . . . . . . . . . . . . . . . . . 106 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 A Supplementary Information for Chapter 2 . . . . . . . . . . . . . . 118 A.1 Proof of Theorem 2.1 . . . . . . . . . . . . . . . . . . . . . . . . 119 A.2 Proof of Theorem 2.2 . . . . . . . . . . . . . . . . . . . . . . . . 121 A.3 Proof of Theorem 2.3 . . . . . . . . . . . . . . . . . . . . . . . . 124 A.4 Proof of Theorem 2.4 . . . . . . . . . . . . . . . . . . . . . . . . 126 B Supplementary Information for Chapter 3 . . . . . . . . . . . . . . 129 B.1 Proof of Theorem 3.1 . . . . . . . . . . . . . . . . . . . . . . . . 129 B.2 Proof of Corollary 3.1 . . . . . . . . . . . . . . . . . . . . . . . . 132 B.3 Derivation of BIC (3.7) . . . . . . . . . . . . . . . . . . . . . . . 133 vi B.4 Proof of Theorem 3.2 . . . . . . . . . . . . . . . . . . . . . . . . 135 C Supplementary Information for Chapter 4 . . . . . . . . . . . . . . 138 C.1 Proof of Theorem 4.1 . . . . . . . . . . . . . . . . . . . . . . . . 138 C.2 Proof of Corollary 4.1 . . . . . . . . . . . . . . . . . . . . . . . . 139 C.3 Proof of Theorem 4.2 . . . . . . . . . . . . . . . . . . . . . . . . 139 C.4 Proof of Proposition 4.1 . . . . . . . . . . . . . . . . . . . . . . . 141 vii List of Tables Table 2.1 Frequencies of covering the true model s based on different screening methods . . . . . . . . . . . . . . . . . . . . . . . . 37 Table 2.2 Simulation results for linear regression, setup 1 . . . . . . . . . . . 47 Table 2.3 Simulation results for linear regression, setup 2 . . . . . . . . . . . 47 Table 2.4 Simulation results for linear regression, setup 3 . . . . . . . . . . . 48 Table 2.5 Simulation results for logistic regression, setup 1 . . . . . . . . . . 49 Table 2.6 Simulation results for logistic regression, setup 2 . . . . . . . . . . 50 Table 2.7 Simulation results for logistic regression, setup 3 . . . . . . . . . . 50 Table 2.8 Simulation results for Poisson regressions, setup 1 . . . . . . . . . 52 Table 2.9 Simulation results for Poisson regressions, setup 2 . . . . . . . . . 52 Table 2.10 Simulation results for Poisson regressions, setup 3 . . . . . . . . . 53 Table 2.11 Results for prostate data. . . . . . . . . . . . . . . . . . . . . . . 54 Table 3.1 Variables for analysis of SLCDC data with non-response adjustments: A: allocate to other categories; D: delete from the data; M: impute by mean values; NA: no adjustment applied. . . . . . . . . . . . . . . 68 Table 3.2 Selection results for the first sampling plan: Prediction assess- ments for models 1–2 are based on the testing RSS, while for models 3–4 they are based on (PPR, NPR) with a benchmark 0.5. 72 Table 3.3 Selection results for the second sampling plan: Prediction assess- ments for models 1–2 are based on the testing RSS, while for models 3–4 they are based on (PPR, NPR) with a benchmark 0.5. . . . . . . 73 Table 3.4 Selection frequency of influential variables in model mis-specified case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 viii Table 3.5 Prediction accuracy for selected models (WPPR, WNPR) based on different benchmarks. . . . . . . . . . . . . . . . . . . . . 78 Table 3.6 Bootstrap selection results for significant variables: (Estimated coefficient, Standard error, Selection rate). . . . . . . . . . . . 78 Table 4.1 Mixing proportions and component parameters in simulation models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Table 4.2 Parameter estimates for galaxy data. ̂k, ̂k, ̂ denote estimated com- ponent mean, mixing proportion, and common component standard deviation (k = 1; 2; : : : ; 7). . . . . . . . . . . . . . . . . . . . . 100 Table 4.3 Parameter estimates for lamb data. ̂k, ̂k denote estimated compo- nent mean and mixing proportion (k = 1; 2). . . . . . . . . . . . . 100 ix List of Figures Figure 1.1 Some commonly used penalty functions, with = 2 for the l1 penalty; = 5 for the l0:5 penalty; ( = 2, a = 3:7) for the SCAD and MCP penalties. . . . . . . . . . . . . . . . . . . . 13 Figure 3.1 Selection criteria values based on candidate models . . . . . . . . 77 Figure 4.1 EM-paths of MSCAD for normal mixture model 1 with n = 100. . 95 Figure 4.2 Selection results for normal mixture models (1–6), n=100. . . . . . 95 Figure 4.3 Selection results for normal mixture models (7–10), n=300. . . . . 96 Figure 4.4 Selection results for Poisson mixture models (1–3), n=(100, 300). . 97 Figure 4.5 Selection results for Poisson mixture models (4–6), n=(100, 300). . 98 Figure 4.6 Histogram of galaxy data. Solid curve: density of seven component models selected by MSCAD. Dashed curve: density of six compo- nent models selected by AIC/BIC. . . . . . . . . . . . . . . . . 99 Figure C.1 Plot demonstration for Proposition 4.1 with = 3:7, = 1, and = (1; 3; 4:2) for cases 1–3. Solid lines represent g1( ) = p0( ) and dashed lines represent g2( ) = + z for various z values. . 142 x Acknowledgments First, I would like to express my sincere gratitude to my research supervisor Dr. Ji- ahua Chen for his invaluable suggestions, guidance, patience, and continuing sup- port. I have learnt much from him, particularly regarding academic research and scientific writing, which will surely benefit my future career. Secondly, I would like to extend my appreciation to my supervisory commit- tee, Dr. Ruben Zamar and Dr. Lang Wu, for their suggestions and encouragement throughout my Ph.D. Also, I will never forget the enjoyable pingpong time we had together. Finally, I would like to thank Peggy, Viena, Elaine, and Andrea for their warm and thoughtful assistance with administrative issues. Peggy and Andrea in partic- ular helped me very much. Thank you for all the conversations and the personal advice. Chen Xu September 2012 xi Chapter 1 Introduction 1.1 Overview Technological innovations have had a profound impact on the process of knowl- edge discovery. It is now feasible to collect data of unprecedented size and com- plexity in diverse areas of scientific research. For example, in computational ge- nomics, geneticists may measure hundreds or thousands of gene expressions to identify the few that are associated with major diseases (Keller et al. [2009]). In market research, long-term economic data are often used to discover subgroups of customers with different needs and consumption behaviors (Dickson and Ginter [1987]). In internet applications, huge numbers of uniform resource locators (URLs) are often analyzed to learn the rules for detecting web links with pop-up advertise- ments (Kushmerich [1999]). Other examples occur in bioinformatics, geology, neurology, health science, economics, and finance. Although the objectives dif- fer in various disciplines, explaining the variation in the variable of interest is a common need in their research. Statistical modeling is one of the most powerful and widely used mathematical tools for data analysis. It aims to provide insightful summaries of the information available and to formulate learning rules based on the observed data. A statistical model mimics the generation of data through a constructed stochastic procedure and attempts to explain the variable variation via a mathematical formulation. For instance, a regression model can be used to describe the relationship between a 1 disease status and gene expressions, and a finite mixture model helps to explain the heterogeneity of consumer behavior. Other classical applications of statisti- cal modeling include graphical models in independence structure learning, propor- tional hazard models in survival analysis, and autoregressive models in time series forecasting. When no prior knowledge is available, researchers may consider many poten- tial variables at the initial stage of the modeling. Examples include a regression model with a huge number of covariates, an autoregressive model with a high or- der, and a finite mixture model with a large number of mixing components. A sophisticated model with many variables provides a better descriptive value for the data structure, but it often leads to low predictive accuracy and poor model inter- pretability. Hence, we wish to identify important features in massive data and to produce a parsimonious model. The issue of feature (model) selection has been studied for decades. Traditional selection procedures, such as best subset selection and stepwise regression, are typ- ically designed for conventional settings, where the observations are assumed to be independent and the number of candidate features is relatively small. However, contemporary scientific research often encounters datasets with high dimensional- ity (a huge number of variables) and/or a complex nonindependent structure. These attributes challenge traditional methods in terms of both theoretical optimality and practical feasibility. For instance, when best subset selection is used in a regression with thousands of covariates, it can be computationally infeasible and unstable in the selection process. Therefore, we need innovative selection approaches that are suitable for the new environment. The penalized likelihood method (PLM) has been demonstrated to be an attrac- tive technique for feature selection. Examples include the least absolute shrinkage and selection operator (LASSO) proposed by Tibshirani [1996] and the smoothly clipped absolute deviation (SCAD) introduced by Fan and Li [2001]. These ap- proaches exclude variables from the model by estimating their coefficients to be zero and shrink the other coefficients accordingly. Compared with traditional meth- ods, the PLM has a lower computational cost and provides more stable selection results. The PLM has received a great deal of attention and covered a wide range of statistical models. In this dissertation, we address new research problems regarding 2 feature selection arising from different applications of the PLM. In Section 1.2, we outline a few feature selection problems from computational genomics, chronic disease studies, and market research as motivating examples for our research. In Section 1.3, we give a brief review of traditional approaches and discuss their limitations in the context of modern data analysis. We then, in Section 1.4, introduce the PLM framework and discuss its properties for variable selection. We discuss both the theoretical and computational aspects, providing the necessary background for the detailed discussion in later chapters. In Section 1.5 we state the aim of our research and outline the contributions of this dissertation. 1.2 Motivating examples Feature selection is fundamental to contemporary scientific research. We begin with a few motivating examples from various scientific fields to illustrate the chal- lenges for feature selection in the context of modern data analysis. 1.2.1 Computational genomics In contemporary biomedical studies, a common goal is to find the genetic explana- tions (e.g., genes) that are responsible for observable traits such as blood pressure, height, or susceptibility to disease. Understanding the genetic associations of dis- eases helps medical researchers to further investigate the diseases and to develop the corresponding treatment methods. In contrast to simple observable traits such as gender or blood type, some complex diseases (e.g., leukemia or diabetes) are believed to be the result of many genetic and environmental factors. Since the in- fluential genes may spread over the whole DNA sequence (the genome), studies of complex diseases normally require a full genome scan on all possible genes. Geneticists often measure hundreds or thousands of genes for a relatively small number of participants. In an ongoing project conducted by the Faculty of Dentistry, University of British Columbia, about 600 microRNA (miRNA) expressions in serum samples were measured from two groups of participants. One group consisted of 30 oral- cancer patients and the other group consisted of 26 individuals without cancer. The question is whether these miRNA readings can be used to distinguish the cancer 3 patients from the others. If the method is successful, the genetic information might be further used to predict whether an oral-cancer patient will progress from a minor tumor to a serious one. Using all 600 miRNAs for the classification leads to a poor predictive value because of the high level of noise. Consequently, it is important to select those that make the greatest contribution to identifying oral-cancer patients. To this end, a traditional two-sample t-test procedure can be carried out to detect the miRNAs that have significant expression differences between the two groups. However, for the simultaneous testing of 600 genetic readings, classical methods to control the probability of false discoveries are no longer relevant. Advanced ad- justments are often needed to control the false-discovery rates. Another common strategy is to build a logistic regression of the tumor type on the miRNA readings; we can then identify the relevant miRNAs by selecting the most important regres- sion covariates. However, the number of covariates p is 600, and the number of participants n is just 56. This large-p-small-n situation places this problem out- side the domain of classical model selection methods (see Section 1.3.2). We need innovative methods to deal with the high dimensionality. 1.2.2 Chronic disease studies Chronic diseases are the leading cause of death in North America. Chronic condi- tions such as kidney disease, cardiovascular disease, anemia, and dementia result in long-term or permanent disability for millions of people, with serious quality- of-life consequences for them and their families. The Public Health Agency of Canada explores the experiences of Canadians with chronic health conditions by conducting the Survey on Living with Chronic Diseases in Canada (SLCDC) on the targeted population. One of the main objectives of SLCDC is to identify health behaviors that influence disease outcomes, so that the government can better plan and provide health services for people with chronic diseases. Regression models are conventionally used in the analysis of SLCDC data; the goal is to detect the influential factors through a variable selection procedure. However, when variable selection is applied to survey data, many potential compli- cations arise. First, the data collected through survey sampling are usually obtained from a finite population without replacement, and hence they have an intrinsic de- 4 pendence structure (i.e., non-i.i.d.). Second, in complex survey designs such as the SLCDC, the inclusion probabilities of sampling units often vary across the target population. Consequently, the correlation between the response and the covariates seen in the sample can be different from that of the population. Ignoring the survey design in the selection process may result in biased results. These special features of survey sampling reduce the effectiveness of traditional selection methods that are developed for i.i.d. sample situations. We need new methods that take into account the special features. 1.2.3 Market segmentation In 2008 a BC marketing company collected data consisting of the annual dining- out expense for 1679 households randomly sampled in BC, Canada. The goal of their study was to identify potential subgroups of customers with different needs and consumption behaviors. A customer grouping strategy (i.e., market segmen- tation) is important for restaurant managers, because providing food services that are suitable for different customers often leads to higher profits. To identify the proper segmentation, cluster analysis is often used: customers with similar characteristics are assigned to a homogeneous group. Finite mixture models are among the most powerful and widely used clustering tools. They di- vide the overall heterogeneous population into a mixture of several subpopulations (components), where each subpopulation represents a single cluster. A mixture model with an excessive number of components (a high order) usually overfits the data and has poor interpretive value. Determining the appropriate number of com- ponents (order selection) is crucial in applications such as market segmentation. Because of the nonregularity of finite mixture models, the classical selection crite- ria are no longer optimal. We need new selection methods that take into account the special features of finite mixture models. 1.3 Traditional feature selection methods The issue of feature selection has received much attention. Classical selection ap- proaches typically consist of two parts: a selection criterion for the comparison of models with different sets of variables (features) and an associated implementation. 5 Traditional selection methods are typically designed for low-dimensional and inde- pendent settings. However, an understanding of their principles, working schemes, and limitations is of vital importance for the development of new methodologies. In this section, we briefly review two classical selection criteria, the Akaike in- formation criterion (AIC; Akaike [1973]) and the Bayesian information criterion (BIC; Schwarz [1978]), and discuss their corresponding implementations. 1.3.1 Selection criteria: AIC and BIC Suppose the data fdi = (yi;xi); i = 1; : : : ; ng are collected independently, where yi is the ith observation of the response variable and xi = fxi1; : : : ; xipgT is the associated p-dimensional covariate vector. In a typical regression context, (yi;xi) is assumed to be a random sample from the population D = (Y;X), where the conditional mean of Y depends on a linear form X with coefficients = (1; : : : ; p) T . In applications, often the effects of many covariates are unim- portant, so we may assume that the corresponding coefficients are zero. Feature selection aims to identify all the covariates with nonzero coefficients. This proce- dure is also referred to as variable selection. Under a more general framework, suppose the data d = (d1; : : : ; dn) are gen- erated from an unspecified density function f(d;) with a q-dimensional param- eter vector = (1; : : : ; q)T . Usually, we are uncertain about the true density f(d;) and instead assume a larger family of models f(d;), in which is a nonvanishing subvector of the p-dimensional parameter = (1; : : : ; p)T . The goal of feature selection is to estimate the dimension of the true model by com- paring candidate models with different dimensions. In the literature this is often referred to as model selection. For convenience of presentation, we do not distinguish the terms feature selec- tion, variable selection, and model selection in the rest of this chapter. We use a unified notation s to indicate a subset of f1; : : : ; pg, which represents a candidate model with parameters s = fj : j 2 sg, and denote by s the true model with . Also, we use (s) to indicate the dimension of s. Clearly, (s) = q. 6 Under the model settings described above, the log-likelihood function of is l(;d) = nX i=1 log f(di;): (1.1) AIC and BIC are members of a more general family of penalized model-fit statistics (referred to as “GIC”), applicable to a wide range of statistical models fitted by the maximum likelihood method, which takes the form GIC(s) = 2l(̂s;d) + c(s); (1.2) where ̂s is the maximum likelihood estimator (MLE) of (s) based on s and c is a positive constant that differs from one model selection criterion to another. The model with the smallest GIC is selected. The magnitude of the GIC is not generally interpretable, but differences between GIC values for different models are of interest. It can be seen that the GIC is a decreasing function of the maximized log-likelihood and an increasing function of the number of variables included in the model. Hence, a lower GIC implies either a simpler model (fewer variables), a better fit (higher maximized likelihood), or both. A model that balances complexity and goodness of fit is preferred. With c = 2 and c = log n, we obtain AIC and BIC respectively: AIC(s) = 2l(̂s;d) + 2(s) (1.3) BIC(s) = 2l(̂s;d) + log n (s): (1.4) Note that the penalty for BIC grows with the sample size, while that for AIC re- mains constant. When n 8, the penalty for BIC is larger than that for AIC, and therefore BIC tends to select models with fewer variables. Although AIC and BIC have similar forms as shown in (1.2), they are based on different statistical considerations. AIC Suppose as before that we have a set of candidate models S under consideration, with each s 2 S having parameters s to be estimated from the data. As defined 7 in (1.1), the model s infers the probability distribution fs(d) = Qn i=1 f(di;s) for the observations d. This serves as an approximation to the true distribution f(d) = Qn i=1 f(di; ). From this point of view, the “best” model is the one that provides the most accurate approximation of f(d). The Kullback-Leibler information is a measure of the distance between two distributions, representing the information “lost” when the second distribution is used to approximate the first. The AIC approach applies the Kullback-Leibler in- formation to the difference between f(d) and fs(d): I(f; fs) = Z f(d) log f(d) fs(d) d(d) = Z f(d) log f(y)d(d) Z f(d) log fs(d)d(d) = E[log f(d)] E[log fs(d)]: (1.5) The best model is then the model s that minimizes the Kullback-Leibler loss (1.5). Note that E[log f(d)] = is a constant that does not depend on the model and is therefore irrelevant to the model comparison. The second termE[log fs(d)] can be approximated by the maximized log-likelihood ln(d; ̂s) with an asymptotic bias approximately equal to the number of variables in model s. In other words, we have I(f; fs) ln(̂s;d) + (s) = + 1 2 AIC(s); which implies that a comparison of the Kullback-Leibler losses (1.5) is approxi- mately equivalent to a comparison of the corresponding AIC values. It can be seen that the AIC approach aims to minimize the Kullback-Leibler deviation between the true distribution f(d) and the distribution fs(d) under a particular candidate model s. Therefore, in some sense, the model selected by AIC achieves the “best” possible approximation to f(d) over the candidate models, even when the true model s is not included in the model space S. However, it is well known that AIC is not a consistent selection criterion, since it does not correctly select the true model s with probability approaching 1 in large samples when s is included in model space S. For further discussion of the consistency and efficiency of AIC, see Shibata [1983], Shao [1997], and Yang [2005]. 8 BIC BIC has its origin in the Bayesian framework; it compares the degree of support in the data for two models. Suppose as before that we have a set of candidate models S for the observations d, and each model s 2 S has a parameter vector s with (s) elements to be estimated. Under model s, the density of d with a given value of s is fs(d;s) = Qn i=1 f(di;s). Assume that, conditioning on s, the prior density of s is (s). Then, the marginal density of d under model s is P (djs) = Z fs(d;s)(s)ds; and the posterior probability of s given d is P (sjd) = P (djs)P (s)P s2S P (s)P (djs) ; where P (s) denotes the prior probability of model s. The model with the high- est posterior probability P (sjd) is then considered to receive the most support from the data. Since P s2S P (s)P (djs) is a constant for any choice of the model, choosing a model with the highest P (sjd) is equivalent to choosing a model that maximizes P (djs)P (s). Under some regularity conditions on the density fs(d), 2 logfP (djs)g has a Laplace approximation given by BIC (1.4) up to an additive constant. Therefore, with uniform prior settings on s (i.e., P (s) is constant for s 2 S), the BIC approach is approximately equivalent to comparing the posterior probabilities P (sjY ). Under some conventional assumptions such as the independence structure of d and a fixed data dimension p, Rao and Wu [1989] established the consistency of BIC by showing that it is asymptotically minimized by the true model s. How- ever, as discussed in Section 1.2, scientific studies often encounter data with high dimensionality and a dependence structure; these features impact the efficiency and theoretical optimality of BIC. To show that BIC can be unsatisfactory, let us consider the genomic example of Section 1.2, where the number of candidate covariates p is 600. Specifically, let the model space be partitioned into subclasses according to the number of covariates in a model. Then, under the constant prior setting, the probability assigned to a 9 subclass is proportional to its size. In particular, the subclass of models containing a single covariate, S1, has size 600, while the subclass of models containing two covariates, S2, has size 600 599=2. Thus, the prior probability assigned to S2 is 599=2 times that assigned to S1. It can also be seen that the prior assigned to Sj increases almost exponentially as j increases to p=2 = 300. Consequently, the subclasses of large models receive much greater priors than those with small mod- els. This encourages the selection of large models in the large-p situation, which is strongly against the principle of parsimony. Therefore, BIC is often unsatisfactory for high-dimensional data analysis. 1.3.2 Selection procedures Best subset selection In best subset selection, a selection criterion such as AIC or BIC is evaluated for each candidate model, and the model with the best “score” is selected. Best subset selection is effective when there are only a few candidate models; it is impractical for a large number of variables. In fact, even for datasets with a moderate number of variables, the total number of candidate models can be too large to be manageable. For example, suppose the number of covariates p in a regression model is 30. This implies that there are 230 a billion possible candidate models to explore, which is obviously a computationally expensive task. Stepwise selection Stepwise selection is another commonly used procedure for feature selection; it typically includes forward selection and backward elimination. In the regression context, the forward selection starts with a null model with no covariates, and then adds to the model the covariate that is most correlated with the current residual. The procedure builds a sequence of models by successively including one covari- ate at a time up to a prespecified number of steps or until all the covariates are included. In contrast, the backward elimination begins with the full model includ- ing all covariates and then forms a sequence of models by deleting one covariate at each step. The models in the sequence are then assessed according to some se- 10 lection criterion. Compared with best subset selection, stepwise selection avoids exhaustive comparisons of all the candidate models and therefore has a lower com- putational cost. When p = 30, there are only about 12p 2 = 450 models to be considered. However, stepwise methods have been found to be unstable in the se- lection process: a small change in the data could cause a very different selection result (Breiman [1995]). This is partially because once a covariate has been added to (removed from) the model at any step in the stepwise selection procedure, it is never removed from (returned to) the final model. Therefore, for complex data, the stepwise methods can easily lead to a locally optimal model. Consequently, selection results based on stepwise methods can be unreliable in practice. 1.4 Penalized likelihood methods Various penalized likelihood methods (PLMs) have been developed for the pur- pose of feature selection. These methods include the least absolute shrinkage and selection operator (LASSO; Tibshirani [1996]), bridge regression (Fu [1998]), the elastic net (Zou and Hastie [2005]), and the smoothly clipped absolute deviation (SCAD; Fan and Li [2001]). The shrinkage idea of PLM has been demonstrated to cope well with many of the challenging features of contemporary data analysis. Compared with traditional methods, the PLM possesses two major advantages. First, its selection procedure is continuous, and hence it provides more robust se- lection results; second, it is computationally efficient, which is crucial for applica- tions with high-throughput data. In this section, we introduce the PLM framework and discuss the theoretical and computational issues. 1.4.1 The penalized likelihood and penalty functions Given the model settings in Section 1.3.1, the penalized likelihood is defined as Q() = l(d;) n pX j=1 (jj j); (1.6) where (:) is a penalty function indexed by a tuning parameter controlling the amount of regularization in . Maximizing the penalized likelihood (1.6) results in 11 a maximum penalized likelihood estimator (MPLE) ̂ for . The form of (:) determines the general behavior of ̂. The L0 regulariza- tion, i.e., (jj) = I(j 6= 0j), penalizes the number of variables included in a candidate model and produces a sparse estimation for . For models of the same size (number of variables), the one that maximizes the unpenalized likelihood is preferred. However, the L0 penalty is not continuous, and the maximization of the corresponding penalized likelihood coincides with the GIC-based best subset selection procedure, which requires exhaustive search and is therefore computa- tionally demanding. With theL1 penalty (jj) = jj, we obtain the LASSO, which continuously shrinks the model parameters toward zero as the tuning parameter increases. Because of the singularity of the L1 penalty at the origin, some parameters can be shrunk to exact zero when is sufficiently large. Thus, LASSO qualifies as a variable selection operator. Because of the continuity of the shrinkage process, LASSO often leads to a more stable selection result than best subset selection does. In addition, the continuous shrinkage often improves the predictive ability of the model because of the bias-variance trade-off (Tibshirani [1996]). It is well known that the L2 penalty (jj) = jj2 results in a ridge regres- sion, which continuously shrinks the model parameters toward zero but does not set them exactly to zero, and hence is not suitable for variable selection purposes. We also observe that the L penalty (jj) = jj with 0 < < 2 leads to a bridge regression (Frank and Friedman [1993], Fu [1998]). The bridge shrinkage is continuous only when 1, while a sparse solution can be obtained only when 1. Fan and Li [2001] advocate penalty functions that give estimators (MPLEs) with the following three desirable properties: Sparsity: The estimator should automatically set small estimated model pa- rameters to zero to reduce the model complexity. Unbiasedness: The estimator should have low bias, especially when the true value of the model parameter is large. Continuity: The estimator should be continuous to avoid instability in the model selection and prediction. 12 −10 −5 0 5 10 0 5 10 15 20 Theta L1 L.5 SCAD MCP Figure 1.1: Some commonly used penalty functions, with = 2 for the l1 penalty; = 5 for the l0:5 penalty; ( = 2, a = 3:7) for the SCAD and MCP penalties. Of the three requirements, sparsity is the most crucial for feature extraction and variable selection, while continuity improves the robustness of the selection pro- cess. Unbiasedness is mainly required for the prediction issue, where the accuracy of the parameter estimates is the major concern. In general for a penalty function, singularity at the origin is required to generate a sparse MPLE, while concavity is needed to reduce the estimation bias. It is known that the L penalty with > 1 does not have sparsity, L1 does not have unbiasedness, and L with < 1 does not have continuity. Therefore, none of the L penalties possesses all three properties. Fan and Li [2001] suggested the smoothly clipped absolute deviation (SCAD) penalty, which is defined through the following derivative 0 (jj) = I( ) + (a )+ (a 1) I( > ) (1.7) for some a > 2 and (0) = 0. As shown in Figure 1.1, the SCAD penalty takes off from the origin as the L1 penalty and then gradually levels off as the model 13 parameter increases. These features ensure that SCAD maintains the sparsity and continuity of l1, and it is unbiased because it does not apply excessive shrinkage to the model parameters. Another penalty in a similar spirit is the minimax concave penalty (MCP) proposed by Zhang [2010], the derivative of which is given by 0 (jj) = (a )+ a : (1.8) TheMCP differs from SCAD for small values of with a strictly decreasing deriva- tive from the origin, which further discourages the over-regularization of the model parameters. In the literature, many other penalty functions have been proposed. To account for grouping effects, Zou and Hastie [2005] suggested a linear combina- tion of the L1 and L2 penalties and called the associated PLM the elastic net. To improve the estimation accuracy, Zou [2006] investigated the use of a weighted L1 penalty and proposed the adaptive LASSO. Figure 1.1 depicts some of these commonly used penalty functions. The shrinkage idea of PLM can also be realized through a full Bayesian analy- sis. ¿From a Bayesian point of view, the penalized likelihood estimator ̂ can be interpreted as the posterior mode estimate when the model parameters have cor- respondingly informative priors. In particular, it is well known that the L1 penalty in LASSO corresponds to an i.i.d. Laplace (i.e., double-exponential) prior of the coefficients (Park and Casella [2008]). Compared with the PLM, Bayesian shrink- age provides more convenient interval estimates of model parameters (i.e., from the estimated posterior), but usually at a cost in terms of computational efficiency. In this dissertation, we focus on the PLM and leave the potential issues of Bayesian shrinkage methods to future research. 1.4.2 Asymptotic properties of PLM The sampling properties of PLM have been extensively studied. The consistency of the MPLE ̂ for feature selection and parameter estimation is of particular interest. These two modes of consistency are defined by Estimation Consistency: k̂ k2 !p 0, as n!1, Selection Consistency: P (fj : ̂j 6= 0g = s)! 1, as n!1, 14 where k:k2 denotes the Euclidean norm. Estimation consistency implies that as the sample size increases the MPLE ̂ approaches the true value with proba- bility tending to 1, which is desirable for the parameter estimation. On the other hand, selection consistency means that, with probability tending to 1, ̂ eliminates unimportant features by estimating their coefficients at zero, which is essential for a good feature selection method. An estimator that is consistent in terms of the pa- rameter estimation does not necessarily consistently select the true model, and vice versa. A good estimator is consistent in both modes. However, these two modes of consistency serve different purposes, and they are often discussed separately. Penalized least squares and the LASSO In the literature, the L1 regularization (e.g., LASSO) has received a great deal of attention because of its sparsity and convexity. Several remarkable contributions have been made in the context of the L1-penalized least squares problem, which is formulated as a specific form of (1.6): min n1ky Xk22 + kk1 ; (1.9) where y = (y1; : : : ; yn) is the response vector, X = (x1; : : : ;xn)T is the n p design matrix with column entries corresponding to the p covariates, is the p- dimensional regression coefficient, and k:k1 denotes the L1 norm. The solution to (1.9) leads to the (sparse) LASSO estimate of , say ̂, which can be used for both parameter estimation and variable (i.e., the column entries ofX) selection. Under some regularity conditions on the design matrix, Knight and Fu [2000] established the estimation consistency of ̂, provided ! 0 as n ! 1. More- over, they showed that the limiting distribution of ̂ has positive probability masses at zero for the coefficients of irrelevant covariates, which provides insight- ful justification for using LASSO for variable selection. Zhao and Yu [2006] further characterized the selection consistency of LASSO by studying a stronger but technically more convenient property, sign consistency, which is defined as P (sgn(̂) = sgn( )) with denoting the true value of . In particular, they derived an irrepresentable condition required for the sign 15 (selection) consistency of LASSO. Following their notation, let be split into two parts, = (1; 2), with 2 assumed to be zero. Also, let the design matrix X be written as fX(1);X(2)g accordingly. The irrepresentable condition requires that, for some constant positive vector (not depending on n), jfX(1)TX(1)g1X(1)TX(2)j < 1 where 1 is a q 1 vector with all elements equal to 1 and the inequality holds elementwise. The irrepresentable condition can be interpreted as a correlation con- straint between the irrelevant covariatesX(2) and the relevant covariatesX(1). Un- der the irrepresentable condition and additional requirements, Zhao and Yu [2006] showed that the LASSO is sign-consistent for an appropriate choice of , even when the number of covariates p diverges with the sample size n at rate log p = O(n) for some 0 < < 1. However, the irrepresentable condition is hard to verify in practice and can become restrictive in high-dimensional situations. Addressing a slightly different but closely related problem, Meindhausen and Buhlmann [2006] established the selection consistency of LASSO in the context of Gaus- sian graphical models, under conditions on the design matrix similar to those in Zhao and Yu [2006]. Zou [2006] further provided a necessary condition for the selection consistency of LASSO. In particular, he showed that, under some general assumptions on the design matrix, LASSO is not variable-selection consistent. The selection bias of LASSO is mainly due to the intrinsic difficulty in distinguishing highly correlated covariates. To correct the bias, Zou [2006] proposed an adap- tively weighted L1 penalty, which is defined by pX j=1 !j jj j; where ! = f!1; : : : ; !pg denotes a prespecified weight vector. He further sug- gested ! = 1=ĵj, with ̂ being some root-n consistent estimator of , so that the penalty is decided adaptively by the data: plausible covariates receive a lower penalty. This strategy accelerates the shrinkage of the coefficients of irrelevant co- variates, provided the weight ! is appropriately specified. Under conditions sim- ilar to those used in Knight and Fu [2000], the adaptive LASSO has been shown 16 to have both estimation and selection consistency. However, finding such a root-n consistent estimator of might not be straightforward in high-dimensional situa- tions with p n. Penalized likelihood and the nonconvex penalties Nonconvex regularization methods (e.g., bridge regression and SCAD) have also received considerable research attention. In a seminal paper, Fan and Li [2001] built the theoretical foundation of nonconvex PLMs for feature selection. In the framework of generalized linear models (GLMs; McCullagh and Nelder [1989]), Fan and Li [2001] showed that there exists a local maximizer of (1.6) that converges to the true value of the model coefficients at a rateOp(n 1 2+an)with an = maxf0(jj j) : j 6= 0g. This result implies that if we choose an appropriate (:) such that an = O(n 1 2 ), the corresponding MPLE is root-n consistent for the parameter estimation. In particular, this is the case when the SCAD penalty is used in (1.6) with = o(1). Moreover, with additional requirements on (:) such as concavity in jj, Fan and Li [2001] demonstrated that such a root-n estimation- consistent MPLE is also selection consistent. They referred to this feature of some nonconvex PLMs (e.g., SCAD and MCP) as the oracle property; this means that the MPLE consistently identifies influential variables in a model and estimates their coefficients as efficiently as does the MLE based on the true model. For potential applications with high dimensionality, Fan and Peng [2004] ex- tended the results of Fan and Li [2001] to diverging p cases with p = o(n1=3). Recently, Fan and Lv [2011] illustrated the oracle property of nonconvex PLMs even when log p = O(n) for some 2 (0; 1). In addition, they derived sufficient conditions that guarantee asymptotic equivalence between the global maximizer of the SCAD-penalized likelihood and the oracle estimator. Their results also suggest that L1-based PLMs generally can not achieve selection and root-n estimation con- sistency simultaneously, and thus in general they do not have the oracle property. 1.4.3 Tuning strategies As discussed in Sections 1.4.1 and 1.4.2, the attractive features of PLM depend on an appropriate choice of (:). Given a specific form of the penalty function (:), 17 we must select a proper tuning parameter that controls the amount of regulariza- tion. It is well known that excessive regularization may lead to the elimination of important variables, while insufficient shrinkage retains too many irrelevant vari- ables in the model. Hence, the selection of is critical for achieving the advantages of PLM in practice. To address this issue, Tibshirani [1996] used the m-fold cross-validation (CV; Stone [1974]) to select the for LASSO. Given the model settings and the notation in Section 1.3.1, this procedure works as follows: First, we divide the full dataset d into T separate sets, say fdtg for t = 1; : : : ; T , and then find the MPLE ̂(t) of based on d dt . We choose an optimal by minimizing CV() = TX t=1 l(dt; ̂(t)) = TX t=1 X i2dt log f(di; ̂(t)): The CV-based tuning method provides a that yields a model with the “optimal” prediction accuracy. In the same spirit, Fan and Li [2001] suggested a tuning method based on gen- eralized cross-validation (GCV; Craven and Wahba [1979]), which selects the that minimizes GCV() = l(d; ̂) n(1 (s)=n)2 ; where s denotes the model corresponding to ̂. Compared with the CV method, the GCV method is computationally more convenient, and it has been widely used. However, from a variable-selection point of view, the GCV-based method has been shown to be inconsistent (Wang et al. [2007]); it tends to select many irrel- evant variables. Wang et al. [2007] advocated the use of BIC for tuning a PLM. This approach finds by minimizing BIC() = 2l(d; ̂) + (s) log n: In the context of penalized least squares with finite-parameter settings, Wang et al. [2007] showed that the SCAD estimator with chosen by ̂BIC is consistent in variable selection. This result has been extended to the diverging p cases with 18 p = o(n) in Wang et al. [2009]. In recent work, Chen and Chen [2012] proposed a family of extended Bayesian information criteria (EBIC) under the GLM setup, which is designed for high- dimensional model selection with a sound Bayesian motivation (see Section 2.4.2). EBIC has been found to be an effective tool for choosing the tuning parameter for PLM in situations where p n (Wang [2009], She [2011]). Specifically, the EBIC-based tuning strategy finds the that minimizes EBIC() = 2l(d; ̂)) + (s)(logn+ log p) for some 0 1. Clearly, for = 0 EBIC reduces to the BIC-based tuning strategy. By choosing > 0, EBIC places more penalties on the model complexity by linking the regularization to the number of candidate variables p. This modi- fication is particularly helpful for selecting an appropriate parsimonious model in ultra-high-dimensional applications where p n, such as the genomic example introduced in Section 1.2.1. 1.4.4 Computational strategies In the last decade, substantial progress has been made in solving optimization prob- lems related to PLMs. Various numerical methods have been proposed for finding the maximizer (i.e., the MPLE) from the penalized likelihood (1.6). In princi- ple, when convex penalties (e.g., L1 and L2) are used, the objective function (1.6) is concave and convex optimization algorithms can be conveniently applied. For nonconvex penalties (e.g., SCAD andMCP), however, the task becomes more chal- lenging since (1.6) is no longer concave in general, and the numerical procedure may lead to local maxima. In this subsection, we briefly review several algorithmic contributions for convex and nonconvex PLMs respectively. LARS and CD As a fundamental regularization problem, the L1-penalized least squares (1.9) has been studied extensively. Efficient algorithms have been developed to numerically find the LASSO estimate. In a seminal paper, Tibshirani [1996] treated the nu- merical problem related to LASSO as a linearly constrained least squares problem, 19 which can be solved by standard quadratic programming techniques. Efron et al. [2004] developed a fast and efficient least angle regression (LARS) procedure for the variable selection, which can be modified to provide the entire solution path for LASSO (f̂ : > 0g). Following (1.9), the LARS algorithm works roughly as follows. Starting with all the coefficients 1; : : : ; p equal to 0, LARS finds the covariate xl that is most correlated with the response y. Then, the LARS moves l from 0 toward its marginal least squares estimate of y on xl, until some other covariate xh has the same correlation with the current residual as that of xl. Next, the LARS algorithm moves (l; h) in the direction defined by their joint least squares estimate of the current residual on (xl;xh), until a third covariate has the same correlation with the current residual. The procedure continues in this way until all the covariates have been added to the model. LARS can be modified to provide the exact solution path of LASSO: if any nonzero coefficient reaches zero in a LARS procedure, we drop the corresponding covariate from the model and recompute the current residual. After p steps (p < n), we reach the least squares estimate of based on the full model, which corresponds to the LASSO estimate with = 0. The idea of LARS has been extended to efficiently solve the L1-penalized likelihood problem in the context of GLMs (McCullagh and Nelder [1989]). Because of the computational advantage of the LARS algorithm, LASSO quickly became a popular technique for dimensionality reduction and feature extraction. Recently, the coordinate-wise descent algorithm (CD) has received attention for its ability to solve the numerical problem related to convex PLMs (Fu [1998], Friedman et al. [2007]). The idea of CD is quite simple: it iteratively optimizes the objective function (1.6) one variable at a time. To many researchers’ surprise, this seemingly naive strategy works amazingly well for a wide range of convex regu- larization problems, including LASSO, the nonnegative garotte (Breiman [1995]), and the elastic net. Taking LASSO as an example, Friedman et al. [2007] demon- strated that CD is competitive with the well-known LARS, and thus it has great potential for high-dimensional problems. 20 LQA and LLA In nonconvex PLMs, the irregular shape of the objective function poses a chal- lenge. To address this issue, Fan and Li [2001] suggested locally approximating a nonconvex penalty by a quadratic function (LQA), i.e., (jj) (j0j) + 1 2 p 0 (0) j0j ( 2 20) for close to some initial value 0. With the aid of LQA, Newton-type algorithms can be modified to solve the MPLE from (1.6) with a nonconvex penalty. For example, we can use the MLE as the initial value and obtain the MPLE through an updating procedure by (k+1) = argmax 8<:l() n pX j=1 0 (j(k)j j) 2j(k)j j 2j 9=; : However, the sequence (k) obtained from LQA may not be sparse for any fixed k and hence is not directly suitable for feature selection. Fan and Li [2001] fur- ther suggested setting (k)j = 0 if j(k)j j is sufficiently small, say j(k)j j < "0 for some tolerance level "0, and removing the corresponding covariate from the model. Although it is useful in practice, such an ad hoc step may bring instability to the procedure: once a variable has been deleted, it can never be considered again. Fur- thermore, the choice of "0 has a direct impact on the final selection result. Chossing an appropriate "0 might be problematic in applications. To avoid this drawback, Zou and Li [2008] proposed a one-step algorithm based on local linear approximation (LLA) to the penalty function: (jj) p(j0j) + 1 2 0 (j0j)(jj j0j); which leads to a similar iteration procedure: (k+1) = argmax 8<:l() n pX j=1 0 (j(k)j j)jj j 9=; : 21 Because this form of LLA shares the common traits of the L1 penalty, the final esti- mate has a sparse structure and hence avoids the ad hoc step. More importantly, ef- ficient algorithms developed for convex PLMs (e.g., LARS) can be directly adopted in the updating procedure. Zou and Li [2008] established the convergence of the LLA procedure and further showed that, starting with a root-n consistent estimator, the nonconvex MPLE obtained by LLA has the oracle property after a single iter- ation. However, obtaining such a “good” initial value may be an issue, especially for the large-p-small-n cases frequently encountered in genetic applications. Recently, She [2009] proposed an iterative thresholding-based algorithm, which provides a novel approach for nonconcave PLMs. In this strategy, the multivariate nonconcave objective function is transformed into equivalent univariate functions for which simple optimization can be conveniently handled. For the L1 penalized least squares problem (1.9), the iterative thresholding-based algorithm is identical to the CD procedure. Thus, it shares advantages with the efficient algorithms devel- oped for convex PLMs. We provide a more detailed discussion of the thresholding- based algorithm in Section 4.3. 1.5 Contributions of the dissertation The PLM has received much attention, and it is applicable to a wide range of statistical models in a variety of scientific areas. In this dissertation, we address new research problems in feature selection arising from several applications of the PLM. As mentioned in Section 1.2, in computational genomics, the number of candi- date features can be much larger than the sample size (ultra-high dimensionality). In such applications, the number of features can be so large that even the computa- tionally attractive PLM is not adequate. A super-efficient procedure to quickly screen a large number of candidate features is therefore essential. Fan and Lv [2008] proposed sure independence screening (SIS), which screens features based on their marginal correlations with the response. It is natural to conjecture that accounting for joint effects among the candidate features will be beneficial for the screening. In this spirit, in Chapter 2 we propose a novel screening approach via the sparsity-restricted maximum likelihood estimator (SMLE) and investigate its 22 performance. The SMLE estimates the high-dimensional model coefficients in a designated low-dimensional subspace and screens features by setting their coeffi- cients to zero. The features passed by SMLE are then subject to a more elaborate selection via the PLM. SMLE incorporates joint effects among features by jointly estimating their model coefficients, and thus it has the potential to provide more reliable screening results than SIS provides. We show that SMLE enjoys the sure screening property of Fan and Lv [2008] in the ultra-high-dimensional GLM setup, and we develop an efficient algorithm for its implementation. In addition, we es- tablish estimation and selection consistency for the SMLE-based PLM and propose the use of EBIC for the tuning parameter selection. The effectiveness of the new methods has been demonstrated in simulation studies. Regression models are routinely used to analyze survey data; they identify the influential factors for certain social or behavioral indices in a target popula- tion. As discussed in the LSCDC example of Section 1.2, when data are collected through survey sampling from a finite population without replacement, they have an intrinsic dependence structure and may not represent the target population. To avoid distorted conclusions, survey weights are usually adopted in the estimation of parameters in regression models based on survey data. Incorporating the sur- vey weights may also be beneficial for variable selection. To investigate this, in Chapter 3 we explore the use of pseudo-likelihood to take account of the survey weights and study a penalized pseudo-likelihood method (PPLM) for the variable selection of survey data. In a joint randomization framework, we prove that the PPLM consistently identifies the influential variables through BIC-based tuning. The finite-sample performance of the approach is assessed via analysis and com- puter simulations based on data from the hypertension component of the 2009 Sur- vey on Living with Chronic Diseases in Canada. The results show that, compared with the standard PLM, the PPLM helps to avoid biased selection results due to informative sampling and provides protection against model mis-specification. In market research, finite mixture models are frequently used to depict the het- erogeneity of an overall data structure. Selecting the most suitable number of mix- ture components (the order) is fundamental to these applications. In Chapter 4, we address order selection for finite mixture models, which provides a flexible tool for modeling data from a heterogeneous population. The PLM procedure proposed by 23 Chen and Khalili [2008] is attractive for order selection. The method fits a high- order mixture model to the data via the penalized likelihood, and the nonsmooth penalty helps to merge close components to achieve a lower order. However, this method requires maximizations over nonsmooth and nonconcave objective func- tions, which are computationally challenging. The commonly used LQA approach fails to provide a sparse solution, which might lead to unstable selection results. To tackle this problem, we transform the original multivariate objective func- tions into a sum of univariate functions and design an iterative thresholding-based algorithm to efficiently solve the sparse maximization without ad hoc steps. We further show the ascent property of the proposed algorithm, i.e., each update to the parameter estimates increases the value of the objective function. This desirable property not only helps to design an appropriate stopping rule in practice but also leads to the convergence of the algorithm. Our simulation studies show that the new algorithm reduces the computational time by approximately 40% in compari- son with LQA. To further reduce the computational burden, we propose a revised BIC criterion to select the tuning parameter of Chen and Khalili’s method, where the regular BIC penalty on the model complexity is reduced by half. We do this because an extra component in a mixture model usually leads to a single extra de- gree of freedom in the limiting distribution of the likelihood-based test statistic. We demonstrate the efficiency of the proposed method via numerical studies. Lastly, in Chapter 5, we summarize the main findings of this dissertation and present several directions for future research. All the technical derivations and proofs are provided in Appendices A–C. 24 Chapter 2 The Screening-Based PLM in Ultra-High-Dimensional Feature Spaces 2.1 Introduction High-dimensional datasets with many variables are frequently encountered in mod- ern scientific research (Hastie et al. [2009], Donoho [2000], Fan and Lv [2010]). It is often important to identify the features that influence the response (feature ex- traction). Examples include detecting the biomarkers responsible for rare diseases and finding the stocks that generate profits in investment portfolios. Regression models are frequently used, and the feature extraction is typically performed by a variable selection procedure. Traditional selection approaches, such as best subset selection and stepwise re- gression, can be computationally expensive and instable in the selection process. The PLMs, including LASSO (Tibshirani [1996]), SCAD (Fan and Li [2001]), the elastic net (Zou and Hastie [2005]), and MCP (Zhang [2010]), are now being used as computationally feasible alternatives for variable selection. These approaches are practical and can be selection consistent. However, when the number of fea- tures p is much larger than the sample size n (ultra-high-dimensional cases), as is 25 common in genetic studies, the direct use of PLMs is difficult. For PLMs with non- convex penalties (e.g., MCP and SCAD), finding the global maximizer of the cor- responding penalized likelihood is computationally challenging. For PLMs with convex penalties (e.g., LASSO and the elastic net), selection consistency may not hold in general. More importantly, the choice of an appropriate tuning parameter for the PLM in ultra-high-dimensional applications has not yet been determined. To overcome the difficulties of the large-p-small-n situation, Fan and Lv [2008] proposed reducing the number of candidate features to a manageable level before applying a more elaborate selection method. They suggested SIS, which screens the original p features according to their marginal correlations with the response variable. This approach efficiently removes most of the irrelevant features while retaining important features with a high probability (sure screening). Wang [2009] proposed the use of classical stepwise forward regression (FR) for feature screen- ing and established its consistency in the sense of Fan and Lv [2008]. Feature screening techniques simplify the high-dimensional variable selection to a lower- dimensional problem, where the PLM can be conveniently applied. Motivated by the insights of these methods, in this chapter we propose a novel approach for feature screening via the sparsity restricted maximum likelihood esti- mator (SMLE). SMLE estimates the high-dimensional model coefficients in a des- ignated low-dimensional subspace and naturally screens features by setting their coefficients to zero. Unlike SIS, SMLE accounts for joint effects between features by jointly estimating their model coefficients. Thus, it has the potential to provide more reliable screening results in practice. The SMLE approach belongs to a more general class of sparsity-constrained approximation methods that have been widely adopted in wavelet analysis, sig- nal processing, and compressed censoring. Sparsity-constrained methods are fre- quently used to construct a parsimonious representation (approximation) of high- resolution images/signals for fast transmission and recovery (Donoho [2006], Candès et al. [2006], Blumensath and Davies [2009]). We attempt to investigate the potential of sparsity-constrained methods (i.e., SMLE) for feature screening. In particular, we show that the SMLE enjoys the sure screening property in the sense of Fan and Lv [2008] in the context of high-dimensional GLMs (McCullagh and Nelder [1989]), where the number of covariates p can be considerably larger than the sample size 26 n. An iterative hard thresholding-based algorithm (IHT; Blumensath and Davies [2008]) can be used to compute SMLE. We establish the consistency of SMLE- based PLM and demonstrate the promising performance of the proposed procedure via numerical studies. The rest of this chapter is organized as follows. In Section 2.2, we introduce the ultra-high-dimensional GLM and review the SIS-based screening method. In Section 2.3, we investigate the use of SMLE for feature screening and discuss its asymptotic properties; an IHT algorithm is proposed for the implementation. We discuss the SMLE-based PLM in Section 2.4 and assess the finite-sample perfor- mance of the method in Section 2.5. Finally, Section 2.6 presents several remarks. The proofs of the theorems are presented in Appendix A. 2.2 Sure screening techniques with ultra-high dimensionality 2.2.1 Model settings and notation Suppose the data f(yi;xi); i = 1; : : : ; ng are collected independently from (Y;x), where Y is a response variable and x = (x1; : : : ; xp)T is a p-dimensional covariate vector. We postulate a GLM between Y and x as follows. Conditioning on x, the distribution of Y is assumed to belong to an exponential family taking the form f(y; ) = exp(y b() + c(y)) (2.1) with respect to a -finite measure and known functions b(:) and c(:). is usually called the natural parameter, and it is assumed to take values in a compact space R such that Z f(y; )d = 1; for any 2 : (2.2) Given (2.1), it is well known that E(Y jx) = = b0() and Var(Y jx) = 2 = b00(), where the primes denote derivatives with respect to . The covariate x influences the response Y in the form of a linear combination g() = xT; 27 where = (1; : : : ; p)T are the p-dimensional model coefficients and g(:) is a specified link function. For theoretical purposes, it is often convenient to relate the natural parameter directly to the covariate x, i.e., = u(xT) for u(:) = (g )1. Of special importance is the canonical link g = 1, which leads to a linear expression of , i.e., = xT. Several classical GLMs with the canonical link are as follows: Linear regression: Suppose xi, yi has the distribution N(i; 2). In this case, we have i = i and the natural link g(i) = i, which implies i = xTi . Model (2.1) leads to the classical normal linear regression yi = x T i + i (2.3) where i is the normal random error with mean zero and variance 2. Logistic regression: Suppose xi, yi has the distribution Bernoulli(pi). In this case, we have pi = i = exp(i)=(1 + exp(i)) and the natural link g(i) = logi=(1 i), which implies pi = exp(xTi )=(1 + exp(xTi )). Model (2.1) leads to the logistic regression logitfP (yi = 1jxi)g = log( pi 1 pi ) = x T i ; (2.4) which is commonly used for regression analysis with binary dependent vari- ables. Poisson regression: Suppose xi, yi has the distribution Poisson(i). We have i = exp(i) and the natural link g(i) = log(i), which implies i = exp(xTi ). Model (2.1) leads to the Poisson regression log(i) = x T i ; (2.5) which is often used for the analysis of count data and multidimensional con- tingency tables. With the GLM settings, the effect of each covariate on the response Y is char- acterized through the size of the corresponding regression coefficient. In applica- 28 tions, when the dimension p is high, it is often believed that only a small number of the covariates in x contribute to the variations in Y , which leads to an idealistic assumption that is sparse. With this sparsity, we can identify influential features by finding the covariates associated with nonzero coefficients. Specifically, let be the true sparse coefficients with q nonzero elements, and let s be an arbitrary subset of f1; : : : ; pg defining a submodel with covariates xs = fxj ; j 2 sg and associated coefficients s = fj ; j 2 sg. For convenience, we use k:k0 to denote the number of nonzero components of an arbitrary vector (i.e., the l0-norm) and (s) to indicate the size of model s. In particular, we denote the true model by s = fj : j 6= 0g with (s) = kk0 = q. We must estimate s from f1; : : : ; pg by analyzing the data f(yi;xi); i = 1; : : : ; ng. We are interested in solving this problem in ultra-high-dimensional situations where p n > q. 2.2.2 Sure independence screening As mentioned, the high dimensionality of p poses great challenges in searching for s using most existing selection methods. A natural idea is to reduce the dimen- sionality of the feature space p to a manageable level k (say, k < n) by a fast and reliable method, such that existing selection methods (e.g., PLM) can be applied to the reduced feature space. This suggests a two-step procedure for feature selection in ultra-high-dimensional situations: a crude feature screening followed by a more careful selection. To this end, Fan and Lv [2008] developed the SIS framework by ranking the marginal correlations between the covariates and the response. Specifically, let w = (w1; : : : ; wp) T be a p-dimensional vector with component wj denoting the fitted coefficient obtained from the marginal regressions of the response variable on the jth covariate. For a given screening bound k < n, the SIS screens features by selecting s = f1 j p : the value of jwj j is among the k largest valuesg: Clearly, in this strategy, each feature is used independently as a single covariate to deternine its strength of association with the response. SIS has been widely adopted in genetic studies, where two-sample testing methods are frequently used 29 to detect the genes that differ between case and control groups (Efron [2007], Storey and Tibshirani [2003]). In the context of a linear model with Gaussian covariates and response, Fan and Lv [2008] showed that P (s s) ! 1 as n ! 1 even when p increases exponentially with n. They refer to this property of SIS as the sure screening property, meaning that a screening method has the ability to retain all the important features with a high probability. This result has been further extended by Fan and Song [2009] to the high-dimensional GLM case where both the response and the covariates can be discrete. SIS uses only the marginal information of the covariates, and its performance can be unstable in applications. To address this issue, Fan et al. [2009] proposed an iterative SIS (ISIS) procedure, which works as follows. First, ISIS applies SIS to select k1 features, denoted s1. Then, it applies the PLM (e.g., SCAD) on s1 to select a subset M1 s1. Next, it fits a regression of Y on M1 and obtains the corresponding residuals. These residuals are then treated as the new response variable, and SIS is applied to the remaining candidate features fx1; : : : ; xpg= M1 to select k2 features, say s2. Then, the PLM is applied to M1 [ s2, which gives a new subset M2. The procedure continues until there are k features in the current subset. At each step of ISIS, since the residuals based on Mt are uncorrelated with the variables in Mt, the unimportant variables in fx1; : : : ; xpg= Mt that are highly correlated with the response Y through strong associations with the features in Mt are not likely to be selected at the next step. Also, the important features that are marginally weakly correlated with Y because of the presence of the variables in Mt should have a chance to be selected. ISIS has great potential to improve SIS, but it has a higher computational cost. 30 2.3 Variable screening via the sparse-MLE 2.3.1 The sparsity-restricted maximum likelihood estimator The seminal theory of SIS stimulates us to seek a more effective and efficient method for feature screening in ultra-high-dimensional applications. In particu- lar, we conjecture that incorporating joint information between candidate features will be beneficial. Our investigation starts from the classical likelihood-based in- ference. Specifically, with the canonical link, the log-likelihood function of is given by l() = nX i=1 [yi xTi b(xTi )]: (2.6) Maximizing (2.6) leads to the MLE of . Under some regularity conditions, the MLE is a consistent estimate of and thus provides information useful for de- tecting s. However, in ultra-high-dimensional settings with p n, the classical MLE is not uniquely defined and therefore loses its interpretive ability. In the spirit of regularization, we consider estimating in a subspace of Rp with the number of nonzero entries constrained to be less than a given screening bound k. Under the assumption that is sparse (i.e., q < k), this constrained estimation is expected to retain all the important information carried by while setting most of the zero entries of exactly to zero, and it therefore provides feature screening. Motivated by this argument, we now present a new screening method. We propose carrying out the aforementioned estimation via the sparsity-restrictedMLE of (SMLE), which is defined by ̂[k] = argmax l() subject to kk0 k (2.7) for some specified k smaller than n. Clearly, the SMLE ̂[k] is constrained to be sparse in its composition, and its nonzero entries correspond to a submodel ŝ = f1 j p : the jth entry of ̂[k] is nonzerog that yields the highest possible likelihood score within the restricted model sparsity 31 k. Functionally, the ̂[k] screens irrelevant features by setting their coefficients to zero, while retaining the important features in ŝ for further selection. Compared with (I)SIS, SMLE naturally accounts for the joint effects between candidate fea- tures by jointly estimating their coefficients. Thus, it has the potential to provide more reliable screening results. The idea of SMLE has similarities with the use of l0-regularized techniques in image processing, where sparsity-constrained least-squares methods are fre- quently used to construct parsimonious representations for high-resolution images (Donoho [2006], Blumensath and Davies [2009]). To provide insights into the use of SMLE for feature screening, we first focus on the theoretical aspects and ignore the computational issues at this stage. As argued in Fan and Lv [2008], a good screening approach should have the ability to remove most of the irrelevant variables while retaining all the relevant ones with a high probability. They refer to this as the sure screening property of a variable screening method. Accordingly, we define SMLE to have screening consistency if P (s ŝ)! 1; as n!1: (2.8) To investigate whether SMLE has screening consistency, we introduce the fol- lowing notation. For any model s, let S(s) = @l(s) @s = nX i=1 [yi b0(xTiss)]xis; H(s) = @2l(s) @s@ T s = nX i=1 b00(xTiss)xisx T is be the score function and the Hessian matrix of l(:) corresponding to s. For k such that q < k, we define Sk+ = fs : s s; ksk0 kg and Sk = fs : s 6 s; ksk0 kg for collections of overfitted models and underfitted models. We investigate the 32 asymptotic properties of ̂[k] in the scenario where p, q, k, and vary with the sample size n. Also, we assume the following conditions, some of which are purely technical and serve only to provide a theoretical understanding of the new screen- ing method. We do not intend these assumptions to be the weakest possible. T1 log p = O(nm) for somem > 0. T2 There exist nonnegative constants 1, 2 such that min j2s jj j w1n1 and q < k w2n2 for some w1, w2 > 0. T3 There exist a positive constant c1 and a corresponding 1 > 0 such that for sufficiently large n, min[n 1H(s)] c1 for s 2 fs : jjssjj2 1g and s 2 S2k+ , where k:k2 is the Euclidean norm and min[:] denotes the smallest eigenvalue of a matrix. T4 There exist positive constants c2, c3, c4, such that max 16j6p max 16i6n ( x2ijPn i=1 x 2 ij 2 i ) c2 n1; and c3 n1 nX i=1 x2ij 2 i c4 for any j 2 f1; : : : ; pg. By condition T1, we assume that p diverges with n up to an exponential rate, which implies that the number of covariates can be substantially larger than the sample size. Condition T2 assumes that is sparse and its minimal component does not degenerate too quickly. Condition T3 corresponds to assumptions A4–A5 of Chen and Chen [2012]. It basically requires s to stay some distance from incorrect models as n increases. Condition T4 places restrictions on the observed values of 33 xj . For a wide range of models, condition T4 holds naturally for random designs on x or for fixed designs with an appropriate rescaling operation. We now establish the screening consistency (sure screening property) of the SMLE ̂[k] via the following theorem. Theorem 2.1 Under model (2.1) and conditions T1–T4, if 1 + 2 < 12 , we have P (s ŝ)! 1; as n!1: See Appendix A for the proof. By Theorem 2.1, we show that, with probability tending to one, SMLE retains all important features in s by estimating their model coefficients away from zero. This desirable property of SMLE provides a necessary condition for correctly identifying s in the more elaborate selection based on ŝ, as will be illustrated in Section 2.4. 2.3.2 Implementation As discussed in the previous subsection, SMLE has the potential to address ultra- high-dimensional feature screening, but this needs to be demonstrated in appli- cations. In principle, numerically finding ̂[k] from (2.7) corresponds to a l0- regularized problem. Such problems have been extensively studied in the area of signal processing, and a number of computational strategies have been de- veloped. Examples include the matching pursuit algorithms (Mallat and Zhang [1993]) and the FOCUSS-based methods (Murray and Kreutz-Delgado [2001]). We find that the hard-thresholding-based algorithms developed under linear mod- els (Blumensath and Davies [2009] are suitable for our needs. In this subsection, we design a modified hard-thresholding procedure in the context of GLM for com- puting ̂[k] in applications. Specifically, to tackle the problem in (2.7), we first approximate the ln(:) at a generic point by hn( ;) = ln() + ( )T sn() u 2 k k22; (2.9) where u > 0 is a scale parameter. The first two terms in (2.9) match the Taylor’s expansion of ln( ) at = , and u2k k22 is introduced as a regularization 34 term. Clearly, we have ln() = hn(;), and hn( ;) well approximates ln() for close to . A key property of hn( ;) is that it is additive in the components of , so that the maximization of hn( ;) over can be conveniently carried out. With the aid of (2.9), we then propose the following iterative procedure to solve (2.7) (t+1) = argmax hn( ; (t)) subject to k k0 k: (2.10) For each iteration step, the regularization term in hn(:) prevents the maximizer from being far from the current estimate of . This feature further guarantees the convergence of (2.10) to a local maximum of ln(:) (subject to the sparsity con- straint). Let y = (y1; : : : ; yn)T and X = (xi; : : : ;xn)T . Because of the additivity of hn( ;) in , the optimization in (2.10) takes a unified specific form: min 1 2 u1[u +XTy XT b0(X)] 2 2 subject to k k0 k: (2.11) Obviously, if the sparsity constraint is ignored, the solution to (2.11) is the typical least squares estimate ~ = + u1XT [y b0(X)]; which corresponds to a zero loss in the objective function. Thus, the constrained minimum of (2.11) is achieved by choosing the k largest (in absolute value) com- ponents of ~ . Consequently, the solution to (2.11) is given by ̂ =H(~ ; k) = H(~ 1; j~ j[k]); : : : ; H(~ p; j~ j[k]) T ; where j~ j[k] is the kth largest component in j~ j and H( ; r) = 8<: ; if j j > r0; if j j r is the hard thresholding function. 35 Therefore, the iteration (2.10) can be re-written as (t+1) =H((t) + u1XT [y b0(X(t))]; k): (2.12) It can be seen that (2.12) is a simple thresholding-based iterative procedure which does not involve complicated operations (such as matrix inversion). This advantage makes (2.12) suitable for high-dimensional computing. Unlike the typical thresh- olding method, (2.12) adaptively performs hard thresholding on the current update such that each (t) satisfies the sparsity constraint i.e., k(t)k0 k. Following Blumensath and Davies [2009], we refer to (2.12) as the iterative hard-thresholding (IHT) procedure. We now show that the sequence f(t)g based on IHT has a property analogous to that of typical thresholding-based methods: it stepwise increases the value of l(:). This increment property further ensures the convergence of IHT to a local maximum of l(:) within the feasible region. For convenience of illustration, we introduce the following condition T30 min[n1H(s)] > 0 for any s with (s) k. It can be seen that condition T30 is analogous to T3, which requires the strict con- cavity of ln(s) over models of size no larger than k. This condition is purely technical and might be further weakened. We focus on providing a theoretical un- derstanding of the IHT and leave this issue to future research. We now justify the convergence of IHT by the following theorem. Theorem 2.2 Given the settings and notation introduced earlier, assume that b(:) in (2.1) is twice continuously differentiable. Let f(t)g be the sequence defined by (2.12). Denote by 1 the maximum eigenvalue ofXTX and (t) = max i sup 0<<1 b00(xTi (t+1) + (1 )xTi (t)): If u > 1(t), then l((t+1)) l((t)): Moreover, if condition T30 holds, then f(t)g converges to a local maximum of ln() subject to kk0 k. 36 See Appendix A for the proof. Theorem 2.2 implies that, for an appropriate scale parameter u, the IHT necessarily converges. In our simulations, we choose u adap- tively at each step according to the value of (t) to guarantee the monotonicity of l((t)). Since (t) may lead to a local maximum, multiple initial values are often used in the hope that the global maximum is not missed. For the linear model, when the IHT starts with (0) = 0, its first iteration corresponds to the SIS step based on the marginal information of features. This suggests that zero might be a reasonable initial choice for IHT. In practice, the LASSO estimates of are also convenient for setting (0). The efficiency of IHT has been observed in numerical studies. 2.3.3 Numerical assessment Model Type Setup (p; n) SIS ISIS FR LASSO SMLE 1 (10000, 200) .22 .94 1.00 .98 .99 Linear 2 (5000, 120) .58 .63 .34 .89 .78 3 (1000, 100) .01 .73 .88 .28 .99 1 (1000, 400) .94 1.00 .- - .99 .99 Logistic 2 (1000, 400) .11 .89 .- - .85 .97 3 (1000, 400) .02 .61 .- - .17 .77 1 (1000, 200) .07 .94 .- - .67 .97 Poisson 2 (1000, 200) .01 .84 .- - .39 .94 3 (1000, 200) .00 .54 .- - .01 .93 Table 2.1: Frequencies of covering the true model s based on different screening methods To provide a quick assessment of SMLE-based screening (2.7), we briefly sum- marize some simulation results based on our numerical studies, which will be pre- sented in more detail in Section 2.5. In this subsection, we focus on showing the numerical performance of the screening methods; we give detailed descriptions of the simulation setups in Section 2.5. We conduct simulation studies in three different modeling contexts: linear re- gression, logistic regression, and Poisson regression. In each case, we examine SMLE for three different setups with specific correlation structures among the can- 37 didate features. We evaluate the screening method by measuring the frequency with which the resulting model includes all the features in the true model, i.e., we mea- sure the ability to correctly screen the irrelevant features. For the linear model, we compare SMLE with four other screening methods: SIS, ISIS, FR, and LASSO. We do not include FR for the logistic and Poisson models because of its high com- putational expense. For a fair comparison of the methods, the size of the screened model k is the same for each screening procedure. Table 2.1 summarizes the coverage frequencies of the true model s for dif- ferent screening methods based on 1000 repetitions. The simulation results reveal that SMLE is competitive with other popular screening methods. The advantage of SMLE-based screening is especially clear for the third correlation setup in the logistic and Poisson models, where both SIS and LASSO miss important features. Compared with ISIS, SMLE achieves higher coverage frequencies at a lower com- putational cost. A more detailed discussion of the simulation results will be given in Section 2.5. 2.4 Screening-based PLM selection procedure We have proposed SMLE, which efficiently screens the irrelevant features in an ultra-high-dimensional model while retaining the important features with an over- whelming probability. Given such a feature-screening technique, s can be con- veniently estimated through a two-step procedure: First, we shrink the full model f1, . . . , pg to a refined submodel ŝ of size (ŝ) k < n, and then we apply the PLM to ŝ to identify s. We refer to SMLE followed by the PLM as SMLE-PLM. Because many irrelevant covariates are removed from the original feature space, the search for s in the PLM step is dramatically narrowed. Sure screening makes it feasible to do feature selection for ultra-high-dimensional problems and dramati- cally speeds up the selection process. In this section, we present further discussion of SMLE-PLM. 2.4.1 Consistency of SMLE-PLM As discussed in Section 1.4.2, some PLMs have both estimation and selection con- sistency for a wide range of applications (Fan and Li [2001]). We now explore 38 whether these desirable properties hold for SMLE-PLM. Unlike the direct imple- mentation of PLM, SMLE-PLM is a two-stage procedure. Given the settings and notation in Section 2.2, SMLE-PLM estimates by ̂(ŝ), found by maximizing Q(ŝ) = l(ŝ) n X j2ŝ (jj j); (2.13) where ŝ is the k-dimensional submodel obtained from SMLE and (:) is a speci- fied penalty function. As in typical PLMs, an appropriate choice of (:) leads to a sparse ̂(ŝ), which further identifies important covariates based on ŝ. We study the large-sample properties of ̂(ŝ) under conditions T1–T4 given in Section 2.3.1. For the asymptotic analysis, we associate with n and consider a penalty function sequence (:) that satisfies the following properties: P1 For any > 0, (jj) 0 with (0) = 0. P2 For any > 0, 0 (jj) = @(jj)=@jj exists and is continuous for jj 2 (0;+1). P3 There exist positive constants 3 and w3, such that 0 (jj) w3n3 for jj 0:5w1n1 . Properties P1–P2 specify the shape and smoothness of the penalty function, and are generally required for PLMs (e.g., Fan and Lv [2011]). Property P3 further restricts the sequence of penalties by setting an upper bound on the derivatives. For a specific choice of (:) (e.g., L1 or SCAD), P3 corresponds to assuming an appropriate asymptotic order for the tuning parameter . In general, it is difficult to study the global maximizer of (2.13) analytically without the concavity of (2.13). As is common in the PLM literature, we study the behavior of local maximizers. We first establish estimation consistency for SMLE-PLM via the following theorem: Theorem 2.3 Under conditions T1–T4, if 1+2 < 12 and 3 > 1+ 2 2 , then there exists a local maximizer ̂(ŝ) of (2.13) with (:) satisfying P1–P3, such that k̂(ŝ) k2 = Op(n) 39 for some 2 (1;minf12 2; 3 22 g). See Appendix A for the proof. Theorem 2.3 shows that, for an appropriate choice of penalty (:), SMLE-PLM consistently estimates the model parameters in the ultra-high-dimensional GLM setup. For the situation where the number of influ- ential features q does not diverge with n, we have 2 = 0 in condition T2. In this case, SMLE-PLM achieves root-n consistency if 3 > 0:5 in requirement P3. In particular, when the L1 penalty is used (i.e., (jj) = jj), this result implies that root-n consistency is achieved if we set = o(n 1 2 ). Moreover, we establish the selection consistency of SMLE-PLM with the ad- ditional condition: T5 There exist a positive constant c5 and a corresponding 2 > 0 such that for sufficiently large n, 1 n j@ln(s) @j @ln( s) @j j c5ks sk2 for any j 2 s, s 2 Sk+ and s 2 fs : ks sk2 2g. Condition T5 is purely technical and basically requires Lipschitz continuity of the score function in a neighborhood of . Also, we need an extra requirement on the order of the penalty function sequence as follows: P4 There exist positive constants 4 and w4 such that 0 (jj) w4n4 for jj < 0:5w1n1 . Theorem 2.4 Under conditions T1–T5, if 1 + 2 < 0:5 and 3 2 > 0:5 1:52 > 4, then there exists a local maximizer ̂(ŝ) = (̂1(ŝ); : : : ; ̂k(ŝ)) T of (2.13) with (:) satisfying P1–P4, such that Pf̂j(ŝ) = 0; for j 2 ŝ n sg ! 1: See Appendix A for the proof. Theorem 2.4 implies that, for an appropriate choice of (:), SMLE-PLM consistently selects s even when the number of features diverges exponentially with the sample size. Appropriate choices of (:) include 40 popular nonconvex penalties such as SCAD or MCP as special cases. In particular, for the fixed-q situation, the selection consistency of SMLE-SCAD is implied if = o(1) and n 1 2 !1. The results of theorems 2.3 and 2.4 are not restricted to SMLE-PLM but also apply to a general screening-based PLM procedure with the sure screening property (e.g., SIS or FR). 2.4.2 Tuning with EBIC As for standard PLMs, given the form of the penalty function (:), we must specify the tuning parameter for the implementation of SMLE-PLM. The com- monly used GCV or BIC tuning methods (Section 1.4.3) are designed for the sit- uation where p < n and may not be suitable for situations where p n. To address this issue, we choose the that minimizes the following EBIC criterion (Chen and Chen [2008]), EBIC(s) = 2l(̂s) + (s)(log n+ log p); 0 1; (2.14) where s denotes the model corresponding to ̂(ŝ) (2.13) and ̂s is the MLE based on model s. With = 0, EBIC reduces to the standard BIC tuning strategy (Wang et al. [2007]). The EBIC is designed for high-dimensional model selection with a sound Bayesian motivation. Let the model space be partitioned into subclasses according to the number of covariates that a model contains. Let Sj for 0 j p be the sub- class of models containing j covariates and (Sj) be the size of Sj . Let us assign each model in subclass Sj an equal probability, i.e., P (sjSj) = 1=(Sj) for any s 2 Sj . Then, instead of assigning probabilities P (Sj) proportional to (Sj) as in BIC, EBIC sets P (Sj) to be proportional to (Sj)1 for some between 0 and 1. Thus, in EBIC the prior probability P (s) for s 2 Sj is set to be proportional to (Sj) . Compared with the constant prior used in BIC, this assignment sub- stantially reduces the high prior in models with a large number of covariates, and hence may be suitable for large p. Chen and Chen [2012] have shown that EBIC is selection consistent in the GLM context even when log p = O(nm). Theorem 2.3 shows that there exists a proper sequence of (:) such that 41 P (s = s ) ! 1. The EBIC can help to tune (:) for the implementation of screening-based PLM. In our simulation studies, we set = 0:5 for EBIC as sug- gested by Chen and Chen [2012]. 2.5 Numerical studies We assess the finite sample performance of the SMLE-PLM via simulation studies. In particular, we are interested in knowing how SMLE compares with other popular screening methods. We must consider many factors in order to obtain a relatively complete picture of the new method. Recall that our general goal is to use a GLM to explain the variation in the response variable Y through a number of covariates (features) se- lected from a large number of candidates. The correlation structure between these covariates can have a strong effect on the performance of screening-based meth- ods. Also, their performance may vary depending on the model to which they are applied. In addition, there are implementation issues to address. Last but not least, we need measures to evaluate the performance of the different methods. In the next subsections we discuss the simulation settings. Some of the simulation results have already been revealed and additional ones will be presented. 2.5.1 General settings We examine the methods in three different modeling contexts: linear regression, logistic regression, and Poisson regression, as described in Section 2.2.1. For the linear model, we compare screening-based LASSO and SCAD on five screening methods: SIS, ISIS, FR, LASSO, and SMLE. We do not include FR for the logistic and Poisson models because of its computational cost. Correlation structure For each model we consider three different correlation structures of the features x1; : : : ; xp, so there are a total of nine combinations. The correlation structures are as follows: Setup 1: x1; : : : ; xp are independent and identically distributed N(0; 1) random variables. 42 Setup 2: x1; : : : ; xp are joint Gaussian, marginallyN(0; 1), with cov(xj ; xj1) = 2=3, cov(xj ; xj2) = 1=3 for j 3, and cov(xj ; xh) = 0 if jj hj 3. Setup 3: x1; : : : ; xp are joint Gaussian, marginally N(0; 1), with cov(xj ; xh) = 0:15 for j; h 2 s and cov(xj ; xh) = 0:3 for j or h 2 f1; : : : ; pg n s. Case 1 is the ideal independence structure, which is the most straightforward for variable selection. In Case 2, we consider a moving average type correlation struc- ture, where features are strongly correlated for order distances less than three. This type of correlation is commonly used to model features with a natural order. In Case 3, we have a compound correlation structure such that every irrelevant feature has equal correlation with the relevant features. We therefore expect the variable selection in Case 3 to be more challenging. Implementation issues In our simulation studies, the screening methods are implemented as follows. For ISIS, we use the R function GLMvanISISscad in the SIS package, with a maximum of five ISIS loops. The number of relevant features retained in each loop is decided by SCAD with the AIC-based tuning method; see Fan et al. [2009]. For SMLE, we use the proposed IHT algorithm. In particular, we choose ten LASSO estimates with sparsity varying from n 1 to k as different initial val- ues for IHT. The estimate from IHT that maximizes the likelihood is then treated as the SMLE. For comparison purposes, we also report the performance of LASSO when it is applied directly to the full model. In particular, we apply the R function glmnet to identify a sequence of ordered features with a prespecified size k. Vari- ables in the sequence are considered to be important (after screening), and PLMs are used for the further selection. For each model, we use NONE to represent the model without further variable selection. We use the notation LASSO and SCAD to represent, respectively, the models selected by LASSO and SCAD with the EBIC tuning strategy. To facilitate the computing process, we use the built-in functions in the R software packages glmnet and SIS to compute the estimators of LASSO and SCAD respectively. Theorem 2.1 shows that when the model size k has a certain asymptotic order, SMLE consistently includes all the important features. However, in practice, a spe- 43 cific choice of k must be made in the implementation. Intuitively, a larger choice of k may increase the probability that a screening method includes all the relevant features. However, if the screened model has many irrelevant variables the final se- lection may be more difficult. In our simulations, we set k = [n= logn] for the lin- ear model, k = [n=4 log n] for the logistic regression model, and k = [n=2 log n] for the Poisson regression model. These model-based choices of k were recom- mended by Fan et al. [2009] and worked satisfactorily in our simulation examples. In fact, the performance of screening-based PLMs are quite robust to a wide range of sensible values for k. Since our goal is to compare different screening methods, we treat these model-based k values as benchmarks for the comparison. Assessment of performance We assess the performance of the screening methods based on T = 1000 simula- tion replications. Specifically, let ŝt denote the model selected in the jth replica- tion by a particular method (e.g., SMLE-None). The coverage probability of that method is computed by T1 PT t=1 I(s ŝt), which measures its ability to dis- cover all relevant features. We characterize the model selectivity of each method in terms of the positive selection rate (PSR) and the false discovery rate (FDR), which are defined as follows: PSR = PT t=1 (s \ ŝt) T(s) ; FDR = PT t=1 (ŝt=s ) T(ŝt) : The PSR and FDR depict two different aspects of the selection result: a high PSR indicates that most of the important variables have been identified, while a low FDR indicates that only a few irrelevant variables have been selected. We also report the average number of features included in ŝt, as well as the proportion of times that s is perfectly identified with no extraneous variables and no missed variables (the correct selection rate, CSR). In the following subsections, we further specify the model parameter settings and discuss the simulation results. 44 2.5.2 Linear regression Parameter settings In this example, the data (yi;xi) for i = 1; : : : ; n were generated as independent copies from (Y;x) according to the linear model (2.3) Y = xT + ; where is a normal random error with mean zero and variance 2. The model parameters used in each of the three correlation setups are as follows: Setup 1: (n; p; ) = (200; 10000; 3), and s is randomly chosen from f1; : : : ; pg with ksk0 = 8. For j 2 s, the j values are generated independently from U(4 log n p n+jZj), whereU is a binary random variable with P (U = 1) = 0:6 and P (U = 1) = 0:4 and Z is a standard normal random variable. For j 62 s, the j values are set to zero. Setup 2: (n; p; ) = (120; 5000; 5), and s = f1; 3; 5; 7; 9g. (1; 3; 5; 7; 9) = (5; 3:5; 2:8; 2:5; 2:2) and j = 0 for j 62 s. Setup 3: (n; p; ) = (100; 1000; 1), and s = f1; 2; 3; 4g. j = 2:5 for 1 j 4 and j = 0 for j > 4. The settings in setup 1 are borrowed from example 1 of Wang [2009], where both s and are generated randomly. For the settings in setup 2, we choose s to ensure non-negligible correlations among all the relevant features. For Case 3, the most challenging situation, we fix s to be the set of the first four features and set the coefficients for all the relevant feature to 2.5. The values for all three cases were chosen after pilot studies to give an appropriate signal-to-noise ratio. To evaluate the prediction accuracy of each model, we independently generated testing data (~yi; ~xi) with the same sample size as the training data. For each ŝt and its associated estimate of , say ̂t, we calculated the corresponding relative prediction error by 1 T TX t=1 ( 1 P i(~yi ~xTi ̂s)2P i(~yi ~xTi ̂t)2 ) ; 45 where ̂s is the least squares estimate based on the true model s. For convenience of comparison, we set ̂t to the least squares estimate on ŝt for a screening method without further selection, and we set ̂t to the corresponding shrinkage estimate for a screening method followed by a PLM (i.e., LASSO or SCAD). Results The simulation results are summarized in Tables 2.2–2.4. For setup 1, most screen- ing methods performed very well in terms of including all the relevant features in the model. This is indicated by the high coverage probabilities. However, SIS, which screens features based on marginal correlations, performed poorly because of the high spurious correlations caused by the ultra-high dimensionality. The drawback of FR is observed in setup 2, where the relevant features are correlated with each other. If one of two highly correlated features is included in an FR procedure, the likelihood of including the other is very small because of its weak correlation with the current residual. Consequently, FR and its associated PLMs performed poorly in this situation. Meanwhile, SIS improved because of the aggregated marginal correlations, but this structure does not encourage further improvements via ISIS. The performance of LASSO and SMLE remained satisfac- tory. Lastly, in setup 3, we see that the strong collinearity among the features greatly deteriorates the performance of SIS and LASSO. In terms of the coverage proba- bility, ISIS significantly improved on SIS by almost 72%, while FR improves on LASSO by 88%. In comparison, SMLE did amazingly well in this challenging situation by achieving coverage probabilities as high as 99%. From Tables 2.2 to 2.4, the performance of LASSO and SCAD after the screen- ing step was similar for setups 1 and 2. SCAN had the better performance for setup 3 in terms of lower FDR and higher CSR. Also, the final model fitted by SCAD consistently yielded a lower prediction error than that fitted by LASSO. This might be due to the excessive shrinkage of the LASSO estimator for large coefficients (see Fan and Li [2001]). In particular, when LASSO and SCAD were used after SMLE in setup 3, the difference in the relative prediction error of their selected (fitted) models was as large as 91%. 46 Screening Coverage Ave. Relative Method PLM prop. PSR FDR CSR model size pred. error SIS None .22 .85 .82 .00 38.0 .52 LASSO .21 .84 .11 .15 7.7 .62 SCAD .22 .85 .12 .21 7.8 .36 ISIS None .94 .99 .79 .00 38.0 .46 LASSO .84 .98 .10 .40 8.9 .58 SCAD .92 .99 .07 .51 8.6 .25 FR None 1.00 1.00 .79 .00 38.0 .55 LASSO .99 .99 .07 .59 8.7 .44 SCAD .99 .99 .09 .37 8.8 .10 LASSO None .98 .99 .77 .00 35.6 .38 LASSO .85 .98 .11 .39 9.0 .59 SCAD .96 .99 .08 .46 8.8 .31 SMLE None .99 1.00 .79 .00 38.0 .48 LASSO .91 .99 .09 .47 8.8 .56 SCAD .98 .99 .07 .49 8.6 .13 Table 2.2: Simulation results for linear regression, setup 1 Screening Coverage Ave. Relative Method PLM prop. PSR FDR CSR model size pred. error SIS None .58 .91 .82 .00 25.0 .35 LASSO .32 .83 .09 .23 4.6 .42 SCAD .26 .78 .15 .17 4.7 .26 ISIS None .63 .92 .81 .00 25.0 .42 LASSO .38 .85 .10 .24 4.8 .41 SCAD .25 .78 .21 .13 5.0 .30 FR None .34 .78 .84 .00 25.0 .60 LASSO .29 .76 .23 .17 5.1 .33 SCAD .19 .73 .28 .08 5.2 .25 LASSO None .89 .97 .79 .00 23.4 .35 LASSO .44 .86 .09 .26 4.8 .41 SCAD .31 .80 .17 .19 4.9 .31 SMLE None .78 .95 .81 .00 25.0 .44 LASSO .46 .87 .09 .29 4.8 .39 SCAD .25 .79 .19 .11 5.0 .28 Table 2.3: Simulation results for linear regression, setup 2 47 Screening Coverage Ave. Relative Method PLM prop. PSR FDR CSR model size pred. error SIS None .01 .33 .94 .00 22.0 .89 LASSO .01 .23 .91 .00 9.8 .95 SCAD .01 .24 .89 .00 9.7 .94 ISIS None .73 .85 .84 .00 22.0 .58 LASSO .52 .70 .66 .01 8.9 .87 SCAD .73 .79 .29 .52 5.8 .25 FR None .88 .89 .84 .00 22.0 .59 LASSO .78 .86 .53 .02 7.9 .83 SCAD .88 .89 .19 .57 5.1 .11 LASSO None .28 .64 .87 .00 20.7 .69 LASSO .03 .32 .86 .00 9.8 .95 SCAD .28 .49 .63 .27 8.2 .68 SMLE None .99 1.00 .82 .00 22.0 .52 LASSO .52 .82 .61 .01 8.9 .92 SCAD .99 .99 .07 .71 4.4 .01 Table 2.4: Simulation results for linear regression, setup 3 2.5.3 Logistic regression Parameter settings In our second example, the generic response Y follows a Bernoulli distribution with success probability satisfying log( 1 ) = x T: Thus, data generated as independent pairs from (Y;x) satisfy a logistic regression model. The parameters used for the three correlation setups are as follows: Setup 1: s is randomly chosen from f1; : : : ; pg with ksk0 = 8. For j 2 s, the j values are generated independently from U(4 log n= p n+ jZj=4), where U is a binary random variable with P (U = 1) = 0:5 and P (U = 1) = 0:5 and Z is a standard normal random variable. For j 62 s, the j values are set to zero. Setup 2: s = f1; 3; 5; 7; 9g. (1; 3; 5; 7; 9) = (2;1:8; 1:6;1:4; 1:2) and 48 j = 0 for j 62 s. Setup 3: s = f1; 2; 3; 4g. j = 1:5 for 1 j 4 and j = 0 for j > 4. Because of the lack of information in a binary response, we choose n = 400 and p = 1000 for all three cases. The coefficients are a rescaled version of those in the linear example. The size of the screened model was set to k = [n=4 log n] for all the screening methods. Results The results are shown in Tables 2.2–2.4, where the prediction accuracy of each selected model was evaluated by the proportion of correct predictions based on in- dependent testing data. Again, since the features are independent, all the methods performed well for setup 1. SMLE and its associated PLMs have the best perfor- mance in setups 2 and 3, where the correlation structure is more complex. LASSO continues to suffer from the collinearity between the features, particularly in setup 3. Screening Coverage Ave. Prediction Method PLM prop. PSR FDR CSR model size accuracy SIS None .94 .99 .53 .00 17.0 .84 LASSO .94 .99 .02 .84 8.1 .84 SCAD .94 .99 .02 .84 8.1 .86 ISIS None 1.00 1.00 .53 .00 17.0 .83 LASSO .99 1.00 .03 .81 8.2 .84 SCAD .99 1.00 .05 .63 8.5 .86 LASSO None .99 1.00 .49 .00 15.7 .83 LASSO .99 .99 .02 .84 8.2 .84 SCAD .99 1.00 .04 .68 8.4 .86 SMLE None .99 1.00 .53 .00 17.0 .83 LASSO .99 .99 .02 .82 8.2 .84 SCAD .99 1.00 .04 .67 8.4 .86 Table 2.5: Simulation results for logistic regression, setup 1 49 Screening Coverage Ave. Prediction Method PLM prop. PSR FDR CSR model size accuracy SIS None .11 .74 .78 .00 17.0 .71 LASSO .08 .62 .24 .00 4.2 .71 SCAD .10 .62 .27 .03 4.3 .73 ISIS None .89 .97 .71 .00 17.0 .75 LASSO .76 .93 .22 .21 6.3 .74 SCAD .84 .96 .13 .49 5.6 .80 LASSO None .85 .97 .68 .00 15.2 .76 LASSO .58 .88 .29 .05 6.4 .74 SCAD .80 .95 .19 .19 6.0 .79 SMLE None .97 .99 .71 .00 17.0 .76 LASSO .77 .94 .21 .23 6.2 .74 SCAD .88 .97 .13 .45 5.7 .80 Table 2.6: Simulation results for logistic regression, setup 2 Screening Coverage Ave. Prediction Method PLM prop. PSR FDR CSR model size accuracy SIS None .02 .45 .89 .00 17.0 .79 LASSO .01 .32 .78 .00 6.1 .74 SCAD .01 .34 .78 .00 6.3 .75 ISIS None .61 .82 .81 .00 17.0 .79 LASSO .29 .62 .66 .00 7.4 .77 SCAD .56 .76 .54 .06 7.2 .81 LASSO None .17 .60 .84 .00 15.2 .83 LASSO .03 .36 .77 .00 6.4 .75 SCAD .09 .43 .74 .01 6.7 .76 SMLE None .77 .92 .78 .00 17.0 .80 LASSO .50 .81 .53 .01 7.1 .79 SCAD .76 .91 .39 .13 6.5 .83 Table 2.7: Simulation results for logistic regression, setup 3 50 2.5.4 Poisson regression Parameter settings We next consider the situation where the generic response Y follows a Poisson distribution with mean exp(xT). Thus, data generated as independent pairs from (Y;x) satisfy a Poisson regression model. The parameters used for the three cor- relation setups are as follows: Setup 1: s is randomly chosen from f1; : : : ; pg with ksk0 = 8. For j 2 s, the j values are generated independently from U(log n= p n+ jZj=8), where U is a binary random variable with P (U = 1) = 0:8 and P (U = 1) = 0:2 and Z is a standard normal random variable. For j 62 s, the j values are set to zero. Setup 2: s = f1; 3; 5; 7; 9g. (1; 3; 5; 7; 9) = (2;1:8; 1:6;1:4; 1:2) and j = 0 for j 62 s. Setup 3: s = f1; 2; 3; 4g. j = 0:7 for 1 j 4 and j = 0 for j > 4. Similarly to the logistic example, we set n = 200 and p = 1000 for all three cases. Four screening methods (i.e., SIS, ISIS, LASSO, and SMLE) are compared with k = [n=2 log n]. We used a testing likelihood ratio to measure the goodness of fit for each model. Specifically, for a given testing data set and its associated log-likelihood ~l(:), we computed the testing likelihood ratio as 1 T TX t=1 ( 1 ~l(̂t) ~l(̂s) ) ; where ̂s is the MLE of based on s and ̂t denotes the estimate for a par- ticular method (e.g., SMLE-SCAD) on the tth replication. We adopted the same choices for ̂t as in the linear example and set T = 1000. Results The results are summarized in Tables 2.8–2.10. Most of the patterns are consistent with those of the previous example. The benefits of SMLE are clear in setups 2 51 and 3. For setup 1, the performance of ISIS and SMLE is comparable, but SMLE is preferred because of its lower computational cost. Screening Coverage Ave. Testing Method PLM prop. PSR FDR CSR model size lh-ratio SIS None .07 .75 .68 .00 19.0 .34 LASSO .06 .74 .30 .00 8.8 .46 SCAD .07 .75 .24 .02 8.1 .33 ISIS None .94 .99 .58 .00 19.0 .14 LASSO .86 .97 .20 .13 10.1 .29 SCAD .93 .99 .08 .52 8.7 .04 LASSO None .67 .94 .58 .00 18.1 .15 LASSO .52 .90 .32 .03 10.9 .39 SCAD .66 .93 .15 .29 9.0 .12 SMLE None .97 .99 .58 .00 19.0 .18 LASSO .91 .98 .14 .27 9.3 .27 SCAD .97 .99 .06 .63 8.5 .03 Table 2.8: Simulation results for Poisson regressions, setup 1 Screening Coverage Ave. Testing Method PLM prop. PSR FDR CSR model size lh-ratio SIS None .01 .53 .86 .00 19.0 .46 LASSO .00 .47 .64 .00 7.1 .46 SCAD .01 .46 .65 .00 7.1 .42 ISIS None .84 .95 .75 .00 19.0 .17 LASSO .68 .89 .41 .05 7.9 .33 SCAD .83 .94 .22 .29 6.3 .07 LASSO None .39 .82 .76 .00 17.5 .22 LASSO .10 .66 .57 .00 8.1 .43 SCAD .38 .79 .40 .06 7.0 .18 SMLE None .94 .98 .74 .00 19.0 .21 LASSO .78 .94 .33 .09 7.5 .31 SCAD .93 .98 .15 .37 6.0 .03 Table 2.9: Simulation results for Poisson regressions, setup 2 52 Screening Coverage Ave. Testing Method PLM prop. PSR FDR CSR model size lh-ratio SIS None .00 .19 .96 .00 19.0 .61 LASSO .00 .14 .93 .00 9.0 .70 SCAD .00 .16 .92 .00 8.2 .62 ISIS None .54 .69 .85 .00 19.0 .34 LASSO .43 .59 .69 .01 8.8 .53 SCAD .54 .67 .55 .09 7.4 .26 LASSO None .01 .26 .93 .00 17.2 .58 LASSO .00 .19 .91 .00 9.3 .70 SCAD .00 .24 .87 .00 8.1 .60 SMLE None .93 .98 .79 .00 19.0 .25 LASSO .62 .86 .56 .01 8.4 .51 SCAD .91 .96 .33 .14 6.2 .07 Table 2.10: Simulation results for Poisson regressions, setup 3 53 2.5.5 Real-data example We now apply the SMLE-based method to a genetic application. Singh et al. [2002] measured the expression levels of 12600 genes from prostate specimens of 52 prostate cancer patients and 50 healthy controls. One objective was to build a gene-expression-based classification rule to predict the identity of unknown prostate samples. Such a classification tool is helpful in the early detection of prostate cancer, which provides a better opportunity for curative surgery. The iden- tification of genes that influence the disease outcome also provides a greater under- standing of the genetic aspect of prostate tumors. By performing a permutation-based correlation test, Singh et al. [2002] de- tected 456 potential genes that are differently expressed between tumorous and normal samples. Using the 6033 genes in the complete dataset, Efron [2009] found a further 377 genes using an empirical Bayesian approach, while Chen and Chen [2012] spotted 3 more genes using EBIC-based LASSO. In this example, we reanalyze the dataset by building a logistic regression logitfP (Y = 1jx)g = xT; where Y is the binary status of the prostate cancer (with Y = 1 for a tumorous sample, Y = 0 for a normal sample) and x contains the 12600 gene expression levels. Accordingly, we predict Y = 1 when P (Y = 1jx) is estimated to be over 0.75 and predict Y = 0 otherwise. SMLE-SCAD is used for the analysis because of its superior performance in the simulation studies. We randomly select a set of 10 subjects from each of the tumorous and normal sample groups as the testing set, and treat the remainder as the training set. We set the screening bound k = 20 Screening Ave. Overall Method model size Sensitivity Specificity pred. error SIS 2.2 .71 .96 .17 ISIS 2.3 .68 .96 .18 LASSO 2.7 .71 .94 .17 SMLE 2.7 .77 .94 .14 Table 2.11: Results for prostate data. 54 on the training set and assess the prediction accuracy of the selected model on the testing set. We base the assessment on T = 200 replications and summarize the results in terms of sensitivity, specificity, and the overall prediction error. For com- parison purposes, we also include the results of SIS, ISIS, and LASSO followed by SCAD with k = 20. All the tuning parameters are selected by EBIC as in the simulation examples. From Table 2.11, we see that all four methods performed well, choosing a parsimonious model with relatively high prediction accuracy. SMLE shows its superiority by choosing a model with higher sensitivity, which corresponds to a lower chance that a cancer patient is wrongly diagnosed as healthy. The results of SMLE are consistent with those of Chen and Chen [2012] in terms of the number of selected genes but have a lower prediction error. 2.6 Summary and conclusions In this chapter, we have developed a new approach for the variable selection prob- lem when the number of candidate variables is ultra-high. Our approach follows existing methods by first screening out a large number of seemingly nonsignificant covariates computationally efficiently and theoretically consistently. The screening step is followed by a widely accepted regularization approach to further reduce the number of variables in the model to enhance the interpretability. The second step results in a sequence of candidate models with increasing model complexity as the degree of regularization reduces. Specifically, we proposed SMLE for the screening; it naturally incorporates joint effects between features in the screening process. We showed that SMLE has the sure screening property in the ultra-high-dimensional GLM setup, and we developed an iterative hard-thresholding algorithm for its implementation. Our simulation study indicated that the new procedure is computationally efficient and competitive with other screening procedures. At the same time, the new method was observed to have a higher probability of retaining all the significant covariates in the model settings that we considered. 55 Chapter 3 The Penalized Pseudo-Likelihood in Analysis of Survey Data 3.1 Introduction In many areas of scientific research, one common interest is to identify the influen- tial factors associated with certain behavioral, social, or economic indices within a target population. For example, sociologists and economists would like to identify important factors that affect the unemployment rate in a specific region, and epi- demiologists are interested in finding risk behavior for diseases. In such studies, researchers often start with a survey of the target population (Korn and Graubard [1999], Rahiala and Teräsvirta [1993], Wolfson [2004]). A representative sample is then selected and measurements of the variables of interest for the sampled units are collected. A regression model is routinely employed to summarize the informa- tion contained in the data. It explains variations in the response variable through a simple function of explanatory variables (covariates). When they lack prior knowl- edge, researchers may collect information on many potential explanatory variables. In these applications, it is never straightforward to decide in advance which vari- ables should be included in the perceived regression model. Consequently, the goal of identifying influential factors is often achieved through a variable selection procedure. That is, we assume a response variable together with a large number of covariates, based on which a regression model is to be fitted. In this chapter, we de- 56 velop a variable selection strategy for survey data and investigate its large-sample properties. Unlike the problems presented in Chapter 2, the sample size n is often very large, while the number of candidate covariates p is large but not huge. This makes a screening procedure unnecessary. However, the traditional all-subset selection (Section 1.3) is still computationally infeasible although its principles apply. More- over, when we perform variable selection for survey sampling, many potential complications arise. First, the data collected through survey sampling are usu- ally obtained from a finite population without replacement, and hence they have an intrinsic dependence structure (i.e., non-i.i.d.). Second, in complex survey designs, the inclusion probabilities of sampling units often vary across the target population. Consequently, the correlation between the response and the covariates reflected in the sample can be distorted from that of the population. This is potentially the case when some segments of the population are sampled more intensively than others. Ignoring the survey design in the selection process may result in biased selection results for the target population. In the literature, sampling weights are often considered in estimating finite population parameters such as the population mean or population proportions. The weighted estimates help to avoid biased inference from informative sampling (Pfeffermann [1993], Fuller [2009]). However, the role of sampling weights in variable selection is not completely clear. Although the parameter estimation and variable selection serve different purposes, they often have a coherent linkage in the modeling process. It is natural to conjecture that using sampling weights is beneficial for variable selection. In this spirit, we investigate the use of a pseudo-likelihood to take account of the sampling weights and propose a penalized pseudo-likelihood method (PPLM) for variable selection in the analysis of survey data. A sample-based BIC criterion is further derived to tune the implementation of the proposed method. In a joint randomization framework, we prove that the PPLM consistently identifies the in- fluential variables through the BIC-based tuning. The proposed selection method is assessed through simulation studies and using data from the 2009 Survey on Living with Chronic Diseases in Canada. This chapter is organized as follows. In Section 3.2, we introduce the joint 57 randomization mechanism and the super-population model. In Section 3.3, we propose the PPLM and derive the sample-based BIC as a tuning strategy. In Sec- tion 3.4, we investigate the asymptotic behavior of the proposed method in a joint randomization framework. We use numerical studies in Section 3.5 to assess the performance of our approach and provide a concluding summary in Section 3.6. The proofs of the theorems are given in Appendix B, where a heuristic derivation of the proposed BIC can also be found. 3.2 Joint inference and super-population The random behavior of an inference procedure is mostly inherited from the ran- domness in the data. In the context of surveys, the set of sampled units is random because of the probabilistic sampling design. At the same time, the observed value of each sampled unit may also be regarded as a random outcome from some con- ceptual infinite super-population (Royall [1976]). In a design-based analysis, the finite population is regarded as nonrandom and all measurements of the sampled units are constants. The parameters of interest are finite population quantities such as the population total or the population median. The statistical inference is evaluated based on the randomness from the probability design. Nonparametric approaches are usually used for design-based inference. One may also regard the design-induced randomness as an artifact. The mea- surements of the sampled units are independent realizations of a random variable from a probability model for the postulated super-population. The parameters of interest are related to the assumed model and model-based inferences are evaluated solely based on the randomization introduced by the model. A third approach is called model-design-based inference; it incorporates ran- domization from both the design and the model. In such a joint randomization mechanism, the finite population is regarded as a random sample from a super- population. The survey sample is considered as a second-phase sampling from the super-population. The parameters of interest can be either model or finite- population parameters. In this mechanism, inferences on the finite-population pa- rameters are motivated by the super-population model. Model-design-based in- ference can be more efficient than pure design-based approaches when the finite 58 population is well described by the super-population model. Compared with pure model-based approaches, it protects against model violation and is therefore more robust in general (Binder and Roberts [2003], Kalton [1983]). We study the variable selection problem under the joint randomization mecha- nism. Let D = f1; : : : ; Ng be a finite population consisting of N sampling units. The measurements on the ith unit are denoted (yi; xi), where yi is the response of interest and xi = (xi1; : : : ; xip)T is a p-dimensional explanatory vector (co- variate vector). These are regarded as independent realizations of (Y;X) from a super-population. We postulate a generalized linear model (GLM) on the super- population as follows. Conditioning onX, the distribution of Y belongs to a natural exponential family, the density of which takes the form f(y; ) = c(y) expfy b()g: (3.1) is known as the natural parameter of f(y; ) such that b0() = E[Y jX] and b00() = Var[Y jX] 2, and c(y) is a normalization constant. The influence of the explanatory variable X on Y is expressed through g() = XT for some assumed linkage function g(:), where the vector = f1; : : : ; pgT is the p-dimensional regression coefficient. If g(:) is the canonical link, i.e., g() = , then we have = XT. For simplicity, we focus on the canonical link in this chapter. Based on this model, the effect of the explanatory variable is characterized through the size of the corresponding regression coefficient. In applications, a complex model with many variables often leads to overfitting and a poor interpre- tive value. Hence, it is desirable to fit the data with a parsimonious model in which many regression coefficients are estimated to be zero. Explanatory variables with nonzero coefficients are then considered to be influential on the response. To this end, we assume that is ideally sparse, and address the variable selection problem by identifying a sparse model formed by the covariates with nonzero coefficients. 59 3.3 Pseudo-likelihood-based variable selection 3.3.1 The penalized pseudo-likelihood method With the model settings described in Section 3.2, it is clear that, if the measurement (yi; xi) is observed for every unit inD, the randomness in the data introduced by the probability sampling design is completely gone. In this situation, the selection of the influential variables is based on the entire population and the PLMs developed in non-survey settings (purely model-based) remain valid for the model-design- based inference. As in a typical PLM, the model coefficient vector is estimated by through maximizing the penalized likelihood function QN () = lN ()N pX j=1 (jj j); (3.2) where lN () = PN i=1 log f(yi; x T i ) is the census log-likelihood function and (:) is a penalty function indexed by a tuning parameter . With an appropriate choice of (:) (Section 1.4.1), contains zero estimates for some coefficients and thus removes the corresponding covariates from the model. Note that (3.2) is only conceptual, because observing (yi; xi) for all units in D is usually not feasible in applications. Instead, a representative sample d = fi1; : : : ; ing f1; : : : ; Ng with n units is often drawn from D and the measure- ments are observed based on the sampled units. Due to the intrinsic dependence structure among the sampled units, a full likelihood on d is prohibitive to com- pute in general. Alternatively, for the model-design-based inference, a pseudo-log- likelihood function is frequently used, which takes the form ln() = X i2d wi log f(yi;) (3.3) with wi denoting the standardized survey weight for the ith unit ( P i2dwi = n). Typically, wi is chosen proportional to 1=P (i 2 d) such that n1ln() is design- unbiased to N1lN (). Maximizing ln() over leads to a maximum pseudo- 60 likelihood estimator (MPLE) ̂ for , i.e., ̂ = argmax ln(): (3.4) Under the appropriate sampling designs, ̂ is often n1=2 consistent for under the joint randomization framework. The idea of using pseudo-likelihood for inference on model parameters has been widely adopted in the literature (Binder [1983], Godambe and Thompson [1986], Molina and Skinner [1992]). Accordingly, we aim to develop an analog of PLM (3.2) based on the pseudo- likelihood. In particular, we propose the maximum penalized pseudo-likelihood estimator ̂ that maximizes Qn() = ln() n pX j=1 (jj j): (3.5) Compared with QN (:), the first term in Qn(:) is the survey-weighted pseudo- likelihood, which potentially helps to avoid sampling errors that might lead to bi- ased inferences for the target population. Meanwhile, the maximizer ̂ of Qn() inherits the sparsity property from its census-based version , which qualifies it as a variable selection operator. We refer to the selection based on ̂ as the PPLM and further investigate its asymptotic performance in the next section. 3.3.2 Asymptotic properties of PPLM To provide some theoretical insight into ̂, we now establish its asymptotic con- sistency under the joint randomization framework described in Section 3.2. Sup- pose there is a sequence of finite populations, say Dr with r ! 1. Each Dr is an i.i.d. sample of size Nr from a super-population modeled by (3.1) with random variable (Y , X = fX1; : : : ; Xpg). Within each Dr, a sample dr of size nr is drawn according to some sampling scheme. We assume that both Nr and nr increase to infinity as r ! 1, with the sampling fraction nr=Nr bounded by some con- stant C < 1. For simplicity of notation, we will drop the index r in the following discussion. Without loss of generality, we assume that the first q coefficients are nonzero 61 and denote the true value of by = f1;2g with 2 = 0. Also, we use s to denote the true model f1; : : : ; qg to be identified. We establish the consistency of ̂ under the regularity conditions specified as follows, where the first two are on the super-population and the third is on the sampling plan: C1 There exists 1 > 0 such that max 1jp E[jb00(X)X2j j1+] <1; for 2 f : jj jj 1g and for some > 0. C2 Let I() = E @2 log f(y;X) @@T : We assume that I() is continuous at and min[I( )] M1 for some constant M1 > 0, where min[A] denotes the smallest eigenvalue of matrix A. C3 The sampling scheme satisfies k 1 n X i2d wizi 1 N NX i=1 zik = Op(n 12 ) for the sequence fzig such thatN1 PN i=1 jzij2+ = O(1)with some > 0. Condition 2 requires that I() is continuous at , so that when is close enough to the minimal eigenvalue of I() is bounded away from zero. Condition 3 is quoted from Theorem 1 in Carrillo et al. [2010]; it requires that the weighted sample mean ẐHT = n1 P i2dwizi is root-n consistent to the population mean Z = 1N PN i=1 zi. This condition is implied if asymptotic normality holds for ẐHT , i.e., p nẐHT !d N( Z; 2), which has been widely established in the literature. For example, with moment condition N1 PN i=1 jzij2+ = O(1), Hájek [1960] 62 showed the asymptotic normality of ẐHT for simple random sampling without re- placement if n ! 1 and n=N ! 0. Bickel and Freedman [1984] established the asymptotic normality of ẐHT under a stratified sampling design, where samples are collected separately within strata that are prespecified in a finite population. Ohlsson [1989] further studied the behavior of ẐHT under a two-stage sampling framework, where primary sampling units (PSUs) are first selected from the finite population and secondary sampling units (SSUs) are collected based on the se- lected PSUs. Ohlsson (1989) showed that if asymptotic normality holds for ẐHT based on single-stage sampling on the PSUs, the asymptotic normality of ẐHT still holds under a corresponding two-stage sampling. For asymptotic studies of ẐHT under other popular sampling designs, we refer to Hájek [1964], Vı́s̆ek [1979], and Chen and Rao [2007]. For the asymptotic analysis, we associate with n and denote the correspond- ing sequence by n. We require the penalty function and n to have the following properties: D1 For any > 0, (jj) 0 for 6= 0 and (0) = 0. D2 Let 0(jj) = @(jj)=@jj. There exists a constant 2 such that 0(jj) 0 for jj 2 (0; 2) and all > 0. Also, 0(jj) is continuous at j for any j 2 f1; : : : ; qg. D3 Let ' = maxf0(j0j j) for 1 j qg. For any M2 > 0, there exists 3 such that 0n(jj) M2(n1=2 + 'n) for jj 2 (0; 3). With k:k denoting the Euclidean norm, we establish the consistency of ̂n via the following theorem. Theorem 3.1 Under conditions C1–C3, if 'n ! 0 as n!1, then there exists a local maximizer ̂n = (̂1n ; ̂2n) of the penalized pseudo-likelihood function (3.5) with n(jj) satisfying D1–D3 such that k̂n k = Op(n 1 2 + 'n) and Pf̂2n = 0g ! 1: 63 See Appendix B for the proof. As shown in Theorem 3.1, with an appropriate choice of (:), the maximum penalized pseudo-likelihood estimator is consistent in both parameter estimation and variable selection under the joint randomization framework. In particular, when (:) is chosen as the L penalty, i.e., n(jj) = njj with 2 (0; 1), consistency holds if n ! 0; when (:) is chosen as the SCAD penalty, consistency holds if n ! 0 and p nn ! 1. In addition, for a special class of penalty functions, we have the following corollary. Corollary 3.1 Suppose that, for any 6= 0, there existsM > 0 such that 0n(jj) = 0 when n > M . Then, under the conditions of Theorem 3.1, the maximizer ̂n = (̂1n ; ̂2n) satisfies P (̂1 = ̂1)! 1 and P (̂2n = 0)! 1 with ̂1 denoting the maximizer of ln() based on the true model s . See Appendix B for the proof. Corollary 3.1 implies that the PPLM is able to consistently identify the influential variables and estimate their coefficients as effi- ciently as the MPLE (3.4) based on the true model. This result echoes the notion of the “oracle property” in Fan and Li [2001], which is desirable for the PLM in non-survey situations (Section 1.4.2). 3.3.3 Tuning strategy via sample-based BIC As in standard PLMs, by varying the level of penalty (:) in (3.5), the PPLM suggests a series of models with differing sparsity. In applications, one needs to make a choice among these sparse models. Given the specific form of (:), this issue boils down to choosing an appropriate tuning parameter . In the non-survey context, various criteria have been proposed for tuning a PLM (Section 1.4.3). In particular, BIC (Schwarz [1978]) has been shown to be effective (Wang et al. [2007], Zhang et al. [2010]). In the same spirit, we derive a sample-based BIC for PPLM in analysis of survey data. Following the super-population formulation described in Section 3.2, we treat the sparse models (3.1) suggested by the PPLM as candidates for further selection. Specifically, for a specified form of penalty (:), let be the range of under 64 consideration. We denote by s a candidate model corresponding to ̂ for some 2 . Let s be the (s)-dimensional coefficient of model s and let s be the prior density of s . Then a pseudo-marginal density function of the data is given by Pn(yjs) = Z Ln(y;s)s(s)ds : Consequently, we may regard the following expression as the pseudo-posterior probability of the model s: Pn(sjy) = Pn(yjs)P (s)P s2S P (s)Pn(yjs) ; (3.6) where S = fs : 2 g is the collection of candidate models. In the spirit of Bayesian analysis, the model with the highest posterior Pn(sjy) is considered to be the one that receives the most support from the data. Since P s2S P (s)Pn(yjs) does not depend on any specific model, the highest Pn(sjy) is achieved by the model that maximizes the corresponding Pn(yjs)P (s). When the uniform prior P (s) = over S is used and under some regularity conditions, we obtain a Laplace approximation (Tierney et al. [1989], Kass and Raftery [1995]): 2 logfPn(yjs)g = 2ln(̂s) + (s) log n+Op(1) with ̂s denoting the MPLE (3.4) based on s. Hence, we choose such that the corresponding model s minimizes BICn(s) = 2ln(̂s) + (s) log n: (3.7) To provide further theoretical justification for the proposed BIC tuning strategy, we let s be an arbitrary model and define two sets of candidate models as follows: Overfitted models: S+ = fs : s s; s 6= sg; Underfitted models: S = fs : s 6 sg. Then, can be partitioned accordingly into + = f : s 2 S+g; = f : s 2 Sg; = f : s = sg: (3.8) 65 In Theorem 3.1, we have shown that P ( 6= ?) ! 1. Therefore, selection consistency of PPLM with a tuning parameter selected by BIC (3.7) is achieved if BIC is able to identify s from any model s with 2 + [ . We use the following theorem to establish this consistency result. Theorem 3.2 Under conditions C1–C3, Pf min 2 +[ BICn(s) BICn(s) g ! 0: See Appendix B for the proof. 3.4 Numerical studies To evaluate the finite sample performance of PPLM, extensive numerical studies have been conducted using data from the Survey on Living with Chronic Diseases in Canada (SLCDC; Canada [2009]). In particular, we compare the proposed se- lection method with standard non-survey PLMs for a couple of sampling plans. The benefits of using survey weights are further examined in the situation where the presumed model is misspecified from the model that generates the data. We also report the analysis of the original SLCDC 2009 data as an example of using PPLM in real applications. 3.4.1 SLCDC data SLCDC is a cross-sectional study sponsored by the Public Health Agency of Canada that collects information related to the experiences of Canadians with chronic health conditions. One of the main objectives of SLCDC is to identify health behavior that influences disease outcomes, so that the government can better plan and provide health services for people with chronic diseases. SLCDC takes place every two years, with two chronic diseases covered in each survey cycle. The 2009 survey focused on arthritis and hypertension. We restrict our attention to hypertension. The target population for the hypertension survey is Canadians aged twenty years or older from the ten provinces who have been diag- nosed with hypertension and who live in private dwellings. To facilitate the survey process, the sampling units of SLCDC 2009 are people with hypertension who 66 completed the 2008 Canadian community health survey (CCHS). For the purpose of SLCDC, the population is first stratified according to the CCHS respondents based on sex and four age groups: 20–44, 45–64, 65–75, and 75+. Therefore, the finite population formed by the CCHS respondents was divided into 8 categories, age (4 levels) by sex (2 levels). A stratified sampling plan is used for SLCDC with proportional sample size allocation. An overall sample of 9005 was selected from the 17437 CCHS respondents, and 6142 respondents completed the SLCDC survey. We identified 40 variables relevant to hypertension based on the original SLCDC data, of which 7 variables have complete information on all 6142 respondents. The remaining 33 variables have missing values due to non-responses in the original questionnaire (see Table 3.1 for a list of the variables and the corresponding non- response rates). There was no obvious systematic reason for the non-response. The variable with the most severe missingness is INCDRPR (household income) with a 9.6% non-response rate; the amount of missing data is relatively minor for the remaining variables. To facilitate the analysis, we used simple imputation methods for the missing data as follows. For a categorical variable, we imputed the non- response value by a random value from the response set. For a continuous vari- able, we imputed the non-response value by the mean value of the responses. Two exceptions to the above imputation are the variables BMHX 02 and CNHX 05. The former acts as the response variable of the regression model in the later data analysis, while the latter has natural restrictions on its range. We removed the 274 observations with missing values for these two variables, leading to basic working data with 5868 observations. The imputation/removal procedure does not have any effect on the evaluation of the BIC procedure based on the simulated population. It could bias the analysis of the real data. However, given the low rate of missingness and the plausibility of missing-at-random in this specific case, the conclusion is unlikely to be severely affected. Since the SLCDC is a follow-up to the CCHS, the sampling weights for SLCDC were initially obtained from the weights of the CCHS data. The weights were then adjusted to ensure that the SLCDC respondents represent the target population. Consequently, the adjusted weights show considerable variation between sampled units. The standardized values of the adjusted weights vary between 0.01 and 33.62 67 with an inter-quartile range of 0.76. Table 3.1: Variables for analysis of SLCDC data with non-response adjustments: A: allocate to other categories; D: delete from the data; M: impute by mean values; NA: no adjustment applied. Variable Description Levels Missing Adjust 1 BMHX 02 Blood pressure control status 2 1.6% D 2 GEO QB Provinces grouped by region - QC 2 - - NA 3 GEO ON Provinces grouped by region - ON 2 - - NA 4 GEO BC Provinces grouped by region - BC 2 - - NA 5 GEO PR Provinces grouped by region - PR 2 - - NA 6 DHHX AGE Age Cont. - - NA 7 DHHX SEX Sex 2 - - NA 8 GENXDHMH Perceived mental health 2 0.2% A 9 CNHX 05 High blood pressure - age when diagnosed Cont. 2.7% D 10 MEHX 02 No. of medications taken Cont. 0.3% M 11 MEHX 03 No. of times per day medications taken Cont. 0.1% M 12 MEHXGMED No. of medications for high blood pressure Cont. 2.0% M 13 MEHX 06 No. of times per day bp medication taken Cont. 1.0% M 14 MEHXDMCO Medication compliance - overall 2 0.2% A 15 HUHXDHP Consulted family doctor about hbp 2 0.1% A 16 SMHX 11A Smoked at any time since being diagnosed 2 0.1% A 17 SMHX 13A Drank alcohol since being diagnosed 2 0.2% A 18 SMHXDSLT Daily salt intake 2 0.2% A 19 SMHXDFDC Dietary foods 2 0.1% A 20 SMHXDPAC Exercise/physical activity 2 0.1% A 21 SMHXDBW Body weight control 2 0.2% A 22 MOHXDBPM Self-monitoring of blood pressure 2 0.3% A 23 MOHX 02 Correct use of bp measurement device 2 0.5% A 24 INHX 01A Info from family doctor 2 2.4% A 25 INHX 01F Info from family member/friend 2 2.4% A 26 INHX 02A Info from book, pamphlet, brochure 2 1.5% A 27 INHX 02C Info from package insert with medication 2 1.5% A 28 INHX 02G Info from media 2 1.5% A 29 INHX 02H Info from internet 2 1.5% A 30 INHX 04 Info received - emotional impact of hbp 2 0.8% A 31 INHX 06 Info received - correct use of medication 2 0.6% A 32 INHX 07 Info received - additional information 2 0.9% A 33 CPGFGAM Gambling activity 2 0.5% A 34 DHHDECF Household type 2 0.2% A 35 EDUDH04 Highest level of education in household 2 3.4% A 36 FVCGTOT Daily consumption - fruits and vegetables 2 5.2% A 37 GEODUR2 Urban and rural areas 2 - - NA 38 HWTDBMI Body mass index (BMI) self-report Cont. 2.1% M 39 INCDRPR Household income - provincial level 10 9.6% A 40 SACDTOT Total number hours - sedentary activities Cont. 1.5% M 68 3.4.2 Simulation settings We first design simulation studies based on the SLCDC data. Specifically, we treat the 40 identified variables as candidate covariates for some response variable Y , and index them as X1 to X40 for simplicity. We consider both continuous and binary responses in our simulations. For the continuous cases, we generate the values of Y according to Model 1: Y = 0:7X6 + 0:7X10 + 0:6X18 0:6X22 + , Model 2: Y = 0:7X6+0:6X10+0:6X18 0:5X22+0:3X30 0:3X34+ , with N(0; 1). For the binary cases where Y 2 f0; 1g, we generate the values of Y according to the logistic models Model 3: logit(PrfY = 1j Xg) = 0:7X7 0:6X8 + 0:5X26, Model 4: logit(PrfY = 1j Xg) = 0:8X7 0:7X8 + 0:6X26 0:5X28 + 0:4X36. The specified models include one of the strata identifiers in SLCDC, i.e.,X6 (Age) orX7 (Sex). Compared with models 1 & 3, models 2 & 4 have two more influential covariates with small coefficients, so it is more difficult to correctly identify them. The finite population used in the simulation was created as follows. The ba- sic working data of 5868 respondents was duplicated 10 times proportional to the rounded integer values of the SLCDC weights, resulting in a pseudo-finite popula- tion of size 55950 with complete information on X1; : : : ; X40. The values of the response Y were then generated based on models 1–4 respectively. We consider the variable selection problem to be the identification of the postulated model that generates the values of Y . We investigate the performance of the proposed procedure under two stratified sampling plans. Specifically, we create four strata based on variables X6 (age, 55-/55+) and X7 (sex, Male/Female), which leads to the group (Female, 55-) of size 7120, the group (Female, 55+) of size 19199, the group (Male, 55-) of size 6187, and the group (Male, 55+) of size 23458. In the first plan, a simple ran- dom sampling without replacement (SRSWR) with equally allocated sample sizes is drawn from each stratum. The inference is made based on the four SRSWRs 69 pooled together. In the second plan, we further construct three subgroups within each stratum based on the sum of two binary covariates of the postulated mod- els. The subgroups are based on X18 +X22 for the data from models 1–2 and on X8 + X26 for the data from models 3–4. We then make inference based on the SRSWRs drawn from each subgroup of the four strata. The overall sample size is equally allocated at the stratum level with a 2:1:2 proportion for the three sub- groups within the same stratum. A simple Monte Carlo computation reveals that the sample correlation between X18 and X22 (for the data from models 1–2) can be as high as 0.5, whereas their population-based correlation is around 0.02. A similar phenomenon is observed between X8 and X26 (for the data from models 3–4). We therefore expect variable selection under the second sampling plan to be more challenging because of this systematic inflation. In the simulations, we set the overall sample size n = 500 for models 1–2 and n = 1500 for models 3–4. The PPLM was then carried out on probability samples obtained from the finite population. In particular, we chose the SCAD penalty for the penalized pseudo-likelihood function (3.5), as advocated by Fan and Li [2001]. The cor- responding maximization of (3.5) was solved using the thresholding-based iter- ative algorithm (She [2011]) and the tuning parameter was determined by the sample-based BIC (3.7). For comparison purposes, AIC (Akaike [1973]) and GCV (Craven and Wahba [1979]) were used as alternatives to the BIC tuning strategy. Based on the discussion in Section 3.3.3, we define the sample-based AIC and GCV as AICn(s) = 2ln(̂s) + 2(s); GCVn(s) = 1 n ln(̂s) (1 (s)=n)2 ; where was selected similarly by minimizing the corresponding scores. Moreover, for each setup, we repeated the selection procedure with all survey weights ignored (set to unity). The unweighted selection results correspond to pure model-based inferences as discussed in Section 3.2. In particular, the PPLM reduces to the standard PLM (3.2) used for non-survey situations. 70 3.4.3 Simulation results In Tables 3.2–3.3, we summarize the simulation results based on 1000 repetitions in terms of the positive selection rate (PSR), false discovery rate (FDR), correct selection rate (CSR), and averaged model size (AMS). Specifically, let s0 be the true model that generates the finite population and s0j be the selected model based on the jth sample, j = 1; : : : ; 1000. The PSR, FDR, CSR, and AMS are estimated as PSR = P1000 j=1 (s \ s0j) 1000(s) ; FDR = P1000 j=1 (s 0 j=s ) 1000(s0j) ; CSR = P1000 j=1 I(s 0 j = s ) 1000 ; AMS = P1000 j=1 (s 0 j) 1000 ; where (s) denotes the size of model s and I(:) is the indicator function. In ad- dition, we assess the predictive accuracy of the selected model as follows. For each setup, a test sample of size 200 is generated by SRSWR from the same finite population as that for the training sample. For models 1–2, we use the averaged residual sum of squares (RSS) on the test data as a measurement of the predictive ability of the selected model. For models 3–4, we compute both positive and neg- ative prediction rates. To be specific, let be a specified benchmark and ̂i be the estimated success probability of the ith test sample, i = 1; : : : ; 200. We then predict the ith response yi by ŷi = 1 if ̂i > and ŷi = 0 otherwise. The correct prediction rates are estimated by PPR = P i2fi:yi=1g I(ŷi = 1)P200 i=1 I(yi = 1) ; NPR = P i2fi:yi=0g I(ŷi = 0)P200 i=1 I(yi = 0) : The final PPR and NPR are averaged based on 1000 replications. Note that PPR and NPR are similar to sensitivity and specificity in the clinical studies, which in- dicate the ability of a 0-1 prediction approach in terms of correct positive and neg- ative predictions. In general, a larger leads to high NPR but low PPR. The value of should be cautiously specified in applications. In our simulation studies, we set = 0:5 for simplicity. The results are encouraging for the PPLM and the sample-based BIC tuning 71 Table 3.2: Selection results for the first sampling plan: Prediction assess- ments for models 1–2 are based on the testing RSS, while for models 3–4 they are based on (PPR, NPR) with a benchmark 0.5. Weights Criterion PSR FDR CSR AMS Prediction Model 1 Ignored GCV .96 .19 .28 4.9 1.04 AIC .99 .48 .05 8.7 1.08 BIC .96 .19 .28 4.9 1.04 Included GCV .95 .24 .19 5.2 1.05 AIC .99 .61 .01 11.4 1.11 BIC .95 .24 .20 5.3 1.05 Model 2 Ignored GCV .72 .19 .02 5.5 1.07 AIC .89 .44 .01 10.3 1.09 BIC .73 .19 .03 5.6 1.07 Included GCV .74 .24 .02 6.1 1.08 AIC .89 .54 .01 12.5 1.12 BIC .74 .24 .03 6.1 1.08 Model 3 Ignored GCV .99 .59 .00 7.8 (.71, .45) AIC .99 .62 .00 8.4 (.69, .49) BIC .96 .43 .00 5.1 (.72, .44) Included GCV .99 .67 .00 9.9 (.71, .47) AIC .99 .70 .00 10.7 (.68, .48) BIC .94 .45 .00 5.3 (.71, .45) Model 4 Ignored GCV .97 .44 .01 9.4 (.66, .55) AIC .98 .47 .01 9.8 (.65, .56) BIC .87 .26 .07 6.0 (.69, .53) Included GCV .98 .54 .01 11.4 (.66, .54) AIC .98 .56 .00 11.9 (.66, .55) BIC .86 .30 .05 6.2 (.68, .53) method. From Tables 3.2–3.3, we observe that the models selected by AIC have both high PSR and FDR, which indicates an excessive inclusion of the irrelevant variables. In comparison, BIC significantly reduces the FDR of the selected models with a slight sacrifice in the PSR, and selects models with more accurate sizes. Al- though GCV behaves similarly to BIC in linear model settings, it is similar to AIC 72 Table 3.3: Selection results for the second sampling plan: Prediction assessments for models 1–2 are based on the testing RSS, while for models 3–4 they are based on (PPR, NPR) with a benchmark 0.5. Weights Criterion PSR FDR CSR AMS Prediction Model 1 Ignored GCV .83 .23 .17 4.6 1.09 AIC .97 .49 .04 8.6 1.10 BIC .83 .23 .17 4.6 1.09 Included GCV .95 .31 .13 5.9 1.07 AIC .99 .65 .00 12.5 1.12 BIC .95 .30 .14 5.9 1.07 Model 2 Ignored GCV .62 .22 .02 5.0 1.13 AIC .88 .45 .01 10.3 1.14 BIC .62 .22 .02 5.1 1.12 Included GCV .72 .28 .01 6.5 1.10 AIC .89 .59 .00 13.7 1.12 BIC .72 .27 .01 6.5 1.10 Model 3 Ignored GCV .87 .62 .00 7.3 (.66, .44) AIC .88 .63 .00 7.6 (.65, .45) BIC .65 .62 .00 4.5 (.68, .42) Included GCV .97 .74 .00 11.9 (.70, .46) AIC .97 .75 .00 12.4 (.68, .46) BIC .89 .50 .00 5.6 (.70, .44) Model 4 Ignored GCV .94 .48 .00 9.5 (.62, .51) AIC .95 .50 .00 10.0 (.62, .52) BIC .72 .41 .00 6.1 (.64, .49) Included GCV .93 .61 .00 12.5 (.64, .53) AIC .94 .62 .00 12.9 (.64, .53) BIC .82 .34 .01 6.4 (.67, .54) for the logistic models where less information is provided by the binary responses. In the first sampling plan, the inclusion probabilities are related to Y through a single covariate in the model (i.e., X6 or X7). The sample correlation structure between the response and covariates is largely maintained from the finite popu- lation. Consequently, no substantial difference is observed between the weighted 73 Table 3.4: Selection frequency of influential variables in model mis-specified case Weights Criterion X18 X20 X38 AMS Testing RSS n=500 Ignored GCV .78 .95 .56 5.9 1.93 AIC .95 .99 .73 12.5 1.95 BIC .83 .97 .60 6.6 1.93 Included GCV .73 .92 .84 6.3 1.77 AIC .91 .99 .85 12.5 1.79 BIC .78 .94 .83 6.9 1.77 n=1000 Ignored GCV .96 1.00 .79 7.6 1.87 AIC .99 1.00 .87 13.1 1.88 BIC .97 1.00 .80 7.9 1.87 Included GCV .93 1.00 .94 7.6 1.71 AIC .98 1.00 .96 13.0 1.72 BIC .94 1.00 .94 7.7 1.71 and unweighted selection procedures from Table 3.2. The benefits of using sampling weights in variable selection are tentatively revealed by the second sampling plan, where the sample correlation structure is systemically distorted. Clearly, the spurious correlation between covariates in the sampled units deteriorates the efficiency of selection methods. This is reflected by the reduced PSRs and the inflated FDRs for the unweighted procedures. Incorpo- rating sampling weights in the selection process helps to correct the biased result. In particular, noticeable improvements have been observed for the PPLM. In the most impressive case (i.e., model 3 of Table 3.3), the PPLM with BIC substan- tially improves the standard PLM procedure by increasing the PSR from .65 to .89, while reducing the corresponding FDR from .62 to .50. Our observation echoes the rationale for weighting to remove bias due to informative sampling Fuller [2009]. Sampling weights also provide protection against model mis-specification (Pfeffermann and Holmes [1985], Kott [1991]): inferences based on weighted estimates may remain valid for the surveyed population, even when the model fails. To gain further insight into weighting in variable selection, we compare the PPLM with the standard PLM in 74 a simulation where the presumed model is misspecified from the model that gen- erates the data. In this situation, a postulated “true” model does not exist, and the goal of variable selection is to find an optimal model that well describes the finite population. We still make use of the stratified pseudo-finite population in Section 3.4.2 but generate the response variable Y according to the strata. Specifically, the values of Y for units in strata (Male, 55+) and (Female, 55+) were generated by Y = 0:6X6 + 0:4X18 + 0:4X20 + 0:6X38 + ; while the values Y for units in the strata (Male, 55-) and (Female, 55-) were gen- erated by Y = 0:6X6 + 0:4X18 + 0:4X20 + with N(0; 1) denoting a random error. In other words, we assume that variable X38 is influential for people aged 55 and older but not for those younger than 55. In addition, we further violate the presumed model 3.1 by excluding X6 from the set of candidate covariates, which mimics the situation where one important design feature is omitted in the modeling. A stratified SRSWR of size 500 or 1000 is drawn using the first sampling plan in Section 5.2. The weighted and unweighted procedures are then tested for the variable selection based on the sampled units. We summarize the simulation results in Table 3.4 by estimating the selection rates of X18, X20, and X38 based on 1000 replications. Similarly to the previous simulations, the averaged model size (AMS) and the testing RSS of the selected models (i.e., the averaged RSS based on testing data of size 200) are also included in the summary. From Table 3.4, we see that when the model assumption is vi- olated, the PPLM still achieves relatively high prediction accuracy by suggesting relevant variables with high probability. In contrast, ignoring the survey weights leads to a relative loss of nearly 9% on the testing RSS because of the exclusion of X38. Apparently, increasing the sample size helps to improve the goodness of fit for the misspecified models, but the cost is the inclusion of more variables. 75 3.4.4 Analysis of SLCDC data To illustrate the application of PPLM, we use it to identify health behaviors that affect the control of blood pressure using SLCDC 2009. The response variable is BMHX 02 from the working data obtained from SLCDC, which has two levels indicating whether or not the blood pressure of the respondent is under control, based on the latest measurement by a health professional. We treat the remaining 39 variables in the working data as candidate covariates, and our goal is to identify the influential covariates that are associated with blood-pressure control. We build a logistic regression of BMHX 02 on the candidate covariates and use PPLM with the SCAD penalty to select the influential covariates. As a preliminary step, each covariate is standardized such that the corresponding first and second weighted sample moments are zero and unity respectively. For comparison, the BIC (3.7) and AIC/GCV are used for the tuning parameter selection. In Figure 3.1, we plot the criterion scores with respect to the model sparsity. We see that BIC selects a model with 11 covariates, while GCV and AIC select the same model, which has 24 covariates. When survey weights are ignored in the selection procedure, models with 7 or 21 covariates are suggested based on PLM with the standard BIC or GCV/AIC. The difference between the weighted and unweighted selection results reflects the potential distortion in the correlation structure of the sampled units. This difference may also be explained by the model mis-specification for part of the SLCDC population (Lohr and Liu [1994]). Given the potential bias of unweighted methods, the weighted results are more plausible. We further assess the selected models in terms of their predictive accuracy as follows. First, we draw 500 independent sets of 5868 bootstrap samples (with replacement) from the working data of SLCDC. For the tth bootstrap sample dt, t = 1; : : : ; 500, the survey weight wi for the ith unit is adjusted by ~wti = vtiwi with vti denoting the number of times that the ith unit is selected in dt. We then fit the selected models to each bootstrap sample (with the weights accounted for accordingly), and evaluate their weighted positive and negative prediction rates (WPPR, WNPR) by WPPR = P i62dt wiI(ŷi = 1; yi = 1)P i62dt wiI(yi = 1) ; WNPR = P i62dt wiI(ŷi = 0; yi = 0)P i62dt wiI(yi = 0) ; 76 0 10 20 30 40 0. 57 0 0. 57 5 0. 58 0 0. 58 5 0. 59 0 GCV model size 0 10 20 30 40 67 00 68 00 69 00 70 00 AIC model size 0 10 20 30 40 67 00 68 00 69 00 70 00 BIC model size Figure 3.1: Selection criteria values based on candidate models where yi and ŷi denote the ith response in BMHX 02 and its predicted value. We summarize the averaged WPPR and WNPR based on 500 bootstrap samples in Table 3.5 according to three different benchmark values (i.e., 0.25, 0.35, 0.45). From Table 3.5, we observe that the models selected from the unweighted anal- ysis have a lower WPPR in general, which provides additional support for using survey weights in the selection procedure. Compared with GCV/AIC, BIC selects a model with a slightly conservative WPPR but a higher WNPR. However, the dif- ference is not significant. The size of the BIC-selected model is much less than that of the model selected by GCV/AIC, which provides an easier interpretation of the response BMHX 02 and the covariates. To assess the stability of the selection, we repeat the PPLM based on 500 boot- 77 Table 3.5: Prediction accuracy for selected models (WPPR,WNPR) based on different benchmarks. Weights Criteria > :25 > :35 > :45 Ignored AIC/GCV (.646, .525) (.460, .688) (.299, .811) BIC (.649, .513) (.445, .705) (.265, .818) Included AIC/GCV (.645, .523) (.488, .682) (.338, .790) BIC (.654, .532) (.485, .706) (.322, .830) Table 3.6: Bootstrap selection results for significant variables: (Estimated co- efficient, Standard error, Selection rate). Variable GCV AIC BIC GEO ON ( .14, .09, .86) ( .16, .09, .92) (.09, .09, .58) DHHX AGE (-.29, .09, 1.0) (-.32, .09, 1.0) (-.27, .08, 1.0) GENXDHMH (-.15, .05, .99) (-.15, .05, .99) (-.14, .06, .92) SMHXDSLT ( .11, .07, .76) ( .12, .07, .84) ( .08, .09, .47) MOHXDBPM (-.08, .07, .67) (-.09, .06, .81) (-.05, .07, .35) INHX 06 ( .18, .06, .97) ( .18, .06, .99) ( .18, .07, .91) HWTDBMI ( .14, .06, .95) ( .14, .06, .97) ( .13, .06, .91) Ave. Model Size 23.1 27.8 10.3 strap samples. In Table 3.6, we list the bootstrap selection rate for the seven most significant covariates according to their MLEs in the original SLCDC working data. The corresponding coefficient estimates and standard errors are also included based on the bootstrap samples. From Table 3.6, we find that only four significant covariates (i.e., DHHX AGE, GENXDHMH, INHX 06, and HWTDBMI) are con- sistently selected by BIC, while GCV/AIC tends to select more unreliable covari- ates. The selection results based on BIC suggest a strong association of blood pres- sure control with age, weight, mental health, and medication information; this is consistent with the hypertension literature (Gelber et al. [2007], Yan et al. [2003]). 3.5 Summary and conclusions In this chapter, we have addressed variable selection in the analysis of complex sur- veys. When units are selected by disproportionate sampling, the data correlation structure in the sample can be distorted. Incorporating sampling weights in the se- 78 lection process protects against biased selection results. In this spirit, we proposed a survey-weighted regularization method based on the pseudo-likelihood (PPLM) and derived a sample-based BIC for its implementation. Under some regularity conditions, we showed that the PPLM consistently identifies the influential vari- ables under a joint randomization framework. The performance of the proposed method was confirmed by numerical studies. 79 Chapter 4 A Thresholding Algorithm for PLM in Finite Mixture Models 4.1 Introduction In the previous chapters, we have addressed the feature selection problem for ultra- high-dimensional data and complex survey data. New approaches have been de- veloped in the framework of penalized likelihood methods (PLMs), and their ef- fectiveness has been illustrated via both theoretical and numerical studies. In this chapter, we continue our investigation of PLMs to address feature selection for finite mixture models in the analysis of heterogeneous data. Finite mixture models play important roles in statistical learning and data min- ing. They have important applications in scientific disciplines such as genetics, marketing, and medical research (Shoukri and MacLachlan [1994], Böhning [2000], Brijs et al. [2004]). They are routinely used to detect the presence of subpopula- tions in an overall population. Determining the number of components (order se- lection) is a fundamental problem in these applications. For instance, the order of a mixture model in a genetic application is indicative of some fundamental gene structures. In financial applications, the order reflects the complexity needed to appropriately model the real world and thereby guides the relevant practices. A mixture model with an excessive number of components usually overfits the data and leads to poor interpretive value. Many researchers have investigated the strate- 80 gies that determine the appropriate order of a finite mixture model given a random sample from the target population. Order selection is invariably a trade-off between model complexity and goodness- of-fit. The classical Akaike information criterion (AIC; Akaike [1973]) and the Bayesian information criterion (BIC; Schwarz [1978]) are often employed in the context of finite mixture models. These methods discount the likelihood-based measure of the goodness-of-fit by penalties directly proportional to the model order. Although BIC has been shown to be consistent (Leroux [1992], Keribin [2000]), its optimality under regular models does not extend to nonregular finite mixture models for order selection under general conditions. Many new procedures have been discussed in the literature, such as distance-measure-based approaches (Chen and Kalbfleisch [1996], Woo and Sriram [2006]), hypothesis-testing-based approaches (Ghosh and Sen [1985], MacLachlan [1987], Chen and Chen [2001], Li and Chen [2010]) and Bayesian approaches (Richardson and Green [1997], Ishwaran et al. [2001], Berkhof et al. [2003]). In this chapter, we study the penalized likelihood method (PLM) introduced by Chen and Khalili [2008]. Unlike other approaches, this method introduces penalties based on two types of overfitting. In particular, it introduces a nonsmooth penalty term to merge close subpopulations in a finite mixture model. The shrinkage principle of regularized regression, e.g., LASSO (Tibshirani [1996]) and SCAD (Fan and Li [2001]), is seamlessly employed for the order selection. With an appropriate choice of the penalty functions, the method is consistent in both identifying the order of the mixture model and estimating the mixing distribution. It has the advantage that the order selection and parameter estimation are completed in one strike. Similarly to PLMs in the context of regression, the numerical problem of Chen and Khalili [2008] does not have straightforward solutions because the penalty functions can be nonsmooth and nonconvex. In the case of SCAD for regression models, Fan and Li [2001] propose locally approximating the nonconvex penalty by a quadratic function. With the aid of local quadratic approximation (LQA), they solve a series of convex optimization problems over smooth objective functions. Chen and Khalili [2008] adopt this strategy and find that the method is also suit- able for order selection. However, it does not directly give sparse solutions, as re- quired for variable or order selection. Recently, substantial progress has been made 81 on optimization problems related to regularization methods (Efron et al. [2004], Zou and Li [2008], Friedman et al. [2007]). In particular, thresholding-based al- gorithms (Daubechies et al. [2004], She [2009]) provide a superior solution to the order selection problem. In this chapter, we aim to provide an effective computational approach for the regularization-based order selection method in finite mixture models. The im- plementation of such methods can be embedded in a typical EM framework, and the challenge arises within the M-step, where the objective function is multivari- ate, nonsmooth, and nonconcave. To overcome this difficulty, we first transform the multivariate optimization problem into a set of univariate optimizations. A thresholding-based algorithm is then used for the nonsmooth and nonconcave but univariate objective functions. We refer to this new computational strategy as the iterative thresholding-based descent algorithm (ITD). Within the EM-framework, the ITD efficiently leads to a sparse estimate of the mixing distribution. It hence attains the goal of the order selection of Chen and Khalili [2008]. We establish the convergence of ITD and demonstrate its performance through simulations and real-data examples. The rest of this chapter is organized as follows. In Section 4.2, we review the regularization approach for order selection and discuss implementation strategies. We introduce the ITD in Section 4.3 and establish its convergence. We discuss ITD tuning issues in Section 4.4. In Section 4.5, we illustrate the ITD via simulations and examples, and concluding remarks are given in Section 4.6. The proofs of the theorems are given in Appendix C. 4.2 Order selection via regularization 4.2.1 Mixture model and penalized likelihood A finite mixture model of orderK is a concave combination ofK standard proba- bility density functions. We focus on the situation where the component distribu- tion is from an exponential family with a possible dispersion parameter: g(y; ; ) = exp[fy b()g=a() + c(y; )] (4.1) 82 with respect to some -finite measure and specific functions a(), b(), and c(). Let and be the parameter spaces for and . We consider the case R and R+, which includes normal, Poisson, binomial, and many other widely used distribution families. The density function of a finite mixture model with orderK is f(y;;; ) = KX k=1 kg(y; k; ); (4.2) where g(y; k; ) specifies the kth component density function with component parameter k and shared structure parameter . We use = (1; : : : ; K) for the mixing proportions, with PK k=1 k = 1, and for the vector of component parameters. The corresponding mixing distribution on assigns probability k to value k. We assume that k 6= 0 and j 6= k for all j 6= k. For simplicity of notation, we denote = (;; ). Given a set of i.i.d. observations Y = (y1; : : : ; yn) from (4.2), the log-likelihood of the mixing distribution with orderK is given by ln( ) = nX i=1 log f(yi;;; ): (4.3) Maximizing ln( ) over leads to a nonparametric maximum likelihood estima- tor (MLE) with a finite order (Lindsay [1983], Lesperance and Kalbfleisch [1992]). However, such an MLE provides little information on the actual order of the mix- ture model. It may overfit by assigning a negligible proportion to an arbitrary sub- population (Type 1) and/or by including several nearly identical subpopulations in the model (Type 2). Chen and Khalili [2008] introduce penalties to prevent both types of overfit- ting. In particular, they introduce a regularization penalty to merge close subpop- ulations to reduce the model complexity. To be specific, for any prespecified large K, we denote the component parameters 1 2 K . Let k = k+1k for k = 1; : : : ;K 1. Chen and Khalili [2008] propose a regularization method 83 that maximizes the penalized log-likelihood function ~ln( ) = ln( ) + CK KX k=1 log k K1X k=1 p(jkj); (4.4) where CK > 0 is a scale constant and p() is a nonsmooth penalty function with a spike at 0. As in regularized regression analysis, when p() spikes at 0, the penalized log-likelihood ~ln( ) has a positive probability of attaining its maximum with some k = 0. Hence, a built-in order selection procedure is obtained by discouraging Type 2 overfitting. By tuning the level of the penalty in p(:), we arrive at a finite mixture model with a suitable order together with a maximum penalized likelihood estimator (MPLE). The first penalty in (4.4) prevents Type-1 overfitting by forcing the mixing proportions away from zero [Chen and Kalbfleisch, 1996]. Consequently, the re- sulting model has k clustered around the parameters of the true subpopulations. This leads to some small values of k to be squeezed by the second penalty p(). A finite mixture model with a lower order is hence attained. 4.2.2 The EM algorithm The implementation of the maximization of ~ln( ) can be embedded in a typical EM framework (Dempster et al. [1977]). Let zik for i = 1; : : : ; n, k = 1; : : : ;K be a 0-1 variable showing the compo- nent membership of the ith observation. If all the zik are observed together with yi, the log-likelihood of the complete data is given by lcn( ) = nX i=1 KX k=1 zik [log k + logfg(yi; k; )g] : The penalized log-likelihood of the complete data takes the form ~lcn( ) = l c n( ) + CK KX k=1 log k K1X k=1 p(jkj): With an initial value of (0), the EM algorithm maximizes ~ln( ) through the 84 following steps based on ~lcn( ): E-step: Let (t) be the estimate of the parameters after t iterations. Compute Q( ; (t)), the conditional expectation of ~lcn( ) given Y with (t) as the true value of the model parameters: Q( ; (t)) = nX i=1 KX k=1 w (t) ik logfg(yi; k; )g + nX i=1 KX k=1 fw(t)ik + CK n g log k K1X k=1 p(jkj); where w (t) ik = E(zikjY ) = (t) k g(yi; (t) k ; (t))PK l=1 (t) l g(yi; (t) l ; (t)) : M-step: The M-step maximizes Q( ; (t)) over . We find (t+1) k = Pn i=1w (t) ik + CK n+KCK and (t+1) = argmax ( nX i=1 KX k=1 w (t) ik logfg(yi; (t)k ; )g ) ; (4.5) (t+1) = argmax ( nX i=1 KX k=1 w (t) ik logfg(yi; k; (t))g KX k=1 p(jkj) ) :(4.6) The above EM procedure leads to a local maximizer of ~ln( ) by iteratively repeat- ing the E-step and M-step. See Wu [1983] and MacLachlan and Krishnan [2008] for the convergence of EM-based computational procedures. The EM-algorithm appears to have completely addressed the numerical issue, but it contains a weakness (4.6). The optimization problem (4.6) is similar to the numerical problem in PLMs for regression analysis. Naturally, Chen and Khalili [2008] employ the local quadratic approximation (LQA) method, which is de- 85 signed primarily for SCAD. However, LQA does not directly provide a sparse solution for the variable selection or a finite mixture model with an appropriate order. An additional step is thus required to manually set near-zero values to exact zero. The development of an effective and direct numerical algorithm is the topic of this chapter. 4.3 Iterative thresholding descent procedure As reviewed in Section 1.4.4, substantial progress has been made in solving the optimization problems related to PLMs. The stagewise least angular regression (LARS; Efron et al. [2004]) is the first breakthrough for LASSO. Zou and Li [2008] suggest a weighted L1 approximation for SCAD, such that the corresponding non- concave maximization can be carried out by the efficient algorithms designed for LASSO.Most recently, the coordinate-descent-based methods (Friedman et al. [2007]) have dramatically sped up the computation for LASSO, and they work even when the number of variables far exceeds the sample size in the regression analysis. Although these methods are not appropriate for our numerical problem, they are a source of inspiration. In particular, we find that the thresholding-based method (Daubechies et al. [2004], She [2009]) can be engineered to address order selection in finite mixture models. This algorithm has two important steps. The first step is to translate the multivariate objective function into an equivalent aux- iliary function that can be decomposed into a sum of several univariate functions. This step dramatically simplifies the problem and reduces the computation. The second step is a thresholding procedure that iteratively performs simple threshold- ing operations on these univariate functions. This is useful particularly when the objective function is nonsmooth and nonconcave. When the procedure converges, a sparse solution is obtained and the goal of order selection is achieved. We now return to the core issue (4.6), which can be rewritten as min ( Q() = KX k=1 'k(k) + a() K1X k=1 p(jk+1 kj) ) ; (4.7) 86 with 'k(k) = nX i=1 wikfyik b(k)]: It can easily be seen that Q() is a multivariate function containing a nonsmooth and usually nonconvex penalty function p(). Our first step toward an effective algorithm is to create an auxiliary function that is a sum of univariate functions. Let 0 = 1, k = k+1 k, so that k = Pk1 j=0 j for k = 1; : : : ;K. After this parameter transformation, the objective function in (4.7) becomes Q() = KX k=1 'k 0@k1X j=0 j 1A+ a()K1X j=1 p(j): (4.8) Next, setting k = Pk1 j=0 j for k = 1; : : : ;K, we introduce an auxiliary function in = (0; : : : ; K1), G(;) = uQ() + 1 2 K1X j=0 (j j)2 u KX k=1 nX i=1 wikfb(k) b(k) b0(k)(k k)g (4.9) for some positive scale u. The auxiliary function has two crucial properties. First, we notice that Q() = u1G(;). This property links a stationary point defined through G to a local minimum of Q. Second, because of the specific form of Q(), it can be seen that G(;) is additive in the components of . This property makes it simple to maximize G(;) with respect to for any given through a thresholding procedure. We now make the first point. Let (0) be an initial vector of . For m = 1; 2; : : :, define (m) = argmin G(;(m1)): (4.10) The additivity of G(;) with respect to decomposes the above operation into a 87 set of univariate optimization problems:8<:min0 0 h 0 + u PK k=1 Pn i=1wik[yi b0(k)] i2 ; minj 1 2 j h j + u PK k=j+1 Pn i=1wik[yi b0(k)] i2 + ua()p(jj j) (4.11) for j = 1; : : : ;K 1. Clearly, these optimization problems have a unified form min q( ) = ( z)2 + p(j j) (4.12) with = 0 for the case j = 0, and = ua() otherwise. A threshold technique will be used to overcome the difficulty caused by the nonsmooth and nonconvex penalty function p(). Recall that can be linearly transformed to by = , where is a lower triangular matrix with the elements below the diagonal equal to one. The following theorem shows that the iteration defined in (4.10) reduces the value of Q. Theorem 4.1 Following the settings and notation introduced earlier, assume that b(:) in (4.1) is twice continuously differentiable over . Let (m) be the sequence defined by (4.10) and (m) = (m). Denote by 1 the maximum eigenvalue of T, and let (m) 2 = max k sup 0<<1 b00((m+1)k + (1 )(m)k ): If 0 < u < [n1 (m) 2 ] 1, then Q((m+1)) Q((m)); and equality holds only when (m+1) = (m). See Appendix C for the proof. If Q() is bounded from below, then Q((m)) is a monotone decreasing sequence that must therefore converge to a limit. If in addition, f : Q() < Q((0))g is compact, then f(m)g must have a convergent subsequence. Many GLMs lead to a regularized likelihood function Q satisfying these two conditions. Thus, this is likely to provide an optimization solution. We 88 are interested further in whether f(m)g itself converges and whether the limit of Q((m)) is a local minimum of Q(). The following corollary confirms the first point. Corollary 4.1 Assume the conditions of Theorem 4.1 and that f : Q() Q((0))g is compact. Let 2 = supm( (m)2 ). When u 2 (0; [n12 ]1), the se- quence f(m)g is asymptotically regular. That is, asm!1, k(m+1) (m)k ! 0: See Appendix C for the proof. Corollary 4.1 not only serves as a necessary step toward a complete proof of the convergence of f(m)g, but also provides a straight- forward stopping criterion for procedure (4.10). That is, we can terminate if k(m+1) (m)k < " for some prespecified tolerance level ". We set " = 105 in our imple- mentation. Using Corollary 4.1, we now prove the convergence of the ITD procedure via the following theorem. Theorem 4.2 Under the conditions of Corollary 4.1, and if the number of sta- tionary points of the objective function Q() is finite, then the sequence f(m)g converges to a fixed point of (4.10) that is also a stationary point of Q(). See Appendix C for the proof. Theorems 4.1 and 4.2 are more general than the specific iteration scheme (4.10). Two assumptions, that f : Q() Q((0))g is compact and that the number of stationary points of Q() is finite, are trivially satisfied for exponential families. However, unlessQ() is a convex function, there is no guarantee that (m) converges to a global minimum. Multiple initial values are often used in the hope that the global minimum will be found. The above results are useful only if we carry out (4.10) effectively. This task is made simpler because G(;) is additive in the elements of . We need only to solve the standard optimization problem (4.12). There is an intrinsic link between the solution to (4.12) and the thresholding rule in wavelet applications (Antoniadis [2007]). Let z+ be the positive part and sgn(z) be the sign of z for any real number z. With = 1, when the L1 penalty 89 p( ) = j j is used, the solution of (4.12) is given by the soft thresholding rule as = (jzj)+ sgn(z); and when the L0 penalty p( ) = 2=2I( 6= 0) is used, the solution of (4.12) is given by the hard thresholding rule as = z I(jzj > ). Explicit solutions of (4.12) for other commonly used penalty functions are avail- able in the literature (Antoniadis [2007], Antoniadis and Fan [2001]). Since the SCAD penalty was advocated by Chen and Khalili [2008] for their regularization method (namely, MSCAD), we give the corresponding explicit solution as follows. The SCAD penalty is defined as a symmetric function with p(0) = 0 and for > 0, p 0 ( ) = I( ) + ( )+ ( 1) I( > ) (4.13) for some constant > 2 and a tuning parameter . The exact size of does not have a noticeable effect on the performance of the order selection, while the level of needs to be further specified in applications (see Section 4.2 for a discussion). Proposition 4.1 When 0 < 1, the solution to (4.12) is given by = 8>>><>>>: (jzj )+ sgn(z); when jzj < (+ 1) (1)zsgn(z) 1 ; when (+ 1) jzj < z; when jzj : When 1 < , the solution to (4.12) is given by = 8<:(jzj )+ sgn(z); when jzj < ;z; when jzj : When < , the solution to (4.12) is given by = zI(jzj ): See Appendix C for the proof. The expressions for the cases < and 1 < can be combined; we did not do this to simplify the proof. When < 1, the optimal solution of (4.12) is a continuous function of z, but this is not true for > 1. Thus, SCAD may not be the most appropriate choice for the order selection of mixture models. At the same time, because = ua() is usually very small as in (4.11), the adverse effect is likely minor in applications. We focus on 90 the algorithm and leave this issue for future research. Proposition 4.1 leads to a thresholding operation for (4.12). Taking SCAD for example, shrinkage occurs whenever the generic value of jzj is lower than some threshold level, say . This feature leads to the sparse . Suppose the initial vector (0) contains K components. Each incident of ̂j = 0 for some j reduces the order of the candidate mixing distribution by one. Because of this feature, the order of the MPLE is a direct output of the algorithm. No ad hoc steps are required. 4.4 Tuning strategies In this section, we discuss some ITD implementation issues. 4.4.1 Choice of u The tuning parameter u does not have a statistical implication. It controls by how much the iteration is directed by the gradient of Q(). We may therefore try to use a large u value. However, the inequality in the proof of Theorem 1 indicates that the objective function is guaranteed to decrease after each iteration only when u is small enough. Because = ua() and the size of this largely controls the thresholding level, a large u also helps to speed up the algorithm. For normal mixtures, we have b00() = 1 so we may choose u = (n1)1, the largest value allowed to ensure the monotonicity of Q((m)). For Poisson mixtures, we may choose u adaptively according to the value of (m) to guarantee monotonicity. Once the sequence (m) settles down, a stable value of u is used. To accelerate the algorithm, we employ the following tuning strategy. Letw = maxk Pn i=1wik and 0 be the submatrix of with columns corresponding to the nonzero entries of (0). Denote by 1 the maximum eigenvalue of T0 0 and let 2 = maxk b00( (0) k ). At each thresholding iteration, we first initiate an tentative large value for u, i.e., u(m) = u = 6=(w1 2 ), and then check whether or not Q((m+1)) < Q((m)). When the inequality is violated, we reduce u to u(m) = 0:5u, and we repeat this reduction procedure until Q((m+1)) < Q((m)) is satisfied. It is clear that the adaptive procedure does not alter the convergence of the algorithm. 91 4.4.2 Choice of The tuning parameter in p() controls the shrinkage in , and therefore it is crucial for the resulting order of the finite mixture model. When = 0, the MPLE of (4.4) places no penalty and thus results in the largest order. As increases from zero to infinity, the MPLE gradually reduces the order for the mixture model to one. The choice of relates to the trade-off between parsimony and goodness-of-fit. Chen and Khalili [2008] prove that when has a certain asymptotic order, the SCAD regularization method is consistent when the sample size increases to infin- ity. However, the asymptotic result does not provide a specific recommendation for a -value in practice. Instead, a cross-validation (CV) procedure is recommended. Specifically, let the full data set Y = fy1; : : : ; yngT be divided intoR nonover- lapping subsets, say Yr with size nr for r = 1; : : : ; R and PR r=1 nr = n. Let Y Yr be the subset with Yr removed from Y . Let ̂;r be the MPLE of the model parameter based on Y Yr. The CV procedure selects a that minimizes CV() = RX r=1 lnr( ̂;r;Yr); where lnr( ;Yr) is the log-likelihood function based on Yr. The choice of with r = 1 worked well in Chen and Khalili’s simulation studies. At the same time, the CV combined with an exhaustive search of makes the whole procedure compu- tationally intensive. A more efficient tuning strategy is desirable in practice. We implemented the algorithm of Chen and Khalili [2008] with a 20-fold CV (Zhang [1993]) to reduce the computational burden. In the literature, BIC-type criteria are investigated for choosing the tuning pa- rameter in regularization methods for variable selection (Wang et al. [2007]). In such methods, the candidate choices of are evaluated by the likelihood scores of the MPLE with penalties proportional to the model complexity. The yielding the best score is then selected. Compared with the CV, this method is computationally cheaper and often leads to a satisfactory selection result. Therefore, we define a revised version of BIC as a function of by RBIC() = 2ln(Y ; ̂) + ( ̂) log n; 92 where ̂ is the MPLE with regularization parameter value and ( ̂) is the number of parameters in this mixing distribution. With = 1, we obtain the traditional BIC. In the simulation, we used = 0.25 and 0.5. We chose a smaller than usual value partly for its superior performance in our simulation studies and partly because an extra component in mixture models usually leads to a single extra degree of freedom in the limiting distribution of likelihood-based test statistics (Mengersen et al. [2011]). Once the value of is chosen, the RBIC selects a value that minimizes RBIC() and thus suggests a corresponding order for the finite mixture model. In applications, several trial runs are helpful to identify the proper range for . We adopted a data-driven searching strategy in our numerical studies. We begin with a large value = that leads to a homogeneous model with order 1. The search proceeds by bisectionally reducing toward zero, such that the orders of the resulting models increase gradually. Grid searches are used between these values to obtain models with intermediate orders. The efficiency of this strategy has been observed in our simulation studies. 4.4.3 Choice of CK The constant CK in the first penalty of (4.4) controls the penalization for small mixing proportions. Its value influences the precision of the corresponding MPLE of the model parameters. However, the effect of CK has repeatedly been found to be minor for the order selection. Chen et al. [2001] and Chen and Khalili [2008] suggested that if the component parameters k are restricted within [M;M ] or [M1;M ] for some large M , then an appropriate choice of CK is logM . In our numerical studies, we set CK = log log y for Poisson mixture models with y = maxifyi; i = 1; : : : ; ng. For normal mixture models, we set CK to the logarithm of the maximum absolute observation based on the standardized data (see Section 4.5). 4.5 Numerical studies We assess the performance of ITD by Monte Carlo simulation examples. In partic- ular, we implement MSCAD with both LQA and ITD to examine their computa- 93 Table 4.1: Mixing proportions and component parameters in simulation mod- els Normal mixture models Model (1; 1) (2; 2) (3; 3) (4; 4) (5; 5) (6; 6) 1 (0.5, 0) (0.5, 3) 2 (0.3, 0) (0.7, 3) 3 (0.2, 0) (0.4, 4) (0.4, 6) 4 (1/4, 0) (1/4, 3) (1/4, 6) (1/4, 9) 5 (1/4, 0) (1/4, 2) (1/4, 5) (1/4, 8) 6 (0.3, 0) (0.2, 2) (0.3, 4) (0.2, 6) 7 (1/6, 0) (1/6, 3) (1/6, 6) (1/6, 9) (1/6, 12) (1/6, 15) 8 (1/6, 0) (1/6, 2) (1/6, 4) (1/6, 6) (1/6, 9) (1/6, 12) 9 (1/6, 0) (1/6, 2) (1/6, 4) (1/6, 7) (1/6, 9) (1/6, 11) 10 (1/6, 0) (1/6, 2) (1/6, 4) (1/6, 6) (1/6, 8) (1/6, 10) Poisson mixture models Model (1; 1) (2; 2) (3; 3) (4; 4) 1 (1/2, 1) (1/2, 3) 2 (1/5, 1) (4/5, 3) 3 (4/5, 1) (1/5, 3) 4 (1/4, 1) (1/4, 4) (1/4, 12) (1/4, 20) 5 (0.1, 1) (0.2, 4) (0.3, 12) (0.4, 20) 6 (0.4, 1) (0.3, 4) (0.2, 12) (0.1, 20) tional efficiency. The performances of a number of tuning strategies for are also compared. The algorithm has been implemented in the R software. 4.5.1 Simulations We generated data from the normal and Poisson mixture models. The component mean and mixing proportions are given in Table 4.1. For the normal mixture models, we set the common component variance 2 = 1. To make the selection invariant on the data scale, we standardized the observations from normal mix- tures such that the corresponding sample mean was zero and the sample standard deviation was three. For each simulated data set, MSCAD was used for the order selection with a common upper bound K = 12. The EM procedure used by both LQA and ITD was implemented with the same initial settings as in Chen and Khalili [2008]. That 94 0 20 40 60 80 100 120 − 1 0 1 2 3 4 EM−path with ITD Steps m u 0 20 40 60 80 100 120 − 1 0 1 2 3 4 EM−path with LQA Steps m u Figure 4.1: EM-paths of MSCAD for normal mixture model 1 with n = 100. Model 1 N o. o f C om po ne ts Oracle AIC BIC RBIC.25 RBIC.5 LQA−cv ITD−cv 1 2 3 4 5 Model 2 N o. o f C om po ne ts Oracle AIC BIC RBIC.25 RBIC.5 LQA−cv ITD−cv 1 2 3 4 5 Model 3 N o. o f C om po ne ts Oracle AIC BIC RBIC.25 RBIC.5 LQA−cv ITD−cv 1 2 3 4 5 Model 4 N o. o f C om po ne ts Oracle AIC BIC RBIC.25 RBIC.5 LQA−cv ITD−cv 1 2 3 4 5 Model 5 N o. o f C om po ne ts Oracle AIC BIC RBIC.25 RBIC.5 LQA−cv ITD−cv 1 2 3 4 5 Model 6 N o. o f C om po ne ts Oracle AIC BIC RBIC.25 RBIC.5 LQA−cv ITD−cv 1 2 3 4 5 Figure 4.2: Selection results for normal mixture models (1–6), n=100. is, the initial values of the component means were chosen as the 100(k 1=2)K% sample quantiles and (0)k = 1=K for k = 1; : : : ;K. For the normal mixtures, we set the initial value of the component variance to the sample variance based on the observations trimmed at the 25% and 75% sample quantiles. For LQA, we merged two subpopulations if their component means were within 103 at the convergence 95 Model 7 N o. o f C om po ne ts Oracle AIC BIC RBIC.25 RBIC.5 LQA−cv ITD−cv 2 3 4 5 6 7 8 Model 8 N o. o f C om po ne ts Oracle AIC BIC RBIC.25 RBIC.5 LQA−cv ITD−cv 2 3 4 5 6 7 8 Model 9 N o. o f C om po ne ts Oracle AIC BIC RBIC.25 RBIC.5 LQA−cv ITD−cv 2 3 4 5 6 7 8 Model 10 N o. o f C om po ne ts Oracle AIC BIC RBIC.25 RBIC.5 LQA−cv ITD−cv 2 3 4 5 6 7 8 Figure 4.3: Selection results for normal mixture models (7–10), n=300. of the algorithm. As a concrete illustration, we plot in Fig. 4.1 the EM-iteration paths of (t) based on ITD and LQA for one data set generated from a normal mixture. Com- pared with EM-LQA, the paths of the EM-ITD procedure are sharper and require significantly fewer iterations to converge. We find that ITD saves about 40% of the computational time compared to LQA over our simulation studies. Although there is no guarantee, ITD and LQA usually converge to similar limiting points in our examples. We summarize the simulation results in terms of how often various orders are selected based on 500 data sets generated from each model with sample sizes n = 100; 300. We use scaled bar plots to denote the frequency of the orders selected by MSCAD against methods and tuning strategies. The length of each 96 Model 1 (n=100) N o. o f C om po ne ts Oracle AIC BIC RBIC.25 RBIC.5 LQA−cv ITD−cv 1 2 3 4 Model 2 (n=100) N o. o f C om po ne ts Oracle AIC BIC RBIC.25 RBIC.5 LQA−cv ITD−cv 1 2 3 4 Model 3 (n=100) N o. o f C om po ne ts Oracle AIC BIC RBIC.25 RBIC.5 LQA−cv ITD−cv 1 2 3 4 Model 1 (n=300) N o. o f C om po ne ts Oracle AIC BIC RBIC.25 RBIC.5 LQA−cv ITD−cv 1 2 3 4 Model 2 (n=300) N o. o f C om po ne ts Oracle AIC BIC RBIC.25 RBIC.5 LQA−cv ITD−cv 1 2 3 4 Model 3 (n=300) N o. o f C om po ne ts Oracle AIC BIC RBIC.25 RBIC.5 LQA−cv ITD−cv 1 2 3 4 Figure 4.4: Selection results for Poisson mixture models (1–3), n=(100, 300). bar is proportional to the selection frequency. As a benchmark, the true orders are marked by a black bar in each plot. We labeled the x-axes as follows: RBIC:25 and RBIC:5 are obtained by ITD with scaled RBIC; LQA-cv and ITD-cv are obtained by 20-fold CV based on LQA/ITD. We have also included the results of the popu- lar AIC and BIC methods as suggested by Leroux [1992]. The simulation results are reported in Figs. 4.2–4.5. Although it is not the goal of our simulation, these figures reaffirm the ef- fectiveness of MSCAD for order selection. The performance of MSCAD is not affected by its implementation. This is reflected by the nearly identical columns for LQA-cv and ITD-cv in all the plots. In terms of the selection frequency for the true mixture order, AIC and BIC do not perform well in most cases. In general, BIC grossly underestimates the true order of the model because of its stringent penalty on the model complexity. AIC is relatively liberal, but its performance is not satisfactory for models with complex structure. The remaining results all relate to MSCAD but are tuned or implemented dif- 97 Model 4 (n=100) N o. o f C om po ne ts Oracle AIC BIC RBIC.25 RBIC.5 LQA−cv ITD−cv 2 3 4 5 Model 5 (n=100) N o. o f C om po ne ts Oracle AIC BIC RBIC.25 RBIC.5 LQA−cv ITD−cv 2 3 4 5 Model 6 (n=100) N o. o f C om po ne ts Oracle AIC BIC RBIC.25 RBIC.5 LQA−cv ITD−cv 2 3 4 5 Model 4 (n=300) N o. o f C om po ne ts Oracle AIC BIC RBIC.25 RBIC.5 LQA−cv ITD−cv 2 3 4 5 Model 5 (n=300) N o. o f C om po ne ts Oracle AIC BIC RBIC.25 RBIC.5 LQA−cv ITD−cv 2 3 4 5 Model 6 (n=300) N o. o f C om po ne ts Oracle AIC BIC RBIC.25 RBIC.5 LQA−cv ITD−cv 2 3 4 5 Figure 4.5: Selection results for Poisson mixture models (4–6), n=(100, 300). ferently (i.e., RBIC:25, RBIC:5, ITD-cv, LQA-cv). Based on these plots, we see that RBIC:5 tends to underestimate the true order. The other methods are compara- ble. If we factor in the computational savings, RBIC:25 has the best performance. Also, the ITD-based methods perform better than the LQA because of the numeri- cal advantages pointed out earlier. 4.5.2 Examples We now illustrate the use of the ITD algorithm for order selection in real appli- cations. We chose to implement MSCAD as in the simulation studies. Our first example is from a well-known astronomical data set, which consists of the veloc- ities of 82 galaxies moving away from our own galaxy (available in Table 1 of Postman et al. [1986]; see the histogram in Fig. 4.6). The multimodality of the ve- locities may indicate the presence of super-clusters of galaxies surrounded by large voids, with each mode representing a cluster as it moves away at its own speed (see Roeder [1990] for more background). We may hence model the values of observed velocities as a random sample from a finite mixture of normal distributions with a 98 De ns ity 10 15 20 25 30 35 0.0 0 0.0 5 0.1 0 0.1 5 0.2 0 0.2 5 Figure 4.6: Histogram of galaxy data. Solid curve: density of seven component models selected by MSCAD. Dashed curve: density of six component models selected by AIC/BIC. common variance. Using a Bayesian approach, Richardson and Green [1997] con- clude that the number of distributions ranges from five to seven, which provides support for the existence of super-clusters. MSCAD-ITD is used to reanalyze the galaxy data with K = 12. RBIC with = 0:25 was used to tune the parameter . The outcome is a seven-component model; the parameter estimates are given in Table 4.2 and Fig. 4.6 shows the fitted density function. As a side note, AIC/BIC leads to a model with six components; see Table 4.2 and Fig. 4.6. It also provides a good description of the multimodal structure of the galaxy data. Our next example is from Leroux and Puterman [1992], who discussed a study of breathing and body movements in fetal lambs. The number of movements of a fetal lamb was recorded for 240 consecutive 5-second intervals. These counts are overdispersed with a ratio of variance to mean estimated at 1.83. It is well known 99 Table 4.2: Parameter estimates for galaxy data. ̂k, ̂k, ̂ denote estimated com- ponent mean, mixing proportion, and common component standard deviation (k = 1; 2; : : : ; 7). Method (̂1; ̂1) (̂2; ̂2) (̂3; ̂3) (̂4; ̂4) (̂5; ̂5) (̂6; ̂6) (̂7; ̂7) ̂ MSCAD (9.7, .08) (16.1, .04) (19.8, .43) (22.4, .21) (23.9, .14) (26.5, ,05) (33.1, .05) .66 AIC/BIC (9.7, .09) (16.2, .02) (19.9, .45) (23.1, .36) (26.3, .04) (33.0, .04) (- -.-, .- -) .81 that overdispersion can often be explained by population heterogeneity. Hence, a Poisson mixture model was used with the number of components selected by AIC/BIC. Using MSCAD implemented via ITD and an initial value of K = 12, we select a two-component model. Not surprisingly, this order is the same as that of the previous analysis. We summarize the resulting fits in Table 4.3. Because of its built-in measure to prevent type-I overfitting, MSCAD gives more balanced mixing proportions. Clustering more observations into the second component also leads to lower component means. Table 4.3: Parameter estimates for lamb data. ̂k, ̂k denote estimated component mean and mixing proportion (k = 1; 2). Method (̂1; ̂1) (̂2; ̂2) MSCAD (.227, .920) (1.88, .080) AIC/BIC (.230, .939) (2.32, .061) 4.6 Summary and conclusions In this chapter, we have developed an iterative thresholding descent algorithm (ITD) for PLM-based order selection methods in finite mixture models. The new algorithm avoids directly solving the original multivariate optimization problem and efficiently leads to the sparse MPLE of the component parameters. We estab- lished the algorithmic convergence of ITD under mild conditions. The efficiency of ITD is well supported by our numerical studies. In applications, we need to specify an upper bound K for the MSCAD-based 100 analysis. One should always use a sufficiently large K to avoid underestimation. However, too large a K often leads to slow convergence. In the examples we examined, the outcomes are not sensitive to the choice of K within a plausible range. We do not recommend settingK below 5 or above 15 in most applications. One needs strong empirical evidence or a large sample size (say above 300) to go outside this range. 101 Chapter 5 Summary and Future Work 5.1 Summary of the dissertation In this dissertation, we have developed approaches based on PLMs to address the issues of feature selection arising in several application fields. In Chapter 2, we addressed feature selection for ultra-high-dimensional data, where the number of features (covariates) is larger than the sample size. To fa- cilitate the selection process, we proposed a novel screening approach to reduce the dimensionality of the data before the application of the PLM. The new method is motivated by the idea of the sparsity-restricted maximum likelihood estimator and can be efficiently implemented through a thresholding-based algorithm. Com- pared with the existing approaches which screen features based on the marginal correlations between the covariates and response, our method accounts for more joint effects between the covariates and thus can be more reliable in applications. We further established the consistency of our method in an ultra-high-dimensional setup and demonstrated its excellent performance through numerical examples. In Chapter 3, we addressed variable selection for complex survey data, where the observations are intrinsically dependent because of the without-replacement sampling plan. To avoid a distorted conclusion caused by the biased samples, we proposed a penalized pseudo-likelihood approach to incorporate the survey weights into the selection process. A pseudo-likelihood-based BIC was further suggested to select the corresponding tuning parameter. We demonstrated the asymptotic con- 102 sistency of our selection method in a joint randomization framework. The decent performance and potential benefits of our method were demonstrated in simulation studies. In Chapter 4, we addressed order selection in finite mixture models for het- erogeneous data. A thresholding-based algorithm was proposed for the implemen- tation of PLMs in such applications. The new algorithm transforms the original multivariate objective function into a sum of univariate functions and efficiently leads to a sparse solution without ad hoc steps. We established the convergence of the new algorithm and illustrated its efficiency through both simulations and real-data examples. 5.2 Future directions We have demonstrated that the PLM is an attractive technique for statistical learn- ing with a wide range of applications. In the remainder of this chapter, we present several possible directions for future research. 5.2.1 Variable selection in model-based clustering Clustering is a fundamental data-analysis tool which assigns similar objects to groups. When the data dimension is high, there are many noise variables that may mask the underlying clustering structures. Specifically, given the p-dimensional observations xi = (xi1; : : : ; xip) for i = 1; : : : ; n, we aim to group the observa- tions into a few clusters such that observations in the same cluster are more similar to each other than to those in different clusters. When p is large, it is likely that many entries in xi are not relevant to the clustering: including them in the analysis introduces noise which might hide the heterogeneous structure of the data. Re- moving these “noise” variables is essential for the clustering of high-dimensional data. Of the many clustering methods, model-based clustering (McLachlan and Peel [2002], Zhong and Ghosh [2003]) is appropriate for variable selection methods. These approaches assume that the data are generated from a finite mixture distri- bution with each component corresponding to a cluster. In this framework, the removal of “noise” variables can be performed by a model selection procedure, in 103 which the shrinkage idea of PLM is helpful. Specifically, suppose that, prior to the clustering, the data are standardized such that each entryXj = (x1j ; : : : ; xnj)T has sample mean 0 and sample variance 1. In some applications, it might be appropriate to assume that the data are independently generated according to a p-variate normal mixture density f(x; ) = KX k=1 kh(x;k;k); (5.1) where k > 0 is the mixing proportion such that PK k=1 k = 1, h(:) denotes the p-variate normal density with mean vector k = (k1; : : : ; kp) and covariance matrixk, and = f(k;k;k); k = 1; : : : ;Kg denotes the unknown parame- ters. Clearly, if some components of k are small, the corresponding entries of the data are not informative for the clustering, at least in terms of location. Employing the shrinkage idea, we may use the PLM to automatically set small components of k to zero and thus remove the irrelevant entries from the data. We could address this by maximizing Q( ) = l( ) '( ) pX j=1 (k(j)k2); (5.2) where l( ) = Pn i=1 log f(xi; ) is the log-likelihood, '(:) and (:) are two penalty functions, and k(j)k2 = vuut KX k=1 2kj is the Euclidean norm on (j) = (1j ; : : : ; kj) for j = 1; : : : ; p. In the normal mixture cases, the first penalty '(:) is generally required to guarantee the existence of the maximizer of (5.2) (Chen et al. [2008]), while the second penalty (:) can lead to a sparse estimate of the mean parameter. Placing k(j)k2 in (:) helps us to set (j) = 0, so that the corresponding entry can be regarded as irrelevant to the clustering. A similar idea has been adopted by Yuan and Lin [2006] for a group LASSO problem. The optimization problem of (5.2) can be similarly cast into an EM-based procedure as discussed in Chapter 4. 104 As in the standard model-based clustering, one could further fit (5.1) by max- imizing (5.2) with a series of K values and use a selection criterion (e.g., BIC) to determine the optimal number of clusters. Both the theoretical and computational aspects of this procedure require further research. 5.2.2 Support vector machine with nonconvex penalty Classification is a basic task in pattern recognition and data mining. Kernel-based classification technologies have attracted much attention, because of their elegant interpretation and highly competitive performance. The support vector machine (SVM) is among the most popular classification approaches in this category (Cristianini and Shawe-Taylor [2000]); it learns the classification rule from the informative margins of the training data. In a binary classification problem, the response y is either +1 or -1 (the clas- sification labels), and a classification rule T is a mapping from the feature vector x to f+1;1g. Given the training set (yi;xi) for i = 1; : : : ; n, we must find a discriminant function f(:) so that any new input x can be correctly classified with f(x) based on T . The kernel-based methods generally use the rule T = sign[f(x)] and the following f(:): f(x;w) = nX i=1 wik(x;xi) + w0; (5.3) where k(:) is a user-specified kernel function such as the polynomial or Gaussian kernel, and w = (w0; : : : ; wn) is the weight vector to be estimated. Given a loss function L(:), the kernel method (5.3) leads to the following optimization scheme min w ( n1 nX i=1 L(yi; f(xi;w)) + (w) ) ; (5.4) where (:) is a penalty function index with tuning parameter . With the L2 penalty (w) = kwk22 and the hinge loss, i.e., L(y; f) = (1 y f)+ = maxf0; 1 y fg; 105 (5.4) leads to the typical L2-SVM, which has been applied to a number of classifi- cation problems. Solid statistical learning theory for the use of L2-SVM has been developed (Lin [2002]). As in ridge regression, the L2 penalty helps to control the model complexity to prevent overfitting. With the development of sparse regularization methods (e.g., LASSO), the L1 penalty has been used in SVM (Zhu et al. [2003]); this leads to a sparse estimate of w in the discriminant function (5.3). The so-called L1-SVM can result in a more compressive rule and thus provide a clearer interpretation. Friedman et al. [2004] showed that the L1-SVM performs better if the underlying model is sparse, while the L2-SVM performs better if most of the observations contribute to the response. Because of the promising theoretical properties of nonconvex penalties, it is natural to explore whether using them in (5.4) would further benefit the SVM. Such a nonconvex-penalized SVM may help to reduce the biased estimates of L1, while maintaining the desirable feature of sparsity in the classification rule. It would be interesting to continue this investigation. 5.2.3 Some other issues Variable selection via grouping In regression analysis, one difficulty for variable selection is the collinearity be- tween the covariates. In ultra-high-dimensional situations, even when all the co- variates are ideally independent, the maximum sample correlation between covari- ates can still be high (Fan and Lv [2008]), which makes it hard to detect the truly influential covariates. Therefore, one might consider grouping all the candidate covariates according to their sample correlations and then treating each group as a independent predictor. In other words, covariates in the same group are considered to represent some factor that may be associated with the response. We could then use the penalized likelihood idea to choose the relevant groups. One-step thresholding-based procedure The thresholding-based algorithm provides a simple and efficient way to solve the numerical problem of nonconvex PLMs. However, because of the nonconcavity 106 of the objective function, multiple initial values are often needed in its implemen- tation. It would be beneficial to develop a data-driven strategy for initializing the thresholding-based procedure. A good initial value often leads to fewer iteration steps and guarantees the statistical properties of the estimate. In Chapter 2, we have tentatively revealed the efficiency of the LASSO-based initializing strategy. It would be beneficial if this method could be given a theoretical justification. Mo- tivated by Zou and Li [2008], we are particularly interested in finding an initial setting such that a few iterations would suffice for variable selection/screening. Tuning LAD-LASSO with high-dimensionality LASSO is an computationally attractive approach to variable selection in high- dimension regression models. The L1-penalized least squares problem (1.9) has received much attention. However, the least squares method is known to be sen- sitive to outliers. An alternative approach is the least absolute deviation (LAD) method which can be robust to outliers. Wang et al. [2006] proposed the L1- penalized LAD, i.e., LAD-LASSO, which is an efficient and robust selection ap- proach. In particular, the idea of coordinate descent (CD) can be conveniently applied to the LAD-LASSO problem. LAD-LASSO therefore has great potential for high-dimensional noisy data. Despite its promising properties (Gao and Huang [2010]), the tuning parameter in LAD-LASSO needs to be specified in applica- tions. The traditional AIC and BIC are designed for least square settings and may not be suitable for the LAD-LASSO problem. A feasible tuning method is therefore desired to realize the desirable properties of LAD-LASSO, especially for high-dimensional situations. EBIC (Chen and Chen [2008]) might provide an appropriate tuning strategy for LAD-LASSO and its variants. 107 Bibliography H. Akaike. Information theory and an extension of the maximum likelihood principle. In 2nd International Symposium on Information Theory, 1:267–281, 1973. ! pages 6, 70, 81 A. Antoniadis. Wavelet methods in statistics: Some recent developments and their applications. Statistics Surveys, 1:16–55, 2007. ! pages 89, 90 A. Antoniadis and J. Fan. Regularization of wavelet approximations. Journal of the American Statistical Association, 96:939–967, 2001. ! pages 90 J. Berkhof, I. Van Mechelen, and A. Gelman. A bayesian approach to the selection and testing of mixture models. Statistica Sinica, 13:423–442, 2003. ! pages 81 P. Bickel and D. Freedman. Asymptotic normality and the bootstrap in stratified sampling. The Annals of Statistics, 12:470–484, 1984. ! pages 63 D. Binder. On the variances of asymptotically normal estimators from complex surveys. International statistical Review, 51:279–292, 1983. ! pages 61 D. Binder and G. Roberts. Chapter: Design-based and model-based methods for estimating model parameters. Wiley Series in Survey Methodology, Chichester, 2003. ! pages 59 T. Blumensath and M. Davies. Iterative thresholding for sparse approximations. The Journal of Fourier Analysis and Applications, 14:629–654, 2008. ! pages 27 T. Blumensath and M. Davies. Iterative hard thresholding for compressed sensing. Applied and Computational Harmonic Analysis, 27:265–274, 2009. ! pages 26, 32, 34, 36 108 D. Böhning. Computer-Assisted Analysis of Mixtures and Applications: Meta Analysis, Disease Mapping, and Others. Chapman & Hall/CRC, 2000. ! pages 80 L. Breiman. Better subset regression using the nonnegative garrote. Technometrics, 37:373–384, 1995. ! pages 11, 20 T. Brijs, D. Karlis, G. Swinnen, K. Vanhoof, G. Wets, and P. Manchanda. A multivariate poisson mixture model for marketing applications. Statistica Neerlandica, 58:322–348, 2004. ! pages 80 S. Canada. Survey on living with chronic diseases in canada 2009: User guide. Supplementary documentation, 2009. ! pages 66 E. J. Candès, J. Romberg, and T. Tao. Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on information theory, 52:489–509, 2006. ! pages 26 I. Carrillo, J. Chen, and C. Wu. The pseudo-gee approach to the analysis of longitudinal surveys. Canadian Journal of Statistics, 38:540–554, 2010. ! pages 62 H. Chen and J. Chen. The likelihood ratio test for homogeneity in finite mixture models. Canadian Journal of Statistics, 29:201–216, 2001. ! pages 81 H. Chen, J. Chen, and J. Kalbfleisch. A modified likelihood ratio test for homogeneity in finite mixture models. Journal of the Royal Statistical Society: Series B, 63:19–29, 2001. ! pages 93 J. Chen and Z. Chen. Extended bayesian information criterion for model selection with large model spaces. Biometrika, 95:759–771, 2008. ! pages 41, 107 J. Chen and Z. Chen. Extended bic for small-n-large-p sparse glm. Statistica Sinica, 22:555–574, 2012. ! pages 19, 33, 41, 42, 54, 55, 118 J. Chen and J. Kalbfleisch. Penalized minimum-distance estimates in finite mixture models. Journal of the American Statistical Association, 103: 1674–1683, 1996. ! pages 81, 84 J. Chen and A. Khalili. Order selection in finite mixture models with a nonsmooth penalty. Journal of the American Statistical Association, 103:1674–1683, 2008. ! pages 24, 81, 82, 83, 85, 90, 92, 93, 94 J. Chen and J. Rao. Asymptotic normality under two-phase sampling designs. Statistica Sinica, 17:1047–1064, 2007. ! pages 63 109 J. Chen, X. Tan, and R. Zhang. Inference for normal mixtures in mean and variance. Statistica Sinica, 18:443–465, 2008. ! pages 104 P. Craven and G. Wahba. Smoothing noisy data with spline functions: Estimating the correct degree of smoothing by the method of generalized cross-validation. Numerische Mathematik, 31:377–403, 1979. ! pages 18, 70 N. Cristianini and J. Shawe-Taylor. An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, 2000. ! pages 105 I. Daubechies, M. Defrise, and C. De Mol. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics, 57:1413–1457, 2004. ! pages 82, 86 A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B, 39: 1–38, 1977. ! pages 84 P. Dickson and J. Ginter. Market segmentation, product differentiation, and marketing strategy. Journal of Marketing, 51:1–11, 1987. ! pages 1 D. Donoho. High-dimensional data analysis: the curses and blessings of dimensionality. Aide-Memoire of a Lecture at AMS Conference on Math Challenges of the 21st Century, 2000. ! pages 25 D. Donoho. Compressed sensing. IEEE Transactions on information theory, 52: 1289–1306, 2006. ! pages 26, 32 B. Efron. Correlation and large-scale simulations significance testing. Journal of the American Statistical Association, 102:93–103, 2007. ! pages 30 B. Efron. Empirical bayes estimates for large-scale prediction problem. Journal of the American Statistical Association, 104:1015–1028, 2009. ! pages 54 B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. The Annals of Statistics, 32:407–499, 2004. ! pages 20, 82, 86 J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96: 1348–1360, 2001. ! pages 2, 11, 12, 13, 17, 18, 21, 25, 38, 46, 64, 70, 81, 132 J. Fan and J. Lv. Sure independence screening for ultrahigh dimensional feature space (with discussion). Journal of the Royal Statistical Society, Series B, 70: 849–911, 2008. ! pages 22, 23, 26, 29, 30, 32, 106 110 J. Fan and J. Lv. A selective overview of variable selection in high dimensional feature space. Statistica Sinica, 20:101–148, 2010. ! pages 25 J. Fan and J. Lv. Nonconcave penalized likelihood with np-dimensionality. IEEE Transactions on Information Theory, 57:5467–5484, 2011. ! pages 17, 39 J. Fan and H. Peng. Nonconcave penalized likelihood with a diverging number of parameters. The Annals of Statistics, 32:781–813, 2004. ! pages 17 J. Fan and R. Song. Sure independence screening in generalized linear models with np-dimensionality. The Annals of Statistics, 38:3567–3604, 2009. ! pages 30 J. Fan, R. Samworth, and Y. Wu. Ultrahigh dimensional variable selection: beyond the linear model. Journal of Machine Learning Research, 10: 1829–1853, 2009. ! pages 30, 43, 44 I. Frank and J. Friedman. A statistical view of some chemonetrics regression tools. Technometrics, 35:109–148, 1993. ! pages 12 J. Friedman, T. Hastie, S. Rosset, R. Tibshirani, and J. Zhu. Discussion of boosting papers. The Annals of Statistics, 32:102–107, 2004. ! pages 106 J. Friedman, T. Hastie, H. Hofling, and R. Tibshirani. Pathwise coordinate optimization. The Annals of Applied Statistics, 1:302–332, 2007. ! pages 20, 82, 86 W. J. Fu. Penalized regression: the bridge versus the lasso. Journal of Computational and Graphical Statistics, 7:397–416, 1998. ! pages 11, 12, 20 W. Fuller. Sampling Statistics. Hoboken, New Jersey: Wiley, 2009. ! pages 57, 74 X. Gao and J. Huang. Asymptotic analysis of high-dimensional lad regression with lasso. Statistica Sinica, 20:1495–1506, 2010. ! pages 107 R. Gelber, J. Gaziano, J. Manson, J. Buring, and H. Sesso. A prospective study of body mass index and the risk of developing hypertension in men. American Journal of Hypertension, 20:370–377, 2007. ! pages 78 J. Ghosh and P. Sen. On the asymptotic performance of the log-likelihood ratio statistic for the mixture model and related results. In In Proceedings of the Berkeley Conference in Honor of Jerzy Neyman and Jack Kiefer, volume 2, pages 789–806, 1985. ! pages 81 111 V. Godambe and M. Thompson. Parameters of superpopulation and survey population: Their relationship and estimation. International Statistical Review, 54:127–138, 1986. ! pages 61 J. Hájek. Limiting distributions in simple random sampling from a finite population. Publications of the Mathematics Institute of Hungarian Academy of Science, 5:361–375, 1960. ! pages 62 J. Hájek. Asymptotic theory of rejective sampling with varying probabilities from a finite population. The Annals of Mathematical Statistics, 35:1491–1523, 1964. ! pages 63 T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning: data mining. Springer-Verlag, New York, 2 edition, 2009. ! pages 25 H. Ishwaran, L. James, and J. Sun. Bayesian model selection in finite mixtures by marginal density decompositions. Journal of the American Statistical Association, 96:1316–1332, 2001. ! pages 81 G. Kalton. Models in the practice of survey sampling. International Statistical Review, 51:175–188, 1983. ! pages 59 R. Kass and A. Raftery. Bayes factors. Journal of the American Statistical Association, 90:773–795, 1995. ! pages 65 A. Keller, P. Leidinger, A. Borries, A. Wendschlag, F. Wucherpfennig, M. Scheffler, H. Huwer, H. Lenhof, and E. Meese. mirnas in lung cancer - studying complex fingerprints in patient’s blood cells by microarray experiments. BMC Cancer, 9:353–363, 2009. ! pages 1 C. Keribin. Consistent estimation of the order of mixture models. The Indian Journal of Statistics, 96:1316–1332, 2000. ! pages 81 K. Knight and W. J. Fu. Asymptotics for lasso-type estimators. The Annals of Statistics, 28:1356–1378, 2000. ! pages 15, 16 E. Korn and B. Graubard. Analysis of Health Surveys. Wiley, New York, 1999. ! pages 56 P. S. Kott. A model-based look at linear regression with survey data. The American Statistician, 45:107–112, 1991. ! pages 74 N. Kushmerich. Learning to remove internet advertisements. In Proceeding of the 3rd International Conference on Autonomous Agents, pages 175–181, 1999. ! pages 1 112 B. Leroux. Consistent estimation of a mixing distribution. The Annals of Statistics, 20:1350–1360, 1992. ! pages 81, 97 B. Leroux and M. Puterman. Maximum-penalized-likelihood estimation for independent and markov-dependent mixture models. Biometrics, 48:545–558, 1992. ! pages 99 M. Lesperance and J. Kalbfleisch. An algorithm for computing the nonparametric mle of a mixing distribution. Journal of the American Statistical Association, 87:120–126, 1992. ! pages 83 P. Li and J. Chen. Testing the order of a finite mixture model. Journal of the American Statistical Association, 105:1084–1092, 2010. ! pages 81 Y. Lin. Support vector machine and the bayes rule in classification. Data Mining and Knowledge Discovery, 6:259–275, 2002. ! pages 106 B. G. Lindsay. The geometry of mixture likelihoods: A general theory. The Annals of Statistics, 11:86–94, 1983. ! pages 83 S. L. Lohr and J. Liu. A comparison of weighted and unweighted analyses in the ncvs. Journal of Quantitative Criminology, 10:343–360, 1994. ! pages 76 G. MacLachlan. On bootstrapping the likelihood ratio test statistics for the number of components in a normal mixture. Applied Statistics, 36:318–324, 1987. ! pages 81 G. MacLachlan and T. Krishnan. The EM Algorithm and Extensions. Hoboken, New Jersey: Wiley, 2 edition, 2008. ! pages 85 S. Mallat and Z. Zhang. Matching pursuits with time-frequency dictionaries. IEEE Transactions on Signal Processing, 41:3397–3415, 1993. ! pages 34 P. McCullagh and J. A. Nelder. Generalized Linear Models. Chapman and Hall, London, 2 edition, 1989. ! pages 20, 26 J. McLachlan and D. Peel. Finite mixture models. Wiley, New York, 2002. ! pages 103 N. Meindhausen and P. Buhlmann. Consistent neighborhood selection for high-dimensional graphs with the lasso. The Annals of Statistics, 34: 1436–1462, 2006. ! pages 16 K. Mengersen, C. Robert, and D. Titterington. Mixtures: Estimation and Applications. Hoboken, New Jersey: Wiley, 1 edition, 2011. ! pages 93 113 E. Molina and C. Skinner. Pseudo-likelihood and quasi-likelihood estimation for complex sampling schemes. Computational Statistics and Data Analysis, 13: 395–405, 1992. ! pages 61 J. F. Murray and K. Kreutz-Delgado. An improved focuss-based learning algorithm for solving sparse linear inverse problems. Conference Record of the Thirty-Fifth Asilomar Conference on Signals, Systems and Computers, pages 347–351, 2001. ! pages 34 E. Ohlsson. Asymptotic normality for two-stage sampling from a finite population. Probability Theory and Related Fields, 81:341–352, 1989. ! pages 63 T. Park and G. Casella. The bayesian lasso. Journal of the American Statistical Association, 103:681–686, 2008. ! pages 14 D. Pfeffermann. The role of sampling weights when modeling survey data. International Statistical Review, 61:317–337, 1993. ! pages 57 D. Pfeffermann and D. J. Holmes. Robustness considerations in the choice of a method of inference for regression analysis of survey data. Journal of the Royal Statistical Society, Series A, 148:268–278, 1985. ! pages 74 M. Postman, J. Huchra, and M. Geller. Probes of large-scale structures in the corona borealis region. The Astronomical Journal, 92:1238–1247, 1986. ! pages 98 M. Rahiala and T. Teräsvirta. Business survey data in forecasting the output of swedish and finnish metal and engineering industries: A kalman filter approach. Journal of Forecasting, 12:255–271, 1993. ! pages 56 C. R. Rao and Y. H. Wu. A strongly consistent procedure for model selection in a regression problem. Biometrika, 76:369–74, 1989. ! pages 9 S. Richardson and P. Green. On bayesian analysis of mixtures with an unknown number of components. Journal of the Royal Statistical Society: Series B, 59: 731–792, 1997. ! pages 81, 99 K. Roeder. Density estimation with confidence sets exemplified by superclusters and voids in the galaxies. Journal of the American Statistical Association, 85: 617–624, 1990. ! pages 98 M. Royall. The linear least-squares prediction approach to two-stage sampling. Journal of the American Statistical Association, 71:657–664, 1976. ! pages 58 114 G. Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6: 461–464, 1978. ! pages 6, 64, 81, 133 J. Shao. An asymptotic theory for linear model selection (with discussion). Statistica Sinica, 7:221–242, 1997. ! pages 8 Y. She. Thresholding-based iterative selection procedures for model selection and shrinkage. Electronic Journal of Statistics, 3:384–415, 2009. ! pages 22, 82, 86 Y. She. An iterative algorithm for fitting nonconvex penalized generalized linear models with grouped predictors. Computational Statistics and Data Analysis, In press, 2011. ! pages 19, 70 R. Shibata. Asymptotic mean efficiency of a selection of regression variables. The Annals of Statistics, 35:415–423, 1983. ! pages 8 M. Shoukri and G. MacLachlan. Parametric estimation in a genetic mixture model with application to nuclear family data. Biometrics, 50:128–139, 1994. ! pages 80 D. Singh, P. G. Febbo, K. Ross, D. G. Jackson, J. Manola, C. Ladd, P. Tamayo, A. A. Renshaw, A. V. D’Amico, J. P. Richie, E. S. Lander, M. Loda, P. W. Kanto, T. R. Golub, and W. R. Sellers. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell, 1:203–209, 2002. ! pages 54 M. Stone. Cross-validatory choice and assessment of statistical predictions (with discussion). Journal of the Royal Statistical Society, Series B, 39:111–147, 1974. ! pages 18 J. Storey and R. Tibshirani. Statistical significance for genome-wide studies. Proceedings of the National Academy of Sciences, 100:9440–9445, 2003. ! pages 30 R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B, 58:267–288, 1996. ! pages 2, 11, 12, 18, 19, 25, 81 L. Tierney, R. Kass, and J. Kadane. Fully exponential laplace approximation to expectations and variances of nonpositive functions. Journal of the American Statistical Association, 84:710–716, 1989. ! pages 65 J. Vı́s̆ek. Asymptotic distribution of simple estimate for rejective, sampford and successive sampling. Contributions to Statistics, pages 263–275, 1979. ! pages 63 115 H. Wang. Forward regression for ultra-high dimensional variable screening. Journal of the American Statistical Association, 104:1512–1524, 2009. ! pages 19, 26, 45 H. Wang, G. Li, and G. Jiang. Robust regression shrinkage and consistent variable selection via the lad-lasso. Journal of Business and Economic Statistics, 11: 1–6, 2006. ! pages 107 H. Wang, R. Li, and C. Tsai. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika, 94:553–568, 2007. ! pages 18, 41, 64, 92 H. Wang, B. Li, and C. Leng. Shrinkage tuning parameter selection with a diverging number of parameters. Journal of the Royal Statistical Society, Series B, 71:671–683, 2009. ! pages 19 W. Wolfson. Analysis of labour force survey data for the information technology occupations 2000-2003. WGW Services Ltd., Ottawa, Ontario, 2004. ! pages 56 M. Woo and T. Sriram. Robust estimation of mixture complexity. Journal of the American Statistical Association, 51:4379–4392, 2006. ! pages 81 J. Wu. On the convergence properties of the em algorithm. The Annals of Statistics, 11:95–103, 1983. ! pages 85 L. Yan, K. Liu, K. Matthews, M. Daviglus, T. Ferguson, and C. Kiefe. Psychosocial factors and risk of hypertension: The coronary artery risk development in young adults (cardia) study. Journal of American Medical Association, 290:2138–2148, 2003. ! pages 78 Y. Yang. Can the strength of aic and bic be shared? a conflict between model identification and regression estimation. Biometrika, 92:937–950, 2005. ! pages 8 M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B, 68:49–67, 2006. ! pages 104 C. Zhang. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38:894–942, 2010. ! pages 14, 25 P. Zhang. Model selection via multifold cross-validation. The Annals of Statistics, 21:229–231, 1993. ! pages 92 116 Y. Zhang, R. Li, and C. Tsai. Regularization parameter selection via generalized information criterion. Journal of the American Statistical Association, 105: 312–323, 2010. ! pages 64 P. Zhao and B. Yu. On model selection consistency of lasso. Journal of Machine Learning Research, 7:2541–2563, 2006. ! pages 15, 16 S. Zhong and J. Ghosh. A unified framework for model-based clustering. Journal of Machine Learning Research, 4:1001–1037, 2003. ! pages 103 J. Zhu, S. Rosset, T. Hastie, and R. Tibshirani. l1-norm support vector machines. Neural Information Processing Systems, 16, 2003. ! pages 106 H. Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101:1418–1429, 2006. ! pages 14, 16 H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B, 67:301–320, 2005. ! pages 11, 14, 25 H. Zou and R. Li. One-step sparse estimation in nonconcave penalized likelihood models. The Annals of Statistics, 36:1509–1533, 2008. ! pages 21, 22, 82, 86, 107 117 Appendix A Supplementary Information for Chapter 2 We provide proofs of Theorems 2.1-2.4 in this appendix. Before getting into the detailed illustrations, we first state a technical lemma as follows. Lemma A.1 Let Yi, i = 1; : : : ; n be independent random variables following ex- ponential family distributions of form (2.1) with natural parameters i 2 . Let i and 2i denote the mean and variance of Yi respectively. Let tni, i = 1; : : : ; n be real numbers such that nX i=1 t2ni 2 i = 1; max 1in ft2nig = O(n) for some positive sequence hn = o(n). Then, for a sufficiently large n, P nX i=1 tni(Yi i) > hn ! exp(h 2 n 3 ): Lemma A.1 states a useful property of exponential family for the illustration of theorems in this chapter, the proof of which can be found in Chen and Chen [2012]. 118 A.1 Proof of Theorem 2.1 Theorem 2.1 shows that, under some regularity conditions on model (2.1) and the design matrix, the SMLE-based screening ŝ retains all influential features in the true model s with probability tending to one. We illustrate the theorem by showing that asymptotically ŝ falls into the collection of over-fitted models that contain s as a submodel. This is implied by the fact that the maximum likelihood score based on an over-fitted model is asymptotically greater than that of any under-fitted model with at least one feature in s excluded. Proof: Let ̂s be the (unrestricted) MLE of based on model s. The theorem is implied if Pfŝ 2 Sk+g ! 1. Thus, it suffices to show that Pfmax s2Sk ln(̂s) min s2Sk+ ln(̂s) g ! 0; (A.1) as n!1. For any s 2 Sk, define s0 = s [ s 2 S2k+ . Consider s0 close to s0 such that ks0 s0k = w1n1 for some w1; 1 > 0. Clearly, when n is sufficiently large, s0 falls into a small neighborhood of s0 , so that condition (T3) becomes applicable. Thus, by Taylor’s theory, we have l(s0) l(s0) = [s0 s0 ]TS(s0) 1 2 [s0 s0 ]TH(~s0)[s0 s0 ] [s0 s0 ]TS(s0) c1 2 nks0 s0k2 w1n1kS(s0)k c1 2 w21n 121 ; (A.2) where ~s0 is an intermediate value between s0 and s0 . Thus, for some generic 119 positive constant c, we have Pfl(s0) l(s0) 0g PfkS(s0)k cn11g X j2s0 PfS2j (s0) ck1n221g = X j2s0 PfSj(s0) ck 1 2n11g+ X j2s0 PfSj(s0) ck 1 2n11g: (A.3) Note that Sj( s0) = nX i=1 [yi b0(xTis0s0)]xij = nX i=1 [yi i]xij : Let tni = xij( Pn i=1 x 2 ij 2 i ) 1=2. By condition T4, we have Pn i=1 t 2 ni 2 1 = 1, maxift2nig = O(n1) and n1 Pn i=1 x 2 ij 2 i c4. Also, by condition T2, we have k w2n2 . With these conditions, Lemma 1 gives the following probability inequality PfSj(s0) ck 1 2n11g Pf nX i=1 tni(yi i) > cn0:5(1212)g c exp(n1212): (A.4) By the same arguments, we also have PfSj(s0) ck 1 2n11g c exp(n1212): (A.5) The inequalities (A.4) and (A.5) imply that, for some generic constant c, Pfl(s0) l(s0)g ck exp(n1212): (A.6) 120 Consequently, by Bonferoni inequality and condition 1 21 22 > 0, we have Pfmax s2sk l(s0) l(s0)g X s2sk Pfl(s0) l(s0)g ckpk exp(n1212) c exp(2 log n+mn2 log n n1212) = o(1): Because l(s0) is concave in s0 , above result holds for any s0 such that ks0 s0k w1n1 . For any s 2 Sk, let s0 be ̂s augmented with zeros corresponding to the elements in s0=s. By condition T2, it is seen that ks0 s0k ks=sk w1n1 : Consequently, Pfmax s2Sk l(̂s) min s2Sk+ l(̂s) g Pfmax s2sk l(s0) l(s0)g = o(1) The theorem is proved. A.2 Proof of Theorem 2.2 Theorem 2.2 shows that the IHT procedure (2.12) stepwise increases the likelihood score within the restricted model sparsity. Such an increment property further en- sures the convergence of IHT to a local maximum of (2.7). Proof: We first show the increment of ln((t)). According to (2.9), we have l((t)) = h((t);(t)) h((t+1);(t)) = l((t+1)) u 2 k(t+1) (t)k2 + nX i=1 [b(xTi (t+1)) b(xTi (t)) b0(xTi (t))(xTi (t+1) xTi (t))]: 121 By the Taylor’s expansion, for any and 0, we have b() b(0) b0(0)( 0) = 1 2 b00(~)( 0)2 for some ~ between and 0. Applying this expansion, we find l((t)) l((t+1)) u 2 k(t+1) (t)k2 + 1 2 (t)kX(t+1) X(t)k2 l((t+1)) + 1 2 ((t)1 u)k(m+1) (m)k2 l((t+1)) as we have required u > (t)1. Apparently, the quality holds only if (t+1) = (t). The increment property of AHT is hence proved. Now let us move to the convergence of f(t)g. By condition C30, l(s) is a strict concave function over s for any s such that ksk0 k. Since k(t)k0 k and l((t)) l((t+1)) for t 0, the concavity of l(:) implies that (t) stays in a bounded (compact) region. Thus, = supt((t)) is finite based on the continuity of b00(). By the similar arguments, we obtain l((t+1)) l((t)) 1 2 (u 1)k(t+1) (t)k2: which implies that k(t+1) (t)k ! 0 due to the convergence of fl((t))g. Also, the compactness implies that f(t)g has at least one limit point, say, ~ = f~1; : : : ; ~pgT . Let ftmg be a subsequence such that limm!1 (tm) = ~. Since k(t+1) (t)k ! 0, we must also have limm!1 (tm+1) = ~. We next show that ~ is a local maximum of ln() subject to kk0 k. By (2.9), we have (tm+1) = argmax h( ;(tm)) subject tok k0 k: ¿From the bivariate continuity of hn(;), lettingm!1, we get ~ = argmax h( ; ~) subject tok k0 k: 122 That is, ~ is a maximum of h( ; ~) with respect to such that k k0 k. We now split our discussion into following two cases. Case 1: when k~k0 < k, let ~ = ~ + u1XT [y b0(X ~)]. By (2.12), we have ~ =H(~ ; k); which implies that the kth largest (in absolute value) component of ~ is zero. The definition ofH then tells us ~ = [H(~ 1; 0); : : : ; H(~ p; 0)] T = ~ ; which implies that S(~) = XT [y b0(X ~)] = 0. Therefore, ~ is an uncon- strained maximum of ln(:) which is also a local maximum subject to kk0 k. Case 2: when k~k0 = k, we must have @h( ;0) @ j =0 = 0 (A.7) for any j 2 f1; : : : ; pg where ~j 6= 0. Let us rewrite h( ; ~) as h( ; ~) = l( ) + T ( ; ~) (A.8) so that T ( ; ~) = u 2 k ~k22 + nX i=1 fb(xTi j) b(xTi ~j) b0( ~j)(xTi j xTi ~j)g: It is seen that @T ( ; ~) @ = ~ = 0: Hence, together with (A.7) and (A.8), this fact implies that @l()=@j is zero at = ~ for any j 2 f1; : : : ; pg where ~j 6= 0. Let ~ be the minimum absolute value of non-zero components in ~. Then, for such that jj ~jk 0:5~, ~ must be a local maximum of l() subject to kk0 k. With above arguments, we now justify convergence of f(t)g as follows. Note 123 that, by condition T30, there are finite number of local maximum of l() subject to kk0 k. Suppose (t) does not converge but has two limiting points, say ~1 6= ~2. By what we have just proved, both are also local maxima of l(). Let = k~1 ~2k. Since k(t+1) (t)k ! 0, the distance between successive (t) goes to 0 as t ! 1. Thus, there must be infinite many (t) which are at least =3 distance from these two limiting points. The compactness condition hence implies that (t) has at least another limiting point which is also a local maximum of l(). Let this one be called ~3. Using the same argument, (t) must have additional limiting point ~4 which implies l() has at least 4 stationary points. Repeatedly applying this logic implies l() has infinite many stationary points. This contradicts with the assumption, and therefore implies (t) converges. A.3 Proof of Theorem 2.3 Theorem 2.3 shows that, for appropriate choices of penalty function, the SMLE- PLM procedure consistently estimates the model coefficients in the ultra-high di- mensional GLM setup where p n. Since SMLE-PLM is a two-step procedure, where the PLM is used on a screened model ŝ from the ultra-high dimensional full model space, the randomness from both screening and PLM steps need to be ac- counted in the asymptotic analysis. We illustrate the theorem by showing that the MPLE converges (in probability) to the true model coefficients at a flat rate based on any ŝ that might be obtained from the screening step. Proof: Apparently, the screening consistency P (s ŝ) ! 1 implies that P (ŝ 2 Sk+) ! 1. To show the theorem, we investigate the performance of MPLE on all possible models over Sk+. Specifically, let u = (u1; : : : ; uk) T be an arbitrary k-dimensional vector with kuk2 = 1 and an = n. For any s 2 Sk+, let s = s + anu, and thus, Q(s)Q(s) = l(s + anu) ln(s) n X j2s [( j + anuj) (j )] (A.9) When n is sufficiently large, s falls into a small neighborhood of s, so that C3 124 becomes applicable. Following similar arguments in (A.2), we have the likelihood term in (A.9) bounded by l(s + anu) l(s) ankS(s)k c1 2 na2n: (A.10) At the same time, we have n X j2s [( j + anuj) (j )] n X j2s [( j + anuj) (j )] = n X j2s anuj 0 (j~j j)sign( ~j) (A.11) for some ~j between j and j+anuj . By condition C2, we requireminj2s jj j w1n 1 . Thus, when n is large enough, we have ~j > 0:5w1n1 for j 2 s and property P3 of (:) implies that the penalty term in (A.9) is bounded by n X j2s anuj 0 (j~j j)sign( ~j) w3n13an X j2s juj j w3n13an p k: (A.12) By (A.10)-(A.12), we obtain Q(s)Q(s) ankS(s)k2 c1 2 na2n + w3n 13an p k nkS(s)k c1 2 n12 + w3 p w2n 13+ 12 2 : (A.13) When < 3 122, the second term dominates the third term in (A.13). Thus, for a sufficient large n and some generic constant c, PfQ(s) Q(s)g PfkS(s)k cn1g c exp(n122); where the last in equality is followed by the same arguments in (A.3)-(A.6). Con- 125 sequently, by Bonferoni inequality and the condition < 12 2, we have PfQ(s) Q(s) for some s 2 Sk+g X s2Sk+ PfQ(s) Q(s)g ckpk exp(n122) c exp(2 log n+mn2 log n n122) = o(1); (A.14) which implies that, with probability tending to one, the local maximizer of Q(ŝ) falls into an-neighborhood of ŝ. The theorem is proved. A.4 Proof of Theorem 2.4 Theorem 2.4 shows that, under additional requirements on the penalty function, the SMLE-PLM procedure consistently identifies the true model s in ultra-high dimensional situations. Similarly to the proof of Theorem 2.3, the randomness from both screening and PLM steps in the SMLE-PLM needs to be considered. We illustrate the theorem by showing that the probability of MPLE that identifies s from any possible ŝ goes to one at a common rate. Proof: Clearly, the requirements of Theorem 2.4 imply that 3 > 0:52 > 1+ 22 . Thus, following the arguments in the proof of Theorem 2.3, there exists a local maximizer ̂(ŝ) of (2.13), such that k̂(s) sk an = cn(0:52) (A.15) with probability tending to 1 for some generic constant c, and s 2 Sk+. For the convenience of presentation, we denote s by fs1;s2gwith s2 = 0 for any s 2 Sk+. Then, for s = fs1;s2g such that ks sk2 an, we have Q(s1;s2)Q(s1; 0) = l(s1;s2) l(s1; 0) n X j2sns (jj j) (A.16) For sufficient large n, s falls into small neighborhood of s such that con- 126 ditions C3 and C5 become applicable. Thus, following the similar arguments in (A.3), the likelihood term in (A.16) is bounded by l(s1;s2) l(s1; 0) @l(s1; 0) T @s2 s2 c1 2 nks2k2 X j2sns (jSj(s)j+ c5nks1 s1k)jj j kS(s)k ks2k+ c5n p kks1 s1k ks2k (A.17) Moreover, with 0 < < 0:5 1 2, s2 falls into 0:5n1 neighborhood of zero and property P4 of (:) implies that (jj j) = 0(j~j j)sign( ~j)j w4n4 jj j (A.18) for some ~j 2 (0; j) and j 2 s n s. Consequently, with (A.17) and (A.18), we have (A.16) bounded by Q(s1;s2)Q(s1; 0) kS(s)k ks2k+ c5n p kks1 s1k ks2k w4n14ks2k kS(s)k ks2k+ c5n0:5+1:52+ks2k w4n14ks2k (A.19) By choosing = 0:5minf0:5 1:52 4; 0:5 1 2g, the third term domi- nates the second term in (A.19). Thus, for a sufficiently large n and some generic constant c, we have PfQ(s1;s2) Q(s1; 0)g PfkS(s)k cn14g c exp(n1242) Consequently, following the similar arguments in (A.14) with the condition 4 < 0:5 1:52, we have PfQ(s1;s2) Q(s1; 0) for some s 2 Sk+g ckpk exp(n1242) = o(1); 127 which implies that, with probability tending to one, the local maximizer of Q(), say ̂(ŝ), is located at ̂j(ŝ) = 0 for j 2 ŝ n s. The theorem is proved. 128 Appendix B Supplementary Information for Chapter 3 We provide proofs of Theorems 3.1-3.2, Corollary 3.1 and the technical deviation of sample-based BIC 3.7 in this appendix. For simplicity of presentation, we use instead of n in the proof. B.1 Proof of Theorem 3.1 Theorem 3.1 shows that, with appropriate sampling schemes and choices of penalty functions, the PPLM consistently estimate the model coefficients and identifies the true model under the joint randomization framework. For the sampling plans where the sample mean converges at the same rate as the i.i.d. sampling, the maximizer of the penalized pseudo-likelihood necessarily converges to the census version based on the full likelihood – this further ensures the consistency of PPLM as stated in the theorem. Proof: Let us first work on the estimation consistency of . Let u be an arbitrary p-dimensional vector with jjujj = c and an = n 12 + ' for some c > 0. To obtain the estimation consistency, it suffices to show that as n ! 1, ̂ is in the an-neighborhood of . 129 Let = + anu. We have Qn()Qn() = fln( + anu) ln()g n pX j=1 [ (jj + anuj j)] (jj j)] fln( + anu) ln()g n qX j=1 [ (jj + anuj j) (jj j)]: (B.1) Let us work on the first term in (B.1). Define HN () = NX i=1 xib00[XTi ]x T i ; Hn() = X i2d wixib00[XTi ]x T i : By condition C1, when is close to , each element in matrixHN (), say htj =PN i=1 b 00(xi)xitxir for t; j 2 f1; : : : ; pg, satisfies N1jhtj j1+ = O(1) with probability tending to 1. Thus, by Condition C3, we have 1 n Hn() 1 N HN ()!p 0: (B.2) Clearly, when n is large enough, is in a small neighborhood of and so (B.2) becomes applicable. Therefore, from the continuity of I() at , we have ln() ln() = anl0n()Tu 1 2 a2nu THn(~)u = anl 0 n( )Tu 1 2 na2nu T I()u(1 + op(1)) anl0n()Tu 1 2 nc2a2nM1(1 + op(1)) (B.3) for some ~ between and, where l0n( ) = @ln()=@ = fl0n1(); : : : ; l0np()gT and l0nj( ) = P i2dwi(yi b0(xi))xij for j 2 f1; : : : ; pg. Note that for 130 j 2 f1; : : : ; pg, E[(fY b0(X)gXj)2] = E f E [ ((Y b0(X))Xj)2j X ] g = E fb00(X)X2j g <1: Thus, N1 PN i=1(yi b0(xi))xij = Op(N1=2). This and Condition C3 imply that l0nj( ) = n N NX i=1 (yi b0(xi))xij +Op( p n) = Op( p n): (B.4) Therefore, the first term in (B.3) is Op( p nan) = Op(na 2 n). By taking a suffi- ciently large c for kuk, (B.3) is dominated by its second term, which implies that Pfln() ln()g ! 0. Now let us consider the second term in (B.1). Based on Taylor’s expansion and the continuity of 0(jj) at 0j , we have (jj + anuj j) (jj j) = anuj 0(jj j)sign(j )(1 + o(1)) for j 2 f1; : : : ; qg. Thus, the second term in (B.1) is O(nan') = Op(na2n), which is also dominated by the second term in (B.3) for a sufficiently large c. This implies that PfQn() Qn()g ! 0, and therefore there exists a local maxi- mizer ofQn in the an-neighborhood of . The proof of the estimation consistency is complete. We now examine the selection consistency. Recall that we write = f1;2g with 2 = 0. Hence, for any = f1;2g such that kk can and 2 6= 0, we have Qn(1;2)Qn(1; 0) = fln(1;2) ln(1; 0)g fn pX j=q+1 (jj j)g (B.5) 131 Similarly to the proof of the estimation consistency, we have ln(1;2) ln(1; 0) = @ln(1; 0) @2 T 2 + 1 2 T2 @ln(1; ~2) @2@ T 2 2 @ln(1; 0) @2 T 2 1 2 nM1k2k2(1 + op(1)) k@ln(1; 0) @2 k k2k (B.6) for some ~2 between 2 and 2 = 0. Also, by Taylor’s expansion, it can be shown that @ln(1; 0) @2j = @ln( ) @2j + @2ln( ) @2j@1 T (1 1)(1 + op(1)) (B.7) for j 2 fq + 1; : : : ; pg, which implies that (B.6) is Op(nk1 1kk2k +p nk2k). Note that when n is large enough, 2 is in a small neighborhood of 0 such that property D3 of (jj) becomes applicable. Specifically, if _j = 0:5j , then for some ~j between j and _j , we have (jj j) = (j _j j) + 1 2 0(j~j j)sign( ~j)j 1 2 M2anjj j (B.8) for each j 2 fq + 1; : : : ; pg and an arbitrary positive constant M2. Thus, by choosing a sufficiently large M2, (B.5) is dominated by its second term. This implies that PfQn(1;2) Qn(1; 0)g ! 0, and therefore the local maximizer of Qn is located at some ̂2 = 2 = 0. The theorem is proved. B.2 Proof of Corollary 3.1 Corollary 3.1 shows that, with additional requirements on the penalty function, the PPLM is able to consistently identify the true model s and estimate their coeffi- cients as efficiently as the MLE based on the true model. This result corresponds to the notation of oracle property in Fan and Li [2001] for the non-survey PLMs. 132 Proof: Clearly, the requirement for (jj) implies that ' ! 0. Thus, by Theo- rem 1, we have P (̂2 = 0) ! 1, which implies that, with probability tending to 1, the following relationship holds for ̂1: @ln(̂1; 0) @1 + qX j=1 0(ĵjj) = 0: (B.9) Also, by Theorem 3.1, k̂11k = Op(n1=2+), so with probability tending to 1 ̂1 is in a small neighborhood of 1, which is bounded away from 0. Since for any 6= 0 there existsM > 0 such that 0(jj) = 0 when n > M , we have P (0(ĵjj) = 0)! 1 for j = 1; : : : ; q: This together with (B.9) implies that, with probability tending to 1, ̂1 satisfies @ln(̂1; 0) @1 = 0; which is exactly the same as the normal equation in solving the MPLE of ln() based on the true model s. The proof is complete. B.3 Derivation of BIC (3.7) We use the same principle as in the classical BIC (Schwarz [1978]) to obtain the sample-based BIC (3.7). Simplistically, we show that the rationale of using BIC (3.7) is to approximately maximize the pseudo-posterior (3.6) when the sample size is large. Our derivations are heuristic and we do not spell out the conditions most rigorously. The ultimate justification of (3.7) is the large sample properties of the resulting variable selection procedure, which is discussed in Theorem 3.2. For simplicity of notation, we use s instead of s in the following derivation. Let s(s) be the prior density function of regression coefficient vector s given candidate model s. A pseudo-marginal density function of the data is then given by Pn(yjs) = Z Ln(y;s)s(s)ds: 133 Consequently, we regard the following expression as the pseudo-posterior proba- bility of model s, Pn(sjy) = Pn(yjs)P (s)P s2S P (s)Pn(yjs) ; where P (s) denotes the prior probability of model s and S is the collection of candidate models. In the spirit of Bayesian inference, we select the model that maximizes Pn(sjy). Since P s2S P (s)Pn(yjs) does not depend on any specific s, the highest Pn(sjy) is achieved at the model that maximizes Pn(yjs) when a uniform prior P () over the model space is adopted. To take a close look at Pn(yjs), let ̂s be the maximizer of Ln(y;s) given s. Assume that s(:) is smooth and does not change with the sample size n. Also, when n is large enough, Ln(y;s)=Ln(y; ̂s) = op(1) fors outside of aOp(n 1=2) neighborhood of ̂s. In addition, withinOp(n1=2) neighborhood of ̂s, s() is a constant function in . That is, s(s) s(̂s) for s 2 . Thus, Pn(yjs) Z ̂s Ln(y;s)s(s)ds s(̂s) Z ̂s Ln(y;s)ds: (B.10) Let Hn() = @2ln()=@@T . Assume that Hn() is continuous at ̂s. Then, for s close to ̂s, ln(s) = logLn(y;s) is approximated by ln(s) ln(̂s) (1=2)(s ̂s)THn(̂s)(s ̂s): (B.11) 134 Approximation (B.11) together with (B.10) implies that Pn(yjs) s(̂s) Z ̂s Ln(y; ̂s) exp 1 2 (s ̂s)THn(̂s)(s ̂s) ds s(̂s) Ln(y; ̂s) Z exp 1 2 (s ̂s)THn(̂s)(s ̂s) ds = s(̂s) Ln(y; ̂s) (2) (s) 2 jHn(̂s)j 1 2 = s(̂s) Ln(y; ̂s) (2=n) (s) 2 jn1Hn(̂s)j 1 2 ; (B.12) where j:j denotes the determinant of a matrix and the first step is from the Laplace approximation. Consequently, we have 2 logPn(yjs) approximated by 2ln(̂s) + (s) log n+R; (B.13) where R = (s) log(2) + log[s(̂s)] + log[jn1Hn(̂s)j] = Op(1): The order is justified when n1Hn(̂s)! H(s) in probability for some positive definite matrixH . When theOp(1) term is ignored from (B.13), we obtain a simplified criterion BICn(s) = 2ln(̂s) + (s) log n: B.4 Proof of Theorem 3.2 Theorem 3.2 shows that the sample-based BIC score (3.7) is minimized on the true model with probability tending to one. With appropriate sampling plans, the sample-based BIC essentially reduces to the classic BIC up to a op(1) term due to the unequal weighting, which has no effect on the consistency of the BIC-based selection. 135 Proof: Let ̂s be the maximizer of ln() based on model s. For any 2 +[ , we have BICn(s) BICn(s) = 2fln(̂s) ln(̂s)g f(s) (s)g logn: Thus, Theorem 3.2 is implied if P h f2ln(̂s) ln(̂s)g fs (s)g log n i ! 0: (B.14) We verify (B.14) for underfitting ( 2 ) and overfitting ( 2 +). Case 1: When 2 , we have s 6 s. Let s be ̂s augmented with zeros corresponding to j 2 f1; : : : ; pg and j 62 s. We have ln(̂s) ln(̂s) ln(̂s) ln() = ln(s) ln(): Note that for close to , there exists a ~ such that ln() ln() = ( )T l0n() 1 2 ( )THn(~)( ) k k kl0n()k 1 2 M1k k2nf1 + op(1)g(B.15) with probability tending to 1. As shown in the proof of Theorem 1, we have kl0n()k = Op(n1=2). Thus, for such that k k = n 1 3 , (B.15) is dom- inated by its second term which is (1=2)M1 n 13 (1 + op(1)). Because of the concavity of ln(), Pfln() ln() 1 2 M1 n 13 g ! 1 holds for any such that k k n 13 , which includes s as a special case. We therefore obtain ln(̂s) ln(̂s) 1 2 M1 n 13 (1 + op(1)) 136 with probability tending to 1. This implies that (B.14) holds for any 2 . Case 2: For 2 +, we have ln(̂s) ln(̂s) ln(̂s) ln(): Similarly, by conditions C2 and C3, with probability tending to 1, we have ln(̂s) ln() (̂s )T l0n() 1 2 (̂s )THn()(̂s )f1 + op(1)g (̂s )T l0n() nl0n()T I1()l0n()(1 + op(1)) 1 M1 kl0n()k2n1(1 + op(1)): Since kl0n()k2 = Op(n), we have ln(̂s) ln() = Op(1), which implies that (B.14) holds for any 2 +. The theorem is proved. 137 Appendix C Supplementary Information for Chapter 4 We provide proofs of Theorems 4.1-4.2, Corollary 4.1 and Proposition 4.1 in this appendix. C.1 Proof of Theorem 4.1 Theorem 3.1 shows that, with an appropriate scale parameter, the iteration from the ITD procedure (4.10) reduces the value of objective functionQ in (4.8). The proof is similar in showing Theorem 2.2 under the GLM context. Proof: According to (4.10), we have Q((m)) = u1G((m);(m)) u1G((m+1);(m)) = Q((m+1)) + 1 2 u1k(m+1) (m)k2 KX k=1 nX i=1 wik[b( (m+1) k ) b((m)k ) b0((m)k )(m+1k (m)k )]: 138 By Taylor’s expansion, for any and 0, we have b() b(0) b0(0)( 0) = 1 2 b00(~)( 0)2 for some ~ between and 0. Applying this expansion, noting that wik 2 (0; 1), we find Q((m)) Q((m+1)) + 1 2 u1k(m+1) (m)k2 1 2 n (m) 2 k(m+1) (m)k2 = Q((m+1)) + 1 2 (u1 n (m)2 1)k(m+1) (m)k2 Q((m+1)) since we have required u1 > n (m)2 1. Clearly, equality holds only if (m+1) = (m). The theorem is hence proved. C.2 Proof of Corollary 4.1 Corollary 4.1 shows that the difference between two iterations in the ITD dimin- ishes as the procedure proceeds. It serves as an important prerequisite for the convergence of ITD. Proof: According to Theorem 4.1, we have Q((m+1)) Q((m)). Thus, the compact assumption implies that (m) stays in a bounded region. Thus, 2 = supm( (m) 2 ) is finite based on the continuity of b 00(). By similar arguments to those in Theorem 1, we obtain Q((m))Q((m+1)) 1 2 (u1 n2 1)k(m+1) (m)k2: Consequently, the asymptotically regularity of f(m)g is implied by the conver- gence of Q((m)). C.3 Proof of Theorem 4.2 Theorem 4.2 shows that the ITD procedure converges to a stationary point of the objective function Q. The convergence is established based on Corollary 4.1 and 139 the fact that the sequence of ITD has only one limiting point. Proof: The compact condition implies that f(m)g has at least one limit point, say, . Let fmtg be a subsequence such that limm!1 (mt) = . By Corollary 4.1, we must also have limm!1 (mt+1) = . We now show that is also a stationary point of Q(). By (4.10), we have (mt+1) = argmin G(;(mt)): ¿From the bivariate continuity of G(;), letting t!1, we get = argmin G(;): That is, = f0; : : : ; K1g is a local/global minimum of G(;) with respect to . Therefore, we must have @G(;) @j = = 0 (C.1) for any j 2 f0; : : : ;K 1g where j 6= 0. Denote k = Pk1 j=0 j and k =Pk1 j=0 j for k = 1; : : : ;K 1, and write G(;) = uQ() + T (;) (C.2) so that T (;) = 1 2 K1X j=1 (j j )2 u KX k=1 nX i=1 wikfb(k) b(k) b0(k)(k k)g: It can be seen that @T (;) @ = = 0: Hence, together with (C.1) and (C.2), this fact implies that @Q()=@j is zero at = for any j 2 f0; : : : ;K 1g where j 6= 0. Therefore, is a stationary 140 point of Q. Suppose (m) does not converge but has two limiting points, say 1 6= 2. By what we have just proved, both are also stationary points of Q(). Let = k1 2k. By Corollary 4.1, the distance between successive (m) goes to 0 as m!1. Thus, there must be infinitely many (m) that are at least =3 from these two limiting points. The compact condition hence implies that (m) has at least another limiting point that is also a stationary point of Q(). Let this be 3. Using the same argument, (m) must have an additional limiting point 4, which implies that Q() has at least four stationary points. Repeatedly applying this logic implies that Q() has infinitely many stationary points. This contradicts the assumption, and therefore implies that (m) converges. C.4 Proof of Proposition 4.1 Proposition 4.1 provides the analytic solution to the unified optimization problem (4.11) with the SCAD penalty, which serves as a build-in block for the ITD proce- dure (4.10). The proof follows standard procedures in mathematical analysis. Proof: By a result from classical mathematical analysis, the solution must be a critical point of q( ). To avoid unnecessary complexity, we assume that z 0 so that 0. The result for z < 0 can be obtained by symmetry. When > 0, we have q0( ) = z + p0( ): (C.3) It can be seen that p0( ) is flat on the interval (0; ]; from the point = it decreases as a straight line with slope =( 1), and it remains 0 after = (see Fig. C.1). A nonzero critical point of q(), if any, must be the intersection of g1( ) = p 0 ( ) and the straight line g2( ) = z . Suppose 0 < 1 (case 1). There is at most one such intersection: (a) when z < , there is no intersection so = 0 is the only critical point and the global minimum; (b) when < z ( + 1), the unique intersection is at = z ; (c) when (+ 1) < z , the unique intersection is at = ( 1)z 1 ; 141 0 1 2 3 4 5 0 1 2 3 4 5 Case 1 gamma (a) (b) (c) (d) 0 1 2 3 4 5 0 1 2 3 4 5 Case 2 gamma (a) (b) (c) (c) 0 1 2 3 4 5 0 1 2 3 4 5 Case 3 gamma (a) (b) Figure C.1: Plot demonstration for Proposition 4.1 with = 3:7, = 1, and = (1; 3; 4:2) for cases 1–3. Solid lines represent g1( ) = p0( ) and dashed lines represent g2( ) = + z for various z values. (d) when z > , the unique intersection is at = z. Note that cases (a) and (b) have a common expression = (z )+. In (b), (c), and (d), the nonzero critical point is found to be the global minimum. Thus, we have obtained the result for the case where > 1 when z > 0. Suppose 1 < (case 2). In this case, p0( ) decreases with a slope smaller than 1 from = until it reaches 0. This creates the possibility of up to four intersections between p0( ) and g( ). Yet we find that the global minimum is at the largest critical point: (a) when z < , there is no intersection and we find = 0; (b) when z < , the two curves intersect at = z ; (c) 142 when z , there can be many intersections but the global minimum is given by = z. Suppose < (case 3). The derivation is the simplest in this case: (a) when z < , there is no intersection and we find = 0; (b) when z, the largest intersection is = z, which is the global minimum. We have considered all possible cases. For z < 0, the expression of must be adjusted accordingly. 143
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Applications of penalized likelihood methods for feature...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Applications of penalized likelihood methods for feature selection in statistical modeling Xu, Chen 2012
pdf
Page Metadata
Item Metadata
Title | Applications of penalized likelihood methods for feature selection in statistical modeling |
Creator |
Xu, Chen |
Publisher | University of British Columbia |
Date Issued | 2012 |
Description | Feature selection plays a pivotal role in knowledge discovery and contemporary scientific research. Traditional best subset selection or stepwise regression can be computationally expensive or unstable in the selection process, and so various penalized likelihood methods (PLMs) have received much attention in recent decades. In this dissertation, we develop approaches based on PLMs to deal with the issues of feature selection arising from several application fields. Motivated by genomic association studies, we first address feature selection in ultra-high-dimensional situations, where the number of candidate features can be huge. Reducing the dimension of the data is essential in such situations. We propose a novel screening approach via the sparsity-restricted maximum likelihood estimator that removes most of the irrelevant features before the formal selection. The model after screening serves as an excellent starting point for the use of PLMs. We establish the screening and selection consistency of the proposed method and develop efficient algorithms for its implementation. We next turn our attention to the analysis of complex survey data, where the identification of influential factors for certain behavioral, social, and economic indices forms a variable selection problem. When data are collected through survey sampling from a finite population, they have an intrinsic dependence structure and may provide a biased representation of the target population. To avoid distorted conclusions, survey weights are usually adopted in these analyses. We use a pseudo-likelihood to account for the survey weights and propose a penalized pseudo-likelihood method for the variable selection of survey data. The consistency of the proposed approach is established for the joint randomization framework. Lastly, we address order selection for finite mixture models, which provides a flexible tool for modeling data from a heterogeneous population. PLMs are attractive for such problems. However, this application requires maximizations over nonsmooth and nonconcave objective functions, which are computationally challenging. We transform the original multivariate objective function into a sum of univariate functions and design an iterative thresholding-based algorithm to efficiently solve the sparse maximization without ad hoc steps. We establish the convergence of the new algorithm and illustrate its efficiency through both simulations and real-data examples. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2012-09-21 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0073181 |
URI | http://hdl.handle.net/2429/43254 |
Degree |
Doctor of Philosophy - PhD |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2012-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2012_fall_xu_chen.pdf [ 589.18kB ]
- Metadata
- JSON: 24-1.0073181.json
- JSON-LD: 24-1.0073181-ld.json
- RDF/XML (Pretty): 24-1.0073181-rdf.xml
- RDF/JSON: 24-1.0073181-rdf.json
- Turtle: 24-1.0073181-turtle.txt
- N-Triples: 24-1.0073181-rdf-ntriples.txt
- Original Record: 24-1.0073181-source.json
- Full Text
- 24-1.0073181-fulltext.txt
- Citation
- 24-1.0073181.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0073181/manifest