Applications of Penalized Likelihood Methods for Feature Selection in Statistical Modeling by Chen Xu M.A., York University, Canada, 2007 B.Sc., Xi’an Jiaotong University, China, 2006 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy in THE FACULTY OF GRADUATE STUDIES (Statistics) The University Of British Columbia (Vancouver) September 2012 c Chen Xu, 2012 ⃝ Abstract Feature selection plays a pivotal role in knowledge discovery and contemporary scientific research. Traditional best subset selection or stepwise regression can be computationally expensive or unstable in the selection process, and so various penalized likelihood methods (PLMs) have received much attention in recent decades. In this dissertation, we develop approaches based on PLMs to deal with the issues of feature selection arising from several application fields. Motivated by genomic association studies, we first address feature selection in ultra-high-dimensional situations, where the number of candidate features can be huge. Reducing the dimension of the data is essential in such situations. We propose a novel screening approach via the sparsity-restricted maximum likelihood estimator that removes most of the irrelevant features before the formal selection. The model after screening serves as an excellent starting point for the use of PLMs. We establish the screening and selection consistency of the proposed method and develop efficient algorithms for its implementation. We next turn our attention to the analysis of complex survey data, where the identification of influential factors for certain behavioral, social, and economic indices forms a variable selection problem. When data are collected though survey sampling from a finite population, they have an intrinsic dependence structure and may provide a biased representation of the target population. To avoid distorted conclusions, survey weights are usually adopted in these analyses. We use a pseudo-likelihood to account for the survey weights and propose a penalized pseudo-likelihood method for the variable selection of survey data. The consistency of the proposed approach is established for the joint randomization framework. ii Lastly, we address order selection for finite mixture models, which provides a flexible tool for modeling data from a heterogeneous population. PLMs are attractive for such problems. However, this application requires maximizations over nonsmooth and nonconcave objective functions, which are computationally challenging. We transform the original multivariate objective function into a sum of univariate functions and design an iterative thresholding-based algorithm to efficiently solve the sparse maximization without ad hoc steps. We establish the convergence of the new algorithm and illustrate its efficiency through both simulations and real-data examples. iii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Motivating examples . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 Computational genomics . . . . . . . . . . . . . . . . . . 3 1.2.2 Chronic disease studies . . . . . . . . . . . . . . . . . . . 4 1.2.3 Market segmentation . . . . . . . . . . . . . . . . . . . . 5 Traditional feature selection methods . . . . . . . . . . . . . . . . 5 1.3.1 Selection criteria: AIC and BIC . . . . . . . . . . . . . . 6 1.3.2 Selection procedures . . . . . . . . . . . . . . . . . . . . 10 Penalized likelihood methods . . . . . . . . . . . . . . . . . . . . 11 1.4.1 The penalized likelihood and penalty functions . . . . . . 11 1.4.2 Asymptotic properties of PLM . . . . . . . . . . . . . . . 14 1.4.3 Tuning strategies . . . . . . . . . . . . . . . . . . . . . . 17 1.4.4 Computational strategies . . . . . . . . . . . . . . . . . . 19 Contributions of the dissertation . . . . . . . . . . . . . . . . . . 22 1.3 1.4 1.5 iv 2 The Screening-Based PLM in Ultra-High-Dimensional Feature Spaces 25 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2 Sure screening techniques with ultra-high dimensionality . . . . . 27 2.2.1 Model settings and notation . . . . . . . . . . . . . . . . 27 2.2.2 Sure independence screening . . . . . . . . . . . . . . . . 29 Variable screening via the sparse-MLE . . . . . . . . . . . . . . . 31 2.3.1 The sparsity-restricted maximum likelihood estimator . . 31 2.3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . 34 2.3.3 Numerical assessment . . . . . . . . . . . . . . . . . . . 37 Screening-based PLM selection procedure . . . . . . . . . . . . . 38 2.4.1 Consistency of SMLE-PLM . . . . . . . . . . . . . . . . 38 2.4.2 Tuning with EBIC . . . . . . . . . . . . . . . . . . . . . 41 Numerical studies . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.5.1 General settings . . . . . . . . . . . . . . . . . . . . . . . 42 2.5.2 Linear regression . . . . . . . . . . . . . . . . . . . . . . 45 2.5.3 Logistic regression . . . . . . . . . . . . . . . . . . . . . 48 2.5.4 Poisson regression . . . . . . . . . . . . . . . . . . . . . 51 2.5.5 Real-data example . . . . . . . . . . . . . . . . . . . . . 54 Summary and conclusions . . . . . . . . . . . . . . . . . . . . . 55 3 The Penalized Pseudo-Likelihood in Analysis of Survey Data . . . . 56 2.3 2.4 2.5 2.6 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.2 Joint inference and super-population . . . . . . . . . . . . . . . . 58 3.3 Pseudo-likelihood-based variable selection . . . . . . . . . . . . . 60 3.3.1 The penalized pseudo-likelihood method . . . . . . . . . 60 3.3.2 Asymptotic properties of PPLM . . . . . . . . . . . . . . 61 3.3.3 Tuning strategy via sample-based BIC . . . . . . . . . . . 64 Numerical studies . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.4.1 SLCDC data . . . . . . . . . . . . . . . . . . . . . . . . 66 3.4.2 Simulation settings . . . . . . . . . . . . . . . . . . . . . 69 3.4.3 Simulation results . . . . . . . . . . . . . . . . . . . . . 71 3.4.4 Analysis of SLCDC data . . . . . . . . . . . . . . . . . . 76 Summary and conclusions . . . . . . . . . . . . . . . . . . . . . 78 3.4 3.5 v 4 A Thresholding Algorithm for PLM in Finite Mixture Models . . . 80 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.2 Order selection via regularization . . . . . . . . . . . . . . . . . . 82 4.2.1 Mixture model and penalized likelihood . . . . . . . . . . 82 4.2.2 The EM algorithm . . . . . . . . . . . . . . . . . . . . . 84 4.3 Iterative thresholding descent procedure . . . . . . . . . . . . . . 86 4.4 Tuning strategies . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.4.1 Choice of u . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.4.2 Choice of λ . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.4.3 Choice of CK . . . . . . . . . . . . . . . . . . . . . . . . 93 Numerical studies . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.5.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.5.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.5 4.6 Summary and conclusions . . . . . . . . . . . . . . . . . . . . . 100 5 Summary and Future Work . . . . . . . . . . . . . . . . . . . . . . . 102 5.1 Summary of the dissertation . . . . . . . . . . . . . . . . . . . . 102 5.2 Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.2.1 Variable selection in model-based clustering . . . . . . . . 103 5.2.2 Support vector machine with nonconvex penalty . . . . . 105 5.2.3 Some other issues . . . . . . . . . . . . . . . . . . . . . . 106 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 A Supplementary Information for Chapter 2 . . . . . . . . . . . . . . 118 A.1 Proof of Theorem 2.1 . . . . . . . . . . . . . . . . . . . . . . . . 119 A.2 Proof of Theorem 2.2 . . . . . . . . . . . . . . . . . . . . . . . . 121 A.3 Proof of Theorem 2.3 . . . . . . . . . . . . . . . . . . . . . . . . 124 A.4 Proof of Theorem 2.4 . . . . . . . . . . . . . . . . . . . . . . . . 126 B Supplementary Information for Chapter 3 . . . . . . . . . . . . . . 129 B.1 Proof of Theorem 3.1 . . . . . . . . . . . . . . . . . . . . . . . . 129 B.2 Proof of Corollary 3.1 . . . . . . . . . . . . . . . . . . . . . . . . 132 B.3 Derivation of BIC (3.7) . . . . . . . . . . . . . . . . . . . . . . . 133 vi B.4 Proof of Theorem 3.2 . . . . . . . . . . . . . . . . . . . . . . . . 135 C Supplementary Information for Chapter 4 . . . . . . . . . . . . . . 138 C.1 Proof of Theorem 4.1 . . . . . . . . . . . . . . . . . . . . . . . . 138 C.2 Proof of Corollary 4.1 . . . . . . . . . . . . . . . . . . . . . . . . 139 C.3 Proof of Theorem 4.2 . . . . . . . . . . . . . . . . . . . . . . . . 139 C.4 Proof of Proposition 4.1 . . . . . . . . . . . . . . . . . . . . . . . 141 vii List of Tables Table 2.1 Frequencies of covering the true model s∗ based on different screening methods . . . . . . . . . . . . . . . . . . . . . . . . 37 Table 2.2 Simulation results for linear regression, setup 1 . . . . . . . . . . . 47 Table 2.3 Simulation results for linear regression, setup 2 . . . . . . . . . . . 47 Table 2.4 Simulation results for linear regression, setup 3 . . . . . . . . . . . 48 Table 2.5 Simulation results for logistic regression, setup 1 . . . . . . . . . . 49 Table 2.6 Simulation results for logistic regression, setup 2 . . . . . . . . . . 50 Table 2.7 Simulation results for logistic regression, setup 3 . . . . . . . . . . 50 Table 2.8 Simulation results for Poisson regressions, setup 1 . . . . . . . . . 52 Table 2.9 Simulation results for Poisson regressions, setup 2 . . . . . . . . . 52 Table 2.10 Simulation results for Poisson regressions, setup 3 . . . . . . . . . 53 Table 2.11 Results for prostate data. . . . . . . . . . . . . . . . . . . . . . . 54 Table 3.1 Variables for analysis of SLCDC data with non-response adjustments: A: allocate to other categories; D: delete from the data; M: impute by mean values; NA: no adjustment applied. . . . . . . . . . . . . . . Table 3.2 68 Selection results for the first sampling plan: Prediction assessments for models 1–2 are based on the testing RSS, while for models 3–4 they are based on (PPR, NPR) with a benchmark 0.5. 72 Table 3.3 Selection results for the second sampling plan: Prediction assessments for models 1–2 are based on the testing RSS, while for models 3–4 they are based on (PPR, NPR) with a benchmark 0.5. . . . . . . Table 3.4 73 Selection frequency of influential variables in model mis-specified case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii 74 Table 3.5 Prediction accuracy for selected models (WPPR, WNPR) based on different benchmarks. . . . . . . . . . . . . . . . . . . . . Table 3.6 Bootstrap selection results for significant variables: (Estimated coefficient, Standard error, Selection rate). . . . . . . . . . . . Table 4.1 78 Mixing proportions and component parameters in simulation models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 4.2 78 94 Parameter estimates for galaxy data. µ ˆk , π ˆk , σ ˆ denote estimated component mean, mixing proportion, and common component standard deviation (k = 1, 2, . . . , 7). Table 4.3 . . . . . . . . . . . . . . . . . . . . 100 Parameter estimates for lamb data. µ ˆk , π ˆk denote estimated component mean and mixing proportion (k = 1, 2). . . . . . . . . . . . . 100 ix List of Figures Figure 1.1 Some commonly used penalty functions, with λ = 2 for the l1 penalty; λ = 5 for the l0.5 penalty; (λ = 2, a = 3.7) for the SCAD and MCP penalties. . . . . . . . . . . . . . . . . . . . 13 Figure 3.1 Selection criteria values based on candidate models . . . . . . . . 77 Figure 4.1 EM-paths of MSCAD for normal mixture model 1 with n = 100. . 95 Figure 4.2 Selection results for normal mixture models (1–6), n=100. . . . . . 95 Figure 4.3 Selection results for normal mixture models (7–10), n=300. . . . . 96 Figure 4.4 Selection results for Poisson mixture models (1–3), n=(100, 300). . 97 Figure 4.5 Selection results for Poisson mixture models (4–6), n=(100, 300). . 98 Figure 4.6 Histogram of galaxy data. Solid curve: density of seven component models selected by MSCAD. Dashed curve: density of six component models selected by AIC/BIC. Figure C.1 . . . . . . . . . . . . . . . . 99 Plot demonstration for Proposition 4.1 with ν = 3.7, λ = 1, and κ = (1, 3, 4.2) for cases 1–3. Solid lines represent g1 (γ) = κp′λ (γ) and dashed lines represent g2 (γ) = −γ + z for various z values. x . 142 Acknowledgments First, I would like to express my sincere gratitude to my research supervisor Dr. Jiahua Chen for his invaluable suggestions, guidance, patience, and continuing support. I have learnt much from him, particularly regarding academic research and scientific writing, which will surely benefit my future career. Secondly, I would like to extend my appreciation to my supervisory committee, Dr. Ruben Zamar and Dr. Lang Wu, for their suggestions and encouragement throughout my Ph.D. Also, I will never forget the enjoyable pingpong time we had together. Finally, I would like to thank Peggy, Viena, Elaine, and Andrea for their warm and thoughtful assistance with administrative issues. Peggy and Andrea in particular helped me very much. Thank you for all the conversations and the personal advice. Chen Xu September 2012 xi Chapter 1 Introduction 1.1 Overview Technological innovations have had a profound impact on the process of knowledge discovery. It is now feasible to collect data of unprecedented size and complexity in diverse areas of scientific research. For example, in computational genomics, geneticists may measure hundreds or thousands of gene expressions to identify the few that are associated with major diseases (Keller et al. [2009]). In market research, long-term economic data are often used to discover subgroups of customers with different needs and consumption behaviors (Dickson and Ginter [1987]). In internet applications, huge numbers of uniform resource locators (URLs) are often analyzed to learn the rules for detecting web links with pop-up advertisements (Kushmerich [1999]). Other examples occur in bioinformatics, geology, neurology, health science, economics, and finance. Although the objectives differ in various disciplines, explaining the variation in the variable of interest is a common need in their research. Statistical modeling is one of the most powerful and widely used mathematical tools for data analysis. It aims to provide insightful summaries of the information available and to formulate learning rules based on the observed data. A statistical model mimics the generation of data through a constructed stochastic procedure and attempts to explain the variable variation via a mathematical formulation. For instance, a regression model can be used to describe the relationship between a 1 disease status and gene expressions, and a finite mixture model helps to explain the heterogeneity of consumer behavior. Other classical applications of statistical modeling include graphical models in independence structure learning, proportional hazard models in survival analysis, and autoregressive models in time series forecasting. When no prior knowledge is available, researchers may consider many potential variables at the initial stage of the modeling. Examples include a regression model with a huge number of covariates, an autoregressive model with a high order, and a finite mixture model with a large number of mixing components. A sophisticated model with many variables provides a better descriptive value for the data structure, but it often leads to low predictive accuracy and poor model interpretability. Hence, we wish to identify important features in massive data and to produce a parsimonious model. The issue of feature (model) selection has been studied for decades. Traditional selection procedures, such as best subset selection and stepwise regression, are typically designed for conventional settings, where the observations are assumed to be independent and the number of candidate features is relatively small. However, contemporary scientific research often encounters datasets with high dimensionality (a huge number of variables) and/or a complex nonindependent structure. These attributes challenge traditional methods in terms of both theoretical optimality and practical feasibility. For instance, when best subset selection is used in a regression with thousands of covariates, it can be computationally infeasible and unstable in the selection process. Therefore, we need innovative selection approaches that are suitable for the new environment. The penalized likelihood method (PLM) has been demonstrated to be an attractive technique for feature selection. Examples include the least absolute shrinkage and selection operator (LASSO) proposed by Tibshirani [1996] and the smoothly clipped absolute deviation (SCAD) introduced by Fan and Li [2001]. These approaches exclude variables from the model by estimating their coefficients to be zero and shrink the other coefficients accordingly. Compared with traditional methods, the PLM has a lower computational cost and provides more stable selection results. The PLM has received a great deal of attention and covered a wide range of statistical models. In this dissertation, we address new research problems regarding 2 feature selection arising from different applications of the PLM. In Section 1.2, we outline a few feature selection problems from computational genomics, chronic disease studies, and market research as motivating examples for our research. In Section 1.3, we give a brief review of traditional approaches and discuss their limitations in the context of modern data analysis. We then, in Section 1.4, introduce the PLM framework and discuss its properties for variable selection. We discuss both the theoretical and computational aspects, providing the necessary background for the detailed discussion in later chapters. In Section 1.5 we state the aim of our research and outline the contributions of this dissertation. 1.2 Motivating examples Feature selection is fundamental to contemporary scientific research. We begin with a few motivating examples from various scientific fields to illustrate the challenges for feature selection in the context of modern data analysis. 1.2.1 Computational genomics In contemporary biomedical studies, a common goal is to find the genetic explanations (e.g., genes) that are responsible for observable traits such as blood pressure, height, or susceptibility to disease. Understanding the genetic associations of diseases helps medical researchers to further investigate the diseases and to develop the corresponding treatment methods. In contrast to simple observable traits such as gender or blood type, some complex diseases (e.g., leukemia or diabetes) are believed to be the result of many genetic and environmental factors. Since the influential genes may spread over the whole DNA sequence (the genome), studies of complex diseases normally require a full genome scan on all possible genes. Geneticists often measure hundreds or thousands of genes for a relatively small number of participants. In an ongoing project conducted by the Faculty of Dentistry, University of British Columbia, about 600 microRNA (miRNA) expressions in serum samples were measured from two groups of participants. One group consisted of 30 oralcancer patients and the other group consisted of 26 individuals without cancer. The question is whether these miRNA readings can be used to distinguish the cancer 3 patients from the others. If the method is successful, the genetic information might be further used to predict whether an oral-cancer patient will progress from a minor tumor to a serious one. Using all 600 miRNAs for the classification leads to a poor predictive value because of the high level of noise. Consequently, it is important to select those that make the greatest contribution to identifying oral-cancer patients. To this end, a traditional two-sample t-test procedure can be carried out to detect the miRNAs that have significant expression differences between the two groups. However, for the simultaneous testing of 600 genetic readings, classical methods to control the probability of false discoveries are no longer relevant. Advanced adjustments are often needed to control the false-discovery rates. Another common strategy is to build a logistic regression of the tumor type on the miRNA readings; we can then identify the relevant miRNAs by selecting the most important regression covariates. However, the number of covariates p is 600, and the number of participants n is just 56. This large-p-small-n situation places this problem outside the domain of classical model selection methods (see Section 1.3.2). We need innovative methods to deal with the high dimensionality. 1.2.2 Chronic disease studies Chronic diseases are the leading cause of death in North America. Chronic conditions such as kidney disease, cardiovascular disease, anemia, and dementia result in long-term or permanent disability for millions of people, with serious qualityof-life consequences for them and their families. The Public Health Agency of Canada explores the experiences of Canadians with chronic health conditions by conducting the Survey on Living with Chronic Diseases in Canada (SLCDC) on the targeted population. One of the main objectives of SLCDC is to identify health behaviors that influence disease outcomes, so that the government can better plan and provide health services for people with chronic diseases. Regression models are conventionally used in the analysis of SLCDC data; the goal is to detect the influential factors through a variable selection procedure. However, when variable selection is applied to survey data, many potential complications arise. First, the data collected through survey sampling are usually obtained from a finite population without replacement, and hence they have an intrinsic de- 4 pendence structure (i.e., non-i.i.d.). Second, in complex survey designs such as the SLCDC, the inclusion probabilities of sampling units often vary across the target population. Consequently, the correlation between the response and the covariates seen in the sample can be different from that of the population. Ignoring the survey design in the selection process may result in biased results. These special features of survey sampling reduce the effectiveness of traditional selection methods that are developed for i.i.d. sample situations. We need new methods that take into account the special features. 1.2.3 Market segmentation In 2008 a BC marketing company collected data consisting of the annual diningout expense for 1679 households randomly sampled in BC, Canada. The goal of their study was to identify potential subgroups of customers with different needs and consumption behaviors. A customer grouping strategy (i.e., market segmentation) is important for restaurant managers, because providing food services that are suitable for different customers often leads to higher profits. To identify the proper segmentation, cluster analysis is often used: customers with similar characteristics are assigned to a homogeneous group. Finite mixture models are among the most powerful and widely used clustering tools. They divide the overall heterogeneous population into a mixture of several subpopulations (components), where each subpopulation represents a single cluster. A mixture model with an excessive number of components (a high order) usually overfits the data and has poor interpretive value. Determining the appropriate number of components (order selection) is crucial in applications such as market segmentation. Because of the nonregularity of finite mixture models, the classical selection criteria are no longer optimal. We need new selection methods that take into account the special features of finite mixture models. 1.3 Traditional feature selection methods The issue of feature selection has received much attention. Classical selection approaches typically consist of two parts: a selection criterion for the comparison of models with different sets of variables (features) and an associated implementation. 5 Traditional selection methods are typically designed for low-dimensional and independent settings. However, an understanding of their principles, working schemes, and limitations is of vital importance for the development of new methodologies. In this section, we briefly review two classical selection criteria, the Akaike information criterion (AIC; Akaike [1973]) and the Bayesian information criterion (BIC; Schwarz [1978]), and discuss their corresponding implementations. 1.3.1 Selection criteria: AIC and BIC Suppose the data {di = (yi , xi ), i = 1, . . . , n} are collected independently, where yi is the ith observation of the response variable and xi = {xi1 , . . . , xip }T is the associated p-dimensional covariate vector. In a typical regression context, (yi , xi ) is assumed to be a random sample from the population D = (Y, X), where the conditional mean of Y depends on a linear form Xβ with coefficients β = (β1 , . . . , βp )T . In applications, often the effects of many covariates are unimportant, so we may assume that the corresponding coefficients are zero. Feature selection aims to identify all the covariates with nonzero coefficients. This procedure is also referred to as variable selection. Under a more general framework, suppose the data d = (d1 , . . . , dn ) are generated from an unspecified density function f (d; θ ∗ ) with a q-dimensional parameter vector θ ∗ = (θ1∗ , . . . , θq∗ )T . Usually, we are uncertain about the true density f (d; θ ∗ ) and instead assume a larger family of models f (d; θ), in which θ ∗ is a nonvanishing subvector of the p-dimensional parameter θ = (θ1 , . . . , θp )T . The goal of feature selection is to estimate the dimension of the true model by comparing candidate models with different dimensions. In the literature this is often referred to as model selection. For convenience of presentation, we do not distinguish the terms feature selection, variable selection, and model selection in the rest of this chapter. We use a unified notation s to indicate a subset of {1, . . . , p}, which represents a candidate model with parameters θ s = {θj : j ∈ s}, and denote by s∗ the true model with θ∗ . Also, we use τ (s) to indicate the dimension of θ s . Clearly, τ (s∗ ) = q. 6 Under the model settings described above, the log-likelihood function of θ is l(θ; d) = n ∑ log f (di ; θ). (1.1) i=1 AIC and BIC are members of a more general family of penalized model-fit statistics (referred to as “GIC”), applicable to a wide range of statistical models fitted by the maximum likelihood method, which takes the form ˆ s ; d) + cτ (s), GIC(s) = −2l(θ (1.2) ˆ s is the maximum likelihood estimator (MLE) of θ(s) based on s and c where θ is a positive constant that differs from one model selection criterion to another. The model with the smallest GIC is selected. The magnitude of the GIC is not generally interpretable, but differences between GIC values for different models are of interest. It can be seen that the GIC is a decreasing function of the maximized log-likelihood and an increasing function of the number of variables included in the model. Hence, a lower GIC implies either a simpler model (fewer variables), a better fit (higher maximized likelihood), or both. A model that balances complexity and goodness of fit is preferred. With c = 2 and c = log n, we obtain AIC and BIC respectively: ˆ s ; d) + 2τ (s) AIC(s) = −2l(θ (1.3) ˆ s ; d) + log n · τ (s). BIC(s) = −2l(θ (1.4) Note that the penalty for BIC grows with the sample size, while that for AIC remains constant. When n ≥ 8, the penalty for BIC is larger than that for AIC, and therefore BIC tends to select models with fewer variables. Although AIC and BIC have similar forms as shown in (1.2), they are based on different statistical considerations. AIC Suppose as before that we have a set of candidate models S under consideration, with each s ∈ S having parameters θ s to be estimated from the data. As defined 7 in (1.1), the model s infers the probability distribution fs (d) = ∏n i=1 f (di ; θ s ) for the observations d. This serves as an approximation to the true distribution ∏ f∗ (d) = ni=1 f (di ; θ ∗ ). From this point of view, the “best” model is the one that provides the most accurate approximation of f∗ (d). The Kullback-Leibler information is a measure of the distance between two distributions, representing the information “lost” when the second distribution is used to approximate the first. The AIC approach applies the Kullback-Leibler information to the difference between f∗ (d) and fs (d): ∫ I(f∗ , fs ) = = ∫ f∗ (d) log f∗ (d) d(d) fs (d) f∗ (d) log f∗ (y)d(d) − ∫ f∗ (d) log fs (d)d(d) = E[log f∗ (d)] − E[log fs (d)]. (1.5) The best model is then the model s that minimizes the Kullback-Leibler loss (1.5). Note that E[log f∗ (d)] = ϕ is a constant that does not depend on the model and is therefore irrelevant to the model comparison. The second term E[log fs (d)] can be ˆ s ) with an asymptotic bias approximated by the maximized log-likelihood ln (d; θ approximately equal to the number of variables in model s. In other words, we have ˆ s ; d) + τ (s) = ϕ + 1 AIC(s), I(f∗ , fs ) ≈ ϕ − ln (θ 2 which implies that a comparison of the Kullback-Leibler losses (1.5) is approximately equivalent to a comparison of the corresponding AIC values. It can be seen that the AIC approach aims to minimize the Kullback-Leibler deviation between the true distribution f∗ (d) and the distribution fs (d) under a particular candidate model s. Therefore, in some sense, the model selected by AIC achieves the “best” possible approximation to f∗ (d) over the candidate models, even when the true model s∗ is not included in the model space S. However, it is well known that AIC is not a consistent selection criterion, since it does not correctly select the true model s∗ with probability approaching 1 in large samples when s∗ is included in model space S. For further discussion of the consistency and efficiency of AIC, see Shibata [1983], Shao [1997], and Yang [2005]. 8 BIC BIC has its origin in the Bayesian framework; it compares the degree of support in the data for two models. Suppose as before that we have a set of candidate models S for the observations d, and each model s ∈ S has a parameter vector θ s with τ (s) elements to be estimated. Under model s, the density of d with a given value ∏ of θ s is fs (d; θ s ) = ni=1 f (di ; θ s ). Assume that, conditioning on s, the prior density of θ s is π(θ s ). Then, the marginal density of d under model s is ∫ P (d|s) = fs (d; θ s )π(θ s )dθ s , and the posterior probability of s given d is P (s|d) = ∑ P (d|s)P (s) , s∈S P (s)P (d|s) where P (s) denotes the prior probability of model s. The model with the highest posterior probability P (s|d) is then considered to receive the most support ∑ from the data. Since s∈S P (s)P (d|s) is a constant for any choice of the model, choosing a model with the highest P (s|d) is equivalent to choosing a model that maximizes P (d|s)P (s). Under some regularity conditions on the density fs (d), −2 log{P (d|s)} has a Laplace approximation given by BIC (1.4) up to an additive constant. Therefore, with uniform prior settings on s (i.e., P (s) is constant for s ∈ S), the BIC approach is approximately equivalent to comparing the posterior probabilities P (s|Y ). Under some conventional assumptions such as the independence structure of d and a fixed data dimension p, Rao and Wu [1989] established the consistency of BIC by showing that it is asymptotically minimized by the true model s∗ . However, as discussed in Section 1.2, scientific studies often encounter data with high dimensionality and a dependence structure; these features impact the efficiency and theoretical optimality of BIC. To show that BIC can be unsatisfactory, let us consider the genomic example of Section 1.2, where the number of candidate covariates p is 600. Specifically, let the model space be partitioned into subclasses according to the number of covariates in a model. Then, under the constant prior setting, the probability assigned to a 9 subclass is proportional to its size. In particular, the subclass of models containing a single covariate, S1 , has size 600, while the subclass of models containing two covariates, S2 , has size 600 × 599/2. Thus, the prior probability assigned to S2 is 599/2 times that assigned to S1 . It can also be seen that the prior assigned to Sj increases almost exponentially as j increases to p/2 = 300. Consequently, the subclasses of large models receive much greater priors than those with small models. This encourages the selection of large models in the large-p situation, which is strongly against the principle of parsimony. Therefore, BIC is often unsatisfactory for high-dimensional data analysis. 1.3.2 Selection procedures Best subset selection In best subset selection, a selection criterion such as AIC or BIC is evaluated for each candidate model, and the model with the best “score” is selected. Best subset selection is effective when there are only a few candidate models; it is impractical for a large number of variables. In fact, even for datasets with a moderate number of variables, the total number of candidate models can be too large to be manageable. For example, suppose the number of covariates p in a regression model is 30. This implies that there are 230 ≈ a billion possible candidate models to explore, which is obviously a computationally expensive task. Stepwise selection Stepwise selection is another commonly used procedure for feature selection; it typically includes forward selection and backward elimination. In the regression context, the forward selection starts with a null model with no covariates, and then adds to the model the covariate that is most correlated with the current residual. The procedure builds a sequence of models by successively including one covariate at a time up to a prespecified number of steps or until all the covariates are included. In contrast, the backward elimination begins with the full model including all covariates and then forms a sequence of models by deleting one covariate at each step. The models in the sequence are then assessed according to some se- 10 lection criterion. Compared with best subset selection, stepwise selection avoids exhaustive comparisons of all the candidate models and therefore has a lower computational cost. When p = 30, there are only about 21 p2 = 450 models to be considered. However, stepwise methods have been found to be unstable in the selection process: a small change in the data could cause a very different selection result (Breiman [1995]). This is partially because once a covariate has been added to (removed from) the model at any step in the stepwise selection procedure, it is never removed from (returned to) the final model. Therefore, for complex data, the stepwise methods can easily lead to a locally optimal model. Consequently, selection results based on stepwise methods can be unreliable in practice. 1.4 Penalized likelihood methods Various penalized likelihood methods (PLMs) have been developed for the purpose of feature selection. These methods include the least absolute shrinkage and selection operator (LASSO; Tibshirani [1996]), bridge regression (Fu [1998]), the elastic net (Zou and Hastie [2005]), and the smoothly clipped absolute deviation (SCAD; Fan and Li [2001]). The shrinkage idea of PLM has been demonstrated to cope well with many of the challenging features of contemporary data analysis. Compared with traditional methods, the PLM possesses two major advantages. First, its selection procedure is continuous, and hence it provides more robust selection results; second, it is computationally efficient, which is crucial for applications with high-throughput data. In this section, we introduce the PLM framework and discuss the theoretical and computational issues. 1.4.1 The penalized likelihood and penalty functions Given the model settings in Section 1.3.1, the penalized likelihood is defined as Q(θ) = l(d; θ) − n p ∑ ϕλ (|θj |), (1.6) j=1 where ϕλ (.) is a penalty function indexed by a tuning parameter λ controlling the amount of regularization in θ. Maximizing the penalized likelihood (1.6) results in 11 ˆ λ for θ. a maximum penalized likelihood estimator (MPLE) θ ˆ λ . The L0 regularizaThe form of ϕλ (.) determines the general behavior of θ tion, i.e., ϕλ (|θ|) = λI(|θ ̸= 0|), penalizes the number of variables included in a candidate model and produces a sparse estimation for θ. For models of the same size (number of variables), the one that maximizes the unpenalized likelihood is preferred. However, the L0 penalty is not continuous, and the maximization of the corresponding penalized likelihood coincides with the GIC-based best subset selection procedure, which requires exhaustive search and is therefore computationally demanding. With the L1 penalty ϕλ (|θ|) = λ|θ|, we obtain the LASSO, which continuously shrinks the model parameters toward zero as the tuning parameter λ increases. Because of the singularity of the L1 penalty at the origin, some parameters can be shrunk to exact zero when λ is sufficiently large. Thus, LASSO qualifies as a variable selection operator. Because of the continuity of the shrinkage process, LASSO often leads to a more stable selection result than best subset selection does. In addition, the continuous shrinkage often improves the predictive ability of the model because of the bias-variance trade-off (Tibshirani [1996]). It is well known that the L2 penalty ϕλ (|θ|) = λ|θ|2 results in a ridge regression, which continuously shrinks the model parameters toward zero but does not set them exactly to zero, and hence is not suitable for variable selection purposes. We also observe that the Lγ penalty ϕλ (|θ|) = λ|θ|γ with 0 < γ < 2 leads to a bridge regression (Frank and Friedman [1993], Fu [1998]). The bridge shrinkage is continuous only when γ ≥ 1, while a sparse solution can be obtained only when γ ≤ 1. Fan and Li [2001] advocate penalty functions that give estimators (MPLEs) with the following three desirable properties: • Sparsity: The estimator should automatically set small estimated model parameters to zero to reduce the model complexity. • Unbiasedness: The estimator should have low bias, especially when the true value of the model parameter is large. • Continuity: The estimator should be continuous to avoid instability in the model selection and prediction. 12 20 15 10 5 0 L1 L.5 SCAD MCP −10 −5 0 5 10 Theta Figure 1.1: Some commonly used penalty functions, with λ = 2 for the l1 penalty; λ = 5 for the l0.5 penalty; (λ = 2, a = 3.7) for the SCAD and MCP penalties. Of the three requirements, sparsity is the most crucial for feature extraction and variable selection, while continuity improves the robustness of the selection process. Unbiasedness is mainly required for the prediction issue, where the accuracy of the parameter estimates is the major concern. In general for a penalty function, singularity at the origin is required to generate a sparse MPLE, while concavity is needed to reduce the estimation bias. It is known that the Lγ penalty with γ > 1 does not have sparsity, L1 does not have unbiasedness, and Lγ with γ < 1 does not have continuity. Therefore, none of the Lγ penalties possesses all three properties. Fan and Li [2001] suggested the smoothly clipped absolute deviation (SCAD) penalty, which is defined through the following derivative } { (aλ − θ)+ I(θ > λ) ϕλ (|θ|) = λ I(θ ≤ λ) + (a − 1)λ ′ (1.7) for some a > 2 and ϕλ (0) = 0. As shown in Figure 1.1, the SCAD penalty takes off from the origin as the L1 penalty and then gradually levels off as the model 13 parameter increases. These features ensure that SCAD maintains the sparsity and continuity of l1 , and it is unbiased because it does not apply excessive shrinkage to the model parameters. Another penalty in a similar spirit is the minimax concave penalty (MCP) proposed by Zhang [2010], the derivative of which is given by ′ ϕλ (|θ|) = (aλ − θ)+ . a (1.8) The MCP differs from SCAD for small values of θ with a strictly decreasing derivative from the origin, which further discourages the over-regularization of the model parameters. In the literature, many other penalty functions have been proposed. To account for grouping effects, Zou and Hastie [2005] suggested a linear combination of the L1 and L2 penalties and called the associated PLM the elastic net. To improve the estimation accuracy, Zou [2006] investigated the use of a weighted L1 penalty and proposed the adaptive LASSO. Figure 1.1 depicts some of these commonly used penalty functions. The shrinkage idea of PLM can also be realized through a full Bayesian analyˆ λ can be sis. ¿From a Bayesian point of view, the penalized likelihood estimator θ interpreted as the posterior mode estimate when the model parameters θ have correspondingly informative priors. In particular, it is well known that the L1 penalty in LASSO corresponds to an i.i.d. Laplace (i.e., double-exponential) prior of the coefficients (Park and Casella [2008]). Compared with the PLM, Bayesian shrinkage provides more convenient interval estimates of model parameters (i.e., from the estimated posterior), but usually at a cost in terms of computational efficiency. In this dissertation, we focus on the PLM and leave the potential issues of Bayesian shrinkage methods to future research. 1.4.2 Asymptotic properties of PLM The sampling properties of PLM have been extensively studied. The consistency ˆ λ for feature selection and parameter estimation is of particular of the MPLE θ interest. These two modes of consistency are defined by • Estimation Consistency: • Selection Consistency: ˆ λ − θ ∗ ∥2 →p 0, ∥θ as n → ∞, P ({j : θˆjλ ̸= 0} = s∗ ) → 1, 14 as n → ∞, where ∥.∥2 denotes the Euclidean norm. Estimation consistency implies that as ˆ λ approaches the true value θ ∗ with probathe sample size increases the MPLE θ bility tending to 1, which is desirable for the parameter estimation. On the other ˆ λ eliminates hand, selection consistency means that, with probability tending to 1, θ unimportant features by estimating their coefficients at zero, which is essential for a good feature selection method. An estimator that is consistent in terms of the parameter estimation does not necessarily consistently select the true model, and vice versa. A good estimator is consistent in both modes. However, these two modes of consistency serve different purposes, and they are often discussed separately. Penalized least squares and the LASSO In the literature, the L1 regularization (e.g., LASSO) has received a great deal of attention because of its sparsity and convexity. Several remarkable contributions have been made in the context of the L1 -penalized least squares problem, which is formulated as a specific form of (1.6): } { min n−1 ∥y − Xβ∥22 + λ∥β∥1 , β (1.9) where y = (y1 , . . . , yn ) is the response vector, X = (x1 , . . . , xn )T is the n × p design matrix with column entries corresponding to the p covariates, β is the pdimensional regression coefficient, and ∥.∥1 denotes the L1 norm. The solution to ˆ , which can be used for (1.9) leads to the (sparse) LASSO estimate of β, say β λ both parameter estimation and variable (i.e., the column entries of X) selection. Under some regularity conditions on the design matrix, Knight and Fu [2000] ˆ , provided λ → 0 as n → ∞. Moreestablished the estimation consistency of β λ ˆ has positive probability over, they showed that the limiting distribution of β λ masses at zero for the coefficients of irrelevant covariates, which provides insightful justification for using LASSO for variable selection. Zhao and Yu [2006] further characterized the selection consistency of LASSO by studying a stronger but technically more convenient property, sign consistency, ˆ ) = sgn(β ∗ )) with β ∗ denoting the true value of which is defined as P (sgn(β λ β. In particular, they derived an irrepresentable condition required for the sign 15 (selection) consistency of LASSO. Following their notation, let β ∗ be split into two parts, β ∗ = (β ∗1 , β ∗2 ), with β ∗2 assumed to be zero. Also, let the design matrix X be written as {X(1), X(2)} accordingly. The irrepresentable condition requires that, for some constant positive vector η (not depending on n), |{X(1)T X(1)}−1 X(1)T X(2)| < 1 − η where 1 is a q × 1 vector with all elements equal to 1 and the inequality holds elementwise. The irrepresentable condition can be interpreted as a correlation constraint between the irrelevant covariates X(2) and the relevant covariates X(1). Under the irrepresentable condition and additional requirements, Zhao and Yu [2006] showed that the LASSO is sign-consistent for an appropriate choice of λ, even when the number of covariates p diverges with the sample size n at rate log p = O(nα ) for some 0 < α < 1. However, the irrepresentable condition is hard to verify in practice and can become restrictive in high-dimensional situations. Addressing a slightly different but closely related problem, Meindhausen and Buhlmann [2006] established the selection consistency of LASSO in the context of Gaussian graphical models, under conditions on the design matrix similar to those in Zhao and Yu [2006]. Zou [2006] further provided a necessary condition for the selection consistency of LASSO. In particular, he showed that, under some general assumptions on the design matrix, LASSO is not variable-selection consistent. The selection bias of LASSO is mainly due to the intrinsic difficulty in distinguishing highly correlated covariates. To correct the bias, Zou [2006] proposed an adaptively weighted L1 penalty, which is defined by λ p ∑ ωj |βj |, j=1 where ω = {ω1 , . . . , ωp } denotes a prespecified weight vector. He further sugˆ with β ˆ being some root-n consistent estimator of β, so that gested ω = 1/|β|, the penalty is decided adaptively by the data: plausible covariates receive a lower penalty. This strategy accelerates the shrinkage of the coefficients of irrelevant covariates, provided the weight ω is appropriately specified. Under conditions similar to those used in Knight and Fu [2000], the adaptive LASSO has been shown 16 to have both estimation and selection consistency. However, finding such a root-n consistent estimator of β might not be straightforward in high-dimensional situations with p ≫ n. Penalized likelihood and the nonconvex penalties Nonconvex regularization methods (e.g., bridge regression and SCAD) have also received considerable research attention. In a seminal paper, Fan and Li [2001] built the theoretical foundation of nonconvex PLMs for feature selection. In the framework of generalized linear models (GLMs; McCullagh and Nelder [1989]), Fan and Li [2001] showed that there exists a local maximizer of (1.6) that converges to the true value of the model coefficients θ ∗ at a rate Op (n− 2 +an ) with 1 an = max{ϕ′λ (|θj∗ |) : θj∗ ̸= 0}. This result implies that if we choose an appropriate ϕλ (.) such that an = O(n− 2 ), the corresponding MPLE is root-n consistent for 1 the parameter estimation. In particular, this is the case when the SCAD penalty is used in (1.6) with λ = o(1). Moreover, with additional requirements on ϕλ (.) such as concavity in |θ|, Fan and Li [2001] demonstrated that such a root-n estimationconsistent MPLE is also selection consistent. They referred to this feature of some nonconvex PLMs (e.g., SCAD and MCP) as the oracle property; this means that the MPLE consistently identifies influential variables in a model and estimates their coefficients as efficiently as does the MLE based on the true model. For potential applications with high dimensionality, Fan and Peng [2004] extended the results of Fan and Li [2001] to diverging p cases with p = o(n1/3 ). Recently, Fan and Lv [2011] illustrated the oracle property of nonconvex PLMs even when log p = O(nξ ) for some ξ ∈ (0, 1). In addition, they derived sufficient conditions that guarantee asymptotic equivalence between the global maximizer of the SCAD-penalized likelihood and the oracle estimator. Their results also suggest that L1 -based PLMs generally can not achieve selection and root-n estimation consistency simultaneously, and thus in general they do not have the oracle property. 1.4.3 Tuning strategies As discussed in Sections 1.4.1 and 1.4.2, the attractive features of PLM depend on an appropriate choice of ϕλ (.). Given a specific form of the penalty function ϕλ (.), 17 we must select a proper tuning parameter λ that controls the amount of regularization. It is well known that excessive regularization may lead to the elimination of important variables, while insufficient shrinkage retains too many irrelevant variables in the model. Hence, the selection of λ is critical for achieving the advantages of PLM in practice. To address this issue, Tibshirani [1996] used the m-fold cross-validation (CV; Stone [1974]) to select the λ for LASSO. Given the model settings and the notation in Section 1.3.1, this procedure works as follows: First, we divide the full dataset ˆ λ (−t) d into T separate sets, say {dt } for t = 1, . . . , T , and then find the MPLE θ of θ based on d − dt . We choose an optimal λ by minimizing CV(λ) = − T ∑ ˆ λ (−t)) = − l(dt , θ t=1 T ∑ ∑ ˆ λ (−t)). log f (di , θ t=1 i∈dt The CV-based tuning method provides a λ that yields a model with the “optimal” prediction accuracy. In the same spirit, Fan and Li [2001] suggested a tuning method based on generalized cross-validation (GCV; Craven and Wahba [1979]), which selects the λ that minimizes GCV(λ) = ˆλ) −l(d, θ , n(1 − τ (sλ )/n)2 ˆ λ . Compared with the CV method, where sλ denotes the model corresponding to θ the GCV method is computationally more convenient, and it has been widely used. However, from a variable-selection point of view, the GCV-based method has been shown to be inconsistent (Wang et al. [2007]); it tends to select many irrelevant variables. Wang et al. [2007] advocated the use of BIC for tuning a PLM. This approach finds λ by minimizing BIC(λ) = −2l(d, θˆλ ) + τ (sλ ) log n. In the context of penalized least squares with finite-parameter settings, Wang et al. ˆ BIC is consistent in [2007] showed that the SCAD estimator with λ chosen by λ variable selection. This result has been extended to the diverging p cases with 18 p = o(n) in Wang et al. [2009]. In recent work, Chen and Chen [2012] proposed a family of extended Bayesian information criteria (EBIC) under the GLM setup, which is designed for highdimensional model selection with a sound Bayesian motivation (see Section 2.4.2). EBIC has been found to be an effective tool for choosing the tuning parameter for PLM in situations where p ≫ n (Wang [2009], She [2011]). Specifically, the EBIC-based tuning strategy finds the λ that minimizes ˆ λ )) + τ (sλ )(log n + γ log p) EBIC(λ) = −2l(d, θ for some 0 ≤ γ ≤ 1. Clearly, for γ = 0 EBIC reduces to the BIC-based tuning strategy. By choosing γ > 0, EBIC places more penalties on the model complexity by linking the regularization to the number of candidate variables p. This modification is particularly helpful for selecting an appropriate parsimonious model in ultra-high-dimensional applications where p ≫ n, such as the genomic example introduced in Section 1.2.1. 1.4.4 Computational strategies In the last decade, substantial progress has been made in solving optimization problems related to PLMs. Various numerical methods have been proposed for finding the maximizer (i.e., the MPLE) from the penalized likelihood (1.6). In principle, when convex penalties (e.g., L1 and L2 ) are used, the objective function (1.6) is concave and convex optimization algorithms can be conveniently applied. For nonconvex penalties (e.g., SCAD and MCP), however, the task becomes more challenging since (1.6) is no longer concave in general, and the numerical procedure may lead to local maxima. In this subsection, we briefly review several algorithmic contributions for convex and nonconvex PLMs respectively. LARS and CD As a fundamental regularization problem, the L1 -penalized least squares (1.9) has been studied extensively. Efficient algorithms have been developed to numerically find the LASSO estimate. In a seminal paper, Tibshirani [1996] treated the numerical problem related to LASSO as a linearly constrained least squares problem, 19 which can be solved by standard quadratic programming techniques. Efron et al. [2004] developed a fast and efficient least angle regression (LARS) procedure for the variable selection, which can be modified to provide the entire solution path for ˆ : λ > 0}). LASSO ({β λ Following (1.9), the LARS algorithm works roughly as follows. Starting with all the coefficients β1 , . . . , βp equal to 0, LARS finds the covariate xl that is most correlated with the response y. Then, the LARS moves βl from 0 toward its marginal least squares estimate of y on xl , until some other covariate xh has the same correlation with the current residual as that of xl . Next, the LARS algorithm moves (βl , βh ) in the direction defined by their joint least squares estimate of the current residual on (xl , xh ), until a third covariate has the same correlation with the current residual. The procedure continues in this way until all the covariates have been added to the model. LARS can be modified to provide the exact solution path of LASSO: if any nonzero coefficient reaches zero in a LARS procedure, we drop the corresponding covariate from the model and recompute the current residual. After p steps (p < n), we reach the least squares estimate of β based on the full model, which corresponds to the LASSO estimate with λ = 0. The idea of LARS has been extended to efficiently solve the L1 -penalized likelihood problem in the context of GLMs (McCullagh and Nelder [1989]). Because of the computational advantage of the LARS algorithm, LASSO quickly became a popular technique for dimensionality reduction and feature extraction. Recently, the coordinate-wise descent algorithm (CD) has received attention for its ability to solve the numerical problem related to convex PLMs (Fu [1998], Friedman et al. [2007]). The idea of CD is quite simple: it iteratively optimizes the objective function (1.6) one variable at a time. To many researchers’ surprise, this seemingly naive strategy works amazingly well for a wide range of convex regularization problems, including LASSO, the nonnegative garotte (Breiman [1995]), and the elastic net. Taking LASSO as an example, Friedman et al. [2007] demonstrated that CD is competitive with the well-known LARS, and thus it has great potential for high-dimensional problems. 20 LQA and LLA In nonconvex PLMs, the irregular shape of the objective function poses a challenge. To address this issue, Fan and Li [2001] suggested locally approximating a nonconvex penalty by a quadratic function (LQA), i.e., ′ ϕλ (|θ|) ≈ ϕλ (|θ0 |) + 1 pλ (θ0 ) 2 (θ − θ02 ) 2 |θ0 | for θ close to some initial value θ0 . With the aid of LQA, Newton-type algorithms can be modified to solve the MPLE from (1.6) with a nonconvex penalty. For example, we can use the MLE as the initial value and obtain the MPLE through an updating procedure by θ (k+1) = arg max ′ l(θ) − n (k) p ∑ ϕλ (|θj |) (k) 2|θj | j=1 θj2 . However, the sequence β (k) obtained from LQA may not be sparse for any fixed k and hence is not directly suitable for feature selection. Fan and Li [2001] fur(k) ther suggested setting θj (k) (k) = 0 if |θj | is sufficiently small, say |θj | < ε0 for some tolerance level ε0 , and removing the corresponding covariate from the model. Although it is useful in practice, such an ad hoc step may bring instability to the procedure: once a variable has been deleted, it can never be considered again. Furthermore, the choice of ε0 has a direct impact on the final selection result. Chossing an appropriate ε0 might be problematic in applications. To avoid this drawback, Zou and Li [2008] proposed a one-step algorithm based on local linear approximation (LLA) to the penalty function: 1 ′ ϕλ (|θ|) ≈ pλ (|θ0 |) + ϕλ (|θ0 |)(|θ| − |θ0 |), 2 which leads to a similar iteration procedure: θ (k+1) = arg max l(θ) − n p ∑ j=1 21 ′ (k) ϕλ (|θj |)|θj | . Because this form of LLA shares the common traits of the L1 penalty, the final estimate has a sparse structure and hence avoids the ad hoc step. More importantly, efficient algorithms developed for convex PLMs (e.g., LARS) can be directly adopted in the updating procedure. Zou and Li [2008] established the convergence of the LLA procedure and further showed that, starting with a root-n consistent estimator, the nonconvex MPLE obtained by LLA has the oracle property after a single iteration. However, obtaining such a “good” initial value may be an issue, especially for the large-p-small-n cases frequently encountered in genetic applications. Recently, She [2009] proposed an iterative thresholding-based algorithm, which provides a novel approach for nonconcave PLMs. In this strategy, the multivariate nonconcave objective function is transformed into equivalent univariate functions for which simple optimization can be conveniently handled. For the L1 penalized least squares problem (1.9), the iterative thresholding-based algorithm is identical to the CD procedure. Thus, it shares advantages with the efficient algorithms developed for convex PLMs. We provide a more detailed discussion of the thresholdingbased algorithm in Section 4.3. 1.5 Contributions of the dissertation The PLM has received much attention, and it is applicable to a wide range of statistical models in a variety of scientific areas. In this dissertation, we address new research problems in feature selection arising from several applications of the PLM. As mentioned in Section 1.2, in computational genomics, the number of candidate features can be much larger than the sample size (ultra-high dimensionality). In such applications, the number of features can be so large that even the computationally attractive PLM is not adequate. A super-efficient procedure to quickly screen a large number of candidate features is therefore essential. Fan and Lv [2008] proposed sure independence screening (SIS), which screens features based on their marginal correlations with the response. It is natural to conjecture that accounting for joint effects among the candidate features will be beneficial for the screening. In this spirit, in Chapter 2 we propose a novel screening approach via the sparsity-restricted maximum likelihood estimator (SMLE) and investigate its 22 performance. The SMLE estimates the high-dimensional model coefficients in a designated low-dimensional subspace and screens features by setting their coefficients to zero. The features passed by SMLE are then subject to a more elaborate selection via the PLM. SMLE incorporates joint effects among features by jointly estimating their model coefficients, and thus it has the potential to provide more reliable screening results than SIS provides. We show that SMLE enjoys the sure screening property of Fan and Lv [2008] in the ultra-high-dimensional GLM setup, and we develop an efficient algorithm for its implementation. In addition, we establish estimation and selection consistency for the SMLE-based PLM and propose the use of EBIC for the tuning parameter selection. The effectiveness of the new methods has been demonstrated in simulation studies. Regression models are routinely used to analyze survey data; they identify the influential factors for certain social or behavioral indices in a target population. As discussed in the LSCDC example of Section 1.2, when data are collected through survey sampling from a finite population without replacement, they have an intrinsic dependence structure and may not represent the target population. To avoid distorted conclusions, survey weights are usually adopted in the estimation of parameters in regression models based on survey data. Incorporating the survey weights may also be beneficial for variable selection. To investigate this, in Chapter 3 we explore the use of pseudo-likelihood to take account of the survey weights and study a penalized pseudo-likelihood method (PPLM) for the variable selection of survey data. In a joint randomization framework, we prove that the PPLM consistently identifies the influential variables through BIC-based tuning. The finite-sample performance of the approach is assessed via analysis and computer simulations based on data from the hypertension component of the 2009 Survey on Living with Chronic Diseases in Canada. The results show that, compared with the standard PLM, the PPLM helps to avoid biased selection results due to informative sampling and provides protection against model mis-specification. In market research, finite mixture models are frequently used to depict the heterogeneity of an overall data structure. Selecting the most suitable number of mixture components (the order) is fundamental to these applications. In Chapter 4, we address order selection for finite mixture models, which provides a flexible tool for modeling data from a heterogeneous population. The PLM procedure proposed by 23 Chen and Khalili [2008] is attractive for order selection. The method fits a highorder mixture model to the data via the penalized likelihood, and the nonsmooth penalty helps to merge close components to achieve a lower order. However, this method requires maximizations over nonsmooth and nonconcave objective functions, which are computationally challenging. The commonly used LQA approach fails to provide a sparse solution, which might lead to unstable selection results. To tackle this problem, we transform the original multivariate objective functions into a sum of univariate functions and design an iterative thresholding-based algorithm to efficiently solve the sparse maximization without ad hoc steps. We further show the ascent property of the proposed algorithm, i.e., each update to the parameter estimates increases the value of the objective function. This desirable property not only helps to design an appropriate stopping rule in practice but also leads to the convergence of the algorithm. Our simulation studies show that the new algorithm reduces the computational time by approximately 40% in comparison with LQA. To further reduce the computational burden, we propose a revised BIC criterion to select the tuning parameter of Chen and Khalili’s method, where the regular BIC penalty on the model complexity is reduced by half. We do this because an extra component in a mixture model usually leads to a single extra degree of freedom in the limiting distribution of the likelihood-based test statistic. We demonstrate the efficiency of the proposed method via numerical studies. Lastly, in Chapter 5, we summarize the main findings of this dissertation and present several directions for future research. All the technical derivations and proofs are provided in Appendices A–C. 24 Chapter 2 The Screening-Based PLM in Ultra-High-Dimensional Feature Spaces 2.1 Introduction High-dimensional datasets with many variables are frequently encountered in modern scientific research (Hastie et al. [2009], Donoho [2000], Fan and Lv [2010]). It is often important to identify the features that influence the response (feature extraction). Examples include detecting the biomarkers responsible for rare diseases and finding the stocks that generate profits in investment portfolios. Regression models are frequently used, and the feature extraction is typically performed by a variable selection procedure. Traditional selection approaches, such as best subset selection and stepwise regression, can be computationally expensive and instable in the selection process. The PLMs, including LASSO (Tibshirani [1996]), SCAD (Fan and Li [2001]), the elastic net (Zou and Hastie [2005]), and MCP (Zhang [2010]), are now being used as computationally feasible alternatives for variable selection. These approaches are practical and can be selection consistent. However, when the number of features p is much larger than the sample size n (ultra-high-dimensional cases), as is 25 common in genetic studies, the direct use of PLMs is difficult. For PLMs with nonconvex penalties (e.g., MCP and SCAD), finding the global maximizer of the corresponding penalized likelihood is computationally challenging. For PLMs with convex penalties (e.g., LASSO and the elastic net), selection consistency may not hold in general. More importantly, the choice of an appropriate tuning parameter for the PLM in ultra-high-dimensional applications has not yet been determined. To overcome the difficulties of the large-p-small-n situation, Fan and Lv [2008] proposed reducing the number of candidate features to a manageable level before applying a more elaborate selection method. They suggested SIS, which screens the original p features according to their marginal correlations with the response variable. This approach efficiently removes most of the irrelevant features while retaining important features with a high probability (sure screening). Wang [2009] proposed the use of classical stepwise forward regression (FR) for feature screening and established its consistency in the sense of Fan and Lv [2008]. Feature screening techniques simplify the high-dimensional variable selection to a lowerdimensional problem, where the PLM can be conveniently applied. Motivated by the insights of these methods, in this chapter we propose a novel approach for feature screening via the sparsity restricted maximum likelihood estimator (SMLE). SMLE estimates the high-dimensional model coefficients in a designated low-dimensional subspace and naturally screens features by setting their coefficients to zero. Unlike SIS, SMLE accounts for joint effects between features by jointly estimating their model coefficients. Thus, it has the potential to provide more reliable screening results in practice. The SMLE approach belongs to a more general class of sparsity-constrained approximation methods that have been widely adopted in wavelet analysis, signal processing, and compressed censoring. Sparsity-constrained methods are frequently used to construct a parsimonious representation (approximation) of highresolution images/signals for fast transmission and recovery (Donoho [2006], Cand`es et al. [2006], Blumensath and Davies [2009]). We attempt to investigate the potential of sparsity-constrained methods (i.e., SMLE) for feature screening. In particular, we show that the SMLE enjoys the sure screening property in the sense of Fan and Lv [2008] in the context of high-dimensional GLMs (McCullagh and Nelder [1989]), where the number of covariates p can be considerably larger than the sample size 26 n. An iterative hard thresholding-based algorithm (IHT; Blumensath and Davies [2008]) can be used to compute SMLE. We establish the consistency of SMLEbased PLM and demonstrate the promising performance of the proposed procedure via numerical studies. The rest of this chapter is organized as follows. In Section 2.2, we introduce the ultra-high-dimensional GLM and review the SIS-based screening method. In Section 2.3, we investigate the use of SMLE for feature screening and discuss its asymptotic properties; an IHT algorithm is proposed for the implementation. We discuss the SMLE-based PLM in Section 2.4 and assess the finite-sample performance of the method in Section 2.5. Finally, Section 2.6 presents several remarks. The proofs of the theorems are presented in Appendix A. 2.2 Sure screening techniques with ultra-high dimensionality 2.2.1 Model settings and notation Suppose the data {(yi , xi ), i = 1, . . . , n} are collected independently from (Y, x), where Y is a response variable and x = (x1 , . . . , xp )T is a p-dimensional covariate vector. We postulate a GLM between Y and x as follows. Conditioning on x, the distribution of Y is assumed to belong to an exponential family taking the form f (y; θ) = exp(θy − b(θ) + c(y)) (2.1) with respect to a σ-finite measure ν and known functions b(.) and c(.). θ is usually called the natural parameter, and it is assumed to take values in a compact space Θ ⊂ R such that ∫ f (y; θ)dν = 1, for any θ ∈ Θ. (2.2) Given (2.1), it is well known that E(Y |x) = µ = b′ (θ) and Var(Y |x) = σ 2 = b′′ (θ), where the primes denote derivatives with respect to θ. The covariate x influences the response Y in the form of a linear combination g(µ) = xT β, 27 where β = (β1 , . . . , βp )T are the p-dimensional model coefficients and g(.) is a specified link function. For theoretical purposes, it is often convenient to relate the natural parameter θ directly to the covariate x, i.e., θ = u(xT β) for u(.) = (g · µ)−1 . Of special importance is the canonical link g = µ−1 , which leads to a linear expression of θ, i.e., θ = xT β. Several classical GLMs with the canonical link are as follows: • Linear regression: Suppose xi , yi has the distribution N(µi , σ 2 ). In this case, we have µi = θi and the natural link g(µi ) = µi , which implies µi = xTi β. Model (2.1) leads to the classical normal linear regression yi = xTi β + ϵi (2.3) where ϵi is the normal random error with mean zero and variance σ 2 . • Logistic regression: Suppose xi , yi has the distribution Bernoulli(pi ). In this case, we have pi = µi = exp(θi )/(1 + exp(θi )) and the natural link g(µi ) = log µi /(1 − µi ), which implies pi = exp(xTi β)/(1 + exp(xTi β)). Model (2.1) leads to the logistic regression logit{P (yi = 1|xi )} = log( pi ) = xTi β, 1 − pi (2.4) which is commonly used for regression analysis with binary dependent variables. • Poisson regression: Suppose xi , yi has the distribution Poisson(µi ). We have µi = exp(θi ) and the natural link g(µi ) = log(µi ), which implies µi = exp(xTi β). Model (2.1) leads to the Poisson regression log(µi ) = xTi β, (2.5) which is often used for the analysis of count data and multidimensional contingency tables. With the GLM settings, the effect of each covariate on the response Y is characterized through the size of the corresponding regression coefficient. In applica28 tions, when the dimension p is high, it is often believed that only a small number of the covariates in x contribute to the variations in Y , which leads to an idealistic assumption that β is sparse. With this sparsity, we can identify influential features by finding the covariates associated with nonzero coefficients. Specifically, let β ∗ be the true sparse coefficients with q nonzero elements, and let s be an arbitrary subset of {1, . . . , p} defining a submodel with covariates xs = {xj , j ∈ s} and associated coefficients β s = {βj , j ∈ s}. For convenience, we use ∥.∥0 to denote the number of nonzero components of an arbitrary vector (i.e., the l0 -norm) and τ (s) to indicate the size of model s. In particular, we denote the true model by s∗ = {j : βj ̸= 0} with τ (s∗ ) = ∥β ∗ ∥0 = q. We must estimate s∗ from {1, . . . , p} by analyzing the data {(yi , xi ), i = 1, . . . , n}. We are interested in solving this problem in ultra-high-dimensional situations where p ≫ n > q. 2.2.2 Sure independence screening As mentioned, the high dimensionality of p poses great challenges in searching for s∗ using most existing selection methods. A natural idea is to reduce the dimensionality of the feature space p to a manageable level k (say, k < n) by a fast and reliable method, such that existing selection methods (e.g., PLM) can be applied to the reduced feature space. This suggests a two-step procedure for feature selection in ultra-high-dimensional situations: a crude feature screening followed by a more careful selection. To this end, Fan and Lv [2008] developed the SIS framework by ranking the marginal correlations between the covariates and the response. Specifically, let w = (w1 , . . . , wp )T be a p-dimensional vector with component wj denoting the fitted coefficient obtained from the marginal regressions of the response variable on the jth covariate. For a given screening bound k < n, the SIS screens features by selecting sˇ = {1 ≤ j ≤ p : the value of |wj | is among the k largest values}. Clearly, in this strategy, each feature is used independently as a single covariate to deternine its strength of association with the response. SIS has been widely adopted in genetic studies, where two-sample testing methods are frequently used 29 to detect the genes that differ between case and control groups (Efron [2007], Storey and Tibshirani [2003]). In the context of a linear model with Gaussian covariates and response, Fan and Lv [2008] showed that P (s∗ ⊂ sˇ) → 1 as n → ∞ even when p increases exponentially with n. They refer to this property of SIS as the sure screening property, meaning that a screening method has the ability to retain all the important features with a high probability. This result has been further extended by Fan and Song [2009] to the high-dimensional GLM case where both the response and the covariates can be discrete. SIS uses only the marginal information of the covariates, and its performance can be unstable in applications. To address this issue, Fan et al. [2009] proposed an iterative SIS (ISIS) procedure, which works as follows. First, ISIS applies SIS to select k1 features, denoted sˇ1 . Then, it applies the PLM (e.g., SCAD) on sˇ1 ˇ 1 ⊂ sˇ1 . Next, it fits a regression of Y on M ˇ 1 and obtains to select a subset M the corresponding residuals. These residuals are then treated as the new response ˇ1 variable, and SIS is applied to the remaining candidate features {x1 , . . . , xp }/M ˇ 1 ∪ sˇ2 , which gives a to select k2 features, say sˇ2 . Then, the PLM is applied to M ˇ 2 . The procedure continues until there are k features in the current new subset M subset. ˇ t are uncorrelated with At each step of ISIS, since the residuals based on M ˇ t , the unimportant variables in {x1 , . . . , xp }/M ˇ t that are highly the variables in M ˇt correlated with the response Y through strong associations with the features in M are not likely to be selected at the next step. Also, the important features that are marginally weakly correlated with Y because of the presence of the variables in ˇ t should have a chance to be selected. ISIS has great potential to improve SIS, M but it has a higher computational cost. 30 2.3 Variable screening via the sparse-MLE 2.3.1 The sparsity-restricted maximum likelihood estimator The seminal theory of SIS stimulates us to seek a more effective and efficient method for feature screening in ultra-high-dimensional applications. In particular, we conjecture that incorporating joint information between candidate features will be beneficial. Our investigation starts from the classical likelihood-based inference. Specifically, with the canonical link, the log-likelihood function of β is given by l(β) = n ∑ [yi · xTi β − b(xTi β)]. (2.6) i=1 Maximizing (2.6) leads to the MLE of β. Under some regularity conditions, the MLE is a consistent estimate of β ∗ and thus provides information useful for detecting s∗ . However, in ultra-high-dimensional settings with p ≫ n, the classical MLE is not uniquely defined and therefore loses its interpretive ability. In the spirit of regularization, we consider estimating β ∗ in a subspace of Rp with the number of nonzero entries constrained to be less than a given screening bound k. Under the assumption that β ∗ is sparse (i.e., q < k), this constrained estimation is expected to retain all the important information carried by β ∗ while setting most of the zero entries of β ∗ exactly to zero, and it therefore provides feature screening. Motivated by this argument, we now present a new screening method. We propose carrying out the aforementioned estimation via the sparsity-restricted MLE of β (SMLE), which is defined by ˆ = arg max l(β) β [k] β subject to ∥β∥0 ≤ k (2.7) ˆ is constrained to be for some specified k smaller than n. Clearly, the SMLE β [k] sparse in its composition, and its nonzero entries correspond to a submodel ˆ is nonzero} sˆ = {1 ≤ j ≤ p : the jth entry of β [k] that yields the highest possible likelihood score within the restricted model sparsity 31 ˆ screens irrelevant features by setting their coefficients to k. Functionally, the β [k] zero, while retaining the important features in sˆ for further selection. Compared with (I)SIS, SMLE naturally accounts for the joint effects between candidate features by jointly estimating their coefficients. Thus, it has the potential to provide more reliable screening results. The idea of SMLE has similarities with the use of l0 -regularized techniques in image processing, where sparsity-constrained least-squares methods are frequently used to construct parsimonious representations for high-resolution images (Donoho [2006], Blumensath and Davies [2009]). To provide insights into the use of SMLE for feature screening, we first focus on the theoretical aspects and ignore the computational issues at this stage. As argued in Fan and Lv [2008], a good screening approach should have the ability to remove most of the irrelevant variables while retaining all the relevant ones with a high probability. They refer to this as the sure screening property of a variable screening method. Accordingly, we define SMLE to have screening consistency if P (s∗ ⊂ sˆ) → 1, as n → ∞. (2.8) To investigate whether SMLE has screening consistency, we introduce the following notation. For any model s, let ∂l(β s ) ∑ = [yi − b′ (xTis β s )]xis , ∂β s n S(β s ) = i=1 H(β s ) = ∂ 2 l(β s) − ∂β s ∂β Ts = n ∑ b′′ (xTis β s )xis xTis i=1 be the score function and the Hessian matrix of l(.) corresponding to β s . For k such that q < k, we define S k+ = {s : s∗ ⊂ s; ∥s∥0 ≤ k} and S k− = {s : s∗ ̸⊂ s; ∥s∥0 ≤ k} for collections of overfitted models and underfitted models. We investigate the 32 ˆ in the scenario where p, q, k, and β ∗ vary with the asymptotic properties of β [k] sample size n. Also, we assume the following conditions, some of which are purely technical and serve only to provide a theoretical understanding of the new screening method. We do not intend these assumptions to be the weakest possible. T1 log p = O(nm ) for some m > 0. T2 There exist nonnegative constants τ1 , τ2 such that min∗ |βj∗ | ≥ w1 n−τ1 q < k ≤ w2 nτ2 and j∈s for some w1 , w2 > 0. T3 There exist a positive constant c1 and a corresponding δ1 > 0 such that for sufficiently large n, λmin [n−1 H(β s )] ≥ c1 for β s ∈ {β s : ||β s − β ∗s ||2 ≤ δ1 } and s ∈ S 2k + , where ∥.∥2 is the Euclidean norm and λmin [.] denotes the smallest eigenvalue of a matrix. T4 There exist positive constants c2 , c3 , c4 , such that { max max 1 j p1 i n ∑n c3 ≤ n } x2ij 2 2 i=1 xij σi −1 n ∑ ≤ c2 · n−1 , and x2ij σi2 ≤ c4 i=1 for any j ∈ {1, . . . , p}. By condition T1, we assume that p diverges with n up to an exponential rate, which implies that the number of covariates can be substantially larger than the sample size. Condition T2 assumes that β ∗ is sparse and its minimal component does not degenerate too quickly. Condition T3 corresponds to assumptions A4–A5 of Chen and Chen [2012]. It basically requires s∗ to stay some distance from incorrect models as n increases. Condition T4 places restrictions on the observed values of 33 xj . For a wide range of models, condition T4 holds naturally for random designs on x or for fixed designs with an appropriate rescaling operation. We now establish the screening consistency (sure screening property) of the ˆ via the following theorem. SMLE β [k] Theorem 2.1 Under model (2.1) and conditions T1–T4, if τ1 + τ2 < 12 , we have P (s∗ ⊂ sˆ) → 1, as n → ∞. See Appendix A for the proof. By Theorem 2.1, we show that, with probability tending to one, SMLE retains all important features in s∗ by estimating their model coefficients away from zero. This desirable property of SMLE provides a necessary condition for correctly identifying s∗ in the more elaborate selection based on sˆ, as will be illustrated in Section 2.4. 2.3.2 Implementation As discussed in the previous subsection, SMLE has the potential to address ultrahigh-dimensional feature screening, but this needs to be demonstrated in appliˆ from (2.7) corresponds to a l0 cations. In principle, numerically finding β [k] regularized problem. Such problems have been extensively studied in the area of signal processing, and a number of computational strategies have been developed. Examples include the matching pursuit algorithms (Mallat and Zhang [1993]) and the FOCUSS-based methods (Murray and Kreutz-Delgado [2001]). We find that the hard-thresholding-based algorithms developed under linear models (Blumensath and Davies [2009] are suitable for our needs. In this subsection, we design a modified hard-thresholding procedure in the context of GLM for comˆ in applications. puting β [k] Specifically, to tackle the problem in (2.7), we first approximate the ln (.) at a generic point β by hn (γ; β) = ln (β) + (γ − β)T sn (β) − u ∥γ − β∥22 , 2 (2.9) where u > 0 is a scale parameter. The first two terms in (2.9) match the Taylor’s expansion of ln (γ) at γ = β, and u 2 ∥γ − β∥22 is introduced as a regularization 34 term. Clearly, we have ln (β) = hn (β; β), and hn (γ; β) well approximates ln (β) for γ close to β. A key property of hn (γ; β) is that it is additive in the components of γ, so that the maximization of hn (γ; β) over γ can be conveniently carried out. With the aid of (2.9), we then propose the following iterative procedure to solve (2.7) β (t+1) = arg max hn (γ; β (t) ) γ subject to ∥γ∥0 ≤ k. (2.10) For each iteration step, the regularization term in hn (.) prevents the maximizer from being far from the current estimate of β. This feature further guarantees the convergence of (2.10) to a local maximum of ln (.) (subject to the sparsity constraint). Let y = (y1 , . . . , yn )T and X = (xi , . . . , xn )T . Because of the additivity of hn (γ; β) in γ, the optimization in (2.10) takes a unified specific form: 1 min γ − u−1 [uβ + X T y − X T b′ (Xβ)] γ 2 2 2 subject to ∥γ∥0 ≤ k. (2.11) Obviously, if the sparsity constraint is ignored, the solution to (2.11) is the typical least squares estimate ˜ = β + u−1 X T [y − b′ (Xβ)], γ which corresponds to a zero loss in the objective function. Thus, the constrained minimum of (2.11) is achieved by choosing the k largest (in absolute value) com˜ . Consequently, the solution to (2.11) is given by ponents of γ [ ]T ˆ = H(˜ γ γ ; k) = H(˜ γ1 ; |˜ γ |[k] ), . . . , H(˜ γp ; |˜ γ |[k] ) , where |˜ γ |[k] is the kth largest component in |˜ γ | and γ, if |γ| > r H(γ; r) = 0, if |γ| ≤ r is the hard thresholding function. 35 Therefore, the iteration (2.10) can be re-written as β (t+1) = H(β (t) + u−1 X T [y − b′ (Xβ (t) )]; k). (2.12) It can be seen that (2.12) is a simple thresholding-based iterative procedure which does not involve complicated operations (such as matrix inversion). This advantage makes (2.12) suitable for high-dimensional computing. Unlike the typical thresholding method, (2.12) adaptively performs hard thresholding on the current update such that each β (t) satisfies the sparsity constraint i.e., ∥β (t) ∥0 ≤ k. Following Blumensath and Davies [2009], we refer to (2.12) as the iterative hard-thresholding (IHT) procedure. We now show that the sequence {β (t) } based on IHT has a property analogous to that of typical thresholding-based methods: it stepwise increases the value of l(.). This increment property further ensures the convergence of IHT to a local maximum of l(.) within the feasible region. For convenience of illustration, we introduce the following condition T3′ λmin [n−1 H(β s )] > 0 for any s with τ (s) ≤ k. It can be seen that condition T3′ is analogous to T3, which requires the strict concavity of ln (β s ) over models of size no larger than k. This condition is purely technical and might be further weakened. We focus on providing a theoretical understanding of the IHT and leave this issue to future research. We now justify the convergence of IHT by the following theorem. Theorem 2.2 Given the settings and notation introduced earlier, assume that b(.) in (2.1) is twice continuously differentiable. Let {β (t) } be the sequence defined by (2.12). Denote by ρ1 the maximum eigenvalue of X T X and ρ(t) = max sup b′′ (αxTi β (t+1) + (1 − α)xTi β (t) ). i 0<α<1 If u > ρ1 ρ(t) , then l(β (t+1) ) ≥ l(β (t) ). Moreover, if condition T3′ holds, then {β (t) } converges to a local maximum of ln (β) subject to ∥β∥0 ≤ k. 36 See Appendix A for the proof. Theorem 2.2 implies that, for an appropriate scale parameter u, the IHT necessarily converges. In our simulations, we choose u adaptively at each step according to the value of β (t) to guarantee the monotonicity of l(β (t) ). Since β (t) may lead to a local maximum, multiple initial values are often used in the hope that the global maximum is not missed. For the linear model, when the IHT starts with β (0) = 0, its first iteration corresponds to the SIS step based on the marginal information of features. This suggests that zero might be a reasonable initial choice for IHT. In practice, the LASSO estimates of β are also convenient for setting β (0) . The efficiency of IHT has been observed in numerical studies. 2.3.3 Numerical assessment Setup 1 2 3 (p, n) (10000, 200) (5000, 120) (1000, 100) SIS .22 .58 .01 ISIS .94 .63 .73 FR 1.00 .34 .88 LASSO .98 .89 .28 SMLE .99 .78 .99 Logistic 1 2 3 (1000, 400) (1000, 400) (1000, 400) .94 .11 .02 1.00 .89 .61 .- .- .- - .99 .85 .17 .99 .97 .77 Poisson 1 2 3 (1000, 200) (1000, 200) (1000, 200) .07 .01 .00 .94 .84 .54 .- .- .- - .67 .39 .01 .97 .94 .93 Model Type Linear Table 2.1: Frequencies of covering the true model s∗ based on different screening methods To provide a quick assessment of SMLE-based screening (2.7), we briefly summarize some simulation results based on our numerical studies, which will be presented in more detail in Section 2.5. In this subsection, we focus on showing the numerical performance of the screening methods; we give detailed descriptions of the simulation setups in Section 2.5. We conduct simulation studies in three different modeling contexts: linear regression, logistic regression, and Poisson regression. In each case, we examine SMLE for three different setups with specific correlation structures among the can37 didate features. We evaluate the screening method by measuring the frequency with which the resulting model includes all the features in the true model, i.e., we measure the ability to correctly screen the irrelevant features. For the linear model, we compare SMLE with four other screening methods: SIS, ISIS, FR, and LASSO. We do not include FR for the logistic and Poisson models because of its high computational expense. For a fair comparison of the methods, the size of the screened model k is the same for each screening procedure. Table 2.1 summarizes the coverage frequencies of the true model s∗ for different screening methods based on 1000 repetitions. The simulation results reveal that SMLE is competitive with other popular screening methods. The advantage of SMLE-based screening is especially clear for the third correlation setup in the logistic and Poisson models, where both SIS and LASSO miss important features. Compared with ISIS, SMLE achieves higher coverage frequencies at a lower computational cost. A more detailed discussion of the simulation results will be given in Section 2.5. 2.4 Screening-based PLM selection procedure We have proposed SMLE, which efficiently screens the irrelevant features in an ultra-high-dimensional model while retaining the important features with an overwhelming probability. Given such a feature-screening technique, s∗ can be conveniently estimated through a two-step procedure: First, we shrink the full model {1, . . . , p} to a refined submodel sˆ of size τ (ˆ s) ≤ k < n, and then we apply the PLM to sˆ to identify s∗ . We refer to SMLE followed by the PLM as SMLE-PLM. Because many irrelevant covariates are removed from the original feature space, the search for s∗ in the PLM step is dramatically narrowed. Sure screening makes it feasible to do feature selection for ultra-high-dimensional problems and dramatically speeds up the selection process. In this section, we present further discussion of SMLE-PLM. 2.4.1 Consistency of SMLE-PLM As discussed in Section 1.4.2, some PLMs have both estimation and selection consistency for a wide range of applications (Fan and Li [2001]). We now explore 38 whether these desirable properties hold for SMLE-PLM. Unlike the direct implementation of PLM, SMLE-PLM is a two-stage procedure. Given the settings and ˆ (ˆ notation in Section 2.2, SMLE-PLM estimates β by β s), found by maximizing λ Q(β sˆ) = l(β sˆ) − n ∑ ϕλ (|βj |), (2.13) j∈ˆ s where sˆ is the k-dimensional submodel obtained from SMLE and ϕλ (.) is a specified penalty function. As in typical PLMs, an appropriate choice of ϕλ (.) leads to ˆ (ˆ a sparse β s), which further identifies important covariates based on sˆ. λ ˆ (ˆ We study the large-sample properties of β λ s) under conditions T1–T4 given in Section 2.3.1. For the asymptotic analysis, we associate λ with n and consider a penalty function sequence ϕλ (.) that satisfies the following properties: P1 For any λ > 0, ϕλ (|θ|) ≥ 0 with ϕλ (0) = 0. ′ P2 For any λ > 0, ϕλ (|θ|) = ∂ϕλ (|θ|)/∂|θ| exists and is continuous for |θ| ∈ (0, +∞). ′ P3 There exist positive constants τ3 and w3 , such that ϕλ (|θ|) ≤ w3 n−τ3 for |θ| ≥ 0.5w1 n−τ1 . Properties P1–P2 specify the shape and smoothness of the penalty function, and are generally required for PLMs (e.g., Fan and Lv [2011]). Property P3 further restricts the sequence of penalties by setting an upper bound on the derivatives. For a specific choice of ϕλ (.) (e.g., L1 or SCAD), P3 corresponds to assuming an appropriate asymptotic order for the tuning parameter λ. In general, it is difficult to study the global maximizer of (2.13) analytically without the concavity of (2.13). As is common in the PLM literature, we study the behavior of local maximizers. We first establish estimation consistency for SMLE-PLM via the following theorem: Theorem 2.3 Under conditions T1–T4, if τ1 +τ2 < 12 and τ3 > τ1 + τ22 , then there ˆ (ˆ exists a local maximizer β s) of (2.13) with ϕλ (.) satisfying P1–P3, such that λ ∗ −υ ˆ (ˆ ∥β ) λ s) − β ∥2 = Op (n 39 for some υ ∈ (τ1 , min{ 12 − τ2 , τ3 − τ2 2 }). See Appendix A for the proof. Theorem 2.3 shows that, for an appropriate choice of penalty ϕλ (.), SMLE-PLM consistently estimates the model parameters in the ultra-high-dimensional GLM setup. For the situation where the number of influential features q does not diverge with n, we have τ2 = 0 in condition T2. In this case, SMLE-PLM achieves root-n consistency if τ3 > 0.5 in requirement P3. In particular, when the L1 penalty is used (i.e., ϕλ (|θ|) = λ|θ|), this result implies that root-n consistency is achieved if we set λ = o(n− 2 ). 1 Moreover, we establish the selection consistency of SMLE-PLM with the additional condition: T5 There exist a positive constant c5 and a corresponding δ2 > 0 such that for sufficiently large n, 1 ∂ln (β s ) ∂ln (β ∗s ) | − | ≤ c5 ∥β s − β ∗s ∥2 n ∂βj ∂βj for any j ∈ s, s ∈ S k+ and β s ∈ {β s : ∥β s − β ∗s ∥2 ≤ δ2 }. Condition T5 is purely technical and basically requires Lipschitz continuity of the score function in a neighborhood of β ∗ . Also, we need an extra requirement on the order of the penalty function sequence as follows: ′ P4 There exist positive constants τ4 and w4 such that ϕλ (|θ|) ≥ w4 n−τ4 for |θ| < 0.5w1 n−τ1 . Theorem 2.4 Under conditions T1–T5, if τ1 + τ2 < 0.5 and τ3 − τ2 > 0.5 − ˆ (ˆ 1.5τ2 > τ4 , then there exists a local maximizer β s) = (βˆ1λ (ˆ s), . . . , βˆkλ (ˆ s))T of λ (2.13) with ϕλ (.) satisfying P1–P4, such that P {βˆjλ (ˆ s) = 0, for j ∈ sˆ \ s∗ } → 1. See Appendix A for the proof. Theorem 2.4 implies that, for an appropriate choice of ϕλ (.), SMLE-PLM consistently selects s∗ even when the number of features diverges exponentially with the sample size. Appropriate choices of ϕλ (.) include 40 popular nonconvex penalties such as SCAD or MCP as special cases. In particular, for the fixed-q situation, the selection consistency of SMLE-SCAD is implied if 1 λ = o(1) and λn 2 → ∞. The results of theorems 2.3 and 2.4 are not restricted to SMLE-PLM but also apply to a general screening-based PLM procedure with the sure screening property (e.g., SIS or FR). 2.4.2 Tuning with EBIC As for standard PLMs, given the form of the penalty function ϕλ (.), we must specify the tuning parameter λ for the implementation of SMLE-PLM. The commonly used GCV or BIC tuning methods (Section 1.4.3) are designed for the situation where p < n and may not be suitable for situations where p ≫ n. To address this issue, we choose the λ that minimizes the following EBIC criterion (Chen and Chen [2008]), ˆ ) + τ (sλ )(log n + γ log p), EBIC(sλ ) = −2l(β sλ 0 ≤ γ ≤ 1, (2.14) ˆ (ˆ ˆ where sλ denotes the model corresponding to β λ s) (2.13) and β s is the MLE based on model s. With γ = 0, EBIC reduces to the standard BIC tuning strategy (Wang et al. [2007]). The EBIC is designed for high-dimensional model selection with a sound Bayesian motivation. Let the model space be partitioned into subclasses according to the number of covariates that a model contains. Let Sj for 0 ≤ j ≤ p be the subclass of models containing j covariates and τ (Sj ) be the size of Sj . Let us assign each model in subclass Sj an equal probability, i.e., P (s|Sj ) = 1/τ (Sj ) for any s ∈ Sj . Then, instead of assigning probabilities P (Sj ) proportional to τ (Sj ) as in BIC, EBIC sets P (Sj ) to be proportional to τ (Sj )1−γ for some γ between 0 and 1. Thus, in EBIC the prior probability P (s) for s ∈ Sj is set to be proportional to τ (Sj )−γ . Compared with the constant prior used in BIC, this assignment substantially reduces the high prior in models with a large number of covariates, and hence may be suitable for large p. Chen and Chen [2012] have shown that EBIC is selection consistent in the GLM context even when log p = O(nm ). Theorem 2.3 shows that there exists a proper sequence of ϕλ (.) such that 41 P (sλ = s∗ ) → 1. The EBIC can help to tune ϕλ (.) for the implementation of screening-based PLM. In our simulation studies, we set γ = 0.5 for EBIC as suggested by Chen and Chen [2012]. 2.5 Numerical studies We assess the finite sample performance of the SMLE-PLM via simulation studies. In particular, we are interested in knowing how SMLE compares with other popular screening methods. We must consider many factors in order to obtain a relatively complete picture of the new method. Recall that our general goal is to use a GLM to explain the variation in the response variable Y through a number of covariates (features) selected from a large number of candidates. The correlation structure between these covariates can have a strong effect on the performance of screening-based methods. Also, their performance may vary depending on the model to which they are applied. In addition, there are implementation issues to address. Last but not least, we need measures to evaluate the performance of the different methods. In the next subsections we discuss the simulation settings. Some of the simulation results have already been revealed and additional ones will be presented. 2.5.1 General settings We examine the methods in three different modeling contexts: linear regression, logistic regression, and Poisson regression, as described in Section 2.2.1. For the linear model, we compare screening-based LASSO and SCAD on five screening methods: SIS, ISIS, FR, LASSO, and SMLE. We do not include FR for the logistic and Poisson models because of its computational cost. Correlation structure For each model we consider three different correlation structures of the features x1 , . . . , xp , so there are a total of nine combinations. The correlation structures are as follows: Setup 1: x1 , . . . , xp are independent and identically distributed N (0, 1) random variables. 42 Setup 2: x1 , . . . , xp are joint Gaussian, marginally N (0, 1), with cov(xj , xj−1 ) = 2/3, cov(xj , xj−2 ) = 1/3 for j ≥ 3, and cov(xj , xh ) = 0 if |j − h| ≥ 3. Setup 3: x1 , . . . , xp are joint Gaussian, marginally N (0, 1), with cov(xj , xh ) = 0.15 for j, h ∈ s∗ and cov(xj , xh ) = 0.3 for j or h ∈ {1, . . . , p} \ s∗ . Case 1 is the ideal independence structure, which is the most straightforward for variable selection. In Case 2, we consider a moving average type correlation structure, where features are strongly correlated for order distances less than three. This type of correlation is commonly used to model features with a natural order. In Case 3, we have a compound correlation structure such that every irrelevant feature has equal correlation with the relevant features. We therefore expect the variable selection in Case 3 to be more challenging. Implementation issues In our simulation studies, the screening methods are implemented as follows. For ISIS, we use the R function GLMvanISISscad in the SIS package, with a maximum of five ISIS loops. The number of relevant features retained in each loop is decided by SCAD with the AIC-based tuning method; see Fan et al. [2009]. For SMLE, we use the proposed IHT algorithm. In particular, we choose ten LASSO estimates with sparsity varying from n − 1 to k as different initial values for IHT. The estimate from IHT that maximizes the likelihood is then treated as the SMLE. For comparison purposes, we also report the performance of LASSO when it is applied directly to the full model. In particular, we apply the R function glmnet to identify a sequence of ordered features with a prespecified size k. Variables in the sequence are considered to be important (after screening), and PLMs are used for the further selection. For each model, we use NONE to represent the model without further variable selection. We use the notation LASSO and SCAD to represent, respectively, the models selected by LASSO and SCAD with the EBIC tuning strategy. To facilitate the computing process, we use the built-in functions in the R software packages glmnet and SIS to compute the estimators of LASSO and SCAD respectively. Theorem 2.1 shows that when the model size k has a certain asymptotic order, SMLE consistently includes all the important features. However, in practice, a spe43 cific choice of k must be made in the implementation. Intuitively, a larger choice of k may increase the probability that a screening method includes all the relevant features. However, if the screened model has many irrelevant variables the final selection may be more difficult. In our simulations, we set k = [n/ log n] for the linear model, k = [n/4 log n] for the logistic regression model, and k = [n/2 log n] for the Poisson regression model. These model-based choices of k were recommended by Fan et al. [2009] and worked satisfactorily in our simulation examples. In fact, the performance of screening-based PLMs are quite robust to a wide range of sensible values for k. Since our goal is to compare different screening methods, we treat these model-based k values as benchmarks for the comparison. Assessment of performance We assess the performance of the screening methods based on T = 1000 simulation replications. Specifically, let sˆt denote the model selected in the jth replication by a particular method (e.g., SMLE-None). The coverage probability of that ∑ method is computed by T −1 Tt=1 I(s∗ ⊂ sˆt ), which measures its ability to discover all relevant features. We characterize the model selectivity of each method in terms of the positive selection rate (PSR) and the false discovery rate (FDR), which are defined as follows: ∑T PSR = ∑T ∗ ∩ sˆt ) , ∗ T τ (s ) t=1 τ (s FDR = st /s∗ ) t=1 τ (ˆ T τ (ˆ st ) . The PSR and FDR depict two different aspects of the selection result: a high PSR indicates that most of the important variables have been identified, while a low FDR indicates that only a few irrelevant variables have been selected. We also report the average number of features included in sˆt , as well as the proportion of times that s∗ is perfectly identified with no extraneous variables and no missed variables (the correct selection rate, CSR). In the following subsections, we further specify the model parameter settings and discuss the simulation results. 44 2.5.2 Linear regression Parameter settings In this example, the data (yi , xi ) for i = 1, . . . , n were generated as independent copies from (Y, x) according to the linear model (2.3) Y = xT β + ϵ, where ϵ is a normal random error with mean zero and variance σ 2 . The model parameters used in each of the three correlation setups are as follows: Setup 1: (n, p, σ) = (200, 10000, 3), and s∗ is randomly chosen from {1, . . . , p} with ∥s∗ ∥0 = 8. For j ∈ s∗ , the βj values are generated independently from √ U (4 log n n+|Z|), where U is a binary random variable with P (U = 1) = 0.6 and P (U = −1) = 0.4 and Z is a standard normal random variable. For j ̸∈ s∗ , the βj values are set to zero. Setup 2: (n, p, σ) = (120, 5000, 5), and s∗ = {1, 3, 5, 7, 9}. (β1 , β3 , β5 , β7 , β9 ) = (5, 3.5, 2.8, 2.5, 2.2) and βj = 0 for j ̸∈ s∗ . Setup 3: (n, p, σ) = (100, 1000, 1), and s∗ = {1, 2, 3, 4}. βj = 2.5 for 1 ≤ j ≤ 4 and βj = 0 for j > 4. The settings in setup 1 are borrowed from example 1 of Wang [2009], where both s∗ and β are generated randomly. For the settings in setup 2, we choose s∗ to ensure non-negligible correlations among all the relevant features. For Case 3, the most challenging situation, we fix s∗ to be the set of the first four features and set the coefficients for all the relevant feature to 2.5. The σ values for all three cases were chosen after pilot studies to give an appropriate signal-to-noise ratio. To evaluate the prediction accuracy of each model, we independently generated ˜ i ) with the same sample size as the training data. For each sˆt testing data (˜ yi , x ˆ , we calculated the corresponding relative and its associated estimate of β, say β t prediction error by { } ∑ T Tˆ 2 ˜ (˜ y − x β ) 1∑ ∗ i i s 1 − ∑i , Tˆ 2 T ˜ (˜ y − x i βt ) i i t=1 45 ˆ ∗ is the least squares estimate based on the true model s∗ . For convenience where β s ˆ to the least squares estimate on sˆt for a screening method of comparison, we set β t ˆ to the corresponding shrinkage estimate without further selection, and we set β t for a screening method followed by a PLM (i.e., LASSO or SCAD). Results The simulation results are summarized in Tables 2.2–2.4. For setup 1, most screening methods performed very well in terms of including all the relevant features in the model. This is indicated by the high coverage probabilities. However, SIS, which screens features based on marginal correlations, performed poorly because of the high spurious correlations caused by the ultra-high dimensionality. The drawback of FR is observed in setup 2, where the relevant features are correlated with each other. If one of two highly correlated features is included in an FR procedure, the likelihood of including the other is very small because of its weak correlation with the current residual. Consequently, FR and its associated PLMs performed poorly in this situation. Meanwhile, SIS improved because of the aggregated marginal correlations, but this structure does not encourage further improvements via ISIS. The performance of LASSO and SMLE remained satisfactory. Lastly, in setup 3, we see that the strong collinearity among the features greatly deteriorates the performance of SIS and LASSO. In terms of the coverage probability, ISIS significantly improved on SIS by almost 72%, while FR improves on LASSO by 88%. In comparison, SMLE did amazingly well in this challenging situation by achieving coverage probabilities as high as 99%. From Tables 2.2 to 2.4, the performance of LASSO and SCAD after the screening step was similar for setups 1 and 2. SCAN had the better performance for setup 3 in terms of lower FDR and higher CSR. Also, the final model fitted by SCAD consistently yielded a lower prediction error than that fitted by LASSO. This might be due to the excessive shrinkage of the LASSO estimator for large coefficients (see Fan and Li [2001]). In particular, when LASSO and SCAD were used after SMLE in setup 3, the difference in the relative prediction error of their selected (fitted) models was as large as 91%. 46 Screening Method SIS ISIS FR LASSO SMLE PLM Coverage prop. PSR FDR CSR Ave. model size Relative pred. error None LASSO SCAD None LASSO SCAD None LASSO SCAD None LASSO SCAD None LASSO SCAD .22 .21 .22 .94 .84 .92 1.00 .99 .99 .98 .85 .96 .99 .91 .98 .85 .84 .85 .99 .98 .99 1.00 .99 .99 .99 .98 .99 1.00 .99 .99 .82 .11 .12 .79 .10 .07 .79 .07 .09 .77 .11 .08 .79 .09 .07 .00 .15 .21 .00 .40 .51 .00 .59 .37 .00 .39 .46 .00 .47 .49 38.0 7.7 7.8 38.0 8.9 8.6 38.0 8.7 8.8 35.6 9.0 8.8 38.0 8.8 8.6 .52 .62 .36 .46 .58 .25 .55 .44 .10 .38 .59 .31 .48 .56 .13 Table 2.2: Simulation results for linear regression, setup 1 Screening Method SIS ISIS FR LASSO SMLE PLM Coverage prop. PSR FDR CSR Ave. model size Relative pred. error None LASSO SCAD None LASSO SCAD None LASSO SCAD None LASSO SCAD None LASSO SCAD .58 .32 .26 .63 .38 .25 .34 .29 .19 .89 .44 .31 .78 .46 .25 .91 .83 .78 .92 .85 .78 .78 .76 .73 .97 .86 .80 .95 .87 .79 .82 .09 .15 .81 .10 .21 .84 .23 .28 .79 .09 .17 .81 .09 .19 .00 .23 .17 .00 .24 .13 .00 .17 .08 .00 .26 .19 .00 .29 .11 25.0 4.6 4.7 25.0 4.8 5.0 25.0 5.1 5.2 23.4 4.8 4.9 25.0 4.8 5.0 .35 .42 .26 .42 .41 .30 .60 .33 .25 .35 .41 .31 .44 .39 .28 Table 2.3: Simulation results for linear regression, setup 2 47 Screening Method SIS ISIS FR LASSO SMLE PLM Coverage prop. PSR FDR CSR Ave. model size Relative pred. error None LASSO SCAD None LASSO SCAD None LASSO SCAD None LASSO SCAD None LASSO SCAD .01 .01 .01 .73 .52 .73 .88 .78 .88 .28 .03 .28 .99 .52 .99 .33 .23 .24 .85 .70 .79 .89 .86 .89 .64 .32 .49 1.00 .82 .99 .94 .91 .89 .84 .66 .29 .84 .53 .19 .87 .86 .63 .82 .61 .07 .00 .00 .00 .00 .01 .52 .00 .02 .57 .00 .00 .27 .00 .01 .71 22.0 9.8 9.7 22.0 8.9 5.8 22.0 7.9 5.1 20.7 9.8 8.2 22.0 8.9 4.4 .89 .95 .94 .58 .87 .25 .59 .83 .11 .69 .95 .68 .52 .92 .01 Table 2.4: Simulation results for linear regression, setup 3 2.5.3 Logistic regression Parameter settings In our second example, the generic response Y follows a Bernoulli distribution with success probability π satisfying log( π ) = xT β. 1−π Thus, data generated as independent pairs from (Y, x) satisfy a logistic regression model. The parameters used for the three correlation setups are as follows: Setup 1: s∗ is randomly chosen from {1, . . . , p} with ∥s∗ ∥0 = 8. For j ∈ s∗ , the √ βj values are generated independently from U (4 log n/ n + |Z|/4), where U is a binary random variable with P (U = 1) = 0.5 and P (U = −1) = 0.5 and Z is a standard normal random variable. For j ̸∈ s∗ , the βj values are set to zero. Setup 2: s∗ = {1, 3, 5, 7, 9}. (β1 , β3 , β5 , β7 , β9 ) = (2, −1.8, 1.6, −1.4, 1.2) and 48 βj = 0 for j ̸∈ s∗ . Setup 3: s∗ = {1, 2, 3, 4}. βj = 1.5 for 1 ≤ j ≤ 4 and βj = 0 for j > 4. Because of the lack of information in a binary response, we choose n = 400 and p = 1000 for all three cases. The coefficients are a rescaled version of those in the linear example. The size of the screened model was set to k = [n/4 log n] for all the screening methods. Results The results are shown in Tables 2.2–2.4, where the prediction accuracy of each selected model was evaluated by the proportion of correct predictions based on independent testing data. Again, since the features are independent, all the methods performed well for setup 1. SMLE and its associated PLMs have the best performance in setups 2 and 3, where the correlation structure is more complex. LASSO continues to suffer from the collinearity between the features, particularly in setup 3. Screening Method PLM Coverage prop. PSR FDR CSR Ave. model size Prediction accuracy SIS None LASSO SCAD .94 .94 .94 .99 .99 .99 .53 .02 .02 .00 .84 .84 17.0 8.1 8.1 .84 .84 .86 ISIS None LASSO SCAD 1.00 .99 .99 1.00 1.00 1.00 .53 .03 .05 .00 .81 .63 17.0 8.2 8.5 .83 .84 .86 LASSO None LASSO SCAD .99 .99 .99 1.00 .99 1.00 .49 .02 .04 .00 .84 .68 15.7 8.2 8.4 .83 .84 .86 SMLE None LASSO SCAD .99 .99 .99 1.00 .99 1.00 .53 .02 .04 .00 .82 .67 17.0 8.2 8.4 .83 .84 .86 Table 2.5: Simulation results for logistic regression, setup 1 49 Screening Method PLM Coverage prop. PSR FDR CSR Ave. model size Prediction accuracy SIS None LASSO SCAD .11 .08 .10 .74 .62 .62 .78 .24 .27 .00 .00 .03 17.0 4.2 4.3 .71 .71 .73 ISIS None LASSO SCAD .89 .76 .84 .97 .93 .96 .71 .22 .13 .00 .21 .49 17.0 6.3 5.6 .75 .74 .80 LASSO None LASSO SCAD .85 .58 .80 .97 .88 .95 .68 .29 .19 .00 .05 .19 15.2 6.4 6.0 .76 .74 .79 SMLE None LASSO SCAD .97 .77 .88 .99 .94 .97 .71 .21 .13 .00 .23 .45 17.0 6.2 5.7 .76 .74 .80 Table 2.6: Simulation results for logistic regression, setup 2 Screening Method PLM Coverage prop. PSR FDR CSR Ave. model size Prediction accuracy SIS None LASSO SCAD .02 .01 .01 .45 .32 .34 .89 .78 .78 .00 .00 .00 17.0 6.1 6.3 .79 .74 .75 ISIS None LASSO SCAD .61 .29 .56 .82 .62 .76 .81 .66 .54 .00 .00 .06 17.0 7.4 7.2 .79 .77 .81 LASSO None LASSO SCAD .17 .03 .09 .60 .36 .43 .84 .77 .74 .00 .00 .01 15.2 6.4 6.7 .83 .75 .76 SMLE None LASSO SCAD .77 .50 .76 .92 .81 .91 .78 .53 .39 .00 .01 .13 17.0 7.1 6.5 .80 .79 .83 Table 2.7: Simulation results for logistic regression, setup 3 50 2.5.4 Poisson regression Parameter settings We next consider the situation where the generic response Y follows a Poisson distribution with mean exp(xT β). Thus, data generated as independent pairs from (Y, x) satisfy a Poisson regression model. The parameters used for the three correlation setups are as follows: Setup 1: s∗ is randomly chosen from {1, . . . , p} with ∥s∗ ∥0 = 8. For j ∈ s∗ , the √ βj values are generated independently from U (log n/ n + |Z|/8), where U is a binary random variable with P (U = 1) = 0.8 and P (U = −1) = 0.2 and Z is a standard normal random variable. For j ̸∈ s∗ , the βj values are set to zero. Setup 2: s∗ = {1, 3, 5, 7, 9}. (β1 , β3 , β5 , β7 , β9 ) = (2, −1.8, 1.6, −1.4, 1.2) and βj = 0 for j ̸∈ s∗ . Setup 3: s∗ = {1, 2, 3, 4}. βj = 0.7 for 1 ≤ j ≤ 4 and βj = 0 for j > 4. Similarly to the logistic example, we set n = 200 and p = 1000 for all three cases. Four screening methods (i.e., SIS, ISIS, LASSO, and SMLE) are compared with k = [n/2 log n]. We used a testing likelihood ratio to measure the goodness of fit for each model. Specifically, for a given testing data set and its associated log-likelihood ˜l(.), we computed the testing likelihood ratio as { } T ˜l(β ˆ ) 1∑ t 1− , ˜l(β ˆ ∗) T s t=1 ˆ ∗ is the MLE of β based on s∗ and β ˆ denotes the β estimate for a parwhere β s t ticular method (e.g., SMLE-SCAD) on the tth replication. We adopted the same ˆ as in the linear example and set T = 1000. choices for β t Results The results are summarized in Tables 2.8–2.10. Most of the patterns are consistent with those of the previous example. The benefits of SMLE are clear in setups 2 51 and 3. For setup 1, the performance of ISIS and SMLE is comparable, but SMLE is preferred because of its lower computational cost. Screening Method PLM Coverage prop. PSR FDR CSR Ave. model size Testing lh-ratio SIS None LASSO SCAD .07 .06 .07 .75 .74 .75 .68 .30 .24 .00 .00 .02 19.0 8.8 8.1 .34 .46 .33 ISIS None LASSO SCAD .94 .86 .93 .99 .97 .99 .58 .20 .08 .00 .13 .52 19.0 10.1 8.7 .14 .29 .04 LASSO None LASSO SCAD .67 .52 .66 .94 .90 .93 .58 .32 .15 .00 .03 .29 18.1 10.9 9.0 .15 .39 .12 SMLE None LASSO SCAD .97 .91 .97 .99 .98 .99 .58 .14 .06 .00 .27 .63 19.0 9.3 8.5 .18 .27 .03 Table 2.8: Simulation results for Poisson regressions, setup 1 Screening Method PLM Coverage prop. PSR FDR CSR Ave. model size Testing lh-ratio SIS None LASSO SCAD .01 .00 .01 .53 .47 .46 .86 .64 .65 .00 .00 .00 19.0 7.1 7.1 .46 .46 .42 ISIS None LASSO SCAD .84 .68 .83 .95 .89 .94 .75 .41 .22 .00 .05 .29 19.0 7.9 6.3 .17 .33 .07 LASSO None LASSO SCAD .39 .10 .38 .82 .66 .79 .76 .57 .40 .00 .00 .06 17.5 8.1 7.0 .22 .43 .18 SMLE None LASSO SCAD .94 .78 .93 .98 .94 .98 .74 .33 .15 .00 .09 .37 19.0 7.5 6.0 .21 .31 .03 Table 2.9: Simulation results for Poisson regressions, setup 2 52 Screening Method PLM Coverage prop. PSR FDR CSR Ave. model size Testing lh-ratio SIS None LASSO SCAD .00 .00 .00 .19 .14 .16 .96 .93 .92 .00 .00 .00 19.0 9.0 8.2 .61 .70 .62 ISIS None LASSO SCAD .54 .43 .54 .69 .59 .67 .85 .69 .55 .00 .01 .09 19.0 8.8 7.4 .34 .53 .26 LASSO None LASSO SCAD .01 .00 .00 .26 .19 .24 .93 .91 .87 .00 .00 .00 17.2 9.3 8.1 .58 .70 .60 SMLE None LASSO SCAD .93 .62 .91 .98 .86 .96 .79 .56 .33 .00 .01 .14 19.0 8.4 6.2 .25 .51 .07 Table 2.10: Simulation results for Poisson regressions, setup 3 53 2.5.5 Real-data example We now apply the SMLE-based method to a genetic application. Singh et al. [2002] measured the expression levels of 12600 genes from prostate specimens of 52 prostate cancer patients and 50 healthy controls. One objective was to build a gene-expression-based classification rule to predict the identity of unknown prostate samples. Such a classification tool is helpful in the early detection of prostate cancer, which provides a better opportunity for curative surgery. The identification of genes that influence the disease outcome also provides a greater understanding of the genetic aspect of prostate tumors. By performing a permutation-based correlation test, Singh et al. [2002] detected 456 potential genes that are differently expressed between tumorous and normal samples. Using the 6033 genes in the complete dataset, Efron [2009] found a further 377 genes using an empirical Bayesian approach, while Chen and Chen [2012] spotted 3 more genes using EBIC-based LASSO. In this example, we reanalyze the dataset by building a logistic regression logit{P (Y = 1|x)} = xT β, where Y is the binary status of the prostate cancer (with Y = 1 for a tumorous sample, Y = 0 for a normal sample) and x contains the 12600 gene expression levels. Accordingly, we predict Y = 1 when P (Y = 1|x) is estimated to be over 0.75 and predict Y = 0 otherwise. SMLE-SCAD is used for the analysis because of its superior performance in the simulation studies. We randomly select a set of 10 subjects from each of the tumorous and normal sample groups as the testing set, and treat the remainder as the training set. We set the screening bound k = 20 Screening Method SIS ISIS LASSO SMLE Ave. model size 2.2 2.3 2.7 2.7 Sensitivity .71 .68 .71 .77 Specificity .96 .96 .94 .94 Table 2.11: Results for prostate data. 54 Overall pred. error .17 .18 .17 .14 on the training set and assess the prediction accuracy of the selected model on the testing set. We base the assessment on T = 200 replications and summarize the results in terms of sensitivity, specificity, and the overall prediction error. For comparison purposes, we also include the results of SIS, ISIS, and LASSO followed by SCAD with k = 20. All the tuning parameters are selected by EBIC as in the simulation examples. From Table 2.11, we see that all four methods performed well, choosing a parsimonious model with relatively high prediction accuracy. SMLE shows its superiority by choosing a model with higher sensitivity, which corresponds to a lower chance that a cancer patient is wrongly diagnosed as healthy. The results of SMLE are consistent with those of Chen and Chen [2012] in terms of the number of selected genes but have a lower prediction error. 2.6 Summary and conclusions In this chapter, we have developed a new approach for the variable selection problem when the number of candidate variables is ultra-high. Our approach follows existing methods by first screening out a large number of seemingly nonsignificant covariates computationally efficiently and theoretically consistently. The screening step is followed by a widely accepted regularization approach to further reduce the number of variables in the model to enhance the interpretability. The second step results in a sequence of candidate models with increasing model complexity as the degree of regularization reduces. Specifically, we proposed SMLE for the screening; it naturally incorporates joint effects between features in the screening process. We showed that SMLE has the sure screening property in the ultra-high-dimensional GLM setup, and we developed an iterative hard-thresholding algorithm for its implementation. Our simulation study indicated that the new procedure is computationally efficient and competitive with other screening procedures. At the same time, the new method was observed to have a higher probability of retaining all the significant covariates in the model settings that we considered. 55 Chapter 3 The Penalized Pseudo-Likelihood in Analysis of Survey Data 3.1 Introduction In many areas of scientific research, one common interest is to identify the influential factors associated with certain behavioral, social, or economic indices within a target population. For example, sociologists and economists would like to identify important factors that affect the unemployment rate in a specific region, and epidemiologists are interested in finding risk behavior for diseases. In such studies, researchers often start with a survey of the target population (Korn and Graubard [1999], Rahiala and Ter¨asvirta [1993], Wolfson [2004]). A representative sample is then selected and measurements of the variables of interest for the sampled units are collected. A regression model is routinely employed to summarize the information contained in the data. It explains variations in the response variable through a simple function of explanatory variables (covariates). When they lack prior knowledge, researchers may collect information on many potential explanatory variables. In these applications, it is never straightforward to decide in advance which variables should be included in the perceived regression model. Consequently, the goal of identifying influential factors is often achieved through a variable selection procedure. That is, we assume a response variable together with a large number of covariates, based on which a regression model is to be fitted. In this chapter, we de56 velop a variable selection strategy for survey data and investigate its large-sample properties. Unlike the problems presented in Chapter 2, the sample size n is often very large, while the number of candidate covariates p is large but not huge. This makes a screening procedure unnecessary. However, the traditional all-subset selection (Section 1.3) is still computationally infeasible although its principles apply. Moreover, when we perform variable selection for survey sampling, many potential complications arise. First, the data collected through survey sampling are usually obtained from a finite population without replacement, and hence they have an intrinsic dependence structure (i.e., non-i.i.d.). Second, in complex survey designs, the inclusion probabilities of sampling units often vary across the target population. Consequently, the correlation between the response and the covariates reflected in the sample can be distorted from that of the population. This is potentially the case when some segments of the population are sampled more intensively than others. Ignoring the survey design in the selection process may result in biased selection results for the target population. In the literature, sampling weights are often considered in estimating finite population parameters such as the population mean or population proportions. The weighted estimates help to avoid biased inference from informative sampling (Pfeffermann [1993], Fuller [2009]). However, the role of sampling weights in variable selection is not completely clear. Although the parameter estimation and variable selection serve different purposes, they often have a coherent linkage in the modeling process. It is natural to conjecture that using sampling weights is beneficial for variable selection. In this spirit, we investigate the use of a pseudo-likelihood to take account of the sampling weights and propose a penalized pseudo-likelihood method (PPLM) for variable selection in the analysis of survey data. A sample-based BIC criterion is further derived to tune the implementation of the proposed method. In a joint randomization framework, we prove that the PPLM consistently identifies the influential variables through the BIC-based tuning. The proposed selection method is assessed through simulation studies and using data from the 2009 Survey on Living with Chronic Diseases in Canada. This chapter is organized as follows. In Section 3.2, we introduce the joint 57 randomization mechanism and the super-population model. In Section 3.3, we propose the PPLM and derive the sample-based BIC as a tuning strategy. In Section 3.4, we investigate the asymptotic behavior of the proposed method in a joint randomization framework. We use numerical studies in Section 3.5 to assess the performance of our approach and provide a concluding summary in Section 3.6. The proofs of the theorems are given in Appendix B, where a heuristic derivation of the proposed BIC can also be found. 3.2 Joint inference and super-population The random behavior of an inference procedure is mostly inherited from the randomness in the data. In the context of surveys, the set of sampled units is random because of the probabilistic sampling design. At the same time, the observed value of each sampled unit may also be regarded as a random outcome from some conceptual infinite super-population (Royall [1976]). In a design-based analysis, the finite population is regarded as nonrandom and all measurements of the sampled units are constants. The parameters of interest are finite population quantities such as the population total or the population median. The statistical inference is evaluated based on the randomness from the probability design. Nonparametric approaches are usually used for design-based inference. One may also regard the design-induced randomness as an artifact. The measurements of the sampled units are independent realizations of a random variable from a probability model for the postulated super-population. The parameters of interest are related to the assumed model and model-based inferences are evaluated solely based on the randomization introduced by the model. A third approach is called model-design-based inference; it incorporates randomization from both the design and the model. In such a joint randomization mechanism, the finite population is regarded as a random sample from a superpopulation. The survey sample is considered as a second-phase sampling from the super-population. The parameters of interest can be either model or finitepopulation parameters. In this mechanism, inferences on the finite-population parameters are motivated by the super-population model. Model-design-based inference can be more efficient than pure design-based approaches when the finite 58 population is well described by the super-population model. Compared with pure model-based approaches, it protects against model violation and is therefore more robust in general (Binder and Roberts [2003], Kalton [1983]). We study the variable selection problem under the joint randomization mechanism. Let D = {1, . . . , N } be a finite population consisting of N sampling units. The measurements on the ith unit are denoted (yi , xi ), where yi is the response of interest and xi = (xi1 , . . . , xip )T is a p-dimensional explanatory vector (covariate vector). These are regarded as independent realizations of (Y, X) from a super-population. We postulate a generalized linear model (GLM) on the superpopulation as follows. Conditioning on X, the distribution of Y belongs to a natural exponential family, the density of which takes the form f (y; θ) = c(y) exp{θy − b(θ)}. (3.1) θ is known as the natural parameter of f (y; θ) such that b′ (θ) = E[Y |X] ≡ µ and b′′ (θ) = Var[Y |X] ≡ σ 2 , and c(y) is a normalization constant. The influence of the explanatory variable X on Y is expressed through g(µ) = XT β for some assumed linkage function g(.), where the vector β = {β1 , . . . , βp }T is the p-dimensional regression coefficient. If g(.) is the canonical link, i.e., g(µ) = θ, then we have θ = XT β. For simplicity, we focus on the canonical link in this chapter. Based on this model, the effect of the explanatory variable is characterized through the size of the corresponding regression coefficient. In applications, a complex model with many variables often leads to overfitting and a poor interpretive value. Hence, it is desirable to fit the data with a parsimonious model in which many regression coefficients are estimated to be zero. Explanatory variables with nonzero coefficients are then considered to be influential on the response. To this end, we assume that β is ideally sparse, and address the variable selection problem by identifying a sparse model formed by the covariates with nonzero coefficients. 59 3.3 Pseudo-likelihood-based variable selection 3.3.1 The penalized pseudo-likelihood method With the model settings described in Section 3.2, it is clear that, if the measurement (yi , xi ) is observed for every unit in D, the randomness in the data introduced by the probability sampling design is completely gone. In this situation, the selection of the influential variables is based on the entire population and the PLMs developed in non-survey settings (purely model-based) remain valid for the model-designbased inference. As in a typical PLM, the model coefficient vector β is estimated ˇ through maximizing the penalized likelihood function by β λ QN (β) = lN (β) − N p ∑ ϕλ (|βj |), (3.2) j=1 where lN (β) = ∑N T i=1 log f (yi ; xi β) is the census log-likelihood function and ϕλ (.) is a penalty function indexed by a tuning parameter λ. With an appropriate ˇ contains zero estimates for some coefficients choice of ϕλ (.) (Section 1.4.1), β λ and thus removes the corresponding covariates from the model. Note that (3.2) is only conceptual, because observing (yi , xi ) for all units in D is usually not feasible in applications. Instead, a representative sample d = {i1 , . . . , in } ⊂ {1, . . . , N } with n units is often drawn from D and the measurements are observed based on the sampled units. Due to the intrinsic dependence structure among the sampled units, a full likelihood on d is prohibitive to compute in general. Alternatively, for the model-design-based inference, a pseudo-loglikelihood function is frequently used, which takes the form ln (β) = ∑ wi log f (yi ; β) (3.3) i∈d with wi denoting the standardized survey weight for the ith unit ( Typically, wi is chosen proportional to 1/P (i ∈ d) such that ∑ i∈d wi = n). −1 n ln (β) is design- unbiased to N −1 lN (β). Maximizing ln (β) over β leads to a maximum pseudo- 60 ˆ for β, i.e., likelihood estimator (MPLE) β ˆ = arg max ln (β). β (3.4) β ˆ is often n−1/2 consistent for β under the Under the appropriate sampling designs, β joint randomization framework. The idea of using pseudo-likelihood for inference on model parameters has been widely adopted in the literature (Binder [1983], Godambe and Thompson [1986], Molina and Skinner [1992]). Accordingly, we aim to develop an analog of PLM (3.2) based on the pseudolikelihood. In particular, we propose the maximum penalized pseudo-likelihood ˆ that maximizes estimator β λ Qn (β) = ln (β) − n p ∑ ϕλ (|βj |). (3.5) j=1 Compared with QN (.), the first term in Qn (.) is the survey-weighted pseudolikelihood, which potentially helps to avoid sampling errors that might lead to biˆ of Qn (β) ased inferences for the target population. Meanwhile, the maximizer β λ ˇ , which qualifies it inherits the sparsity property from its census-based version β λ ˆ as the PPLM as a variable selection operator. We refer to the selection based on β λ and further investigate its asymptotic performance in the next section. 3.3.2 Asymptotic properties of PPLM ˆ , we now establish its asymptotic conTo provide some theoretical insight into β λ sistency under the joint randomization framework described in Section 3.2. Suppose there is a sequence of finite populations, say Dr with r → ∞. Each Dr is an i.i.d. sample of size Nr from a super-population modeled by (3.1) with random variable (Y , X = {X1 , . . . , Xp }). Within each Dr , a sample dr of size nr is drawn according to some sampling scheme. We assume that both Nr and nr increase to infinity as r → ∞, with the sampling fraction nr /Nr bounded by some constant C < 1. For simplicity of notation, we will drop the index r in the following discussion. Without loss of generality, we assume that the first q coefficients are nonzero 61 and denote the true value of β by β ∗ = {β ∗1 , β ∗2 } with β ∗2 = 0. Also, we use s∗ to denote the true model {1, . . . , q} to be identified. We establish the consistency of ˆ under the regularity conditions specified as follows, where the first two are on β λ the super-population and the third is on the sampling plan: C1 There exists ξ1 > 0 such that max E[|b′′ (Xβ)Xj2 |1+η ] < ∞, 1≤j≤p for β ∈ {β : ||β − β ∗ || ≤ ξ1 } and for some η > 0. [ ] ∂ 2 log f (y; Xβ) I(β) = −E . ∂β∂β T C2 Let We assume that I(β) is continuous at β ∗ and λmin [I(β ∗ )] ≥ M1 for some constant M1 > 0, where λmin [A] denotes the smallest eigenvalue of matrix A. C3 The sampling scheme satisfies ∥ N 1 1∑ 1 ∑ wi zi − zi ∥ = Op (n− 2 ) n N i=1 i∈d for the sequence {zi } such that N −1 ∑N 2+η i=1 |zi | = O(1) with some η > 0. Condition 2 requires that I(β) is continuous at β ∗ , so that when β is close enough to β ∗ the minimal eigenvalue of I(β) is bounded away from zero. Condition 3 is quoted from Theorem 1 in Carrillo et al. [2010]; it requires that the weighted ∑ sample mean ZˆHT = n−1 i∈d wi zi is root-n consistent to the population mean ∑ Z¯ = N1 N zi . This condition is implied if asymptotic normality holds for ZˆHT , √ ˆ i=1 ¯ ν 2 ), which has been widely established in the literature. i.e., nZHT →d N (Z, ∑ 2+η = O(1), H´ For example, with moment condition N −1 N ajek [1960] i=1 |zi | 62 showed the asymptotic normality of ZˆHT for simple random sampling without replacement if n → ∞ and n/N → 0. Bickel and Freedman [1984] established the asymptotic normality of ZˆHT under a stratified sampling design, where samples are collected separately within strata that are prespecified in a finite population. Ohlsson [1989] further studied the behavior of ZˆHT under a two-stage sampling framework, where primary sampling units (PSUs) are first selected from the finite population and secondary sampling units (SSUs) are collected based on the selected PSUs. Ohlsson (1989) showed that if asymptotic normality holds for ZˆHT based on single-stage sampling on the PSUs, the asymptotic normality of ZˆHT still holds under a corresponding two-stage sampling. For asymptotic studies of ZˆHT under other popular sampling designs, we refer to H´ajek [1964], V´ıs˘ek [1979], and Chen and Rao [2007]. For the asymptotic analysis, we associate λ with n and denote the corresponding sequence by λn . We require the penalty function and λn to have the following properties: D1 For any λ > 0, ϕλ (|β|) ≥ 0 for β ̸= 0 and ϕλ (0) = 0. D2 Let ϕ′λ (|β|) = ∂ϕλ (|β|)/∂|β|. There exists a constant ξ2 such that ϕ′λ (|β|) ≥ 0 for |β| ∈ (0, ξ2 ) and all λ > 0. Also, ϕ′λ (|β|) is continuous at βj∗ for any j ∈ {1, . . . , q}. D3 Let φλ = max{ϕ′λ (|β0j |) for 1 ≤ j ≤ q}. For any M2 > 0, there exists ξ3 such that ϕ′λn (|β|) ≥ M2 (n−1/2 + φλn ) for |β| ∈ (0, ξ3 ). ˆ via the With ∥.∥ denoting the Euclidean norm, we establish the consistency of β λn following theorem. Theorem 3.1 Under conditions C1–C3, if φλn → 0 as n → ∞, then there exists a ˆ = (β ˆ ,β ˆ ) of the penalized pseudo-likelihood function local maximizer β λn 1λn 2λn (3.5) with ϕλn (|β|) satisfying D1–D3 such that ˆ − β ∗ ∥ = Op (n− 2 + φλ ) and ∥β λn n 1 63 ˆ P {β 2λn = 0} → 1. See Appendix B for the proof. As shown in Theorem 3.1, with an appropriate choice of ϕλ (.), the maximum penalized pseudo-likelihood estimator is consistent in both parameter estimation and variable selection under the joint randomization framework. In particular, when ϕλ (.) is chosen as the Lγ penalty, i.e., ϕλn (|β|) = λn |β|γ with γ ∈ (0, 1), consistency holds if λn → 0; when ϕλ (.) is chosen as the √ SCAD penalty, consistency holds if λn → 0 and nλn → ∞. In addition, for a special class of penalty functions, we have the following corollary. Corollary 3.1 Suppose that, for any β ̸= 0, there exists M > 0 such that ϕ′λn (|β|) = 0 when n > M . Then, under the conditions of Theorem 3.1, the maximizer ˆ = (β ˆ ,β ˆ ) satisfies β λn 1λn 2λn ˆ =β ˆ ) → 1 and P (β ˆ P (β 1λ 1 2λn = 0) → 1 ˆ denoting the maximizer of ln (β) based on the true model s∗ . with β 1 See Appendix B for the proof. Corollary 3.1 implies that the PPLM is able to consistently identify the influential variables and estimate their coefficients as efficiently as the MPLE (3.4) based on the true model. This result echoes the notion of the “oracle property” in Fan and Li [2001], which is desirable for the PLM in non-survey situations (Section 1.4.2). 3.3.3 Tuning strategy via sample-based BIC As in standard PLMs, by varying the level of penalty ϕλ (.) in (3.5), the PPLM suggests a series of models with differing sparsity. In applications, one needs to make a choice among these sparse models. Given the specific form of ϕλ (.), this issue boils down to choosing an appropriate tuning parameter λ. In the non-survey context, various criteria have been proposed for tuning a PLM (Section 1.4.3). In particular, BIC (Schwarz [1978]) has been shown to be effective (Wang et al. [2007], Zhang et al. [2010]). In the same spirit, we derive a sample-based BIC for PPLM in analysis of survey data. Following the super-population formulation described in Section 3.2, we treat the sparse models (3.1) suggested by the PPLM as candidates for further selection. Specifically, for a specified form of penalty ϕλ (.), let Ω be the range of λ under 64 ˆ for some consideration. We denote by sλ a candidate model corresponding to β λ λ ∈ Ω. Let β sλ be the τ (sλ )-dimensional coefficient of model sλ and let νsλ be the prior density of β sλ . Then a pseudo-marginal density function of the data is given by ∫ Pn (y|sλ ) = Ln (y; β sλ )νsλ (β sλ )dβ sλ . Consequently, we may regard the following expression as the pseudo-posterior probability of the model sλ : Pn (sλ |y) = ∑ Pn (y|sλ )P (sλ ) , sλ ∈SΩ P (sλ )Pn (y|sλ ) (3.6) where SΩ = {sλ : λ ∈ Ω} is the collection of candidate models. In the spirit of Bayesian analysis, the model with the highest posterior Pn (sλ |y) is considered to ∑ be the one that receives the most support from the data. Since sλ ∈SΩ P (sλ )Pn (y|sλ ) does not depend on any specific model, the highest Pn (sλ |y) is achieved by the model that maximizes the corresponding Pn (y|sλ )P (sλ ). When the uniform prior P (sλ ) = ζ over SΩ is used and under some regularity conditions, we obtain a Laplace approximation (Tierney et al. [1989], Kass and Raftery [1995]): ˆ ) + τ (sλ ) log n + Op (1) −2 log{Pn (y|sλ )} = −2ln (β sλ ˆ denoting the MPLE (3.4) based on sλ . Hence, we choose λ such that the with β sλ corresponding model sλ minimizes ˆ ) + τ (sλ ) log n. BICn (sλ ) = −2ln (β sλ (3.7) To provide further theoretical justification for the proposed BIC tuning strategy, we let s be an arbitrary model and define two sets of candidate models as follows: • Overfitted models: • Underfitted models: S+ = {s : s∗ ⊂ s, s ̸= s∗ }; S− = {s : s∗ ̸⊂ s}. Then, Ω can be partitioned accordingly into Ω+ = {λ : sλ ∈ S+ }, Ω− = {λ : sλ ∈ S− }, 65 Ω∗ = {λ : sλ = s∗ }. (3.8) In Theorem 3.1, we have shown that P (Ω∗ ̸= ∅) → 1. Therefore, selection consistency of PPLM with a tuning parameter selected by BIC (3.7) is achieved if BIC is able to identify s∗ from any model sλ with λ ∈ Ω+ ∪ Ω− . We use the following theorem to establish this consistency result. Theorem 3.2 Under conditions C1–C3, P{ min λ∈Ω+ ∪Ω− BICn (sλ ) ≤ BICn (s∗ ) } → 0. See Appendix B for the proof. 3.4 Numerical studies To evaluate the finite sample performance of PPLM, extensive numerical studies have been conducted using data from the Survey on Living with Chronic Diseases in Canada (SLCDC; Canada [2009]). In particular, we compare the proposed selection method with standard non-survey PLMs for a couple of sampling plans. The benefits of using survey weights are further examined in the situation where the presumed model is misspecified from the model that generates the data. We also report the analysis of the original SLCDC 2009 data as an example of using PPLM in real applications. 3.4.1 SLCDC data SLCDC is a cross-sectional study sponsored by the Public Health Agency of Canada that collects information related to the experiences of Canadians with chronic health conditions. One of the main objectives of SLCDC is to identify health behavior that influences disease outcomes, so that the government can better plan and provide health services for people with chronic diseases. SLCDC takes place every two years, with two chronic diseases covered in each survey cycle. The 2009 survey focused on arthritis and hypertension. We restrict our attention to hypertension. The target population for the hypertension survey is Canadians aged twenty years or older from the ten provinces who have been diagnosed with hypertension and who live in private dwellings. To facilitate the survey process, the sampling units of SLCDC 2009 are people with hypertension who 66 completed the 2008 Canadian community health survey (CCHS). For the purpose of SLCDC, the population is first stratified according to the CCHS respondents based on sex and four age groups: 20–44, 45–64, 65–75, and 75+. Therefore, the finite population formed by the CCHS respondents was divided into 8 categories, age (4 levels) by sex (2 levels). A stratified sampling plan is used for SLCDC with proportional sample size allocation. An overall sample of 9005 was selected from the 17437 CCHS respondents, and 6142 respondents completed the SLCDC survey. We identified 40 variables relevant to hypertension based on the original SLCDC data, of which 7 variables have complete information on all 6142 respondents. The remaining 33 variables have missing values due to non-responses in the original questionnaire (see Table 3.1 for a list of the variables and the corresponding nonresponse rates). There was no obvious systematic reason for the non-response. The variable with the most severe missingness is INCDRPR (household income) with a 9.6% non-response rate; the amount of missing data is relatively minor for the remaining variables. To facilitate the analysis, we used simple imputation methods for the missing data as follows. For a categorical variable, we imputed the nonresponse value by a random value from the response set. For a continuous variable, we imputed the non-response value by the mean value of the responses. Two exceptions to the above imputation are the variables BMHX 02 and CNHX 05. The former acts as the response variable of the regression model in the later data analysis, while the latter has natural restrictions on its range. We removed the 274 observations with missing values for these two variables, leading to basic working data with 5868 observations. The imputation/removal procedure does not have any effect on the evaluation of the BIC procedure based on the simulated population. It could bias the analysis of the real data. However, given the low rate of missingness and the plausibility of missing-at-random in this specific case, the conclusion is unlikely to be severely affected. Since the SLCDC is a follow-up to the CCHS, the sampling weights for SLCDC were initially obtained from the weights of the CCHS data. The weights were then adjusted to ensure that the SLCDC respondents represent the target population. Consequently, the adjusted weights show considerable variation between sampled units. The standardized values of the adjusted weights vary between 0.01 and 33.62 67 with an inter-quartile range of 0.76. Table 3.1: Variables for analysis of SLCDC data with non-response adjustments: A: allocate to other categories; D: delete from the data; M: impute by mean values; NA: no adjustment applied. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 Variable BMHX 02 GEO QB GEO ON GEO BC GEO PR DHHX AGE DHHX SEX GENXDHMH CNHX 05 MEHX 02 MEHX 03 MEHXGMED MEHX 06 MEHXDMCO HUHXDHP SMHX 11A SMHX 13A SMHXDSLT SMHXDFDC SMHXDPAC SMHXDBW MOHXDBPM MOHX 02 INHX 01A INHX 01F INHX 02A INHX 02C INHX 02G INHX 02H INHX 04 INHX 06 INHX 07 CPGFGAM DHHDECF EDUDH04 FVCGTOT GEODUR2 HWTDBMI INCDRPR SACDTOT Description Levels Missing Adjust Blood pressure control status 2 1.6% D Provinces grouped by region - QC 2 -NA Provinces grouped by region - ON 2 -NA Provinces grouped by region - BC 2 -NA Provinces grouped by region - PR 2 -NA Age Cont. -NA Sex 2 -NA Perceived mental health 2 0.2% A High blood pressure - age when diagnosed Cont. 2.7% D No. of medications taken Cont. 0.3% M No. of times per day medications taken Cont. 0.1% M No. of medications for high blood pressure Cont. 2.0% M No. of times per day bp medication taken Cont. 1.0% M Medication compliance - overall 2 0.2% A Consulted family doctor about hbp 2 0.1% A Smoked at any time since being diagnosed 2 0.1% A Drank alcohol since being diagnosed 2 0.2% A Daily salt intake 2 0.2% A Dietary foods 2 0.1% A Exercise/physical activity 2 0.1% A Body weight control 2 0.2% A Self-monitoring of blood pressure 2 0.3% A Correct use of bp measurement device 2 0.5% A Info from family doctor 2 2.4% A Info from family member/friend 2 2.4% A Info from book, pamphlet, brochure 2 1.5% A Info from package insert with medication 2 1.5% A Info from media 2 1.5% A Info from internet 2 1.5% A Info received - emotional impact of hbp 2 0.8% A Info received - correct use of medication 2 0.6% A Info received - additional information 2 0.9% A Gambling activity 2 0.5% A Household type 2 0.2% A Highest level of education in household 2 3.4% A Daily consumption - fruits and vegetables 2 5.2% A Urban and rural areas 2 -NA Body mass index (BMI) self-report Cont. 2.1% M Household income - provincial level 10 9.6% A Total number hours - sedentary activities Cont. 1.5% M 68 3.4.2 Simulation settings We first design simulation studies based on the SLCDC data. Specifically, we treat the 40 identified variables as candidate covariates for some response variable Y , and index them as X1 to X40 for simplicity. We consider both continuous and binary responses in our simulations. For the continuous cases, we generate the values of Y according to • Model 1: Y = 0.7X6 + 0.7X10 + 0.6X18 − 0.6X22 + ϵ, • Model 2: Y = 0.7X6 + 0.6X10 + 0.6X18 − 0.5X22 + 0.3X30 − 0.3X34 + ϵ, with ϵ ∼ N (0, 1). For the binary cases where Y ∈ {0, 1}, we generate the values of Y according to the logistic models • Model 3: logit(Pr{Y = 1| X}) = 0.7X7 − 0.6X8 + 0.5X26 , • Model 4: logit(Pr{Y = 1| X}) = 0.8X7 − 0.7X8 + 0.6X26 − 0.5X28 + 0.4X36 . The specified models include one of the strata identifiers in SLCDC, i.e., X6 (Age) or X7 (Sex). Compared with models 1 & 3, models 2 & 4 have two more influential covariates with small coefficients, so it is more difficult to correctly identify them. The finite population used in the simulation was created as follows. The basic working data of 5868 respondents was duplicated 10 times proportional to the rounded integer values of the SLCDC weights, resulting in a pseudo-finite population of size 55950 with complete information on X1 , . . . , X40 . The values of the response Y were then generated based on models 1–4 respectively. We consider the variable selection problem to be the identification of the postulated model that generates the values of Y . We investigate the performance of the proposed procedure under two stratified sampling plans. Specifically, we create four strata based on variables X6 (age, 55-/55+) and X7 (sex, Male/Female), which leads to the group (Female, 55-) of size 7120, the group (Female, 55+) of size 19199, the group (Male, 55-) of size 6187, and the group (Male, 55+) of size 23458. In the first plan, a simple random sampling without replacement (SRSWR) with equally allocated sample sizes is drawn from each stratum. The inference is made based on the four SRSWRs 69 pooled together. In the second plan, we further construct three subgroups within each stratum based on the sum of two binary covariates of the postulated models. The subgroups are based on X18 + X22 for the data from models 1–2 and on X8 + X26 for the data from models 3–4. We then make inference based on the SRSWRs drawn from each subgroup of the four strata. The overall sample size is equally allocated at the stratum level with a 2:1:2 proportion for the three subgroups within the same stratum. A simple Monte Carlo computation reveals that the sample correlation between X18 and X22 (for the data from models 1–2) can be as high as 0.5, whereas their population-based correlation is around 0.02. A similar phenomenon is observed between X8 and X26 (for the data from models 3–4). We therefore expect variable selection under the second sampling plan to be more challenging because of this systematic inflation. In the simulations, we set the overall sample size n = 500 for models 1–2 and n = 1500 for models 3–4. The PPLM was then carried out on probability samples obtained from the finite population. In particular, we chose the SCAD penalty for the penalized pseudo-likelihood function (3.5), as advocated by Fan and Li [2001]. The corresponding maximization of (3.5) was solved using the thresholding-based iterative algorithm (She [2011]) and the tuning parameter was determined by the sample-based BIC (3.7). For comparison purposes, AIC (Akaike [1973]) and GCV (Craven and Wahba [1979]) were used as alternatives to the BIC tuning strategy. Based on the discussion in Section 3.3.3, we define the sample-based AIC and GCV as ˆ ) + 2τ (sλ ), AICn (sλ ) = −2ln (β sλ ˆ ) ln (β 1 sλ GCVn (sλ ) = − , n (1 − τ (sλ )/n)2 where λ was selected similarly by minimizing the corresponding scores. Moreover, for each setup, we repeated the selection procedure with all survey weights ignored (set to unity). The unweighted selection results correspond to pure model-based inferences as discussed in Section 3.2. In particular, the PPLM reduces to the standard PLM (3.2) used for non-survey situations. 70 3.4.3 Simulation results In Tables 3.2–3.3, we summarize the simulation results based on 1000 repetitions in terms of the positive selection rate (PSR), false discovery rate (FDR), correct selection rate (CSR), and averaged model size (AMS). Specifically, let s0 be the true model that generates the finite population and s′j be the selected model based on the jth sample, j = 1, . . . , 1000. The PSR, FDR, CSR, and AMS are estimated as ∑1000 PSR = CSR = j=1 τ (s∗ ∩ s′j ) ∑1000 , 1000τ (s∗ ) ∑1000 ′ ∗ j=1 I(sj = s ) 1000 j=1 FDR = , τ (s′j /s∗ ) , 1000τ (s′j ) ∑1000 ′ j=1 τ (sj ) AMS = , 1000 where τ (s) denotes the size of model s and I(.) is the indicator function. In addition, we assess the predictive accuracy of the selected model as follows. For each setup, a test sample of size 200 is generated by SRSWR from the same finite population as that for the training sample. For models 1–2, we use the averaged residual sum of squares (RSS) on the test data as a measurement of the predictive ability of the selected model. For models 3–4, we compute both positive and negative prediction rates. To be specific, let π ∗ be a specified benchmark and π ˆi be the estimated success probability of the ith test sample, i = 1, . . . , 200. We then predict the ith response yi by yˆi = 1 if π ˆi > π ∗ and yˆi = 0 otherwise. The correct prediction rates are estimated by ∑ PPR = yi = i∈{i:yi =1} I(ˆ ∑200 i=1 I(yi = 1) 1) ∑ , NPR = yi = i∈{i:yi =0} I(ˆ ∑200 i=1 I(yi = 0) 0) . The final PPR and NPR are averaged based on 1000 replications. Note that PPR and NPR are similar to sensitivity and specificity in the clinical studies, which indicate the ability of a 0-1 prediction approach in terms of correct positive and negative predictions. In general, a larger π ∗ leads to high NPR but low PPR. The value of π ∗ should be cautiously specified in applications. In our simulation studies, we set π ∗ = 0.5 for simplicity. The results are encouraging for the PPLM and the sample-based BIC tuning 71 Table 3.2: Selection results for the first sampling plan: Prediction assessments for models 1–2 are based on the testing RSS, while for models 3–4 they are based on (PPR, NPR) with a benchmark 0.5. Weights Criterion PSR Ignored GCV AIC BIC GCV AIC BIC .96 .99 .96 .95 .99 .95 FDR Model 1 .19 .48 .19 .24 .61 .24 GCV AIC BIC GCV AIC BIC .72 .89 .73 .74 .89 .74 GCV AIC BIC GCV AIC BIC GCV AIC BIC GCV AIC BIC Included Ignored Included Ignored Included Ignored Included CSR AMS Prediction .28 .05 .28 .19 .01 .20 4.9 8.7 4.9 5.2 11.4 5.3 1.04 1.08 1.04 1.05 1.11 1.05 Model 2 .19 .44 .19 .24 .54 .24 .02 .01 .03 .02 .01 .03 5.5 10.3 5.6 6.1 12.5 6.1 1.07 1.09 1.07 1.08 1.12 1.08 .99 .99 .96 .99 .99 .94 Model 3 .59 .62 .43 .67 .70 .45 .00 .00 .00 .00 .00 .00 7.8 8.4 5.1 9.9 10.7 5.3 (.71, .45) (.69, .49) (.72, .44) (.71, .47) (.68, .48) (.71, .45) .97 .98 .87 .98 .98 .86 Model 4 .44 .47 .26 .54 .56 .30 .01 .01 .07 .01 .00 .05 9.4 9.8 6.0 11.4 11.9 6.2 (.66, .55) (.65, .56) (.69, .53) (.66, .54) (.66, .55) (.68, .53) method. From Tables 3.2–3.3, we observe that the models selected by AIC have both high PSR and FDR, which indicates an excessive inclusion of the irrelevant variables. In comparison, BIC significantly reduces the FDR of the selected models with a slight sacrifice in the PSR, and selects models with more accurate sizes. Although GCV behaves similarly to BIC in linear model settings, it is similar to AIC 72 Table 3.3: Selection results for the second sampling plan: Prediction assessments for models 1–2 are based on the testing RSS, while for models 3–4 they are based on (PPR, NPR) with a benchmark 0.5. Weights Criterion PSR Ignored GCV AIC BIC GCV AIC BIC .83 .97 .83 .95 .99 .95 FDR Model 1 .23 .49 .23 .31 .65 .30 GCV AIC BIC GCV AIC BIC .62 .88 .62 .72 .89 .72 GCV AIC BIC GCV AIC BIC GCV AIC BIC GCV AIC BIC Included Ignored Included Ignored Included Ignored Included CSR AMS Prediction .17 .04 .17 .13 .00 .14 4.6 8.6 4.6 5.9 12.5 5.9 1.09 1.10 1.09 1.07 1.12 1.07 Model 2 .22 .45 .22 .28 .59 .27 .02 .01 .02 .01 .00 .01 5.0 10.3 5.1 6.5 13.7 6.5 1.13 1.14 1.12 1.10 1.12 1.10 .87 .88 .65 .97 .97 .89 Model 3 .62 .63 .62 .74 .75 .50 .00 .00 .00 .00 .00 .00 7.3 7.6 4.5 11.9 12.4 5.6 (.66, .44) (.65, .45) (.68, .42) (.70, .46) (.68, .46) (.70, .44) .94 .95 .72 .93 .94 .82 Model 4 .48 .50 .41 .61 .62 .34 .00 .00 .00 .00 .00 .01 9.5 10.0 6.1 12.5 12.9 6.4 (.62, .51) (.62, .52) (.64, .49) (.64, .53) (.64, .53) (.67, .54) for the logistic models where less information is provided by the binary responses. In the first sampling plan, the inclusion probabilities are related to Y through a single covariate in the model (i.e., X6 or X7 ). The sample correlation structure between the response and covariates is largely maintained from the finite population. Consequently, no substantial difference is observed between the weighted 73 Table 3.4: Selection frequency of influential variables in model mis-specified case X38 AMS Testing RSS GCV AIC BIC X20 n=500 .78 .95 .95 .99 .83 .97 .56 .73 .60 5.9 12.5 6.6 1.93 1.95 1.93 GCV AIC BIC .73 .91 .78 .84 .85 .83 6.3 12.5 6.9 1.77 1.79 1.77 Ignored GCV AIC BIC n=1000 .96 1.00 .79 .99 1.00 .87 .97 1.00 .80 7.6 13.1 7.9 1.87 1.88 1.87 Included GCV AIC BIC .93 .98 .94 7.6 13.0 7.7 1.71 1.72 1.71 Weights Criterion Ignored Included X18 .92 .99 .94 1.00 1.00 1.00 .94 .96 .94 and unweighted selection procedures from Table 3.2. The benefits of using sampling weights in variable selection are tentatively revealed by the second sampling plan, where the sample correlation structure is systemically distorted. Clearly, the spurious correlation between covariates in the sampled units deteriorates the efficiency of selection methods. This is reflected by the reduced PSRs and the inflated FDRs for the unweighted procedures. Incorporating sampling weights in the selection process helps to correct the biased result. In particular, noticeable improvements have been observed for the PPLM. In the most impressive case (i.e., model 3 of Table 3.3), the PPLM with BIC substantially improves the standard PLM procedure by increasing the PSR from .65 to .89, while reducing the corresponding FDR from .62 to .50. Our observation echoes the rationale for weighting to remove bias due to informative sampling Fuller [2009]. Sampling weights also provide protection against model mis-specification (Pfeffermann and Holmes [1985], Kott [1991]): inferences based on weighted estimates may remain valid for the surveyed population, even when the model fails. To gain further insight into weighting in variable selection, we compare the PPLM with the standard PLM in 74 a simulation where the presumed model is misspecified from the model that generates the data. In this situation, a postulated “true” model does not exist, and the goal of variable selection is to find an optimal model that well describes the finite population. We still make use of the stratified pseudo-finite population in Section 3.4.2 but generate the response variable Y according to the strata. Specifically, the values of Y for units in strata (Male, 55+) and (Female, 55+) were generated by Y = 0.6X6 + 0.4X18 + 0.4X20 + 0.6X38 + ϵ, while the values Y for units in the strata (Male, 55-) and (Female, 55-) were generated by Y = 0.6X6 + 0.4X18 + 0.4X20 + ϵ with ϵ ∼ N (0, 1) denoting a random error. In other words, we assume that variable X38 is influential for people aged 55 and older but not for those younger than 55. In addition, we further violate the presumed model 3.1 by excluding X6 from the set of candidate covariates, which mimics the situation where one important design feature is omitted in the modeling. A stratified SRSWR of size 500 or 1000 is drawn using the first sampling plan in Section 5.2. The weighted and unweighted procedures are then tested for the variable selection based on the sampled units. We summarize the simulation results in Table 3.4 by estimating the selection rates of X18 , X20 , and X38 based on 1000 replications. Similarly to the previous simulations, the averaged model size (AMS) and the testing RSS of the selected models (i.e., the averaged RSS based on testing data of size 200) are also included in the summary. From Table 3.4, we see that when the model assumption is violated, the PPLM still achieves relatively high prediction accuracy by suggesting relevant variables with high probability. In contrast, ignoring the survey weights leads to a relative loss of nearly 9% on the testing RSS because of the exclusion of X38 . Apparently, increasing the sample size helps to improve the goodness of fit for the misspecified models, but the cost is the inclusion of more variables. 75 3.4.4 Analysis of SLCDC data To illustrate the application of PPLM, we use it to identify health behaviors that affect the control of blood pressure using SLCDC 2009. The response variable is BMHX 02 from the working data obtained from SLCDC, which has two levels indicating whether or not the blood pressure of the respondent is under control, based on the latest measurement by a health professional. We treat the remaining 39 variables in the working data as candidate covariates, and our goal is to identify the influential covariates that are associated with blood-pressure control. We build a logistic regression of BMHX 02 on the candidate covariates and use PPLM with the SCAD penalty to select the influential covariates. As a preliminary step, each covariate is standardized such that the corresponding first and second weighted sample moments are zero and unity respectively. For comparison, the BIC (3.7) and AIC/GCV are used for the tuning parameter selection. In Figure 3.1, we plot the criterion scores with respect to the model sparsity. We see that BIC selects a model with 11 covariates, while GCV and AIC select the same model, which has 24 covariates. When survey weights are ignored in the selection procedure, models with 7 or 21 covariates are suggested based on PLM with the standard BIC or GCV/AIC. The difference between the weighted and unweighted selection results reflects the potential distortion in the correlation structure of the sampled units. This difference may also be explained by the model mis-specification for part of the SLCDC population (Lohr and Liu [1994]). Given the potential bias of unweighted methods, the weighted results are more plausible. We further assess the selected models in terms of their predictive accuracy as follows. First, we draw 500 independent sets of 5868 bootstrap samples (with replacement) from the working data of SLCDC. For the tth bootstrap sample dt , t = 1, . . . , 500, the survey weight wi for the ith unit is adjusted by w ˜ti = vti wi with vti denoting the number of times that the ith unit is selected in dt . We then fit the selected models to each bootstrap sample (with the weights accounted for accordingly), and evaluate their weighted positive and negative prediction rates (WPPR, WNPR) by ∑ WPPR = i̸∈dt ∑ wi I(ˆ yi = 1, yi = 1) i̸∈dt wi I(yi = 1) ∑ , WNPR = i̸∈dt ∑ wi I(ˆ yi = 0, yi = 0) i̸∈dt 76 wi I(yi = 0) , AIC 0.570 6700 0.575 6800 0.580 6900 0.585 7000 0.590 GCV 0 10 20 30 40 0 10 20 model size 30 40 model size 6700 6800 6900 7000 BIC 0 10 20 30 40 model size Figure 3.1: Selection criteria values based on candidate models where yi and yˆi denote the ith response in BMHX 02 and its predicted value. We summarize the averaged WPPR and WNPR based on 500 bootstrap samples in Table 3.5 according to three different benchmark values (i.e., 0.25, 0.35, 0.45). From Table 3.5, we observe that the models selected from the unweighted analysis have a lower WPPR in general, which provides additional support for using survey weights in the selection procedure. Compared with GCV/AIC, BIC selects a model with a slightly conservative WPPR but a higher WNPR. However, the difference is not significant. The size of the BIC-selected model is much less than that of the model selected by GCV/AIC, which provides an easier interpretation of the response BMHX 02 and the covariates. To assess the stability of the selection, we repeat the PPLM based on 500 boot77 Table 3.5: Prediction accuracy for selected models (WPPR, WNPR) based on different benchmarks. Weights Ignored Included Criteria AIC/GCV BIC AIC/GCV BIC .25 (.646, .525) (.649, .513) (.645, .523) (.654, .532) .35 (.460, .688) (.445, .705) (.488, .682) (.485, .706) .45 (.299, .811) (.265, .818) (.338, .790) (.322, .830) Table 3.6: Bootstrap selection results for significant variables: (Estimated coefficient, Standard error, Selection rate). Variable GEO ON DHHX AGE GENXDHMH SMHXDSLT MOHXDBPM INHX 06 HWTDBMI Ave. Model Size GCV ( .14, .09, .86) (-.29, .09, 1.0) (-.15, .05, .99) ( .11, .07, .76) (-.08, .07, .67) ( .18, .06, .97) ( .14, .06, .95) 23.1 AIC ( .16, .09, .92) (-.32, .09, 1.0) (-.15, .05, .99) ( .12, .07, .84) (-.09, .06, .81) ( .18, .06, .99) ( .14, .06, .97) 27.8 BIC (.09, .09, .58) (-.27, .08, 1.0) (-.14, .06, .92) ( .08, .09, .47) (-.05, .07, .35) ( .18, .07, .91) ( .13, .06, .91) 10.3 strap samples. In Table 3.6, we list the bootstrap selection rate for the seven most significant covariates according to their MLEs in the original SLCDC working data. The corresponding coefficient estimates and standard errors are also included based on the bootstrap samples. From Table 3.6, we find that only four significant covariates (i.e., DHHX AGE, GENXDHMH, INHX 06, and HWTDBMI) are consistently selected by BIC, while GCV/AIC tends to select more unreliable covariates. The selection results based on BIC suggest a strong association of blood pressure control with age, weight, mental health, and medication information; this is consistent with the hypertension literature (Gelber et al. [2007], Yan et al. [2003]). 3.5 Summary and conclusions In this chapter, we have addressed variable selection in the analysis of complex surveys. When units are selected by disproportionate sampling, the data correlation structure in the sample can be distorted. Incorporating sampling weights in the se78 lection process protects against biased selection results. In this spirit, we proposed a survey-weighted regularization method based on the pseudo-likelihood (PPLM) and derived a sample-based BIC for its implementation. Under some regularity conditions, we showed that the PPLM consistently identifies the influential variables under a joint randomization framework. The performance of the proposed method was confirmed by numerical studies. 79 Chapter 4 A Thresholding Algorithm for PLM in Finite Mixture Models 4.1 Introduction In the previous chapters, we have addressed the feature selection problem for ultrahigh-dimensional data and complex survey data. New approaches have been developed in the framework of penalized likelihood methods (PLMs), and their effectiveness has been illustrated via both theoretical and numerical studies. In this chapter, we continue our investigation of PLMs to address feature selection for finite mixture models in the analysis of heterogeneous data. Finite mixture models play important roles in statistical learning and data mining. They have important applications in scientific disciplines such as genetics, marketing, and medical research (Shoukri and MacLachlan [1994], B¨ohning [2000], Brijs et al. [2004]). They are routinely used to detect the presence of subpopulations in an overall population. Determining the number of components (order selection) is a fundamental problem in these applications. For instance, the order of a mixture model in a genetic application is indicative of some fundamental gene structures. In financial applications, the order reflects the complexity needed to appropriately model the real world and thereby guides the relevant practices. A mixture model with an excessive number of components usually overfits the data and leads to poor interpretive value. Many researchers have investigated the strate80 gies that determine the appropriate order of a finite mixture model given a random sample from the target population. Order selection is invariably a trade-off between model complexity and goodnessof-fit. The classical Akaike information criterion (AIC; Akaike [1973]) and the Bayesian information criterion (BIC; Schwarz [1978]) are often employed in the context of finite mixture models. These methods discount the likelihood-based measure of the goodness-of-fit by penalties directly proportional to the model order. Although BIC has been shown to be consistent (Leroux [1992], Keribin [2000]), its optimality under regular models does not extend to nonregular finite mixture models for order selection under general conditions. Many new procedures have been discussed in the literature, such as distance-measure-based approaches (Chen and Kalbfleisch [1996], Woo and Sriram [2006]), hypothesis-testing-based approaches (Ghosh and Sen [1985], MacLachlan [1987], Chen and Chen [2001], Li and Chen [2010]) and Bayesian approaches (Richardson and Green [1997], Ishwaran et al. [2001], Berkhof et al. [2003]). In this chapter, we study the penalized likelihood method (PLM) introduced by Chen and Khalili [2008]. Unlike other approaches, this method introduces penalties based on two types of overfitting. In particular, it introduces a nonsmooth penalty term to merge close subpopulations in a finite mixture model. The shrinkage principle of regularized regression, e.g., LASSO (Tibshirani [1996]) and SCAD (Fan and Li [2001]), is seamlessly employed for the order selection. With an appropriate choice of the penalty functions, the method is consistent in both identifying the order of the mixture model and estimating the mixing distribution. It has the advantage that the order selection and parameter estimation are completed in one strike. Similarly to PLMs in the context of regression, the numerical problem of Chen and Khalili [2008] does not have straightforward solutions because the penalty functions can be nonsmooth and nonconvex. In the case of SCAD for regression models, Fan and Li [2001] propose locally approximating the nonconvex penalty by a quadratic function. With the aid of local quadratic approximation (LQA), they solve a series of convex optimization problems over smooth objective functions. Chen and Khalili [2008] adopt this strategy and find that the method is also suitable for order selection. However, it does not directly give sparse solutions, as required for variable or order selection. Recently, substantial progress has been made 81 on optimization problems related to regularization methods (Efron et al. [2004], Zou and Li [2008], Friedman et al. [2007]). In particular, thresholding-based algorithms (Daubechies et al. [2004], She [2009]) provide a superior solution to the order selection problem. In this chapter, we aim to provide an effective computational approach for the regularization-based order selection method in finite mixture models. The implementation of such methods can be embedded in a typical EM framework, and the challenge arises within the M-step, where the objective function is multivariate, nonsmooth, and nonconcave. To overcome this difficulty, we first transform the multivariate optimization problem into a set of univariate optimizations. A thresholding-based algorithm is then used for the nonsmooth and nonconcave but univariate objective functions. We refer to this new computational strategy as the iterative thresholding-based descent algorithm (ITD). Within the EM-framework, the ITD efficiently leads to a sparse estimate of the mixing distribution. It hence attains the goal of the order selection of Chen and Khalili [2008]. We establish the convergence of ITD and demonstrate its performance through simulations and real-data examples. The rest of this chapter is organized as follows. In Section 4.2, we review the regularization approach for order selection and discuss implementation strategies. We introduce the ITD in Section 4.3 and establish its convergence. We discuss ITD tuning issues in Section 4.4. In Section 4.5, we illustrate the ITD via simulations and examples, and concluding remarks are given in Section 4.6. The proofs of the theorems are given in Appendix C. 4.2 Order selection via regularization 4.2.1 Mixture model and penalized likelihood A finite mixture model of order K is a concave combination of K standard probability density functions. We focus on the situation where the component distribution is from an exponential family with a possible dispersion parameter: g(y; θ, ϕ) = exp[{yθ − b(θ)}/a(ϕ) + c(y, ϕ)] 82 (4.1) with respect to some σ-finite measure and specific functions a(·), b(·), and c(·). Let Θ and Φ be the parameter spaces for θ and ϕ. We consider the case Θ ⊂ R and Φ ⊂ R+ , which includes normal, Poisson, binomial, and many other widely used distribution families. The density function of a finite mixture model with order K is f (y; θ, π, ϕ) = K ∑ πk g(y; θk , ϕ), (4.2) k=1 where g(y; θk , ϕ) specifies the kth component density function with component parameter θk and shared structure parameter ϕ. We use π = (π1 , . . . , πK ) for ∑ the mixing proportions, with K k=1 πk = 1, and θ for the vector of component parameters. The corresponding mixing distribution on Θ assigns probability πk to value θk . We assume that πk ̸= 0 and θj ̸= θk for all j ̸= k. For simplicity of notation, we denote Ψ = (θ, π, ϕ). Given a set of i.i.d. observations Y = (y1 , . . . , yn ) from (4.2), the log-likelihood of the mixing distribution with order K is given by ln (Ψ) = n ∑ log f (yi ; θ, π, ϕ). (4.3) i=1 Maximizing ln (Ψ) over Ψ leads to a nonparametric maximum likelihood estimator (MLE) with a finite order (Lindsay [1983], Lesperance and Kalbfleisch [1992]). However, such an MLE provides little information on the actual order of the mixture model. It may overfit by assigning a negligible proportion to an arbitrary subpopulation (Type 1) and/or by including several nearly identical subpopulations in the model (Type 2). Chen and Khalili [2008] introduce penalties to prevent both types of overfitting. In particular, they introduce a regularization penalty to merge close subpopulations to reduce the model complexity. To be specific, for any prespecified large K, we denote the component parameters θ1 ≤ θ2 ≤ · · · ≤ θK . Let ηk = θk+1 − θk for k = 1, . . . , K − 1. Chen and Khalili [2008] propose a regularization method 83 that maximizes the penalized log-likelihood function ˜ln (Ψ) = ln (Ψ) + CK K ∑ log πk − k=1 K−1 ∑ pλ (|ηk |), (4.4) k=1 where CK > 0 is a scale constant and pλ (·) is a nonsmooth penalty function with a spike at 0. As in regularized regression analysis, when pλ (·) spikes at 0, the penalized log-likelihood ˜ln (Ψ) has a positive probability of attaining its maximum with some ηk = 0. Hence, a built-in order selection procedure is obtained by discouraging Type 2 overfitting. By tuning the level of the penalty λ in pλ (.), we arrive at a finite mixture model with a suitable order together with a maximum penalized likelihood estimator (MPLE). The first penalty in (4.4) prevents Type-1 overfitting by forcing the mixing proportions away from zero [Chen and Kalbfleisch, 1996]. Consequently, the resulting model has θk clustered around the parameters of the true subpopulations. This leads to some small values of ηk to be squeezed by the second penalty pλ (·). A finite mixture model with a lower order is hence attained. 4.2.2 The EM algorithm The implementation of the maximization of ˜ln (Ψ) can be embedded in a typical EM framework (Dempster et al. [1977]). Let zik for i = 1, . . . , n, k = 1, . . . , K be a 0-1 variable showing the component membership of the ith observation. If all the zik are observed together with yi , the log-likelihood of the complete data is given by lnc (Ψ) = K n ∑ ∑ zik [log πk + log{g(yi ; θk , ϕ)}] . i=1 k=1 The penalized log-likelihood of the complete data takes the form ˜lc (Ψ) = lc (Ψ) + CK n n K ∑ k=1 log πk − K−1 ∑ pλ (|ηk |). k=1 With an initial value of Ψ(0) , the EM algorithm maximizes ˜ln (Ψ) through the 84 following steps based on ˜lnc (Ψ): • E-step: Let Ψ(t) be the estimate of the parameters after t iterations. Compute Q(Ψ; Ψ(t) ), the conditional expectation of ˜lc (Ψ) given Y with Ψ(t) as the n true value of the model parameters: (t) Q(Ψ; Ψ ) = n ∑ K ∑ (t) wik log{g(yi ; θk , ϕ)} i=1 k=1 n ∑ K ∑ (t) {wik + i=1 k=1 K−1 ∑ CK } log πk − pλ (|ηk |), + n k=1 where (t) (t) π g(yi ; θ , ϕ(t) ) (t) wik = E(zik |Y ) = ∑ k (t) k (t) . K (t) ) π g(y ; θ , ϕ i l=1 l l • M-step: The M-step maximizes Q(Ψ; Ψ(t) ) over Ψ. We find (t+1) πk ∑n (t) i=1 wik + CK n + KCK = and (t+1) ϕ = arg max ϕ θ (t+1) = arg max θ { n K ∑∑ } (t) (t) wik log{g(yi ; θk , ϕ)} , (4.5) i=1 k=1 { n K ∑∑ (t) wik log{g(yi ; θk , ϕ(t) )} i=1 k=1 − K ∑ } pλ (|ηk |)(4.6) . k=1 The above EM procedure leads to a local maximizer of ˜ln (Ψ) by iteratively repeating the E-step and M-step. See Wu [1983] and MacLachlan and Krishnan [2008] for the convergence of EM-based computational procedures. The EM-algorithm appears to have completely addressed the numerical issue, but it contains a weakness (4.6). The optimization problem (4.6) is similar to the numerical problem in PLMs for regression analysis. Naturally, Chen and Khalili [2008] employ the local quadratic approximation (LQA) method, which is de85 signed primarily for SCAD. However, LQA does not directly provide a sparse solution for the variable selection or a finite mixture model with an appropriate order. An additional step is thus required to manually set near-zero values to exact zero. The development of an effective and direct numerical algorithm is the topic of this chapter. 4.3 Iterative thresholding descent procedure As reviewed in Section 1.4.4, substantial progress has been made in solving the optimization problems related to PLMs. The stagewise least angular regression (LARS; Efron et al. [2004]) is the first breakthrough for LASSO. Zou and Li [2008] suggest a weighted L1 approximation for SCAD, such that the corresponding nonconcave maximization can be carried out by the efficient algorithms designed for LASSO. Most recently, the coordinate-descent-based methods (Friedman et al. [2007]) have dramatically sped up the computation for LASSO, and they work even when the number of variables far exceeds the sample size in the regression analysis. Although these methods are not appropriate for our numerical problem, they are a source of inspiration. In particular, we find that the thresholding-based method (Daubechies et al. [2004], She [2009]) can be engineered to address order selection in finite mixture models. This algorithm has two important steps. The first step is to translate the multivariate objective function into an equivalent auxiliary function that can be decomposed into a sum of several univariate functions. This step dramatically simplifies the problem and reduces the computation. The second step is a thresholding procedure that iteratively performs simple thresholding operations on these univariate functions. This is useful particularly when the objective function is nonsmooth and nonconcave. When the procedure converges, a sparse solution is obtained and the goal of order selection is achieved. We now return to the core issue (4.6), which can be rewritten as { K ∑ min Q(θ) = − φk (θk ) + a(ϕ) θ k=1 86 K−1 ∑ k=1 } pλ (|θk+1 − θk |) , (4.7) with φk (θk ) = n ∑ wik {yi θk − b(θk )]. i=1 It can easily be seen that Q(θ) is a multivariate function containing a nonsmooth and usually nonconvex penalty function pλ (·). Our first step toward an effective algorithm is to create an auxiliary function that is a sum of univariate functions. ∑ Let η0 = θ1 , ηk = θk+1 − θk , so that θk = k−1 j=0 ηj for k = 1, . . . , K. After this parameter transformation, the objective function in (4.7) becomes Q(η) = − K ∑ k−1 K−1 ∑ ∑ φk ηj + a(ϕ) pλ (ηj ). k=1 Next, setting ζk = ∑k−1 j=0 ξj j=0 (4.8) j=1 for k = 1, . . . , K, we introduce an auxiliary function in ξ = (ξ0 , . . . , ξK−1 ), K−1 1 ∑ G(ξ; η) = uQ(ξ) + (ξj − ηj )2 2 j=0 −u n K ∑ ∑ wik {b(ζk ) − b(θk ) − b′ (θk )(ζk − θk )} (4.9) k=1 i=1 for some positive scale u. The auxiliary function has two crucial properties. First, we notice that Q(η) = u−1 G(η; η). This property links a stationary point defined through G to a local minimum of Q. Second, because of the specific form of Q(ξ), it can be seen that G(ξ; η) is additive in the components of ξ. This property makes it simple to maximize G(ξ; η) with respect to ξ for any given η through a thresholding procedure. We now make the first point. Let η (0) be an initial vector of η. For m = 1, 2, . . ., define η (m) = arg min G(ξ; η (m−1) ). ξ (4.10) The additivity of G(ξ; η) with respect to ξ decomposes the above operation into a 87 set of univariate optimization problems: ( [ ])2 ∑K ∑n ′ (θ )] minξ ξ − η + u w [y − b , 0 0 i ik k 0 k=1 i=1 ( [ ])2 (4.11) ∑ ∑ n min 1 ξ − η + u K ′ (θ )] w [y − b + ua(ϕ)p (|ξ |) j j j ξj 2 k λ k=j+1 i=1 ik i for j = 1, . . . , K − 1. Clearly, these optimization problems have a unified form { } min q(γ) = (γ − z)2 + κpλ (|γ|) (4.12) γ with κ = 0 for the case j = 0, and κ = ua(ϕ) otherwise. A threshold technique will be used to overcome the difficulty caused by the nonsmooth and nonconvex penalty function pλ (·). Recall that η can be linearly transformed to θ by θ = Γη, where Γ is a lower triangular matrix with the elements below the diagonal equal to one. The following theorem shows that the iteration defined in (4.10) reduces the value of Q. Theorem 4.1 Following the settings and notation introduced earlier, assume that b(.) in (4.1) is twice continuously differentiable over Θ. Let η (m) be the sequence defined by (4.10) and θ (m) = Γη (m) . Denote by τ1 the maximum eigenvalue of ΓT Γ, and let (m) τ2 = max sup b′′ (αθk (m) −1 ] , If 0 < u < [nτ1 τ2 (m+1) k 0<α<1 (m) + (1 − α)θk ). then Q(η (m+1) ) ≤ Q(η (m) ), and equality holds only when η (m+1) = η (m) . See Appendix C for the proof. If Q(η) is bounded from below, then Q(η (m) ) is a monotone decreasing sequence that must therefore converge to a limit. If in addition, {η : Q(η) < Q(η (0) )} is compact, then {η (m) } must have a convergent subsequence. Many GLMs lead to a regularized likelihood function Q satisfying these two conditions. Thus, this is likely to provide an optimization solution. We 88 are interested further in whether {η (m) } itself converges and whether the limit of Q(η (m) ) is a local minimum of Q(·). The following corollary confirms the first point. Corollary 4.1 Assume the conditions of Theorem 4.1 and that {η : Q(η) ≤ Q(η (0) )} is compact. Let τ2∗ = supm (τ2 (m) ). When u ∈ (0, [nτ1 τ2∗ ]−1 ), the se- quence {η (m) } is asymptotically regular. That is, as m → ∞, ∥η (m+1) − η (m) ∥ → 0. See Appendix C for the proof. Corollary 4.1 not only serves as a necessary step toward a complete proof of the convergence of {η (m) }, but also provides a straightforward stopping criterion for procedure (4.10). That is, we can terminate if ∥η (m+1) − η (m) ∥ < ε for some prespecified tolerance level ε. We set ε = 10−5 in our implementation. Using Corollary 4.1, we now prove the convergence of the ITD procedure via the following theorem. Theorem 4.2 Under the conditions of Corollary 4.1, and if the number of stationary points of the objective function Q(η) is finite, then the sequence {η (m) } converges to a fixed point of (4.10) that is also a stationary point of Q(η). See Appendix C for the proof. Theorems 4.1 and 4.2 are more general than the specific iteration scheme (4.10). Two assumptions, that {η : Q(η) ≤ Q(η (0) )} is compact and that the number of stationary points of Q(·) is finite, are trivially satisfied for exponential families. However, unless Q(·) is a convex function, there is no guarantee that η (m) converges to a global minimum. Multiple initial values are often used in the hope that the global minimum will be found. The above results are useful only if we carry out (4.10) effectively. This task is made simpler because G(ξ; η) is additive in the elements of ξ. We need only to solve the standard optimization problem (4.12). There is an intrinsic link between the solution to (4.12) and the thresholding rule in wavelet applications (Antoniadis [2007]). Let z+ be the positive part and sgn(z) be the sign of z for any real number z. With κ = 1, when the L1 penalty 89 pλ (γ) = λ|γ| is used, the solution of (4.12) is given by the soft thresholding rule as γ ∗ = (|z|−λ)+ ·sgn(z); and when the L0 penalty pλ (γ) = λ2 /2·I(γ ̸= 0) is used, the solution of (4.12) is given by the hard thresholding rule as γ ∗ = z · I(|z| > λ). Explicit solutions of (4.12) for other commonly used penalty functions are available in the literature (Antoniadis [2007], Antoniadis and Fan [2001]). Since the SCAD penalty was advocated by Chen and Khalili [2008] for their regularization method (namely, MSCAD), we give the corresponding explicit solution as follows. The SCAD penalty is defined as a symmetric function with pλ (0) = 0 and for γ > 0, ′ pλ (γ) = λI(γ ≤ λ) + (νλ − γ)+ I(γ > λ) (ν − 1) (4.13) for some constant ν > 2 and a tuning parameter λ. The exact size of ν does not have a noticeable effect on the performance of the order selection, while the level of λ needs to be further specified in applications (see Section 4.2 for a discussion). Proposition 4.1 When 0 < κ ≤ ν − 1, the solution to (4.12) is given by (|z| − κλ)+ · sgn(z), when |z| < (κ + 1)λ sgn(z) γ ∗ = (ν−1)z−κνλ , when (κ + 1)λ ≤ |z| < νλ ν−κ−1 z, when |z| ≥ νλ. When ν − 1 < κ ≤ ν, the solution to (4.12) is given by (|z| − κλ) · sgn(z), + γ∗ = z, when |z| ≥ νλ. when |z| < νλ, When ν < κ, the solution to (4.12) is given by γ ∗ = zI(|z| ≥ νλ). See Appendix C for the proof. The expressions for the cases ν < κ and ν − 1 < κ ≤ ν can be combined; we did not do this to simplify the proof. When κ < ν − 1, the optimal solution of (4.12) is a continuous function of z, but this is not true for κ > ν − 1. Thus, SCAD may not be the most appropriate choice for the order selection of mixture models. At the same time, because κ = ua(ϕ) is usually very small as in (4.11), the adverse effect is likely minor in applications. We focus on 90 the algorithm and leave this issue for future research. Proposition 4.1 leads to a thresholding operation for (4.12). Taking SCAD for example, shrinkage occurs whenever the generic value of |z| is lower than some threshold level, say κλ. This feature leads to the sparse η. Suppose the initial vector η (0) contains K components. Each incident of ηˆj = 0 for some j reduces the order of the candidate mixing distribution by one. Because of this feature, the order of the MPLE is a direct output of the algorithm. No ad hoc steps are required. 4.4 Tuning strategies In this section, we discuss some ITD implementation issues. 4.4.1 Choice of u The tuning parameter u does not have a statistical implication. It controls by how much the iteration is directed by the gradient of Q(η). We may therefore try to use a large u value. However, the inequality in the proof of Theorem 1 indicates that the objective function is guaranteed to decrease after each iteration only when u is small enough. Because κ = ua(ϕ) and the size of this largely controls the thresholding level, a large u also helps to speed up the algorithm. For normal mixtures, we have b′′ (θ) = 1 so we may choose u = (nτ1 )−1 , the largest value allowed to ensure the monotonicity of Q(η (m) ). For Poisson mixtures, we may choose u adaptively according to the value of η (m) to guarantee monotonicity. Once the sequence η (m) settles down, a stable value of u is used. To accelerate the algorithm, we employ the following tuning strategy. Let w∗ = ∑ maxk ni=1 wik and Γ0 be the submatrix of Γ with columns corresponding to the nonzero entries of η (0) . Denote by τ1∗ the maximum eigenvalue of ΓT0 Γ0 and let τ2∗ = maxk b′′ (θk ). At each thresholding iteration, we first initiate an tentative (0) large value for u, i.e., u(m) = u∗ = 6/(w∗ τ1∗ τ2∗ ), and then check whether or not Q(η (m+1) ) < Q(η (m) ). When the inequality is violated, we reduce u to u(m) = 0.5u∗ , and we repeat this reduction procedure until Q(η (m+1) ) < Q(η (m) ) is satisfied. It is clear that the adaptive procedure does not alter the convergence of the algorithm. 91 4.4.2 Choice of λ The tuning parameter λ in pλ (·) controls the shrinkage in η, and therefore it is crucial for the resulting order of the finite mixture model. When λ = 0, the MPLE of (4.4) places no penalty and thus results in the largest order. As λ increases from zero to infinity, the MPLE gradually reduces the order for the mixture model to one. The choice of λ relates to the trade-off between parsimony and goodness-of-fit. Chen and Khalili [2008] prove that when λ has a certain asymptotic order, the SCAD regularization method is consistent when the sample size increases to infinity. However, the asymptotic result does not provide a specific recommendation for a λ-value in practice. Instead, a cross-validation (CV) procedure is recommended. Specifically, let the full data set Y = {y1 , . . . , yn }T be divided into R nonover∑ lapping subsets, say Yr with size nr for r = 1, . . . , R and R r=1 nr = n. Let ˆ Y − Yr be the subset with Yr removed from Y . Let Ψλ,−r be the MPLE of the model parameter Ψ based on Y −Yr . The CV procedure selects a λ that minimizes CV(λ) = − R ∑ ˆ λ,−r ; Yr ), lnr (Ψ r=1 where lnr (Ψ; Yr ) is the log-likelihood function based on Yr . The choice of λ with r = 1 worked well in Chen and Khalili’s simulation studies. At the same time, the CV combined with an exhaustive search of λ makes the whole procedure computationally intensive. A more efficient tuning strategy is desirable in practice. We implemented the algorithm of Chen and Khalili [2008] with a 20-fold CV (Zhang [1993]) to reduce the computational burden. In the literature, BIC-type criteria are investigated for choosing the tuning parameter in regularization methods for variable selection (Wang et al. [2007]). In such methods, the candidate choices of λ are evaluated by the likelihood scores of the MPLE with penalties proportional to the model complexity. The λ yielding the best score is then selected. Compared with the CV, this method is computationally cheaper and often leads to a satisfactory selection result. Therefore, we define a revised version of BIC as a function of λ by ˆ λ ) + γτ (Ψ ˆ λ ) log n, RBIC(λ) = −2ln (Y ; Ψ 92 ˆ λ is the MPLE with regularization parameter value λ and τ (Ψ ˆ λ ) is the where Ψ number of parameters in this mixing distribution. With γ = 1, we obtain the traditional BIC. In the simulation, we used γ = 0.25 and 0.5. We chose a smaller than usual value partly for its superior performance in our simulation studies and partly because an extra component in mixture models usually leads to a single extra degree of freedom in the limiting distribution of likelihood-based test statistics (Mengersen et al. [2011]). Once the value of γ is chosen, the RBIC selects a λ value that minimizes RBIC(λ) and thus suggests a corresponding order for the finite mixture model. In applications, several trial runs are helpful to identify the proper range for λ. We adopted a data-driven searching strategy in our numerical studies. We begin with a large value λ = λ∗ that leads to a homogeneous model with order 1. The search proceeds by bisectionally reducing λ∗ toward zero, such that the orders of the resulting models increase gradually. Grid searches are used between these λ values to obtain models with intermediate orders. The efficiency of this strategy has been observed in our simulation studies. 4.4.3 Choice of CK The constant CK in the first penalty of (4.4) controls the penalization for small mixing proportions. Its value influences the precision of the corresponding MPLE of the model parameters. However, the effect of CK has repeatedly been found to be minor for the order selection. Chen et al. [2001] and Chen and Khalili [2008] suggested that if the component parameters θk are restricted within [−M, M ] or [M −1 , M ] for some large M , then an appropriate choice of CK is log M . In our numerical studies, we set CK = log log y ∗ for Poisson mixture models with y ∗ = maxi {yi , i = 1, . . . , n}. For normal mixture models, we set CK to the logarithm of the maximum absolute observation based on the standardized data (see Section 4.5). 4.5 Numerical studies We assess the performance of ITD by Monte Carlo simulation examples. In particular, we implement MSCAD with both LQA and ITD to examine their computa93 Table 4.1: Mixing proportions and component parameters in simulation models Model 1 2 3 4 5 6 7 8 9 10 (π1 , µ1 ) (0.5, 0) (0.3, 0) (0.2, 0) (1/4, 0) (1/4, 0) (0.3, 0) (1/6, 0) (1/6, 0) (1/6, 0) (1/6, 0) Normal mixture models (π2 , µ2 ) (π3 , µ3 ) (π4 , µ4 ) (0.5, 3) (0.7, 3) (0.4, 4) (0.4, 6) (1/4, 3) (1/4, 6) (1/4, 9) (1/4, 2) (1/4, 5) (1/4, 8) (0.2, 2) (0.3, 4) (0.2, 6) (1/6, 3) (1/6, 6) (1/6, 9) (1/6, 2) (1/6, 4) (1/6, 6) (1/6, 2) (1/6, 4) (1/6, 7) (1/6, 2) (1/6, 4) (1/6, 6) Model 1 2 3 4 5 6 Poisson mixture models (π1 , µ1 ) (π2 , µ2 ) (π3 , µ3 ) (1/2, 1) (1/2, 3) (1/5, 1) (4/5, 3) (4/5, 1) (1/5, 3) (1/4, 1) (1/4, 4) (1/4, 12) (0.1, 1) (0.2, 4) (0.3, 12) (0.4, 1) (0.3, 4) (0.2, 12) (π5 , µ5 ) (π6 , µ6 ) (1/6, 12) (1/6, 9) (1/6, 9) (1/6, 8) (1/6, 15) (1/6, 12) (1/6, 11) (1/6, 10) (π4 , µ4 ) (1/4, 20) (0.4, 20) (0.1, 20) tional efficiency. The performances of a number of tuning strategies for λ are also compared. The algorithm has been implemented in the R software. 4.5.1 Simulations We generated data from the normal and Poisson mixture models. The component mean µ and mixing proportions π are given in Table 4.1. For the normal mixture models, we set the common component variance σ 2 = 1. To make the selection invariant on the data scale, we standardized the observations from normal mixtures such that the corresponding sample mean was zero and the sample standard deviation was three. For each simulated data set, MSCAD was used for the order selection with a common upper bound K = 12. The EM procedure used by both LQA and ITD was implemented with the same initial settings as in Chen and Khalili [2008]. That 94 3 2 mu 1 0 −1 −1 0 1 mu 2 3 4 EM−path with LQA 4 EM−path with ITD 0 20 40 60 80 100 120 0 20 40 Steps 60 80 100 120 Steps Figure 4.1: EM-paths of MSCAD for normal mixture model 1 with n = 100. Model 2 5 4 4 4 3 No. of Componets 5 3 3 2 2 2 1 1 1 Oracle AIC BIC RBIC.25 RBIC.5 LQA−cv ITD−cv Oracle AIC BIC Model 4 RBIC.25 RBIC.5 LQA−cv ITD−cv Oracle 5 4 4 3 No. of Componets 5 4 3 2 2 1 1 1 BIC RBIC.25 RBIC.5 LQA−cv ITD−cv Oracle AIC BIC RBIC.25 RBIC.25 RBIC.5 LQA−cv ITD−cv RBIC.5 LQA−cv ITD−cv 3 2 AIC BIC Model 6 5 Oracle AIC Model 5 No. of Componets No. of Componets Model 3 5 No. of Componets No. of Componets Model 1 RBIC.5 LQA−cv ITD−cv Oracle AIC BIC RBIC.25 Figure 4.2: Selection results for normal mixture models (1–6), n=100. is, the initial values of the component means were chosen as the 100(k − 1/2)K% (0) sample quantiles and πk = 1/K for k = 1, . . . , K. For the normal mixtures, we set the initial value of the component variance to the sample variance based on the observations trimmed at the 25% and 75% sample quantiles. For LQA, we merged two subpopulations if their component means were within 10−3 at the convergence 95 Model 8 8 8 7 7 6 6 No. of Componets No. of Componets Model 7 5 4 5 4 3 3 2 2 Oracle AIC BIC RBIC.25 RBIC.5 LQA−cv ITD−cv Oracle AIC BIC 8 8 7 7 6 6 5 4 2 2 BIC RBIC.25 ITD−cv RBIC.5 LQA−cv ITD−cv 4 3 AIC LQA−cv 5 3 Oracle RBIC.5 Model 10 No. of Componets No. of Componets Model 9 RBIC.25 RBIC.5 LQA−cv ITD−cv Oracle AIC BIC RBIC.25 Figure 4.3: Selection results for normal mixture models (7–10), n=300. of the algorithm. As a concrete illustration, we plot in Fig. 4.1 the EM-iteration paths of θ (t) based on ITD and LQA for one data set generated from a normal mixture. Compared with EM-LQA, the paths of the EM-ITD procedure are sharper and require significantly fewer iterations to converge. We find that ITD saves about 40% of the computational time compared to LQA over our simulation studies. Although there is no guarantee, ITD and LQA usually converge to similar limiting points in our examples. We summarize the simulation results in terms of how often various orders are selected based on 500 data sets generated from each model with sample sizes n = 100, 300. We use scaled bar plots to denote the frequency of the orders selected by MSCAD against methods and tuning strategies. The length of each 96 Model 1 (n=100) Model 2 (n=100) 3 2 1 4 No. of Componets 4 No. of Componets No. of Componets 4 3 2 1 Oracle AIC BIC RBIC.25 RBIC.5 LQA−cv ITD−cv Oracle AIC BIC RBIC.25 RBIC.5 LQA−cv ITD−cv Oracle 2 1 3 2 1 RBIC.25 RBIC.5 LQA−cv ITD−cv BIC RBIC.25 RBIC.5 LQA−cv ITD−cv LQA−cv ITD−cv 4 No. of Componets No. of Componets 3 BIC AIC Model 3 (n=300) 4 AIC 2 Model 2 (n=300) 4 Oracle 3 1 Model 1 (n=300) No. of Componets Model 3 (n=100) 3 2 1 Oracle AIC BIC RBIC.25 RBIC.5 LQA−cv ITD−cv Oracle AIC BIC RBIC.25 RBIC.5 Figure 4.4: Selection results for Poisson mixture models (1–3), n=(100, 300). bar is proportional to the selection frequency. As a benchmark, the true orders are marked by a black bar in each plot. We labeled the x-axes as follows: RBIC.25 and RBIC.5 are obtained by ITD with scaled RBIC; LQA-cv and ITD-cv are obtained by 20-fold CV based on LQA/ITD. We have also included the results of the popular AIC and BIC methods as suggested by Leroux [1992]. The simulation results are reported in Figs. 4.2–4.5. Although it is not the goal of our simulation, these figures reaffirm the effectiveness of MSCAD for order selection. The performance of MSCAD is not affected by its implementation. This is reflected by the nearly identical columns for LQA-cv and ITD-cv in all the plots. In terms of the selection frequency for the true mixture order, AIC and BIC do not perform well in most cases. In general, BIC grossly underestimates the true order of the model because of its stringent penalty on the model complexity. AIC is relatively liberal, but its performance is not satisfactory for models with complex structure. The remaining results all relate to MSCAD but are tuned or implemented dif97 Model 4 (n=100) Model 5 (n=100) 4 3 2 5 No. of Componets 5 No. of Componets No. of Componets 5 4 3 2 Oracle AIC BIC RBIC.25 RBIC.5 LQA−cv ITD−cv Oracle AIC BIC RBIC.25 RBIC.5 LQA−cv ITD−cv Oracle 3 2 4 3 2 RBIC.25 RBIC.5 LQA−cv ITD−cv BIC RBIC.25 RBIC.5 LQA−cv ITD−cv LQA−cv ITD−cv 5 No. of Componets No. of Componets 4 BIC AIC Model 6 (n=300) 5 AIC 3 Model 5 (n=300) 5 Oracle 4 2 Model 4 (n=300) No. of Componets Model 6 (n=100) 4 3 2 Oracle AIC BIC RBIC.25 RBIC.5 LQA−cv ITD−cv Oracle AIC BIC RBIC.25 RBIC.5 Figure 4.5: Selection results for Poisson mixture models (4–6), n=(100, 300). ferently (i.e., RBIC.25 , RBIC.5 , ITD-cv, LQA-cv). Based on these plots, we see that RBIC.5 tends to underestimate the true order. The other methods are comparable. If we factor in the computational savings, RBIC.25 has the best performance. Also, the ITD-based methods perform better than the LQA because of the numerical advantages pointed out earlier. 4.5.2 Examples We now illustrate the use of the ITD algorithm for order selection in real applications. We chose to implement MSCAD as in the simulation studies. Our first example is from a well-known astronomical data set, which consists of the velocities of 82 galaxies moving away from our own galaxy (available in Table 1 of Postman et al. [1986]; see the histogram in Fig. 4.6). The multimodality of the velocities may indicate the presence of super-clusters of galaxies surrounded by large voids, with each mode representing a cluster as it moves away at its own speed (see Roeder [1990] for more background). We may hence model the values of observed velocities as a random sample from a finite mixture of normal distributions with a 98 0.25 0.20 0.15 0.00 0.05 0.10 Density 10 15 20 25 30 35 Figure 4.6: Histogram of galaxy data. Solid curve: density of seven component models selected by MSCAD. Dashed curve: density of six component models selected by AIC/BIC. common variance. Using a Bayesian approach, Richardson and Green [1997] conclude that the number of distributions ranges from five to seven, which provides support for the existence of super-clusters. MSCAD-ITD is used to reanalyze the galaxy data with K = 12. RBIC with γ = 0.25 was used to tune the parameter λ. The outcome is a seven-component model; the parameter estimates are given in Table 4.2 and Fig. 4.6 shows the fitted density function. As a side note, AIC/BIC leads to a model with six components; see Table 4.2 and Fig. 4.6. It also provides a good description of the multimodal structure of the galaxy data. Our next example is from Leroux and Puterman [1992], who discussed a study of breathing and body movements in fetal lambs. The number of movements of a fetal lamb was recorded for 240 consecutive 5-second intervals. These counts are overdispersed with a ratio of variance to mean estimated at 1.83. It is well known 99 Table 4.2: Parameter estimates for galaxy data. µˆk , πˆk , σˆ denote estimated component mean, mixing proportion, and common component standard deviation (k = 1, 2, . . . , 7). Method (ˆ µ1 , π ˆ1 ) (ˆ µ2 , π ˆ2 ) (ˆ µ3 , π ˆ3 ) (ˆ µ4 , π ˆ4 ) (ˆ µ5 , π ˆ5 ) (ˆ µ6 , π ˆ6 ) (ˆ µ7 , π ˆ7 ) σ ˆ MSCAD (9.7, .08) (16.1, .04) (19.8, .43) (22.4, .21) (23.9, .14) (26.5, ,05) (33.1, .05) .66 AIC/BIC (9.7, .09) (16.2, .02) (19.9, .45) (23.1, .36) (26.3, .04) (33.0, .04) (- -.-, .- -) .81 that overdispersion can often be explained by population heterogeneity. Hence, a Poisson mixture model was used with the number of components selected by AIC/BIC. Using MSCAD implemented via ITD and an initial value of K = 12, we select a two-component model. Not surprisingly, this order is the same as that of the previous analysis. We summarize the resulting fits in Table 4.3. Because of its built-in measure to prevent type-I overfitting, MSCAD gives more balanced mixing proportions. Clustering more observations into the second component also leads to lower component means. Table 4.3: Parameter estimates for lamb data. µˆk , πˆk denote estimated component mean and mixing proportion (k = 1, 2). 4.6 Method (ˆ µ1 , π ˆ1 ) (ˆ µ2 , π ˆ2 ) MSCAD (.227, .920) (1.88, .080) AIC/BIC (.230, .939) (2.32, .061) Summary and conclusions In this chapter, we have developed an iterative thresholding descent algorithm (ITD) for PLM-based order selection methods in finite mixture models. The new algorithm avoids directly solving the original multivariate optimization problem and efficiently leads to the sparse MPLE of the component parameters. We established the algorithmic convergence of ITD under mild conditions. The efficiency of ITD is well supported by our numerical studies. In applications, we need to specify an upper bound K for the MSCAD-based 100 analysis. One should always use a sufficiently large K to avoid underestimation. However, too large a K often leads to slow convergence. In the examples we examined, the outcomes are not sensitive to the choice of K within a plausible range. We do not recommend setting K below 5 or above 15 in most applications. One needs strong empirical evidence or a large sample size (say above 300) to go outside this range. 101 Chapter 5 Summary and Future Work 5.1 Summary of the dissertation In this dissertation, we have developed approaches based on PLMs to address the issues of feature selection arising in several application fields. In Chapter 2, we addressed feature selection for ultra-high-dimensional data, where the number of features (covariates) is larger than the sample size. To facilitate the selection process, we proposed a novel screening approach to reduce the dimensionality of the data before the application of the PLM. The new method is motivated by the idea of the sparsity-restricted maximum likelihood estimator and can be efficiently implemented through a thresholding-based algorithm. Compared with the existing approaches which screen features based on the marginal correlations between the covariates and response, our method accounts for more joint effects between the covariates and thus can be more reliable in applications. We further established the consistency of our method in an ultra-high-dimensional setup and demonstrated its excellent performance through numerical examples. In Chapter 3, we addressed variable selection for complex survey data, where the observations are intrinsically dependent because of the without-replacement sampling plan. To avoid a distorted conclusion caused by the biased samples, we proposed a penalized pseudo-likelihood approach to incorporate the survey weights into the selection process. A pseudo-likelihood-based BIC was further suggested to select the corresponding tuning parameter. We demonstrated the asymptotic con102 sistency of our selection method in a joint randomization framework. The decent performance and potential benefits of our method were demonstrated in simulation studies. In Chapter 4, we addressed order selection in finite mixture models for heterogeneous data. A thresholding-based algorithm was proposed for the implementation of PLMs in such applications. The new algorithm transforms the original multivariate objective function into a sum of univariate functions and efficiently leads to a sparse solution without ad hoc steps. We established the convergence of the new algorithm and illustrated its efficiency through both simulations and real-data examples. 5.2 Future directions We have demonstrated that the PLM is an attractive technique for statistical learning with a wide range of applications. In the remainder of this chapter, we present several possible directions for future research. 5.2.1 Variable selection in model-based clustering Clustering is a fundamental data-analysis tool which assigns similar objects to groups. When the data dimension is high, there are many noise variables that may mask the underlying clustering structures. Specifically, given the p-dimensional observations xi = (xi1 , . . . , xip ) for i = 1, . . . , n, we aim to group the observations into a few clusters such that observations in the same cluster are more similar to each other than to those in different clusters. When p is large, it is likely that many entries in xi are not relevant to the clustering: including them in the analysis introduces noise which might hide the heterogeneous structure of the data. Removing these “noise” variables is essential for the clustering of high-dimensional data. Of the many clustering methods, model-based clustering (McLachlan and Peel [2002], Zhong and Ghosh [2003]) is appropriate for variable selection methods. These approaches assume that the data are generated from a finite mixture distribution with each component corresponding to a cluster. In this framework, the removal of “noise” variables can be performed by a model selection procedure, in 103 which the shrinkage idea of PLM is helpful. Specifically, suppose that, prior to the clustering, the data are standardized such that each entry Xj = (x1j , . . . , xnj )T has sample mean 0 and sample variance 1. In some applications, it might be appropriate to assume that the data are independently generated according to a p-variate normal mixture density f (x; Ω) = K ∑ πk h(x; µk , Σk ), (5.1) k=1 where πk > 0 is the mixing proportion such that ∑K k=1 πk = 1, h(.) denotes the p-variate normal density with mean vector µk = (µk1 , . . . , µkp ) and covariance matrix Σk , and Ω = {(πk , µk , Σk ), k = 1, . . . , K} denotes the unknown parameters. Clearly, if some components of µk are small, the corresponding entries of the data are not informative for the clustering, at least in terms of location. Employing the shrinkage idea, we may use the PLM to automatically set small components of µk to zero and thus remove the irrelevant entries from the data. We could address this by maximizing Q(Ω) = l(Ω) − φ(Ω) − p ∑ ϕλ (∥µ(j) ∥2 ), (5.2) j=1 where l(Ω) = ∑n i=1 log f (xi ; Ω) is the log-likelihood, φ(.) and ϕλ (.) are two penalty functions, and ∥µ(j) ∥2 = K ∑ µ2kj k=1 is the Euclidean norm on µ(j) = (µ1j , . . . , µkj ) for j = 1, . . . , p. In the normal mixture cases, the first penalty φ(.) is generally required to guarantee the existence of the maximizer of (5.2) (Chen et al. [2008]), while the second penalty ϕλ (.) can lead to a sparse estimate of the mean parameter. Placing ∥µ(j) ∥2 in ϕλ (.) helps us to set µ(j) = 0, so that the corresponding entry can be regarded as irrelevant to the clustering. A similar idea has been adopted by Yuan and Lin [2006] for a group LASSO problem. The optimization problem of (5.2) can be similarly cast into an EM-based procedure as discussed in Chapter 4. 104 As in the standard model-based clustering, one could further fit (5.1) by maximizing (5.2) with a series of K values and use a selection criterion (e.g., BIC) to determine the optimal number of clusters. Both the theoretical and computational aspects of this procedure require further research. 5.2.2 Support vector machine with nonconvex penalty Classification is a basic task in pattern recognition and data mining. Kernel-based classification technologies have attracted much attention, because of their elegant interpretation and highly competitive performance. The support vector machine (SVM) is among the most popular classification approaches in this category (Cristianini and Shawe-Taylor [2000]); it learns the classification rule from the informative margins of the training data. In a binary classification problem, the response y is either +1 or -1 (the classification labels), and a classification rule T is a mapping from the feature vector x to {+1, −1}. Given the training set (yi , xi ) for i = 1, . . . , n, we must find a discriminant function f (.) so that any new input x can be correctly classified with f (x) based on T . The kernel-based methods generally use the rule T = sign[f (x)] and the following f (.): f (x; w) = n ∑ wi k(x, xi ) + w0 , (5.3) i=1 where k(.) is a user-specified kernel function such as the polynomial or Gaussian kernel, and w = (w0 , . . . , wn ) is the weight vector to be estimated. Given a loss function L(.), the kernel method (5.3) leads to the following optimization scheme { −1 min n w n ∑ } L(yi , f (xi ; w)) + ϕλ (w) , i=1 where ϕλ (.) is a penalty function index with tuning parameter λ. With the L2 penalty ϕλ (w) = λ∥w∥22 and the hinge loss, i.e., L(y, f ) = (1 − y · f )+ = max{0, 1 − y · f }, 105 (5.4) (5.4) leads to the typical L2 -SVM, which has been applied to a number of classification problems. Solid statistical learning theory for the use of L2 -SVM has been developed (Lin [2002]). As in ridge regression, the L2 penalty helps to control the model complexity to prevent overfitting. With the development of sparse regularization methods (e.g., LASSO), the L1 penalty has been used in SVM (Zhu et al. [2003]); this leads to a sparse estimate of w in the discriminant function (5.3). The so-called L1 -SVM can result in a more compressive rule and thus provide a clearer interpretation. Friedman et al. [2004] showed that the L1 -SVM performs better if the underlying model is sparse, while the L2 -SVM performs better if most of the observations contribute to the response. Because of the promising theoretical properties of nonconvex penalties, it is natural to explore whether using them in (5.4) would further benefit the SVM. Such a nonconvex-penalized SVM may help to reduce the biased estimates of L1 , while maintaining the desirable feature of sparsity in the classification rule. It would be interesting to continue this investigation. 5.2.3 Some other issues Variable selection via grouping In regression analysis, one difficulty for variable selection is the collinearity between the covariates. In ultra-high-dimensional situations, even when all the covariates are ideally independent, the maximum sample correlation between covariates can still be high (Fan and Lv [2008]), which makes it hard to detect the truly influential covariates. Therefore, one might consider grouping all the candidate covariates according to their sample correlations and then treating each group as a independent predictor. In other words, covariates in the same group are considered to represent some factor that may be associated with the response. We could then use the penalized likelihood idea to choose the relevant groups. One-step thresholding-based procedure The thresholding-based algorithm provides a simple and efficient way to solve the numerical problem of nonconvex PLMs. However, because of the nonconcavity 106 of the objective function, multiple initial values are often needed in its implementation. It would be beneficial to develop a data-driven strategy for initializing the thresholding-based procedure. A good initial value often leads to fewer iteration steps and guarantees the statistical properties of the estimate. In Chapter 2, we have tentatively revealed the efficiency of the LASSO-based initializing strategy. It would be beneficial if this method could be given a theoretical justification. Motivated by Zou and Li [2008], we are particularly interested in finding an initial setting such that a few iterations would suffice for variable selection/screening. Tuning LAD-LASSO with high-dimensionality LASSO is an computationally attractive approach to variable selection in highdimension regression models. The L1-penalized least squares problem (1.9) has received much attention. However, the least squares method is known to be sensitive to outliers. An alternative approach is the least absolute deviation (LAD) method which can be robust to outliers. Wang et al. [2006] proposed the L1 penalized LAD, i.e., LAD-LASSO, which is an efficient and robust selection approach. In particular, the idea of coordinate descent (CD) can be conveniently applied to the LAD-LASSO problem. LAD-LASSO therefore has great potential for high-dimensional noisy data. Despite its promising properties (Gao and Huang [2010]), the tuning parameter in LAD-LASSO needs to be specified in applications. The traditional AIC and BIC are designed for least square settings and may not be suitable for the LAD-LASSO problem. A feasible tuning method is therefore desired to realize the desirable properties of LAD-LASSO, especially for high-dimensional situations. EBIC (Chen and Chen [2008]) might provide an appropriate tuning strategy for LAD-LASSO and its variants. 107 Bibliography H. Akaike. Information theory and an extension of the maximum likelihood principle. In 2nd International Symposium on Information Theory, 1:267–281, 1973. → pages 6, 70, 81 A. Antoniadis. Wavelet methods in statistics: Some recent developments and their applications. Statistics Surveys, 1:16–55, 2007. → pages 89, 90 A. Antoniadis and J. Fan. Regularization of wavelet approximations. Journal of the American Statistical Association, 96:939–967, 2001. → pages 90 J. Berkhof, I. Van Mechelen, and A. Gelman. A bayesian approach to the selection and testing of mixture models. Statistica Sinica, 13:423–442, 2003. → pages 81 P. Bickel and D. Freedman. Asymptotic normality and the bootstrap in stratified sampling. The Annals of Statistics, 12:470–484, 1984. → pages 63 D. Binder. On the variances of asymptotically normal estimators from complex surveys. International statistical Review, 51:279–292, 1983. → pages 61 D. Binder and G. Roberts. Chapter: Design-based and model-based methods for estimating model parameters. Wiley Series in Survey Methodology, Chichester, 2003. → pages 59 T. Blumensath and M. Davies. Iterative thresholding for sparse approximations. The Journal of Fourier Analysis and Applications, 14:629–654, 2008. → pages 27 T. Blumensath and M. Davies. Iterative hard thresholding for compressed sensing. Applied and Computational Harmonic Analysis, 27:265–274, 2009. → pages 26, 32, 34, 36 108 D. B¨ohning. Computer-Assisted Analysis of Mixtures and Applications: Meta Analysis, Disease Mapping, and Others. Chapman & Hall/CRC, 2000. → pages 80 L. Breiman. Better subset regression using the nonnegative garrote. Technometrics, 37:373–384, 1995. → pages 11, 20 T. Brijs, D. Karlis, G. Swinnen, K. Vanhoof, G. Wets, and P. Manchanda. A multivariate poisson mixture model for marketing applications. Statistica Neerlandica, 58:322–348, 2004. → pages 80 S. Canada. Survey on living with chronic diseases in canada 2009: User guide. Supplementary documentation, 2009. → pages 66 E. J. Cand`es, J. Romberg, and T. Tao. Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on information theory, 52:489–509, 2006. → pages 26 I. Carrillo, J. Chen, and C. Wu. The pseudo-gee approach to the analysis of longitudinal surveys. Canadian Journal of Statistics, 38:540–554, 2010. → pages 62 H. Chen and J. Chen. The likelihood ratio test for homogeneity in finite mixture models. Canadian Journal of Statistics, 29:201–216, 2001. → pages 81 H. Chen, J. Chen, and J. Kalbfleisch. A modified likelihood ratio test for homogeneity in finite mixture models. Journal of the Royal Statistical Society: Series B, 63:19–29, 2001. → pages 93 J. Chen and Z. Chen. Extended bayesian information criterion for model selection with large model spaces. Biometrika, 95:759–771, 2008. → pages 41, 107 J. Chen and Z. Chen. Extended bic for small-n-large-p sparse glm. Statistica Sinica, 22:555–574, 2012. → pages 19, 33, 41, 42, 54, 55, 118 J. Chen and J. Kalbfleisch. Penalized minimum-distance estimates in finite mixture models. Journal of the American Statistical Association, 103: 1674–1683, 1996. → pages 81, 84 J. Chen and A. Khalili. Order selection in finite mixture models with a nonsmooth penalty. Journal of the American Statistical Association, 103:1674–1683, 2008. → pages 24, 81, 82, 83, 85, 90, 92, 93, 94 J. Chen and J. Rao. Asymptotic normality under two-phase sampling designs. Statistica Sinica, 17:1047–1064, 2007. → pages 63 109 J. Chen, X. Tan, and R. Zhang. Inference for normal mixtures in mean and variance. Statistica Sinica, 18:443–465, 2008. → pages 104 P. Craven and G. Wahba. Smoothing noisy data with spline functions: Estimating the correct degree of smoothing by the method of generalized cross-validation. Numerische Mathematik, 31:377–403, 1979. → pages 18, 70 N. Cristianini and J. Shawe-Taylor. An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, 2000. → pages 105 I. Daubechies, M. Defrise, and C. De Mol. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics, 57:1413–1457, 2004. → pages 82, 86 A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B, 39: 1–38, 1977. → pages 84 P. Dickson and J. Ginter. Market segmentation, product differentiation, and marketing strategy. Journal of Marketing, 51:1–11, 1987. → pages 1 D. Donoho. High-dimensional data analysis: the curses and blessings of dimensionality. Aide-Memoire of a Lecture at AMS Conference on Math Challenges of the 21st Century, 2000. → pages 25 D. Donoho. Compressed sensing. IEEE Transactions on information theory, 52: 1289–1306, 2006. → pages 26, 32 B. Efron. Correlation and large-scale simulations significance testing. Journal of the American Statistical Association, 102:93–103, 2007. → pages 30 B. Efron. Empirical bayes estimates for large-scale prediction problem. Journal of the American Statistical Association, 104:1015–1028, 2009. → pages 54 B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. The Annals of Statistics, 32:407–499, 2004. → pages 20, 82, 86 J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96: 1348–1360, 2001. → pages 2, 11, 12, 13, 17, 18, 21, 25, 38, 46, 64, 70, 81, 132 J. Fan and J. Lv. Sure independence screening for ultrahigh dimensional feature space (with discussion). Journal of the Royal Statistical Society, Series B, 70: 849–911, 2008. → pages 22, 23, 26, 29, 30, 32, 106 110 J. Fan and J. Lv. A selective overview of variable selection in high dimensional feature space. Statistica Sinica, 20:101–148, 2010. → pages 25 J. Fan and J. Lv. Nonconcave penalized likelihood with np-dimensionality. IEEE Transactions on Information Theory, 57:5467–5484, 2011. → pages 17, 39 J. Fan and H. Peng. Nonconcave penalized likelihood with a diverging number of parameters. The Annals of Statistics, 32:781–813, 2004. → pages 17 J. Fan and R. Song. Sure independence screening in generalized linear models with np-dimensionality. The Annals of Statistics, 38:3567–3604, 2009. → pages 30 J. Fan, R. Samworth, and Y. Wu. Ultrahigh dimensional variable selection: beyond the linear model. Journal of Machine Learning Research, 10: 1829–1853, 2009. → pages 30, 43, 44 I. Frank and J. Friedman. A statistical view of some chemonetrics regression tools. Technometrics, 35:109–148, 1993. → pages 12 J. Friedman, T. Hastie, S. Rosset, R. Tibshirani, and J. Zhu. Discussion of boosting papers. The Annals of Statistics, 32:102–107, 2004. → pages 106 J. Friedman, T. Hastie, H. Hofling, and R. Tibshirani. Pathwise coordinate optimization. The Annals of Applied Statistics, 1:302–332, 2007. → pages 20, 82, 86 W. J. Fu. Penalized regression: the bridge versus the lasso. Journal of Computational and Graphical Statistics, 7:397–416, 1998. → pages 11, 12, 20 W. Fuller. Sampling Statistics. Hoboken, New Jersey: Wiley, 2009. → pages 57, 74 X. Gao and J. Huang. Asymptotic analysis of high-dimensional lad regression with lasso. Statistica Sinica, 20:1495–1506, 2010. → pages 107 R. Gelber, J. Gaziano, J. Manson, J. Buring, and H. Sesso. A prospective study of body mass index and the risk of developing hypertension in men. American Journal of Hypertension, 20:370–377, 2007. → pages 78 J. Ghosh and P. Sen. On the asymptotic performance of the log-likelihood ratio statistic for the mixture model and related results. In In Proceedings of the Berkeley Conference in Honor of Jerzy Neyman and Jack Kiefer, volume 2, pages 789–806, 1985. → pages 81 111 V. Godambe and M. Thompson. Parameters of superpopulation and survey population: Their relationship and estimation. International Statistical Review, 54:127–138, 1986. → pages 61 J. H´ajek. Limiting distributions in simple random sampling from a finite population. Publications of the Mathematics Institute of Hungarian Academy of Science, 5:361–375, 1960. → pages 62 J. H´ajek. Asymptotic theory of rejective sampling with varying probabilities from a finite population. The Annals of Mathematical Statistics, 35:1491–1523, 1964. → pages 63 T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning: data mining. Springer-Verlag, New York, 2 edition, 2009. → pages 25 H. Ishwaran, L. James, and J. Sun. Bayesian model selection in finite mixtures by marginal density decompositions. Journal of the American Statistical Association, 96:1316–1332, 2001. → pages 81 G. Kalton. Models in the practice of survey sampling. International Statistical Review, 51:175–188, 1983. → pages 59 R. Kass and A. Raftery. Bayes factors. Journal of the American Statistical Association, 90:773–795, 1995. → pages 65 A. Keller, P. Leidinger, A. Borries, A. Wendschlag, F. Wucherpfennig, M. Scheffler, H. Huwer, H. Lenhof, and E. Meese. mirnas in lung cancer studying complex fingerprints in patient’s blood cells by microarray experiments. BMC Cancer, 9:353–363, 2009. → pages 1 C. Keribin. Consistent estimation of the order of mixture models. The Indian Journal of Statistics, 96:1316–1332, 2000. → pages 81 K. Knight and W. J. Fu. Asymptotics for lasso-type estimators. The Annals of Statistics, 28:1356–1378, 2000. → pages 15, 16 E. Korn and B. Graubard. Analysis of Health Surveys. Wiley, New York, 1999. → pages 56 P. S. Kott. A model-based look at linear regression with survey data. The American Statistician, 45:107–112, 1991. → pages 74 N. Kushmerich. Learning to remove internet advertisements. In Proceeding of the 3rd International Conference on Autonomous Agents, pages 175–181, 1999. → pages 1 112 B. Leroux. Consistent estimation of a mixing distribution. The Annals of Statistics, 20:1350–1360, 1992. → pages 81, 97 B. Leroux and M. Puterman. Maximum-penalized-likelihood estimation for independent and markov-dependent mixture models. Biometrics, 48:545–558, 1992. → pages 99 M. Lesperance and J. Kalbfleisch. An algorithm for computing the nonparametric mle of a mixing distribution. Journal of the American Statistical Association, 87:120–126, 1992. → pages 83 P. Li and J. Chen. Testing the order of a finite mixture model. Journal of the American Statistical Association, 105:1084–1092, 2010. → pages 81 Y. Lin. Support vector machine and the bayes rule in classification. Data Mining and Knowledge Discovery, 6:259–275, 2002. → pages 106 B. G. Lindsay. The geometry of mixture likelihoods: A general theory. The Annals of Statistics, 11:86–94, 1983. → pages 83 S. L. Lohr and J. Liu. A comparison of weighted and unweighted analyses in the ncvs. Journal of Quantitative Criminology, 10:343–360, 1994. → pages 76 G. MacLachlan. On bootstrapping the likelihood ratio test statistics for the number of components in a normal mixture. Applied Statistics, 36:318–324, 1987. → pages 81 G. MacLachlan and T. Krishnan. The EM Algorithm and Extensions. Hoboken, New Jersey: Wiley, 2 edition, 2008. → pages 85 S. Mallat and Z. Zhang. Matching pursuits with time-frequency dictionaries. IEEE Transactions on Signal Processing, 41:3397–3415, 1993. → pages 34 P. McCullagh and J. A. Nelder. Generalized Linear Models. Chapman and Hall, London, 2 edition, 1989. → pages 20, 26 J. McLachlan and D. Peel. Finite mixture models. Wiley, New York, 2002. → pages 103 N. Meindhausen and P. Buhlmann. Consistent neighborhood selection for high-dimensional graphs with the lasso. The Annals of Statistics, 34: 1436–1462, 2006. → pages 16 K. Mengersen, C. Robert, and D. Titterington. Mixtures: Estimation and Applications. Hoboken, New Jersey: Wiley, 1 edition, 2011. → pages 93 113 E. Molina and C. Skinner. Pseudo-likelihood and quasi-likelihood estimation for complex sampling schemes. Computational Statistics and Data Analysis, 13: 395–405, 1992. → pages 61 J. F. Murray and K. Kreutz-Delgado. An improved focuss-based learning algorithm for solving sparse linear inverse problems. Conference Record of the Thirty-Fifth Asilomar Conference on Signals, Systems and Computers, pages 347–351, 2001. → pages 34 E. Ohlsson. Asymptotic normality for two-stage sampling from a finite population. Probability Theory and Related Fields, 81:341–352, 1989. → pages 63 T. Park and G. Casella. The bayesian lasso. Journal of the American Statistical Association, 103:681–686, 2008. → pages 14 D. Pfeffermann. The role of sampling weights when modeling survey data. International Statistical Review, 61:317–337, 1993. → pages 57 D. Pfeffermann and D. J. Holmes. Robustness considerations in the choice of a method of inference for regression analysis of survey data. Journal of the Royal Statistical Society, Series A, 148:268–278, 1985. → pages 74 M. Postman, J. Huchra, and M. Geller. Probes of large-scale structures in the corona borealis region. The Astronomical Journal, 92:1238–1247, 1986. → pages 98 M. Rahiala and T. Ter¨asvirta. Business survey data in forecasting the output of swedish and finnish metal and engineering industries: A kalman filter approach. Journal of Forecasting, 12:255–271, 1993. → pages 56 C. R. Rao and Y. H. Wu. A strongly consistent procedure for model selection in a regression problem. Biometrika, 76:369–74, 1989. → pages 9 S. Richardson and P. Green. On bayesian analysis of mixtures with an unknown number of components. Journal of the Royal Statistical Society: Series B, 59: 731–792, 1997. → pages 81, 99 K. Roeder. Density estimation with confidence sets exemplified by superclusters and voids in the galaxies. Journal of the American Statistical Association, 85: 617–624, 1990. → pages 98 M. Royall. The linear least-squares prediction approach to two-stage sampling. Journal of the American Statistical Association, 71:657–664, 1976. → pages 58 114 G. Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6: 461–464, 1978. → pages 6, 64, 81, 133 J. Shao. An asymptotic theory for linear model selection (with discussion). Statistica Sinica, 7:221–242, 1997. → pages 8 Y. She. Thresholding-based iterative selection procedures for model selection and shrinkage. Electronic Journal of Statistics, 3:384–415, 2009. → pages 22, 82, 86 Y. She. An iterative algorithm for fitting nonconvex penalized generalized linear models with grouped predictors. Computational Statistics and Data Analysis, In press, 2011. → pages 19, 70 R. Shibata. Asymptotic mean efficiency of a selection of regression variables. The Annals of Statistics, 35:415–423, 1983. → pages 8 M. Shoukri and G. MacLachlan. Parametric estimation in a genetic mixture model with application to nuclear family data. Biometrics, 50:128–139, 1994. → pages 80 D. Singh, P. G. Febbo, K. Ross, D. G. Jackson, J. Manola, C. Ladd, P. Tamayo, A. A. Renshaw, A. V. D’Amico, J. P. Richie, E. S. Lander, M. Loda, P. W. Kanto, T. R. Golub, and W. R. Sellers. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell, 1:203–209, 2002. → pages 54 M. Stone. Cross-validatory choice and assessment of statistical predictions (with discussion). Journal of the Royal Statistical Society, Series B, 39:111–147, 1974. → pages 18 J. Storey and R. Tibshirani. Statistical significance for genome-wide studies. Proceedings of the National Academy of Sciences, 100:9440–9445, 2003. → pages 30 R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B, 58:267–288, 1996. → pages 2, 11, 12, 18, 19, 25, 81 L. Tierney, R. Kass, and J. Kadane. Fully exponential laplace approximation to expectations and variances of nonpositive functions. Journal of the American Statistical Association, 84:710–716, 1989. → pages 65 J. V´ıs˘ek. Asymptotic distribution of simple estimate for rejective, sampford and successive sampling. Contributions to Statistics, pages 263–275, 1979. → pages 63 115 H. Wang. Forward regression for ultra-high dimensional variable screening. Journal of the American Statistical Association, 104:1512–1524, 2009. → pages 19, 26, 45 H. Wang, G. Li, and G. Jiang. Robust regression shrinkage and consistent variable selection via the lad-lasso. Journal of Business and Economic Statistics, 11: 1–6, 2006. → pages 107 H. Wang, R. Li, and C. Tsai. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika, 94:553–568, 2007. → pages 18, 41, 64, 92 H. Wang, B. Li, and C. Leng. Shrinkage tuning parameter selection with a diverging number of parameters. Journal of the Royal Statistical Society, Series B, 71:671–683, 2009. → pages 19 W. Wolfson. Analysis of labour force survey data for the information technology occupations 2000-2003. WGW Services Ltd., Ottawa, Ontario, 2004. → pages 56 M. Woo and T. Sriram. Robust estimation of mixture complexity. Journal of the American Statistical Association, 51:4379–4392, 2006. → pages 81 J. Wu. On the convergence properties of the em algorithm. The Annals of Statistics, 11:95–103, 1983. → pages 85 L. Yan, K. Liu, K. Matthews, M. Daviglus, T. Ferguson, and C. Kiefe. Psychosocial factors and risk of hypertension: The coronary artery risk development in young adults (cardia) study. Journal of American Medical Association, 290:2138–2148, 2003. → pages 78 Y. Yang. Can the strength of aic and bic be shared? a conflict between model identification and regression estimation. Biometrika, 92:937–950, 2005. → pages 8 M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B, 68:49–67, 2006. → pages 104 C. Zhang. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38:894–942, 2010. → pages 14, 25 P. Zhang. Model selection via multifold cross-validation. The Annals of Statistics, 21:229–231, 1993. → pages 92 116 Y. Zhang, R. Li, and C. Tsai. Regularization parameter selection via generalized information criterion. Journal of the American Statistical Association, 105: 312–323, 2010. → pages 64 P. Zhao and B. Yu. On model selection consistency of lasso. Journal of Machine Learning Research, 7:2541–2563, 2006. → pages 15, 16 S. Zhong and J. Ghosh. A unified framework for model-based clustering. Journal of Machine Learning Research, 4:1001–1037, 2003. → pages 103 J. Zhu, S. Rosset, T. Hastie, and R. Tibshirani. l1 -norm support vector machines. Neural Information Processing Systems, 16, 2003. → pages 106 H. Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101:1418–1429, 2006. → pages 14, 16 H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B, 67:301–320, 2005. → pages 11, 14, 25 H. Zou and R. Li. One-step sparse estimation in nonconcave penalized likelihood models. The Annals of Statistics, 36:1509–1533, 2008. → pages 21, 22, 82, 86, 107 117 Appendix A Supplementary Information for Chapter 2 We provide proofs of Theorems 2.1-2.4 in this appendix. Before getting into the detailed illustrations, we first state a technical lemma as follows. Lemma A.1 Let Yi , i = 1, . . . , n be independent random variables following exponential family distributions of form (2.1) with natural parameters θi ∈ Θ. Let µi and σi2 denote the mean and variance of Yi respectively. Let tni , i = 1, . . . , n be real numbers such that n ∑ t2ni σi2 = 1, i=1 max {t2ni } = O(n) 1≤i≤n for some positive sequence hn = o(n). Then, for a sufficiently large n, ( P n ∑ ) tni (Yi − µi ) > hn i=1 ≤ exp(− h2n ). 3 Lemma A.1 states a useful property of exponential family for the illustration of theorems in this chapter, the proof of which can be found in Chen and Chen [2012]. 118 A.1 Proof of Theorem 2.1 Theorem 2.1 shows that, under some regularity conditions on model (2.1) and the design matrix, the SMLE-based screening sˆ retains all influential features in the true model s∗ with probability tending to one. We illustrate the theorem by showing that asymptotically sˆ falls into the collection of over-fitted models that contain s∗ as a submodel. This is implied by the fact that the maximum likelihood score based on an over-fitted model is asymptotically greater than that of any under-fitted model with at least one feature in s∗ excluded. ˆ be the (unrestricted) MLE of β based on model s. The theorem is Proof: Let β s implied if P {ˆ s ∈ S k+ } → 1. Thus, it suffices to show that ˆ ) ≥ min ln (β ˆ ) } → 0, P {max ln (β s s s∈S k− s∈S k+ (A.1) as n → ∞. ∗ For any s ∈ S k− , define s′ = s ∪ s∗ ∈ S 2k + . Consider β s′ close to β s′ such that ∥β s′ − β ∗s′ ∥ = w1 n−τ1 for some w1 , τ1 > 0. Clearly, when n is sufficiently large, β s′ falls into a small neighborhood of β ∗s′ , so that condition (T3) becomes applicable. Thus, by Taylor’s theory, we have l(β s′ ) − l(β ∗s′ ) 1 ˜ ′ )[β ′ − β ∗′ ] = [β s′ − β ∗s′ ]T S(β ∗s′ ) − [β s′ − β ∗s′ ]T H(β s s s 2 c 1 ≤ [β s′ − β ∗s′ ]T S(β ∗s′ ) − n∥β s′ − β ∗s′ ∥2 2 c1 ≤ w1 n−τ1 ∥S(β ∗s′ )∥ − w12 n1−2τ1 , 2 (A.2) ˜ ′ is an intermediate value between β ′ and β ∗′ . Thus, for some generic where β s s s 119 positive constant c, we have P {l(β s′ ) − l(β ∗s′ ) ≥ 0} ≤ P {∥S(β ∗s′ )∥ ≥ cn1−τ1 } ∑ ≤ P {Sj2 (β ∗s′ ) ≥ ck −1 n2−2τ1 } j∈s′ = ∑ P {Sj (β ∗s′ ) ≥ ck − 2 n1−τ1 } + 1 j∈s′ ∑ P {−Sj (β ∗s′ ) ≥ ck − 2 n1−τ1 }. 1 j∈s′ (A.3) Note that Sj (β ∗s′ ) = n ∑ [yi − b ′ (xTis′ β ∗s′ )]xij i=1 n ∑ = [yi − µi ]xij . i=1 ∑ ∑ Let tni = xij ( ni=1 x2ij σi2 )−1/2 . By condition T4, we have ni=1 t2ni σ12 = 1, ∑ maxi {t2ni } = O(n−1 ) and n−1 ni=1 x2ij σi2 ≤ c4 . Also, by condition T2, we have k ≤ w2 nτ2 . With these conditions, Lemma 1 gives the following probability inequality P {Sj (β ∗s′ ) ≥ ck − 2 n1−τ1 } n ∑ ≤ P{ tni (yi − µi ) > cn0.5(1−2τ1 −τ2 ) } 1 i=1 ≤ c exp(−n1−2τ1 −τ2 ). (A.4) By the same arguments, we also have P {−Sj (β ∗s′ ) ≥ ck − 2 n1−τ1 } ≤ c exp(−n1−2τ1 −τ2 ). 1 (A.5) The inequalities (A.4) and (A.5) imply that, for some generic constant c, P {l(β s′ ) ≥ l(β ∗s′ )} ≤ ck exp(−n1−2τ1 −τ2 ). 120 (A.6) Consequently, by Bonferoni inequality and condition 1 − 2τ1 − 2τ2 > 0, we have P {max l(β s′ ) ≥ l(β ∗s′ )} s∈sk− ≤ ∑ P {l(β s′ ) ≥ l(β ∗s′ )} s∈sk− ≤ ckpk exp(−n1−2τ1 −τ2 ) ≤ c exp(τ2 log n + mnτ2 log n − n1−2τ1 −τ2 ) = o(1). Because l(β s′ ) is concave in β s′ , above result holds for any β s′ such that ∥β s′ − β ∗s′ ∥ ≥ w1 nτ1 . ˘ ′ be β ˆ augmented with zeros corresponding to the For any s ∈ S k− , let β s s elements in s′ /s∗ . By condition T2, it is seen that ˘ ′ − β ∗′ ∥ ≥ ∥β ∗∗ ∥ ≥ w1 nτ1 . ∥β s s s /s Consequently, ˘ ′ ) ≥ l(β ∗′ )} = o(1) ˆ ) ≥ min l(β ˆ ) } ≤ P {max l(β P {max l(β s s s s s∈S k− s∈sk− s∈S k+ The theorem is proved. A.2 Proof of Theorem 2.2 Theorem 2.2 shows that the IHT procedure (2.12) stepwise increases the likelihood score within the restricted model sparsity. Such an increment property further ensures the convergence of IHT to a local maximum of (2.7). Proof: We first show the increment of ln (β (t) ). According to (2.9), we have l(β (t) ) = h(β (t) , β (t) ) ≤ h(β (t+1) , β (t) ) u = l(β (t+1) ) − ∥β (t+1) − β (t) ∥2 2 n ∑ + [b(xTi β (t+1) ) − b(xTi β (t) ) − b′ (xTi β (t) )(xTi β (t+1) − xTi β (t) )]. i=1 121 By the Taylor’s expansion, for any θ and θ0 , we have 1 ˜ b(θ) − b(θ0 ) − b′ (θ0 )(θ − θ0 ) = b′′ (θ)(θ − θ0 )2 2 for some θ˜ between θ and θ0 . Applying this expansion, we find 1 u (t+1) ∥β − β (t) ∥2 + ρ(t) ∥Xβ (t+1) − Xβ (t) ∥2 2 2 1 (t) (t+1) (m+1) ≤ l(β ) + (ρ ρ1 − u)∥β − β (m) ∥2 2 ≤ l(β (t+1) ) l(β (t) ) ≤ l(β (t+1) ) − as we have required u > ρ(t) ρ1 . Apparently, the quality holds only if β (t+1) = β (t) . The increment property of AHT is hence proved. Now let us move to the convergence of {β (t) }. By condition C3′ , l(β s ) is a strict concave function over β s for any s such that ∥s∥0 ≤ k. Since ∥β (t) ∥0 ≤ k and l(β (t) ) ≤ l(β (t+1) ) for t ≥ 0, the concavity of l(.) implies that β (t) stays in a bounded (compact) region. Thus, τ ∗ = supt (ρ(t) ) is finite based on the continuity of b′′ (·). By the similar arguments, we obtain 1 l(β (t+1) ) − l(β (t) ) ≥ (u − ρ∗ ρ1 )∥β (t+1) − β (t) ∥2 . 2 which implies that ∥β (t+1) − β (t) ∥ → 0 due to the convergence of {l(β (t) )}. Also, the compactness implies that {β (t) } has at least one limit point, say, ˜ = {β˜1 , . . . , β˜p }T . Let {tm } be a subsequence such that limm→∞ β (tm ) = β. ˜ β ˜ Since ∥β (t+1) − β (t) ∥ → 0, we must also have limm→∞ β (tm +1) = β. ˜ is a local maximum of ln (β) subject to ∥β∥0 ≤ k. By We next show that η (2.9), we have β (tm +1) = arg max h(γ; β (tm ) ) γ subject to∥γ∥0 ≤ k. ¿From the bivariate continuity of hn (ξ; η), letting m → ∞, we get ˜ = arg max h(γ; β) ˜ subject to∥γ∥0 ≤ k. β γ 122 ˜ is a maximum of h(γ; β) ˜ with respect to γ such that ∥γ∥0 ≤ k. We now That is, β split our discussion into following two cases. ˜ 0 < k, let γ ˜ + u−1 X T [y − b′ (X β)]. ˜ By (2.12), we ˜ =β Case 1: when ∥β∥ have ˜ = H(˜ β γ ; k), ˜ is zero. The which implies that the kth largest (in absolute value) component of γ definition of H then tells us ˜ = [H(˜ ˜, β γ1 ; 0), . . . , H(˜ γp ; 0)]T = γ ˜ = X T [y − b′ (X β)] ˜ = 0. Therefore, β ˜ is an unconwhich implies that S(β) strained maximum of ln (.) which is also a local maximum subject to ∥β∥0 ≤ k. ˜ 0 = k, we must have Case 2: when ∥β∥ ∂h(γ, β ′ ) =0 ∂γj γ =β ′ (A.7) ˜ as for any j ∈ {1, . . . , p} where β˜j ̸= 0. Let us rewrite h(γ; β) ˜ = l(γ) + T (γ, β) ˜ h(γ, β) (A.8) so that ˜ 2+ ˜ = − u ∥γ − β∥ T (γ, β) 2 2 n ∑ {b(xTi γj ) − b(xTi β˜j ) − b′ (β˜j )(xTi γj − xTi β˜j )}. i=1 It is seen that ˜ ∂T (γ, β) ∂γ = 0. ˜ γ =β Hence, together with (A.7) and (A.8), this fact implies that ∂l(β)/∂β j is zero at ˜ for any j ∈ {1, . . . , p} where β˜j ̸= 0. Let δ˜ be the minimum absolute value β=β ˜ Then, for β such that |β − β ˜ ∥ ≤ 0.5δ, ˜ β ˜ must be of non-zero components in β. j j a local maximum of l(β) subject to ∥β∥0 ≤ k. With above arguments, we now justify convergence of {β (t) } as follows. Note 123 that, by condition T3′ , there are finite number of local maximum of l(β) subject to ∥β∥0 ≤ k. Suppose β (t) does not converge but has two limiting points, say ˜ ̸= β ˜ . By what we have just proved, both are also local maxima of l(·). Let β 1 2 ˜ −β ˜ ∥. Since ∥β (t+1) − β (t) ∥ → 0, the distance between successive β (t) ϵ = ∥β 1 2 goes to 0 as t → ∞. Thus, there must be infinite many β (t) which are at least ϵ/3 distance from these two limiting points. The compactness condition hence implies that β (t) has at least another limiting point which is also a local maximum of l(·). ˜ . Using the same argument, β (t) must have additional Let this one be called β 3 ˜ which implies l(·) has at least 4 stationary points. Repeatedly limiting point β 4 applying this logic implies l(·) has infinite many stationary points. This contradicts with the assumption, and therefore implies β (t) converges. A.3 Proof of Theorem 2.3 Theorem 2.3 shows that, for appropriate choices of penalty function, the SMLEPLM procedure consistently estimates the model coefficients in the ultra-high dimensional GLM setup where p ≫ n. Since SMLE-PLM is a two-step procedure, where the PLM is used on a screened model sˆ from the ultra-high dimensional full model space, the randomness from both screening and PLM steps need to be accounted in the asymptotic analysis. We illustrate the theorem by showing that the MPLE converges (in probability) to the true model coefficients at a flat rate based on any sˆ that might be obtained from the screening step. Proof: Apparently, the screening consistency P (s∗ ⊂ sˆ) → 1 implies that P (ˆ s∈ S k+ ) → 1. To show the theorem, we investigate the performance of MPLE on all possible models over S k+ . Specifically, let u = (u1 , . . . , uk )T be an arbitrary k , let β = k-dimensional vector with ∥u∥2 = 1 and an = n−υ . For any s ∈ S+ s β ∗s + an u, and thus, Q(β s ) − Q(β ∗s ) = l(β ∗s + an u) − ln (β ∗s ) − n ∑ [ϕλ (βj∗ + an uj ) − ϕλ (βj∗ )] (A.9) j∈s When n is sufficiently large, β s falls into a small neighborhood of β ∗s , so that C3 124 becomes applicable. Following similar arguments in (A.2), we have the likelihood term in (A.9) bounded by l(β ∗s + an u) − l(β ∗s ) ≤ an ∥S(β ∗s )∥ − c1 2 na . 2 n (A.10) At the same time, we have n ∑ [ϕλ (βj∗ + an uj ) − ϕλ (βj∗ )] ≥ n ∑ [ϕλ (βj∗ + an uj ) − ϕλ (βj∗ )] j∈s∗ j∈s = n ∑ ′ an uj ϕλ (|β˜j |)sign(β˜j ) (A.11) j∈s∗ for some β˜j between βj∗ and βj∗ +an uj . By condition C2, we require minj∈s∗ |βj∗ | ≥ w1 n−τ1 . Thus, when n is large enough, we have β˜j > 0.5w1 n−τ1 for j ∈ s∗ and property P3 of ϕλ (.) implies that the penalty term in (A.9) is bounded by n ∑ ′ an uj ϕλ (|β˜j |)sign(β˜j ) ≥ −w3 n1−τ3 an j∈s∗ ∑ √ |uj | ≥ −w3 n1−τ3 an k. j∈s∗ (A.12) By (A.10)-(A.12), we obtain √ c1 2 nan + w3 n1−τ3 an k 2 1 √ c1 −υ ∗ ≤ n ∥S(β s )∥ − n1−2υ + w3 w2 n1−τ3 + 2 τ2 . 2 (A.13) Q(β s ) − Q(β ∗s ) ≤ an ∥S(β ∗s )∥2 − When υ < τ3 − 21 τ2 , the second term dominates the third term in (A.13). Thus, for a sufficient large n and some generic constant c, P {Q(β s ) ≥ Q(β ∗s )} ≤ P {∥S(β ∗s )∥ ≥ cn1−υ } ≤ c exp(−n1−2υ−τ2 ), where the last in equality is followed by the same arguments in (A.3)-(A.6). Con- 125 sequently, by Bonferoni inequality and the condition υ < 1 2 − τ2 , we have P {Q(β s ) ≥ Q(β ∗s ) for some s ∈ S k+ } ∑ ≤ P {Q(β s ) ≥ Q(β ∗s )} s∈S k+ ≤ ckpk exp(−n1−2υ−τ2 ) ≤ c exp(τ2 log n + mnτ2 log n − n1−2υ−τ2 ) = o(1), (A.14) which implies that, with probability tending to one, the local maximizer of Q(β sˆ) falls into an -neighborhood of β ∗sˆ. The theorem is proved. A.4 Proof of Theorem 2.4 Theorem 2.4 shows that, under additional requirements on the penalty function, the SMLE-PLM procedure consistently identifies the true model s∗ in ultra-high dimensional situations. Similarly to the proof of Theorem 2.3, the randomness from both screening and PLM steps in the SMLE-PLM needs to be considered. We illustrate the theorem by showing that the probability of MPLE that identifies s∗ from any possible sˆ goes to one at a common rate. Proof: Clearly, the requirements of Theorem 2.4 imply that τ3 > 0.5−τ2 > τ1 + τ22 . Thus, following the arguments in the proof of Theorem 2.3, there exists a local ˆ (ˆ maximizer β s) of (2.13), such that λ ˆ (s) − β ∗ ∥ ≤ an = cn−(0.5−τ2 −δ) ∥β λ s (A.15) with probability tending to 1 for some generic constant c, δ and s ∈ S k+ . For the convenience of presentation, we denote β ∗s by {β ∗s1 , β ∗s2 } with β ∗s2 = 0 for any s ∈ S k+ . Then, for β s = {β s1 , β s2 } such that ∥β s − β ∗s ∥2 ≤ an , we have Q(β s1 , β s2 ) − Q(β ∗s1 , 0) = l(β s1 , β s2 ) − l(β ∗s1 , 0) − n ∑ ϕλ (|βj |) j∈s\s∗ (A.16) For sufficient large n, β s falls into small neighborhood of β ∗s such that con126 ditions C3 and C5 become applicable. Thus, following the similar arguments in (A.3), the likelihood term in (A.16) is bounded by l(β s1 , β s2 ) − l(β ∗s1 , 0) c1 ∂l(β s1 , 0)T β s2 − n∥β s2 ∥2 ≤ ∂β s2 2 ∑ (|Sj (β ∗s )| + c5 n∥β s1 − β ∗s1 ∥)|βj | ≤ j∈s\s∗ √ ≤ ∥S(β ∗s )∥ · ∥β s2 ∥ + c5 n k∥β s1 − β ∗s1 ∥ · ∥β s2 ∥ (A.17) Moreover, with 0 < δ < 0.5 − τ1 − τ2 , β s2 falls into 0.5n−τ1 neighborhood of zero and property P4 of ϕλ (.) implies that ′ ϕ(|βj |) = ϕ (|β˜j |)sign(β˜j )βj ≥ w4 n−τ4 |βj | (A.18) for some β˜j ∈ (0, βj ) and j ∈ s \ s∗ . Consequently, with (A.17) and (A.18), we have (A.16) bounded by Q(β s1 , β s2 ) − Q(β ∗s1 , 0) √ ≤ ∥S(β ∗s )∥ · ∥β s2 ∥ + c5 n k∥β s1 − β ∗s1 ∥ · ∥β s2 ∥ − w4 n1−τ4 ∥β s2 ∥ ≤ ∥S(β ∗s )∥ · ∥β s2 ∥ + c5 n0.5+1.5τ2 +δ ∥β s2 ∥ − w4 n1−τ4 ∥β s2 ∥ (A.19) By choosing δ = 0.5 min{0.5 − 1.5τ2 − τ4 , 0.5 − τ1 − τ2 }, the third term dominates the second term in (A.19). Thus, for a sufficiently large n and some generic constant c, we have P {Q(β s1 , β s2 ) ≥ Q(β ∗s1 , 0)} ≤ P {∥S(β ∗s )∥ ≥ cn1−τ4 } ≤ c exp(−n1−2τ4 −τ2 ) Consequently, following the similar arguments in (A.14) with the condition τ4 < 0.5 − 1.5τ2 , we have k P {Q(β s1 , β s2 ) ≥ Q(β ∗s1 , 0) for some s ∈ S+ } ≤ ckpk exp(−n1−2τ4 −τ2 ) = o(1), 127 which implies that, with probability tending to one, the local maximizer of Q(β), ˆ (ˆ say β s), is located at βˆλj (ˆ s) = 0 for j ∈ sˆ \ s∗ . The theorem is proved. λ 128 Appendix B Supplementary Information for Chapter 3 We provide proofs of Theorems 3.1-3.2, Corollary 3.1 and the technical deviation of sample-based BIC 3.7 in this appendix. For simplicity of presentation, we use λ instead of λn in the proof. B.1 Proof of Theorem 3.1 Theorem 3.1 shows that, with appropriate sampling schemes and choices of penalty functions, the PPLM consistently estimate the model coefficients and identifies the true model under the joint randomization framework. For the sampling plans where the sample mean converges at the same rate as the i.i.d. sampling, the maximizer of the penalized pseudo-likelihood necessarily converges to the census version based on the full likelihood – this further ensures the consistency of PPLM as stated in the theorem. Proof: Let us first work on the estimation consistency of β. Let u be an arbitrary p-dimensional vector with ||u|| = c and an = n− 2 + φλ for some c > 0. To ˆ is in the obtain the estimation consistency, it suffices to show that as n → ∞, β λ 1 an -neighborhood of β ∗ . 129 Let β = β ∗ + an u. We have Qn (β) − Qn (β ∗ ) ∗ ∗ = {ln (β + an u) − ln (β )} − n ≤ {ln (β ∗ + an u) − ln (β ∗ )} − n p ∑ j=1 q ∑ [ψλ (|β ∗j + an uj |)] − ψλ (|β ∗j |)] [ψλ (|β ∗j + an uj |) − ψλ (|β ∗j |)]. j=1 (B.1) Let us work on the first term in (B.1). Define HN (β) = N ∑ xi b′′ [XTi β]xTi , Hn (β) = i=1 ∑ wi xi b′′ [XTi β]xTi . i∈d By condition C1, when β is close to β ∗ , each element in matrix HN (β), say htj = ∑N ′′ −1 |h |1+η = O(1) with tj i=1 b (xi β)xit xir for t, j ∈ {1, . . . , p}, satisfies N probability tending to 1. Thus, by Condition C3, we have 1 1 Hn (β) − HN (β) →p 0. n N (B.2) Clearly, when n is large enough, β is in a small neighborhood of β ∗ and so (B.2) becomes applicable. Therefore, from the continuity of I(β) at β ∗ , we have 1 ˜ ln (β) − ln (β ∗ ) = an ln′ (β ∗ )T u − a2n uT Hn (β)u 2 1 = an ln′ (β ∗ )T u − na2n uT I(β ∗ )u(1 + op (1)) 2 1 2 2 ′ ∗ T ≤ an ln (β ) u − nc an M1 (1 + op (1)) 2 (B.3) ˜ between β and β ∗ , where l′ (β ∗ ) = ∂ln (β ∗ )/∂β = {l′ (β ∗ ), . . . , l′ (β ∗ )}T for some β np n n1 ∑ ∗ ∗ ′ ′ and lnj (β ) = i∈d wi (yi − b (xi β ))xij for j ∈ {1, . . . , p}. Note that for 130 j ∈ {1, . . . , p}, E[({Y − b′ (Xβ ∗ )}Xj )2 ] = E { E [ ((Y − b′ (Xβ ∗ ))Xj )2 | X ] } = E {b′′ (Xβ ∗ )Xj2 } < ∞. Thus, N −1 ∑N i=1 (yi − b′ (xi β ∗ ))xij = Op (N −1/2 ). This and Condition C3 imply that ′ lnj (β ∗ ) N √ √ n ∑ = (yi − b′ (xi β ∗ ))xij + Op ( n) = Op ( n). N (B.4) i=1 √ Therefore, the first term in (B.3) is Op ( nan ) = Op (na2n ). By taking a sufficiently large c for ∥u∥, (B.3) is dominated by its second term, which implies that P {ln (β) ≥ ln (β ∗ )} → 0. Now let us consider the second term in (B.1). Based on Taylor’s expansion and the continuity of ψλ′ (|β|) at β 0j , we have ψλ (|β ∗j + an uj |) − ψλ (|β ∗j |) = an uj ψλ′ (|β ∗j |)sign(βj∗ )(1 + o(1)) for j ∈ {1, . . . , q}. Thus, the second term in (B.1) is O(nan φλ ) = Op (na2n ), which is also dominated by the second term in (B.3) for a sufficiently large c. This implies that P {Qn (β) ≥ Qn (β ∗ )} → 0, and therefore there exists a local maximizer of Qn in the an -neighborhood of β ∗ . The proof of the estimation consistency is complete. We now examine the selection consistency. Recall that we write β ∗ = {β ∗1 , β ∗2 } with β ∗2 = 0. Hence, for any β = {β 1 , β 2 } such that ∥β −β ∗ ∥ ≤ can and β 2 ̸= 0, we have Qn (β 1 , β 2 ) − Qn (β 1 , 0) = {ln (β 1 , β 2 ) − ln (β 1 , 0)} − {n p ∑ ψλ (|βj |)} j=q+1 (B.5) 131 Similarly to the proof of the estimation consistency, we have ln (β 1 , β 2 ) − ln (β 1 , 0) = ˜ ) ∂ln (β 1 , 0) T 1 ∂ln (β 1 , β 2 β 2 + β T2 β2 ∂β 2 2 ∂β 2 ∂β T2 ∂ln (β 1 , 0) T 1 β 2 − nM1 ∥β 2 ∥2 (1 + op (1)) ∂β 2 2 ∂ln (β 1 , 0) ≤ ∥ ∥ · ∥β 2 ∥ (B.6) ∂β 2 ≤ ˜ between β and β ∗ = 0. Also, by Taylor’s expansion, it can be shown for some β 2 2 2 that ∂ln (β 1 , 0) ∂ln (β ∗ ) ∂ 2 ln (β ∗ ) = + (β 1 − β ∗1 )(1 + op (1)) ∂β2j ∂β2j ∂β2j ∂β 1 T (B.7) for j ∈ {q + 1, . . . , p}, which implies that (B.6) is Op (n∥β 1 − β ∗1 ∥∥β 2 ∥ + √ n∥β 2 ∥). Note that when n is large enough, β 2 is in a small neighborhood of 0 such that property D3 of ψ(|β|) becomes applicable. Specifically, if β˙ j = 0.5βj , then for some β˜j between βj and β˙ j , we have 1 ψ(|βj |) = ψ(|β˙ j |) + ψ ′ (|β˜j |)sign(β˜j )βj 2 1 ≥ M2 an |βj | 2 (B.8) for each j ∈ {q + 1, . . . , p} and an arbitrary positive constant M2 . Thus, by choosing a sufficiently large M2 , (B.5) is dominated by its second term. This implies that P {Qn (β 1 , β 2 ) ≥ Qn (β 1 , 0)} → 0, and therefore the local maximizer ˆ = β ∗ = 0. The theorem is proved. of Qn is located at some β λ2 2 B.2 Proof of Corollary 3.1 Corollary 3.1 shows that, with additional requirements on the penalty function, the PPLM is able to consistently identify the true model s and estimate their coefficients as efficiently as the MLE based on the true model. This result corresponds to the notation of oracle property in Fan and Li [2001] for the non-survey PLMs. 132 Proof: Clearly, the requirement for ϕλ (|β|) implies that φλ → 0. Thus, by Theoˆ = 0) → 1, which implies that, with probability tending to rem 1, we have P (β 2λ ˆ : 1, the following relationship holds for β 1λ ˆ , 0) ∑ ∂ln (β 1λ ϕ′ (|βˆjλ |) = 0. + ∂β 1 q (B.9) j=1 ˆ − β ∗ ∥ = Op (n−1/2 + ϕλ ), so with probability tending Also, by Theorem 3.1, ∥β 1λ 1 ˆ is in a small neighborhood of β ∗ , which is bounded away from 0. Since to 1 β 1λ 1 for any β ̸= 0 there exists M > 0 such that ψλ′ (|β|) = 0 when n > M , we have P (ϕ′ (|βˆjλ |) = 0) → 1 for j = 1, . . . , q. ˆ satisfies This together with (B.9) implies that, with probability tending to 1, β λ1 ˆ , 0) ∂ln (β 1λ = 0, ∂β 1 which is exactly the same as the normal equation in solving the MPLE of ln (β) based on the true model s∗ . The proof is complete. B.3 Derivation of BIC (3.7) We use the same principle as in the classical BIC (Schwarz [1978]) to obtain the sample-based BIC (3.7). Simplistically, we show that the rationale of using BIC (3.7) is to approximately maximize the pseudo-posterior (3.6) when the sample size is large. Our derivations are heuristic and we do not spell out the conditions most rigorously. The ultimate justification of (3.7) is the large sample properties of the resulting variable selection procedure, which is discussed in Theorem 3.2. For simplicity of notation, we use s instead of sλ in the following derivation. Let νs (β s ) be the prior density function of regression coefficient vector β s given candidate model s. A pseudo-marginal density function of the data is then given by ∫ Pn (y|s) = Ln (y; β s )νs (β s )dβ s . 133 Consequently, we regard the following expression as the pseudo-posterior probability of model s, Pn (s|y) = ∑ Pn (y|s)P (s) , s∈S P (s)Pn (y|s) where P (s) denotes the prior probability of model s and S is the collection of candidate models. In the spirit of Bayesian inference, we select the model that ∑ maximizes Pn (s|y). Since s∈S P (s)Pn (y|s) does not depend on any specific s, the highest Pn (s|y) is achieved at the model that maximizes Pn (y|s) when a uniform prior P (·) over the model space is adopted. ˆ be the maximizer of Ln (y; β ) given s. To take a close look at Pn (y|s), let β s s Assume that νs (.) is smooth and does not change with the sample size n. Also, when n is large enough, ˆ ) = op (1) Ln (y; β s )/Ln (y; β s ˆ . In addition, within Op (n−1/2 ) for β s outside of a Op (n−1/2 ) neighborhood ∆ of β s ˆ ˆ ) neighborhood of β , νs (·) is a constant function in ∆. That is, νs (β ) ≈ νs (β s s s for β s ∈ ∆. Thus, ∫ Pn (y|s) ≈ Ln (y; β s )νs (β s )dβ s ∆β ˆ ∫ s ˆ )· ≈ νs (β s Ln (y; β s )dβ s . ∆β ˆ (B.10) s ˆ . Then, Let Hn (β) = −∂ 2 ln (β)/∂β∂β T . Assume that Hn (β) is continuous at β s ˆ , ln (β ) = log Ln (y; β ) is approximated by for β close to β s s s s ˆ ) − (1/2)(β − β ˆ )T Hn (β ˆ )(β − β ˆ ). ln (β s ) ≈ ln (β s s s s s s 134 (B.11) Approximation (B.11) together with (B.10) implies that { } 1 T ˆ )· ˆ ) · exp − (β − β ˆ ) Hn (β ˆ )(β − β ˆ ) dβ Pn (y|s) ≈ νs (β Ln (y; β s s s s s s s 2 s ∆β ˆ s { } ∫ 1 T ˆ ˆ ˆ ˆ ˆ ≈ νs (β s ) · Ln (y; β s ) · exp − (β s − β s ) Hn (β s )(β s − β s ) dβ s 2 ∫ ˆ ) · Ln (y; β ˆ ) · (2π) = νs ( β s s τ (s) 2 ˆ ) · Ln (y; β ˆ ) · (2π/n) = νs ( β s s ˆ )|− 12 · |Hn (β s τ (s) 2 ˆ )|− 21 , · |n−1 Hn (β s (B.12) where |.| denotes the determinant of a matrix and the first step is from the Laplace approximation. Consequently, we have −2 log Pn (y|s) approximated by ˆ ) + τ (s) log n + R, − 2ln (β s (B.13) where ˆ )] + log[|n−1 Hn (β ˆ )|] = Op (1). R = −τ (s) log(2π) + log[νs (β s s The order is justified when ˆ ) → H(β ∗ ) n−1 Hn (β s s in probability for some positive definite matrix H. When the Op (1) term is ignored from (B.13), we obtain a simplified criterion ˆ ) + τ (s) log n. BICn (s) = −2ln (β s B.4 Proof of Theorem 3.2 Theorem 3.2 shows that the sample-based BIC score (3.7) is minimized on the true model with probability tending to one. With appropriate sampling plans, the sample-based BIC essentially reduces to the classic BIC up to a op (1) term due to the unequal weighting, which has no effect on the consistency of the BIC-based selection. 135 ˆ be the maximizer of ln (β) based on model s. For any λ ∈ Ω+ ∪ Ω− , Proof: Let β s we have ˆ ∗ ) − ln (β ˆ )} − {τ (s∗ ) − τ (sλ )} log n. BICn (sλ ) − BICn (s∗ ) = 2{ln (β s sλ Thus, Theorem 3.2 is implied if [ ] ˆ ∗ ) − l n (β ˆ )} ≤ {τ s∗ − τ (sλ )} log n → 0. P {2ln (β s sλ (B.14) We verify (B.14) for underfitting (λ ∈ Ω− ) and overfitting (λ ∈ Ω+ ). ˇ be β ˆ augmented with zeros Case 1: When λ ∈ Ω− , we have s∗ ̸⊂ sλ . Let β sλ sλ corresponding to j ∈ {1, . . . , p} and j ̸∈ sλ . We have ˆ ) − ln (β ˆ ∗ ) ≤ ln (β ˆ ) − ln (β ∗ ) ln (β sλ s sλ ˇ ) − ln (β ∗ ). = ln (β sλ ˜ such that Note that for β close to β ∗ , there exists a β 1 ˜ − β∗ ) ln (β) − ln (β ∗ ) = (β − β ∗ )T ln′ (β ∗ ) − (β − β ∗ )T Hn (β)(β 2 1 ≤ ∥β − β ∗ ∥ · ∥ln′ (β ∗ )∥ − M1 ∥β − β ∗ ∥2 n{1 + op (1)} (B.15) 2 with probability tending to 1. As shown in the proof of Theorem 1, we have ∥ln′ (β ∗ )∥ = Op (n1/2 ). Thus, for β such that ∥β − β ∗ ∥ = n− 3 , (B.15) is dom1 1 inated by its second term which is −(1/2)M1 · n 3 (1 + op (1)). Because of the concavity of ln (β), 1 1 P {ln (β) − ln (β ∗ ) ≤ − M1 · n 3 } → 1 2 1 ˇ as a special case. holds for any β such that ∥β − β ∗ ∥ ≥ n− 3 , which includes β sλ We therefore obtain ˆ ∗ ) − ln (β ˆ ) ≥ 1 M1 · n 13 (1 + op (1)) l n (β s sλ 2 136 with probability tending to 1. This implies that (B.14) holds for any λ ∈ Ω− . Case 2: For λ ∈ Ω+ , we have ˆ ) − ln (β ˆ ∗ ) ≤ ln (β ˆ ) − ln (β ∗ ). ln (β sλ s sλ Similarly, by conditions C2 and C3, with probability tending to 1, we have ˆ ) − ln (β ∗ ) ≤ (β ˆ − β ∗ )T l′ (β ∗ ) − 1 (β ˆ − β ∗ )T Hn (β ∗ )(β ˆ − β ∗ ){1 + op (1)} ln (β sλ sλ sλ n 2 sλ ˆ − β ∗ )T l′ (β ∗ ) ≤ (β sλ n ≤ nln′ (β ∗ )T I −1 (β ∗ )ln′ (β ∗ )(1 + op (1)) 1 ′ ∗ 2 −1 ∥l (β )∥ n (1 + op (1)). ≤ M1 n ˆ )− ln (β ∗ ) = Op (1), which implies that Since ∥ln′ (β ∗ )∥2 = Op (n), we have ln (β sλ (B.14) holds for any λ ∈ Ω+ . The theorem is proved. 137 Appendix C Supplementary Information for Chapter 4 We provide proofs of Theorems 4.1-4.2, Corollary 4.1 and Proposition 4.1 in this appendix. C.1 Proof of Theorem 4.1 Theorem 3.1 shows that, with an appropriate scale parameter, the iteration from the ITD procedure (4.10) reduces the value of objective function Q in (4.8). The proof is similar in showing Theorem 2.2 under the GLM context. Proof: According to (4.10), we have Q(η (m) ) = u−1 G(η (m) , η (m) ) ≥ u−1 G(η (m+1) , η (m) ) 1 = Q(η (m+1) ) + u−1 ∥η (m+1) − η (m) ∥2 2 K ∑ n ∑ (m+1) (m) (m) (m) − wik [b(θk ) − b(θk ) − b′ (θk )(θkm+1 − θk )]. k=1 i=1 138 By Taylor’s expansion, for any θ and θ0 , we have 1 ˜ b(θ) − b(θ0 ) − b′ (θ0 )(θ − θ0 ) = b′′ (θ)(θ − θ0 )2 2 for some θ˜ between θ and θ0 . Applying this expansion, noting that wik ∈ (0, 1), we find 1 1 (m) Q(η (m) ) ≥ Q(η (m+1) ) + u−1 ∥η (m+1) − η (m) ∥2 − nτ2 ∥θ (m+1) − θ (m) ∥2 2 2 1 (m) = Q(η (m+1) ) + (u−1 − nτ2 τ1 )∥η (m+1) − η (m) ∥2 2 ≥ Q(η (m+1) ) since we have required u−1 > nτ2 (m) τ1 . Clearly, equality holds only if η (m+1) = η (m) . The theorem is hence proved. C.2 Proof of Corollary 4.1 Corollary 4.1 shows that the difference between two iterations in the ITD diminishes as the procedure proceeds. It serves as an important prerequisite for the convergence of ITD. Proof: According to Theorem 4.1, we have Q(η (m+1) ) ≤ Q(η (m) ). Thus, the compact assumption implies that η (m) stays in a bounded region. Thus, τ2∗ = (m) supm (τ2 ) is finite based on the continuity of b′′ (·). By similar arguments to those in Theorem 1, we obtain 1 Q(η (m) ) − Q(η (m+1) ) ≥ (u−1 − nτ2∗ τ1 )∥η (m+1) − η (m) ∥2 . 2 Consequently, the asymptotically regularity of {η (m) } is implied by the convergence of Q(η (m) ). C.3 Proof of Theorem 4.2 Theorem 4.2 shows that the ITD procedure converges to a stationary point of the objective function Q. The convergence is established based on Corollary 4.1 and 139 the fact that the sequence of ITD has only one limiting point. Proof: The compact condition implies that {η (m) } has at least one limit point, say, η ∗ . Let {mt } be a subsequence such that limm→∞ η (mt ) = η ∗ . By Corollary 4.1, we must also have limm→∞ η (mt +1) = η ∗ . We now show that η ∗ is also a stationary point of Q(η). By (4.10), we have η (mt +1) = arg min G(ξ; η (mt ) ). ξ ¿From the bivariate continuity of G(ξ; η), letting t → ∞, we get η ∗ = arg min G(ξ; η ∗ ). ξ ∗ That is, η ∗ = {η0∗ , . . . , ηK−1 } is a local/global minimum of G(ξ; η ∗ ) with respect to ξ. Therefore, we must have ∂G(ξ, η ∗ ) =0 ∂ξj ξ =η ∗ for any j ∈ {0, . . . , K − 1} where ηj∗ ̸= 0. Denote θk∗ = ∑k−1 j=0 ξj for k = 1, . . . , K − 1, and write G(ξ, η ∗ ) = uQ(ξ) + T (ξ, η ∗ ) (C.1) ∑k−1 j=0 ηj∗ and ζk = (C.2) so that T (ξ, η ∗ ) = K−1 K ∑ n ∑ 1 ∑ (ξj − ηj∗ )2 − u wik {b(ζk ) − b(θk∗ ) − b′ (θk∗ )(ζk − θk∗ )}. 2 j=1 It can be seen that k=1 i=1 ∂T (ξ, η ∗ ) = 0. ∂ξ ξ =η ∗ Hence, together with (C.1) and (C.2), this fact implies that ∂Q(η)/∂ηj is zero at η = η ∗ for any j ∈ {0, . . . , K − 1} where ηj∗ ̸= 0. Therefore, η ∗ is a stationary 140 point of Q. Suppose η (m) does not converge but has two limiting points, say η ∗1 ̸= η ∗2 . By what we have just proved, both are also stationary points of Q(·). Let ϵ = ∥η ∗1 − η ∗2 ∥. By Corollary 4.1, the distance between successive η (m) goes to 0 as m → ∞. Thus, there must be infinitely many η (m) that are at least ϵ/3 from these two limiting points. The compact condition hence implies that η (m) has at least another limiting point that is also a stationary point of Q(·). Let this be η ∗3 . Using the same argument, η (m) must have an additional limiting point η ∗4 , which implies that Q(·) has at least four stationary points. Repeatedly applying this logic implies that Q(·) has infinitely many stationary points. This contradicts the assumption, and therefore implies that η (m) converges. C.4 Proof of Proposition 4.1 Proposition 4.1 provides the analytic solution to the unified optimization problem (4.11) with the SCAD penalty, which serves as a build-in block for the ITD procedure (4.10). The proof follows standard procedures in mathematical analysis. Proof: By a result from classical mathematical analysis, the solution γ ∗ must be a critical point of q(γ). To avoid unnecessary complexity, we assume that z ≥ 0 so that γ ∗ ≥ 0. The result for z < 0 can be obtained by symmetry. When γ > 0, we have q ′ (γ) = γ − z + κp′λ (γ). (C.3) It can be seen that κp′λ (γ) is flat on the interval (0, λ]; from the point γ = λ it decreases as a straight line with slope −κ/(ν − 1), and it remains 0 after γ = νλ (see Fig. C.1). A nonzero critical point of q(·), if any, must be the intersection of g1 (γ) = κp′λ (γ) and the straight line g2 (γ) = z − γ. Suppose 0 < κ ≤ ν − 1 (case 1). There is at most one such intersection: (a) when z < κλ, there is no intersection so γ ∗ = 0 is the only critical point and the global minimum; (b) when κλ < z ≤ (κ + 1)λ, the unique intersection is at γ ∗ = z − κλ; (c) when (κ + 1)λ < z ≤ νλ, the unique intersection is at γ∗ = (ν − 1)z − κνλ ; ν−κ−1 141 5 5 4 Case 2 4 Case 1 (c) (c) 3 3 (d) 2 1 1 2 (c) (b) (a) 0 (b) 0 (a) 0 1 2 3 4 5 0 1 2 gamma 3 4 5 gamma 2 3 4 5 Case 3 (b) 0 1 (a) 0 1 2 3 4 5 gamma Figure C.1: Plot demonstration for Proposition 4.1 with ν = 3.7, λ = 1, and κ = (1, 3, 4.2) for cases 1–3. Solid lines represent g1 (γ) = κp′λ (γ) and dashed lines represent g2 (γ) = −γ + z for various z values. (d) when z > νλ, the unique intersection is at γ ∗ = z. Note that cases (a) and (b) have a common expression γ ∗ = (z − κλ)+ . In (b), (c), and (d), the nonzero critical point is found to be the global minimum. Thus, we have obtained the result for the case where κ > ν − 1 when z > 0. Suppose ν − 1 < κ ≤ ν (case 2). In this case, κp′λ (γ) decreases with a slope smaller than −1 from γ = λ until it reaches 0. This creates the possibility of up to four intersections between κp′λ (γ) and g(γ). Yet we find that the global minimum is at the largest critical point: (a) when z < κλ, there is no intersection and we find γ ∗ = 0; (b) when κλ ≤ z < νλ, the two curves intersect at γ ∗ = z − κλ; (c) 142 when z ≥ νλ, there can be many intersections but the global minimum is given by γ ∗ = z. Suppose ν < κ (case 3). The derivation is the simplest in this case: (a) when z < νλ, there is no intersection and we find γ ∗ = 0; (b) when νλ ≤ z, the largest intersection is γ ∗ = z, which is the global minimum. We have considered all possible cases. For z < 0, the expression of γ ∗ must be adjusted accordingly. 143
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Applications of penalized likelihood methods for feature...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Applications of penalized likelihood methods for feature selection in statistical modeling Xu, Chen 2012
pdf
Page Metadata
Item Metadata
Title | Applications of penalized likelihood methods for feature selection in statistical modeling |
Creator |
Xu, Chen |
Publisher | University of British Columbia |
Date Issued | 2012 |
Description | Feature selection plays a pivotal role in knowledge discovery and contemporary scientific research. Traditional best subset selection or stepwise regression can be computationally expensive or unstable in the selection process, and so various penalized likelihood methods (PLMs) have received much attention in recent decades. In this dissertation, we develop approaches based on PLMs to deal with the issues of feature selection arising from several application fields. Motivated by genomic association studies, we first address feature selection in ultra-high-dimensional situations, where the number of candidate features can be huge. Reducing the dimension of the data is essential in such situations. We propose a novel screening approach via the sparsity-restricted maximum likelihood estimator that removes most of the irrelevant features before the formal selection. The model after screening serves as an excellent starting point for the use of PLMs. We establish the screening and selection consistency of the proposed method and develop efficient algorithms for its implementation. We next turn our attention to the analysis of complex survey data, where the identification of influential factors for certain behavioral, social, and economic indices forms a variable selection problem. When data are collected through survey sampling from a finite population, they have an intrinsic dependence structure and may provide a biased representation of the target population. To avoid distorted conclusions, survey weights are usually adopted in these analyses. We use a pseudo-likelihood to account for the survey weights and propose a penalized pseudo-likelihood method for the variable selection of survey data. The consistency of the proposed approach is established for the joint randomization framework. Lastly, we address order selection for finite mixture models, which provides a flexible tool for modeling data from a heterogeneous population. PLMs are attractive for such problems. However, this application requires maximizations over nonsmooth and nonconcave objective functions, which are computationally challenging. We transform the original multivariate objective function into a sum of univariate functions and design an iterative thresholding-based algorithm to efficiently solve the sparse maximization without ad hoc steps. We establish the convergence of the new algorithm and illustrate its efficiency through both simulations and real-data examples. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2012-09-21 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0073181 |
URI | http://hdl.handle.net/2429/43254 |
Degree |
Doctor of Philosophy - PhD |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2012-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
Aggregated Source Repository | DSpace |
Download
- Media
- 24-ubc_2012_fall_xu_chen.pdf [ 589.18kB ]
- Metadata
- JSON: 24-1.0073181.json
- JSON-LD: 24-1.0073181-ld.json
- RDF/XML (Pretty): 24-1.0073181-rdf.xml
- RDF/JSON: 24-1.0073181-rdf.json
- Turtle: 24-1.0073181-turtle.txt
- N-Triples: 24-1.0073181-rdf-ntriples.txt
- Original Record: 24-1.0073181-source.json
- Full Text
- 24-1.0073181-fulltext.txt
- Citation
- 24-1.0073181.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0073181/manifest