Graphical Model Structure Learning with 1-Regularization by Mark Schmidt B.Sc., The University of Alberta, 2003 M.Sc., The University of Alberta, 2005 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in The Faculty of Graduate Studies (Computer Science) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) August 2010 c Mark Schmidt 2010 Abstract This work looks at fitting probabilistic graphical models to data when the structure is not known. The main tool to do this is 1 -regularization and the more general group 1 -regularization. We describe limited-memory quasi-Newton methods to solve optimization problems with these types of regularizers, and we examine learning directed acyclic graphical models with 1 -regularization, learning undirected graphical models with group 1 -regularization, and learning hierarchical loglinear models with overlapping group 1 -regularization. ii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi 1 Introduction . . . . . . . . . . . . . . . . 1.1 Regression and Binary Classification . 1.2 Dependency Networks . . . . . . . . . 1.3 Directed Acyclic Graphical Models . . 1.4 Gaussian and Ising Graphical Models 1.5 Pairwise Undirected Graphical Models 1.6 General Log-Linear Models . . . . . . 1.7 Data Sets . . . . . . . . . . . . . . . . 1.8 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 3 6 8 10 14 18 20 22 2 Optimization with 1 -Regularization . . . . . . . . . . 2.1 Logistic Regression with Differentiable Regularization 2.1.1 L-BFGS Approximation . . . . . . . . . . . . . 2.1.2 1 -Regularization over an Orthant . . . . . . . 2.2 Logistic Regression with 1 -Regularization . . . . . . 2.2.1 Orthant-Wise Learning . . . . . . . . . . . . . 2.2.2 Active-Set Methods . . . . . . . . . . . . . . . 2.2.3 Two-Metric Projection . . . . . . . . . . . . . 2.3 Projected Scaled Sub-Gradient . . . . . . . . . . . . . 2.3.1 Gafni-Bertsekas Variant . . . . . . . . . . . . . 2.3.2 Sign Constraint Variant . . . . . . . . . . . . . 2.3.3 Active-Set Variant . . . . . . . . . . . . . . . . 2.4 Implementation . . . . . . . . . . . . . . . . . . . . . 2.5 Regularization Path and Active-Set Optimization . . 2.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . 2.6.1 Logistic Regression . . . . . . . . . . . . . . . 2.6.2 Ising Graphical Models . . . . . . . . . . . . . 2.7 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 25 26 27 27 28 30 32 33 33 34 35 36 38 40 40 46 46 iii 2.7.1 2.7.2 Other Objective Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Other Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3 Optimization with Group 1 -Regularization . . 3.1 Barzilai-Borwein Methods . . . . . . . . . . . . . 3.1.1 Spectral Projected Gradient . . . . . . . 3.1.2 Barzilai-Borwein Soft Threshold . . . . . 3.2 Quasi-Newton Methods . . . . . . . . . . . . . . 3.2.1 Projected Quasi-Newton . . . . . . . . . 3.2.2 Quasi-Newton Soft Threshold . . . . . . 3.3 Implementation . . . . . . . . . . . . . . . . . . 3.4 Regularization Path and Active-Set Optimization 3.5 Experiments . . . . . . . . . . . . . . . . . . . . 3.5.1 Pairwise Log-Linear Models . . . . . . . 3.5.2 Ising Graphical Models . . . . . . . . . . 3.6 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 51 51 52 54 55 56 58 62 62 63 63 64 4 Directed Graphical Model Structure Learning 4.1 Search and Score Methods . . . . . . . . . . . 4.2 Constraint-Based Methods . . . . . . . . . . . 4.3 Hybrid Methods . . . . . . . . . . . . . . . . . 4.4 A Hybrid Method with 1 -regularization . . . 4.5 Causal DAGs . . . . . . . . . . . . . . . . . . . 4.6 Experiments . . . . . . . . . . . . . . . . . . . 4.6.1 Synthetic Data . . . . . . . . . . . . . . 4.6.2 Real Data . . . . . . . . . . . . . . . . 4.7 Similar Methods . . . . . . . . . . . . . . . . . 4.8 Extensions . . . . . . . . . . . . . . . . . . . . 4.8.1 Other CPDs . . . . . . . . . . . . . . . 4.8.2 Other Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 68 69 70 71 73 73 74 78 90 91 92 94 5 Undirected Graphical Model Structure Learning 5.1 Search-based and Constraint-based Methods . . . 5.2 1 -Regularization . . . . . . . . . . . . . . . . . . 5.3 Approximate Objectives . . . . . . . . . . . . . . . 5.4 Group 1 -Regularization . . . . . . . . . . . . . . 5.5 Optimization with General Group Norms . . . . . 5.6 Blockwise Sparsity . . . . . . . . . . . . . . . . . . 5.7 Conditional Random Fields . . . . . . . . . . . . . 5.7.1 Associative Conditional Random Fields . . 5.8 Experiments . . . . . . . . . . . . . . . . . . . . . 5.8.1 Edge Potentials and Regularization Types 5.8.2 Approximate Objectives . . . . . . . . . . 5.8.3 Larger Real Data . . . . . . . . . . . . . . 5.8.4 Blockwise Sparsity . . . . . . . . . . . . . . 5.8.5 Conditional Random Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 98 100 100 102 103 105 106 107 108 108 110 112 121 123 . . . . . . . . . . . . iv 5.9 Similar Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 5.10 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 6 Hierarchical Log-Linear Model Structure Learning 6.1 Optimality Conditions . . . . . . . . . . . . . . . . . 6.2 Regularization Path and Active-Set Optimization . 6.3 Constrained Formulation . . . . . . . . . . . . . . . 6.4 Dykstra’s Algorithm . . . . . . . . . . . . . . . . . . 6.4.1 Soft-Dykstra’s Algorithm . . . . . . . . . . . 6.5 Experiments . . . . . . . . . . . . . . . . . . . . . . 6.5.1 Smaller Data . . . . . . . . . . . . . . . . . . 6.5.2 Larger Data . . . . . . . . . . . . . . . . . . 6.5.3 Structure Estimation . . . . . . . . . . . . . 6.6 Similar Methods . . . . . . . . . . . . . . . . . . . . 6.7 Extensions . . . . . . . . . . . . . . . . . . . . . . . 7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 131 132 133 133 134 135 135 136 137 139 139 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 Appendices A Data Structures for Checking Acyclicity . . . . . . . . . . . . . . . . . . . . . . . . 153 A.1 Ancestor Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 A.2 Reversal Witness Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 B Projection onto Norm Cones B.1 Scalar Norm . . . . . . . . B.2 2 Norm . . . . . . . . . . B.3 ∞ Norm . . . . . . . . . . B.4 1 Norm . . . . . . . . . . B.5 Nuclear Norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 157 158 160 161 163 v List of Figures 2.1 2.2 2.3 2.4 2.5 2.6 3.1 3.2 3.3 4.1 Function evaluations against objective value and number of non-zero coefficients for logistic regression (λ = 1) with 1 -regularization for different optimization strategies initialized with the zero vector. Top to bottom: sido data, thrombin data, and spam data. This figure is best viewed in color. . . . . . . . . . . . . . . . . . . . . . . . . The same experiment as Figure 2.1, but using the optimal solution for λ = 2 as the starting vector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The same experiment as Figure 2.1, but focusing on methods that are based on L-BFGS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The same experiment as Figure 2.3, but using the optimal solution for λ = 2 as the starting vector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Function evaluations against objective value for training IGMs (λ = 50) with 1 regularization for different optimization strategies. Top row: cyto data. Bottom row: awma data. Left column: zero vector used for initialization. Right column: solution with λ = 100 used for initialization. This figure is best viewed in color. . . The same experiment as Figure 2.5, but focusing on methods based on an L-BFGS approximation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 . 43 . 44 . 45 . 47 . 48 Function evaluations and number of edges against objective value and number of non-zero coefficients for training a log-linear model with full potentials and group 1 -regularization for different optimization strategies initialized with the zero vector (λ = 50). The top row is for the cyto data and the bottom row is for the awma data. This figure is best viewed in color. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 The same experiment as Figure 3.1, but using the optimal solution for λ = 100 as the starting vector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Function evaluations against objective value for training IGMs (λ = 50) with 1 regularization for different optimization strategies. Top row: cyto data. Bottom row: awma data. Left column: zero vector used for initialization. Right column: solution with λ = 100 used for initialization. This figure is best viewed in color. . . . 66 The percent of edges remaining (top) and number of true edes removed (bottom) for different edge pruning strategies for seven structures from the Bayesian network repository. From left to right, the plots show the results with samples sizes of 1000, 5000, and 20000. We see that the L1MB pruning method leads to a reasonable amount of pruning while tending not to remove true edges. . . . . . . . . . . . . . . 75 vi 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 5.1 The relative BIC after 10000 score evaluations in a DAG-search for different pruning strategies on the seven synthetic data sets from the Bayesian Network Structure Learning Repository. We the BIC relative to the empty graph (top) and relative to the highest score for each data set (bottom). From left to right, the plots show the results with samples sizes of 1000, 5000, and 20000. We see that the L1MB pruning consistently achieves among the lowest scores. . . . . . . . . . . . . . . . . . . . . . The BIC against the number of score evaluations in a DAG-search for different pruning strategies with 1000 (left), 5000 (middle), and 20000 (right) samples from the alarm data set. We see that no pruning eventually leads to a good score, that the pruning strategies allow the method to explore multiple local optima, and that the L1MB algorithm achieves both of these properties. . . . . . . . . . . . . . . . . Structural errors for the highest scoring structure after 10000 score evaluations in an interventional DAG-search for different pruning strategies on the seven synthetic data sets from the Bayesian Network Structure Learning Repository. From left to right, the plots show the results with samples sizes of 1000, 5000, and 20000. We see that the L1MB pruning leads to the fewest structural errors in almost every case. . The structural errors against the number of score evaluations in an interventional DAG-search for different pruning strategies with 1000 (left), 5000 (middle), and 20000 (right) samples from the alarm data set. . . . . . . . . . . . . . . . . . . . . Structures estimated on the rain data set under a topological ordering. From left to right: optimal tree-structure consistent with ordering, optimal parents consistent with the ordering and SC(5) pruning, greedy parent selection given the ordering, and the L1MB algorithm constrained to be consistent with the ordering. . . . . . . The regression weights for the rain data set using the L1MB algorithm for a topological ordering. We see that weights between adjacent days (first diagonal above the main diagonal) are much larger than the other weights. . . . . . . . . . . . . . The relative BIC compared to the empty graph (left) and method with highest BIC (right) after 50000 score evaluations in a DAG-search for different pruning strategies on the real data sets. The data are ordered by node size: (1) rain (28 nodes), (2) msweb (57 nodes), (3) news (100 nodes), and (4) usps (256 nodes). Note that the None method has a relative BIC of 0 on the usps data set in the left figure. . . . . All edges with regression weight above 0.5 in the Markov blankets estimated by L1MB on the news data. Undirected edges represent cases where the directed edge was found in both directions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . All edges with regression weight above 0.5 in the model found by DAG-search with L1MB pruning on the news data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . The tree structure that maximizes the BIC on the news data. . . . . . . . . . . . . All edges with regression weight above 1 in the Markov blankets estimated by L1MB on the usps data. Undirected edges represent cases where the directed edge was found in both directions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . All edges with regression weight above 1.5 in the model found by DAG-search with L1MB pruning on the usps data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . The optimal tree structure on the usps data. . . . . . . . . . . . . . . . . . . . . . . 76 . 76 . 77 . 78 . 79 . 80 . 81 . 83 . 84 . 85 . 87 . 88 . 89 Test set negative log-likelihood (left) and relative negative log-likelihood (right) on the cyto data using different regularization and edge potential types. . . . . . . . . . 110 vii 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 5.13 5.14 5.15 5.16 6.1 6.2 6.3 Test set negative log-likelihood (left) and relative negative log-likelihood (right) on the awma data using different regularization and edge potential types. . . . . . . . . 110 Test set negative log-likelihood on the cyto (left) and awma (right) data sets using different approximate objective functions. . . . . . . . . . . . . . . . . . . . . . . . . 111 Test set negative log-pseudo-likelihood (left) and relative negative log-pseudo-likelihood (right) on the awma5 data using different regularization and edge potential types. . 112 Test set negative log-pseudo-likelihood (left) and relative negative log-pseudo-likelihood (right) on the traffic (top) and temperature (bottom) data using different regularization and edge potential types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Test set negative log-pseudo-likelihood (left) and relative negative log-pseudo-likelihood (right) on the usps4 (top) and usps8 (bottom) data using different regularization and edge potential types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 Structures estimated on the rain data set with group 1 -regularization for different regularization parameter values. From left to right, λ = 256, 128, 64 (for λ = 512 the graph is disconnected). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Structure estimated on the news data set with group 1 -regularization (λ = 512, isolated nodes are not plotted). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 Structure estimated on the news data set with group 1 -regularization (λ = 256, isolated nodes are not plotted). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Structure estimated on the usps data set with group 1 -regularization (λ = 4096). . . 118 Structure estimated on the usps data set with group 1 -regularization (λ = 2048). . . 119 Structure estimated on the usps data set with group 1 -regularization (λ = 1024). . . 120 Average cross-validated log-likelihood against regularization strength under different blockwise-sparse regularization schemes applied to the regularized empirical covariance for the genes data [Schmidt et al., 2009b]. . . . . . . . . . . . . . . . . . . . . . 122 Test set negative log-likelihood (left) and relative negative log-likelihood (right) on the genes data using different regularization methods. . . . . . . . . . . . . . . . . . 122 Interquartile range of relative test-set classification accuracy for different methods of training CRFs on synthetic data using the exact objective (top-left), pseudolikelihood approximation (top-right), Bethe approximation (bottom-left), and selected methods under different approximations (bottom-right). Note that the empty graph, corresponding to logistic regression, always had a relative accuracy of zero. . 124 Interquartile range of relative test-set classification accuracy for different methods of training CRFs on the coronary heart disease data at the segment level (left) and heart level (right). Note that the discriminative structure learning method with group 1 -regularization with the ∞ norm always has a relative accuracy of one on the heart-level classification task (rightmost column). . . . . . . . . . . . . . . . . . . 125 Test set negative log-likelihood (left) and relative negative log-likelihood (right) on the cyto data using different regularization types and potential restrictions. . . . . . 135 Test set negative log-likelihood (left) and relative negative log-likelihood (right) on the awma data using different regularization types and potential restrictions. . . . . 136 Test set negative pseudo-log-likelihood (left) and relative negative log-likelihood (right) on the awma5 data using different regularization types and potential restrictions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 viii 6.4 6.5 6.6 Test set negative pseudo-log-likelihood (left) and relative negative log-likelihood (right) on the traffic data using different regularization types and potential restrictions.137 Test set negative pseudo-log-likelihood (left) and relative negative log-likelihood (right) on the usps4 data using different regularization types and potential restrictions.138 False positives of different orders against training set size for the first model along the regularization path where the HLLM selects a superset of the true data-generating model [Schmidt and Murphy, 2010]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 ix List of Acronyms • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • #P-hard: Non-deterministic counting polynomial-time hard. AS: Active set. BBSG: Barzilai-Borwein sub-gradient. BBST: Barzilai-Borwein soft-threshold. BFGS: Broyden-Fletcher-Goldfarb-Shanno. BIC: Bayesian information criterion. CPD: Conditional probability distribution. CRF: Conditional random field. DAG: directed acyclic graph. DSST diagonally scaled soft-threshold. GGM: Gaussian graphical model. HLLM: Hierarchical log-linear model. IGM: Ising graphical model. L-BFGS: Limited-memory Broyden-Fletcher-Goldfarb-Shanno. LASSO: Least absolute shrinkage and selection operator. L1MB: 1 -Markov blanket. MMHC: Max-min hill-climbing. MMPC: Max-min parents and children. NP-hard: Non-deterministic polynomial-time hard. OPG: Optimal projected gradient. OWL: Orthant-wise learning. PSS: Projected scaled sub-gradient. PSSas: PSS active set. PSSgb: PSS Gafni-Bertsekas. PSSsp: PSS sign projection. PQN: Projected quasi-Newton. QNST: Quasi-Newton soft-threshold. SC: Sparse candidate. SCAD: Smoothly clipped absolute deviation. SPG: Spectral projected gradient. TMP: Two-metric projection. x Acknowledgements I would first like to thank my supervisor Kevin Murphy for sharing his knowledge, pushing me to work hard and constantly try to improve my work, and giving me the freedom to explore a diverse set of projects. I’d also like to thank my other supervisory committee members Michael Friedlander and Arnaud Doucet for their help and advice, as well as my other ‘unnofficial’ supervisors Russ Greiner, Albert Murtha, Glenn Fung, and R´omer Rosales. I would like to acknowledge my other co-authors for their help in letting me be more productive that I would have been able to on my own: Ewout van den Berg, Peter Carbonetto, Dana Cobzas, David Duvenaud, Daniel Eaton, Nando de Freitas, Emt Khan, Chi-Hoon Lee, Ilya Levner, Benjamin Marlin, Marianne Morris, Alexandru Niculescu-Mizil, Nic Schraudolph, Ian South-Dickinson, J¨org Sander, Kevin Swersky, Aline Tabet, and SVN Vishwanathan. The National Science and Engineering Research Council of Canada and the Li Tze Fong Memorial Fellowship provided funding for part of this work, while WestGrid provided computational resources. Finally and most importantly, I’d like to dedicate this thesis to my girlfriend Alisha, and my parents Joanne and Ken. xi Chapter 1 Introduction Graphical models [Whittaker, 1990, Lauritzen, 1996, Koller and Friedman, 2009] are used as efficient representations for probability distributions in a wide variety of applications. In many cases, the graphical structure describing the dependencies in the model is known. However, in some applications it is not clear what graphical structure should be used. Alternately, we may want to model a data set using a graphical model but not assume a particular graphical structure a priori. In this thesis we examine the problem of estimating the parameters of a graphical model given a data set, when the graphical structure is not given. One approach to this task is to assume a graphical model where all possible interactions are present (a dense model), and estimate the parameters of this model given the data set. An alternative approach is to try and find a sparse set of edges that optimize a criterion assessing the quality of the structure. There are several reasons why we might prefer the sparse approach: • Statistical efficiency: Since there are fewer parameters in the sparse model, we may be able to estimate them more effectively. For example, the number of parameters to be estimated in the dense model will grow quadratically or exponentially (depending on the particular model) in the number of variables present in the data. In contrast, the number of parameters needed by a sparse model might be much smaller. • Computational efficiency: Due to the smaller number of parameters in a sparse structure, typically it will be much less costly to estimate the parameters. Further, performing inference tasks in the graphical model will require quadratic, cubic, or exponential time (depending on the particular model) in the dense model, while it may be possible to do these tasks more efficiently in a sparse model. • Structural discovery: If we believe the dependencies in our data set can be accurately described within the class of graphical models we are searching over, then we might hope to find the ‘true’ structure that describes the dependencies in the data set. Even if the dependencies in the data set do not conform precisely to a particular graphical model, the edges discovered by a structure learning method may still be indicative of the dependencies (or independencies) present in the data. There has been substantial recent interest in this task due to applications in systems biology, such as [Sachs et al., 2005]. The disadvantage of taking the sparse approach is simply that there are an enormous number of possible structures. For example, in the case of the directed acyclic models we describe in Section 1.3, there are a super-exponential number of possible structures, and finding the optimal structure (under various definitions of optimality) is known to be NP-hard [Chickering, 1995]. Further, it can be computationally expensive to search through the space of graph structures. For example, in undirected graphical models we must re-fit all parameters if any edge is added or removed from the graph. Since fitting all parameters is typically computationally expensive, this means that even greedy apporaches that attempt to add/remove one edge at time are extremely 1 expensive. For these reason, in some scenarios we might want to consider fitting a single dense model, but use regularization to address the issue of statistical efficiency, use approximations to address the issue of computational efficiency, and try to interpret our estimates of the parameters for structural discovery. In this work, we take an approach that is intermediate between fitting a single regularized dense model, and searching for an optimal sparse model. Specifically, we consider fitting a single dense model with a penalty on the 1 -norm on the parameters. This 1 -regularization has a sparsityinducing property [Tibshirani, 1996, Chen et al., 1998]; if the penalty on the 1 -norm is strong enough, then many of the parameters in the optimal solution will be zero. Further, we parameterize the dense graphical model such that if the parameters associated with an edge are set to zero, it is equivalent to removing the edge from the model. This allows us to learn a sparse graphical model by fitting a single dense graphical model. In addition to combining regularization and sparsity within a convex optimization framework, in Section 1.1 we discuss other appealing properties that are known about 1 -regularization. This idea of using 1 -regularization to learn a sparse graphical model has recently been explored by various authors, and in this chapter we review related work on this topic. However, previous work on 1 -regularization for structure learning has largely been used in very restricted scenarios. Specifically, nearly all of the previous work makes the assumptions that: • The graphical model is undirected. • There is a one-to-one correspondence between parameters and edges. • The model only includes pairwise dependencies. In Chapters 4, 5, and 6, we examine models that do not make these assumptions. Specifically, these chapters outline methods for structure learning using 1 -regularization for the following scenarios: • Chapter 4: Directed acyclic graphical models. • Chapter 5: Undirected models with multi-parameter edges or edge groups. • Chapter 6: Undirected models with higher-order dependencies. Interspersed with our discussion of prior work, we discuss the motivations for examining these scenarios throughout the remainder of this chapter. In the latter two cases, we consider generalizations of 1 -regularization that penalize groups of variables. In Chapter 2, we describe non-differentiable extensions of limited-memory quasi-Newton methods for solving the 1 -regularization problems arising in Chapters 4 and 5, while in Chapter 3 we describe constrained and non-differentiable limited-memory quasi-Newton methods for solving the group 1 -regularization problems arising in Chapters 5 and 6. Chapter 7 discusses some extensions of this work. Chapters 2-6 are based on (and extend) existing work. In particular, Chapter 2 is based on [Schmidt et al., 2007a], Chapter 3 is based on [Schmidt et al., 2009b], Chapter 4 is based on [Schmidt et al., 2007b], Chapter 5 is based on [Schmidt et al., 2008], and Chapter 6 is based on [Schmidt and Murphy, 2010]. The remainder of this chapter is structured as follows. First, in the next section we review using 1 -regularization for variable selection in regression and classification. Next, we move on to using 1 -regularization to learn dependency networks, a straightforward extension of the regression/classification methodology that allows us to visualize dependencies between variables, but that does not necessarily form a consistent probabilistic model. We then consider linearly-parameterized 2 directed acyclic graphical models, where we learn a dependency network under a variable ordering to yield a consistent probabilistic model. Subsequently we consider 1 -regularization for structure learning in two special classes of undirected graphical models, namely Gaussian graphical models and pairwise Ising models of binary data. We then consider pairwise models of general discrete data, and higher-order log-linear models of discrete data. We then outline the data sets examined in this work, and finally we conclude the chapter with a summary of contributions. 1.1 Regression and Binary Classification In regression, we have are given a set n real-valued targets y i (for i = 1, 2, . . . , n), and a corresponding set of n real-valued p-vectors that we denote by xi (for i = 1, 2, . . . , n). Our goal is to build a model that predicts y i given the corresponding p-vector xi . Binary classification is similar, except that each y i can only take values in the discrete set {−1, +1}. The most common regression method is the linear least-squares model [see Bishop, 2006, §3.1.1], where we assume that y i is a linear function of xi (and a bias term b), and we fit the parameters {w, b} of the model by minimizing the least-squares objective n 1 i min (y − wT xi − b)2 . w,b 2 i=1 The least-squares estimator can also be viewed as a maximum likelihood estimator, under the assumption that each y i follows a Gaussian distribution with mean wT xi +b and a positive variance σ (the exact value of σ does not affect the optimal values of {w, b}). Formally, (y i − wT xi − b)2 1 p(y i |xi , w, b) = √ exp − 2σ 2 σ 2π . We arrive at the least-squares objective if we consider minimizing the negative logarithm of the likelihood, − ni=1 log p(y i |xi , w, b), under this model with σ set to 1 and ignoring constant terms (the objective has the same minimizers for any other positive σ). The most common binary classification method is logistic regression, where we assume that the logarithm of the odds of y i taking on +1 (instead of −1) is a linear function of wT xi +b [see Bishop, 2006, §4.3.2]. This implies that we assume y i follows a logistic distribution with location wT xi + b and scale 1: 1 p(y i |xi , w, b) = . i 1 + exp(−y (wT xi + b)) Maximum likelihood estimation in this model is typically carried out by minimizing the negative log-likelihood, n log(1 + exp(−y i (wT xi + b))). min w,b i=1 Unlike the least-squares objective, in general there will not be a closed-form solution for the parameters in the logistic regression model. However, we can obtain accurate numerical maximum likelihood estimates by minimizing the (differentiable, unconstrained, and convex) negative loglikelihood. Nevertheless, for both of these models there are several reasons why we might not want to use a maximum likelihood estimate of the parameters: 3 • The maximum likelihood estimate tends to have all coefficients wi non-zero, even though it may be the case that some variables are irrelevant for prediction. If a variable is irrelevant for predicting yi , then its value should be set to zero to nullify its effect on the prediction (and yield a more interpretable model). • The maximum likelihood estimator may over-fit. That is, the average likelihood of the data used for estimation might be much higher than the average likelihood for data that was not used during estimation. This can arise if we do not have a sufficiently large sample size n (relative to the number of features p and their complexity), because in this case the maximum likelihood estimate of the parameters may have a high variance (the parameters can change substantially with small changes in the data). In the language of numerical computing, we say that estimating the parameters can be ill-posed. Subset selection methods are a common strategy for addressing the first issue. That is, we do a search over the possible non-zero subsets of the coefficients, and choose the subset that optimizes some criteria judging the worthiness of the subset. Several heuristic strategies for doing the search exist such as forward and backward selection, but the general problem of choosing the best subset under most optimization criteria is known to be NP-hard [Huo and Ni, 2007]. Further, even if we were given the optimal subset this may not address the second issue with maximum likelihood estimation. The most common method used to address the second issue is 2 -regularization of the coefficients, known as Tikhonov regularization or ridge (logistic) regression [see Bishop, 2006, §3.1.4]. In ridge (logistic) regression, we optimize the negative log-likelihood subject to a penalty (with scale λ > 0) on the (squared) 2 -norm of the regression coefficients w: n − log p(y i |xi , w, b) + λ||w||22 . min w,b i=1 If we interpret the 2 -regularization term as the negative logarithm of a prior, then we see that finding the ridge (logistic) regression parameters is equivalent to finding the parameters that maximize the posterior distribution, p(y i |xi , w, b)p(w, b), with a prior for the parameters p(w, b) that factorizes into an independent zero-mean Gaussian distribution for each element wi , and an (improper) uniform distribution for b. The effect of this prior is to decrease the variance of the estimator, by adding a bias towards zero in the estimation of the coefficients. However, as with the maximum likelihood estimate the 2 -regularized estimate tends to have all coefficients wi non-zero. In 1 -regularized least-squares (or logistic regression), we minimize the negative log-likelihood subject to a penalty on the 1 -norm of the coefficients: n − log p(y i |xi , w, b) + λ||w||1 . min w,b (1.1) i=1 This type of regularization has been popularized under the name basis pursuit denoising [Chen et al., 1998] for the least-squares loss, and least absolute shrinkage and subset selection operator (LASSO) for the least-squares and logistic regression losses [Tibshirani, 1996]. Prior to these works 1 -regularization had also been explored for the least-squares loss [Santosa and Symes, 1986] and least absolute error loss [Claerbout and Muir, 1973]. As opposed to problem (1.1), Tibshirani [1996] 4 proposed using an explicit bound τ on the 1 norm of the parameters leading to the problem n − log p(y i |xi , w, b) s.t. ||w||1 ≤ τ. min w,b (1.2) i=1 Problems (1.1) and (1.2) are very closely related; solving problem 1.1 is equivalent to minimizing the Lagrangian of (1.2) with a fixed Lagrange multiplier λ, and for any value of τ we can find a corresponding value of λ that gives the same solution. We focus on 1.1 since λ has the intuitive interpretation as the strength of a Laplace prior on the parameters. In contrast to subset selection and 2 -regularization, 1 -regularization simultaneously achieves subset selection (by setting parameters wi to 0 for sufficiently large λ) and regularization (by adding a bias towards zero in the estimation of the coefficients). Further, under suitable conditions and an appropriate choice of λ, 1 -regularization will choose the correct subset of non-zero variables [for example, see Zhao and Yu, 2006]. Even if this structural discovery task is not the goal, 1 -regularization is often still effective at building a regressor (or classifier) that predicts well on new examples, even if irrelevant features are present in the data. For example, Ng [2004] shows that 1 -regularized logistic regression has an asymptotic sample complexity function that grows with the logarithm of the number of irrelevant features1 . This means that 1 -regularization can produce near-optimal models even if there are an exponential number of irrelevant features, in contrast to the linear sample complexity of 2 -regularized logistic regression that would require an exponential number of samples to produce near-optimal models if there are an exponential number of irrelevant features. The generalization performance of logistic regression with 1 -regularization is examined in [Krishnapuram et al., 2005], who prove non-trivial bounds on the generalization performance (i.e. bounds on the error obtained on data not seen during training). When using 1 -regularization, it is important to select an appropriate value of the hyperparameter λ. There are a wide variety of criteria available to do this, but in this work we focus on two. The first criterion we consider is validation set likelihood, a score that tries to assess how effective the estimator is at modeling new instances. To compute the validation set likelihood for a fixed value of λ, we 1. randomly choose half of our data set; 2. compute the 1 -regularized estimator on this half of the data set; 3. compute the likelihood of the other half of the data set with the estimated parameters. The validation set likelihood gives us a criterion for assessing how well the estimator for a particular value of λ models p(y i |xi , w, b) for new instances {xi , y i }. We can choose a good value of λ by searching for a value that maximizes this validation score (where we use the same random half of the training data for each value of λ). In cases where the number n of training examples is small, a variation on the validation score is the cross-validation score [Bishop, 2006, §1.3], where we train on different subsets of the data. The advantage of this is that it makes greater use of the available data, but the disadvantages are that it is slower and it no longer represents an independent estimate of the generalization performance. The validation score is used for many of our experiments, as it is typically accurate in assessing prediction error (assuming sufficient data is available to provide reliable estimates using half of the 1 The sample complexity function is a two parameter function of ( , δ), defined as the minimum number of training examples such that that we can be within of the optimal predictor with probability at least 1 − δ. 5 training data). However, as discussed in [Meinshausen and Buhlmann, 2006] the optimal parameters under the prediction-optimal value of λ will in general have too many non-zero variables. Because of this, we may want to consider a different criterion when the goal is structural discovery. When trying to do structural discovery, in some cases we will consider the Bayesian information criterion (BIC) [Schwarz, 1978], n ˆ ˆb) BIC(y i , xi , w, ˆ ˆb) + (d/2) log n. − log p(y i |xi , w, i=1 Here, d is the number of free parameters in the model (i.e. the number of non-zero elements of ˆ and ˆb are the maximum likelihood estimates for the set of w, plus one for the bias), while w non-zero coefficients. That is, it simultaneously tries to maximize the model fit of the training data while minimizing the number of free parameters used to do this. Schwarz [1978] derives this criterion as a large-sample approximation to the marginal likelihood of the data2 . It can also be viewed as a large-sample approximation to a minimum description length criteria [Rissanen, 1978]. In particular, if we wish to compress the data set and model, optimizing the BIC approximates the optimal level of compression (as the size of the data set increases) [see Hastie et al., 2009, §7.8]. For exponential family models the BIC has appealing asymptotic consistency properties in terms of variable selection; if we compute the BIC on a set of models that includes the true model, the true model will achieve the lowest value as the size of the data set increases [Schwarz, 1978]. Further, [Haughton, 1988] shows that optimizing this criteria will choose the correct set of variables with probability tending to one as the size of the data set increases. The BIC can also be viewed from the perspective of regularization, in that it is equivalent to regularization by the d, where the regularization strength is chosen according to the size of the 0 pseudo-norm, ||w||0 data set. A complicating factor with using the BIC for selecting λ for 1 -regularization is that the criterion is traditionally defined for the maximum likelihood estimate. Therefore, when we use BIC for model selection, we use 1 -regularization as a filter; the 1 -regularization is only used to select the set of non-zero variables, and we subsequently compute the maximum likelihood estimate of the coefficients when evaluating the BIC3 . 1.2 Dependency Networks Now consider the unsupervised case where we are given a set of n real-valued p-vectors xi (for i = 1, 2, . . . , n) and no distinguished response variables y i , and we want to build a graph that visualizes the direct dependencies between the variables. One way to do this is, for each variable j, we make variable j the target and compute the optimal parameters in a linear regression model xij = wjT xi−j + bj (where we use −j to denote all variables except j). This linear regression can be fit using the methods we describe in the previous section, and the sets of variables selected are used to draw a graph that visualizes dependencies in the data. Specifically, we draw these dependencies as a directed graph with p nodes (one for each variable), where the graph contains an edge going into each node from each of the variables that was selected when regressing on the node. The model 2 In the case of regression with a linear least-squares loss and 2 -regularization, it is possible to compute the marginal likelihood of the training data in closed form, but this is not possible in most scenarios. 3 We discuss work on using 1 -regularized estimates with the BIC at the end of Chapter 4. 6 resulting from doing this conditional regression (or classification) of each variable given all others is known as a dependency network [Heckerman et al., 2001]. Using 1 -regularized least-squares to learn the structure of a dependency network on continuous variables was examined in [Meinshausen and Buhlmann, 2006]4 . Meinshausen and Buhlmann [2006] outline conditions under which this procedure is consistent in terms of variable selection in Gaussian graphical models (that we review in the next section), allowing the number of variables and density of the graph to increase as a function of the sample size. Analogously, Wainwright et al. [2006] proposed using 1 -regularized logistic regression to learn the structure of a dependency network on binary variables, and examine consistency in terms of variable selection for Ising graphical models of binary data (that we also review in the next section). While these approaches lead to a graph structure that may be useful in terms of visualization or structural discovery, they can be problematic as a probabilistic model of the p-vectors because given finite data the dependency network estimated in this way will typically be inconsistent. For example, Heckerman et al. [2001] give the simple case where the dependency network predicts that x1 depends on x2 in p(x1 |x2 ) but that x2 does not depend on x1 in p(x2 |x1 ). These inconsistencies can lead to cases where there may be no joint distribution over the variables that is consistent with the estimated conditional distributions. The set of consistent dependency networks is equivalent to the set of undirected graphical models [Heckerman et al., 2001], and these dependency network methods can be viewed as pseudo-likelihood approximations [Besag, 1975] of the corresponding undirected graphical models. To deal with potential structural asymmetry, Meinshausen and Buhlmann [2006], Wainwright et al. [2006] consider two heuristics to turn the directed graph into an undirected graph. Their first strategy includes the undirected edge if either corresponding directed edge was found, while the second strategy only includes the undirected edge if both directed edges were found. This still leaves the problem that we have two versions of the parameter associated with each edge, though [Hofling and Tibshirani, 2009] give two related heuristics for obtaining a single parameter. Given that we can learn dependency networks with existing methods for regression and classification, it is useful at this point to discuss why we might want to use more complicated models that define (consistent) joint distributions over the p-vectors. Several of the tasks we can consider doing with a joint distribution (that can’t be accomplished in general with inconsistent dependency networks) include: • Compute joint probabilities: Given a new p-vector x, we can try to assess its probability p(x1 , x2 , . . . , xp ) under the model. Similarly, we can check which of two p-vectors has a higher probability, search for the p-vector with highest probability (decoding), or test whether a p-vector has a very low probability (i.e. outlier detection). • Compute marginals and conditionals: Given a distribution over x, p(x1 , x2 , . . . , xp ), we can consider calculating marginal probabilities like p(xi ), or conditional probabilities like p(xi |xj ), using the rules of marginalization and conditional probability. • Generate samples: We can try to generate new p-vectors according to our estimate of the joint distribution. This can be useful for model assessment. We can also consider generating conditional samples from the distribution given the values of some of the variables (i.e. fillingin missing values). 4 Gustafsson et al. [2003] present a closely related approach for estimating time-series dependencies. 7 While we can also perform these tasks by using heuristic methods to construct a consistent dependency network from the (typically inconsistent) result of learning a dependency network, pseudolikelihood approximations are known to be inefficient estimators compared to using the likelihood of the probabilistic model explicitly [Besag, 1977, Liang and Jordan, 2008]5 . 1.3 Directed Acyclic Graphical Models As we see in the next section, the prior work on structure learning in probabilistic graphical models with 1 -regularization largely focuses on pairwise undirected models. However, there are many reasons why we might prefer to use directed acyclic graph (DAG) models: • Efficiency of computing joint probabilities, samples, and (approximate) marginals: As we have just mentioned, these types of operation are the main reasons for building a consistent model of the joint distribution. However, for discrete data these operations are intractable in general for pairwise undirected graphical models. In contrast, some of these operations can be done in polynomial time in analogous DAG models. This includes computing the probability of a vector and generating unbiased samples from the model. The latter can be used in Monte Carlo methods to efficiently approximate marginals in the model. Provided we condition on the first variables in an ordering, we can also efficiently compute the conditional probability of the remaining variables and generate unbiased conditional samples of the remaining variables (the latter can be used to efficiently approximate the corresponding conditionals). • Parameter independence: The likelihood in DAG models factorizes into a product of singlevariable conditional distributions. Thus, unlike undirected graphical models where estimating single-variable conditional distributions is used as an approximation, we can find the optimal parameters in the joint likelihood of DAG models by fitting the parameters of a set of single-variable conditional distributions. Further, DAGs allow us to mix different types of variables in a straightforward way. For example, we can model the joint likelihood of vectors containing both real-valued and binary-valued variables. Because parameter independence allows parameter estimation to separate into independent sub-problems, it also allows us to independently tune an individual regularization parameter λi in estimating the conditional of each variable i, and allows us to use caching of results to implement efficient local search methods for structure learning. DAG models, also known as Bayesian networks, are one way to model the joint distribution p(x1 , x2 , . . . , xp ) of a set of p random variables. If we repeatedly use the definition of conditional probability, p(x, y) = p(y|x)p(x), in the order n down to 1, then we obtain the factorization of the joint distribution p p(xi |x1:i−1 ). p(x1 , . . . , xp ) = i=1 This factorization of the joint distribution is valid for any probability distribution. In DAG models, we make the additional conditional independence assumption that p(xi |x1:i−1 ) = p(xi |xπ(i) ), (1.3) 5 Here, the efficiency of a consistent estimator is defined as its asymptotic variance around the true parameter in terms of the number of training samples. 8 for some π(i) ⊆ {j|1 ≤ j < i}. The elements of π(i) are called the ‘parents’ of variable i (the ‘child’), while the terms p(xi |xπ(i) ) are referred to as the conditional probability distributions (CPDs) of the DAG model. We can visualize the conditional independence properties implied by the variable ordering and the choices of π(i) as a directed graph, where we draw a directed edge coming into each node from each of its parents. Since the order 1, . . . , n will constitute a topological ordering of the graph (that is, an ordering where parents come before children), the graph is necessarily acyclic. In addition to the conditional independence properties directly encoded in the use of (1.3), the method of dseparation allows us to use the graph structure to test whether any other conditional independence statement is implied by the factorization [see Koller and Friedman, 2009, §3.3]. Note that it is possible for two different graphs to imply the same set of conditional independence statements about the distribution. In this case, we say that the graphs are Markov equivalent. In this work, we focus on modeling binary data using logistic regression for the CPDs: p(xi |xπ(i) , wi , bi ) = 1 . 1 + exp(−xi (wiT xπ(i) + bi )) This is sometimes referred to as a sigmoid belief network [see, for example Saul et al., 1996], and we note that these are directed analogues to the Ising graphical models we discuss in the next section6 . While sigmoid belief networks do not have the expressive power of the tabular CPDs traditionally used in DAG models [see Koller and Friedman, 2009, §5.1], their smaller number of parameters allows more efficient estimation of the parameters from data. Specifically, there are a linear number of parameters in the CPDs of a sigmoid belief net (in terms of the number of parents), rather than the exponential number associated with tabular CPDs. This allows us to fit DAG models where some nodes have a potentially large number of parents. Using these CPDs, the dominant cost of evaluating the joint probability p(x) of a vector x is the calculation of the p inner products wiT xπ(i) . In the worst case (a fully connected graph) this will require O(p2 ), in contrast to the #P-hard problem of evaluating the probability of an observed p-vector in a general binary undirected model. Similarly, we can generate an independent sample from the distribution in O(p2 ) (and a set of independent samples can be used to approximate any marginal). Further, if we use |E| to denote the number of edges in the graph, then computing all these operations is O(p + |E|). This means that if there are more than p edges, then the cost of performing these operations is directly proportional to the sparsity of the graph structure. In DAG models, the negative log-likelihood function for a set of n realizations of p-vectors xi is given by n p − log p(xij |xiπ(j) , wj , bj ). i=1 j=1 This objective function is separable with respect to the parameters of the different CPDs. If the regularizer separates in the same way then we satisfy the parameter independence condition [Heckerman et al., 1995]. This means that we can optimize the parameters of each CPD independently. Thus, parameter estimation in this model is similar to the parameter estimation procedure used in dependency networks with logistic regression conditionals, except that each regression is done on the subset of nodes earlier in the ordering, and optimizing these independent logistic regressions 6 While if we use Gaussian CPDs we obtain a directed model that is analogous to the Gaussian graphical models we discuss in the next section. 9 directly optimizes the joint likelihood of a consistent probabilistic model. That is, by placing a constraint on the variable ordering we guarantee that the parameters yield a consistent probabilistic model. If we are given the variable ordering, then estimating the structure of a DAG model reduces to the problem of independently performing variable selection to select the parents of each node. Thus, we can learn sigmoid belief networks using 1 -regularization by solving a series of independent 1 -regularized logistic regression problems. Most previous work on structure learning in DAG models with 1 -regularization has considered the case of a known ordering [Li and Yang, 2005, Huang et al., 2006, Levina et al., 2008]7 . However, in general we do not have a topological ordering available, and sub-optimal orderings may lead to models that are much more dense than the optimal ordering. Thus, in Chapter 4 we consider a method that uses 1 -regularization for structure learning in sigmoid belief networks that does not assume a known topological ordering of the variables. The challenge associated with this problem is that if we relax the constraints imposed by the ordering (corresponding to fitting a dependency network) this typically leads to a structure violating the acyclicity constraint. Thus, our method uses a two-phase approach: in the first phase the method learns a dependency network using 1 -regularization to obtain a set of candidate edges, then in the second phases it uses local search in the space of DAGs restricted to these candidates. Although various methods have been proposed for restricting the set of candidate edges in DAGs, an important aspect of the new algorithm is that the same criterion is used for variable selection in both phases. Although we state and evaluate the method for the case of sigmoid belief networks, it can trivially be applied to the case of Gaussian CPDs or other types of linearly-parameterized CPDs. 1.4 Gaussian and Ising Graphical Models We now turn to the case of fitting two special types of pairwise undirected graphical models, and consider learning a sparse graph structure by using 1 -regularization of the parameters corresponding to the edges in the graph. The advantage of working with undirected models is that we have no acyclicity constraint and thus we can use 1 -regularization directly to estimate a sparse structure. However, the disadvantage of undirected models is that the log-likelihood does not separate into a set of independent problems, and this makes parameter estimation much more expensive. In pairwise undirected graphical models, we model the joint distribution p(x1 , x2 , . . . , xp ) of a set of p random variables x as a globally normalized product of non-negative unary potentials φi (xi ) and non-negative pairwise potentials φij (xi , xj ): p(x) 1 Z p p φi (xi ) i=1 φij (xi , xj ). (1.4) (i,j)∈E The normalizing constant Z is defined as the constant such that the distribution integrates to one over all possible assignments to x. The set E contains the set of pairs of variables that we want to include a pairwise potential for. Typically, we will only include pairwise potentials for a relevant subset of the possible pairs of variables (but unlike DAG models that must be acyclic, we do not need to enforce any constraints on the set of edges in undirected models). If the potential φij (xi , xj ) is included in the model, we say that two nodes i and j are neighbors, and that in this case j is in 7 These works use Gaussian CPDs instead of sigmoid CPDs. 10 the Markov blanket of node i (and vice versa). A local Markov property follows from the pairwise factorization (1.4) and the choice of pairwise potentials to include, namely that for node i with Markov blanket M B(i) we have the conditional independence property p(xi |x−i ) = p(xi |xM B(i) ). Further, we can visualize all of the conditional independence properties implied by the factorization as an undirected graph, where each variable corresponds to a node in the graph and we place an undirected edge between each set of neighbors. Because of this we refer to unary potentials φi (xi ) as node potentials, and pairwise potentials φij (xi , xj ) as edge potentials. We can then test whether the factorization implies that two sets of variables are conditionally independent (given a third set of variables) by testing whether the conditioning set separates the two sets in this graph. For more details on the independence properties of undirected graphical models, we refer to [Koller and Friedman, 2009, §4.3]. Structure learning in pairwise undirected graphical models is the task of selecting the pairs of vairables to include as neighbors/edges in E. However, note that if a potential function φij (xi , xj ) takes the value 1 for all values of xi and xj , it is equivalent to removing the potential from the model8 . Thus, if we parameterize our pairwise potentials such that zeros in the parameterization make φij (xi , xj ) take the value 1 for all xi and xj , then we can use 1 -regularization of the fullyconnected model to encourage that we learn a sparse graphical structure. This idea was first explored for Gaussian graphical models (GGMs). In GGMs, we model the joint distribution p(x1 , x2 , . . . , xp ) of a set of p continuous random variables as a multivariate Gaussian with mean b and precision (inverse-covariance) matrix W : p(x1 , x2 , . . . , xp ) 1 1 exp(− (x − b)T W (x − b)). Z 2 (1.5) In this case the normalizing constant Z is Z 1 (2π)p/2 |W −1 |1/2 . This normalization constant can be computed inO(p3 ) time using a Cholesky factorization of W . If we expand out the quadratic form in (1.5) we see that GGMs are a special case of (1.4), and hence they are pairwise undirected graphical models. In particular, an edge is present between two variables i and j in a Gaussian graphical model if and only if the corresponding element Wij of the precision matrix W is non-zero. This type of model was introduced in [Dempster, 1972], where it was referred to as covariance selection. Because setting elements of the precision matrix to zero corresponds to removing edges from the graph, we can consider simultaneously estimating the parameters and a sparse structure in GGMs by minimizing the negative log-likelihood subject to 1 -regularization of the elements of the precision matrix. This is referred to as the graphical LASSO by [Friedman et al., 2008], and it has been proposed by numerous authors [Dahl et al., 2005, Banerjee et al., 2006, Yuan and Lin, 2007]. In the graphical LASSO we set b to the sample mean of the training data, and then compute the 8 Because of the global normalization, setting the potential to a constant value c for any choice of c is also equivalent to removing the node from the model 11 precision matrix by optimizing the negative log-likelihood with 1 -regularization of the precision matrix elements. This latter problem can be written as the convex optimization problem ˆ ) + λ||W ||1 . min − log det W + tr(ΣW (1.6) W 0 ˆ refers to the sample covariance In the above ||W ||1 refers to the entry-wise 1 norm of W and Σ n i i T ˆ matrix, Σ (1/n) i=1 (x − b)(x − b) . The positive-definite constraint W 0 is required to ensure that the solution is a valid distribution. Although this constraint may appear problematic since the positive-definite cone is an open set, the log-determinant term in the objective function acts as a log-barrier that ensures the solution is an interior point of this set9 . Due to the appealing notion of combining regularization and sparsity within classical covariance selection methods, numerous authors have explored solution methods and applications of this model. A subset of the extensive work is [Dahl et al., 2005, Banerjee et al., 2006, Yuan and Lin, 2007, Friedman et al., 2008, d’Aspremont et al., 2008, Duchi et al., 2008a, Krishnamurthy and d’Aspremont, 2009, Lu, 2009, 2010, Yuan, 2009]. Further, as we discuss in the last section the dependency network methods of [Meinshausen and Buhlmann, 2006] represent pseudo-likelihood approximations to the graphical LASSO model. We can also consider applying 1 -regularization to the joint distribution in undirected graphical models over discrete variables; the first works to explore this were [Lee et al., 2006b, Wainwright et al., 2006, Dahinden et al., 2007]. Most of the work in this vein has considered the special case of pairwise undirected graphical models of binary data with Ising potentials. We refer to these models as Ising graphical models (IGMs). In IGMs, we write the joint distribution p(x1 , x2 , . . . , xp ) of a set of p binary random variables as p(x1 , x2 , . . . , xp ) 1 exp( Z p xi bi + i=1 xi xj wij ). (1.7) (i,j)∈E where in this case the normalizing constant Z is p Z exp( xi bi + i=1 x xi xj wij ). (i,j)∈E Some authors use a {0, 1} representation of the binary variables in (1.7) [Wainwright et al., 2006, Hofling and Tibshirani, 2009], while other authors use a {−1, 1} representation [Banerjee et al., 2008, Kolar and Xing, 2008]10 . Clearly, (1.7) is a pairwise undirected graphical model and setting the edge parameter wij to zero is equivalent to removing the pairwise potential from the model. Thus we can consider minimizing the negative log-likelihood with 1 -regularization of the edge parameters to learn a regularized sparse structure. Specifically, we solve p n min W,b [− m=1 i=1 p n [−xm i bi m xm i xj wij ]] − j=i+1 p |wij |. + n log Z(W, b) + λ (1.8) i=1 j=i+1 9 That the solution is an interior point of the constraint set is appealing from the perspective of optimization, since it means that we do not necessarily have to use constrained optimization methods to solve (1.6) 10 Both representations can model arbitrary positive pairwise distributions over binary data, but these two parameterizations do not give necessarily lead to equivalent models if we regularize the parameters 12 We note that the first term in this expression is linear while the second is convex [Boyd and Vandenberghe, 2004, §3.1], so this is a convex optimization problem. Unfortunately, solving this optimization problem is more complicated than the GGM case, because of the combinatorial nature of the normalizing constant Z. The complexity of computing Z and related quantities is discussed in (for example) [Koller and Friedman, 2009, §9-10]. In particular, it is #P-hard to evaluate Z. Hardness results also apply to other operations involving discrete undirected models, such as computing the most likely configuration (NP-hard in general) and computing marginal or conditional probabilities (#P-hard in general). In some practical cases of interest, it is possible to efficiently solve these problems. For example, variants of the belief-propagation message-passing algorithm can solve these problems in O(p) when the graph is tree-structured. However, the best known general methods for solving these problems require a runtime that is exponential in the treewidth of the graph. For more information on the complexity of inference, the relation to treewidth, and general message-passing algorithms, see (for example) [Koller and Friedman, 2009, §9-10]. Different authors have proposed different solutions to the computational intractability of evaluating the objective function. As we discuss in the previous section, Wainwright et al. [2006] fit a dependency network with logistic regression conditionals. This corresponds to a pseudo-likelihood approximation. For a generalization of IGMs, in [Schmidt et al., 2008] we considered the symmetric pseudo-likelihood approximation11 p n min − W,b [ p p m log p(xm i |x−i , W, b)] + λ m=1 i=1 |wij |. (1.9) i=1 j=i+1 We note that each conditional probability is the likelihood in a logistic regression model: p(xi |x−i , W, b) = 1 exp(xi bi + Zi xi xj wij ), j=i where the local normalizing constant Zi only sums over possible assignments to xi (and thus can be tractably evaluated). This is nearly identical to learning a dependency network with logistic regression conditionals, but in this formulation we use the same parameter wij in both p(xi |x−i ) and p(xj |x−j ) rather than using two versions of the parameter. [Hofling and Tibshirani, 2009] show that this approximation gives performance that is similar to or better than the performance of the dependency network approximation. A disadvantage of the symmetric version is that the optimization problem is slightly more difficult since the optimization problem is no longer separable in the conditional distributions. Instead of using a pseudo-likelihood approximation, Lee et al. [2006b] use the non-convex Bethe variational approximation to log Z (in the special case of treestructured graphs, this approximation is convex and exact). Banerjee et al. [2008] proposed a convex variational approximation to log Z, while Kolar and Xing [2008] outline a set of additional constraints that can be imposed to improve this approximation. Hofling and Tibshirani [2009] also consider using junction trees to evaluate the likelihood exactly, but the cost of this is exponential in the treewidth of the graph. Because the complexity of using the model has such a strong dependency on the graph structure, one of the main interests in learning a sparse structure is to learn a model that is easier to use. However, we note that the degree of sparsity of a graph is not a perfect surrogate for the treewidth 11 This approximation was subsequently used in [Hofling and Tibshirani, 2009] 13 of a graph12 . In particular, it is possible for a graph with a large number of edges to have a lower treewidth than a graph that is more sparse. For example, a chain-structured graph on 100 nodes will have 99 edges and a treewidth of 1, while we if we construct a graph consisting of a threeclique and 97 other nodes with no edges then this graph has only 3 edges but has a treewidth of 2. Nevertheless, there is a simple local property that relates sparsity to treewidth: the treewidth of a graph is never decreased by adding an edge. Thus, although sparsity is not a perfect measure of treewidth, increased sparsity may lead to a decreased cost of using the model. Further, approximate inference methods typically scale linearly in the number of edges in the model. Thus, in cases where only high treewidth graphs provide good models of a data set, sparsity directly decreases the cost of using the model. 1.5 Pairwise Undirected Graphical Models As we discuss in the previous section, there has been substantial interest in using 1 -regularization for structure learning in GGMs and IGMs. However, GGMs and IGMs can only represent pairwise distributions over Gaussian and binary data, respectively. In this section we review the class of pairwise log-linear models, a generalization of IGMs that can be used to model arbitrary pairwise positive distributions over discrete data. In log-linear models of discrete vectors x ∈ {1, 2, . . . , k}p , the logarithm of each of the potentials is a linear function of the parameters. For example, if variable xj can take four states (k = 4), we can define the node potential φj (xj ) such that log φj (xj ) = I(xj = 1)bj,1 + I(xj = 2)bj,2 + I(x3 = 3)bj,3 , where bi,j is the parameter associated with state j for node i, and I(·) denotes an indicator function that returns a value of 1 if its argument is true and 0 otherwise. Note that even though xj has four possible states in the example above, we only use three parameters. If we used four parameters then one of them would be redundant because the global normalization in (1.4) allows us to rescale each potential (or equivalently add or subtract a constant from the log-potential) without changing the model. Thus, we consider node potentials that have k − 1 parameters for nodes that can take k possible states. We obtain node potentials that are equivalent to those used in IGMs in the special case where we have binary states. We find it convenient to use the notation bj to denote the set of parameters {bj,1 , bj,2 , . . . , bj,k−1 } associated with the node potential φj (xj ). We consider several different parameterizations of the edge potentials. First, we note that in IGMs with binary variables that take the values {0, 1} we can write each edge potential φij (xi , xj ) as log φij (xi , xj ) = xi xj wij = I(xi = 1, xj = 1)wij . Note that this parameterization of the potentials treats the two states asymmetrically. That is, if wij > 0 then the edge encourages x1 and x2 to both take the state 1, but if wij < 0 it encourages them to both not take the state 1. In both cases, there is no distinction made between the three states (1, 0), (0, 1), and (0, 0) (while as before if wij = 0 then the edge has no effect). This 12 We discuss methods that use explicit constraints on the treewidth in Chapter 5 14 asymmetry is not present in the edge potentials used by IGMs with binary variables that take the values {−1, 1}: log φij (xi , xj ) = xi xj wij = I(xi = xj )wij − I(xi = xj )wij ≡ I(xi = xj )wij − I(xi = xj )wij + wij = 2I(xi = xj )wij = I(xi = xj )w ˜ij , (1.10) where in the third line we add the constant wij (the global normalization means this does not change the distribution) and in the last line we re-parameterize in terms of of w ˜ij 2wij . In this form we see that the simple change in representation leads to the states being treated symmetrically, the edge encourages the nodes to take the same state if wij > 0 and encourages the nodes to have different states if wij < 0 (and the edge has no effect if wij = 0). It is often convenient to represent our edge (log-)potentials as a matrix, where each entry (i, j) contains log φij (xi , xj ). Below, we give the matrices corresponding to Ising edge potentials over two binary variables under the {0, 1} (left) and {−1, 1} (right) representations: log φij (·, ·, wij ) = wij 0 0 0 , log φij (·, ·, wij ) = wij 0 0 wij . The form of the right matrix as well as (1.10) suggests a generalization to data with more than two states, where we place a parameter on the diagonal elements of this matrix and no parameter on the off-diagonals. As an example, for variables with three states we use the following edge potential matrix: wij 0 0 log φij (·, ·, wij ) = 0 wij 0 . 0 0 wij In the remainder of this work we refer to potentials with this form as Ising potentials, and we call pairwise log-linear models with these types of potentials IGMs. If we have separate node and edge parameters for each node and edge (respectively), then Ising potentials are sufficient to model any pairwise positive distribution over binary data. However, we can only model a set of restricted distributions over general discrete data with Ising potentials. Thus, to model general distributions over discrete data we must consider other parameterizations of the edge potentials. The next set of potentials we consider are a natural generalization of Ising potentials, where (for nodes taking values in {1, 2, . . . , k}) we include a weight for each configuration where the nodes take the same state. For example, for an edge between two variables that can take three possible states we would use log φij (xi , xj ) = I(xi = 1, xj = 1)wij1 + I(xi = 2, xj = 2)wij2 + I(xi = 3, xj = 3)wij3 . Alternately, we can write the edge (log-)potentials as the matrix wij1 0 0 wij2 0 . log φij (·, ·, wij ) = 0 0 0 wij3 15 Here, we use the notation wij to refer to the set of all parameters associated with an edge potential φij (xij ). These potentials distinguish between configurations where the variables take the same states, and can be used to model a wider class of distributions than Ising potentials. This form of potential was previously used in, for example, [Taskar et al., 2004] (who contrast it with the classic Potts model). Since we obtain Ising potentials if we set all the diagonal elements to the same value, we refer to potentials of this form as generalized Ising or gIsing potentials. However, note that with these potentials the edge is present in the model unless wijk is set to zero for all k. We can also consider completely general potentials over discrete variables where we parameterize every element of the edge potential matrix, allowing us to model arbitrary pairwise positive distributions over discrete data. For example, for an edge between two variables that can each take three states we would use log φij (xi , xj ) = I(xi = 1, xj = 1)wij11 + I(xi = 1, xj = 2)wij12 + I(xi = 1, xj = 3)wij13 + I(xi = 2, xj = 1)wij21 + I(xi = 2, xj = 2)wij22 + I(xi = 2, xj = 3)wij23 + I(xi = 3, xj = 1)wij31 + I(xi = 3, xj = 2)wij32 + I(xi = 3, xj = 3)wij23 , or in matrix form: wij11 wij12 wij13 log φij (·, ·, wij ) = wij21 wij22 wij23 . wij31 wij32 wij33 Since we assign a different potential to each configuration of the nodes, we refer to potentials of this form as full potentials. Here, we have k 2 parameters, and the edge is included in the model if any of these k 2 are non-zero. In [Schmidt et al., 2008] we used these types of potentials but fixed the value of one of the variables to zero (as with the node potentials) to decrease the number of parameters in the model. However, the choice of the particular variable to fix at zero can influence the particular structure learned, so in this work we use the full representation13 . Below are the matrices for the Ising, gIsing, and full potentials: wij 0 0 log φij (·, ·, wij ) = 0 wij 0 (Ising edge potentials); 0 0 wij wij1 0 0 wij2 0 (gIsing edge potentials); log φij (·, ·, wij ) = 0 0 0 wij3 wij11 wij12 wij13 log φij (·, ·, wij ) = wij21 wij22 wij23 (full edge potentials). wij31 wij32 wij33 An edge has no effect on the model if all entries of this matrix are set to zero14 . In the Ising case this corresponds to setting wij to zero, while in the other cases we must set all elements of wij to zero. We close our discussion on potential parameterizations by noting that we can naturally extend the full potentials to scenarios where the two nodes have a different number of states. 13 14 Under this parameterization, the optimal parameters are still identifiable if we use a strictly convex regularizer. In the case of full potentials, we could also set all the entries of the matrix to a constant. 16 In pairwise undirected models of discrete data, the negative log-likelihood function for a set of n realizations of p-vectors xi is given by p n − p [log φi (xm i , bi ) [ m=1 i=1 m log φij (xm i , xj , wij )]] + n log Z(w, b), + j=i+1 where we use the notation b to denote the set of all node parameters and w to denote the set of all edge parameters. As in the IGM case, this is a convex function. In log-linear models, the gradient of the average negative log-likelihood has a simple form. For example, the average gradient with respect to a node potential parameter bi,j is ∇bi,j 1 − n n 1 log p(x |b, w) = p(xi = j) − n n m m=1 I(xm i = j). m=1 Thus we see that at a maximum likelihood solution (where the gradient is zero), the model must have the same unary marginals as the data. Similarly, the average gradient with respect to an edge potential parameter wijqr (when using full potentials) is ∇wijqr 1 − n n 1 log p(x |b, w) = p(xi = q, xj = r) − n n m m=1 m I(xm i = q, xj = r), m=1 and thus at a maximum likelihood solution the model marginals will match the empirical frequencies for all edges that are included in the model. The models in the previous sections have a one-to-one correspondence between parameters in the model and edges in the graph. However, the log-linear models we discuss in this section may have more than one parameter associated with each edge. Further, the edge is only removed from the model if all of the parameters associated with the edge are set to zero. Thus, if we would like to use regularization to directly encourage graphical sparsity we must consider group 1 -regularization, a generalization of 1 -regularization that penalizes groups of variables in order to directly encourage group-wise sparsity. Utilizing group 1 -regularization to encourage sparsity in terms of groups of variables was proposed by Bakin [1999]15 . In group 1 -regularization, we penalize the 1 norm of the (non-squared) 2 norms of the groups. For our problem, we have one group for each edge and the group contains all parameters associated with the corresponding edge. Thus, we can write the problem of estimating a sparse regularized structure with group 1 -regularization as p n min − w,b [ p [log φi (xm i , bi ) + m=1 i=1 p p log φij (xm ij , wij )]] + n log Z(w, b) + λ j=i+1 ||wij ||2 , (1.11) i=1 j=i+1 (using an approximate objective function gives an analogous formulation). We obtain 1 -regularization if each group contains only a single variable. We can interpret this “ 1 of 2 norms” regularizer as an 1 -regularizer of the lengths of the vectors wij . Consequently, it encourages sparsity in the lengths of the vectors, leading to the entire group being set to zero when the length becomes zero. Utilizing (1.11) was mentioned in [Lee et al., 2006b], but this work did not discuss how to solve the resulting optimization problem. Dahinden et al. [2007] use (1.11) to encourage graphical 15 Sardy et al. [2000] discuss using 1 -regularization of the complex modulus, a special case of group 1 -regularization 17 sparsity, but their methodology is restricted to small data sets since they did not consider using approximate inference or efficient large-scale optimization strategies. In Chapter 3 we give largescale optimization strategies that are especially suited to solving problems like (1.11), where we have a large number number of variables, a costly objective, and a regularizer (or constraints) with a simple structure. In Chapter 5, we consider several variations on (1.11). In particular, we show how different choices of the norm can lead to edge potentials with different properties (and in some cases better performance). We also extend the block-wise sparse strategy proposed in [Duchi et al., 2008a], where edges are placed in groups and we would like to encourage sparsity in terms of groups of edges. Finally, in Chapter 5 we consider extending (1.11) to include covariates for use in structured classification problems, leading to a discriminative structure learning method for structured classification. 1.6 General Log-Linear Models Due to their relatively small number of parameters, pairwise log-linear have sometimes been advocated in scenarios where limited data is available [Whittaker, 1990, §9.3]. However, pairwise models only focus on the unary and pairwise statistical properties of the data, so the pairwise assumption can be fairly restrictive if higher-order moments of the data are important and we have sufficient training examples available to estimate such higher-order statistics. Despite this fact, with only one exception, all previous work on structure learning with 1 -regularization has made the pairwise assumption. The one exception is Dahinden et al. [2007] who considered log-linear models of discrete data where all potentials up to a fixed order are considered, and used group 1 -regularization to learn the structure. For general log-linear models [Bishop et al., 1975], we can write the probability of a vector x ∈ {1, 2, . . . , k}p as a globally normalized product of potential functions φA (xA ) defined for each possible subset A of S {1, 2, . . . , p}: p(x) 1 Z φA (xA ). A⊆S As before the normalizing constant Z enforces that the distribution sums to one, and the logarithm of each potential φA (xA ) is linear in the parameters of the potential. For models including higherorder terms, we use the short-hand wA to refer to all the parameters associated with the potential φA (xA ) (whether it be unary, pairwise, or higher-order), and we use w to refer to the concatenation of all wA . We define the unary potentials and pairwise potentials as before, and can define the threeway and higher-order potentials by generalizing the Ising, gIsing, and full potentials we discuss for pairwise models. In general, if A contains c elements that can each take k values, φA (xA ) will have k c parameters wA when we use full potentials, k parameters when we use gIsing potentials, and one parameter when we use Ising potentials. In the case of full potentials, general log-linear models can be used to model arbitrary positive distributions over discrete data. In practice, it is typically not feasible to include a potential φA (xA ) for all 2p subsets. As before, removing the potential φA (xA ) from the model is equivalent to setting it to one (or any other constant) for all values of xA , or equivalently setting all elements of wA to zero (or any other constant). We obtain the class of pairwise models if we enforce wA = 0 for all A with a cardinality greater than two. This effectively nullifies the effects of the higher-order statistics of the data on the model. 18 The group 1 -regularization strategy from the previous section can naturally be extended to the case of general log-linear models. This results in the optimization problem n log p(xi |w) + min − w i=1 λA ||wA ||2 . (1.12) A⊆S Here we include a separate regularization parameter λA ≥ 0 for each group since we typically want to use a different degree of penalization for potentials of different orders. This is (essentially) the approach taken in [Dahinden et al., 2007]. Dahinden et al. [2007] also consider a variant where we only consider potentials up to a certain order, and successively increase the order. This latter strategy can be viewed as an 1 -regularization version of a classic strategy for structure learning in general log-linear models [see Bishop et al., 1975, §4.5.1]. However, a problem with (1.12) is that sparsity in the variable groups A does not directly correspond to conditional independencies in the model (except in the pairwise case). In particular, in a log-linear model variable sets B and C are conditionally independent given all other variables if and only if all elements of wA are zero for all A that contain at least one element from B and at least one element from C [see Whittaker, 1990, Proposition 7.2.1]. In principle, we can use the optimization methods of Chapter 3 to solve (1.12) (with an approximate objective function, if necessary). Indeed, in Chapter 6 we consider using this formulation to learn the structure of threeway log-linear models. However, this formulation is only practical when the number of nodes p or the maximum size of the factors M is very small, since if we allow for p possible subsets of size M to examine. Further, if we allow factors of M -way factors there are M arbitrary size then there are 2p factors to consider. For example, if we have 32 variables then we would have 232 groups, and (with full potentials) each group would contain up to k 32 parameters. This exponential number of variables makes the problem very difficult to solve if we don’t enforce a strong cardinality restriction (such as restricting attention to pairwise or threeway models). Dahinden et al. [2007] did not address the problems associated with the exponential number of variables in this formulation, since their application only had five variables. In Chapter 6, we consider using group 1 -regularization for convex structure learning in the special case of hierarchical log-linear models, where a potential φA (xA ) can only be included if the potentials on all subsets of A are also included. Although hierarchical models are a subset of the class of general log-linear models, they are a far larger class of models than the set of pairwise (or threeway) models. Further, one of the advantages of hierarchical models is that sparsity in the groups directly corresponds to conditional independencies in the model. Similar to [Bach, 2008b], we develop an active-set method that can incrementally add higher order factors, and places no restriction on the maximum cardinality of the potentials. This method uses the hierarchical property to potentially rule out an exponential number of higher-order potentials, and converges to a solution satisfying a set of necessary optimality conditions. Key to the convex parameterization of the space of hierarchical log-linear models is that we allow the groups to overlap. This results in a more difficult optimization problem, but in Chapter 6 we give a strategy to adapt the methods of Chapter 3 to the case of overlapping groups. Our experiments show that allowing for such higher order interactions can result in improved prediction accuracy. 19 1.7 Data Sets In addition to experiments on synthetic data, in this work we test the performance of various methods on several real data sets. Several of the these latter data sets are used in multiple chapters and in multiple contexts. Thus, to avoid repeating information we introduce all of the real data sets in this section. When we consider testing large-scale methods for 1 -regularized logistic regression in Sections 2.6.1, we consider the following binary classification data sets: • sido: This data set contains 4932 binary variables describing properties of 12678 molecules that have been tested against the AIDS HIV virus. The target indicates the molecular activity, and among the variables are several artificially generated ‘probe’ variables. This data set is made available as part of the Causality Workbench, http://www.causality.inf.ethz.ch/home.php. • thrombin: This data set contains 139350 binary variables describing three-dimensional properties of 1909 molecules that have been tested for their ability to bind to thrombin, a key receptor in blood clotting. The target variables indicate whether the molecules are active (bind well). This data set has been made available by DuPont Pharmaceuticals Research Laboratories for the KDD Cup 2001 competition, http://pages.cs.wisc.edu/~dpage/kddcup2001/. • spam: This data contains 823470 binary variables describing the presence of word tokens in 92189 e-mail messages involved in the legal investigations of the Enron corporation. The target variable indicates whether the e-mail was spam or not. This data set was made the TREC 2005 corpus [Cormack and Lynam, 2005], http://plg.uwaterloo.ca/~gvcormac/treccorpus/. This data set was prepared by Peter Carbonetto, who used the SpamBayes software for feature extraction, http://spambayes.sourceforge.net/. Since many of the tasks related to learning probabilistic graphical models are NP-hard or #Phard, in some cases we consider data sets that have a relatively small number of nodes and states. This makes it possible to solve the NP-hard and #P-hard problems exactly. This removes any confounding effects associated with using approximations when comparing optimization strategies in Sections 2.6.2 and 3.5, and comparing different models in Sections 5.8.1 and 6.5. These data sets also allows us to compare the quality of different approximations compared to the exact case in Section 5.8.2. We examine two small data sets: • cyto: The data studied in [Sachs et al., 2005]. In this study, intracellular multivariate flow cytometry was used to simultaneously measure the expression levels of 11 phosphorylated proteins and phospholipid components in 5400 individual primary human immune system cells over 9 different stimulatory/inhibitory conditions. We used the targets of intervention and 3-state discretization strategy (into ‘under-expressed’, ‘baseline’, and ‘over-expressed’) of [Sachs et al., 2005]. This data set is available at the Causality Workbench Repository (we ignore the experimental conditions), http://www.causality.inf.ethz.ch/repository.php. 20 • awma: The coronary heart disease data studied in [Qazi et al., 2007]. In this study, expert cardiologists provided ratings from 1-5 of the motion of 16 segments of the left ventricle of the heart in 2602 patients [Qazi et al., 2007]. Here, a rating of 1 indicates normal, while classes 2-5 represent degrees of abnormality. Although the segments are rated from 1 to 5, as in [Qazi et al., 2007] we aggregate the four abnormal states (2-5) into a single state (classes 3 to 5 are severely under-represented in the data). This data set was provided by Siemens Medical Solutions, and is not available on-line. When we examine learning probabilistic graphical models of larger binary data sets in Section 4.6.2, we focus on the following data sets: • rain: We created a data set consisting of 28-vectors, representing a binarized version of the ‘daily precipitation’ amount for the first 28 of days of the month for the weather station in Steveston, British Columbia. We obtained values from 1896-2004, but removed months with missing (or accumulated) values (this left 1059 months). We only used the first 28 days for each month to make all of the samples have the same length. Measurements marked with a zero or trace precipitation values were assigned to one class, while fields with a non-zero value were assigned to the other class (approximately 41 percent of the values are non-zero). This data was extracted from the Canadian Daily Climate Archive from Environment Canada’s National Climate and Data Information Archive, http://climate.weatheroffice.ec.gc.ca/. • msweb: The Anonymous Microsoft Web Data, a data set measuring whether each of 294 webpages were visited by 32711 anonymous randomly-selected users of microsoft.com. We focused on the 57 websites with greater than 250 visits. This data set is available from the UCI Machine Learning Repository, http://archive.ics.uci.edu/ml/index.html. • news: A data set measuring the occurrence of 100 words in 16242 newsgroup postings from the 20 Newsgroups data. This data set is available from Sam Roweis’ data page, http://www.cs.toronto.edu/~roweis/data.html. • usps: A set of 11000 binary 16 by 16 images (256 variables), each representing a single digit. We binarized the pixels by assigning pixels with a value of zero to one state, and pixels with a non-zero value to the other state. This data set is available from Sam Roweis’ data page, http://www.cs.toronto.edu/~roweis/data.html. When we examine learning probabilistic graphical models of larger non-binary discrete data sets in Sections 5.8.3 and 6.5, we focus on the following data sets: • awma-5: Rather than aggregating the four abnormal states (2 − 5) into one single state, we consider the full five-state version of the awma data from [Qazi et al., 2007]. Here, 1 indicates normal, 2 indicates hypokinetic, 3 indicates akinetic, 4 indicates dyskinetic, and 5 indicates aneurysm. • traffic: The traffic data contains 32 four-state variables measuring the level of traffic flow at different San Francisco locations at 4413 time points [Krause and Guestrin, 2005]. This data sets was also previously analyzed in Shahaf et al. [2009], and was sent to us by Dafna Shahaf. 21 • temperature: The temperature data contains 54 four-state variables measuring temperature levels in the Intel Research, Berkeley lab [Deshpande et al., 2004]. This data set was also previously analyzed in Shahaf et al. [2009], and was also sent to us by Dafna Shahaf. • usps4: Rather than binarizing the usps data, in this data set we discretize the pixels’ intensity values into four equally-space bins (and we concentrated on the 16 pixels in the center of the images). • usps8: This is similar to the usps4 data, but using a discretization into eight bins. When we consider learning blockwise-sparse GGMs in Section 5.8.4, we consider the following data set: • genes: A subset of the data set examined in [Gasch et al., 2000], containing mRNA expression levels of 667 genes in the yeast genome measured under 174 different conditions. This data set was previously analyzed in [Duchi et al., 2008a], and we use the same pre-processing and assignment of the variables to the 86 ‘types’ that they use. Finally, when we consider structured binary classification in Section 5.8.5 we focus on the following data set: • awma-c: In this data set we consider the classification problem of labeling 16 segments of the left ventricle as normal or abnormal based on multi-view ultrasound video [Schmidt et al., 2008]. This is similar to the awma data, but consists of 345 cases where we have cardiologist labels for all 16 segments as well as features measured from the associated videos. For this data, we have a total of 34 features for each segment measuring properties of the motion of the segment from the tracked contours of the ventricle. This data set was also provided by Siemens Medical Solutions. 1.8 Summary of Contributions Below, we briefly summarize the contributions of each chapter: • Chapter 2: We give extensions of limited-memory quasi-Newton methods for differentiable optimization to the case of optimizing a differentiable function with 1 -regularization. We argue that these extensions have more appealing properties than previous extensions. Our experiments on 1 -regularized logistic regression indicate that these extensions perform similar to or better than other methods for this problem. • Chapter 3: We give new limited-memory quasi-Newton methods for optimizing differentiable functions subject to simple constraints or simple non-differentiable regularizers. We argue that these extensions are appealing when the differentiable function is high-dimensional and costly to evaluate, while the constraints (or non-differentiable regularizer) have a simple structure. Our experiments on group 1 -regularized pairwise log-linear models indicate that these methods outperform existing methods for this problem. • Chapter 4: We give a method that uses 1 -regularization to learn a sigmoid dependency network to prune the set of edges that are considered in a search over the space of DAG models with sigmoid CPDs. Unlike previous pruning methods that prune based on a different 22 criteria than the subsequent search, our method uses the same score in both the pruning and the search phase. Our experiments indicate that this pruning strategy is advantageous over previous pruning strategies that do not take advantage of the structure of the CPDs or the form of the score. Although we concentrate on the case of sigmoid CPDs the method applies to a more general class of linearly-parameterized CPDs. • Chapter 5: We consider methods for learning pairwise undirected graphical models with group 1 -regularization to encourage graphical sparsity. Unlike previous work, we consider using group 1 -regularization with different choices of the group norm, and argue that the structure offered by different choices can be advantageous. We show how to apply the methods of Chapter 3 to the case of different group norms, and we introduce a group version of the nuclear norm regularizer. Our experiments indicate that different choices of the group norm can lead to improved predictive performance, and that utilizing general pairwise log-linear models of discrete data can lead to better predictive performance than IGMs. We extend previous work on blockwise-sparse GGMs by considering different choices of the group. We also extend previous work on group 1 -regularization of log-linear models to the case of conditional loglinear models, representing the first method that simultaneously and discriminatively learns both structure and parameters in a structured classification model. • Chapter 6: We consider using overlapping group 1 -regularization for structure learning in hierarchical log-linear models, with no restriction on the cardinality of the potentials. We give an active-set method for searching the exponential space of possible higher-order potentials, and show how to apply the methods of Chapter 3 to the case of overlapping groups. Our experiments indicate that removing the cardinality restriction leads to better predictive performance than pairwise (or threeway) models. 23 Chapter 2 Optimization with 1-Regularization Many of the models examined in this work require solving an 1 -regularized logistic regression problem. Although logistic regression is typically used for binary classification (§1.1), we must also solve a set of 1 -regularized logistic regression problems for structure learning in dependency networks (§1.2), and for the DAG structure learning method we describe in Chapter 4. Further, pseudo-likelihood approximations in 1 -regularzied IGM models (§1.4) and general binary log-linear models (§5.3) take the form of a set of dependent 1 -regularized logistic regression problems. If we consider models with more than two states for each node, then these problems are replaced with the analogous multiclass logistic regression [Bishop, 2006, §4.3.4]. Further, the 1 -regularized IGM model and general binary log-linear models (§1.5) also have a similar structure. Thus, in this work it is important to be able to efficiently optimize the parameters in 1 -regularized logistic regression and related models. Fortunately, this has recently become a well-studied problem. In this chapter, we describe algorithms for solving the general optimization problem min f (x) x λi |xi |, L(x) + (2.1) i where L(x) is assumed to be convex and differentiable with respect to x ∈ Rp , and we may have a separate regularization parameter λi ≥ 0 for each variable i. We particularly concentrate on the special case of the 1 -regularized logistic regression problem we discuss in Section 1.1. In this case, L(x) is the negative log-likelihood in a logistic regression model, the optimization parameters x are the concatenation of the weights w and bias b in the model, and λi is the same across all i, except for the bias term b where λi = 0. More precisely, logistic regression is the following special case of (2.1): n log(1 + exp(−y i (wT xi + b))) + λ||w||1 . min w,b i=1 In this case, the gradients of L(w, b) with respect to b and w are given by n −y i /(1 + exp(y i (wT xi + b))), ∇b L(w, b) = i=1 n −y i xi /(1 + exp(y i (wT xi + b))). ∇w L(w, b) = i=1 Although our focus is on logistic regression, we note that the algorithms we describe in this chapter are applicable to any optimization problem of the form (2.1), including (for example) 1 -regularized IGMs. Solving (2.1) is complicated by the non-differentiability of |xi | at xi = 0. In the next section, we briefly discuss one of the most effective optimization methods for logistic regression when no 24 regularization is used, or when (differentiable) 2 -regularization is used. We then proceed to outline several properties that we would like an optimization method for 1 -regularized logistic regression to have, followed by a discussion of existing and then new methods for solving the non-differentiable 1 -regularized logistic regression problem. 2.1 Logistic Regression with Differentiable Regularization In the case of unregularized logistic regression or logistic regression with 2 -regularization, several comparison studies indicate that quasi-Newton methods are among the most efficient methods available for solving large-scale (generalized) logistic regression problems [Malouf, 2002, Wallach, 2002, Minka, 2003, Sha and Pereira, 2003]. Quasi-Newton optimization algorithms are closely related to optimization methods based on Newton’s method, but where the matrix of second partial derivatives of the objective function (the Hessian) is replaced by an approximation. Typically, for large-scale problems the approximation is constructed using limited-memory Broyden-FletcherGoldfarb-Shanno (L-BFGS) updates [Nocedal, 1980]. In this section we review a Newton-like algorithm for unconstrained optimization, and then discuss L-BFGS updates. Newton-like algorithms for unconstrained differentiable optimization are iterative methods, where at each iteration we form a quadratic model qk (x) 1 f (xk ) + (x − xk )T ∇f (xk ) + (x − xk )T Hk (x − xk ), 2 around the current iterate xk . Here, Hk is a positive-definite matrix (the Hessian or an approximation of it). To compute the new iterate xk+1 we move to the minimum of this quadratic. This minimum is given by xk+1 ← xk − Hk−1 ∇f (xk ). Unfortunately, this new iterate may not necessarily decrease the objective function. However, the direction of search, d −Hk−1 ∇f (xk ), is a descent direction at xk . That is, for sufficiently small α > 0 we have f (xk + αd) < f (xk ) provided that xk is not an optimal solution (and that Hk is positive-definite). Therefore, we can decrease the objective function by moving in the direction d using iterates of the form xk+1 ← xk + αHk−1 ∇f (xk ), for some sufficiently small α > 0. To choose the step length α, we can use a line search. Specifically, to guarantee a sufficient decrease on the objective value, we start with α = 1 and decrease α until we satisfy the Armijo condition f (xk+1 ) ≤ f (xk ) + ν∇f (xk )T (xk+1 − xk ), with ν ∈ (0, 1). (2.2) A typical value of the sufficient decrease parameter ν is 10−4 . In the case of logistic regression and if Hk is a reasonable approximation to the Hessian, the value α = 1 will typically be accepted. If this value is not accepted, we can generate a new step length α ˜ in the interval (0, α). A common approach to do this is to set α ˜ to the minimum of the cubic polynomial that interpolates f (xk ), f (xk+1 ), and the directional derivatives ∇f (xk )T (xk+1 −xk ) and ∇f (xk+1 )T (xk+1 −xk ). Typically, we use some safeguards to ensure that α ˜ is in the interval (0, α) and that it is not too close to either end point. For example, we can project the minimum of the cubic interpolant into the interval [ξ1 α, ξ2 α], for 0 < ξ1 ≤ ξ2 < 1. For the logistic regression objective function, the minimum of the 25 cubic polynomial, α ˜ , will typically be accepted, but if it is not we can repeat the cubic interpolation focusing on the interval (0, α ˜ ). Successively refining the interval generates a decreasing sequence of step lengths, and we must eventually find a value of α that satisfies the Armijo condition. Once a suitable value of α is found, we set xk+1 ← xk + αd and start a new iteration at xk+1 . We typically continue this process until the iterations no longer make substantial progress (the relative change in parameter values or function values is too small), or until we satisfy a criterion measuring the first-order optimality of the current iterate (i.e. that ||∇f (xk )|| is too small). For more information on Newton’s method for optimization and the other issues we discuss in this section, see [Gill et al., 1981, Nocedal and Wright, 1999] (among many other standard references). 2.1.1 L-BFGS Approximation Rather than explicitly using the exact Hessian Hk = ∇2 f (xk ), quasi-Newton methods allow us to build an approximation to the Hessian (or its inverse) using successive differences in the parameter vector sk xk+1 − xk , and the gradient yk ∇f (xk+1 ) − ∇f (xk ). Quasi-Newton methods typically begin with a scaled identity matrix approximation B0 σI to the Hessian (for some positive σ), and after each iteration Bk+1 is updated so that the changes in parameters and gradients satisfy the secant equation Bk+1 sk = yk . (2.3) The solution Bk+1 of (2.3) is not unique, and the most common way to choose a unique matrix Bk+1 is with the BFGS formula [see Gill et al., 1981, §4.5.2]. In the limited-memory version of the BFGS update, L-BFGS, we don’t explicitly store Bk but rather store a set of m differences sk and yk . To pre-multiply a vector with the inverse of a matrix B0 = σk I updated m times with these stored vectors, we can use a simple algorithm that runs in O(mp) [Nocedal, 1980]. The choice of the scaling coefficient σk can have a significant impact on the performance of the method. A widely-used and typically very effective choice of this scaling is [Shanno and Phua, 1978] ykT yk σk . (2.4) ykT sk In later sections we consider utilizing an inverse Hessian approximation that simply takes the form σk I, with σk given by (2.4) and without any quasi-Newton updates applied to it. This was proposed by Barzilai and Borwein [1988], and represents a σk that minimizes the squared error in (2.3) under this simple approximation16 . 16 An important property of the Barzilai-Borwein approximation, as opposed to the L-BFGS approximation, is that it only changes the magnitude of the negative gradient, and not its direction 26 2.1.2 1 -Regularization over an Orthant We define an orthant for some sign pattern {ζ1 , ζ2 , . . . , ζp } to be the closed subset of Rp satisfying ζ1 x1 ≥ 0, ζ2 x2 ≥ 0, ... ζp xp ≥ 0, where each ζi can take values in the set {−1, 1}. An important property that is used extensively in this chapter is that the 1 -regularization problem (2.1) is differentiable over any given orthant. In particular, over an orthant with sign pattern element ζi the derivative of the regularizer with respect to a variable i is given by the linear function λi ζi . Thus, if we are given an orthant containing the optimal solution, solving the 1 -regularization problem (2.1) reduces to the problem of minimizing a convex differentiable objective function with bound constraints on the variables (the bound constraints ensure that we do not leave the orthant). Minimizing differentiable functions subject to simple bound constraints can be solved with straightforward modifications of problems for minimizing unconstrained differentiable functions (like quasi-Newton methods with an L-BFGS Hessian approximation); we describe one such method in Section 2.2.3. 2.2 Logistic Regression with 1 -Regularization Typically, we do not know the orthant of the optimal solution, so we must consider methods that deal with the non-differentiability of the regularizer. We would like to have an algorithm that allows us to efficiently solve the optimization problem even when the number of variables or the number of training examples is large. Toward this end, we can identify several properties that we would like of an optimization algorithm for solving the 1 -regularized logistic regression problem: 1. Block updates: the algorithm is able to move more than one variable at a time to improve the objective function. 2. Linear time/space: the algorithm requires O(p) space, and O(p) time per iteration. 3. Warm start: if we initialize the algorithm close to the optimal solution, it will require fewer iterations to converge. 4. Sparse iterates: if the final solution is sparse, the algorithm does not necessarily need to evaluate the objective function with a dense parameter vector. These four properties eliminate several of the available strategies. For example, the block-updates requirement eliminates coordinate descent methods such as [Fu, 1998] (such methods are extremely effective if L(x) is close to separable, but their performance degrades sharply as the dependency between variables increases). The linear time/space requirement eliminates projected Newton methods like the constrained iteratively-reweighted least squares method [Lee et al., 2006c]. The warmstart requirement eliminates the possibility of applying interior point methods to a constrained re-formulation of the problem [Koh et al., 2007]. Finally, the sparse-iterates requirement also eliminates such interior point methods (since variables only become zero in the limit), and also eliminates 27 approaches based on the expectation maximization bound optimization algorithm [Figueiredo, 2003] (since variables cannot move away from zero once they are set to zero in this framework). Despite eliminating some alternatives, these requirements still leave a variety of methods available. Ideally, we would also like a method for 1 -regularized logistic regression that satisfies the following three properties: 5. Reduction to Newton’s method: If at some iteration the algorithm identifies the orthant and set of non-zero variables in the optimal solution, and if it stays in this orthant with this set of non-zero variables on subsequent iterations, then the algorithm will take the same steps that an unconstrained (quasi-)Newton method applied to the non-zero variables would use to solve (2.1) over the orthant. 6. Fast modification of non-zero set: At each iteration, the algorithm is able to make many zero-valued variables non-zero. Similarly, it is able to make many non-zero variables take the value zero. 7. No increase in problem size: The algorithm is able to solve the problem in terms of the original variables, rather than solving an equivalent problem with a larger number of variables. In [Schmidt et al., 2007a, 2009a], we review a large variety of the available methods for solving 1 regularized logistic regression problems, and experimentally compared 14 of the available methods. Unfortunately, none of the algorithms discussed in these reviews satisfy all 7 of the above properties. Rather than reviewing all of the methods discussed in this prior work, in this section we only review three of the most effective methods, namely the orthant-wise learning algorithm, active-set methods, and applying the two-metric projection method to a bound-constrained re-formulation of the problem. Though very effective, each one of these strategies is deficient in one of the last three properties. After reviewing these three methods, in the next section we present extensions of these three methods that (in two of the three cases) allow the method to satisfy all 7 properties and that (in all three cases) lead to better practical performance. 2.2.1 Orthant-Wise Learning Andrew and Gao [2007] present one of the most effective methods currently available for solving large-scale 1 -regularized logistic regression problems. It is based on choosing an appropriately defined steepest descent direction on (2.1) and taking a step resembling a Newton iteration in this direction (with an L-BFGS Hessian approximation). To define the steepest descent direction we note that even though f (x) in (2.1) is not differentiable in general, that directional derivatives always exist (by convexity of f (x)). Thus, analogous to the differentiable case, we can define the steepest descent direction for f (x) at a point x as the direction that minimizes the directional derivative (ie. the direction that locally decreases the objective most quickly). Closely related to this concept is what Andrew and Gao [2007] refer to as the pseudo-gradient of f (x), defined as the element of the sub-differential of f (x) at x with minimum norm. Following an argument in [Bertsekas et al., 2003, §8.4] (replacing maximization with minimization and concavity with convexity), it follows that the steepest descent direction (in the Euclidean norm) for a convex function is the negation of this pseudo-gradient. The sub-differential of f (x) in (2.1) (denoted ∂f (x)) with respect to a variable i is given by ∂i f (x) = ∇i L(x) + λi sgn(xi ), (2.5) 28 where the set-valued function sgn(xi ) is defined by [see Bertsekas, 1999, Figure B.12] sgn(xi ) sign(xi ), xi = 0 . [−1, 1], xi = 0 Since the sub-differential is separable in the variables, the problem of computing the minimum-norm element of the sub-differential is also separable in the variables. Hence, we can solve the minimumnorm problem coordinate-wise to yield that the pseudo-gradient with respect to a variable i is: ∇i L(x), λ=0 ∇ L(x) + λ sign(x ), λ > 0, |xi | > 0 i i i ˜ ∇i L(x) + λi , λ > 0, xi = 0, ∇i L(x) < −λi ∇i f (x) (2.6) ∇ L(x) − λ , λ > 0, x = 0, ∇ L(x) > λ i i i i i 0, λ > 0, xi = 0, |∇i L(x)| ≤ λi In the first two cases the function is differentiable with respect to i, so the pseudo-gradient is simply the gradient with respect to i (the only element of the sub-differential). In the last case of (2.6) the pseudo-gradient is zero, since we can set sgn(xi ) to −∇i L(x)/λi to achieve a norm of zero. In the remaining cases we obtain the minimum-norm solution by setting sgn(xi ) to have the opposite sign of ∇i L(x). The pseudo-gradient has several properties that are analogous to the gradient vector ˜ (xk ) in unconstrained optimization. First, as we discuss above, the negative pseudo-gradient −∇f is in the direction that minimizes the directional derivative at xk . Second, it follows from the first ˜ (xk ) = 0 is a necessary and sufficient condition for an iterate xk to be a global property that ∇f minimum provided that L(x) is a differentiable convex function. Since the regularizer is piecewise-linear and is a simple linear function over a given orthant we might be tempted to use the pseudo-gradient in a quadratic approximation of f (x) at xk , yielding the approximation qk (x) ˜ (xk ) + 1 (x − xk )T Hk (x − xk ), f (xk ) + (x − xk )T ∇f 2 (2.7) where Hk is a positive-definite approximation of ∇2 L(x). We could then consider minimizing this ˜ (xk ), leading to iterations of quadratic approximation to generate a search direction d −Hk−1 ∇f the form xk+1 ← xk + αd (where α is selected by a backtracking line search to satisfy the Armijo condition, with the gradient replaced by the pseudo-gradient). Unfortunately, there are two major problems with this approach: (i) the line search has no mechanism to set variables to exactly zero, and (ii) in general d will not be a descent direction. To address the problem that the line search does not set variables to exactly zero, [Andrew and Gao, 2007] set variables xk+1 to zero if they differ in sign from xk . We use PO to denote this orthant projection applied to a parameter vector x for some arbitrary direction d: PO (x + d)i 0 xi + di if xi (xi + di ) < 0, otherwise. Applying this projection to xk+1 is effective at sparsifying the parameter vector since it sets variables to exactly zero, and it also ensures that the line search does not cross points of non-differentiability (since the line search is truncated along some dimensions so that xk+1 is in an orthant containing xk ). 29 There are many ways to address the problem that the method may not generate a descent direction, and the methods we discuss in this chapter will differ mainly in how they address this problem. However, the main insight used by all the methods is that f (x) is differentiable over any single orthant, and that if we restrict attention to variables with non-zero pseudo-gradient then the quadratic approximation (2.7) is the truncated Taylor series expansion of the function restricted to a particular orthant. In particular, it is the truncated Taylor series expansion for the ˜ (xk ) for some extremely small > 0. Thus, provided that xk does orthant containing xk − ∇f not minimize f (x) over this orthant (which would imply that xk is a global optimum), using this quadratic approximation is guaranteed to yield a descent direction at xk if xk+1 happens to lie in this ‘right’ orthant for sufficiently small positive α. Since in general this will not be the case, the methods considered in this chapter will give different strategies to ensure that xk+1 lies in the ‘right’ orthant for sufficiently small α. To ensure that xk+1 leads into the correct orthant, the sufficient condition that is used by the various methods is that the search direction d agrees with ˜ (xk ) for all variables that are zero17 . the steepest descent direction −∇f The orthant-wise learning algorithm of [Andrew and Gao, 2007] uses a direct approach to enforce this sufficient condition. In particular, they compute d and set any value di in d to zero if its sign ˜ (xk ). We use PS to denote this sign projection: does not agree with −∇f PS (d)i ˜ i f (xk )) > 0 di , if di (∇ 0, otherwise (2.8) Andrew and Gao [2007] show that the positive-definiteness of Hk implies that PS (d) will have at least one non-zero element and will represent a descent direction at xk (for sub-optimal xk ). Thus to generate the new iterate, the orthant-wise learning method uses steps of the form ˜ (xk )]]. xk+1 = PO [xk + αPS [−Hk−1 ∇f As with all the algorithms we discuss in this chapter, the line search parameter α is chosen to satisfy the Armijo sufficient decrease condition (2.2) (but using the pseudo-gradient in place of the gradient). The orthant-wise learning method was the most effective method in experiments on 1 -regularization problems by the authors [Andrew and Gao, 2007] and us [Schmidt et al., 2009b], while it was among the most effective methods in our comparison of quasi-Newton methods for 1 -regularized logistic regression [Schmidt et al., 2009a]. Nevertheless, we might still hope to develop a better method since the orthant-wise learning method only satisfies six of our seven criteria; it does not reduce to Newton’s method on the non-zero variables. This is because of the sign projection PS , that may always restrict the method to only move along a subset of the non-zero variables. 2.2.2 Active-Set Methods Active-set methods are widely used for solving 1 -regularization problems. Osborne et al. [2000] outline an active-set method that is one of the first methods proposed for optimizing a linear least-squares objective subject to a constraint on the 1 -norm of the parameter vector, as in (1.2). This algorithm was extended to the case of logistic regression in [Roth, 2004], while active-set methods have also been proposed for the λ formulation where we put a penalty of the 1 -norm 17 We note that this condition is trivially satisfied if the positive-definite matrix Hk is diagonal, and hence a diagonal-scaling or the Barzilai-Borwein approximation yield a descent direction. 30 of the parameter vectors [Perkins et al., 2003, Lee et al., 2006a]. Methods that seek to trace the regularization path of optimal coefficients as λ is varied for the least-squares loss [Osborne et al., 2000, Efron et al., 2004] or logistic regression [Rosset, 2004, Park and Hastie, 2007] are also closely related. In this section, we discuss an active-set method that is most closely related to the method proposed in [Perkins et al., 2003]. In the approach of [Perkins et al., 2003], we divide the variables into two sets: the working set W containing the non-zero variables, and the active set A containing the zero-valued variables (here, we use the terminology active set because of the analogy with active-set methods for constrained optimization). On each iteration we only update the working set, and on a typical iteration we generate the next iterate by projecting the Newton step: −1 xW ← PO [xW − αHW ∇W f (x)]. (2.9) Here, we have used ∇W f (x) to denote the sub-vector of ∇f (x) corresponding to elements of W, and HW to denote the sub-matrix of Hk with all rows and columns of W. Since this step is restricted to the non-zero variables, the gradient is defined and this step has the descent property (it decreases the objective function for sufficiently small α, provided that the non-zero variables are not at their optimal value given the fixed values of the remaining variables). Although it was not present in [Perkins et al., 2003], we have added the projection operator PO to allow the method to set variables to exactly zero and to prevent the possibility that a very small absolute value in xW will require that the line search chooses a very small value of α. Once we have found the optimal values for the non-zero variables given the zero-valued variables, ˜ i f (x)| = 0 for all i ∈ A, then we terminate with the optimal two things can happen. First, if |∇ solution. Otherwise, we move a single variable from the active set to the working set. In particular, ˜ i f (x)| (i.e. the zerowe choose the variable i in A with the largest pseudo-gradient magnitude |∇ valued variable that gives the steepest decrease in the objective), and we move this variable from A to W. With the new working set W, we apply the Newton step (2.9) where we replace the gradient with the pseudo-gradient. Interestingly, this yields a descent direction. This is because the scaling matrix HW has positive diagonals (because Hk is positive-definite) and the pseudo-gradient with respect to the non-zero variables is zero (because they are at their conditionally optimal values), so −1 ˜ ˜ i f (x) for the single zero-valued variable i just the sign of (HW ∇W f (x))i will match the sign of ∇ added to W. This means that the active-set method doesn’t need to use the (potentially harmful) PS sign projection. It is quite clear that this active-set method satisfies our definition of reducing to Newton’s method. Once we have identified the correct working set (and its sign), and if the iterates maintain this working set (and its sign), then the active-set method will essentially be applying Newton iterations to optimize the working set. Unfortunately, we achieved this property at the cost of another; the active-set method no longer allows fast changes to the set of non-zero variables. In particular, if k variables have the value zero that must be non-zero in the optimal solution, then we must perform at least k iterations of the method. This can be impractical for large-scale problems with many non-zero variables. We can imagine several solutions to this problem. First, we could consider initializing all variables non-zero, but this would lead to the loss of the sparse iterates property. We could also imagine trying to move more than one variable away from 0 on each iteration. Unfortunately, as soon as we consider moving two variables away from zero in a single iteration, we can no longer 31 guarantee that the Newton-like direction is a descent direction. We return to this latter idea in Section 2.3. 2.2.3 Two-Metric Projection A classic strategy in linear programming for addressing 1 -norm minimization problems is to transform the elements of the norm into positive and negative parts, then minimize a linear function of these parts [see Bertsimas and Tsitsiklis, 1997, §1.3]. Applied to (2.1), we write each xi using new − + − non-negative variables x+ i and xi as xi = xi − xi , leading to the problem − + − λi (x+ i + xi ), subject to xi ≥ 0, xi ≥ 0, ∀i . min L(x+ − x− ) + x+ ,x− (2.10) i This is now a smooth optimization problem with simple non-negativity constraints on the variables. This problem shares the same minimizers as (2.1). To see this, note that the range of L(x) is unchanged under the {x+ , x− } representation. Further, the objective function is an upper bound − on (2.1), since by the non-negativity constraints and the triangle inequality we have x+ i + xi = − + − |x+ i | + | − xi | ≥ |xi − xi | = |xi |. Finally, we note that the upper bound is tight at a minimizer; it − + − must be the case that at least one of x+ i or xi is 0, so |xi | = xi + xi (otherwise, it violates that we are at a minimizer since we could decrease the regularization term without changing the value − of L(x) by decreasing x+ i and xi by some small positive constant). In the discussion below, we use y to denote the concatenation of the positive and negative parts. There are many efficient methods available for solving smooth optimization problems with bound constraints. We outline one, the two-metric projection algorithm discussed in Gafni and Bertsekas [1982]. In the two-metric projection algorithm, we divide the variables into two sets. By analogy with the previous section, we call them the active set A and the working set W. In this algorithm, the active set is defined as the set of variables that are sufficiently close to zero and have a positive partial derivative. In other words, the active set is defined by A {i|yi < , ∇i f (y) > 0}. The working set is the complement of this set, consisting of variables that are sufficiently non-zero or that have a negative partial derivative. In the two-metric projection method, we simultaneously take the projection of a Newton step for the working set and the projection of a diagonally-scaled gradient step for the active set. Specifically, we take the simultaneous iteration −1 yW ← [yW − αHW ∇W f (y)]+ yA ← [yA − αD∇A f (y)]+ . The diagonal matrix D must be positive-definite and the function [x]+ max{x, 0} projects the variables onto the non-negative orthant. A typical choice for D is the identity matrix, making the step for the active-set variables a projected-gradient step18 . This simultaneous iteration is guaranteed to provide (feasible) descent on the objective function. Further, this iteration will reduce to Newton’s method on the non-zero variables if the correct active set has been identified, and it allows us to move many variables between the zero and non-zero sets at each iteration. 18 It is referred to as a two-metric projection method because we can write it as a scaled projected gradient step, but where we project using the Euclidean norm rather than a quadratic norm defined by the Hessian approximation. 32 Unfortunately, this method has the obvious drawback that we have increased the problem size. A more subtle potential disadvantage of this method lies in the re-formulation of the problem. In particular, the Hessian of the re-formulation is necessarily singular (it contains columns that differ only in sign), even if the Hessian of the original problem is positive-definite. This might indicate that the re-formulation may be more difficult to solve than the original problem. 2.3 Projected Scaled Sub-Gradient We now consider extensions of the three methods above that alleviate each of their disadvantages, and in two of the cases give us methods that satisfy all 7 properties that we would like of an optimizer. We call these methods projected scaled sub-gradient (PSS) methods, because the iterations can be written as the projection of a scaling of a sub-gradient of the objective. In particular, the PSS methods use specific choices of the sub-gradient, scaling, and projection: • For the projection, we use the (Euclidean) orthant projection PO . ˜ (x). • For the sub-gradient, we use the pseudo-gradient ∇f • For the scaling, we require that the scaling leads to the correct orthant (ie. among variables currently set to zero, no element of the scaled direction has opposite sign from the negative pseudo-gradient). • For the scaling, we also require that it is positive-definite with respect to some subset of the variables with non-zero pseudo-gradient (and zero with respect to the remaining rows). Because we use a scaling matrix but project under the Euclidean norm, these methods could alternately be called two-metric sub-gradient projection methods, and we note that both the orthantwise learning and active-set methods of the previous section satisfy the four properties above. These properties ensure that the method generates descent directions and, although we omit detailed convergence proofs, ensure convergence under fairly weak conditions. This can be shown by suitable modifications of the arguments made for gradient-related methods [Bertsekas, 1999, Proposition 1.2.1], as in [Andrew and Gao, 2007] 2.3.1 Gafni-Bertsekas Variant The first variant we consider is analogous to the two-metric projection idea, but applied to 1 regularization instead of bound constraints. We refer to this as the PSS Gafni-Bertsekas variant (PSSgb). In this method, we define the working set as those variables that are sufficiently non-zero: W {i||xi | > }. As usual, the active set is defined as the complement of this set. Similar to the two-metric projection method we perform a simultaneous iteration where we take a projection of the Newton step along the working set and a diagonally-scaled projected pseudo-gradient step for the active-set variables: −1 xW ← PO [xW − αHW ∇W f (x)] ˜ A f (x)]. xA ← PO [xA − αD∇ 33 Provided we use a positive diagonal scaling D, this combined direction is guaranteed to be a descent direction (ie. it has a negative directional derivative) unless x is optimal. To see this, note that ˜ A f (x). If we have a a sub-optimal x must have at least one non-zero element in ∇W f (x) or ∇ non-zero element in ∇W f (x), then by positive-definiteness of HW it follows that the contribution −1 to the directional derivative of moving the variables in W in the direction −HW ∇W f (x) is negative (and similarly, if ∇W f (x) = 0 then the contribution to the directional derivative is zero). If we ˜ A f (x), it similarly follows that the component of the directional have a non-zero element in ∇ ˜ A f (x) = 0 then this derivative with respect to the zero-valued variables is negative (while if ∇ ˜ i f (x) is in the direction that coordinate-wise minimizes contribution is zero). This is because −∇ ˜ A f (x) correspond to variables that have the directional derivative, so all non-zero elements in ∇ a negative contribution to the directional derivative (while variables with a pseudo-gradient of 0 contribute a value of zero to the directional derivative). Subsequently, in either case the overall directional derivative is negative and the combined direction is a descent direction. Note that after identifying the correct set of non-zero variables, these iterations will perform Newton steps on the non-zero variables. Further, many variables can be made zero/non-zero at each iteration, and we have not increased the size of the problem. Thus, we have a simple algorithm that achieves all 7 properties. In the two-metric projection algorithm, the choice of the diagonal scaling matrix D does not have a significant effect on the performance of the algorithm (it simply controls the rate that very small variables move towards zero). However, the choice of D in the PSSgb algorithm can have a significant effect on the performance of the method, since if D is too large we may need to perform several backtracking steps before the step length is accepted (while too small of a value will require many iterations to move variables from the active set to the working set). In our implementation of the method, we compute the Shanno-Phua/Barzilai-Borwein scaling σk of the variables given by (2.4), and set D to σk−1 I. 2.3.2 Sign Constraint Variant The orthant-wise learning algorithm does not satisfy the property of reducing to Newton’s method because of the PS sign projection. The problem with this projection is that it may set elements of the Newton direction to zero for a large portion of the zero and non-zero variables. However, note that to lie in the correct orthant (to guarantee descent) we only require that the zero-valued variables in the search direction agree with the negative pseudo-gradient sign. Thus, in the PSS sign projection (PSSsp) variant we apply the orthant-wise learning iteration but use a less constrained version of the PS sign projection. Specifically, we use PS ∗ (d)i ˜ i f (xk )) > 0 di , if (xk )i = 0 or di (∇ 0, otherwise The only difference between the PS ∗ projection and the PS projection (2.8) is the presence of the condition “if (xk )i = 0”, which stops us from unnecessarily applying the sign projection to non-zero variables. The method still does not reduce to Newton’s method because the zero-valued variables still affect the search direction for the non-zero variables. However, this simple modification allows greater use of the Hessian approximation for the non-zero variables and leads to improved practical performance. 34 2.3.3 Active-Set Variant In the final PSS variant, the PSSas method, we extend the active-set method of Section 2.2.2 above so that it can add more than one variable to the working set at each iteration. In this method, we augment the simple working set W above with the k zero-valued variables with the largest (non-zero) pseudo-gradient magnitudes to obtain a new working set W k . We then compare the −1 ˜ sign values of the elements of the pseudo-gradient to the signs in the product HW k ∇W k f (x). If the signs for all k zero-valued variables agree, then the working set is valid and we simply take the active-set step given by (2.9), but using W k and the pseudo-gradient. If any of the signs disagree, then we must find a different value for k that satisfies this property (k can range from 0 up to the number of zero-valued variables that have non-zero pseudo-gradient). We would like k to be as large as possible, but can not test too many values because testing each value of k costs O(mp). For example, a naive approach is to start with k = 0 (which is always valid), and test each k in increasing order until we find an invalid value (then accept k − 1). This would lead to a worst-case iteration cost of O(mp2 ), making it unsuitable for large-scale problems. In the PSSas method, we do a binary search for a k such that k is valid and k + 1 is invalid. This finds a value of k that is at least as large as the one found by the naive method, but has a worst-case iteration cost of O(mp log p), only slightly higher than the iteration cost of the other methods we discuss in this section19 . 19 We could re-gain the linear-time iteration cost by using a constant upper bound on k. 35 2.4 Implementation To make the PSS algorithms concrete, we now give pseudo-code for the PSS methods. Algorithm 1 outlines a general framework that can be used as a basis for implementing a PSS method. Input: Function L(x), regularization parameters λi , initial parameter vector x0 , optimality tolerance , number of corrections m, sufficient decrease parameter η, line search safeguard parameters ξ1 and ξ2 , direction calculation function dir(k, g, σ, S, Y ) k ← 0; S ← []; // initialize collection on of quasi-Newton vectors Y ← []; fk ← L(x0 ) + ||λi • x0 ||1 ; // evaluate initial parameter vector ˜ (x0 ) ; gk ← ∇f // compute pseudo-gradient, see (2.6) while ||gk ||∞ > do dk ← dir(k, g, σ, S, Y ); // compute algorithm-specific descent direction α←1; // initial step length xk+1 ← PO (xk + αdk ); // initial trial value fk+1 ← L(xk+1 ) + ||λi • xk+1 ||1 ; // evaluate new parameter vector ˜ (xk+1 ) ; gk+1 ← ∇f // compute pseudo-gradient, see (2.6) while fk+1 > fk + ηgkT (xk+1 − xk ) do Select α ∈ (ξ1 α, ξ2 α) ; // safeguarded cubic interpolation xk+1 ← PO (xk + αdk ); // next trial value fk+1 ← L(xk+1 ) + ||λi • xk+1 ||1 ; ˜ (xk+1 ) ; gk+1 ← ∇f sk ← xk+1 − xk ; yk ← gk+1 − gk ; if k > m then Remove oldest vector from S and Y ; S ← [S sk ] ; Y ← [Y yk ]; σ ← (ykT yk )/(ykT sk ); k ← k + 1; // compute quasi-Newton differences // update quasi-Newton difference matrices // update diagonal Hessian scaling Algorithm 1: PSS framework for 1 -regularized optimization. A practical implementation would be slightly more complicated than Algorithm 1, because typically we want to impose iteration or function evaluation limits, and we implement checks that assess whether sufficient progress continues to be made. Note that we compute the quasiNewton vectors based on the pseudo-gradient rather than the differences in ∇L(x) as in [Andrew and Gao, 2007], since we found this gave better performance. Although it may seem to counterintuitive to include the non-smooth component as part of the Hessian approximation, there has been some empirical work showing that the BFGS approximation may be effective for certain types of non-smooth problems [Lewis and Overton, 2008]. We have left the calculation of the descent direction (dk ) unspecified in this pseudo-code, because this is the primary difference between the PSS methods. In particular, the orthant-wise learning method uses Algorithm 2 to calculate the 36 descent direction (the PSSsp direction calculation is identical, but using the PS ∗ sign projection). Input: Iteration number k, pseudo-gradient g, scaling σ, quasi-Newton matrices S and Y Output: Descent direction d if k = 0 then d ← − min{1, 1/||g||1 }g; else d ← −H −1 g ; d ← PS (d) ; // apply L-BFGS algorithm using g, S, Y , and σ // sign projection, see (2.8) Algorithm 2: Direction calculation in orthant-wise learning. In this algorithm, we use min{1, 1/||g||1 } as the scaling of the gradient step on the first iteration. For logistic regression, this heuristic tends to make the step small enough that it is typically accepted without the need to backtrack. Algorithm 3 outlines the descent direction calculation used in the active-set method. Input: Iteration number k, pseudo-gradient g, scaling σ, quasi-Newton matrices S and Y Output: Descent direction d d ← 0; W ← {i|λi = 0} {i|xi = 0} ; // default working set if ||gW ||∞ < then W ← W {i|i = arg maxj |gj |}; // add variable to working set if k = 0 then dW ← − min{1, 1/||gW ||1 }gW ; else −1 gW ; dW ← −HW // apply L-BFGS algorithm using g, S, Y , and σ Algorithm 3: Direction calculation in active-set method. Algorithm 4 outlines the descent direction calculation in the PSSgb method. We note that the descent direction calculation for the two-metric projection method is similar. Input: Iteration number k, pseudo-gradient g, scaling σ, quasi-Newton matrices S and Y Output: Descent direction d if k = 0 then d ← − min{1, 1/||g||1 }g; else W ← {i|λi = 0} {i|xi = 0}; A ← Wc ; dA ← −σ −1 gA ; −1 dW ← −HW gW ; // active set is complement of working set // take steepest descent direction on active set // apply L-BFGS algorithm using g, S, Y , and σ Algorithm 4: Direction calculation in PSSgb. 37 Finally, Algorithm 5 outlines the descent direction calculation for the PSSas method. Input: Iteration number k, pseudo-gradient g, scaling σ, quasi-Newton matrices S and Y Output: Descent direction d d ← 0; W ← {i|λi = 0} {i|xi = 0} ; // default working set if k = 0 then dW ← − min{1, 1/||gW ||1 }gW ; else −1 dW ← −HW gW ; // default direction (k = 0) LB ← 0 ; // k = 0 is always legal U B ← 1 + |{i|xi = 0 and gi = 0}| ; // k can not be greater than the number of zero-valued variables with non-zero pseudo-gradient while U B − LB = 1 do k ← (U B + LB)/2 ; // new value for k k W ← W {i|i among largest k values of |gk | for xi = 0}; dk ← 0; −1 dkW k ← −HW k gW k ; k if sgn(d )i = sgn(g)i for some i with xi = 0 then U B ← k; // this is not a valid value of k else LB ← k ; d ← dk ; // largest valid value of k found so far Algorithm 5: Direction calculation in PSSas. We close this section by noting a final subtle but important implementation detail. On iterations where the working set changes, it isn’t immediately obvious how the L-BFGS update should be defined. For example, one possible strategy would be to reset the L-BFGS approximation every time the working set changes. Unfortunately, this excludes the possibility that curvature information gathered with a related working set might still be useful. In our implementation, we store the vector and gradient differences from the previous m iterations for all variables, and define the L-BFGS update based on the differences in the working set variables that satisfy (sk )TW (yk )W > . (2.11) This curvature condition is sufficient to guarantee that the quasi-Newton matrix is positive-definite [see Nocedal and Wright, 1999, §8.1]. In our implementation, we compute σk based on the differences in all variables. 2.5 Regularization Path and Active-Set Optimization Hoerl and Kennard [1970] introduced the concept of a ridge trace, a plot of the optimal coefficients in an 2 -regularized least-squares model as the regularization parameter λ is varied. More recently, there has been substantial interest in similar plots for 1 -regularized coefficients as λ is varied, for both least-squares [Osborne et al., 2000, Efron et al., 2004] and logistic regression [Rosset, 2004, Park and Hastie, 2007]. In this section we consider the calculation of multiple points along this 38 regularization path. That is, we would like to solve (2.1) for a set of values of λ rather than a fixed value. When solving (2.1) for multiple values of λ, as with many other algorithms we can improve the performance of the PSS algorithms by using a warm-start strategy; we reduce the number of PSS iterates needed by initializing the iterates using the solution with a closely related value of λ. This allows us to solve the optimization problem for a set of values of λ more efficiently than we would be able to if we ran the optimizer independently for each value of λ. In addition to taking advantage of warm-starting (which is also possible with 2 -regularization), for 1 -regularization we can take advantage of the sparsity of the coefficients along the regularization path to solve the 1 -regularized problem for a sequence of values of λ for a much smaller cost than if we were using 2 -regularization. This idea is very important in Chapters 5 and 6 where it leads to an exponential speed-up for larger values of λ, but we introduce it here since it still yields modest computational gains in the case of logistic regression. Consider the following set of necessary and sufficient conditions for a vector x to be a minimizer of f (x) for given values of λi : ∇i L(x) + λi sign(xi ) = 0, |xi | > 0 |∇i L(x)| ≤ λi , xi = 0 (2.12) These conditions are equivalent to the necessary and sufficient optimality condition of requiring the zero-vector to be an element of the sub-differential [Bertsekas, 1999, §B.5], or equivalently that the pseudo-gradient is the zero vector. Rather than applying a PSS method to all the variables, we could apply it to the set of non-zero variables combined with the variables satisfying |∇i L(x)| > λi . Once we have computed the optimal solution restricted to this set, we update the set of variables and repeat the optimization. That is, we alternate between two steps: • Find variables i such that xi = 0, or xi = 0 and |∇i L(x)| > λi . • Solve the problem with respect to these variables. If the set of variables to optimize does not change between iterations of this procedure, then the parameter vector satisfies (2.12) and hence is globally optimal (if it does change, then we are at a sub-optimal solution and we must continue to loop between these two steps). This is essentially the same active-set procedure discussed in Hofling and Tibshirani [2009], and note that we can use an approximate solution in the second step provided we eventually solve the problem to optimality. If we consider beginning with a sufficiently large value of λ and running this procedure with a decreasing sequence of λ values (as is done in [Park and Hastie, 2007]), then most of the iterations for large values of λ will only be run on a small subset of the variables. This is in contrast to the case of 2 -regularization, where all variables are non-zero for all values of λ and we would need to solve the problem explicitly with respect to all variables for all values of the regularization parameter. The optimality conditions also allow us to determine the values of the λi variables where all of the variables that are subject to regularization are set to zero. For the 1 -regularized logistic regression problem, it follows from (2.12) that the value of λ that sets all regression weights to zero is λmax max |∇wi L(0, ˜b)|, i where ˜b is the optimal value of the bias parameter subject to the constraint that w = 0. In general, we can compute λmax by optimizing with respect to the unregularized variables, and then finding the maximum gradient magnitude. 39 2.6 Experiments We compared the performance of several large-scale optimization methods for regression. In particular, we compared the following methods: 1 -regularized logistic • OWL: the orthant-wise learning method we discuss in 2.2.1. • AS: the active-set method we discuss in 2.2.2. • TMP: the two-metric projection method we discuss in 2.2.3. • PSSgb: the projected scaled sub-gradient method (Gafni-Bertsekas variant) proposed in 2.3.1. • PSSsp: the projected scaled sub-gradient method (sign projection variant) proposed in 2.3.2. • PSSas: the projected scaled sub-gradient method (active-set variant) proposed in 2.3.3. • BBSG: a Barzilai-Borwein sub-gradient method where we move along the negative pseudogradient with the step length given by (2.4), and project the iterates using the PO operator. • SPG: applying a spectral projected gradient method to the bound constrained formulation (2.10), similar to [Figueiredo et al., 2007]. • BBST: applying the iterative soft-thresholding algorithm with the step length given by (2.4), similar to [Wright et al., 2009]. • DSST: applying a diagonally scaled soft-thresholding algorithm, similar to [Hofling and Tibshirani, 2009]. • OPG: applying Nesterov’s optimal projected gradient method to the bound-constrained formulation, using the adaptive line-search suggested in [Liu et al., 2009]. For a comparison to other methods on some small-scale problems, see [Schmidt et al., 2007a, 2009a]. The list above contains three of the most effective methods in this previous work, the new PSS methods, as well as five very effective newer methods that were not included in the previous comparisons. The first six methods use L-BFGS updates and a backtracking line search using the Armijo condition. The BBSG/SPG/BBST methods use the Barzilai-Borwein step length [Barzilai and Borwein, 1988] with a backtracking line search using the non-monotone Armijo condition [Grippo et al., 1986, Raydan, 1997]. The DSST method uses a diagonal scaling, using the inverse of the diagonals of the Hessian (this is only the method that explicitly computes second-order information). Finally, the OPG method uses Nesterov’s optimal worst-case gradient method for optimizing differentiable objectives over simple convex sets [Nesterov, 2004, §2.2.4], augmented with the adaptive line search and Lipschitz estimation procedure discussed in [Liu et al., 2009]. 2.6.1 Logistic Regression We tested the methods on the three binary classification data sets from Section 1.7. We measured the performance of the methods in terms of the objective value achieved against the number of function evaluations used by the methods. The termination criteria for all methods was that the infinity norm of the pseudo-gradient was less than 10−5 , or the change in objective value between successive 40 iterations, parameter values between successive iterations, or directional derivative of the descent ˜ (w)||1 } direction, was below 10−9 . We set the initial step size of all the methods to 1/ min{1, 1/||∇f (except for the OPG method where we used 1/n as used in the code of [Liu et al., 2009], which we found gave slightly better performance). We set the sufficient decrease parameter η in the line search to 10−4 , and the safeguard parameters for projecting the cubic interpolation {ξ1 , ξ2 } to .001 and 0.6. For the methods based on an L-BFGS approximation of the Hessian, we stored 10 previous parameter and gradient vectors. For the methods that use the non-monotonic Armijo condition, we set the number of previous function values to store at 10. For the OWL method, we used a quadratic initialization of the line search [Nocedal and Wright, 1999, §3.4] since we found this gave better performance than initializing the line search with α = 1. In our experiments, we set λi to 1 for all xi (except the bias, where λi was set to zero). Optimizing with this relatively small value of λ still leads to sparse solutions, and it makes the optimization difficult enough that we can see a difference in performance between the methods. For larger values of λ, the performance of the methods becomes more similar, while the relative performance of the methods for smaller values of λi is similar to the performance with each λi set to 1. We tested two choices for the initial parameter vector: (i) we initialized with the zero vector (cold-start), and (ii) we initialized with the solution for λ = 2 (warm-start). We estimated the optimal value of the objective function, f ∗ , in our experiments by taking the lowest objective value found across the methods. In Figures 2.1 (cold-start) and 2.2 (warm-start), we plot the logarithm of the objective function minus f ∗ and the number of non-zero variables against the number of function evaluations for the PSSgb method and the five methods that are not based on an L-BFGS approximation. In these plots, we plot the minimum objective value found by each method rather than the function value at each evaluation (the methods may explore higher values if backtracking is required, while the non-monotonic SPG/BBST/BBSG/OPG methods may spend multiple iterations exploring higher values). In these figures, we see that the PSSgb method outperforms the methods that are not based on L-BFGS. In particular, it obtains a lower objective value (for the same number of function evaluations), it more quickly identifies the correct set of non-zero variables, and it terminates earlier. Comparing the two figures, we see that the performance of the methods is closer if we use warm-starting, but that the PSSgb method still outperforms the other methods based on this measure. Among the three methods based on Barzilai-Borwein steps, the BBSG method was the most effective, the SPG method was the least effective, while the BBST method tended to be intermediary. There was no clearly superior method between the OPG method, the DSST method, and the methods based on the Barzilai-Borwein steps. In Figures 2.3 (cold-start) and 2.4 (warm-start), we focus on the six methods that are based on an L-BFGS approximation. In these figures, we see that the three new PSS methods were the three most effective strategies across all experiments. The TMP method also does reasonably well on two of the data sets, but does poorly on the thrombin data. The OWL method was effective at initially driving down the objective function, but its progress slowed down on later iterations (presumably because the PS operator slowed down the local convergence rate). The AS method tended to perform very poorly except on the thrombin data set, presumably because the final solution was more sparse on this data set. Finally, we note that the PSSgb method seemed to be the least effective among the three new PSS methods, while the PSSas method was the most effective method in most scenarios. 41 2 10 Number of Non−Zero Coefficients Objective Value minus Optimal 3000 PSSgb BBSG BBST OPG SPG DSST 0 10 −2 10 −4 10 200 400 600 PSSgb BBSG BBST OPG SPG 2800 2600 2400 2200 2000 1800 1600 1400 1200 1000 800 1000 1200 1400 1600 1800 2000 200 400 Function Evaluations 600 800 1000 1200 1400 1600 1800 2000 Function Evaluations 2 10 Number of Non−Zero Coefficients Objective Value minus Optimal 500 PSSgb BBSG BBST OPG SPG DSST 0 10 −2 10 −4 10 PSSgb BBSG BBST OPG SPG 450 400 350 300 250 200 150 100 50 50 100 150 200 250 300 350 400 450 50 500 100 150 200 250 300 350 400 450 500 Function Evaluations Function Evaluations Objective Value minus Optimal 2 10 0 10 −2 10 −4 10 100 200 300 400 500 600 700 Function Evaluations 800 900 1000 Number of Non−Zero Coefficients 600 PSSgb BBSG BBST OPG SPG DSST 4 10 PSSgb BBSG BBST OPG SPG 580 560 540 520 500 480 460 440 420 400 380 100 200 300 400 500 600 700 800 900 1000 Function Evaluations Figure 2.1: Function evaluations against objective value and number of non-zero coefficients for logistic regression (λ = 1) with 1 -regularization for different optimization strategies initialized with the zero vector. Top to bottom: sido data, thrombin data, and spam data. This figure is best viewed in color. 42 0 10 Number of Non−Zero Coefficients Objective Value minus Optimal 3000 PSSgb BBSG BBST OPG SPG DSST −2 10 −4 10 200 400 600 PSSgb BBSG BBST OPG SPG 2800 2600 2400 2200 2000 1800 1600 1400 1200 1000 800 1000 1200 1400 1600 1800 2000 200 400 Function Evaluations 600 800 1000 1200 1400 1600 1800 2000 Function Evaluations 10 Number of Non−Zero Coefficients Objective Value minus Optimal 500 PSSgb BBSG BBST OPG SPG DSST 0 −2 10 −4 10 PSSgb BBSG BBST OPG SPG 450 400 350 300 250 200 150 100 50 50 100 150 200 250 300 350 400 450 50 500 100 200 250 300 350 400 450 500 10 PSSgb BBSG BBST OPG SPG DSST 0 10 −2 10 −4 10 100 200 300 400 500 600 700 Function Evaluations 800 900 1000 Number of Non−Zero Coefficients 600 2 Objective Value minus Optimal 150 Function Evaluations Function Evaluations PSSgb BBSG BBST OPG SPG 580 560 540 520 500 480 460 440 420 400 380 100 200 300 400 500 600 700 800 900 1000 Function Evaluations Figure 2.2: The same experiment as Figure 2.1, but using the optimal solution for λ = 2 as the starting vector. 43 2 10 Number of Non−Zero Coefficients Objective Value minus Optimal 3000 PSSas PSSgb PSSsp OWL TMP AS 0 10 −2 10 −4 10 PSSas PSSgb PSSsp OWL TMP AS 2500 2000 1500 1000 500 0 200 400 600 800 1000 1200 1400 1600 1800 2000 200 400 Function Evaluations 600 800 1000 1200 1400 1600 1800 2000 Function Evaluations 2 10 Number of Non−Zero Coefficients Objective Value minus Optimal 500 PSSas PSSgb PSSsp OWL TMP AS 0 10 −2 10 −4 10 PSSas PSSgb PSSsp OWL TMP AS 450 400 350 300 250 200 150 100 50 0 50 100 150 200 250 300 350 400 450 50 500 100 150 200 250 300 350 400 450 500 Function Evaluations Function Evaluations Objective Value minus Optimal 2 10 0 10 −2 10 −4 10 Number of Non−Zero Coefficients 700 PSSas PSSgb PSSsp OWL TMP AS 4 10 PSSas PSSgb PSSsp OWL TMP AS 600 500 400 300 200 100 100 200 300 400 500 600 700 Function Evaluations 800 900 1000 100 200 300 400 500 600 700 800 900 1000 Function Evaluations Figure 2.3: The same experiment as Figure 2.1, but focusing on methods that are based on L-BFGS. 44 0 10 Number of Non−Zero Coefficients Objective Value minus Optimal 3000 PSSas PSSgb PSSsp OWL TMP AS −2 10 −4 10 PSSas PSSgb PSSsp OWL TMP AS 2500 2000 1500 1000 500 0 200 400 600 800 1000 1200 1400 1600 1800 2000 200 400 Function Evaluations 600 800 1000 1200 1400 1600 1800 2000 Function Evaluations 10 Number of Non−Zero Coefficients Objective Value minus Optimal 500 PSSas PSSgb PSSsp OWL TMP AS 0 −2 10 −4 10 PSSas PSSgb PSSsp OWL TMP AS 450 400 350 300 250 200 150 100 50 0 50 100 150 200 250 300 350 400 450 50 500 100 200 250 300 350 400 450 500 10 PSSas PSSgb PSSsp OWL TMP AS 0 10 −2 10 −4 10 Number of Non−Zero Coefficients 700 2 Objective Value minus Optimal 150 Function Evaluations Function Evaluations PSSas PSSgb PSSsp OWL TMP AS 600 500 400 300 200 100 100 200 300 400 500 600 700 Function Evaluations 800 900 1000 100 200 300 400 500 600 700 800 900 1000 Function Evaluations Figure 2.4: The same experiment as Figure 2.3, but using the optimal solution for λ = 2 as the starting vector. 45 2.6.2 Ising Graphical Models We next tested the various optimization methods on the task of estimating 1 -regularized IGMs. We did this to see if the trends observed in the logistic regression experiments carry over to related loss functions, and in particular if they carry over to using 1 -regularization for structure learning in log-linear models for the special case where their is a direct correspondence between edges and parameters. Thus, we examined fitting 1 -regularized IGMs to the cyto and awma data sets (where it is possible to compute the exact IGM objective) from Section 1.7. In these experiments we set the value of λ to 50, yielding a sufficiently difficult optimization problem that differences between the methods become apparent (for larger values of λ the methods are all very effective). We used essentially the same experimental set-up as in the case of logistic regression, with the following modifications: (i) we did not test the DSST method since even computing the diagonals of the Hessian is prohibitively expensive, and (ii) for the OPG method we used the same initial step length that the other methods used (we found that using 1/n gave poor performance for estimating IGMs). In Figure 2.5, we plot the performance of the PSSgb method against the methods that are not based on L-BFGS, where we initialize with the zero vector and with the solution for λ = 100. As in the logistic regression experiments, the PSSgb method dominates the methods that are not based on L-BFGS. Further, we again see that BBSG is the best and SPG is the worst among the three methods based on the Barzilai-Borwein step size (SPG, BBST, and BBSG). However, unlike the logistic regression experiments we see in these experiments that the methods based on the BarzilaiBorwein step outperform the OPG method. Although it is possible that better performance could be obtained with the OPG method with a different choice of initial step size, we note that the performance of the other methods does not have this strong dependence on the initial step size. In Figure 2.6, we plot the performance of the methods based on L-BFGS. In this plot, we again see that the new PSS methods typically outperform the other methods. The one exception to this was on the awma data with the warm-start, where the AS method proved very effective (since the set of non-zero variables didn’t change much between λ = 100 and λ = 50). However, in the other scenarios the AS method is dominated by the new PSS methods. 2.7 Extensions In this section, we consider several straightforward extensions of the work we describe in this chapter. 2.7.1 Other Objective Functions We have presented an efficient large-scale optimization method for 1 -regularized logistic regression. However, the only assumption needed in order to use this method is that the function we want to optimize with 1 -regularization is differentiable and convex. We can further relax the assumption of convexity if we concede that the algorithm may find a local minimum that is not also a global minimum. Thus, we can apply this optimization algorithm in a wide variety of other scenarios. Besides the obvious problem of learning dependency networks with logistic regression conditionals (or other CPDs we discuss in Chapter 4), below we list several applications to structure learning: • Solving the graphical LASSO in the primal: Most current methods for solving the graphical LASSO optimization problem (1.6) solve a Lagrangian dual of the optimization 46 4 PSSgb BBSG BBST OPG SPG 2 10 Objective Value minus Optimal Objective Value minus Optimal 10 0 10 −2 10 −4 10 50 100 150 200 250 300 350 400 450 PSSgb BBSG BBST OPG SPG 0 10 −2 10 −4 10 500 50 100 Function Evaluations 150 200 250 300 350 400 450 500 Function Evaluations 4 PSSgb BBSG BBST OPG SPG 2 10 Objective Value minus Optimal Objective Value minus Optimal 10 0 10 −2 10 −4 10 50 100 150 200 250 300 350 Function Evaluations 400 450 500 PSSgb BBSG BBST OPG SPG 0 10 −2 10 −4 10 50 100 150 200 250 300 350 400 450 500 Function Evaluations Figure 2.5: Function evaluations against objective value for training IGMs (λ = 50) with 1 regularization for different optimization strategies. Top row: cyto data. Bottom row: awma data. Left column: zero vector used for initialization. Right column: solution with λ = 100 used for initialization. This figure is best viewed in color. problem [Banerjee et al., 2006, Friedman et al., 2008, Duchi et al., 2008a]. A potential disadvantage of working with the dual formulation is that the dual parameters are not sparse. In our experiments in [Marlin et al., 2009], we used the PSSgb algorithm to directly solve the graphical LASSO problem in the primal. Since the PSS iterations tend to be sparse, this lets us take advantage of techniques for sparse Cholesky factorizations [Rue and Held, 2005, §2.4] to efficiently evaluate the objective function in (1.6). • Sparse Conditional Random Fields: Conditionals random fields are a class of log-linear models augmented with covariates, and they represent a natural generalization of logistic regression to the case where we have multiple target variables [Lafferty et al., 2001] (we discuss this type of model in more detail in Section 5.7). Goodman [2004] shows that training conditional random fields with 1 -regularization offers improved performance in several natural language processing applications. The PSS algorithms can easily be applied in the case of 47 4 PSSas PSSgb PSSsp OWL TMP AS 2 10 Objective Value minus Optimal Objective Value minus Optimal 10 0 10 −2 10 −4 10 10 20 30 40 50 60 70 80 90 PSSas PSSgb PSSsp OWL TMP AS 0 10 −2 10 −4 10 100 10 20 Function Evaluations 30 40 50 60 70 80 90 100 Function Evaluations 4 PSSas PSSgb PSSsp OWL TMP AS 2 10 Objective Value minus Optimal Objective Value minus Optimal 10 0 10 −2 10 −4 10 10 20 30 40 50 60 70 Function Evaluations 80 90 100 PSSas PSSgb PSSsp OWL TMP AS 0 10 −2 10 −4 10 10 20 30 40 50 60 70 80 90 100 Function Evaluations Figure 2.6: The same experiment as Figure 2.5, but focusing on methods based on an L-BFGS approximation. conditional random fields (indeed, we do this in Chapter 5) to learn a sparse set of node and edge features. • Sparse neural networks: Neural networks are a type of model that is widely used for non-linear regression and classification [see Bishop, 2006, §5]. In these models, the outputs are modeled through a sequence of non-linear transformations of the inputs. Typically, these non-linear transformations are the cumulative distribution function values for a set of linearly-parameterized logistic distributions with different parameters. Typically, each linearly-parameterized logistic distribution depends on the values of all variables in the previous layer, making the model very complex and difficult to interpret. To avoid over-fitting in these complex models, we typically use 2 -regularization of the parameters. However, we can learn a sparse neural network model if we replace this 2 -regularization with 1 regularization [Williams, 1995]. This can lead to much more parsimonious and interpretable models, since the elements of each layer will only depend on a subset of the variables in the previous layer. Although the objective function in this problem is non-convex, we can find a 48 local minimum of the (non-convex) objective function with the PSS methods. 2.7.2 Other Extensions We conclude this chapter with several other possible extensions: • Hessian-Free Newton Methods: Lin et al. [2007] recently showed that Hessian-free Newton methods can be competitive with L-BFGS for 2 -regularized logistic regression. Rather than building a Hessian approximation, these methods seek to solve the Newton system −∇2 f (xk )−1 ∇f (xk ) up to a specific error tolerance by using Hessian-vector products within a linear conjugate gradient algorithm. The method is known as a Hessian-free Newton method because the Hessian-vector products can be computed without explicitly forming the Hessian. They are also known as truncated or inexact Newton methods because the Newton direction is only computed up to a specific error tolerance. It is straightforward to implement a Hessianfree version of the PSSgb or PSSsp methods where the linear conjugate gradient algorithm is used to solve the linear system involving the working set. • Improved Line Search: It may be possible to improve the line search routine in various ways. Our line search uses a simple backtracking line search along the projection arc, with safeguarded cubic interpolation to generate trial values. This cubic interpolation ignores that the function is not smooth at locations where variables become exactly zero. Although a step size of 1 is typically accepted and the backtracking is typically not invoked (and only very rarely does the method backtrack more than once), it might be possible to get better performance by using a line search that takes advantage of the known locations of the non-differentiable points along the search direction. It might also be possible to consider non-smooth generalizations of the strong Wolfe conditions [Nocedal and Wright, 1999, §3.1] as a stronger measure of sufficient decrease than the Armijo condition. • Other Regularizers and Bound Constraints: In principle, we could extend the PSS methods to find local optima in general problems of the form min L(x) + R(x), x where L(x) is differentiable and R(x) is continuous and separable into a set of functions that are each differentiable everywhere except at a countable number of (known) locations. This includes 1 -regularization of differentiable objective functions as a special case, but also includes other regularizers such as the smoothly clipped absolute deviation (SCAD) penalty [Fan and Li, 2002]. To consider this case, we would need to re-define the PO projection operator so that it projects element-wise into the relevant interval where the function is differentiable, and re-define the pseudo-gradient so that its negative minimizes the directional derivative of the objective function. Further, by using ideas from the two-metric projection algorithm it would be straightforward to modify the PSS methods to incorporate lower and/or upper bounds on the variables. 49 Chapter 3 Optimization with Group 1-Regularization In this chapter, we consider large-scale methods for solving optimization problems of the form min f (x) x λA ||xA ||2 . L(x) + (3.1) A where L(x) is assumed to be convex and differentiable with respect to x, and we may have a separate regularization parameter λA ≥ 0 for each group A. In this chapter we assume that the groups A are disjoint. The algorithms we describe in this chapter are applicable to any optimization problem of this form, but our focus is on the case where L(x) is the negative log-likelihood in a (possibly conditional) undirected graphical model. In this case, the optimization parameters x are the concatenation of the weights w and biases b (as well as feature weights v in the conditional case we discuss in Section 5.7), and each disjoint subset A contains all parameters associated with an individual edge in the model (though our discussion applies to other group structures, such as the blockwise-sparse models we discuss in Section 5.6). While this chapter focuses on the case of disjoint groups and penalizing the 2 norm of the groups, in Chapter 5 we extend the methods considered here for penalizing other norms of the groups while in Chapter 6 we extend the methods considered here to the case of overlapping groups. Problem (3.1) is a generalization of the problem addressed in Chapter 2, in that we now penalize groups of variables instead of individual elements (we obtain problem (2.1) in the special case that each group A contains only one element). As before, this optimization is complicated by the nondifferentiability of the regularizer term. In particular, the function is non-differentiable if an entire group of variables is exactly zero. Since (conditional) undirected graphical models generalize logistic regression while group 1 regularization generalizes the previously examined 1 -regularization, we might naturally consider extending the very efficient methods of Chapter 2 to solve this more general problem20 . In Sections 3.1 we discuss applying methods based on the Barzilai-Borwein approximation to the group case, including the SPG and BBST methods of the previous chapter. However, these methods do not take into account that the objective function is very costly to evaluate, while the methods from Chapter 2 based on L-BFGS that require fewer evaluations (PSS, TMP, OWL) can not be extended in a straightforward way to the group case. Thus, in Section 3.2 we give new methods based on L-BFGS designed to reduce the number of objective evaluations (at the expense of a higher iteration cost). 20 In the case of (unconditional) Gaussian graphical models or (unconditional) pairwise log-linear models with Ising potentials, each edge only has one parameter and the methods we discuss in Chapter 2 can be applied directly. 50 3.1 Barzilai-Borwein Methods In this section discuss applying methods based on non-monotonic Barzilai-Borwein iterations, focusing on two variants. In the first variant, we formulate (3.1) as a differentiable optimization over a convex set and apply non-monotonic Barzilai-Borwein steps within a projected gradient iteration. This is referred to as a spectral projected gradient (SPG) algorithm. In the second variant, we apply non-monotonic Barzilai-Borwein steps within a soft-thresholding iteration that directly seeks to optimize (3.1). Due to the similarity to iterative soft-thresholding methods, we refer to this as a Barzilai-Borwein soft-threshold (BBST) algorithm. 3.1.1 Spectral Projected Gradient In Chapter 2, we considered formulating the non-differentiable 1 -regularized optimization problem as a smooth optimization problem with bound constraints. Then, we considered solving the bound constrained problem with the two-metric projection (TMP), optimal projected gradient (OPG), or SPG algorithm. Unfortunately, this problem transformation is no longer possible in the group case. However, it is still possible to transform (3.1) into a smooth optimization problem over a convex set. To do this, we introduce an additional variable gA for each group A. We then replace each norm ||xA ||2 with the variable gA , and optimize subject to the constraint that gA ≥ ||xA ||2 . That is, we solve problem λA gA , subject to gA ≥ ||xA ||2 , ∀A . min L(x) + x,g (3.2) A This formulation replaces the non-linear, non-differentiable regularizer with a simple linear function. We note that these constraints are a special case of second-order cone constraints [Boyd and Vandenberghe, 2004, §4.4.2], and that each constraint defines a convex set called an 2 norm cone. For any feasible pair {x, g}, the objective function in (3.2) gives an upper bound on the objective (3.1), while at a minimizer it must be the case that gA = ||wA ||2 for all groups (otherwise, we could decrease the objective by decreasing gA to ||wA ||2 ). It follows that because the range of L(x) is unchanged a minimizer of (3.1) must correspond to a minimizer of (3.2). Although we can not apply the TMP algorithm for bound-constrained optimization to problem (3.2) because it is not a bound-constrained problem, we can still apply the SPG and OPG algorithms. The projected-gradient method [Goldstein, 1964, Levitin and Poliak, 1966] is a constrained optimization algorithm for solving min f (x), x∈C where f (x) is a differentiable function and C is a closed convex set. We consider a variant of the method that uses iterations of the form xk+1 ← PC (xk − α∇f (xk )). Here, α is selected to satisfy the Armijo condition by a backtracking line search and PC is defined by PC (x) arg min ||x − y||2 , (3.3) y∈C the Euclidean projection onto C. This simple general-purpose method has two drawbacks: (i) in general solving (3.3) may itself be a computationally challenging problem, and (ii) the use of the steepest descent step results in slow convergence. 51 The SPG method [Birgin et al., 2000] uses two simple modifications of the projected gradient method to enhance its convergence rate. First, it initializes the line search with the step size αbb = ykT sk , ykT yk proposed by [Barzilai and Borwein, 1988]. Second, it uses a non-monotonic version of the Armijo condition [Grippo et al., 1986]: f (xk+1 ) ≤ max f (xi ) + ν∇f (xk )T (xk+1 − xk ), with ν ∈ (0, 1). (3.4) i=k−m:k This non-monotonic Armijo condition typically accepts αbb (even if it increases the objective function), but still ensures global convergence of the method. A typical value for the number m of previous function values to consider is 10. These two simple modifications have been shown to experimentally lead to a very large improvement in the convergence rate of the method, and due to its strong empirical performance SPG has recently been explored in several applications [Dai and Fletcher, 2005, Figueiredo et al., 2007, van den Berg and Friedlander, 2008]. Although the SPG strategy reduces the number of iterations of the method that we must perform, for the method to be efficient we must still be able to efficiently compute the projection onto the constraint set. Fortunately, in problem (3.2) each constraint only affects variables associated with the corresponding group. Thus, we can compute the projection across the groups by independently solving the projection problem for each group. For each group, the corresponding problem takes the form xA gA PC2 (xA , gA ) = arg min y,z − y z , subject to z ≥ ||y||2 . 2 The solution to this problem is [Boyd and Vandenberghe, 2004, Exercise 8.3(c)] if ||x||2 ≤ g, (x, g), x ||x||2 +g ||x||2 +g PC2 (x, g) = ( ||x||2 2 , 2 ), if ||x||2 > g, ||x||2 + g > 0, (0, 0), if ||x||2 > g, ||x||2 + g ≤ 0. We give an explicit derivation of this result in Appendix B. Thus, we can solve this sub-projection in O(|A|) and we can solve the full projection in O(p) for a problem with p variables21 . We close this section by noting that we could alternatively use this constrained formulation and the above projection operator within an OPG method [Nesterov, 2004, §2.2.4]. 3.1.2 Barzilai-Borwein Soft Threshold A wide variety of authors have recently considered using a class of algorithms known as iterative soft-thresholding (or forward-backward splitting) for optimization with sparse regularizers, including [Daubechies et al., 2004, Combettes and Wajs, 2005, Elad et al., 2006, Hale et al., 2007, Nesterov, 2007, Duchi and Singer, 2009] These methods addresses problems of the form min f (x) x 21 L(x) + R(x). We show how to solve the related problem of projecting onto the norm ball defined by the in [van den Berg et al., 2008]. (3.5) 1 of 2 norms 52 Here, R(x) is convex and possibly non-differentiable, while L(x) is assumed to be differentiable and convex with a Lipschitz-continuous gradient. Rather than converting this problem to a constrained optimization problem, these algorithms solve the non-smooth optimization problem directly with a projection-like operator. In particular, these method take steps of the form xk+1 ← SR (xk − α∇L(xk ), α). (3.6) Here, we have used SR (x, α) to denote the solution of a ‘soft-threshold’ problem at x with step size α and regularizer R(x). Specifically, the soft-threshold operator is given by the solution to the soft-threshold problem 1 SR (x, α) arg min ||y − x||22 + αR(y). (3.7) y 2 In our case, R(x) A λA ||xA ||2 so the soft-threshold step for problem (3.1) would be 1 arg min ||y − (xk − α∇L(xk ))||22 + α y 2 λA ||yA ||2 . A Thus, we first take a step along the negative gradient of the loss function, and then compute this projection-like soft-threshold operator to take into account the effect of the regularizer. The latter step effectively sparsifies the result of the (generally dense) gradient step. As discussed by [Combettes and Wajs, 2005], the soft-threshold operator is a generalization of the projection operator, and we recognize the iterative soft-thresholding algorithm as the classic gradient-projection algorithm but with projection replaced by soft-thresholding. Similar to the classic gradient projection algorithm, this algorithm may converge very slowly. However, analogous to the SPG algorithm, Wright et al. [2009] propose to use Barzilai-Borwein steps and a non-monotonic line search to speed the convergence of the method. We refer to this method as the Barzilai-Borwein softthreshold (BBST) method22 . Wright et al. [2009] discuss computing the soft-thresholding operator in the case of group 1 regularization. As before, the operator separates into solving a simple problem for each group. The solution for an individual group is SR2 (xA , α) = sgn(xA ) max{0, ||xA ||2 − αλA }, where we use sgn(y) to denote a set-valued function that returns y/||y||2 if y = 0, and returns all values such that ||y||2 ≤ 1 if y = 0. In the next section we give an L-BFGS extension of the BBST algorithm, but first we establish some useful properties of the method (that will also apply in the new method). First, we note that computing xk+1 in (3.6) is equivalent to solving the optimization problem arg min L(xk ) + (y − xk )T ∇L(xk ) + y 1 ||y − xk ||22 + R(y). 2α (3.8) Thus, we can view the soft-threshold step as the solution of a first-order approximation of L(x) at xk , that is regularized by R(x) as well as the distance to xk . Nesterov [2007] refers to (3.8) as the composite gradient mapping, while Wright et al. [2009] refers to it as a separable approximation. Using this equivalent formulation, we can establish that an iterate xk is an optimal solution to the 22 Soft-threshold variants of the OPG method are discussed in [Nesterov, 2007]. 53 original problem if and only if xk solves (3.8). To see this, first note that the sub-differential of our original optimization problem (3.5) is ∂f (x) = ∇L(x) + ∂R(x). A vector x∗ is a minimizer of a convex function if and only if 0 ∈ ∂f (x∗ ) [Bertsekas, 1999, §B.5]. The sub-differential of the objective function in (3.8) (that we denote by q k (y)) is ∂q k (y) = ∇L(xk ) + 1 (y − xk ) + ∂R(y). α Thus, if y = xk then the optimality conditions for (3.8) reduce to 0 ∈ ∇L(xk ) + ∂R(xk ) and this is equivalent to xk being an optimal solution [see also Combettes and Wajs, 2005, Proposition 3.1]. By re-writing the soft-threshold operator in the form (3.8), we can use an argument similar to [Bertsekas, 1999, Exercise 6.3.11] to establish the useful property that if the solution x∗k to (3.8) is not a minimizer of f (x), then f (x∗k ) < f (xk ) for sufficiently small α. To do this, first note that xk achieves an objective value of L(xk ) + R(xk ) in (3.8), thus if xk is not a minimizer of f (x) then x∗k achieves a lower objective value in (3.8) and we have 1 ||x∗ − xk ||22 + R(x∗k ) 2α k ≥ L(x∗k ) + R(x∗k ) (for 0 < α ≤ 1/L). L(xk ) + R(xk ) > L(xk ) + (x∗k − xk )T ∇L(xk ) + (3.9) The last line follows from [Bertsekas, 1999, Proposition A.24], where L is the Lipschitz constant of the gradient of L(x). This result is also given by [Nesterov, 2007, Theorem 1 and Remark 1], and a related result that backtracking along α satisfies a modified Armijo condition is given by [Wright et al., 2009, Lemma 3]. We note that the gradient of the negative log-likelihood in an undirected model is Lipschitz continuous because the gradient is continuously differentiable and the spectral norm of the Hessian is bounded. We also note that the descent property still holds if L(x) is only locally Lipschitz continuous. Finally, an important property that is relevant to the next section is that (3.9) holds not only for the result of the soft-threshold operator, but for any x∗k that achieves a lower objective value than xk in (3.8). 3.2 Quasi-Newton Methods The Barzilai-Borwein methods discussed in the previous section represent some of the most efficient methods currently available for solving problem (3.1). However, compared to simple objectives like logistic regression a complicating factor in optimizing the parameters of undirected graphical models is that it is very expensive to evaluate the objective function. Further, in our experiments in Chapter 2 we saw that the SPG, OPG, and BBST methods typically require many more function evaluations than methods that are based on an L-BFGS Hessian approximation (such as the PSS, TMP, and OWL methods). Unfortunately, the methods based on L-BFGS updates from Chapter 2 do not admit a straightforward extension to the group case. This is because we do not have an operator that is analogous to the PO orthant-projection from Chapter 2 (that sparsifies the solution and truncates the line search to a region where the Taylor expansion is valid). However, in the previous section we showed that we can convert problem (3.1) to a differentiable constrained optimization where it is straightforward to compute the projection onto the feasible set. Motivated by 54 problems with this structure, in [Schmidt et al., 2009b] we gave a limited-memory projected quasiNewton (PQN) algorithm that uses an L-BFGS Hessian approximation to solve high-dimensional constrained optimization problems where it is substantially more expensive to evaluate the objective function than it is to project onto the feasible set. We review this method next. Subsequently, we consider a variant of this method that incorporates an L-BFGS Hessian approximation into a soft-thresholding algorithm. 3.2.1 Projected Quasi-Newton As with the gradient-projection method, projected Newton methods address the problem of minimizing a function f (x) over a convex set C. Similar to unconstrained Newton-like methods, at each iteration projected Newton methods consider a quadratic approximation of the objective function around the current iterate xk : 1 qk (x) f (xk ) + (x − xk )T ∇f (xk ) + (x − xk )T Bk (x − xk ). (3.10) 2 Here, Bk is a positive-definite approximation to the Hessian. In order to generate a direction of search that is both a descent direction and feasible, projected Newton methods find the minimizer x∗k of this quadratic approximation over the set C. That is, they solve x∗k arg min qk (x). x∈C (3.11) This generates a descent direction d x∗k − xk , where xk + αdk is feasible for α ∈ [0, 1]. As before, we can use this direction as part of a backtracking line search until we have a new iterate satisfying the Armijo condition. If Bk is the exact Hessian and we always test α = 1 first, this method has a quadratic rate of convergence in the neighborhood of a minimizer satisfying second-order sufficiency conditions [Bertsekas, 1999, Proposition 2.3.5]. The drawbacks of this method in its unmodified form are that: (i) it requires computing/storing a dense p by p Hessian approximation, and (ii) finding the constrained minimizer of the quadratic model may be very expensive. We use the L-BFGS Hessian approximation to address the first issue. As mentioned in Chapter 2, there is an efficient recursive formula that pre-multiplies a vector by the inverse of a matrix B0 = σk I updated m times with the BFGS formula. However, in order to evaluate the objective function in (3.11) we need to be able to multiply by Bk , not Bk−1 . This can be done using the compact representation of Byrd et al. [1994], that represents the updates Bk as a low rank matrix Bk = σk I − N M −1 N T , (3.12) where N is p-by-2m, and M is 2m-by-2m. With this representation, we can compute qk (x) and ∇qk (x) in O(mp) (both values can be obtained with one multiplication by Bk ). Given the L-BFGS representation of Bk , we minimize (3.11) by using the SPG algorithm discussed in the previous section. In addition to evaluating qk (x) (and its gradient), the cost of running SPG is dominated by computing the projection PC . However, note that we do not need to evaluate the objective function in the SPG sub-routine. Hence, the proposed method is most effective on problems where computing the projection is much less expensive than evaluating the objective function23 . In the case of group 1 -regularized undirected graphical models, we can compute the projection in linear time while evaluating the objective function is #P-hard in general 23 This is different than many classical optimization problems like quadratic programming, where evaluating the objective function is relatively inexpensive and computing the projection may be as difficult as solving the original problem. 55 (even evaluating the approximate objective functions from Section 1.4 will typically be much more costly than computing the projection). Thus, the conditions needed for the PQN method to be efficient are clearly satisfied. In general, running the SPG sub-routine to obtain a high-accuracy solution may be computationally expensive. However, we must be careful about terminating the SPG sub-routine early because an approximate solution to (3.11) will not in general be a descent direction. Fortunately, we can guarantee that the SPG sub-routine yields a descent direction even under early termination if we initialize it with xk and we run the method for at least one iteration (so that we obtain a vector y satisfying the Armijo condition on the quadratic approximation). To see this, first note that positive-definiteness of Bk implies that a sufficient condition for y − xk to be a descent direction for some vector y is that qk (y) < f (xk ) (since this implies that (y − xk )T ∇f (xk ) < 0). Subsequently, using qk (xk ) = f (xk ) we have that qk (y) < f (xk ) where y is the first point satisfying the Armijo condition on qk (x) if we initialize SPG with xk . Thus, if we initialize the SPG sub-routine with xk then after the first iteration (and every subsequent iteration) the SPG solution gives a descent direction and it can safely be terminated early. Further, provided that the eigenvalues of the Hessian approximation Bk are bounded, the search directions generated by the SPG sub-routine are gradient related [see Bertsekas, 1999, §1.2] after the first iteration. Convergence of the PQN method thus follows from [Bertsekas, 1999, Proposition 2.2.1]. In our implementation we include an explicit maximum c on the number of iterations to run the SPG sub-routine for24 . For problems where computing the projection onto the constraint set can be done in O(p), the iteration cost of the PQN method is therefore O(pmc). 3.2.2 Quasi-Newton Soft Threshold The PQN method is a general technique for constrained optimization, and we can apply it in the special case of group 1 -regularization problems after a suitable problem transformation. However, we saw in the last section that the soft-threshold operator provides a direct way to apply BarzilaiBorwein steps to solve group 1 -regularization problems (in that we don’t have to introduce auxiliary variables). In this section, we consider a method that is analogous to the PQN algorithm, but that is suitable for optimizing the sum of a costly objective function L(x) (with Lipschitz-continuous gradient) and a convex regularizer R(x) where we can efficiently compute the soft-threshold operator for the regularizer. We call this the quasi-Newton soft-threshold (QNST) algorithm. At each iteration of the QNST algorithm, we form a regularized quadratic approximation to the function qkα (x) L(xk ) + (x − xk )T ∇L(xk ) + 1 (x − xk )T Bk (x − xk ) + R(x), 2α (3.13) where Bk is an L-BFGS approximation of ∇2 L(xk ). That is, we use a quadratic approximation to the smooth function L(x) but include the regularizer explicitly in the sub-problem. To find an (approximate) minimizer xk+1 of this sub-problem, we use c iterations of the BBST method. To set the step size length α, we use a backtracking line search. Besides the use of a soft-thresholding method to solve (3.13), there is a close connection between the QNST method and soft-thresholding algorithms. We can see this by re-writing the optimization 24 An alternative strategy would be to run the method until the sub-problem is solved up to a certain optimality tolerance. This tolerance could then be set using a forcing sequence [Nocedal and Wright, 1999, §6.1] 56 over (3.13) as follows: arg min L(xk ) + (y − xk )T ∇L(xk ) + y 1 (y − xk )T Bk (y − xk ) + R(y) 2α 1 (y − xk )T Bk (y − xk ) + R(y) 2α 1 α(y − xk )T ∇L(xk ) + (y − xk )T Bk (y − xk ) + αR(y) 2 1 1 2 T −1 α ∇L(xk ) Bk ∇L(xk ) + α(y − xk )T ∇L(xk ) + (y − xk )T Bk (y − xk ) + αR(y) 2 2 1 ((y − xk ) + αBk−1 ∇L(xk ))T Bk ((y − xk ) + αBk−1 ∇L(xk )) + αR(y) 2 1 ||(y − xk ) + αBk−1 ∇L(xk ))||2Bk + αR(y) 2 1 ||y − (xk − αBk−1 ∇L(xk ))||2Bk + αR(y). 2 = arg min (y − xk )T ∇L(xk ) + y = arg min y = arg min y = arg min y = arg min y = arg min y Here, we use || · ||H to denote the quadratic norm ||x||H = (xT Hx)−1/2 . In the last line, we see that the solution x∗k of (3.13) is the result of a generalized soft-thresholding step x∗k ← SR (xk − αBk−1 ∇L(xk ), α, Bk ), where we define the generalized soft-threshold operator SR (x, α, H) as SR (x, α, H) 1 arg min ||y − x||2H + αR(x). y 2 Thus, we see that the QNST method can be viewed as taking a standard unconstrained L-BFGS step on L(x), followed by applying a soft-threshold operation with regularizer R(x) where we measure distance based on the quasi-Newton approximation25 . We obtain the standard soft-thresholding algorithm if we fix Bk to I. We note that this is analogous to the relationship between projected gradient and projected (quasi-)Newton methods [Bertsekas, 1999, §2.3]. Indeed, we obtain a standard unconstrained quasi-Newton method for differentiable optimization (as we describe in Section 2.1) if R(x) is a constant function and we solve the sub-problem exactly. Further, the QNST method can be viewed as a generalization of the PQN method, since we obtain a version of the PQN method if R(x) is an extended real-valued function that returns 0 if x ∈ C and returns ∞ otherwise. This suggests that we could also use the QNST method to minimize differentiable function with simple non-differentiable regularizers over simple convex sets (provided that the soft-threshold operation can still be computed efficiently). The steps of the QNST algorithm can be viewed as steps of a standard soft-threshold algorithm −1/2 −1/2 ˜ )+R(Bk x ˜ ) in terms of x ˜ , which is equivalent to (3.5) with the transforfor minimizing L(Bk x 1/2 ˜ = Bk x. It follows from our argument of the previous section that the QNST algorithm mation x has the descent property that if xk is not an optimal solution, then f (xk+1 ) < f (xk ) for sufficiently small α. Finally, it follows from a similar argument to the one made in the PQN section, combined with (3.9), that we can terminate the BBST sub-rouinte early provided that we initialize it with xk and find a solution with a lower objective value in the regularized quadratic approximation. 25 Convergence rates under different choices of norm for soft-thresholding algorithms are discussed in [Chen and Rockafellar, 1997]. 57 3.3 Implementation In Algorithm 6 we give pseudo-code for the SPG method. Input: Objective function f (x), projection function PC (x), initial parameter vector x0 , optimality tolerance , number of previous function value to store m, sufficient decrease parameter η, line search safeguard parameters ξ1 and ξ2 , step length upper and lower limits αmax and αmin . k ← 0; x0 ← PC (x0 ) ; // project initial parameter vector fk ← f (x0 ) ; // evaluate objective function gk ← ∇f (x0 ) ; // compute gradient while ||xk − PC (xk − gk )||∞ > do if k = 0 then α ← − min(1, 1/||gk ||1 ) ; // initial step size else α ← ykT sk /ykT yk ; α ← max(αmin , min(αmax , α)) ; // Barzilai-Borwein step size // Safeguarded BB step xk+1 ← PC (xk − αgk ); // initial trial value fk+1 ← f (xk+1 ) ; // evaluate new parameter vector gk+1 ← ∇f (xk+1 ) ; // compute new gradient while fk+1 > maxi=k−m:k fi + ηgkT (xk+1 − xk ) do Select α ∈ (ξ1 α, ξ2 α) ; // safeguarded cubic interpolation xk+1 ← PC (xk − αgk ); // new trial value fk+1 ← f (xk+1 ) ; // evaluate new parameter vector gk+1 ← ∇f (xk+1 ) ; // compute new gradient sk ← xk+1 − xk ; yk ← gk+1 − gk ; k ← k + 1; // compute quasi-Newton differences Algorithm 6: Spectral projected gradient algorithm for minimizing a function f (x) over a convex set C. Note that the above algorithm uses one of the two step sizes proposed by Barzilai and Borwein [1988], we can use the alternate step size by simply replacing the appropriate line in the code above. Also, in the above code we are backtracking along the projection arc [see Bertsekas, 1999, §2.3]. Birgin et al. [2000] also considered a variant where we backtrack along a feasible direction. The latter strategy is more appealing in cases where the projection is expensive to compute. 58 The BBST algorithm is identical to SPG, with the following modifications: (i) we do not project the initial vector, (ii) we define fk as L(xk ) + R(xk ) but gk as ∇L(xk ), (iii) in the optimality condition we replace PC (xk − gk ) with SR (xk − gk , 1), (iv) in the iterate update we replace PC (xk − αgk ) with SR (xk −αgk , α), and (v) in the non-monotonic Armijo condition we replace gkT (xk+1 −xk ) with α multiplied by the directional derivative of the objective at xk in the direction (xk+1 − xk ). We give pseudo-code for the BBST method below, where we use R (x; y) to denote the directional derivative of R(x) evaluated at x in the direction of y26 . Input: Differentiable convex function f (x), regularization function R(x), soft-threshold function SR (x), initial parameter vector x0 , optimality tolerance , number of previous function value to store m, sufficient decrease parameter η, line search safeguard parameters ξ1 and ξ2 , step length upper and lower limits αmax and αmin . k ← 0; fk ← f (x0 ) + R(x0 ) ; // evaluate objective function gk ← ∇f (x0 ) ; // compute gradient while ||xk − SR (xk − gk , 1)||∞ > do if k = 0 then α ← − min(1, 1/||gk ||1 ) ; // initial step size else α ← ykT sk /ykT yk ; α ← max(αmin , min(αmax , α)) ; // Barzilai-Borwein step size // Safeguarded BB step xk+1 ← SR (xk − αgk , α); // initial trial value fk+1 ← f (xk+1 ) + R(xk+1 ) ; // evaluate new parameter vector gk+1 ← ∇f (xk+1 ) ; // compute new gradient while fk+1 > maxi=k−m:k fi + ηα(gk + R (xk ; xk+1 − xk ))T (xk+1 − xk ) do Select α ∈ (ξ1 α, ξ2 α) ; // safeguarded cubic interpolation xk+1 ← SR (xk − αgk , α); // new trial value fk+1 ← f (xk+1 ) + R(xk+1 ) ; // evaluate new parameter vector gk+1 ← ∇f (xk+1 ) ; // compute new gradient sk ← xk+1 − xk ; yk ← gk+1 − gk ; k ← k + 1; // compute quasi-Newton differences Algorithm 7: Barizilai-Borwein soft-threshold algorithm for minimizing the sum of a differentiable convex function f (x) and a convex regularizer R(x). 26 If this directional derivative is difficult to compute, we could alternately use α||xk+1 −xk ||22 in the Armijo condition as in [Wright et al., 2009]. 59 In Algorithm 8 we give pseudo-code for the PQN method. In this pseudo-code, we find it convenient to use SPG(xk , c, gk , σ, S, Y ) to denote applying c iterations of SPG starting from xk to approximately solve problem (3.11) with the gradient set to gk and with the L-BFGS approximation (3.12) constructed using σ, S, and Y . Input: Objective function f , projection function PC , inital parameter vector x0 , optimality tolerance , number of corrections m, sufficient decrease parameter η, line search safeguard parameters ξ1 and ξ2 , maximum number of SPG iterations c. k ← 0; x0 ← PC (x0 ) ; // project initial parameter vector fk ← f (x0 ) ; // evaluate objective function gk ← ∇f (x0 ) ; // compute gradient while ||xk − PC (xk − gk )||∞ > do α=1; if k = 0 then dk = −gk min(1, 1/||gk ||1 ) ; // use steepest descent else x∗k ← SPG(xk , c, gk , σ, S, Y ) ; approximation dk ← x∗k − xk ; // approximately minimize quadratic // feasible descent direction xk+1 ← xk + αdk ; fk+1 ← f (xk+1 ) ; gk+1 ← ∇f (xk+1 ) ; while fk+1 > fk + ηgkT (xk+1 − xk ) do Select α ∈ (ξ1 α, ξ2 α) ; xk+1 ← xk + αdk ; fk+1 ← f (xk+1 ) ; gk+1 ← ∇f (xk+1 ) ; // initial trial value // evaluate new parameter vector // safeguarded cubic interpolation // new trial value // evaluate new parameter vector sk ← xk+1 − xk ; yk ← gk+1 − gk ; if k > m then Remove oldest vector from S and Y ; S ← [S sk ] ; Y ← [Y yk ]; σ ← (ykT sk )/(ykT yk ); k ← k + 1; // compute quasi-Newton differences // update quasi-Newton difference matrices // update diagonal Hessian scaling Algorithm 8: Limited-memory projected quasi-Newton algorithm for minimizing a function f (x) over a convex set C. In this code, we have used backtracking along the feasible direction, but we could also consider a variant of the method where we backtrack along the projection arc [see Bertsekas, 1999, §2.3]. Here, during the iterations of the line search we would incorporate the step size α into the quadratic approximation (3.11) (similar to the QNST method discussed next), and use SPG to directly solve for xk+1 (increasing the cost of backtracking, but possibly generating better trial values). 60 We obtain the QNST algorithm by using the same replacements we used to obtain the BBST algorithm from the SPG algorithm, in addition to: (i) replacing SPG by BBST, (ii) replacing (3.11) by (3.13), and (iii) directly solving for xk+1 for the trial value of α instead of computing dk and then setting xk+1 to xk +αdk (both before and during the line search). We give pseudo-code for the QNST method below, where BBST(xk , c, gk , σ, S, Y, α) is defined analogously to the SPG function in the PQN pseudo-code (but augmented to include the step size α). Input: Differentiable convex function f (x), regularization function R(x), soft-threshold function SR (x), initial parameter vector x0 , optimality tolerance , number of corrections m, sufficient decrease parameter η, line search safeguard parameters ξ1 and ξ2 , maximum number of BBST iterations c. k ← 0; fk ← f (x0 ) + R(x0 ) ; // evaluate objective function gk ← ∇f (x0 ) ; // compute gradient while ||xk − SR (xk − gk , 1)||∞ > do if k = 0 then α ← min(1, 1/||gk ||1 ) ; // initial step size xk+1 = SR (xk − αgk , α) ; // use basic soft-threshold step else α←1; xk+1 ← BBST(xk , c, gk , σ, S, Y, α) ; // approximately minimize approximation fk+1 ← f (xk+1 ) + R(xk+1 ) ; // evaluate new parameter vector gk+1 ← ∇f (xk+1 ) ; while fk+1 > fk + ηα(gk + R (xk ; xk+1 − xk ))T (xk+1 − xk ) do Select α ∈ (ξ1 α, ξ2 α) ; // safeguarded cubic interpolation xk+1 ← BBST(xk , c, gk , σ, S, Y, α) ; // new trial value fk+1 ← f (xk+1 ) + R(xk+1 ) ; // evaluate new parameter vector gk+1 ← ∇f (xk+1 ) ; sk ← xk+1 − xk ; yk ← gk+1 − gk ; if k > m then Remove oldest vector from S and Y ; S ← [S sk ] ; Y ← [Y yk ]; σ ← (ykT sk )/(ykT yk ); k ← k + 1; // compute quasi-Newton differences // update quasi-Newton difference matrices // update diagonal Hessian scaling Algorithm 9: Limited-memory quasi-Newton soft-threshold algorithm for minimizing the sum of a differentiable convex function f (x) and a convex regularizer R(x). 61 3.4 Regularization Path and Active-Set Optimization As with the methods from Chapter 2, the methods we discuss in this chapter can make use of good starting parameter values when we want to solve for multiple values of λ. In this section, we consider a method for solving for a sequence of values of λ that is analogous to the one we discuss in Section 2.5. Consider the following set of necessary and sufficient conditions for a vector x to be a minimizer of f (x) for given values of λA : ∇A L(x) + λA sgn(xA ) = 0, xA = 0, ||∇A L(x)||2 ≤ λA , xA = 0. These conditions are equivalent to the necessary and sufficient optimality condition that the zerovector is an element of the sub-differential of (3.1). Similar to (2.5), these conditions allow us to determine the value of λmax that sets all (regularized) groups to zero (after we have optimized with respect to the unregularized variables). In particular, if we denote the unregularized variables by b and the regularized variables by w, then we have λmax ˜ 2, max ||∇wA L(0, b)|| A ˜ optimizes L(w, b) with respect to b (with w fixed at 0). where b Analogous to the method in Section 2.5, we could consider the following active-set method: • Find groups A such that xA = 0, or xA = 0 and ||∇A L(x)||2 > λA . • Solve the problem with respect to these groups. We can again consider applying this procedure for a decreasing sequence of values of the regularization parameter. The only difference between this procedure and the procedure of Section 2.5 is that the selection of variables to include in the optimization is done at the group level rather than the individual variable level. However, the computational gains achieved by applying this strategy to undirected graphical models can be much more dramatic than the gains achieved for logistic regression. In particular, for large values of λ the graph defined on the subset of groups that we optimize over will have low treewidth and thus we can evaluate the objective function efficiently. Thus, for sufficiently large values of λ we can evaluate the objective function exactly in polynomial time, while the objective function associated with the corresponding 2 -regularization problem (where the graph is dense) will require exponential time even for large values of λ. 3.5 Experiments We compared the performance of several large-scale optimization methods for group 1 -regularized (unconditional) log-linear models. In particular, we compared the following methods: • SPG: The spectral projected gradient method we discuss in Section 3.1.1. • OPG: The optimal projected gradient using the line search suggested in [Liu et al., 2009], applied to the constrained formulation we discuss in Section 3.1.1. • BBST: The Barzilai-Borwein soft-threshold method we discuss in Section 3.1.2. 62 • PQN10: The projected quasi-Newton method we discuss in Section 3.2.1, where we run the SPG sub-routine for 10 iterations. • PQN100: The projected quasi-Newton method we discuss in Section 3.2.1, where we run the SPG sub-routine for 100 iterations. • QNST10: The quasi-Newton soft-threshold method we discuss in Section 3.2.2, where we run the BBST sub-routine for 10 iterations. • QNST100: The quasi-Newton soft-threshold method we discuss in Section 3.2.2, where we run the BBST sub-routine for 100 iterations. Although other methods exist, our experiments in [van den Berg et al., 2008] indicated that the SPG algorithm outperformed several competing methods for estimating conditional log-linear models, while in [Schmidt et al., 2009a] our experiments indicated that both SPG and PQN outperformed competing methods for estimating log-linear models and (blockwise-sparse) Gaussian graphical models. We tested the methods on the two data sets from Section 1.7 where we can evaluate the objective function exactly, namely the cyto and awma data sets. We used the same experimental setup and optimization parameters as in Chapter 2. We set the optimality tolerance for the SPG and BBST sub-routines to be 10−6 , and the tolerance for lack of progress in these sub-routines at 10−10 . We set the value of λ to 50, yielding a sufficiently difficult problem that differences between the methods become apparent (for larger values of λ, the methods perform similarly). 3.5.1 Pairwise Log-Linear Models In our first experiment we used full potentials and we initialized the methods with all elements of b and w set to zero. Figure 3.1 plots the logarithm of objective function value minus f ∗ and number of non-zero edges against the number of function evaluations (in this case, the extreme cost of function evaluations makes this a very good surrogate for the runtimes of the various methods). As in the case of 1 -regularized logistic regression, in this experiment the methods based on L-BFGS (PQN and QNST) outperformed the other methods (SPG, OPG, and BBST). This was true even for the PQN10 and QNST10 methods, that only make limited use of the second-order approximation. We also see that the PQN100 and QNST100 methods that solve the direction finding sub-problem more accurately tend to give better performance than the PQN10 and QNST10 methods. In Figure 3.2, we repeat the experiment but initialize the methods with the solution for λ = 100. We see that the methods have better performance with this initialization, but we see the same trends across the methods. 3.5.2 Ising Graphical Models Our second experiment sought to test whether the PQN and QNST are competitive with the most effective method from Chapter 2 (the PSSas method), in the special case of IGMs where each group has only one variable and the methods from either this chapter or Chapter 2 can be applied. We thus applied the group 1 -regularization methods in the experimental set-up from Section 2.6.2. We compare the group 1 -regularization methods to the PSSas method in Figure 3.3. Here, we see that the QNST10, PQN100, and QNST100 methods have similar performance to the PSSas method even though they use an approximate solution of the sub-problem (though the lower iteration cost makes the PSSas method more appealing for regular 1 -regularization problems), while the PQN10 63 56 Objective Value minus Optimal PQN10 PQN100 QNST10 QNST100 BBST SPG OPG 2 10 0 10 −2 10 PQN10 PQN100 QNST10 QNST100 BBST SPG OPG 55 Number of Edges 4 10 54 53 52 51 −4 10 50 100 200 300 400 500 600 700 800 100 900 1000 200 400 500 600 700 800 900 1000 120 4 PQN10 PQN100 QNST10 QNST100 BBST SPG OPG 2 10 0 10 −2 10 PQN10 PQN100 QNST10 QNST100 BBST SPG OPG 110 Number of Edges 10 Objective Value minus Optimal 300 Function Evaluations Function Evaluations 100 90 80 70 60 −4 10 50 100 200 300 400 500 600 700 Function Evaluations 800 900 1000 100 200 300 400 500 600 700 800 900 1000 Function Evaluations Figure 3.1: Function evaluations and number of edges against objective value and number of nonzero coefficients for training a log-linear model with full potentials and group 1 -regularization for different optimization strategies initialized with the zero vector (λ = 50). The top row is for the cyto data and the bottom row is for the awma data. This figure is best viewed in color. method performed similarly to the PSSas method on the cyto data but slightly worse on the awma data. 3.6 Extensions The PQN and QNST represent general optimization strategies for optimizing high-dimensional costly objective functions subject to simple constraints or regularizers, respectively. Hence, they may also be useful other optimization problems. We encounter several examples in Chapters 5 and 6. Below, we give several examples: • Blockwise-sparse graphical models: In [Schmidt et al., 2009b] we use PQN to solve the Lagrangian dual of the blockwise-sparse GGM model examined in Duchi et al. [2008a], and that we discuss further in Chapter 5. We could alternately consider applying PQN with the 64 56 2 PQN10 PQN100 QNST10 QNST100 BBST SPG OPG 0 10 −2 10 PQN10 PQN100 QNST10 QNST100 BBST SPG OPG 55 Number of Edges Objective Value minus Optimal 10 54 53 52 51 −4 10 50 100 200 300 400 500 600 700 800 100 900 1000 200 300 400 500 600 700 800 900 1000 Function Evaluations Function Evaluations 0 10 −2 10 −4 10 PQN10 PQN100 QNST10 QNST100 BBST SPG OPG 110 Number of Edges Objective Value minus Optimal 120 PQN10 PQN100 QNST10 QNST100 BBST SPG OPG 100 90 80 70 60 50 100 200 300 400 500 600 700 Function Evaluations 800 900 1000 100 200 300 400 500 600 700 800 900 1000 Function Evaluations Figure 3.2: The same experiment as Figure 3.1, but using the optimal solution for λ = 100 as the starting vector. constrained formulation to solve the primal problem, or applying QNST directly to the primal problem (the advantage of solving in the primal is that the primal variables are sparse). • Feature selection in conditional random fields: In many applications of conditional random fields we have either non-binary discrete target variables, or categorical features that are represented as a set of binary indicator variables. In both cases, there is more than one variable associated with each feature and we must consider group 1 -regularization to encourage sparsity in terms of the features. Since the objective function in these scenarios is costly to evaluate, the PQN and QNST methods are well-suited to solving the resulting optimization problems. Further, in Chapter 5 we discuss performing structure learning in conditional random fields with group 1 -regularization. In this scenario the objective function is even more costly to evaluate than in log-linear models, so the advantages of the PQN and QNST methods are more pronounced. 65 4 PQN10 PQN100 QNST10 QNST100 PSSas BBST 2 10 Objective Value minus Optimal Objective Value minus Optimal 10 0 10 −2 10 −4 10 10 20 30 40 50 60 70 80 90 PQN10 PQN100 QNST10 QNST100 PSSas BBST 0 10 −2 10 −4 10 100 10 20 Function Evaluations 30 40 50 60 70 80 90 100 Function Evaluations 4 PQN10 PQN100 QNST10 QNST100 PSSas BBST 2 10 Objective Value minus Optimal Objective Value minus Optimal 10 0 10 −2 10 −4 10 10 20 30 40 50 60 70 Function Evaluations 80 90 100 PQN10 PQN100 QNST10 QNST100 PSSas BBST 0 10 −2 10 −4 10 10 20 30 40 50 60 70 80 90 100 Function Evaluations Figure 3.3: Function evaluations against objective value for training IGMs (λ = 50) with 1 regularization for different optimization strategies. Top row: cyto data. Bottom row: awma data. Left column: zero vector used for initialization. Right column: solution with λ = 100 used for initialization. This figure is best viewed in color. • Different choices of group norm: In Chapter 5, we discuss computing the projection and soft-threshold operations for different choices of the group norm. This allows us to apply the methods in this chapter to different choices of the group norm. This includes the ∞ norm of the groups, and the nuclear norm of the groups in cases where the groups form matrices. • Overlapping groups: In Chapter 6 we discuss computing the projection and soft-threshold operations when the groups overlap. That is, cases where each variable belongs in multiple groups. This allows us to apply the methods in this chapter to the case of general groups. Further, Jacob et al. [2009] describe an alternative generalization to the case of overlapping groups, and the methods in this chapter can be directly applied in this formulation. 66 Chapter 4 Directed Graphical Model Structure Learning As we discuss in Chapter 1, the prior work on structure learning in probabilistic graphical models with 1 -regularization largely focuses on pairwise undirected models. However, given the estimated model performing standard operations (ie. computing the probability of a vector, computing marginals, generating unbiased samples) with undirected models is computationally intractable in general. In contrast, as we discuss in Section 1.3 it is possible to perform many operations exactly or approximately in DAG models in polynomial time. In scenarios where the estimated model will ultimately be used to perform these types of operations, DAG models are an appealing alternative to undirected models. Further, parameter estimation in DAG models is separable in the CPDs and there is no need to compute an intractable normalizing constant. Since parameter estimation is separable, this allows us to independently tune an individual regularization parameter for each CPD (unlike undirected models where the same regularization parameter is typically used for all edges), to mix different types of data (ie. we can have both Gaussian and binary variables in the same data set), and allows us to more efficiently search the space of graphs since single edge modifications only change the parameters of a subset of the CPDs. Given n realizations of p-vectors xi , the goal of structure learning in DAG models is to find a graph structure G (and corresponding parameters {wj , bj } for each node j) that optimize some criteria measuring the quality of the DAG model. In the special case where we are given a topological ordering of the nodes, this problem reduces to performing variable selection (among variables earlier in the ordering) independently for each of the CPDs [Buntine, 1991, Cooper and Herskovits, 1992]27 . For sigmoid belief networks, this corresponds to performing variable selection in a set of independent logistic regression models. Thus, we can learn DAG models with a known topological ordering using a straightforward extension of the methods we discuss in Chapter 1; we perform structure learning by using 1 -regularization to solve each of these variable selection tasks. Other works using 1 regularization for structure learning in DAG models have focused on this relatively simple case [Li and Yang, 2005, Huang et al., 2006, Levina et al., 2008]. Even if we are not given a topological ordering, if we enforce that each node can have at most one parent then finding the optimal graph can be formulated and solved as a minimum spanning tree problem [Chow and Liu, 1968]28 . If we are not given a topological ordering and allow each node to have more than one parent, then finding the optimal DAG is NP-hard in most reasonable scenarios [Chickering, 1995, Dasgupta, 1999, Chickering et al., 2004]. Indeed, even if the graph structure is restricted to be a tree but each node is allowed to have at most k ≥ 2 parents (also known as a poly-tree), it is NP-hard to even approximate the best graph structure to within a 27 Assuming that the parameters of each CPD are independent. This relies on the scoring criteria satisfying the property of pairwise score equivalence, namely that the score of having xi as the only parent of xj is the same as the reverse. If the scoring criteria does not have this property, the optimal tree can be found in polynomial time by solving an optimal branching problem [Heckerman et al., 1995] 28 67 constant factor [Dasgupta, 1999]. Nevertheless, we can typically obtain a better model by not assuming a fixed ordering, and this general case is the focus of this chapter. The main challenge arising in the general case is the acyclicity constraint. Because the graph must be acyclic, we can not simply regress each node on all other nodes. Subsequently, we need to consider searching through the space of topological orderings, or directly searching through the space of directed acyclic graphs. 4.1 Search and Score Methods Traditionally, there have been two different approaches to structure learning in general DAG models. In search and score methods, we use some criterion to assess the quality of a particular structure (such as the BIC or validation set likelihood), and we optimize this criterion by using a local search method to search through the space of DAGs [see Lam and Bacchus, 1993, Heckerman et al., 1995]. The BIC is widely used for evaluating the quality of a candidate structure. Early work that used the BIC includes [Lam and Bacchus, 1993, Bouckaert, 1993, Suzuki, 1999]. Under certain assumptions, the BIC gives the same score to Markov equivalent graphs [Bouckaert, 1993]29 . In [Friedman and Yakhini, 1996], asymptotic properties of the BIC for evaluating DAG structures are examined. They derive an asymptotic bound on the sample complexity of structure learning by optimizing the BIC score (in terms of Kullback-Leibler divergence), and show that in addition to asymptotic consistency that the BIC score also leads to asymptotic minimality (that is, it will choose the most sparse structure that describes the distribution). The most widely used alternative to the BIC for measuring structural quality are methods that compute the marginal likelihood of the CPDs (i.e. the likelihood after integrating over all possible parameters) given the graph structure under a suitable prior [Cooper and Herskovits, 1992, Heckerman et al., 1995]. This work will focus on the BIC, since except in special cases (such as Gaussian or tabular CPDs with conjugate priors), it is not possible to compute the marginal likelihood in closed form. The prototypical search and score procedure is a greedy local search through the space of DAGs where at each iteration we perform the edge addition/deletion/reversal that improves the score by the largest amount, subject to satisfying the acyclicity constraint [Heckerman et al., 1995]. If no legal addition/deletion/reversal improves the score, the method can be reset to a different randomly generated DAG. We call this procedure DAG-search. The efficiency of this procedure is substantially improved if we (as in almost all related work on this subject) make the assumption of parameter modularity [Heckerman et al., 1995], meaning that if the same CPD appears in two graph structures, then the parameters of the CPD are the same in both structures (this assumptions follows as a consequence of assuming that the parameters of different CPDs are independent). Parameter modularity allows us to efficiently evaluate the effect of single edge additions/deletions/reversals. Further, we can use a hash data structure to prevent re-evaluating the same CPDs, while we note that the scores for most of the candidate additions/deletions/reversals will not change after a single addition/deletion/reversal. There have been a variety of approaches proposed to enhance the basic DAG-search, and we briefly review a variety of these modifications here. Some authors have considered using different local search procedures, such as genetic algorithms [Larrafiag et al., 1996], ant colony optimization [de Campos et al., 2002a], and the greedy randomized adaptive search procedure [de Cam29 These assumptions are not satisfied for sigmoid belief networks where nodes have more than one parent, since using sigmoid CPDs imposes additional structure on the model in these scenarios. 68 pos et al., 2002b]. Instead of searching through the space of DAGs, some authors have proposed searching through the space of topological orderings [Larrafiag et al., 1996, de Campos et al., 2002a, Teyssier and Koller, 2005] or Markov equivalent graphs [Spirtes and Meek, 1995, Madigan et al., 1996, Munteanu and Bendou, 2001, Chickering, 2003, Nielsen et al., 2003]. Steck [2000] proposes a local search move where all directions are removed and then re-oriented. Elidan et al. [2002] considers data re-weighting schemes that may allow DAG-search to escape local minima. Hulten and Domingos [2002] use Hoeffding’s inequality for structure learning with tabular CPDs when the number of training examples is enormous. Moore and Wong [2003] propose a local search move where all edges connected to a node are severed, and the node is then optimally reinserted into the graph. Nachman et al. [2004] proposes a method for efficiently finding the best node to add for each variable with regression-based CPDs. Some authors have also considered exact methods that find the highest scoring structure, but that may require an exponential amount of time [Suzuki, 1999, Koivisto and Sood, 2004]. Despite the large number of more complicated methods that have been proposed in the literature, it has proven surprisingly difficult to devise a method that consistently outperforms an efficient DAG-search implementation, an issue discussed in [Teyssier and Koller, 2005]. Although search and score methods are often surprisingly effective, the main drawback of the search and score methodology is simply that the search space is very large; the space of DAGs is super-exponential in the number of nodes [Robinson, 1976]. 4.2 Constraint-Based Methods In contrast to search and score methods that try to directly optimize a criteria measuring the quality of the model, constraint-based methods seek to prune the set of possible edges. The original methods of this type are described in [Verma and Pearl, 1990, Spirtes and Glymour, 1991]. For each pair of variables, these methods search for a conditioning set that makes the pair satisfy a conditional independence hypothesis test (or makes their conditional dependence fall below a threshold [Cheng et al., 2002]). If we assume that the data is generated according to a DAG model and if a conditioning set is found that makes the pair of variables independent, then there can not exist an edge between the pair and the edge can be removed from consideration. After this edge pruning phase, further constraints may be used to determine the directionality of a subset of the remaining edges. Of particular interest to the present work is the observation of Verma and Pearl [1990] that the search space can be reduced to the Markov blanket of each node; the set of nodes that are conditionally dependent on the node given all other nodes, consisting graphically of the node’s parents, children, and co-parents (other nodes that are parents of one of the node’s children). Unfortunately, the constraint-based approaches have several disadvantages. First, there are an exponential number of possible conditioning sets. Although implementations of constraint-based methods typically use heuristics that only consider a limited number of conditioning sets [Spirtes and Glymour, 1991], these methods must still perform a very large number of hypothesis tests. If corrections for multiple tests were incorporated into these methods, their statistical power would be very low. Indeed, it is not clear how to set the threshold value(s) in these tests such that the correct structure is identified asymptotically, as briefly discussed in [Heckerman et al., 1999]. Further, with finite data it is possible that an error in an independence test early in the procedure may lead to a propagation of errors. While constraint-based approaches typically output a valid 69 equivalence class, it is possible that they will output a cyclic graph, as illustrated in the pin-wheel example of [Dash and Druzdzel, 1999]. Some constraint-based methods have sought to address some of the common criticisms of constraint-based methods [Margaritis and Thrun, 1999], but a final noteworthy criticism is that it isn’t clear how the results of these hypothesis tests relate to the quality of the model. 4.3 Hybrid Methods The disadvantages of the constraint-based methods and the search and score methods have led to the development of hybrid methods. In hybrid methods, constraint-based reasoning is used to prune the set of edges to consider within a search and score method. This can lead to an enormous reduction in the number of possible graphs to search over. Much of the early work on methods of this type focused on forming constraints by eliciting domain knowledge from human experts. One example we have already seen is the case where the expert is asked to provide a topological ordering [Cooper and Herskovits, 1992]. Other examples include methods that attempt to construct an ordering given statements of domain knowledge [Srinivas et al., 1990], and methods that use partial orderings or require that some known edges are included in the model [Lam and Bacchus, 1993]. Unfortunately, these strategies crucially rely on the existence of a domain expert to provide the constraints. One of the most popular methods to incorporate automatic pruning is the sparse candidate (SC) algorithm [Friedman et al., 1999]. In the SC algorithm, we compute a measure of dependence between each pair of variables (such as mutual information), and for a fixed k we only consider those k variables with the highest dependence as potential parents. Although this approach makes it feasible to learn DAG models with thousands of variables, this approach is somewhat problematic for the following reason: we can construct DAG distributions where no value of k less than (p − 1) will include all true parents among the k most dependent variables. For example, consider a chainstructured graph where x1 is a parent of x2 , x2 is a parent of x3 , x3 is a parent of x4 , and so on up to xn . If this structure is a parameterized so that the variables have a very high mutual information, and we add an extra node xn+1 that is a parent of xn but with a low mutual information, then it might be the case that x1 through xn−2 all have a higher mutual information with xn than its true parent xn+1 . To address this problem, after using a search and score method in the reduced space, [Friedman et al., 1999] suggest re-computing the set of candidate parents (this time using a conditional measure of dependence) when a local minimum is reached. We might hope to avoid the unsound pruning caused by the sparse candidate algorithm by accepting all parents whose pairwise mutual information is above a threshold. Unfortunately, this is not a particularly effective pruning strategy, since even if the underlying graph is sparse the variables may not be marginally independent. As an example, consider a simple chain-structured model with Gaussian CPDs. The precision matrix in this model is tri-diagonal, corresponding to a very sparse graph. However, the inverse of a tri-diagonal matrix will (in general) be completely dense; all variables are marginally dependent so no interactions are pruned. Because tests of marginal (in)dependence are not particularly effective at pruning the set of possible edges, more recent hybrid approaches have considered directly applying constraint-based structure learning methods to prune the set of edges [Dash and Druzdzel, 1999, Li and Yang, 2004, Tsamardinos et al., 2006]; these algorithms rely on conditional independence tests rather than marginal independence tests. These methods typically lead to a substantial reduction in the search 70 space, and one of the empirically most effective current methods for learning DAG structures, the max-min hill-climbing (MMHC) algorithm, is of this type [Tsamardinos et al., 2006]. Further, if we assume that a perfect conditional independence oracle is available, and that all conditional independencies in the distribution follow from the graph structure, then reducing the search space by pruning edges between conditionally independent nodes is a sound pruning strategy (it will never remove a true dependency from the model). An alternative strategy to pruning the space of DAGs is to use conditional independence tests to obtain a variable ordering, and then apply a variable selection method assuming that the ordering represents a topological ordering [Singh and Valtorta, 1993, Acid et al., 2001, Dobra et al., 2004]. The disadvantage of this type of approach is simply that it may be very difficult to find a correct topological ordering. Of particular note is the method of [Dobra et al., 2004], where the authors initially learn a dependency network on the variables (to approximate each node’s Markov blanket), and use this to construct an ordering. 4.4 A Hybrid Method with 1 -regularization The recent hybrid methods are appealing compared to strict constraint-based methods, because the second phase (search and score) of the methods attempts to optimize a score measuring the quality of the structure. Further, they can be advantageous over strict score-based methods, due to the much smaller search space. However, the hypothesis tests (or pairwise dependency measures) used by existing hybrid methods ignore the score during the first phase (edge pruning). As an example where this might be problematic, consider the case where two variables are weakly dependent and we want to find a structure optimizing the BIC score. Here, the edge may pass the independence test and not be pruned during the first phase, even though including this edge is unlikely to improve the BIC score. Similarly, an independence test might prune an edge between variables that appear to be almost independent (recall that these hypothesis tests are not corrected for multiple testing), even though including the edge would later lead to an improved validation score. Towards developing a hybrid method that takes into account our scoring criteria during both phases, we propose the following two-phase hybrid method for a given scoring criteria: 1. Edge pruning: We use 1 -regularization to learn a dependency network with logistic regression conditionals, that optimizes the proposed scoring criteria. We refer to this as the 1 -Markov blanket (L1MB) algorithm, and we note that this problem can be solved even with a very large number of nodes using the methods of Chapter 2. 2. Search: We run a DAG-search algorithm to search through the space of possible DAG structures, restricted to the edges found by the L1MB algorithm. Below we give pseudo-code for the L1MB algorithm. In our implementation, for a network with p nodes we compute the 1 -regularized solution (and corresponding score) for (p − 1) equally spaced values along the regularization path between λ set to zero and λmax , where λmax is the value where 71 all (non-bias) variables become zero (see Section 2.5). Input: Data xij for i = 1, . . . , n and j = 1, . . . , p. Output: Markov blanket M Bj for each node j. for j = 1 to p do M Bj ← ∅ ; // initially try using empty Markov blanket s ← score(xj , x∅ ) ; // compute score with empty Markov blanket b ← minb ni=1 log(1 + exp(xij b)) ; // optimize for bias variable n i i i g ← i=1 xj x−j /(1 + exp(xj b)) ; // gradient of regression weights at zero λmax ← maxi {gi } ; // maximum value of regularization parameter for λ = ((p − 1)/p)λmax down to 0 in increments of λmax /p do {w, b} ← arg minw,b ni=1 log(1 + exp(−xij (wT xi−j + b))) + λ||w||1 ; // Chapter 2 nz = {v|wv = 0}; // find non-zero variables ssub = score(xj , xnz ) ; // compute score with selected Markov blanket if ssub > s then s ← ssub ; // new maximum value found M Bj ← nz ; // record higher-scoring Markov blanket Algorithm 10: L1MB Algorithm. If we assume that the true structure is a DAG and that 1 -regularization is able to perfectly select the relevant variables, then the L1MB algorithm will identify each variable’s Markov blanket. For the second phase, we use an implementation of the DAG-search method, where at each iteration we choose the variable addition/deletion/reversal (among edges found by the L1MB algorithm) that improves the score by the largest amount. To address the criticism that DAG-search requires costly acyclicity checks [Teyssier and Koller, 2005], we used the ancestor matrix data structure described in [Giudici and Castelo, 2003] for improving the speed of Markov chain Monte Carlo methods that explore the space of DAGs. With this data structure, it is possible to check whether an addition will cause a cycle in O(1), while testing whether a reversal of an existing edge leads to a cycle can be done in O(p). In Appendix A, we review this data structure and present several enhancements to it, including a reversal witness matrix data structure that allows us to test whether reversing an edge will cause a cycle in O(1). In [Schmidt et al., 2007b], we also examined a variant of the method where we used the known-ordering 1 -regularization method to find the optimal structure given an ordering, and we used the local swap moves described in [Teyssier and Koller, 2005] for searching the space of orderings. Although this gives a somewhat more elegant procedure, we found that this was not as effective as searching through the space of DAGs when the ancestor matrix data structure is used for testing acyclicity30 . The hybrid 1 -regularization method we discuss here is closely related to the work described in [Li and Yang, 2004]31 . Li and Yang [2004] also first learn a dependency network using 1 regularization. This is followed by running a constraint-based method to further prune the edges and fix the directionality of some edges, and the final step runs a DAG-search to optimize a scoring criteria. Besides removing the (potentially error-prone) second phase (and the focus on sigmoid 30 Note that we are using sigmoid CPDs with no bound on the in-degree of nodes in the graph, while Teyssier and Koller [2005] used tabular CPDs with a bounded in-degree. This allowed them to pre-compute all possible scores, while in our work computing all possible scores is intractable so we compute the scores as needed. This requires solving a logistic regression problem to test any changed edges. 31 At the time that [Schmidt et al., 2007b] was published, we were not aware of this work (nor were our reviewers) 72 CPDs instead of Gaussian CPDs), the crucial difference between our method and this previous work is that we use the scoring criteria when constructing the dependency network, while [Li and Yang, 2004] use hypothesis testing. As we discuss above, it is not necessarily clear how the results of the hypothesis tests relate to the score that is optimized in the final stage. 4.5 Causal DAGs Unfortunately, without being given a topological ordering it will only be possible to identify the optimal DAG structure up to Markov equivalence. That is, it may not be possible to identify the directionality of some of the edges. This may not be a bad thing if our goal is to build a density model, since it might imply that multiple DAGs will achieve the globally optimal score (and we only need to find one of them). However, this property is less appealing from the perspective of structural discovery, since if we believe our data is generated from a DAG model, it means that we may not be able to distinguish the ‘true’ structure from other candidates. A notable special case where we can hope to identify the true structure without a topological ordering is the case of causal DAGs, where the data includes interventions. A causal DAG model [Pearl, 2000] is a DAG model where we assume that the directions of the edges represent causal influences (the causal Markov assumption). Under this assumption, we distinguish between conditioning by observation and conditioning by intervention. When conditioning on a variable by observation, we use the standard rules of conditional probability to answer conditional queries. When conditioning on a variable by intervention, we create a modified DAG model, and then use the standard rules of conditional probability to answer conditional queries using the modified model. Specifically, when variable j is set by intervention (denoted do(j)), we use a modified DAG where the CPD for variable j has been removed. In other words, the interventional distribution uses p(x1 , . . . , xp |do(j)) = p(xi |xπ(i) ). i=j Graphically, the effect of removing this CPD is to remove all incoming edges into j, while preserving outgoing edges. Thus setting j by intervention makes it independent of its causes, but preserves the dependency on its effects. Because of this asymmetry between cause and effect, it is possible to distinguish between Markov equivalent graphs in causal DAGs, given data that includes interventions. Utilizing interventional data within structure learning was first explored in [Cooper and Yoo, 1999], and extending the hybrid method based on 1 -regularization above to model interventional data with causal DAGs is straightforward. When estimating the conditional of node j during parameter estimation (during either the L1MB or DAG-search phase), we use the modified distribution in cases where j was set by intervention. Similarly, we remove the CPD for node j when evaluating the BIC or validation likelihood. All other aspects of the method remain the same. 4.6 Experiments We now experimentally examine the performance of various methods for learning sparse DAG models. We first did a series of experiments on synthetic data where the structure was known, detailed in the next section. After these experiments, we apply the methods to learn sparse DAG models of real data in §4.6.2 73 4.6.1 Synthetic Data We first considered a set of synthetic data sets, where we generated samples from a known structure and then tried to recover the structure from the samples. In particular, we obtained seven graph structures from the Bayesian Network Repository, http://compbio.cs.huji.ac.il/Repository/. In the table below, we give the names, number of nodes, number of edges, and maximum number of parents for each of the seven networks we considered. Name insurance water mildew alarm barley hailfinder carpo Nodes 27 32 32 37 48 56 61 Edges 52 66 46 46 84 66 74 Max Parents 3 5 3 3 4 4 5 To parameterize the networks as a sigmoid belief network with strong edge weights, we set the bias for each node to zero and each edge weight was set according to the formula wij ← sign(N (0, 1)) + N (0, 1)/4, where N (0, 1) is a sample from a standard normal distribution. In our experiments, we used the BIC as the scoring criterion. In our first experiment, we compared the performance of several different possible strategies for pruning the set of edges: • SC: The set of parents selected on the first iteration of the sparse candidate method [Friedman et al., 1999], where we used pairwise mutual information to rank the candidates. We tested the method with two parameters, the true maximum in-degree across the networks (5) and double this amount (10). • MMPC: The set of parents remaining after the max-min parents and children pruning procedure [Tsamardinos et al., 2006], where conditional hypothesis tests are used to prune the set of parents. The experiments in Tsamardinos et al. [2006] indicate that this constraint-based procedure leads to state-of-the-art results against a wide variety of alternative methods. We used the implementation in the author’s Causal Explorer software [Aliferis et al., 2003]. We tested the method with two values of the hypothesis test threshold, the software default of 0.05 and a more conservative value of 0.10. • L1MB: The proposed procedure for finding the Markov blanket of each node using 1 regularization, where the hyper-parameter is selected to optimize the scoring criterion. Other than the selection of points to evaluate along the regularization path, this algorithm has no parameters. An ideal pruning procedure would remove as many edges as possible, while minimizing the number of true edges that are removed. In Figure 4.1, we plot the the percent of edges remaining (top) and the number of true edges removed (bottom) for all of the methods on the seven data sets 74 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0.4 SC(5) SC(10) MMPC(.05) MMPC(.1) L1MB 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 2 3 4 5 6 7 0.3 0.25 0.2 0.15 0.1 0.05 0 1 2 3 Data Set 4 5 6 7 1 14 12 10 8 6 4 14 10 8 6 4 2 0 0 4 5 6 7 5 6 7 SC(5) SC(10) MMPC(.05) MMPC(.1) L1MB 16 12 2 Data Set SC(5) SC(10) MMPC(.05) MMPC(.1) L1MB 16 True Edges Removed 16 4 18 True Edges Removed SC(5) SC(10) MMPC(.05) MMPC(.1) L1MB 18 3 3 Data Set 18 2 2 Data Set 20 1 SC(5) SC(10) MMPC(.05) MMPC(.1) L1MB 0.35 0 1 True Edges Removed Percent of Edges Remaining SC(5) SC(10) MMPC(.05) MMPC(.1) L1MB Percent of Edges Remaining Percent of Edges Remaining 0.4 14 12 10 8 6 4 2 0 1 2 3 4 Data Set 5 6 7 1 2 3 4 5 6 7 Data Set Figure 4.1: The percent of edges remaining (top) and number of true edes removed (bottom) for different edge pruning strategies for seven structures from the Bayesian network repository. From left to right, the plots show the results with samples sizes of 1000, 5000, and 20000. We see that the L1MB pruning method leads to a reasonable amount of pruning while tending not to remove true edges. for three different sample sizes (1000, 5000, and 20000). In this figure, we see that the SC method has a fairly sharp trade-off between the two objectives: SC(5) removes a large number of edges but removes many true edges, while SC(10) removes fewer true edges but does not prune much of the search space. The MMPC method is more effective, it reduces the search space substantially and does not remove many true edges, decreasing the number of true edges that are removed as the sample size increases. The L1MB method has similar behaviour; L1MB does not prune quite as much as the MMPC method but removes fewer true edges. Indeed, the L1MB method removed no true edges in any data set for any of the experiments with 5000 or 20000 samples (and it never removed more than one true edge), while the other methods removed multiple true edges in almost every case. In our next experiment, we sought to assess the effectiveness of a DAG-search routine under these different pruning strategies. We compared the five methods examined in the previous experiment, as well applying the DAG-search with no pruning. To test the different pruning strategies, we started the DAG-search from the empty graph and ran it until it had made 10000 score evaluations. If a local minimum was found before this limit, the methods were restarted to a randomly generated DAG (we generated the DAGs by generating a random topological ordering, and adding each edge consistent with the pruning and the ordering with probability 0.5). The same random DAGs were used across the methods. We restarted the hash of score values after each local minimum was found, but better performance would be achieved by keeping the same hash table between runs. We plot the BIC after 10000 evaluations against the data sets for the different methods in Figure 4.2. Since 75 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 0.08 SC(5) SC(10) MMPC(.05) MMPC(.1) L1MB None 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 1 2 3 4 5 6 0.06 0.05 0.04 0.03 0.02 0.01 1 2 3 4 5 6 7 1 0.9 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Relative BIC (vs. worst method) 0.9 Relative BIC (vs. worst method) 1 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 3 4 3 4 5 6 7 5 6 7 5 6 7 Data Set 1 2 2 Data Set 1 1 SC(5) SC(10) MMPC(.05) MMPC(.1) L1MB None 0.07 0 7 Data Set Relative BIC (vs. worst method) Relative BIC (vs. empty graph) 0.08 SC(5) SC(10) MMPC(.05) MMPC(.1) L1MB None 0.08 Relative BIC (vs. empty graph) Relative BIC (vs. empty graph) 0.09 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 Data Set 4 5 6 7 1 2 3 Data Set 4 Data Set Figure 4.2: The relative BIC after 10000 score evaluations in a DAG-search for different pruning strategies on the seven synthetic data sets from the Bayesian Network Structure Learning Repository. We the BIC relative to the empty graph (top) and relative to the highest score for each data set (bottom). From left to right, the plots show the results with samples sizes of 1000, 5000, and 20000. We see that the L1MB pruning consistently achieves among the lowest scores. 4 x 10 5 2.6 SC(5) SC(10) L1MB None 5.1 5 2.5 BIC 4.8 BIC 10.4 SC(5) SC(10) L1MB None 2.55 4.9 4.7 5 x 10 9.8 2.4 9.6 9.4 4.6 2.3 4.5 2.25 9 4.4 2.2 8.8 4.3 2.15 8.6 4.2 2.1 0 500 1000 1500 2000 2500 Score Evaluation 3000 3500 4000 SC(5) SC(10) L1MB None 10 2.45 2.35 x 10 10.2 BIC 5.2 9.2 8.4 0 500 1000 1500 2000 2500 Score Evaluation 3000 3500 4000 0 500 1000 1500 2000 2500 3000 3500 4000 Score Evaluation Figure 4.3: The BIC against the number of score evaluations in a DAG-search for different pruning strategies with 1000 (left), 5000 (middle), and 20000 (right) samples from the alarm data set. We see that no pruning eventually leads to a good score, that the pruning strategies allow the method to explore multiple local optima, and that the L1MB algorithm achieves both of these properties. 76 50 35 SC(5) SC(10) L1MB None 30 35 Structural Errors Structural Errors 40 45 SC(5) SC(10) L1MB None 30 25 20 15 SC(5) SC(10) L1MB None 40 35 25 Structural Errors 45 20 15 10 30 25 20 15 10 10 5 5 5 0 0 1 2 3 4 Data Set 5 6 7 0 1 2 3 4 Data Set 5 6 7 1 2 3 4 5 6 7 Data Set Figure 4.4: Structural errors for the highest scoring structure after 10000 score evaluations in an interventional DAG-search for different pruning strategies on the seven synthetic data sets from the Bayesian Network Structure Learning Repository. From left to right, the plots show the results with samples sizes of 1000, 5000, and 20000. We see that the L1MB pruning leads to the fewest structural errors in almost every case. the absolute BIC varies across data sets and sample sizes, these figures plot a relative BIC. In the top of Figure 4.2, we computed the relative BIC by scaling the scores to values between the lowest BIC found across the methods, and the BIC of the empty graph. Under this criteria, the empty graph would have a relative BIC of 1, and the best graph found across the methods gets a value of 0. In the bottom of Figure 4.2, we plot the score relative to the pruning strategy that had the highest BIC over each data set. In this figure, we see that the L1MB pruning consistently lead to low BIC across the sample sizes, while the SC methods were less effective and the MMPC methods were in between. Interestingly, not using any pruning seemed to be more effective as the sample size increased. This might be because the BIC favours the true model more heavily as the sample size increases. To gain more insight into the performance disparities between different pruning methods, in Figure 4.3 we plot the BIC of the current structure against the score evaluation for the different pruning methods for the three sample sizes on the alarm data set (we omit the MMPC methods for clarity, but note that these methods resemble the SC and L1MB methods). Here, we see that the None method takes substantially longer to reach a local minimum than the other methods, but eventually reaches a good local minimum. In contrast, the pruning methods reach local minima very quickly, and this allows them to explore multiple modes. However, because the SC and MMPC methods tend to remove true edges from the model, the minima they found tend to be poorer than those found by the None and L1MB method. In general, without a topological ordering we can only expect to recover the true structure up to its Markov equivalence class. This is the reason we used the BIC as a measure of performance in the previous experiment. In our final experiment on synthetic data, we generated interventional data to test the ability of the different pruning strategies to recover the true structure. To generate interventional data, for each sample we generated a random integer between 0 and p, and intervened on the corresponding node by setting it to 1 with probability 0.5 (when 0 was drawn, we did not intervene on any nodes and generated a purely observational sample). We plot the number of structural errors in the structure with the lowest BIC after 10000 evaluations for the different pruning strategies on the different data sets and sample sizes in Figure 4.4. We did not run the MMPC method on this data set, since that software does not support interventions. In this figure, 77 150 150 150 100 50 0 100 50 0 0 500 1000 1500 2000 2500 Score Evaluation 3000 3500 4000 SC(5) SC(10) L1MB None Structural Errors SC(5) SC(10) L1MB None Structural Errors Structural Errors SC(5) SC(10) L1MB None 100 50 0 0 500 1000 1500 2000 2500 Score Evaluation 3000 3500 4000 0 500 1000 1500 2000 2500 3000 3500 4000 Score Evaluation Figure 4.5: The structural errors against the number of score evaluations in an interventional DAGsearch for different pruning strategies with 1000 (left), 5000 (middle), and 20000 (right) samples from the alarm data set. we see that the L1MB method consistently achieves among the lowest structural errors, making fewer errors as the sample size increases. In Figure 4.5, we plot the number of structural errors achieved by the current structure against the number of score evaluations on the alarm data set for the different sample sizes. An interesting aspect of these plots is that for small sample sizes the number of structural errors does not decrease monotonically with the BIC. As a consequence, we see that with 5000 samples that the None method finds the true structure, but it does not choose this structure since it finds a different structure with a lower score. With 5000 samples the L1MB method also finds the true structure (three times) during its search. The L1MB method also finds nine local optima with a single structural error. That is, these methods are one edge away from the true structure, but the modification can not be made without violating acyclicity. With 20000 samples both the None and L1MB pruning methods find the true structure, but the L1MB method finds it seven times before the None method finds it. 4.6.2 Real Data Because the true structure is generally unknown in real data, it is generally not possible to evaluate a structure learning method in terms of structural errors. However, we might still be interested in testing whether a method recovers a plausible structure. Thus, we first sought to test the method on a real data set where we had a reasonable guess of both a topological ordering of the variables and the structure of the model. Towards this end, we focused on the rain data we describe in Section 1.7. For this data, we assumed that using the days of the month in order would represent a reasonable topological ordering. Further, we might expect to learn a structure that connects adjacent days, under the intuition that if it rains on one day it is likely to also rain the next day. This would lead to a 28-node Markov chain structure. We might also expect to see connections between non-adjacent but close days, although connections between distant days seem less likely. In Figure 4.6, we plot the structure given by four different structure learning methods: (i) finding the optimal tree structure subject to the topological ordering, (ii) exhaustive enumeration to find the structure with highest BIC that is consistent with the ordering and the SC(5) pruning, (iii) greedily selecting parents starting from the empty graph (this is similar to the K2 algorithm [Cooper and Herskovits, 1992]), and (iv) using the L1MB method constrained to be consistent with the ordering (in this case no search is necessary). In this plot we see that the optimal tree for this data set is a 78 1 1 2 2 3 3 4 5 6 1 1 2 2 3 4 4 5 5 6 3 4 5 6 7 6 7 7 7 8 8 8 8 9 9 9 9 10 10 10 10 11 11 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15 15 15 16 16 16 17 17 18 19 20 21 17 18 18 19 20 19 20 21 20 21 22 23 23 24 24 24 25 25 26 28 24 26 27 28 22 23 25 26 27 21 22 23 28 17 18 19 22 27 16 25 26 27 28 Figure 4.6: Structures estimated on the rain data set under a topological ordering. From left to right: optimal tree-structure consistent with ordering, optimal parents consistent with the ordering and SC(5) pruning, greedy parent selection given the ordering, and the L1MB algorithm constrained to be consistent with the ordering. 79 5 10 15 20 25 5 10 15 20 25 Figure 4.7: The regression weights for the rain data set using the L1MB algorithm for a topological ordering. We see that weights between adjacent days (first diagonal above the main diagonal) are much larger than the other weights. 28-node Markov chain, as expected. In contrast, the structures learned by the other methods are much less interpretable, including not only edges between adjacent days but also edges between more distant days. The structures learned by exhaustive enumeration after using using the other pruning strategies from Figure 4.1 (namely, the SC(10), MMPC(0.05), and MMPC(0.1) methods) were qualitatively similar to these latter structures, in that they included all edges between adjacent days but also included edges between temporally close nodes as well as nodes that are not temporally close. This would seem to indicate that these methods are not ideal for structural discovery (or that the BIC is not appropriate for judging the quality of the structure). However, we gain additional insight if we look at the regression weights. In Figure 4.7 we plot as a matrix the absolute value of the (non-bias) regression weights of the L1MB method. In this plot, we see that the first diagonal above the main diagonal contains substantially larger weights than the rest of the matrix. The elements on this diagonal represent the effect of the previous day on each day. We also see some much weaker weights on the next two upper diagonals and spread out throughout the rest of the upper triangle of the matrix (the main diagonal is zero since it does not correspond to a parameter, while the lower triangle part of the matrix is zero because we assumed that the order represents a topological ordering). With logistic regression CPDs over binary parents encoded as {−1, 1} binary variables, we can interpret the regression weights in terms of the odds of the child taking the same value as its parent. For example, a regression weight of 0.5 means that the logarithm of the odds of a child taking the same value as its parent is increased by 0.5 (over its bias value). By looking at the regression weights, we see that the 28-node Markov chain has the strongest influence on the model and that it is recovered if we only concentrate on the largest regression weights (this isn’t unique to the L1MB method, the same is true of the other pruning methods, too). Thus, the unexpected extra edges present in the L1MB graph structure represent weaker statistical dependencies. These might be spurious correlations detected by the method that happen to improve the BIC, or they might reflect that the data is not perfectly modeled by a sigmoid belief network. In general, we may not have a topological ordering, so our next experiment compared the 80 1 SC(5) SC(10) MMPC(.05,5) MMPC(.1,5) L1MB None Relative BIC (vs. worst method) Relative BIC (vs. empty graph) 0.15 0.1 0.05 0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 1.5 2 2.5 Data Set 3 3.5 4 1 1.5 2 2.5 3 3.5 4 Data Set Figure 4.8: The relative BIC compared to the empty graph (left) and method with highest BIC (right) after 50000 score evaluations in a DAG-search for different pruning strategies on the real data sets. The data are ordered by node size: (1) rain (28 nodes), (2) msweb (57 nodes), (3) news (100 nodes), and (4) usps (256 nodes). Note that the None method has a relative BIC of 0 on the usps data set in the left figure. various DAG-search pruning strategies on the four larger binary data sets we discuss in Section 1.7. For most of these larger data sets the MMPC pruning did not finish after a week of computation, so we decreased the maximum cardinality of the conditioning sets to 5 for the MMPC pruning methods. We plot the relative BIC for the different data sets after 50000 score evaluations in Figure 4.8. In these plots, we see that the basic DAG-search (None) is effective on the two smaller data sets but its performance decreases substantially on the two data sets with the larger number of nodes (on the usps data set, the evaluation limit is exceeded before all neighboring graphs can be considered). In contrast, the L1MB method was more effective than the other methods on the two higher-dimensional data sets. We now look at one of the learned structures in more detail, focusing on the news data. The 100 words measured in this data set are aids, baseball, bible, bmw, cancer, car, card, case, children, christian, computer, course, data, dealer, disease, disk, display, doctor, dos, drive, driver, earth, email, engine, evidence, fact, fans, files, food, format, ftp, games, god, government, graphics, gun, health, help, hit, hockey, honda, human, image, insurance, israel, jesus, jews, launch, law, league, lunar, mac, mars, medicine, memory, mission, moon, msg, nasa, nhl, number, oil, orbit, patients, pc, phone, players, power, president, problem, program, puck, question, religion, research, rights, satellite, science, scsi, season, server, shuttle, software, solar, space, state, studies, system, team, technology, university, version, video, vitamin, war, water, win, windows, won, and world. The Markov blankets estimated by L1MB for the first ten words are • aids: children, disease, evidence, fact, food, health, president, program, research • baseball: case, christian, computer, drive, email, fact, fans, games, god, government, help, hit, league, memory, nhl, players, power, puck, question, season, software, state, system, 81 team, win, windows • bible: car, card, christian, course, earth, fact, god, jesus, orbit, program, question, religion, version, windows, world • bmw: car, christian, engine, god, government, help, university, windows • cancer: disease, health, medicine, patients, research, studies Many of the words present in these estimated Markov blankets represent fairly natural associations (aids:disease, baseball:fans, bible:god, bmw:car, cancer:patients, etc.). However, some of the estimated statistical dependencies seem less intuitive, such as baseball:windows and bmw:christian. As before, we gain more insight if we look at not only the sparsity pattern but also the regression weights. Below we repeat the list along with the values of the corresponding regression weights: • aids: children (0.53), disease (0.84), fact (0.47), health (0.77), president (0.50), research (0.53) • baseball: christian (-0.98), drive (-0.49), games (0.81), god (-0.46), government (-0.69), hit (0.62), memory (-1.29), players (1.16), season (0.31), software (-0.68), windows (-1.45) • bible: car (-0.72), card (-0.88), christian (0.49), fact (0.21), god (1.01), jesus (0.68), orbit (0.83), program (-0.56), religion (0.24), version (0.49) • bmw: car (0.60), christian (-11.54), engine (0.69), god (-0.74), government (-1.01), help (0.50), windows (-1.43) • cancer: disease (0.62), medicine (0.58), patients (0.90), research (0.49), studies (0.70) Here, we see that some of the less intuitive statistical dependencies have negative regression weights (italicized), indicating that they represent a dissociative relationship (i.e. the model reflects that baseball:windows is an unlikely combination). Closer investigation reveals that these dissociative relationships heavily influence the model. For example, if we examine all the regression weights, the strongest dissociative relationship is government:nhl (with a weight of −13.31), while the strongest associative relationship is food:msg (with a weight of 2.52). Further, there are 1173 negative regression weights, and only 286 positive regression weights (while 8541 are zero). 82 children earth bible mission god christian memory shuttle software religion display server technology satellite fact question engine honda driver video israel dealer power government state war oil world gun windows doctor version image fans health computer insurance research president science university games medicine water law human problem puck hit nhl team cancer win disease rights hockey season studies league won players evidence format ftp baseball aids bmw jews files program orbit drive card system space dos scsi graphics moon car pc data solar lunar mars mac jesus launch course food disk nasa case patients number msg help vitamin email phone 83 Figure 4.9: All edges with regression weight above 0.5 in the Markov blankets estimated by L1MB on the news data. Undirected edges represent cases where the directed edge was found in both directions. evidence msg case course question fact drive god gun christian government religion law jesus jews rights power car disease engine bible honda mission patients computer bmw dealer medicine science problem studies mars system program season launch university war world state lunar oil water food puck league cancer dos files players nhl research technology satellite hockey israel children disk team human president shuttle games solar moon scsi space orbit earth nasa hit baseball fans version windows won win email phone number image memory ftp format video mac data driver software health aids insurance pc doctor card help server vitamin display graphics 84 Figure 4.10: All edges with regression weight above 0.5 in the model found by DAG-search with L1MB pruning on the news data. email disk drive car bmw dealer engine scsi honda ftp phone files number format memory system image data problem display help server mac windows card dos graphics driver video launch oil pc program software moon lunar win space nasa mars version orbit earth team shuttle satellite technology solar fans games mission hockey nhl god bible christian jesus religion government power health aids food msg water cancer insurance medicine patients disease doctor studies case league players baseball hit jews children won president israel rights state human law university fact gun research course evidence question war world science computer vitamin 85 Figure 4.11: The tree structure that maximizes the BIC on the news data. puck season Given this large number of non-zero weights, it is difficult to appropriately visualize the many relationships present in the model. Visualization is further complicated by the number of weak relationships (that might represent false positives). Thus, to visualize the strongest associative effects in the estimated Markov blankets, we plot in Figure 4.9 all edges where the regression weight is above 0.5. In this figure, undirected edges represent edges where the edge was selected in both directions, while directed edges represent edges that were selected asymmetrically. In Figure 4.10, we plot the first local minimum found by the DAG-search with L1MB pruning (again restricted to edges where the weight is above 0.5). In both of these graphs, we can clearly see trends in different regions of the graph, including areas of words related to sports, cars, politics, religion, computers, and outer space. Unlike the dependency network estimated by L1MB, the DAG structure is a consistently parameterized density model. Thus, we can use it to measure likelihoods (this could be used to test whether a newsgroup post is spam, for example) or to generate independent samples from the distribution. In Figure 4.11 we plot the tree structure that optimizes the BIC, calculated using the generalization of the Chow-Liu algorithm discussed in [Heckerman et al., 1995, §7.1] (note that the edge directions in this plot are meaningless as long as the they do not create a v-structure, hence there is no special significance to e-mail being the root of the graph). The tree structure is much less dense (containing only 99 edges) and hence much more interpretable than the L1MB or DAG structures. Further, we see a similar grouping of topics. However, because each node can have at most one parents, this model does not place direct edges between several highly related concepts. For example, the tree model assumes that the words ‘hockey’ and ‘puck’ are independent given the value of the ‘team’ variable. As a more extreme example, we must traverse six nodes to reach the word ‘mac’ from ‘pc’ (both the hockey:puck and mac:pc interactions are direct edges in the L1MB and DAG structures). A further potential advantage of using general DAG models instead of restricting to trees is the ability of DAG models to ‘explain away’ different competing hypotheses. For example, in the DAG structure the word ‘program’ has both ‘space’ and ‘disk’ as parents. This reflects that we are more likely to see the word ‘program’ if we see the word ‘space’ or if we see the word ‘disk’. Further, this also means that if we see the words ‘program’ and ‘disk’ then we are less likely to see the word ‘space’ (observing ‘disk’ explains why ‘program’ was observed, making it less likely that the word ‘space’ is present). This explaining away between parent variables does not happen in trees, since each edge can have at most one parent. The phenomenon of explaining away also does not happen in pairwise undirected graphical models, although it is possible that explaninig away can be modeled in undirected graphical models with higher-order potentials. 86 1,1 1,2 2,1 1,3 2,2 1,4 3,1 2,3 4,1 1,5 3,2 2,4 4,2 3,3 4,3 3,4 1,6 2,5 1,7 5,1 2,6 6,1 5,2 6,2 3,5 5,3 4,4 6,3 3,6 1,8 5,4 2,7 6,4 1,9 4,5 2,8 1,10 3,7 2,9 4,6 2,10 4,7 7,2 6,9 5,7 9,2 7,7 10,2 11,2 9,3 9,4 10,3 12,1 9,5 10,5 10,6 5,16 8,13 9,12 10,11 11,7 6,14 10,10 13,4 10,9 15,1 11,10 10,12 11,9 15,2 4,16 5,15 9,11 10,8 12,5 14,2 5,14 9,10 9,9 11,6 12,4 7,13 8,12 10,7 11,5 13,3 7,12 4,15 8,10 9,8 12,3 14,1 8,9 9,7 13,2 3,16 6,13 8,11 8,8 11,4 13,1 7,10 7,11 9,6 11,3 12,2 4,14 6,12 8,7 10,4 3,15 5,13 6,11 7,8 2,16 4,13 5,12 7,9 5,8 8,6 2,15 3,14 5,11 6,10 7,6 1,16 3,13 4,12 6,6 9,1 1,15 2,14 4,11 5,10 6,7 8,5 2,13 3,12 5,9 6,8 7,5 1,14 3,11 4,10 7,4 8,4 10,1 2,12 4,9 7,3 8,3 1,13 3,10 6,5 8,2 11,1 2,11 4,8 5,5 8,1 1,12 3,9 5,6 7,1 1,11 3,8 14,3 11,11 11,8 16,1 11,12 12,8 16,2 15,3 6,15 12,7 16,3 14,4 7,14 6,16 12,6 7,15 13,5 8,14 13,6 14,5 13,7 12,9 15,4 12,10 14,6 12,11 14,8 10,15 9,16 10,16 14,10 13,12 11,15 14,11 13,13 15,8 16,7 9,15 12,12 15,7 8,16 11,14 13,11 15,6 16,6 10,14 12,13 13,10 14,9 16,5 9,14 11,13 13,9 15,5 8,15 10,13 13,8 14,7 16,4 7,16 9,13 11,16 12,14 15,9 16,8 14,12 12,15 15,10 13,14 16,9 15,11 12,16 14,13 16,10 13,15 15,12 16,11 14,14 15,13 16,12 13,16 14,15 15,14 16,13 14,16 15,15 16,14 15,16 16,15 87 16,16 Figure 4.12: All edges with regression weight above 1 in the Markov blankets estimated by L1MB on the usps data. Undirected edges represent cases where the directed edge was found in both directions. 6,14 5,12 7,14 8,14 9,14 5,2 10,14 6,1 10,15 4,3 5,1 9,16 8,15 10,16 7,15 9,15 6,16 11,15 11,14 1,16 1,10 12,14 2,16 12,15 2,15 13,14 3,15 13,15 14,16 13,16 6,15 13,12 2,14 12,11 2,13 12,10 1,14 16,14 12,9 1,13 11,9 16,16 4,16 16,15 7,13 8,13 12,12 11,12 11,10 6,13 10,10 5,13 4,13 14,11 11,11 10,11 10,9 13,10 9,13 16,13 15,10 14,10 15,9 16,9 16,10 14,9 15,11 10,13 16,12 16,11 7,12 14,8 9,10 8,10 9,9 3,11 9,11 15,8 8,12 16,8 6,12 14,7 13,8 15,7 4,5 6,4 1,5 1,3 5,4 7,4 11,8 15,4 13,7 16,4 16,5 15,2 15,1 11,6 10,4 11,4 14,4 12,4 13,4 15,3 14,1 12,6 9,4 7,2 8,7 13,5 16,1 11,5 8,6 12,5 9,6 10,6 8,1 8,2 9,1 6,9 9,7 16,2 13,6 14,5 5,10 8,8 16,3 12,7 16,7 7,9 1,1 9,3 7,3 10,7 1,2 10,3 7,1 8,3 11,7 14,6 1,4 4,4 3,1 2,10 6,10 15,5 2,1 4,7 7,10 12,8 16,6 2,3 1,9 4,9 15,6 9,12 2,5 2,9 8,9 13,9 5,5 2,6 3,9 9,8 6,11 7,11 8,11 3,8 8,4 6,2 10,8 5,11 4,12 2,4 6,3 4,10 4,11 15,13 13,11 10,12 3,12 15,12 14,12 3,10 2,2 4,6 5,3 2,12 14,13 15,15 15,16 3,16 11,13 3,13 15,14 5,15 3,2 3,3 2,8 1,6 2,11 14,14 1,8 1,7 1,12 12,13 3,14 5,16 4,15 13,13 14,15 2,7 1,11 4,14 5,14 4,2 3,6 3,7 7,16 8,16 4,1 3,5 11,16 12,16 1,15 3,4 9,2 7,8 5,9 10,1 6,8 10,2 7,7 14,3 14,2 13,1 7,6 5,8 4,8 12,3 11,3 12,1 13,3 11,1 6,7 5,7 13,2 6,5 6,6 11,2 5,6 12,2 10,5 9,5 7,5 8,5 88 Figure 4.13: All edges with regression weight above 1.5 in the model found by DAG-search with L1MB pruning on the usps data. 1,1 1,2 1,3 1,4 2,2 2,3 2,4 3,1 3,2 3,3 3,4 4,1 4,2 4,3 1,8 5,1 5,2 5,3 2,8 6,1 6,2 6,3 3,8 7,1 7,2 7,3 4,8 8,1 8,2 8,3 9,1 9,2 9,3 6,8 6,7 6,6 6,5 10,1 10,2 10,3 7,8 7,7 7,5 11,1 11,2 11,3 8,8 8,7 12,2 12,3 9,8 9,7 12,1 2,1 13,1 14,1 11,8 13,3 14,2 15,1 14,3 15,2 16,1 15,3 16,2 12,7 12,8 13,4 1,6 2,5 2,6 2,7 5,8 10,8 13,2 1,7 1,5 3,5 3,6 4,7 5,6 1,10 2,10 3,7 4,5 2,9 4,6 5,4 3,9 5,5 5,7 1,9 6,4 4,9 1,11 2,11 16,5 3,10 1,13 3,11 2,12 4,12 4,13 1,15 4,10 1,16 1,14 2,13 3,12 2,14 3,13 4,11 5,10 5,11 5,12 5,13 2,15 2,16 7,4 6,9 6,10 6,11 6,12 6,13 3,16 3,15 3,14 7,6 8,4 7,9 7,10 4,15 4,14 8,5 8,6 9,4 9,6 9,5 10,6 10,5 11,6 11,7 11,5 12,5 12,6 12,4 7,11 8,9 8,10 10,4 9,9 9,10 11,4 10,9 7,12 8,11 4,16 5,14 6,15 7,16 8,16 14,4 5,15 5,16 13,5 16,3 1,12 5,9 10,7 13,6 4,4 6,16 6,14 7,15 7,14 8,15 9,16 9,15 7,13 8,13 8,14 14,8 9,12 9,14 13,8 13,7 14,5 10,16 10,15 14,9 15,8 10,14 13,9 14,6 15,4 11,16 11,15 14,10 16,8 11,14 13,10 12,16 12,15 14,11 12,14 13,13 15,5 14,7 16,4 15,6 15,7 13,14 14,12 16,6 16,7 14,14 14,13 15,13 15,14 15,12 16,13 16,14 16,12 12,13 11,10 9,11 10,11 9,13 10,10 12,9 10,12 10,13 11,9 12,10 11,11 11,12 13,11 13,12 16,9 15,16 11,13 12,12 13,15 13,16 15,10 14,15 15,11 16,10 14,16 16,11 15,15 16,16 89 Figure 4.14: The optimal tree structure on the usps data. 8,12 16,15 12,11 15,9 We also examined the graph structures learned for the usps data. Since the 256 variables in this data set are the intensity values at individual pixels in a 16 by 16 image, we might expect the method to learn a structure resembling a two-dimensional grid structure where each node is connected to its horizontal and vertical neighbors. As in the news data set, the graphs learned on the usps data are relatively dense. For example, the dependency network estimated by the L1MB algorithm contained 4678 edges. Though this is small subset of the 65280 possible edges, it makes visualization of the graph difficult. To visualize the strongest interactions in the model, in Figure 4.12 we show all edges with a regression weight above 1 in the Markov blankets estimated by the L1MB method. Here, we see a structure roughly resembling what we might expect. Some nodes in the graph, such as (12, 5), are connected to their horizontal and vertical neighbors in the image. There are also a large number of nodes that are connected to not only their horizontal and vertical neighbors, but also their diagonal neighbors. Some areas of the graph bear less of a resemblance to a grid model, including some areas that are more dense and some areas that are more sparse. In Figure 4.13, we plot the edges with strongest regression weights found by the DAG-search procedure with L1MB pruning (the full structure has 1813 edges). In this figure, we see that most of the strongest edges represent interactions between pixels that are adjacent in the image. However, there are a large number of nodes in this graph that are not connected to all of their horizontal and vertical neighbors. The likely cause for the model not including these obvious interactions is the acyclicity constraint, and the difficulty of searching through the space of graphs without violating this constraint. In Figure 4.14, we plot the tree structure that optimizes the BIC. In this structure we see that all edges are between adjacent pixels. However, we also see that (because the structure is constrained to be a tree) the structure is missing most of the dependencies between adjacent pixels, and the tree reveals very little about the true nature of the data. In Chapter 5, we see that using 1 -regularization for structure learning in undirected graphical models (that do not have an acyclicity constraint) can yield a much more intuitive structure on the usps data set. However, we note that while this undirected structure is more intuitive, we must make approximations when using the structure since it is computationally intractable to perform many operations with the structure. In contrast, the advantage of the DAG model in Figure 4.13 is that we can perform many operations with the model (such as computing probabilities of fully observed vectors and generating unbiased samples) in polynomial-time. 4.7 Similar Methods In this chapter, we discuss a method that uses 1 -regularization for structure learning in DAG models. Using 1 -regularization in DAG models was also explored in [Li and Yang, 2004, 2005, Huang et al., 2006, Levina et al., 2008]. In this setion, we highlight the differences between the method outlined in this chapter and this prior work32 . The first notable distinction between the present work and this prior work is that we focus on binary data, while these prior works all focused on Gaussian data. This may seem like a small difference, but the computational difference between using the analogous undirected model can be substantial. For example, in the worst case computing the probability of a continuous vector in GGMs costs O(p3 ), while in Gaussian DAGs it costs O(p2 ). In contrast, in the worst case computing the probability of a binary vector in IGMs is #P -hard, while in sigmoid DAGs it only 32 There has also been subsequent work done after the publication of [Schmidt et al., 2007b] that extends our work, we discuss this in the next section. 90 costs O(p2 ). Thus, the computational savings achieved by considering DAGs in the binary case are much more substantial than in the Gaussian case. Further, the work we describe in this chapter can also be applied to the Gaussian case by simply using Gaussian CPDs (we give further details in [Schmidt et al., 2007b]). The method we discuss in this chapter also extends easily to other data types, including general discrete data or continuous data modeled using a robust distribution like student’s t-distribution (we give details in the next section). Further, we can combine these different data types to model vectors with mixed types. Another notable difference with the methods of [Li and Yang, 2005, Huang et al., 2006, Levina et al., 2008] is that we do not assume a known ordering of the variables. These prior works can be interpreted as a variant on the graphical LASSO where we parameterize the precision matrix in terms of an LDLT factorization (where D is diagonal with positive entries and L is lower triangular with unit diagonal). Li and Yang [2005] seek to optimize L and D under a prior similar to the graphical LASSO, Huang et al. [2006] use direct 1 -regularization of the coefficients in L, while Levina et al. [2008] use a variant of 1 -regularization on the elements of L that encourages L to have low-bandwidth. If we consider the Gaussian case, the method we discuss in this chapter is most closely related to [Huang et al., 2006], since there is a direct correspondence between sparsity in linearly-parameterized Gaussian CPDs and sparsity in the Cholesky factor L. However, since we do not assume a fixed ordering, in the Gaussian case our method can be interpreted as using a P LDLT P T factorization of the precision matrix instead of an LDLT factorization, where P is a permutation matrix. That is, we additionally consider searching for the best permutation of the variables (since some permutations will yield higher degrees of sparsity than others). Besides the extension beyond the Gaussian case, the distinction between this work and [Li and Yang, 2004] is more subtle. This method can also interpreted as being parameterized in terms of a P LDLT P T factorization of the precision matrix. However, the optimization objective function is not clear in the three stage procedure of [Li and Yang, 2004]. The first stage of this procedure computes an 1 -regularized Gaussian dependency network, where hypothesis testing is used to find the value of λ. The second phase uses the sparsity pattern obtained by the first phase, along with a series of independence tests, to increase the sparsity of the model and direct some of the edges. Finally, the third phase uses a search algorithm that explicitly optimizes a scoring function to determine the directionality of the remaining edges. It is only in this third phase that the method seeks to find a high scoring structure, and it is not clear how the first two phases relate to the score of the structure. In contrast, our method only has two phases; in the first phase we estimate a dependency network using 1 -regularization, and in the second phase we search for a high scoring network restricted to the edges present in the dependency network. However, unlike this previous work we use the same score in both phases. This leads to a simpler and more elegant framework, and avoids the need to rely on the indirect performance measures provided by the results of hypothesis tests. 4.8 Extensions This chapter has considered using 1 -regularization for learning sparse DAG models of binary data with logistic regression CPDs. However, the method we discuss in this chapter can be extended in a straightforward way to a variety of other scenarios. This section gives an overview of several of these extensions. 91 4.8.1 Other CPDs Instead of using logistic regression to represent CPDs over binary data, we could consider a wide variety of alternatives. In the following list, we discuss several examples: • Generalized linear models of binary data: Logistic regression is a particular instance of a generalized linear model for binary classification. We can apply the ideas present in this chapter to other models in this family. For example, one of the simplest modifications we could consider is to use probit regression instead of logistic regression. In probit regression, we model the conditional probability of a child given a linear function of its parents using the cumulative distribution function of the standard normal, p(xi |xπ(i) , wi , bi ) = Φ(xi (wiT xπ(i) + bi )). The hybrid 1 -regularization method can be easily modified to use probit regression for one or more of the node’s CPDs, while the methods of Chapter 2 can also be applied to optimize the probit regression negative log-likelihood with 1 -regularization. As another example of a generalized linear model of binary data, we could use complementary log-log regression, p(xi = 1|xπ(i) , wi , bi ) = 1 − exp(− exp(wiT xπ(i) + bi ). As opposed to the logistic and probit regression models that are derived from cumulative distribution functions that are symmetric around zero, the complementary log-log regression is derived from the cumulative distribution function of the extreme-value distribution, and is asymmetric around zero. This property may be useful in cases where one of the binary states is much more likely than the other, such as in the news data where most words appear infrequently. For additional details about the probit, logistic, and complementary log-log models, see [Johnson and Albert, 1999, §3.1]. • Gaussian models of real-valued data: We could consider modeling real-valued data using Gaussian CPDs, (xi − wiT xπ(i) − bi )2 1 p(xi |xπ(i) , wi , bi , σ) = √ exp − 2σ 2 σ 2π . In this case, optimizing the CPDs in terms of {wi , bi } involves solving an 1 -regularized leastsquares problem (for a fixed set of regression weights we can solve for σ analytically). The algorithms of Chapter 2 can also be applied to this case, although there exist many other solvers for this particular problem. If all CPDs are Gaussian then the joint distribution is also Gaussian. This makes it a possible alternative to the graphical LASSO as an efficient representation for multivariate Gaussian distributions. Indeed, while the graphical LASSO estimates a multivariate Gaussian with zeros in the precision matrix, a sparse Gaussian DAG will have zeros in the corresponding Cholesky factor L of a P LDLT P T factorization of the precision matrix (the topological ordering determines the permutation matrix P ). We discuss the case of Gaussian CPDs further in [Schmidt et al., 2007b], and in the appendix of that paper we give an extension of the LARS algorithm that allows efficient calculation of the BIC of all subsets of variables along the regularization path. 92 • Robust distributions of real-valued data: The Gaussian distribution has very thin tails, making it sensitive to outliers. A more robust alternative for real-valued data is to use a linearly parameterized Laplace distribution, p(xi |xπ(i) , wi , bi , σ) = |xi − wiT xπ(i) − bi | 1 exp − 2σ σ . Here, computing the optimal 1 -regularized {wi , bi } can be formulated as a linear program while the optimal σ for fixed regression weights can be computed analytically. An alternative to a linearly-parameterized Laplace distribution is a linearly-parameterized version of Student’s t-distribution, (xi − wiT xπ(i) − bi )2 Γ(ν/2 + 1/2) √ p(xi |xπ(i) , wi , bi , σ, ν) = 1+ σν Γ(ν/2) σπν −ν/2−1/2 . In this case, one way to optimize the parameters {w, σ, ν} is to alternate between the three types of parameters, optimizing each in turn. There is no simple correspondence between the parameters of a multivariate Laplace (or t) distribution and conditional independencies in the model, so fitting a DAG with Laplace (or t) distributions for CPDs might be a reasonable alternative for obtaining a robust multivariate distribution with conditional independence properties. More details about the univariate and multivariate Laplace and t distribution (as well as other multivariate distributions) can be found in [Lindsey and Lindsey, 2006]. • Generalized linear models for discrete data: Rather than being binary, it might be the case that a variable comes from a discrete set {1, 2, . . . , k}. In this case, it is possible to use multinomial logistic regression for the CPDs [see Bishop, 2006, §4.3.4], p(xi = c|xπ(i) , wi· , bi ) = exp(wic )T xπ(i) k T c =1 exp(wic ) xπ(i) . In this case, we have a vector wic associated with each state c for each variable i (typically, the vector for one of the states c is set to the zero vector if we are doing maximum likelihood estimation). In this case, we might also want to consider a different representation of the parent variables xπ(i) . For example, we might encode a parent as a set of binary indicator variables (one for each state of the parent). Because of this, there is no longer a one-to-one correspondence between parameters of the model and edges of the graph. For cases like this, it might be more appropriate to use the group 1 -regularization methods that are the focus of subsequent chapters. The multinomial logistic regression model assumes that the set of states {1, 2, . . . , k} is unordered. However, in many cases there may be a natural ordering among the states (i.e. 2 is closer to 3 than 4). In this case, ordinal logistic regression CPDs might be more appropriate [see Johnson and Albert, 1999, §4.1], p(xi = c|xπ(i) , wi , bi , γi,· ) = 1 1 − . T 1 + exp(γi,c − wi xπ(i) + bi ) 1 + exp(γi,c−1 − wiT xπ(i) + bi ) Included in this model is a set of adaptive thresholds {γi,0 , γi,1 , . . . , γi,k } on the cumulative distribution function, where γi,j ≤ γi,j+1 . Here, γi,0 is taken to be −∞, while γi,k is taken to be 93 ∞, and one of the remaining values γi,c is typically fixed at zero for identifiability. Optimizing the parameters of an ordinal regression CPD with 1 -regularization can be formulated as a problem with bound constraints. This type of problem can be handled with a straightforward extension of the algorithms we describe in Chapter 2. Beyond the above extensions to different types of data, one of the main advantages of DAG models is that they allow us to specify a multivariate distribution over vectors that contain multiple types of data. However, in these cases it is important to consider the representation of the parent variables xπ(i) , and it will often be the case that the group 1 -regularization methods of subsequent chapters are needed. 4.8.2 Other Extensions We conclude this chapter by noting several other possible extensions: • Conditional DAGs: In some cases, we may not want to model the distribution of some variables in the model. That is, we might want to model the conditional distribution of some subset of variables given another subset. We can learn a conditional DAG model using the techniques we describe in this chapter, by simply adding the constraint that nodes that are being conditioned on can have no parents (these constraints are simply added to the set of excluded edges generated by the L1MB algorithm). With these additional constraints, learning a sparse conditional DAG model proceeds as in the unconditional case we describe in this chapter. • Dynamic Bayesian Networks: Dynamic Bayesian networks are a type of structured DAG model that generalizes hidden Markov models and Kalman filters for modeling multivariate time series data [see Murphy, 2002]. If we are given p-vectors at consecutive time points, we can consider trying to learn the structure of a dynamic Bayesian network that describes the dependency between time points. In these models, we learn an initial graph of the variables, as well as an inter-slice graph that models the variables conditioned on the variables at the previous time point. As discussed in [Friedman et al., 1998], we can extend structure learning methods for DAGs to the case of dynamic Bayesian networks. In particular, we can apply the methods present in this section to learn the structure of the initial graph, while learning the structure of the transition graph takes the form of learning a conditional DAG model (where we condition on the previous time point and tie the transition CPDs across time)33 . • Linear Non-Gaussian data: Under some choices of the CPDs it is possible to distinguish between Markov-equivalent DAGs from observational data alone. In [Shimizu et al., 2005], the authors consider the case of a DAG model where each child is a linear function of its parents, with additive but non-Gaussian noise. They show that the optimal DAG can be recovered by post-processing the results of running an independent component analysis of the data (zeros in the independent components correspond to missing edges in the DAG). However, the independent component analysis does not yield entries that are exactly zero, and hence the method uses a set hypothesis tests to determine whether an effect is significant 33 [Gustafsson et al., 2003] consider using 1 -regularization to estimate the inter-slice graph under the constraint that parents must come from the previous time point. This constraint substantially simplifies structure learning since it excludes the possibility of creating cycles. 94 or not. If we used a differentiable measure of independence, we could apply 1 -regularization to the factor loading matrix using the methods of Chapter 2 to learn a set of independent components with elements that are exactly zero. • More general models of intervention: In our discussion of interventional data, we considered the case of perfect interventions. That is, we assumed that the effect of an intervention is to perfectly control the value of a single variable. However, in general we might want to consider more general interventions that affect multiple variables, or cases where we do not know the effect of the intervention. Eaton and Murphy [2007] consider the more general case of uncertain interventions. Here, the DAG model is augmented with a binary node for each intervention. By convention these intervention (or action) nodes are not allowed to have any parents in the DAG model, and the effect of an intervention is simply to set the value of the corresponding intervention node. This model is a special case of a conditional DAG model, so the methods we discuss in this section can be applied. • Removing false positives from L1MB: In our experiments the L1MB algorithm typically did not exclude true edges, but included many false positive edges. Three potential sources of these false positives are (i) errors associated with estimating the Markov blanket, (ii) errors associated with using 1 -regularization for variable selection, and (iii) errors associated with using the BIC. Estimating the edge set based on estimating the Markov blanket leads to false positives because the Markov blanket includes co-parents. We could consider several heuristics to try and remove these co-parents, such as testing whether variables in the Markov blanket are marginally independent (this is a straightforward calculation under the BIC or validation score). Such a procedure could remove false positives associated with co-parents that do not share common ancestor. Alternately, we could consider more elaborate schemes where we condition on different subsets of the variables in order to remove co-parents from consideration. We discuss a related approach in the appendix of [Schmidt et al., 2007b], in the context of applying the L1MB algorithm to graphs with a very large number of nodes. Using 1 -regularization for variable selection is another potential source of false positives. As discussed in [Bach, 2008a], 1 -regularization chooses all relevant variables with a probability tending to one exponentially fast (as the number of samples increases), but also chooses irrelevant variables with non-zero probability. This leads to false positives. To alleviate this problem, Bach [2008a] suggests applying 1 -regularization to a set of bootstrap samples of the data set, and taking the intersection of the variables selected in the samples. We could consider applying this strategy in order to reduce the number of false positives. Alternately, we could consider several alternatives to 1 -regularization. For example, the adapative LASSO [Zou, 2006] and SCAD penalties [Fan and Li, 2002] are two regularizers that have been proposed to give better properties than 1 -regularization. The adaptive LASSO has been used for learning the structure in Gaussian dependency networks [Shimamura et al., 2007], while both the adaptive LASSO and SCAD penalties were examined for learning Gaussian graphical models in [Fan et al., 2009]. The methods of Chapter 2 can be used directly for regularization with the adaptive LASSO (it simply consists of a suitable setting of the individual regularization weights λi ). Recent approaches for the (non-convex) SCAD regularizer use weighted 1 -regularization as a sub-routine within a bound optimization scheme [Zou and Li, 2008], so the methods of Chapter 2 could also be used for this regularizer. Alternately, 95 as we discuss in the extensions section of Chapter 2, it is possible to extend the methods of Chapter 2 to be used directly for optimization with SCAD regularization. Another approach that might remove false positives (from both phases) is to define a suitable prior and compute a Bayesian score [Cooper and Herskovits, 1992, Heckerman et al., 1995], instead of the simple BIC approximation. In general the Bayesian score can not be computed in closed form, so approximations to these integrals would be needed. Moghaddam et al. [2009] show that even simple approximations can lead to performance improvements over using the BIC. Further, as long as the score is separable across nodes, it is trivial to replace the BIC in our method with another score assessing the quality of graph structures. Closely related to the Bayesian score is the sparse Bayesian learning regularizer discussed in [Wipf and Nagarajan, 2009], a generalization of automatic relevance determination methods. This prior is motivated by the form of the marginal likelihood in the Gaussian case, and the authors of this work show that this non-separable, non-convex regularizer has several appealing advantages over 1 -regularization. As with the SCAD regularizer, current methods for using this regularizer use a bound optimization strategy where weighted 1 -regularization is used as a sub-routine, so the methods of Chapter 2 could be used to solve this sub-routine. • Other structure search methods: We have considered a simple DAG-search method to perform the search over possible DAG structures. However, we could augment/replace this search procedure with any of the methods we discuss in Section 4.1. Indeed, [Vidaurre et al., 2010] have applied our hybrid 1 -regularization method (with Gaussian CPDs) where the DAG-search has been replaced with a search through the space of Markov-equivalent structures. Another possible search strategy is the constrained optimal search of [Perrier et al., 2008]. Given a structure constraining the set of possible edges, Perrier et al. [2008] describe a method for finding the optimal DAG that is exponential in the degree of this structure. Hence, if the L1MB algorithm returns a structure with a sufficiently low degree, it is possible to find the optimal DAG structure even if the number of nodes in the original graph is very large. • Regularization of the CPD parameters: The experiments in this chapter have used maximum likelihood estimates of the parameters. In many cases, we are able to obtain a better model in terms of validation score by using a regularized estimate, such as an 2 regularized or 1 -regularized estimate. As long as the regularizer does not violate parameter independence or parameter modularity, searching for an optimal regularizer within each CPD only adds a small computational overhead to the L1MB and DAG-search procedure. There has also been work on estimating the number of degrees of freedom of 1 -regularized estimates. For example, [Zou et al., 2007] shows that the number of non-zero coefficients is an unbiased estimate of the number of degrees of freedom when using 1 -regularized parameter estimates within the BIC. • Non-linear CPDs: We have concentrated on CPDs that are linear in the values of the parent variables. This is similar to the pairwise assumption in undirected graphical models, and it similarly may be restrictive in some scenarios. We can relax this assumption if we consider using non-linear transformations of the parent variables. For example, to gain the representational power of tabular CPDs we could use CPDs that are linear in the set of indicator functions over possible parent configurations. Alternately, we could add products 96 of the parent variables (or other such transformations) as additional terms in the CPDs. Under many choices of non-linear CPDs, it will typically be more appropriate to use (disjoint or overlapping) group 1 -regularization of the CPD parameters, similar to the methods we discuss in the next two Chapters. • Convex approaches to DAG learning: In this chapter, we resorted to a search-based method because of the acyclity constraint on the graph structure. However, there has been a limited amount of work on convex formulations of DAG learning. Guo and Schuurmans [2006] consider a convex relaxation involving semi-definite programming to approximate a node ordering, while Jaakkola et al. [2010] formulate DAG learning as a binary linear program with linear constraints that enforce acyclitiy. It may be possible to apply 1 -regularization with one of these characterizations of the acyclicity constraint. 97 Chapter 5 Undirected Graphical Model Structure Learning We now turn to task of the structure learning in pairwise undirected graphical models. In some sense, structure learning is easier in the undirected case because we do not have a global acyclicity constraint; given some candidate undirected structure, we still obtain a legal undirected structure after any edge addition. However, in another sense structure learning in undirected models is much harder because of the global normalization. Unlike in the DAG case where we used separability of the log-likelihood to efficiently evaluate single-edge modifications, the lack of separability of the loglikelihood in the undirected case means that we must re-fit all parameters after any edge addition or deletion (while even evaluating the score given a fixed graph structure can be computationally expensive or intractable). This makes methods based on local search extremely expensive. In the next two sections, we briefly review several of the search-based and constraint-based strategies that have been proposed for structure learning in undirected graphical models. After this, the remainder of the section focuses on the (potentially much faster) methods based on 1 -regularization. 5.1 Search-based and Constraint-based Methods Whittaker [1990, §8.2] contains a list of references from the statistics literature from the 1970s and 1980s on structure learning in (Gaussian and log-linear) undirected graphical models. Typically, these methods start with the empty structure, and search for the best possible edge addition. A deviance score (based on a maximum likelihood estimate) is used in measuring the quality of a structure, and a hypothesis test is used to determine whether each new edge improves the score by a sufficient margin. The algorithm terminates once one of these hypothesis tests fails. Because a likelihood-based criteria is used, this termination criteria is used to avoid adding all possible edges. Alternative methods also exist that start with the dense model and successively remove edges. Classic examples of these types of methods include [Dempster, 1972] for the Gaussian case, and [Goodman, 1971] for the log-linear case. In general, this procedure is extremely expensive if the number of variables is non-trivial; these procedures must fit O(p2 ) undirected models at each of the O(p2 ) steps. These types of greedy methods seem to have fallen out of favor with the introduction of methods that use a score that encourages sparsity (such as the BIC or marginal likelihood criteria), and methods that directly seek to optimize these types of scores. These and other classical methods, as well as methods based on the BIC, are discussed further in Edwards [2000, §6]. In order to avoid the expensive computations associated with general undirected graphical models, many authors have considered restricting the search to the set of decomposable graphical models. A subset of the extensive work on this topic includes [Wermuth, 1976, Malvestuto, 1991, Dawid and Lauritzen, 1993, Madigan and Raftery, 1994, Xiang et al., 1997, Giudici and Green, 98 1999, Deshpande et al., 2001]. Decomposable models correspond to the subset of undirected graphical models where the graph structure is chordal [Whittaker, 1990, §12]. The set of conditional independencies in decomposable models can be represented as both an undirected and a directed acyclic graphical model (in particular, a chordal undirected graph encodes the same set of conditional independences as a DAG with no v-structures). Subsequently, in chordal undirected models it is possible to take advantage of many of the convenient properties of directed models. Particularly relevant for structure learning is that the likelihood can be evaluated efficiently, and that parameter estimation can be done locally. Another noteworthy property of decomposable models is that the marginal likelihood given a fixed graph structure can be evaluated in closed form (with a suitably chosen conjugate prior). This is in contrast to general undirected graphical models, where even evaluating the BIC may be intractable since it requires computing the rank of an exponentialsized matrix [Koller and Friedman, 2009, §20.7.3]. However, like acyclicity, the constraint that a graph must be chordal is a global (non-convex) constraint. This negates the (potential) advantage of structure learning in undirected models, since we would have local optima (that are not global optima) even if we use 1 -regularization of the parameters for structure learning. Thus, we would need to consider techniques similar to those used in Chapter 4 (i.e. greedy search) to learn chordal graphs. Further, if we did this we would still be restricted to a strict subset of the set of distributions whose independence properties can be represented by DAGs. An alternative to using chordal graphs is to place an explicit restriction on the treewidth of the graph. By placing a restriction on the treewidth we guarantee that inference in the model can be performed in polynomial time. Indeed, bounded treewidth networks allow polynomial-time computation of quantities that are difficult to compute even in directed and chordal models (such as computing conditional probabilities). If the treewidth is restricted to be 1 (corresponding to tree-structured graphs) the optimal maximum likelihood structure can be found in polynomial time [Chow and Liu, 1968]. Heckerman et al. [1995] discuss extending this methodology to other scores. For any bound greater than 1, finding the optimal bounded treewidth structure (under various scoring criteria) is NP-hard [Srebro, 2003] (even determining the treewidth of a graph is NP-hard in general). Nevertheless, several recent works have examined this case. For example, [Karger and Srebro, 2001] give a polynomial-time approximation scheme for this problem, while Bach and Jordan [2001] consider searching in the space of graphs with bounded treewidth. Evaluating the score achieved by edge modifications is still relatively expensive in these graphs, since low treewidth graphs will not in general be chordal. Thus, Bach and Jordan [2001] consider heuristics for evaluating the scores of neighboring graphs. Narasimhan and Bilmes [2004] have considered constraint-based polynomial-time strategies for learning bounded tree-width networks in the probably approximately correct (PAC) learning framework for the consistent case (when the data is generated according to an undirected graphical model), by solving a series of submodular optimization problems to discover conditional independencies. Chechetka and Guestrin [2007] give a related constraint-based method that is polynomial-time for a more general class of data-generating distributions (though the algorithm remains exponential in the bound on the treewidth). Shahaf et al. [2009] consider a graph cut procedure for recursively partitioning the nodes to learn bounded-treewidth networks, that has certain theoretical guarantees and has shown good empirical performance. The disadvantage of considering only networks with bounded treewidth is simply that many distributions can not be represented as a network with bounded treewidth, so a non-trivial treewidth might be necessary to build a good model of a particular data set. There has also been work on constraint-based methods for learning the structure with a con- 99 straint on the number of neighbors. For example, Koller and Friedman [2009, §20.7.1] give a polynomial-time constraint-based method for learning bounded-degree networks (in the consistent case). Such constraint-based methods are appealing because they do not require parameter estimation. However, we note that these methods must rely on the same assumptions and be subject to the same criticisms as the constraint-based methods for learning DAGs we discuss in Section 4.2. Further, as discussed in [Koller and Friedman, 2009, §20], constraint-based methods do not distinguish between Markov-equivalent graph structures that use different factorizations (this is applicable when we remove the pairwise assumption). Recently, Abbeel et al. [2006] give an exponential-time algorithm for learning bounded-degree networks in the PAC learning framework (for the consistent case) that also learns the factorization. 5.2 1 -Regularization Let us temporarily assume that each edge has only a single parameter associated with it. Then, as we discuss in Chapter 1, we can formulate the problem of learning a sparse pairwise structure with 1 -regularization as p n min − w,b [ m=1 i=1 p [log φi (xm i , bi ) + log φij (xki , xkj , wij )]] + n log Z(w, b) + λ||w||1 . (5.1) j=i+1 During the past five years, there has been intense interest from various communities in using this formulation for structure learning in undirected graphical models. The reasons for this are simple. First, it is an appealing notion to formulate the problem of fitting a sparse regularized model to data with unknown structure as a convex optimization problem. Second, this formulation does not impose any constraints (such as decomposability, bounded treewidth, or bounded degree) on the structures that can be learned. Third, we might hope to inherit the appealing properties of 1 regularization that are known for regression and classification that we discuss in Chapter 1. Finally, unlike search-based methods where we must solve a non-separable convex optimization problem for every possible edge addition/deletion that we consider, when we use 1 -regularization we only need to solve one convex optimization problem that (arguably) has a comparable difficulty level. This means that using 1 -regularization is much faster than using a search-based method. 5.3 Approximate Objectives As we discuss in Section 1.4, most of the work that examines 1 -regularization for structure learning in undirected graphical models focuses on GGMs and IGMs. In the case of GGMs, the normalizing constant can be computed in polynomial time and hence the problem can tractably be solved even with a non-trivial number of nodes (for example, using the methods of Chapter 2). For IGMs and the more general pairwise log-linear models we describe in Section 1.5, it may be intractable to evaluate the objective function in (5.1) since the graph structure resulting from the sparsity pattern may not have a low treewidth. Thus, for discrete data we must consider approximate objective functions. A classic technique for addressing the intractability of evaluating the likelihood in undirected models is to replace the likelihood with the product of univariate conditionals. This is known as a pseudo-likelihood [Besag, 1975]. This approximation is consistent, in the sense that if the data is 100 generated from an undirected graphical model then as the number of training examples grows the maximum pseudo-likelihood estimate converges to the maximum likelihood estimate [see Koller and Friedman, 2009, Theorem 20.3]. In [Schmidt et al., 2008], we considered using a pseudo-likelihood in (5.1), giving the problem p m min − w,b [ m log p(xm i |x−i , w, b)] + λ||w||1 , (5.2) k=1 i=1 We note that the conditional probability of a node given all other nodes takes the form p(xi |x−i , w, b) = 1 φi (xi , bi ) Zi φij (xij , wij ), {j|j=i} where the local normalizing constant Zi only sums over possible assignments to xi (and thus can be tractably evaluated). As we discuss in 1.4, in the IGM case these conditionals take the form of logistic regression likelihoods. In the general pairwise discrete case, these conditionals take the form of a multiclass logistic regression likelihood [Bishop, 2006, §4.3.4], where the features are defined in terms of the values assigned to neighboring nodes. Thus, the 1 -regularized pseudo-likelihood takes the form of a set of dependent (multiclass) logistic regression problems, where the dependency arises because each set of edge parameters wij is present in the conditional p(xi |x−i ) and p(xj |x−j ). However, the joint optimization of these dependent (multiclass) logistic regression problems is easily handled by the methods of Chapter 2. As we discuss in Section 1.2, an alternative pseudo-likelihood approximation is to learn a dependency network. This is identical to (5.2), but where we make a separate copy of each edge parameter set wij for each conditional. Although this problem can be solved slightly more efficiently, we must heuristically construct each edge parameter out of its two copies. Hofling and Tibshirani [2009] compared the symmetric pseudo-likelihood (5.2) to two ways of obtaining a single estimate out of this asymmetric dependency network pseudo-likelihood. They found that while all three estimates were good approximations, that the symmetric approximation had an advantage over the two asymmetric versions. An alternative to using a pseudo-likelihood approximation of the likelihood is to use a variational approximation of the logarithm of the normalizing constant (known as the log-partition function) [see Wainwright and Jordan, 2008]. In general, such approximations are not consistent. However, theoretical arguments by Wainwright [2006] suggest that it can be beneficial to use such an approximation in certain scenarios, if the same approximation will subsequently be adopted when using the model. Lee et al. [2006b] considered the Bethe free energy approximation to the log-partition function, implemented using the loopy belief-propagation message-passing algorithm. This approximation is appealing because it is exact for tree-structured graphs. Thus, as we move along the regularization path this approximation is exact until the graph has loops (in contrast, pseudo-likelihood approximations are only exact if we have no edges). However, for graphs with loops the Bethe approximation will not generally be convex, nor does it give an upper bound on the log-partition function. Further, the use of loopy belief propagation might lead to discontinuities in the objective function because of non-convergence of the algorithm or because it converges to different local optima. As an alternative to the (non-convex) Bethe approximation, in this work we also consider using the (non-convex) mean-field variational approximation (with a fully factorized approximating 101 distribution), and consider using a “convexified” (tree-reweighted) Bethe approximation [see Wainwright and Jordan, 2008, §5 and §7]. The latter approximation uses a convex combination of tree-structured approximations to give a convex upper bound on the log-partition function, but uses a clever re-parameterization that allows the number of trees to potentially be very large without an increase in computation. In particular, the method uses a minor variant on the loopy belief-propagation message-passing algorithm that utilizes a set of edge appearance probabilities (the mean field approximation is also computed by a message-passing algorithm). For each edge, the edge appearance probability is the (weighted) distribution of times the edge appears in one of the tree-structured approximations. We obtain the regular Bethe approximation if these are all set to 1, but this leads to non-convexity as it is not a valid distribution over tree-structured graphs (unless the graph is actually a tree). In our experiments, we considered using all possible spanning trees of the dense graph (with equal weight) in the approximation. The probability of an edge appearing in a random spanning tree of a fully connected graph on p nodes is 2/p for p ≥ 2 (each spanning tree consists of (p − 1) edges selected in an exchangeable way from the p(p − 1)/2 edges). Note that we use these edge appearance probabilities even if some of the edges have all parameters set to zero. 5.4 Group 1 -Regularization In the case of IGMs, sparsity in the parameters directly corresponds to sparsity in the graph structure. However, this is no longer the case if we consider more general potentials like the gIsing or full edge potentials from Section 1.5 where each edge has multiple parameters. In these models we must set all parameters associated with an edge to zero in order to remove the edge from the model, and thus 1 -regularization does not directly encourage graphical sparsity. Indeed, 1 -regularization completely ignores that graphical sparsity might lead to a more parsimonious graph structure or greater computational savings than sparsity of individual edge weights. In order to encourage sparsity in terms of edges instead of individual edge parameters, we can use group 1 -regularization. Utilizing group 1 -regularization to encourage sparsity in terms of groups of variables was proposed by Bakin [1999] in the context of regression. In this work Bakin considered penalizing the 1 norm of the 2 norms of the groups in order to encourage sparsity at the group level. For our problem, we have one group for each edge and the group contains all parameters associated with the corresponding edge. Thus, we can write the problem of estimating a sparse regularized structure with group 1 -regularization as p n min − w,b [ m=1 i=1 p [log φi (xm i , bi ) + p p m log φij (xm i , xj , wij )]] + n log Z(w, b) + λ j=i+1 ||wij ||p , i=1 j=i+1 (5.3) for some norm || · ||p (using an approximate objective gives an analogous formulation). While Bakin [1999] considered penalizing the 2 -norms of the groups (corresponding to 1 -regularization of the lengths of the vectors wij ), other authors have subsequently considered using other norms that also achieve group sparsity34 . For example, Turlach et al. [2005] use the ∞ norm of the groups in the context of multiple linear regressions, corresponding to 1 -regularization of the maximum 34 We obtain 1 -regularization as in (5.1) if each group contains only a single variable (under any choice of norm), or if we use the 1 -norm of the individual groups. 102 absolute values within the groups (but not penalizing elements of the groups that do not achieve the maximum value). We considered using (5.3) with the ∞ norm of the groups in [Schmidt et al., 2008]. Since the 2 norm places no bias on the direction, in some sense it is the only norm that does not encourage additional structure in the edge potentials. This is as opposed to the degenerate case of using the 1 norm that prefers sparsity within the groups, and it also differs from the ∞ norm that encourages elements within the same group to have exactly the same magnitude. However, this latter property produces some interesting biases when using the ∞ norm. For example, with the gIsing potentials it encourages all edge weights (associated with the same edge) to have the same magnitude. If these weights also have the same sign, it encourages the gIsing weights to take the exact same value and to subsequently become Ising potentials. With full potentials, using the ∞ norm also encourages patterns of tied weights within the potentials, but places no restriction on what elements of the individual edge potential matrices are tied. Thus, it might lead to some edges using Ising potentials, some edges using gIsing potentials, some edges taking other patterns, and some edges having no pattern (in general there will be no pattern when the 2 norms of the groups is used). While previous work on group 1 -regularzation has only considered the 2 or ∞ norms of the groups, these are not the only possible choices of the group norm. For the case of full potentials, in this work we also consider using the nuclear norm of the edge weight matrix. This can be viewed as an extension of the nuclear norm regularizer described in [Fazel et al., 2001] to the case of groups. The nuclear norm penalizes the sum of the singular values of the matrix, and using it within a group 1 -regularization framework encourages not only group sparsity35 but encourages the edge weight matrices to be low rank. The advantage of this is that, for k > 2, this may lead to a more parsimonious representation of the full edge weight matrix. For models with many states this might lead to a substantial reduction in the number of parameters (and degrees of freedom) in the final model, and because of this it represents an alternative to the weight-tieing used in the Ising or gIsing potentials. In cases where groups have a single element, the methods of Chapter 2 can be used to solve the optimization problem, while the methods of Chapter 3 can be used to solve the general case when we use the 2 norms of the groups. In the next section, we discuss simple extensions to the methods of Chapter of 3 that allow us to handle the ∞ and nuclear norms of the groups. 5.5 Optimization with General Group Norms In this section we consider the generalization of (3.1) where we penalize some norm || · ||p of the groups: min f (x) L(x) + λA ||xA ||p . (5.4) x A Note that in this expression, we do not necessarily have to use the same norm for each group. By the positive homogeneity property of norms, for any choice of norm this function is non-differentiable if an entire group of variables is exactly zero. Depending on the particular norm, there may be other non-differentiabilities. For example, with the ∞ norm the objective is also non-differentiable whenever more than one variable in a group achieves the largest magnitude within the group. 35 All elements of the matrix are necessarily zero if all singular values are zero. 103 To apply the SPG or PQN method to solve (5.4), we must convert it into a differentiable optimization problem over a convex set. As before, we do this by introducing an additional variable gA for each group A and optimize subject to the constraint that gA ≥ ||xA ||p : λA gA , subject to gA ≥ ||xA ||p , ∀A . min L(x) + x,g (5.5) A By convexity of norms, the constraints in this problem define a convex set for any choice of norm. In an addendum to [Schmidt et al., 2008], we show how to compute the projection for this problem when we penalized the ∞ norm of the groups36 . The cost of solving the sub-problem in this case is O(|A| log |A|), since in the worst case we may need to sort the elements of xA . In the degenerate case where we use the 1 norm of the groups, this problem can be solved in O(|A|) (expected time) using a simple extension of the randomized algorithm outlined in [Duchi et al., 2008b]). Although penalizing the 1 norm of the groups is not interesting on its own since this choice of group norm reduces to regular 1 -regularization, we can use this as a sub-routine for computing the projection when we penalize the nuclear norm of the groups. In particular, similar to [Cai et al., 2010], the projection for an individual group can be computed in O(|A|3/2 ) by computing the singular value decomposition of the group [Golub and Van Loan, 1996, §2], applying the 1 norm method to the singular values, then reforming the matrix with the modified singular values to form the projected matrix. We give more details regarding these projections in Appendix B, but for now we simply note that for all the norms we consider it is possible to compute the projection efficiently for reasonably-sized groups. Wright et al. [2009] discuss computing the soft-threshold operator for different choices of the norm in group 1 -regularization. In the case of the ∞ norm, the solution for an individual group is given by an explicit element-wise threshold operator SR (xA , α)i = sgn(xi ) min{|xi |, θA }, where the threshold θA used by the group is given by the maximum over i of the absolute difference between xA and the result of projecting xA onto the 1 -ball of radius αλA [Duchi and Singer, 2009]. Duchi et al. [2008b] give a randomized algorithm with an expected O(|A|) runtime for computing this projection. In the case of the nuclear norm, the soft-threshold is given by applying the softthreshold rule σi ← max{0, σi − αλA } to the singular values σi of the matrix groups [Cai et al., 2010]. We can also generalize the active-set method from Section 3.4 to the case of an arbitrary group norm ||·||p . To test whether groups with all elements zero can locally improve the objective function by moving them away from zero, we need to characterize the sub-differential of the regularizer in (5.4). To do this, we use a non-standard (but equivalent) definition of the sub-differential of a convex function R(x) [Combettes and Wajs, 2005, §2.1]: ∂R(x) {g|R(x) + R∗ (x) = xT g}, where R∗ (x) is the convex conjugate of R(x) [see Boyd and Vandenberghe, 2004, §3.3]. If R(x) is a norm, R(x) || · ||p , the convex conjugate is given by a (∞, 0) indicator function on the dual 36 of Quattoni et al. [2009] show how to solve the related problem of projecting onto the norm ball defined by the norms. 1 ∞ 104 norm unit ball [Boyd and Vandenberghe, 2004, Exercise 3.26] R∗ (x) 0 if ||x||q ≤ 1, ∞ otherwise. Here, we use || · ||q to denote the dual norm of || · ||p [Boyd and Vandenberghe, 2004, §A.1.6]. It follows that the sub-differential of the regularizer for a group with xA = 0 is all vectors with dual norm less than or equal to λA . Thus the optimality condition that 0 ∈ ∂f (x) in (5.4) for a group with xA = 0 is ||∇A L(x)||q ≤ λA . Using this, an active-set method generalizing the one in Section 3.4 to the case of an arbitrary group norm || · ||p is • Find groups A such that xA = 0, or xA = 0 and ||∇A L(x)||q > λA . • Solve the problem with respect to these groups. The dual norms for all norms considered in this work are given in [Boyd and Vandenberghe, 2004, §A.1.6]. The 2 norm is its own dual, giving the algorithm of Section 3.4. The dual norm of the 1 norm is the ∞ norm, giving the algorithm of Section 2.5. Since the dual norm of the ∞ norm is the 1 norm, if we penalize the ∞ norm of the groups we add a group if the absolute value of the gradient of any element of the group is above λA . Finally, the dual of the nuclear norm is the 2 operator norm, so we add matrix groups if the largest singular value of the matrix containing the values of the gradient elements exceeds λA . To end this section we note that when we repeated the experiments of Chapter 3 with other choices of the group norm, the relative performance of the different optimization methods was very similar. 5.6 Blockwise Sparsity Duchi et al. [2008a] consider an alternate use of group 1 -regularization within the context of GGMs. Their model assigns each node in the graph a type. They consequently use 1 -regularization of the edges between variables of the same type, but group 1 -regularization of the set of edges between different types. That is, they encourage sparsity in terms of the blocks of the precision matrix that represent interactions between variables of different types. We refer to this as blockwise-sparsity, since it encourages sparsity in terms of pre-defined blocks of the precision matrix. Duchi et al. [2008a] penalize the ∞ norm of the blocks, and give a projected gradient method for solving a Lagrangian dual problem in the case of GGMs. In [Schmidt et al., 2009a], we showed that the PQN method of Chapter 3 outperforms this projected gradient method at solving the Lagrangian dual, and we considered using the PQN method to solve the Lagrangian dual that arises when we penalize the 2 norm of the groups. The methods in Chapter 3 can also be used to encourage blockwise-sparsity in IGM models, for the 2 or ∞ norm of the blocks. Further, we can also use them to encourage blockwise-sparsity in general pairwise log-linear models. In this case, we simply define each group to be all edge parameters associated with all edges in the block. 105 5.7 Conditional Random Fields Thus far, we have considered building a probabilistic model of all variables present in a data set. However, in many cases we might be interested in predicting the values of some variables (the targets) given the others (the features). This is similar to the regression and classification tasks we discuss in (1.1), but here we consider the generalization where we have more than one target variable. Further, the target variables may be dependent, even after after conditioning on the features. Analogous to the regression case, we use x to denote the features and we use y to denote the target variables. One way to address this problem is to model p(y, x) with an undirected model, and then use the conditional distribution p(ym |xm ) to answer conditional queries about instance m. However, in cases where the features are very complicated, it may be very difficult to build a good model of p(y, x). For this multiple-target scenario, Lafferty et al. [2001] introduced conditional random fields (CRFs). In CRFs, we fit an undirected graphical model by optimizing the conditional likelihood p(ym |xm ) over all m training examples (this is typically referred to as a discriminative model). In the case of log-linear models, this is a natural generalization of logistic regression to the multitarget scenario (while for GGMs it is a natural generalization of least-squares). The advantage of optimizing the conditional likelihood instead of the likelihood is that we treat the variables x as fixed, instead of addressing the potentially difficult task of building a model of them. Liang and Jordan [2008] show that, if the model is misspecified (as is typically the case when dealing with real data), that optimizing p(y|x) is asymptotically more efficient both in terms of parameter estimation and generalization error than optimizing p(y, x). In this work, we consider CRFs with a log-linear parameterization. For example, for a threestate node i we use node potentials of the form T xm bi,1 + vi1 i T xm , bi,2 + vi2 log φi (·, xm i , bi , v i ) = i T xm bi,3 + vi3 i where each node has its own set of bias parameters bi , its own set of (vector-valued) feature weights vi , and its own set of features xm i for instance m (some of these features may be shared between nodes). Typically, we fix the value of bi,j to zero for one of the states j, and we may also fix the vector vi,j to the zero vector for one of the states. For full edge potentials on the edge between two three-state nodes i and j, we use T xm w T m T m wij11 + vij11 ij12 + vij12 xij wij13 + vij13 xij ij T m T m T xm w wij21 + vij21 log φij (·, ·, xm ij22 + vij22 xij wij23 + vij23 xij , ij , wij , vij ) = ij T m T m T m wij31 + vij31 xij wij32 + vij32 xij wij33 + vij33 xij where we note that each edge has its own set of weights wij , its own set of (vector-valued) feature weights vij , and its own set of edge features xm ij for instance m. We can fix the values of some of these weights to zero if we want a restricted class of potentials like the Ising or gIsing potentials. However, note that even in the Ising case, each edge will have multiple parameters. We can write the negative log-likelihood function with these potentials as p n − [ m=1 i=1 p [log φi (yim , xm i , bi , vi ) + m log φij (yim , yjm , xm ij , wij , vij )] + log Z(w, b, v, x )], j=i+1 106 where we have used v to refer to all node and edge feature weights. As in the unconditional case, this function is jointly convex in all of its parameters. However, since the normalizing constant for each training example is a function of xm , we now have a normalizing constant for each training instance m. This makes parameter estimation in the conditional case much more expensive. While there has been some work towards discriminative structure learning in the context of Bayesian network classifiers [see Schmidt et al., 2008, Table 1], all previous work on CRFs has assumed that the graphical structure is known. Further, the high cost of evaluating the likelihood even for a fixed structure makes search-based methods unappealing. Thus, to apply a CRF model to a data set with unknown structure, in [Schmidt et al., 2008] we considered using group 1 regularization. More precisely, we used 2 -regularization of the node feature weights and group 1 -regularization of all edge weights corresponding to the same edge to learn a sparse regularized CRF by solving p n m min − w,b,v p m ||vi ||22 p(y |x , b, w, v) + λ1 m=1 p ||[wij vij ]T ||p , + λ2 i=1 (5.6) i=1 j=i+1 Note that solving this problem is not the same as the computationally more efficient approach of first learning the structure of y as an unconditional log-linear model, and subsequently using this as the structure of the CRF. Our experiments indicate that this latter strategy under-performs using (5.6) to simultaneously and conditionally learn both structure and parameters. 5.7.1 Associative Conditional Random Fields In all models up to this point we have considered sparsity as a rough approximation of the treewidth of the graph. This is because adding an edge will never decrease the treewidth of a graph, so models with fewer edges may have lower treewidths and thus allow efficient calculations with the model. In many applications of CRFs, the calculation that we are often most interested in is finding the conditional optimal decoding. That is, given the covariates x we would like to find the assignment of labels y∗ with highest probability under the model: y∗ = arg max p(y|x). y (5.7) Although this problem is NP-hard in general, for the special of case of binary variables with submodular edge potentials it is possible to solve this problem in polynomial time [Kolmogorov and Zabih, 2002]. The sub-modularity condition is equivalent to the requirement that, for each edge, the log-potentials for assignments where the two variables take the same state are greater than the log-potentials for assignments where the variables have different states: log φij (1, 1) + log φij (2, 2) ≥ log φij (1, 2) + log φij (2, 1), ∀ij . (5.8) We call a CRF satisfying this condition an associative CRF, analogous to the associative maxmargin Markov networks examined in [Taskar et al., 2004]. Note that satisfying this condition allows us to perform optimal decoding in polynomial time as a minimum graph-cut problem, independent of the treewidth of the graph. Thus, enforcing that (5.8) is true for all edges and all possible values of the features x ensures that we can efficiently solve (5.7). In [Cobzas and Schmidt, 2009] we consider two simple conditions that are sufficient to ensure that the estimated parameters in a binary CRF with Ising edge potentials (and a fixed structure) satisfy (5.8). First, we require that all 107 features x are non-negative. Second, during parameter estimation we constrain all edge parameters to also be non-negative. These conditions ensure that log φij (1, 1) ≥ 0 and log φij (2, 2) ≥ 0, while since we use Ising edge potentials we have that log φij (1, 2) = 0 and log φij (2, 1) = 0. Adding these constraints yields a bound-constrained optimization problem that was solved with the two-metric projection algorithm discussed in Section 2.2.3. We can use 1 -regularization to extend this prior work to learn the graph structure while still constraining the model to be associative. If we use 1 -regularization of the edge weights, then the problem reduces to optimizing a differentiable function with 1 -regularization over the non-negative orthant. As discussed in 2.1.2, applying 1 -regularization with an orthant constraint can be written as a bound-constrained smooth optimization problem. Thus, we can estimate the parameters of an associative CRF with 1 -regularization using the two-metric projection algorithm discussed in Section 2.2.3. If we use group 1 -regularization of the edge parameters to encourage a sparse graph structure, then computing the projection (or soft-threshold) subject to non-negativity constraints is straightforward; we set to zero all negative elements before computing the projection (or softthreshold) for the remaining elements [van den Berg, 2010]. This allows us to apply the methods of Chapter 3 to solve the bound-constrained problem. 5.8 Experiments We first examined two small real data sets where we could compare the effects of different regularization and edge potential types with the exact objective (§5.8.1), and then with approximate objectives (§5.8.2). We then compared the methods on some larger data sets using the pseudolikelihood approximation (§5.8.3). We then looked at blockwise sparse models of a real data set (§5.8.4), and finally compared different ways to train CRF models on synthetic and real data (§5.8.5). 5.8.1 Edge Potentials and Regularization Types We fist sought to assess the effects on prediction performance of different choices of regularization and edge potential type. To do this we used the two small data (cyto and awma) from Section 1.7, where the number of nodes (and states) is sufficiently small that we can evaluate the objective function exactly even with a densely connected graph (thus removing the use of an approximate objective function as a potential confounding factor). On these data sets, we tested the three edge potentials from Section 1.5: • Ising: Here we have one parameter on each edge, giving the potential of the two nodes taking the same state. In the binary case this yields the IGM model of Section 1.4. • gIsing: Here we have k parameters on each edge, giving the potential of the two nodes taking the same state for each state. • full: Here we have a matrix of k 2 parameters on each edge, giving the potential for all k 2 combinations of the states. We compared the following regularization strategies: • Tree: We compute the maximum likelihood tree structure, then fit its parameters using regularization. 2- 108 • L2 : We fit the fully connected structure with 2 -regularization. This does not yield a sparse structure, but may still perform well at prediction. • L1 : We fit the fully connected structure with 1 -regularization. This encourages sparsity in the edge parameters but does not directly encourage graphical sparsity. • L12 : We fit the fully connected structure with group edge parameters. This encourages graphical sparsity. 1 -regularization of the 2 norms of the • L1∞ : We fit the fully connected structure with group 1 -regularization of the ∞ norms of the edge parameters. This encourages graphical sparsity and also encourages elements of the same edge potential to have the same magnitude. • L1σ : We fit the fully connected structure with group 1 -regularization of the nuclear norms of the edge parameters. This encourages graphical sparsity and also encourages the edge potential matrices to have low rank. In our experiments, we trained on one third of the data, evaluated the likelihood of a separate third of the data to estimate λ, and used the final third of the data to evaluate the model with the selected value of λ. We repeated this set-up with 10 different partitions of the data to estimate the variability of the results. We note that the particular split of the data into training/validation/testing is a confounding factor that affects of the performances of the method. Thus, in addition to computing the test set negative log-likelihood of the methods on each trial, we also computed a relative test set negative log-likelihood where we scaled the values to lie in the range [0, 1] for each split (the best method on each data split is assigned a value of 0, and the worst method is assigned a value of 1). This latter score removes the particular data split as a potential confounding factor, giving a measure of relative performance across the different splits. For each model we tested λ = 2r , where r was decreased from 10 down to −7 in increments of 0.25 (and we used warm-starting to solve these related optimization problems in order from the largest to the smallest value of λ). For these experiments, we added a weak 2 -regularizer (with λ = 10−4 ) to all models. This only has a small effect on the estimated parameters, but makes the objective strictly convex and removes the possibility that an observed difference between methods is due to the particular global optima found by the methods. In Figure 5.1 we the compare the various methods on the cyto data in terms of the absolute score (left), and we compare the most effective methods in terms of relative score (right)37 . Several trends are obvious. First, the Ising potentials do substantially worse than the gIsing and full potentials. With Ising potentials even the L2 and L1 methods that consider all possible edges do substantially worse than using the simple tree model with the slightly more general gIsing potentials. The full potentials also do better than the gIsing potentials for a given regularization type, but the difference in this case is not as dramatic. Another trend we see is that the regularization methods dominate the tree method, for a given edge potential type. However, we see no differences between different regularization types (for a given edge potential type) in terms of the absolute score. In terms of the relative score, we see that the sparse regularization methods tended to outperform using 2 -regularization, but the different sparse regularization methods had similar performance. 37 Not all regularization types are included for all edge types in this plot. This is because the group 1 -regularization methods are equivalent to 1 -regularization for Ising potentials, while group 1 -regularization with the nuclear norm is equivalent to 1 -regularization for gIsing potentials 109 4 −3 x 10 x 10 6 test set relative negative log−likelihood test set negative log−likelihood 1.55 1.5 1.45 1.4 Ising gIsing full 1.35 1.3 1.25 1.2 5 4 full 3 2 1 0 Tree L2 L1 Tree L2 L1 L12 L1inf Tree L2 L1 L12 L1inf L1nuc L2 L1 L12 L1inf L1nuc test set negative log−likelihood 4200 4100 4000 tree 3900 L1 L2 L12 L1! L1" 3800 3700 test set relative negative log−likelihood Figure 5.1: Test set negative log-likelihood (left) and relative negative log-likelihood (right) on the cyto data using different regularization and edge potential types. 0.03 0.025 0.02 L1 L2 L12 L1! L1" 0.015 0.01 0.005 3600 0 Ising gIsing full Ising gIsing full Ising gIsing full gIsing full gIsing full full Ising gIsing full Ising gIsing full gIsing full gIsing full full Figure 5.2: Test set negative log-likelihood (left) and relative negative log-likelihood (right) on the awma data using different regularization and edge potential types. In Figure 5.2 we compare the various methods on the awma data in terms of the absolute score (left), and we compare the non-tree methods in terms of relative score (right). On this data set, we again see that dense regularization methods dominate the tree methods. However, on this binary data set we see that the choice of edge potentials makes no difference. This makes sense because using multi-parameter edge potentials does not increase the expressive power of the model for binary data. Further, we see no difference in absolute score between the various regularization methods, though we again see that the different sparse regularization methods tended to outperform using 2 -regularization in terms of the relative score. 5.8.2 Approximate Objectives We next sought to compare the performance of different approximations to the objective function on these two small data sets. We compared the following objective functions from Section 5.3: 110 4 x 10 6500 test set negative log−likelihood test set negative log−likelihood 1.7 1.6 exact pseudo mean Bethe c-Bethe 1.5 1.4 1.3 6000 5500 exact pseudo mean Bethe c-Bethe 5000 4500 4000 1.2 L1nuc L1inf L12 L1 L2 L1nuc L1inf L12 L1 L2 L1nuc L1inf L12 L1 L2 L1nuc L1inf L12 L1 L2 L1nuc L1inf L12 L1 L2 Tree L1nuc L1inf L12 L1 L2 L1nuc L1inf L12 L1 L2 L1nuc L1inf L12 L1 L2 L1nuc L1inf L12 L1 L2 L1nuc L1inf L12 L1 L2 Tree 3500 Figure 5.3: Test set negative log-likelihood on the cyto (left) and awma (right) data sets using different approximate objective functions. • exact: The exact (convex) objective. • pseudo: The (convex) pseudo-likelihood approximation. • mean: The (non-convex) mean-field variational approximation. • Bethe: The (non-convex) Bethe variational approximation. • c-Bethe: The convexified Bethe variational approximation. Our experimental set-up was identical to the previous sub-section, except that we trained the different regularization methods under these different approximations. Note that we still compute the exact validation and test score. In Figure 5.3 we plot the performance of the different regularization methods under different approximate objective functions (using full potentials). In this plot, we see that the objective function that leads to the best performance is (unsurprisingly) the exact objective function. Among the approximate objectives, the pseudo-likelihood approximation proved to give results that are much closer to the exact objective function than the variational approximations. Indeed, it was surprising that in almost every case the variational approximations proved to give worse parameters than the optimal tree (the only exception to this was using group 1 -regularization with the ∞ norm of the groups under the Bethe approximation). In this Figure, we also see that the performance of the three convex objective functions changed little across the regularization types, but that the two non-convex approximations were more erratic. It is somewhat surprising that the performance of the Bethe approximation changes substantially under different choices of the regularization norm, but that the performance for a given norm was consistent across trials (this is especially surprising on the binary awma data set). In contrast to the Bethe approximation, for most choices of regularization the mean field method was erratic, except when using 2 -regularization, and when using 1 -regularization in the cyto data (where it had consistent but poor performance). 111 test set relative negative log−pseudo−likelihood test set negative log−pseudo−likelihood 6800 6600 6400 Ising 6200 gIsing full 6000 5800 5600 5400 5200 5000 Tree L2 L1 Tree L2 L1 L12 L1inf Tree L2 L1 L12 L1inf L1nuc 1 0.9 0.8 0.7 Ising 0.6 gIsing full 0.5 0.4 0.3 0.2 0.1 0 Tree L2 L1 Tree L2 L1 L12 L1inf Tree L2 L1 L12 L1inf L1nuc Figure 5.4: Test set negative log-pseudo-likelihood (left) and relative negative log-pseudo-likelihood (right) on the awma5 data using different regularization and edge potential types. 5.8.3 Larger Real Data We next sought to compare the different regularization strategies and edge potential types on larger data sets (where evaluating the exact objective function is intractable in general). Specifically, we considered the four larger (non-binary) discrete data sets from Section 1.7. On these data sets we concentrated on the pseudo-likelihood objective function for training and testing, but otherwise we used the same experimental set-up. In Figure 5.4 we plot the absolute and relative results of different regularization strategies and edge potential types on the awma5 data set. Unlike the binary version of the data set, for the full five state version of the data set we see differences between the different regularization and edge potential types. In particular, we see that all models achieve the best performance with full potentials, while all models achieve the worst performance with Ising potentials. Further, we see that for a given edge potential type that the sparse regularization methods outperform 2 -regularization (for gIsing and full potentials), while for a fixed edge potential type 2 -regularization outperforms the tree model. In this experiment, we see significant differences in the relative scores between the different sparse regularization methods when using full potentials. In particular, group 1 -regularization with the 2 norm achieved the best value across all training set splits, following by the other group 1 -regularization strategies, and regular 1 -regularization gave the worst performance among the sparse regularizers. We plot the results on the four-state traffic and temperature data sets in Figure 5.5. Similar to the three-state cyto and five-state awma5 data sets, we again observe that utilizing more expressive potentials leads to better performance. Further, similar to the awma5 data set, using group 1 regularization with the 2 norm (and full potentials) achieved the best performance across all 10 trials for both of these data sets. Finally, we plot the results on the four-state usps4 and eight-state usps8 data sets in Figure 5.6. On these data sets, the best performance across all trials was achieved by group 1 -regularization with the nuclear norm. Further, on the usps8 data set we even see a significant advantage in the absolute score over all other methods when using group 1 -regularization with the nuclear norm. This makes intuitive sense, since we would expect the edge weight matrices resulting from 112 4 test set relative negative log−pseudo−likelihood test set negative log−pseudo−likelihood x 10 3.6 3.4 3.2 3 Ising gIsing full 2.8 2.6 2.4 2.2 Tree L2 L1 Tree L2 L1 L12 L1inf Tree L2 L1 L12 0.025 0.02 0.015 full 0.01 0.005 0 L1inf L1nuc L2 5 test set relative negative log−pseudo−likelihood test set negative log−pseudo−likelihood 1.8 Ising gIsing full 1.4 1.2 1 Tree L12 L1inf L1nuc x 10 2 1.6 L1 −3 x 10 L2 L1 Tree L2 L1 L12 L1inf Tree L2 L1 L12 L1inf L1nuc 6 5 4 3 full 2 1 0 L2 L1 L12 L1inf L1nuc Figure 5.5: Test set negative log-pseudo-likelihood (left) and relative negative log-pseudo-likelihood (right) on the traffic (top) and temperature (bottom) data using different regularization and edge potential types. the discretized states to be highly structured and well-approximated by a low rank matrix. In contrast, group 1 -regularization with the 2 norm outperforms all methods except the nuclear norm method on the usps4 data, but does worse than using regular 1 -regularization on the usps8 data. We believe this is because no structure is assumed by using 2 -regularization of the edge weight matrices, making it more difficult to estimate the 64 parameters associated with each edge. 113 4 x 10 test set relative negative log−pseudo−likelihood test set negative log−pseudo−likelihood 4.5 4 Ising gIsing full 3.5 3 Tree L2 L1 Tree L2 L1 L12 L1inf Tree L2 L1 L12 0.025 0.02 0.015 full 0.01 0.005 0 L1inf L1nuc L2 L1 L12 L1inf L1nuc L1inf L1nuc 4 test set relative negative log−pseudo−likelihood test set negative log−pseudo−likelihood x 10 6.5 6 5.5 Ising gIsing full 5 4.5 Tree L2 L1 Tree L2 L1 L12 L1inf Tree L2 L1 L12 L1inf L1nuc 0.16 0.14 0.12 0.1 full 0.08 0.06 0.04 0.02 0 L2 L1 L12 Figure 5.6: Test set negative log-pseudo-likelihood (left) and relative negative log-pseudo-likelihood (right) on the usps4 (top) and usps8 (bottom) data using different regularization and edge potential types. 114 1 1 1 2 2 2 3 3 4 3 4 4 5 5 6 6 7 7 8 5 6 7 8 9 8 9 10 9 10 11 10 11 11 12 12 13 13 14 14 14 15 15 15 16 16 17 17 18 18 19 20 21 21 22 16 17 18 19 20 21 22 23 22 23 24 24 23 24 25 25 26 26 27 27 28 13 19 20 25 12 28 26 27 28 Figure 5.7: Structures estimated on the rain data set with group 1 -regularization for different regularization parameter values. From left to right, λ = 256, 128, 64 (for λ = 512 the graph is disconnected). 115 baseball players bible games hockey season christian team earth win god jesus religion evidence fact number war world human case israel course jews children government question president law rights email phone state files research university computer science version help nasa dos program software space system windows 116 Figure 5.8: Structure estimated on the news data set with group 1 -regularization disk problem card data drive scsi video (λ = 512, isolated nodes are not plotted). case children bible course christian computer disk email files dos help drive ftp mac data display card fact pc government image video number law research launch nasa president dealer state israel jesus jews war baseball religion games question hockey shuttle moon space system car engine earth god rights phone software human power problem program version gun memory scsi insurance evidence graphics format health league science university orbit players world season driver team technology win windows 117 Figure 5.9: Structure estimated on the news data set with group nhl won 1 -regularization (λ = 256, isolated nodes are not plotted). fans 1,5 1,6 2,3 2,4 3,2 2,5 1,11 1,7 1,12 1,13 2,11 1,14 1,15 4,15 5,14 5,11 5,13 6,11 4,16 6,10 6,12 5,15 5,16 7,12 6,15 7,13 6,16 7,14 7,15 9,11 10,11 11,11 10,14 11,15 12,12 12,13 13,11 12,14 12,15 14,11 13,14 13,15 14,14 14,15 11,3 12,1 12,3 14,3 14,1 13,2 14,2 15,1 15,2 16,1 16,2 15,3 15,4 15,5 15,6 16,3 16,4 16,5 16,6 16,9 15,12 16,10 16,11 118 16,12 Figure 5.10: Structure estimated on the usps data set with group 1 -regularization 12,2 13,1 15,10 15,11 15,13 15,14 11,2 15,9 14,12 14,13 14,6 15,7 15,8 10,2 10,3 13,3 14,4 14,5 14,7 14,8 14,10 13,13 13,7 14,9 13,12 13,5 13,6 13,8 13,10 12,4 9,3 13,4 12,7 13,9 10,4 11,4 12,5 12,6 12,8 12,10 12,11 10,5 11,5 11,7 12,9 11,12 11,13 11,14 12,16 11,10 9,4 9,5 10,6 11,6 11,8 11,9 10,12 10,13 10,15 10,9 11,1 9,2 8,6 9,6 10,7 8,3 8,2 8,5 8,7 9,7 10,8 10,10 9,13 9,15 9,9 9,12 9,14 9,8 9,10 8,13 8,15 10,16 8,11 8,12 8,14 9,16 8,10 7,2 10,1 8,4 7,7 8,8 7,3 7,6 7,8 8,9 8,1 7,5 6,8 7,9 7,11 6,14 8,16 7,10 6,13 6,2 9,1 7,4 6,7 6,9 6,3 6,4 6,6 5,9 6,1 7,1 6,5 5,8 5,10 5,3 5,2 5,5 5,7 4,10 5,12 5,1 5,4 5,6 4,9 4,13 4,14 4,6 4,8 4,2 4,3 4,4 4,7 3,10 4,1 4,5 3,8 4,12 3,15 3,16 3,6 3,7 3,9 4,11 3,4 3,5 2,8 2,10 3,12 3,13 3,14 7,16 1,9 3,3 2,6 2,7 2,9 3,11 2,14 2,15 2,16 11,16 1,10 2,12 2,13 1,16 1,8 2,2 (λ = 4096). 1,4 1,3 1,5 1,6 1,7 1,8 1,9 2,8 1,11 2,9 1,12 1,13 2,11 1,14 3,11 2,14 3,12 4,12 2,15 4,11 4,13 3,15 4,15 5,15 7,13 6,16 7,15 8,15 10,12 9,15 11,13 10,16 12,13 11,16 13,13 12,16 14,15 15,9 15,10 14,13 15,13 16,7 16,10 14,16 16,11 16,12 16,13 16,5 16,6 16,9 15,12 15,15 16,3 16,4 16,8 15,11 13,16 15,7 15,16 16,15 119 16,14 Figure 5.11: Structure estimated on the usps data set with group 1 -regularization 16,1 16,2 15,3 15,5 15,1 15,2 15,4 15,6 15,8 14,11 14,12 13,15 15,14 14,9 14,10 14,3 14,5 14,7 14,1 14,2 14,4 14,6 14,8 13,11 13,12 12,15 14,14 13,9 13,10 13,3 13,5 13,7 13,1 13,2 13,4 13,6 13,8 12,11 12,12 11,15 13,14 12,9 12,10 12,3 12,5 12,7 12,1 12,2 12,4 12,6 12,8 11,11 11,12 10,15 12,14 11,10 10,13 9,16 11,14 11,9 11,3 11,5 11,7 11,1 11,2 11,4 11,6 11,8 10,11 10,5 10,7 10,9 10,10 9,13 8,16 10,14 9,11 9,12 10,3 10,4 10,6 10,8 10,1 10,2 9,5 9,7 9,9 9,10 8,13 7,16 9,14 8,11 8,12 9,3 9,4 9,6 9,8 9,1 9,2 8,5 8,7 8,9 8,10 8,3 8,4 8,6 8,8 7,11 7,12 6,15 8,14 7,9 7,10 6,13 5,16 7,14 6,11 6,12 8,1 8,2 7,5 7,7 7,8 7,3 7,4 7,6 6,9 6,10 5,13 4,16 6,14 5,11 5,12 7,1 7,2 6,5 6,7 6,8 6,3 6,4 6,6 5,9 5,10 2,16 3,16 5,14 1,15 1,16 6,1 6,2 5,5 5,7 5,8 5,3 5,4 5,6 4,9 4,10 3,13 4,14 4,8 5,1 5,2 4,5 4,7 3,10 4,3 4,4 4,6 3,9 2,12 2,13 3,14 3,8 4,1 4,2 3,5 3,7 2,10 3,1 3,3 3,4 3,6 2,1 3,2 2,5 2,7 1,10 2,2 2,4 2,6 1,2 2,3 (λ = 2048). 1,1 1,2 1,3 2,1 1,4 2,2 1,5 2,3 1,6 1,7 2,5 1,8 1,9 1,11 1,13 1,15 2,13 3,11 2,14 3,13 1,16 2,15 2,16 3,15 4,13 4,15 6,15 7,13 8,12 8,15 9,13 10,12 9,15 11,11 11,12 10,15 11,13 12,11 11,14 12,12 11,15 11,16 12,15 13,11 13,12 13,13 13,14 13,15 14,12 14,13 14,14 14,11 15,1 15,2 14,5 15,3 15,4 14,7 14,9 14,10 14,3 14,4 14,6 14,8 14,1 14,2 13,5 13,7 13,9 13,3 13,4 13,6 13,8 13,10 12,13 12,14 12,9 13,1 13,2 12,5 12,7 12,10 12,3 12,4 12,6 12,8 12,1 12,2 11,5 11,7 11,9 11,3 11,6 11,8 11,10 10,13 10,14 10,16 14,16 10,11 9,14 9,16 10,9 11,1 11,2 11,4 10,7 10,10 10,3 10,5 10,6 10,8 10,2 10,4 9,7 9,9 9,11 9,12 9,6 9,8 9,10 8,13 8,14 8,16 14,15 8,11 7,14 7,15 8,9 8,10 10,1 9,4 9,5 8,8 7,11 7,12 9,3 8,6 8,7 7,10 6,13 6,14 6,16 7,9 6,12 5,15 5,16 13,16 6,11 5,14 9,2 8,5 7,8 9,1 8,4 7,7 6,10 8,3 7,6 6,9 5,12 5,13 4,16 7,16 5,11 4,14 8,2 7,5 6,8 8,1 7,4 6,7 5,10 7,3 6,6 5,9 4,12 3,16 12,16 4,11 3,14 7,2 6,5 5,8 7,1 6,4 5,7 4,10 6,3 5,6 4,9 3,12 6,2 5,5 4,8 6,1 5,4 4,7 3,10 5,3 4,6 3,9 2,12 5,1 5,2 4,5 3,8 2,11 1,14 4,3 4,4 3,7 2,10 4,2 3,5 3,6 2,9 1,12 4,1 3,4 2,7 2,8 3,2 3,3 2,6 1,10 15,5 15,6 15,7 16,1 15,8 16,2 15,9 16,3 15,10 16,4 15,11 16,5 15,12 16,6 15,13 16,7 15,14 16,8 15,15 16,9 16,10 15,16 16,11 16,12 16,13 16,14 16,15 16,16 120 Figure 5.12: Structure estimated on the usps data set with group 3,1 2,4 1 -regularization (λ = 1024). We finally sought to assess whether the group 1 -regularization method learns a reasonable graph structure. We fit a group 1 -regularized undirected graphical model (with full potentials, the 2 group norm, and the pseudo-likelihood approximation) on the full rain, news, and usps data sets examined in the previous chapter. We tested integer powers of 2 for the regularization parameter, and examined the largest such values that produced non-empty graphs. In Figure 5.7, we plot the structure estimated on the rain data with λ = 256, 128, and 64. With λ = 256 the model estimates a 28-node Markov chain, which (as discussed in the previous chapter) is a reasonable structure for this data set and is the optimal tree structure. As λ is decreased more edges are added, between temporally close nodes for λ = 128 and between more distant nodes for λ = 64. With λ = 512, the graph is disconnected. In Figures 5.8 and 5.9, we plot the structure estimated on the news data set with λ set to 512 and 256, respectively. The graph with λ = 512 is very interpretable and intuitive, even though it is not a tree structure. The graph with λ = 256 is more dense and less interpretable, but the edges still tend to represent intuitive associations. With λ = 128, the graph was very dense and not particularly interpretable, while with λ = 1024 the graph only contained four edges: bible:god, christian:god, dos:windows, and god:jesus. The most common application of pairwise undirected models (ignoring time-series data where there is no distinction in the graphical properties of directed and undirected models) is image processing, where a two-dimensional grid graph structure is typically assumed. Thus, for the usps data set we might expect the method to estimate a two-dimensional grid graph structure, where each node/pixel is connected to its four horizontal and vertical neighbors. We plot the structure estimated with λ = 4096, 2048, and 1024 for the usps data in Figures 5.10-5.12. Here, we see that (for large values of λ) the model learns structures that are close to two-dimensional grid models. Indeed, these structure are much closer to a grid structure than the three graph structures we examined for the usps data set in Chapter 4 (Figures 4.12-4.14). However, there are still some discrepancies between the structures in Figures 5.10-5.12 and a two-dimensional grid structure. The first discrepancy is that extra edges are present near the boundaries (and two of the corners in particular), with the number of extra edges increasing as λ decreases. This might be because the fewer neighboring pixels present at the boundaries means that it is important to not only look at neighboring pixel’s values. The second discrepancy is that while in some parts of the image the graph forms a perfect grid structure where each pixel is connected to its four horizontal and vertical neighbors (around (7,9), for example), throughout most of the image the model also connects each pixel to its diagonal neighbors (i.e. the median number of neighbors for each node is 6). These extra edges are intuitive, since diagonal neighbors may contain additional information that is not present in the horizontal and vertical neighbors. With lower values of λ, edges between more distant nodes are added and the graphs become less interpretable. 5.8.4 Blockwise Sparsity In [Schmidt et al., 2009b], we sought to test the performance of fitting blockwise-sparse GGMs to the genes data. In this experiment we sought to reproduce Figure 4 of [Duchi et al., 2008a], and to test the effect of using the 2 norm of the blocks instead of the ∞ norm used in this previous work. We first followed the same experimental set-up as [Duchi et al., 2008a], and performed 50 random train/test splits. In Figure 5.13 we give our version of Figure 4 from [Duchi et al., 2008a], augmented with the blockwise sparse model that penalizes the 2 norm of the blocks. This figure suggests that using the 2 norm of the blocks gives a further improvement over the existing 121 −530 Average Log−Likelihood L1,2 L1,∞ −532 L1 Base −534 −536 −538 −540 −542 −4 −3 10 −2 10 10 Regularization Strength (λ) Figure 5.13: Average cross-validated log-likelihood against regularization strength under different blockwise-sparse regularization schemes applied to the regularized empirical covariance for the genes data [Schmidt et al., 2009b]. . 1 relative test set negative log−likelihood test set negative log−likelihood 680 660 640 620 600 580 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Base L1 L1inf L12 Base L1 L1inf L12 Figure 5.14: Test set negative log-likelihood (left) and relative negative log-likelihood (right) on the genes data using different regularization methods. 122 blockwise sparse model. To assess the usefulness of the various models for prediction, we divided the data into equalsized training/validation/testing sets, and measured the test set absolute and relative negative log-likelihoods. The level of Tikhonov regularization discussed in [Duchi et al., 2008a] was selected using the validation set likelihood on each training split. We plot the distribution of these values over 50 trials in Figure 5.14. Here we see that there is no difference in the performance of the methods in absolute score, but that (though very noisy due to the small number of training examples) the blockwise-sparse model that penalizes the 2 norms of the blocks may have an advantage over the other methods. 5.8.5 Conditional Random Fields In [Schmidt et al., 2008], we experimentally compared an extensive variety of approaches to learning CRF models. Below we divide up these approaches into several groups: • Fixed Structure: We learn the parameters of a CRF with a fixed structure. We considered an Empty structure (corresponding to an independent logistic regression model for each target), a Chain structure (the structure most commonly used in CRFs), a Full structure (assuming all edges are present), and the True structure. For the synthetic experiments, the True structure was set to the actual generating structure, while for the real data we used a structure constructed from expert knowledge. • Generative Acyclic: We first learn the graph structure based on the labels alone, and then learn the parameters of a CRF with this fixed structure. We considered the generative models from [Qazi et al., 2007], namely finding the maximum likelihood Tree and using DAG-Search. • Generative 1 : Here we use (group) 1 -regularization to learn a fixed structure based on the labels alone, and then learn the parameters of a CRF with this fixed structure. We considered using 1 -regularization and group 1 -regularization with the 2 and ∞ norms. • Discriminative 1 : Here we used simultaneous conditional estimation of the structure and parameters using (group) 1 -regularization, where again we considered 1 -regularization and group 1 -regularization with the 2 and ∞ norms. To compare methods and test the effects of both discriminative structure learning and approximate inference for training, we created a synthetic dataset from a small (10-node) binary CRF. We used 10 local features for each node (sampled from a standard Normal) plus a bias term. We chose the graph structure by including√each possible edge with probability 0.5. Similarly, we sam√ pled random node weights vi ∼ N (0, 2), and edge weights wij ∼ U (−b, b), where b ∼ N (0, 2) for each edge (the results were similar under different sampling schemes). We drew 500 training samples and 1000 test samples from the exact distribution p(y|x). In all models, we impose an 2 penalty on the node weights, and we also impose an 2 penalty on the edge weights for all models that do not use 1 regularization of the edge weights. For each of the models compared, the scale of these two regularization parameters is selected by cross-validation on the training set. In our experiments, we explored 10 different permutations of training and testing instances in order to quantify variation in the performance of the methods. For testing the quality of the models, we computed the classification error associated with the exact conditionals p(yi |x). 123 1.0 Fixed Structure 0.5 Generative Generative L1 Discriminative L1 Acyclic Relative Classification Accuracy 2 in f L1 L1 in f L1 2 L1 L1 L1 DA G ll e Tr ee Tr u in Fu y ha C Em pt in f 2 L1 L1 L1 2 in f L1 L1 L1 DA G ll e Tr ee Tr u in Fu ha C Em pt 1.0 Fixed Structure 0.5 Generative Generative L1 Discriminative L1 Acyclic Relative Classification Accuracy 1.0 Fixed Full 0.5 Fixed True Discriminative Discriminative Discriminative L1 L12 L1inf 0 in f L1 2 L1 L1 in f L1 2 L1 L1 in Fu ll Tr ue Tr ee DA G ha C Em pt y 0 do Be th e Ex ac Ps t eu do Be th e Ex ac Ps t eu do Be th e Ex ac Ps t eu do Be th e Ex ac Ps t eu do Be th e Ex ac t Relative Classification Accuracy Generative Generative L1 Discriminative L1 Acyclic 0 y 0 Fixed Structure 0.5 Ps eu Relative Classification Accuracy 1.0 Figure 5.15: Interquartile range of relative test-set classification accuracy for different methods of training CRFs on synthetic data using the exact objective (top-left), pseudo-likelihood approximation (top-right), Bethe approximation (bottom-left), and selected methods under different approximations (bottom-right). Note that the empty graph, corresponding to logistic regression, always had a relative accuracy of zero. 124 1.0 Fixed Structure 0.5 Generative Generative L1 Discriminative L1 Acyclic Fixed Structure 0.5 Generative Generative L1 Discriminative L1 Acyclic f 2 in L1 L1 f in L1 2 L1 L1 L1 DA G e Tr ee ll Fu Tr u y pt ha in C Em f in 2 L1 L1 f L1 2 in L1 L1 L1 DA G e Tr ee ll Fu Tr u pt Em C y 0 ha in 0 Relative Classification Accuracy Relative Classification Accuracy 1.0 Figure 5.16: Interquartile range of relative test-set classification accuracy for different methods of training CRFs on the coronary heart disease data at the segment level (left) and heart level (right). Note that the discriminative structure learning method with group 1 -regularization with the ∞ norm always has a relative accuracy of one on the heart-level classification task (rightmost column). We compared learning with the exact objective, the (conditional) pseudo-likelihood objective, and the Bethe variational approximation. In Figures 5.15, we show the relative classification accuracy of different methods on the test set for different objective functions (the best possible score is 1, and the worst is 0). Although not necessary for the synthetic data, we use this measure since the real data examined next is relatively small with a class imbalance, and even though the ranking of the methods is consistent across trials, the particular data split on a given trial represents a confounding factor that obscures the relative performance of the methods. We summarize this distribution in terms of its interquartile range (a measure of the width of the central 50% interval of the distribution); this is a more robust summary than the standard mean and standard deviation. The results show several broad trends: (a) pseudo-likelihood and the Bethe approximation are almost as good as the exact likelihood (and the Bethe approximation is slightly better than pseudolikelihood), (b) discriminatively learned structures outperform generatively learned structures, (c) any kind of structure is better than no structure at all, (d) in the generative case, group 1 regularization (under both norms) and regular 1 -regularization are very similar (consistent with our earlier experiments in binary data), and (e) both group 1 -regularization methods outperform 1 -regularization in the discriminative case. Results on other synthetic data sets yield qualitatively similar conclusions, with two exceptions: (i) as we decrease the number of features the performance of group 1 -regularization becomes more similar to regular 1 -regularization, and (ii) on some data sets the Bethe approximation produced results that were much worse than the pseudo-likelihood approximation. We next examined the awma-c classification problem. In this data set, we have 19 local image features for each node calculated from the tracked contours of the ventricle. Among these features we include local ejection fraction ratio, radial displacement, circumferential strain, velocity, thickness, thickening, timing, eigenmotion, curvature, and bending energy. We also have 15 global image features (that are ths same across nodes). For the node features we used the concatenation 125 of the 15 global features and the 19 local features for the node. For the edge features we used the 15 global features and the 38 features consisting of the concatenation of the local features for each node. We used 2/3 of the data for training and selecting the two regularization parameters, and 1/3 of the data for testing (across 10 different splits). We generated the True structure by adding edges between all nodes sharing a face in the heart diagram, constructed by expert cardiologists, from Qazi et al. [2007]. We trained various models using pseudo-likelihood and tested them using exact inference38 . In Figure 5.16, we show the relative classification accuracy on the test set at the segment level and the heart level (the heart level decision is made by cardiologists by testing whether two or more segments are abnormal). We see that the discriminative model with group 1 -regularization with the ∞ norms of the groups performs among the best at the segment level (achieving a median absolute classification accuracy of 0.92), and is typically the best method at the important heartlevel prediction task (achieving a median absolute accuracy of 0.86 and the lowest error rate at this task in 9 out of the 10 trials). These encouraging results can also help less-experienced cardiologists improve their diagnostic accuracy; the agreement between less-experienced cardiologists and experts is often below 50% [Schmidt et al., 2008]. 5.9 Similar Methods In this chapter, we discuss using group 1 -regularization for structure learning in pairwise undirected graphical models of discrete data that do not make the Ising assumption. However, it should be noted that similar extensions have been proposed prior and concurrently with this work. In this section, we highlight the differences between the prior (and concurrent) work and the work outlined in this chapter. Models that do not make the Ising assumption were also explored in [Lee et al., 2006b, Dahinden et al., 2007]. In [Lee et al., 2006b], they use the non-convex Bethe approximation, and their experiments indicate that the method reaches different local optima with different optimization strategies. Further, they ignore that each edge can have multiple parameters and simply use the standard 1 -regularization. However, they note that this does not directly encourage graphical sparsity and that graphical sparsity could be achieved with group 1 -regularization. However, they do not provide a method to solve the resulting problem. In contrast, [Dahinden et al., 2007] ignore the computational infeasibility of evaluating the likelihood function but use group 1 -regularization to directly encourage graphical sparsity. However, they use an optimization algorithm that may require many evaluations of the (generally intractable) likelihood, and do not present results on data with more than 5 variables. In both cases, they only consider using the 2 norm of the groups. Our work is distinct from this prior work in several ways. First, we consider the convex pseudolikelihood and convexified Bethe approximations to the likelihood. Using convex approximations means that the estimated parameters are not sensitive to initialization of the optimization procedure, or to the particular optimization strategy used. Second, we consider choices of the group norm other than the 2 norm. This includes the proposed group extension of the nuclear norm, which is novel (as far as we are aware). Our experiments indicate that in some cases other choices of the group norm give better results than the 2 norm. Third, we give a method for adding covariates to the model to yield the more powerful CRF models, while this is the first work to consider 38 We also tested using the Bethe approximation for this task, but learning with this approximation typically lead to parameters where the message-passing algorithm would not converge and lead to poor results. 126 structure learning in CRFs. Finally, in Chapter 3 we outline a new optimization algorithm that is especially suited to solving the resulting convex optimization problems, taking into account the very large number of optimization variables, the high cost of evaluating the objective function, and the relatively simple form of the regularizer. Using group 1 -regularization to learn blockwise-sparse models was originally proposed in [Duchi et al., 2008a], where they use group 1 -regularization in GGMs with the ∞ norm for blockwisesparsity. The work presented here considers other choices of group norms, as well as blockwisesparse discrete models. Further, the optimization algorithms outlined in Chapter 3 are also wellsuited to solving this type of optimization problem, although the improvement in the group-sparse GGM case is not as dramatic as in the group-sparse discrete case. 5.10 Extensions To conclude this chapter, below we list some extensions of the work presented here: • Composite likelihoods: The maximum pseudo-likelihood approximation is asymptotically less efficient than the maximum likelihood estimator [Besag, 1977, Liang and Jordan, 2008]. A generalization of pseudo-likelihood approximations is the class of composite likelihoods [Lindsay, 1988], where we can consider using the conditional (or marginal) distributions of groups of variables rather than individiual variables. We would likely obtain better results by using a more powerful composite likelihood. For example, we could consider using a composite likelihood where we optimize the conditionals of all pairs of variables. This would only lead to a constant increase in computational complexity but might result in a much better approximation. • Other variational inference methods: In this work we have considered some of the most common methods for variational inference, but it is straightforward to use other variational inference methods. For example, Banerjee et al. [2008] use 1 -regularization to learn the structure of an (unconditional) undirected graphical model with binary states and Ising potentials that uses a log-determinant approximation. Subsequently, [Kolar and Xing, 2008] proposed a cutting plane strategy that iteratively refines this approximation. We might also consider using convergent message-passing algorithms to improve the stability of the optimization [Kolmogorov, 2006], or trying to optimize the edge appearance probabilities in the covexified Bethe free energy [Wainwright et al., 2002]. An extensive survey of variational inference methods is [Wainwright and Jordan, 2008]. • Other group structures: The optimization methods we discuss in Chapter 3 make no assumptions about the group structure except that the groups are disjoint (we consider removing this assumption in Chapter 6), so it is possible to use them for a wide variety of group structures. Besides the cases we discuss here where we use groups to encourage graphical or blockwise sparsity, another interesting possible grouping of the variables occurs for CRFs where the features are binary {0, 1} variables. Instead of assigning all edge weights associated with a single edge to the same group, we could consider using groups that only contain the edge weights associated with a single feature for a single edge. In the case of bias variable edge groups, these would correspond to unconditional interactions (interactions between the target variables that exist regardless of the features). In contrast, the {0, 1} feature-edge groups would correspond to context-specific dependencies. That is, these would 127 indicate dependencies that only exist between the target variables when the corresponding feature value is set to 1. • Learning the variable types in blockwise-sparse models: Instead of assuming that the variable types are given, in [Marlin et al., 2009] we consider the problem of learning blockwisesparse models while estimating the variable types. This work also considers a variation on the blockwise-sparse model where we use 1 -regularization of the blocks but estimate the appropriate scale of the regularization parameter for each block (leading to a form of soft blockwise-sparsity). The models in this work rely on a variational Bayesian procedure, where the methods of Chapters 2 and 3 are used as sub-routines in the variational parameter update. • Interventional Potentials: In Section 4.5, we discuss modeling interventions in DAG models using Pearl’s do-calculus. However, for many data sets the assumption of acycility is often inappropriate; many models of biological networks contain feedback cycles (for example, see Sachs et al. [2005]). In contrast, undirected graphical models allow cycles. However, under most interpretations of the data generating processes associated with undirected graphs there is no difference between conditioning by observation and conditioning by intervention [Lauritzen and Richardson, 2002]; undirected models do not distinguish between observing a variable (‘seeing’) and setting it by intervention (‘doing’). Motivated by the problem of using cyclic models for interventional data, in [Schmidt and Murphy, 2009] we defined the notion of an interventional potential. These are undirected potential functions that are augmented with interventional semantics. In [Schmidt and Murphy, 2009], we consider structure learning using group 1 -regularization with interventional potentials on the cyto data, and show that this leads to a better model of this data set than causal DAGs or undirected models that ignore the effects of interventions (as in this chapter). • Uncertain Interventions: In [Duvenaud et al., 2010], we consider using general conditional density estimators for making causal predictions. As in the DAG-based uncertain intervention framework of Eaton and Murphy [2007], this model includes explicit binary intervention variables in the model and considers modeling the variables conditional on these intervention variables (alternately, we can consider feature variables that measure properties of the interventions). If we use a CRF as the conditional density estimator, then we have a CRF with binary {0, 1} variables. In this case, we could consider using the feature-edge groups above and learning context-specific interactions (i.e. interactions present under different interventions). In the case where we use feature variables that characterize properties of different interventions, this framework would allow the model to make predictions about previously unseen interventions. • Optimization-based search: In this chapter we assumed that the scale of the regularization parameter λ is the same across the groups, but the optimization algorithms in Chapter 3 allow a separate λA for each group A. Given that the cost of evaluating a single edge addition/deletion in a search-based structure learning strategy may be similar to solving the convex optimization problem in an 1 -regularization approach to structure learning, we might consider using a search-based method where we simply solve the convex optimization problem for different assignments to the different λA variables. We might be able to use this to propose more global moves than single edge additions/deletions, which would not cost more to evaluate since the non-separability the log-likelihood means that evaluating single edge addi128 tions/deletions may be similar to the cost of evaluating completely new graphs. An approach closely related to this was examined in [Moghaddam et al., 2009]. 129 Chapter 6 Hierarchical Log-Linear Model Structure Learning In Chapter 5, we considered using group 1 -regularization for structure learning in pairwise loglinear models. However, on many real data sets it may be important to model higher-order interactions. Thus we would like to relax the pairwise assumption, but as we discuss in Section 1.6 it is challenging to consider general log-linear models without including an explicit cardinality restriction due to the exponential number of possible higher-order potentials. As an alternative to using an explicit cardinality restriction, we consider fitting general log-linear models (as we describe in Section 1.6) subject to the following constraint: • Hierarchical Inclusion Restriction: If wA = 0 and A ⊂ B, then wB = 0. This is the class of hierarchical log-linear models [Bishop et al., 1975, Whittaker, 1990, §7]. While a subset of the space of general log-linear models, the set of hierarchical log-linear models is much larger than the set of pairwise models, and can include interactions of any order. Further, group-sparsity in hierarchical models directly corresponds to conditional independence. The hierarchical inclusion restriction imposes constraints on the possible sparsity pattern of w, beyond that obtained using (disjoint) group 1 -regularization. In the context of linear regression and multiple kernel learning, several authors have recently shown that group 1 -regularization with overlapping groups can be used to enforce hierarchical inclusion restrictions [Zhao et al., 2009, Bach, 2008b]. As an example, if we would like to enforce the restriction that B must be zero when A is zero, we can do this using two groups: The first group simply includes the variables in B, while the second group includes the variables in both A and B. Regularization using these groups encourages A to be non-zero whenever B is non-zero, since when B is non-zero A is not penalized for moving away from zero [see Zhao et al., 2009, Theorem 1]. As an example, consider the simple case where we have a differentiable loss function L(x) where x has two variables x1 and x2 , and we want to enforce that variable x2 is allowed to be non-zero only when x1 is non-zero. To do this, we use the regularizer λ12 ||x12 ||2 + λ2 |x2 |. Now, consider a ˜ where where x1 is zero but x2 is non-zero. At this point the regularizer is differentiable point x with respect to x1 with derivative zero, so unless it happens by chance that ∇x1 L(˜ x) = 0 we can improve the objective function by moving x1 away from zero. Generalizing this basic idea, to enforce that the solution of our regularized optimization problem satisfies the hierarchical inclusion restriction we can solve the convex optimization problem39 n log p(xi |w) + min − w i=1 λA ( A⊆S ||wB ||22 )1/2 . {B|A⊆B} 39 Although we will focus on using the 2 norm of the groups in this chapter, it is possible to use analogous methods where we penalize other norms of the groups. 130 ∗ as the concatenation of the parameters w with all parameters If we define the set of parameters wA A wB such that A ⊂ B, we can write this as n w ∗ λA ||wA ||2 . log p(xi |w) + min − i=1 (6.1) A⊆S This is very similar to applying group 1 -regularization to learn the structure of general log-linear models as in (1.12), except that the parameters of higher-order terms are added to the corresponding lower-order groups. Similar to Theorem 1 of Zhao et al. [2009], we can show that under reasonable assumptions a minimizer of (6.1) will satisfy hierarchical inclusion. We give details about this in the next section after discussing optimality conditions for this problem. 6.1 Optimality Conditions Using f (w) to denote the objective in (6.1), the sub-differential of f (w) is n ∗ λA sgn(wA ), log p(xi |w) + ∂f (w) = −∇ i=1 A⊆S where sgn(y) is defined as in Section 3.1.2 (we pad the output of this signum function with zeros ˜ is a minimizer of a convex function if so that it has the right dimension). Recall that a vector w ˜ [Bertsekas, 1999, §B.5]. and only if 0 ∈ ∂f (w) We call A an active group if wB = 0 for some B such that A ⊆ B. If A is not an active group and wB = 0 for some B ⊂ A, we call A an inactive group. We refer to the remaining groups as boundary groups; a boundary group A satisfies wB = 0 for all B ⊂ A and wC = 0 for all A ⊆ C. In other words, the boundary groups are the groups that can be made non-zero without violating hierarchical inclusion. The optimality conditions with respect to an active group A reduce to n ∗ λA wA /||wB ||2 . log p(xi |w) = ∇wA i=1 (6.2) B⊆A If we treat all inactive groups as fixed, the optimality conditions with respect to a boundary group A become n log p(xi |w)||2 ≤ λA . ||∇wA (6.3) i=1 The combination of (6.2) and (6.3) constitute necessary and sufficient conditions for a minimizer of (6.1) under the constraint that inactive groups are fixed at zero. These also comprise necessary (but not necessarily sufficient) conditions for global optimality of (6.1). We can now show that under reasonable assumptions that minimizers of (6.1) satisfy hierarchical ˜ of (6.1) that does not. Then there exists some A such inclusion. Assume we have a minimizer w ˜ A = 0 and some B such that A ⊂ B and w ˜ B = 0. This implies group A is active and must that w ˜ A = 0, we have that ∇wA log p(x|w) ˜ is exactly 0, and assuming the set where satisfy (6.2). Using w ˜ A is a minimizer. this happens has zero probability it contradicts that w 131 Unfortunately, there are several complicating factors in solving (6.1). In particular, (i) there remains an exponential number of groups to consider and (ii) we can no longer compute the projection (or soft-threshold) operator used by the optimization algorithms in Chapter 3. We address the former issue first. 6.2 Regularization Path and Active-Set Optimization We would like to avoid having to consider the exponential number of groups present in (6.1). Since we know that the solution is a hierarchical model, we propose to use an active-set method that incrementally adds variables to the problem until (6.2) and (6.3) are satisfied, that uses hierarchical inclusion to exclude the possibility of adding most variables. The method alternates between two phases: • Find the set of active groups, and the boundary groups violating (6.3). • Solve the problem with respect to these variables. We repeat this until no new groups are found in the first step, and at this point we have (by construction) found a point satisfying (6.2) and (6.3). This is analogous to the active-set methods of Chapters 2 and 3, but note that here we only consider adding groups that satisfy hierarchical inclusion. In this algorithm, the addition of boundary groups has an intuitive interpretation; we only add the zero-valued group A if it satisfies hierarchical inclusion and the difference between the model marginals and the empirical frequencies exceeds λA . Such an addition rule is very reminiscent of the method of [Gevarter, 1987]. This method greedily adds constraints on higherorder marginals if the observed higher-order marginals differ significantly from the model’s higherorder marginals after fitting lower-order marginals, for the closely related problem of computing a maximum entropy distribution subject to given marginal constraints [Cheeseman, 1983]. However, the proposed method differs from the prior work in that the active-set method can add or remove variables, and it makes progresses towards solving a convex optimization problem. Consider a simple 6-node hierarchical log-linear model containing non-zero potentials on (1)(2)(3) (4)(5)(6)(1,2)(1,3)(1,4)(4,5)(4,6)(5,6)(4,5,6). Though there are 20 possible threeway interactions in a 6-node model, only one satisfies hierarchical inclusion, so our method would not consider the other 19. Further, we do not need to consider any fourway, fiveway, or sixway interactions since none of these satisfy hierarchical inclusion. In general, we might need to consider more higherorder interactions, but we will never need to consider more than a polynomial number of groups more than the number present in the final model. That is, hierarchical inclusion and the active-set method can save us from looking at an exponential number of irrelevant higher-order factors. To stop us from considering overly complicated models that do not generalize well, to set the regularization parameter(s) we can start with the unary model and incrementally decrease the regularization until a measure of generalization error starts to increase. This is analogous to the regularization path methods mentioned in Chapters 2 and 3, but augmented with a termination criteria. Before moving on to how we can solve the problem with respect to a subset of the groups, we summarize the computational gains that can be achieved for computing the 1 -regularization path compared to the 2 -regularization path for the optimization problems we discuss in Chapters 2, 3, and 6: 132 • Chapter 2: For logistic regression with 1 -regularization, for large values of λ we may reduce the cost of evaluating the objective function by a polynomial factor, and reduce the number of variables by a polynomial factor. • Chapter 3: For pairwise log-linear models with group 1 -regularization, for large values of λ we may reduce the cost of evaluating the objective function by an exponential factor, and reduce the number of variables by a polynomial factor. • Chapter 6: For hierarchical log-linear models with overlapping group 1 -regularization, for large values of λ we may reduce the cost of evaluating the objective function by an exponential factor, and reduce the number of variables by an exponential factor. 6.3 Constrained Formulation In step 1 of the active-set method we must solve (6.1) with respect to a subset of the groups. This comprises a group 1 -regularization problem with overlapping groups. Besides a special case discussed in [Zhao et al., 2009] where the solution can be computed directly, previous approaches to solving group 1 -regularization problems with overlapping groups include a boosted LASSO method [Zhao et al., 2009] and a re-formulation of the problem as a smooth objective with a simplex constraint [Bach, 2008b]. Unfortunately, applying these methods to graphical models would be relatively inefficient since they might require a very large number of function evaluations. As before, we can solve the problem by writing it as an equivalent differentiable but constrained problem. In particular, we again introduce a scalar auxiliary variable gA to bound the norm of each ∗ , leading to a smooth objective with second-order cone constraints: group wA min − log p(x|w) + w,g s.t. gA ≥ λA gA , A⊆S ∗ ||wA ||2 , ∀A . (6.4) As we saw in the Chapter 3, the projection for each group has a simple closed-form solution. Thus, we might consider solving this problem with the SPG or PQN method. However, because the groups now overlap, we can no longer compute the projection onto each group independently. 6.4 Dykstra’s Algorithm We would like to solve the problem of computing the projection onto a convex set defined by the intersection of sets, where we can efficiently project onto each individual set. One of the earliest results on this problem is due to von Neumann [1950, §13], who proved that the limit of cyclically projecting a point onto two closed linear sets is the projection onto the intersection of the sets. Bregman [1965] proposed to cyclically project onto a series of general convex sets in order to find a point in their intersection, but this method will not generally converge to the projection. The contribution of Dykstra [1983] was to show that by taking the current iterate and removing the difference calculated from the previous cycle, then subsequently projecting this value, that the cyclic projection method converges to the optimal solution for general (closed) convex sets. Deutsch and Hundal [1994] have shown that Dykstra’s algorithm converges at a geometric rate for polyhedral sets (the set defined with the 2 group norm is not polyhedral, but the set defined with the ∞ 133 group norm is polyhedral). Algorithm 11 gives pseudo-code for an implementation of Dykstra’s algorithm (we obtain Bregman’s method if we fix Ii at 0). Input: Point w0 , convex sets C1 , C2 , . . . , Cq , tolerance q Output: PC (w0 ), the projection of w0 onto C i=1 Ci . ∀i , Ii ← 0; j ← 0; while wj is changing by more than do for i = 1 to q do wj ← PCi (wj−1 − Ii ); Ii ← wj − (wj−1 − Ii ) ; j ← j + 1; Algorithm 11: Dykstra’s cyclic projection algorithm for finding the projection of a point onto an intersection of convex sets. Despite its simplicity, Dykstra’s algorithm is not widely used because of its high storage requirements. In its unmodified form, applying Dykstra’s algorithm to compute the projection in (6.4) would be impractical, since for each group we would need to store a copy of the entire parameter vector. Fortunately, in (6.4) each constraint only affects a small subset of the variables. By taking advantage of this it is straightforward to derive a sparse variant of Dykstra’s algorithm that only needs to store a copy of each variable for each group that it is associated with (rather than one copy of the entire parameter vector for each group). This leads to an enormous reduction in the memory requirements. Further, although using Dykstra’s algorithm rather than an analytic update leads to a higher iteration cost, the cost of running the cyclic projection algorithm will typically be much smaller than the cost of evaluating the objective function. 6.4.1 Soft-Dykstra’s Algorithm Allowing the groups to overlap also means that the soft-threshold operator can not be applied independently to the different groups. Given the similarity between the projection and the softthreshold operator, we might expect to be able to derive a variant on Dykstra’s algorithm that is able to solve the soft-threshold problem with overlapping groups. Bauschke and Combettes [2008] present a generalization of Dykstra’s algorithm that can be used to solve this problem, outlined in Algorithm 12. Input: Point w0 , convex regularizers R1 (w), R2 (w), . . . , Rq (w), tolerance , step size α Output: SR (w0 , α), the soft-threshold operator with input w0 , step size α, and regularizer q R(w) i=1 Ri (w). ∀i , Ii ← 0; j ← 0; while wj is changing by more than do for i = 1 to q do wj ← SRi (wj−1 − Ii , α); Ii ← wj − (wj−1 − Ii ) ; j ← j + 1; Algorithm 12: Variant of Dykstra’s algorithm for computing soft-threshold operators for a regularizer consisting of the sum of convex regularizers. 134 4 x 10 test set negative log−likelihood Pairwise gIsing Threeway HLLM gIsing gIsing Pairwise Full Threeway Full HLLM Full 1.22 1.21 1.2 1.19 1.18 test set relative negative log−likelihood 1 1.23 0.9 0.8 Pairwise gIsing Threeway HLLM gIsing gIsing Pairwise Full Threeway Full L2 L2 L2 L2 HLLM Full 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1.17 L2 L1 L2 L1 L1 L2 L1 L2 L1 L1 L1 L1 L1 L1 L1 L1 Figure 6.1: Test set negative log-likelihood (left) and relative negative log-likelihood (right) on the cyto data using different regularization types and potential restrictions. The appeal of using Algorithm 12 extends beyond the fact that the soft-threshold algorithm is a simpler, more direct, and potentially more efficient strategy than projection methods. This is because Dykstra’s projection method typically approaches the optimal projection through a sequence of infeasible iterates. Thus, in the projection framework we must solve it sufficiently accurately to guarantee that we are (numerically close enough to) feasible. In contrast, since there is no notion of feasibility (in the BBST and QNST algorithms) we might be able to terminate the soft-threshold variant of the algorithm early. 6.5 Experiments In this section we re-visit building generative models of the data sets examined in the last chapter, but consider fitting models that relax the pairwise assumption. In the next section we re-visit the two data sets where exact likelihood calculation was possible, and then we turn to several of the larger data sets. 6.5.1 Smaller Data We first re-visit building generative models of the cyto and awma small data sets, where we use exact likelihood calculation and consider both the full and gIsing parameterizations. On each data set we compared our hierarchical log-linear model with overlapping group 1 -regularization (labeled HLLM in the figures) to fitting log-linear models restricted to both pairwise and threeway potentials with both 2 -regularization and group 1 -regularization with the group 2 norm. Note that unlike the pairwise and threeway models, an 2 -regularized version of the hierarchical log-linear model is infeasible. We trained on a random half of the data set, and tested on the remaining half as the regularization parameter λ was varied. For the pairwise and threeway models, we set λA to the constant λ. For the hierarchical model, we set λA to λ2|A|−2 , where |A| is the cardinality of A and we placed no explicit restriction on the cardinality of A. For all the models, we did not regularize the unary weights. 135 1 3720 3700 3680 3660 3640 3620 Pairwise L2 3600 Pairwise L1 Threeway L2 Threeway L1 HLLM L1 3580 3560 test set relative negative log−likelihood test set negative log−likelihood 3740 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 Pairwise L2 Pairwise L1 Threeway L2 Threeway L1 HLLM L1 0.1 0 gIsing full gIsing full gIsing full gIsing full gIsing full gIsing full gIsing full gIsing full gIsing full gIsing full Figure 6.2: Test set negative log-likelihood (left) and relative negative log-likelihood (right) on the awma data using different regularization types and potential restrictions. We plot the results obtained on the cyto data in Figure 6.1. On this data set, we see that allowing for threeway interactions leads to better performance than using pairwise interactions (for both types of potentials), and further that the hierarchical model that allows higher-order interactions leads to a further improvement. The HLLM with full potentials included up to fourway potentials, while with Ising potentials fiveway potentials were also included. We plot the results obtained on the awma data in Figure 6.2. On this data set that the threeway models do no better than the pairwise models, and the 2 -regularized threeway model seems to do worse than the pairwise models. In contrast, the hierarchical model seems to have an advantage over the pairwise models. On this data set the HLLMs included fourway interactions on nine of the ten trials when using full potentials, and additionally included fiveway potentials on two of the ten trials when using Ising potentials. 6.5.2 Larger Data We next tested the various methods on several larger data sets, concentrating on the case of gIsing potentials and the pseudo-likelihood approximation. We plot the test-set pseudo-log-likelihood for the awma5, traffic, and usps4 data sets in Figures 6.3-6.5. In these figures we see that modeling threeway interactions gives improved results for two of these three data sets. However, the hierarchical model dominated the threeway models, and did substantially better than the pairwise models except on the awma5 data where the pairwise model with 1 -regularization does similar. On the awma5 data set the HLLM only included pairwise and threeway factors, while it included fourway factors on the traffic data set and fiveway factors on the usps4 data set. Although we concentrated on these relatively small data sets in our experiments, these data sets are still larger than previous data where higher-order models have been applied. For example, Dahinden et al. [2007] use (disjoint) group 1 -regularization and only considered up to 5 binary variables, while [Dobra and Massam, 2010] considered log linear models over 16 binary variables and used stochastic local search to identify the structure (this search-based method requires fitting each model during the search, which is very expensive). In contrast, the traffic data examined in this work contains 32 four-state variables. Our method can in principle be used to learn models with 136 6100 Pairwise 6000 Threeway HLLM 5900 5800 5700 5600 5500 5400 5300 L2 L1 L2 L1 test set relative negative log−pseudo−likelihood test set negative log−pseudo−likelihood 6200 1 0.9 Pairwise 0.8 Threeway HLLM 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 L1 L2 L1 L2 L1 L1 Figure 6.3: Test set negative pseudo-log-likelihood (left) and relative negative log-likelihood (right) on the awma5 data using different regularization types and potential restrictions. 4 x 10 Pairwise 2.35 Threeway HLLM 2.3 2.25 2.2 L2 L1 L2 L1 L1 test set relative negative log−pseudo−likelihood test set negative log−pseudo−likelihood 2.4 1 0.9 Pairwise 0.8 Threeway HLLM 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 L2 L1 L2 L1 L1 Figure 6.4: Test set negative pseudo-log-likelihood (left) and relative negative log-likelihood (right) on the traffic data using different regularization types and potential restrictions. higher-order interactions on even larger data sets, provided that the solution of the optimization problem is sufficiently sparse. 6.5.3 Structure Estimation We next sought to assess the performance of the HLLM for structure estimation. We created a 10node synthetic data set that includes all unary factors as well as the factors (2, 3)(4, 5, 6) (7, 8, 9, 10) (a non-hierarchical model), where the model weights were generated from a N (0, 1) distribution. In Figure 6.5.3, we plot the number of false positives of different orders present in the first model along the regularization path that includes all three factors in the true structure against the number of training examples (we define a false positive as a factor where none of its supersets are present in the true model). For example, with 20000 samples the order of edge additions was (with false positives in square brackets) (8,10)(7,9)(9,10)(7,10)(4,5)(8,9)(2,3)(4,6)(8,9,10)(7,8)(7,8,9)(7,8,10) 137 4 test set negative log−pseudo−likelihood 3.35 Pairwise 3.3 Threeway HLLM 3.25 3.2 3.15 3.1 3.05 3 L2 L1 L2 L1 test set relative negative log−pseudo−likelihood x 10 3.4 1 0.9 Pairwise 0.8 Threeway HLLM 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 L2 L1 L1 L2 L1 L1 Figure 6.5: Test set negative pseudo-log-likelihood (left) and relative negative log-likelihood (right) on the usps4 data using different regularization types and potential restrictions. 25 Pairwise Threeway Fourway Fiveway False Positives 20 15 10 5 0 0 50 100 150 200 Training Examples (thousands) Figure 6.6: False positives of different orders against training set size for the first model along the regularization path where the HLLM selects a superset of the true data-generating model [Schmidt and Murphy, 2010]. (5,6)[1,8][5,9][3,8][3,7](4,5,6)[1,7](7,9,10)(7,8,9,10) (at this point it includes all three factors in the true structure, with 5 pairwise false positives and no higher-order false positives). In the figure, we see that the model tends to include false positives before it adds all true factors, but the number decreases as the sample size increases. Further, there tend to be few higher-order false positives; although it includes spurious pairwise factors even with 150000 samples, the model includes no spurious threeway factors beyond 30000 samples, no spurious fourway factors beyond 10000 samples, and no fiveway factors for any sample size (the plot begins at 5000). We next examined the coronary heart disease data set analyzed in [Edwards and Havr´ anek, 1985]. The first fifteen factors added along the HLLM-L1 regularization path on this data set are 138 (B,C)(A,C)(B,E)(A,E)(C,E)(D,E)(A,D)(B,F)(E,F)[C,D][A,F](A,D,E)(D,F)[D,E,F][A,B]. We have used square brackets to denote factors that are not recognized in the prior work, and may represent false positives due to the use of a point estimate with this small sample size. The first seven factors are the union of the minimally sufficient hierarchical models from the analysis by Edwards and Havr´anek. These are also the factors with posterior mode greater than 0.5 for a prior strength of 2 and 3 in the hierarchical models of [Dobra and Massam, 2010], while the first eight are the factors selected with a prior strength of 32 and 64. With a prior strength of 128 Dobra and Massam [2010] find the ninth factor introduced by our model, as well as the factor (D,F) introduced later. The remaining factor with this prior strength is the factor (B,C,F), that is not found until much later in the regularization path in our model. In contrast, the first three-way factor introduced by our model is (A,D,E). This factor is present in both of the accepted graphical models in [Edwards and Havr´anek, 1985], and is the only threeway factor with a posterior greater than 0.5 (under a Laplace approximation) in the graphical models of [Dobra and Massam, 2010] for a prior strength of 1, 2, 3, 32, and 64. 6.6 Similar Methods In this chapter we considered using group 1 -regularization for structure learning in discrete undirected graphical models where the pairwise assumption is relaxed. The only prior work we are aware of that has considered this case is [Dahinden et al., 2007]. However, in [Dahinden et al., 2007] they use disjoint group 1 -regularization and thus in general group sparsity in their model does not correspond to conditional independence. Further, Dahinden et al. [2007] ignore the challenges associated with considering higher-order factors when the number of variables is non-trivial. In contrast, this chapter has provided methods for addressing the problems associated with the intractable objective function and the exponential number of higher-order terms. These considerations allow the method we discuss in this chapter to be applied to much larger data sets, without any explicit restriction on the cardinality of the model. 6.7 Extensions Below we discuss several extensions of the work we discuss in this chapter: • DAGs: We have considered using hierarchical inclusion in order to provide a tractable way to relax the pairwise assumption in undirected graphical models. The use of sigmoid (or Gaussian) CPDs in Chapter 4 is very similar to the pairwise assumption, and hierarchical inclusion would also be useful in DAG models. For example, if we are given a variable ordering and use Gaussian CPDs then hierarchical inclusion can be used to encourage the Cholesky matrix to have a low-bandwidth. Alternatively, we could consider sigmoid (or Gaussian) CPDs where we use a non-linear basis expansion of the parent variables. Hierarchical inclusion could then be used to tractably search through the exponential space of possible terms to include, as in [Bach, 2008b]. • Conditional and interventional models: We have considered unconditional models in this chapter, but as in Chapter 5 we could also consider conditional models and models with interventional potentials. 139 • Other group structures: Rather than using the 2 group norm, we could apply Dykstra’s algorithm with the ∞ group norm (or any norm where we can efficiently compute the projection). Further, we can still apply Dykstra’s algorithm under different assignments of variables to overlapping groups. It is also possible that better performance would be achieved by a different selection of the regularization weights λA . • Other overlapping group schemes: Jacob et al. [2009] consider a different notion of overlapping groups to encourage a sparsity pattern that is a union of groups. They represent each variable as a combination of auxiliary variables and penalize these (disjoint) variables. We could enforce hierarchical inclusion in this framework by adding to each group all subsets of the group, as opposed to all supersets in (6.1). An advantage of this is that the projection (or soft-threshold) could easily be computed using the methods of Chapter 3, but a disadvantage is that it would be grossly over-parameterized (we would have an auxiliary variable for every subset of each non-zero factor). Further, although the result would still be hierarchical it might be the case that the auxiliary variables associated with lower-order groups would be zero (since the parameters of the lower-order group would be represented by the version associated with a higher-order group), so it seems less likely that an efficient active-set method that finds the globally optimal solution could be developed. • Sufficient conditions: The active-set method converges to a method satisfying necessary optimality conditions for (6.1) and conditions that are sufficient under the constraint that inactive groups are fixed at zero. However, it may terminate at a point that is not a global optimum of the full problem (6.1) since it may terminate at a point where making an inactive group non-zero could improve the objective. Thus, an outstanding issue is deriving an efficient way to test (or bound) sufficient optimality conditions over all variables in order to ensure global optimality, as in [Bach, 2008b]. Since the gradient of the objective function is bounded in magnitude, it is likely that such tests are possible. Given such a test, a related issue is developing an efficient search procedure for finding sub-optimal inactive groups. Even if such a test is intractable in general, several heuristic strategies are possible that would improve the likelihood that we find the global optimum. For example, we could test not only all boundary groups, but all groups that would become boundary groups if the current boundary groups were non-zero. This test would still only need to consider a polynomial number of groups more than is non-zero at the current iterate. 140 Chapter 7 Discussion In this chapter, we discuss several issues that have been ignored in this work, as well as several interesting extensions of this work and possible directions of future work. • Missing data and hidden variables: In this work, we have assumed that all variables are observed in all training samples. We can also consider scenarios where the values of some variables are missing or where the values of some variables are hidden by marginalizing over the missing values. In the case of undirected models, if we use O to denote the observed variables and H to denote the hidden variables, we could write the probability of the observed variables in this scenario as p(xO ) p(xO , xH ) = xH 1 Z p˜(xO , xH ), xH where we have used p˜(xO , xH ) to denote the unnormalized product of the potential functions. Although in principle this is a straightforward extension, the sum over values of the hidden variables complicates the optimization. In particular, computing this sum might require approximate inference. Further, this sum leads to a (non-linear) concave term in the negative log-likelihood, so the objective function is no longer convex. We could consider directly finding a local minimum of the resulting non-convex optimization problem with one of the methods we describe in Chapter 3. Alternately, we could find a local minimum by using the expectation maximization (EM) algorithm [Dempster et al., 1977] to yield a sequence of convex optimization problems of the form addressed in Chapter 3 that would upper bound the objective function. • Mixture models: In Chapter 6 we considered using higher-order potentials to model complicated distributions that are not fully-characterized by pairwise statistical dependencies. An alternative and complementary strategy to increase the representational power of models is with mixtures. Here, we would represent the probability of an observed vector as a convex combination of independent graphical models C πc p(x|wc ), p(x) c=1 where C c=1 πc = 1. Even though the individual graphical models are independent, the use of a convex combination introduces dependencies between all variables (assuming we have at least C = 2 mixture components). For example, even if we use a completely disconnected graph, all variables are dependent in the joint distribution [Bishop, 2006, §9.3.3]. Previous work has examined mixtures of tree-structured graphical models [Meila and Jordan, 2000], but we could consider mixtures of general graphical models. As with the case of missing 141 variables, this formulation is not convex but we could use the EM algorithm to find a local minimum by solving a sequence of convex problems of the form addressed in Chapter 3. • Stochastic inference and online estimation: In this work we have focused on the case of deterministic approximations to the marginals in undirected graphical models. An alternative class of methods exist that generate stochastic samples from the distribution in order to approximate the marginals [Koller and Friedman, 2009, §12]. The advantage of these methods is that, as the sample size increases, they converge to the true marginals. However, with finite sample sizes the approximation will not be exact and there may be discontinuities in the associated objective function. One way to optimize under this sort of approximation is with stochastic approximation methods where we alternate between generating samples and updating the parameters [Younes, 1989]. It is well known that projections can be used within stochastic approximation methods [Kushner and Yin, 2003, §5], while more recent work has examined stochastic approximation methods that use the soft-threshold operator [Duchi and Singer, 2009]. We could also use the stochastic approximation framework to apply the techniques we describe in an online setting, where rather than a fixed training set we receive training samples one at a time. • Other types of structure learning: This work has concentrated on the cases of linearlyparameterized DAG models, pairwise log-linear models, and hierarchical log-linear models. However, it is possible to extend the ideas we discuss here to other types of models. For example, we could consider learning the structure of chain-graph models [Lauritzen, 1996, §3.2.3] (models that combine directed and undirected edges), by using a search-based method to search through the space of (block-)DAG models and using group 1 -regularization to learn the undirected structure within the blocks. Similarly, the methods of Chapter 4 may be useful for structure learning in ancestral graph Markov models, a generalization of DAG models that is closed under marginalization and conditioning [Richardson and Spirtes, 2002]. It might also be possible to use the ideas of Chapter 6 to learn probabilistic context free grammars or first-order probabilistic models [Rusell and Norvig, 2003, §14.6 and §23.1], or Markov logic networks [Richardson and Domingos, 2006]. • Bayesian methods: The regularized parameter estimates we use in this work can be interpreted as the maximizers of a posterior distribution under an appropriate prior. If we have a small sample size and are interested in the task of structural estimation, it may prove more useful to find the posterior probability of an edge parameter taking a value of zero in this posterior distribution. In this case, it does not make sense to use an 1 -regularizer because the edge posterior is zero in the posterior with probability zero. Thus, in this case we would need to consider a prior/regularizer that places an atom at zero in the prior distribution. Although there has been some work on exact computation of edge posteriors in models with a small number of variables [Koivisto and Sood, 2004], a variational or stochastic approximation would likely be needed to approximate the edge posteriors. In the variational case, the methods of Chapter 2 or 3 may be useful in implementing the variational update, similar to our work in [Marlin et al., 2009]. • Max-margin training: For CRFs, an alternative to optimizing the conditional likelihood is to use a maximum-margin training objective [Taskar et al., 2003]. The maximum-margin objective is non-differentiable (when formulated as an unconstrained optimization), but does 142 not depend on the normalizing constant in the model. Instead, it depends on the most likely configuration under the model (or the second most likely, in some variations). While computing the normalizing constant is still NP-hard (as opposed to the #P-hard task of evaluating the normalizing constant), there exist several special cases where we can compute the most likely configuration even though we can not compute the normalizing constant. For example, [Taskar et al., 2004] consider using maximum-margin training in binary models with constraints that enforce sub-modular potentials (assuming a fixed structure). We could consider a variant on this method (or other cases where we can efficiently compute the most likely configuration) where we use (group) 1 -regularization of the edge parameters to learn a sparse structure. Unlike training with the conditional likelihood, if we enforce sub-modularity of the edge potentials it would be possible to evaluate the objective function in this problem (without using approximations) even for a non-trivial number of nodes. We conclude by noting that implementations of the methods discussed in this thesis will be made available on the author’s homepage. 143 Bibliography P. Abbeel, D. Koller, and A. Y. Ng. Learning factor graphs in polynomial time and sample complexity. Journal of Machine Learning Research, 7:1743–1788, 2006. S. Acid, L. de Campos, and J. Huete. The search of causal orderings: A short cut for learning belief networks. European Conference on Symbolic and Quantitative Approaches to Reasoning with Uncertainty, 2001. C. Aliferis, I. Tsamardinos, A. Statnikov, and L. Brown. Causal explorer: A causal probabilistic network learning toolkit for biomedical discovery. International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences, 2003. G. Andrew and J. Gao. Scalable training of L1 -regularized log-linear models. International Conference on Machine Learning, 2007. F. Bach. Bolasso: model consistent Lasso estimation through the bootstrap. International Conference on Machine Learning, 2008a. F. Bach. Exploring large feature spaces with hierarchical multiple kernel learning. Conference on Neural Information Processing Systems, 2008b. F. Bach and M. Jordan. Thin junction trees. Conference on Neural Information Processing Systems, 2001. S. Bakin. Adaptive regression and model selection in data mining problems. PhD thesis, Australian National University, Canberra, 1999. O. Banerjee, L. E. Ghaoui, A. d’Aspremont, and G. Natsoulis. Convex optimization techniques for fitting sparse gaussian graphical models. International Conference on Machine Learning, 2006. O. Banerjee, L. El Ghaoui, and A. d’Aspremont. Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. Journal of Machine Learning Research, 9:485–516, 2008. J. Barzilai and J. Borwein. Two-point step size gradient methods. IMA Journal of Numerical Analysis, 8 (1):141–148, 1988. H. Bauschke and P. Combettes. A Dykstra-like algorithm for two monotone operators. Pacific Journal of Optimization, 4(3):383–391, 2008. D. Bertsekas. Nonlinear Programming. Athena Scientific, 2nd edition, 1999. D. Bertsekas, A. Nedic, and A. Ozdaglar. Convex Analysis and Optimization. Athena Scientific, 2003. D. Bertsimas and J. Tsitsiklis. Introduction to linear optimization. Athena Scientific, 1997. J. Besag. Statistical analysis of non-lattice data. The Statistician, 24(3):179–195, 1975. J. Besag. Efficiency of pseudolikelihood estimators for simple gaussian fields. Biometrika, 64(3):616–618, 1977. E. Birgin, J. Mart´ınez, and M. Raydan. Nonmonotone spectral projected gradient methods on convex sets. SIAM Journal on Optimization, 10(4):1196–1211, 2000. C. Bishop. Pattern recognition and machine learning. Springer, 2006. Y. Bishop, F. E., and H. P. Discrete multivariate analysis: Theory and practice. MIT Press, 1975. R. Bouckaert. Probabilistic network construction using the minimum description length principle. European Conference on Symbolic and Quantitative Approaches to Reasoning with Uncertainty, 1993. S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, 2004. L. Bregman. The method of successive projection for finding a common point of convex sets. Doklady Akademii Nauk, 162(3):487–490, 1965. An English translation appears in Soviet Mathematics Doklady, 6:688–692, 1965. W. Buntine. Theory refinement on Bayesian networks. Conference Uncertainty in Artificial Intelligence, 1991. 144 R. Byrd, J. Nocedal, and R. Schnabel. Representations of quasi-Newton matrices and their use in limited memory methods. Mathematical Programming, 63(2):129–156, 1994. J. Cai, E. Candes, and Z. Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 20(4):1956–1982, 2010. A. Chechetka and C. Guestrin. Efficient principled learning of thin junction trees. Conference on Neural Information Processing Systems, 2007. P. Cheeseman. A method of computing generalized bayesian probability values for expert systems. International Joint Conference on Artificial Intelligence, 1983. G. Chen and R. Rockafellar. Convergence rates in forward-backward splitting. SIAM Journal on Optimization, 7(2):421–444, 1997. S. Chen, D. Donoho, and M. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing, 20(1):33–61, 1998. J. Cheng, R. Greiner, J. Kelly, D. Bell, and W. Liu. Learning Bayesian networks from data: an informationtheory based approach. Artificial Intelligence, 137(1-2):43–90, 2002. D. Chickering. Learning Bayesian networks is NP-complete. Conference on Artificial Intelligence and Statistics, 1995. D. Chickering. Optimal structure identification with greedy search. The Journal of Machine Learning Research, 3:507–554, 2003. D. Chickering, D. Heckerman, and C. Meek. Large-sample learning of Bayesian networks is NP-hard. The Journal of Machine Learning Research, 5:1287–1330, 2004. C. Chow and C. Liu. Approximating discrete probability distributions with dependence trees. IEEE transactions on Information Theory, 14(3):462–467, 1968. J. Claerbout and F. Muir. Robust modeling with erratic data. Geophysics, 38:826–844, 1973. D. Cobzas and M. Schmidt. Increase Discrimination in Level Set Methods with Embedded Conditional Random Fields. Conference on Computer Vision and Pattern Recognition, 2009. P. Combettes and V. Wajs. Signal recovery by proximal forward-backward splitting. Multiscale Modeling and Simulation, 4(4):1168–1200, 2005. G. Cooper and E. Herskovits. A Bayesian method for the induction of probabilistic networks from data. Machine learning, 9(4):309–347, 1992. G. Cooper and C. Yoo. Causal discovery from a mixture of experimental and observational data. Conference Uncertainty in Artificial Intelligence, 1999. G. Cormack and T. Lynam. Spam corpus creation for TREC. Conference on Email and Anti-Spam, 2005. T. Cormen, C. Leiserson, R. Rivest, and C. Stein. Introduction to algorithms. MIT press, 2nd edition, 2001. C. Dahinden, G. Parmiggiani, M. Emerick, and P. B¨ uhlmann. Penalized likelihood for sparse contingency tables with an application to full-length cDNA libraries. BMC Bioinformatics, 8:476, 2007. J. Dahl, V. Roychowdhury, and L. Vandenberghe. Maximum likelihood estimation of gaussian graphical models: numerical implementation and topology selection. Technical report, UCLA, 2005. Y. Dai and R. Fletcher. Projected Barzilai-Borwein methods for large-scale box-constrained quadratic programming. Numerische Mathematik, 100(1):21–47, 2005. S. Dasgupta. Learning polytrees. Proc. UAI, pages 134–141, 1999. D. Dash and M. Druzdzel. A hybrid anytime algorithm for the construction of causal models from sparse data. Conference Uncertainty in Artificial Intelligence, 1999. A. d’Aspremont, O. Banerjee, and L. El Ghaoui. First-Order Methods for Sparse Covariance Selection. SIAM Journal on Matrix Analysis and Applications, 30(1):56–66, 2008. I. Daubechies, M. Defrise, and C. De Mol. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics, 57(11):1413–1457, 2004. A. Dawid and S. Lauritzen. Hyper Markov laws in the statistical analysis of decomposable graphical models. Annals of Statistics, 21(3):1272–1317, 1993. L. de Campos, J. Fern´ andez-Luna, J. G´ amez, and J. Puerta. Ant colony optimization for learning Bayesian networks. International Journal of Approximate Reasoning, 31(3):291–311, 2002a. L. de Campos, J. Fernandez-Luna, and J. Puerta. Local search methods for learning Bayesian networks 145 using a modified neighborhood in the space of dags. Ibero-American Conference on Artificial Intelligence, 2002b. A. Dempster. Covariance selection. Biometrics, 28(1):157–175, 1972. A. Dempster, N. Laird, D. Rubin, et al. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B, 39(1):1–38, 1977. A. Deshpande, M. Garofalakis, and M. Jordan. Efficient stepwise selection in decomposable models. Conference Uncertainty in Artificial Intelligence, 2001. A. Deshpande, C. Guestrin, S. Madden, J. Hellerstein, and W. Hong. Model-driven data acquisition in sensor networks. Conference on Very Large Data Bases, 2004. F. Deutsch and H. Hundal. The rate of convergence of Dykstra’s cyclic projections algorithm: The polyhedral case. Numerical Functional Analysis and Optimization, 15(5-6):537–565, 1994. A. Dobra and H. Massam. The mode oriented stochastic search (MOSS) algorithm for log-linear models with conjugate priors. Statistcal Methodology, 2010. In Press. A. Dobra, C. Hans, B. Jones, J. Nevins, G. Yao, and M. West. Sparse graphical models for exploring gene expression data. Journal of Multivariate Analysis, 90(1):196–212, 2004. J. Duchi and Y. Singer. Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 10:2899–2934, 2009. J. Duchi, S. Gould, and D. Koller. Projected subgradient methods for learning sparse gaussians. Conference Uncertainty in Artificial Intelligence, 2008a. J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efficient projections onto the l 1-ball for learning in high dimensions. International Conference on Machine Learning, 2008b. D. Duvenaud, D. Eaton, K. Murphy, and M. Schmidt. Causal learning without DAGs. Journal of Machine Learning Research Workshop and Conference Proceedings, 6:177–190, 2010. R. Dykstra. An algorithm for restricted least squares regression. Journal of the American Statistical Association, 78(384):837–842, 1983. D. Eaton and K. Murphy. Exact Bayesian structure learning from uncertain interventions. Conference on Artificial Intelligence and Statistics, 2007. D. Edwards. Introduction to graphical modelling. Springer, 2000. D. Edwards and T. Havr´ anek. A fast procedure for model search in multidimensional contingency tables. Biometrika, 72(2):339–351, 1985. B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics, 32(2): 407–499, 2004. M. Elad, B. Matalon, and M. Zibulevsky. Image denoising with shrinkage and redundant representations. Conference on Computer Vision and Pattern Recognition, 2006. G. Elidan, M. Ninio, N. Friedman, and D. Shuurmans. Data perturbation for escaping local maxima in learning. National Conference on Artificial Intelligence, 2002. J. Fan and R. Li. Variable selection for Cox’s proportional hazards model and frailty model. Annals of Statistics, 30(1):74–99, 2002. J. Fan, Y. Feng, and Y. Wu. Network exploration via the adaptive LASSO and SCAD penalties. Annals of Applied Statistics, 3(2):521–541, 2009. M. Fazel, H. Hindi, and S. Boyd. A rank minimization heuristic with application to minimum order system approximation. American Control Conference, 2001. M. Figueiredo. Adaptive sparseness for supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(9):1050–1159, 2003. M. Figueiredo, R. Nowak, and S. Wright. Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems. IEEE Journal of Selected Topics in Signal Processing, 1 (4):586–597, 2007. J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432–441, 2008. N. Friedman and Z. Yakhini. On the sample complexity of learning Bayesian networks. Conference Uncertainty in Artificial Intelligence, 1996. 146 N. Friedman, K. Murphy, and S. Russell. Learning the structure of dynamic probabilistic networks. Conference Uncertainty in Artificial Intelligence, 1998. N. Friedman, D. Pe’er, and I. Nachman. Learning Bayesian network structure from massive datasets: The “sparse candidate” algorithm. Conference Uncertainty in Artificial Intelligence, 1999. W. Fu. Penalized regressions: the bridge versus the lasso. Journal of Computational and Graphical Statistics, 7(3):397–416, 1998. E. Gafni and D. Bertsekas. Two-metric projection methods for constrained optimization. SIAM Journal on Control and Optimization, 22(6):936–964, 1982. A. Gasch, P. Spellman, C. Kao, O. Carmel-Harel, M. Eisen, G. Storz, D. Botstein, and P. Brown. Genomic expression programs in the response of yeast cells to environmental changes. Molecular biology of the cell, 11(12):4241–4257, 2000. W. B. Gevarter. Automatic probabilistic knowledge acquisition from data. International Conference on Data Engineering, 1987. P. Gill, W. Murray, and M. Wright. Practical optimization. Academic Press, 1981. P. Giudici and R. Castelo. Improving Markov chain Monte Carlo model search for data mining. Machine Learning, 50(1-2):127–158, 2003. P. Giudici and P. Green. Decomposable graphical Gaussian model determination. Biometrika, 86(4):785–801, 1999. A. Goldstein. Convex programming in Hilbert space. Bulletin of the American Mathematical Society, 70(5): 709–710, 1964. G. Golub and C. Van Loan. Matrix computations. Johns Hopkins University Press, 3rd edition, 1996. J. Goodman. Exponential Priors for Maximum Entropy Models. Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, 2004. L. Goodman. The analysis of multidimensional contingency tables: Stepwise procedures and direct estimation methods for building models for multiple classifications. Technometrics, 13(1):33–61, 1971. L. Grippo, F. Lampariello, and S. Lucidi. A nonmonotone line search technique for Newton’s method. SIAM Journal on Numerical Analysis, 23(4):707–716, 1986. Y. Guo and D. Schuurmans. Convex structure learning for Bayesian networks: Polynomial feature selection and approximate ordering. Conference Uncertainty in Artificial Intelligence, 2006. M. Gustafsson, M. Hornquist, and A. Lombardi. Large-scale reverse engineering by the lasso. International Conference on Systems Biology, 2003. E. Hale, W. Yin, and Y. Zhang. A fixed-point continuation method for 1 -regularized minimization with applications to compressed sensing. Technical report, Rice University, 2007. T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2nd edition, 2009. D. Haughton. On the choice of a model to fit data from an exponential family. Annals of Statistics, 16(1): 342–355, 1988. D. Heckerman, D. Geiger, and D. Chickering. Learning Bayesian networks: The combination of knowledge and statistical data. Machine learning, 20(3):197–243, 1995. D. Heckerman, C. Meek, and G. Cooper. A Bayesian approach to causal discovery. Computation, causation, and discovery, pages 141–165, 1999. D. Heckerman, D. Chickering, C. Meek, R. Rounthwaite, and C. Kadie. Dependency networks for inference, collaborative filtering, and data visualization. Journal of Machine Learning Research, 1:49–75, 2001. A. Hoerl and R. Kennard. Ridge regression: applications to nonorthogonal problems. Technometrics, 12(1): 69–82, 1970. H. Hofling and R. Tibshirani. Estimation of Sparse Binary Pairwise Markov Networks using Pseudolikelihoods. Journal of Machine Learning Research, 10:883–906, 2009. J. Huang, N. Liu, M. Pourahmadi, and L. Liu. Covariance matrix selection and estimation via penalised normal likelihood. Biometrika, 93:85–98, 2006. G. Hulten and P. Domingos. Mining complex models from arbitrarily large databases in constant time. International Conference on Knowledge Discovery and Data Mining, 2002. 147 X. Huo and X. Ni. When do stepwise algorithms meet subset selection criteria? Annals of Statistics, 35(2): 870–887, 2007. T. Jaakkola, D. Sontag, A. Globerson, and M. Meila. Learning Bayesian network structure using LP relaxations. Conference on Artificial Intelligence and Statistics, 2010. L. Jacob, G. Obozinski, and J. Vert. Group Lasso with overlap and graph Lasso. ICML, 2009. V. Johnson and J. Albert. Ordinal data modeling. Springer, 1999. A. Kahn. Topological sorting of large networks. Communications of the ACM, 5(11):558–562, 1962. D. Karger and N. Srebro. Learning Markov networks: Maximum bounded tree-width graphs. ACM-SIAM Symposium on Discrete Algorithms, 2001. K. Koh, S. Kim, and S. Boyd. An interior-point method for large-scale l 1-regularized logistic regression. Journal of Machine Learning Research, 8:1519–1555, 2007. M. Koivisto and K. Sood. Exact Bayesian structure discovery in Bayesian networks. Journal of Machine Learning Research, 5:549–573, 2004. M. Kolar and E. Xing. Improved Estimation of High-dimensional Ising Models. Technical report, Carnegie Mellon University, 2008. D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009. V. Kolmogorov. Convergent tree-reweighted message passing for energy minimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10):1568–1583, 2006. V. Kolmogorov and R. Zabih. What energy functions can be minimized via graph cuts? European Conference on Computer Vision, 2002. A. Krause and C. Guestrin. Near-optimal nonmyopic value of information in graphical models. Conference Uncertainty in Artificial Intelligence, 2005. V. Krishnamurthy and A. d’Aspremont. A Pathwise Algorithm for Covariance Selection. NIPS Workshop on Optimization for Machine Learning, 2009. B. Krishnapuram, L. Carin, M. Figueiredo, and A. Hartemink. Sparse multinomial logistic regression: Fast algorithms and generalization bounds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(6):957–968, 2005. H. Kushner and G. Yin. Stochastic approximation and recursive algorithms and applications. Springer, 2003. J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. International Conference on Machine Learning, 2001. W. Lam and F. Bacchus. Using causal information and local measures to learn Bayesian networks. Conference Uncertainty in Artificial Intelligence, 1993. P. Larrafiag, M. Poza, Y. Yurramendi, R. Murga, and C. Kuijpers. Structure learning of Bayesian networks by genetic algorithms: performance analysis of control parameters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(9):912–926, 1996. S. Lauritzen. Graphical models. Oxford University Press, USA, 1996. S. Lauritzen and T. Richardson. Chain graph models and their causal interpretations. Journal of the Royal Statistical Society: Series B, 64(3):321–361, 2002. H. Lee, A. Battle, R. Raina, and A. Ng. Efficient sparse coding algorithms. Conference on Neural Information Processing Systems, 2006a. S. Lee, V. Ganapathi, and D. Koller. Efficient Structure Learning of Markov Networks using L1 Regularization. Conference on Neural Information Processing Systems, 2006b. S. Lee, H. Lee, P. Abbeel, and A. Ng. Efficient L1 1 Regularized Logistic Regression. National Conference on Artificial Intelligence, 2006c. E. Levina, A. Rothman, and J. Zhu. Sparse estimation of large covariance matrices via a nested Lasso penalty. Annals of Applied Statistics, 2(1):245–263, 2008. E. Levitin and B. Poliak. Constrained minimization methods. USSR Computational mathematics and mathematical physics, 6:1–50, 1966. English translation of a paper in Zh. Vychisl. Mat. i Mat. Fiz., vol. 6, pp. 787-823, 1965. A. Lewis and M. Overton. Nonsmooth optimization via BFGS. Optimization Online, 2008. F. Li and Y. Yang. Recovering genetic regulatory networks from micro-array data and location analysis 148 data. International Conference on Genome Informatics, 2004. F. Li and Y. Yang. Using modified lasso regression to learn large undirected graphs in a probabilistic framework. National Conference on Artificial Intelligence, 2005. P. Liang and M. Jordan. An asymptotic analysis of generative, discriminative, and pseudolikelihood estimators. International Conference on Machine Learning, 2008. C. Lin, R. Weng, and S. Keerthi. Trust region newton method for logistic regression. International Conference on Machine Learning, 2007. B. Lindsay. Composite likelihood methods. Contemporary Mathematics, 80(1):221–39, 1988. J. Lindsey and P. Lindsey. Multivariate distributions with correlation matrices for nonlinear repeated measurements. Computational Statistics and Data Analysis, 50(3):720–732, 2006. J. Liu, J. Chen, and J. Ye. Large-scale sparse logistic regression. International Conference on Knowledge Discovery and Data Mining, 2009. Z. Lu. Smooth optimization approach for sparse covariance selection. SIAM Journal on Optimization, 19 (4):1807–1827, 2009. Z. Lu. Adaptive first-order methods for general sparse inverse covariance selection. SIAM Journal on Matrix Analysis and Applications (accepted), 2010. D. Madigan and A. Raftery. Model selection and accounting for model uncertainty in graphical models using Occam’s window. Journal of the American Statistical Association, 89(428):1535–1546, 1994. D. Madigan, S. Andersson, M. Perlman, and C. Volinsky. Bayesian model averaging and model selection for Markov equivalence classes of acyclic digraphs. Communications in Statistics - Theory and Methods, 25 (11):2493–2519, 1996. R. Malouf. A comparison of algorithms for maximum entropy parameter estimation. International Conference On Computational Linguistics, 2002. F. Malvestuto. Approximating discrete probability distributions with decomposablemodels. IEEE Transactions on systems, Man and Cybernetics, 21(5):1287–1294, 1991. D. Margaritis and S. Thrun. Bayesian network induction via local neighborhoods. Conference on Neural Information Processing Systems, 1999. B. Marlin, M. Schmidt, and K. Murphy. Group Sparse Priors for Covariance Estimation. Conference Uncertainty in Artificial Intelligence, 2009. M. Meila and M. Jordan. Learning with mixtures of trees. Journal of Machine Learning Research, 1:1–48, 2000. N. Meinshausen and P. Buhlmann. High-dimensional graphs and variable selection with the Lasso. Annals of Statistics, 34(3):1436–1462, 2006. T. Minka. Algorithms for maximum-likelihood logistic regression. Technical report, CMU, 2003. B. Moghaddam, B. Marlin, M. Khan, and K. Murphy. Accelerating Bayesian Structural Inference for NonDecomposable Gaussian Graphical Models. Conference on Neural Information Processing Systems, 2009. A. Moore and W. Wong. Optimal reinsertion: A new search operator for accelerated and more accurate Bayesian network structure learning. International Conference on Machine Learning, 2003. P. Munteanu and M. Bendou. The EQ framework for learning equivalence classes of Bayesian networks. IEEE International Conference on Data Mining, 2001. K. Murphy. Dynamic Bayesian Networks: Representation, Inference and Learning. PhD thesis, UC Berkeley, 2002. I. Nachman, G. Elidan, and N. Friedman. “Ideal Parent” structure learning for continuous variable networks. Conference Uncertainty in Artificial Intelligence, 2004. M. Narasimhan and J. Bilmes. PAC-learning bounded tree-width graphical models. Conference Uncertainty in Artificial Intelligence, 2004. Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Springer, 2004. Y. Nesterov. Gradient methods for minimizing composite objective function. Technical report, Universit´e Catholique do Louvain, 2007. A. Ng. Feature selection, L1 vs. L2 regularization, and rotational invariance. International Conference on Machine Learning, 2004. 149 J. Nielsen, T. Kocka, and J. Pena. On local optima in learning Bayesian networks. Conference Uncertainty in Artificial Intelligence, 2003. J. Nocedal. Updating quasi-Newton matrices with limited storage. Mathematics of Computation, 35(151): 773–782, 1980. J. Nocedal and S. Wright. Numerical optimization. Springer, 1999. M. Osborne, B. Presnell, and B. Turlach. A new approach to variable selection in least squares problems. IMA Journal of Numerical Analysis, 20(3):389–403, 2000. M. Park and T. Hastie. L1 Regularization Path Algorithm for Generalized Linear Models. Journal of the Royal Statistical Society: Series B, 69(4):659–677, 2007. J. Pearl. Causality: Models, reasoning, and inference. Cambridge Univ Press, 2000. S. Perkins, K. Lacker, and J. Theiler. Grafting: Fast, incremental feature selection by gradient descent in function space. Journal of Machine Learning Research, 3:1333–1356, 2003. E. Perrier, S. Imoto, and S. Miyano. Finding optimal Bayesian network given a super-structure. Journal of Machine Learning Research, 9:2251–2286, 2008. M. Qazi, G. Fung, S. Krishnan, R. Rosales, H. Steck, R. Rao, D. Poldermans, and D. Chandrasekaran. Automated heart wall motion abnormality detection from ultrasound images using Bayesian networks. IJCAI, 2007. A. Quattoni, X. Carreras, M. Collins, and T. Darrell. An efficient projection for l1,∞ regularization. International Conference on Machine Learning, 2009. M. Raydan. The Barzilai and Borwein gradient method for the large scale unconstrained minimization problem. SIAM Journal on Optimization, 7(1):26–33, 1997. M. Richardson and P. Domingos. Markov logic networks. Machine Learning, 62(1-2):107–136, 2006. T. Richardson and P. Spirtes. Ancestral graph Markov models. Annals of Statistics, 30(4):962–1030, 2002. J. Rissanen. Modeling by shortest data description. Automatica, 14:465–471, 1978. R. Robinson. Counting unlabeled acyclic digraphs. Australian Conference on Combinatorial Mathematics, 1976. S. Rosset. Tracking curved regularized optimization solution paths. Conference on Neural Information Processing Systems, 2004. V. Roth. The generalized LASSO. IEEE Transactions on Neural Networks, 15(1):16–28, 2004. H. Rue and L. Held. Gaussian Markov random fields: theory and applications. Chapman & Hall, 2005. S. Rusell and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall, 2003. K. Sachs, O. Perez, D. Pe’er, D. Lauffenburger, and G. Nolan. Causal protein-signaling networks derived from multiparameter single-cell data. Science, 308(5721):523–529, 2005. F. Santosa and W. Symes. Linear inversion of band-limited reflection seismograms. SIAM Journal on Scientific and Statistical Computing, 7(4):1307–1330, 1986. S. Sardy, A. Bruce, and P. Tseng. Block coordinate relaxation methods for nonparametric wavelet denoising. Journal of Computational and Graphical Statistics, 9(2):361–379, 2000. L. Saul, T. Jaakkola, and M. Jordan. Mean field theory for sigmoid belief networks. Journal of artificial intelligence research, 4:61–76, 1996. M. Schmidt and K. Murphy. Modeling Discrete Interventional Data using Directed Cyclic Graphical Models. Conference Uncertainty in Artificial Intelligence, 2009. M. Schmidt and K. Murphy. Convex Structure Learning in Log-Linear Models: Beyond Pairwise Potentials. Conference on Artificial Intelligence and Statistics, 2010. M. Schmidt, G. Fung, and R. Rosales. Fast optimization methods for L1 regularization: A comparative study and two new approaches. European Conference on Machine Learning, 2007a. M. Schmidt, A. Niculescu-Mizil, and K. Murphy. Learning graphical model structure using l1-regularization paths. National Conference on Artificial Intelligence, 2007b. M. Schmidt, K. Murphy, G. Fung, and R. Rosales. Structure learning in random fields for heart motion abnormality detection. Conference on Computer Vision and Pattern Recognition, 2008. M. Schmidt, G. Fung, and R. Rosales. Optimization methods for 1 -regularization. Technical report, University of British Columbia, 2009a. 150 M. Schmidt, E. van den Berg, M. Friedlander, and K. Murphy. Optimizing costly functions with simple constraints: A limited-memory projected quasi-newton algorithm. Conference on Artificial Intelligence and Statistics, 2009b. G. Schwarz. Estimating the dimension of a model. Annals of Statistics, 6(2):461–464, 1978. F. Sha and F. Pereira. Shallow parsing with conditional random fields. Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, 2003. D. Shahaf, A. Chechetka, and C. Guestrin. Learning thin junction trees via graph cuts. Conference on Artificial Intelligence and Statistics, 2009. D. Shanno and K. Phua. Matrix conditioning and nonlinear optimization. Mathematical Programming, 14 (1):149–160, 1978. T. Shimamura, S. Imoto, R. Yamaguchi, and S. Miyano. Weighted lasso in graphical Gaussian modeling for large gene network estimation based on microarray data. International Conference on Genome informatics, 2007. S. Shimizu, A. Hyvarinen, Y. Kano, and P. Hoyer. Discovery of non-gaussian linear causal models using ICA. Conference Uncertainty in Artificial Intelligence, 2005. M. Singh and M. Valtorta. An algorithm for the construction of Bayesian network structures from data. Conference Uncertainty in Artificial Intelligence, 1993. P. Spirtes and C. Glymour. An algorithm for fast recovery of sparse causal graphs. Social Science Computer Review, 9(1):62–72, 1991. P. Spirtes and C. Meek. Learning Bayesian networks with discrete variables from data. International Conference on Knowledge Discovery and Data Mining, 1995. N. Srebro. Maximum likelihood bounded tree-width Markov networks. Artificial intelligence, 143(1):123–138, 2003. S. Srinivas, S. Russell, and A. Agogino. Automated construction of sparse Bayesian networks from unstructured probabilistic models and domain information. Conference Uncertainty in Artificial Intelligence, 1990. H. Steck. On the use of skeletons when learning in bayesian networks. Conference Uncertainty in Artificial Intelligence, 2000. J. Suzuki. Learning Bayesian belief networks based on the MDL principle: An efficient algorithm using the branch and bound technique. IEICE Transactions on Information and Systems, 82:356–367, 1999. B. Taskar, C. Guestrin, and D. Koller. Max-margin Markov networks. Conference on Neural Information Processing Systems, 2003. B. Taskar, V. Chatalbashev, and D. Koller. Learning associative Markov networks. International Conference on Machine Learning, 2004. M. Teyssier and D. Koller. Ordering-based search: A simple and effective algorithm for learning Bayesian networks. Conference Uncertainty in Artificial Intelligence, 2005. R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B, 58(1):267–288, 1996. I. Tsamardinos, L. Brown, and C. Aliferis. The max-min hill-climbing Bayesian network structure learning algorithm. Machine learning, 65(1):31–78, 2006. B. Turlach, W. Venables, and S. Wright. Simultaneous variable selection. Technometrics, 47(3):349–363, 2005. E. van den Berg. Convex optimization for generalized sparse recovery. PhD thesis, UBC, 2010. E. van den Berg and M. Friedlander. Probing the Pareto frontier for basis pursuit solutions. SIAM Journal on Scientific Computing, 31(2):890–912, 2008. E. van den Berg, M. Schmidt, and K. M. M. Friedlander. Group sparsity via linear-time projection. Technical report, University of British Columbia, 2008. T. Verma and J. Pearl. Equivalence and synthesis of causal models. Conference Uncertainty in Artificial Intelligence, 1990. D. Vidaurre, C. Bielza, and P. Larranaga. Learning a L1-regularized Gaussian Bayesian network in the space of equivalence classes. IEEE Transactions on Systems, Man and Cybernetics: Part B, 2010. 151 J. von Neumann. Functional Operators, vol. II, The Geometry of Orthogonal Spaces, volume 22 of Annals of Mathematical Studies. Princeton University Press, 1950. This is a reprint of notes first distributed in 1933-34. M. Wainwright. Estimating the “Wrong” Graphical Model: Benefits in the Computation-Limited Setting. Journal of Machine Learning Research, 7:1829–1859, 2006. M. Wainwright and M. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1-2):1–305, 2008. M. Wainwright, T. Jaakkola, and A. Willsky. A new class of upper bounds on the log partition function. Conference Uncertainty in Artificial Intelligence, 2002. M. Wainwright, P. Ravikumar, and J. Lafferty. High-Dimensional Graphical Model Selection Using 1 Regularized Logistic Regression. Conference on Neural Information Processing Systems, 2006. H. Wallach. Efficient training of conditional random fields. Master’s thesis, University of Edinbrugh, 2002. N. Wermuth. Model search among multiplicative models. Biometrics, 32(2):253–263, 1976. J. Whittaker. Graphical models in applied multivariate analysis. John Wiley and Sons, 1990. P. Williams. Bayesian regularization and pruning using a Laplace prior. Neural Computation, 7(1):117–143, 1995. D. Wipf and S. Nagarajan. Sparse Estimation Using General Likelihoods and Non-Factorial Priors. Conference on Neural Information Processing Systems, 2009. S. Wright, R. Nowak, and M. Figueiredo. Sparse reconstruction by separable approximation. IEEE Transactions on Signal Processing, 57(7):2479–2493, 2009. Y. Xiang, S. Wong, and N. Cercone. A “microscopic study of minimum entropy search in learning decomposable markov networks. Machine Learning, 26(1):65–92, 1997. L. Younes. Parameter estimation for imperfectly observed Gibbsian fields. Probability Theory and Related Fields, 82:625–645, 1989. M. Yuan and Y. Lin. Model selection and estimation in the Gaussian graphical model. Biometrika, 94(1): 19–35, 2007. X. Yuan. Alternating direction methods for sparse covariance selection. Technical report, Hong Kong Baptist University, 2009. P. Zhao and B. Yu. On model selection consistency of Lasso. Journal of Machine Learning Research, 7: 2541–2563, 2006. P. Zhao, G. Rocha, and B. Yu. The composite absolute penalties family for grouped and hierarchical variable selection. Annals of Statistics, 37(6A):3468–3497, 2009. H. Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101 (476):1418–1429, 2006. H. Zou and R. Li. One-step sparse estimates in nonconcave penalized likelihood models. Annals of Statistics, 36(4):1509–1533, 2008. H. Zou, T. Hastie, and R. Tibshirani. On the “degrees of freedom” of the lasso. Annals of Statistics, 35(5): 2173–2192, 2007. 152 Appendix A Data Structures for Checking Acyclicity The following material is needed for the fast of implementation of acyclicity checks in the DAGsearch method used in Chapter 4. Giudici and Castelo [2003] propose using an ancestor matrix data structure to efficiently test whether local moves preserve acyclicity, and they give procedures for updating the ancestor matrix. However, the authors do not give a procedure for constructing the ancestor matrix given a graph, while the analysis of the runtimes of the updates is not correct. In this section, we give an efficient procedure for building an ancestor matrix given a DAG, review the ancestor matrix update rules and their runtimes, present several special cases that lead to faster updates of the ancestor matrix, and present the reversal witness matrix data structure that allows us to quickly check whether reversals will introduce a cycle. A.1 Ancestor Matrix The ancestor matrix for a DAG with n nodes is an n by n binary matrix, that we will denote by A. We set element Aij of the matrix to 1 if i is an ancestor of j in the graph, meaning that there is a directed path from i to j. Otherwise, we set Aij to 0 (by convention, we set Aii = 0). In this section we will express the runtimes of all operations in terms of n, since it is possible that we will need to use the data structure on a maximally connected directed acyclic graph. Given the ancestor matrix for a directed acyclic graph, it is trivial to test whether adding a new edge will introduce a cycle. To see this, note that the graph is acyclic before introducing the new edge, so if a cycle is introduced the new edge must be part of the cycle. We can thus test whether a new edge from i to j introduces a cycle by simply testing whether j is an ancestor of i; Input: Ancestor matrix A, edge (i, j) to test. Output: Returns 1 if adding (i,j) will cause a cycle return Aji ; Algorithm 13: Using an ancestor matrix to test whether an addition preserves acyclicity. If we want to test whether each of the O(n2 ) possible edges will introduce a cycle, with the ancestor matrix we can do this in O(n2 ). This is substantially more efficient than the O(n4 ) cost of naively checking each single-edge-augmented graph independently with an O(n2 ) search. Above, we assume that the ancestor matrix is given. However, it will only lead to a net computational gain if we can efficiently construct it from a given graph, and efficiently update it after single edge changes. Below, we make use of topological sorting to give an efficient procedure 153 for constructing an ancestor matrix. Input: Graph G Output: A valid ancestor matrix A for G initialize all elements of A to zero; find a topological ordering of G; foreach node c in order do foreach parent p of c do Apc ← 1 ; foreach ancestor a of p do Aac ← 1; Algorithm 14: Constructing an ancestor matrix. The topological sort can be done in O(n2 ) [Kahn, 1962, Cormen et al., 2001, §22.4], while constructing the ancestor matrix by computing the ancestors of each node in topological order requires at most O(n3 ) (it is possible that the total cost could be reduced to O(n2 ), the size of the structure). Below, we give the procedure for updating the ancestor matrix after a single edge addition. Input: Ancestor matrix A and edge (p, c) to add Output: A valid ancestor matrix A with (p, c) added if Apc = 1 then return ; // fast update, p was already an ancestor of c Apc ← 1 ; // p is now an ancestor of c foreach descendant d of c do Apd ← 1 ; // p is now an ancestor of all descendants of c foreach ancestor a of p do Aac ← 1 ; // ancestors of p are now ancestors of c foreach descendant d of c do Aad ← 1 ; // ancestors of p are now ancestors of descendants of c Algorithm 15: Updating an ancestor matrix after an addition. Similar to finding all ancestors of a node by looking at the corresponding column of the ancestor matrix, we can find all descendants of a node by looking at the corresponding row. The above procedure require O(n2 ) in the worst-case, due to the need to update the up to O(n2 ) members of the product of ancestors and descendants of the two nodes (the runtime was incorrectly state as O(n) by Giudici and Castelo [2003]). Note the line marked as fast update; if p is already an ancestor of c when we add the edge (p, c), then no new ancestor relations can arise and the update only costs O(1). Also note that if we constructed the ancestor matrix for a given graph by repeatedly applying the addition algorithm (starting from an empty graph), that this would require O(n4 ). Thus, our topological ordering construction algorithm is more efficient than this naive method. Next, we give a procedure for updating the ancestor matrix after deleting an edge. The method 154 below is called after an edge (p, c) has just been removed from the graph G: Input: Graph G, ancestor matrix A, and edge (p, c) to delete Output: A valid ancestor matrix A with (p, c) deleted foreach parent p∗ of c do if App∗ = 1 then return ; // fast update, c is still a descendant of p find a topological ordering of G; foreach node j in order starting from c do Akj → 0, ∀k ; // clear ancestors of j foreach node j in order starting from c do foreach parent i of j do Aij ← 1 ; foreach ancestor a of i do Aaj ← 1; Algorithm 16: Updating an ancestor matrix after a deletion. In the worst case, the above procedure will cost O(n3 ) since we may need to rebuild the ancestor matrix for most of the graph. However, the update will only require up to O(n) in the case marked fast update. In this case, p remains an ancestor of c after deleting (p, c) so the ancestor relationships do not change. To update the ancestor matrix after a reversal, we call the above deletion procedure followed by the addition procedure. A.2 Reversal Witness Matrix By similar reasoning to the addition case, reversing an existing edge (i, j) in a directed acyclic graph will introduce a cycle if and only if i is an ancestor of some ancestor of j. In other words, if some descendant of i is an ancestor of j, then reversing the edge from i to j will introduce a path from j to itself via the newly reversed edge. We make this more formal below. Input: Ancestor matrix A, edge (i, j) to test. Output: Returns 1 if reversing (i,j) will cause a cycle foreach ancestor a of j do if Aia = 1 then return 1; return 0; Algorithm 17: Using an ancestor matrix to test whether a reversal preserves acyclicity. Since we have the current ancestor matrix available, we can easily find the up to O(n) ancestors of j. Since we do a constant amount of work for each ancestor the procedure checks whether reversing (i, j) will introduce a cycle in at most O(n). Subsequently, we can check all edges in O(n3 ). We now outline a data structure that we refer to as a reversal witness matrix. In this context, we say that a witness exists for an edge (i, j) if some descendant of i is an ancestor of j. Thus, reversing an edge will cause a cycle if and only if a witness exists. The reversal witness matrix is simply an n by n sparse binary matrix that is set to 1 if the edge from i to j exists and has a 155 witness. We consider the following simple procedure for constructing the reversal witness matrix given a graph. It assumes that the above procedure for testing a reversal given the ancestor matrix is available, and can be called to test an edge (i, j) with the interface testReversal(A,p,c). Input: Graph G and ancestor matrix A Output: A valid reversal witness matrix R for G initialize all elements of R to zero; foreach edge (i, j) in G do R(i, j) ← testReversal(A,p,c); Algorithm 18: Constructing a reversal witness matrix. Above, we can construct the reversal witness matrix in O(n3 ) by simply using the ancestor matrix to check whether each edge of the O(n2 ) edges can be reversed. We now give a procedure for updating the reversal witness matrix for a single edge (p, c) after the addition of the edge (i, j). Input: Reversal witness matrix R, ancestor matrix A, an edge (p, c) to update after adding edge (i, j) Output: A valid reversal witness matrix R with (p, c) added if Rpc = 1 then return ; // fast update, this edge already has a witness if (p is not i or an ancestor of i) and (c is not j or a descendant of j) then return ; // fast update, p has no new descendants, c has no new ancestors R(i, j) ← testReversal(A,p,c); Algorithm 19: Updating a reversal witness matrix after an addition. In the above, the fast updates require O(1) while the slow updates require O(n). Thus, this data structure is more efficient than using the ancestor matrix alone whenever a fast update is performed. We update the reversal witness matrix after a deletion by rebuilding it. 156 Appendix B Projection onto Norm Cones When applied to group 1 -regularization problems, the SPG and PQN methods discussed in Chapter 3 employ the operation of projecting onto a norm cone. That is, for a given x0 and g0 they compute the projection PCp (x0 , g0 ) arg min ||x||p ≤g x g − x0 g0 . 2 for the given norm || · ||p . By non-negativity of norms, we can equivalently solve 1 1 arg min ||x − x0 ||22 + (g − g0 )2 , s.t. ||x||p ≤ g. x,g 2 2 (B.1) In this appendix, we give simple algorithms for solving (B.1) for different norms examined in this work. B.1 Scalar Norm We first consider the one-dimensional case where we simply have a scalar x. In this case, problem (B.1) can be written as 1 1 arg min (x − x0 )2 + (g − g0 )2 , s.t. |x| ≤ g. x,g 2 2 In this case, the projection onto the scalar norm cone Ca (x0 , g0 ), 0 |x0 |+g0 PCa (x0 , g0 ) = (sign(x0 ) |x0 |+g , 2 ), 2 (0, 0), (B.2) is given by if |x0 | ≤ g0 , if |x0 | > g0 , |x0 | + g0 > 0, if |x0 | > g0 , |x0 | + g0 ≤ 0. (B.3) Proof. If |x0 | ≤ g0 , then the first case follows because x = x0 and g = g0 satisfy the constraints and achieve the minimum possible objective value of zero in (B.2). Thus, it remains to show the other two cases and below we assume that |x0 | > g0 . First, note that |x| ≥ 0 implies that in a solution (x∗ , g ∗ ) we must have that g ∗ ≥ 0. Further, ∗ x can not have the opposite sign to x0 : x∗ x0 ≥ 0. To show this, assume that x∗ x0 < 0. Then 1 1 1 1 ∗ (x − x0 )2 + (g ∗ − g0 )2 = (x∗ )2 − x∗ x0 + x20 + (g ∗ − g0 )2 2 2 2 2 1 1 > (x∗ )2 + x20 + (g ∗ − g0 )2 2 2 1 ∗ 2 > x0 + (g − g0 )2 2 1 = (0 − x0 )2 + (g ∗ − g0 )2 . 2 157 This would imply that (0, g ∗ ) achieves a lower objective value than (x∗ , g ∗ ), and since g ∗ ≥ 0 we obtain a contradiction. We can similarly show that |x∗ | ≤ |x0 |, since if |x∗ | > |x0 | and g ∗ ≥ |x∗ | then (|x0 |, g ∗ ) would achieve a lower objective value while remaining feasible. Further, using that |x∗ | ≤ |x0 | and g0 < |x0 | we can similarly show that g ∗ ≤ |x0 |, since if g ∗ > |x0 | then (x∗ , |x0 |) would achieve a lower objective (while remaining feasible). We now establish the second two cases of (B.3) under the assumption that x0 ≥ 0. Since we know x∗ x0 ≥ 0 this implies x∗ ≥ 0 and we can re-write (B.2) as 1 1 arg min (x − x0 )2 + (g − g0 )2 , s.t. 0 ≤ x ≤ g. x,g 2 2 Ignoring the trivial case where x0 < g0 , in a solution of this problem it must be the case that x∗ = g ∗ . To see this, assume that g ∗ > x∗ . Then we can increase x∗ to improve the objective function since we know that g ∗ ≤ x0 . We use that x∗ = g ∗ to eliminate x and obtain the simple problem 1 1 arg min (g − x0 )2 + (x − g0 )2 , s.t. g ≥ 0. g 2 2 Introducing a Lagrange multiplier µ ≥ 0 for the inequality constraint the Lagrangian of this problem is 1 1 (g − x0 )2 + (g − g0 )2 − µg. 2 2 Setting the derivative of the Lagrangian to zero we have that 0 = g − x0 + g − g0 − µ. From this we obtain that x0 + g0 + µ , 2 for some µ ≥ 0. By complementary slackness we must have g ∗ µ = 0. If g ∗ = 0, this implies that µ = −x0 − g0 , which can only be positive if x0 + g0 ≤ 0. This establishes the third case of (B.3) (when x0 ≥ 0). Otherwise we have µ = 0 and x0 + g0 g∗ = . 2 We can use a simlar argument to show that we obtain the same result but with x0 replaced by |x0 | and x∗ = −g ∗ when we have x0 < 0. g∗ = B.2 2 Norm We next consider projecting onto the Euclidean norm cone. In this case, we can write problem (B.1) as 1 1 arg min ||x − x0 ||22 + (g − g0 )2 , s.t. ||x||2 ≤ g. (B.4) x,g 2 2 The projection onto the Euclidean norm cone C2 is given by [Boyd and Vandenberghe, 2004, Exercise 8.3(c)] if ||x0 ||2 ≤ g0 , (x0 , g0 ), x0 ||x0 ||2 +g0 ||x0 ||2 +g0 PC2 (x0 , g0 ) = ( ||x0 ||2 (B.5) , ), if ||x0 ||2 > g, ||x0 ||2 + g0 > 0, 2 2 (0, 0), if ||x0 ||2 > g, ||x0 ||2 + g0 ≤ 0. 158 Proof. We first establish that in an optimal solution (x∗ , g ∗ ) of (B.4), that x∗ is in the same direction as x0 . To do this, assume we can write x∗ = αx0 + y, where α is a scalar and y is a non-zero vector containing the part of x∗ that is orthogonal to x0 . Then we have that 1 ∗ 1 1 1 ||x − x0 ||22 + (g ∗ − g0 )2 = ||(αx0 + y) − x0 ||22 + (g ∗ − g0 )2 2 2 2 2 1 1 ∗ 2 = ||(α − 1)x0 + y||2 + (g − g0 )2 2 2 1 1 1 = |α − 1|2 ||x0 ||22 + (α − 1)xT0 y + ||y||22 + (g ∗ − g0 )2 2 2 2 1 1 1 ∗ 2 2 2 2 = |α − 1| ||x0 ||2 + ||y||2 + (g − g0 ) 2 2 2 1 1 > |α − 1|2 ||x0 ||22 + (g ∗ − g0 )2 2 2 1 ∗ 1 2 = ||(α − 1)x0 ||2 + (g − g0 )2 2 2 1 1 = ||αx0 − x0 ||22 + (g ∗ − g0 )2 . 2 2 Since y = 0, this implies that (αx0 , g ∗ ) achieves a lower objective value than x∗ while it is feasible due to the feasilibility of (αx0 + y, g ∗ ) and orthogonality of x0 and y. This establishes that y must be zero and that x∗ = αx0 for some α. By a similar argument to the scalar case we can show that x∗i (x0 )i ≥ 0 for all i, so we have that α ≥ 0. We next review a basic identity, that if x = αx0 for some scalar α ≥ 0 then ||x − x0 ||22 = ||αx0 − x0 ||22 = α2 xT0 x0 − 2αxT0 x0 + xT0 x0 = (α||x0 ||2 − ||x0 ||2 )2 = (||x||2 − ||x0 ||2 )2 . We can use this identity to re-write (B.4) as 1 1 arg min (||x||2 − ||x0 ||2 )2 + (g − g0 )2 , s.t. ||x||2 ≤ g. x,g 2 2 Except in the trivial first case of (B.5), by similar reasoning to the scalar case we will have that g ∗ = ||x∗ ||2 in the solution of this problem. Thus, we can eliminate ||x||2 to give the much simpler problem 1 1 arg min (g − ||x0 ||2 )2 + (g − g0 )2 , s.t. g ≥ 0. x,g 2 2 This is identical to the scalar case for x0 ≥ 0 but with x0 replaced by the non-negative scalar ||x0 ||2 . We can thus derive the optimal g ∗ in (B.5) from the same argument used in the previous proof. In the case where g ∗ > 0, the constraint that ||x∗ ||2 = g ∗ along with knowing that x∗ is in the direction of x0 imply that x∗ = (x0 /||x0 ||2 )g ∗ . 159 B.3 ∞ Norm We next consider projecting onto the ∞ norm cone. We first concentrate on the case where x0 is a 2-vector with (x0 )1 ≥ (x0 )2 ≥ 0. In this case, we can write problem (B.1) as arg min x1 ,x2 ,g 1 1 1 (x1 − (x0 )1 )2 + (x2 − (x0 )2 )2 + (g − g0 )2 , s.t. g ≥ x1 ≥ 0, g ≥ x2 ≥ 0. 2 2 2 The solution of this special case of projecting onto the PC∞ ((x0 )1 , (x0 )2 , g0 ) = ((x0 )1 , (x0 )2 , g0 ), ( (x0 )1 +g0 , (x ) , (x0 )1 +g0 ), 0 2 2 2 (x0 )1 +(x0 )2 +g0 (x0 )1 +(x0 )2 +g0 (x0 )1 +(x0 )2 +g0 ( , , ), 3 3 3 (0, 0), ∞ (B.6) norm cone is given by (B.7) if if if if ||x0 ||∞ ||x0 ||∞ ||x0 ||∞ ||x0 ||∞ ≤ g0 , > g0 , (x0 )21 +g0 > (x0 )2 , 0 )2 +g0 > g0 , (x0 )21 +g0 ≤ (x0 )2 , (x0 )1 +(x >0 3 (x0 )1 +g0 (x0 )1 +(x0 )2 +g0 > g0 , ≤ (x0 )2 , ≤0 2 3 (B.8) Proof. We start by noting that if the inputs satisfy the constraints then we once again simply return the inputs in the first case. If this is not the case, then a similar argument to the scalar case shows that in an optimal solution (x∗1 , x∗2 , g ∗ ) it must be the case that x∗1 = g ∗ . Subsequently, we can (as before) eliminate x1 from (B.6) to give the problem 1 1 1 arg min (g − (x0 )1 )2 + (x2 − (x0 )2 )2 + (g − g0 )2 , s.t. g ≥ 0, g ≥ x2 ≥ 0. x2 ,g 2 2 2 Since (x0 )2 ≥ 0 and we require g ≥ x2 , the constraint x2 ≥ 0 will be satisfied at a solution even if it is not included explicitly, so we remove it and write the Lagrangian for this problem is 1 1 1 (g − (x0 )1 )2 + (x2 − (x0 )2 )2 + (g − g0 )2 − µ1 g + µ2 (x2 − g) 2 2 2 At a solution we require that the gradient of the Lagrangian with respect to both g and x2 is equal to zero: 0 = (g − (x0 )1 + g − g0 − µ1 ) − µ2 0 = x2 − (x0 )2 + µ2 Note that the first term in the first equation is the gradient of the Lagrangian for the problem of projecting onto the scalar norm cone. If we use (˜ x1 , g˜) to denote the result of projecting ((x0 )1 , g0 ) onto the scalar norm cone, then the first term in the first equation is zero at (˜ x1 , x2 , g˜) for any x2 . Thus, If it happens to be the case that g˜ > (x0 )2 , then all constraints are satisfied and complementary slackness implies that µ2 = 0 so the solution to the problem is (˜ x1 , (x0 )2 , g˜). This establishes the second case of (B.8). If g˜ ≤ (x0 )2 , then we can show that x∗1 = x∗2 = g ∗ . Thus, we can eliminate both x1 and x2 and write the optimization as a bound constrained optimization in g. Solving this problem as we did in the scalar case yields the third and fourth cases of (B.8). 160 Although the result above only applies to a very restricted scenario, we can generalize it to compute the general ∞ norm cone projection. In particular, we can clearly remove the restriction (x0 )1 ≥ (x0 )2 by sorting the elements of x0 before projecting. We can further remove the constraint that (x0 )1 and (x0 )2 are non-negative by projecting their absolute values and then assigning the appropriate signs to the results. Finally, we can use an inductive argument to generalize the result to arbitrary p-vectors. Below, we give pseudo-code for a general method that requires O(p log p) time (due to the need to sort x0 ). Input: Scalar g and p-vector x if g ≥ ||x||∞ then return; // input value satisfies constraints sorted ← {sort(|x|), 0}; // sort absolute values in descending order, append zero s ← 0; for k ← 1 to p do s ← s+sorted(k); α ← (s + g)/(k + 1); // trial value for g if α > 0 and α <sorted(k + 1) then for i ← 1 to p do xi ←sign(xi ) min{|xi |, α} ; // threshold values. g ← α; return; x ← 0; g ← 0; Algorithm 20: Projection onto B.4 1 norm cone. ∞ Norm We now turn to the task of projecting onto the 1 norm cone. We first concentrate on the case where x0 only contains non-negative elements. In this case, we can write problem (B.1) as 1 1 arg min ||x − x0 ||22 + (g − g0 )2 , s.t. x,g 2 2 The solution of this special case of projecting onto the 1 p xi ≤ g, x ≥ 0. (B.9) i=1 norm cone is given by PC1 (x0 , g0 ) = (max{0, x0 − θ}, g0 + θ), (B.10) where the max operation is done element-wise and where θ ≥ 0 is the minimum (scalar) value such that the constraints are satisfied. Proof. The Lagrangian for problem (B.9) is 1 1 ||x − x0 ||22 + (g − g0 )2 + θ( 2 2 p xi − g) − yT x, i=1 161 with a scalar Lagrange multiplier θ for the sum constraint and a vector of Lagrange multipliers y for the non-negativity constraints. Setting the gradient of the Lagrangian with respect to g to zero we obtain 0 = g − g0 − θ. From this we obtain that the optimal g ∗ has the form g ∗ = g0 + θ. (B.11) Setting the gradient of the Lagrangian with respect to x to zero we obtain 0 = x − x0 + θ − y. From this we obtain that the optimal x∗ has the form x∗ = x0 − θ + y. By complementary slackness, we have that x∗i yi = 0 for all i. If yi = 0 then we have x∗i = (x0 )i − θ. Similarly, if x∗i = 0 we have that yi = −(x0 )i + θ. Since we require yi ≥ 0, we see that x∗i can be zero only if (x0 )i − θ < 0. Combining both cases, we have that x∗i has the form x∗i = max{0, (x0 )i − θ}. (B.12) We have now established the form of (B.10), and it remains to show that we must find the minimum θ ≥ 0. Using (B.11) and (B.12) to eliminate x and g in (B.9), we obtain 1 1 arg min || max{0, x0 − θ} − x0 ||22 + (g0 + θ − g0 )2 , s.t. θ 2 2 p max{0, (x0 )i − θ} ≤ g0 + θ, θ ≥ 0. i=1 We can simplify the objective function in this expression to give 1 1 arg min || min{x0 , θ}||22 + θ2 , s.t. θ 2 2 p max{0, (x0 )i − θ} ≤ g0 + θ, θ ≥ 0. i=1 where the min operation is done element-wise. We see that for θ ≥ 0 that the first term in this objective function is monotonically increasing in θ while the second term is strictly monotonically increasing in θ. Thus, the constrained minimizer of this objective function is the minimum θ ≥ 0 satisfying the constraints. As before, we can extend (B.10) to allow negative elements of x by multiplying the result of projecting the absolute values by the signs of the corresponding input elements. Below, we give 162 pseudo-code for a general method that computes the projection onto the Input: Scalar g and p-vector x if g ≥ ||x||1 then return; 1 norm cone. // input value satisfies constraints sorted ← {0,sort(|x|)}; // sort absolute values in ascending order, append zero for k ← 1 to p do θ ←sorted(k + 1); if α + θ > pi=1 max{0, x − θ} then break; θ =sorted(k) + pi=1 max{0, x − (g+sorted(k))}/(k + 1); g = g + θ; x = max{0, x − θ}; Algorithm 21: Projection onto 1 norm cone. By replacing the for loop with a binary search for k, the implementation above can be modified to run in O(p log p) time. Similar to [Duchi et al., 2008b], this can be futher reduced to O(p) by using a linear-time median-finding algorithm rather than sorting. B.5 Nuclear Norm We finally consider the case of projecting onto the nuclear norm cone. In this case, we can write problem (B.1) for an input matrix X0 as 1 1 arg min ||X − X0 ||2F + (g − g0 )2 , s.t. ||X||σ ≤ g. X,g 2 2 (B.13) Here, we use ||X||σ to denote the nuclear norm of X, the sum of the singular values of the matrix X. We denote the singular value decomposition of X0 by X0 = U0 Σ0 V0T , where Σ0 a diagonal matrix containing the singular values σ0 . Using this notation, the solution of (B.13) is given by ˜ T , g˜), PCσ (X0 , g0 ) = (U0 ΣV 0 (B.14) ˜ is a diagonal matrix with elements σ where Σ ˜ , and (˜ σ , g˜) is the result of projecting (σ0 , g0 ) onto the 1 norm cone. Proof. We first establish that in an optimal solution (X ∗ , g ∗ ) that X ∗ must have the same singular vectors as X0 (for all non-zero singular values). To do this we note that solving (B.13) is equivalent to minimizing the Lagrangian for some µ ≥ 0: 1 1 min ||X − X0 ||2F + (g − g0 )2 + µ||X||σ − µg, X,g 2 2 We can re-write this problem as 1 1 min (g − g0 )2 − µg + min ||X − X0 ||2F + µ||X||σ . g 2 X 2 Focusing on the inner minimization over X, Cai et al. [2010, Theorem 2.1] implies that the optimal solution X ∗ has the same singular vectors as X0 . 163 Using this property we can re-write (B.13) as 1 1 arg min ||U0 diag(σ)V0T − U0 diag(σ0 )V0T ||2F + (g − g0 )2 , s.t. ||U0 diag(σ)V0T ||σ ≤ g. σ,g 2 2 Since U0 and V0 are orthogonal, we can re-write this as 1 1 arg min ||σ − σ0 ||22 + (g − g0 )2 , s.t. σ,g 2 2 the projection of (σ0 , g0 ) onto the 1 p σi ≤ g, σ ≥ 0. i=1 norm cone. 164
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Graphical model structure learning using L₁-regularization
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Graphical model structure learning using L₁-regularization Schmidt, Mark 2010
pdf
Page Metadata
Item Metadata
Title | Graphical model structure learning using L₁-regularization |
Creator |
Schmidt, Mark |
Publisher | University of British Columbia |
Date Issued | 2010 |
Description | This work looks at fitting probabilistic graphical models to data when the structure is not known. The main tool to do this is L₁-regularization and the more general group L₁-regularization. We describe limited-memory quasi-Newton methods to solve optimization problems with these types of regularizers, and we examine learning directed acyclic graphical models with L₁-regularization, learning undirected graphical models with group L₁-regularization, and learning hierarchical log-linear models with overlapping group L₁-regularization. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2010-08-11 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0051935 |
URI | http://hdl.handle.net/2429/27277 |
Degree |
Doctor of Philosophy - PhD |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2010-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
Aggregated Source Repository | DSpace |
Download
- Media
- 24-ubc_2010_fall_schmidt_mark.pdf [ 3.43MB ]
- Metadata
- JSON: 24-1.0051935.json
- JSON-LD: 24-1.0051935-ld.json
- RDF/XML (Pretty): 24-1.0051935-rdf.xml
- RDF/JSON: 24-1.0051935-rdf.json
- Turtle: 24-1.0051935-turtle.txt
- N-Triples: 24-1.0051935-rdf-ntriples.txt
- Original Record: 24-1.0051935-source.json
- Full Text
- 24-1.0051935-fulltext.txt
- Citation
- 24-1.0051935.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0051935/manifest