Greed is GoodGreedy Optimization Methods for Large-ScaleStructured ProblemsbyJulie NutiniB.Sc., The University of British Columbia (Okanagan), 2010M.Sc., The University of British Columbia (Okanagan), 2012A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinThe Faculty of Graduate and Postdoctoral Studies(Computer Science)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)May 2018c© Julie Nutini 2018The following individuals certify that they have read, and recommend to theFaculty of Graduate and Postdoctoral Studies for acceptance, the dissertation entitled:Greed is Good: Greedy Optimization Methods for Large-Scale Structured Problemssubmitted by Julie Nutini in partial fulfillment of the requirements forthe degree of Doctor of Philosophyin Computer ScienceExamining Committee:Mark Schmidt, Computer ScienceSupervisorChen Greif, Computer ScienceSupervisory Committee MemberWill Evans, Computer ScienceSupervisory Committee MemberBruce Shepherd, Computer ScienceUniversity ExaminerOzgur Yilmaz, MathematicsUniversity ExamineriiAbstractThis work looks at large-scale machine learning, with a particular focus on greedy methods. Arecent trend caused by big datasets is to use optimization methods that have a cheap iterationcost. In this category are (block) coordinate descent and Kaczmarz methods, as the updatesof these methods only rely on a reduced subspace of the problem at each iteration. Prior toour work, the literature cast greedy variations of these methods as computationally expensivewith comparable convergence rates to randomized versions. In this dissertation, we show thatgreed is good. Specifically, we show that greedy coordinate descent and Kaczmarz methodshave efficient implementations and can be faster than their randomized counterparts for certaincommon problem structures in machine learning. We show linear convergence for greedy (block)coordinate descent methods under a revived relaxation of strong convexity from 1963, whichwe call the Polyak- Lojasiewicz (PL) inequality. Of the proposed relaxations of strong convexityin the recent literature, we show that the PL inequality is the weakest condition that stillensures a global minimum. Further, we highlight the exploitable flexibility in block coordinatedescent methods, not only in the different types of selection rules possible, but also in thetypes of updates we can use. We show that using second-order or exact updates with greedyblock coordinate descent methods can lead to superlinear or finite convergence (respectively) forpopular machine learning problems. Finally, we introduce the notion of “active-set complexity”,which we define as the number of iterations required before an algorithm is guaranteed to reachthe optimal active manifold, and show explicit bounds for two common problem instances whenusing the proximal gradient or the proximal coordinate descent method.iiiLay SummaryA recent trend caused by big datasets is to use methods that are computationally inexpensiveto solve large-scale problems in machine learning. This work looks at several of these methods,with a particular focus on greedy variations, that is, methods that try to make the most possibleprogress at each step. Prior to our work, these greedy methods were regarded as computation-ally expensive with similar performance to cheaper versions that make a random amount ofprogress at each step. In this dissertation, we show that greed is good. Specifically, we showthat these greedy methods can be very efficient and can be faster relative to their randomizedcounterparts for solving machine learning problems with certain structure. We exploit the flex-ibility of these methods and show various ways (both theoretically and empirically) to speedthem up.ivPrefaceThe body of research in this dissertation (Chapters 2 - 6) is based off of several collaborativepapers that have either been previously published or are currently under review.• The work in Chapter 2 was published in the Proceedings of the 32nd International Con-ference on Machine Learning (ICML) [Nutini et al., 2015]:J. Nutini, Mark Schmidt, Issam H. Laradji, Michael Friedlander and Hoyt Koepke.Coordinate Descent Converges Faster with the Gauss-Southwell Rule Than RandomSelection, ICML 2015 [arXiv].The majority of this chapter’s manuscript was written by J. Nutini and Mark Schmidt.The new convergence rate analysis for the greedy coordinate descent method (Section2.2.3) was done by Michael Friedlander and Mark Schmidt. The analysis of strong-convexity constants (Section 2.3) was contributed to by Michael Friedlander. The Gauss-Southwell-Lipschitz rule and nearest neighbour analysis in Sections 2.5.2 and 2.5.3 werea joint effort by J. Nutini, Issam Laradji and Mark Schmidt. The majority of the work inSection 2.8 (numerical experiments) was done by Issam Laradji, with help from J. Nutini.Appendix A.1 showing how to calculate the greedy Gauss-Southwell rule efficiently forsparse problems was primary researched by Issam Laradji and Mark Schmidt. All othersections were primarily researched by J. Nutini and Mark Schmidt. The final co-author onthis paper, Hoyt Koepke, was the primary researcher on extending the greedy coordinatedescent analysis to the case of using exact coordinate optimization (excluded from thisdissertation).• A version of Chapter 3 was published in the Proceedings of the 32nd Conference onUncertainty in Artificial Intelligence (UAI) [Nutini et al., 2016]:J. Nutini, Behrooz Sepehry, Issam H. Laradji, Mark Schmidt, Hoyt Koepke and AlimVirani. Convergence Rates for Greedy Kaczmarz Algorithms, and Faster Random-ized Kaczmarz Rules Using the Orthogonality Graph, UAI 2016 [arXiv].The majority of this chapter’s manuscript was researched and written by J. Nutini andMark Schmidt. The majority of the work in Section 3.9 and Appendix B.12 (numericalexperiments) was done by Issam Laradji, with help from J. Nutini and Alim Virani. Asection of the corresponding paper on extending the convergence rate analysis of Kaczmarzvmethods to consider more than one step (i.e., a sequence of steps) was written by my co-authors Behrooz Sepehry, Hoyt Koepke and Mark Schmidt, and therefore excluded fromthis dissertation.• The material in Chapter 4 was published in the Proceedings of the 27th European Con-ference on Machine Learning (ECML) [Karimi et al., 2016]:Hamed Karimi, J. Nutini and Mark Schmidt. Linear Convergence of Gradient andProximal-Gradient Methods Under the Polyak- Lojasiewicz Condition, ECML 2016[arXiv].Hamed Karimi, Mark Schmidt and I all made contributions to Theorem 2 and Section C.5,while the work in Section C.4 was done by Hamed Karimi. Several results presented inthe corresponding paper are excluded from this dissertation as they were written bymy co-authors. These include using the Polyak- Lojasiewicz (PL) inequality to give newconvergence rate analyses for stochastic gradient descent methods and stochastic variancereduced gradient methods, as well as a result proving the equivalence of our proposedproximal-PL condition to two other previously proposed conditions.• A version of Chapter 5 has been submitted for publication [Nutini et al., 2017a]:J. Nutini, Issam H. Laradji and Mark Schmidt. Let’s Make Block Coordinate De-scent Go Fast: Faster Greedy Rules, Message-Passing, Active-Set Complexity, andSuperlinear Convergence (2017) [arXiv].This chapter’s manuscript was written by J. Nutini and Mark Schmidt. The majority ofthe research was joint work between J. Nutini, Issam Laradji and Mark Schmidt. Section5.5 and Appendix D.5 (numerical experiments) were primarily done by Issam Laradji,with help from J. Nutini and Mark Schmidt.• The material in Chapter 6 is a compilation of material from the reference for Chap-ter 5 [Nutini et al., 2017a] and a manuscript that has been submitted for publication [Nu-tini et al., 2017b]:J. Nutini, Mark Schmidt and Warren Hare. “Active-set complexity” of proximalgradient: How long does it take to find the sparsity pattern? (2017) [arXiv].The majority of this chapter’s manuscript was written by J. Nutini and Mark Schmidt.The proof of Lemmas 2 and 4 was joint work by Warren Hare, Mark Schmidt and J.Nutini. The majority of the work in Section 6.5 (numerical experiments) was done byIssam Laradji, with help from J. Nutini and Mark Schmidt.viTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xivAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xviii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Big-Data: A Barrier to Learning? . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 The Learning Problem/Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2.1 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 First-Order Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.4 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.5 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.6 Coordinate Descent Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.7 Linear Systems and Kaczmarz Methods . . . . . . . . . . . . . . . . . . . . . . . 121.8 Relaxing Strong Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.9 Proximal First-Order Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.10 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 Greedy Coordinate Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.1 Problems of Interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2 Analysis of Convergence Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.2.1 Randomized Coordinate Descent . . . . . . . . . . . . . . . . . . . . . . . 202.2.2 Gauss-Southwell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.2.3 Refined Gauss-Southwell Analysis . . . . . . . . . . . . . . . . . . . . . . 21vii2.3 Comparison for Separable Quadratic . . . . . . . . . . . . . . . . . . . . . . . . . 222.3.1 ‘Working Together’ Interpretation . . . . . . . . . . . . . . . . . . . . . . 222.3.2 Fast Convergence with Bias Term . . . . . . . . . . . . . . . . . . . . . . 232.4 Rates with Different Lipschitz Constants . . . . . . . . . . . . . . . . . . . . . . . 232.5 Rules Depending on Lipschitz Constants . . . . . . . . . . . . . . . . . . . . . . . 242.5.1 Lipschitz Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.5.2 Gauss-Southwell-Lipschitz Rule . . . . . . . . . . . . . . . . . . . . . . . . 242.5.3 Connection between GSL Rule and Normalized Nearest Neighbour Search 262.6 Approximate Gauss-Southwell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.6.1 Multiplicative Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.6.2 Additive Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.7 Proximal Gradient Gauss-Southwell . . . . . . . . . . . . . . . . . . . . . . . . . 302.8 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353 Greedy Kaczmarz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.1 Problems of Interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.2 Kaczmarz Algorithm and Greedy Selection Rules . . . . . . . . . . . . . . . . . . 383.2.1 Efficient Calculations for Sparse A . . . . . . . . . . . . . . . . . . . . . . 393.2.2 Approximate Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.3 Analyzing Selection Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.3.1 Randomized and Maximum Residual . . . . . . . . . . . . . . . . . . . . . 413.3.2 Tighter Uniform and MR Analysis . . . . . . . . . . . . . . . . . . . . . . 433.3.3 Maximum Distance Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.4 Kaczmarz and Coordinate Descent . . . . . . . . . . . . . . . . . . . . . . . . . . 453.5 Example: Diagonal A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.6 Approximate Greedy Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.6.1 Multiplicative Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.6.2 Additive Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.7 Systems of Linear Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.8 Faster Randomized Kaczmarz Methods . . . . . . . . . . . . . . . . . . . . . . . 493.9 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503.10 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524 Relaxing Strong Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.1 Polyak- Lojasiewicz Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.1.1 Relationships Between Conditions . . . . . . . . . . . . . . . . . . . . . . 554.1.2 Invex and Non-Convex Functions . . . . . . . . . . . . . . . . . . . . . . . 574.1.3 Relevant Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594.2 Convergence of Huge-Scale Methods . . . . . . . . . . . . . . . . . . . . . . . . . 59viii4.2.1 Randomized Coordinate Descent . . . . . . . . . . . . . . . . . . . . . . . 604.2.2 Greedy Coordinate Descent . . . . . . . . . . . . . . . . . . . . . . . . . . 614.2.3 Sign-Based Gradient Methods . . . . . . . . . . . . . . . . . . . . . . . . . 614.3 Proximal Gradient Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.3.1 Relevant Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644.3.2 Least Squares with `1-Regularization . . . . . . . . . . . . . . . . . . . . . 654.3.3 Proximal Coordinate Descent . . . . . . . . . . . . . . . . . . . . . . . . . 654.3.4 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . 664.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675 Greedy Block Coordinate Descent . . . . . . . . . . . . . . . . . . . . . . . . . 685.1 Block Coordinate Descent Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 695.1.1 Block Selection Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 705.1.2 Fixed vs. Variable Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . 715.1.3 Block Update Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 725.1.4 Problems of Interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735.2 Improved Greedy Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745.2.1 Block Gauss-Southwell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745.2.2 Block Gauss-Southwell-Lipschitz . . . . . . . . . . . . . . . . . . . . . . . 755.2.3 Block Gauss-Southwell-Quadratic . . . . . . . . . . . . . . . . . . . . . . . 765.2.4 Block Gauss-Southwell-Diagonal . . . . . . . . . . . . . . . . . . . . . . . 775.2.5 Convergence Rate under Polyak- Lojasiewicz . . . . . . . . . . . . . . . . . 785.2.6 Convergence Rate with General Functions . . . . . . . . . . . . . . . . . . 795.3 Practical Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 805.3.1 Tractable GSD for Variable Blocks . . . . . . . . . . . . . . . . . . . . . . 805.3.2 Tractable GSQ for Variable Blocks . . . . . . . . . . . . . . . . . . . . . . 825.3.3 Lipschitz Estimates for Fixed Blocks . . . . . . . . . . . . . . . . . . . . . 835.3.4 Efficient Line Searches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 835.3.5 Block Partitioning with Fixed Blocks . . . . . . . . . . . . . . . . . . . . . 845.3.6 Newton Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845.4 Message-Passing for Huge-Block Updates . . . . . . . . . . . . . . . . . . . . . . 855.4.1 Partitioning into Forest-Structured Blocks . . . . . . . . . . . . . . . . . . 905.4.2 Approximate Greedy Rules with Forest-Structured Blocks . . . . . . . . . 915.5 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 915.5.1 Greedy Rules with Gradient Updates . . . . . . . . . . . . . . . . . . . . . 925.5.2 Greedy Rules with Matrix Updates . . . . . . . . . . . . . . . . . . . . . . 945.5.3 Message-Passing Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . 955.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96ix6 Active-Set Identification and Complexity . . . . . . . . . . . . . . . . . . . . 986.1 Notation and Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1016.2 Manifold Identification for Separable g . . . . . . . . . . . . . . . . . . . . . . . . 1026.2.1 Proximal Gradient Method . . . . . . . . . . . . . . . . . . . . . . . . . . 1026.2.2 Proximal Coordinate Descent Method . . . . . . . . . . . . . . . . . . . . 1046.3 Active-Set Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1066.4 Superlinear and Finite Convergence of Proximal BCD . . . . . . . . . . . . . . . 1086.4.1 Proximal-Newton Updates and Superlinear Convergence . . . . . . . . . . 1086.4.2 Practical Proximal-Newton Methods . . . . . . . . . . . . . . . . . . . . . 1096.4.3 Optimal Updates for Quadratic f and Piecewise-Linear g . . . . . . . . . 1106.5 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1116.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1127 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .A Chapter 2 Supplementary Material . . . . . . . . . . . . . . . . . . . . . . . . 135A.1 Efficient Calculation of GS Rules for Sparse Problems . . . . . . . . . . . . . . . 135A.1.1 Problem h2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135A.1.2 Problem h1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136A.2 Relationship Between µ1 and µ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137A.3 Analysis for Separable Quadratic Case . . . . . . . . . . . . . . . . . . . . . . . . 138A.3.1 Equivalent Definition of Strong Convexity . . . . . . . . . . . . . . . . . . 138A.3.2 Strong Convexity Constant µ1 for Separable Quadratic Functions . . . . . 139A.4 Gauss-Southwell-Lipschitz Rule: Convergence Rate . . . . . . . . . . . . . . . . . 141A.5 Comparing µL to µ1 and µ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142A.5.1 Relationship Between µL and µ1 . . . . . . . . . . . . . . . . . . . . . . . 142A.5.2 Relationship Between µL and µ . . . . . . . . . . . . . . . . . . . . . . . . 143A.6 Approximate Gauss-Southwell with Additive Error . . . . . . . . . . . . . . . . . 143A.6.1 Gradient Bound in Terms of L1 . . . . . . . . . . . . . . . . . . . . . . . . 144A.6.2 Additive Error Bound in Terms of L1 . . . . . . . . . . . . . . . . . . . . 146A.6.3 Additive Error Bound in Terms of L . . . . . . . . . . . . . . . . . . . . . 146A.7 Convergence Analysis of GS-s, GS-r, and GS-q Rules . . . . . . . . . . . . . . . . 148A.7.1 Notation and Basic Inequality . . . . . . . . . . . . . . . . . . . . . . . . . 148A.7.2 Convergence Bound for GS-q Rule . . . . . . . . . . . . . . . . . . . . . . 149A.7.3 GS-q is at Least as Fast as Random . . . . . . . . . . . . . . . . . . . . . 149A.7.4 GS-q is at Least as Fast as GS-r . . . . . . . . . . . . . . . . . . . . . . . 151xA.7.5 Lack of Progress of the GS-s Rule . . . . . . . . . . . . . . . . . . . . . . 153A.7.6 Lack of Progress of the GS-r Rule . . . . . . . . . . . . . . . . . . . . . . 155A.8 Proximal Gradient in the `1-Norm . . . . . . . . . . . . . . . . . . . . . . . . . . 157B Chapter 3 Supplementary Material . . . . . . . . . . . . . . . . . . . . . . . . 159B.1 Efficient Calculations for Sparse A . . . . . . . . . . . . . . . . . . . . . . . . . . 159B.2 Randomized and Maximum Residual . . . . . . . . . . . . . . . . . . . . . . . . . 160B.3 Tighter Uniform and MR Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 162B.4 Maximum Distance Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163B.5 Kaczmarz and Coordinate Descent . . . . . . . . . . . . . . . . . . . . . . . . . . 164B.6 Example: Diagonal A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165B.7 Multiplicative Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167B.8 Additive Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168B.9 MD Rule and Randomized Kaczmarz via Johnson-Lindenstrauss . . . . . . . . . 169B.10 Systems of Linear Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171B.11 Faster Randomized Kaczmarz Using the Orthogonality Graph of A . . . . . . . . 172B.12 Additional Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173C Chapter 4 Supplementary Material . . . . . . . . . . . . . . . . . . . . . . . . 176C.1 Relationships Between Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . 176C.2 Relevant Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179C.3 Sign-Based Gradient Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180C.4 Proximal-PL Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181C.5 Relevant Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183C.6 Proximal Coordinate Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185D Chapter 5 Supplementary Material . . . . . . . . . . . . . . . . . . . . . . . . 187D.1 Cost of Multi-Class Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . 187D.1.1 Cost of Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . 187D.1.2 Cost of Randomized Coordinate Descent . . . . . . . . . . . . . . . . . . . 188D.1.3 Cost of Greedy Coordinate Descent (Arbitrary Blocks) . . . . . . . . . . . 189D.1.4 Cost of Greedy Coordinate Descent (Fixed Blocks) . . . . . . . . . . . . . 189D.2 Blockwise Lipschitz Constants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190D.2.1 Quadratic Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191D.2.2 Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191D.2.3 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191D.2.4 Multi-Class Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . 192D.3 Derivation of GSD Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193D.4 Efficiently Testing the Forest Property . . . . . . . . . . . . . . . . . . . . . . . . 194D.5 Full Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196xiD.5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196D.5.2 Greedy Rules with Gradients Updates . . . . . . . . . . . . . . . . . . . . 197D.5.3 Greedy Rules with Matrix and Newton Updates . . . . . . . . . . . . . . 198xiiList of TablesTable 3.1 Comparison of Convergence Rates . . . . . . . . . . . . . . . . . . . . . . . . . 44Table 3.2 Convergence Rate Constants for Diagonal A . . . . . . . . . . . . . . . . . . . 46Table B.1 Convergence Rate Constants for Diagonal A . . . . . . . . . . . . . . . . . . . 167xiiiList of FiguresFigure 1.1 Visualization of several iterations of cyclic coordinate descent on the levelcurves of a quadratic function. The steps alternate between updating x1 andx2, and converge towards the minimum value. . . . . . . . . . . . . . . . . . . 11Figure 2.1 Visualization of the Gauss-Southwell selection rule. Shown here are threedifferent projections of a function onto individual coordinates given the cor-responding values of x = [x1, x2, x3]. The dotted green lines are the indi-vidual gradient values (tangent lines) at x. We see that the Gauss-Southwellrule selects the coordinate corresponding to the largest (steepest) individualgradient value (in magnitude). . . . . . . . . . . . . . . . . . . . . . . . . . . 20Figure 2.2 Visualization of the Gauss-Southwell-Lipschitz selection rule compared tothe Gauss-Southwell selection rule. When the slopes of the tangent lines(gradient values) are similar, the GSL will make more progress by selectingthe coordinate with the slower changing derivative (smaller Li). . . . . . . . . 25Figure 2.3 Visualization of the GS rule as a nearest neighbour problem. The “nearestneighbour” corresponds to the vector that is the closest (in distance) to r(x),i.e., we want to minimize the distance between two vectors. Alternatively,the GS rule is evaluated as a maximization of an inner product. This makessense as the smaller the angle between two vectors, the larger the cosine ofthat angle and in turn, the larger the inner product. . . . . . . . . . . . . . . 27Figure 2.4 Comparison of coordinate selection rules for 4 instances of problem h1. . . . . 32Figure 2.5 Comparison of coordinate selection rules for graph-based semi-supervisedlearning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35Figure 3.1 Example of the updating procedure for a max-heap structure on a 5×5 sparsematrix: (a) select the node with highest d value; (b) update selected sampleand neighbours; (c) reorder max-heap structure. . . . . . . . . . . . . . . . . 40Figure 3.2 Visualizing the orthogonality of vectors xk+1 − xk and xk+1−x∗. . . . . . . . 41Figure 3.3 Comparison of Kaczmarz selection rules for squared error (left) and distanceto solution (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51Figure 4.1 Visual of the implications shown in Theorem 2 between the various relax-ations of strong convexity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57xivFigure 4.2 Example: f(x) = x2 + 3 sin2(x) is an invex but non-convex function thatsatisfies the PL inequality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58Figure 5.1 Process of partitioning nodes into level sets. For the above graph we have thefollowing sets: L{1} = {8}, L{2} = {6, 7}, L{3} = {3, 4, 5} and L{4} = {1, 2} . 88Figure 5.2 Illustration of Step 2 (row-reduction process) of Algorithm 1 for the treein Figure 5.4. The matrix represents [A˜|c˜]. The black squares representunchanged non-zero values of A˜ and the grey squares represent non-zerovalues that are updated at some iteration in Step 2. In the final matrix (farright), the values in the last column are the values assigned to the vectorC in Steps 1 and 2 above, while the remaining columns that form an uppertriangular matrix are the values corresponding to the constructed P matrix.The backward solve of Step 3 solves the linear system. . . . . . . . . . . . . . 89Figure 5.3 Partitioning strategies for defining forest-structured blocks. . . . . . . . . . . 89Figure 5.4 Comparison of different random and greedy block selection rules on threedifferent problems when using gradient updates. . . . . . . . . . . . . . . . . 93Figure 5.5 Comparison of different greedy block selection rules on three different prob-lems when using matrix updates. . . . . . . . . . . . . . . . . . . . . . . . . . 94Figure 5.6 Comparison of different greedy block selection rules on two quadratic graph-structured problems when using optimal updates. . . . . . . . . . . . . . . . . 96Figure 6.1 Visualization of (a) the proximal gradient update for a non-negatively con-strained optimization problem (6.3); and (b) the proximal operator (soft-threshold) used in the proximal gradient update for an `1-regularized opti-mization problem (6.4). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100Figure 6.2 Comparison of different updates when using greedy fixed and variable blocksof different sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112Figure 6.3 Comparison of different updates when using random fixed and variable blocksof different sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112Figure B.1 Comparison of Kaczmarz and Coordinate Descent. . . . . . . . . . . . . . . . 166Figure B.2 Comparison of MR, MD and Hybrid Method for Very Sparse Dataset. . . . . 175Figure D.1 Comparison of different random and greedy block selection rules on five dif-ferent problems (rows) with three different blocks (columns) when using gra-dient updates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199Figure D.2 Comparison of different random and greedy block selection rules with gradi-ent updates and fixed blocks, using two different strategies to estimate Lb. . . 200Figure D.3 Comparison of different random and greedy block selection rules with gra-dient updates and fixed blocks, using three different ways to partition thevariables into blocks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201xvFigure D.4 Comparison of different greedy block selection rules when using matrix updates.202Figure D.5 Comparison of different greedy block selection rules when using Newton up-dates and a line search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203xviAcknowledgementsI would first and foremost like to thank my supervisor, Mark Schmidt – thank you for help-ing me battle the imposter within. Your support, encouragement and mentorship are at theroot of this work, and it has been an absolute pleasure learning from you. To my supervisorycommittee, Chen Greif and Will Evans, my external examiner, Stephen Wright, my universityexaminers, Bruce Shepherd and Ozgur Yilmaz, and my defence chair, Maurice Queyranne –thank you for your time, support and helpful feedback. Thank you to Michael Friedlander forsupervising me in the early years of this degree. To my Masters supervisor, Warren Hare –thank you for encouraging me to pursue my studies further and for sticking by my side longenough to get another publication (no fairy dust required). Thank you to all of my co-authorsand lab mates, especially Issam Laradji – I have thoroughly enjoyed learning alongside you.To all of my colleagues here at UBC and at UrtheCast – thank you for crossing your fingerswith me every time my answer was, “I should be done by the end of [insert month here]”.To all of the support staff at UBC, especially Joyce Poon and Kath Imhiran – thank you foralways making the paper work an easy process. To my family and friends – thank you foryour unwavering support and for providing me with a distraction when needed. And finally,to my parents, for whom there are no words – this accomplishment is as much yours as it is mine.This work was partially funded by the Natural Sciences and Engineering Research Councilof Canada, the UBC Faculty of Science, and a UBC Four Year Fellowship.xviiTo my grandpa – an educator, a family man and a lover of life...xviiiChapter 1IntroductionMachine learning is remodelling the world we live in. Coined by computer gaming and artificialintelligence pioneer Arthur Samuel in 1959, the term machine learning (ML) has been popularlydefined as the “field of study that gives computers the ability to learn without being explicitlyprogrammed”, (credited to Samuel [1959]). Elaborating on this elegant definition, ML is thestudy of using computers to automatically detect patterns in data and make predictions ordecisions, [Murphy, 2012]. This automation process is useful in the absence of a human expertskilled at analyzing the data, when a human cannot detect or explain patterns in the data,or when the complexity of a classification or prediction problem is beyond the capability of ahuman. By designing computer systems and optimization algorithms capable of parsing thisdata, ML has been integral in the success of several big impact applications recently, such asepileptic seizure prediction [Mirowski et al., 2008], speech recognition [Mohamed et al., 2009],music generation [Boulanger-Lewandowski et al., 2012], large-scale video classification [Karpa-thy et al., 2014] and autonomous vehicles [Teichmann et al., 2016]. It is evident that ML is atthe forefront of technological change in our world.1.1 Big-Data: A Barrier to Learning?There is no shortage of data in today’s data driven world. In nearly every daily task wegenerate and record data on the order of terabytes to exabytes. Online news articles, blogposts, Facebook likes, credit card transactions, online purchases, gene expression data, maps,satellite imagery and user interactions are a fractional sample of the day to day activities forwhich we record data. Fitting most ML models involves solving an optimization problem, andwithout methods capable of parsing and analyzing these huge datasets, the learning processbecomes wildly intractable.This presents us with the crucial task of developing ways to efficiently deal with these massivedata sets. One approach is to create hardware that can handle the computational requirementsof existing algorithms. For example, parallel computing using multi-core machines, outsourcingcomputational work to cloud computing platforms and faster GPUs have had a huge impacton the ability of the ML field to keep up with the ever-growing amount of data. However, thedevelopment of hardware is restricted by what is known as Moore’s Law, the observation thatthe processing speed of computers only doubles every two years. With the rate at which we arecollecting data, we cannot rely on hardware advances to keep up.1A different approach to dealing with these large datasets is to design methods that arescalable with problem size. That is, for large-scale data sets we want methods whose cost scalesat most linearly (or almost linearly) with the data size. The best method for a given problemis often dictated by the size of the problem; the larger the problem, the greater the restrictionson what mathematical operations are feasible.In general, optimization algorithms can be classified according to two properties:1. Convergence rate: the number of iterations required to reach a solution of accuracy .2. Iteration cost: the amount of computation each iteration requires.Often times these two properties work in opposition to each other. For example, a second ordermethod like Newton’s method achieves a superlinear convergence rate but requires a quadratic-or-worse iteration cost. Alternatively, a first order method like gradient descent has a cheaperiteration cost but only achieves a linear convergence rate. If the data size is large enough, it ispossible that gradient descent will find an epsilon-optimal solution in a shorter amount of totalcomputation time compared to Newton’s method (even though the number of iterations maybe much higher).For some big datasets the cost of first-order gradient methods is still infeasible. As a result,there has been a recent trend of proposing/reviving methods with even cheaper iteration costs.Generally speaking, this indicates a shift in focus from developing a robust algorithm that isscalable for all types of problems to developing algorithms that exploit problem structure todeal with scalability. By exploiting problem structure we are able to develop faster optimizationmethods that still maintain cheap iteration costs.Before we discuss the details of these methods, we first need to describe the problem struc-ture that these large-scale methods are designed to exploit. In the next section, we definethe supervised learning problem that is solved in many machine learning tasks. Formally thisproblem is known as expected risk minimization.1.2 The Learning Problem/AlgorithmOptimization is used to formalize the learning process, where the goal is to determine a mappingsuch that for any observed input feature vector ai ∈ IRn and corresponding output classificationor prediction vector bi ∈ IR, the mapping yields the true output bi. We use optimization to learnthe parameterization of this mapping such that some difference measure between the output ofthe learned mapping and the true output is minimized.1The mapping is known formally as a prediction function. There are several families ofprediction functions and an optimization process can be done over entire families to select theoptimal form the prediction function should take. However for the purpose of this work weassume that we have a fixed prediction function with an unknown parameterization. That is1The description of the ERM problem in this section largely follows the presentation of Bottou et al. [2016].2we are given a prediction function h(·;x) : IRn× IRn → IR parameterized by an unknown vectorx ∈ IRn. For example, a general linear prediction function is define byh(ai;x) = xTai, (1.1)where i is the index of a data sample and the ai could potentially be non-linear transformationsof some original measurements. The goal is to learn a parameter vector x such that for anyfeature vector ai ∈ IRn, the output of h(ai;x), say bˆi, matches the known output bi.To carry out this learning process, we require a measure of difference between bˆi and bi.We define a loss function as some function fi : IR × IR → IR that computes a measureof difference between these two values. This loss function is usually a continuous (and oftenconvex) approximation to the “true” loss function (see Section 1.2.1). Indeed, this loss functionis often strongly-convex, an important property that will be discussed in this work and that wedefine in Section 1.3.Given a prediction function h and a loss function fi, we define the objective function of theML optimization known as the expected risk function, which is defined byfˆ(x) =∫IRn×IRfi(bi, h(ai;x)) dP (ai, bi) = E [fi(bi, h(ai;x))] , (1.2)where P (ai, bi) is a joint probability distribution from which the observation pair (ai, bi) wassampled. Clearly, the function in (1.2) is impractical to evaluate as it is an expectation overthe entire (infinite) distribution of possible data examples. Thus, in practice we approximateit using an empirical risk function. Given m independently and identically distributed randomobservations (ai, bi) ∈ IRn×IR for i = 1, 2, . . . ,m, we evaluate the empirical risk function, whichis defined byf(x) =1mm∑i=1fi(bi, h(ai;x)),where fi(x) is commonly used as short form notation for fi(bi, h(ai;x)). Thus, the ML opti-mization problem is to find x such that empirical risk of misclassification is minimized,minx∈IRnf(x) ≡ minx∈IRn1mm∑i=1fi(x).In general, this problem is known as empirical risk minimization (ERM). Often we use regu-larization to decrease the variance in the learned estimator so that it is less sensitive to thedata it is trained on. This typically improves test error but assumes some prior belief over thesmoothness of the desired model. We define the regularized empirical risk minimization byminx∈IRn1mm∑i=1fi(x) + λg(x),3for some regularization function g : IRn → IR and regularization parameter λ > 0. The mostcommon regularizers used in ML are:• ‖·‖2: The `2-norm is differentiable and acts as a smoothing regularizer that when added toa convex loss function results in a strongly convex loss function. This type of regularizationis a special case of Tikhonov regularization, where the Tikhonov matrix is the identitymatrix. The influence of the `2-regularization encourages the entries of the parametervector w to be small in magnitude.• ‖ · ‖1: The `1-norm promotes sparsity in the parameter vector w. The resulting non-zeroentires in w correspond to the most important features needed for classification and ithas been shown that this type of regularization can help with the curse of dimensionality[Ng, 2004]. Otherwise, when there is enough data the `1-norm has a denoising effect onthe solution (e.g., the basis pursuit denoising problem in signal reconstruction applica-tions [Chen and Donoho, 1994, Chen et al., 2001]).Using either the regularized or non-regularized ERM objective, the learning process thenuses data to train, validate and test the model. Usually the dataset is separated into a trainingset and a testing set. Although not the topic of this work, the variability of a training dataset hasa direct influence on the accuracy of the model learned. Validation includes deciding/choosingbetween several prediction/loss functions to determine the one that yields the lowest percentageof error in predictions. This generated model is then used to test on a given dataset. Insummary, the choice of loss function is made via empirical observation and experimentation,using training data, validation and testing datasets. The selected loss function usually has thebest performance on the validation set and we explore several common loss functions used in MLin the next section. (For more details on the use of different regularization/penalty functionsused in statistical learning, see [Hastie et al., 2001, Wainwright, 2014].)1.2.1 Loss FunctionsIn this work we focus on methods that can be used to solve ERM and regularized ERM withconvex loss functions. Convex loss functions are commonly used in ML because they are well-behaved and have a well-defined unique solution. We define the convex loss functions that aremost regularly used in ML problems next (see [Rosasco et al., 2004] for a thorough comparisonof the most common convex loss functions used in the machine learning).Least-Squares LossThe least-squares loss is defined as taking the squared error between the linear predictionfunction defined in (1.1) evaluated at a sample ai ∈ IRd and the corresponding true outputbi ∈ IR,fi(x) =12(bi − xTai)2.4As an optimization problem this is known as least-squares linear regression and is the mini-mization of the squared error over a set of samples ai ∈ IRn, i = 1, . . . ,m,minx∈IRn12m∑i=1(bi − xTai)2.Least-squares loss is twice-differentiable and convex, but not strongly convex. In order tomake the objective strongly convex to ensure a unique solution, we can add a strongly convexregularizer. For example, using a set of samples A = [a1, a2, . . . , am]T ∈ IRm×n and a set ofoutputs b ∈ IRm, the ridge regression problem uses `2-regularization,minx∈IRn12‖Ax− b‖22 + λ‖x‖22.Alternatively, the LASSO problem uses `1-regularization,minx∈IRn12‖Ax− b‖22 + λ‖x‖1.We note, however, that this choice of regularization does not ensure a unique solution.Logistic LossFor problems of binary classification with bi ∈ {−1, 1}, we consider the functionmax(0,−bi sign(xTai)), (1.3)where sign(·) is equal to −1 if the argument is negative and +1 if the argument is positive.Using this as our loss function, if bi sign(xTai) > 1, then our model predicted the correctclassification for example i and (1.3) would be equal to 0. That is, in the standard binary casebˆi = sign(xTai) and (1.3) is a “measure of difference” between bˆi and bi. However, (1.3) isa nonsmooth problem and suffers from a degenerate solution when x = 0. Thus, we use thelogistic loss function, which is a smooth approximation of (1.3),max(0,−bi sign(xTai)) ≈ log(exp(0) + exp(−bixTai)).As an optimization problem, this translates to minimizing the penalization of the predictionsmade by the model over some training sample set,minx∈IRnm∑i=1log(1 + exp(−bixTai)).The logistic loss function is smooth and convex, and we can add an `2-regularizer to make theobjective strongly convex. For a regularized logistic loss function, Ng [2004] shows that whenlearning in the presence of many irrelevant features, while the worst case sample complexity for5an `2-regularized logistic loss problem grows linearly in the number of irrelevant features, thenumber of training examples required for learning with `1-regularization grows only logarith-mically in the number of irrelevant features. Thus, `1-regularization can be used to counteractthe curse of dimensionality.Hinge LossFor problems of classification, where bi takes on a fixed integer value from a finite set, e.g.,bi ∈ {1,−1}, we have the hinge loss function,fi(x) := max{1− bixTai, 0}.This loss function is the tightest convex upper bound (on [−1, 1]) of the 0-1 indicator function,or the true loss function1bixTai<0.The regularized empirical risk minimization problem using this loss is better known as theclassic linear Support Vector Machine (SVM) primal problem [Cortes and Vapnik, 1995], andcan equivalently be written as the hinge-loss function plus `2-regularization,minx∈IRn1mm∑i=1max{1− bixTai, 0}+ λ‖x‖22.Points that are correctly classified are not penalized, while points that are misclassified (onthe wrong side of the separating hyperplane) are penalized linearly with respect to the dis-tance to the correct boundary. This problem is differentiable but non-smooth, and unlike`1-regularization, this non-smoothness is not separable. As a result, we often consider solv-ing the Fenchel dual of the primal problem, which reduces to a linearly constrained quadraticproblem,minz∈[0,U ]12zTMz −∑izi,where U ∈ IR+ is a constant and M is a particular positive semi-definite matrix. In [Rosascoet al., 2004], the authors use a probabilistic bound on the estimation error for the classificationproblem and show the convergence rate of the hinge-loss is almost identical to the logistic lossrate, and far superior to the quadratic loss rate.1.3 First-Order MethodsThe popularity of convex optimization in ML has grown significantly in recent years. Thereare numerous efficient convex optimization algorithms that have been proven to find globallyoptimal solutions and use convex geometry to prove rates of convergence for large-scale problems(see [Bertsekas, 2015, Cevher et al., 2014, Nesterov, 2004]).6In this section we consider several commonly used first-order methods for solving the un-constrained convex optimization problem,minx∈IRnf(x). (1.4)We assume that f is a convex and differentiable function, and that it satisfies a smoothnessassumption. That is, we assume the gradient of f is L-Lipschitz continuous such that for allx, y ∈ Rn we have‖∇f(y)−∇f(x)‖ ≤ L‖y − x‖. (1.5)This smoothness assumption is standard in convex optimization [Nesterov, 2004, § 2.1.1]. Fortwice differentiable functions, this condition implies that the maximum eigenvalue of the Hessianof f , ∇2f(x), is bounded above by L, [Nesterov, 2004, Lem. 1.2.2].In some cases we also assume that f is µ-strongly convex [Nesterov, 2004, § 2.1.3], that isfor all x, y ∈ IRn we havef(y) ≥ f(x) + 〈∇f(x), y − x〉+ µ2‖y − x‖2,where µ > 0 is the strong convexity constant. If µ = 0, then this is equivalent to assuming f issimply convex. For twice differentiable functions, strong convexity implies that the minimumeigenvalue of the Hessian is bounded below by µ, [Nesterov, 2004, Thm. 2.1.11].Under these assumptions (Lipschitz continuity and strong convexity), we can obtain a q-linear convergence rate for first-order methods,limk→∞|f(xk+1)− f(x∗)||f(xk)− f(x∗)| = ρ, (1.6)where 1 > ρ > 0 is known as the rate of convergence. This means that the function evaluatedat the iterates xk will converge asymptotically to the solution f(x∗) at a constant rate of ρ.This is the classic version of linear convergence used in optimization. However, in this work wesay a method achieves a linear convergence rate iff(xk)− f∗ ≤ γρk, (1.7)where we explicitly know the constants ρ and γ ∈ IR. We can express both these conditionsas O(ρk), but the condition (1.7) is a stronger condition than (1.6) and more relevant formachine learning, since it focuses on performance in the finite time (non-asymptotic) case. InSection 1.8 we discuss how several conditions have been proposed in the literature to relaxthe strong convexity assumption while still maintaining a linear convergence rate for first-ordermethods.In this work we focus on upper bounding the convergence rates of certain first-order meth-ods. We note that lower bounds on the convergence rate of any first-order method under the7assumptions of Lipschitz continuity and strong convexity have been analyzed, showing that thebest lower bound we can obtain is linear [Nesterov, 2004, §2.1.4]. This proves that all first-ordermethods will converge with at least a linear rate in this setting.1.4 Gradient DescentThe most classic first-order optimization algorithm is the gradient descent (GD) method. Ateach iteration a step is taken in the direction of the negative gradient evaluated at the currentiteration, yielding the updatexk+1 = xk − αk∇f(xk),where k is the iteration counter, xk is the current iterate, ∇f is the gradient of f and αk > 0 isa per iteration step-size. The step-size αk can either be a fixed value defined by some continuityconstant of the function f or can be determined using a line search technique at each iteration.If we assume the function f is convex and has Lipschitz continuous gradient, then using afixed step-size of αk = 1/L we can achieve a O(1/k) sublinear convergence rate [Nesterov, 2004,Cor. 2.1.2],f(xk)− f∗ ≤ 2Lk + 4‖x0 − x∗‖2.Thus, it requires O(1/) iterations to achieve an -accurate solution. However, under theseassumptions there is no guaranteed convergence of the iterates.If we also assume µ-strong convexity of f , then the derived linear rate for GD is givenby [Nesterov, 2004, Thm. 2.2.8],f(xk)− f(x∗) ≤(1− µL)k [f(x0)− f(x∗)] ,so that the rate of convergence is O((1− µL)k), where µ ≤ L follows directly from Lipschitzcontinuity and strong convexity using the results in Nesterov [2004, Thm. 2.1.5 and Thm.2.1.10]). Under these assumptions, we guarantee convergence of both the iterates and thefunction values.1.5 Stochastic Gradient DescentAs mentioned at the end of Section 1.1, there has been a recent trend towards using meth-ods with very cheap iteration costs. One method that reduces the iteration cost of classicGD methods for large-scale problems with specific structures and that has become an impor-tant tool for modern high-dimensional ML problems is the stochastic gradient descent (SGD)method [Robbins and Monro, 1951].8Consider problems that have the following finite sum form,f(x) =1mm∑i=1fi(x),where m is very large. At each iteration of the SGD method we randomly select an indexi ∈ {1, 2, . . . ,m}, and update according toxk+1 = xk − αkf ′i(xk).In this update, if fi is differentiable (smooth) then f′i is the gradient, ∇fi, and if fi is non-differentiable (non-smooth), then f ′i is a subgradient, i.e., an element of the subdifferential offi, ∂fi, where∂fi(x) = {v : fi(y) ≥ fi(x) + 〈v, y − x〉 for all x, y ∈ IR}.This update gives an unbiased estimate of the true gradient,E[f ′i(x)] =1mm∑i=1∇fi(x) = ∇f(x).In SGD we require that the step-size αk converges asymptotically to 0 due to the variance ofthe gradients,1nn∑i=1‖∇fi(xk)−∇f(xk)‖2.If the variance is zero, then we have an exact gradient and every step is a descent step. If thevariance is large, then many of the steps will be in the wrong direction. Therefore, the effectsof variance in SGD methods can be reduced by using an appropriate step-size that “scales”this variance. The classic choice of step-size is αk = O(1/k) and Bach et al. showed usingαk = O(1/kα) for α ∈ (0, 1/2) is more robust [Bach and Moulines, 2011] (encourages largerstep-sizes).Deterministic gradient methods [Cauchy, 1847] for this problem have an update cost linearin m, where as the cost of stochastic iterations are independent of m, i.e., m times faster thandeterministic. Furthermore, they have the same rates for non-smooth problems, meaning wecan solve non-smooth problems m times faster using stochastic methods. The achievable ratesare sublinear for convex O(1/√k) and strongly convex O(1/k) [Nemirovski et al., 2009]. This isan example of sacrificing convergence rate for cheaper iteration costs, as these rates are clearlyslower than the rates obtained by the GD method for smooth problems.A popular method that improves convergence and also battles the variance introduced byrandomness in SGD methods is the stochastic average gradient (SAG) method proposed by LeRoux et al. [2012]. The SAG method still only requires one gradient evaluation at each iteration,9but unlike SGD it achieves a linear convergence rate. The iteration update is given byxk+1 = xk − αkmm∑i=1ykiwhere a memory of yki = ∇fi(xk) from the last k where i was selected. For L-smooth, convexfunctions fi, Schmidt et al. [2017] showed that with a constant step-size of αk = 1/16L, theSAG iterations achieve a rate ofE[f(x¯k)− f(x∗)]≤ 32nkC0,where x¯k = 1k∑k−1i=0 xi is the average iterate and C0 is a constant dependent on the initializationof the method. Schmidt et al. [2017] also show a linear rate of convergence when f is µ-stronglyconvex,E[f(xk)− f(x∗)]≤(1−min{µ16L,18n})kC0.These are similar rates compared to GD, but each iteration is m times cheaper.Alternative methods, such as the Stochastic Variance Reduced Gradient (SVRG) method[Johnson and Zhang, 2013], have also been proposed. Although SVRG is not faster, it doesnot have the memory requirements of SAG. However, the major challenge of the classic andvariance reduced stochastic gradient methods is still choosing the step-size, as a line search isnot practical given that there is no guarantee for function value decrease.1.6 Coordinate Descent MethodsAn alternative way to deal with the size of large-scale optimization problems is instead ofupdating all n variables at each iteration, we can select a single variable (or “block” of variables)to update. We call these methods (block) coordinate descent methods. Each iteration of acoordinate descent (CD) method carries out an approximate update along a single coordinatedirection or coordinate hyperplane. In the single coordinate case, the update is given byxk+1 = xk − αk∇ikf(xk)eik ,where eik is a vector with a one in position ik and zeros in all other positions.Since CD methods only update one coordinate at each iteration, or in the case of block CDmethods, a subset of coordinates at each iteration, this yields a low iteration cost for problemswith certain structure. Nesterov [2010, 2012] brought clarity to the true power of CD methodsin his seminal research on CD methods. He showed that coordinate descent can be fasterthan gradient descent in cases where, if we are optimizing n variables, the cost of performingone full gradient iteration is similar to the cost of performing n single coordinate updates.Essentially, this says that CD methods are highly efficient for numerous popular ML problems10Figure 1.1: Visualization of several iterations of cyclic coordinate descent on the level curves ofa quadratic function. The steps alternate between updating x1 and x2, and converge towardsthe minimum value.like least-squares, logistic regression, LASSO, SVMs, quadratics, graph-based label propagationalgorithms for semi-supervised learning and other sparse graph problems.Like SGD, these methods have become an important tool for modern high-dimensional MLproblems. However, CD methods are appealing over SGD methods because each iterationupdates a single variable for all fi whereas SGD updates all variables but only observes one fi.Thus, we can do a line search to determine αk at each iteration with CD methods. Further,with CD methods we can improve performance by adapting the algorithmic building blocks toexploit problem structure, which is the main focus of this work.At each iteration a coordinate ik (or block of coordinates bk) is chosen to be updated usinga certain selection rule. Most commonly this selection is done in a cyclic (see Figure 1.1) orrandom fashion. Numerous works have recently been publish on randomized coordinate descentmethods and for a comprehensive overview of CD methods and their variations/extensions, seethese summary papers [Shi et al., 2016, Wright, 2015].Assuming smoothness of f and constant step-size 1/L, randomized CD methods can beshown to achieve a convergence rate of O(1/k). If we further assume strong convexity of f , ithas been shown that randomized CD methods achieve a linear convergence rate in expectation,E[f(xk+1)]− f(x∗) ≤(1− µLn)[f(xk)− f(x∗)].This is a special of case of Nesterov [2012, Theorem 2] with α = 0 in his notation. Nesterov[2012] also showed that greedy CD methods (selection of i so as to maximize progress at eachiteration) achieves this same bound (not in expectation).We note that in the above rate, L is the coordinate-wise Lipschitz constant, that is, for each11i = 1, . . . , n,|∇if(x+ αei)−∇if(x)| ≤ L|α|, ∀x ∈ Rn and α ∈ R,where ei is a vector with a one in position i and zero in all other positions. For gradient methods,the Lipschitz condition in (1.5) is usually assumed to hold for some Lipschitz constant Lf , wherethe relationship between these constants is L/n ≤ Lf . Thus, the above rate is faster.1.7 Linear Systems and Kaczmarz MethodsClosely-related to CD and SGD methods is the Kaczmarz method [Kaczmarz, 1937], which isdesigned to solve large-scale consistent (a solution exists) systems of linear equations,Ax = b,where A ∈ IRm×n and b ∈ IRm. This is a fundamental problem in machine learning. At eachiteration of the Kaczmarz algorithm, a row ik is selected and the current iterate xk is projectedonto the hyperplane defined by aTikxk = bik . This gives the iterationxk+1 = xk +bik − aTikxk‖aik‖2aik ,and it has been proven that this algorithm converges linearly to a solution x∗ under weakconditions (e.g., each i is visited infinitely often) using cyclic [Deutsch, 1985, Deutsch andHundal, 1997, Gala´ntai, 2005] or random Strohmer and Vershynin [2009] selection.The Kaczmarz method projects onto a single hyperplane at each iteration, and thus, hasa low-iteration cost like SGD and CD. In fact, the Kaczmarz method can be expressed as aninstance of weighted SGD [Needell et al., 2013] when solving the least-squares problem. Howevera benefit of using Kaczmarz methods over SGD methods for these types of problems is thatKaczmarz methods use a step-size of αk = 1 for all iterations, and thus, avoiding the step-sizeselection issue of SGD methods.Further, as discussed by Wright [2015], Kaczmarz methods applied to a linear system canalso be interpreted as CD methods on the dual problem,miny12‖AT y‖2 − bT y,where x = AT y∗, so that Ax = AAT y∗ = b. As discussed by Ma et al. [2015a] there are severalconnections/differences between cyclic CD (also known as the Gauss-Seidel method [Seidel,1874]) and Kaczmarz methods.121.8 Relaxing Strong ConvexityTo prove linear convergence rates for GD and CD, we assume strong convexity of f . This is areasonable assumption as an `2-norm regularizer can be added to make any convex objectivestrongly convex. Nevertheless, there have been various conditions proposed over the years, allwith the goal of replacing or weakening the assumption of strong convexity while still guaran-teeing linear convergence for problems like least-squares and logistic regression [Anitescu, 2000,Liu and Wright, 2015, Liu et al., 2014, Lojasiewicz, 1963, Luo and Tseng, 1993, Ma et al.,2015b, Necoara et al., 2015, Polyak, 1963, Zhang and Yin, 2013].The oldest one of these conditions is the Polyak- Lojasiewicz (PL) inequality [ Lojasiewicz,1963, Polyak, 1963], which requires that for all x, we have12‖∇f(x)‖2 ≥ µ (f(x)− f∗) .The PL inequality was proposed in 1963 by Polyak [1963] and is a special case of the Lojasiewicz[1963] inequality proposed in the same year. This condition implies that the gradient growsfaster than a quadratic function as we move away from the optimal function value. The PLinequality is sufficient to show a global linear convergence rate for gradient descent withoutrequiring strong convexity (or even convexity). Further, this inequality implies that everystationary point is a global minimum. However, unlike the guarantees of strong convexity, theglobal minimum need not be unique.Despite the PL inequality being the oldest of the existing conditions to relaxing strongconvexity, it remained relatively unknown in the literature until our recent work [Karimi et al.,2016].1.9 Proximal First-Order MethodsSeveral of the first-order methods mentioned in the previous sections have variants that can beused to solve the nonsmooth problem,minx∈IRnf(x) + g(x), (1.8)where f is smooth and convex, and g is convex but not necessarily smooth. A classic exampleof this problem is optimization subject to non-negative constraints,argminx≥0f(x),13where in this case gi is the indicator function on the non-negative orthant,gi(xi) =0 if xi ≥ 0,∞ if xi < 0.Another example that has received significant recent attention is the case of an `1-regularizer,argminx∈IRnf(x) + λ‖x‖1,where in this case gi = λ|xi|. Here, the `1-norm regularizer is used to encourage sparsity in thesolution. It is possible to generalize GD methods to the nonsmooth problem (1.8) by simplyconsidering subgradients instead of gradients. However, these methods only achieve sublinearrates.One of most widely-used methods for minimizing functions of the form (1.8) is the proximalgradient (PG) method [Beck and Teboulle, 2009, Bertsekas, 2015, Levitin and Polyak, 1966,Nesterov, 2013], which uses an iteration update given by applying the proximal operator to astandard GD update,xk+1 = proxαkg[xk − αk∇f(xk)],where the proximal operator is given byproxαg[y] = argminx∈Rn12‖x− y‖2 + αg(x).The proximal versions of the gradient based methods presented in the previous section allachieve the same worst-case convergence rate bounds as regular versions when f is assumedto be smooth. In this work we consider both the smooth problem (1.4) and the nonsmoothproblem (1.8).We also note here that accelerated variants of the methods presented in this chapter exist.For example, it is well-known that the accelerated gradient method achieves the optimal rate ofconvergence when f is only smooth and convex (not strongly convex) [Nesterov, 1983] (as doesproximal gradient descent when f is non-smooth and convex [Nesterov, 2013]). Also, recentlyan accelerated SGD method was proposed that is robust to noise and variance [Jain et al., 2017].These methods are not the focus of this work but we expect that all contributions presented inthis work should carry over to the accelerated setting.1.10 Summary of ContributionsIn this work we focus on greedy (block) coordinate descent methods and greedy Kaczmarzmethods, where for greedy (block) coordinate descent methods, we consider relaxing the strongconvexity assumption normally used to show linear convergence. Specifically, we focus on14exploiting the cheap iteration costs of these methods, showing faster rates are achievable whenwe exploit problem structure. Our list of contributions is as follows:• Chapter 2: We show that greedy CD methods are faster than random CDmethods for problems with certain structure. Previous bounds show that ran-dom and greedy obtain the same convergence rate bounds, but our analysis gives tighterbounds on greedy methods showing that they can be faster. Our work includes a sum-mary of two general problem classes for which one coordinate descent iteration is n timescheaper than a full-gradient update, conditions that ensure efficient implementation ofgreedy coordinate selection and an exploration of when greedy selection beats randomselection using a simple separable quadratic problem. We present a new greedy selectionrule that uses Lipschitz gradient information and has a relationship to the nearest neigh-bour problem. We also present results for approximate and proximal-variants of greedyselection rules. Finally, we present numerical results to emphasize the efficacy of greedyselection rules for coordinate descent methods.• Chapter 3: We show that greedy Kaczmarz methods are faster than randomKaczmarz methods for problems where A is sparse. Previous bounds show thatrandom and greedy obtain the same convergence rate bounds, but our analysis givestighter bounds on greedy methods showing that they can be faster. Our work includesefficient ways to calculate the greedy selection rules when the matrix A is sparse, sim-pler/tighter convergence rate analysis for randomized selection rules and analysis forgreedy selection rules. We present a comparison of general convergence rates for ran-domized and greedy selection rules, and a comparison of rates for the specific example ofa diagonal A. We also present analysis for approximate greedy selection rules and a fasterrandomized method using adaptive selection rules. Finally, we present numerical resultsto emphasize the efficiency of greedy selection rules for Kaczmarz methods.• Chapter 4: We show that of the conditions proposed to relax strong con-vexity, the PL inequality is the weakest condition that still ensures a globalminimum despite it being much older and less popular than other existingconditions. Our work includes presenting a formal relationship between several of theseexisting bounds, showing that the PL inequality can be used to establish the first lin-ear convergence rate analysis for sign-based gradient descent methods and establishingdifferent problem classes for which a proximal extension to the PL inequality holds.• Chapter 5: We show that by adjusting the algorithmic components of blockcoordinate descent methods such that they exploit problem structure, weare able to obtain significantly faster methods. Our work includes proposing newgreedy block-selection strategies that guarantee more progress per iteration than the clas-sic greedy rule, exploring previously proposed block update strategies that exploit higher-order information and proving faster local convergence rates, and exploring the use of15message-passing to efficiently compute optimal block updates for problems with a sparsedependency between variables. We present numerical results to support all of our findingsand establish the efficiency of our greedy block coordinate descent approaches.• Chapter 6: We show that greedy BCD methods have a finite-time manifoldidentification property for problems with separable non-smooth structures.Our analysis notably leads to bounds on the number of iterations requiredto reach the optimal manifold (“active-set complexity”). We show this leads tosuperlinear convergence when using greedy rules with variable blocks and updates withsecond-order information for problems with sufficiently-sparse solutions. In the specialcase of LASSO and SVM problems, we further show that optimal updates are possible.This leads to finite convergence for SVM and LASSO problems with sufficiently-sparsesolutions when using greedy selection and sufficiently-large variable blocks. We also usethis analysis to show active-set identification and active-set complexity results for the fullproximal gradient method.We note that Chapters 2- 5 include appendices, which contain extra theoretical and experimen-tal results. The details in these appendices are for the interested reader and are not requiredto understand the main ideas in this dissertation.In Chapter 7 we discuss the impact some of these contributions have had since publication,as well as future extensions.16Chapter 2Greedy Coordinate DescentThere has been substantial recent interest in applying coordinate descent methods to solve large-scale optimization problems because of their cheap iteration costs, low memory requirementsand amenability to parallelization. The seminal work of Nesterov [2012] gave the first globalrate of convergence analysis for coordinate-descent methods for minimizing convex functions.The analysis in Nesterov’s work suggests that choosing a random coordinate to update gives thesame performance as choosing the “best” coordinate to update via the more expensive Gauss-Southwell (GS) rule. This result gives a compelling argument to use randomized coordinatedescent in contexts where the GS rule is too expensive. It also suggests that there is no benefitto using the GS rule in contexts where it is relatively cheap. However, in these contexts, the GSrule often substantially outperforms randomized coordinate selection in practice. This suggeststhat either the analysis of GS in [Nesterov, 2012] is not tight, or that there exists a class offunctions for which the GS rule is as slow as randomized coordinate descent.In this chapter, we present our work on greedy coordinate descent methods. We first discusscontexts in which it makes sense to use coordinate descent and the GS rule (Section 2.1). InSection 2.2 we give the existing analysis for random and greedy coordinate descent methodspresented by Nesterov [2012], and then we give a tighter convergence rate analysis of the GSrule (under strong convexity and standard smoothness assumptions) that yields the same rateas the randomized method for a restricted class of functions, but is otherwise faster (and insome cases substantially faster). We further show that, compared to the usual constant step-size update of the coordinate, the GS method with varying step-sizes has a provably fasterrate (Section 2.4). Furthermore, in Section 2.5, we propose a variant of the GS rule that,similar to Nesterov’s more clever randomized sampling scheme proposed in [Nesterov, 2012],uses knowledge of the Lipschitz constants of the coordinate-wise gradients to obtain a fasterrate. We also analyze approximate GS rules (Section 2.6), which provide an intermediatestrategy between randomized methods and the exact GS rule. Finally, we analyze proximalgradient variants of the GS rule (Section 2.7) for optimizing problems that include a separablenon-smooth term. All our findings are supported by empirical results on some classic machinelearning problems (Section 2.8).172.1 Problems of InterestThe rates of Nesterov show that coordinate descent can be faster than gradient descent in caseswhere, if we are optimizing n variables, the cost of performing n coordinate updates is similarto the cost of performing one full gradient iteration. Two common problem structures thatsatisfy this characterization and therefore are amenable to coordinate descent are:h1(x) :=n∑i=1gi(xi) + f(Ax), h2(x) :=∑i∈Vgi(xi) +∑(i,j)∈Efij(xi, xj),where xi is element i of x, f is smooth and cheap, the fij are smooth, G = {V,E} is a graph,and A is a matrix. (It is assumed that all functions are convex.)2 The family of functions h1includes core machine-learning problems such as least squares, logistic regression, LASSO, andSVMs (when solved in dual form) [Hsieh et al., 2008]. Family h2 includes quadratic functions,graph-based label propagation algorithms for semi-supervised learning [Bengio et al., 2006],and finding the most likely assignments in continuous pairwise graphical models [Rue andHeld, 2005].In general, the GS rule for problem h2 is as expensive as a full gradient evaluation. However,the structure of G often allows efficient implementation of the GS rule. For example, if eachnode has at most d neighbours, we can track the gradients of all the variables and use a max-heap structure to implement the GS rule in O(d log n) time [Meshi et al., 2012]. This is similarto the cost of the randomized algorithm if d ≈ |E|/n (since the average cost of the randomizedmethod depends on the average degree). This condition is true in a variety of applications. Forexample, in spatial statistics we often use two-dimensional grid-structured graphs, where themaximum degree is four and the average degree is slightly less than 4. As another example,for applying graph-based label propagation on the Facebook graph (to detect the spread ofdiseases, for example), the average number of friends is around 200 but no user has more thanseven thousand friends.3 The maximum number of friends would be even smaller if we removededges based on proximity. A non-sparse example where GS is efficient is complete graphs, sincehere the average degree and maximum degree are both (n − 1). Thus, the GS rule is efficientfor optimizing dense quadratic functions. On the other hand, GS could be very inefficient forstar graphs.If each column of A has at most c non-zeroes and each row has at most r non-zeroes, thenfor many notable instances of problem h1 we can implement the GS rule in O(cr log n) timeby maintaining Ax as well as the gradient and again using a max-heap (see Appendix A.1).Thus, GS will be efficient if cr is similar to the number of non-zeroes in A divided by n.Otherwise, Dhillon et al. [2011] show that we can approximate the GS rule for problem h1 with2We could also consider slightly more general cases like functions that are defined on hyper-edges [Richta´rikand Taka´cˇ, 2016], provided that we can still perform n coordinate updates for a similar cost to one gradientevaluation.3https://recordsetter.com/world-record/facebook-friends18no gi functions by solving a nearest-neighbour problem. Their analysis of the GS rule in theconvex case, however, gives the same convergence rate that is obtained by random selection(although the constant factor can be smaller by a factor of up to n). More recently, Shrivastavaand Li [2014] give a general method for approximating the GS rule for problem h1 with no gifunctions by writing it as a maximum inner-product search problem.2.2 Analysis of Convergence RatesWe are interested in solving the convex optimization problemminx∈Rnf(x), (2.1)where ∇f is coordinate-wise L-Lipschitz continuous, i.e., for i = 1, . . . , n,|∇if(x+ αei)−∇if(x)| ≤ L|α|, ∀x ∈ Rn and α ∈ R,where ei is a vector with a one in position i and zero in all other positions. For twice-differentiable functions, this is equivalent to the assumption that the diagonal elements of theHessian are bounded in magnitude by L. In contrast, the typical assumption used for gradientmethods is that ∇f is Lf -Lipschitz continuous (note that L ≤ Lf ≤ Ln). The coordinate-descent method with constant step-size is based on the iterationxk+1 = xk − 1L∇ikf(xk)eik .The randomized coordinate-selection rule chooses ik uniformly from the set {1, 2, . . . , n}. Al-ternatively, the GS ruleik ∈ argmaxi|∇if(xk)|,chooses the coordinate with the largest directional derivative (see Figure 2.1). Under eitherrule, because f is coordinate-wise Lipschitz continuous, we obtain the following bound on theprogress made by each iteration:f(xk+1) ≤ f(xk) +∇ikf(xk)(xk+1 − xk)ik +L2(xk+1 − xk)2ik= f(xk)− 1L(∇ikf(xk))2 +L2[1L∇ikf(xk)]2= f(xk)− 12L[∇ikf(xk)]2.(2.2)We focus on the case where f is µ-strongly convex, meaning that, for some positive µ,f(y) ≥ f(x) + 〈∇f(x), y − x〉+ µ2‖y − x‖2, ∀x, y ∈ Rn, (2.3)19x1 x2 x3Gauss-SouthwellFigure 2.1: Visualization of the Gauss-Southwell selection rule. Shown here are three differ-ent projections of a function onto individual coordinates given the corresponding values ofx = [x1, x2, x3]. The dotted green lines are the individual gradient values (tangent lines) at x.We see that the Gauss-Southwell rule selects the coordinate corresponding to the largest (steep-est) individual gradient value (in magnitude).which implies thatf(x∗) ≥ f(xk)− 12µ‖∇f(xk)‖2, (2.4)where x∗ is the optimal solution of (2.1). This bound is obtained by minimizing both sidesof (2.3) with respect to y.2.2.1 Randomized Coordinate DescentConditioning on the σ-field Fk−1 generated by the sequence {x0, x1, . . . , xk−1}, and takingexpectations of both sides of (2.2), when ik is chosen with uniform sampling we obtainE[f(xk+1)] ≤ E[f(xk)− 12L(∇ikf(xk))2]= f(xk)− 12Ln∑i=11n(∇if(xk))2= f(xk)− 12Ln‖∇f(xk)‖2.Using (2.4) and subtracting f(x∗) from both sides, we getE[f(xk+1)]− f(x∗) ≤(1− µLn)[f(xk)− f(x∗)]. (2.5)This is a special of case of Nesterov [2012, Theorem 2] with α = 0 in his notation.2.2.2 Gauss-SouthwellWe now consider the progress implied by the GS rule. By the definition of ik,(∇ikf(xk))2 = ‖∇f(xk)‖2∞ ≥ (1/n)‖∇f(xk)‖2. (2.6)20Applying this inequality to (2.2), we obtainf(xk+1) ≤ f(xk)− 12Ln‖∇f(xk)‖2,which together with (2.4), implies thatf(xk+1)− f(x∗) ≤(1− µLn)[f(xk)− f(x∗)]. (2.7)This is a special case of Boyd and Vandenberghe [2004, §9.4.3], viewing the GS rule as perform-ing steepest descent in the 1-norm. While this is faster than known rates for cyclic coordinateselection [Beck and Tetruashvili, 2013] and holds deterministically rather than in expectation,this rate is the same as the randomized rate given in (2.5).2.2.3 Refined Gauss-Southwell AnalysisThe deficiency of the existing GS analysis is that too much is lost when we use the inequalityin (2.6). To avoid the need to use this inequality we propose measuring strong convexity in the1-norm, i.e.,f(y) ≥ f(x) + 〈∇f(x), y − x〉+ µ12‖y − x‖21,which is the analogue of (2.3). Minimizing both sides with respect to y, we obtainf(x∗) ≥ f(x)− supy{〈−∇f(x), y − x〉 − µ12‖y − x‖21}= f(x)−(µ12‖ · ‖21)∗(−∇f(x))= f(x)− 12µ1‖∇f(x)‖2∞,(2.8)which makes use of the convex conjugate (µ12 ‖ ·‖21)∗ = 12µ1 ‖ ·‖2∞ [Boyd and Vandenberghe, 2004,§3.3]. Using (2.8) in (2.2), and the fact that (∇ikf(xk))2 = ‖∇f(xk)‖2∞ for the GS rule, weobtainf(xk+1)− f(x∗) ≤(1− µ1L)[f(xk)− f(x∗)]. (2.9)It is evident that if µ1 = µ/n, then the rates implied by (2.5) and (2.9) are identical,but (2.9) is faster if µ1 > µ/n. In Appendix A.2, we show that the relationship between µ andµ1 can be obtained through the relationship between the squared norms || · ||2 and || · ||21. Inparticular, we haveµn≤ µ1 ≤ µ.Thus, at one extreme the GS rule obtains the same rate as uniform selection (µ1 ≈ µ/n).However, at the other extreme, it could be faster than uniform selection by a factor of n(µ1 ≈ µ). This analysis, that the GS rule only obtains the same bound as random selection inan extreme case, supports the better practical behaviour of GS.212.3 Comparison for Separable QuadraticWe illustrate these two extremes with the simple example of a quadratic function with a diagonalHessian ∇2f(x) = diag(λ1, . . . , λn). In this case,µ = miniλi, and µ1 =(n∑i=11λi)−1.We prove the correctness of this formula for µ1 in Appendix A.3. The parameter µ1 achievesits lower bound when all λi are equal, λ1 = · · · = λn = α > 0, in which caseµ = α and µ1 = α/n.Thus, uniform selection does as well as the GS rule if all elements of the gradient change atexactly the same rate. This is reasonable: under this condition, there is no apparent advantagein selecting the coordinate to update in a clever way. Intuitively, one might expect that thefavourable case for the Gauss-Southwell rule would be where one λi is much larger than theothers. However, in this case, µ1 is again similar to µ/n. To achieve the other extreme, supposethat λ1 = β and λ2 = λ3 = · · · = λn = α with α ≥ β. In this case, we have µ = β andµ1 =βαn−1αn−1 + (n− 1)βαn−2 =βαα+ (n− 1)β .If we take α → ∞, then we have µ1 → β, so µ1 → µ. This case is much less intuitive; GS isn times faster than random coordinate selection if one element of the gradient changes muchmore slowly than the others.2.3.1 ‘Working Together’ InterpretationIn the separable quadratic case above, µ1 is given by the harmonic mean of the eigenvaluesof the Hessian divided by n. The harmonic mean is dominated by its smallest values, andthis is why having one small value is a notable case. Furthermore, the harmonic mean dividedby n has an interpretation in terms of processes ‘working together’ [Ferger, 1931]. If each λirepresents the time taken by each process to finish a task (e.g., large values of λi correspondto slow workers), then µ is the time needed by the fastest worker to complete the task, and µ1is the time needed to complete the task if all processes work together (and have independenteffects). Using this interpretation, the GS rule provides the most benefit over random selectionwhen working together is not efficient, meaning that if the n processes work together, then thetask is not solved much faster than if the fastest worker performed the task alone. This gives aninterpretation of the non-intuitive scenario where GS provides the most benefit: if all workershave the same efficiency, then working together solves the problem n times faster. Similarly, ifthere is one slow worker (large λi), then the problem is solved roughly n times faster by working22together. On the other hand, if most workers are slow (many large λi), then working togetherhas little benefit and we should be greedy in our selection of the workers.2.3.2 Fast Convergence with Bias TermConsider the standard linear-prediction framework,argminx,βm∑i=1f(aTi x+ β) +λ2‖x‖2 + σ2β2,where we have included a bias variable β (an example of problem h1). Typically, the regulariza-tion parameter σ of the bias variable is set to be much smaller than the regularization parameterλ of the other covariates, to avoid biasing against a global shift in the predictor. Assuming thatthere is no hidden strong convexity in the sum, this problem has the structure described in theprevious section (µ1 ≈ µ) where GS has the most benefit over random selection.2.4 Rates with Different Lipschitz ConstantsConsider the more general scenario where we have a Lipschitz constant Li for the partialderivative of f with respect to each coordinate i,|∇if(x+ αei)−∇if(x)| ≤ Li|α|, ∀x ∈ Rn and α ∈ R,and we use a coordinate-dependent step-size at each iteration:xk+1 = xk − 1Lik∇ikf(xk)eik . (2.10)By the logic of (2.2), in this setting we havef(xk+1) ≤ f(xk)− 12Lik[∇ikf(xk)]2, (2.11)and thus a convergence rate off(xk)− f(x∗) ≤ k∏j=1(1− µ1Lij) [f(x0)− f(x∗)]. (2.12)Noting that L = maxi{Li}, we havek∏j=1(1− µ1Lij)≤(1− µ1L)k. (2.13)23Thus, the convergence rate based on the Li will be faster, provided that at least one iterationchooses an ik with Lik < L. In the worst case, however, (2.13) holds with equality even if theLi are distinct, as we might need to update a coordinate with Li = L on every iteration. (Forexample, consider a separable function where all but one coordinate is initialized at its optimalvalue, and the remaining coordinate has Li = L.) In Section 2.5, we discuss selection rules thatincorporate the Li to achieve faster rates whenever the Li are distinct.2.5 Rules Depending on Lipschitz ConstantsIf the Li are known, Nesterov [2012] showed that we can obtain a faster convergence rate bysampling proportional to the Li. We review this result below and compare it to the GS rule, andthen propose an improved GS rule for this scenario. Although in this section we will assume thatthe Li are known, this assumption can be relaxed using a backtracking procedure [Nesterov,2012, §6.1].2.5.1 Lipschitz SamplingTaking the expectation of (2.11) under the distribution pi = Li/∑nj=1 Lj and proceeding asbefore, we obtainE[f(xk+1)]− f(x∗) ≤(1− µnL¯)[f(xk)− f(x∗)],where L¯ = 1n∑nj=1 Lj is the average of the Lipschitz constants. This was shown by Leventhaland Lewis [2010] and is a special case of Nesterov [2012, Theorem 2] with α = 1 in his notation.This rate is faster than (2.5) for uniform sampling if any Li differ.Under our analysis, this rate may or may not be faster than (2.9) for the GS rule. On theone extreme, if µ1 = µ/n and any Li differ, then this Lipschitz sampling scheme is faster thanour rate for GS. Indeed, in the context of the problem from Section 2.3, we can make Lipschitzsampling faster than GS by a factor of nearly n by making one λi much larger than all the others(recall that our analysis shows no benefit to the GS rule over randomized selection when onlyone λi is much larger than the others). At the other extreme, in our example from Section 2.3with many large α and one small β, the GS and Lipschitz sampling rates are the same whenn = 2, with a rate of (1 − β/(α + β)). However, the GS rate will be faster than the Lipschitzsampling rate for any α > β when n > 2, as the Lipschitz sampling rate is (1−β/((n−1)α+β)),which is slower than the GS rate of (1− β/(α+ (n− 1)β)).2.5.2 Gauss-Southwell-Lipschitz RuleSince neither Lipschitz sampling nor GS dominates the other in general, we are motivated toconsider if faster rules are possible by combining the two approaches. Indeed, we obtain a faster24x1x2Gauss-SouthwellGauss-Southwell-LipschitzFigure 2.2: Visualization of the Gauss-Southwell-Lipschitz selection rule compared to the Gauss-Southwell selection rule. When the slopes of the tangent lines (gradient values) are similar, theGSL will make more progress by selecting the coordinate with the slower changing derivative(smaller Li).rate by choosing the ik that minimizes (2.11), leading to the ruleik ∈ argmaxi|∇if(xk)|√Li,which we call the Gauss-Southwell-Lipschitz (GSL) rule. We see in Figure 2.2 that if thedirectional derivative between two coordinates is equal, then compared to the GS rule, the GSLrule will select the coordinate that leads to more function value progress (smaller Li).Following a similar argument to Section 2.2.3, but using (2.11) in place of (2.2), the GSLrule obtains a convergence rate off(xk+1)− f(x∗) ≤ (1− µL)[f(xk)− f(x∗)],where µL is the strong convexity constant with respect to the norm ‖x‖L =∑ni=1√Li|xi|. Thisis shown in Appendix A.4, and in Appendix A.5 we show thatmax{ µnL¯,µ1L}≤ µL ≤ µ1mini{Li} .Thus, the GSL rule is always at least as fast as the fastest of the GS rule and Lipschitz sampling.Indeed, it can be more than a factor of n faster than using Lipschitz sampling, while it canobtain a rate closer to the minimum Li, instead of the maximum Li that the classic GS ruledepends on.An interesting property of the GSL rule for quadratic functions is that it is the optimalmyopic coordinate update. That is, if we have an oracle that can choose the coordinate andthe step-size that decreases f by the largest amount, i.e.,f(xk+1) ≡ argmini,α{f(xk + αei)}, (2.14)25this is equivalent to using the GSL rule and the update in (2.10). This follows because (2.11)holds with equality in the quadratic case, and the choice αk = 1/Lik yields the optimal step-size. Thus, although faster schemes could be possible with non-myopic strategies that cleverlychoose the sequence of coordinates or step-sizes, if we can only perform one iteration, then theGSL rule cannot be improved.For general f , (2.14) is known as the maximum improvement (MI) rule. This rule has beenused in the context of boosting [Ra¨tsch et al., 2001], graphical models [Della Pietra et al., 1997,Lee et al., 2006, Scheinberg and Rish, 2009], Gaussian processes [Bo and Sminchisescu, 2012],and low-rank tensor approximations [Li et al., 2015]. By the argumentf(xk+1) = minα{f(xk + αeik)}≤ f(xk − 1Lik∇iif(xk)eik)≤ f(xk)− 12[∇ikf(xk)]2Lik,(2.15)our GSL rate also applies to the MI rule, improving existing bounds on this strategy. However,the GSL rule is much cheaper and does not require any special structure (recall that we canestimate Li as we go).2.5.3 Connection between GSL Rule and Normalized Nearest NeighbourSearchDhillon et al. [2011] discuss an interesting connection between the GS rule and the nearest-neighbour-search (NNS) problem for objectives of the formminx∈IRnF (x) = f(Ax), (2.16)This is a special case of h1 with no gi functions, and its gradient has the special form∇F (x) = AT r(x),where r(x) = ∇f(Ax). We use the symbol r because it is the residual vector (Ax − b) in thespecial case of least squares. For this problem structure the GS rule has the formik ∈ argmaxi|∇if(xk)|≡ argmaxi|r(xk)Tai|,where ai denotes column i of A for i = 1, . . . , n. Dhillon et al. [2011] propose to approximate26Figure 2.3: Visualization of the GS rule as a nearest neighbour problem. The “nearest neigh-bour” corresponds to the vector that is the closest (in distance) to r(x), i.e., we want to minimizethe distance between two vectors. Alternatively, the GS rule is evaluated as a maximization ofan inner product. This makes sense as the smaller the angle between two vectors, the largerthe cosine of that angle and in turn, the larger the inner product.the above argmax by solving the following NNS problem (see Figure 2.3)ik ∈ argmini∈[2n]‖r(xk)− ai‖,where i in the range (n + 1) through 2n refers to the negation −(ai−n) of column (i − n) andif the selected ik is greater than n we return (i − n). We can justify this approximation usingthe logicik ∈ argmini∈[2n]‖r(xk)− ai‖≡ argmini∈[2n]12‖r(xk)− ai‖2≡ argmini∈[2n]12‖r(xk)‖2︸ ︷︷ ︸constant−r(xk)Tai + 12‖ai‖2≡ argmaxi∈[n]|r(xk)Tai| − 12‖ai‖2≡ argmaxi∈[n]|∇if(xk)| − 12‖ai‖2.Thus, the NNS computes an approximation to the GS rule that is biased towards coordinateswhere ‖ai‖ is small. Note that this formulation is equivalent to the GS rule in the special casethat ‖ai‖ = 1 (or any other constant) for all i. Shrivastava and Li [2014] have more recentlyconsidered the case where ‖ai‖ ≤ 1 and incorporate powers of ‖ai‖ in the NNS to yield a betterapproximation.A further interesting property of the GSL rule is that we can often formulate the exact GSL27rule as a normalized NNS problem. In particular, for problem (2.16) the Lipschitz constantswill often have the form Li = γ‖ai‖2 for a some positive scalar γ. For example, least squareshas γ = 1 and logistic regression has γ = 0.25. When the Lipschitz constants have this form,we can compute the exact GSL rule by solving a normalized NNS problem,ik ∈ argmini∈[2n]∣∣∣∣∣∣∣∣r(xk)− ai‖ai‖∣∣∣∣∣∣∣∣ . (2.17)The exactness of this formula follows becauseik ∈ argmini∈[2n]∣∣∣∣∣∣∣∣r(xk)− ai‖ai‖∣∣∣∣∣∣∣∣≡ argmini∈[2n]12‖r(xk)− ai/‖ai‖‖2≡ argmini∈[2n]12‖r(xk)‖2︸ ︷︷ ︸constant−r(xk)Tai‖ai‖ +12‖ai‖2‖ai‖2︸ ︷︷ ︸constant≡ argmaxi∈[n]|r(xk)Tai|‖ai‖≡ argmaxi∈[n]|r(xk)Tai|√γ‖ai‖≡ argmaxi∈[n]|∇if(xk)|√Li.Thus, the form of the Lipschitz constant conveniently removes the bias towards smaller valuesof ‖ai‖ that gets introduced when we try to formulate the classic GS rule as a NNS problem.Interestingly, in this setting we do not need to know γ to implement the GSL rule as a NNSproblem.2.6 Approximate Gauss-SouthwellIn many applications, computing the exact GS rule is too inefficient to be of any practical use.However, a computationally cheaper approximate GS rule might be available. Approximate GSrules under multiplicative and additive errors were considered by Dhillon et al. [2011] in theconvex case, but in this setting the convergence rate is similar to the rate achieved by randomselection. In this section, we give rates depending on µ1 for approximate GS rules.2.6.1 Multiplicative ErrorsIn the multiplicative error regime, the approximate GS rule chooses an ik satisfying|∇ikf(xk)| ≥ ‖∇f(xk)‖∞(1− k),28for some k ∈ [0, 1). In this regime, our basic bound on the progress (2.2) still holds, as it wasdefined for any ik. We can incorporate this type of error into our lower bound (2.8) to obtainf(x∗) ≥ f(xk)− 12µ1‖∇f(xk)‖2∞≥ f(xk)− 12µ1(1− k)2 |∇ikf(xk)|2.This implies a convergence rate off(xk+1)− f(x∗) ≤(1− µ1(1− k)2L)[f(xk)− f(x∗)].Thus, the convergence rate of the method is nearly identical to using the exact GS rule forsmall k (and it degrades gracefully with k). This is in contrast to having an error in thegradient [Friedlander and Schmidt, 2012], where the error must decrease to zero over time.2.6.2 Additive ErrorsIn the additive error regime, the approximate GS rule chooses an ik satisfying|∇ikf(xk)| ≥ ‖∇f(xk)‖∞ − k,for some k ≥ 0. In Appendix A.6, we show that under this rule, we havef(xk+1)− f(x∗) ≤(1− µ1L)k [f(x0)− f(x∗) +Ak],whereAk ≤min{k∑i=1(1− µ1L)−ii√2L1L√f(x0)− f(x∗),k∑i=1(1− µ1L)−i(i√2L√f(x0)− f(x∗) + 2i2L)},where L1 is the Lipschitz constant of ∇f with respect to the 1-norm. Note that L1 could besubstantially larger than L, so the second part of the minimum in Ak is likely to be the smallerpart unless the i are large. This regime is closer to the case of having an error in the gradient,as to obtain convergence the k must decrease to zero. This result implies that a sufficientcondition for the algorithm to obtain a linear convergence rate is that the errors k converge tozero at a linear rate. Further, if the errors satisfy k = O(ρk) for some ρ < (1−µ1/L), then theconvergence rate of the method is the same as if we used an exact GS rule. On the other hand,if k does not decrease to zero, we may end up repeatedly updating the same wrong coordinateand the algorithm will not converge (though we could switch to the randomized method if this29is detected).2.7 Proximal Gradient Gauss-SouthwellOne of the key motivations for the resurgence of interest in coordinate descent methods is theirperformance on problems of the formminx∈RnF (x) ≡ f(x) +n∑i=1gi(xi),where f is smooth and convex and the gi are convex, but possibly non-smooth. This includesproblems with `1-regularization, and optimization with lower and/or upper bounds on thevariables. Similar to proximal gradient methods, we can apply the proximal operator to thecoordinate update,xk+1 = prox 1Lgik[xk − 1L∇ikf(xk)eik],whereproxαgi [y] = argminx∈Rn12‖x− y‖2 + αgi(xi).We note that all variables other than the selected xi stay at their existing values in the optimalsolution to this problem. With random coordinate selection, Richta´rik and Taka´cˇ [2014] showthat this method has a convergence rate ofE[F (xk+1)− F (x∗)] ≤(1− µnL)[F (xk)− F (x∗)],similar to the unconstrained/smooth case.There are several generalizations of the GS rule to this scenario. Here we consider threepossibilities, all of which are equivalent to the GS rule if the gi are not present. First, theGS-s rule chooses the coordinate with the most negative directional derivative. This strategy ispopular for `1-regularization [Li and Osher, 2009, Shevade and Keerthi, 2003, Wu and Lange,2008] and in general is given by [see Bertsekas, 2016, §8.4]ik ∈ argmaxi{mins∈∂gi|∇if(xk) + s|}.However, the length of the step (‖xk+1 − xk‖) could be arbitrarily small under this choice. Incontrast, the GS-r rule chooses the coordinate that maximizes the length of the step [Dhillonet al., 2011, Tseng and Yun, 2009b],ik ∈ argmaxi{∣∣∣∣xki − prox 1Lgi[xki −1L∇if(xk)]∣∣∣∣} .This rule is effective for bound-constrained problems, but it ignores the change in the non-30smooth term (gi(xk+1i )−gi(xkk)). Finally, the GS-q rule maximizes progress assuming a quadraticupper bound on f [Tseng and Yun, 2009b],ik ∈ argmini{mind{f(xk) +∇if(xk)d+ L2d2 + gi(xki + d)− gi(xki )}}.While the least intuitive rule, the GS-q rule seems to have the best theoretical properties.Further, if we use Li in place of L in the GS-q rule (which we call the GSL-q strategy), then weobtain the GSL rule if the gi are not present. In contrast, using Li in place of L in the GS-rrule (which we call the GSL-r strategy) does not yield the GSL rule as a special case.In Appendix A.7, we show that using the GS-q rule yields a convergence rate ofF (xk+1)− F (x∗) ≤ min{(1− µLn)[f(xk)− f(x∗)],(1− µ1L)[f(xk)− f(x∗)] + k}, (2.18)where k is bounded above by a measure of the non-linearity of the gi along the possiblecoordinate updates times the inverse condition number µ1/L. Note that k goes to zero as kincreases. In contrast, in Appendix A.7 we also give counter-examples showing that the ratein (2.18) does not hold with k = 0 for the GS-s or GS-r rule, even if the minimum is replacedby a maximum. Thus, any bound for the GS-s or GS-r rule would be slower than the expectedrate under random selection, while the GS-q rule leads to a better bound.Recently, Song et al. [2017] proposed an alternative GS-q rule for `1-regularized problems(g(x) := ‖x‖1) that uses an `1-norm square approximation. A generalized version of this updaterule is given bydk ∈ mind∈IRn{〈∇f(xk), d〉+ L2‖d‖21 + g(xk + d)}xk+1 = xk + dk,which is equivalent to the GS rule in the smooth case [Boyd and Vandenberghe, 2004, §9.4.2].Song et al. [2017] show that this version of the GS-q rule improves the convergence rate bya constant factor over random in the convex, `1-regularized ERM setting. In Appendix A.8we show that the k term in (2.18) is zero when using this update and assuming L1-Lipschitzcontinuity. However, unlike the other non-smooth generalizations of the GS rule, this general-ization may select more than one variable to update at each iteration (update all coordinatescorresponding to non-zero entries in dk) making this method more like block coordinate descent.310 10 20 30 40 50 60 70 80 90 1000.20.30.40.50.60.70.80.91`2 -regularized sparse least squaresEpochsObjectiveCyclicRandomLipschitzGSGSL0 10 20 30 40 50 60 70 80 90 1000.50.550.60.650.70.750.80.850.90.951`2 -regularized sparse logistic regressionEpochsObjectiveCyclic−constant Cyclic−exactLipschitz−constantLipschitz−exactRandom−constantRandom−exactGS−constantGS−exactGSL−constantGSL−exact0 10 20 30 40 50 60 70 80 90 10000.10.20.30.40.50.60.70.80.91Over-determined dense least squaresEpochsObjectiveLipschitzCyclicRandomGSApproximated−GSApproximated−GSL0 10 20 30 40 50 60 70 80 90 1000.20.30.40.50.60.70.80.91`1 -regularized underdetermined sparse least squaresEpochsObjectiveRandom CyclicLipschitzGS−qGS−rGS−sGSL−qGSL−rFigure 2.4: Comparison of coordinate selection rules for 4 instances of problem h1.322.8 ExperimentsWe first compare the efficacy of different coordinate selection rules on the following simple in-stances of h1.`2-regularized sparse least squares: Here we consider the problemminx12m‖Ax− b‖2 + λ2‖x‖2,an instance of problem h1. We set A to be an m by n matrix with entries sampled from aN (0, 1) distribution (with m = 1000 and n = 1000). We then added 1 to each entry (to inducea dependency between columns), multiplied each column by a sample from N (0, 1) multipliedby ten (to induce different Lipschitz constants across the coordinates), and only kept each entryof A non-zero with probability 10 log(n)/n (a sparsity level that allows the Gauss-Southwell ruleto be applied with cost O(log3(n)). We set λ = 1 and b = Ax + e, where the entries of x ande were drawn from a N (0, 1) distribution. In this setting, we used a step-size of 1/Li for eachcoordinate i, which corresponds to exact coordinate optimization.`2-regularized sparse logistic regression: Here we consider the problemminx1mm∑i=1log(1 + exp(−biaTi x)) +λ2‖x‖2.We set the aTi to be the rows of A from the previous problem, and set b = sign(Ax), butrandomly flipping each bi with probability 0.1. In this setting, we compared using a step-sizeof 1/Li to using exact coordinate optimization.Over-determined dense least squares: Here we consider the problemminx12m‖Ax− b‖2,but, unlike the previous case, we do not set elements of A to zero and we make A have dimen-sion 1000 by 100. Because the system is over-determined, it does not need an explicit stronglyconvex regularizer to induce global strong convexity. In this case, the density level means thatthe exact GS rule is not efficient. Hence, we use a balltree structure [Omohundro, 1989] toimplement an efficient approximate GS rule based on the connection to the NNS problem dis-covered by Dhillon et al. [2011]. On the other hand, we can compute the exact GSL rule forthis problem as a NNS problem as discussed in Section 2.5.3.`1-regularized underdetermined sparse least squares: Here we consider the non-smooth33problemminx12m‖Ax− b‖2 + λ‖x‖1.We generate A as we did for the `2-regularized sparse least squares problem, except with thedimension 1000 by 10000. This problem is not globally strongly convex, but will be stronglyconvex along the dimensions that are non-zero in the optimal solution.We plot the objective function (divided by its initial value) of coordinate descent under dif-ferent selection rules in Figure 2.4. Even on these simple datasets, we see dramatic differencesin performance between the different strategies. In particular, the GS rule outperforms randomcoordinate selection (as well as cyclic selection) by a substantial margin in all cases. The Lips-chitz sampling strategy can narrow this gap, but it remains large (even when an approximateGS rule is used). The difference between GS and randomized selection seems to be most dra-matic for the `1-regularized problem; the GS rules tend to focus on the non-zero variables whilemost randomized/cyclic updates focus on the zero variables, which tend not to move away fromzero.4 Exact coordinate optimization and using the GSL rule seem to give modest but consis-tent improvements. The three non-smooth GS-∗ rules had nearly identical performance despitetheir different theoretical properties. The GSL-q rule gave better performance than the GS-∗rules, while the GSL-r variant performed worse than even cyclic and random strategies. Wefound it was also possible to make the GS-s rule perform poorly by perturbing the initializationaway from zero.While the results in this section are in terms of epochs, we direct the reader to Nutini et al.[2015, Appendix I] for runtime experiments for the `2-regularized sparse least squares problemdefined above. The authors use the efficient max-heap implementation in scikit-learn [Pe-dregosa et al., 2011] and show that the GS and GSL rules also offer benefits in terms of runtimeover cyclic, random and Lipschitz sampling.We next consider an instance of problem h2, that is, performing label propagation for semi-supervised learning in the ‘two moons’ dataset [Zhou et al., 2003]. We generate 500 samplesfrom this dataset, randomly label five points in the data, and connect each node to its fivenearest neighbours. This high level of sparsity is typical of graph-based methods for semi-supervised learning, and allows the exact Gauss-Southwell rule to be implemented efficiently.We use the quadratic labeling criterion of Bengio et al. [2006], which allows exact coordinateoptimization and is normally optimized with cyclic coordinate descent. We plot the performanceunder different selection rules in Figure 2.5. Here, we see that even cyclic coordinate descentoutperforms randomized coordinate descent, but that the GS and GSL rules give even betterperformance. We note that the GS and GSL rules perform similarly on this problem since theLipschitz constants do not vary much.4To reduce the cost of the GS-s method in this context, Shevade and Keerthi [2003] consider a variant wherewe first compute the GS-s rule for the non-zero variables and if an element is sufficiently large then they do notconsider the zero variables.34Epochs0 5 10 15 20 25 30Objective00.10.20.30.40.50.60.70.80.91CyclicRandomLipschitzGSGSLGraph-based label propagationFigure 2.5: Comparison of coordinate selection rules for graph-based semi-supervised learning.2.9 DiscussionIt is clear that the GS rule is not practical for every problem where randomized methods areapplicable. Nevertheless, we have shown that even approximate GS rules can obtain betterconvergence rate bounds than fully-randomized methods. We have given a similar justificationfor the use of exact coordinate optimization, and we note that our argument could also beused to justify the use of exact coordinate optimization within randomized coordinate descentmethods (as used in our experiments). We have also proposed the improved GSL rule, andconsidered approximate/proximal variants. In Chapter 4 we show that our analysis can beused for scenarios without strong convexity and in Chapter 5 we show it also applies to blockupdates. We expect it could also be used for accelerated/parallel methods [Fercoq and Richta´rik,2015], for primal-dual rates of dual coordinate ascent [Shalev-Shwartz and Zhang, 2013], forsuccessive projection methods [Leventhal and Lewis, 2010] and for boosting algorithms [Ra¨tschet al., 2001].35Chapter 3Greedy KaczmarzSolving large linear systems is a fundamental problem in machine learning. Applications rangefrom least-squares problems to Gaussian processes to graph-based semi-supervised learning.All of these applications (and many others) benefit from advances in solving large-scale linearsystems. The Kaczmarz method is a particular iterative algorithm suited for solving consistentlinear systems of the form Ax = b. This method as we know it today was originally proposedby Polish mathematician Stefan Kaczmarz [1937] and later re-invented by Gordon et al. [1970]under the name algebraic reconstruction technique (ART). However, this method is closely re-lated to the method of alternating projections, which was proposed by von Neumann in 1933 forthe case of 2 subspaces (published in 1950, von Neumann [1950]). It has been used in numerousapplications including image reconstruction and digital signal processing, and belongs to sev-eral general categories of methods including row-action, component-solution, cyclic projection,successive projection methods [Censor, 1981] and stochastic gradient descent (when applied toa least-squares problem [Needell et al., 2013]).At each iteration k, the Kaczmarz method uses a selection rule to choose some row ik ofA and then projects the current iterate xk onto the corresponding hyperplane aTikxk = bik .Classically, the two categories of selection rules are cyclic and random. Cyclic selection repeat-edly cycles through the coordinates in sequential order, making it simple to implement andcomputationally inexpensive. There are various linear convergence rates for cyclic selection [seeDeutsch, 1985, Deutsch and Hundal, 1997, Gala´ntai, 2005], but these rates are in terms of cyclesthrough the entire dataset and involve constants that are not easily interpreted. Further, theperformance of cyclic selection worsens if we have an undesirable ordering of the rows of A.Randomized selection has recently become the default selection rule in the literature onKaczmarz-type methods. Empirically, selecting ik randomly often performs substantially betterin practice than cyclic selection [Feichtinger et al., 1992, Herman and Meyer, 1993]. Although anumber of asymptotic convergence rates for randomized selection have been presented [Censoret al., 1983, Hanke and Niethammer, 1990, Tanabe, 1971, Whitney and Meany, 1967], the pivotaltheoretical result supporting the use of randomized selection for the Kaczmarz method was givenby Strohmer and Vershynin [2009]. They proved a simple non-asymptotic linear convergencerate (in expectation) in terms of the number of iterations, when rows are selected proportionalto their squared norms. This work spurred numerous extensions and generalizations of therandomized Kaczmarz method [Lee and Sidford, 2013, Leventhal and Lewis, 2010, Liu andWright, 2014, Ma et al., 2015a, Needell, 2010, Zouzias and Freris, 2013], including similar rates36when we replace the equality constraints with inequality constraints.Rather than cyclic or randomized, in this chapter we consider greedy selection rules. Thereare very few results in the literature that explore the use of greedy selection rules for Kaczmarz-type methods. Griebel and Oswald [2012] present the maximum residual rule for multiplicativeSchwarz methods, for which the randomized Kaczmarz iteration is a special case. Their theo-retical results show similar convergence rate estimates for both greedy and random methods,suggesting there is no advantage of greedy selection over randomized selection (since greedyselection has additional computational costs). Eldar and Needell [2011] propose a greedy maxi-mum distance rule, which they approximate using the Johnson-Lindenstrauss [1984] transformto reduce the computation cost. They show that this leads to a faster algorithm in practice,and show that this rule may achieve more progress than random selection on certain iterations.In the next section, we define several relevant problems of interest in machine learning thatcan be solved via Kaczmarz methods. Subsequently, we define the greedy selection rules and dis-cuss cases where they can be computed efficiently. In Section 3.3 we give faster convergence rateanalyses for both the maximum residual rule and the maximum distance rule, which clarify therelationship of these rules to random selection and show that greedy methods will typically havebetter convergence rates than randomized selection. Section 3.4 contrasts Kaczmarz methodswith coordinate descent methods, Section 3.5 considers a simplified setting where we explicitlycompute the constants in the convergence rates, Section 3.6 considers how these convergencerates are changed under approximations to the greedy rules, and Section 3.7 discusses the caseof inequality constraints. We also propose provably-faster randomized selection rules for matri-ces A with pairwise-orthogonal rows by using the so-called “orthogonality graph” (Section 3.8).Finally, in Section 3.9 we present numerical experiments evaluating greedy Kaczmarz methods.3.1 Problems of InterestWe first consider systems of linear equations,Ax = b, (3.1)where A is an m × n matrix and b ∈ IRm. We assume the system is consistent, meaning asolution x∗ exists. We denote the rows of A by a>1 , . . . , a>m, where each ai ∈ Rn and all rowshave at least one non-zero entry, and use b = (b1, . . . , bm)>, where each bi ∈ IR. One of themost important examples of a consistent linear system, and a fundamental model in machinelearning, is the least squares problem,minx∈IRn12‖Ax− b‖2.37An appealing way to write a least squares problem as a linear system is to solve the (n + m)-variable consistent system [see also Zouzias and Freris, 2013](A −I0 AT)(xy)=(b0).Other applications in machine learning that involve solving consistent linear systems include:least-squares support vector machines, Gaussian processes, fitting the final layer of a neuralnetwork (using squared-error), graph-based semi-supervised learning or other graph-Laplacianproblems [Bengio et al., 2006], and finding the optimal configuration in Gaussian Markov ran-dom fields [Rue and Held, 2005].Kaczmarz methods can also be applied to solve consistent systems of linear inequalities,Ax ≤ b,or combinations of linear equalities and inequalities. We believe there is a lot potential touse this application of Kaczmarz methods in machine learning. Indeed, a classic exampleof solving linear inequalities is finding a linear separator for a binary classification problem.The classic perceptron algorithm is a generalization of the Kaczmarz method, but unlike theclassic sublinear rates of perceptron methods [Novikoff, 1962] we can show a linear rate for theKaczmarz method.Kaczmarz methods could also be used to solve the `1-regularized robust regression problem,minxf(x) := ‖Ax− b‖1 + λ‖x‖1,for λ ≥ 0. We can formulate finding an x with f(x) ≤ τ for some constant τ as a set of linearinequalities. By doing a binary search for τ and using warm-starting, this can be substantiallyfaster than existing approaches like stochastic subgradient methods (which have a sublinearconvergence rate) or formulating as a linear program (which is not scaleable due to the super-linear cost). The above logic applies to many piecewise-linear problems in machine learning likevariants of support vector machines/regression with the `1-norm, regression under the `∞-norm,and linear programming relaxations for decoding in graphical models.3.2 Kaczmarz Algorithm and Greedy Selection RulesThe Kaczmarz algorithm for solving linear systems begins from an initial guess x0, and eachiteration k chooses a row ik and projects the current iterate xk onto the hyperplane defined byaTikxk = bik . This gives the iterationxk+1 = xk +bik − aTikxk‖aik‖2aik , (3.2)38and the algorithm converges to a solution x∗ under weak conditions (e.g., each i is visitedinfinitely often). We consider two greedy selection rules: the maximum residual rule and themaximum distance rule. The maximum residual (MR) rule selects ik according toik ∈ argmaxi|aTi xk − bi|, (3.3)which is the equation ik that is ‘furthest’ from being satisfied. The maximum distance (MD)rule selects ik according toik ∈ argmaxi∣∣∣∣aTi xk − bi‖ai‖∣∣∣∣ , (3.4)which is the rule that maximizes the distance between iterations, ‖xk+1 − xk‖.3.2.1 Efficient Calculations for Sparse AIn general, computing these greedy selection rules exactly is too computationally expensive,but in some applications we can compute them efficiently. For example, consider a sparse Awith at most c non-zeros per column and at most r non-zeros per row. In this setting, weshow in Appendix B.1 that using a max-heap structure both rules can be computed exactly inO(cr logm) time. We show a simple example of this process in Figure 3.1 for a matrix with thefollowing sparsity pattern,A =∗ ∗ 0 0 0∗ ∗ ∗ 0 00 ∗ ∗ 0 00 0 0 ∗ 00 0 0 0 ∗ .We exploit the fact that projecting onto row i does not change the residual of row j if ai andaj do not share a non-zero index.The above sparsity condition guarantees that row i is orthogonal to row j, and indeedprojecting onto row i will not change the residual of row j under the more general conditionthat ai and aj are orthogonal. Consider what we call the orthogonality graph: an undirectedgraph on m nodes where we place on edge between nodes i and j if ai is not orthogonal to aj .Given this graph, to update all residuals after we update a row i we only need to update theneighbours of node i in this graph. Even if A is dense (r = n and c = m), if the maximumnumber of neighbours is g, then tracking the maximum residual costs O(gr + g log(m)). If gis small, this could still be comparable to the O(r + log(m)) cost of using existing randomizedselection strategies.3.2.2 Approximate CalculationMany applications, particularly those arising from graphical models with a simple structure, willallow efficient calculation of the greedy rules using the method of the previous section. However,39i = 2d = 0.7i = 3d = 0.3i = 4d = 0.4i = 1d = 0.2i = 5d = 0.1i = 2d′ = 0i = 3d′ = 0.6i = 4d = 0.4i = 1d′ = 0.8i = 5d = 0.1i = 1d′ = 0.8i = 2d′ = 0i = 3d′ = 0.6i = 4d = 0.4i = 5d = 0.1(a) (b) (c)Figure 3.1: Example of the updating procedure for a max-heap structure on a 5 × 5 sparsematrix: (a) select the node with highest d value; (b) update selected sample and neighbours;(c) reorder max-heap structure.in other applications it will be too inefficient to calculate the greedy rules. Nevertheless, Eldarand Needell [2011] show that it is possible to efficiently select an ik that approximates the greedyrules by making use of the dimensionality reduction technique of Johnson and Lindenstrauss[1984]. Their experiments show that approximate greedy rules can be sufficiently accurate andthat they still outperform random selection. After first analyzing exact greedy rules in the nextsection, we analyze the effect of using approximate rules in Section 3.6.3.3 Analyzing Selection RulesAll the convergence rates we discuss use a relationship between the terms ‖xk+1 − x∗‖ and‖xk − x∗‖. To derive this relationship, we consider the following expansion of ‖xk − x∗‖2:‖xk − x∗‖2 = ‖xk − xk+1 + xk+1 − x∗‖2= ‖xk − xk+1‖2 + ‖xk+1 − x∗‖2 − 2〈xk+1 − xk, xk+1 − x∗〉.Rearranging, we obtain‖xk+1 − x∗‖2 = ‖xk − x∗‖2−‖xk+1 − xk‖2 + 2〈xk+1−xk, xk+1 − x∗〉. (3.5)40Given the Kaczmarz update (3.2), we can say that xk+1 − xk = γkaik for some scalar γk andselected ik. Then the last term in (3.5) equals 0, as shown by the following argument,(xk+1 − xk)T (xk+1 − x∗) = γaTik(xk+1 − x∗)= γ(aTikxk+1 − aTikx∗)= γ(bik − bik)= 0,where the last equality follows from the fact that both xk+1 and x∗ solve the equality aTik · = bik .This proves that (xk+1 − xk) is orthogonal to (xk+1−x∗) (see Figure 3.2). Then using thexkx∗xk+1Figure 3.2: Visualizing the orthogonality of vectors xk+1 − xk and xk+1−x∗.definition of xk+1 from (3.2) and simplifying (3.5), we obtain‖xk+1 − x∗‖2 = ‖xk − x∗‖2 −(aTikxk − bik)2‖aik‖2. (3.6)3.3.1 Randomized and Maximum ResidualWe first give an analysis of the Kaczmarz method with uniform random selection of the row toupdate i (which we abbreviate as ‘U’). Conditioning on the σ-field Fk generated by the sequence{x0, x1, . . . , xk}, and taking expectations of both sides of (3.6), when ik is selected using U we41obtainE[‖xk+1 − x∗‖2] = ‖xk − x∗‖2 − E[(aTi xk − bi)2‖ai‖2]= ‖xk − x∗‖2 −m∑i=11m(a>i (xk − x∗))2‖ai‖2≤ ‖xk − x∗‖2 − 1m‖A‖2∞,2m∑i=1(a>i (xk − x∗))2= ‖xk − x∗‖2 − 1m‖A‖2∞,2‖A(xk − x∗)‖2≤(1− σ(A, 2)2m‖A‖2∞,2)‖xk − x∗‖2, (3.7)where ‖A‖2∞,2 := maxi{‖ai‖2} and σ(A, 2) is the Hoffman [1952] constant.5 We have assumedthat xk is not a solution, allowing us to use Hoffman’s bound (the inequality is trivially satisfiedif xk is a solution to Ax = b). When A has independent columns, σ(A, 2) is the nth singularvalue of A and in general it is the smallest non-zero singular value.The argument above is related to the analysis of Vishnoi [2013] but is simpler due to theuse of the Hoffman bound. Further, this simple argument makes it straightforward to derivebounds on other rules. For example, we can derive the convergence rate bound of Strohmer andVershynin [2009] by following the above steps but selecting i non-uniformly with probability‖ai‖2/‖A‖2F (where ‖A‖F is the Frobenius norm of A). We review these steps in Appendix B.2,showing that this non-uniform (NU) selection strategy hasE[‖xk+1 − x∗‖2] ≤(1− σ(A, 2)2‖A‖2F)‖xk − x∗‖2. (3.8)This strategy requires prior knowledge of the row norms of A, but this is a one-time computationand can be reused for any linear system involving A. Because ‖A‖2F ≤ m‖A‖2∞,2, the NUrate (3.8) is at least as fast as the uniform rate (3.7).While a trivial analysis shows that the MR rule also satisfies (3.7) in a deterministic sense,in Appendix B.2 we give a tighter analysis of the MR rule showing it has the convergence rate‖xk+1 − x∗‖2 ≤(1− σ(A,∞)2‖A‖2∞,2)‖xk − x∗‖2, (3.9)where the Hoffman-like constant σ(A,∞) satisfies the relationshipσ(A, 2)√m≤ σ(A,∞) ≤ σ(A, 2).5In this work, any reference to the Hoffman constant is in fact the inverse of the Hoffman constant definedby Hoffman [1952].42Thus, at one extreme the maximum residual rule obtains the same rate as (3.7) for uniformselection when σ(A, 2)2/m ≈ σ(A,∞)2. However, at the other extreme the maximum residualrule could be faster than uniform selection by a factor of m (σ(A,∞)2 ≈ σ(A, 2)2). Thus,although the uniform and MR bounds are the same in the worst case, the MR rule can besuperior by a large margin.In contrast to comparing U and MR, the MR rate may be faster or slower than the NUrate. This is because‖A‖∞,2 ≤ ‖A‖F ≤√m‖A‖∞,2,so these quantities and the relationship between σ(A, 2) and σ(A,∞) influence which bound istighter.3.3.2 Tighter Uniform and MR AnalysisIn our derivations of rates (3.7) and (3.9), we use the inequality‖ai‖2 ≤ ‖A‖2∞,2 ∀ i, (3.10)which leads to a simple result but could be very loose if the range of the row norms is large. Inthis section, we give tighter analyses of the U and MR rules that are less interpretable but aretighter because they avoid this inequality.In order to avoid using this inequality for our analysis of U, we can absorb the row norms ofA into a row weighting matrix D, where D = diag(‖a1‖, ‖a2‖, . . . , ‖am‖). Defining A¯ := D−1A,we show in Appendix B.3 that this results in the following upper bound on the convergencerate for uniform random selection,E[‖xk+1 − x∗‖2] ≤(1− σ(A¯, 2)2m)‖xk − x∗‖2. (3.11)A similar result is given by Needell et al. [2013] under the stronger assumption that A hasindependent columns. The rate in (3.11) is tighter than (3.7), since σ(A, 2)/‖A‖∞,2 ≤ σ(A¯, 2)[van der Sluis, 1969]. Further, this rate can be faster than the non-uniform sampling method ofStrohmer and Vershynin [2009]. For example, suppose row i is orthogonal to all other rows buthas a significantly larger row norm than all other row norms. In other words, ‖ai‖ >> ‖aj‖ forall j 6= i. In this case, NU selection will repeatedly select row i (even though it only needs tobe selected once), whereas U will only select it on each iteration with probability 1/m. It hasbeen previously pointed out that Strohmer and Vershynin’s method can perform poorly if theproblem has one row norm that is significantly larger than the other row norms [Censor et al.,2009]. This result theoretically shows that U can have a tighter bound than the NU method ofStrohmer and Vershynin.In Appendix B.3, we also give a simple modification of our analysis of the MR rule, which43leads to the rate‖xk+1 − x∗‖2 ≤(1− σ(A,∞)2‖aik‖2)‖xk − x∗‖2. (3.12)This bound depends on the specific ‖aik‖ corresponding to the ik selected at each iteration k.This convergence rate will be faster whenever we select an ik with ‖aik‖ < ‖A‖∞,2. However, inthe worst case we repeatedly select ik values with ‖aik‖ = ‖A‖∞,2 so there is no improvement.This issue is considered in [Sepehry, 2016], where the authors give tighter bounds on the sequenceof ‖aik‖ values for problems with sparse orthogonality graphs.3.3.3 Maximum Distance RuleIf we can only perform one iteration of the Kaczmarz method, the optimal rule (with respect toiteration progress) is in fact the MD rule. In Appendix B.4, we show that this strategy achievesa rate of‖xk+1 − x∗‖2 ≤(1− σ(A¯,∞)2)‖xk − x∗‖2, (3.13)where σ(A¯,∞) satisfiesmax{σ(A¯, 2)√m,σ(A, 2)‖A‖F ,σ(A,∞)‖A‖∞,2}≤σ(A¯,∞)≤σ(A¯, 2).Thus, the maximum distance rule is at least as fast as the fastest among the U/NU/MR∞ rules,where MR∞ refers to rate (3.9). Further, in Appendix B.9 we show that this new rate is notonly simpler but is strictly tighter than the rate reported by Eldar and Needell [2011] for theexact MD rule. In Table 3.1, we summarize the relationships we have discussed in this sectionTable 3.1: Comparison of Convergence RatesU∞ U NU MR∞ MR MDU∞ = ≤ ≤ ≤ ≤ ≤U = P P P ≤NU = P P ≤MR∞ = ≤ ≤MR = ≤MD =among the different selection rules. We use the following abbreviations: U∞ - uniform (3.7),U - tight uniform (3.11), NU - non-uniform (3.8), MR∞ - maximum residual (3.9), MR - tightmaximum residual (3.12) and MD - maximum distance (3.13). The inequality sign (≤) indicatesthat the bound for the selection rule listed in the row is slower or equal to the rule listed in thecolumn, while we have written ‘P’ to indicate that the faster method is problem-dependent.443.4 Kaczmarz and Coordinate DescentWith the exception of the tighter U and MR rate, the results of the previous section areanalogous to the results in Chapter 2 for coordinate descent methods. Indeed, if we applycoordinate descent methods to minimize the squared error between Ax and b then we ob-tain similar-looking rates and analogous conclusions. With cyclic selection this is called theGauss-Seidel method [Seidel, 1874], and as discussed by Ma et al. [2015a] there are several con-nections/differences between this method and Kaczmarz methods. In this section we highlightsome key differences.First, the previous work required strong convexity which would require that A has indepen-dent columns. This is often unrealistic, and our results from the previous section hold for anyA.6 Second, here our results are in terms of the iterates ‖xk−x∗‖, which is the natural measurefor linear systems. The coordinate descent results are in terms of f(xk)− f(x∗) and althoughit is possible to use strong convexity to turn this into a rate on ‖xk−x∗‖, this would result in alooser bound and would again require strong convexity to hold (see Ma et al. [2015a]). On theother hand, coordinate descent gives the least squares solution for inconsistent systems. How-ever, this is also true of the Kaczmarz method using the formulation in Section 3.1. Anothersubtle issue is that the Kaczmarz rates depend on the row norms of A while the coordinate de-scent rates depend on the column norms. Thus, there are scenarios where we expect Kaczmarzmethods to be much faster and vice versa. Finally, we note that Kaczmarz methods can beextended to allow inequality constraints (see Section 3.7).As discussed by Wright [2015], Kaczmarz methods can also be interpreted as coordinatedescent methods on the dual problemminy12‖AT y‖2 − bT y, (3.14)where x = AT y∗ so that Ax = AAT y∗ = b. Applying the Gauss-Southwell rule in this set-ting yields the MR rule while applying the Gauss-Southwell-Lipschitz rule yields the MD rule(see Appendix B.5 for details and numerical comparisons, indicating that in some cases Kacz-marz substantially outperforms CD). However, applying the analysis of Chapter 2 to this dualproblem would require that A has independent rows and would only yield a rate on the dualobjective, unlike the convergence rates in terms of ‖xk − x∗‖ that hold for general A from theprevious section. In Chapter 4 we revisit the strong-convexity assumption on CD methods andshow that, in fact, it is not a necessary assumption to guarantee a linear convergence rate.3.5 Example: Diagonal ATo give a concrete example of these rates, we consider the simple case of a diagonal A. Whilesuch problems are not particularly interesting, this case provides a simple setting to understand6In Chapter 4 we show that the results of Chapter 2 apply for general least squares problems.45these different rates without referring to Hoffman bounds.Consider a square diagonal matrix A with aii > 0 for all i. In this case, the diagonal entriesare the eigenvalues λi of the linear system. The convergence rate constants for this scenarioare given in Table 3.2. We provide the details in Appendix B.6 of the derivations for σ(A,∞)Table 3.2: Convergence Rate Constants for Diagonal AU∞(1− λ2mmλ21)U(1− 1m)NU(1− λ2m∑i λ2i)MR∞1− 1λ21[∑i1λ2i]−1MR1− 1λ2ik[∑i1λ2i]−1MD(1− 1m)and σ(A¯,∞), as well as substitutions for the uniform, non-uniform, and uniform tight rates toyield the above table. We note that the uniform tight rate follows from λ2m(A¯) being equivalentto the minimum eigenvalue of the identity matrix.If we consider the most basic case when all the eigenvalues of A are equal, then all theselection rules yield the same rate of (1− 1/m) and the method converges in at most m stepsfor greedy selection rules and in at most O(m logm) steps (in expectation) for the random rules(due to the ‘coupon collector’ problem). Further, this is the worst situation for the greedy MRand MD rules since they satisfy their lower bounds on σ(A,∞) and σ(A¯,∞).Now consider the extreme case when all the eigenvalues are equal except for one. Forexample, consider when λ1 = λ2 = · · · = λm−1 > λm with m > 2. Letting α = λ2i (A) for anyi = 1, . . . ,m− 1 and β = λ2m(A), we haveβmα︸︷︷︸U∞<βα(m− 1) + β︸ ︷︷ ︸NU<βα+ β(m− 1)︸ ︷︷ ︸MR∞≤ 1λ2ikαβα+ β(m− 1)︸ ︷︷ ︸MR<1m︸︷︷︸U, MD.Thus, Strohmer and Vershynin’s NU rule would actually be the worst rule to use, whereas Uand MD are the best. In this case σ(A,∞)2 is closer to its upper bound (≈ β) so we wouldexpect greedy rules to perform well.463.6 Approximate Greedy RulesIn many applications computing the exact MR or MD rule will be too inefficient, but we canalways approximate it using a cheaper approximate greedy rule, as in the method of Eldarand Needell [2011]. In this section we consider methods that compute the greedy rules up tomultiplicative or additive errors.3.6.1 Multiplicative ErrorSuppose we have approximated the MR rule such that there is a multiplicative error in ourselection of ik,|aTikxk − bik | ≥ maxi |aTi xk − bi|(1− k),for some k ∈ [0, 1). In this scenario, using the tight analysis for the MR rule, we show inAppendix B.7 that‖xk+1 − x∗‖2 ≤(1− (1− k)2σ(A,∞)2‖aik‖2)‖xk − x∗‖2.Similarly, if we approximate the MD rule up to a multiplicative error,∣∣∣∣∣aTikxk − bik‖aik‖∣∣∣∣∣ ≥ maxi∣∣∣∣aTi xk − bi‖ai‖∣∣∣∣ (1− ¯k),for some ¯k ∈ [0, 1), then we show in Appendix B.7 that the following rate holds,‖xk+1 − x∗‖2 ≤(1− (1− ¯k)2σ(A¯,∞)2)‖xk − x∗‖2.These scenarios do not require the error to converge to 0. However, if k or ¯k is large, then theconvergence rate will be slow.3.6.2 Additive ErrorSuppose we select ik using the MR rule up to additive error,|aTikxk − bik |2 ≥ maxi |aTi xk − bi|2 − k,or similarly for the MD rule,∣∣∣∣∣aTikxk − bik‖aik‖∣∣∣∣∣2≥ maxi∣∣∣∣aTi xk − bi‖ai‖∣∣∣∣2 − ¯k,47for some k ≥ 0 or ¯k ≥ 0, respectively. We show in Appendix B.8 that this results in thefollowing convergence rates for the MR and MD rules with additive error (respectively),‖xk+1 − x∗‖2 ≤(1− σ(A,∞)2‖aik‖2)‖xk − x∗‖2 + k‖aik‖2,and‖xk+1 − x∗‖2 ≤ (1− σ(A¯,∞)2)‖xk − x∗‖2 + ¯k.With an additive error, we need the errors to go to 0 in order for the algorithm to converge; ifit does go to 0 fast enough, we obtain the same rate as if we were calculating the exact greedyrule. In the approximate greedy rule used by Eldar and Needell [2011], there is unfortunatelya constant additive error. To address this, they compare the approximate greedy selectionto a randomly selected ik and take the one with the largest distance. This approach can besubstantially faster when far from the solution, but may eventually revert to random selection.We give details comparing Eldar and Needell’s rate to our above rate in Appendix B.9, buthere we note that the above bounds will typically be much stronger.3.7 Systems of Linear InequalitiesKaczmarz methods have been extended to systems of linear inequalities,aTi x ≤ bi (i ∈ I≤)aTi x = bi (i ∈ I=). (3.15)where the disjoint index sets I≤ and I= partition the set {1, 2, . . . ,m} [Leventhal and Lewis,2010]. In this setting the method takes the formxk+1 = xk − βk‖ai‖2ai, with βk =(aTi xk − bi)+ (i ∈ I≤)aTi xk − bi (i ∈ I=),where (γ)+ = max{γ, 0}. In Appendix B.10 we derive analogous greedy rules and convergenceresults for this case. The main difference in this setting is that the rates are in terms of thedistance of xk to the feasible set S of (3.15),d(xk, S) = minz∈S‖xk − z‖2 = ‖xk − PS(xk)‖2,where PS(x) is the projection of x onto S. This generalization is needed because with inequalityconstraints the different iterates xk may have different projections onto S.483.8 Faster Randomized Kaczmarz MethodsIn this section we use the orthogonality graph presented in Section 3.2.1 to design new selectionrules and derive faster convergence rates for randomized Kaczmarz methods. Similar to theclassic convergence rate analyses of cyclic Kaczmarz algorithms, these new rates/rules dependin some sense on the ‘angle’ between rows, which is a property that is not captured by existingrandomized/greedy analyses (only depend on the row norms).If two rows ai and aj are orthogonal, then if the equality aTi xk = bi holds at iteration xk andwe select ik = j, then we know that aTi xk+1 = bi. More generally, updating ik makes equalityik satisfied but could make any equality j unsatisfied where aj is not orthogonal to aik . Thus,after we have selected row ik, equation ik will remain satisfied for all subsequent iterations untilone of its neighbours is selected in the orthogonality graph.7Based on this, we call a row i ‘selectable’ if i has never been selected or if a neighbourof i in the orthogonality graph has been selected since the last time i was selected.8 We usethe notation ski = 1 to denote that row i is ‘selectable’ on iteration k, and otherwise we useski = 0 and say that i is ‘not selectable’ at iteration k. There is no reason to ever update a‘not selectable’ row, because by definition the equality is already satisfied. Based on this, wepropose two simple randomized schemes:1. Adaptive Uniform: select ik uniformly from the selectable rows.2. Adaptive Non-Uniform: select ik proportional to ‖ai‖2 among the selectable rows.Let Ak/A¯k denote the sub-matrix of A/A¯ formed by concatenating the selectable rows oniteration k, and let mk denote the number of selectable rows. If we are given the set ofselectable nodes at iteration k, then for adaptive uniform we obtain the boundE[‖xk+1 − x∗‖2] ≤(1− σ(A¯k, 2)2mk)‖xk − x∗‖2,while for adaptive non-uniform we obtain the boundE[‖xk+1 − x∗‖2] ≤(1− σ(Ak, 2)2‖Ak‖2F)‖xk − x∗‖2.If we are not on the first iteration, then at least one node is not selectable and these are strictlyfaster than the previous bounds. The gain will be small if most nodes are selectable (whichwould be typical of dense orthogonality graphs), but the gain can be very large if only a fewnodes are selectable (which would be typical of sparse orthogonality graphs).Practical Issues: In order for the adaptive methods to be efficient, we must be able toefficiently form the orthogonality graph and update the set of selectable nodes. If each node7Although we only consider randomized Kaczmarz methods in this section, [Sepehry, 2016] uses the orthog-onality graph to derive a tighter multi-step analysis for the MR rule.8If we initialize with x0 = 0, then instead of considering all nodes as initially selectable we can restrict to thenodes i with bi 6= 0 since otherwise we have aTi x0 = bi already.49has at most g neighbours in the orthogonality graph, then the cost of updating the set ofselectable nodes and then sampling from the set of selectable nodes is O(g log(m)) (we givedetails in Appendix B.11). In order for this to not increase the iteration cost compared to theNU method, we only require the very-reasonable assumption that g log(m) = O(n + log(m)).In many applications where orthogonality is present, g will be far smaller than this.However, forming the orthogonality graph at the start may be prohibitive: it would costO(m2n) in the worst case to test pairwise orthogonality of all nodes. In the sparse case whereeach column has at most c non-zeros, we can find an approximation to the orthogonality graphin O(c2n) by assuming that all rows which share a non-zero are non-orthogonal. Alternately, inmany applications the orthogonality graph is easily derived from the structure of the problem.For example, in graph-based semi-supervised learning where the graph is constructed based onthe k-nearest neighbours, the orthogonality graph will simply be the given k-nearest neighbourgraph as these correspond to the columns that will be mutually non-zero in A.3.9 ExperimentsEldar and Needell [2011] have already shown that approximate greedy rules can outperformrandomized rules for dense problems. Thus, in our experiments we focus on comparing theeffectiveness of different rules on very sparse problems where our max-heap strategy allows usto efficiently compute the exact greedy rules. The first problem we consider is generating adataset A with a 50 by 50 lattice-structured dependency (giving n = 2500). The correspondingA has the following non-zero elements: the diagonal elements Ai,i, the upper/lower diagonalelements Ai,i+1 and Ai+1,i when i is not a multiple of 50 (horizontal edges), and the diagonalbands Ai,i+50 and Ai+50,i (vertical edges). We generate these non-zero elements from a N (0, 1)distribution and generate the target vector b = Az using z ∼ N (0, I). Each row in thisproblem has at most four neighbours, and this type of sparsity structure is typical of spatialGaussian graphical models and linear systems that arise from discretizing two-dimensionalpartial differential equations.The second problem we consider is solving an overdetermined consistent linear system witha very sparse A of size 2500× 1000. We generate each row of A independently such that thereare log(m)/2m non-zero entries per row drawn from a uniform distribution between 0 and 1.To explore how having different row norms affects the performance of the selection rules, werandomly multiply one out of every 11 rows by a factor of 10,000.For the third problem, we solve a label propagation problem for semi-supervised learning inthe ‘two moons’ dataset [Zhou et al., 2003]. From this dataset, we generate 2000 samples andrandomly label 100 points in the data. We then connect each node to its 5 nearest neighbours.Constructing a data set with such a high sparsity level is typical of graph-based methods forsemi-supervised learning. We use a variant of the quadratic labelling criterion of Bengio et al.500 2000 4000 6000 8000 10000 12000Iteration−6−5−4−3−2−10LogSquaredErrorUMDMRCRPNUA(u)A(Nu)Ising model0 2000 4000 6000 8000 10000 12000Iteration−2.5−2.0−1.5−1.0−0.50.0LogDistanceUMDMRCRP NUA(u)A(Nu)Ising model0 2000 4000 6000 8000 10000 12000Iteration−50−40−30−20−100LogSquaredErrorUMDMRCRPNU A(u)A(Nu)Very Sparse Overdetermined Linear-System0 2000 4000 6000 8000 10000 12000Iteration−12−10−8−6−4−20LogDistanceUMDMRCRPNUA(u)A(Nu)Very Sparse Overdetermined Linear-System0 2000 4000 6000 8000 10000Iteration−1.5−1.0−0.50.0LogSquaredErrorUMDMR CRPNUA(u)A(Nu)Label Propagation0 2000 4000 6000 8000 10000Iteration−0.14−0.12−0.10−0.08−0.06−0.04−0.020.00LogDistanceUMDMRCRPNUA(u)A(Nu)Label PropagationFigure 3.3: Comparison of Kaczmarz selection rules for squared error (left) and distance tosolution (right).51[2006],minyi|i 6∈S12n∑i=1n∑j=1wij(yi − yj)2,where y is our label vector (each yi can take one of 2 values), S is the set of labels that wedo know and wij ≥ 0 are the weights assigned to each yi describing how strongly we want thelabel yi and yj to be similar. We can express this quadratic problem as a linear system thatis consistent by construction (see Appendix B.12), and hence apply Kaczmarz methods. As welabelled 100 points in our data, the resulting linear system has a matrix of size 1900 × 1900while the number of neighbours g in the orthogonality graph is at most 5.In Figure 3.3 we compare the normalized squared error and distance against the iterationnumber for 8 different selection rules: cyclic (C), random permutation (RP - where we changethe cycle order after each pass through the rows), uniform random (U), adaptive uniformrandom (A(u)), non-uniform random NU, adaptive non-uniform random (A(Nu)), maximumresidual (MR), and maximum distance (MD).In experiments 1 and 3, MR performs similarly to MD and both outperform all otherselection rules. For experiment 2, the MD rule outperforms all other selection rules in termsof distance to the solution although MR performs better on the early iterations in terms ofsquared error. In Appendix B.12 we explore a ‘hybrid’ method on the overdetermined linearsystem problem that does well on both measures. For runtime results on these experiments,see Nutini et al. [2016, Appendix M].The new randomized A(u) method did not give significantly better performance than theexisting U method on any dataset. This agrees with our bounds which show that the gain ofthis strategy is modest. In contrast, the new randomized A(Nu) method performed much betterthan the existing NU method on the over-determined linear system in terms of squared error.This again agrees with our bounds which suggest that the A(Nu) method has the most to gainwhen the row norms are very different. Interestingly, in most experiments we found that cyclicselection worked better than any of the randomized algorithms. However, cyclic methods wereclearly beaten by greedy methods.3.10 DiscussionIn this chapter, we proved faster convergence rate bounds for a variety of row-selection rulesin the context of Kaczmarz methods for solving linear systems. We have also provided newrandomized selection rules that make use of orthogonality in the data in order to achieve bettertheoretical and practical performance. While we have focused on the case of non-acceleratedand single-variable variants of the Kaczmarz algorithm, we expect that all of our conclusionsalso hold for accelerated Kaczmarz and block Kaczmarz methods [Gower and Richta´rik, 2015,Lee and Sidford, 2013, Liu and Wright, 2014, Needell and Tropp, 2014, Oswald and Zhou, 2015].52Chapter 4Relaxing Strong ConvexityFitting most machine learning models involves solving some sort of optimization problem.Gradient descent, and variants of it like coordinate descent and stochastic gradient, are theworkhorse tools used by the field to solve very large instances of these problems. In this chap-ter we consider the basic problem of minimizing a smooth function and the convergence rate ofgradient descent methods. It is well-known that if f is strongly convex, then gradient descentachieves a global linear convergence rate for this problem [Nesterov, 2004]. However, manyof the fundamental models in machine learning like least squares and logistic regression yieldobjective functions that are convex but not strongly convex. Further, if f is only convex, thengradient descent only achieves a sub-linear rate.This situation has motivated a variety of alternatives to strong convexity (SC) in the lit-erature, in order to show that we can obtain linear convergence rates for problems like leastsquares and logistic regression. One of the oldest of these conditions is the error bounds (EB)of Luo and Tseng [1993], but four other recently-considered conditions are essential strongconvexity (ESC) [Liu et al., 2014], weak strong convexity (WSC) [Necoara et al., 2015], therestricted secant inequality (RSI) [Zhang and Yin, 2013], and the quadratic growth (QG) con-dition [Anitescu, 2000]. Some of these conditions have different names in the special case ofconvex functions. For example, a convex function satisfying RSI is said to satisfy restrictedstrong convexity (RSC) [Zhang and Yin, 2013]. Names describing convex functions satisfy-ing QG include optimal strong convexity (OSC) [Liu and Wright, 2015], semi-strong convexity(SSC) [Gong and Ye, 2014], and (confusingly) WSC [Ma et al., 2015b]. The proofs of linearconvergence under all of these relaxations are typically not straightforward, and it is rarelydiscussed how these conditions relate to each other.In this work, we consider a much older condition that we refer to as the Polyak- Lojasiewicz(PL) inequality. This inequality was originally introduced by Polyak [1963], who showed thatit is a sufficient condition for gradient descent to achieve a linear convergence rate. We describeit as the PL inequality because it is also a special case of the inequality introduced in thesame year by Lojasiewicz [1963]. We review the PL inequality in the next section and howit leads to a trivial proof of the linear convergence rate of gradient descent. Next, in termsof showing a global linear convergence rate to the optimal solution, we show that the PLinequality is weaker than all of the more recent conditions discussed in the previous paragraph.This suggests that we can replace the long and complicated proofs under any of the conditionsabove with simpler proofs based on the PL inequality. Subsequently, we show how this result53implies gradient descent achieves linear rates for standard problems in machine learning likeleast squares and logistic regression that are not necessarily strongly convex, and even for somenon-convex problems (Section 4.1.3). In Section 4.2 we use the PL inequality to give newconvergence rates for randomized and greedy coordinate descent (implying a new convergencerate for certain variants of boosting) and sign-based gradient descent methods. Next we turnto the problem of minimizing the sum of a smooth function and a simple non-smooth function.We propose a generalization of the PL inequality that allows us to show linear convergencerates for proximal gradient methods without strong convexity. This leads to a simple analysisshowing linear convergence of methods for training support vector machines. It also impliesthat we obtain a linear convergence rate for `1-regularized least squares problems, showing thatthe extra conditions previously assumed to derive linear converge rates in this setting are infact not needed.4.1 Polyak- Lojasiewicz InequalityWe first focus on the basic unconstrained optimization problemargminx∈Rnf(x), (4.1)and we assume that the first derivative of f is L-Lipschitz continuous. This means thatf(y) ≤ f(x) + 〈∇f(x), y − x〉+ L2||y − x||2, (4.2)for all x, y ∈ IRn. For twice-differentiable objectives this assumption means that the eigenvaluesof ∇2f(x) are bounded above by some L, which is typically a reasonable assumption. We alsoassume that the optimization problem has a non-empty solution set X ∗, and we use f∗ todenote the corresponding optimal function value. We will say that a function satisfies the PLinequality if the following holds for some µ > 0,12||∇f(x)||2 ≥ µ(f(x)− f∗), ∀ x. (4.3)This inequality requires that the gradient grows faster than a quadratic function as we moveaway from the optimal function value. Note that this inequality implies that every stationarypoint is a global minimum. But unlike strong convexity, it does not imply that there is aunique solution. Linear convergence of gradient descent under these assumptions was firstproved by Polyak [1963]. Below we give a simple proof of this result when using a step-size of1/L.Theorem 1. Consider problem (4.1), where f has an L-Lipschitz continuous gradient (4.2),a non-empty solution set X ∗, and satisfies the PL inequality (4.3). Then the gradient method54with a step-size of 1/L,xk+1 = xk − 1L∇f(xk), (4.4)has a global linear convergence rate,f(xk)− f∗ ≤(1− µL)k(f(x0)− f∗).Proof. By using update rule (4.4) in the Lipschitz inequality condition (4.2) we havef(xk+1)− f(xk) ≤ − 12L||∇f(xk)||2.Now by using the PL inequality (4.3) we getf(xk+1)− f(xk) ≤ −µL(f(xk)− f∗).Re-arranging and subtracting f∗ from both sides gives us f(xk+1)− f∗ ≤ (1− µL) (f(xk)− f∗),where using the same argument as in [Csiba and Richta´rik, 2017, Lem 1] we can see that(1 − µL) < 1 since f(xk) − f∗ ≥ 0 for any k = 0, 1, . . . , and thus it must hold that µ ≤ L.Applying the inequality recursively gives the result.Note that the above result also holds if we use the optimal step-size at each iteration,because under this choice we havef(xk+1) = minα{f(xk − α∇f(xk))} ≤ f(xk − 1L∇f(xk)).A beautiful aspect of this proof is its simplicity; in fact it is simpler than the proof of the samefact under the usual strong convexity assumption. It is certainly simpler than typical proofswhich rely on the other conditions mentioned at the beginning of this chapter. Further, it isworth noting that the proof does not assume convexity of f . Thus, this is one of the few generalresults we have for global linear convergence on non-convex problems.4.1.1 Relationships Between ConditionsAs mentioned at the beginning of this chapter, several other assumptions have been exploredover the last 25 years in order to show that gradient descent achieves a linear convergence rate.We state these conditions next, noting that all of these definitions involve some constant µ > 0(which may not be the same across conditions). We use the convention that xp is the projectionof x onto the solution set X ∗.1. Strong Convexity (SC): For all x and y we havef(y) ≥ f(x) + 〈∇f(x), y − x〉+ µ2‖y − x‖2.552. Essential Strong Convexity (ESC): For all x and y such that xp = yp we havef(y) ≥ f(x) + 〈∇f(x), y − x〉+ µ2‖y − x‖2.3. Weak Strong Convexity (WSC): For all x we havef∗ ≥ f(x) + 〈∇f(x), xp − x〉+ µ2‖xp − x‖2.4. Restricted Secant Inequality (RSI): For all x we have〈∇f(x), x− xp〉 ≥ µ‖xp − x‖2.If the function f is also convex it is called restricted strong convexity (RSC).5. Error Bound (EB): For all x we have‖∇f(x)‖ ≥ µ‖xp − x‖.6. Polyak- Lojasiewicz (PL): For all x we have12‖∇f(x)‖2 ≥ µ(f(x)− f∗).7. Quadratic Growth (QG): For all x we havef(x)− f∗ ≥ µ2‖xp − x‖2.If the function f is also convex it is called optimal strong convexity (OSC) or semi-strong convexity or sometimes WSC (but we will reserve the expression WSC for thedefinition above).These conditions typically assume that f is convex, and lead to more complicated proofsthan the one of Theorem 1. However, it is rarely discussed how the conditions relate to eachother. Indeed, all of the relationships that have been explored have only been in the context ofconvex functions [Liu and Wright, 2015, Necoara et al., 2015, Zhang, 2015]. In Figure 4.1 weshow how these conditions relate to each other with and without the assumption of convexity,and we formalize these relationships in the following theorem.Theorem 2. For a function f with a Lipschitz-continuous gradient, the following implicationshold:(SC) ⊂ (ESC) ⊆ (WSC) ⊆ (RSI) ⊆ (EB) ≡ (PL) ⊆ (QG).56InvexQGPL = EBRSISCESCWSCConvexOSC, SSC, RSCFigure 4.1: Visual of the implications shown in Theorem 2 between the various relaxations ofstrong convexity.If we further assume that f is convex then we have(RSI) ≡ (EB) ≡ (PL) ≡ (QG).Proof. See Appendix C.1.Note that Zhang [2016] independently also recently gave the relationships between RSI, EB,PL, and QG.9 This result shows that QG is the weakest assumption among those considered.However, QG allows non-global local minima so it is not enough to guarantee that gradientdescent finds a global minimizer. This means that, among those considered above, PL andthe equivalent EB are the most general conditions that allow linear convergence to a globalminimizer. Note that in the convex case QG is called OSC or SSC, but the result above showsthat in the convex case it is also equivalent to EB and PL (as well as RSI which is known asRSC in this case).4.1.2 Invex and Non-Convex FunctionsWhile the PL inequality does not imply convexity of f , it does imply the weaker condition ofinvexity. A function is invex if it is differentiable and there exists a vector valued function ηsuch that for any x, y ∈ IRn, the following inequality holdsf(x) ≥ f(y) + η(x, y)T∇f(y).It is clear that if η(x, y) = x− y, then f is convex.Invexity was first introduced by Hanson [1981], and has been used in the context of learning9 Drusvyatskiy and Lewis [2016] is a recent work discussing the relationships among many of these conditionsfor non-smooth functions.57xf(x) Figure 4.2: Example: f(x) = x2 + 3 sin2(x) is an invex but non-convex function that satisfiesthe PL inequality.output kernels [Dinuzzo et al., 2011]. Craven and Glover [1985] show that a smooth f is invexif and only if every stationary point of f is a global minimum. Since the PL inequality impliesthat all stationary points are global minimizers, functions satisfying the PL inequality must beinvex. It is easy to see this by noting that at any stationary point x¯, we have ∇f(x¯) = 0, sofor µ > 0 by the PL inequality, we have0 =12‖∇f(x¯)‖2 ≥ µ(f(x)− f∗) ≥ 0,which means f(x) = f∗ and we must be at the global minimum.Indeed, Theorem 2 shows that all of the previous conditions (except QG) imply invexity.The function f(x) = x2+3 sin2(x) shown in Figure 4.2 is an example of an invex but non-convexfunction satisfying the PL inequality (with µ = 1/32). Thus, Theorem 1 implies gradient descentobtains a global linear convergence rate on this function.Unfortunately, many complicated models have non-optimal stationary points. For example,typical deep feed-forward neural networks have sub-optimal stationary points and are thus notinvex. A classic way to analyze functions like this is to consider a global convergence phase anda local convergence phase. The global convergence phase is the time spent to get “close” to alocal minimum, and then once we are “close” to a local minimum the local convergence phasecharacterizes the convergence rate of the method. Usually, the local convergence phase startsto apply once we are locally strongly convex around the minimizer. But this means that thelocal convergence phase may be arbitrarily small: for example, for f(x) = x2 + 3 sin2(x) thelocal convergence rate would not even apply over the interval x ∈ [−1, 1]. If we instead definedthe local convergence phase in terms of locally satisfying the PL inequality, then we see that itcan be much larger (x ∈ IR for this example).584.1.3 Relevant ProblemsIf f is µ-strongly convex, then it also satisfies the PL inequality with the same µ (see Ap-pendix C.2). Further, by Theorem 2, f satisfies the PL inequality if it satisfies any of ESC,WSC, RSI, or EB (while for convex f , QG is also sufficient). Although it is hard to preciselycharacterize the general class of functions for which the PL inequality is satisfied, we note oneimportant special case below.Strongly convex composed with linear: This is the case where f has the form f(x) =g(Ax) for some σ-strongly convex function g and some matrix A. In Appendix C.2, we use theHoffman bound (previously used in Chapter 3) to show that this class of functions satisfies thePL inequality, and we note that this form frequently arises in machine learning. For example,least squares problems have the formf(x) = ‖Ax− b‖2,and by noting that g(z) , ‖z − b‖2 is strongly convex we see that least squares falls into thiscategory. Indeed, this class includes all convex quadratic functions.In the case of logistic regression we havef(x) =m∑i=1log(1 + exp(biaTi x)).This can be written in the form g(Ax), where g is strictly convex but not strongly convex. Incases like this where g is only strictly convex, the PL inequality will still be satisfied over anycompact set. Thus, if the iterations of gradient descent remain bounded, the linear convergenceresult still applies. It is reasonable to assume that the iterates remain bounded when the setof solutions is finite, since each step must decrease the objective function. Thus, for practicalpurposes, we can relax the above condition to “strictly-convex composed with linear” and thePL inequality implies a linear convergence rate for logistic regression.4.2 Convergence of Huge-Scale MethodsIn this section, we use the PL inequality to analyze variants of one of the most widely-usedtechniques for handling large-scale machine learning problems: coordinate descent methods. Inparticular, the PL inequality yields very simple analyses of this method that apply to moregeneral classes of functions than previously analyzed. We also note that the PL inequalityhas recently been used by Garber and Hazan [2015] to analyze the Frank-Wolfe algorithm andby Karimi et al. [2016] to analyze stochastic gradient and stochastic variance reduced gradientmethods. Further, inspired by the resilient backpropagation (RPROP) algorithm of Riedmillerand Braun [1992], we give a convergence rate analysis for a sign-based gradient descent method.594.2.1 Randomized Coordinate DescentNesterov [2012] shows that randomized coordinate descent achieves a faster convergence ratethan gradient descent for problems where we have n variables and it is n times cheaper to updateone coordinate than it is to compute the entire gradient. The expected linear convergence ratesin this previous work rely on strong convexity, but in this section we show that randomizedcoordinate descent achieves an expected linear convergence rate if we only assume that the PLinequality holds.To analyze coordinate descent methods, we assume that the gradient is coordinate-wiseLipschitz continuous, meaning that for any x and y we havef(x+ αei) ≤ f(x) + α∇if(x) + L2α2, ∀α ∈ R, ∀x ∈ Rn, (4.5)for any coordinate i, and where ei is the ith unit vector.Theorem 3. Consider problem (4.1), where f has a coordinate-wise L-Lipschitz continuousgradient (4.5), a non-empty solution set X ∗, and satisfies the PL inequality (4.3). Consider thecoordinate descent method with a step-size of 1/L,xk+1 = xk − 1L∇ikf(xk)eik . (4.6)If we choose the variable to update ik uniformly at random, then the algorithm has an expectedlinear convergence rate ofE[f(xk)− f∗] ≤(1− µnL)k[f(x0)− f∗].Proof. By using the update rule (4.6) in the Lipschitz condition (4.5) we havef(xk+1) ≤ f(xk)− 12L|∇ikf(xk)|2.By taking the expectation of both sides with respect to ik we haveE[f(xk+1)]≤ f(xk)− 12LE[|∇ikf(xk)|2]= f(xk)− 12L∑i1n|∇if(xk)|2= f(xk)− 12nL||∇f(xk)||2.By using the PL inequality (4.3) and subtracting f∗ from both sides, we getE[f(xk+1)− f∗] ≤(1− µnL)[f(xk)− f∗].Applying this recursively and using iterated expectations yields the result.60As before, instead of using 1/L we could perform exact coordinate optimization and theresult would still hold. If we have a Lipschitz constant Li for each coordinate and sampleproportional to the Li as suggested by Nesterov [2012], then the above argument (using astep-size of 1/Lik) can be used to show that we obtain a faster rate ofE[f(xk)− f∗] ≤(1− µnL¯)k[f(x0)− f∗],where L¯ = 1n∑nj=1 Lj .4.2.2 Greedy Coordinate DescentIn Chapter 2 we analyzed coordinate descent under the greedy Gauss-Southwell (GS) rule, andargued that this rule may be suitable for problems with a large degree of sparsity. We showedthat a faster convergence rate can be obtained for the GS rule by measuring strong convexityin the `1-norm. Since the PL inequality is defined on the dual (gradient) space, in order toderive an analogous result we could measure the PL inequality in the ∞-norm,12‖∇f(x)‖2∞ ≥ µ1(f(x)− f∗).Because of the equivalence between norms, this is not introducing any additional assumptionsbeyond that the PL inequality is satisfied. Further, if f is µ1-strongly convex in the `1-norm,then it satisfies the PL inequality in the ∞-norm with the same constant µ1. By using that|∇ikf(xk)| = ‖∇f(xk)‖∞ when the GS rule is used, the above argument can be used to showthat coordinate descent with the GS rule achieves a convergence rate off(xk)− f∗ ≤(1− µ1L)k[f(x0)− f∗],when the function satisfies the PL inequality in the ∞-norm with a constant of µ1. By theequivalence between norms we have that µ/n ≤ µ1, so this is faster than the rate with randomselection.Meir and Ra¨tsch [2003] show that we can view some variants of boosting algorithms asimplementations of coordinate descent with the GS rule. They use the error bound propertyto argue that these methods achieve a linear convergence rate, but this property does not leadto an explicit rate. Our simple result above thus provides the first explicit convergence rate forthese variants of boosting.4.2.3 Sign-Based Gradient MethodsThe learning heuristic RPROP (Resilient backPROPagation) is a classic iterative method usedfor supervised learning problems in feedforward neural networks [Riedmiller and Braun, 1992].61The general update for some vector of step-sizes αk ∈ IRn is given byxk+1 = xk − αk ◦ sign∇f(xk),where the ◦ operator indicates coordinate-wise multiplication. Although this method has beenused for many years in the machine learning community, we are not aware of any previousconvergence rate analysis of such a method. We assume the individual step-sizes αki are cho-sen proportional to 1/√Li, where the Li are constants such that the gradient is 1-Lipschitzcontinuous in the norm defined by‖z‖L−1[1] ,∑i1√Li|zi|.Formally, we assume that the Li are set so that for all x and y we have‖∇f(y)−∇f(x)‖L−1[1] ≤ ‖y − x‖L[∞], (4.7)and where the dual norm of the ‖ · ‖L−1[1] norm above is given by the ‖ · ‖L[∞] norm,‖z‖L[∞] , maxi√Li|zi|.We note that such Li always exist if the gradient is Lipschitz continuous, so this is not addingany assumptions on the function f . The particular choice of the step-sizes αki that we willanalyze isαki =‖∇f(xk)‖L−1[1]√Li,which yields a linear convergence rate for problems where the PL inequality is satisfied in theL−1[1]-norm,12‖∇f(xk)‖2L−1[1] ≥ µL[∞](f(xk)− f∗). (4.8)This choice of αk yields steepest descent in the L∞-norm. The coordinate-wise iteration updateunder this choice of αki is given byxk+1i = xki −‖∇f(xk)‖L−1[1]√Lisign∇if(xk).Defining a diagonal matrix Λ with 1/√Li along the diagonal, the update can be written asxk+1 = xk − ‖∇f(xk)‖L−1[1]Λ ◦ sign∇f(xk). (4.9)Theorem 4. Consider problem (4.1), where f has a Lipschitz continuous gradient (4.7), a non-empty solution set X ∗, and satisfies the PL inequality (4.8). Consider the sign-based gradient62update defined in (4.9). Then the algorithm has a linear convergence rate off(xk+1)− f(x∗) ≤ (1− µL[∞]) (f(xk)− f(x∗)) .Proof. See Appendix C.3.4.3 Proximal Gradient GeneralizationAttouch and Bolte [2009] consider a generalization of the the PL inequality due to Kurdykato give conditions under which the classic proximal-point algorithm achieves a linear conver-gence rate for non-smooth problems (called the KL inequality). However, in practice proximal-gradient methods are more relevant to many machine learning problems. The KL inequalityhas been used to show local linear convergence of proximal gradient methods [Li and Pong,2016], and it has been used to show global linear convergence of proximal gradient methodsunder the assumption that the objective function is convex [Bolte et al., 2015]. In this sectionwe propose a different generalization of the PL inequality that yields a simple global linearconvergence analysis without assuming convexity of the objective function.Proximal gradient methods apply to problems of the formargminx∈RnF (x) = f(x) + g(x), (4.10)where f is a differentiable function with an L-Lipschitz continuous gradient and g is a simplebut potentially non-smooth convex function. Typical examples of simple functions g include ascaled `1-norm of the parameter vectors, g(x) = λ‖x‖1, and indicator functions that are zero ifx lies in a simple convex set and are infinity otherwise.In order to analyze proximal gradient algorithms, a natural (though not particularly intu-itive) generalization of the PL inequality is that there exists a µ > 0 satisfying12Dg(x, L) ≥ µ(F (x)− F ∗), (4.11)whereDg(x, α) ≡ −2αminy[〈∇f(x), y − x〉+ α2||y − x||2 + g(y)− g(x)]. (4.12)We call this the proximal-PL inequality, and we note that if g is constant (or linear) then itreduces to the standard PL inequality. Below we show that this inequality is sufficient for theproximal gradient method to achieve a global linear convergence rate.Theorem 5. Consider problem (4.10), where f has an L-Lipschitz continuous gradient (4.2), Fhas a non-empty solution set X ∗, g is convex, and F satisfies the proximal-PL inequality (4.11).63Then the proximal gradient method with a step-size of 1/L,xk+1 = argminy∈IRn[〈∇f(xk), y − xk〉+ L2||y − xk||2 + g(y)− g(xk)](4.13)converges linearly to the optimal value F ∗,F (xk)− F ∗ ≤(1− µL)k[F (x0)− F ∗].Proof. By using Lipschitz continuity of the gradient of f we haveF (xk+1) = f(xk+1) + g(xk) + g(xk+1)− g(xk)≤ F (xk)+〈∇f(xk), xk+1−xk〉+L2||xk+1−xk||2 + g(xk+1)−g(xk)≤ F (xk)− 12LDg(xk, L)≤ F (xk)− µL[F (xk)− F ∗],which uses the definition of xk+1 and Dg followed by the proximal-PL inequality (4.11). Thissubsequently implies thatF (xk+1)− F ∗ ≤(1− µL)[F (xk)− F ∗], (4.14)which applied recursively gives the result.While other conditions have been proposed to show linear convergence rates of proximalgradient methods without strong convexity [Kadkhodaie et al., 2014, Zhang, 2015], their anal-yses tend to be much more complicated than the above while, as we discuss in the next section,the proximal-PL inequality includes the standard scenarios where these apply.4.3.1 Relevant ProblemsAs with the PL inequality, we now list several important function classes that satisfy theproximal-PL inequality (4.11). We give proofs that these classes satisfy the inequality in Ap-pendix C.5.1. The inequality is satisfied if f satisfies the PL inequality and g is constant. Thus, theabove result generalizes Theorem 1.2. The inequality is satisfied if f is strongly convex. This is the usual assumption used toshow a linear convergence rate for the proximal gradient algorithm [Schmidt et al., 2011],although we note that the above analysis is much simpler than standard arguments.3. The inequality is satisfied if f has the form f(x) = h(Ax) for a strongly convex functionh and a matrix A, while g is an indicator function for a polyhedral set.644. The inequality is satisfied if F is convex and satisfies the QG property.It has also been shown that the inequality is satisfied if F satisfies the proximal-EB conditionor the KL inequality [Karimi et al., 2016]. By this equivalence, the proximal-PL inequalityalso holds for other problems where a linear convergence rate has been shown like group `1-regularization [Tseng, 2010], sparse group `1-regularization [Zhang et al., 2013], nuclear-normregularization [Hou et al., 2013], and other classes of functions [Drusvyatskiy and Lewis, 2016,Zhou and So, 2015].4.3.2 Least Squares with `1-RegularizationPerhaps the most interesting example of problem (4.10) is the `1-regularized least squaresproblem,argminx∈IRn12‖Ax− b‖2 + λ‖x‖1,where λ > 0 is the regularization parameter. This problem has been studied extensively inmachine learning, signal processing, and statistics. This problem structure seems well-suited tousing proximal gradient methods, but the first works analyzing proximal gradient methods forthis problem only showed sub-linear convergence rates [Beck and Teboulle, 2009]. Subsequentworks show that linear convergence rates can be achieved under additional assumptions. Forexample, Gu et al. [2013] prove that their algorithm achieves a linear convergence rate if Asatisfies a restricted isometry property (RIP) and the solution is sufficiently sparse. Xiao andZhang [2013] also assume the RIP property and show linear convergence using a homotopymethod that slowly decreases the value of λ. Agarwal et al. [2012] give a linear convergence rateunder a modified restricted strong convexity and modified restricted smoothness assumption. Butthese problems have also been shown to satisfy proximal variants of the EB condition [Necoaraand Clipici, 2016, Tseng, 2010], and thus by the equivalence result in [Karimi et al., 2016] wehave that any `1-regularized least squares problem satisfies the proximal-PL inequality. Thus,Theorem 5 gives a simple proof of global linear convergence for these problems without makingadditional assumptions or making any modifications to the algorithm.4.3.3 Proximal Coordinate DescentIt is also possible to adapt our results on coordinate descent and proximal gradient methodsin order to give a linear convergence rate for coordinate-wise proximal gradient methods forproblem (4.10). To do this, we require the extra assumption that g is a separable function.This means that g(x) =∑i gi(xi) for a set of univariate functions gi. The update rule for thecoordinate-wise proximal gradient method isxk+1 = argminα∈IR[α∇ikf(xk) +L2α2 + gik(xik + α)− gik(xik)], (4.15)We state the convergence rate result below.65Theorem 6. Assume the setup of Theorem 5 and that g is a separable function g(x) =∑i gi(xi), where each gi is convex. Then the coordinate-wise proximal gradient update rule(4.15) achieves a convergence rateE[F (xk)− F ∗] ≤(1− µnL)k[F (x0)− F ∗], (4.16)when ik is selected uniformly at random.The proof is given in Appendix C.6 and although it is more complicated than the proofof Theorem 5, it is still simpler than existing proofs for proximal coordinate descent understrong convexity [Richta´rik and Taka´cˇ, 2014]. It is also possible to analyze stochastic proximalgradient algorithms, and indeed Reddi et al. [2016a] use the proximal-PL inequality to analyzefinite-sum methods in the proximal stochastic case. We also note that Zhang [2016] has recentlyanalyzed cyclic coordinate descent for convex functions satisfying QG.4.3.4 Support Vector MachinesAnother important model problem that arises in machine learning is support vector machines,argminx∈IRnm∑i=1max(0, 1− bixTai) + λ2‖x‖2. (4.17)where (ai, bi) are the labelled training set with ai ∈ Rn and bi ∈ {−1, 1}. We often solve thisproblem by performing coordinate optimization on its Fenchel dual, which has the formminw¯f(w¯) =12w¯TMw¯ −∑w¯i, w¯i ∈ [0, U ], (4.18)for a particular positive semi-definite matrix M and constant U . This convex function satisfiesthe QG property and thus Theorem 6 implies that coordinate optimization achieves a linearconvergence rate in terms of optimizing the dual objective. Further, note that Hush et al. [2006]show that we can obtain an -accurate solution to the primal problem with an O(2)-accuratesolution to the dual problem. Thus this result also implies we can obtain a linear convergencerate on the primal problem by showing that stochastic dual coordinate ascent has a linearconvergence rate on the dual problem. Global linear convergence rates for SVMs have alsobeen shown by others [Ma et al., 2015b, Tseng and Yun, 2009a, Wang and Lin, 2014], but againwe note that these works lead to much more complicated analyses. Although the constants inthese convergence rates may be quite bad (depending on the smallest non-zero singular valueof the Gram matrix), we note that the existing sublinear rates still apply in the early iterationswhile, as the algorithm begins to identify support vectors, the constants improve (dependingon the smallest non-zero singular value of the block of the Gram matrix corresponding to thesupport vectors).The result of the previous section is not only restricted to SVMs. Indeed, the result of66the previous section implies a linear convergence rate for many `2-regularized linear predictionproblems, the framework considered in the stochastic dual coordinate ascent (SDCA) workof Shalev-Shwartz and Zhang [2013]. While Shalev-Shwartz and Zhang [2013] show that this istrue when the primal is smooth, our result gives linear rates in many cases where the primal isnon-smooth.4.4 DiscussionWe believe that the work in this chapter provides a unifying and simplifying view of a vari-ety of optimization and convergence rate issues in machine learning. Indeed, we have shownthat many of the assumptions used to achieve linear convergence rates can be replaced bythe PL inequality and its proximal generalization. While we have focused on sufficient con-ditions for linear convergence, another recent work has turned to the question of necessaryconditions for convergence [Zhang, 2016]. Further, while we have focused on non-acceleratedmethods, Zhang [2016] has recently analyzed Nesterov’s accelerated gradient method withoutstrong convexity. We also note that, while we have focused on first-order methods, Nesterovand Polyak [2006] have used the PL inequality to analyze a second-order Newton-style methodwith cubic regularization. They also consider a generalization of the inequality under the namegradient-dominated functions.Throughout this chapter, we pointed out how our analyses imply convergence rates for avariety of machine learning models and algorithms. Some of these were previously known,typically under stronger assumptions or with more complicated proofs, but many of these arenovel. We expect that going forward efficiency will no longer be decided by the issue of whetherfunctions are strongly convex, but rather by whether they satisfy a variant of the PL inequality.67Chapter 5Greedy Block Coordinate DescentBlock coordinate descent (BCD) methods have become one of our key tools for solving someof the most important large-scale optimization problems. This is due to their typical ease ofimplementation, low memory requirements, cheap iteration costs, adaptability to distributedsettings, ability to use problem structure, and numerical performance. Notably, they have beenused for almost two decades in the context of `1-regularized least squares (LASSO) [Fu, 1998,Sardy et al., 2000] and support vector machines (SVMs) [Chang and Lin, 2011, Joachims, 1999].Indeed, randomized BCD methods have recently been used to solve instances of these widely-used models with a billion variables [Richta´rik and Taka´cˇ, 2014], while for “kernelized” SVMsgreedy BCD methods remain among the state of the art methods [You et al., 2016]. Due to thewide use of these models, any improvements on the convergence of BCD methods will affect amyriad of applications.While there are a variety of ways to implement a BCD method, the three main buildingblocks that affect its performance are:1. Blocking strategy. We need to define a set of possible “blocks” of problem variablesthat we might choose to update at a particular iteration. Two common strategies areto form a partition of the coordinates into disjoint sets (we call this fixed blocks) or toconsider any possible subset of coordinates as a “block” (we call this variable blocks).Typical choices include using a set of fixed blocks related to the problem structure, orusing variable blocks by randomly sampling a fixed number of coordinates.2. Block selection rule. Given a set of possible blocks, we need a rule to select a block ofcorresponding variables to update. Classic choices include cyclically going through a fixedordering of blocks, choosing random blocks, choosing the block with the largest gradient(the Gauss-Southwell rule), or choosing the block that leads to the largest improvement.3. Block update rule. Given the block we have selected, we need to decide how to up-date the block of corresponding variables. Typical choices include performing a gradientdescent iteration, computing the Newton direction and performing a line search, or com-puting the optimal update of the block by subspace minimization.In the next section we introduce our notation, review the standard choices behind BCDalgorithms, and discuss problem structures where BCD is suitable. Subsequently, the followingsections explore a wide variety of ways to speed up BCD by modifying the three building blocksabove.681. In Section 5.2 we propose block selection rules that are variants of the Gauss-Southwellrule, but that incorporate knowledge of Lipschitz constants in order to give better boundson the progress made at each iteration. We also give a general result characterizing theconvergence rate obtained using the Gauss-Southwell rule as well as the new greedy rules,under both the Polyak- Lojasiewicz inequality and for general (potentially non-convex)functions.2. In Section 5.3 we discuss practical implementation issues. This includes how to approxi-mate the new rules in the variable-block setting, how to estimate the Lipschitz constants,how to efficiently implement line searches, how the blocking strategy interacts with greedyrules, and why we should prefer Newton updates over the “matrix updates” of recentworks.3. In Section 5.4 we show how second-order updates, or the exact update for quadraticfunctions, can be computed in linear-time for problems with sparse dependencies whenusing “forest-structured” blocks. This allows us to use huge block sizes for problemswith sparse dependencies, and uses a connection between sparse quadratic functions andGaussian Markov random fields (GMRFs) by exploiting the “message-passing” algorithmsdeveloped for GMRFs.We note that many related ideas have been explored by others in the context of BCD methodsand we will go into detail about these related works in subsequent sections. In Section 5.5 weuse a variety of problems arising in machine learning to evaluate the effects of these choiceson BCD implementations. These experiments indicate that in some cases the performanceimprovement obtained by using these enhanced methods can be dramatic. The source codeand data files required to reproduce the experimental results of this paper can be downloadedfrom: https://github.com/IssamLaradji/BlockCoordinateDescent.5.1 Block Coordinate Descent AlgorithmsWe first consider the problem of minimizing a differentiable multivariate function,argminx∈IRnf(x). (5.1)At iteration k of a BCD algorithm, we first select a block bk ⊆ {1, 2, . . . , n} and then updatethe subvector xbk ∈ IR|bk| corresponding to these variables,xk+1bk = xkbk+ dk.Coordinates of xk+1 that are not included in bk are simply kept at their previous value. The vec-tor dk ∈ IR|bk| is typically selected to provide descent in the direction of the reduced dimensional69subproblem,dk ∈ argmind∈IR|bk|f(xk + Ubkd), (5.2)where we construct Ubk ∈ {0, 1}n×|bk| from the columns of the identity matrix corresponding tothe coordinates in bk. Using this notation, we havexbk = UTbkx,which allows us to write the BCD update of all n variables in the formxk+1 = xk + Ubkdk.There are many possible ways to define the block bk as well as the direction dk. Typically wehave a maximum block size τ , which is chosen to be the largest number of variables that wecan efficiently update at once. Given τ that divides evenly into n, consider a simple orderedfixed partition of the coordinates into a set B of n/τ blocks,B = {{1, 2, . . . , τ}, {τ + 1, τ + 2, . . . , 2τ}, . . . , {(n− τ) + 1, (n− τ) + 2, . . . , n}}.To select the block in B to update at each iteration we could simply repeatedly cycle throughthis set in order. A simple choice of dk is the negative gradient corresponding to the coordinatesin bk, multiplied by a scalar step-size αk that is sufficiently small to ensure that we decreasethe objective function. This leads to a gradient update of the formxk+1 = xk − αkUbk∇bkf(xk), (5.3)where ∇bkf(xk) are the elements of the gradient ∇f(xk) corresponding to the coordinates inbk. While this gradient update and cyclic selection among an ordered fixed partition is simple,we can often drastically improve the performance using more clever choices. We highlight somecommon alternative choices in the next three subsections.5.1.1 Block Selection RulesRepeatedly going through a fixed partition of the coordinates is known as cyclic selection [Bert-sekas, 2016], and this is referred to as Gauss-Seidel when solving linear systems [Seidel, 1874].The performance of cyclic selection may suffer if the order the blocks are cycled through ischosen poorly, but it has been shown that random permutations can fix a worst case for cyclicCD [Lee and Wright, 2016]. A variation on cyclic selection is “essentially” cyclic selectionwhere each block must be selected at least every m iterations for some fixed m that is largerthan the number of blocks [Bertsekas, 2016]. Alternately, several authors have explored theadvantages of randomized block selection [Nesterov, 2010, Richta´rik and Taka´cˇ, 2014]. Thesimplest randomized selection rule is to select one of the blocks uniformly at random. However,70several recent works show dramatic performance improvements over this naive random sam-pling by incorporating knowledge of the Lipschitz continuity properties of the gradients of theblocks [Nesterov, 2010, Qu and Richta´rik, 2016, Richta´rik and Taka´cˇ, 2016] or more recentlyby trying to estimate the optimal sampling distribution online [Namkoong et al., 2017].An alternative to cyclic and random block selection is greedy selection. Greedy methodssolve an optimization problem to select the “best” block at each iteration. A classic exampleof greedy selection is the block Gauss-Southwell (GS) rule, which chooses the block whosegradient has the largest Euclidean norm,bk ∈ argmaxb∈B‖∇bf(xk)‖, (5.4)where we use B as the set of possible blocks. As we saw in Chapter 2, this rule tends to makemore progress per iteration in theory and practice than randomized selection. Unfortunately, formany problems it is more expensive than cyclic or randomized selection. However, several recentworks show that certain problem structures allow efficient calculation of GS-style rules (seeChapter 2 as well as [Fountoulakis et al., 2016, Lei et al., 2016, Meshi et al., 2012]), allowefficient approximation of GS-style rules [Dhillon et al., 2011, Stich et al., 2017, Thoppe et al.,2014], or allow other rules that try to improve on the progress made at each iteration [Csibaet al., 2015, Glasmachers and Dogan, 2013].The ideal selection rule is the maximum improvement (MI) rule, which chooses the blockthat decreases the function value by the largest amount. Notable recent applications of thisrule include leading eigenvector computation [Li et al., 2015], polynomial optimization [Chenet al., 2012], and fitting Gaussian processes [Bo and Sminchisescu, 2012]. While recent worksexplore computing or approximating the MI rule for quadratic functions [Bo and Sminchisescu,2012, Thoppe et al., 2014], in general the MI rule is much more expensive than the GS rule.5.1.2 Fixed vs. Variable BlocksWhile the choice of the block to update has a significant effect on performance, how we definethe set of possible blocks also has a major impact. Although other variations are possible, wehighlight below the two most common blocking strategies:1. Fixed blocks. This method uses a partition of the coordinates into disjoint blocks, asin our simple example above. This partition is typically chosen prior to the first iterationof the BCD algorithm, and this set of blocks is then held fixed for all iterations of thealgorithm. We often use blocks of roughly equal size, so if we use blocks of size τ thismethod might partition the n coordinates into n/τ blocks. Generic ways to partition thecoordinates include “in order” as we did above [Bertsekas, 2016], or using a random parti-tion [Nesterov, 2010]. Alternatively, the partition may exploit some inherent substructureof the objective function such as block separability [Meier et al., 2008], the ability to effi-ciently solve the corresponding sub-problem (5.2) with respect to the blocks [Sardy et al.,712000], or based on the Lipschitz constants of the resulting blocks [Csiba and Richta´rik,2016, Thoppe et al., 2014].2. Variable blocks. Instead of restricting our blocks to a pre-defined partition of the coor-dinates, we could instead consider choosing any of the 2n − 1 possible sets of coordinatesas our block. In the randomized setting, this is referred to as “arbitrary” sampling [Quand Richta´rik, 2016, Richta´rik and Taka´cˇ, 2016]. We say that such strategies use vari-able blocks because we are not choosing from a partition of the coordinates that is fixedacross the iterations. Due to computational considerations, when using variable blockswe typically want to impose a restriction on the size of the blocks. For example, we couldconstruct a block of size τ by randomly sampling τ coordinates without replacement,which is known as τ -nice sampling [Qu and Richta´rik, 2016, Richta´rik and Taka´cˇ, 2016].Alternately, we could include each coordinate in the block bk with some probability likeτ/n (so the block size may change across iterations but we control its expected size). Aversion of the greedy GS rule (5.4) with variable blocks would select the τ coordinatescorresponding to the elements of the gradient with largest magnitudes [Tseng and Yun,2009b]. This can be viewed as a greedy variant of τ -nice sampling. While we can findthese τ coordinates easily,10 computing the MI rule with variable blocks is much moredifficult. Indeed, while methods exist to compute the MI rule for quadratics with fixedblocks [Thoppe et al., 2014], with variable blocks it is NP-hard to compute the MI ruleand existing works resort to approximations [Bo and Sminchisescu, 2012].5.1.3 Block Update RulesThe selection of the update vector dk can significantly affect performance of the BCD method.For example, in the gradient update (5.3) the method can be sensitive to the choice of thestep-size αk. Classic ways to set αk include using a fixed step-size (with each block possiblyhaving its own fixed step-size), using an approximate line search, or using the optimal step-size(which has a closed-form solution for quadratic objectives) [Bertsekas, 2016].The most common alternative to the gradient update above is a Newton update,dk = −αk(∇2bkbkf(xk))−1∇bkf(xk), (5.5)where we might replace the instantaneous Hessian ∇2bkbkf(xk) by a positive-definite approx-imation to it. In this context the step-size αk is again a step-size that can be chosen usingsimilar strategies to those mentioned above. Several recent works analyze such updates andshow that they can substantially improve the convergence rate [Fountoulakis and Tappenden,2015, Qu et al., 2016, Tappenden et al., 2016]. For special problem classes, another possible10Once the max-heap (as defined in Chapter 2) has been updated with the current gradient values, we can findthe maximal τ elements by repeating the process of removing the maximal element and resorting the remainingelements τ times. This yields the τ maximal elements at a cost of τ log(n).72type of update is what we will call the optimal update. This update chooses dk to solve (5.2).In other words, it updates the block bk to maximally decrease the objective function.5.1.4 Problems of InterestBCD methods tend to be good choices for problems where we can update all variables for roughlythe same cost as computing the gradient. As presented in Section 2.1 for the single-coordinatecase this includes the following two common problem structures,h1(x) :=n∑i=1gi(xi) + f(Ax), or h2(x) :=∑i∈Vgi(xi) +∑(i,j)∈Efij(xi, xj),where f is smooth and cheap, the fij are smooth, G = {V,E} is a graph, and A is a matrix.Examples of problems leading to functions of the form h1 include least squares, logistic regres-sion, LASSO, and SVMs.11 The most important example of problem h2 is quadratic functions,which are crucial to many aspects of scientific computing.12Problems h1 and h2 are also suitable for BCD methods, as they tend to admit efficientblock update strategies. In general, if single-coordinate descent is efficient for a problem,then BCD methods are also efficient for that problem and this applies whether we use fixedblocks or variable blocks. Other scenarios where coordinate descent and BCD methods haveproven useful include matrix and tensor factorization methods [Xu and Yin, 2013, Yu et al.,2012], problems involving log-determinants [Hsieh et al., 2013, Scheinberg and Rish, 2009], andproblems involving convex extensions of sub-modular functions [Ene and Nguyen, 2015, Jegelkaet al., 2013].An important point to note is that there are special problem classes where BCD withfixed blocks is reasonable even though using variable blocks (or single-coordinate updates)would not be suitable. For example, consider a variant of problem h1 where we use group`1-regularization [Bakin, 1999],h3(x) :=∑b∈B‖xb‖+ f(Ax), (5.6)where B is a partition of the coordinates. We cannot apply single-coordinate updates to thisproblem due to the non-smooth norms, but we can take advantage of the group-separablestructure in the sum of norms and apply BCD using the blocks in B [Meier et al., 2008, Qinet al., 2013]. Sardy et al. [2000] in their early work on solving LASSO problems considerproblem h1 where the columns of A are the union of a set of orthogonal matrices. By choosingthe fixed blocks to correspond to the orthogonal matrices, it is very efficient to apply BCD.11Coordinate descent remains suitable for multi-linear generalizations of problem h1 like functions of the formf(XY ) where X and Y are both matrix variables.12Problem h2 can be generalized to allow functions between more than 2 variables, and coordinate descentremains suitable as long as the expected number of functions in which each variable appears is n-times smallerthan the total number of functions (assuming each function has a constant cost).73In Appendix D.1, we outline how fixed blocks lead to an efficient greedy BCD method for thewidely-used multi-class logistic regression problem when the data has a certain sparsity level.5.2 Improved Greedy RulesPrevious works have identified that the greedy GS rule can lead to suboptimal progress, andhave proposed rules that are closer to the MI rule for the special case of quadratic functions [Boand Sminchisescu, 2012, Thoppe et al., 2014]. However, for non-quadratic functions it is notobvious how we should approximate the MI rule. As an intermediate between the GS rule andthe MI rule for general functions, in Section 2.5.2 we presented the Gauss-Southwell-Lipschitz(GSL) rule in the case of single-coordinate updates. The GSL rule is equivalent to the MI rule inthe special case of quadratic functions, so either rule can be used in that setting. However, theMI rule involves optimizing over a subspace which will typically be expensive for non-quadraticfunctions. After reviewing the classic block GS rule, in this section we consider several possibleblock extensions of the GSL rule that give a better approximation to the MI rule withoutrequiring subspace optimization.5.2.1 Block Gauss-SouthwellWhen analyzing BCD methods we typically assume that the gradient of each block b is Lb-Lipschitz continuous, meaning that for all x ∈ IRn and d ∈ IR|b|‖∇bf(x+ Ubd)−∇bf(x)‖ ≤ Lb‖d‖, (5.7)for some constant Lb > 0. This is a standard assumption, and in Appendix D.2 we give boundson Lb for the common data-fitting models of least squares and logistic regression. If we applythe descent lemma [Bertsekas, 2016] to the reduced sub-problem (5.2) associated with someblock bk selected at iteration k, then we obtain the following upper bound on the function valueprogress,f(xk+1) ≤ f(xk) + 〈∇f(xk), xk+1 − xk〉+ Lbk2‖xk+1 − xk‖2 (5.8)= f(xk) + 〈∇bkf(xk), dk〉+Lbk2‖dk‖2.The right side is minimized in terms of dk under the choicedk = − 1Lbk∇bkf(xk), (5.9)74which is simply a gradient update with a step-size of αk = 1/Lbk . Substituting this into ourupper bound, we obtainf(xk+1) ≤ f(xk)− 12Lbk‖∇bkf(xk)‖2. (5.10)A reasonable way to choose a block bk at each iteration is by minimizing the upper boundin (5.10) with respect to bk. For example, if we assume that Lb is the same for all blocks b thenwe derive the GS rule (5.4) of choosing the bk that maximizes the gradient norm.We can use the bound (5.10) to compare the progress made by different selection rules. Forexample, this bound indicates that the GS rule can make more progress with variable blocksthan with fixed blocks (under the usual setting where the fixed blocks are a subset of the possiblevariable blocks). In particular, consider the case where we have partitioned the coordinates intoblocks of size τ and we are comparing this to using variable blocks of size τ . The case wherethere is no advantage for variable blocks is when the indices corresponding to the τ -largest|∇if(xk)| values are in one of the fixed partitions; in this (unlikely) case the GS rule with fixedblocks and variable blocks will choose the same variables to update. The case where we seethe largest advantage of using variable blocks is when each of the indices corresponding to theτ -largest |∇if(xk)| values are in different blocks of the fixed partition; in this case the last termin (5.10) can be improved by a factor as large as τ2 when using variable blocks instead of fixedblocks. Thus, with larger blocks there is more of an advantage to using variable blocks overfixed blocks.5.2.2 Block Gauss-Southwell-LipschitzThe GS rule is not the optimal block selection rule in terms of the bound (5.10) if we knowthe block-Lipschitz constants Lb. Instead of choosing the block with largest norm, considerminimizing (5.10) in terms of bk,bk ∈ argmaxb∈B{‖∇bf(xk)‖2Lb}. (5.11)We call this the block Gauss-Southwell-Lipschitz (GSL) rule. If all Lb are the same, thenthe GSL rule is equivalent to the classic GS rule. But in the typical case where the Lb differ,the GSL rule guarantees more progress than the GS rule since it incorporates the gradientinformation as well as the Lipschitz constants Lb. For example, it reflects that if the gradientsof two blocks are similar but their Lipschitz constants are very different, then we can guaranteemore progress by updating the block with the smaller Lipschitz constant. In the extreme case,for both fixed and variable blocks the GSL rule improves the bound (5.10) over the GS rule bya factor as large as (maxb∈B Lb)/(minb∈B Lb).The block GSL rule in (5.11) is a simple generalization of the single-coordinate GSL rule toblocks of any size. However, it loses a key feature of the single-coordinate GSL rule: the block75GSL rule is not equivalent to the MI rule for quadratic functions. Unlike the single-coordinatecase, where ∇2iif(xk) = Li so that (5.8) holds with equality, for the block case we only have∇2bbf(xk) Lb so (5.8) may underestimate the progress that is possible in certain directions.In the next section we give a second generalization of the GSL rule that is equivalent to the MIrule for quadratics.5.2.3 Block Gauss-Southwell-QuadraticFor single-coordinate updates, the bound in (5.10) is the tightest quadratic bound on progresswe can expect given only the assumption of block Lipschitz-continuity (it holds with equalityfor quadratic functions). However, for block updates of more than one variable we can obtain atighter quadratic bound using general quadratic norms of the form ‖ · ‖H =√〈H·, ·〉 for somepositive-definite matrix H. In particular, assume that each block has a Lipschitz-continuousgradient with Lb = 1 for a particular positive-definite matrix Hb ∈ IR|b|×|b|, meaning that‖∇bf(x+ Ubd)−∇bf(x)‖H−1b ≤ ‖d‖Hb ,for all x ∈ IRn and d ∈ IR|b|. Due to the equivalence between norms, this merely changes how wemeasure the continuity of the gradient and is not imposing any new assumptions. Indeed, theblock Lipschitz-continuity assumption in (5.7) is just a special case of the above with Hb = LbI,where I is the |b| × |b| identity matrix. Although this characterization of Lipschitz continuityappears more complex, for some functions it is actually computationally cheaper to constructmatrices Hb than to find valid bounds Lb. We show this in Appendix D.2 for the cases of leastsquares and logistic regression.Under this alternative characterization of the Lipschitz assumption, at each iteration k wehavef(xk+1) ≤ f(xk) + 〈∇bkf(xk), dk〉+12‖dk‖2Hbk . (5.12)The left-hand side of (5.12) is minimized whendk = − (Hbk)−1∇bkf(xk), (5.13)which we will call the matrix update of a block. Although this is equivalent to Newton’smethod for quadratic functions, we use the name “matrix update” rather than “Newton’smethod” here to distinguish two types of updates: Newton’s method is based on the instanta-neous Hessian ∇2bbf(xk), while the matrix update is based on a matrix Hb that upper boundsthe Hessian for all x.13 We will discuss Newton updates in subsequent sections, but substitutingthe matrix update into the upper bound yieldsf(xk+1) ≤ f(xk)− 12‖∇bkf(xk)‖2H−1bk. (5.14)13We say that a matrix A “upper bounds” a matrix B, written A B, if for all x we have xTAx ≥ xTBx.76Consider a simple quadratic function f(x) = xTAx for a positive-definite matrix A. In this casewe can take Hb to be the sub-matrix Abb while in our previous bound we would require Lb to bethe maximum eigenvalue of this sub-matrix. Thus, in the worst case (where ∇bkf(xk) is in thespan of the principal eigenvectors of Abb) the new bound is at least as good as (5.10). But if theeigenvalues of Abb are spread out then this bound shows that the matrix update will typicallyguarantee substantially more progress; in this case the quadratic bound (5.14) can improve onthe bound in (5.10) by a factor as large as the condition number of Abb when updating blockb. The update (5.13) was analyzed for BCD methods in several recent works [Qu et al., 2016,Tappenden et al., 2016], which considered random selection of the blocks. They show that thisupdate provably reduces the number of iterations required, and in some cases dramatically. Forthe special case of quadratic functions where (5.14) holds with equality, greedy rules based onminimizing this bound have been explored for both fixed [Thoppe et al., 2014] and variable [Boand Sminchisescu, 2012] blocks.Rather than focusing on the special case of quadratic functions, we want to define a bettergreedy rule than (5.11) for functions with Lipschitz-continuous gradients. By optimizing (5.14)in terms of bk we obtain a second generalization of the GSL rule,bk ∈ argmaxb∈B{‖∇bf(xk)‖H−1b}≡ argmaxb∈B{∇bf(xk)TH−1b ∇bf(xk)}, (5.15)which we call the block Gauss-Southwell quadratic (GSQ) rule.14 Since (5.14) holds withequality for quadratics this new rule is equivalent to the MI rule in that case. But this rule alsoapplies to non-quadratic functions where it guarantees a better bound on progress than the GSrule (and the GSL rule).5.2.4 Block Gauss-Southwell-DiagonalWhile the GSQ rule has appealing theoretical properties, for many problems it may be difficultto find full matrices Hb and their storage may also be an issue. Previous related works [Csibaand Richta´rik, 2017, Qu et al., 2016, Tseng and Yun, 2009b] address this issue by restrictingthe matrices Hb to be diagonal matrices Db. Under this choice we obtain a rule of the formbk ∈ argmaxb∈B{‖∇bf(xk)‖D−1b}≡ argmaxb∈B{∑i∈b|∇if(xk)|2Db,i}, (5.16)where we are using Db,i to refer to the diagonal element corresponding to coordinate i in blockb. We call this the block Gauss-Southwell diagonal (GSD) rule. This bound arises if weconsider a gradient update, where coordinate i has a constant step-size of D−1b,i when updated14While preparing this work for submission, we were made aware of a work that independently proposed thisrule under the name “greedy mini-batch” rule [Csiba and Richta´rik, 2017]. However, our focus on addressing thecomputational issues associated with the rule is quite different from that work, which focuses on tight convergenceanalyses.77as part of block b. This rule gives an intermediate approach that can guarantee more progressper iteration than the GSL rule, but that may be easier to implement than the GSQ rule.5.2.5 Convergence Rate under Polyak- LojasiewiczOur discussion above focuses on the progress we can guarantee at each iteration, assumingonly that the function has a Lipschitz-continuous gradient. Under additional assumptions, it ispossible to use these progress bounds to derive convergence rates on the overall BCD method.For example, assume f satisfies the PL inequality presented in Chapter 4, that is, for all x wehave for some µ > 0 that12(‖∇f(x)‖∗)2 ≥ µ (f(x)− f∗) , (5.17)where ‖ · ‖∗ can be any norm and f∗ is the optimal function value. The function class satis-fying this inequality includes all strongly convex functions but also includes a variety of otherimportant problems like least squares (see Section 4.1.3). As seen in Section 4.2, this inequalityleads to a simple proof of the linear convergence of any algorithm which has a progress boundof the formf(xk+1) ≤ f(xk)− 12‖∇f(xk)‖2∗, (5.18)such as gradient descent, coordinate descent with the GS rule and sign-based updates.Theorem 7. Assume f satisfies the PL inequality (5.17) for some µ > 0 and norm ‖ · ‖∗. Anyalgorithm that satisfies a progress bound of the form (5.18) with respect to the same norm ‖ · ‖∗obtains the following linear convergence rate,f(xk+1)− f∗ ≤ (1− µ)k [f(x0)− f∗] . (5.19)Proof. By subtracting f∗ from both sides of (5.18) and applying (5.17) directly, we obtain ourresult by recursion.Thus, if we can describe the progress obtained using a particular block selection rule andblock update rule in the form of (5.18), then we have a linear rate of convergence for BCD onthis class of functions. It is straightforward to do this using an appropriately defined norm, asshown in the following corollary.Corollary 1. Assume ∇f is Lipschitz continuous (5.12) and that f satisfies the PL inequality(5.17) in the norm defined by‖v‖B = maxb∈B‖vb‖Hb−1 , (5.20)for some µ > 0 and matrix Hb ∈ IR|b|×|b|. Then the BCD method using either the GSQ, GSL,GSD or GS selection rule achieves a linear convergence rate of the form (5.19).Proof. Using the definition of the GSQ rule (5.15) in the progress bound resulting from the78Lipschitz continuity of ∇f and the matrix update (5.14), we havef(xk+1) ≤ f(xk)− 12maxb∈B{‖∇bf(xk)‖2H−1b}= f(xk)− 12‖∇f(xk)‖2B.(5.21)By Theorem 7 and the observation that the GSL, GSD and GS rules are all special cases of theGSQ rule corresponding to specific choices of Hb, we have our result.We refer the reader to the work of Csiba and Richta´rik for tools that allow alternativeanalyses of BCD methods [Csiba and Richta´rik, 2017].5.2.6 Convergence Rate with General FunctionsThe PL inequality is satisfied for many problems of practical interest, and is even satisfiedfor some non-convex functions. However, general non-convex functions do not satisfy the PLinequality and thus the analysis of the previous section does not apply. Without a condition likethe PL inequality, it is difficult to show convergence to the optimal function value f∗ (since weshould not expect a local optimization method to be able to find the global solution of functionsthat may be NP-hard to minimize). However, the bound (5.21) still implies a weaker type ofconvergence rate even for general non-convex problems. The following result is a generalizationof a standard argument for the convergence rate of gradient descent [Nesterov, 2004], and givesus an idea of how fast the BCD method is able to find a point resembling a stationary pointeven in the general non-convex setting.Theorem 8. Assume ∇f is Lipschitz continuous (5.12) and that f is bounded below by somef∗. Then the BCD method using either the GSQ, GSL, GSD or GS selection rule achieves thefollowing convergence rate of the minimum gradient norm,mint=0,1,...,k−1‖∇f(xt)‖2B ≤2(f(x0)− f∗)k.Proof. By rearranging (5.21), we have12‖∇f(xk)‖2B ≤ f(xk)− f(xk+1).Summing this inequality over iterations t = 0 up to (k − 1) yields12k−1∑t=0‖∇f(xt)‖2B ≤ f(x0)− f(xk+1).Using that all k elements in the sum are lower bounded by their minimum and that f(xk+1) ≥79f∗, we getk2(mint=0,1,...,k−1‖∇f(xt)‖2B)≤ f(x0)− f∗.Due to the potential non-convexity we cannot say anything about the gradient norm ofthe final iteration, but this shows that the minimum gradient norm converges to zero with anerror at iteration k of O(1/k). This is a global sublinear result, but note that if the algorithmeventually arrives and stays in a region satisfying the PL inequality around a set of local optima,then the local convergence rate to this set of optima will increase to be linear.5.3 Practical IssuesThe previous section defines straightforward new rules that yield a simple analysis. In practicethere are several issues that remain to be addressed. For example, it seems intractable tocompute any of the new rules in the case of variable blocks. Furthermore, we may not knowthe Lipschitz constants for our problem. For fixed blocks we also need to consider how topartition the coordinates into blocks. Another issue is that the dk choices used above do notincorporate the local Hessian information. Although how we address these issues will dependon our particular application, in this section we discuss several issues associated with thesepractical considerations.5.3.1 Tractable GSD for Variable BlocksThe problem with using any of the new selection rules above in the case of variable blocks isthat they seem intractable to compute for any non-trivial block size. In particular, to computethe GSL rule using variable blocks requires the calculation of Lb for each possible block, whichseems intractable for any problem of reasonable size. Since the GSL rule is a special case of theGSD and GSQ rules, these rules also seem intractable in general. In this section we show howto restrict the GSD matrices so that this rule has the same complexity as the classic GS rule.Consider a variant of the GSD rule where each Db,i can depend on i but does not dependon b, so we have Db,i = di for some value di ∈ IR+ for all blocks b. This gives a rule of the formbk ∈ argmaxb∈B{∑i∈b|∇if(xk)|2di}. (5.22)Unlike the general GSD rule, this rule has essentially the same complexity as the classic GSrule since it simply involves finding the largest values of the ratio |∇if(xk)|2/di.A natural choice of the di values would seem to be di = Li, since in this case we recoverthe GSL rule if the blocks have a size of 1 (here we are using Li to refer to the coordinate-wise Lipschitz constant of coordinate i). Unfortunately, this does not lead to a bound of the80form (5.18) as needed in Theorems 7 and 8 because coordinate-wise Li-Lipschitz continuitywith respect to the Euclidean norm does not imply 1-Lipschitz continuity with respect to thenorm ‖ · ‖D−1b when the block size is larger than 1. Subsequently, the steps under this choicemay increase the objective function. A similar restriction on the Db matrices in (5.22) is used inthe implementation of Tseng and Yun based on the Hessian diagonals [Tseng and Yun, 2009b],but their approach similarly does not give an upper bound and thus they employ a line searchin their block update.It is possible to avoid needing a line search by setting Db,i = Liτ , where τ is the maximumblock size in B. This still generalizes the single-coordinate GSL rule, and in Appendix D.3we show that this leads to a bound of the form (5.18) for twice-differentiable convex functions(thus Theorems 7 and 8 hold). If all blocks have the same size then this approach selects thesame block as using Db,i = Li, but the matching block update uses a much-smaller step-sizethat guarantees descent. We do not advocate using this smaller step, but note that the boundwe derive also holds for alternate updates like taking a gradient update with αk = 1/Lbk orusing a matrix update based on Hbk .The choice of Db,i = Liτ leads to a fairly pessimistic bound, but it is not obvious evenfor simple problems how to choose an optimal set of Di values. Choosing these values isrelated to the problem of finding an expected separable over-approximation (ESO), which arisesin the context of randomized coordinate descent methods [Richta´rik and Taka´cˇ, 2016]. Quand Richta´rik give an extensive discussion of how we might bound such quantities for certainproblem structures [Qu and Richta´rik, 2016]. In our experiments we also explored anothersimple choice that is inspired by the “simultaneous iterative reconstruction technique” (SIRT)from tomographic image reconstruction [Gregor and Fessler, 2015]. In this approach, we use amatrix upper bound M on the full Hessian ∇2f(x) (for all x) and set15Db,i =n∑j=1|Mi,j |. (5.23)We found that this choice worked better when using gradient updates, although using thesimpler Liτ is less expensive and was more effective when doing matrix updates.By using the relationship Lb ≤∑i∈b Li ≤ |b|maxi∈b Li, two other ways we might considerdefining a more-tractable rule could bebk ∈ argmaxb∈B{∑i∈b |∇if(xk)|2|b|maxi∈b Li}, or bk ∈ argmaxb∈B{∑i∈b |∇if(xk)|2∑i∈b Li}.The rule on the left can be computed using dynamic programming while the rule on the rightcan be computed using an algorithm of Megiddo [1979]. However, when using a step-size of1/Lb we found both rules performed similarly or worse to using the GSD rule with Db,i = Li15It follow that D −M 0 because it is symmetric and diagonally-dominant with non-negative diagonals.81(when paired with gradient or matrix updates).165.3.2 Tractable GSQ for Variable BlocksIn order to make the GSQ rule tractable with variable blocks, we could similarly require thatthe entries of Hb depend solely on the coordinates i ∈ b, so that Hb = Mb,b where M is a fixedmatrix (as above) and Mb,b refers to the sub-matrix corresponding to the coordinates in b. Ourrestriction on the GSD rule in the previous section corresponds to the case where M is diagonal.In the full-matrix case, the block selected according to this rule is given by the coordinatescorresponding to the non-zero variables of an L0-constrained quadratic minimization,argmin‖d‖0≤τ{f(xk) + 〈∇f(xk), d〉+ 12dTMd}, (5.24)where ‖ · ‖0 is the number of non-zeroes. This selection rule is discussed in Tseng and Yun[2009b], but in their implementation they use a diagonal M . Although this problem is NP-hardwith a non-diagonal M , there is a recent wealth of literature on algorithms for finding approx-imate solutions. For example, one of the simplest local optimization methods for this problemis the iterative hard-thresholding (IHT) method [Blumensath and Davies, 2009]. Another pop-ular method for approximately solving this problem is the orthogonal matching pursuit (OMP)method from signal processing which is also known as forward selection in statistics [Hock-ing, 1976, Pati et al., 1993]. Computing d via (5.24) is also equivalent to computing the MIrule for a quadratic function, and thus we could alternately use the approximation of Bo andSminchisescu [2012] for this problem.Although it appears quite general, note that the exact GSQ rule under this restriction on Hbdoes not guarantee as much progress as the more-general GSQ rule (if computed exactly) thatwe proposed in the previous section. For some problems we can obtain tighter matrix boundsover blocks of variables than are obtained by taking the sub-matrix of a fixed matrix-boundover all variables. We show this for the case of multi-class logistic regression in Appendix D.2.As a consequence of this result we conclude that there does not appear to be a reason to usethis restriction in the case of fixed blocks.Although using the GSQ rule with variable blocks forces us to use an approximation, theseapproximations might still select a block that makes more progress than methods based ondiagonal approximations (which ignore the strengths of dependencies between variables). It ispossible that approximating the GSQ rule does not necessarily lead to a bound of the form (5.18)as there may be no fixed norm for which this inequality holds. However, in this case we caninitialize the algorithm with an update rule that does achieve such a bound in order to guaranteethat Theorems 7 and 8 hold, since this initialization ensures that we do at least as well as thisreference block selection rule.16On the other hand, the rule on the right worked better if we forced the algorithms to use a step-size of1/(∑i∈b Li), but this led to worse performance overall than using the larger 1/Lb step-size.82The main disadvantage of this approach for large-scale problems is the need to deal withthe full matrix M (which does not arise when using a diagonal approximation or using fixedblocks). In large-scale settings we would need to consider matrices M with special structureslike the sum of a diagonal matrix with a sparse and/or a low-rank matrix.5.3.3 Lipschitz Estimates for Fixed BlocksUsing the GSD rule with the choice of Db,i = Li may also be useful in the case of fixed blocks.In particular, if it is easier to compute the single-coordinate Li values than the block Lb valuesthen we might prefer to use the GSD rule with this choice. On the other hand, an appealingalternative in the case of fixed blocks is to use an estimate of Lb for each block as in Nesterov’swork [Nesterov, 2010]. In particular, for each Lb we could start with some small estimate (likeLb = 1) and then double it whenever the inequality (5.10) is not satisfied (since this indicatesLb is too small). Given some b, the bound obtained under this strategy is at most a factor of 2slower than using the optimal values of Lb. Further, if our estimate of Lb is much smaller thanthe global value, then this strategy can actually guarantee much more progress than using the“correct” Lb value.17In the case of matrix updates, we can use (5.14) to verify that an Hb matrix is valid [Foun-toulakis and Tappenden, 2015]. Recall that (5.14) is derived by plugging the update (5.13) intothe Lipschitz progress bound (5.12). Unfortunately, it is not obvious how to update a matrixHb if we find that it is not a valid upper bound. One simple possibility is to multiply theelements of our estimate Hb by 2. This is equivalent to using a matrix update but with a scalarstep-size αk,dk = −αk(Hb)−1∇bkf(xk), (5.25)similar to the step-size in the Newton update (5.5).5.3.4 Efficient Line SearchesThe Lipschitz approximation procedures of the previous section do not seem practical whenusing variable blocks, since there are an exponential number of possible blocks. To use variableblocks for problems where we do not know Lb or Hb, a reasonable approach is to use a linesearch. For example, we can choose αk in (5.25) using a standard line search like those that usethe Armijo condition or Wolfe conditions [Nocedal and Wright, 1999]. When using large blocksizes with gradient updates, line searches based on the Wolfe conditions are likely to make moreprogress than using the true Lb values (since for large block sizes the line search would tend tochoose values of αk that are much larger than αk = 1/Lbk).Further, the problem structures that lend themselves to efficient coordinate descent algo-rithms tend to lend themselves to efficient line search algorithms. For example, if our objective17While it might be tempting to also apply such estimates in the case of variable blocks, a practical issue isthat we would need a step-size for all of the exponential number of possible variable blocks.83has the form f(Ax) then a line search would try to minimize the f(Axk + αkAUbkdk) in termsof αk. Notice that the algorithm would already have access to Axk and that we can efficientlycompute AUbkdk since it only depends on the columns of A that are in bk. Thus, after (effi-ciently) computing AUbkdk once the line search simply involves trying to minimize f(v1 +αkv2)in terms of αk (for particular vectors v1 and v2). The cost of this approximate minimizationwill typically not add a large cost to the overall algorithm.5.3.5 Block Partitioning with Fixed BlocksSeveral prior works note that for fixed blocks the partitioning of coordinates into blocks can playa significant role in the performance of BCD methods. Thoppe et al. [2014] suggest trying to finda block-diagonally dominant partition of the coordinates, and experimented with a heuristicfor quadratic functions where the coordinates corresponding to the rows with the largest valueswere placed in the same block. In the context of parallel BCD, Scherrer et al. [2012] considera feature clustering approach in the context of problem h1 that tries to minimize the spectralnorm between columns of A from different blocks. Csiba and Richta´rik [2016] discuss strategiesfor partitioning the coordinates when using randomized selection.Based on the discussion in the previous sections, for greedy BCD methods it is clear thatwe guarantee the most progress if we can make the mixed norm ‖∇f(xk)‖B as large as possibleacross iterations (assuming that the Hb give a valid bound). This supports strategies wherewe try to minimize the maximum Lipschitz constant across iterations. One way to do this isto try to ensure that the average Lipschitz constant across the blocks is small. For example,we could place the largest Li value with the smallest Li value, the second-largest Li valuewith the second-smallest Li value, and so on. While intuitive, this may be sub-optimal; itignores that if we cleverly partition the coordinates we may force the algorithm to often chooseblocks with very-small Lipschitz constants (which lead to much more progress in decreasing theobjective function). In our experiments, similar to the method of Thoppe et. al. for quadratics,we explore the simple strategy of sorting the Li values and partitioning this list into equal-sizedblocks. Although in the worst case this leads to iterations that are not very productive sincethey update all of the largest Li values, it also guarantees some very productive iterations thatupdate none of the largest Li values and leads to better overall performance in our experiments.5.3.6 Newton UpdatesChoosing the vector dk that we use to update the block xbk would seem to be straightforwardsince in the previous section we derived the block selection rules in the context of specificblock updates; the GSL rule is derived assuming a gradient update (5.9), the GSQ rule isderived assuming a matrix update (5.13), and so on. However, using the update dk that leadsto the selection rule can be highly sub-optimal. For example, we might make substantiallymore progress using the matrix update (5.13) even if we choose the block bk based on the GSLrule. Indeed, given bk the matrix update makes the optimal amount of progress for quadratic84functions, so in this case we should prefer the matrix update for all selection rules (includingrandom and cyclic rules).However, the matrix update in (5.13) can itself be highly sub-optimal for non-quadraticfunctions as it employs an upper-bound Hbk on the sub-Hessian ∇2bkbkf(x) that must hold for allparameters x. For twice-differentiable non-quadratic functions, we could potentially make moreprogress by using classic Newton updates where we use the instantaneous Hessian ∇2bkbkf(xk)with respect to the block. Indeed, considering the extreme case where we have one blockcontaining all the coordinates, Newton updates can lead to superlinear convergence [Dennisand More´, 1974] while matrix updates destroy this property. That being said, we shouldnot expect superlinear convergence of BCD methods with Newton or even optimal updates.18Nevertheless, in Chapter 6 we show that for certain common problem structures it is possibleto achieve superlinear convergence with Newton-style updates.Fountoulakis & Tappenden recently highlight this difference between using matrix updatesand using Newton updates [Fountoulakis and Tappenden, 2015], and propose a BCD methodbased on Newton updates. To guarantee progress when far from the solution classic Newtonupdates require safeguards like a line search or trust-region [Fountoulakis and Tappenden,2015, Tseng and Yun, 2009b], but as we discussed in this section line searches tend not to adda significant cost to BCD methods. Thus, if we want to maximize the progress we make ateach iteration we recommend to use one of the greedy rules to select the block to update, butthen update the block using the Newton direction and a line search. In our implementation, weused a backtracking line search starting with αk = 1 and backtracking for the first time usingquadratic Hermite polynomial interpolation and using cubic Hermite polynomial interpolationif we backtracked more than once (which rarely happened since αk = 1 or the first backtrackwere typically accepted) [Nocedal and Wright, 1999].195.4 Message-Passing for Huge-Block UpdatesQu et al. [2016] discuss how in some settings increasing the block size with matrix updatesdoes not necessarily lead to a performance gain due to the higher iteration cost. In the case ofNewton updates the additional cost of computing the sub-Hessian ∇2bbf(xk) may also be non-trivial. Thus, whether matrix and Newton updates will be beneficial over gradient updates willdepend on the particular problem and the chosen block size. However, in this section we arguethat in some cases matrix updates and Newton updates can be computed efficiently using hugeblocks. In particular, we focus on the case where the dependencies between variables are sparse,and we will choose the structure of the blocks in order to guarantee that the matrix/Newtonupdate can be computed efficiently.18Consider a 2-variable quadratic objective where we use single-coordinate updates. The optimal update(which is equivalent to the matrix/Newton update) is easy to compute, but if the quadratic is non-separablethen the convergence rate of this approach is only linear.19We also explored a variant based on cubic regularization of Newton’s method [Nesterov and Polyak, 2006],but were not able to obtain a significant performance gain with this approach.85The cost of using Newton updates with the BCD method depends on two factors: (i) the costof calculating the sub-Hessian ∇2bkbkf(xk) and (ii) the cost of solving the corresponding linearsystem. The cost of computing the sub-Hessian depends on the particular objective function weare minimizing. But, for the problems where coordinate descent is efficient (see Section 5.1.4),it is typically substantially cheaper to compute the sub-Hessian for a block than to computethe full Hessian. Indeed, for many cases where we apply BCD, computing the sub-Hessianfor a block is cheap due to the sparsity of the Hessian. For example, in the graph-structuredproblems h2 the edges in the graph correspond to the non-zeroes in the Hessian.Although this sparsity and reduced problem size would seem to make BCD methods withexact Newton updates ideal, in the worst case the iteration cost would still be O(|bk|3) usingstandard matrix factorization methods. A similar cost is needed using the matrix updateswith fixed Hessian upper-bounds Hb and for performing an optimal update in the special caseof quadratic functions. In some settings we can reduce this to O(|bk|2) by storing matrixfactorizations, but this cost is still prohibitive if we want to use large blocks (we can use |bk| inthe thousands but not the millions).An alternative to computing the exact Newton update is to use an approximation to theNewton update that has a runtime dominated by the sparsity level of the sub-Hessian. Forexample, we could use conjugate gradient methods or use randomized Hessian approxima-tions [Dembo et al., 1982, Pilanci and Wainwright, 2017]. However, these approximationsrequire setting an approximation accuracy and may be inaccurate if the sub-Hessian is notwell-conditioned. In this section we consider an alternative approach: choosing blocks with asparsity pattern that guarantees we can solve the resulting linear systems involving the sub-Hessian (or its approximation) in O(|bk|) using a “message-passing” algorithm. If the sparsitypattern is favourable, this allows us to update huge blocks at each iteration using exact matrixupdates or Newton updates (which are the optimal updates for quadratic problems).To illustrate the message-passing algorithm, we first consider the basic quadratic minimiza-tion problemargminx∈IRn12xTAx− cTx,where we assume the matrix A ∈ IRn×n is positive-definite and sparse. By excluding termsnot depending on the coordinates in the block, the optimal update for block b is given by thesolution to the linear systemAbbxb = c˜, (5.26)where Abb ∈ IR|b|×|b| is the submatrix of A corresponding to block b, and c˜ = cb − Abb¯xb¯ is avector with b¯ defined as the complement of b and Abb¯ ∈ IR|b|×|b¯| is the submatrix of A with rowsfrom b and columns from b¯. We note that in practice efficient BCD methods already need totrack Ax so computing c˜ is efficient. Although we focus on solving (5.26) for simplicity, themessage-passing solution we discuss here will also apply to matrix updates (which leads to alinear system involving Hb) and Newton updates (which leads to a linear system involving the86sub-Hessian).Consider a pairwise undirected graph G = (V,E), where the vertices V are the coordinatesof our problem and the edges E are the non-zero off-diagonal elements of A. Thus, if A isdiagonal then G has no edges, if A is dense then there are edges between all nodes (G is fully-connected), if A is tridiagonal then edges connect adjacent nodes (G is a chain-structured graphwhere (1)− (2)− (3)− (4)− . . . ), and so on.For BCD methods, unless we have a block size |b| = n, we only work with a subset of nodesb at each iteration. The graph obtained from the sub-matrix Abb is called the induced subgraphGb. Specifically, the nodes Vb ∈ Gb are the coordinates in the set b, while the edges Eb ∈ Gb areall edges (i, j) ∈ E where i, j ∈ Vb (edges between nodes in b). We are interested in the specialcase where the induced sub-graph Gb forms a forest, meaning that it has no cycles.20 Theidea of exploiting tree structures within BCD updates has previously been explored by Sontagand Jaakkola [2009]. However, unlike this work, where we use tree structured blocks to com-pute general Newton updates, Sontag and Jaakkola propose this idea in the context of linearprogramming.In the special case of forest-structured induced subgraphs, we can compute the optimalupdate (5.26) in linear time using message passing [Shental et al., 2008] instead of the cubicworst-case time required by typical matrix factorization implementations. Indeed, in this casethe message passing algorithm is equivalent to Gaussian elimination [Bickson, 2009, Prop. 3.4.1]where the amount of “fill-in” is guaranteed to be linear. This idea of exploiting tree structureswithin Gaussian elimination dates back over 50 years [Parter, 1961], and similar ideas haverecently been explored by Srinivasan and Todorov [2015] for Newton methods. Their graphicalNewton algorithm can solve the Newton system in O(t3) times the size of G, where t is the“treewidth” of the graph (t = 1 for forests). However, the tree-width of G is usually large whileit is more reasonable to assume that we can find low-treewidth induced subgraphs Gb.To illustrate the message-passing algorithm in the terminology of Gaussian elimination, wefirst need to divide the nodes {1, 2, . . . , |b|} in the forest into sets L{1}, L{2}, . . . , L{T}, whereL{1} is an arbitrary node in graph Gb selected to be the root node, L{2} is the set of allneighbours of the “root” node, L{3} is the set of all neighbours of the nodes in L{2} excludingparent nodes (nodes in L{1 : 2}), and so on until all nodes are assigned to a set (if the forestis made of disconnected trees, we need to do this for each tree). An example of this process isdepicted in Figure 5.4. Once these sets are initialized, we start with the nodes furthest fromthe root node L{T}, and carry out the row operations of Gaussian elimination moving towardsthe root. Then we use backward substitution to solve the system A˜x = c˜. We outline the fullprocedure in Algorithm 1.Whether or not message-passing is useful will depend on the sparsity pattern of A. Diagonalmatrices correspond to disconnected graphs, which are clearly forests (they have no edges), and20An undirected cycle is a sequence of adjacent nodes in V starting and ending at the same node, where thereare no repetitions of nodes or edges other than the final node.87Algorithm 1 Message Passing for a Tree Graph1. Initialize:Input: vector c˜, forest-structured matrix A˜, and levels L{1}, L{2}, . . . , L{T}.for i = 1, 2, . . . , |b|Set Pii ← A˜ii, Ci ← c˜i. # P,C track row operations2. Gaussian Elimination:for t = T, T − 1, . . . , 1 # start furthest from rootfor i ∈ L{t}if t > 1J ← N{i}\L{1 : t− 1} # neighbours that are not parent nodeif J = ∅ # i corresponds to a leaf nodecontinue # no updateselseJ ← N{i} # root node has no parent nodePJi ← A˜Ji # initialize off-diagonal elementsPii ← Pii −∑j∈JP 2jiPjj# update diagonal elements of P in L{t}Ci ← Ci −∑j∈JPjiPjj· Cj3. Backward Solve:for t = 1, 2, . . . , T # start with root nodefor i ∈ L{t}if t < Tp← N{i}\L{t+ 1 : T} # parent node of i (empty for t = 1)elsep← N{i} # only neighbour of leaf node is parentxi ← Ci − A˜ip · xpPii# solution to A˜x = c˜86374152863741528637415286374152Figure 5.1: Process of partitioning nodes into level sets. For the above graph we have thefollowing sets: L{1} = {8}, L{2} = {6, 7}, L{3} = {3, 4, 5} and L{4} = {1, 2} .thus we can use an optimal BCD update with a block size of n. Now consider a quadraticfunction with a lattice-structured non-zero pattern as in Figure 5.3. This graph is a bipartite88Figure 5.2: Illustration of Step 2 (row-reduction process) of Algorithm 1 for the tree in Figure5.4. The matrix represents [A˜|c˜]. The black squares represent unchanged non-zero values of A˜and the grey squares represent non-zero values that are updated at some iteration in Step 2. Inthe final matrix (far right), the values in the last column are the values assigned to the vectorC in Steps 1 and 2 above, while the remaining columns that form an upper triangular matrixare the values corresponding to the constructed P matrix. The backward solve of Step 3 solvesthe linear system.(a) Red-black, |b| = n/2. (b) Fixed block, |b| = n/2. (c) Variable block, |b| ≈ 2n/3.Figure 5.3: Partitioning strategies for defining forest-structured blocks.graph, or a two-colourable graph, and a classic fixed partitioning strategy for problems with thiscommon structure is to use a “red-black ordering” (see Figure 5.3a). Choosing this colouringmakes the matrix Abb diagonal when we update the red nodes (and similarly for the blacknodes), allowing us to solve (5.26) in linear time. So this colouring scheme allows an optimalBCD update with a block size of n/2. The adjacency matrix of a graph like this is calledconsistently ordered. In general, an adjacency matrix is consistently ordered if the nodes of thecorresponding graph can be partitioned into sets such that any two adjacent nodes belong todifferent, consecutive sets [Young, 1971, Def. 5.3.2]. In the case of red-black ordering, all oddnumbered sets would make up one block and all even numbered sets would make up anotherblock.Message passing allows us to go beyond the red-black, and update any forest-structuredblock of size τ in linear time. For example, the partition given in Figure 5.3b also has blocksof size n/2 but these blocks include dependencies. Our experiments indicate that blocks thatmaintain dependencies, such as Figure 5.3b, make substantially more progress than using thered-black blocks or using smaller non-forest structured blocks. However, red-black blocks, or89more generally, consistently ordered matrices, are very well-suited for parallelization.Unlike in Figure 5.3a, the colouring of a general graph or partitioning of a general adjacencymatrix may lead to blocks of different sizes. Multi-colouring techniques are used to find graphcolourings for general graphs, where colours are assigned to nodes such that no neighbouringnodes share the same colour [Saad, 2003, §12.4]. Finding the minimum number of “colours”(or the maximum size of blocks) for a given graph is exactly the NP-hard graph colouringproblem. However, there are various heuristics that quickly give a non-minimal valid colouringof the nodes (for a survey of heuristic, meta-heuristic and hybrid methods for graph colouring,see Baghel et al. [2013]). If a graph is ν-colourable, then we can arrange the adjacency matrixsuch that it has ν diagonal blocks along its diagonal (the off diagonal blocks may be dense). Inrelation to BCD methods, this means that Abb is diagonal for each block and we can updateeach of the ν blocks in linear time in the size of the block.Alternatively, as we did for blocks of equal size in Figure 5.3b we can consider all blockswith a forest-structured induced subgraph, where the block size may vary at each iteration butrestricting to forests still leads to a linear-time update. As seen in Figure 5.3c, by allowingvariable block sizes we can select a forest-structured block of size |b| ≈ 2n/3 in one iteration(black nodes) while still maintaining a linear-time update. If we further sample different randomforests or blocks at each iteration, then the convergence rate under this strategy is covered bythe arbitrary sampling theory [Qu et al., 2014]. Also note that the maximum of the gradientnorms over all forests defines a valid norm, so our analysis of Gauss-Southwell can be appliedto this case.5.4.1 Partitioning into Forest-Structured BlocksWe can generalize the red-black approach to arbitrary graphs by defining our blocks such thatno two neighbours are in the same block. While for lattice-structured graphs we only needtwo blocks to do this, for general graphs we may need a larger number of blocks. Finding theminimum number of blocks we need for a given graph is exactly the NP-hard graph colour-ing problem. Fortunately, there are various heuristics that quickly give a non-minimal validcolouring of the nodes. For example, in our experiments we used the following classic greedyalgorithm [Welsh and Powell, 1967]:1. Proceed through the vertices of the graph in some order i = 1, 2, . . . , n.2. For each vertex i, assign it the smallest positive integer (“colour”) such that it does nothave the same colour as any of its neighbours among the vertices {1, 2, . . . , i− 1}.We can use all vertices assigned to the same integer as our blocks in the algorithm, and if weapply this algorithm to a lattice-structured graph (using row- or column-ordering of the nodes)then we obtain the classic red-black colouring of the graph.Instead of disconnected blocks, in this work we instead consider forest-structured blocks.The size of the largest possible forest is related to the graph colouring problem [Esperet et al.,902015], but we can consider a slight variation on the second step of the greedy colouring algorithmto find a set of forest-structured blocks:1. Proceed through the vertices of the graph in some order i = 1, 2, . . . , n.2. For each vertex i, assign it the smallest positive integer (“forest”) such that the nodesassigned to that integer among the set {1, 2, . . . , i} form a forest.If we apply this to a lattice structured graph (in column-ordering), this generates a partitioninto two forest-structured graphs similar to the one in Figure 5.3b (only the bottom row isdifferent). This procedure requires us to be able to test whether adding a node to a forestmaintains the forest structure, and we show how to do this efficiently in Appendix D.4.In the case of lattice-structured graph there is a natural ordering of the vertices, but formany graphs there is no natural ordering. In such we might simply consider a random ordering.Alternately, if we know the individual Lipschitz constants Li, we could order by these values(with the largest Li going first so that they are likely assigned to the same block if possible). Inour experiments we found that this ordering improved performance for an unstructured dataset,and performed similarly to using the natural ordering in a lattice-structured dataset.5.4.2 Approximate Greedy Rules with Forest-Structured BlocksSimilar to the problems of the previous section, computing the Gauss-Southwell rule over forest-structured variable blocks is NP-hard, as we can reduce the 3-satisfiability problem to theproblem of finding a maximum-weight forest [Garey and Johnson, 1979]. However, we use asimilar greedy method to approximate the greedy Gauss-Southwell rule over the set of trees:1. Initialize bk with the node i corresponding to the largest gradient, |∇if(xk)|.2. Search for the node i with the largest gradient that is not part of bk and that maintainsthat bk is a forest.3. If such a node is found, add it to bk and go back to step 2. Otherwise, stop.Although this procedure does not yield the exact solution in general, it is appealing since (i) theprocedure is efficient as it is easy to test whether adding a node maintains the forest property(see Appendix D.4), (ii) it outputs a forest so that the subsequent update is linear-time, (iii) weare guaranteed that the coordinate corresponding to the variable with the largest gradient isincluded in bk, and (iv) we cannot add any additional node to the final forest and still maintainthe forest property. A similar heuristic can be used to approximate the GSD rule under therestriction from Section 5.3.1 or to generate a forest randomly.5.5 Numerical ExperimentsWe performed an extensive variety of experiments to evaluate the effects of the contributionslisted in the previous sections. In this section we include several of these results that highlight91some key trends we observed, and in each subsection below we explicitly list the insights weobtained from the experiment. We considered five datasets that evaluate the methods in avariety of scenarios:A Least-squares with a sparse data matrix.B Binary logistic regression with a sparse data matrix.C 50-class logistic regression problem with a dense data matrix.D Lattice-structured quadratic objective as in Section 5.4.E Binary label propagation problem (sparse but unstructured quadratic).For interested readers, we give the full details of these datasets in Appendix D.5 where we havealso included our full set of experiment results.In our experiments we use the number of iterations as our measure of performance. Thismeasure is far from perfect, especially when considering greedy methods, since it ignoresthe computational cost of each iteration. However, this measure of performance providesan problem- and implementation-independent measure of performance. We seek a problem-independent measure of performance since runtimes are highly dependent on the block sizeand the problem structure; as the block size grows, the extra computational cost of greedymethods may eventually be outweighed by the cost of computing the block update. We seekan implementation-independent measure of performance since the actual runtimes of differentmethods will vary wildly across applications. However, it is typically easy to estimate the per-iteration runtime when considering a new problem. Thus, we hope that our quantification ofwhat can be gained from more-expensive methods gives guidance to readers about whether themore-expensive methods will lead to a performance gain on their applications. In any case, weare careful to qualify all of our claims with warnings in cases where the iteration costs differ.5.5.1 Greedy Rules with Gradient UpdatesOur first experiment considers gradient updates with a step-size of 1/Lb, and seeks to quantifythe effect of using fixed blocks compared to variable blocks (Section 5.2.1) as well as the effectof using the new GSL rule (Section 5.2.2). In particular, we compare selecting the block usingCyclic, Random, Lipschitz (sampling the elements of the block proportional to Li), GS, andGSL rules. For each of these rules we implemented a fixed block (FB) and variable block(VB) variant. For VB using Cyclic selection, we split a random permutation of the coordinatesinto equal-sized blocks and updated these blocks in order (followed by using another randompermutation). To approximate the seemingly-intractable GSL rule with VB, we used the GSDrule (Section 5.2.4) using the SIRT-style approximation (5.23) from Section 5.3.1. We used thebounds in Appendix D.2 to set the Lb values. To construct the partition of the coordinates920 100 200 300 400 500Iterations with 5-sized blocks7.9× 1034.4× 1042.4× 1051.3× 1067.4× 106f(x)−f∗ for Least Squares on Dataset A Cyclic-FBLipschitz-FBRandom-FBGS-FBGSL-FBLipschitz-VBCyclic-VBRandom-VBGS-VBGSL-VB0 100 200 300 400 500Iterations with 5-sized blocks3.0× 1033.2× 1033.4× 1033.6× 1033.9× 103f(x)−f∗ for Softmax on Dataset CCyclic-FB Lipschitz-FBRandom-FBGS-FBGSL-FBLipschitz-VBCyclic-VBRandom-VBGS-VBGSL-VB0 100 200 300 400 500Iterations with 5-sized blocks2.7× 1023.4× 1024.4× 1025.6× 1027.2× 102f(x)−f∗ for Quadratic on Dataset ECyclic-FBLipschitz-FBRandom-FBGS-FBGSL-FBLipschitz-VBCyclic-VBRandom-VBGS-VBGSL-VBFigure 5.4: Comparison of different random and greedy block selection rules on three differentproblems when using gradient updates.needed in the FB method, we sorted the coordinates according to their Li values then placedthe largest Li values into the first block, the next set of largest in the second block, and so on.We plot selected results in Figure 5.4, while experiments on all datasets and with otherblock sizes are given in Appendix D.5.2. Overall, we observed the following trends:• Greedy rules tend to outperform random and cyclic rules, particularly with smallblock sizes. This difference is sometimes enormous, and this suggests we should prefergreedy rules when the greedy rules can be implemented with a similar cost to cyclic orrandom selection.• The variance of the performance between the rules becomes smaller as theblock size increases. This suggests that if we use very-large block sizes with gradientupdates that we should prefer simple Cyclic or Random updates.• VB can substantially outperform FB when using GS for certain problems. Thisis because FB are a subset of the VB, so we can make the progress bound better. Thus,we should prefer GS-VB for problems where this has a similar cost to GS-FB. We foundthis trend was reversed for random rules, where fixed blocks tended to perform better.We suspect this trend is due to the coupon collector problem: it takes FB fewer iterationsthan VB to select all variables at least once.• GSL consistently improved on the classic GS rule, and in some cases the new rulewith FB even outperformed the GS rule with VB. Interestingly, the performance gain waslarger in the block case than in the single-coordinate case (see Section 2.8).In Appendix D.5.2 we repeat this experiment for the FB methods but using the approxima-tion to Lb discussed in Section 5.3.3. This sought to test whether this procedure, which mayunderestimate the true Lb and thus use larger step-sizes, would improve performance. Thisexperiment lead to some additional insights:• Approximating Lb was more effective as the block size increases. This makessense, since with large block sizes there are more possible directions and we are unlikelyto ever need to use a step-size as small as 1/Lb for the global Lb.930 100 200 300 400 500Iterations with 5-sized blocks0.3× 10−14.0× 1004.9× 1026.1× 1047.4× 106f(x)−f∗ for Least Squares on Dataset AGSQ-FBGS-FBGSL-FBGSD-FBGS-VBGSD-VBGSQ-VBGSL-VB0 100 200 300 400 500Iterations with 5-sized blocks2.3× 1032.6× 1033.0× 1033.4× 1033.9× 103f(x)−f∗ for Softmax on Dataset CGSQ-FBGS-FBGSL-FB GSD-FBGS-VBGSD-VBGSQ-VBGSL-VB0 100 200 300 400 500Iterations with 5-sized blocks2.3× 1023.0× 1024.1× 1025.4× 1027.2× 102f(x)−f∗ for Quadratic on Dataset EGSQ-FBGS-FBGSL-FBGSD-FBGS-VBGSD-VBGSQ-VBGSL-VBFigure 5.5: Comparison of different greedy block selection rules on three different problemswhen using matrix updates.• Approximating Lb is far more effective than using a loose bound. We haverelatively-good bounds for all problems except Problem C. On this problem the Lipschitzapproximation procedure was much more effective even for small block sizes.This experiment suggests that we should prefer to use an approximation to Lb (or an explicitline search) when using gradient updates unless we have a tight approximation to the trueLb and we are using a small block size. We also performed experiments with different blockpartitioning strategies for FB (see Appendix D.5.2). Although these experiments had somevariability, we found that the block partitioning strategy did not make a large difference forcyclic and random rules. In contrast, when using greedy rules our sorting approach tendedto outperform using random blocks or choosing the blocks to have similar average Lipschitzconstants.5.5.2 Greedy Rules with Matrix UpdatesOur next experiment considers using matrix updates based on the matrices Hb from Ap-pendix D.2, and quantifies the effects of the GSQ and GSD rules introduced in Sections 5.2.3-5.2.4 as well the approximations to these introduced in Sections 5.3.1-5.3.2. In particular, forFB we consider the GS rule and the GSL rule (from the previous experiment), the GSD rule(using the diagonal matrices from Section 5.3.1 with Db,i = Li), and the GSQ rule (which isoptimal for the three quadratic objectives). For VB we consider the GS rule from the previousexperiment as well as the GSD rule (using Db,i = Li), and the GSQ rule using the approxima-tion from Section 5.3.2 and 10 iterations of iterative hard thresholding. Other than switchingto matrix updates and focusing on these greedy rules, we keep all other experimental factorsthe same.We plot selected results of doing this experiment in Figure 5.5. These experiments showedthe following interesting trends:• There is a larger advantage to VB with matrix updates. When using matrixupdates, the basic GS-VB method outperformed even the most effective GSQ-FB rule forsmaller block sizes.94• There is little advantage to GSD/GSQ with FB. Although the GSL rule consistentlyimproved over the classic GS rule, we did not see any advantage to using the more-advanced GSD or GSQ rules when using FB.• GSD outperformed GS with VB. Despite the use of a crude approximation to theGSD rule, the GSD rule consistently outperformed the classic GS rule.• GSQ slightly outperformed GSD with VB and large blocks. Although the GSQ-VB rule performed the best across all experiments, the difference was more noticeable forlarge block sizes. However, this did not offset its high cost in any experiment. We alsoexperimented with OMP instead of IHT, and found it gave a small improvement but theiterations were substantially more expensive.Putting the above together, with matrix updates our experiments indicate that the GSL orGSD seem to both provide good performance for FB, while for VB the GSD rule should bepreferred. We would only recommend using the GSQ rule in settings where we can use VB andwhere operations involving the objective f are much more expensive than running an IHT orOMP method. We performed experiments with different block partition strategies for FB, butfound that when using matrix updates the partitioning strategy did not make a big differencefor cyclic, random, or greedy rules.In Appendix D.5.3 we repeat this experiment for the non-quadratic objectives using the New-ton direction and a backtracking line search to set the step-size, as discussed in Sections 5.3.4and 5.3.6. For both datasets, the Newton updates resulted in a significant performance improve-ment over the matrix updates. This indicates that we should prefer classic Newton updates overthe more recent matrix updates for non-quadratic objectives where computing the sub-block ofthe Hessian is tractable.5.5.3 Message-Passing UpdatesWe next seek to quantify the effect of using message-passing to efficiently implement exactupdates for quadratic functions, as discussed in Section 5.4. For this experiment, we focusedon the lattice-structured dataset D and the unstructured but sparse dataset E. These are bothquadratic objectives with high treewidth, but that allow us to find large forest-structured in-duced subgraphs. We compared the following strategies to choose the block: greedily choosingthe best general unstructured blocks using GS (General), cycling between blocks generated bythe greedy graph colouring algorithm of Section 5.4.1 (Red Black), cycling between blocks gen-erated by the greedy forest-partitioning algorithm of Section 5.4.1 (Tree Partitions), greedilychoosing a tree using the algorithm of Section 5.4.2 (Greedy Tree), and growing a tree randomlyusing the same algorithm (Random Tree). For the lattice-structured Dataset D, the greedy par-titioning algorithms proceed through the variables in order which generate partitions similarto those shown in in Figure 5.3b. For the unstructured Dataset E, we apply the greedy parti-tioning strategies of Section 5.4.1 using both a random ordering and by sorting the Lipschitz950 100 200 300 400 500Iterations0.1× 10−40.4× 10−11.2× 1023.5× 1051.0× 109f(x)−f∗ for Quadratic on Dataset DRandom TreeGreedy TreeGeneralRed BlackTree Partitions0 100 200 300 400 500Iterations0.8× 10−40.4× 10−20.2× 1001.3× 1017.2× 102f(x)−f∗ for Quadratic on Dataset ERandom TreeGreedy TreeGeneralRed Black LipschitzTree Partitions LipschitzTree Partitions OrderRed Black OrderFigure 5.6: Comparison of different greedy block selection rules on two quadratic graph-structured problems when using optimal updates.constants Li. Since the cost of the exact update for tree-structured methods is O(n), for theunstructured blocks we chose a block size of bk = n1/3 to make the costs comparable (since theexact solve is cubic in the block size for unstructured blocks).We plot selected results of doing this experiment in Figure 5.6. Here, we see that even theclassic red-black ordering outperforms using general unstructured blocks (since we must usesuch small block sizes). The tree-structured blocks perform even better, and in the unstruc-tured setting our greedy approximation of the GS rule under variable blocks outperforms theother strategies. However, our greedy tree partitioning method also performs well. For thelattice-structured data (left) it performed similarly to the greedy approximation, while for theunstructured data (right) it outperformed all methods except greedy (and performed betterwhen sorting by the Lipschitz constants than using a random order).5.6 DiscussionIn this chapter we focused on non-accelerated BCD methods. However, we expect that ourconclusions are likely to also apply for accelerated BCD methods [Fercoq and Richta´rik, 2015].Similarly, while we focused on the setting of serial computation, we expect that our conclusionswill give insight into developing more efficient parallel and distributed BCD methods [Richta´rikand Taka´cˇ, 2016].Although our experiments indicate that our choice of the diagonal matrices D within theGSD rule provides a consistent improvement, this choice is clearly sub-optimal. A futuredirection is to find a generic strategy to construct better diagonal matrices, and work on ESOmethods could potentially be adapted for doing this [Qu and Richta´rik, 2016]. This could bein the setting where we are given knowledge of the Lipschitz constants, but a more-interestingidea is to construct these matrices online as the algorithm runs.The GSQ rule can be viewed as a greedy rule that incorporates more sophisticated second-96order information than the simpler GS and GSL rules. In preliminary experiments, we alsoconsidered selection rules based on the cubic regularization bound. However, these did notseem to converge more quickly than the existing rules in our experiments, and it is not obvioushow one could efficiently implement such second-order rules.We focused on BCD methods that approximate the objective function at each step byglobally bounding higher-order terms in a Taylor expansion. However, we would expect moreprogress if we could bound these locally in a suitably-larger neighbourhood of the currentiteration. Alternately, note that bounding the Taylor expansion is not the only way to upperbound a function. For example, Khan [2012] discusses a variety of strategies for bounding thebinary logistic regression loss and indeed proves that other bounds are tighter than the Taylorexpansion (“Bohning”) bound that we use. It would be interesting to explore the convergenceproperties of BCD methods whose bounds do not come from a Taylor expansion.While we focused on the case of trees, there are message-passing algorithms that allow graphswith cycles [Rose, 1970, Srinivasan and Todorov, 2015]. The efficiency of these methods dependson the “treewidth” of the induced subgraph, where if the treewidth is small (as in trees) thenthe updates are efficient, and if the treewidth is large (as in fully-connected graphs) then thesedo not provide an advantage. Treewidth is related to the notion of “chordal” graphs (trees arespecial cases of chordal graphs) and chordal embeddings which have been exploited for matrixproblems like covariance estimation [Dahl et al., 2008] and semidefinite programming [Sun et al.,2014, Vandenberghe and Andersen, 2015]. Considering “treewidth 2” or “treewidth 3” blockswould give more progress than our tree-based updates, although it is NP-hard to compute thetreewidth of a general graph (but it is easy to upper-bound this quantity by simply choosing arandom elimination order).As opposed to structural constraints like requiring the graph to be a tree, it is now knownthat message-passing algorithms can solve linear systems with other properties like diagonaldominance or “attractive” coefficients [Malioutov et al., 2006]. There also exist specializedlinear-time solvers for Laplacian matrices [Kyng and Sachdeva, 2016], and it would be interestingto explore BCD methods based on these structures. It would also be interesting to explorewhether approximate message-passing algorithms which allow general graphs [Malioutov et al.,2006] can be used to improve optimization algorithms.97Chapter 6Active-Set Identification andComplexityIn this section we consider optimization problems of the formargminx∈IRnf(x) +n∑i=1gi(xi), (6.1)where ∇f is Lipschitz-continuous, that is for all x, y ∈ IRn, we have‖∇f(y)−∇f(x)‖ ≤ L‖y − x‖, (6.2)and each gi only needs to be convex and lower semi-continuous (it may be non-smooth orinfinite at some xi). A classic example of a problem in this framework is optimization subjectto non-negative constraints,argminx≥0f(x), (6.3)where in this case gi is the indicator function on the non-negative orthant,gi(xi) =0 if xi ≥ 0,∞ if xi < 0.Another example that has received significant recent attention is the case of an `1-regularizer,argminx∈IRnf(x) + λ‖x‖1, (6.4)where in this case gi(xi) = λ|xi|. Here, the `1-norm regularizer is used to encourage sparsity inthe solution. A related problem is the group `1-regularization problem (5.6), where instead ofbeing separable, g is block-separable.Proximal gradient methods have become one of the default strategies for solving prob-lem (6.1). Given the separability assumption we make on g in (6.1), the proximal gradientupdate is separable. The coordinate-wise proximal gradient update (using a step-size of 1/L)is given byxk+1i = prox 1Lgi(xki −1L∇if(xk)), (6.5)98where the coordinate-wise proximal operator is defined asprox 1Lgi(y) = argminxi∈IR12|xi − y|2 + 1Lgi(xi).Whether we are using the coordinate-wise or full proximal gradient method, this update formholds for the individual coordinates.In the special case of non-negative constraints like (6.3) the update in (6.5) is given byxk+ 12i = xi −1L∇if(xk)xk+1i =[xk+ 12i]+,which we have written as a gradient update followed by the projection [β]+ = max{0, β} ontothe non-negative orthant (see Figure 6.1a). For `1-regularization problems (6.4) the updatereduces to an element-wise soft-thresholding step,xk+ 12i = xk − 1L∇if(xk),xk+1i =xk+ 12i∣∣∣∣xk+ 12i ∣∣∣∣[∣∣∣∣xk+ 12i ∣∣∣∣− λL]+,(6.6)which we have written as a gradient update followed by the soft-threshold operator (shown inFigure 6.1b).It has been established that (block) coordinate descent methods based on the update (6.5)for problem (6.1) obtain similar convergence rates to the case where we do not have a non-smooth term g (see Section 2.7 and [Nesterov, 2012, Richta´rik and Taka´cˇ, 2014]). The mainfocus of this chapter is to show that the non-smoothness of g can actually lead to a fasterconvergence rate.This idea dates back at least 40 years to the work of Bertsekas [1976].21 For the case ofnon-negative constraints, he shows that the sparsity pattern of xk generated by the projected-gradient method matches the sparsity pattern of the solution x∗ for all sufficiently large k. Thus,after a finite number of iterations the projected-gradient method will “identify” the final set ofnon-zero variables. Once these values are identified, Bertsekas suggests that we can fix the zero-valued variables and apply an unconstrained Newton update to the set of non-zero variables toobtain superlinear convergence. Even without switching to a superlinearly-convergent method,the convergence rate of the projected-gradient method can be faster once the set of non-zeroesis identified since it is effectively optimizing in the lower-dimensional space corresponding tothe non-zero variables.21A similar property was shown for proximal point methods in a more general setting around the sametime, [Rockafellar, 1976].99(a) For problem (6.3) each iteration of theproximal gradient method takes a gradient de-scent step, xk+12 , and then projects this pointonto the non-negative orthant.u1u2xkxk+12[xk+12]+(b) For problem (6.4), the proximal operator or “soft-threshold” operator shrinks the value of u by the reg-ularization constant, λ. If |u| > λ, then the resultingvalue is u− sign(u) · λ. Otherwise, if |u| ≤ λ, then theproximal operator sets u to 0.uproxλ|·|(u)−λλuuuλλFigure 6.1: Visualization of (a) the proximal gradient update for a non-negatively constrainedoptimization problem (6.3); and (b) the proximal operator (soft-threshold) used in the proximalgradient update for an `1-regularized optimization problem (6.4).This idea of identifying a smooth “manifold” containing the solution x∗ has been generalizedto allow polyhedral constraints [Burke and More´, 1988], general convex constraints [Wright,1993], and even non-convex constraints [Hare and Lewis, 2004]. Similar results exist in theproximal gradient setting. For example, it has been shown that the proximal gradient methodidentifies the sparsity pattern in the solution of `1-regularized problems after a finite numberof iterations [Hare, 2011]. The active-set identification property has also been shown for otheralgorithms like certain coordinate descent and stochastic gradient methods [Lee and Wright,2012, Mifflin and Sagastiza´bal, 2002, Wright, 2012]. Specifically, Wright shows that BCD alsohas this manifold identification property for separable g [Wright, 2012], provided that the coor-dinates are chosen in an essentially-cyclic way (or provided that we can simultaneously chooseto update all variables that do not lie on the manifold). Wright also shows that superlinear con-vergence is possible if we use a Newton update on the manifold, assuming the Newton updatedoes not leave the manifold.In this chapter, we show active-set identification for the full proximal gradient method. Thisis a well-known characteristic of the full proximal gradient method but given our assumptionon the separability of g our proof analysis is much simpler than these existing analyses. Wethen extend this result to the proximal coordinate descent case for general separable g. Wefollow a similar argument to Bertsekas [1976], which yields a simple proof that holds for manypossible selection rules including greedy rules (which may not be essentially-cyclic). When usinggreedy BCD methods with variable blocks we show this leads to superlinear convergence for100problems with sufficiently-sparse solutions (when we use updates incorporating second-orderinformation). In the special case of LASSO and SVM problems, we further show that optimalupdates are possible. This leads to finite convergence for SVM and LASSO problems withsufficiently-sparse solutions when using greedy selection and sufficiently-large variable blocks.Most prior works show the active-set identification happens asymptotically. In Section 6.3,we introduce the notion of the “active-set complexity” of an algorithm, which we define as thenumber of iterations required before an algorithm is guaranteed to have reached the active-set.Our active-set identification arguments lead to bounds on the active-set complexity of the fulland BCD variants of the proximal gradient method. We are only aware of one previous workgiving such bounds, the work of Liang et al. who included a bound on the active-set complexityof the proximal gradient method [Liang et al., 2017, Proposition 3.6]. Unlike this work, theirresult does not evoke strong-convexity. Instead, their work applies an inclusion condition onthe local subdifferential of the regularization term that ours does not require. By focusing onthe strongly-convex case in Section 6.3 (which is common in machine learning due to the useof regularization), we obtain a simpler analysis and a much tighter bound than in this previouswork. Specifically, both rates depend on the “distance to the subdifferential boundary”, but inour analysis this term only appears inside of a logarithm rather than outside of it. As examples,we consider problems (6.3) and (6.4), and show explicit bounds for the active-set complexity inboth the full and BCD proximal gradient methods.6.1 Notation and AssumptionsBy our separability assumption on g, the subdifferential of g can be expressed as the concate-nation of the individual subdifferential of each gi, where the subdifferential of gi at any xi ∈ IRis defined by∂gi(xi) = {v ∈ IR : gi(y) ≥ gi(xi) + v · (y − xi), for all y ∈dom gi}.This implies that the subdifferential of each gi is just an interval on the real line. In particular,the interior of the subdifferential of each gi at a non-differentiable point xi can be written asan open interval,int ∂gi(xi) ≡ (li, ui), (6.7)where li ∈ IR ∪ {−∞} and ui ∈ IR ∪ {∞} (the ∞ values occur if xi is at its lower or upperbound, respectively). The active-set at a solution x∗ for a separable g is then defined byZ = {i : ∂gi(x∗i ) is not a singleton}.By (6.7), the set Z includes indices i where x∗i is equal to the lower bound on xi, is equal tothe upper bound on xi, or occurs at a non-smooth value of gi. In our examples of non-negativeconstraints or `1-regularization, Z is the set of coordinates that are zero at the solution x∗.101With this definition, we can formally define the manifold identification property.Definition 1. The manifold identification property for problem (6.1) is satisfied if for all suf-ficiently large k, we have that xki = x∗i for some solution x∗ for all i ∈ Z.In order to prove the manifold identification property, in addition to assuming that ∇f isL-Lipschitz continuous (6.2), we require two assumptions. Our first assumption is that theiterates of the algorithm converge to a solution x∗.Assumption 1. The iterates converge to an optimal solution x∗ of problem (6.1), that isxk → x∗ as k →∞.Our second assumption is a nondegeneracy condition on the solution x∗ that the algorithmconverges to. Below we write the standard nondegeneracy condition from the literature for ourspecial case of (6.1).Assumption 2. We say that x∗ is a nondegenerate solution for problem (6.1) if it holds that−∇if(x∗) = ∇ig(x∗i ) if ∂gi(x∗i ) is a singleton (gi is smooth at x∗i )−∇if(x∗) ∈ int ∂gi(x∗i ) if ∂gi(x∗i ) is not a singleton (gi is non-smooth at x∗i ).This condition states that −∇f(x∗) must be in the “relative interior” (see [Boyd and Van-denberghe, 2004, Section 2.1.3]) of the subdifferential of g at the solution x∗. In the case of thenon-negative bound constrained problem (6.3), this requires that ∇if(x∗) > 0 for all variablesi that are zero at the solution (x∗i = 0). For the `1-regularization problem (6.4), this requiresthat |∇if(x∗)| < λ for all variables i that are zero at the solution.226.2 Manifold Identification for Separable gIn this section we show that the full and BCD variants of the proximal gradient method identifythe active-set of (6.1) in a finite number of iterations. Although this result follows from the moregeneral results in the literature for the full proximal gradient method, by focusing on (6.1) wegive a substantially simpler proof that will allow us to bound the active-set iteration complexityof the method.6.2.1 Proximal Gradient MethodWe first note that if we assume f is strongly convex, then the iterates converge to a (unique)solution x∗ with a linear rate [Schmidt et al., 2011],‖xk − x∗‖ ≤(1− 1κ)k‖x0 − x∗‖, (6.8)22Note that |∇if(x∗)| ≤ λ for all i with x∗i = 0 follows from the optimality conditions, so this assumptionsimply rules out the case where |∇if(x∗i )| = λ. We note that in this case the nondegeneracy condition is a strictcomplementarity condition [De Santis et al., 2016].102where κ is the condition number of f . This implies that Assumption 1 holds. We give a simpleresult that follows directly from Assumption 1 and establishes that for any β > 0 there exists afinite iteration k¯ such that the distance from the iterate xk to the solution x∗ for all iterationsk ≥ k¯ is bounded above by β.Lemma 1. Let Assumption 1 hold. For any β, there exists some minimum finite k¯ such that‖xk − x∗‖ ≤ β for all k ≥ k¯.An important quantity in our analysis is the minimum distance to the nearest boundary ofthe subdifferential (6.7) among indices i ∈ Z. This quantity is given byδ = mini∈Z{min{−∇if(x∗)− li, ui +∇if(x∗)}} . (6.9)Our argument essentially states that once Lemma 1 is satisfied for some finite k¯ and a particularβ > 0, then at this point the algorithm always sets xki to x∗i for all i ∈ Z. In the next result,we prove that this happens for a value β depending on δ as defined in (6.9).Lemma 2. Consider problem (6.1), where f is convex with L-Lipschitz continuous gradientand the gi are proper convex functions (not necessarily smooth). Let Assumption 1 be satisfiedand Assumption 2 be satisfied for the particular x∗ that the algorithm converges to. Then forthe proximal gradient method with a step-size of 1/L there exists a k¯ such that for all k > k¯ wehave xki = x∗i for all i ∈ Z.Proof. By the definition of the proximal gradient step and the separability of g, for all i wehavexk+1i ∈ argminy{12∣∣∣∣y − (xki − 1L∇if(xk))∣∣∣∣2 + 1Lgi(y)}.This problem is strongly-convex, and its unique solution satisfies0 ∈ y − xki +1L∇if(xk) + 1L∂gi(y),or equivalently thatL(xki − y)−∇if(xk) ∈ ∂gi(y). (6.10)By Lemma 1, there exists a minimum finite iterate k¯ such that ‖xk¯ − x∗‖ ≤ δ/2L. Since|xki − x∗i | ≤ ‖xk − x∗‖, this implies that for all k ≥ k¯ we have− δ/2L ≤ xki − x∗i ≤ δ/2L, for all i. (6.11)103Further, the Lipschitz continuity of ∇f in (6.2) implies that we also have|∇if(xk)−∇if(x∗)| ≤ ‖∇f(xk)−∇f(x∗)‖≤ L‖xk − x∗‖≤ δ/2,which implies that− δ/2−∇if(x∗) ≤ −∇if(xk) ≤ δ/2−∇if(x∗). (6.12)To complete the proof it is sufficient to show that for any k ≥ k¯ and i ∈ Z that y = x∗i satisfies(6.10). Since the solution to (6.10) is unique, this will imply the desired result. We first showthat the left-side is less than the upper limit ui of the interval ∂gi(x∗i ),L(xki − x∗i )−∇if(xk) ≤ δ/2−∇if(xk) (right-side of (6.11))≤ δ −∇if(x∗) (right-side of (6.12))≤ (ui +∇if(x∗))−∇if(x∗) (definition of δ, (6.9))≤ ui.We can use the left-sides of (6.11) and (6.12) and an analogous sequence of inequalities to showthat L(xki − x∗i ) − ∇if(xk) ≥ li, implying that x∗i solves (6.10).Both problems (6.3) and (6.4) satisfy the manifold identification result. By the definitionof δ in (6.9), we have that δ = mini∈Z{∇if(x∗)} for problem (6.3). We note that if δ = 0,then we may approach the manifold through the interior of the domain and the manifold maynever be identified (this is the purpose of the nondegeneracy condition). For problem (6.4), wehave that δ = λ−maxi∈Z{|∇if(x∗)|}. From these results, we are able to define explicit boundson the number of iterations required to reach the manifold, a new result that we explore inSection 6.3.6.2.2 Proximal Coordinate Descent MethodWe note that Assumption 1 holds for the proximal coordinate descent method if we assumethat f is strongly convex and that we use cyclic or greedy selection. Specifically, by existingworks on cyclic [Beck and Tetruashvili, 2013] and greedy selection (Section 2.7) of ik withinproximal coordinate descent methods, we have thatF (xk)− F (x∗) ≤ ρk[F (x0)− F (x∗)], (6.13)104for some ρ < 1 when f is strongly convex. Note that strong convexity of f implies the strongconvexity of F , so we haveF (y) ≥ F (x) + 〈s, y − x〉+ µ2‖y − x‖2,where µ is the strong convexity constant of f and s is any subgradient of F at x. Taking y = xkand x = x∗ we obtain thatF (xk) ≥ F (x∗) + µ2‖xk − x∗‖2, (6.14)which uses that 0 ∈ ∂F (x∗). Thus we have that‖xk − x∗‖2 ≤ 2µ[F (xk)− F (x∗)] ≤ 2µρk[F (x0)− F (x∗)], (6.15)which implies Assumption 1. However, Assumption 1 will also hold under a variety of otherscenarios.There are three results that we require in order to prove the manifold identification propertyfor proximal coordinate descent methods. The first result is Lemma 1 and follows directly fromAssumption 1. The second result we require is that for any i ∈ Z such that xki 6= x∗i , eventuallycoordinate i is selected at some finite iteration.Lemma 3. Let Assumption 1 hold. If xki 6= x∗i for some i ∈ Z, then coordinate i will be selectedby the proximal coordinate descent method after a finite number of iterations.Proof. For eventual contradiction, suppose we did not select such an i after iteration k′. Thenfor all k ≥ k′ we have that|xk′i − x∗i | = |xki − x∗i | ≤ ‖xk − x∗‖. (6.16)By Assumption 1 the right-hand side is converging to 0, so it will eventually be less than|xk′i −x∗i | for some k ≥ k′, contradicting the inequality. Thus after a finite number of iterationswe must have that xki 6= xk′i , which can only be achieved by selecting i.The third result we require is an adaptation of Lemma 2 to the proximal coordinate descentsetting. It states that once Lemma 1 is satisfied for some finite k¯ and a particular β > 0(depending on δ), then for the coordinate i ∈ Z selected at some iteration k′ ≥ k¯ by theproximal coordinate descent method, we have xk′i = x∗i .Lemma 4. Consider problem (6.1), where f is convex with L-Lipschitz continuous gradientand the gi are proper convex functions (not necessarily smooth). Let Assumption 1 be satisfiedand Assumption 2 be satisfied for the particular x∗ that the algorithm converges to. Then forthe proximal coordinate descent method with a step-size of 1/L, if ‖xk − x∗‖ ≤ δ/2L holds andi ∈ Z is selected at iteration k, then xk+1i = x∗i .105Proof. The proof is identical to Lemma 2, but restricting to the update of the single coordinate.With the above results we next have the manifold identification property for the proximalcoordinate descent method.Theorem 9. Consider problem (6.1), where f is convex with L-Lipschitz continuous gradientand the gi are proper convex functions. Let Assumption 1 be satisfied and Assumption 2 besatisfied for the particular x∗ that the algorithm converges to. Then for the proximal coordinatedescent method with a step-size of 1/L there exists a finite k such that xki = x∗i for all i ∈ Z.Proof. Lemma 1 implies that the assumptions of Lemma 4 are eventually satisfied, and com-bining this with Lemma 3 we have our result.While the above result considers single-coordinate updates, it can trivially be modified toshow that the proximal BCD method has the manifold identification property. The only changeis that once ‖xk − x∗‖ ≤ δ/2L, we have that xk+1i = x∗i for all i ∈ bk ∩Z. Thus, BCD methodscan simultaneously move many variables onto the optimal manifold. In Section 6.4 we showhow the active-set identification results presented in this section can lead to superlinear or finiteconvergence of proximal BCD methods.Instead of using a step-size of 1/L, it is more common to use a bigger step-size of 1/Liwithin coordinate descent methods, where Li is the coordinate-wise Lipschitz constant. In thiscase, the results of Lemma 4 hold for β = δ/(L + Li). This is a larger region since Li ≤ L,so with this standard step-size the iterates can move onto the manifold from further away andwe expect to identify the manifold earlier. The argument can also be modified to use otherstep-size selection methods, provided that we can write the algorithm in terms of a step-sizeαk that is guaranteed to be bounded from below.6.3 Active-Set ComplexityThe manifold identification property presented in the previous section can be shown usingthe more sophisticated tools of related works [Burke and More´, 1988, Hare and Lewis, 2004].However, an appealing aspect of the simple argument in Section 6.2 is that it can be combinedwith non-asymptotic convergence rates of the iterates to bound the number of iterations requiredto reach the manifold. We call this the “active-set complexity” of the method. Given any methodwith an iterate bound of the form,‖xk − x∗‖ ≤ γ(1− 1κ)k, (6.17)for some κ ≥ 1, the next result uses that (1 − 1/κ)k ≤ exp(−k/κ) to bound the number ofiterations it will take to identify the active-set, and thus reach the manifold.106Theorem 10. Consider any method that achieves an iterate bound (6.17). For δ as definedin (6.9), we have ‖xk¯−x∗‖ ≤ δ/2L after at most κ log(2Lγ/δ) iterations. Further, we will iden-tify the active-set after an additional t iterations, where t is the number of additional iterationsrequired to select all suboptimal xi with i ∈ Z.For the full proximal gradient method, if we assume f is strongly convex, then by the ratein 6.8 we have that γ = ‖x0−x∗‖ and κ is the condition number of f . Further, all variables areupdated at each iteration so it will only take a single additional iteration (t = 1) to ensure allsuboptimal xi with i ∈ Z are optimal. Therefore, we will identify the active-set after at mostκ log(2L‖x0 − x∗‖/δ) iterations. Nutini et al. [2017b] extend these results to show active-setidentification and active-set complexity for the full proximal gradient method when using ageneral constant step-size. Their results include proving a generalized convergence rate boundfor the proximal gradient method.For the proximal BCD case, if we assume that f is strongly convex and that we are usingcyclic or greedy selection, then we are guaranteed the linear convergence rate in (6.15) forγ = 2µ [F (x0)−F (x∗)] and some κ ≥ 1. (We note that this type of rate also holds for a variety ofother types of selection rules). Unlike the full proximal gradient case, the active-set complexityis complicated by the fact that not all coordinates are updated on each iteration for proximalBCD methods; the value of t depends on the selection rule we use. If we use cyclic selection wewill require at most t = n additional iterations to select all suboptimal coordinates i ∈ Z andthus, to reach the optimal manifold. To bound the active-set complexity for general rules likegreedy rules, we cannot guarantee that all coordinates will be selected after n iterations once weare close to the solution. In the case of non-negative constraints (6.3), the number of additionaliterations depends on a quantity we will call , which is the smallest non-zero variable xk¯i fori ∈ Z and k¯ satisfying the first part of Theorem 10. It follows from (6.15) that we require atmost κ log(γ/) iterations beyond k¯ to select all non-zero i ∈ Z. Thus, the active-set complexityfor greedy rules for problem (6.3) is bounded above by κ(log(2Lγ/δ) + log(γ/)). Based on thisbound, greedy rules (which yield a smaller κ) may identify the manifold more quickly thancyclic rules in cases where is large. However, if is very small then greedy rules may take alarger number of iterations to reach the manifold.23Finally, it is interesting to note that the bound we prove in Theorem 10 only dependslogarithmically on 1/δ, and that if δ (as defined in (6.9)) is quite large then we can expect toidentify the active-set very quickly. This O(log(1/δ)) dependence is in contrast to the previousresult of Liang et al. who give a bound of the form O(1/∑ni=1 δ2i ) where δi is the distance of∇if to the boundary of the subdifferential ∂gi at x∗ [Liang et al., 2017, Proposition 3.6]. Thus,our bound is much tighter as it only depends logarithmically on the single largest δi (thoughwe make the extra assumption of strong-convexity).23If this is a concern, the implementer could consider a safeguard ensuring that the method is essentially-cyclic. Alternately, we could consider rules that prefer to include variables that are near the manifold and havethe appropriate gradient sign.1076.4 Superlinear and Finite Convergence of Proximal BCDMost of the issues discussed in Chapter 5 for smooth BCD methods carry over in a straight-forward way to the proximal setting; we can still consider fixed or variable blocks, there existmatrix and Newton updates, and we can still consider cyclic, random, or greedy selection rules.One subtle issue is that, as presented in Section 2.7, there are many generalizations of the GSrule to the proximal setting. However, the GS-q rule defined by Tseng and Yun [2009a] seemsto be the generalization of GS with the best theoretical properties. A GSL variant of this rulein the notation of Chapter 5 would take the formbk ∈ argminb∈B{mind{〈∇bf(xk), d〉+ Lb2‖d‖2 +∑i∈bgi(xi + di)−∑i∈bgi(xi)}}, (6.18)where we assume that the gradient of f is Lb-Lipschitz continuous with respect to block b. Ageneralization of the GS rule is obtained if we assume that the Lb are equal across all blocks.In the next three subsections, we discuss the consequences of our active-set identificationresults when using the proximal BCD method. Specifically, once we have identified the active-set, we can achieve superlinear convergence for certain problems with sparse solutions (and insome cases finite termination at an optimal solution).6.4.1 Proximal-Newton Updates and Superlinear ConvergenceOnce we have identified the optimal manifold, we can think about switching from using theproximal BCD method to using an unconstrained optimizer on the coordinates i 6∈ Z. Theunconstrained optimizer can be a Newton update, and thus under the appropriate conditionscan achieve superlinear convergence. However, a problem with such “2-phase” approaches isthat we do not know the exact time at which we reach the optimal manifold. This can make theapproach inefficient: if we start the second phase too early, then we sub-optimize over the wrongmanifold, while if we start the second phase too late, then we waste iterations performing first-order updates when we should be using second-order updates. Wright proposes an interestingalternative where at each iteration we consider replacing the proximal gradient block updatewith a Newton block update on the current manifold [Wright, 2012]. This has the advantagethat the manifold can continue to be updated, and that Newton updates are possible as soonas the optimal manifold has been identified. However, note that the dimension of the currentmanifold might be larger than the block size and the dimension of the optimal manifold, so thisapproach can significantly increase the iteration cost for some problems.Rather than “switching” to an unconstrained Newton update, we can alternately take ad-vantage of the superlinear converge of proximal-Newton updates [Lee et al., 2012]. For example,in this section we consider Newton proximal-BCD updates as in several recent works [Foun-toulakis and Tappenden, 2015, Qu et al., 2016, Tappenden et al., 2016]. For a block b these108updates have the formxk+1b ∈ argminy∈R|b|{〈∇bf(xkb ), y − xkb 〉+12αk‖y − xkb‖2Hkb +∑i∈bgi(yi)}, (6.19)where Hkb is the matrix corresponding to block b at iteration k (which can be the sub-Hessian∇2bbf(xk)) and αk is the step-size. As before if we set Hkb = Hb for some fixed matrix Hb, thenwe can take αk = 1 if block b of f is 1-Lipschitz continuous in the Hb-norm.In the next section, we give a practical variant on proximal-Newton updates that also has themanifold identification property under standard assumptions.24 An advantage of this approachis that the block size typically restricts the computational complexity of the Newton update(which we discuss further in the next sections). Further, superlinear convergence is possible inthe scenario where the coordinates i 6∈ Z are chosen as part of the block bk for all sufficientlylarge k. However, note that this superlinear scenario only occurs in the special case where weuse a greedy rule with variable blocks and where the size of the blocks is at least as large as thedimension of the optimal manifold. With variable blocks, the GS-q and GSL-q rules (6.18) willno longer select coordinates i ∈ Z since their optimal di value is zero when close to the solutionand on the manifold. Thus, these rules will only select i 6∈ Z once the manifold has beenidentified.25 In contrast, we would not expect superlinear convergence for fixed blocks unlessall i 6∈ Z happen to be in the same partition. While we could show superlinear convergenceof subsequences for random selection with variable blocks, the number of iterations betweenelements of the subsequence may be prohibitively large.6.4.2 Practical Proximal-Newton MethodsA challenge with using the update (6.19) in general is that the optimization is non-quadratic(due to the gi terms) and non-separable (due to the Hkb -norm). If we make the Hkb diagonal,then the objective is separable but this destroys the potential for superlinear convergence.Fortunately, a variety of strategies exist in the literature to allow non-diagonal Hkb .For example, for bound constrained problems we can apply two-metric projection (TMP)methods, which use a modified Hkb and allow the computation of a (cheap) projection underthe Euclidean norm [Gafni and Bertsekas, 1984]. This method splits the coordinates into an“active” set and a “working” set, where the active-set A for non-negative constraints would beA = {i | xi < ,∇if(x) > 0},24A common variation of the proximal-Newton method solves (6.19) with αk = 1 and then sets xk+1 basedon a search along the line segment between xk and this solution [Fountoulakis and Tappenden, 2015, Schmidt,2010]. This variation does not have the manifold identification property; only when the line search is on αk dowe have this property.25A subtle issue is the case where di = 0 in (6.18) but i 6∈ Z. In such cases we can break ties by preferringcoordinates i, where gi is differentiable so that the i 6∈ Z are included.109for some small while the working-setW is the compliment of this set. So the active-set containsthe coordinates corresponding to the variables that we expect to be zero while the working-setcontains the coordinates corresponding to the variables that we expect to be unconstrained.The TMP method can subsequently use the updatexW ← projC(xW − αH−1W ∇Wf(x))xA ← projC (xA − α∇Af(x)) .This method performs a gradient update on the active-set and a Newton update on the working-set. Gafni and Bertsekas [1984] show that this preserves many of the essential properties ofprojected-Newton methods like giving a descent direction, converging to a stationary point,and superlinear convergence if we identify the correct set of non-zero variables. Also note thatfor indices i ∈ Z, this eventually only takes gradient steps so our analysis of the previous sectionapplies (it identifies the manifold in a finite number of iterations). As opposed to solving theblock-wise proximal-Newton update in (6.19), in our experiments we explore simply using theTMP update applied to the block and found that it gave nearly identical performance for amuch lower cost.TMP methods have also been generalized to settings like `1-regularization [Schmidt, 2010]and they can essentially be used for any separable g function. Another widely-used strategy isto inexactly solve (6.19) [Fountoulakis and Tappenden, 2015, Lee et al., 2012, Schmidt, 2010].This has the advantage that it can still be used in the group `1-regularization setting or othergroup-separable settings.6.4.3 Optimal Updates for Quadratic f and Piecewise-Linear gTwo of the most well-studied optimization problems in machine learning are the SVM andLASSO problems. The LASSO problem is given by an `1-regularized quadratic objectiveargminx12‖Ax− b‖2 + λ‖x‖1,while the dual of the (non-smooth) SVM problem has the form of a bound-constrained quadraticobjectiveargminx∈[0,U ]12xTMx−∑ixi, (6.20)for a particular positive semi-definite matrix M and constant U . In both cases we typicallyexpect the solution to be sparse, and identifying the optimal manifold has been shown toimprove practical performance of BCD methods [De Santis et al., 2016, Joachims, 1999].Both problems have a set of gi that are piecewise-linear over their domain, implying thatthe they can be written as a maximum over univariate linear functions on the domain of110each variable. Although we can still consider TMP or inexact proximal-Newton updates forthese problems, this special structure actually allows us to compute the exact minimum withrespect to a block (which is efficient when considering medium-sized blocks). Indeed, for SVMproblems the idea of using exact updates in BCD methods dates back to the sequential minimaloptimization (SMO) method [Platt, 1998], which uses exact updates for blocks of size 2. In thissection we consider methods that work for blocks of arbitrary size.26While we could write the optimal update as a quadratic program, the special structure ofthe LASSO and SVM problems lends well to exact homotopy methods. These methods dateback to Osborne and Turlach [2011], Osborne et al. [2000] who proposed an exact homotopymethod that solves the LASSO problem for all values of λ. This type of approach was laterpopularized under the name “least angle regression” (LARS) [Efron et al., 2004]. Since thesolution path is piecewise-linear, given the output of a homotopy algorithm we can extractthe exact solution for our given value of λ. Hastie et al. [2004] derive an analogous homotopymethod for SVMs, while Rosset and Zhu [2007] derive a generic homotopy method for the caseof piecewise-linear gi functions.The cost of each iteration of a homotopy method on a problem with |b| variables is O(|b|2).It is known that the worst-case runtime of these homotopy methods can be exponential [Mairaland Yu, 2012]. However, the problems where this arises are somewhat unstable, and in practicethe solution is typically obtained after a linear number of iterations. This gives a runtime inpractice of O(|b|3), which does not allow enormous blocks but does allow us to efficiently useblock sizes in the hundreds or thousands. That being said, since these methods compute theexact block update, in the scenario where we previously had superlinear convergence, we nowobtain finite convergence. That is, the algorithm will stop in a finite number of iterations withthe exact solution provided that it has identified the optimal manifold, uses a greedy rule withvariable blocks, and the block size is larger than the dimension of the manifold. This finitetermination is also guaranteed under similar assumptions for TMP methods, and althoughTMP methods may make less progress per-iteration than exact updates, they may be a cheaperalternative to homotopy methods as the cost is explicitly restricted to O(|b|3).6.5 Numerical ExperimentsIn this section we demonstrate the manifold identification and superlinear/finite convergenceproperties of the greedy BCD method as discussed in Section 6.4 for a sparse non-negativeconstrained `1-regularized least-squares problem using Dataset A (see Appendix D.5.1). Inparticular, we compare the performance of a projected gradient update with Lb step-size, aprojected Newton (PN) solver with line search as discussed in Section 6.4.1 and the two-metricprojection (TMP) update as discussed in Section 6.4.2 when using fixed (FB) and variable (VB)26The methods discussed in this section can also be used to compute exact Newton-like updates in the case ofa non-quadratic f , but where the gi are still piecewise-linear.1110 50 100 150 200 250 300 350 400Iterations with 5-sized blocks0.7× 10−80.2× 10−40.7× 10−12.3× 1027.1× 105f(x)−f∗ for Non-negative Least Squares on Dataset APN-VBTMP-VBPG-VBPN-FBTMP-FBPG-FB0 100 200 300 400 500Iterations with 50-sized blocks0.2× 10−80.8× 10−50.4× 10−11.6× 1027.1× 105PN-VBTMP-VBPG-VBPN-FBTMP-FBPG-FB0 100 200 300 400 500Iterations with 100-sized blocks0.4× 10−80.1× 10−40.5× 10−11.9× 1027.1× 105PN-VBTMP-VBPG-VBPN-FBTMP-FBPG-FBFigure 6.2: Comparison of different updates when using greedy fixed and variable blocks ofdifferent sizes.blocks of different sizes (|bk| ∈ 5, 50, 100). We use a regularization constant of λ = 50, 000 toencourage a high level of sparsity resulting in an optimal solution x∗ with 51 non-zero variables.In Figure 6.2 we indicate active-set identification with a star and show that all approacheseventually identify the active-set. We see that TMP does as well as projected Newton forall block sizes, while both do better than gradient updates. For a block size of 100, we getfinite convergence using projected Newton and TMP updates. We repeat this experiment forrandom block selection in Figure 6.3 and show that for such a sparse problem multiple iterationsare often required before progress is made due to the repetitive selection of variables that arealready zero/active.0 100 200 300 400 500Iterations with 5-sized blocks6.2× 1056.4× 1056.6× 1056.8× 1057.1× 105f(x)−f∗ for Non-negative Least Squares on Dataset APN-VBTMP-VBPG-VBPN-FBTMP-FBPG-FB0 100 200 300 400 500Iterations with 50-sized blocks2.0× 1044.9× 1041.2× 1052.9× 1057.1× 105PN-VBTMP-VBPG-VBPN-FBTMP-FBPG-FB0 100 200 300 400 500Iterations with 100-sized blocks6.0× 1023.5× 1032.1× 1041.2× 1057.1× 105PN-VBTMP-VBPG-VBPN-FBTMP-FBPG-FBFigure 6.3: Comparison of different updates when using random fixed and variable blocks ofdifferent sizes.6.6 DiscussionIn this chapter we showed that greedy BCD methods have a finite-time manifold identifica-tion property for problems with separable non-smooth structures like bound constraints or`1-regularization. Our analysis notably leads to bounds on the number of iterations requiredto reach the optimal manifold, or the “active-set complexity”, for the full proximal gradientmethod as well as BCD variants. Further, when using greedy rules with variable blocks thisleads to superlinear or finite convergence for problems with sufficiently-sparse solutions.112While we made the assumption of strong convexity (or similar), it would be useful to re-lax this condition while still establishing active-set identification results. Further directionsinclude allowing linear equalities between blocks or extending the results to accelerated proxi-mal methods. Finally, although it is easy to extend the results in this section to the case of agroup-separable g, a very useful extension would be to consider the non-separable case.113Chapter 7DiscussionIn this chapter we discuss interesting extensions of the work presented in this dissertation, aswell as issues that we did not consider and several future directions.• Revitalization of greedy coordinate descent methods. Since the publication of ourwork in Chapter 2, there has been a resurgence of greedy coordinate descent methods inthe literature for various applications. We list several below that cite our work:– Wang [2017] considers model-based iterative reconstruction methods for image re-construction applications. They propose using a combination of greedy and randomcoordinate descent methods, and show that the best performance is obtained usinga hybrid of 20% random updates and 80% greedy updates.– Gsponer et al. [2017] present a new learning algorithm for learning a sequence regres-sion function, which uses greedy coordinate descent method and exact optimizationto find αk. They exploit the sparsity and nested structure of the feature space toensure an efficient calculation of the greedy Gauss-Southwell rule. They improvethe efficiency further by proving an upper bound on the best coordinate and thenusing a branch-and-bound algorithm to calculate the best coordinate. Their methodcompares to state-of-the-art, while requiring little to no pre-processing or domainknowledge.– Massias et al. [2017] consider working-set methods, where only a subset of constraintsare considered at each iteration leading to simpler problems of reduced size. Theauthors propose a new batch version of the GS-r rule and show that their new greedyactive-set method achieves state-of-the-art performance on sparse learning problemswith respect to floating point calculations (and time) on LASSO and multi-taskLASSO estimators.– Stich et al. [2017] propose an approximate greedy coordinate descent method, where agradient oracle is used to approximate the gradient. They show that the approximategradient can be updated cheaply at no extra cost, making their method efficient fora more general set of problems with less structure than we assume in this work.• Accelerated methods: We focused on non-accelerated greedy (block) coordinate de-scent methods in Chapters 2 and 5. However, we expect our conclusions are likely tohold in the case of accelerated methods. In fact, since the publication of our work in114Chapter 2, Lau and Yao [2017] proposed an accelerated greedy block coordinate proximalgradient method using the GS-r selection rule. They assume the Kurdyka- Lojasiewiczinequality (see Chapter 4) and exploit the results of recent works to show convergence oftheir method for non-convex problems. Their method is shown to beat state-of-the-artsolvers on sparse linear regression problems with separable or block-separable regularizers.• Parallel methods: Another extension that we did not consider in this work is paral-lelization. For iterative methods like coordinate descent, parallel implementations aresuitable when the dependency graph is sparse (see Bertsekas and Tsitsiklis [1989, §1.2.4]).For example, if the objective function is separable, then coordinate descent methods are“embarrassingly parallel”, meaning the speedup achieved is directly proportional to thenumber of processors used. When we talk about coordinate descent for truly huge-scaleproblems, it is impractical to consider a serial implementation. Lots of work has beendone on parallel randomized coordinate descent methods (see Richta´rik and Taka´cˇ [2016]for summary). Since the publication of our work the following parallel greedy coordinatedescent methods have been proposed:– You et al. [2016] considered smooth functions with bound constraints and showedlinear convergence of their proposed asynchronous parallel greedy coordinate descentmethod when using the GS-r selection rule.– Moreau and Oudre [2017] considered the specific problem of convolutional sparsecoding, which is designed to build sparse linear representations of datasets. Theyexploited the specific structure of the convolutional problem and showed that theirproposed asynchronous parallel version of greedy coordinate descent using the GS-r rule scales superlinearly with the number of cores (up to a point), making it anefficient option compared to other state-of-the-art methods. Unlike [You et al., 2016]their results do not require centralized communication and a finely tuned step size.• Non-convex problems: While we focused on convex problems, since our work in Chap-ter 2 was published other authors have shown that our methods are useful for non-convexproblems. For example, although it is well-known that the Gauss-Southwell rule workswell for the non-convex PageRank problem [Berkhin, 2006, Bonchi et al., 2012, Jeh andWidom, 2003, Lei et al., 2016, McSherry, 2005, Nassar et al., 2015], following the publi-cation of our work, Wang et al. [2017] showed that the Gauss-Southwell-Lipschitz rule isalso very useful for this problem.We have also seen extensions of the PL work that was presented Chapter 4.– Zhang et al. [2016] show that principle component analysis satisfies the PL inequalityon a Riemann manifold.– Reddi et al. [2016b] show linear convergence of proximal versions of both the stochas-tic variance reduced gradient and SAGA algorithms when assuming our proximal115PL-inequality.– Joulani et al. [2017] show that if a function is star-strongly-convex, then that impliesthat the function satisfies the PL inequality.– Roulet and d’Aspremont [2017] use the PL inequality to ensure linear convergenceof an accelerated method with restart.– [Yin et al., 2017] analyze the convergence of mini-batch stochastic gradient descentmethods under the PL inequality, specifically when the batch-size is proportional toa measure of the “gradient diversity”.– The PL inequality under the name of “gradient dominance condition” has also beenshow to hold for phase retrieval problems [Zhou et al., 2016], blind deconvolution [Liet al., 2016], and linear residual neural networks [Hardt and Ma, 2016, Zhou andLiang, 2017], while Zhou and Liang [2017] also show it holds for the square lossfunction of linear and one-hidden-layer nonlinear neural networks.– Csiba and Richta´rik [2017] introduce the “Weak Polyak- Lojasiewicz” condition. Sim-ilar to how we introduced the PL inequality as a generalization of strong-convexityto a class of non-convex problems in Chapter 4, Csiba and Richta´rik generalize theweakly convex case, providing convergence theory for a new class of non-convexproblems.An elegant consequence of the work in Chapter 4 is that we are starting to see moreconnections drawn between existing works when it comes to relaxing strong-convexity.This allows authors to exploit the “best” or weakest assumption in the easiest form fortheir given problem setting.• Other methods: Greedy variations of several other methods have also been proposedsince our work, including a greedy primal-dual method [Lei et al., 2017] and a greedydirection method of multiplier (uses our analysis from Chapter 2) [Huang et al., 2017].We mentioned in Chapter 3 that the Kaczmarz method could be used for piecewise-linearobjectives. Since the publication of our work, Yang and Lin [2015] developed an SGDmethod that has a linear convergence rate (like stochastic average gradient) for piecewise-linear objectives.Future extensions that were not considered in this dissertation are:• Successive over-relaxation methods. Successive over-relaxation (SOR) [Frankel,1950, Young, 1950] is an extrapolation of the cyclic Gauss-Seidel method. It is defined,depending on the extrapolation factor ω > 0, by replacing the iterate xk+1 following afull sweep through the coordinates by the following modification,xk+1 = ωxk+1 + (1− ω)xk.116This scheme can significantly accelerate the convergence rate obtained by the Gauss-Seidel(cyclic CD) method for an optimal value of ω where 1 < ω < 2. The optimal value of ωis known in some cases but it is not known for general matrices.When we apply SOR to the coordinate-wise update in coordinate descent methods (orGauss-Seidel assuming cyclic selection), we obtain the followingxk+1 = ωxk+1 + (1− ω)xk= ω(xk − α∇if(xk)eik) + (1− ω)xk= xk − ωα∇if(xk)eik ,which translates to the adjustment of the step size by a constant factor. It would beinteresting to see if a similar speed-up as is seen for the Gauss-Seidel method can beobtained for coordinate descent or Kaczmarz methods when using greedy selection rules.It is possible that in this setting, there is a connection between SOR and a simpler versionof some well-known first-order acceleration technique like Nesterov’s accelerated gradientdescent method or the heavy-ball method. We may be able to exploit or generalize thecases where ω is known to see if these connections exist.• Successive Projection Methods. We did not talk about the successive projectionmethod of Censor [1981] in this work, which is a generalization of the Kaczmarz method.Recently, Tibshirani [2017] analyzed the similarities between Dykstra’s algorithm (equiv-alently, successive projection method), alternating direction method of multipliers andcoordinate descent, showing connections and equivalences between these methods whenapplied to the primal and dual regularized regression problem (under some assumptions).These connections could lead to various new analyses and extensions for coordinate de-scent methods, including new parallel methods and extensions to infinite-dimensionalfunction spaces.• BCD methods with constraints between blocks. This is an important setting forvariational inference in graphical models, which is a very important topic in machinelearning [Bishop, 2006]. A recent result gives a convergence rate for a variation inferencemethod [Khan et al., 2016] but this result is not for the standard coordinate descentmethod. The difficulty with using BCD for problems with constraints between blocksis that related subproblems (blocks with constraints between them) are solved indepen-dently. Zhang et al. [2014] show that it is possible to strategically partition blocks andthen use a message passing algorithm (which allows you to account for the constraints) soas to significantly improve the empirical efficiency of alternating minimization techniques.More recently, She and Schmidt [2017] show linear convergence of the 2-coordinate se-quential minimal optimization methods for application to SVMs with unregularized bias.This is a case where the constraints are not completely separable. Earlier works on this117topic include Tseng and Yun [2009a] who considered linearly constrained nonsmooth sep-arable functions, Necoara et al. [2011] who considered random block coordinate descentfor general large-scale convex objectives with linearly coupled constraints, and Necoaraand Patrascu [2014] who propose a variant of random block coordinate descent for com-posite objective functions with linearly coupled constraints. It would be interesting to seeif there is potential to exploit simpler analysis/better convergence rates for randomizedmethods, analyze these methods for a more general class of constrained problems andpossibly extend the analysis to greedy BCD methods.118BibliographyA. Agarwal, S. N. Negahban, and M. J. Wainwright. Fast global convergence rates of gradientmethods for high-dimensional statistical recovery. Ann. Statist., pages 2452–2482, 2012.M. Anitescu. Degenerate nonlinear programming with a quadratic growth condition. SIAM J.Optim., pages 1116–1135, 2000.H. Attouch and J. Bolte. On the convergence of the proximal algorithm for nonsmooth functionsinvolving analytic features. Math. Program., Ser. B, pages 5–16, 2009.F. Bach and E. Moulines. Non–asymptotic analysis of stochastic approximation algorithms formachine learning. In Advances in Neural Information Processing Systems 24, pages 451–459,2011.M. Baghel, S. Agrawal, and S. Silakari. Recent trends and developments in graph coloring. InProceedings of the International Conference on Frontiers of Intelligent Computing: Theoryand Applications, pages 431–439. Springer Berlin Heidelberg, 2013.S. Bakin. Adaptive regression and model selection in data mining problems. PhD thesis, Aus-tralian National University, Canberra, Australia, 1999.A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverseproblems. SIAM J. Imaging Sci., 2(1):183–202, 2009.A. Beck and L. Tetruashvili. On the convergence of block coordinate descent type methods.SIAM J. Optim., 23(4):2037–2060, 2013.Y. Bengio, O. Delalleau, and N. Le Roux. Label propagation and quadratic criterion. InO. Chapelle, B. Scho¨lkopf, and A. Zien, editors, Semi-Supervised Learning, chapter 11, pages193–216. MIT Press, 2006.P. Berkhin. Bookmark-coloring algorithm for personalized PageRank computing. InternetMathematics, 3(1):41–62, 2006.D. P. Bertsekas. On the Goldstein-Levitin-Polyak gradient projection method. IEEE Transac-tions on Automatic Control, 21(2):174–184, 1976.D. P. Bertsekas. Convex Optimization Algorithms. Athena Scientific Belmont, 2015.D. P. Bertsekas. Nonlinear Programming. Athena Scientific, 3rd edition, 2016.D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods,volume 23. Englewood Cliffs: Prentice Hall, NJ, 1989.D. Bickson. Gaussian Belief Propagation: Theory and Application. PhD thesis, The HebrewUniversity of Jerusalem, Jerusalem, Israel, 2009.119C. Bishop. Pattern Recognition and Machine Learning. Springer-Verlag New York, 2006.T. Blumensath and M. E. Davies. Iterative hard thresholding for compressed sensing. Appl.Comput. Harmon. Anal., 27(3):265–274, 2009.L. Bo and C. Sminchisescu. Greedy block coordinate descent for large scale Gaussian processregression. arXiv:1206.3238, 2012.D. Bo¨hning. Multinomial logistic regression algorithm. Ann. Inst. Stat. Math., 44(1):197–200,1992.J. Bolte, T. P. Nguyen, J. Peypouquet, and B. W. Suter. From error bounds to the complexityof first-order descent methods for convex functions. arXiv:1510.08234, 2015.F. Bonchi, P. Esfandiar, D. F. Gleich, C. Greif, and L. V. S. Lakshmanan. Fast matrix compu-tations for pairwise and columnwise commute times and Katz scores. Internet Mathematics,8(1-2):73–112, 2012.L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning.arXiv:1606.04838, 2016.N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent. Modeling temporal dependencies inhigh-dimensional sequences: Application to polyphonic music generation and transcription.arXiv:1206.6392, 2012.S. P. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.J. V. Burke and J. J. More´. On the identification of active constraints. SIAM J. Numer. Anal.,25(5):1197–1211, 1988.A. Cauchy. Me´thode ge´ne´rale pour la re´solution des syste`mes d’e´quations simultane´es. ComptesRendus Hebd. Se´ances Acad. Sci., 25:536–538, 1847.Y. Censor. Row-action methods for huge and sparse systems and their applications. SIAMRev., 23(4):444–466, 1981.Y. Censor, P. B. Eggermont, and D. Gordon. Strong underrelaxation in Kaczmarz’s methodfor inconsistent systems. Numer. Math., 41:83–92, 1983.Y. Censor, G. T. Herman, and M. Jiang. A note on the behaviour of the randomized Kaczmarzalgorithm of Strohmer and Vershynin. J. Fourier Anal. Appl., 15:431–436, 2009.V. Cevher, S. Becker, and M. Schmidt. Convex optimization for big data: Scalable, randomized,and parallel algorithms for big data analytics. IEEE Signal Processing Magazine, 31:32–43,2014.C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Trans.Intell. Syst. Technol., 2(3):27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.B. Chen, S. He, Z. Li, and S. Zhang. Maximum block improvement and polynomial optimization.SIAM J. Optim., 22(1):87–107, 2012.120S. Chen and D. Donoho. Basis pursuit. In 28th Asilomar Conf. Signals, Systems Computers.Asilomar, 1994.S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAMRev., 43(1):129–159, 2001.T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. MITPress Cambridge, second edition, 2001.C. Cortes and V. N. Vapnik. Support-vector networks. Machine Learning, 20:273–297, 1995.B. D. Craven and B. M. Glover. Invex functions and duality. J. Austral. Math. Soc. (SeriesA), pages 1–20, 1985.D. Csiba and P. Richta´rik. Importance sampling for minibatches. arXiv:1602.02283, 2016.D. Csiba and P. Richta´rik. Global convergence of arbitrary-block gradient methods for gener-alized Polyak- Lojasiewicz functions. arXiv:1709.03014, 2017.D. Csiba, Z. Qu, and P. Richta´rik. Stochastic dual coordinate ascent with adaptive probabilities.In Proceedings of the 32nd International Conference on Machine Learning, pages 674–683,2015.J. Dahl, L. Vandenberghe, and V. Roychowdhury. Covariance selection for nonchordal graphsvia chordal embedding. Optim. Methods Softw., 23(4):501–520, 2008.M. De Santis, S. Lucidi, and F. Rinaldi. A fast active set block coordinate descent algorithmfor `1-regularized least squares. SIAM J. Optim., 26(1):781–809, 2016.S. Della Pietra, V. Della Pietra, and J. Lafferty. Inducing features of random fields. IEEETransactions on Pattern Analysis and Machine Intelligence, 19(4):380–393, 1997.R. S. Dembo, S. C. Eisenstat, and T. Steihaug. Inexact Newton methods. SIAM J. Numer.Anal., 19(2):400–408, 1982.J. E. Dennis and J. J. More´. A characterization of superlinear convergence and its applicationto quasi-Newton methods. Math. Comput., 28(126):549–560, 1974.F. Deutsch. Rate of convergence of the method of alternating projections. Internat. Schriften-reihe Numer. Math., 72:96–107, 1985.F. Deutsch and H. Hundal. The rate of convergence for the method of alternating projections,II. J. Math. Anal. Appl., 205:381–405, 1997.I. S. Dhillon, P. K. Ravikumar, and A. Tewari. Nearest neighbor based greedy coordinatedescent. In Advances in Neural Information Processing Systems 24, pages 2160–2168, 2011.F. Dinuzzo, C. S. Ong, P. Gehler, and G. Pillonetto. Learning output kernels with blockcoordinate descent. In Proceedings of the 28th International Conference on Machine Learning,pages 49–56, 2011.D. Drusvyatskiy and A. S. Lewis. Error bounds, quadratic growth, and linear convergence ofproximal methods. arXiv:1602.06661, 2016.121B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Ann. Stat., 32(2):407–451, 2004.Y. C. Eldar and D. Needell. Acceleration of randomized Kaczmarz methods via the Johnson-Lindenstrauss Lemma. Numer. Algor., 58:163–177, 2011.A. Ene and H. L. Nguyen. Random coordinate descent methods for minimizing decomposablesubmodular functions. In Proceedings of the 32nd International Conference on MachineLearning, pages 787–795, 2015.L. Esperet, L. Lemoine, and F. Maffray. Equitable partition of graphs into induced forests.Discrete Math., 338:1481–1483, 2015.H. G. Feichtinger, C. Cenker, M. Mayer, H. Steier, and T. Strohmer. New variants of the POCSmethod using affine subspaces of finite codimension with applications to irregular sampling.SPIE: VCIP, pages 299–310, 1992.O. Fercoq and P. Richta´rik. Accelerated, parallel and proximal coordinate descent. SIAM J.Optim., 25(4):1997–2023, 2015.W. F. Ferger. The nature and use of the harmonic mean. Journal of the American StatisticalAssociation, 26(173):36–40, 1931.K. Fountoulakis and R. Tappenden. A flexible coordinate descent method. arXiv:1507.03713,2015.K. Fountoulakis, F. Roosta-Khorasani, J. Shun, X. Cheng, and M. W. Mahoney. Exploitingoptimization for local graph clustering. arXiv:1602.01886, 2016.S. P. Frankel. Convergence rates of iterative treatments of partial differential equations. Math.Tables Aids Comput., 4(30):65–75, 1950.M. P. Friedlander and M. Schmidt. Hybrid deterministic-stochastic methods for data fitting.SIAM Journal on Scientific Computing, 34(3):A1380–A1405, 2012.W. J. Fu. Penalized regressions: the bridge versus the lasso. J. Comput. Graph. Stat., 7(3):397–416, 1998.E. M. Gafni and D. P. Bertsekas. Two-metric projection methods for constrained optimization.SIAM J. Control Optim., 22(6):936–964, 1984.A. Gala´ntai. On the rate of convergence of the alternating projection method in finite dimen-sional spaces. J. Math. Anal. Appl., 310:30–44, 2005.D. Garber and E. Hazan. Faster rates for the Frank-Wolfe method over strongly-convex sets.In Proceedings of the 32nd International Conference on Machine Learning, pages 541–549,2015.M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory ofNP-Completeness. W. H. Freeman & Co., New York, NY, USA, 1979.T. Glasmachers and U. Dogan. Accelerated coordinate descent with adaptive coordinate fre-quencies. In Proceedings of the 5th Asian Conference on Machine Learning, pages 72–86,2013.122P. Gong and J. Ye. Linear convergence of variance-reduced stochastic gradient without strongconvexity. arXiv:1406.1102, 2014.R. Gordon, R. Bender, and G. T. Herman. Algebraic Reconstruction Techniques (ART) forthree-dimensional electron microscopy and x-ray photography. J. Theor. Biol., 29(3):471–481,1970.R. M. Gower and P. Richta´rik. Randomized iterative methods for linear systems. SIAM J.Matrix Anal. Appl., 36(4):1660–1690, 2015.J. Gregor and J. A. Fessler. Comparison of SIRT and SQS for regularized weighted least squaresimage reconstruction. IEEE Trans. Comput. Imaging, 1(1):44–55, 2015.M. Griebel and P. Oswald. Greedy and randomized versions of the multiplicative Schwartzmethod. Lin. Alg. Appl., 437:1596–1610, 2012.S. Gsponer, B. Smyth, and G. Ifrim. Efficient sequence regression by learning linear models inall-subsequence space. In Machine Learning and Knowledge Discovery in Databases, pages37–52, 2017.M. Gu, L.-H. Lim, and C. J. Wu. ParNes: A rapidly convergent algorithm for accurate recoveryof sparse and approximately sparse signals. Numer. Algor., pages 321–347, 2013.M. Hanke and W. Niethammer. On the acceleration of Kaczmarz’s method for inconsistentlinear systems. Lin. Alg. Appl., 130:83–98, 1990.M. A. Hanson. On sufficiency of the Kuhn-Tucker conditions. J. Math. Anal. Appl., pages545–550, 1981.M. Hardt and T. Ma. Identity matters in deep learning. arXiv:1611.04231, 2016.W. L. Hare. Identifying active manifolds in regularization problems. In H. H. Bauschke, R. S.Burachik, P. L. Combettes, V. Elser, D. R. Luke, and H. Wolkowicz, editors, Fixed-PointAlgorithms for Inverse Problems in Science and Engineering, pages 261–271. Springer NewYork, New York, NY, 2011.W. L. Hare and A. S. Lewis. Identifying active constraints via partial smoothness and prox-regularity. J. Convex Analysis, 11(2):251–266, 2004.T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning, volume 1.Springer Series in Statistics, New York, 2nd edition, 2001.T. Hastie, S. Rosset, R. Tibshirani, and J. Zhu. The entire regularization path for the supportvector machine. J. Mach. Learn. Res., 5:1391–1415, 2004.G. T. Herman and L. B. Meyer. Algebraic reconstruction techniques can be made computa-tionally efficient. IEEE Trans. Medical Imaging, 12(3):600–609, 1993.R. R. Hocking. A biometrics invited paper. The analysis and selection of variables in linearregression. Biometrics, 32(1):1–49, 1976.A. J. Hoffman. On approximate solutions of systems of linear inequalities. J. Res. Nat. Bur.Stand., 49(4):263–265, 1952.123K. Hou, Z. Zhou, A. M.-C. So, and Z.-Q. Luo. On the linear convergence of the proximal gradientmethod for trace norm regularization. In Advances in Neural Information Processing Systems26, pages 710–718, 2013.C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan. A dual coordinate de-scent method for large-scale linear SVM. In Proceedings of the 25th International Conferenceon Machine Learning, pages 408–415, 2008.C.-J. Hsieh, M. A. Sustik, I. S. Dhillon, P. K. Ravikumar, and R. Poldrack. BIG & QUIC: Sparseinverse covariance estimation for a million variables. In Advances in Neural InformationProcessing Systems 26, pages 3165–3173, 2013.X. Huang, I. E. H. Yen, R. Zhang, Q. Huang, P. Ravikumar, and I. S. Dhillon. Greedy directionmethod of multiplier for MAP inference of large output domain. In Proceedings of the 20thInternational Conference on Artificial Intelligence and Statistics, volume 54, pages 1550–1559, 2017.D. Hush, P. Kelly, C. Scovel, and I. Steinwart. QP algorithms with guaranteed accuracy andrun time for support vector machines. J. Mach. Learn. Res., pages 733–769, 2006.P. Jain, S. M. Kakade, R. Kidambi, P. Netrapalli, and A. Sidford. Accelerating stochasticgradient descent. arXiv:1704.08227, 2017.S. Jegelka, F. Bach, and S. Sra. Reflection methods for user-friendly submodular optimization.In Advances in Neural Information Processing Systems 26, pages 1313–1321, 2013.G. Jeh and J. Widom. Scaling personalized web search. In Proceedings of the 12th InternationalConference on World Wide Web, pages 271–279. ACM, 2003.T. Joachims. Making large-scale SVM learning practical. In B. Scho¨lkopf, C. J. C. Burges, andA. J. Smola, editors, Advances in Kernel Methods - Support Vector Learning, pages 169–184.MIT Press, 1999.R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variancereduction. In Advances in Neural Information Processing Systems 26, pages 315–323, 2013.W. B. Johnson and J. Lindenstrauss. Extensions of Lipchitz mappings into a Hilbert space.Contemp. Math., 26:189–206, 1984.P. Joulani, A. Gyo¨rgy, and C. Szepesva´ri. A modular analysis of adaptive (non-) convex opti-mization: Optimism, composite objectives, and variational bounds. Proceedings of MachineLearning Research, 1:1–40, 2017.S. Kaczmarz. Angena¨herte Auflo¨sung von Systemen linearer Gleichungen, Bulletin Internationalde l’Acade´mie Polonaise des Sciences et des Letters. Classe des Sciences Mathe´matiques etNaturelles. Se´rie A, Sciences Mathe´matiques, 35:355–357, 1937.M. Kadkhodaie, M. Sanjabi, and Z.-Q. Luo. On the linear convergence of the approximateproximal splitting method for non-smooth convex optimization. arXiv:1404.5350v1, 2014.H. Karimi, J. Nutini, and M. Schmidt. Linear convergence of gradient and proximal-gradientmethods under the Polyak- Lojasiewicz condition. In Machine Learning and Knowledge Dis-covery in Databases: European Conference, ECML PKDD 2016, Proceedings, Part I, pages795–811, 2016.124A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale videoclassification with convolutional neural networks. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, pages 1725–1732, 2014.M. E. Khan. Variational Learning for Latent Gaussian Model of Discrete Data. PhD thesis,The University of British Columbia, Vancouver, Canada, 2012.M. E. Khan, R. Babanezhad, W. Lin, M. Schmidt, and M. Sugiyama. Faster stochastic vari-ational inference using proximal gradient methods with general divergence functions. InProceedings of the 32nd Conference on Uncertainty in Artificial Intelligence, pages 319–328,2016.R. Kyng and S. Sachdeva. Approximate Gaussian elimination for Laplacians - fast, sparse, andsimple. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposiumon, pages 573–582. IEEE, 2016.T. K. Lau and Y. Yao. Accelerated block coordinate proximal gradients with applications inhigh dimensional statistics. arXiv:1710.05338, 2017.N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential con-vergence rate for finite training sets. In Advances in Neural Information Processing Systems25, pages 2663–2671, 2012.C.-P. Lee and S. J. Wright. Random permutations fix a worst case for cyclic coordinate descent.arXiv:1607.08320, 2016.J. D. Lee, Y. Sun, and M. A. Saunders. Proximal Newton-type methods for convex optimization.In Advances in Neural Information Processing Systems 25, pages 827–835, 2012.S. Lee and S. J. Wright. Manifold identification in dual averaging for regularized stochasticonline learning. J. Mach. Learn. Res., 13(1):1705–1744, 2012.S.-i. Lee, V. Ganapathi, and D. Koller. Efficient structure learning of Markov networks using`1-regularization. In Advances in Neural Information Processing Systems 19, pages 817–824,2006.Y. T. Lee and A. Sidford. Efficient accelerated coordinate descent methods and faster algorithmsfor solving linear systems. arXiv:1305.1922v1, 2013.Q. Lei, K. Zhong, and I. S. Dhillon. Coordinate-wise power method. In Advances in NeuralInformation Processing Systems 29, pages 2064–2072, 2016.Q. Lei, I. E.-H. Yen, C.-y. Wu, I. S. Dhillon, and P. Ravikumar. Doubly greedy primal-dual coordinate descent for sparse empirical risk minimization. In Proceedings of the 34thInternational Conference on Machine Learning, pages 2034–2042, 2017.L. Leventhal and A. S. Lewis. Randomized methods for linear constraints: Convergence ratesand conditioning. Math. Oper. Res., 35(3):641–654, 2010.E. S. Levitin and B. T. Polyak. Constrained minimization methods. USSR Comput. Math.Math. Phys., 6:1–50, 1966.G. Li and T. K. Pong. Calculus of the exponent of Kurdyka-Lojasiewicz inequality and itsapplications to linear convergence of first-order methods. arXiv:1602.02915v1, 2016.125X. Li, S. Ling, T. Strohmer, and K. Wei. Rapid, robust, and reliable blind deconvolution vianonconvex optimization. arXiv:1606.04933, 2016.Y. Li and S. Osher. Coordinate descent optimization for `1 minimization with application tocompressed sensing; a greedy algorithm. Inverse Problems and Imaging, 3(3):487–503, 2009.Z. Li, A. Uschmajew, and S. Zhang. On convergence of the maximum block improvementmethod. SIAM J. Optim., 25(1):210–233, 2015.J. Liang, J. Fadili, and G. Peyre´. Activity identification and local linear convergence of forward–backward-type methods. SIAM J. Optim., 27(1):408–437, 2017.J. Liu and S. J. Wright. An accelerated randomized Kaczmarz method. arXiv:1310.2887v2,2014.J. Liu and S. J. Wright. Asynchronous stochastic coordinate descent: Parallelism and conver-gence properties. SIAM J. Optim., pages 351–376, 2015.J. Liu, S. J. Wright, C. Re´, V. Bittorf, and S. Sridhar. An asynchronous parallel stochasticcoordinate descent algorithm. arXiv:1311.1873v3, 2014.S. Lojasiewicz. A topological property of real analytic subsets (in French). Coll. du CNRS, Lese´quations aux de´rive´es partielles, pages 87–89, 1963.Z.-Q. Luo and P. Tseng. Error bounds and convergence analysis of feasible descent methods:A general approach. Ann. Oper. Res., 46(1):157–178, 1993.A. Ma, D. Needell, and A. Ramdas. Convergence properties of the randomized extended Gauss-Seidel and Kaczmarz methods. arXiv:1503.08235v2, 2015a.C. Ma, T. Tappenden, and M. Taka´cˇ. Linear convergence of the randomized feasible descentmethod under the weak strong convexity assumption. arXiv:1506.02530, 2015b.J. Mairal and B. Yu. Complexity analysis of the lasso regularization path. In Proceedings ofthe 29th International Conference on Machine Learning, pages 353–360, 2012.D. M. Malioutov, J. K. Johnson, and A. S. Willsky. Walk-sums and belief propagation inGaussian graphical models. J. Mach. Learn. Res., 7(Oct):2031–2064, 2006.M. Massias, A. Gramfort, and J. Salmon. From safe screening rules to working sets for fasterLasso-type solvers. arXiv:1703.07285, 2017.F. McSherry. A uniform approach to accelerated PageRank computation. In Proceedings of the14th International Conference on World Wide Web, pages 575–582. ACM, 2005.N. Megiddo. Combinatorial optimization with rational objective functions. Math. Oper. Res.,4(4):414–424, 1979.L. Meier, S. Van De Geer, and P. Bu¨hlmann. The group lasso for logistic regression. J. R. Stat.Soc. Series B Stat. Methodol., 70(1):53–71, 2008.R. Meir and G. Ra¨tsch. An Introduction to Boosting and Leveraging, pages 118–183. Springer,Heidelberg, 2003.126O. Meshi, T. Jaakkola, and A. Globerson. Convergence rate analysis of MAP coordinateminimization algorithms. In Advances in Neural Information Processing Systems 25, pages3014–3022, 2012.R. Mifflin and C. Sagastiza´bal. Proximal points are on the fast track. J. Convex Anal., 9(2):563–579, 2002.P. W. Mirowski, Y. LeCun, D. Madhavan, and R. Kuzniecky. Comparing SVM and convolu-tional networks for epileptic seizure prediction from intracranial EEG. In IEEE Workshopon Machine Learning for Signal Processing, pages 244–249. IEEE, 2008.A.-r. Mohamed, G. Dahl, and G. Hinton. Deep belief networks for phone recognition. In NIPSWorkshop on Deep Learning for Speech Recognition and Related Applications, page 39, 2009.T. Moreau and N. Oudre, L. amd Vayatis. Distributed convolutional sparse coding.arXiv:1705.10087, 2017.K. P. Murphy. Machine Learning: A Probabilistic Perspective. MIT Press, 2012.H. Namkoong, A. Sinha, S. Yadlowsky, and J. C. Duchi. Adaptive sampling probabilities fornon-smooth optimization. In Proceedings of the 34th International Conference on MachineLearning, pages 2574–2583, 2017.H. Nassar, K. Kloster, and D. F. Gleich. Strong localization in personalized PageRank vectors.In International Workshop on Algorithms and Models for the Web-Graph, pages 190–202.Springer, 2015.I. Necoara and D. Clipici. Parallel random coordinate descent method for composite minimiza-tion: Convergence analysis and error bounds. SIAM J. Optim., pages 197–226, 2016.I. Necoara and A. Patrascu. A random coordinate descent algorithm for optimization problemswith composite objection function and linear coupled constraints. Comput. Optim. Appl.,pages 307–337, 2014.I. Necoara, Y. Nesterov, and F. Glineur. A random coordinate descent method on large opti-mization problems with linear constraints. Technical Report, 2011.I. Necoara, Y. Nesterov, and F. Glineur. Linear convergence of first order methods for non-strongly convex optimization. arXiv:1504.06298v3, 2015.D. Needell. Randomized Kaczmarz solver for noisy linear systems. BIT Numer. Math., 50:395–403, 2010.D. Needell and J. A. Tropp. Paved with good intentions: analysis of a randomized blockKaczmarz method. Linear Algebra Appl., 441:199–221, 2014.D. Needell, N. Srebro, and R. Ward. Stochastic gradient descent, weighted sampling, and therandomized Kaczmarz algorithm. arXiv:1310.5715v5, 2013.A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approachto stochastic programming. SIAM J. Optim., 19(4):1574–1609, 2009.Y. Nesterov. A method of solving a convex programming problem with convergence rate o(1/k2).In Soviet Mathematics Doklady, volume 27, pages 372–376, 1983.127Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer AcademicPublishers, Dordrecht, The Netherlands, 2004.Y. Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems.CORE Discussion Paper, 2010.Y. Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems.SIAM J. Optim., 22(2):341–362, 2012.Y. Nesterov. Gradient methods for minimizing composite functions. Math. Program., 140:125–161, 2013.Y. Nesterov and B. T. Polyak. Cubic regularization of Newton method and its global perfor-mance. Math. Program., pages 177–205, 2006.A. Y. Ng. Feature selection, l1 vs. l2 regularization, and rotational invariance. In Proceedingsof the 21st International Conference on Machine Learning, page 78. ACM, 2004.J. Nocedal and S. J. Wright. Numerical Optimization. Springer, 1999.A. B. J. Novikoff. On convergence proofs for perceptrons. Symp. Math. Theory Automata, 12:615–622, 1962.J. Nutini, M. Schmidt, I. H. Laradji, M. Friedlander, and H. Koepke. Coordinate descentconverges faster with the Gauss-Southwell rule than random selection. In Proceedings of the32nd International Conference on Machine Learning, pages 1632–1641, 2015.J. Nutini, B. Sepehry, I. Laradji, M. Schmidt, H. Koepke, and A. Virani. Convergence rates forgreedy Kaczmarz algorithms, and faster randomized Kaczmarz rules using the orthogonalitygraph. arXiv:1612.07838, 2016.J. Nutini, I. Laradji, M. Schmidt, and W. Hare. Let’s make block coordinate descent go fast:Faster greedy rules, message- passing, active-set complexity, and superlinear convergence.submitted for publication, 2017a.J. Nutini, M. Schmidt, and W. Hare. “Active-set complexity” of proximal gradient: How longdoes it take to find the sparsity pattern? submitted for publication, 2017b.S. M. Omohundro. Five balltree construction algorithms. Technical report, International Com-puter Science Institute, Berkeley, 1989.M. R. Osborne and B. A. Turlach. A homotopy algorithm for the quantile regression lasso andrelated piecewise linear problems. J. Comput. Graph. Stat., 20(4):972–987, 2011.M. R. Osborne, B. Presnell, and B. A. Turlach. A new approach to variable selection in leastsquares problems. IMA J. Numer. Anal., 20(3):389–403, 2000.P. Oswald and W. Zhou. Convergence analysis for Kaczmarz-type methods in a Hilbert spaceframework. Lin. Alg. Appl., 478:131–161, 2015.S. Parter. The use of linear graphs in Gauss elimination. SIAM Rev., 3(2):119–130, 1961.Y. C. Pati, R. Rezaiifar, and P. S. Krishnaprasad. Orthogonal matching pursuit: Recursivefunction approximation with applications to wavelet decomposition. In Proceedings of the27th Annu. Asilomar Conf. Signals, Systems and Computers, pages 40–44. IEEE, 1993.128F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of MachineLearning Research, 12:2825–2830, 2011.M. Pilanci and M. J. Wainwright. Newton sketch: A linear-time optimization algorithm withlinear-quadratic convergence. SIAM J. Optim., 27(1):205–245, 2017.J. C. Platt. Sequential minimal optimization: A fast algorithm for training support vectormachines. Technical report, Microsoft Research, 1998.B. T. Polyak. Gradient methods for minimizing functionals (in Russian). Zh. Vychisl. Mat.Mat. Fiz., pages 643–653, 1963.Z. Qin, K. Scheinberg, and D. Goldfarb. Efficient block-coordinate descent algorithms for GroupLasso. Mathematical Programming Computation, 5:143–169, 2013.Z. Qu and P. Richta´rik. Coordinate descent with arbitrary sampling I: Algorithms and com-plexity. Optim. Methods Softw., 31(5):829–857, 2016.Z. Qu and P. Richta´rik. Coordinate descent with arbitrary sampling II: Expected separableoverapproximation. Optim. Methods Softw., 31(5):858–884, 2016.Z. Qu, P. Richta´rik, and T. Zhang. Randomized dual coordinate ascent with arbitrary sampling.arXiv:1411.5873, 2014.Z. Qu, P. Richta´rik, M. Taka´cˇ, and O. Fercoq. SDNA: Stochastic dual Newton ascent forempirical risk minimization. In Proceedings of the 33rd International Conference on MachineLearning, pages 1823–1832, 2016.G. Ra¨tsch, S. Mika, and M. K. Warmuth. On the convergence of leveraging. In Advances inNeural Information Processing Systems 14, pages 487–494, 2001.S. J. Reddi, S. Sra, B. Poczos, and A. Smola. Fast stochastic methods for nonsmooth nonconvexoptimization. arXiv:1605.06900, 2016a.S. J. Reddi, S. Sra, B. Po´czos, and A. J. Smola. Proximal stochastic methods for nonsmoothnonconvex finite-sum optimization. In Advances in Neural Information Processing Systems29, pages 1145–1153, 2016b.P. Richta´rik and M. Taka´cˇ. Parallel coordinate descent methods for big data optimization.Math. Prog., 156(1-2):433–484, 2016.P. Richta´rik and M. Taka´cˇ. Iteration complexity of randomized block-coordinate descent meth-ods for minimizing a composite function. Math. Program., 144:1–38, 2014.P. Richta´rik and M. Taka´cˇ. On optimal probabilities in stochastic coordinate descent methods.Optimization Letters, 10(6):1233–1243, 2016.M. Riedmiller and H. Braun. RPROP - A fast adaptive learning algorithm. In: Proc. of ISCISVII, 1992.H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Statistics, 22(3):400–407, 1951.129R. T. Rockafellar. Monotone operators and the proximal point algorithm. SIAM J. ControlOptim., 14(5):877898, 1976.L. Rosasco, E. De Vito, A. Caponnetto, M. Piana, and A. Verri. Are loss functions all thesame? Neural Computation, 16(5):1063–1076, 2004.D. J. Rose. Triangulated graphs and the elimination process. J. Math. Anal. Appl., 32(3):597–609, 1970.S. Rosset and J. Zhu. Piecewise linear regularized solution paths. Ann. Stat., 35(3):1012–1030,2007.V. Roulet and A. d’Aspremont. Sharpness, restart and acceleration. In Advances in NeuralInformation Processing Systems 30, pages 1119–1129, 2017.H. Rue and L. Held. Gaussian Markov Random Fields: Theory and Applications. CRC Press,2005.Y. Saad. Iterative Methods for Sparse Linear Systems. SIAM, 2003.A. L. Samuel. Some studies in machine learning using the game of checkers. IBM J. Res. Dev.,3(3):210–229, 1959.S. Sardy, A. G. Bruce, and P. Tseng. Block coordinate relaxation methods for nonparametricwavelet denoising. J. Comput. Graph. Stat., 9(2):361–379, 2000.K. Scheinberg and I. Rish. SINCO - a greedy coordinate ascent method for sparse inversecovariance selection problem. Optimization Online, 2009.C. Scherrer, A. Tewari, M. Halappanavar, and D. J. Haglin. Feature clustering for acceleratingparallel coordinate descent. In Advances in Neural Information Processing Systems 25, pages28–36, 2012.M. Schmidt. Graphical Model Structure Learning with `1-Regularization. PhD thesis, TheUniversity of British Columbia, Vancouver, BC, Canada, 2010.M. Schmidt, N. Le Roux, and F. Bach. Convergence rates of inexact proximal-gradient methodsfor convex optimization. In Advances in Neural Information Processing Systems 24, pages1458–1466, 2011.M. Schmidt, N. Le Roux, and F. Bach. Minimizing finite sums with the stochastic averagegradient. Math. Program., 162(1-2):83–112, 2017.L. Seidel. U¨ber ein Verfahren die Gleichungen, auf welche die Methode der kleinsten Quadratefu¨hrt, sowie linea¨re Gleichungen u¨berhaupt durch successive Anna¨herung aufzulo¨sen. Verlagder k. Akademie, 1874.B. Sepehry. Finding a maximum weight sequence with dependency constraints. Master’s thesis,University of British Columbia, Vancouver, BC, Canada, 2016.S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularizedloss minimization. J. Mach. Learn. Res., pages 567–599, 2013.130J. She and M. Schmidt. Linear convergence and support vector identification of sequential min-imal optimization. In 10th NIPS Workshop on Optimization for Machine Learning, page 5,2017.O. Shental, P. H. Siegel, J. K. Wolf, D. Bickson, and D. Dolev. Gaussian belief propaga-tion solver for systems of linear equations. In Information Theory, 2008. ISIT 2008. IEEEInternational Symposium on, pages 1863–1867. IEEE, 2008.S. K. Shevade and S. S. Keerthi. A simple and efficient algorithm for gene selection using sparselogistic regression. Bioinformatics, 19(17):2246–2253, 2003.H.-J. M. Shi, S. Tu, Y. Xu, and W. Yin. A primer on coordinate descent algorithms.arXiv:1610.00040, 2016.A. Shrivastava and P. Li. Asymmetric LSH (ALSH) for sublinear time maximum inner productsearch (MIPS). In Advances in Neural Information Processing Systems 27, pages 2321–2329,2014.C. Song, S. Cui, Y. Jiang, and S.-T. Xia. Accelerated stochastic greedy coordinate descentby soft thresholding projection onto simplex. In Advances in Neural Information ProcessingSystems 30, pages 4841–4850, 2017.D. Sontag and T. Jaakkola. Tree block coordinate descent for MAP in graphical models. InProceedings of the 12th International Conference on Artificial Intelligence and Statistics,pages 544–551, 2009.A. Srinivasan and E. Todorov. Graphical Newton. arXiv:1508.00952, 2015.S. U. Stich, A. Raj, and M. Jaggi. Approximate steepest coordinate descent. arXiv:1706.08427,2017.T. Strohmer and R. Vershynin. A randomized Kaczmarz algorithm with exponential conver-gence. J. Fourier Anal. Appl., 15:262–278, 2009.Y. Sun, M. S. Andersen, and L. Vandenberghe. Decomposition in conic optimization withpartially separable structure. SIAM J. Optim., 24(2):873–897, 2014.K. Tanabe. Projection method for solving a singular system of linear equations and its appli-cations. Numer. Math., 17:203–214, 1971.R. Tappenden, P. Richta´rik, and J. Gondzio. Inexact coordinate descent: Complexity andpreconditioning. J. Optim. Theory Appl., pages 144–176, 2016.M. Teichmann, M. Weber, M. Zoellner, R. Cipolla, and R. Urtasun. MultiNet: Real-time jointsemantic reasoning for autonomous driving. arXiv:1612.07695, 2016.G. Thoppe, V. S. Borkar, and D. Garg. Greedy block coordinate descent (GBCD) method forhigh dimensional quadratic programs. arXiv:1404.6635, 2014.R. J. Tibshirani. Dykstra’s algorithm, ADMM, and coordinate descent: Connections, insights,and extensions. In Advances in Neural Information Processing Systems 30, pages 517–528,2017.131P. Tseng. Approximation accuracy, gradient methods, and error bound for structured convexoptimization. Math. Program., Ser. B, pages 263–295, 2010.P. Tseng and S. Yun. Block-coordinate gradient descent method for linearly constrained nons-mooth separable optimization. J. Optim. Theory Appl., pages 513–535, 2009a.P. Tseng and S. Yun. A coordinate gradient descent method for nonsmooth separable mini-mization. Math. Program., 117:387–423, 2009b.A. van der Sluis. Condition numbers and equilibrium of matrices. Numer. Math., 14:14–23,1969.L. Vandenberghe and M. S. Andersen. Chordal graphs and semidefinite optimization. Found.Trends Optim., 1(4):241–433, 2015.N. K. Vishnoi. Lx = b Laplacian solvers and their algorithmic applications. Found. TrendsTheoretical Computer Science, 8(1-2):1–141, 2013.J. von Neumann. Functional Operators (AM-22), Volume 2: The Geometry of OrthogonalSpaces. (AM-22). Princeton University Press, 1950.M. J. Wainwright. Structured regularizers for high-dimensional problems: Statistical and com-putational issues. Ann. Rev. Stat. Appl., 1(1):233–253, Jan 2014.J. Wang, W. Wang, D. Garber, and N. Srebro. Efficient coordinate-wise leading eigenvectorcomputation. arXiv:1702.07834, 2017.P.-W. Wang and C.-J. Lin. Iteration complexity of feasible descent methods for convex opti-mization. J. Mach. Learn. Res., pages 1523–1548, 2014.X. Wang. High Performance Tomography. PhD thesis, Purdue University, West Lafayette, IN,USA, 2017.D. J. A. Welsh and M. B. Powell. An upper bound for the chromatic number of a graph andits application to timetabling problems. The Computer Journal, 10(1):85–86, 1967.T. Whitney and R. Meany. Two algorithms related to the method of steepest descent. SIAMJ. Numer. Anal., 4(1):109–118, 1967.S. J. Wright. Identifiable surfaces in constrained optimization. SIAM J. Control Optim., 31(4):1063–1079, 1993.S. J. Wright. Accelerated block-coordinate relaxation for regularized optimization. SIAM J.Optim., 22(1):159–186, 2012.S. J. Wright. Coordinate descent algorithms. arXiv:1502.04759v1, 2015.T. T. Wu and K. Lange. Coordinate descent algorithms for lasso penalized regression. TheAnnals of Applied Statistics, 2(1):224–244, 2008.L. Xiao and T. Zhang. A proximal-gradient homotopy method for the sparse least-squaresproblem. SIAM J. Optim., 23(2):1062–1091, 2013.132Y. Xu and W. Yin. A block coordinate descent method for regularized multiconvex optimizationwith applications to nonnegative tensor factorization and completion. SIAM J. Imaging Sci,6(3):1758–1789, 2013.T. Yang and Q. Lin. RSG: Beating subgradient method without smoothness and strong con-vexity. arXiv:1512.03107, 2015.D. Yin, A. Pananjady, M. Lam, D. Papailiopoulos, K. Ramchandran, and P. Bartlett. Gradientdiversity empowers distributed learning. arXiv:1706.05699, 2017.Y. You, X. Lian, J. Liu, H.-F. Yu, I. S. Dhillon, J. Demmel, and C.-J. Hsieh. Asynchronousparallel greedy coordinate descent. In Advances in Neural Information Processing Systems29, pages 4682–4690, 2016.D. M. Young. Iterative Methods for Solving Partial Difference Equations of Elliptic Type. PhDthesis, Harvard University, Cambridge, MA, USA, 1950.D. M. Young. Iterative Solution of Large Linear Systems. New York: Academic Press, 1971.H.-F. Yu, C.-J. Hsieh, S. Si, and I. S. Dhillon. Scalable coordinate descent approaches to parallelmatrix factorization for recommender systems. In Data Mining (ICDM), 2012 IEEE 12thInternational Conference on, pages 765–774. IEEE, 2012.H. Zhang. The restricted strong convexity revisited: Analysis of equivalence to error boundand quadratic growth. arXiv:1511.01635, 2015.H. Zhang. New analysis of linear convergence of gradient-type methods via unifying error boundconditions. arXiv:1606.00269, 2016.H. Zhang and W. Yin. Gradient methods for convex minimization: Better rates under weakerconditions. arXiv:1303.4645, 2013.H. Zhang, J. Jiang, and Z.-Q. Luo. On the linear convergence of a proximal gradient method fora class of nonsmooth convex minimization problems. J. Oper. Res. Soc. China, 1(2):163–186,2013.H. Zhang, S. J. Reddi, and S. Sra. Riemannian SVRG: Fast stochastic optimization on Rieman-nian manifolds. In Advances in Neural Information Processing Systems 29, pages 4592–4600,2016.J. Zhang, A. G. Schwing, and R. Urtasun. Message passing inference for large scale graphicalmodels with high order potentials. In Advances in Neural Information Processing Systems27, pages 1134–1142, 2014.D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Scho¨lkopf. Learning with local and globalconsistency. In Advances in Neural Information Processing Systems 16, pages 321–328, 2003.Y. Zhou and Y. Liang. Characterization of gradient dominance and regularity conditions forneural networks. arXiv:1710.06910, 2017.Y. Zhou, H. Zhang, and Y. Liang. Geometrical properties and accelerated gradient solvers ofnon-convex phase retrieval. In 54th Annual Allerton Conference on Communication, Control,and Computing, pages 331–335. IEEE, 2016.133Z. Zhou and A. M.-C. So. A unified approach to error bounds for structured convex optimizationproblems. arXiv:1512.03518, 2015.A. Zouzias and N. M. Freris. Randomized extended Kaczmarz for solving least-squares.arXiv:1205.5770v3, 2013.134Appendix AChapter 2 Supplementary MaterialA.1 Efficient Calculation of GS Rules for Sparse ProblemsWe first give additional details on how to calculate the GS rule efficiently for sparse instancesof problems h1 and h2. We will consider the case where each gi is smooth, but the ideas canbe extended to allow a non-smooth gi. Further, note that the efficient calculation does not relyon convexity, so these strategies can also be used for non-convex problems.A.1.1 Problem h2Problem h2 has the formh2(x) :=∑i∈Vgi(xi) +∑(i,j)∈Efij(xi, xj),where each gi and fij are differentiable and G = {V,E} is a graph where the number of vertices|V | is the same as the number of variables n. If all nodes in the graph have a degree (numberof neighbours) bounded above by some constant d, we can implement the GS rule in O(d log n)after an O(n+ |E|) time initialization by maintaining the following information about xk:1. A vector containing the values ∇igi(xki ).2. A matrix containing the values ∇ifij(xki , xkj ) in the first column and ∇jfij(xki , xkj ) in thesecond column.3. The elements of the gradient vector ∇h2(xk) stored in a binary max heap data struc-ture [see Cormen et al., 2001, Chapter 6].Given the heap structure, we can compute the GS rule in O(1) by simply reading the indexvalue of the root node in the max heap. The costs for initializing these structures are:1. O(n) to compute gi(x0i ) for all n nodes.2. O(|E|) to compute ∇ijfij(x0i , x0j ) for all |E| edges.3. O(n + |E|) to sum the values in the above structures to compute ∇h(x0), and O(n) toconstruct the initial max heap.135Thus, the one-time initialization cost is O(n+ |E|). The costs of updating the data structuresafter we update xkik to xk+1ikfor the selected coordinate ik are:1. O(1) to compute gik(xk+1ik).2. O(d) to compute ∇ijfij(xk+1i , xk+1j ) for (i, j) ∈ E and i = ik or j = ik (only d such valuesexist by assumption, and all other ∇ijfij(xi, xj) are unchanged).3. O(d) to update up the d elements of ∇h(xk+1) that differ from ∇h(xk) by using thedifferences in changed values of gi and fij , followed by O(d log n) to perform d updates ofthe heap at a cost of O(log n) for each update.The most expensive part of the update is modifying the heap, and thus the total cost isO(d log n).27A.1.2 Problem h1Problem h1 has the formh1(x) :=n∑i=1gi(xi) + f(Ax),where gi and f are differentiable, and A is an m by n matrix where we denote column i by aiand row j by aTj . Note that f is a function from IRm to IR, and we assume ∇jf only dependson aTj x. While this is a strong assumption (e.g., it rules out f being the product function),this class includes a variety of notable problems like the least squares and logistic regressionmodels from our experiments. If A has z non-zero elements, with a maximum of c non-zeroelements in each column and r non-zero elements in each row, then with a pre-processing costof O(z) we can implement the GS rule in this setting in O(cr log n) by maintaining the followinginformation about xk:1. A vector containing the values ∇igi(xki ).2. A vector containing the product Axk.3. A vector containing the values ∇f(Axk).4. A vector containing the product AT∇f(Axk).5. The elements of the gradient vector ∇h1(xk) stored in a binary max heap data structure.The heap structure again allows us to compute the GS rule in O(1), and the costs of initializingthese structures are:1. O(n) to compute gi(x0i ) for all n variables.27For less-sparse problems where n < d logn, using a heap is actually inefficient and we should simply store∇h(xk) as a vector. The initialization cost is the same, but we can then perform the GS rule in O(n) by simplysearching through the vector for the maximum element.1362. O(z) to compute the product Ax0.3. O(m) to compute ∇f(Ax0) (using that ∇jf only depends on aTj x0).4. O(z) to compute AT∇f(Ax0).5. O(n) to add the ∇igi(x0i ) to the above product to obtain ∇h1(x0) and construct the initialmax heap.As it is reasonable to assume that z ≥ m and z ≥ n (e.g., we have at least one non-zero in eachrow and column), the cost of the initialization is thus O(z). The costs of updating the datastructures after we update xkik to xk+1ikfor the selected coordinate ik are:1. O(1) to compute gik(xk+1ik).2. O(c) to update the product using Axk+1 = Axk + (xk+1ik − xkik)ai, since ai has at most cnon-zero values.3. O(c) to update up to c elements of ∇f(Axk+1) that have changed (again using that ∇jfonly depends on aTj xk+1).4. O(cr) to perform up to c updates of the formAT∇f(Axk+1) = AT∇f(Axk) + (∇jf(Axk+1)−∇jf(Axk))(ai)T ,where each update costs O(r) since each ai has at most r non-zero values.5. O(cr log n) to update the gradients in the heap.The most expensive part is again the heap update, and thus the total cost is O(cr log n).A.2 Relationship Between µ1 and µWe can establish the relationship between µ and µ1 by using the known relationship betweenthe 2-norm and the 1-norm,‖x‖1 ≥ ‖x‖ ≥ 1√n‖x‖1.In particular, if we assume that f is µ-strongly convex in the 2-norm, then for all x and y wehavef(y) ≥ f(x) + 〈∇f(x), y − x〉+ µ2‖y − x‖2≥ f(x) + 〈∇f(x), y − x〉+ µ2n‖y − x‖21,137implying that f is at least µn -strongly convex in the 1-norm. Similarly, if we assume that agiven f is µ1-strongly convex in the 1-norm then for all x and y we havef(y) ≥ f(x) + 〈∇f(x), y − x〉+ µ12‖y − x‖21≥ f(x) + 〈∇f(x), y − x〉+ µ12‖y − x‖2,implying that f is at least µ1-strongly convex in the 2-norm. Summarizing these two relation-ships, we haveµn≤ µ1 ≤ µ.A.3 Analysis for Separable Quadratic CaseWe first establish an equivalent definition of strong convexity in the 1-norm, along the linesof Nesterov [2004, Theorem 2.1.9]. Subsequently, we use this equivalent definition to derive µ1for a separable quadratic function.A.3.1 Equivalent Definition of Strong ConvexityAssume that f is µ1-strongly convex in the 1-norm, so that for any x, y ∈ IRn we havef(y) ≥ f(x) + 〈∇f(x), y − x〉+ µ12‖y − x‖21.Reversing x and y in the above givesf(x) ≥ f(y) + 〈∇f(y), x− y〉+ µ12‖x− y‖21,and adding these two together yields〈∇f(y)−∇f(x), y − x〉 ≥ µ1‖y − x‖21. (A.1)Conversely, assume that for all x and y we have〈∇f(y)−∇f(x), y − x〉 ≥ µ1‖y − x‖21,138and consider the function g(τ) = f(x+ τ(y − x)) for τ ∈ IR. Thenf(y)−f(x)−〈∇f(x), y−x〉=g(1)− g(0)− 〈∇f(x), y − x〉=∫ 10dgdτ(τ)− 〈∇f(x), y − x〉 dτ=∫ 10〈∇f(x+ τ(y − x)), y − x〉−〈∇f(x), y − x〉 dτ=∫ 10〈∇f(x+ τ(y − x))−∇f(x), y − x〉 dτ≥∫ 10µ1τ‖τ(y − x)‖21 dτ=∫ 10µ1τ‖y − x‖21 dτ=µ12τ2‖y − x‖21∣∣∣∣10=µ12‖y − x‖21.Thus, µ1-strong convexity in the 1-norm is equivalent to having〈∇f(y)−∇f(x), y − x〉 ≥ µ1‖y − x‖21 ∀ x, y. (A.2)A.3.2 Strong Convexity Constant µ1 for Separable Quadratic FunctionsConsider a strongly convex quadratic function f with a diagonal Hessian H = ∇2f(x) =diag(λ1, . . . , λn), where λi > 0 for all i = 1, . . . , n. We show that in this caseµ1 =(n∑i=11λi)−1.From the previous section, µ1 is the minimum value such that (A.2) holds,µ1 = infx 6=y〈∇f(y)−∇f(x), y − x〉‖y − x‖21.139Using ∇f(x) = Hx+ b for some b and letting z = y − x, we getµ1 = infx 6=y〈(Hy − b)− (Hx− b), y − x〉‖y − x‖21= infx 6=y〈H(y − x), y − x〉‖y − x‖21= infz 6=0zTHz‖z‖21= min‖z‖1=1zTHz= mineT z=1n∑i=1λiz2i ,where the last two lines use that the objective is invariant to scaling of z and to the sign of z(respectively), and where e is a vector containing a one in every position. This is an equality-constrained strictly-convex quadratic program, so its solution is given as a stationary point(z∗, η∗) of the Lagrangian,Λ(z, η) =n∑i=1λiz2i + η(1− eT z).Differentiating with respect to each zi for i = 1, . . . , n and equating to zero, we have for all ithat 2λiz∗i − η∗ = 0, orz∗i =η∗2λi. (A.3)Differentiating the Lagrangian with respect to η and equating to zero we obtain 1− eT z∗ = 0,or equivalently1 = eT z∗ =η∗2∑j1λj,which yieldsη∗ = 2∑j1λj−1 .Combining this result for η∗ with equation (A.3), we havez∗i =1λi∑j1λj−1 .140This gives the minimizer, so we evaluate the objective at this point to obtain µ1,µ1 =n∑i=1λi(z∗i )2=n∑i=1λi 1λi n∑j=11λj−12=n∑i=11λi n∑j=11λj−2= n∑j=11λj−2( n∑i=11λi)= n∑j=11λj−1 .A.4 Gauss-Southwell-Lipschitz Rule: Convergence RateThe coordinate-descent method with a constant step-size of Lik uses the iterationxk+1 = xk − 1Lik∇ikf(xk)eik .Because f is coordinate-wise Lik -Lipschitz continuous, we obtain the following bound on theprogress made by each iteration:f(xk+1) ≤ f(xk) +∇ikf(xk)(xk+1 − xk)ik +Lik2(xk+1 − xk)2ik= f(xk)− 1Lik(∇ikf(xk))2 +Lik2[1Lik∇ikf(xk)]2= f(xk)− 12Lik[∇ikf(xk)]2= f(xk)− 12[∇ikf(xk)√Lik]2.(A.4)By choosing the coordinate to update according to the Gauss-Southwell-Lipchitz (GSL) rule,ik ∈ argmaxi|∇if(xk)|√Li,we obtain the tightest possible bound on (A.4). We define the following norm,‖x‖L =n∑i=1√Li|xi|, (A.5)141which has a dual norm of‖x‖∗L = maxi1√Li|xi|.Under this notation, and using the GSL rule, (A.4) becomesf(xk+1) ≤ f(xk)− 12(‖∇f(xk)‖∗L)2,Measuring strong convexity in the norm ‖ · ‖L we getf(y) ≥ f(x) + 〈∇f(x), y − x〉+ µL2‖y − x‖2L.Minimizing both sides with respect to y we getf(x∗) ≥ f(x)− supy{〈−∇f(x), y − x〉 − µL2‖y − x‖2L}= f(x)−(µL2‖ · ‖2L)∗(−∇f(x))= f(x)− 12µL(‖∇f(x)‖∗L)2.Putting these together yieldsf(xk+1)− f(x∗) ≤ (1− µL)[f(xk)− f(x∗)]. (A.6)A.5 Comparing µL to µ1 and µBy the logic Appendix A.2, to establish a relationship between different strong convexity con-stants under different norms, it is sufficient to establish the relationships between the squarednorms. In this section, we use this to establish the relationship between µL defined in (A.5)and both µ1 and µ.A.5.1 Relationship Between µL and µ1We havec‖x‖1 − ‖x‖L = c∑i|xi| −∑i√Li|xi| =∑i(c−√Li)|xi|,Assuming c ≥ √L, where L = maxi{Li}, the expression is non-negative and we get‖x‖L ≤√L‖x‖1.By usingc‖x‖L − ‖x‖1 =∑i(c√Li − 1)|xi|,142and assuming c ≥ 1√Lmin, where Lmin = mini{Li}, this expression is nonnegative and we get‖x‖1 ≤ 1√Lmin‖x‖L.The relationship between µL and µ1 is based on the squared norm, so in summary we haveµ1L≤ µL ≤ µ1Lmin.A.5.2 Relationship Between µL and µLet ~L denote a vector with elements√Li, and we note that‖~L‖ =(∑i(√Li)2)1/2=(∑iLi)1/2=√nL¯, where L¯ =1n∑iLi.Using this, we have‖x‖L = xT (sign(x) ◦ ~L) ≤ ‖x‖‖ sign(x) ◦ ~L‖ =√nL¯‖x‖.This implies thatµnL¯≤ µL.Note that we can also show that µL ≤ µLmin , but this is less tight than the upper bound fromthe previous section because µ1 ≤ µ.A.6 Approximate Gauss-Southwell with Additive ErrorIn the additive error regime, the approximate Gauss-Southwell rule chooses an ik satisfying|∇ikf(xk)| ≥ ‖∇f(xk)‖∞ − k, where k ≥ 0 ∀k,and we note that we can assume k ≤ ‖∇f(xk)‖∞ without loss of generality because we mustalways choose an i with |∇ikf(xk)| ≥ 0. Applying this to our bound on the iteration progress,we getf(xk+1) ≤ f(xk)− 12L[∇ikf(xk)]2≤ f(xk)− 12L(‖∇f(xk)‖∞ − k)2= f(xk)− 12L(‖∇f(xk)‖2∞ − 2k‖∇f(xk)‖∞ + 2k)= f(xk)− 12L‖∇f(xk)‖2∞ +kL‖∇f(xk)‖∞ − 2k2L(A.7)143We first give a result that assumes f is L1-Lipschitz continuous in the `1-norm. This impliesan inequality that we prove next, followed by a convergence rate that depends on L1. However,note that L ≤ L1 ≤ Ln, so this potentially introduces a dependency on n. We subsequentlygive a slightly less concise result that has a worse dependency on but does not rely on L1.A.6.1 Gradient Bound in Terms of L1We say that ∇f is L1-Lipschitz continuous in the `1-norm if we have for all x and y that‖∇f(x)−∇f(y)‖∞ ≤ L1‖x− y‖1.Similar to Nesterov [2004, Theorem 2.1.5], we now show that this impliesf(y) ≥ f(x) + 〈∇f(x), y − x〉+ 12L1‖∇f(y)−∇f(x)‖2∞, (A.8)and subsequently that‖∇f(xk)‖∞ = ‖∇f(xk)−∇f(x∗)‖∞≤√2L1(f(xk)− f(x∗))≤√2L1(f(x0)− f(x∗)), (A.9)where we have used that f(xk) ≤ f(xk−1) for all k and any choice of ik−1 (this follows from thebasic bound on the progress of coordinate descent methods).We first show that ∇f being L1-Lipschitz continuous in the 1-norm implies thatf(y) ≤ f(x) + 〈∇f(x), y − x〉+ L12‖y − x‖21,144for all x and y. Consider the function g(τ) = f(x+ τ(y − x)) with τ ∈ IR. Thenf(y)− f(x)−〈∇f(x), y − x〉= g(1)− g(0)− 〈∇f(x), y − x〉=∫ 10dgdτ(τ)− 〈∇f(x), y − x〉 dτ=∫ 10〈∇f(x+ τ(y − x)), y − x〉 − 〈∇f(x), y − x〉 dτ=∫ 10〈∇f(x+ τ(y − x))−∇f(x), y − x〉 dτ≤∫ 10‖∇f(x+ τ(y − x))−∇f(x)‖∞‖y − x‖1 dτ≤∫ 10L1τ‖y − x‖21 dτ=L12τ2‖y − x‖21∣∣∣∣10=L12‖y − x‖21.To subsequently show (A.8), fix x ∈ IRn and consider the functionφ(y) = f(y)− 〈∇f(x), y〉,which is convex on IRn and also has an L1-Lipschitz continuous gradient in the 1-norm, as‖φ′(y)− φ′(x)‖∞ = ‖(∇f(y)−∇f(x))− (∇f(x)−∇f(x))‖∞= ‖∇f(y)−∇f(x)‖∞≤ L1‖y − x‖1.As the minimizer of φ is x (i.e., φ′(x) = 0), for any y ∈ IRn we haveφ(x) = minvφ(v) ≤ minvφ(y) + 〈φ′(y), v − y〉+ L12‖v − y‖21= φ(y)− supv〈−φ′(y), v − y〉 − L12‖v − y‖21= φ(y)− 12L1‖φ′(y)‖2∞.145Substituting in the definition of φ, we havef(x)− 〈∇f(x), x〉 ≤ f(y)− 〈∇f(x), y〉 − 12L1‖∇f(y)−∇f(x)‖2∞⇐⇒ f(x) ≤ f(y) + 〈∇f(x), x− y〉 − 12L1‖∇f(y)−∇f(x)‖2∞⇐⇒ f(y) ≥ f(x) + 〈∇f(x), y − x〉+ 12L1‖∇f(y)−∇f(x)‖2∞.A.6.2 Additive Error Bound in Terms of L1Using (A.9) in (A.7) and noting that k ≥ 0, we obtainf(xk+1) ≤ f(xk)− 12L‖∇f(xk)‖2∞ +kL‖∇f(xk)‖∞ − 2k2L≤ f(xk)− 12L‖∇f(xk)‖2∞ +kL√2L1(f(x0)− f(x∗))− 2k2L≤ f(xk)− 12L‖∇f(xk)‖2∞ + k√2L1L√f(x0)− f(x∗).Applying strong convexity (taken with respect to the 1-norm), we getf(xk+1)− f(x∗) ≤(1− µ1L)[f(xk)− f(x∗)]+ k√2L1L√f(x0)− f(x∗),which impliesf(xk+1)− f(x∗) ≤(1− µ1L)k[f(x0)− f(x∗)]+k∑i=1(1− µ1L)k−ii√2L1L√f(x0)− f(x∗)=(1− µ1L)k[f(x0)− f(x∗) +√f(x0)− f(x∗)Ak],whereAk =√2L1Lk∑i=1(1− µ1L)−ii.A.6.3 Additive Error Bound in Terms of LBy our additive error inequality, we have|∇ikf(xk)|+ k ≥ ‖∇f(xk)‖∞.146Using this again in (A.7) we getf(xk+1) ≤ f(xk)− 12L‖∇f(xk)‖2∞ +kL‖∇f(xk)‖∞ − 2k2L≤ f(xk)− 12L‖∇f(xk)‖2∞ +kL(|∇ikf(xk)|+ k)− 2k2L= f(xk)− 12L‖∇f(xk)‖2∞ +kL|∇ikf(xk)|+2k2L.Further, from our basic progress bound that holds for any ik we havef(x∗) ≤ f(xk+1) ≤ f(xk)− 12L[∇ikf(xk)]2≤ f(x0)− 12L[∇ikf(xk)]2,which implies|∇ikf(xk)| ≤√2L(f(x0)− f(x∗)).and thus thatf(xk+1) ≤ f(xk)− 12L‖∇f(xk)‖2∞ +kL√2L(f(x0)− f(x∗)) + 2k2L= f(xk)− 12L‖∇f(xk)‖2∞ + k√2L√f(x0)− f(x∗) + 2k2L.Applying strong convexity and applying the inequality recursively we obtainf(xk+1)− f(x∗) ≤(1− µ1L)k[f(x0)− f(x∗)]+k∑i=1(1− µ1L)k−i(i√2L√f(x0)− f(x∗) + 2i2L)=(1− µ1L)k[f(x0)− f(x∗) +Ak],whereAk =k∑i=1(1− µ1L)−i(√ 2Li√f(x0)− f(x∗) + 2i2L).Although uglier than the expression depending on L1, this expression will tend to be smallerunless k is not small.147A.7 Convergence Analysis of GS-s, GS-r, and GS-q RulesIn this section, we consider problems of the formminx∈IRnF (x) = f(x) + g(x) = f(x) +n∑i=1gi(xi),where f satisfies our usual assumptions, but the gi can be non-smooth. We first introduce somenotation that will be needed to state our result for the GS-q rule, followed by stating the resultand then showing that it holds in two parts. We then turn to showing that the rate cannothold in general for the GS-s and GS-r rules.A.7.1 Notation and Basic InequalityTo analyze this case, an important inequality we will use is that the L-Lipschitz-continuity of∇if implies that for all x, i, and d thatF (x+ dei) = f(x+ dei) + g(x+ dei)≤ f(x) + 〈∇f(x), dei〉+ L2d2 + g(x+ dei)= f(x) + g(x) + 〈∇f(x), dei〉+ L2d2 + gi(xi + d)− gi(xi)= F (x) + Vi(x, d),(A.10)whereVi(x, d) ≡ 〈∇f(x), dei〉+ L2d2 + gi(xi + d)− gi(xi).Notice that the GS-q rule is defined byik ∈ argmini{mindVi(x, d)}.We use the notation dki ∈ argmind Vi(xk, d) and we will use dk to denote the vector containingthese values for all i. When using the GS-q rule, the iteration is defined byxk+1 = xk + dikeik= xk + argmind{Vik(x, d)}eik .(A.11)In this notation the GS-r rule is given byjk ∈ argmaxi|dki |.148We will use the notation xk+ to be the step that would be taken at xk if we update coordinatejk according the GS-r rulexk+ = xk + djkejk .From the optimality of dki , we have for any i that− L[(xki −1L∇if(xk))− (xki + dki )] ∈ ∂gi(xki + dki ), (A.12)and we will use the notation skj for the unique element of ∂gj(xkj+dkj ) satisfying this relationship.We use sk to denote the vector containing these values.A.7.2 Convergence Bound for GS-q RuleUnder this notation, we can show that coordinate descent with the GS-q rule satisfies the boundF (xk+1)− F (x∗)≤ min{(1− µLn)[f(xk)− f(x∗)],(1− µ1L)[f(xk)− f(x∗)] + k}, (A.13)wherek ≤ µ1L(g(xk+)− g(xk + dk) + 〈sk, (xk + dk)− xk+〉).We note that if g is linear then k = 0 and this convergence rate reduces toF (xk+1)− F (x∗) ≤(1− µ1L)[F (xk)− F (x∗)].Otherwise, k depends on how far g(xk+) lies above a particular linear underestimate extendingfrom (xk + dk), as well as the conditioning of f . We show this result by first showing that theGS-q rule makes at least as much progress as randomized selection (first part of the min), andthen showing that the GS-q rule also makes at least as much progress as the GS-r rule (secondpart of the min).A.7.3 GS-q is at Least as Fast as RandomOur argument in this section follows a similar approach to Richta´rik and Taka´cˇ [2014]. Inparticular, combining (A.10) and (A.11) we have the following upper bound on the iteration149progressF (xk+1)≤ F (xk) + mini∈{1,2,...,n}{mind∈IRVi(xk, d)},= F (xk) + mini∈{1,2,...,n}{miny∈IRnVi(xk, yi − xki )},= F (xk) + miny∈IRn{mini∈{1,2,...,n}Vi(xk, yi − xki )},≤ F (xk) + miny∈IRn{1nn∑i=1Vi(xk, yi − xk)}= F (xk) +1nminy∈IRn{〈∇f(xk), y − xk〉+ L2‖y − xk‖2 + g(y)− g(xk)}=(1− 1n)F (xk)+1nminy∈IRn{f(xk)+〈∇f(xk), y − xk〉+L2‖y−xk‖2+g(y)}.From strong convexity of f , we have that F is also µ-strongly convex and thatf(xk) ≤ f(y)− 〈∇f(xk), y − xk)〉 − µ2‖y − xk‖2,F (αx∗ + (1− α)xk) ≤ αF (x∗) + (1− α)F (xk)− α(1− α)µ2‖xk − x∗‖2,for any y ∈ IRn and any α ∈ [0, 1] [see Nesterov, 2004, Theorem 2.1.9]. Using these gives usF (xk+1)≤(1− 1n)F (xk) +1nminy∈IRn{f(y)− µ2‖y − x‖2 + L2‖y − xk‖2 + g(y)}=(1− 1n)F (xk) +1nminy∈IRn{F (y) +L− µ2‖y − xk‖2}≤(1− 1n)F (xk)+1nminα∈[0,1]{F (αx∗ + (1− α)xk) + α2(L− µ)2‖xk − x∗‖2}≤(1− 1n)F (xk)+1nminα∈[0,1]{αF (x∗) + (1− α)F (xk) + α2(L− µ)−α(1− α)µ2‖xk − x∗‖2}≤(1− 1n)F (xk) +1n[α∗F (x∗) + (1− α∗)F (xk)](with α∗ =µL∈ (0, 1])=(1− 1n)F (xk) +α∗nF (x∗) +(1− α∗)nF (xk)= F (xk)− α∗n[F (xk)− F (x∗)].150Subtracting F (x∗) from both sides of this inequality gives usF (xk+1)− F (x∗) ≤(1− µnL)[F (xk)− F (x∗)].A.7.4 GS-q is at Least as Fast as GS-rIn this section we derive the right side of the bound (A.13) for the GS-r rule, but note it alsoapplies to the GS-q rule because from (A.10) and (A.11) we haveF (xk+1) ≤ F (xk) + miniVi(x, dki ) (GS-q rule)≤ F (xk) + Vjk(x, dkjk) (jk selected by the GS-r rule).Note that we lose progress by considering a bound based on the GS-r rule, but its connectionto the ∞-norm will make it easier to derive an upper bound.By the convexity of gjk we havegjk(xkjk) ≥ gjk(xkjk + dkjk) + skjk(xkjk − (xkjk + dkjk))= gjk(xkjk+ dkjk)− (−Ldkjk −∇jkf(xk))(dkjk)= gjk(xkjk+ dkjk) +∇jkf(xk)dkjk + L(dkjk)2,where ski is defined by (A.12). Using this we have thatF (xk+1) ≤ F (xk) + Vj(x, dkjk)= F (xk) +∇jf(xk)(dkjk) +L2(dkjk)2 + gi(xkjk+ dkjk)− gi(xkjk)≤ F (xk) +∇jf(xk)(dkjk) +L2(dkjk)2 −∇jkf(xk)dkjk − L(dkjk)2= F (xk)− L2(dkjk)2.Adding and subtracting F (x∗) and noting that jk is selected using the GS-r rule, we obtain theupper boundF (xk+1)− F (x∗) ≤ F (xk)− F (x∗)− L2||dk||2∞. (A.14)Recall that we use xk+ to denote the iteration that would result if we chose jk and actuallyperformed the GS-r update. Using the Lipschitz continuity of the gradient and definition of151the GS-q rule again, we haveF (xk+1)≤ F (xk) +∇f(xk)T (xk+1 − xk) + L2||xk+1 − xk||2 + g(xk+1)− g(xk)≤ F (xk) +∇f(xk)T (xk+ − xk) +L2||xk+ − xk||2 + g(x+k )− g(xk)= f(xk) +∇f(xk)T (xk+ − xk) +L2‖dk‖2∞ + g(xk+)By the strong convexity of f , for any y ∈ IRN we havef(xk) ≤ f(y)−∇f(xk)T (y − xk)− µ12‖y − xk‖21,and using this we obtainF (xk+1) ≤ f(y) +∇f(xk)T (xk+ − y)−µ12‖y − xk‖21 +L2‖dk‖2∞ + g(xk+). (A.15)By the convexity of g and sk ∈ ∂g(xk + dk), we haveg(y) ≥ g(xk + dk) + 〈sk, y − (xk + dk)〉.Combining (A.15) with the above inequality, we haveF (xk+1)− F (y) ≤ 〈∇f(xk), xk+ − y〉 −µ12‖y − xk‖21 +L2‖dk‖2∞+ g(xk+)− g(xk + dk) + 〈sk, (xk + dk)− y〉.We add and subtract 〈sk, xk+〉 on the right-hand side to getF (xk+1)− F (y) ≤ 〈∇f(xk) + sk, xk+ − y〉 −µ12‖y − xk‖21 +L2‖dk‖2∞+ g(xk+)− g(xk + dk) + 〈sk, (xk + dk)− xk+〉.Let ck = g(xk+) − g(xk + dk) + 〈sk, (xk + dk) − xk+〉, which is non-negative by the convexity g.Making this substitution, we haveF (y) ≥ F (xk+1) + 〈−Ldk, y − xk+〉+µ12‖y − xk‖21 −L2‖dk‖2∞ − ck.Now add and subtract 〈−Ldk, xk〉 to the right-hand side and use (A.12) to getF (y) ≥ F (xk+1) + 〈−Ldk, y − xk〉+ µ12‖y − xk‖21 −L2‖dk‖2∞ − L〈dk, xk − xk+〉 − ck.152Minimizing both sides with respect to y results inF (x∗) ≥ F (xk+1)− L22µ1‖dk‖2∞ −L2‖dk‖2∞ − L〈dk, xk − xk+〉 − ck≥ F (xk+1)− L22µ1‖dk‖2∞ −L2‖dk‖2∞ + L‖dk‖2∞ − ck= F (xk+1)− L(L− µ1)2µ1‖dk‖2∞ − ck,where we have used that xk+ = xk + dkjkejk and |dkjk | = ‖dk‖∞. Combining this with equation(A.14), we getF (xk+1)− F (x∗)≤ F (xk)− F (x∗)− L2‖dk‖2∞≤ F (xk)− F (x∗)− µ1(L− µ1)[F (xk+1)− F (x∗)− ck],and with some manipulation and simplification, we have(1 +µ1(L− µ1))[F (xk+1)− F (x∗)]≤ F (xk)− F (x∗) + k µ1(L− µ1)F (xk+1)− F (x∗) ≤ (L− µ1)L[F (xk)− F (x∗)]+ ckµ1LF (xk+1)− F (x∗) ≤(1− µ1L)[F (xk)− F (x∗)]+ ckµ1L.A.7.5 Lack of Progress of the GS-s RuleWe now show that the rate (1 − µ1/L), and even the slower rate (1 − µ/Ln), cannot hold forthe GS-s rule. We do this by constructing a problem where an iteration of the GS-s methoddoes not make sufficient progress. In particular, consider the bound-constrained problemminx∈Cf(x) =12‖Ax− b‖22,where C = {x : x ≥ 0}, andA =1 00 0.7 , b =−1−3 , x0 =10.1 , x∗ =00 .153We thus have thatf(x0) =12((1 + 1)2 + (.07 + 3)2) ≈ 6.7f(x∗) =12((−1)2 + (−3)2) = 5∇f(x0) = AT (Ax0 − b) ≈2.02.1∇2f(x) = ATA =1 00 0.49 .The parameter values for this problem aren = 2µ = λmin = 0.49L = λmax = 1µ1 =(1λ1+1λ2)−1= 1 +10.49≈ 0.33,where the λi are the eigenvalues of ATA, and µ and µ1 are the corresponding strong convexityconstants for the 2-norm and 1-norm, respectively.The proximal operator of the indicator function is the projection onto the set C, whichinvolves setting negative elements to zero. Thus, our iteration update is given byxk+1 = proxδC [xk − 1L∇ikf(xk)eik ] = max(xk −1L∇ikf(xk)eik , 0),For this problem, the GS-s rule is given byi ∈ argmaxi|ηki |,whereηki =∇if(xk), if xki 6= 0 or ∇if(xk) < 00, otherwise .Based on the value of ∇f(x0), the GS-s rule thus chooses to update coordinate 2, setting it tozero and obtainingf(x1) =12((1 + 1)2 + (−3)2) = 6.5.154Thus we havef(x1)− f(x∗)f(x0)− f(x∗) ≈6.5− 56.7− 5 ≈ 0.88,even though the bounds obtain the faster rates of(1− µLn)=(1− 0.492)≈ 0.76,(1− µ1L)≈ (1− 0.33) = 0.67.Thus, the GS-s rule does not satisfy either bound. On the other hand, the GS-r and GS-q rulesare given in this context byik ∈ argmaxi∣∣∣∣max(xk − 1L∇if(xk)ei, 0)− xk∣∣∣∣ ,and thus both these rules choose to update coordinate 1, setting it to zero to obtain f(x1) ≈ 5.2and a progress ratio off(x1)− f(x∗)f(x0)− f(x∗) ≈5.2− 56.7− 5 ≈ 0.12,which clearly satisfies both bounds.A.7.6 Lack of Progress of the GS-r RuleWe now turn to showing that the GS-r rule does not satisfy these bounds in general. It willnot be possible to show this for a simple bound-constrained problem since the GS-r and GS-qrules are equivalent for these problems. Thus, we consider the following `1-regularized problemminx∈IR212‖Ax− b‖22 + λ‖x‖1 ≡ F (x).We use the same A as the previous section, so that n, µ, L, and µ1 are the same. However, wenow takeb =2−1 , x0 =0.40.5 , x∗ =10 , λ = 1,so we havef(x0) ≈ 3.1, f(x∗) = 2The proximal operator of the absolute value function is given by the soft-threshold function,and our coordinate update of variable ik is given byxk+1ik = proxλ|·|[xk+ 12ik] = sgn(xk+ 12ik) ·max(xk+12ik− λ/L, 0),155where we have used the notationxk+ 12i = xki −1L∇if(xk)ei.The GS-r rule is defined byik ∈ argmaxi|dki |,where dki = proxλ|·|[xk+ 12i ]− xki and in this cased0 =0.6−0.5 .Thus, the GS-r rule chooses to update coordinate 1. After this update the function value isF (x1) ≈ 2.9,so the progress ratio isF (x1)− F (x∗)F (x0)− F (x∗) ≈2.9− 23.1− 2 ≈ 0.84.However, the bounds suggest faster progress ratios of(1− µLn)≈ 0.76,(1− µ1L)≈ 0.67,so the GS-r rule does not satisfy either bound. In contrast, in this setting the GS-q rule choosesto update coordinate 2 and obtains F (x1) ≈ 2.2, obtaining a progress ratio ofF (x1)− F (x∗)F (x0)− F (x∗) ≈2.2− 23.1− 2 ≈ 0.16,which satisfies both bounds by a substantial margin. Indeed, we used a genetic algorithm tosearch for a setting of the parameters of this problem (values of x0, λ, b, and the diagonalsof A) that would make the GS-q not satisfy the bound depending on µ1, and it easily foundcounter-examples for the GS-s and GS-r rules but was not able to produce a counter examplefor the GS-q rule.156A.8 Proximal Gradient in the `1-NormOur argument in this section follows a similar approach to Richta´rik and Taka´cˇ [2014]. AssumingL1-Lipschitz continuity of ∇f , we haveF (x+ d) = f(x+ d) + g(x+ d)≤ f(x) + 〈∇f(x), d〉+ L12‖d‖21 + g(x+ d)= F (x) + 〈∇f(x), d〉+ L12‖d‖21 + g(x+ d)− g(x)= F (x) + V (x, d),whereV (x, d) ≡ 〈∇f(x), d〉+ L12‖d‖21 + g(x+ d)− g(x).Generalizing the GS-q rule defined by Song et al. [2017], we havedk ∈ argmind∈IRnV (x, d),xk+1 = xk + argmind∈IRnV (x, d).Plugging this update into the above inequality with a change of variable, we have the followingupper bound on the iteration progressF (xk+1) ≤ F (xk) +{mind∈IRnV (xk, d)},= F (xk) +{miny∈IRnV (xk, y − xk)},= F (xk) + miny∈IRn{〈∇f(xk), y − xk〉+ L12‖y − xk‖21 + g(y)− g(xk)}= miny∈IRn{f(xk) + 〈∇f(xk), y − xk〉+ L12‖y − xk‖21 + g(y)}.By the strong convexity of f , we have that F is also µ1-strongly convex,f(xk) ≤ f(y)− 〈∇f(xk), y − xk)〉 − µ12‖y − xk‖21,F (αx∗ + (1− α)xk) ≤ αF (x∗) + (1− α)F (xk)− α(1− α)µ12‖xk − x∗‖21,157for any y ∈ IRn and α ∈ [0, 1] [see Nesterov, 2004, Theorem 2.1.9]. Therefore,F (xk+1) ≤ miny∈IRn{f(y)− µ12‖y − xk‖21 +L12‖y − xk‖21 + g(y)}= miny∈IRn{F (y) +(L1 − µ1)2‖y − xk‖21}≤ minα∈[0,1]{F (αx∗ + (1− α)xk) + α2(L1 − µ1)2‖xk − x∗‖21}≤ minα∈[0,1]{αF (x∗) + (1− α)F (xk) + α2(L1 − µ1)− α(1− α)µ12‖xk − x∗‖21}≤[α∗F (x∗) + (1− α∗)F (xk)] (choosing α∗ =µ1L1∈ (0, 1])= F (xk)− α∗[F (xk)− F (x∗)].Subtracting F (x∗) from both sides of this inequality gives usF (xk+1)− F (x∗) ≤(1− µ1L1)[F (xk)− F (x∗)].158Appendix BChapter 3 Supplementary MaterialB.1 Efficient Calculations for Sparse ATo compute the MR rule efficiently for sparse A ∈ IRm×n, we need to store and update theresiduals ri = (aTi xk − bi) for all i. If we initialize with x0 = 0, then the initial values of theresiduals are simply the corresponding bi values. Given the initial residuals, we can construct amax-heap structure on these residuals in O(m) time. The max-heap structure lets us computethe MR rule in O(1) time. After an iteration of the Kaczmarz method, we can update themax-heap efficiently as follows:For each j where xk+1j 6= xkj :• For each i with aij 6= 0:– Update ri using ri ← ri − aijxkj + aijxk+1j .– Update max-heap using the new value of |ri|.The cost of each update to an ri is O(1) and the cost of each heap update is O(logm). Ifeach row of A has at most r non-zeroes and each column has at most c non-zeroes, then theouter loop is run at most r times while the inner loop is run at most c times for each outerloop iteration. Thus, in the worst case the total cost is O(cr logm), although it might be muchfaster if we have particularly sparse rows or columns. Thus, if c and r are sufficiently small, theMR rule is not much more expensive than non-uniform random selection which costs O(logm).For the MD rule, the cost is the same except there is an extra one-time cost to pre-computethe row norms ‖ai‖.Now consider the case where A may be dense but each row is orthogonal to all but at most gother rows. In this setting it would be too slow to implement the above update of the residuals,since the cost would be O(mn log(m)). In this setting, it makes more sense to use the followingalternative approach to update the max-heap after we have updated row ik:For each i that is a neighbour of ik in the orthogonality graph:• Compute the residual ri = aTi xk − bi.• Update max-heap using the new value of |ri|.We can find the set of neighbours for each node in constant time by keeping a list of theneighbours of each node. This loop would run at most g times and the cost of each iterationwould be O(n) to update the residual and O(logm) to update the heap. Thus, the cost to159track the residuals using this alternative approach would be O(gn + g log(m)) or the fasterO(gr + g log(m)) if each row has at most r non-zeros.B.2 Randomized and Maximum ResidualIn this section, we provide details of the convergence rate derivations for the non-uniform andmaximum residual (MR) selection rules. All the convergence rates we discuss use the followingrelationship,‖xk+1 − x∗‖2 = ‖xk − x∗‖2 − ‖xk+1 − xk‖2= ‖xk − x∗‖2 −∥∥∥∥(bi − aTi xk)‖ai‖2 · ai∥∥∥∥2= ‖xk − x∗‖2 −(aTi xk − bi)2‖ai‖2 , (B.1)which is equation (3.6).Non-UniformWe first review how the steps discussed by Vishnoi [2013] that can be used to derive theconvergence rate bound of Strohmer and Vershynin [2009] for non-uniform random selectionwhen row i is chosen according to the probability distribution determined by ‖ai‖/‖A‖F . Takingthe expectation of (B.1) with respect to i, we haveE[‖xk+1 − x∗‖2] = ‖xk − x∗‖2 − E[(aTi xk − bi)2‖ai‖2]= ‖xk − x∗‖2 −m∑i=1‖ai‖2‖A‖2F(a>i (xk − x∗))2‖ai‖2= ‖xk − x∗‖2 − ‖A(xk − x∗)‖2‖A‖2F≤(1− σ(A, 2)2‖A‖2F)‖xk − x∗‖2, (B.2)where σ(A, 2) is the Hoffman [1952] constant, which can be defined as the largest value suchthat for any x that is not a solution to the linear system we haveσ(A, 2)‖x− x∗‖ ≤ ‖A(x− x∗)‖, (B.3)where x∗ is the projection of x onto the set of solutions S. In other words, we can write it asσ(A, 2) := infx 6∈S‖A(x− x∗)‖‖x− x∗‖ .160Strohmer and Vershynin [2009] consider the special case where A has independent columns,and this result yields their rate in this special case since under this assumption σ(A, 2) is givenby the nth singular value of A. For general matrices, σ(A, 2) is given by the smallest non-zerosingular value of A.Maximum ResidualWe use a similar analysis to prove a convergence rate bound for the MR rule,ik ∈ argmaxi|aTi xk − bi|. (B.4)Assuming that i is selected according to (B.4), then starting from (B.1) we have‖xk+1 − x∗‖2 = ‖xk − x∗‖2 − maxi(aTi xk − bi)2‖ai‖2≤ ‖xk − x∗‖2 − 1‖A‖2∞,2maxi(aTi (xk − x∗))2= ‖xk − x∗‖2 − ‖A(xk − x∗)‖2∞‖A‖2∞,2≤(1− σ(A,∞)2‖A‖2∞,2)‖xk − x∗‖2, (B.5)where ‖A‖2∞,2 := maxi{‖ai‖2} and σ(A,∞) is the largest value such thatσ(A,∞)‖x− x∗‖ ≤ ‖A(x− x∗)‖∞, (B.6)or equivalentlyσ(A,∞) := infx 6∈S‖A(x− x∗)‖∞‖x− x∗‖ .The existence of such a Hoffman-like constant follows from the existence of the Hoffman constantand the equivalence between norms. Applying the norm equivalence ‖·‖∞ ≥ 1√m‖·‖ to equation(B.3) we haveσ(A, 2)‖x− x∗‖ ≤ ‖A(x− x∗)‖ ≤ √m‖A(x− x∗)‖∞,which implies that σ(A, 2)/√m ≤ σ(A,∞). Similarly, applying ‖ · ‖∞ ≤ ‖ · ‖ to (B.6) we haveσ(A,∞)‖x− x∗‖ ≤ ‖A(x− x∗)‖∞ ≤ ‖A(x− x∗)‖,which implies that σ(A,∞) cannot be larger than σ(A, 2). Thus, σ(A,∞) satisfies the relation-shipσ(A, 2)√m≤ σ(A,∞) ≤ σ(A, 2). (B.7)161B.3 Tighter Uniform and MR AnalysisTo avoid using the inequality ‖ai‖ ≤ ‖A‖∞,2 for all i, we want to ‘absorb’ the individual rownorms into the bound. We start with uniform selection.UniformConsider the diagonal matrix D = diag(‖a1‖2, ‖a2‖2, . . . , ‖am‖2). By taking the expectation of(B.1), we haveE[‖xk+1 − x∗‖2] = ‖xk − x∗‖2 − E[(aTi xk − bi)2‖ai‖2]= ‖xk − x∗‖2 −m∑i=11m(aTi xk − bi)2‖ai‖2= ‖xk − x∗‖2 − 1mm∑i=1([ai‖ai‖]T(xk − x∗))2= ‖xk − x∗‖2 − ‖D−1A(xk − x∗)‖2m≤(1− σ(A¯, 2)2m)‖xk − x∗‖2, (B.8)where recall that A¯ = D−1A, and we used that Ax = b and D−1Ax = D−1b have the samesolution set.Maximum ResidualFor the tighter analysis of the MR rule we do not want to alter the selection rule. Thus, wefirst evaluate the MR rule and then divide by the corresponding ‖aik‖2 for the selected ik atiteration k. Starting from (B.1), this gives us‖xk+1 − x∗‖2 = ‖xk − x∗‖2 − 1‖aik‖2maxi(aTi (xk − x∗))2= ‖xk − x∗‖2 − ‖A(xk − x∗)‖2∞‖aik‖2≤(1− σ(A,∞)2‖aik‖2)‖xk − x∗‖2. (B.9)Applying this recursively over all k iterations yields the rate‖xk − x∗‖2 ≤k∏j=1(1− σ(A,∞)2‖aij‖2)‖x0 − x∗‖2. (B.10)162B.4 Maximum Distance RuleIf we can only perform one iteration of the Kaczmarz method, the optimal rule with respect toiterate progress is the maximum distance (MD) rule,ik ∈ argmaxi∣∣∣∣aTi xk − bi‖ai‖∣∣∣∣ . (B.11)Starting again from (B.1) and using D as defined in the tight analysis for the U rule, wehave‖xk+1 − x∗‖2 = ‖xk − x∗‖2 −maxi(aTi xk − bi‖ai‖)2= ‖xk − x∗‖2 −maxi([ai‖ai‖]T(xk − x∗))2= ‖xk − x∗‖2 − ‖D−1A(xk − x∗)‖2∞≤ (1− σ(A¯,∞)2)‖xk − x∗‖2. (B.12)We now show thatmax{σ(A¯, 2)√m,σ(A, 2)‖A‖F ,σ(A,∞)‖A‖∞,2}≤ σ(A¯,∞) ≤ σ(A¯, 2). (B.13)To derive the upper bound on σ(A¯,∞), and to derive the lower bound in terms of σ(A¯, 2), wecan use norm equivalence arguments as we did for σ(A,∞). This yieldsσ(A¯, 2)√m≤ σ(A¯,∞) ≤ σ(A¯, 2).The last argument in the maximum in (B.13), corresponding to the MR∞ rate, holds because‖A‖∞,2 ≥ ‖ai‖ for all i so we haveσ(A,∞)‖A‖∞,2 ‖x− x∗‖ ≤ ‖A(x− x∗)‖∞‖A‖∞,2= maxi{ |aTi (x− x∗)|‖A‖∞,2}≤ maxi{ |aTi (x− x∗)|‖ai‖}= ‖A¯(x− x∗)‖∞.163For the second argument in the maximum in (B.13), the NU rate, we haveσ(A, 2)2‖A‖2F‖x− x∗‖2 ≤ ‖A(x− x∗)‖2‖A‖2F=∑i(aTi (x− x∗))2∑i ‖ai‖2≤ maxi{(aTi (x− x∗))2‖ai‖2}= ‖A¯(x− x∗)‖∞.The second inequality is true by noting that it is equivalent to the inequality1 ≤ maxi{(aTi (x− x∗)2/∑j(aTj (x− x∗))2‖ai‖2/∑j ‖aj‖2},and this true because the maximum ratio between two probability mass functions must be atleast 1,1 ≤ maxipi/∑j pjqi/∑j qj, with all pi ≥ 0, qi ≥ 0.Finally, we note that the MD rule obtains the tightest bound in terms of performing onestep. This follows from (B.1),‖xk+1 − x∗‖2 = ‖xk − x∗‖2 − ‖xk+1 − xk‖2 = ‖xk − x∗‖2 −(aTi xk − bi)2‖ai‖2 ,and noting that the MD rule maximizes ‖xk+1 − xk‖ and thus it maximizes how much smaller‖xk+1 − x∗‖ is than ‖xk − x∗‖.B.5 Kaczmarz and Coordinate DescentConsider the Kaczmarz update:xk+1 = xk − (aTi xk − bi)‖ai‖2 ai.This update is equivalent to one step of coordinate descent (CD) with step length 1/‖ai‖2applied to the dual problem,miny12‖AT y‖2 − bT y, (B.14)see Wright [2015]. Using the primal-dual relationship AT y = x, we can show the relationshipbetween the greedy Kaczmarz selection rules and applying greedy coordinate descent rules to164this dual problem. Consider the gradient of the dual problem,∇f(y) = AAT y − b.The Gauss-Southwell (GS) rule for CD on the dual problem is equivalent to the MR rule forKaczmarz on the primal problem sinceik ∈ argmaxi|∇if(yk)|︸ ︷︷ ︸Gauss-Southwell rule≡ argmaxi|aTi (AT yk)− bi| ≡ argmaxi|aTi xk − bi|︸ ︷︷ ︸Maximum residual rulewhere aTi is the ith row of A. Similarly, the Gauss-Southwell-Lipschitz (GSL) rule applied tothe dual is equivalent to applying a Kaczmarz iteration with the MD rule,ik ∈ argmaxi|∇if(yk)|√Li︸ ︷︷ ︸Gauss-Southwell-Lipschitz rule≡ argmaxi|aTi (AT yk)− bi|‖ai‖ ≡ argmaxi∣∣∣∣aTi xk − bi‖ai‖∣∣∣∣︸ ︷︷ ︸Maximum distance rule,as the Lipschitz constants for the dual problem are Li = ‖ai‖2.Figure B.1 shows the results of running Kaczmarz compared to using CD (on the least-squares primal problem) for our 3 datasets from Section 3.9. In this figure we measure theperformance in terms of the number of “effective passes” through the data (one “effective” passwould be the number of iterations needed for the cyclic variant of the algorithm to visit theentire dataset). In the first experiment Kaczmarz and CD methods perform similarly, whileKaczmarz methods work better in the second experiment and CD methods work better in thethird experiment.B.6 Example: Diagonal AConsider a square diagonal matrix A with aii > 0 for all i. In this case, the diagonal entriesare the eigenvalues λi of the A and σ(A, 2) = λmin. We give the convergence rate constants forsuch a diagonal A in Table B.1, and in this section we show how to arrive at these rates. Weuse U∞ for the slower uniform rate to differentiate from U (tight uniform) for rate (B.8), andwe use MR∞ for rate (B.5) to differentiate it from MR (tight) rate (B.9).For U∞, the rate follows straight from ‖A‖∞,2 = maxi ‖ai‖ = maxi λi = λmax. For U, wenote that the weighted matrix A¯ := D−1A is simply the identity matrix. The NU rate usesthat ‖A‖2F =∑i λ2i . For both MR∞ and MR, we haveσ(A,∞)2 := infy 6=z‖A(y − z)‖2∞‖y − z‖2 = inf‖w‖=1 ‖Aw‖2∞.1650 1 2 3 4 5Passes−6−5−4−3−2−10LogSquaredError CDU CDNUCD GSLCDGSUNUMRMDIsing model0 1 2 3 4 5Passes−50−40−30−20−100LogSquaredErrorCD U CDNUCD GSLCD GSUNUMRMDVery Sparse Overdetermined Linear-System0 1 2 3 4 5Passes−2.0−1.5−1.0−0.50.0LogSquaredErrorCDU CD NUCD GSLCDGSUNUMRMDLabel PropagationFigure B.1: Comparison of Kaczmarz and Coordinate Descent.Consider the equivalent problemminw∈IRm, y∈Rys.t. − y ≤ λ2iw2i ≤ y for all i,‖w‖ = 1,From the first inequality, we get− yλ2i≤ w2i ≤yλ2i∀i ⇒ (wi)2 ≤ yλ2i∀i.It follows that‖w‖2 =m∑i=1w2i ≤m∑i=1yλ2i,166Table B.1: Convergence Rate Constants for Diagonal ARule Rate Diagonal AU∞(1− σ(A, 2)2m‖A‖2∞,2) (1− λ2minmλ2max)U(1− σ(A¯, 2)2m) (1− 1m)NU(1− σ(A, 2)2‖A‖2F) (1− λ2min∑i λ2i)MR∞(1− σ(A,∞)2‖A‖2∞,2) 1− 1λ21[∑i1λ2i]−1MR(1− σ(A,∞)2‖aik‖2) 1− 1λ2ik[∑i1λ2i]−1MD(1− σ(A¯,∞)2) (1− 1m)which is equivalent toy ≥ ‖w‖2∑mi=11λ2i.Because we are minimizing y this must hold with equality at a solution, and because of theconstraints ‖w‖ = 1 we haveσ(A,∞)2 =(∑i1λ2i)−1.For the MR∞ rate, we divide σ(A,∞)2 by the maximum eigenvalue squared. For the MR rate,we divide by the specific λ2ik corresponding to the row ik selected at iteration k.For the MD rule, following the argument we did to derive σ(A,∞)2 and using that A¯ = Igives usσ(A¯,∞)2 = 1m.B.7 Multiplicative ErrorSuppose we have approximated the MR selection rule such that there is a multiplicative errorin our selection of ik,|aTikxk − bik | ≥ maxi |aTi xk − bi|(1− k),167for some k ∈ [0, 1). In this scenario, we have‖xk+1 − x∗‖2 = ‖xk − x∗‖2 − 1‖aik‖2(∣∣∣aTikxk − bik ∣∣∣2)≤ ‖xk − x∗‖2 − 1‖aik‖2(maxi∣∣∣aTi xk − bi∣∣∣ (1− k))2= ‖xk − x∗‖2 − (1− k)2‖aik‖2‖A(xk − x∗)‖2∞≤(1− (1− k)2σ(A,∞)2‖aik‖2)‖xk − x∗‖2.We define a multiplicative approximation to the MD rule as an ik satisfying∣∣∣∣∣aTikxk − bik‖aik‖∣∣∣∣∣ ≥ maxi∣∣∣∣aTi xk − bi‖ai‖∣∣∣∣ (1− ¯k),for some ¯k ∈ [0, 1). With such a rule we have‖xk+1 − x∗‖2 = ‖xk − x∗‖2 −∣∣∣∣∣aTikxk − bik‖aik‖∣∣∣∣∣2≤ ‖xk − x∗‖2 −(maxi∣∣∣∣aTi xk − bi‖ai‖∣∣∣∣ (1− ¯k))2= ‖xk − x∗‖2 − (1− ¯k)2 maxi∣∣∣∣aTi (xk − x∗)‖ai‖∣∣∣∣2= ‖xk − x∗‖2 − (1− ¯k)2‖D−1A(xk − x∗)‖2∞≤(1− (1− ¯k)2σ(A¯,∞)2)‖xk − x∗‖2.B.8 Additive ErrorSuppose we select ik using an approximate MR rule where|aTikxk − bik |2 ≥ maxi |aTi xk − bi|2 − k,168for some k ≥ 0. Then we have the following convergence rate,‖xk+1 − x∗‖2 = ‖xk − x∗‖2 − 1‖aik‖2∣∣∣aTikxk − bik ∣∣∣2≤ ‖xk − x∗‖2 − 1‖aik‖2(maxi∣∣∣aTi xk − bi∣∣∣2 − k)= ‖xk − x∗‖2 − ‖A(xk − x∗)‖2∞‖aik‖2+k‖aik‖2≤(1− σ(A,∞)2‖aik‖2)‖xk − x∗‖2 + k‖aik‖2.For the MD rule with additive error, ik is selected such that∣∣∣∣∣aTikxk − bik‖aik‖∣∣∣∣∣2≥ maxi∣∣∣∣aTi xk − bi‖ai‖∣∣∣∣2 − ¯k,for some ¯k ≥ 0. Then we have‖xk+1 − x∗‖2 = ‖xk − x∗‖2 −∣∣∣∣∣aTikxk − bik‖aik‖∣∣∣∣∣2≤ ‖xk − x∗‖2 −(maxi∣∣∣∣aTi xk − bi‖ai‖∣∣∣∣2 − ¯k)= ‖xk − x∗‖2 − ‖D−1A(xk − x∗)‖2∞ + ¯k≤ (1− σ(A¯,∞)2)‖xk − x∗‖2 + ¯k.B.9 Comparison of Rates for the Maximum Distance Ruleand the Randomized Kaczmarz viaJohnson-Lindenstrauss MethodIn Eldar and Needell [2011], the authors assume that the rows of A are normalized and that weare dealing with a homogeneous system (Ax = 0), which is not particularly interesting since wecan solve it in O(1) by setting x = 0. Their main convergence result is stated in Theorem 1.Note that RKJL stands for Randomized Kaczmarz via Johnson-Lindenstrauss, which is a hybridtechnique using both random selection and an approximate MD rule using the dimensionalityreduction technique of Johnson and Lindenstrauss [1984]. In their work they give the resultbelow.Theorem 1 Fix an estimation xk and denote by xk+1 and xk+1RK the next estimations usingthe RKJL and the standard RK method, respectively. Define γj = |〈aj , xk〉|2 and ordering theseso that γ1 ≥ γ2 ≥ · · · ≥ γm. Then, with δ being a constant affecting the error due to the JL169approximation we haveE‖xk+1 − x∗‖2 ≤ minE‖xk+1RK − x‖2 − m∑j=1(pj − 1m)γj + 2δ, E‖xk+1RK − x∗‖2 ,wherepj =(m−jn−1)(mn), j ≤ m− n+ 10, j > m− n+ 1are non-negative values satisfying∑mj=1 pj = 1 and p1 ≥ p2 ≥ · · · ≥ pm = 0.First, we simplify this bound. Applying the nonuniform random rate of Strohmer andVershynin [2009] to the result of Theorem 1, we getE[‖xk+1 − x‖2]≤ minE [‖xk+1RK − x∗‖2]− m∑j=1(pj − 1m)γj + 2δ, E[‖xk+1RK − x∗‖2]= min[‖xk − x∗‖2 − 1‖A‖2Fm∑j=1γj −m∑j=1pjγj +m∑j=11mγj + 2δ, ‖xk − x∗‖2 − 1‖A‖2Fm∑j=1γj]= min‖xk − x∗‖2 − m∑j=1pjγj + 2δ, ‖xk − x∗‖2 − 1mm∑j=1γj , (B.15)where in the last line we use ‖A‖2F = m for a matrix A with normalized rows (in this case ofnormalized rows non-uniform selection is simply uniform random selection). To compare thisto our rate in the setting of an additive error, suppose we define k such that the ik selectedsatisfiesγik ≥ maxiγi − ¯k.Then, noting that ‖ai‖ = 1 for all i, our convergence rate with additive error is based on thebound‖xk+1 − x∗‖2 = ‖xk − x∗‖2 − γik≤ ‖xk − x∗‖2 −maxiγi + ¯k. (B.16)Comparing the bounds (B.15) and (B.16), we see that our MD bound is always faster in thecase of exact optimization (¯k = δ = 0), as the average and the weighted sum of the absolute170inner products squared is less than the maximum inner product squared,max 1mm∑j=1γj ,m∑j=1pjγj ≤ maxi γi.If there is error present, then our rate is faster whenmaxiγi − k ≥ max 1mm∑j=1γj ,m∑j=1pjγj − 2δ .We note that even if our approximation is worse than the error resulting from the RKJL method,k ≥ 2δ, it is possible that maxi γi is significantly larger than 1m∑mj=1 γj and∑mj=1 pjγj and inthis case our rate would be tighter. Further, our rate is more general as it does not specificallyassume the Johnson-Lindenstrauss dimensionality reduction technique, that the rows of A arenormalized, or that the linear system is homogeneous.B.10 Systems of Linear InequalitiesConsider the system of linear equalities and inequalities,aTi x ≤ bi (i ∈ I≤)aTi x = bi (i ∈ I=). (B.17)where the disjoint index sets I≤ and I= partition the set {1, 2, . . . ,m}. As presented by Lev-enthal and Lewis [2010], a generalization of the Kaczmarz algorithm that accommodates linearinequalities is given byβkik =(aTikxk − bik)+ (ik ∈ I≤)aTikxk − bik (ik ∈ I=),xk+1 = xk − βkik‖aik‖2aik ,where for x ∈ IRn we define x+ element-wise by(x+)i = max{xi, 0}.This leads to the following generalization of the MR and MD rules, respectively,ik = max∣∣∣βki ∣∣∣ = ‖βk‖∞, and ik = max ∣∣∣∣ βki‖ai‖∣∣∣∣ = ‖D−1βk‖∞. (B.18)Unlike for equalities where the Kaczmarz method converges to the projection of the initial171iterate x0 onto the intersection of the constraints, for inequalities we can only guarantee thatthe Kaczmarz method converges to a point in the feasible set. Thus, in convergence ratesinvolving inequalities it is standard to use a bound for the distance from the current iterate xkto the feasible region,d(x, S) = minz∈S‖x− z‖2 = ‖x− PS(x)‖2,where PS(x) is the projection of x onto the feasible set S.Following closely the arguments of Leventhal and Lewis [2010] for systems of inequalities,we next give the following result which they credit to Hoffman [1952].Theorem 11. Let (B.17) be a consistent system of linear equalities and inequalities, then thereexists a constant σ(A,∞) such thatx ∈ IRn and S 6= ∅ ⇒ d(x, S) ≤ 1σ(A,∞)‖e(Ax− b)‖∞,where S is the set of feasible solutions and where the function e : IRm 7→ IRm is defined bye(y)i =y+i (i ∈ I≤)yi (i ∈ I=).From Leventhal and Lewis [2010], combining both cases (ik ∈ I≤ or ik ∈ I=), the followingrelationship holds with respect to the distance measure d(x, S),d(xk+1, S)2 ≤ d(xk, S)− e(Axk − b)2ik‖aik‖2. (B.19)Following from this bound and Theorem 11, it is straightforward to derive analogous results forall greedy selection rates derived in this chapter. For example, if we select ik according to thegeneralized MR rule (B.18) then the analogous tight rate for the MR rule is given byd(xk+1, S)2 ≤ d(xk, S)2 − e(Axk − b)2ik‖aik‖2= d(xk, S)2 − ‖βk‖2∞‖aik‖2≤(1− σ(A,∞)2‖aik‖2)d(xk, S)2.B.11 Faster Randomized Kaczmarz Using the OrthogonalityGraph of AIn order for the adaptive methods to be efficient, we must be able to efficiently update the setof selectable nodes at each iteration. To do this we use a tree structure that keeps track of172the number of selectable children in the tree (for uniform random selection) or the cumulativesums of the selectable row norms of A (for non-uniform random selection). A similar structureis used in the non-uniform sampling code of Schmidt et al. [2017].Recall that the standard inverse-transform approach approach to sampling from a non-uniform discrete probability distribution over m variables:1. Compute the cumulative probabilities, ci =∑ij=1 pj for each i from 1 to m.2. Generate a random number u uniformly distributed over [0, 1].3. Return the smallest i such that ci ≥ u.We can compute all m values of ci in Step 1 at a cost of O(m) by maintaining the running sum.We assume that Step 2 costs O(1) and we can implement Step 3 in O(log(m)) using a binarysearch. If we are sampling from a fixed distribution, then we only need to perform Step 1 onceand from that point we can generate samples from the distribution at a cost of O(log(m)).In the adaptive randomized selection rules, the probabilities pj change at each iterationand hence the ci values also change. This means we we cannot skip Step 1 as we can forfixed probabilities. However, if the orthogonality graph is sparse then it is still possible toefficiently implement these strategies. To do this, we consider a binary tree-structure that hasthe probabilities pj as leaf nodes while each internal node is the sum of its two descendants(and thus the root node has a value of 1). Given this structure, we can find the smallest ci ≥ uin O(log(m)) by traversing the tree. Further, if we update one of the pj values then we canupdate this data structure in O(log(m)) time since this only requires changing one node at eachdepth of the tree. If each node has at most g neighbours in the orthogonality graph, then weneed to update g probabilities in the binary tree, leading to a cost of O(g log(m)) to updatethe tree structure on each iteration.Note that the above structure can be modified to work with unnormalized probabilities atthe leaf nodes, since the root node will contain the normalizing constant required to make theseunnormalized probabilities into a valid probability mass function. Using this, we can implementthe adaptive uniform method by setting the leaf nodes to 1 for selectable nodes and 0 for non-selectable nodes. To implement the adaptive non-uniform method, we set the leaf nodes to 0for non-selectable nodes and ‖ai‖2 for selectable nodes.B.12 Additional ExperimentsFormulating the Semi-Supervised Label Propagation Problem as a LinearSystemOur third experiment solves a label propagation problem for semi-supervised learning in the‘two moons’ dataset [Zhou et al., 2003]. We use a variant of the quadratic labelling criterion of173Bengio et al. [2006],minyi∈S′f(y) ≡ 12n∑i=1n∑j=1wij(yi − yj)2,where y is our label vector (each yi can take one of 2 values), S is the set of labels that wedo know, S′ is the set of labels that we do not know and wij ≥ 0 are the weights assigned toeach yi describing how strongly we want the labels yi and yj to be similar. We assume withoutloss of generality that wii = 0 (since it does not affect the objective) and that wij = wji for alli, j because by the symmetry in the objective the model only depends on these terms through(wij + wji). We can express this quadratic problem as a linear system that is consistent byconstruction. In other words, we can define A and b such that∇f(y) = 0 ⇐⇒ Ay = b, with y ∈ S′.Differentiating f with respect to some yk ∈ S′, we have∇kf(y) =∑j 6=kwkj(yk − yj)︸ ︷︷ ︸i=k, j 6=k−∑i 6=kwik(yi − yk)︸ ︷︷ ︸i 6=k, j=k+∑i=kwkk(yk − yk)︸ ︷︷ ︸i=k, j=k=n∑i=1wki(yk − yi)−n∑i=1wik(yi − yk)= 2n∑i=1wkiyk − 2n∑i=1wkiyi.Setting this equal to zero and splitting the summation over S and S′ separately, we haven∑i=1wkiyk −∑i∈S′wkiyi =∑i∈Swkiyi.Assuming the elements of S′ form the first |S′| elements of the matrix A, the above formulationyields the |S′| × |S′| matrix with entriesAk,i =∑nj=1wkj if i = k,−wki if i 6= k,where k and i ∈ S′ andbk =∑i∈Swkiyi.Hybrid MethodsFor the very sparse overdetermined dataset, we see very different performances between the MRand MD rules with respect to squared error and distance. We see that the MR rule outperforms1740 100 200 300 400 500Iteration−10−8−6−4−20LogSquaredErrorMRMDhybrid-switchUNUVery Sparse Overdetermined Linear-System0 100 200 300 400 500Iteration−2.5−2.0−1.5−1.0−0.50.0LogDistanceMRMDhybrid-switchUNUVery Sparse Overdetermined Linear-SystemFigure B.2: Comparison of MR, MD and Hybrid Method for Very Sparse Dataset.the MD rule in the beginning with respect to squared-error and the MD rule outperforms theMR rule significantly with respect to distance. These observations align with the respectivedefinitions of each greedy rule. However, if we want a method that converges well with respectto both of these objectives, then we could consider ‘hybrid’ greedy rule. For example, we couldsimply alternate between using the MR rule and the MD rule. As we see in Figure B.2, thisapproach simultaneously exploits the convergence of the MR rule in terms of squared errorand the MD rule in terms of distance to the solution. However, computationally this approachrequires the maintenance of two max-heap structures.175Appendix CChapter 4 Supplementary MaterialC.1 Relationships Between ConditionsBelow we prove a subset of the implications in Theorem 2. The remaining relationships inTheorem 2 follow from these results and transitivity.• SC → ESC: The SC assumption implies that the ESC inequality is satisfied for all xand y, so it is also satisfied under the constraint xp = yp.• ESC→WSC: Take y = xp in the ESC inequality (which clearly has the same projectionas x) to get WSC with the same µ as a special case.• WSC→ RSI: Re-arrange the WSC inequality to〈∇f(x), x− xp〉 ≥ f(x)− f∗ + µ2‖xp − x‖2.Since f(x)− f∗ ≥ 0, we have RSI with µ2 .• RSI→ EB: Using Cauchy-Schwartz on the RSI we have‖∇f(x)‖‖x− xp‖ ≥ 〈∇f(x), x− xp〉 ≥ µ‖xp − x‖2,and dividing both sides by ‖x− xp‖ (assuming x 6= xp) gives EB with the same µ (whileEB clearly holds if x = xp).• EB→ PL: By Lipschitz continuity we havef(x) ≤ f(xp) + 〈∇f(xp), x− xp〉+ L2‖xp − x‖2,and using EB along with f(xp) = f∗ and ∇f(xp) = 0 we havef(x)− f∗ ≤ L2‖xp − x‖2 ≤ L2µ2‖∇f(x)‖2,which is the PL inequality with constant µ2L .• PL→ EB: Below we show that PL implies QG. Using this result, while denoting the PL176constant with µp and the QG constant with µq, we get12‖∇f(x)‖2 ≥ µp(f(x)− f∗) ≥ µpµq2‖x− xp‖2,which implies that EB holds with constant√µpµq.• QG + Convex→ RSI: By convexity we havef(xp) ≥ f(x) + 〈∇f(x), xp − x〉.Re-arranging and using QG we get〈∇f(x), x− xp〉 ≥ f(x)− f∗ ≥ µ2‖xp − x‖2,which is RSI with constant µ2 .• PL→ QG: Define the functiong(x) =√f(x)− f∗.If we assume that f satisfies the PL inequality then for any x 6∈ X ∗ we have‖∇g(x)‖2 = ‖∇f(x)‖2f(x)− f∗ ≥ 2µ,or that‖∇g(x)‖ ≥√2µ. (C.1)By the definition of g, to show QG it is sufficient to show thatg(x) ≥√2µ‖x− xp‖. (C.2)As f is assumed to satisfy the PL inequality we have that f is an invex function andthus by definition g is a positive invex function (g(x) ≥ 0) with a closed optimal solutionset X ∗ such that for all y ∈ X ∗, g(y) = 0. For any point x0 6∈ X ∗, consider solving thefollowing differential equation:dx(t)dt= −∇g(x(t))x(t = 0) = x0, (C.3)for x(t) 6∈ X ∗. (This is a flow orbit starting at x0 and flowing along the gradient of g.) By(C.1), ∇g is bounded from below, and as g is a positive invex function g is also boundedfrom below. Thus, by moving along the path defined by (C.3) we are sufficiently reducingthe function and will eventually reach the optimal set. Thus there exists a T such that177x(T ) ∈ X ∗ (and at this point the differential equation ceases to be defined). We can showthis by starting from the gradient theorem for line integrals,g(x0)− g(xt) =∫ x0xt〈∇g(x), dx〉= −∫ xtx0〈∇g(x), dx〉 (flipping integral bounds)= −∫ T0〈∇g(x(t)), dx(t)dt〉 dt (reparameterization)(∗) =∫ T0‖∇g(x(t))‖2 dt (from (C.3))≥∫ T02µdt (from (C.1))= 2µT.As g(xt) ≥ 0, this shows we need to have T ≤ g(x0)/2µ, so there must be a T withx(T ) ∈ X ∗.The length of the orbit x(t) starting at x0, which we denote by L(x0), is given byL(x0) =∫ T0‖dx(t)/dt‖dt =∫ T0‖∇g(x(t))‖ dt ≥ ‖x0 − xp‖, (C.4)where xp is the projection of x0 onto X ∗ and the inequality follows because the orbit isa path from x0 to a point in X ∗ (and thus it must be at least as long as the projectiondistance).Starting from the line marked (∗) above we haveg(x0)− g(xT ) =∫ T0‖∇g(x(t))‖2 dt≥√2µ∫ T0‖∇g(x(t))‖ dt (by (C.1))≥√2µ‖x0 − xp‖. (by (C.4))As g(xT ) = 0, this yields our result (C.2), or equivalentlyf(x)− f∗ ≥ 2µ‖x− xp‖2,which is QG with a different constant.178C.2 Relevant ProblemsStrongly convex:By minimizing both sides of the strong convexity inequality with respect to y we getf(x∗) ≥ f(x)− 12µ||∇f(x)||2,which implies the PL inequality holds with the same value µ. Thus, Theorem 1 exactly matchesthe known rate for gradient descent with a step-size of 1/L for a µ-strongly convex function.Strongly convex composed with linear:To show that this class of functions satisfies the PL inequality, we first define f(x) := g(Ax)for a σ-strongly convex function g. For arbitrary x and y, we define u := Ax and v := Ay. Bythe strong convexity of g, we haveg(v) ≥ g(u) +∇g(u)T (v − u) + σ2‖v − u‖2.By our definitions of u and v, we getg(Ay) ≥ g(Ax) +∇g(Ax)T (Ay −Ax) + σ2‖Ay −Ax‖2,where we can write the middle term as (AT∇g(Ax))T (y − x). By the definition of f and itsgradient being ∇f(x) = AT∇g(Ax) by the multivariate chain rule, we obtainf(y) ≥ f(x) + 〈∇f(x), y − x〉+ σ2||A(y − x)||2.Using xp to denote the projection of x onto the optimal solution set X ∗, we havef(xp) ≥ f(x) + 〈∇f(x), xp − x〉+ σ2||A(xp − x)||2≥ f(x) + 〈∇f(x), xp − x〉+ σθ(A)2||xp − x||2≥ f(x) + miny[〈∇f(x), y − x〉+ σθ(A)2||y − x||2]= f(x)− 12θ(A)σ||∇f(x)||2.In the second line we use that X ∗ is polyhedral, and use the theorem of Hoffman [1952] toobtain a bound in terms of θ(A) (the smallest non-zero singular value of A). This derivationimplies that the PL inequality is satisfied with µ = σθ(A).179C.3 Sign-Based Gradient MethodsDefining a diagonal matrix Λ with 1/√Li along the diagonal, the update can be written asxk+1 = xk − ‖∇f(xk)‖L−1[1]Λ ◦ sign∇f(xk).Consider the function g(τ) = f(x+ τ(y − x)) with τ ∈ IR. Thenf(y)− f(x)− 〈∇f(x), y − x〉= g(1)− g(0)− 〈∇f(x), y − x〉=∫ 10dgdτ(τ)− 〈∇f(x), y − x〉 dτ=∫ 10〈∇f(x+ τ(y − x)), y − x〉 − 〈∇f(x), y − x〉 dτ=∫ 10〈∇f(x+ τ(y − x))−∇f(x), y − x〉 dτ≤∫ 10‖∇f(x+ τ(y − x))−∇f(x)‖L−1[1]‖y − x‖L[∞] dτ≤∫ 10τ‖y − x‖2L[∞] dτ= τ212‖y − x‖2L[∞]∣∣∣∣10=12‖y − x‖2L[∞]=12|y − x‖2L[∞].where the second inequality uses the Lipschitz assumption, and in the first inequality we haveused the Cauchy-Schwarz inequality and that the dual norm of the L−1[1] norm is the L[∞]norm. The above gives an upper bound on the function in terms of this L[∞]-norm,f(y) ≤ f(x) + 〈∇f(x), y − x〉+ 12‖y − x‖2L[∞].180Plugging in our iteration update we havef(xk+1)≤ f(xk) + 〈∇f(xk), xk+1 − xk〉+ 12‖xk+1 − xk‖2L[∞]= f(xk)− ‖∇f(xk)‖L−1[1]〈∇f(xk),Λ ◦ sign∇f(xk)〉 +‖∇f(xk)‖2L−1[1]2‖Λ ◦ sign∇f(xk)‖2L[∞]= f(xk)−‖∇f(xk)‖2L−1[1]+‖∇f(xk)‖2L−1[1]2(maxi1√Li√Li| sign∇if(xk)|)2= f(xk)− 12‖∇f(xk)‖2L−1[1].Subtracting f∗ from both sides yieldsf(xk+1)− f(x∗) ≤ f(xk)− f(x∗)− 12‖∇f(xk)‖2L−1[1].Applying the PL inequality with respect to the L−1[1]-norm (which, if the PL inequality issatisfied, holds for some µL[∞] by the equivalence between norms),12‖∇f(xk)‖2L−1[1] ≥ µL[∞](f(xk)− f∗),we havef(xk+1)− f(x∗) ≤ (1− µL[∞]) (f(xk)− f(x∗)) .C.4 Proximal-PL LemmaIn this section we give a useful property of the function Dg.Lemma 5. For any differentiable function f and any convex function g, given µ2 ≥ µ1 > 0 wehaveDg(x, µ2) ≥ Dg(x, µ1).We will prove Lemma 5 as a corollary of a related result. We first restate the definitionDg(x, λ) = −2λminy[〈∇f(x), y − x〉+ λ2||y − x||2 + g(y)− g(x)], (C.5)181and we note that we require λ > 0. By completing the square, we haveDg(x, λ) = −miny[− ‖∇f(x)‖2 + ‖∇f(x)‖2 + 2λ〈∇f(x), y − x〉+λ2||y − x||2 + 2λ(g(y)− g(x))]= ||∇f(x)||2 −miny[||λ(y − x) +∇f(x)||2 + 2λ(g(y)− g(x))] .Notice that if g = 0, then Dg(x, λ) = ||∇f(x)||2 and the proximal-PL inequality reduces to thePL inequality. We will define the proximal residual function as the second part of the aboveequality,Rg(λ, x, a) , miny[||λ(y − x) + a||2 + 2λ(g(y)− g(x)] . (C.6)Lemma 6. If g is convex then for any x and a, and for 0 < λ1 ≤ λ2 we haveRg(λ1, x, a) ≥ Rg(λ2, x, a). (C.7)Proof. Without loss of generality, assume x = 0. Then we haveRg(λ, a) = miny[||λy + a||2 + 2λ(g(y)− g(0)]= miny¯[||y¯ + a||2 + 2λ(g(y¯/λ)− g(0)] , (C.8)where in the second line we used a changed of variables y¯ = λy (note that we are minimizingover the whole space of IRn). By the convexity of g, for any α ∈ [0, 1] and z ∈ IRn we haveg(αz) ≤ αg(z) + (1− α)g(0)⇐⇒ g(αz)− g(0) ≤ α(g(z)− g(0)). (C.9)By using 0 < λ1/λ2 ≤ 1 and using the choices α = λ1λ2 and z = y¯/λ1 we haveg(y¯/λ2)− g(0) ≤ λ1λ2(g(y¯/λ1)− g(0))⇐⇒ λ2(g(y¯/λ2)− g(0)) ≤ λ1(g(y¯/λ1)− g(0)), (C.10)Adding ||y¯ + a||2 to both sides, we get||y¯ + a||2 + λ2(g(y¯/λ2)− g(0)) ≤ ||y¯ + a||2 + λ1(g(y¯/λ1)− g(0)). (C.11)Taking the minimum over both sides with respect to y¯ yields Lemma 6 due to (C.8).182Corollary 2. For any differentiable function f and convex function g, given λ1 ≤ λ2, we haveDg(x, λ2) ≥ Dg(x, λ1). (C.12)By using Dg(x, λ) = ||∇f(x)||2 −Rg(λ, x,∇f(x)), Corollary 2 is exactly Lemma 5.C.5 Relevant ProblemsIn this section we prove that the three classes of functions listed in Section 4.3.1 satisfy theproximal-PL inequality condition. Note that while we prove these hold for Dg(x, λ) for λ ≤ L,by Lemma 5 above they also hold for Dg(x, L).1. f(x), where f satisfies the PL inequality (g is constant):As g is assumed to be constant, we have g(y) − g(x) = 0 and the left-hand side of theproximal-PL inequality simplifies toDg(x, µ) = −2µminy{〈∇f(x), y − x〉+ µ2‖y − x‖2}= −2µ(− 12µ‖f(x)‖2)= ‖∇f(x)‖2,Thus, the proximal PL inequality simplifies to f satisfying the PL inequality,12‖∇f(x)‖2 ≥ µ (f(x)− f∗) ,as we assumed.2. F (x) = f(x) + g(x) and f is strongly convex:By the strong convexity of f we havef(y) ≥ f(x) + 〈∇f(x), y − x〉+ µ2||y − x||2, (C.13)which leads toF (y) ≥ F (x) + 〈∇f(x), y − x〉+ µ2||y − x||2 + g(y)− g(x). (C.14)Minimizing both sides respect to y,F ∗ ≥ F (x) + miny〈∇f(x), y − x〉+ µ2||y − x||2 + g(y)− g(x)= F (x)− 12µDg(x, µ). (C.15)Rearranging, we have our result.1833. F (x) = f(Ax) + g(x) and f is strongly convex, g is the indicator function for a polyhedralset X , and A is a linear transformation:By defining f˜(x) = f(Ax) and using strong convexity of f , we havef˜(y) ≥ f˜(x) + 〈∇f˜(x), y − x〉+ µ2||A(y − x)||2, (C.16)which leads toF (y) ≥ F (x) + 〈∇f˜(x), y − x〉+ µ2||A(y − x)||2 + g(y)− g(x). (C.17)Since X is polyhedral, it can be written as a set {x : Bx ≤ c} for a matrix B and a vectorc. As before, assume that xp is the projection of x onto the optimal solution set X ∗ whichin this case is {x : Bx ≤ c, Ax = z} for some z.F ∗ − F (x) = F (xp)− F (x)≥ 〈∇f˜(x), xp − x〉+ µ2||A(x− xp)||2 + g(xp)− g(x)= 〈∇f˜(x), xp − x〉+ µ2||Ax− z||2 + g(xp)− g(x)= 〈∇f˜(x), xp − x〉+ µ2||{Ax− z}+ + {−Ax+ z}+||2 + g(xp)− g(x)= 〈∇f˜(x), xp − x〉+ µ2∥∥∥∥∥∥∥∥∥∥∥∥∥∥∥A−ABx−z−zc+∥∥∥∥∥∥∥∥∥∥∥∥∥∥∥2+ g(xp)− g(x)≥ 〈∇f˜(x), xp − x〉+ µθ(A,B)2||x− xp||2 + g(xp)− g(x)≥ miny[〈∇f˜(x), y − x〉+ µθ(A,B)2||y − x||2 + g(y)− g(x)]= − 12µ θ(A)Dg(x, µθ(A,B)).where we have used the notation that {·}+ = max{0, ·}, the fourth equality follows becausex was projected onto X in the previous iteration (so Bx− c ≤ 0), and the line after thatuses Hoffman’s bound [Hoffman, 1952].4. F (x) = f(x) + g(x), f is convex, and F satisfies the quadratic growth (QG) condition:A function F satisfies the QG condition ifF (x)− F ∗ ≥ µ2||x− xp||2. (C.18)184For any λ > 0 we have,miny[〈∇f(x), y − x〉+ λ2||y − x||2 + g(y)− g(x)]≤ 〈∇f(x), xp − x〉+ λ2||xp − x||2 + g(xp)− g(x)≤ f(xp)− f(x) + λ2||xp − x||2 + g(xp)− g(x)=λ2||xp − x||2 + F ∗ − F (x)≤(1− λµ)(F ∗ − F ). (C.19)The third line follows from the convexity of f , and the last inequality uses the QGcondition of F . Multiplying both sides by −2λ, we haveDg(x, λ) = −2λminy[〈∇f˜(x), y − x〉+ λ2||y − x||2 + g(y)− g(x)]≥ 2λ(1− λµ)(F (x)− F ∗). (C.20)This is true for any λ > 0, and by choosing λ = µ/2 we haveDg(x, µ/2) ≥ µ2(F (x)− F ∗). (C.21)C.6 Proximal Coordinate DescentIn this section, we show linear convergence of randomized coordinate descent for F (x) = f(x)+g(x) assuming that F satisfies the proximal PL inequality, ∇f is coordinate-wise Lipschitzcontinuous, and g is a separable convex function (g(x) =∑i gi(xi)).From coordinate-wise Lipschitz continuity of ∇f and separability of g, we haveF (x+ yiei)− F (x) ≤ yi∇if(x) + L2y2i + gi(xi + yi)− g(xi). (C.22)Given a coordinate i the coordinate descent step chooses yi to minimize this upper bound onthe improvement in F ,yi = argminti∈IR{ti∇if(x) + L2t2i + gi(xi + ti)− g(xi).}We next use an argument similar to Richta´rik and Taka´cˇ [2014] to relate the expected improve-185ment (with random selection of the coordinates) to the function Dg,E{mintiti∇if(x) + L2t2i + gi(xi + ti)− gi(xi)}=1n∑imintiti∇if(x) + L2t2i + gi(xi + ti)− gi(xi)=1nmint1,··· ,tn∑iti∇if(x) + L2t2i + gi(xi + ti)− gi(xi)=1nminy≡x+(t1,··· ,tn)〈∇f(x), y − x〉+ L2||y − x||2 + g(y)− g(x)= − 12LnDg(L, x).(Note that separability allows us to exchange the summation and minimization operators.) Byusing this and taking the expectation of (C.22) we getE[F (xk+1)]≤ F (xk)− 12LnDg(L, x). (C.23)Subtracting F ∗ from both sides and applying the proximal-PL inequality yields a linear con-vergence rate of(1− µnL).186Appendix DChapter 5 Supplementary MaterialD.1 Cost of Multi-Class Logistic RegressionThe typical setting where we expect coordinate descent to outperform gradient descent is whenthe cost of one gradient descent iteration is similar to the cost of updating all variables viacoordinate descent. It is well known that for the binary logistic regression objective, one of themost ubiquitous models in machine learning, coordinate descent with uniform random selectionsatisfies this property. As seen in Appendix A.1.2 this property is also satisfied for the GS rulein the case of logistic regression, provided that the data is sufficiently sparse.In this section we consider multi-class logistic regression. We first analyze the cost ofgradient descent on this objective and how randomized coordinate descent is efficient for anysparsity level. Then we show that a high sparsity level is not sufficient for the GS rule to beefficient for this problem, but that it is efficient if we use a particular set of fixed blocks.D.1.1 Cost of Gradient DescentThe likelihood for a single training example i with features ai ∈ IRd and a label bi ∈ {1, 2, . . . , k}is given byp(bi|ai, X) =exp(xTbiai)∑kc=1 exp(xTc ai),where xc is column c of our matrix of parameters X ∈ IRd×k (so the number of parametersn is dk). To maximize the likelihood over m independent and identically-distributed trainingexamples we minimize the negative log-likelihood,f(X) =m∑i=1[−xTbiai + log(k∑c=1exp(xTc ai))], (D.1)which is a convex function. The partial derivative of this objective with respect to a particularXjc is given by∂∂Xjcf(X) = −m∑i=1aij[I(bi = c)− exp(xTc ai)∑kc′=1 exp(xTc′ai)], (D.2)where I is a 0/1 indicator variable and aij is feature j for training example i. We use A todenote a matrix where row i is given by aTi . To compute the full gradient, the operations which187depend on the size of the problem are:1. Computing xTc ai for all values of i and c.2. Computing the sums∑kc=1 exp(xTc ai) for all values of i.3. Computing the partial derivative sums (D.2) for all values of j and c.The first step is the result of the matrix multiplication AX, so if A has z non-zeroes then thishas a cost of O(zk) if we compute it using k matrix-vector multiplications. The second stepcosts O(mk), which under the reasonable assumption that m ≤ z (since each row usually hasat least one non-zero) is also in O(zk). The third step is the result of a matrix multiplicationof the form ATR for a (dense) m times k matrix R (whose elements have a constant-time costto compute given the results of the first two steps), which also costs O(zk) giving a final costof O(zk).D.1.2 Cost of Randomized Coordinate DescentSince there are n = dk variables, we want our coordinate descent iterations to be dk-times fasterthan the gradient descent cost of O(zk). Thus, we want to be able to implement coordinatedescent iterations for a cost of O(z/d) (noting that we always expect z ≥ d since otherwise wecould remove some columns of A that only have zeroes). The key to doing this for randomizedcoordinate descent is to track two quantities:1. The values xTc ai for all i and c.2. The values∑kc′=1 exp(xTc′ai) for all i.Given these values we can compute the partial derivative in O(z/d) in expectation, because thisis the expected number of non-zero values of aij in the partial derivative sum (D.2) (A has ztotal non-zeroes and we are randomly choosing one of the d columns). Further, after updatinga particular Xjc we can update the above quantities for the same cost:1. We need to update xTc ai for the particular c we chose for the examples i where aij isnon-zero for the chosen value of j. This requires an O(1) operation (subtract the oldxjcaij and add the new value) for each non-zero element of column j of A. Since A has znon-zeroes and d columns, the expected number of non-zeroes is z/d so this has a cost ofO(z/d).2. We need to update∑kc′=1 exp(xTc′ai) for all i where aij is non-zero for our chosen j. Sincewe expect z/d non-zero values of aij , the cost of this step is also O(z/d).Note that BCD is also efficient since if we update τ elements, the cost is O(zτ/d) by justapplying the above logic τ times. In fact, step 2 and computing the final partial derivative hassome redundant computation if we update multiple Xjc with the same c, so we might have asmall performance gain in the block case.188D.1.3 Cost of Greedy Coordinate Descent (Arbitrary Blocks)The cost of greedy coordinate descent is typically higher than randomized coordinate descentsince we need to track all partial derivatives. However, consider the case where each row hasat most zr non-zeroes and each column has at most zc non-zeroes. In this setting we previouslyshowed that for binary logistic regression it is possible to track all partial derivatives for a costof O(zrzc), and that we can track the maximum gradient value at the cost of an additionallogarithmic factor (see Appendix A.1).28 Thus, greedy coordinate selection has a similar costto uniform selection when the sparsity pattern makes zrzc similar to z/d (as in the case of agrid-structured dependency graph like Figure 5.3).Unfortunately, having zrzc similar to z/d is not sufficient in the multi-class case. In par-ticular, the cost of tracking all the partial derivatives after updating an Xjc in the multi-classcase can be broken down as follows:1. We need to update xTc ai for the examples i where aij is non-zero. Since there are at mostzc non-zero values of aij over all i the cost of this is O(zc).2. We need to update∑kc=1 exp(xTc ai) for all i where aij is non-zero. Since there are at mostzc non-zero values of aij the cost of this is O(zc).3. We need to update the partial derivatives ∂f/∂Xjc for all j and c. Observe that eachtime we have aij non-zero, we change the partial derivative with respect to all features j′that are non-zero in the example i and we must update all classes c′ for these examples.Thus, for the O(zc) examples with a non-zero feature j we need to update up to O(zr)other features for that example and for each of these we need to update all k classes. Thisgives a cost of O(zrzck).So while in the binary case we needed O(zrzc) to be comparable to O(z/d) for greedy coordinatedescent to be efficient, in the multi-class case we now need O(zrzck) to be comparable to O(z/d)in the multi-class case. This means that not only do we need a high degree of sparsity but wealso need the number of classes k to be small for greedy coordinate descent to be efficient.D.1.4 Cost of Greedy Coordinate Descent (Fixed Blocks)Greedy rules are more expensive in the multi-class case because whenever we change an indi-vidual variable Xjc, it changes the partial derivative with respect to Xj′c′ for a set of j′ valuesand for all c′. But we can improve the efficiency of greedy rules by using a special choice offixed blocks that reduces the number of j′ values. In particular, BCD is more efficient for themulti-class case if we put Xjc′ for all c′ into the same block. In other words, we ensure thateach row of X is part of the same block so that we apply BCD to rows rather than in an28Note that the purpose of the quantity zrzc is to serve as a potentially-crude upper bound on the maximumdegree in the dependency graph we describe in Section 5.4. Any tighter bound on this degree would yield atighter upper bound on the runtime.189unstructured way. Below we consider the cost of updating the needed quantities after changingan entire row of Xjc values:1. Since we are updating k elements, the cost of updating the xTc ai is k-times larger givingO(zck) when we update a row.2. Similarly, the cost of updating the sums∑kc=1 exp(xTc ai) is k-times larger also givingO(zck).3. Where we gain in computation is the cost of computing the changed values of the partialderivatives ∂f/∂Xjc. As before, each time we have aij non-zero for our particular rowj, we change the partial derivative with respect to all other j′ for this example and withrespect to each class c′ for these j′. Thus, for the O(zc) examples with a non-zero featurej we need to update up to O(zr) other features for that example and for each of these weneed to update all k classes. But since j is the same for each variable we update, we onlyhave to do this once which gives us a cost of O(zrzck).So the cost to update a row of the matrix X is O(zrzck), which is the same cost as only updatinga single element. Considering the case of updating individual rows, this gives us d blocks so inorder for BCD to be efficient it must be d-times faster than the gradient descent cost of O(zk).Thus, we need a cost of O(zk/d) per iteration. This is achieved if O(zrzc) to be similar toO(z/d), which is the same condition we needed in the binary case.D.2 Blockwise Lipschitz ConstantsIn this section we show how to derive lower-bounds on the block-Lipschitz constants of the gra-dient and Hessian for several common problem settings. We will use that a twice-differentiablefunction has an L-Lipschitz continuous gradient if and only if the absolute eigenvalues of itsHessian are upper-bounded by L,‖∇f(x)−∇f(y)‖ ≤ L‖x− y‖ ⇐⇒ ∇2f(x) LI.This implies that when considering blockwise constants we have‖∇bf(x+ Ubd)−∇bf(x)‖ ⇐⇒ ∇2bbf(x) LbI.Thus, bounding the blockwise eigenvalues of the Hessian bounds the blockwise Lipschitz con-stants of the gradient. We also use that this equivalence extends to the case of general quadraticnorms,‖∇bf(x+ Ubd)−∇bf(x)‖H−1b ≤ ‖d‖Hb ⇐⇒ ∇2bbf(x) Hb.190D.2.1 Quadratic FunctionsQuadratic functions have the formf(x) =12xTAx+ cTx,for a positive semi-definite matrix A and vector c. For all x the Hessian with respect to blockb is given by the sub-matrix of A,∇2bbf(x) = Abb.Thus, we have that Lb is given by the maximum eigenvalue of the submatrix, Lb = ‖Ab‖ (theoperator norm of the submatrix). In the special case where b only contains a single element i,we have that Li is given by the absolute value of the diagonal element, Li = |Aii|. If we wantto use a general quadratic norm we can simply take Hb = Abb, which is cheaper to computethan the Lb (since it does not require an eigenvalue calculation).D.2.2 Least SquaresThe least squares objective has the formf(x) =12‖Ax− c‖2,for a matrix A and vector c. This is a special case of a quadratic function, where the Hessianis given by∇2f(x) = ATA.This gives us that Lb = ‖Ab‖2 (where Ab is the matrix containing the columns b of A). In thespecial case where the block has a single element j, observe that Lj =∑mi=1 a2ij (sum of thesquared values in column j) so we do not need to solve an eigenvalue problem. When using aquadratic norm we can take Hb = ATb Ab which similarly does not require solving an eigenvalueproblem.D.2.3 Logistic RegressionThe likelihood of a single example in a logistic regression model is given byp(bi|ai, x) = 11 + exp(−bixTai) ,where each ai ∈ IRd and bi ∈ {−1, 1}. To maximize the likelihood over m examples (sampledindependently) we minimize the negative log-likelihood,f(x) =m∑i=1log(1 + exp(−bixTai)).191Using A as a matrix where row i is given by aTi and defining hi(x) = p(bi|ai, x), we have that∇2f(x) =m∑i=1hi(x)(1− hi(x))aiaTi 0.25m∑i=1aiaTi= 0.25ATA.The generalized inequality above is the binary version of the Bohning bound [Bo¨hning, 1992].This bound can be derived by observing that hi(x) is in the range (0, 1), so the quantityhi(x)(1−hi(x)) has an upper bound of 0.25. This result means that we can use Lb = 0.25‖Ab‖2for block b, Lj = 0.25∑mi=1 a2ij for single-coordinate blocks, and Hb = 0.25ATb Ab if we are usinga general quadratic norm (notice that computing Hb is again cheaper than computing Lb).D.2.4 Multi-Class Logistic RegressionThe Hessian of the multi-class logistic regression objective (D.1) with respect to parametervectors xc and xc′ can be written as∂2∂xc∂x′cf(X) =m∑i=1hi,c(X)(I(c = c′)− hi,c′(X))aiaTi ,where similar to the binary logistic regression case we have defined hi,c = p(c|ai, X). This givesthe full Hessian the form∇2f(X) =m∑i=1Hi(X)⊗ aiaTi ,where we used ⊗ to denote the Kronecker product and where element (c, c′) of the k by kmatrix Hi(X) is given by hi,c(X)(I(c = c′) − hi,c′(X)). Bohning’s bound [Bo¨hning, 1992] onthis matrix is thatHi(X) 12(I − 1k11T),where 1 is a vector of ones while recall that k is the number of classes. Using this we have∇2f(X) m∑i=112(I − 1k11T)⊗ aiaTi=12(I − 1k11T)⊗m∑i=1aiaTi=12(I − 1k11T)⊗ATA.192As before we can take submatrices of this expression as our Hb, and we can take eigenvaluesof the submatrices as our Lb. However, due to the 1/k factor we can actually obtain tighterbounds for sub-matrices of the Hessian that do not involve at least two of the classes. Inparticular, consider a sub-Hessian involving the variables only associated with k′ classes fork′ < k. In this case we can replace the k by k matrix (I − (1/k)11T ) with the k′ by k′ matrix(I − (1/(k′+ 1))11T ). The “+1” added to k′ in the second term effectively groups all the otherclasses (whose variables are fixed) into a single class (the “+1” is included in Bohning’s originalpaper as he fixes xk = 0 and defines k to be one smaller than the number of classes). Thismeans (for example) that we can take Lj = 0.25∑mi=1 a2ij as in the binary case rather than theslightly-larger diagonal element 0.5(1− 1/k)∑mi=1 a2ij in the matrix above.29D.3 Derivation of GSD RuleIn this section we derive a progress bound for twice-differentiable convex functions when wechoose and update the block bk according to the GSD rule with Db,i = Liτ (where τ is themaximum block size). We start by using the Taylor series representation of f(xk+1) in terms off(xk) and some z between xk+1 and xk (keeping in mind that these only differ along coordinatesin bk),f(xk+1) = f(xk) + 〈∇f(xk), xk+1 − xk〉+ 12(xk+1 − xk)T∇2bkbkf(z)(xk+1 − xk)≤ f(xk) + 〈∇f(xk), xk+1 − xk〉+ |bk|2∑i∈bk∇2iif(z)(xk+1i − xki )2≤ f(xk) + 〈∇f(xk), xk+1 − xk〉+ τ2∑i∈bk∇2iif(z)(xk+1i − xki )2≤ f(xk) + 〈∇f(xk), xk+1 − xk〉+ τ2∑i∈bkLi(xk+1i − xki )2,where the first inequality follows from convexity of f which implies that ∇2bkbkf(xk) is positivesemi-definite and by Lemma 1 of Nesterov’s coordinate descent paper [Nesterov, 2010]. Thesecond inequality follows from the definition of τ and the third follows from the definition ofLi. Now using our choice of Db,i = Liτ in the update we have for i ∈ bk thatxk+1i = xki −1Liτ∇if(xk),29The binary logistic regression case can conceptually be viewed as a variation on the softmax loss where wefix xc = 0 for one of the classes and thus are always only updating variables from class. This gives the specialcase of 0.5(I − 1/(k + 1)11T )ATA = 0.5(1− 0.5)ATA = 0.25ATA, the binary logistic regression bound from theprevious section.193which yieldsf(xk+1) ≤ f(xk)− 12τ∑i∈bk|∇if(xk)|2Li= f(xk)− 12maxb∑i∈b|∇if(xk)|2Liτ= f(xk)− ‖∇f(xk)‖2B.The first equality uses that we are selecting bk using the GSD rule with Db,i = Liτ and thesecond inequality follows from the definition of of the mixed norm ‖ ·‖B from Section 5.2.5 withHb = Db. This progress bound implies that the convergence rate results in that section alsohold.D.4 Efficiently Testing the Forest PropertyIn this section we give a method to test whether adding a node to an existing forest maintainsthe forest property. In this setting our input is an undirected graph G and a set of nodes bwhose induced subgraph Gb forms a forest (has no cycles). Given a node i, we want to testwhether adding i to b will maintain that the induced subgraph is acyclic. In this section weshow how to do this in O(p), where p is the degree of the node i.The method is based on the following simple observations:• If the new node i introduces a cycle, then it must be part of the cycle. This followsbecause Gb is assumed to be acyclic, so no cycles can exist that do not involve i.• If i introduces a cycle, we can arbitrarily choose i to be the start and end point of thecycle.• If the new node i has 1 or fewer neighbours in b, then it does not introduce a cycle. Withno neighbours it clearly can not be part of a cycle. With one neighbour, we would haveto traverse its one edge more than once to have it start and end a path.• If the new node i has at least 2 neighbours in b that are part of the same tree, then iintroduces a cycle. Specifically, we can construct a cycle as follows: we start at node i, goto one of its neighbours, follow a path through the tree to another one of its neighboursin the same tree (such a path exists because trees are connected by definition), and thenreturn to node i.• If the new node i has at least 2 neighbours in b but they are all in different trees, theni does not introduce a cycle. This is similar to the case where i has only one edge: anypath that starts and ends at node i would have to traverse one of its edges more thanonce (because the disjoint trees are not connected to each other).194The above cases suggest that to determine whether adding node i to the forest b maintains theforest property, we only need to test whether node i is connected to two nodes that are part ofthe same tree in the existing forest. We can do this in O(p) using the following data structures:1. For each of the n nodes, a list of the adjacent nodes in G.2. A set of n labels in {0, 1, 2, . . . , t}, where t is the number of trees in the existing forest.This number is set to 0 for nodes that are not in b, is set to 1 for nodes in the first tree,is set to 2 for nodes in the second tree, and so on.Note that there is no ordering to the labels {1, 2, . . . , t}, each tree is just assigned an arbitrarynumber that we will use to determine if nodes are in the same tree. We can find all neighboursof node i in O(p) using the adjacency list, and we can count the number of neighbours in eachtree in O(p) using the tree numbers. If this count is at least 2 for any tree then the nodeintroduces a cycle, and otherwise it does not.In the algorithm of Section 5.4.2, we also need to update the data structures after addinga node i to b that maintains the forest property. For this update we need to consider threescenarios:• If the node i has one neighbour in b, we assign it the label of its neighbour.• If the node i has no neighbours in b, we assign it the label (t + 1) since it forms a newtree.• If the node i has multiple neighbours in b, we need to merge all the trees it is connectedto.The first two steps cost O(1), but a naive implementation of the third step would cost O(n)since we could need to re-label almost all of the nodes. Fortunately, we can reduce the cost ofthis merge step to O(p). This requires a relaxation of the condition that the labels representdisjoint trees. Instead, we only require that nodes with the same label are part of the same tree.This allows multiple labels to be associated with each tree, but using an extra data structurewe can still determine if two labels are part of the same tree:3. A list of t numbers, where element j gives the minimum node number in the tree that jis part of.Thus, given the labels of two nodes we can determine whether they are part of the same tree inO(1) by checking whether their minimum node numbers agree. Given this data structure, themerge step is simple: we arbitrarily assign the new node i to the tree of one of its neighbours,we find the minimum node number among the p trees that need to be merged, and then we usethis as the minimum node number for all p trees. This reduces the cost to O(p).Giving that we can efficiently test the forest property in O(p) for a node with p neighbours,it follows that the total cost of the greedy algorithm from Section 5.4.2 is O(n log n+ |E|) given195the gradient vector and adjacency lists. The O(n log n) factor comes from sorting the gradientvalues, and the number of edges |E| is 2 times the number of p values. If this cost is prohibitive,one could simply restrict the number of nodes that we consider adding the forest to reduce thistime.D.5 Full Experimental ResultsIn this section we first provide details on the datasets, and then we present our complete set ofexperimental results.D.5.1 DatasetsWe considered these five datasets:A A least squares problem with a data matrix A ∈ IRm×n and target b ∈ IRm,argminx∈IRn12‖Ax− b‖2.We set A to be an m by n matrix with entries sampled from a N (0, 1) distribution (withm = 1000 and n = 10000). We then added 1 to each entry (to induce a dependencybetween columns), multiplied each column by a sample from N (0, 1) multiplied by ten(to induce different Lipschitz constants across the coordinates), and only kept each entryof A non-zero with probability 10 log(m)/m. We set b = Ax + e, where the entries ofe were drawn from a N (0, 1) distribution while we set 90% of x to zero and drew theremaining values from a N (0, 1) distribution.B A binary logistic regression problem of the formargminx∈IRnn∑i=1log(1 + exp(−bixTai)).We use the data matrix A from the previous dataset (setting row i of A to aTi ), and bi tobe the sign of xTai using the x used in the generating the previous dataset. We then flipthe sign of each entry in b with probability 0.1 to make the dataset non-separable.C A multi-class logistic regression problem of the formargminx∈IRd×km∑i=1[−xTbiai + log(k∑c=1exp(xTc ai))],see (D.1). We generate a 1000 by 1000 matrix A as in the previous two cases. Togenerate the bi ∈ {1, 2, . . . , k} (with k = 50), we compute AX +E where the elements of196the matrices X ∈ IRd×k and E ∈ IRm×k are sampled from a standard normal distribution.We then compute the maximum index in each row of that matrix as the class labels.D A label propagation problem of the formminxi∈S′12n∑i=1n∑j=1wij(xi − xj)2,where x is our label vector, S is the set of labels that we do know (these xi are set to asample from a normal distribution with a variance of 100), S′ is the set of labels that wedo not know, and wij ≥ 0 are the weights assigned to each xi describing how strongly wewant the labels xi and xj to be similar. We set the non-zero pattern of the wij so thatthe graph forms a 50 by 50 lattice-structure (setting the non-zero values to 10000). Welabeled 100 points, leading to a problem with 2400 variables but where each variable hasat most 4 neighbours in the graph.E Another label propagation problem for semi-supervised learning in the ‘two moons’dataset[Zhou et al., 2003]. We generate 2000 samples from this dataset, randomly label 100 pointsin the data, and connect each node to its five nearest neighbours (using wij = 1). Thisresults in a very sparse but unstructured graph.D.5.2 Greedy Rules with Gradients UpdatesIn Figure D.1 we show the performance of the different methods from Section 5.5.1 on all fivedatasets with three different block sizes. In Figure D.2 we repeat the experiment but focusingonly on the FB methods. For each FB method, we plot the performance using our upper boundson Lb as the step-size (Lb) and using the Lipschitz approximation procedure from Section 5.3.3(LA). Here we see the LA methods improves performance when using large block sizes and incases where the global Lb bound is not tight.Our third experiment also focused on the FB methods, but considered different ways topartition the variables into fixed blocks. We considered three approaches:1. Order: just using the variables in their numerical order (which is similar to using a randomorder for dataset except Dataset D, where this method groups variables that adjacent inthe lattice).2. Avg: we compute the coordinate-wise Lipschitz constants Li, and place the largest Liwith the smallest Li values so that the average Li values are similar across the blocks.3. Sort: we sort the Li values and place the largest values together (and the smallest valuestogether).We compared many variations on cyclic/random/greedy rules with gradient or matrix updates.In the case of greedy rules with gradient updates, we found that the Sort method tended to197perform the best while the Order method tended to perform the worst (see Figure D.3). Whenusing matrix updates or when using cyclic/randomized rules, we found that no partitioningstrategy dominated other strategies.D.5.3 Greedy Rules with Matrix and Newton UpdatesIn Figure D.4 we show the performance of the different methods from Section 5.5.2 on allfive datasets with three different block sizes. In Figure D.5 we repeat this experiment on thetwo non-quadratic problems, using the Newton direction and a line search rather than matrixupdates. We see that using Newton’s method significantly improves performance over matrixupdates.1980 100 200 300 400 500Iterations with 5-sized blocks7.9× 1034.4× 1042.4× 1051.3× 1067.4× 106f(x)−f∗ for Least Squares on Dataset A Cyclic-FBLipschitz-FBRandom-FBGS-FBGSL-FBLipschitz-VBCyclic-VBRandom-VBGS-VBGSL-VB0 100 200 300 400 500Iterations with 50-sized blocks0.1× 10−11.8× 1002.8× 1024.6× 1047.4× 106Cyclic-FB Lipschitz-FBRandom-FBGS-FBGSL-FBLipschitz-VBCyclic-VBRandom-VBGS-VBGSL-VB0 100 200 300 400 500Iterations with 100-sized blocks0.3× 10−50.4× 10−24.9× 1006.0× 1037.4× 106Cyclic-FB Lipschitz-FBRandom-FBGS-FBGSL-FBLipschitz-VBCyclic-VBRandom-VBGS-VBGSL-VB0 100 200 300 400 500Iterations with 5-sized blocks6.1× 1011.1× 1022.1× 1023.8× 1026.9× 102f(x)−f∗ for Logistic on Dataset BCyclic-FBLipschitz-FBRandom-FBGS-FBGSL-FBLipschitz-VBCyclic-VBRandom-VBGS-VBGSL-VB0 100 200 300 400 500Iterations with 50-sized blocks1.6× 1014.1× 1011.1× 1022.7× 1026.9× 102Cyclic-FBLipschitz-FBRandom-FBGS-FBGSL-FBLipschitz-VBCyclic-VBRandom-VBGS-VBGSL-VB0 100 200 300 400 500Iterations with 100-sized blocks1.1× 1013.1× 1018.7× 1012.5× 1026.9× 102Cyclic-FBLipschitz-FBRandom-FBGS-FBGSL-FBLipschitz-VBCyclic-VBRandom-VBGS-VBGSL-VB0 100 200 300 400 500Iterations with 5-sized blocks3.0× 1033.2× 1033.4× 1033.6× 1033.9× 103f(x)−f∗ for Softmax on Dataset CCyclic-FBLipschitz-FBRandom-FBGS-FBGSL-FBLipschitz-VBCyclic-VBRandom-VBGS-VBGSL-VB0 100 200 300 400 500Iterations with 50-sized blocks2.0× 1032.4× 1032.8× 1033.3× 1033.9× 103Cyclic-FBLipschitz-FBRandom-FBGS-FBGSL-FBLipschitz-VBCyclic-VB Random-VBGS-VBGSL-VB0 100 200 300 400 500Iterations with 100-sized blocks1.6× 1032.0× 1032.5× 1033.1× 1033.9× 103Cyclic-FBLipschitz-FBRandom-FBGS-FBGSL-FBLipschitz-VBCyclic-VBRandom-VBGS-VBGSL-VB0 100 200 300 400 500Iterations with 5-sized blocks1.3× 1082.2× 1083.7× 1086.2× 1081.0× 109f(x)−f∗ for Quadratic on Dataset DCyclic-FBLipschitz-FBRandom-FBGS-FBGSL-FBLipschitz-VBCyclic-VBRandom-VBGS-VBGSL-VB0 100 200 300 400 500Iterations with 50-sized blocks2.0× 1075.3× 1071.4× 1083.9× 1081.0× 109Cyclic-FBLipschitz-FBRandom-FBGS-FBGSL-FBLipschitz-VBCyclic-VBRandom-VBGS-VBGSL-VB0 100 200 300 400 500Iterations with 100-sized blocks8.2× 1062.8× 1079.3× 1073.1× 1081.0× 109Cyclic-FBLipschitz-FBRandom-FBGS-FBGSL-FBLipschitz-VBCyclic-VBRandom-VBGS-VBGSL-VB0 100 200 300 400 500Iterations with 5-sized blocks2.7× 1023.4× 1024.4× 1025.6× 1027.2× 102f(x)−f∗ for Quadratic on Dataset E Cyclic-FBLipschitz-FBRandom-FBGS-FBGSL-FBLipschitz-VBCyclic-VBRandom-VBGS-VBGSL-VB0 100 200 300 400 500Iterations with 50-sized blocks1.1× 1021.7× 1022.7× 1024.4× 1027.2× 102Cyclic-FBLipschitz-FBRandom-FBGS-FBGSL-FBLipschitz-VBCyclic-VBRandom-VBGS-VBGSL-VB0 100 200 300 400 500Iterations with 100-sized blocks5.5× 1011.0× 1022.0× 1023.8× 1027.2× 102Cyclic-FBLipschitz-FBRandom-FBGS-FBGSL-FBLipschitz-VBCyclic-VBRandom-VBGS-VBGSL-VBFigure D.1: Comparison of different random and greedy block selection rules on five differentproblems (rows) with three different blocks (columns) when using gradient updates.1990 100 200 300 400 500Iterations with 5-sized blocks8.4× 1034.6× 1042.5× 1051.4× 1067.4× 106f(x)−f∗ for Least Squares on Dataset ALA-GSLb-GSLA-GSLLb-GSLLA-LipschitzLb-LipschitzLA-RandomLb-RandomLA-CyclicLb-Cyclic0 100 200 300 400 500Iterations with 20-sized blocks9.0× 1011.5× 1032.6× 1044.4× 1057.4× 106LA-GSLb-GSLA-GSLLb-GSLLA-LipschitzLb-LipschitzLA-RandomLb-RandomLA-CyclicLb-Cyclic0 100 200 300 400 500Iterations with 50-sized blocks0.1× 10−20.3× 1009.2× 1012.6× 1047.4× 106LA-GS Lb-GSLA-GSLLb-GSLLA-LipschitzLb-LipschitzLA-RandomLb-RandomLA-CyclicLb-Cyclic0 100 200 300 400 500Iterations with 5-sized blocks3.9× 1018.0× 1011.6× 1023.4× 1026.9× 102f(x)−f∗ for Logistic on Dataset BLA-GSLb-GSLA-GSLLb-GSLLA-LipschitzLb-LipschitzLA-RandomLb-RandomLA-CyclicLb-Cyclic0 100 200 300 400 500Iterations with 20-sized blocks0.5× 1002.8× 1001.8× 1011.1× 1026.9× 102LA-GSLb-GSLA-GSLLb-GSLLA-LipschitzLb-LipschitzLA-RandomLb-RandomLA-CyclicLb-Cyclic0 100 200 300 400 500Iterations with 50-sized blocks0.2× 1001.6× 1001.2× 1019.1× 1016.9× 102LA-GSLb-GSLA-GSLLb-GSLLA-LipschitzLb-LipschitzLA-RandomLb-RandomLA-CyclicLb-Cyclic0 100 200 300 400 500Iterations with 5-sized blocks1.9× 1032.3× 1032.7× 1033.3× 1033.9× 103f(x)−f∗ for Softmax on Dataset CLA-GSLb-GSLA-GSLLb-GSLLA-LipschitzLb-LipschitzLA-RandomLb-RandomLA-CyclicLb-Cyclic0 100 200 300 400 500Iterations with 20-sized blocks5.8× 1029.3× 1021.5× 1032.4× 1033.9× 103LA-GSLb-GSLA-GSLLb-GSLLA-LipschitzLb-LipschitzLA-RandomLb-RandomLA-CyclicLb-Cyclic0 100 200 300 400 500Iterations with 50-sized blocks2.6× 1019.2× 1013.2× 1021.1× 1033.9× 103LA-GSLb-GSLA-GSLLb-GSLLA-LipschitzLb-LipschitzLA-RandomLb-RandomLA-CyclicLb-Cyclic0 100 200 300 400 500Iterations with 5-sized blocks2.6× 1083.7× 1085.2× 1087.4× 1081.0× 109f(x)−f∗ for Quadratic on Dataset DLA-GSLb-GSLA-GSLLb-GSLLA-LipschitzLb-Lipschitz LA-RandomLb-RandomLA-CyclicLb-Cyclic0 100 200 300 400 500Iterations with 20-sized blocks1.6× 1082.5× 1084.0× 1086.5× 1081.0× 109LA-GSLb-GSLA-GSLLb-GSLLA-LipschitzLb-LipschitzLA-RandomLb-RandomLA-CyclicLb-Cyclic0 100 200 300 400 500Iterations with 50-sized blocks7.9× 1071.5× 1082.9× 1085.5× 1081.0× 109LA-GSLb-GSLA-GSLLb-GSLLA-LipschitzLb-LipschitzLA-RandomLb-RandomLA-CyclicLb-Cyclic0 100 200 300 400 500Iterations with 5-sized blocks3.2× 1023.9× 1024.8× 1025.9× 1027.2× 102f(x)−f∗ for Quadratic on Dataset ELA-GSLb-GSLA-GSLLb-GSLLA-LipschitzLb-LipschitzLA-RandomLb-RandomLA-CyclicLb-Cyclic0 100 200 300 400 500Iterations with 20-sized blocks1.9× 1022.6× 1023.7× 1025.1× 1027.2× 102LA-GSLb-GSLA-GSLLb-GSLLA-LipschitzLb-LipschitzLA-RandomLb-RandomLA-CyclicLb-Cyclic0 100 200 300 400 500Iterations with 50-sized blocks1.1× 1021.7× 1022.7× 1024.4× 1027.2× 102LA-GSLb-GSLA-GSLLb-GSLLA-LipschitzLb-LipschitzLA-RandomLb-RandomLA-CyclicLb-CyclicFigure D.2: Comparison of different random and greedy block selection rules with gradientupdates and fixed blocks, using two different strategies to estimate Lb.2000 100 200 300 400 500Iterations with 5-sized blocks8.2× 1034.5× 1042.5× 1051.4× 1067.4× 106f(x)−f∗ for Least Squares on Dataset AGS-SortGSD-SortGSL-SortGS-OrderGSD-OrderGSL-OrderGS-AvgGSD-AvgGSL-Avg0 100 200 300 400 500Iterations with 20-sized blocks8.7× 1011.5× 1032.5× 1044.4× 1057.4× 106GS-SortGSD-SortGSL-SortGS-OrderGSD-OrderGSL-OrderGS-AvgGSD-AvgGSL-Avg0 100 200 300 400 500Iterations with 50-sized blocks4.4× 1001.6× 1025.7× 1032.1× 1057.4× 106GS-SortGSD-SortGSL-SortGS-OrderGSD-OrderGSL-OrderGS-AvgGSD-Avg GSL-Avg0 100 200 300 400 500Iterations with 5-sized blocks6.1× 1011.1× 1022.1× 1023.8× 1026.9× 102f(x)−f∗ for Logistic on Dataset BGS-SortGSD-SortGSL-SortGS-OrderGSD-OrderGSL-OrderGS-AvgGSD-AvgGSL-Avg0 100 200 300 400 500Iterations with 20-sized blocks3.6× 1017.5× 1011.6× 1023.3× 1026.9× 102GS-SortGSD-SortGSL-SortGS-OrderGSD-OrderGSL-OrderGS-AvgGSD-AvgGSL-Avg0 100 200 300 400 500Iterations with 50-sized blocks2.7× 1016.2× 1011.4× 1023.1× 1026.9× 102GS-SortGSD-SortGSL-Sort GS-OrderGSD-OrderGSL-OrderGS-AvgGSD-AvgGSL-Avg0 100 200 300 400 500Iterations with 5-sized blocks3.3× 1033.5× 1033.6× 1033.8× 1033.9× 103f(x)−f∗ for Softmax on Dataset CGS-SortGSD-SortGSL-SortGS-OrderGSD-OrderGSL-OrderGS-AvgGSD-AvgGSL-Avg0 100 200 300 400 500Iterations with 20-sized blocks3.1× 1033.2× 1033.5× 1033.7× 1033.9× 103GS-SortGSD-SortGSL-SortGS-OrderGSD-OrderGSL-OrderGS-AvgGSD-AvgGSL-Avg0 100 200 300 400 500Iterations with 50-sized blocks2.6× 1032.9× 1033.2× 1033.5× 1033.9× 103GS-SortGSD-SortGSL-SortGS-OrderGSD-OrderGSL-OrderGS-AvgGSD-AvgGSL-Avg0 100 200 300 400 500Iterations with 5-sized blocks2.6× 1083.7× 1085.2× 1087.4× 1081.0× 109f(x)−f∗ for Quadratic on Dataset DGS-SortGSD-SortGSL-SortGS-OrderGSD-OrderGSL-OrderGS-AvgGSD-AvgGSL-Avg0 100 200 300 400 500Iterations with 20-sized blocks1.7× 1082.7× 1084.2× 1086.7× 1081.0× 109GS-SortGSD-SortGSL-SortGS-OrderGSD-OrderGSL-OrderGS-AvgGSD-AvgGSL-Avg0 100 200 300 400 500Iterations with 50-sized blocks9.7× 1071.8× 1083.2× 1085.8× 1081.0× 109GS-SortGSD-SortGSL-SortGS-OrderGSD-OrderGSL-OrderGS-AvgGSD-AvgGSL-Avg0 100 200 300 400 500Iterations with 5-sized blocks3.2× 1023.9× 1024.8× 1025.9× 1027.2× 102f(x)−f∗ for Quadratic on Dataset EGS-SortGSD-SortGSL-SortGS-OrderGSD-OrderGSL-OrderGS-AvgGSD-AvgGSL-Avg0 100 200 300 400 500Iterations with 20-sized blocks1.9× 1022.6× 1023.7× 1025.1× 1027.2× 102GS-SortGSD-SortGSL-SortGS-OrderGSD-OrderGSL-OrderGS-AvgGSD-AvgGSL-Avg0 100 200 300 400 500Iterations with 50-sized blocks1.1× 1021.7× 1022.7× 1024.4× 1027.2× 102GS-SortGSD-SortGSL-SortGS-OrderGSD-OrderGSL-OrderGS-AvgGSD-AvgGSL-AvgFigure D.3: Comparison of different random and greedy block selection rules with gradientupdates and fixed blocks, using three different ways to partition the variables into blocks.2010 100 200 300 400 500Iterations with 5-sized blocks0.3× 10−14.0× 1004.9× 1026.1× 1047.4× 106f(x)−f∗ for Least Squares on Dataset A GSQ-FBGS-FBGSL-FBGSD-FBGS-VBGSD-VBGSQ-VBGSL-VB0 100 200 300 400 500Iterations with 50-sized blocks0.8× 10−80.5× 10−40.2× 1001.4× 1037.4× 106GSQ-FBGS-FBGSL-FBGSD-FBGS-VBGSD-VBGSQ-VBGSL-VB0 100 200 300 400 500Iterations with 100-sized blocks0.8× 10−80.4× 10−40.2× 1001.4× 1037.4× 106GSQ-FBGS-FBGSL-FBGSD-FBGS-VBGSD-VBGSQ-VBGSL-VB0 100 200 300 400 500Iterations with 5-sized blocks2.3× 1015.4× 1011.3× 1023.0× 1026.9× 102f(x)−f∗ for Logistic on Dataset BGSQ-FBGS-FBGSL-FBGSD-FBGS-VBGSD-VBGSQ-VBGSL-VB0 100 200 300 400 500Iterations with 50-sized blocks3.2× 1001.2× 1014.7× 1011.8× 1026.9× 102GSQ-FB GS-FBGSL-FBGSD-FBGS-VBGSD-VBGSQ-VBGSL-VB0 100 200 300 400 500Iterations with 100-sized blocks1.8× 1008.1× 1003.6× 1011.6× 1026.9× 102GSQ-FBGS-FBGSL-FBGSD-FBGS-VBGSD-VBGSQ-VBGSL-VB0 100 200 300 400 500Iterations with 5-sized blocks2.3× 1032.6× 1033.0× 1033.4× 1033.9× 103f(x)−f∗ for Softmax on Dataset C GSQ-FBGS-FBGSL-FBGSD-FBGS-VBGSD-VBGSQ-VBGSL-VB0 100 200 300 400 500Iterations with 50-sized blocks3.9× 1026.9× 1021.2× 1032.2× 1033.9× 103GSQ-FBGS-FBGSL-FBGSD-FBGS-VBGSD-VBGSQ-VBGSL-VB0 100 200 300 400 500Iterations with 100-sized blocks1.7× 1023.7× 1028.1× 1021.8× 1033.9× 103GSQ-FBGS-FBGSL-FBGSD-FBGS-VBGSD-VBGSQ-VBGSL-VB0 100 200 300 400 500Iterations with 5-sized blocks1.2× 1082.0× 1083.5× 1086.1× 1081.0× 109f(x)−f∗ for Quadratic on Dataset DGSQ-FBGS-FBGSL-FBGSD-FBGS-VBGSD-VBGSQ-VBGSL-VB0 100 200 300 400 500Iterations with 50-sized blocks4.8× 1053.3× 1062.3× 1071.5× 1081.0× 109GSQ-FBGS-FBGSL-FBGSD-FBGS-VBGSD-VBGSQ-VBGSL-VB0 100 200 300 400 500Iterations with 100-sized blocks1.4× 1034.1× 1041.2× 1063.5× 1071.0× 109GSQ-FBGS-FBGSL-FBGSD-FBGS-VBGSD-VBGSQ-VBGSL-VB0 100 200 300 400 500Iterations with 5-sized blocks2.3× 1023.0× 1024.1× 1025.4× 1027.2× 102f(x)−f∗ for Quadratic on Dataset EGSQ-FBGS-FBGSL-FBGSD-FBGS-VBGSD-VBGSQ-VBGSL-VB0 100 200 300 400 500Iterations with 50-sized blocks4.1× 1018.3× 1011.7× 1023.5× 1027.2× 102GSQ-FBGS-FBGSL-FBGSD-FBGS-VBGSD-VBGSQ-VBGSL-VB0 100 200 300 400 500Iterations with 100-sized blocks9.7× 1002.9× 1018.4× 1012.4× 1027.2× 102GSQ-FBGS-FBGSL-FBGSD-FBGS-VBGSD-VBGSQ-VBGSL-VBFigure D.4: Comparison of different greedy block selection rules when using matrix updates.2020 100 200 300 400 500Iterations with 5-sized blocks0.9× 10−30.3× 10−10.8× 1002.3× 1016.9× 102f(x)−f∗ for Logistic on Dataset B GSQ-FBGS-FBGSL-FBGSD-FBGS-VBGSD-VBGSQ-VBGSL-VB0 100 200 300 400 500Iterations with 50-sized blocks0.9× 10−80.5× 10−50.3× 10−21.3× 1006.9× 102GSQ-FBGS-FBGSL-FBGSD-FBGS-VBGSD-VBGSQ-VBGSL-VB0 100 200 300 400Iterations with 100-sized blocks0.8× 10−80.4× 10−50.2× 10−21.3× 1006.9× 102GSQ-FBGS-FBGSL-FBGSD-FBGS-VBGSD-VBGSQ-VBGSL-VB0 100 200 300 400 500Iterations with 5-sized blocks2.8× 1025.4× 1021.1× 1032.0× 1033.9× 103f(x)−f∗ for Softmax on Dataset CGSQ-FBGS-FBGSL-FB GSD-FBGS-VBGSD-VBGSQ-VBGSL-VB0 100 200 300 400 500Iterations with 50-sized blocks1.0× 10−40.8× 10−20.6× 1004.9× 1013.9× 103GSQ-FBGS-FBGSL-FBGSD-FBGS-VBGSD-VBGSQ-VBGSL-VB0 100 200 300 400 500Iterations with 100-sized blocks1.0× 10−80.8× 10−50.6× 10−24.9× 1003.9× 103GSQ-FBGS-FBGSL-FBGSD-FBGS-VBGSD-VBGSQ-VBGSL-VBFigure D.5: Comparison of different greedy block selection rules when using Newton updatesand a line search.203