Convex optimization for generalized sparse recovery by Ewout van den Berg M.Sc., Delft University of Technology, The Netherlands, 2003 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy in The Faculty of Graduate Studies (Computer Science) The University of British Columbia (Vancouver) December 2009 © Ewout van den Berg 2009 ii Abstract The past decade has witnessed the emergence of compressed sensing as a way of acquiring sparsely representable signals in a compressed form. These developments have greatly motivated research in sparse signal recovery, which lies at the heart of compressed sensing, and which has recently found its use in altogether new applications. In the first part of this thesis we study the theoretical aspects of jointsparse recovery by means of sum-of-norms minimization, and the ReMBo- 1 algorithm, which combines boosting techniques with 1 -minimization. For the sum-of-norms approach we derive necessary and sufficient conditions for recovery, by extending existing results to the joint-sparse setting. We focus in particular on minimization of the sum of 1 , and 2 norms, and give concrete examples where recovery succeeds with one formulation but not with the other. We base our analysis of ReMBo- 1 on its geometrical interpretation, which leads to a study of orthant intersections with randomly oriented subspaces. This work establishes a clear picture of the mechanics behind the method, and explains the different aspects of its performance. The second part and main contribution of this thesis is the development of a framework for solving a wide class of convex optimization problems for sparse recovery. We provide a detailed account of the application of the framework on several problems, but also consider its limitations. The framework has been implemented in the spgl1 algorithm, which is already well established as an effective solver. Numerical results show that our algorithm is state-of-the-art, and compares favorably even with solvers for the easier—but less natural— Lagrangian formulations. The last part of this thesis discusses two supporting software packages: sparco, which provides a suite of test problems for sparse recovery, and spot, a Matlab toolbox for the creation and manipulation of linear operators. spot greatly facilitates rapid prototyping in sparse recovery and compressed sensing, where linear operators form the elementary building blocks. Following the practice of reproducible research, all code used for the experiments and generation of figures is available online at http://www.cs.ubc.ca/labs/scl/thesis/09vandenBerg/ iii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Acknowledgments ix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Introduction . . . . . . . . . . . . . . . . . 1.1 Signal compression and sparsity . . . . 1.2 Basis pursuit . . . . . . . . . . . . . . . 1.3 Compressed sensing . . . . . . . . . . . 1.3.1 Practical issues . . . . . . . . . 1.3.2 Examples . . . . . . . . . . . . . 1.4 Generalized sparse recovery . . . . . . . 1.4.1 Nonnegative basis pursuit . . . . 1.4.2 Group and joint-sparse recovery 1.4.3 Low-rank matrix reconstruction 1.5 Thesis overview and contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 6 9 10 14 15 16 17 18 2 Theory behind sparse recovery . . . . . . . 2.1 Mutual coherence . . . . . . . . . . . . . . 2.2 Null-space properties . . . . . . . . . . . . 2.3 Optimality conditions . . . . . . . . . . . . 2.4 Restricted isometry . . . . . . . . . . . . . 2.5 Geometry . . . . . . . . . . . . . . . . . . . 2.5.1 Facial structure and recovery . . . . 2.5.2 Connection to optimality conditions 2.5.3 Recovery bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 19 21 23 25 27 28 30 31 3 Joint-sparse recovery . . . . . . . . . . 3.1 Uniqueness of sparsest MMV solution 3.2 Recovery of block-sparse signals . . . 3.3 MMV recovery using row-norm sums 3.4 MMV recovery using 1,2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 33 34 35 37 . . . . . . . . . . . . . . . . . . . . . . . . . . Table of Contents 3.5 3.6 iv 3.4.1 Sufficient conditions for recovery via 1,2 . . . 3.4.2 Counter examples . . . . . . . . . . . . . . . . 3.4.3 Experiments . . . . . . . . . . . . . . . . . . . Bridging the gap from 1,1 to ReMBo . . . . . . . . . Recovery using ReMBo . . . . . . . . . . . . . . . . . 3.6.1 Maximum orthant intersections with subspace 3.6.2 Practical considerations . . . . . . . . . . . . . 3.6.3 Experiments . . . . . . . . . . . . . . . . . . . 4 Solvers . . . . . . . . . . . . . . . . . . . . . 4.1 Basis pursuit (BP) . . . . . . . . . . . . 4.1.1 Simplex method . . . . . . . . . 4.1.2 Interior-point algorithms . . . . 4.1.3 Null-space approach . . . . . . . 4.2 Basis pursuit denoise (BPσ ) . . . . . . 4.2.1 Second-order cone programming 4.2.2 Homotopy . . . . . . . . . . . . 4.2.3 NESTA . . . . . . . . . . . . . . 4.3 Regularized basis pursuit (QPλ ) . . . . 4.3.1 Quadratic programming . . . . 4.3.2 Fixed-point iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 39 40 41 43 46 49 49 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 54 55 56 57 57 57 58 59 60 60 62 5 Solver framework . . . . . . . . . . . . . . . . 5.1 Assumptions . . . . . . . . . . . . . . . . . 5.2 Approach . . . . . . . . . . . . . . . . . . . 5.2.1 Differentiability of the Pareto curve 5.2.2 Rationale for the gauge restriction . 5.3 Practical aspects of the framework . . . . . 5.3.1 Primal-dual gap . . . . . . . . . . . 5.3.2 Accuracy of the gradient . . . . . . 5.3.3 Local convergence rate . . . . . . . 5.3.4 First root-finding step . . . . . . . . 5.4 Solving the subproblems . . . . . . . . . . 5.4.1 Spectral projected gradients . . . . 5.4.2 Projected quasi-Newton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 65 66 68 72 74 74 75 76 78 79 79 82 6 Application of the solver framework . . 6.1 Basis pursuit denoise . . . . . . . . . . 6.1.1 Polar function . . . . . . . . . . 6.1.2 Projection . . . . . . . . . . . . 6.2 Joint and group sparsity . . . . . . . . 6.2.1 Application – Source localization 6.2.2 Polar functions of mixed norms 6.2.3 Projection . . . . . . . . . . . . 6.3 Sign-restricted formulations . . . . . . . 6.3.1 Application – Mass spectrometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 84 85 85 88 88 91 92 93 94 . . . . . . . . . . . . . . . . . . Table of Contents 6.4 v 6.3.2 Polar function . . . . . . . . . . . . . . . 6.3.3 Projection . . . . . . . . . . . . . . . . . Low-rank matrix recovery and completion . . . . 6.4.1 Application – Distance matrix completion 6.4.2 Polar function . . . . . . . . . . . . . . . 6.4.3 Projection . . . . . . . . . . . . . . . . . 7 Experiments . . . . . . . . . . 7.1 Basis pursuit . . . . . . . . 7.2 Basis pursuit denoise . . . 7.3 Joint sparsity . . . . . . . . 7.4 Nonnegative basis pursuit . 7.5 Nuclear-norm minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 97 97 98 99 99 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 102 104 110 112 112 8 Sparco and Spot . . . . . . . . . . . . . . . 8.1 Sparco . . . . . . . . . . . . . . . . . . . 8.1.1 Design objectives . . . . . . . . . 8.1.2 Test problems . . . . . . . . . . . 8.1.3 Implementation . . . . . . . . . . 8.2 Spot . . . . . . . . . . . . . . . . . . . . . 8.2.1 Multiplication . . . . . . . . . . . 8.2.2 Transposition and conjugation . . 8.2.3 Addition and subtraction . . . . . 8.2.4 Dictionaries and arrays . . . . . . 8.2.5 Block diagonal operators . . . . . 8.2.6 Kronecker products . . . . . . . . 8.2.7 Subset assignment and reference . 8.2.8 Elementwise operations . . . . . . 8.2.9 Solving systems of linear equations 8.2.10 Application to non-linear operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 117 117 118 120 121 122 123 123 124 124 124 125 126 127 127 9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 vi List of Tables 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 7.10 7.11 7.12 7.13 Solver performance on a selection of sparco test problems . . . Solver performance on mixture of Gaussian and all-one matrix . Performance of basis pursuit solvers on restricted DCT matrices Solutions of a subset of sparco test problems . . . . . . . . . . . Basis pursuit solver performance on subset of sparco problems . Basis pursuit denoise solvers on subset of sparco problems . . . Performance on compressible basis pursuit denoise problems . . . Test problem settings for mmv experiments . . . . . . . . . . . . Solver performance on approximately sparse mmv problems . . . Test problem settings for nonnegative bpdn experiments . . . . . Solver performance on nonnegative bpdn test problems . . . . . Test problem settings for matrix completion experiments . . . . . Solver performance on a set of matrix completion problems . . . 103 105 106 107 108 109 111 111 113 114 114 115 116 8.1 8.2 8.3 8.4 List of sources for sparco test problems . . . . . . . . . . Selection of sparco test problems . . . . . . . . . . . . . Fields in the data structure representing sparco problems List of several elementary operators provided by spot . . 119 119 120 122 . . . . . . . . . . . . . . . . vii List of Figures 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 Cumulative energy of image representation in different bases . Image reconstruction from truncated coefficients . . . . . . . Sparse representations in the Dirac and DCT bases . . . . . . Example of compressed sensing on a sinusoidal signal . . . . . Illustration of the different aspects of compressed sensing . . Basis pursuit reconstruction of the Shepp-Logan phantom . . Reconstruction of angiogram from compressed measurements Reconstruction of missing data using sparse recovery . . . . . Main components of sparse recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 4 7 8 9 12 13 16 18 2.1 2.2 2.3 2.4 2.5 2.6 2.7 Tools and techniques for basis pursuit recovery theory . . . . . Mutual coherence and its recovery guarantees . . . . . . . . . . The exact recovery condition and its recovery guarantees . . . . Distribution of singular values and restricted isometry bounds . Empirical recovery breakdown curve . . . . . . . . . . . . . . . The cross-polytope and its image under a linear transformation Geometrical interpretation of basis pursuit denoise . . . . . . . . . . . . . . 20 21 25 28 29 29 29 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 Empirical recovery rates for sum-of-norm minimization . . . Joint-sparse recovery distribution for 1,2 minimization . . . Theoretical and experimental recovery rates for boosted 1 . Performance model for the ReMBo algorithm . . . . . . . . Intersection of orthants by different subspaces . . . . . . . . Sign pattern generation from a random matrix . . . . . . . Sign pattern generation from a biased matrix . . . . . . . . Unique orthant counts after a limited number of trials . . . Performance of ReMBo after a limited number of iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 41 44 46 48 51 52 53 53 4.1 4.2 4.3 4.4 Classification of solvers for sparse recovery . . . . . . . . . . Homotopy trajectory and median number of steps required Projection arcs of the negative gradient on R2+ . . . . . . . Plot of f (x) = x + λ · ∂x |x|, and its inverse . . . . . . . . . . . . . . . . . . . . . 55 59 62 64 5.1 5.2 5.3 5.4 One and two-dimensional gauge functions Pareto curve for basis pursuit denoise . . Root-finding on the Pareto curve . . . . . Illustration for proof of Lemma 5.3 . . . . . . . . . . . . . . . . 66 67 68 72 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . List of Figures viii 5.5 5.6 5.7 Illustration of conjugates of gauge functions . . . . . . . . . . . . Illustration for the proof of Lemma 5.5 . . . . . . . . . . . . . . . Projection arcs on two- and three-dimensional cross-polytopes . . 73 75 81 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 Soft-thresholding parameter for projection onto the 1 -ball . . Focusing of planar waves using two types of telescopes . . . . Angular spectra obtained by different algorithms . . . . . . . Sensor configuration and arrival directions on the half-sphere Iterative refinement of arrival directions . . . . . . . . . . . . Empirical breakdown curve for nonnegative signal recovery . Mass spectrum of propane using electron ionization . . . . . . Mass spectra of individual components and their mixture . . Euclidean distance matrix completion . . . . . . . . . . . . . 87 89 90 90 91 94 94 96 99 8.1 Overlapping block-diagonal operators . . . . . . . . . . . . . . . . 125 . . . . . . . . . . . . . . . . . . ix Acknowledgments I have been very fortunate to have spent the past four years working with my supervisor Michael Friedlander. His unwaivering support, enthusiasm, and encouragement to work on topics that are interesting to me, have made working on my Ph.D. a thoroughly enjoyable experience; I couldn’t have asked for a better supervisor. ¨ There has also been plenty of fruitful interdisciplinary collaboration with Ozgur Yılmaz, Felix Herrmann, and fellow students Rayan Saab, Gilles Hennenfent, and Mark Schmidt. The numerous discussions, with reasoning from different perspectives, have certainly broadened my view. It has been a privilege to work with such a fine group of people. I would further like to thank Will Evans and Eldad Haber for being my university examiners, Richard Baraniuk for agreeing to be my external examiner, and Lara Hall for taking care of all administrative matters. The internet has made academic life more convenient in terms of easy access to information, but has also hastened its pace due to the continuous and relentless flow of new results. Thanks to Igor Carron for writing his blog and to the digital signal processing group at Rice for maintaining the compressive sensing repository. Both resources have been very helpful at centralizing information and creating some order in the chaos. Lab life was enlivened by the many labmates over the past several years. Many thanks to Hui, Dan, Tim, Tracy, Jelena, Liz, Kathrin, Maria, Shi-dong, Mitch, Essex, and Landon for their humor and for reminding me to get out of the lab from time to time. A special thanks also to all my friends here at UBC, who have made it a memorable stay. In particular I would like to express my gratitude to Xin-Yi, Hong-Rae, Andy, Suh, Apple, Hamed, Abhijit, Jie, and Rong-Rong for their warm friendship. Finally, I am indebted to my parents, brother, and sister for their unconditional love and support throughout my entire study. I would like to dedicate this thesis to my late aunt Dorret Vermeij who has always been very supportive and who would have been so proud of my thesis completion. Ewout van den Berg Vancouver December 12, 2009 1 Chapter 1 Introduction The main topic of this thesis is sparse recovery. Because this research area is still relatively young, we use this introductory chapter to provide a brief overview of its origins, along with recent developments. We start by noting the importance of sparse signal representations in transform-based coding. From there we move to the basis pursuit method, which aims to find sparse representations for classes of signals that cannot be represented as such in an orthonormal basis. This naturally leads to sparse recovery and ultimately to the newly established field of compressed sensing in which sparse recovery is firmly ingrained. 1.1 Signal compression and sparsity In their uncompressed form, digital signals such as audio, video, and still images often require vast amounts of data to be stored. Take for example the 256 × 256 grayscale image shown in Figure 1.1(a). Without compression, such an image would require all 65,536 individual grayscale values to be stored. Far more compact (though approximate) representations of the image can be obtained by using techniques such as transform coding, which, for example, forms the basis of the ubiquitous JPEG compression scheme [138]. The storage reduction achieved by these methods is based on the fact that much of the data can be discarded without greatly degrading the appearance of the image or signal. This type of compression, which typically provides an inexact signal representation, is called lossy compression, as opposed to lossless compression. So, how exactly does it work? Let s ∈ Rm be a vectorized representation of our signal; in this case the Euler image. If s were sparse it would suffice to store the location and value of the nonzero entries, giving a lossless compression. Because most practical signals are not exactly sparse we can choose a small positive threshold and treat all entries in s with magnitude below the threshold as zero. Alternatively we can retain a fixed number of entries that are largest in magnitude, thus giving a sparse approximate representation of the signal. The quality of the resulting signal approximation depends on the number entries stored, and how fast the magnitude of the signal entries decreases; the faster the decay, the less the loss of signal energy, and the more accurate the representation. In transform-based coding we do not work with the signal s directly, but rather with a representation in terms of an orthonormal transformation. That is, given an orthonormal basis Φ, we represent s as ΦT x, where x := Φs. The 1.2. Basis pursuit 2 reason for doing so is that by judicious choice of Φ, we can affect the decay rate or compressibility of the signal. More formally, denote by v(k) the best k-term approximation of a vector v, zeroing out all but the k largest entries in magnitude, and with ties resolved based on a lexicographical ordering. We can then represent the signal energy retained by keeping the largest k entries of a vector v as γk (v) := v(k) 2 / v 2 . Recall that the p -norm · p for p ≥ 1 is given by v p = |vi |p 1/p . (1.1) Now, based on the orthogonality of Φ we can express the k-term approximation error in s(k) := ΦT x(k) as s(k) − s 2 = Φ(x(k) − x) 2 = x(k) − x 2 = 1 − γk (x)2 · s 2 , where we use the fact that x 2 = ΦT x 2 = s 2 , and that x(k) coincides with x on its support. An appropriate choice of Φ causes γk (Φs) to increase faster, leading to a smaller approximation error for a given k, or, more interestingly, allowing a sparser representation for an equivalent approximation. To illustrate this, we consider the compression of the Euler image using three different bases Φ: the two-dimensional Daubechies wavelet [49], the twodimensional discrete cosine transform (DCT), and the Dirac basis (i.e., the identity basis). Figure 1.1(b) plots the value of γk (Φs) as a function of k, with the horizontal axis representing k as a percentage of the total number of entries in s. It can be seen that the signal representation in both the wavelet and DCT transform has most of the energy concentrated in a small number of coefficients. This results in much more accurate signal representations ΦT x(k) compared to s(k) = I T x(k) with the same number of coefficients, as illustrated in Figure 1.2. 1.2 Basis pursuit Given a class of signals S, such as audio signals or natural images. For compression we would like to find a basis Φ such that γk (Φs) is as large as possible for typical signals s ∈ S. Now, suppose that the class of signals is simply too large to be accurately represented in any basis. As an example, let S1 and S2 be all signals that have a k-sparse representation in respectively the Dirac basis Φ1 ≡ I and DCT basis Φ2 . These bases are such that signals with a sparse representation in the one are necessarily dense in the other (see for example [62]). This is illustrated in Figure 1.3, which shows on the left a signal, and on the right the corresponding DCT coefficients. The left side also shows the Dirac coefficients themselves since Φ1 = I. It is clear that the one-sparse Dirac signal in plot (a) has a fully dense representation in the DCT basis, shown in plot (b). On the other hand, the sparse DCT representation in plot (d) is fully dense in the Dirac basis (c). The combined signal, shown in plot (e) clearly has no sparse 1.2. Basis pursuit 3 1 Fraction of energy 0.8 0.6 0.4 0.2 0 Wavelet DCT Dirac 0 (a) 5 10 15 20 Percentage of coefficients 25 (b) Figure 1.1: Picture of Euler (a), and (b) the cumulative energy contributed by sorted coefficients in different bases. representation in either basis, even though it only differs from (c) at the one location indicated by the arrow. This lack of sparse representation is not so much due to the choice of basis, but a consequence of the class of signals: there simply does not exist any basis that can sparsely represent all signals in the set S := S1 + S2 . This inherent limitation of bases led people in the signal processing community to consider alternative ways of representing signals. A natural extension of bases are overcomplete dictionaries. These can be motivated by considering the representation of a signal s in basis A: n x i A↓ i s = Ax = (1.2) i=1 where A↓i denotes the ith column of A. The freedom of choosing xi is clearly limited by the number n of columns or atoms in A; increasing n gives much more flexibility. This leads to the notion of overcomplete dictionaries, represented by m × n matrices A, with m ≤ n. When expressing a signal s in terms of an overcomplete dictionary, its coefficients x are no longer unique. In fact, provided that s is in the range of A, there will be infinitely many representations. This is unlike the case of orthogonal bases, or invertible matrices in general, where there is exactly one such representation. Given that we are trying to obtain a compressed representation of s, it makes sense to find the sparsest x that satisfies Ax = s. This can be done by solving the following problem with b = s: minimize x where x 0 x 0 subject to Ax = b, (1.3) denotes the number of nonzero entries in x. The solution to this 1.2. Basis pursuit 4 Wavelet DCT Dirac (a1) 0.0202 (b1) 0.0428 (c1) 0.5941 (a2) 0.0645 (b2) 0.0912 (c2) 0.8652 (a3) 0.1030 (b3) 0.1240 (c3) 0.9435 Figure 1.2: Reconstruction of the Euler image based on (from top to bottom) the largest 20, 5, and 2 percent of the coefficients of (from left to right) the Daubechies-8 wavelet, discrete cosine transform, and Dirac basis. The label below each figure gives the value of 1 − γk (Φs)2 . 1.2. Basis pursuit 5 problem may still contain m nonzero entries, which is equal to the number of entries in s. The same typically holds true in transform-based coding, where x := Φs is compressible, but not sparse. In the latter setting we sparsify x by discarding the smaller entries as a post-processing step. In sparsity-minimizing formulations the same can be achieved directly by either fixing the maximum sparsity level and finding x that minimizes the misfit, or by finding the sparsest x satisfying a given misfit. The second approach yields to following generalization to (1.3): minimize x 0 subject to Ax − b 2 ≤ σ, (1.4) x where σ is the maximum permissible misfit. Unfortunately, both (1.3) and (1.4) are combinatorial and were shown to be NP-hard by Natarajan [115]. In their seminal paper, Chen et al. [39] propose a convex relaxation of (1.3), called basis pursuit (BP): minimize x x subject to 1 Ax = b. (BP) In contrast to (1.3), this formulation can be practically solved using techniques from convex optimization (see Chapter 4 for more details), and is therefore tractable. Another important property is that the non-differentiability of the 1 -norm along the coordinate axes (where one or more entries of x are zero) induces sparsity in the solution. This property has long been known empirically in geophysics, and was already used in the 1970s by Claerbout and Muir [40], and Taylor et al. [139]. The analogous 1 relaxation of (1.4) is given by minimize x x 1 subject to Ax − b 2 ≤ σ, (BPσ ) which we, somewhat against convention, refer to as the basis pursuit denoise (BPDN) problem. Commonly this label is applied to the regularized basis pursuit problem introduced originally by Chen et al. [39]: minimize x 1 2 Ax − b 2 2 + λ x 1, with λ > 0. (QPλ ) This latter formulation can be interpreted as the Lagrangian form of (BPσ ), and the two formulations are equivalent in the sense that for any σ > 0 there exists a λ such that (BPσ ) and (QPλ ) have the same solution (or set of solutions if the columns in A are not in general position). Going back to our earlier DCT-Dirac example, we construct a dictionary A = [Φ1 , Φ2 ], consisting of the atoms of both the Dirac basis Φ1 and DCT basis Φ2 . We then solve (BP), with A and b as above, to obtain the least 1 norm coefficient vector x satisfying Ax = b. The resulting solution x, plotted in Figure 1.3(g), is exactly what we hoped it would be: is has a single nonzero Dirac coefficient in the first half and a single DCT coefficient in the second half. In this case we conclude that 1 minimization has in fact found the sparsest possible solution, and therefore the solution to the 0 minimization problem (1.3). A natural question then, is to ask whether this was simply a fortuitous 1.3. Compressed sensing 6 example or whether there are conditions under which the two formulations have an identical unique solution. It turns out that, under certain conditions, this indeed holds and we discuss this topic at length in Chapter 2. As a comparison, we also computed the least 2 -norm solution satisfying Ax = b. This solution is entirely nonzero and given by the vector [b; ΦT2 b]/2. 1.3 Compressed sensing From the illustration on the Euler image we see that a mere 5 to 20 percent of the wavelet coefficients suffices to accurately represent this signal, and the same applies to a large class of natural images. The field of compressed sensing (CS) [25, 58] stems from the realization that the conventional process of acquiring signals such as images is inherently wasteful. That is, for a p × q image we first record all p · q pixels, and then discard roughly 90 percent of the underlying coefficients during compression. With B an n × n basis in which the signals of interest have a sparse or compressible representation, the full encoding-decoding cycle is summarized as follows: 1. Measure s 2. Compute x = Φs 3. Set approximate coefficient vector x(k) 4. Construct signal sˆ = ΦT x(k) . The intention behind compressed sensing is to sample the data in such a way that we only record the information necessary to reconstruct the signal [58]. Instead of measuring signal s directly, and processing x = Φs, compressed sensing proceeds by recording b = M s, where M is a suitable m × n measurement matrix. By choosing m < n, this immediately gives a compressed measurement vector b of length m instead of n. Assuming that Φ permits a sparse (approximate) representation of s, we can decode b by finding a sparse representation x by solving (1.4) or (BPσ ), with A = M ΦT . From the resulting coefficients we can obtain the decoded signal by computing sˆ := ΦT x. Summarizing the above, we have the following procedure: 1. Measure b = M s 2. Determine x by solving (BPσ ) with A := M ΦT 3. Construct signal sˆ := ΦT x. In the compressed sensing approach, in contrast to the conventional scheme, most of the effort is spent in the decoding phases where we need to find a sparse solution to the system Ax = b. On the other hand, the encoding phase is nonadaptive and does not need to analyze the signal in order to determine the final encoding. Interestingly, the absence of the sparsity basis Φ in the encoding phase of compressed sensing allows us to acquire a signal and choose the most suitable 1.3. Compressed sensing Dirac DCT 1 0.1 0.5 0.05 0 0 −0.5 −0.05 −1 0 100 7 −0.1 0 200 (a) 100 200 (b) 2 0.5 1 0 0 −1 −0.5 0 100 200 −2 0 100 (c) 200 (d) 2 1 1 0.5 0 0 −0.5 −1 −1 0 100 200 −2 0 100 (e) 200 (f) 2 1 0 −1 −2 0 100 200 300 400 500 (g) Figure 1.3: Plot of (a) a signal that is sparse in the Dirac basis, along with (b) its coefficients in the DCT domain. Likewise, (c) a signal that (d) has a sparse coefficients in the DCT domain, followed by (e) the summation of the signals and (f) its coefficients. Finally, (g) the minimum 1 -norm representation in a dictionary containing the atoms of both bases. 1.3. Compressed sensing 10 2 8 0.5 1.5 5 1 0.5 0 0 0 −0.5 −5 −1 −1.5 −10 0 10 20 30 −2 0 40 50 (a) 100 150 200 250 −0.5 0 50 (b) 6 2 4 1.5 150 200 250 150 200 250 (c) 1 1 2 100 0.5 0.5 0 0 0 −2 −0.5 −4 −8 0 −0.5 −1 −6 −1.5 10 20 30 (d) 40 50 60 −2 0 −1 100 200 (e) 300 400 500 0 50 100 (f) Figure 1.4: Compressed sensing applied to a sinusoidal signal s using an 48×256 measurement matrix M , with (a) measurement b = M s, (b) coefficients x obtained by solving (BP), and (c) recovered signal Bx. Likewise (d–f) for the same signal with a single impulse, using an overcomplete sparsity basis B and a 60 × 256 measurement matrix M . basis Φ, or even overdetermined dictionary B, during the decoding phase. The measurement matrix, on the other hand, is used and therefore needs to be fixed. To illustrate the compressed sensing methodology, we apply it to the sparse DCT signal of Figure 1.3(c). This is done by first generating a 48 × 256 measurement matrix M with its entries randomly drawn from the normal distribution. We then compute the observation vector b = M s, which is plotted in Figure 1.4(a). Because we know that the original signal s has an exact sparse representation we then solve the basis pursuit formulation with A := M B, and B T = Φ2 the DCT basis. This gives the coefficient vector x shown in Figure 1.4(b). The decoded signal obtained by computing Bx is shown in plot (c), and in this case corresponds exactly to the original signal. We now consider taking compressed measurements of the mixed DCT-Dirac signal of Figure 1.3(e). We choose B = [ΦT1 , ΦT2 ] as the sparsity basis1 , and draw a random 64 × 256 measurement matrix M . Proceeding as above, this gives the measurement, coefficient, and decoded signal vectors given in Figure 1.4(d–f). Due to the existence of a exact sparse coefficient vector and the fact that m is sufficiently large, we again recover original signal s. 1 In this context we prefer to use the term ‘basis’ rather than ‘dictionary’, despite the fact that the underlying matrix is nonsquare. 1.3. Compressed sensing 9 Compressed sensing Sparse recovery Analog measurement Quantization Matrix design Figure 1.5: Graphical representation of the different aspects of compressed sensing. In this thesis we will predominantly be concerned with the highlighted (shaded) topic of sparse recovery. 1.3.1 Practical issues Besides improved theoretical recovery conditions, and faster sparsity-promoting solvers, there are a number of important practical issues in compressed sensing that we have so far ignored. These are summarized in Figure 1.5 and discussed below. Taking measurements Recall that the idea behind compressed sensing is to sample only a limited number of indirect measurements, rather than the entire signal s. This is done by taking inner products of x with a set of vectors or function {f1 , . . . , fm }, i.e., bi = fi , s . (1.5) For decoding we construct a measurement matrix M , whose rows represent discretized versions of fi . That is, M i→ := fˆi , where M i→ denotes the (column) vector corresponding to the ith row of M . The inner product in (1.5) cannot be digitally evaluated as j Mi,j sj , because this requires access to all individually sampled entries of sj , which defeats the purpose of CS. Therefore, for compressed sensing to work in practice, these inner products must somehow be evaluated by means of an analog process. At present there has not been done a lot of work on how to solve this issue. There are, however, a number of specific settings in which this can be done efficiently. We will discuss some of these in Section 1.3.2. Quantization Once analog measurements have been made, they need to be quantized before they are suitable for digital processing. This can be done, for example, by taking a codebook C of discrete values and mapping each analog value vi to the nearest element in C: bi ∈ argmin vi − κ , κ∈C breaking ties in some predetermined manner. This step in the encoding of the compressed signal obviously affects the decoding process and reduces the accuracy with which the original signal can be reconstructed. Recent work by 1.3. Compressed sensing 10 Dai et al. [44], Goyal et al. [82], and Boufounos and Baraniuk [19] considers the theoretical implications of quantization in compressed sensing. Meanwhile, Zymnis et al. [156] and Jacques et al. [97] have proposed ways of explicitly incorporating quantization into numerical solvers. One alternative way of dealing with quantization in solvers is to model the observation as b = v + e, where e represents a vector of additive noise due to quantization. However, this excludes prior information about the measurement process and is unlikely to achieve the best possible results. Measurement matrix Because of limitations on analog hardware [20], only certain measurement ‘matrices’ can be used. In the larger context of sparse recovery using basis pursuit or equivalent formulations, however, there is much more freedom. Properties of A that are important include the amount of data required to store the matrix, whether or not it facilitates an implicit representation through a fast routine that evaluates its matrix-vector products, and most importantly, whether or not it satisfies conditions that guarantee near or exact recovery for certain types of signals. In the course of Chapter 2, which discusses theoretical results in sparse recovery, we mention known matrix classes that satisfy given conditions. Typically these matrices have probabilistic constructions and satisfy conditions only with high probability. Recently a number of explicit matrix constructions that are guaranteed to satisfy certain equivalence conditions have been proposed by DeVore [53]. For a survey of known constructions and theoretical guarantees, see Berinde et al. [14]. Sparse recovery A crucial component of compressed sensing is an algorithm for the recovery, or decoding, of coefficient vector x from the compressed measurement b. When x is assumed to be sparse, this can be done using sparse recovery techniques, such as basis pursuit denoise. In this thesis we are predominantly concerned with sparse recovery using convex optimization. Various approaches are reviewed in Chapter 4. In Chapter 5 we develop a new convex optimization framework for sparse recovery. Besides these methods, a number of sparse recovery algorithms have been proposed that work only in combination with particular measurement matrices. We forego a discussion of these types of algorithms and refer instead to Indyk [96], Jafarpour et al. [98], and Howard et al. [95]. 1.3.2 Examples There are a number of settings where the compressed acquisition of signals corresponding to (1.5) can be done in analog. We discuss some of these applications in the following three sections. 1.3. Compressed sensing 11 MRI and Angiography The most natural setting for compressed sensing to date is in magnetic resonance imaging (MRI). This application was first suggested by Cand`es et al. [32] and was further developed by Lustig et al. [105]. The latter article also gives an excellent introduction to the topic. The key property of MRI that makes it fit the CS framework is that the signal of interest is measured directly in the frequency domain, i.e., the k-space. This means that instead of sampling the image directly, we instead sample points in k-space. As an illustration of this we consider the modified Shepp-Logan phantom [141, p.199], depicted in Figure 1.6(a). The k-space representation of this image is shown in plot (b) and it is from this space that we obtain our measurements. Time restrictions on the scans prevent us from sampling the entire space and we can only sample along trajectories though k-space. These trajectories are limited by physiological and hardware constraints [105], and typical trajectories have the form of straight lines, radial lines, or spirals. For our example we ignore such restrictions and use the random pattern given in Figure 1.6(c). This pattern takes into account the fact that most of the image energy in k-space is concentrated close to the center. Mathematically, let s represent the image, F the two-dimensional Fourier transform, and R a restriction matrix corresponding to the frequencies measured. Then the observed signal is given by b = RF s. Setting M = RF gives us our measurement matrix which represents the actual sampling process used in practice; conversion to the digital domain does not happen until after inner-products with the rows of the matrix have been evaluated. Given the partial-Fourier coefficients, the easiest way to obtain an image is using filtered backprojection [32]. In this approach we simply compute sˆ = F ∗ RT b, and essentially assume that all missing coefficients are zero. The result of this operation is shown in Figure 1.6(d). While fast, this method often introduces severe undersampling artifacts, as shown by plot(e), which gives the error in the reconstruction. Before we can apply sparse recovery for the reconstruction of the signal, we first need a sparsity basis B. Images obtained through magnetic resonance (MR) often have sparse representations in a wavelet basis. For our example we use the sparsifying transform similar to that used in the JPEG-2000 standard [138]: we divide the image into 32 × 32 blocks and apply a two-dimensional Haar wavelet transformation to each block. Letting B represent this operator, we can apply basis pursuit with A := M B = RF B to find coefficients x and signal sˆ = Bx. When applied to the MRI example this gives an exact recovery of the phantom image, as shown in Figure 1.6(f). An alternative way of recovering MR images is to use of total-variation minimization [30, 104], which takes advantage of the fact that the gradient of these images is often sparse. 1.3. Compressed sensing 12 (a) (b) (c) (d) (e) (f) Figure 1.6: Plots of (a) 512 × 512 modified Shepp-Logan phantom, (b) logarithmic k-space representation, (c) sampling mask covering 20% of all coefficients, (d) reconstruction using filtered backprojection, (e) error in reconstruction, and (f) exact reconstruction using basis pursuit. 1.3. Compressed sensing 13 Another example of the use of CS in medical imaging is the visualization of blood vessels and organs using angiograms. The use of contrast agents injected into the body prior to the imaging process causes angiograms to be inherently sparse in the pixel representation [105]. Figures 1.7(a,b) respectively show a synthetic angiogram and its representation in k-space. Sampling along the set of radial lines in plot (c) followed by backprojection gives the result shown in (d). Using compressed sensing, in this case with the sparsity basis set to the identity matrix, we recover the sparse image shown in plot (f). (a) (b) (c) (d) (e) (f) Figure 1.7: Plot of (a) 256 × 256 synthetic angiogram in reverse color, (b) k-space representation, (c) sampling pattern along 32 radial lines giving 12.1% of all coefficients, (d) reconstruction using filtered backprojection, (e) error in reconstruction, and (f) sparse reconstruction using basis pursuit. Single-pixel camera A second example of the application of compressed sensing is the single-pixel camera architecture suggested by Takhar et al. [137]. Before looking at this new architecture, let us look at the imaging steps of conventional digital cameras. The key component of these cameras is a sensor array that converts photons at visual wavelengths into electrons [66]. An image can be formed by opening the shutter for a brief period of time and measuring the amount of light that hits each sensor. For more advanced imaging systems, requiring observations beyond the visual wavelengths, sensors become more complicated and much 1.4. Generalized sparse recovery 14 larger. This typically means that fewer sensors are available, thus leading to a reduced imaging resolution. The single-pixel architecture overcomes this problem by using a digital micromirror device to reflect part of the incident light beams onto a single sensor, which records the total reflected light intensity. A number of such measurements are taken with different mirror orientations. The micromirror device consists of an array of tiny mirrors that can each be oriented at two different angles. At one angle it reflects light away from the sensor and at the other angle it reflects light onto the sensor. Let s be the light intensities impeding upon the mirrors and m be a vector that contains 0 or 1 depending on whether the corresponding mirror is positioned away or towards the sensor respectively. Then the amount of light recorded by the sensor is given by mTs. This is exactly of the form (1.5), and by changing the mirror orientations, we effectively create the rows of a measurement matrix M . Note that by changing the position of the mirror during a measurement we can effectively create arbitrary weights between 0 and 1. Negative values can be obtained by subtracting one measurement vector from another. DNA mircoarrays DNA microarrays are used for microbe detection and classification, and are made up of a number of genetic sensors or spots containing short DNA oligonucleotides called probes [132]. These probes are chosen to complement sequences in the target DNA which is tagged with fluorescent labels. The relative abundance of the sequences in the target is determined by flushing the target sample against the array and letting the complementary strands bind or hybridize. The remaining solvent is then washed away and the array read out by measuring the illumination pattern at each spot [45]. In conventional DNA microarrays it is important that each sensor is carefully designed to respond to a single target and to avoid cross-hybridization with similar but non-complementary sequences. For the compressed sensing DNA microarray introduced by Sheikh et al. [132], this requirement is relaxed and each spot is designed to respond to a larger group of targets. The measurement matrix M consists of entries mij , which represent the probability that target j hybridizes with spot i. The inner product of each row with the sample corresponds to the illumination intensity measured at the associated spot. Because we can expect each sample to contain only a limited number of target agents, no sparsity basis is required. 1.4 Generalized sparse recovery The use of sparse recovery is not limited to compressed sensing. Instead there are a number of applications that can be formulated as (1.3), or, if some level of misfit between Ax and b is allowed, as (1.4). By choosing an appropriate A we can, for example, apply the framework 1.4. Generalized sparse recovery 15 to data interpolation or inpainting [71]. In these problems we are given an incomplete data set, and the goal is to complete the data set in such a way that it is compatible with the known data. That is, letting s be the original complete signal, and R be a restriction matrix, we observe b = Rs. By choosing a sparsity basis B we can set A = RB, and use basis pursuit to recover a sparse x satisfying b = Ax. This may be seen as another application of compressed sensing by interpreting R as a measurement matrix, but it seems to have a wider applicability than compressed signal acquisition. For our first example of interpolation we go back to the sinusoidal signal of Figure 1.3(c). In that example we created a 48×256 restriction matrix consisting of rows with a single nonzero value at the sampled location. The resulting measurements, plotted at their original location, are shown in Figure 1.8(b). Basis-pursuit recovery of the entire signal using A = RΦT2 , with Φ2 the DCT transform, gives an exact reconstruction of the original signal, shown in (c). Interpolation of seismic data with missing traces using sparse recovery was proposed by Hennenfent and Herrmann [91, 92]. Such missing data can be due to the breakdown of geophones (sensors), or the physical difficulty of placing them at desired locations. A full synthetic data set, courtesy of Eric Verschuur, is shown in Figure 1.8(d). The restriction operator in this example is such that it retains only a small number of columns for the vectorized image, as illustrated in Figure 1.8(e). Accurate recovery is possible due to the fact that seismic images are highly compressible in the curvelet domain [27, 92]. The signal recovered using basis pursuit is shown in Figure 1.8(f). Applications that are less related to compressed sensing often require different formulations. Below we will introduce convex relaxations of three different sparse-recovery formulations that are discussed later in this thesis. 1.4.1 Nonnegative basis pursuit The reason for the success of basis pursuit in recovering sparse vectors is due to the fact that 1 -norm minimization quite capably captures the prior information that the solution be sparse. By adding more prior information on the coefficients we can reduce the feasible set of solutions, thus making recovery more likely. One such prior is to require x to be nonnegative; this is the case, for example, in angiogram images, where it is physically impossibility to have negative values. Modifying the basis pursuit formulation accordingly gives minimize x eTx subject to Ax = b, x ≥ 0, (1.6) and likewise for basis pursuit denoise: minimize x eTx subject to Ax − b 2 ≤ σ, x ≥ 0. The same formulation can be used to incorporate specific sign restrictions on each coefficient, by appropriately redefining the columns of A. 1.4. Generalized sparse recovery 16 0.5 0.5 0.5 0 0 0 −0.5 0 50 100 150 200 −0.5 0 250 50 100 150 200 −0.5 0 250 50 100 (a) (b) (c) (d) (e) (f) 150 200 250 Figure 1.8: Reconstruction of missing data, showing (a) the original signal, which has a 10-sparse DCT representation, (b) observed data, obtained by applying a 48 × 256 restriction matrix (81.25% missing), and (c) exact recovery. For (d) a seismic data set with (e) 32% of traces (columns) missing, along with (f) the interpolated result. Seismic data set courtesy of Eric Verschuur. 1.4.2 Group and joint-sparse recovery In areas such as neuroimaging using functional MRI or magnetoencephalography [80], sparsity arises predominantly in fixed patterns: certain parts of the brain that are associated with different functions are mostly active simultaneously [42]. In this type of sparsity we are not so much interested in sparsity within functional groups but more in sparsity in terms of entire groups. This can be achieved by grouping the variables into disjoint sets Λi , and solving minimize x,s s subject to 0 Ax − b 2 ≤ σ, si = xΛi , for any norm on xΛi . This can be relaxed to a convex formulation by writing minimize x xΛi 2 subject to Ax − b ≤ σ. (1.7) i A special case of group sparsity arises in the situation where we are given a number of measurement vectors B ↓i , i = 1, . . . , k, such that the coefficients X ↓i are known to be jointly sparse, i.e., have nonzero entries clustered in a small number of rows. This joint-sparse problem can be rewritten as a group-sparse problem by vectorizing the coefficients, grouping the entries in each row into the same set, and setting A to I ⊗ A, where ⊗ denotes the Kronecker product. A more convenient formulation for the joint-sparse problem, however, is given 1.4. Generalized sparse recovery 17 by minimize X∈Rn×k where X 1,2 X is a specific instance of the X p,q AX − B subject to 1,2 p,q -norm, X i→ = p q F ≤ σ, (1.8) defined by 1/p . (1.9) i Formulations other than (1.8) are possible: Tropp [145] suggests the use of 1,∞ , whereas Cotter et al. [42] adopt p,2 , with p ≤ 1. In Section 6.2 we propose an algorithm for solving the 1,2 formulation, and in Section 3.4 study the conditions for sparse recovery using 1,2 -minimization: minimize X X which is a special case of the minimize X 1.4.3 1,2 subject to p,q -minimization X p,q AX = B, (1.10) problem subject to AX = B. (1.11) Low-rank matrix reconstruction An altogether different area in which convex relaxation can sometimes be used to solve otherwise intractable problems is in low-rank matrix completion. In this problem we are given an incomplete m × n matrix B, of which only the entries (i, j) ∈ Ω are known. Based on this information we want to reconstruct the missing entries. Assuming that B is low-rank, the ideal—but unfortunately NP-hard [29]—formulation for this is minimize X∈Rm×n rank(X) subject to XΩ = BΩ . (1.12) Fazel [75] proves that the nearest convex relaxation of the rank is given by the nuclear norm X ∗ of X. This norm is a specific instance of the Schatten p-norm, and is defined as the sum of the singular values of X. Applying this convex relaxation to (1.12) leads to the nuclear norm minimization problem [29] minimize X X ∗ subject to XΩ = BΩ . (1.13) An interesting analogy between nuclear-norm minimization and basis pursuit, pointed out by Recht et al. [126], is that both formulations replace minimization of cardinality by minimization based on the 1 -norm. That is, basis pursuit relaxes x 0 to x 1 , whereas nuclear-norm minimization relaxes σ 0 to σ 1 , where σ denotes the vector of singular values of X. We further discuss nuclear norm minimization in Section 6.4. 1.5. Thesis overview and contributions 18 Sparse recovery Recovery theory Chap. 2, 3 Solvers Chap. 4−6 Figure 1.9: Graphical representation of the two main components of sparse recovery: recovery theory and numerical solvers. The dotted line indicates the dependency between theory and solvers in the case of, for example, greedy recovery algorithms. 1.5 Thesis overview and contributions As illustrated in Figure 1.9, there are two major aspects to sparse recovery. On the one hand, there is the theoretical aspect, which studies the precise conditions under which recovery is successful, and the inherent limitations of different approaches. On the other hand, there is the practical aspect, in which algorithms are developed to solve sparse recovery problems or their convex relaxation. The first half of this thesis is concerned with the theoretical aspects of sparse recovery. Chapter 2 provides a survey of current results and places a particular emphasis on the different techniques used to establish these results. We illustrate the meaning of various theoretical results with empirical simulations, and connect different results where possible. Chapter 3, whose contents have been submitted for publication [8], studies joint-sparse recovery using two classes of algorithms. For this we extend the results in Chapter 2, and leverage the geometrical interpretation of basis pursuit. In the second half of this thesis we focus on sparse recovery algorithms. We give a comprehensive overview of existing approaches in Chapter 4. The main contribution of this thesis is contained in Chapters 5 and 6, which provide a new algorithmic framework for solving a wide class of convex formulations for sparse recovery. Parts of these chapters have been published in [10, 131], and the widely-used software implementation, spgl1, is freely available online [9]. The development of spgl1 was motivated by the lack of feasible alternatives, and was the first algorithm specifically designed to solve the basis pursuit denoise formulation (BPσ ). This is in contrast to most other algorithms, which apply only to the simpler penalized formulation (QPλ ). spgl1 can now solve many more problems, and we compare its performance against other state-of-the-art algorithms in Chapter 7. We conclude with Chapter 8, in which we present a test problem suite for sparse recovery, published in [12], as well as the linearoperator toolbox, spot [11], that was used for its implementation. 19 Chapter 2 Theory behind sparse recovery In this chapter we review theoretical results on the sparse recovery properties of basis pursuit and basis pursuit denoise: the conditions under which the solution of (BP) coincides with the sparsest solution obtained using (1.3); necessary and sufficient conditions for (BP) to recover x0 from b = Ax0 ; probabilities with which arbitrary vectors x0 are recovered; and bounds on the approximation error when applying basis pursuit or basis pursuit denoise on noisy observations. Results regarding the exact recovery of vectors x0 using basis pursuit can be classified according to the class of vectors they apply to: individual vectors, vectors with nonzero entries at fixed locations, or uniform recovery, concerning the recovery of all vectors x0 with a given maximum number of nonzero entries. Since the publication of the seminal paper by Donoho and Huo [62] there has been a deluge of theoretical results and it would be far beyond the scope of this thesis to discuss them all. So instead, we focus here on the tools and techniques used to establish recovery conditions and, at the risk of being incomplete, only mention some typical results as an illustration. Figure 2.1 summarizes the techniques discussed in this chapter and highlights the techniques (see shaded boxes) that will be used in Chapter 3 to derive equivalence conditions for jointsparse recovery using (1.10). It is worth emphasizing that the strength of the results surveyed in this chapter lies in the fact that they hold for a specific problem formulations (predominantly basis pursuit). As such, they continue to hold irrespective of the algorithm used to solve the problem. This is in stark contrast to greedy approaches and other heuristics, where formulations lack and results are intricately connected to the algorithm. 2.1 Mutual coherence The very first equivalence results derived by Donoho and Huo [62] were based on the mutual coherence, or similarity of the atoms in dictionary A. Assuming A has unit-norm columns, the mutual coherence is defined as µ(A) = max | A↓i , A↓j |, i=j and is one of the few sparse-recovery metrics that can be computed for a given matrix A in reasonable time. For the special case where A consists of a pair 2.1. Mutual coherence 20 Recovery theory Mutual coherence Sect. 2.1 Null−space properties Sect. 2.2, 3.3 Restricted isometry Optimality conditions Sect. 2.3, 3.4 Sect. 2.4 Geometrical properties Sect. 2.5, 3.5, 3.6 Figure 2.1: Overview of the most common tools and techniques used to derive and express recovery results for basis pursuit. of orthonormal bases, Elad and Bruckstein [69] show that a sufficient condition for 1 to uniquely recover all r-sparse vectors x from b = Ax is that √ 2 − 1/2 r< . µ(A) This was later shown to be both necessary and sufficient by Feuer and Nemirovski [76]. For general dictionaries, Donoho and Elad [60], and Malioutov et al. [108] derive the sufficient condition r< 1 + 1/µ(A) . 2 (2.1) Unfortunately, the use of mutual coherence typically gives very pessimistic recovery guarantees. As an illustration, we determine the average and minimal mutual coherence based on ten thousand m × 256 matrices with random Gaussian entries and columns normalized to one. The resulting values, as a function of m, are plotted in Figure 2.2(a). By applying (2.1) to these mutual coherence values we obtain the recovery guarantees shown in Figure 2.2(b). Note that based on the lower mutual coherence of near-square matrices, recovery is guaranteed for vectors that are at most 2-sparse. Observe that the mutual coherence of A is equal to the largest absolute offdiagonal entry in the gram matrix AT A. This allows us to directly apply the results by Rosenfeld [128] as a lower bound for the mutual coherence of m × n matrices A with unit norm columns (see also Strohmer and Heath [136]): µ(A) ≥ n−m . m(n − 1) Substituting this lower bound into (2.1) gives the best possible uniform recovery guarantee based on mutual coherence, as shown in Figure 2.2(b). However, even under these ideal conditions, the sparsity level for which recovery is guaranteed is still very low, at least for m n. To weaken the conditions, Tropp [144, 146] and Donoho and Elad [60] have proposed extensions and generalizations on mutual coherence, but they are seldom used. 2.2. Null-space properties Lower bound Minimum Mean Mutual coherence 0.8 0.6 0.4 0.2 0 0 50 100 150 200 Recovery guarantee (sparsity) 12 1 10 250 21 Lower bound Minimum Mean 8 6 4 2 0 0 50 100 150 200 250 m m (a) (b) Figure 2.2: (a) Mutual coherence for random Gaussian m × 256 matrices with columns normalized to unit norm, and (b) the resulting recovery guarantees. The use of mutual coherence is not limited to the analysis of basis pursuit. For example, Tropp [146] uses mutual coherence to characterize the recovery error obtained using (QPλ ) for a given λ. Donoho et al. [61] consider the recovery of r-sparse x0 from b = Ax0 + v using (BPσ ). Assuming v 2 ≤ ≤ σ, they show that the solution x∗ of (BPσ ) satisfies x∗ − x0 2 2 ≤ ( + δ)2 , 1 − µ(A) · (4r − 1) provided that r < (1/µ(A) + 1)/4. Mutual coherence is also useful in deriving bounds on other metrics that are used to characterize sparse recovery. 2.2 Null-space properties Recall that the kernel, or null-space, of a matrix A is defined as Ker(A) = {z | Az = 0}. Theoretical results based on null-space properties generally characterize recovery for the set of all vectors x0 with a fixed support, which is defined as Supp(x) = {j | xj = 0}. We say that x0 can be uniformly recovered on I ⊆ {1, . . . , n} if all x0 with Supp(x0 ) ⊆ I can be recovered. The following theorem illustrates typical nullspace conditions for uniform recovery via 1 on an index set. 2.2. Null-space properties 22 Theorem 2.1 (Donoho and Elad [60], Gribonval and Nielsen [85]). Let A be an m × n matrix and I ⊆ {1, . . . , n} be a fixed index set. Then all x0 ∈ Rn with Supp(x0 ) ⊆ I can be uniquely recovered from b = Ax0 using basis pursuit (BP) if and only if for all z ∈ Ker(A) \ {0}, |zj | < j∈I |zj |. (2.2) j∈I That is, the 1 -norm of z on I is strictly less than the complement I c . 1 -norm of z on the The essential observation behind this result is that any feasible x satisfying Ax = b can be written as x = x0 + z, where z is a vector in the null-space of A. For equivalence and uniqueness we then require that x0 has an 1 -norm that is strictly less than all other vectors x. This then leads to the sufficiency of the condition. The necessity follows from the construction of a counter example x0 from any z ∈ Ker(A) not satisfying (2.2). Gribonval and Nielsen [85, 86] show a more general relation for recovery using p minimization minimize x x p subject to Ax = b. (2.3) with 0 ≤ p ≤ 1. For this they define the p -concentration of all non-trivial vectors in the null-space of A on a given index set I as Cp (A, I) := maximize z∈Ker(A)\{0} |zi |p . p i∈I |zi | i∈I They then use this quantity to characterize recovery of x0 , with Supp(x0 ) ⊆ I, from b = Ax0 by means of (2.3) as follows: < 1/2 : all x0 can be uniquely recovered; Cp (A, I) = 1/2 : all x0 are solutions of (2.3) but may not be unique; > 1/2 : there are vectors x0 that can not be recovered. Uniform sparse recovery. Donoho and Elad [60] derive conditions for uniform recovery using 0 minimization (1.3) based on the spark of A. This quantity is defined as the smallest number of columns in A that are linearly dependent. When a set of columns I is linearly dependent, there exists a vector z such that AI z = 0, and therefore the spark coincides with the sparsest possible nonzero vector in the null-space of A: Spark(A) = min { z 0 : z ∈ Ker(A) \ {0}}. It then follows from the 0 -concentration that a necessary and sufficient condition for the uniform recovery of all r-sparse vectors is that r < Spark(A)/2. (2.4) 2.3. Optimality conditions 23 There are currently no algorithms that can evaluate the spark of a matrix in a tractable way; checking if Spark(A) > s would require taking, for example, the QR-factorization of AI for all possible subsets I ⊆ {1, . . . , n} of cardinality s. A lower bound on the spark, in terms of mutual coherence, albeit very pessimistic, is given by Spark(A) > 1/µ(A). This follows from the diagonal dominance, and therefore invertibility, of the Gram matrix ATI AI [60]. On the other hand, note that an m × n matrix A with columns drawn independently and uniformly at random from the Euclidean sphere, due to their expected general position, gives Spark(A) = min{m, n} + 1 with probability one. (A set of vectors in Rm are said to be in general position if no k + 1 of them support a (k−1)-dimensional hyperplane, for any k ∈ [0, m].) 2.3 Optimality conditions For convex optimization problems, the Karush-Kuhn-Tucker (KKT) optimality conditions give both necessary and sufficient conditions for a point to be optimal (see, e.g., [121, Ch.12]). From these conditions, some of the most precise recovery results can be derived. Applied to the basis-pursuit formulation, the KKT conditions state that x is a solution of (BP) if and only if x is feasible and there exists a Lagrange multiplier y satisfying (AT y)i ∈ sign(xi ) if i ∈ Supp(x); [−1, 1] otherwise. (2.5) A necessary and sufficient condition for x to be the unique minimizer is that there exists a y that, in addition, satisfies |(AT y)i | < 1, for all i ∈ Supp(x). (2.6) Interestingly, this condition reveals a very important property of 1 minimization, namely that recoverability depends only on the support of x and the sign of these nonzero values; if the above relation holds for one such x it will hold for all others as well. This granularity is much finer than the r-sparse uniform recovery results obtained using mutual coherence, or the uniform recovery on a support by null-space properties of A. For sufficiently sparse x we can use the Moore-Penrose pseudoinverse to construct vectors y that, by construction, satisfy the first condition in (2.5). Recalling that the pseudoinverse for full-rank matrices M is defined as M† = (M T M )−1 M T M T (M M T )−1 if M has independent columns; if M has independent rows, (2.7) we proceed as follows. Denote by I the support of x, let s = sign(xI ) be the sign of the on-support entries, and define M = ATI . Then the choice y = M † s 2.3. Optimality conditions 24 satisfies ATI y = M M † s = s. A sufficient condition for recovery is then that (2.6) holds for all i not in the support of x. This is formalized in the following theorem. Theorem 2.2. Let I ⊆ {1, . . . , n} be an index subset with |I| ≤ rank(A) and denote M = ATI . Then all x supported on I with sign pattern sign(x)I = s can be recovered from b = Ax using (BP) if (2.6) holds for y = M † s. The pseudoinverse is also used by Tropp [144] in the definition of the exact recovery constant (ERC). In a slightly modified form, this constant is defined in terms of A and support I as κ(A, I) := max A†I A↓i 1 . (2.8) i∈I The following theorem, due to Tropp [144], uses ERC to guarantee the uniform recovery of all vectors on the given support. We provide a simplified proof. Theorem 2.3 (Tropp [144]). A sufficient condition for (BP) to recover all x0 supported on I with |I| ≤ rank(A) is that κ(A, I) < 1. (2.9) Proof. Choose any x = x0 satisfying Ax = b, and denote its support by J . By the assumption on I there is at least on j ∈ J such that j ∈ I. Further noting that A†I AI = I, and A†I A↓i 1 = 1 for i ∈ I we have x0 1 = A†I AI x0 1 = A†I b 1 = A†I Ax 1 xj (A†I A↓j ) = 1 j∈J |xj | · A†I A↓j ≤ j∈J 1 |xj | = x 1 . < j∈J We connect the two theorems using the following result. Theorem 2.4. Theorem 2.3 implies Theorem 2.2 for all sign patterns s on I and vice versa. Proof. Let s be an arbitrary sign pattern and set y = M † s with M = ATI . We need to show that |(A↓j )Ty| < 1 for all j ∈ I. Denoting aj = A↓j for convenience, we have |(A↓j )Ty| = |aTjy| = |aTjM † s| = |(A†I aj )Ts| |si | · |(A†I aj )i | = ≤ i |(A†I aj )i | = A†I aj 1 < 1. i where we used assumption (2.9) and the fact that (M † )T = A†I . For the reverse result note that, by assumption, (2.6) holds with y = M † s for any sign pattern s. In particular choosing s = sign(AI aj ) gives 1 > |(AT y)j | = |aTj y| = |aTj M † s| = |sT (AI aj )| = AI aj 1. 2.4. Restricted isometry Maximum κ value 5 10 m = 64 m = 128 m = 192 m = 256 8 Recovery guarantee 6 4 3 2 6 4 2 1 0 0 25 5 10 15 20 Index size 25 0 0 30 50 100 150 200 250 m (a) (b) Figure 2.3: (a) Maximum value of κ(A, I), sampled over 1,000 random index sets I for normalized randomly drawn Gaussian m × 256 matrices A; (b) upper bound on uniform recovery guarantees for given matrices A. Because j ∈ I was arbitrary, this implies (2.9). Note that when A has unit-norm columns and the cardinality of I is one, condition (2.9) reduces to µ(A) < 1. This in turn is equivalent to (2.1) for r = 1. Finally, as an example of the guarantees given based on the ERC, we create random m × 256 matrices with unit norm columns for different values of m. For each matrix we generate 1,000 random index size for each cardinality up to 30 and compute the value of κ in (2.8). The maximum quantities are plotted in Figure 2.3(a) for a number of different values of m. Based on these values we determine the largest support size for which (2.9) holds. This gives an upper bound on the uniform recovery guarantees based on the ERC, which is shown in Figure 2.3(b). Like the mutual coherence, the ERC for a fixed support I can be practically computed. 2.4 Restricted isometry The restricted isometry condition introduced by Cand`es and Tao [33] is one of the most widely used conditions for guaranteeing exact recovery, or establishing bounds on the recovery error. These sufficient conditions for recovery, as well as the error bounds, are expressed in terms of the r-restricted isometry constant δr (A), which they define as the smallest δ satisfying (1 − δ) x 2 2 ≤ Ax 2 2 ≤ (1 + δ) x 2 2 (2.10) for every r-sparse vector x. This condition equivalently requires that for all sets Λ of cardinality r, the squared singular values of AΛ are bounded by 1 ± δr . Another way to see this is that the eigenvalues of all r × r principal submatrices of the Gram matrix AT A are bounded between (1 − δr ) and (1 + δr ). 2.4. Restricted isometry 26 It is beyond the scope of this thesis to try and give an exhaustive summary of all results that have been expressed in terms of the restricted isometry constant. We therefore limit ourselves to illustrating the flavor of these results by citing two recent theorems by Cand`es. Theorem 2.5 (Noiseless recovery, Cand`es [26]). Let b = Ax with any x ∈ Rn , and denote by x(r) a vector whose nonzero entries correspond to the r entries √ of x largest in magnitude. Assume that δ2r < 2 − 1. Then the solution x∗ to (BP) obeys x∗ − x 1 ≤ c1 x − x(r) 1 , and x∗ − x 2 ≤ c1 r−1/2 x − x(r) 1 , with constant c1 depending only on δ2r . Recovery is exact if x is r-sparse. With slight modifications to the proof given in [26], the sufficient condition for exact uniform r-sparse recovery can be generalized to δr+s (A) < 1 , (2.11) 2r/s √ with s ∈ [1, r]. Choosing s = r reduces to δ2r < 2 − 1, as above. For the noisy case where b = Ax + z, the following result holds. √ Theorem 2.6 (Noisy recovery, Cand`es [26]). Assume that δ2r < 2 − 1 and Ax − b 2 ≤ . Then the solution of (BPσ ) obeys x∗ − x 2 1+ ≤ c1 r−1/2 x − x(r) 1 + c2 , with constants c1 and c2 depending only on δ2r . √ In a recent paper, Cai et al. [24] sharpen these results and require δ1.75r (A) < 2 − 1. Interestingly, they also show that earlier conditions by Cand`es and Tao [35] are actually implied by those in Theorem 2.5. Restricted isometry constants for matrices. The restricted isometry constant of a matrix A, like its spark, is exceedingly hard to evaluate and essentially requires the computation of the extreme singular values of all AΛ , with |Λ| = r. Given that we can may never be able to verify if condition (2.11) is satisfied, there are three ways to proceed: derive bounds on δr , design matrices with inherent restricted isometry properties, or consider families of random matrices. Recent work by d’Aspremont [48] uses semidefinite relaxation to obtain upper bounds on (1 + δr ). By randomly sampling column supports Λ we can derive lower bounds, but these can only be used to show that (2.11) is not satisfied. The second approach is to design matrices in such a way that the restricted isometry constant is known by construction. Unfortunately though, the only known results in this direction, established by DeVore [53], satisfy the 2.5. Geometry 27 restricted isometry conditions only for relatively small values of r. The most fruitful approach by far has been to consider families of random m × n matrices A and determine bounds on r such that, for a matrix randomly drawn from the family, conditions like those in Theorem 2.5 are satisfied with high probability. The first results in this direction were obtained by Cand`es and Tao [34]. They showed that an m × n matrix A consisting of m randomly selected rows of an n × n Fourier matrix, satisfies δ2r + δ3r < 1 with very high probability, provided that r ≤ C · m/(log n)6 . These results were improved by Rudelson and Vershynin [129], who lowered the exponent from six to four. Another important class is the Gaussian ensemble where m × n matrices A are randomly generated by independently sampling its entries from the normal distribution with zero mean and variance 1/m. Work by Cand`es and Tao [34], and Baraniuk et al. [4] shows that such matrices satisfy certain restricted isometry conditions with probability 1−O(e−γn ), for some γ > 0, provided that m ≥ C ·r log(n/r). Other examples include: Bernoulli ensembles (Baraniuk et al. [4]), randomly restricted Hadamard matrices (Rudelson and Vershynin [129]), and random Toeplitz matrices (Bajwa et al. [2]). Experiments. For a more concrete illustration of the restricted isometry property, we generate a random 128 × 256 matrix A with normally distributed entries and scale its columns to have unit norm. We then sample the extreme singular values for 50,000 random submatrices AΛ with |Λ| = r, for each r = 1, . . . , 128, as shown in Figure 2.4(a). Using these singular values we can then determine a lower bound on δr by noting that δr (A) ≥ max{[σ1 (AΛ )]2 − 1, 1 − [σr (AΛ )]2 }. To ensure δr is non-decreasing in r, we enforce the fact that δr (A) ≥ δr−1 (A), where needed, and obtain the lower bound on δr (A) shown in Figure 2.4(b). From these values it follows that condition (2.11) is satisfied only for s = r = 1, thus guaranteeing uniform recovery of at most one-sparse signals. This is much lower than the empirical result in which we considered the recovery of 5,000 random x0 with different sparsity levels. In these experiments, as shown in Figure 2.5, all vectors with a sparsity of up to 32 were successfully recovered. 2.5 Geometry We now look at the geometrical interpretation of basis pursuit recovery, which was first described by Donoho [57]. For basis pursuit, and to a lesser extent basis pursuit denoise, this interpretation provides an intuitive way of thinking about 1 -based sparse recovery and has led to asymptotically sharp recovery results. We discuss this interpretation in detail here because of its relevance to the results we derive in Chapter 3. Starting with the geometry of the 1 -norm, note that the set of all points of the unit 1 -ball, {x ∈ Rn | x 1 ≤ 1}, can be formed by taking convex 2.5. Geometry 2 28 4 3 r Lower bound on δ (A) Extreme singular values 3.5 1.5 1 0.5 2.5 2 1.5 1 0.5 0 0 20 40 60 r 80 100 (a) extreme singular values 120 0 0 20 40 60 r 80 100 120 (b) lower bound on δr (A) Figure 2.4: Experiments with restricted isometry for a 128 × 256 random matrix A, showing (a) the extreme singular values attained by random submatrices of A with s columns, over 50,000 trials, and (b) the resulting lower bound on δs (A). combinations of ±ej , the signed columns of the identity matrix. Geometrically this is equivalent to taking the convex hull of these vectors, giving the crosspolytope, or n-octahedron Cn = conv{±e1 , ±e2 , . . . , ±en }. As an example, we illustrate C2 and C3 in Figures 2.6(a,b) respectively; more information about these polytopes can be found in Gr¨ unbaum [89] and Ziegler [155]. For the satisfaction of the basis pursuit constraint Ax = b we need to consider the image of x under A. By applying the linear mapping x → Ax to all points x ∈ C we obtain another polytope P = {Ax | x ∈ C} = AC. This is illustrated in Figure 2.6(c), from which it can be seen that P coincides with conv{±A↓j }j , the convex hull of all points in Rm corresponding to the columns of A and their negation. Now, recall that we want to minimize τ = x 1 such that Ax = b is satisfied. In other words, we want to find the smallest value of τ such that there exists an x ∈ τ C satisfying Ax = b, that is b ∈ A(τ C) = τ P. When τ is too small, it is apparent from Figure 2.7(a) that b ∈ τ P. On the other hand, when b is in the relative interior of τ P we can reduce the size of the polytope and still satisfy b = Ax. We therefore conclude that basis pursuit finds the value of τ such that τ P just meets b, as shown in Figure 2.7(b). The same principle applies to the basis pursuit denoise formulation (BPσ ). Defining the unit 2 -norm ball as B = {x ∈ Rn | x 2 ≤ 2}, this formulation looks for the smallest τ for which τ P ∩ (b + σB) = ∅, as illustrated in Figure 2.7(c). 2.5.1 Facial structure and recovery So far, we have only looked at the 1 -norm of the basis pursuit (denoise) solution x∗ . Fix τ , and let y denote the point where b + σB, with σ = Ax − b , meets P. By definition, y must lie in the relative interior of exactly one face F of 2.5. Geometry 100 29 Restricted isometry (1) Empirical (32) Recoverability (%) 80 60 40 20 0 0 20 40 60 80 Sparsity 100 120 Figure 2.5: Percentage of r-sparse vectors x0 recovered from b = Ax0 for random 128 × 256 matrix A, based on 5,000 experiments for each r. e2 e2 a2 a3 e3 −e1 e1 −e1 −a1 −e3 −e2 −e2 (a) a1 e1 (b) −a3 −a2 (c) Figure 2.6: Illustration of (a) cross-polytope C2 , (b) cross-polytope C3 , and (c) the image of C3 under A. For convenience we here denote A↓j by aj . b b b (b) (c) a2 a3 a1 −a1 −a3 −a2 (a) Figure 2.7: (a) Initial situation with polytope AC and point b, and the geometry for (b) basis pursuit, and (c) basis pursuit denoise. 2.5. Geometry 30 P := τ AC. The location and sign of the non-zero entries of x∗ are determined by the points ±τ A↓j that lie in F. In particular, x∗j > 0 only if τ A↓j ∈ F, and x∗j < 0 only if −τ A↓j ∈ F. By the definition of the 1 -norm we have that the sum of the coefficient magnitudes is given by τ . The relative magnitudes correspond to the convex weights on each of the ±τ A↓j ∈ F to give y. As an example, consider the situation in Figure 2.7(b). Here, y = b lies in the relative interior of the face given by the convex hull of τ A↓2 and τ A↓3 , thus giving x∗2 , x∗3 > 0, and x∗2 + x∗3 = τ . Based on this view it is immediate that x∗ is unique if and only if F is a non-degenerate face of P—that is, whenever the number of points ±τ A↓j in F exceeds the dimension of the affine hull of F by at least one. When this is not the case we can write y as infinitely many different convex combinations of the vertices of F, thus implying non-uniqueness of x∗ . Going back to the cross-polytope C, we note that each face F ∈ C can be expressed as the convex hull of a subset of vertices {±ei }i∈I not including any pair of vertices that are reflections with respect to the origin. Each face in C thus corresponds to a support and sign pattern of x with x 1 = 1, and xi = 0 only if i ∈ I, with the sign corresponding to the sign of ±ei . Now, note that A is a linear mapping from Rn to Rm . It is well known (see for example Ziegler [155, Lemma 7.10]) that the image of a polytope Q in Rn under the linear map A, R = AQ, is also a polytope, and that the preimage of each face of R is a face of Q. Applying this result to P = AC we see that each face in P is the image of some face in C, and therefore corresponds to a certain support and sign pattern of coefficient vectors x ∈ Rn . The reverse is not always true, and some of the faces of C may not map to a face of P. Whenever some x0 is on a face F of C that does not map to a face on P, it means that b = Ax0 lies in the interior of P. This in turn means that there is another x with smaller 1 -norm (i.e., we can reduce the size of P while maintaining b ∈ P) that also satisfies Ax = b. In other words, x0 is not a solution of the basis-pursuit problem and can therefore not be recovered. We conclude that a vector x0 with x0 1 = τ can be uniquely recovered if and only if the smallest face F in τ C that contains x0 maps to a nondegenerate face in P = τ AC. From this we also conclude, as before, that recovery of x0 depends only on its support and sign pattern. 2.5.2 Connection to optimality conditions It is interesting to point out the close connection between the geometrical interpretation of basis pursuit, and the optimality conditions in Section 2.3. Let x be a vector supported on I, and assume, without loss of generality, that x 1 = 1, and xi > 0 for all i ∈ I. In order for x to be the unique solution of (BP), S = conv{A↓i }i∈I must be a nondegenerate face of P = AC. By definition, each face of P can be supported by a hyperplane in Rm . For S to be a face, there 2.5. Geometry 31 must exist a normal vector z such that for each j ∈ I, (A↓i − A↓j )T z = 0 ∀i ∈ I, (−A↓i − A↓j )T z < 0 ↓i ↓j T (±A − A ) z < 0 ∀i ∈ I, ∀i ∈ I. By scaling z such that (A↓j )T z = 1, this reduces to (AT z)i = (AT z)j = 1 ∀i ∈ I |(AT z)i | < 1 ∀i ∈ I, for each j ∈ I, which exactly gives the conditions in (2.5) and (2.6). In fact, all vectors z, or more generally all vectors in the normal cone of P at b, can be seen to be Lagrange multipliers for (BP). This also explains why the strict inequality in (2.6) is a necessary and sufficient condition for x to be the unique solution. It also shows that whenever there exists a y that satisfies (2.6), then there also exists a y that satisfies only (2.5). For z, this corresponds to the relative interior and the boundary of the normal cone, respectively. However, if there only exists a z that satisfies (ATz)i = 1 for i ∈ I, and |(AT z)i | ≤ 1 for i ∈ I, with at least one equality, then S is (part of) a degenerate face, and x is not the unique solution. 2.5.3 Recovery bounds We now consider the recovery bounds for basis pursuit as derived from the geometrical perspective. For simplicity of exposition, we assume that the faces of P are all non-degenerate, or equivalently, that P has 2n vertices. A first observation is that the probability of recovering a random vector with s nonzero entries whose signs are ±1 with equal probability, is equal to the ratio of (s−1)-faces in P to the number of (s−1)-faces in C. That is, letting Fd (P) denote the collection of all d-faces [89] in P, the probability of recovering an arbitrary exactly s-sparse x0 using basis pursuit is given by P 1 (A, s) = |Fs−1 (AC)| |Fs−1 (AC)| . = |Fs−1 (C)| 2s This coincides with the probability that is estimated (as a percentage) by the curve in Figure 2.5. For the probability of recovering vectors with a given support I we can write P 1 (A, I) = |FI (AC)| |FI (AC)| = , |FI (C)| 2|I| (2.12) where FI (C) denotes the number of faces in C formed by the convex hulls of {±ei }i∈I , and FI (AC) = F|I| (P) ∩ AFI (C) is the number of faces on AC generated by {±A↓j }j∈I . 2.5. Geometry 32 Uniform recovery. Donoho [57] showed that the uniform recovery of all rsparse vectors using basis pursuit corresponds to the notion of r-neighborliness of P. The neighborliness of a polytope is defined as the largest number r such that the convex hull of any r of its vertices corresponds to a face of that polytope [89]. In the case of centrally symmetric polytopes (i.e., P = −P) this gives a neighborliness of one because the convex hull of any two vertices that are reflected through the origin will not be a face of P. Such pairs are therefore excluded in the definition of neighborliness for centrally symmetric polytopes. Bounds on central r-neighborliness were known as early as 1968, when McMullen and Shephard [113] showed that r ≤ (m+1)/3 whenever 2 < m ≤ n−2. More recently, Donoho [59] studied central neighborliness for random m × n orthogonal projectors A with m ∼ δn. He derives, amongst other things, a function ρN (δ) > 0 such that P = AC is centrally ρd -neighborly with high probability for large d, whenever ρ < ρN (δ). In other work, Linial and Novik [101] show the existence of m-dimensional centrally symmetric r-neighborly polytopes with 2(n + m) vertices, with r(m, n) = Θ m 1 + log((m + n)/m) , and show that this bound on r is tight. Nonnegative basis pursuit and neighborliness. At first sight, formulation (1.6) seems very similar to basis pursuit with only an additional constraint requiring nonnegativity of x. This seemingly small change has major implications on the bounds for uniform recovery, essentially because the cross-polytope C is replaced by a simplex, thus giving rise to non-centrally symmetric polytope images under projection. Donoho and Tanner [55, 54] study recovery using nonnegative basis pursuit and mention the family of cyclic polytopes. These m-dimensional polytopes exist with any number n > m of vertices and are m/2 -neighborly (recall that this guarantees the uniform recovery with (1.6) of all nonnegative vectors x0 with at most m/2 nonzero entries). Moreover, simple explicit constructions for these polytopes are available [79]. 33 Chapter 3 Joint-sparse recovery In this chapter we consider sparse recovery in the case where, instead of a single measurement vector (SMV) b = Ax, we have multiple measurement vectors (MMV) B = AX (see Section 1.4.2 for notation). In the MMV setting it is typically assumed that the n × k matrix X is jointly sparse, meaning that the nonzero entries in each column are located at identical positions. As mentioned in Section 1.4.2, the MMV problem is a special case of group-sparse SMV where nonzero entries in x are clustered in predefined non-overlapping groups of indices Λi . (Some authors use the term ‘blocks’ instead of ‘groups’. In the text, we use whichever term is used in the original context.) We can write MMV as a groupsparse problem by vectorizing B and X, setting Λi = {i, i + n, . . . , i + (k − 1)n} for i = 1, . . . , n, and redefining A := I ⊗ A. This transformation to a groupsparse SMV is characterized by the special block-diagonal structure of A with identical blocks. There is considerable freedom in choosing a convex formulation for the MMV problem. In this chapter we will concentrate on the 1,2 relaxation (1.10) given in Section 1.4.2, and the ReMBo algorithm proposed by Mishali and Eldar [114]. Before doing so, we briefly survey some results on 0 -based minimization and conditions for group-sparse recovery and their connection to the MMV problem. For related results on greedy approaches for the MMV problem, we refer to Chen and Huo [38], Cotter et al. [42], Eldar and Rauhut [74], Gribonval et al. [87], Leviatan and Temlyakov [100], and Tropp et al. [147, 148]. Throughout this chapter we make the following assumptions. The matrix A ∈ Rm×n is full-rank. The unknown matrix to be recovered X0 ∈ Rn×k , is r-row-sparse, and, except for our study of ReMBo, the columns of X0 have identical support with exactly r nonzero entries. 3.1 Uniqueness of sparsest MMV solution We define the sparsest solution to the MMV problem as the matrix X with the least number of rows containing nonzero entries that satisfies AX = B. This can be formulated as the non-convex optimization problem minimize X,v v 0 subject to AX = B, X i → ≤ vi , (3.1) for any norm · , or even any other nonnegative function f (x) that is 0 if and only if x = 0. Denoting for convenience X 0,∗ := [f (X i→ )]i 0 , the following result gives conditions under which X is the unique solution of (3.1). 3.2. Recovery of block-sparse signals 34 Theorem 3.1 (Cotter et al. [42], Chen and Huo [38]). If B = AX and X 0,∗ < Spark(A) + rank(B) − 1 2 then X is the unique solution to (3.1). This result generalizes (2.4) for the SMV problem, where rank(B) = 1. Formulation (3.1), just like (1.3), is intractable and we therefore proceed by studying recovery using convex relaxations. 3.2 Recovery of block-sparse signals As noted earlier, MMV problems can be reformulated as block-sparse SMV problems, which have recently been studied by Eldar et al. [72] and Eldar and Mishali [73], who provide recovery conditions in terms of generalizations of the mutual coherence and the restricted isometry property (RIP). In particular, Eldar et al. define the block-coherence of a matrix A with blocks Λi of size k as 1 µΛ (A) = max σ1 (ATΛi AΛj ), i=j k where σ1 (M ) denotes the largest singular value of M . The coherence within blocks is described by the sub-coherence νΛ (A) = max max |(A↓i )TA↓j |. l i,j∈Λ,i=j Using these two quantities they derive the following result. Theorem 3.2 (Eldar et al. [72]). Let x be an r-block-sparse vector with index sets Λi of cardinality k, and let A have unit norm columns. Then x is the unique solution of (1.7) with b = Ax, if kr < 1 2 1 ν(A) + k − (k − 1) µΛ (A) µΛ (A) . (3.2) ¯ With A = I ⊗ A¯ (as in MMV), it is easy to see that µΛ = µA/k and ν(A) = 0. Substituting these into (3.2) then gives r < (1 + 1/µ(A))/2, which is exactly (2.1). In other words, based on these mutual-coherence conditions, the recovery guarantees for block-sparse signals using (1.7) are identical to those for independent recovery of each X ↓j from B ↓j using (BP). As an extension of the restricted isometry, Eldar and Mishali [73] define the block-RIP δk (A, Λ) as the smallest δ satisfying (1 − δ) c 2 2 ≤ Ac 2 2 ≤ (1 + δ) c 2 2 (3.3) for all c supported on ∪i∈I Λi , with |I| ≤ k. They then show that √ any r-blocksparse vector x can be uniquely recovered using (1.7) if δ2r (A, Λ) < 2−1. Note 3.3. MMV recovery using row-norm sums 35 that δ2r (A, Λ) ≤ δ2kr (A) because the vectors c considered for δ2r (A, Λ) are a subset of those considered for δ2kr (A). Indeed, Eldar and Mishali show that the block-restricted isometry condition (3.3) has a larger probability of being satisfied for random Gaussian ensembles than the standard non-block version. However, as with the block-coherence, this result does not give any additional ¯ In both cases, the reason for this guarantees for MMV since δ2r (A, Λ) = δ2r (A). is the special block-diagonal structure of A that follows from the reformulation of the MMV problem as an SMV problem. 3.3 MMV recovery using row-norm sums Our analysis of sparse recovery for the MMV problem of recovering X0 from B = AX0 begins with an extension of Theorem 2.1 to recovery using the convex relaxation n X i→ minimize X subject to AX = B; (3.4) j=1 note that the norm within the summation is arbitrary. Define the row support of a matrix as Supp(X) = {i | X i→ = 0}. row With these definitions we have the following result. (A related result is given by Stojnic et al. [135].) Theorem 3.3. Let A be an m × n matrix, I ⊆ {1, . . . , n} be a fixed index set, and let · denote any vector norm. Then all X0 ∈ Rn×k with Supprow (X0 ) ⊆ I can be uniquely recovered from B = AX0 using (3.4) if and only if for all Z = 0 with columns Z ↓j ∈ Ker(A), Z i→ < i∈I Z i→ . (3.5) i∈I Proof. For the “only if” part, suppose that there is a Z = 0 with columns Z ↓j ∈ Ker(A) such that (3.5) does not hold. Now, choose X i→ = Z i→ for all i ∈ I and with all remaining rows zero. Set B = AX. Next, define V = X − Z, and note that AV = AX − AZ = AX = B. The construction of V implies that i X i→ ≥ i V i→ , and consequently X cannot be the unique solution of (3.4). Conversely, let X be an arbitrary matrix with Supprow (X) ⊆ I, and let B = AX. To show that X is the unique solution of (3.4) it suffices to show that for any Z with columns Z ↓j ∈ Ker(A) \ {0}, X i→ . (X + Z)i→ > i i 3.3. MMV recovery using row-norm sums 36 This is equivalent to Z i→ + (X + Z)i→ − i∈I i∈I X i→ > 0. i∈I Applying the reverse triangle inequality, a + b − b ≥ − a , to the summation over i ∈ I and reordering exactly gives condition (3.5). In the special case of the sum of 1 -norms, i.e., the 1,1 norm defined in (1.9), summing the norms of the columns is equivalent to summing the norms of the rows. As a result, (3.4) can be written as k X ↓j minimize X 1 subject to AX ↓j = B ↓j , i = 1, . . . , k. (3.6) j=1 Because this objective is separable, the problem can be decoupled and solved as a series of independent basis pursuit problems, giving one X ↓j for each column B ↓j of B. The following result relates recovery using the sum-of-norms formulation (3.4) to recovery using 1,1 -minimization (see (1.11)). Theorem 3.4. Let A be an m × n matrix, I ⊆ {1, . . . , n} be a fixed index set, and · denote any vector norm. Then uniform recovery of all X ∈ Rn×k with Supprow (X) ⊆ I using sums of norms (3.4) implies uniform recovery on I using 1,1 . Proof. For uniform recovery on support I to hold it follows from Theorem 3.3 that for any matrix Z with columns Z ↓j ∈ Ker(A) \ {0}, property (3.5) holds. In particular it holds for Z with Z ↓j = z¯ for all j, with z¯ ∈ Ker(A) \ {0}. Note that for these matrices there exist a norm-dependent constant γ such that |¯ zi | = γ Z i→ . Since the choice of z¯ was arbitrary, it follows from (3.5) that the NS-condition (2.2) for independent recovery of vectors B ↓j using 1 in Theorem 2.1 is satisfied. Moreover, because 1,1 is equivalent to independent recovery, we also have uniform recovery on I using 1,1 . An implication of Theorem 3.4 is that the use of restricted isometry conditions (or any technique for that matter) to analyze uniform joint r-sparse recovery conditions for the sum-of-norms approach necessarily lead to results that are no stronger than uniform 1 recovery for each vector independently. Eldar and Rauhut [74, Prop. 4.1] make a similar observation with regard to recovery using the 1,2 norm. Their result can easily be extended to the general sum-of-norms formulation. 3.4. MMV recovery using 37 1,2 100 ℓ1,2 , ℓ1,2 , ℓ1,2 , ℓ1,1 , ℓ1,1 , ℓ1,1 , 90 Recovery rate (%) 80 70 k k k k k k =2 =3 =5 =2 =3 =5 60 50 40 30 20 10 0 0 5 10 r 15 20 Figure 3.1: Recovery rates for fixed, randomly drawn 20 × 60 matrices A, averaged over 1,000 trials at each row-sparsity level r. The nonzero entries in the 60 × k matrix X0 are sampled i.i.d. from the normal distribution. The solid and dashed lines represent 1,2 and 1,1 recovery, respectively. 3.4 MMV recovery using 1,2 In this section we take a closer look at the 1,2 -minimization problem (1.10), which is a special case of the sum-of-norms problem. Although Theorem 3.4 establishes that uniform recovery via 1,2 is no better than uniform recovery via 1,1 , there are many situations in which it recovers signals that 1,1 cannot. Indeed, it is evident from Figure 3.1 that the probability of recovering individual signals with random signs and support is much higher for 1,2 . It is also clear from this figure that the probability of recovery via 1,1 even reduces with increasing k. (We explain this phenomenon in Section 3.5.) The limitation of analysis for uniform recovery was also observed by Eldar and Rauhut [74], and they proceed by replacing of this worst-case analysis by an appropriate averagecase analysis. Based on this model, they show that joint-sparse MMV recovery using 1,2 is, on average, superior to independent basis pursuit SMV recovery. We next derive the optimality conditions for 1,2 similar to those for 1 given in Section 2.3. Building on these conditions we then construct examples for which 1,2 recovery works and 1,1 fails, and vice versa. We conclude the section with a set of experiments that illustrate the aforementioned construction and show the difference in recovery rates. 3.4. MMV recovery using 3.4.1 38 1,2 Sufficient conditions for recovery via 1,2 The optimality conditions of the 1,2 problem (1.10) play a vital role in deriving a set of sufficient conditions for joint-sparse recovery. In this section we derive the dual of (1.10) and the corresponding necessary and sufficient optimality conditions. These allow us to derive sufficient conditions for recovery via 1,2 . We start with the Lagrangian for (1.10), which is defined as L(X, Y ) = X − Y, AX − B , 1,2 (3.7) with V, W := trace(V TW ) an inner-product defined over real matrices. The dual is then given by maximizing inf L(X, Y ) = inf { X X X 1,2 − Y, AX − B } ATY , X − X = B, Y − sup 1,2 (3.8) X over Y . (Because the primal problem has only linear constraints, there necessarily exists a dual solution Y ∗ that maximizes this expression [127, Theorem 28.2].) To simplify the supremum term, we note that for any convex, positively homogeneous function f , sup { w, v − f (v)} = v 0 if w ∈ ∂f (0), ∞ otherwise. To derive these conditions, note that positive homogeneity of f implies that f (0) = 0, and thus w ∈ ∂f (0) implies that w, v ≤ f (v) for all v. Hence, the supremum is achieved with v = 0. If on the other hand w ∈ ∂f (0), then there exists some v such that w, v > f (v), and by the positive homogeneity of f , w, αv − f (αv) → ∞ as α → ∞. Applying this expression for the supremum to (3.8), we arrive at the necessary condition ATY ∈ ∂ 0 1,2 , (3.9) which is required for dual feasibility. We now derive an expression for the subdifferential ∂ X 1,2 . For rows i where X i→ 2 > 0, the gradient is given by ∇ X i→ 2 = X i→ / X i→ 2 . For the remaining rows, the gradient is not defined, but ∂ X i→ 2 coincides with the set of unit 2 -norm vectors B k2 = {v ∈ Rk | v 2 ≤ 1}. Thus, for each i = 1, . . . , n, ∂X i→ X 1,2 ∈ X i→ / X i→ B k2 2 if X i→ 2 > 0, (3.10) otherwise. Combining this expression with (3.9), we arrive at the dual of (1.10): maximize Y trace(B TY ) subject to ATY ∞,2 ≤ 1. (3.11) 3.4. MMV recovery using 39 1,2 Note that this dual formulation could have been obtained directly by noting that the ∞,2 -norm is the dual of the 1,2 -norm. However, the above derivation does give additional information. Indeed, it shows that the following conditions are necessary and sufficient for a primal-dual pair (X ∗ , Y ∗ ) to be optimal for (1.10) and its dual (3.11): AX ∗ = B T A Y ∗ X ∗ ∞,2 1,2 ≤1 T ∗ = trace(B Y ) (primal feasibility); (3.12a) (dual feasibility); (3.12b) (zero duality gap). (3.12c) The existence of a matrix Y ∗ that satisfies (3.12) provides a certificate that the feasible matrix X ∗ is an optimal solution of (1.10). However, it does not guarantee that X ∗ is also the unique solution. The following theorem gives sufficient conditions, similar to those in Section 2.3, that also guarantee uniqueness of the solution. Theorem 3.5. Let A be an m × n matrix, and B be an m × k matrix. Then a set of sufficient conditions for X to be the unique minimizer of (1.10) with Lagrange multiplier Y ∈ Rm×k and row support I = Supprow (X), is that AX = B, T (3.13a) ∗ i→ ↓i ∗ i→ (A Y ) = (X ) / (X ) T ↓i (A Y ) 2 2, < 1, i∈I (3.13b) i∈I (3.13c) rank(AI ) = |I|. (3.13d) Proof. The first three conditions clearly imply that (X, Y ) primal and dual feasible, and thus satisfy (3.12a) and (3.12b). Conditions (3.13b) and (3.13c) together imply that n n [(ATY )↓i ]T X i→ = trace(B TY ) ≡ i=1 X i→ ≡ X 1,2 . i=1 The first and last identities above follow directly from the definitions of the matrix trace and of the norm · 1,2 , respectively; the middle equality follows from the standard Cauchy inequality. Thus, the zero-gap requirement (3.12c) is satisfied. The conditions (3.13a)–(3.13c) are therefore sufficient for (X, Y ) to be an optimal primal-dual solution of (1.10). Because Y determines the support and is a Lagrange multiplier for every solution X, this support must be unique. It then follows from condition (3.13d) that X must be unique. 3.4.2 Counter examples Using the sufficient and necessary conditions developed in the previous section we now construct examples of problems for which 1,2 succeeds while 1,1 fails, and vice versa. Because of its simplicity, we begin with the latter. 3.4. MMV recovery using Recovery using 1,1 A= where 1 0 0.5 0.5 1,2 fails. 1 , 0.8 40 1,2 Consider the 2 and X0 = 2 0 matrices 1 10 . 0 By drawing AC, the convex hull of the columns of ±A, it is easily seen that convex combinations of the first two columns give points on a face of the polytope. Because the weights in the columns of X0 are scalar multiples of such points they can be uniquely recovered using 1 minimization, and consequently X0 itself can be recovery using 1,1 . On the other hand, for 1,2 minimization to recover X0 , there must exist a Y ∈ R2×2 satisfying both (3.12b) and (3.13b). However, the unique Y satisfying the latter condition does not satisfy the former, thereby showing that 1,2 fails to recover X0 . Recovery using 1,2 where 1,1 fails. For the construction of a problem where 1,2 succeeds and 1,1 fails, we consider two vectors, f and s, with the same support I, in such a way that individual 1 recovery fails for f , while it succeeds for s. In addition we assume that there exists a vector y that satisfies y TA↓j = sign(sj ) for all j ∈ I, and |y TA↓j | < 1 for all j ∈ I; i.e., y satisfies conditions (3.13b) and (3.13c). Using the vectors f and s, we construct the 2-column matrix X0 = [(1−γ)s, γf ], and claim that for sufficiently small γ > 0, this gives the desired reconstruction problem. Clearly, for any γ = 0, 1,1 recovery fails because the second column can never be recovered, and we only need to show that 1,2 does succeed. For γ = 0, the matrix Y = [y, 0] satisfies conditions (3.13b) and (3.13c) and, assuming (3.13d) is also satisfied, X0 is the unique solution of 1,2 with B = AX0 . For sufficiently small γ > 0, the conditions that Y need to satisfy change slightly due to the division by X0i→ 2 for those rows in Supprow (X). By adding corrections to the columns of Y those new conditions can be satisfied. In particular, these corrections can be done by adding weighted combinations of the columns in Y¯ , which are constructed in such a way that it satisfies ATI Y¯ = I, and minimizes ATI c Y¯ ∞,∞ on the complement I c of I. Note that the above argument can also be used to show that 1,2 fails for γ sufficiently close to one. Because the support and signs of X remain the same for all 0 < γ < 1, we can conclude the following: recovery using 1,2 can be influenced by the magnitude of the nonzero entries of X0 . This is unlike 1,1 , where recovery depends only on the support and sign pattern of the nonzero entries. A consequence of this conclusion is that the notion of faces used in the geometrical interpretation of 1 is not applicable to the 1,2 problem. 3.4.3 Experiments To get an idea of just how much more 1,2 can recover in the above case where 1,1 fails, we generate a 20 × 60 matrix A with entries i.i.d. normally distributed, 3.5. Bridging the gap from 1,1 to ReMBo 41 and determine a set of vectors si and fi with identical support for which 1 recovery succeeds and fails, respectively. Using triples of vectors si and fj we construct row-sparse matrices such as X0 = [s1 , f1 , f2 ] or X0 = [s1 , s2 , f2 ], and attempt to recover from B = AX0 W , where W = diag(ω1 , ω2 , ω3 ) is a diagonal weighting matrix with nonnegative entries and unit trace, by solving (1.10). For problems of this size, interior-point methods are very efficient and we use SDPT3 [142, 150] through the CVX toolbox [84, 83]. We consider X0 to be recovered when the maximum absolute difference between X0 and the ∗ −5 . The results of the experiment are shown in 1,2 solution X is less than 10 Figure 3.2. In addition to the expected regions of recovery around individual columns si and failure around fi , we see that certain combinations of vectors si still fail, while other combinations of vectors fi may be recoverable. By contrast, when using 1,1 to solve the problem, any combination of si vectors can be recovered while no combination including an fi can be recovered. |I| = 5 |I| = 5 |I| = 5 |I| = 7 |I| = 10 |I| = 10 |I| = 10 |I| = 10 Figure 3.2: Generation of problems where 1,2 succeeds, while 1,1 fails. For a 20 × 60 matrix A and fixed support of size |I| = 5, 7, 10, we create vectors fi that cannot be recovered using 1 , and vectors si than can be recovered. Each triangle represents an X0 constructed from the vectors denoted in the corners. The location in the triangle determines the weight on each vector, ranging from zero to one, and summing up to one. The dark areas indicates the weights for which 1,2 successfully recovered X0 . 3.5 Bridging the gap from 1,1 to ReMBo We begin this section with a discussion showing that the performance of 1,1 can only get worse with increasing number of observations, thus explaining empirical observations made earlier by Chen and Huo [38] and Mishali and Eldar [114]. We then show how the recovery rate can be improved by using the boosting technique introduced by Mishali and Eldar [114]. The resulting 3.5. Bridging the gap from 1,1 to ReMBo 42 boosted- 1 approach is a simplified version of the ReMBo- 1 algorithm, which we discuss in Section 3.6. Although boosted- 1 has a lower performance, we include it because its simplicity makes it easy to analyze and allows us to show more intuitively what it is that makes ReMBo- 1 work so well. Also, the recovery rate of boosted- 1 motives a performance model for ReMBo- 1 recovery. Thus, while boosted- 1 may not be a viable algorithm in practice, it does help bridge the gap between 1,1 and ReMBo- 1 . As described in Section 3.3, recovery using 1,1 is equivalent to individual 1 recovery of each column x(j) := X0↓j based on solving (BP) with b := B ↓j , for j = 1, . . . , k. Assume that the signs of nonzero entries in the support of each x(j) are uniformly distributed. Then we can express the probability of recovering X0 with row support I using 1,1 in terms of the probability of recovering individual vectors on that support using 1 . By the separability of (3.6), 1,1 recovers X0 if and only if (BP) successfully recovers each x(j) . Recalling (2.12), denote the recovery rate of an x(j) supported on I by P 1 (A, I). Then the expected 1,1 recovery rate is k P 1,1 (A, I, k) = [P 1 (A, I)] . This expression shows that the probability of recovery using 1,1 can only decrease as k increases, which clearly defeats the purpose of gathering multiple observations; see Figure 3.1. There are many problem instances where 1,1 fails to recover X0 as a whole but does correctly recover a subset of columns x(j) . The following boosting procedure [114] exploits this fact and uses it to help generate the entire solution. Given such a vector x(j) with support J of sufficiently small cardinality (e.g., ¯ less than m/2), solve the following system for X: ¯ minimize X ¯ −B AJ X F. (3.14) If the residual in (3.14) is zero, conclude that the support J coincides with I ¯ If the residual is and assume that the nonzero entries of X0 are given by X. nonzero, the support J is necessarily incorrect, and the next sufficiently-sparse vector is checked. This approach is outlined in Algorithm 3.1. The recovery properties of the boosted 1 approach are opposite from those of 1,1 : it fails only if all individual columns with support I fail to be recovered using 1 . Hence, given an unknown n × k matrix X0 supported on I with its sign pattern uniformly random, the boosted 1 algorithm enjoys an expected recovery rate of k P (A, I, k) = 1 − [1 − P 1 (A, I)] . (3.15) Note that this result hinges on our blanket assumption that the columns of X0 have identical support. To experimentally verify the recovery rate, we generate a 20 × 80 matrix A with entries independently sampled from the normal distribution and fix a randomly chosen support set Ir for three levels of sparsity, r = 8, 9, 10. On each of these three supports we generate vectors with all possible sign patterns and solve (BP) to verify if they can be recovered (see Section 3.4.3). This gives 3.6. Recovery using ReMBo Algorithm 3.1: The boosted 1 43 algorithm given A, B for j = 1, . . . , k do solve (BP) with b := B ↓j to get x J ← Supp(x) if |J | < m/2 then ¯ solve (3.14) to get X ¯ if AJ X = B then X∗ = 0 ¯ [(X ∗ )i→ ]i∈J ← X return solution X ∗ return failure exactly the face counts required to compute the 1 recovery probability in (2.12), and the expected boosted 1 recovery rate in (3.15). For the empirical success rate we take the average over 1,000 trials with random matrices X0 supported on Ir , and its nonzero entries independently drawn from the normal distribution. Because recovery of individual vectors using 1 minimization depends only on their sign pattern, we reduce the computational time by comparing the sign patterns against precomputed recovery tables (this is possible because both A and Ir remain fixed), rather than invoking an 1 solver for each vector. The theoretical and empirical recovery rates using boosted 1 are plotted in Figure 3.3. 3.6 Recovery using ReMBo The ReMBo algorithm by Mishali and Eldar [114] proceeds by taking a random k-vector w and combining the individual observations in B into a single weighted observation b := Bw. It then solves an SMV problem Ax = b for this b and checks if the computed solution x∗ is sufficiently sparse. If not, the above steps are repeated with different weight vectors w until a given maximum number of trials is reached. If the support J of x∗ is small, we form AJ = [A↓j ]j∈J , and ¯ with zero residual. If this is the case we have check if (3.14) has a solution X ¯ and are done. Otherwise, we simply the nonzero rows of the solution X ∗ in X proceed with the next weight vector w. The ReMBo algorithm does not prescribe a particular SMV solver. However, throughout this section we use ReMBo in conjunction with 1 minimization, thus giving the ReMBo- 1 algorithm summarized in Figure 3.2. It can be seen that the ReMBo- 1 algorithm reduces to boosted 1 when setting the maximum number of iterations to k and choosing w := ei in the ith iteration. The formulation given in [114] requires a user-defined threshold on the cardinality of the support J instead of the fixed threshold m/2 in Algorithm 3.2. Ideally, based on (2.4), this threshold should be half the spark of A, but this 3.6. Recovery using ReMBo 100 90 44 r=8 r=9 r = 10 Recovery rate (%) 80 70 60 50 40 30 20 10 0 0 5 10 k 15 20 Figure 3.3: Theoretical (dashed) and experimental (solid) performance of boosted 1 on three problem instances with different row-support sizes r. is too expensive to compute. If, however, we assume that A is in general position, we can take Spark(A) = m + 1. Choosing an even larger threshold can help recover signals with row sparsity exceeding m/2, although, in this case, the solution can no longer be guaranteed to be the sparsest possible solution. In our study of ReMBo- 1 , we fix an unknown matrix X0 with row support I of cardinality r. We deviate from the blanket assumption made in the introduction of this chapter and allow for individual columns to be supported on a subset of I. Each time we multiply B by a random weight vector w(k) , we in fact create a new problem which, with probability one, has an exact r-sparse solution x0 := X0 w(k) . Recall that the recovery of sparse vectors from b = Ax0 using 1 depends only on the support and sign-pattern of x0 . Clearly, the probability of recovery improves as the number of distinct sign patterns encountered by ReMBo- 1 increases. The maximum number of sign patterns encountered with boosted 1 is the number of observations k. The question thus becomes, how many different sign patterns ReMBo- 1 can encounter by taking linear combinations of the columns in X0 ? (We disregard the situation where elimination occurs and |Supp(X0 w)| < r.) Equivalently, we can ask how many orthants in Rs (each corresponding to a different sign pattern) can be properly intersected by the ¯ consisting of the nonzero rows of subspace given by the range of the submatrix X X0 (with proper we mean intersection of the interior). In Section 3.6.1 we derive an exact expression for the maximum number of proper orthant intersections in Rn by a subspace generated by d vectors, denoted by C(n, d). 3.6. Recovery using ReMBo Algorithm 3.2: The ReMBo- 1 45 algorithm given A, B. Set Iteration ← 0 while Iteration < MaxIteration do w ← Random(k, 1) solve (BP) with b = Bw to get x J ← Supp(x) if |J | < m/2 then ¯ solve (3.14) to get X ¯ = B then if AJ X X∗ = 0 [(X ∗ )i→ ← X i→ for i ∈ J return solution X ∗ Iteration ← Iteration + 1 return failure Based on the above reasoning, a good model for a bound on the recovery rate for n × k matrices X0 with Supprow (X0 ) = I < m/2 using ReMBo- 1 is given by S FI (AC) 1− PR (A, I, k) = 1 − . (3.16) F (C) − 2(i − 1) I i=1 where S denotes the number of unique sign patterns tried. The maximum possible value of S is C(|I|, k)/2. This maximum number of orthant intersections is attained with probability one for subspaces Range(X0 ), where the entries in the support of X0 are normally distributed (cf. Corollary 3.8). The term within brackets denotes the probability of failure and the fraction represents the success rate, which is given by the ratio of the number of faces FI (AC) that survived the mapping to the total number of faces to consider. The total number reduces by two at each trial because we can exclude the face f we just tried, as well as −f . The factor of two in C(|I|, k)/2 is also due to this symmetry. (Henceforth we use the convention that the uniqueness of a sign pattern is invariant under negation.) This model would be a bound for the average performance of ReMBo- 1 if the sign patterns generated would be randomly sampled from the space of all sign patterns on the given support. However, because they are generated from the orthant intersections with a subspace, the actual set of patterns is highly structured. Indeed, it is possible to imagine a situation where the (r−1)-faces in C that perish in the mapping to AC all have sign patterns that are contained in the set generated by a single subspace. Any other set of sign patterns would then necessarily include some faces that survive the mapping and by trying all patterns in that set we would recover X0 . In this case, the average recovery over all X0 on that support could be much higher than that given by (3.16). An interesting questions is how the surviving faces of C are distributed. Due to the simplicial structure of the facets of C, we can expect the faces that perish to be 3.6. Recovery using ReMBo 46 100 90 Recovery rate (%) 80 70 60 50 40 30 20 r=8 r=9 r = 10 10 0 0 5 10 k 15 20 Figure 3.4: Theoretical performance model for ReMBo on three problem instances with different sparsity levels r. partially clustered and partially unclustered. (If a (d−2)-face perishes, then so will the two (d−1)-faces whose intersection gives this face, thus giving a cluster of faces that are lost. On the other hand, there will be faces that perish, while all their sub-faces survive.) Note that, regardless of these patterns, recovery is guaranteed in the limit whenever the number of sign patterns tried exceeds, |FI (C)| − |FI (AC)|, the number of faces lost. Figure 3.4 illustrates the theoretical performance model based on C(n, d), for which we derive the exact expression in Section 3.6.1. In Section 3.6.2 we discuss practical limitations, and in Section 3.6.3 we empirically look at how the number of sign patterns generated grows with the number of normally distributed vectors w, and how this affects the recovery rates. To allow comparison between ReMBo and boosted 1 , we used the same matrix A and support Ir used to generate Figure 3.3. 3.6.1 Maximum orthant intersections with subspace In this section we give an exact characterization of the number of orthants intersected by a subspace. The maximum number of intersections is given by the following theorem. 3.6. Recovery using ReMBo 47 Theorem 3.6. Let C(n, d) denote the maximum attainable number of orthant interiors intersected by a subspace in Rn generated by d vectors. Then C(n, 1) = 2, C(n, d) = 2n for d ≥ n. In general, C(n, d) is given by d−1 C(n, d) = C(n − 1, d − 1) + C(n − 1, d) = 2 i=0 n−1 . i (3.17) Proof. The number of intersected orthants is exactly equal to the number of proper sign patterns (excluding zero values) that can be generated by linear combinations of those d vectors. When d = 1, there can only be two such sign patterns corresponding to positive and negative multiples of that vector, thus giving C(n, 1) = 2. Whenever d ≥ n, we can choose a basis for Rn and add additional vectors as needed, and we can reach all points, and therefore all 2n = C(n, d) sign patterns. For the general case (3.17), let v1 , . . . , vd be vectors in Rn such that the affine hull with the origin, S = aff{0, v1 , . . . , vd }, gives a subspace in Rn that properly intersects the maximum number of orthants, C(n, d). Without loss of generality, assume that vectors vi , i = 1, . . . , d − 1, all have their nth component equal to zero. Now, let T = aff{0, v1 , . . . , vd−1 } ⊆ Rn−1 be the intersection of S with the (n − 1)-dimensional subspace of all points X = {x ∈ Rn | xn = 0}, and let CT denote the number of (n − 1)-orthants intersected by T . Note that T itself, as embedded in Rn , does not properly intersect any orthant. However, by adding or subtracting an arbitrarily small amount of vd , we intersect 2CT orthants; taking vd to be the nth column of the identity matrix would suffice for that matter; see Figure 3.5(a,b). Any other orthants that are added have either xn > 0 or xn < 0, and their number does not depend on the magnitude of the nth entry of vd , provided it remains nonzero. Because only the first n − 1 entries of vd determine the maximum number of additional orthants, the problem reduces to Rn−1 . In fact, we ask how many new orthants can be added to CT taking the affine hull of T with v, the orthogonal projection vd onto X . Since the maximum orthants for this d-dimensional subspace in Rn−1 is given by C(n − 1, d), this number is clearly bounded by C(n − 1, d) − CT . Adding this to 2CT , we have C(n, d) ≤ 2CT + [C(n − 1, d) − CT ] = CT + C(n − 1, d) ≤ C(n − 1, d − 1) + C(n − 1, d) d−1 ≤2 i=0 n−1 . i (3.18) The final expression follows by expanding the recurrence relations, which generates (a part of) Pascal’s triangle, and combining this with C(1, j) = 2 for j ≥ 1. In the above, whenever there are free orthants in Rn−1 , that is, when d < n, we can always choose the corresponding part of vd in that orthant. As a consequence, no subspace generated by a set of vectors can intersect the maximum number of orthants when the range of those vectors includes some ei . 3.6. Recovery using ReMBo 48 z x y (a) (b) (c) Figure 3.5: Illustration of orthant intersection, (a) one-dimensional subspace in the two-dimensional xy-plane, intersecting two orthants, (b) trivial extension to three-dimensions, doubling the number of orthants intersected, (c) optimal number of six intersections in three dimensions. We now show that this expression holds with equality. Let U denote an (n − d)-subspace in Rn that intersects the maximum C(n, n − d) orthants. We claim that in the interior of each orthant not intersected by U there exists a vector that is orthogonal to U . If this were not the case then T must be aligned with some ei and can therefore not be optimal. The span of these orthogonal vectors generates a d-subspace V that intersects CV = 2n −C(n, n−d) orthants, and it follows that C(n, d) ≥ CV = 2n − C(n, n − d) n−d−1 ≥ 2n − 2 i=0 n−1 =2 n−d n−1 i n−1 i d−1 =2 i=0 n−1 =2 i=0 n−1 i n−1 −2 i n−d−1 i=0 n−1 i ≥ C(n, d), where the last inequality follows from (3.18). Consequently, all inequalities hold with equality. Corollary 3.7. Given d ≤ n, then C(n, d) = 2n −C(n, n−d), and C(2d, d) = 22d−1 . The exact characterization of the number of orthant intersections follows directly from the proof of Theorem 3.6 and is summarized by the following corollary. Corollary 3.8. A subspace H in Rn , defined as the range of V = [v1 , v2 , . . . , vd ], intersects the maximum number of orthants C(n, d) whenever rank(V ) = n, or when ei ∈ Range(V ) for i = 1, . . . , n. 3.6. Recovery using ReMBo 3.6.2 49 Practical considerations In practice, it is generally not computationally feasible to generate all of the C(|I|, k)/2 unique sign patterns. This means that we would have to set S in (3.16) to the number of unique patterns actually tried. For a given X0 the actual probability of recovery is determined by a number of factors. First of all, the ¯ prescribe a subspace linear combinations of the columns of the nonzero part of X and therefore a set of possible sign patterns. With each sign pattern is associated a face in C that may or may not map to a face in AC. Second, depending on the probability distribution from which the weight vectors w are drawn, there is a certain probability for reaching each sign pattern. Summing the probability of reaching those patterns that can be recovered gives the probability P (A, I, X0 ) of recovering with an individual random sample w. The probability of recovery after t trials is then of the form 1 − [1 − P (A, I, X0 )]t . To attain a certain sign pattern e¯, we need to draw a k-vector w such that ¯ sign(Xw) = e¯. For a positive sign on the ith position of the support, we ¯ i→ w > 0}, and likewise can take any vector w in the open halfspace {w | X k for negative signs. The region of vectors w in R that generates a desired sign pattern thus corresponds to the intersection of |I| open halfspaces. The measure of this intersection as a fraction of Rk determines the probability of sampling such a w. To formalize, define K as the cone generated by the rows of ¯ and the unit Euclidean (k−1)-sphere S k−1 = {x ∈ Rk | x 2 = 1}. − diag(¯ e)X, The intersection of halfspaces then corresponds to the interior of the polar cone of K: K◦ = {x ∈ Rk | xTy ≤ 0, ∀y ∈ K}. The fraction of Rk taken up by K◦ is given by the (k−1)-content of S k−1 ∩ K◦ to the (k−1)-content of S k−1 [89]. This quantity coincides precisely with the definition of the external angle of K at the origin. 3.6.3 Experiments In this section we illustrate the results from Section 3.6 and examine some practical considerations that affect the performance of ReMBo. For all experiments that require the matrix A, we use the same 20 × 80 matrix that was used in Section 3.5, and likewise for the supports Ir . To solve (BP), we again use CVX in conjunction with SDPT3. We consider x0 to be recovered from b = Ax0 = AX0 w if x∗ − x0 ∞ ≤ 10−5 , where x∗ is the computed solution. The experiments that are concerned with the number of unique sign patterns ¯ representing the nonzero entries generated depend only on the r × k matrix X of X0 . Because an initial reordering of the rows does not affect the number ¯ r = |I|, and the number of of patterns, those experiments depend only on X, observations k; the exact indices in the support set I are irrelevant for those tests. 3.6. Recovery using ReMBo 50 Generation of unique sign patterns. The practical performance of ReMBo1 depends on its ability to generate as many different sign patterns as possible using the columns in X0 . A natural question to ask then is how the number of such patterns grows with the number of randomly drawn samples w. Although this ultimately depends on the distribution used for generating the entries in w, we shall, for sake of simplicity, consider only samples drawn from the normal ¯ with normallydistribution. As an experiment, we take a 10 × 5 matrix X 8 distributed entries, and over 10 trials record how often each sign-pattern (or negation) was reached, and in which trial they were first encountered. The results of this experiment are summarized in Figure 3.6. From the distribution in Figure 3.6(b) it is clear that the occurrence levels of different orthants exhibits a strong bias. The most frequently visited orthant pairs were reached up to 7.3 × 106 times, while others, those hard to reach using weights from the normal distribution, were observed only four times over all trials. Likewise, we can look at the rate of encountering new sign patterns. This is done in Figure 3.6(c), which shows how the average rate changes over the number of trials. The curves in Figure 3.6(d) illustrate the probability given in (3.16), with S set to the number of orthant pairs at a given iteration, and with face counts determined as in Section 3.5, for three instances with support cardinality r = 10, and observations k = 5. ¯ on the performance of ReMBo. The number of orthants Influence of X that a subspace can intersect does not depend on the basis with which it was generated. However, the basis does greatly influence the ability to sample those orthants. Figure 3.7 shows two ways in which this can happen. In part (a) we ¯ sample the number of unique sign patterns for two different 9 × 5 matrices X, each with columns scaled to unit 2 -norm. The entries of the first matrix are independently drawn from the normal distribution, while those in the second are generated by repeating a single column drawn likewise and adding small random perturbations to each entry. This difference causes the average angle between any pair of columns to decrease from 65 degrees in the random matrix to a mere 8 in the perturbed matrix, and greatly reduces the probability of reaching certain orthants. The same idea applies to the case where d ≥ n, as shown in part (b) of the same figure. Although choosing d greater than n does not increase the number of orthants that can be reached, it does make reaching them easier, thus allowing ReMBo- 1 to work more efficiently. Hence, we can expect ReMBo- 1 to have higher recovery on average when the number of columns in X0 increases and when they have a lower mutual coherence µ(X) = maxi=j |xTi xj |/( xi 2 · xj 2 ). Limiting the number of iterations. The number of iterations used in the previous experiments greatly exceeds that what is practically feasible: we cannot afford to run ReMBo- 1 until all possible sign patterns have been tried, even if there was to way detect that the limit had been reached. Realistically, we should set the number of iterations to a fixed maximum that depends on the computational resources available, as well as on the problem setting itself. 3.6. Recovery using ReMBo 8 Instances (% of trials) Unique sign pattern pairs 250 200 150 100 50 0 0 10 51 2 10 4 10 Iterations 6 10 6 4 2 0 0 8 10 50 100 150 200 Sign pattern index (b) 1 100 0.8 80 Recovery rate (%) Unique sign patterns / iteration (a) 0.6 0.4 0.2 0 0 10 2 10 4 10 Iterations (c) 250 6 10 8 10 60 40 p = 1.17% p = 0.20% p = 0.78% 20 0 0 10 2 10 4 10 Iterations 6 10 8 10 (d) ¯ with (a) Figure 3.6: Sampling the sign patterns for a 10 × 5 matrix X, number of unique sign patterns versus number of trials, (b) relative frequency with which each orthant is sampled, (c) average number of new sign patterns per iteration as a function of iterations, and (d) modeled probability of recovery using ReMBo- 1 for three instances of X0 with row sparsity r = 10, and k = 5 observations. 3.6. Recovery using ReMBo 500 Unique sign patterns 150 Unique sign patterns 52 100 50 Gaussian Perturbed 0 0 10 5 10 Iterations (a) 400 300 200 k = 10 k = 12 k = 15 100 0 0 10 5 10 Iterations (b) ¯ with Figure 3.7: Number of unique sign patterns for (a) two 9 × 5 matrices X columns scaled to unit 2 -norm; one with entries drawn independently from the normal distribution, and one with a single random column repeated and random perturbations added, and (b) 10 × k matrices with k = 10, 12, 15. Figure 3.6(a) shows the empirical number of unique orthants S as a function ¯ For the performance analysis it would be useful to of iterations for a fixed X. ¯ know what the distribution of unique orthant counts looks like on the average X for a fixed number of trials. To find out, we draw 1,000 random r × k matrices ¯ with r = 10 nonzero rows fixed and the number of columns ranging from k = X ¯ we count the number of unique sign patterns attained after 1, . . . , 20. For each X 1,000 and 10,000 iterations. The resulting minimum, maximum, and median values are plotted in Figure 3.8, along with the theoretical maximum. More interestingly, of course, is the average recovery rate of ReMBo- 1 based those number of iterations. For this test we again use the 20 × 80 matrix A with fixed support I. For each value of k = 1, . . . , 20, we generate random matrices X on I and run ReMBo- 1 with the maximum number of iterations set to 1,000 and 10,000, respectively. To save on computation time, we again compare the onsupport sign pattern of each coefficient vector Xw to the precomputed results instead of solving 1 . The average recovery rate thus obtained is plotted in Figures 3.9(a,b), along with the average of the modeled performance using (3.16) with S set to the orthant counts found in the previous experiment. 3.6. Recovery using ReMBo 53 600 Unique orthant pairs 500 400 300 200 100 Trials = 1,000 Trials = 10,000 Trials = Inf 0 0 5 10 k 15 20 100 100 80 80 Recovery rate (%) Recovery rate (%) Figure 3.8: Effect of limiting the number of weight vectors w on the distri¯ solid lines give bution of unique orthant counts for 10 × k random matrices X, the median number and the dashed lines indicate the minimum and maximum values, the top solid line is the theoretical maximum. 60 40 20 0 0 r=8 r=9 r = 10 5 10 k (a) 1,000 trials 15 20 60 40 20 0 0 r=8 r=9 r = 10 5 10 15 20 k (b) 10,000 trials Figure 3.9: Effect of limiting the number of weight vectors w on the average performance of the ReMBo- 1 algorithm. The recovery with (a) 1,000, and (b) 10,000 trials is plotted (solid line), along with the average predicted performance (dashed line). These results are based on a fixed 20 × 80 matrix A and three different support sizes r = 8, 9, 10. The support patterns used are the same as those used for Figure 3.3. 54 Chapter 4 Solvers Along with the establishment of the theoretical foundations of sparse recovery, there has been a natural interest in algorithms that can efficiently solve the underlying optimization problems. So far, we have predominantly considered convex relaxations of ideal—but impractical—formulations for sparse recovery. Besides these relaxations, there is a large class of greedy approaches that do not solve any fixed problem formulation, but instead prescribe an algorithm for obtaining sparse (and possibly approximate) solutions. Theoretical recovery results have also been developed for these greedy approaches, but because they do not solve an optimization problem, the results only apply to a particular algorithm. By contrast, the theoretical results for the convex formulations studied in Chapters 2 and 3 apply to any algorithm for solving the given problem, regardless of the implementation. The ReMBo- 1 results in Chapter 3 are algorithm-specific as a whole, but are nevertheless independent of the algorithm used to solve the basis-pursuit subproblems. More information on greedy algorithms can be found in [43] and [117]. A third class of algorithms uses mostly heuristic approaches to solve nonconvex formulations. Despite the fact that these algorithms are typically guaranteed only to converge to a local minimizer, they do seem to work quite well in practice. Figure 4.1 summarizes these different approaches and highlights our primary focus on solvers for convex formulations. In this classification we can further distinguish between solvers that fit the data exactly, and those that allows some misfit. In the first three sections of this chapter we discuss solvers for basis pursuit, basis pursuit denoise, and regularized basis pursuit. We do not go into the details of nonconvex heuristics. Although we do not extensively cover solvers for other problems, such as nonnegative basis pursuit or MMV, we mention extensions of techniques to these problems whenever possible. 4.1 Basis pursuit (BP) When Chen et al. [39] proposed basis pursuit, they emphasized that it can be conveniently reformulated as a standard linear program: minimize x cTx subject to Ax = b, x ≥ 0. (4.1) ¯x = b, A¯ (4.2) Given the basis pursuit formulation minimize x ¯ x ¯ 1 subject to 4.1. Basis pursuit (BP) 55 Solvers for sparse recovery Combinatorial Convex optimization Greedy algorithms Nonconvex heuristics Chap. 4, 5, 6 Figure 4.1: Classification of solvers for sparse recovery. this can be done using the well-established technique of variable splitting, in which we write x ¯ as x+ − x− with x+ , x− ≥ 0. This allows us to rewrite the ¯ as the sum eT(x+ +x− ) of entries in its positive and negative parts, 1 -norm of x ¯x is conveniently rewritten as with e denoting a vector of all ones. The term A¯ + − + ¯ ¯ ¯ ¯ Ax − Ax . With A = [A, −A] and x = [x ; x− ] it follows that (4.2) can equivalently be expressed as (4.1). Once the latter formulation is solved, we can recovery x ¯ simply by subtracting the second half of x (corresponding to x− ) − from the first. It is easily seen that at most one entry j of x+ j or xj is nonzero, for otherwise we could lower the objective by subtracting the smallest of the two from both entries without changing Ax. In the special case of nonnegative basis pursuit (1.6), we can forego splitting, and immediately replace x 1 by eTx. 4.1.1 Simplex method The advantage of rewriting the basis pursuit formulation as a linear program is that solvers for the latter are at a very mature stage. Perhaps the most famous amongst these, and certainly the most established, is the simplex method [47]. Let the polyhedron S = {x ≥ 0 | Ax = b} denote the feasible set of (4.1). Assuming S is nonempty, the simplex method systematically traverses neighboring vertices of S until either an optimal solution or a direction of unbounded descent has been found. Associated with each vertex is a support B of m indices such that xi ≥ 0 for i ∈ B and xi = 0 for i ∈ N , with N = {1, . . . , n} \ B the complement of B. The index sets B of neighboring vertices are such that all but one entry is the same. The key part of each iterations of the simplex algorithm is to determine which entry from B to remove and which entry from N to add. This choice ideally lowers the objective value at the new vertex, but this is sometimes impossible and we may have to settle for one at which the objective remains the same. In this latter situation, which arises when (4.1) is degenerate, special care has to be taken not to traverse a cyclic path of vertices, causing the algorithm to loop forever. The major cost of the simplex method consists of solving a linear system of equations AB xB = b or ATB y = cB . Efficient implementations of the simplex methods maintain a factorization of AB , which is updated after each iteration, and occasionally regenerated for numerical stability. This implies that, when 4.1. Basis pursuit (BP) 56 applied to linear programs where A is dense, the simplex method does not scale very well with the problem size. In addition, it cannot take advantage of fast operators for products with A. 4.1.2 Interior-point algorithms Feasible interior-point methods for linear programming approach the optimal solution not by tracing a path on the boundary of S, but instead by generating iterates on the interior of the cone x ≥ 0. We here briefly discuss the primal-dual interior-point method described by Nocedal and Wright [122]. Let y and s denote the Lagrange multipliers associated respectively with the constraints Ax = b and x ≥ 0. The KKT conditions for (4.1) are then given by T A y+s−c F (x, y, s) = Ax − b = 0, (x, s) ≥ 0, (4.3) XSe where X = diag(x) and S = diag(s). This system can be solved by applying Newton’s method, requiring at each iteration the solution of the linear system J(x, y, s) · d = −F (x, y, s), where J(x, y, s) denotes the Jacobian of F evaluated at (x, y, s). The interior-point method modifies the right-hand side of this system. In particular, at each iteration it requires that xi si ≥ τ for an iteration-specific value of τ that positive and gradually decreases towards zero, based on a duality measure and a centering parameter. This leads to the following system of equations that needs to be solved at each iteration, and which constitutes the main computational part of the algorithm: ∆x 0 0 AT I . A 0 0 0 ∆y = (4.4) ∆s −XSe + τ e S 0 X Once the search direction is found a line search ensures that x and s remain strictly within the feasible region. Needless to say, the above algorithm description is rather elementary and more advanced versions exist. An important modification is the infeasible approach, in which the first two criteria in (4.3) need not be satisfied at intermediate iterates. For a more in-depth discussion, see Nesterov and Nemirovskii [120], and the very accessible book by Wright [152]. All of the interior-point methods mentioned above require the solution of a system of equations to determine a search direction. Rather than solving the equivalent of (4.4) directly, they typically work with the so-called normalequations which follow by eliminating ∆s and ∆x, giving AD2 AT∆y = v for D = S −1/2 X 1/2 and an appropriate vector v. 4.2. Basis pursuit denoise (BPσ ) 57 Solving this system becomes a real bottleneck with increasing problem size. When A is known implicitly through routines for fast multiplication it is necessary to solve either the normal equations, or the augmented system (4.4), using an iterative linear solver. However, these linear systems necessarily become arbitrarily ill-conditioned near the solution, and iterative solvers typically struggle to obtain the required accuracies. 4.1.3 Null-space approach In compressed sensing applications, A is often constructed by choosing arbitrary rows from an orthonormal basis B. We can write this as A = RB, where R is a restriction operator, i.e., a matrix composed of a subset of rows from an identity matrix. It is easily seen that, due to orthonormality of the rows of B, an orthonormal basis for the null-space of A is given by N = B T(Rc )T, where Rc is composed of the columns of the identity matrix that do not appear in R. Any feasible point x ∈ F = {x | Ax = b} can then be written as xb + N c, where xb satisfies Axb = b and c is any vector in Rn−m . We can find xb by applying the pseudo-inverse of A to b, giving xb = A† b = (ATA)−1 ATb = ATb, since ATA = I. This allows us to reformulate (4.1) as minimize c xb + N c 1 . While convex and unconstrained, this problem is still hard to solve because it is non-differentiable. We do not further elaborate on this approach. 4.2 Basis pursuit denoise (BPσ ) The basis pursuit denoise formulation (BPσ ) minimizes the 1 -norm of the coefficients subject to a bound on the 2 -norm misfit b − Ax. This formulation is very natural for practical signal recovery where noise is ubiquitous, but often well understood. Despite this appealing property, very few efficient solvers for this problem exist. The remainder of this section reviews various approaches. 4.2.1 Second-order cone programming Just like (BP) can be reformulated as a standard linear program, we can reformulate (BPσ ) as a standard second-order cone program minimize cTx subject to A(i) x + b(i) M x = y, x 2 ≤ cT(i) x + d(i) , i = 1, . . . , k, (4.5) The inequality constraint in (BPσ ) requires no conversion and easily fits in into the above formulation. The objective x 1 can be rewritten as the minimum 4.2. Basis pursuit denoise (BPσ ) 58 sum of ui such that −ui ≤ xi ≤ ui . This gives minimize eTu subject to |xi | ≤ ui , i = 1, . . . , n, Ax + b 2 ≤ σ, x,u which can be rewritten as (4.5) by combining x and u into a single vector and appropriately defining the matrices A(i) and the remaining vectors. Second-order cone programs can be solved using primal-dual interior-point methods (see Nesterov and Nemirovskii [120] and Alizadeh and Goldfarb [1]). An alternative approach, used by Cand`es and Romberg in their pioneering sparse recovery toolbox 1 -magic [31], is to use a primal log-barrier method. Although these methods are very robust and capable of generating highly accurate solutions, they both suffer the same problem as the interior-point method described in Section 4.1.2, namely, they scale extremely poorly with increasing problem size. Second-order cone programs can also be used to solve other problems such as basis pursuit with complex variables, as well as the group-sparse (1.7) and 1,2 minimization (1.10) problems. See Malioutov et al. [110], Eldar and Mishali [73], and Stojnic et al. [135] for the latter two formulations. 4.2.2 Homotopy The homotopy method was introduced by Osborne et al. [123] as an approach for solving variable-selection problems. The method takes advantage of the fact that the entries of the optimal solution x∗λ of (QPλ ) are piecewise linear in λ. This allows the algorithm to trace the objective fλ (x) = 1 2 Ax − b 2 2 + λ x 1, (4.6) as a function of λ, starting with λ = ATb ∞ , and reducing it in discrete steps until no more progress can be made. With some minor modifications, the algorithm can find the solution to (BPσ ), provided, of course, that Ax − b ≤ σ can be satisfied. As an illustration, we compute the homotopy trajectory for a random 20×40 matrix A and random b. The resulting curve, plotted in terms of the 2 residual norm against λ, is shown in Figure 4.2(a). The dots on the curve indicate the values of λ at which the active set changes. While the homotopy algorithm can be extremely fast, it also has a number of problems. First, the number of steps required to reach the solution can become quite large, especially if the final solution is not very sparse. Second, the subproblems, which need to be solved at each iteration, can become expensive to solve whenever the number of active variables is large. This again happens when the basis pursuit denoise solution is not sparse. Finally, we cannot warmstart the algorithm and need to start from x = 0 even when the problem changes only slightly. 4.2. Basis pursuit denoise (BPσ ) 1000 Median number of steps Residual norm 4 3 2 1 0 0 2 4 lambda 6 8 59 m = n/2 m = 30 800 600 400 200 0 0 200 400 600 800 n (a) homotopy trajectory (b) number of iterations Figure 4.2: (a) Complete homotopy trajectory of a random 20 × 40 problem, and (b) median number of steps required to solve 50 random m × n problems with m = n/2 (solid) and m = 30 (dashed), for different values of n. From a geometrical perspective (see Section 2.5) it is intuitive that the number of steps required to solve basis pursuit problem by homotopy should predominantly depend on the dimensionality of the problem and the distribution of the vectors in A. To test the former idea, we apply the homotopy algorithm to two sets of random m × n matrices A for 50 random vectors b and determined the median number of steps taken until completion. For the first set we fixed m = 30 and varied n, while in the second we chose m = n/2. The results are plotted in Figure 4.2 and confirms (for such matrices) the observation by Malioutov et al. [111] that the number of steps grows proportional to m. 4.2.3 NESTA The nesta algorithm by Becker et al. [7] is based on a framework developed by Nesterov [118, 119] for minimization of convex nonsmooth functions f (x) over a convex set Qp . We assume that f (x) can be written as f (x) = max u, W x , u∈Qd (4.7) where W ∈ Rp×n and Qd is a closed and bounded convex set. This definition covers all induced norms, and holds in particular for f (x) = x 1 by setting W = I and Qd = {u | u ∞ ≤ 1}. Because functions of the form (4.7) are convex but not generally smooth, Nesterov [118] proposed to augment f (x) with a smoothing prox-function pd (u) that is continuous and strongly convex on Qd . For some smoothing parameter µ > 0 this gives a continuously differentiable function fµ (x) = max u, W x − µpd (u). (4.8) u∈Qd This function can then be minimized over x ∈ Qp to give an approximate solution to the original problem. The algorithm proposed by Nesterov solves 4.3. Regularized basis pursuit (QPλ ) 60 this type of problem by generating a sequence of iterates x(i) based on two other sequences y (i) and z (i) and a primal proxy function pp (x) (see Becker et al. [7] for more details). By assuming that the rows in A are orthonormal, and choosing a convenient proxy function, Becker et al. find closed form solutions to the subproblems that determine the iterates y (i) and z (i) . Consequently, they obtain an efficient algorithm for minimizing of fµ (x) over the convex set Qp . Applied to the basis pursuit denoise formulation this gives minimize x fµ (x) subject to Ax − b 2 ≤ σ. (4.9) They also show that the accuracy of this approximate solution to (BPσ ) is proportional to the smoothing parameter µ. It thus follows that for an accurate approximation we have to solve (4.9) for a value of µ close to zero. Because doing so directly is inefficient, they propose a continuation method that gradually reduces µ to the desired level, and warm-starting the solver with the previous solution x of (4.9) as an initial point. The major advantage of nesta over other approaches is its versatility, making it applicable to many different problems that few other solvers can handle. 4.3 Regularized basis pursuit (QPλ ) The regularized basis pursuit formulation (QPλ ), introduced by Chen et al. [39] under the name basis pursuit denoise, is the Lagrange formulation of (BPσ ). In this formulation, the value of λ, which is related to the Lagrange multiplier corresponding to the constraint Ax − b 2 ≤ σ, is assumed to be known. This valuable information allows us to use the unconstrained formulation (QPλ ), which is much easier to solve than (BPσ ). This is perhaps the reason why most solvers proposed so far work with this regularized formulation. 4.3.1 Quadratic programming In Section 4.1 we show how the basis pursuit problem can be rewritten as a linear program by splitting x into positive and negative parts. This same technique can be applied to convert (QPλ ) into a convex quadratic program (QP). Using the same notation, and expanding z 22 as z Tz, this gives minimize x ¯ 1 ¯x) xTA¯TA¯ 2 (¯ + (AT b + λe)Tx ¯ subject to x ¯≥0 (4.10) There are several ways to solve quadratic programs, including-active set methods, interior-point methods, and gradient projection; see e.g, Nocedal and Wright [122]. We concentrate on the latter two, which have been successfully applied to sparse recovery. 4.3. Regularized basis pursuit (QPλ ) 61 Interior-point methods Kim et al. [99] propose a primal log-barrier interior-point method for the quadratic programming formulation of regularized basis pursuit. Instead of using (4.10), they reformulate (QPλ ) as minimize x,u 1 2 Ax − b22 + eTu, st − u ≤ x ≤ u. The constraints are then incorporated into the objective using a log-barrier function: n b(u, x) = − n log(ui − xi ) − i=1 n log(u2i + x2i ), log(ui − xi ) = i=1 i=1 giving a new objective f (x, u) = 21 Ax − b22 + eTu + µ · b(u, x), which is optimized over the interior of the constraint −u ≤ x ≤ u. At each iteration, a search direction d is computed based on the Newton system Hd = −g, where H denotes the symmetric positive define Hessian ∇2 f and g denotes the gradient ∇f . For large-scale problems, it is prohibitively expensive to compute the exact solution to this system, and Kim et al. therefore suggest a truncated Newton method based on preconditioned conjugate gradients. This requires a truncation rule determining the accuracy of the solves, and a preconditioner. Because a dual-feasible point is cheaply computed, the truncation rule is based on the duality gap, and the Newton system is solved increasingly accurately as the iterates approach the solution. Kim et al. also describe a preconditioner consisting of a 2-by-2 block matrix with diagonal blocks. In the actual implementation of the algorithm, a slightly different preconditioner is used to avoid the potentially expensive computation of diag(ATA), which is needed by their preconditioner. Once the search direction has been determined, a backtracking line search is done in combination with an Armijo condition for sufficient descent. At the end of each iteration, the parameter µ is updated to reduce the influence of the barrier term. Besides being crucial for obtaining search directions for large-systems, the conjugate gradient method has the additional advantage of not requiring an explicit representation of A. This means that the algorithm can make use of fast routines for computing matrix-vector products, if these are available, thus facilitating the use of large dense implicit matrices. As an alternative, Chen et al. [39] (see also Sardy et al. [130]) derive a primal-dual log-barrier formulation for the QP formulation given in (4.10). Gradient projection The gpsr solver by Figueiredo et al. [78] uses gradient projection to solve (4.10). Gradient projection methods proceed by projecting a step from the current iterate x(k) along the negative gradient −∇f (x(k) ) = −AT(Ax(k) − b) + λe onto the feasible set, giving a new point x(α) = x(k) − αβ∇f (x(k) ) + , 4.3. Regularized basis pursuit (QPλ ) 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 −0.2 −0.2 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −0.4 −0.2 0 (a) 62 0.2 0.4 0.6 0.8 1 (b) Figure 4.3: Contour plot of objective value in two variables in the nonnegative orthant, along with negative gradient (dashed) at current iterate (o). Plot (a) shows the projection arc (solid), while plot (b) shows the backtracking path from a single point on the projection arc, indicated by (*). where (x)+ := max{0, x}, and β is an initial scaling parameter. The way this new point is used gives rise to two variants of the algorithm. In the first variant, a backtracking line search along the projection arc x(α) is done with α ∈ (0, 1]. The second variant takes a fixed α and determines search direction d(k) = x(α)− x(k) , and then performs a line search along x = x(k) +γd(k) , for γ ∈ [0, 1]. Due to the quadratic nature of the objective function in (4.10), this second line search can be done exactly, with a closed-form optimal step length γ ∗ . We illustrate the two line-search methods in Figure 4.3. Both of the above approaches are implemented in gpsr. To speed up the convergence for the latter method, gpsr uses a nonmonotonic approach where β is determined using a method by Barzilai and Borwein [5]. We discuss this method in more detail in Section 5.4.1. 4.3.2 Fixed-point iterations Iterative soft-thresholding In iterative soft-thresholding (ist) the fixed-point iterations are given by x(k+1) = sγλ (x(k) − γAT(Ax(k) − b)), γ > 0, (4.11) with the soft-thresholding function Sλ (x) defined as [Sλ (x)]i = Sλ (xi ) := sgn(xi ) · max{0, |xi | − λ}, i = 1, . . . , n. (4.12) This iteration has been derived from many different perspectives, including expectation maximization by Figueiredo and Nowak [77], and through the use of surrogate functionals by Daubechies et al. [50]; see Elad et al. [70] for an 4.3. Regularized basis pursuit (QPλ ) 63 overview. Perhaps the easiest way to obtain this iteration however, is by considering the optimality conditions for (QPλ ). The following derivation is based on work by Hale et al. [90]. Each minimizer x of (QPλ ) is characterized by 0 ∈ ∇x ( 12 Ax − b 22 ) + ∂x λ x 1 = AT(Ax − b) + λ∂x x 1 . Moving the first term to the left-hand side, multiplying by γ > 0, and adding x to either side gives x − γAT(Ax − b) ∈ x − γλ · ∂x x 1 = Fγλ (x), (4.13) where Fα denotes the operator (I(·) + α∂x · 1 ). This operator is separable, meaning that [Fα (x)]i = fα (xi ), where fα (xi ) = xi + α · ∂|xi |. By inspection, and as illustrated in Figure 4.4, it is easily shown that the inverse fα−1 (xi ) is given by the soft-thresholding operator sα (xi ), and it follows from separability that the inverse of F is given by [Fα−1 (x)]i = fα−1 (xi ). Applying this inverse to both sides of (4.13) yields −1 x = Fγλ (x − γAT(Ax − b)) = Sγλ (x − γAT(Ax − b)), which forms the basis for the fixed-point iteration (4.11). This iteration can also be formulated as a proximal forward-backward splitting process (see Combettes and Wajs [41]) for the minimization of f1 (x) + f2 (x) with f1 (x) = 21 Ax − b 22 and f2 (x) = λ x 1 . The forward step consists of setting ti+1 = xi − γ∇x f1 (xi ) = xi − γAT(Ax − b), and can be seen as a gradient descend step with step length γ. The backward step applies the proximity operator to the intermediate result giving xi+1 = proxγf2 ti+1 = Sγλ (ti+1 ), where the last equality follows from [41, Example 2.20]. The soft-thresholding iteration (4.13) is implemented in the fpc code by Hale et al. [90]. They note that ist can be very slow when applied directly to (QPλ ) with small regularization parameter λ. To overcome this, they propose a continuation approach in which a series of problems with decreasing λ is solved, each warm-started from the previous solution. This leads to a considerable speed up of the algorithm. Importantly, Daubechies et al. [50] prove convergence of (4.11) with γ = 1 to the minimizer of (QPλ ), under the condition that the norm of A is strictly less than one. More general results can be found in Combettes and Wajs [41]. Convergence rates for (4.11) were derived by Hale et al. [90] and Bredies and Lorenz [22]. The latter also discuss an extension of the soft-thresholding iteration for group sparse minimization (1.7). 4.3. Regularized basis pursuit (QPλ ) 2 2 1.5 1.5 1 1 λ 0.5 0.5 0 0 −0.5 −1 −1.5 −1.5 −1 −λ λ −0.5 −λ −1 −2 −2 64 0 (a) 1 2 −2 −2 −1 0 1 2 (b) Figure 4.4: Plots of (a) f (x) = x+λ·∂x |x|, and (b) its inverse f −1 (x) = Sλ (x). 65 Chapter 5 Solver framework The common thread tying together many of the sparse-recovery formulations discussed so far, is that they minimize a convex objective function, subject to bounds on some measure of misfit. In this chapter we present a framework for solving an optimization problem that encompasses all of the formulations that we discuss in this thesis, and thus consider the general problem minimize x κ(x) subject to ρ(Ax − b) ≤ σ, (5.1) where κ and ρ are convex functions that are general enough to cover a wide range of problems in sparse recovery. This framework was first described in van den Berg and Friedlander [10], and we further extend and analyze it here. Chapter 6 discusses the application of the framework for several specific instances of the functions κ and ρ. An implementation of the framework, called spgl1, is available on the web [9]. 5.1 Assumptions For the derivation of the solver framework for (5.1) we assume that κ and ρ are gauge functions. A function f is a gauge function [127], if it satisfies the following properties: (1) f is nonnegative, f (x) ≥ 0; (2) f is positive homogeneous, f (αx) = αf (x) for all α ≥ 0; (3) f vanishes at the origin, f (0) = 0; (4) f is convex. To give an idea what such functions look like we plotted a number of one and two-dimensional gauge functions in Figure 5.1. Norms are special cases of gauge functions that additionally satisfy f (x) = f (−x). Examples of this special class of gauge functions are depicted in Figure 5.1(d–f) and correspond to the 2 , 1 , and ∞ norms respectively. As a second criterion, we require ρ to be continuously differentiable away from the origin. As an example, this is satisfied by the gauge function in Figure 5.1(d), but not by the ones in plots (e) and (f). Alternative criteria are possible, for example, κ is continuously differentiable away from the origin and A is full row-rank. 5.2. Approach 66 ∞ (a) (b) (c) (d) (e) (f) Figure 5.1: One and two-dimensional gauge functions. Finally, we assume that b is in the range of A, and without loss of generality that b = 0; otherwise the unique optimal solution to (5.1) for all σ ≥ 0 would be given by x∗ = 0. In section 5.2.2 we explain the rationale behind these conditions and describe the implications of not meeting them. 5.2 Approach We begin the derivation of the solver framework by making the following observation. Let xσ be any solution to (5.1). Then, by choosing τ = τσ := κ(xσ ), it is also a solution of φ(τ ) := minimize x ρ(Ax − b) subject to κ(x) ≤ τ. (5.2) This is easily shown by assuming to the contrary that xσ is not an optimal point for (5.2). Since xσ is feasible by the choice of τ , there must be some point, xτ , such that κ(xτ ) ≤ τ , and ρ(Axτ − b) < ρ(Axσ − b). But this means that xτ is strictly feasible for (5.1) and implies the existence of a positive scalar γ < 1 such that ρ(γAxτ − b) ≤ σ and κ(γxτ ) = γκ(xτ ) < τ . In other words, if xσ is not a solution of (5.2) then it could not have been an optimal solution for (5.1) either. With the equivalence between these formulations established, the question becomes: how to find the value of τ corresponding to a given σ? For this we turn to the Pareto curve, which, when applied to (5.2), the optimal trade-off between τ and the resulting misfit ρ(Ax∗ − b). We express this curve as the function φ(τ ) for τ ∈ [0, τ0 ], where τ0 is the smallest τ for which φ(τ ) = 0. 5.2. Approach 67 8 7 6 5 φ(τ ) 4 3 2 1 0 0 σ τσ 2 4 6 τ 8 10 12 14 Figure 5.2: Pareto curve φ(τ ) with desired level of misfit σ and corresponding value of τ . For the equivalence between (5.1) and (5.2), we need to find the smallest τ for which φ(τ ) ≤ σ, which amounts to solving the nonlinear equation φ(τ ) = σ, as illustrated in Figure 5.2. One way of solving the nonlinear equation φ(τ ) = σ is by applying Newton’s method for nonlinear equations. This iterative method works by generating a linear approximation to the function at the current iterate and using it to find the next iterate. That is, given the current iterate τk , we create the model f (∆τk ) = φ(τk ) + ∆τk φ (τi ) and solve f (∆τk ) = σ for step size ∆τk . With this step size, we have the following update: τk+1 = τk + ∆τk , with ∆τk = (σ − φ(τk ))/φ (τk ). (5.3) We continue this process, illustrated in Figure 5.3, until some stopping criterion is satisfied. Summarizing the above gives the following algorithm. Algorithm 5.1: Newton root-finding framework 1 2 3 4 5 6 7 8 9 10 Given A, b, σ, and tolerance ≥ 0 Set k = 0, τ1 = 0 repeat Increment iterate k ← k + 1 Evaluate φ(τk ) by solving (5.2) and obtain x∗k Set rk = b − Ax∗k Compute φ (τk ) Update τk+1 = τk + (σ − φ(τk ))/φ (τk ) until |ρ(rk ) − σ| ≤ Return x∗ = x∗k 5.2. Approach 8 8 τ2 6 6 4 4 2 2 68 8 τ3 τ4 6 φ(τ2) 0 0 5 τ 10 4 0 0 15 (a) φ(τ3) 2 σ 5 10 τ 15 0 0 5 (b) τ 10 15 (c) Figure 5.3: Three iterations of Newton’s root-finding procedure for solving φ(τ ) = σ. For this approach to work the Pareto curve φ(τ ) must be differentiable, and both φ(τ ), and φ (τ ) must be practically computable. In addition, we must have some guarantee that the iterations τk converge. We proceed by demonstrating differentiability, deriving the gradient, and analyzing the local convergence of Newton’s method under inexact evaluations of φ(τ ) and φ (τ ). We discuss solving the subproblems (5.2) in Section 5.4. 5.2.1 Differentiability of the Pareto curve We prove differentiability of the Pareto curve using two results. The first result [127, Theorem 25.1] states that a convex function is differentiable at a point x if and only if the subgradient at that point is unique. Naturally, the gradient at x is then given by the unique subgradient. The second result [16, Propositions 6.1.2b, 6.5.8a] establishes that for convex problems of the form p(c) := minimize x f (x) subject to g(x) ≤ c, a vector y is a Lagrange multiplier if and only if −y ∈ ∂p(c), provided that p(c) is finite. By combining the two results we conclude that p(c) is differentiable if and only if the Lagrange multiplier y is unique. To apply this result to φ(τ ) we first need to show that it is convex and bounded. Boundedness follows from the fact that x = 0 is feasible for all τ ≥ 0, ensuring that the objective ρ(Ax − b) is bounded above by ρ(b). For convexity we have the following result. Lemma 5.1. Let κ and ρ be gauge functions. Then the function φ(τ ) defined in (5.2) is convex and nonincreasing for all τ ≥ 0. Proof. The fact that φ(τ ) is nonincreasing is follows directly from the observation that the feasible set enlarges as τ increases. Next, consider any nonnegative scalars τ1 and τ2 , and let x1 and x2 be the corresponding minimizers of (5.2). For any β ∈ [0, 1] define xβ = βx1 + (1 − β)x2 , and note that by convexity of κ, κ(xβ ) = κ(βx1 + (1 − β)x2 ) ≤ βκ(x1 ) + (1 − β)κ(x2 ) = βτ1 + (1 − β)τ2 , 5.2. Approach 69 where the last equality follows from the positive homogeneity of κ and ρ. With τβ := βτ1 + (1 − β)τ2 this gives κ(xβ ) ≤ τβ , thus showing that xβ is a feasible point for (5.2) with τ = τβ . For the objective we then have φ(τβ ) ≤ ρ(Axβ − b) = ρ(βAx1 − βb + (1 − β)Ax2 − (1 − β)b) ≤ βρ(Ax1 − b) + (1 − β)ρ(Ax2 − b) = βφ(τ1 ) + (1 − β)φ(τ2 ), as required for convexity of φ. It then remains to show that the Lagrange multiplier for (5.2) it unique. For this we shift our attention to the dual problem. Derivation of the dual As a first step in the derivation of the dual, we rewrite (5.2) in terms of x and an explicit residual term r: minimize x,r ρ(r) subject to Ax + r = b, κ(x) ≤ τ. (5.4) The dual to this equivalent problem is given by L(y, λ) maximize y,λ subject to λ ≥ 0, (5.5) where y ∈ Rm and λ ∈ R are dual variables, and L is the Lagrange dual function, given by L(y, λ) := inf {ρ(r) − y T(Ax + r − b) + λ(κ(x) − τ )}. x,r By separability of the infemum over x and r we can rewrite L in terms of two separate suprema, giving L(y, λ) = bTy − τ λ − sup {y Tr − ρ(r)} − sup {y TAx − λκ(x)}. r (5.6) x We recognize the first supremum as the conjugate function of ρ; and the second supremum as the conjugate function of λκ(x). For a gauge function f , the conjugate function f ∗ can be conveniently expressed as f ∗ (u) := sup wTu − f (w) = w 0 if f ◦ (u) ≤ 1 ∞ otherwise, (5.7) where the polar of f is defined by f ◦ (u) = sup{wTu| f (w) ≤ 1}. (5.8) w In case f is a norm, the polar reduces to the dual norm (for more details see Rockafellar [127] and Boyd and Vandenberghe [21, Section 3.3.1]). It follows from substitution of (5.7) in (5.6) that the dual of (5.2) is maximize y,λ bT y − τ λ subject to ρ◦ (y) ≤ 1, κ◦ (ATy) ≤ λ. (5.9) 5.2. Approach 70 Note that the constraint λ ≥ 0 in (5.5) is implied by κ◦ (ATy) ≤ λ, because κ◦ is nonnegative. Importantly, in the case ρ(r) = r 2 , we have ρ◦ (r) = r 2 , and the dual variables y and λ can easily be computed from the optimal primal solutions. To derive y, first note from (5.7) that sup y Tr − r 2 =0 if y 2 ≤ 1. r Therefore, y = r/ r 2 , and we can without loss of generality take y 2 = 1 in (5.9). To derive the optimal λ, note that as long as τ > 0, λ must be at its lower bound κ◦ (ATy), for otherwise we can increase the objective. Consequently, we take λ = κ◦ (ATy). The dual variable y can be eliminated, and we arrive at the following necessary and sufficient optimality conditions for the primal-dual solution (rτ , xτ , λτ ) of (5.4) with ρ(r) = r 2 : Axτ + rτ = b, ◦ κ(xτ ) ≤ τ T κ (A rτ ) ≤ λτ rτ 2 λτ (κ(xτ ) − τ ) = 0 (primal feasibility); (5.10a) (dual feasibility); (5.10b) (complementarity). (5.10c) Uniqueness of the multiplier Differentiability of the Pareto curve ultimately depends on the uniqueness of the Lagrange multiplier λ. This leads us to the following theorem. Theorem 5.2. Let ρ and κ be gauge functions, satisfying 1. ρ is differentiable away from the origin, or 2. κ is differentiable away from the origin and A has full row-rank. Then, on the open interval τ ∈ (0, τ0 ), the Pareto curve φ(τ ) := minimize x ρ(Ax − b) subject to κ(x) ≤ τ is strictly decreasing, and continuously differentiable with φ (τ ) = −λτ . Proof. For the first condition, note that differentiability of ρ implies strict convexity of the level sets of ρ◦ (see Lemma 5.3, below). For φ(τ ) to be differentiable it suffices to show that y ∗ , and hence λ, in (5.9) is unique. Suppose that this is not so, and that (y1 , λ1 ) and (y2 , λ2 ) are two distinct solutions to the dual formulation with λi = κ◦ (AT yi ). Over the given range of τ , the primal objective is strictly greater than zero. By the Slater constraint qualification (which is trivially satisfied for τ > 0) the primal-dual gap is zero and therefore the dual objective is strictly greater than zero. Moreover, we must have ρ◦ (y1 ) = ρ◦ (y2 ) = 1, for otherwise we could multiply each y by a scalar β > 1 which, given positivity of the dual objective and positive homogeneity of κ◦ , increases the objective by the same factor. Taking any convex combination y = αy1 + (1 − α)y2 with 0 < α < 1, gives ρ◦ (y) < 1, by strict convexity. In 5.2. Approach 71 addition, κ◦ (ATy) ≤ λ, with λ = αλ1 + (1 − α)λ2 . It can then be seen that the dual objective at (y, λ) either remains the same or increases (depending on κ). However, by the argument made above we can always multiply y by a scalar strictly larger than one, thus giving the desired result. For the second condition we rewrite (5.9) as maximize y bT y − τ κ◦ (ATy), ρ◦ (y) ≤ 1. From differentiability of κ and the assumption on A, −κ◦ (ATy) is strictly concave except along rays emanating from the origin. This guarantees a unique solution unless there exists an α > 0 such that αy ∗ is both feasible and optimal. However, this is possible only when the objective is zero, which we exclude by assumption. Lemma 5.3. Let f be a gauge function that is differentiable away from the origin. Then the level sets of f ◦ are strictly convex. Proof. We start with the geometrical determination of the argument w for which (5.8) attains the supremum. From the differentiability of f it follows that f (u) < ∞ for all u ∈ R. Consequently, there always exists a w for which uTw > 0, and, by positive homogeneity, the supremum of (5.8) is always attained for a point w satisfying f (w) = 1. Now, let S := {x | f (x) = 1} denote the unit sphere induced by f . For a fixed u1 , we can write any w1 ∈ S as αu1 + v, such that uT1 v = 0. In order to determine f ◦ (u1 ) we therefore need to maximize α. As illustrated in Figure 5.4(a), this amounts to finding a supporting hyperplane of S with normal u1 / u1 2 . The point where the hyperplane touches S then gives the desired w1 , which is unique due to differentiability of f . For the determination of strict convexity of the level sets of f ◦ , we need to consider only the unit level set S ◦ := {x | f ◦ (x) = 1}. Assume, by contradiction, that there exist two disjoint points u1 , u2 ∈ S ◦ for which strict convexity is violated. That, is f ◦ (uα ) = 1 for all α ∈ [0, 1], with uα := αu1 + (1 − α)u2 . Let w1 ∈ W1 and w2 ∈ W2 represent vectors in S such that uT1 w1 = 1, and uT2 w2 = 1. For w ∈ S to satisfy wTuα = 1, we must have wTu1 = wTu2 , which shows that W1 = W2 , and that w ∈ W1 . However, we now claim that w2 cannot be equal to w1 , unless u1 = u2 . For w1 = w2 to hold, the supporting hyperplanes at that point need to be the same, due to differentiability of f . This also implies that the normals u1 / u1 2 and u2 / u2 of these hyperplanes are identical, but that means that u1 = u2 , for otherwise f ◦ (u1 ) = f ◦ (u2 ). We thus conclude that w1 must be different from w2 (see Figure 5.4(c)), and therefore that S ◦ is strictly convex. For an efficient implementation of the framework it is crucial that the Lagrange multiplier λ can be easily obtained. Indeed, in many situations solving the dual formulation (5.9) may be as hard as solving the original problem (5.1). The following theorem is therefore extremely convenient. 5.2. Approach u1 1 72 1 1 u2 0.5 w1 0 S −0.5 −1 −1 −0.5 0 (a) 0.5 1 u2 0.5 0.5 0 0 −0.5 −0.5 w2 −1 −1 −0.5 0 0.5 1 −1 −1 (b) −0.5 0 0.5 1 (c) Figure 5.4: Illustration for proof of Lemma 5.3, with (a) unit level set S := {x | f (x) = 1}, vector u1 , and w1 ∈ S maximizing uT1 w1 , (b) another point u2 satisfying uT2 w1 = uT1 w1 , and (c) disjoint w2 ∈ S maximizing uT2 w2 . Theorem 5.4. Let κ be a gauge function and let ρ(r) = r 2 . Then, on the interval τ ∈ (0, τ0 ), the Pareto curve is continuously differentiable with φ (τ ) = −κ◦ (ATr∗ )/ r∗ 2 , (5.11) where r∗ = b − Ax∗ with primal solution x∗ . Proof. By strict convexity of the level sets of ρ there must be a unique solution r∗ to problem (5.4). Given this r∗ , the second supremum in (5.6) must evaluate to zero with r = r∗ for any dual optimal y ∗ . With ρ(r) = r 2 , it can be verified that the only y ∗ satisfying this criterion is given by y ∗ = r∗ / r∗ 2 . Combining Theorem 5.2 with the fact that λ = κ◦ (ATy ∗ ) and κ◦ is positive homogeneous gives (5.11). 5.2.2 Rationale for the gauge restriction In Section 5.2.1 and Theorem 5.2 we limited κ and ρ to be gauge functions. We next consider what happens if this restriction is lifted. Recall that in the derivation of the gradient of the Pareto curve, the dual formulation plays a key role. For the derivation of this dual we conveniently used the conjugate function given in (5.7). For more general nonnegative convex functions f (x) that vanish at the origin, the supremum of v Tx − f (x) over x has a similar characterization but with an important difference. Before discussing this, let us take a look at the idea behind the conjugate function for gauge functions. In the top half of Figure 5.5 we plot the function f (x) = |x|. The dotted line indicates the inner-product between v and x for different values of v. In plot (a) this product lies below f (x) and the supremum of v Tx−f (x) is easily seen to be zero. For the value of v in plot (b), the inner-product v Tx exceeds f (x), and by positive homogeneity of f (x), we can make the gap arbitrarily large by increasing x. Plot (c) shows the critical value of v where f (x) coincides with v Tx. 5.2. Approach 73 2 2 2 1.5 1.5 1.5 1 1 1 0.5 0.5 0.5 0 −2 −1 0 1 2 0 −2 −1 (a) 0 1 2 0 −2 2 2 1.5 1.5 1.5 1 1 1 0.5 0.5 0.5 −1 0 1 2 0 −2 (d) −1 0 0 1 2 1 2 (c) 2 0 −2 −1 (b) 1 2 (e) 0 −2 −1 0 (f) Figure 5.5: Graphical illustration of the conjugate function for (a–c) a gauge function and (d–f) a more general function. The dotted line in (f) represents the recession function. From the above we can see that the supremum is zero as long as v Tx does not exceed f (x) for any x. Because of positive homogeneity it suffices to check this for f (x) = 1, and therefore, without loss of generality for f (x) ≤ 1. Doing so gives the requirement sup {v Tx | f (x) ≤ 1} ≤ 1, x which by (5.8) coincides with f ◦ (v) ≤ 1. Combining both cases yields (5.7). Omitting the condition of positive homogeneity changes the above characterization of the conjugate. As an example, consider the function f (x) (solid line) in Figure 5.5(d), along with v Tx (dotted line). Here, the supremum of v Tx − f (x) is strictly positive without being unbounded. Function f (x) is chosen such that it is quadratic on the interval [0, 1], and linear on the intervals (−∞, −1], [−1, 0], and [1, ∞). It can be shown that the dotted line v Tx in plot (e) corresponds to the critical value of v; increasing v by any amount gives an unbounded conjugate. This critical value is easily seen to coincide with the slope of f (x) on the rightmost linear segment. In higher dimensions a similar relation holds and instead of looking directly at the function f (x) to determine boundedness of the conjugate we need to look at the smallest gauge function γ(x) such that f (x) ≤ γ(x) for all x. This function turns out to be the recession function f 0+ (x) of f (x), which is defined as the function whose epigraph gives the largest cone in Rn+1 that fits into the epigraph of f (x) [127]. Figure 5.5(f) shows this recession function as a dotted line. Analogously we can define a function f 0+ (x) whose epigraph is the smallest cone containing f (x), indicated by the dark solid line. With these two gauge functions we can characterize the 5.3. Practical aspects of the framework conjugate of nonnegative convex functions 0 ∗ T f (y) := sup w u − f (w) = ∈ (0, ∞) w ∞ 74 that vanish at the origin as if [f 0+ ]◦ (x) ≤ 1 if [f 0+ ]◦ (x) > 1, [f 0+ ]◦ (x) ≤ 1 otherwise. Unlike conjugate functions of gauges, this is no longer an indicator function. As a consequence, the expression for the dual problem is no longer as convenient as that for gauge functions. Moreover, obtaining the dual solution may require solving the dual problem, which of course greatly reduces the efficiency of our proposed method. 5.3 Practical aspects of the framework The Newton root-finding framework in Algorithm 5.1 requires the computation of φ(τk ) and φ (τk ). In practice these quantities will never be exact and it may even be too expensive to compute them to a high level of accuracy. In this section we derive a bound on the primal-dual gap to evaluate the optimality of a feasible point and study local convergence of Newton’s method based on inaccurate function and gradient values. Throughout this section we assume that ρ(r) := r 2 . 5.3.1 Primal-dual gap The algorithms for solving (5.2), which we outline in Section 5.4, maintain feasibility of the iterates at all iterations. As a result, an approximate solution x ¯τ and its corresponding residual r¯τ := b − A¯ xτ satisfy κ(¯ xτ ) ≤ τ, and r¯τ ≥ rτ 2 2 > 0, (5.12) where the second set of inequalities holds because x ¯τ is suboptimal and τ < τ0 . We can use these to construct dual variables y¯τ := r¯τ / r¯τ 2 ¯ τ := −κ◦ (ATy¯τ ), and λ which are dual feasible, i.e., they satisfy (5.10b). The objective of the dual problem (5.9), evaluated at any feasible point, gives a lower bound on the optimal value rτ 2 . Therefore, ¯ τ ≤ rτ bTy¯τ − τ λ 2 ≤ r¯τ 2. With the duality gap defined as δτ := r¯τ 2 ¯ τ ), − (bTy¯τ − τ λ we can bound the difference in objective as 0 ≤ r¯τ 2 − rτ 2 ≤ δτ . (5.13) 5.3. Practical aspects of the framework b r ∗ 75 δ 2 R := AB S H Figure 5.6: Illustration for proof of Lemma 5.5. 5.3.2 Accuracy of the gradient ¯ ) := r¯τ 2 be the objective value of (5.2) at the approximate solution Let φ(τ x ¯τ . The duality gap δτ at x ¯τ provides a bound on the difference between φ(τ ) ¯ ). Assuming that A is full rank, we can use the relative duality gap and φ(τ ητ := δτ / r¯τ 2 to obtain a bound on the difference between the derivatives φ (τ ) and φ¯ (τ ). In order to do so, we first bound the quantity r¯τ − rτ . Lemma 5.5. Let (x∗ , r∗ ) be the optimal solution for (5.4) with ρ(r) := r 2 , and let δ be the duality gap for a feasible point (¯ r, x ¯). Then r¯ − r∗ 2 δ 2 + 2δ r∗ 2 . ≤ Proof. Formulation (5.4) with the given ρ is equivalent to minimize x Ax − b 2 subject to κ(x) ≤ τ. With the feasible set denoted by B := {x | κ(x) ≤ τ }, the range of Ax over all feasible x is given by R := AB. Because convexity of sets is preserved under linear maps, R is convex. It follows from the use of the two-norm misfit that a unique separating hyperplane H exists between R and the Euclidean ball of radius r∗ 2 around b (see Figure 5.6). Let S denote the (n − 1)-sphere around b of radius r¯ 2 ≤ r∗ + δ. The distance between r¯ and r∗ can be seen to be bounded by the distance between Ax∗ and the points in S ∩ R. This distance itself is bounded by distance from Ax∗ to any point on the intersection between S and the separating hyperplane H. For the latter distance d we have ( r∗ 2 + δ)2 = r∗ 22 + d2 and the stated result follows. We next derive a bound on the difference of the gradients based on r¯ − r∗ . 5.3. Practical aspects of the framework Lemma 5.6. Let r¯ − r∗ 2 76 ≤ γ, then |φ¯ (τ ) − φ (τ )| ≤ c · γ/ r¯ 2 , where c is a positive constant independent of τ . Proof. Define v := r¯ − r∗ . We consider two cases: first assume that φ¯ (τ ) ≤ φ (τ ). With the definition of −φ (τ ) this gives the following κ◦ (ATr∗ ) κ◦ (ATv) κ◦ (ATr∗ ) κ◦ (ATr¯) κ◦ (ATr∗ ) − ≤ + − φ (τ ) − φ¯ (τ ) = r¯ 2 r∗ 2 r¯ 2 r¯ 2 r∗ 2 ◦ T κ (A v) ≤ ≤ c · v 2 / r¯ 2 ≤ c · γ/ r¯ 2 . r¯ 2 The first inequality second line follows from the fact that 1/ r¯ For the second case we consider φ (τ ) ≤ φ¯ (τ ). This gives 2 ≤ 1/ r∗ 2 . κ◦ (ATr∗ ) κ◦ (ATr¯) κ◦ (ATr∗ ) κ◦ (ATr∗ ) κ◦ (ATv) φ¯ (τ ) − φ (τ ) = − ≤ − + r∗ 2 r¯ 2 r∗ 2 r¯ 2 r¯ 2 ∗ ◦ T ∗ ◦ T ( r¯ 2 − r 2 ) · κ (A r ) κ (A v) ≤ + r∗ 2 · r¯ 2 r¯ 2 γ κ◦ (ATr∗ ) κ◦ (ATv) γ v 2 ≤ · + ≤ c1 · + c2 · ∗ r¯ 2 r 2 r¯ 2 r¯ 2 r¯ 2 ≤ c · γ/ r¯ 2 . For the first inequality we used the reverse triangle inequality. We then used the above inverse norm inequality, along with bounds on κ◦ (ATr∗ ) r∗ 2 , and likewise for ATv. Combining these two results, we arrive at a bound on |φ¯ (τ ) − φ (τ )|, in terms of the relative duality gap η. Corollary 5.7. Let η be the relative duality gap δ/ r¯ 2 , then |φ¯ (τ ) − φ (τ )| ≤ c · Proof. From Lemma 5.5 and the fact that r∗ r¯ − r∗ 2 ≤ η 2 + 2η. 2 δ 2 + 2δ r¯ ≤ r¯ 2 , we have 2 =: γ. Applying Lemma 5.6 then gives the result. 5.3.3 Local convergence rate We start our analysis of the local convergence properties by defining ¯ ) − φ(τ ), |φ¯ (τ ) − φ (τ )|}. γτ := min{φ(τ (5.14) 5.3. Practical aspects of the framework 77 Based on this quantity we bound the difference between the exact and inexact root-finding step size. Lemma 5.8. Let γτ be as defined in (5.14). Then (τ ) := ¯ )−σ φ(τ ) − σ φ(τ ≤ cγτ , − ¯ φ (τ ) φ (τ ) where c is a constant independent of τ . ¯ ) = φ(τ ) + β. It Proof. Let α and β be such that φ¯ (τ ) = φ (τ ) + α, and φ(τ follows from the assumptions, that |α| ≤ γτ , and |β| ≤ γτ . This allows us to write ¯ ) φ¯ (τ )φ(τ ) φ (τ )φ(τ (φ¯ (τ ) − φ (τ )) − ¯ −σ ¯ (τ ) = ¯ φ (τ )φ (τ ) φ (τ )φ (τ ) φ (τ )φ (τ ) ¯ (φ (τ ) + α)φ(τ ) − φ (τ )(φ(τ ) + β) (φ (τ ) − φ (τ )) + ≤σ φ¯ (τ )φ (τ ) φ¯ (τ )φ (τ ) αφ(τ ) σγτ β ≤ ¯ + ¯ − φ (τ )φ (τ ) φ (τ )φ (τ ) φ¯ (τ ) σγτ γτ φ(τ ) γτ ≤ ¯ + + ≤ cγτ . φ (τ )φ (τ ) φ¯ (τ )φ (τ ) |φ¯ (τ )| The last step follows from the fact that there exists a constant c1 > 0 such that c1 ≤ φ (τ ) and c1 ≤ φ(τ ), and that φ(τ ) ≤ b 2 . The following theorem establishes the local convergence rate of an inexact Newton method for φ(τ ) = σ, where φ and φ are known only approximately. Theorem 5.9. Suppose that A has full rank, σ ∈ (0, b 2 ), and γk := γτk → 0. Then, if τ0 is close enough to τσ , and τk remains in the interval (0, τ0 ), the iteration (5.3)—with φ and φ replaced by φ¯ and φ¯ —generates a sequence τk → τσ that satisfies |τk+1 − τσ | = cγk + νk |τk − τσ |, (5.15) where νk → 0 and c is a positive constant. Proof. Because φ(τσ ) = σ ∈ (0, b 2 ), it follows from the definition of τ0 , that τσ ∈ (0, τ0 ). By Theorem 5.2 we have that φ(τ ) is continuously differentiable for all τ close enough to τσ , and so by Taylor’s theorem, 1 φ(τk ) − σ = φ (τσ + α[τk − τσ ]) dα · (τk − τσ ) 0 1 = φ (τk )(τk − τσ ) + φ (τσ + α[τk − τσ ]) − φ (τk ) · dα (τk − τσ ) 0 = φ (τk )(τk − τσ ) + ω(τk , τσ ), 5.3. Practical aspects of the framework 78 where the remainder ω satisfies ω(τk , τσ )/|τk − τσ | → 0 as |τk − τσ | → 0. (5.16) It follows from the definition of φ (τ ) and from standard properties of matrix norms that cl σm (A) ≤ −φ (τ ) ≤ cu σ1 (A). We use this property, along with ¯ k ) /φ¯ (τk ), to establish that Lemma 5.8, and the definition ∆τk = σ − φ(τ |τk+1 − τσ | = |τk − τσ + ∆τk | ¯ k) − σ φ(τ 1 φ(τk ) − σ − ω(τk , τσ ) + = − ¯ φ (τk ) φ (τk ) ¯ k) − σ φ(τk ) − σ φ(τ ω(τk , τσ ) ≤ − ¯ + φ (τk ) φ (τk ) φ (τk ) = c1 γk + c2 |ω(τk , τσ )| = c1 γk + νk |τk − τσ |, with positive constants c1 and c2 , and νk := c2 |ω(τk , τσ )|/|τk − τσ |. When τk is sufficiently close to τσ , (5.16) implies that νk < 1. Apply the above inequality recursively ≥ 1 times to obtain |τk+ − τσ | ≤ c1 (νk ) −i γk+i−1 + |τk − τσ | · i=1 νk+1−i , i=1 and because γk → 0 and νk < 1, it follows that τk+ → τσ as → ∞. Thus τk → τσ , as required. By again applying (5.16), we have that νk → 0. Note that if (5.2) is solved exactly at each iteration, such that γk = 0, then Theorem 5.9 shows that the convergence rate is superlinear, as we expect of a standard Newton iteration. In effect, the convergence rate of the algorithm depends on the rate at which δk → 0. If A is rank deficient, then the constant c in (5.15) is infinite; we thus expect that ill-conditioning in A leads to slow convergence unless γk = 0, i.e., φ is evaluated accurately at every iteration. For related results see also Dembo et al. [52, Section 4]. 5.3.4 First root-finding step A natural choice for the initial value of τ is zero. Besides guaranteeing that τ is not too large, it also has the advantage that the solution of φ(0) and φ (0) are often available in closed form. That is, whenever κ(x) = 0 implies x = 0, the only feasible point, and therefore the solution to (5.2) at τ = 0 is x∗ = 0. It follows from the definition of φ(τ ) that φ(0) = Ax∗ − b 2 = b 2 . Moreover, by application of (5.11) in Theorem 5.4 we have φ (0) = −κ◦ (AT b)/ b 2 . This means that under the above condition on κ the first root-finding step is essentially free, requiring only a single matrix-vector product with AT and one evaluation of κ◦ . 5.4. Solving the subproblems 5.4 79 Solving the subproblems Ultimately, the root-finding framework depends on an efficient method to solve subproblems of the form (5.2). In this section we discuss two such methods. Due to the need to obtain efficient gradients of the Pareto curve, we restrict ourselves to the case where ρ(r) = r 2 , giving the generalized Lasso-formulation minimize x 1 2 Ax − b 2 2 subject to κ(x) ≤ τ. (5.17) As we will see in Chapter 6, many problems of interest have a Euclidean or Frobenius norm misfit. This means that the restriction on ρ does not greatly limit the practical application of the framework. Note however that both subproblem solvers are easily adapted to deal with more general ρ (or ρ2 ), provided that the gradient exists everywhere on the feasible set and is available to the solver. Indeed, both methods are capable of solving minimize f (x) subject to x ∈ Ω, (5.18) x where Ω is some non-empty convex set. For this they require the evaluation of objective f (x), the gradient g(x) := ∇f (x), and the orthogonal projection of arbitrary points v onto the feasible set P (v) = PΩ (v) := argmin u − v 2. (5.19) u∈Ω For (5.17) we have Ω := {x | κ(x) ≤ τ }, and we discuss efficient projection algorithms for various functions κ(x) in Chapter 6. For (5.17), the evaluation of both the objective and the gradient essentially reduces to matrix-vector products with A and AT. This allows matrix A to be an implicit operator, provided that routines for matrix-vector products are available. In Chapter 8 we present a Matlab toolbox aimed at the construction of large implicit matrices with associated operations, including matrix-vector multiplication. As an aside, we remark that (5.17) reduces to the well-known Lasso problem [140] when κ(x) = x 1 . The Lasso problem can be solved using a variety of methods, including homotopy [123], least angle regression (LARS) [68], and the active-set method by [124, 149]. In our context, the use of the homotopy and LARS algorithms makes no sense because, as discussed in Section 4.2, they can be more efficiently used to solve the basis pursuit denoise problem directly. 5.4.1 Spectral projected gradients The first method for solving (5.17) is the spectral projected-gradient (SPG) algorithm introduced by Birgin et al. [17]. Their method generates a sequence of feasible iterates {xk }, which is shown to converge to a constrained stationary point, which corresponds to the global minimizer for (5.17). Given the current iterate xk , the algorithm proceeds as follows. First we compute search direction dk = −g(xk ) and determine an initial step length β. This gives the first point 5.4. Solving the subproblems 80 xk + βdk , which is then projected onto the feasible region to obtain the first feasible trial point P (xk + βdk ). To ensure sufficient descent we then perform a backtracking line-search either directly from the first trial point or by reducing β. Once an acceptable trial point is found we set xk+1 and proceed with the next iteration until a stopping criterion is met. A detailed version of this algorithm is given in Algorithm 5.2, and we now proceed with a more elaborate discussion of each of the main steps. Algorithm 5.2: Scaled projected gradient algorithm with (left) backtracking line search, and (right) curvilinear line search. Given A, b, x0 ∈ Ω, τ while not converged do g = AT(Ax − b) β = initial steplength d = P(x − βg) − x α=1 repeat x ¯ = x + αd Reduce α until sufficient descent x=x ¯ Given A, b, x0 ∈ Ω, τ while not converged do g = AT(Ax − b) β = initial steplength α=1 repeat x ¯ = P(x − αβg) Reduce α until sufficient descent x=x ¯ Initial step length. In the spectral projected gradient method, the initial step length is determined by βk := ∆x, ∆x , ∆x, ∆g (5.20) where ∆x := xk −xk−1 is the difference between the current and previous iterate and ∆g := g(xk ) − g(xk−1 ) the corresponding difference in gradients. This step length was originally introduced by Barzilai and Borwein [5] for unconstrained minimization and motivated as follows. Given the current iterate we like to take a step xk+1 = xk + Sk g(xk ), where Sk = βk I. The value of βk is chosen such that it minimizes ∆x − βk ∆g or ∆x/βk − ∆g . Solving the latter problem gives (5.20). Birgin et al. [17] introduced the Barzilai-Borwein step length to constrained optimization and relate the step length to the quadratic approximation q(x) of f (x), with approximate Hessian Bk = βk−1 I. Alternative choices for step length have been studied as well, for example by Dai and Fletcher [46]. Backtracking line search. The two components that characterize a line search are the way in which trial points are generated, and the conditions under which a trial point is deemed acceptable. For the scaled projected gradient method we adopt two different ways of generating trial points. The first method generates a sequence of point along the projection arc x(α) = P (xk + αβk dk ), 5.4. Solving the subproblems 81 Figure 5.7: Illustration of projection arcs on the two- and three-dimensional cross-polytope. with α ≥ 0, and dk the (scaled) search direction. Examples of projection arcs for the nonnegative orthant and the 1 -ball are illustrated in Figures 4.3(a) and 5.7 respectively. In practice we typically sample the projection arc along the set α = µi , for trial points i ≥ 0 and with µ ∈ (0, 1). As noted by Birgin et al. [17], the distance between two consecutively projected points can be small or even zero for corner points. In such situations, which can arise when xk + βk dk is far from the feasible region, it may be necessary to use extremely small values of α in order to reach new points and make progress with the line search. To safeguard against this we follow Birgin et al. [17] and use backtracking from a single projection when projection along the arc fails. In this approach we generate a new search direction dˆk := P (xk + βk dk ) − xk , and perform regular backtracking line search along this direction, as shown in Figure 4.3(b). This approach requires only a single projection onto the feasible set and guarantees that all points xk + αdˆk are feasible for α ∈ [0, 1]. For the acceptance of trial points we use the non-monotonic generalization of the Armijo condition by Grippo et al. [88] for projected line search by Bertsekas [15]. Under the typical Armijo condition sufficient descent is required relative to the most recent iterate, leading to monotonically decreasing objective values. While this property is convenient from a theoretical perspective it was recognized that enforcing monotonicity can considerably slow the rate of convergence [88]. In order to overcome this problem, while maintaining the theoretical properties of Armijo, Grippo et al. [88] suggest to define sufficient descent relative to the maximum objective value over a fixed number M of the most recent iterates. With sufficient-descent parameter γ, this amounts to the condition f (x(α)) ≤ max i∈[max{1,k−M },k] f (xi ) + γg(xk )T(x(α) − xk ). 5.4. Solving the subproblems 82 Stopping criteria. The most natural stopping criterion for any optimization algorithm is sufficient optimality of the current iterate. In SPG this can be done based on the projected gradient or on the primal-dual gap derived in Section 5.3.1. Regarding the gradient, a sufficient condition for a point xk to be optimal for (5.17) is that P (xk − g(xk )) = xk . This is satisfied either when g(xk ) is zero, or when the negative gradient is outward orthogonal to the boundary of the feasible set at xk . In our spgl1 implementation [9] it was found that the number of iterations required to reach a sufficiently small duality gap can be prohibitively large. At the same time it was observed that the primal objective may be near optimal, meaning that the large gap was predominantly due to the sensitivity of the dual feasible point based on rk . Therefore, although not ideal, the stopping criterion also takes into account the progress of the objective. Together with the duality gap, this approach seems to work well in practice, although it does sometimes lead to undesirable behavior in the root-finding process when the gradient is insufficiently accurate. Finally, to ensure finite termination of the algorithm, we include additional constraints on the number of iterations and matrix-vector products taken. 5.4.2 Projected quasi-Newton As an alternative to SPG as a solver for subproblem (5.17), we developed the projected quasi-Newton (PQN) approach described in Schmidt et al. [131]. The algorithm is aimed at solving the more general problem (5.18), in particular when the objective f (x) and the corresponding gradient g(x) are expensive to compute while projection onto the feasible set Ω is relatively cheap. The PQN algorithm consist of an outer and an inner level. In the outer level, information from subsequent iterates is used to construct a quadratic model of the objective function around the current xk : qk (x) := fk + (x − xk )Tgk + 21 (x − xk )TBk (x − xk ). Here, the positive definite matrix Bk is an approximation to the Hessian ∇2 f (x) is constructed implicitly based on L-BFGS updates [102]. The inner level uses the quadratic model to find a search direction by approximately solving: minimize x qk (x) subject to x ∈ Ω. (5.21) Once determined, the outer level does a line search along xk +αdk with α ∈ [0, 1] to determine the next feasible iterate xk+1 . A summary of the algorithm is given in Algorithm 5.3. Solving the PQN subproblem. Problem (5.21) is a special case of (5.18) and can be solved using the SPG algorithm described in the previous section. Because the objective function depends only on Bk and the fixed vector g(xk ), we can solve the subproblem without evaluating f (x) or g(x). Another important property, following from positive definiteness of Bk , is that any x satisfying 5.4. Solving the subproblems 83 Algorithm 5.3: Projected quasi-Newton algorithm. Given A, b, x0 ∈ Ω, τ ˆ Initialize Hessian approximation H while not converged do Solve (5.21) for search direction d Initialize α = 1 repeat x ¯ = x + αd Update α until sufficient descent x=x ¯ ˆ Update Hessian approximation H return x q(x) < q(0) can be used to construct a valid search direction dk := x − xk that is guaranteed to be descent direction for the original problem. An illustration of the quadratic model on the two-dimensional cross-polytope is given in Figure 5.7(b). 84 Chapter 6 Application of the solver framework In Chapter 5 we derived a framework for problems of the form minimize x κ(x) subject to Ax − b 2 ≤ σ, (6.1) and proposed two algorithms for solving the subproblems arising in the framework. Both algorithms require the evaluation of κ(x) and κ◦ (x), and projection onto the feasible set. In this chapter we establish these components for a number of problem formulations that commonly arise in sparse recovery and satisfy the conditions required by the framework. For each problem formulation we give a motivation for its importance and provide potential applications. We then describe the two essential components needed by the solvers: orthogonal projection onto the feasible set, and the evaluation of the polar of κ(x). For most problems considered, κ(x) is a norm, in which case the polar is usually referred to as the dual norm. For sake of uniformity in section names, etc., we often use the term ‘polar’, even in the context of norms. 6.1 Basis pursuit denoise The most widely used and studied convex sparse recovery formulation to date is the basis pursuit denoise problem (BPσ ), described and motivated in Chapter 1. In this section we consider a generalization of this problem, which allows positive weights to be associated to each coefficient, giving the weighted basis pursuit denoise formulation: minimize x Wx 1 subject to Ax − b 2 ≤ σ, (6.2) where W is a diagonal weighting matrix. The most prominent use of this formulation (with σ = 0) is the reweighted 1 algorithm by Cand`es et al. [37], which is somewhat akin to iteratively reweighted least-squares. Empirically it is shown [37] that the probability of recovering a vector x0 from observation b = Ax0 benefits from this reweighting, compared to a single iteration of unweighted basis pursuit. The weighted formulation fits (6.1) with κ(x) := W x 1 and we next discuss the derivation of κ◦ (x), and the Euclidean projection onto the (weighted) onenorm ball. 6.1. Basis pursuit denoise 6.1.1 85 Polar function For the polar of the weighted one-norm W x general, result. 1 we use the following, more Theorem 6.1. Let κ be a gauge function, and let Φ be an invertible matrix. Given κΦ (x) := κ(Φx), then κ◦Φ (x) = κ◦ (Φ−T x). Proof. Applying the definition of gauge polars (5.8) to κΦ gives κ◦Φ (x) = sup{ w, x | κΦ (w) ≤ 1} = sup{ w, x | κ(Φw) ≤ 1}. w w Now define u = Φw. Invertibility of Φ then allows us to write κ◦Φ (x) = sup{ Φ−1 u, x | κ(u) ≤ 1} = sup{ u, Φ−T x | κ(u) ≤ 1} = κ◦ (Φ−T x), u u as desired. Applied to the weighted one-norm, we obtain the following result. Corollary 6.2. Let W be any non-singular diagonal matrix. Then the dual norm of W x 1 is given by W −1 x ∞ . 6.1.2 Projection Good performance of the spgl1 and pqn algorithms depends crucially on being able to efficiently compute projections onto the feasible set. We now show that this is indeed possible for projection onto the one-norm ball. We also discuss the modifications necessary to incorporate weights. Unweighted one-norm projection The Euclidean projection of a point c ∈ Rn onto the one-norm ball of radius τ is given by the solution of minimize x 1 2 c−x 2 2 subject to x 1 ≤ τ. (6.3) This problem can be solved with a specialized O(n log n) algorithm, which we derive next. The algorithm was independently developed by Cand`es and Romberg [30], Daubechies et al. [51], and van den Berg and Friedlander [10]. Despite similarities in the approach, our implementation is nevertheless quite different from that of the others. In case c 1 ≤ τ , the projection is trivially found by setting x = c. For all other cases there exists for each τ a scalar λ such that minimize x 1 2 c−x 2 2 +λ x 1 (6.4) 6.1. Basis pursuit denoise 86 has the same solution as (6.3). The solution of this penalized formulation is obtained by applying (componentwise) the soft-thresholding operator (4.12). The problems (6.3) and (6.4) are equivalent when we choose λ such that Sλ (c) 1 = τ . Once this value is know we can solve (6.3) by applying Sλ on c. To simplify the discussion, let ai , i = 1, . . . , n, be the absolute values of c in decreasing order. It is convenient to add an+1 := 0 and to define the function θ(λ) := Sλ (c) 1 . (6.5) It can be verified from (4.12) that θ(λ) is strictly decreasing in λ from θ(an+1 ) = θ(0) = c 1 to θ(a1 ) = 0. Therefore, there exists an integer k such that θ(ak ) ≤ τ < θ(ak+1 ). (6.6) Suppose that k is given. Then it remains to find a correction δ ≥ 0 such that θ(ak − δ) = τ . It follows from the definition of θ that i i n (aj − ai ) = max{0, aj − ai } = θ(ai ) = aj − i · ai . (6.7) j=1 j=1 j=1 For a δ such that 0 ≤ δ ≤ ai − ai+1 , it similarly holds that i n (aj − ai + δ) = δi + θ(ai ), max{0, aj − (ai − δ)} = θ(ai − δ) = j=1 (6.8) j=1 which is linear in δ on the specified interval, as illustrated in Figure 6.1. These results suggest the following projection algorithm: 1. sort the absolute values of c to get vector a; 2. find k satisfying (6.6); 3. solve θ(ak − δ) = τ for δ; based on (6.8), this gives δ = (τ − θ(ak ))/k; 4. compute x := Sλ (c) with λ = ak − δ. Because of the sorting step, the overall time complexity of these algorithms is O(n log n). The third step takes constant time, and soft-thresholding in step four requires O(n) time. For the second step we need to be a little careful because direct evaluation of θ(ak ) using (6.7) for all k could take O(n2 ) operations. Fortunately, it is not hard to see that θ(ai ) = θ(ai+1 ) − i(ai − ai+1 ), thus yielding an O(1) update to compute θ(ak ) for successive k. By replacing the first two steps of the algorithm with more advanced techniques it is possible to bring the complexity down to an expected run time linear in n; see Duchi et al. [67] and van den Berg et al. [13]. 6.1. Basis pursuit denoise 87 5 λ = 0.36.. 4 θ(λ) 3 τ=2 2 δ 1 0 0 a7 a6 0.2 0.4 0.6 λ 0.8 1 Figure 6.1: Plots of θ(λ) for a random c of length eight, showing the relationship between τ , λ, and δ. The points on the curve indicate the value of θ(λ) for values of λ coinciding with some ai . Weighted one-norm projection The Euclidean projection onto a weighted one-norm ball is given by Pτ (x) = argmin z 1 2 x−z 2 2 subject to Wz 1 ≤ τ. (6.9) It is easily seen that for all i with wi = 0, the solution is given by Pτ (x)i = xi . Therefore, without loss of generality, we can assume that wi = 0. In addition, since P0 (x) = 0, we only need to consider the case where τ > 0. An efficient projection algorithm can again be derived starting with the Lagrangian formulation of (6.9), minimize x 1 2 x−z 2 2 + λ W x 1. For x to be a solution of this problem it suffices that the subgradient of the objective, evaluated at x, contains zero. Since the objective is separable, this reduces to 0 ∈ xi − ci + λ|wi | · sgn(xi ). This condition is satisfied by setting xi to the soft-thresholded value of zi with threshold λ|wi |, i.e., x∗ (λ) := sgn(zi ) · max{0, |zi | − λ|wi |} i . (6.10) 6.2. Joint and group sparsity 88 The function θw (λ) := W x(λ) 1 is again non-increasing and piecewise linear with breaks occurring at λ = |zi /wi |. With only minor changes, this allows us to use the same algorithm as for the regular one-norm projection. With careful implementation it is also possible to apply the linear time algorithm in [67, 13]. 6.2 Joint and group sparsity In this section we develop the theory and methods required to apply the rootfinding framework to the group sparse recovery problem (1.7). We also consider the mmv problem (1.8) as a special case of group recovery. To start, let us look at a practical application of the latter problem in source localization. 6.2.1 Application – Source localization In astronomy, radio telescopes are used to image distant objects that emit radio waves. Due to the distance of the source, these waves arrive at the telescope as plane waves, and the design of the parabolic dish (see illustration in Figure 6.2(a)) is such that all reflected signals travel the same distance to the central receiver. This causes the signals to arrive at the receiver in phase, thus leading to an amplified signal. By changing the direction of the telescope, it can focus on different parts of the sky. An alternative approach is to use an array of stationary omnidirectional sensors, as shown in Figure 6.2(b). Focusing the array at a particular direction can then be achieved by applying an appropriate delay or phase shift to the signal of each sensor prior to summation. Let Si,j represent a set of narrowband signals arriving from angles θi , i = 1, . . . , n, at time tj , j = 1, . . . , k. Under the narrowband assumption, and provided the spacing between sensors is sufficiently small, the sensor output can be formulated as B = A(θ)S + N where N is a matrix of random additive noise, and A(θ) represents the phase shift and gain matrix. Each column in B contains the measurement values taken by the sensors at a given point in time. Each column in A matrix corresponds to a single arrival angle θi , and each row corresponds to one of m sensors. In the two-dimensional case, with sensor positions given by (pi , 0), i = 1, . . . , m, the complex phase shift for sensor i, relative to the origin at (0, 0), for angle θj , is given by Ai,j = exp{2ıπ cos(θj )pi /λ}, √ with ı = −1, and wavelength λ. In source localization the angles θ, and consequently A(θ), are unknown. When the number of sources is small, we can discretize the space into a set of discrete angles ψ and find a sparse approximate solution to A(ψ)X = B. 6.2. Joint and group sparsity 89 λ s1 (a) s2 θ s3 s4 s5 (b) Figure 6.2: Focusing planar waves from a given direction using (a) a radio telescope, (b) an array of omnidirectional sensors. Assuming sources are stationary or moving slowly with respect to the observation time, we would like the nonzero entries in X (corresponding to different angles of arrival) to be restricted to a small number of rows. This motivates the approach taken by Malioutov et al. [109], which amounts exactly to the mmv problem (1.8). Note that the misfit between A(ψ)X and B in this case is due not only to the signal noise N , but also to the angular discretization ψ. As a two-dimensional example, consider an array of twenty omnidirectional sensors spaced at half the wavelength of interest. Impinging on this array are five far-field sources, located at angles 60◦ , 65◦ , 80◦ , 100.5◦ , and 160◦ relative to the horizon. We are given twenty observations measured by the array at a signal-to-noise ratio of 10dB. To recover the direction of arrival we discretize the (two-dimensional) space at an angular resolution of 1◦ and compare mmv and bpdn to beamforming, Capon, and music (see [107] for more information). The resulting powers from each direction of arrival are shown in Figure 6.3. Both mmv and bpdn can be seen to give very good results. For a more realistic example we consider a three-dimensional source localization problem. Because of the added dimension, discretizing the space of all possible directions and positioning of sensors becomes somewhat harder. For the placement of the sensors we choose a near-uniform distribution within a unit norm circle, which is conveniently done using existing circle-packing results [134]. For the discretization of arrival directions we choose a set of points P := {pi } on the unit sphere, with their potential energy, i=j 1/ pi − pj 2 , approximately minimized. The final result is then obtained by discarding the points in the halfspace below the surface. The discretizations obtained for 80 sensors and 100 directions are shown in Figure 6.4, along with a signal coming from eight directions. Given a set of such signals we run mmv and obtain an approximate solution on the coarse grid; see Figure 6.5(a,e). Because the actual direction of arrival are unlikely to be captured by the grid we can expect there to be some misfit. This misfit can be reduced by locally refining the grid based on the approximate signal directions by adding new directions of arrival and repeating reconstruction until the desired results it reached. This is illustrated in Figure 6.5(b–d,f–h). 6.2. Joint and group sparsity Beamforming Capon MUSIC BPDN MMV 90 90 110 70 120 130 50 140 40 150 30 20 170 0 10 −20 −40 −60 −80 −80 −60 Power (dB) −40 −20 0 Figure 6.3: Angular spectra obtained using beamforming, Capon, MUSIC, BPDN, and MMV for five uncorrelated far-field sources at angles 60◦ , 65◦ , 80◦ , 100.5◦ , and 160◦ . The directions of arrival are discretized at 1◦ resolution. (a) sensor positions (b) coarse direction grid (c) actual directions Figure 6.4: Configuration of (a) 80 sensors on the plane, (b) coarse grid of 100 arrival directions on the half-sphere, and (c) top view of actual arrival directions on a fine grid. 6.2. Joint and group sparsity 91 (a) (b) (c) (d) (e) (f) (g) (h) Figure 6.5: Grid of (a) initial arrival directions, and (b–d) after up to three refinement steps, along with corresponding solutions (e–f). 6.2.2 Polar functions of mixed norms The · p,q norm defined in (1.9), and the sum of norms in (1.7), are special cases of more general mixed norms. The dual or polar function for such norms can be obtained using the following result. Theorem 6.3. Let σi , i = 1, . . . , k represent disjoint index sets such that · pi with i σi = {1, . . . , n}. Associate with each group i a primal norm dual norm · di . Denote vi (x) = xσi pi and wi (x) = xσi di . Let · p be a norm such that for all vectors q, r ≥ 0, q p ≤ q + r p , and let · d denote its dual norm. Then the dual norm of ||| · |||p := v(·) p is given by ||| · |||d := w(·) d . Proof. First we need to show that |||x|||p is indeed a norm. It is easily seen that the requirements |||x|||p ≥ 0, |||αx|||p = |α| · |||x|||p , and |||x|||p = 0 if and only if x = 0 hold. For the triangle inequality we need to show that |||q + r|||p ≤ |||q|||p + |||r|||p . Using the triangle inequality of the group norms, we have that 0 ≤ v(q + r) ≤ v(q) + v(r), componentwise. The assumption on the outer norm · p then allows us to write |||q + r|||p = v(q + r) p ≤ v(q) + v(r) p ≤ v(q) p + v(r) p = |||q|||p + |||r|||p , as desired. Next, to derive the dual norm we note that the dual of any norm is defined implicitly by the following equation: x d := {sup xT z | z z p ≤ 1}. 6.2. Joint and group sparsity 92 For each given subvector xσi , the supremum of xTσi zσi with xσi pi ≤ 1 is given by wi (x) := xσi d . This quantity scales linearly with the bound we impose on the primal norm, i.e., under the condition xσi pi ≤ ti , the supremum becomes ti wi (x). Writing w = {wi (x)}i , we can write the supremum over the entire vectors as {sup xT z | z p ≤ 1} = t,z = ti xTσi zσi | t {sup z i T {sup t w | t p p ≤ 1, zσi pi ≤ 1} ≤ 1}. t But this is exactly the definition of w d . Note that the requirement on the outer primal norm · p is essential in deriving the triangle inequality, but does not hold for all norms. For example, the norm x := Φx 2 , with any non-diagonal, invertible matrix Φ does not satisfy the requirement. Importantly though, the requirement is satisfied for the common norms x γ , 1 ≤ γ ≤ ∞; By repeated application of the above theorem we can derive the dual of arbitrarily nested norms. For example, the function x(1) 2 + max{ x(2) 1 , x(3) 2 } applied to a vector consisting of x(1) , x(2) , x(3) is a norm whose dual is given by max{ x(1) 2 , x(2) ∞ + x(3) 2 }. Likewise, by vectorizing X and imposing appropriate groups, we can use Theorem 6.3 to obtain the dual of X p,q : Corollary 6.4. Given p, q ≥ 1. Let p and q be such that 1/p + 1/p = 1 and 1/q + 1/q = 1. Then ( X p,q )◦ = X p ,q . 6.2.3 Projection Euclidean projection onto the norm-balls induced by X 1,2 and i xσi 2 can be reduced to projection onto the one-norm ball, using the following theorem. Theorem 6.5. Let c(i) , i = 1, . . . , n, be a set of vectors, possibly of different length. Then the solution x∗ = (x∗(1) , . . . , x∗(n) ) of 1 2 minimize x c(i) − x(i) 2 2 subject to i x(i) 2 ≤ τ, (6.11) i can be obtained by solving the one-norm projection problem minimize u 1 2 v−u 2 2 subject to u with vi = c(i) 2 , and setting x∗(i) = (u∗i /vi ) · c(i) 0 if vi = 0, otherwise. 1 ≤ τ, (6.12) 6.3. Sign-restricted formulations 93 Proof. We first treat the special case where vi = 0. The projection for these groups is trivially given by x(i) = c(i) = 0, thus allowing us to exclude these groups. Next, rewrite (6.11) as 1 2 minimize x,u c(i) − x(i) 2 2 subject to x(i) 2 ≤ ui , u 1 ≤ τ. (6.13) i Fixing u = u∗ makes the problem separable, reducing the problem for each i to 1 2 minimize x(i) c(i) − x(i) 2 2 subject to x(i) 2 2 ≤ u2i . For ui = 0 this immediately gives x(i) = 0. Otherwise the first-order optimality conditions on x require that the gradient of the Lagrangian, L(λi ) = 1 2 c(i) − x(i) 2 2 + λ( x(i) 2 2 − u2i ), with λ ≥ 0, be equal to zero; that is, ∇L(x(i) ) = x(i) − c(i) + 2λi x(i) = 0. It follows that x(i) = c(i) /(1 + 2λi ) = γi c(i) , such that x(i) 2 = γ c(i) 2 = ui (which also holds for ui = 0). Using the definition vi = c(i) 2 , and the fact that x(i) = γi c(i) , we can rewrite each term of the objective of (6.11) as c(i) − x(i) 2 2 = = = cT(i) c(i) − 2γi cT(i) c(i) + γi2 cT(i) c(i) c(i) 2 2 − 2γi c(i) 2 2 + γi2 c(i) 2 2 vi2 − 2γi vi2 + γi2 vi2 = (vi − γi vi )2 = (vi − ui )2 . Finally, substituting this expression into (6.13) yields (6.12), because the constraint x(i) 2 = ui is automatically satisfied by setting x(i) = γi c(i) = (ui /vi ) · c(i) . This proof can be extended to deal with weighted group projection. In that case the problem reduces to projection onto the weighted one-norm ball. 6.3 Sign-restricted formulations In this section we consider the generalized sign-restricted basis pursuit denoise formulation minimize x κ(x) subject to Ax − b 2 ≤ σ, xi ≥0 ≤0 i ∈ Ip i ∈ In , where Ip , In ⊆ {1, . . . , n} are two (possibly empty) disjoint sets of indices, and κ is a norm. The use of sign information as a prior can greatly help with the recovery of sparse signals. For nonnegative basis pursuit (nnbp) this advantage was theoretically shown by Donoho and Tanner [55, 54], as pointed out in Section 2.5.3. Indeed, Figure 6.6 shows that nnbp clearly outperforms general basis pursuit in the fraction of randomly chosen sparse x0 that can be recovered from b = Ax0 . We next describe a problem in analytical chemistry that can conveniently be expressed as an nnbp. 6.3. Sign-restricted formulations 100 100 BP Nonnegative BP 80 Relative intensity Recovery (%) 80 60 40 10 20 30 Number of nonzeros 40 Figure 6.6: Equivalence breakdown curve for 40×80 random Gaussian matrix averaged over 300 random nonnegative x0 . 6.3.1 60 40 20 20 0 0 94 0 10 15 20 25 30 35 40 45 m/z Figure 6.7: Mass spectrum of propane using electron ionization [116]. Application – Mass spectrometry Mass spectrometry is a powerful method used to identify the chemical composition of a sample. There exist several different approaches but we restrict ourselves here to electron ionization (EI) mass spectrometry in which analyte molecules are ionized using high-energy electrons. Such ionization is often followed by fragmentation of the molecule with existing bonds breaking and possibly new bonds forming in a manner that is characteristic of the original molecule. Mass spectrometers register the relative abundance of ions for a range of massto-charge (m/z) ratios, which can be used to deduce the chemical make-up of the compound [112, 133]. Once analyzed, the mass spectrum can subsequently be used as a signature to recognize the corresponding compound. As an example, consider the mass spectrum of propane (C3 H8 ), illustrated in Figure 6.7. The molecular ion generally has a mass of 44 u (unified atomic mass units) consisting of three 12 C atoms and eight 1 H atoms (the small peak at 45 m/z is due to presence of 13 C isotopes). The peaks directly preceding 44 m/z are ions with increasingly many hydrogen atoms missing. The most intense peak at 29 m/z corresponds to ethyl (C2 H5 + ) ions, which is again preceded by ions with fewer hydrogen atoms. Finally, there are the peaks around the methyl (CH3 + ) ion at 15 m/z. When analyzing mixtures2 the components contribute independently to the measured spectrum. In case of electron ionization, this superposition is linear [112] and the spectrum can thus be written as a nonnegative combination of the individual spectra. Thus, given a mixed mass spectrum b we can identify the components by forming a dictionary of spectra A from possible substances and finding a sparse nonnegative solution x satisfying Ax ≈ b. This can be 2 In practice mixtures are generally separated using one of several types of chromatograph before introduction into the mass spectrometer. 6.3. Sign-restricted formulations 95 formulated as minimize x x 1 subject to Ax − b 2 ≤ σ, x ≥ 0. (6.14) A similar formulation was recently proposed by Du and Angeletti [65]. To evaluate the approach we created a dictionary A containing the mass spectra of 438 compounds obtained from the NIST Chemistry WebBook [116]. Each spectrum was normalized and expanded to contain the intensities for 82 m/z values. The spectrum b of a synthetic mixture was created by adding the spectra of twelve compounds with randomly selected ratios; see Figures 6.8. We then solved (6.14) with appropriate σ for b, and likewise for a measurement contaminated with additive noise. The results of this simulation are shown in Figure 6.8(f). 6.3.2 Polar function There are two ways to restrict x to have a certain sign pattern: by adding explicit constraints, or by extending κ to be an extended real function that is infinity at all x violating the desired sign pattern. In this section we consider the second approach. The following result gives the polar of the desired function. Theorem 6.6. Let C be the intersection of (half )spaces Si with Si = {x ∈ Rn | xi ≥ 0}, or Si = {x ∈ Rn | xi ≤ 0}, or Si = Rn . Further, let x be any norm that is invariant under sign changes in x, and let κ(x) be given by x if x ∈ C, κ(x) = ∞ otherwise. Then, with · ∗ the dual norm and P(x) the Euclidean projection onto C, κ◦ (x) = P(x) ∗ . Proof. We consider three different cases: (i) x ∈ C, (ii) x ∈ C ◦ , and (iii) x ∈ C ∪ C ◦ , which cover all x ∈ Rn . In the first case, we have P(x) = x and therefore only need to show that the polar κ◦ (x) in (5.8) is attained by some w ∈ C. This implies that for those x it does not matter whether we use κ(x) or x , hence giving κ◦ (x) = x ∗ . It suffices to show that w lies in the same orthant as x. Assuming the contrary, it is easily seen that xT w is increased by flipping the sign of the violating component while κ(w) remains the same, giving a contradiction. For the second case, x ∈ C ◦ , it can be seen that wTx ≤ 0 for all w ∈ C, and we therefore have κ◦ (x) = 0. The results then follows from the fact that P(x) = 0, and therefore that κ◦ (x) = 0 ∗ = 0. For third case we define u = P(x), and v = x − u, where uTv = 0 due to projection. Now let w give the supremum in κ◦ (u). We know from the first case 100 100 80 80 Relative intensity Relative intensity 6.3. Sign-restricted formulations 60 40 60 40 20 0 0 20 20 40 m/z 60 0 0 80 20 100 80 80 60 40 80 60 80 60 40 20 40 60 80 0 0 100 20 m/z (c) component 5 40 m/z (d) component 6 100 25 80 20 Percentage Relative intensity 60 20 20 60 40 Actual Recovered Recovered (Noisy) 15 10 5 20 0 0 40 m/z (b) component 2 100 Relative intensity Relative intensity (a) component 1 0 0 96 20 40 60 m/z (e) mixture 80 100 0 0 100 200 300 Component 400 500 (f) recovery results Figure 6.8: Mass spectra of (a)–(d) four of the twelve components in the mixture (e), along with (f) the actual and recovered relative abundance. The dictionary contains the mass spectra of 438 different molecules. 6.4. Low-rank matrix recovery and completion 97 that w must be in the orthant as u, i.e., share the same sign pattern, possibly with more zeroes. As a consequence we then have wTv = 0, and κ◦ (x) = xT w/κ(w) = (uT w + v T w)/κ(w) = uT w/κ(w) = κ◦ (u) = u ∗ . To see that the first equality holds, note that for all feasible w such that wT = 0, we have wTv < 0. This can only lower the value of the polar and such w can therefore not be optimal. The sign invariance of the norm is clearly satisfied for all p norms. Interestingly, it also applies to all nested norms using such norms. In this case the outer norms do not need to satisfy the condition of invariance under sign changes. Since there are no restrictions on index sets In and Ip , aside from their disjointness, it is possible, for example, to impose independent sign restrictions on the real or imaginary parts of complex numbers. With only minor modifications to the proof it can be shown that Theorem 6.6 holds for all closed convex cones C when κ(x) = x 2 for all x ∈ C. 6.3.3 Projection The projection for the sign-restricted formulation consists of setting to zero all components of x that violate the restriction and projecting the remainder onto the ball induced by the underlying norm. 6.4 Low-rank matrix recovery and completion The matrix completion problem described in Section 1.4.3 is a special case of the low-rank matrix recovery problem: minimize m×n rank(X) subject to A(X) − b X∈R 2 ≤ σ, where A(X) is a linear operator mapping X to a vector. This problem is again intractable, and a convex relaxation of this formulation, based on the nuclear norm, was suggested by Fazel [75] and Recht et al. [126]: minimize X∈Rm×n X ∗ subject to A(X) − b 2 ≤ σ. (6.15) Conditions for exact recovery of X0 from b := A(X0 ) using (6.15) with σ = 0 were recently studied by Recht et al. [126], who leveraged the restricted isometry technique developed for exact vector recovery using 1 . They derived necessary and sufficient conditions for recovery, and subsequently used them to compute recovery bounds for linear operators A whose matrix representation has independent random Gaussian entries. Choosing A in (6.15) to be an operator that restricts elements of a matrix to the set Ω gives the nuclear-norm formulation for noisy matrix completion: minimize X X ∗ subject to XΩ − BΩ 2 ≤ σ. 6.4. Low-rank matrix recovery and completion 98 With σ = 0 this reduces to the exact matrix completion formulation (1.13). Conditions for exact recovery using the latter formulation were studied by Cand`es and Recht [29] and Cand`es and Tao [36]. An extension of this work to the noisy case was given by Cand`es and Plan [28]. 6.4.1 Application – Distance matrix completion As an illustration of nuclear norm minimization for matrix completion consider the following scenario (see also [29, 126]). Let X = [x1 , . . . , xn ] ∈ Rd×n denote the coordinates of n sensors in Rd . Given the squared pairwise distance Di,j := (xi −xj )T (xi −xj ) for a limited number of pairs (i, j) ∈ Ω, we want to determine the distance between any pair of sensors. That is, we want to find the Euclidean (squared) distance matrix D given by D = esT + seT − 2X T X, (6.16) where e denotes the vector of all ones, and each si := xi 2 . Because D is a low-rank matrix (it can be seen from (6.16) that D has rank at most d + 2), we can try to apply (1.13) to recover D from DΩ . We consider three ways of choosing Ω. In the first one we restrict the known distances to those that are below a certain threshold: Ω1 := {(i, j) | Di,j ≤ r2 } (see Figure 6.9(b)). In the second method we fix the cardinality of Ω2 and choose its elements uniformly at random. Finally, for Ω3 we also fix the cardinality, but make sure that all diagonal entries are in the set, and that all other elements occur in symmetric pairs. These assumptions reflect the fact that the distance from a sensor to itself is zero and that the distance between two sensors is independent of their ordering (Di,j = Dj,i ). For the empirical recovery rate of (1.13) as a function of the cardinality of the three different Ω sets, we proceed as follows. First we generate 100 random 3 × 50 instances of X with entries uniformly sampled from [0, 1]. For each of these matrices we compute D using (6.16) and generate index sets Ω1 , Ω2 , and Ω3 of varying cardinality. We then solve (1.13) for each of these combinations and record the number of exact reconstructions for different cardinalities and index set type. The resulting recovery rates are plotted in Figure 6.9(b). The recovery curves reveal that recovery from random samples is far superior to recovery from distance-based samples; the latter essentially requires most distance pairs to be known in order to successfully complete the matrix. This is very likely due to the fact that the latter is much more structured and may provide little distance information for isolated sensors such as sensors located in a corner. In Figure 6.9(c–e) we plot instances of the sampling pattern obtained from the three methods. For the distance-based pattern in plot (c) it can be seen that little information is known for sensors 21, 43, and 46. Random sampling alleviates this problem and consequently gives much better recovery results. In a sense, the non-symmetric random sampling gives the most information because we could infer the distance for any pair of sensors whenever information is known for its symmetric counterpart. However, we did not perform any such 6.4. Low-rank matrix recovery and completion 1 100 Recovery rate (%) 0.8 0.6 0.4 80 Radius Random Symmetric 60 40 20 0.2 0 0 0 0 0.2 0.4 0.6 0.8 20 40 60 Samples (%) 1 (a) 0 0 10 10 10 20 20 20 30 30 30 40 40 40 10 20 30 40 80 100 (b) 0 50 0 99 50 50 0 (c) 10 20 (d) 30 40 50 50 0 10 20 30 40 50 (e) Figure 6.9: Euclidean distance matrix completion: (a) example of nodes observed from central node in circle; (b) recovery probabilities for random and distance-based sampling; and entries observed with (c) distance-based sampling, (d) symmetric random sampling, and (e) uniformly random sampling. preprocessing step, nor did we alter formulation (1.13) to take into account this symmetry. Nevertheless, for random sampling it can be seen that the use of (1.13) does successfully solve the matrix completion problem in many instances. 6.4.2 Polar function The polar or dual of the nuclear norm X ∗ is well-known to be given by the operator norm X 2 , which corresponds to the largest singular value of X. 6.4.3 Projection From existing results it follows that projection onto the nuclear norm ball reduces to computing the singular value decomposition (SVD) followed by projection of the singular values onto the 1 -norm ball, which we discussed in Section 6.1.2: 6.4. Low-rank matrix recovery and completion 100 Theorem 6.7. Let C be an m × n matrix with singular value decomposition C = Uc Dc VcT , and U and V orthonormal and Dc an m × n matrix with diagonal dc . Then the solution X ∗ to the nuclear-norm restricted problem minimize X C −X 2 F subject to X ∗ ≤ τ, (6.17) is given by U Dx V where Dx is an m × n matrix with diagonal entries dx : dx := arg mind dc − d 2 subject to d 1 ≤ τ. (6.18) Proof. This follows directly from [23, Theorem 2.1] or [106, Theorem 3]. 101 Chapter 7 Experiments Having established both the theoretical foundation of the solver framework, and the necessary ingredients for its application to a number of specific formulations, we can now consider its performance. For our empirical experiments we consider three implementations of the framework: the main spectrally projected-gradient implementation spgl1, as well as two implementations with tighter tolerances for each subproblem solve, based on the spectral projected gradient (spg-tight) and projected quasi-Newton (pqn) methods respectively. Each solver accepts functions for the evaluation of κ(x), κ◦ (x) and the projection onto the feasible set, thus simplifying the addition of new formulations that fit the framework. The majority of solvers available for the problems under consideration work with a penalized version of the formulation. To include them in our comparison, we follow the approach suggested by Becker et al. [7]. For brevity we only discuss their approach in the context of basis pursuit denoise. Nevertheless, with only minor changes it applies equally to all other formulations discussed in this chapter. Becker et al. generate equivalent formulation in σ and λ (for (BPσ ) and (QPλ ) respectively) by first solving (BPσ ) approximately, for some σ ¯. Based on this they define λ := ATr ∞ , with misfit r := b − Ax, and use fista [6] to solve (QPλ ) to high accuracy. This gives a corresponding σ := b − Ax∗ 2 , as well as an accurate solution x∗ to both formulations. Because at the time of writing there was no publicly available code for fista, we implemented our own version. This is relatively straightforward and we here discuss only the stopping criterion, which again applies to all other formulations with minor changes. Let r denote the misfit b − Ax for the current iterate x. We define the relative accuracy as µ = max |λ − ATr ∞ | | x , max{1, λ} 1 + (rTr − rTb)/ ATr max{1, x 1 } ∞| . The first term is the relative difference between λ and the estimation ATr ∞ obtained for (BPσ ) with the current misfit σ ˆ := r 2 . The second part is the relative primal-dual gap of the (BPσ ) formulation, again with misfit σ ˆ . This gap can be derived using the techniques described in Section 5.2.1. Throughout this chapter, we pre-solve the test problems to relative accuracy µ ≤ 1e−10, unless otherwise stated. All run times reported in this chapter are as obtained on a single-core 2GHz AMD machine with 2Gb of RAM, running Matlab 7.8. 7.1. Basis pursuit 7.1 102 Basis pursuit In Section 6.1 we derived the polar function for both the weighted and unweighted 1 -norm, and presented an efficient routine for projection onto the feasible set. For our experiments we focus on the unweighted formulations, due to the lack of solvers for the weighted case. We start with the application of the framework to the basis pursuit formulation (σ = 0), which precludes the use of solvers for the regularized version (QPλ ). Solvers that do apply are homotopy, implementation courtesy of Michael Friedlander and Michael Saunders, and the Bregman algorithm [154], as implemented by Wotao Yin. nesta [7] can be used whenever the rows in matrix A are orthonormal. All solvers are called with their default parameters, or, when missing, with the parameters suggested in the accompanying paper. We slightly modified the Bregman algorithm to allow up to five updates of its penalty parameter µ, whenever the subproblem solver failed. For the first set of experiments we apply the different solvers to a selection of sparco test problems (for details, see Table 8.2). The results of the experiments are summarized in Table 7.1. We evaluate the accuracy in terms of the 1 -norm of the solution, the 2 -norm of the residual (which should be zero), and the ∞ -norm of the difference between the solution and the original x0 used in generating the problem. In general, comparing against x0 may be meaningless, as it need not be the solution to the basis pursuit problem. However, based on the 1 -norm of the solution and x0 , we conclude that they do coincide for this set of problems. Starting with the sgnspike problem we see that all solvers successfully recover the original x0 . The fastest solvers are spgl1 and homotopy, closely followed by spg-tight and pqn. The most accurate, but also the slowest is cvx/sdpt3. This is even more pronounced on zsgnspike, the complex version of the problem, where the root-finding approach is substantially faster. The dcthdr test problem seems to be solved inaccurate, but this is mostly due to the fact that we report absolute deviations from x0 , which for dcthdr has an 1 -norm of the order 4.3e+4. nesta has a very small residual but solves the problem somewhat less accurate than the other methods. Homotopy and Bregman are the most accurate here, but at the expense of run time. The rootfinding algorithms do very well in terms of time. Finally, we omit cvx/sdpt3, because it took too long to solve. There is little we can say about the large complex angiogram problem because the root-finding implementations are the only ones that can be practically applied. However, the number of matrix-vector products and run time for spgl1 are quite low, considering the number of variables. spgl1 is also the fastest solver for jitter, but trails in accuracy to cvx/sdpt3 and Bregman. The relative inaccuracy with which spgl1 solves it subproblems takes its toll for the difficult blkheavi problem. All solvers but cvx/sdpt3 have considerable trouble solving this problem, despite the fact that its solution is unique (in fact, the matrix A used in this problem has a simple closed-form inverse). spgl1 is seen to slightly overshoot the target τ ∗ , leading to an inaccurate solution. This blkheavi jitter angiogr. dcthdr zsgnspike sgnspike 7.1. Basis pursuit Solver cvx/sdpt3 spgl1 spg-tight pqn Homotopy Bregman cvx/sdpt3 spgl1 spg-tight pqn nesta spgl1 spg-tight pqn Homotopy Bregman spgl1 spg-tight pqn cvx/sdpt3 nesta spgl1 spg-tight pqn Bregman cvx/sdpt3 spgl1 spg-tight pqn Bregman Time (s) 4.1e+2 5.7e−1 9.6e−1 1.1e+0 4.6e−1 3.9e+0 1.2e+3 1.0e+0 1.4e+0 1.9e+1 9.6e+0 4.1e+0 3.4e+0 6.2e+0 1.9e+1 2.9e+1 1.6e+0 2.4e+0 5.8e+1 1.7e+1 2.5e+0 2.0e−1 3.5e−1 4.5e−1 1.4e+0 4.7e−1 1.3e+1 1.4e+2 1.2e+2 1.6e+1 Aprod n.a. 61 114 82 49 301 n.a. 59 110 324 824 304 276 188 735 1334 101 202 328 n.a. 620 39 74 64 163 n.a. 3789 41037 10161 2054 x 1 2.0000e+1 2.0000e+1 2.0000e+1 2.0000e+1 2.0000e+1 2.0000e+1 2.8284e+1 2.8284e+1 2.8284e+1 2.8284e+1 4.3765e+4 4.3744e+4 4.3749e+4 4.3748e+4 4.3744e+4 4.3744e+4 3.8433e+2 3.8433e+2 3.8437e+2 1.7412e+0 1.7413e+0 1.7409e+0 1.7411e+0 1.7412e+0 1.7412e+0 4.1000e+1 4.2102e+1 4.1000e+1 4.1000e+1 1.7691e+1 103 r 2 7.5e−10 7.3e−5 5.4e−6 1.3e−5 9.4e−6 6.6e−6 1.9e−10 9.5e−5 1.1e−5 4.7e−5 8.5e−13 9.8e−5 7.4e−13 3.3e−2 3.7e−5 2.2e−6 6.1e−5 7.5e−5 3.5e−15 1.6e−12 1.3e−16 8.0e−5 3.7e−5 3.1e−5 2.2e−8 1.6e−9 2.9e−1 3.0e−5 5.1e−5 8.2e+0 x − x0 1.5e−9 7.2e−5 4.4e−6 2.1e−5 5.6e−6 6.9e−6 1.3e−9 1.1e−4 8.4e−6 5.8e−5 2.6e−1 1.0e−4 3.9e−2 7.9e−2 9.2e−6 1.0e−6 2.0e−5 9.9e−6 1.6e−4 2.6e−10 5.0e−6 1.6e−4 6.1e−5 4.7e−5 3.6e−8 1.0e−9 1.5e−1 1.9e−5 3.3e−5 3.7e+0 ∞ Table 7.1: Solver performance on a selection of sparco test problems. 7.2. Basis pursuit denoise 104 holds even more so for Heaviside with normalized columns (not shown). The poor performance on the blkheavi problem led us to consider problems where the columns of A are increasingly similar. This was achieved by taking convex combinations A := (1 − α)G + αE, where G is a fixed 600 × 1800 Gaussian matrix, and E is a matrix of all ones. The 80-sparse coefficient vector x0 was randomly chosen and fixed for all A. The results obtained for different convex combinations of G and E are shown in Table 7.2. Despite the increasing similarity of the columns, basis pursuit can recovery x0 in all cases considered, as indicated by the 1 -norm of the solutions. Among the different methods, homotopy is by far the best, except when A = G, in which case it matches spgl1. Both spgl1 and the Bregman algorithm fail on the problems in which the E component in A dominates; the former by overestimating τ , the latter by underestimating it. The tighter versions spg-tight and pqn do not have this problem, but do require many more matrix-vector products to reach a solution. Even the run time of cvx/sdpt3, commonly less sensitive to A, goes up as the presence of E in A becomes more prominent. To conclude the basis pursuit experiments, we look at the scalability of the different methods, and their sensitivity to the sparsity level of x0 . For this we let A be an implicit representation of a randomly restricted DCT matrix of size 2n−2 × 2n , and choose x0 to be 2n−5 respectively 2n−4 sparse, for different scaling parameters n. We summarize the number of matrix-vector products and run time in Table 7.3. Because cvx does not work with implicit matrices we form an explicit representation prior to calling the solver. In order to save time, we limit the problem size solved with cvx/sdpt3, and likewise for the homotopy method. The remaining methods can be seen to scale very well to large problem sizes. For the sparsity level of 2n−5 , the number of matrix-vector products remains nearly constant throughout for nesta and spgl1. For the Bregman algorithm, this number fluctuates heavily, while it steadily increases for homotopy. The accuracy of the different methods is comparable throughout, while in terms of run time, spgl1 and spg-tight clearly excel. Decreasing the sparsity to 2n−4 affects the performance of the solvers, as shown by the increased number of matrix-vector products required. For nesta the number of matrix-vector products remain nearly the same. Closer inspection of the data shows that the solutions obtained by nesta, while closely fitting Ax = b, typically have an overly large 1 -norm. The Bregman algorithm, by contrast, tends to give solutions with an 1 -norm that is too small, which necessarily leads to larger misfits. Interestingly, the more accurate subproblem solves in spg-tight benefit its overall performance, making it even faster than spgl1. Finally, note that the sparsity level strongly affects the performance of homotopy, which now requires substantially more time to reach the solution. 7.2 Basis pursuit denoise We start the performance evaluation for basis pursuit denoise on four of the most challenging problems in the sparco test set: srcsep, seismic, phantom2, and 80% 60% 40% 20% 0% 7.2. Basis pursuit denoise Solver cvx/sdpt3 spgl1 spg-tight pqn Homotopy Bregman cvx/sdpt3 spgl1 spg-tight pqn Homotopy Bregman cvx/sdpt3 spgl1 spg-tight pqn Homotopy Bregman cvx/sdpt3 spgl1 spg-tight pqn Homotopy Bregman cvx/sdpt3 spgl1 spg-tight pqn Homotopy Bregman Time (s) 3.8e+2 1.6e+0 3.0e+0 4.4e+0 1.5e+0 7.1e+0 3.8e+2 7.0e+0 1.2e+1 2.9e+1 1.3e+0 8.8e+0 4.7e+2 2.9e+1 4.6e+1 1.6e+2 1.5e+0 2.3e+1 4.9e+2 3.0e+1 1.9e+2 4.8e+2 1.5e+0 1.2e+1 5.5e+2 7.1e+0 2.8e+2 1.5e+3 2.0e+0 1.2e+1 Aprod n.a. 307 680 304 224 1267 n.a. 1535 2896 2408 229 1545 n.a. 6385 10529 11224 253 3949 n.a. 6810 44236 20162 269 2052 n.a. 1597 64201 46065 333 2052 x 1 6.417e+1 6.417e+1 6.417e+1 6.417e+1 6.417e+1 6.417e+1 6.417e+1 6.417e+1 6.417e+1 6.417e+1 6.417e+1 6.417e+1 6.417e+1 6.439e+1 6.417e+1 6.417e+1 6.417e+1 6.400e+1 6.417e+1 6.762e+1 6.417e+1 6.417e+1 6.417e+1 5.388e+1 6.417e+1 1.184e+2 6.417e+1 6.417e+1 6.417e+1 1.025e+1 105 r 2 1.5e−5 9.4e−5 2.7e−6 3.6e−6 3.9e−7 2.0e−4 1.8e−5 7.5e−5 3.4e−6 5.7e−6 4.9e−7 3.9e−4 3.0e−6 9.7e−5 4.9e−6 1.8e−5 6.5e−7 2.6e−1 2.4e−6 9.6e−5 1.3e−6 4.1e−5 9.7e−7 1.2e+1 1.4e−6 6.3e−5 1.5e−3 1.5e−3 1.9e−6 3.9e+1 x − x0 3.7e−7 2.4e−6 9.0e−8 6.8e−8 3.6e−9 2.5e−6 2.8e−7 1.6e−6 1.4e−7 1.3e−7 5.5e−9 7.3e−6 5.9e−8 7.8e−3 2.6e−7 8.0e−7 1.0e−8 4.1e−3 4.9e−8 1.1e−1 3.0e−8 2.8e−6 2.3e−8 3.8e−1 8.1e−8 1.2e+0 6.9e−5 7.1e−5 9.1e−8 2.3e+0 ∞ Table 7.2: Convex combination of random 600 × 1800 Gaussian matrix G, and a matrix E = eeT of all ones. Percentage indicates relative weight of E. 7.2. Basis pursuit denoise Solver cvx/sdpt3 nesta spgl1 spg-tight pqn Homotopy Bregman Solver cvx/sdpt3 nesta spgl1 spg-tight pqn Homotopy Bregman n = 10 2.7e+1 n.a. 1.8e+0 777 4.7e−1 923 7.8e−1 1350 1.2e+0 1087 2.5e−1 1425 2.6e+0 529 n = 10 5.5e+1 n.a. 1.6e+0 681 8.2e+0 3469 5.9e+0 6096 6.2e+0 3935 1.7e+0 6471 9.5e+0 1899 106 n = 11 4.0e+2 n.a. 2.2e+0 813 6.8e−1 997 1.1e+0 1515 1.9e+0 1187 6.7e−1 1672 7.1e+0 1287 Sparsity n = 12 — — 3.1e+0 837 7.8e−1 997 1.2e+0 1460 2.7e+0 1179 2.1e+0 1755 5.1e+0 773 = 2n−5 n = 13 — — 4.3e+0 801 1.3e+0 977 1.9e+0 1453 4.7e+0 1155 9.4e+0 2037 7.1e+0 825 n = 16 — — 3.1e+1 825 9.3e+0 994 1.4e+1 1455 4.1e+1 1162 — — 9.3e+1 1844 n = 19 — — 2.7e+2 855 9.6e+1 1069 1.2e+2 1529 3.0e+2 1219 — — 2.7e+2 702 n = 11 4.7e+2 n.a. 2.1e+0 753 8.2e+0 3096 6.0e+0 5425 8.3e+0 3556 6.0e+0 6188 4.9e+1 8979 Sparsity n = 12 — — 3.0e+0 825 9.0e+0 2729 8.3e+0 5110 1.2e+1 3149 3.6e+1 6721 2.6e+1 3253 = 2n−4 n = 13 — — 4.0e+0 741 1.8e+1 3248 1.1e+1 5460 2.4e+1 3658 2.4e+2 8461 3.6e+1 3523 n = 16 — — 3.1e+1 801 1.1e+2 2961 9.6e+1 5295 2.0e+2 3299 — — 1.9e+2 2750 n = 19 — — 2.6e+2 843 1.5e+3 4092 7.7e+2 6251 1.5e+3 4374 — — 1.7e+3 9212 Table 7.3: Basis pursuit solver performance on randomly restricted 2n−2 × 2n DCT matrices. The two rows for each solver respectively denote run time in seconds, and number of matrix-vector products. 7.2. Basis pursuit denoise Problem srcsep1 seismic phantom2 finger σ = 0.1 b 2 1e−10 (8.3347e+2) 1e−10 (1.5789e+4) 1e−10 (7.0455e+3) 1e−10 (4.3956e+3) µ ( x∗ 1 ) σ = 0.01 b 2 1e−10 (1.0535e+3) 1e−10 (2.0164e+4) 1e−10 (8.0027e+3) 1e−10 (5.3703e+3) 107 σ = 0.001 b 2 1e−10 (1.0873e+3) 4.8e−4 (2.0771e+4) 1e−10 (8.1185e+3) 1e−10 (5.4726e+3) Table 7.4: Relative accuracy of fista solutions for a subset of the sparco test problems. The number in parentheses gives the 1 -norm of the solution. finger. These problems are generated based on signals, rather than on sparse coefficients, and no strictly sparse representation may exist. We therefore allow a solution to satisfy Ax − b ≤ σ, with a misfit σ proportional to a fixed fraction of the norm of b. Table 7.4 summarizes the pre-solve accuracy µ, along with the 1 -norm of the solution for different values of σ. For the seismic problem we could not obtain a very accurate solution within a reasonable amount of time, and had to settle for µ = 4.8e−4. We compare our implementation of the root-finding algorithm against nesta and homotopy. For the penalized formulation we use gpsr [78], fpc [90], and our implementation of fista. All solvers were called with default parameters, except for fista, which was called with µ = 1e−6. The large size of the problems rendered the use of cvx/sdpt3 impracticable. The first thing to note about the test results with σ = 0.1 b 2 , given in Table 7.5, is that all solvers had considerable difficulty in reaching even modestly accurate solutions. fista did very well, obtaining higher accuracy with competitive run time. Recall, however, that fista solves the penalized version of basis pursuit denoise, and therefore required the Lagrange multiplier λ to be specified. spgl1 is the fastest solver in all cases, but tends to over shoot the target τ value on these problems (compare with Table 7.4). This affects the accuracy of the solution, which, nevertheless, is comparable to the solution obtained by fpc. The quality of the solutions improves with increasing accuracy of the subproblem solves, as evident from the results obtained using spg-tight, and pqn. The latter two solvers do require substantially more time than spgl1, but still compare favorably against gpsr. Decreasing the value of σ generally increases the amount of time spent on solving the problem. This is not so much due to the number of root finding iterations, but more as a consequence of the subproblems getting increasingly hard to solve accurately; the same holds true for the penalized formulation with decreasing λ. The results obtained for σ = 0.01 b 2 give a similar picture as those for σ = 0.1 b 2 , except that the run time is quite a bit higher. As evident from Table 7.4, the seismic problem could not be solved to the desired level of µ = 1e−6, even after running fista for over a day. Therefore, instead of reporting exceedingly large run times, with otherwise similar solver behavior, we decided to limit the run time of each solver to half an hour, and see how much could finger phantom2 seismic srcsep1 7.2. Basis pursuit denoise Solver spgl1 spg-tight pqn gpsr fpc fista spgl1 spg-tight pqn gpsr fpc fista spgl1 spg-tight pqn fista spgl1 spg-tight pqn gpsr fpc fista Time (s) 5.2e+1 1.2e+3 7.4e+2 1.3e+3 1.2e+2 3.7e+3 1.8e+2 8.4e+3 1.8e+4 5.5e+3 4.1e+2 9.3e+3 5.3e+0 2.4e+2 5.5e+3 4.4e+2 2.6e+1 2.1e+3 4.4e+3 2.5e+3 8.5e+1 2.2e+3 Aprod 152 5341 1158 9705 10154 20916 325 23381 6401 34307 35216 55230 87233 94345 90241 102834 191 26794 9375 46810 47823 67837 x 1 8.3415e+2 8.3346e+2 8.3346e+2 8.3347e+2 8.3335e+2 8.3347e+2 1.5831e+4 1.5789e+4 1.5789e+4 1.5789e+4 1.5783e+4 1.5789e+4 7.0482e+3 7.0455e+3 7.0456e+3 7.0455e+3 4.4114e+3 4.3956e+3 4.3956e+3 4.3957e+3 4.3953e+3 4.3956e+3 108 x − x∗ ∞ 4.8e−1 1.9e−3 5.1e−3 1.2e−2 3.4e−1 9.7e−6 3.0e+1 2.1e+1 2.2e+1 1.7e+1 1.9e+1 4.7e−2 2.4e−1 3.8e−3 7.8e−2 8.2e−5 4.3e+0 3.7e−1 6.3e−1 1.9e+0 4.1e+0 4.2e−3 Table 7.5: Solver performance for the basis pursuit formulation on a selection of sparco test problems, with σ set to 0.1 b 2 . finger phantom2 seismic srcsep1 7.2. Basis pursuit denoise Solver spgl1 spg-tight pqn gpsr fpc fista spgl1 spg-tight pqn gpsr fpc fista spgl1 spg-tight pqn fista spgl1 spg-tight pqn gpsr fpc fista Time (s) 4.8e+2 1.8e+3 1.8e+3 1.8e+3 2.7e+2 1.8e+3 5.6e+2 1.8e+3 1.8e+3 5.6e+2 2.7e+2 1.8e+3 2.1e+1 8.4e+2 1.8e+3 9.9e+2 1.0e+2 1.8e+3 1.8e+3 1.3e+2 4.1e+1 1.8e+3 Aprod 1501 9660 3369 15838 16777 22087 1016 5171 1540 6341 6942 10808 59683 76207 60823 96222 793 16955 1997 18047 18528 34836 x 1 1.0891e+3 1.0560e+3 1.0838e+3 1.0895e+3 1.0899e+3 1.0873e+3 2.0877e+4 1.3890e+4 5.2461e+3 2.0912e+4 2.0970e+4 2.0799e+4 8.1232e+3 8.1186e+3 7.9624e+3 8.1185e+3 5.4877e+3 5.0560e+3 2.7825e+3 5.5010e+3 5.5212e+3 5.4726e+3 109 x − x∗ ∞ 5.7e−1 2.0e−1 1.4e−1 5.5e−1 5.7e−1 6.7e−2 2.8e+1 3.0e+1 3.8e+1 2.8e+1 2.8e+1 2.8e+1 2.3e−1 7.4e−2 2.3e−1 2.9e−3 3.8e+0 1.9e+0 3.3e+0 3.9e+0 3.9e+0 2.1e−1 Table 7.6: Solver performance for basis pursuit denoise on a selection of sparco test problems, with σ set to 0.001 b 2 , and run time limited to 30 minutes. be done within this time. The results for this experiment, with the smallest of the three misfits (σ = 0.001 b 2 ), are shown in Table 7.6. Here, the tighter tolerance on subproblem solves of spg-tight and pqn indirectly limits the number of root finding steps that can be taken. This is apparent, for example, in srcsep1 and seismic, where the target 1 -norm is not quite reached. The less stringent stopping criteria for spgl1, on the other hand, still cause the target to be exceeded. It is clear that for limited-time runs some tuning is needed to balance between these two options. As a final set of test problems, we apply basis pursuit denoise to the compressed sensing scenario where a 1400 × 4096 partial DCT matrix is used to sample a compressible signal. We created four different problems, all using the same A, but with different compressibility rates and misfit values. For compress1a we set x0 to a random permutation of a vector with entries decaying exponentially from 10 to 10e−3, and chose σ = 0.01 x 1 . For the remaining three problems compress1b–compress1d we fixed x0 with entries ranging from 10 to 10e−4, and chose σ as 0.01 x0 1 , 0.001 x0 1 , and 0.0001 x0 1 respectively. In this setting, in addition to the other solvers, both homotopy and nesta can 7.3. Joint sparsity 110 be applied, giving the results are shown in Table 7.7. The first thing to notice for this problem is the rapid deterioration of the performance of homotopy for smaller values of σ. Unlike in previous experiments with exact sparse solutions, we here encounter a problem that requires a large number of homotopy steps. In all fairness, however, it should be noted that homotopy does give extremely accurate solutions. The others solvers behave mostly as before with spgl1 giving a fast approximation, and spg-tight giving accurate solutions in very competitive time. 7.3 Joint sparsity As mentioned in the introduction, we use fista to obtain for each test problem an accurate pair of (σ, λ) values, as well as an accurate solution for both formulations. In order for fista to work, we need to provide it with the proximal function for g(x) := i xσi 2 : proxγ (g)(x) := argmin g(u) − u 1 u−x 2γ 2 2 (7.1) We can conveniently express this in terms of a general signum function on vectors, defined as gsgn(x) := x/ x {z | z ≤ 1} x >0 otherwise. With this definition it can be shown that the proxy of g(x), with xσi ∈ C|σi | , is given by [proxγ (x)]σi = gsgn(xσi ) · max{ xσi 2 − γ, 0}. When the cardinality of each group is one, i.e., |σi | = 1, this expression reduces to the proxy function of g(x) = x 1 , given by the standard soft-thresholding operator. The pre-solve procedure was applied to the five test problems given in Table 7.8. For the first four problems we drew A from the Gaussian ensemble, and created an X0 with random support containing randomly normally distributed entries. The observed matrix B := AX0 +R was formed by adding a normalized noise term R with R F = ν AX0 F , to the ideal observation AX0 . Because the norm of the noise term is typically not known exactly, we used an over, respectively under estimation for two of the problems. The fifth problem is the first iteration of the radio telescope example shown in Figure 6.5(a,e). We compare our solver against fista, cvx/sdpt3, as well as sparsa, developed by Wright et al. [153]. The performance of spgl1, as seen from the results in Table 7.9, is excellent on all mmv1 test problems. In particular, even with the more relaxed subproblem stopping criteria, it is able to pin-point the right value of τ , and obtain an accurate final solution. It does have some trouble with the telescope problem, however, which is likely due to the stronger coherence compress1d compress1c compress1b compress1a 7.3. Joint sparsity Solver nesta spgl1 spg-tight pqn Homotopy gpsr fpc fista nesta spgl1 spg-tight pqn Homotopy gpsr fpc fista nesta spgl1 spg-tight pqn Homotopy gpsr fpc fista nesta spgl1 spg-tight pqn Homotopy gpsr fpc fista Time (s) 2.5e+0 1.1e−1 7.4e−1 2.1e+0 1.7e+1 3.0e−1 2.0e−1 1.4e+0 2.5e+0 1.1e−1 9.4e−1 2.4e+0 2.2e+1 3.8e−1 3.2e−1 1.7e+0 2.7e+0 6.4e−1 1.0e+1 2.4e+1 1.8e+2 7.5e+0 1.2e+0 3.7e+1 2.7e+0 2.5e+0 4.1e+1 6.0e+1 2.3e+2 2.2e+1 1.4e+0 8.5e+1 Aprod 681 20 341 166 1380 1450 1539 1865 657 20 387 176 1544 1632 1749 2153 753 879 3779 1463 6781 8448 8765 17459 729 1229 11946 2293 15337 20196 20545 40559 x 1 9.6343e+2 9.5683e+2 9.5652e+2 9.5652e+2 9.5652e+2 9.5653e+2 9.5652e+2 9.5652e+2 1.0334e+3 1.0271e+3 1.0265e+3 1.0265e+3 1.0265e+3 1.0265e+3 1.0265e+3 1.0265e+3 2.3953e+3 2.3674e+3 2.3662e+3 2.3662e+3 2.3662e+3 2.3661e+3 2.3660e+3 2.3662e+3 2.5495e+3 2.5161e+3 2.5147e+3 2.5148e+3 2.5147e+3 2.5150e+3 2.5164e+3 2.5147e+3 111 | r 2 −σ| σ 3.3e−13 9.2e−6 1.5e−9 4.2e−9 3.6e−13 1.5e−6 5.6e−7 1.8e−8 4.8e−13 8.1e−5 4.3e−9 1.2e−8 5.9e−13 6.2e−6 6.4e−6 1.4e−8 3.2e−11 6.6e−5 8.1e−7 1.1e−5 5.1e−11 2.2e−4 1.5e−3 3.4e−7 2.3e−9 1.8e−4 7.4e−5 1.1e−3 1.4e−10 3.0e−3 2.7e−2 3.1e−7 x − x0 1.3e+0 4.5e−1 1.0e−4 4.5e−4 1.2e−9 8.4e−4 7.3e−4 1.5e−5 1.0e+0 4.5e−1 5.0e−5 4.4e−4 1.3e−9 7.8e−4 7.1e−4 1.4e−5 2.4e+0 7.6e−1 1.6e−2 4.6e−2 2.0e−8 1.2e−1 3.4e−1 1.2e−4 2.9e+0 1.1e+0 1.9e−1 3.7e−1 4.5e−8 8.3e−1 1.3e+0 1.1e−3 ∞ Table 7.7: Solver performance for basis pursuit denoise on a set of problems with compressible x0 and different levels of misfit. Problem mmv1a mmv1d mmv1e mmv1f telescope A 60 × 100 200 × 400 200 × 400 300 × 800 80 × 109 Observations 5 5 10 3 2 Sparsity 12 20 20 50 8 ν 0.01 0.01 0.02 0.01 — 1.0 1.2 1.0 0.9 σ R R R R 3 Table 7.8: Test problem settings for mmv experiments. F F F F 7.4. Nonnegative basis pursuit 112 between the columns in A. cvx does very well on this problem, but does not overall scale very well with the number of measurement vectors (see for example mmv1f). sparsa has a run time similar to spg-tight, but yields solutions that are not nearly as accurate. The most accurate results for these tests are obtained using fista. In terms of run time, fista is about three to four times slower than spg-tight, except on telescope, where the two are similar both in run time and accuracy. 7.4 Nonnegative basis pursuit We apply the nonnegative basis pursuit denoise framework on a set of randomly restricted DCT operators with noise scaled to ν Ax0 2 , and σ set close to r 2 . In most cases we underestimate σ, which makes the problem harder to solve for spgl1 and variants. The parameters for the different problems in the test set, including the noisy mass spectrometry setting described in Section 6.3.1, are given in Table 7.10. For the experiments we again follow the pre-solve procedure explained in the introduction. The proximal function (7.1) required by fista, corresponding to nonnegative bpdn is a one-sided soft-thresholding operator, given by proxγ (x) = [x − γ]+ , where [v]+ := max{v, 0} denotes projection onto the nonnegative orthant. Unfortunately, there are not many solvers that are specific for the nonnegative basis pursuit problem. Although it would have been possible to modify gpsr to forego variable splitting and work directly with the nonnegative part only, no such attempt was made. As a result, this leaves us with cvx for smaller problems, and fista for the penalized formulation. From the run time and number of matrix-vector products reported in Table 7.11, it is apparent that both spgl1 and fista require more effort to solve problems where the number of nonzero entries in x0 is large compared to the length of b. The hardest amongst these problems is nonnegn02, which has very few measurements per nonzero entry in x0 . Comparing nonnegn04 to nonnegn06 shows, as expected, that an increased σ makes the problem easier to solve. Likewise, nonnegn03 is solved much faster and more accurate than nonnegn02 due to the larger number of measurements. The behavior of the different solvers is similar to earlier experiments, so we will not discuss them further. Finally, we note that cvx/sdpt3 did very well on the mass spectrometry problem. 7.5 Nuclear-norm minimization A number of solvers for matrix completion based on nuclear norm minimization have recently been introduced: Ma et al. [106] propose fpca, which combines approximate SVD with fixed-point iterations to solve the penalized formulation of (6.15), while Cai et al. [23] derive a singular value soft-thresholding algorithm, telescope mmv1f mmv1e mmv1d mmv1a 7.5. Nuclear-norm minimization Solver cvx/sdpt3 spgl1 spg-tight pqn sparsa fista cvx/sdpt3 spgl1 spg-tight pqn sparsa fista cvx/sdpt3 spgl1 spg-tight pqn sparsa fista cvx/sdpt3 spgl1 spg-tight pqn sparsa fista cvx/sdpt3 spgl1 spg-tight pqn sparsa fista Time (s) 2.9e+0 2.4e−1 6.1e−1 1.3e+0 5.5e−1 1.8e+0 8.8e+1 3.1e−1 1.0e+0 2.5e+0 3.6e+0 3.2e+0 3.5e+2 4.3e−1 1.8e+0 4.6e+0 5.1e+0 6.0e+0 1.7e+2 6.0e−1 2.4e+0 6.4e+0 1.2e+1 8.9e+0 4.1e+0 7.1e−1 1.6e+1 2.4e+2 8.7e−1 1.7e+1 Aprod n.a. 345 1055 900 580 3221 n.a. 305 1185 995 950 3891 n.a. 530 2670 2270 1580 8372 n.a. 312 1392 1002 990 5515 n.a. 604 16418 106842 340 19740 X 1,2 1.4653e+1 1.4653e+1 1.4653e+1 1.4653e+1 1.4693e+1 1.4653e+1 4.8226e+1 4.8226e+1 4.8226e+1 4.8226e+1 4.8241e+1 4.8226e+1 5.8488e+1 5.8488e+1 5.8488e+1 5.8488e+1 5.8480e+1 5.8488e+1 8.1691e+1 8.1691e+1 8.1690e+1 8.1690e+1 8.1690e+1 8.1691e+1 1.6909e+1 1.6917e+1 1.6909e+1 1.6909e+1 1.7146e+1 1.6909e+1 113 | r 2 −σ| σ 3.9e−9 3.3e−6 1.5e−5 1.5e−5 4.6e−2 6.9e−8 1.6e−9 1.5e−6 2.3e−6 2.3e−6 9.8e−3 3.7e−8 8.6e−10 2.4e−5 6.7e−5 6.7e−5 3.8e−3 6.8e−9 8.6e−10 7.6e−5 7.2e−5 7.3e−5 7.2e−5 4.9e−8 3.9e−10 5.8e−5 3.5e−6 3.5e−6 1.7e−1 6.9e−7 x − x0 1.4e−6 5.9e−5 5.2e−7 4.8e−7 1.2e−2 4.0e−8 8.9e−8 7.6e−5 1.7e−7 1.8e−7 3.9e−3 7.1e−8 9.2e−6 4.7e−4 3.5e−6 3.5e−6 4.1e−3 4.7e−8 1.6e−6 7.9e−4 1.3e−6 1.3e−6 2.7e−4 5.0e−8 2.8e−5 2.4e−1 2.4e−6 8.1e−6 6.8e−1 2.3e−5 ∞ Table 7.9: Solver performance on a set of noisy or approximately sparse mmv problems. 7.5. Nuclear-norm minimization Problem nonnegn01 nonnegn02 nonnegn03 nonnegn04 nonnegn05 nonnegn06 massspec A 100 × 256 500 × 8192 1500 × 8192 500 × 8192 500 × 8192 500 × 8192 82 × 438 Sparsity 10 200 200 100 10 100 12 ν 0.01 0.001 0.001 0.001 0.001 0.05 0.012 114 σ/ r 2 1.02 0.9 0.9 0.9 0.9 0.9 1.0 x 1 6.1484e+0 1.5039e+2 1.7499e+2 8.6661e+1 8.6205e+0 8.2346e+1 9.9505e−1 massspec nonnegn06 nonnegn05 nonnegn04 nonnegn03 nonnegn02 nonnegn01 Table 7.10: Test problem settings for nonnegative basis pursuit denoise experiments. Solver cvx/sdpt3 spgl1 spg-tight pqn fista spgl1 spg-tight pqn fista spgl1 spg-tight pqn fista spgl1 spg-tight pqn fista spgl1 spg-tight pqn fista spgl1 spg-tight pqn fista cvx/sdpt3 spgl1 spg-tight pqn fista Time (s) 1.3e+0 3.4e−1 6.4e−1 7.8e−1 1.4e+0 8.9e+0 3.1e+1 1.1e+2 1.5e+2 1.7e+0 1.9e+0 5.5e+0 1.7e+1 1.1e+1 1.7e+1 7.0e+1 1.5e+2 4.6e−1 8.0e−1 6.7e+0 1.3e+1 2.1e+0 4.7e+0 2.9e+1 8.3e+1 1.1e+0 3.2e+0 3.8e+1 2.3e+2 2.5e+1 Aprod n.a. 71 153 120 335 1052 4158 1350 20011 190 246 190 2283 1273 2287 900 20011 51 101 470 1789 241 624 676 10963 n.a. 2172 30070 30975 20023 x 1 6.1484e+0 6.1476e+0 6.1483e+0 6.1484e+0 6.1484e+0 1.5044e+2 1.5041e+2 1.5044e+2 1.5039e+2 1.7499e+2 1.7499e+2 1.7499e+2 1.7499e+2 8.6662e+1 8.6665e+1 8.6674e+1 8.6661e+1 8.6202e+0 8.6163e+0 8.6205e+0 8.6205e+0 8.2343e+1 8.2346e+1 8.2345e+1 8.2346e+1 9.9505e−1 9.7853e−1 9.9504e−1 9.9581e−1 9.9506e−1 | r 2 −σ| σ 2.1e−10 5.6e−3 4.2e−4 1.3e−4 9.1e−9 2.9e−2 9.9e−3 7.8e−4 9.0e−5 1.0e−2 1.2e−3 1.1e−2 2.9e−9 4.1e−2 1.7e−3 5.2e−3 5.2e−5 1.7e−2 2.2e−1 1.3e−3 1.0e−7 7.9e−4 7.0e−5 1.9e−4 1.8e−7 2.2e−9 4.4e+0 2.4e−3 2.4e−4 2.9e−3 x − x0 5.6e−8 1.2e−4 7.1e−6 4.5e−6 3.8e−9 1.3e−1 1.0e−1 1.1e−1 1.4e−3 2.6e−5 2.1e−5 2.9e−4 3.4e−9 9.7e−3 7.9e−3 1.2e−2 1.2e−5 1.2e−4 5.1e−4 4.0e−5 1.3e−9 8.4e−3 6.0e−4 2.5e−3 1.8e−6 1.1e−8 3.8e−2 1.2e−5 2.4e−3 4.2e−5 ∞ Table 7.11: Solver performance on the set nonnegative basis pursuit denoise test problems. 7.5. Nuclear-norm minimization Problem nucnorm(n)01 nucnorm(n)02 nucnorm(n)03 nucnorm(n)04 nucnorm(n)05 nucnorm(n)06 M 10 × 10 50 × 50 100 × 100 100 × 100 200 × 200 200 × 200 Rank 2 4 7 7 12 2 Observed entries 80% 60% 50% 30% 20% 20% 115 (γ) 0.05 0.01 0.02 0.03 0.01 0.01 Table 7.12: Test problem settings for matrix completion experiments. called svt, for solving a slight relaxation to the exact matrix completion problem (1.13). Two more solvers were proposed by Toh and Yun [143], and Liu et al. [103], but their implementation is not yet publicly available. We start our experiments for the exact matrix completion problem (1.13). Both fpca and svt approximate the solution to this problem by choosing a sufficiently small penalty parameter in their respective formulations. The same, in a way, holds true for spgl1, which finds a τ that is sufficiently close to the 1 norm of the true solution. From the results in Table 7.13 it can be seen that svt reaches its default maximum number of 500 iterations on problems nucnorm02, nucnorm04, and nucnorm05. Despite the fact that svt takes fewer singular value decompositions, it is still slower and less accurate than fpca. The reason for this difference seems to lie predominantly in the way the SVDs are computed. We should note here that svt is designed for large-scale problems, and that the results reported here may not show its true potential. fpca does really well on the first two problems, but less so on nucnorm05. This is most likely due to the fixed value of the penalty parameter, which also leads to a higher misfit and objective. cvx/sdpt3 does not scale very well with the problem size and, as noted in [103], can hardly be used for matrices with more than 100 rows and columns. Having said this, we acknowledge that, as a consequence of the way projection is currently implemented, spg-tight and pqn do not scale very well either. This can be improved by replacing full singular value decomposition by a partial one, preferably of minimal size. Interestingly, spgtight is more accurate, and nearly twice as fast as spgl1 on nucnorm02 and nucnorm05. For the noisy problem instances, spg-tight takes about twice as long as spgl1, but does obtain substantially more accurate solutions. fista also has a good performance, but does not yet scale well because of the way we implemented the singular value thresholding proxy function. Finally, pqn is not very well suited to the nuclear norm minimization problem, as a consequence of its strong dependence on projection, which is exactly the bottleneck. nucnormn06 nucnormn04 nucnormn02 nucnorm05 nucnorm04 nucnorm02 7.5. Nuclear-norm minimization Solver cvx/sdpt3 spgl1 spg-tight pqn fpca svt cvx/sdpt3 spgl1 spg-tight pqn fpca svt cvx/sdpt3 spgl1 spg-tight pqn fpca svt cvx/sdpt3 spgl1 spg-tight pqn fpca fista cvx/sdpt3 spgl1 spg-tight pqn fpca fista cvx/sdpt3 spgl1 spg-tight pqn fpca fista Time (s) 1.0e+1 9.1e+0 4.4e+0 1.5e+1 5.9e+0 3.2e+0 6.6e+1 7.8e+1 9.7e+1 4.8e+2 1.1e+1 1.6e+2 1.0e+3 5.2e+2 3.9e+2 1.8e+3 3.3e+1 6.4e+2 1.3e+1 2.1e+0 4.6e+0 1.2e+1 3.4e+0 6.3e+0 8.9e+1 1.2e+1 2.6e+1 1.5e+2 5.2e+0 7.4e+1 9.8e+2 2.7e+1 3.0e+1 1.4e+2 1.2e+1 7.6e+1 #SVD n.a. 540 524 2554 4517 500 n.a. 1074 2778 20033 4546 500 n.a. 1259 2106 13132 4666 500 n.a. 110 519 2058 2500 538 n.a. 138 679 5490 2000 1319 n.a. 51 134 833 2000 235 X ∗ 2.0179e+2 2.0180e+2 2.0179e+2 2.0180e+2 2.0179e+2 2.0184e+2 7.1411e+2 7.1470e+2 7.1415e+2 7.1435e+2 7.1433e+2 8.7067e+2 2.2578e+3 2.2583e+3 2.2579e+3 2.2590e+3 2.3208e+3 3.6454e+3 2.0061e+2 2.0063e+2 2.0061e+2 2.0061e+2 2.0045e+2 2.0061e+2 6.9528e+2 6.9531e+2 6.9528e+2 6.9528e+2 6.9383e+2 6.9528e+2 3.9998e+2 3.9999e+2 3.9998e+2 3.9998e+2 3.9974e+2 3.9998e+2 116 | r 2 −σ| max(1,σ) 0.0e+0 9.7e−5 5.9e−4 9.5e−4 5.8e−5 4.6e−2 0.0e+0 8.3e−5 8.4e−4 2.5e−13 2.4e−4 4.4e+1 0.0e+0 1.9e−5 4.1e−3 8.0e−3 7.8e−4 2.1e+2 3.1e−9 3.8e−5 5.2e−7 1.3e−6 7.7e−2 3.7e−1 2.1e−11 9.2e−5 3.7e−5 6.5e−5 3.2e−1 8.5e−1 1.9e−8 6.9e−5 3.1e−7 4.1e−7 5.0e−2 7.3e−1 |X − X ∗ | — 3.5e−1 3.2e−1 3.3e−1 2.5e−4 9.4e−1 — 1.8e+1 2.8e+0 7.9e+0 8.6e+0 1.1e+2 — 1.9e+1 5.5e+0 8.8e+0 2.8e+2 5.0e+2 4.3e−5 3.0e−1 5.5e−3 1.1e−2 5.0e−1 8.9e−6 4.2e−4 3.8e+0 6.1e−2 3.2e−1 1.7e+1 1.8e−4 1.3e−4 1.3e+0 4.7e−3 6.1e−2 2.9e+0 1.9e−3 Table 7.13: Solver performance on a set of matrix completion problems. 117 Chapter 8 Sparco and Spot In this chapter we describe two Matlab software packages that were originally intended to support the development of spgl1 but have evolved to become significant codes in their own right. The first of the two packages, called sparco, provides a suite of problems for testing and benchmarking algorithms for sparse recovery. The second package, called spot, is a Matlab toolbox for the construction, application, and manipulation of linear operators [11]. 8.1 Sparco Collections of test problem sets, such as those provided by Matrix Market [18], COPS [56], and CUTEr [81], are routinely used in the numerical linear algebra and optimization communities as a means of testing and benchmarking algorithms. These collections are invaluable because they provide a common and easily identifiable reference point, which is especially important when objectively comparing the performance of different algorithms. When we developed spgl1 [10], there did not exist any coherent set of test problems for sparse recovery. At that time, all published codes were evaluated on either random or paper-specific test problems. Moreover, with the noticeable exceptions of SparseLab [64], 1 -magic [31], and gpsr [78], no code was generally available for reproducing the given problems. This situation motivated us to compile a standard set of test problems and make them available through a consistent interface, which was realized with sparco [12]. We next discuss the design criteria of sparco, and then move on to discuss the particulars of the test problems. 8.1.1 Design objectives The design of sparco was guided by a number of criteria that help promote the wide use of the ensuing test suite. We designed sparco to be Open, freely available, and extensible. A test suite should be considered the collective property of the research community as a whole, rather than of a small group of individuals. Making the software open, freely available, and extensible creates a situation in which research groups can share or contribute to a common set of test problems. This is conducive to a collaborative and transparent research process, as it establishes a widely accepted benchmark free of potential bias. 8.1. Sparco 118 General and coherent. When providing a set of test problems it is important that the problems are sufficiently general; the choice of problem formulation should not severely restrict the class of applicable algorithms. On the other hand, the test set should provide a coherent class of problems; it would not make sense to mix MMV problems with matrix completion problems. Comprehensive. The variety of problems in a test suite is extremely important for a number of reasons. First, given an established set of test problems, it is natural for researchers to try and make their code perform as good as possible on these problems. This may inadvertently lead to codes that are highly fine tuned for these problems, but show a moderate or even poor performance on other important problems. Second, the inclusion of problems with different characteristic properties may expose situations not anticipated by the solvers, thus helping improve robustness of codes. Finally, partially related to the first point, it is important to provide problems with different gradations of size and difficulty. In particular, the problem set should include challenging test problems that motivate the development of more advanced approaches. 8.1.2 Test problems The current version of sparco focuses on sparse recovery and compressed sensing problems that can be expressed in the form b = Ax0 + r, where A := M B is the combination of a measurement matrix M and a sparsity basis B, and r is an additive noise vector. For sparse recovery problems we typically have no additive noise (r = 0), and use the identity matrix for the sparsity basis (B = I). For compressed sensing, as well as inpainting, b is typically obtained by directly applying the measurement matrix to the original signal s, giving b = M s + r. In these applications we typically do not know the value of x0 and have to make the assumption that we can write s as Bx0 with a sparse or compressible x0 . Although this may appear to be a disadvantage, it does reflect practical problems. Moreover, unlike with sparse recovery, these two applications are not so much concerned with finding a sparse representation x0 , but rather with recovering the original signal s from partial or compressed data. This brings us to the aspect of qualitative versus quantitative test problems. In qualitative experiments we are especially interested in algorithms that solve the intended problem, i.e., recover x0 or reconstruct s. The focus here lies on the quality of the solution. In quantitative tests we are merely interested in solving a problem (formulation) as fast and accurate as possible, and are not so much interested in the solution itself. As an example, consider sparse recovery from b = Ax0 . In qualitative experiments the main focus would be on the recovery rate of x0 with different algorithms. For quantitative experiments we fix the formulation, such as basis pursuit, and compare algorithms that solve 8.1. Sparco Seismic imaging Blind source separation MRI imaging Basis pursuit Compressed sensing Image processing 119 Hennenfent and Herrmann [94, 93] Wang et al. [151] Lustig et al. [104], Cand`es et al. [32] Chen et al. [39], Donoho and Johnstone [63] Cand`es and Romberg [31], Takhar et al. [137] Takhar et al. [137], Figueiredo et al. [78] Table 8.1: List of sources for sparco test problems. Problem angiogram cosspike dcthdr finger gcosspike jitter p3poly phantom1 phantom2 seismic sgnspike soccer1 spiketrn srcsep1 srcsep2 yinyang zsgnspike ID 502 3 12 703 5 902 6 501 503 901 7 601 903 401 402 603 8 m 3321 1024 2000 11013 300 200 600 629 21679 41472 600 3200 1024 29166 29166 1024 600 n 10000 2048 8192 125385 2048 1000 2048 4096 65536 480617 2560 4096 1024 57344 86016 4096 2560 b 2 1.6e+1 1.0e+2 2.3e+3 5.5e+1 8.1e+1 4.7e−1 5.4e+3 5.1e+0 6.0e+1 1.1e+2 2.2e+0 5.5e+4 5.7e+1 2.2e+1 2.3e+1 2.5e+1 3.1e+0 operator 2D FFT DCT restricted DCT 2D curvelet Gaussian ensemble, DCT DCT Gaussian ensemble, wavelet 2D FFT, wavelet 2D FFT 2D curvelet Gaussian ensemble binary ensemble, wavelet 1D convolution windowed DCT windowed DCT wavelet Gaussian ensemble Table 8.2: Selection of sparco test problems. that problem; in this case we only criterion is that x∗ solves (BP), regardless of whether it coincides with x0 or not. The limited availability of well-established test problems meant that we had to construct new problems. In order to do so, we performed a comprehensive literature survey and implemented some 25 problems arising in a variety of contexts (see Table 8.1). While the original intention behind sparco was to provide test problems for quantitative benchmarking, most of the test scenarios found in the literature were motivated by practical applications, and are therefore mostly qualitative in nature. Nevertheless, they are still suitable for quantitative testing, and in fact provide some of the more challenging problems when compared, for example, to randomly generated test problems. A selection of the problems, along with their name, identifier, and other important properties is listed in Table 8.2. A complete list of sparco problems and their implementation is available online at http://www.cs.ubc.ca/labs/scl/sparco/index.php. 8.1. Sparco Field M B A b r x0 signal reconstruct info 120 Description Measurement matrix Sparsity basis Operator (A := M B) Observed vector (b := M s, or b := Ax0 ) Additive noise vector Sparse coefficient vector (optional) Measured signal s (s := Bx0 , when x0 is available) function handle to reconstruct signal s from coefficients x. problem information structure (name, title, sparcoID, etc.) Table 8.3: Fields in the data structure representing sparco problems. 8.1.3 Implementation sparco is implemented in Matlab and is set up in such a way that it facilitates the addition of test problems contributed by the community. This is achieved by implementing each test problem as a stand-alone Matlab script and storing them in a dedicated directory where sparco can automatically find them. A new problem is thus added to the suite simply by copying the script. Problems can be instantiated through the user interface. This interface provides consistent access to the problems by means of their name or numerical identifier. As an example, consider the seismic interpolation problem discussed in Section 1.4. This problem is included in sparco as problem 901 (seismic). To instantiate it, a call is made to the generateProblem function with the desired problem identifier: P = generateProblem(901); % or P = generateProblem('seismic'); The result of this call is a structure that contains relevant problem information, which includes the linear operators M , B, and B, as well as vectors b, r, s, and, when available, x0 . A complete list of fields is given in Table 8.3 For other details concerning the implementation of sparco please see [12]. Problem scripts. The information for each test problem is generated by a corresponding script located in the problem directory. Each script implements a predefined interface to enable incorporation into the sparco framework. Most of the code in these problem scripts is concerned with setting up the problem structure. This involves loading or generating the sparse coefficient vector or signal data, as well as the construction measurement matrix M , and sparsity basis B. Both M and B are represented implicitly using the linear operators provided by spot. We will discuss this software package in Section 8.2. For self-documentation, each problem script can include code to generate figures illustrating the problem, as well as a problem description and citations regarding its origin. 8.2. Spot 8.2 121 Spot Spot [11], as mentioned in the introduction, is a Matlab toolbox for the construction, application, and manipulation of linear operators. One of the main goals of the package is to present operators in such a way that they are as easy to work with as explicit matrices, with the difference that they are represented implicitly. The most elementary operation involving operators is the equivalent of matrix-vector multiplication. As an example, consider the application of a onedimensional discrete cosine transform on a vector x of length 16, using one of three approaches: > > y = dct(x); > z = idct(y); > D = opDCT(16); > y = D * x; > z = D' * y; > D = dct(eye(16)); > y = D * x; > z = D' * y; In the left-most column we directly use the dct and idct. This approach takes advantage of fast routines available for evaluating the discrete cosine transform and its inverse, but lacks in flexibility. At the other extreme, in the right-most column, we explicitly instantiate the corresponding matrix representation by applying DCT to the identity matrix. This simplifies subsequent application and manipulation of the transformation, but at the expense of scalability and computational efficiency. spot, in the middle provides a balance between these two approaches, with the scalability and efficiency of fast multiplication routines, and ease of representation of a matrix. Note that, just like the matrix version, we need to specify the size of the operator when instantiating it. spot arose from the need to construct overcomplete dictionaries consisting of the union of bases. This arises, for example, the DCT-Dirac dictionary A = [Φ1 , Φ2 ] discussed in Section 1.2. While it is not hard to write a function that multiplies a vector by A or AT , one does easily get weary from writing code for yet another combination of bases. With the help of spot we can simply write what we want to do: I = opEye(256); D = opDCT(256); A = [I, D]; While the above code, as intended, makes A look like an explicit matrix, it is stored fully implicit, and for multiplication still takes advantage of the dct and idct routines. The precursor to spot, implemented as part of sparco, was written based on function handles and anonymous functions. For easier maintenance and increased flexibility, spot was written based on the object oriented and operator overloading features provided in the newer versions of Matlab. In the following sections we describe how the basic operators, such as multiplication and addition, are overloaded in spot to combine and manipulate more elementary operators. For a list of some of the most important elementary operators, see Table 8.4. 8.2. Spot opEye opDiag opOnes opZero opRestriction opBlockDiag opMatrix opDCT opDCT2 opDFT opDFT2 opHadamard opHeaviside opToeplitz opWavelet opCurvelet opGaussian opConvolve 122 Identity matrix Diagonal matrix Matrix with all ones Matrix with all zeros Restriction operator Block-diagonal operator Wrapper for explicit matrices One-dimensional fast discrete cosine transform Two-dimensional fast discrete cosine transform One-dimensional fast Fourier transform Two-dimensional fast Fourier transform Hadamard basis Heaviside matrix Toeplitz or circular matrices Discrete wavelet transformation based on the Rice wavelet toolbox [3] Discrete curvelet transform based on CurveLab 2.0 [27] Explicit or implicit Gaussian ensemble; i.e., matrices with i.i.d. normally distributed entries One- or two-dimensional convolution. Table 8.4: List of several elementary operators provided by spot. As a final remark, we note that spot can be applied to images by working with the vectorized form; spot operators for operations on two-dimensional data internally reshape vectors to matrices, when needed. 8.2.1 Multiplication Multiplication by operators is the most elementary operation in spot. In its simplest form we have operator-vector products such as follows. (Here, and throughout this section, we denote A–D for arbitrary spot operators, lowercase variables for vectors, and M an explicit matrix.) y = A * x; z = B * y; n = m'* C; % Evaluated as n = (C H m)H This code evaluates the application of A on x, and then B to y. In both cases the result will be an explicit vector of real or complex numbers. An equivalent approach is to first create a new operator C representing the product B · A, and then apply this to x to directly get vector z: C = B * A; z = C * x; % Construct compound operator % Evaluate z = BAx Although spot operators are expected to support only multiplication of vectors, it is possible to write operator-matrix multiplications. Internally, this is 8.2. Spot 123 implemented as a series of operator-vector multiplications. N = C * M; N = M * C; % Evaluated as N = (C H M H )H A special case of multiplication is multiplication by scalars. Generally, this will give a new operator scaled by that quantity; C = 3 * A; % Construct compound operator C = 3A C = A * 3; One notable exception is when the corresponding dimension of the operator is one. In that case we have a valid matrix-vector product which results in a numeric solution. In other words, matrix-vector products take precedence over scalar multiplication. 8.2.2 Transposition and conjugation As mentioned in the introduction, each spot operator implements multiplication by itself and its conjugate. Using Matlab’s transpose operator ’ returns a new operator in which these two modes are switched. B = A'; x = A'* y; x = B * y; % Identical result as previous line When using transposes within calculations these new operators are discarded as soon as the multiplication has been done. The transpose of a complex operator, rather than its conjugate transpose, can be formed using the .’ operator. This is implemented as the elementwise conjugate of the conjugate transpose: A = C.'; % Transpose of complex operator, C T = (C H ) 8.2.3 Addition and subtraction The next elementary operation is the addition and subtraction of operators. In its simplest form we add two or more operators; x = (B + C + D) * y; A = B + C + D; x = A * y; % Equivalent to first statement When spot encounters the sum of an operator with a matrix or some class object that implements the multiplication and size operators, it will first wrap that entity to a spot operator. This allows us to write C = A + M; % Addition of operator and matrix C = M + A; % Addition of matrix and operator Addition of scalars to an operator is interpreted as an elementwise addition, just like it is done for matrices. In order to make this work we first create a new operator of appropriate size, consisting of only ones, and premultiply that with the scalar. The following two statements are therefore equivalent 8.2. Spot 124 A = B + 3; A = B + 3*opOnes(size(B)); Subtraction is implemented by scalar multiplication combined with addition. A = B - C; A = B + (-C); 8.2.4 Dictionaries and arrays The original motivation for developing spot was the concatenation of operators to form a dictionary. This can now be achieved simply by writing A = [B,C,M]; % or A = [B C M]; Note that explicit matrices and classes can be mixed with spot operators, provided they are all compatible in size. Like in addition, these are automatically converted to spot operators. Vertical concatenation is done by typing A = [B;C;M]; % or A = [B C M]; With these two operations specified, Matlab automatically converts arrays of operators into a vertical concatenation of dictionaries: A = [B C; C' M]; % or A = [B C C' M]; % both represent A = [[B,C];[C’,M]]; 8.2.5 Block diagonal operators Analogous to matrices of operators it is possible to construct block diagonal operators with operator blocks. This is done using the blkdiag command, which takes a list of operators and matrices to create the desired operator: D = blkdiag(A,B,C,M); There is no restriction on the dimension of each operator, and the resulting operator need not necessarily be square. By calling the underlying operator opBlockDiag directly, it is possible to specify horizontal or vertical overlap; see Figure 8.1. 8.2.6 Kronecker products The kron operator allows the creation of Kronecker products between operators. Unlike Matlab’s built-in kron function, the kron product in spot can take an arbitrary number of arguments. Hence to construct D = A ⊗ B ⊗ C we can type 8.2. Spot A 125 A A B B B C C C M M No overlap M Vertical overlap Horizontal overlap Figure 8.1: Overlapping block-diagonal operators. D = kron(A,B,C); Needless to say, products with Kronecker operators can quickly increase in computational complexity. Given a set of operators of size mi × ni , i = 1, . . . , k, let m = mi , and n = ni . Then the application of the Kronecker product will require n/ni multiplications with each operator i, and likewise m/mi products in transpose mode. Kronecker products can be useful when operating on high dimensional data represented in vectorized form. For example, let x represent a data volume of dimensions l × m × n. To create an operator that applies a one-dimensional Fourier transformation on vectors along the second dimension, we can simply write op = kron(opEye(n),opDFT(m),opEye(l)); 8.2.7 Subset assignment and reference For certain applications, such as the restricted Fourier observations in the MRI example in Section 1.3.2, we may be interested in applying only a part of an operator. This can be done by creating new operators that are restrictions of existing operators. Restriction can be done in terms of rows, columns, a combination of rows and columns, and in terms of individual elements of the underlying matrix. Like in matrices, this is done by indexing A = B(3:5,:); A = C(:,4:6); A = D(3:5,4:6); % Extract rows 3-5 % Extract columns 4-6 % Extract rows 3-5 of columns 4-6 The single colon ‘:’ indicates that all elements in the given dimension are selected, and is equivalent to writing the range 1:end. A = B(3:5,1:end); % Extract rows 3-5 There is no restriction on the ordering of the selected columns and rows, and it is possible to repeat entries. A = B([1,2,2,5,3],end:-1:1); % Extract rows in reverse 8.2. Spot 126 Because row and column indexing is implemented by pre- and post multiplication by selection matrices, the underlying operator is only applied once for each use of the restricted operator. As an illustration of the use of the subset selection we create an operator that consists of randomly selected rows of the discrete Fourier transform. The first step is to construct a Fourier transform of the desired dimension using the opDFT command. After that we choose which rows we want to keep and use subset selection to construct the restricted operator: F = opDFT(128); R = F(rand(128,1)<=0.8,:); This generates a restricted Fourier operator with approximately 80% of all rows selected. Assignment. Existing operators can be modified by assigning new values to a subset of the entries of the underlying matrix representation. For example we may want to zero out part of some operator A: A(2,:) = 0; A(:,[3,5]) = 0; % Zero out row two % Zero out columns three and five These operations are not restricted only to zero values; other constants can be used as well. In fact, with the same ease we can assign matrices and other operators to rectangular subsets of existing operators. A special situation arises when assigning the empty set [] to a subset of rows or columns. Doing so causes the given rows or columns to be deleted from the operator. This is implemented by respectively pre- and post multiplying by a restriction matrix. This means that multiplication by the resulting operator still requires application of the full original operator whether or not the relevant rows or columns are needed. 8.2.8 Elementwise operations There are a number of operations that, just like multiplication by scalars, affect all individual entries of matrix underlying the operator. For complex operators spot provides the commands real, imag, and conj. The first command discards the imaginary parts of the operator and results in a real operator, while the second command keeps only the imaginary part of the operator (this results in a real number operator). The conj command replaces each element by its complex conjugate. When applied to a real operator, real and conj do nothing, while imag returns a new zero operator (all underlying operators are discarded). F a b c = = = = opMatrix(2 + 3i); real(F); % a = opMatrix(2) imag(F); % b = opMatrix(3) conj(F); % c = opMatrix(2 - 3i) 8.2. Spot 127 Application of real, imag, or conj leads to the creation of a new operator. Provided the new operator is not the zero operator, multiplication is implemented by distinguishing between real, imaginary, and complex vectors. In the first two cases a single application of the underlying operator suffices. For complex vectors we need to apply the underlying operator to both the real and imaginary parts of the vector and combine the results according to the desired operation, thus requiring two operations per multiplication. 8.2.9 Solving systems of linear equations The backslash operator \ can be used to solve linear least-squares problems involving spot operators. x = A \ b; This problem is solved by calling the lsqr code by Paige and Saunders [125]. The same applies to the pseudo-inverse, which solves a least-squares problem whenever it is applied to a vector: P = pinv(A); x = P * b; % Equivalent to x = A \ b; Matlab provides a number of routines for solving systems of linear equations, such as pcg and gmres, that either take a matrix or a function handle. To make them compatible with operators, spot provides a set of thin function wrappers. 8.2.10 Application to non-linear operators In some situations it may be desirable to create nonlinear operators. For example, when the output of a certain series of operations on complex input is known to be real we may want to have an operator that discards the imaginary part of its input. While spot permits the construction of such operators, utmost care has to be taken when using them in combination with other operations such as restriction. Because there are no sanity checks in the implementation of the meta operators, application to nonlinear operators is not guaranteed to be meaningful; this is something the user should decide. 128 Chapter 9 Conclusions Sparse recovery is concerned with the reconstruction of vectors or matrices, which are sparse in appropriate sparsity measures, from compressed or incomplete, and possibly noisy representations. In particular, it is concerned with various reconstruction algorithms, and the conditions under which these algorithms recover the desired quantity either exactly or approximately. A large class of reconstruction methods is based on the solution of convex optimization problems. Besides the theoretical properties associated with each formulation, there is the practical aspect of solving the underlying optimization problem. In this thesis we present an algorithm, or framework, which can solve a variety of constrained convex formulations that frequently arise in sparse recovery. In addition, we study the theoretical properties of two methods for joint-sparse recovery. The introduction of this thesis describes the developments in signal processing that led to the emergence of sparse recovery, and gives an overview of several applications, as well as current topics in sparse recovery. Theoretical results. In the first half of this thesis we obtain theoretical results for the recovery from a matrix B containing multiple measurement vectors whose underlying coefficients X0 have identical support. A popular convex formulation for solving the MMV problem is minimization the sum of row-norms of X subject to AX = B. We show that for uniform recovery of signals on a fixed support, sum-of-norm minimization cannot outperform 1,1 , regardless of the norm used. This is despite the fact that 1,1 reduces to a series of basis pursuit problems for each column separately, and does not take advantage of the prior information that the columns in X0 have the same support (or are subsets of the joint support). However, empirical results show that 1,2 easily outperforms 1,1 on average, i.e., when applied to randomly generated problems. We give concrete examples where 1,2 succeeds and 1,1 fails. This results also serves to highlight that the magnitude of the entries of X0 can affect recovery. This is unlike recovery using 1,1 , in which only the sign and support matters. We observe that, beyond the uniform-recovery regime, the recovery rate of 1,1 degrades rapidly as the number of observations increases. This phenomenon is explained by the subsequent analysis of 1,1 recovery, which also explains the poor recovery rates reported by others. We use the geometrical interpretation of basis pursuit to analyze recovery using ReMBo- 1 . We show how recovery is intricately related to the orientation of the subspace spanned by the columns of X0 , and the number of orthants this subspace intersects. We derive an exact expression for the maximum number of Chapter 9. Conclusions 129 orthants any d-dimensional subspace in Rn can intersect. The dimension d of the subspace depends on the rank of X0 , and it follows from our results, that adding more observations without increasing d does not increase the number of intersected orthants. Empirical results show, however, that additional observations do increase the probability of reaching certain orthants, which, in turn, increases the efficiency of ReMBo- 1 . This work has led to a number of questions that are interesting as future work. First, given a set {v1 , . . . , vn } of randomly distributed vectors, what is the distribution of the external angle of the conic hull of {±v1 , ±v2 , . . . , ±vn }. Knowing this distribution would help establish the expected number of unique sign-patterns reached by random combinations of vectors vi , after a fixed number of trials. Second, it would be interesting to study the distribution of faces of cross-polytope C that vanish under linear mappings, especially the distribution over the faces of C that are intersected by random d-dimensional subspaces. Solver for generalized sparse recovery. The main contribution of this thesis is a solver framework for solving sparse recovery problem formulations that are subject to constraints on a measure of misfit. Our implementation of this solver, spgl1, is available on-line, and has been successfully used by many practitioners. spgl1 is still one of the very few solvers that can deal with explicit misfit constraints, as opposed penalized formulations, which are the easier to solve, but less natural in their use, because they require the user to provide the Lagrange multiplier corresponding to the constraint. Our numerical experiments show that spgl1 does quite well on a range of problems. The main problem with the current implementation is the tendency to slightly overshoot the target objective τσ , due to the inaccuracy of the subproblem solves. This is easily avoided by tightening the tolerance values, but this comes at the cost of an increase of run time. In future work we aim to provide the user with a number of options ranging from fast and dangerous to slow and accurate. The emergence of new test problems helps tune the solver, which is a continuing process until the solver reaches a certain level of maturity. We have considered two types of algorithms for solving the generalized Lasso sub-problems: the spectrally projected gradient (SPG) method, and a projected quasi-Newton (PQN) method. The latter method is designed for problems where evaluating the function and gradient values of the sub-problem is expensive. To reduce the cost, it forms quadratic models of the objective function and optimizes this model on the feasible set to obtain a search direction for the original problem. Applied to our test problems, PQN does indeed reduce the number of function and gradient evaluations substantially. However, due to the relatively low cost of these evaluations, it does not outperform SPG in terms of run time. As future work, we will study the use of alternative solvers for the generalized Lasso problem. In particular, we plan to incorporate the optimal first-order method by Nesterov [118], and use his complexity results to derive a corresponding complexity of the entire root-finding approach. Finally, two major modifications to spgl1 are needed for the matrix-com- Chapter 9. Conclusions 130 pletion problem. First of all, we need to avoid the evaluation of the full singular value decomposition used in projection, and reuse as much information as possible. This means that projection can no longer be treated as a separate black-box routine, but needs to be integrated, so that information gathered at previous iterations can be used to ensure that the size of the partial SVD is no larger than needed. Second, for scalability of the algorithm, we need to modify the code to work with the partial SVD representation of X, instead of the full matrix. Sparco and Spot. sparco provides a set of test problems for sparse recovery based on single measurement vectors. As such, it is mostly suited for the basis pursuit, and basis pursuit denoise formulations. The numerical experiments done in this thesis suggest that sparco needs a number of modifications. First, it would be beneficial to include more synthetic and noisy problems. This requires some careful choice of problem instances to avoid cluttering the suite (once a problem is added and used in benchmarks, it may be hard to remove later). Second, given the fact that the majority of the existing solvers apply to regularized basis pursuit, it may be desirable to include with each problem, the corresponding (σ, λ) pairs, as well as an accurate solution. Third, following developments in sparse recovery, it would be useful to include problems for other formulations as separate test sets. The spot linear operator toolbox is extremely convenient for the construction of linear operators. With the foundation of the package rigorously set up, it only remains to keep adding useful operators, as they arise. Hence, the current version should be regarded as the first version of a continuously evolving toolbox. 131 Bibliography [1] Farid Alizadeh and Donald Goldfarb. Second-order cone programming. Mathematical Programming, Serie B, 95(1):3–51, January 2003. [2] Waheed U. Bajwa, Jarvis D. Haupt, Gil M. Raz, Stephen J. Wright, and Robert D. Nowak. Toeplitz-structured compressed sensing matrices. In IEEE/SP 14th Workshop on Statistical Signal Processing, 2007, pages 294–298, August 2007. [3] Richard Baraniuk, Hyeokho Choi, Felix Fernandes, Brent Hendricks, Ramesh Neelamani, Vinay Ribeiro, Justin Romberg, Ramesh Gopinath, Haitao Guo, Markus Lang, Jan Erik Odegard, and Dong Wei. Rice Wavelet Toolbox, 1993. http://www.dsp.rice.edu/software/rwt.shtml. [4] Richard Baraniuk, Mark Davenport, Ronald DeVore, and Michael Wakin. A simple proof of the restricted isometry property for random matrices. Constructive Approximation, 28(3):253–263, December 2008. [5] Jonathan Barzilai and Jonathan M. Borwein. Two-point step size gradient methods. IMA Journal of Numerical Analysis, 8(1):141–148, 1988. [6] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009. [7] Stephen Becker, J´erˆ ome Bobin, and Emmanuel J. Cand`es. NESTA: A fast and accurate first-order method for sparse recovery, 2009. Submitted. [8] Ewout van den Berg and Michael Friedlander. Joint-sparse recovery from multiple measurements. arXiv:CS/0904.2051, April 2009. [9] Ewout van den Berg and Michael P. Friedlander. SPGL1, 2007. http://www.cs.ubc.ca/labs/scl/index.php/Main/Software. [10] Ewout van den Berg and Michael P. Friedlander. Probing the Pareto frontier for basis-pursuit solutions. SIAM Journal on Scientific Computing, 31(2):890–912, 2008. [11] Ewout van den Berg and Michael P. Friedlander. http://www.cs.ubc.ca/labs/scl/spot/. Spot, 2009. Bibliography 132 [12] Ewout van den Berg, Michael P. Friedlander, Gilles Hennenfent, Felix J. ¨ ur Yılmaz. Algorithm 890: Sparco: A Herrmann, Rayan Saab, and Ozg¨ testing framework for sparse reconstruction. ACM Transactions on Mathematical Software, 35(4):1–16, February 2009. [13] Ewout van den Berg, Mark Schmidt, Michael P. Friedlander, and Kevin Murphy. Group sparsity via linear-time projection. Technical Report TR-2008-09, Dept. of Computer Science, UBC, June 2008. [14] Radu Berinde, Anna C. Gilbert, Piotr Indyk, Howard J. Karloff, and Martin J. Strauss. Combining geometry and combinatorics: A unified approach to sparse signal recovery. In Forty-sixth annual Allerton conference on Communication, Control, and Computing, September 2008. [15] Dimitri P. Bertsekas. On the Goldstein-Levitin-Polyak gradient projection method. IEEE Transactions on Automatic Control, AC-21(2):174–184, April 1976. [16] Dimitri P. Bertsekas, Angelia Nedi´c, and Asuman E. Ozdaglar. Convex Analysis and Optimization. Athena Scientific, 2003. [17] Ernesto G. Birgin, Jos´e Mario Mart´ınez, and Marcos Raydan. Nonmonotone spectral projected gradient methods on convex sets. SIAM Journal on Optimization, 10(4):1196–1211, 2000. [18] Ronald F. Boisvert, Roldan Pozo, Karin Remington, Richard Barrett, and Jack J. Dongarra. Matrix Market: A web resource for test matrix collections. In Ronald F. Boisvert, editor, The quality of numerical software: assessment and enhancement, pages 125–137. Chapman & Hall, London, 1997. [19] Petros T. Boufounos and Richard G. Baraniuk. Quantization of sparse representations. Technical Report 0701, ECE Department, Rice University, March 2007. [20] Petros T. Boufounos and Richard G. Baraniuk. Sigma delta quantization for compressive sensing. In Proceedings of SPIE, volume 6701, pages 1–13, August 2007. [21] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004. [22] Kristian Bredies and Dirk A. Lorenz. Linear convergence of iterative softthresholding. Journal of Fourier Analysis and Applications, 14:813–837, 2008. [23] Jian-Feng Cai, Emmanuel J. Cand`es, and Zuowei Shen. A singular value thresholding algorithm for matrix completion. arXiv:0810.3286v1, October 2008. Bibliography 133 [24] T. Tony Cai, Guangwu Xu, and Jung Zhang. On recovery of sparse signals via 1 minimization. IEEE Transactions on Information Theory, 55(7):3388–3397, July 2009. [25] Emmanuel J. Cand`es. Compressive sampling. In Proceedings of the International Congress of Mathematicians, Madrid, Spain, 2006. [26] Emmanuel J. Cand`es. The restricted isometry property and its implications for compressed sensing. Compte Rendus de l’Academie des Sciences Paris, ser. I, 346:589–592, 2008. [27] Emmanuel J. Cand`es, Laurent Demanet, David Donoho, and Lexing Ying. Fast discrete curvelet transforms. Multiscale Modeling and Simulation, 5(3):861–899, January 2006. [28] Emmanuel J. Cand`es and Yaniv Plan. Matrix completion with noise. arXiv:0903.3131v1, March 2009. [29] Emmanuel J. Cand`es and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6):1– 49, December 2009. [30] Emmanuel J. Cand`es and Justin Romberg. Practical signal recovery from random projections. In Proceedings of SPIE – Volume 5674, March 2005. [31] Emmanuel J. Cand`es and Justin Romberg. http://www.l1-magic.org. 1 -magic, 2007. [32] Emmanuel J. Cand`es, Justin Romberg, and Terence Tao. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory, 52(2):489–509, February 2006. [33] Emmanuel J. Cand`es and Terence Tao. Decoding by linear programming. IEEE Transactions on Information Theory, 51(2):4203–4215, December 2005. [34] Emmanuel J. Cand`es and Terence Tao. Near-optimal signal recovery from random projections – universal encoding strategies. IEEE Transactions on Information Theory, 52(2), 2006. [35] Emmanuel J. Cand`es and Terence Tao. The Dantzig selector: Statistical estimation when p is much larger than n. Annals of Statistics, 35(6):2313– 2351, 2007. [36] Emmanuel J. Cand`es and Terence Tao. The power of convex relaxation: Near-optimal matrix completion. arXiv:0903.1476v1, March 2009. [37] Emmanuel J. Cand`es, Michael B. Wakin, and Stephen P. Boyd. Enhancing sparsity by reweighted 1 minimization. Journal of Fourier Analysis and Applications, 14(5–6):877–905, December 2008. Bibliography 134 [38] Jie Chen and Xiaoming Huo. Theoretical results on sparse representations of multiple-measurement vectors. IEEE Transactions on Signal Processing, 54:4634–4643, December 2006. [39] Scott Shaobing Chen, David L. Donoho, and Michael A. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing, 20(1):33–61, 1998. [40] Jon F. Claerbout and Francis Muir. Robust modeling with erratic data. Geophysics, 38(5):826–644, October 1973. [41] Patrick L. Combettes and Val´erie R. Wajs. Signal recovery by proximal forward-backward splitting. Multiscale Modeling and Simulation, 4(4):1168–1200, 2005. [42] Shane F. Cotter, Bhaskar D. Rao, Kjersti Engang, and Kenneth KreutzDelgado. Sparse solutions to linear inverse problems with multiple measurement vectors. IEEE Transactions on Signal Processing, 53:2477–2488, July 2005. [43] Wei Dai and Olgica Milenkovic. Subspace pursuit for compressive sensing signal reconstruction. IEEE Transactions on Information Theory, 55(5):2230–2249, May 2009. [44] Wei Dai, Hoa Vinh Pham, and Olica Milenkovic. Quantized compressive sensing. arXiv:0901.0749v2, March 2009. [45] Wei Dai, Mona A Sheikh, Olgica Milenkovic, and Richard G. Baraniuk. Compressive sensing DNA microarrays. EURASIP Journal on Bioinformatics and Systems Biology, 2009:1–12, 2009. [46] Yu-Hong Dai and Roger Fletcher. Projected Barzilai-Borwein methods for large-scale box-constrained quadratic programming. Numerische Mathematik, 100:21–47, 2005. [47] George Bernard Dantzig. Linear Programming and Extensions. Princeton University Press, Princeton, NJ, 1963. [48] Alexandre d’Aspremont, Francis R. Bach, and Laurent El Ghaoui. Optimal solutions for sparse principal component analysis. Journal of Machine Learning Research, 9:1269–1294, July 2008. [49] Ingrid Daubechies. Ten Lectures on Wavelets. SIAM, 1992. [50] Ingrid Daubechies, Michel Defrise, and Christine de Mol. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics, 57:1413– 1457, November 2004. Bibliography 135 [51] Ingrid Daubechies, Massimo Fornasier, and Ignace Loris. Accelerated projected gradient method for linear inverse problems with sparsity constraints. Journal of Fourier Analysis and Applications, 14(5–6), December 2008. [52] Ron S. Dembo, Stanley C. Eisenstat, and Trond Steihaug. Inexact Newton methods. SIAM Journal on Numerical Analysis, 19(2):400–408, April 1982. [53] Ronald A. DeVore. Deterministic constructions of compressed sensing matrices. Journal of Complexity, 23(4–6):918–925, August–December 2007. [54] David L. Dohono and Jared Tanner. Neighborliness of randomly-projected simplices in high dimensions. Proceedings of the National Academy of Sciences, 102(27):9452–9457, 2005. [55] David L. Dohono and Jared Tanner. Sparse nonnegative solution of underdetermined linear equations by linear programming. Proceedings of the National Academy of Sciences, 102(27):9446–9451, 2005. [56] Elizabeth D. Dolan and Jorge J. Mor´e. Benchmarking optimization software with COPS. Technical Report ANL/MCS-246, Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL, 2000. Revised January 2001. [57] David L. Donoho. Neighborly polytopes and sparse solution of underdetermined linear equations. Technical Report 2005-4, Department of Statistics, Stanford University, Stanford, CA, 2005. [58] David L. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289–1306, April 2006. [59] David L. Donoho. High-dimensional centrosymmetric polytopes with neighborliness proportional to dimension. Discrete and Computational Geometry, 35(4):617–652, May 2006. [60] David L. Donoho and Michael Elad. Optimally sparse representation in general (nonorthogonal) dictionaries via 1 minimization. Proceedings of the National Academy of Sciences, 100(5):2197–2202, March 2003. [61] David L. Donoho, Michael Elad, and Vladimir N. Temlyakov. Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Transactions on information theory, 52(1):6–18, January 2006. [62] David L. Donoho and Xiaoming Huo. Uncertainty principles and ideal atomic decomposition. IEEE Transactions on Information Theory, 47(7):2845–2862, November 2001. [63] David L. Donoho and Iain M. Johnstone. Ideal spatial adaptation by wavelet shrinkage. Biometrika, 81(3):425–455, 1994. Bibliography 136 [64] David L. Donoho, Victoria C. Stodden, and Yaakov Tsaig. Sparselab, 2007. http://sparselab.stanford.edu/. [65] Pei-Cheng Du and Ruth H. Angeletti. Automatic deconvolution of isotoperesolved mass spectra using variable selection and quantized peptide mass distribution. Analytical Chemistry, 78(10):3385–3392, May 2006. [66] Marco F. Duarte, Mark A. Davenport, Dharmpal Takhar, Jason N. Laska, Ting Sun, Kevin F. Kelly, and Richard G. Baraniuk. Single-pixel imaging via compressive sampling. IEEE Signal Processing Magazine, 25(2):83–91, 2008. [67] John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Efficient projections onto the 1 -ball for learning in high dimensions. In Proceedings of the 25th International Conference on Machine Learning, pages 272–279, 2008. [68] Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least angle regression. Annals of Statistics, 32(2):407–499, 2004. [69] Michael Elad and Alfred M. Bruckstein. A generalized uncertainty principle and sparse representation in pairs of bases. IEEE Transactions on Information Theory, 48(9):2558–2567, September 2002. [70] Michael Elad, Boaz Matalon, Joseph Shtok, and Michael Zibulevsky. A wide-angle view at iterated shrinkage algorithms. In Proceedings of SPIE, volume 6701, pages 670102:1–670102:19, September 2007. [71] Michael Elad, Jean-Luc Starck, Philippe Querre, and David L. Donoho. Simultaneous cartoon and texture image inpainting using morphological component analysis (MCA). Applied and Computational Harmonic Analysis, 19(3):340–358, November 2005. [72] Yonina C. Eldar, Patrick Kuppinger, and Helmut B¨olcskei. Compressed sensing of block-sparse signals: Uncertainty relations and efficient recovery. arXiv:0906.3173v2, June 2009. [73] Yonina C. Eldar and Moshe Mishali. Robust recovery of signals from a union of subspaces. arXiv 0807.4581, July 2008. [74] Yonina C. Eldar and Holger Rauhut. Average case analysis of multichannel sparse recovery using convex relaxation. arXiv:0904.0494, April 2009. [75] Maryam Fazel. Matrix Rank Minimization with Applications. PhD thesis, Stanford University, 2002. [76] Arie Feuer and Arkadi Nemirovski. On sparse representation in pairs of bases. IEEE Transactions on Information Theory, 49(6):1579–1581, June 2003. Bibliography 137 [77] M´ ario A. T. Figueiredo and Robert D. Nowak. An EM algorithm for wavelet-based image restoration. IEEE Transactions on Image Processing, 12(8):906–916, August 2003. [78] M´ ario A. T. Figueiredo, Robert D. Nowak, and Stephen J. Wright. Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems. IEEE Journal of Selected Topics in Signal Processing, 1(4):586–597, December 2007. [79] David Gale. Neighborly and cyclic polytopes. Proceedings of Symposia in Pure Mathematics, VII:225–232, 1963. [80] Irina F. Gorodnitsky, John S. George, and Bhaskar D. Rao. Neuromagnetic source imaging with FOCUSS: A recursive weighted minimum norm algorithm. Electroencephalography and Clinical Neurophysiology, 95(4):231–251, October 1995. [81] Nicholas I. M. Gould, Dominique Orban, and Philippe L. Toint. CUTEr and SifDec: A constrained and unconstrained testing environment, revisited. ACM Transactions on Mathematical Software, 29(4):373–394, December 2003. [82] Vivek K. Goyal, Alyson K. Fletcher, and Sundeep Rangan. Compressive sampling and lossy compression. IEEE Signal Processing Magazine, 25(2), March 2008. [83] Michael Grant and Stephen Boyd. Graph implementations for nonsmooth convex programs. In V. Blondel, S. Boyd, and H. Kimura, editors, Lecture Notes in Control and Information Sciences, pages 95–110. Springer, 2008. [84] Michael Grant and Stephen Boyd. CVX: Matlab software for disciplined convex programming (web page and software), February 2009. http://stanford.edu/∼boyd/cvx. [85] R´emi Gribonval and Morten Nielsen. Sparse representations in unions of bases. IEEE Transactions on Information Theory, 49(12):3320–3325, December 2003. [86] R´emi Gribonval and Morten Nielsen. Highly sparse representations from dictionaries are unique and independents of the sparseness measure. Applied and Computational Harmonic Analysis, 22(3):335–355, May 2007. [87] R´emi Gribonval, Holger Rauhut, Karin Schnass, and Pierre Vandergheynst. Atoms of all channels, unite! Average case analysis of multichannel sparse recovery using greedy algorithms. Journal of Fourier Analysis and Applications, 14(5–6):655–687, December 2008. [88] Luigi Grippo, Francesco Lampariello, and Stephano Lucidi. A nonmonotone line search technique for Newton’s method. SIAM Journal on Numerical Analysis, 23(4):707–716, August 1986. Bibliography 138 [89] Branko Gr¨ unbaum. Convex Polytopes, volume 221 of Graduate Texts in Mathematics. Springer-Verlag, second edition, 2003. [90] Elaine Hale, Wotao Yin, and Yin Zhang. Fixed-point continuation for 1 -minimization: Methodology and convergence. SIAM Journal on Optimization, 19(3):1107–1130, 2008. [91] Gilles Hennenfent and Felix J. Herrmann. Sparseness-constrained data continuation with frames: Applications to missing traces and aliased signals in 2/3-D. In SEG International Exposition and 75th Annual Meeting, 2005. [92] Gilles Hennenfent and Felix J. Herrmann. Application of stable signal recovery to seismic interpolation. In SEG International Exposition and 76th Annual Meeting, 2006. [93] Gilles Hennenfent and Felix J. Herrmann. Random sampling: New insights into the reconstruction of coarsely-sampled wavefields. In SEG International Exposition and 77th Annual Meeting, 2007. [94] Gilles Hennenfent and Felix J. Herrmann. Simply denoise: Wavefield reconstruction via jittered undersampling. Geophysics, 73(3):V19–V28, May–June 2008. [95] Stephen D. Howard, A. Robert Calderbank, and Stephen J. Searle. A fast reconstruction algorithm for deterministic compressive sensing using second order reed-muller codes. In Proc. Conference on Information Sciences and Systems (CISS2008), pages 11–15, March 2008. [96] Piotr Indyk. Explicit constructions for compressed sensing of sparse signals. In Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms, January 2008. [97] Laurent Jacques, David K. Hammond, and Mohamed Jalal Fadili. Dequantizing compressed sensing: When oversampling and non-Gaussian constraints combine. arXiv:0902.2367v2, February 2009. [98] Sina Jafarpour, Weiyu Xu, Babak Hassibi, and A. Robert Calderbank. Efficient and robust compressed sensing using high-quality expander graphs. arXiv:0806.3802, June 2008. [99] Seung-Jean Kim, Kwangmoo Koh, Michael Lustig, Stephen Boyd, and Dimitry Gorinevsky. An interior-point method for large-scale 1 regularized least squares. IEEE Journal of Selected Topics in Signal Processing, 1(4):606–617, December 2007. [100] Dany Leviatan and Vladimir N. Temlyakov. Simultaneous greedy approximation in Banach spaces. Journal of Complexity, 21(3), June 2005. Bibliography 139 [101] Nathan Linial and Isabella Novik. How neighborly can a centrally symmetric polytope be? Discrete and Computational Geometry, 36(2):273–281, 2006. [102] Dong C. Liu and Jorge Nocedal. On the limited memory BFGS method for large scale optimization. Mathematical Programming, Serie B, 45(1– 3):503–528, August 1989. [103] Yong-Jin Liu, Defeng Sun, and Kim-Chuan Toh. An implementable proximal point algorithmic framework for nuclear norm minimization. Preprint, July 2009. [104] Michael Lustig, David L. Donoho, and John M. Pauly. Sparse MRI: The application of compressed sensing for rapid MR imaging. Magnetic Resonance in Medicine, 58(6):1182–1195, December 2007. [105] Michael Lustig, David L. Donoho, Juan M. Santos, and John M. Pauly. Compressed sensing MRI. IEEE Signal Processing Magazine, 25(2):72–82, 2008. [106] Shiqian Ma, Donald Goldfarb, and Lifeng Chen. Fixed point and Bregman iterative methods for matrix rank minimization. arXiv:0905.1643v2, May 2009. [107] Dmitry M. Malioutov. A sparse signal reconstruction perspective for source localization with sensor arrays. Master’s thesis, Dept. Electrical Engineering, Massachusetts Institute of Technology, Cambridge, MA, July 2003. [108] Dmitry M. Malioutov, M¨ ujad C ¸ etin, and Alan S. Willsky. Optimal sparse representations in general overcomplete bases. In IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 2, pages 793– 796, May 2004. [109] Dmitry M. Malioutov, M¨ ujad C ¸ etin, and Alan S. Willsky. A sparse signal reconstruction perspective for source localization with sensor arrays. IEEE Transactions on Signal Processing, 53(8):3010–3022, August 2005. [110] Dmitry M. Malioutov, M¨ ujdat C ¸ etin, and Alan S. Willsky. Source localization by enforcing sparsity through a Laplacian prior: An SVD-based approach. In IEEE Workshop on Statistical Signal Processing, 2003. [111] Dmitry M. Malioutov, M¨ ujdat C ¸ etin, and Alan S. Willsky. Homotopy continuation for sparse signal representation. In IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 5, pages 733– 736, March 2005. [112] Fred W. McLafferty and Franti˘sek Ture˘cek. Interpretation of Mass Spectra. University Science Books, Mill Valley, CA, fourth edition, 1993. Bibliography 140 [113] Peter McMullen and Geoffrey C. Shephard. Diagrams for centrally symmetric polytopes. Mathematika, 15:123–138, 1968. [114] Moshe Mishali and Yonina C. Eldar. Reduce and boost: Recovering arbitrary sets of jointly sparse vectors. IEEE Transactions on Signal Processing, 56(10):4692–4702, October 2008. [115] Balas K. Natarajan. Sparse approximate solutions to linear systems. SIAM Journal on Computing, 24(2):227–234, April 1995. [116] National Institute of Standards and Technology. NIST Chemistry WebBook, 2009. http://webbook.nist.gov/chemistry/. [117] Deanna Needell and Joel A. Tropp. CoSaMP: Iterative signal recovery from incomplete and inaccurate samples. Applied and Computational Harmonic Analysis, 29(3):301–321, May 2009. [118] Yurii Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, Serie A, 103(1):127–152, May 2005. [119] Yurii Nesterov. Gradient methods for minimizing composite objective function. Technical Report 2007/76, Center for Operations Research and Econometrics, Catholic University of Louvain, 2007. [120] Yurii Nesterov and Arkadii Nemirovskii. Interior-Point Polynomial Algorithms in Convex Programming. SIAM, 1994. [121] Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer Series in Operations Research. Springer, 1999. [122] Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer Series in Operations Research and Financial Engineering. Springer, second edition, 2006. [123] Michael R. Osborne, Brett Presnell, and Berwin A. Turlach. A new approach to variable selection in least squares problems. IMA Journal of Numerical Analysis, 20(3):389–403, 2000. [124] Michael R. Osborne, Brett Presnell, and Berwin A. Turlach. On the LASSO and its dual. Journal of Computational and Graphical Statistics, 9(2):319–337, 2000. [125] Christopher C. Paige and Michael A. Saunders. LSQR an algorithm for sparse linear equations and sparse least squares. ACM Transactions om Mathematical Software, 8(1):43–71, 1982. [126] Benjamin Recht, Maryam Fazel, and Pablo A. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. arXiv:0706.4138v1, June 2007. Bibliography 141 [127] R. Tyrrell Rockafellar. Convex Analysis. Princeton University Press, Princeton, 1970. [128] Moshe Rosenfeld. In Praise of the Gram Matrix, pages 318–323. The Mathematics of Paul Erd¨os, II. Springer, Berlin, 1997. [129] Mark Rudelson and Roman Vershynin. Sparse reconstruction by convex relaxation: Fourier and Gaussian measurements. In 40th Annual Conference on Information Sciences and Systems, pages 207–212, Princeton, March 2006. [130] Sylvain Sardy, Andrew G. Bruce, and Paul Tseng. Block coordinate relaxation methods for nonparametric wavelet denoising. Journal of Computational and Graphical Statistics, 9(2):361–379, 2000. [131] Mark Schmidt, Ewout van den Berg, Michael P. Friedlander, and Kevin Murphy. Optimizing costly functions with simple constraints: a limitedmemory projected quasi-Newton algorithm. In Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics, pages 448–455, April 16–18 2009. [132] Mona A. Sheikh, Olgica Milenkovic, Shri Sarvotham, and Richard G. Baraniuk. Compressed sensing DNA microarrays. Technical Report TREE0706, ECE, Rice University, May 2007. [133] R. Martin Smith. Understanding mass spectra: A basic approach. John Wiley and Sons, Hoboken, NJ, second edition, 2004. [134] Eckard Specht. Packing of circles in the unit circle, http://hydra.nat.uni-magdeburg.de/packing/cci/cci.html. 2009. [135] Mihailo Stojnic, Farzad Parvaresh, and Babak Hassibi. On the reconstruction of block-sparse signals with an optimal number of measurements. IEEE Transactions on Signal Processing, 57(8):3075–3085, August 2009. [136] Thomas Strohmer and Robert W. Heath Jr. Grassmannian frames with applications to coding and communication. Applied and Computational Harmonic Analysis, 14(3):257–275, 2003. [137] Dharmpal Takhar, Jason N. Laska, Michael B. Wakin, Marco F. Duarte, Dror Baron, Shriram Sarvotham, Kevin F. Kelly, and Richard G. Baraniuk. A new camera architecture based on optical-domain compression. In Proceedings of the IS&T/SPIE Symposium on Electronic Imaging: Computational Imaging, volume 6065, January 2006. [138] David S. Taubman and Michael W. Marcellin. JPEG2000: Image Compression Fundamentals, Standards and Practice. Kluwer Academic Publishers, Boston, 2001. Bibliography 142 [139] Howard L. Taylor, Stephen C. Banks, and John F. McCoy. Deconvolution with the 1 norm. Geophysics, 44(1):39–52, January 1979. [140] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267– 288, 1996. [141] Peter A. Toft. The Radon Transform — Theory and Implementation. PhD thesis, Department of Mathematical Modelling, Technical University of Denmark, June 1996. [142] Kim-Chuan Toh, Michael J. Todd, and Reha H. T¨ ut¨ unc¨ u. SDPT3 – a Matlab software package for semidefinite programming. Optimization Methods and Software, 11(1–4):545–581, 1999. [143] Kim-Chuan Toh and Sangwoon Yun. An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems. Preprint, April 2009. [144] Joel A. Tropp. Greed is good: Algorithmic results for sparse approximation. IEEE Transactions on Information Theory, 50(10), October 2004. [145] Joel A. Tropp. Algorithms for simultaneous sparse approximation: Part II: Convex relaxation. Signal Processing, 86:589–602, 2006. [146] Joel A. Tropp. Just relax: Convex programming methods for identifying sparse signals in noise. IEEE Transactions on Information Theory, 52(3):1030–1051, March 2006. [147] Joel A. Tropp, Anna C. Gilbert, and Martin J. Strauss. Simultaneous sparse approximation via greedy pursuit. In Proc. 2005 IEEE Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP), volume 5, pages 721– 724, Philadelphia, March 2005. [148] Joel A. Tropp, Anna C. Gilbert, and Martin J. Strauss. Algorithms for simultaneous sparse approximation: Part I: Greedy pursuit. Signal Processing, 86:572–588, 2006. [149] Berwin A. Turlach. On algorithms for solving least squares problems under an l1 penalty or an l1 constraint. In Proceedings of the American Statistical Association, pages 2572–2577. American Statistical Association, 2004. [150] Reha H. T¨ ut¨ unc¨ u, Kim-Chuan Toh, and Michael J. Todd. Solving semidefinite-quadratic-linear programs using SDPT3. Mathematical Programming, Serie B, 95:189–217, 2003. ¨ ur Yılmaz, and Felix J. Herrmann. Bayesian [151] Deli Wang, Rayan Saab, Ozg¨ wavefield separation by transform-domain sparsity promotion. Geophysics, 73:A33–A38, 2008. Bibliography 143 [152] Stephen J. Wright. Primal-Dual Interior-Point Methods. SIAM, 1997. [153] Stephen J. Wright, Robert D. Nowak, and M´ario A. T. Figueiredo. Sparse reconstruction by separable approximation. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 3373–3376, March 2008. [154] Wotao Yin, Stanley Osher, Donald Goldfarb, and Jerome Darbon. Bregman iterative algorithms for 1 -minimization with applications to compressed sensing. SIAM Journal on Imaging Sciences, 1(1):143–168, 2008. [155] G¨ unter M. Ziegler. Lectures on Polytopes, volume 152 of Graduate Texts in Mathematics. Springer-Verlag, first edition, 2006. [156] Argyrios Zymnis, Stephen Boyd, and Emmanuel J. Cand`es. Compressed sensing with quantized measurements. Submitted for publication, April 2009.
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Convex optimization for generalized sparse recovery
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Convex optimization for generalized sparse recovery van den Berg, Ewout 2009
pdf
Page Metadata
Item Metadata
Title | Convex optimization for generalized sparse recovery |
Creator |
van den Berg, Ewout |
Publisher | University of British Columbia |
Date Issued | 2009 |
Description | The past decade has witnessed the emergence of compressed sensing as a way of acquiring sparsely representable signals in a compressed form. These developments have greatly motivated research in sparse signal recovery, which lies at the heart of compressed sensing, and which has recently found its use in altogether new applications. In the first part of this thesis we study the theoretical aspects of joint-sparse recovery by means of sum-of-norms minimization, and the ReMBo-l₁ algorithm, which combines boosting techniques with l₁-minimization. For the sum-of-norms approach we derive necessary and sufficient conditions for recovery, by extending existing results to the joint-sparse setting. We focus in particular on minimization of the sum of l₁, and l₂ norms, and give concrete examples where recovery succeeds with one formulation but not with the other. We base our analysis of ReMBo-l₁ on its geometrical interpretation, which leads to a study of orthant intersections with randomly oriented subspaces. This work establishes a clear picture of the mechanics behind the method, and explains the different aspects of its performance. The second part and main contribution of this thesis is the development of a framework for solving a wide class of convex optimization problems for sparse recovery. We provide a detailed account of the application of the framework on several problems, but also consider its limitations. The framework has been implemented in the SPGL1 algorithm, which is already well established as an effective solver. Numerical results show that our algorithm is state-of-the-art, and compares favorably even with solvers for the easier---but less natural---Lagrangian formulations. The last part of this thesis discusses two supporting software packages: Sparco, which provides a suite of test problems for sparse recovery, and Spot, a Matlab toolbox for the creation and manipulation of linear operators. Spot greatly facilitates rapid prototyping in sparse recovery and compressed sensing, where linear operators form the elementary building blocks. Following the practice of reproducible research, all code used for the experiments and generation of figures is available online at http://www.cs.ubc.ca/labs/scl/thesis/09vandenBerg/. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2009-12-14 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-ShareAlike 3.0 Unported |
DOI | 10.14288/1.0051332 |
URI | http://hdl.handle.net/2429/16646 |
Degree |
Doctor of Philosophy - PhD |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2010-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-sa/3.0/ |
Aggregated Source Repository | DSpace |
Download
- Media
- 24-ubc_2010_spring_vandenberg_ewout.pdf [ 3.67MB ]
- Metadata
- JSON: 24-1.0051332.json
- JSON-LD: 24-1.0051332-ld.json
- RDF/XML (Pretty): 24-1.0051332-rdf.xml
- RDF/JSON: 24-1.0051332-rdf.json
- Turtle: 24-1.0051332-turtle.txt
- N-Triples: 24-1.0051332-rdf-ntriples.txt
- Original Record: 24-1.0051332-source.json
- Full Text
- 24-1.0051332-fulltext.txt
- Citation
- 24-1.0051332.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0051332/manifest