UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Compressed sensing : decoding and quantization Saab, Rayan 2010

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata

Download

Media
24-ubc_2010_fall_saab_rayan.pdf [ 1.13MB ]
Metadata
JSON: 24-1.0064977.json
JSON-LD: 24-1.0064977-ld.json
RDF/XML (Pretty): 24-1.0064977-rdf.xml
RDF/JSON: 24-1.0064977-rdf.json
Turtle: 24-1.0064977-turtle.txt
N-Triples: 24-1.0064977-rdf-ntriples.txt
Original Record: 24-1.0064977-source.json
Full Text
24-1.0064977-fulltext.txt
Citation
24-1.0064977.ris

Full Text

Compressed Sensing: Decoding and Quantization by Rayan Saab B.E., American University of Beirut, 2003 M.A.Sc., The University of British Columbia, 2005 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy in The Faculty of Graduate Studies (Electrical and Computer Engineering) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) May, 2010 c© Rayan Saab 2010 Abstract Recently, great strides in sparse approximation theory and its application have been made. Many of these developments were spurred by the emerging area of compressed sensing. Compressed sensing is a signal acquisition paradigm that entails recovering estimates of sparse and compressible signals from n linear measurements, many fewer than the signal ambient dimension N . In general, these measurements are assumed to be imperfect. For example, they may be noisy or quantized (or both). Thus, the associated sparse recovery problem requires the existence of stable and robust “decoders” to reconstruct the signal. In this thesis, we first address the theoretical properties of ∆p, a class of compressed sensing decoders that rely on `p minimization with p ∈ (0, 1). In particular, we extend the known results regarding the decoder ∆1, based on `1 minimization, to ∆p with p ∈ (0, 1). Our results are two-fold. We show that under sufficient conditions that are weaker than the analogous sufficient conditions for ∆1, the decoders ∆p are robust to noise and stable. In particular, they are (2, p) instance optimal. We also show that, like ∆1, the decoders ∆p are (2, 2) instance optimal in probability provided the measurement matrix is drawn from an appropriate distribution. Second, we address quantization of compressed sensing measurements. Typically, the most commonly assumed approach (called PCM) entails quantizing each measurement independently, using a quantization alphabet with step-size d. Subsequently, by using a stable and robust decoder one can guarantee that the estimate produced by the decoder is within O(d) of the signal. As a more effective alternative, we propose using sigma-delta schemes to quantize compressed sensing measurements of a k-sparse signal. We show that there is an associated two-stage recovery method which guarantees a reduction of the approximation error by a factor of ( n k )α(r−1/2) , for any α < 1 if n &r k(logN) 1/(1−α) (with high probability on the initial draw of the measurement matrix). In particular, the first recovery stage employs a stable decoder to estimate the support of the signal, while the second stage employs Sobolev dual frames to approximate the signal. ii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Co-authorship Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x 1 Introduction and Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 The sparse approximation problem . . . . . . . . . . . . . . . . . . . . . 2 1.3 Decoders and recovery guarantees . . . . . . . . . . . . . . . . . . . . . . 3 1.3.1 Convex relaxation . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3.2 Greedy algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.3 Bounds on the sparsity level k . . . . . . . . . . . . . . . . . . . . 7 1.4 Compressed sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4.1 Compressed sensing as a sparse approximation problem . . . . . . 10 1.4.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.5 Compressed sensing decoders and recovery guarantees . . . . . . . . . . . 11 1.5.1 Convex relaxation . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.5.2 Greedy algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.5.3 Uniform recovery guarantees with random matrices . . . . . . . . 16 1.6 Compressed sensing applications . . . . . . . . . . . . . . . . . . . . . . . 17 1.7 Quantization and compressed sensing . . . . . . . . . . . . . . . . . . . . 20 1.7.1 Pulse code modulation for compressed sensing . . . . . . . . . . . 20 1.7.2 Sigma-Delta quantization of finite frame expansions . . . . . . . . 22 1.8 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 1.8.1 Decoding by non-convex optimization . . . . . . . . . . . . . . . . 25 1.8.2 Σ∆ quantization for compressed sensing . . . . . . . . . . . . . . 25 1.9 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 iii Table of Contents 2 Sparse recovery by non-convex optimization – instance optimality . . 32 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.1.1 Decoding by `1 minimization . . . . . . . . . . . . . . . . . . . . . 33 2.1.2 Decoding by `p minimization . . . . . . . . . . . . . . . . . . . . . 35 2.1.3 Paper Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.2.1 Sparse recovery with ∆p: stability and robustness . . . . . . . . . 38 2.2.2 The relationship between k1 and kp . . . . . . . . . . . . . . . . . 40 2.2.3 Instance optimality in probability and ∆p . . . . . . . . . . . . . 41 2.3 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.4 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.4.1 Proof of Proposition 2.2.4 . . . . . . . . . . . . . . . . . . . . . . 49 2.4.2 Proof of Theorem 2.2.1 . . . . . . . . . . . . . . . . . . . . . . . . 50 2.4.3 Proof of Lemma 2.2.5. . . . . . . . . . . . . . . . . . . . . . . . . 52 2.4.4 Proof of Theorem 2.2.6. . . . . . . . . . . . . . . . . . . . . . . . 54 2.4.5 Proof of Theorem 2.2.7. . . . . . . . . . . . . . . . . . . . . . . . 55 2.5 Proof of Lemma 2.4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 2.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3 Sobolev Duals for Random Frames and Σ∆ Quantization of Compressed Sensing Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.2 Background on Σ∆ quantization of frame expansions . . . . . . . . . . . 72 3.3 Reconstruction error bound for random frames . . . . . . . . . . . . . . . 75 3.3.1 Lower bound for σmin(D −rE) . . . . . . . . . . . . . . . . . . . . 75 3.3.2 Implication for compressed sensing matrices . . . . . . . . . . . . 80 3.4 Σ∆ quantization of compressed sensing measurements . . . . . . . . . . . 81 3.5 Numerical experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 3.6 Remarks on extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 3.6.1 Other noise shaping matrices . . . . . . . . . . . . . . . . . . . . 90 3.6.2 Measurement noise and compressible signals . . . . . . . . . . . . 95 3.7 Singular values of D−r . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 3.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.1 Summary of contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.2 Σ∆ quantization with decoding by `p minimization . . . . . . . . . . . . 106 4.3 Open questions and future work . . . . . . . . . . . . . . . . . . . . . . . 107 4.3.1 Decoding by non-convex optimization . . . . . . . . . . . . . . . . 109 4.3.2 Σ∆ quantization for compressed sensing measurements . . . . . . 109 4.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 iv Table of Contents A Other contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 v List of Figures 1.1 A simplified diagram showing a “traditional” data acquisition paradigm. An analog signal, say f(t), is first sampled at an appropriate rate to yield the samples z(`) which are then quantized, giving zq ∈ AN (where A is the quantization alphabet). Together these two stages comprise analog to dig- ital conversion. Once the signal is digitized, it is usually compressed. For example the compression stage may begin by transforming the quantized signal into an appropriate domain to obtain the coefficients x = B−1zq. The largest coefficients are then retained to yield xk ∈ ΣNk . . . . . . . . . 9 1.2 A simplified diagram of a compressed sensing data acquisition and storage paradigm. An analog signal f(t) is measured with a compressed sensing system. This operation can be conceptualized as an implicit sampling stage yielding the vector z ∈ RN . This is followed by a generalized measurement stage giving a vector of measurements b ∈ Rn, n  N , where b = Φz. b is then quantized to bq. For examples of how sampling and measurement using Φ can be done without “actually sampling” the signal, see Section 1.6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3 A comparison of the average and worst case errors when recovering 50 signals x, with ∆1 and ∆CoSaMP as the number of non-zero coefficients in x increases. The non-zeros in x are chosen to be of unit magnitude (and random signs). Here, we use a fixed 100× 300 matrix A whose entries are drawn i.i.d. from N (0, 1/n). . . . . . . . . . . . . . . . . . . . . . . . . 16 1.4 The average performance of PCM quantization and reconstruction with the decoder ∆1. For this experiment the non-zero entries of x are constant and d = 0.01. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 1.5 The worst case performance of PCM quantization and reconstruction with the decoder ∆1. For this experiment the non-zero entries of x are constant and d = 0.01. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.1 A comparison of the sufficient conditions on δmk in (2.17) and (2.16) as a function of m, for p = 0.1 (top), p = 0.5 (center) and p = 0.9 (bottom). . 40 2.2 For a Gaussian matrix A ∈ R100×300, whose δk values are estimated via MC simulations, we generate the theoretical (top) and practical (bottom) phase-diagrams for reconstruction via `p minimization. . . . . . . . . . . 45 vi List of Figures 2.3 Reconstruction error with compressible signals (top), noisy observations (bottom). Observe the almost linear growth of the error in compressible signals and for different values of p, highlighting the instance optimality of the decoders. The plots were generated by averaging the results of 10 experiments with the same matrix A and randomized locations of the coefficients of x. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.4 Reconstruction signal to noise ratios (in dB) obtained by using ∆p to recover signals whose sorted coefficients decay according to a power law (x(j) = cj−1/q, ‖x‖2 = 1) as a function of q (top) and as a function of p (bottom). The presented results are averages of 50 experiments performed with different matrices in R100×200. Observe that for highly compressible signals, e.g., for q = 0.4, there is a 5 dB gain in using p < 0.6 as compared to p = 1. The performance advantage is about 2 dB for q = 0.6. As the signals become much less compressible, i.e., as we increase q to 0.9 the performances are almost identical. . . . . . . . . . . . . . . . . . . . . . 48 3.1 Numerical behavior (in log-log scale) of 1/σmin(D −rE) as a function of λ = n/k, for r = 0, 1, 2, 3, 4. In this figure, k = 50 and 1 ≤ λ ≤ 25. For each problem size, the largest value of 1/σmin(D −rE) among 1000 realizations of a random n× k matrix E sampled from the Gaussian ensemble N (0, 1 n In) was recorded. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 3.2 The average performance of the proposed Σ∆ quantization and reconstruc- tion schemes for various values of k. For this experiment the non-zero entries of x are constant and d = 0.01. . . . . . . . . . . . . . . . . . . . 91 3.3 The worst case performance of the proposed Σ∆ quantization and recon- struction schemes for various values of k. For this experiment the non-zero entries of x are constant and d = 0.01. . . . . . . . . . . . . . . . . . . . 92 3.4 The average performance of the proposed Σ∆ quantization and reconstruc- tion schemes for various values of k. For this experiment the non-zero entries of x are i.i.d. N (0, 1) and d = 10−4. . . . . . . . . . . . . . . . . . 93 3.5 The worst case performance of the proposed Σ∆ quantization and recon- struction schemes for various values of k. For this experiment the non-zero entries of x are i.i.d. N (0, 1) and d = 10−4. . . . . . . . . . . . . . . . . . 94 3.6 The average case performance of the proposed Σ∆ quantization and re- construction schemes (with general duals) for various values of k. For this experiment the non-zero entries of x are constant and d = 0.01. . . . . . 97 3.7 The worst case performance of the proposed Σ∆ quantization and recon- struction schemes (with general duals) for various values of k. For this experiment the non-zero entries of x are constant and d = 0.01. . . . . . 98 vii List of Figures 4.1 The average performance of the proposed Σ∆ quantization and reconstruc- tion schemes for various values of k. Specifically, in the coarse recovery stage we use ∆1 and ∆p, p = .5. For this experiment the non-zero entries of x are i.i.d. N (0, 1) and d = 10−3. . . . . . . . . . . . . . . . . . . . . . 108 viii Acknowledgments I have had the great fortune of having a wonderful group of people around me during my Ph.D. studies. Each has contributed to my growth in their own way; for that I am grateful. I could not have wished for better thesis supervisors than Özgür Yılmaz and Rabab Ward. Throughout these last few years, Özgür has been both unwaveringly supportive and immeasurably generous with his time and insights. His enthusiasm, mentorship, and friendship have contributed much to making my Ph.D. experience as rich and enjoyable as it has been. I will certainly miss all our discussions - academic and otherwise - and all the hours we spent mulling over research problems, solutions, and proofs. Rabab Ward has been unconditionally supportive for the entire duration of my grad- uate experience at UBC. I count myself lucky to have been a recipient of her valuable guidance, infectious warmth, and genuine encouragement. My educational experience has been also enriched by various collaborations over the last few years. In particular, it has been a privilege to have worked with Sinan Güntürk. The discussions and collaborations I have had with the members of “DNOISE” have been thought provoking and enriching - I am fortunate to have been a part of that group. I have thoroughly enjoyed numerous conversations with Felix Herrmann, Michael Friedlander, Gilles Hennenfent and Ewout van den Berg. I have spent much time at the Signal and Image Processing Lab where I have enjoyed the friendship and company of Rupin Dalvi, Ilker Hacihaliloglu and the rest of the group. Throughout my years in Vancouver, Fadi Asfour has been a wonderful friend, always supportive - and available for coffee. I want to thank Hassan Mansour for our many years of friendship. Had you gone to New Mexico, Vancouver would not have been this much fun. I would also have had to find someone else to bounce ideas off, which would have been a hassle. I owe a great debt of thanks to my parents, Nawal and Taan, and to my brother, Roham. Their love and support has never wavered. For all my good fortune, for the last eight years I have been luckiest to share every step of my journey with Lina. I look forward to sharing every step with you for eighty more. ix Co-authorship Statement Note, this is a Manuscript based thesis constructed around the manuscripts listed below: Manuscript 1: Sparse recovery by non-convex optimization – instance optimality. The research topics addressed in this manuscript were jointly selected by Rayan Saab and Dr. Özgur Yılmaz. The proofs of the main results were done by Rayan Saab and refined by Dr. Özgür Yılmaz. The manuscript was prepared jointly. Manuscript 2: Sobolev Duals for Random Frames and Σ∆ Quantization of Com- pressed Sensing Measurements. The topic of this manuscript was identified by Dr. Özgür Yılmaz and Dr. C. Sinan Güntürk. The main proof strategy was devised jointly by Rayan Saab, Dr. C. Sinan Güntürk and Dr. Özgür Yılmaz (during a BIRS workshop). Initial results and accompanying proofs were prepared by Rayan Saab and refined by Dr. C. Sinan Güntürk and Dr. Özgür Yılmaz. The manuscript was prepared jointly between Rayan Saab, Dr. Özgür Yılmaz, Dr. C. Sinan Güntürk and Dr. Alexander Powell. The order of authorship for this manuscript is alphabetical. x Chapter 1 Introduction and Overview 1.1 Introduction In this thesis, we address problems in the recently emerging field of “sparse signal pro- cessing”. Sparse signal processing can be considered as a sub-field of signal processing dealing with applications where the signals are sparse or approximately sparse. To be more precise, we say that a signal is sparse if it admits a transform domain representation where most of its coefficients are zero. Similarly, we say that a signal is compressible when it can be closely approximated by a sparse signal. Thus, when one knows that a signal to be acquired or processed is sparse or compressible it is natural to incorporate such information in the design and analysis of the associated algorithms. In fact, sparse and compressible signals are rather ubiquitous and this allows the design of successful algorithms in such far reaching applications as compression, denoising, source separation, interpolation, and data acquisition. For example, natural images are nearly sparse in the discrete cosine transform (DCT) domain and in the wavelet domain. This fact is crucial to the success of the JPEG (e.g., [55]) and JPEG2000 (e.g., [48]) compression schemes and is used heavily in various image denoising algorithms (e.g., [20]). Similarly, audio signals tend to be sparse in the short time Fourier transform domain and in the short time DCT domain. This in turn allows for applications such as MP3 compression [7] and audio source separation (e.g., [46, 59]). To give one more example, seismic signals are compressible in the curvelet domain. This information also enables signal processing algorithms for denoising [30], data interpolation [29], and source separation [56] among many other applications. One can count several such classes of signals that are sparse in appropriate domains and examples of applications where this information is used to reconstruct the signals, e.g., from noisy measurements; however, all such applications share a common thread. Specifically, one is given a set of possibly corrupted observations of some signal, and the foreknowledge that the signal at hand is sparse in some known domain. One then proceeds to find a sparse signal that approximates the observations well, while attempting to avoid the adversarial effects of the corruptions. We will occasionally refer to this type of problems as sparse approximation (reconstruction) problems. The sparse approximation problems that have recently garnered the most interest are related to compressed sensing [8, 11, 12, 21, 53] (see Section 1.4). This thesis is mainly concerned with two important aspects of compressed sensing, namely decoder design 1 Chapter 1. Introduction and Overview and quantization. However, before we discuss compressed sensing and summarize our contributions, we first provide the necessary context. In Section 1.2 we formulate a (generic) sparse approximation problem precisely and we discuss the desirable properties that algorithms proposed to solve it must possess. In Section 1.3, we survey the sparse approximation results that predate compressed sensing. In Section 1.4, we formally introduce compressed sensing and cast the associated reconstruction problem as a sparse approximation problem. Section 1.5 presents a brief literature review on some important aspects of compressed sensing. Specifically, we focus on the recovery guarantees obtained using various decoding algorithms and on the encoders (matrices) that enable such results. Having provided the necessary introduction to compressed sensing, in Section 1.6 we give some examples of practical applications of the compressed sensing theory. In Section 1.7 we discuss quantization in the context of compressed sensing, survey some of the results obtained in this direction and provide connections to frame quantization. Finally, in Section 1.8 we present an overview of the contributions of this thesis, which are two- fold. First, we show that a class of algorithms based on `p optimization with p < 1 has favorable compressed sensing reconstruction guarantees. Second, we show that a certain class of quantization algorithms, Σ∆ quantization, with an associated two-stage reconstruction algorithm (where `p optimization may be used in the first stage) provides superior performance to scalar quantization - again in the context of compressed sensing. 1.2 The sparse approximation problem Define ΣNk to be the set of all k-sparse vectors, i.e., ΣNk := {x ∈ RN : |supp(x)| ≤ k}, where supp(x) := {i ∈ {1, ..., N} : x(i) 6= 0}, is the support of x. Define compressible vectors as vectors that can be well approximated in ΣNk . Let σk(x)`p denote the best k-term approximation error of x in `p (quasi-)norm where p > 0, i.e., σk(x)`p := min v∈ΣNk ‖x− v‖p. Throughout the text, A denotes an n×N real matrix where n < N . Let the associated encoder, also denoted by A, be the map x 7→ Ax and denote the (in general noisy) encoding of x by b, i.e., b = Ax+ e. (1.1) The sparse approximation problem mentioned in Section 1.1 requires the existence of decoders, say ∆ : Rn 7→ RN , with roughly the following properties (C1) Exact Recovery: ∆(Ax) = x whenever x ∈ ΣNk with sufficiently small k. 2 Chapter 1. Introduction and Overview (C2) Stability and Robustness: ‖x − ∆(Ax + e)‖ . ‖e‖ + σk(x)`p, where1 the norms are appropriately chosen. Here e denotes measurement error, e.g., thermal and computational noise. (C3) Numerical Tractability: ∆(Ax+ e) can be computed efficiently (in some sense). In general, the problem of constructing decoders with properties (C1)-(C3) is non- trivial as A is overcomplete, i.e., the linear system of n equations in (1.1) is underdeter- mined, and thus, if consistent, it admits infinitely many solutions. In order for a decoder to satisfy (C1)-(C3), it must choose the “correct solution” among infinitely many so- lutions. Under the assumption that the original signal x is sparse, one can phrase the problem of finding the desired solution as an optimization problem where the objective is to maximize an appropriate “measure of sparsity” while simultaneously satisfying the constraints defined by (1.1). In the noise-free case, i.e., when e = 0 in (1.1), under certain conditions on the n×N matrix A, e.g., if A is in general position, there is a decoder ∆0 which satisfies ∆0(Ax) = x for all x ∈ ΣNk whenever k < n/2 (e.g., see [22]). This ∆0 can be explicitly computed via the optimization problem ∆0(b) := argmin y ‖y‖0 subject to Ay = b. (1.2) Here ‖y‖0 denotes the number of non-zero entries of the vector y, equivalently its so-called `0-norm. Clearly, the sparsity of y is reflected by its `0-norm - and ideally to recover a sparse signal x from noisy observations b, one would turn to the decoder ∆0. However, it is easy to see that this decoder entails solving a combinatorial problem and that if one hopes to solve sparse approximation problems efficiently, then alternative decoders must be considered. In Section 1.3 we give an overview of some important decoders that have been suggested and analyzed. Specifically, we examine the results that predate compressed sensing. 1.3 Decoders and recovery guarantees The optimization problem (1.2) that is involved in calculating ∆0(b) is combinatorial. In other words, its complexity grows extremely quickly as the ambient dimension N grows larger than n and this renders the decoder ∆0 impractical. Consequently, one then seeks alternative decoders ∆̂ where ∆̂(b) can be computed more efficiently. In this section, we review various such decoders and focus on the conditions under which they are guaranteed to recover k-sparse signals. Here, we focus on two classes of decoders that rely on convex optimization and greedy approaches, respectively. In the latter case, we include the quintessential greedy algorithms matching pursuit (cf., [39]) and orthogonal 1Here and throughout a & b (resp. a . b) should be read as a ≥ Cb (resp. a ≤ Cb) for some constant C, independent of a and b. 3 Chapter 1. Introduction and Overview matching pursuit (cf., [43]) since they provide inspiration for more sophisticated greedy approaches with better guarantees in the context of compressed sensing. 1.3.1 Convex relaxation As mentioned before, the ∆0 decoder entails solving (1.2) which is not only non-convex, but also combinatorial in nature. To alleviate these problems, one may wish to replace the `0 norm in (1.2) with the convex `1 norm. This yields the decoder ∆1 and its associated optimization problem (famously called the basis pursuit problem [14]): ∆1(b) := argmin y ‖y‖1 subject to Ay = b. (1.3) Here, it is natural to ask under what conditions the decoders ∆0 and ∆1 yield identical vectors. When such conditions are satisfied, the sparsest vector x that satisfies b = Ax can be found by solving (1.3), i.e., by constrained `1 minimization. Such an optimization problem can be posed as a linear program [14] if the variables and matrix are real valued and as a second order cone program (SOCP) [37] in the complex-valued setting. Moreover, spurred by the advent of compressed sensing, the last few years have witnessed the introduction of specialized algorithms designed for solving (1.3) efficiently in the sparse recovery setting, e.g., [54], [25], [19]. In other words, ∆1 is numerically tractable in the sense of (C3). To return to the question at hand, we are interested in determining conditions under which the sparsest solution can be found using the numerically tractable ∆1 decoder. Donoho and Huo proved that ∆0(Ax) and ∆1(Ax) yield x if x is sufficiently sparse and A is composed of two orthogonal bases [23]. Donoho and Elad later extended this result to the case of general non-orthogonal dictionaries [22]. They derived sufficient conditions on how sparse the underlying vector x should be so that ∆0(Ax) is: (i) unique, and (ii) attainable via `1 minimization. To that end, they introduced the notion of spark, the smallest number of linearly de- pendent columns of a matrix, and they showed [22] (cf. [23]) that ∆0(Ax) is unique if ‖x‖0 < spark(A) 2 . (1.4) Donoho and Elad [22] also provided sufficient conditions for the equivalence of ∆0 and ∆1. These results depend on µ(A), the mutual coherence of A (assumed to have unit-norm columns), where µ(A) := max i6=j |〈ai, aj〉|. (1.5) 4 Chapter 1. Introduction and Overview Here ai is the ith column of A. Specifically, they showed that if ‖x‖0 < 1 + 1/µ(A) 2 , (1.6) then ∆0(Ax) = ∆1(Ax). It is worth mentioning here that for any A, where (N > n), the spark and mutual coherence satisfy the relationship spark(A) > 1/µ(A) [22]. Consequently, a quick calculation shows that if inequality (1.6) is satisfied, then ∆1(b) is necessarily unique and is the sparsest vector satisfying Ax = b. In short, if x is sparse enough x = ∆1(Ax). Note here that the above sufficient conditions on the equivalence of ∆0 and ∆1 were independently derived by Donoho and Elad [22], and Gribonval and Nielsen [28]. These results apply to vectors in both the real and the complex spaces as generalized by Tropp [51]. The above results depict stringent sufficient conditions for identifying the exact unique solution of (1.2) via solving (1.3). The later compressed sensing results (cf. [8,11,12,21, 53]) demonstrate that these conditions are pessimistic in the sense that one can obtain very good reconstruction that is also stable and robust under much weaker conditions with high probability for certain types of matrices A (see, e.g., [12, 53]). We will survey these results in Section 1.4, but for now we turn back to decoders based on greedy algorithms. 1.3.2 Greedy algorithms As another alternative to the decoder ∆0, which involves solving (1.2), one could use decoders based on greedy algorithms such as orthogonal matching pursuit (OMP) [43]. Such decoders are algorithmic in nature, and in general do not explicitly minimize a cost function. However, in the context of sparse recovery, it has been shown that under certain conditions that are remarkably similar to those related to ∆1 (see Section 1.3.1), OMP can recover a sparse signal exactly [50]. Below we briefly describe matching pursuit [39], the predecessor of OMP. We then de- scribe OMP, and present the results in [50] pertaining to the conditions under which OMP can solve (1.2). Later, we describe compressive sampling matching pursuit (CoSaMP) [41], a more sophisticated greedy algorithm, which provides performance guarantees in the spirit of (C2) in the compressed sensing context. We will also see that these guaran- tees are similar to (but weaker than) the guarantees provided by the ∆1 decoder based on `1 minimization. We now introduce some notation used here and throughout. Given a vector v ∈ RN , and an index set T ⊂ {1, 2, ..., N}, we denote the complement of T by T c := {1, 2, ..., N} \ T . Moreover, vT denotes the vector with entries vT (j) = v(j) for all j ∈ T and vT (j) = 0 for j ∈ T c. Finally, vk ∈ RN denotes the vector formed by preserving the k largest coefficients of v and setting the remaining ones to zero. 5 Chapter 1. Introduction and Overview Matching Pursuit Matching pursuit (MP) [39] is a greedy algorithm that iteratively updates its estimate x (i) mp of the signal and a corresponding residual r(i) = b − Ax(i)mp. In particular, at every iteration i, it selects a column of A whose inner product with the residual is maximal. It then updates its estimate of the signal using x (i−1) mp , r(i−1), and the newly selected column of A. The corresponding residual r(i) (whose norm converges asymptotically to zero if A is in general position [39]) is then computed and the process is repeated. The MP algorithm with stopping criterion ‖Axmp − b‖2 <  is described in Algorithm 1. Algorithm 1 Matching pursuit 1: Initialize r(0) = b and x (0) mp = 0, i = 0. 2: while ‖Ax(i)mp − b‖2 ≥  do 3: Increment the iteration count i← i+ 1. 4: Select q(i) ∈ arg max j=1,...,n |a∗jr(i)|. 5: Update x (i) mp|q(i) = x(i−1)mp |q(i) + a∗q(i)r(i−1) and r(i) = b− Ax(i)mp 6: end while Orthogonal matching pursuit Orthogonal Matching Pursuit [43] proceeds in a manner similar to MP, but recomputes the estimates of the signal at each iteration using all the columns of A selected thus far. This guarantees convergence in n steps, highlighting one main reason why OMP is typically preferred to MP. Moreover, note that if the signal is k-sparse and OMP recovers it, then it would do so in k iterations. This renders OMP computationally efficient, in that it requires only few iterations to recover a sparse signal, where each iteration is relatively cheap to compute. Defining Ω to be a set that contains the indices of the columns of A selected up until the ith iteration, OMP (with stopping criterion ‖Axomp − b‖2 < ) is summarized in Algorithm 2 (cf. [50]). Conforming with the decoder notation used above, we denote the final output of OMP (with input b and A) by ∆omp(b). Tropp [50] showed that the sufficient condition (1.6) for the equivalence of the decoders ∆0(Ax) and ∆1(Ax) (e.g., [22]) holds for ∆omp(Ax) as well. Moreover, he generalized this result using the so called cumulative coherence function [49, 50] of a dictionary A, which he defined as µ1(s) := max|Λ|=s max i/∈Λ ∑ j∈Λ |a∗jai|. (1.7) 6 Chapter 1. Introduction and Overview Algorithm 2 Orthogonal Matching Pursuit 1: Input: b ∈ Rn, A ∈ Rn×N 2: Initialize r(0) = b, Ω = {} and x(0)omp = 0, i = 0. 3: while ‖Ax(i)omp − b‖2 ≥  do 4: Increment the iteration count i← i+ 1. 5: Select q ∈ arg max j=1,...,n |a∗jr(i)|, 6: Set Ω← Ω⋃{q} 7: Update x (i) omp|Ω = argmin y ‖b− AΩy‖2. (Equivalently, x(i)omp|Ω = A†Ωb.) x (i) omp|Ωc = 0. r(i) = b− Ax(i)omp 8: end while 9: Output x (i) omp In words, µ1(s) is the largest sum of inner products between one column of A and s other columns. This puts us in a position to summarize the early results on both OMP and `1 relaxation with a theorem. Theorem 1.3.1. (adapted from [50] and [51]) Given b ∈ Rn and A ∈ Rn×N , n ≤ N , with b = Ax, x ∈ RN . If ‖x‖0 = k < 1 + 1/µ(A) 2 , or more generally if µ1(k) + µ1(k − 1) < 1, then x is the sparsest vector such that Ax = b, and x is unique. Moreover, ∆1(Ax) = ∆omp(Ax) = x. In words, x can be recovered perfectly by OMP and `1 minimization. 1.3.3 Bounds on the sparsity level k It is known (e.g., [57], [47]) that a lower bound on the coherence of a dictionary is µ ≥ √ N − n n(N − 1) . (1.8) When N is much larger than n, we can see that this bound reduces to µ & n−1/2. The consequences on Theorem 1.3.1 are clear; in the best case (i.e., when µ is smallest) ∆1 and ∆omp can only guarantee recovery of k-sparse signals with k < 1+1/µ(A) 2 . √ n. Compare this to the fact that given a matrix A ∈ Rn×N in general position, ∆0 guarantees the 7 Chapter 1. Introduction and Overview recovery of k-sparse vectors with k < n/2. Clearly, there is a large gap between the dimensions required for recovery by the ideal decoder ∆0 and those that are sufficient for recovery with ∆omp and ∆1. We will next see that for certain types of matrices, the compressed sensing literature shows that the above result is pessimistic in that one can guarantee recovery for signals with sparsity k . n/ (log(N/k))a, for some positive a that depends on the matrix. Close examination of this bound on k reveals that it is significantly better than √ n. However, before we present these results, we must now introduce compressed sensing. 1.4 Compressed sensing In the last few years, the advent of compressed sensing has spurred a resurgence in the field of sparse signal processing with contributions from the applied mathematics, electri- cal engineering, geometric functional analysis, and theoretical computer science commu- nities. Compressed sensing was initiated with a series of papers by Candès, Romberg, Tao (e.g., [11, 12]) and Donoho [21, 53]. Compressed sensing has succeeded in capturing the attention of such diverse communities partly because it provides a new paradigm for data acquisition which stands in contrast with the “traditional” approach. Currently, most data acquisition technologies operate by “oversampling” the signals that one wishes to ac- quire (here, we use the term oversampling in a specific sense that we will clarify shortly). Later, the signal is processed for compression, for example, by applying transform coding. This entails representing the acquired signal in an appropriate transform domain where it is compressible, i.e., its transform domain representation is approximately sparse. The coefficients in the transform domain are sorted and only the largest ones are retained while the rest are set to zero. If the transform domain representation is a decomposition in an orthonormal basis, then the distortion is equal to the energy of the zeroed coeffi- cients. Consequently, the process is lossy but the loss is small because the signal at hand is compressible. Moreover, the process is adaptive in that the indices of the retained coefficients differ from signal to signal, depending on which basis elements represent each signal best. To summarize, in the above paradigm one acquires many more samples than the final number of coefficients that are retained for storage (or transmission) and the choice of coefficients to keep is not fixed; in that sense, this approach is wasteful and may be computationally intensive. It places a significant computational burden at the sensor device which must perform the compression stage prior to data transmission or storage. Otherwise if one chooses not to compress the data, the sensor must store or transmit large amounts of data which may incur prohibitive costs in terms of memory or power consumption. In essence, compressed sensing is motivated by the following questions, posed by Donoho [21]: “Why go to so much effort to acquire all the data when most of what we get will be thrown away? Can’t we just directly measure the part that won’t end up being 8 Chapter 1. Introduction and Overview thrown away?” Thus, compressed sensing is a data acquisition paradigm that attempts to perform both the sampling and dimensionality reduction simultaneously under the assumption of sparsity [8]. It is noteworthy that while sparsity has played a fundamental role in modern signal processing, as evidenced by the applications briefly discussed in Section 1.1, compressed sensing is novel in that it utilizes the sparsity of signals to offer alternative data acquisition techniques [8]. For ease of comparison, Figures 1.1 and 1.2 depict a simplified conventional data acquisition paradigm and a possible compressed sensing paradigm, respectively. Examining Figure 1.2, we see that compressed sensing entails taking n “generalized” linear measurements of a signal z ∈ RN , with n  N . These measurements take the form of inner products of z with the (appropriately chosen) vectors φi ∈ RN , i ∈ {1, ..., n}, i.e., we measure 〈φi, z〉. Note that these measurements are non- adaptive. In other words, the choice of the vectors φi does not depend on the particular signal z (though it may depend on particular knowledge of the signal class that z belongs to). In short, one may say that compressed sensing places a smaller computational load on the measurement side of an acquisition/reconstruction system by collecting fewer measurements than the signal ambient dimension and then bypassing a transform coding stage. On the other hand, while transform coding allows for linear reconstruction of the signal, in compressed sensing one must use a non-linear decoder (such as ∆1) to recover the signal. Thus, compressed sensing transfers the computational burden from the encoder (measurement, coding) side to the decoder (reconstruction) side, where one may typically have more computational resources available. In fact, in Section 1.6 we will see examples of compressed sensing applications that specifically benefit from this. Figure 1.1: A simplified diagram showing a “traditional” data acquisition paradigm. An analog signal, say f(t), is first sampled at an appropriate rate to yield the samples z(`) which are then quantized, giving zq ∈ AN (where A is the quantization alphabet). Together these two stages comprise analog to digital conversion. Once the signal is digitized, it is usually compressed. For example the compression stage may begin by transforming the quantized signal into an appropriate domain to obtain the coefficients x = B−1zq. The largest coefficients are then retained to yield xk ∈ ΣNk . 9 Chapter 1. Introduction and Overview Figure 1.2: A simplified diagram of a compressed sensing data acquisition and storage paradigm. An analog signal f(t) is measured with a compressed sensing system. This op- eration can be conceptualized as an implicit sampling stage yielding the vector z ∈ RN . This is followed by a generalized measurement stage giving a vector of measurements b ∈ Rn, n  N , where b = Φz. b is then quantized to bq. For examples of how sampling and measurement using Φ can be done without “actually sampling” the signal, see Section 1.6. In what follows, we briefly review the essential elements to compressed sensing and the conditions they must satisfy to allow for successful recovery. We focus on the so- called “encoder” matrix, the signal to be acquired, and the decoders used to recover the signal from its compressed sensing measurements. We defer a discussion of quantization until Section 1.7. 1.4.1 Compressed sensing as a sparse approximation problem Let z ∈ RN be the signal to be acquired and suppose that it admits a sparse (or com- pressible) representation in a known transform domain represented by the orthonormal matrix B ∈ RN×N . In other words, we have z = Bx where x is sparse (or compress- ible). Next, suppose that our acquisition paradigm consists of taking inner products of z with measurement vectors {φi}ni=1, where n ≤ N . Thus, we acquire n measurements bi = 〈φi, z〉, i ∈ {1, ..., n}. In matrix notation, this can be represented succinctly as b = Φz, where the measurement matrix Φ is composed of the rows φi and where b ∈ Rn is simply a vector of the measurements bi, i.e., b = [b1, b2, ...bn] T . We can now write b = Φz = ΦBx = Ax. (1.9) Above, we define the matrix A := ΦB, A ∈ Rn×N and we occasionally refer to it as the encoder. Clearly, one can model the addition of errors, e.g., quantization error and measurement noise, to the idealized equation above. In this case, the measurements are given by b = Ax+ e, (1.10) 10 Chapter 1. Introduction and Overview where e denotes the combined error. Examining (1.10) above, one quickly observes that the problem at hand is a sparse approximation problem as discussed in Section 1.2. As mentioned before, the compressed sensing acquisition process is non-adaptive. No matter what sparse vector x is, one measures and stores Ax + e, with a given encoder A. Thus, the choice of A is critical since it must allow for the recovery of x from Ax+ e via some robust, stable and computationally tractable decoder. In the next section, we review some of the results pertaining to the decoders ∆1 and ∆CoSaMP in the compressed sensing setting. We focus on the sufficient conditions on A that guarantee stability and robustness of the above decoders, as identified in property (C2) of Section 1.2. We also survey results on classes of random matrices that satisfy these conditions, albeit with high probability on an appropriate probability space. 1.4.2 Preliminaries The k-restricted isometry2 constants δk of a matrix A, introduced by Candès, Romberg and Tao (see, e.g., [10]) are defined as the smallest constants satisfying (1− δk)‖v‖22 ≤ ‖Av‖22 ≤ (1 + δk)‖v‖22 (1.11) for every v ∈ ΣNk (recall that ΣNk is the set of all k-sparse vectors in RN). Using the notation of [58], we say that a matrix satisfies RIP(k, δ) if δk < δ. The restricted isometry constants play an important role in the theory of compressed sensing. We introduce them here since many of the conditions for perfect recovery of k-sparse vectors (e.g., by CoSaMP or `1 minimization) are formulated in terms of δ(a+1)k , for some a > 1. Moreover, the bounds on the behaviour of the error when decoding signals in non-ideal situations are formulated in terms of these constants. Remark. It is easy to see that the above constants are called restricted isometry constants because δk provides bounds on how well A acting on a k-sparse vector preserves distances. For example, the smaller δk is, the closer the action of A on k-sparse vectors is to that of an orthonormal matrix. 1.5 Compressed sensing decoders and recovery guarantees The compressed sensing literature has grown tremendously in the last few years, and an exhaustive survey is all but impossible. On the other hand, one can easily trace the origins of this field to a series of papers by Donoho and by Candès, Romberg, and Tao 2Given two normed vector spaces X and Y , a linear isometry is a linear map A : X 7→ Y such that ‖Ax‖ = ‖x‖ ∀x ∈ X . Thus, isometries preserve distances. 11 Chapter 1. Introduction and Overview (cf. [9–12, 21, 53]). In this section, we sample some of these results and also focus on stability and robustness of the decoders ∆1 and ∆CoSaMP . 1.5.1 Convex relaxation Here, we begin to distinguish between two types of reconstruction guarantees, non- uniform and uniform. In the former case, the signal to be measured is fixed, and the results hold with high probability on the draw of the encoder matrix from an appropriate distribution. On the other hand, uniform recovery results hold for any signal x (in a given class Λ ⊂ RN) and a fixed encoder A. Non-uniform recovery As an example of non-uniform guarantees, we present a result from [9]. There, the authors tackle the problem of sparse recovery from incomplete frequency information. Thus, the matrix A consists of n rows (chosen at random) of the N ×N discrete Fourier transform matrix. Such a problem, for example arises in Magnetic Resonance Imaging and astrophysics [8] where an object is sensed (or observed) by measuring selected fre- quency coefficients, i.e., n rows of the Fourier matrix. In that case, Candès, Romberg and Tao [9] proved the following theorem regarding the recovery of a k-sparse signal x. Theorem 1.5.1. ( [9]) Let x be a k-sparse signal supported on an unknown set T and suppose that we are given n of its Fourier coefficients with frequencies selected uniformly at random. If n ≥ Cak logN , then with probability at least 1−O(N−a), ∆1(Ax) = x. Note that the above result requires careful interpretation. Up until now, we have generally been interested in fixing a matrix A, and examining the sparsity level of sig- nals that we can reconstruct via certain decoders. Here, for each individual k-sparse signal, n Fourier coefficients are selected at random. In other words, the matrix A is no longer fixed but is drawn randomly every time we wish to “measure” a new signal. For that reason, guarantees in the sense of Theorem 1.5.1 are called non-uniform, while the results that will follow next, based on the restricted isometry constants of a fixed matrix A, will be uniform. This type of nuance is important in interpreting compressed sensing results and specifically in interpreting some of the contributions of this thesis, especially those in Chapter 2, where we extend both uniform and non-uniform known stability and robustness results on ∆1 to the class of decoders ∆p, p < 1. Remark. Note that in Theorem 1.5.1 we only require k . n/ logN , and compare that to the results of Theorem 1.3.1 where we required that k . √ n (see the discussion in Section 1.3.3). This large improvement in the allowable number of non-zeros of x is a hallmark of compressed sensing results. 12 Chapter 1. Introduction and Overview Uniform recovery, stability and robustness As mentioned in Section 1.4.2, Candes and Tao introduce and use the notion of the restricted isometry constants [11, 12] to prove the following theorem. Theorem 1.5.2. ( [11]). Suppose that x is k-sparse. If δak + aδ(a+1)k < a− 1 (for some positive a such that ak is an integer), then ∆1(Ax) = x. Earlier, we differentiated between signals that are strictly sparse, i.e., have many zero coefficients, and those that are compressible. The results presented thus far focused on sparse signal recovery. However, it is empirically observed that many important classes of signals such as natural images, speech, and music are highly compressible, but not exactly sparse. This is evidenced by such algorithms as JPEG, JPEG2000 and MP3 which only code the highest transform coefficients and discard the rest, thereby incurring a small distortion. In light of this, an important question that arises [8] is how well one can recover a signal that is just nearly sparse. In this context, let x be an arbitrary vector and let xk be a vector obtained by retaining only the k coefficients of x with the highest magnitude and setting the rest to zero (as would be done in a compression algorithm). It turns out that the following theorem holds [10]. Theorem 1.5.3. ( [10]) Let x be arbitrary and suppose that δak+aδ(a+1)k < a−1. Then ‖∆1(Ax)− x‖2 ≤ C ‖x− xk‖1√ k . (1.12) For reasonable values of δ(a+1)k, the constant C is well behaved; e.g. C = 8.77 when δ4k = 1/5. Stated in words, this means that ∆1 stably recovers the k-largest entries of an N - dimensional unknown vector x from n measurements when δak + aδ(a+1)k < a− 1. Next, we consider the general model where we also account for a noise vector e, i.e., we have b = Ax+ e, (1.13) where x is compressible and ‖e‖2 < . In such a case, it is natural to modify ∆1 in a way that favors approximate reconstruction. Specifically, we use the decoder ∆1 defined via ∆1(b) = argmin y ‖y‖1 subject to ‖Ay − b‖2 ≤ . (1.14) The following theorem holds. Theorem 1.5.4. ( [10]) Suppose that x is arbitrary and let δak + aδ(a+1)k < a− 1. Then ‖∆1(Ax+ e)− x‖2 = C1+ C2 ‖x− xk‖1√ k . (1.15) 13 Chapter 1. Introduction and Overview For reasonable values of δak, the constants are well behaved; e.g. C1 = 12.04 and C2 = 8.77 when δ4k = 1/5. 1.5.2 Greedy algorithms Non-uniform recovery and OMP As discussed before, OMP is a simple and effective greedy algorithm that can be employed for solving the sparse recovery problem. In the compressed sensing setting, Tropp and Gilbert [52] prove the following theorem showing that OMP can non-uniformly recover k-sparse signals from n linear measurements when n & k log(N). Theorem 1.5.5 ( [52]). Fix  ∈ (0, 1) and choose n > Ck ln(N/). Suppose that x ∈ RN is k-sparse and let A be such that its entries are drawn i.i.d. from the standard Gaussian distribution. Let b = Ax, then OMP can reconstruct x with probability exceeding 1− . Note that this is a non-uniform result because it requires that the measurement matrix A be drawn randomly every time a new measurement is to be made. Nonetheless, there is a large improvement in the allowable number of non-zeros of x over the older mutual coherence based results discussed in Section 1.3.3 - albeit for Gaussian matrices. On the other hand, it has been shown that in the setting above, OMP must fail for at least one k-sparse signal [44]. In other words, the non-uniformity cannot be miti- gated. Moreover, there are no known stability and robustness guarantees on OMP to accommodate compressible signals and noisy observations. Uniform recovery, stability and robustness To circumvent the issues discussed above, [42] proposed a greedy algorithm with uniform guarantees (slightly weaker than those required by (C2)) which was called Regularized Orthogonal Matching Pursuit (ROMP). Later, this algorithm was improved by Needell and Tropp [41] who introduced the CoSaMP algorithm, our next subject. In particular, CoSaMP not only offers uniform recovery guarantees but is also stable and robust in the sense of (C2). Remark. Note that CoSaMP is remarkably similar to another greedy algorithm, Subspace Pursuit [16] which was developed by Dai and Milenkovic around the same time. The two algorithms share almost identical recovery guarantees. For ease of exposition, we choose to refer to CoSaMP, but stress that this choice is arbitrary. CoSaMP [41] is a greedy algorithm which is based on OMP (and ROMP) but differs in some critical aspects that allow for stronger guarantees that OMP does not provide. One of the major differences with OMP is that at each iteration CoSaMP initially adds the indices of up to 2k columns of A to the estimated support of the signal. It then “prunes” down the estimated support size back to k. In this sense, it differs by allowing 14 Chapter 1. Introduction and Overview the removal of columns that were previously chosen. Algorithm 3 describes the flow of this decoding scheme. As before, we denote the final output of CoSAMP with input b (and A) with ∆CoSaMP (b) to conform with the decoder notation used throughout. Algorithm 3 Compressive Sampling Matching Pursuit 1: Initialize r(0) = b, Ω = {} and x(0)cosamp = 0, i = 0. 2: while ‖Ax(i)cosamp − b‖2 ≥  do 3: i← i+ 1 4: y = A∗r(i) 5: Select Ω = supp(y2k) (choose the largest 2k coefficients of y) T = Ω ∪ supp(x(i−1)cosamp) 6: Update uT = A † T b uT c = 0 x (i) cosamp=uk. r(i) = b− Ax(i)cosamp 7: end while Needell and Tropp [41] prove the following theorem regarding the performance of CoSaMP. Theorem 1.5.6. Suppose that A ∈ Rn×N satisfies RIP(4k, c) for an appropriately small constant c (c ≤ 0.025 works). Let x be arbitrary and let b = Ax + e be a vector of measurements contaminated with arbitrary noise e. For a given precision parameter  ‖∆CoSaMP (Ax+ e)− x‖2 ≤ Cmax { , ‖x− xk‖1√ k + ‖e‖2 } . (1.16) Note that the above result is very similar to Theorem 1.5.4. The only important dis- crepancy is that there is a gap between the conditions under which the two theorems hold. Viewed in a different light, their stability constants differ. In particular, ∆1 offers better reconstruction under weaker conditions, at the expense of being more computationally intensive. On the other hand, the theorems presented in this section offer only sufficient conditions and only upper bounds on the errors. One must thus be careful about draw- ing sweeping conclusions about the performance of algorithms from these conditions and bounds. However, numerical experiments bear out the conclusion that ∆1 is indeed more powerful. To highlight this point, we conduct an experiment where we fix a 100 × 300 matrix A, whose entries are drawn i.i.d. from N (0, 1/n). Then for each k ∈ {1, 2, ..., 35}, we generate a k-sparse signal x ∈ RN with its non-zero entries set randomly to ±1 and we compute the errors ‖∆1(Ax) − x‖2 and ‖∆CoSaMP (Ax) − x‖2. For each such k, the experiment is repeated 50 times. The results are reported in Figure 1.3. Figure 1.3 indicates that both approaches recover the signal perfectly when k is small; however, as we start increasing k, x becomes more and more like a compressible signal and 15 Chapter 1. Introduction and Overview clearly ∆1 provides better stability constants. This is in agreement with the theoretical results presented in this section. 5 10 15 20 25 30 35 10 20 30 40 50 number of nonzero coefficients of x e rr o r (in  l 2 n o rm ) mean error using ∆1 mean error using ∆CoSaMP maximum error using ∆1 maximum error using ∆CoSaMP Figure 1.3: A comparison of the average and worst case errors when recovering 50 signals x, with ∆1 and ∆CoSaMP as the number of non-zero coefficients in x increases. The non-zeros in x are chosen to be of unit magnitude (and random signs). Here, we use a fixed 100 × 300 matrix A whose entries are drawn i.i.d. from N (0, 1/n). 1.5.3 Uniform recovery guarantees with random matrices The results presented in the previous sections show that, for example, the numerically tractable ∆1 and ∆  1 decoders satisfy (C1)-(C3) as long as the matrix A satisfies an appropriate restricted isometry condition. Thus, it is important to ask what classes of matrices satisfy conditions of the form (A1) δak + aδ(a+1)k < a− 1, (1.17) and for what range of k. Since verifying whether a particular matrix satisfies RIP(s, δ) for some s ∈ N is combinatorially difficult3, it would seem that these conditions are not useful. However, that is certainly not the case. In fact, it has been shown that several 3One would have to check the singular values of each of the ( N s ) sub-matrices of A of size n× s. 16 Chapter 1. Introduction and Overview important classes of random matrices satisfy the condition (A1) for relatively large values of k, allowing for recovery via ∆1 and ∆  1. Remark. The discussion above, with minor modification, applies to ∆CoSaMP . The major classes of matrices known to satisfy the restricted isometry condition (A1) include random matrices with sub-Gaussian entries [40] and partial bounded orthogonal matrices (cf. [45]). We call a random variable X sub-Gaussian if it satisfies P (|X| > t) ≤ Ce−ct2 for all t > 0. This class includes, for example, matrices whose entries are drawn i.i.d. from either the Gaussian or Bernoulli distribution. We call an n × N matrix a partial bounded orthogonal matrix if it is obtained by randomly selecting n rows of an N × N orthonormal matrix. This important class includes, for example, partial discrete Fourier transform matrices. Below, we briefly review what is known about these matrices and the probabilities with which they satisfy RIP(k, δ) for a given k and δ. Theorem 1.5.7. (adapted from [42]) Consider an n×N measurement matrix A and let k ≥ 1, δ ∈ (0, 1/2) and γ ∈ (0, 1). (i) If A is a sub-Gaussian matrix, then with probability 1− γ the matrix 1√ n A satisfies RIP(k, δ) provided that n ≥ C k δ2 log ( N k 1 δ2 ) . (ii) If A is a partial bounded orthogonal matrix, then with probability 1− γ, the matrix√ N n A satisfies RIP(k, δ) provided that n ≥ C(k log(N) δ2 ) log ( k log(N) δ2 ) log2(N). The constants above, denoted by C, depend only on the class of matrices at hand and the probability of failure γ. The proofs of the two parts of the above theorem and the dependence of C on γ can be found in [40] and [45], respectively. 1.6 Compressed sensing applications Compressed sensing is a field that has been largely driven by theoretical breakthroughs that have provided both motivation and direction to applications. In fact, the field is currently witnessing an ever-growing effort in putting compressed sensing ideas to diverse applications. In this section, we give a few examples involving the use of compressed sensing for signal acquisition and recovery. The single pixel camera In the previous section, we saw that compressed sensing involves acquiring measurements using inner products between the signal z ∈ RN and a collection of (random) vectors 17 Chapter 1. Introduction and Overview {φi}ni=1. One may thus wonder how in practice this could be applied without acquiring the signal z itself. The single-pixel camera (e.g., [24]) achieves this feat by making use of an “optical computer” of sorts. It is comprised of a digital micromirror device (DMD), two lenses, a single photon detector and an analog to digital converter [24]. The DMD consists of an array of N tiny mirrors, each of which corresponds to a “pixel” in z. The mirror configuration can be controlled and each tiny mirror can be made to reflect light either to or away from one of the lenses. In turn, the lens focuses the “aggregate” light on to the detector, which now reads the inner product (as a voltage) between the current mirror configuration (a vector of ones and zeros) and the pixels of z. This voltage is then digitized and the process is repeated for n different pseudo-random calibrations of the mirrors. System calibration allows for ±1 values in the entries of each φi and one thus obtains the compressed sensing measurement vector b = Φz. Recalling that images are compressible in an appropriate basis, one can then obtain an approximation of the original image using an appropriate compressed sensing decoder. In addition to the usual benefits of compressed sensing, this architecture allows the use of various types of detectors [24] including spectrometers and photodiodes that are, for example, sensitive to different wavelengths. Since the cost of an array of N such detectors may be high, this camera enables applications that would otherwise be prohibitively expensive. For all these reasons and because the single-pixel camera was one of the earlier hardware implementations of compressed sensing ideas, it has become one of the sensational stories in the field. Sparse MRI Another area where advances in compressed sensing have important practical applica- tions is medical imaging, particularly magnetic resonance imaging (MRI), in which one measures (or samples) the Fourier coefficients of images. Lustig et al., e.g., [38], identify the speed at which data can be collected in MRI as fundamentally limited by physical constraints. This implies that collecting more data requires a longer acquisition time. Lustig et al. [38] thus exploit the sparsity in magnetic resonance (MR) images (either in the pixel domain or other transform domains) coupled with compressed sensing results on partial Fourier matrices as motivation for systems that collect significantly fewer sam- ples than traditional MRI. This reduces acquisition time considerably. As before, with the measurements at hand, one can use compressed sensing decoders to stably recover estimates of the original image. 18 Chapter 1. Introduction and Overview Compressed sensing in seismology4 Seismic data is generally collected on multidimensional grids and involves Tera bytes of data [32]. A simplified version of how such acquisition may be performed follows. In brief, the acquisition process may use several sources placed at the surface of the Earth. These sources transmit energy through the subsurface, and the resulting wavefields are reflected, e.g., across interfaces of layers of rock. Subsequently, the response of each receiver from an array of receivers placed on the surface is collected and stored (e.g., as a function of time). The large data sizes (not to mention monetary costs) are thus partly due to the use of large arrays of such sources and receivers and partly due to recording the response of each receiver to each source. Motivated by compressed sensing ideas, Hennenfent and Herrmann [31] propose an alternate acquisition paradigm. They propose the use of fewer sources and receivers, which are placed on a randomly perturbed sub-sampled grid. Observing that seismic data is sparse, e.g., in the curvelet domain, they subsequently utilize stable compressed sensing decoders, such as ∆1, to reconstruct the full data volume. Also using ideas from compressed sensing, Herrmann and Lin [36] formulate a memory intensive problem in seismology, called the wavefield extrapolation problem, on small subsets of the data volume. This reduces the size of the associated operators and the problem can be solved using compressed sensing decoders with significant savings in computational effort. As before, this is made possible by the sparsity of the solution in an appropriate domain. Compressed sensing in astronomy Technology based on compressed sensing is in fact now in orbit around the Earth5. The Herschel satellite, recently launched by the European Space Agency6 is faced with the following problem. According to Bobin et al. [4], large amounts of data that Herschel col- lects must be transmitted to Earth but the transmission cost is too high. Moreover, since a small on-board CPU load is required [4], traditional compression techniques cannot be used. Since most of the computational load in compressed sensing is at the decoder side, by collecting compressed sensing measurements the data volume to be transmitted is reduced, and the above problems are expected to be mitigated. 4One early manuscript proposing the use of constrained `1 minimization in the context of sparse approximation was in the area of reflection seismology, namely a 1973 paper of Claerbout and Muir [15]. 5See http://nuit-blanche.blogspot.com/2009/12/cs-hype-or-hope-herschel-news-sparsity.html. 6See http ://www.esa.int/science/herschel. 19 Chapter 1. Introduction and Overview 1.7 Quantization and compressed sensing The process of acquiring signals and transforming them into the digital domain involves two stages, sampling and quantization. In general, after sampling a signal, the samples must be quantized to a finite set of output values [1]. Moreover, this process is non- linear and non-invertible since an infinite set of input values is mapped into a finite set. Currently, analog to digital converters follow one of two broad strategies; either sampling slightly above the Nyquist rate and then finely quantizing the samples, or coarsely quantizing samples obtained at a rate much higher than Nyquist. Either way, there is a tradeoff involved between bandwidth and resolution (cf. [1]). Compressed sensing offers an alternative way of acquiring a signal. Specifically, if an analog signal can be approximated as a sum of a finite number of elements of a basis, then by (randomly) sampling the signal in a domain that is incoherent with the sparsity domain, one can recover the signal robustly by `1-minimization, for example, as we have seen before. However, that leaves the issue of quantization not addressed explicitly in a satisfactory manner. 1.7.1 Pulse code modulation for compressed sensing Quantization involves employing a discrete “alphabet” A to replace each measurement bj with a quantized measurement qj := b̂j ∈ A. A typical choice for such an alphabet is A = dZ, i.e., A = {. . . ,−2d,−d, d, 2d, . . .} for some step size d > 0. One common approach to quantization is known as Pulse Code Modulation (PCM). Its objective is to minimize ‖e‖2 = ‖b− q‖2 over q ∈ An. This directly implies minimizing |bj − qj | for each j, i.e., PCM entails quantizing each measurement separately to the nearest element of A. As a direct consequence, we obtain ‖b − q‖2 ≤ 12d √ n. Denoting the vector of PCM quantized compressed sensing measurements by qPCM, using any robust decoder ∆ then guarantees that ‖∆(qPCM)− x‖2 ≤ Cd √ n. (1.18) Examining (1.18) we observe a somewhat unexpected result. If one uses PCM quan- tization, the reconstruction error bound deteriorates with increasing the number of mea- surements n. It turns out (see Chapter 3) that this is an artifact of the normalization of the measurement matrix A (so that its columns have unit-norm in expectation). On the other hand, by choosing the entries of A to be standard i.i.d. random variables, e.g., according to N (0, 1), a decoder with robust recovery yields ‖∆(qPCM)− x‖2 ≤ Cd. (1.19) The above bound conforms with the intuitive expectation that recovery results should not deteriorate with more measurements. However, we still observe that the guarantee on the behavior of the error does not improve as n increases. One can make the argument 20 Chapter 1. Introduction and Overview that the compressed sensing recovery guarantees are pessimistic and that in practice the error may decrease with more measurements. Unfortunately, that is not the case as we now demonstrate numerically. To that end, we generate 100 k-sparse signals with k = 5. Each signal x has its non-zero entries supported on a random set T , but with magnitude 1/ √ k. Next, for n ∈ {100, 200, ..., 1000} we generate the measurements b = A(n)x, where A(n) is comprised of the first n rows of A. We then quantize b using PCM to obtain qPCM. For each of these quantized measurements q, we compute ∆1(qPCM) with  = √ nd/2. The average and maximum of the approximation error ‖∆1(qPCM)− x‖2 over the 100 signals are displayed, as functions of n/k, in Figures 1.4 and 1.5. As the theory suggests, the approximation error does not improve with increasing the number of measurements. Figure 1.4: The average performance of PCM quantization and reconstruction with the decoder ∆1. For this experiment the non-zero entries of x are constant and d = 0.01. In an attempt to improve the reconstruction error from quantized compressed sens- ing measurements, the existing literature has focused mainly on alternative methods (decoders) for reconstruction from PCM-quantized measurements and variants thereof, e.g., [5, 17, 26, 33, 35, 60]. However, results from frame theory show that the limitation lies not in the reconstruction technique but in the use of PCM quantization itself. In particular, let E be an n× k real matrix, and let K be a bounded set in Rk. For x ∈ K, suppose we obtain qPCM by quantizing the entries of b = Ex using PCM with alphabet 21 Chapter 1. Introduction and Overview Figure 1.5: The worst case performance of PCM quantization and reconstruction with the decoder ∆1. For this experiment the non-zero entries of x are constant and d = 0.01. A = dZ. Let ∆opt be an optimal decoder. Then, Goyal et al. show in [27] that[ E ‖x−∆opt(qPCM(x))‖22 ]1/2 & k n d (1.20) where the expectation is with respect to a probability measure on x that is, for example, absolutely continuous. This shows that reconstruction by means of alternative recon- struction algorithms from PCM-quantized compressed sensing measurements is limited in the improvement that it can offer. To see this, suppose that one is given the support T of a bounded k-sparse signal x, and its PCM-quantized measurement vector qPCM (which is a quantized version of b = Ax = ATxT ). Now, AT is an n × k real matrix and the above result can be applied. 1.7.2 Sigma-Delta quantization of finite frame expansions Motivated by the discussion above, in this thesis we propose a quantization scheme for compressed sensing measurements that does not rely on PCM. Instead, our proposed scheme is based on sigma-delta (Σ∆) quantization which, in the context of traditional data acquisition, is a well-established scheme used to quantize oversampled bandlimited functions. Moreover, Σ∆ schemes have been recently shown to be effective in quantizing 22 Chapter 1. Introduction and Overview overcomplete frame expansions in finite high-dimensional spaces (e.g., [2], [3]). Since our proposed quantization scheme relies on ideas from Σ∆ quantization for finite frame expansions, it is fitting to briefly introduce the subject here. Suppose that E ∈ Rn×k is of rank k, and let F be any left inverse of E. Following the nomenclature used in frame theory (e.g., [13]), we refer to the collection of the rows of E as the analysis frame and the columns of F as the synthesis (dual) frame. For any x ∈ Rk, let b = Ex be its frame coefficient vector, q ∈ An be its quantization, and let x̂ := Fq be its reconstruction using the dual frame. As mentioned before, PCM consists of encoding x by replacing each of its frame coefficients with the nearest element in the quantization alphabet A. Reconstruction from PCM quantized measurements employs a dual frame F (typically the “canonical” dual F = E† is used) to obtain x̂ = Fq. However, PCM is a suboptimal scheme when E is redundant, i.e., when n > k since it does not use the dependencies among the frame vectors [3]. On the other hand, Σ∆ schemes (see, e.g., [18]) are a class of recursive algorithms that explicitly make use of the dependencies between the redundant vectors of the recon- struction frame F to achieve robust, high precision quantization [3]. Here, we adopt the notation and description of [3] to introduce Σ∆ schemes. Given a finite set A ⊂ R, define the associated scalar quantizer by Q(u) := argmin q∈A |u− q|, (1.21) and let ρ : Rr+1 7→ R be a fixed function known as the quantization rule. A general rth order Σ∆ scheme with alphabet A runs the following iteration [3] for m = 1, 2, . · · · , n. qm = Q ( ρ(u1m−1, u 2 m−1, · · · , urm−1, bm) ) u1m = u 1 m−1 + bm − qm, u2m = u 2 m−1 + u 1 m, ... (1.22) urm = u r m−1 + u r−1 m . Above, one usually chooses u10 = u 2 0 = · · · = ur0 = 0. Now, let D ∈ Rn×n be a first order difference matrix defined as D :=  1 0 0 0 · · · 0 −1 1 0 0 · · · 0 0 −1 1 0 · · · 0 . . . . . . . . . . . . 0 0 · · · −1 1 0 0 0 · · · 0 −1 1  (1.23) 23 Chapter 1. Introduction and Overview and note that D is invertible. Importantly, it was shown in [34] that the linear recon- struction error of an rth order Σ∆ scheme can be bounded by ‖x− x̃‖2 = ‖FDru‖2 ≤ ‖FDr‖2‖u‖2, (1.24) where u = [u1, ..., uN ] T and un = u r n. Working with stable Σ∆ schemes, i.e., schemes with bounded ‖u‖2, but in the absence of further information about u (which is typically the case) a reasonable approach to controlling the error ‖x − x̃‖2 is to choose the dual F , such that ‖FDr‖2 is minimized. This is one underlying idea behind the contribution of [3], which additionally imposes a “smoothness” condition on E to obtain an upper bound on the error ‖Fq − x‖2, which decays as n−r. In other words, using Σ∆ schemes in the context of quantization of frame expansions, one can obtain a decrease in the reconstruction error as n increases. 1.8 Contributions Having introduced compressed sensing and discussed many of the most pertinent results in the literature, we now turn to the contributions of this thesis. In Figure 1.2 a compressed sensing acquisition paradigm is depicted. In particular, we see that it is comprised of a measurement stage in which one obtains the vector b = Ax (or b = Ax + e), and a quantization stage which yields a quantized version of b, say q. Subsequently, to obtain an estimate of x, one typically uses a non-linear decoder. In Section 1.5 we saw that the choice of the matrix A influences the existence and quality of recovery guarantees via the decoders ∆1 and ∆CoSaMP . Thus, the choice of encoder and decoder are intimately related and successful approximation of x from Ax (or q) entails the use of robust and stable decoders ∆, so that the error ‖∆(Ax + e) − x‖2 is well behaved. By the same token, one would expect that a valid approach for dealing with quantized compressed sensing measurements would be to tie the decoder design to the specifics of the measurement and quantization processes. However, until recently, the compressed sens- ing literature has mostly focused on robustness to general errors but ignored the specifics of the errors introduced by the quantization problem. The few proposed approaches pertaining to quantization focus on reconstruction from PCM quantized measurements, which is limited in the quality of reconstruction error it can offer (see Section 1.7). In fact, [6] demonstrates that “compressed sensing of sparse signals followed by scalar quan- tization is inefficient both in terms of the rate utilization and the error performance.” A similar conclusion is reported, for example, in [26]. In light of the above discussion, the contributions of this thesis are two-fold, focus- ing on two aspects of the compressed sensing data acquisition/reconstruction paradigm: decoder design and quantization. 24 Chapter 1. Introduction and Overview 1.8.1 Decoding by non-convex optimization Since the underlying assumption in compressed sensing is that of a sparse signal, the ∆0 decoder is the intuitive choice; however its computational cost is prohibitive and alternative decoders are chosen. The ∆1 decoder is a favored choice and it offers robust and stable recovery, however the conditions under which it is guaranteed to recover x are stronger than those required by ∆0. To see this, note that any k-sparse signals x can be reconstructed from its measure- ment vector b = Ax using ∆0 if A satisfies δ2k < 1. Using ∆1, one would require a stronger condition, for example, of the type δ(a+1)k < a−1 a+1 for some a > 1 (this is a slightly stronger form of the condition 1.17). On the other hand, the ∆1 decoder is numerically tractable, stable, and robust. Other numerically tractable decoders such as ∆CoSaMP offer similar robustness and stability guarantees, but those guarantees are associated with even more stringent conditions than those of ∆1 (see Section 1.5). In this thesis, we analyze the performance of a class of compressed sensing decoders that rely on non-convex optimization. In particular, we focus on the decoders ∆p and ∆p defined by ∆p(b) := argmin y ‖y‖p subject to Ay = b, (1.25) ∆p(b) := argmin y ‖y‖p subject to ‖Ay − b‖2 ≤ , and (1.26) with 0 < p < 1. Like ∆1, these decoders are approximations of ∆0. However, unlike ∆1, the actions of these decoders are computed via non-convex optimization problems. Nevertheless, these can be solved, at least locally, still much faster than (1.2). Our results contribute to bridging the gap between the recovery guarantees of the idealized ∆0 decoder and those of the practical ∆1 decoder. We prove that by using the decoders ∆p and ∆  p one can guarantee robust and stable recovery in the sense of property (C2). We also prove that the recovery guarantees can be better than those for ∆1 and hold under weaker conditions; in other words, with these decoders we can recover vectors that are less sparse than possible with ∆1. We illustrate these conclusions with numerical experiments. Our contributions in this direction are presented in Chapter 2. 1.8.2 Σ∆ quantization for compressed sensing Second, we address the issue of quantization of compressed sensing measurements of k- sparse signals. In this setting, we propose the use of an rth order Σ∆ quantizer and an associated two-stage reconstruction algorithm. Specifically, suppose that x is sparse and that we quantize its compressed sensing measurement vector b = Ax to obtain qΣ∆. Our two-stage reconstruction algorithm consists of (i) Coarse recovery: any robust decoder applied to qΣ∆ yields an initial, “coarse” approximation x# of x, and in particular, the exact (or approximate) support T of 25 Chapter 1. Introduction and Overview x. (ii) Fine recovery: The rth order Sobolev dual of the frame AT applied to qΣ∆ yields a finer approximation x̂Σ∆ of x. We prove that the approximation error ‖x̂Σ∆ − x‖2 decays as the “redundancy” λ = nk increases. In fact, by using an arbitrarily high order Σ∆ scheme, we can make this decay faster than any power law (albeit with higher constants). For example, for an rth order Σ∆ scheme with quantization step size d, we have ‖x̂Σ∆ − x‖2 . ( k n )0.9(r−1/2) d, with high probability if A is drawn from an appropriate distribution and x is sparse enough. Comparing this result with the lower bound (1.20) on the error when using PCM, [ E ‖x−∆opt(qPCM(x))‖22 ]1/2 & k n d, one immediately sees the benefit of using the proposed rth order Σ∆ quantization scheme. As before, we illustrate our results with numerical experiments. The contributions in this direction are presented in Chapter 3. They include extending the results of [3] to random frames, which may be of independent interest. 26 Chapter 1. Introduction and Overview 1.9 References [1] P. Aziz, H. Sorensen, and J. Van Der Spiegel. Overview of sigma-delta converters. IEEE Signal Processing Magazine, 13(1):61–84, 1996. [2] J. Benedetto, A. Powell, and Ö. Yılmaz. Sigma-delta (Σ∆) quantization and finite frames. IEEE Transactions on Information Theory, 52(5):1990–2005, May 2006. [3] J. Blum, M. Lammers, A. Powell, and Ö. Yılmaz. Sobolev duals in frame theory and Sigma-Delta quantization. Journal of Fourier Analysis and Applications. Accepted. [4] J. Bobin, J. Starck, and R. Ottensamer. Compressed sensing in astronomy. IEEE J. Selected Topics in Signal Process, 2(5):718–726, 2008. [5] P. Boufounos and R. Baraniuk. 1-bit compressive sensing. In 42nd annual Conference on Information Sciences and Systems (CISS), pages 19–21. [6] P. Boufounos and R. G. Baraniuk. Quantization of sparse representations. Rice ECE Department Technical Report TREE 0701, 2007. [7] K. Brandenburg. MP3 and AAC explained. In Audio Engineering Society. Interna- tional conference, pages 99–110, 1999. [8] E. J. Candès. Compressive sampling. Proceedings of the International Congress of Mathematicians, Madrid, Spain, 2006. [9] E. J. Candès, J. Romberg, and T. Tao. Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory, 52:489–509, 2006. [10] E. J. Candès, J. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics, 59:1207–1223, 2006. [11] E. J. Candès and T. Tao. Decoding by linear programming. IEEE Transactions on Information Theory, 51(12):489–509, 2005. [12] E. J. Candès and T. Tao. Near-optimal signal recovery from random projections: uni- versal encoding strategies? IEEE Transactions on Information Theory, 52(12):5406– 5425, 2006. [13] P. Casazza. The art of frame theory. Taiwanese Journal of Mathematics, 4(2):129– 201, 2000. [14] S. Chen, D. Donoho, and M. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing, 20(1):33–61, 1999. 27 Chapter 1. Introduction and Overview [15] J. Claerbout and F. Muir. Robust modeling with erratic data. Geophysics, 38:826, 1973. [16] W. Dai and O. Milenkovic. Subspace pursuit for compressive sensing signal recon- struction. IEEE Trans. on Info. Theory, 55(5):2230–2249, 2009. [17] W. Dai, H. Pham, and O. Milenkovic. Quantized compressive sensing. arXiv. [18] I. Daubechies and R. DeVore. Approximating a bandlimited function using very coarsely quantized data: A family of stable sigma-delta modulators of arbitrary order. Ann. of Math., 158(2):679–710, September 2003. [19] I. Daubechies, R. DeVore, M. Fornasier, and C. Güntürk. Iteratively re-weighted least squares minimization for sparse recovery. Communications on Pure and Applied Mathematics, 63(1):1–38, 2009. [20] D. Donoho. De-noising by soft-thresholding. IEEE transactions on information theory, 41(3):613–627, 1995. [21] D. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289–1306, 2006. [22] D. Donoho and M. Elad. Optimally sparse representation in general (nonorthogonal) dictionaries via `1 minimization. Proceedings of the National Academy of Sciences of the United States of America, 100(5):2197–2202, 2003. [23] D. Donoho and X. Huo. Uncertainty principles and ideal atomic decomposition. IEEE Transactions on Information Theory, 47:2845–2862, 2001. [24] M. Duarte, M. Davenport, D. Takhar, J. Laska, T. Sun, K. Kelly, and R. Baraniuk. Single-pixel imaging via compressive sampling. IEEE Signal Processing Magazine, 25(2):83–91, 2008. [25] M. Figueiredo, R. Nowak, and S. Wright. Gradient Projection for Sparse Recon- struction: Application to Compressed Sensing and Other Inverse Problems. IEEE Journal of Selected Topics in Signal Processing, 1(4):586–597, 2007. [26] V. Goyal, A. Fletcher, and S. Rangan. Compressive sampling and lossy compression. IEEE Signal Processing Magazine, 25(2):48–56, 2008. [27] V. Goyal, M. Vetterli, and N. Thao. Quantized overcomplete expansions in RN : analysis, synthesis, and algorithms. IEEE Trans. Inform. Theory, 44(1):16–31, Jan 1998. 28 Chapter 1. Introduction and Overview [28] R. Gribonval and M. Nielsen. On sparse representations in arbitrary redundant bases. IEEE Transactions on Information Theory, 49(12):3320–3325, December 2003. [29] G. Hennenfent and F. Herrmann. Application of stable signal recovery to seismic interpolation. In SEG Extended Abstracts. Society of Exploration Geophysicists, 2006. [30] G. Hennenfent and F. Herrmann. Seismic denoising with nonuniformly sampled curvelets. Computing in Science & Engineering, 8(3):16–25, 2006. [31] G. Hennenfent and F. Herrmann. Simply denoise: wavefield reconstruction via jittered undersampling. Geophysics, 73:V19, 2008. [32] F. Herrmann and G. Hennenfent. Non-parametric seismic data recovery with curvelet frames. Geophysical Journal International, 173(1):233–248, 2008. [33] L. Jacques, D. Hammond, and M. Fadili. Dequantizing compressed sensing: When oversampling and non-gaussian constraints combine. Arxiv preprint: http://arxiv. org/abs/0902.2367, 2009. [34] M. Lammers, A. Powell, and Ö. Yılmaz. Alternative dual frames for digital-to-analog conversion in sigma–delta quantization. Advances in Computational Mathematics, pages 1–30, 2008. [35] J. Laska, P. Boufounos, M. Davenport, and R. Baraniuk. Democracy in action: Quantization, saturation, and compressive sensing. Preprint, 2009. [36] T. Lin and F. Herrmann. Compressed wavefield extrapolation with curvelets. In SEG International Exposition and 77th Annual Meeting, 2007. [37] M. Lobo, L. Vanderberghe, S. Boyd, and H. Lebret. Applications of second-order cone programming. Linear Algebra and its Applications, 284:193–228, 1998. [38] M. Lustig, D. Donoho, and J. Pauly. Sparse MRI: The application of compressed sensing for rapid MR imaging. Magnetic Resonance in Medicine, 58(6):1182–1195, 2007. [39] S. Mallat and Z. Zhang. Matching pursuits with time-frequency dictionaries. IEEE Transactions on Signal Processing, 41(12):3397–3415, 1993. [40] S. Mendelson, A. Pajor, and N. Tomczak-Jaegermann. Uniform uncertainty principle for Bernoulli and subgaussian ensembles. Constructive Approximation, 28(3):277– 289, 2008. 29 Chapter 1. Introduction and Overview [41] D. Needell and J. Tropp. CoSaMP: Iterative signal recovery from incomplete and inaccurate samples. Applied and Computational Harmonic Analysis, 26(3):301–321, 2009. [42] D. Needell and R. Vershynin. Uniform uncertainty principle and signal recovery via regularized orthogonal matching pursuit. Foundations of Computational Mathemat- ics, 9(3):317–334, 2009. [43] Y. Pati, R. Rezaiifar, and P. Krishnaprasad. Orthogonal matching pursuit: Recur- sive function approximation with applications to wavelet decomposition, 1993. [44] H. Rauhut. On the impossibility of uniform sparse reconstruction using greedy methods. Sampl. Theory Signal Image Process, 7(2):197–215, 2008. [45] M. Rudelson and R.Vershynin. On sparse reconstruction from Fourier and Gaussian measurements. Communications on Pure and Applied Mathematics, 61(8):1025– 1045, 2008. [46] R. Saab, Ö. Yılmaz, M. J. McKeown, and R. Abugharbieh. Underdetermined ane- choic blind source separation via `q-basis-pursuit with q < 1. IEEE Transactions on Signal Processing, 55(8):4004–4017, 2007. [47] T. Strohmer and R. Heath. Grassmannian frames with applications to coding and communication. Applied and Computational Harmonic Analysis, 14(3):257–275, 2003. [48] D. Taubman, M. Marcellin, and M. Rabbani. JPEG2000: Image compression fun- damentals, standards and practice. Journal of Electronic Imaging, 11:286, 2002. [49] J. Tropp. Greed is good: Algorithmic results for sparse approximation. ICES Report 0304, The University of Texas at Austin, 2003. [50] J. Tropp. Greed is good: algorithmic results for sparse approximation. IEEE Trans- actions on Information Theory, 50(10):2231–2242, 2004. [51] J. Tropp. Recovery of short, complex linear combinations via l1 minimization. IEEE Transactions on Information Theory, 51(4):1568–1570, April 2005. [52] J. Tropp and A. Gilbert. Signal recovery from random measurements via orthogonal matching pursuit. IEEE Transactions on Information Theory, 53(12):4655, 2007. [53] Y. Tsaig and D. Donoho. Extensions of compressed sensing. Signal Process., 86(3):549–571, 2006. [54] E. van den Berg and M. Friedlander. Probing the Pareto frontier for basis pursuit solutions. SIAM Journal on Scientific Computing, 31(2):890–912, 2008. 30 Chapter 1. Introduction and Overview [55] G. Wallace. The JPEG still picture compression standard. Commun. ACM, 34(4):30–44, 1991. [56] D. Wang, R. Saab, Ö. Yılmaz, and F. Herrmann. Bayesian wavefield separation by transform-domain sparsity promotion. Geophysics, 73:A33, 2008. [57] L. Welch. Lower bounds on the maximum cross correlation of signals (Corresp.). IEEE Transactions on Information Theory, 20(3):397–399, 1974. [58] P. Wojtaszczyk. Stability and instance optimality for gaussian measurements in compressed sensing. Foundations of Computational Mathematics (to appear). [59] Ö. Yılmaz and S. Rickard. Blind source separation of speech mixtures via time- frequency masking. IEEE Transactions on Signal Processing, 52(7):1830–1847, July 2004. [60] A. Zymnis, S. Boyd, and E. Candès. Compressed sensing with quantized measure- ments. 2009. Submitted. 31 Chapter 2 Sparse recovery by non-convex optimization – instance optimality 1 2.1 Introduction The sparse recovery problem received a lot of attention lately, both because of its role in transform coding with redundant dictionaries (e.g., [9, 28, 29]), and perhaps more importantly because it inspired compressed sensing [3,4,13], a novel method of acquiring signals with certain properties more efficiently compared to the classical approach based on Nyquist-Shannon sampling theory. Define ΣNk to be the set of all k-sparse vectors, i.e., ΣNk := {x ∈ RN : |supp(x)| ≤ k}, and define compressible vectors as vectors that can be well approximated in ΣNk . Let σk(x)`p denote the best k-term approximation error of x in `p (quasi-)norm where p > 0, i.e., σk(x)`p := min v∈ΣNk ‖x− v‖p. Throughout the text, A denotes an n×N real matrix where n < N . Let the associated encoder be the map x 7→ Ax (also denoted by A). The transform coding and compressed sensing problems mentioned above require the existence of decoders, say ∆ : Rn 7→ RN , with roughly the following properties: (C1) ∆(Ax) = x whenever x ∈ ΣNk with sufficiently small k. (C2) ‖x−∆(Ax+ e)‖ . ‖e‖+ σk(x)`p , where the norms are appropriately chosen. Here e denotes measurement error, e.g., thermal and computational noise. (C3) ∆(Ax) can be computed efficiently (in some sense). Below, we denote the (in general noisy) encoding of x by b, i.e., b = Ax+ e. (2.1) 1A version of this chapter has been accepted for publication. R. Saab, Ö. Yılmaz, “Sparse recovery by non-convex optimization – instance optimality,” Applied and Computational Harmonic Analysis. 32 Chapter 2. Sparse recovery by non-convex optimization – instance optimality In general, the problem of constructing decoders with properties (C1)-(C3) is non-trivial (even in the noise-free case) as A is overcomplete, i.e., the linear system of n equations in (2.1) is underdetermined, and thus, if consistent, it admits infinitely many solutions. In order for a decoder to satisfy (C1)-(C3), it must choose the “correct solution” among these infinitely many solutions. Under the assumption that the original signal x is sparse, one can phrase the problem of finding the desired solution as an optimization problem where the objective is to maximize an appropriate “measure of sparsity” while simultaneously satisfying the constraints defined by (2.1). In the noise-free case, i.e., when e = 0 in (2.1), under certain conditions on the n × N matrix A, i.e., if A is in general position, there is a decoder ∆0 which satisfies ∆0(Ax) = x for all x ∈ ΣNk whenever k < n/2, e.g., see [14]. This ∆0 can be explicitly computed via the optimization problem ∆0(b) := argmin y ‖y‖0 subject to b = Ay. (2.2) Here ‖y‖0 denotes the number of non-zero entries of the vector y, equivalently its so-called `0-norm. Clearly, the sparsity of y is reflected by its `0-norm. 2.1.1 Decoding by `1 minimization As mentioned above, ∆0(Ax) = x exactly if x is sufficiently sparse depending on the matrix A. However, the associated optimization problem is combinatorial in nature, thus its complexity grows quickly as N becomes much larger than n. Naturally, one then seeks to modify the optimization problem so that it lends itself to solution methods that are more tractable than combinatorial search. In fact, in the noise-free setting, the decoder defined by `1 minimization, given by ∆1(b) := argmin y ‖y‖1 subject to Ay = b, (2.3) recovers x exactly if x is sufficiently sparse and the matrix A has certain properties (e.g., [4, 6, 9, 14, 15, 26]). In particular, it has been shown in [4] that if x ∈ ΣNk and A satisfies a certain restricted isometry property, e.g., δ3k < 1/3 or more generally δ(a+1)k < a−1 a+1 for some a > 1 such that a ∈ 1 k N, then ∆1(Ax) = x (in what follows, N denotes the set of positive integers, i.e., 0 /∈ N). Here δk are the k-restricted isometry constants of A, as introduced by Candès, Romberg and Tao (see, e.g., [4]), defined as the smallest constants satisfying (1− δk)‖c‖22 ≤ ‖Ac‖22 ≤ (1 + δk)‖c‖22 (2.4) for every c ∈ ΣNk . Throughout the paper, using the notation of [30], we say that a matrix satisfies RIP(k, δ) if δk < δ. Checking whether a given matrix satisfies a certain RIP is computationally intensive, and becomes rapidly intractable as the size of the matrix increases. On the other hand, there are certain classes of random matrices which have favorable RIP. In fact, let A be 33 Chapter 2. Sparse recovery by non-convex optimization – instance optimality an n × N matrix the columns of which are independent, identically distributed (i.i.d.) random vectors with any sub-Gaussian distribution. It has been shown that A satisfies RIP (k, δ) with any 0 < δ < 1 when k ≤ c1n/log(N/n), (2.5) with probability greater than 1−2e−c2n (see, e.g., [1], [5], [6]), where c1 and c2 are positive constants that only depend on δ and on the actual distribution from which A is drawn. In addition to recovering sparse vectors from error-free observations, it is important that the decoder be robust to noise and stable with regards to the “compressibility” of x. In other words, we require that the reconstruction error scale well with the measurement error and with the “non-sparsity” of the signal (i.e., (C2) above). For matrices that satisfy RIP((a + 1)k, δ), with δ < a−1 a+1 for some a > 1 such that a ∈ 1 k N, it has been shown in [4] that there exists a feasible decoder ∆1 for which the approximation error ‖∆1(b)−x‖2 scales linearly with the measurement error ‖e‖2 ≤  and with σk(x)`1 . More specifically, define the decoder ∆1(b) = argmin y ‖y‖1 subject to ‖Ay − b‖2 ≤ . (2.6) The following theorem of Candès et al. in [4] provides error guarantees when x is not sparse and when the observation is noisy. Theorem 2.1.1. [4] Fix  ≥ 0, suppose that x is arbitrary, and let b = Ax + e where ‖e‖2 ≤ . If A satisfies δ3k + 3δ4k < 2, then ‖∆1(b)− x‖2 ≤ C1,k+ C2,k σk(x)`1√ k . (2.7) For reasonable values of δ4k, the constants are well behaved; e.g., C1,k = 12.04 and C2,k = 8.77 for δ4k = 1/5. Remark. This means that given b = Ax + e, and x is sufficiently sparse, ∆1(b) recovers the underlying sparse signal within the noise level. Consequently the recovery is perfect if  = 0. Remark. By explicitly assuming x to be sparse, Candès et al.. [4] proved a version of the above result with smaller constants, i.e., for b = Ax+ e with x ∈ ΣNk and ‖e‖2 ≤ , ‖∆1(b)− x‖2 ≤ Ck, (2.8) where Ck < C1,k. Remark. Recently, Candès [2] showed that δ2k < √ 2− 1 is sufficient to guarantee robust and stable recovery in the sense of (2.7) with slightly better constants. 34 Chapter 2. Sparse recovery by non-convex optimization – instance optimality In the noise free case, i.e., when  = 0, the reconstruction error in Theorem 2.1.1 is bounded above by σk(x)`1/ √ k, see (2.7). This upper bound would sharpen if one could replace σk(x)`1/ √ k with σk(x)`2 on the right hand side of (2.7) (note that σk(x)`1 can be large even if all the entries of the reconstruction error are small but nonzero; this follows from the fact that for any vector y ∈ RN , ‖y‖2 ≤ ‖y‖1 ≤ √ N‖y‖2, and consequently there are vectors x ∈ RN for which σk(x)`1/ √ k  σk(x)`2 , especially when N is large). In [10] it was shown that the term C2,kσk(x)`1/ √ k on the right hand side of (2.7) cannot be replaced with Cσk(x)`2 if one seeks the inequality to hold for all x ∈ RN with a fixed matrix A, unless n > cN for some constant c. This is unsatisfactory since the paradigm of compressed sensing relies on the ability of recovering sparse or compressible vectors x from significantly fewer measurements than the ambient dimension N . Even though one cannot obtain bounds on the approximation error in terms of σk(x)`2 with constants that are uniform on x (with a fixed matrix A), the situation is significantly better if we relax the uniformity requirement and seek for a version of (2.7) that holds “with high probability”. Indeed, it has been recently shown by Wojtaszczyk that for any specific x, σk(x)`2 can be placed in (2.7) in lieu of σk(x)`1/ √ k (with different constants that are still independent of x) with high probability on the draw of A if (i) n > ck logN and (ii) the entries A is drawn i.i.d. from a Gaussian distribution or the columns of A are drawn i.i.d. from the uniform distribution on the unit sphere in Rn [30]. In other words, the encoder ∆1 = ∆ 0 1 is “(2,2) instance optimal in probability” for encoders associated with such A, a property which was discussed in [10]. Following the notation of [30], we say that an encoder-decoder pair (A,∆) is (q, p) instance optimal of order S with constant C if ‖∆(Ax)− x‖q ≤ C σk(x)`p k1/p−1/q (2.9) holds for all x ∈ RN . Moreover, for random matrices Aω, (Aω,∆) is said to be (q, p) instance optimal in probability if for any x (2.9) holds with high probability on the draw of Aω. Note that with this notation Theorem 2.1.1 implies that (A,∆1) is (2,1) instance optimal (set  = 0), provided A satisfies the conditions of the theorem. The preceding discussion makes it clear that ∆1 satisfies conditions (C1) and (C2), at least when A is a sub-Gaussian random matrix and k is sufficiently small. It only remains to note that decoding by ∆1 amounts to solving an `1 minimization problem, and is thus tractable, i.e., we also have (C3). In fact, `1 minimization problems as described above can be solved efficiently with solvers specifically designed for the sparse recovery scenarios (e.g. [27], [16], [11]). 2.1.2 Decoding by `p minimization We have so far seen that with appropriate encoders, the decoders ∆1 provide robust and stable recovery for compressible signals even when the measurements are noisy [4], and 35 Chapter 2. Sparse recovery by non-convex optimization – instance optimality that (Aω,∆1) is (2,2) instance optimal in probability [30] when Aω is an appropriate random matrix. In particular, stability and robustness properties are conditioned on an appropriate RIP while the instance optimality property is dependent on the draw of the encoder matrix (which is typically called the measurement matrix) from an appropriate distribution, in addition to RIP. Recall that the decoders ∆1 and ∆  1 were devised because their action can be computed by solving convex approximations to the combinatorial optimization problem (2.2) that is required to compute ∆0. The decoders defined by ∆p(b) := argmin y ‖y‖p s.t. ‖Ay − b‖2 ≤ , and (2.10) ∆p(b) := argmin y ‖y‖p s.t. Ay = b, (2.11) with 0 < p < 1 are also approximations of ∆0, the actions of which are computed via non- convex optimization problems that can be solved, at least locally, still much faster than (2.2). It is natural to ask whether the decoders ∆p and ∆  p possess robustness, stability, and instance optimality properties similar to those of ∆1 and ∆  1, and whether these are obtained under weaker conditions on the measurement matrices than the analogous ones with p = 1. Early work by Gribonval and co-authors [19–22] take some initial steps in answering these questions. In particular, they devise metrics that lead to sufficient conditions for uniqueness of ∆1(b) to imply uniqueness of ∆p(b) and specifically for having ∆p(b) = ∆1(b) = x. The authors also present stability conditions in terms of various norms that bound the error, and they conclude that the smaller the value of p is, the more non-zero entries can be recovered by (2.11). These conditions, however, are hard to check explicitly and no class of deterministic or random matrices was shown to satisfy them at least with high probability. On the other hand, the authors provide lower bounds for their metrics in terms of generalized mutual coherence. Still, these conditions are pessimistic in the sense that they generally guarantee recovery of only very sparse vectors. Recently, Chartrand showed that in the noise-free setting, a sufficiently sparse signal can be recovered perfectly with ∆p, where 0 < p < 1, under less restrictive RIP require- ments than those needed to guarantee perfect recovery with ∆1. The following theorem was proved in [7]. Theorem 2.1.2. [7] Let 0 < p ≤ 1, and let k ∈ N. Suppose that x is k-sparse, and set b = Ax. If A satisfies δak + a 2 p −1δ(a+1)k < a 2 p −1 − 1 for some a > 1 such that a ∈ 1 k N, then ∆p(b) = x. Note that, for example, when p = 0.5 and a = 3, the above theorem only requires δ3k+27δ4k < 26 to guarantee perfect recovery with ∆0.5, a less restrictive condition than the analogous one needed to guarantee perfect reconstruction with ∆1, i.e., δ3k + 3δ4k < 2. Moreover, in [8], Staneva and Chartrand study a modified RIP that is defined by replacing ‖Ac‖2 in (2.4) with ‖Ac‖p. They show that under this new definition of δk, the 36 Chapter 2. Sparse recovery by non-convex optimization – instance optimality same sufficient condition as in Theorem 2.1.2 guarantees perfect recovery. Steneva and Chartrand also show that if A is an n × N Gaussian matrix, their sufficient condition is satisfied provided n > C1(p)k + pC2(p)k log(N/k), where C1(p) and C2(p) are given explicitly in [8]. It is important to note that pC2(p) goes to zero as p goes to zero. In other words, the dependence on N of the required number of measurements n (that guarantees perfect recovery for all x ∈ ΣNk ) disappears as p approaches 0. This result motivates a more detailed study to understand the properties of the decoders ∆p in terms of stability and robustness, which is the objective of this paper. Algorithmic Issues Clearly, recovery by `p minimization poses a non-convex optimization problem with many local minimizers. It is encouraging that simulation results from recent papers, e.g., [7, 25], strongly indicate that simple modifications to known approaches like iterated reweighted least squares algorithms and projected gradient algorithms yield x∗ that are the global minimizers of the associated `p minimization problem (or approximate the global optimizers very well). It is also encouraging to note that even though the results presented in this work and in others [7, 19–22,25] assume that the global minimizer has been found, a significant set of these results, including all results in this paper, continue to hold if we could obtain a feasible point x̃∗ which satisfies ‖x̃∗‖p ≤ ‖x‖p (where x is the vector to be recovered). Nevertheless, it should be stated that to our knowledge, the modified algorithms mentioned above have only been shown to converge to local minima. 2.1.3 Paper Outline In what follows, we present generalizations of the above results, giving stability and robustness guarantees for `p minimization. In Section 2.2.1 we show that the decoders ∆p and ∆  p are robust to noise and (2,p) instance optimal in the case of appropriate measurement matrices. For this section we rely and expand on our note [25]. In Section 2.2.3 we extend [30] and show that for the same range of dimensions as for decoding by `1 minimization, i.e., when Aω ∈ Rn×N with n > ck log(N), (Aω,∆p) is also (2,2) instance optimal in probability for 0 < p < 1, provided the measurement matrix Aω is drawn from an appropriate distribution. The generalization follows the proof of Wojtaszczyk in [30]; however it is non-trivial and requires a variant of a result by Gordon and Kalton [18] on the Banach-Mazur distance between a p-convex body and its convex hull. In Section 2.3 we present some numerical results, further illustrating the possible benefits of using `p minimization and highlighting the behavior of the ∆p decoder in terms of stability and robustness. Finally, in Section 2.4 we present the proofs of the main theorems and corollaries. While writing this paper, we became aware of the work of Foucart and Lai [17] which also shows similar (2, p) instance optimality results for 0 < p < 1 under different sufficient conditions. In essence, one could use the (2, p)-results of Foucart and Lai to obtain (2, 2) 37 Chapter 2. Sparse recovery by non-convex optimization – instance optimality instance optimality in probability results similar to the ones we present in this paper, albeit with different constants. Since neither the sufficient conditions for (2, p) instance optimality presented in [17] nor the ones in this paper are uniformly weaker, and since neither provide uniformly better constants, we simply use our estimates throughout. 2.2 Main Results In this section, we present our theoretical results on the ability of `p minimization to recover sparse and compressible signals in the presence of noise. 2.2.1 Sparse recovery with ∆p: stability and robustness We begin with a deterministic stability and robustness theorem for decoders ∆p and ∆  p when 0 < p < 1 that generalizes Theorem 2.1.1 of Candès et al. Note the associated sufficient conditions on the measurement matrix, given in (2.12) below, are weaker for smaller values of p than those that correspond to p = 1. The results in this subsection were initially reported, in part, in [25]. In what follows, we say that a matrix A satisfies the property P (a, k, p) if it satisfies δak + a 2 p −1δ(a+1)k < a 2 p −1 − 1, (2.12) for k ∈ N and a > 1 such that a ∈ 1 k N. Theorem 2.2.1 (General Case). Let 0 < p ≤ 1. Suppose that x is arbitrary and b = Ax+ e where ‖e‖2 ≤ . If A satisfies P (a, k, p), then ‖∆p(b)− x‖p2 ≤ C1p + C2 σk(x) p `p k1−p/2 , (2.13) where C1 = 2 p 1 + a p/2−1(2/p− 1)−p/2 (1− δ(a+1)k)p/2 − (1 + δak)p/2ap/2−1 , and (2.14) C2 = 2( p 2−p) p/2 a1−p/2 ( 1 + ((2/p− 1) p2 + ap/2−1)(1 + δak)p/2 (1− δ(a+1)k)p/2 − (1+δak)p/2a1−p/2 ) . (2.15) Remark. By setting p = 1 and a = 3 in Theorem 2.2.1, we obtain Theorem 2.1.1, with precisely the same constants. Remark. The constants in Theorem 2.2.1 are generally well behaved; e.g., C1 = 5.31 and C2 = 4.31 for δ4k = 0.5 and p = 0.5. Note for δ4k = 0.5 the sufficient condition (2.12) is not satisfied when p = 1, and thus Theorem 2.2.1 does not yield any upper bounds on ‖∆1(b)− x‖2 in terms of σk(x)`1 . 38 Chapter 2. Sparse recovery by non-convex optimization – instance optimality Corollary 2.2.2 ((2, p) instance optimality). Let 0 < p ≤ 1. Suppose that A satisfies P (a, k, p). Then (A,∆p) is (2, p) instance optimal of order k with constant C 1/p 2 where C2 is as in (2.15). Corollary 2.2.3 (sparse case). Let 0 < p ≤ 1. Suppose x ∈ ΣNk and b = Ax + e where ‖e‖2 ≤ . If A satisfies P (a, k, p), then ‖∆p(b)− x‖2 ≤ (C1)1/p , where C1 is as in (2.14). Remark. Corollaries 2.2.2 and 2.2.3 follow from Theorem 2.2.1 by setting  = 0 and σk(x)`p = 0, respectively. Furthermore, Corollary 2.2.3 can be proved independently of Theorem 2.2.1 leading to smaller constants. See [25] for the explicit values of these im- proved constants. Finally, note that setting  = 0 in Corollary 2.2.3, we obtain Theorem 2.1.2 as a corollary. Remark. In [17], Foucart and Lai give different sufficient conditions for exact recovery than those we present. In particular, they show that if δmk < g(m) := 4( √ 2− 1)(m/2)1/p−1/2 4( √ 2− 1)(m/2)1/p−1/2 + 2 (2.16) holds for some m ≥ 2, m ∈ 1 k N, then ∆p will recover signals in Σ N k exactly. Note that the sufficient condition in this paper, i.e., (2.12), holds when δmk < f(m) := (m− 1)2/p−1 − 1 (m− 1)2/p−1 + 1 (2.17) for some m ≥ 2, m ∈ 1 k N. In Figure 2.1, we compare these different sufficient conditions as a function of m for p = 0.1, 0.5, and 0.9 respectively. Figure 2.1 indicates that neither sufficient condition is weaker than the other for all values of m. In fact, we can deduce that (2.16) is weaker when m is close to 2, while (2.17) is weaker when m starts to grow larger. Since both conditions are only sufficient, if either one of them holds for an appropriate m, then ∆p recovers all signals in Σ N k . Remark. In [12], Davies and Gribonval showed that if one chooses δ2k > δ(p) (where δ(p) can be computed implicitly for 0 < p ≤ 1), then there exist matrices (matrices in R (N−1)×N that correspond to tight Parseval frames in RN−1) with the prescribed δ2k for which ∆p fails to recover signals in Σ N k . Note that this result does not contradict with the results that we present in this paper: we provide sufficient conditions (e.g., (2.12)) in terms of δ(a+1)k , where a > 1 and ak ∈ N, that guarantee recovery by ∆p. These conditions are weaker than the corresponding conditions ensuring recovery by ∆1, which suggests that using ∆p can be beneficial. Moreover, the numerical examples we provide 39 Chapter 2. Sparse recovery by non-convex optimization – instance optimality 2 2.5 3 3.5 4 0 0.5 1 Sufficient conditions for exact recovery m f g 2 2.5 3 3.5 4 0 0.5 1 m f g 2 2.5 3 3.5 4 0 0.5 1 m f g Figure 2.1: A comparison of the sufficient conditions on δmk in (2.17) and (2.16) as a function of m, for p = 0.1 (top), p = 0.5 (center) and p = 0.9 (bottom). in Section 2.3 indicate that by using ∆p, 0 < p < 1, one can indeed recover signals in ΣNk , even when ∆1 fails to recover them (see Figure 2.2). Remark. In summary, Theorem 2.2.1 states that if (2.12) is satisfied then we can recover signals in ΣNk stably by decoding with ∆  p. It is worth mentioning that the sufficient conditions presented here reduce the gap between the conditions for exact recovery with ∆0 (i.e., δ2k < 1) and with ∆1, e.g., δ3k < 1/3. For example for a = 2 and p = 0.5, δ3k < 7/9 is sufficient. In the next subsection, we quantify this improvement. 2.2.2 The relationship between k1 and kp Let A be an n × N matrix and suppose δm, m ∈ {1, . . . , bn/2c} are its m-restricted isometry constants. Define kp for A with 0 < p ≤ 1 as the largest value of k ∈ N for which the slightly stronger version of (2.12) given by δ(a+1)k < a 2 p −1 − 1 a 2 p −1 + 1 (2.18) 40 Chapter 2. Sparse recovery by non-convex optimization – instance optimality holds for some a > 1, a ∈ 1 k N. Consequently, by Theorem 2.2.1, ∆p(Ax) = x for all x ∈ ΣNkp . We now establish a relationship between k1 and kp. Proposition 2.2.4. Suppose, in the above described setting, there exists k1 ∈ N and a > 1, a ∈ 1 k1 N such that δ(a+1)k1 < a− 1 a + 1 (2.19) Then ∆1 recovers all k1-sparse vectors, and ∆p recovers all kp sparse vectors with kp = ⌊ a+ 1 a p 2−p + 1 k1 ⌋ . Remark. For example, if δ5k1 < 3/5 then using ∆ 2 3 , we can recover all k 2 3 -sparse vectors with k 2 3 = b5 3 k1c. 2.2.3 Instance optimality in probability and ∆p In this section, we show that (Aω,∆p) is (2, 2) instance optimal in probability when Aω is an appropriate random matrix. Our approach is based on that of [30], which we summarize now. A matrix A is said to possess the LQ1(α) property if and only if A(BN1 ) ⊃ αBn2 , where Bmq denotes the `q unit ball in R m. In [30], Wojtaszczyk shows that random Gaussian matrices of size n×N as well as matrices whose columns are drawn uniformly from the sphere possess, with high probability, the LQ1(α) property with α = µ √ log (N/n) n . Noting that such matrices also satisfy RIP((a+1)k, δ) with k < c n log(N/n) , again with high probability, Wojtaszczyk proves that ∆1, for these matrices, is (2,2) instance optimal in probability of order k. Our strategy for generalizing this result to ∆p with 0 < p < 1 relies on a generalization of the LQ1 property to an LQp property. Specifically, we say that a matrix A satisfies LQp(α) if and only if A(BNp ) ⊃ αBn2 . We first show that a random matrix Aω, either Gaussian or uniform as mentioned above, satisfies the LQp(α) property with α = 1 C(p) ( µ2 log (N/n) n )(1/p−1/2) . Once we establish this property, the proof of instance optimality in probability for ∆p proceeds largely unchanged from Wojtaszczyk’s proof with modifications to account only 41 Chapter 2. Sparse recovery by non-convex optimization – instance optimality for the non-convexity of the `p-quasinorm with 0 < p < 1. Next, we present our results on instance optimality of the ∆p decoder, while deferring the proofs to Section 2.4. Throughout the rest of the paper, we focus on two classes of random matrices: Aω denotes n × N matrices, the entries of which are drawn from a zero mean, normalized column-variance Gaussian distribution, i.e., Aω = (ai,j) where ai,j ∼ N (0, 1/ √ n); in this case, we say that Aω is an n×N Gaussian random matrix. Ãω, on the other hand, denotes n × N matrices, the columns of which are drawn uniformly from the sphere; in this case we say that Ãω is an n × N uniform random matrix. In each case, (Ω, P ) denotes the associated probability space. We start with a lemma (which generalizes an analogous result of [30]) that shows that the matrices Aω and Ãω satisfy the LQp property with high probability. Lemma 2.2.5. Let 0 < p ≤ 1, and let Aω be an n × N Gaussian random matrix. For 0 < µ < 1/ √ 2, suppose that K1n(logn) ξ ≤ N ≤ eK2n for some ξ > (1−2µ2)−1 and some constants K1, K2 > 0. Then, there exists a constant c = c(µ, ξ,K1, K2) > 0, independent of p, n, and N , and a set Ωµ = { ω ∈ Ω : Aω(BNp ) ⊃ 1 C(p) ( µ2 logN/n n )1/p−1/2 Bn2 } such that P (Ωµ) ≥ 1− e−cn. In other words, Aω satisfies the LQp(α), α = 1/C(p) ( µ2 log (N/n) n )1/p−1/2 , with proba- bility ≥ 1−e−cn on the draw of the matrix. Here C(p) is a positive constant that depends only on p. (In particular, C(1) = 1 and see (2.50) for the explicit value of C(p) when 0 < p < 1). This statement is true also for Ãω. The above lemma for p = 1 can be found in [30]. As we will see in Section 2.4, the generalization of this result to 0 < p < 1 is non-trivial and requires a result from [18], cf. [23], relating certain “distances” of p-convex bodies to their convex hulls. It is important to note that this lemma provides the machinery needed to prove the following theorem, which extends to ∆p, 0 < p < 1, the analogous result of Wojtaszczyk [30] for ∆1. In what follows, for a set T ⊆ {1, . . . , N}, T c := {1, . . . , N} \ T ; for y ∈ RN , yT denotes the vector with entries yT (j) = y(j) for all j ∈ T , and yT (j) = 0 for j ∈ T c. Theorem 2.2.6. Let 0 < p < 1. Suppose that A ∈ Rn×N satisfies RIP(k, δ) and LQp ( 1 C(p) (µ2/k)1/p−1/2 ) for some µ > 0 and C(p) as in (2.50). Let ∆ be an arbitrary decoder. If (A,∆) is (2,p) instance optimal of order k with constant C2,p, then for any x ∈ RN and e ∈ Rn, all of the following hold. (i) ‖∆(Ax+ e)− x‖2 ≤ C(‖e‖2 + σk(x)`pk1/p−1/2 ) 42 Chapter 2. Sparse recovery by non-convex optimization – instance optimality (ii) ‖∆(Ax)− x‖2 ≤ C(‖AxT c0 ‖2 + σk(x)`2) (iii) ‖∆(Ax+ e)− x‖2 ≤ C(‖e‖2 + σk(x)`2 + ‖AxT c0 ‖2) Above, T0 denotes the set of indices of the largest (in magnitude) k coefficients of x; the constants (all denoted by C) depend on δ, µ, p, and C2,p but not on n and N . For the explicit values of these constants see (2.38) and (2.39). Finally, our main theorem on the instance optimality in probability of the ∆p decoder follows. Theorem 2.2.7. Let 0 < p < 1, and let Aω be an n × N Gaussian random matrix. Suppose that N ≥ n[log(n)]2. There exists constants c1, c2, c3 > 0 such that for all k ∈ N with k ≤ c1n/ log (N/n), the following are true. (i) There exists Ω1 with P (Ω1) ≥ 1− 3e−c2n such that for all ω ∈ Ω1 ‖∆p(Aω(x) + e)− x‖2 ≤ C(‖e‖2 + σk(x)`p k1/p−1/2 ), (2.20) for any x ∈ RN and for any e ∈ Rn. (ii) For any x ∈ RN , there exists Ωx with P (Ωx) ≥ 1− 4e−c3n such that for all ω ∈ Ωx ‖∆p(Aω(x) + e)− x‖2 ≤ C (‖e‖2 + σk(x)`2) , (2.21) for any e ∈ Rn. The statement also holds for Ãω, i.e., for random matrices the columns of which are drawn independently from a uniform distribution on the sphere. Remark. The constants above (both denoted by C) depend on the parameters of the particular LQp and RIP properties that the matrix satisfies, and are given explicitly in Section 2.4, see (2.38) and (2.41). The constants c1, c2, and c3 depend only on p and the distribution of the underlying random matrix (see the proof in Section 2.4.5) and are independent of n and N . Remark. Clearly, the statements do not make sense if the hypothesis of the theorem forces k to be 0. In turn, for a given (n,N) pair, it is possible that there is no positive integer k for which the conclusions of Theorem 2.2.7 hold. In particular, to get a non-trivial statement, one needs n > 1 c1 log(N/n). Remark. Note the difference in the order of the quantifiers between conclusions (i) and (ii) of Theorem 2.2.7. Specifically, with statement (i), once the matrix is drawn from the “good” set Ω1, we obtain the error guarantee (2.20) for every x and e. In other words, after the initial draw of a good matrix A, stability and robustness in the sense of (2.20) 43 Chapter 2. Sparse recovery by non-convex optimization – instance optimality are ensured. On the other hand, statement (ii) concludes that associated with every x is a “good” set Ωx (possibly different for different x) such that if the matrix is drawn from Ωx, then stability and robustness in the sense of (2.21) are guaranteed. Thus, in (ii), for every x, a different matrix is drawn, and with high probability on that draw (2.21) holds. Remark. The above theorem pertains to the decoders ∆p which, like the analogous theo- rem for ∆1 presented in [30], requires no knowledge of the noise level. In other words, ∆p provides estimates of sparse and compressible signals from limited and noisy observations without having to explicitly account for the noise in the decoding. This provides an im- provement on Theorem 2.2.1 and a practical advantage when estimates of measurement noise levels are absent. 2.3 Numerical Experiments In this section, we present some numerical experiments to highlight important aspects of sparse reconstruction by decoding using ∆p, 0 < p ≤ 1. First, we compare the sufficient conditions under which decoding with ∆p guarantees perfect recovery of signals in Σ N k for different values of p and k. Next, we present numerical results illustrating the robustness and instance optimality of the ∆p decoder. Here, we wish to observe the linear growth of the `2 reconstruction error ‖∆p(Ax+ e)− x‖2, as a function of σk(x)`2 and of ‖e‖2. To that end, we generate a 100 × 300 matrix A whose columns are drawn from a Gaussian distribution and we estimate its RIP constants δk via Monte Carlo (MC) simulations. Under the assumption that the estimated constants are the correct ones (while in fact they are only lower bounds), Figure 2.2 (top) shows the regions where (2.12) guarantees recovery for different (k, p)-pairs. On the other hand, Figure 2.2 (bottom) shows the empirical recovery rates via `p quasinorm minimization: To obtain this figure, for every k = 1, . . . , 49, we chose 50 different instances of x ∈ Σ300k where non-zero coefficients of each were drawn i.i.d. from the standard Gaussian distribution. These vectors were encoded using the same measurement matrix A as above. Since there is no known algorithm that will yield the global minimizer of the optimization problem (2.11), we approximated the action of ∆p by using a projected gradient algorithm on a sequence of smoothed versions of the `p minimization problem: In (2.11), instead of minimizing the ‖y‖p, we minimized (∑ i (y 2 i +  2)p/2 )1/p initially with a large . We then used the corresponding solution as the starting point of the next subproblem obtained by decreasing the value of  according to the rule i = (0.99)i−1. We continued reducing the value of  and solving the corresponding subproblem until  becomes very small. Note that this approach is similar to the one described in [7]. The empirical results show that ∆p (in fact, the approximation of ∆p as described above) is successful in a wider range of scenarios than those predicted by Theorem 2.2.1. This can be attributed to the fact that the conditions presented in this paper are only sufficient, or to the fact that in practice what is observed is not necessarily a manifestation of uniform recovery. Rather, 44 Chapter 2. Sparse recovery by non-convex optimization – instance optimality the practical results could be interpreted as success of ∆p with high probability on either x or A. Region where recovery with ∆p is guaranteed for p and k  (Light Shading = Recoverable) p k 10 20 30 40 50 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 k p Empirical Recovery Rates with ∆p 10 20 30 40 50 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Figure 2.2: For a Gaussian matrix A ∈ R100×300, whose δk values are estimated via MC simulations, we generate the theoretical (top) and practical (bottom) phase-diagrams for re- construction via `p minimization. 45 Chapter 2. Sparse recovery by non-convex optimization – instance optimality 0 0.02 0.04 0.06 0.08 0.1 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 λ ||∆ p(A x)− x|| 2 Performance of ∆p for Compressible Signal Vs Compressibility k=40, ||x T||2 = 1, ||zTc||2 = 1, x = xT + λ zTc; p=0.4 p=0.6 p=0.8 p=1 0 0.02 0.04 0.06 0.08 0.1 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 λ ||∆ p(A x+ λ e )−x || 2 Performance of ∆p for Noisy Observation Vs Noise k=40, ||x T||2=1, ||e||2=1 p=0.4 p=0.6 p=0.8 p=1 Figure 2.3: Reconstruction error with compressible signals (top), noisy observations (bottom). Observe the almost linear growth of the error in compressible signals and for different values of p, highlighting the instance optimality of the decoders. The plots were generated by averaging the results of 10 experiments with the same matrix A and randomized locations of the coefficients of x. 46 Chapter 2. Sparse recovery by non-convex optimization – instance optimality Next, we generate scenarios that allude to the conclusions of Theorem 2.2.7. To that end, we generate a signal composed of xT ∈ Σ30040 , supported on an index set T , and a signal zT c supported on T c, where all the coefficients are drawn from the standard Gaussian distribution. We then normalize xT and zT c so that ‖xT‖2 = ‖zT c‖2 = 1 and generate x = xT + λzT c with increasing values of λ (starting from 0), thereby increasing σ40(x)`2 ≈ λ. For this experiment, we choose our measurement matrix A ∈ R100×300 by drawing its columns uniformly from the sphere. For each value of λ we measure the reconstruction error ‖∆p(Ax)−x‖2, and we repeat the process 10 times while randomizing the index set T but preserving the coefficient values. We report the averaged results in Figure 2.3 (top) for different values of p. Similarly, we generate noisy observations AxT +λe, of a sparse signal xT ∈ Σ30040 where ‖xT‖2 = ‖e‖2 = 1 and we increase the noise level starting from λ = 0. Here, again, the non-zero entries of xT and all entries of e were chosen i.i.d. from the standard Gaussian distribution and then the vectors were properly normalized. Next, we measure ‖∆p(AxT + λe) − xT‖2 (for 10 realizations where we randomize T ) and report the averaged results in Figure 2.3 (bottom) for different values of p. In both these experiments, we observe that the error increases roughly linearly as we increase λ, i.e., σ40(x)`2 and the noise power, respectively. Moreover, when the signal is highly compressible or when the noise level is low, we observe that reconstruction using ∆p with 0 < p < 1 yields a lower approximation error than that with p = 1. It is also worth noting that for values of p close to one, even in the case of sparse signals with no noise, the average reconstruction error is non-zero. This may be due to the fact that for such large p the number of measurements is not sufficient for the recovery of signals with k = 40, further highlighting the benefits of using the decoder ∆p, with smaller values of p. Finally, in Figure 2.4, we plot the results of an experiment in which we generate signals x ∈ R200 with sorted coefficients x(j) that decay according to some power law. In particular, for various values of 0 < q < 1, we set x(j) = cj−1/q such that ‖x‖2 = 1. We then encode x with 50 different 100 × 200 measurement matrices the columns of which were drawn from the uniform distribution on the sphere, and examine the approximations obtained by decoding with ∆p for different values of 0 < p < 1. The results indicate that values of p ≈ q provide the lowest reconstruction errors. Note that in Figure 2.4, we report the results in the form of signal to noise ratios defined as SNR = 20 log10 ( ‖x‖2 ‖∆(Ax)− x‖2 ) . 47 Chapter 2. Sparse recovery by non-convex optimization – instance optimality 0.4 0.5 0.6 0.7 0.8 0.9 10 20 30 40 50 60 70 q Av er ag e R ec on st ru ct io n Si gn al  to  N oi se  R at io  in  d B Performance of ∆p on Compressible Signals p=0.4 p=0.6 p=0.8 p=1 0.4 0.5 0.6 0.7 0.8 0.9 1 60 65 p Performance of ∆p on Compressible Signals 0.4 0.5 0.6 0.7 0.8 0.9 1 32 34 36 pAv er ag e Re co ns tru ct io n Si gn al  to  N oi se  R at io  in  d B 0.4 0.5 0.6 0.7 0.8 0.9 1 18 20 22 24 p q=0.4 q=0.6 q=0.8 Figure 2.4: Reconstruction signal to noise ratios (in dB) obtained by using ∆p to recover signals whose sorted coefficients decay according to a power law (x(j) = cj−1/q, ‖x‖2 = 1) as a function of q (top) and as a function of p (bottom). The presented results are averages of 50 experiments performed with different matrices in R100×200. Observe that for highly compressible signals, e.g., for q = 0.4, there is a 5 dB gain in using p < 0.6 as compared to p = 1. The performance advantage is about 2 dB for q = 0.6. As the signals become much less compressible, i.e., as we increase q to 0.9 the performances are almost identical. 48 Chapter 2. Sparse recovery by non-convex optimization – instance optimality 2.4 Proofs 2.4.1 Proof of Proposition 2.2.4 First, note that for any A ∈ Rn×N , δm is non-decreasing in n. Also, the map a 7→ a−1a+1 is increasing in a for a ≥ 0. Set L := (a+ 1)k1, ˜̀= a p2−p , and k̃p = L˜̀+ 1. Then δ(˜̀+1)k̃p = δ(a+1)k1 < a− 1a+ 1 = ˜̀2−pp − 1˜̀2−pp + 1 . We now describe how to choose ` and kp such that ` ≥ ˜̀, kp ∈ N, and (`+ 1)kp = L (this will be sufficient to complete the proof using the monotonicity observations above). First, note that this last equality is satisfied only if (`, kp) is in the set {( n L− n, L− n) : n = 1, . . . , L− 1}. Let n∗ be such that n∗ − 1 L− n∗ + 1 < ˜̀≤ n∗ L− n∗ . (2.22) To see that such an n∗ exists, recall that ˜̀= a p2−p where 0 < p < 1. Also, (a+ 1)k1 = L with k1 ∈ N, and a > 1. Consequently, 1 < ˜̀ < a ≤ L − 1, and a ∈ { nL−n : n = dL 2 e, . . . , L − 1}. Thus, we know that we can find n∗ as above. Furthermore, n∗ L−n∗ > 1. It follows from (2.22) that L− n∗ ≤ k̃p < L− n∗ + 1. We now choose ` = n∗ L− n∗ , and kp = bk̃pc = L− n ∗. Then (`+ 1)kp = L, and ` ≥ ˜̀. So, we conclude that for ` as above and kp = bk̃pc = ⌊ a+ 1 a p 2−p + 1 k1 ⌋ , we have δ(`+1)kp < ` 2−p p − 1 ` 2−p p + 1 . Consequently, the condition of Corollary 2.2.3 is satisfied and we have the desired con- clusion. 49 Chapter 2. Sparse recovery by non-convex optimization – instance optimality 2.4.2 Proof of Theorem 2.2.1 Wemodify the proof of Candès et al.. of the analogous result for the encoder ∆1 (Theorem 2 in [4]) to account for the non-convexity of the `p quasinorm. We give the full proof for completeness. We stick to the notation of [4] whenever possible. Let 0 < p < 1, x ∈ RN be arbitrary, and define x∗ := ∆p(b) and h := x∗ − x. Our goal is to obtain an upper bound on ‖h‖2 given that ‖Ah‖2 ≤ 2 (by definition of ∆p). Below, for a set T ⊆ {1, . . . , N}, T c := {1, . . . , N} \ T ; for y ∈ RN , yT denotes the vector with entries yT (j) = y(j) for all j ∈ T , and yT (j) = 0 for j ∈ T c. ( I ) We start by decomposing h as a sum of sparse vectors with disjoint support. In particular, denote by T0 the set of indices of the largest (in magnitude) k coefficients of x (here k is to be determined later). Next, partition T co into sets T1, T2, . . . , |Tj| = L for j ≥ 1 where L ∈ N (also to be determined later), such that T1 is the set of indices of the L largest (in magnitude) coefficients of hT c0 , T2 is the set of indices of the second L largest coefficients of hT c0 , and so on. Finally let T01 := T0 ∪ T1. We now obtain a lower bound for ‖Ah‖p2 using the RIP constants of the matrix A. In particular, we have ‖Ah‖p2 = ‖AhT01 + ∑ j≥2 AhTj‖p2 ≥ ‖AhT01‖p2 − ∑ j≥2 ‖AhTj‖p2 ≥ (1− δL+|T0|)p/2‖hT01‖p2 − (1 + δL)p/2 ∑ j≥2 ‖hTj‖p2. (2.23) Above, together with RIP, we used the fact that ‖ · ‖p2 satisfies the triangle inequality for any 0 < p < 1. What now remains is to relate ‖hT01‖p2 and ∑ j≥2 ‖hTj‖p2 to ‖h‖2. ( II ) Next, we aim to bound ∑ j≥2 ‖hTj‖p2 from above in terms of ‖h‖2. To that end, we proceed as in [4]. First, note that |hTj+1(`)|p ≤ |hTj (`′)|p for all ` ∈ Tj+1, `′ ∈ Tj , and thus |hTj+1(`)|p ≤ ‖hTj‖pp/L. It follows that ‖hTj+1‖22 ≤ L1− 2 p‖hTj‖2p, and consequently∑ j≥2 ‖hTj‖p2 ≤ L p 2 −1∑ j≥1 ‖hTj‖pp = L p 2 −1‖hT co ‖pp. (2.24) Next, note that, similar to the case when p = 1 as shown in [4], the “error” h is concen- trated on the “essential support” of x (in our case T0). To quantify this claim, we repeat the analogous calculation in [4]: Note, first, that by definition of x∗, ‖x∗‖pp = ‖x+ h‖pp = ‖xT0 + hT0‖pp + ‖xT c0 + hT c0 ‖pp ≤ ‖x‖pp. As ‖ · ‖pp satisfies the triangle inequality, we then have ‖xT0‖pp − ‖hT0‖pp + ‖hT c0 ‖pp − ‖xT c0 ‖pp ≤ ‖x‖pp. 50 Chapter 2. Sparse recovery by non-convex optimization – instance optimality Consequently, ‖hT co ‖pp ≤ ‖hT0‖pp + 2‖xT c0 ‖pp, (2.25) which, together with (2.24), implies∑ j≥2 ‖hTj‖p2 ≤ L p 2 −1(‖hT0‖pp + 2‖xT c0 ‖pp) ≤ ρ1− p 2 (‖hT01‖p2 + 2|T0| p 2 −1‖xT c0 ‖pp), (2.26) where ρ := |T0| L , and we used the fact that ‖hT0‖pp ≤ |T0|1− p 2‖hT0‖p2 (which follows as |supp(hT0)| = |T0|). Using (2.26) and (2.23), we obtain ‖Ah‖p2 ≥ Cp,L,|T0|‖hT01‖p2 − 2ρ1− p 2 |T0| p 2 −1(1 + δL) p 2‖xT c0 ‖pp, (2.27) where Cp,L,|T0| := (1− δL+|T0|) p 2 − (1 + δL) p 2ρ1− p 2 . (2.28) At this point, using ‖Ah‖2 ≤ 2, we obtain an upper bound on ‖hT01‖2 given by ‖hT01‖p2 ≤ 1 Cp,L,|T0| ( (2)p + 2ρ1− p 2 (1 + δL) p 2 ‖xT c0 ‖pp |T0|1− p2 ) , (2.29) provided Cp,L,|T0| > 0 (this will impose the condition given in (2.12) on the RIP constants of the underlying matrix A). ( III ) To complete the proof, we will show that the error vector h is concentrated on T01. Denote by hT c0 [m] the mth largest (in magnitude) coefficient of hT c0 and observe that|hT c0 [m]|p ≤ ‖hT c0 ‖pp/m. As hT c01 [m] = hT c0 [L+m], we then have ‖hT c01‖22 = ∑ m≥L+1 |hT c0 [m]|2 ≤ ∑ m≥L+1 (‖hT c0 ‖pp m ) 2 p ≤ ‖hT c0 ‖ 2 p L 2 p −1(2/p− 1) . (2.30) Here, the last inequality follows because for 0 < p < 1∑ m≥L+1 m− 2 p ≤ ∫ ∞ L t− 2 pdt = 1 L 2 p −1(2/p− 1) . Finally, we use (2.25) and (2.30) to conclude ‖h‖22 = ‖hT01‖22 + ‖hT c01‖22 ≤ ‖hT01‖22 + [‖hT0‖pp + 2‖xT c0 ‖pp L1− p 2 (2/p− 1) p2 ] 2 p ≤ [( 1 + ρ1− p 2 (2/p− 1)− p2 )‖hT01‖p2 + 2ρ1− p2 (2/p− 1)− p2 ‖xT c0 ‖pp|T0|1− p2 ] 2 p . (2.31) 51 Chapter 2. Sparse recovery by non-convex optimization – instance optimality Above, we used the fact that ‖hT0‖pp ≤ |T0|1− p 2 ‖hT0‖p2, and that for any b, c ≥ 0, and α ≥ 1, bα + cα ≤ (b+ c)α. ( IV) We now set |T0| = k, L = ak where k and a are chosen such that Cp,ak,k > 0 which is equivalent to having a, k, and p satisfy (2.12). In this case, ‖xT c0 ‖p = σk(x)`p , ρ = 1/a, and combining (2.29) and (2.31) yields ‖h‖p2 ≤ C1p + C2 σk(x) p `p k1− p 2 (2.32) where C1 and C2 are as in (2.14) and (2.15), respectively. 2.4.3 Proof of Lemma 2.2.5. ( I ) The following result of Wojtaszczyk [30, Proposition 2.2] will be useful. Proposition 2.4.1 ( [30]). Let Aω be an n × N Gaussian random matrix, let 0 < µ < 1/ √ 2, and suppose that K1n(log n) ξ ≤ N ≤ eCn for some ξ > (1 − 2µ2)−1 and some constants K1, K2 > 0. Then, there exists a constant c = c(µ, ξ,K1, K2) > 0, independent of n and N , and a set Ωµ = { ω : Aω(B N 1 ) ⊃ µ √ logN/n n Bn2 } such that P (Ωµ) ≥ 1− e−cn. The above statement is true also for Ãω. We will also use the following adaptation of [18, Lemma 2] for which we will first introduce some notation. Define a body to be a compact set containing the origin as an interior point and star shaped with respect to the origin 7 [23]. Below, we use conv(K) to denote the convex-hull of a body K. For K ⊆ B, we denote by d1(K,B) the “distance” between K and B given by d1(K,B) := inf{λ > 0 : K ⊂ B ⊂ λK} = inf{λ > 0 : 1 λ B ⊂ K ⊂ B}. Finally, we call a body K p-convex if for any x, y ∈ K, λx+µy ∈ K whenever λ, µ ∈ [0, 1] such that λp + µp = 1. Lemma 2.4.2. Let 0 < p < 1, and let K be a p-convex body in Rn. If conv(K) ⊂ Bn2 , then d1(K,B n 2 ) ≤ C(p)d1(conv(K), Bn2 )(2/p−1), 7We say that a set is star-shaped (with respect to the origin) if whenever a point x is a member of the set, then so is any point on the line segment connecting the origin to x. 52 Chapter 2. Sparse recovery by non-convex optimization – instance optimality where C(p) = ( 21−p + (1− p)21−p/2 p ) 2−p p2 ( 1 (1− p) log 2 ) 2−2p p2 . We defer the proof of this lemma to Section 2.5 ( II ) Note that Ãω(B N 1 ) ⊂ Bn2 . This follows because ‖Ãω‖1→2, which is equal to the largest column norm of Ãω, is 1 by construction. Thus, for x ∈ BN1 , ‖Ãω(x)‖2 ≤ ‖Ãω‖1→2‖x‖1 ≤ 1, that is, Ãω(B N 1 ) ⊂ Bn2 , and so d1(Ãω(BN1 ), Bn2 ) is well-defined. Next, by Proposition 2.4.1, we know that there exists Ωµ with P (Ωµ) ≥ 1− e−cn such that for all ω ∈ Ωµ, Ãω(B N 1 ) ⊃ µ √ logN/n n Bn2 (2.33) From this point on, let ω ∈ Ωµ. Then Bn2 ⊃ Ãω(BN1 ) ⊃ µ √ logN/n n Bn2 , and consequently d1(Ãω(B N 1 ), B n 2 ) ≤ ( µ √ logN/n n )−1 . (2.34) The next step is to note that conv(BNp ) = B N 1 and consequently conv ( Ãω(B N p ) ) = Ãω ( conv(BNp ) ) = Ãω(B N 1 ). We can now invoke Lemma 2.4.2 to conclude that d1(Ãω(B N p ), B n 2 ) ≤ C(p)d1(conv(Ãω(BNp )), Bn2 ) 2−p p = C(p)d1(Ãω(B N 1 ), B n 2 ) 2−p p . (2.35) Finally, by using (2.34), we find that d1(Ãω(B N p ), B n 2 ) ≤ C(p) ( µ2 logN/n n )1/2−1/p , (2.36) and consequently Ãω(B N p ) ⊃ 1 C(p) ( µ2 logN/n n )(1/p−1/2) Bn2 . (2.37) 53 Chapter 2. Sparse recovery by non-convex optimization – instance optimality In other words, the matrix Ãω has the LQp(α) property with the desired value of α for every ω ∈ Ωµ with P (Ωµ) ≥ 1− e−cn. Here c is as specified in Proposition 2.4.1. To see that the same is true for Aω, note that there exists a set Ω0 with P (Ω0) > 1 − e−cn such that for all ω ∈ Ω0, ‖Aj(ω)‖2 < 2, for every column Aj of Aω (this follows from RIP). Using this observation one can trace the above proof with minor modifications. 2.4.4 Proof of Theorem 2.2.6. We start with the following lemma, the proof of which for p < 1 follows with very little modification from the analogous proof of Lemma 3.1 in [30] and shall be omitted. Lemma 2.4.3. Let 0 < p < 1 and suppose that A satisfies RIP(k, δ) and LQp ( γp/k 1/p−1/2) with γp := µ 2/p−1/C(p). Then for every x ∈ RN , there exists x̃ ∈ RN such that Ax = Ax̃, ‖x̃‖p ≤ k 1/p−1/2 γp ‖Ax‖2, and ‖x̃‖2 ≤ C3‖Ax‖2. Here, C3 = 1 γp + γp(1−δ)+1 (1−δ2)γp . Note that C3 depends only on µ, δ and p. We now proceed to prove Theorem 2.2.6. Our proof follows the steps of [30] and differs in the handling of the non-convexity of the `p quasinorms for 0 < p < 1. First, recall that A satisfies RIP(k, δ) and LQp(γp/k 1/p−1/2), so by Lemma 2.4.3, there exists z ∈ RN such that Az = e, ‖z‖p ≤ k1/p−1/2γp ‖e‖2, and ‖z‖2 ≤ C3‖e‖2. Now, A(x+ z) = Ax+ e, and ∆ is (2, p) instance optimal with constant C2,p. Thus, ‖∆(A(x) + e)− (x+ z)‖2 ≤ C2,pσk(x+ z)`p k1/p−1/2 , and consequently ‖∆(A(x) + e)− x‖2 ≤ ‖z‖2 + C2,p σk(x+ z)`p k1/p−1/2 ≤ C3‖e‖2 + C2,p σk(x+ z)`p k1/p−1/2 ≤ C3‖e‖2 + 21/p−1C2,p σk(x)`p + ‖z‖p k1/p−1/2 ≤ C3‖e‖2 + 21/p−1C2,p σk(x)`p k1/p−1/2 + 21/p−1C2,p ‖e‖2 γp , where in the third inequality we used the fact in any that `p quasinorm satisfies the 54 Chapter 2. Sparse recovery by non-convex optimization – instance optimality inequality ‖a+ b‖p ≤ 2 1 p −1(‖a‖p + ‖b‖p) for all a, b ∈ RN . So, we conclude ‖∆(A(x) + e)− x‖2 ≤ ( C3 + 2 1/p−1C2,p/γp ) ‖e‖2 + 21/p−1C2,p σk(x)`p k1/p−1/2 . (2.38) That is (i) holds with C = C3 + 2 1/p−1C2,p(1/γp + 1). Next, we prove parts (ii) and (iii) of Theorem 2.2.6. As in the analogous proof of [30], Theorem 2.2.6 (ii) can be seen as a special case of Theorem 2.2.6 (iii), with e = 0. We therefore turn to proving (iii). Once again, by Lemma 2.4.3, there exists v and z in RN such that the following hold. Av = e; ‖v‖p ≤ k1/p−1/2γp ‖e‖2, ‖v‖2 ≤ C3‖e‖2, and Az = AxT c0 ; ‖z‖p ≤ k 1/p−1/2 γp ‖AxT c0 ‖2, ‖z‖2 ≤ C3‖AxT c0 ‖2. Here T0 is the set of indices of the largest (in magnitude) k coefficients of x, and T c 0 and xT co are as in the proof of Theorem 2.2.1. Similar to the previous part we can see that A(xT0 + z + v) = Ax+ e and by the hypothesis of (2, p) instance optimality of ∆, we have ‖∆(Ax+ e)− (xT0 + z + v)‖2 ≤ C2,p σk(xT0 + z + v)`p k1/p−1/2 . Consequently observing that xT0 = x− xT c0 and using the triangle inequality, ‖∆(A(x) + e)− x‖2 ≤ ‖xT c0 − z − v‖2 + C2,p σk(xT0 + z + v)`p k1/p−1/2 ≤ ‖xT c0 − z − v‖2 + 21/p−1(C2,p) (‖z‖p + ‖v‖p k1/p−1/2 ) ≤ σk(x)`2 + ‖z‖2 + ‖v‖2 + 21/p−1C2,p (‖AxT c0 ‖2 γp + ‖e‖2 γp ) ≤ σk(x)`2 + ( C3 + 2 1/p−1C2,p γp ) (‖e‖2 + ‖AxT c0 ‖2). (2.39) That is (iii) holds with C = 1 + C3 + 2 1/p−1C2,p γp . By setting e = 0, one can see that this is the same constant associated with (ii). This concludes the proof of this theorem. 2.4.5 Proof of Theorem 2.2.7. First, we show that (Aω,∆p) is (2, p) instance optimal of order k for an appropriate range of k with high probability. One of the fundamental results in compressed sensing theory states that for any δ ∈ (0, 1), there exists c̃1, c̃2 > 0 and ΩRIP with P (ΩRIP) ≥ 1− 2e−c̃2n, all depending only on δ, such that Aω, ω ∈ ΩRIP, satisfies RIP(`, δ) for any ` ≤ c̃1 nlog(N/n) . 55 Chapter 2. Sparse recovery by non-convex optimization – instance optimality See, e.g., [6], [1], for the proof of this statement as well as for the explicit values of the constants. Now, choose δ ∈ (0, 1) such that δ < 22/p−1−1 22/p−1+1 . Then, with c̃1, c̃2, and ΩRIP as above, for every ω ∈ ΩRIP and for every k < c̃13 nlog(N/n) , the RIP constants of Aω satisfy (2.18) (and hence (2.12)), with a = 2. Thus, by Corollary 2.2.2 (Aω,∆p) is instance optimal of order k with constant C 1/p 2 as in (2.15). Now, set k1 = c1 n log(N/n) with c1 ≤ c̃1/3 such that k1 ∈ N (note that such a c1 exists if n and N are sufficiently large). By the hypothesis of the theorem, n and N satisfy the hypothesis of the Lemma 2.2.5 with ξ = 2, K1 = 1, some 0 < µ < 1/2, and an appropriate K2 (determined by c̃1 above). Because( µ2 log(N/n) n )1/p−1/2 = ( µ2 c1 k1 )1/p−1/2 by Lemma 2.2.5, there exists Ωµ, P (Ωµ) ≥ 1 − e−cn such that for every ω ∈ Ωµ, Aω satisfies LQp ( γp(µ) k1 1/p−1/2 ) where γp(µ) := c 1/p−1/2 1 µ 2/p−1 C(p) . Consequently, set Ω1 := ΩRIP ∩Ωµ. Then, P (Ω1) ≥ 1− 2e−c̃2n− e−cn ≥ 1− 3e−c2n, for c2 = min{c̃2, c}. Note that c2 depends on c, which is now a universal constant, and c̃2, which depends only on the distribution of Aω (and in particular its concentration of measure properties, see [1]). Now, if ω ∈ Ω1, Aω satisfies RIP(3k1, δ), thus RIP(k1, δ), as well as LQp ( γp k1 1/p−1/2 ) . Therefore we can apply part (i) of Theorem 2.2.6 to get the first part of this theorem, i.e., ‖∆(Aω(x) + e)− x‖2 ≤ C ( ‖e‖2 + σk1(x)`p k1 1/p−1/2 ) . (2.40) Here C is as in (2.38) with C2,p = C 1/p 2 . To finish the proof of part (i), note that for k ≤ k1, σk1(x)`p ≤ σk(x)`p and k1/p−1/2 ≤ k1/p−1/21 . To prove part (ii), first define T0 as the support of the k1 largest coefficients (in magnitude) of x and T c0 = {1, ..., N} \ T0. Now, note that for any x there exists a set Ω̃x with P (Ω̃x) ≥ 1 − e−c̃n for some universal constant c̃ > 0, such that for all ω ∈ Ω̃x, ‖AωxT c0 ‖2 ≤ 2‖xT c0 ‖2 = 2σk1(x)`2 (this follows from the concentration of measure property of Gaussian matrices, see, e.g., [1]). Define Ωx := Ω̃x ∩ Ω1. Thus, P (Ωx) ≥ 1− 3e−c2n − e−c̃n ≥ 1− 4e−c3n where c3 = min{c2, c̃}. Note that the dependencies of c3 are identical to those of c2 discussed above. Recall that for ω ∈ Ω1, Aω satisfies both RIP(k1, δ) and LQp ( γp (k1) 1/p−1/2 ) . We can now apply part (iii) of Theorem 2.2.6 to obtain for ω ∈ Ωx ‖∆(Aω(x) + e)− x‖2 ≤ C ( 3σk1(x)`2 + ‖e‖2 ) . (2.41) Above, the constant C is as in (2.39). Once again, note that for k ≤ k1, σk1(x)`2 ≤ σk(x)`2 to finish the proof for any k ≤ k1. 56 Chapter 2. Sparse recovery by non-convex optimization – instance optimality 2.5 Proof of Lemma 2.4.2 In this section we provide the proof of Lemma 2.4.2 for the sake of completeness and also because we explicitly calculate the optimal constants involved. Let us first introduce some notation used in [18] and [23]. For a body K ⊂ Rn, define its gauge functional by ‖x‖K := inf{t > 0 : x ∈ tK}, and let Tq(K), q ∈ (1, 2], be the smallest constant C such that ∀m ∈ N, x1, ..., xm ∈ K inf i=±1 { ‖ m∑ i=1 ixi‖K } ≤ Cm1/q. Given a p-convex body K and a positive integer r, define αr = αr(K) := sup{‖ ∑r i=1 xi‖K r : xi ∈ K, i ≤ r}. Note that αr ≤ r−1+1/p. Finally, conforming with the notation used in [18] and [23], we define δK := d1(K, conv(K)). Note that this should not cause confusion as we do not refer to the RIP constants through- out the rest of the paper. It can be shown by a result of [24] that δK = supr αr(K), cf. [18, Lemma 1] for a proof. We will need the following propositions. Proposition 2.5.1 (sub-additivity of ‖ · ‖pK). For the gauge functional ‖ · ‖K associated with a p-convex body K ∈ Rn, the following inequality holds for any x, y ∈ Rn. ‖x+ y‖pK ≤ ‖x‖pK + ‖y‖pK . (2.42) Proof. Let r = ‖x‖K and u = ‖y‖K . If at least one of r and u is zero, then (2.42) holds trivially. (Note that, as K is a body, ‖x‖K = 0 if and only if x = 0.) So, we may assume that both r and u are strictly positive. Since K is compact, it follows that x/r ∈ K and y/u ∈ K. Furthermore, K is p-convex, i.e., for all α, β ∈ [0, 1] with α + β = 1, we have α1/px/r + β1/py/u ∈ K. In particular, choose α = rp rp+up and β = u p rp+up . This gives x+y (rp+up)1/p ∈ K. Consequently, by the definition of the gauge functional ‖ x+y (rp+up)1/p ‖K ≤ 1. Finally, ‖ x+y (rp+up)1/p ‖pK = ‖x+y‖ p K (rp+up) ≤ 1 and ‖x+ y‖pK ≤ rp + up = ‖x‖pK + ‖y‖pK. Proposition 2.5.2. T2(B n 2 ) = 1. Proof. Note that ‖ · ‖Bn2 = ‖ · ‖2, and thus, by definition, T2(Bn2 ) is the smallest constant C such that for every positive integer n and for every choice of points x1, ..., xm ∈ B2, inf i=±1 { ‖ m∑ i=1 ixi‖2 } ≤ C√m. (2.43) 57 Chapter 2. Sparse recovery by non-convex optimization – instance optimality For m ≤ n, we can choose {x1, . . . , xm} to be orthonormal. Consequently, ‖ m∑ i=1 ixi‖22 = m∑ i=1 2i = m, and thus, T2 = T2(B n 2 ) ≥ 1. On the other hand, let n be an arbitrary positive integer, and suppose that {x1, . . . , xm} ⊂ Bn2 . Then, it is easy to show that there exists a choice of signs i, i = 1, . . . , m such that inf i=±1 { ‖ m∑ i=1 ixi‖2 } ≤ √m. Indeed, we will show this by induction. First, note that ‖1x1‖2 = ‖x1‖2 ≤ √ 1. Next, assume that there exists 1, . . . , `−1 such that ‖ `−1∑ i=1 ixi‖2 ≤ √ `− 1. Then (using parallelogram law), min{‖ `−1∑ i=1 ixi + x`‖22, ‖ `−1∑ i=1 ixi − x`‖22} ≤ ‖ `−1∑ i=1 ixi‖22 + ‖x`‖22 ≤ `. Choosing ` accordingly, we get ‖ ∑̀ i=1 ixi‖22 ≤ `, which implies that T2 ≤ 1. Using the fact that T2 ≥ 1 which we showed above, we conclude that T2 = 1. Proof of Lemma 2.4.2 We now present a proof of the more general form of Lemma 2.4.2 as stated in [18] and [23] (albeit for the Banach-Mazur distance in place of d1). The proof is essentially as in [18], cf. [23], which in fact also works with the distance d1 to establish an upper bound on the Banach-Mazur distance between a p-convex body and a symmetric body. Lemma 2.5.3. Let 0 < p < 1, q ∈ (1, 2], and let K be a p-convex body. Suppose that B is a symmetric body with respect to the origin such that conv(K) ⊂ B. Then d1(K,B) ≤ Cp,q[Tq(B)]φ−1[d1(conv(K), B)]φ, 58 Chapter 2. Sparse recovery by non-convex optimization – instance optimality where φ = 1/p−1/q 1−1/q . Proof. Note that K ⊂ conv(K) ⊂ B, and therefore d1(K,B) is well-defined. Let d = d1(K,B) and T = Tq(B). Thus, (1/d)B ⊂ K ⊂ B. Let n be a positive integer and let xi, i ∈ 1, 2, ..., 2m be a collection of points in K. Then, xi ∈ B and by the definition of T , there is a choice of signs i so that ‖ ∑2m i=1 ixi‖B ≤ T2m/q. Since B is symmetric, we can assume that D = {i : i = 1} has |D| > 2m−1. Now we can write ‖ 2m∑ i=1 xi‖pK = ‖ 2m∑ i=1 ixi + 2 ∑ i/∈D xi‖pK ≤ dp‖ 2m∑ i=1 ixi‖pB + 2p‖ ∑ i/∈D xi‖pK ≤ dpT p2mp/q + 2mpαp2m−1 , (2.44) where the first inequality uses the sub-additivity of ‖ · ‖K and the fact that (1/d)B ⊂ K. Thus by taking the supremum in (2.44) over all possible xi’s and dividing by 2 mp, we obtain, for any n, αp2m ≤ dpT p2mp/q−mp + αp2m−1 . By applying this inequality for m− 1, m− 2, ..., k, we obtain the following inequality for any k ≤ m αp2m ≤ dpT p ∞∑ i=k+1 2−ip(1−1/q) + αp 2k ≤ dpT p 2 −kp(1−1/q) p(1− 1/q) log 2 + 2 k(1−p). (2.45) Since δK = supr αr, we now want to minimize the right hand side in (2.45) by choosing k appropriately. To that end, define f(k) := 2k(1−p) + (dT )p 2−k(1−1/q)p p(1− 1/q) log 2 and A := (dT )p p(1− 1/q) log 2 . Since αp2m ≤ f(k) for any k ∈ {1, ..., m − 1}, the best bound on αp2m is essentially given by f(k∗), where f ′(k∗) = 0. However, since k∗ is not necessarily an integer (which we require), we will instead use f(k∗ + 1) ≥ f(dk∗e) ≥ f(k∗) as a bound. Thus, we solve f ′(k∗) = 0 to obtain k∗ = 1 1−p/q log2 ( Ap(1−1/q) 1−p ) . By evaluating f(k) at k∗+ 1, we obtain α2m ≤ (f(k∗ + 1))1/p for every m ≥ k∗ + 1. In other words, for every m ≥ k∗ + 1 , we have α2m ≤ (dT ) 1−p 1−p/q ( 21−p + 2−p(1−1/q) 1− p p− p/q )1/p( 1 (1− p) log 2 ) 1/p−1 p(1−p/q) . (2.46) 59 Chapter 2. Sparse recovery by non-convex optimization – instance optimality On the other hand, if m ≤ k∗, then αp2m ≤ 2m(1−p) ≤ 2(k∗+1)(1−p). However, this last bound is one of the summands in the right hand side of (2.45) with k = k∗ + 1 (which we provide a bound for in (2.46)). Consequently (2.46) holds for all n. In particular, it holds for the value of n which achieves the supremum of α2m . Since δK = supr αr, we obtain δK ≤ (dT ) (1−p) (1−p/q) ( 21−p + 2−p(1−1/q) 1− p p(1− 1/q) )1/p( 1 (1− p) log 2 ) 1/p−1 p(1−p/q) . (2.47) Remark. In the previous step we utilize the fact that in the derivations above we can replace every 2m and 2k with n and k respectively, thus every n and k with log2m and log2k without changing (2.46). This allows us to pass from the bound on α2m to δK = suprαr without any problems. Recalling the definitions of d1(conv(K), B) and δK , note the following inclusions: 1 δKd1(conv(K,B)) B ⊂ 1 δK conv(K) ⊂ K ⊂ conv(K) ⊂ B. (2.48) Consequently 1 δKd1(conv(K,B)) B ⊂ K ⊂ B and the inequality d1(K,B) = d ≤ δKd1(conv(K), B) (2.49) follows from the definition of d1(K,B). Combining (2.49) and (2.47) we complete the proof with Cp,q = ( 21−p + 2−p(1−1/q) 1− p p(1− 1/q) ) 1−p/q p2(1−1/q) ( 1 (1− p) log 2 ) 1/p−1 p(1−1/q) . Finally, we choose above B = Bn2 and q = 2, recall that T = T2(B n 2 ) = 1 (see Proposition 2.5.2), and obtain Lemma 2.4.2 as a corollary with C(p) = ( 21−p + (1− p)21−p/2 p ) 2−p p2 ( 1 (1− p) log 2 ) 2−2p p2 . (2.50) Acknowledgment The authors would like to thank Michael Friedlander, Gilles Hennenfent, Felix Herrmann, and Ewout Van Den Berg for many fruitful discussions. This work was finalized during an AIM workshop. We thank the American Institute of Mathematics for its hospitality. Moreover, R. Saab thanks Rabab Ward for her immense support. The authors also 60 Chapter 2. Sparse recovery by non-convex optimization – instance optimality thank the anonymous reviewers for their constructive comments which improved the paper significantly. 61 Chapter 2. Sparse recovery by non-convex optimization – instance optimality 2.6 References [1] R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin. A simple proof of the restricted isometry property for random matrices. Constructive Approximation, 28(3):253–263, 2008. [2] E. J. Candès. The restricted isometry property and its implications for compressed sensing. Comptes rendus-Mathématique, 346(9-10):589–592, 2008. [3] E. J. Candès, J. Romberg, and T. Tao. Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory, 52:489–509, 2006. [4] E. J. Candès, J. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics, 59:1207–1223, 2006. [5] E. J. Candès and T. Tao. Decoding by linear programming. IEEE Transactions on Information Theory, 51(12):489–509, 2005. [6] E. J. Candès and T. Tao. Near-optimal signal recovery from random projections: uni- versal encoding strategies? IEEE Transactions on Information Theory, 52(12):5406– 5425, 2006. [7] R. Chartrand. Exact reconstructions of sparse signals via nonconvex minimization. IEEE Signal Processing Letters, 14(10):707–710, 2007. [8] R. Chartrand and V. Staneva. Restricted isometry properties and nonconvex com- pressive sensing. Inverse Problems, 24(035020), 2008. [9] S. Chen, D. Donoho, and M. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing, 20(1):33–61, 1999. [10] A. Cohen, W. Dahmen, and R. DeVore. Compressed sensing and best k-term ap- proximation. Journal of the American Mathematical Society, 22(1):211–231, 2009. [11] I. Daubechies, R. DeVore, M. Fornasier, and C. Güntürk. Iteratively re-weighted least squares minimization for sparse recovery. Communications on Pure and Applied Mathematics, 63(1):1–38, 2009. [12] M. Davies and R. Gribonval. Restricted Isometry Constants where `p sparse recovery can fail for 0 < p ≤ 1. IEEE Transactions on Information Theory, 55(5):2203–2214, May 2009. [13] D. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289–1306, 2006. 62 Chapter 2. Sparse recovery by non-convex optimization – instance optimality [14] D. Donoho and M. Elad. Optimally sparse representation in general (nonorthogonal) dictionaries via `1 minimization. Proceedings of the National Academy of Sciences of the United States of America, 100(5):2197–2202, 2003. [15] D. Donoho and X. Huo. Uncertainty principles and ideal atomic decomposition. IEEE Transactions on Information Theory, 47:2845–2862, 2001. [16] M. Figueiredo, R. Nowak, and S. Wright. Gradient Projection for Sparse Recon- struction: Application to Compressed Sensing and Other Inverse Problems. IEEE Journal of Selected Topics in Signal Processing, 1(4):586–597, 2007. [17] S. Foucart and M. Lai. Sparsest solutions of underdetermined linear systems via `q-minimization for 0 < q ≤ 1. Applied and Computational Harmonic Analysis, 26(3):395–407, 2009. [18] Y. Gordon and N. Kalton. Local structure theory for quasi-normed spaces. Bulletin des sciences mathématiques, 118:441–453, 1994. [19] R. Gribonval, R. M. Figueras i Ventura, and P. Vandergheynst. A simple test to check the optimality of sparse signal approximations. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’05), volume 5, pages V/717 – V/720, Mar. 2005. [20] R. Gribonval, R. M. Figueras i Ventura, and P. Vandergheynst. A simple test to check the optimality of sparse signal approximations. EURASIP Signal Processing, special issue on Sparse Approximations in Signal and Image Processing, 86(3):496– 510, Mar. 2006. [21] R. Gribonval and M. Nielsen. On the strong uniqueness of highly sparse expansions from redundant dictionaries. In International Conference on Independent Compo- nent Analysis (ICA’04), LNCS, Granada, Spain, Sept. 2004. Springer-Verlag. [22] R. Gribonval and M. Nielsen. Highly sparse representations from dictionaries are unique and independent of the sparseness measure. Applied and Computational Harmonic Analysis, 22(3):335–355, May 2007. [23] O. Guedon and A. Litvak. Euclidean projections of p-convex body. GAFA, Lecture Notes in Math, 1745:95–108, 2000. [24] N. Peck. Banach-mazur distances and projections on p-convex spaces. Mathematis- che Zeitschrift, 177(1):131–142, 1981. [25] R. Saab, R. Chartrand, and Ö. Yılmaz. Stable sparse approximations via nonconvex optimization. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3885–3888, 2008. 63 Chapter 2. Sparse recovery by non-convex optimization – instance optimality [26] J. Tropp. Recovery of short, complex linear combinations via l1 minimization. IEEE Transactions on Information Theory, 51(4):1568–1570, April 2005. [27] E. van den Berg and M. Friedlander. Probing the Pareto frontier for basis pursuit solutions. SIAM Journal on Scientific Computing, 31(2):890–912, 2008. [28] P. Vandergheynst and P. Frossard. Efficient image representation by anisotropic refinement in matching pursuit. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 3, 2001. [29] R. Ventura, P. Vandergheynst, and P. Frossard. Low rate and scalable image coding with redundant representations. IEEE Transactions on Image Processing, 15(3):726– 739, 2006. [30] P. Wojtaszczyk. Stability and instance optimality for gaussian measurements in compressed sensing. Foundations of Computational Mathematics (to appear). 64 Chapter 3 Sobolev Duals for Random Frames and Σ∆ Quantization of Compressed Sensing Measurements 1 3.1 Introduction Compressed sensing is concerned with when and how sparse signals can be recovered exactly or approximately from few linear measurements [8, 10, 15]. Let A be an n × N matrix providing the measurements where n N , and ΣNk denote the space of k-sparse signals in RN , k < n. A standard objective, after a suitable change of basis, is that the mapping x 7→ b = Ax be injective on ΣNk . Minimal conditions on A that offer such a guarantee are well-known (see, e.g. [11]) and require at least that n ≥ 2k. On the other hand, under stricter conditions on A, such as the restricted isometry property (RIP), one can recover sparse vectors from their measurements by numerically efficient methods, such as `1-minimization. Moreover, the recovery will also be robust when the measurements are corrupted [9], cf. [14]; if b̂ = Ax + e where e is any vector such that ‖e‖2 ≤ , then the solution x# of the optimization problem min ‖y‖1 subject to ‖Ay − b̂‖2 ≤  (3.1) will satisfy ‖x− x#‖2 . . The price paid for these stronger recovery guarantees is the somewhat smaller range of values available for the dimensional parameters n, k, and N . While there are some explicit (deterministic) constructions of measurement matrices with stable recovery guar- antees, best results (widest range of values) have been found via random families of ma- trices. For example, if the entries of A are independently sampled from the Gaussian distribution N (0, 1 n ), then with high probability, A will satisfy the RIP (with a suitable set of parameters) if n ∼ k log(N k ). Significant effort has been put on understanding the 1A version of this chapter will be submitted for publication. C.S. Güntürk, A. Powell, R. Saab, and Ö. Yılmaz, “Sobolev Duals for Random Frames and Σ∆ Quantization of Compressed Sensing Measure- ments”. 65 Chapter 3. Sobolev Duals for Random Frames and Σ∆ Quantization of Compressed Sensing Measurements phase transition behavior of the RIP parameters for other random families, e.g., Bernoulli matrices and random Fourier samplers. Quantization for compressed sensing measurements The robust recovery result mentioned above is essential to the practicality of compressed sensing, especially from an analog-to-digital conversion point of view. If a discrete al- phabet A, such as A = dZ for some step size d > 0, is to be employed to replace each measurement bj with a quantized measurement qj := b̂j ∈ A, then the temptation, in light of this result, would be to minimize ‖e‖2 = ‖b− q‖2 over q ∈ An. This immediately reduces to minimizing |bj − qj | for each j, i.e., quantizing each measurement separately to the nearest element of A, which is usually called Pulse Code Modulation (PCM). Since ‖b− q‖2 ≤ 12d √ n, the robust recovery result guarantees that ‖x− x#PCM‖2 . d √ n. (3.2) Note that (3.2) is somewhat surprising as the reconstruction error bound does not improve by increasing the number of (quantized) measurements; on the contrary, it deteriorates. However, the √ n term is an artifact of our choice of normalization for the measurement matrix A. In the compressed sensing literature, it is conventional to normalize a (random) measurement matrix A so that it has unit-norm columns (in expectation). This is the necessary scaling to achieve isometry, and for random matrices it ensures that E‖Ax‖2 = ‖x‖2 for any x, which then leads to the RIP through concentration of measure and finally to the robust recovery result stated in (3.1). On the other hand, this normalization imposes an n-dependent dynamic range for the measurements which scales as 1/ √ n, hence it is not fair to use the same value d for the quantizer resolution as n increases. In this paper, we investigate the dependence of the recovery error on the number of quantized measurements where d is independent of n. A fair assessment of this dependence can be made only if the dynamic range of each measurement is kept constant while increasing the number of measurements. This suggests that the natural normalization in our setting should ensure that the entries of the measurement matrix A are independent of n. In the specific case of random matrices, we can achieve this by choosing the entries of A standard i.i.d. random variables, e.g. according to N (0, 1). With this normalization of A, the robust recovery result of [9], given above, can be modified as ‖b̂− b‖2 ≤  =⇒ ‖x− x#‖2 . 1√ n , (3.3) which also replaces (3.2) with ‖x− x#PCM‖2 . d. (3.4) As expected, this error bound does not deteriorate with n anymore. In this paper, we will adopt this normalization convention and work with the standard Gaussian distribu- 66 Chapter 3. Sobolev Duals for Random Frames and Σ∆ Quantization of Compressed Sensing Measurements tion N (0, 1) when quantization is involved, but also use the more typical normalization N (0, 1/n) for certain concentration estimates that will be derived in Section 3.3. The transition between these two conventions is of course trivial. The above analysis of quantization error is based on PCM, which involves separate (independent) quantization of each measurement. The vast logarithmic reduction of the ambient dimension N would seem to suggest that this strategy is essentially optimal since information appears to be squeezed (compressed) into few uncorrelated measure- ments. Perhaps for this reason, the existing literature on quantization of compressed sensing measurements focused mainly on alternative reconstruction methods from PCM- quantized measurements and variants thereof, e.g., [6, 12, 17, 21, 23, 29, 33]. The only exception we are aware of is [7], which uses Σ∆ modulation to quantize x before the random measurements are made. On the other hand, it is clear that if (once) the support of the signal is known (re- covered), then the n measurements that have been taken are highly redundant compared to the maximum k degrees of freedom that the signal has on its support. At this point, the signal may be considered oversampled. However, the error bound (3.4) does not offer an improvement of reconstruction accuracy, even if additional samples become available. (The RIP parameters of A are likely to improve as n increases, but this does not seem to reflect on the implicit constant factor in (3.4) satisfactorily.) This is contrary to the conventional wisdom in the theory and practice of oversampled quantization in A/D conversion where reconstruction error decreases as the sampling rate increases, especially with the use of quantization algorithms specially geared for the reconstruction procedure. The main goal of this paper is to show how this can be done in the compressed sensing setting as well. Quantization for oversampled data Methods of quantization have long been studied for oversampled data conversion. Sigma- delta (Σ∆) quantization (modulation), for instance, is the dominant method of A/D conversion for audio signals and relies heavily on oversampling, see [13, 19, 25]. In this setting, oversampling is typically exploited to employ very coarse quantization (e.g., 1 bit/sample), however, the working principle of Σ∆ quantization is applicable to any quantization alphabet. In fact, it is more natural to consider Σ∆ quantization as a “noise8 shaping” method, for it seeks a quantized signal (qj) by a recursive procedure to push the quantization error signal b − q towards an unoccupied portion of the signal spectrum. In the case of bandlimited signals, this would correspond to high frequency bands. As the canonical example, the standard first-order Σ∆ quantizer computes a bounded 8The quantization error is often modeled as white noise in signal processing, hence the terminology. However our treatment of quantization error in this paper is entirely deterministic. 67 Chapter 3. Sobolev Duals for Random Frames and Σ∆ Quantization of Compressed Sensing Measurements solution (uj) to the difference equation (∆u)j := uj − uj−1 = bj − qj . (3.5) This can be achieved recursively by choosing, for example, qj = argmin p∈A |uj−1 + bj − p|. (3.6) Since the reconstruction of oversampled bandlimited signals can be achieved with a low- pass filter ϕ that can also be arranged to be well-localized in time, the reconstruction error ϕ ∗ (b− q) = ∆ϕ ∗ u becomes small due to the smoothness of ϕ. It turns out that, with this procedure, the reconstruction error is reduced by a factor of the oversampling ratio λ, defined to be the ratio of the actual sampling rate to the bandwidth of ϕ. This principle can be iterated to set up higher-order Σ∆ quantization schemes. It is well-known that a reconstruction accuracy of order O(λ−r) can be achieved (in the supremum norm) if a bounded solution to the equation ∆ru = b − q can be found [13] (here, r ∈ N is the order of the associated Σ∆ scheme). The boundedness of u is im- portant for practical implementation, but it is also important for the error bound. The implicit constant in this bound depends on r as well as ‖u‖∞. Fine analyses of carefully designed schemes have shown that optimizing the order can even yield exponential ac- curacy O(e−cλ) for fixed sized finite alphabets A (see [19]), which is optimal apart from the value of the constant c. For infinite alphabets, there is no theoretical lower bound for the quantization error as λ increases. (However almost all practical coding schemes use some form of finite alphabet.) The above formulation of noise-shaping for oversampled data conversion generalizes naturally to the problem of quantization of arbitrary frame expansions, e.g., [3]. Specifi- cally, we will consider finite frames in Rk. Let E be a full-rank n × k matrix and F be any left inverse of E. In frame theory, one refers to the collection of the rows of E as the analysis frame and the columns of F as the synthesis (dual) frame. For any x ∈ Rk, let b = Ex be its frame coefficient vector, q ∈ An be its quantization, and let x̂ := Fq be its reconstruction using the dual frame. Typically An ∩ b+ Ker(F ) = ∅, so we have x̂ 6= x. The reconstruction error is given by x− x̂ = F (b− q), (3.7) and the goal of noise shaping amounts to arranging q in such a way that b− q is close to Ker(F ). If the sequence (fj) n 1 of dual frame vectors were known to vary smoothly in j (including smooth termination into null vector), then Σ∆ quantization could be employed without much alteration, e.g., [5, 22]. However, this need not be the case for many examples of frames (together with their canonical duals) that are used in practice. For this reason, it has recently been proposed in [4] to use special alternative dual frames, called Sobolev 68 Chapter 3. Sobolev Duals for Random Frames and Σ∆ Quantization of Compressed Sensing Measurements dual frames, that are naturally adapted to Σ∆ quantization. It is shown in [4] (see also Section 3.2) that for any frame E, if a standard rth order Σ∆ quantization algorithm with alphabet A = dZ is used to compute q := qΣ∆, then with an rth order Sobolev dual frame F := FSob,r and x̂Σ∆ := FSob,rqΣ∆, the reconstruction error obeys the bound 9 ‖x− x̂Σ∆‖2 .r d √ n σmin(D−rE) , (3.8) where D is the n× n difference matrix defined by Dij :=  1, if i = j, −1, if i = j + 1, 0, otherwise, (3.9) and σmin(D −rE) stands for the smallest singular value of D−rE. Contributions For the compressed sensing application that is the subject of this paper, E will simply be a sub-matrix of the measurement matrix A, hence it may have been found by sampling an i.i.d. random variable. Minimum singular values of random matrices with i.i.d. entries have been studied extensively in the mathematical literature. For an n×k random matrix E with i.i.d. entries sampled from a sub-Gaussian distribution with zero mean and unit variance,10 one has σmin(E) ≥ √ n− √ k (3.10) with high probability [27]. Note that in general D−rE would not have i.i.d. entries. A naive lower bound for σmin(D −rE) would be σmin(D−r)σmin(E). However (see Proposition 3.3.1), σmin(D −r) satisfies σmin(D −r) r 1, (3.11) and therefore this naive product bound yields no improvement on the reconstruction error for Σ∆-quantized measurements over the bound (3.4) for PCM-quantized ones. In fact, the true behavior of σmin(D −rE) turns out to be drastically different and is described in Theorem A, one of our main results (see also Theorem 3.3.7). For simplicity, we shall work with standard i.i.d. Gaussian variables for the entries of E. In analogy with our earlier notation, we define the “oversampling ratio” λ of the frame E by λ := n k . (3.12) 9Here and throughout a &r b (resp. a .r b) should be read as a ≥ Crb (resp. a ≤ Crb) for some constant Cr which depends on r but is independent of a and b. 10 As mentioned earlier, we do not normalize the measurement matrix A in the quantization setting. 69 Chapter 3. Sobolev Duals for Random Frames and Σ∆ Quantization of Compressed Sensing Measurements Theorem A. Let E be an n × k random matrix whose entries are i.i.d. N (0, 1). For any α ∈ (0, 1), if λ ≥ c(logn)1/(1−α), then with probability at least 1− exp(−c′nλ−α), σmin(D −rE) &r λα(r− 1 2 ) √ n, (3.13) which yields the reconstruction error bound ‖x− x̂Σ∆‖2 .r λ−α(r− 12 )d. (3.14) While the kind of decay in this error bound is familiar to Σ∆ modulation, the domain of applicability of this result is rather surprising. Previously, the only setting in which this type of approximation accuracy could be achieved (with or without Sobolev duals) was the case of highly structured frames (e.g. when the frame vectors are found by sampling along a piecewise smooth frame path). Theorem A shows that such an accuracy is obtained even when the analysis frame is a random Gaussian matrix, provided the reconstruction is done via Sobolev duals. In the compressed sensing setting, one needs (3.13) to be uniform for all the frames E that are found by selecting k columns of A at a time. The proof of Theorem A extends in a straightforward manner using a standard “union bound” argument, provided λ is known to be slightly larger. More precisely, if A is an n × N matrix whose entries are i.i.d. according to N (0, 1), and if λ := n/k ≥ c(logN)1/(1−α), then (3.13) holds for all E = AT with #T ≤ k with the same type of probability bound (with new constants). This result can be utilized to improve the reconstruction accuracy of a sparse signal x from its Σ∆-quantized compressed sensing measurements if the support T of x is known. This is because if T is known, AT is known, and its Sobolev dual can be found and used in the reconstruction. On the other hand, for most signals, recovering the exact or approximate support is already nearly guaranteed by the robust recovery result shown in (3.3) together with the stability of the associated Σ∆ quantizer. For example, a simple sufficient condition for full recovery of the support is that all the |xj | for j ∈ T be larger than C‖y− qΣ∆‖2 for a suitable constant C. A precise version of this condition is stated in Theorem B. In light of all these results, we propose Σ∆ quantization as a more effective alternative of PCM (independent quantization) for compressed sensing. With high probability on the measurement matrix, a significant improvement of the reconstruction accuracy of sparse signals can be achieved through a two-stage recovery procedure: (i) Coarse recovery: `1-minimization (or any other robust recovery procedure) ap- plied to qΣ∆ yields an initial, “coarse” approximation x # of x, and in particular, the exact (or approximate) support T of x. (ii) Fine recovery: Sobolev dual of the frame AT applied to qΣ∆ yields a finer ap- proximation x̂Σ∆ of x. 70 Chapter 3. Sobolev Duals for Random Frames and Σ∆ Quantization of Compressed Sensing Measurements Combining all these, our second main theorem follows (also see Theorem 3.4.2): Theorem B. Let A be an n × N matrix whose entries are i.i.d. according to N (0, 1). Suppose α ∈ (0, 1) and λ := n/k ≥ c(logN)1/(1−α) where c = c(r, α). Then there are two constants c′ and C that depend only on r such that with probability at least 1 − exp(−c′nλ−α) on the draw of A, the following holds: For every x ∈ ΣNk such that minj∈supp(x) |xj| ≥ Cd, the reconstruction x̂Σ∆ satisfies ‖x− x̂Σ∆‖2 .r λ−α(r− 12 )d. (3.15) To put this result in perspective, note that the approximation error given in (3.15) decays as the “redundancy” λ = n k increases. In fact, by using an arbitrarily high order Σ∆ scheme, we can make the error decay faster than any power law (albeit with higher constants). Note that such a decay is not observed in the reconstruction error bound for PCM given in (3.4). Of course, one could argue that these upper bounds may not reflect the actual behavior of the error. However, in the setting of frame quantization the performance of PCM is well investigated. In particular, let E be an n × k real matrix, and let K be a bounded set in Rk. For x ∈ K, suppose we obtain qPCM(x) by quantizing the entries of b = Ex using PCM with alphabet A = dZ. Let ∆opt be an optimal decoder. Then, Goyal et al. show in [18] that[ E ‖x−∆opt(qPCM(x))‖22 ]1/2 & λ−1d where λ = n/k and the expectation is with respect a probability measure on x that is, for example, absolutely continuous. This lower bound limits the extent to which one can improve the reconstruction by means of alternative reconstruction algorithms from PCM- quantized compressed sensing measurements. On the other hand, setting, for example, α = 3/4 in Theorem B we observe that if we use a second-order Σ∆ scheme to quantize the measurements, and if we adopt the two-stage recovery procedure proposed above, the resulting approximation will be superior to that produced optimally from PCM-quantized measurements, provided n/k is sufficiently large. It is possible to imagine more sophisticated and more effective quantization and re- covery algorithms for compressed sensing. However using Σ∆ quantization has a number of appealing features: • It produces more accurate approximations than any known quantization scheme in this setting (even when sophisticated recovery algorithms are employed). • It ismodular in the sense that if the fine recovery stage is not available or practical to implement, then the standard (coarse) recovery procedure can still be applied as is. 71 Chapter 3. Sobolev Duals for Random Frames and Σ∆ Quantization of Compressed Sensing Measurements • It is progressive in the sense that if new measurements arrive (in any given order), noise shaping can be continued on these measurements as long as the state of the system (r real values for an rth order scheme) has been stored. • It is universal in the sense that it uses no information about the measurement matrix or the signal. The paper is organized as follows. We review the basics of Σ∆ quantization and Sobolev duals in frame theory in Section 3.2, followed by the reconstruction error bounds for random Gaussian frames in Section 3.3. We then present the specifics of our proposed quantization and recovery algorithm for compressed sensing in Section 3.4. We present our numerical experiments in Section 3.5 and conclude with extensions to more general settings in Section 3.6. 3.2 Background on Σ∆ quantization of frame expansions Σ∆ quantization The governing equation of a standard rth order Σ∆ quantization scheme with input b = (bj) and output q = (qj) is (∆ru)j = bj − qj , j = 1, 2, . . . , (3.16) where the qj ∈ A are chosen according to some quantization rule given by qj = Q(uj−1, . . . , uj−T , bj , . . . , bj−S). (3.17) Not all Σ∆ quantization schemes are presented (or implemented) in this canonical form, but they all can be rewritten as such for an appropriate choice of r and u. We shall not be concerned with the specifics of the mapping Q, except that we need u to be bounded. The smaller the size of the alphabet A gets relative to r, the harder it is to guarantee this property. The extreme case is 1-bit quantization, i.e., |A| = 2, which is typically the most challenging setting. We will not be working in this case. In fact, for our purposes, A will in general have to be sufficiently fine to allow for the recovery of the support of sparse signals. In order to avoid technical difficulties, we shall work with the infinite alphabet A = dZ, but also note that only a finite portion of this alphabet will be used for bounded signals. A standard quantization rule that has this “boundedness” property is given by the greedy rule which minimizes |uj| given uj−1, . . . , uj−r and bj , i.e., qj = argmin r∈A ∣∣∣ r∑ i=1 (−1)i−1 ( r i ) uj−i + bj − r ∣∣∣. (3.18) 72 Chapter 3. Sobolev Duals for Random Frames and Σ∆ Quantization of Compressed Sensing Measurements It is well-known that with this rule, one has |uj| ≤ 2−1d and |bj − qj | ≤ 2r−1d. In turn, if ‖b‖∞ < C, then one needs only L := 2dCd e+ 2r + 1 levels. In this case, the associated quantizer is said to be log2 L-bit, and we have ‖u‖∞ . d and ‖b− q‖∞ .r d. (3.19) With more stringent quantization rules, the first inequality would also have an r-dependent constant. In fact, it is known that for quantization rules with a 1-bit alphabet, this con- stant will be as large as O(rr), e.g., see [13, 19]. In this paper, unless otherwise stated, we shall be working with the greedy quantization rule of (3.18). The initial condition of the recursion in (3.16) can be set arbitrarily, but it will be convenient for us to set them equal to zero for finite frames. With u−r+1 = · · · = u0 = 0, and j = 1, . . . , n, the difference equation (3.16) can be rewritten as a matrix equation Dru = b− q, (3.20) where D is as in (3.9). As before, we assume E is an n × k matrix whose rows form the analysis frame and F is a k × n left inverse of E whose columns form the dual (synthesis) frame. Given any x ∈ Rk, we set b = Ex, and define its rth order Σ∆ quantization qΣ∆ and its reconstruction x̂Σ∆ := FqΣ∆. Substituting (3.20) into (3.7), we obtain the error expression x− x̂ = FDru. (3.21) With this expression, ‖x− x̂‖ can be bounded for any norm ‖ · ‖ simply as ‖x− x̂‖ ≤ ‖u‖∞ n∑ j=1 ‖(FDr)j‖. (3.22) Here (FDr)j is the jth column of FD r. This bound is also valid in infinite dimensions, and in fact has been used extensively in the mathematical treatment of oversampled A/D conversion of bandlimited functions. For r = 1, and the `2 norm, the sum term on the right hand side motivated the study of the so-called frame variation defined by V (F ) := n∑ j=1 ‖fj − fj+1‖2, (3.23) where (fj) are the columns of F , and one defines fn+1 = 0. Higher-order frame variations to be used with higher-order Σ∆ schemes are defined similarly, see [2,3]. Frames (analysis as well as synthesis) that are obtained via uniform sampling a smooth curve in Rk (so- called frame path) are typical in many settings. However, the “frame variation bound” is 73 Chapter 3. Sobolev Duals for Random Frames and Σ∆ Quantization of Compressed Sensing Measurements useful in finite dimensions only when the frame path terminates smoothly. Otherwise, it does not provide higher-order reconstruction accuracy. Designing smoothly terminating frames can be technically challenging, e.g., [5]. Sobolev duals Recently, a more straightforward approach was proposed in [22] for the design of (alter- nate) duals of finite frames for Σ∆ quantization. Here, one instead considers the operator norm of FDr on `2 and the corresponding bound ‖x− x̂‖2 ≤ ‖FDr‖op‖u‖2. (3.24) Note that this bound is not available in the infinite dimensional setting of bandlimited functions due to the fact that u is typically not in `2. It is now natural to minimize ‖FDr‖op over all dual frames of a given analysis frame E. These frames, introduced in [4], have been called Sobolev duals, in analogy with `2-type Sobolev (semi)norms. Σ∆ quantization algorithms are normally designed for analog circuit operation, so they control ‖u‖∞, which would control ‖u‖2 only in a suboptimal way. However, it turns out that there are important advantages in working with the `2 norm in the analysis. The first advantage is that Sobolev duals are readily available by an explicit formula. The solution Fsob,r of the optimization problem min F ‖FDr‖op subject to FE = I (3.25) is given by the matrix equation Fsob,rD r = (D−rE)†, (3.26) where † stands for the Moore-Penrose inversion operator, which, in our case, is given by E† := (E∗E)−1E∗. Note that for r = 0 (i.e., no noise-shaping, or PCM), one simply obtains F = E†, the canonical dual frame of E. The second advantage of this approach is that highly developed methods are present for spectral norms of matrices, especially in the random setting. Plugging (3.26) into (3.24), it immediately follows that ‖x− x̂‖2 ≤ ‖(D−rE)†‖op‖u‖2 = 1 σmin(D−rE) ‖u‖2, (3.27) where σmin(D −rE) stands for the smallest singular value of D−rE. 74 Chapter 3. Sobolev Duals for Random Frames and Σ∆ Quantization of Compressed Sensing Measurements 3.3 Reconstruction error bound for random frames In what follows, σj(A) will denote the jth largest singular value of the matrixA. Similarly, λj(B) will denote the jth largest eigenvalue of the Hermitian matrix B. Hence, we have σj(A) = √ λj(A∗A). We will also use the notation Σ(A) for the diagonal matrix of singular values of A, with the convention (Σ(A))jj = σj(A). All matrices in our discussion will be real valued and the Hermitian conjugate reduces to the transpose. We have seen that the main object of interest for the reconstruction error bound is σmin(D −rE) for a random frame E. Let H be a square matrix. The first observation we make is that when E is i.i.d. Gaussian, the distribution of Σ(HE) is the same as the distribution of Σ(Σ(H)E). To see this, let UΣ(H)V ∗ be the singular value decomposition of H where U and V are unitary matrices. Then HE = UΣ(H)V ∗E. Since the unitary transformation U does not alter singular values, we have Σ(HE) = Σ(Σ(H)V ∗E), and because of the unitary invariance of the i.i.d. Gaussian measure, the matrix Ẽ := V ∗E has the same distribution as E, hence the claim. Therefore it suffices to study the singular values of Σ(H)E. In our case, H = D−r and we first need information on the deterministic object Σ(D−r). The following result will be sufficient for our purposes: Proposition 3.3.1. Let r be any positive integer and D be as in (3.9). There are positive numerical constants c1(r) and c2(r), independent of n, such that c1(r) (n j )r ≤ σj(D−r) ≤ c2(r)nr, j = 1, . . . , n. (3.28) The proof of this result is rather standard in the study of Toeplitz matrices, and is given in Section 3.7. 3.3.1 Lower bound for σmin(D −rE) In light of the above discussion, the distribution of σmin(D −rE) is the same as that of inf ‖x‖2=1 ‖Σ(D−r)Ex‖2. (3.29) We replace Σ(D−r) with an arbitrary diagonal matrix S with Sjj =: sj > 0. The first two results will concern upper bounds for the norm of independent but non-identically distributed Gaussian vectors. They are rather standard, but we include them for the definiteness of our discussion when they will be used later. Proposition 3.3.2. Let ξ ∼ N (0, 1 n In). For any Θ > 1, P ( n∑ j=1 s2jξ 2 j > Θ‖s‖2∞ ) ≤ Θn/2e−(Θ−1)n/2. (3.30) 75 Chapter 3. Sobolev Duals for Random Frames and Σ∆ Quantization of Compressed Sensing Measurements Proof. Since sj ≤ ‖s‖∞ for all j, we have P ( n∑ j=1 s2jξ 2 j > Θ‖s‖2∞ ) ≤ P ( n∑ j=1 ξ2j > Θ ) . (3.31) This bound is the (standard) Gaussian measure of the complement of a sphere of radius√ nΘ and can be estimated very accurately. We use a simple approach via P ( n∑ j=1 ξ2j > Θ ) ≤ min λ≥0 ∫ Rn e−(Θ− ∑n j=1 x 2 j)λ/2 n∏ j=1 e−nx 2 j/2 dxj√ 2pi/n = min λ≥0 e−λΘ/2(1− λ/n)−n/2 = Θn/2e−(Θ−1)n/2, (3.32) where in the last step we set λ = n(1−Θ−1). Lemma 3.3.3. Let E be an n× k random matrix whose entries are i.i.d. N (0, 1 n ). For any Θ > 1, consider the event E := { ‖SE‖`k2→`n2 ≤ 2 √ Θ‖s‖∞ } . Then P (E c) ≤ 5kΘn/2e−(Θ−1)n/2. Proof. We follow the same approach as in [1]. The maximum number of ρ-distinguishable points on the unit sphere in Rk is at most (2 ρ +1)k. (This follows by a volume argument11 as in e.g., [24, p.487].) Fix a maximal set Q of 1 2 -distinguishable points of the unit sphere in Rk with #Q ≤ 5k. Since Q is maximal, it is a 1 2 -net for the unit sphere. For each q ∈ Q, consider ξj = (Eq)j, j = 1, . . . , n. Then ξ ∼ N (0, 1nIn). As before, we have ‖SEq‖22 = n∑ j=1 s2jξ 2 j . Let E(Q) be the event { ‖SEq‖2 ≤ √ Θ‖s‖∞, ∀q ∈ Q } . Then, by Proposition 3.3.2, we have the union bound P (E(Q)c) ≤ 5kΘn/2e−(Θ−1)n/2. (3.33) Assume the event E(Q), and let M = ‖SE‖`k2→`n2 . For each ‖x‖2 = 1, there is q ∈ Q with 11Balls with radii ρ/2 and centers at a ρ-distinguishable set of points on the unit sphere are mutually disjoint and are all contained in the ball of radius 1 + ρ/2 centered at the origin. Hence there can be at most (1 + ρ/2)k/(ρ/2)k of them. 76 Chapter 3. Sobolev Duals for Random Frames and Σ∆ Quantization of Compressed Sensing Measurements ‖q − x‖2 ≤ 1/2, hence ‖SEx‖2 ≤ ‖SEq‖2 + ‖SE(x− q)‖2 ≤ √ Θ‖s‖∞ + M 2 . Taking the supremum over all x on the unit sphere, we obtain M ≤ √ Θ‖s‖∞ + M 2 , i.e., ‖SE‖`k2→`n2 ≤ 2 √ Θ‖s‖∞. Therefore E(Q) ⊂ E , and the result follows. The following estimate concerns a lower bound for the Euclidean norm of (s1ξ1, . . . , snξn). It is not sharp when the sj are identical, but it will be useful for our problem where sj = σj(D −r) obey a power law (see Corollary 3.3.5). Proposition 3.3.4. Let ξ ∼ N (0, 1 n In). For any γ > 0, P ( n∑ j=1 s2jξ 2 j < γ ) ≤ min 1≤L≤n (eγn L )L/2 (s1s2 · · · sL)−1. (3.34) Proof. For any t ≥ 0 and any integer L ∈ {1, . . . , n}, we have P ( n∑ j=1 s2jξ 2 j < γ ) ≤ ∫ Rn e(γ− ∑n j=1 s 2 jx 2 j)t/2 n∏ j=1 e−nx 2 j/2 dxj√ 2pi/n = etγ/2 n∏ j=1 ∫ R e−x 2 j (n+ts 2 j )/2 dxj√ 2pi/n = etγ/2 n∏ j=1 (1 + ts2j/n) −1/2 ≤ etγ/2 L∏ j=1 (ts2j/n) −1/2 ≤ etγ/2(n/t)L/2(s1s2 · · · sL)−1. (3.35) For any L, we can set t = L/γ, which is the critical point of the function t 7→ etγt−L. Since L is arbitrary, the result follows. Corollary 3.3.5. Let ξ ∼ N (0, 1 n In), r be a positive integer, and c1 > 0 be such that sj ≥ c1 ( n j )r , j = 1, . . . , n. (3.36) 77 Chapter 3. Sobolev Duals for Random Frames and Σ∆ Quantization of Compressed Sensing Measurements Then for any Λ ≥ 1 and n ≥ Λ, P ( n∑ j=1 s2jξ 2 j < c 2 1Λ 2r−1 ) < (60n/Λ)r/2e−n(r−1/2)/Λ. (3.37) Proof. By rescaling sj , we can assume c1 = 1. For any L ∈ {1, . . . , n}, we have (s1s2 · · · sL)−1 ≤ (L!) r nrL < (8L)r/2 ( Lr ernr )L , where we have used the coarse estimate L! < e1/12L(2piL)1/2(L/e)L < (8L)1/2(L/e)L. Setting γ = Λ2r−1 in Proposition 3.3.4, we obtain P ( n∑ j=1 s2jξ 2 j < Λ 2r−1 ) < (8L)r/2 [( ΛL en )L]r−1/2 . (3.38) We set L = bn Λ c. Since 1 ≤ Λ ≤ n, it is guaranteed that 1 ≤ L ≤ n. Since ΛL ≤ n, we get ( ΛL en )L ≤ e−L < e1− nΛ Plugging this in (3.38) and using 8e2 < 60, we find P ( n∑ j=1 s2jξ 2 j < Λ 2r−1 ) < (60n/Λ)r/2e−n(r−1/2)/Λ. (3.39) Theorem 3.3.6. Let E be an n × k random matrix whose entries are i.i.d. N (0, 1 n ), r be a positive integer, and assume that the entries sj of the diagonal matrix S satisfy c1 ( n j )r ≤ sj ≤ c2nr, j = 1, . . . , n. (3.40) Let Λ ≥ 1 be any number and assume n ≥ Λ. Consider the event F := { ‖SEx‖2 ≥ 1 2 c1Λ r−1/2‖x‖2, ∀x ∈ Rk } . Then P (F c) ≤ 5ke−n/2 + 8r (17c2/c1)k Λk/2 (n Λ )r(k+1/2) e−n(r−1/2)Λ. Proof. Consider a ρ-net Q̃ of the unit sphere of Rk with #Q̃ ≤ (2 ρ + 1 )k where the value 78 Chapter 3. Sobolev Duals for Random Frames and Σ∆ Quantization of Compressed Sensing Measurements of ρ < 1 will be chosen later. Let Ẽ(Q̃) be the event { ‖SEq‖2 ≥ c1Λr−1/2, ∀q ∈ Q̃ } . By Corollary 3.3.5, we know that P ( Ẽ(Q̃)c ) ≤ ( 2 ρ + 1 )k ( 60n Λ )r/2 e−n(r−1/2)/Λ. (3.41) Let E be the event in Lemma 3.3.3 with Θ = 4. Let E be any given matrix in the event E ∩ Ẽ(Q̃). For each ‖x‖2 = 1, there is q ∈ Q̃ with ‖q − x‖2 ≤ ρ, hence by Lemma 3.3.3, we have ‖SE(x− q)‖2 ≤ 4‖s‖∞‖x− q‖2 ≤ 4c2nrρ. Choose ρ = c1Λ r−1/2 8c2nr = c1 8c2 √ Λ (Λ n )r . Hence ‖SEx‖2 ≥ ‖SEq‖2 − ‖SE(x− q)‖2 ≥ c1Λr−1/2 − 4c2nrρ = 1 2 c1Λ r−1/2. This shows that E ∩ Ẽ(Q̃) ⊂ F . Clearly, ρ ≤ 1/8 by our choice of parameters and hence 2 ρ + 1 ≤ 17 8ρ . Using the probability bounds of Lemma 3.3.3 and (3.41), we have P (F c) ≤ 5k4n/2e−3n/2 + ( 17 8ρ )k ( 60n Λ )r/2 e−n(r−1/2)/Λ ≤ 5ke−n/2 + 8r(17c2/c1)kΛk/2 (n Λ )r(k+1/2) e−n(r−1/2)/Λ, (3.42) where we have used 2 < e and √ 60 < 8 for simplification. The following theorem is now a direct corollary of the above estimate. Theorem 3.3.7. Let E be an n×k random matrix whose entries are i.i.d. N (0, 1 n ), r be a positive integer, D be the difference matrix defined in (3.9), and the constant c1 = c1(r) be as in Proposition 3.3.1. Let 0 < α < 1 be any number. Assume that λ := n k ≥ c3(log n)1/(1−α), (3.43) where c3 = c3(r) is an appropriate constant. Then P ( σmin(D −rE) ≥ c1λα(r−1/2) ) ≥ 1− 2e−c4n1−αkα (3.44) for some constant c4 = c4(r) > 0. 79 Chapter 3. Sobolev Duals for Random Frames and Σ∆ Quantization of Compressed Sensing Measurements Proof. Set Λ = λα in Lemma 3.3.6. We only need to show that max [ 5ke−n/2, 8r(17c2/c1)kΛk/2 (n Λ )r(k+1/2) e−n(r−1/2)/Λ ] ≤ e−c4n1−αkα. It suffices to show that k log 5− n/2 ≤ −c4n1−αkα and r log 8 + k log(17c2/c1) + 1 2 k log Λ + r(k + 1 2 ) log(n/Λ)− (r−1 2 ) n Λ ≤ −c4n1−αkα. The first inequality is easily seen to hold if λ ≥ log 51 2 −c4 . For the second inequality, first notice that n/Λ = n1−αkα. Since k + 1/2  k, and r − 1/2  r, it is easily seen that we only need to check that k log n ≤ c5 n Λ for a sufficiently small c5. This follows from our assumption on λ by setting c5 = 1/c 1−α 3 . Remark. By replacing E in Theorem 3.3.7 with √ nE, we obtain Theorem A. 3.3.2 Implication for compressed sensing matrices Theorem 3.3.8. Let r, D, c1(r) be as in Theorem 3.3.7 and A be an n × N random matrix whose entries are i.i.d. N (0, 1 n ). Let 0 < α < 1 be any number and assume that λ := n k ≥ c6(logN)1/(1−α), (3.45) where c6 = c6(r) is an appropriate constant. Then with probability at least 1− 2e−c7nλ−α for some c7 = c7(r) > 0, every n× k submatrix E of A satisfies σmin(D −rE) ≥ c1λα(r−1/2). (3.46) Proof. We will choose c7 = c4/2, where c4 is as in Theorem 3.3.7. The proof will follow immediately by a union bound once we show that( N k ) ≤ ec4n1−αkα. 80 Chapter 3. Sobolev Duals for Random Frames and Σ∆ Quantization of Compressed Sensing Measurements Since ( N k ) ≤ Nk, it suffices to show that k logN ≤ c4 2 n1−αkα. Both this condition and the hypothesis of Theorem 3.3.7 will be satisfied if we choose c6 = max(c3, (2/c4) 1/(1−α)). Remark. If A is a Gaussian matrix with entries i.i.d. N (0, 1) rather than N (0, 1 n ), The- orem 3.3.8 applied to 1√ n A implies that every n× k submatrix E of A satisfies σmin(D −rE) ≥ c1λα(r−1/2) √ n, (3.47) with high probability. 3.4 Σ∆ quantization of compressed sensing measurements In this section we will assume that the conditions of Theorem 3.3.8 are satisfied for some 0 < α < 1 and r, and the measurement matrix A that is drawn from N (0, 1) yields (3.47). For definiteness, we also assume that A admits the robust recovery constant C1 = 10, i.e., the solution x# of the program (3.1) satisfies ‖b̂− b‖2 ≤  =⇒ ‖x− x#‖2 ≤ 10 1√ n . Note again that our choice of normalization for the measurement matrix A is different from the compressed sensing convention. As mentioned in the Introduction, it is more appropriate to work with a measurement matrix A ∼ N (0, 1) in order to be able to use a quantizer alphabet that does not depend on n. For this reason, in the remainder of the paper, A shall denote an n×N matrix whose entries are i.i.d. from N (0, 1). Let q := qΣ∆ be output of the standard greedy rth order Σ∆ quantizer with the alphabet A = dZ and input b. As stated in Section 3.2, we know that ‖b− q‖∞ ≤ 2r−1d and therefore ‖b− q‖2 ≤ 2r−1d √ n. 81 Chapter 3. Sobolev Duals for Random Frames and Σ∆ Quantization of Compressed Sensing Measurements Coarse recovery and recovery of support Our first goal is to recover the support T of x. For this purpose we shall use a coarse approximation of x. Let x′ := argmin ‖y‖1 subject to ‖Ay − q‖2 ≤  := 2r−1d √ n. (3.48) By the robust recovery result (for our choice of normalization for A), we know that ‖x− x′‖2 ≤ η := 5 · 2rd. The simplest attempt to recover T from x′ is to pick the positions of its k largest entries. This attempt can fail if some entry of xj on T is smaller than η for then it is possible that x′j = 0 and therefore j is not picked. On the other hand, it is easy to see that if the smallest nonzero entry of x is strictly bigger than 2η in magnitude, then this method always succeeds. (Since ‖x − x′‖∞ ≤ η, the entries of x′ are bigger than η on T and less than η on T c.) The constant 2 can be replaced with √ 2 by a more careful analysis, and can be pushed arbitrarily close to 1 by picking more than k positions. The proposition below gives a precise condition on how well this can be done. We also provide a bound on how much of x can potentially be missed if no lower bound on |xj| is available for j ∈ T . Proposition 3.4.1. Let ‖x − x′‖`N2 ≤ η, T = supp x and k = |T |. For any k′ ∈{k, . . . , N−1}, let T ′ be the support of (any of) the k′ largest entries of x′. (i) ‖xT\T ′‖2 ≤ βη where β ≤ ( 1 + k k′ )1/2 . (ii) If |xj | > γη for all j ∈ T , where γ := ( 1 + 1 k′−k+1 )1/2 , then T ′ ⊃ T . Proof. (i) We have ∑ j∈T |xj − x′j |2 + ∑ j∈T c |x′j |2 = ‖x− x′‖22 ≤ η2. (3.49) In particular, this implies ∑ j∈T\T ′ |xj − x′j |2 + ∑ j∈T ′\T |x′j|2 ≤ η2. (3.50) Suppose T \ T ′ 6= ∅. Then T ′ \ T is also nonempty. In fact, we have |T ′ \ T | = |T \ T ′|+ k′ − k. 82 Chapter 3. Sobolev Duals for Random Frames and Σ∆ Quantization of Compressed Sensing Measurements Now, observe that 1 |T \ T ′| ∑ j∈T\T ′ |x′j|2 ≤ max j∈T\T ′ |x′j |2 ≤ min j∈T ′\T |x′j|2 ≤ 1 |T ′ \ T | ∑ j∈T ′\T |x′j|2, which, together with (3.50) implies ‖xT\T ′‖2 ≤ ‖x′T\T ′‖2 + ‖(x− x′)T\T ′‖2 ≤ ‖x′T\T ′‖2 + √ η2 − |T ′ \ T | |T \ T ′|‖x ′ T\T ′‖22. It is easy to check that for any A > 0, and any 0 ≤ t ≤ η/√A, t+ √ η2 − At2 ≤ ( 1 + 1 A )1/2 η. (3.51) The result follows by setting A = |T ′ \ T |/|T \ T ′| and noticing that A ≥ k′/k. (ii) Let z1 ≥ · · · ≥ zN be the decreasing rearrangement of |x′1|, . . . , |x′N |. We have ∑ j∈T |x′j|2 ≤ k∑ i=1 z2i so ∑ j∈T c |x′j|2 ≥ N∑ i=k+1 z2i ≥ k′+1∑ i=k+1 z2i ≥ (k′ − k + 1)z2k′+1. Hence by (3.49) we have max j∈T |xj − x′j |2 + (k′ − k + 1)z2k′+1 ≤ η2. Since |x′j| ≥ |xj| − |xj − x′j |, the above inequality now implies min j∈T |x′j | ≥ min j∈T |xj| −max j∈T |xj − x′j | ≥ min j∈T |xj| − √ η2 − (k′ − k + 1)z2k′+1. Now, another application of (3.51) with A = k′ − k + 1 yields − √ η2 − (k′ − k + 1)z2k′+1 ≥ zk′+1 − γη and therefore min j∈T |x′j | ≥ min j∈T |xj|+ zk′+1 − γη > zk′+1 = max j∈T ′c |x′j |. 83 Chapter 3. Sobolev Duals for Random Frames and Σ∆ Quantization of Compressed Sensing Measurements It is then clear that T ⊂ T ′ because if T ′c ∩ T 6= ∅, the inequality max j∈T ′c |x′j | ≥ max j∈T ′c∩T |x′j | ≥ min j∈T |x′j| would give us a contradiction. Note that if the k′ largest entries of x′ are picked with k′ > k, then one would need to work with T ′ for the fine recovery stage, and therefore the starting assumptions on A have to be modified for k′. For simplicity we shall stick to k′ = k and consequently γ = √ 2. Remark. Note that support recovery in various compressed sensing contexts has received some attention lately (e.g., [16], [31], [32]). However, for the purposes of this paper the above proposition is appropriate (given the choice of decoder) and sufficient. Fine recovery Once T is found, the rth order Sobolev dual frame F := FSob,r of E = AT is computed and we set x̂Σ∆ = Fq. We now restate and prove Theorem B. Theorem 3.4.2. Let A be an n×N matrix whose entries are i.i.d. according to N (0, 1). Suppose α ∈ (0, 1) and λ := n/k ≥ c(logN)1/(1−α) where c = c(r, α). Then there are two constants c′ and C that depend only on r such that with probability at least 1 − exp(−c′nλ−α) on the draw of A, the following holds: For every x ∈ ΣNk such that minj∈supp(x) |xj| ≥ Cd, the reconstruction x̂Σ∆ satisfies ‖x− x̂Σ∆‖2 .r λ−α(r− 12 )d. (3.52) Proof. Suppose that λ ≥ c(logN)1/(1−α) with c = c6 as in the proof of Theorem 3.3.8. Let qΣ∆ be obtained by quantizing b := Ax via an rth order Σ∆ scheme with alphabet A = dZ and with the quantization rule as in (3.18), and let u be the associated state sequence as in (3.16). Define x# as the solution of the program min ‖y‖1 subject to ‖Ay − qΣ∆‖2 ≤ . Suppose that A admits the robust recovery constant C1, i.e., the solution x # of the program (3.3) satisfies ‖x−x#‖2 ≤ C1/ √ n for every x in ΣNk provided that ‖y−qΣ∆‖ ≤ . Note that C1, as given for example in [9], only depends on the RIP constants of A and is well-behaved if n andN satisfy the hypothesis of the theorem. As discussed in Section 3.2, in this case we have ‖b− qΣ∆‖2 ≤ 2r−1d √ n which implies ‖x− x#‖2 ≤ C12r−1d. 84 Chapter 3. Sobolev Duals for Random Frames and Σ∆ Quantization of Compressed Sensing Measurements Assume that min j∈T |xj | ≥ C1 · 2r−1/2d =: Cd. (3.53) Then, Proposition 3.4.1 (with γ = √ 2 and η = C12 r−1) shows that T ′, the support of the k largest entries of x#, is identical to the support T of x. Finally, set x̂Σ∆ = Fsob,rqΣ∆ where Fsob,r is the rth order Sobolev dual of AT . Using the fact that ‖u‖2 ≤ 2−1d √ n (see Section 3.2) together with the conclusion of Theorem 3.3.8 and the error bound (3.27), we conclude that ‖x− x̂Σ∆‖2 ≤ ‖u‖2√ nσmin(D−rE) ≤ λ −α(r−1/2) 2c1 d. (3.54) Note that the RIP and therefore the robust recovery will hold with probability 1 − exp(c′′n), and our Sobolev dual reconstruction error bound will hold with probability 1− exp(−c7nλ−α). Here c1 and c7 are as in the proof of Theorem 3.3.8. Remark. To interpret the size condition in a concrete case, assume that A admits the robust recovery constant C1 = 10, and that we have min j∈T |xj | ≥ √ 2η = 5 · 2r+1/2d. (3.55) If PCM is used as the quantization method, then the best error guarantee we have that holds uniformly on T would be ‖x− x#PCM‖∞ ≤ ‖x− x#PCM‖2 ≤ 5d. It can be argued that the approximately recovered entries of x#PCM are meaningful only when the minimum nonzero entry of x is at least as large as the maximum uncertainty in x#PCM, which is only known to be bounded by 5d. Hence, in some sense the size condition (3.55) is natural (modulo the factor 2r+1/2). Quantizer choice and rate-distortion issues So far we have not made any assumptions on the step size d of the uniform infinite quantizer A = dZ. An important question concerns how large d should be for the most effective use of resources. This question is motivated by the fact that infinite quantizers are not practical and have to be replaced by finite ones. Similarly, it is important to determine the minimum number of bits to be used in the quantizer and the associated 85 Chapter 3. Sobolev Duals for Random Frames and Σ∆ Quantization of Compressed Sensing Measurements approximation error. First, let us assume that a ≤ |xj | ≤ ρ for all j ∈ T, (3.56) and suppose that ρ = 2Ba where B represents the number of dyadic scales over which the input is allowed to range. For usefulness of our results, one would be interested in the regime a  ρ. Clearly, dr, the quantization step size used by an rth order Σ∆ scheme for our support recovery results to hold must satisfy dr ≤ a/52r+1/2 (as before, we assume C1 = 10). Let us for the moment use the largest allowable step-size, i.e., set dr := a/5 2r+1/2 . (3.57) Next, let us assume that a Br-bit uniform quantizer of step size dr is to replace A = dZ. We know that ‖q‖∞ could be as large as 2r−1dr + ‖b‖∞, therefore we need to bound ‖b‖∞ efficiently. If we use the RIP, then A does not expand the `2-norm of k-sparse vectors by more than a factor of 2 √ n (note our choice of normalization for A), and therefore it follows that ‖b‖∞ ≤ ‖b‖2 ≤ 2 √ n‖x‖2 ≤ 2ρ √ nk, which is a restatement of the inequality ‖E‖`k ∞ →`n ∞ ≤ √ k‖E‖`k2→`n2 that holds for any n×k matrix E. However, it can be argued that the (∞,∞)-norm of a random matrix should typically be smaller. In fact, if E were drawn from the Bernoulli model, i.e., Eij ∼ ±1, then we would have ‖E‖`k ∞ →`n ∞ = k = λ−1/2 √ nk, as can easily be seen from the general formula ‖E‖`k ∞ →`n ∞ = max 1≤i≤n k∑ j=1 |Eij|. (3.58) Using simple concentration inequalities for Gaussian random variables, it turns out that for the range of aspect ratio λ = n/k and probability of encountering a matrix A that we are interested in, we have ‖E‖`k ∞ →`n ∞ ≤ λ−α/2√nk for every n× k submatrix E of A. We start with the following estimate: Proposition 3.4.3. Let ξ1, . . . , ξk i.i.d. standard Gaussian variables. Then, for any 86 Chapter 3. Sobolev Duals for Random Frames and Σ∆ Quantization of Compressed Sensing Measurements Θ > 1, P ( k∑ j=1 |ξj| > Θ ) ≤ 2ke−Θ2/(2k). (3.59) Proof. P ( k∑ j=1 |ξj| > Θ ) ≤ min t≥0 ∫ Rk e−(Θ− ∑k j=1 |xj|)t k∏ j=1 e−x 2 j/2 dxj√ 2pi = min t≥0 e−Θt ( et 2/2 ∫ R e− 1 2 (|x|−t)2 dx√ 2pi )k = min t≥0 e−Θt ( 2et 2/2 ∫ ∞ 0 e− 1 2 (x−t)2 dx√ 2pi )k ≤ 2kmin t≥0 e−Θt+kt 2/2 = 2ke−Θ 2/(2k). (3.60) where in the last step we set t = Θ/k. Proposition 3.4.4. Let A be an n×N random matrix whose entries are i.i.d. N (0, 1). Let 0 < α < 1 be any number and assume that λ := n k ≥ c1(logN)1/(1−α), (3.61) where c1 is an appropriate constant. Then with probability at least 1−e−c2n1−αkα for some c2 > 0, every n× k submatrix E of A satisfies ‖E‖`k ∞ →`n ∞ ≤ λ−α/2 √ nk. (3.62) Proof. Proposition 3.4.3 straightforwardly implies that P ({∃T such that |T | = k and ‖AT‖`k ∞ →`n ∞ > Θ}) ≤ (N k ) n2ke−Θ 2/(2k). (3.63) Let Θ = λ−α/2 √ nk. It remains to show that k logN + k log 2 + logn + c2n 1−αkα ≤ Θ 2 2k . If c1 in (3.61) is sufficiently large and c2 is sufficiently small, then the expression on the left hand side is bounded by kλ1−α/2 = Θ2/(2k). Without loss of generality, we may now assume that A also satisfies the conclusion of 87 Chapter 3. Sobolev Duals for Random Frames and Σ∆ Quantization of Compressed Sensing Measurements Proposition 3.4.4. Hence we have an improved bound on the range of y given by ‖b‖∞ ≤ ρλ−α/2 √ nk = ρλ(1−α)/2k. (3.64) We assume Br is chosen to satisfy 2Br−1dr = 2r−1dr + ρλ(1−α)/2k, (3.65) so that the quantizer is not overloaded. Since ρ/dr ≈ 2r+1/2+B by (3.56) and (3.57), we see that the second term on the right hand side of (3.65) is significantly larger than the first, which implies 2Br−1dr ≈ 2BAλ(1−α)/2k. (3.66) Hence, using (3.57) again, Br must satisfy 2Br−1 ≈ 5 2B+r+1/2λ(1−α)/2k. (3.67) Based on Theorem 3.4.2, the approximation error (the distortion) DΣ∆ incurred after the fine recovery stage via Sobolev duals satisfies the bound DΣ∆ .r λ −α(r−1/2)dr ≈ λ −α(r−1/2)a 2r+1/2 . (3.68) A similar calculation for the PCM encoder with the same step size dr and the standard `1 decoder results in the necessity for roughly the same number of bits Br as the Σ∆ encoder (because of the approximation (3.66)), but provides only the distortion bound DPCM . dr ≈ a 2r+1/2 . (3.69) Note that the analysis above requires that both PCM and Σ∆ encoders utilize high- resolution quantizers, however the benefit of using Σ∆ encoders is obvious upon compar- ing (3.68) and (3.69). 3.5 Numerical experiments In order to test the accuracy of Theorem 3.3.7, our first numerical experiment concerns the minimum singular value of D−rE as a function of λ = n/k. In Figure 3.1, we plot the worst case (the largest) value, among 1000 realizations, of 1/σmin(D −rE) for the range 1 ≤ λ ≤ 25, where we have kept k = 50. As predicted by this theorem, we find that the negative slope in the log-log scale is roughly equal to r − 1/2, albeit slightly less, which seems in agreement with the presence of our control parameter α. As for the size of the r-dependent constants, the function 5rλ−r+1/2 seems to be a reasonably close numerical fit, which also explains why we observe the separation of the individual curves 88 Chapter 3. Sobolev Duals for Random Frames and Σ∆ Quantization of Compressed Sensing Measurements after λ > 5. Our next experiment involves the full quantization algorithm for compressed sensing including the “recovery of support” and “fine recovery” stages. To that end, we first generate a 1000 × 2000 matrix A, where the entries of A are drawn i.i.d. according to N (0, 1). To examine the performance of the proposed scheme as the redundancy λ increases in comparison to the performance of the standard PCM quantization, we run a set of experiments: In each experiment we fix the sparsity k ∈ {5, 10, 20, 40}, and we generate k-sparse signals x with the non-zero entries of each signal supported on a random set T , but with magnitude 1/ √ k. This ensures that ‖x‖2 = 1. Next, for n ∈ {100, 200, ..., 1000} we generate the measurements b = A(n)x, where A(n) is comprised of the first n rows of A. We then quantize b using PCM, as well as the 1st and 2nd order Σ∆ quantizers, defined via (3.16) and (3.18) (in all cases the quantizer step size is d = 10−2). For each of these quantized measurements q, we perform the coarse recovery stage, i.e., we solve the associated `1 minimization problem to recover a coarse estimate of x as well as an estimate T̃ of the support T . The approximation error obtained using the coarse estimate (with PCM quantization) is displayed in Figures 3.2 and 3.3 (see the dotted curve). Next, we implement the fine recovery stage of our algorithm. In particular, we use the estimated support set T̃ and generate the associated dual Fsob,r. Defining Fsob,0 := (A (n) T̃ )†, in each case, our final estimate of the signal is obtained via the fine recovery stage as x̂T̃ = Fsob,rq, x̂T̃ c = 0. Note that this way, we obtain an alternative reconstruction also in the case of PCM. We repeat this experiment 100 times for each (k, n) pair and plot the average of the resulting errors ‖x − x̃‖2 as a function of λ in Figure 3.2 as well as the maximum of ‖x − x̂‖2 in Figure 3.3. For our final experiment, we choose the entries of xT i.i.d. from N (0, 1), and use a quantizer step size d = 10−4. Otherwise, the experimental setup is identical to the previous one. The average of the resulting errors ‖x − x̃‖2 as a function of λ is reported in Figure 3.4 and the maximum of ‖x− x̂‖2 in Figure 3.5. The main observations that we obtain from these experiments are as follows: • Σ∆ schemes outperform the coarse reconstruction obtained from PCM quantized measurements significantly even when r = 1 and even for small values of λ. • For the Σ∆ reconstruction error, the negative slope in the log-log scale is roughly equal to r. This outperforms the (best case) predictions of Theorem B which are obtained through the operator norm bound and suggests the presence of further cancellation due to the statistical nature of the Σ∆ state variable u, similar to the white noise hypothesis. • When a fine recovery stage is employed in the case of PCM (using the Moore- Penrose pseudoinverse of the submatrix of A that corresponds to the estimated support of x), the approximation is consistently improved (when compared to the coarse recovery). Moreover, the associated approximation error is observed to be of 89 Chapter 3. Sobolev Duals for Random Frames and Σ∆ Quantization of Compressed Sensing Measurements order O(λ−1/2), in contrast with the error corresponding to the coarse recovery from PCM quantized measurements (with the `1 decoder only) where the approximation error does not seem to depend on λ. A rigorous analysis of this behaviour will be given in a separate manuscript. Figure 3.1: Numerical behavior (in log-log scale) of 1/σmin(D −rE) as a function of λ = n/k, for r = 0, 1, 2, 3, 4. In this figure, k = 50 and 1 ≤ λ ≤ 25. For each problem size, the largest value of 1/σmin(D −rE) among 1000 realizations of a random n× k matrix E sampled from the Gaussian ensemble N (0, 1nIn) was recorded. 3.6 Remarks on extensions 3.6.1 Other noise shaping matrices In the above approach, the particular quantization scheme that we use can be identified with its “noise-shaping matrix”, which is Dr in the case of an rth order Σ∆ scheme and the identity matrix in the case of PCM. The results we obtained above are valid for the aforementioned noise-shaping matri- ces. However, our techniques are fairly general and our estimates can be modified to 90 Chapter 3. Sobolev Duals for Random Frames and Σ∆ Quantization of Compressed Sensing Measurements (a) (b) (c) (d) Figure 3.2: The average performance of the proposed Σ∆ quantization and reconstruction schemes for various values of k. For this experiment the non-zero entries of x are constant and d = 0.01. 91 Chapter 3. Sobolev Duals for Random Frames and Σ∆ Quantization of Compressed Sensing Measurements (a) (b) (c) (d) Figure 3.3: The worst case performance of the proposed Σ∆ quantization and reconstruction schemes for various values of k. For this experiment the non-zero entries of x are constant and d = 0.01. 92 Chapter 3. Sobolev Duals for Random Frames and Σ∆ Quantization of Compressed Sensing Measurements (a) (b) (c) (d) Figure 3.4: The average performance of the proposed Σ∆ quantization and reconstruction schemes for various values of k. For this experiment the non-zero entries of x are i.i.d. N (0, 1) and d = 10−4. 93 Chapter 3. Sobolev Duals for Random Frames and Σ∆ Quantization of Compressed Sensing Measurements (a) (b) (c) (d) Figure 3.5: The worst case performance of the proposed Σ∆ quantization and reconstruction schemes for various values of k. For this experiment the non-zero entries of x are i.i.d. N (0, 1) and d = 10−4. 94 Chapter 3. Sobolev Duals for Random Frames and Σ∆ Quantization of Compressed Sensing Measurements investigate the accuracy obtained using an arbitrary quantization scheme with the asso- ciated invertible noise-shaping matrix H . In particular, the estimates depend solely on the distribution of the singular values of H . Of course, in this case, we also need change our “fine recovery” stage and use the “H-dual” of the corresponding frame E, which we define via FHH = (HE) †. (3.70) As an example, consider an rth order high-pass Σ∆ scheme whose noise shaping matrix is Hr where H is defined via Hij := { 1, if i = j or if i = j + 1, 0, otherwise. (3.71) It is easy to check that the singular values of H are identical to those of D. It follows that all the results presented in this paper are valid also if the compressed measurements are quantized via an an rth order high-pass Σ∆ scheme, provided the reconstruction is done using the Hr-duals instead of the rth order Sobolev duals. Note that such a result for high-pass Σ∆ schemes is not known to hold in the case of structured frames. 3.6.2 Measurement noise and compressible signals One of the natural questions is whether the quantization methods developed in this paper are effective in the presence of measurement noise in addition to the error introduced during the quantization process. Another natural question is how to extend this theory to include the case when the underlying signals are not necessarily strictly sparse, but nevertheless still “compressible”. Suppose x ∈ RN is not sparse, but compressible in the usual sense (e.g. as in [9]), and let b = Ax+e, where e stands for additive measurement noise. The coarse recovery stage inherits the stability and robustness properties of `1 decoding for compressed sensing, therefore the accuracy of this first reconstruction depends on the best k-term approxima- tion error for x, and the deviation of Ax from the quantized signal q (which comprises of the measurement noise e and the quantization error b − q). Up to constant factors, the quantization error for any (stable) Σ∆ quantizer is comparable to that of PCM, hence the reconstruction error at the coarse recovery stage would also be comparable. In the fine recovery stage, however, the difference between σmax(FHH) and σmax(FH) plays a critical role. In the particular case of H = Dr and FH = Fsob,r, the Sobolev duals we use in the reconstruction are tailored to reduce the effect of the quantization error in- troduced by an rth order Σ∆ quantizer. This is reflected in the fact that as λ increases, the kernel of the reconstruction operator Fsob,r contains a larger portion of high-pass se- quences (like the quantization error of Σ∆ modulation), and is quantified by the bound σmax(Fsob,rD r) . λ−(r−1/2)n−1/2 (see Theorem A, (3.26) and (3.27)). Consequently, ob- taining more measurements increases λ, and even though ‖b− q‖2 increases as well, the 95 Chapter 3. Sobolev Duals for Random Frames and Σ∆ Quantization of Compressed Sensing Measurements reconstruction error due to quantization decreases. At the same time, obtaining more measurements would also increase the size of the external noise e, as well as the “alias- ing error” that is the result of the “off-support” entries of x. However, this noise+error term is not counteracted by the action of Fsob,r. In fact, for any dual F , the relation FE = I implies σmax(F ) ≥ 1/σmax(E) & n−1/2 already and in the case of measurement noise, it is not possible to do better than the canonical dual E† on average. In this case, depending on the size of the noise term, the fine recovery stage may not improve the total reconstruction error even though the “quantizer error” is still reduced. One possible remedy for this problem is to construct alternative quantization schemes with associated noise-shaping matrices that balance the above discussed trade-off between the quantization error and the error that is introduced by other factors. This is a delicate procedure, and it will be investigated thoroughly in future work. However, a first such construction can be made by using “leaky” Σ∆ schemes with H given by Hij :=  1, if i = j, −µ if i = j + 1, 0, otherwise, (3.72) where µ ∈ (0, 1). Our preliminary numerical experiments (see Figures 3.6 and 3.7) suggest that this approach can be used to improve the accuracy of the approximation further in the fine recovery stage in this more general setting. We note that the parameter µ above can be adjusted based on how compressible the signals of interest are and what the expected noise level is. 3.7 Singular values of D−r It will be more convenient to work with the singular values of Dr. Note that because of our convention of descending ordering of singular values, we have σj(D −r) = 1 σn+1−j(Dr) , j = 1, . . . , n. (3.73) For r = 1, an explicit formula is available [28, 30]. Indeed, we have σj(D) = 2 cos ( pij 2n+ 1 ) , j = 1, . . . , n, (3.74) which implies σj(D −1) = 1 2 sin ( pi(j−1/2) 2(n+1/2) ) , j = 1, . . . , n. (3.75) The first observation is that σj(D r) and (σj(D)) r are different, because D and D∗ do 96 Chapter 3. Sobolev Duals for Random Frames and Σ∆ Quantization of Compressed Sensing Measurements (a) (b) (c) Figure 3.6: The average case performance of the proposed Σ∆ quantization and reconstruc- tion schemes (with general duals) for various values of k. For this experiment the non-zero entries of x are constant and d = 0.01. not commute. However, this becomes insignificant as n → ∞. In fact, the asymptotic distribution of (σj(D r))nj=1 as n→∞ is rather easy to find using standard results in the theory of Toeplitz matrices: D is a banded Toeplitz matrix whose symbol is f(θ) = 1−eiθ, hence the symbol of Dr is (1 − eiθ)r. It then follows by Parter’s extension of Szegö’s theorem [26] that for any continuous function ψ, we have lim n→∞ 1 n n∑ j=1 ψ(σj(D r)) = 1 2pi ∫ pi −pi ψ(|f(θ)|r) dθ. (3.76) We have |f(θ)| = 2 sin |θ|/2 for |θ| ≤ pi, hence the distribution of (σj(Dr))nj=1 is asymp- totically the same as that of 2r sinr(pij/2n), and consequently, we can think of σj(D −r) roughly as ( 2r sinr(pij/2n) )−1 . Moreover, we know that ‖Dr‖op ≤ ‖D‖rop ≤ 2r, hence 97 Chapter 3. Sobolev Duals for Random Frames and Σ∆ Quantization of Compressed Sensing Measurements (a) (b) (c) Figure 3.7: The worst case performance of the proposed Σ∆ quantization and reconstruction schemes (with general duals) for various values of k. For this experiment the non-zero entries of x are constant and d = 0.01. σmin(D −r) ≥ 2−r. When combined with known results on the rate of convergence to the limiting dis- tribution in Szegö’s theorem, the above asymptotics could be turned into an estimate of the kind given in Proposition 3.3.1, perhaps with some loss of precision. Here we shall provide a more direct approach which is not asymptotic, and works for all n ≥ 4r. The underlying observation is that D and D∗ almost commute: D∗D − DD∗ has only two nonzero entries, at (1, 1) and (n, n). Based on this observation, we show below that D∗rDr is then a perturbation of (D∗D)r of rank at most 2r. Proposition 3.7.1. Let C(r) = D∗rDr − (D∗D)r where we assume n ≥ 2r. Define Ir := {1, . . . , r} × {1, . . . , r} ∪ {n− r + 1, . . . , n} × {n− r + 1, . . . , n}. 98 Chapter 3. Sobolev Duals for Random Frames and Σ∆ Quantization of Compressed Sensing Measurements Then C (r) i,j = 0 for all (i, j) ∈ Icr . Therefore, rank(C(r)) ≤ 2r. Proof. Define the set Cr of all “r-cornered” matrices as Cr = {M :Mi,j = 0 if (i, j) ∈ Icr}, and the set Br of all “r-banded” matrices as Br = {M :Mi,j = 0 if |i− j| > r}. Both sets are closed under matrix addition. It is also easy to check the following facts (for the admissible range of values for r and s): (i) If B ∈ Br and C ∈ Cs, then BC ∈ Cr+s and CB ∈ Cr+s. (ii) If B ∈ Br and B̃ ∈ Bs, then BB̃ ∈ Br+s. (iii) If C ∈ Cr and C̃ ∈ Cs, then CC̃ ∈ Cmax(r,s). (iv) If C ∈ Cr, then D∗CD ∈ Cr+1. Note that DD∗, D∗D ∈ B1 and the commutator [D∗, D] =: Γ1 ∈ C1. Define Γr := (D ∗D)r − (DD∗)r = (DD∗ + Γ1)r − (DD∗)r. We expand out the first term (noting the non-commutativity), cancel (DD∗)r and see that every term that remains is a product of r terms (counting each DD∗ as one term) each of which is either in B1 or in C1. Repeated applications of (i), (ii), and (iii) yield Γr ∈ Cr. We will now show by induction on r that C(r) ∈ Cr for all r such that 2r ≤ n. The cases r = 0 and r = 1 hold trivially. Assume the statement holds for a given value of r. Since C(r+1) = D∗(C(r) + Γr)D and Γr ∈ Cr, property (iv) above now shows that C(r+1) ∈ Cr+1. The next result, originally due to Weyl (see, e.g., [20, Thm 4.3.6]), will now allow us to estimate the eigenvalues of D∗rDr using the eigenvalues of (D∗D)r: Theorem 3.7.2 (Weyl). Let B and C be n×n Hermitian matrices where C has rank at most p. Then λj+p(B) ≤ λj(B + C) ≤ λj−p(B), j = p+ 1, . . . , m− p, (3.77) where we assume eigenvalues are in descending order. We are now fully equipped to prove Proposition 3.3.1. 99 Chapter 3. Sobolev Duals for Random Frames and Σ∆ Quantization of Compressed Sensing Measurements Proof of Proposition 3.3.1. We set p = 2r, B = (D∗D)r, and C = C(r) = D∗rDr − (D∗D)r in Weyl’s theorem. By Proposition 3.7.1, C has rank at most 2r. Hence, we have the relation λj+2r((D ∗D)r) ≤ λj(D∗rDr) ≤ λj−2r((D∗D)r), j = 2r + 1, . . . , n− 2r. (3.78) Since λj((D ∗D)r) = λj(D∗D)r, this corresponds to σj+2r(D) r ≤ σj(Dr) ≤ σj−2r(D)r, j = 2r + 1, . . . , n− 2r. (3.79) For the remaining values of j, we will simply use the largest and smallest singular values of Dr as upper and lower bounds. However, note that σ1(D r) = ‖Dr‖op ≤ ‖D‖rop = (σ1(D))r and similarly σm(D r) = ‖D−r‖−1op ≥ ‖D−1‖−rop = (σn(D))r. Hence (3.79) can be rewritten as σmin(j+2r,n)(D) r ≤ σj(Dr) ≤ σmax(j−2r,1)(D)r, j = 1, . . . , n. (3.80) Inverting these relations via (3.73), we obtain σmin(j+2r,n)(D −1)r ≤ σj(D−r) ≤ σmax(j−2r,1)(D−1)r, j = 1, . . . , n. (3.81) Finally, to demonstrate the desired bounds of Proposition 3.3.1, we rewrite (3.75) via the inequality 2x/pi ≤ sin x ≤ x for 0 ≤ x ≤ pi/2 as n+ 1/2 pi(j − 1/2) ≤ σj(D −1) ≤ n + 1/2 2(j − 1/2) , (3.82) and observe that min(j + 2r, n) r j and max(j − 2r, 1) r j for j = 1, . . . , n. Remark. The constants c1(r) and c2(r) that one obtains from the above argument would be significantly exaggerated. This is primarily due to the fact that Proposition 3.3.1 is not stated in the tightest possible form. The advantage of this form is the simplicity of the subsequent analysis in Section 3.3.1. Our estimates of σmin(D −rE) would become significantly more accurate if the asymptotic distribution of σj(D −r) is incorporated into our proofs in Section 3.3.1. However, the main disadvantage would be that the estimates would then hold only for all sufficiently large n. 100 Chapter 3. Sobolev Duals for Random Frames and Σ∆ Quantization of Compressed Sensing Measurements Acknowledgments The authors would like to thank Ronald DeVore for valuable discussions. This work was initiated during an AIM workshop and matured during a BIRS workshop. We thank the American Institute of Mathematics and Banff International Research Station for their hospitality. 101 Chapter 3. Sobolev Duals for Random Frames and Σ∆ Quantization of Compressed Sensing Measurements 3.8 References [1] R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin. A simple proof of the restricted isometry property for random matrices. Constr. Approx., 28(3):253–263, 2008. [2] J. Benedetto, A. Powell, and Ö. Yılmaz. Second order sigma–delta (Σ∆) quantiza- tion of finite frame expansions. Appl. Comput. Harmon. Anal, 20:126–148, 2006. [3] J. Benedetto, A. Powell, and Ö. Yılmaz. Sigma-delta (Σ∆) quantization and finite frames. IEEE Transactions on Information Theory, 52(5):1990–2005, May 2006. [4] J. Blum, M. Lammers, A. Powell, and Ö. Yılmaz. Sobolev duals in frame theory and Sigma-Delta quantization. Journal of Fourier Analysis and Applications. Accepted. [5] B. Bodmann, V. Paulsen, and S. Abdulbaki. Smooth Frame-Path Termination for Higher Order Sigma-Delta Quantization. J. Fourier Anal. Appl., 13(3):285–307, 2007. [6] P. Boufounos and R. Baraniuk. 1-bit compressive sensing. In 42nd annual Conference on Information Sciences and Systems (CISS), pages 19–21. [7] P. Boufounos and R. Baraniuk. Sigma delta quantization for compressive sensing. In Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, volume 6701, page 4. Citeseer, 2007. [8] E. J. Candès. Compressive sampling. In International Congress of Mathematicians. Vol. III, pages 1433–1452. Eur. Math. Soc., Zürich, 2006. [9] E. J. Candès, J. Romberg, and T. Tao. Signal recovery from incomplete and inac- curate measurements. Comm. Pure Appl. Math., 59(8):1207–1223, 2005. [10] E. J. Candès, J. Romberg, and T. Tao. Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Trans. Inform. Theory, 52(2):489–509, 2006. [11] A. Cohen, W. Dahmen, and R. DeVore. Compressed sensing and best k-term ap- proximation. J. Amer. Math. Soc., 22(1):211–231, 2009. [12] W. Dai, H. Pham, and O. Milenkovic. Quantized compressive sensing. arXiv. [13] I. Daubechies and R. DeVore. Approximating a bandlimited function using very coarsely quantized data: A family of stable sigma-delta modulators of arbitrary order. Ann. of Math., 158(2):679–710, September 2003. 102 Chapter 3. Sobolev Duals for Random Frames and Σ∆ Quantization of Compressed Sensing Measurements [14] D. Donoho. For most large underdetermined systems of equations, the minimal l1-norm near-solution approximates the sparsest near-solution. Communications on Pure and Applied Mathematics, 59(7):907–934, 2006. [15] D. L. Donoho. Compressed sensing. IEEE Trans. Inform. Theory, 52(4):1289–1306, 2006. [16] A. Fletcher, S. Rangan, and V. Goyal. Necessary and Sufficient Conditions for Spar- sity Pattern Recovery. Information Theory, IEEE Transactions on, 55(2009):5758– 5772, 2009. [17] V. Goyal, A. Fletcher, and S. Rangan. Compressive sampling and lossy compression. IEEE Signal Processing Magazine, 25(2):48–56, 2008. [18] V. Goyal, M. Vetterli, and N. Thao. Quantized overcomplete expansions in RN : analysis, synthesis, and algorithms. IEEE Trans. Inform. Theory, 44(1):16–31, Jan 1998. [19] C. Güntürk. One-bit sigma-delta quantization with exponential accuracy. Comm. Pure Appl. Math., 56(11):1608–1630, 2003. [20] R. A. Horn and C. R. Johnson. Matrix analysis. Cambridge University Press, Cambridge, 1990. Corrected reprint of the 1985 original. [21] L. Jacques, D. Hammond, and M. Fadili. Dequantizing compressed sensing: When oversampling and non-gaussian constraints combine. Arxiv preprint: http://arxiv. org/abs/0902.2367, 2009. [22] M. Lammers, A. Powell, and Ö. Yılmaz. Alternative dual frames for digital-to-analog conversion in Sigma-Delta quantization. Advances in Computational Mathematics, 2008. To appear. [23] J. Laska, P. Boufounos, M. Davenport, and R. Baraniuk. Democracy in action: Quantization, saturation, and compressive sensing. Preprint, 2009. [24] G. G. Lorentz, M. v. Golitschek, and Y. Makovoz. Constructive approximation, vol- ume 304 of Grundlehren der Mathematischen Wissenschaften [Fundamental Princi- ples of Mathematical Sciences]. Springer-Verlag, Berlin, 1996. Advanced problems. [25] S. Norsworthy, R.Schreier, and G. Temes, editors. Delta-Sigma Data Converters. IEEE Press, 1997. [26] S. V. Parter. On the distribution of the singular values of Toeplitz matrices. Linear Algebra Appl., 80:115–130, 1986. 103 Chapter 3. Sobolev Duals for Random Frames and Σ∆ Quantization of Compressed Sensing Measurements [27] M. Rudelson and R. Vershynin. Smallest singular value of a random rectangular matrix. Comm. Pure Appl. Math., 62(12):1595–1739, 2009. [28] G. Strang. The discrete cosine transform. SIAM review, pages 135–147, 1999. [29] J. Sun and V. Goyal. Optimal quantization of random measurements in compressed sensing. In IEEE International Symposium on Information Theory, pages 6–10, 2009. [30] J. Von Neumann. Distribution of the ratio of the mean square successive difference to the variance. The Annals of Mathematical Statistics, 12(4):367–395, 1941. [31] M. Wainwright. Information-Theoretic Limits on Sparsity Recovery in the High-Dimensional and Noisy Setting. IEEE transactions on information theory, 55(12):5728–5741, 2009. [32] M. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recov- ery using l 1-constrained quadratic programming (Lasso). IEEE Transactions on Information Theory, 55(5):2183–2202, 2009. [33] A. Zymnis, S. Boyd, and E. Candès. Compressed sensing with quantized measure- ments. 2009. Submitted. 104 Chapter 4 Conclusion This chapter provides a brief summary of the contributions of this thesis. In addition, we provide the details of a numerical experiment that shows how one can combine both con- tributions of this thesis, on quantization and decoding, to obtain favorable reconstruction results. Finally, we discuss some important open problems and topics for future work. 4.1 Summary of contributions This thesis is concerned with both the encoding (measurement and quantization) and decoding (recovery) stages of compressed sensing. Below, we summarize our results. Decoding by `p minimization With respect to decoding, we showed that the decoders ∆p and ∆  p, with p < 1 have many favorable properties compared to other decoders such as ∆1 and ∆  1 (as well as ∆CoSaMP ). Specifically, our results were for matrices A with entries drawn i.i.d. from N (0, 1/√(n)) or with columns drawn uniformly from the sphere. We now restate these results: • When one obtains noisy measurements b = Ax + e of an arbitrary signal x with ‖e‖2 ≤ , the ∆p decoder is stable and robust under weaker conditions than the analogous decoders based on `1 minimization and greedy algorithms. More pre- cisely, we showed that ‖∆p(Ax+ e)− x‖2 ≤ C1+ C2 σk(x)`p k1/p−1/2 , with p < 1 can be true even when an analogous statement cannot be made about ∆1 (see, e.g., Chapter 2, Section 2.2.1, Theorem 2.2.1). • Using a given encoder A, ∆p can robustly recover signals that are less sparse than those that can be recovered with ∆1 (see Chapter 2, Section 2.2.2). • The decoder ∆p is (2,2) instance optimal in probability. Specifically, ‖∆p(Ax+ e)− x‖2 ≤ C(‖e‖2 + σk(x)`2) 105 Chapter 4. Conclusion with high probability on the draw of A. This result, for example, is useful when the noise level is not known. Moreover, it has the added benefit of quantifying the reconstruction error and the best k-term approximation error using the same norm (see Chapter 2, Section 2.2.3). Σ∆ quantization for compressed sensing measurements With respect to quantization of compressed sensing measurements of a sparse signal x, we proposed using an rth order Σ∆ scheme and an associated two-stage reconstruction scheme. In the first stage a robust decoder (such as ∆p, p ≤ 1) recovers the support T of x. In the second stage, the rth order Sobolev dual F of AT is used to approximate x from the quantized measurements q via Fq. Our results in this direction were derived for matrices A with entries drawn i.i.d. from N (0, 1) and, as discussed in Chapter 3, the proposed approach offers several advantages in the compressed sensing setting: • It produces more accurate approximations than any known quantization scheme in the setting, even when sophisticated recovery algorithms are employed (see Chapter 3, Theorem 3.4.2). • It is modular in the sense that if the (second) fine recovery stage is not available or practical to implement, then the standard (coarse) recovery procedure can still be applied as is. This stage inherits the error guarantees of the robust decoder. • It is progressive in the sense that if new measurements arrive (in any given order), noise shaping can be continued on these measurements as long as the state of the system (r real values for an rth order scheme) has been stored. • It is universal in the sense that it uses no information about the measurement matrix or the signal. Next, we will see how using a Σ∆ scheme for encoding compressed sensing measure- ments and subsequently employing ∆p as the stable decoder in the associated coarse recovery stage, combines the advantages of the approaches discussed in Chapters 2 and 3. 4.2 Σ∆ quantization with decoding by `p minimization In this section we provide the details of a numerical experiment by which we hope to tie together the various contributions in this thesis. We employ the Σ∆ quantization scheme discussed in Chapter 3 to quantize compressed sensing measurements of k-sparse signals. In the coarse recovery stage associated with Σ∆ quantization, we employ the ∆p decoder 106 Chapter 4. Conclusion of Chapter 2 to recover the support of the signal after which we proceed with the fine recovery stage as discussed in Chapter 3. Note here that we use the ∆p decoder (not ∆p). This highlights, for example, the practical nature of the (2, 2)-instance optimality in probability of ∆p discussed in Chapter 2. For the numerical experiment, we generate a 1000 × 2000 matrix A whose entries are drawn i.i.d. from N (0, 1). We fix k = 40, and we generate k-sparse signals x with the non-zero entries of each signal supported on a random set T and drawn i.i.d. from N (0, 1). As in the experiments of Chapter 3, for n ∈ {100, 200, ..., 1000} we generate the measurements b = A(n)x, where A(n) is comprised of the first n rows of A. We then quantize b using PCM, as well as the 1st and 2nd order Σ∆ quantizers. For each of these quantized measurements q, we perform the coarse recovery stage, i.e., we compute each of ∆1(q) and ∆p(q) with p = 0.5, to recover a coarse estimate of x as well as an estimate T̃ of the support. We then implement the fine recovery stage of our algorithm, i.e., we generate the associated duals Fsob,r using T̃ . Finally, our estimate of the signal is obtained via the fine recovery stage as x̃T̃ = Fsob,rq, This experiment is repeated 50 times for each value of n and the average of the resulting errors ‖x − x̃‖2 as a function of λ is displayed in Figure 4.1. We note the following observations. • As the contributions of Chapter 2 indicate, the ∆p decoder, with p < 1 can recover signals that the ∆1 decoder cannot. Specifically, when the oversampling ratio λ is small, ∆p still offers stable reconstruction even though ∆  1 fails. • By using ∆p as opposed to ∆p in the coarse recovery stage, we rely on the (2,2)- instance optimality in probability property discussed in Chapter 2. Despite not accounting for the quantization error, we were able to recover the support of the signal reliably. • As expected, the results in Chapter 3 extend to the stable decoder ∆p. In short, by combining the contributions of this thesis, we (i) increase the range of k for which stable reconstruction (hence support recovery) is possible, compared to `1 minimization, and (ii) significantly decrease the reconstruction error from quantized compressed sens- ing measurements of sparse signals over that range of k. 4.3 Open questions and future work In this thesis, we have shown several results that are pertinent to compressed sensing quantization and decoding. In these contexts, we identify some important avenues that remain to be explored. 107 Chapter 4. Conclusion 2.5 3.75 5 6.25 8.75 12.5 17.5 22.5 10−6 10−5 10−4 10−3 10−2 10−1 100 101 λ a ve ra ge  l 2 − n o rm  o f t he  e rro r performance of various quantization/decoding schemes, k = 40 PCM → l1 PCM → l1 → Fcan Σ∆ (r=1) → l1 → Fsob,1 Σ∆ (r=2) → l1 → Fsob,2 Σ∆ (r=1) → lp → Fsob,1 Σ∆ (r=2) → lp → Fsob,2 cλ−r, r=0,1,2 Figure 4.1: The average performance of the proposed Σ∆ quantization and reconstruction schemes for various values of k. Specifically, in the coarse recovery stage we use ∆1 and ∆p, p = .5. For this experiment the non-zero entries of x are i.i.d. N (0, 1) and d = 10−3. 108 Chapter 4. Conclusion 4.3.1 Decoding by non-convex optimization Convergent numerical algorithms for computing ∆p(Ax) There is compelling numerical evidence (e.g., [1,2,4]) that simple modifications of existing optimization algorithms yield global minimizers of the `p minimization problem associ- ated with ∆p. On the other hand there is currently no known algorithm of complexity provably comparable to the that of `1 minimization that also provably converges to the desired global minimizer. For example, an acceptable result in this direction would be to show that for a sparse x, with high probability on the draw of A a given numerical algorithm converges to ∆p(Ax) (with acceptable computational complexity). (2,2) instance optimality in probability: other matrix classes The results of Chapter 2, pertaining to the (2, p) instance optimality of ∆p, hold for any matrix whose restricted isometry constants satisfy certain inequalities (see, e.g., Chapter 2, Section 2.2.1, Theorem 2.2.1). In that sense, these results hold for all the classes of matrices for which the analogous ∆1 results hold - and under weaker conditions. Namely, matrices whose entries are drawn i.i.d. from a sub-Gaussian distribution and partial bounded orthogonal matrices will satisfy these inequalities with high probability. On the other hand, the results pertaining to the (2, 2) instance optimality in prob- ability of ∆p, rely on the relevant matrices satisfying an LQp property, in addition to a restricted isometry property (see Chapter 2, Section 2.2.3). We showed that the relevant LQp property used in the proofs is satisfied by random Gaussian matrices and matri- ces whose columns are drawn uniformly from the sphere. An extension of our results to sub-Gaussian matrices may be possible by modifying the LQp property. In fact, this tech- nique was followed successfully in [3] to extend the analogous results of Wojtaszczyk [5] on ∆1 to sub-Gaussian random matrices. However, preliminary work in this direction shows that the extension to the case p < 1 may not be straightforward. Nevertheless, this remains an important topic for future work. In a similar direction, one could investigate whether the above decoders (including ∆1) are (2,2) instance optimal in probability when the matrices at hand are partial orthogonal matrices (such as randomly sampled Fourier matrices). 4.3.2 Σ∆ quantization for compressed sensing measurements Other matrix classes Our results in Chapter 3 pertain to matrices whose entries are drawn i.i.d. from N (0, 1). It is a natural next step to investigate whether the results extend more generally to matrices whose entries are drawn from sub-Gaussian distributions (such as the Bernoulli distribution). In fact, preliminary work on this topic shows that such an extension may 109 Chapter 4. Conclusion be possible with minor modifications to the proof techniques of Chapter 3. On the other hand, a more difficult question which requires different techniques is whether the results extend to partial bounded orthogonal matrices such as matrices formed by randomly selecting rows of the Fourier matrix. This remains a topic for future work. Explicitly handling noisy and compressible signals In Chapter 3 we saw that given compressed sensing measurements of sparse signals, Σ∆ quantization offers significant gains over PCM in terms of error guarantees. However, despite encouraging numerical results (see Chapter 3, Section 3.6) a more in-depth treat- ment regarding compressible signals and noisy measurements is still needed. For a more detailed discussion of this topic see Chapter 3, Section 3.6. A one-stage decoding algorithm The approach proposed in Chapter 3 requires a two-stage decoder, one to recover the support of the sparse signal and the other to apply the Sobolev dual of the appropriate frame. It would certainly be interesting to design a one-stage decoder that would still allow for results of the form presented in, e.g., Chapter 3, Theorem B. To that end, such a decoder could entail the minimization of an appropriate optimization problem or it could entail computations similar to those used in greedy algorithms, such as CoSaMP. 110 Chapter 4. Conclusion 4.4 References [1] R. Chartrand. Exact reconstructions of sparse signals via nonconvex minimization. IEEE Signal Processing Letters, 14(10):707–710, 2007. [2] R. Chartrand and W. Tin. Iteratively reweighted algorithms for compressive sens- ing. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2008., pages 3869–3872, 31 2008-April 4 2008. [3] R. DeVore, G. Petrova, and P. Wojtaszczyk. Instance-optimality in probability with an l1-minimization decoder. Applied and Computational Harmonic Analysis, 27(3):275–288, 2009. [4] R. Saab, R. Chartrand, and Ö. Yılmaz. Stable sparse approximations via nonconvex optimization. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3885–3888, 2008. [5] P. Wojtaszczyk. Stability and instance optimality for gaussian measurements in com- pressed sensing. Foundations of Computational Mathematics (to appear). 111 Appendix A Other contributions During the course of my Ph.D. work I have co-authored manuscripts that deal with various aspects of the sparse approximation problem, including the theoretical, compu- tational, and applied. While this thesis focuses on compressed sensing, below I list all the manuscripts. Journal papers • R. Saab, and Ö. Yılmaz. Sparse recovery by non-convex optimization – instance optimality. Applied and Computational Harmonic Analysis (in press). • C.S. Güntürk, A. Powell, R. Saab, and Ö. Yılmaz. Sobolev duals for random frames and Σ∆ quantization of compressed sensing measurements. Arxiv preprint arXiv:1002.0182. • E. Van de Berg, M.P. Friedlander, G. Hennenfent, R. Saab, Ö. Yılmaz, and F.J. Herrmann. Sparco: A testing framework for sparse reconstruction. ACM Trans- actions on Mathematical Software, 24:4, 2009. • D. Wang, R. Saab, Ö. Yılmaz, and F.J. Herrmann. Bayesian wavefield separation by transform-domain sparsity promotion. Geophysics, 73:A33, 2008. Conference procedings • H. Mansour, R. Saab, P. Nasiopoulos, and R. Ward. Color image desaturation using sparse reconstruction. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Dallas, Texas, 2010. • R. Saab and Ö. Yılmaz. A short note on non-convex compressed sensing. In Sampling Theory and Applications (SAMPTA09), Marseille, France, 2009. • R. Saab, R. Chartrand, and Ö. Yılmaz. Stable sparse approximation via non- convex optimization. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Las Vegas, Nevada, 2008. • D. Wang, R. Saab, Ö. Yılmaz, and F.J. Herrmann. Recent results in curvelet- based primary-multiple separation: application to real data. In Annual Meeting International Society Exploratory Geopghysics (SEG), San Antonio, Texas, 2007. 112 Appendix A. Other contributions • R. Saab, D. Wang, Ö. Yılmaz, and F.J. Herrmann. Curvelet based primary- multiple separation from a Bayesian perspective. In Annual Meeting International Society Exploratory Geopghysics (SEG), San Antonio, Texas, 2007. 113

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            data-media="{[{embed.selectedMedia}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0064977/manifest

Comment

Related Items