Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Theory and algorithms for compressive data acquisition under practical constraints Melnykova, Kateryna 2021

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


24-ubc_2021_may_melnykova_kateryna.pdf [ 1.44MB ]
JSON: 24-1.0395845.json
JSON-LD: 24-1.0395845-ld.json
RDF/XML (Pretty): 24-1.0395845-rdf.xml
RDF/JSON: 24-1.0395845-rdf.json
Turtle: 24-1.0395845-turtle.txt
N-Triples: 24-1.0395845-rdf-ntriples.txt
Original Record: 24-1.0395845-source.json
Full Text

Full Text

Theory and algorithms forcompressive data acquisition underpractical constraintsbyKateryna MelnykovaB.Sc., Kyiv Taras Shevchenko University, 2010M.Sc., The University of Manitoba, 2012A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinThe Faculty of Graduate and Postdoctoral Studies(Mathematics)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)February 2021© Kateryna Melnykova 2021The following individuals certify that they have read, and recommend to the Faculty ofGraduate and Postdoctoral Studies for acceptance, the thesis entitled:Theory and algorithms for compressive data acquisition under practical constraintssubmitted by Kateryna Melnykova in partial fulfillment of the requirements forthe degree of Doctor of Philosophyin MathematicsExamining Committee:Ozgur Yilmaz, Professor, Mathematics, UBCSupervisorYaniv Plan, Professor, Mathematics, UBCSupervisory Committee MemberBrian Wetton, Professor, Mathematics, UBCUniversity ExaminerMark Schmidt, Professor, Computer Science, UBCUniversity ExaminerAdditional Supervisory Committee Members:Brian Marcus, Mathematics, UBCSupervisory Committee MemberMalabika Pramanik, Mathematics, UBCSupervisory Committee MemberiiAbstractSignal reconstruction from linear measurements is a well-established problem in applied andcomputational mathematics, data science, and signal processing. In a nutshell, a signallives in a vector space, and one aims to reconstruct the signal from its linear measurements.This thesis investigates the following real-world scenarios under this setup.Scenario 1. Consider the over-sampled setting and assume that signals to be recov-ered are known to be sparse. In this case, randomized Kaczmarz and sparse randomizedKaczmarz algorithms are two well-known algorithms that can handle large data. Thisthesis investigates the reconstruction of sparse signals and the support detection for bothalgorithms and provides rigorous results for this setting.Scenario 2. As in scenario 1, suppose that signals of interest are known to be sparseand belong to RN , where N is large, but this time we are in the under-sampled setting.The thesis proposes a novel algorithm called Iterative Reweighted Kaczmarz (IRWK) forrecovery of N -dimensional s-sparse signals from their linear measurements in the under-determined setting. The IRWK algorithm uses a single measurement per iteration andrequires working memory to be O(N).Scenario 3. Suppose that linear measurements of a signal are quantized using Memo-ryless Scalar Quantization (MSQ), which essentially rounds off each entry to the nearestpoint in δZ. The measurements perturbed by the MSQ lead to an inaccurate reconstruc-tion of the signal even in an over-sampled setting (frame quantization). In practice, whenthe number of measurements m increases, the reconstruction error is observed to decay ata rate of O(m−1/2). However, the earlier theory guarantees only O(1) error. This thesisbridges theory with the observed results by rigorously proving that the error is in the formof c1 + c2O(m−1/2) where c1 and c2 are independent of m. We prove that c1 is extremelysmall for the case of Gaussian measurements and provide sharp bounds on its value. Also,we mathematically justify why we expect c1 to be small but non-zero for a broad class ofrandom matrices. We also extend the result to the under-determined setting (compressedsensing quantization).iiiLay summaryI investigate the following scenarios of reconstruction of signals from their measurements.Sparse recovery. In signal processing, signals of interest are often sparse, i.e., theycan be approximated by a sequence of numbers, of which most entries are zero. In machinelearning, a few features can often explain a phenomenon, and the goal is to find out whichones. Solving these problems for big data requires algorithms to be both computer memory-efficient and computing time-efficient. I improve the convergence guarantees for some ofthese algorithms and propose a family of novel algorithms that outperform establishedalgorithms in specific settings.Quantization. Signals are stored in digital computers, so they need to be quantized.This causes errors in reconstructed signals. I prove that the reconstruction error is reducedto near-zero even when the most straight-forward quantization method is employed - insettings when this was not known before.ivPrefaceThis thesis consists of my original research, conducted at the Department of Mathematicsat the University of British Columbia, Vancouver, Canada, under the supervision of Dr.O¨zgu¨r Yılmaz. Results in Chapters 2 and 3 are done in collaboration with Dr. O¨zgu¨rYılmaz and Dr. Halyun Jeong and are to be submitted to the journal soon. Chapter 4is joint work with Dr. O¨zgu¨r Yılmaz and has been posted on arXiv (1804.02839) and issubmitted for publication.vTable of contentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiList of notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Finite frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.2 The canonical dual frame and the Moore-Penrose pseudoinverse . . 31.1.3 Alternative dual frames . . . . . . . . . . . . . . . . . . . . . . . . . 51.1.4 Noise-suppressing effect . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Compressed sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2.1 Theoretical reconstruction guarantees . . . . . . . . . . . . . . . . . 71.2.2 Reconstruction algorithms: overview . . . . . . . . . . . . . . . . . 101.3 Probability in high dimensions . . . . . . . . . . . . . . . . . . . . . . . . . 151.3.1 Sub-exponential and sub-Gaussian random variables . . . . . . . . . 161.3.2 Concentration of the sum of independent random variables . . . . . 171.3.3 Random matrices in high dimensions . . . . . . . . . . . . . . . . . 191.4 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211.5 Organization of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 22vi2 Kaczmarz algorithm for sparse recovery . . . . . . . . . . . . . . . . . . . 232.1 Overview of the Kaczmarz algorithms . . . . . . . . . . . . . . . . . . . . . 232.1.1 The (deterministic) Kaczmarz algorithms . . . . . . . . . . . . . . . 242.1.2 The randomized Kaczmarz (RK) algorithm . . . . . . . . . . . . . . 252.2 Kaczmarz-based algorithms for sparse recovery . . . . . . . . . . . . . . . . 282.2.1 The sparse randomized Kaczmarz (SRK) algorithm . . . . . . . . . 292.2.2 The randomized sparse Kaczmarz (RASK) algorithm . . . . . . . . 302.2.3 The Kaczmarz algorithms meet the IHT algorithm . . . . . . . . . 312.3 Main results. The RK algorithm for support detection . . . . . . . . . . . . 332.4 Main results. The sparse randomized Kaczmarz algorithm . . . . . . . . . 392.4.1 Local linear convergence . . . . . . . . . . . . . . . . . . . . . . . . 392.4.2 Boundness of the reconstruction sequence . . . . . . . . . . . . . . . 422.4.3 On convergence of the SRK algorithm . . . . . . . . . . . . . . . . . 442.4.4 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452.5 Numerical experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552.5.1 Experiment 1. The RK algorithm vs the gradient descent for sparserecovery and support detection . . . . . . . . . . . . . . . . . . . . . 562.5.2 Experiment 2. Local convergence of the SRK algorithm . . . . . . . 572.5.3 Experiment 3. On the max-norm convergence and element-wise dy-namics of the SRK algorithm . . . . . . . . . . . . . . . . . . . . . . 583 The iterative reweighted Kaczmarz algorithm (IRWK) . . . . . . . . . . 603.1 The IRLS using randomized reweighted Kaczmarz . . . . . . . . . . . . . . 603.2 The Iteratively Reweighted Kaczmarz algorithm (IRWK) . . . . . . . . . . 613.2.1 Specific choice of non-unit weight . . . . . . . . . . . . . . . . . . . 633.2.2 Running time of the randomized Kaczmarz algorithm for each itera-tion of IRWK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633.2.3 On condition number of the weighted matrix and on the number ofiterations of the randomized Kaczmarz algorithm . . . . . . . . . . 663.2.4 Local linear convergence . . . . . . . . . . . . . . . . . . . . . . . . 713.2.5 Connection with the IHT and compressibility of solutions . . . . . . 743.3 Numerical performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 763.3.1 Experiment 1. IRLS using randomized Kaczmarz algorithm. . . . . 773.3.2 Experiment 2. The IRWK and the IRLS algorithms . . . . . . . . . 783.3.3 Experiment 3. The performance of IRWK algorithm . . . . . . . . . 783.3.4 Experiment 4. Compressibility of the IRWK approximation sequence 824 MSQ for frames and compressed sensing . . . . . . . . . . . . . . . . . . . 844.1 Introduction to quantization . . . . . . . . . . . . . . . . . . . . . . . . . . 854.1.1 Quantization for frames . . . . . . . . . . . . . . . . . . . . . . . . . 854.1.2 Quantization for compressed sensing . . . . . . . . . . . . . . . . . . 88vii4.2 MSQ for random frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . 904.2.1 Error estimates without WNH – main results . . . . . . . . . . . . . 914.2.2 Gaussian random matrices. . . . . . . . . . . . . . . . . . . . . . . . 944.2.3 On the value of µ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 954.2.4 Extension to noisy and dithered quantization . . . . . . . . . . . . . 974.2.5 Proofs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 994.3 MSQ for compressed sensing . . . . . . . . . . . . . . . . . . . . . . . . . . 1134.3.1 Two-phase reconstruction scheme . . . . . . . . . . . . . . . . . . . 1134.3.2 Projected back projection . . . . . . . . . . . . . . . . . . . . . . . . 1144.3.3 Proofs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1154.4 Numerical experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1184.4.1 Experiment 1. MSQ for frame theory. . . . . . . . . . . . . . . . . . 1184.4.2 Experiment 2. Lower bound for the decay rate. . . . . . . . . . . . 1194.4.3 Experiment 3. Compressed sensing setting. . . . . . . . . . . . . . . 1205 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125viiiList of figures2.1 Support detection for consistent overdetermined linear systems using theRK algorithm and the gradient descent. Here the matrix is 2000 × 600Gaussian, and the signal is 30-sparse. Both algorithms are initialized with~0. The gradient descent is run using the kernel trick. . . . . . . . . . . . . . 572.2 Convergence rate in the local regime for (a) overdetermined, (b) underdeter-mined consistent linear systems. The estimated sparsity is 80. The dashedlines represent the linear convergence for the comparison. The y-axis is inthe logarithmic scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582.3 On element-wise convergence . . . . . . . . . . . . . . . . . . . . . . . . . . 593.1 The IRLS iterations for 350×1000 Gaussian matrix and 40-sparse unit-normsignal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 783.2 Number of iterations of the IRLS and IRWK algorithms for an 350 × 1000Gaussian matrix and 40-sparse signals . . . . . . . . . . . . . . . . . . . . . 793.3 Reconstruction errors for various sparse recovery algorithms for 350 × 1000random matrix. We consider sparsity 2 and 40, and the estimated sparsity 3and 48, respectively. Then, we draw a unit-norm s-sparse signal x∗, computethe corresponding measurements, and run sparse recovery algorithms. Weplot the reconstruction error vs. CPU time where the reconstruction erroris in the logarithmic scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 803.4 The CPU time required to reach the reconstruction error 10−5 for varioussettings. Unless specified above, the number of measurements m = 350, thedimension of the signal N = 1000, the sparsity s = 40, and the estimatedsparsity is 1.2s. We draw an m ×N ±1 Bernoulli random matrix A and aunit-norm s-sparse signal x∗ and run each algorithm above for Ax = b :=Ax∗. We measure the CPU time required to reach accuracy 10−5. If theCPU time exceeds 20, we assume that sparse recovery is unsuccessful. Allplots are in the log-log scale. . . . . . . . . . . . . . . . . . . . . . . . . . . 813.5 Compressibility of the reconstruction sequence {xk} produced by the IRWKalgorithm. Here the measurement matrix is 350 × 1000 matrix with i.i.d.standard Gaussian entries and x∗ ∈ Σ100040 such that ‖x‖ = 1 is drawnuniformly. The estimated sparsity is 48. . . . . . . . . . . . . . . . . . . . . 83ix4.1 The numerical behavior (in log-log scale) of the mean of the reconstructionerror, as a function of λ = m/n, for a fixed unit-norm signal x and anm×n random matrix E with i.i.d. standard Gaussian entries (A), with i.i.dBernoulli entries (B), and with independent rows, uniformly distributed onthe sphere of radius√n (C). For each δ = 0.1, 0.05, 0.01, we draw 1000 real-izations of the random matrix E, compute the quantized frame coefficientsof x, and reconstruct using E†. We plot the average reconstruction error foreach δ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1194.2 The numerical behaviour of the reconstruction error in the setting identicalto that described in Figure 4.1; this time δ = 4 and λ ∈ [10, 1000]. Theresults shown are the outcomes for Gaussian matrices (A), Bernoulli ma-trices (B), and matrices whose rows are randomly drawn from the uniformdistribution on the sphere (C). . . . . . . . . . . . . . . . . . . . . . . . . . 1204.3 The numerical behaviour of the reconstruction error in the setting identicalto that described in Figure 4.1; this time E is an m × n random Fouriermatrix obtained by restricting the N ×N DFT matrix (with N = 100000)to its first n = 20 columns and then selecting m random rows. Here δ = 0.01and λ ∈ [10, 1000]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1214.4 The numerical behaviour of the reconstruction error in the CS setting. Theresults shown are the outcomes for Gaussian measurement matrices (a),Bernoulli measurement matrices (b), and matrices whose rows are randomlydrawn from the uniform distribution on the sphere (c). Here x ∈ ΣNs , whereN = 1000 and s = 20, the measurement matrix A is m×N , where m variesbetween 100 and 500, corresponding to λ varying between 5 and 25. . . . . 121xList of algorithms1.1 ℓ0 minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.2 The projected back projection (PBP) . . . . . . . . . . . . . . . . . . . . . . 111.3 Iterative hard thresholding (IHT) . . . . . . . . . . . . . . . . . . . . . . . . 121.4 Basis Pursuit Denoising (BPDN) . . . . . . . . . . . . . . . . . . . . . . . . 131.5 Iteratively reweighted least squares (IRLS) . . . . . . . . . . . . . . . . . . 142.1 The Kaczmarz algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.2 The randomized Kaczmarz algorithm . . . . . . . . . . . . . . . . . . . . . . 252.3 Sparse randomized Kaczmarz algorithm (SRK) . . . . . . . . . . . . . . . . 292.4 The randomized sparse Kaczmarz (RASK) algorithm . . . . . . . . . . . . . 302.5 The KZIMT algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322.6 The SRK-IHT algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.1 The iterative reweighted least squares (IRWK) . . . . . . . . . . . . . . . . 62xiList of notations⊙ entry-wise product between two vectors[n] the set of the first n integers {1, 2, ..., n}‖ · ‖p p-norm defined by ‖x‖p := (∑i |xi|p)1/p.If p ≥ 1 ‖ · ‖p is a norm, for 0 ≤ p < 1, it is a quasi-norm‖ · ‖ 2-norm‖ · ‖ψ2 sub-Gaussian norm‖ · ‖F Frobenius normAT transpose of AA† Moore-Penrose pseudoinverse matrix of Adiag(v) diagonal matrix whose diagonal entries are defined by vEX expectation of the event XHt hard thresholding operator that keeps top t entries of its inputand sets other entries to zeroI identity matrixJ matrix whose all entries are onesℓ2 Euclidean norm spaceℓp Banach space equipped with the norm ‖ · ‖pN set of all natural numbersN (A) null space of matrix AN (µ, σ2) the Gaussian distribution with mean µ and variance σ2P(X) probability of the event XR set of all real numbersR(A) range of matrix ASλ soft thresholding operator, Sλ(x) := max{|x| − λ, 0}sign(x)ΣNs set of all s-sparse signals in RNxiiAcknowledgementsThroughout the research and writing of this dissertation, I have received a great deal ofsupport and assistance.I would first like to thank my supervisor, Dr. Ozgur Yilmaz, whose expertise andsupport were invaluable. In particular, he taught me how to use a “big picture” to getinsights into a specific research question. His support helped me to become a bettermathematician, and this thesis would not be possible without him. I am also gratefulfor his detailed feedback on this thesis, and I am sorry about the hundreds of articles(a/an/the) you added and removed from my writings.I would like to thank my collaborator, Dr. Halyun Jeong, for always being availableto help me to structure my ideas, for his patience, kindness, and support. I enormouslyappreciate all the hours we spent trying to solve the research problem. I want to extendmy gratefulness to all the professors at UBC, who guided me in class or during meetings.Last but not least, I would like to thank my family and friends who encouraged andboosted my morale during my PhD.xiiiChapter 1IntroductionIn this thesis, we investigate several problems in the context of reconstructing (approxima-tions of) signals in finite - but high - ambient dimension, say, n, from m possibly corruptedlinear measurements. This problem falls into one of the two categories:1. Frame Theory: When m > n, i.e., the number of measurements exceeds the am-bient dimension, we are in the realm of frame theory (see, e.g., [34, 35, 43]). Inthis setting, the structured redundancy of frame expansions can be exploited for de-noising (see, e.g., Section 1.1.4 and Chapter 11 in [111] and references therein), forquantization (see, e.g., Section 4.1.1 and Chapter 8 in [34] and references therein),error corrections (see, e.g., Chapter 7 in [34] and references therein) among oth-ers. This framework is a central part of audio and image acquisition and processing[5, 104, 111, 139, 150], medical imaging [103, 130], seismic signal analysis [74, 79].In this thesis, we consider the problem of quantization and obtain new results incontrolling the error behavior as a function of the number of measurements and theambient dimension. These results are presented in Chapter 4.2. Compressed Sensing: When the number of measurements is smaller than the ambi-ent dimension, i.e., whenm < n, it is in general not possible to reconstruct the signal.However, if the signals of interest are known to admit a “sparse representation”, thenit is now well established that signals can be recovered using sparse recovery methods.This “sampling paradigm” is called compressed sensing [28, 30, 33, 53]. The task ofrecovering a sparse signal from underdetermined measurements can be found in im-age acquisition literature such as a 1-pixel digital camera [55, 97], medical imagingwith an emphasis on the magnetic resonance imaging (MRI) [61, 85, 108, 109], seismicimaging [140, 161], feature selection in machine learning [63, 98, 118], and other areas[99]. In this thesis, we focus on a class of computationally feasible sparse recoveryalgorithms that are also memory-efficient. In particular, we focus on algorithms thatrequire access to a single measurement at each iteration. Chapter 2 investigates thesparse recovery algorithms both in over- and under-determined settings, and Chap-ter 3 introduces a family of novel algorithms that are both computationally- andmemory-efficient.We note that while our results on quantization, which we present in Chapter 4, aremainly focused on the frame theory, they do extend - as discussed in Section 4.3 - to the1compressed sensing setting. Similarly, the sparse algorithms in Chapters 2 and 3 can beused both in the compressed sensing as well as frame theory settings.The rest of this chapter is structured as follows. Sections 1.1 and 1.2 briefly reviewrelated results in frame theory and compressed sensing, respectively. Random variablesand the concentration bounds, vectors, and matrices are overviewed in Section 1.3. Thischapter is finished by introducing global notations used throughout the thesis in Section1.4 and the road map for the remaining part of the thesis (Section 1.5).1.1 Finite frames1.1.1 OverviewIn various applications, signals can be modeled as elements in Rn. Accordingly, any signalx ∈ Rn can be represented discretely, for example, by choosing an orthonormal basis {vi}for Rn and computing (or measuring) xi = 〈x , vi〉. Alternatively, one can use frames,which generalize the notion of basis by allowing “controlled” redundancy. Specifically,{e1, e2, ...} ⊂ Rn is a frame for Rn if there exist positive constants a ≤ b, called the framebounds, such that for all x ∈ Rn,a‖x‖2 ≤∑|〈x , ei〉|2 ≤ b‖x‖2.If a = b, then the frame is tight. If ‖ei‖ = 1, then the frame is unit-norm.While the sequence {ei} may be infinite, in this thesis, we consider only finite frames,i.e., when there are finitely many frame vectors ei. Suppose that {e1, e2, ..., em} is a framein Rn. Then, {ei} spans Rn which, in turn, implies that for every x ∈ Rn, there exist scalars{yi} such that x =∑yiei. Also, we conclude that m ≥ n. If {ei} is overcomplete, i.e., ifm > n, then the choice of {yi} in this decomposition is not unique. This can be exploitedin applications to reduce the error introduced by lossy operations like transmission over anoisy channel or quantization.The decomposition above raises the following two tasks:Task 1. Reconstruct a signal from its frame coefficients. Reconstruct x from itscomputed or measured frame coefficients bi := 〈x , ei〉, i = 1, 2, ...,m. Let T : Rn 7→ Rmbe the analysis operator, which is defined as Tx := {〈x , ei〉} = {bi}. Then, given {bi}, wewant to find x such that Tx = {bi}. Note that the range of T is a subspace of Rm. Ifm > n, there are infinitely many operators T˜ : Rm 7→ Rn such that T˜T is identity.Task 2. Decompose a signal for the given frame. For signal x and frame {ei},compute coefficients yi such that x =∑mi=1 yiei. Formally, let T∗ : Rm 7→ Rn be thesynthesis operator, which is defined as T ∗{yi} =∑mi=1 yiei. Note that T∗ is adjoint toT [35, p. 17]. Computing the coefficients {yi} is equivalent to finding a solution to theequation T ∗y = x which, in turn, corresponds to finding a right inverse of T ∗. Note that ifm > n, i.e., if the frame {ei} is redundant, then there are infinitely many right inverses ofT ∗.2A frame {f1, f2, ..., fm} is said to be dual to frame {ei} if, for all x,x =∑〈x , ei〉fi =∑〈x , fi〉ei.Therefore, if a frame {fi} is dual to a frame {ei}, then it can be used to resolve both taskslisted above. If the frame {ei} is tight with the frame bound a, then 1aei is a dual frame to{ei}.Note that the analysis and frame operators are linear, and therefore, in the finitedimensions, they can be represented by matrices. Suppose that {e1, e2, ..., em} is a framefor Rn, and let E be the matrix whose rows are eTi , i = 1, 2, ...,m. Then, E ∈ Rm×n wherem ≥ n. We call E a frame if its rows form a frame. If E is a frame, then a vector b = Exrepresents the sequence of frame coefficients {〈x , ei〉}, thus E is the analysis operator ofthis frame. Furthermore, the frame bounds a and b satisfya‖x‖2 ≤ ‖Ex‖2 ≤ b‖x‖2.In other words, singular values of E determine the frame bounds a and b. The singularvalues of any matrix A are defined as the square root of eigenvalues of ATA. If σmin andσmax are the smallest and the largest singular values of E, respectively, thena ≤ σ2min ≤ σ2max ≤ b.In general, for any matrix E ∈ Rm×n, m ≥ n, if E has full rank, then its singular valuesare positive, and therefore, its rows form a frame. The analysis operator of this frame isE, and the synthesis operator is given by ET .In this chapter, we want to distinguish statements for frames and for arbitrary matrices(not necessarily “tall” or full rank), and therefore, we use the following notations. A matrixis denoted by E if its rows form a frame and by A if it is an arbitrary matrix (not necessaryfull rank or m ≥ n).1.1.2 The canonical dual frame and the Moore-Penrose pseudoinverseIn this section, we provide a solution to the tasks stated above: find a left inverse of T anda right inverse of T ∗. Recall that dual frames may be used to resolve these tasks. In thissection, we introduce the canonical dual frame and the Moore-Penrose pseudoinverse andstate their basic properties.We start by observing that the analysis operator T is a bijection between Rn and therange of T , R(T ). Equivalently, E : x 7→ Ex is a bijection between Rn and R(E). Moregenerally, every matrix A establishes the bijection between R(AT ) and R(A). First, let usshow that A : R(AT ) 7→ R(A) is injective. Suppose that two signals x1 and x2 in R(AT )are mapped to the same vector, i.e., Ax1 = Ax2. Then, x1 − x2 belongs to N (A) andR(AT ) at the same time. Since these spaces are orthogonal, their intersection equals {~0}.3Therefore, x1 = x2, which, in turn, implies that A : R(AT ) 7→ R(A) is injective. Note thatthe dimensions of R(AT ) and R(A) are equal (and equal to the rank of A). Therefore,A : R(AT ) 7→ R(A) is bijective. In particular, if R(AT ) = Rn, i.e., A is a full column rankmatrix that represents a frame, then A : Rn 7→ R(A) is bijective. If R(A) = Rm, i.e., A isa full row-rank matrix, then A : R(AT ) 7→ Rm is bijective.Since T is a bijection between Rn and R(T ), then T˜T = I may be satisfied. Thecomposition of the analysis operator T and the synthesis operator T ∗ is called the frameoperator. S = T ∗T : x 7→∑〈x , ei〉ei is bounded and invertible if {ei} is a frame, see, e.g.,[43, p. 6]. Note that S−1T ∗T = T ∗TS−1, and therefore, S−1T ∗ is a left inverse for T , andTS−1 is a right inverse for T ∗. Indeed, for all x ∈ Rn, (see, e.g., [35, p. 27]),x =∑〈x , ei〉S−1ei =∑〈x , S−1ei〉ei.Therefore, {S−1ei} is a dual frame, which is called the canonical dual frame.In finite dimensional space, the canonical dual frame may be represented by the matrixS−1E = (ETE)−1E, which is known as the Moore-Penrose pseudoinverse. Generally, theMoore-Penrose pseudoinverse for (potentially non-square non-full-rank) matrix A is givenbyA† = limǫ→0+(ATA+ ǫI)−1AT = limǫ→0+AT (AAT + ǫI)−1.Note that both limits are well-defined for all matrices A and are equal to each other [10].If A is a full rank matrix, then at least one of the matrices ATA and AAT is invertible.We conclude that• If m > n, then A† = (ATA)−1AT .• If m = n, then A† = A−1.• If m < n, then A† = AT (AAT )−1.In this thesis, we usually consider full rank matrices, and we use the formulas above withoutreferencing. Please refer to [10] for further the properties of the general Moore-Penrosepseudoinverse matrices along with proofs.The singular values of the Moore-Penrose pseudoinverse of a given matrix A can becomputed as follows. Non-zero singular values of A† are reciprocal of non-zero singularvalues of A. Therefore, if all singular values of A are between σmin > 0 and σmax, thennon-zero singular values of A† are between σ−1max and σ−1min. In particular, the largest singularvalue of A† equals the reciprocal of the smallest non-zero singular value of A.If Ax = b has a solution, then A†b is the minimum 2-norm solution of Ax = b. If A isa full rank overdetermined matrix, then A†b is the least-squares solution of Ax = b [10].The Moore-Penrose matrix can be used to obtain a projection onto the range of aframe, as described below. Suppose that E is a frame. Then, the null space of EE† =E(ETE)−1ET coincides with the null space of ET . Since E† is a left inverse matrix to E,4EE†Ex = Ex for all x. Therefore, EE† sets all vectors in the null space of ET to zero andacts as the identity of the range of E. Since Rm can be orthogonally decomposed as a sumof the null space of ET and the range of E, EE† is a projection onto R(E).1.1.3 Alternative dual framesRecall that there are infinitely many dual frames to all frames E ∈ Rm×n where m > n.Section 1.1.2 focuses on the specific dual frame called the canonical dual and its associatedmatrix. In this section, we provide a brief review of alternative dual frames.First, we want to find a representation for all left inverses of a frame E, i.e., to finda representation of all E˜ ∈ Rn×m such that E˜E = I. This condition implies that forevery b ∈ R(E), E˜b = E†b. Therefore, it suffices to characterize the action of E˜ onthe orthogonal complement of R(E), which equals N (ET ). Then, for every b ∈ Rm,b = projR(E)b+ projN (ET )b = EE†b+ (I − EE†)b, andE˜b = E˜EE†b+ E˜(I − EE†)b = E†b+ E˜(I − EE†)b.Therefore, all dual frames may be represented as E† + DprojR(E)⊥ = E† + D(I − EE†)where D ∈ Rn×m is any matrix [101].1.1.4 Noise-suppressing effectIn this section, we investigate the ability of frames to alleviate the effect of noise in signalacquisition. Suppose that x is a signal, and its frame coefficients Ex are perturbed byrandom noise ǫ = (ǫ1, ǫ2, ..., ǫm). We assume that all ǫi are i. i. d. mean zero and varianceσ2. We want to reconstruct x from its noisy measurements b = Ex+ ǫ.A common metric for accuracy performance is the mean square error (MSE). Supposethat fE : Rm 7→ Rn is used to reconstruct the signal x from its measurements b. Then, theMSE is defined as follows.MSE = E‖x− fE(b)‖2 = E‖x− fE(Ex+ ǫ)‖2.Here the expectation is taken with respect to the noise only. If x or E is random (whichis the case in following chapters), then the expectation is taken over them as well. If itis the case, we suppose that ǫ is independent of x and E. We want the reconstructionscheme fE to minimize the MSE. This problem is known as denoising, i.e., reconstruct thevector x from its linear (potentially perturbed) overcomplete measurements Ex in a waythat alleviates the effect of perturbation.Suppose that the signal is reconstructed using a left inverse E˜, i.e., fEb = E˜b. Denote5the columns of E˜ by e˜1, e˜2, ...e˜m. Then, using that for i 6= j, Eǫiǫj = EǫiEǫj = 0,MSE = E‖x− E˜b‖2 = E‖x− E˜(Ex+ ǫ)‖2= E‖E˜ǫ‖2 = E‖m∑i=1e˜iǫi‖2=m∑i=1‖e˜i‖2Eǫ2i = σ2‖E˜‖2F .Here ‖E˜‖2F stands for the squared Frobenius norm of E˜, i.e., ‖E˜‖2F equals the sum ofsquared entries of E˜. Note that the squared Frobenius norm is also the sum of the squaresof singular values of the matrix, and therefore,MSE = σ2n∑i=1σ˜2i ,where σ˜i, i = 1, 2, ..., n, are singular values of E˜. Note that n singular values of E˜ arereciprocals of the singular values of E. Therefore, MSE for linear reconstruction is at leastσ2∑ni=1 σ−2i , where σi, i = 1, 2, ..., n, are singular values of E. Since all singular values ofE are between a and b, we conclude thatσ2nb2≤ σ2n∑i=1σ−2i ≤ MSE.The minimum of the MSE equals σ2∑ni=1 σ−2i , which is attained whenm−n singular valuesof E˜ are set to zero. This case corresponds to using the Moore-Penrose pseudoinverse.If E is a tight frame with the frame bounds a = b = m, then all singular values of Eequal m. In this case,MSE = σ2nm.In particular, if the frame size m is large, we expect the MSE to be small. Therefore,the more redundant a tight frame is, the larger portion of noise can be alleviated byreconstruction from the frame coefficients. Recall that this statement holds for i.i.d. noisewith mean zero and variance σ2.1.2 Compressed sensingConsider a linear system b = Ax∗, where A ∈ Rm×N , and consider the following problem:reconstruct x∗ from its measurements b in a computationally feasible way. In general, theleast number of measurements that can provide a unique solution of Ax = b is N . Indeed, ifm < N , then there are either zero or infinitely many solutions of the linear system, which,6in general, makes reconstruction of x∗ impossible. If m ≥ N and A is a full (column) rankmatrix, then there is a unique solution to Ax = b, and therefore, this solution equals x∗.Note that additional information about the desired vector can potentially reduce therequired number of measurements m to guarantee accurate reconstruction. In the twofollowing real-world settings, such information is the sparsity of the desired solution.Setting 1: Signal processing. Suppose that we want to reconstruct a signal fromits measurements. Important classes of real-world signals are known to possess a certainstructure: they are sparse or almost sparse in a particular basis or frame representation,e.g., in a wavelet base, wavelet frames, curvelet frames, Gabor frames. Therefore, we seeka sparse solution of the linear system Ax = b which is investigated by compressed sensing[28, 30, 33, 53]. Specifically, given A and b, we want to ensure that there is a unique sparsesolution of Ax = b, and there is a computationally tractable method to reconstruct it. See,e.g., [4, 100, 109, 111, 160] as a non-exhaustive list of applications of compressed sensingin signal processing.Setting 2: Machine learning. Suppose that we want to analyze the dependencybetween a list of features x ∈ RN that predicts the collected data b (see, e.g., [3]). Thefeatures may be dependent on each other, and one expects linear dependency betweenfeatures and data, i.e., there is a matrix A such that Ax = b. Suppose that the goal is toselect as few features as possible to explain the data fully. In machine learning, this set up isknown as a feature selection from a linear model (see, e.g., [63, 118, 134, 145, 158, 166] andmany others). Mathematically, the goal is as follows: among all solutions of Ax = b findthe solution x∗ has as many zero entries as possible or at most s non-zero entries if sparsitys is given. If the feature selection and training of the linear mode happen simultaneously,we end up having the same set up as above.Both settings above aim to recover a sparse vector from its linear measurements. If xhas at most s non-zero entries, it is called s-sparse. The set of s-sparse signals in RN isdenoted by ΣNs . The support of x is defined as the index set of all its non-zero entries. Inthis thesis, unless otherwise specified, we assume that s≪ N .The rest of the chapter is organized as follows. Section 1.2.1 provides two conditionsthat guarantee the accurate recovery of the sparse signals from the linear systems anddiscusses them. Section 1.2.2 discusses five compressed sensing algorithms that are furtherdiscussed in this thesis.1.2.1 Theoretical reconstruction guaranteesConsider the system of linear equations Ax = b, where A ∈ Rm×N , m < N . Recallthat if the system is consistent, then it has infinitely many solutions. In this section, weconsider two conditions, each of which implies that there is at most one sparse solution ofAx = b. Note that this list of conditions that imply the uniqueness of sparse recovery isnot exhaustive, see, e.g., coherence [65, chapter 5].7Null space property (NSP)We start with the straight-forward argument (as in [65, chapter 2]) that guarantees that thelinear system does not have two s-sparse solutions: If Ax = b has two s-sparse solutions,then there is a 2s-sparse vector in the null space of A (denoted by N (A)). Note that theconverse is also valid: if there is a 2s-sparse signal in the null space of A, then there exists-sparse signals x1 and x2 such that Ax1 = Ax2. Therefore, there is at most one s-sparsesolution of Ax = b if and only if the null space of A and the set of 2s-sparse signals areintersecting in the origin only (see, e.g., [65, p. 49]).Definition 1.2.1 (The null space property (NSP)). A matrix A satisfies the null spaceproperty of order s with constant γ if for every vector x ∈ N (A), x 6= ~0, and for everysupport set S, #S ≤ s,‖xS‖1 < γ‖xSc‖1.The NSP of order s implies that there is no s-sparse vector in the null space of A.Specifically, if A satisfies the NSP of order s with constant γ ≤ 1, then Ax = b has aunique s-sparse solution (see, e.g., [65, p. 79]). Moreover, if all s-sparse signals can berecovered exactly from their linear measurements, then A satisfies the NSP of order s (see,e.g., [2]). Therefore, the NSP is a necessary and sufficient condition for accurate sparserecovery.As a downside, certifying that a given matrix A satisfies the NSP is an NP-hard problem[147]. The next subsection reviews another approach to guarantee the successful recon-struction of sparse signals from their linear measurements.Restricted Isometry Property (RIP)Recall that accurate recovery of s-sparse vectors from their underconstrained measurementsrequiresN (A)∩ΣN2s = {~0}. Equivalently, for all support sets S, #S = 2s, and for all vectorsx supported on S, we require ‖Ax‖ = ‖ASxS‖ > 0. Note that AS ∈ Rm×2s is a “tall”matrix, and many tall matrices approximately preserve the norm of the vector in highdimensions (see a rigorous statement below). Therefore, one may consider matrices Asuch that for each support set S, AS is an approximate isometry. Such a condition on A iscalled the restricted isometry property (RIP) (introduced in [32]) and is formally defined asfollows. A matrix A ∈ Rm×N is said to satisfy the RIP of order s with constant δ = δs < 1if for every s-sparse vector x,(1− δ)‖x‖2 ≤ ‖Ax‖2 ≤ (1 + δ)‖x‖2. (1.1)In particular, if A satisfies the RIP of order s, then there are no non-zero s-sparse vectorsin the null space of A. Note for every S ⊂ [N ], #S = s, the RIP constant δ = δs satisfies‖I −ATSAS‖ ≤ δs, (1.2)8where AS is a submatrix of A obtained by only keeping columns of A whose index is in S.The application of the RIP for the guaranteed success of the sparse recovery problemswas first introduced by E. Candes, D. Donoho, J. Romberg, and T. Tao [28, 30, 33, 53] witha huge success. In what follows, we review only a tiny portion of extremely rich literaturefor compressed sensing problems. For a more detailed review of theoretical findings, pleaserefer to the following books and surveys [18, 56, 65, 102, 148] and many others.Note that the RIP implies the NSP [44]. Suppose that A satisfies the RIP of order 2swith constant δ2s. Then, A satisfies the NSP of order s with constant γ >2δ2s(1−δ2s)√s [65, p.142-143]. Therefore, if δ2s >√s2+√s, then there is at most one s-sparse solution of Ax = b.Even though the RIP is a deterministic condition, verifying for the RIP becomes com-putationally intractable as the size of the matrix grows [9, 147]. Fortunately, broad classesof random matrices are known to satisfy the RIP with overwhelming probability (see The-orem 1.3.6).On separation of the set of sparse signals from the null spaceNote that the RIP separates the null space of the measurement matrix and the set of sparsesignals (except the origin). In this section, we prove that ΣNs is still rather close to N (A)under the following assumptions: (1) A ∈ Rm×N satisfies the RIP, and (2) m≪ N . Notethat both the null space of matrix and the set of sparse signals are cones with the vertexat the origin, and the angle between cones may be used as a measure of distance.Consider an s-sparse signal x and a matrix A that satisfies the RIP of order s withconstant δ. Denote the projection of x onto the null space of A by xN and the projectionof x onto the range of AT (the row-span of A) by xR.Theorem 1.2.2. Let A ∈ Rm×N be a matrix that satisfies the RIP of order s with constantδ. Denote the smallest and the largest singular values of AT by σmin and σmax, respectively.Then, for every x ∈ ΣNs , the sine of the angle between x and the null space of A is between√1− δσmaxand√1 + δσmin.Remark 1.2.3. Theorem 1.3.4 provides the bound on the singular values if the rows of A arei.i.d. sub-Gaussian. If A satisfies the RIP, the typical size of the smallest and the largestsingular values of A is Ω(√N/m). Since δ ∈ (0, 1), Theorem 1.2.2 implies that the sine ofthe angle between N (A) and ΣNk is close to zero but can be separated from zero. Usingthat sin(x) ≈ x for small angles, we conclude that the angle is approximately betweenc1√1− δ√mNand c2√1 + δ√mN,where c1 and c2 are constants.9Proof. Let x ∈ ΣNs . The cosine of the angle between x and the null space equals ‖xN ‖‖x‖where xN is the projection of x onto N (A). Note that N (A) and R(AT ) are orthogonal,and their sum equals RN . Therefore, the sine of the angle between x and N (A) is ‖xR‖‖x‖ .Also, note thatx = xN + xR. (1.3)Multiplying both sides of (1.3) by A implies that Ax = AxN +AxR = AxR. Therefore,the norm of both sides of the equality is equal. Using the definition of the singular valuesand the RIP property,(1− δ)‖x‖2 ≤ ‖Ax‖2 = ‖AxR‖2 ≤ σ2max‖xR‖2.Similarly, one can establish the other bound between ‖x‖2 and ‖xR‖2. Note that we usethat A†A is a projection onto R(AT ), xR ∈ R(AT ), and the largest singular value of A† isσ−1min.(1 + δ)‖x‖2 ≥ ‖Ax‖2 = ‖AxR‖2 ≥ σ2min‖A†AxR‖2 = σ2min‖xR‖2.Therefore,1− δσ2max‖x‖2 ≤ ‖xR‖2 ≤ 1 + δσ2min‖x‖2. (1.4)We conclude that for each sparse signal x, the sine of the angle between x and the nullspace of A is between√1−δσmaxand√1+δσmin. Therefore, the angle between the cone of sparsesignal and the null space is also between these two quantities.Remark 1.2.4. The last inequality proves that for every sparse signal x, there is a vectorxN in the null space, which is rather close to x. Therefore, the RIP implies the following:all sparse vectors are separated from the null space, but the angle between each sparsevector and the null space is small.1.2.2 Reconstruction algorithms: overviewThis section presents a selection of reconstruction algorithms in compressed sensing. Notethat this survey is far from being exhaustive: many methods such as orthogonal matchingpursuit (OMP), CoSaMP, OMP, FISTA, HTP, and many others are left behind to keepthe discussion concise. Please refer to [114] for a review and classification of sparse re-covery methods and [7] for performance comparison. Besides, note that Kaczmarz-basedalgorithms for sparse recovery are reviewed and discussed in Section 2.2 and Chapter 3.Let A be an m × N measurement matrix where m ≪ N , let x∗ be an s-sparse signalin RN , and let b = Ax∗ be measurements. We aim to recover x∗ from its measurements bwhen s is sufficiently small.10Straight-forward approach: the sparsest solutionConsider the most intuitive way of finding the sparsest solution: among all solutions ofAx = b, select the sparsest one. Here ‖x‖0 represents the number of non-zero entries in x.Algorithm 1.1 ℓ0 minimization1: Inputs: A, b2: Outputs: xℓ0 := argmin{‖x‖0 subject to Ax = b} solved by brute force.Note that this notation is similar to ℓp-norm notation, but ‖ · ‖0 is not a norm. Note that‖ · ‖0 is not a convex function, and, therefore, convex optimization is not applicable to thisproblem. Here one can use the following brute force method. Consider s = 1, 2, ..., andconsider all potential supports S such that #S = s. When the support is fixed, solvingAx = b reduces to ASxS = b where AS is the m × s matrix. Since it suffices to consider#S ≤ m, then inverting AS depends on m and #S only, not on N . If there is an s-sparsesolution, the algorithm may be stopped.This algorithm’s complexity is NP-hard (see, e.g., p. 55 in [65] for the proof). Indeed,even if the true sparsity s is known or estimated before the reconstruction process, thereare(Ns) ∼ N s potential supports. Therefore, the straightforward algorithm above is slowin general and computationally unfeasible.Projected back projection (PBP)Let Ht be the “hard-thresholding operator” action on real-valued vectors that keeps thet-largest (in magnitude) entries of its argument unchanged and sets the other entries tozero. Note that Ht is the projection onto the space of t-sparse signals.Algorithm 1.2 The projected back projection (PBP)1: Inputs: A, b, ρs (estimated sparsity)2: Outputs: xPBP = Hρs(AT b)Consider the reconstruction xPBP obtained by the projected back projection methodshown in Algorithm 1.2. If A satisfies the RIP of order 3s with the constant δ3s, then, (see,e.g., [65, p. 149])‖xPBP − x∗‖ ≤ 2δ3s‖x∗‖.The PBP algorithm requires δ3s to be very small to ensure good reconstruction error,and, therefore, it does not provide good recovery guarantees in many settings. The signalreconstructed via the PBP may be used as a warm start for a more sophisticated iterativealgorithm. Note that Algorithm 1.3 introduced below uses the PBP as the first iteration.11Iterative Hard Thresholding (IHT)As before, we want to reconstruct the sparse solution of Ax = b from A and b. If thesparsity constraint is disregarded, one may consider the functional f(x) = 12‖Ax− b‖2 andrun the gradient descent algorithm given byxk+1 = xk − λAT (Axk − b).Note that the gradient descent does not promote sparsity directly; the sparsity can bepromoted by projecting xk onto the set of sparse signals. This leads us to the algorithmknown as the iterative hard thresholding (IHT) and is summarized below in Algorithm 1.3.Algorithm 1.3 Iterative hard thresholding (IHT)1: Inputs: A, b, ρs (estimated sparsity), and stepsize λ > 0.2: Initialize: x0 = (0, 0, 0, ..., 0)T .3: For k = 0, 1, 2, ...: xk+1 = Hs(xk − λAT (Axk − b)).Algorithm 1.3 was introduced by Blumensath and Davies [15] and gained popularityfor its simplicity and effectiveness. Note that we limit this section to Algorithm 1.3 amongall modifications and extensions of the IHT algorithms (see, e.g., [17, 66, 115, 162, 165] fora non-exhaustive list of recent modifications of the IHT algorithm).[15] showed a that for full row-rank matrix A such that for all x ∈ ΣN2s, ‖√λAx‖ < ‖x‖,IHT converges to the local minimum of f(x) = ‖Ax−b‖; this local minimum is sparse. If Asatisfies the RIP of order 2s, then one may consider λ = (1+ δ2s)−1 to ensure convergence.Moreover, in this case, the convergence is linear [70].Note that in the compressed sensing community, A is usually assumed to satisfy theRIP, and λ = 1. We provide a brief overview of convergence results below. Let s be thesparsity of x∗ and ρs to be the accepted or estimated sparsity of the reconstructed signal.Then, Algorithm 1.3 is run with ρs as the sparsity level, ρ ≥ 1. Let b = Ax∗ + ǫ andsuppose that N > s+ 2ρs. If A is assumed to satisfy the RIP of order (2ρ+ 1)s with theconstant δ(2ρ+1)s ≤ 1√2ν , then [136]‖xk − x∗‖ ≤√2νδ(2ρ+1)s‖xk−1 − x∗‖+ (1 + δ(2ρ+1)s)√ν‖ǫ‖.Here ν = 1 +ρ−1+√(4+ρ−1)ρ−12 . Note that if ρ = 1, then√ν ≈ 1.618 and, therefore, werequire δ3s >1√2ν≈ 0.437. If ρ = 2, then ν = 2 and we require δ5s > 12 . The proof in [136]tightens the well-known result of [16], which considers only ρ = 1 and δ3s < 1/√8.The convergence may be stated in terms of the RIP constant δ2ρs. Suppose that Asatisfies the RIP of order 2ρs with the constant δ2ρs. If λ =11+δ2ρsand δ2ρs <13 , then12Algorithm 1.3 converges linearly with the rate 12δ2ρs − 12 [70]. If λ = 1, then it suffices torequire δ2ρs <14 [64] to guarantee the linear convergence.Note that iterative hard thresholding algorithm is suitable for big data because, foreach iteration, the reconstructed signal is ρs-sparse, which is easy to store or process. As adownside, it is a somewhat restrictive method: on each iteration, the IHT algorithm wipeseven rather large entries if they are not in the top ρs entries. Note that for many sparserecovery problems, the conditions for convergence are not met, and, in this case, Algorithm1.3 is also diverging (see, e.g., [7] for numerical experiments).Basis Pursuit Denoising (BPDN)Recall the ℓ0-minimization algorithm (Algorithm 1.1). Smoothening ℓ0-norm leads to thefollowing approach:argmin ‖x‖p subject to Ax = b.Note that for p < 1, the ℓp norm is not convex (and ‖ · ‖p is not a norm but a quasi-norm).p > 1 does not promote sparsity even for 1-sparse vector. Therefore, consider the followingalgorithm.Algorithm 1.4 Basis Pursuit Denoising (BPDN)1: Inputs: A, b, noise level/relaxation parameter γ ≥ 0.2: Outputs: xBP = argmin ‖x‖1 subject to ‖Ax− b‖ ≤ γ.Algorithm 1.4 was originally proposed by Chen, Donoho, and Saunders [37] and gainedincredible popularity (see, e.g., [13, 28, 31, 33, 54, 67, 71]). As mentioned before, the newfunction to be minimized is convex; moreover, it is subdifferentiable. When A satisfiesthe NSP of order s, then Algorithm 1.4 is stable and robust (see, e.g., [2]). Note thatAlgorithm 1.4 is a convex program because the solution minimizes a convex function overa convex set.Adding an ℓ2 regularization term to the BPDN is called the regularized basis pursuitor the elastic net, see, e.g., [137]. The regularized basis pursuit considers the followingminimization problem.xL1,2 = argminλ‖x‖1 + 12‖x‖22 subject to Ax = b. (1.5)Here λ is a hyperparameter of the algorithm. Note that for large enough λ, the solution ofthe regularized basis pursuit coincides with the solution of the BPDN [68]. In particular, ifλ ≥ 10‖x∗‖∞ and the measurement matrix satisfies the RIP of order 2s with constant δ2s <0.4404, then the solution of (1.5) matches the underlying sparse solution [92]. Moreover,the algorithm is stable and robust [137].13Iteratively Reweighted Least Squares (IRLS) for sparse recoveryThe BPDN algorithm, discussed in the previous section, requires solving the ℓ1-norm min-imization problem to obtain the sparse solution of the linear system Ax = b. The ℓ1-normis not a differentiable function. Thus, derivative-based optimization methods may not beapplicable to this problem or at least require modifications by using the subgradient, which,in turn, also decreases the convergence rates. Therefore, replacing the ℓ1-norm in the ob-jective function with the ℓ2-norm could increase the convergence speed of the algorithm.However, a direct replacement of the ℓ1-norm by the ℓ2-norm does not provide a sparsesolution. As discussed in Section 1.2.2, ℓ1-norm promotes sparsity while ℓ2-norm does not.In this subsection, we consider weighted ℓ2-minimization for sparse recovery, whereweights are used to promote sparsity of the output of the minimization problem. Considerthe weight vector wk based on the current iteration xk such that to each weight roughlydepends on the likelihood that the given entry is a part of the true support. Suppose thatinstead of minimizing ‖x‖1, one minimizes ‖wk ⊙ x‖ subject to some conditions, where ⊙represents element-wise product. In this case, entries of x with large weight are promotedto be small in the solution to the minimization problem. Therefore, if an entry is believedto be out of true support, a large weight should be assigned. The iterative reweighted leastsquares (IRLS) algorithm (Algorithm 1.5) implements this idea.Algorithm 1.5 Iteratively reweighted least squares (IRLS)1: Inputs: A, b, ρs (estimated sparsity).2: Initialize: w0 = (1, 1, 1, ..., 1)T , x0 = (0, 0, 0, ..., 0)T , k = 1.3: For k = 0, 1, 2, ...: Given wk, find the minimum of ‖x‖ℓ2,wk such that x satisfies Ax = b,i.e., solvexk+1 = arg minx:Ax=b‖x‖ℓ2,wk = arg minx:Ax=b ‖wk ⊙ x‖2. (1.6)Letǫk+1 = min{ǫk,|xk+1(ρs+1)|N},where xk+1(i) represents the ith largest (in magnitude) entry of xk+1. Then, update theweight vector as follows:wk+1i =((xk+1i )2 + ǫ2k+1)−1/4.4: Stop: When ǫk = 0.Note that IRLS-type algorithms are originally developed to solve ℓp-minimization prob-lems, 0 < p ≤ ∞ (see, e.g., [73, 93, 126]). Therefore, several other algorithms share thename IRLS. In this section, we focus on the IRLS algorithm for sparse recovery as given14in Algorithm 1.5 and its modifications.Suppose that A ∈ Rm×N is a matrix with full row rank, let x∗ be an s-sparse signal, andlet b = Ax∗ be the measurements. If A satisfies the NSP of order ρs with the NSP constantγ < 1 and where s < ρs− 2γ1−γ , Algorithm 1.5 recovers x∗ from b exactly [48]. Moreover, ifγ is small enough, for example, if γ(1 + γ) < 2/3, then the convergence is asymptoticallylinear (with respect to the iteration number k) [48] and globally linear under some strongerconditions [8]. For all γ ∈ (0, 1), local linear convergence holds. The proof considers thefunctional J(x,w, ǫ) = ‖w⊙x‖2+ ǫ2‖w‖2+‖w−1‖2 and performs alternating minimizationwith respect to x and w. Here w−1 is the element-wise reciprocal of w.The IRLS may be also considered [8] as an iterative solution (with respect to x and ǫ)tomin{‖x‖1 − ǫN∑i=1ln(ǫ+ |xi|) subject to ‖b−Ax‖ ≤ η}where η corresponds to the noise level. In the noiseless scenario as above, we let η = 0.[8] extended Algorithm 1.5 for ℓp-minimization where 0 < p ≤ 1 by considering the weightupdating scheme wk+1i =((xk+1i )2 + ǫ2k+1)−1/2+p/4. If also, ǫk is a constant independentof k and xk, then the IRLS converges linearly for 0 < p ≤ 1 in the noisy setting [8].[6] proves that Algorithm 1.5 may fail for ρs − 2γ1−γ ≤ s ≤ ρs and establishes theconvergence of the modified algorithm when ǫk+1 = min{ǫk, t|xk+1(ρs+1)|/N}for fixed 0 <t ≤ 1/4.The approach of Algorithm 1.5 may be extended to the noisy setting and ℓp- minimiza-tion by modifying the loss function to f(x) = ‖Ax − b‖2 +∑Ni=1 λi|xi|p where 1 ≤ p ≤ 2.[152], [47, p.391-412] proposed a surrogate functional that replaces J in the alternatingminimization; this scheme is guaranteed to converge. Algorithm 1.5 can also be modifiedto recover the sparse solution over the set Aix+bi ∈ Ci where Ci is considered to be convex[26]. Note that this case includes a noisy setting.Other notable modifications of the algorithm are based on the slime dynamics [142],[58]. The modified algorithms converge linearly; the algorithm in [58] also provides acertificate of infeasibility if no sparse solution exists.1.3 Probability in high dimensionsRecently, more and more data are being acquired, analyzed, and processed. Big datarequire algorithms that are as computationally and memory-efficient as possible. For ex-ample, if we consider a frame E ∈ Rm×n, where both dimensions m and n are large, thenestimating the frame bounds or singular values using classical methods like SVD may betime-consuming. Also, the RIP constant for a “wide” matrix A ∈ Rm×N may be computedusing the singular values of all m× s submatrices of A. As we discuss below, the singular15values of a broad class of random matrices can be bounded with small probabilistic error,and, therefore, these matrices are of interest in the contexts of both frame theory and com-pressed sensing. This section defines and discusses basic properties of sub-Gaussian andsub-exponential random variables and their sums, and sub-Gaussian and sub-exponentialrandom vectors and matrices.The rest of the section is structured as follows. Section 1.3.1 introduces the sub-exponential and sub-Gaussian random variables along with their basic properties. Section1.3.2 investigates non-asymptotic versions of the Central Limit Theorem and focuses onthe concentration properties of sums of independent sub-exponential and sub-Gaussianrandom variables. Section 1.3.3 deals with a multi-dimensional setting. Section 1.3.3provides instrumental definitions for random vectors, which are used in Section 1.3.3 forbounding singular values and in Section 1.3.3 for estimating the constants for a broad classof “wide” random matrices.1.3.1 Sub-exponential and sub-Gaussian random variablesMost of the results below hold with “overwhelming” or “high probability”, and, therefore,we start this section with definitions. We say that event X = X(n) holds with overwhelm-ing probability (with respect to n) if there are absolute constants c1 > 0 and c2 > 0 suchthatP(X) ≥ 1− c1 exp(−c2n).In other words, X occurs with overwhelming probability if the probability of failure isexponentially small. We say that event X = X(n) holds with high probability (withrespect to n) if there are absolute constants c1 > 0 and c2 > 0 such thatP(X) ≥ 1− c1 exp(−c2n2).Note that if an event holds with high probability, then it holds with overwhelming proba-bility.We call random variable X sub-Gaussian if there exist constants c1 > 0 and c2 > 0such that for every t > 0,P(|X| ≥ t) ≤ c1 exp(−c2t2).Note that all bounded random variables are sub-Gaussian. Also, note that Gaussian ran-dom variables are sub-Gaussian. The following statements are equivalent [149, p. 22].1. X is a sub-Gaussian random variable.2. There exists c3 > 0 such that for all p ≥ 1, (E|X|p)1/p ≤ c3√p.3. There exists c4 > 0 such that for every |t| ≤ 1c4 ,E exp(t2X2) ≤ exp(t2c24).164. There exists c5 > 0 such that the moment generation function E exp(X2/t2) is finiteat t = c5.The sub-Gaussian random variables form a Banach space [25] with the norm‖z‖ψ2 = supp≥1p1/2 (E|z|p)1/p .This norm is called sub-Gaussian norm or ψ2-norm.Another class of random variables considered here is sub-exponential random variables.For example, if X is a sub-Gaussian random variable, X2 is not necessarily sub-Gaussianbut it is sub-exponential. X is said to be sub-exponential if there exist constants c1 > 0and c2 > 0 such that for every t > 0,P(|X| ≥ t) ≤ c1 exp(−c2t).The following statements are equivalent [149, p. 29]1. X is a sub-exponential random variable.2. There exists c3 > 0 such that for all p ≥ 1, (E|X|p)1/p ≤ c3p.3. There exists c4 > 0 such that for every 0 ≤ t ≤ 1c4 ,E exp(t|X|) ≤ exp(tc4).4. There exists c5 > 0 such that the moment generation function E exp(|X|/t) is finiteat t = c5.The space of all sub-exponential random variables is a Banach space equipped with norm[149, p. 33]‖X‖ψ1 = supp≥1p−1 (E|X|p)1/p .Similar to the sub-Gaussian norm, ‖ · ‖ψ1 is called sub-exponential norm or ψ1-norm.Note that sub-Gaussian and sub-exponential random variables are well-concentrated.We exploit this property in the following sections.1.3.2 Concentration of the sum of independent random variablesRecall the Lindeberg-Levy Central Limit Theorem (CLT): For centered i.i.d. random vari-ables X1, X2, ...,XN with finite variance σ2, the following sum converges in distribution:X1 +X2 + ...+XN√Nσ→ N (0, 1).17The CLT does not specify how quickly the convergence occurs; the theorem provides theasymptotic distribution only. This section reviews some of the literature on the non-asymptotic convergence rate of a sum of independent mean zero random variables.The following theorem establishes the non-asymptotic convergence rate for a sum ofrandom variables to a standard Gaussian random variable.Theorem 1.3.1 (Berry-Esseen theorem, see, e.g., [138]). Let Xi, i = 1, 2, ..., N, be inde-pendent random variables such that EXi = 0, EX2i = σ2i > 0 and E|Xi|3 ≤ ρi < ∞. Then,consider the distribution function ofX1 +X2 + ...+XN√σ21 + σ22 + ...+ σ2Ndenoted by ΦN and the distribution function of the standard Gaussian random variabledenoted by Ψ. Then, there exists a constant c0 such thatsupt∈R|ΦN (t)−Ψ(t)| ≤ c0∑Ni=1 ρi(∑Ni=1 σ2i)3/2 .Note that the standard Gaussian distribution is well-concentrated around the origin.Indeed, the Mills inequality (see, e.g., [72, 116, 159]) for standard Gaussian random vari-ables states that for all t > 0,P(|X| ≥ t) ≤ 2t√2πe−t22 .Note that the bound on the probability is decaying fast as t grows. For example, if t = 2,then the bound is 0.05399, and if t = 3, the bound is 0.00295. Therefore, the CLTimplies that under broad conditions, the limit of the X1+X2+...+XN√Nσis well-concentrated.The concentration can be quantified non-asymptotically using the following theorems forsub-Gaussian and sub-exponential random variables, respectively.Theorem 1.3.2 (Hoeffding inequality, see, e.g., [149, p. 27]). Let X1, X2,..., XN beindependent centered sub-Gaussian random variables, and let K = maxi ‖Xi‖ψ2 . Then forevery a = (a1, a2, ..., aN ) and every t ≥ 0, we haveP(∣∣∣∣∣N∑i=1aiXi∣∣∣∣∣ ≥ t)≤ 2 exp(− ct2K2‖a‖2),where c > 0 is an absolute constant. In particular, when ai = 1/N for all i = 1, 2, ..., N ,thenP(∣∣∣∣∣ 1NN∑i=1Xi∣∣∣∣∣ ≥ t)≤ 2 exp(− ct2K2N),18Theorem 1.3.3 (Bernstein’s inequality, see, e.g., [149, p. 33]). Let X1, X2,..., XN beindependent centered sub-exponential random variables such that max ‖Xi‖ψ1 ≤ K. Fixa = (a1, a2, ..., aN ). Then, there exists an absolute constant c > 0 such that for everyt > 0,P(∣∣∣∣∣N∑i=1aiXi∣∣∣∣∣ > t)≤ 2 exp(−cmin{t2K2‖a‖2 ,tK‖a‖∞}).In particular, when ai = 1/N for all i = 1, 2, .., N ,P(∣∣∣∣∣ 1NN∑i=1Xi∣∣∣∣∣ > t)≤ 2 exp(−cmin{t2K2,tK}N).Note that if t is fixed in theorems above and a = 1/N , then both probabilistic boundsdecay exponentially fast when N increases. Therefore, the average of sub-Gaussian randomvariables and the average of sub-exponential random variables is well-concentrated.1.3.3 Random matrices in high dimensionsRandom matrices play a crucial role in frame theory and compressed sensing by providingwell-balanced frames and matrices that satisfy the RIP with overwhelming probability.This section reviews such results. Section 1.3.3 defines isotropic and sub-Gaussian randomvectors. Then, for matrices with independent isotropic sub-Gaussian random columns orrows, Section 1.3.3 provides bounds on singular values. Finally, a broad class of randommatrices satisfies the RIP with overwhelming probability, as shown in Section 1.3.3.Random vectorsThis section provides instrumental definitions for random vectors that are used in thesubsequent sections.A random vector Y ∈ Rn is sub-Gaussian if the one-dimensional marginals 〈Y, z〉 aresub-Gaussian for all z ∈ Rn. The sub-Gaussian norm of such a vector Y is defined as‖Y ‖ψ2 = supz∈Sn−1‖〈Y, z〉‖ψ2 .For any sub-Gaussian vector Y = (y1, y2, ..., yn), ‖Y ‖ψ2 ≥ maxi ‖yi‖ψ2 and ‖YT ‖ψ2 ≤ ‖Y ‖ψ2for any T ⊆ {1, . . . , n}. Also, note that if Y has i.i.d. sub-Gaussian entries with sub-Gaussian norm K, then ‖Y ‖ψ2 ≤ CK, where C is an absolute constant independent of Kand n [149, p. 51].A vector Y is isotropic, if for every y ∈ Rn, E〈Y, y〉2 = ‖y‖2. Equivalently, Y is isotropicif and only if EY Y T = In. Therefore, if Y has i.i.d. entries with mean zero and varianceone, then Y is isotropic. A standard Gaussian random vector and a random vector whoseentries are i.i.d. ±1 Bernoulli are examples of such vectors. Note that there are many19vectors whose entries are dependent but are isotropic. For example, a random row of n×nFourier matrix is isotropic, and so is a vector in Rn which is drawn uniformly from thesphere of radius√n centered at the origin. Note that if Y is isotropic, then E‖Y ‖2 = n.Singular valuesIn this section, we consider bounds on singular values of overdetermined random matriceswhose rows or columns are independent. Note that for every matrix A, non-zero singularvalues of A identical to non-zero singular values of AT . Therefore, the theorems belowbound the singular values for both overdetermined and underdetermined matrices.Theorem 1.3.4 (see, e.g., [149, p. 91]). Let A be an m × n matrix, m ≥ n, whose rowsaTi are independent centered isotropic sub-Gaussian random vectors. Let K = max ‖ai‖ψ2 .Then, for any t > 0, with probability at least 1− 2 exp(−t2),√m− CK2(√n+ t) ≤ σmin(A) ≤ σmax(A) ≤√m+ CK2(√n+ t)where σmin and σmax are the smallest and the largest singular values of A, respectively.Theorem 1.3.5 (see, e.g., [148, p. 245]). Let A be an m×n matrix, m > n, whose columnsai are independent centered isotropic sub-Gaussian random vectors such that ‖ai‖2 = Nalmost surely. Let K = max ‖ai‖ψ2 . Then, for any t > 0, with probability at least 1 −2 exp(−t2),√m−CK(√n+ t) ≤ σmin(A) ≤ σmax(A) ≤√m+ CK(√n+ t)where σmin and σmax are the smallest and the largest singular values of A, respectively, andCK is a constant that depends on K only.If m≫ n and A ∈ Rm×n is as in Theorem 1.3.4 or 1.3.5, then all singular values of Aare approximately√m, and, therefore, A is an approximate isometry with overwhelmingprobability. Also note that if E ∈ Rm×n, m≫ n, is a matrix whose rows are independentisotropic mean zero random vectors, then E is a frame with frame bounds of order m withoverwhelming probability. Therefore, random matrices can be used to generate frames inhigh dimensions, and in the following section, we discuss that a broad class of randommatrices satisfies the RIP.Random matrices and the RIPSection 1.2.1 investigates matrices that satisfy the RIP and proves their efficiency forcompressed sensing problems. However, verifying that a matrix satisfies the RIP is anNP-hard problem [9]. In this section, we consider the RIP for random matrices.Suppose that A ∈ Rm×N is a random matrix. If A has independent isotropic meanzero sub-Gaussian random rows and m ≫ s, then, Theorem 1.3.4 implies that any givenm×s submatrix of A is an approximate isometry with overwhelming probability. The RIPrequires every m× s submatrix of A to be an approximate isometry.20Theorem 1.3.6 ([148, p. 254-255]). Suppose that A ∈ Rm×N satisfies one of two models:• Rows of A are independent isotropic mean zero sub-Gaussian random vectors whosesub-Gaussian norm does not exceed K, or• Columns of A are independent isotropic mean zero sub-Gaussian random vectors suchthat their norms equal one almost sure, and sub-Gaussian norm of columns does notexceed K.Then, there exist constants C and c that depend on K only that satisfy that followingproperty: For every s > 0 and δ ∈ (0, 1) such that m ≥ Cδ−2s log(eN/s), 1√mA satisfiesthe RIP of order s with constant δ with probability at least 1− 2 exp(−cδ2m).1.4 NotationsThis section introduces general notations used throughout the thesis, complementing thebrief table in List of Notations chapter.Throughout the thesis, signals are usually denoted by x with potential usage of super-script, and other vectors are denoted by lowercase letters. The ith entry of x is denotedby xi. Both n and N denote the dimension of the signal. The latter notation is used toemphasize that the ambient dimension of the signal is large. Besides, we often assume thatsignals in RN are sparse. s denotes the sparsity of a signal, and ρs denotes the estimatedsparsity, ρ ≥ 1. ‖ · ‖ refers to the ℓ2-norm, and ‖ · ‖p refers to the ℓp-norm. The innerproduct is denoted by 〈· , ·〉, and the entry-wise product of two vectors, say, x and y, isdenoted by x⊙ y.In this thesis, matrices are denoted by E or A. E is used to emphasize that the matrixis a frame, and A is used for arbitrary matrices and compressed sensing matrices. eTiand aTi denote the ith row of E and A, respectively. eij and aij indicate the entry onthe intersection of the ith row and the jth column of a matrix E and A, respectively. mrefers to the total number of measurements or, in matrix notation, the number of rows.Therefore, E is usually an m×n matrix, m ≥ n. A is either a “wide” m×N matrix wherem ≪ N or an m × n matrix of arbitrary size. σmin(·) and σmax(·) stand for the smallestand largest singular values of its argument, respectively. For simplification of notations, ifapplicable, singular values of E and A may be denoted by σmin and σmax.The thresholding operator Ht keeps the largest (in magnitude) t entries of its inputand sets other to zero. Therefore, for every x, Htx is always t-sparse. If T ⊂ [N ], then HTrepresents the operator that keeps entries whose index is in T and zeroes other entries. Forevery T , HTx is #T -sparse. As a complementary notation, xT ∈ Rt is the restriction of xto T when T ⊂ [N ] with #T = t. For a matrix A ∈ Rm×N , AT denotes them×t submatrixof A obtained by the restriction of A to the columns indexed by T where #T = t.Throughout, C, c1, c2, ... indicate constants. k stands for the iterator number, and K isreserved to be a bound on sub-Gaussian norm of a signal.211.5 Organization of this thesisIn this thesis, we consider the following aspects of the reconstruction of signals from theirlinear measurements. First, we focus on the algorithms for sparse recovery that are bothrunning time and working memory-efficient. Specifically, we consider various modificationsof the randomized Kaczmarz (RK) algorithm.In Chapter 2, we refine the bound on the support detection time for the RK algorithmfor an online sampling scenario if the measurement vectors are i.i.d. Bernoulli. Also, weinvestigate the sparse randomized Kaczmarz (SRK) algorithm for both over-determinedand under-determined consistent linear systems.Chapter 3 starts by considering the memory-efficient implementation of the IRLS al-gorithm. In this case, at every iteration, the IRLS minimization problem is solved usingthe RK algorithm. Then, we propose a family of novel algorithms for compressed sensing,called the iteratively reweighted Kaczmarz (IRWK) algorithm. The algorithm is efficientin both running time and working memory and outperforms many other Kaczmarz-basedalgorithms for compressed sensing in the CPU time.In Chapter 4, we bound the reconstruction error both in the frame and compressedsensing settings if the measurements are quantized according to the memoryless scalarquantization (MSQ).22Chapter 2Kaczmarz algorithm for sparserecoveryIn recent years, the amount of acquired data became unprecedentedly large. Many compu-tational tasks require developing special tools for analyzing and processing large datasets;new algorithms must be computationally tractable in both time and space. In this section,we consider the system of linear equations Ax = b where A ∈ Rm×n is a very large matrixin both m and n. We want to recover x from b with the additional constraint: the recon-struction process should be memory-efficient and time-efficient. In this chapter, we assumethat working memory is only O(n) and consider computationally feasible algorithms forsparse recovery from over- and under-constraint linear measurements. Specifically, we focuson the Kaczmarz-like iterative methods for sparse recovery.The rest of the chapter is organized as follows. Section 2.1 discusses the deterministicand randomized Kaczmarz (RK) algorithms (Sections 2.1.1 and 2.1.2, respectively). Themodifications of the RK algorithm for sparse recovery are reviewed in Section 2.2. Section2.3 focuses on how quickly the RK algorithm recovers the support of the sparse signal if theunderlying solution of the overdetermined linear system is sparse. Section 2.4 investigatesthe sparse randomized Kaczmarz algorithm. We end this chapter with numerical analysisfor the Kaczmarz-based algorithms (Section 2.5).2.1 Overview of the Kaczmarz algorithmsNext, we provide a brief overview of the (deterministic) Kaczmarz and randomized Kacz-marz algorithms in Sections 2.1.1 and 2.1.2, respectively. Note that in this section, wefocus only on the above-mentioned algorithms and skip their extensions and modificationssuch as the extended randomized Kaczmarz algorithm (see, e.g., [110, 167]), the acceler-ated randomized Kaczmarz algorithm (see, e.g., [57, 94, 125]), the randomized Kaczmarzalgorithm for phase retrieval (see, e.g., [86, 144, 155]), the block randomized Kaczmarzalgorithm (see, e.g., [120, 122, 124]) and many others. Modifications of the Kaczmarzalgorithm for sparse recovery problems are reviewed in Section The (deterministic) Kaczmarz algorithmsOne of the popular memory-efficient solvers of linear systems is the iterative solver namedthe Kaczmarz algorithm stated below. At each iteration, this algorithm needs to accessonly one row of the measurement matrix A and the corresponding entry of b = Ax∗. Therequired working memory for the Kaczmarz algorithm is O(n), where n is the ambientdimension of the signal. Specifically, at each iteration, the algorithm projects the recon-structed signal onto the hyperplane 〈a , x〉 = b, where aT denotes the row of matrix Apicked for the current iteration. In the original Kaczmarz algorithm, rows are selected ina cycle manner.Algorithm 2.1 The Kaczmarz algorithm1: Inputs: A = [(a1)T , (a2)T , ..., (am)T ]T ∈ Rm×n, b. For consistency, let a0 = am.2: Initialize: x0 ∈ Rn.3: For k = 0, 1, 2, ...: let ik = k mod m. We select the ikth row of matrix A and computexk+1:xk+1 = xk +bik − 〈aik , xk〉‖aik‖2 aik .4: Outputs: Solution of Ax = b if it exists.The Kaczmarz algorithm (Algorithm 2.1) introduced in [88] is an alternating projectionmethod. This algorithm is also known under the name Algebraic Reconstruction Technique(ART) in computer tomography [80, 119].Denote the row selected for the kth iteration by aik . Then, the update rule becomesxk+1 = argmin{x: 〈aik , x〉=bik}‖xk − x‖ = proj{x: 〈aik , x〉=bik}xk.Note that if we assume that the solution of Ax = b exists and denote it by x∗, then thereconstruction error ‖xk − x∗‖ is non-increasing even if there are multiple solutions toAx = b. Indeed, note thatxk+1 − x∗ = proj{x: 〈aik , x〉=0}(xk − x∗).Since the projection onto a hyperplane that contains the origin does not increase the norm,i.e., ‖xk+1 − x∗‖ ≤ ‖xk − x∗‖.The Kaczmarz algorithm (Algorithm 2.1) converges to a solution of Ax = b if the linearsystem is consistent [88]. The Kaczmarz algorithm converges linearly in terms of cycles(see, e.g., [49, 69] for analysis in the frame setting), and, in general, the convergence rate ishard to estimate (see, e.g., [51]). Note that the convergence rate dramatically depends onthe order of rows in the matrix A. If, for example, A has two identical consecutive rows,say, the ith and (i+ 1)st rows, then, for all t ∈ N, xi+1+tm = xi+tm. To avoid such issues,at each iteration, we can pick a random row of A, which is the focus of the next section.242.1.2 The randomized Kaczmarz (RK) algorithmThis section reviews the so-called randomized Kaczmarz algorithm (Algorithm 2.2) intro-duced by Strohmer and Vershynin [143]. As mentioned before, selecting a row for theKaczmarz iteration at random improves the convergence speed of the algorithm, bothnumerically and theoretically.Algorithm 2.2 The randomized Kaczmarz algorithm1: Inputs: A = [(a1)T , (a2)T , ..., (am)T ]T ∈ Rm×n, b ∈ Rm.2: Initialize: x0 ∈ Rn.3: For k = 0, 1, 2, ...: Pick index ik ∈ [m] such that P(ik = j) = ‖aj‖2/‖A‖2F . Then,xk+1 = xk +bik − 〈aik , xk〉‖aik‖2 aik .4: Outputs: A solution of Ax = b if it exists.Note that Algorithm 2.2 is also an alternative projection method, and therefore, thereconstruction error is non-increasing. Let us further investigate the properties of the RKalgorithm updates as a projection.1. xk −x0 belongs to R(AT ) for all k. Denote the projection onto the row-span of A byP. Then, xk−x0 = P(xk−x0), which, in turn, implies that xk−P(xk) = x0−P(x0).2. If there exists a solution of Ax = b, then A†b is a solution of Ax = b. In Section 1.1.2,we argued that A establishes a bijection between R(AT ) and R(A), and, therefore,A†b ∈ R(A) is the only solution of Ax = b that belongs to the row-span of AT .Combining these two observations, we conclude that if Algorithm 2.2 converges, then itconverges tox˜ = (x˜− P(x˜)) + P(x˜) = (x0 − P(x0)) +A†b.Note that the convergence point remains the same regardless of the row selection strategy.The following theorem specifies the convergence rate of the RK algorithm.Theorem 2.1.1 (see, e.g., [110, 143]). Let b ∈ Rn and let A ∈ Rm×n be a non-zeromatrix such that the linear system Ax = b has a (possibly non-unique) solution. Denotethe projection operator onto the row-span of A by P and let σmin > 0 be such that for allz ∈ R(AT ), ‖Az‖ ≥ σmin‖z‖. Suppose that we run the RK algorithm (Algorithm 2.2) forAx = b with initialization x0, and denote the obtained reconstruction sequence by {xk}.Then,E‖P(xk+1)−A†b‖2 ≤(1− σ2min‖A‖2F)‖P(xk)−A†b‖225andxk+1 −P(xk+1) = xk − P(xk) = x0 − P(x0).Therefore, xk converges to x0 − P(x0) +A†b, andE‖xk+1 −(x0 − P(x0) +A†b)‖2 ≤(1− σ2min‖A‖2F)‖xk −(x0 − P(x0) +A†b)‖2.Remark 2.1.2. The original proof is due to Strohmer and Vershynin [143] and considersthe case of overdetermined full rank matrices A such that Ax = b is consistent. Under thisassumption, A is a frame, and σmin represents the smallest singular value of A. Then, The-orem 2.1.1 implies that the reconstruction sequence {xk}, generated by the RK algorithm,converges to the solution of Ax = b at a rate 1− σ2min‖A‖2F in expectation.The generalization of the proof in [143] to all consistent linear systems with underde-termined full-rank matrices when the RK algorithm is initialized with ~0 can be found, e.g.,in [110]. In this case, σmin is the smallest singular value of AT . Under these assumptions,Theorem 2.1.1 implies that the reconstruction sequence {xk} converges to the min 2-normsolution of the linear system at a rate 1− σ2min‖A‖2F in expectation.Note that Theorem 2.1.1 considers a broader case than both settings above: we considerall non-zero, not necessary full-rank matrices A and all potential initializations of the RKalgorithm. The proof of this statement is a straight-forward modification of the proofprovided in [110] using two observations above, and we skip it here.Remark 2.1.3. Note that Theorem 2.1.1 does not assume that A is a full rank matrix,and, therefore, even if A has repetitive rows or columns, the RK algorithm still convergeswith the rate stated above. Moreover, since A : R(AT ) 7→ R(A) is a bijection (see Section1.1.2 for the proof), σmin in the statement of the theorem can be chosen to be the smallestpositive singular value of A.Another approach to convergence analysis considers projection P˜ onto the space ofsolutions of Ax = b (assuming that it is non-empty). Then, the reconstruction sequence{xk} generated by the RK algorithm for Ax = b satisfies [96, 156]E‖xk+1 − P˜(xk+1)‖2 ≤(1− σ2min(AT )‖A‖2F)‖xk − x∗‖2,where σmin is the smallest positive singular value of A.Note that for inconsistent linear systems, the RK algorithm does not converge. For anoverdetermined full rank matrix A, if b = Ax∗ + ǫ, then [121, 167]E‖xk − x∗‖ ≤(1− σ2min‖A‖2F)k/2‖x0 − x∗‖+ ‖A‖Fσminmaxi∈[m]|ǫi|‖ai‖ .26Therefore, the algorithm is robust. [121] also showed that the equality above can beattained. Note that for the inconsistent overdetermined linear system, introducing therelaxation parameter ηk > 0 in the update rule, i.e.,xk+1 = xk + ηkbik − 〈aik , xk〉‖aik‖2 aikmakes the reconstruction sequence {xk} convergent to the least square solution of Ax = bwhen ηk → 0 (see [167] and references therein). Such algorithm is called the Relaxedrandomized Kaczmarz algorithm.Chen and Powell established that the RK algorithm converges to the sparse solutiona.s. if the measurement vector is drawn i.i.d. from a certain sub-Gaussian distribution ateach iteration [38]. The setting when a new measurement vector is used for each iterationand is not re-used is called online sampling. In this setting, it is common to assume thatonce iteration is completed, the measurement vector is not stored, and, therefore, onlyO(n) memory is required overall. It is usually assumed that the measurement vectorsare drawn i.i.d. from a certain distribution. Note that the Relaxed randomized Kaczmarzalgorithm performs well in the online sampling scenario: If measurement vectors are drawni.i.d. uniformly from a sphere, then the Relaxed randomized Kaczmarz converges to thesparse solution if and only if limk→∞ ηk = 0 and∑ηk =∞ [105].On the row selection strategyAlgorithm 2.2 suggests a specific weight choice depending on the norm of individual rows.On the other hand, the algorithm is row-agnostic is the following sense: If we multiply therow and corresponding measurement by a non-zero number, the update will stay the sameas before, but probability of picking each row changes. Therefore, the norm of individualrows does not modify the update but changes the probability of each row being selected.This observation leads to the question: what scaling of rows of A is optimal. And whatscaling of rows of A maximizesσ2min‖A‖2F, so the established convergence rate is maximized.The proof of Theorem 2.1.1 [143] can be extended to any predetermined weight choice.Denote the probability that the row i is selected by µi;∑µi = 1. Then,E‖xk+1 − x∗‖2 ≤1− σ2min(AT )maxi{‖ai‖2µi} ‖xk − x∗‖2. (2.1)In particular, if rows of A are drawn uniformly and independently, then the convergencerate is bounded above by 1− σ2minm‖A‖2∞,2. Here ‖A‖2∞,2 denotes the maximum norm of rows ofA. Therefore, if the row-norms of A are of the same order, the bound on the convergencerate for the uniform row selecting scheme is close to the bound in Theorem optimal choice of probabilities {µi} in (2.1) matches the selection scheme intro-duced in Algorithm 2.2. Note that optimizing the bound does not mean optimizing theconvergence rate of the algorithm. Note that the scaling of rows of A changes only theprobability of picking the specific row [36]. At the same time, the optimal convergenceproperties must be defined by geometric properties (specifically, the pairwise angles) ofthe hyperplanes 〈ai , x〉 = bi which are scale-invariant. It implies that the row selectionstrategy in Algorithm 2.2 is not optimal in general.Probabilistic bound on the reconstruction error of the RK algorithmTheorem 2.1.1 provides convergence in expectation, but the probabilistic statement maybe more applicable in some circumstances, and we rely on this bound, in particular, inSection 2.3. Using the Markov inequality and the fact that the reconstruction error normis not decreasing, we conclude that for every t > 0,P(‖P(xk)−A†b‖2 ≥ t)≤ t−1E(‖P(xk)−A†b‖2)≤ t−1(1− σ2min‖A‖2F)k‖P(x0)−A†b‖2t.Recall that x0 −P(x0) = xk −P(xk) ∈ (R(AT ))⊥ is orthogonal to P(xk)−A†b ∈ R(AT ).Therefore,P(‖xk − (x0 − P(x0))−A†b‖2 ≥ t)≤(1− σ2min‖A‖2F)k‖P(x0)−A†b‖2t. (2.2)Therefore, for every consistent linear system Ax = b and every ǫ > 0, there is k0 thatdepends on x0, A, and b such that for every k ≥ k0,P(‖xk − (x0 − P(x0))−A†b‖2 ≥ t)≤ ǫ.2.2 Kaczmarz-based algorithms for sparse recoveryAs mentioned before, the recent increase in the size of acquired data created a demandfor algorithms that are efficient both in running time and in working memory. Note thatthe RK algorithm (Algorithm 2.2) fits these requirements because it converges linearlyin expectation and requires O(n) working memory, where n is the ambient dimension ofthe signal. In this section, we consider iterative sparse recovery algorithms that haveKaczmarz-like update rules, and, therefore, require O(n) working memory. Recall that theRK algorithm (Algorithm 2.2) does not take advantage of prior knowledge of the sparsity28of the underlying signal. Thus, one may consider the algorithms that utilize the updatesof the Kaczmarz algorithms and promote sparsity when the signal is known to be sparse.Broadly speaking, there are two approaches to apply Kaczmarz-like methods for sparserecovery problems: One can modify the updates of the RK algorithm in a way that promotessparsity; such approaches are the main focus of this section. Alternatively, one may considersparse recovery algorithms that require to find the minimum 2-norm solution of a linearsystem at each iteration and use the RK algorithm as a solver. For example, the BPDNalgorithm (Algorithm 1.4) may be reformulated as a quadratic program that may be solvedusing the RK algorithm (see [141] for details). Similarly, one may use the RK algorithmfor the IRLS updates, as we discuss in Section 3.1.In this section, we review the following Kaczmarz-based algorithms for sparse recovery:the sparse randomized Kaczmarz (SRK) algorithm (Algorithm 2.3 [113]) in Section 2.2.1,the randomized sparse Kaczmarz (RASK) algorithm (Algorithm 2.4 [106, 107, 135]) inSection 2.2.2, (3) the KZIMT algorithm (Algorithm 2.5 [163]) that may be viewed as astochastic version of the IHT (Algorithm 1.3) in Section 2.2.3. Finally, the SRK-IHTalgorithm (Algorithm 2.6 [154]) combines the ideas from the SRK and the IHT literatureinto an algorithm that further improves the convergence.2.2.1 The sparse randomized Kaczmarz (SRK) algorithmRecall that we want to run Kaczmarz-like iterations but in a way that promotes sparsityof the reconstructed signal. The SRK algorithm does so by scaling of the measurementvectors, as described in Algorithm 2.3.Algorithm 2.3 Sparse randomized Kaczmarz algorithm (SRK)1: Inputs: A = [(a1)T |(a2)T |...|(am)T ]T ∈ Rm×n, b, ρs (estimated sparsity).2: Initialize: x0 ∈ Rn. Usually, x0 = ~0.3: For k = 0, 1, 2, ...: Draw a row of A, say, ik, with probability of proportional to normof a row. Denote the index set of top ρs entries of xk by Sk. Let wk ∈ Rn be such thatwki ={1, if i ∈ Sk1√k, otherwise.Compute xk as follows:xk+1 = xk +bik − 〈aik , wk ⊙ xk〉‖wk ⊙ aik‖2 wk ⊙ aik , (2.3)where ⊙ stands for the element-wise product.Algorithm 2.3 was proposed by Mansour and Yı´lmaz in [113] and was extended for29low-rank matrix recovery [112].The idea of scaling the measurement vectors can be heuristically justified as follows.Note that when the number of iterations becomes large and the support of the underlyingsparse signal is acquired correctly, i.e., when 1/√k ≈ 0, the measurement vector is con-centrated only on a small subset of entries, say, S. Therefore, the SRK algorithm solvesASxS ≈ b. Note that AS ∈ Rm×s, and, therefore, one expects a convergence rate to befaster than for the RK algorithm in the local regime, see Experiment 2 (Section 2.5.2) fornumerical performance.Assigning small but non-zero weights for the entries outside of the active set is importantif we don’t acquire the support of the underlying sparse signal correctly. Small weightsprovide an opportunity for entries outside of the current active set Sk to enter the activeset in subsequent iterations.Note that each iteration of the SRK algorithm requires O(n) working memory andO(n log(ρs)) time. The memory requirement is the same as for the RK algorithm, but thetime requirement is only slightly worse: for any sparse signal, ρs≪ n and ρs < m.Numerically, this algorithm converges linearly [113]. The properties of the SRK algo-rithm is further investigated in Sections 2.4 and The randomized sparse Kaczmarz (RASK) algorithmRecall that the RK algorithm does not promote sparsity directly, and, therefore, one may in-corporate additional step to make the reconstructed signal xk “sparser”. In the following al-gorithm, we apply the soft thresholding operator Sλ, where Sλ(x) := max{|x|−λ, 0}sign(x).This way, all small entries of xk are set to zero, and the relative order of other entries isnot changed.Algorithm 2.4 The randomized sparse Kaczmarz (RASK) algorithm1: Inputs: A = [(a1)T |(a2)T |...|(am)T ]T ∈ Rm×n, b, ρs (estimated sparsity), λ > 0.2: Initialize: x0 = ~0 and x−1/2 = ~0.3: Outputs: The (approximate) solution of minx λ‖x‖1 + 12‖x‖2 subject to Ax = b.4: For k = 0, 1, 2, ...: Draw a row of A, say, ik, with probability proportional to norm ofa row.xk+1/2 = xk−1/2 +bik + 〈aik , x∗ − xk〉‖aik‖2 aikandxk+1 = Sλ(xk+1/2)Remark 2.2.1. There are two distinct algorithms whose names are similar: the sparse ran-domized Kaczmarz algorithm and the randomized sparse Kaczmarz, which are introduced30in Sections 2.2.1 and 2.2.2, respectively. To avoid confusion, Algorithm 2.3 is called theSRK algorithm, and Algorithm 2.4 is called the RASK algorithm.The RASK algorithm is a randomized version of the sparse Kaczmarz algorithm [106,107] and can be extended to online sampling [95].The total required memory is O(n), and the running time of a single iteration is O(n),which is identical to the RK algorithm. Note that the RASK algorithm considers twovariables: xk and xk+1/2, and therefore, needs twice as much memory as other Kaczmarz-based algorithms.Theorem 2.2.2 ([135]). Let A ∈ Rm×n and b 6= ~0 belongs to the range of A. Denote thesolution of the regularized basis pursuitx˜ = argminxλ‖x‖1 + 12‖x‖2 subject to Ax = b.Denote the smallest (in magnitude) non-zero entry of x˜ by x˜min. Then, for all k ≥ 0, thereconstructed signal xk produced by the kth iteration of the RASK algorithm with initial-ization ~0 satisfiesE‖xk − x˜‖ ≤ qk/2√2λ‖x˜‖1 + ‖x˜‖2,where q = 1− σ2min(A)‖A‖2F12|x˜min||x˜min|+2λ andσmin = min{σmin(AS) > 0 |S ⊆ [N ]}.Authors also establish the convergence rate for the noisy setting. Note that, whenλ ≥ 10‖x∗‖∞ and the matrix A satisfies the RIP of order 2s with constant δ2s ≤ 0.4404,then the RASK algorithm recovers the sparse solution exactly.2.2.3 The Kaczmarz algorithms meet the IHT algorithmNote that the RK algorithm may be considered as stochastic gradient descent (see, e.g.,[123] for further details). The IHT update is strongly related to gradient descent. Combin-ing these two observations, Zhang et al. proposed the following algorithm (Algorithm 2.5).Note that the computational complexity of this algorithm is similar to the IHT, but thisalgorithm is memory-efficient because it requires only O(n) working memory. The authorsof the algorithms show that numerically the KZIMT algorithm outperforms the IHT in theCPU time and leads the better reconstruction accuracy for noisy setting.The idea of using the Kaczmarz-like update for the IHT is further developed in [154].Instead of just taking the RK updates, we can promote sparsity by running the SRKupdates. This idea results in Algorithm 2.6. Authors [154] compare the algorithm with theIHT and the KZIMT and observe that in certain settings, the SRK-IHT outperforms them.To our best knowledge, there is no established convergence guarantees for the KZIMT andthe SRK-IHT algorithms.31Algorithm 2.5 The KZIMT algorithm1: Inputs: A = [(a1)T |(a2)T |...|(am)T ]T ∈ Rm×n, b, ρs (estimated sparsity).2: Initialize: x0 ∈ Rn. Usually, x0 = ~0.3: For k = 0, 1, 2, ...: Let xk+10 = xk. For t = 1, 2, ...,m,xk+1t = xk+1t−1 +bimk+t − 〈aimk+t , xk+1t−1 〉‖aimk+t‖2 aimk+t ,where imk+t is drawn from [m] with probability proportionate to the norm of thecorresponding row of A. Let xk+1 = Hρs(xk+1m ).Algorithm 2.6 The SRK-IHT algorithm1: Inputs: A = [(a1)T |(a2)T |...|(am)T ]T ∈ Rm×n, b, ρs (estimated sparsity).2: Initialize: x0 ∈ Rn. Usually, x0 = ~0.3: For k = 0, 1, 2, ...: Let xk+10 = xk. For t = 1, 2, ...,m, find the index set of the largestmax{ρs, n−mk − t+ 1} entries of xk+1t−1 and denote it by Smk+t. Letwmk+ti ={1, if i ∈ Smk+t1√mk+t, otherwise.Letxk+1t = xk+1t−1 +bimk+t − 〈wmt+t ⊙ aimk+t , xk+1t−1 〉‖wmk+t ⊙ aimk+t‖2 wmk+t ⊙ aimk+t ,where imk+t is drawn from [m] with probability proportionate to the norm of thecorresponding row of A. Let xk+1 = Hρs(xk+1m ).322.3 Main results. The RK algorithm for support detectionConsider the following scenario: Suppose that a consistent (full rank) overdetermined linearsystem has a sparse solution, and we run the RK algorithm to reconstruct it. Theorem2.1.1 guarantees that the sparse signal is successfully reconstructed and that a convergencerate is linear. In this section, we investigate the support detection of the RK algorithm insuch a setting. Specifically, we focus on the number of iterations that guarantee supportdetection with an arbitrarily large probability.First, we bound the number of iterations such that the reconstruction error is of givenaccuracy with an arbitrarily large probability.Theorem 2.3.1. Let A ∈ Rm×n, m > n, be a full column rank matrix. Let x∗ ∈ Rn andlet b = Ax∗. Then, for every ν > 0 and ǫ ∈ (0, 1), the output of the randomized Kaczmarzalgorithm for Ax = b afterk ≥ −1log(1− σ2min‖A‖2F) log(‖x0 − x∗‖2ǫν2)iterations satisfies ‖xk − x∗‖ ≤ ν with probability at least 1− ǫ.The proof of the theorem follows from straightforward algebraic manipulations of (2.2).Corollary 2.3.2 (Support detection). In addition to the conditions of Theorem 2.3.1,assume that x∗ is s-sparse and denote its smallest (in magnitude) non-zero entry by x∗min.Then, for every ǫ > 0, it suffices to runk >−1log(1− σ2min‖A‖2F) log(2‖x0 − x∗‖2ǫ|x∗min|2)iterations such that the index set of the top s entries of xk contain the top entries of x∗with probability at least 1− ǫ.Proof. Successful support detection means that for every index i1 is in the support indexset and i2 outside of the support set, |xki1 | > |xki2 |. Note that using the inverse triangleinequality and the Cauchy-Schwartz inequality, we get|xki1 | − |xki2 | ≥ |x∗i1 | − |xki1 − x∗i1 | − |xki2 − x∗i2 |≥ |x∗min| −√2√(xki1 − x∗i1)2 + (xki2 − x∗i2)2≥ |x∗min| −√2‖xk − x∗‖.Therefore, the index set of the top s entries of the reconstructed signal equals S∗ if ‖xk −x∗‖ > 1/√2|x∗min|. Applying Theorem 2.3.1 finishes the proof.33Before proceeding, let us address the typical number of iterations. For simplicity,assume that A ∈ Rm×n, m≫ n, has i.i.d. sub-Gaussian entries. Then, with overwhelmingprobability, there is c > 0 such thatσ2min‖A‖2F≤ cmnm= c1n.Note that this bound does not depend on the number of measurements m. Using this ap-proximation, we conclude that there is c > 0 such that k = c(n log(2‖x0−x∗‖2ǫ|x∗min|2))iterationssuffice to guarantee that the support is detected with the probability 1 − ǫ. If non-zeroentries of x∗ are roughly of the same order and x0 = 0, i.e., |x∗min| ≈ 1√s‖x∗‖, the boundreduces tok ≥ cn log(2sǫ)), (2.4)where s is the sparsity of the signal.In the support detection literature, one usually considers overestimating the true spar-sity. In this case, the index set of the ρs entries is considered successful support detectionif it contains the support index set of the s-sparse signal x∗. Here ρ ≥ 1. Once the sup-port is detected, the full reconstruction of x∗ happens on the second stage which solvesASz = b, where AS ∈ Rm×ρs keeps only those columns of A whose indices are in the activeset determined in the previous stage.The methodology above does not allow to take benefit of considering more entries inthe active set than the size of the support set, and a new bound is provided for this case.Theorem 2.3.3. Consider the online sampling scenario: for each iteration k, we draw arandom vector ak with i.i.d. ±1 Bernoulli entries and run the RK algorithm (Algorithm2.2) using this row. It is equivalent of having full 2n×n Bernoulli matrix and drawing rowsin the same way as in the randomized Kaczmarz algorithm. Fix ρ ≥ 1 such that ρs ≤ n,and denote the reconstructed signal for the RK algorithm by xk. Denote the index set ofthe top ρs entries of xk by Sk, and the true support of x∗ by S∗. Then, the probability thatxk recovers the support of x∗, i.e., the probability that S∗ ⊂ S satisfiesP(S∗ ⊂ S)≥ 1− 4‖x∗ − x0‖2|x∗min|2((n− 2n)k+1n(s+n− sρs− s+ 1)((n− 1n)k−(n− 2n)k)).If, in addition, ρs > s, thenP(S∗ ⊂ S) ≥ 1− 4‖x∗ − x0‖2|x∗min|2(e−2kn +(sn+n− sn1ρs− s+ 1)(e−kn − e− 2kn)). (2.5)Remark 2.3.4. Fix ǫ > 0. Increasing the value of ρ decreases the number of iterations ksuch that the bound (2.5) guarantees that the index set of top ρs entries contains the true34support. However, a too large value of ρ does not serve support detection well: we spentcomputational and time resources to reduce inverting m × n matrix to inverting m × ρsmatrix. The optimal choice of ρ depends on the specific application.Remark 2.3.5. In this remark, we want to derive a sufficient number of iterations such thatthe support is detected with probability at least 1 − ǫ. Note that one may solve (2.5) asa quadratic inequality with respect to e−k/n, and get the most accurate bound on k basedon this approach; we provide the upper bound that is easy to analyze. For simplicity, weassume that s2(ρ− 1) ≤ n. Then,sn+n− sn1ρs− s+ 1 <1s(ρ− 1) +n− sn1ρs− s <2s(ρ− 1) .Therefore, (2.5) implies thatP(S∗ ⊂ Sk) ≥ 1− 4‖x∗ − x0‖2|x∗min|2((1− 2s(ρ− 1))e−2kn +2s(ρ− 1)e− kn)≥ 1− 8‖x∗ − x0‖2|x∗min|2max{(1− 2s(ρ− 1))e−2kn ,2s(ρ− 1)e− kn}≥ 1− 8‖x∗ − x0‖2|x∗min|2max{(1− 2s(ρ− 1))e−2kn ,2s(ρ− 1)e− kn}Then, to get the probability of failure at most ǫ, it suffices to runk = nmax{12log(8‖x∗ − x0‖2ǫ|x∗min|2(1− 2s(ρ− 1))),log(8‖x∗ − x0‖2ǫ|x∗min|22s(ρ− 1))}(2.6)iterations to guarantee that the true support is recovered.Remark 2.3.6. Let us compare (2.6) with the bound in Corollary 2.3.2 (which iscn log(2‖x∗−x0‖2ǫ|x∗min|2)). If the first argument of the maximum in (2.6) dominates, then thesufficient number of iterations is of the same order for methods of estimate. Also, oneexpects c ≈ 1 for large enough n, so the bound provided in this remark is smaller by aconstant factor (at least 2) comparing to Corollary 2.3.2.Suppose that the second argument of the maximum in (2.6) dominates. Then, (2.6)is smaller than the bound in Corollary 2.3.2 by the factor s(ρ − 1) in the logarithm. Weconclude that (2.6) provides a refined bound on the sufficient number of iterations.Suppose that all non-zero entries of x∗ are of the same order, i.e., there is a constantα > 0 independent of s and n such that‖x∗‖2s|x∗min|2≈ α.35Suppose that x0 = ~0. Then, (2.6) is equivalent tok = nmax{12log(8α2sǫ(1− 2s(ρ− 1))), log(8α2ǫ2ρ− 1)}.Note that for a sufficiently small value of ǫ, the sufficient number of iterations does notdepend on the underlying sparsity of the signal s, and, also, a large value of ρ (thatsatisfies s2(ρ−1) ≤ n) decreases the required number of iterations to guarantee the supportdetection. Note that the bound in Corollary 2.3.2 increases as s increases and also doesnot utilize the value of ρ.Remark 2.3.7. When ρ = 1, the event S∗ 6⊂ S is that the support set is not recovered fromthe top s entries of xk. This setting is considered in Corollary 2.3.2 and, in this remark,we compare two bounds. For ρ = 1, Theorem 2.3.3 implies that the probability of notrecovering support is at most4‖x∗‖2|x∗min|2(1− 1n)k.Therefore, ifk ≥ −1log(1− 1n) log( 4‖x∗‖2ǫ|x∗min|2),then the support is recovered with probability at least 1− ǫ. The bound in Corollary 2.3.2uses σ2min/‖A‖2F which for the full ±1 Bernoulli matrix can be bounded as follows:σ2min‖A‖2F≥ cmmn=cn,where c ≈ 1 is a constant independent of n and m and the inequality holds with over-whelming probability. Plugging this bound into Corollary 2.3.2 yields thatk ≥ −1log(1− cn) log( 2‖x∗‖2ǫ|x∗min|2)which is of the same order as the bound provided by Theorem 2.3.3.To prove this theorem, we need the following auxiliary lemma from the probabilitytheory.Lemma 2.3.8. Let A1, A2, ... An be potentially dependent random events. Then, theprobability that at least t (t ∈ [n]) of them holds does not exceedP(A1) + P(A2) + ...+ P(An)t.36Proof. First, assume that the probability space is discrete. Suppose that {ω1, ω2, ..., ωN}are all events. Then, probability of each ωj is added exactly as many times as number ofAj ’s that contain ωi. Therefore,P(A1) + P(A2) + ...+ P(An) =N∑i=1P(ωi)#{j : ωi ∈ Aj}≥ tN∑i=1P(ωi)1{ωk:#{j:ωk∈Aj}≥t}(ωi)= tN∑i=1P(at least t of Ai’s hold).Next, we extend the proof to arbitrary probability spaces. Let Azii equal to Ai if zi = 1 andAci if zi = −1. Consider all events ∩Ni=1Azii where {zi} runs over all potential combinationsof 1 and −1. Using these disjoint events as discrete events in the first part of the proofallows to derive the same bound for arbitrary probability space.Proof of Theorem 2.3.3. Recall the update formulaxk − x∗ = xk−1 − x∗ − 〈aik−1 , xk − x∗〉‖aik−1‖2 aik−1 .Therefore, for each j ∈ [n],(xkj − x∗j )2= (xk−1j − x∗j )2 − 2(xk−1j − x∗j )〈aik−1 , xk − x∗〉‖aik−1‖2 aik−1j +(〈aik−1 , xk − x∗〉‖aik−1‖2)2(aik−1j )2.(2.7)Note that aik−1 is a Bernoulli ±1 vector, so ‖aik‖2 = n and (aik−1j )2 = 1. Taking expectationof both sides, and using that entries of aik−1 are independent, mean zero, and variance one,we conclude thatE(xkj − x∗j )2 = (xk−1j − x∗j )2 − 2(xk−1j − x∗j)2n+‖xk−1 − x∗‖2n2.Using matrix notation, this equality can be rewritten asE(xk − x∗)2 =[(1− 2n)I +1n2J](xk−1 − x∗)2,37where the square is taken entry-wise and J stands for n×n matrix whose all entries equal1. Since each iteration draws the corresponding measurement vector aik−1 independentlyfrom the previous draws,E(xk − x∗)2 =[(1− 2n)I +1n2J]k(x0 − x∗)2. (2.8)We want to raise the matrix to kth power. Note that I · I = I andI · 1nJ =1nJ · I = 1nJ · 1nJ =1nJ.By mathematical induction, we prove that for k ≥ 1,[(1− 2n)I +1n2J]k=(1− 2n)kI +((n− 1n)k−(n− 2n)k) 1nJ.Plugging this result in (2.8), we getE(xk − x∗)2 =(1− 2n)k(x0 − x∗)2 +((n− 1n)k−(n− 2n)k) ‖x0 − x∗‖2n. (2.9)Denote the support index set of x∗ by S∗. The active set S contains S∗ if both thefollowing conditions hold:1. For every i ∈ S∗, (xki − x∗i )2 ≤ (x∗min)24 .2. There are at most ρs− s violations of (xki )2 = (xki − x∗i )2 ≤ (x∗min)24 for i ∈ Sc∗.Therefore, the active set does not contain S∗ if one of the conditions fail. Denoting theevent (xki − x∗i )2 ≤ (x∗min)24 by Ai, we getP(S∗ 6⊂ S) ≤ P (∪i∈S∗Aci ) + P (at least (ρs − s+ 1)-manyAci ’s hold, i 6∈ S∗) .The first probability on the right hand side can be bounded by a union bound, and thesecond probability is bounded using Lemma 2.3.8. This yieldsP(S∗ 6⊂ S) ≤∑i∈S∗P (Aci ) +∑i∈Sc∗ P(Aci )ρs− s+ 1 .Recall that Ai stands for the event (xki − x∗i )2 ≤ (x∗min)24 , and, thus, Aci is the event (xki −x∗i )2 >(x∗min)24 . Using the Markov inequality,P(S∗ 6⊂ S) ≤∑i∈S∗ E(xki − x∗i )2 +∑i∈Sc∗E(xki −x∗i )2ρs−s+1|x∗min|2/4.38Plugging values from (2.9) and using that x0 = 0, we conclude that∑i∈S∗E(xki − x∗i )2 =(n− 2n)k‖x∗‖2 + sn((n− 1n)k−(n− 2n)k)‖x∗‖2and ∑i∈Sc∗E(xki − x∗i )2 =n− sn((n− 1n)k−(n− 2n)k)‖x∗‖2.Combining these two equalities implies thatP(S∗ 6⊂ S) ≤ 4‖x∗‖2|x∗min|2((n− 2n)k+1n(s+n− sρs− s+ 1)((n− 1n)k−(n− 2n)k))Remark 2.3.9. The probability above can be computed explicitly for given values of m, s,ρ, and k.2.4 Main results. The sparse randomized Kaczmarzalgorithm2.4.1 Local linear convergenceIn this section, we investigate the local convergence of the SRK algorithm for both over-and under-determined linear systems. We start from a general statement that summarizesboth cases and then derive the convergence rate for over- and under-determined matrices,respectively. Also, we conclude the convergence rate for a broad class of sub-Gaussianrandom matrices.Lemma 2.4.1. Suppose that x∗ ∈ Σns and denote its smallest (in magnitude) non-zeroentry by x∗min. Let A ∈ Rm×n and b := Ax∗. Fix ρ such that ρs > s. Suppose that we runthe SRK algorithm (Algorithm 2.3) to solve b = Ax with estimated sparsity ρs. If for somek0, ‖xk0 − x∗‖ ≤√ρs−s+1√ρs−s+1+1 |x∗min|, then, for all k ≥ k0,E‖xk+1 − x∗‖2 ≤ ‖xk − x∗‖2 − 1mmaxi ‖wk ⊙ ai‖2 ‖A(wk ⊙ (xk − x∗))‖2. (2.10)This lemma implies that in the local regime, for all matrices A, the reconstruction errordecreases in expectation. We consider the following specific settings: when a measurementmatrix is overdetermined and when a measurement matrix satisfies the RIP.39Local convergence of the SRK algorithm in the overdetermined settingSuppose that A as in Lemma 2.4.1 and, also, A ∈ Rm×n is an overdetermined full rankmatrix. Then, the smallest singular value of A, denoted by σmin, is positive. The followingtheorem establishes the convergence rate of the SRK under such constraints.Theorem 2.4.2. Assume that the conditions of Lemma 2.4.1 hold. If A ∈ Rm×n is a fullrank overdetermined matrix, then, for all k ≥ k0,E‖xk+1 − x∗‖2 ≤(1− σ2minmmaxi ‖wk ⊙ ai‖2ρs− sn− s)‖xk − x∗‖2.Remark 2.4.3. Ifσ2minmmaxi ‖wk ⊙ ai‖2ρs− sn− s =σ2minmρs− smaxi ‖wk ⊙ ai‖21n− s < 1,then the SRK algorithm converges linearly in expectation in the local regime.Remark 2.4.4. Note that the theorem above is valid for any full-rank overdetermined matrixA, i.e., for any frame. For consistent overdetermined linear equations, one may run theRK algorithm with a uniform row selection strategy. The convergence rate (in mean) ofthe RK algorithm is1− σ2minmmaxi ‖ai‖2 .If magnitudes of all entries of A are approximately equal, say, to τ , then, for large k, theconvergence rate of the SRK algorithm is approximately1− σ2minmτ2‖wk‖2ρs− sn− s ≈ 1−σ2minmτ2ρsρs− sn− s = 1−σ2minn− sρ− 1ρ,and for the RK algorithm, the convergence rate is approximately1− σ2minmτ2n.We conclude that the bounds on the RK and SRK algorithms convergence rates are ofthe same order. Algebraic manipulations imply that the guaranteed convergence rate forthe SRK algorithm is weaker than the guaranteed convergence rate for the RK algorithm.Note that smaller upper bounds do not imply smaller value, and in Experiment 2 (Section2.5.2), we show that in the local regime, the SRK algorithm outperforms the RK algorithm.For a broad class of random overdetermined matrices, we can estimate the singularvalue of the matrix and bound the maximum of weighted rows in the denominator. First,we consider a random matrix whose entries are i.i.d. ±1 Bernoulli. In this case, maxi ‖wk⊙ai‖ = maxi ‖wk‖ = ‖wk‖. A similar result may be obtained for the random Fourier matrix.40Corollary 2.4.5. Suppose that the conditions of Theorem 2.4.2 are met. Suppose thatentries of A are i.i.d. ±1 Bernoulli random variables. Using Theorem 1.3.4, we concludethat with probability at least 0.99, we draw A such thatE‖xk+1 − x∗‖2 ≤(1− 1ρs+ n−ρsk(1−C√nm)2ρs− sn− s)‖xk − x∗‖2.Here the expectation is taken over the choice of a row for the kth iteration.Corollary 2.4.6. Suppose that entries of A ∈ Rm×n, m > n, are independent realizationsof mean zero variance one sub-Gaussian random variable whose sub-Gaussian norm doesnot exceed K. Assume that all conditions of Theorem 2.4.2 hold. Then, with probability atleast 0.99, we draw A such that, for all k ≥ k0,E‖xk+1 − x∗‖2 ≤(1−(1− C1K√nm)2C2K ln(mn)(ρs +n−ρsk )ρs− sn− s)‖xk − x∗‖2.Here the expectation is taken over the choice of a row for the kth iteration. In particular,if m≫ n and k is large, the SRK converges linearly on average at a rate 1− c ρ−1ρ ln(m)(n−s) .Local convergence of the SRK in the underdetermined settingWe proceed to the underdetermined case. Recall that every consistent linear system b = Axhas infinitely many solutions. To guarantee that there is a unique s-sparse solution, weassume that A ∈ Rm×N satisfies the RIP (see Section 1.2.1 for details).Theorem 2.4.7. Suppose that x∗ ∈ ΣNs and denote its smallest (in magnitude) non-zeroentry by x∗min. Suppose that A ∈ Rm×N satisfies the RIP of order ρs with constant δ.Let b = Ax∗. We run the SRK algorithm (Algorithm 2.3) for Ax = b with estimatedsparsity ρs > s and inherit the notations from the algorithm. If for some k0, ‖xk0 − x∗‖ ≤√ρs−s+1√ρs−s+1+1 |x∗min|, thenE‖xk+1 − x∗‖2 ≤1−((1− 1√k)√1− δ√ρs−sN−s − 1√k‖A‖)2mmaxi ‖wk ⊙ ai‖2 ‖xk − x∗‖2.Remark 2.4.8. Suppose that k is very large, and, therefore, all terms involving 1/√k and1/k are neglectfully small. Then, the convergence rate is approximately1− 1− δmmaxi ‖wk ⊙ ai‖2ρs− sN − s .41Similar to the overdetermined case, let us consider two specific cases of measurementmatrix A: when the entries of A have the same magnitude (e.g., ±1 Bernoulli matrix,random Fourier matrix) and for all matrices whose entries are i.i.d. sub-Gaussian randomvariables.Corollary 2.4.9. Suppose that the entries of A = (aij) ∈ Rm×N are i.i.d. ±1/√mBernoulli random variables. Then, under the conditions of Theorem 2.4.7, with probabilityat least 0.99, we draw matrix A such that1−((1− 1√k)√1− δ√ρs−sN−s − 1√k(√Nm + C))2ρs+ N−ρsk.Here C > 0 is an absolute constant. For large enough k, the convergence rate is approxi-mately1− ρ− 1ρ1− δN − s .Corollary 2.4.10. Suppose that x∗ ∈ ΣNs and denote its smallest (in magnitude) non-zeroentry by x∗min. Suppose that entries of A ∈ Rm×N , m ≪ N , are independent realizationsof mean zero variance 1/m sub-Gaussian random variables whose sub-Gaussian norm doesnot exceed K/√m. Let b = Ax∗. We run the SRK algorithm (Algorithm 2.3) for Ax = band inherit the notations from the algorithm. If for some k0, ‖xk0−x∗‖ ≤√ρs−s+1√ρs−s+1+1 |x∗min|,then with probability at least 0.99, we draw a matrix A such that A satisfies the RIP oforder ρs with constantδ = C√ρsmlog(eNs)and for all k ≥ k0,E‖xk+1 − x∗‖2 ≤1−((1− 1√k)√1− δ√ρs−sN−s − 1√k (Nm + C))2ln(mN)(ρs+ N−ρsk) ‖xk − x∗‖2.Here the expectation is taken over the choice of measurement vectors. In particular, ifm≫ n, the SRK algorithm converges linearly on average at a rate 1− c ρs−sln(mn)(n−s) .2.4.2 Boundness of the reconstruction sequenceNote that the reconstruction sequence {xk} generated by the SRK algorithm (Algorithm2.3) is a random sequence of vectors where randomization comes from a row selection ateach iteration. Let A = (aij) ∈ Rm×n be a matrix such that for some c > 0, |aij | ≥ c.42In this section, we show that supk E‖xk‖2 < ∞. Note that using the Markov inequality,it would imply that for every ǫ ∈ (0, 1), there is t = t(ǫ) such that for every k ≥ 0,P(‖xk‖ ≥ t) ≤ ǫ. This statement holds for overdetermined matrices and for matrices thatsatisfy the RIP.As before, we consider two different cases: when matrixA is over- and under-determined.Theorem 2.4.11. Suppose that A = (aij) ∈ Rm×n is a full rank overdetermined matrixsuch that aij 6= 0 for all i and j. Denote the rows of A by (ai)T , i = 1, 2, ...,m. Letx∗ ∈ Σns and let b := Ax∗. Suppose that we run the SRK algorithm (Algorithm 2.3) withthe estimated sparsity by ρs ≥ s, and denote the reconstructed signal after the kth iterationby xk. Then, for every k ≥ 0,E‖xk+1‖2 ≤ max{‖x0‖2, ‖b‖2σ2minmaxij a2ijminij a2ijnρs},where σmin is the smallest singular value of A.Now suppose that A satisfies the RIP of order δ.Theorem 2.4.12. Suppose that A = (aij) ∈ Rm×N satisfies the RIP of order δ and for alli ∈ [m] and j ∈ [N ], 0 < c1 ≤ |aij | ≤ c2. Then, for all k that are large enough,E‖xk+1‖2 ≤ max{‖x0‖2, 2‖b‖2N(1− δ)ρsc2c1}.Corollary 2.4.13. If matrix A does not contain zero entries, then E‖xk+1‖2 is bounded.Remark 2.4.14. Since x∗ is an s-sparse signal, ‖b‖2 = ‖Ax∗‖2 ≤ (1 + δ)‖x∗‖2. Therefore,the bound above can be rewritten asE‖xk+1‖2 ≤ max{‖x0‖2, 2(1 + δ)N(1− δ)ρsc2c1‖xk‖2}Note that ±1/m Bernoulli random matrix satisfies the RIP of order ρs, and its entrieshas the same magnitude. Recall that Theorem 1.3.6 provides guarantees that the Bernoullimatrix satisfies the RIP. We conclude the following statement.Corollary 2.4.15. Suppose that m ∈ N, N ∈ N, ρ, s ∈ N, and δ > 0 are such thatm ≥ c1δ−2ρs log(eN/(ρs)). Suppose that A ∈ Rm×N is a ±1/√m Bernoulli matrix. Then,with probability at least 1 − 2 exp(−cδ2m) with respect to the draw of A, for large enoughk,E‖xk+1‖2 ≤ max{‖x0‖2, 2‖b‖2N(1− δ)ρs}.We conclude that for a broad class of measurement matrices, the reconstruction se-quence produced by the SRK algorithm is bounded in expectation.432.4.3 On convergence of the SRK algorithmIn this section, we provide an intuition for the convergence of the SRK algorithm. Weconsider the measurement matrix to be the full ±1 Bernoulli matrix or, equivalently, if wedraw a random ±1 Bernoulli vector at each iteration.Theorem 2.4.16. Let x∗ ∈ ΣNs and let ρ be such that ρs > s. Suppose that one of thefollowing equivalent scenarios hold.• Measurement matrix A ∈ R2N×N is the full ±1 Bernoulli matrix, i.e., its rows are all2N possible Bernoulli vectors. Let b = Ax∗. We run the SRK algorithm for Ax = b.• At each iteration, we draw a random vector ak ∈ RN with i.i.d. ±1 Bernoulli entriesand retrieve the measurement bk = 〈ak , x∗〉. We use this measurement to run theSRK update.Denote the reconstruction sequence by {xk}. Let jl = l if l ∈ Sk ∩ Sc∗ or if |xkl | ≤ |x∗l | or ifsign(xkl ) = sign(x∗l ), and let jl be any index in Sk ∩ Sc∗ otherwise. Then, for all l ∈ [N ],E(xk+1l − x∗l )2 ≤(1− 2(wkl )2‖wk‖2)(xkjl − x∗jl)2 +(wkl )2‖wk‖4(1 +1√ρ− 1)‖(x∗ − xk)Sk∪S∗‖2.Corollary 2.4.17. Note that ‖(x∗ − xk)Sk∪S∗‖2 ≤ (#Sk ∪ S∗)‖(x∗ − xk)Sk∪S∗‖2∞ ≤ (ρs+s)‖x∗ − xk‖2∞. Plugging this bound into Theorem 2.4.16, we getE(xk+1l − x∗l )2 ≤(1− 2(wkl )2‖wk‖2)(xkjl − x∗jl)2 +(wkl )2(ρs+ s)‖wk‖4(1 +1√ρ− 1)‖(x∗ − xk)‖2∞.Note that for every ρ > 4,(ρs+ s)‖wk‖2(1 +1√ρ− 1)< 2,and, therefore, there is c = c(ρ) ∈ (0, 2) such thatE(xk+1l − x∗l )2 ≤(1− c(wkl )2‖wk‖2)‖x∗ − xk‖2∞.We conclude that‖E(xk+1 − x∗)2‖∞ ≤(1− c/k‖wk‖2)‖x∗ − xk‖2∞, (2.11)where square is taken element-wise.44Unfortunately, the expected value and the maximum are not interchangeable, and, there-fore, such an approach does not guarantee the decay of the reconstruction error. On theother hand, note thatk1∏i=k0(1− c/k‖wk‖2)≤k1∏i=k0(1− c/kρs+ N−ρsk0)≤k1∏i=k0exp(− c/kρs+ N−ρsk0)= exp− cρs+ N−ρsk0k1∑i=k01k≤ exp(− cρs+ N−ρsk0(ln(k1 − 1)− ln(k0)))=(k1 − 1k0)− cρs+N−ρsk0 .Therefore, if k1 = 2k0 + 1, the product is of the constant order.If there were no expectation in (2.11), then we would observe the reconstruction errordiminishes to zero at the rate above.Therefore, one may expect that the SRK algorithm converges in the ℓ∞ norm. Weverify this hypothesis in Experiment ProofsProof of Lemma 2.4.1First, we show that when the conditions of Lemma 2.4.1 are met, the active set containsthe support of underlying sparse solution for all subsequent iterations.Lemma 2.4.18. Let x∗ ∈ ΣNs and denote its smallest (in magnitude) non-zero entry byx∗min. Let x ∈ RN be such that‖x∗ − x‖ ≤√ρs− s+ 1√ρs− s+ 1 + 1 |x∗min|for some ρ ≥ 1. Then, the index set of the largest (in magnitude) ρs entries of x containsthe support of x∗.Proof. The proof is by contradiction. Denote the support of x∗ by S∗ and the index setof the largest (in magnitude) ρs entries of x by S. Suppose that S does not contain S∗.Using that #S = ρs and #S = s, #S ∩ Sc∗ ≥ ρs− s+ 1. Therefore, S ∩ Sc∗ is non-empty.Using the continuity of the norm,‖xS∩Sc∗‖ = ‖(x∗ − x)S∩Sc∗‖ ≤ ‖x∗ − x‖ ≤√ρs− s+ 1√ρs− s+ 1 + 1 |x∗min|.45Therefore, the smallest entry of x whose index belongs to S ∩Sc∗ is at most (√ρs− s+ 1+1)−1|x∗min|. Since S is the index set for the largest ρs entries, the magnitude of any entryof x whose index does not belong to S are at most (√ρs− s+ 1 + 1)−1|x∗min|.Finally, for every l ∈ S∗,|xl| ≥ |x∗l | − |x∗l − xl| ≥ |x∗min| − ‖x∗ − x‖ ≥ (√ρs− s+ 1 + 1)−1|x∗min|.Therefore, for every l ∈ S∗, l does not belong to Sc. It contradicts the assumption thatS∗ 6⊆ S, and finishes the proof.We proceed to proving Lemma 2.4.1. Suppose that for some k, ‖xk0−x∗‖ ≤√ρs−s√ρs−s+1 |x∗min|.Lemma 2.4.18 implies that the active set Sk contains the true support S∗. Then, wk⊙x∗ =x∗ and the update rule may be rewritten as follows.xk+1 − x∗ = xk − x∗ + 〈aik , wk ⊙ x∗ − wk ⊙ xk〉‖wk ⊙ aik‖2 wk ⊙ aik= xk − x∗ + 〈wk ⊙ aik , x∗ − xk〉‖wk ⊙ aik‖2 wk ⊙ aik .Note that the last line is a projection of xk − x∗ onto (wk ⊙ aik)⊥. Therefore, using theprojection properties,‖xk+1 − x∗‖2 = ‖xk − x∗‖2 −∥∥∥∥〈wk ⊙ aik , x∗ − xk〉‖wk ⊙ aik‖2 wk ⊙ aik∥∥∥∥2= ‖xk − x∗‖2 −(〈wk ⊙ aik , x∗ − xk〉)2‖wk ⊙ aik‖2 .We conclude that ‖xk+1− x∗‖2 ≤ ‖xk − x∗‖2 ≤√ρs−s+1√ρs−s+1+1 |x∗min|. Using Lemma 2.4.18, weconclude that for all subsequent iterations, the active set contains the support of x∗.Now we establish the bound on E‖xk+1 − x∗‖2. Recall that the expectation here istaken over the choice of rows of A for each iteration.E‖xk+1 − x∗‖2 = ‖xk − x∗‖2 − E(〈wk ⊙ aik , x∗ − xk〉)2‖wk ⊙ aik‖2≤ ‖xk − x∗‖2 − 1maxt ‖wk ⊙ at‖2E(〈aj , wk ⊙ (x∗ − xk)〉)2= ‖xk − x∗‖2 − 1mmaxt ‖wk ⊙ at‖2∥∥∥A(wk ⊙ (x∗ − xk))∥∥∥246Proof of Theorem 2.4.2Using Lemma 2.4.1, we conclude thatE‖xk+1 − x∗‖2 = ‖xk − x∗‖2 − 1mmaxt ‖wk ⊙ at‖2∥∥∥A(wk ⊙ (x∗ − xk))∥∥∥2≤ ‖xk − x∗‖2 − σ2min(A)mmaxt ‖wk ⊙ at‖2∥∥∥diag(wk)(x∗ − xk)∥∥∥2 .≤ ‖xk − x∗‖2 − σ2min(A)mmaxt ‖wk ⊙ at‖2∥∥∥(x∗ − xk)Sk∥∥∥2 .Note that since A is a full rank m × n matrix, m ≥ n, σmin(A) > 0. We need to bound∥∥(x∗ − xk)Sk∥∥2 by ‖xk − x∗‖2, and the following lemma finishes the proof.Lemma 2.4.19. Suppose that x∗ ∈ Σns and x ∈ Rn. Let ρ > 1 be such that ρs > s. Denotethe support of x∗ by S∗ and the index set of the ρs largest (in magnitude) entries of x byS. Suppose that S∗ ⊂ S. Then,‖x− x∗‖2 ≤ n− sρs− s‖(x− x∗)S‖2.Proof. Using the sparsity of x∗, the reconstruction error may be decomposed as‖x− x∗‖2 = ‖(x− x∗)S‖2 + ‖xSc‖2.We need to bound ‖xSc‖2 by a multiple of ‖(x−x∗)S‖2. Note that ‖xSc‖2 ≤ (n−ρs)‖xSc‖2∞.Recall that the indexes of the ρs largest entries of x are in S, and therefore, any entry ofx whose index is not in S is smaller than the entry of x whose index is in S. Since setS ∩ Sc∗ contains at least ρs − s > 0 elements, then ‖xSc‖∞ ≤ ‖xS∩Sc∗‖2/(ρs − s). Finally,we conclude that‖x− x∗‖2 ≤ ‖(x− x∗)S‖2 + n− ρsρs− s ‖xS∩Sc∗‖2 ≤ n− sρs− s‖(x− x∗)S‖2.Proof of Theorem 2.4.7Recall the statement of Lemma 2.4.1:E‖xk+1 − x∗‖2 ≤ ‖xk − x∗‖2 − 1mmaxi ‖wk ⊙ ai‖2 ‖A(wk ⊙ (xk − x∗))‖2.47Recall that A satisfies the RIP of order ρs with constant δ. Then, using the inversetriangle inequality,‖A(wk ⊙ (xk − x∗))‖ = ‖(1− 1√k)A((xk − x∗)Sk)+1√kA(xk − x∗)‖≥(1− 1√k)‖A((xk − x∗)Sk)‖ − 1√k‖A(xk − x∗)‖≥(1− 1√k)√1− δ‖(xk − x∗)Sk‖ −σmax√k‖xk − x∗‖.Using Lemma 2.4.19, we conclude that‖A(wk ⊙ (xk − x∗))‖ ≥(1− 1√k)√1− δ√ρs− sN − s‖xk − x∗‖ − σmax√k‖xk − x∗‖.Plugging this bound into (2.10) finishes the proof.Proof of Corollaries 2.4.6 and 2.4.10Both corollaries rely on the bound on singular values of A from Theorem 1.3.4 and thebound on maxi ‖wk ⊙ ai‖ established below. For Corollary 2.4.10, we also need to provethat A satisfies the RIP of order ρs.Lemma 2.4.20. Suppose that X1, X2, ..., Xd are i.i.d. sub-Gaussian random variableswhose sub-Gaussian norm is at most K. Then, for every t > 0, with probability at least1− c1 exp(−c2t2)P(maxiXi ≤ tK√ln d)≥ 1− c1 exp(−c2t2).Here c1 and c2 are absolute constants.Proof. Since Xi’s are identically distributed and sub-Gaussian, then, there exist c1 > 0and c2 > 0 such thatP(|Xi| ≥ t) ≤ 1− c1 exp(−c2t2/K2)For every t > 0,P(maxi|Xi| ≥ tK√ln d) = P(∃i s.t. |Xi| ≥ tK√ln d) ≤d∑i=1P(|Xi| ≥ tK√ln d)≤d∑i=1c1 exp(−c2t2 ln d) = dc1 exp(−c2t2 ln d) = c1 exp(−c2t2).48Corollary 2.4.21. Suppose that all entries of A = (aij) ∈ Rm×n are i.i.d. sub-Gaussianrandom variables whose sub-Gaussian norm does not exceed K. Then, for every t > 0,maxi,j |aij | ≤ tK√ln(mn) with high probability (with respect to t). In particular, there isa constant C > 0 such that with probability at least 0.99,maxi,j|aij | ≤ CK√ln(mn).Finally, we need to prove that matrix A, as in Theorem 2.4.10, satisfies the RIP. Notethat the conditions of Theorem 1.3.6 are met if we show that rows of√mA = (aij) areisotropic. Indeed, for every x ∈ RN and every row (ai)T of A,E(〈√mai , x〉)2 = mE N∑j=1N∑l=1aijxjailxl = EN∑j=1a2ijx2j = ‖x‖2.Plugging the bounds above into Theorem 2.4.2 implies Corollary 2.4.10. Plugging thebounds above into Theorem 2.4.7 implies Corollary 2.4.10.Proof of Theorems 2.4.11 and 2.4.12The key idea of the proof is the following. We will show that there exist α1 ∈ (0, 1)and α2 > 0 such that α2/α1 is independent of xk and k and that for all k, the followinginequality holds.E‖xk+1‖2 ≤ (1− α1)‖xk‖2 + α2. (2.12)For now, assume that we showed the inequality above. Then, rearranging the terms yieldsE(‖xk+1‖2 − α2α1)≤ (1− α1)(‖xk‖2 − α2α1).This inequality implies that for all k ≥ 0 and deterministic x0,E(‖xk‖2 − α2α1)≤ max{‖x0‖2 − α2α1, 0}.Therefore,E‖xk‖2 ≤ max{‖x0‖2, α2α1}.In the rest of the proof, we specify α1 and α2 for over- and under-determined settings.49We start the proof for generalized linear measurements and then derive the results forthe over- and under-determined settings. Recall the update rule of the SRK algorithm.xk+1 = xk +〈aik , x∗ − wk ⊙ xk〉‖wk ⊙ ak‖2 wk ⊙ aik= xk − 〈wk ⊙ aik , xk〉‖wk ⊙ aik‖2 wk ⊙ aik + 〈aik , x∗〉‖wk ⊙ aik‖2wk ⊙ aik= proj(wk⊙aik )⊥xk +bik‖wk ⊙ aik‖2wk ⊙ aik .Note that the first term is perpendicular to wk ⊙ aik while the second term is parallelto this vector. Then,‖xk+1‖2 = ‖proj(wk⊙aik )⊥xk‖2 + ‖bik‖wk ⊙ aik‖2wk ⊙ aik‖2= ‖xk‖2 − ‖proj(wk⊙aik )xk‖2 +(bik)2‖wk ⊙ aik‖2= ‖xk‖2 − 〈wk ⊙ aik , xk〉‖wk ⊙ aik‖2 +(bik)2‖wk ⊙ aik‖2= ‖xk‖2 − (〈aik , wk ⊙ xk〉)2‖wk ⊙ aik‖2 +(bik)2‖wk ⊙ aik‖2 .Recall that for every iteration, we select rows of A uniformly at random. Therefore,E‖xk+1‖2 = ‖xk‖2 − 1mm∑i=1(〈ai , wk ⊙ xk〉)2‖wk ⊙ ai‖2 +1mm∑i=1(bi)2‖wk ⊙ ai‖2≤ ‖xk‖2 − 1mmaxi ‖wk ⊙ ai‖2m∑i=1(〈ai , wk ⊙ xk〉)2+1mmini ‖wk ⊙ ai‖2m∑i=1(bi)2= ‖xk‖2 − 1mmaxi ‖wk ⊙ ai‖2 ‖A(wk ⊙ xk)‖2 + ‖b‖2mmini ‖wk ⊙ ai‖2 .We consider two cases of measurement matrices: (1) if A is an overdetermined full rankmatrix, and (2) when A satisfies the RIP of order ρs with constant δ. First, suppose thatA ∈ Rm×n is a frame, i.e., it is an overdetermined full rank matrix. Then,‖A(wk ⊙ xk)‖2 ≥ σ2min‖wk ⊙ xk‖2 ≥ σ2min‖xkSk‖250Recall that Sk is the index set of the largest ρs entries of xk, and, therefore‖xk‖2 = ‖xkSk‖2 + ‖xkSck‖2 ≤ ‖xkSk‖2 + (N − ρs)‖xkSck‖2∞≤ ‖xkSk‖2 +N − ρsρs‖xkSk‖2 =Nρs‖xkSk‖2. (2.13)We conclude that if A is a frame, thenE‖xk+1‖2 ≤ ‖xk‖2 − σ2minmmaxi ‖wk ⊙ ai‖2ρsN‖xk‖2 + ‖b‖2mmini ‖wk ⊙ ai‖2≤ ‖xk‖2 − σ2minm‖wk‖2maxij |aij |2ρsN‖xk‖2 + ‖b‖2m‖wk‖2minij |aij |2We conclude that α1 =σ2minm‖wk‖2 maxij |aij |2ρsN and α2 =‖b‖2mminij |aij |2 in (2.12).Now suppose that A ∈ Rm×N is a matrix that satisfies the RIP of order ρs with constantδ. Then,‖A(wk ⊙ xk)‖ ≥(1− 1√k)‖AxkSk‖ −1√k‖Axk‖≥(1− 1√k)√1− δ‖xkSk‖2 −σmax√k‖xk‖≥(1− 1√k)√1− δ√ρsN‖xk‖ − σmax√k‖xk‖We conclude that for for large enough k,‖A(wk ⊙ xk)‖2 ≤((1− 1√k)√1− δ√ρsN− σmax√k)2‖xk‖2.We conclude that forα1 :=((1− 1√k)√1− δ√ρsN− σmax√k)21m‖wk‖2maxij |aij |2andα2 :=‖b‖2m‖wk‖2minij |aij |2 ,(2.12) holds.51Proof of Theorem 2.4.16Recall that entries of ak are ±1 Bernoulli random variables, and, therefore, ‖ak ⊙ wk‖2 =‖wk‖2. Then, for every l ∈ [N ], (2.3) can be rewritten as follows.xk+1l = xkl +〈ak , x− wk ⊙ xk〉‖wk‖2 akl wkl .Expanding (xk+1l − x∗l )2, we get(xk+1l − x∗l )2 = (xkl − x∗l )2 +(〈ak , x∗ − wk ⊙ xk〉‖wk‖2 akl wkl)2+ 2(xkl − x∗l )〈ak , x− wk ⊙ xk〉‖wk‖2 akl wkl= (xkl − x∗l )2 + (wkl )2(〈ak , x∗ − wk ⊙ xk〉)2‖wk‖4+ 2(xkl − x∗l )akl wkl〈ak , x∗ − wk ⊙ xk〉‖wk‖2 .Taking the expectation from both sides, we getE(xk+1l − x∗l )2 = (xkl − x∗l )2 + (wkl )2‖x∗ − wk ⊙ xk‖2‖wk‖4 + 2(xkl − x∗l )wklx∗l − wkl xkl‖wk‖2 . (2.14)By rearranging terms, we getE(xk+1l −x∗l )2 = (1−2(wkl )2‖wk‖2 )(xkl −x∗l )2+(wkl )2‖wk‖4 ‖x∗−wk⊙xk‖2+2wkl (1− wkl )‖wk‖2 (xkl −x∗l )x∗l .(2.15)Bounds on the second and the third terms above in the following lemmas finish theproof.Lemma 2.4.22. Suppose that xk ∈ RN and wk is as in Algorithm 2.3. If ρ > 1 such thatρs > s, then for k ≥ N−ρss ,‖x∗ − wk ⊙ xk‖2 ≤(1 +1√ρ− 1)‖(x∗ − xk)Sk∪S∗‖2.This Lemma is proved in Section 2.4.4.Lemma 2.4.23. If Xkl := 2wkl (1−wkl )‖wk‖2 (xkl − x∗l )x∗l > 0, then for every j ∈ Sk ∩ Sc∗,(1− 2(wkl )2‖wk‖2)(xkl − x∗l )2 +Xkl ≤(1− 2(wkl )2‖wk‖2)(xkj − x∗j )2.The proof of this lemma can be found in Section of Lemma 2.4.22In this section, we bound ‖x∗ − wk ⊙ xk‖2. The proof relies on the following lemmas.Lemma 2.4.24. Let u, v ∈ RN . Then, for every t > 0,‖u+ v‖2 ≤ (1 + t)‖u‖2 + (1 + t−1)‖v‖2.Proof. Note that for every t > 0,0 ≤ ‖√tu− 1√tv‖2 = t‖u‖2 − 2〈u , v〉+ 1t‖v‖2.Therefore,2〈u , v〉 ≤ t‖u‖2 + 1t‖v‖2.Expanding ‖u+ v‖2 and the inequality above finishes the proof.‖u+ v‖2 = ‖u‖2 + ‖v‖2 + 2〈u , v〉 ≤ ‖u‖2 + ‖v‖2 + t‖u‖2 + 1t‖v‖2.Lemma 2.4.25. Suppose that xk, x∗, Sk and S are as above. Assume that Sck ∩ S∗ isnon-empty. Then,‖xkSck‖∞ ≤ 1√ρs− s‖(xk − x∗)Sk∩Sc∗‖.This inequality implies that‖xkSck∩S∗‖ ≤√s‖xkSck∩S∗‖∞ ≤√sρs− s‖(xk − x∗)Sk∩Sc∗‖and, similarly,‖xkSck∩Sc∗‖ ≤√N − ρsρs− s ‖(xk − x∗)Sk∩Sc∗‖.Proof. Recall that Sk contains indexes of the largest ρs entries of xk, and, therefore, themagnitude of each entry of xk whose index is in Sck is less or equal to the magnitude of anyentry whose index is in Sk ∩ Sc∗. Note that #(Sk ∩ Sc∗) ≥ #Sk −#S∗ = ρs− s. Therefore,‖xkSck‖2∞ ≤1ρs− s‖xkSk∩Sc∗‖2.53To bound ‖x∗ − wk ⊙ xk‖2, we expand the norm and use Lemma 2.4.24 as follows.‖x∗ − wk ⊙ xk‖2 = ‖(x∗ − xk)Sk‖2 + ‖(x∗ −1√kxk)Sck∩S∗‖2 +1k‖(x∗ − xk)Sck∩Sc∗‖2≤ ‖(x∗ − xk)Sk‖2 +((1 + t−1)‖(x∗ − xk)Sck∩S∗‖2 + (1 + t)(1−1√k)2‖xkSck∩S∗‖2)+1k‖(xk)Sck∩Sc∗‖2≤ ‖(x∗ − xk)Sk‖2 +((1 + t−1)‖(x∗ − xk)Sck∩S∗‖2 + (1 + t)‖xkSck∩S∗‖2)+1k‖(xk)Sck∩Sc∗‖2.Note that #Sck ∩ S∗ ≤ #S∗ = s. Using Lemma 2.4.25, we conclude that‖x∗ − wk ⊙ xk‖2 ≤ ‖(x∗ − xk)Sk‖2+((1 + t−1)‖(x∗ − xk)Sck∩S∗‖2 +1 + tρ− 1‖(x∗ − xk)Sk∩Sc∗‖2)+1kN − ρsρs− s ‖(xk)Sk∩Sc∗‖2= ‖(x∗ − xk)Sk∩S∗‖2+(1 +1 + tρ− 1)‖(x∗ − xk)Sk∩Sc∗‖2 + (1 + t−1)‖(x∗ − xk)Sck∩S∗‖2+1kN − ρsρs− s ‖(xk)Sk∩Sc∗‖2≤ max{1, 1 + 1 + tρ− 1 +1kN − ρsρs− s , 1 + t−1}‖(x∗ − xk)Sk∪S∗‖2.Using ρ > 1 and k ≥ N−ρss , then for t =√ρ− 1 > 0, we get‖x∗ − wk ⊙ xk‖2 ≤ max{1 +√ρρ− 1 +1ρ− 1 , 1 +1√ρ− 1}‖(x∗ − xk)Sk∩S∗‖2=(1 +1√ρ− 1)‖(x∗ − xk)Sk∩S∗‖2.Proof of Lemma 2.4.23Recall that Lemma 2.4.23 bounds Xkl := 2wkl (1−wkl )‖wk‖2 (xkl − x∗l )x∗l . Note that when wkl = 1or x∗l = 0, then this term is zero. Therefore, the only case we need to bound is whenwkl = 1/√k and x∗l 6= 0, i.e., when l ∈ Sck ∩ S∗.Suppose that l ∈ Sck ∩ S∗ and consider the case when Xkl > 0. In this case, all of belowhold simultaneously:541. x∗l 6= 0, and2. wkk =1√k, and3. sign(xkl − x∗l ) = sign(x∗l ).In particular, it implies that sign(xkl ) = sign(x∗l ). Therefore, |xkl −x∗l | ≤ |xkl | and sign(xkl −x∗l ) = sign(xkl ).Without loss of generality, we assume that x∗l > 0. Then, xkl > x∗l > 0 andXkl = 2wkl (1− wkl )‖wk‖2 (xkl − x∗l )(x∗l − xkl + xkl )= −2wkl (1− wkl )‖wk‖2 (xkl − x∗l )2 + 2wkl (1− wkl )‖wk‖2 xkl − x∗l )xkl= −2wkl (1− wkl )‖wk‖2 (xkl − x∗l )2 + 2wkl (1− wkl )‖wk‖2 |xkl − x∗l | · |xkl |.In the last equality we used that sign(xkl − x∗l ) = sign(xkl ). Therefore,(1− 2(wkl )2‖wk‖2)(xkl − x∗l )2 +Xkl≤(1− 2(wkl )2‖wk‖2)(xkl − x∗l )2 − 2wkl (1− wkl )‖wk‖2 (xkl − x∗l )2 + 2wkl (1− wkl )‖wk‖2 (xkl )2=(1− 2wkl‖wk‖2)(xkl − x∗l )2 + 2wkl (1− wkl )‖wk‖2 (xkl )2≤(1− 2(wkl )2‖wk‖2)(xkl )2,where in the last line we used that xkl > x∗ > 0, and, therefore, xkl − x∗l < xkl . Finally,note that l ∈ Sck, and, therefore, |xkl | is smaller than the magnitude of any |xkj | = |xkj − x∗|where j ∈ Sk ∩ Sc∗.2.5 Numerical experimentsIn this section, we support the theoretical findings of this chapter through numerical exper-iments. Experiment 1 (Section 2.5.1) compares the reconstruction error and the supportdetection time for the RK algorithm and gradient descent for overdetermined consistentlinear systems. Experiment 2 (Section 2.5.2) investigates the local convergence rate for55the SRK algorithm and compares it with other algorithms. Experiment 3 (Section 2.5.3)explores the convergence of the SRK algorithm in the L∞ norm and the element-wisedynamics of the algorithm.2.5.1 Experiment 1. The RK algorithm vs the gradient descent forsparse recovery and support detectionTheorem 2.3.3 proves the efficiency of using the RK algorithm for the support detectionof sparse signals from their consistent overdetermined measurements. In this experiment,we confirm this statement numerically by comparing the performance of the RK algorithmand the gradient descent. We chose the gradient descent for comparison because it is anefficient solver that does not utilize the sparsity of the underlying signal directly.We implement the following procedure. We draw a random matrix A ∈ R2000×600 withi.i.d. standard Gaussian entries and a unit-norm 30-sparse signal x∗. The RK algorithmis initialized with x0 = 0 and runs, as described in Algorithm 2.2. We run the gradientdescent for minimizing the function f(x) = 12‖Ax − b‖2 with initialization ~0 using thekernel trick: before doing any iterations, we compute the norm of the matrix A, the kernelATA/‖A‖2, and the vector AT b/‖A‖2. Then, the gradient descent update rule becomes asfollows.xk = xk−1 +AT b‖A‖2 −ATA‖A‖2 xk−1.As a downside, pre-computing the quantities delay the first iteration, and, in Figure 2.1,we include this delay in the total CPU time for the algorithm.In Figure 2.1a, we run the procedure above ten times and compare the reconstructionerrors of the RK and gradient descent algorithms as functions of the CPU time. In Figure2.1b, we run the procedure above 50 times and plot the CPU time required for the supportdetection (by considering the index set of top 60 entries of the current reconstructed signal)versus the smallest non-zero entry of the underlying sparse signal.Figure 2.1a confirms the theoretical findings: the gradient descent converges signifi-cantly faster than the RK algorithm but requires some amount of time to start the firstiteration. During that time, the RK algorithm can achieve significant progress in thereconstruction.Next, we address what algorithm performs better for support detection. Recall thatLemma 2.4.1 implies that the support detection is successful if the reconstruction error isless than (1−1/(√ρs− s+ 1+1))|x∗min|, where x∗min is the smallest in magnitude non-zeroentry of x∗. For the setting above, we conclude that if the reconstruction error is less than0.848|x∗min|, then the support is reconstructed correctly. Figure 2.1a implies that if |x∗min|is large enough, then the RK algorithm outperforms the gradient descent in the supportdetection. Figure 2.1b confirms this heuristic idea and also shows that when |x∗min| is small,then the gradient descent may outperform the RK algorithm.560 0.1 0.2 0.3 0.4 0.5CPU time10-410-2100relative errorRandomized KaczmarzGradient descent(a) Relative errors0 0.01 0.02 0.03 0.0400. time of the support detectionRandomized KaczmarzGradient descent(b) CPU time for the support detectionFigure 2.1: Support detection for consistent overdetermined linear systems using the RKalgorithm and the gradient descent. Here the matrix is 2000×600 Gaussian, and the signalis 30-sparse. Both algorithms are initialized with ~0. The gradient descent is run using thekernel trick.We conclude that the early progress of the RK algorithm allows it to outperform thegradient descent in the support detection if the smallest element of the underlying sparsesignal x∗ is rather large. Otherwise, the gradient descent outperforms the RK algorithm.2.5.2 Experiment 2. Local convergence of the SRK algorithmIn this experiment, we investigate the tightness of the local convergence results provided byTheorems 2.4.6 and 2.4.7 and compare its performance with other memory-efficient solversappropriate for the given scenario. Recall that the local convergence occurs for consistentoverdetermined linear systems, and, in this case, we compare the SRK algorithm withthe RK, KZIHT, SRK-IHT algorithms. If the measurement matrix satisfies the RIP, wecompare the SRK algorithm with all of the algorithms above except the RK algorithm,which does not guarantee sparse recovery.We run two different experiments: for over- and under-determined linear systems. Wedraw an m × n matrix and a unit-norm signal x∗ ∈ Σn20. For the over-determined case(Figure 2.2a), m = 1000 and n = 350, and for the under-determined setting (Figure 2.2b),m = 350 and n = 1000. For both cases, we use x∗ + η as a starting point, where η is arandom noise such that ‖η‖ = 1/4|x∗min|. Here |x∗min| stands for the smallest in magnitudenon-zero entry of x∗. Then, in terms of Theorem 2.4.7, the SRK algorithm is in the localregime. We run the SRK, KZIHT, SRK-IHT algorithms with the given starting point andestimated sparsity 80, and the RK algorithm for the overdetermined case. We plot thereconstruction error versus the CPU time of the algorithms in Figure 2.2b. Dashed linesshow the linear convergence decay. The y-axis is in the logarithmic scale.For both over- and under-determined settings, Figure 2.2 implies that the SRK algo-570 0.01 0.02 0.03 0.04 0.05CPU time10-610-410-2Reconstruction errorSRKRKKZIHTSRK-IHT(a) For 1000 × 350 Gaussian matrix and unit-norm signal in Σ350200 0.1 0.2 0.3CPU time10-1010-810-610-410-2Reconstruction errorSRKKZIHTSRK-IHT(b) For 350 × 1000 Gaussian matrix and unit-norm signal in Σ100020Figure 2.2: Convergence rate in the local regime for (a) overdetermined, (b) underdeter-mined consistent linear systems. The estimated sparsity is 80. The dashed lines representthe linear convergence for the comparison. The y-axis is in the logarithmic scale.rithm converges slower than the KZIHT algorithm, and with about the same rate as theSRK-IHT algorithm. For the overdetermined setting, note that the SRK convergences sig-nificantly faster than the RK algorithm. Note that Corollary 2.4.4 claims that the upperbounds on the convergence rate of the RK and SRK algorithms in the local regimes are ofthe same order. Since the bound for the RK algorithm is relatively tight, we conclude thatthe bound in Theorem 2.4.2 is not tight in general.2.5.3 Experiment 3. On the max-norm convergence and element-wisedynamics of the SRK algorithmIn Section 2.4.3, we discussed the element-wise dynamics of (xk−x∗)2, which represents thereconstruction error. We conjectured that ‖xk−x∗‖∞ decays in expectation. We also used‖(xk−x∗)Sk∪S∗‖2/√#(Sk ∩ S∗) to examine the elementwise dynamics. In this experiment,we numerically investigate the values of ‖xk −x∗‖∞, ‖(xk −x∗)Sk∪S∗‖2/√#(Sk ∩ S∗), andelement-wise dynamics of (xk − x∗)2 to support the conjecture.We draw a 350 × 1000 matrix with i.i.d. standard Gaussian entries and a unit-norm10-sparse signal x∗. Suppose that estimated sparsity is 50, and, therefore, the overesti-mated factor ρ = 50/10 = 5 satisfies the assumption ρ > 4. We run the SRK algorithmwith initialization ~0. We plot ‖xk − x∗‖∞ and ‖(xk − x∗)Sk∪S∗‖2/√#(Sk ∩ S∗) versus theiteration number in Figure 2.3. Also, we draw a random entry from the support set and arandom entry of the support set and plot (xkl − x∗l )2 for these entries. The y-axis is in thelogarithmic scale.Figure 2.3 shows that the max-norm of the reconstruction error decays. Also, ‖(xk −x∗)Sk∪S∗‖2/√#(Sk ∩ S∗) decays linearly in expectation. However, for both support and580 500 1000 1500 2000 2500 3000 3500 4000Iteration number10-510-410-310-210-1100Figure 2.3: On element-wise convergencenon-support elements, we observe a large fluctuation of values.59Chapter 3The iterative reweighted Kaczmarzalgorithm (IRWK)This chapter is dedicated to a novel memory-efficient sparse recovery algorithm inspiredby the IRLS. Section 3.1 proposes using the randomized Kaczmarz as a solver for theminimization problem in the IRLS. Section 3.2 motivates and states the IRWK algorithm.Sections 3.2.2, 3.2.3, and 3.2.4 establish the properties of the algorithm. Section 3.2.5highlights the connection of the IRWK to other sparse recovery algorithms: the IHT, theIST, and the sparse Kaczmarz. The numerical performance of the algorithm is investigatedin Section The IRLS using randomized reweighted KaczmarzSection 2.3 discusses applying the randomized Kaczmarz (RK) algorithm (Algorithm 2.2)for sparse recovery. Note that the RK algorithm does not directly utilize prior informationon the sparsity of the signal to be recovered. In this section, we consider using the RKalgorithm in the implementation of the well-known sparse recovery algorithm IRLS toincrease its space efficiency. Specifically, if in any algorithm that involves finding theminimum 2-norm solution of a system of linear equations, the RK algorithm can be applied.Recall that the IRLS (Algorithm 1.5) requires solvingxk+1 = arg minz:Ax=b‖wk ⊙ z‖ (3.1)for each iteration, where xk ∈ RN is the current estimate of the true solution, and wk ∈ RNis the current weight vector determined by xk and ǫk. We propose using the RK algorithmfor this task. Note that Az = b is a consistent underdetermined linear system, but directapplication of the RK algorithm with x0 = ~0 returns the minimum 2-norm solution insteadof minimizing ‖wk ⊙ z‖. Therefore, let Wk = diag(wk) and consider the substitutionz′ =Wkz into 3.1, which givesxk+1 =Wk arg minz:AWkz=b‖z‖2.The RK algorithm may be applied as a solver for this minimization problem, the solution ofwhich can be expressed as xk+1 =Wk(AWk)†b. Using the RK solver, the required working60memory is O(N). Indeed, for each iteration, the algorithm requires access to a single rowof the measurement matrix, to the signal and its weights, and few constants. Therefore,we conclude that the IRLS using the RK solver is a memory-efficient algorithm for sparserecovery problems.Let us address the issue of the convergence rate of RK iterates in this minimizationproblem. Recall that wki = ((xki )2 + ǫ2k)−1/4 where ǫk = min{ǫk, |xk+1(ρs+1)|/N}. The valuesof the entries of wk vary significantly. Therefore, rows of AW k will have different normswhich, in turn, can cause the matrix AW k being nearly singular. Inverting such a matrixis a computationally hard problem. As a practical remedy, if the largest entry of wk islarge enough, then ǫk is small, and we expect the support to be recovered at this iteration.However, for medium-size ǫk, we can still have an imbalanced AWk (see Section 3.3.1).3.2 The Iteratively Reweighted Kaczmarz algorithm(IRWK)In this section, we propose a novel algorithm that has the memory-efficiency of the RKalgorithm and utilizes the sparsity of the signal. As before, the underdetermined measure-ment matrix is denoted by A ∈ Rm×N , and let x∗ ∈ ΣNs be the signal with the supportindex set S and let b = Ax∗ be the measurements. In the compressed sensing scenario, wewant to recover the sparse signal x∗ from b in a computationally tractable way.The IRLS using the RK algorithm is both time and memory efficient. In this subsection,we add further requirements to the algorithm and propose a novel algorithm for the sparserecovery. Note that each time we start the RK algorithm for the inner iteration of theIRLS, we initialize it with 0. However, a good initialization (warm start) of the algorithmdramatically improves the running time numerically and theoretically. Note that after thekth iteration of the IRLS, we know at least one solution Ax = b which is xk, and we expect(at least, eventually) xkSk ≈ x∗, i.e., we would like to use information regarding xk whenwe choose how to initialize the RK algorithm for the (k + 1)th iteration.In addition, we want the algorithm to be less sensitive to the values of xk in the followingsense. Note that in IRLS, the weights at each iteration are determined by the value of eachentry of xk at that iteration giving the weight wki ≈ 1/√|xki |. Suppose that we want tooptimize the running time by early stopping of the RK algorithm which, in turn, causessome distortion into the true values of xk. Since the weights are inversely proportional to√|xki |, such a distortion may perturb the weights significantly.We propose an algorithm given in Algorithm 3.1 where weight updating scheme {wk}∞k=1is given in Section 3.2.1.Note that the IRWK utilizes all benefits of the RK algorithm: For each iteration, werequire to keep only xk and one row of the measurement matrix in the working memory.The algorithm does not require to know the total number of measurements which is ben-61Algorithm 3.1 The iterative reweighted least squares (IRWK)1: Inputs: A ∈ Rm×N , b, ρs (estimated sparsity such that ρs > s).2: Outputs: A solution to Ax = b whose sparsity does not exceed ρs, ρ > 1.3: Initialize: x0 = A†b (any solution of Ax = b would also do). S0 = supp(Hρs(x0)).Compute the diagonal weight matrix W0 ∈ RN×N whose entries are defined as follows:(W0)ii ={1, if i ∈ S0w0, if i 6∈ S0.4: For k = 0, 1, 2, ...:Run the RK algorithm for AWkz = b with initialization xkSkto get zk+1, i.e.,zk+1 = xkSk + (AWk)†(b−AxkSk).Compute xk+1.xk+1 =Wkzk = xkSk +Wk(AWk)†(b−AxkSk).Note that Axk+1 = b. Let Sk+1 = supp(Hρs(xk+1)). Compute the values of the diagonalmatrix Wk by the following formula:(Wk)ii ={1 if i ∈ Sk,wk if i /∈ Sk.62eficial for large scale systems. Alto, it is applicable to dynamic linear systems (when newmeasurements can be appended during the implementation of the algorithm) and for onlinelearning.One may draw similarities between IRLS and IRWK: both algorithms use weightingthat prioritizes the entries of xk whose magnitude is large; both algorithms solve similarminimization problems at each iteration, albeit with different values of W k and differentinitialization. The IRWK weights are less sensitive to the specific values of entries of xkthan the IRLS weights, which allows for early stopping, and IRWK’s initialization of eachRK algorithm with z0 = xkSk provides a warm start, especially if x∗ ≈ xk.3.2.1 Specific choice of non-unit weightWe propose two different schemes of updating the weights wk:1. Constant (non-unit) weights wk =√mN for all k. This scheme guarantees efficientconvergence of the RK algorithm (see Theorem 3.2.11 and discussion in Section 3.2.3)and is motivated by keeping the matrix AWk well-balanced (see Section 3.2.3). Notethat this weight schema can be used only if all singular values of A are of order√N/m.2. Dynamic (non-unit) weights wk =√‖xkSck‖/‖x∗‖. This scheme numerically performswell, and the weights reflect how sparse the current signal xk is. In particular, if allentries of xk are of the same order of magnitude and ‖x∗‖ = 1, then wk ≈(mN)1/4.If xk is sparse, then wk = 0. This scheme requires the estimate of the norm of thesparse solution. If A satisfies the RIP, then the norm of x∗ approximately equals thenorm of b.Note that such wk that are larger than ǫk in the IRLS weights for nearly sparsesignals. Indeed, if xk is nearly sparse, “small” IRLS weights would be of orderǫk ≈‖xkSck‖∞N‖xk‖ (after weight normalization) while the weighting scheme above suggestswk =√‖xkSck‖/‖x∗‖. The later quantity is expected to be larger.We investigate the performance of the algorithm for both schemes in further sections.3.2.2 Running time of the randomized Kaczmarz algorithm for eachiteration of IRWKAt each iteration k of the IRWK algorithm, described in Algorithm 3.1, we use the RKalgorithm for solving AWkz = b with initialization xkSk. The algorithm relies of the factthat the RK algorithm inside the IRWK converges tozk+1 = xkSk + (AWk)†(b−AxkSk) = projN (AWk)(xkSk) + (AWk)†b,63which is based on Theorem 2.1.1. The intuitive proof of this fact is the following. Let zldenote the output of the lth Kaczmarz iteration. Then, zl − zl−1 ∈ R(WkAT ) for all l.Recall that z0 = xkSk ; thus, zl − xkSk ∈ R(WkAT ). In other words, the output of the RKalgorithm belongs to the column range of the measurement matrix plus the initial signal.Since each full row rank “wide” matrix Φ is a bijection between R(ΦT ) and R(Φ), thereis unique signal zk+1 that satisfies AWkz = b. Straight-forward algebraic manipulationsimply the formula for zk+1.Note that the projection onto the row-range of a matrix Φ can be written as Φ†Φ.Therefore, if z satisfies AWkz = b, thenzk+1 = projN (AWk)(xkSk) + projR(WkAT )z. (3.2)This approach heavily relies on the orthogonality of R(WkAT ) and N (AWk) and that theirsum spans the whole RN . In the IRWK, we are interested in xk more than in zk becausexk satisfies Axk = b. Note thatxk+1 =WkprojN (AWk)(xkSk) +Wk(AWk)†b.This section investigates each iteration as a projection. Note thatWkprojN (AWk)(xkSk) ∈ N (A)andWk(AWk)†b ∈ R(W 2kAT ).Unless Wk = I, N (A) and R(W 2kAT ) are not orthogonal.Recall that by construction, Axk = b, so (3.2) implies that for all x such that Ax = b,xk+1 =WkprojN (AWk)(xkSk) +WkprojR(WkAT )W−1k x. (3.3)Lemma 3.2.1. For every sequence of weights {wk}, the sequence of xk generated by theIRWK (Algorithm 3.1) satisfiesw2k‖xk+1Sk − xkSk‖2 + ‖xk+1Sck‖2 ≤ ‖xkSck‖2.Proof. Note that xkSk =WkxkSk. Subtracting xkSk from both sides of (3.3), we conclude thatxk+1 − xkSk = −Wk(I − projN (AWk))(xkSk) +WkprojR(WkAT )W−1k x.Note that (I − projN (AWk)) = projR(WkAT ). Let x = xk. Using algebraic manipulations,we getW−1k(xk+1 − xkSk)= projR(WkAT )(W−1k xkSck).64Since projection does not increase norm,‖W−1k(xk+1 − xkSk)‖2 ≤ ‖W−1k xkSck‖2.Using that W−1k xkSck= 1wkxkSck, we getw2k‖xk+1Sk − xkSk‖2 + ‖xk+1Sck‖2 ≤ w2k‖xkSck‖2.This lemma implies the following crucial property of the sequence {xk}.Theorem 3.2.2. For every choice of the sequence {wk}, the sequence {xk} generated bythe IRWK algorithm (Algorithm 3.1) satisfies‖xk+1Sck+1‖ ≤ ‖xk+1Sck‖ ≤ ‖xkSck‖.Remark 3.2.3. The signal xk can be considered being close to sparse if the norm of xkexcluding the largest entries is small, i.e., when ‖xkSck‖ is small. Note that ‖xk‖ is not closeto zero because Axk = b. The theorem implies that ‖xkSck‖ is non-increasing, thus, as everynon-increasing nonnegative sequence converges, so does ‖xkSck‖.Remark 3.2.4. If wk does not converge to zero, Lemma 3.2.1 and Theorem 3.2.2 imply that‖xk+1Sk − xkSk‖ → 0 as k →∞.Corollary 3.2.5. For all k,√‖xkSck‖‖x∗‖ ≤ 1. Therefore, the non-unit weights in the dynamicweight scheme do not exceed 1. If the measurement matrix satisfies the RIP and its smallestnon-zero singular value is at least c√Nm , then√‖xkSck‖‖x∗‖ ≤√(1 + δ)mc2N.Proof. Remark 3.2.3 implies that ‖xkSck‖ ≤ ‖x0Sc0‖. By the norm continuity, ‖x0Sc0‖ ≤ ‖x0‖.Recall that x0 = A†b is the min 2-norm solution of Ax = b. Therefore, the norm of x0 issmaller than the norm of any other solution of Ax = b, in particular, x = x∗. Thus,‖xkSck‖ ≤ ‖x∗‖.The first claim of the corollary is proved. If, in addition, A satisfies the RIP and σmin(A) ≥c√Nm , then(1 + δ)‖x∗‖2 ≥ ‖Ax∗‖2 = ‖b‖2 = ‖Ax0‖2 ≥ c2Nm‖x0‖2 ≥ c2Nm‖xkSck‖2.65Proof of Theorem 3.2.2. The first inequality follows from the fact that xk+1Sk+1 is the bestρs-term approximation to xk+1, so the approximation error ‖xk+1−xk+1Sk+1‖ is smaller than‖xk+1 − xk+1Sk ‖. The second inequality follows from Lemma 3.2.1.We finish this subsection with restating the IRWK reconstruction using alternatingprojections. Note that (3.3) implies that for any solution x∗ of Ax = b,xk+1 − x∗ =WkprojN (AWk)(xkSk)−Wk(I − projR(WkAT ))W−1k x∗=WkprojN (AWk)(xkSk)−WkprojN (AWk)W−1k x∗=WkprojN (AWk)W−1k(xkSk − x∗). (3.4)Note that the projection onto the range of matrix Φ equals Φ†Φ and that projN (Φ) =I − projR(ΦT ). Therefore,xk+1 − x∗ =Wk(I − (AWk)†AWk)W−1k (xkSk − x∗)=(I −Wk(AWk)†A)(xkSk − x∗).Matrix I −Wk(AWk)†A has two eigenspaces: N (A) with the corresponding eignenvalue 1and R(W 2kAT ) with the corresponding eigenvalue 0. Therefore, this matrix is the obliqueprojection onto N (A):(I −Wk(AWk)†A)x = argminz‖x− z‖2,W−1k s. t. z ∈ N (A)for all x where ‖z‖2,W−1k := ‖W−1k z‖. Note that this oblique projection can increase theEuclidean norm of its argument but does not increase the induced norm ‖ · ‖2,W−1k. There-fore, the IRWK may be viewed as an algorithm that takes a solution xk of Ax = b, projectsit onto the set of sparse signals (computing xkSk). Then, the algorithm enforces the nextiterate to be a solution of Ax = b by (obliquely) projecting xkSk − x∗ onto N (A). Such aperspective is similar to the iterative hard thresholding algorithm (IHT), and we investigatethis connection further in Section On condition number of the weighted matrix and on the numberof iterations of the randomized Kaczmarz algorithmAlgorithm 3.1 applies the RK algorithm for solving the minimization problemargmin ‖z‖ subject to AWkz = b. (3.5)The algorithm is initialized with xkSk . In this section, we bound the number of iterationsrequired by Algorithm 3.1 for acquiring the solution for the minimization problem (3.5).66Recall that Theorem 2.1.1 uses‖AWk‖2Fσ2min(WkAT )to establish the convergence rate for the RKalgorithm: smaller condition number implies faster convergence.The following lemma investigates the singular values of WkAT .Lemma 3.2.6. Let A, Wk, and ρ be as in Algorithm 3.1. Suppose that the matrix Asatisfies the RIP of order ρs with constant δ, and assume that the singular values of ATare between σmin and σmax. Then, the singular values of WkAT are between wkσmin and√1 + δ +w2kσ2max.Proof. The proof is based on the properties of singular values. For all x ∈ Rm,‖WkATx‖ ≥ wk‖ATx‖ ≥ wkσmin‖x‖;this inequality yields the lower bound on singular values of WkAT .Next, we obtain the bound on the largest singular value of WkAT . Note that thelargest singular value ofWkAT equals the largest singular value of AWk. Using the triangleinequality, we conclude that for all x ∈ RN ,‖AWkx‖ = ‖AxSk + wkAxSck‖ ≤ ‖AxSk‖+ wk‖AxSck‖.The first term on the right-hand side can be bounded using the RIP and the second termcan be bounded using the largest singular value of A, yielding‖AWkx‖ ≤√1 + δ‖xSk‖+ wkσmax‖xSck‖ ≤√(1 + δ) + w2kσ2max‖x‖,where the last inequality is obtained using the Cauchy-Schwarz inequality. We concludethat all singular values of WkAT are at most√(1 + δ) +w2kσ2max.Remark 3.2.7. For a matrix A that satisfies the RIP, the typical size of the smallest and thelargest singular values is O(√Nm ). In this case, if wk ≥√mN ,WkAT is a well-balanced matrixwhose condition number is bounded by a constant independent of m and N . Therefore,for constant non-unit weight scheme in the IRWK, matrix AWk is well-balanced. Thisobservation motivates the choice of the non-unit weight in the constant weight scheme.Theorem 2.3.1 uses the quantity σ2min(WkAT )/‖AWk‖2F , and we bound it in the follow-ing lemma.Lemma 3.2.8. Let A ∈ Rm×N , m ≤ N , be a measurement matrix whose entries aij arei.i.d. sub-Gaussian random variables with sub-Gaussian norm K, mean zero and varianceone. Then,σ2min(WkAT )‖AWk‖2F≥ c2 w2kNρsm+ w2k(N − ρs)mwith overwhelming probability with respect to m.67Remark 3.2.9. Note that the lower bound forσ2min(W 2kAT )‖WkAT ‖2Fgiven above is increasing in wk.If wk = O(1) and m≪ N , the bound becomes O(1/m). If wk ≥√mN , thenσ2min(W2kAT )‖AWk‖2F≥ cρs+m− ρsmN.Remark 3.2.10. Suppose that A is as in Lemma 3.2.8 and K does not depend on m. Notethat A is does not satisfy the RIP with overwhelming probability, but scaled matrix 1√mAsatisfies the property. The scaling of A does not change the bound in Lemma 3.2.8, sothe same bound is applicable if the variance of entries is not one (note that K must beadjusted accordingly).Proof. Lemma 3.2.6 provides the bound on the smallest singular value of WkAT using thesmallest singular value of AT (denoted by σmin). Since the entries of A are i.i.d. sub-Gaussian random variables and m ≪ N , Theorem 1.3.4 proves that the smallest singularvalue of AT is at least c1√N with overwhelming probability. Therefore,σ2min(WkAT ) ≥ c21w2kNwith overwhelming probability.To bound the Frobenius norm, we expand the expression based on Wk:‖AWk‖2F =m∑i=1∑j∈Ska2ij + w2km∑i=1∑j 6∈Ska2ij .Since aij are sub-Gaussian random variables, a2ij are sub-exponential random variableswhose sum may be bounded using Theorem 1.3.3. Here, Ea2ij = 1 and‖a2ij − 1‖ψ1 ≤ ‖a2ij‖ψ1 + 1 =√‖aij‖ψ2 + 1 =√K + 1.Theorem 1.3.3 implies thatP∣∣∣∣∣∣m∑i=1∑j∈Sk(a2ij)− ρsm∣∣∣∣∣∣ > 12ρsm ≤ 2 exp(−cmin{ ρ2s2m24(√K + 1)2,ρ2s2m22(√K + 1)}).Since 2(√K + 1) > 1, the bound above may be simplified toP∣∣∣∣∣∣m∑i=1∑j∈Sk(a2ij)− ρsm∣∣∣∣∣∣ > 12ρsm ≤ 2 exp(−c ρ2s2m24(√K + 1)2).68Similarly, Theorem 1.3.3 implies thatP∣∣∣∣∣∣m∑i=1∑j∈Sck(a2ij)− (N − ρs)m∣∣∣∣∣∣ > 12(N − ρs)m ≤ 2 exp(−c(N − ρs)2m24(√K + 1)2).Therefore, with overwhelming probability (with respect to m),‖ATSck‖2F ≥12(ρsm+ w2k(N − ρs)m).The bounds on the smallest singular value and the Frobenius norm imply that, withoverwhelming probability with respect to m, the following bound holds.σ2min(W2kAT )‖WkAT ‖2F≥ c˜ w2kNρsm+ w2k(N − ρs)m.Theorem 3.2.11. Suppose that entries of matrix A ∈ Rm×N are i.i.d. mean zero, vari-ance 1/√m, sub-Gaussian random variables with the sub-Gaussian norm at most K/√m.Suppose that A satisfies the RIP of order ρs with constant δ, and the smallest singularvalue AT is at least c√N/m. Suppose that we run kth iteration of Algorithm 3.1, and,specifically, we solve the minimization problem (3.5) using the RK algorithm (Algorithm2.2) with initialization xkSk . Then, it suffices to run− 1log(1− cρsmw2kN+(1− ρsN)m) log(‖xkSck‖2w2kǫν2)iterations such that, with probability at least 1 − ǫ − 2 exp(−c′m), the output of the min-imization problem zk+1 is within distance ν from projN (AWk)(xkSk) + (AWk)†b. Here c′ =c′(ρ, s,K) > 0.Proof. Theorem 2.3.1 and Lemma 3.2.8 imply that− 1log(1− cρsmw2kN+(1− ρsN)m) log(‖xkSk − xk+1‖2ǫν2)(3.6)iterations suffice to guarantee the convergence of the kth iteration of the IRWK withprobability at least 1− ǫ− c1 exp(−c2m). Here xk+1 is the signal to which RK algorithmconverges:xk+1 = xkSk + (AWk)†(b−AWkxkSk).69Recall that Axk = b for all k and, by definition of Wk, WkxkSk= xkSk and WkxkSck= wkxkSck.Therefore,xk+1 − xkSk = (AWk)†(Axk −AxkSk)= (AWk)†AxkSck =1wk(AWk)†AWkxkSck =1wkprojR(WkAT )(xkSck).Note that projection does not increase ℓ2-norm, thus, ‖projR(WkAT )(xkSck)‖ ≤ ‖xkSck‖. There-fore,‖xk+1 − xkSk‖ ≤1wk‖xkSck‖.Plugging this inequality into (3.6) implies the statement of the theorem.Corollary 3.2.12 (Number of Kaczmarz iterations for the constant weight scheme). Ifwk =√Nm for given k, then the sufficient number of iterations to reach the accuracy ν withprobability at least 1− ǫ− c1 exp(−c2m) provided by Theorem 3.2.11 is− 1log(1− cρs+(1− ρsN)m) log(m‖xkSck‖2ǫNν2).Note that the argument of the logarithm in the denominator is of order 1− c′m provided thatρs < m < N . This quantity determines the convergence rate of the RK algorithm and is ofthe same scale as for the RK without weights. This performance is achieved because thisweight selecting scheme is tuned to keep AWk well-balanced.Corollary 3.2.13 (Number of Kaczmarz iterations for the dynamic non-unit weightscheme). If wk =√‖xkSck‖ > 0 for given k, then the sufficient number of iterations ofthe RK algorithm provided by Theorem 3.2.11 is− 1log1− cρsm‖xkSck‖N+(1− ρsN)m log(‖xkSck‖ǫν2).Remark 3.2.14. Note that when ‖xkSck‖ → 0, then the bound above becomes infinitely large.However, note that when ‖xkSck‖ is very small, then xk is the approximately sparse solutionof Ax = b, and, therefore, xk ≈ x∗. The rigorous statement follows.Note that for all k, Axk = b = Ax∗. Therefore, A(xkSk − x∗) = AxkSck . Suppose that Asatisfies the RIP of order ρs+ s with constant δ. Then,(1− δ)‖xkSk − x∗‖2 ≤ ‖A(xkSk − x∗)‖2 = ‖AxkSck‖2 ≤ σ2max‖xkSck‖2.70Here σmax stands for the largest singular value of A. The inequality above implies that‖xkSck‖ ≥√1− δσmax‖xkSk − x∗‖.We conclude that small ‖xkSck‖ implies the small reconstruction error, and, in this case, onemay stop running the IRWK.In particular, suppose that the accuracy ν suffices for the sparse recovery, i.e., we canstop reconstruction if ‖xkSk − x∗‖ ≤ ν. Then, once ‖xkSck‖ ≤√1−δσmaxν, we can stop thealgorithm and use xkSk as a reconstructed signal. In this case, for each k, the number ofKaczmarz iterations will not exceed− 1log1− cρsm‖xkSck‖N+(1− ρsN)m log(√1− δσmaxǫν).3.2.4 Local linear convergenceIn this section, we focus on the convergence rate of the IRWK algorithm if the reconstructedsignal after k iterations is close to the underlying sparse solution. Note that the IRWKalgorithm considers two different approximated signals: xk and xkSk . xk satisfies Axk =b = Ax∗, and therefore, even if we stop the algorithm early, the current value of xk willbe a solution of the linear system. Note that, if xk is considered to be the output of thealgorithm, then ‖xk − x∗‖ is the reconstruction error; this quantity is traditional for theerror analysis. On the other hand, xkSk is, by construction, sparse. In the case the IRWKis stopped too early, xkSk may be chosen over xk if one prioritizes sparsity over satisfyingAx = b. The corresponding reconstruction error, ‖xkSk − x∗‖ may be estimated using theRIP even without knowing the true value of x∗:(1− δ)‖AxkSk − b‖2 ≤ ‖xkSk − x∗‖2 ≤ (1 + δ)‖AxkSk − b‖2.Therefore, the following theorem considers the case when at least one of the reconstructionerrors is small enough. In this case, the theorem implies linear convergence.Theorem 3.2.15. Suppose that A satisfies the RIP of order ρs with constant δ, ρ > 1. Letx∗ be an s-sparse signal and let b = Ax∗. Let x∗min be the value of the least (in magnitude)non-zero entry of x∗. Suppose that for some k ≥ 0, at least one of two statements hold:1. ‖xk − x∗‖ ≤ |x∗min|/2, or2. ‖xkSk − x∗‖ ≤ |x∗min|/2.71Then,‖xk+1Sk+1 − x∗‖ ≤(1 +1− δw2kσ2max)−1‖xkSk − x∗‖.Since wk < 1 for both non-unit weight updating scheme, the convergence rate is linear.Remark 3.2.16. For the constant weight scheme, wk =√mN . Using that σmax ≈ wk, weconclude that the convergence rate is approximately c(2 − δ)−1.For the dynamic weight scheme,(wk)2 = ‖xkSck‖/‖x∗‖ = ‖xk − xkSk‖/‖x∗‖ ≤ ‖xk − x∗‖/‖x∗‖,where we used that xkSk is the best (ρs)-term approximation to xk, and, therefore, xkSk iscloser to xk than x∗. If ‖xk − x∗‖ < |x∗min|/2, then(1 +1− δw2kσ2max)−1≤(1 + 2(1− δ)‖x∗‖|x∗min|σ2max)−1.Therefore, the convergence rate is linear.The proof of Theorem 3.2.15 is supported by the following lemmas.Lemma 3.2.17. Let x∗ be an s-sparse signal and let Sk be the index set of the top ρsentries of xk, ρ > 1 such that ρs > s. Denote the least (in magnitude) non-zero entry ofx∗ by x∗min. Then, Sk contains the support of x∗. In this case, ‖xk−x∗‖ ≤ |x∗min|/2 implies‖xkSk − x∗‖ ≤ |x∗min|/2.Proof. Suppose that ‖xk − x∗‖ ≤ |x∗min|/2. Then, for every i ∈ [N ], |xki − x∗i | ≤ |x∗min|/2.Consider i ∈ S∗. Since x∗min corresponds to the smallest (in magnitude) non-zero entry ofx∗ and x∗i 6= 0, |x∗min| ≤ |x∗i |. Therefore, |xki − x∗i | ≤ 12 |x∗i |. The inverse triangle inequalityand the choice of x∗min imply that for all i ∈ S∗,|xki | ≥12|x∗i | ≥12|x∗min|.If i 6∈ S∗, then |xki | = |xki − x∗i | ≤ ‖xk − x∗‖ ≤ |x∗min|/2. Therefore, the index set of theρs-many largest entries contains the support index set of x∗, i.e., S∗ ⊆ Sk. Thus, x∗ = x∗Sk .Therefore,‖xkSk − x∗‖ = ‖xkSk − x∗Sk‖ ≤ ‖xk − x∗‖ ≤12|x∗min|.Now we need to show that if ‖xkSk − x∗‖ ≤ 12 |x∗min|, then the active set Sk containsthe true support S∗. Since ρs > s, there is i0 such that i0 ∈ Sk but i0 6∈ S∗. Then,‖xkSk − x∗‖ ≤ |x∗min|/2 implies that |xki0 | ≤ |x∗min|/2. Recall that Sk contains indices of thetop entries of xk and i0 ∈ Sk. Then, for all i ∈ Sck, |xki | ≤ |xki0 | ≤ |x∗min|/2. For every i ∈ S∗,|xki | ≥ |x∗i | − |xki − x∗i | ≥ |x∗min| −12|x∗min| =12|x∗min|.Therefore, S∗ ⊂ Sk.72Lemma 3.2.18. Suppose that for some xk, xk+1, and x∗, the corresponding index setssatisfy S∗ ⊆ Sk and S∗ ⊆ Sk+1, s = #S∗ ≤ #Sk = #Sk+1 = ρs. Then, for any wk ≤ 1,‖xk+1Sk+1 − x∗‖2 + w−2k ‖xk+1Sck+1‖2 ≤ ‖xk+1Sk − x∗‖2 + w−2k ‖xk+1Sck ‖2.Proof. Note that S∗ ⊆ Sk and S∗ ⊆ Sk+1, and therefore, ‖xk+1S∗ − x∗‖2 is part of the bothsides of the statement of the lemma. Then, the statement of the lemma is equivalent to‖xk+1Sk+1∩Sc∗‖2 +w−2k ‖xk+1Sck+1∩Sc∗‖2 ≤ ‖xk+1Sk∩Sc∗‖2 + w−2k ‖xk+1Sck∩Sc∗‖2. (3.7)Also, note that the conditions of the lemma imply that #Sk ∩ Sc∗ = #Sk+1 ∩ Sc∗. Then,the left hand side can be considered as a square weighted norm of xk+1Sc∗ where the largest#Sk+1 ∩Sc∗ entries of the vectors have weight 1, and other entries have weights w−1k . Notethat since w−1k ≥ 1, reassigning the weights to different entries can only increase the norm,and therefore, the right-hand side of (3.7) is larger than the left-hand side.Proof of Theorem 3.2.15. Lemma 3.2.17 proves that the second inequality in the conditionof the theorem holds if the first inequality holds. Therefore, WLOG, assume that‖xkSk − x∗‖ ≤12|x∗min|.Then, Lemma 3.2.17 implies that S∗ ⊂ Sk+1. Rearranging the terms in (3.4) yieldsW−1k (xk+1 − x∗) = projN (AWk)W−1k (xkSk − x∗) = projN (AWk)(xkSk − x∗).Note that projection does not increase the ℓ2 norm. Then,‖xk+1Sk − x∗‖2 +w−2k ‖xk+1Sck ‖2 = ‖W−1k (xk+1 − x∗)‖2 ≤ ‖xkSk − x∗‖2. (3.8)Therefore, using that w−1k ≥ 1, we conclude that ‖xk+1 − x∗‖2 ≤ ‖xkSk − x∗‖2 ≤ |x∗min|2/4.By Lemma 3.2.17, S∗ ⊂ Sk+1. Then, using Lemma 3.2.18,(3.8) implies that‖xk+1Sk+1 − x∗‖2 + w−2k ‖xk+1Sck+1‖2 ≤ ‖xkSk − x∗‖2. (3.9)One may conclude that {‖xtSt − x∗‖2}t≥k is a non-increasing sequence, and for t ≥ k,S∗ ⊂ St.Now we establish the convergence rate. Denote the largest singular value of A by σmax.Then, by the definition of singular values and the RIP,‖xk+1Sck+1‖2 ≥ 1σ2max‖Axk+1Sck+1‖2 = 1σ2max‖A(xk+1Sk+1 − x∗)‖2 ≥1− δσ2max‖xk+1Sk+1 − x∗‖2. (3.10)Combining with (3.9), we get(1 +1− δw2kσ2max)‖xk+1Sk+1 − x∗‖2 ≤ ‖xkSk − x∗‖2.Algebraic manipulations finish the proof.733.2.5 Connection with the IHT and compressibility of solutionsIn this section, we investigate the similarity of the connection of the IRWK algorithm (Algo-rithm 3.1) with another algorithm in compressed sensing, specifically, the IHT (Algorithm1.3).Throughout this section, we assume that A satisfies the RIP of order ρs+s with constantδ, and there is an s-sparse solution x∗ that satisfies Ax∗ = b. Recall that uniqueness of ans-sparse solution can be guaranteed using the RIP of matrix A (see Section 1.2.1). Recallthe IRWK update rule.xk+1 = xkSk −W 2kAT (AW 2kAT )−1A(xkSk − x∗).Note that AW 2kAT is a square, symmetric, and positive semi-definite matrix, and, therefore,(AW 2kAT )1/2 is well-defined. LetAk := (AW2kAT )−1/2A.Then, the update rule may be rewritten as follows.xk+1 = xkSk −W 2kATkAk(xkSk − x∗). (3.11)And, finally,xk+1Sk+1 = Hρs(xkSk −W 2kATkAk(xkSk − x∗)).This formula strongly resembles the update rule in the IHT (Algorithm 1.3). Recall thatfor the IHT algorithm, we usually assume that the measurement matrix satisfies the RIP.Therefore, let us bound the RIP constant of Ak.Proposition 3.2.19. Suppose that A ∈ Rm×N , m ≤ N , is a full row rank matrix thatsatisfies the RIP of order t with constant δ. Denote the smallest and the largest singularvalues of A by σmin and σmax, respectively. Let S ⊂ [N ] be such that 0 < #S = ρs ≤ t andlet W be a diagonal matrix such that Wii equals 1 if i ∈ S and equals w ∈ (0, 1) otherwise.Then, two following statements hold.1. The singular values of A˜ := (AW 2AT )−1/2 are between(1 + δ + w2σ2max)−1/2and(wσmin)−1.2. (AW 2AT )−1/2A = A˜A satisfies the RIP of order t with constantδ˜ = max{(1− 1− δ1 + δ + w2σ2max),(1 + δw2σ2min− 1)}.Proof. Using the singular value decomposition, we conclude that the singular values ofA˜ = ((AW ) · (AW )T )−1/2 are reciprocals of singular values of WAT . Lemma 3.2.6 proves74that the singular values of WAT are between wσmin and√1 + δ +w2σ2max. Therefore, thesingular values of A˜ = (AW 2AT )−1/2 satisfy(1 + δ + w2σ2max)−1/2 ≤ σmin((AW 2AT )−1/2) ≤ σmax((AW 2AT )−1/2) ≤ (wσmin)−1 .The first statement of the proposition is proved.Using the singular values of A˜ and the RIP of A, we conclude that for every x ∈ ΣNt ,‖A˜Ax‖2 ≤ σ2max(A˜)‖Ax‖2 ≤ σ2max(A˜)(1 + δ)‖x‖2and‖A˜Ax‖2 ≥ σ2min(A˜)‖Ax‖2 ≥ σ2min(A˜)(1− δ)‖x‖2.Then, using the first part of the proposition, we conclude the second statement of theproposition.Corollary 3.2.20. Suppose that all singular values of AT are between c1√Nm and c2√Nm .This assumption holds for a broad class of sub-Gaussian random matrices; see Theorem1.3.4. Suppose that w =√mN , which corresponds to the constant weight scheme in theIRWK. Then we conclude that (1) the singular values of Ak are between (1 + δ + c2)−1/2and c−11 , and (2) Ak satisfies the RIP of order ρs with constantδ˜ = max{1− 1− δ1 + δ + c22,1 + δc21− 1}.Note that for a broad class of sub-Gaussian random matrices, Theorem 1.3.4 impliesthat for the case when N ≫ m, c1 ≈ 1− and c2 ≈ 1+. In this case,δ˜ ≈ max{2− 32 + δ, δ}.Note that for all δ < 1, δ˜ < 1.While the IRWK update rule is similar to the IHT update rule, these algorithms aredramatically different. First, note that Ak changes at every iteration while the IHT isbased on matrices that don’t depend on the iteration number. Secondly, the IHT requiresA to satisfy the RIP with a small constant δ, and the RIP constant of Ak is typicallyδ˜ ≥ 2− 32+δ > 0.5. Thirdly, note that W 2k has two distinct weights, 1 and w2k < 1, and theIHT stepsize λ scales all entries uniformly. Finally, for constant weight scheme in Algorithm3.1, the majority of diagonal entries of W 2k are m/N ≪ 1, and the typical stepsize of theIHT is close to one. Therefore, the IRWK and the IHT algorithms have similarities in theupdate rule but are very different.75Suppose that Ak satisfies the RIP of order ρs + s with constant δ˜ and x∗ ∈ ΣNs . Weconsider only certain indexes of both sides of (3.11). First, let us consider indexes that arein Sk.xk+1Sk − x∗Sk = HSk(I −W 2kATkAk)(xkSk − x∗).Using (1.2), we conclude that‖xk+1Sk − x∗Sk‖ ≤ δ˜‖xkSk − x∗‖.Now consider the indexes in Sck ∩S∗ if this set is non-empty. Using (1.2), we conclude that‖xk+1Sck∩S∗ − x∗Sck∩S∗‖ ≤ (1− w2k)‖xkSk − x∗‖2 + w2k δ˜‖xkSk − x∗‖.Finally, let us consider the index set S ⊂ [N ] such that #S = s and S ⊂ Sck ∩ Sc∗. Then,‖xk+1S ‖ = ‖xk+1S − x∗S‖ = ‖HS(I −W 2kATkAk)(xkSk − x∗)‖= ‖HS((xkSk − x∗)−W 2kATkAk(xkSk − x∗))‖= ‖HSW 2kATkAk(xkSk − x∗)‖ = w2k‖HSATkAk(xkSk − x∗)‖≤ w2k δ˜‖xkSk − x∗‖.The last inequality uses (1.2).Note that out of three different upper bounds on entries of xk+1 − x∗, the smallestvalue has the third one: when we consider indexes outside of the active set and the supportof x∗. Note that the smaller upper bound does not imply the smaller value in general.Nevertheless, we conjecture that the sequence {xk − x∗} becomes more “compressible” ask grows. Here, by compressibility, we mean either of ‖(xk − x∗)Sck∩S∗‖/‖(xk − x∗)Sk‖ and‖(xk − x∗)Sck∩S∗‖∞/‖(xk − x∗)Sk‖. We support this conjecture with Experiment 3.3.4.If the conjecture holds, then we get the insight into the action of the threshold operator.Recall that for each k, we select the top ρs entries of xk. The conjecture suggests thatentries of xk whose indices are out of the active set and the support of x∗ are very small,and therefore, are unlikely to enter the active set on the subsequent iteration.3.3 Numerical performanceThis section investigates the numerical performance of the IRLS using the RK algorithmand the IRWK algorithm introduced above. Experiment 1 (Section 3.3.1) considers theminimization problem in the IRLS and compares the performance of different solvers:the RK algorithm, the QR factorization method, and the Moore-Penrose pseudoinversematrix. Experiment 2 (Section 3.3.2) compares the IRLS and the IRWK algorithms in76number of iterations and number of the RK iterations. Experiment 3 (Section 3.3.3)numerically establishes the convergence rate for the IRWK for various classes of randommatrices, the number of measurements, the dimension of the signals, and the sparsity ofthe signal. Experiment 4 (Section 3.3.4) numerically confirms the theory and hypothesisabout compressibility of signals.3.3.1 Experiment 1. IRLS using randomized Kaczmarz algorithm.In this experiment, we investigate the running time of individual iterations of the IRLSalgorithm. Specifically, the kth iteration of the IRLS algorithm requires to solveargmin ‖wk ⊙ x‖ subject to Ax = b,or, equivalently,‖z‖ subsect to A(wk ⊙ z) = b.Therefore, the IRLS iterations can be implemented using the QR decomposition, the SVD,or the RK algorithm, among other methods. In this series of experiments, we illustrate thatregardless of specific implementation methods, some iterations of the IRLS algorithm mayconsume a lot of the running time. This is because Adiag(wk) becomes nearly singular.These findings are illustrated in Figure 3.1.Note that the smallest and the largest entries of wk are approximately 1/√|xkmax| andǫ−1/2k ≥√N/‖xkSck‖∞, respectively, where Sk is the index set of the largest (in magnitude)entries of xk, and xkmax is the largest entry of xk. Therefore, the matrix Adiag(wk) maybe close to singular even if xk is moderately compressible.To illustrate that this case numerically, we generate a random unit-norm signal x∗ ∈Σ100040 and draw a random matrix A ∈ R350×1000 whose entries are i.i.d. standard Gaussianrandom variables. We run the IRLS algorithm (Algorithm 1.5). For each iteration, wemeasure the smallest and the largest singular values of diag(wk)AT and its Frobeniusnorm. In Figure 3.1a, we plot the condition number of diag(wk)AT and the ratio betweenthe Frobenius norm of diag(wk)AT and the smallest singular value over the reconstructionerror in the log-log scale. Note that the condition number determines how well-conditionedthe matrix is, and the ratio between the Frobenius norm of diag(wk)AT and the smallestsingular value predicts the convergence rate of the RK algorithm. We run the procedureabove ten times and compare the relative errors in Figure 3.1b for the IRLS algorithmversus CPU time when we use the QR decomposition, the SVD, and the RK algorithm assolvers for the minimization problem.Figure 3.1a implies that Adiag(wk) is eventually nearly singular, and therefore, weexpect that many iterations require significant running time. We confirm this in Figure3.1b, where we observe that the decay of the reconstruction error eventually gets slower,and the performance of the RK algorithm is affected more than the QR decomposition andthe SVD.770 5 10 15 20 25Iteration number100102104106Condition number(a) Condition numbers of diag(wk)AT0 0.5 1 1.5CPU time0. error(b) Convergence rateFigure 3.1: The IRLS iterations for 350 × 1000 Gaussian matrix and 40-sparse unit-normsignal.3.3.2 Experiment 2. The IRWK and the IRLS algorithmsThe IRWK algorithm aims to resolve two issues of the IRLS algorithm: ensure that thematrix AWk is not nearly singular and provide a warm start for the minimization problem,and, therefore, to improve the performance. In this experiment, we numerically confirmthat the IRWK algorithm outperforms the IRLS algorithm with the RK updates.We run the following scheme 20 times, and for each run, we plot the result. We drawa 350 × 1000 matrix A whose entries are i.i.d. standard Gaussian and a unit-norm signalx∗ ∈ Σ100040 . Let ρ = 2. Then, we run the IRWK algorithm with constant and dynamicweights and the IRLS algorithm for Ax = b = Ax∗. Note that in all of the runs, weverify that the reconstructed signal coincides with x∗. Figure 3.2 compares the number ofiterations made by each algorithm for each run. However, a straightforward comparisonof the number of iterations does not imply that one algorithm is faster than another asthe running time for the iterations should be taken into account. Both the IRWK and theIRLS rely on the RK algorithm iterations, which we use to represent the running time.In Figure 3.2a, we compare the number of “outer” iterations, i.e., the total number of kiterations, and in Figure 3.2b, we compare the total number of the RK iterations.Based on Figure 3.2, we conclude that the number of “outer” iterations for the IRLS andthe IRWK algorithms is comparable. However, the IRWK algorithm requires a significantlysmaller number of the RK iterations than the IRLS algorithm, and, therefore, is morecomputationally efficient.3.3.3 Experiment 3. The performance of IRWK algorithmIn this experiment, we investigate the reconstruction error of the IRWK algorithm as afunction of the CPU time for several classes of random matrices and a broad range of780 5 10 15 20Run number101102Number of iterations(a) Comparing the number of “outer” itera-tions0 5 10 15 20Run number105106107Number of the RK iterations(b) The total number of the RK iterationsFigure 3.2: Number of iterations of the IRLS and IRWK algorithms for an 350 × 1000Gaussian matrix and 40-sparse signalshyperparameters. Experiment 3a focuses on the dependency of the reconstruction error onthe CPU time for three classes of random matrices, and Experiment 3b investigates theCPU time required by the IRWK to reach the specific level of accuracy. All experimentsabove compare the performance of the IRWK algorithm with other memory-efficient algo-rithms (the IRLS algorithm using the RK algorithm, the SRK, RASK, the KZIHT, andthe SRK-IHT algorithms) to provide a fair comparison.Experiment 3a. The IRWK algorithm for various classes of measurementmatricesIn this experiment, we investigate the decay of the reconstruction error as the number ofiterations increases.Specifically, we consider three different classes of matrices: Gaussian, Bernoulli, andpartial Fourier. Drawing a random Gaussian matrix means drawing a matrix whose entriesare i.i.d. standard Gaussian random variables. Bernoulli random matrices are matriceswhose entries are i.i.d. ±1. An m ×N partial Fourier matrix, where m < N , is obtainedby randomly picking m rows of an N×N Fourier matrix. Let the number of measurementsm = 350, the dimensions of the signal N = 1000, and let the true sparsity be 2 and 40,and the estimated sparsity is 3 and 48, respectively. For these settings, we generate aunit-norm s-sparse signal x and a random matrix of the corresponding type. We run theIRWK algorithm with constant and dynamic weights, the IRLS algorithm using the RKupdates, the SRK, KZIHT, and SRK-IHT algorithm, and the RASK with λ = 10. Weplot the reconstruction error versus the CPU time where the reconstruction error is in thelogarithmic scale.Note that the IRWK algorithm with constant and dynamic weights successfully recoverthe sparse signal. The IRWK with dynamic weights outperforms the IRWK with constant790 1 2 3 4 5CPU time10-510-410-310-210-1100Reconstruction errorIRWK with constant weightsIRWK with dynamic weightsSRKRASKIRLS using KaczmarzKZIHTSRK-IHT(a) Gaussian matrix, s = 20 1 2 3 4 5CPU time10-510-410-310-210-1100Reconstruction errorIRWK with constant weightsIRWK with dynamic weightsSRKRASKIRLS using KaczmarzKZIHTSRK-IHT(b) Bernoulli matrix, s = 20 1 2 3 4 5CPU time10-510-410-310-210-1100Reconstruction errorIRWK with constant weightsIRWK with dynamic weightsSRKRASKIRLS using KaczmarzKZIHTSRK-IHT(c) Fourier matrix, s = 20 1 2 3 4 5CPU time10-510-410-310-210-1100Reconstruction errorIRWK with constant weightsIRWK with dynamic weightsSRKRASKIRLS using KaczmarzKZIHTSRK-IHT(d) Gaussian matrix, s = 400 1 2 3 4 5CPU time10-510-410-310-210-1100Reconstruction errorIRWK with constant weightsIRWK with dynamic weightsSRKRASKIRLS using KaczmarzKZIHTSRK-IHT(e) Bernoulli matrix, s = 400 1 2 3 4 5CPU time10-510-410-310-210-1100Reconstruction errorIRWK with constant weightsIRWK with dynamic weightsSRKRASKIRLS using KaczmarzKZIHTSRK-IHT(f) Fourier matrix, s = 40Figure 3.3: Reconstruction errors for various sparse recovery algorithms for 350 × 1000random matrix. We consider sparsity 2 and 40, and the estimated sparsity 3 and 48,respectively. Then, we draw a unit-norm s-sparse signal x∗, compute the correspondingmeasurements, and run sparse recovery algorithms. We plot the reconstruction error vs.CPU time where the reconstruction error is in the logarithmic scale.80200 300 400 500Number of measurements10-1100101CPU time(a) Dependence on m800 1000 1200 1400 1600 18002000Dimension of the signal10-1100101CPU time(b) Dependence on N0 20 40 60 80Sparsity of the signal10-1100101CPU timeX 22Y 0.4482(c) Dependence on sFigure 3.4: The CPU time required to reach the reconstruction error 10−5 for varioussettings. Unless specified above, the number of measurements m = 350, the dimension ofthe signal N = 1000, the sparsity s = 40, and the estimated sparsity is 1.2s. We draw anm × N ±1 Bernoulli random matrix A and a unit-norm s-sparse signal x∗ and run eachalgorithm above for Ax = b := Ax∗. We measure the CPU time required to reach accuracy10−5. If the CPU time exceeds 20, we assume that sparse recovery is unsuccessful. Allplots are in the log-log scale.weights. Both algorithms outperform the IRLS using RK updates and the RASK whenλ = 10, and for large enough CPU time, they surpass the SRK algorithm. Also, note thatthe suggested convergence rate of the IRWK algorithm (with respect to CPU time) is fasterthan linear. We showed that the IRWK algorithm converges linearly in the local regime(Theorem 3.2.15), but the result of the theorem is not tight.Experiment 3b. The convergence rate of the IRWK algorithmIn this experiment, we compare the numerical performance of the various memory-efficientalgorithms with the IRWK algorithm over the broad set of parameters.This experiment may be split into three parts: varying on m, N , and s. First, wefix N = 1000, s = 40, and estimated sparsity ρs = 48. In this case, we consider m =200, 220, ...500. Secondly, we fix m = 350, s = 40, and the estimated sparsity ρs = 48and consider N = 800, 900, ..., 2000. Thirdly, we set m = 350, N = 1000, ρ = 1.2 andconsider the case when s = 10, 12, ... 80, and the estimated sparsity is ρs. For each ofthe settings above, we draw a ±1 Bernoulli matrix A and a unit-norm s-sparse signal x∗.Let b = Ax∗. We run the IRWK, the SRK, the RASK with λ = 10, the IRLS with RKalgorithm, the KZIHT, and the SRK-IHT algorithms. We plot the CPU time required toreach the reconstruction error at most 10−5 versus (a) number of measurements in Figure3.4a, (b) dimension of the signal in Figure 3.4b, (c) sparsity of the signal in Figure 3.4c.All plots are in the log-log scale. If the reconstruction error didn’t reach this threshold in20 seconds, we do not plot it below. For example, the CPU time of the IRLS algorithmusing RK updates always exceeds 20.81We analyze the dependence of the CPU time on each hyperparameter (m, N , and s)separately. Figure 3.4a shows that the CPU time increases when m grows. At first, it mayseem counter-intuitive: more measurements should make the recovery process simpler andfaster. The reason is that the running time of the RK algorithm used for “inner” iterationsgiven in Theorem 3.2.11 increases as m increases.Figure 3.4b shows that the CPU time increases as N increases. Note that the IRWKwith constant weights and the KZIHT algorithms fail to converge for large enough N , butthe IRWK with dynamic weights, the SRK, and the SRK-IHT converge. Among thesethree algorithms, the CPU time of the IRWK algorithm with dynamic weights has theslowest growth, and, therefore, is beneficial for the case when the ambient dimension ofthe signal is very large.When the sparsity increases, the IRWK algorithm performs well. Note that it suc-cessfully recovers the underlying signal in the less amount of time comparing to the SRKand RASK algorithms (which also converge when the sparsity s is large). The KZIHTalgorithm outperforms the IRWK algorithm in the running time. Still, the rate of changewhen s grows allows us to conjecture that for extremely large s, the IRWK algorithm wouldoutperform the KZIHT.We conclude that the IRWK algorithm converges over a broad range of parameters, andin some circumstances, outperforms many memory-efficient sparse recovery algorithms.3.3.4 Experiment 4. Compressibility of the IRWK approximationsequenceTheorem 3.2.2 states that {‖xkSck‖ = ‖xk − Hρs(xk)‖} is a non-increasing sequence. InSection 3.2.5, we suggest that the compressibility of the reconstruction sequence, i.e.,{‖xkSck‖/‖xkSk‖} and {‖xkSck‖∞/‖xkSk‖∞} are increasing. This scenario is of interest whenthere is not enough time (or computational resources) to run the algorithm until conver-gence or support detection. Indeed, each reconstructed signal xk in the IRWK satisfiesAx = b, and we claim that, heuristically, each iteration makes the signal closer to sparsesignals. In this experiment, we verify these statements numerically.We draw a 350 × 1000 matrix with i.i.d. standard Gaussian entries and a 40-sparsesignal x∗ in R1000. We run the IRWK with constant and dynamic weights, and plot‖xkSck‖ = ‖xk −Hρs(xk)‖ for each iteration in Figure 3.5a, where the norm is 2-norm andmax-norm. We also plot ‖xkSck‖∞/‖xkSk‖∞ and ‖xkSck‖2/‖xkSk‖2 in Figure 3.5b.We confirm our theoretical findings that ‖xkSck‖ is non-increasing; in fact, it is strictlydecreasing. Moreover, ‖xkSck‖∞ is also decreasing. As suggested in Section 3.2.5, the com-pressibility of the signal decreases as the iteration number k increases.820 10 20 30 40Iteration number10-810-610-410-2100IRWK with constant weights, max-normIRWK with constant weights, 2-normIRWK with dynamics weights, max-normIRWK with dynamic weights, two-norm(a) The max-norm and two-norm of xk−Hρsxk0 10 20 30 40Iteration number10-5100(b) {‖xkSck‖/‖xkSk‖} where norm is either L∞ orL2.Figure 3.5: Compressibility of the reconstruction sequence {xk} produced by the IRWKalgorithm. Here the measurement matrix is 350×1000 matrix with i.i.d. standard Gaussianentries and x∗ ∈ Σ100040 such that ‖x‖ = 1 is drawn uniformly. The estimated sparsity is 48.83Chapter 4Memoryless scalar quantization(MSQ) for frames and compressedsensingModern technology is overwhelmingly digital, which, in turn, requires all signals to bestored or transmitted as a finite sequence of digits. In this chapter, we address the issueof quantizing or digitizing linear measurements of finite dimensional signals, both in theoverdetermined (frame theory) setting and in the underdetermined (compressed sensing)setting.Consider generalized linear measurements {〈aj , x∗〉}, j ∈ [m], which can be representedas b = Ax∗. We aim to map b ∈ Rm to a discrete set Am, and we call such a mapping aquantizer, and A is called an alphabet. A is often assumed to be {−1, 1} (1-bit quantiza-tion) or δZ (multi-bit quantization), where δ > 0 is called the stepsize. In this chapter, weconsider the multi-bit quantization only. A natural requirement for quantizers is to allowcomputationally feasible accurate reconstruction of the underlying signal. Therefore, weequip the quantizer Q with the reconstruction scheme ∆ and consider the reconstructionerror‖x∗ −∆(Q(Ax∗))‖,where xQ is the reconstructed signal. Note that due to the discreteness of the quantizedmeasurements, an accurate reconstruction is not achievable in general, but we aim it to beas small as possible.The main focus of this chapter is the decay rate of the reconstruction error as thenumber of measurements increases. In Section 4.1, we review the literature two quan-tization schemes, namely, the memory scalar quantization (MSQ) and Σ∆ quantization,in the frame theory setting and in compressed sensing setting (Sections 4.1.1 and 4.1.2,respectively). Results for MSQ for frames are stated and discussed in Section 4.2. Theirextension to the compressed sensing is presented in Section 4.3. The chapter concludeswith numerical experiments in Section 4.4.844.1 Introduction to quantizationBroadly speaking, there are two approaches to quantization: One approach introducesstructure into the quantization error Ax − Q(Ax) with the goal of improving the re-construction error. As a downside, this approach requires analogue memory, which, incertain applications, is not desired or not available. In such cases, one has to use mem-oryless quantizers such as the memoryless scalar quantization (MSQ), which is arguablythe most intuitive quantization strategy and can be easily implemented in practice. Tobe more specific, let qδ : R 7→ δZ be the uniform scalar quantizer with stepsize δ > 0,where qδ(y) := jδ if y ∈ (jδ − δ/2, jδ + δ/2], j ∈ Z. The associated uniform MSQ withstepsize δ is QMSQδ : Rm 7→ (δZ)m such that the jth component of QMSQδ (u) satisfies(QMSQδ (u))j = qδ(uj). When there is no ambiguity, we shall use Q to denote QMSQδ .Reconstruction of the signal from its quantized measurements can be roughly split ontotwo cases: over- and under-determined measurements, which are reviewed in Sections 4.1.1and 4.1.2, respectively.4.1.1 Quantization for framesSuppose that our signals of interest belong to Rn, and let E ∈ Rm×n, m ≥ n, be theanalysis matrix of a frame for Rn. The vector of frame coefficients of a signal x is givenby b = Ex, which we interpret as measurements of x. Clearly, x can be recovered from bexactly by means of linear reconstruction methods, i.e., x = E˜b where E˜ is any left-inverseof E. Then, the reconstruction error becomes‖x∗ − E˜Q(Ex)‖.In this section, we consider the Σ∆ quantization and the MSQ in Sections 4.1.1 and 4.1.1,respectively.Σ∆ quantization for framesSuppose that the sensor device can store r analogue values. Then, one popular quantizationscheme is the rth order (greedy) Σ∆ quantization, as stated below. Let b = Ex∗ =(b1, b2, ..., bm)T and denote the quantized coefficients by q = (q1, q2, ..., qm)T .qi = Q(r∑j=1(−1)j−1(rj)ui−j + bi)ui =r∑j=1(−1)j−1(rj)ui−j + bi − qi.85Note that by construction, ‖u‖∞ ∈ (−δ/2, δ/2]. Define D = (Dij) by Dij = 1 if i = j,Dij = −1 if i = j +1, and Dij = 0 otherwise. Then, b− q = Dru = Ex∗−DrQ(D−rEx∗),and, therefore, D−rq = D−rb− u. Then, the reconstruction error is‖E˜ (Ex∗ −DrQ(D−rEx∗)) ‖.If we reconstruct the signal the using the Moore-Penrose pseudoinverse, then the re-construction error is of order Ω(2−cλ), where c > 0 is a constant and λ := m/n is the“oversampling” rate [50, 77, 90]. In other words, when we increase the redundancy of theframe by taking more measurements, we expect the reconstruction error to decay exponen-tially.Note that the reconstruction error is structured (see, e.g., [41] for details), and, asdiscussed in Section 1.1.4, alternative left inverses may outperform the Moore-Penrosepseudoinverse. Blum et al. [14] proposed to use so-called Sobolev dual frames (D−rE)†D−ras the decoder. In this case, the reconstruction error becomes‖x∗ − (D−rE)†D−r ·DrQ(D−rEx∗)‖ = ‖x∗ − (D−rE)†Q(D−rEx∗)‖.Yilmaz et al. [78] proved that for E = (eij) ∈ Rm×n whose entries are i.i.d. standardGaussian, the reconstruction error is of order O((nm)−α(r−1/2))δ with overwhelming prob-ability for every α ∈ (0, 1) provided that the oversampling rate λ := m/n is sufficientlylarge. This result was extended for matrices with i.i.d. sub-Gaussian entries [89].If the alphabet A, #A = L ≥ 2, can be selected as one wish, then the smallestpossible reconstruction error for Gaussian random frames is at most√nL−(1−4/n)m/n withprobability at least 1− exp(−cm/n2) where m/n is large enough [39]. Similar bound holdsfor Fourier frame: c√mL−m/n [40].MSQ for frames and the WNHNote that the MSQ quantization error does not exhibit any obvious structure, and, there-fore, a common approach in the literature is to use the Moore-Penrose pseudo-inverse forthe reconstruction. Recall that among all left-inverses of E, E† has the smallest operatornorm (see Section 1.1.2). Alternatively to linear reconstruction methods, one may con-sider consistent reconstruction methods, i.e., when the reconstructed signal xQ satisfiesQ(ExQ) = Q(Ex). These schemes for the MSQ are nearly optimal in terms of accuracy,but computationally expensive [45, 46, 75, 129]. For frames consisting of (nearly) equal-norm vectors, the lower bound for the reconstruction error (for the MSQ quantizationover all reconstruction methods) is Ω(λ−1) [75] where λ := m/n. Note that this bound issignificantly worse than any bound provided above for the Σ∆ quantization scheme.Let x˜ = E†Q(b) = E†Q(Ex). The corresponding reconstruction error isE(x) := ‖x− x˜‖ = ‖x− E†Q(Ex)‖ = ‖E†(Ex−Q(Ex))‖. (4.1)86The aim is to identify the dependence of E(x) on δ, n, and m for a given infinite familyof frames parametrized by m and n. It is often of interest to understand the worst-caseerror over a compact set K of signals, i.e., supx∈K E(x), average L2 error over such aset K, i.e., ‖E‖L2(K), or the expected error when the signal is drawn randomly from K,i.e., E E2(x). The analysis of these error terms is not straight-forward, especially whenm > n. Perhaps the most common simplifying approach is to use the WNH: If E is fixedand x is random, the entries of the vector Ex − Q(Ex) are assumed to be i.i.d. randomvariables, uniformly distributed in (−δ/2, δ/2] (so, with mean 0 and variance δ2/12) [12],cf. [75, 76, 87]. Indeed, if we assume that the WNH holds, then the quantization error maybe considered a random noise, and, using the calculations in Section 1.1.4, we concludethatE E2(x) = ‖E†‖2F δ2/12, (4.2)where ‖ · ‖F denotes the Frobenius norm. Finally, recall that ‖E†‖F ≤√n‖E†‖ =√n (σmin(E))−1 and for a vast class of matrices (e.g., matrices with independent isotropicsub-Gaussian random rows), σmin(E) ≥ C√m (with high probability). For such matrices,the WNH allows us to concludeE E2(x) ≤ C nmδ2 (4.3)where C is independent of m and n. The class of matrices for which (4.3) holds includesrandom partial Fourier matrices [131] and all matrices with independent isotropic sub-Gaussian random rows, e.g., [148, p. 232], including random matrices with i.i.d. standardGaussian or Bernoulli entries. Note that (4.3) becomes an equality if E is a unit-normtight frame; therefore, for at least such frames, the error bound is sharp under the WNH.The WNH is rather successful for predicting the reconstruction error associated withhard-to-analyze quantization schemes [76, 87, 128, 153] and can be (approximately) justifiedin special cases [20, 76, 87]. Yet, the WNH is not rigorous and, at least in certain cases,not valid—see, e.g., [11, 87, 91, 153]. To our knowledge, there are only a few results in theliterature that provide a rigorous analysis of E(x) that is not based on the WNH. In [153]and [164], the authors consider MSQ with a fixed δ > 0 for quantizing expansions withrespect to asymptotically equidistributed unit-norm tight frames for Rn. Specifically, theyinvestigate the decay of the reconstruction error E(x) as more measurements are taken,i.e., as m increases. An analysis based on the WNH would suggest that the error shouldvanish as m goes to infinity. However, [164] shows that for any given signal x ∈ Rn, if thequantizer step size δ is sufficiently small (depending on ‖x‖ and n, but independent fromm),limm→∞ E(x) ≥ Cδn+12‖x‖n−12,where C = C(n) > 0, implying that the error does not always tend to zero as m → ∞.This shows that the WNH does not hold, at least in this particular setting.87While the WNH is not rigorous and not valid for all cases, its statement can be heuris-tically justified based on the asymptotic behavior of Ex −Q(Ex) as δ approaches 0. Forexample, consider a deterministic full rank matrix E ∈ Rm×n and a random signal x withabsolutely continuous distribution of its entries with respect to the n-dimensional Lebesguemeasure. Then, 1δ (Ex − Q(Ex)) → [−1/2, 1/2)m in distribution as δ → 0+ [20, 87]. Fur-thermore, the entries of 1δ (Ex − Q(Ex)) → [−1/2, 1/2)m are asymptotically uncorrelated[151]. Note that the results above are asymptotic as δ → 0, and therefore, they can beapplied for high-resolution settings where δ > 0 is extremely small. However, in practice, δis often not very small – sometimes as big as the infinity norm of the signal x (correspond-ing to one-bit quantization). Thus, non-asymptotic reconstruction error analysis must beperformed to understand the actual behavior of the reconstruction error.To enforce the desired statistical properties of the quantization error, one may im-plement dithered quantization: Before the quantization is applied to measurements, ran-dom dither is added. Dither τ = (τ1, τ2, ..., τm) is drawn to be i.i.d. and uniform over[−δ/2, δ/2). Then, the measurements become Q(Ax∗ + τ). In this case, the analog of theWNH holds, and the reconstruction error is of order O((nm)1/2)δ. The dither is beneficialboth for frame theory and for compressed sensing settings [19, 22, 52, 81, 146].4.1.2 Quantization for compressed sensingAs in the case of frame expansions, it is crucial that the compressed measurements arequantized given that our technology is almost exclusively digital. While the focus of theearly literature on CS virtually neglected the issue of quantization (and compression in thesense of source coding), recent work focused on two alternative quantization approaches. Inthe first approach, one considers quantization methods that “shape” the approximation er-ror in a way so that it can be “filtered out” during reconstruction. Such methods are callednoise-shaping quantizers and have recently been shown to be effective in CS quantization[39, 60, 78, 89, 132, 133], cf. [42]. In fact, one can achieve exponentially accurate approxi-mations (with respect to the total bit budget) if one incorporates a post-Σ∆-quantizationcoding stage, i.e., one achieves nearly optimal encoding – see [132] for details.While noise-shaping quantizers provide superior approximations, their implementationrequires “memory”. In applications where the measurements are obtained sequentially orall at once, e.g., when acquiring an audio signal or an image, this is not an importantissue. However, if the measurements are obtained, say, by a distributed sensor network,implementing noise-shaping quantizers may be challenging if the sensors cannot commu-nicate with each other efficiently. On the other hand MSQ is easy to implement in anysetting and it remains to be the most popular approach in CS quantization; see, e.g.,[24, 62, 102]. Specifically, various reconstruction methods have been proposed that aim toimprove the approximation obtained from the MSQ-quantized compressive measurements[82, 83, 117, 168]. In this chapter, we also focus on the use of multi-bit MSQ in the CS88framework.1 That is, we will quantize the compressed measurements b, as defined above,by Q(b), where Q = QMSQδ . One may interpret the associated quantization error b−Q(b)as “noise” that is bounded by δ/2 in ℓ∞-norm (and consequently, by√mδ/2 in ℓ2-norm).Such an approach allows us to apply classical robust recovery results in CS. Specifically, x̂given byx̂ := argmin ‖z‖1 subject to ‖Az −Q(b)‖ ≤√mδ/2 (4.4)satisfies‖x− x̂‖ ≤ Cδ (4.5)with high probability when λ = mn is sufficiently large and A is appropriately normalized asm increases (which is crucial to ensure that the constant C in (4.5) does not depend on m– see [31] and [78]). Note, that while the approximation error in (4.5) scales linearly withthe step size δ (as expected), it does not tend to zero as the “redundancy” λ increases; thisis observed also in numerical experiments in [78].While the above mentioned approach follows naturally from the basic robust recoveryresults in CS, the reconstruction can be improved by employing alternative techniques thatprovide more accurate estimates, see, e.g., [24, 78, 82] and also [23, 84, 127] for the 1-bitcase. One such alternative is the use of the “two-stage framework” as devised in [78], whichcan be used to adopt any given frame quantization technique to the CS setting to improvethe reconstruction. Next we describe this framework.Let A be an m × N CS measurement matrix, and let b = Ax be the compressivesamples of x ∈ ΣNs . Suppose that T is the support of x, i.e., T := {j : x(j) 6= 0} with#T = s. Then, AT is the analysis matrix of a frame for Rs that consists of m vectors andthe compressive samples satisfy b = ATxT , i.e., the entries of b are the frame coefficientsof xT with respect to AT =: E. Note that if, for example, A is a random matrix with i.i.d.sub-Gaussian entries, then AT is a random frame with i.i.d. sub-Gaussian entries. Thisobservation motivates the use of any frame quantization method, in our case MSQ, alongwith a two-stage reconstruction scheme, as proposed in [78].Two-stage reconstruction scheme for MSQ in CSLet q = Q(b) with Q = QMSQδ . In Stage 1, we recover a coarse approximation x#MSQby solving (4.4), from which the support T of x∗ can be extracted with high probabilityunder some additional conditions on x∗ – see [78]. In Stage 2, we obtain a refined estimatexMSQ of xT by using a reconstruction method tailored to the underlying frame quantizationproblem. Our focus in Stage 2 will be the use of linear reconstruction methods. Specifi-cally, the refined estimate will be obtained via xMSQ = A†T q. Our goal is to analyze theapproximation error ‖x− xMSQ‖2 as λ increases.1Note that one-bit MSQ provides a yet simpler quantization method. The analysis of the one-bit MSQturns out to be significantly different from the multi-bit MSQ and is not within the scope of this chapter,e.g., see [1, 21, 127].89The error analysis for this two-stage scheme raises the following question in randomframe theory: For an m × s random matrix E with m > s and for a deterministic signalx ∈ Rs, how does ‖x − E†Q(Ex)‖ decay as λ grows? We focus on random matrices withindependent, isotropic, sub-Gaussian rows. While the technique of [78] can be generalizedto this case, it yields only a (non-zero) constant error bound independent of λ. On theother hand, the WNH “predicts” an approximation with O(λ−1/2)δ error, which appearsto agree with numerical experiments as seen, for example, in Figure 4.4a. Here, we tightenboth the result in [78] as well as the prediction using the WNH. Specifically, we provethat the reconstruction error behaves like v + O(λ−1/2)δ, where v > 0 is a small numberthat is barely noticeable in applications; this constant term, however, can not be removedfrom the error bound. We also extend this result to the CS setting when the recovery isperformed using the two-stage algorithm stated above.4.2 MSQ for random framesLetm ≥ n and let E ∈ Rm×n be a random matrix with independent isotropic sub-Gaussianrows. Suppose that x∗ ∈ Rn is deterministic and fixed, and Q = QMSQδ . We wish to controlthe reconstruction error E(x∗) = ‖x∗−E†Q(Ex∗)‖ by establishing upper and lower bounds.First, since ‖Ex∗ −Q(Ex∗)‖∞ ≤ δ/2 and σmin(E) ≥ c√m with overwhelming probability[148, p. 23 and p. 36], a rough product bound yields‖x∗ − E†Q(Ex∗)‖ ≤ (σmin(E))−1√mδ/2 ≤ Cδ (4.6)with overwhelming probability.Note that this bound captures the dependence of the error on the quantizer resolutionδ; but it does not depend on m, the number of measurements. Intuitively, we expect theerror to decrease as we obtain more measurements, i.e., more information about x∗. Thismotivates further analysis of the approximation error E(x).A heuristic bound. Before we refine this bound rigorously in Section 4.2.1 we present aheuristic estimate based on a modified version of the WNH that shows that approximationerror decays as m increases. As we showed in Section 4.1.1, such a decay can be “justified”using the WNH when the frame E is deterministic while the signal is random. In ourcurrent setting the frame is random, so the WNH is not directly applicable: On one hand,since x∗ is deterministic and the random matrix E has independent rows, the entries ofu := Ex∗ −Q(Ex∗) are independent random variables; if we additionally assume that therows of E are identically distributed, the entries of u are identically distributed as well.This observation coincides in part with the WNH. On the other hand, (4.2) does not holdanymore since E† is random in our setting. One way to go around this is to introduce amodified WNH as follows.Modified WNH (m-WNH): Let E be a random matrix. Assume that the signal x isalso random and independent of E. The m-WNH assumes that the conditional random90variable (Ex−Q(Ex)) |E has i.i.d. entries that are uniformly distributed in (−δ/2, δ/2].Implications of m-WNH: Set E† = (e†ij), u := (Ex−Q(Ex)), and suppose that m-WNH holds. ThenE{‖x− E†Q(Ex)‖2|E}= E{‖E† · (Ex−Q(Ex))‖2|E}=n∑i=1m∑j=1m∑k=1e†ije†ikE {ujuk|E}=n∑i=1m∑j=1m∑k=1e†ije†ikE{u2j |E}1[j=k]=n∑i=1m∑j=1(e†ij)2 δ212=δ212‖E†‖2F≤ nδ212‖E†‖2 = nδ212(σmin(E))−2 .Finally, using the law of total expectation,E‖x− E†Q(Ex)‖2 = E(E{‖x− E†Q(Ex)‖2|E})≤ Enδ212(σmin(E))−2≤ nδ212· Cm−1 (4.7)In the last inequality, we used that σmin(E) ≥ c√m with overwhelming probability. In thiscase,(E{E2(x)})1/2 ≤ Cδ√n/m provided that both x and E are random and the m-WNHholds.Numerical experiments in Section 4.4 appear to agree with the heuristic calculationabove: the empirical reconstruction error is O(λ−1/2)δ, as predicted in (4.7) when E isdrawn from various random matrix ensembles. On the other hand, WNH is a special caseof m-WNH, thus m-WNH is also not fully rigorous and not valid at least in certain cases.For example, in Corollary 4.2.9, we show that in the case when the matrix E has i.i.d.standard Gaussian random entries (we often say such an E is a Gaussian random matrix),m-WNH does not hold. This motivates our error analysis in the rest of the chapter thatdoes not rely on m-WNH.4.2.1 Error estimates without WNH – main resultsFrom here on, let E ∈ Rm×n with m > n, δ > 0, Q := QMSQδ , and λ := m/n. We seek toestimate the reconstruction error E(x) = ‖x− E†Q(Ex)‖ for x ∈ Rn.91Theorem 4.2.1. Let n ≥ 3 and let x ∈ Rk be fixed. Suppose that λ > 1 and E ∈ Rm×nis a random matrix with independent isotropic sub-Gaussian rows whose ψ2-norm does notexceed K. Then, there is an absolute constant C > 0 such that for every c2 ∈ (0, 1), settingc1 = CK√ln(e2c−12 ) > 0, we haveE(x) < M(µ+ c1√log nλ−1/2δ)(4.8)andE(x) > M ′(µ− c1√log nλ−1/2δ), (4.9)with probability at least 1 − c2 − 2 exp(−c3m). Here µ = 1m‖EET (Ex − Q(Ex))‖, M =(1/2 − cKλ−1/2)−2 > 0, M ′ = (3/2 + cKλ−1/2)−2 > 0, cK > 0 depends only on K, andc3 > 0 is an absolute constant.Remark 4.2.2. The probability of failure contains a constant term c2. This constant isunavoidable, but it may be chosen to depend on m and/or λ. Note that such a change willaffect c1.The error bounds in (4.8) are composed of (constant multiples of) two summands:µ = 1m‖EET (Ex−Q(Ex))‖ and√log nλ−1/2δ. As the decay rate of the latter agrees withthat predicted by the m-WNH, we will focus on the first term, i.e., µ, for random matricesE with identically distributed rows.Proposition 4.2.3. Assume that the conditions of Theorem 4.2.1 hold and, in addition,the rows eTi of E are identically distributed. Then,EET (Ex−Q(Ex)) = mEe1(eT1 x−Q(eT1 x)).This impliesµ = ‖Ee1(eT1 x−Q(eT1 x))‖.Proof. Fix x ∈ Rk. Then,EET (Ex−Q(Ex)) =m∑i=1Eei(eTi x−Q(eTi x)),where the summands on the right hand side are identical vectors all equal to, say, Ee1(eT1 x−Q(eT1 x)). It follows thatµ =1m‖EET (Ex−Q(Ex))‖ = ‖Ee1(eT1 x−Q(eT1 x))‖which completes the proof.92Remark 4.2.4. Proposition 4.2.3 shows that, if matrix E has i.i.d. isotropic sub-Gaussianrows, then µ in Theorem 4.2.1 is a constant that does not depend on m. Thus, in caseswhen µ 6= 0, e.g., when E is a Gaussian matrix – see Corollary 4.2.9, E(x) = O(1), ratherthan O(m−1) which is what the m-WNH predicts.Next, we establish an upper bound for µ.Proposition 4.2.5. In the setting of Theorem 4.2.1, µ ≤ δ2(1 + cK√nm).Proof.µ =1m‖EET (Ex−Q(Ex))‖ ≤ 1mE(σmax(ET )‖Ex−Q(Ex)‖)≤ 1mE(σmax(ET )δ2√m)≤ δ2(1 + cK√nm).For the bound on Eσmax(ET ), we used Theorem 1.3.4.The above estimate implies that, in the worst-case scenario, µ is bounded by δ/2 + ǫas m→∞.Indeed, the bound of order δ/2 is observed numerically if the random frame E is asubmatrix of the first k columns of a sufficiently large Fourier matrix with m randomlyselected rows, see Figure 4.3.The following artificial scenarios illustrate (provably) that the reconstruction error in-deed does not always decay to zero but may tend to a non-zero constant, possibly (nearly)as big as the upper bound in Proposition 4.2.3.(A) [Bernoulli frame, one-dimensional signal] Let n = 1, i.e., x ∈ R, and let E ∈ Rm×1be a ±1 Bernoulli frame. Then each individual sample is either x or −x, i.e., moresamples do not bring any additional information. Thus, the optimal reconstructionfrom quantized measurements Q(Ex) is x˜ = Q(x). Accordingly, if Q(x) 6= x, whichis the case for almost all x, the reconstruction error |x− x˜| is non-zero and in fact itcan be as big as δ/2. Note that the same discussion applies if we consider a Bernoullirandom frame with n > 1 and set x = c(1, 0, 0, ..., 0) ∈ Rn.(B) [Bernoulli frame, a class of high-dimensional signals] Let the frame E ∈ Rm×n be a±1 Bernoulli frame. Consider all x = (x1, x2, ..., xn) such that|xi −Q(xi)| < δ2n(4.10)Using that all entries of E = (eij) are ±1, we get that for all i ∈ {1, 2, ..,m},−δ/2 + (EQ(x))i < (Ex)i < δ/2 + (EQ(x))i.93This, together with the fact that (EQ(x))i ∈ δZ, implies that Q(Ex) = EQ(x) for allx satisfying (4.10). Accordingly, for two signals x1 and x2 such that Q(x1) = Q(x2)and (4.10) holds, Q(Ex1) equals Q(Ex2), and thus the reconstruction will be identical,yielding an ℓ2 reconstruction error as big asδ2√n.(C) [Bernoulli random frame, m > 2n] Consider a ±1 Bernoulli matrix E ∈ Rm×n andany signal x ∈ Rn where m > 2n. Note that such a Bernoulli matrix can have at most2n distinct rows, i.e., if we exclude repetitive rows, we get at most 2n different mea-surements. Once the frame E contains all 2n distinct rows, more measurements do notbring any additional information. Therefore, unless Ex = Q(Ex), the reconstructionerror will saturate at a non-zero constant.(D) [Fourier random frame, a class of high-dimensional signals] Consider the discreteFourier transform (DFT) matrix F ∈ RN×N . We select the first d columns of F andwe draw rows uniformly at random. Suppose that the signal x = (c, 0, 0, ..., 0). Then,all samples of the signal will be identical, and similar to the case A, the reconstructionerror for some such signals is very close to δ/2.While some of these scenarios attain (nearly) the bound of Proposition 4.2.5, this boundis not tight, at least for some “nice” ensembles. In Corollary 4.2.9, we show that µ is notzero when E is Gaussian, but µ it is very small in any practically relevant setting, whichwe prove in the next section.4.2.2 Gaussian random matrices.Next, we obtain sharp bounds for ‖EET (Ex − Q(Ex))‖ when E is a Gaussian randommatrix. These, in turn, yield bounds for the reconstruction error E(x) = ‖x− E†Q(Ex)‖.Theorem 4.2.6. Let x ∈ Rn be fixed and let E ∈ Rm×n be a random matrix with isotropicsub-Gaussian rows. Furthermore, assume that the entries of E are i.i.d. whose densityfunction φ is a Schwartz function. Then the ith entry of EET (Ex−Q(Ex)) satisfies(EET (Ex−Q(Ex)))i= mxi + xi∑p∈Z(−1)pĝ( |xi|δp)∏r 6=iφ̂(xrsign(xi)δp) , (4.11)where g(z) := Ee111{e11≤z} =∫ z−∞ tφ(t)dt.In particular, if E is a Gaussian matrix,(EET (Ex−Q(Ex)))i= −2mxi∞∑p=1(−1)p exp(−2π2‖x‖2p2δ2),which implies2‖x‖(exp(−2π2‖x‖2δ2)− exp(−8π2‖x‖2δ2))< µ < 2‖x‖ exp(−2π2‖x‖2δ2), (4.12)94where µ = 1m‖EET (Ex−Q(Ex))‖ as in Theorem 4.2.1.Remark 4.2.7. The right-hand side of (4.11) may be to calculate exactly. On the other hand,since φ is a Schwartz function (and so are φ̂ and ĝ), it may be accurately approximated bytruncating the series to its first few terms.Remark 4.2.8. For Gaussian random matrices, the termµ ≤ 2‖x‖ exp(−2π2‖x‖2δ2),which is typically very small. For example, if ‖x‖ = 1 and δ = 0.5, this bound is approxi-mately 2.05× 10−34.We end this section by combining Theorem 4.2.1 and Theorem 4.2.6 to state the lowerand upper bounds on the approximation error for Gaussian frames.Corollary 4.2.9 (Reconstruction error for Gaussian frames). Let n ≥ 3, x ∈ Rn, andsuppose that E is an m×n Gaussian matrix (i.e., its entries are i.i.d. standard Gaussian).Then, there is an absolute constant C > 0 such that for every c2 ∈ (0, 1), setting c1 =C√ln(e2c−12 ) > 0, we haveE(x) < α(2‖x‖e− 2pi2‖x‖2δ2 + c1√log nλ−1/2δ)(4.13)andE(x) > β(2‖x‖e− 2pi2‖x‖2δ2 − 2‖x‖e−8pi2‖x‖2δ2 + c1√log nλ−1/2δ)(4.14)with probability at least 1 − 2e−m/8 − c2 for every m ∈ N such that λ > 4. Here α =(1/2 − λ−1/2)−2 and β = (3/2 + λ−1/2)− On the value of µIn Section 4.2.1, we established that the reconstruction error is of order Ω(µ+ nmδ), whereµ does not depend on m. Therefore, non-asymptotic decay of the reconstruction error isdramatically different when µ = 0 and when µ 6= 0. Recall that Corollary 4.2.6 provides asharp estimate of the value of µ for Gaussian random matrices, and, in particular, µ 6= 0.In this section, we show that for a large n such that m/n > 3 and for E ∈ Rm×n withcentered i.i.d. sub-Gaussian random variables, µ 6= 0 under weak assumptions on x.Let us start from a heuristic explanation of why one would expect µ 6= 0 when n islarge. Assume that the entries of E are centered i.i.d. sub-Gaussian random variableswhose variance is one and whose sub-Gaussian norm does not exceed K. Using algebraicmanipulations and the Cauchy-Schwarz inequality,µ =1m‖EETF (Ex)‖ = ‖Ee1F (eT1 x)‖ ≥∣∣〈Ee1F (eT1 x) , x〉∣∣‖x‖ =∣∣EeT1 xF (eT1 x)∣∣‖x‖ .95Therefore, if we show that EeT1 xF (eT1 x) 6= 0, it would imply that µ 6= 0. Let z = eT1 x andconsider EzF (z). Observe that for x = 1√n(1, 1, ..., 1)T ,z = eT1 x =n∑i=1e1ixi =∑ni=1 e1i√n.When n → ∞, by the Central Limit Theorem, z d−−−→n→∞ N (0, 1) for many distributionsof e1. Moreover, Theorem 4.2.6 for x = (1, 0, 0, ..., 0) implies that for standard Gaussianrandom variable ξ,EξF (ξ) = −2∞∑p=1(−1)pe− 2pi2p2δ2which is not zero. Combining all of the above, we conclude that for x = 1√n(1, 1, ..., 1)T ,µ ≥ ∣∣EeT1 xF (eT1 x)∣∣→ |EξF (ξ)| 6= 0.The heuristics above implies the following non-asymptotic bound. We provide its rig-orous proof in Section 4.2.5.Theorem 4.2.10. Let m and n be such that m/n > 3. Let x ∈ Rn be fixed and unit-norm.Suppose that A ∈ Rm×n is a matrix whose entries are centered i.i.d. sub-Gaussian randomvariables whose variance is one and whose sub-Gaussian norm does not exceed K. Then,there exist constants c and c0 (defined in the Hoeffding inequality and in the Berry-Esseeninequalities, respectively) independent of m, n, and δ and such that for every C ∈ N,µ ≥ 2(e−2pi2δ2 − e− 8pi2δ2)− 13c0‖x‖33C(2C + 1)(2C + 5)δ2− eK2cexp(−cC2δ2K2)− 4 exp(−C2δ22). (4.15)Remark 4.2.11. It is ambiguous how x changes as n→∞. If we just add zero entries as ngrows, we would expect the same decay rate as for the lower-dimensional vectors. In thissection, we assume that the entries of x change as n grows but ‖x‖ stays intact.As an extreme case, if x = 1√n(1, 1, ..., 1), then ‖x‖33 = 1√n . The norm inequality‖x‖2 ≤ n 12− 13 ‖x‖3 implies that O( 1√n) is the fastest possible decay of ‖x‖33 when ‖x‖2 isfixed. Also, note that‖x‖33 =n∑i=1|xi|3 ≤n∑i=1(|xi|2 max1≤i≤n|xi|)≤ ‖x‖∞‖x‖22implies that for the unit-norm x ∈ Rn, if limn→∞ ‖x‖∞ = 0, then ‖x‖33 → 0 as n→ 0.96Remark 4.2.12. Note that m is not part of the bound on µ in the theorem and supposethat ‖x‖33 → 0 as n → ∞. Then, for large enough n, µ > 0 regardless of the number ofmeasurements. It means that for large enough n and any number of measurements, thereconstruction error is bounded from below by a constant term and, therefore, does notdiminish to zero.Remark 4.2.13. Suppose that E is as in Theorem 4.2.10. Suppose that m → ∞ in a waythat λ = m/n → ∞ and n → ∞. Then, for large enough m (and, correspondingly, n),Theorem 4.2.10 guarantees that if ‖x‖3 → 0 as n→∞, then the reconstruction error doesnot diminish to zero. In other words, even if the “oversampling ratio” λ is extremely large,then the reconstruction error does not diminish to zero for large n and small ‖x‖ Extension to noisy and dithered quantizationDithered quantizationLet τ = (τ1, τ2, ..., τm) be a random vector independent of E and whose entries are i.i.d.random variables distributed uniformly over (−δ/2, δ/2]. We call this vector a dither, andwe say that the quantization is dithered if the dither is added just before the quantization.In other words, the quantized measurements becomeq = Q(Ex+ τ).One of the benefits of applying dither is that E and Ex+τ−Q(Ex+τ) are independent, andthe analog of the WNH to perform error analysis is applicable with theoretical guarantees.Suppose that the reconstruction from the quantized dithered measurements is imple-mented via the Moore-Penrose pseudoinverse matrix. Then, the reconstruction error be-comes:‖x− xdither‖ = ‖x− E†Q(Ex+ τ)‖ = ‖(ETE)−1ET (Ex−Q(Ex+ τ))‖.Theorem 4.2.14. Let x ∈ Rn be fixed. Suppose that λ = m/n > 3 and E ∈ Rm×n isa random matrix with independent isotropic sub-Gaussian rows whose ψ2-norm does notexceed K. Then, there is an absolute constant C > 0 such that for every c2 ∈ (0, 1), settingc1 = CK√ln(e2c−12 ) > 0, we have‖x− xdither‖ < A(2c1√log nλ−1/2δ)(4.16)with probability at least 1− c2 − 2 exp(−c3m).Remark 4.2.15. This theorem is well-known in the dithering literature (see, e.g., [19, 22,146]). We present it here because this result is a natural extension of the methodology ofthis chapter.97Noisy measurementsIn practice, the quantized measurements Q(Ex) can be distorted by noise before beingstored/processed. Assume that the noise vector ǫ = (ǫ1, ǫ2, ...ǫm)T is added after thequantization process, and now we want to recover x ∈ Rn fromq˜ = Q(Ex) + ǫ.If the signal is reconstructed using the Moore-Penrose pseudoinverse, the reconstructionerror becomesx− x˜ = x− E†q˜ = (ETE)−1ET (Ex−Q(Ex)− ǫ).Therefore,‖x− x˜‖ ≤ ‖(ETE)−1ET (Ex−Q(Ex))‖ + ‖E†ǫ‖.The first term is estimated earlier in the chapter (see Theorem 4.2.1).For the second term, we provide two error bounds. Proceeding in a straight-forwardway,‖E†ǫ‖ ≤ 1σmin(E)‖ǫ‖ ≤ c√nm‖ǫ‖ whp.Here the second inequality holds with overwhelming probability. Note that the aboveestimate is valid for any noise vector, including adversarial (i.e., worst-case) and those thatare dependent on the measurement matrix.If we assume that the noise is independent of E, symmetric (and therefore, is meanzero), and its entries are i.i.d. sub-Gaussian random variables with ψ2-norm at most L,the error bound can be improved. As before,‖E†ǫ‖ ≤ 1σ2min(E)‖ET ǫ‖ = 1σ2min(E)‖m∑i=1eiǫi‖.Again, (ei, ǫi) are i.i.d. Therefore, we consider the norm of the sum of i.i.d. random vectorseiǫi. The mean of this random vector is zero because of independence of ei and ǫi, and thefact that Eǫi = 0:Eeiǫi = EeiEǫi = 0.In order to apply the Hoeffding inequality (Theorem 1.3.2), we show that the sub-Gaussian norm of e1ǫ1 is at most KL. Indeed, for every x ∈ Rn and for every |t| ≤ 1KL ,Eet〈e1ǫ1 , x〉 = Eet〈e1 x〉ǫ1 = Ee1Eǫ1e(t〈e1 x〉)ǫ1≤ Ee1et2(〈e1 x〉)2L2≤ et2L2K2Applying the Hoeffding inequality (Theorem 1.3.2), we conclude the following theorem.98Theorem 4.2.16. Let x ∈ Rn be fixed. Suppose that λ > 3 and E ∈ Rm×n is a randommatrix with independent isotropic sub-Gaussian rows whose ψ2-norm does not exceed K.Assume that additive noise, denoted by τ = (τ1, τ2, ..., τm), is independent of E and hasi.i.d. mean zero sub-Gaussian entries with ψ2-norm at most L. Then, there is an absoluteconstant C > 0 such that for every c2 ∈ (0, 1), setting c1 = CKL√ln(e2c−12 ) > 0, we haveE(x) < M(µ+ c1√log nλ−1/2δ)(4.17)andE(x) > M ′(µ− c1√log nλ−1/2δ), (4.18)with probability at least 1 − c2 − 2 exp(−c3m). Here µ = 1m‖EET (Ex − Q(Ex))‖, M =(1/2 − cKλ−1/2)−2 > 0, M ′ = (3/2 + cKλ−1/2)−2 > 0, cK > 0 depends only on K, andc3 > 0 is an absolute constant.4.2.5 Proofs.First, we provide a road map for obtaining the upper bound. Note that the reconstructionerror satisfiesE(x) = ‖x− E†Q(Ex)‖ = ‖m (ETE)−1 · 1mET (Ex−Q(Ex))‖≤ (mσmax((ETE)−1))(µ+ 1m‖ET (Ex−Q(Ex))− EET (Ex−Q(Ex))‖), (4.19)where µ = 1m‖EET (Ex − Q(Ex))‖. Here, the term mσmax((ETE)−1) is controlled usingthe following corollary from Theorem 1.3.4.Corollary 4.2.17. For a matrix E that satisfies the conditions of Theorem 1.3.4, withprobability at least 1− 2 exp(−c3m),M ′1m≤ σmin((ETE)−1) ≤ σmax((ETE)−1) ≤M 1m,where M = (1/2 − cKλ−1/2)−2 > 0 and M ′ = (3/2 + cKλ−1/2)−2 > 0. If E is a Gaussianmatrix, cK = 1.In addition we need to estimate the deviation of ET (Ex − Q(Ex)) from its meanEET (Ex−Q(Ex)). This can be done entrywise using the following observations: Considerthe ith entry of 1mET (Ex−Q(Ex)), say αi, given byαi =1mm∑j=1(ejiF (n∑k=1ejkxk))(4.20)99where F (z) := z − Q(z) for all z ∈ R. Each summand in (4.20) depends only on the jthrow of E. Thus, since E has independent rows, the summands in (4.20) are independent.Furthermore, they are sub-Gaussian random variables whose sub-Gaussian norms do notexceed Kδ/2. So, we can control the deviation of this sum from its mean as∣∣∣ 1mm∑j=1(ejiF (n∑k=1ejkxk)− EejiF (n∑k=1ejkxk))∣∣∣ ≤ C2√ 1mwith probability at least 1 − e exp(−C3C22 ). Finally, 1m‖ET (Ex − Q(Ex)) − EET (Ex −Q(Ex))‖ can be bounded using these entrywise estimates. Next we provide the full proof.Proof of Theorem 4.2.1.Observe thatm > n implies that that E is a full rank matrix with overwhelming probability,thus (ETE)−1E = E† is well-defined. Also by Corollary 4.2.17,A′ ≤ (mσmin(ETE)−1) ≤ (mσmax(ETE)−1) ≤ Awith probability at least 1 − 2 exp(−c3m), where A,A′, and c3 are as in Corollary 4.2.17.This, together with (4.19), givesE(x) ≤ A(µ+1m‖ET (Ex−Q(Ex)) − EET (Ex−Q(Ex))‖)A similar calculation (using the reverse triangle inequality this time) yields the lower boundE(x) ≥ A′(µ− 1m‖ET (Ex−Q(Ex)) − EET (Ex−Q(Ex))‖)To finish the proof, we need to bound 1m‖ET (Ex − Q(Ex)) − EET (Ex − Q(Ex))‖ =(∑ni=1(αi−Eαi)2)1/2, where αi is as in (4.20), from above and below. As outlined before,we will achieve this by controlling αi − E(αi) for each i.Observe that αi is an average of m terms (ejiF (∑nk=1 ejkxk)) that are independentbecause each term depends on jth row of matrix E only, and according the conditionsof the theorem, rows of E are independent. Therefore, to bound αi we use the Hoeffdinginequality - see Lemma 4.2.18 below. Combining the observations above and Lemma 4.2.18finishes the proof.Lemma 4.2.18. Let x ∈ Rn and E be an m× n matrix with independent rows such thatits entries eij are sub-Gaussian whose ψ2-norms do not exceed K. Suppose that αi is as in(4.20). Then,100(i) Fix 1 ≤ i ≤ n. With any precision c2 ∈ (0, 1), with probability at least 1− 1nc2,|αi − Eαi| ≤ c1√log n√1mKδ.(ii) With any precision c2 ∈ (0, 1), with probability at least 1− c2, we have1m‖ET (Ex−Q(Ex)) − EET (Ex−Q(Ex))‖ ≤ c1√log n√nmKδ.Here c1 = C√log(e2c−12 ) and C is an absolute constant.Proof of Lemma 4.2.18. Note that1m‖ET (Ex−Q(Ex)) − EET (Ex−Q(Ex))‖ =( n∑i=1(αi − Eαi)2)1/2and, therefore, the second statement of Lemma 4.2.18 follows from the first one usingstraight-forward union bound on the first inequality. Therefore, we focus on the proof ofthe first statement of the lemma.Fix 1 ≤ i ≤ n, and considerαi − Eαi = 1mm∑j=1(ejiF (n∑s=1ejsxs)− EejiF (n∑s=1ejsxs)).Note that we aim to bound the sum of m random variables Zj := ejiF (n∑k=1ejkxk) −EejiF (n∑k=1ejkxk).Claim: Given that E has independent, isotropic, sub-Gaussian random rows,(a) Zj are independent.(b) Zj are centered, sub-Gaussian random variables whose ψ2-norm does not exceed Kδ.Suppose this claim holds (it is proved below). Then, the Hoeffding inequality, as statedin Theorem 1.3.2, with t := c1√log n√mKδ, where c1 > 0 is a constant to be determinedlater, implies thatP(|αi − Eαi| ≤ c1√log n√1mKδ)≥ 1− e exp (−cc21 · log(n)m) , (4.21)101where c > 0 is an absolute constant. Simplifying the probability of failure, we haveP(|αi − Eαi| ≤ c1√log n√1mKδ)≥ 1− en−cc21 .One can simplify the expression for probability of failure further to obtain the statementin the lemma. Specifically, if cc21 > 1 (which can be guaranteed by choosing c2 ∈ (0, 1) andthen picking c1 such that c2 = e2−cc21), then, for n ≥ 3, the probability of failure may bebounded from above by en−cc21 ≤ 1nc2.The proof of the lemma will be complete once we prove the claim above.Proof of the claim. Since the rows of E are independent, so are {ejiF (∑nk=1 ejkxk)},j = 1, 2, ..,m, which, in turn, implies (a). Next note that Zj are centered by definition,so to prove (b), it suffices to show that ejiF (∑nk=1 ejkxk) and EejiF (∑nk=1 ejkxk) are sub-Gaussian. The latter term is a constant, and‖Zj‖ψ2 ≤ ‖ejiF (n∑k=1ejkxk)‖ψ2 +∣∣E[ejiF ( n∑k=1ejkxk)]∣∣≤ ‖ejiF (n∑k=1ejkxk)‖ψ2 + E∣∣ejiF ( n∑k=1ejkxk)∣∣ ≤ 2‖ejiF ( n∑k=1ejkxk)‖ψ2≤ 2‖F (n∑k=1ejkxk)‖∞‖eji‖ψ2 ≤ 2 · 0.5δ‖eji‖ψ2 ≤ δKHere we used basic properties of ψ2-norm together with the fact that the range of F is(−δ/2, δ/2] and the sub-Gaussian norm of eji does not exceed the sub-Gaussian norm ofthe jth row of E, which is at most K.This finishes the proof of the claim, consequently the proof of Lemma 4.2.18, and hencethe proof of Theorem 4.2.1.Proof of Theorem 4.2.6Before we start the proof, let us state the Poisson summation formula which will be usedfurther in this section.Theorem 4.2.19 ( [59, p. 287]). Let f : R 7→ R be a Schwartz function. For any a > 0and b ∈ R, ∑n∈Zf(an+ b) =∑p∈Z1af̂(pa)exp(2πipab), (4.22)where f̂(ω) =∫exp(−2πitω)f(t)dt is a Fourier transform of f .1021mE m∑j=1ejiF (n∑k=1ejkxk) = E(e1iF ( n∑k=1e1kxk))= E(e1i(n∑k=1e1kxk))− E(e1iQ(n∑k=1e1kxk))= Exie21i + E∑k 6=ixke1ie1k − E(e1iQ(n∑k=1e1kxk))= xi − E(e1iQ(n∑k=1e1kxk))In the last equality we used the facts that E has isotropic rows, so each entry has mean 0and variance 1, and its entries are independent.If xi = 0, then, using the fact that random variables e1i and∑k 6=i e1kxk are indepen-dent,E(e1iQ(n∑k=1e1kxk))= Ee1iEQ(n∑k=1e1kxk) = 0.Then, Ee1iF (∑nk=1 e1kxk) = xi = 0, and the statement of the theorem holds.Now assume that xi 6= 0. There are two possibilities, namely, xi > 0 and xi < 0. Wejoin them into the one case for a potentially different set of xi’s using the oddity of Q. Letxk := xksign(xi), k = 1, 2, ..., n. Then,E(e1iF (k∑k=1e1kxk))= xi − sign(xi)E(e1iQ(n∑k=1e1kxk)). (4.23)Note that xi = xisign(xi) = |xi| which is positive because xi 6= 0. Therefore, it suffices toestimate E(e1iQ(∑nk=1 e1kxk)).Let 1 be an indicator function of a Boolean variable, taking the value 1 if its argumentis true and 0 otherwise. Using this notation and the definition of Q, we haveE(e1iQ(n∑k=1e1kxk))=∑j∈ZEe1i · jδ · 1{n∑k=1e1kxk ∈ (jδ − δ2, jδ +δ2]}. (4.24)Let ηi :=∑k 6=i e1kxk. Note that1{n∑k=1e1kxk ∈ (jδ − δ2, jδ +δ2]} ={1, if e1i ∈ ( jδ−δ/2−ηixi ,jδ+δ/2−ηixi]0, otherwise.103Therefore, one may rewrite the summands on the right hand side of (4.24) asEe1ijδ1{n∑k=1e1kxk ∈ (jδ − δ2, jδ +δ2]} (4.25)= Ee1ijδ1{e1i ≤ jδ + δ/2 − ηixi} − Ee1ijδ1{e1i ≤ jδ − δ/2 − ηixi}.Next, let g(z) := Ee1i1{e1i ≤ z} for z ∈ R. Since all entries of E are distributedidentically, g does not depend on i. We claim that g is a Schwartz function. Indeed,g(z) =∫ z−∞tφ(t)dt,where φ is the density function of e11. Note that by our hypothesis φ is a Schwartzfunction. So, clearly, g ∈ C∞. In addition, since φ(t) = O(|t|−n−2) for all n as t → −∞,g(z) = O(|t|−n) for all n as t → −∞. Now recall that the rows of E are isotropic, so themean value of each entry is 0. Then,g(z) = Ee11 −∫ ∞ztφ(t)dt = −∫ ∞ztφ(t)dt.By similar arguments, g(z) = O(|t|−n) as z →∞. Therefore, g is a Schwartz function.In what follows, we will have random arguments for g. Specifically, for a randomvariable η that is independent of e1i, letg(η) := E [e1i1{e1i ≤ η} | η] .Note that this definition of g agrees with the previous one when one takes η = z a.e. (whichis independent of e1i). Using the law of total expectation for each term in (4.25) and thedefinition of g, we getEe1ijδ1{e1i ≤ jδ + δ/2 − ηixi} = E[E[e1ijδ1{e1i ≤ jδ + δ/2 − ηixi} | ηi]]= jδEg(jδ + δ/2 − ηixi)and, similarly,Ee1ijδ1{e1i ≤ jδ − δ/2 − ηixi} = jδEg(jδ − δ/2 − ηixi).Therefore, (4.25) may be rewritten as follows,Ee1ijδ1{n∑k=1e1kxk ∈ (jδ − δ/2, nδ + δ/2]} = jδEg(jδ + δ/2 − ηixi)− jδEg(jδ − δ/2 − ηixi),104which, in turn, allows us to rewrite the original expression E (e1iQ(∑nk=1 e1kxk)) asE(e1iQ(n∑k=1e1kxk))= E∑j∈Z(jδ(g(jδ + δ/2 − ηixi)− g(jδ − δ/2 − ηixi)))= E∑j∈Z(jδg(jδ + δ/2 − ηixi)− ((j − 1) + 1)δg((j − 1)δ + δ/2 − ηixi)))Splitting the series and shifting the index of summation by 1 leads toE(e1iQ(n∑k=1e1kxk))= E−δ∑j∈Zg(jδ + δ/2− ηixi) . (4.26)Here splitting the series can be justified by using Fubini-Tonelli theorem twice togetherwith the fact that g is a Schwartz function, which implies g(z) = O(|z|−3) as |z| → ∞.Therefore,∑j∈Z jδg(jδ ± δ/2 − ηi) is convergent for any value of ηi in R.Next, we will estimate the mean value of the series, where we use the Poisson summationformula (4.22), which may be applied as g is a Schwartz function. Denote the Fouriertransform of g by ĝ(ω) :=∫exp(−i2πtω)g(t)dt. Then,∑j∈Zg(jδxi+δ/2 − ηixi) =∑p∈Zxiδĝ(xiδp) exp(i2πxipδδ/2 − ηixi)=xiδ∑p∈Z(−1)pĝ(xiδp) exp(−i2πpδηi)Plugging the last formula into (4.26) gives us the following equation:E(e1iQ(n∑k=1e1kxk))= E−δxiδ∑p∈Z(−1)pĝ(xiδp) exp(−i2πpδηi) .We want to interchange the expectation and infinite sum. To do this, we need to verifythe conditions of dominated convergence theorem. Note that|(−1)pĝ(xiδp) exp(−i2πpδηi)| = |ĝ(xiδp)|.Since g is a Schwartz function, so is ĝ, and therefore,E∑p∈Z|(−1)pĝ(xiδp) exp(−i2πpδηi)| = E∑p∈Z|ĝ(xiδp)| =∑p∈Z|ĝ(xiδp)|105is convergent. Note that the last equality holds because we consider the expected valueof a deterministic expression, so the expected value is redundant. Using Fubini-Tonellitheorem, we interchange the expected value and the infinite sum, and obtainE(e1iQ(n∑k=1e1kxk))=−δxiδ∑p∈Z((−1)pĝ(xiδp)E exp(−i2πpδηi)) . (4.27)Moreover, since ηi =∑k 6=i e1kxk and {e1k} is a collection of independent random variables,E exp(−i2πpδηi) = E∏k 6=iexp(−i2πpδxke1k)=∏k 6=iE exp(−i2πpδxke1k)=∏k 6=iφ̂(pδxk)where φ is the density function of e1k (recall that all entries of matrix E are identicallydistributed). Plugging the result above into (4.27) and then plugging into (4.23), we getE(e1iF (n∑k=1e1kxk))= xi − sign(xi)−xi∑p∈Z(−1)pĝ(xiδp)∏k 6=iφ̂(pδxk) ,= xi + xi∑p∈Z(−1)pĝ( |xi|δp)∏k 6=iφ̂(xksign(xi)δp)where φ is a density function of e11 and g(z) = Ee111{e11 < z} =∫ z−∞ tφ(t)dt. To finishthe proof of the first statement of the Theorem,E m∑j=1ejiF (n∑k=1ejkxk) = mE(e1iF ( n∑k=1e1kxk))= mxi + xi∑p∈Z(−1)pĝ( |xi|δp)∏k 6=iφ̂(xksign(xi)δp)which coincides with (4.11).106Next, assume that entries of E are i.i.d. standard Gaussian random variables. Then,the density function φ and its Fourier transform φ̂ are given byφ(z) =√12πexp(−z22), φ̂(ω) = exp(−2π2ω2).Consequently,g(z) :=∫ z−∞tφ(t)dt =∫ z−∞t ·√12πexp(−t2/2)dt = −√12πe−z22 .and its Fourier transform is ĝ(ω) := − exp(−2π2ω2). Plugging the exact values of φ̂ and ĝinto (4.11),E m∑j=1ejiF (n∑k=1ejkxk) = mxi + xi∑p∈Z(−1)p (−e−2π2 x2i p2δ2 )∏k 6=ie−2π2x2kp2δ2= mxi + xi∑p∈Z(−(−1)pe−2π2∑nk=1x2kδ2p2)= mxi − xi∑p∈Z(−1)pe−2π2 ‖x‖2δ2p2 .Note that terms of the series are even with respect to p, i.e., the value of the terms for pand −p are identical. When p = 0, the term is 1. Therefore, one may simplify the seriesas follows.E m∑j=1ejiF (n∑k=1ejkxk) = m(xi − xi − 2xi ∞∑p=1(−1)pe−2π2 ‖x‖2δ2p2)= −2mxi∞∑p=1(−1)pe−2π2 ‖x‖2δ2p2 .This is an alternating series, so one may bound the sum using the first two terms ofthe series. To be more precise, the exact value of the series E(∑mj=1 ejiF (∑nk=1 ejkxk))lies between 2mxi exp(−2π2 ‖x‖2δ2) and 2mxi exp(−2π2 ‖x‖2δ2) − 2mxi exp(−8π2 ‖x‖2δ2), whichimplies (4.12) and finishes the proof.107Proof of Theorem 4.2.10The rigorous proof is based on the heuristics outlined in Section 4.2.3. We introduce thefollowing notations for simplicity: for unit-norm x ∈ Rn, let z = eT1 x and ξ ∼ N (0, 1).Note that Ez2 = 1. Then, using the Cauchy-Schwarz inequality and the inverse triangleinequality,µ ≥∣∣EeT1 xF (eT1 x)∣∣‖x‖ = |EzF (z)| = | − EξF (ξ) + EξF (ξ)− EzF (z)|≥ |EξF (ξ)| − |EξF (ξ)− EzF (z)| = |EξF (ξ)| − |Eξ2 − EξQ(ξ)− Ez2 + EzQ(z)|= |EξF (ξ)| − |EzQ(z)− EξQ(ξ)|.The first term is computed in Theorem 4.2.6:2(e−2pi2δ2 − e− 8pi2δ2)≤ |EξF (ξ)| = 2∞∑p=1(−1)p+1e− 2pi2p2δ2 ≤ 2e− 2pi2δ2 .We conclude thatµ ≥ 2(e−2pi2δ2 − e− 8pi2δ2)− |EzQ(z)− EξQ(ξ)|. (4.28)We proceed to proving that for the large class of signals x ∈ Rn, the distribution ofz =∑nk=1 e1kxk will become close to N (0, 1), and the term |EzQ(z)−EξQ(ξ)| goes to zeroas n→∞ (while keeping m/n > 3).Fix C ∈ N. Note that zQ(z)1z∈(−δ/2,δ/2) = EξQ(ξ)1ξ∈(−δ/2,δ/2] = 0. Then, the triangleinequality implies the following bound:|EzQ(z)− EξQ(ξ)| ≤ ∣∣EzQ(z)1z∈(δ/2,Cδ+δ/2] − EξQ(ξ)1ξ∈(δ/2,Cδ/2+δ/2]∣∣+∣∣EzQ(z)1z∈(−Cδ/2−δ/2,−δ/2] − EξQ(ξ)1ξ∈(−Cδ/2−δ/2,−δ/2]∣∣+∣∣EzQ(z)1z≤−Cδ/2−δ/2 − EξQ(ξ)1ξ≤−Cδ/2−δ/2+EzQ(z)1z>Cδ/2+δ/2 − EξQ(ξ)1ξ>Cδ/2+δ/2∣∣Recall that for every non-negative random variable, Eγ =∫∞0 P(γ > t)dt. Using this108formula and the fact that γQ(γ) ≥ 0 for all γ, we get:|EzQ(z)− EξQ(ξ)|≤∣∣∣∣∫ ∞0(P(zQ(z)1z∈(δ/2,Cδ+δ/2] > t)− P (ξQ(ξ)1ξ∈(δ/2,Cδ+δ/2] > t)) dt∣∣∣∣+∣∣∣∣∫ ∞0(P(zQ(z)1z∈(−Cδ−δ/2,−δ/2] > t)− P (ξQ(ξ)1ξ∈(−Cδ/2−δ/2,−δ/2] > t)) dt∣∣∣∣+∣∣∣∣∫ ∞0[P(zQ(z)1|z|≥Cδ+δ/2 > t)− P (ξQ(ξ)1|ξ|≥Cδ+δ/2 > t)] dt∣∣∣∣Similar to the previous proof, we observe that Q is a piece-wise constant function. Indeed,for each i ∈ N, Q is a constant on each interval (iδ − δ/2, iδ + δ/2]. Therefore, the lastinequality can be rewritten as|EzQ(z)− EξQ(ξ)|≤∣∣∣∣∣∫ ∞0C∑i=1(P(ziδ1z∈(iδ−δ/2,iδ+δ/2] > t)− P (ξiδ)1ξ∈(iδ−δ/2,iδ+δ/2] > t)) dt∣∣∣∣∣+∣∣∣∣∣∫ ∞0C∑i=1(P(z(−iδ)1z∈(−iδ−δ/2,−iδ+δ/2] > t)− P (ξ(−iδ)1ξ∈(−iδ−δ/2,−iδ+δ/2] > t)) dt∣∣∣∣∣+∣∣∣∣∫ ∞0[P (|z| ≥ Cδ + δ/2 and zQ(z) > t)− P (|ξ| ≥ Cδ + δ/2 and ξQ(ξ) > t)] dt∣∣∣∣ .Note that if |z| > δ/2, then zQ(z) > (|z| − δ/2)2. The same observation applies to ξQ(ξ)109for |ξ| > δ/2. Then,|EzQ(z)− EξQ(ξ)| ≤∣∣∣∣∣∫ ∞0C∑i=1(P(max{ tiδ, iδ − δ2} < z ≤ iδ + δ2)−P(max{ tiδ, iδ − δ2} < ξ ≤ iδ + δ2))dt∣∣∣∣+∣∣∣∣∣∫ ∞0C∑i=1(P(−iδ − δ2< z ≤ min{− tiδ, −iδ + δ2})−P(−iδ − δ2< ξ ≤ min{− tiδ, −iδ + δ/2}))dt∣∣∣∣+∣∣∣∣∫ ∞0[P(|z| ≥ Cδ + δ/2 and (|z| − δ2)2 > t)−P(|ξ| ≥ Cδ + δ/2 and (|ξ| − δ2)2 > t)]dt∣∣∣∣ .Note that for large values of t, specifically, for t ≥ i2δ2+ iδ2/2, tiδ ≥ iδ+ δ2 and −iδ−δ/2 ≥− tiδ . In this case, the expression under the corresponding integral becomes zero. Usingthis observation and the triangle inequality, we get|EzQ(z)− EξQ(ξ)| ≤C∑i=1∫ i(i+ 12)δ20∣∣∣∣P(max{ tiδ , iδ − δ2} < z ≤ iδ + δ2)−P(max{ tiδ, iδ − δ2} < ξ ≤ iδ + δ2)∣∣∣∣ dt+C∑i=1∫ i(i+ 12)δ20∣∣∣∣P(−iδ − δ2 < z ≤ min{− tiδ , −iδ + δ2})−P(−iδ − δ2< ξ ≤ min{− tiδ, −iδ + δ/2})∣∣∣∣ dt+∫ C2δ20|P (|z| ≥ Cδ + δ/2) − P (|ξ| ≥ Cδ + δ/2)| dt+∫ ∞C2δ2∣∣∣P(|z| ≥ √t+ δ/2) − P(|ξ| ≥ √t+ δ/2)∣∣∣ dt.We tailor the Berry-Esseen Theorem (Theorem 1.3.1) to our setting and get the bound onthe difference of probabilities.110Corollary 4.2.20. Suppose that x ∈ Rn is unit-norm and let Xi = e1ixi. Then, Theorem1.3.1 implies that for each t ∈ R,∣∣∣∣∣P(n∑i=1e1ixi < t)− P (ξ < t)∣∣∣∣∣ ≤ c0n∑i=1|xi|3 = c0‖x‖33.Using Corollary 4.2.20, we get|EzQ(z)− EξQ(ξ)| ≤C∑i=1∫ i(i+ 12)δ202c0‖x‖33dt+C∑i=1∫ i(i+ 12)δ202c0‖x‖33dt+∫ C2δ20c0‖x‖33dt+∫ ∞C2δ2∣∣∣P(|z| ≥ √t+ δ/2) − P(|ξ| ≥ √t+ δ/2)∣∣∣ dt.= 4c0‖x‖33C∑i=1i(i+12)δ2 + c0‖x‖33C2δ2+∫ ∞C2δ2∣∣∣P(|z| ≥ √t+ δ/2) − P(|ξ| ≥ √t+ δ/2)∣∣∣ dt.= 4c0‖x‖33(C(C + 1)(2C + 1)6+C(C + 1)4)δ2 + c0‖x‖33C2δ2+∫ ∞C2δ2∣∣∣P(|z| ≥ √t+ δ/2) − P(|ξ| ≥ √t+ δ/2)∣∣∣ dt.=13c0‖x‖33C(2C + 1)(2C + 5)δ2+∫ ∞C2δ2∣∣∣P(|z| ≥ √t+ δ/2) − P(|ξ| ≥ √t+ δ/2)∣∣∣ dt.We proceed to bounding the integral. Using the triangle inequality and the fact that√t+ δ/2 >√t, we get∫ ∞C2δ2∣∣∣P(|z| ≥ √t+ δ/2) − P (|ξ| ≥ √t+ δ/2)∣∣∣ dt≤∫ ∞C2δ2P(|z| ≥ √t)dt+∫ ∞C2δ2P(|ξ| ≥ √t)dt.Recall that z =∑ni=1 e1ixi and the assumption ‖x‖ = 1. To bound the integral, we apply111the Hoeffding inequality (Theorem 1.3.2) and the Chernoff inequality:∫ ∞C2δ2∣∣∣P(|z| ≥ √t+ δ/2) − P (|ξ| ≥ √t+ δ/2)∣∣∣ dt≤∫ ∞C2δ2e exp(− ctK2)dt+∫ ∞C2δ22 exp(− t2)dt= eK2 exp(−cC2δ2K2)+ 4exp(−C2δ22).Proof of Theorem 4.2.14The proof closely resembles the proof of Theorem 4.2.1, and here we provide the outlineof the proof. Using singular values of E and the definition of F : z 7→ z − Q(z), thereconstruction error can be bounded as‖x− xdither‖ ≤ 1σ2min(E)‖ET (F (Ex+ τ)− τ) ‖.Corollary 4.2.17 provides the upper bound for the smallest singular value of E. Nowconsider the term ET (F (Ex+ τ)− τ). Recall that rows of E are denoted by eT1 , eT2 , ..., eTm.Then,ET (F (Ex+ τ)− τ) =m∑i=1ei(F (eT1 x+ τi)− τi).Note that for i 6= j, (ei, τi) is independent of (ej , τj), which, in turn, implies that all termsin sum are i.i.d. random vectors.We claim that e1(F (eT1 x + τ1) − τ1) is a sub-Gaussian random variable whose sub-Gaussian norm does not exceed δK provided that sub-Gaussian norm of e1 does not exceedK. Indeed, observe that the range of F is bounded by δ/2 in magnitude and τ is boundedby δ/2, so |F (eT1 x + τ1) − τ1| ≤ δ. Therefore, the product of sub-Gaussian vector e1 andbounded random variable F (eT1 x+τ1)−τ1 is a sub-Gaussian random vector whose ψ2-normdoes not exceed δK.By the Hoeffding inequality (Theorem 1.3.2), the sum of i.i.d. random vectors is con-centrated around its mean, which we compute belowEET (F (Ex+ τ1)− τ1) = mEe1(F (eT1 x+ τ1)− τ1)= mEe1F (eT1 x+ τ1)−mEe1τ1= mEe1e1Eτ1F (eT1 x+ τ1)−mEe1e1Eτ1τ1 = 0− 0 = 0.112Here, on the last line, we considered mathematical expectation with respect to two inde-pendent random variables e1 and τ1. We also used that Eτ1 = 0 and that for every z,EF (z + τ1) = 0. Therefore, the Hoeffding inequality impliesP(1mET (F (Ex+ τ1)− τ1) > t)≤ e exp(−ct2δ2).Combining all bounds finishes the proof.4.3 MSQ for compressed sensing4.3.1 Two-phase reconstruction schemeTo generalize Theorem 4.2.1 to the CS setting, we follow the procedure that is describedin detail in [78]: Suppose x ∈ ΣNs is the signal to be acquired and A ∈ Rm×N , m ≪ N ,is a CS measurement matrix whose rows are independent isotropic sub-Gaussian randomvectors with sub-Gaussian norm at most K. The goal is to recover a s-sparse signal x fromits quantized compressive measurements Q(y) = Q(Ax).We adopt the two-stage approach as summarized in Section 4.1.2. Recall that in Stage 1we recover the support of x by using (4.29). Then the coarse approximation x̂MSQ obtainedas in (4.4), which satisfies ‖x− x̂MSQ‖ ≤ Cδ, where C is independent of m. Thus, settings′ = s, η = Cδ, and x′ = x̂MSQ in [78, Proposition 4.1], we observe that the supportof x is fully recovered from the indexes of the largest (in-magnitude) entries of x˜MSQ if|xj | >√2Cδ for all j in the support of x.Theorem 4.3.1. Let α ∈ [0, 1/2) be arbitrary and let K > 0. Fix x ∈ ΣNs be such thatminj∈supp(x) |xj | ≥ Cδ, where C depends only on s and K. Suppose that A ∈ Rm×Nwhose entries are i.i.d. sub-Gaussian with sub-Gaussian norm not exceeding K, s ≥ 3, andλ = m/s ≥ c4 logN where c4 = c4(s, α,K) is a constant. Then, with xMSQ obtained usingthe two-stage method, there are constants c5, c6, c7 > 0 that depend only on s, K, and xsuch that with probability at least 1 − c5 exp(−c6λα) on the draw of A, the reconstructionerror satisfies‖xMSQ − x‖ ≤ A(1m‖EET (Ex−Q(Ex))‖ + c7√log sλ−(1−α)/2δ),where E = AT , T = supp(x), has i.i.d. sub-Gaussian entries, which are copies of theentries of A.Like before, we can calculate ‖EET (Ex−Q(Ex))‖ if A, thus E, is a Gaussian randommatrix.113Corollary 4.3.2. In the setting of Theorem 4.3.1, suppose that A ∈ Rm×N is a Gaussianmatrix. Then with probability at least 1−c5 exp(−c6λα) on the draw of A, the reconstructionerror satisfies‖x̂MSQ − x‖ ≤ A(2‖x‖ exp(−2π2‖x‖2δ2) + c7√log sλ−(1−α)/2δ).Above the constants are as in Theorem Projected back projectionWe carry the same notations as in the previous sections. The measurement matrix isdenoted by A ∈ Rm×N , and we assume that A is a sub-Gaussian random matrix withindependent centered isotropic random rows. Consider the following reconstruction schemefor distorted measurements q = Q(Ax):xPBP = Hs(1mAT q),where Hs is the projection onto the set of s-sparse signals. In other words, Hs keeps thelargest (in magnitude) s entries of its argument and sets other entries to zero. We providethe bound for the reconstruction error‖xPBP − x‖ = ‖Hs( 1mATQ(Ax))− x‖.Theorem 4.3.3. Fix x ∈ ΣNs with its support index set T . Suppose that A is a sub-Gaussian matrix with independent isotropic random rows whose sub-Gaussian norm doesnot exceed K. Assume that A satisfies the RIP of order 2s with constant δ2s. Then, thereconstruction error satisfies‖xPBP − x‖ ≤Mµ+ CPBP√log(s)λ−1/2 + 4δ2s‖x‖with probability 1 − cPBP − e exp(−c3m). Here µ = 1m‖EATT (ATx − Q(ATx)‖, A ∼ 1+ isa constant that depends on λ = m/s only, CPBP = C(K√ln(e2(cPBP)−1) + 1) where C isan absolute constant.If, in addition, A is a Gaussian matrix, the bound becomes‖xPBP − x‖ ≤ 2M‖x‖ exp(−2π2‖x‖2δ2) + CPBP√log(s)λ−1/2 + 4δ2s‖x‖with the same probability.Let us compare this result with two related ones. First, note that if quantization is notapplied, i.e., if Q = I, then ‖xPBP − x‖ ≤ 2δ3s‖x‖ (see Section 1.2.2). If the quantizationis dithered, then the result of Xu et al. [157] implies that‖xdither PBP − x‖ ≤ 2δ2s(3 + δ)‖x‖.114Note that all three bounds contain the term of order δ2s‖x‖. Inverting constants in The-orem 1.3.6, we get that for fixed s and N , δ2s = O(m−1/2), and therefore, in the casewithout quantization and for the dithered quantization, the reconstruction error decayslike O(m−1/2). Note that for the MSQ, the proven decay rate may be rewritten as‖xPBP − x‖ ≤M 1m‖EATT (ATx−Q(ATx)‖+O(m−1/2),where the constant factor hidden O contains factors for m, N , and K. Note that the firstterm is a constant if rows of A are i.i.d., but it is a tiny number, at least for Gaussianmatrices and for the case when s is large.4.3.3 Proofs.Proof of Theorem 4.3.1The proof is similar to the proof of Theorem B in [78]. In this section, we only reproducethe main ideas of the proof.Let x ∈ ΣNs and let T be the support of x. Without loss of generality, we assume that#T = s. Denote the measurement matrix by A ∈ Rm×N whose rows are independentisotropic sub-Gaussian vectors whose sub-Gaussian norm does not exceed K.We use the two-stage reconstruction method. Accordingly, in Stage 1, we recover acoarse estimate x#MSQof x viax#MSQ = argmin ‖z‖1 subject to ‖Az − b‖ ≤ 0.5δ√m. (4.29)If x is sufficiently sparse, x#MSQ will be a good approximation of the original signal x.Indeed, if 1√mA satisfies the RIP of order s with δ2s <√2/2, then [27]‖x#MSQ − x‖ ≤ Cδ.Recall that this is the case with overwhelming probability, for example, if A is an m×Nsub-Gaussian matrix with independent isotropic rows whose ψ2-norm does not exceed Kand m &K s log(N/s) – see Theorem 1.3.4.The coarse estimate x#MSQ can be improved in the Stage 2 of the reconstruction. Thekey idea is that T , the support of x, can be recovered from x#MSQ with high probabilityprovided that all non-zero entries of x exceed√2Cδ– see [78, Proposition 4.1].The next step is to recover the non-zero entries of x using the support T , or, equivalently,to recover xT . Note that Q(y) = Q(Ax) = Q(ATxT ). Recall that rows of A are independentisotropic sub-Gaussian random vectors with sub-Gaussian norm at most K, and so are therows of AT . Therefore, the conditions of Theorem 4.2.1 are satisfied for the frame AT .115In the setting of Theorem 4.2.1, let α ∈ [0, 1/2), λ = m/s, and let c2 := e2 exp(−λα).Then, c1 = C√ln(e2c−12 ) = Cλα/2 and by Theorem 4.2.1, for a given xT ∈ Rs, withprobability exceeding 1− 2e−c3m − e2 exp(−λα), we have‖A†TQ(ATxT )− xT ‖ < M(2‖x‖ exp(−2π2‖x‖2δ2)+ c7√log sλ−1/2+α/2δ), (4.30)where M = (1/2 − cKλ−1/2)−2.To guarantee the fine recovery in the stage 2 of reconstruction, each m×s submatrix ofA must satisfy the conditions of Theorem 4.2.1 which would imply the validity of (4.30). Inother words, we need to show that the rows of AT are independent isotropic sub-Gaussianrandom vectors for all #T ≤ s. Recall that the rows of m×N matrix A are independentsub-Gaussian isotropic random vectors. Since s rows of matrix A are chosen based on thesupport of the signal x, these rows are independent. Moreover, these rows remain isotropicand sub-Gaussian. Therefore, for each fixed choice of s rows, (4.30) holds with probabilityat least 1 − 2 exp(−c3m) − e2e−λα . Finally, applying the result of Theorem 4.2.1 finishesthe proof.Proof of Theorem 4.3.3Before we start, let us recall Lemma 2.1 in [29]:Theorem 4.3.4. Suppose that 1√mA satisfies the RIP of order δ2s. Then, for every x, ∈ΣNs , and z ∈ ΣN2s whose support set includes the support set of x,∣∣∣∣ 1m〈Ax , Az〉 − 〈x , z〉∣∣∣∣ ≤ 2δ2s‖x‖‖z‖.Equivalently, if for any set S such that |S| = 2s and supp(x) ⊂ S, we restrict 1mATAx toindex set S, we get ∣∣∣∣( 1mATAx− x)S∣∣∣∣ ≤ δ2s‖x‖.The proof loosely follows the spirit of [157]. Let S be the support of xPBP and T be116the support of x. Applying the triangle inequality, we get‖xPBP − x‖ = ‖( 1mATQ(Ax))S − x‖≤ ‖( 1mATQ(Ax))S − ( 1mATAx)S‖+ ‖( 1mATAx)S − ( 1mATAx)S∪T ‖+ ‖( 1mATAx)S∪T − x‖= ‖( 1mATQ(Ax))S − ( 1mATAx)S‖+ ‖( 1mATAx)(Sc∩T )‖+ ‖( 1mATAx− x)S∪T ‖Recall that |S| = |T | = s. Since S is the index set of the s largest entries of 1mATF (Ax)and |Sc ∩ T | = |S ∩ T c| we get the following bound:‖xPBP − x‖ ≤ ‖( 1mATQ(Ax))S − ( 1mATAx)S‖+ ‖( 1mATAx)(S∩T c)‖+ ‖( 1mATAx− x)S∪T ‖≤ ‖( 1mATF (Ax))S‖+ 2‖( 1mATAx− x)S∪T ‖.Let xT ∈ Rs be the vector that consists of non-zero entries of x. Note that Ax can beconsidered as a linear combination of columns of A, and only columns whose indexes are inT are multiplied by a non-zero number. Therefore, Ax = ATxT , where AT is a submatrixof A that keeps only columns whose index is in T .Applying the triangle inequality, we get‖xPBP − x‖ ≤ 1m‖ (ATF (Ax))S‖+ 2‖( 1mATAx− x)S∪T ‖≤ 1m‖ (ATF (ATxT ))S∩T ‖+ 1m‖ (ATF (ATxT ))S∩T c ‖+ 2‖( 1mATAx− x)S∪T ‖≤ 1m‖ATTF (ATxT )‖+1m‖ATS∩T cF (ATxT )‖+ 2‖(1mATAx− x)S∪T ‖.In the last line, we used that if we increase the number of entries from |S ∩ T | to |T |, thenorm can only increase.117Let us bound each term individually. Note that rows of AT keep all properties of rowsof A: they are independent isotropic sub-Gaussian with the same sub-Gaussian norm.Therefore, we can use Theorem 4.2.1 or Theorem 4.2.9 to estimate the first term.We proceed to bounding the term 1m‖ATS∩T cF (ATxT )‖. Let A = (aij). For each i ∈S ∩ T c, consider the ith entry of the vector ATS∩T cF (ATxT ):1m(ATS∩T cF (ATxT ))i=1mm∑j=1ajiF (∑k∈Tajkxk).Since A has independent rows, the expression above is the sum of independent random vari-ables. Let us investigate each random variable. Since S ∩T c and T do not intersect, and Ahas independent rows, ATS∩T c and AT are independent. Therefore, aji and F (∑k∈T ajkxk)are independent. We conclude that EajiF (∑k∈T ajkxk) = 0. Since F (ATxT ) is bounded byδ/2, ajiF (∑k∈T ajkxk) is a sub-Gaussian random variable with mean 0 and sub-Gaussiannorm at most Kδ/2. We can apply the Hoeffding inequality (Theorem 1.3.2) for boundingthe sum of random variables. Taking the union bound for each entry and using |S∩T c| ≤ s,we conclude‖ 1mm∑j=1ajiF (∑k∈Tajkxk)‖ ≤ C√log(n)λ−1/2δwith probability at least 1− e exp(−C ′).The last term is bounded in Theorem 4.3.4. Combining all three bounds yields thestatement of the theorem.4.4 Numerical experiments4.4.1 Experiment 1. MSQ for frame theory.This first set of experiments illustrates the error decay for the MSQ quantizer (with variousquantizer stepsizes δ) for various classes of measurement matrices. To that end, we fix aunit-norm signal x ∈ Rn where n = 20 and consider m × n random frames E, where mvaries between 20 and 2000 (i.e., λ ∈ [1, 100]). For each m, we pick 1000 realizations of Eand calculate the reconstruction error E(x) = ‖x − E†(QMSQδ (Ex))‖. We perform this forthree distinct values of the quantizer resolution δ, specifically δ ∈ {0.01, 0.05, 0.1}. For eachδ, we report the average value of E(x) over the 1000 realizations of E. Figure 4.1 shows theoutcomes in three different scenarios: when E is a Gaussian matrix (A), a Bernoulli matrix(B), and when the rows of E are drawn independently from the uniform distribution on thesphere with radius n1/2 (C). The observed error decay rates seem to be consistent with m-WNH, which predicts a decay like δλ−1/2, as well as with Theorem 4.2.1, which shows thatthe error would behave like C1 +C2δλ−1/2, where C1 is a generally non-zero, but possiblyvery small constant, at least in the case of Gaussian random matrices. So, assuming that118100 101 102λ10-410-310-210-1100reconstruction error(a)100 101 102λ10-410-310-210-1100reconstruction error(b)100 101 102λ10-410-310-210-1100reconstruction errorfor δ=0.1for δ=0.05for δ=0.010.35λ-1/2*0.10.35λ-1/2*0.050.35λ-1/2*0.01(c)Figure 4.1: The numerical behavior (in log-log scale) of the mean of the reconstructionerror, as a function of λ = m/n, for a fixed unit-norm signal x and an m × n randommatrix E with i.i.d. standard Gaussian entries (A), with i.i.d Bernoulli entries (B), andwith independent rows, uniformly distributed on the sphere of radius√n (C). For each δ =0.1, 0.05, 0.01, we draw 1000 realizations of the random matrix E, compute the quantizedframe coefficients of x, and reconstruct using E†. We plot the average reconstruction errorfor each δ.C1 is small for the other random ensembles as well, the error in these experiments is farfrom saturating at the level of C1. Next, we will design a numerical experiment that allowsus to observe C1.4.4.2 Experiment 2. Lower bound for the decay rate.This experiment aims to provide numerical illustration of the result of Theorem 4.2.1.Consider a deterministic unit-norm signal x and random frame E that satisfy the conditionsof Theorem 4.2.1. Then, Theorem 4.2.1 states that the reconstruction error E(x) satisfiesA′(µ− c1K√nλ−1/2δ)≤ E(x) ≤ A(µ+ c1K√nλ−1/2δ)with probability at least 1 − c2 − 2 exp(−c3m) where µ = 1m‖EET (Ex −Q(Ex))‖, A andA′ are some positive constants, c2 ∈ (0, 1), and c1 = c1(c2) as in Theorem 4.2.1. Accordingto Proposition 4.2.3, we expect the term µ to be of order O(1) as m → ∞. In particular,if µ is not zero, the reconstruction error should tend to a value between A′µ and Aµ asλ = m/n → ∞. Note that this does not contradict with the outcomes of Experiment 1.For example, in the Gaussian case, in Theorem 4.2.6 we provide a rigorous estimate ofthe value of µ, which is extremely small when δ < 1. Therefore, in the experiments withGaussian frames, we can only observe the term Ac1K√nλ−1/2δ for the given range of λ.In order to actually observe the influence of the constant term, we now set δ := 4 (thuscreating an artificial set-up) and repeat the Experiment 1 for Gaussian matrices, Bernoullimatrices, and matrices whose rows are randomly drawn from the uniform distributionon the sphere, with λ ∈ [10, 1000] this time. The outcomes are shown in Figure 4.2.119101 102 103λ0.550.560.570.580.590.60.610.62reconstruction errorfor δ = 42‖x‖e−2pi2/δ22‖x‖(e−2pi2/δ2− e−8pi2/δ2)(a)101 102 103λ0.60.620.640.660.680.7reconstruction error(b)101 102 103λ0.560.580.60.620.640.660.68reconstruction error(c)Figure 4.2: The numerical behaviour of the reconstruction error in the setting identical tothat described in Figure 4.1; this time δ = 4 and λ ∈ [10, 1000]. The results shown are theoutcomes for Gaussian matrices (A), Bernoulli matrices (B), and matrices whose rows arerandomly drawn from the uniform distribution on the sphere (C).Specifically, in Figure 4.2a, we observe that when E is Gaussian, the error settles down to avalue between 2‖x‖(exp(−2π2‖x‖2/δ2)− exp(−8π2‖x‖2/δ2)) and 2‖x‖ exp(−2π2‖x‖2/δ2),as predicted. In Figure 4.2b, we observe that the behaviour of the error in the Bernoullicase is similar in that it settles to a value that is comparable with the Gaussian case. Whilewe do not have a sharp estimate for the bounds in this case, the experiment suggests that,like the Gaussian case, µ is a non-zero, but small, constant when E is a Bernoulli matrix.Finally, Figure 4.2c shows that a similar conclusion can be drawn for the case when E iswith random rows, drawn independently from the uniform distribution on the sphere withradius n1/2.Finally, we consider random submatrices of the Discrete Fourier Transform (DFT)matrix whose rows and columns are drawn in the following way: We pick the first s columnsof the N×N Fourier matrix, with N = 100000. Note that this results in a harmonic framefor Cs. Next, we fix m < N ; we randomly draw m rows of this harmonic frame, resultingin an m × s random frame E. As above, for each m, we pick 1000 realizations of Eand calculate the average of the reconstruction error E(x) = ‖x − E†(QMSQδ (Ex))‖ withδ = 0.01 over these realizations. The results are reported in Figure 4.3 which suggest thatthe average reconstruction error approaches a constant approximately equal to δ/2. Notethat this is qualitatively different from the other random matrix ensembles we consideredin Experiment 1 in that we observe the constant term in the error bounds in a realisticsetting (with δ = 0.01) rather than the artificial setting with δ = 4 above.4.4.3 Experiment 3. Compressed sensing setting.This experiments focuses on CS and is designed to observe the error decay shown in The-orem 4.3.1 when the 2-stage reconstruction scheme described in Section 4.3 is used toreconstruct a sparse x from its quantized (by MSQ) compressed measurements. Let thesignal x ∈ ΣNs , where N = 1000 and s = 20 and let A be m ×N . We will quantize com-120101 102 103λ4.84.854.94.9555. error×10-3Figure 4.3: The numerical behaviour of the reconstruction error in the setting identical tothat described in Figure 4.1; this time E is an m× n random Fourier matrix obtained byrestricting the N ×N DFT matrix (with N = 100000) to its first n = 20 columns and thenselecting m random rows. Here δ = 0.01 and λ ∈ [10, 1000].pressed measurements of x, given by Ax, using MSQ. Here is a list of various parametersin our experiments: A will be Gaussian (Figure 4.4a), Bernoulli (Figure 4.4b), or a randommatrix whose rows are drawn i.i.d. from the sphere of radius√N. (Figure 4.4c). We willvary m between 100 and 500, corresponding to λ ∈ [5, 25]. The quantization stepsize isδ ∈ {0.01, 0.05, 0.1}. Finally, to ensure the condition of Theorem 4.3.2 for successful sup-port recovery is satisfied, we choose the non-zero entries of x randomly from {±1/√s}. Itturns out that this condition is not satisfied when A is Bernoulli and δ = 0.1, so in theBernoulli case we restrict our experiments to δ = 0.01 and δ = 0.05. The outcomes aregiven in Figure 4.4, where we observe that, in each case, the decay of the reconstructionerror agrees with the result of Theorem 10 15 20 25λ10-310-210-1100reconstruction error(a)5 10 15 20 25λ10-310-210-1100reconstruction error(b)5 10 15 20 25λ10-310-210-1100reconstruction errorfor δ=0,1for δ=0.05for δ=0.010.3λ-1/2*0.10.3λ-1/2*0.050.3λ-1/2*0.01(c)Figure 4.4: The numerical behaviour of the reconstruction error in the CS setting. Theresults shown are the outcomes for Gaussian measurement matrices (a), Bernoulli mea-surement matrices (b), and matrices whose rows are randomly drawn from the uniformdistribution on the sphere (c). Here x ∈ ΣNs , where N = 1000 and s = 20, the measure-ment matrix A is m×N , where m varies between 100 and 500, corresponding to λ varyingbetween 5 and 25.121Chapter 5ConclusionsThis thesis focuses on the following aspects of the acquisition of signals from their linearmeasurements.Scenario 1. Time-efficient and working memory feasible algorithms are crucial in allapplications. Due to the recent rapid growth of the amount of acquired data and relativelyslow increase of computer working memory capacity, in the numerous applications, werequire algorithms to be efficient in both working memory and running time. One suchalgorithm is the randomized Kaczmarz (RK) algorithm, which is a solver for the consistentoverdetermined linear systems. This solver uses a single measurement per iteration, whichmakes its working memory space complexity to be O(n), where n is the dimension of thesignal. For comparison, uploading a full measurement matrix requires O(mn) workingmemory where m is the number of measurements. In Chapters 2 and 3, we consider theRK algorithm and its modifications for sparse recovery problems under the constraintsabove.First, we consider the online sampling scenario when the measurement vector is arandom ±1 Bernoulli vector in Rn, which is drawn independently at each iteration. Denotethe sparsity of the underlying signal by s and the estimated sparsity by ρs. For the casewhen ρs > s and s2(ρ − 1) < n, we refine previously known support detection time forthe RK algorithm from O(n2 log s) to O(n2). We leave the question of the convergencerate of the RK algorithm for other online sampling schemes and for measurements usingoverdetermined sub-Gaussian matrices for future studies. The new bound suggests theefficiency of the RK algorithm for the support acquisition, at least in certain cases, whichwe numerically verify as follows. We compare the support detection time for the RKalgorithm and gradient descent and conclude that the RK algorithm may outperform thegradient descent, at least in particular settings.Then, we focus on modifications of the RK algorithm that promote sparsity. In thiscase, we investigate both over- and under-determined consistent linear systems. The sparserandomized Kaczmarz (SRK) algorithm is an established effective solver for sparse recoveryproblems both in over- and under-determined cases. Numerically, the algorithm convergeslinearly in 2-norm. Before, there were no known theoretical convergence guarantees. Weprove that the SRK converges linearly in the local regime if the measurement matrix A isoverdetermined or if A satisfies the RIP. These findings are numerically supported. Also,we numerically verify that the SRK algorithm outperforms many other Kaczmarz-basedalgorithms. Next, we prove that the SRK iterates are bounded in expectation under the122same assumptions on the measurement matrix. Finally, we provide intuition on why theSRK converges analyzing the element-wise dynamics. The rigorous convergence guaranteesfor the general case are subject to further research.Then, we combine the RK algorithm and the iteratively reweighted least squares (IRLS)algorithm in the following way: At each iteration of the IRLS, we solve the correspondingminimization problem using the RK algorithm. Such implementation requires only O(n)working memory, and thus, the algorithm is working memory-efficient. Next, we investigatethe running time performance of the algorithm in comparison with using the SVD and QRdecomposition as solvers. We find that, regardless of the solver, each subsequent iterationof the IRLS requires more CPU time, and the implementation using the RK algorithm isaffected the most. We determine two potential reasons for this phenomenon. To alleviatethese flaws, we propose a family of novel compressed sensing algorithms named the IRWK,which are both time and working memory-efficient. Numerically, the total number ofthe iterations is of the same order for the IRLS and IRWK algorithms. Still, the IRWKalgorithm is faster than the IRLS because each iteration of the IRWK algorithm requiressignificantly less number of the RK updates per iteration. We rigorously prove that theIRWK algorithms converge linearly in the local regime, and numerically observe the linearconvergence in the CPU time. Also, we prove that the reconstructed sequence xk becomesmore compressible in the sense that ‖xkSck‖ = ‖xk − Hρs(xk)‖ is non-increasing, which isof use if the algorithm is terminated before the convergence. We numerically support thisstatement and also observe that ‖xkSck‖/‖xkSk‖ and ‖xkSck‖∞/‖xkSk‖∞ are also non-increasing.Scenario 2. Digital technology requires to map the acquired measurements to the pre-defined discrete alphabet for their storage/processing, i.e., to quantize the measurements.In this case, the accurate reconstruction of the underlying signal is, in general, impossible.In Chapter 4, we focus on the dependency of the reconstruction error on the number ofmeasurements m. Heuristically, additional measurements should improve accuracy becausethey provide more information. Our research focuses on the MSQ quantization, whichessentially rounds off the individual measurements to the closest point in δZ, δ > 0. Forthe frame theory setting, the numerical experiments suggest that the reconstruction errordecays as O(m−1/2)δ. Before, the theory claimed that the reconstruction error bound isO(1)δ, and we investigated this discrepancy. Note that the previous theoretical bounddoes not guarantee that additional measurements improve the reconstruction error, whichis a significant disadvantage for the quantization method. We rigorously show that thereconstruction error is of order vδ + O(m−1/2)δ, where v is a constant independent ofm. Therefore, if for the given distribution of the random matrix and the signal, v = 0,then the reconstruction error decays to zero. If v 6= 0, then the reconstruction error is notguaranteed to diminish to zero. Accordingly, it is crucial to distinguish these two cases.For matrices with i.i.d. entries whose distribution is a Schwartz function, e.g., Gaussian,we show that v 6= 0 is a very small number unnoticeable in practice. Therefore, this resultexplains the discrepancy between the theory and practice and refines the accuracy of the123previously known bound. We provide the rigorous and sharp upper and lower bounds on thevalue of v and confirm the bounds numerically. Also, we provide the rigorous lower boundon v for a broad class of signals in large ambient dimensions and many matrices with i.i.d.entries; in this case, v 6= 0. We discuss that such behavior may be observed in some casesin smaller dimensions as well. We extend the methodology above to dithered quantizationand confirm the previously known theoretical result that, in this case, the reconstructionerror is of order O(m−1/2)δ. We also extend the results for noisy measurements. Aspotential future work, one may obtain a refined upper bound of the value of v for generalsub-Gaussian random matrices.Quantization is of interest to the compressed sensing problems. In this thesis, we con-sider reconstruction from the measurements quantized according to the MSQ using thefollowing algorithms: the PBP and two-stage algorithms. For both algorithms, we rigor-ously prove that the reconstruction error is of order v +O(m−1/2(1−α))δ with probability0.99 for any α ∈ (0, 1). Here v 6= 0 in the same cases as for the setting above. Furtherstudies may establish the corresponding results for other compressed sensing algorithms.124Bibliography[1] Albert Ai, Alex Lapanowski, Yaniv Plan, and Roman Vershynin. One-bit com-pressed sensing with non-Gaussian measurements. Linear Algebra and its Applica-tions, 441:222–239, 2014.[2] Akram Aldroubi, Xuemei Chen, and Alex Powell. Stability and robustness of ℓqminimization using null space property. Proceedings of SampTA, 2011, 2011.[3] Ethem Alpaydin. Introduction to machine learning. MIT press, 2020.[4] A Morelli Andre´s, Sebastian Padovani, Mariano Tepper, and Julio Jacobo-Berlles.Face recognition on partially occluded images using compressed sensing. PatternRecognition Letters, 36:235–242, 2014.[5] Andreas Antoniou. Digital signal processing. McGraw-Hill, 2016.[6] Aleksandr Y Aravkin, James V Burke, and Daiwei He. IRLS for sparse recoveryrevisited: Examples of failure and a remedy. arXiv preprint arXiv:1910.07095, 2019.[7] Youness Arjoune, Naima Kaabouch, Hassan El Ghazi, and Ahmed Tamtaoui. Com-pressive sensing: Performance comparison of sparse recovery algorithms. In 2017IEEE 7th annual computing and communication workshop and conference (CCWC),pages 1–7. IEEE, 2017.[8] Demba Ba, Behtash Babadi, Patrick L Purdon, and Emery N Brown. Convergenceand stability of iteratively re-weighted least squares algorithms. IEEE Transactionson Signal Processing, 62(1):183–195, 2013.[9] Afonso S Bandeira, Edgar Dobriban, Dustin G Mixon, and William F Sawin. Cer-tifying the restricted isometry property is hard. IEEE transactions on informationtheory, 59(6):3448–3450, 2013.[10] Joa˜o Carlos Alves Barata and Mahir Saleh Hussein. The Moore–Penrose pseudoin-verse: A tutorial review of the theory. Brazilian Journal of Physics, 42(1-2):146–165,2012.[11] John J Benedetto, Alexander M Powell, and O¨zgu¨r Yilmaz. Σ∆ quantization andfinite frames. IEEE Transactions on Information Theory, 52(5):1990–2005, 2006.125[12] William Ralph Bennett. Spectra of quantized signals. Bell System Tech. J., 27:446–472, 1948.[13] Aaron Berk, Yaniv Plan, and O¨zgu¨r Yilmaz. Parameter instability regimes in sparseproximal denoising programs. In 2019 13th International conference on SamplingTheory and Applications (SampTA), pages 1–5. IEEE, 2019.[14] James Blum, Mark Lammers, Alexander M Powell, and O¨zgu¨r Yılmaz. Sobolev dualsin frame theory and Σ∆ quantization. Journal of Fourier Analysis and Applications,16(3):365–381, 2010.[15] Thomas Blumensath and Mike E Davies. Iterative thresholding for sparse approxi-mations. Journal of Fourier analysis and Applications, 14(5-6):629–654, 2008.[16] Thomas Blumensath and Mike E Davies. Iterative hard thresholding for compressedsensing. Applied and computational harmonic analysis, 27(3):265–274, 2009.[17] Thomas Blumensath and Mike E Davies. Normalized iterative hard thresholding:Guaranteed stability and performance. IEEE Journal of selected topics in signalprocessing, 4(2):298–309, 2010.[18] Holger Boche, Robert Calderbank, Gitta Kutyniok, and Jan Vyb´ıral. A survey ofcompressed sensing. In Compressed sensing and its applications, pages 1–39. Springer,2015.[19] Bernhard G Bodmann and Stanley P Lipshitz. Randomly dithered quantization andΣ∆ noise shaping for finite frames. Applied and Computational Harmonic Analysis,25(3):367–380, 2008.[20] Sergiy Borodachov and Yang Wang. Lattice quantization error for redundant repre-sentations. Applied and Computational Harmonic Analysis, 27(3):334–341, 2009.[21] Petros T Boufounos. Reconstruction of sparse signals from distorted randomizedmeasurements. In Acoustics Speech and Signal Processing (ICASSP), 2010 IEEEInternational Conference on, pages 3998–4001. IEEE, 2010.[22] Petros T Boufounos. Universal rate-efficient scalar quantization. IEEE transactionson information theory, 58(3):1861–1872, 2011.[23] Petros T Boufounos and Richard G Baraniuk. 1-bit compressive sensing. In Infor-mation Sciences and Systems, 2008. CISS 2008. 42nd Annual Conference on, pages16–21. IEEE, 2008.[24] Petros T Boufounos, Laurent Jacques, Felix Krahmer, and Rayan Saab. Quantizationand compressive sensing. In Compressed Sensing and its Applications, pages 193–237.Springer, 2015.126[25] Valerii V Buldygin and Yu V Kozachenko. Sub-Gaussian random variables. UkrainianMathematical Journal, 32(6):483–489, 1980.[26] James V Burke, Frank E Curtis, Hao Wang, and Jiashan Wang. Iterative reweightedlinear least squares for exact penalty subproblems on product sets. SIAM Journalon Optimization, 25(1):261–294, 2015.[27] T Tony Cai and Anru Zhang. Sparse representation of a polytope and recovery ofsparse signals and low-rank matrices. IEEE Transactions on Information Theory,60(1):122–132, Jan 2014.[28] Emmanuel J Cande`s. Compressive sampling. In Proceedings of the internationalcongress of mathematicians, volume 3, pages 1433–1452. Madrid, Spain, 2006.[29] Emmanuel J Cande`s. The restricted isometry property and its implications for com-pressed sensing. Comptes rendus mathematique, 346(9-10):589–592, 2008.[30] Emmanuel J Cande`s, Justin Romberg, and Terence Tao. Robust uncertainty prin-ciples: exact signal reconstruction from highly incomplete frequency information.IEEE Trans. Inform. Theory, 52:489–509, 2006.[31] Emmanuel J Cande`s, Justin K Romberg, and Terence Tao. Stable signal recoveryfrom incomplete and inaccurate measurements. Communications on pure and appliedmathematics, 59(8):1207–1223, 2006.[32] Emmanuel J Candes and Terence Tao. Decoding by linear programming. IEEEtransactions on information theory, 51(12):4203–4215, 2005.[33] Emmanuel J Cande`s and Terence Tao. Near-optimal signal recovery from randomprojections: Universal encoding strategies? IEEE Trans. Inform. Theory, 52:5406–5425, 2006.[34] Peter G Casazza and Gitta Kutyniok. Finite frames: Theory and applications.Springer, 2012.[35] Peter G Casazza, Gitta Kutyniok, and Friedrich Philipp. Introduction to finite frametheory. In Finite frames, pages 1–53. Springer, 2013.[36] Yair Censor, Gabor T Herman, and Ming Jiang. A note on the behavior of therandomized Kaczmarz algorithm of Strohmer and Vershynin. Journal of FourierAnalysis and Applications, 15(4):431–436, 2009.[37] Scott Shaobing Chen, David L Donoho, and Michael A Saunders. Atomic decompo-sition by basis pursuit. SIAM review, 43(1):129–159, 2001.127[38] Xuemei Chen and Alexander M Powell. Almost sure convergence of the Kaczmarzalgorithm with random measurements. Journal of Fourier Analysis and Applications,18(6):1195–1214, 2012.[39] Evan Chou and C Sinan Gu¨ntu¨rk. Distributed noise-shaping quantization: I. Betaduals of finite frames and near-optimal quantization of random measurements. Con-structive Approximation, 44(1):1–22, 2016.[40] Evan Chou and C Sinan Gu¨ntu¨rk. Distributed noise-shaping quantization: II. Classi-cal frames. In Excursions in Harmonic Analysis, Volume 5, pages 179–198. Springer,2017.[41] Evan Chou, C Sinan Gu¨ntu¨rk, Felix Krahmer, Rayan Saab, and O¨zgu¨r Yılmaz. Noise-shaping quantization methods for frame-based and compressive sampling systems. InSampling theory, a renaissance, pages 157–184. Springer, 2015.[42] Evan Chou, C Sinan Gu¨ntu¨rk, Felix Krahmer, Rayan Saab, and O¨zgu¨r Yılmaz. Noise-Shaping Quantization Methods for Frame-Based and Compressive Sampling Systems,pages 157–184. Springer, 2015.[43] Ole Christensen. An introduction to frames and Riesz bases. Springer, 2016.[44] Albert Cohen, Wolfgang Dahmen, and Ronald DeVore. Compressed sensing and bestk-term approximation. Journal of the American Mathematical Society, 22(1):211–231, 2009.[45] Zoran Cvetkovic. Resilience properties of redundant expansions under additive noiseand quantization. IEEE Transactions on Information Theory, 49(3):644–656, 2003.[46] Zoran Cvetkovic and Martin Vetterli. On simple oversampled A/D conversion inL2(R). IEEE Transactions on Information Theory, 47(1):146–154, 2001.[47] Michael Cwikel and Mario Milman. Functional Analysis, Harmonic Analysis, andImage Processing: A Collection of Papers in Honor of Bjo¨rn Jawerth, volume 693.American Mathematical Society, 2017.[48] Ingrid Daubechies, Ronald DeVore, Massimo Fornasier, and C Sinan Gu¨ntu¨rk. Iter-atively reweighted least squares minimization for sparse recovery. Communicationson Pure and Applied Mathematics: A Journal Issued by the Courant Institute ofMathematical Sciences, 63(1):1–38, 2010.[49] Jonathan Niles-Weed De Huang, Joel A Tropp, and Rachel Ward. Matrix concen-tration for products. arXiv preprint arXiv:2003.05437, 2020.128[50] Percy Deift, Felix Krahmer, and C Sınan Gu¨ntu¨rk. An optimal family of expo-nentially accurate one-bit Σ∆ quantization schemes. Communications on Pure andApplied Mathematics, 64(7):883–919, 2011.[51] Frank Deutsch and Hein Hundal. The rate of convergence for the method of alternat-ing projections, II. Journal of Mathematical Analysis and Applications, 205(2):381–405, 1997.[52] Sjoerd Dirksen and Shahar Mendelson. Non-Gaussian hyperplane tessellations androbust one-bit compressed sensing. arXiv preprint arXiv:1805.09409, 2018.[53] David L Donoho. Compressed sensing. IEEE Trans. Inform. Theory, 52:1289–1306,2006.[54] David L Donoho. For most large underdetermined systems of linear equations theminimal ℓ1-norm solution is also the sparsest solution. Communications on Pure andApplied Mathematics: A Journal Issued by the Courant Institute of MathematicalSciences, 59(6):797–829, 2006.[55] Marco F Duarte, Mark A Davenport, Dharmpal Takhar, Jason N Laska, Ting Sun,Kevin F Kelly, and Richard G Baraniuk. Single-pixel imaging via compressive sam-pling. IEEE signal processing magazine, 25(2):83–91, 2008.[56] Yonina C Eldar and Gitta Kutyniok. Compressed sensing: theory and applications.Cambridge university press, 2012.[57] Yonina C Eldar and Deanna Needell. Acceleration of randomized Kaczmarz methodvia the Johnson–Lindenstrauss lemma. Numerical Algorithms, 58(2):163–177, 2011.[58] Alina Ene and Adrian Vladu. Improved convergence for ℓ1 and ℓ∞ regression via iter-atively reweighted least squares. In International Conference on Machine Learning,pages 1794–1801, 2019.[59] Charles L Epstein. Introduction to the mathematics of medical imaging. SIAM, 2007.[60] Joe-Mei Feng and Felix Krahmer. An RIP-based approach to Σ∆ quantization forcompressed sensing. IEEE Signal Processing Letters, 21(11):1351–1355, 2014.[61] Li Feng, Thomas Benkert, Kai Tobias Block, Daniel K Sodickson, Ricardo Otazo,and Hersh Chandarana. Compressed sensing for body MRI. Journal of MagneticResonance Imaging, 45(4):966–987, 2017.[62] Alyson K Fletcher, Sundeep Rangan, and Vivek K Goyal. On the rate-distortionperformance of compressed sensing. In Acoustics, Speech and Signal Processing,2007. ICASSP 2007. IEEE International Conference on, volume 3, pages III–885.IEEE, 2007.129[63] Valeria Fonti and Eduard Belitser. Feature selection using LASSO. VU AmsterdamResearch Paper in Business Analytics, 30:1–25, 2017.[64] Simon Foucart. Sparse recovery algorithms: sufficient conditions in terms of restrictedisometry constants. In Approximation Theory XIII: San Antonio 2010, pages 65–77.Springer, 2012.[65] Simon Foucart and Holger Rauhut. A mathematical introduction to compressivesensing. Bull. Am. Math, 54:151–165, 2017.[66] Simon Foucart and Srinivas Subramanian. Iterative hard thresholding for low-rankrecovery from rank-one projections. Linear Algebra and its Applications, 572:117–134, 2019.[67] Michael P Friedlander, Ives Macedo, and Ting Kei Pong. Gauge optimization andduality. SIAM Journal on Optimization, 24(4):1999–2022, 2014.[68] Michael P Friedlander and Paul Tseng. Exact regularization of convex programs.SIAM Journal on Optimization, 18(4):1326–1350, 2008.[69] A Gala´ntai. On the rate of convergence of the alternating projection method in finitedimensional spaces. Journal of mathematical analysis and applications, 310(1):30–44,2005.[70] Rahul Garg and Rohit Khandekar. Gradient descent with sparsification: an iterativealgorithm for sparse recovery with restricted isometry property. In Proceedings of the26th annual international conference on machine learning, pages 337–344, 2009.[71] Tom Goldstein and Christoph Studer. Phasemax: Convex phase retrieval via basispursuit. IEEE Transactions on Information Theory, 64(4):2675–2689, 2018.[72] Robert D Gordon. Values of Mills’ ratio of area to bounding ordinate and of the nor-mal probability integral for large values of the argument. The Annals of MathematicalStatistics, 12(3):364–366, 1941.[73] Irina F Gorodnitsky and Bhaskar D Rao. Sparse signal reconstruction from limiteddata using FOCUSS: A re-weighted minimum norm algorithm. IEEE Transactionson signal processing, 45(3):600–616, 1997.[74] Pierre Goupillaud, Alex Grossmann, and Jean Morlet. Cycle-octave and relatedtransforms in seismic signal analysis. Geoexploration, 23(1):85–102, 1984.[75] Vivek K Goyal, Martin Vetterli, and Nguyen T Thao. Quantized overcomplete expan-sions in Rn: analysis, synthesis, and algorithms. IEEE Transactions on InformationTheory, 44(1):16–31, 1998.130[76] Robert M Gray. Quantization noise spectra. IEEE Transactions on informationtheory, 36(6):1220–1244, 1990.[77] C Sinan Gu¨ntu¨rk. One-bit Σ∆ quantization with exponential accuracy. Communica-tions on Pure and Applied Mathematics: A Journal Issued by the Courant Instituteof Mathematical Sciences, 56(11):1608–1630, 2003.[78] C Sinan Gu¨ntu¨rk, Mark Lammers, Alexander M Powell, Rayan Saab, and O¨ Yılmaz.Sobolev duals for random frames and Σ∆ quantization of compressed sensing mea-surements. Foundations of Computational mathematics, 13(1):1–36, 2013.[79] Gilles Hennenfent and Felix J Herrmann. Seismic denoising with nonuniformly sam-pled curvelets. Computing in Science & Engineering, 8(3):16–25, 2006.[80] Gabor T Herman. Fundamentals of computerized tomography: image reconstructionfrom projections. Springer Science & Business Media, 2009.[81] Laurent Jacques and Valerio Cambareri. Time for dithering: fast and quantizedrandom embeddings via the restricted isometry property. Information and Inference:A Journal of the IMA, 6(4):441–476, 2017.[82] Laurent Jacques, David K Hammond, and M Jalal Fadili. Dequantizing compressedsensing: When oversampling and non-Gaussian constraints combine. InformationTheory, IEEE Transactions on, 57(1):559–571, 2011.[83] Laurent Jacques, David K Hammond, and M Jalal Fadili. Stabilizing nonuniformlyquantized compressed sensing with scalar companders. IEEE Transactions on Infor-mation Theory, 59(12):7969 – 7984, Jan 2013.[84] Laurent Jacques, Jason N Laska, Petros T Boufounos, and Richard G Baraniuk.Robust 1-bit compressive sensing via binary stable embeddings of sparse vectors.IEEE Transactions on Information Theory, 59(4):2082–2102, 2013.[85] Oren N Jaspan, Roman Fleysher, and Michael L Lipton. Compressed sensing MRI: areview of the clinical literature. The British journal of radiology, 88(1056):20150487,2015.[86] Halyun Jeong and C Sinan Gu¨ntu¨rk. Convergence of the randomized Kaczmarzmethod for phase retrieval. arXiv preprint arXiv:1706.10291, 2017.[87] David Jimenez, Long Wang, and Yang Wang. The white noise hypothesis for uniformquantization errors. SIAM J. Math Analysis, 38:2042–2056, 2007.[88] Stefan Kaczmarz. Angena¨herte auflo¨sung von systemen linearer gleichungen. BulletinInternational de l’Academie Polonaise des Sciences et des Lettres, 35:355–357, 1937.131[89] Felix Krahmer, Rayan Saab, and O¨zgu¨r Yilmaz. Σ∆ quantization of sub-Gaussianframe expansions and its application to compressed sensing. Information and Infer-ence, 3:40–58, 2014.[90] Felix Krahmer and Rachel Ward. Lower bounds for the error decay incurred by coarsequantization schemes. Applied and Computational Harmonic Analysis, 32(1):131–138, 2012.[91] HB Kushner, M Meisner, and AV Levy. Almost uniformity of quantization errors.IEEE Transactions on Instrumentation and Measurement, 40:682 – 687, 1991.[92] Ming-Jun Lai and Wotao Yin. Augmented ℓ1 and nuclear-norm models with a glob-ally linearly convergent algorithm. SIAM Journal on Imaging Sciences, 6(2):1059–1091, 2013.[93] Charles Lawrence Lawson. Contribution to the theory of linear least maximum ap-proximation. PhD thesis, Ph. D. dissertation, Univ. Calif., 1961.[94] Yin Tat Lee and Aaron Sidford. Efficient accelerated coordinate descent methods andfaster algorithms for solving linear systems. In 2013 IEEE 54th Annual Symposiumon Foundations of Computer Science, pages 147–156. IEEE, 2013.[95] Yunwen Lei and Ding-Xuan Zhou. Learning theory of randomized sparse Kaczmarzmethod. SIAM Journal on Imaging Sciences, 11(1):547–574, 2018.[96] Dennis Leventhal and Adrian S Lewis. Randomized methods for linear constraints:convergence rates and conditioning. Mathematics of Operations Research, 35(3):641–654, 2010.[97] Chengbo Li. An efficient algorithm for total variation regularization with applicationsto the single pixel camera and compressive sensing. PhD thesis, 2010.[98] Fan Li, Yiming Yang, and Eric P Xing. From LASSO regression to feature vectormachine. In Advances in neural information processing systems, pages 779–786, 2006.[99] Lixiang Li, Guoqian Wen, ZemingWang, and Yixian Yang. Efficient and secure imagecommunication system based on compressed sensing for IoT monitoring applications.IEEE Transactions on Multimedia, 22(1):82–95, 2019.[100] Shancang Li, Li Da Xu, and Xinheng Wang. Compressed sensing signal and dataacquisition in wireless sensor networks and internet of things. IEEE Transactions onIndustrial Informatics, 9(4):2177–2186, 2012.[101] Shidong Li. On general frame decompositions. Numerical functional analysis andoptimization, 16(9-10):1181–1191, 1995.132[102] Zhilin Li, Wenbo Xu, Xiaobo Zhang, and Jiaru Lin. A survey on one-bit compressedsensing: Theory and applications. Frontiers of Computer Science, 12(2):217–230,2018.[103] Zhi-Pei Liang and Paul C Lauterbur. Principles of magnetic resonance imaging: asignal processing perspective. SPIE Optical Engineering Press, 2000.[104] Jae S Lim. Two-dimensional signal and image processing. Englewood Cliffs, 1990.[105] Junhong Lin and Ding-Xuan Zhou. Learning theory of randomized Kaczmarz algo-rithm. The Journal of Machine Learning Research, 16(1):3341–3365, 2015.[106] Dirk A Lorenz, Frank Scho¨pfer, and Stephan Wenger. The linearized Bregmanmethod via split feasibility problems: Analysis and generalizations. SIAM Journalon Imaging Sciences, 7(2):1237–1262, 2014.[107] Dirk A Lorenz, Stephan Wenger, Frank Scho¨pfer, and Marcus Magnor. A sparseKaczmarz solver and a linearized Bregman method for online compressed sensing. In2014 IEEE international conference on image processing (ICIP), pages 1347–1351.IEEE, 2014.[108] Michael Lustig, David Donoho, and John M Pauly. Sparse MRI: the application ofcompressed sensing for rapid MR imaging. Magnetic Resonance in Medicine: AnOfficial Journal of the International Society for Magnetic Resonance in Medicine,58(6):1182–1195, 2007.[109] Michael Lustig, David L Donoho, Juan M Santos, and John M Pauly. Compressedsensing MRI. IEEE signal processing magazine, 25(2):72–82, 2008.[110] Anna Ma, Deanna Needell, and Aaditya Ramdas. Convergence properties of therandomized extended Gauss–Seidel and Kaczmarz methods. SIAM Journal on MatrixAnalysis and Applications, 36(4):1590–1604, 2015.[111] Stephane Mallat. A Wavelet Tour of Signal Processing: The Sparce Way. 2009.[112] Hassan Mansour, U Kamilov, and O¨zgu¨r Yilmaz. A Kaczmarz method for low rankmatrix recovery. Proceedings of Signal Processing with Adaptive Sparse StructuralRepresentation (SPARS) 2017, 2017.[113] Hassan Mansour and O¨zgu¨r Yilmaz. A sparse randomized Kaczmarz algorithm. In2013 IEEE Global Conference on Signal and Information Processing, pages 621–621,2013.[114] Elaine Crespo Marques, Nilson Maciel, Lirida Naviner, Hao Cai, and Jun Yang. Areview of sparse recovery algorithms. IEEE Access, 7:1300–1322, 2018.133[115] Nan Meng and Yun-Bin Zhao. Newton-step-based hard thresholding algorithms forsparse signal recovery. arXiv preprint arXiv:2001.07181, 2020.[116] John P Mills. Table of the ratio: area to bounding ordinate, for any portion of normalcurve. Biometrika, pages 395–400, 1926.[117] Amirafshar Moshtaghpour, Laurent Jacques, Valerio Cambareri, Ke´vin Degraux, andChristophe De Vleeschouwer. Consistent basis pursuit for signal and matrix estimatesin quantized compressed sensing. IEEE Signal Processing Letters, 23(1):25–29, Jan2016.[118] R Muthukrishnan and R Rohini. LASSO: A feature selection technique in predictivemodeling for machine learning. In 2016 IEEE international conference on advancesin computer applications (ICACA), pages 18–20. IEEE, 2016.[119] Frank Natterer. The mathematics of computerized tomography. SIAM, 2001.[120] Ion Necoara. Faster randomized block Kaczmarz algorithms. SIAM Journal onMatrix Analysis and Applications, 40(4):1425–1452, 2019.[121] Deanna Needell. Randomized Kaczmarz solver for noisy linear systems. BIT Numer-ical Mathematics, 50(2):395–403, 2010.[122] Deanna Needell and Joel A Tropp. Paved with good intentions: analysis of a ran-domized block Kaczmarz method. Linear Algebra and its Applications, 441:199–221,2014.[123] Deanna Needell, Rachel Ward, and Nati Srebro. Stochastic gradient descent, weightedsampling, and the randomized Kaczmarz algorithm. In Advances in neural informa-tion processing systems, pages 1017–1025, 2014.[124] Deanna Needell, Ran Zhao, and Anastasios Zouzias. Randomized block Kaczmarzmethod with projection for solving least squares. Linear Algebra and its Applications,484:322–343, 2015.[125] Julie Nutini, Behrooz Sepehry, Alim Virani, Issam Laradji, Mark Schmidt, and HoytKoepke. Convergence rates for greedy Kaczmarz algorithms. In Conference on Un-certainty in Artificial Intelligence, 2016.[126] Dianne P O’Leary. Robust regression computation using iteratively reweighted leastsquares. SIAM Journal on Matrix Analysis and Applications, 11(3):466–480, 1990.[127] Yaniv Plan and Roman Vershynin. One-bit compressed sensing by linear program-ming. Communications on Pure and Applied Mathematics, 66(8):1275–1297, 2013.134[128] Alexander M Powell, Rayan Saab, and O¨zgu¨r Yılmaz. Quantization and finite frames.In Finite frames, pages 267–302. Springer, 2013.[129] Alexander M Powell and J Tyler Whitehouse. Error bounds for consistent recon-struction: Random polytopes and coverage processes. Foundations of ComputationalMathematics, 16(2):395–423, 2016.[130] Jerry L Prince and Jonathan M Links. Medical imaging signals and systems. PearsonPrentice Hall Upper Saddle River, 2006.[131] Mark Rudelson and Roman Vershynin. Non-asymptotic theory of random matrices:extreme singular values. In Proceedings of the International Congress of Mathemati-cians, volume 3, pages 1576–1602, New Delhi, 2010. Hindustan Book Agency.[132] Rayan Saab, Rongrong Wang, and O¨zgu¨r Yılmaz. From compressed sensing to com-pressed bit-streams: practical encoders, tractable decoders. IEEE Transactions onInformation Theory, 2017.[133] Rayan Saab, Rongrong Wang, and O¨zgu¨r Yılmaz. Quantization of compressive sam-ples with stable and robust recovery. Applied and Computational Harmonic Analysis,44(1):123–143, 2018.[134] Yvan Saeys, In˜aki Inza, and Pedro Larran˜aga. A review of feature selection techniquesin bioinformatics. bioinformatics, 23(19):2507–2517, 2007.[135] Frank Scho¨pfer and Dirk A Lorenz. Linear convergence of the randomized sparseKaczmarz method. Mathematical Programming, 173(1–2):509–536, 2019.[136] Jie Shen and Ping Li. A tight bound of hard thresholding. The Journal of MachineLearning Research, 18(1):7650–7691, 2017.[137] Yi Shen, Bin Han, and Elena Braverman. Stability of the elastic net estimator.Journal of Complexity, 32(1):20–39, 2016.[138] IG Shevtsova. An improvement of convergence rate estimates in the Lyapunov the-orem. In Doklady Mathematics, volume 82, pages 862–864. Springer, 2010.[139] Athanassios Skodras, Charilaos Christopoulos, and Touradj Ebrahimi. The jpeg 2000still image compression standard. IEEE Signal processing magazine, 18(5):36–58,2001.[140] RV Soares, Xiaodong Luo, Geir Evensen, and Tuhin Bhakta. 4D seismic historymatching: Assessing the use of a dictionary learning based sparse representationmethod. Journal of Petroleum Science and Engineering, 195:107763, 2020.135[141] Suvrit Sra and Joel A Tropp. Row-action methods for compressed sensing. In 2006IEEE International Conference on Acoustics Speech and Signal Processing Proceed-ings, volume 3, pages III–III. IEEE, 2006.[142] Damian Straszak and Nisheeth K Vishnoi. IRLS and slime mold: Equivalence andconvergence. arXiv preprint arXiv:1601.02712, 2016.[143] Thomas Strohmer and Roman Vershynin. A randomized Kaczmarz algorithm withexponential convergence. Journal of Fourier Analysis and Applications, 15(2):262,2009.[144] Yan Shuo Tan and Roman Vershynin. Phase retrieval via randomized Kaczmarz:Theoretical guarantees. Information and Inference: A Journal of the IMA, 8(1):97–123, 2019.[145] Jiliang Tang, Salem Alelyani, and Huan Liu. Feature selection for classification: Areview. Data classification: Algorithms and applications, page 37, 2014.[146] Christos Thrampoulidis and Ankit Singh Rawat. The generalized LASSO for sub-Gaussian measurements with dithered quantization. IEEE Transactions on Informa-tion Theory, 66(4):2487–2500, 2020.[147] Andreas M Tillmann and Marc E Pfetsch. The computational complexity of therestricted isometry property, the nullspace property, and related concepts in com-pressed sensing. IEEE Transactions on Information Theory, 60(2):1248–1259, 2014.[148] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices,page 210–268. Cambridge University Press, 2012.[149] Roman Vershynin. High-dimensional probability: An introduction with applicationsin data science, volume 47. Cambridge university press, 2018.[150] Martin Vetterli, Jelena Kovacˇevic´, and Vivek K Goyal. Foundations of signal pro-cessing. Cambridge University Press, 2014.[151] Harish Viswanathan and Ram Zamir. On the whiteness of high-resolution quantiza-tion errors. IEEE Transactions on Information Theory, 47(5):2029–2038, 2001.[152] Sergey Voronin. Regularization of linear systems with sparsity constraints with ap-plications to large scale inverse problems. PhD thesis, Princeton, NJ: PrincetonUniversity, 2012.[153] Yang Wang and Zhiqiang Xu. The performance of PCM quantization under tightframe representations. SIAM J. Math. Anal., 44:2802–2823, 2011.136[154] Ying Wang and Guorui Li. An iterative hard thresholding algorithm based on sparserandomized Kaczmarz method for compressed sensing. International Journal of Com-putational Intelligence and Applications, 17(03):1850015, 2018.[155] Ke Wei. Solving systems of phaseless equations via Kaczmarz methods: A proof ofconcept study. Inverse Problems, 31(12):125008, 2015.[156] Stephen J Wright. Coordinate descent algorithms. Mathematical Programming,151(1):3–34, 2015.[157] Chunlei Xu and Laurent Jacques. Quantized compressive sensing with RIP matrices:the benefit of dithering. Information and Inference: A Journal of the IMA, 2019.[158] Makoto Yamada, Wittawat Jitkrittum, Leonid Sigal, Eric P Xing, and MasashiSugiyama. High-dimensional feature selection by feature-wise kernelized LASSO.Neural computation, 26(1):185–207, 2014.[159] Zhen-Hang Yang and Yu-Ming Chu. On approximating Mills ratio. Journal ofInequalities and Applications, 2015(1):273, 2015.[160] Juhwan Yoo, Stephen Becker, Manuel Monge, Matthew Loh, Emmanuel Candes,and Azita Emami-Neyestanak. Design and implementation of a fully integratedcompressed-sensing signal acquisition system. In 2012 IEEE International Confer-ence on Acoustics, Speech and Signal Processing (ICASSP), pages 5325–5328. IEEE,2012.[161] Liang Zhang, LG Han, ZG Liu, and Yu Li. Seismic data reconstruction based on com-pressed sensing and Contourlet transform. Geophysical Prospecting for Petroleum,56(6):804–811, 2017.[162] Xiaobo Zhang, Wenbo Xu, Jiaru Lin, and Yifei Dang. Block normalised iterative hardthresholding algorithm for compressed sensing. Electronics Letters, 55(17):957–959,2019.[163] Zhuosheng Zhang, Yongchao Yu, and Shumin Zhao. Iterative hard thresholdingbased on randomized Kaczmarz method. Circuits, Systems, and Signal Processing,34(6):2065–2075, 2015.[164] Heng Zhou and Zhiqiang Xu. The lower bound of the PCM quantization error inhigh dimension. Applied and Computational Harmonic Analysis, 2015.[165] Pan Zhou, Xiaotong Yuan, and Jiashi Feng. Efficient stochastic gradient hard thresh-olding. In Advances in Neural Information Processing Systems, pages 1984–1993,2018.137[166] Yang Zhou, Rong Jin, and Steven Chu-Hong Hoi. Exclusive LASSO for multi-taskfeature selection. In Proceedings of the Thirteenth International Conference on Arti-ficial Intelligence and Statistics, pages 988–995, 2010.[167] Anastasios Zouzias and Nikolaos M Freris. Randomized extended Kaczmarz for solv-ing least squares. SIAM Journal on Matrix Analysis and Applications, 34(2):773–793,2013.[168] Argyrios Zymnis, Stephen Boyd, and Emmanuel Candes. Compressed sensing withquantized measurements. Signal Processing Letters, IEEE, 17(2):149–152, 2010.138


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items