UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Sparsity-based methods for image reconstruction and processing in cone-beam computed tomography Karimi, Davood 2016

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2016_september_karimi_davood.pdf [ 32.23MB ]
Metadata
JSON: 24-1.0306875.json
JSON-LD: 24-1.0306875-ld.json
RDF/XML (Pretty): 24-1.0306875-rdf.xml
RDF/JSON: 24-1.0306875-rdf.json
Turtle: 24-1.0306875-turtle.txt
N-Triples: 24-1.0306875-rdf-ntriples.txt
Original Record: 24-1.0306875-source.json
Full Text
24-1.0306875-fulltext.txt
Citation
24-1.0306875.ris

Full Text

Sparsity-Based Methods for ImageReconstruction and Processing inCone-Beam Computed TomographybyDavood KarimiM.Sc., The University of Manitoba, 2013A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinThe Faculty of Graduate and Postdoctoral Studies(Electrical and Computer Engineering)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)July 2016c© Davood Karimi 2016AbstractX-ray computed tomography (CT) is an essential tool in modern medicine.As the scale and diversity of the medical applications of CT continue toincrease, the quest for reducing the radiation dose becomes of extreme im-portance. However, producing high-quality images from low-dose scans hasproven to be a serious challenge. Therefore, further research in develop-ing more effective image reconstruction and processing algorithms for CT isnecessary.This dissertation explores the potential of patch-based image models andtotal variation (TV) regularization for improving the quality of low-dose CTimages. It proposes novel algorithms for 1) denoising and interpolation ofCT projection measurements (known as the sinogram), 2) denoising andrestoration of reconstructed CT images, and 3) iterative CT image recon-struction.For sinogram denoising, patch-based and TV-based algorithms are pro-posed. For interpolation of undersampled projections, an algorithm basedon both patch-based and TV-based image models is proposed. Experimentsshow that the proposed algorithms substantially improve the quality of CTimages reconstructed from low-dose scans and achieve state-of-the-art re-sults in sinogram denoising and interpolation.To suppress streak artifacts in CT images reconstructed from low-dosescans, an algorithm based on sparse representation in coupled learned dic-tionaries is proposed. Moreover, a structured dictionary is proposed fordenoising and restoration of reconstructed CT images. These algorithmssignificantly improve the image quality and prove that highly effective CTpost-processing algorithms can be devised with the help of learned overcom-plete dictionaries.This dissertation also proposes two iterative reconstruction algorithmsthat are based on variance-reduced stochastic gradient descent. One algo-rithm employs TV regularization only and proposes a stochastic-deterministicapproach for image recovery. The other obtains better results by using bothTV and patch-based regularizations. Both algorithms achieve convergencebehavior and reconstruction results that are better than widely used itera-iiAbstracttive reconstruction algorithms compared to. Our results show that variance-reduced stochastic gradient descent algorithms can form the basis of veryefficient iterative CT reconstruction algorithms.This dissertation shows that sparsity-based methods, especially patch-based methods, have a great potential in improving the image quality inlow-dose CT. Therefore, these methods can play a key role in the futuresuccess of CT.iiiPrefaceThis dissertation presents the research conducted by Davood Karimi, withthe help and supervision of Prof. Rabab K. Ward. Below is a list of thescientific articles written by Davood Karimi during the course of his doctoralstudies at the University of British Columbia.Part of Chapter 2 was published in paper J7. The work presented inChapter 3 has been published in paper J6. The contents of Chapter 4appear in papers J1 and J2. Chapter 6 was published as J5. Chapter 7 waspublished as C2. Finally, the contents of Chapter 8 appear in J3, J4, andC1.J1: Davood Karimi, Pierre Deman, Rabab Ward, and Nancy Ford. A sino-gram denoising algorithm for low-dose computed tomography. BMCMedical Imaging, 16(11):1–14, 2016.J2: Davood Karimi and Rabab Ward. A denoising algorithm for projec-tion measurements in cone-beam computed tomography. Computersin Biology and Medicine, 69:71–82, 2016.J3: Davood Karimi and Rabab Ward. On the computational implemen-tation of forward and back-projection operations for cone-beam com-puted tomography. Medical & Biological Engineering & Computing,pages 1–12, 2015.J4: Davood Karimi and RababWard. A hybrid stochastic-deterministicgradient descent algorithm for image reconstruction in cone-beam com-puted tomography. Biomedical Physics and Engineering Express, 2(1):015008, 2016.J5: Davood Karimi and RababWard. Reducing streak artifacts in com-puted tomography via sparse representation in coupled dictionaries.Medical Physics, 43(3):1473-1486, 2016.J6: Davood Karimi and RababWard. Sinogram denoising via simultane-ous sparse representation in learned dictionaries. Physics in Medicineand Biology, 61(9):3536-53, 2016.ivPrefaceJ7: Davood Karimi and Rabab Ward. Patch-based models and algorithmsfor image processing- a review of the basic principles and methods, andtheir application in computed tomography. pages 113, 2016.C1: Davood Karimi, Rabab Ward, and Nancy Ford. A weighted stochasticgradient descent algorithm for image reconstruction in 3D computedtomography. In World Congress on Medical Physics and Biomed-ical Engineering, June 7-12, 2015, Toronto, Canada, pages 70–73.Springer, 2015.C2: Davood Karimi and RababWard. A novel structured dictionary forfast processing of 3D medical images, with application to computed to-mography restoration and denoising. In SPIE Medical Imaging, 2016.J1: Davood Karimi is the primary author and the main contributor.Pierre Deman contributed to the design of the experiments. Pierre Deman,Dr. Rabab Ward, and Dr. Nancy Ford provided technical feedback andhelped with the writing of the manuscript.The rest of papers: Davood Karimi is the primary author and themain contributor. Dr. Rabab Ward provided technical feedback and helpedwith the writing of the papers.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . xviiiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 A brief history of x-ray computed tomography . . . . . . . . 11.2 Imaging model . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Image reconstruction algorithms in CT . . . . . . . . . . . . 41.4 The goals of this dissertation . . . . . . . . . . . . . . . . . . 62 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . 82.1 Image processing with learned overcomplete dictionaries . . . 92.1.1 Sparse representation in analytical dictionaries . . . . 92.1.2 Learned overcomplete dictionaries . . . . . . . . . . . 112.1.3 Applications of learned dictionaries . . . . . . . . . . 172.2 Non-local patch-based image processing . . . . . . . . . . . . 232.3 Other patch-based methods . . . . . . . . . . . . . . . . . . . 272.4 Patch-based methods for Poisson noise . . . . . . . . . . . . 282.5 Total variation (TV) . . . . . . . . . . . . . . . . . . . . . . . 322.6 Published research on sparsity-based methods in CT . . . . . 332.6.1 Pre-processing methods . . . . . . . . . . . . . . . . . 34viTable of Contents2.6.2 Iterative reconstruction methods . . . . . . . . . . . . 362.6.3 Post-processing methods . . . . . . . . . . . . . . . . 442.7 Final remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 483 Sinogram Denoising with Learned Dictionaries . . . . . . . 513.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.2 The proposed algorithm . . . . . . . . . . . . . . . . . . . . . 533.2.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . 533.2.2 Dictionary learning . . . . . . . . . . . . . . . . . . . 573.2.3 Denoising . . . . . . . . . . . . . . . . . . . . . . . . . 593.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603.3.1 Simulation experiment . . . . . . . . . . . . . . . . . 623.3.2 Experiment with micro-CT scan of a rat . . . . . . . 663.3.3 Experiment with micro-CT scan of a phantom . . . . 693.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724 Sinogram Denoising using Total Variation . . . . . . . . . . 804.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 804.2 Approach 1- Employing higher-order derivatives . . . . . . . 814.2.1 The proposed algorithm . . . . . . . . . . . . . . . . . 814.2.2 Simulation experiment . . . . . . . . . . . . . . . . . 874.2.3 Experiments with real micro-CT data . . . . . . . . . 894.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 974.3 Approach 2- Locally adaptive regularization . . . . . . . . . 994.3.1 The proposed algorithm . . . . . . . . . . . . . . . . . 994.3.2 Simulation experiment . . . . . . . . . . . . . . . . . 1034.3.3 Experiment with real micro-CT data . . . . . . . . . 1064.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 1135 Sinogram Interpolation . . . . . . . . . . . . . . . . . . . . . . 1155.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1155.2 The proposed algorithm . . . . . . . . . . . . . . . . . . . . . 1175.2.1 Regularization in terms of sinogram self-similarity . . 1175.2.2 Regularization in terms of sinogram smoothness . . . 1195.2.3 Optimization algorithm . . . . . . . . . . . . . . . . . 1205.3 Results and discussion . . . . . . . . . . . . . . . . . . . . . . 1215.3.1 Experiment with simulated data . . . . . . . . . . . . 1215.3.2 Experiment with real CT data . . . . . . . . . . . . . 123viiTable of Contents6 Reducing Streak Artifacts using Coupled Dictionaries . . 1266.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1266.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1276.2.1 The proposed approach . . . . . . . . . . . . . . . . . 1276.2.2 The dictionary learning algorithm . . . . . . . . . . . 1296.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1336.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1427 Two-Level Dictionary for Fast CT Image Denoising ... . . 1487.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1487.2 The proposed algorithm . . . . . . . . . . . . . . . . . . . . . 1497.3 Results and discussion . . . . . . . . . . . . . . . . . . . . . . 1537.3.1 Denoising . . . . . . . . . . . . . . . . . . . . . . . . . 1537.3.2 Restoration . . . . . . . . . . . . . . . . . . . . . . . . 1548 TV-Regularized Iterative Reconstruction . . . . . . . . . . . 1598.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1598.1.1 Motivation and background . . . . . . . . . . . . . . 1598.1.2 Formulation of the problem . . . . . . . . . . . . . . . 1608.1.3 Stochastic gradient descent method . . . . . . . . . . 1628.1.4 Variance-reduced stochastic gradient descent . . . . . 1658.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1678.2.1 The proposed algorithm . . . . . . . . . . . . . . . . . 1678.2.2 Implementation details . . . . . . . . . . . . . . . . . 1708.2.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 1738.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1748.3.1 Simulated data . . . . . . . . . . . . . . . . . . . . . . 1748.3.2 Micro-CT scan of the physical phantom . . . . . . . . 1748.3.3 Micro-CT scan of a rat . . . . . . . . . . . . . . . . . 1778.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1819 Iterative Reconstruction with Nonlocal Regularization . . 1889.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1889.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1919.2.1 Problem formulation . . . . . . . . . . . . . . . . . . 1919.2.2 Optimization algorithm . . . . . . . . . . . . . . . . . 1949.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . 1969.3.1 Simulated data . . . . . . . . . . . . . . . . . . . . . . 1969.3.2 Real data . . . . . . . . . . . . . . . . . . . . . . . . . 199viiiTable of Contents10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20210.1 Contributions of this dissertation . . . . . . . . . . . . . . . . 20210.1.1 Pre-processing algorithms . . . . . . . . . . . . . . . . 20210.1.2 Post-processing algorithms . . . . . . . . . . . . . . . 20310.1.3 Iterative reconstruction algorithms . . . . . . . . . . . 20410.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . 20510.2.1 Pre-processing algorithms . . . . . . . . . . . . . . . . 20510.2.2 Post-processing algorithms . . . . . . . . . . . . . . . 20610.2.3 Iterative reconstruction algorithms . . . . . . . . . . . 206Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208ixList of Tables3.1 Comparison of sinogram denoising algorithms in the projec-tion domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . 623.2 Comparison of sinogram denoising and image denoising algo-rithms on simulated data. . . . . . . . . . . . . . . . . . . . . 633.3 Comparison of sinogram denoising and image denoising algo-rithms in terms of spatial resolution. . . . . . . . . . . . . . . 663.4 Comparison of sinogram denoising and image denoising algo-rithms on the micro-CT scan of a rat. . . . . . . . . . . . . . 683.5 Comparison of sinogram denoising and image denoising algo-rithms on the micro-CT scan of a physical phantom. . . . . . 724.1 Evaluation of second-order TV sinogram denoising with sim-ulated data in sinogram domain. . . . . . . . . . . . . . . . . 884.2 Evaluation of second-order TV sinogram denoising with sim-ulated data in image domain. . . . . . . . . . . . . . . . . . . 904.3 Evaluation of second-order TV sinogram denoising with micro-CT scan of a phantom. . . . . . . . . . . . . . . . . . . . . . . 914.4 Evaluation of second-order TV sinogram denoising with micro-CT scan of a rat. . . . . . . . . . . . . . . . . . . . . . . . . . 955.1 Objective evaluation of image quality improvement with sino-gram interpolation on simulated data. . . . . . . . . . . . . . 1225.2 Objective evaluation of image quality improvement with sino-gram interpolation on real micro-CT data. . . . . . . . . . . . 1246.1 Evaluation of dictionary-based streak artifact suppression onmicro-CT scan of a rat. . . . . . . . . . . . . . . . . . . . . . 1376.2 Evaluation of dictionary-based streak artifact suppression onmicro-CT scan of a phantom. . . . . . . . . . . . . . . . . . . 1386.3 Evaluation of the generalizability of learned parameters indictionary-based streak artifact suppression. . . . . . . . . . . 142xList of Tables7.1 Evaluation of the two-level learned dictionary with simulationdata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1537.2 Evaluation of the two-level learned dictionary with micro-CTdata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1568.1 Objective evaluation of the hybrid stochastic-deterministic re-construction algorithm with simulation data. . . . . . . . . . 1768.2 Evaluation of the hybrid stochastic-deterministic reconstruc-tion algorithm with micro-CT data of a phantom. . . . . . . . 1768.3 Evaluation of the hybrid stochastic-deterministic reconstruc-tion algorithm with micro-CT data of a rat. . . . . . . . . . . 1819.1 Image quality metrics for the experiment with simulated data. 1969.2 Image quality metrics for the experiment with real data. . . . 199xiList of Figures1.1 A schematic representation of cone-beam CT geometry. . . . 33.1 Stacking of cone-beam projections to exploit intra-projectionand inter-projection correlations. . . . . . . . . . . . . . . . . 523.2 Demonstration of the effectiveness of the proposed dimension-ality reduction mapping for projection signals. . . . . . . . . . 553.3 Visual comparison of sinogram denoising and image denoisingon simulated data. . . . . . . . . . . . . . . . . . . . . . . . . 643.4 Comparison of sinogram denoising and image denoising algo-rithms in terms of spatial resolution on simulated data. . . . 653.5 Performance comparison of sinogram denoising and image de-noising on micro-CT scan of a rat. . . . . . . . . . . . . . . . 703.6 Visual comparison of sinogram denoising and image denoisingon micro-CT data. . . . . . . . . . . . . . . . . . . . . . . . . 733.7 Comparison of sinogram denoising and image denoising interms of spatial resolution on micro-CT data. . . . . . . . . . 743.8 Generalizability of the learned dictionary in dictionary-basessinogram denoising. . . . . . . . . . . . . . . . . . . . . . . . . 773.9 Effect of the dictionary size in dictionary-based sinogram de-noising. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 783.10 Visual comparison of dictionary atoms learned from CT pro-jections and CT images. . . . . . . . . . . . . . . . . . . . . . 794.1 A visual depiction of the piecewise-smooth nature of CT pro-jections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824.2 Evaluation of second-order TV sinogram denoising with sim-ulated data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 894.3 Evaluation of second-order TV sinogram denoising with micro-CT scan of a phantom. . . . . . . . . . . . . . . . . . . . . . . 924.4 Profiles of the images reconstructed after second-order TVsinogram denoising. . . . . . . . . . . . . . . . . . . . . . . . . 934.5 Parameter tuning for second-order TV sinogram denoising. . 94xiiList of Figures4.6 Image-domain spatial resolution as influenced by TV-basedsinogram denoising. . . . . . . . . . . . . . . . . . . . . . . . . 954.7 Evaluation of second-order TV sinogram denoising with micro-CT scan of a rat. . . . . . . . . . . . . . . . . . . . . . . . . . 964.8 Evaluation of second-order TV sinogram denoising with low-dose micro-CT scan of a rat. . . . . . . . . . . . . . . . . . . 974.9 Parameter tuning for first and second-order TV sinogram de-noising. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 984.10 Visual evaluation of locally-adaptive TV sinogram denoisingon simulated data. . . . . . . . . . . . . . . . . . . . . . . . . 1044.11 Objective evaluation of locally-adaptive TV sinogram denois-ing on simulated data. . . . . . . . . . . . . . . . . . . . . . . 1054.12 Objective evaluation of locally-adaptive TV sinogram denois-ing on micro-CT scan of a phantom. . . . . . . . . . . . . . . 1074.13 Visual evaluation of locally-adaptive TV sinogram denoisingon micro-CT scan of a phantom. . . . . . . . . . . . . . . . . 1084.14 Profiles of the images reconstructed after locally-adaptive TVsinogram denoising. . . . . . . . . . . . . . . . . . . . . . . . . 1094.15 Trade-off between noise and spatial resolution for locally-adaptive TV sinogram denoising. . . . . . . . . . . . . . . . . 1104.16 Objective evaluation of locally-adaptive TV sinogram denois-ing on micro-CT scan of a rat. . . . . . . . . . . . . . . . . . 1114.17 Visual evaluation of locally-adaptive TV sinogram denoisingon micro-CT scan of a rat. . . . . . . . . . . . . . . . . . . . . 1114.18 Trade-off between noise and spatial resolution for locally-adaptive TV sinogram denoising evaluated on micro-CT scanof a rat. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1125.1 A schematic representation of the sinogram interpolation prob-lem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1165.2 Block matching between the undersampled scan to be inter-polated and a high-dose reference scan. . . . . . . . . . . . . . 1185.3 Effect of sinogram interpolation on the quality of brain phan-tom images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1235.4 Effect of sinogram interpolation on the quality of rat imagesreconstructed from micro-CT data. . . . . . . . . . . . . . . . 1256.1 Schematic representation of dictionary-based algorithm forstreak artifact suppression. . . . . . . . . . . . . . . . . . . . 129xiiiList of Figures6.2 Evaluation of dictionary-based algorithm for streak artifactsuppression on training data. . . . . . . . . . . . . . . . . . . 1346.3 Evaluation of dictionary-based algorithm for streak artifactsuppression on test data. . . . . . . . . . . . . . . . . . . . . . 1366.4 Visual evaluation of dictionary-based algorithm for streak ar-tifact suppression on micro-CT scan of a phantom. . . . . . . 1396.5 Assessment of generalizability of the parameters of dictionary-based algorithm for streak artifact suppression. . . . . . . . . 1416.6 Visual comparison of two models with different complexitiesfor dictionary-based streak artifact suppression. . . . . . . . . 1437.1 Schematic representation of two-level dictionary. . . . . . . . 1507.2 Visual inspection of the denoising performance of two-leveldictionary on simulation data. . . . . . . . . . . . . . . . . . . 1547.3 Visual inspection of the denoising performance of two-leveldictionary on micro-CT data. . . . . . . . . . . . . . . . . . . 1557.4 Visual inspection of the performance of two-level dictionaryfor suppressing ring artifacts. . . . . . . . . . . . . . . . . . . 1577.5 Effect of dictionary size and sparsity level on the performanceof the two-level dictionary for suppressing ring artifacts. . . . 1588.1 Convergence behavior of the stochastic-deterministic iterativereconstruction on simulation data. . . . . . . . . . . . . . . . 1758.2 Visual inspection of the reconstruction results of the stochastic-deterministic iterative reconstruction on simulation data. . . 1778.3 Convergence of the stochastic-deterministic iterative recon-struction on micro-CT scan of a phantom. . . . . . . . . . . . 1788.4 Reconstruction results of the stochastic-deterministic itera-tive reconstruction on micro-CT scan of a phantom. . . . . . 1798.5 Convergence of the stochastic-deterministic iterative recon-struction on micro-CT scan of a rat. . . . . . . . . . . . . . . 1808.6 Reconstruction results of the stochastic-deterministic itera-tive reconstruction on micro-CT scan of a rat. . . . . . . . . . 1828.7 Effects of batch size and regularization parameter on the con-vergence of stochastic-deterministic iterative reconstruction. . 1858.8 Effects of batch size and regularization parameter on the re-construction results of stochastic-deterministic iterative re-construction. . . . . . . . . . . . . . . . . . . . . . . . . . . . 186xivList of Figures9.1 The proposed initialization for the Generalized PatchMatchalgorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1939.2 Reconstruction results for the experiment with the brain phan-tom. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1979.3 RMSE plots for reconstruction of the brain phantom. . . . . . 1989.4 Reconstruction results of the experiment with the rat scan. . 200xvGlossaryANLM Adaptive non-local meansCBCT Cone-beam computed tomographyCNR Contrast to noise rationCT Computed tomographyDCT Discrete cosine transformFBP Filtered back-projectionFDK Feldkamp-Davis-KressFGD Full gradient descentGPU Graphical processing unitMAP Maximum a posterioriMI Mutual informationMRI Magnetic resonance imagingMTF Modulation transfer functionNLM Non-local meansNS Noise strengthOMP Orthogonal matching pursuitPCA Principal component analysisPET Positron emission tomographyPSNR Peak signal to noise rationRMSE Root mean square errorxviGlossaryROF Rudin-Osher-FatemiROI Region of interestSAG Stochastic average gradientSGD Stochastic gradient descentSNR Signal to noise rationSR Spatial resolutionSSIM structural similarity indexSVD Singular value decompositionSVRG Stochastic variance-reduced gradientTV Total variationVR-SGD Variance-reduced stochastic gradient descentxviiAcknowledgementsI would like to sincerely thank my thesis advisor, Prof. Rabab Ward, for allher support, guidance, advice, and encouragement during the course of myPh.D. studies.All of the micro-CT data that were used in this dissertation were pro-vided by the Centre for High-Throughput Phenogenomics at the Universityof British Columbia. I am very grateful to Dr. Nancy Ford, the Director ofthe Centre, for her constant support. I am also very thankful to Mr. JohnSchipilow and Dr. Pierre Deman who carried out most of the micro-CTscanning.Sincere thanks to the members of my thesis committee: Dr. Nancy Ford,Dr. Vikram Krishnamurthy, and Dr. Jane Wang, the University Examiners:Dr. Shuo Tang and Dr. Ozgur Yilmaz, the Chair of the Final Oral Defence:Dr. Anna Celler, and the External Examiner: Dr. Charles Bouman of Pur-due University, for their valuable feedback that helped improve the researchpresented in this dissertation.I was supported by the University of British Columbia through a Four-Year Fellowship (4YF) and by the Natural Sciences and Engineering Re-search Council of Canada (NSERC) through an Alexander Graham BellCanada Graduate Scholarship (CGS-D). Their support is greatly appreci-ated.Greatest thanks to my family for what words cannot describe.xviiiDedicationTo my parents, for their dedication.xixChapter 1Introduction1.1 A brief history of x-ray computedtomographyComputed tomography (CT) refers to creating images of the cross sectionsof an object using transmission or reflection data. These data are usuallyreferred to as the projections of the object. For the projection data tobe sufficient for reconstructing the object’s image, the object needs to beilluminated from many different directions. The problem of reconstructingthe image of an object from its projections has various applications, fromreconstructing the structure of molecules from data collected with electronmicroscopes to reconstructing maps of radio emissions of celestial objectsfrom data collected with radio telescopes [129]. However, the most importantapplications of CT have been in the field of medicine, where the impact ofCT has been nothing short of revolutionary. Today, physicians and surgeonsare able to view the internal organs of their patients with a precision andsafety that was impossible to imagine before the advent of CT.The fundamental difference between different medical imaging modalitiesis the property of the material (i.e., tissue) that they image. X-ray CT, whichis the focus of this dissertation, is based on the tissue’s ability to attenuate x-ray photons. X-rays had been discovered by the German physicist WilhelmRontgen in 1895. Rontgen, who won the first Nobel Prize in Physics forthis discovery, realized that x-rays could reveal the skeletal structure ofthe body parts because bones and soft tissue had different x-ray attenuationproperties. However, the first commercial CT scanners appeared in the early1970s, finally winning the 1979 Nobel Prize in Medicine for Allan Cormackand Godfrey Hounsfield for independently inventing CT.Today, x-ray CT is an indispensable tool in medicine. In fact, the wordsCT and computed tomography are used to refer to x-ray CT with no con-fusion. In the rest of this dissertation, too, the qualifier “x-ray” is dropped,assuming the implicit knowledge that the whole dissertation is devoted tox-ray computed tomography. Since its commercial introduction more than11.2. Imaging model40 years ago, diagnostic and therapeutic applications of CT have continuedto grow. In the past two decades especially, great advancements have beenmade in CT scanner technology and the available computational resources.Moreover, new scanning methods such as dual-source and dual-energy CThave become commercially available. Today, very fast scanning of large vol-umes has become possible. This has led to a dramatic increase in CT usagein clinical settings. It is estimated that globally more than 50,000 dual-energy x-ray CT scanners are in operation [290]. In the USA alone, thenumber of CT scans made annually increased from 19 million to 62 millionbetween 1993 and 2006 [226].1.2 Imaging modelAlthough the algorithms proposed in this dissertation apply to most or allCT geometries, the focus of this dissertation is on cone-beam computedtomography (CBCT). CBCT is a relatively new scan geometry that hasfound applications as diverse as image-guided radiation therapy, dentistry,breast CT, and microtomography [52, 108, 137, 283]. Figure 1.1 shows aschematic representation of CBCT. Divergent x-rays penetrate the objectand become attenuated before being detected by an array of detectors. Theequation relating the detected photon number to the line integral of theattenuation coefficient is [269]:N idN i0= exp(−∫iµds)(1.1)where N i0 and Nid denote, respectively, the emitted and detected photonnumbers for the ray from the x-ray source to the detector bin i and∫i µdsis the line integral of the attenuation coefficient (µ) along that ray. Bydiscretizing the imaged object, the following approximation to (1.1) can bemade:log(N i0N id)=K∑k=1ai,k xk (1.2)where xk is the value of the unknown image at voxel k and ai,k is the lengthof intersection of ray i with this voxel. The equations for all measurementscan be combined and conveniently written in matrix form as:y = Ax+ w (1.3)21.2. Imaging modelwhere y represents the vector of measurements (also known as the sinogram),x is the unknown image, A represents the projection matrix, and w is themeasurement noise.Figure 1.1: A schematic representation of cone-beam CT geometry.The discretization approach mentioned above has several shortcomings.For example, it does not consider the finite size of the x-ray source and thedetector area. Furthermore, exact computation of the intersection lengthsof rays with voxels is computationally very costly for large-scale 3D CT.Therefore, several efficient implementations of the system matrix A havebeen proposed [78, 185, 193, 221]. For large-scale 3D CT, matrix A is toolarge to be saved in computer memory. Instead, these algorithms implementmultiplication with matrix A and its transpose by computing the matrixelements on-the-fly.Even though in theory Nd follows a Poisson distribution, due to manycomplicating factors including the polychromatic nature of the x-ray sourceand the electronic noise, an accurate model of the raw data takes the formof a compound Poisson, shifted Poisson, or Poisson+Gaussian distribution[174]. For many practical applications, an adequate noise model is obtainedby adding a Gaussian noise (to simulate the electronic noise) to the theo-retical values of Nd. More realistic modeling, especially in low-dose CT, ismuch more complex and will need to take into account very subtle phenom-ena, which are the subject of much research [244, 312, 351]. An alternativeapproach is to consider the ratio of the photon counts after the logarithmtransformation. Even though N i0 and Nid are Poisson distributed, the noisein the sinogram (i.e., after the logarithm transformation) can be modeled as31.3. Image reconstruction algorithms in CTa Gaussian-distributed random variable with zero mean and a variance thatfollows [201, 204, 326]:σ2i =exp(y¯i)N i0(1.4)In this equation, y¯i is the expected value of the sinogram datum at detectori. In general, a system-specific constant η is needed to fit the measurements[326]:σ2i = fi exp(y¯iη)(1.5)where fi, similar to 1/Ni0 in (1.4), mainly accounts for the effect of bowtiefiltration.1.3 Image reconstruction algorithms in CTA central component in every CT system is the suite of image reconstruc-tion and processing algorithms, whose task it to reconstruct the image ofthe object from its projection measurements. These algorithms have alsocontinually evolved over time. The first CT scanners relied on simple iter-ative algorithms that aimed at recovering the unknown image as a solutionof a system of linear equations. Many of these basic iterative methods hadbeen developed by mathematicians like Kaczmarz well before the adventof CT. As the size of CT images grew, analytical filtered-backprojection(FBP) methods became more common and they are still widely used inpractice [247]. For CBCT, the well-known Feldkamp-Davis-Kress (FDK)filtered-backprojection algorithm is still widely used [107, 307, 310]. Thesemethods, which are based on the Fourier slice theorem, require a large num-ber of projections to produce a high-quality image, but they are much fasterthan iterative methods.The speed advantage of FBP methods has become less significant inrecent years as the power of personal computers has increased and newhardware options such as graphical processing units (GPUs) have becomeavailable. On the other hand, with a consistent growth in medical CT us-age, many studies have shown that the radiation dose levels used in CT maybe harmful to the patients [24, 299]. Reducing the radiation dose can beaccomplished by reducing the number of projection measurements and/orby reducing the radiation dose for each projection. However, the imagesreconstructed from such under-sampled or noisy measurements with FBP41.3. Image reconstruction algorithms in CTmethods will have a poor diagnostic quality. As a result of these develop-ments, there has been a renewal of interest in statistical and iterative imagereconstruction methods because they have the potential to produce high-quality images from low-dose scans [19, 149]. Furthermore, even though inthe beginning most of the algorithms used in CT were image reconstructionalgorithms, gradually image processing algorithms were used for denoising,restoration, or otherwise improving the projection measurements and thereconstructed images. Many of these algorithms are borrowed from the re-search on image processing for natural images. Even today, algorithms thathave been developed for processing of natural images are often applied inCT with little or no modifications.Turning to more effective image reconstruction and processing algo-rithms is not the only approach to radiation dose reduction. There areindeed other approaches, such as improving the system hardware and imag-ing protocols [222, 223]. However, there are very strong additional reasonsthat encourage research on better algorithms for CT. To begin with, theadvantage of iterative/statistical reconstruction algorithms is not limited tolow-dose CT. There are other situations where FBP methods fail and onehas to resort to more sophisticated iterative/statistical reconstruction meth-ods. Examples include non-standard scanning geometries such as those withirregular or limited angular sampling (e.g., in tomosynthesis) or when someof the measurements are missing or corrupted. Another important factor,as we mentioned above, is the increased availability of high-performancecomputational hardware. The potential of these advancements in computerhardware for CT has begun to be understood and there is a growing bodyof research to investigate the significance of this increased computationalpower for CT [144, 224, 261, 272]. Lastly, a very important factor that ismore related to the subject of this dissertation, is the introduction of newtheories and methods in signal and image processing and applied mathe-matics that can lead to more powerful and more flexible algorithms for CT.For example, new optimization algorithms that have much faster theoreticalconvergence rates have led to state-of-the-art image reconstruction algo-rithms in recent years [160, 263]. Another example of recent advancementsin signal and image processing that has already had a great impact on CT isthe new developments in sparsity-based models. In these models, the imageis transformed from its native representation in terms of pixel/voxel valuesinto a different space where it has a more concise and more meaningful rep-resentation. It is hoped that such a representation provides a more effectivedescription of the relevant image features, thereby improving the achievableresults in various image processing tasks. Even though this is an old idea in51.4. The goals of this dissertationimage processing, recent decades and years have witnessed the emergenceof new models and algorithms that have called for a reassessment of thepotential of sparsity-based methods in CT.1.4 The goals of this dissertationThe goal of this dissertation is to advance the state of the art in the applica-tion of sparsity-based methods in low-dose CT. Even though sparsity-basedmethods have been widely used for image reconstruction and processing inCT, more recent sparsity-based models and optimization algorithms havethe potential to substantially improve the current state of the art. The goalof this dissertation is to make significant contributions in this direction.Some of the novelties of the research that is reported in this dissertation aresummarized below.• This dissertation relies heavily on learned overcomplete dictionaries.Compared with analytical dictionaries such as wavelets, learned dic-tionaries have a much higher representational power and flexibility.In recent years, these dictionaries have been shown to achieve state-of-the-art results in many image processing tasks. However, the po-tential of learned dictionaries for CT has not been fully appreciated.This dissertation tries to explore this potential. This dissertation alsodraws heavily upon other patch-based models and algorithms, espe-cially those that exploit nonlocal patch similarities. Patch-based mod-els have emerged as one of the most powerful models in image process-ing in the past decade. However, little research has been reported onthe application of these models in CT. This dissertation tries to inves-tigate the potential of patch-based methods for image reconstructionand processing in CT.• The great majority of the published research have focused on itera-tive reconstruction algorithms and image-domain post-processing al-gorithms. Comparatively, many fewer studies have been reported ondenoising, restoration, or otherwise improving the projection measure-ments. In this dissertation, we pay particular attention to this gap inresearch.• For iterative image reconstruction, this dissertation makes use of newstochastic optimization algorithms. Stochastic/incremental optimiza-tion methods have been used to accelerate various CT reconstruction61.4. The goals of this dissertationalgorithms over the past two decades. This dissertation shows that thenew class of variance-reduced stochastic gradient descent algorithmsare superior to the traditional stochastic optimization methods for CTreconstruction.7Chapter 2Literature ReviewThis chapter starts by reviewing the basic principles of the three main imagemodels that are used in this dissertation. These models are based on sparserepresentation in learned dictionaries, nonlocal patch-based models, and to-tal variation (TV). Then, example studies that have used these models forimage reconstruction and processing in CT are reviewed.The first two of the three image models mentioned above belong to patch-based models. In patch-based image processing, the units upon which op-eration are carried are small image patches, which in the case of 3D imagesare also referred to as blocks. In the great majority of applications squarepatches or cubic blocks are used, even though other patch shapes can also beemployed. For simplicity of presentation, we will use the term “patch” un-less when talking explicitly about 3D images. The number of pixels/voxelsin a patch in patch-based image processing methods is usually on the orderof tens or a few hundreds. A typical patch size would be 8× 8 pixels for 2Dimages or 8× 8× 8 voxels for 3D images.Broadly speaking, in patch-based methods the image is first divided intosmall patches. Then, each patch is processed either separately on its ownor jointly with patches that are very similar to it. The final output imageis then formed by assembling the processed patches. In patch-based denois-ing, for instance, one can divide the image into small overlapping patches,denoise each patch independently, and then build the final denoised imageby averaging the denoised patches. There are many reasons for focusing onsmall patches rather than on the whole image. First, because of the curseof dimensionality, it is much easier and more reliable to learn a model forsmall image patches than for very large patches or for the whole image.Secondly, for many models, computations are significantly reduced if theyare applied on small patches rather than on the whole image. In addition,research in the past decade has shown that working with small patches canresult in very effective algorithms that outperform competing methods in awide range of image processing tasks. For example, as we will explain laterin this chapter, patch-based denoising methods are currently considered tobe the state of the art, achieving close-to-optimal denoising performance.82.1. Image processing with learned overcomplete dictionariesPatch-based methods have been among the most heavily researchedmethods in the field of image processing in recent years and they have pro-duced state-of-the-art results in many tasks including denoising, restoration,super-resolution, inpainting, and reconstruction. However, these methodshave received very little attention in CT. Even though there has been a lim-ited effort in using patch-based methods in CT, the results of the publishedstudies have been very promising. Given the great success of patch-basedmethods in various image-processing applications, they seem to have thepotential to substantially improve the current state of the art algorithms inCT.The word “patch-based” may be ambiguous because it can potentiallyrefer to any image model or algorithm that works with small patches. Forexample, image compression algorithms such as JPEG work on small imagepatches. However, the word patch-based has recently been used to refer tocertain classes of methods. In order to explain the central concepts of thesemethods, we will first describe the two main frameworks in patch-basedimage processing: (1) sparse representation of image patches in learnedovercomplete dictionaries, (2) models based on nonlocal patch similarities.These two frameworks do not cover all patch-based image processing meth-ods. However, most of these methods have their roots in one or both ofthese two frameworks.2.1 Image processing with learned overcompletedictionaries2.1.1 Sparse representation in analytical dictionariesA signal x ∈ Rm is said to have a sparse representation in a dictionaryD ∈ Rm×n if it can be accurately approximated by a linear combination of asmall number of its columns. Mathematically, this means that there existsa vector γ such that x ∼= Dγ and ‖γ‖0  n. Here, ‖γ‖0 denotes the thenumber of nonzero entries of γ and is usually referred to as the `0-norm ofγ, although it is not a true norm. This means that only a small numberof columns of D are sufficient for accurate representation of the signal x.The ability to represent a high-dimensional signal as a linear combinationof a small number of building blocks is a very powerful concept and it isat the center of many of the most widely used algorithms in signal andimage processing. Columns of the dictionary D are commonly referred toas atoms. If these atoms comprise a set of linearly independent vectors and92.1. Image processing with learned overcomplete dictionariesif they span the whole space of Rm, then they are called basis vectors andD is called a basis. Moreover, if the basis vectors are mutually orthogonal,D is called an orthogonal basis.Bases, and orthogonal bases in particular, have interesting analyticalproperties that makes them easy to analyze. Moreover, for many of theorthogonal bases that are commonly used in signal and image processing,very fast computational algorithms have been developed. This computa-tional advantage made these bases very appealing when the computationalresources were limited. Over the past two decades, and especially in thepast decade, there has been a significant shift of interest towards dictio-naries that can adapt to a given class of signals using a learning strategy.The dictionaries obtained in this way lack the analytical and computationaladvantages of orthogonal bases, but they have much higher representationalpower. Therefore, they usually lead to superior results for many image pro-cessing tasks. Before we explain these dictionaries, we briefly review thehistory of sparsity-inducing transforms in image processing. More detailedtreatment of this background can be found in [218, 273].Sparsity-based models are as old as digital signal processing itself. Start-ing in the 1960s, the Fourier transform was used in signal processing becauseit could diagonalize the linear time-invariant filters, which were widespreadin signal processing. Adoption of the Fourier transform was significantlyaccelerated by the invention of the Fast Fourier Transform in 1965 [138].Fourier transform represents a signal as a sum of sinusoids of different fre-quencies. Suppressing the high-frequency components of this representa-tion, for example, is a simple denoising method. This is, however, not agood model for natural images because these Fourier basis functions are notefficient for representing sharp edges. In fact, a single edge results in a largenumber of non-zero Fourier coefficients. Therefore, denoising using Fourierfiltering leads to blurred images. An efficient representation of localizedfeatures needs bases that include elements with concentrated support. Thisgave rise to the Short-Time Fourier Transform (STFT) [5, 16] and, more im-portantly, the wavelet transform [74, 217]. The wavelet transform was themajor manifestation of a revolution in signal processing that is referred toas multi-scale or multi-resolution signal processing. The main idea in thisparadigm is that many signals, and in particular natural images, containrelevant features on many different scales. Both the Fourier transform andthe wavelet transform can be interpreted as the representation of a signal ina dictionary. For the Fourier transform, for example, the dictionary atomsinclude sinusoids of different frequencies.Despite its tremendous success, the wavelet transform suffers from im-102.1. Image processing with learned overcomplete dictionariesportant shortcomings for analyzing higher-dimensional signals such as nat-ural images. Even though the wavelet transform possesses important op-timality properties for one-dimensional signals, it is much less effective forhigher-dimensional signals. This is because in higher dimensions, the wavelettransform is a separable extension of the one-dimensional transform alongdifferent dimensions. As a result, for example the 2D wavelet transformis suitable for representing points but it is not effective for representingedges. This is a major shortcoming because the main features in naturalimages are composed of edges. Therefore, there was a need for sparsity-inducing transforms or dictionaries that could efficiently represent thesetypes of features. Consequently, great research effort was devoted to de-signing transforms/dictionaries especially suitable for natural images. Someof the proposed transforms that have been more successful for image process-ing applications include the complex wavelet transform [164], the curvelettransform [41, 42], the contourlet transform [86] and its extension to 3Dimages known as surfacelet [198], the shearlet transform [96, 171], and thebandlet transform [176].The transforms mentioned above have had a great impact on the fieldof image processing and are still used in practice. They have also been usedin CT [e.g., 41, 110, 266, 321]. However, learned overcomplete dictionariesachieve much better results in practice by breaking some of the restrictionsthat are naturally imposed by these analytical dictionaries. The restrictionof orthogonality, for instance, requires the number of atoms in the dictionaryto be no more than the dimensionality of the signal. The consequencesof these limitations had already been realized by researchers working onwavelets. This realization led to developments such as stationary wavelettransform, steerable wavelet transform, and wavelet packets, which greatlyimproved upon the orthogonal wavelet transform [64, 230, 295]. However,these transforms are still based on fixed constructions and do not have thefreedom and adaptability of learned dictionaries that we will explain below.2.1.2 Learned overcomplete dictionariesThe basic idea of adapting the dictionary to the signal is not completelynew. One can argue that the Principal Component Analysis (PCA) method[131], which is also known as the Karhunen–Love Transform (KLT) in signalprocessing, is an example of learning a dictionary from the training data.However, this transform also is limited in terms of the dictionary structureand the number of atoms in the dictionary. Specifically, the atoms in a PCAdictionary are necessarily orthogonal and their number is at most equal to112.1. Image processing with learned overcomplete dictionariesthe signal dimensionality.The modern story of dictionary learning begins with a paper by Ol-shausen and Field [245]. The question posed in that paper was: if we assumethat small patches of natural images have a sparse representation in a dictio-nary D and try to learn this dictionary from a set of training patches, whatwould the learned dictionary atoms look like? They found that the learneddictionary consisted of atoms that were spatially localized, oriented, andbandpass. This was a remarkable discovery because these are exactly thecharacteristics of simple-cell receptive fields in the mammalian visual cortex.Although similar patterns existed in Gabor filters [75, 76], Olshausen andField had been able to show that these structures can be explained usingonly one assumption: sparsity.Suppose that we are given a set of training signals and would like tolearn a dictionary for sparse representation of these signals. We stack thesetraining signals as columns of a matrix, which we denote with X. Eachcolumn of X is a referred to as a training signal. In image processing ap-plications, each training signal is a patch (for 2D images) or block (in thecase of 3D images) that is vectorized to form a column of X. Using thematrix of training signals, a dictionary can be learned through the followingoptimization problem.minimizeD∈D ,Γ‖X −DΓ‖2F + λ‖Γ‖1 (2.1)In the above equation, X denotes the matrix of training data, Γ is thematrix of representation coefficients of the training signals in D, and D isthe set of matrices whose columns have a unit Euclidean norm. The ithcolumn of Γ is the vector of representation coefficients of the ith column ofX (i.e., the ith training signal) in D. The notations ‖.‖F and ‖.‖1 denote,respectively, the Frobenius norm and the `1 norm. The constraint D ∈ Dis necessary to avoid scale ambiguity because without this constraint theobjective function can be made smaller by decreasing Γ by an arbitraryfactor and increasing D by the same factor. The first term in the objectivefunction requires that the training signals be accurately represented by thecolumns of D and the second terms promotes sparsity, encouraging that asmall number of columns of D are used in the representation of each trainingsignal.There are many possible variations of the optimization problem pre-sented in Equation (2.1), some of which will be explained in this chapter.For example the `1 penalty on Γ is sometimes replaced with an `0 penalty.In fact, it can be shown that variations of this problem include problems as122.1. Image processing with learned overcomplete dictionariesdiverse as PCA, clustering or vector quantization, independent componentanalysis, archetypal analysis, and non-negative matrix factorization (see forexample [9, 211]). The most important fact about the optimization problemin (2.1) is that it is not jointly convex with respect to D and Γ. Therefore,only a stationary point can be hoped for and the global optimum is notguaranteed. However, this problem is convex with respect to D and Γ indi-vidually. Therefore, many dictionary learning problems adopt an alternatingminimization approach. In other words, the objective function is minimizedwith respect to one of the two variables while keeping the other fixed. Thefirst such method was the method of optimal directions (MOD) [103]. Ineach iteration of MOD, the objective function is first minimized with respectto Γ by solving a separate sparse coding problem for each training signal:Γk+1i = argminγ‖Xi −Dkγ‖22 subject to: ‖γ‖0 ≤ K (2.2)In the above equation, and in the rest of this chapter, we use subscripts onmatrices to index their columns. Therefore, Xi indicates the ith column ofX, which is the ith training signal and Γi is the ith column of Γ, which is thevector of representation coefficients of Xi in D. We will use superscripts toindicate iteration number. Once all columns of Γ are updated, Γ is kept fixedand the dictionary is updated. This update is in the form of a least-squaresproblem that has a closed-form solution:Dk+1 = X(Γk+1i )† (2.3)where † denotes the Moore-Penrose pseudo-inverse.Before moving on, we need to say two brief words about the optimizationproblem in (2.2). This optimization problem is one formulation of the sparsecoding problem that is a central part of any image processing method thatmakes use of learned overcomplete dictionaries. Because of their ubiquity,there has been a very large body of research on the properties of these prob-lems and solution methods. Often the `0 norm in this equation is replacedwith an `1 norm. This problem is known as Least Absolute Shrinkage andSelection Operator (LASSO) in statistics [315] and Basis Pursuit Denoising(BPDN) in signal processing [54]. A review of these methods is beyond thescope of this dissertation. Therefore, we will only mention or describe therelevant algorithms where necessary. A recent review of these methods canbe found in [9]. In MOD, this step is solved using the orthogonal matchingpursuit (OMP) [252] or the focal underdetermined system solver (FOCUSS)[119].132.1. Image processing with learned overcomplete dictionariesAnother dictionary-learning algorithm that has shown to be more effi-cient than MOD is the K-SVD algorithm [2]. K-SVD is arguably the mostwidely used dictionary learning algorithm today. Similar to MOD, each iter-ation of the K-SVD algorithm updates each column of Γ by solving a sparsecoding problem similar to (2.2). However, unlike the MOD that updates alldictionary atoms at once, K-SVD updates each dictionary atom (i.e., eachcolumn of D) sequentially. Assuming all dictionary atoms are fixed exceptfor the ith atom, the cost function in (2.1) can be written as:‖X −DΓ‖2F =∥∥∥∥∥∥X −N∑j=1DjΓTj∥∥∥∥∥∥2F=∥∥∥∥∥∥X −N∑j=1,j 6=iDjΓTj −DiΓTi∥∥∥∥∥∥2F= ‖Ei −DiΓTi ‖2F(2.4)In the K-SVD algorithm this is minimized using an SVD decomposition ofthe matrix Ei after restricting it to the training signals that are using Diin their representation. The reason behind this restriction is that it willpreserve the sparsity of the representation coefficients. Let us denote therestricted version of Ei with ERi and assume that the SVD decompositionof ERi is ERi = U∆VT . Then, U1 and ∆(1, 1)V1 provide the updates of Diand ΓTi , where U1 and V1 denote the first columns of U and V , respectively,and ∆(1, 1) is the largest singular value of ERi .A major problem with methods like MOD and K-SVD is that they arecomputationally intensive. Even though efficient implementations of thesealgorithms have been developed [275], the amount of computations becomesvery excessive when the number of training signals and the signal dimen-sionality grow. Therefore, a number of studies have proposed algorithmsthat are particularly designed for learning dictionaries from huge datasetsin reasonable time [213, 313]. The algorithm proposed in [212, 213], for ex-ample, is based on stochastic optimization algorithms that are particularlysuitable for large-scale problems. Instead of solving the optimization prob-lem by considering the whole training data, it randomly picks one trainingsignal (i.e., one column of X) and approximately minimizes the objectivefunction using that one training signal. Convincing theoretical and empiri-cal evidence regarding the convergence of this dictionary learning approachhave been presented in [213].Another important class of dictionary learning algorithms are maximum-likelihood algorithms, which are in fact among the first methods suggested142.1. Image processing with learned overcomplete dictionariesfor learning dictionaries from data [167, 183, 184]. These methods assumethat each training signal is produced by a model of the form:Xi = DΓi + wi (2.5)where wi is a Gaussian-distributed white noise. To encourage sparsity of therepresentation coefficients (Γi), these methods assume a sparsity-promotingprior such as a Cauchy or Laplace distribution for entries of Γ. Addition-ally, these approaches assume that the entries of Γ are independent andidentically distributed and that each signal Xi is drawn independently. Adictionary can then be learned by maximizing the data likelihood p(X|D)or the posterior p(D|X). Quite often, the resulting likelihood function isvery difficult to maximize and it is further simplified before applying theoptimization algorithm.It can be argued that the maximum-likelihood methods explained aboveare not truly Bayesian methods because they yield a point estimate ratherthan the full posterior distribution [211]. As a result, in recent years severalfully-Bayesian dictionary learning methods have been proposed [135, 358,359]. In these algorithms, priors are placed on all model parameters, i.e.,not only on the dictionary atoms Di and sparse representation vectors Γi,but also on all other model parameters such as the number of dictionaryatoms and the noise variance for each training signal. The most importantpriors assumed in these models are usually Gaussian priors with Gammahyper-priors for the dictionary atoms (Di) and representation coefficients(Γi), and a Beta-Bernoulli process for the support of Γi [135, 359]. Fullposterior density of the model parameters and hyper-parameters are itera-tively estimated via Gibbs sampling. Compared to all other dictionary learn-ing methods described above, these fully-Bayesian methods are significantlymore computationally demanding. On the other hand, their robustness withrespect to poor initialization and their ability to learn some important pa-rameters such as the noise variance makes them potentially very useful forcertain applications [304, 337].There are many variations and enhancements of dictionary learning thatwe cannot describe in detail due to space limitations. However, we brieflymention three important variations. The first is the structured dictionarylearning. The main idea here is not only to learn the dictionary atoms butalso the interaction between the learned dictionary atoms. For example, acommon structure that is assumed between the atoms is a tree structure,where each atom is the descendant/parent of some other atoms [141, 142].During the dictionary usage, then, an atom will participate in the sparse152.1. Image processing with learned overcomplete dictionariescode of a signal if and only if its parent atom does so. Obviously, the basic`0 and `1 norms are not capable of modelling these interactions between dic-tionary atoms. The success of structured dictionary learning, therefore, hasbeen made possible by algorithms for structured sparse coding [134, 140].Another common structure is the grid structure that enforces a neighbor-hood relation between atoms [153, 313]. The second variation that is ofgreat importance is multi-scale dictionary learning. Extending the basicdictionary learning scheme to consider different patch sizes has been shownto significantly improve the performance of the dictionary-based image pro-cessing [216, 246]. Moreover, this extension to multiple scales has beensuggested as an approach to addressing some of the theoretical flaws in thedictionary-based image processing [250]. The third important variation thatwe mention here includes dictionaries that have a fast application. As wementioned above, learned dictionaries do not possess such desired struc-tural properties as orthogonality. As a result, they are much more costly toapply than analytical dictionaries. Therefore, several dictionary structureshave been proposed with the goal of reducing the computational cost duringdictionary usage [1, 274]. These dictionaries can be particularly useful forprocessing of large 3D images.As final remarks on dictionary learning, we should first mention thatthere is no strong theoretical justification behind most dictionary learn-ing algorithms. In particular, there is no theoretical guarantee that thesealgorithms are robust or that the learned dictionary should work well inpractical applications. In practice, learning a good dictionary certainly re-quires a sufficient amount of training data and the minimum amount ofdata needed grows at least linearly with the number of dictionary atoms[274, 276]. Uniqueness of the learned dictionary, however, is only guaran-teed for an exponential number of training signals [3]. In fact, the theory ofdictionary learning is considered to be one of the major open problems inthe field of sparse representation [98]. Secondly, pre-processing of trainingimage patches has proved to significantly influence the types of structuresthat emerge in the learned dictionary and the performance of the learneddictionary in practice. Three of the most commonly used pre-processingoperations include: (i) removing of the patch mean, also known as centering[100], (ii) variance normalization which is preceded with centering [139, 258],and (iii) de-correlating the pixel values within a patch, referred to as whiten-ing [20, 136]. The overall effect of all these three operations is to amplifythe high-frequency structure such as edges, resulting in more high-frequencypatterns in the learned dictionary [211].162.1. Image processing with learned overcomplete dictionaries2.1.3 Applications of learned dictionariesLearned overcomplete dictionaries have been employed in various image pro-cessing and computer vision applications in the past ten years. There aremonographs that review and explain these applications in detail [99, 211].Because of space limitations, we describe the basic formulations for imagedenoising, image inpainting, and image scale-up. Not only these three tasksare among the most successful applications of learned dictionaries in imageprocessing, they are also very instructive in terms of how these dictionariescan be used to accomplish various image processing tasks.Image denoisingSuppose that we have measured a noisy image x = x0 + w, wherex0 is the true underlying image and w is the additive noise that isassumed to be white Gaussian. The prior assumption in denoisingusing a dictionary D is that every patch in the image has a sparserepresentation in D. If we denote a typical patch with p, this wouldmean that there exists a sparse vector γ such that ||p−Dγ||22 < , where is proportional to the noise variance [100, 215]. Using this prior onevery patch in the image, the maximum a posteriori (MAP) estimationof the true image can be found as the solution of the following problem[99]:{x̂0, {γ̂i}Ni=1}= argminz,{γi}Ni=1λ‖z−x‖22 +N∑i=1(‖Riz−Dγi‖22 +‖γi‖0) (2.6)where Ri represents a binary matrix that extracts and vectorizes theith patch from the image. This is a very common notation that isused to simplify the presentation of this type of equations and wewill use it in the rest of this dissertation. N is the total number ofextracted patches. It is common to use overlapping patches to avoiddiscontinuity artifacts at the patch boundaries. In fact, unless thecomputational time is a concern, it is recommended that maximumoverlap is used such that adjacent extracted patches are shifted byonly one pixel in each direction. This means extracting all possiblepatches from the image.The objective function in Equation (2.6) is easy to understand. Thefirst term requires the denoised image to be close to the measurement,172.1. Image processing with learned overcomplete dictionariesx, and the second term requires that every patch extracted from thisimage to have a sparse representation in the dictionary D. The com-mon approach to solving this optimization problem is an approximateblock-coordinate minimization. First, we initialize z to the noisy mea-surement (z = x). Keeping z fixed, the objective function is minimizedwith respect to {γi}Ni=1. This step is simplified because it is equivalentto N independent problems, one for each patch, that can be solvedusing sparse coding algorithms. Then {γi}Ni=1 are kept fixed and theobjective function is minimized with respect to z. This minimizationhas a closed-form solution:x̂0 =(λI +N∑i=1RTi Ri)−1(λx+N∑i=1RTi Dγˆi)(2.7)There is no need to form and invert a matrix to solve this equation. Itis basically equivalent to returning the denoised patches to their rightplace on the image canvas and performing a weighted averaging. Theweighted averaging simply takes into account the overlapping of thepatches and a weighted averaging with the noisy image x (with weightλ).The minimization with respect to {γi}Ni=1 and z can be performediteratively by using x̂0 obtained from (2.7) as the new estimate ofthe image. However, this will run into difficulties because the noisedistribution in x̂0 is unknown and it is certainly not white Gaussian.Therefore, x̂0 obtained from (2.7) is usually used as the estimate ofthe underlying image x0.Image inpaintingLet us denote the true underlying image with x0 and assume that theobserved image x not only contains noise, but also some pixels arenot observed or are corrupted to the extent that the measurements ofthose pixels should be ignored. The model used for this scenario is x =Mx0 +w where w is the additive noise and M is a mask matrix, whichis a binary matrix that removes the unobserved/corrupted pixels. Thegoal is to recover x0 from x. Similar to the denoising problem above,one can use the prior assumptions that patches of x0 have a sparserepresentation in a dictionary D. The MAP estimate of x0 can be182.1. Image processing with learned overcomplete dictionariesfound as a solution of the following problem [99]:{x̂0, {γ̂i}Ni=1}= argminz,{γi}Ni=1λ‖Mz − x‖22+N∑i=1(‖Riz −Dγi‖22 + ‖γi‖0) (2.8)An approximate solution can be found using an approach rather similarto that described above for the denoising problem. Specifically, westart with an initialization z = MTx. Then, assuming that z is fixed,we solve N independent sparse coding problems to find estimates of{γi}Ni=1. The only issue here is that this initial z will be corruptedat the locations of unobserved pixels. Therefore, the estimation of{γi}Ni=1 needs to take this into account by introducing a local maskmatrix for each patch:γˆi = argminγ‖Mi(Riz −Dγ)‖22 subject to: ‖γ‖0 ≤ K (2.9)Once {γi}Ni=1 are estimated, an approximation to the underlying fullimage is found as:xˆ0 =(λMTM +N∑i=1RTi Ri)−1(λMTx+N∑i=1RTi Dγˆi)(2.10)which has a simple interpretation similar to (2.7). The method de-scribed above has been shown to be very effective in many studies[215, 216, 254].Image scale-up (super-resolution)As we saw above, the applications of learned dictionaries for imagedenoising and inpainting can be quite straightforward. Nevertheless,application of learned dictionaries for image processing may involvemuch more elaborate approaches, even for the simple tasks such asdenoising. As an example of a slightly more complex task, in thissection we explain the image scale-up. Image scale-up can serve as agood example of more elaborate applications of learned dictionaries in192.1. Image processing with learned overcomplete dictionariesimage processing. Moreover, it has been one of the most successfulapplications of learned dictionaries to date [345].Suppose xh is a high-resolution image. A blurred low-resolution ver-sion of this image can be modeled as xl = SHxh, where H andS denote the blur and down-sampling operators. Given the mea-sured low-resolution image, which can also include additive noise (i.e.,xl = SHxh + w), the goal is to recover the high-resolution image.This problem is usually called the image scale-up problem, and it isalso referred to as image super-resolution.The first image scale-up algorithm that used learned dictionaries wassuggested in [345]. This algorithm is based on learning two dictionar-ies, one for sparse representation of the patches of the high-resolutionimage and one for sparse representation of the patches of the low-resolution image. Let us denote these dictionaries with Dh and Dl,respectively. The basic assumption in this algorithm is that sparse rep-resentation of a low-resolution patch inDl is identical to the sparse rep-resentation of its corresponding high-resolution patch in Dh. There-fore, given a low-resolution image xl, one can divide it into patchesand use each low-resolution patch to estimate its corresponding high-resolution patch. Let us denote the ith patch extracted from xl with Xliand its corresponding high-resolution patch with Xhi . One first findsthe sparse representation of X li in Dl using any sparse coding algo-rithm such that X li∼= Dγi. Then, by assumption, γi is also the sparserepresentation of Xhi in Dh. Therefore, the estimate of Xhi will be:X̂hi∼= Dhγi. These estimated high-resolution patches are then placedon the canvas of the high-resolution image and the high-resolution im-age is formed via a weighted averaging similar to that in the denoisingapplication above. The procedure that we explained here for estimat-ing the high-resolution patches from their low-resolution counterpartsis the simplest approach. In practice, this procedure is applied withslight modifications that significantly improve the results [99, 345, 352].The main assumption in the above algorithm was that the sparse codesof the low-resolution and high-resolution patches were identical. Thisis an assumption that has to be enforced during dictionary learning.In other words, the dictionaries Dh and Dl are learned such that thiscondition is satisfied. The dictionary learning approach suggested in202.1. Image processing with learned overcomplete dictionaries[345] is:minimizeDh,Dl,Γ1mh‖Xh −DhΓ‖2F +1ml‖X l −DlΓ‖2F + λ‖Γ‖1 (2.11)where Xh and X l represent the matrices of training signals. The ithcolumn ofXh is the vectorized version of a patch extracted from a high-resolution image and the ith column of X l is the vectorized version ofthe corresponding low-resolution patch. mh and ml are the lengths ofthe high-resolution and low-resolution training signals and are includedin the objective function to properly balance the two terms. Theimportant choice in the objective function in (2.11) is to use the sameΓ in the first and the second terms of the objective function. It is easyto understand how this choice forces the learned dictionariesDh andDlto be such that the corresponding high-resolution and low-resolutionpatches have the same sparse representation.The above algorithm achieved surprisingly good results [345]. How-ever, it was soon realized that the assumption of this algorithm onthe sparse representations was too restrictive and that better resultscould be obtained by relaxing those assumptions. For instance, onestudy suggested a linear relation between the sparse representations oflow-resolution and high-resolution patches and obtained better results[327]. The dictionary learning formulation for this algorithm had thefollowing form:minimize{Dh,Γh,Dl,Γl,W}(||Xh −DhΓh||2F + ||X l −DlΓl||2F+ λh||Γh||1 + λl||Γl||1+ λW ||Γl −WΓh||2F + α||W ||2F) (2.12)It is easy to see that here the assumption is not that the sparse rep-resentation of high-resolution patches (Γh) is the same as the sparserepresentation of the low-resolution patches (Γl), but that there is alinear relationship between them. This linear relation is representedby the matrix W . This results in a much more general and more flex-ible model. On the other hand, this is also a more difficult model tolearn because it requires learning of the matrix W , in addition to the212.1. Image processing with learned overcomplete dictionariestwo dictionaries. In [327], a block-coordinate optimization algorithmwas suggested for solving this problem and it was shown to producevery good results.There have also been other approaches to relaxing the relationship be-tween the sparse codes of high-resolution and low-resolution patches.For instance, one study suggested a bilinear relation involving two ma-trices [132]. Another study suggested a statistical inference techniqueto predict the sparse code of the high-resolution patches from low-resolution ones [253]. Both of these approaches reported very goodresults. In general, image scale-up with the help of learned dictio-naries has shown to outperform other competing methods and it is agood example of the power of learned dictionaries in modeling naturalimages.Other applicationsIn the above, we explained three applications of learned dictionar-ies. However, these dictionaries have proved highly effective in manyother applications as well. Some of these other applications includeimage demosaicking [215, 225], deblurring [69, 88], compressed sens-ing [55, 93, 267], morphological component analysis [99], compression[36, 298], classification [147, 265], cross-domain image synthesis [362]and removal of various types of artifacts from the image [151, 350].For many image processing tasks, such as denoising and compression,application of the dictionary is relatively straightforward. However,there are also more complex tasks for which learning and applicationof overcomplete dictionaries are much more complex. It has been sug-gested in [210, 211] that many of these applications can be consideredas instances of classification or regression problems. The authors of[210] coin the term “task-driven dictionary learning” to describe theseapplications and suggest that a general optimization formulation forthese problems is of the form:minimizeD∈D,W∈WL(Y,W, Γˆ) + λ‖W‖2F (2.13)In the above formulation, Γˆ is the matrix of representation coefficientsof the training signals, obtained by solving a problem such as (2.2).The cost function L quantifies the error in the prediction of the tar-get variables Y from the sparse codes Γˆ, and W denotes the model222.2. Non-local patch-based image processingparameters. For a classification problem, Y represents the labels ofthe training signals, whereas in a regression setting Y represents real-valued vectors. For example, the image scale-up problem that we havepresented above is an example of the regression setting where Y repre-sents the vectors of the pixel values of the high-resolution patches. Thesecond term in the above objective function is a regularization termon model parameters that is meant to avoid overfitting and numericalinstability.Therefore, in task-driven dictionary learning, the goal is to learn thedictionary not only for sparse representation of the signal, but also sothat it can be employed for accurate prediction of the target variables,Y . The general optimization problem in (2.13) is very difficult to solve.In addition to the fact that the objective function is non-convex, thedependence of L on D is through Γˆ, which is in turn obtained bysolving (2.2). In the asymptotic case when the amount of trainingdata is very large, it has been shown that this general optimizationproblem is differentiable and can be effectively solved using stochasticgradient descent [210]. It has been shown that this approach canlead to very good results in a range of classification and regressiontasks such as compressed sensing, handwritten digit classification, andinverse halftoning [210].2.2 Non-local patch-based image processingNatural images contain abundant self-similarities. In terms of image patches,this means that for every patch in a natural image we can probably find manysimilar patches in the same image. The main idea in non-local patch-basedimage processing is to exploit this self-similarity by finding/collecting similarpatches and processing them jointly. The idea of exploiting patch similaritiesand the notion of nonlocal filtering are not very new [70, 97, 316, 330, 349].However, it was the non-local means (NLM) denoising algorithm proposedin [37] that started the new wave of research in this field. Even though thebasic idea behind NLM denoising is very simple and intuitive, it achievesremarkable denoising results and it has created a great deal of interest inthe image processing community.Let us denote the noisy image with x = x0 + w, where, as before, x0denotes the true image. We also denote the ith pixel of x with x(i) and apatch/block centered on x(i) with x[i]. We will use similar notations in therest of this dissertation. The NLM algorithm considers overlapping patches,232.2. Non-local patch-based image processingeach patch centered on one pixel. The value of the ith pixel in the underlyingimage, x0(i), is estimated as a weighted average of the center pixels of allthe patches as follows:x̂0(i) =N∑j=1Ga(‖x[j]− x[i]‖2F )∑Nj=1Ga(‖x[j]− x[i]‖2F )x(j) (2.14)where Ga denotes a Gaussian kernel with bandwidth a and N is the totalnumber of patches. The intuition behind this algorithm is very simple:similar patches are likely to have similar pixels at their centers. Therefore,in order to estimate the true value of the ith pixel, the algorithm performs aweighted averaging of the values of all pixels, with the weight being relatedto the similarity of each patch with the patch centered on the ith pixel.Although in theory all patches can be included in the denoising of the ithpixel, as shown in (2.14), in practice only patches from a small neighborhoodaround this pixel are included. In fact, many of the methods that are basedon NLM denoising first find several patches that are similar to x[i]. Onlythose patches that are similar enough to x[i] are used in computing x̂0(i).Therefore, a practical implementation of the NLM denoising will be:x̂0(i) =∑j∈SiGa(‖x[j]− x[i]‖2F )∑j∈Si Ga(‖x[j]− x[i]‖2F )x(j)where: Si = {j| j ∈ Ni & ‖x[j]− x[i]‖F ≤ }(2.15)where Ni is a small neighborhood around the ith pixel and  is a noise-dependent threshold.The idea behind the NLM has proved to be an extremely powerful modelfor natural images. For the denoising task, NLM filtering and its extensionshave led to the best denoising results [228, 289]. Some studies have shownthat the current state-of-the-art algorithms are approaching the theoreticalperformance limits of denoising [51, 182]. Some of the recent extensions ofthe basic NLM denoising include Bayesian/probabilistic extensions of themethod [178, 333], spatially adaptive selection of the algorithm parameters[91, 95], combining NLM denoising with TV denoising [306], and the use ofnon-square patches that has been shown to improve the results around edgesand high-contrast features [83]. Some of the most productive extensions ofthe NLM scheme involve exploiting the power of learned dictionaries. Wewill discuss these methods in the next section.242.2. Non-local patch-based image processingNonlocal patch-based methods are very computationally demanding.Therefore, a large number of research papers have focused on speedup strate-gies. A very effective strategy was proposed in [73]. This strategy is basedon building a temporary image that holds the discrete integration of thesquared differences of the noisy image for all patch translations. This inte-gral image is used for fast computation of the patch differences (x[j]−x[i]),which is the main computational burden in nonlocal patch-based methods.A large number of papers have focused on reducing the computational costof NLM denoising by classifying/clustering the image patches before startingthe denoising process [27, 77, 207]. The justification behind this approach isthat the computational bottleneck of NLM denoising is the search for similarpatches. Therefore, these methods aim at clustering the patches so that thesearch for similar patches becomes less computationally demanding. Mostof these methods compute a few features from each patch to obtain a conciserepresentation of the patches. Typical features include average gray valueand gradient orientation. During denoising, for each patch a set of similarpatches is found using the clustered patches. One study compared varioustree structures for fast finding of similar patches in an image and foundthat vantage point trees are superior to other tree structures [170]. Anotherclass of highly efficient algorithms for finding similar patches are stochasticin nature. These methods can be much faster than the deterministic tech-niques we mentioned above, but they are less accurate. Perhaps the mostwidely used algorithm in this category is the PatchMatch algorithm and itsextensions [12, 13].The NLM algorithm and its extensions that we will explain in the nextsection have been recognized as the state-of-the-art methods for image de-noising. However, the idea of exploiting the patch similarities has been usedfor many other image processing tasks. For instance, it has been shown thatnonlocal patch similarities can be used to develop highly effective regular-izations for inverse problems and iterative image reconstruction algorithms[114, 194, 229, 255, 354]. Below, we briefly explain two of these algorithms.Let us consider the inverse problem of estimating an unknown image xfrom the measurements y = Ax + w, where w is the additive noise. Thematrix A represents the known forward model that can be, for example, ablur matrix (in image deblurring) or the projection matrix (in tomography).In [255], it is suggested to recover x by solving the optimization problem:xˆ = argminx‖y −Ax‖22 + λ∑i∑j√wi,j |(x(i)− x(j)| (2.16)252.2. Non-local patch-based image processingwhere wi,j are the nonlocal patch-based weights that are computed in afashion similar to NLM denoising:wi,j =1Zexp(− ‖x[i]− x[j]‖2σ2)(2.17)where Z is a normalizing factor. Therefore, the regularization term in (2.16)is a non-local total variation on a graph where the graph weights are basedon nonlocal patch similarities. The difficulty with solving this optimizationproblem is that the weights themselves depend on the unknown image, x.The algorithm suggested in [255] iteratively estimates the weights from thelatest image estimate and then updates the image based on the new weightsusing a proximal gradient method. In summary, given the image estimateat the kth iteration, xˆk, the weights are estimated from this image. Then,the image is updated using a proximal gradient iteration [46, 66] :xˆk+1 =ProxµλJ(xˆk + µAT(y −Axˆk))ProxJ(x) = argminz12‖x− z‖22 + J(z)(2.18)where J is the regularization term in (2.16) and µ is the step size. Havingcomputed the new estimate xˆk+1, the patch-based weights are re-computedand the algorithm continues. This algorithm showed very good results onthree types of inverse problems including compressed sensing, inpainting,and image scale-up [254].In [348], the following optimization problem was suggested for recoveringthe unknown image x.xˆ = argminx‖y −Ax‖22 + λ∑i∑j∈Ni‖x[i]− x[j]‖p (2.19)where p ≤ 1 and Ni is a neighborhood around the ith pixel. An iterativemajorization-minimization algorithm is suggested for solving (2.19). Ma-jorization of the regularization term will lead to the following quadraticsurrogate problem:xˆ = argminx‖y −Ax‖22 + λxTSx (2.20)where S is a sparse matrix representing the patch similarities. The algorithmalternates between minimization of (2.20) using a conjugate gradient descentmethod and updating the matrix S from the new image estimate.262.3. Other patch-based methodsAs we mentioned above, nonlocal patch similarities have been shown tobe very useful for many image processing tasks. Because of space limitations,in this section we focused on image denoising and inverse problems, which aremore relevant to CT. However, we should mention that in recent years, theidea of exploiting nonlocal patch similarities has been applied to many imageprocessing tasks and this is currently a very active area of research. Someexamples of these applications include image enhancements [38], deblurring[163], inpainting [124], and super-resolution [262].2.3 Other patch-based methodsThe large number and diversity of patch-based image processing algorithmsthat have been developed in the past ten years makes it impossible to reviewthem all here. Nonetheless, most of these algorithms are based on sparserepresentation of patches in learned dictionaries (Section 2.1) and/or ex-ploiting nonlocal patch similarities (Section 2.2). In this section, we try toprovide a broad overview of some of the extensions of these ideas and otherpatch-based methods.To begin with, it is natural to combine the two ideas of learned dictionar-ies and non-local filtering to enjoy the benefits of both methods. Researchin this direction has proven to be very fruitful. The first algorithm to ex-plicitly follow this approach was “the non-local sparse model” proposed in[214]. This method collects similar patches of the image, as in NLM denois-ing. However, unlike NLM that performs a weighted averaging, the non-localsparse model uses sparse coding of similar patches in a learned dictionary.The basic assumption in the non-local sparse model is that similar patchesshould use similar dictionary atoms in their representations. Therefore, si-multaneous sparse coding techniques (e.g., [318, 319]) are applied on groupsof similar patches.The idea of combining the benefits of non-local patch similarities and oflearned dictionaries has been explored by many studies in the recent years[50, 84, 87, 88, 311, 347]. Most of these methods have reported state-of-the-art results. Although the details of these algorithms are different, the mainideas can be simply explained in terms of the non-local patch similarities andsparse representation in learned dictionaries. The K-LLD algorithm [50], forexample, uses steering kernel regression method to find structurally similarpatches and then uses PCA to learn a suitable dictionary for each set ofsimilar patches. The Adaptive Sparse Domain Selection (ASDS) algorithm[88], on the other hand, clusters the training patches and learns a sub-272.4. Patch-based methods for Poisson noisedictionary for each cluster using PCA. For a new patch, then, ASDS selectsthe most relevant sub-dictionary for sparse coding of that patch. The idea ofusing PCA for building the dictionaries in these methods has received greatattention because the learned dictionaries will be orthonogonal. In [84],global, local, and hierarchical implementations of PCA dictionaries werestudied. It was found the local-PCA (i.e., PCA applied on patches selectedfrom a sliding window) led to the best results.A very successful patch-based image denoising algorithm, that has simi-larities with the non-local sparse model, is the BM3D algorithm [71]. Eventhough BM3D has been proposed in 2007, it is still regarded as the state-of-the-art image denoising algorithm. Similar to the non-local sparse model,BM3D collects similar patches and filters them jointly. However, unlikethe non-local sparse model, it uses orthogonal DCT dictionaries instead oflearned overcomplete dictionaries. Moreover, BM3D works in two steps.First, patch-matching and filtering is performed on the original noisy imageto obtain an intermediate denoised image. Then, a new round of denoisingis performed. This time, the intermediate image is used for finding similarpatches. The algorithm includes other components such as Wiener filteringand weighted averaging [71]. Further improvements to the original BM3Dalgorithm and an extension to 3D images (called the BM4D algorithm) havealso been proposed [72, 205].2.4 Patch-based methods for Poisson noiseIn this section, we focus on the patch-based methods for the case when thenoise follows a Poisson distribution. The reason for devoting a section tothis topic is that, as we explained in Section 1.2, the noise in CT projectionmeasurements has a complex distribution that can be best approximated as aPoisson noise or, after log-transformationm, as a Gaussian noise with signal-dependent variance [204, 326]. In any case, application of the patch-basedimage processing methods to the projection measurements in CT requirescareful consideration of the complex noise distribution. Unfortunately, mostof the patch-based image processing methods, including all algorithms thatwe have described so far in this chapter, have been proposed for Gaussiannoise. Moreover, most of these algorithms (with the exception of fully-Bayesian methods described in Section 2.1.2) assume that the Gaussiannoise has a uniform variance. Comparatively, the research on patch-basedmethods for the case of Poisson noise has been very limited and most ofthese limited works have been published very recently.282.4. Patch-based methods for Poisson noiseAn important first obstacle facing the application of patch-based meth-ods to the case of Poisson noise is the choice of an appropriate patch sim-ilarity measure. Methods that depend on nonlocal patch similarities needa patch similarity measure to find similar patches. Likewise, when we usesparse representation of the patches in a learned dictionary we often need apatch similarity measure. This is needed, for example, for finding the sparserepresentation of the patch in the dictionary using greedy methods. Whenthe noise has a Gaussian distribution, the standard choice is the Euclideandistance, which has a sound theoretical justification and is easy to use.For the non-Gaussian noise distributions, one straight-forward approachis to apply a so-called variance-stabilization transform so that the noisebecomes close to Gaussian and then use the Euclidean distance. For thePoisson noise, the commonly-used transforms include the Anscombe trans-form [7] and the Haar-Fisz transform [111]. If one wants to avoid thesetransforms and work with the original patches that are contaminated withPoisson noise, the proper choice of patch similarity measure is less obvious.Over the years, many criteria have been suggested for measuring the sim-ilarity between patches contaminated with Poisson noise [6]. For the caseof low-count Poisson measurements, one study has suggested that the earthmover’s distance (EMD) is a good measure of distance between patches[115]. It has been suggested that EMD can be approximated by passing thepatches with Poisson noise through a Gaussian filter and then applying theEuclidean distance [115]. One study compared several different patch dis-tance measures for Poisson noise through extensive numerical experiments[85]. It was found that the generalized likelihood ratio (GLR) was the bestsimilarity criterion in terms of the trade-off between the probability of de-tection and false alarm [85]. GLR has many desirable theoretical propertiesthat make it very appealing as a patch distance measure [82]. For the Pois-son noise, this ratio is given by the following Equation:LG(x1, x2) = (x1 + x2)x1+x22x1+x2xx11 xx22(2.21)Given two noisy patches x1[i] and x2[i], where i ∈ ω, and assuming thatthe noise in pixels is independent, this gives the following similarity measure292.4. Patch-based methods for Poisson noisebetween the two patches:S(x1, x2) =∑i∈ω(x1[i] + x2[i]) log(x1[i] + x2[i])− (x1[i]) log(x1[i])− (x2[i]) log(x2[i])− (x1[i] + x2[i]) log 2(2.22)In [82], the GLR-based patch similarity criterion was also compared withsix other criteria for non-local patch-based denoising of images with Poissonnoise. It was found that using GLR led to the best denoising result whenthe noise is strong [82]. When the noise was not strong, the results showedthat it was better to use a variance-stabilization transform to convert thePoisson noise into Gaussian noise and then to use the Euclidean distance.The algorithm used in [82] for non-local filtering is as follows:x̂0(i) =N∑j=1LG(x[j]− x[i])1/h∑Nj=1 LG(x[j]− x[i])1/hx(j) (2.23)This algorithm includes the parameter h instead of the kernel bandwidth ain Equation (2.14).Another nonlocal patch-based denoising algorithm for Poisson noise wassuggested in [81]. A main feature of this algorithm is that the patch simi-larity weights are computed from the original noisy image as well as from apre-filtered image:x̂0(i) =N∑j=1wi,j∑Nj=1wi,jx(j) where: wi,j = exp(− ui,jα− vi,jβ)(2.24)where ui,j are computed from the noisy image using a likelihood ratio prin-ciple and vi,j are computed from a pre-estimate of the true image using thesymmetric Kullback-Leibler divergence. It is shown that the optimal valuesfor the parameters α and β can be computed and that this algorithm canachieve state of the art denoising results.The patch-similarity measure in (2.22) was used to develop a k-medoidsdenoising algorithm in [44]. The k-medoids algorithm is similar to k-meansalgorithms. They are different in that k-means uses the centroid of eachcluster as the representative of that cluster, whereas the k-medoids algo-rithm uses data points (i.e., examples) as the representative of the cluster.Moreover, k-medoids can work with any distance measure, not necessarily302.4. Patch-based methods for Poisson noisethe Euclidean distance. It was shown in [44] that the k-medoids algorithmachieved very good Poisson denoising results, outperforming the nonlocalPoisson denoising method of [82] in some tests. The k-medoids algorithmis in fact a special case of the dictionary learning approach. The differencewith the dictionary-learning approach is that in the k-medoids algorithmonly one atom participates in the representation of each patch.The reason why the study in [44] limited itself to using only one atomfor representation of each patch was the difficulties in sparse coding underthe Poisson noise. Suppose that x0[i] is the ith patch of the true underlyingimage and x[i] is the measured patch under Poisson noise. If we wish torecover x0[i] from x[i] via sparse representation in a dictionary D, we needto solve a problem that has the following form [94]:γ̂i = argminγ s.t. ‖γ‖0≤T1TDγi − x[i]T log(Dγi) subject to: Dγi > 0 (2.25)Having found γ̂i, we will have: x̂0[i] = Dγ̂i. The difficulties of solving thisproblem have been discussed in [94, 115] and greedy sparse coding algorithmshave been proposed for solving this problem. The author of [94] then applytheir proposed algorithm for denoising of images with Poisson noise. Eventhough they use a wavelet basis for D, they achieve impressive results.A true dictionary learning-based denoising algorithm for images withPoisson noise was suggested in [115]. In that study, a global dictionaryis learned from a set of training data. Then, for a given noisy image tobe denoised, the algorithm first clusters similar patches. All patches in acluster are denoised together via simultaneous sparse representation in D.This means that patches that are clustered together are forced to sharesimilar dictionary atoms in their representation. Experiments showed thatthis method was comparable with or better than competing methods. Aslightly similar approach that also combines the ideas of learned dictionariesand non-local filtering is proposed in [280, 281]. In this approach, k-meansclustering is used to group similar image patches. A dictionary is learned foreach cluster of similar patches using the Poisson-PCA algorithm [65, 297].For solving the Poisson-PCA problem, which is also known as exponential-PCA, the authors use the Newton’s method. This algorithm showed goodperformance under low-count Poisson noise.312.5. Total variation (TV)2.5 Total variation (TV)Total variation (TV), which was first proposed in [277] for image denoisingand reconstruction, has become one of the most widely used regularizationfunctions in image processing. For a function x(t) defined on the interval[0, 1], it is defined as [278]:TV(x) = sup∑i|x(ti)− x(ti−1)| (2.26)where the supremum is computed over all possible partitions of the inter-val [0, 1]. For a piecewise-constant signal, TV(x) is simply the sum of themagnitudes of the signal jumps. If x(t) is smooth, the following equivalentdefinition exists:TV(x) =∫ 10∣∣∣∣dxdt∣∣∣∣ dt (2.27)For a function x(s, t) of two variables defined on the unit square, theabove definition can be extended as:TV(x) =∫ 10∫ 10∥∥∥∥(∂x∂s , ∂x∂t)∥∥∥∥ ds dt (2.28)Different discretizations have been proposed. Suppose x ∈ RN×N is a2D image. A common discretization is [46]:TV(x) =∑1≤i,j≤N|(∇x)i,j | (2.29)where(∇x)i,j =((∇x)1i,j , (∇x)2i,j)(∇x)1i,j ={xi+1,j − xi,j if i < N0 if i = N(∇x)2i,j ={xi,j+1 − xi,j if j < N0 if j = N(2.30)and for z = (z1, z2) ∈ R2, |z| =√z21 + z22 .Suppose that we obtain measurements y = Ax+ w, where, as before, Ais some operation or transformation such as blurring, sampling, or forwardprojection in CT and w is additive Gaussian noise with uniform variance.322.6. Published research on sparsity-based methods in CTThe maximum a posteriori estimate of x with a total variation prior P (x) ∼e−J(x) is obtained as:xMAP = argminx‖Ax− y‖22 + TV(x) (2.31)A special case of this problem is the denoising problem shown below,which corresponds to the case where A is the identity matrix.xMAP = argminx‖x− y‖22 +∫Ω|∇x|du (2.32)which is usually referred to as the Rudin-Osher-Fatemi (ROF) model forimage denoising.The main properties of TV include convexity, lower semi-continuity, andhomogeneity [45]. Many different algorithms have been suggested for solvingthis problem. Examples of the optimization approaches that are used tosolve this problem include primal-dual methods [168, 278], second-order coneprogramming [116], dual formulations [46, 361], split Bregman methods [112,117], and accelerated proximal gradient methods [17, 237].In general, TV is a good model for recovering blocky images, i.e., imagesthat consist of piecewise-constant features with sharp edges [278]. Manystudies have used TV to successfully accomplish various image processingtasks, including denoising [46], deblurring [332], inpainting [249], restoration[29], and reconstruction [303]. However, on images with fine texture andramp-like features, this model usually performs poorly [113]. Therefore,many studies have tried to improve or modify this model so that it canbe useful for more complicated images. Some of the research directionsinclude employing higher-order differentials [21, 23, 34], locally adaptiveformulations that try to identify the type of local image features and adjustthe action of the algorithm accordingly [53, 89, 120, 125], and combining TVwith other regularizations in order to improve its performance [121, 199].2.6 Published research on sparsity-basedmethods in CTThis section reviews some of the published research on the application ofthe sparsity-based models and algorithms described so far in this chapterin CT. We divide these applications into three categories: 1) pre-processingmethods, which aim at restoring or denoising the projection measurements,2) iterative reconstruction methods, and 3) post-processing methods, whose332.6. Published research on sparsity-based methods in CTgoal is to enhance, restore, denoise, or otherwise improve the quality of thereconstructed image.2.6.1 Pre-processing methodsCompared with iterative reconstruction methods and post-processing meth-ods, pre-processing methods account for a much smaller share of the pub-lished studies on sparsity-based algorithms for CT. There are two mainreasons behind this. The first reason is that the pre-processing methods forCT, in general, face certain difficulties. For example, it is well-known thatsharp image features are smoothed in the projection domain. Therefore,the preservation of sharp image features and fine details is more challengingwhen working in the projection domain. Moreover, many commercial scan-ners do not allow access to the raw projection data. Therefore, it is moredifficult to validate the pre-processing algorithms and apply them in clinicalsettings. The second reason is that a great majority of the sparsity-basedimage processing algorithms have been proposed with the assumption ofadditive Gaussian noise with uniform variance. As we described in Section2.4, research on patch-based methods for the case of Poisson noise has beenmuch more limited in extent and the algorithms that have been proposed forPoisson noise are very recent and have not yet been absorbed by researchersworking on CT.A patch-based sinogram denoising algorithm was proposed in [292]. Afixed DCT dictionary was used for representation of the sinogram patches.However, the shrinkage rule used for denoising was learned from the trainingdata. The denoised projections were then used to reconstruct the imageusing an FBP method. A patch-based processing using learned shrinkagefunctions was then applied on the reconstructed image. The results of thestudy showed that this rather simple algorithm outperformed some of thewell-known iterative CT reconstruction algorithms.The use of learned dictionaries for inpainting (i.e., upsampling) of theCT projection measurements has also been proposed [188]. The goal of sino-gram upsampling is to reduce the x-ray dose used for imaging by acquiringonly a fraction of the projections directly and estimating the unobservedprojections with upsampling. The assumption used in this algorithm wasthat patches extracted from the projections admit a sparse representationin a dictionary that could be learned from a set of training sinograms. Theapproach followed by this study was very similar to the general inpaint-ing approach that we explained in Section 2.1.3. The results of the studyshowed that dictionary-based upsampling of the projections substantially342.6. Published research on sparsity-based methods in CTimproved the quality of the images reconstructed with FBP, outperformingmore traditional sinogram interpolation methods based on splines.As we mentioned above, a challenge for all sinogram denoising/restorationalgorithms is preservation of fine image detail. The algorithm presented in[291] has proposed an interesting idea to address this issue. In fact, thisstudy contains several interesting ideas. One of these ideas is that in learn-ing a dictionary for sparse representation of sinogram patches, not only thesinogram-domain error but also the error in the image domain is considered.Specifically, first a dictionary (D1) is learned considering only the error inthe sinogram domain. Let us denote the CT image by x and its sinogramby y. Then D1 is found by solving:{D1, Γˆ} = argminD,Γ‖Γ‖0 subject to: ‖DΓi −Riy‖22 ≤ Cσi ∀ i (2.33)This optimization to find D1 is carried out using the K-SVD algo-rithm described in Section 2.1. The only difference here is that the signal-dependent nature of noise, σi, should be taken into account in the sparsecoding step (C is a tuning parameter). This dictionary is then further opti-mized by minimizing the reconstruction error in the image domain:D2 = argminD∥∥∥∥∥FBP(∑i(RTi Ri)−1∑iRTi DΓˆ)− x∥∥∥∥∥2Q,2(2.34)where we have used FBP to denote the CT reconstruction algorithm (here,filtered back-projection). Note that the Γˆ in the optimization problem (2.34)is that found by solving (2.33). In other words, for finding D2 we keep thesparse representations fixed and find a dictionary that leads to a betterreconstruction of the image, x. The notation ‖.‖Q,2 denotes a weighted `2norm. It is suggested that the weights Q are chosen such that more weightsare given to low-contrast features [291].The x and y in the above equations denote the “training data”, whichincludes a set of high-quality images and their projections. In fact, insteadof only one image, a large number of images can be used for better training.Now, suppose that we are given noisy projections of a new object/patient,which we denote with ynoisy. It is suggested to denoise ynoisy in two steps.First, sparse representations of patches of ynoisy in D1 are obtained. Denot-ing this with Γˆ, the final denoised sinogram is obtained as the solution of352.6. Published research on sparsity-based methods in CTthe following problem which uses D2:ydenoised = argminyλ‖y − ynoisy‖2W +∑i‖D2Γˆi −Riy‖ (2.35)where W are weights to account for the signal-dependent nature of the noise.This problem has a simple solution similar to Equation (2.7). Experimentalresults for 2D CT have shown promising results [291].2.6.2 Iterative reconstruction methodsIn recent years, several iterative image reconstruction algorithms involvingregularizations in terms of image patches have been proposed for CT. Ingeneral, these algorithms have reported very promising results. However, aconvincing comparison of these algorithms with other classes of iterative re-construction algorithms such as those based on TV or other edge-preservingregularizations is still lacking. In this section, we review some of the iterativeCT reconstruction algorithms that use patch-based or TV regularization.A typical example of dictionary-based CT reconstruction algorithms isthe algorithm proposed in [339]. That paper suggested recovering the imageas a solution of the following optimization problem:minimizex,D,Γ∑iwi([Ax]i−yi))2 +λ(∑k||Rkx−DΓk||22 + νk||Γk||0)(2.36)In the above problem, A is the projection matrix and wis are noise-dependent weights. The first term in the objective function encouragesmeasurement consistency. The remaining terms constitute the regulariza-tion, which are very similar to the terms in the formulation of the basicdictionary learning problem in (2.1). In (2.36), the dictionary is learnedfrom the image itself. The authors of [339] solved this problem by alter-nating minimization with respect to the three variables. Minimization withrespect to x is carried out using the separable paraboloid surrogate methodsuggested in [101]. The problem with this approach, however, is that it re-quires access to the individual elements of the projection matrix. Althoughthis is a simple requirement for 2D CT, this can be a major problem forlarge 3D CT because with efficient implementations of forward and back-projection operations it is not convenient to access individual matrix ele-ments [152, 193]. Minimization with respect to D and Γ is performed using362.6. Published research on sparsity-based methods in CTthe K-SVD and OMP algorithms, respectively. Alternatively, the dictionarycan be learned in advance from a set of training images. This will remove Dfrom the list of the optimization variables in (2.36), substantially simplifyingthe problem. Both approaches are presented in [339]. Experiments showedthat both these approaches led to very good reconstructions, outperforminga TV-based algorithm.Other formulations that are very similar to the one described above havebeen shown to be superior to TV-based reconstruction and other standarditerative reconstruction algorithms in electron tomography [4, 192]. Anotherstudy had first learned a dictionary from training images, but for the imagereconstruction step it did not include the sparsity term in the objectivefunction [105]. In other words, only the first two terms in the objectivefunction in Equation (2.36) were considered. A gradient descent approachwas used to solve the problem. That study obtained superior reconstructionswith learned dictionaries compared to using a DCT basis.One study used an optimization approach similar to the one describedabove, but used box-splines for image representation [279]. In other words,instead of native pixel representation of the image, box spline were usedas the basis functions in the image domain. The unknown image x has arepresentation of the form x =∑i ciφi, where φi is the box spline centeredon the ith pixel and ci is the value of attenuation coefficient for that pixel.The resulting optimization problem is of the following form:minimizec,Γ‖Hc− y‖2W + λ(∑k||Rkc−DΓk||22 + νk||Γk||0)(2.37)In the above problem, H is the forward model relating the image represen-tation coefficients to the sinogram measurements, y. In other words, H issimply the equivalent of the projection matrix A. The rest of the objectivefunction is the same as that in Equation (2.36). Once the representationcoefficients, c, are found by solving (2.37), the image is reconstructed simplyas x =∑i ciφi. The results of this study showed that this dictionary-basedalgorithm achieved much better reconstructions than a wavelet-based recon-struction algorithm.The dual-dictionary methods proposed in [197, 355] rely on two dictio-naries. One of the dictionaries (Dl) is composed of patches from CT imagesreconstructed from a small number of projection views, while the seconddictionary (Dh) contains the corresponding patches from a high-quality im-age. The atoms of the two dictionaries have one-to-one correspondence. The372.6. Published research on sparsity-based methods in CTstrategy here is to first find the sparse code of the patches of the image tobe reconstructed in Dl and then to recover a good estimate of the patch bymultiplying this sparse code with Dh. The dictionaries are not learned here,but they are built by sampling a large number of patches from few-view andhigh-quality training images. This approach has been reported to achievebetter results than TV-based reconstruction algorithms [197].A different dictionary-based reconstruction algorithm was suggested in[301]. In this algorithm, first a dictionary (D) is learned by solving a problemof the following form:minimizeD,Γ‖X −DΓ‖2F + λ‖Γ‖1 subject to: D ∈ D & Γ ∈ R+ (2.38)where D can be an `2 ball and R+ is the non-negative orthant of the propersize. The above problem is solved using the Alternating Direction Methodof Multipliers (ADMM) to find the dictionary. It is reported that learningthe dictionary with ADMM is computationally very efficient and largelyindependent of the initialization. The learned dictionary is then used toregularize the reconstruction algorithm by requiring that the patches of thereconstructed image have a sparse representation in the dictionary. However,unlike most other dictionary-based algorithms, overlapping patches are notused. Instead, a novel regularization term is introduced to avoid the blockingartifacts at the patch borders. Specifically, the optimization problem torecover the image from projection measurements y has this form:minimizexΓ‖AxΓ − y‖22 + λ‖Γ‖1 + µ‖LxΓ‖22 (2.39)where, to simplify the notation, we have used xΓ to emphasize that thereconstructed image depends on the sparse representation matrix, Γ. Thematrix L is a matrix that computes the directional derivatives across thepatch boundaries. Therefore, the role of the last term in the objective func-tion is to penalize large jumps at the patch boundaries, thereby suppressingblocking artifacts that arise when non-overlapping patches are used. Com-parison with TV-based reconstruction showed that this dictionary-basedreconstruction algorithm resulted in much better images, preserving finetextural detail that are smeared by TV-based reconstruction. Overall, thealgorithm proposed in that paper contains several interesting ideas that canbe useful for designing dictionary-based reconstruction algorithms for CT.A later paper studied the sensitivity of this algorithm to such factors as thescale and rotation of features in the training data [300].382.6. Published research on sparsity-based methods in CTAn iterative reconstruction algorithm that combines sparse representa-tion of image patches with sinogram smoothing was proposed in [305]. Theimage is reconstructed as a solution of the following optimization problem:minimizex,y,Γ‖y − y¯‖+ α yTWy + β‖Ax− y‖22+ λ(∑k||Rkx−DΓk||22 + νk||Γk||0)(2.40)The first two terms, where y¯ is the measured noisy sinogram, representthe sinogram Markov random field model [189, 324]. The remaining termsare similar to those we encountered above. As usual, the authors havesuggested to solve the above problem using a block-coordinate minimization,where the minimization with respect to the image x is carried out using aconjugate gradients method. That study also suggests interesting variationsof the objective function in (2.40), but the experimental evaluations thatare presented are very limited.As the last example of dictionary-based iterative reconstruction algo-rithms, we include the method based on sparsifying transforms proposed in[256, 257]. Sparsifying transforms are variations of the analysis model forsparsity [268, 331]. In the analysis model, instead of the relation x = Dγthat we have discussed so far in this chapter, the relation Dx = γ is used.In other words, D here acts as an operator on the signal (e.g., the imagepatch) to find the representation coefficients, γ. In [256, 257], it is suggestedthat the unknown CT image be recovered as the solution of the followingoptimization problem:minimizex,D,Γ∑i‖DRix− Γi‖22 + λ‖Γ‖1 + αH(D)subject to: ‖Ax− y‖2W ≤ (2.41)where H(D) is a regularization on the dictionary D, and W representsthe weights introduced to account for the signal-dependent noise variance.The results of that study showed that this approach led to results thatwere comparable with iterative reconstruction with synthesis formulationand TV-based regularization, while also being slightly faster.In recent years, there has also been a growing attention to the potentialof regularization in terms of non-local patch priors for iterative CT recon-struction. In [194], it was suggested to recover the CT image as a solution392.6. Published research on sparsity-based methods in CTof the following optimization problem:xˆ = argminx‖y −Ax‖22 + λJNL(x) (2.42)where JNL(x) is the regularization in terms of patch similarities. Two dif-ferent forms were suggested for JNL(x):JNL/TV(x) =∑i∑j∈Ni‖√wi,j(x(i)− x(j))‖2JNL/H1(x) =∑i∑j∈Niwi,j‖x(i)− x(j)‖22(2.43)where wi,j are the patch-based similarity weights. For the ith pixel, they arecomputed from all pixels j in a window around i using:wi,j = exp(− ‖x[i]− x[j]‖h2)(2.44)It is suggested that these weights be computed from a FBP-reconstructedimage and that the filter parameter h be chosen based on the local estimateof noise variance. The local noise variance is estimated from the waveletcoefficients of the finest wavelet subband (vi) according to [90]:h =median(|vi|)0.6745(2.45)The authors of [194] solved the problem (2.42) with either of the regu-larization functions in (2.43) using a simple gradient descent and found thatthe recovered CT image had a better visual and objective quality than astandard TV-based iterative reconstruction algorithm.As simple iterative algorithm that alternates between projections ontoconvex sets (POCS) to improve measurements consistency and an NLM-type restoration has been proposed in [133] . That algorithm was shown toperform better than a TV-based algorithm but no comparison with the stateof the art methods was performed. Another study developed a NLM-typeregularization for perfusion CT that relies on a high-quality prior image[202]. The proposed regularization function, shown in the following equa-tion, is in terms of the similarity between patches of the unknown imageto be reconstructed from a low-dose scan (x) and the patches of the priorimage (xp).J(x) = ‖x− x¯‖qq where x¯(i) =∑j∈Niwi,jxp(j) (2.46)402.6. Published research on sparsity-based methods in CTThe authors suggest q = 1.2. A steepest-descent approach is used to ap-proximately solve this problem. A similar, but more general, algorithm thatdoes not require a prior image was proposed in [353]. The formulation is thesame as the above, the main difference being that the weights in the NLMformulation are computed from the image itself. A Gauss-Seidel approach isused to solve the resulting problem. Both of the above NLM-type regular-ization methods are reported to result in better reconstructions than moreconventional regularizations such as Gaussian Markov random field.Non-local patch-based regularization was also used for the new techniqueof equally-sloped tomography (EST, [227]) and was shown to improve thequality of the reconstructed image both from small or large number of pro-jections [106]. Nonlocal patch-based regularization substantially improvedthe CNR, SNR, and spatial resolution of the images reconstructed from 60,90, and 360 projections in that study.Patch-based iterative reconstruction algorithms have also been proposedfor dynamic CT. In dynamic CT, several successive images of the the samepatient are reconstructed. Therefore, there is abundant temporal correla-tion (i.e., correlation between successive images) in addition to the spatialcorrelation within each of the images in the sequence. There have been sev-eral studies in recent years aimed at exploiting these correlations in termsof patch/block similarities. In general, these studies have reported verypromising results.A reconstruction algorithm with nonlocal patch-based regularization wasproposed for dynamic CT in [155]. The proposed regularizer for the kthframe of the image is as follows:J(xk) =∑i∑j∈NiGa(xk[i]− xk[j])|xk(i)− xk(j)|2+∑i∑l∈{1,2,...,K}\k∑j∈∆iGa(xl[i]− xl[j])|xl(i)− xl(j)|2(2.47)where, as before, x(i) and x[i] denote the ith image pixel and the patchcentered on that pixel, respectively. Ga(.) is a Gaussian kernel as in thestandard NLM denoising. The first term is a spatial regularization in termsof the patches of the current image frame, xk. In this term, Ni is a simplerectangular neighborhood around the ith pixel. The second term (where Kis the total number of frames) is a temporal patch-based regularization thatinvolves patches from all other frames in the image sequence. In this term,∆i is a neighborhood whose spatial size is pre-fixed but whose temporal ex-tension is found for each pixel such that the probability of finding patches412.6. Published research on sparsity-based methods in CTwith similar structural features (e.g., edges) is increased. This is done by di-viding the temporal neighborhood into blocks and estimating the structuralsimilarity of these blocks with the patch centered on the ith pixel. Onlya small fraction of blocks that are most similar to x[i] are included in ∆i.A similar approach was proposed in [154] for the case when a high-qualityprior image is available. This high-quality prior image does not have to be aCT image and can be acquired in other imaging modalities. The results ofexperiments with simulated and real data show that this algorithm achievesvery good reconstructions.Temporal non-local-means (TNLM) algorithms were proposed in [146,314]. These algorithms suggest recovering a set of successive CT images{xk| k ∈ 1 : K} by minimizing an optimization problem that includes (inaddition to the measurement fidelity term) the following regularization:J({xk}) = K∑k=1∑i∑jwi,j(xk(i)− xk+1(j))2(2.48)where, as usual, the weights are computed based on patch similarities:wi,j =1Zexp(− ‖x[i]− x[j]‖22h2)(2.49)An important choice in this algorithm is that only inter-image patchsimilarities are taken into account and not the intra-image patch similari-ties. The justification is that the proposed algorithm is for the case wheneach of the images in the sequence is reconstructed from a small numberof projections and, hence, contains much streak artifacts. Therefore, usingpatches from the same image will amplify the streak artifacts, while usingpatches from neighboring images will suppress the artifacts. In addition tothe iterative reconstruction algorithm, in [146] another very similar algo-rithm has been suggested that can also be classified as a post-processingalgorithm. In this alternative scheme, each of the images in the sequenceare first reconstructed from their correponding projections, and then theyare post-processed using an optimization algorithm that includes the verysame regularization function in (2.48).A tensor-based iterative reconstruction algorithm was proposed for dy-namic CT in [308]. Tensor-based dictionaries are a relatively new type ofdictionary that are gaining more popularity. As we have mentioned above, inimage processing applications, image patches/blocks are vectorized and usedas training/test signals. Tensor-based methods treat the image patches or422.6. Published research on sparsity-based methods in CTblocks in their original form, i.e., without vectorizing them [40, 62]. There-fore, they are expected to better exploit the correlation between adjacentpixels. In [308], a tensor-based algorithm was compared with a standarddictionary for dynamic CT reconstruction and it was found that the tensor-based dictionaries resulted in a slight improvement in the quality of thereconstructed image.Compared with the reconstruction algorithms that are based on learneddictionaries or nonlocal patch similarities, many more algorithms have usedTV regularization terms. This is partly because TV-regularized cost func-tions are easier to handle using standard optimization algorithms, especiallyfor large-scale 3D image reconstruction. Moreover, the CT community ismore familiar with TV-based regularization because it has been used forCT reconstruction for a longer time. Many studies have formulated the re-construction problem as a regularized least-squares minimization similar to(2.31). Some of the optimization techniques that have been suggested forsolving this problem include accelerated first-order methods [61, 143, 251],alternating direction method of multipliers [264], and forward-backwardsplitting algorithm [145]. Another very commonly used formulation for CTreconstruction is the constrained optimization formulation, where the imageTV is minimized under measurement consistency constraints [126, 242, 270].Most published studies use an alternating algorithm for solving this prob-lem, whereby at each iteration first the image TV is reduced, and this is thenfollowed by a step that enforces the measurement consistency constraint. Asimple (and probably inefficient [45]) method that has been adopted in manystudies uses a steepest descent for TV minimization followed by projectiononto convex sets for measurement consistency [293].Several studies have combined the TV regularization with regulariza-tion in terms of a prior high-quality image in applications such as dynamicCT [22, 123], perfusion imaging [239], and respiratory-gated CT [180, 302].In general, the existence of a high-quality prior image reduces the numberof projection measurements required for reconstructing high-quality imagesfrom subsequent scans. Other variations of the standard TV regulariza-tion that have been successfully applied for CT reconstruction include non-convex TV [49, 294] and higher-order TV [346].In general, TV-based reconstruction methods have proven to be muchbetter than traditional CT reconstruction algorithms, particularly in recon-struction from few-view and noisy projection data. Therefore, many studieshave concluded that TV-based reconstruction methods have a great poten-tial for dose reduction in a wide range of CT applications [28, 162, 309, 342].However, there has been no satisfying comparison between TV and other432.6. Published research on sparsity-based methods in CTedge-preserving or smoothness-promoting regularization functions that arevery widely used in CT [32, 80, 160, 325].2.6.3 Post-processing methodsMany of the sparsity-based algorithms that have been proposed for CT fallinto the category of post-processing methods. This is partly because most ofthese algorithms are directly based on the sparsity-based methods that havebeen proposed for natural images. Because general sparsity-based imageprocessing algorithms mostly include denoising and restoration algorithms,they are more easily extended as post-processing methods for CT. Moreover,some of the sparsity-based methods, particularly patch-based image process-ing methods, are very computationally expensive. Therefore, especially forlarge-scale 3D CT, it is easier to deploy them as one-shot post-processingalgorithms than as part of an iterative reconstruction algorithm.A large number of dictionary-based algorithms have been proposed forCT denoising. The basic denoising algorithm that we described in Section2.1.3 was used for denoising of abdomen CT images in [58, 60], and head CTimages in [59] and showed promising results in all these studies. Straight-forward representation of image patches in a learned dictionary followed byweighted averaging resulted in effective suppression of noise and artifactsand a marked improvement in the visual and objective image quality.Non-local means methods have also been applied for CT image denoising.An early example is [156]. In that study, the authors investigated the effectof different parameters such as the patch size, smoothing strength, and thesize of the search window around the current pixel to find similar patches.Among the findings of that study with lung and abdomen CT images wasthat one can choose the size of the search window for finding similar patchesto be as small as 25 × 25 pixels and still achieve very impressive denoisingresults. However, this required careful tuning of the denoising parameter (ain Equation (2.15)). Moreover, choosing a small search window also requiredreducing the patch size to ensure that for every pixel a sufficient number ofsimilar patches is found in the search window. Otherwise, in certain imageareas such as around the edges, very little denoising is accomplished. An-other study found that with a basic NLM denoising, the x-ray tube currentsetting can be reduced to one fifth of that in routine abdominal CT imagingwithout jeopardizing the image quality [56].An algorithm specially tailored to image-guided radiotherapy was pro-posed in [343]. Since in this scenario a patient is scanned multiple times, itwas suggested that the first scan be performed with standard dose and later442.6. Published research on sparsity-based methods in CTscans with much reduced dose. An NLM-type algorithm was suggested to re-duce the noise in the low-dose images. The proposed algorithm denoised thelow-dose images by finding similar patches in the image reconstructed fromthe standard-dose scan. Similarly, in CT perfusion imaging and angiographythe same patient is scanned multiple times. A modified NLM algorithm wassuggested for these imaging scenarios in [200]. The algorithm proposed inthat study registered a standard-dose prior image to the low-dose image athand. The low-dose image is then denoised using a NLM algorithm wherepatches are extracted from the registered standard-dose image.One study suggested adapting the strength of the NLM denoising basedon the estimated local noise level [191]. That paper proposed a fast methodfor approximating the noise level in the reconstructed image and suggestedchoosing the bandwidth of the Gaussian kernel in the NLM denoising to beproportional to the estimated standard deviation of the noise. Evaluationsshowed that this algorithm effectively suppressed the noise without degrad-ing the spatial resolution. Using speed-up techniques such as those in [73],this algorithm was able to process large 3D images in a few minutes whenimplemented on a GPU.Applying the nonlocal patch-based denoising methods in a spatiallyadaptive fashion has been proposed by many studies on natural images[157, 158]. For CT images, it is well known that the noise variance inthe reconstructed image can vary significantly across the image. Therefore,estimating the local noise variance may improve the performance of thepatch-based denoising methods. Another approach for estimating the localnoise variance in the CT image was proposed in [15]. In this approach, whichis much simpler than the method proposed in [191], even and odd-numberedprojections are used to reconstruct two images. Then, assuming the noise inthe projections are uncorrelated, the local noise variance is estimated fromthe difference of the two images.So far in this section, we have talked about algorithms that have beensuggested primarily for removing the noise. However, CT images can alsobe marred by various types of artifacts that can significantly reduce theirdiagnostic value [14]. Recently, a few patch-based algorithms have beenproposed specifically for suppressing these artifacts. A dictionary-based al-gorithm for suppressing streak artifacts in CT images is proposed in [57].The artifact-full image is first decomposed into its high-frequency bands inthe horizontal, vertical, and diagonal directions. Sparse representation ofpatches of each of these bands are computed in three “discriminative” dic-tionaries that include atoms specifically learned to represent artifacts andgenuine image features. Artifacts are suppressed by simply setting to zero452.6. Published research on sparsity-based methods in CTthe large coefficients that correspond to the artifact atoms. The results ofthis study on artifact-full CT images are impressive.A nonlocal patch-based artifact reduction method was suggested in [341].This method is tailored for suppressing the streak artifacts that arise whenthe number of projections used for image reconstruction is small and itrelies on the existence of a high-quality prior image. The few-view imagethat is marred by artifacts is first registered to the high-quality referenceimage using a registration algorithm that uses the SIFT features [196]. Theregistered reference image is then used to simulate an artifact-full few-viewimage. To remove the streak artifacts from the current image, its patches arematched with the simulated artifact-full image, but then the correspondinghigh-quality patches from the reference image are used to build the targetimage. This algorithm is further extended in [340] to be used when a priorscan from the same patient is not available but a rich database of scansfrom a large number of patients exists. The results of both these studieson real CT images of human head and lung are very good. Both methodssubstantially reduced the streaking artifacts in images reconstructed fromless than 100 projections.A major challenge facing the application of patch-based algorithms forlarge 3D CT images is the computational time. Although we discuss thischallenge here under the post-processing methods, they apply equally topre-processing methods and are indeed even more relevant to iterative re-construction algorithms. Of course, one obvious approach to reducing thecomputational load is to work with 2D patches, instead of 3D blocks. How-ever, this will likely hurt the algorithm performance because the voxel corre-lations in the 3rd dimension are not exploited. Three studies have reportedthat compared with 2D denoising, 3D denoising of CT images leads to animprovement in PSNR of approximately 1 to 4 dB [186, 187, 274]. Anotherstudy used 2D patches to denoise the slices in 3D CT images but patcheswere used from neighboring slices in addition to patches from the same slice[156]. They found that this approach increased the PSNR by more than 4dB. Another obvious solution is to use faster hardware such as GPUs. Thisoption has been explored in many studies. For instance, implementationof an NLM-type algorithm on a GPU reduced the computational time bya factor of 35 in one study [191]. Iterative reconstruction algorithms withnon-local patch-based regularization terms have also been implemented onGPU [146, 314]. Another remarkable example was shown in [15], where theauthors implemented the K-SVD algorithm for CT denoising on Cell Broad-band Engine Architecture and achieved speedup factors between 16 and 225compared with its implementation on a CPU.462.6. Published research on sparsity-based methods in CTThere have also been many algorithmic approaches to reducing the com-putational time. An ingenious and highly efficient method to address thischallenge was proposed in [274]. This method, which is named “doublesparsity” is based on the observation that the learned dictionary atoms,themselves, have a sparse representation in a standard basis, such as DCT.The authors suggest a dictionary structure of the form D = ΦA, where Φis a basis with fast implicit implementation and A is a sparse matrix. Theyshow that this dictionary can be efficiently learned using an algorithm simi-lar to the K-SVD algorithm. Denoising of 3D CT images with this dictionarystructure leads to speed-up factors of around 30, while also improving thedenoising performance. A relatively similar idea is the separable dictionaryproposed in [127], where the dictionary to be learned from data is assumedto be the Kronecker product of two smaller dictionaries. By reducing thecomplexity of sparse coding from O(n) to O(√n), this dictionary model al-lows much larger patch/block sizes to be used, or alternatively, it resultsin significant speedups for equal patch size. A two-level dictionary struc-ture was proposed in [186]. In this method, the learned dictionary atomsare clustered using a k-means algorithm that employs the coherence as thedistance measure. For sparse coding of a test patch, a greedy algorithmis used to select the most likely atoms which are then used to obtain thesparse representation of the patch. Another study used the coherence of thedictionary atoms in learning a dictionary on a graph and reported very goodresults in 3D CT denoising [190].For dictionary-based methods, the most computationally demanding partof the algorithm during both dictionary learning and usage is the sparse cod-ing step. As we mentioned above, the image is usually divided into overlap-ping patches/blocks and the sparse representation of each patch/block in thedictionary has to be computed at least once (more than once if the algorithmis iterative). If the dictionary has no structure, which is the general casefor overcomplete learned dictionaries, the sparse coding of each patch willrequire solving a small optimization problem. This will be computationallydemanding, especially when the number and size of these patches/blocks arelarge such as in 3D CT. In recent years, many algorithms have been sug-gested for sparse coding of large signals in unstructured dictionaries. Some ofthese algorithms are basically faster implementations of traditional sparsecoding algorithms [169, 275], while others are based on more novel ideas[35, 122, 179, 334]. Some of these methods have achieved several orders ofmagnitude speedups [35, 122]. A description of these algorithms is beyondthe scope of this manuscript, but the computational edge that they offermakes patch-based methods more appealing for large-scale CT imaging.472.7. Final remarksFor the NLM algorithms, the major computational bottleneck is thesearch for similar patches. We have described some of the state-of-the-artmethods for reducing the computational load of patch search in Section 2.2.There has been little published research on how these techniques may workon CT images. One study has applied the method of integral image [73]on CT images. The same study reported that if the smoothing strength isproperly adjusted, a very small search window and a very small patch sizecan be used, to obtain significant savings in computation.2.7 Final remarksSparsity-based models have long been used in digital image processing. Re-cently, learned overcomplete dictionaries have been shown to lead to betterresults than analytical dictionaries such as wavelets in almost all imageprocessing tasks. Nonlocal patch similarities have also been proven to beextremely useful in many image processing applications. Algorithms basedon nonlocal patch similarities are considered to be the state of the art in im-portant applications such as denoising. The practical utility of patch-basedmodels has been demonstrated by hundreds of studies in recent years, manyof which have been conducted on medical images. The use of learned over-complete dictionaries for sparse representation of image patches and use ofnonlocal patch similarities are at the core of much of the ongoing researchin the field of image processing.The published studies on the application of these methods for reconstruc-tion and processing of CT images have reported very good results. However,the amount of research on the application of these methods in CT has beenfar less than that on natural images. Any reader who is familiar with thechallenges of reconstruction and processing of CT images will acknowledgethat there is an immense potential for these methods to improve the currentstate of the art algorithms in CT.In terms of the pre-processing algorithms, there has been only a coupleof published papers on patch-based algorithms. This is partly due to thefact that most of the patch-based models and algorithms have been origi-nally proposed for the case of uniform Gaussian noise. For example, greedysparse coding algorithms that form a central component of methods that uselearned overcomplete dictionaries have been proposed for the case of Gaus-sian noise. As we mentioned in Section 2.4, only recently similar methodsfor the case of Poisson noise have started to appear. Nonetheless, even withthe current tools, patch-based models can serve as useful tools for devel-482.7. Final remarksoping powerful pre-processing algorithms for CT. Some of the patch-basedmethods that we have reviewed in Section 2.4 have been applied on verynoisy images (i.e., very low-count Poisson noise) and they have achievedimpressive results. This might be extremely useful for low-dose CT, whichis of special importance in clinical settings.Iterative CT reconstruction algorithms that have used TV or patch-based regularization terms have reported very promising results. One cansay that the published works have already demonstrated the usefulness ofpatch-based methods for CT reconstruction. However, many of the proposedalgorithms have been applied on 2D images. In some cases it is not clearwhether a proposed algorithm can be applied to large 3D reconstructionwhere the efficient implementations of forward and back-projection opera-tions limit the type of iterative algorithm that can be employed. Moreover,little is known about the robustness of these algorithms in terms of thetrained dictionary. As we mentioned in Section 2.1.2, the dictionary learn-ing problem is non-convex and, hence, dictionary learning algorithms arenot supported by strong theoretical guarantees.Post-processing accounts for the largest share of the published papers onthe application of patch-based methods in CT. Both denoising and restora-tion (e.g., artifact removal) algorithms have been proposed. Most of thesepapers have reported good results, even though many of them have usedalgorithms that have been originally proposed for natural images with lit-tle modification. Therefore, it is likely that much better results could beachieved by designing dedicated algorithms for CT. In fact, CT images, es-pecially those reconstructed from low-dose scans, present unique challenges.Specifically, these images are contaminated by very strong noise with a non-uniform and unknown distribution. Moreover, they are also marred by var-ious types of artifacts. This situation calls for carefully-devised algorithmsthat are tailored for CT images. Although this can be challenging, the suc-cess of patch-based methods on natural images can be taken as a strongindication of their potential to tackle these challenges. Patch-based meth-ods have led to the best available denoising algorithms. Moreover, they havebeen successfully used for suppressing various types of artifacts and anoma-lies in natural images and videos. Therefore, they are likely to achieve stateof the art denoising and restoration results in CT.In conclusion, this review of the literature shows that sparsity-basedand patch-based methods have a great potential to improve the currentimage reconstruction and image processing algorithms in CT. With an everincreasing usage of CT in clinical applications, it is necessary to reduce theradiation dose used for imaging so that CT can be used to its full potential.492.7. Final remarksMeanwhile, the increased computational power of modern computers makesit possible to use more sophisticated algorithms for image reconstructionand processing. Therefore, the methods reviewed in this chapter can play akey role in solving some of the major challenges facing CT.50Chapter 3Sinogram Denoising viaSimultaneous SparseRepresentation in LearnedDictionaries3.1 IntroductionIn this chapter, we propose a novel algorithm for denoising the projectionmeasurements in 3D CBCT. The noise model that we use in this chapteris the Gaussian model that we described in Section 1.2. Specifically, if wedenote the line integral of the linear attenuation coefficient by yi =∫i µds,then:yi ∼ N (y¯i, σ2i ) where σ2i = fi exp(y¯i) (3.1)where fi is a factor that depends mainly on the effect of bowtie filtration.The value of fi does not depend on the object being imaged and can beeasily estimated from data [326].Because of the nature of the projection measurements in CBCT, there isa strong correlation between neighboring pixels within a projection. There isalso a strong correlation between the value of a pixel in neighboring projec-tions. In order to exploit both of these correlations, we stack the projectionstogether to form a large 3D image as shown in Figure 3.1. We assume thateach projection is of size nu × nv pixels and that nθ equally-spaced projec-tions are acquired. The algorithm that we propose extracts small blocksof size m3 pixels for processing. Throughout this chapter we use m = 8,which is a common patch/block size for dictionary-based image processing.We denote the whole 3D image of size nu × nv × nθ with y. Blocks of sizem3 extracted from y are vectorized and stacked in a matrix denoted byY ∈ Rm3×n, where n is the number of blocks. We will use Yi to denote theith vectorized block, which is the ith column of Y .513.1. IntroductionFigure 3.1: A schematic representation of the cone beam CT (left) and thestacked projections (right).We described the method of image denoising with learned dictionariesin Section 2.1.3 and the method of NLM denoising in Section 2.2. We men-tioned that both these methods have been shown to be effective denoisingmethods. Therefore, the algorithm that is proposed in this chapter com-bines the advantages of the two methods. As in the NLM algorithm, we findsimilar blocks and process them together in order to exploit the abundantself-similarities that exist in CT projections. However, unlike NLM that usesweighted averaging and unlike the BM4D algorithm ([71, 205], briefly de-scribed in Section 2.3) that uses thresholding in standard bases, we performsimultaneous sparse coding of the similar blocks in a learned dictionary. Inthis sense, the proposed algorithm is similar to the non-local sparse modelproposed in [214].In brief, using training data we first learn a dictionary for simultaneoussparse coding of similar blocks. Given a new set of noisy projections, wegroup the similar blocks and denoise them via simultaneous sparse coding inthe learned dictionary. The main intuition behind our proposed algorithm isthat similar blocks must have a jointly sparse representation in the learneddictionary. In other words, sparse representation of similar blocks must usesimilar atoms in the learned dictionary.523.2. The proposed algorithm3.2 The proposed algorithm3.2.1 ClusteringAs mentioned above, the algorithm proposed in this chapter first learns adictionary for simultaneous sparse representation of similar blocks. Then,this dictionary is used for denoising. Since both of these steps involve clus-tering of similar blocks, we start by explaining our clustering algorithm.In the NLM algorithm and similar algorithms such as BM4D, desnoisingof the ith pixel requires searching the entire image or a sizable neighborhoodaround this pixel to find blocks that resemble the block centered on this pixel.Since this search has to be performed for every block, it is computationallyvery intensive. In the algorithm proposed in this chapter, we follow anapproach that is much less computationally demanding. Specifically, wedivide the stacked projections into overlapping blocks and cluster all theseblocks only once.Consider a particular block, Yi, and suppose that we would like to decidewhether block Yj is similar to Yi so that they can be clustered together.We make two simplifying assumptions: (1) the value of the true projectiondoes not change drastically between neighboring pixels, and (2) the noisein adjacent pixels are independent. These are mild assumptions and aresatisfied to a high degree in practice. From these assumptions, we cancompute the average noise variance in block Yi, which we denote as σi, usingEquation (3.1). Then, if values of Yi and Yj are close, the variable ‖Yi −Yj‖22/(2σ2i ) will follow a central chi-squared distribution with m3 degrees offreedom, χ2m3 . Denoting its cumulative distribution function with Fm3(t),a good choice for similarity threshold will be α = (2σ2i )F−1m3(t0) for someparameter t0 [214, 215]. Here, F−1m3is the inverse of Fm3 . In other words,we decide that Yi and Yj are similar if their squared Euclidean distance‖Yi−Yj‖22 is less than or equal to α. We have empirically found that a valueof t0 ' 0.90 leads to good clustering results for all noise levels. The samevalue has been suggested in [214].A straightforward clustering would require computing the distance ‖Yi−Yj‖22 for each pair of blocks that we wish to compare. Because the dimensionof the blocks and their number are large, we suggest a two-step approach toreduce the computation. Our approach is similar in concept to the locality-sensitive hashing methods that are common in the field of data mining forfinding nearest neighbors in high-dimensional data [181]. Specifically, wemap each block Yi, which resides in R512, onto R4 using the following com-533.2. The proposed algorithmputations:hi(1) =m∑i=1m∑j=1m∑k=1Yi(i, j, k)hi(2) =m/2∑i=1m∑j=1m∑k=1Yi(i, j, k)−m∑i=m/2+1m∑j=1m∑k=1Yi(i, j, k)hi(3) =m∑i=1m/2∑j=1m∑k=1Yi(i, j, k)−m∑i=1m∑j=m/2+1m∑k=1Yi(i, j, k)hi(4) =m∑i=1m∑j=1m/2∑k=1Yi(i, j, k)−m∑i=1m∑j=1m∑k=m/2+1Yi(i, j, k)(3.2)It is easy to see that hi(1) is a measure of average projection amplitudein the ith block whereas hi(2), hi(3), and hi(4) estimate its slope along thethree dimensions. We can interpret these computations as a “projection” ofYi from R512 onto R4. In locality-sensitive hashing, high-dimensional signalsare projected onto a low-dimensional space where they can be clustered at amuch lower computational cost. The choice of projections is highly criticalbecause similar signals in the original high-dimensional space must remainclose to each other in the low-dimensional space defined by the projections.Conversely, signals that are far apart in the original space should remainfar apart in the low-dimensional space. Our choice of the projections in(3.2) is motivated by the nature of CT projections as smooth and slowlyvarying signals. Therefore, we hope that, in general, Yi and Yj are closetogether if and only if hi and hj are so. In Figure 3.2 we have shownthe effectiveness of the proposed mapping on projections simulated froma Shepp-Logan phantom. We selected 105 random pairs of blocks fromsimulated projections of this phantom and computed the Euclidean distancesin the Y and h spaces for each pair of blocks. In part (a) of this figure we haveshown the result for noise-free projections. In part (b) of the same figurewe have shown the result for very noisy projections that were simulated byassuming the number of incident photons to be 100. It is clear that theproposed mapping is highly effective. What is particularly important to usis that blocks that are very close to each other in the Y space are also veryclose in the h space.These projections are computed for all Yi prior to clustering and theyare normalized so that each of the four projections has a mean of zero andstandard deviation equal to one. During clustering, every time two signals543.2. The proposed algorithmFigure 3.2: The Euclidean distance between blocks extracted from projec-tions of Shepp-Logan phantom in the h space versus that in the Y space.(a) noise-free projections, (b) very noisy projections.Yi and Yj are to be compared, we first compare their projections, hi and hj .Only if hi and hj are sufficiently close, we will proceed to compare Yi and Yj .As we mentioned above, we use the `2 norm to compare Yi and Yj becauseour noise model allows us to find a reasonable threshold for ‖Yi − Yj‖22. Forcomparing hi and hj , however, we use the `∞-norm. This is because weexpect that two blocks Yi and Yj be similar only if their mean amplitudeand slope in all three directions are close to each other. Therefore, when wewould like to decide if Yi and Yj are close enough to be clustered, first wetest if ‖hi−hj‖∞ ≤ h for some threshold h, and only if this is the case thenwe will proceed to test ‖Yi − Yj‖22 ≤ α. If this latter test is also satisfied,then we will decide that Yj can be clustered with Yi. This two-step schemewill drastically reduce the number of comparisons between 512-dimensionalblocks needed for clustering.The choice of threshold h is very important. If h is too small, manyblocks that are truly close will not be clustered together, reducing the per-formance of the algorithm. On the other hand, if h is too large, the numberof comparisons of high-dimensional signals that are not truly similar willincrease, increasing the computational time with no gain in the denoisingperformance. We have empirically found that a value of h ' 0.5 works wellfor all noise levels (as we mentioned above, the computed values of h arenormalized to have a standard deviation of 1). The reason why this valueworks well for different noise levels is because the features that constitute553.2. The proposed algorithmour proposed mapping (h) are not much affected by the noise. As can beseen in Equation (3.2), all four features that constitute h involve a largeamount of summing (or, equivalently, averaging) of the elements of Y . As aresult, the values of h are very robust to the noise level. Consequently, theproper value of the threshold h is also largely unaffected by the noise level.The proposed clustering algorithm is presented in Algorithm 1. Thealgorithm sweeps through the columns of Y (we remind the reader thatcolumns of Y are blocks extracted from stacked projections that have beenvectorized). If column Yi has not been clustered yet, a new cluster is definedwith Yi as its representative element. The algorithm will then check allunclustered columns and adds them to the newly-formed cluster if they aresufficiently close to Yi based on the criteria described above. As we explainedabove, first all unclustered columns are compared with Yi in terms of theirprojections (h). Those that are close to Yi in terms of their projections,denoted with I2 in Algorithm 1, are then tested to identify those that aretruly close to Yi in the original space.input : matrix Y ∈ Rm3×n containing vectorized blocks as itscolumnsoutput: sets of indices of similar blocks, Sncluster = 0S = {}for i← 1 to n doif i /∈ S{k} ∀k = 1 : ncluster thenncluster = ncluster + 1σ2 = fi exp(Y¯i)α = (2σ2)F−1m3(0.90)I1 = {j ∈ [i+ 1, n] : j /∈ S{k} ∀k = 1 : ncluster}I2 = {j ∈ I1 : ‖hj − hi‖∞ ≤ h}I3 = {j ∈ I2 : ‖Yj − Yi‖22 ≤ α}S{ncluster} = i ∪ I3endendAlgorithm 1: Algorithm for clustering of the blocks extracted from thestacked projections.563.2. The proposed algorithm3.2.2 Dictionary learningAs mentioned above, the core idea in the proposed algorithm is to exploitboth the self-similarity in the stacked projections and the power of learneddictionaries by collecting similar blocks in the stacked projections and com-puting their simultaneous sparse representation in a properly designed dic-tionary. Therefore, the central role of the dictionary in this algorithm isobvious. In this section, we describe how we learn the dictionary.Given a set of training projections, we first select 3D blocks from differentlocations in the stacked projections and vectorize them as columns of amatrix Y . In this section, we refer to columns of Y as “training signals”.We cluster these training signals using Algorithm 1, and obtain a set S ={s1, s2, ..., sncluster}, where each si contains the indices of similar trainingsignals. A good dictionary can then be learned by solving the followingoptimization problem:minimizeD,Γ‖Y −DΓ‖2F + λncluster∑i=1‖Γsi‖row-0 (3.3)In the above equation, Γsi denotes the matrix Γ restricted to columnsindexed by si. In other words, Γsi is the matrix of representation coefficientsof all signals in cluster si. ‖.‖row-0 is a pseudo-norm that counts the numberof non-zero rows; that is, ‖Γsi‖row-0 counts the number of rows of Γsi thathave at least one non-zero element. In other words, ‖Γsi‖row-0 counts thenumber of atoms of the dictionary D that participate in the representationof at least one of the signals in cluster si. Therefore, using ‖Γsi‖row-0 as apenalty encourages atoms in the set si to share the same atoms and to use asmall number of atoms in their representation. In summary, the form of thecost function in (3.3) reflects what we expect from the trained dictionary:the first term requires that the dictionary accurately model the trainingsignals, while the second term requires that the training signals have a sparserepresentation in D and that similar training signals share the same atomsfrom the dictionary, i.e., to have a joint-sparse representation in D. This isprecisely in accordance with the intuition behind our proposed algorithm.As we mentioned above, (3.3) is a non-convex minimization problemand is commonly solved by alternately minimizing with respect to Γ andD. Below, we will explain how we perform these minimizations. In ourpresentation, we will assume that Y ∈ Rm3×N and D ∈ Rm3×K , i.e., thenumber of training blocks is N and the trained dictionary has K atoms. Wealso assume that the columns of D are normalized so that they have `2 normof unity.573.2. The proposed algorithmMinimization with respect to D: Our approach here is similar to theapproach followed in the K-SVD algorithm [2]. K-SVD algorithm sweepsthrough the columns of D (i.e., dictionary atoms) and updates each atomalong with the coefficients of the signals that use it. Details were given inSection 2.1.2.Minimization with respect to Γ: With D being fixed, the minimizationproblem in this step can be written as:minimizeΓncluster∑i=1(‖Ysi −DΓsi‖2F + λ‖Γsi‖row-0) (3.4)where, as before, Ysi and Γsi denote the restriction of these matrices to thecolumns indexed by si. From (3.4) we see that the optimization problemin this step consists of ncluster separate subproblems, one for each cluster ofsignals. Each of these subproblems is a simultaneous sparse coding problem.Since we can estimate the noise variance, we prefer to re-write (3.4) in aconstrained form:(min.Γsi‖Γsi‖row-0 subject to: ‖Ysi −DΓsi‖2F ≤ i)i = 1 : nclusterThe advantage of this constrained minimization over the unconstrainedformulation in (3.4) is that λ is unknown and has no physical meaning,whereas i is directly related to the noise variance, which we can easilyestimate. We solve each of these subproblems using Algorithm 2, which isan extension of the famous OMP algorithm [319]. In this algorithm, |si|denotes the cardinality of the set si. At each iteration of this algorithm, weselect the atom that has the largest cumulative correlation with the signalsin Ysi and then project Ysi on the subspace spanned by the set of atomsselected so far. We then update the residual and repeat.An important choice is the dictionary size, i.e., the number of atoms inthe dictionary. We used a dictionary size of 1024 in all our experiments.We will discuss the effect of the dictionary size on the performance of theproposed algorithm later in this chapter. Another important decision is thechoice of the initial dictionary. It is common to use an overcomplete DCTor wavelet dictionary as the initial dictionary. Our experience shows thatthe algorithm converges much faster if we build the initial dictionary fromthe training signals. For example, 1024 training signals can be randomlyselected. Alternatively, after clustering the training signals, 1024 clusterscan be randomly selected and one random signal from each cluster or the583.2. The proposed algorithminput : Dictionary D and the set of similar signals Ysioutput: Γsi , sparse representation coefficients of Ysi in Dr = YsiI = {}σ2 = fi exp(Y¯si)α = (σ2)F−1|si|m3(0.90)while ||r||2F > α dokˆ = argmax k∑|si|j=1 |〈(Ysi)j , Dk〉|I = I ∪ kˆΓsi = (DTI DI)−1DTI Ysir = Ysi −DIΓsiendAlgorithm 2: Simultaneous greedy sparse coding [319].average of all atoms in each cluster be used in the initial dictionary. Withthis initialization, at most 50 iterations of the proposed algorithm wereenough to converge to a good dictionary.Moreover, at the end of each iteration of the dictionary learning algo-rithm we removed from the dictionary the least-used atom (the atom usedby the least number of signal groups) and instead added a new dictionaryatom that was formed as the average of the most difficult group of signals(the group that used the largest number of dictionary atoms in its rep-resentation). Similar strategies are commonly used in dictionary learningalgorithms.3.2.3 DenoisingOnce a dictionary is learned for simultaneous sparse representation of similarblocks using the method described above, it can be used for denoising of anewly-acquired projection set. Of course, one can learn the dictionary fromthe noisy image itself. This is indeed a common approach in dictionary-basedimage denoising and may lead to slightly better results [100, 215]. However,dictionary learning is very computationally expensive and it would be highlydesirable if a pre-trained dictionary could be used, instead of learning anew dictionary for newly-acquired projections. As we will describe laterin this chapter, our results show that for CT projections, as long as thescan geometry does not change, a well-trained dictionary can be used fordenoising the projections of different objects without a loss of performance.Denoising of a new set of projections is carried out in three simple steps593.3. Evaluationdescribed below.• Partition the stacked projections into overlapping blocks and clusterthem using Algorithm 1.• For each cluster of blocks Ysi , use the simultaneous sparse coding al-gorithm outlined in Algorithm 2 to find the representation coefficientsof Ysi in D. Denoting these representation coefficients by Γˆsi , thedenoised estimate of the blocks in this cluster will be: Yˆsi = DΓˆsi .• The final (denoised) estimate of the projections is computed using asimple averaging:Yˆ =ncluster∑i=1∑j∈siRj(DΓˆsi)jncluster∑i=1∑j∈siRj1 (3.5)where Rj is a binary matrix that places the jth block in its locationin the stacked projections,  indicates element-wise division, and 1 ∈Rm3 is a vector of all ones. The above equation has a simple meaningsimilar to Equation 2.7.3.3 EvaluationWe apply the proposed denoising algorithm on simulated and real cone-beamCT projections and compare its performance with the following algorithms:• Bilateral filtering, which has been suggested for sinogram denois-ing in [219]. Since this is not a patch-based algorithm, we expectthat it should be faster than patch-based algorithms but to be lesseffective in terms of denoising performance. We include it here to con-trast the patch-based denoising algorithms with the more traditionalsinogram denoising methods. This algorithm estimates the denoisedvalue of the image at pixel k by minimizing a cost function of theform E(u(k)) =∑k′∈Ωk P1(k, k′)P2(k, k′) where Ωk is a neighborhoodaround this pixel, and P1 and P2 are two cost functions in terms of thespatial distance and difference in pixel values, respectively. In [219]both P1 and P2 are suggested to be Gaussians. The bandwidth of P1 issuggested to be fixed at w/6, where w is the neighborhood width, andthe bandwidth of P2 is suggested to be chosen in the range [0.7, 2.8].603.3. Evaluation• We will apply the dictionary-based approach proposed in [100] to de-noise the projection measurements. We will refer to this algorithm asK-SVD.• We will apply the BM4D denoising algorithm [205] on the projectionmeasurements. There are many parameters that influence the per-formance of this algorithm. We use the set of parameter values thathas been named “normal profile” in [205]. These parameter valuesprovide a good balance between speed and performance for this algo-rithm. For brevity, we will refer to the BM4D algorithm applied onthe projections (sinogram) as BM4D-s.• We will apply the BM4D algorithm also in the image domain. Inother words, we reconstruct the image from noisy projections and thendenoise the reconstructed image. Although this is the same BM4Dalgorithm, we will refer to it as BM4D-i to distinguish it from BM4D-s that is applied on the sinogram.• An adaptive NLM algorithm was proposed for CT image denoisingin [191]. We will apply this algorithm on the images reconstructedfrom noisy projections. We will refer to this algorithm as ANLM. Animportant difference between ANLM and the basic NLM algorithm isthat ANLM relies on a relatively fast approximation of the noise levelin the reconstructed image and adjusts the denoising strength basedon the estimated noise level.To have a fair comparison between different algorithms, we will use thesame block size (i.e., 83) and the same level of overlapping by using a 3-pixel shift so that adjacent overlapping blocks share 5 pixels in each di-rection. Therefore, the methods that we compare are of two types: 1)the sinogram denoising (pre-processing) algorithms, which include bilateralfiltering, K-SVD, BM4D-s, and the proposed algorithm, and 2) the post-processing algorithms that work on the reconstructed image, which includeBM4D-i and ANLM. As we will explain below, we apply these algorithmson several sets of simulated and real cone-beam projections. Since the ulti-mate goal is to achieve a high-quality image, most of our evaluations will bedone in the image domain. Therefore, for each of the experiments that wewill describe below, we will apply these three types of algorithms as follows:1) For the sinogram-denoising algorithms (i.e., bilateral filtering, BM4D-s,K-SVD, and the proposed algorithm), after denoising the projections witheach algorithm we use the FDK algorithm to reconstruct the image, 2) For613.3. EvaluationBilateral filtering K-SVD BM4D-s Proposed algorithm0.060 0.055 0.051 0.048Table 3.1: Root-Mean-Square of the difference between the denoised projec-tions and the true projections on the data simulated from the digital brainphantom.the post-processing algorithms (i.e., BM4D-i and ANLM), we reconstructthe image using the FDK algorithm from noisy projections and then applythese algorithms to denoise the reconstructed image. Therefore, all algo-rithms will include an FDK reconstruction. We use a Hamming windowwith the FDK algorithm.3.3.1 Simulation experimentWe simulated 1440 equally-spaced noisy projections from a 3D digital brainphantom. We generated this phantom using one of the phantoms in theBrainWeb database [63]. The phantom size was 362× 434× 362 voxels andthe projections were each 650× 450 pixels in size. We assumed the numberof incident photons N i0 = 105 to be constant for all i by ignoring the bowtiefiltration. To simulate the electronic noise that becomes more importantin low-dose CT, we added a Gaussian noise with a standard deviation ofσ = 100 to the detected photon counts.For this simulation study, since we know the true phantom projections,we can quantitatively compare the sinogram denoising algorithms in the pro-jection domain. For this purpose, we computed the Root Mean Square of theError (RMSE), where error is defined as the difference between the denoisedand the true (i.e., noise-free) projections. Table 3.1 shows this comparison.Note that this comparison only includes the sinogram denoising algorithms,i.e., bilateral filtering, K-SVD, BM4D-s, and the proposed algorithm. It canbe seen from this table that the proposed algorithm has achieved a lowerRMSE than the other sinogram-denoising algorithms. As one might expect,bilateral filtering is not as effective as the patch-based denoising algorithms.A more useful comparison is in terms of the quality of the reconstructedimage, where we can compare all three types of algorithms. To assess thequality of the reconstructed images, we compute the RMSE, where erroris the difference between the reconstructed image and the true phantomimage. We also compute the structural similarity index (SSIM) between the623.3. EvaluationRMSE SSIM time (h)Bilateral filtering 0.066 0.735 0.40KSVD 0.065 0.738 1.84BM4D-s 0.059 0.765 4.35Proposed algorithm 0.057 0.774 0.44BM4D-i 0.055 0.770 0.82ANLM 0.062 0.758 1.70Table 3.2: Comparison of different algorithms in terms of the quality of thereconstructed image and the computational time on the noisy scan simulatedfrom the digital brain phantom.reconstructed image, xˆ, and the true image, x0, as follows [329]:SSIM(xˆ, x0) =(2µxˆµx0 + C1)(2σxˆx0 + C2)(µ2xˆ + µ2x0 + C1) + (σ2xˆ + σ2x0 + C2)(3.6)where µx and σx represent the mean and standard deviation of x, σxˆx0 isthe covariance, and C1 and C2 are constants.The results of the quantitative comparison are presented in Table 3.2.In addition to RMSE and SSIM, we have also presented the computationaltimes. From this table, the proposed algorithm has performed better thanthe other sinogram denoising algorithms and ANLM. The image producedby BM4D-i is close to the image produced by the proposed algorithm. An-other very important observation is that the computational time of the pro-posed algorithm is much shorter than that of K-SVD, BM4D-s, BM4D-i,and ANLM. In fact, the computational time of the proposed algorithm isnot much longer than that of bilateral filtering, whereas the computationaltimes of other algorithms are approximately 2 to 11 times longer than thatof bilateral filtering. All algorithms were run in Matlab version R2012b ona Windows 7 PC with 32 GB of memory and 3.4 GHz Intel Core i7 CPU.We should point out that the computation times reported for K-SVD andthe proposed algorithm in Table 3.2 do not include the time for the dic-tionary learning stage. As we will describe later in this chapter, it is notalways necessary to learn a new dictionary for every new set of projectionsand, in many cases, a dictionary learned on projections of one object can beeffectively used for denoising of the projections of a different object.Figure 3.3 shows a slice of the digital brain phantom in the images recon-structed with different algorithms. Compared with other sinogram denoising633.3. Evaluationalgorithms (i.e., K-SVD, BM4D-s, and bilateral filtering) and ANLM, theproposed algorithm has resulted in a better image quality. The visual qualityis not very different between the proposed algorithm and BM4D-i.Figure 3.3: A slice of the digital brain phantom. (a) reference image,(b) FDK-reconstructed from noisy projections, (c) bilateral filtering, (d)K-SVD, (e) BM4D-s, (f) the proposed algorithm, (g) BM4D-i, (h) ANLM.In order to further evaluate the proposed algorithm in terms of the spa-tial resolution, we performed a second simulation experiment. The goal ofthis experiment was to determine how the spatial resolution for differentalgorithms is affected by the object contrast. We rely on the estimationof the modulation transfer function (MTF) following an approach similarto that in [209]. In this experiment, we simulated noisy projections fromthe MTF bead phantom, which we generated using the CONRAD software[208]. As shown in Figure 3.4(a), this digital phantom includes three small643.3. EvaluationFigure 3.4: (a) The central slice of the MTF bead phantom; the square Cshows the location of the cube used to compute the noise in the reconstructedimages to ensure equal noise for all algorithms, (b) estimated MTF for thehigh-contrast phantom, (c) estimated MTF for the low-contrast phantom.beads and two large high-attenuation inserts. We generated two versionsof this phantom. In both versions, the phantom disk was assumed to havea linear attenuation coefficient of 1. However, in one of the phantoms thebeads had a linear attenuation coefficient of 3, whereas in the other theyhad a linear attenuation coefficient of 1.4. We will refer to these two phan-toms as high-contrast and low-contrast MTF bead phantoms, respectively.If we regard the phantom disk as the background, the contrast of the beadsin the low-contrast phantom was 1/5 that in the high-contrast phantom.The phantom size was 5123 voxels and the projections were each 600× 600pixels in size. We simulated 720 equally-spaced noisy projections from eachphantom by assuming the number of incident photons to be N i0 = 2 × 103and the standard deviation of the additive Gaussian noise to be σ = 50. Weestimated the MTF from the center bead in the reconstructed images.In addition to contrast, another factor that affects the spatial resolutionis the amount of smoothing (i.e., the denoising strength). In fact, noise,contrast, and spatial resolution are three inter-dependent criteria. In orderto also account for the noise level, we adjusted the denoising strengths ofthe sinogram-denoising and image-domain denoising methods such that thenoise in the images produced by different algorithms were approximatelyequal. Each of these algorithms have tuning parameters that allow for ad-justment of the denoising strength. In this simulation experiment we con-sidered a cube, the cross-section of which is shown in 3.4(a), and adjustedthe denoising strengths of the sinogram-denoising and image-domain de-noising methods such that the variance of the voxel values in this cube was653.3. EvaluationSpatial resolution Spatial resolution(high-contrast) (low-contrast)Bilateral filtering 0.86 0.75KSVD 0.85 0.74BM4D-s 0.88 0.79Proposed algorithm 0.88 0.80BM4D-i 0.90 0.82ANLM 0.90 0.81Table 3.3: Spatial resolution (defined as the spatial frequency at which thenormalized MTF reaches 0.10) in units of mm−1 in the images of the MTFbead phantom produced by different algorithms.approximately equal in the images produced by different algorithms.The estimated MTFs from the high-contrast and low-contrast phantomimages produced by different algorithms are shown in Figure 3.4(b)-(c). InTable3.3 we have shown the spatial frequency at which the normalized MTFfor different algorithms reached a value of 0.10. These results show that theimages produced by the proposed algorithm have a higher spatial resolutionthan the images produced by the other three sinogram denoising algorithms.The two image-based denoising algorithms also achieve high spatial resolu-tion and outperform the sinogram-denoising algorithms including our pro-posed algorithm. From the MTF plots for the low-contrast phantom shownin Figure 3.4(c), there is also a general difference between the sinogram-denoising and image-domain denoising methods. In general, the sinogramdenoising methods have resulted in higher MTF at low spatial frequenciesbut the image-domain denoising methods have resulted in higher MTF athigher spatial frequencies.3.3.2 Experiment with micro-CT scan of a ratAll of the real CT data that is used in this dissertation were collected witha Gamma Medica eXplore CT 120 micro-CT scanner. Therefore, here weprovide some technical details about this scanner. This scanner has a flatpanel detector located 449 mm from the source and 397 mm from the axis ofrotation. The detector panel includes 3500× 2272 detector elements. Usingthree different binning options of 1 × 1, 2 × 2, and 4 × 4, projections withthree different sizes of 3500×2272, 1750×1136, 875×568, respectively, canbe obtained. The distance between the centers of adjacent detector elements663.3. Evaluationis 0.02831mm. Unless otherwise stated, the size of the reconstructed imagesis 880×880×650 voxels, each voxel having a size 0.1×0.1×0.1 mm3. Otherscan settings are different for different experiments; therefore they will bestated for each experiment separately. Some further information about thisscanner can be found in [203].For the rat scan used in this section, tube voltage, tube current, andexposure time were equal to 70 kV, 32 mA, and 16 ms, respectively. Thiswas the lowest possible setting in terms of mAs because the scanner did notoperate under 0.5 mAs. The scan consisted of 720 projections between 0◦and 360◦ at 0.5◦ intervals. The size of each projection was 875× 568 pixels.Because we do not have access to the true (i.e., noise-free) projections,we evaluate the performance of different algorithms in terms of the quality ofthe reconstructed images. For this purpose, we reconstructed a high-qualityimage of the rat using the full set of 720 projections. To create this image,we reconstructed an initial image using the FDK algorithm followed by 50iterations of the MFISTA algorithm [17] to improve the quality of the FDK-reconstructed image. The resulting image had a very high quality and wewill refer to it as “the reference image”.To compare different algorithms, we applied them on the same scan. Aswe did in the simulation studies described above, in our experiments withreal data we applied the sinogram denoising algorithms (i.e., bilateral filter-ing, K-SVD, BM4D-s, and the proposed algorithm) on the noisy projectionsand used the denoised projections to reconstruct an image with the FDKalgorithm. We applied the post-processing algorithms (i.e., BM4D-i andANLM) on the FDK-reconstructed image from the noisy projections. Thequality of the reconstructed images was assessed by computing the RMSE,where we define the error as the difference between the reconstructed imageand the reference image, and the SSIM between the two images. In Table3.4, we have summarized the results of the quantitative comparison betweendifferent algorithms. In addition to RMSE, SSIM, and the computationaltime, we have included two additional numbers that are indicators of spatialresolution and noise strength, which we have denoted by SR and NS andare computed as follows:• As an indicator of spatial resolution (SR), we computed the maximumabsolute value of the gradient (i.e., slope) along the line L marked inFigure 3.5(a). A larger gradient, corresponding to a sharper slope,indicates a higher spatial resolution.• As a measure of noise strength (NS), we computed the standard devia-tion of the voxel values in a cube whose cross-section has been marked673.3. Evaluationin Figure 3.5(a) with the rectangle C. From the reference image, weidentified this cube to be highly uniform.Noise suppression and spatial resolution are usually two opposing objec-tives in denoising. A stronger denoising, in general, leads to a loss of spa-tial resolution and a successful denoising algorithm can be simply definedas one that reduces the noise with little degradation of spatial resolution.Therefore, the above two numbers are very useful indicators for comparingdifferent algorithms. A very similar approach was suggested for examiningthe trade-off between noise removal and spatial resolution in CT images in[219].RMSE SSIM time (h) SR NSBilateral filtering 0.0123 0.783 0.42 0.160 0.138KSVD 0.0121 0.792 2.2 0.156 0.124BM4D-s 0.0116 0.808 2.8 0.165 0.119Proposed algorithm 0.0106 0.814 1.0 0.165 0.116BM4D-i 0.0110 0.810 3.9 0.160 0.112ANLM 0.0118 0.803 4.6 0.162 0.123Table 3.4: Quality criteria for the images produced by different algorithmsfrom the noisy rat scan.The numbers in Table 3.4 show that the propsoed algorithm has outper-formed other sinogram denoising algorithms and image-domain denoisingalgorithms, while having a much shorter computational time. The proposedalgorithm is, in most cases, better than other methods in terms of the spatialresolution and the noise strength. Bilteral filtering has a shorter computa-tional time but it is less effective than the other algorithms. In terms ofcomputational time, the proposed algorithm requires a much shorter timethan other methods, except for bilateral filtering. BM4D-s and BM4D-i,which are close to the proposed algorithm in terms of the image qualitycriteria, take 3 to 4 times longer to complete.Figure 3.5 shows a slice in the image of the rat produced using differ-ent algorithms. The entire slice has been displayed with a window of linearattenuation coefficient of [0, 0.45]. In order to better demonstrate the dif-ference between the images, we have selected two ROIs and displayed themwith a magnification of 150% and with much narrower windows of linearattenuation coefficients. The locations of these ROIs have been marked on683.3. Evaluationthe slice of the reference image in Figure 3.5(a). The ROI shown on the top-left of each slice contains fat surrounded by soft tissue and is shown with awindow of linear attenuation coefficient of [0.14, 0.22]. The ROI shown onthe top-right of each slice contains bone surrounded by soft tissue and is dis-played with a window of linear attenuation coefficient of [0.18, 0.30]. Fromthis figure, the proposed algorithm seems to have resulted in a better imagethan other sinogram denoising algorithms and post-processing algorithms.This is more visible from the ROI displayed on the top left of each slice.3.3.3 Experiment with micro-CT scan of a phantomA physical phantom was scanned using the same micro-CT scanner describedabove. This is a quality assurance phantom that has been designed forcomprehensive evaluation of the performance of micro-CT systems. It hasvarious modules that allow for a complete evaluation of the quality of thereconstructed images in terms of spatial resolution, noise, geometric accu-racy, linearity, etc. Therefore, in this dissertation this phantom is used veryfrequently and referred to as “the physical phantom”. A detailed descriptionof this phantom can be found in [92].For evaluation of the algorithm proposed in this chapter, the physicalphantom was scanned twice:1. Low-noise scan. This scan consisted of 720 projections at 0.5◦ in-tervals between 0◦ and 360◦. The tube voltage, tube current, andexposure time were 70 kV, 40 mA, and 25 ms, respectively.2. High-noise scan. This scan consisted of 720 projections between0◦ and 360◦ at 0.5◦ intervals. The tube voltage, tube current, andexposure time were equal to 50 kV, 32 mA, and 16 ms, respectively.Moreover, a 0.2mm copper filter was used for this scan.The phantom had no movement between the two scans. Therefore, thetwo scans are from the same object at exactly the same location. Notethat it was not possible to perform two identical scans in the rat experimentdescribed above because the internal organs of the rat moved ever so slightlyduring the experiments.From the low-noise scan, we reconstructed a high-quality reference imagein a way similar to our rat experiment above. Different algorithms were usedto reconstruct the image of the phantom from the high-noise scan and thereconstructed image was compared with the reference image. In addition693.3. EvaluationFigure 3.5: A slice of the image of the rat. (a) the reference image, (b) FDK-reconstructed from noisy projections, (c) bilateral filtering, (d) K-SVD, (e)BM4D-s, (f) the proposed algorithm, (g) BM4D-i, (h) ANLM.703.3. Evaluationto SSIM and RMSE, we computed two numbers as indicators of the noisestrength (NS) and the spatial resolution (SR) as described below:• The phantom has a uniform polycarbonate disk that has been includedin the phantom for the purpose of estimating the noise level. Weselected five 103−voxel cubes at different locations within the diskand computed the standard deviation of the voxel values in each cube.We use the average of these five standard deviations as an indicatorof noise strength (NS).• The phantom included a slanted edge that consisted of a plastic-airboundary that is specially designed for accurate estimation of the mod-ulation transfer function (MTF). Although the MTF can also be esti-mated from a slit or a wire, this phantom provides an edge for MTFestimation because it is easier to fabricate a very smooth edge in aphysical phantom. The slanted edge in this phantom had an angle of5◦ relative to the image matrix that allowed for accurate estimation ofMTF using methods such as that proposed in [39]. We estimate theMTF for the range of spatial frequencies between 0 and 5mm−1. As itis commonly done, we report the spatial frequency at which the nor-malized MTF reaches a value of 0.10 as a measure of spatial resolution(SR). We will also show the full MTF curves.Figure 3.6 shows two fine coils inside the phantom in the images re-constructed by different algorithms. These coils are very useful for visualinspection of the spatial resolution and noise in the images. Compared withother sinogram denoising algorithms (i.e., bilateral filtering, K-SVD, andBM4D-s) and ANLM, the proposed algorithm seems to have resulted in abetter image.A more objective comparison of different algorithms can be performedusing the quantitative criteria that we described above, which we have sum-marized in Table 3.5. These numbers clearly show that compared with theother sinogram denoising algorithms and ANLM, the proposed algorithmhas resulted in a better image. BM4D-i has also performed well and isslightly better than the proposed algorithm but it takes 4 times longer tocomplete. In terms of computational time, the proposed algorithm is againmuch faster than the other algorithms, except bilateral filtering.The values of spatial resolution (SR) shown in Table 3.5 represent thespatial frequency at which the normalized MTF reached a value of 0.10.However, this is not a complete characterization of the spatial resolution. Amore detailed comparison of different algorithms can be done by examining713.4. DiscussionRMSE SSIM time (h) SR NSBilateral filtering 0.0145 0.786 0.42 3.81 0.0148KSVD 0.0142 0.789 2.5 3.78 0.0144BM4D-s 0.0129 0.800 3.0 3.83 0.0139Proposed algorithm 0.0127 0.820 0.9 3.83 0.0135BM4D-i 0.0124 0.822 4.0 3.83 0.0132ANLM 0.0132 0.797 5.4 3.81 0.0143Table 3.5: Comparison between different algorithms on the scan of the phys-ical phantom.the full MTF curves. Moreover, as can also be seen in Table 3.5, the noisestrengths are different for different algorithms. As we mentioned above, spa-tial resolution is influenced by the denoising strength. Therefore, we alsoestimated the MTF for different algorithms with noise matching. Specifi-cally, we adjusted the denoising strengths of different algorithms such thatthe noise strength (NS, computed as described above) was approximatelyequal to 0.0130 for all algorithms. We then estimated the MTF for theimages produced by different algorithms.The estimated MTFs are shown in Figure 3.7. We have shown the esti-mated MTFs in part (a) of this figure. Because the number of algorithmscompared is large and some of the MTF curves are very close, in part (b) ofthis figure we have shown the difference between the MTF of the image pro-duced by different algorithms and the MTF of the image reconstructed fromnoisy projections with the FDK algorithm. We have done this only to beable to better see the difference between the MTFs for different algorithms.Overall, the plots of MTF in this figure show that the images producedby BM4D-s and the proposed algorithm have higher spatial resolution thanother algorithms, especially in the frequency range [1mm−1, 2mm−1].3.4 DiscussionIn summary, our results with simulated and real CT projections show thatthe proposed algorithm achieves state-of-the-art results. It outperforms or atleast matches some of the best projection-domain and image-domain denois-ing algorithms in terms of the visual and objective quality of the producedimage and computational time. Our simulation study with the digital brainphantom showed that the projections denoised with the proposed algorithm723.4. DiscussionFigure 3.6: Images of two fine coils inside the physical phantom in thereference image and the images produced from the high-noise scan usingdifferent algorithms.733.4. DiscussionFigure 3.7: (a) The estimated noise-matched MTF for the images of thephysical phantom produced by different algorithms, and (b) the differencebetween the MTF for different algorithms and the MTF for the FDK-reconstructed image without denoising.743.4. Discussionwere closer to the true projections, compared with projections denoised withthe other sinogram-denoising algorithms. Our simulation experiments aswell as our experiments with real data showed that in terms of the visual andobjective quality of the produced image, the proposed algorithm achievedbetter results than the other methods. Denoising in the image domain withthe BM4D algorithm often produced results that were close to the imagesproduced by our proposed sinogram-denoising algorithm, but BM4D is muchmore computationally intensive than our proposed algorithm.The proposed algorithm exploits the abundant self-similarity in the pro-jections by grouping similar blocks extracted from the stacked projections.To denoise groups of similar blocks, the proposed algorithm makes the as-sumption that blocks in each such group should have a joint-sparse repre-sentation in a well-designed dictionary, which can be learned from trainingdata. It is important to note that, at the level of small blocks of size 83considered by the proposed algorithm, the type of patterns that appear inthe stacked projections does not depend strongly on the object being im-aged. This is important because it can mean that a dictionary learned fromprojections of a certain object can be used for denoising the projectionsof a different object. This can lead to large savings in computation be-cause learning a dictionary is by far more computationally demanding thanapplying the dictionary for denoising. For natural images, in general, thedominant types of features in different images can be quite different and, forexample, a dictionary learned on a cartoon-like or smooth image may notbe optimal for denoising an image that contains fine textures. Therefore,for dictionary-based denoising of natural images, it is usually better to learnthe dictionary from a set of similar training images or even from the noisyimage itself. For CT projections, however, because the local nature of CTprojections is not dependent on the scanned object, we expect that a dictio-nary learned from the projections of one object could be used for denoisingthe projections of another object.In order to determine if this is in fact the case, we used the dictionarylearned from the projections of the rat (from Section 3.3.2) to denoise theprojections of the physical phantom (from Section 3.3.3). And vice versa,we used the dictionary learned on the projections of the physical phantom todenoise the projections of the rat. In both cases, the results obtained werealmost exactly the same as those presented in the Results section above.In particular, none of the values in Tables 3.4 and 3.5 changed by morethan 2%. In our opinion, this indicates that the dictionary learned fromprojections of one object can be used for denoising of the projections ofan entirely different object. Of course, if the scan geometry changes, for753.4. Discussionexample if the cone angle is increased or decreased significantly, the learneddictionary may no longer be effective and a new dictionary must be learned.But these settings do not usually change on a commercial scanner.As a further test of generalizability of the learned dictionaries in theproposed algorithm, we performed another experiment. In this experiment,the abdominal part of another rat was scanned. The rat in Section 3.3.2was scanned in the chest region (as can be seen in Figure 3.5) and the tubevoltage, tube current, and exposure time were equal to 70 kV, 32 mA, and16 ms, respectively. In this new scan, the new rat was scanned in the abdom-inal region using a tube voltage of 50 kV. Furthermore, a 0.2mm copperfilter was used to create a noisier scan than the scan in Section 3.3.2. Weapplied the proposed algorithm with the dictionary learned from the scanof the rat in Section 3.3.2 on this new scan. We also applied the other algo-rithms on this scan. A slice of the images produced by different algorithmsfrom this scan has been shown in Figure 3.8. In the same figure, we havealso shown a segment of a profile in this slice. The proposed algorithm hasresulted in very effective denoising that is better than or comparable withthe other methods. Our experience shows that the dictionary learned on thescan of one object can be used for effective denoising of the scan of anotherobject, unless the scan settings such as angular spacing between successiveprojections changes drastically. This can mean large savings in computa-tional time. As an example, denoising of the scan of the rat in Section 3.3.2required approximately 1h, as shown in Table 3.4, while learning the dictio-nary for the same scan takes longer than this. The dictionary learning timedepends on the number of training signals, the initial dictionary used, andthe dictionary size. With a good initialization, as we explained in Section3.2.2, a dictionary can be trained in approximately 3− 4h.The size of the dictionary has a direct effect on the computational de-mand and the denoising performance. A larger dictionary is usually moreexpressive and, in general, can lead to a sparser representation and higherperformance. On the other hand, the computational cost of applying thedictionary is directly related to its size. As we mentioned above, in all ofthe experiments that we reported in this chapter we used a dictionary ofsize 1024, which means that the number of atoms was twice the signal di-mensionality. A natural question is what is the effect of dictionary size inthis application. In other words, one would like to know if the performanceof the proposed algorithm can be improved by increasing the dictionarysize, or whether the size of the dictionary may be reduced without a loss inperformance.In our experience, larger dictionaries do not lead to a significant im-763.4. DiscussionFigure 3.8: (Top) A slice in the image of the second rat produced by dif-ferent algorithms, and (Bottom) a small profile segment through this slice.The location of the profile segment has been marked on the slice of thereference image with the line segment “L”. (a) reference image, (b) FDK-reconstructed from noisy projections, (c) bilateral filtering, (d) KSVD, (e)BM4D-s, (f) the proposed algorithm, (g) BM4D-i, (h) ANLM.773.4. Discussionprovement in the denoising performance, while significantly increasing thecomputational load. Reducing the dictionary size, on the other hand, leadsto a slight performance loss. As an example, in Figure 3.9 we show the effectof the size of the dictionary in our experiment with the rat data (Section3.3.2). In this figure, we plot the RMSE, SSIM, and computational timefor several different dictionary sizes. The actual values reported in Table3.4 were for a dictionary of size 1024. In Figure 3.9 we plot the normalizedvalues of RMSE, SSIM, and the computational time by dividing the valuesfor various dictionary sizes by the values for a dictionary size of 1024. Notethat a low RMSE and computational time and a high SSIM are desirable.From this figure, for dictionary sizes of 768 and 512 the quality of the re-constructed image is only slightly lower compared to a dictionary size of1024. Increasing the dictionary size to 1536 and 2048 does not lead to anoticeable improvement in RMSE and SSIM, while significantly increasingthe computational time.Figure 3.9: Effect of the dictionary size on the denoising performance andcomputational time for denoising of the projections of the rat (Section 3.3.2).The observation that the dictionary size can be reduced to the signal di-mensionality (i.e., 512) without a substantial degradation in the performanceloss is surprising. In almost all image-processing applications of learneddictionaries, the dictionary size is at least twice the signal dimensionality.Figure 3.10(a) shows some typical atoms of the dictionary learned from theprojections of the rat described in Section 3.3.2. Because the learned atoms783.4. Discussionare actually 8×8×8 cubes, we have randomly selected 50 of these atoms andhave shown all of their 8 slices in Figure 3.10(a). In Figure 3.10(b) we haverandomly selected 50 atoms learned on the reconstructed image of the rat.A visual comparison of these atoms from the two dictionaries quickly revealstheir fundamental difference. Even though there exist similar-looking atomsin the two dictionaries, one can easily see at a glance that the atoms learnedon the projections are simpler and smoother. This stems from the fact thatthe type of patterns that occur in CT projections are much more limited.This is also why we could limit the number of dictionary atoms to the signaldimensionality (512), while for natural images, including medical images,the number of atoms in the dictionary is usually chosen to be at least twicethe signal dimensionality. Further reducing the dictionary size below 512leads to a sharp deterioration of the performance of the proposed algorithm.Figure 3.10: (a) A full depiction of 50 randomly selected atoms from thedictionary learned from the projections of the rat; each column shows the8 slices of a single atom. (b) The same, for the dictionary learned from thereconstructed image of the rat.79Chapter 4Sinogram Denoising usingTotal Variation4.1 IntroductionIn this chapter, we use a Poisson noise model for the projection measure-ments and propose denoising algorithms based on total variation (TV) regu-larization. Unlike our approach in Chapter 3, in this Chapter we will denoiseeach projection view separately.We denote the true and noisy projections with u and v, respectively,and assume that they are of size m × n. Individual pixels of a projection,for example for u, are denoted as u(i, j), i = 1 to m, j = 1 to n. For anarbitrary pixel location (from the probability mass function of a variablewith Poisson distribution):P (v(i, j)|u(i, j)) = e−u(i,j)u(i, j)v(i,j)v(i, j)!(4.1)Assuming the pixel values are independent, for the entire image we willhave:P (v|u) =∏i,je−u(i,j)u(i, j)v(i,j)v(i, j)!(4.2)We ignore the denominator, which is independent of u. Since we wantto find a functional to minimize, we consider the negative logarithm of thenumerator:−log(P (v|u) ∝∑i,ju(i, j)− v(i, j)log(u(i, j)) (4.3)With this measurement consistency term, total variation denoising canbe performed by minimizing this cost function:Eλ(u) =∫Ω(u− v logu) + λ∫Ω|∇u| (4.4)804.2. Approach 1- Employing higher-order derivativeswhere ∇u is the gradient of u. As we mentioned in Section 2.5, because thisregularizer is based on the `1-norm of the gradient, it is very effective inpreserving image edges while suppressing the noise. Therefore, it has beentremendously successful in reconstruction, deconvolution, and denoising ofpiecewise-constant images. In recent years, there has been a growing bodyof research on improving the capabilities of this model [48]. Perhaps themost significant enhancements have been achieved by including higher-orderdifferentials in the model [47, 286].The use of higher-order derivatives leads to superior results on imagesthat contain piecewise-smooth features. On piecewise-smooth images, thebasic TV formulation leads to artificial blocky features known as staircaseartifacts. This is because with the basic TV model, piecewise-constant solu-tions are preferred. Including higher-order differentials, on the other hand,will encourage piecewise-smooth solutions. This can be very important forprojection measurements in CT. Even if the imaged object (e.g., the hu-man body) may be modeled as piecewise-constant, its projections will notbe piecewise-constant. This is easy to visualize and we show a simple ex-ample in Figure 4.1. This figure shows a 2D slice and a 1D profile fromthe low-contrast 3D Shepp-Logan phantom alongside a typical cone-beamprojection of it. Even though the the phantom itself is strictly piecewiseconstant, this is not the case for its projection. It is well documented thatthe basic TV model does not achieve optimal performance on this type ofimages [47, 199].In this chapter, we propose two approaches for tackling this problem.The first approach is to use higher-order differentials. The second approachis to apply the basic TV denoising in a locally adaptive fashion.4.2 Approach 1- Employing higher-orderderivatives in TV regularization4.2.1 The proposed algorithmFollowing the above discussion, we suggest a regularization function thatincludes the `1-norm of both the gradient and the Hessian of the image,leading to a cost function of the form:E(u) =∫Ω(u− v logu) + λ1∫Ω|∇u|+ λ2∫Ω|∇2u| (4.5)814.2. Approach 1- Employing higher-order derivativesFigure 4.1: The central slice of the low-contrast 3D Shepp-Logan phan-tom (a), and a representative one-dimensional profile of it (b); a cone-beamprojection of the same phantom (c), and a one-dimensional profile of theprojection (d).or in the discrete image domain:E(u) =∑i,j(u(i, j)− v(i, j) logu(i, j))+ λ1∑i,j|∇u(i, j)|+ λ2∑i,j|∇2u(i, j)|(4.6)The norms of the gradient and the Hessian in the discrete image domainare defined as follows:|∇u(i, j)| = (Dxu(i, j)2 +Dyu(i, j)2)1/2 (4.7)|∇2u(i, j)| = (Dxxu(i, j)2 + 2Dxyu(i, j)2 +Dyyu(i, j)2)1/2 (4.8)where Dx, Dy, Dxx, Dxy, and Dyy are the first and second-order difference824.2. Approach 1- Employing higher-order derivativesoperators defined as follows:Dxu(i, j) = u(i+ 1, j)− u(i, j)Dyu(i, j) = u(i, j + 1)− u(i, j)Dxxu(i, j) = u(i+ 1, j)− 2u(i, j) + u(i− 1, j)Dyyu(i, j) = u(i, j + 1)− 2u(i, j) + u(i, j − 1)Dxyu(i, j) = u(i+ 1, j + 1)− u(i+ 1, j)− u(i, j + 1) + u(i, j)(4.9)Here, we have provided the definitions for the interior pixels. For the bound-ary pixels we assume periodic boundary condition as in [248, 332]. Hence-forth, we only work in the discrete domain but to simplify the expressionswe drop the pixel indices and only show them when necessary.To derive a minimization algorithm for the functional E(u), we followthe split Bregman iterative framework [117] which is a very efficient algo-rithm for `1-regularized problems. Split Bregman method can be consideredas a member of larger families of algorithms such as the alternating directionmethod of multipliers [33] or proximal methods [66]. In the split Bregmanmethod, first the unconstrained optimization problem is converted into aconstrained problem by introducing new variables. For E(u) in (4.6), wewrite the corresponding constrained problem by introducing three new vari-ables (f , g, and h):minimize∑(f − v log f) + λ1∑|g|+ λ2∑|h|subject to f = u, g = ∇u, h = ∇2u(4.10)This constrained problem can now be solved through the following Bregmaniteration:Initialize: u0 = v, f0 = v, g0 = ∇v, h0 = ∇2v, b01 = b02 = b03 = 0while ‖uk − uk−1‖2 > [uk+1, fk+1, gk+1, hk+1] = arg minu,f,g,h∑(f − v log f)+ λ1∑|g|+ λ2∑|h|+µ12∑(f − u− bk1)2+µ22∑(g −∇u− bk2)2+µ32∑(h−∇2u− bk3)2(4.11)834.2. Approach 1- Employing higher-order derivativesbk+11 = bk1 + uk+1 − fk+1bk+12 = bk2 +∇uk+1 − gk+1bk+13 = bk3 +∇2uk+1 − hk+1endwhere µi are the algorithm parameters and bi are auxiliary variables. Inthe Bregman iterative approach, the updates of the auxiliary variables bi re-place the updates of the subgradients of the objective function. We shouldalso note that if u is an m-by-n image, i.e., u ∈ Rm×n, then f , b1 ∈ Rm×n,∇u , g , b2 ∈ (Rm×n)2, and ∇2u , h , b3 ∈ (Rm×n)4. We denote the compo-nents of g and h as g = [gx, gy] and h = [hxx, hxy, hyx, hyy] and similarly for∇u, ∇2u, b2, and b3.Although it may seem that the above modification has made the problemharder, the efficiency of the split Bregman scheme lies in the fact that theminimization problem can now be split into smaller problems that can besolved much more easily. Therefore, the large minimization problem in theabove algorithm is solved by iteratively minimizing with respect to each ofthe four variables (u, f , g, and h), resulting in the following algorithm:Initialize: u0 = v, f0 = v, g0 = ∇v, h0 = ∇2v, b01 = b02 = b03 = 0while ‖uk − uk−1‖2 > For i = 1 : Nfk+1 = arg minf∑(f − v log f) + µ12∑(f − uk − bk1)2uk+1 = arg minuµ12∑(fk+1 − u− bk1)2+µ22∑(gk −∇u− bk2)2+µ32∑(hk −∇2u− bk3)2gk+1 = arg mingλ1∑|g|+ µ22∑(g −∇uk+1 − bk2)2hk+1 = arg minhλ2∑|h|+ µ32∑(h−∇2uk+1 − bk3)2endbk+11 = bk1 + uk+1 − fk+1(4.12)844.2. Approach 1- Employing higher-order derivativesbk+12 = bk2 +∇uk+1 − gk+1bk+13 = bk3 +∇2uk+1 − hk+1endTo avoid complicating the notation, we have not introduced additionalindices for the variable updates in the For loop. In fact, for many problemsonly one iteration of this loop is sufficient for fast convergence of the overallalgorithm [117]. The efficiency of the split Bregman scheme entirely dependson how fast the sub-problems can be solved. In the following, we will showthat for our problem, the four sub-problems can be solved very efficiently.Minimization with respect to f : Returning to the notation with pixelindices, this sub-problem is:fk+1 = arg minf∑i,j(f(i, j)− v(i, j) log f(i, j))+µ12∑i,j(f(i, j)− uk(i, j)− bk1(i, j))2(4.13)This expression can be written as a sum of scalar minimization problemsin terms of individual pixel values. Since the function to be minimized isconvex in terms of f(i, j), the solution can be found by setting the derivativeto zero. Using basic calculus and the knowledge that f ≥ 0 we can showthat the following formula gives the exact solution to this problem.fk+1(i, j) =B2+√(B2)2+v(i, j)µ1where B = uk(i, j) + bk1(i, j)−1µ1(4.14)Minimization with respect to u: This subproblem can be solved throughits optimality condition which can be written as [117, 248]:[µ1I − µ2(DTxDx +DTy Dy) + µ3(DTxxDxx + 2DTxyDxy +DTyyDyy)]uk+1 =µ1(fk+1 − bk1)− µ2(DTx (gkx − bk2x) +DTy (gky − bk2y))+ µ3(DTxx(hkxx − bk3xx) + 2DTxy(hkxy − bk3xy) +DTyy(hkyy − bk3yy))854.2. Approach 1- Employing higher-order derivativeswhere DTx , DTy , DTxx, DTxy, and DTyy are the backward-difference operatorscorresponding to the forward-difference operators defined in (4.9). For thepixels in the interior of the image, these operators are defined as:DTx u(i, j) = u(i, j)− u(i− 1, j)DTy u(i, j) = u(i, j)− u(i, j − 1)DTxxu(i, j) = Dxxu(i, j) = u(i+ 1, j)− 2u(i, j) + u(i− 1, j)DTyyu(i, j) = Dyyu(i, j) = u(i, j + 1)− 2u(i, j) + u(i, j − 1)DTxyu(i, j) = u(i, j)− u(i− 1, j)− u(i, j − 1) + u(i− 1, j − 1)(4.15)Despite its long expression, the above equation is a system of linear equa-tions in uk+1. Due to the structure of the forward and backward differencematrices, the system matrix is diagonally dominant. When the size of theimage is not small, an efficient algorithm for finding a good approximatesolution is the Gauss-Seidel method [118, 248].Minimization with respect to g: This problem reads:gk+1 = arg mingλ1∑i,j|g(i, j)|+ µ22∑i,j(g(i, j)−∇uk+1(i, j)− bk2(i, j))2This problem is also equivalent to a set of scalar problems in terms of indi-vidual pixel values. Its solution is a simple extension of the soft thresholdingoperation [249, 328]:gk+1x (i, j) = max(|B(i, j)| − λ1µ2, 0)Bx(i, j)|B(i, j)|gk+1y (i, j) = max(|B(i, j)| − λ1µ2, 0)By(i, j)|B(i, j)|where B(i, j) = [Bx(i, j), By(i, j)]= [bk2x(i, j) +Dxuk+1(i, j), bk2y(i, j) +Dyuk+1(i, j)](4.16)Minimization with respect to h: This problem is very similar to min-imization with respect to g above. Its solution is similarly a generalization864.2. Approach 1- Employing higher-order derivativesof the soft thresholding operation [249]:hk+1xx (i, j) = max(|C(i, j)| − λ2µ3, 0)Cxx(i, j)|C(i, j)|hk+1xy (i, j) = max(|C(i, j)| − λ2µ3, 0)Cxy(i, j)|C(i, j)|hk+1yx (i, j) = max(|C(i, j)| − λ2µ3, 0)Cyx(i, j)|C(i, j)|hk+1yy (i, j) = max(|C(i, j)| − λ2µ3, 0)Cyy(i, j)|C(i, j)|where C(i, j) = [Cxx(i, j), Cxy(i, j), Cyx(i, j), Cyy(i, j)]= [bk3xx(i, j) +Dk+1xx u(i, j), bk3xy(i, j) +Dk+1xy u(i, j),bk3yx(i, j) +Dk+1yx u(i, j), bk3yy(i, j) +Dk+1yy u(i, j)](4.17)If we set λ2 = 0 in Equations (4.5) or (4.6), we get a simplified modelwith standard (i.e., first-order) TV regularization. This will greatly simplifythe algorithm because the variables h and b3 will also be removed. Wewill refer to this simplified model as “TV sinogram denoising” and to thefull model described above as “(TV + TV2) sinogram denoising” and willpresent the results for both models. This will allow us to see whether or not,and to what extent, the more complex model with two regularizers improvesthe results.To evaluate the proposed denoising algorithm, we applied it on sets ofsimulated and real cone-beam sinograms. We compare our proposed al-gorithm with the bilateral filtering algorithm that we described in Section3.3.4.2.2 Simulation experimentNoisy cone-beam projections were simulated from a 3D low-contrast Shepp-Logan phantom according to the model in Equation (1.1). We used twodifferent values of N i0 = 1000 and Ni0 = 100 to simulate two sets of pro-jections with different levels of noise. For each ray, the expected numberof detected photons (N id) is given by Equation (1.1). The actual detectedphoton count was simulated as a Poisson random variable with mean equalto N id. We will refer to the scans with Ni0 = 1000 and Ni0 = 100 as low-noiseand high-noise, respectively. The phantom size was 2563 voxels and the pro-jections were each 300× 300 pixels in size. Each simulated scan consisted of720 projections between 0◦ and 360◦.874.2. Approach 1- Employing higher-order derivativesBecause for this simulated experiment we have access to the true pro-jections, we can quantitatively compare the denoised projections with thetrue projections. To this end, we computed the value of two criteria: (1)the root-mean-square of the error (RMSE), where the error is defined asthe difference between the denoised projection and the true projection (i.e.,without the Poisson noise), and (2) the mutual information (MI) betweenthe denoised projections (uˆ) and the true projections (u∗) computed as [259]:MI(u∗, uˆ) =h∑i=1h∑j=1q(u∗i , uˆj)log(q(u∗i , uˆj)p(u∗i )p(uˆj))(4.18)Here, p and q represent the marginal and joint probability distribution func-tions, respectively. We used histograms of u∗ and uˆ for estimating theseprobability densities and h is the number of bins in the histograms. Wenormalized the computed MI(u∗, uˆ) by dividing it by MI(u∗, u∗).The results of this comparison are presented in Table 4.1. As we men-tioned above, “TV sinogram denoising” means setting λ2 = 0 and adjustingλ1 in our model. We will discuss the role of the parameter values and somepossible approaches to selecting proper values later in this chapter. Theresults shown for TV sinogram denoising in Table 4.1 were obtained withλ1 = 2, µ1 = 10, and µ2 = 0.1. The results shown for (TV + TV2) sino-gram denoising in Table 4.1 were obtained with λ1 = 2, λ2 = 0.1, µ1 = 10,µ2 = 0.5, and µ3 = 0.01. For bilateral filtering, the results are shown forthe choice of the bandwidth of P2 being equal to 2.2, which gave us thelowest RMSE. The numbers in this table indicate that the proposed TVsinogram denoising methods, especially the (TV + TV2) sinogram denois-ing algorithm, outperform the method based on bilateral filtering.BilateralfilteringTVdenoising(TV + TV2)denoisingN0 = 100RMSE 0.0420 0.0364 0.0308MI 0.277 0.351 0.375N0 = 1000RMSE 0.0192 0.0174 0.0163MI 0.408 0.425 0.442Table 4.1: Comparison of different denoising algorithms in terms of theRMSE and MI of the denoised projections on the data simulated from thelow-contrast Shepp-Logan phantom.884.2. Approach 1- Employing higher-order derivativesIn order to determine the effect of sinogram denoising on the qualityof the reconstructed image, we used the FDK algorithm to reconstruct theimage of the phantom from noisy and denoised projections. In Figure 4.2we have shown the central slices of the reconstructed images. This figureclearly shows that the image reconstructed from the projections denoisedwith the (TV + TV2)-model has a better quality than the bilateral filteringand TV denoising, especially for the high-noise case.For a quantitative comparison, we computed the RMSE, where erroris defined as the difference between the reconstructed image and the truephantom image, and the structural similarity index (SSIM) between the twoimages as given in Equation (3.6). We have summarized the results of thisquantitative comparison in Table 4.2. The numbers in this table indicatethat denoising of the projections using the (TV + TV2)-model results in abetter image. The difference is more significant for the high-noise case.Figure 4.2: The central slice of the low-contrast Shepp-Logan phantom re-constructed from (a) noisy projections, and from projections denoised using(b) bilateral filtering, (c) TV sinogram denoising, and (d) (TV + TV2)sinogram denoising. The top row is for reconstruction from the low-noiseprojections and the bottom row is for reconstruction from the high-noiseprojections. The location of the ROI that has been displayed on the bottomleft of each slice has been marked by a rectangle in part (d) of this figure.4.2.3 Experiments with real micro-CT dataThe scanner described in Section 3.3.2 was used to scan the physical phan-tom. The phantom was scanned twice. The first scan consisted of 720894.2. Approach 1- Employing higher-order derivativesNodenoisingBilateralfilteringTVdenoising(TV + TV2)denoisingN0 = 100RMSE 0.360 0.148 0.124 0.111SSIM 0.220 0.396 0.451 0.480N0 = 1000RMSE 0.210 0.091 0.083 0.077SSIM 0.404 0.634 0.672 0.690Table 4.2: RMSE and SSIM for the images of the low-contrast Shepp-Loganphantom reconstructed from noisy projections and from projections denoisedwith different denoising algorithms.projections between 0◦ and 360◦ at 0.5◦ intervals. For this scan, the tubevoltage, tube current, and exposure time were chosen to be, respectively,70 kV, 40 mA, and 25 ms. We used the full set of 720 projections fromthis scan to reconstruct a high-quality image. This image was created byfirst using the FDK algorithm to reconstruct an initial image and then ap-plying 50 iterations of the MFISTA algorithm [17] to further improve itsquality. We will refer to this image as “the reference image” and will useit as the ground-truth for evaluating the denoising algorithms. The secondscan of the phantom consisted of 360 projections between 0◦ and 360◦ at 1◦intervals. For this scan, the tube voltage, tube current, and exposure timewere chosen to be 50 kV, 32 mA, and 16 mAs, respectively. This was thelowest possible setting in terms of mAs as the scanner did not operate under0.5 mAs. Moreover, a 0.1-mm copper filter was used to further reduce theradiation dose, further increasing the noise level. The resulting scan wasvery noisy. Since we do not have the true projections, in this experiment weevaluate the denoising algorithms in the image domain.In order to evaluate the quality of the reconstructed images, we com-pared them with the reference image by computing the RMSE and SSIM.The results are summarized in Table 4.3. In addition, we used two of themodules in the phantom to evaluate the spatial resolution and the noise levelin the reconstructed images. We used the plastic-air edge for estimating themodulation transfer function (MTF). We used the method proposed in [39]to estimate the MTF over the range of spatial frequencies [0, 5mm−1]. Wereport the spatial frequency at which the normalized MTF reached a value of0.1 as the indicator of the spatial resolution. We use the standard deviationof the voxel values in a uniform polycarbonate disk in the phantom as anindicator of noise level. The values in Table 4.3 clearly show that the pro-904.2. Approach 1- Employing higher-order derivativesposed TV denoising model is better than bilateral filtering. Moreover, thefull (TV + TV2) model has led to a higher image quality than the simplerTV-denoising model.NodenoisingBilateralfilteringTVdenoising(TV + TV2)denoisingRMSE 0.0393 0.0255 0.0228 0.0210SSIM 0.460 0.646 0.696 0.707Spatial resolution 3.55 3.58 3.61 3.61Noise level 0.0383 0.0191 0.0172 0.0160Table 4.3: Comparison of different sinogram denoising algorithms in termsof the quality of the reconstructed image of the physical phantom.For visual comparison, in Figure 4.3 we have shown a slice of the phantomreconstructed from noisy and denoised projections. This figure agrees withquantitative comparison in Table 4.3. Our proposed TV-based denoisingseems to have resulted in a higher quality image than bilateral filtering.Moreover, the (TV + TV2)-model has produced a slightly better image thanthe TV-model. To show the difference between the images reconstructedfrom projections denoised with different algorithms more clearly, in Figure4.4 we have shown a profile in the reconstructed images. The location ofthis profile has been marked with a white vertical line on the slice of thereference image in Figure 4.3(a). From the profiles in Figure 4.4, the imagereconstructed from the projections denoised using the (TV+TV2)-modelis much closer to the reference image, which agrees with the quantitativeevaluations presented in Table 4.3.The values of the image quality metrics presented in Table 4.3 and theimages shown in Figures 4.3 and 4.4 were obtained using one particular setof parameter values for each algorithm. However, they do not present acomplete comparison of the performance of different denoising algorithms.Comparing different denoising algorithms requires a more detailed look atthe trade-off between noise and spatial resolution. In Figure 4.5 we haveshown plots of the noise level versus spatial resolution for a range of pa-rameter values for different algorithms. Noise level and spatial resolutionwere computed as described above using the slanted edge module and theuniform disk module in the phantom. For bilateral filtering, we have shownthe plot for the bandwidth of P2 in the range [0.7, 2.8]. For the proposedTV-based algorithm we have presented three curves, each for one different914.2. Approach 1- Employing higher-order derivativesFigure 4.3: A slice of the image of the physical phantom reconstructed fromnoisy and denoised projections. (a) the reference image, (b) reconstructedfrom noisy projections, and reconstructed from projections denoised using(c) bilateral filtering, (d) the TV-model, and (e) the (TV+TV2)-model.value of the regularization parameter λ2 ∈ {0, 0.2, 1}. The curve for λ2 = 0corresponds to the “TV-denoising”, i.e. regularization only in terms of thegradient. For each value of λ2, we applied the proposed algorithm for tenvalues of λ1 in the range [0.5, 10]. Note that a high spatial resolution and alow noise level are desirable.From this figure, it is clear that the proposed TV-based denoising algo-rithm outperforms the bilateral filtering. For the range of values of λ1 andλ2 that we tried in this experiment, the value of λ1 seems to more stronglyaffect the behavior of the proposed algorithm. The Hessian regularizationhas a very positive effect. When λ2 > 0, i.e., when the Hessian regulariza-tion term exists, the proposed algorithm can achieve better results in termsof noise level and spatial resolution, i.e., lower noise level and higher spatialresolution. Moreover, when λ2 > 0, the performance is more stable withregard to changes in λ1. This can be seen by comparing the three curvesthat have been shown for the TV-based algorithm. The curve correspondingto λ2 = 0 is influenced more strongly by the change in λ1, whereas whenλ2 > 0 the algorithm is less sensitive to change in λ1.For a complete characterization of the spatial resolution of the imagesreconstructed from the projections denoised by different algorithms, we plot924.2. Approach 1- Employing higher-order derivativesFigure 4.4: A profile in the reconstructed images of the physical phantomreconstructed from (a) noisy projections, and from projections denoised with(b) bilateral filtering, (c) the TV-model, and (d) the (TV+TV2)-model. Theblue line in each figure shows the profile in the reference image. The locationof this profiles has been marked in Figure 4.3(a).the estimated MTFs in Figure 4.6. Our approach to estimating the MTFsthat are shown in this figure are different from the spatial resolutions shownin Table 4.3 in an important way. As can be seen in Table 4.3, the threealgorithms are different in terms of both spatial resolution and noise level.However, as we mentioned above, noise and spatial resolution are inter-dependent. Therefore, estimation of the MTF curves in Figure 4.6 wasdone while matching the noise level of the reconstructed image for differentalgorithms. We adjusted the tuning parameters of the different algorithms(i.e., σi for bilateral filtering and λ1 and λ2 for the proposed algorithm)such that the noise level was the same in the reconstructed images for allthree algorithms. We then estimated the MTF for different algorithms.From the estimated MTFs in Figure 4.6 it can be seen that the TV-baseddenoising leads to a higher MTF for all spatial frequencies, especially forspatial frequencies above 2mm−1. All three algorithms are very close up tothe spatial frequency of approximately 1mm−1. The MTF for the two TV-based algorithms are close, but the (TV + TV2)-model results in a sightlyhigher MTF at higher spatial frequencies.The same scanner was used to scan two dead rats. The first scan hadrelatively less noise, whereas the second scan was much noisier. We willrefer to these two scans as low-noise and high-noise rat scans. Each of the934.2. Approach 1- Employing higher-order derivativesFigure 4.5: Plots of noise level versus spatial resolution for a range of pa-rameter values for bilateral filtering and the proposed TV-based denoisingalgorithm.two scans consisted of 720 projections between 0◦ and 360◦ at 0.5◦ intervals.For both scans, the tube voltage, tube current, and exposure time wereset to 50 kV, 32 mA, and 16 ms, respectively. However, for the high-noisescan we used a 0.2-mm copper filter. For both the low-noise and the high-noise rat scans, we used all 720 projections to reconstruct a high-qualityreference image using the same procedure as that described for the physicalphantom above. We will use this image as the ground-truth for evaluatingthe denoising algorithms. For both the low-noise and the high-noise scans,we applied the denoising algorithms on a subset of 360 projections of thesame scan and reconstructed the image of the rat using the FDK algorithm.Table 4.4 shows a summary of the quantitative comparison between dif-ferent sinogram denoising algorithms in terms of RMSE and SSIM of thereconstructed images of the low-noise rat scan. Furthermore, for a visualcomparison, we have shown a typical slice of the reconstructed images in Fig-ure 4.7. Similar to the above experiments, compared with bilateral filtering,the TV-based algorithms have produced better results.In Figure 4.8, we have shown a slice from the reconstructed images of thehigh-noise rat scan. All three sinogram denoising algorithms have resultedin a significant improvement in the visual quality of the reconstructed imagefrom this high-noise scan. Similar to the above experiments, the proposed944.2. Approach 1- Employing higher-order derivativesFigure 4.6: The estimated noise-matched MTF for different sinogram de-noising algorithms.NodenoisingBilateralfilteringTVdenoising(TV + TV2)denoisingRMSE 0.0220 0.0173 0.0140 0.0126SSIM 0.622 0.685 0.711 0.740Table 4.4: Comparison of different sinogram denoising algorithms in termsof the quality of the reconstructed image of the low-noise rat scan.TV-based algorithm has worked better than bilateral filtering. Moreover,the (TV + TV2) model seems to have more effectively suppressed the noisewithout blurring the image features.However, the images shown in Figure 4.8 have been reconstructed usinga particular set of parameters for each algorithm. For a better comparisonbetween different denoising algorithms, we examine the trade-off between thenoise suppression and image sharpness. For this purpose, as a measure ofthe noise level we computed the contrast-to-noise-ratio (CNR). We selectedthe cubes b and s as shown on the slice of the reference image in Figure4.8(a). We estimated the CNR using the following equation:CNR =|µs − µb|(σs + σb)/2(4.19)where µ and σ denote, respectively, the mean and the standard deviation954.2. Approach 1- Employing higher-order derivativesFigure 4.7: A slice of the images reconstructed from the low-noise rat scan;(a) the reference image, (b) reconstructed from noisy projections, and recon-structed from projections denoised with (c) bilateral filtering, (d) the TVmodel, and (e) the (TV + TV2)-model.of the voxels in each cube. As a measure of image sharpness, we computedthe maximum slope along the line “L” shown in Figure 4.8(a). This line lieson an edge between soft tissue and fat. If the spatial resolution is high, theslope of this edge will also be high. On the other hand, if the image is over-smoothed this slope becomes small. We will denote the computed slopealong the line “L” with SL and use it as a measure of image sharpness.A similar approach was used to quantify the spatial resolution in [219].Therefore, a small SL indicates that the image is over-smoothed and spatialresolution is low. On the other hand, a large value of SL indicates that theimage is sharp and spatial resolution is high.Figure 4.9 shows the plots of SL versus CNR for different parametervalues for bilateral filtering and TV-based denoising algorithms. Note thata high CNR and a high SL are desirable. For bilateral filtering, we changedthe bandwidth of P2 in the range [0.7, 2.8]. For the proposed TV-basedalgorithm, we found that parameter values λ1 = 4 and λ2 = 0.4 lead to goodresults. Therefore, in order to investigate the role of these two parameters,we first kept λ2 = 0.4 constant and changed λ1 in the range [0, 10]. Then,we kept λ1 = 4 constant and changed λ2 in the range [0, 2.0]. Therefore, forthe proposed TV-based algorithm we have two curves that show the effectof tuning λ1 and λ2. It is clear from the proposed cost function in Equation(4.5) that λ1 and λ2 determine the strength of regularization in terms of the964.2. Approach 1- Employing higher-order derivativesFigure 4.8: A slice of the images reconstructed from the high-noise ratscan; (a) the reference image, (b) reconstructed from noisy projections, andreconstructed from projections denoised with (c) bilateral filtering, (d) theTV model, (e) the (TV + TV2)-model.gradient and the Hessian, respectively. The plots in Figure 4.9 show thatthe proposed TV-based denoising algorithm can achieve higher SL and CNRcompared with bilateral filtering. It is also clear that both regularizationterms play an important role in the performance of the proposed algorithmbecause when either λ1 or λ2 decrease to zero, one or both of the criteriadecrease. Especially, when λ2 → 0, both SL and CNR decrease. This clearlyindicates the importance of the regularization in terms of the Hessian.4.2.4 DiscussionOverall, the results of our experiments with simulated and real data showthat the proposed algorithm is highly effective in suppressing the noise in theprojection measurements in cone-beam CT. This is evident from the quanti-tative evaluation presented in Table 4.1. The effect of the noise suppressionon the quality of the FDK-reconstructed images is substantial, as can beseen from Figures 4.2 - 4.9 and from the objective image quality metrics inTables 4.2 - 4.4. Our results suggest that the proposed TV-based algorithmis superior to the method based on bilateral filtering both in terms of spatial974.2. Approach 1- Employing higher-order derivativesFigure 4.9: The computed SL (a measure of image sharpness or spatialresolution) versus CNR for a range of parameter values for different sinogramdenoising algorithms.resolution and noise suppression. The difference between the proposed TVsinogram denoising and the (TV + TV2) sinogram denoising models wasalso significant on both the simulated data and on the real data. This showsa clear gain in the denoising performance by including the regularizationin terms of the Hessian. We should note, however, that this gain in per-formance comes at the cost of more computation. For example, with ourMatlab implementation on a Windows 7 PC with a 3.4 GHz Intel Core i7CPU, denoising of one projection from the rat scan takes approximately 33seconds with the TV model and 46 seconds with the (TV + TV2) model.The choice of algorithm parameters can significantly influence its perfor-mance, as shown in our results in Figures 4.5 and 4.9. There are two sets ofparameters in the proposed algorithm. The regularization parameters, λ1and λ2, control the degree of regularization. They must be selected accord-ing to the desired quality of the final image. The Bregman parameters, µ1,µ2, and µ3 mostly influence the convergence of the algorithm [117, 248]. Thesimplified model with only first-order TV regularization has three parame-ters, λ1, µ1, and µ2, which makes it easier to find good parameter values.For the full model, too, one can first ignore the second-order differentialterms and find proper values for λ1, µ1, and µ2, and then proceed to findgood values for λ2 and µ3.984.3. Approach 2- Locally adaptive regularizationA well-known method for selecting λ1 was proposed by Chambolle [46].This method was proposed for the case of additive Gaussian noise. In ourexperience this method works for the case of Poisson noise as well. For thismethod to work, one needs to have a prior estimate of the noise level in theimage. If the average noise variance in the noisy image v is σ2, one startswith an arbitrary value of the regularizarion parameter λ = λ0 and updatesλ as follows:Repeat until convergenceuk = arg minuΨ(u, v) + λk∫Ω|∇u|λk+1 = λk(Nσ)Ψ(uk, v)endThe same strategy has been suggested for finding λ1 for Poisson denoisingin [112]. That paper has also developed an empirical equation relating theregularization parameter λ1 and the Poisson noise variance, which has theform: λ1 = 1/(72.4/σ + 97.7/σ2), where σ2 is the noise variance. Thisequation can be very useful for choosing a good initial value for λ1, whichcan then be improved using Chambolle’s method mentioned above. For thesimplified model, we found that choosing µ1 ≈ 10λ1 and µ2 ≈ 0.1λ1 lead togood results. For the full model, we had good results with λ2 < λ1, usuallyλ2 ≈ λ1/10, and µ3 ≈ λ2/10, which are not very different from the valuessuggested for the inpainting problem in [249].4.3 Approach 2- Locally adaptive regularization4.3.1 The proposed algorithmThis approach is based on a comparison between the optimality conditionsfor the TV denoising models for Gaussian and Poisson noises [175]:{Gaussian Eλ(u) =12‖u− v‖22 + λ∫Ω |∇u| (u− v) + λ p = 0Poisson Eλ(u) =∫Ω(u− v logu) + λ∫Ω |∇u| (u− v) + (λu) p = 0where p is a sub-gradient of∫Ω |∇u|. The only difference between the twoequations is the dependence of the regularization parameter on u in thePoisson case. This suggests that a stronger smoothing must be appliedwhere the signal has larger values. This outcome agrees with what we expect994.3. Approach 2- Locally adaptive regularizationsince under the Poisson distribution the noise variance is proportional to thesignal intensity.Minimization of the cost function in (4.4) is not a straightforward prob-lem. One approach is to first replace |∇u| with √|∇u|2 +  for a small  > 0and then to apply a gradient descent iteration [175]. Another approach,suggested in [282], is to use a Taylor’s expansion of the data fidelity termand to minimize this approximate model. Here, we use an algorithm thatwas developed to solve the original ROF denoising problem for Gaussiannoise. However, we denoise each sinogram pixel separately by minimizingEλ(u) in a small neighborhood around that pixel, and with a regularizationparameter inspired by the optimality condition described above.As we mentioned in the previous section, a heavily researched approachto reducing the staircase effect is to replace the `1 norm of the gradientwith the `1 norm of higher-order differential operators. A less sophisticatedapproach, but one that has a trivial implementation, is to perform the totalvariation minimization locally. This approach has also been shown to allevi-ate the staircase effect [195]. Moreover, with a local minimization strategy,if the size of the neighborhood considered in minimization is small enough,one can safely assume that the sinogram intensity and noise level are ap-proximately constant. Therefore, a solution based on the ROF’s originalmodel will be a good approximation to the solution of the model based onPoisson noise. This way, we can utilize efficient existing algorithms for theROF model while avoiding the staircase artifacts.Since our approach is based on Chambolle’s famous algorithm [46], webriefly describe this algorithm here. This algorithm minimizes the followingcost function, which is the same as the TV denoising model for Gaussiannoise that we described in Section 2.5.Eλ(u) =12‖u− v‖22 + λ∫Ω|∇u| (4.20)If we denote by X and Y the space of the image u and its gradient, ∇u,respectively, then an alternative definition of total variation of u is:∑i,j|∇u|i,j = sup {〈p,∇u〉Y : p ∈ Y, |pi,j | ≤ 1} (4.21)Chambolle introduced the discrete divergence operator as the dual of thegradient operator, i.e. 〈p,∇u〉Y = 〈−div p, u〉X . In the discrete image do-main:(div p)i,j = (p1i,j − p1i−1,j) + (p2i,j − p2i,j−1) (4.22)1004.3. Approach 2- Locally adaptive regularizationBecause of the duality of the gradient and divergence operators, totalvariation can also be written as:∑i,j|∇u|i,j = supz∈K〈z, u〉X K = {div p : p ∈ Y, |pi,j | ≤ 1} (4.23)The minimizer of the cost function in (4.20) is then obtained by projectingv onto the set λK:u = v − piλK(v) (4.24)which is equivalent to minimizing the Euclidian distance between v andλ div p, and this can be achieved via the following iteration for computing p:p0 = 0; pn+1i,j =pni,j + τ(∇(div pn − v/λ))i,j1 + τ |(∇(div pn − v/λ))i,j | (4.25)where τ > 0 is the step size. For a small enough step size, τ ≤ 1/8, thealgorithm is guaranteed to converge [46].Instead of a global solution, we minimize the cost function (4.4) in asmall neighborhood of each pixel. To this end, let us denote by ω the set ofindices that define the desired neighborhood around the current pixel. Forexample, for a square neighborhood of size (2m+ 1)× (2m+ 1) pixels: ω ={(i, j) : i, j = −m : m}. We also consider a normalized Gaussian weightingfunction on this neighborhood:W (i, j) = exp(−(i2 + j2)h2)(4.26)The local problem will then become that of minimizing the followingcost function:Eλ,W (u′) =12‖u′ − vω‖2W + λ′∫ω|∇u′| (4.27)where ‖.‖2W denotes the weighted norm with weights W and vω and u′ areimages restricted to the window ω around the current pixel. The solutionof this local optimization problem will be similar to Chambolle’s algorithmdescribed above [195]. The only difference is in the update formula for p:p0 = 0; pn+1i,j =pni,j + τ(∇(D−1div pn − v/λ′))i,j1 + τ |(∇(D−1div pn − v/λ′))i,j | (4.28)1014.3. Approach 2- Locally adaptive regularizationwhere D is a diagonal matrix whose diagonal elements are the values of W .The regularization parameter, λ′, must be chosen according to (4.20).The simplest approach is to set λ′ = λv(i, j), where λ is a global regulariza-tion parameter and v(i, j) is the value of the current pixel in the noisy image.Since v(i, j) is noisy, a better choice is to use a weighted local average as theestimate of the intensity of the true image at the current pixel (note thatthe maximum-likelihood estimate of the mean of a Poisson process from aset of observations is the arithmetic mean of the observations). Therefore,we suggest the following choice for the local regularization parameter.λ′ = λ∑−a≤i′,j′≤aW′(i′, j′)v(i− i′, j − j′)∑−a≤i′,j′≤aW ′(i′, j′)where W ′(i, j) = exp(−(i2 + j2)h′2) (4.29)There are several parameters in the proposed algorithm. The global reg-ularization parameter λ controls the strength of the denoising. It should beset based on the desired level of smoothing. Parameter m sets the size of theneighborhood considered around each pixel, which in this study was chosento be a square window of size (2m+ 1)× (2m+ 1). Numerical experimentsin [195] have shown that in total variation denoising, the influence map ofa pixel is usually limited to a radius of approximately 10 pixels for typicalvalues of the regularization parameter. Therefore, a good value for m wouldbe around 10, which is the value we used for all experiments reported inthis chapter. The width of the Gaussian weighting function W is adjustedthrough h. We used h = 2m which we found empirically to work well. Sim-ilarly, a and h′ in (4.29) determine the size of the window and the weightsused for determining the local regularization parameter. These have to beset based on the noise level in the image; larger values should be chosenwhen noise is stronger. We used a = 4 and h′ = 2a.A simple implementation of the proposed algorithm can be computa-tionally intensive because it will involve solving a minimization problem,though very small, for every individual pixel in the sinogram. This will bea major drawback because a big advantage of sinogram denoising methods,compared to iterative image reconstruction methods, is the shorter compu-tational time. To reduce the computational time, after minimizing the localcost function (4.27) around the current pixel, we will replace the value ofall pixels in the window of size (2a+ 1)× (2a+ 1) around the current pixel,instead of just the center pixel, and then shift the window by (2a+ 1). Ourextensive numerical experiments with simulated and real projections showed1024.3. Approach 2- Locally adaptive regularizationthat with this approach the results will be almost identical to the case whereonly one pixel is denoised at a time. This is the approach that we followedin all experiments reported in this section.We evaluated the proposed denoising algorithm on simulated projectionsand two sets of real low-dose projections of the physical phantom and a ratobtained using the micro-CT scanner. We compared the performance of ourproposed algorithm with two other methods:1. The bilateral filtering algorithm described in Section 3.3. We appliedthe bilateral filtering for several values of the bandwidth of P2, whichwe denote with σ, in the range [0.7, 2.8] and applied the proposedalgorithm for several values of the regularization parameter, λ.2. A nonlocal principal component analysis (NL-PCA) algorithm pro-posed in [281]. In this method, patches of the image are first clusteredusing the K-Means algorithm. For all patches in a cluster a PoissonPCA (also known as exponential PCA) is performed to denoise them.The PCA problem is solved using the Newton’s method. The denoisedpatches are returned to their original locations and averaged (to ac-count for the patch overlaps) in order to form the denoised image.Patch-based methods are computationally very intensive. Therefore,with this algorithm we used parameter settings that resulted in a rea-sonable computational time.4.3.2 Simulation experimentWe simulated 360 noisy cone-beam projections, from 0◦ to 359◦ from a 3DShepp-Logan phantom according to the model in (1.1). We used two valuesof N i0 = 500 and 2000 to simulate two sets of projections, which we will callhigh-noise and low-noise, respectively. The phantom size was 512×512×512voxels and the projections were each 700× 700 pixels in size.Figure 4.10 shows one-dimensional profiles of the noisy and denoisedprojections. The plots in this figure show that the proposed TV-based de-noising significantly removes the noise and seems to be superior to bilateralfiltering and NL-PCA.For quantitative comparison, we computed the Root Mean Square of theError (RMSE), where error is defined as the difference between the denoisedand the true (i.e., noise-free) projections, and the mutual information (MI).Figure 4.11 shows the plots of RMSE and MI. For the proposed algorithm,we have plotted these values for 10 logarithmically-spaced values of λ inthe range [0.01, 1], which we found to give the best denoising results. For1034.3. Approach 2- Locally adaptive regularizationFigure 4.10: Two typical one-dimensional profiles of the noisy and denoisedprojections simulated from the Shepp-Logan phantom. The thin blue curvein each plot shows the corresponding noise-free projection. The left columnis for the high-noise case and the right column is for the low-noise case. (a)the noisy sinogram, and denoised using (b) bilateral filtering, (c) NL-PCA,and (d) the proposed TV-based algorithm.1044.3. Approach 2- Locally adaptive regularizationbilateral filtering, following [219], we have plotted these values for 10 linearly-spaced values of σ in the range [0.5, 3.2]. From these plots it is clear that theproposed algorithm has achieved significantly better denoising results thanbilateral filtering and NL-PCA. Best results with the proposed algorithmare achieved with λ values around 0.1 and the denoising is too strong forλ > 1. For bilateral filtering, we found that best denoising results wereusually obtained for values of σ close to 3.0 and the performance did notimprove or slightly deteriorated when σ was increased beyond 3.2. Thesolid squares on these plots show the optimum value of the correspondingparameter (i.e., lowest RMSE or highest MI). The phantom profiles shown inFigure 4.10 for the proposed algorithm and bilateral filtering were obtainedwith the parameter values that resulted in the lowest RMSE.Figure 4.11: Comparison between different denoising algorithms in termsof RMSE and MI for the high-noise projections (top row) and low-noiseprojections (bottom row) simulated from the Shepp-Logan phantom. Valuesfor the bilateral filtering algorithm are plotted as a function of σ (the bottomhorizontal axis), whereas the values for the proposed algorithm are plottedas a function of the regularization parameter λ (the top horizontal axis).The solid squares indicate the points of optimum.1054.3. Approach 2- Locally adaptive regularization4.3.3 Experiment with real micro-CT dataCone-beam projections were acquired from the physical phantom using theGamma Medica micro-CT scanner. Two scans of the phantom were gener-ated:1. Low-noise scan. Consisting of 720 projections of size 875×568 pixelsbetween 0◦ and 360◦ at 0.5◦ intervals. The tube voltage, tube current,and exposure time were 70 kV, 40 mA, and 25 ms, respectively.2. High-noise scan. Consisting of 240 projections of size 875×568 pixelsbetween 0◦ and 360◦ at 1.5◦ intervals. The tube voltage, tube current,and exposure time were 50 kV, 32 mA, and 16 mAs, respectively.We used the low-noise scan to reconstruct a high-quality reference im-age of the phantom using the FDK algorithm. To evaluate the denoisingalgorithms, we applied them on the high-noise projections, reconstructedthe image of the phantom from the denoised projections using the FDK al-gorithm, and compared the reconstructed image with the reference image.Similar to the experiment with the simulated projections, we performed thedenoising for 10 linearly-spaced values of σ in the range [0.5, 3.2] for bilateralfiltering. Similarly, we ran the proposed algorithm with 10 logarithmically-spaced values of λ in the range [0.001, 0.1]. In order to assess the overallquality of the reconstructed images, we computed the RMSE and SSIM. Theplots of RMSE and SSIM are shown in Figure 4.12. Compared with bothbilateral filtering and NL-PCA, the image reconstructed from projectionsdenoised using the proposed algorithm has a significantly lower RSME andhigher SSIM. Best results in terms of SSIM with the proposed algorithm areobtained with λ = 0.0129 and for bilateral filtering algorithm with σ = 2.6.Figure 4.13 shows two of the fine coils in the images of the phantomreconstructed from noisy and denoised projections. These coils have thick-nesses of 500 µm and 200 µm, corresponding to spatial resolutions of 1 and2.5 line pairs per mm, respectively. The image shown for the proposed algo-rithm corresponds to λ = 0.0129 and the image shown for bilateral filteringcorresponds to σ = 2.6. As we mentioned above, these parameter valuesled to highest SSIM. The images show a marked improvement in the imagequality via sinogram denoising. It also seems that the proposed algorithmleads to a smoother image without affecting the spatial resolution. In Figure4.14 we have shown a profile through the center of the 500-µm coil for theimages reconstructed from noisy and denoised projections and also the dif-ference between them and the reference image for a closer comparison. It is1064.3. Approach 2- Locally adaptive regularizationFigure 4.12: Performance comparison between different sinogram denoisingalgorithms in terms of RMSE and SSIM on the scan of the physical phantom.Values for the bilateral filtering algorithm are plotted as a function of σ (thebottom horizontal axis), whereas the values for the proposed algorithm areplotted as a function of the regularization parameter λ (the top horizontalaxis). The solid squares indicate the points of optimum.clear from these profiles that the image reconstructed from the projectionsdenoised using the proposed algorithm are closer to the reference image.In order to compare the denoising algorithms in terms of the trade-offbetween noise and spatial resolution, we followed an approach similar tothat in Section 4.2. Specifically, we computed the following two numbers asmeasures of spatial resolution and noise level in the reconstructed image ofthe phantom:Measure of spatial resolution. We estimated the MTF as described inSection 4.2.3 and used the spatial frequency at which the normalizedMTF reached a value of 0.10 as a measure of spatial resolution.Measure of noise level. We selected five cubes in the uniform polycar-bonate disk in the phantom, each 10 × 10 × 10 voxels, at differentlocations within this disk and computed the standard deviation of thevoxel values in each cube. We use the average standard deviation ofvoxel values in these cubes as a measure of noise level.In Figure 4.15, we have shown plots of these two values for the threedenoising algorithms. Note that a high spatial resolution and a low noiselevel are desirable. Therefore, all three denoising algorithms have improvedthe quality of the reconstructed image for the range of parameter values used(except for λ = 0.1 with the proposed algorithm). Moreover, the proposed1074.3. Approach 2- Locally adaptive regularizationFigure 4.13: The 200-µm (top row) and 500-µm (bottom row) coils in theimages reconstructed from noisy and denoised projections of the physicalphantom; (a) the reference image, (b) without denoising, (c) bilateral filter-ing, (d) NL-PCA, and (e) the proposed algorithm.algorithm has achieved better results than bilateral filtering and NL-PCA.Specifically, for λ ∈ [0.0077, 0.0359] the proposed algorithm has achievedboth higher spatial resolution and lower noise than bilateral filtering (forany parameter value) and NL-PCA.In Figure 4.15, we have also shown plots of the MTF obtained with thethree denoising algorithms. All three sinogram denoising algorithms haveled to an improvement in the spatial resolution in the reconstructed image.The proposed algorithm has resulted in a higher MTF than bilateral filteringand NL-PCA for all spatial frequencies.A rat was scanned using the micro-CT scanner. Because the internalorgans of the rat constantly moved, it was not possible to create two identicalscans with different noise levels as we did for the phantom. Therefore,the rat was scanned only once. The scan consisted of 720 projections ofsize 875 × 568 pixels between 0◦ and 360◦ at 0.5◦ intervals with the tubevoltage, tube current, and exposure time equal to 70 kV, 32 mA, and 16 ms,respectively. To create a high-quality reference image from the full set of 720projections, we first reconstructed an initial image using the FDK algorithm.Then, we used 50 iterations of MFISTA algorithm [17] to improve the qualityof the FDK-reconstructed image. We applied the denoising algorithms ona subset of 240 projections of the same scan (projections at 1.5◦ intervals)and reconstructed the image of the rat using the FDK algorithm.Similar to the physical phantom experiment, we use RMSE and SSIMas a measure of the overall closeness of the reconstructed images to thereference image. Figure 4.16 shows these criteria for the three sinogram1084.3. Approach 2- Locally adaptive regularizationFigure 4.14: The left column shows a profile through the 500-µm coil inthe images of the physical phantom reconstructed from noisy and denoisedprojections: (a) without denoising, (b) bilateral filtering, (c) NL-PCA, and(d) the proposed algorithm. In these plots, the blue curve is the profile ofthe reference image. The right column shows the difference between theprofiles shown in the left column and the profile of the reference image.1094.3. Approach 2- Locally adaptive regularizationFigure 4.15: Left: plots of the normalized MTF obtained by different sino-gram denoising algorithms. Right: plots of noise level versus spatial resolu-tion for different denoising algorithms. The dashed lines in this plot showthe corresponding values for the image reconstructed without sinogram de-noising.denoising algorithms. From this figure, denoising of the projections withthe proposed algorithm has lead to superior results in terms of RMSE andSSIM compared to bilateral filtering and NL-PCA.For a visual comparison, Figure 4.17 shows a 2D slice of the reconstructedimages. For the proposed algorithm and bilateral filtering, the images shownin this figure were obtained using the parameter values that resulted in thelowest SSIM, i.e., λ = 0.0129 and σ = 2.6 (see Figure 4.16).The window of the linear attenuation coefficient, µ, used to display thewhole slices is [0, 0.55]. To allow a better visual comparison, we have selectedtwo regions of interest (ROI) within this slice and have shown them inzoomed-in views and with narrower µ-windows. The ROI shown on the topleft of each slice contains fat surrounded with soft tissue; this ROI is shownwith a magnification factor of 1.5 and with a µ-window of [0.15, 0.20]. TheROI shown on the top right of each slice contains bone surrounded withsoft tissue; this ROI is shown with a magnification factor of 2.0 and witha µ-window of [0.18, 0.50]. These images show a strong positive effect forsinogram denoising in terms of the visual quality of the reconstructed image.Moreover, denoising with the proposed algorithm seems to have resulted ina higher-quality image, especially in the soft-tissue ROI.1104.3. Approach 2- Locally adaptive regularizationFigure 4.16: Comparison of different sinogram denoising algorithms on therat scan.Figure 4.17: A slice of the image of the rat reconstructed from noisy anddenoised projections: (a) the reference image, (b) without denoising, (c) bi-lateral filtering, (d) NL-PCA, and (e) the proposed algorithm. The locationsof the selected ROIs have been marked on the reference image (a).1114.3. Approach 2- Locally adaptive regularizationFigure 4.18: Left: the ROI used to compute the noise level and spatial reso-lution in the reconstructed images of the rat; the noise level was computed asthe standard deviation of voxel values in the cube C and the spatial resolu-tion was computed as the maximum gradient along the line L. Right: plotsof noise level versus spatial resolution for the three denoising algorithms.The dashed horizontal and vertical lines in this plot show the correspondingvalues for the image reconstructed without sinogram denoising.In order to compare the denoising algorithms in terms of the trade-offbetween noise suppression and spatial resolution, we selected an ROI shownin Figure 4.18 and computed the following measures of noise level and spatialresolution:Measure of spatial resolution. We compute the maximum absolute valueof the gradient (i.e., slope) along the line L marked in the ROI shownin Figure 4.18 as a measure of spatial resolution.Measure of noise level. We consider a cube of size 50×50×50 voxels, thecross-section of which is shown in the displayed ROI. From the refer-ence image, we identified this cube as being highly uniform. Therefore,we computed the standard deviation of the voxel values in this cubeas a measure of noise level.The results are plotted in Figure 4.18. This plot is very similar to theplot shown for the physical phantom experiment in Figure 4.15. The mainobservations are that all three sinogram denoising algorithms have improved1124.3. Approach 2- Locally adaptive regularizationthe quality of the reconstructed image in terms of spatial resolution and noiselevel, and that the proposed algorithm can outperform the bilateral filteringalgorithm and NL-PCA with the right selection of the regularization pa-rameter. Specifically, with λ ∈ [0.0129, 0.0359], the proposed algorithm hasresulted in lower noise and better spatial resolution than bilateral filtering(with any choice of σ) and NL-PCA.4.3.4 DiscussionOverall, the results of our experiments show that the proposed algorithmperforms better than bilateral filtering. We should emphasize that it is likelythat NL-PCA can outperform our proposed TV-based denoising algorithm,albeit at a much higher computational cost. In this study, for NL-PCA wedid not use the parameter values that the authors of [281] had suggested.Instead, we chose parameter values that resulted in a relatively short com-putational time. For example, the authors of [281] suggest patch sizes of20× 20 pixels, but we used patches of size 8× 8 pixels.In order to compare the computational time of the proposed algorithmwith that of bilateral filtering and NL-PCA, we considered the denoising of240 projections of the rat scan. As we mentioned above, each projection inthis scan was 875 × 568 pixels. The proposed TV-based algorithm imple-mented in Matlab version R2012b and executed on a Windows 7 PC with16 GB of memory and 3.4 GHz Intel Core i7 CPU needed approximately6 minutes to denoise all 240 projections. In comparison, bilateral filteringand NL-PCA needed 8.5 minutes and 42 minutes, respectively, for the samedenoising task.In general, our experience shows that the patch-based algorithm pro-posed in Chapter 3 can achieve better results than the two TV-based al-gorithms proposed in this chapter. Another advantage of the patch-baseddenoising algorithm proposed in Chapter 3 is that it is less sensitive tothe choice of its parameters. Both algorithms proposed in this chapter willperform poorly if their parameter(s) are not selected properly. Comparedwith the algorithm proposed in this section that has only one parameter,choosing the parameter values is much harder for the algorithm proposedin Section 4.2 because the number of parameters is larger and their inter-actions is complex. The advantage of the TV-based denoising algorithms,compared with the patch-based algorithms, is that they are usually muchfaster. Most of the computational time required by the dictionary-basedmethods includes the time needed for learning the dictionary. The resultsof Chapter 3 show that if the scan geometry does not change, there is no1134.3. Approach 2- Locally adaptive regularizationneed to learn a new dictionary. However, if the scan geometry or the angularspacing between successive projections change, a new dictionary will have tobe learned. Some other patch-based methods (such as NL-PCA algorithm)learn the dictionary from the noisy sinogram. Therefore, such methods areexpected to be very computationally intensive.114Chapter 5Sinogram Interpolation5.1 IntroductionAs we mentioned previously in this dissertation, in practice there are twobasic ways to reduce the radiation dose used in CT imaging: (1) lowering thex-ray photon current, and/or (2) reducing the number of projection mea-surements taken. However, if filtered-backprojection methods are used toreconstruct an image from such noisy and/or undersampled measurements,the quality of the produced image will be very low. In this chapter, we pro-pose a sinogram interpolation algorithm for cone-beam CT. Our algorithmexploits both smoothness and self-similarity of the sinogram. We apply theproposed algorithm on simulated and real cone-beam CT projections andcompare it with an algorithm that is based on learned dictionaries.A schematic of the cone-beam CT and a sample sinogram of a brainphantom are shown in Figure 5.1(a). The variation of the photon flux in-cident on the detectors follows a Poisson distribution. As we mentionedin Section 1.2, after the logarithm transformation the noise in the pro-jection measurements follows approximately a Gaussian distribution withsignal-dependent variance. Specifically, let us denote the true line inte-gral of the attenuation coefficient along the line from the x-ray source tothe detector indexed with (i, j) in the kth projection with yt(i, j, k). Inother words, i and j indicate the detector location on the detector planeand k indicates the rotation angle angle, θ. Then the noisy measurementyn(i, j, k) ∼ N (yt(i, j, k), σ2ijk), where σ2ijk ∝ exp(yt(i, j, k)). In this chapterwe use this Gaussian noise model because our experience shows that ourproposed interpolation algorithm works well with this noise model, even onvery low dose scans.We assume that only a portion of the desired projections have beendirectly measured and the rest are to be estimated (i.e., interpolated). Al-though the algorithm that we propose can be applied on very general cases,in order to simplify the presentation of the algorithm, here we considerthe case depicted in Figure 5.1(b). Specifically, we assume that only halfof the nθ desired projection views have been measured and the remaining1155.1. IntroductionFigure 5.1: (a) A schematic of cone-beam CT and sinogram of a brain phan-tom, (b) A schematic representation of the sinogram interpolation problem.half are to be estimated. In other words, the measured noisy sinogram isyn ∈ Rnu×nv×nθ/2 and we would like to estimate the interpolated sinogramy ∈ Rnu×nv×nθ . As we will explain later, even though our main goal is sino-gram interpolation to estimate the missing projection views, the proposedalgorithm also has an excellent denoising effect. Therefore, we estimate thefull set of nθ projection views, not only the missing nθ/2 views.Given a noisy sinogram yn, we propose to estimate the true sinogram yby minimizing the following cost function:J(y) = ‖My − yn‖2W + λsRs(y) + λhRh(y) (5.1)The first term in J is obtained simply by maximizing the log-likelihood of1165.2. The proposed algorithmthe measurements. In this term, M is a binary mask matrix that removesfrom y those projection views that have not been measured and W is adiagonal weight matrix whose diagonal elements are inversely proportionalto the measurement variance. The regularization functionsRs andRh will beexplained in Subsections 5.2.1 and 5.2.2, respectively. Rs is a regularizationfunction in terms of nonlocal similarities, Rh is a regularization function interms of smoothness, and λs and λh are regularization parameters.5.2 The proposed algorithm5.2.1 Regularization in terms of sinogram self-similarityIn Section 2.2, we reviewed some of the applications of image processingalgorithms that are based on nonlocal patch similarities. Inspired by thegreat success of these algorithms and because this type of self-similarity isvery abundant in sinogram (even more than in natural images, as can beseen in the sample sinogram in Figure 5.1(a)), we suggest a similar form forRs:Rs(y) = ‖y − y∗‖22 where:y∗(i, j, k) =∑(i′,j′,k′)Ga(z[i′, j′, k′]− y[i, j, k])∑(i′,j′,k′)Ga(z[i′, j′, k′]− y[i, j, k])z(i′, j′, k′) (5.2)Computation of y∗ has a few differences with the basic NLM in Equation(2.14). Firstly, we work with 3D blocks instead of 2D patches. We stackthe 2D projections to form a 3D image, as shown in Figure 5.1(b), andwork with small blocks of this image. This will allow us to exploit boththe correlation between adjacent pixels within a projection as well as thecorrelation between pixels in adjacent projection views. Therefore, in theabove equation, y[i, j, k] is a small block centered on pixel y(i, j, k). Sec-ondly, unlike Equation (2.14) where patch similarities in the same image areexploited, in Equation (5.2) we use a second image, as shown in Figure 5.2.This second image, denoted with z in Equation (5.2) and Figure 5.2 is builtby stacking the projections of the scan of a similar object. For instance,this can be a previous scan of the same patient or of a different patient. Wewill explain our justification for this choice below. Thirdly, for computingy∗(i, j, k), we first find a small number of blocks in z that are very similarto y[i, j, k] and use those blocks only, instead of all blocks in the image. Inother words, in Equation (5.2), both summations are over (i′, j′, k′) ∈ Ωi,j,k,1175.2. The proposed algorithmwhere Ωi,j,k denotes the indices of these blocks. This is a necessary com-promise to keep the computational time reasonable and it is followed by allpractical implementations of nonlocal patch-based methods.Figure 5.2: Block matching between the noisy scan to be restored y and thehigh-dose reference scan z.Computation of Rs requires that for each pixel y(i, j, k) we consider asmall block around it, y[i, j, k], and find a set of blocks sufficiently similar toy[i, j, k]. Even for medium-size 2D images, this can be very computationallycostly and many algorithms have been proposed for reducing the computa-tional load. Here, we use the Generalized PatchMatch algorithm [12, 13] forthis purpose.The Generalized PatchMatch algorithm is an iterative stochastic algo-rithm for finding a set of k similar patches in xref for every patch in x.Therefore, the goal of this algorithm is not to find the set of k most similarpatches, but only to find a set of k very similar patches. Let us denotethe set of indices of the k similar patches to x[i] in xref with Si. The al-gorithm starts by random assignment; i.e., for each patch x[i], Si is chosento be k random patches in xref. This random assignment is then iterativelyimproved via a set of very effective heuristics, called propagation, randomsearch, and enrichment. We describe these steps very briefly here; imple-mentation details can be found in [12, 13]. In propagation, similar patchesare shared among neighboring pixels. In other words, the algorithm seeksto improve Si by examining SNi , where Ni is the set of immediate neighborsof x(i). In random search, the algorithm seeks to improve Si by examininga window around the patches that have been identified as good matches.In other words, the algorithm examines {j +M, ∀j ∈ Si}. Here, M is asearch window, whose size is exponentially reduced with more iterations. Inenrichment, good matches are propagated in the “patch space”. In otherwords, the algorithm seeks to improve Si by examining {Sj , ∀j ∈ Si} or{Sj , ∀i ∈ Sj}, called forward enrichment and inverse enrichment, respec-tively. In forward enrichment the idea is that a patch that has been identified1185.2. The proposed algorithmas similar to x[i] is likely to have, in its own set of similar patches, moresimilar patches to x[i]. The intuition behind inverse enrichment is similar.We follow this algorithm exactly as in [13], except that in the first iter-ation we do not use a random neighborhood. Instead, to find blocks similarto y[i, j, k], we search a neighborhood around z(i, j, k) because if y and zare similar scans (e.g., scans of the same body part of the same patient ordifferent patients) similar blocks are likely to exist in similar spatial loca-tions in the two scans. Our experience shows that with a proper choice of z,this approach works much better than finding similar blocks from the sameimage. We should also note that in finding similar blocks and in computingthe block differences in (5.2), we only include the pixels from projectionsthat have been directly measured.5.2.2 Regularization in terms of sinogram smoothnessAn important characteristic of sinogram is its smoothness. Inspired by thesuccess of the denoising algorithm that we proposed in Section 4.2, we modelthis smoothness via the `1 norm of the Hessian of the sinogram. We suggestthe following form for Rh:Rh(y) =∑i,j,k(|∇2uvy(i, j, k)|+ |∇2vθy(i, j, k)|+ |∇2θuy(i, j, k)|) (5.3)In other words, we compute the 2D Hessians in the three orthogonalplanes and add their `1 norms. We compute |∇2uvy(i, j, k)| as:|∇2uvy(i, j, k)| =(Duuy(i, j, k)2 + 2Duvy(i, j, k)2 +Dvvy(i, j, k)2)1/2(5.4)and similarly for |∇2vθy(i, j, k)| and |∇2θuy(i, j, k)|. The forward difference aredefined as in Equation (4.9). For pixels at the boundaries, we use periodicboundaries as suggested in [248].Therefore, the proposed cost function has the form below, where we havedropped the pixel indices to simplify the expressions.J(y) =∑i,j,k(‖My − yn‖2W + λs‖y − y∗‖22+ λh(|∇2uvy|+ |∇2vθy|+ |∇2θuy|)) (5.5)1195.2. The proposed algorithmIn summary, the first term encourages consistency with the portion of thesinogram that has been measured. The second term (Rs(y)) is the term thatactually performs the interpolation. It does so by matching blocks from asimilar scan. The last term (Rh(y)) promotes smoothness by penalizing the`1 norm of the Hessian.5.2.3 Optimization algorithmAn estimate of the interpolated (and denoised) projections is obtained asa minimizer of J(y). To perform this minimization, we use the split Breg-man iterative algorithm, which has much similarities with the algorithmused for sinogram denoising in Section 4.2. The split Bregman method firstconverts the unconstrained optimization problem of minimizing J(y) into aconstrained problem:minimize∑(‖Mf − yn‖2W + λs‖f − y∗‖22)+ λh∑|g1|+ λh∑|g2|+ λh∑|g3|subject to: f = y, g1 = ∇2uvy, g2 = ∇2vθy, g3 = ∇2θuy(5.6)This constrained optimization problem is solved via Bregman iteration:Initialize: y0 = yn, f0 = yn, g01 = ∇2uvyn, g02 = ∇2vθyn, g03 = ∇2θuyn,b01 = 0, b02 = 0, b03 = 0, b04 = 0while ||yk − yk−1||22 > [yk+1, fk+1, gk+11 , gk+12 , gk+13 ] =arg minu,f,g1,g2,g3∑(‖Mf − yn‖2W + λs‖f − y∗‖22)+ λh∑|g1|+ λh∑|g2|+ λh∑|g3|+µ12∑(f − y − bk1)2 +µ22∑(g1 −∇2uvy − bk2)2+µ22∑(g2 −∇2vθy − bk3)2 +µ22∑(g3 −∇2θuy − bk4)2bk+11 = bk1 + yk+1 − fk+1bk+12 = bk2 +∇2uvyk+1 − gk+11bk+13 = bk3 +∇2vθyk+1 − gk+12bk+14 = bk4 +∇2θuyk+1 − gk+13end1205.3. Results and discussionThe advantage of this reformulation is that the above minimization prob-lem can be split into five smaller problems, one for each of the five variablesthat can be solved more easily. Minimization with respect to f has a sim-ple closed-form solution. Minimizations with respect to g1, g2, and g3 havesimple soft-thresholding solutions similar to that shown in Equation (4.17).Minimization with respect to y is more difficult. We solve this optimiza-tion approximately by considering each of the three Hessian terms in turn.Each of these three sub-problems will involve a linear system which can beapproximately solved using the Gauss-Seidel method.The regularization parameters λs and λh influence the recovered solution,whereas µ1 and µ2 influence the convergence speed. We do not discuss theeffects of these parameters as their effects are very similar to the denoisingproblem discussed in Section 4.2 and also in [248, 249]. In our experimentswe used λh = 0.0001, µ1 = µ2 = 0.001, and λs = 1.5.3 Results and discussionWe applied the proposed algorithm on simulated and real cone-beam CTprojections. As mentioned previously in this chapter, unlike the iterative re-construction methods that aim at reconstructing a high-quality image fromundersampled projections, our goal is to estimate the “missing” projections.Therefore, in all of the experiments reported in this chapter, for image recon-struction we used the FDK algorithm. We compare our algorithm with thedictionary-based sinogram interpolation method proposed in [188], whichhas been shown to be better than spline interpolation.5.3.1 Experiment with simulated dataWe first applied our algorithm on scans simulated from a brain phantom,which we obtained from the BrainWeb database [63]. We simulated nθprojections from this phantom, for two values of nθ = 1440 and 960. Foreach nθ, we first reconstructed the image of the phantom from the full set ofnθ projections and from nθ/2 projections; we denote these images with xnθand xnθ/2, respectively. We then applied the proposed algorithm and thedictionary-based interpolation algorithm to interpolate the subset of nθ/2projections to generate nθ projections and reconstructed the image of thephantom from the interpolated projections. We will denote these imageswith xproposednθ/2 and xdict.nθ/2. We simulated two levels of noise in the projectionswith different number of incident photons: N0 = 106 and N0 = 5× 104. We1215.3. Results and discussionwill refer to these simulations as low-noise and high-noise, respectively. Forboth simulations, we assumed the detector electronic noise to be additiveGaussian with a standard deviation of 40. As we mentioned in Section 5.2.1,for the block-matching required for computation of Rs, we use another scan(denoted with z in Equation (5.2) and Figure 5.2), which can be the scanof another object or a previous scan of the same object. In this experiment,we used the simulated scan of a different brain phantom from the samedatabase.We compared the reconstructed images with the true phantom imageby computing the root-mean-square of error (RMSE) and the StructuralSimilarity (SSIM) index. The results of this comparison are presented inTable 5.1. Sinogram interpolation with the proposed algorithm has resultedin a large improvement in the objective quality of the reconstructed image.The improvement is more substantial in the case of high-noise projections.This is because, as we mentioned above, both regularization termsRs andRhhave excellent denoising effects. Therefore, the proposed algorithm resultsin an automatic denoising. The proposed algorithm has also outperformedthe interpolation algorithm based on learned dictionaries. The objectivequality of xproposednθ/2 is very close to xnθ in the low-noise case and better thanxnθ in the high-noise case.xnθ xnθ/2 xproposednθ/2xdict.nθ/2nθ = 1440Low-noiseRMSE 0.104 0.140 0.111 0.128SSIM 0.745 0.683 0.726 0.705High-noiseRMSE 0.124 0.157 0.121 0.136SSIM 0.710 0.642 0.715 0.686nθ = 960Low-noiseRMSE 0.127 0.154 0.130 0.138SSIM 0.708 0.640 0.691 0.671High-noiseRMSE 0.143 0.164 0.136 0.144SSIM 0.688 0.625 0.690 0.662Table 5.1: Objective quality of the reconstructed images of the brain phan-tom.Figure 5.3 shows a slice in the reconstructed images of the phantom.Interpolation of the sinogram with the proposed algorithm has resulted ina substantial improvement in the visual quality of the reconstructed im-age. Not only artifacts have been significantly reduced, noise has also beendecreased substantially.1225.3. Results and discussionFigure 5.3: A slice in the images of the brain phantom reconstructed fromthe high-noise projections with nθ = 1440. (a) the true phantom, (b) xnθ ,(c) xnθ/2, (d) xproposednθ/2, (e) xdict.nθ/2.5.3.2 Experiment with real CT dataWe used micro-CT scans of a rat for this experiment. The scan consisted of720 projections. We reconstructed a high-quality “reference” image of therat from all 720 projections using 25 iterations of MFISTA algorithm. Then,we selected subsets of nθ projections from this scan for two values of nθ = 360and 180. For each nθ, we reconstructed the image of the rat from nθ andnθ/2 projections. We then interpolated the subset of nθ/2 projections usingthe proposed algorithm and the dictionary-based interpolation algorithmto obtain nθ projections and reconstructed the image of the rat from theinterpolated projections.We will refer to the scan described above as normal-dose scan becauseit was obtained at normal dose used in routine imaging. The same ratwas scanned at much reduced dose (by reducing the mAs setting to half ofthat in routine imaging and using additional copper filtration) and the sameanalysis as above was performed. We will refer to this scan as the low-dosescan. For the block-matching in the computation of Rs we used the scan ofa different rat.To assess the quality of the reconstructed images, we computed the1235.3. Results and discussionxnθ xnθ/2 xproposednθ/2xdict.nθ/2nθ = 720Normal-doseRMSE 0.0173 0.0222 0.0185 0.0191SSIM 0.654 0.611 0.640 0.636CNR 13.8 11.2 13.6 12.3Low-doseRMSE 0.0194 0.0236 0.0189 0.0199SSIM 0.630 0.597 0.637 0.623CNR 12.3 10.5 13.0 12.1nθ = 360Normal-doseRMSE 0.0200 0.0238 0.0208 0.0216SSIM 0.632 0.582 0.624 0.617CNR 13.0 10.9 13.0 12.1Low-doseRMSE 0.0223 0.0250 0.0215 0.0229SSIM 0.612 0.556 0.620 0.586CNR 11.7 10.2 12.2 11.5Table 5.2: Objective quality of the reconstructed images of the rat.RMSE and SSIM with respect to the reference image as well as the contrast-to-noise ratio (CNR). The results of this evaluation have been summarized inTable 5.2. There is a substantial improvement in image quality metrics as aresult of sinogram interpolation with the proposed algorithm. The proposedalgorithm has led to much better image quality than the dictionary-basedsinogram interpolation. The gain in image quality is more pronounced inthe low-dose case. In fact, the objective quality of xproposednθ/2 is even betterthan xnθ on the low-dose scan.For a visual comparison, Figure 5.4 shows a slice from the reconstructedimages of the rat from the low-dose scan with nθ = 240. There is a remark-able improvement in image quality as a result of sinogram interpolation withthe proposed algorithm. The visual quality of xproposednθ/2 seems to be betterthan both xdict.nθ/2 and xnθ .Overall, our experiments show that the proposed sinogram interpolationalgorithm can lead to a large improvement in the quality of the reconstructedimage. Due to the additional denoising effect of both regularization func-tions used by the proposed algorithm, this improvement is more significantwhen applied on low-dose scans. This means that the proposed algorithmis especially well-suited for sinogram restoration in low-dose CT. Our otherexperiments, not reported here because of space limitations, also show thatthe proposed algorithm can be used to effectively interpolate the sinogramin more general cases, for example when the angular spacing of the missing1245.3. Results and discussionFigure 5.4: A slice in the images reconstructed from the low-dose scan of therat with nθ = 240. (a) the reference image, (b) xnθ , (c) xnθ/2, (d) xproposednθ/2,(e) xdict.nθ/2.projection views is non-uniform or when some detector measurements arecorrupted.125Chapter 6Reducing Streak Artifactsusing Coupled Dictionaries6.1 IntroductionAs mentioned previously in this dissertation, a simple approach to reducingthe radiation dose is to reduce the number of projections. Unfortunately, theimages reconstructed by FBP-based methods from such few-view scans willcontain large amounts of streaking artifacts. This chapter proposes a noveltechnique for suppressing these artifacts. Artifacts in CT may arise fromdifferent causes and would therefore have different shapes and structures[14, 231]. This chapter focuses on streak artifacts that occur in imagesthat are reconstructed by the FBP-based methods from a small number ofprojections. In particular, the FDK algorithm that is commonly used forimage reconstruction in cone-beam CT requires several hundred projectionsin order to produce artifact-free images. Reducing the number of projectionscan result in severe streak artifacts. This chapter proposes an algorithm thatsuppresses these artifacts without blurring or distorting the genuine imagefeatures.The proposed method is based on learning two dictionaries for sparserepresentation of small blocks extracted from 3D CT images; one dictio-nary is for artifact-full images and the other is for high-quality artifact-freeimages. The proposed method employs a linear map that relates the repre-sentation coefficients of the artifact-full and the artifact-free blocks in thesetwo dictionaries. This linear map is also learned from the training data.The central idea is to use the representation coefficients of small blocks inan artifact-full image to find the representation coefficients of the corre-sponding artifact-free blocks, thereby recovering the artifact-free image.As mentioned in Section 2.6.3, there are studies that have used patch-based methods to suppress artifacts in CT. Some of these studies dependon the existence of a high-quality prior image of the same patient or a richdatabase of scans from a large number of patients [340, 341]. In many1266.2. Methodssituations, however, one does not have access to a previous scan of the samepatient or a rich database of images. The method proposed in this chapterdoes not rely on any such prior image or database. In order to compare ouralgorithm with the previously-proposed algorithms, we consider the ArtifactSuppressed Dictionary Learning (ASDL) method proposed in [57], whichalso does not require a prior image. We have briefly described ASDL inSection 2.6.3.6.2 Methods6.2.1 The proposed approachThe proposed algorithm learns two separate dictionaries, one for artifact-full images and another for high-quality images devoid of artifacts. Thesetwo dictionaries are denoted, respectively, by Da (for artifact) and Dc (forclean). We also denote an artifact-full image (reconstructed from a smallnumber of projection views) by xa and its corresponding high-quality im-age (reconstructed from a very large number of projections) by xc. In thetraining stage of the proposed algorithm, we extract small blocks from eachof these images and stack the vectorized versions of these blocks to createtwo matrices, which we denote by Xa and Xc. The ith column of eachof these matrices are denoted by Xai and Xci . Given an artifact-full block,Xai , we would like to recover its artifact-free version, Xci . Our approachhere is to use the sparse representation of Xai in Da to recover the sparserepresentation of Xci in Dc. In other words, given an artifact-full block,Xai , we first find the sparse representation vector Γai such that Xai∼= DaΓai .From Γai , we estimate Γci , the sparse code of the corresponding artifact-freeblock in its dictionary. Finally, the estimate of the artifact-free block willbe X̂ci = DcΓci .A very important decision is the choice of the relation between the sparserepresentation vectors, Γai and Γci . The simplest choice is an equality rela-tion, i.e., Γai = Γci . This relation was suggested for image super-resolution in[344, 345]. However, this relation is very restrictive and a more relaxed rela-tion can provide a greater flexibility, hopefully leading to better results. Forsuper-resolution, for instance, as we explained in Section 2.1.3, improved re-sults have been reported by using more relaxed relations. Therefore, in thisstudy we use a linear relation between Γai and Γci because such a model hasyielded very good results in image super-resolution, multi-modal retrieval,and cross-domain image synthesis and recognition [132, 327, 362]. In otherwords, we assume that the vector of representation coefficients of an artifact-1276.2. Methodsfree block can be obtained from the vector of representation coefficients ofthe corresponding artifact-full block using the equation Γci = PΓai , where Pis a linear map (i.e., a matrix) that is also estimated from the training data.Throughout this chapter, N and n denote the number of projectionsthat we use to reconstruct the artifact-free and the artifact-full images, re-spectively. In the experimental evaluations, N = 720 and n ≈ 100. Forimage reconstruction we use the FDK algorithm. The goal of the proposedalgorithm is to suppress the artifacts in an image reconstructed from n pro-jections so that it looks similar to the reference image reconstructed fromN projections.Assuming for a moment that we have learned the dictionaries (Da andDc) and the linear map (P ), we now explain how an artifact-suppressedimage is produced by the proposed algorithm. The proposed algorithm isshown as a schematic in Figure 6.1. As can be seen from this figure, thealgorithm starts by dividing the given set of n projections into two subsets ofodd and even projections, each containing n/2 projections, and reconstructstwo images using these two subsets of projections. The rationale behind thisapproach is to better exploit the correlation in the projections. The imagesreconstructed from each of these two subsets of n/2 projections will containmore artifacts than an image reconstructed from n projections. However,while the artifacts in these two images will be quite different, the genuineimage features will be shared between the two images. Therefore, we expectthat reconstructing two separate images using odd and even projectionsshould result in better results in terms of the quality of the final recoveredimage. Our experience shows that this is indeed the case. It should be notedthat with the FDK algorithm the computational cost of reconstructing twoimages each from n/2 projections is the same as that of reconstructing oneimage from n projections. We should also note that (as it is commonly donein this field) we work with mean-subtracted images. Later, the mean imageis added back to the final reconstructed image.From each of the two artifact-full images, we extract small overlappingblocks. In all of our experiments we used 83-voxel blocks with an overlapof 5 voxels in each direction between neighboring blocks. The ith pair ofblocks extracted from the two artifact-full images are vectorized and stackedin tandem to form one vector, Xai . The sparse representation of this vectorin Da is computed such that Xai∼= DaΓai . Γai is then multiplied by thematrix P to find the sparse code, Γci , of the corresponding artifact-suppressedblock. Finally, the artifact-suppressed block is estimated as X̂ci = DcΓci .This process is repeated for all pairs of overlapping blocks extracted from1286.2. MethodsFigure 6.1: A schematic representation of the proposed algorithm. The stepsshown in the dashed box are performed for all extracted overlapping blocks.the artifact-full images. The artifact-suppressed blocks are placed in theircorrect locations (which is the same location where the artifact-full blockswere extracted from) in the destination image and averaged to obtain thefinal estimate of the artifact-suppressed image, x̂c. The averaging is to takeinto account the fact that each voxel, except for the ones at the corners ofthe image, participates in more than one block because we use overlappingblocks. Mathematically, if Ri is the binary matrix that places X̂ci in itsproper location, then x̂c is computed as:x̂c =(K∑i=1RiX̂ci)(K∑i=1Ri1)(6.1)where K is the total number of blocks,  indicates element-wise division,and 1 ∈ R512 is a vector of ones.6.2.2 The dictionary learning algorithmIn the above, we assumed that the two dictionaries, Dc and Da, and the lin-ear map, P , were known and we described the steps taken to remove artifactsfrom an artifact-full image. In this section, we present algorithms for learn-ing Dc, Da, and P . The training data needed by the proposed algorithmincludes a set of artifact-full blocks and their corresponding artifact-freeblocks. To generate this data, we scan an appropriate object with a highangular sampling rate so that the scan contains a very large number (N) ofprojections. We reconstruct our artifact-free image using all N projections.1296.2. MethodsWe then choose n N of these projections, divide them into odd and evenprojections and reconstruct two artifact-full images from each subset of n/2projections. Then, we extract random blocks from the artifact-free image,vectorize and stack them as columns of a matrix Xc. From exactly thesame location as the blocks used to create Xc, we extract blocks from thetwo artifact-full images reconstructed from n/2 projections and stack themtogether to create a matrix Xa. We refer to columns of Xa and Xc as “thetraining signals”. As we mentioned above, each pair of blocks extractedfrom the same location in the artifact-full images are stacked in tandem asa single column of Xa. Therefore, Xc and Xa have the same number ofcolumns, while each column of Xa is twice as long as each column of Xc.Specifically in our experiments, Xc ∈ R512×K and Xa ∈ R1024×K , whereK is the number of extracted blocks. We will provide more detail on ourtraining and test data in Section 6.3.Having generated our training data, Xa and Xc, we suggest to learnthe dictionaries Dc and Da and the linear map P by solving the followingoptimization problem, which is very similar to the formulations suggestedfor super-resolution and photo-sketch synthesis in [327]:minimize{Da,Γa,Dc,Γc,P}(‖Xa −DaΓa‖2F + ‖Xc −DcΓc‖2F+ λa‖Γa‖1 + λc‖Γc‖1+ α‖Γc − PΓa‖2F + β‖P‖2F)subject to: ‖Dai ‖2 ≤ 1 & ‖Dci‖2 ≤ 1 ∀i(6.2)The first two terms in the objective function force the dictionariesDa andDc to accurately model the training signals in Xa and Xc, respectively. Thethird and fourth terms encourage sparsity of these representations. Thesefour terms are reminiscent of the terms of the objective function in the basicdictionary learning algorithm. The fifth term enforces the linear relationbetween the vectors of sparse representation in the two dictionaries. Thelast term penalizes the norm of P in order to avoid overfitting and numericalinstability.The proposed objective function is not convex with respect to its fivevariables simultaneously. However, it is convex with respect to each of thevariables if we fix the rest. Therefore, as it is common in dictionary learning,we follow an alternating minimization scheme to find a stationary point ofthis problem.1306.2. MethodsUpdating the dictionaries Da and Dc. With other variables beingfixed, minimization with respect to the dictionary atoms can be written asthe following two identical optimization problems.minimizeDa‖Xa −DaΓa‖2F subject to: ‖Dai ‖2 ≤ 1 ∀iminimizeDc‖Xc −DcΓc‖2F subject to: ‖Dci‖2 ≤ 1 ∀i(6.3)We use the efficient implementation of the K-SVD algorithm proposed in[275] to solve these problems.Updating the sparse representation matrices Γa and Γc. With allother variables fixed, minimization with respect to Γa can be simplified asfollows:minimizeΓa‖Xa −DaΓa‖2F + α‖Γc − PΓa‖2F + λa‖Γa‖1≡minimizeΓa∥∥∥∥[ Xa√αΓc]−[Da√αP]Γa∥∥∥∥2F+ λa‖Γa‖1(6.4)Similarly, minimization with respect to Γc can be written as:minimizeΓc‖Xc −DcΓc‖2F + α‖Γc − PΓa‖2F + λc‖Γc‖1≡minimizeΓc∥∥∥∥[ Xc√αPΓa]−[Dc√αI]Γc∥∥∥∥2F+ λc‖Γc‖1(6.5)where I is the identity matrix of the right size.The optimization problems in (6.4) and (6.5) are simple sparse cod-ing problems. In our experience, greedy algorithms such as the orthog-onal matching pursuit (OMP) [252] with a sparsity constraint work verywell on these problems. Therefore, we used OMP with sparsity constraints‖Γc‖0 ≤ T c and ‖Γa‖0 ≤ T a to solve these problems. In other words, wereplace the `1-norm penalty with `0-norm constraints. We choose Tc = 16for artifact-free blocks as suggested in the context of denoising in [274]. Forartifact-full blocks we choose a higher sparsity level of T a = 20 because thesignals are longer and include artifacts too. This approach will also eliminatethe need to tune the parameters λa and λc.1316.2. MethodsUpdating the linear map P . Minimization with respect to P involvessolving the following problem:minimizePα||Γc − PΓa||2F + β||P ||2F (6.6)which has the following closed-form solution:P ∗ = Γc(Γa)T(Γa(Γa)T +βαI)−1(6.7)Initialization and regularization parameter selection. We initializeDa and Dc to the dictionaries learned with the basic dictionary-learningscheme, Equation (2.1), from artifact-full and artifact-free images, respec-tively. We used the K-SVD algorithm [2] to learn these initial dictionaries.Of course, for Da we need to concatenate two such dictionaries to obtainthe right size. This initialization will help our learning algorithm convergemuch faster than if we use an overcomplete DCT or wavelet basis as theinitial dictionary. The number of atoms in each dictionary was chosen tobe 1024. We initialize P to the identity matrix. Using OMP, we computethe sparse representations of the training signals Xa and Xc in the initialdictionaries Da and Dc, respectively, and use them as the initial values of Γaand Γc. As mentioned above, our training strategy does not require knowingthe values of λa and λc. We used α = β = 0.1 for all experiments reportedin this chapter. We found empirically that these values work very well inall our experiments. As mentioned above, the regularization term ‖P‖2F ismeant to avoid numerical instability. As suggested in [69], if the amount oftraining data is not too small, numerical instabilities are unlikely to occurand β can be set to a very small number.After the above initializations, we alternately minimize the objectivefunction with respect to the five variables. For the stopping criterion, onecan adopt a cross-validation approach. Specifically, in this approach part ofthe training data (Xa and Xc) can be set aside as the validation data. Atthe end of each iteration of the learning algorithm, the learned parameters(Da, Dc, and P ) are applied to reconstruct the artifact-free blocks in thevalidation data set from their corresponding artifact-full blocks using our al-gorithm as shown in Figure 6.1. The learning is stopped when an acceptablelevel of accuracy in reconstructing the artifact-free blocks in the validationdata set is achieved. We should also mention that, instead of a straight-forward application of the algorithm for reconstruction as shown in Figure6.1, one can optimize the coefficients Γa and Γc simultaneously. The opti-1326.3. Evaluationmization steps will be similar to those presented for training above, exceptthat the parameters will be fixed and only Γa, Γc, and Xc are optimized. Al-ternatively, one can stop the learning algorithm after a pre-specified numberof iterations, which is also very common in dictionary learning [2, 212]. Inthe experiments reported in this chapter we stopped the learning algorithmafter 50 iterations, which we have found empirically to be sufficient.6.3 EvaluationThe first scan consisted of N = 720 projections of a rat between 0◦ and 360◦at 0.5◦ intervals. We used the full set of 720 projections to reconstruct avery high-quality image of the rat. For this purpose, we first reconstructedan image using the FDK algorithm, followed by 50 iterations of the MFISTAalgorithm. The resulting image, which we refer to as the reference image,had a very high quality with no visible artifacts of any kind. To evaluateour proposed algorithm, we used a subset of n = 120 projections fromthis scan, i.e., projections at 3◦ intervals. As we mentioned above andshowed in Figure 6.1, the proposed algorithm divides the projections intotwo halves, each containing n/2 = 60 projections (odd and even projections),and reconstructs separate images from each set of 60 projections using theFDK algorithm.In this first experiment, we generated the training and test data fromthe same images. Each reconstructed image was 880× 880× 650 voxels. Wedivided each of the images into two halves (i.e., into two 880×880×325-voxelimages). We used one half for training, i.e., for learning the dictionaries Daand Dc and the matrix P . Then, we applied our method as shown in Figure6.1 to suppress the artifacts in the other half of the image.Figure 6.2 shows the effect of applying the proposed algorithm for ar-tifact suppression. In this figure, we have shown a typical slice of the ref-erence image (reconstructed from 720 projections) and artifact-full image(reconstructed from 120 projections), alongside the result of applying ouralgorithm and ASDL with 120 projections. For a comparison, we have alsoincluded the same slice in the image reconstructed with the FDK algorithmfrom 240 projections. We have also shown, in part (f) of this figure, theimage obtained with total-variation (TV) denoising. We used Chambolle’sfamous algorithm for TV denoising.The whole slice in Figure 6.2 is shown using a window of attenuation co-efficient (µ-window) of [0, 0.40]. For a better visual comparison, we have se-lected two regions of interest (ROIs) and have displayed them with narrower1336.3. EvaluationFigure 6.2: A typical slice of the images reconstructed from the first ratscan. (a) the reference image reconstructed from 720 projections, (b) FDK-reconstructed from 120 projections, (c) FDK-reconstructed from 240 pro-jections, (d) the image produced from 120 projections using our proposedalgorithm, (e) the image reconstructed from 120 projections using ASDL,(f) FDK-reconstructed from 120 projections followed by TV denoising. TwoROIs are shown magnified on the top-left and top-right of the images. Thelocations of these ROIs have been marked on the slice of the reference image.1346.3. Evaluationµ-windows. One of these ROIs (shown on the top right of each image) con-tains bone surrounded with soft tissue. We display this ROI with a µ-windowof [0.2, 0.5]. The second ROI contains fat surrounded with soft tissue, whichwe display with a µ-window of [0.14, 0.23]. Both ROIs are displayed witha magnification of 150%. This figure shows a marked improvement in thequality of the reconstructed image as a result of applying the proposed al-gorithm. The image reconstructed with the proposed algorithm (from 120projections) has a much higher quality than the FDK-reconstructed imagefrom 120 projections and appears to be better than the FDK-reconstructedimage from 240 projections also. ASDL seems to have removed most of thenoise in the image without suppressing much of the artifact. The imageobtained via TV denoising has a low quality and all of the streak artifactshave remained, even though much of the noise has been removed. This isan expected result because denoising algorithms cannot distinguish betweentrue image features and streak artifacts. The TV denoising method, forinstance, is based on the prior assumption that the image has a sparse gra-dient, encouraging piece-wise constant solutions. Although this approach iseffective for removing random noise, it cannot distinguish strong artifactsfrom genuine image features.For a more meaningful evaluation, we applied the dictionary and linearmap learned in the experiment described above to suppress the artifacts inthe image of a different rat. Similar to the above experiment, in this newexperiment the scan consisted of 720 projections. A reference artifact-freeimage was reconstructed using all 720 projections. The proposed algorithmand ASDL were applied to produce artifact-suppressed images from 120 pro-jections. Figure 6.3 shows a typical slice in the images from this experiment.We have included the two intermediate FDK-reconstructed images (from 60even and odd projections) that were used in the proposed algorithm in parts(g) and (h) of this figure. However, the image produced by the proposedalgorithm should be compared with the FDK-reconstructed image from 120projections, shown in part (b) of this figure. This figure shows that theproposed algorithm has successfully reduced the artifacts, significantly im-proving the image quality. This is more visible in the two ROIs that arere-displayed with narrow µ-windows. The µ-window for the whole slice is[0, 0.40] and the µ-windows for the ROIs shown on the top-left and bottom-right are [0.16, 0.21] and [0.2, 0.45], respectively. The proposed algorithm hasalso produced an image that is markedly better than the image producedby ASDL and slightly better than the FDK-reconstructed image from 240projections. ASDL has not been very successful in reducing the artifacts inthis experiment.1356.3. EvaluationFigure 6.3: A slice from the reconstructed images of the second rat. (a)the reference image, (b) FDK-reconstructed from 120 projections, (c) FDK-reconstructed from 240 projections, (d) produced by the proposed algorithmfrom 120 projections, (e) produced by ASDL from 120 projections, (f) FDK-reconstructed image from 120 projections followed by K-SVD denoising, (g)-(h) the intermediate images in the proposed algorithm (each reconstructedwith the FDK algorithm from 60 projections). The locations of the twoROIs have been marked on the slice of the reference image.1366.3. EvaluationOne of the reasons for the improved quality of the images produced bythe proposed algorithm is the noise reduction. Even though we designed ouralgorithm for artifact reduction, it also leads to an automatic noise reduc-tion, which is a free extra benefit. The reason for this automatic denoisingis that the output image of the proposed algorithm is created as a sparserepresentation of blocks in the dictionary Dc. This is a successful denois-ing strategy, as we explained in Section 2.1.3. Therefore, one may questionwhether the improvement in the image quality by the proposed algorithm ismostly due to its denoising effect, and not artifact-suppression as we haveclaimed. To show that this is not the case, i.e., that our proposed algorithmis indeed an artifact-suppression algorithm, we applied the K-SVD denois-ing algorithm [100] to remove the noise in the image reconstructed withthe FDK algorithm from 120 projections. The resulting image is shown inFigure 6.3(f). It is quite clear that dictionary-based denoising only removesthe noise and leaves the artifacts untouched. Therefore, our proposed algo-rithm does indeed accomplish much more than a dictionary-based denoisingalgorithm and is a true artifact-suppression algorithm.For a more objective evaluation of the proposed algorithm, we computethe root-mean-square of the error (RMSE), where we define the error as thedifference between the reconstructed image and the reference image, andthe structural similarity index (SSIM) between the reconstructed image andthe reference image. The results are presented in Table 6.1. The proposedalgorithm has clearly outperformed ASDL in terms of RMSE and SSIM.Moreover, compared with the FDK-reconstructed image from 240 projec-tions, the artifact-suppressed image produced by our proposed algorithmfrom 120 projections is closer to the reference image in terms of both RMSEand SSIM.FDK-120FDK-240FDK-120 withK-SVDdenoisingProposedalgorithm -120ASDL -120RMSE 0.0206 0.0119 0.0165 0.0104 0.0147SSIM 0.653 0.856 0.702 0.860 0.764Table 6.1: RMSE and SSIM for the FDK-reconstructed images of the secondrat from 120 and 240 projections, FDK-reconstructed from 120 projectionsfollowed by denoising using the K-SVD algorithm, and the images producedfrom 120 projections by the proposed algorithm and ASDL.In order to further evaluate the performance of the proposed algorithm,1376.3. Evaluationwe applied it on a scan of the physical phantom. In order to test the gener-alizability of the learned parameters (Da, Dc, and P ), we used those param-eters that were learned from the rat scan described above to suppress theartifacts in this experiment with the phantom. Similar to the scans of therat, the phantom scan consisted of 720 projections. All of the 720 projec-tions were used to reconstruct a high-quality reference image. The proposedalgorithm was then used to produce an artifact-suppressed image from 120projections.In Figure 6.4 we show parts of two selected slices in the image of thephantom. The µ-window used for displaying these figures is [0, 0.50]. It isclear from this figure that the proposed algorithm has significantly reducedthe artifacts without degrading genuine image features. The right column inthis figure shows two of the resolution coils that have been included in thisphantom for the purpose of visual inspection of the spatial resolution. In theFDK-reconstructed image from 120 projections, and to a lower degree alsoin the FDK-reconstructed image from 240 projections, these coils have givenrise to ring-shape artifacts. While the proposed algorithm has substantiallyreduced these and other artifacts, it has not blurred or degraded the fineimage features but seems to have improved them too. Compared with ourproposed algorithm, ASDL has been much less effective in removing theseartifacts. On the other hand, TV-based denoising seems to have had noeffect on the artifacts, although it has managed to reduce the noise.Table 6.2 presents a quantitative evaluation of the quality of the imagesreconstructed by different algorithms in this experiment. In addition toRMSE and SSIM, we estimated the modulation transfer function (MTF)using an approach similar to that in Chapters 3 and 4. The values reportedin Table 6.2 as spatial resolution are the spatial frequencies at which thenormalized MTF reached 0.1 in different images.ReferenceimageFDK-120FDK-240Proposedalgorithm-120ASDL-120RMSE 0.0 0.022 0.010 0.0094 0.0137SSIM 1.0 0.623 0.892 0.900 0.812S.R. (mm−1) 4.39 3.66 3.98 3.98 3.76Table 6.2: Quantitative comparison of the quality of the images of the phys-ical phantom reconstructed with the FDK algorithm from 120 and 240 pro-jections and the artifact-suppressed images produced using the proposedalgorithm and ASDL from 120 projections.1386.3. EvaluationFigure 6.4: Parts of two typical slices in the image of the phantom; (a)the reference image, (b) FDK-reconstructed from 120 projections, (c) FDK-reconstructed from 240 projections, (d) produced by the proposed algorithmfrom 120 projections, (e) produced by ASDL from 120 projections, (f) FDK-reconstructed from 120 projections followed by TV denoising.As we mentioned above, the results shown in Figure 6.4 and Table 6.2were obtained by using the parameters (i.e., Da, Dc, and P ) from the ratexperiment. Even though our results show that these parameters work verywell on the phantom image, one may wonder if we could have obtainedbetter results by learning a new set of parameters from a similar image.In order to determine if learning a new set of parameters from a similarimage can improve the results, we performed a new experiment. In thisexperiment, the physical phantom was scanned one more time and newdictionaries (Da and Dc) and a new matrix (P ) were learned. We thenapplied our algorithm with these parameters on the scan of the phantomdescribed above. The results of this experiment were very close to the resultsshown in Figure 6.4 and Table 6.2. In particular, the values of SSIM andspatial resolution shown for the proposed algorithm in Table 6.2 increasedslightly to 0.905 and 4.00, while the value of RMSE remained almost thesame. In our opinion, this means that the parameters learned from oneimage can be used when applying the algorithm on a very different image,even though slightly better results may be obtained if the parameters are1396.3. Evaluationlearned from a similar image. This is because our proposed algorithm isbased on relating the sparse representations of artifact-full blocks with theircorresponding artifact-free blocks. Because of the small size of the blocks,the learned parameters of the algorithm depend on the local shapes of theseartifacts, which does not depend much on the shape of large-scale featuresof the image.The above results indicate that the dictionaries and the linear maplearned from one image are applicable to another image. However, in theabove experiments we used the same number of projections (i.e., 120) forboth the rat and the phantom. It is well known that artifacts becomestronger and their angles change as the number of projections is reduced.Therefore, parameters learned from one image may not work well when usedfor artifact suppression in an image reconstructed from a different numberof projections. Moreover, as the number of projections decreases, it will bemore difficult to suppress the artifacts because they become much stronger.Even though our algorithm was able to effectively suppress the artifacts inthe images reconstructed from 120 projections, it is expected that its per-formance should decrease as the number of projections is reduced further.Therefore, we face two important questions: 1) What is the limit of per-formance of the proposed algorithm in terms of the number of projectionsrequired? and 2) Is it possible to use the parameters learned from an exper-iment with n1 projections for artifact suppression with a different numberof projections, n2 6= n1?In order to answer these questions, we used the parameters learned fromthe scan of the first rat to suppress the artifacts in the image reconstructedfrom the scan of the second rat, but this time we used n = 90 projections. Inother words, Da, Dc, and P are learned on artifact-full images reconstructedfrom 120 projections, and then they are applied to suppress the artifacts inimages reconstructed from 90 projections. The results of this experimentare shown in Figure 6.5 and Table 6.3 and denoted as “Proposed Algorithm- train:120 - test:90”. The µ-windows used for Figure 6.5 are the same asthose for Figure 6.3. For comparison, we also learned new dictionaries and alinear map, this time from images reconstructed from 90 projections of thescan of the first rat and used them to suppress the artifacts in the imagesreconstructed from 90 projections of the scan of the second rat. The resultsare also shown in Figure 6.5 and Table 6.3, denoted as “Proposed Algorithm- train:90 - test:90”.There are two main conclusions that can be drawn as answers to thetwo questions that we posed above. First, Figure 6.5 and Table 6.3 showa marked improvement in “train:90 - test:90” compared with “train:120 -1406.3. EvaluationFigure 6.5: A slice from the reconstructed image of the second rat; (a)FDK-reconstructed from 90 projections, (b) FDK-reconstructed from 180projections, (c) produced by the proposed algorithm from 90 projectionsusing the parameters learned on the image of the first rat with 90 projec-tions, (d) produced by the proposed algorithm from 90 projections using theparameters learned on the image of the first rat with 120 projections. (Thereference image for this figure is the same as Figure 6.3 (a)).test:90”. Therefore, parameters learned for artifact reduction by the pro-posed algorithm are no longer optimal if the number of projections changessignificantly. We think this is because the shape and the strength of theartifacts change significantly when the number of projections changes sig-nificantly. Secondly, the performance of our proposed artifact-suppressionalgorithm was reduced compared to our experiments with 120 projectionsthat we reported earlier in this chapter. Figure 6.5(c), which corresponds to“train:90 - test:90” is still better than the FDK-reconstructed image withtwice the number of projections shown in Figure 6.5(b). However, by com-paring these results with those shown in Figures 6.3 and 6.4 and Tables 6.1and 6.2, we see that the gain in the image quality is reduced. This is what weshould expect because as the number of projections is reduced, the artifactsdominate the genuine image details and it should thus be more difficult forthe algorithm to remove them.As we mentioned in Section 2.1.3, for image super-resolution and other1416.4. DiscussionFDK-90 FDK-180ProposedAlgorithm -train:120 - test:90ProposedAlgorithm -train:90 - test:90RMSE 0.0259 0.0167 0.0172 0.0160SSIM 0.531 0.730 0.705 0.741Table 6.3: Comparison between the quality of the images of the second ratreconstructed using the FDK algorithm with 90 and 180 projections andthe artifact-suppressed images produced by the proposed algorithm with 90projections.applications, various different models have been suggested for relating thesparse representation coefficients of the source and the target images. Sim-ilarly, many different models can be used for relating the sparse representa-tions of artifact-full and artifact-free blocks. An investigation of all possiblemodels is beyond the limitations of this dissertation. However, one natu-ral question is whether a simpler model than the linear model used in theproposed algorithm would work equally well. To answer this question, weapplied our algorithm on the rat data by assuming that the sparse repre-sentation coefficients are identical. This is the same as the assumption usedfor image super-resolution in [345]. In terms of the notation used in thischapter, this approach means setting P = I where I is the identity matrixand Γc = Γa. The result of this experiment is presented in Figure 6.6. Forcomparison, we have also shown the slice of the image reconstructed by theproposed algorithm (with linear mapping). It is clear that this simplifiedmodel is much less effective than the proposed model with the linear map-ping. In fact, in terms of the RMSE and SSIM, the image produced by thissimplified model is slightly worse than the image produced by ASDL.6.4 DiscussionOur results show that the proposed algorithm is very effective in suppressingthe streak artifacts that appear in CT images reconstructed from approxi-mately 100 projections. In all of our experiments, the image reconstructedby the proposed algorithm had a better quality than the image reconstructedwith the FDK algorithm from twice as many projections. The streak arti-facts were largely removed in the images produced from 120 projections bythe proposed algorithm. There was also no visible blurring or distortion1426.4. DiscussionFigure 6.6: (a) Image reconstructed by assuming an identity relation be-tween representation coefficients of artifact-full and artifact-free blocks. (b)Image reconstructed by assuming a linear relation; this image is the sameas that in Figure 6.3(d).of the true image features by the proposed algorithm. In addition to sup-pressing the artifacts, the proposed algorithm effectively reduces the noisealso.An important and promising behavior of the proposed algorithm is thatits parameters, i.e., the two dictionaries and the linear map, do not haveto be learned for every image, as long as the numbers of projections usedin the training and reconstruction stages are close. This is a very valuableproperty because it means a substantial saving in the time and effort neededfor training. This behavior is due to the design of the algorithm. Specifically,the algorithm is based on relating the sparse representations of small blocksof the artifact-full and artifact-free images in the learned dictionaries. Thelocal structure of these artifacts does depend on the number of projectionsused to reconstruct the artifact-full images, but it is, to a large degree,independent of the shape of large image features. Therefore, the parameterslearned from one set of training images work well when the algorithm isapplied on a very different image as long as the number of projections doesnot change much.The theoretical underpinnings of the dictionary learning algorithms havenot quite matured yet. In fact, a complete theoretical analysis of dictionarylearning is still considered as an open problem [98]. Nonetheless, in recentyears these algorithms have been used in hundreds or thousands of studiesand have proven to be robust and reliable. The performance of these learneddictionaries for different image processing tasks may depend on factors suchas the amount of training data and the scale and structure of features in theimage. These factors may also influence the performance of the algorithmproposed in this study, but a detailed investigation of these factors is beyond1436.4. Discussionthe scope of this dissertation. Nonetheless, such studies can be very helpfulfor the proper application of learned dictionaries in CT. We are aware ofonly one such study [300], in which the effect of the scale and orientation offeatures in the training data on the performance of dictionary-based iterativeCT reconstruction were analyzed. Similar studies can be very helpful forbetter implementation of artifact-removal algorithms such as the algorithmproposed in this chapter.Our experience shows that the performance of the proposed algorithmdeteriorates when the number of projections is much less than 100. Forexample, we have tried applying our algorithm with 70 projections withvery little success. Even though the proposed algorithm still suppressessome of the artifacts when the number of projections is around 70, thegenuine image detail are distorted. This is simply because the artifacts arevery strong and overshadow the image features. Overall, our experienceshows that the gain in image quality obtained by our proposed algorithmis most significant when the number of projections used is approximatelybetween 100 and 200.There is also a slight blurring of a few of the image features in some of theimages reconstructed by the proposed algorithm. An example can be seenin the bone structure shown in the top-right ROI in Figure 6.2. This blur-ring can be reduced by increasing the number of atoms of the artifact-freedictionary that are used in building each of the blocks of the reconstructedimage. In the proposed algorithm, this is controlled by parameter T c. Aswe mentioned earlier in this chapter, we used T c = 16. We chose this valuebecause it was found to be a very good choice in the denoising of 3D CTimages in [274]. One can choose a larger T c to reduce the blurring. Increas-ing the number of dictionary atoms that participate in the representation ofthe image blocks will improve the reconstruction of image features, therebyreducing the blurring. However, increasing the number of atoms may alsoreduce the denoising effect of the proposed algorithm because some of theadded atoms will model the noise. The right value of T c will ultimatelydepend on the desired trade-off between noise and spatial resolution. An-other approach for reducing the blurring is to increase the block overlap. Inalmost all applications of dictionary-based image processing, better resultsare obtained by increasing the overlap between adjacent extracted blocks.The downside of this approach is increased computation.As we mentioned in the Introduction section, an important considerationin learning and usage of overcomplete dictionaries is the computational time.In most applications, the learning of the dictionary is more computationallyintensive than its later usage. This is also the case in the proposed algorithm.1446.4. DiscussionIn our proposed learning algorithm, the most computationally expensivepart is the sparse coding steps required for updating Γa and Γc (Equations(6.4) and (6.5)). As we mentioned, we used the OMP algorithm for solvingthese equations. Given a dictionary D and a signal z, each iteration ofOMP selects a new atom from D and projects z onto the space of columnsselected so far. If we denote the sub-dictionary that contains the set ofcolumns that have been selected up to the current iteration by DS , thebottleneck of the OMP algorithm is the computation of the pseudo-inverseof DS . In a straight-forward implementation, this pseudo-inverse must becomputed at every step of OMP, i.e., every time a new dictionary atom isselected. This computation can be avoided by using progressive Cholesky orQR update strategy instead of explicit matrix inversion [30, 68]. Even fasterimplementation is possible by avoiding explicit computation of the signalresidual at each iteration. This approach, which has been named Batch-OMP in [275], can result in very large speed-ups when the number of trainingsignals is large. Of course, OMP is not the only tool for solving Equations(6.4) and (6.5) and there are other alternative algorithms that have beensuggested for fast sparse coding of large numbers of signals [122]. The secondmost computationally expensive step is the dictionary update in Equation(6.3). In a straight-forward implementation of the K-SVD algorithm, themain computational burden is associated with the SVD decomposition ofthe error matrix. The implementation suggested in [275] avoids an explicitSVD computation, making the update substantially faster. As suggestedin [274] even the computation of the error matrix is not necessary, makingthe algorithm faster. The speedup strategies mentioned above have beenimplemented in publicly-available software and they have been applied forCT denoising in [15]. The last step in the proposed training algorithm(update of P through Equation (6.7)) is computationally negligible. Therequired number of iterations of the dictionary learning algorithm stronglydepends on the initialization. If some parameters (Da, Dc, and P ) havealready been learned on a different dataset, they can be used to substantiallyreduce the required number of iterations of the learning algorithm. The timerequired for the reconstruction of the final image from the artifact-full images(once the algorithm parameters have been learned) is proportional to theimage size. For example, with our Matlab implementation, an image of size200× 200× 100 voxels can be reconstructed in less than 15 minutes.There are ways in which the performance of the proposed algorithm maybe improved. One of these ways, as we have mentioned previously in thischapter, is to learn the algorithm parameters (i.e., the dictionary and thelinear map) from a training dataset that is similar to the image that we1456.4. Discussionwant to reconstruct. We have also observed that increasing the dictionarysize slightly improves the quality of the reconstructed images. Increasingthe number of atoms to 2048 (from 1024 used in the experiments reportedhere) slightly improved the results. However, increasing the dictionary sizealso increases the computational load. Another simple approach to improv-ing the algorithm performance is to increase the amount of overlap betweenthe neighboring blocks. As we mentioned above, in all of the experimentsreported in this chapter we used an overlap of 5 voxels in each direction. In-creasing the block overlap always improves the quality of the reconstructedimage at the cost of increased computational load. In patch-based process-ing of 2D images it is common to use the maximum overlap (so that theneighboring patches are shifted by only one pixel). However, for large 3Dimages this may result in excessive computational load. For blocks of size83 voxels used in this study, increasing the voxel overlap from 5 to the max-imum possible overlap of 7 will increase the number of blocks by a factor of27. However, in our experience this also improves the performance of theproposed algorithm in terms of both artifact suppression and, especially,denoising. Finally, it may also be possible to achieve improved image qual-ity by using a more flexible model and more training data. For instance, amore general model was proposed for image super-resolution, cross-view ac-tion recognition, and sketch-to-photo face recognition in [132]. This modelincludes two linear maps and has been shown to be successful in varioustasks. Training of more general models requires larger training data andlonger training times. More general models may also include additional pa-rameters that need tuning. Nevertheless, when properly trained, they maylead to improved results.In recent years, several “iterative” dictionary-based CT reconstructionalgorithms have been proposed [11, 197, 301, 305, 339]. In general, thesealgorithms have shown promising results. Therefore, an interesting questionis whether the algorithm proposed in this chapter can be turned into an it-erative reconstruction algorithm. A first step towards this transformation isto introduce a measurement misfit term into our objective function. Assum-ing that the algorithm parameters have been learned, a potential approachto recovering a high-quality image is by solving the following optimization1466.4. Discussionproblem:minimize{x,Γa,Γc}(λdata‖Ax− y‖2W +∑i{‖Rcix−DcΓci‖22 + λc‖Γci‖1+ ‖Rai xa −DaΓai ‖2F + λa‖Γai ‖1}+ α‖Γc − PΓa‖2F) (6.8)In this equation, A is the projection matrix, y represents the sinogrammeasurements, λdata is the regularization parameter for the measurementmisfit term, and ||.||2W is the weighted `2-norm that is preferred in CT re-construction because the noise in y is signal-dependent. Rci is a binarymatrix that extracts the ith block from the image x, so that Rcix is thevectorized block. The same comment applies to Rai xa, except that, as weexplained earlier, two artifact-full images are reconstructed from odd andeven projections. However, here we have used the simple notation of Rai xa toavid a cluttered equation. The artifact-full image xa is reconstructed usingthe FDK algorithm and the high-quality image is obtained as the solutionof the above optimization problem.The optimization problem in (6.8) can be solved using alternating min-imization. Minimization with respect to Γa and Γc will be similar to thoseshown in Equations (6.4) and (6.5). Minimization with respect to x can beperformed, for example, using the separable paraboloid surrogate method[101]. This is the method used in the iterative dictionary-based reconstruc-tion method proposed in [339]. One problem with this approach is thatit requires access to individual elements of the projection matrix A. Forlarge 3D images this is impractical because the matrix A is too large to besaved in the computer memory and with standard implementations of thismatrix it is not easy to access individual elements [152]. Another approachto solving the minimization with respect to x is to use the conjugate gra-dient method, as suggested in [305]. Another potential challenge in usingthe iterative reconstruction method proposed in Equation (6.8) is the choiceof the regularization parameter. In general, the choice of the regulariza-tion parameter is one of the critical aspects of regularized inverse problems[8]. In most dictionary-based iterative reconstruction algorithms that havebeen proposed for CT, this issue has not been properly addressed and theregularization parameter has been selected empirically [305, 308, 339].147Chapter 7Two-Level Dictionary forFast CT Image Denoisingand Restoration7.1 IntroductionAs explained in Section 2.1.3, one of the major disadvantages of learnedovercomplete dictionaries is that they are much more computationally costlythan analytical dictionaries. Specifically, obtaining the sparse representationof a signal in these overcomplete and unstructured dictionaries requires solv-ing an optimization problem. Denoting the dictionary with D, the sparserepresentation γ of a signal x in D will require solving:minimize ‖γ‖0 subject to: ‖x−Dγ‖22 ≤  (7.1)where  depends on the noise variance. This problem is typically solvedby either using a greedy method such as the orthogonal matching pursuit(OMP) or by a convex relaxation of the `0 norm to `1 norm and usingmethods such as the basis pursuit [317–319].The processing of large 3D images, in particular, is computationallyhighly intensive because the number and dimensionality of blocks are veryhigh. Greedy methods, which are the focus of this chapter, may yield verysub-optimal results because they choose the dictionary atoms one at a time.In this chapter, we propose a structured dictionary for sparse represen-tation of large signals. The proposed dictionary structure will speed up thesparse coding by allowing multiple atoms to be selected in each iteration.Moreover, by structuring the dictionary atoms into clusters, the proposeddictionary structure will enable us to learn and effectively use a larger num-ber of atoms, increasing the expressive power of the dictionary. In summary,the proposed dictionary has two levels. The first level consists of an off-the-shelf orthonormal basis while the second level consists of learned atoms thatare adapted to the signal class of interest. The signal is first decomposed in1487.2. The proposed algorithmthe first-level dictionary. Since the first-level dictionary is orthonormal, thisdecomposition can be computed very efficiently. The decomposition in thefirst-level dictionary is used to find the sparse representation of the signalin the second-level dictionary, which consists of learned atoms. Therefore,the proposed dictionary structure aims at combining the speed of analyticaldictionaries with the flexibility and representational power of overcompletedictionaries.We will apply the proposed dictionary structure for removing noise andring artifacts from CT images. Unlike the streak artifact that we consid-ered in Chapter 6, the noise and ring artifacts considered in this chapterhave a very different shape than the genuine image features. Therefore, ourapproach to removing noise and ring artifacts in this chapter follows thebasic dictionary-based denoising method that we described in Section 2.1.3.In other words, we assume that the true image features will have a sparserepresentation in dictionaries trained on clean artifact-free images, whereasnoise and ring artifacts will not have a sparse representation in such dictio-naries. Therefore, sparse estimation of image blocks in a dictionary learnedfrom clean images should lead to the suppression of noise and ring artifacts.7.2 The proposed algorithmWe denote the image with x. We extract blocks of size 83 from this imagefor processing. As is commonly done in dictionary-based image processing,we vectorize each extracted block and refer to each vectorized block as a“signal”. We also denote by X ∈ R512×N the matrix that contains thevectorized blocks as its columns, with N being the total number of blocks.Our choice of block size of 83 is to a large degree arbitrary and we made thischoice only to simplify the presentation of the proposed methods.We propose a structured dictionary that consists of two levels. The toplevel consists of a fixed orthonormal basis. To build this basis, we beginwith a 3D DCT basis of size 4 × 4 × 4 and upsample it by a factor of 2 ineach direction. The resulting dictionary will be still orthonormal, althoughit will not be a basis because there will be only 64 basis vectors that cannotspan the space of R512. The second level consists of atoms that are learnedfrom training data as will be explained below. Figure 7.1 shows a schematicof the proposed structured dictionary. We denote the top level dictionaryby Du and the second level dictionary by Dl. Each atom in Dl is groupedunder one of the atoms in Du that has the smallest Euclidean distance withit. We write Dl(i) ∈ Du(j) to indicate that the atom i in Dl is grouped1497.2. The proposed algorithmFigure 7.1: A schematic representation of the proposed dictionary structure.The top level dictionary, Du, is an orthonormal basis such that the decom-position of a signal in Du can be computed very efficiently. The second-leveldictionary, Dl, contains atoms that are learned from training data. The goalis to find the sparse representation of a test signal in terms of atoms fromDl. The first-level dictionary Du is used as a guide for selecting the mostinformative atoms from Dl.under atom j in Du.To explain the rationale behind the proposed dictionary structure, wemust note that the most computationally demanding step in greedy sparsecoding algorithms is the identification of the most informative dictionaryatom. In each iteration of the standard greedy algorithms, such as OMPpresented in Algorithm 3, the inner product of the residual and every atom inthe dictionary is computed and the atom that has the largest inner productis identified and added to the support (we assume that all dictionary atomshave equal norms). This has to be performed in a straight-forward fashionbecause the learned dictionary has no structure and there is no fast algorithmfor computing the coefficients. Moreover, the number of atoms in learnedovercomplete dictionaries is usually very large (usually at least twice thelength of the signal). Therefore, this step can be very computationallydemanding. This is especially the case when the length of the signal islarge, such as in 3D image processing.As mentioned above, in the proposed two-level dictionary the learnedatoms are in Dl, which is the second level. Therefore, the goal is to findthe sparse representation of the signal in Dl. The first-level dictionary, Du,is used as an aid in identifying those atoms in Dl that can potentially beuseful in sparse coding of the signal. Algorithm 4 shows how the sparserepresentation of a signal x in Dl is computed.In each iteration of the algorithm, we first apply Du on the residual andidentify the s largest coefficients. These are the s atoms in the orthonormaldictionary Du that are most correlated with the signal residual. For each1507.2. The proposed algorithminput : Dictionary D, the signal x, output: c, sparse representation coefficient of x in Dr = xc = 0I = {}while ‖r‖2F >  dokˆ = argmax k |〈r,Dk〉|I = I ∪ kˆc = (DTI DI)−1DTI xr = x−DIcendAlgorithm 3: Greedy sparse coding of signal x in dictionary D using theorthogonal matching pursuit (OMP) algorithm.of these s atoms in Du, we separately search the atoms in Dl that aregrouped under them and find the atom most correlated with the residual.These atoms are then added to the set of atoms that have been found in theprevious iterations to update the support of the signal. The signal is thenprojected onto the space spanned by these atoms. Finally, the representationis pruned to the s largest coefficients (in magnitude). This process can berepeated for a predefined number of times or until the norm of the residualfalls below a threshold. If the sparsity level s is known, the loop may evenbe performed only once because s atoms would have been already selectedin the first iteration.So far we have assumed that we know the dictionary Dl. In practice,this dictionary must be learned from the training data. To learn Dl, weuse the K-SVD algorithm [2] that we described in Section 2.1.2, with slightmodifications. The difference between our approach and K-SVD is that forthe sparse coding step we use Algorithm 4. In addition, after updating thedictionary atoms in each iteration, we cluster them under Du by assigningeach learned atom to the closest atom in Du (in terms of the Euclideandistance). Note that atoms with the smallest Euclidean distance are alsoatoms with the largest inner product because atoms in Du and Dl have unitnorms. We then perform a pruning step on the dictionary atoms. We pruneeach group of atoms (i.e., all atoms grouped under one of the atoms in Du)by computing the inner product between all pairs of atoms in that groupand eliminating one of the atoms in each pair that have an inner productmore than 0.95. We also remove atoms that are used by less than a smallfraction (p) of the training signals, where we usually choose p to be between1517.2. The proposed algorithminput : Dictionaries Du and Dl and the signal x, sparsity level soutput: γ, vector of sparse representation coefficients of x in Dlr = xI = {}γ = 0while ||r||22 >  dow = DTu rJ = supp(ws)K = {}for j ∈ J dokˆ = argmax k |〈Dl(k), r〉| Dl(k) ∈ Du(j)K = K ∪ kˆendI = K ∪ supp(γ)β|I = ([Dl]TI [Dl]I)−1[Dl]TI xβ|Ic = 0γ = βsr = x−DlγendAlgorithm 4: Algorithm for greedy sparse coding in the two-level dic-tionary. We use us to denote a vector u restricted to its s largest (inmagnitude) elements. In other words, us is equal to u at the location ofthe s largest components of u and zero elsewhere. We write u|I to denotevector u restricted to indices in the set I, i.e., u|I(i) = u(i) for all i ∈ Iand u|I(i) = 0 for all i ∈ Ic. For a matrix D we write [D]K to denotethis matrix restricted to its columns indexed by K. Also, supp(u) denotesthe support of u, which is the set of indices of its non-zero elements. Weborrow these notations from [319].1527.3. Results and discussion10−3 and 10−2.7.3 Results and discussionWe apply the proposed algorithm for denoising and restoration of FDK-reconstructed images. We compare the proposed algorithm with the K-SVD denoising algorithm [100] which has also been applied for denois-ing/restoration of CT images [60]. For both the proposed algorithm andK-SVD denoising, we use the simple one-step denoising method that wedescribed in Section 2.1.3.7.3.1 DenoisingFigure 7.2 shows the performance of our algorithm and K-SVD denoisingon noisy images of a brain phantom. This phantom was obtained from theBrainWeb database [63]. The noisy image was reconstructed from projec-tions simulated with incident photon number of 105 and assuming additiveGaussian noise with a standard deviation of 100. From Figure 7.2, the pro-posed algorithm seems to have resulted in better denoising than K-SVD.For an objective comparison, in Table 7.1 we have shown the RMSE, SSIM,CNR, and computational time for two different dictionary sizes.Dictionary size= 1024 Dictionary size= 4096ProposedalgorithmK-SVDdenoisingProposedalgorithmK-SVDdenoisingRMSE 0.067 0.067 0.061 0.066SSIM 0.763 0.762 0.781 0.760CNR 16.1 16.0 16.5 15.9time (h) 0.11 0.20 0.19 0.66Table 7.1: Denoising of the FDK-reconstructed image of a brain phantomwith the proposed two-level dictionary and K-SVD denoising.The proposed algorithm has achieved comparable or better results thanK-SVD, while having a shorter computational time as well. Increasing thedictionary size from 1024 to 4096 (that is, increasing the degree of over-completeness from 2 to 8) has improved the performance of the proposedalgorithm, but it has had little influence on the performance of the K-SVDdenoising. We think that this is because the clustering of the atoms in the1537.3. Results and discussionFigure 7.2: Visual comparison of the proposed algorithm and K-SVD de-noising on denoising of brain phantom images. (a) the true phantom, (b)the noisy image, (c) denoised with K-SVD denoising, (d) denoised with theproposed algorithm.proposed dictionary structure allows much larger number of atoms to belearned and effectively used during dictionary deployment.We also applied the proposed algorithm for removing noise from a seriesof 3D micro-CT images. Figure 7.3 shows a slice from the noisy image ofa rat and the images denoised with the proposed algorithm and K-SVDdenoising. For a quantitative comparison of the proposed algorithm andK-SVD denoising, in Table 7.2 we have shown the RMSE, SSIM, CNR, andthe computation time for denoising of a rat image. The values of RMSE andSSIM were computed by comparing the images with a reference image, alsoshown in Figure 7.3, that was reconstructed with 25 iterations of MFISTA.The proposed algorithm has achieved a slightly better image quality whilereducing the computational time by approximately a factor of 3.7.3.2 RestorationWe applied the proposed algorithm for removing ring artifacts from a seriesof 3D micro-CT images. The images used in this section contain substantial1547.3. Results and discussionFigure 7.3: Visual comparison of the proposed algorithm and K-SVD de-noising on noisy rat images. (a) the reference image, (b) the noisy image, (c)denoised with K-SVD denoising, (d) denoised with the proposed algorithm.amounts of ring artifacts that are caused by detector saturation. We showsome of our results as figures. Because we are unable to reconstruct high-quality reference images in this experiment, we cannot provide a quantitativeevaluation.Figure 7.4 shows some of our results from this experiment. The dic-tionary is learned from a set of artifact-free training images. This learneddictionary is then used for processing of images with ring artifacts. The ra-tionale is simply that the learned dictionary will be adapted to representingthe image features and not the ring artifacts. Therefore, sparse represen-tation of image blocks in the dictionary will lead to the suppression of theartifacts. As can be seen in Figure 7.4, the proposed algorithm has resultedin a substantial reduction of artifacts. In general, we observed that theperformance of the proposed algorithm in reducing these these artifacts is1557.3. Results and discussionProposed algorithm K-SVD DenoisingRMSE 0.0140 0.0144SSIM 0.770 0.770CNR 22.6 22.4time (h) 1.3 3.4Table 7.2: Denoising of the FDK-reconstructed image of a rat from realmicro-CT scan with the proposed two-level dictionary and K-SVD denoising.slightly better than K-SVD, while being at least twice faster.We should emphasize that the observed success of this simple strategyin suppressing the ring artifacts is because of the shape of these artifacts.Specifically, these artifacts have different shapes than the genuine image fea-tures. Because we learn the dictionaries from clean (artifact-free) trainingimages, the dictionary atoms are adapted to representing the genuine im-age features but they cannot represent the ring artifacts as well. Therefore,sparse representation of image blocks in these dictionaries (which is equiv-alent to shrinkage/thresholding of the coefficients) leads to suppression ofthe artifacts. If the artifacts were similar in shape to the genuine imagefeatures, this simple strategy would probably not be as effective.There are two parameters that could impact the performance of the pro-posed algorithm in suppression of ring artifacts. One of these parametersis the dictionary size. As we saw from Table 7.1, compared with an un-structured dictionary, the proposed structured dictionary has the potentialto effectively learn and use a larger number of atoms for the denoising task.We would like to know how the dictionary size may influence the perfor-mance of the proposed algorithm for artifact suppression. Another veryimportant setting is the sparsity level, i.e., the number of dictionary atomsused to represent each of the image blocks. The sparsity has been denotedwith s in Algorithm 4.In Figure 7.5 we have shown the performance of the proposed algorithmfor two different dictionary sizes (1024 and 4096) and three different sparsitylevels (2, 4, and 8). The results show a very slight improvement in the qualityof the resulting images with a dictionary of 4096 atoms, compared with adictionary with 1024 atoms. The sparsity level has a more significant effecton the resulting image quality. As expected, smaller sparsity levels have ledto a stronger artifact suppression because a smaller number of atoms are usedin representing each image block and, hence, artifacts have a smaller chanceof being represented. On the other hand, this also results in a blurring of1567.3. Results and discussionFigure 7.4: (a) A rat image with strong ring artifacts, (b) the same imageafter being processed with a standard dictionary, (c) the same image afterbeing processed with the proposed algorithm.true image features.1577.3. Results and discussionFigure 7.5: Effect of dictionary size and sparsity level on the performanceof the two-level dictionary for suppressing ring artifacts. (a) the originalartifact-full image. The second row shows the processed images with adictionary of 1024 atoms. The third row shows the processed images with adictionary of 4096 atoms. The sparsity levels are shown on each image.158Chapter 8TV-Regularized IterativeReconstruction8.1 Introduction8.1.1 Motivation and backgroundWhen the number of CT projection measurements is small and/or the mea-surements are very noisy, statistical and iterative reconstruction methodscan lead to a much higher image quality compared to analytical reconstruc-tion methods. Therefore, statistical and iterative methods can reduce theamount of radiation used for imaging. Despite their high computationalrequirements, iterative reconstruction methods have received increasing at-tention in recent years. This revival of interest is due to several factorsincluding increased awareness of the health risks associated with exposureto radiation, availability of faster computers, and algorithmic advancements.In recent years, significant progress has been made in the development of CTreconstruction algorithms. State-of-the-art CT reconstruction methods relyon effective models of the image and use efficient optimization methods suchas accelerated first-order methods and variable-splitting algorithms. Thesemethods have led to very promising results in reconstructing high-qualityimages from undersampled and noisy measurements [61, 160, 240, 251, 263].Even though there has been significant progress in reducing the numberof measurements required for high-quality image reconstruction, the longcomputational times can still be a major limiting factor in the adoption ofiterative algorithms in practice. Although new hardware options such asGPU offer significant speedups, the size and resolution of the reconstructedimages also continue to grow and many clinical applications demand recon-struction of high-quality images in very short times. Therefore, there is agreat need for algorithms that can converge to a high-quality image in asmall number of iterations. In this chapter, we propose a reconstructionalgorithm based on a new class of stochastic gradient descent methods. Thebasic stochastic gradient descent method has been widely used in various1598.1. Introductionapplications in signal processing and machine learning for decades. How-ever, it has a poor theoretical convergence rate and in practice it usuallyfails to converge to an accurate solution. In recent years, new stochasticgradient descent algorithms that overcome these shortcomings have beenproposed. We will review some of these algorithmic developments and pro-pose an algorithm for image reconstruction for cone-beam CT. We will applyour algorithm on simulated and real cone-beam projection data and compareit with some of the state-of-the-art reconstruction algorithms.8.1.2 Formulation of the problemWe consider a linear model for CT projection measurements:yˆ = Aˆx+ v (8.1)where Aˆ represents the projection matrix, yˆ denotes the projection measure-ments, x is the unknown image to be estimated, and v is the measurementnoise.As mentioned in Section 1.2, the noise, v, in the sinogram (i.e., afterthe log transformation) is very close to a Gaussian with zero mean and asignal-dependent variance, σ2i ∝ exp(y¯i), where y¯i is the expected value ofthe sinogram at detector i. Therefore, following the maximum-likelihoodprinciple, it is natural to use a weighted least-squares cost function of thefollowing form:F (x) =12‖Aˆx− yˆ‖2W =12(Aˆx− yˆ)TW (Aˆx− yˆ) (8.2)where W is a diagonal matrix whose diagonal elements are proportional tothe inverse of the measurement variances, σ2i . To simplify the notations, wedefine A = W 1/2Aˆ and y = W 1/2yˆ so as to transform the cost function (8.2)into a standard least-squares cost function:F (x) =12‖Ax− y‖22 (8.3)It should be noted that the weight matrix W depends on the mean sino-gram data, y¯, as shown in Equation (1.4), but y¯ is not available in practicebecause only one sinogram is measured. However, we can use the knowl-edge that the sinogram is always smooth and slowly-varying. Therefore, wesmooth the measured sinogram and use it for computing the weights W .Because the inverse problem of estimating x from the measurements yis ill-posed, it is common to add a regularization term to F (x). Here we1608.1. Introductionuse a total variation regularizer and also write F (x) as an average over theprojection views. The resulting composite cost function is as follows:Φ(x) =1nn∑i=112‖Aix− yi‖22 + λTV(x) (8.4)where n represents the number of projection views and λ is the regular-ization parameter. Clearly, yi denotes the vector of measurements of theith projection view and Ai denotes the sub-matrix of A formed by keepingonly those rows that correspond to the measurements in yi. We will denotethe measurement inconsistency term for the ith projection view by fi, i.e.,fi(x) =12‖Aix− yi‖22.The algorithm proposed in this chapter makes use of the gradient ofF and the gradient of the component functions, fi. The gradient of thecomplete measurement inconsistency term in Equation (8.4) is:∇F (x) = 1nAT (Ax− y) (8.5)Similarly, the gradient of a component function, fi, is:∇fi(x) = ATi (Aix− yi) (8.6)The gradient descent method for minimizing F (x) iteratively performsthe following update:xk+1 = xk − αk∇F (xk) (8.7)where αk is the step size.To deal with the non-smooth regularization term, we use the proximal-gradient method [66, 67], which is an extension of gradient descent methodin the following sense. It is easy to show that the update in (8.7) is equivalentto the following problem:xk+1 = arg minx{F (xk) +∇F (xk)T (x− xk) + 12αk‖x− xk‖22}(8.8)where the expression being minimized is a quadratic approximation to F (x)in the neighborhood of xk [43]. The proximal gradient methods account forthe non-smooth regularization term (TV(x) in our case) by simply addingit to this approximation:xk+1 = arg minx{F (xk) +∇F (xk)T (x− xk) + 12αk‖x− xk‖22 + TV(x)}1618.1. IntroductionThe corresponding update rule for the above problem is:xk+1 = proxαkTV(xk − αk∇F (xk))(8.9)where the proximal operator or proximal map is defined as:proxTV(x) = arg minu{12‖u− x‖22 + TV(u)}(8.10)In this chapter, we use the algorithm proposed in [46] for solving theproximal operation in (8.10). This step does not have to be solved to a greataccuracy. In our experience, one to three iterations of the algorithm in [46]are sufficient for fast convergence of the algorithm proposed in this chapter.Theoretically, it has been proven that if the error in the computation ofthe proximal mapping decreases gradually, both the basic proximal gradientmethod and the accelerated proximal gradient method achieve the sameconvergence rate as in the error-free case [284].Two properties of the objective function that are particularly importantto the performance of first-order methods are Lipschitz continuity of thegradient and strong convexity. Function F (x) is said to have a Lipschitz-continuous gradient if‖∇F (x)−∇F (y)‖2 ≤ L‖x− y‖2 ∀x, y ∈ dom(F ) (8.11)for some constant L. It is said to be strongly convex with parameter γ ifF (y) ≥ F (x) +∇F (x)T (y − x) + γ2‖x− y‖22 ∀x, y ∈ dom(F ) (8.12)For F (x) defined as (8.3), L and γ are equal, respectively, to the largestand smallest eigenvalues of ATA, which can be found easily using powermethods [118]. Therefore, ∇F (x) and ∇fi(x) are Lipschitz continuous andwe will denote the Lipschitz constant of ∇fi(x) with Li. However, they arenot strongly convex because A (and therefore all Ai) have more columnsthan rows; therefore γ = 0.8.1.3 Stochastic gradient descent methodIn many signal processing and machine learning problems, we are interestedin finding a minimizer of the sum (or the average) of a large number offunctions or the sum (or the average) of a function over a large number1628.1. Introductionof training examples. An example is empirical risk minimization, which iswidely encountered in machine learning problems. In these applications, theobjective function F (x) can be written in terms of functions fi(x):F (x) =1nn∑i=1fi(x) (8.13)The standard (full) gradient descent (FGD) method for minimizing F (x)suggests an iteration of the form:xk+1 = xk − αk∇F (xk) = xk − αk 1nn∑i=1∇fi(xk), (8.14)where αk is the step size. The problem is that computing the full gradientis usually very costly. On the other hand, there is usually a large amountof correlation between the measurements. This implies that computation ofthe full gradient is not necessary to make adequate progress. Instead, thedescent direction suggested by a small subset of the component functions,{fi, i = 1 : n}, can lead to good progress. Because computing ∇fi is ntimes faster than computing ∇F (x), this can be of significant practical valuewhen n is large, which is the case in many problems in machine learning andsignal processing. Image reconstruction in CT also fits this model very well.Computing the gradients as shown in Equation (8.5) involves forward andback-projection operations, which are the most expensive operations in CTreconstruction. Also, the number of projections used for reconstruction isusually very large (several tens or hundreds) and there is a large amountof correlation between the measurements in different projection views. Thishas long been recognized by researchers working on CT reconstruction. Themethod of ordered subsets, which has been used to accelerate many of thestandard CT reconstruction algorithms, is based on this idea [104, 338, 360].Among the standard methods for minimizing a function like (8.13) is thestochastic gradient descent (SGD) method. SGD computes an update direc-tion based on the gradient of one of the component functions, fi. Therefore,each iteration of SGD has the following form:xk+1 = xk − αk∇fik(xk) (8.15)where the index ik is chosen from among the set {1, ..., n} based on someprobability distribution. The computational cost of each iteration of SGDis 1/n that of FGD. However, although the expected value of the stochastic1638.1. Introductiongradient directions, ∇fi(x), is equal to the full gradient, ∇F (x), their vari-ance is very high. As a result, the convergence rate of SGD is much worsethan that of FGD. Specifically, for smooth functions the convergence rate ofFGD is O(1/k) [236], which can be improved to O(1/k2) using accelerationmethods [18, 235]. This O(1/k2) convergence rate is known to be optimal,meaning that no first-order method can achieve a faster convergence [236].On the other hand, the optimal convergence rate of SGD is O(1/√k) [233].Furthermore, unlike FGD, the convergence rate of SGD does not improvewhen the objective function has Lipschitz continuous gradient [173]. Thismeans that SGD does not exploit this important property, which is satisfiedby the measurement misfit term in CT (8.3). In practice, the basic SGDalgorithm has a very fast initial convergence speed and is a good methodfor obtaining rough solutions to large-scale problems. However, obtainingan accurate solution with SGD will require a very large number of iterationsand will demand that the step size be gradually reduced.As mentioned above, the difference in the convergence rates of FGD andSGD is due to the fact that the variance of the SGD directions does not di-minish as the signal estimate gets closer to a solution. There has been muchresearch on improving the convergence behavior of the basic SGD algorithmand a complete review of this immense literature is beyond the limitationsof this dissertation. Here, we mention some of the central ideas and mainapproaches. A well-known approach to reducing the effect of large varianceof SGD directions is to use diminishing step sizes. There are various guide-lines on how to gradually reduce the step size. Common choices includeexponential decay (αk = α0ak) and inverse decay (αk = α0/(1 + aα0k))[10, 31]. The problem with this approach is that it requires careful tuning ofthe hyper-parameters (α0 and a) and it can only achieve a sublinear conver-gence rate [26, 296]. Another natural approach to reducing the variance isaveraging. Different studies have used averaging of the gradient directions(∇fik(xk)), averaging of the signal estimates (xk), or a combination of thetwo [128, 238, 260, 288, 335]. In general, averaging will lead to faster conver-gence rates and increased robustness of the convergence rate to the selectionof the step size. However, the improvement in the convergence speed ob-tained by averaging is usually small. Some algorithms start by performingsimple SGD updates and gradually increase the number of functions involvedin the computation of the gradient directions so that the update directionsgradually approach the full gradient descent directions. Examples of thesetype of methods, which are called hybrid methods, can be found in [25, 109].With careful selection of the step size and batch size, these algorithms cansignificantly improve the basic SGD method. However, these methods do1648.1. Introductionnot achieve the convergence rate of variance-reduced SGD algorithms thatwe will describe below. Some studies have suggested SGD methods withmomentum [320]. In these algorithms, the SGD update in (8.15) is aug-mented with a multiple of the previous update direction(s). However, thesemomentum methods still require diminishing step sizes and achieve a smallimprovement over the basic SGD algorithm.8.1.4 Variance-reduced stochastic gradient descentAlthough the modified SGD algorithms mentioned above can improve thebasic SGD method, our focus here is on a new class of SGD algorithms thathave been proposed very recently. The core idea in these methods is tokeep a copy of the full gradient direction or copies of the stochastic gradientdirections and use them in building the update directions. With this trick,these methods achieve a linear convergence rate on strongly convex func-tions. On smooth but not strongly convex functions, they achieve O(1/k)convergence, which is a dramatic improvement over the O(1/√k) conver-gence rate of the basic SGD. To the best of our knowledge, the first suchalgorithm was the stochastic average gradient (SAG) algorithm proposed in[177]. In our notation, the update suggested by SAG is as follows:xk+1 = xk − αk(∇fik(xk)−∇fik(x˜) +∇F (x˜))(8.16)where x˜ is an old signal estimate for which we have stored the stochasticgradient directions {∇fi(x˜), i = 1 : n}. Algorithms such as SAG havebeen shown to be very successful on a variety of machine learning problemsin recent years [79, 165, 206, 241]. The reason for the success of thesealgorithms lies in the fact that as the signal estimate becomes closer tothe solution, the variance of the update direction approaches zero. This isbecause as xk and x˜ approach the optimal point x∗, ∇F (x˜) → 0, by thedefinition of optimality. Now, if ∇fi(xk)→ ∇fi(x∗) and ∇fi(x˜)→ ∇fi(x∗)(which is the case if fis have Lipschitz-continuous gradients), then:∇fi(xk)−∇fi(x˜) +∇F (x˜)→ 0 (8.17)Rigorous convergence proofs can be found in the original papers. For 3DCT, the problem with SAG and many similar algorithms that have beenproposed recently (e.g., [79, 206]) is that they require that the most recentcopies of all stochastic gradients, ∇fi(x), to be saved. For reconstructionof a 500 × 500 × 500 image from 100 projections, this will require at least5003×100×4 bytes = 50 GB of memory. Therefore, we suggest an algorithm1658.1. Introductionthat only requires the saving of the full gradient direction, ∇F (x˜). Thealgorithm that we will propose is similar to the stochastic variance-reducedgradient (SVRG) algorithm [148] which we present below in Algorithm 5.Data: x0Result: xJfor j ← 1 to J dox˜ = xj−1 ;µ˜ = ∇F (x˜) ;x0j = x˜ ;for k ← 1 to K doselect an index ik from among the set {1, ..., n} ;xk+1j = xkj − α(∇fik(xkj )−∇fik(x˜) + µ˜) ;endxj = xKj ;endAlgorithm 5: SVRG algorithm [148].The SVRG update is similar in form to the SAG update. The differenceis that, unlike SAG, SVRG does not store copies of every ∇fi(x). Instead,only the full gradient, ∇F (x˜), is computed and stored. The price that onepays, on a general problem, is that every iteration will require evaluationof two SGD directions, ∇fik(xkj ) and ∇fik(x˜). However, as we will seebelow, this is not the case in our problem because the gradient is a lin-ear function. Computation of ∇F (x˜) is followed by a large number (K inAlgorithm 5) of variance-reduced SGD updates, after which the algorithmrecomputes ∇F (x˜) and the process repeats. As for SAG, the reason forSVRG’s effectiveness lies in the fact that the variance of the update direc-tion, ∇fik(xkj )−∇fik(x˜)+µ˜, is very low compared to the update direction ofthe basic SGD method, especially as we get close to the solution. As a result,SVRG can work with a constant and large step size α that does not needtuning and it can achieve significantly faster convergence rates [148, 336].SVRG and its proximal version [336] are among the state-of-the-artvariance-reduced SGD algorithms. One of the aspects of SVRG that lateralgorithms have tried to improve upon is the frequency of computation ofthe full gradient, ∇F (x˜). In the original SVRG algorithm presented in Algo-rithm 5, the full gradient is computed after a fixed number of SGD iterations.It is suggested that ∇F (x˜) be recomputed after two SGD passes through theentire data, i.e., K = 2n in Algorithm 5 [148]. Therefore, the computation1668.2. Methodsof the full gradient can account for up to 1/3 of the total computational costof the SVRG algorithm. A few studies have tried to reduce the frequencyof computation of the full gradient. The MixedGrad algorithm suggested in[206] reduces the number of computations of ∇F (x˜) to O(log(K)), where Kis the number of SGD updates. MixedGrad is based on the assumption thatas the signal estimate becomes closer to the solution, ∇F (x˜) changes lesssignificantly and it needs to be updated less frequently. Theoretical analysisshows that MixedGrad has a convergence rate similar to that of SVRG, whilereducing the number of ∇F (x˜) updates. Another strategy was suggested byS2GD algorithm [166]. In S2GD, the number of SGD updates after eachcomputation of ∇F (x˜) is a random variable following a specially-devisedprobability distribution function and it gradually increases as we approacha solution.Below, we will propose an algorithm for image reconstruction in CBCT.The core iteration in our algorithm consists of a variance-reduced SGD up-date similar to SVRG. The main differences between our proposed algorithmand those described above include: (1) we suggest a heuristic strategy fordeciding when to update the full gradient, ∇F (x˜), and (2) our algorithmgradually increases the batch size and transforms into a limited-memoryquasi-Newton method as the algorithm approaches a solution. In this sense,the proposed algorithm is similar to the hybrid methods that we brieflydescribed above.8.2 Methods8.2.1 The proposed algorithmOur proposed algorithm is presented below (Algorithm 6). In this algorithm,we use subscripts for the indices of the main loop (that updates µ˜ = ∇F (x˜))and superscripts for the indices of the inner loop (involving SGD-type up-dates). To make the algorithm easy to follow, we have omitted some of thedetails that we will explain here and in the next subsection.Our algorithm starts by running through all fis and performing a simpleproximal SGD update for each fi. It has been shown that performing abasic SGD update during the first run through the data leads to much fasterconvergence than an SVRG-type update [285, 287, 336]. Also, at the end ofeach round of SGD-type updates, we return a weighted average of the T + 1latest signal estimates, rather than only the last one. The weighting schemethat we use here is similar to those used in [172] and in our experiments weset T ≈ n/10 where n is the number of projection views. Another necessary1678.2. MethodsData: x0, {Li, i = 1, ..., n} , M = 1x01 = x0for k ← 1 to n doselect ik ∈ {1, ..., n}δ = ATik(Aikxk−11 − yik)xk1 = proxλTV(xk−11 − αkLik δ)endx˜ = 2(T+1)(T+2)∑nk=n−T (k − n+ T + 1)xk1µ˜ = 1nAT (Ax˜− y))for j ← 2 to J dox0j = x˜for k ← 1 to nmax doδold = δselect set Sk of size M from {1, ..., n}δ = 1M∑i∈Sk ATi Ai(xk−1j − x˜) + µ˜if M ≥ 5 thend = Hkδelsed = δendxkj = proxλTV(xk−1j − αkL d)if 〈δ, δold〉 < 0 & (∆Φ)k < 1 thenbreakendendx˜ = 2(T+1)(T+2)∑nk=n−T (k − n+ T + 1)xkjµ˜ = 1nAT (Ax˜− y))if ∆Φ < 2 thenM = 2×MendendAlgorithm 6: The proposed algorithm. (Li: Lipschitz constants, n: thenumber of projection views, M ; the batch size, λ: the regularization pa-rameter, α: the step size, T : number of latest signal estimates that are av-eraged to find the new signal estimate at the end of each round of stochasticminimization loops, H: inverse Hessian estimate.1688.2. Methodsmodification is that (because our cost function includes the non-smooth TVregularization term) we use proximal stochastic gradient steps instead ofplain stochastic gradient steps.Unlike SVRG (and more recent algorithms such as MixedGrad) that usea pre-set and fixed update frequency for ∇F (x˜), our proposed algorithmdetermines when ∇F (x˜) needs to be updated on the fly. Specifically, werecompute µ˜ = ∇F (x˜) only if both of the following conditions are satisfied:(1) the inner product of two successive descent directions (δ) is negative(i.e., the two successive stochastic gradient directions make an angle largerthan 90◦), which is commonly taken as a sign that the quality of the updatedirections is poor, and (2) if the decrease in the objective function Φ is belowa certain threshold, 1. Although Algorithm 6 shows that these conditionsare checked after every iteration of the inner loop, this is not necessary.Instead, one can check these conditions after, e.g., every 10 iterations of theinner loop. Moreover, the change in the objective function (∆Φ)k used inthis step does not have to include the complete measurement misfit term,but only those fis that are involved in the current iteration, i.e. {fi : i ∈ Sk},making the computation of (∆Φ)k very cheap.The other main feature of the proposed algorithm is that it graduallyincreases the batch size (the number of fi used to calculate the update direc-tion). This modification puts our algorithm in the class of hybrid methodsthat we mentioned above. As we explained, hybrid methods gradually in-crease the batch size in order to reduce the variance of the update directionsas the algorithm approaches a solution. However, this is not our goal sinceour algorithm already uses variance-reduced update directions. Our goal isto try to exploit the curvature information to make faster progress towardsa solution. When the batch size (M) is larger than 4, we use a quasi-Newtonmethod to compute the update direction. This allows the algorithm to usethe curvature information as the algorithms gets closer to a solution. Wehave shown this procedure with a simple notation, d = Hkδ, in the algo-rithm, where Hk denotes the current inverse Hessian estimate. In fact, weuse the limited-memory BFGS algorithm [243]. This algorithm uses an it-erative procedure to compute d that only involves vector multiplications.When M is too small, the direction d generated by this procedure is poor.The algorithm starts with a batch size of M = 1 and doubles the batchsize every time the reduction in the objective function, ∆Φ, falls belowa threshold, 2. Unlike the (∆Φ)k used for deciding whether or not toupdate µ˜ as explained above, the ∆Φ used here will include the full mea-surement misfit term, but this will not require much extra computationbecause the projection Ax˜ is already computed in the previous step. A1698.2. Methodshybrid deterministic-stochastic algorithm is proposed in [109], in which theauthors show that exponentially increasing the batch size after every passthrough the data leads to good theoretical and practical convergence rates.However, in our experience with CBCT projection data, it is usually muchbetter to increase the batch size very slowly. In fact, in all our experimentsat least for three passes through the data the batch size remained M = 1.This is perhaps because in our application n is much smaller than in mostmachine learning applications.8.2.2 Implementation detailsSampling.Sampling refers to the strategy for selecting a function fi at each stepof the stochastic gradient descent algorithm. Cyclic and uniformly ran-dom sampling are the simplest and most widely used strategies. When thecomponent functions fi have Lipschitz gradients, choosing fi with a proba-bility proportional to its Lipschitz constant, Li, leads to better theoreticaland practical convergence [43, 336]. It has recently been shown that thisstrategy is a good approximation to the optimal sampling strategy for thebasic SGD algorithm [232, 357]. In machine learning applications, an ef-fective strategy is to divide the data into small clusters such that there islow within-cluster variance and use a stratified sampling strategy [356]. Thislast strategy is similar to a sampling technique that is commonly used in theimplementation of the ordered subsets method for CT reconstruction: it iscommon to adopt subset orderings that lead to large angles between succes-sive projection views used by the algorithm [130, 159]. The idea behind thismethod is that two projections that have a small angular spacing betweenthem include much redundant information; therefore, convergence shouldbe faster if successive projections are far apart. We use this ordering forthe initial stage of the algorithm (which consists of ordinary SGD updates).However, for the rest of the algorithm we use a purely random sampling inwhich each projection is sampled with a probability proportional to its Lip-schitz constant. In our experience, this leads to a slightly faster convergencein practice. We only make sure that a projection view used in the previousiteration is not sampled again in the current iteration.Step size.Standard step sizes for convex and strongly convex problems with Lips-chitz gradient are 1/L and 2/(γ + L), respectively [43, 236]. For the initial1708.2. Methodsstage of the proposed algorithm that involves simple SGD updates, we foundthat a more aggressive step size of 2/L leads to faster convergence. For thenext iterations we use a step size of 1/L. However, as we mentioned above,when the batch size, M , is larger than 4, our algorithm transforms into aquasi-Newton method. When this happens, we need to perform a line searchto find a suitable step-size. We used a backtracking line search with an initialstep size of 2/L and Armijo rule [243]. Interestingly, our experience showsthat we do not need to perform this line search at every iteration becausethe step size does not change much between iterations. In our implementa-tion, we perform this line search after every 10 iterations of the inner loop.Moreover, the objective function used in the line search does not have toinclude the complete measurement misfit term, F (x), but only those fi(x)that are in the current batch, making the line search much cheaper.Parameters 1 and 2 determine, respectively, how often the full gradient(µ˜ = ∇F (x˜)) is recomputed and how fast the batch size (M) grows. Theseparameters are not fixed, but they are updated during the iteration of thealgorithm. In fact, they should both gradually decrease because they arethresholds on the reduction in the objective function and this reductionis larger in the early iterations. We have developed heuristic methods forupdating 1 and 2 that we explain here. Every time we start a round ofupdates in the inner loop, i.e., k = 1, we compute the reduction (∆Φ)k andset 1 = (∆Φ)k/2. The informal justification for this choice is that whenwe start the iteration of the inner loop we have a fresh µ˜ = ∇F (x˜) and,therefore, the reduction in the objective function, (∆Φ)k, must be large.With more inner-loop iterations, the current estimate xkj departs furtheraway from x˜ and, hence, µ˜ = ∇F (x˜) will be less useful. When the reductionin the objective function is less than 50% of that in the beginning of theinner loop, we will decide that µ˜ = ∇F (x˜) needs to be updated. For 2 wefollow a similar heuristic. Specifically, at the end of each iteration of theouter loop, we expect that the reduction in the objective function, ∆Φ, beat least half of that of the previous iteration. If this is not the case, wedouble the batch size.Study on the implementation of the system matrixThe implementation of the system matrix A is of critical importance be-cause it can greatly influence the performance of any iterative reconstructionalgorithm. As we briefly mentioned in Section 1.2, in theory the elementsof A represent the intersection lengths of rays with voxels. However, thissimplified view does not consider important factors such as the size of the1718.2. Methodsdetector elements. Moreover, it is computationally very intensive. There-fore, several algorithms have been suggested for efficient implementation ofthe system matrix for cone-beam CT. Because A is too large to be savedin the computer memory, these algorithms implement multiplication withA and AT , which in CT are referred to, respectively, as forward-projectionand back-projection.In brief, these efficient implementations of the system matrix representthe image using some form of voxel basis function and compute the systemmatrix by calculating the convolution of the footprints of these basis func-tions with the surfaces of the detectors. Let µ(−→x ) denote the continuous 3Dmap of attenuation coefficients, which is a function of the spatial location−→x . When discretized, µ(−→x ) can be written in the general form shown below:µ(−→x ) =∑jµjb(−→x −−→xj) (8.18)where b(−→x ) is called the basis function and µj is the attenuation coefficient atlocation −→xj . Common types of basis functions include cubic and spherically-symmetric function [185, 193, 221]. The convolution of the footprints ofthese basis functions on the surfaces of the detectors is related to the valuesof the elements of A.We performed a detailed study in which we considered three differentimplementations of the system matrix:• The distance-driven algorithm proposed in [78].• The separable-footprints algorithm [193].• The algorithm proposed in [363] that uses Bessel functions of order 2as the voxel basis function.• In addition to the above state of the art algorithms, we implemented asimple forward and back-projection algorithm that projected only thecenter of each voxel on the detector plane.The main results of our study are as follows [152].• As expected, there is a trade-off between speed and accuracy. Moreaccurate implementations are also more computationally demanding.• Fast convergence of iterative image reconstruction methods requiresaccurate implementation of forward and back-projection operations,involving a direct estimation of the convolution of the footprint of thevoxel basis function with the surfaces of the detectors.1728.2. Methods• Reconstruction of the images of low-contrast objects needs more ac-curate implementation of the system matrix.• In iterative image reconstruction, implementations of the system ma-trix that have a decent level of accuracy lead to faster convergencethan implementations that are either inaccurate or too accurate. Inour experiments, the implementation based on cubic voxels with sep-arable footprints [193] resulted in the fastest convergence of iterativereconstruction algorithms. Therefore, we used this implementation inall experiments reported in this chapter and in the rest of this disser-tation.8.2.3 EvaluationWe applied the proposed algorithms on sets of simulated and real data de-scribed below and compared it with the Monotone Fast Iterative ShrinkageThresholding Algorithm (MFISTA)[17], Nesterov’s third method [234], aGradient-Projection-Barzilai-Borwein (GP-BB) algorithm as suggested in[251], and also the proximal version of the original SVRG algorithm as de-scribed in [336]. All algorithms were implemented in Matlab version R2012brunning on a Windows 7 PC with 32 GB of memory and 3.4 GHz Intel Corei7 CPU.Simulated data Two sets of scans with average incident photon countsof N0 = 2 × 103 and N0 = 2 × 104 were simulated from a 3D Shepp-Loganphantom of size 256× 256× 256 voxels with isotropic voxels of 0.1× 0.1×0.1 mm3. We will refer to these scans as high-noise and low-noise simulatedscans, respectively. Each of these scans consisted of 180 projections between0◦ and 360◦ with uniform angular spacing. A flat detector of 360×360 pixelswas considered. The distances from the source to the detector panel and tothe axis of rotation were assumed to be 450 mm and 400 mm, respectively,and the distance between the centers of adjacent pixels of the detector panelwas assumed to be 0.113 mm.We used the original image of the phantom for evaluating the qualityof the reconstructed images. We used 30 uniformly-spaced projections forreconstruction with the proposed algorithm and other iterative algorithms.We also reconstructed the image of the phantom with the FDK algorithmboth using 30 projections and using all 180 projections.1738.3. ResultsReal data We used micro-CT scans of the physical phantom and of a deadrat. Each scan consisted of 720 projections at 0.50◦ intervals. The tubevoltage, tube current, and exposure time were 70 kV, 32 mA, and 16 ms,respectively, for both scans. We used the full set of 720 projections with theFDK algorithm to reconstruct high-quality images of the phantom and of therat. We will call these high-quality images “the reference image” and regardthem as the true images of the physical phantom and the rat. We used120 equally-spaced projections from the phantom and 120 equally-spacedprojections from the rat for reconstruction with the proposed algorithm andother iterative algorithms.8.3 Results8.3.1 Simulated dataFigure 8.1 shows the plots of RMSE and the objective function as a func-tion of CPU time for different reconstruction algorithms for high-noise andlow-noise simulated data. The proposed algorithm shows a much fasterconvergence rate than other algorithms on both low-noise and high-noiseprojection sets. In particular, the convergence rate of the proposed algo-rithm is very fast in the initial iterations. In Table 8.1, we have summarizedsome of the image quality criteria for images reconstructed using differentalgorithms. The values in this table correspond to images reconstructed af-ter 1000 s. The images reconstructed using the proposed algorithm have amuch higher quality criteria and are very close to or slightly better than theimages reconstructed from 180 projections using the FDK algorithm.For a visual comparison, in Figure 8.2 we have shown a central slice anda central profile through the images of the phantom reconstructed from thehigh-noise projections using the proposed algorithm and MFISTA (whichperformed best among the other algorithms on this data). The image recon-structed by the proposed algorithm includes all of the important features ofthe phantom with very little artifacts and is much superior to the MFISTA-reconstructed image.8.3.2 Micro-CT scan of the physical phantomFigure 8.3 shows the plots of RMSE, SNR, CNR, and SSIM for reconstruc-tion of the image of the physical phantom using different algorithms. Thehorizontal axis is labeled as “iteration number”. Here, one iteration meansone forward-projection and one back-projection involving all n projection1748.3. ResultsFigure 8.1: Evolution of the reconstruction error (RMSE) and the objectivefunction for reconstruction of the Shepp-Logan phantom. (a) RMSE, low-noise projections; (b) RMSE, high-noise projections; (c) objective function,low-noise projections; (d) objective function, high-noise projections.views. We use this more convenient measure of computation time becausefor large images the computational time is dominated by the time requiredfor forward and back-projection. For all iterative algorithms in this study,forward and back-projection accounted for more than 90% of the compu-tation time. The RMSE plots in Figure 8.3 indicate that the proposedalgorithm converges to the reference image much faster than the other algo-rithms, particularly in the initial iterations. The plots of SNR, CNR, andSSIM in Figure 8.3 show that the objective image quality criteria improvemuch faster with the proposed algorithm than with the other algorithms. InTable 8.2 we have summarized the values of some of the image quality mea-sures after 30 iterations of different algorithms. The image reconstructed1758.3. ResultsPro-posedProx-SVRGNes-terovMFI-STAGP-BBFDK-30FDK-180Low-noiseSSIM 0.741 0.696 0.661 0.643 0.619 0.546 0.730MI 0.642 0.555 0.524 0.506 0.488 0.426 0.683CNR 5.13 4.14 3.86 3.88 3.78 3.27 5.49High-noiseSSIM 0.710 0.600 0.561 0.564 0.463 0.335 0.701MI 0.627 0.523 0.500 0.494 0.426 0.389 0.667CNR 4.81 4.17 3.74 3.74 3.45 2.960 4.99Table 8.1: Image quality criteria for the images of the Shepp-Logan phantomreconstructed from 30 projections using different algorithms. The numbersnext to FDK show the number of projections used with that algorithm.using the proposed algorithm has higher quality criteria than the imagesreconstructed using all other algorithms and it is also superior to the imagereconstructed using FDK with 240 projections.ProposedalgorithmProx-SVRGNes-terovMFI-STAGP-BBFDK-120FDK-240SSIM 0.747 0.716 0.695 0.696 0.675 0.499 0.703MI 0.593 0.564 0.540 0.532 0.493 0.370 0.528CNR 24.0 23.6 23.6 23.5 23.0 17.0 23.3SNR (dB) 20.9 20.4 20.2 20.2 20.3 18.0 20.1Table 8.2: Performance comparison between different algorithms in recon-struction of the image of the physical phantom from real data after 30 iter-ations. The numbers next to FDK indicate the number of projections usedwith that algorithm.The phantom includes a set of fine coils that are ideal for visual assess-ment of the spatial resolution in the reconstructed images. In Figure 8.4,we have shown slices through two of these coils in the images reconstructedusing different algorithms. Compared with the images reconstructed byNesterov’s algorithm and Prox-SVRG, the image reconstructed by the pro-posed algorithm seems to be much closer to the reference image. For a closercomparison, in the same figure we have also plotted the difference betweenthe reference image and the images reconstructed with the proposed algo-1768.3. ResultsFigure 8.2: (a) The central slice and (b) the central profile of the Shepp-Logan phantom reconstructed using MFISTA; (c) the central slice and (d)the central profile of the Shepp-Logan phantom reconstructed using theproposed algorithm.rithm and Nesterov’s algorithm along a profile through the center of one ofthese coils. These plots clearly show that the image reconstructed using theproposed algorithm is closer to the reference image.8.3.3 Micro-CT scan of a ratFigure 8.5 shows the plots of RMSE, SSIM, MI, and CNR for differentalgorithms for reconstruction of the image of the rat. The general trendsobservable in this figure are very similar to those in Figure 8.3. What ismost important to us is that the convergence of the image reconstructedwith the proposed algorithm to the reference image is very fast. Also, allobjective image quality measures improve much faster with the proposedalgorithm. Prox-SVRG algorithm also has a fast initial convergence ratebut falls behind the proposed algorithm with more iterations.Table 8.3 summarizes some of the image quality criteria for the recon-1778.3. ResultsFigure 8.3: Plots of RMSE, SSIM, CNR, and SNR for reconstruction of theimage of the physical phantom from real CBCT projections using differentalgorithms.1788.3. ResultsFigure 8.4: Slices through two of the coils in the images of the physical phan-tom: (a) the reference image, (b) reconstructed using Nesterov’s method,(c) reconstructed using Prox-SVRG, (d) reconstructed using the proposedalgorithm. The plots show the difference between the reference image andthe images reconstructed using (e) Nesterov’s method and (f) the proposedmethod along the vertical line shown in the image of the coil in the referenceimage.1798.3. ResultsFigure 8.5: Plots of RMSE, SSIM, CNR, and MI for reconstruction of theimage of the rat from real CBCT projections using different algorithms.1808.4. DiscussionProposedalgorithmProx-SVRGNes-terov MFISTAGP-BBFDK-120FDK-240SSIM 0.761 0.732 0.719 0.706 0.675 0.459 0.708MI 0.478 0.454 0.452 0.450 0.444 0.355 0.459CNR 22.9 22.5 22.2 22.2 22.1 17.7 22.3Table 8.3: Performance comparison between different algorithms in recon-struction of the image of the rat after 30 iterations. The numbers next toFDK indicate the number of projections used with that algorithm.struction with different algorithms after 30 iterations. A comparison of thenumbers in this table shows that the image reconstructed by the proposedalgorithm has a higher quality than those reconstructed by the other algo-rithms. The image reconstructed by the proposed algorithm also has higherquality criteria than the image reconstructed with the FDK algorithm using240 projections.For a visual comparison, in Figure 8.6 we have shown a representativeslice from the images of the rat reconstructed by different algorithms. Fora better visual comparison between the images reconstructed with differentalgorithms, we have selected two regions of interest (ROI) and shown themin zoomed-in views with a narrower window of linear attenuation coefficient,µ. The µ-window for the entire slice is [0, 0.55]. The ROI shown on the lowerleft includes fat surrounded by soft tissue; the µ-window used for displayingthis ROI is [0.14, 0.22]. The ROI shown on the lower right includes bonesurrounded by soft tissue; the µ-window used for displaying this ROI is[0.15, 0.50]. It is quite clear, especially from the ROI displayed on the lowerleft, that the image reconstructed by the proposed algorithm has a higherquality that those reconstructed by the other algorithms.8.4 DiscussionA comparison of the plots of RMSE and the objective function for our exper-iments with simulated and real data shows that the proposed algorithm hasa much faster convergence to the true or reference image than the other algo-rithms considered in this study. Plots of the objective image quality criteriashow that the proposed algorithm recovers a high-quality image much fasterthan the other algorithms. In the experiments with the simulated data, theproposed algorithm was able to reconstruct the Shepp-Logan phantom to1818.4. DiscussionFigure 8.6: A representative slice of the image of the rat reconstructedusing different algorithms: (a) The reference image, (b) reconstructed byProx-SVRG, (c) reconstructed by Nesterov’s method, and (d) reconstructedby the proposed algorithm. The zoomed-in views re-displayed with a narrowwindow of linear attenuation coefficient correspond to the rectangular ROIsshown on the reference image.a high accuracy from undersampled and noisy measurements. The imagesreconstructed by the proposed algorithm were superior to the images recon-structed by other algorithms in terms of visual quality and all quantitativecriteria used in this study. The same was true for our experiments with thereal CBCT projections.An important observation was the very fast convergence rate of the pro-posed algorithm in the early iterations. Moreover, the algorithm maintaineda good convergence rate with more iterations. In reconstruction from realCBCT projections, the number of iterations of the proposed algorithm toachieve a certain RMSE was approximately 1/3 the number of iterations re-1828.4. Discussionquired by the other algorithms, as can be seen in Figures 8.3 and 8.5. Thiscan be of high practical value in clinical applications where a fast imagerecovery is highly desirable.The Prox-SVRG algorithm was better than MFISTA, the Nesterov’smethod, and GP-BB, which are all among the best methods for image re-construction in 3D CT. As shown in Figures 8.3 and 8.5, Prox-SVRG had agood start but its convergence quickly slowed down. Our implementation ofProx-SVRG in this study was identical to that suggested in [336]. We havefound that this simple algorithm can be significantly improved if we slightlyreduced the step size or the regularization parameter with iteration number.Overall, the performance of our proposed algorithm and Prox-SVRG in thisstudy suggests that variance-reduced SGD methods can form the basis ofsuccessful algorithms for image reconstruction in CT.The value of the regularization parameter λ has a significant influenceon the performance of the proposed algorithm as well as the performanceof other algorithms considered in this chapter. In general inverse problems,and in CT reconstruction in particular, sometimes a trial-and-error methodis used to find a proper value for λ [8, 251]. However, this can be a drawbackin practice. There are also systematic methods for determining λ, but mostof them are computationally very expensive or apply to a very limited classof problems [102, 323]. A heuristic approach that we followed in this studywas to choose a trial value for λ, apply a small number (e.g., 3 to 5) ofproximal SGD updates and monitor the change in the values of the twoterms of the objective function, i.e., the measurement misfit term and thetotal variation. To avoid excessive computational costs, we only look at thechange in one of the n components of the measurement misfit term. Forvalues of λ that are far from the proper range of values, one or both of thetwo terms decrease very little. Only for a relatively short range of λ do bothterms decrease consistently and we use a value towards the lower end of thisrange. Even though this approach requires trying several different value ofλ, for each value only a small number of proximal SGD updates are applied.Therefore, a range of possible values for λ can be found with relatively littleeffort. For the low-noise data simulated from the Shepp-Logan phantom, forexample, the identified range was [50, 700] and we chose a value of 100. ForProx-SVRG we used the same λ that we used for the proposed algorithm.For MFISTA, Nesterov’s method, and GP-BB we started with the value ofλ that we used for the proposed algorithm as explained above, but thentried several larger and smaller values and chose the value that gave thebest reconstruction results.In addition to the regularization parameter λ, the proposed algorithm1838.4. Discussionincludes other parameters that can affect its performance and the quality ofthe reconstructed image. The two most important of these are the step size(α) and the batch size (M). For the step size, we provided a description ofthe available guidelines in Section 8.2.2. As we mentioned there, for the firstphase of the algorithm we use step sizes that are inversely proportional tothe Lipschitz constants, L, which is known to be the optimal step size [236],and for the second phase of the algorithm we use a backtracking line searchfor step size selection. Since these guidelines are based on sound theory, wedo not discuss tuning of the step size. Therefore, we focus on λ and M .In order to study the effect of the choice of λ and M on the algorithmperformance, we conducted an experiment on the scan of a rat obtainedwith the micro-CT scanner. For this scan, a lower tube voltage of 50 kVwas used and a 2-mm copper filter was also used to increase the noise level.We used 240 equally-spaced projections from this scan. We applied theproposed algorithm with three different values of λ ∈ {75, 150, 350} andthree different values of M ∈ {1, 4, 10}. Note that the proposed algorithmgradually increases the batch size. Therefore, by M here we mean thebatch size at the start of the algorithm. The results of this experimentare presented in Figures 8.7 and 8.8.In Figure 8.7 we have shown plots of RMSE, CNR, and SSIM for differ-ent values of λ and M . In Figure 8.8 we have shown a slice of the imagereconstructed with different parameter values. In both of these figures, wehave shown the result obtained with the Nesterove’s algorithm for compar-ison. On this dataset, Nesterov’e method performed better than MFISTAand GP-BB and was very close to Prox-SVRG. There are important con-clusions that can be drawn from these figures. As expected, the visual andobjective quality of the image reconstructed by the proposed algorithm isinfluenced by the choice of the parameters. The choice of the regularizationparameter (λ) affects the convergence behavior of the proposed algorithmand the visual quality of the reconstructed image. A larger λ leads to asmoother image with stronger denoising but a simultaneous blurring and re-duction in the sharpness of the edges. On the other hand, a smaller λ leadsto a slower convergence in terms of RMSE and reconstruction of an imagethat is in general rougher. In terms of the batch size (M) the conclusion ismore straightforward: using M = 1 always lead to the best result. As we ar-gued earlier in this chapter, this is because the projection measurements fordifferent view angles contain much shared information and, hence, the gra-dient computed based on the projection measurements from different viewangles are also highly correlated. Therefore, in the early iterations of thealgorithm, it is much more efficient to compute the update directions based1848.4. DiscussionFigure 8.7: Plots of RMSE, SSIM, and CNR for different settings of theregularization parameter, λ, and the batch size at the start of the algorithm,M . The legend for all plots is similar to the one shown on the top-left plot.on the gradient computed from one projection. Notwithstanding these in-fluences, it is interesting to note that the proposed algorithm still recovers ahigh-quality image that is visually and quantitatively better than the imagereconstructed with the Nesterov’s algorithm for a relatively wide range ofparameter values.As we mentioned above, the most computationally expensive part of theproposed algorithm is the forward and back-projection operations, whichwe have denoted with Ai and ATi . These operations accounted for approx-imately 91% of the computational time. The second most computationallyexpensive operation was the computation of the proximal operators, denotedwith proxλTV(.) in Algorithm 6, which accounted for approximately 5% ofthe computational time. As we mentioned in Section 8.1.2, we used the1858.4. DiscussionFigure 8.8: Effect of the choice of the regularization parameter, λ, and thebatch size at the start of the algorithm, M , on the visual quality of the re-constructed image of a rat. (a) The reference image, (b) FDK-reconstructed,(c) reconstructed using Nesterov’s algorithm, and the images reconstructedusing the proposed algorithm with different parameter values: (d) λ = 75,M = 1, (e) λ = 75, M = 4, (f) λ = 75, M = 10, (g) λ = 150, M = 1, (h)λ = 150, M = 4, (i) λ = 150, M = 10, (j) λ = 350, M = 1, (k) λ = 350,M = 4, (l) λ = 350, M = 10.1868.4. DiscussionChambolle’s well-known algorithm [46] to compute the proximal operations.Although a large number of iterations of this algorithm can be applied tocompute the proximal operation to a high accuracy, we have found that oneto three iterations are enough to give the proposed algorithm a good behav-ior. This experience of ours agrees with the theoretical results developedin [284]. The third most computationally expensive part of the proposedalgorithm was the application of the limited-memory BFGS algorithm tofind the update direction, the step denoted with d = Hkδ in Algorithm 6.This computation accounted for approximately 2% of the algorithm time.The rest of operations accounted for approximately 2% of the computationaltime.Most of the variance-reduced SGD algorithms proposed in recent yearsfocus on strongly convex functions or have much higher theoretical con-vergence rates for strongly convex functions [79, 148, 165, 177]. As wementioned above, when the number of projection measurements is less thanthe number of unknown voxels (which is almost always the case in sparse-view reconstruction), the CT reconstruction problem is not strongly convex.Nonetheless, our experimental results show that variance-reduced SGD al-gorithms can lead to efficient CT reconstruction algorithms. It is an openquestion whether the convergence rate can be improved if the CT recon-struction problem is formulated as a strongly convex optimization problem.A simple approach, also suggested in [206], is to add an `2 regularizationterm to the objective function and gradually reduce the strength of thisregularization with iterations. We applied this idea to CT reconstructionbut did not obtain good results. An entirely different possible approach toimproving the algorithm proposed in this chapter is to use acceleration tech-niques, which have been shown to work well with ordered-subsets methodfor CT reconstruction in recent years [159, 161].187Chapter 9Iterative Reconstructionwith Nonlocal Regularization9.1 IntroductionThe results of Chapter 8 showed that variance-reduced stochastic gradientdescent (VR-SGD) algorithms are a very suitable optimization approach forCT reconstruction. Not only they show very good convergence behavior,they do not need manual tuning of the step size, unlike the basic SGDmethods. The algorithm proposed in Chapter 8 improved the basic VR-SGD algorithm by gradually increasing the batch size and exploiting thecurvature of the cost function as the image estimate approached a solution.The focus of this Chapter is on the problem regularization. In chapter8 we relied on TV regularization, which has been used by many studieson CT reconstruction in the past decade. Recent studies, however, haveshown that patch-based regularization methods can outperform TV-basedregularization methods in image reconstruction. In this chapter, we considera nonlocal patch-based regularization.Because the problem of estimating the CT image from few-view andnoisy projections is ill-posed, it is critical to properly regularize the recon-struction problem. Most of the published reconstruction algorithms in thepast decade have used smoothness-promoting or edge-preserving regulariza-tion functions. These regularizers span a wide range of complexity, fromsimple roughness penalties to non-convex regularizations terms. In general,these regularizers encourage smooth or piecewise-constant solutions by pe-nalizing jumps between neighboring pixels. Algorithms that are based onsuch regularizers have been successful, and their success has contributed sig-nificantly to the growing interest in iterative CT reconstruction. However,CT images usually include fine and texture-like features that are not suitablefor reconstruction with these regularizers. Such features usually get blurredor are poorly reconstructed with these algorithms.As we saw in the review of the literature in Chapter 2, research in the1889.1. Introductionpast ten years has shown that nonlocal patch similarities can be used todevise very powerful models for natural images. Nonlocal patch similaritieshave been successfully exploited in various image processing tasks. Thesemodels are well known for preserving fine image features such as texturesand low-contrast edges even in the presence of strong noise. This is becausethe redundant information in similar image patches will help preserve gen-uine image features even when these features are not very strong or whenthe amount of noise is substantial. As we described in Sections 2.2 and 2.6.2,nonlocal patch-based similarities have also been successfully used to regular-ize various inverse problems including CT reconstruction. Let us considerthe linear model with Gaussian noise that we described in Section 8.1.2,i.e., y = Ax + w. In this model, y is the vector of sinogram data, A is theprojection matrix, and w is the additive noise. To estimate x, it has beensuggested to minimize a cost function of the form:F (x) =12‖Ax− y‖22 +RNL(x) (9.1)where the regularization term, RNL(x), penalizes the difference between thepixel values based on their patch similarity. Various formulations have beenproposed for RNL(x). For example, two common formulations are the fol-lowing [114, 194]:RNL-TV(x) =∑i1C(i)∑j∈SiGa(‖x[j]− x[i]‖) · |x(j)− x(i)|RNL-H1(x) =∑i1C(i)∑j∈SiGa(‖x[j]− x[i]‖) · |x(j)− x(i)|2(9.2)In the above equations, C(i) is a normalization constant, Ga is usually aGaussian function with bandwidth a, and Si is usually a window around thecurrent pixel, x(i), and x[i] is a patch centered on x(i).Minimization of the cost function in Equation (9.1) is challenging, mainlybecause of the dependence of the patch similarity weights on the image x.Most of the proposed algorithms find an approximate solution by computingthe weights from an initial image estimate [194] or by iteratively updatingthe weights from the latest image estimate [229, 354]. Moreover, many differ-ent approaches have been proposed for minimizing (9.1), including gradientdescent [194], proximal gradient methods [255], majorization-minimization[348], graph-cuts methods [114], and Bregman methods [354].1899.1. IntroductionIn general, algorithms that include regularization functions that arebased on nonlocal patch similarities have been reported to outperform algo-rithms based on smoothness-promoting regularizations. However, many ofthe proposed algorithms that use nonlocal patch-based regularization havethe following issues:1. They are only suitable for small-scale problems. In general, patch-based image processing algorithms are known to be very computa-tionally intensive. This is especially the case for processing of large3D images. Existing approaches for iteratively updating the patchsimilarity-based weights will be very costly when applied on large 3Dimages.2. For the minimization of the measurement misfit term, most of theproposed algorithms use slow methods such as gradient descent orconjugate gradient descent [146, 154, 155, 202, 353]. Hence, mostof the proposed algorithms have been evaluated on small 2D images[133, 194, 353].3. Another limitation of almost all proposed algorithms is that they com-pute the patch similarity weights from a small window around eachpixel. For large 3D images, this window should be very small to keepthe computations manageable. However, this is not a good practicebecause there may be no similar patches in this window, while manysimilar patches may exist in other parts of the image.In this chapter, we suggest an iterative CT reconstruction algorithmwith nonlocal patch-based regularization that tries to address some of theshortcomings mentioned above. Unlike previous algorithms that use patchesfrom a small window in the same image, we use patches from a high-qualityreference image. In this respect, our approach is similar to that in Chapter5, where we used a low-noise sinogram for interpolating noisy undersampledprojections. Moreover, we use a stochastic algorithm to find a small numberof similar patches that can come from any location in this reference image.For minimization of the cost function we use a VR-SGD method as we didin Chapter 8.1909.2. Methods9.2 Methods9.2.1 Problem formulationWe propose to estimate the unknown image x as a minimizer of the followingcost function:F (x) =1nn∑i=112‖Aix− yi‖22 +λNL2‖x− xNL‖22 + λTVTV(x) (9.3)We have written the measurement misfit term (the first of the three termsin the above objective function) as an average over the projection views,similar to our approach in Chapter 8. In this term, yi denotes the vectorof sinogram measurements in the ith projection view and Ai denotes theprojection matrix for that projection view angle. In the first regularizationterm, xNL is an estimate of x that is computed from a high-quality refer-ence image using a patch-based approach similar to NLM denoising. Thesecond regularization term is the total-variation of x. By using both typesof regularizations, we will be able to study the effects of these two types ofregularizations separately and jointly.We compute xNL using the following equation, which is the standardNLM-type formulation.xNL(i) =1C(i)∑j∈SiGa(‖x[i]− xref[j]‖2F ) · xref(j) (9.4)Therefore, the regularization function that we suggest is slightly differentfrom the more commonly used forms shown in Equation (9.2). Nonetheless,the justification behind this regularization is the same. Indeed, at least onestudy has used a similar regularization for CT reconstruction [202].As mentioned above, in most previous studies the patch similarity weightsare computed either from an initial image estimate or from the latest imageestimate on the fly. Moreover, the set of patches used to regularize the valueof the ith pixel, denoted with Si in Equations (9.2) and (9.4), is usually allpatches in a small window around that pixel. The proposed algorithm isdifferent in both aspects as we explain below.Firstly, we compute the patch similarity weights from a high-quality ref-erence image, denoted with xref in Equation (9.4). Images reconstructedfrom low-dose scans usually contain much noise and streaking artifacts.Therefore, an initial image estimate, which is usually reconstructed using1919.2. Methodsa filtered backprojection algorithm, will be a poor choice for computing thepatch similarities. One may expect that iteratively updating the patch sim-ilarities from the latest image estimate will gradually improve the patchsimilarity estimates. However, the opposite may happen because strongartifacts in the early image estimates can result in poor patch similarityestimates, further amplifying the artifacts with more iterations of the algo-rithm [229]. This can be avoided by estimating the patch similarity weightsfrom a high-quality reference image.Secondly, it is very likely that no patches similar to patch x[i] exist ina small window around the pixel xref(i) (or pixel x(i), for that matter).Therefore, instead of defining Si to be a window around pixel xref(i), wedefine it to be the set of indices of k patches from the image xref that aresimilar to the patch x[i]. These patches can be located anywhere in xref.Therefore, for each pixel x(i), we need to find a set of k patches similar tox[i] in xref. Because of the very large size of the image, we cannot hope tofind the k most similar patches. Therefore, we use a stochastic approachbased on the Generalized PatchMatch algorithm [13]. We described themain steps of this algorithm in Section 5.2.1.We follow the Generalized PatchMatch algorithm, except that we do notuse random initializations. Instead, we use a more informed initializationthat can lead to much better matches in the early iterations of the algorithm.As mentioned above, for block matching we use a high-quality prior imageas the reference image, denoted with xref in Equation (9.4). This can be theimage of the same patient (in situations where a patient is scanned severaltimes) or of a different patient from a database. Let us denote with Si theindices of the k patches in xref that are similar to x[i]. If the locations ofthe organs in the reference image and the image being reconstructed havenot shifted much, one can initialize Si to a set of random patches in a smallwindow around xref(i). However, this will not be a good strategy if theorgans have shifted significantly between x and xref, or if x and xref areimages of two very different patients. In that case, we suggest finding initialvalues for Sis using the following two steps. Figure 9.1 shows these steps.1. Partition x and xref into non-overlapping blocks and run several itera-tions of the Generalized PatchMatch to find a set of at least k matchesin xref for each (non-overlapping) block in x. Let us denote each ofthese non-overlapping blocks of x with xNO[j] and the indices of theset of matching blocks for xNO[j] with SNO(j).This step will determine the overall mapping of the location of match-ing features between x and xref.1929.2. Methods2. For each overlapping block in x, initialize Si with the help of thefindings of the above step. Suppose that the pixel x(i) falls in xNO[ik].Then, we can initialize Si to be a set of random pixels in SNO(ik).Alternatively, we can do a search in blocks pointed to by SNO(ik) tofind a set of matching blocks. Our experience shows that the twoapproaches lead to comparable results. We used the latter approachin the experiments reported in this chapter.Note that because in this initialization approach (specifically, in step 1above) we run the Generalized PatchMatch for non-overlapping patches, itscost will be very low.Figure 9.1: The proposed initialization for the Generalized PatchMatch al-gorithm. Step 1: The source image (left) and the reference image (right)are partitioned into non-overlapping blocks. For each block in the sourceimage, at least k matching blocks are found in the reference image. Step2: for each block in the source image, k matching blocks are found in thereference image with the help of the mapping discovered in Step 1. In thissimple illustration, 5 matching blocks are found in Step 1 and then k = 3blocks are identified in Step 2.1939.2. Methods9.2.2 Optimization algorithmWe start by rewriting the proposed cost function in Equation (9.3) as:F (x) = G(x) +R(x) (9.5)Assuming that xNL is constant, the first term, G(x) =1n∑ni=112‖Aix −yi‖22 + λNL2 ‖x−xNL‖22, is convex and differentiable. The second term, R(x) =λTVTV(x), is convex but non-differentiable. We would like to minimize theproposed cost function using a proximal VR-SGD approach. To this end,we rewrite the cost function as:F (x) =1nn∑i=1gi(x) +R(x)=1nn∑i=1(12‖Aix− yi‖22 +λNL2‖x− xNL‖22)+ λTVTV(x)(9.6)We suggest Algorithm 7 for minimizing this cost function. The updatedirection in this algorithm is very similar to the basic VR-SGD updatethat we described in Chapter 8. The only difference is that we weight thefull gradient direction (µ˜) using the average of the Lipschitz constants, assuggested in [336]. We use the algorithm proposed in [46] for computing theproximal operation for the TV regularization term.In Algorithm 7, Li denotes the Lipschitz constant of gi, which is equal toλi + λNL, where λi is the largest eigenvalue of ATi Ai. Lmean = λmean + λNLdenotes the average of Lis, where λmean is the average of the λis, and α is thestep size. Let us also denote the strong convexity parameter of the whole costfunction, F (x), with µ. This means that for all x and y in the domain of Fwe have F (x) ≥ F (y)+zT (x−y)+ µ2‖x−y‖22, where z is any subgradient of Fat y. For the proposed cost function in Equation (9.6) we have µ ≥ λNL, andin most cases of interest where A is a wide matrix (i.e., more unknown imagevoxels than sinogram measurements) we have µ = λNL. From Theorem 1in [336], if we choose α and λNL such that the value of r given below is lessthan 1, then the proposed algorithm will have a geometric convergence withrate r. This means that E(F (xn))− F (x∗) ≤ rn(F (x0)− F (x∗)), where x∗is the global minimizer of F .r 'λmeanλNL+ 12α(1− 4α)n +4α1− 4α (9.7)1949.2. Methodsinput : initial image estimate x0 (obtained using the FDKalgorithm).output: final image estimate after N iterations, xN .option 1: compute xNL based on x0 by performing several iterationsof the Generalized PatchMatchfor j ← 1 to N dox˜ = xj−1option 2: update xNL based on x˜ by performing one iteration ofthe Generalized PatchMatchµ˜ = 1nAT (Ax˜− y)) + λNL(x˜− xNL)x0j = x˜for k ← 1 to 2n doselect ik ∈ {1, ..., n} with probability Li∑i Lixkj = proxαλTVTV(xk−1j − α(1LikATikAik(xk−1j − x˜) + 1Lmean µ˜))endxj = x2njendAlgorithm 7: The proposed proximal VR-SGD algorithm for CT recon-struction by minimizing Equation (9.6).It is always possible to choose α and λNL such that r < 1. However, thestep size suggested by this analysis is always less than 1/4, usually around0.1 or smaller. As we will see later, our experiments show that larger stepsizes result in faster convergence in our application.In the above analysis, we assumed that xNL was constant. In order toobey this assumption, we can compute xNL from x0 using the GeneralizedPatchMatch before the start of the algorithm. Alternatively, we can con-tinually update xNL by performing one iteration of the Generalized Patch-Match in each iteration of the proposed algorithm. Both of these optionshave been shown in Algorithm 7, labeled Option 1 and Option 2, respec-tively. The theoretical convergence rate mentioned above does not apply ifwe choose Option 2, i.e., if xNL is updated based on the current estimateof x. Nonetheless, Option 2 is intuitively better than option 1. This is be-cause with more iterations of the main algorithm, noise and artifacts in theimage estimate are reduced and, therefore, xNL will be closer to the trueimage. Therefore, it makes more sense to gradually improve xNL than tospend much effort to estimate it from x0 at the algorithm start. Our results,which we will present in the next section, support this intuition.1959.3. Results and Discussion9.3 Results and Discussion9.3.1 Simulated dataWe simulated 720 noisy projections from a digital brain phantom, which weobtained from the BrainWeb database [63]. Figure 9.2(a) shows a slice ofthis phantom. Figure 9.2(b) shows a slice from a different brain phantom,obtained from the same database, which we used as xref for block matchingto compute xNL (see Equation (9.4) and Figure 9.1). The rest of the imagesin Figure 9.2 show the images reconstructed with different algorithms. Forcomparison, we have used FDK and MFISTA algorithms. For the proposedalgorithm, we have shown the reconstruction results with two different valuesof λTV = {0, 200} and two different values of λNL = {0, 0.01}. The case withλTV = 0 corresponds to using patch-based regularization only; similarly, thecase with λNL = 0 corresponds to using TV regularization only. We usedα = 1 in all experiments in this chapter.A visual comparison of the reconstruction results in Figure 9.2 suggeststhat both regularization terms contribute positively to the quality of thereconstructed images. However, the patch-based regularization alone hasresulted in a better image than the TV regularization alone. In particular,reconstruction with (λTV = 0, λNL = 0.01) seems to have better preservedthe image sharpness than reconstruction with (λTV = 200, λNL = 0). Theobjective image quality criteria summarized in Table 9.1 supports this state-ment. Figure 9.2 and Table 9.1 are for reconstruction after 15 iterations.FDK MFISTAProposedalgorithmλTV = 200λNL = 0ProposedalgorithmλTV = 0λNL = 0.01ProposedalgorithmλTV = 200λNL = 0.01RMSE 0.068 0.039 0.032 0.027 0.026SSIM 0.710 0.745 0.750 0.785 0.785CNR 14.6 19.3 21.0 21.2 21.3Table 9.1: Image quality metrics for the experiment with simulated data.Figure 9.3(a) shows the RMSE plots for different values of the regulariza-tion parameters for up to 30 iterations. These plots show that patch-basedregularization results in much better convergence, especially as the numberof iterations increases. When λNL = 0.01, the presence of the TV regulariza-tion leads to faster initial convergence, but has no significant added positiveeffect after the first few iterations.1969.3. Results and DiscussionFigure 9.2: Reconstruction results of the experiment with the brain phan-tom. (a) the reference image, (b) a slice of the brain phantom used forblock matching, xref, (c) FDK, (d) MFISTA, (e) proposed algorithm with(λTV = 0, λNL = 0), (f) proposed algorithm with (λTV = 200, λNL = 0),(g) proposed algorithm with (λTV = 0, λNL = 0.01), (h) proposed al-gorithm with (λTV = 200, λNL = 0.01), (i) proposed algorithm with(λTV = 200, λNL = 0.01) with reconstruction of the patch-based image esti-mate, xNL, at the start of the algorithm (Option 1 in Algorithm 7).1979.3. Results and DiscussionThe RMSE plots shown in Figure 9.3(a) and the images in Figure 9.2(e)-(h) were obtained by using Option 2 in Algorithm 7. In Figure 9.2(i) we haveshown the image reconstructed using Option 1. This image clearly containsartifacts that do not exist in Figure 9.2(e)-(h). To better understand whyOption 2 leads to better results, in Figure 9.3(b) we comparethe two optionsin terms of the RMSE of the reconstructed image and the RMSE of xNL.For Option 1, we applied 30 iterations of the Generalized PatchMatch algo-rithm before the start of the algorithm, whereas for Option 2 we applied oneiteration of the Generalized PatchMatch at the beginning of each iterationof the proposed algorithm. With Option 1, xNL is created from the initialimage estimate (x0) and, hence, its error is fixed. The quality of xNL gener-ated in Option 1 is not very high, as indicated by its relatively high RMSE.Therefore, patch-based regularization with Option 1 does not lead to fastconvergence. In fact, after the initial iterations, regularization in terms ofproximity with xNL hurts the algorithm convergence because it forces theimage estimate to remain close to xNL. On the other hand, the quality ofxNL generated with Option 2 continues to improve with more iterations ofthe algorithm because it is updated using the latest image estimate. There-fore, patch-based regularization with Option 2 constantly pushes the imageestimate towards a lower RMSE. In the experiments with real data reportedin the next section, we will only show the results obtained with Option 2.Figure 9.3: (a) RMSE plots for reconstruction of the brain phantom. (b)Comparison between the two approaches for estimating the nonlocal patch-based image estimate, xNL; as shown in Algorithm 7, in Option 1 xNL isestimated before the start of the image reconstruction algorithm, whereasin Option 2 xNL is iteratively updated based on the latest image estimate.1989.3. Results and Discussion9.3.2 Real dataWe evaluated the proposed algorithm on the micro-CT scan of a rat. Thescan consisted of 720 projections, all of which were used to reconstruct areference image. The proposed algorithm was then applied to reconstructan image from a subset of 180 projections from this scan.Figure 9.4 shows slices of the reconstructed image of the rat after 15iterations of the proposed algorithm. In the same figure, we have also showntwo profiles, the locations of which have been marked with the line segmentsL1 and L2 in Figure 9.4(a). This figure shows that the nonlocal patch-based regularization has resulted in a marked improvement in the quality ofthe reconstructed image, especially of the fine details. The plots of RMSEfor this experiment were very similar to that for the simulation experimentshown in Figure 9.3, and hence they are omitted. Table 9.2 shows a summaryof the objective image quality for this experiment.FDK MFISTAProposedalgorithmλTV = 300λNL = 0ProposedalgorithmλTV = 0λNL = 0.04ProposedalgorithmλTV = 300λNL = 0.04RMSE 0.0170 0.0136 0.0122 0.0109 0.0105SSIM 0.604 0.731 0.755 0.781 0.770Table 9.2: Image quality metrics for the experiment with real data.An important setting with the patch-based regularization is the valueof the regularization parameter, λNL. Our extensive experiments show thatthis value should be approximately equal to or smaller than the smallestλi. Recall that λi is the largest eigenvalue of Ai, the projection matrixfor the ith projection view. For our simulation experiment with the brainphantom, for example, λi ranged from 0.32 (for projection view angles thatwere integer multiples of pi/2) to 0.43 (for projection view angles that wereodd multiples of pi/4) and values of the regularization parameter in therange λNL ∈ [0.05, 0.25] gave us good results. In the experiment with therat scan λi ranged between 0.087 and 0.124, and values of λNL in the range[0.02, 0.06] gave us good results.Overall, our experience with the algorithms proposed in Chapter 8 andthis chapter shows that when approximately 100 to 200 noisy projectionsare used for reconstruction, TV regularization alone is enough to faithfullyreconstruct large image features, provided that a good value is selected forthe regularization parameter. On the other hand, using only TV regulariza-1999.3. Results and DiscussionFigure 9.4: Reconstruction results of the experiment with the rat scan.(a) the reference image, (b) the corresponding slice of the image usedfor block matching, xref, (c) FDK- reconstructed, (d) MFISTA, (e) pro-posed algorithm with (λTV = 0, λNL = 0), (f) proposed algorithm with(λTV = 300, λNL = 0), (g) proposed algorithm with (λTV = 0, λNL = 0.04),(h) proposed algorithm with (λTV = 300, λNL = 0.04). The lower panelshows two small profile segments in the images reconstructed by the pro-posed algorithm with three different regularization parameter settings. Thelocations of these profiles have been marked in the reference image withwhite line segments.2009.3. Results and Discussiontion, faithful reconstruction of fine image features is much more difficult andsometimes impossible within a small number of iterations, even after carefultuning of the regularization parameter. Such features can be reconstructedby increasing the number of iterations. On the other hand, fine image fea-tures are easier to reconstruct by using nonlocal patch-based regularization.We showed an example of the effectiveness of patch-based regularization forrecovery of fine features in Figure 9.4. The success of patch-based regular-ization in reconstructing fine features is because they exploit the redundantinformation in similar patches, thereby preserving fine edges and texturesthat are normally blurred by TV-based regularization. We can say that TVis a global regularization function, treating all parts of the image with equalstrength, determined by the value of the regularization parameter. There-fore, if we choose a sufficiently large regularization parameter to ensurestrong noise suppression, some fine image features could be lost. Nonlocalpatch-based regularization, on the other hand, treats each image locationdifferently by finding similar patches that can help preserve and amplifythe local features. Therefore, nonlocal patch-based regularization can re-cover fine image features much easier than TV regularization. As expected,this advantage of patch-based regularization comes with certain costs. Inparticular, the additional computational and memory requirements can besignificant, especially for large-scale images. The algorithm proposed in thischapter also requires a high-quality reference image, which is not alwaysavailable.201Chapter 10ConclusionsX-ray computed tomography (CT) has become one of the most essential andwidely-used tools in medicine. As its usage continues to increase, the needto maintain the radiation dose at a reasonably safe level becomes even moreimportant. Therefore, in order for CT to fulfill the growing demands, theimage reconstruction and processing algorithms should be greatly improved.This dissertation investigated the potential of some of the powerful con-cepts and tools in image processing and optimization for image reconstruc-tion and processing in CT. These included patch-based image models, totalvariation, and variance-reduced stochastic optimization algorithms. We pro-posed new algorithms for denoising and interpolation of CT measurements,denoising and restoration of CT images, and for iterative CT reconstruction.Experiments with simulated and real CT data showed that the proposed al-gorithms can compete with, and often outperform, some of the state of theart algorithms.10.1 Contributions of this dissertationThis section summarizes the main findings and contributions of this disserta-tion under three categories of pre-processing, post-processing, and iterativereconstruction algorithms.10.1.1 Pre-processing algorithmsOnly a very small fraction of the published algorithms for CT have focusedon processing the measured CT projections (i.e., the sinogram). The resultsof this dissertation show that denoising and interpolation of the CT projec-tion measurements can result in substantial improvements in the quality ofthe reconstructed CT images.• The results of Chapter 3 show that patch-based methods can be usedto devise very effective sinogram denoising methods. To the best ofour knowledge, this is the first study that exploits both the nonlocal20210.1. Contributions of this dissertationpatch similarities and sparse representation in learned dictionaries forsinogram denoising.• The results of Chapter 4 show that effective sinogram denoising al-gorithms can be designed based on total variation minimization. Wesuggested two approaches to account for the signal-dependent natureof the noise and the smooth nature of the projection measurements.We are unaware of any previously-published studies to suggest any ofthese approaches. Both of these approaches proved to be effective inexperiments with low-dose CT projections.• In general, the patch-based method proposed in Chapter 3 leads to bet-ter results than the TV-based algorithms proposed in Chapter 4. Theadvantage of TV-based denoising methods is that they are in generalmuch faster than patch-based denoising methods. The patch-basedmethod proposed in Chapter 3 is fast but it can also include a dic-tionary learning step that can add substantially to the computationaltime. The dictionary has to be re-trained every time scan geometryor the angular spacing between successive projections changes. Onthe other hand, a shortcoming of the TV-based denoising algorithmsis that they involve regularization parameters that need to be tunedcarefully in order to obtain good results.• Two very important properties of CT projection measurements aresmoothness and self-similarity. The results of Chapter 5 show thatthese properties can be exploited to effectively interpolate and denoisenoisy undersampled projections, leading to a large improvement inthe quality of low-dose CT images. To the best of our knowledge, noprevious study has used nonlocal patch similarities for interpolationof CT projections.10.1.2 Post-processing algorithmsPost-processing of low-dose CT images is very challenging because of thepresence of artifacts and strong noise with unknown and spatially-varyingdistribution. This dissertation focused on using sparse representation inlearned dictionaries for low-dose CT image denoising and restoration. Ourresults show that learned overcomplete dictionaries are effective in denoisingand restoration of low-dose CT images.• In Chapter 6, we proposed a method for removing streak artifacts thatarise in images reconstructed from a small number of projections. To20310.1. Contributions of this dissertationthe best of our knowledge, this algorithm is the first algorithm to usecoupled dictionaries for artifact suppression in CT images. The resultsof that chapter show that the proposed algorithm substantially reducesthe artifacts without degrading the true image features.• The two-level dictionary structure proposed in Chapter 7 aimed atremoving the noise or artifacts that had very different shapes thanthe genuine image features, unlike the streak artifacts considered inChapter 6. This dictionary structure combined the advantages of an-alytical and learned dictionaries. In our experiments, the proposeddictionary structure effectively suppressed the noise and ring artifactsin low-dose CT images, achieving results that were comparable withor better than standard dictionary-based processing.10.1.3 Iterative reconstruction algorithmsTwo iterative reconstruction algorithms were proposed in this dissertation.An important feature of both algorithms was the use of variance-reducedstochastic gradient descent (VR-SGD) methods. To the best of our knowl-edge, VR-SGD algorithms have never been used for CT reconstruction be-fore. Our results show that VR-SGD algorithms can be used to build veryefficient CT reconstruction algorithms. Although the method of orderedsubsets has long been used for CT reconstruction, our results indicate thatVR-SGD algorithms show a very good convergence, especially as the imageestimate becomes close to a solution. Moreover, step size selection for VR-SGD algorithms is much easier than for the ordered subsets method andthere is no need to reduce the step size or increase the batch size.• The hybrid stochastic-deterministic algorithm that we proposed inChapter 8 further improved the VR-SGD method by exploiting thecurvature of the cost function as the image estimate became close toa solution. Our results showed that this algorithm performed betterthan the basic proximal VR-SGD.• The results of Chapter 9 show that regularization in terms of nonlo-cal patch similarities can be used to develop very effective CT recon-struction algorithms. Algorithms that use patch-based regularizationare more computationally intensive than algorithms that use edge-preserving regularizations, such as the algorithm proposed in Chapter8. However, fine image features such as texture, small features, and20410.2. Future worklow-contrast edges are much better preserved by exploiting nonlocalpatch similarities.Implementation of forward and back-projection operations on GPU canreduce the per-iteration cost of iterative reconstruction algorithms by largefactors. Nonetheless, in order to make the reconstruction time of large 3Dimages clinically acceptable, the number of iterations needs to be reducedtoo. This dissertation showed that VR-SGD methods offer an efficient ap-proach towards achieving this goal.10.2 Future workThe methods proposed in this dissertation can be improved in various ways.There are also many related research directions that have not been exploredin this dissertation. In this section, some of these potential research topicsare pointed out.10.2.1 Pre-processing algorithms• The sinogram denoising and interpolation algorithms proposed in thisdissertation were based on simplified noise models. It is well knownthat in low-dose CT, these models are less accurate. Therefore, itwill be very important to study how the performance of the proposedalgorithms is affected by the accuracy of the noise model.• Some of the studies on patch-based Poisson denoising that we reviewedin Section 2.4 have focused on extremely low Poisson counts. Theresults of some of these studies have been very impressive. It wouldbe very interesting to investigate the significance of these results forvery low-dose CT.• An important limitation of the image processing methods that arebased on nonlocal patch similarities is the computational cost of find-ing similar patches. Therefore, most algorithms employ small patchsizes to reduce the computational cost. As we showed in Chapter 3, forCT projections it is possible to project large patches/blocks into muchsmaller spaces. This possibility will pose several important researchquestions. For example, it will be interesting to know how the perfor-mance of patch-based denoising and interpolation algorithms such asthose proposed in this dissertation will be affected when much largerpatch sizes are used.20510.2. Future work10.2.2 Post-processing algorithms• This dissertation uses learned dictionaries for suppressing two types ofartifacts, i.e., streaking artifacts that arise when the number of projec-tions is small (Chapter 6) and ring artifacts (Chapter 7). Artifacts inCT images can originate from various sources and have different shapes[14]. These artifacts can be very strong and can thus significantly re-duce the image quality. In many cases, artifacts have very differentgeometrical and statistical properties than the true image features. Inthat case, a simple algorithm such as the dictionary-based processingmethod proposed in 7 may be able to significantly reduce the artifacts.More often, however, artifacts have much similarities with the true im-age features. For example, this is true for the streak artifacts that westudied in Chapter 6. In such cases, more sophisticated algorithmswill be needed to suppress the artifacts without damaging the trueimage features. The results of this dissertation suggest that learneddictionaries may be able to reduce other types of artifacts also.10.2.3 Iterative reconstruction algorithms• All dictionary-based iterative CT reconstruction algorithms that weare aware of, including all those reviewed in Chapter 2, have been pro-posed for 2D CT. For reconstruction of large 3D images these methodsare not efficient because they require access to individual elements ofthe system matrix and this is not possible to do efficiently for large3D images because the system matrix will be too large to save in thecomputer memory. Therefore, there is a need for efficient optimizationmethods that can solve iterative dictionary-based reconstruction prob-lems such as Equation (2.36) for large-scale 3D images. This topic wasnot addressed in this dissertation. However, there are existing meth-ods that could be employed for solving this problem, for example therecently-proposed plug-and-play approach [271, 322].• The method of ordered subsets, which is equivalent to incremental/stochastic gradient descent, has long been used to accelerate variousiterative reconstruction algorithms in CT [150, 220]. In recent years,this method has been employed in designing some of the state of the artalgorithms by combining it with other optimization techniques such asmomentum [160, 161, 240]. Our results show that VR-SGD has certainimportant advantages over the method of ordered subsets. Therefore,it would be very interesting to investigate whether VR-SGD methods20610.2. Future workcould also be combined with other optimization techniques to achievefaster reconstruction.207Bibliography[1] Michal Aharon and Michael Elad. Sparse and redundant modeling ofimage content using an image-signature-dictionary. SIAM Journal onImaging Sciences, 1(3):228–247, 2008.[2] Michal Aharon, Michael Elad, and Alfred Bruckstein. K-SVD: Analgorithm for designing overcomplete dictionaries for sparse represen-tation. Signal Processing, IEEE Transactions on, 54(11):4311–4322,2006.[3] Michal Aharon, Michael Elad, and Alfred M. Bruckstein. On theuniqueness of overcomplete dictionaries, and a practical way to retrievethem. Linear Algebra and its Applications, 416(1):48 – 67, 2006.[4] A AlAfeef, P Cockshott, I MacLaren, and S McVitie. Compressedsensing electron tomography using adaptive dictionaries: a simulationstudy. Journal of Physics: Conference Series, 522(1):012021, 2014.[5] J.B. Allen and L. Rabiner. A unified approach to short-time fourieranalysis and synthesis. Proceedings of the IEEE, 65(11):1558–1564,Nov 1977.[6] Francois Alter, Yasuyuki Matsushita, and Xiaoou Tang. An intensitysimilarity measure in low-light conditions. Association for ComputingMachinery, Inc., March 2006.[7] F. J. Anscombe. The transformation of poisson, binomial andnegative-binomial data. Biometrika, 35(3/4):246–254, 1948.[8] Richard C Aster, Clifford H Thurber, and Brian Borchers. Parameterestimation and inverse problems, volume 90. Academic Press, 2005.[9] Francis Bach, Rodolphe Jenatton, Julien Mairal, and GuillaumeObozinski. Optimization with sparsity-inducing penalties. Founda-tion and Trends in Machine Learning, 4(1):1–106, 2012.[10] Francis Bach and Eric Moulines. Non-strongly-convex smooth stochas-tic approximation with convergence rate O(1/n). In C.j.c. Burges,208BibliographyL. Bottou, M. Welling, Z. Ghahramani, and K.q. Weinberger, edi-tors, Advances in Neural Information Processing Systems 26, pages773–781. 2013.[11] Ti Bai, Xuanqin Mou, Qiong Xu, and Yanbo Zhang. Low-dose ct re-construction based on multiscale dictionary. In SPIE Medical Imaging,pages 86683L–86683L. International Society for Optics and Photonics,2013.[12] Connelly Barnes, Eli Shechtman, Adam Finkelstein, and Dan B Gold-man. PatchMatch: A randomized correspondence algorithm for struc-tural image editing. ACM Transactions on Graphics (Proc. SIG-GRAPH), 28(3), August 2009.[13] Connelly Barnes, Eli Shechtman, DanB. Goldman, and Adam Finkel-stein. The Generalized PatchMatch correspondence algorithm. InKostas Daniilidis, Petros Maragos, and Nikos Paragios, editors, Com-puter Vision ECCV 2010, volume 6313 of Lecture Notes in ComputerScience, pages 29–43. Springer Berlin Heidelberg, 2010.[14] Julia F. Barrett and Nicholas Keat. Artifacts in CT: Recognition andavoidance. RadioGraphics, 24(6):1679–1691, 2004.[15] Borsdorf A. Kstler H. Rubinstein R. Bartuschat, D. and M. Strmer.A parallel K-SVD implementation for CT image denoising. TechnicalReport CS 10, University of Erlangen-Nrnberg, 2009.[16] M.J. Bastiaans. Gabor’s expansion of a signal into gaussian elementarysignals. Proceedings of the IEEE, 68(4):538–539, April 1980.[17] Amir Beck and Marc Teboulle. Fast gradient-based algorithms forconstrained total variation image denoising and deblurring problems.Image Processing, IEEE Transactions on, 18(11):2419–2434, 2009.[18] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholdingalgorithm for linear inverse problems. SIAM J. Img. Sci., 2(1):183–202, March 2009.[19] Marcel Beister, Daniel Kolditz, and Willi A. Kalender. Iterative re-construction methods in x-ray CT. Physica Medica, 28(2):94 – 108,2012.[20] Anthony J. Bell and Terrence J. Sejnowski. The independent compo-nents of natural scenes are edge filters. Vision Research, 37(23):3327– 3338, 1997.209Bibliography[21] Martin Benning, Christoph Brune, Martin Burger, and Jahn Mu¨ller.Higher-order tv methods—enhancement via bregman iteration. Jour-nal of Scientific Computing, 54(2):269–310, 2012.[22] Frank Bergner, Timo Berkus, Markus Oelhafen, Patrik Kunz, TinsuPan, Rainer Grimmer, Ludwig Ritschl, and Marc Kachelrie. An inves-tigation of 4d cone-beam ct algorithms for slowly rotating scanners.Medical Physics, 37(9), 2010.[23] Ma¨ıtine Bergounioux and Loic Piffet. A second-order model for imagedenoising. Set-Valued and Variational Analysis, 18(3):277–306, 2010.[24] A. Berrington de Gonzlez, M. Mahesh, K. Kim, M. Bhargavan,R. Lewis, F. Mettler, and C. Land. Projected cancer risks fromcomputed tomographic scans performed in the United States in 2007.Archives of Internal Medicine, 169(22):2071–2077, 2009.[25] Dimitri P. Bertsekas. A new class of incremental gradient methodsfor least squares problems. SIAM J. on Optimization, 7(4):913–926,April 1997.[26] Dimitri P Bertsekas. Nonlinear programming. Athena Scientific, 1999.[27] H. Bhujle and S. Chaudhuri. Novel speed-up strategies for non-localmeans denoising with patch and edge patch based dictionaries. ImageProcessing, IEEE Transactions on, 23(1):356–365, Jan 2014.[28] Junguo Bian, Jeffrey H Siewerdsen, Xiao Han, Emil Y Sidky, Jerry LPrince, Charles A Pelizzari, and Xiaochuan Pan. Evaluation of sparse-view reconstruction from flat-panel-detector cone-beam CT. Physicsin Medicine and Biology, 55(22):6575, 2010.[29] J.M. Bioucas-Dias and M. A T Figueiredo. A new twist: Two-stepiterative shrinkage/thresholding algorithms for image restoration. Im-age Processing, IEEE Transactions on, 16(12):2992–3004, 2007.[30] Thomas Blumensath and Mike E Davies. Gradient pursuits. SignalProcessing, IEEE Transactions on, 56(6):2370–2382, 2008.[31] Lon Bottou. Stochastic gradient descent tricks. In Grgoire Montavon,GeneviveB. Orr, and Klaus-Robert Mller, editors, Neural Networks:Tricks of the Trade, volume 7700 of Lecture Notes in Computer Sci-ence, pages 421–436. Springer Berlin Heidelberg, 2012.210Bibliography[32] C. Bouman and K. Sauer. A generalized gaussian image model foredge-preserving map estimation. Image Processing, IEEE Transac-tions on, 2(3):296–310, Jul 1993.[33] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and JonathanEckstein. Distributed optimization and statistical learning via thealternating direction method of multipliers. Foundations and Trends R©in Machine Learning, 3(1):1–122, 2011.[34] Kristian Bredies, Karl Kunisch, and Thomas Pock. Total generalizedvariation. SIAM Journal on Imaging Sciences, 3(3):492–526, 2010.[35] Alex Bronstein, Pablo Sprechmann, and Guillermo Sapiro. Learningefficient structured sparse models. arXiv preprint arXiv:1206.4649,2012.[36] Ori Bryt and Michael Elad. Compression of facial images using thek-svd algorithm. Journal of Visual Communication and Image Repre-sentation, 19(4):270 – 282, 2008.[37] Antoni Buades, Bartomeu Coll, and Jean-Michel Morel. A review ofimage denoising algorithms, with a new one. Multiscale Modeling &Simulation, 4(2):490–530, 2005.[38] Antoni Buades, Bartomeu Coll, and Jean-Michel Morel. Image en-hancement by non-local reverse heat equation. Preprint CMLA,22:2006, 2006.[39] Egbert Buhr, Susanne Gu¨nther-Kohfahl, and Ulrich Neitzel. Sim-ple method for modulation transfer function determination of digitalimaging detectors from edge images. In Medical Imaging 2003, pages877–884. International Society for Optics and Photonics, 2003.[40] Cesar F Caiafa and Andrzej Cichocki. Multidimensional compressedsensing and their applications. Wiley Interdisciplinary Reviews: DataMining and Knowledge Discovery, 3(6):355–380, 2013.[41] EJ Candes and DL Donoho. Curvelets and reconstruction of imagesfrom noisy radon data (invited paper)[4119-10]. Proc. SPIE, (1):108–117, 2000.[42] Emmanuel Candes, Laurent Demanet, David Donoho, and LexingYing. Fast discrete curvelet transforms. Multiscale Modeling & Sim-ulation, 5(3):861–899, 2006.211Bibliography[43] V. Cevher, S. Becker, and M. Schmidt. Convex optimization for bigdata: Scalable, randomized, and parallel algorithms for big data ana-lytics. Signal Processing Magazine, IEEE, 31(5):32–43, Sept 2014.[44] P. Chainais. Towards dictionary learning from images with non gaus-sian noise. In Machine Learning for Signal Processing (MLSP), 2012IEEE International Workshop on, pages 1–6, Sept 2012.[45] A. Chambolle, M. Novaga, D. Cremers, and T. Pock. An introductionto total variation for image analysis. In in Theoretical Foundationsand Numerical Methods for Sparse Recovery, De Gruyter, 2010.[46] Antonin Chambolle. An algorithm for total variation minimizationand applications. Journal of Mathematical Imaging and Vision, 20(1-2):89–97, 2004.[47] Antonin Chambolle and Pierre-Louis Lions. Image recovery via to-tal variation minimization and related problems. Numerische Mathe-matik, 76(2):167–188, 1997.[48] T. F. Chan, S. Esedoglu, F. E. Park, and A. M. Yip. Recent devel-opments in total variation image restoration. In Handbook of Mathe-matical Models in Computer Vision. Springer, Berlin, 2005.[49] Rick Chartrand, Emil Y Sidky, and Xiaochuan Pan. Nonconvex com-pressive sensing for x-ray ct: an algorithm comparison. In Signals,Systems and Computers, 2013 Asilomar Conference on, pages 665–669. IEEE, 2013.[50] P. Chatterjee and P. Milanfar. Clustering-based denoising with lo-cally learned dictionaries. Image Processing, IEEE Transactions on,18(7):1438–1451, July 2009.[51] P. Chatterjee and P. Milanfar. Patch-based near-optimal image de-noising. Image Processing, IEEE Transactions on, 21(4):1635–1649,April 2012.[52] Biao Chen and Ruola Ning. Cone-beam volume CT breast imaging:Feasibility study. Medical Physics, 29(5):755–770, 2002.[53] Qiang Chen, Philippe Montesinos, Quan Sen Sun, Peng Ann Heng,and De Shen Xia. Adaptive total variation denoising based on differ-ence curvature. Image and Vision Computing, 28(3):298 – 306, 2010.[54] S. Chen, D. Donoho, and M. Saunders. Atomic decomposition by basispursuit. SIAM Journal on Scientific Computing, 20(1):33–61, 1998.212Bibliography[55] Wei Chen and M.R.D. Rodrigues. Dictionary learning with optimizedprojection design for compressive sensing applications. Signal Process-ing Letters, IEEE, 20(10):992–995, Oct 2013.[56] Yang Chen, Wufan Chen, Xindao Yin, Xianghua Ye, Xudong Bao,Limin Luo, Qianjing Feng, Yinsheng li, and Xiaoe Yu. Improving low-dose abdominal {CT} images by weighted intensity averaging overlarge-scale neighborhoods. European Journal of Radiology, 80(2):e42– e49, 2011.[57] Yang Chen, Luyao Shi, Qianjing Feng, Jian Yang, Huazhong Shu,Limin Luo, J.-L. Coatrieux, and Wufan Chen. Artifact suppresseddictionary learning for low-dose CT image processing. Medical Imag-ing, IEEE Transactions on, 33(12):2271–2292, Dec 2014.[58] Yang Chen, Luyao Shi, Yining Hu, Qing Cao, Fei Yu, Limin Luo,and C. Toumoulin. Confidence weighted dictionary learning algorithmfor low-dose ct image processing. In Nuclear Science Symposium andMedical Imaging Conference (NSS/MIC), 2013 IEEE, pages 1–4, Oct2013.[59] Yang Chen, Luyao Shi, Jiang Yang, Yining Hu, Limin Luo, XindaoYin, and Jean-Louis Coatrieux. Radiation dose reduction with dictio-nary learning based processing for head ct. Australasian Physical &Engineering Sciences in Medicine, 37(3):483–493, 2014.[60] Yang Chen, Xindao Yin, Luyao Shi, Huazhong Shu, Limin Luo, Jean-Louis Coatrieux, and Christine Toumoulin. Improving abdomen tumorlow-dose CT images using a fast dictionary learning based processing.Physics in Medicine and Biology, 58(16):5803, 2013.[61] Kihwan Choi, Jing Wang, Lei Zhu, Tae-Suk Suh, Stephen Boyd, andLei Xing. Compressed sensing based cone-beam computed tomographyreconstruction with a first-order method. Medical Physics, 37(9):5113–5125, 2010.[62] A. Cichocki, D. Mandic, L. De Lathauwer, Guoxu Zhou, Qibin Zhao,C. Caiafa, and H.A. Phan. Tensor decompositions for signal processingapplications: From two-way to multiway component analysis. SignalProcessing Magazine, IEEE, 32(2):145–163, March 2015.[63] Chris A Cocosco, Vasken Kollokian, Remi K-S Kwan, G Bruce Pike,and Alan C Evans. Brainweb: Online interface to a 3d mri simulatedbrain database. In NeuroImage, 1997.213Bibliography[64] Ronald R Coifman, Yves Meyer, and Victor Wickerhauser. Waveletanalysis and signal processing. In In Wavelets and their Applications.Citeseer, 1992.[65] Michael Collins, S. Dasgupta, and Robert E Schapire. A generalizationof principal components analysis to the exponential family. In T.G.Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in NeuralInformation Processing Systems 14, pages 617–624. MIT Press, 2002.[66] Patrick L Combettes and Jean-Christophe Pesquet. Proximal splittingmethods in signal processing. In Fixed-point algorithms for inverseproblems in science and engineering, pages 185–212. Springer NewYork, 2011.[67] Patrick L Combettes and Vale´rie R Wajs. Signal recovery by prox-imal forward-backward splitting. Multiscale Modeling & Simulation,4(4):1168–1200, 2005.[68] S.F. Cotter, R. Adler, R.D. Rao, and K. Kreutz-Delgado. Forwardsequential algorithms for best basis selection. Vision, Image and SignalProcessing, IEE Proceedings -, 146(5):235–244, Oct 1999.[69] Florent Couzinie-Devy, Julien Mairal, Francis Bach, and Jean Ponce.Dictionary learning for deblurring and digital zoom. Arxiv preprintarXiv:1110.0957, 2011.[70] Antonio Criminisi, P. Perez, and K. Toyama. Region filling and objectremoval by exemplar-based image inpainting. Image Processing, IEEETransactions on, 13(9):1200–1212, Sept 2004.[71] Kostadin Dabov, Alessandro Foi, Vladimir Katkovnik, and KarenEgiazarian. Image denoising by sparse 3-D transform-domain collabo-rative filtering. Image Processing, IEEE Transactions on, 16(8):2080–2095, 2007.[72] Kostadin Dabov, Alessandro Foi, Vladimir Katkovnik, and KarenEgiazarian. Bm3d image denoising with shape-adaptive principal com-ponent analysis. In SPARS ’09 - Signal Processing with AdaptiveSparse Structured Representations, pages 6–pp, 2009.[73] J. Darbon, A. Cunha, T.F. Chan, S. Osher, and G.J. Jensen. Fastnonlocal filtering applied to electron cryomicroscopy. In BiomedicalImaging: From Nano to Macro, 2008. ISBI 2008. 5th IEEE Interna-tional Symposium on, pages 1331–1334, May 2008.214Bibliography[74] Ingrid Daubechies. Orthonormal bases of compactly supportedwavelets. Communications on Pure and Applied Mathematics,41(7):909–996, 1988.[75] John G. Daugman. Two-dimensional spectral analysis of cortical re-ceptive field profiles. Vision Research, 20(10):847 – 856, 1980.[76] John G Daugman. Uncertainty relation for resolution in space, spa-tial frequency, and orientation optimized by two-dimensional visualcortical filters. JOSA A, 2(7):1160–1169, 1985.[77] A. Dauwe, B. Goossens, H. Q. Luong, and W. Philips. A fast non-localimage denoising algorithm, 2008.[78] Bruno De Man and Samit Basu. Distance-driven projection and back-projection in three dimensions. Physics in Medicine and Biology,49(11):2463–75, 2004.[79] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fastincremental gradient method with support for non-strongly convexcomposite objectives. In Advances in Neural Information ProcessingSystems, pages 1646–1654, 2014.[80] A.H. Delaney and Y. Bresler. Globally convergent edge-preserving reg-ularized reconstruction: an application to limited-angle tomography.Image Processing, IEEE Transactions on, 7(2):204–221, Feb 1998.[81] C.-A. Deledalle, F. Tupin, and L. Denis. Poisson nl means: Unsuper-vised non local means for poisson noise. In Image Processing (ICIP),2010 17th IEEE International Conference on, pages 801–804, Sept2010.[82] Charles-Alban Deledalle, Lo¨ıc Denis, and Florence Tupin. How tocompare noisy patches? patch similarity beyond gaussian noise. In-ternational journal of computer vision, 99(1):86–102, 2012.[83] Charles-Alban Deledalle, Vincent Duval, and Joseph Salmon. Non-local methods with shape-adaptive patches (nlm-sap). Journal ofMathematical Imaging and Vision, 43(2):103–120, 2012.[84] Charles-Alban Deledalle, Joseph Salmon, and Arnak Dalalyan. Im-age denoising with patch based pca: local versus global. In BMVC,volume 81, pages 425–455, 2011.215Bibliography[85] Charles-Alban Deledalle, Florence Tupin, and Lo¨ıc Denis. Patch simi-larity under non Gaussian noise. In International Conference on ImageProcessing, pages 1845 – 1848, Brussels, Belgium, September 2011.[86] M.N. Do and M. Vetterli. The contourlet transform: an efficient direc-tional multiresolution image representation. Image Processing, IEEETransactions on, 14(12):2091–2106, Dec 2005.[87] Weisheng Dong, Xin Li, D. Zhang, and Guangming Shi. Sparsity-basedimage denoising via dictionary learning and structural clustering. InComputer Vision and Pattern Recognition (CVPR), 2011 IEEE Con-ference on, pages 457–464, June 2011.[88] Weisheng Dong, D. Zhang, Guangming Shi, and Xiaolin Wu. Imagedeblurring and super-resolution by adaptive sparse domain selectionand adaptive regularization. Image Processing, IEEE Transactionson, 20(7):1838–1857, July 2011.[89] Yiqiu Dong, Michael Hintermu¨ller, and M Monserrat Rincon-Camacho. Automated regularization parameter selection in multi-scaletotal variation models for image restoration. Journal of MathematicalImaging and Vision, 40(1):82–104, 2011.[90] DAVID L. DONOHO and JAIN M. JOHNSTONE. Ideal spatial adap-tation by wavelet shrinkage. Biometrika, 81(3):425–455, 1994.[91] V. Dore and M. Cheriet. Robust nl-means filter with optimal pixel-wisesmoothing parameter for statistical image denoising. Signal Process-ing, IEEE Transactions on, 57(5):1703–1716, May 2009.[92] L. Y. Du, J. Umoh, H. N. Nikolov, S. I. Pollmann, T. Y. Lee, andD. W. Holdsworth. A quality assurance phantom for the performanceevaluation of volumetric micro-CT systems. Physics in Medicine andBiology, 52(23):7087–7108, 2007.[93] J.M. Duarte-Carvajalino and G. Sapiro. Learning to sense sparse sig-nals: Simultaneous sensing matrix and sparsifying dictionary opti-mization. Image Processing, IEEE Transactions on, 18(7):1395–1408,July 2009.[94] F.-X. Dupe and S. Anthoine. A greedy approach to sparse poissondenoising. In Machine Learning for Signal Processing (MLSP), 2013IEEE International Workshop on, pages 1–6, Sept 2013.216Bibliography[95] Vincent Duval, Jean-Franois Aujol, and Yann Gousseau. A bias-variance approach for the nonlocal means. SIAM Journal on ImagingSciences, 4(2):760–788, 2011.[96] Glenn Easley, Demetrio Labate, and Wang-Q Lim. Sparse directionalimage representations using the discrete shearlet transform. Appliedand Computational Harmonic Analysis, 25(1):25–46, 2008.[97] A.A. Efros and T.K. Leung. Texture synthesis by non-parametricsampling. In Computer Vision, 1999. The Proceedings of the SeventhIEEE International Conference on, volume 2, pages 1033–1038 vol.2,1999.[98] M. Elad. Sparse and redundant representation modeling; what next?Signal Processing Letters, IEEE, 19(12):922–928, Dec 2012.[99] Michael Elad. Sparse and redundant representations: from theory toapplications in signal and image processing. Springer, 2010.[100] Michael Elad and Michal Aharon. Image denoising via sparse andredundant representations over learned dictionaries. Image Processing,IEEE Transactions on, 15(12):3736–3745, 2006.[101] IA Elbakri and J.A Fessler. Statistical image reconstruction for polyen-ergetic x-ray computed tomography. Medical Imaging, IEEE Trans-actions on, 21(2):89–99, Feb 2002.[102] Yonina C Eldar. Generalized sure for exponential families: Appli-cations to regularization. Signal Processing, IEEE Transactions on,57(2):471–481, 2009.[103] K. Engan, S.O. Aase, and J. Hakon Husoy. Method of optimal direc-tions for frame design. In Acoustics, Speech, and Signal Processing,1999. Proceedings., 1999 IEEE International Conference on, volume 5,pages 2443–2446 vol.5, 1999.[104] Hakan Erdogan and Jeffrey A Fessler. Ordered subsets algorithmsfor transmission tomography. Physics in medicine and biology,44(11):2835–2851, 1999.[105] Vincent Etter, Ivana Jovanovic, and Martin Vetterli. Use of learneddictionaries in tomographic reconstruction, 2011.[106] Benjamin P Fahimian, Yu Mao, Peter Cloetens, and Jianwei Miao.Low-dose x-ray phase-contrast and absorption ct using equally slopedtomography. Physics in Medicine and Biology, 55(18):5383, 2010.217Bibliography[107] L. A. Feldkamp, L. C. Davis, and J. W. Kress. Practical cone-beamalgorithm. J. Opt. Soc. Am. A, 1(6):612–619, Jun 1984.[108] N. L. Ford, M. M. Thornton, and D. W. Holdsworth. Fundamentalimage quality limits for microcomputed tomography in small animals.Medical Physics, 30(11):2869–2877, 2003.[109] Michael P. Friedlander and Mark W. Schmidt. Hybrid deterministic-stochastic methods for data fitting. CoRR, abs/1104.2373, 2011.[110] Jrgen Frikel. Sparse regularization in limited angle tomography. Ap-plied and Computational Harmonic Analysis, 34(1):117 – 141, 2013.[111] Piotr Fryzlewicz and Guy P Nason. A haar-fisz algorithm for poissonintensity estimation. Journal of Computational and Graphical Statis-tics, 13(3):621–638, 2004.[112] Pascal Getreuer. Rudin–osher–fatemi total variation denoising usingsplit bregman. Image Processing On Line, 2012.[113] G. Gilboa, N. Sochen, and Y.Y. Zeevi. Variational denoising of partlytextured images by spatially varying constraints. Image Processing,IEEE Transactions on, 15(8):2281–2289, Aug 2006.[114] Guy Gilboa, Jerome Darbon, Stanley Osher, and Tony Chan. Nonlocalconvex functionals for image regularization. UCLA CAM-report, pages06–57, 2006.[115] Raja Giryes and Michael Elad. Sparsity based poisson denoising withdictionary learning. arXiv preprint arXiv:1309.4306, 2013.[116] Donald Goldfarb and Wotao Yin. Second-order cone programmingmethods for total variation-based image restoration. SIAM Journalon Scientific Computing, 27(2):622–645, 2005.[117] T. Goldstein and S. Osher. The split Bregman method for l1-regularized problems. SIAM Journal on Imaging Sciences, 2(2):323–343, 2009.[118] G.H. Golub and C. F. Van-Loan. Matrix Computations. Johns HopkinsUniversity Press, Baltimore, MD, 2013.[119] I.F. Gorodnitsky and B.D. Rao. Sparse signal reconstruction fromlimited data using focuss: a re-weighted minimum norm algorithm.Signal Processing, IEEE Transactions on, 45(3):600–616, Mar 1997.218Bibliography[120] Markus Grasmair. Locally adaptive total variation regularization. InScale Space and Variational methods in computer Vision, pages 331–342. Springer Berlin Heidelberg, 2009.[121] Markus Grasmair and Frank Lenzen. Anisotropic total variation fil-tering. Applied Mathematics & Optimization, 62(3):323–339, 2010.[122] Karol Gregor and Yann Lecun. Learning fast approximations of sparsecoding. In Machine Learning (ICML), 2010, International Conferenceon, pages 1–8. Omnipress, 2010.[123] Jie Tang Guang-Hong Chen and Shuai Leng. Prior image constrainedcompressed sensing (piccs): A method to accurately reconstruct dy-namic ct images from highly undersampled projection data sets. Med-ical Physics, 35(2):660–663, 2008.[124] C. Guillemot and O. Le Meur. Image inpainting : Overview andrecent advances. Signal Processing Magazine, IEEE, 31(1):127–144,Jan 2014.[125] Weihong Guo and Feng Huang. Adaptive total variation based filteringfor mri images with spatially inhomogeneous noise and artifacts. InBiomedical Imaging: From Nano to Macro, 2009. ISBI ’09. IEEEInternational Symposium on, pages 101–104, June 2009.[126] Xiao Han, Junguo Bian, D.R. Eaker, T.L. Kline, E.Y. Sidky, E.L. Rit-man, and Xiaochuan Pan. Algorithm-enabled low-dose micro-ct imag-ing. Medical Imaging, IEEE Transactions on, 30(3):606–620, March2011.[127] Simon Hawe, Matthias Seibert, and Martin Kleinsteuber. Separa-ble dictionary learning. In Computer Vision and Pattern Recognition(CVPR), 2013 IEEE Conference on, pages 438–445. IEEE, 2013.[128] Elad Hazan and Satyen Kale. Beyond the regret minimization barrier:optimal algorithms for stochastic strongly-convex optimization. TheJournal of Machine Learning Research, 15(1):2489–2512, 2014.[129] Gabor T Herman. Fundamentals of computerized tomography: imagereconstruction from projections. Springer, 2009.[130] G.T. Herman and L.B. Meyer. Algebraic reconstruction techniquescan be made computationally efficient [positron emission tomographyapplication]. Medical Imaging, IEEE Transactions on, 12(3):600–609,Sep 1993.219Bibliography[131] H Hotelling. Analysis of a complex of statistical variables into principalcomponents. Journal of Educational Psychology, 24(6):417–441, 1933.[132] De-An Huang and Yu-Chiang Frank Wang. Coupled dictionary andfeature space learning with applications to cross-domain image syn-thesis and recognition. In Computer Vision (ICCV), 2013 IEEE In-ternational Conference on, pages 2496–2503. IEEE, 2013.[133] Jing Huang, Jianhua Ma, Nan Liu, Hua Zhang, Zhaoying Bian, YanqiuFeng, Qianjin Feng, and Wufan Chen. Sparse angular {CT} recon-struction using non-local means based iterative-correction {POCS}.Computers in Biology and Medicine, 41(4):195 – 205, 2011.[134] Junzhou Huang, Tong Zhang, and Dimitris Metaxas. Learning withstructured sparsity. The Journal of Machine Learning Research,12:3371–3412, 2011.[135] J.M. Hughes, D.N. Rockmore, and Yang Wang. Bayesian learningof sparse multiscale image representations. Image Processing, IEEETransactions on, 22(12):4972–4983, Dec 2013.[136] Aapo Hyva¨rinen, Jarmo Hurri, and Patrick O Hoyer. Natural Im-age Statistics: A probabilistic approach to early computational vision,volume 39. Springer-Verlag New York Inc, 2009.[137] David A Jaffray, Jeffrey H Siewerdsen, John W Wong, and Alvaro AMartinez. Flat-panel cone-beam computed tomography for image-guided radiation therapy. International Journal of Radiation Oncol-ogy*Biology*Physics, 53(5):1337 – 1349, 2002.[138] John W. Tukey James W. Cooley. An algorithm for the machinecalculation of complex fourier series. Mathematics of Computation,19(90):297–301, 1965.[139] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun. What isthe best multi-stage architecture for object recognition? In ComputerVision, 2009 IEEE 12th International Conference on, pages 2146–2153, Sept 2009.[140] Rodolphe Jenatton, Jean-Yves Audibert, and Francis Bach. Struc-tured variable selection with sparsity-inducing norms. J. Mach. Learn.Res., 12:2777–2824, November 2011.220Bibliography[141] Rodolphe Jenatton, Julien Mairal, Francis R Bach, and Guillaume RObozinski. Proximal methods for sparse hierarchical dictionary learn-ing. In Proceedings of the 27th International Conference on MachineLearning (ICML-10), pages 487–494, 2010.[142] Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, and FrancisBach. Proximal methods for hierarchical sparse coding. Journal ofMachine Learning Research, (12):2297–2334, 2011.[143] T. L. Jensen, J. H. Jørgensen, P. C. Hansen, and S. H. Jensen. Imple-mentation of an optimal first-order method for strongly convex totalvariation regularization. BIT Numerical Mathematics, 52(2):329–356,2011.[144] Xun Jia, Bin Dong, Yifei Lou, and Steve B Jiang. GPU-based iterativecone-beam CT reconstruction using tight frame regularization. Physicsin Medicine and Biology, 56(13):3787, 2011.[145] Xun Jia, Yifei Lou, Ruijiang Li, William Y. Song, and Steve B. Jiang.GPU-based fast cone beam CT reconstruction from undersampled andnoisy projection data via total variation. Medical Physics, 37(4):1757–1760, 2010.[146] Xun Jia, Zhen Tian, Yifei Lou, Jan-Jakob Sonke, and Steve B. Jiang.Four-dimensional cone beam ct reconstruction and enhancement usinga temporal nonlocal means method. Medical Physics, 39(9):5592–5602,2012.[147] Zhuolin Jiang, Zhe Lin, and L.S. Davis. Label consistent k-svd: Learn-ing a discriminative dictionary for recognition. Pattern Analysis andMachine Intelligence, IEEE Transactions on, 35(11):2651–2664, Nov2013.[148] Rie Johnson and Tong Zhang. Accelerating stochastic gradient de-scent using predictive variance reduction. In C.J.C. Burges, L. Bottou,M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advancesin Neural Information Processing Systems 26, pages 315–323. CurranAssociates, Inc., 2013.[149] Willi A. Kalender. Computed Tomography: Fundamentals, SystemTechnology, Image Quality, Applications. Publicis, 2011.[150] C. Kamphuis and F. J. Beekman. Accelerated iterative transmissionct reconstruction using an ordered subsets convex algorithm. IEEETransactions on Medical Imaging, 17(6):1101–1105, Dec 1998.221Bibliography[151] Li-Wei Kang, Chia-Wen Lin, and Yu-Hsiang Fu. Automatic single-image-based rain streaks removal via image decomposition. ImageProcessing, IEEE Transactions on, 21(4):1742–1755, April 2012.[152] Davood Karimi and Rabab Ward. On the computational implemen-tation of forward and back-projection operations for cone-beam com-puted tomography. Medical & Biological Engineering & Computing,pages 1–12, 2015.[153] K. Kavukcuoglu, M.A. Ranzato, R. Fergus, and Yann Le-Cun. Learn-ing invariant features through topographic filter maps. In ComputerVision and Pattern Recognition, 2009. CVPR 2009. IEEE Conferenceon, pages 1605–1612, June 2009.[154] D Kazantsev, G Van Eyndhoven, WRB Lionheart, PJ Withers,KJ Dobson, SA McDonald, R Atwood, and PD Lee. Employing tem-poral self-similarity across the entire time domain in computed to-mography reconstruction. Philosophical Transactions of the Royal So-ciety of London A: Mathematical, Physical and Engineering Sciences,373(2043):20140389, 2015.[155] Daniil Kazantsev, William M Thompson, G Van Eyndhoven, KJ Dob-son, Anders P Kaestner, WRB Lionheart, Philip J Withers, and Pe-ter D Lee. 4d-ct reconstruction with unified spatial-temporal patch-based regularization. Inverse Probl. Imaging, 9:447–467, 2015.[156] Z.S. Kelm, D. Blezek, B. Bartholmai, and B.J. Erickson. Optimizingnon-local means for denoising low dose ct. In Biomedical Imaging:From Nano to Macro, 2009. ISBI ’09. IEEE International Symposiumon, pages 662–665, June 2009.[157] C. Kervrann and J. Boulanger. Optimal spatial adaptation for patch-based image denoising. Image Processing, IEEE Transactions on,15(10):2866–2878, Oct 2006.[158] Charles Kervrann and Jrme Boulanger. Local adaptivity to variablesmoothness for exemplar-based image regularization and representa-tion. International Journal of Computer Vision, 79(1):45–69, 2008.[159] Donghwan Kim and J.A. Fessler. Ordered subsets acceleration usingrelaxed momentum for x-ray ct image reconstruction. In Nuclear Sci-ence Symposium and Medical Imaging Conference (NSS/MIC), 2013IEEE, pages 1–5, Oct 2013.222Bibliography[160] Donghwan Kim, S. Ramani, and J.A. Fessler. Combining orderedsubsets and momentum for accelerated x-ray CT image reconstruction.Medical Imaging, IEEE Transactions on, 34(1):167–178, Jan 2015.[161] Donghwan Kim, Sathish Ramani, and Jeffrey A Fessler. Accelerat-ing x-ray ct ordered subsets image reconstruction with nesterovs first-order methods. Proc. Intl. Mtg. on Fully 3D Image Recon. in Rad.and Nuc. Med, pages 22–5, 2013.[162] Hojin Kim, Ruijiang Li, Rena Lee, Thomas Goldstein, Stephen Boyd,Emmanuel Candes, and Lei Xing. Dose optimization with first-ordertotal-variation minimization for dense angularly sampled and sparseintensity modulated radiation therapy (dassim-rt). Medical Physics,39(7), 2012.[163] Stefan Kindermann, Stanley Osher, and Peter W. Jones. Deblurringand denoising of images by nonlocal functionals. Multiscale Modeling& Simulation, 4(4):1091–1115, 2005.[164] Nick Kingsbury. Complex wavelets for shift invariant analysis andfiltering of signals. Applied and computational harmonic analysis,10(3):234–253, 2001.[165] Jakub Konecny´, Jie Liu, Peter Richta´rik, and Martin Taka´c. mS2GD:Mini-batch semi-stochastic gradient descent in the proximal setting.CoRR, abs/1410.4744, 2014.[166] Jakub Konecny´, Zheng Qu, and Peter Richta´rik. Semi-stochastic co-ordinate descent. CoRR, abs/1412.6293, 2014.[167] Kenneth Kreutz-Delgado, Joseph F Murray, Bhaskar D Rao, KjerstiEngan, Te-Won Lee, and Terrence J Sejnowski. Dictionary learningalgorithms for sparse representation. Neural computation, 15(2):349–396, 2003.[168] D. Krishnan, Ping Lin, and A.M. Yip. A primal-dual active-setmethod for non-negativity constrained total variation deblurring prob-lems. Image Processing, IEEE Transactions on, 16(11):2766–2777,Nov 2007.[169] S. Krstulovic and R. Gribonval. Mptk: Matching pursuit madetractable. In Acoustics, Speech and Signal Processing, 2006. ICASSP2006 Proceedings. 2006 IEEE International Conference on, volume 3,pages III–III, May 2006.223Bibliography[170] N. Kumar, L. Zhang, and S. K. Nayar. What is a Good Nearest Neigh-bors Algorithm for Finding Similar Patches in Images? In EuropeanConference on Computer Vision (ECCV), pages 364–378, Oct 2008.[171] Demetrio Labate, Wang-Q Lim, Gitta Kutyniok, and Guido Weiss.Sparse multidimensional representation using shearlets. In Optics &Photonics 2005, pages 59140U–59140U. International Society for Op-tics and Photonics, 2005.[172] Simon Lacoste-Julien, Mark Schmidt, and Francis Bach. A simplerapproach to obtaining an o(1/t) convergence rate for the projectedstochastic subgradient method. arXiv preprint arXiv:1212.2002, 2012.[173] Guanghui Lan. An optimal method for stochastic composite optimiza-tion. Mathematical Programming, 133(1-2):365–397, 2012.[174] Patrick J LaRivire. Penalized-likelihood sinogram smoothing for low-dose CT. Medical Physics, 32(6):1676–1683, 2005.[175] Triet Le, Rick Chartrand, and ThomasJ. Asaki. A variational ap-proach to reconstructing images corrupted by Poisson noise. Journalof Mathematical Imaging and Vision, 27(3):257–263, 2007.[176] E. Le Pennec and S. Mallat. Sparse geometric image representationswith bandelets. Image Processing, IEEE Transactions on, 14(4):423–438, April 2005.[177] Nicolas Le Roux, Mark Schmidt, and Francis Bach. A stochastic gra-dient method with an exponential convergence rate for finite trainingsets. arXiv preprint arXiv:1202.6258, 2012.[178] M. Lebrun, A. Buades, and J. M. Morel. A nonlocal bayesian imagedenoising algorithm. SIAM Journal on Imaging Sciences, 6(3):1665–1688, 2013.[179] Honglak Lee, Alexis Battle, Rajat Raina, and Andrew Y Ng. Effi-cient sparse coding algorithms. In Advances in neural informationprocessing systems, pages 801–808, 2006.[180] Shuai Leng, Jie Tang, Joseph Zambelli, Brian Nett, Ranjini Tolakana-halli, and Guang-Hong Chen. High temporal resolution and streak-free four-dimensional cone-beam computed tomography. Physics inMedicine and Biology, 53(20):5653, 2008.[181] Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. Miningof massive datasets. Cambridge University Press, 2014.224Bibliography[182] A. Levin and B. Nadler. Natural image denoising: Optimality and in-herent bounds. In Computer Vision and Pattern Recognition (CVPR),2011 IEEE Conference on, pages 2833–2840, June 2011.[183] Michael S Lewicki and Bruno A Olshausen. Probabilistic frame-work for the adaptation and comparison of image codes. JOSA A,16(7):1587–1601, 1999.[184] Michael S Lewicki and Terrence J Sejnowski. Learning overcompleterepresentations. Neural computation, 12(2):337–365, 2000.[185] R M Lewitt. Alternatives to voxels for image representation in iterativereconstruction algorithms. Physics in Medicine and Biology, 37(3):705,1992.[186] Shutao Li, Leyuan Fang, and Haitao Yin. An efficient dictionarylearning algorithm and its application to 3-d medical image denoising.Biomedical Engineering, IEEE Transactions on, 59(2):417–427, 2012.[187] Shutao Li, Haitao Yin, and Leyuan Fang. Group-sparse represen-tation with dictionary learning for medical image denoising and fu-sion. Biomedical Engineering, IEEE Transactions on, 59(12):3450–3459, Dec 2012.[188] Si Li, Qing Cao, Yang Chen, Yining Hu, Limin Luo, and ChristineToumoulin. Dictionary learning based sinogram inpainting for CTsparse reconstruction. Optik - International Journal for Light andElectron Optics, 125(12):2862 – 2867, 2014.[189] Tianfang Li, Xiang Li, Jing Wang, Junhai Wen, Hongbing Lu, JiangHsieh, and Zhengrong Liang. Nonlinear sinogram smoothing for low-dose x-ray CT. Nuclear Science, IEEE Transactions on, 51(5):2505–2513, Oct 2004.[190] Yinsheng Li, Yang Chen, Yining Hu, Ahmed Oukili, Limin Luo, Wu-fan Chen, and Christine Toumoulin. Strategy of computed tomogra-phy sinogram inpainting based on sinusoid-like curve decompositionand eigenvector-guided interpolation. J Opt Soc Am A Opt Image SciVis., 29(1):153–163, 2012.[191] Zhoubo Li, Lifeng Yu, Joshua D. Trzasko, David S. Lake, Daniel J.Blezek, Joel G. Fletcher, Cynthia H. McCollough, and Armando Man-duca. Adaptive nonlocal means filtering based on local noise level forct denoising. Medical Physics, 41(1):–, 2014.225Bibliography[192] Baodong Liu, Hengyong Yu, Scott S. Verbridge, Lizhi Sun, andGe Wang. Dictionary-learning-based reconstruction method for elec-tron tomography. Scanning, 36(4):377–383, 2014.[193] Yong Long, J.A. Fessler, and J.M. Balter. 3D forward and back-projection for x-ray CT using separable footprints. Medical Imaging,IEEE Transactions on, 29(11):1839–1850, 2010.[194] Yifei Lou, Xiaoqun Zhang, Stanley Osher, and Andrea Bertozzi. Im-age recovery via nonlocal operators. Journal of Scientific Computing,42(2):185–197, 2010.[195] C. Louchet and L. Moisan. Total variation as a local filter. SIAMJournal on Imaging Sciences, 4(2):651–694, 2011.[196] David G Lowe. Distinctive image features from scale-invariant key-points. International journal of computer vision, 60(2):91–110, 2004.[197] Yang Lu, Jun Zhao, and Ge Wang. Few-view image reconstructionwith dual dictionaries. Physics in Medicine and Biology, 57(1):173,2012.[198] Y.M. Lu and M.N. Do. Multidimensional directional filter banks andsurfacelets. Image Processing, IEEE Transactions on, 16(4):918–931,April 2007.[199] Marius Lysaker and XueCheng Tai. Iterative image restoration com-bining total variation minimization and a second-order functional. In-ternational Journal of Computer Vision, 66(1):5–18, 2006.[200] Jianhua Ma, Jing Huang, Qianjin Feng, Hua Zhang, Hongbing Lu,Zhengrong Liang, and Wufan Chen. Low-dose computed tomographyimage restoration using previous normal-dose scan. Medical Physics,38(10):5713–5731, 2011.[201] Jianhua Ma, Zhengrong Liang, Yi Fan, Yan Liu, Jing Huang, WufanChen, and Hongbing Lu. Variance analysis of x-ray ct sinograms inthe presence of electronic noise background. Medical Physics, 39(7),2012.[202] Jianhua Ma, Hua Zhang, Yang Gao, Jing Huang, Zhengrong Liang,Qianjing Feng, and Wufan Chen. Iterative image reconstruction forcerebral perfusion ct using a pre-contrast scan induced edge-preservingprior. Physics in Medicine and Biology, 57(22):7519, 2012.226Bibliography[203] A. Plenevaux P. Choquet A. Constantinesco E. Salmon A. LuxenA. Seret M.A. Bahri, G. Warnock. Performance evaluation of thegeneral electric explore ct 120 micro-ct using the vmct phantom. Nu-clear Instruments and Methods in Physics Research A, 648:S181–S185,2011.[204] Albert Macovski. Medical Imaging Systems. Prentice Hall, UpperSaddle River, NJ, 1983.[205] Matteo Maggioni, Vladimir Katkovnik, Karen Egiazarian, andAlessandro Foi. A nonlocal transform-domain filter for volumetricdata denoising and reconstruction. IEEE Transactions on Image Pro-cessing, 22(1):1057–7149, 2013.[206] Mehrdad Mahdavi and Rong Jin. MixedGrad: an O(1/T) convergencerate algorithm for stochastic smooth optimization. arXiv preprintarXiv:1307.7192, 2013.[207] M. Mahmoudi and G. Sapiro. Fast image and video denoising vianonlocal means of similar neighborhoods. Signal Processing Letters,IEEE, 12(12):839–842, Dec 2005.[208] Andreas Maier, Hannes G. Hofmann, Martin Berger, Peter Fischer,Chris Schwemmer, Haibo Wu, Kerstin Mller, Joachim Hornegger,Jang-Hwan Choi, Christian Riess, Andreas Keil, and Rebecca Fahrig.Conrada software framework for cone-beam imaging in radiology. Med-ical Physics, 40(11):–, 2013.[209] Andreas Maier, Lars Wigstrm, Hannes G. Hofmann, Joachim Horneg-ger, Lei Zhu, Norbert Strobel, and Rebecca Fahrig. Three-dimensionalanisotropic adaptive filtering of projection data for noise reduction incone beam ct. Medical Physics, 38(11), 2011.[210] Julien Mairal, Francis Bach, and Jean Ponce. Task-driven dictionarylearning. IEEE Transactions on Pattern Analysis and Machine Intel-ligence (PAMI), 34(4):791–804, 2012.[211] Julien Mairal, Francis Bach, and Jean Ponce. Sparse modeling forimage and vision processing. Foundations and Trends R© in ComputerGraphics and Vision, 8(2-3):85–283, 2014.[212] Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro. Onlinedictionary learning for sparse coding. In International Conference onMachine Learning (ICML), 2009.227Bibliography[213] Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro. On-line learning for matrix factorization and sparse coding. Journal ofMachine Learning Research (JMLR), 11:19–60, 2010.[214] Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro, and An-drew Zisserman. Non-local sparse models for image restoration. InInternational Conference on Computer Vision (ICCV). I, 2009.[215] Julien Mairal, Michael Elad, and Guillermo Sapiro. Sparse repre-sentation for color image restoration. IEEE Transactions on ImageProcessing, 17(1):53–69, 2008.[216] Julien Mairal, Guillermo Sapiro, and Michael Elad. Learning multi-scale sparse representations for image and video restoration. MultiscaleModeling & Simulation, 7(1):214–241, 2008.[217] S.G. Mallat. A theory for multiresolution signal decomposition: thewavelet representation. Pattern Analysis and Machine Intelligence,IEEE Transactions on, 11(7):674–693, Jul 1989.[218] Ste´phane Mallat. A wavelet tour of signal processing. Academic press,1999.[219] Armando Manduca, Lifeng Yu, Joshua D. Trzasko, Natalia Khaylova,James M Kofler, Cynthia M McCollough, and Joel G. Fletcher. Pro-jection space denoising with bilateral filtering and CT noise modelingfor dose reduction in CT. Medical Physics, 36(11):4911–4919, 2009.[220] S H Manglos, G M Gagne, A Krol, F D Thomas, andR Narayanaswamy. Transmission maximum-likelihood reconstructionwith ordered subsets for cone beam CT. Physics in Medicine andBiology, 40(7):1225, 1995.[221] S. Matej and R.M. Lewitt. Practical considerations for 3-D imagereconstruction using spherically symmetric volume elements. MedicalImaging, IEEE Transactions on, 15(1):68–78, Feb 1996.[222] Cynthia H. McCollough, Michael R. Bruesewitz, and James M. Kofler.CT dose reduction and dose management tools: Overview of availableoptions. RadioGraphics, 26(2):503–512,