UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Methods for image recovery in computational imaging Xiao, Lei 2017

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2017_may_xiao_lei.pdf [ 104.12MB ]
JSON: 24-1.0343606.json
JSON-LD: 24-1.0343606-ld.json
RDF/XML (Pretty): 24-1.0343606-rdf.xml
RDF/JSON: 24-1.0343606-rdf.json
Turtle: 24-1.0343606-turtle.txt
N-Triples: 24-1.0343606-rdf-ntriples.txt
Original Record: 24-1.0343606-source.json
Full Text

Full Text

Methods For Image Recovery In Computational ImagingbyLei XiaoA THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University of British Columbia(Vancouver)April 2017c© Lei Xiao, 2017AbstractAs a classic topic that has been studied for decades, image restoration is still avery active research area. Developing more effective and efficient methods arehighly desirable. This thesis addresses image restoration problems for applicationsin computational imaging including Time-of-Flight (ToF) imaging and digital pho-tography.While ToF cameras have shown great promise at low-cost depth imaging, theysuffer from limited depth-of-field and low spatial-resolution. We develop a compu-tational method to remove lens blur and increase image resolution of off-the-shelfToF cameras. The method solves latent images directly from the raw sensor dataas an inverse problem, and supports for future ToF cameras that use multiple fre-quencies, phases and exposures.Photographs taken by hand-held cameras are likely to suffer from blur causedby camera shake during exposure. Removing such blur and recovering sharp im-ages as a post-process is therefore critical. We develop a blind deblurring methodthat is purely based on stochastic random-walk optimization. This simple frame-work in combination with different priors produces comparable results to the muchmore complex state-of-the-art deblurring algorithms.Blur causes even more serious issues for document photographs as slight blurcan make Optical Character Recognition (OCR) techniques fail. We address theblind deblurring problem specifically for common document photographs. Ob-serving that the latter are mostly composed of high-order structures, our methodcaptures such domain property by a series of high-order filters as well as cus-tomized response functions. These parameters are trained from data by discrim-inative learning approach and form an end-to-end network that can efficiently andiijointly estimate blur kernels and legible images.Discriminative learning approaches achieve convincing trade-off between im-age quality and computational efficiency, however, they require separate trainingfor each restoration task and problem condition, making it time-consuming anddifficult to encompass all tasks and conditions during training. We combine dis-criminative learning and formal optimization techniques to learn image priors thatrequire a single-pass training and share across various tasks and conditions whilekeeping the efficiency as previous discriminative methods. After being trained, ourmethod can be combined with other likelihood or priors to address unseen restora-tion tasks or further improve the image quality.iiiPrefaceThe content of this thesis is based on the following publications. For each publica-tion, the contributions of each author are listed.L. Xiao, F. Heide, M. O’Toole, A. Kolb, M. B. Hullin, K. Kutulakos, and W.Heidrich. Defocus Deblurring and Superresolution for Time-of-Flight Depth Cam-eras. IEEE Conference on Computer Vision and Pattern Recognition (CVPR),2014. W. Heidrich conceived the initial idea of tackling the multi-path mitigationproblem as an inverse problem. The shallow depth-of-field limitation of ToF cam-eras was raised up during discussions between L. Xiao, F. Heide and M. O’Toole.L. Xiao developed the inverse problem formulation and optimization algorithm,designed the experiment setup, performed all experiments, and wrote the initialmanuscript. F. Heide contributed to discussions during the course of the research,and provided the code for lens calibration. A. Kolb provided the ToF camera usedin the experiments. All authors contributed to the manuscript writing.L. Xiao, J. Gregson, F. Heide, and W. Heidrich. Stochastic Blind Motion De-blurring. IEEE Transactions on Image Processing, 24(10):3071-3085, Oct 2015.L. Xiao conceived the idea, developed the algorithm, conducted all experiments,and wrote the initial manuscript. W. Heidrich provided detailed guidance duringthe course of the research, and did major editing of the final manuscript. J. Gregsonand F. Heide contributed to discussions during the course of the research and themanuscript writing.L. Xiao, J. Wang, W. Heidrich, and M. Hirsch. Learning High-Order Filters forEfficient Blind Deconvolution of Document Photographs. European Conference onComputer Vision (ECCV), 2016. The problem of document photo deblurring wasproposed by J. Wang. L. Xiao developed the method, conducted all experiments,ivand wrote the initial manuscript. J. Wang and M. Hirsch provided guidance duringthe course of the research and did major editing of the the final manuscript. W.Heidrich contributed to discussions during the course of the research.L. Xiao, F. Heide, W. Heidrich, B. Scho¨lkopf, and M. Hirsch. Learning Prox-imal Operators for Image Restoration. under review of ICCV 2017. L. Xiao con-ceived the idea, developed the method, conducted most experiments, and wrotethe initial manuscript. F. Heide contributed to discussions during the course ofthe research, provided the Halide implementation, and did major editing of themanuscript. M. Hirsch contributed to discussions during the course of the research,and did major editing of the manuscript. All authors contributed to the manuscriptwriting.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xivAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Time-of-flight depth imaging . . . . . . . . . . . . . . . . . . . . 31.2 Blind deblurring of natural images . . . . . . . . . . . . . . . . . 41.3 Blind deblurring of document photographs . . . . . . . . . . . . . 41.4 Learning proximal operators for general image restoration . . . . 51.5 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Background and Related Work . . . . . . . . . . . . . . . . . . . . . 72.1 General image restoration . . . . . . . . . . . . . . . . . . . . . . 82.1.1 Classical models . . . . . . . . . . . . . . . . . . . . . . 8vi2.1.2 Generative learning models . . . . . . . . . . . . . . . . 102.1.3 Discriminative learning models . . . . . . . . . . . . . . 112.2 Natural image deblurring . . . . . . . . . . . . . . . . . . . . . . 122.2.1 Non-blind deconvolution . . . . . . . . . . . . . . . . . . 122.2.2 Blind deconvolution . . . . . . . . . . . . . . . . . . . . 132.3 Document photograph deblurring . . . . . . . . . . . . . . . . . . 152.4 Time-of-flight imaging . . . . . . . . . . . . . . . . . . . . . . . 152.4.1 Imaging principle . . . . . . . . . . . . . . . . . . . . . . 152.4.2 ToF defocus deblurring . . . . . . . . . . . . . . . . . . . 172.5 Proximal optimization techniques . . . . . . . . . . . . . . . . . 182.5.1 Proximal operators . . . . . . . . . . . . . . . . . . . . . 182.5.2 Half quadratic splitting . . . . . . . . . . . . . . . . . . . 192.5.3 Alternating direction method of multipliers . . . . . . . . 213 Time-of-Flight Defocus Deblurring and Superresolution . . . . . . . 223.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.2.1 Algorithm overview . . . . . . . . . . . . . . . . . . . . 243.2.2 Updating kernel estimate . . . . . . . . . . . . . . . . . . 273.2.3 Updating slack variable . . . . . . . . . . . . . . . . . . . 273.2.4 Update amplitude estimate . . . . . . . . . . . . . . . . . 273.2.5 Updating depth estimate . . . . . . . . . . . . . . . . . . 283.2.6 Calibration . . . . . . . . . . . . . . . . . . . . . . . . . 293.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.3.1 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . 313.3.2 Real data . . . . . . . . . . . . . . . . . . . . . . . . . . 313.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414 Stochastic Blind Motion Deblurring . . . . . . . . . . . . . . . . . . 424.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.2 Basic algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.2.1 Updating the intrinsic image . . . . . . . . . . . . . . . . 45vii4.2.2 Updating the kernel . . . . . . . . . . . . . . . . . . . . . 544.2.3 Updating the weights . . . . . . . . . . . . . . . . . . . . 574.3 Algorithmic Extensions . . . . . . . . . . . . . . . . . . . . . . . 574.3.1 Color images . . . . . . . . . . . . . . . . . . . . . . . . 574.3.2 Chromatic kernels . . . . . . . . . . . . . . . . . . . . . 634.3.3 Saturated or missing data . . . . . . . . . . . . . . . . . . 634.3.4 Poisson noise . . . . . . . . . . . . . . . . . . . . . . . . 654.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674.5 Conclusion and future work . . . . . . . . . . . . . . . . . . . . . 745 Document Image Deblurring . . . . . . . . . . . . . . . . . . . . . . 765.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 775.2.1 Cascade of shrinkage fields . . . . . . . . . . . . . . . . . 785.2.2 Multi-scale interleaved CSF for blind deconvolution . . . 795.2.3 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 815.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 865.4 Conclusion and future work . . . . . . . . . . . . . . . . . . . . . 936 Learning Proximal Operators for Image Restoration . . . . . . . . . 956.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 956.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 966.2.1 Proximal fields . . . . . . . . . . . . . . . . . . . . . . . 976.2.2 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 996.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1016.4 Conclusion and future work . . . . . . . . . . . . . . . . . . . . . 1207 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125A Supplemental Materials . . . . . . . . . . . . . . . . . . . . . . . . . 136A.1 Supplemental material for Chapter 3 . . . . . . . . . . . . . . . . 136A.2 Supplemental material for Chapter 5 . . . . . . . . . . . . . . . . 140viiiA.3 Supplemental material for Chapter 6 . . . . . . . . . . . . . . . . 145ixList of TablesTable 4.1 PSNR comparisons on the benchmark dataset [57] . . . . . . . 70Table 4.2 Run-time analysis. . . . . . . . . . . . . . . . . . . . . . . . . 72Table 5.1 Run-time comparison. . . . . . . . . . . . . . . . . . . . . . . 89Table 6.1 Average PSNR on 68 images from [87] for image denoising. . 102Table 6.2 Run-time comparison for image denoising. . . . . . . . . . . . 108Table 6.3 Average PSNR on 32 images from [64] for non-blind deconvo-lution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109Table 6.4 Test with different HQS iterations and model stages. . . . . . . 120xList of FiguresFigure 2.1 Principle of ToF imaging. . . . . . . . . . . . . . . . . . . . 16Figure 3.1 Simulated Buddha scene. . . . . . . . . . . . . . . . . . . . . 23Figure 3.2 Experimental setup. . . . . . . . . . . . . . . . . . . . . . . . 29Figure 3.3 Results on simulated Buddha scene with 0.5% white noise. . . 30Figure 3.4 PSNR of our estimated amplitude and depth on the Buddhascene at each iteration. . . . . . . . . . . . . . . . . . . . . . 31Figure 3.5 RGB Photographs of the real scenes. . . . . . . . . . . . . . . 32Figure 3.6 Results on the Camel scene. . . . . . . . . . . . . . . . . . . 33Figure 3.7 Results on the Character scene. . . . . . . . . . . . . . . . . . 34Figure 3.8 Results on the Board scene. . . . . . . . . . . . . . . . . . . 35Figure 3.9 Two insets of the results on the Camel scene in Fig. 3.6. . . . 36Figure 3.10 Mesh visualization for the Camel scene in Fig. 3.6. . . . . . . 36Figure 3.11 Two insets of the results on the Character scene in Fig. 3.7. . . 37Figure 3.12 Two insets of the results on the Board scene in Fig. 3.8. . . . . 37Figure 3.13 Results with different upsampling ratios. . . . . . . . . . . . . 38Figure 3.14 Experiment results using the TGV prior on the complex images. 39Figure 3.15 Comparison with post-superresolution. . . . . . . . . . . . . 40Figure 4.1 Our multi-scale scheme on the scene shown in Fig.4.8. . . . . 45Figure 4.2 Ground truth intrinsic image and kernels used for empiricalconvergence test in Fig. 4.3 and 4.6. . . . . . . . . . . . . . . 50Figure 4.3 Example of non-blind intrinsic image estimation. . . . . . . . 51xiFigure 4.4 Visualization of sample distribution for the non-blind exam-ples in Fig. 4.3. . . . . . . . . . . . . . . . . . . . . . . . . . 52Figure 4.5 Visualization of the residual between the ground truth and ourestimated intrinsic image at different iterations. . . . . . . . . 53Figure 4.6 Example of non-blind kernel estimation. . . . . . . . . . . . . 56Figure 4.7 Example of our blind estimation of intrinsic and kernel. . . . . 58Figure 4.8 Results on a noisy real-world image. . . . . . . . . . . . . . . 59Figure 4.9 Results on a real-world image and visual comparisons of state-of-the-art methods. . . . . . . . . . . . . . . . . . . . . . . . 60Figure 4.10 Closeup of the cloth region in the scene shown in Fig. 4.9. . . 61Figure 4.11 Results on chromatic blur kernel. . . . . . . . . . . . . . . . . 62Figure 4.12 Results on partially saturated image. . . . . . . . . . . . . . . 64Figure 4.13 Results on a synthetic image with Poisson noise. . . . . . . . 66Figure 4.14 Comparison on image #1 with kernel #3 and image #4 withkernel #4 from the dataset [57]. . . . . . . . . . . . . . . . . 69Figure 4.15 Results on synthetic data with difference levels of white Gaus-sian noise. In subfigure (a-f), the standard derivation (σ) ofthe noise is set to be 0, 0.01, 0.02, 0.03, 0.04 and 0.05 respec-tively. In each subfigure, the 1st row shows the input blurryimage, and the 2nd row shows our deblurred result. The PSNRvalues are shown at the left-top corner of each image. An insetis shown at the right-bottom corner of each image for better view. 73Figure 4.16 Results on more real data from [20] and [1]. . . . . . . . . . . 75Figure 5.1 Visual comparison between natural image, large-font text im-age and common text image. . . . . . . . . . . . . . . . . . . 77Figure 5.2 Algorithm architecture. . . . . . . . . . . . . . . . . . . . . . 80Figure 5.3 Example intermediate results of our algorithm on a synthetictest image. . . . . . . . . . . . . . . . . . . . . . . . . . . . 83Figure 5.4 Example intermediate results on a real-world test image. . . . 84Figure 5.5 Example of learned filters and shrinkage functions. . . . . . . 85Figure 5.6 Example kernels and images used at training of our algorithm. 86Figure 5.7 Comparison on a real image. . . . . . . . . . . . . . . . . . . 86xiiFigure 5.8 Comparison on a real image. . . . . . . . . . . . . . . . . . . 87Figure 5.9 PSNR and OCR comparison on a synthetic test dataset with 8blur kernels. . . . . . . . . . . . . . . . . . . . . . . . . . . . 88Figure 5.10 Comparison on synthetic images. . . . . . . . . . . . . . . . 88Figure 5.11 Comparison on non-English text and severely rotated images. 90Figure 5.12 Robustness test on noise level and image PPI. . . . . . . . . . 91Figure 5.13 Comparison on a real image with large-font text. . . . . . . . 91Figure 5.14 Results on spatially-varying blur kernels. . . . . . . . . . . . 92Figure 5.15 Result on example document containing color figures. . . . . 93Figure 6.1 Learned filters at each stage of the proximal operator. . . . . . 100Figure 6.2 Analysis of model generality on image denoising. . . . . . . . 102Figure 6.3 Results with input noise level σ=5. . . . . . . . . . . . . . . . 103Figure 6.4 Results with input noise level σ=15. . . . . . . . . . . . . . . 103Figure 6.5 Results with input noise level σ=25. . . . . . . . . . . . . . . 104Figure 6.6 Results with input noise level σ=5. . . . . . . . . . . . . . . . 105Figure 6.7 Results with input noise level σ=15. . . . . . . . . . . . . . . 106Figure 6.8 Results with input noise level σ=25. . . . . . . . . . . . . . . 107Figure 6.9 Our results with different fidelity weight for non-blind decon-volution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109Figure 6.10 Results on images with blur kernel #1 in the dataset. . . . . . 110Figure 6.11 Results on non-blind deconvolution with Gaussian blur. . . . . 111Figure 6.12 Results on non-blind deconvolution with Gaussian blur. . . . . 112Figure 6.13 Experiment on incorporating non-local patch similarity priorwith our model after being trained. . . . . . . . . . . . . . . . 114Figure 6.14 Experiment on incorporating a color prior with our model afterbeing trained. . . . . . . . . . . . . . . . . . . . . . . . . . . 115Figure 6.15 Experiment on joint denoising and inpainting task. . . . . . . 117Figure 6.16 Results at each HQS iteration of our method on image denois-ing with noise level σ = 25. . . . . . . . . . . . . . . . . . . 118Figure 6.17 Results at each HQS iteration of our method on non-blind de-convolution. . . . . . . . . . . . . . . . . . . . . . . . . . . . 119xiiiGlossaryADMM Alternating Direction Method of MultipliersARF Active Random FieldBCCB Block Circulant matrix with Circulant BlocksBM3D Block Matching and 3D FilteringCGD Conjugate Gradient DescentCNN Convolutional Neural NetworkCSC Convolutional Sparse CodingCSF Cascaded Shrinkage FieldsEFF Efficient Filter FlowEPLL Expected Patch Log LikelihoodFoE Field-of-ExpertsGD Gradient DescentGMM Gaussian Mixture ModelsHQS Half Quadratic Splitting methodKSVD K-Singular Value DecompositionLSQ Least SquaresxivLSSC Learned Simultaneous Sparse CodingNLM Non-Local MeanOCR Optical Character Recognitionopt-MRF Optimized Markov Random FieldsPD First Order Primal Dual methodPMD Photonic Mixer DevicePPI Pixel Per InchPSNR Peak Signal-to-Noise RatioRTF Regression Tree FieldSCD Stochastic Coordinate DescentSF Shrinkage FieldSNR Signal-to-Noise RatioTGV Total Generalized VariationToF Time-of-FlightTRD Trainable Nonlinear Reaction DiffusionTV Total VariationWNNM Weighted Nuclear Norm MinimizationxvAcknowledgmentsFirst and foremost, I would like to thank my advisor Wolfgang Heidrich for hisguidance and support throughout my time at University of British Columbia.Thanks to my thesis and RPE committee members Uri Ascher, Jim Little andRobert J. Woodham for their timely feedback and advice over the years.Thanks to all my research collaborators: Felix Heide, James Gregson, MatthewO’Toole, Matthias B. Hullin, Andreas Kolb, Kiriakos N. Kutulakos, Jue Wang,Michael Hirsch and Bernhard Scho¨lkopf.xviDedicationTo my parents and Yuna, for their love and support.xviiChapter 1IntroductionImage restoration aims at computationally enhancing the quality of images by un-doing the adverse effects of image degradation such as noise and blur. As a keyarea of image and signal processing it is an extremely well studied problem anda plethora of methods exists, see for example [74] for a recent survey. Restora-tion tasks, such as denoising, deblurring, demosaicing and resolution enhancement,have to be addressed as part of most imaging and machine vision systems.Although tremendous efforts have been dedicated to this topic, image restora-tion is still a very active research area due to the following major reasons. Froma Bayesian perspective, solving restoration tasks as statistical estimation problemsdoes not only require physically-motivated models for the data likelihood, but re-lies on effective prior knowledge on the latent images, a.k.a. priors or regularizers,as a key component. Good image priors are still an active area of investigation.While traditional methods focus on local image statistics and aim at maintain-ing edges such as Total Variation (TV) [88], bilateral filter [102] and anisotropicdiffusion [105], more recent methods exploit the non-local statistics of images[2, 22, 25, 37, 71, 101]. In particular, the highly successful Block Matching and3D Filtering (BM3D) method [22] searches for similar patches within the imageand combines them through a collaborative filtering step.On the other hand, efficiently solving restoration problems with the priors isimportant as well. Image restoration tasks are typically formulated as inverseproblems, where the latent images are estimated by solving corresponding nu-1merical optimization problems. Significant progress has been made recently inoptimization algorithms. Representative techniques such as Half Quadratic Split-ting method (HQS) [30], Split Bregman [35], Alternating Direction Method ofMultipliers (ADMM) [79], First Order Primal Dual method (PD) [15], etc, havebeen widely used in image restoration and significantly improved the restorationperformance both in time and quality. In spite of this, as the rise of mobile imagingsystems such as smartphone cameras and autonomous vehicles, developing evenmore power- and time-efficient optimization techniques are still highly desirable.Furthermore, through the successful application of machine learning and data-driven approaches, image restoration has seen revived interest and significant progressmore recently. Broadly speaking, recently proposed state-of-the-art data-drivenmethods can be grouped into two classes: generative approaches that aim at prob-abilistic models of undegraded images and discriminative approaches that try tolearn a direct mapping from degraded to clean images. Generative methods seekto learn probabilistic models of undegraded images. A simple, yet powerful sub-class include models that approximate the sparse gradient distribution of naturalimages [58, 59, 68]. More expressive generative models include Field-of-Experts(FoE) [86], K-Singular Value Decomposition (KSVD) [26] and Expected PatchLog Likelihood (EPLL) [118]. Discriminative models have become increasinglypopular for image restoration due to their attractive tradeoff between high imagerestoration quality and computational efficiency. Methods include trainable ran-dom field models such as Cascaded Shrinkage Fields (CSF) [92], Regression TreeField (RTF) [51], Trainable Nonlinear Reaction Diffusion (TRD) [18], as well asdeep convolutional networks [50] and other multi-layer perceptrons [13]. Discrim-inative approaches achieve great computational efficiency at run-time by defininga particular feed-forward structure whose trainable parameters are optimized fora particular task during training. Those learned parameters are then kept fixed atrun-time resulting in a fixed computational cost.This thesis presents our researches on above topics in image restoration anddemonstrate applications for Time-of-Flight (ToF) imaging and digital photogra-phy, as detailed below.21.1 Time-of-flight depth imagingFast and high-quality depth-sensing cameras are highly desirable in mobile robotics,human-machine interfaces, quality control and inspection, and advanced automo-tive applications. Among the wide variety of depth-sensing technologies available(depth from stereo, structured lighting, Lidar scanning, etc), continuous-wave ToFcameras have emerged as an efficient, low-cost, compact, and versatile depth imag-ing solution, such as Microsoft Kinect-2, Photonic Mixer Device (PMD), Swiss-Ranger, etc.The active light source required to produce these ToF images presents signif-icant drawbacks, however. To create a ToF image with high Signal-to-Noise Ra-tio (SNR), the light signal must be sufficiently intense to overcome sensor noise andquantization effects. Factors determining the signal strength include light sourcepower, integration time, imaging range, and lens aperture. In practice, light poweris often limited for eye safety and energy considerations, and the integration timemust be short enough to allow real-time operation. Consequently, ToF systemsideally would need to use imaging optics with large numerical aperture to makebetter use of available light. However, this would come at a cost; large apertureshave a shallow depth of field and hence introduce defocus blur in the raw ToF im-ages. Another shortcoming is the limited spatial resolution of currently availableToF sensors. Most commercial ToF cameras have fewer than 0.04 megapixels.Due to the non-linear image formation model of ToF cameras, such depth offield blur and low resolution present a significant problem for ToF cameras, gener-ating artifacts such as “flying pixels” around depth discontinuities as well as lossof texture detail.Chapter 3 addresses this problem by introducing a new computational methodto simultaneously remove the defocus blur and increase the resolution of off-the-shelf ToF cameras post capture. The method directly works on the raw sensor dataand solves the deblurring and superresolution problem in a principled way. Un-like previous ToF deblurring techniques, our approach applies regularizers directlyto the latent intensity and depth images, and supports deblurring ToF images cap-tured with multiple modulation frequencies, phases or exposures. This work waspublished in [110].31.2 Blind deblurring of natural imagesTaking photographs has become increasingly common due to the popularity ofmobile cameras in recent years. However, photos taken by hand-held camerasare likely to suffer from motion blur caused by camera shake during exposure,especially in low-light environment. The blur in an image can prevent humans fromresolving scene details and make machine vision algorithms (object recognition,segmentation, etc) fail. Therefore, removing such blur and recovering sharp imagesas a post-process is critical.When the camera motion trajectories are pre-known, the problem to estimatethe sharp latent images is called non-blind deblurring. However, in most practicalcases these information are unknown, and need to be estimated together with thesharp images. This more challenging problem is called blind deblurring. Solv-ing blind deblurring typically requires good prior knowledge of the latent imageand blur. While good priors for both the images and the blur kernels are still anactive area of investigation, many choices that have been proposed are non-linearand often even non-convex. This makes it difficult and time consuming to exper-iment with different image priors, since each new candidate typically requires acustomized optimization procedure that can require a significant effort to imple-ment.Chapter 4 presents a blind deblurring method that is purely based on simplestochastic sampling. The method relies entirely on local evaluations of the ob-jective function, without the need to compute gradient information. This makes iteffortless to implement and test new image and kernel priors. We demonstrate suchstochastic optimization solver for a variety of non-convex and non-smooth priorsand likelihood. In combination with different image and kernel priors, the methodproduces results that match or exceed the results obtained by much more com-plex state-of-the-art algorithms, which typically require customized optimizationsolvers. This work was published in [112].1.3 Blind deblurring of document photographsTaking photographs of text documents (printed articles, receipts, newspapers, books,etc) instead of scanning them has become quite common recently. The motion blur4caused by camera shake is critical for document photographs taken by hand-heldcameras, as slight blur can prevent existing Optical Character Recognition (OCR)techniques from extracting correct text from them. Removing blur and recoveringsharp, legible document images is thus highly desirable.Recent text image deblurring methods use sparse gradient priors (e.g., totalvariation [16], `0 gradient [19, 78]) and text-specific priors (e.g., text classifier [19],`0 intensity [78]) for sharp latent image estimation. These methods can producehigh-quality results in many cases, however their practical adaptation is hamperedby several drawbacks. Firstly, their use of sparse gradient priors usually forces therecovered image to be piece-wise constant. Although these priors are effective forimages with large-font text (i.e., high Pixel Per Inch (PPI)), they do not work wellfor photographs of common text documents such as printed articles and newspaperswhere the font sizes are typically small [48]. Furthermore, these methods employiterative sparse optimization techniques that are usually time-consuming for highresolution images taken by modern cameras.Chapter 5 presents a new algorithm for practical document deblurring thatachieves both high quality and high efficiency. Observing that document imagesare usually dominated by small-scale high-order structures, the algorithm learns aseries of scale- and iteration-wise high-order filters to capture the domain-specificproperty of document photographs. The method uses a discriminative learning ap-proach to train such filters and other parameters of a feed-forward network thattakes a single blurry document photograph as input and produces high-quality la-tent image and blur kernel rapidly. This work was published in [111].1.4 Learning proximal operators for general imagerestorationState-of-the-art models such as FoE [86], EPLL [118], Weighted Nuclear NormMinimization (WNNM) [37] etc, are generic in the sense that they can be ap-plied for various restoration tasks. However, the resulting iterative optimizationproblems are prohibitively expensive, rendering them impractical for applicationsthat require real-time processes and for use on mobile vision systems. Recently,a number of works [18, 87, 92] have addressed this issue by truncating the iter-5ative optimization and learning discriminative image priors, tailored to the like-lihood and optimization approach. While these methods allow one to trade-offquality with the computational budget for a given application, the learned priorsare highly specialized to the image formation and noise parameters, in contrastto optimization-based approaches. Since each individual problem instantiation re-quires costly learning and storing of the model coefficients, current proposals forlearned priors are impractical for vision applications with dynamically changing(often continuous) parameters. This is a common scenario in most real-world vi-sion settings, as well as applications in engineering and scientific imaging that relyon the ability rapidly prototype methods.Chapter 6 presents an algorithm that combines discriminative learned modelswith formal optimization methods to learn generic priors that truly share acrossproblem domains. Using proximal optimization methods [10, 30, 79] allows usto decouple the likelihood and prior which is key to learning such shared mod-els. It also means that we can rely on well-researched physically-motivated modelsfor the likelihood, while learning priors from example data. By learning general-ized proximal mappings as a prior model, our approach is computationally cheapwhile being general. We verify our technique using the same model for a varietyof diverse low-level image reconstruction tasks and problem conditions, demon-strating the effectiveness of our approach. Benefiting from the proximal splittingtechniques, our approach can naturally be combined with existing state-of-the-artpriors after being trained to further improve the reconstruction quality. This workwas published in [109].1.5 OrganizationThe remaining part of this thesis is organized as follows: Chapter 2 reviews back-ground knowledge and previous work; Chapter 3, 4, 5 and 6 present our methodsand results for each individual topic; and Chapter 7 concludes this thesis work anddiscusses potential research directions for future work.6Chapter 2Background and Related WorkLet vector b be the observed image, vector x be the latent (desired) image andfunction f be the sensing operator, the image formation process is represented asin Eq. 2.1.b = f(x) (2.1)The sensing operator f differs for each imaging task. For denoising task, f(x) =x + n where n is the noise. For inpainting and demosaicing tasks, f(x) = a x + n where vector a is a binary mask representing the measurement pattern,and  represents element-wise multiplication. For non-blind deconvolution tasks,f(x) = k ⊗ x + n where vector k is the blur kernel, and ⊗ represents two-dimensional (2D) convolution between x and k. The sensing operators f are linearfor these tasks, thus the image formation process can be re-written as:b = Ax + n (2.2)where matrix A represents the sensing operation. Specifically, for denoising taskA is an identity matrix, for inpainting and demosaicing A is a binary diagonalmatrix with diagonal element a, and for non-blind deconvolution task A is a BlockCirculant matrix with Circulant Blocks (BCCB) that represents 2D convolutionwith k. For other imaging tasks such as time-of-flight depth imaging and phaseretrieval, the sensing operators f are non-linear and more complex. The time-of-7flight depth imaging task is discussed in Section 2.4.Image restoration tasks are typically formulated as inverse problems in ad-vanced methods. As the original problems are typically ill-posed, prior knowledgeof the latent images are required as regularization. The general form of such regu-larized inverse problems is given in Eq. 2.3.x = argminxλ2||b−Ax||22 +R(x), (2.3)where the first squared term in the objective is the data fidelity (likelihood), R(x)is the image prior, and λ is a positive scalar controlling the relative weight betweenthe data fidelity and prior in the objective. The well-known TV prior [88], forexample, is formulated as R(x) = ||∇x||1 where ∇ computes the derivatives ofimage x, and the `1 norm promotes sparsity of∇x.The remaining part of this chapter is organized as follows: Section 2.1 reviewsprevious work on image restoration in general; Section 2.2 and 2.3 review previouswork on blind deconvolution of natural images and document images respectively;Section 2.4 reviews related work on time-of-flight deblurring and enhancement;and Section 2.5 reviews several advanced numerical optimization methods that arefrequently used in literature for image restoration.2.1 General image restorationBroadly speaking, recently proposed state-of-the-art image restoration methodscan be grouped into three classes: classical approaches that make no explicit useof machine learning, generative approaches that aim at probabilistic models of un-degraded natural images and discriminative approaches that try to learn a directmapping from degraded to latent images. Unlike classical methods, methods be-longing to the latter two classes depend on the availability of training data. Wereview each class in the following sections.2.1.1 Classical modelsTraditional models focus on local image statistics and aim at maintaining edgessuch as TV [88], anisotropic diffusion models [105], and bilateral filter [102]. More8recent methods exploit the non-local statistics of images such as Non-Local Mean(NLM) [2], BM3D [22], non-local variants of sparse representation methods [25,71], WNNM [37], and [101].In particular, the seminal work of bilateral filter [102] and NLM [2] can be an-alyzed within the same scheme for image filtering that is based on pixel similarity,as given in Eq.2.4.xˆ(i) =1Ci∑j∈Ωiw(i, j) · x(j)s.t., Ci =∑j∈Ωiw(i, j),(2.4)where xˆ is the estimated (filtered) image from input x, i, j are pixel indices, Ωirepresents the neighbor region of pixel x(i), scalar weight w(i, j) represents thecontribution of pixel x(j) to the estimated pixel xˆ(i), and scalar Ci is the normal-ization weight. The weight w(i, j) is defined on the similarity between pixel x(i)and x(j), that is, the more similar pixel x(j) is to pixel x(i), the more contributionis from pixel j when estimating xˆ(i).With this scheme, we give the representation of Gaussian filter, bilateral filterand NLM in Eq.2.5,2.6 and 2.7 respectively.(Gaussian filter) xˆ(i) =1Ci∑j∈Ωie−|i−j|2σ2 · x(j) (2.5)(bilateral filter) xˆ(i) =1Ci∑j∈Ωie− |i−j|2σ2v− |x(i)−x(j)|2σ2s · x(j) (2.6)(non-local mean) xˆ(i) =1Ci∑j∈xe−||α(x(Ni)−x(Nj))||22σ2 · x(j) (2.7)The Gaussian filter (Eq.2.5) defines the pixel similarity on their spatial distance|i− j|2. The bilateral filter (Eq.2.6) defines the pixel similarity on both the spatialdistance and the intensity difference, thus can preserve edges while reducing noisein the image. NLM (Eq.2.7) defines the similarity between pixels i, j on the sim-ilarity between patches x(Ni), x(Nj) centered at x(i) and x(j) respectively, andthe pixel x(j) can be from any region in the image x rather than only from the local9neighborhood of pixel i as in previous methods. α in Eq.2.7 is a pre-defined Gaus-sian kernel that gives more weight at central regions of each patch. NLM buildson the fundamental observation that similar patches often can be found within animage, and this idea has stimulated many follow-up extensions.BM3D method extends the non-local similarity idea which searches for simi-lar patches across the image as in NLM, but combines them through collaborativepatch-filtering steps rather than simple pixel averaging. The non-local sparse rep-resentation methods [25, 71] explores the patch similarity idea as well while en-forces the similar patches to have similar coefficients in the transformed domain.WNNM [37] filters similar patches within the image by applying low-rank con-straints on singular value decomposition.2.1.2 Generative learning modelsMethods of this class seek to learn probabilistic models of undegraded natural im-ages. A simple yet powerful subclass include models that approximate the sparsegradient distribution of natural images [58, 59, 68]. The `p-norm (0 < p < 1) onimage derivative has been shown as an effective prior to enforce the heavy-taileddistribution of the gradient of natural images [58, 68].More expressive generative models include KSVD [26], Convolutional SparseCoding (CSC) [12, 41, 108], FoE [86] and EPLL [118]. Both the KSVD and CSCmethods assume small patches in an image can be approximated by a linear com-bination of a few atoms from an overcomplete dictionary D that is learned fromtraining data. The dictionary D consists of K atoms {d1,d2, ...,dK}, ||dk||22 ≤ 1.The problem KSVD aims to solve is given in Eq.2.8.α = argminα||α||0, s.t. x =K∑k=1αkdk, (2.8)where x is a small patch from the image, andα = {α1,α2, ...,αK} are the coeffi-cients. The `0-norm enforces the coefficientsα to be sparse. As KSVD operates oneach image patch separately, it ignores the coherence between patches across theimage, and the resulting dictionary D typically presents redundancy. To addressthis issue, CSC methods [12, 41, 108] are recently proposed where the dictionary10and coefficients are learned on the entire images rather than patches. The problemthat CSC methods solve is given in Eq.2.9.α = argminα||x−K∑k=1dk⊗αk||22 + λ||α||1, (2.9)where the image x is approximated by a linear combination of K atoms dk eachconvolved with a sparse coefficient map αk. Each αk has the same support as theimage x.KSVD and CSC are categorized as synthesis-operator methods. The counter-part are the analysis-operator methods, of which a representative work is the FoEmodel as given in Eq.2.10.x = argminxλ2||b−Ax||22 +N∑i=1φi(Fix), (2.10)where matrix Fi represents 2D convolution with filter fi, and function φi representsthe penalty on corresponding filter response Fix. Both the filters fi and penaltyfunctions φi are learned from training data. The well-known TV regularizer canbe viewed as a special case of the FoE model where fi is the derivative operator ∇and φi the `1 norm.EPLL [118] models image patches through Gaussian Mixture Models (GMM)and applies such patch prior to the whole image by HQS optimization technique.All of these methods have in common to be agnostic to the image restoration task,i.e., they can be used for any image degradation and can be combined with anylikelihood and additional priors at test time.2.1.3 Discriminative learning modelsDiscriminative models have become increasingly popular for image restoration re-cently due to their attractive tradeoff between high image restoration quality andefficiency at test time. Discriminative approaches owe their computational effi-ciency at run-time by defining a particular feed-forward structure whose trainableparameters are optimized for a particular task during training. Those learned pa-rameters are then kept fixed at run-time resulting in a fixed computational cost.11Representative methods in this class include trainable random field modelssuch as RTF [51], CSF [92], TRD [18], as well as deep convolutional networks [50]and other multi-layer perceptrons [13]. Both CSF and TRD are derived from theFoE model (given in Eq.2.10) by unrolling corresponding optimization iterations ofEq.2.10 to be feed-forward networks. All parameters of the network are trained byminimizing the error between its output images and ground truth images. Specif-ically, CSF and TRD unroll HQS and Gradient Descent (GD) iterations that solveEq.2.10, respectively. More details of these methods can be found in Chapter 5 andChapter 6.The downside of discriminative models is that they cannot generalize acrosstasks and typically necessitate separate feed-forward architectures and separatetraining for each restoration task (denoising, demosaicing, deblurring, etc) and ev-ery possible image degradation (noise level, Bayer pattern, blur kernel, etc). InChapter 6, we try to address this issue.2.2 Natural image deblurringIn this section we review previous deblurring work on natural images. As mostblind deblurring methods alternatively estimate the blur kernel and the sharp la-tent image as non-blind subproblems, we first review the state-of-the-art non-blinddeblurring methods in Section 2.2.1 before the blind methods in Section 2.2.2. Al-though the non-blind deblurring is typically an important component of the blindmethods, note that, the priors and optimization methods developed for high-qualitynon-blind deblurring may not be effective for the blind case.2.2.1 Non-blind deconvolutionThe classic Wiener deconvolution method [107] uses an inverse filter in Fourierdomain by assuming the image and noise power spectra are known, and the re-sults typically suffer from overly smoothed edges and ringing artifacts. The TVprior, which promotes piece-wise constant images, has become very popular inimaging problems since it was first introduced in [88]. The heavy-tailed distri-bution of natural image gradients is another effective regularization for image de-blurring [27, 65]. Several non-local priors were developed to model the patch12recurrence across an image and showed significant improvement on image deblur-ring [21, 23, 24, 54]. [94] first removes the blurriness in the image with a simpleconstrained least squares minimization and then trains a neural network to removethe noise and ringing artifacts in the result image from the previous step. Thismethod needs to train separate neural network for each blur kernel, which limitsits application in practice. More recently, several methods were proposed to usetrainable random field models for image restoration including deblurring [92, 93].The CSF method [92] reduces the optimization problem of random field modelsinto cascaded quadratic minimization problems that can be efficiently solved inFourier domain.2.2.2 Blind deconvolutionWhen the blur kernel is not readily available, the problem becomes much morechallenging especially for single image input, where both the latent sharp image xand blur kernel k need to be recovered, as shown in Eq. 2.11.(x,k) = argminx,kλ2||b− k⊗ x||22 +R(x) + T (k), (2.11)where T (k) represents priors on the blur kernel. Typically k is assumed to benon-negative and sum up to one, i.e., k ≥ 0 and ||k||1 = 1. This is an ill-posedproblem to solve. It is highly under-constrained and non-convex, and is subject tomany degenerate solutions, including one where the estimated kernel is simply aDirac peak and the latent image is the blurry input.In order to make the non-convex optimization converge to a good local opti-mum, the regularizers should be discriminatively designed for the blind problem,while directly applying the image priors from non-blind deblurring research (e.g.,TV) usually works poorly especially when k is large or complex. Another, the de-signed regularizers should be efficiently solvable by the optimization method forpractical use.Recent success arises from the use of sparse priors and multi-scale scheme.Fergus et al [27] fits the heavy-tailed prior by a mixture of Gaussians and solvesthe intrinsic image gradient and blur kernel by a variational Bayesian method [75].13Richardson-Lucy algorithm [70, 84] is then applied to reconstruct the final intrinsicimage with the estimated kernel. Krishnan et al [59] introduced a scale-invariant`1/`2 prior, which compensates for the attenuation of high frequencies in the blurryimage. Xu et al [114] used the `0 regularizer on the image gradient. Goldstein etal [34] estimated the kernel from the power spectrum of the blurred image. Yue etal [115] improved [34] by fusing it with sparse gradient prior. Sun et al [99]imposed patch priors to recover good partial latent images for kernel estimation.Michaeli and Irani [73] exploited the recurrence of small image patches across dif-ferent scales of single natural images. Anwar et al [5] learned a class-specific priorof image frequency spectrum for the restoration of frequencies that cannot be re-covered with generic priors. Zuo et al [119] learned iteration-wise parameters ofthe `p regularizer on image gradients. Schelten et al [90] trained cascaded inter-leaved RTF [93] to post-improve the result of other blind deblurring methods fornatural images.Another type of methods employs filtering approaches to extract strong im-age edges from which kernels may be estimated rapidly. Cho et al [20] adoptedshock filter [77] and bilateral filter [102] to predict sharp edges. Xu et al [113]improved [20] by neglecting edges with small spatial support as they impede ker-nel estimation. Schuler et al [95] learned such nonlinear filters with a multi-layerconvolutional neural network.In case of highly noisy input, Tai et al [100] proposed a method for jointlydenoising and deblurring the image. Zhong et al [116] applied directional low-pass filters at different orientations to the input image and estimated the Radontransform of the blur kernel from each filtered image, while the final blur kernel iscomputed by inverse Radon transform.For non-uniform blur due to camera rotation, Whyte et al [106] proposed a pa-rameterized geometric model of the blurring process considering the rotational ve-locity of the camera during exposure. Gupta et al [38] modeled the spatially vary-ing kernels by a motion density function which records the fraction of time spent ineach discretized portion of the space by the camera during exposure. Harmeling etal [40] and Hirsch et al [46] combined the global camera motion model and localpatch uniform deblurring to accelerate the non-uniform kernel estimation.142.3 Document photograph deblurringIn this section we review previous deblurring work on text document photographs.Most recent methods of text deblurring use the same sparse gradient assumptiondeveloped for natural images, and augment it with additional text-specific regular-ization. Chen et al [16] and Cho et al [19] applied explicit text pixel segmentationand enforced the text pixels to be dark or have similar colors. Pan et al [78] used`0 regularized intensity and gradient priors for text deblurring. The use of sparsegradient priors makes such methods work well for large-font text images, but failon common document images that have smaller fonts.Hradisˇ et al [48] trained a convolutional neural network to directly predictthe sharp patch from a small blurry one, without considering the image forma-tion model and explicit blur kernel estimation. With a large enough model andtraining dataset, this method produces good results on English documents with se-vere noise, large defocus blurs or simple motion blur. However, this method failson more complicated motion trajectories, and is sensitive to page orientation, fontstyle and text languages. Furthermore, this method often produces “hallucinated”characters or words that appear to be sharp and natural in the output image, but arecompletely wrong semantically. This undesirable side-effect severely limits its ap-plication range as most users do not expect the text to be changed in the deblurringprocess.2.4 Time-of-flight imaging2.4.1 Imaging principleWe explain the principle of ToF depth imaging with Fig.2.1. The active light sourceemits modulated light wave with frequency f , i.e., s(t) = sin(2pift), where tdenotes time. The light reflected from the object surface with reflectance α falls onthe sensor with a phase delay φ, i.e., r(t) = α sin(2pift − φ). The phase delay φrelates to the depth z of the surface point by Eq.2.12:z =cφ4pif(2.12)15		න ݀ݐ்௧ୀ଴		න ݀ݐ்௧ୀ଴light source ݏ ݐ ൌ ݏ݅݊ 2ߨ݂ݐobjectݎ ݐ ൌ ߙ	sin 2ߨ݂ݐ െ ߶lensreceived lightsin 2ߨ݂ݐܿ݋ݏ 2ߨ݂ݐreference signalreference signalmeasurement p = ߙ ܿ݋ݏ െ߶ /2q =ߙ ݏ݅݊ െ߶ /2measurementFigure 2.1: Principle of ToF imaging.where c is the light speed. In the sensor, the received signal r(t) is further modu-lated with two reference signals sin(2pift) and cos(2pift), and the resulting mod-ulated signals are integrated over time duration 0−T to generate the ToF measure-ments p = α cos(−φ)/2 and q = α sin(−φ)/2. With these two measurements, thephase φ and the reflectance α can be calculated as:φ = arctan(−qp) (2.13)α = 2√p2 + q2, (2.14)while the depth of the object surface point can be calculated by Eq.2.12. Note thatby Euler’s formula, the two ToF measurements p, q can be viewed as the real andimaginary part of the complex value αe−iφ/2, that is, ae−i(4pifc·z), where a = α/2.16The complex measurements for all sensor pixels are represented asa ◦ e−i( 4pifc ·z), (2.15)where a and z represent the amplitude and depth map respectively, and ◦ representsHadamard product (pixel-wise multiplication) of vectors.2.4.2 ToF defocus deblurringMost existing ToF enhancement methods take as input the naive depth map fromthe camera software, rather than the raw complex-valued measurements. Lidar-Boost [96] and KinectFusion [49] utilized captures from multiple viewpoints toincrease the depth resolution. Zhu et al [117] explored the complementary char-acteristics of ToF and stereo geometry methods and combined them to producebetter-quality depths. Other methods ([28, 53, 80]) used a high-resolution RGB orintensity image to guide the upsampling of the low-resolution depth map, basedon the assumption about the co-occurrence of image discontinuities in RGB anddepth data. These methods assume the scenes are all in-focus, while in this paperwe deal with the scenes degraded by defocus blur. Furthermore, these methodsrequire additional hardware or multi-view captures, while our method uses single-view captures from a single ToF sensor.Godbaz et al [32] proposed a two-stage method for parametric blind deconvo-lution of full-field continuous-wave Lidar imaging. They estimate the lens parame-ters from a pair of Lidar measurements taken at different aperture settings, and thendeconvolve these complex-domain measurements, from which the final depth mapis computed. Godbaz et al [31] applied the coded aperture technique to extend thedepth of field for full-field continuous-wave Lidar imaging. The complex-domainLidar measurement is iteratively deconvolved with a simple Gaussian derivativeprior, while at each iteration the blur kernel of each pixel is updated according tothe currently estimated Lidar image. These two methods are close to ours in thesense of directly working on the raw measurements. In contrast to these methods,which aim to deblur the complex measurements, our approach directly estimatesthe latent amplitude and depth from the degraded measurements. This allows usto apply separate regularizations on the amplitude and depth, and also supports for17the next generation ToF cameras with multiple modulation frequencies, phases andexposures [33, 55].Single-frequency ToF cameras have limited unambiguous distance range. Ob-jects separated by the integer multiples of the full range are indistinguishable. Thenext generation of ToF cameras use multiple modulation frequencies and phasesto reduce the ambiguity [33, 55]. The multi-frequency/phase data can also helpresolve “flying pixels” (mixtures of foreground and background depth) at the ob-ject boundaries, and suppress artifacts due to global illumination [29]. The ToFdata captured with single exposure could be noisy or saturated due to scene prop-erties such as surface materials and reflectivity. Multiple exposures are proposedto increase the dynamic range of the measurements and remove those unreliablepixels [33, 39]. Our algorithm adapts well to these cameras, since it directly es-timates the latent amplitude and depth from raw measurements that could comefrom multiple sequential captures.2.5 Proximal optimization techniquesNumerical optimization methods play important roles in image restoration. Re-cently, a class of algorithms, called proximal algorithms, have attracted widespreadinterest in computational imaging and image processing as they are very generallyapplicable and well-suited to large-scale problems [79]. In proximal algorithms,the basic operation is evaluating the proximal operator of a function, which in-volves a small optimization problem and possibly has a closed-form solution. Inthe remaining part of this section, we present the definition of the proximal oper-ators and review two specific proximal algorithms (HQS, ADMM) that are widelyused in recent work.2.5.1 Proximal operatorsGiven function f : Rn → R, its proximal operator is defined asproxf/λ(v) = argminxf(x) +λ2||x− v||22 (2.16)proxf/λ(v) can be interpreted as a point in domain Rn that compromises be-18tween minimizing the function f and being close to point v. The positive scalarλ can be interpreted as the trade-off parameter. More details of proximal opera-tors can be found in [79]. Below we present several representative functions andcorresponding proximal operators.When f is an indicator function,f = IC(x) ={0,x ∈ C+∞,x /∈ C, proxf/λ(v) = argminx∈C||x− v||22 (2.17)When f is the `0 norm, its proximal operator is called hard-shrinkage:f = ||x||0, proxf/λ(v) ={0, |v| ≤ 1/λv, otherwise(2.18)When f is the `1 norm, its proximal operator is called soft-shrinkage:f = ||x||1, proxf/λ(v) =v + 1/λ, v < −1/λ0, |v| ≤ 1/λv − 1/λ, v > 1/λ(2.19)2.5.2 Half quadratic splittingIn this section, we explain the HQS algorithm. For simplicity purposes, we takethe widely used TV prior as an example. Eq. 2.20 gives the objective function tominimize.λ2||b−Ax||22 + ||∇x||1 (2.20)The objective function in Eq. 2.20 is non-smooth and difficult to solve withtraditional optimization methods such as Newton’s method. HQS [30] relaxes theoriginal objective in Eq. 2.20 to be:λ2||b−Ax||22 + ||z||1 + ρ||z−∇x||22 (2.21)19where a slack variable z is introduced to approximate x, and ρ is a positive scalar.Eq. 2.21 is iteratively minimized by solving the latent image x and the slack vari-able z alternately, as given in Eq. 2.22 and 2.23.xt = argminxλ||b−Ax||22 + ρt||zt−1 −∇x||22 (2.22)zt = argminz||z||1 + ρt||z−∇xt||22 (2.23)Importantly, ρt increases as the iteration t continues. The latter forces z to becomean increasingly good approximation of x, thus making Eq. 2.21 an increasinglygood proxy for Eq. 2.20.The latent image x update step in Eq. 2.22 is a linear least squares problem,which can be efficiently solved by Conjugate Gradient Descent (CGD), or hasclosed-form solution in Fourier domain when the sensing matrix A is identity atdenoising task or BCCB at deconvolution task as given in Eq. 2.24.xt = F−1( F(λATb + ρt∇Tzt)F(λATA + ρt∇T∇)), (2.24)where F and F−1 represent Fourier and inverse Fourier transform respectively.The slack variable z update step in Eq. 2.23 is pixel-wise separable and has aclosed-form solution with the proximal operator soft-shrinkage:zt = soft-shrinkage(∇xt, 0.5/ρt) =∇xt + 0.5/ρt, ∇xt < −0.5/ρt0, |∇xt| ≤ 0.5/ρt∇xt − 0.5/ρt, ∇xt > 0.5/ρt(2.25)The soft-shrinkage operator is introduced in Eq.2.19.HQS typically requires only a few iterations to converge with proper setting ofρt in practice.202.5.3 Alternating direction method of multipliersADMM [9, 79] relaxes the original objective in Eq. 2.20 to be:x = argminxλ2||b−Ax||22 + ||z||1 + ρ||z−∇x + u||22 (2.26)where slack variables z and u are introduced. Eq. 2.26 is iteratively minimizedby solving for the latent image x, slack variable z and u alternately, as givenin Eq. 2.27, 2.28 and 2.29.xt = argminxλ||b−Ax||22 + ρ||zt−1 −∇x + ut−1||22 (2.27)zt = argminz||z||1 + ρ||z−∇xt + ut−1||22 (2.28)ut = ut−1 + zt −∇xt (2.29)where in contrast to HQS method, the scalar ρ does not need to be increased withthe iteration t. ADMM typically requires tens of iterations to converge in practice.21Chapter 3Time-of-Flight DefocusDeblurring and Superresolution3.1 IntroductionContinuous-wave ToF cameras show great promise as low-cost depth image sen-sors in mobile applications. However, they also suffer from several challenges,including limited illumination intensity, which mandates the use of large numeri-cal aperture lenses, and thus results in a shallow depth of field, making it difficultto capture scenes with large variations in depth. Another shortcoming is the limitedspatial resolution of currently available ToF sensors.In this chapter, we address this problem by introducing a new computationalmethod to simultaneously remove defocus blur and increase the resolution of off-the-shelf ToF cameras. We do this by solving a semi-blind deconvolution problem,where prior knowledge of the blur kernel is available. Unlike past ToF deblurringtechniques, our approach applies sparse regularizers directly to the latent ampli-tude and depth images, and supports deblurring ToF images captured with multiplefrequencies, phases or exposures.Continuous-wave ToF sensors are designed to have an image formation modelthat is linear in amplitude a, but non-linear in depth z, such that the captured raw22(a) Amplitude a (b) Depth z (c) HSV mapFigure 3.1: Simulated Buddha scene. The HSV map uses amplitude a asvalue and depth z as hue. The value and hue visualizes the magnitudeand scaled phase of the complex-valued a◦g(z) in Eq.3.1, respectively.sensor data is given asa ◦ g(z) ≈ a ◦ e−i( 4pifc ·z), (3.1)where ◦ represents Hadamard product, f represents the frequency of the continuous-wave modulation, and c is the constant speed of light. The function g(z) can eitherbe calibrated [42], or, more commonly, is simply approximated by the complex-valued function from Eq. 3.1 (“cosine model” [33]). The derivation of the imageformation model is given in Section 2.4. Fig. 3.1 shows a simulated scene withvisualization.We aim to compute a solution to the following ill-posed inverse problem intro-duced in [31]:b = SK(z) (a ◦ g(z)) , (3.2)where the complex-valued vector b represents the raw ToF measurements, the real-valued matrix S is a downsampling operator, and the real-valued matrix K(z) rep-23resents the spatially-varying blur kernel for a given depth map z. The problem isill-posed because the matrix SK(z) is usually not invertible, and semi-blind be-cause the matrix K(z) is known at each depth z. In past work, S is assumed to bethe identity matrix.It is important to note that Eq.3.2 becomes the conventional image deblurringproblem when f = 0, such that a ◦ g(z) = a. Estimating the amount of defocusblur from a blurry amplitude map a is a particularly challenging problem in thiscase, requiring either multiple photos or specialized optics [68]. Unlike conven-tional cameras, ToF cameras provide additional depth information that can be usedto recover the defocus blur kernel much more robustly [31].Our method focuses on solving Eq.3.2 to recover the deblurred amplitude anddepth maps from a single blurry image, captured with an off-the-shelf ToF cam-era. Because this inverse problem is still an ill-conditioned problem, it’s criticalto choose appropriate regularizers to reflect prior information on the solution (i.e.,sparse edges). Godbaz et al [31] proposed differential priors that operate on thecomplex ToF image representing the cosine model, but it remains unclear what agood regularizer should even look like in this space. We instead choose to regu-larize our solution in the amplitude and depth map space directly. This introducescertain numerical challenges, because of the highly nonlinear relation between thedepth components and the raw ToF measurements (Eq.3.1). We relax this problemby splitting the optimization procedure into two parts, alternating between optimiz-ing for amplitude and depth. Our method can seamlessly include a super-resolutioncomponent, helping to overcome the limited sensor resolution in current generationToF cameras. Unlike earlier approaches, our method is also not inherently limitedto the cosine model, and could be easily extended to calibrated waveforms in thefuture.3.2 Method3.2.1 Algorithm overviewGiven the raw measurements b from a single view, our algorithm aims to removeoptical lens blur and produce high quality depth map z and amplitude map a. The24latent depth z and amplitude a are coupled in the measurements, thus we solvethem as a joint optimization problem.(a, z) = argmina,z||b− SK (a ◦ g(z)) ||22 + Φ(a) + Ψ(z) (3.3)Eq.3.3 shows the objective function we aim to minimize. The quadratic termrepresents a data-fitting error, assuming zero-mean Gaussian noise in the measure-ments. Φ(a) and Ψ(z) represent regularizers for amplitude a and depth z respec-tively. The algorithm alternatively estimates a and z, and update the blur kernelmatrix K at the end of each iteration according to currently estimated z.Sparse gradient priors [68] have been widely used in natural image deblurring,but are improper for depth where the gradient could be non-zero for most pixels.The sparse second-order derivative priors have been used in surface denoising [6,7, 104], but fail to model the discontinuity at object boundaries and thus do notdistinguish the blurred and latent sharp depth. In this paper, we use the second-order Total Generalized Variation (TGV) [11] for both the amplitude and depth, asshown in Eq.3.4 and3.5.Φ(a) = minyλ1||∇a− y||1 + λ2||∇y||1 (3.4)Ψ(z) = minxτ1||∇z− x||1 + τ2||∇x||1 (3.5)The TGV prior automatically balances the first and second order derivativeconstraints. Following Knoll et al [56], this can be intuitively understood as fol-lows. In flat regions of z, the second order derivative ∇2z is locally small, thusit benefits the minimization problem in Eq.3.5 to choose x = ∇z, and minimizethe second order derivative ||∇x||1. While in the sharp edges of z (i.e., at objectboundaries), ∇2z is larger than ∇z, thus it benefits to choose x as zero, and mini-mize the first order derivative ||∇z||1. Similar analysis applies for a as well. Theparameters λ1, λ2, τ1, τ2 define the relative weights of the first and second orderconstraints. A modified version of TGV was used in Ferstl et al [28] for image25Algorithm 1 Defocus deblurring for ToF depth cameraInput: Raw measurements b; modulation frequency f ; upsampling ratio rOutput: Estimated depth z and amplitude a1: a = upsample(magnitude(b), r)2: z = upsample(phase(b)/(4pif/c)), r)3: for n = 1 to N do4: K = updateKernel(z)5: c = argminc||b− SKc||22 + ρ||c− a ◦ g(z)||226: a = argminaρ||c− a ◦ g(z)||22 + Φ(a)7: z = argminzρ||c− a ◦ g(z)||22 + Ψ(z)8: end forguided depth upsampling.a◦g(z) in Eq.3.1 is highly nonlinear regarding to z. To reduce the computationcomplexity in this nonlinear problem, the algorithm splits the data-fitting term inthe objective (Eq.3.3) into a linear Least Squares (LSQ) and a pixel-wise separablenonlinear LSQ term, as in Eq.3.6. The scalar ρ defines the relative weight of thesplitting term.(a, z) = argmina,z,clinear LSQ for c︷ ︸︸ ︷||b− SKc||22 +separable nonlinear LSQ for z︷ ︸︸ ︷ρ||c− a ◦ g(z)||22+ Φ(a) + Ψ(z)(3.6)Algorithm 1 shows the high-level pseudocode of the proposed method. Theamplitude a and depth z are initialized as the magnitude and phase of the complex-valued measurement b, respectively, and upsampled by nearest-neighbor methodif superresolution wanted. Then the algorithm iteratively updates the blur kernelmatrix K, slack variable c, amplitude a and depth z. The number of iterations Nis typically set as 10. Details of each subproblem are described in Section 3.2.2 -Section Updating kernel estimateThe blur kernel is pre-calibrated at each pixel and sampled depth. The algorithmupdates K by a simple interpolated lookup in the pre-calibrated table of kernels ac-cording to the currently estimated depth z. The details of the calibration procedureare explained in Section Updating slack variableThe update of the slack variable c requires solving a linear LSQ problem (Algo-rithm 1, Line 5). Since the resulting linear equation system is positive-definite, anumber of options for efficient solvers exist.3.2.4 Update amplitude estimateAlgorithm 2 Update amplitudeInput: c, A, ρ, ρa, λ1, λ2, number of iterations: MOutput: Estimated amplitude a1: for n = 1 to M do2: a = argminaρ||c−Aa||22 + λ1ρa||∇a− y − p1 + u1||223: y = argminyλ1||∇a− y − p1 + u1||22+λ2||∇y − p2 + u2||224: p1 = argminp1||p1||1 + ρa||∇a− y − p1 + u1||225: p2 = argminp2||p2||1 + ρa||∇y − p2 + u2||226: u1 = u1 +∇a− y − p17: u2 = u2 +∇y − p28: end forBy substituting Φ(a) into the update rule for the amplitude (Algorithm 1,Line 6), we obtain the following optimization problemmina,yρ||c−Aa||22 + λ1||∇a− y||1 + λ2||∇y||1, (3.7)where A is a diagonal matrix composed of g(z). This problem is solved by ADMM [9]),as shown in Algorithm 2. The a and y updates are linear least squares problems.27The p1,2-updates are soft shrinkage problems and have closed form solutions [9].The number of ADMM iterations M is typically set to 20. More details of eachsubproblem are provided in Appendix A. Updating depth estimateAlgorithm 3 Update depthInput: c, a, ρ, ρx, τ1, τ2, number of iterations: MOutput: Updated depth z1: for n = 1 to M do2: z = argminzρ||c− a ◦ g(z)||22 + τ1ρx||∇z− x− q1 + v1||223: x = argminxτ1||∇z− x− q1 + v1||22 + τ2||∇x− q2 + v2||224: q1 = argminq1||q1||1 + ρx||∇z− x− q1 + v1||225: q2 = argminq2||q2||1 + ρx||∇x− q2 + v2||226: v1 = v1 +∇z− x− q17: v2 = v2 +∇x− q28: end forIn a similar fashion, we can substitute Ψ(z) into the update rule for the depth(Algorithm 1, Line 7), and obtain the optimization problemminz,xρ||c−a ◦ g(z)||22+τ1||∇z−x||1+τ2||∇x||1 (3.8)Once again, we apply the ADMM method to reduce this problem into easier sub-problems, as shown in Algorithm 3. For the sparse nonlinear least squares prob-lem of updating z (Algorithm 3, Line 2), we use the Levenberg-Marquardt algo-rithm [61, 76] with an analytical Jacobian for the cosine model. To adapt ourmethod to arbitrary (calibrated) waveforms, the only required change would be toreplace this derivative estimate with a tabulated version based on the calibrationdata. We use the cosine model for the experiments in Section 3.3 for fair com-parisons with previous work, which makes the same assumption. Again, the q1,2updates are soft shrinkage problems, we use M = 20 iterations, and all furtherdetails are provided in Appendix A.1.28(a) PMD-Digicam camera (b) Scene setup for PSF calibrationFigure 3.2: Experimental setup.3.2.6 CalibrationThe blur kernel (PSF) pre-calibration for real measurements is done by a similarapproach to [44]. Fig. 3.2 shows the experiment setup. Printed random noisepatterns are attached on a flat white board, which is held on a translation stage.The translation stage moves the white board to place from 60cm to 160cm awayfrom the camera, with 1cm incremental. At each place, the camera captures withlarge aperture (the same aperture used for real measurements). Then, this processis repeated but with a small aperture so that the scene is nearly in-focus. Next, theamplitude images of the two captures at each place are used to estimate the PSF asa non-blind deconvolution problem.3.3 ResultsWe test the proposed algorithm on both synthetic and real datasets, and comparewith two methods: the naive method, which computes the amplitude and depthas the magnitude and phase of the raw complex images respectively; and Godbazet al [31], which alternatively updates blur kernels and deconvolves the compleximage with a Gaussian prior.29(a) Estimated a, from left to right, by ground truth, naive method (29.1dB), Godbaz method(31.9dB) and ours (34.5dB).(b) Estimated z, from left to right, by ground truth, naive method (36.2dB), Godbaz method(37.3dB) and ours (43.0dB).Figure 3.3: Results on simulated Buddha scene with 0.5% white noise. Ourmethod significantly reduces the blurriness and suppresses noise in botha and z, and reducing the flying pixels at object boundaries. PSNR ofthe results are provided in the brackets.302830323436384042440 1 2 3 4 5 6 7 8 9 10PSNR (dB)Number of iterationsestimated depth estimated amplitudeFigure 3.4: PSNR of our estimated amplitude a and depth z on the Buddhascene at each iteration.3.3.1 Synthetic dataThe results on a simulated Buddha scene is shown in Fig. 3.3. The naive amplitudea and depth z are blurry and contain strong noise and flying pixels. In the visu-alized depth map, the flying pixels appear in different color than the foregroundand background surface at the boundaries. Godbaz et al method is highly sensitiveto noise. Their result to some extent reduces the blurriness, but contains obviousnoise, ringing artifacts and flying pixels. The proposed method significantly re-duces the blurriness and flying pixels, and suppress the noise in both the amplitudea and depth z. The results are compared with ground truth data. Our approachproduces much higher PSNR than the other methods.In Fig. 3.4, we show the PSNR values of our estimated a and z at each iteration(i.e., n in Algorithm 1).3.3.2 Real dataWe captured real datasets using the Digicam camera from PMDTechnologies (Fig. 3.2)with a 6-15mm and f/1.4 lens. 30MHz modulation frequency and 300 microsec-ond exposure time are used, and a single frame is captured for each scene. We crop31(a) Camel scene (b) Character scene(c) Board sceneFigure 3.5: RGB Photographs of the real scenes.out the pixels near image boundaries and the typical resolution of input images is250 × 180 pixels. The pre-calibrated PSFs have a width of 5-11 pixels at depthsbetween 0.6 and 1.6m.In Fig. 3.5 we show the photographs of the captured scenes. The results andcomparisons are shown in Fig. 3.6, 3.7 and 3.8, and cropped regions are shownin Fig. 3.9, 3.11 and 3.12. In Fig. 3.10, we show the mesh geometry color-codedaccording to surface normal to better illustrate the depth results. Please zoom infor better views.Similarly as in the synthetic example, Godbaz et al method is unable to handlenoise (which is common in low-end ToF cameras), and fails to recover sharp scenefeatures. Our method produces much higher quality amplitude and depth, in termof suppressing the noise, recovering sharp features and reducing flying pixels. Wealso run our algorithm with 2x superresolution (i.e., upsampling ratio r = 2 in32(a) Real part of b (b) Imaginary part of b(c) Naı¨vea (d) Naı¨vez(e) Godbaz a (f) Godbaz z(g) Our a (h) Our z(i) Our a with superresolution (j) Our z with superresolutionFigure 3.6: Results on the Camel scene. In (a-b) the red color indicates pos-itive values and blue negative in the raw measurements. The croppedregions are shown in Fig. 3.9.33(a) Real part of b (b) Imaginary part of b(c) Naı¨vea (d) Naı¨vez(e) Godbaz a (f) Godbaz z(g) Our a (h) Our z(i) Our a w/ superresolution (j) Our z w/ superresolutionFigure 3.7: Results on the Character scene. The cropped regions are shownin Fig. 3.11.34(a) Real part of b (b) Imaginary part of b(c) Naı¨vea (d) Naı¨vez(e) Godbaz a (f) Godbaz z(g) Our a (h) Our z(i) Our a w/ superresolution (j) Our z w/ superresolutionFigure 3.8: Results on the Board scene. The cropped regions are shown inFig. 3.12.35Figure 3.9: Two insets of the results on the Camel scene in Fig. 3.6. From leftto right shows the naive method, Godbaz et al method, ours, and ourswith superresolution.(a) Naı¨vez (b) Godbaz z(c) Our z (d) Our z with superresolutionFigure 3.10: Mesh visualization for the Camel scene in Fig. 3.6. The colorindicates the surface normal in horizontal direction, i.e., blue indicatesleft-faced surface and red the opposite.36Figure 3.11: Two insets of the results on the Character scene in Fig. 3.7. Fromleft to right shows the naive method, Godbaz et al method, ours, andours with superresolution.Figure 3.12: Two insets of the results on the Board scene in Fig. 3.8. Fromleft to right shows the naive method, Godbaz et al method, ours, andours with superresolution.37Figure 3.13: We run the proposed algorithm on an inset of the Characterscene with different upsampling ratios. From left to right shows theresult of naive method, and ours with 1x, 2x, 4x, 6x superresolutionrespectively. We observed that little more details are recovered beyond4x upsampling.Algorithm 1), and show that we achieve even better results with superresolutionin our joint optimization framework. We use bicubic interpolation for the down-sampling operator S. In Fig. 3.13, we show our results with different upsamplingratios, and observe that little more details are recovered beyond 4x upsampling forour dataset.We tune the parameters to generate results of compared method and ours. Forour results, ρ (in Algorithm 1), ρa (in Algorithm 2) and ρx (in Algorithm 3) arefixed as 0.125, 10 and 10 respectively. In Algorithm 2, λ1 is typically set as 0.001or 0.002, and λ2 as 20 or 40 times λ1. In Algorithm 3, τ1 is typically set as 0.0005or 0.001, and τ2 as 20 or 40 times τ1.We run our unoptimized Matlab code with a single core on an Intel i7 2.4GHzCPU. The running time is reported taken the Character scene as an example. Dur-ing the total 10 iterations in Algorithm 1 with no superresolution, our code took 80seconds for updating the slack variable c (Sec. 3.2.3), 16 seconds for the amplitude(Sec. 3.2.4), and 347 seconds for the depth (Sec. 3.2.5). We believe the code canbe further accelerated by choosing more efficient solvers for some subproblemsor running on GPU. For example, the current Levenberg-Marquardt Matlab solverused in Algorithm 3 can be replaced with much more efficient ones [3].38(a) Estimated a (b) Estimated zFigure 3.14: Experiment results using the TGV prior on the complex im-ages. The estimated amplitude is over-smoothed while the depth isstill highly noisy.3.4 DiscussionRegularization strategyTo verify the benefit of directly regularizing and solving the latent amplitude anddepth, we replaced the Gaussian derivative prior in Godbaz et al [31] with theTGV prior on the complex image, and solved the complex image using ADMMframework. The result on the Camel scene is shown in Fig. 3.14. Even though theprior weight is high enough to over-smooth the amplitude, the estimated depth stillcontains strong noise compared to Fig. 3.6. This demonstrates the high quality ofour results is to a large part owed to the approach of regularizing z and a directly.Joint deblurring and superresolutionTo show the advantage of our joint deblurring and superresolution from ToF rawmeasurements, we compare with the results of superresolution after deblurred byeach method. Two example insets are shown in Fig. 3.15. Our jointly deblurredand super-resolved depth and amplitude preserve sharp features and reduce flyingpixels better than the others.39Figure 3.15: The top row shows depth results, and bottom amplitude.From left to right: (a) naive method; (b) naive method + post-superresolution; (c) Godbaz et al + post-superresolution; (d) our de-blurred + post-superresolution; and (e) our jointly deblurred and su-perresolved. The TGV prior is used for all post-superresolution. Thesetwo scenes are from Fig. 3.9 and 3.11.Multiple frequencies, phases and exposuresThe latest generation of ToF cameras uses multiple frequencies or phases in or-der to reduce range ambiguity and improve depth resolution. Multiple exposurescould be used to increase the dynamic range of the raw measurements and removeartifacts due to lack of reflection or over-saturation in one shot. The proposed algo-rithm well adapts for these cameras since the latent amplitude and depth is solveddirectly from the raw measurements, which could come from different capturesand put together in the data-fitting term.Defocus levelThe pixel width of current ToF sensor is approximately 45µm, compared with RGBsensors which have approximately 5µm pixel size. As the ToF sensor resolutionincreases as the technology matures, the defocus effect in ToF imaging is expectedto be more obvious and the importance of deblurring ToF images will become morepronounced.40Limitations and future workBoth the proposed method and the compared Godbaz et al [31] assume white Gaus-sian noise in the raw measurements. This noise model is inaccurate due to the non-linear image formation model, typically relatively low light conditions, as well asambient light canceling in ToF imaging [33], which amplifies shot noise. As afuture work we would like to study more accurate noise models for ToF cameras.3.5 ConclusionIn this paper we proposed an effective method to simultaneously remove lens blurand increase image resolution for ToF depth cameras. Our algorithm solves the la-tent amplitude and depth directly from the raw complex images, and separate priorsare used for each to recover sharp features and reduce flying pixels and noise. Weshow our algorithm significantly improves the image quality on simulated and realdataset compared with previous work. Unlike previous approaches, our method isnot fundamentally limited to the cosine model for continuous-wave ToF cameras,which has been shown to be inaccurate for many systems (e.g., [42]) and shouldadapt to multi-frequency, multi-phase or multi-exposure ToF cameras.41Chapter 4Stochastic Blind MotionDeblurring4.1 IntroductionThe goal of deconvolution is to recover a sharp intrinsic image x from a blurredimage b. The image formation process can be modeled asb = k⊗ x + n, (4.1)where k represents the blur kernel, ⊗ represents discrete convolution, and n is anoise term. We first assume Gaussian noise for simplicity, and extend our algorithmfor Poisson noise in later sections.While there are many applications in which the kernel k is known or can becalibrated a priori (e.g., aberration correction) or is considered to be from a smallset of possible kernels (defocus blur), in other situations the kernel is unknownsince it is generated by a process that cannot be replicated (e.g., object motion orcamera shake).This blind case, where both the kernel and the intrinsic image is unknown, ishighly under-constrained, and is subject to many degenerate solutions, includingone where the estimated kernel is simply a Dirac peak. It is therefore obvious thatblind deconvolution is only feasible with the use of additional information, either42in the form of additional sensor data (e.g., inertial sensors [52] or multiple inputimages of the same scene [83]), or in the form of prior information (image priors).While good priors for both the images and the blur kernels are still an active areaof investigation, many choices that have been proposed are non-linear and ofteneven non-convex. This makes it difficult and time consuming to experiment withdifferent image priors, since each new candidate typically requires a customizedoptimization procedure that can require a significant effort to implement.In this chapter we extend the recent work in stochastic non-blind deconvolu-tion [36] to derive a blind method that is purely based on stochastic sampling. Themethod relies entirely on local evaluations of the objective function, without theneed to compute the gradient of the objective function. This makes it effortlessto implement and test new image priors. Specifically we present the followingcontributions:• a stochastic framework for blind deconvolution,• a demonstration of stochastic optimization for non-convex priors for the im-age (e.g. sparse derivatives and cross-channel information), the kernel (e.g.anisotropic diffusion), and even the data term (handling of saturation andclamping, as well as an Anscombe transformation for Poisson noise).• an implementation with a range of sparsity and color priors for the image, aswell as sparsity and smoothness priors for the kernel, that together match oroutperform existing state-of-the art methods,• a method for recovering colored blur kernels that arrive, e.g. in remote sens-ing where the exposure time varies per channel (e.g., [60]), and• a method for dealing with clipped color values in non-blind deconvolution.434.2 Basic algorithmAlgorithm 4 Stochastic blind deconvolution frameworkInput: Blurry image b; weights for priors on intrinsic image: θx; weights forpriors on the kernel: θk; maximum blur kernel size φ; number of iterations T .Output: Estimated intrinsic image x; estimated kernel k.1: S = d2 log2(φ/3) + 1e //number of scales2: for s = 1 to S do3: bs = downsample(b, s, ‘bilinear’)4: if s == 1 then5: xs = bs6: ks = initializeKernel(b)7: else8: xs = upsample(xs−1,√2, ‘bicubic’)9: ks = upsample(ks−1,√2, ‘nearest-neighbor’)10: end if11: for t = 1 to T do12: xs = updateIntrinsicImage(bs,ks,xs, θx)13: ks = updateKernel(bs,xs,ks, θk)14: end for15: [θx, θk] = updateWeights(θx, θk)16: end for17: k = ks //final estimated kernel18: x = updateIntrinsicImage(b,k,xs, θx) //final intrinsic image restorationOur algorithm is based on a multi-scale approach that iterates between kernelestimation and solving a non-blind deconvolution step (Algorithm 4). The algo-rithm uses a scale space to avoid local optima, working its way from the coarsescales to the fine scales (also see Fig. 4.1). At each scale, the method upsamplesthe kernel and intrinsic image from the next coarser scale using nearest-neighborand bicubic upsampling respectively and then alternately updates the current esti-mates for the intrinsic image (Section 4.2.1) and the kernel (Section 4.2.2) at thecurrent scale in an inner loop. Next, prior weights are adjusted (Section 4.2.3) be-fore moving to the next finer scale. At the coarsest scale, the kernel is initializedas a 3 by 3 image with either a horizontal or vertical stripe depending on the domi-nant gradient direction [27], while the blurry image for each scale is obtained withbilinear downsampling from the full-resolution blurry image.44Figure 4.1: Our multi-scale scheme on the scene shown in Fig.4.8. Row 1& 2: intrinsic images estimated at each scale. Row 3: kernel estimatedat each scale. In contrast to most existing blind deconvolution meth-ods, our algorithm can recover all color channels of the intrinsic imagesimultaneously using the cross-channel information.4.2.1 Updating the intrinsic imageGiven the kernel estimated at the current scale ks, the function updateIntrinsicIm-age(.) in Algorithm 4 updates our estimate of the intrinsic image by solving a non-blind deconvolution problem using a stochastic random walk optimization (Algo-rithm 5, also see [36]), which minimizes objectives of the form:f(xs) = ||bs − ks ⊗ xs||22 + θx · g(xs) (4.2)The quadratic term gives the data-fitting error. g(.) = [g1(.), . . . , gR(.)]T is avector of individual regularizers, and θx = [θx1, . . . , θxR]T is the correspondingvector of weights for each regularization term. The total weighted penalty (scalar)is the dot-product of θx and g(.). Examples of gi(.) are given in Eq. 4.8 - 4.12.45The intrinsic image updating method described in this section is heavily basedon the previous work [36] with small modifications. We review its details here forself-completeness and provide additional analysis about its convergence.Random walk processAs summarized in Algorithm 5, we create a random walk chain of pixel locationpi in the support domain of the intrinsic image at which we propose to add to orremove from an energy quantum edx:xsi = xsi−1 ± edx · δpi , (4.3)where δpi is the characteristic function (i.e., Kronecker delta function) for the sam-ple pixel pi, and xsi the estimated intrinsic image at ith iteration of the randomwalk. Both the positive and negative energy are evaluated at the sample pixel butonly the sample that decreases the objective function most is kept.The quantity c(pi) measures the change of the objective function f(.) if theproposed sample pi with value ±edx were to be accepted, i.e.,c(pi) = f(xsi )− f(xsi−1) (4.4)If the sample decreases the objective (i.e. c(pi) < 0), it is accepted and Eq. 4.3 isapplied (lines 6-9, Algorithm 5).The quantity a measures the rate of objective change given the new proposedsample pi, as defined in Eq. 4.5.a = c(pi)/c(pi−1) (4.5)If a is lower than a uniform-randomly generated real number between 0 and 1(a.k.a.Russian roulette strategy in Metropolis-Hasting sampling techniques [14]),and the previous sample pi−1 decreased the objective, the new sample is discardedentirely leaving the intrinsic image and random walk chain unchanged (lines 11-13, Algorithm 5). Otherwise, the random walk chain is updated to the new samplelocation pi.46Algorithm 5 Stochastic random walk optimizationInput: x0 as the initial value of the signal x to optimize, with support domain Ω;function c(pi) evaluating the change of objective given sample pi; functiont(pi|pi−1) mutating sample location from pi−1 to pi; ed as the initial mag-nitude of the sample energy;  as a constant scalar (by default 0.001); γ as aconstant scalar between 0 and 1 (by default 0.1);N as the number of iterations;M as the total number of proposed samples in each iteration.Output: Updated signal x.1: p0 = random(Ω) //initialize the random walk chain by uniform-randomlysampling the support domain Ω.2: for j = 1 to N do3: β = 0 //number of accepted samples in each iteration4: for i = 1 to M do5: pi = sample(pi−1, t(pi|pi−1), ed) //propose a new sample pi with en-ergy ed given the previous sample pi−1 and transition function t(.).6: if c(pi) < 0 then7: β = β + 18: accept(pi) //if the new sample pi decreases the objective, accept it andupdate x (e.g., by Eq. 4.3).9: end if10: a = c(pi)/c(pi−1) //compute the rate of objective change given the newsample pi.11: if a < random([0, 1]) and c(pi−1) < 0 then12: pi = pi−1 //if a is lower than a uniformly distributed random real num-ber between 0 and 1, and the previous sample pi−1 decreased the ob-jective, keep the random walk chain at previous sample location pi−1;otherwise, update the random walk chain to pi.13: end if14: end for15: p0 = pN //update the root of random walk chain for next iteration.16: if β <  ·N and ed <  then17: break //if too few samples were accepted and ed is too small meanwhile,stop the iterations.18: else if β < γ ·N then19: ed = 0.5·ed //reduce the sample magnitude by half when too few sampleswere accepted.20: end if21: end for47Evaluating c(pi)Given a proposed sample, the objective function in Eq. 4.2 changes locally becauseof the compact support of the blur kernel ks and priors g(.). We can efficientlyevaluate c(pi) by only considering the local neighborhood of the given samplepixel.In practice, we keep a second sequence of images bˆsi = ks ⊗ xsi , which repre-sents the blurred image we would expect if the intrinsic image was xsi . The bˆsi canbe updated efficiently by splatting ks ⊗ δpi during the random walk process:bˆsi = bˆsi−1 ± edx(ks ⊗ δpi) (4.6)The splat ks ⊗ δpi is just a shifted, and mirrored copy of the kernel ks and is pre-computed for acceleration. The change in the regularization term is evaluated in ananalogous manner but is specific to chosen regularizers.Sample mutationThe function sample(.) generates a new sample pi from the previous sample pi−1by the mutation function t(pi|pi−1). Two types of mutation strategies are used.The first strategy generates pi by a zero-mean Gaussian-distributed randomoffset η(0, σ) from pi−1. This mutation allows more samples to be drawn in theregions where they can reduce the objective function more effectively. Using aGaussian distribution ensures ergodicity of the random walk process. The standarddeviation of the Gaussian kernel σ is set to 4 pixels when updating intrinsic image.The second mutation strategy chooses the new sample randomly in the supportdomain with uniform probability. We stimulate this mutation with 2% probabil-ity during the random walk process. This helps to avoid start-up bias and alsocontributes to ergodicity of the random walk process.Eq. 4.7 gives the formula of the above mutation strategies:pi ={random(Ω), if q < 0.02;pi−1 + η(0, σ), otherwise(4.7)where random(Ω) means a random pixel across the image support domain Ω, and48q is a random real number between 0 and 1 generated online at each mutation.Sample energyThe magnitude of the sample energy edx is reset to be an initial large value at thebeginning of each scale, and adjusted iteratively at each scale. It is reduced by halfwhenever the percentage of accepted samples over all proposed ones in previousiteration goes below a constant scalar γ ∈ [0, 1]. This allows the method to takelarge steps early and make more subtle changes as the minimization proceeds. Inour experiments we use the initial value of edx as 0.02 and γ as 0.1. Note that theblurry and intrinsic images are normalized to be between 0 and 1.Stopping criteriaThe random walk process is terminated if the percentage of accepted samples inprevious iteration goes below a constant threshold  and the magnitude of the sam-ple energy edx is smaller than  meanwhile (lines 16-20, Algorithm 5). In ourexperiments we set  to 0.001.Regularizers g(.)A benefit of the stochastic optimization framework is that it allows very generalpriors to be used with no change to the overall algorithm. We have used a selectionof well-known and frequently used regularization terms listed below. ∇h and ∇vare horizontal and vertical first-order derivative operators respectively. ∇hh, ∇hvand ∇vv are corresponding second-order derivative operators.• Anisotropic total variation [35] (convex, but non-smooth):||∇hxs||1 + ||∇vxs||1 (4.8)• Isotropic total variation [35] (convex, but non-smooth):||(|∇hxs|2 + |∇vxs|2)1/2||1 (4.9)• Sparse first-order derivatives [27, 66] (non-convex):||∇hxs||p + ||∇vxs||p, p ∈ (0, 1] (4.10)49True intrinsic and an inset kernel 1 kernel 2 kernel 3 kernel 4Figure 4.2: Ground truth intrinsic image (800x800 pixels) and kernels (25x25pixels) used for empirical convergence test in Fig. 4.3 and 4.6. Thekernels are upsampled by nearest neighbor for visualization.• Sparse gradient [27, 66] (non-convex):||(|∇hxs|2 + |∇vxs|2)1/2||p, p ∈ (0, 1] (4.11)• Sparse second-order derivatives [62] (non-convex):||∇hhxs||p + 2||∇hvxs||p + ||∇vvxs||p, p ∈ (0, 1] (4.12)In addition we use other priors, including one to reduce chromatic artifacts (seeSection 4.3.1) as well as non-convex data term, e.g., in the case of images withPoisson noise (Section 4.3.4).Empirical convergenceFollowing the analysis in [36], our stochastic random walk framework is a formof Stochastic Coordinate Descent (SCD) [69, 89, 97]. In each iteration, the algo-rithm picks a single pixel in the image and checks if the objective can be reducedby depositing energy in this pixel. This corresponds to picking a single degree-of-freedom (i.e. a single coordinate axis in the vector of unknowns) and descend-ing along that direction, without computing full gradient of the objective function.The difference to other SCD methods is that our algorithm uses the random walkprocess to exploit spatial coherence in the deconvolution problem and focus thecomputational effort on regions with sharp edges, where most work is to be donein deconvolution (see Fig. 4.4).For smooth objectives, SCD methods provably converge as long as there is a50(a) w/ kernel 1 (b) w/ kernel 2 (c) w/ kernel 3 (d) w/ kernel 424816326412825651210240 20 40 60 80 100 120 140 160 180 200Objective fIterationsw/ kernel 1w/ kernel 2w/ kernel 3w/ kernel 42224262830323436380 20 40 60 80 100 120 140 160 180 200PSNR (dB)Iterationsw/ kernel 1w/ kernel 2w/ kernel 3w/ kernel 4Figure 4.3: Example of non-blind intrinsic image estimation. In (a)-(d), the1st row shows the inset of blurry input simulated with the true intrinsicimage and each kernel given in Fig. 4.2, and the 2nd row shows the insetof non-blind estimated intrinsic image by our stochastic random walkmethod. Only insets are shown due to limited space. The bottom rowshow the change of objective function f and PSNR when the numberof iterations increases. In each iteration, 50000 samples were proposed.Sparse gradient prior (Eq. 4.11 with p = 0.8, non-convex) was applied.51(a) Initial residual(b) Distribution of accepted sample energy edx(c) Distribution of proposed sample locations051015202530354045500 5 10 15 20 25 30 35 40 45(×103)w/ kernel 105101520253035404550 (×103)w/ kernel 20 5 10 15 20 25 30 35 40 45 05101520253035404550 (×103)w/ kernel 30 5 10 15 20 25 30 35 40 45 05101520253035404550 (×103)w/ kernel 40 5 10 15 20 25 30 35 40 45(d) Histogram of # proposed samplesFigure 4.4: Visualization of sample distribution for the non-blind examplesin Fig. 4.3. From left to right, the columns show the example with ker-nel 1,2,3,4. (a) shows the residual between initial intrinsic image andground truth. (b) shows the map of accepted sample energy edx. Thevalues of the residuals and sample energy in (a)-(b) are 10× magnifiedbefore coded in CMYK colorspace, where magenta channel indicatespositive values and yellow negative. (c) shows the normalized distribu-tion of proposed sample locations pi (including both accepted and re-jected samples.), coded in key channel in CMYK colorspace. (d) showsthe histogram of the number of proposed samples in (c). The horizontalaxis indicates the bins of the number of proposed samples (clamped at45 for limited space), and vertical axis indicates the number of pixels ateach bin. 52(a) Initial (b) Iteration 20 (c) Iteration 60 (d) Iteration 200Figure 4.5: Visualization of the residual between the ground truth and ourestimated intrinsic image at different iterations, for the example shownin Fig.4.3(a). The values of the residual are 10×magnified before codedin CMYK colorspace, where magenta channel indicates positive valuesand yellow negative.finite probability of choosing each possible coordinate axis. This is ensured by theergodicity of our mutation strategy. For general, non-smooth objectives no suchproof exists (see details in [36]), but in Fig. 4.3, we show empirical convergenceexperiments for our method in the case of non-smooth and non-convex objectivesin non-blind intrinsic image estimations.In Fig. 4.5, we visualize the residual between the true intrinsic and our esti-mation in selected iterations for the example in Fig. 4.3(a). The algorithm reducesthe residual progressively. In Fig. 4.4, we visualize the sample distribution in therandom walk process for the examples in Fig. 4.3. As shown in Fig. 4.4(a), theresidual is large mostly at pixels near the image edges. Fig. 4.4(b) shows the dis-tribution of accepted sample energy edx, which are mostly located at pixels wherethe residual is large. Note that the positive and negative edx partially overlap dur-ing the random walk process. Fig. 4.4(c) shows the distribution of the number ofproposed samples (including both accepted and rejected ones). More samples areproposed near the image edges. Fig. 4.4(d) shows the histogram of the number ofproposed samples in Fig. 4.4(c). The majority of pixels consume 5-30 samples.The intuitive reasons why such an apparently small amount of samples are re-quired are: 1) the samples are not uniform-randomly proposed. The algorithm usesrandom walk to exploit the spatial coherence in the images, thus enforce impor-tance sampling; 2) the sample energy ed is initialized as a large value to reduce the53number of required samples at the beginning (as larger ed can reduce the objectivemore effectively), and then progressively reduced to better recover smaller detailsin the image. This multi-weight strategy help reduce the total number of requiredsamples; 3) each proposed sample is evaluated with both positive and negativeenergy.4.2.2 Updating the kernelThe updateKernel(.) function is used to perform kernel estimation, i.e., to findthe kernel that best explains the blurry input bs and intrinsic image xs for a givenscale s. This operation is performed in derivative domain, minimizing the objectivefunction in Eq. 4.13 consisting of a data term and set of priors g(.) with weightsθk:f(ks) = ||∇h,vbs − ks ⊗∇h,vxs||22 + θk · g(ks), (4.13)where ∇h,v represents the set of horizontal and vertical first-order derivative oper-ators, i.e., ∇h,v = {∇h,∇v}, and thus ∇h,vxs = {∇hxs,∇vxs}. We observedthat the optimization converges faster when the data-fitting error is computed inthe derivative domain rather than in the intensity domain. This is consistent withthe findings of Cho et al [20]. On the other hand, the cross-channel prior (seeSection 4.3.1) requires the image to be represented in the intensity domain. Sincemixing the intensity domain and the gradient domain in a single subproblem wouldbe too costly, we use the gradient domain only for the kernel subproblem.The same stochastic random walk algorithm (Algorithm 5) is used for updatingthe kernel except samples are now drawn from the kernel image. As before, anenergy quantum edk is added or removed at each sample location pi causing thekernel to be updated by Eq. 4.14, where ksi is the estimated kernel at ith iterationof the random walk, and δpi is the characteristic function (i.e., Kronecker deltafunction) for the pixel located at pi. Non-negativity of the kernel is enforced byrejecting any sample that causes a kernel pixel to become negative.ksi = ksi−1 ± edk · δpi (4.14)54Updates to the kernel result in a change to the entire blurry image. To evalu-ate the data efficiently, the algorithm maintains an estimate of the gradient of thecurrent blurry image, ∇h,vbˆsi , which is compared to the down-sampled capturedblurry image to evaluate the change to the objective function. Each sample at piis a scaled Dirac function ±edk · δpi , resulting in an update rule whereby a shiftedand scaled copy of the current estimate for the intrinsic image xs is added:∇h,vbˆsi = ∇h,vbˆsi−1 ± edk(δpi ⊗∇h,vxs) (4.15)As in the intrinsic image update, it is necessary to apply a regularizer to thekernel estimation in order to enforce specific properties. For motion blur kernelswe expect kernels to be i) smooth ii) sparse, in the sense that most kernel entrieswill be zero and iii) continuous, in that the kernel should be a smooth curve overthe exposure time. We enforce these properties with the priors from Eq. 4.16-4.18respectively:• Smoothness [62]: ||∇2ks||22 (4.16)• Sparsity: ||ks||p, 0 ≤ p < 1 (4.17)• Continuity: ||ks −AD(ks)||22 (4.18)For the continuity prior, Eq. 4.18, anisotropic diffusion [81] is used for thefilter AD(.). This is a non-linear filter that favors long continuous features whichhelps to reconstruct thin motion-blur trails. A benefit of the stochastic optimizationalgorithm is that this function may be computed exactly for each sample, ratherthan linearized per iteration. As with the intrinsic update, the key benefit of thestochastic framework is that only local evaluations of the regularizers are required.After updating the kernel, a simple denoising filter is applied that sets any pixelin ks to zero whenever its eight neighboring pixels are near zero. This removesisolated speckles that sometimes occur with the stochastic random walk. The pixelswhose intensity is lower than a threshold (i.e., 0.05 times the highest pixel intensityof current estimated kernel) are also set to be zero. This can be interpreted as an55024681012140 10 20 30 40 50 60 70 80 90w/ kernel 10 10 20 30 40 50 60 70 80 90w/ kernel 20246810121416180 10 20 30 40 50 60 70 80 90w/ kernel 305101520250 10 20 30 40 50 60 70 80 90w/ kernel 405101520253035(a) Histogram of # proposed samples24816326412825651210240 5 10 15 20 25 30 35 40 45 50 55 60Objective fIterationskernel 1kernel 2kernel 3kernel 43540455055606570750 5 10 15 20 25 30 35 40 45 50 55 60PSNR (dB)Iterationskernel 1kernel 2kernel 3kernel 4Figure 4.6: Example of non-blind kernel estimation. The true intrinsic imageand kernel are given in Fig. 4.2. The input blurry images are given inFig. 4.3. (a) shows the histogram of the number of proposed samples(including both accepted and rejected samples). The bottom row showsthe change of objective function f and PSNR as the number of itera-tions increases. The initial kernels can be arbitrary for non-blind kernelestimation. Note that the sample energy is non-zero at the region wherethe pixel intensity is zero in both initial and final recovered kernel. Thisis due to the post-processing (remove isolated pixel, shrinkage, and nor-malization) at the end of each iteration. This post-processing also causesnon-monotonicity in the objective and PSNR curve. The priors definedin Eq. 4.16, 4.17 and 4.18 were applied. In each iteration, 200 sampleswere proposed.56additional shrinkage operator that ensures kernel sparsity. The kernel image isnormalized to 1 at the end of each iteration.In Fig. 4.6, we show empirical convergence test of our method for non-blindkernel estimations. Regarding the parameters in Algorithm 5, we use the initialvalue of edk as 0.01, and γ as 0.1 in the experiments.4.2.3 Updating the weightsThe weights θx and θk define the relative strength of the data-fitting error andregularizers in the objective functions for intrinsic image and kernel updates. Thealgorithm begins with initially high θx (except for the cross-channel prior explainedin Section 4.3.1) at the coarsest scale and halves them whenever a new scale isstarted, until minimum thresholds are reached. This helps to avoid local optima inthe subproblem. θk is kept unchanged for all scales in our experiments.After the kernel is estimated at the finest scale, the function updateIntrinsicIm-age(.) is applied again to generate the final estimation of intrinsic image x (i.e.,line 18, Algorithm 4).In Fig. 4.7, we visualize the progress of intrinsic and kernel estimation at multi-scale process.4.3 Algorithmic ExtensionsHaving described the basic algorithm in Section 4.2 we now proceed to intro-duce several useful extensions, including non-convex priors for color images (Sec-tion 4.3.1) and chromatic kernels (Section 4.3.2), as well as partially saturated pix-els (Section 4.3.3). Finally, we demonstrate how non-linear versions of the imageformation model can also be included to account for non-Gaussian noise models(Section 4.3.4).4.3.1 Color imagesTo recover color images corrupted by motion blur, a simple extension of the basicalgorithm might perform kernel estimation as described in Section 4.2.2 (summingthe data term over all three channels) followed by separately deblurring each in-trinsic channel. However, better results can be obtained by jointly deblurring all57(a) Blurry input (b) Our estimation (c) Ground truth45474951535557590 5 10 15 20 25 30 35PSNR (dB) of kernelIterationsbicubicnearest neighbor scale 1scale 2scale 3scale 4scale 5scale 6scale 7scale 82426283032343638400 5 10 15 20 25 30 35 40PSNR (dB) of intrinsicIterationsbicubicnearest neighbor scale 1scale 2scale 3scale 4scale 5scale 6scale 7scale 8finalFigure 4.7: Example of our blind estimation of intrinsic and kernel. The twoplots show the PSNR value of intermediate estimation of intrinsic andkernel at multi-scale scheme. To compute the PSNR values, at eachscale we upsample the intrinsic and kernel to the finest resolution bybicubic or nearest neighbor and remove the possible shifts first. The“final” step in the plot means the final restoration of the intrinsic image,i.e., line 18, Algorithm 4.58(a) Input(b) Fergus et al [27](c) Cho et al [20](d) Xu et al [1](e) OursFigure 4.8: Results on a noisy real-world image. Our algorithm significantlyreduces chromatic artifacts compared with previous methods (betterview on screen).59(a) Input (b) Fergus et al [27] (c) Shan et al [98](d) Cho et al [20] (e) Xu & Jia [113] (f) Hirsch et al [46](g) Krishnan et al [59] (h) Xu et al [114] (i) OursFigure 4.9: Results on a real-world image and visual comparisons of state-of-the-art methods. A closeup is shown in Fig. 4.10.60(a) Input (b) Shan et al [98] (c) Cho et al [20](d) Xu & Jia [113] (e) Xu et al [114] (f) OursFigure 4.10: Closeup of the cloth region in the scene shown in Fig. 4.9. Ourresult contains much fewer chromatic artifacts than the other methods.intrinsic channels simultaneously since the majority of edges in the true intrinsicimage occur in all channels, with sparse hue changes. Based on this observation,Heide et al [43] proposed a cross-channel prior to remove chromatic aberrationscaused by low-quality lenses:∑i,j∈{r,g,b}λij ||xsi · ∇h,vxsj − xsj · ∇h,vxsi ||p, 0 < p ≤ 1 (4.19)Adding the cross-channel priors to the regularizers g(.) for the intrinsic imageresults in a non-convex objective. Heide et alused an alternating minimization inwhich one channel is deblurred with the other two fixed. We instead reconstruct allchannels simultaneously by running one sampling chain per channel run in lock-step. Although the method still alternates between the channels, this occurs so61Figure 4.11: Results on chromatic blur kernel. Left column: input imageblurred with a chromatic kernel (right bottom); Right column: our re-covered intrinsic image and blur kernel (right bottom).frequently that the optimization is effectively performed simultaneously over allchannels. Our algorithm begins with low weight for cross-channel prior at thecoarsest scale and doubles it when a new scale is started.We find that the cross-channel prior proposed in [43] improves deblurring per-formance even for achromatic kernels by suppressing color artifacts that wouldbe introduced by separate deblurring of each channel. Example comparisons areshown in Fig. 4.8, 4.9 and 4.10.624.3.2 Chromatic kernelsIt is further possible to extend the method to estimating chromatic kernels that oc-cur in sensor fusion where individual channels have unique exposure times (e.g. [60]).This is accomplished by separately updating each kernel channel, but using thecross-channel prior in Eq. 4.19 during the intrinsic image update at each scaleas described in Section 4.3.1. Synthetic examples of this strategy are shown inFig. Saturated or missing dataSaturated pixels are a common occurrence when taking photos with consumer cam-eras, as is missing or unreliable data due to lens debris. Deblurring data with satu-rated pixels often results in visually objectionable ringing artifacts since the captureprocess clamps the input data in a way that is not consistent with the image for-mation model, while for debris it may be preferable to mask out such regions andallow the deblurring algorithm to inpaint plausible content. In the following dis-cussion, we consider only the case of saturated blurry pixels, however the approachapplies equally well to lens debris.To handle such saturated pixels, our algorithm performs kernel estimation asusual using all non-saturated pixels. When reconstructing the final intrinsic image,previous work, including [36], simply uses a data term that omits saturated blurryimage pixels, leading to improved results over deblurring naively. However wehave found that a two-phase approach to intrinsic image estimation yields muchimproved results.The two phase algorithm divides the intrinsic image into two regions: a reliableregion which does not contribute to saturated blurry image pixels and an unreliableregion that contains pixels do contribute saturated blurry pixels. Four binary masksare defined:• mbs Saturated pixels in the blurred input.• mxv Mask of unreliable intrinsic pixels with a saturated pixel from mbs intheir support.• mxu Mask of reliable intrinsic pixels, the inverse mask of mxv .63(a) True intrinsic image(b) Input blurry image(c) Recovered intrinsic without two-phase approach(d) Recovered intrinsic with two-phase approachFigure 4.12: Results on partially saturated image. The dynamic range ofpixel intensities in the true intrinsic image is [0, 68]. The simulatedblurry image is clamped to 1. The proposed two-phase approach helpsreduce ringing artifacts near saturated pixels.64• mbd Mask of blurred pixels with a contribution from an unreliable pixel,i.e., where k⊗mxv 6= 0.Using these masks we perform the intrinsic image reconstruction in two phases.First the intrinsic image is estimated for reliable intrinsic image pixels in mxu,masking out data term contributions from unreliable blurred pixels in mbd by min-imizing:f(x) = ||(b− k⊗ x) · (1−mbd )||22 + θx · g(x) (4.20)This optimization outputs the estimated intrinsic image everywhere that the lin-ear image formation model holds, generating samples everywhere in the image asneeded to minimize both data term and the priors g(.). The second phase recon-structs the unreliable regions, leaving the intrinsic image in the reliable regions (i.e.in mxu) fixed.f(x) = ||(b− clip(k⊗ x)) ·mbd ||22 + θx · g(x) (4.21)The second phase only generates samples within the mask mxv . By performingthe reconstruction in this method, ringing is constrained to the non-reliable imageregion unlike in the typical approach where it can spread well beyond as a conse-quence of the data fitting term. Fig. 4.12 shows our results on synthetic partiallysaturated data. When dealing with color images, the proposed two-phase recon-struction is the same as described except that the masks vary in different colorchannels.4.3.4 Poisson noiseIn previous sections, we use quadratic fidelity in the objective by assuming whiteGaussian noise in the input images. Here we extend our algorithm to deal withimages containing Poisson noise, using the Anscombe transform [4]:Ansc(z) = 2√c · z + 3/8, (4.22)65(a) True intrinsic (b) Input blurry (21.23 dB)(c) Ours, Gaussian (24.24 dB) (d) Ours, Poisson (24.70 dB)Figure 4.13: Results on a synthetic image with Poisson noise (peak intensity= 500). Each subfigure contains an inset at the right-bottom cornerfor better view. (c) shows our result with Gaussian noise assumption(using Eq. 4.2, 4.13). (d) shows our result with Poisson noise assump-tion and the Anscombe transform (using Eq. 4.23, 4.24). The resultwith Poisson noise assumption recovers more details and is less noisyespecially in bright regions. We run both experiments with numerousparameters and select the results with highest PSNR.66where vector z represents the input image, c is a scalar for converting z to itscorresponding photon number.The Anscombe transform (denoted as Ansc(.)) converts Poisson noise to ap-proximately Gaussian noise. It allows us to use quadratic term for data-fitting errorin the transform domain, and thus fits our multi-scale framework. Similar transfor-mations are available for other noise models, including mixtures of Gaussian andPoisson noise, which are common in real images.Specifically, as shown in Eq. 4.23 and 4.24, we apply the Anscombe transformon the observed blurry image b before downsampling it into each scale s. Ateach scale, the data-fitting error is computed in the transform domain, while theregularizers on the intrinsic image g(xs) are still computed in regular domain. Thisis because all the intrinsic image priors were learned from natural images, and maynot hold well in the transform domain.f(xs) = ||(Ansc(b))s −Ansc(ks ⊗ xs))||22 + θx · g(xs) (4.23)f(ks) = ||∇h,v(Ansc(b))s −∇h,vAnsc(ks ⊗ xs)||22 + θk · g(ks) (4.24)Note that the data-fitting terms are non-convex now. In Fig. 4.13, we show asynthetic example with Poisson noise (peak intensity = 500). We run our frame-work with Gaussian noise assumption (using Eq. 4.2, 4.13), and with Poisson noiseassumption (using Eq. 4.23, 4.24). The latter produces visually and quantitativelybetter results.4.4 ResultsVisual comparisonsIn Fig. 4.8, 4.9, and 4.10 we compare the results of our algorithm with severalstate-of-the-art methods on real-world images. Our algorithm significantly reduceschromatic artifacts in the recovered intrinsic images. Fig. 4.11 shows our results onsimulated data with chromatic blur. Our algorithm recovers the chromatic kernelwell and produces a clean intrinsic image without color artifacts. Fig. 4.12 con-67tains results for partially saturated data. The proposed simple two-phase methodeffectively reduces the ringing artifacts near the saturated pixels. Fig. 4.13 showsour results with the proposed non-convex data-fitting error in Eq. 4.23 and 4.24.Qualitative comparisonsWe also tested our algorithm on a real-world database that contains both groundtruth intrinsic images and kernels [57]. The database consists of 4 images, each ofwhich is blurred with 12 kernels. 8 of the kernels are approximately uniform acrossthe image, while 4 kernels are both large and exhibit strong spatial variation. Thedataset provides 199 unblurred frames recorded during the camera motion trajec-tory for each image. We run the provided script to compute the PSNR value foreach of our results. The script first estimate the optimal intensity scaling and trans-lation between the recovered image and the unblurred reference image such thattheir `2 error over three color channels becomes minimal. PSNR is then computedusing these calibrated images. The final PSNR values reported in the paper is de-fined as the maximum PSNR between the recovered image and any of the 199unblurred reference images along the trajectory.In Table 4.1, we show the PSNR values averaged over all 4 images for each ker-nel by our algorithm, and compare with Fergus et al [27], Shan et al [98], Cho etal [20], Xu and Jia [113], Krishnan et al [59], Whyte et al [106], Hirsch et al [46]and Xu et al [1]. The downloadable software by Xu et al [1] incorporates tech-nologies proposed in [113] and [114]. We adjusted the parameters in their softwareto produce the best results possible. For the other methods we use the publishedPSNR results directly from the database [57]. The PSNR value and recovered in-trinsic image for each image can be found in the supplementary material.On mostly spatially-invariant kernels (#1-7 and #12), our algorithm producesresults that are either close to or better than the best published methods. We noticethat the state-of-the-art Xu et al [1]’s results sometimes look sharper than ours,but are actually oversharpened at the edges and thus have lower PSNRs. We showexamples in Fig. 4.14, where their results even look sharper than the ground truthand contain halo artifacts at the edge pixels. This may be caused by the use ofexplicit shock filter and bilateral filter in their kernel estimation.68(a) Ours (38.82dB) (b) Xu [1] (33.34dB) (c) Reference(d) Ours (33.29dB) (e) Xu [1] (30.72dB) (f) ReferenceFigure 4.14: Comparison on image #1 with kernel #3 and image #4 with ker-nel #4 from the dataset [57]. The filter-based method Xu et al [1]over-sharpens the image and creates halo artifacts near the edges (bet-ter view on screen).69Table 4.1: PSNR (dB) comparisons on the benchmark dataset [57]. Onmostly spatially-invariant kernels (#1-7 and #12), our algorithm producesresults close to or better than the state-of-the-art methods. Our algorithmfails to produce the best results on the extremely large and spatially-variant kernels #8-11 (with size over 141 by 141 pixels). The resolutionof each input image is 800 by 800 pixels. Please see Section 4.4 for moredetails.Kernel 01 Kernel 02 Kernel 03 Kernel 04 Kernel 05 Kernel 06 Kernel 07 Kernel 12Kernel width 35 35 17 35 35 35 35 49Input 24.89 27.85 32.34 28.09 29.22 25.22 24.94 23.85Fergus [27] 20.63 29.38 31.51 29.48 25.76 22.34 20.70 20.11Shan [98] 29.42 28.28 30.77 28.60 30.04 26.85 26.83 23.76Cho [20] 32.49 31.86 31.44 30.89 32.38 30.01 30.91 28.98Xu and Jia [113] 32.45 32.58 32.22 32.01 32.98 30.43 31.40 29.51Krishnan [59] 31.57 31.09 31.67 30.77 30.58 24.59 25.80 23.32Whyte [106] 32.08 32.20 34.61 32.10 32.54 30.89 29.06 28.21Hirsch [46] 32.04 30.09 33.94 32.13 32.82 29.53 29.32 26.81Xu et al [1] 31.94 31.71 31.42 31.30 31.78 30.06 30.87 29.18Ours 32.65 32.68 35.35 33.74 32.56 31.74 30.96 30.12Kernel 08 Kernel 09 Kernel 10 Kernel 11Kernel width 141 141 141 141Input 19.55 19.82 20.50 22.90Fergus [27] 17.93 18.87 18.72 16.95Shan [98] 19.19 22.90 20.49 23.56Cho [20] 22.34 28.00 23.96 24.54Xu and Jia [113] 22.54 28.35 24.12 25.87Krishnan [59] 15.35 22.14 20.15 21.68Whyte [106] 19.07 19.80 20.38 23.19Hirsch [46] 19.49 23.50 20.19 23.40Xu et al [1] 19.52 26.46 21.77 25.77Ours 19.79 21.16 21.02 22.8870All methods perform significantly worse on the spatially-variant kernels #8-11. Our software does not currently deal with spatially-variant kernels, and insteadrecovers an average kernel for the whole image. As a result, our method only pro-duces PSNR values in the middle of the field for these kernels. Fixing this problemwould require cutting the image into tiles, solving a blind deconvolution problemfor each tile, realigning the resulting tiles (since blind deconvolution can introducean offset in the kernel and intrinsic image), and stitching the results back together.This should be easily feasible in the future, but is not currently implemented.Computational costWe compared the runtime with two state-of-the-art methods, Cho et al [20] andXu et al [1], using the executable files provided by the authors. The method byCho et al [20] requires only a few seconds per megapixel, but their results areconsistently worse than ours (see Fig. 4.8, 4.9 and 4.10 and Table 4.1). In Fig. 4.8for an 848 by 636 blurry/noisy RGB image and a 19 by 19 achromatic kernel, ourmethod requires 197.0 seconds in total (140.4 seconds for blind kernel estimation,and 56.6 seconds for non-blind deconvolution). Xu et al’s method [1] is relativelyfaster than ours at 121.9 seconds, but their results show suffer from more artifacts.We also run our algorithm on different size images and kernels and report theruntime in Table 4.2. The code was compiled with gcc and the experiments weredone on an Intel i7 CPU with 16GB RAM. The result images and parameters areshown in the supplementary document.Influence of the noiseTo test the influence of noise on the performance, we run our algorithm on syntheticdata with various noise levels. The parameters are tuned roughly for the results.The results are shown in Fig. 4.15.Priors and parameter selectionOur framework allows us to easily adapt any priors or data-fitting term in the ob-jective function. We use smoothness (Eq. 4.16), sparsity (Eq. 4.17, p = 0.8) andcontinuity (Eq. 4.18) as kernel priors for all results in the paper. We use sparse71Table 4.2: Run-time analysis. The reported time are in seconds. We run ourunoptimized code on images with pixel resolution 400×400, 800×800,1200 × 1200, and kernels with pixel resolution 15 × 15, 25 × 25, 35 ×35. The blurry images are simulated with resized intrinsic image andblur kernel at each resolution. (a) shows the experiments on gray-scaledimages. (b) shows the experiments on RGB images, where all channelswere recovered simultaneously. We increased the number of proposedsamples as the image size and kernel size increase. The result imagesand parameters are shown in the supplementary.(a) Experiments on gray-scaled imagesImagewidthKernelwidthMultiscale estimation FinalrestorationTotalIntrinsic update Kernel update40015 4.68 7.61 2.92 15.7125 13.44 14.58 5.45 34.0335 23.24 21.44 15.74 61.1280015 11.53 54.64 7.05 74.9325 19.94 59.64 14.24 95.4835 53.81 80.31 37.03 172.99120015 18.15 132.56 13.06 167.8525 30.63 139.03 25.81 199.0935 75.55 186.19 53.85 319.42(b) Experiments on color imagesImagewidthKernelwidthMultiscale estimation FinalrestorationTotalIntrinsic update Kernel update40015 13.02 25.96 8.13 47.6225 36.67 49.74 16.24 103.2135 65.02 72.77 47.33 185.8180015 32.58 180.57 19.94 234.8125 56.32 187.81 40.33 286.0435 155.47 439.09 116.72 713.16120015 48.99 402.35 35.83 491.0925 83.42 408.32 71.85 567.1235 217.86 594.26 154.54 971.1472(a) σ = 0 (b) σ = 0.01 (c) σ = 0.02(d) σ = 0.03 (e) σ = 0.04 (f) σ = 0.05Figure 4.15: Results on synthetic data with difference levels of white Gaus-sian noise. In subfigure (a-f), the standard derivation (σ) of the noise isset to be 0, 0.01, 0.02, 0.03, 0.04 and 0.05 respectively. In each subfig-ure, the 1st row shows the input blurry image, and the 2nd row showsour deblurred result. The PSNR values are shown at the left-top cornerof each image. An inset is shown at the right-bottom corner of eachimage for better view.73gradient (Eq. 4.11, p = 0.6 or 0.8), sparse second-order derivatives (Eq. 4.12, p =1) and cross-channel prior (Eq. 4.19, p = 1) as intrinsic image priors for Fig. 4.7,4.8, 4.9, 4.10 and Table 4.1. And we use isotropic total variation (Eq. 4.9) andcross-channel prior (Eq. 4.19, p = 1) as intrinsic image priors for Fig. 4.11.Regarding the number of iterations and sample mutations in our experiments,we usually use T (in Algorithm 1) as 5-10, N (in Algorithm 2) as 5 for bothintrinsic and kernel updating, andM (in Algorithm 2) as 10000-50000 for intrinsicupdating and 100-500 for kernel updating in the multi-scale process. In the finalintrinsic image restoration step (line 18, Algorithm 4), we usually use N as 5,and M as 100000-500000. These numbers are proportionally adjusted with theresolution of input image and kernel for better convergence.The prior weights θx and θk are roughly tuned for best results. In our experi-ments we set the initial and minimum-threshold weight of sparse gradient as 0.05-0.5 and 0.0005-0.002, the initial and maximum-threshold weight of cross-channelprior as 0.0001 and 0.0005 (see Section 4.2.3 for the strategy of weights updating).In the final intrinsic image restoration step, we set the weight of sparse gradientand cross-channel prior as 0.0005-0.002 and 0.0005- Conclusion and future workIn this paper we present an attempt using simple random search technique for com-plex imaging problems: a simple and effective algorithm for blind motion deblur-ring from a single input image. We propose to use cross-channel information toreduce chromatic artifacts in the estimated intrinsic images and to recover chro-matic blur kernels. We also propose a two-phase method to reduce ringing artifactswhen deblurring saturated or missing pixels. Furthermore, we propose to use anon-convex data-fitting term to deal with Poisson noisy images.Our algorithm provides an easy-to-use framework for blind deconvolution prob-lems. It allows us to easily test new priors for both the kernel and the intrinsic im-age. This kind of experimentation would be much harder with other optimizationmethods.The computational efficiency of our method is below those highly optimizedspecialized solvers. However, such solvers typically include only a single regular-74(a) Blurry input (b) Cho et al [20] (c) Xu et al [1] (d) OursFigure 4.16: Results on more real data from [20] and [1].ization term, whereas we can easily combine many, for improved image quality.In the future, we would like to extend our algorithm to handle spatially variantkernels with the method outlined above. We also would like to further improve onthe handling of saturated pixels. We observe that pixels saturated in one channelmight not be saturated in another. By employing the cross-channel informationtogether with neighboring pixels, we should be able to recover the saturated pixelsbetter.75Chapter 5Document Image Deblurring5.1 IntroductionIn this chapter, we propose a new algorithm for practical document deblurringthat achieves both high quality and high efficiency. In contrast to previous worksrelying on low-order filter statistics, our algorithm aims to capture the domain-specific property of document images by learning a series of scale- and iteration-wise high-order filters. A motivational example is shown in Fig. 5.1, where wecompare small patches extracted from a natural image, a large-font text image anda common text document image. Since most deblurring methods adopt a multi-scale framework in order to avoid bad local optima, we compare patches extractedfrom multiple scales. Evidently, the natural image and large-font text image bothcontain long, clear edges at all scales, making the use of sparse gradient priorseffective. In contrast, patches from the document image with a small font size aremostly composed of small-scale high-order structures, especially at coarse scales,which makes sparse gradient priors to be inaccurate. This observation motivates usto use high-order filter statistics as effective regularization for deblurring documentimages. We use a discriminative approach and learn such regularization terms bytraining a multi-scale, interleaved cascade of shrinkage field models [92], whichwas recently proposed as an effective tool for image restoration.76300x300 pixels150x150100x100Figure 5.1: Visual comparison between a natural image (left), a large-fonttext image (middle) and a common text document image at 150 PPI(right) at various scales.Our main contributions include:• We demonstrate the importance of using high-order filters in text documentimage restoration.• We propose a new algorithm for fast and high-quality deblurring of docu-ment photographs, suitable for processing high resolution images capturedby modern mobile devices.• Unlike the recent Convolutional Neural Network (CNN) based documentdeblurring method [48], our approach is robust to page orientation, font styleand text language, even though such variants are not included at our training.5.2 MethodAs in most previous work, we assume a simple image formation model for eachlocal text region asb = Kx + n, (5.1)where b represents the degraded image, x the sharp latent image, matrix K thecorresponding 2D convolution with blur kernel k, and n white Gaussian noise.77The goal of the post-processing is to recover x and k from single input b.The Shrinkage Field (SF) model has been recently proposed as an effective andefficient tool for image restoration [92]. It has been successfully applied to bothimage denoising and non-blind image deconvolution, producing state-of-the-art re-sults while maintaining high computational efficiency. Motivated by this success,we adopt the shrinkage field model for the challenging problem of blind deblur-ring of document images. In particular, we propose a multi-scale, interleaved CSFwhich estimates the unknown blur kernel while progressively refining the estima-tion of the latent image. This is also partly inspired by [90], which proposes aninterleaved cascade of RTF to post-improve the results of state-of-the-art naturalimage deblurring methods. However, in contrast to [90], our method does not de-pend on an initial kernel estimation from an auxiliary method. Instead, we estimateboth the unknown blur kernel and latent sharp image from a single blurry input im-age.5.2.1 Cascade of shrinkage fieldsThe shrinkage field model can be derived from the FoE model [87]:argminxD(x,b) +∑Ni=1ρi(Fix), (5.2)where D represents the data fidelity given measurement b, matrix Fi representsthe corresponding 2D convolution with filter fi, and ρi is the penalty on the filterresponse. Half-quadratic optimization [30], a popular approach for the optimiza-tion of common random field models, introduces auxiliary variables ui for all filterresponses Fix and replaces the energy optimization problem in Eq. 5.2 with aquadratic relaxation:argminx,uD(x,b) +∑Ni=1(β||Fix− ui||22 + ρi(ui)), (5.3)which for β →∞ converges to the original problem in Eq. 5.2. The key insight of[92] is that the minimizer of the second term w.r.t. ui can be replaced by a flexible1D shrinkage function ψi of filter response Fix.Different from standard random fields which are parameterized through po-78tential functions, SF models the shrinkage functions associated with the potentialdirectly. Given data formation model as in Eq. 5.1, this reduces the original op-timization problem Eq. 5.2 to a single quadratic minimization problem in eachiteration, which can be solved efficiently asxt = F−1[F(KTt−1b + λt∑Ni=1 FtiTψti(Ftixt−1))F(KTt−1) · F(Kt−1) + λt∑Ni=1F(FtiT) · F(Fti)], (5.4)where t is iteration index, K is the blur kernel matrix, F and F−1 indicate Fouriertransform and its inverse, and ψi the shrinkage function. The model parametersΘt = (f ti , ψti , λt) are trained by loss-minimization, e.g. by minimizing the `2 errorbetween estimated images xt and the ground truth. Performing multiple predic-tions of Eq. 5.4 is known as a cascade of shrinkage fields. For more details on theshrinkage fields model we refer readers to the supplemental material and [92].5.2.2 Multi-scale interleaved CSF for blind deconvolutionWe do not follow the commonly used two-step deblurring procedure where ker-nel estimation and final latent image recovery are separated. Instead, we learnan interleaved CSF that directly produces both the estimated blur kernel and thepredicted latent image. Our interleaved CSF is obtained by stacking multiple SFsinto a cascade that is intermitted by kernel refinement steps. This cascade gener-ates a sequence of iteratively refined blur kernel and latent image estimates, i.e.{kt}t=1,..,T and {xt}t=1,..,T respectively. At each stage of the cascade, we employa separately trained SF model for sharp image restoration. In addition, we learnan auxiliary SF model which generates a latent image zt that is used to facilitateblur kernel estimation. The reason of including this extra SF model at each stage isto allow for selecting features that might benefit kernel estimation and eliminatingother features and artifacts. Note that the idea of introducing such a latent featureimage for improving kernel estimation is not new, and is a rather common practicein recent state-of-the-art blind deconvolution methods [20, 113]. Fig. 5.2 depicts aschematic illustration of a single stage of our interleaved CSF approach.More specifically, given the input image b, our method recovers k and x si-79Figure 5.2: Algorithm architecture.multaneously by solving the following optimization problem:(x,k) = argminx,k||b− k⊗ x||22 +∑Ni=1ρi(Fix) + τ ||k||22,s.t. k ≥ 0, ||k||1 = 1(5.5)To this end, our proposed interleaved CSF alternates between the following blurkernel and latent image estimation steps:Update xt. For sharp image update we train a SF model with parameters Θt =(f ti , ψti , λt). Analogously to Eq. 5.4 we obtain the following update for xt atiteration t:xt = F−1[F(KTt−1b + λt∑Ni=1 FtiTψti(Ftizt−1))F(KTt−1) · F(Kt−1) + λt∑Ni=1F(FtiT) · F(Fti)](5.6)Update zt and kt. For kernel estimation we first update the latent image zt fromxt by learning a separate SF model. Denoting convolution with filter gti bymatrix Gti, we have:zt = F−1[F(KTt−1b + ηt∑Ni=1 GtiTφti(Gtixt))F(KTt−1) · F(Kt−1) + ηt∑Ni=1F(GtiT) · F(Gti)](5.7)For kernel estimation we employ a simple Tikhonov prior. Given the esti-mated latent image zt and the blurry input image b, the update for kt reads:kt = F−1[ F(zt)∗ · F(b)F(zt)∗ · F(zt) + τ t], (5.8)80where ∗ indicates complex conjugate. The model parameters learned at thisstep are denoted as Ωt = (gti, φti, ηt, τ t). Note that Ωt are trained to facilitatethe update of both kernel kt and image zt.The xt update step in Eq. 5.6 takes zt−1 rather than xt−1 as input, as zt−1 improvesfrom xt−1 w.r.t. removing blur by Eq. 5.7 at iteration t− 1. xt and zt is observedto converge as the latent image and kernel are recovered.Algorithm 6 Blind deblurring at one scaleInput: blurry image bOutput: estimated image x and kernel k.1: for t = 1 to 5 do2: Update xt by Eq. 5.6.3: Update zt by Eq. 5.7.4: Update kt by Eq. 5.8.5: kt = max(0,kt),kt = kt/||kt||1.6: end forAlgorithm 6 summarizes the proposed approach for blind deblurring of doc-ument images. Note that there is translation and scaling ambiguity between thesharp image and blur kernel at blind deconvolution. The estimated kernel is nor-malized such that all its pixel values sum up to one. In Algorithm 7 for training, xtis shifted to better align with the ground truth image x¯, before updating k. We findthat our algorithm usually converges in 5 iterations per scale.5.2.3 LearningOur interleaved CSF has two sets of model parameters at every stage t = 1, .., 5,one for sharp image restoration, Θt = (f ti , ψti , λt), and the other for blur kernelestimation, Ωt = (gti, φti, ηt, τ t). All model parameters are learned through loss-minimization.Note that in addition to the blurry input image, each model receives also theprevious image and blur kernel predictions as input, which are progressively re-fined at each iteration. This is in contrast to the non-blind deconvolution setting of[92], where the blur kernel is known and is kept fixed throughout all stages. Ourinterleaved CSF model is trained in a greedy fashion, i.e. stage by stage such that81Algorithm 7 Learning at one scaleInput: blurry image b; true image x¯; true kernel k¯.Output: model parameters (f ti , ψti , λt,gti, φti, ηt, τ t)1: for t = 1 to 5 do2: Train model parameters: (f ti , ψti , λt) to minimize ||xt − x¯||22 with gradientgiven in Eq. 5.9.3: Update xt by Eq. 5.6.4: Shift xt to better align with x¯.5: Train model parameters: (gti, φti, ηt, τ t) to minimize ||kt−k¯||22+α||zt−x¯||22with gradient given in Eq. 5.10.6: Update zt by Eq. 5.7.7: Update kt by Eq. 5.8.8: kt = max(0,kt),kt = kt/||kt||1.9: end forthe learned SF models at one stage are able to adapt to the kernel and latent imageestimated at the previous stage.More specifically, at each stage we update our model parameters by iteratingbetween the following two steps:Update xt. To learn the model parameters Θt, we minimize the `2 error betweenthe current image estimate and the ground truth image x¯, i.e. ` = ||xt− x¯||22.Its gradient w.r.t. the model parameters Θt = (f ti , ψti , λt) can be readilycomputed as∂`Θt=∂xt∂Θt∂`xt(5.9)The derivatives for specific model parameters are omitted here for brevity,but can be found in Appendix A.2.Update zt and kt. The model parameters Ωt of the SF models for kernel estima-tion at stage t are learned by minimizing the loss function ` = ||kt − k¯||22 +α||zt − x¯||22, where k¯ denotes the ground truth blur kernel and α is a cou-pling constant. This loss accounts for errors in the kernel but also prevents82blurry inputtrue kernelx at scale 1, iter 5 x at scale 2, iter 5 x at scale 3, iter 5 x at scale 4, iter 5k at scale 1, iter 5 k at scale 2, iter 5 k at scale 3, iter 5 k at scale 4, iter 5Figure 5.3: Example intermediate results of our algorithm on a synthetic testimage.the latent image used in Eq. (5.8) to diverge. Its gradient w.r.t. the modelparameters Ωt = (gti, φti, ηt, τ t) reads∂`∂Ωt=∂zt∂Ωt∂kt∂zt∂`∂kt+∂kt∂Ωt∂`∂kt+∂zt∂Ωt∂`∂zt(5.10)Again, details for the computation of the derivatives w.r.t. to specific modelparameters are included in the supplemental material. We want to point outthat the kernel estimation error ||kt − k¯||22 is back-propagated to the modelparameters (gti, φti, ηt) in the SF for zt. Hence, the latent image zt is tailoredfor accurate kernel estimation and predicted such that the refinement in kt ineach iteration is optimal. This differs from related work in [90, 119].Multi-scale approachOur algorithm uses a multi-scale approach to prevent bad local optima. The kernelwidths that are used at different scales are 5, 9, 17, 25 pixels. At each scale s, theblurry image bs, the true latent image x¯s and k¯s are downsampled (and normalizedfor k¯s) from their original resolution. The scale index s is omitted for convenience.At the beginning of each scale s > 1, the estimated image x is initialized by bicubicupsampling its estimation at the previous scale, and the blur kernel k is initialized83blurry input x at scale 1, iter 5 x at scale 2, iter 5 x at scale 3, iter 5k at scale 1, iter 5 k at scale 2, iter 5 k at scale 3, iter 5Figure 5.4: Example intermediate results (cropped) of our algorithm on areal-world test image (shown in Fig.5.8 in Chapter 5).by nearest-neighbor upsampling, followed by re-normalization. At the coarsestscale s = 1, x is initialized as b and k is initialized as a delta peak. The couplingconstant α in kernel estimation loss is defined as α = r · η, where r is the ratiobetween pixel numbers in kernel kt and image zt at current scale, η is initializedwith 1 at the coarsest scale and at each subsequent scale it is multiplied by a factorof 0.25. Algorithm 7 summarizes our learning procedure for a single scale of ourCSF model. Fig. 5.3 and 5.4 shows intermediate results of our estimated imagex and kernel k at each scale. Note that our algorithm simultaneously estimatesthe latent image and blur kernel, and does not need extra non-blind deconvolutionsteps.Model complexityIn both the model Θt for xt and model Ωt for (zt, kt), we choose to use 24 filtersf ti of size 5× 5 for trade-off between result quality, model complexity and time ef-ficiency. As in [92], we initialize the filters with a DCT filter bank. Each shrinkagefunction ψti and φti are composed of 51 equidistant-positioned radial basis func-tions (RBFs) and are initialized as identity function. We further enforce centralsymmetry to the shrinkage functions, so that the number of trainable RBFs reduces84(a) Learned filters f ti and shrinkage functions ψti in Eq. 5.6.(b) Learned filters gti and shrinkage functions φti in Eq. 5.7.Figure 5.5: Learned filters and shrinkage functions (at 3rd scale, 1st itera-tion) for updating xt (Eq. 5.6) and zt, kt (Eq. 5.7), respectively. Otherparameters learned at this iteration: λt=0.5757, ηt=0.0218, τ t=0.0018.by half to 25. Fig. 5.5 visualizes some learned models.Training datasetsWe have found that that our method works well with a relatively small trainingdataset without over-fitting. We collected 20 motion blur kernels from [92], andrandomly rotated them to generate 60 different kernels. We collected 60 sharppatches of 250 × 250 pixels cropped from documents rendered around 175 PPI,and rotated each with a random angle between -4 and 4 degrees. We then generated60 blurry images by convolving each pair of sharp image and kernel, followed byadding white Gaussian noise and quantizing to 8 bits. Fig. 5.6 visualizes exampleblur kernels and images used at our training. We used the L-BFGS solver [91] inMatlab for training, which took about 12 hours on a desktop with an Intel XeonCPU.85Figure 5.6: Example kernels and images used at training of our algorithm.BlurryXuPanHradisOursFigure 5.7: Comparison on a real image taken from [48]. Row 1-5 from topto bottom show the blurry image, result of Xu [1], Pan [78], Hradisˇ [48]and our method. Two cropped regions are shown here, the full resolutionresults along with more examples can be found in the supplemental.5.3 ResultsIn this section we evaluate the proposed algorithm on both synthetic and real-worldimages. We compare with Pan [78] and Hradisˇ [48], the state-of-the-art meth-ods for text image blind deblurring, and the natural image deblurring softwareproduced by Xu [1], which are based on recently proposed state-of-the-art tech-niques [113, 114]. We used the code and binaries provided by the authors andtuned the parameters to generate the best possible results.86BlurryPanHradisOursFigure 5.8: Comparison on a real image taken from [48]. Row 1-4 from topto bottom show the blurry image, result of Pan [78], Hradisˇ [48] andour method. Two cropped regions are shown, the full resolution resultsalong with more results can be found in the supplemental.Real-world imagesIn Fig. 5.7 and 5.8 we show comparisons on real images. The result images ofXu [1] and Pan [78] contain obvious artifacts due to ineffective image priors thatlead to inaccurate kernel estimation. Hradisˇ [48] fails to recover many charactersand distorted the font type and illumination. Our method produces the best resultsin these cases, and our results are both visually pleasing and highly legible. Thefull resolution images and more results can be found on the author webpage.Quantitative comparisonsFor quantitative evaluation, we test all methods on a synthetic dataset and compareresults in terms of PSNR. We collect 8 sharp document images with 250×250 pix-els cropped from documents rendered at 150 PPI (similar PPI as used for trainingin [48]). Each image is blurred with 8 kernels at 25×25 collected from [63], fol-lowed by adding 1% Gaussian noise and 8-bit quantization. In Fig. 5.9, we showthe average PSNR values of all 8 test images synthesized with the same blur kernel.Our method outperforms other methods in all cases by 0.5-6.0 dB. Hradisˇ [48] has8705101520251 2 3 4 5 6 7 8PSNR (dB)Xu Pan Hradis Ours0%1%10%100%1 2 3 4 5 6 7 8OCR character error ratekernel kernelFigure 5.9: PSNR and OCR comparison on a synthetic test dataset with 8blur kernels.(a) Blurry (b) Pan [78] (c) Hradisˇ[48] (d) Ours (e) GTFigure 5.10: Comparison on synthetic images from the PSNR experiments inFig. 5.9. Note that the original results of [48] break the illumination ofthe images. We clamp the intensity of their results to match the groundtruth image before computing the PSNR values.close performance to ours on kernel #3, which is close to defocus blur. It also per-forms reasonably well on kernel #6 which features a simple motion path, but failson other more challenging kernels. Some results along with the estimated kernelsare shown in Fig. 5.10 for visual comparison.An interesting question one may ask is whether improved deblur can directlylead to better OCR accuracy. To answer this question we evaluate OCR accuracyusing the software ABBYY FineReader 12. We collected 8 sharp document images88Table 5.1: Run-time comparison (in seconds).Image size 2562 5122 10242Xu [1] (C++) 14.8 33.4 -Pan [78] (Matlab) 19.6 84.3 271.9Hradisˇ [48] (C++) 48.5 193.7 594.9Hradisˇ [48] (GPU) 0.3 1.0 3.1Ours (Matlab) 2.0 3.9 11.4Pre-computation (Matlab) 1.8 4.6 15.3from the OCR test dataset in [48]. Each document image contains a continuousparagraph. We synthesized 64 blurry images with the 8 kernels and 1% Gaussiannoise similarly as in the PSNR comparison. We run the OCR software and usedthe script provided by [48] to compute the average character error rate for all 8 testimages synthesized with the same kernel1. The results are shown in Fig. 5.9. Theyare consistent with the PSNR results also in Fig. 5.9. Hradisˇ [48] performs wellon kernel #3 and #6 but fails on other challenging kernels, while our method isconsistently better than others. All the test images and results for PSNR and OCRcomparisons are included in the supplemental material.Run-time comparisonTable 5.1 provides a comparison on computational efficiency, using images blurredby a 17×17 kernel at three different resolutions. The experiments were done onan Intel i7 CPU with 16GB RAM and a GeForce GTX TITAN GPU. Assumingthe image sensor resolution is a known priori2, we pre-compute the FFTs of thetrained filters fi and gi for maximal efficiency. We report the timing of our Matlabimplementation on CPU. A GPU implementation should significantly reduce thetime as our method only requires FFT, 2D convolution and 1D look-up-table (LUT)operations, which is our future work.1We used the script “eval.py” downloaded from the author webpage [48] to compute the error rate.2This is a common assumption especially for batch processing of document images.89(a) Blurry (b) Hradisˇ[48] (c) Ours (d) GTFigure 5.11: Comparison on non-English text and severely rotated images.Note that such non-English text and large rotation were not includedin our training dataset.RobustnessIn Fig. 5.11, we show results on non-English text and severely rotated image. Al-though both Hradisˇ [48] and our method are only trained on English text data, ourmethod can be applied to non-English text as well. This is a great benefit of ourmethod as we do not need to train on every different language, or increase themodel complexity to handle them as [48] would need to do. Our method is alsorobust against a significant change of page orientation, which cannot be handledwell by [48].In Fig. 5.12, we show the results of our method when the noise level and PPI ofthe test data differs from the training data. Fig. 5.12(a) shows that the performanceof our method is fairly steady when the noise level in the test images is not toomuch higher than that of the training data, meaning that the models trained atsparse noise levels are sufficient for practical use. Fig. 5.12(b) shows that ourmethod works well in a fairly broad range of image PPIs given the training data arearound 175 PPI.In Fig. 5.13, we show a comparison on a real image with large-font text. Fol-908101214161820221% 2% 3% 4% 5%PSNR (dB)noise level of test imagestraining noise 1%training noise 3%training noise 5%(a)81012141618202224125 150 175 200 225 250PSNR (dB)PPI of test images(b)Figure 5.12: Robustness test on noise level and image PPI.(a) Blurry (b) Xu [1] (c) Pan [78] (d) Hradisˇ[48] (e) OursFigure 5.13: Comparison on a real image with large-font text. The referenceresults are from [48]. Following [48], the input of (d) Hradisˇ’ and (e)our method was downsampled by factor of 3.lowing [48], the input of Hradisˇ’ and our method was downsampled by factor of 3in order to apply the trained models without re-training. Although such downsam-pling breaks the image formation model in Eq. 5.1, our method can still generatereasonable result.Non-uniform blurOur method can be easily extended to handle non-uniform blur by dividing theimage into overlapped tiles, deblurring each tile with our proposed algorithm, andthen realigning the resulting tiles to generate the final estimated image. An exampleis shown in Fig. 5.14.Documents with color figuresOur algorithm can be easily extended to handle the documents with color figures.We first run the blind deblurring algorithm on text regions to recover the latent text91(a) Blurry(b) Hradisˇ [48](c) Ours(d) Our estimated kernel(e) Ground truth kernelFigure 5.14: Results on spatially-varying blur kernels. The blurry input issynthesized with the EFF model [47] to approximate practical pixel-wise variant blur.92(a) Blurry input (b) Our deblurred imageFigure 5.15: Result on example document containing color figures. Theright-bottom corner in (b) shows the blur kernel estimated from thetext regions. In this example we simply use [58] for the non-blind de-blurring step, although [92] can be used instead for improved results.images and blur kernels. Then we take the kernels estimated from the text regionsaround the color figures (optionally interpolate the kernels using the Efficient FilterFlow (EFF) method [47] for spatially-varying blur), and deblur the color figureregions as non-blind deconvolution. Finally, we align the text and figure regions togenerate the final image. An example is shown in in Fig. 5.15.It is a benefit that our algorithm jointly estimates the latent image and theblur kernel, and the latter can be further used for deblurring non-text regions non-blindly. Hradisˇ et al [48] does not recover the blur kernel thus cannot handle thefigure regions in the document.5.4 Conclusion and future workIn this chapter we present a new algorithm for fast and high-quality blind decon-volution of document photographs. Our key idea is to to use high-order filters fordocument image regularization, and propose to learn such filters and influencesfrom training data using multi-scale, interleaved cascade of shrinkage field mod-93els. Extensive experiments demonstrate that our approach not only produces higherquality results than the state-of-the-art methods, but is also computational efficient,and robust against noise level, language and page orientation changes that are notincluded in the training data.Our method also has some limitations. It cannot fully recover the details ofan image if it is degraded by large out-of-focus blur. In such case, Hradisˇ [48]may outperform our method given its excellent synthesis ability. As future work itwould be interesting to combine both approaches. Although we only show learningour model on document photographs, we believe such a framework can also beapplied to other domain-specific images, which we plan to explore in the future.94Chapter 6Learning Proximal Operators forImage Restoration6.1 IntroductionRecently several discriminative learning approaches [18, 92] have been proposedfor effective image restoration, achieving convincing trade-off between image qual-ity and computational efficiency. However, these methods require separate trainingfor each restoration task and problem condition. This makes it time-consuming anddifficult to encompass all tasks and conditions during training. In this chapter, wecombine discriminative learning technique with formal optimization methods tolearn generic priors that truly share across problem domains. Our models require asingle-pass training and allow reuse across various problems and conditions whileachieving comparable efficiency as previous discriminative approaches.In particular, we make the following contributions:• We propose proximal fields as a convolutional model for image priors thatare computationally cheap to train and are shared across different imagerestoration tasks and problem conditions.• Proximal fields are formulated as proximal operators, allowing their use inadvanced proximal optimization algorithms.95• We show that our approach is general by demonstrating proximal fields fordiverse low-level problems, such as denoising, deconvolution and inpainting,for varying noise settings.• We show that our method can naturally be combined with existing likelihoodand priors after being trained, to cover unseen tasks and to further improvereconstruction quality.6.2 MethodThe seminal work of FoE [87] generalizes the form of filter response based regu-larizers in the objective function given in Eq. 6.1. Vector b and x represents theobserved and latent (desired) image respectively, matrix A is the sensing opera-tor, matrix Fi represents 2D convolution with filter fi, and function φi representsthe penalty on corresponding filter response Fix. The positive scalar λ controlsrelative weight between the data fidelity (likelihood) and the regularization term.λ2||b−Ax||22 +N∑i=1φi(Fix) (6.1)The well-known anisotropic total-variation regularizer can be viewed as a specialcase of the FoE model where fi is the derivative operator∇ and φi the `1 norm.It is difficult to directly minimize Eq. 6.1 when the penalty function φi is non-linear and/or non-smooth (e.g., `p norm, 0 < p ≤ 1). Proximal algorithms [10,15, 30] instead, relax Eq. 6.1 and split the original problem into several easiersubproblems that are solved alternately until convergence.In this paper we employ the HQS algorithm [30] to relax Eq. 6.1, as it typicallyrequires much fewer iterations to converge compared with other proximal meth-ods such as ADMM [10] and PD [15]. The relaxed objective function is given inEq. 6.2:λ2||b−Ax||22 +ρ2||z− x||22 +N∑i=1φi(Fiz), (6.2)where a slack variable z is introduced to approximate x, and ρ is a positive scalar.96While most related approaches [58, 92] relax Eq. 6.1 by splitting on Fix ratherthan x, it would limit the model flexibility in our method. This will be explainedmore clearly in the next sections.With the HQS algorithm, Eq. 6.2 is iteratively minimized by solving for theslack variable z and the latent image x alternately as in Eq. 6.3 and 6.4 (t =1, 2, ..., T ).zt = argminz(ρt2||z− xt−1||22 +N∑i=1φi(Fiz)), (6.3)xt = argminx(λ||b−Ax||22 + ρt||zt − x||22), (6.4)where ρt increases as the iteration continues. The latter forces z to become anincreasingly good approximation of x, thus making Eq. 6.2 an increasingly goodproxy for Eq. Proximal fieldsThe zt-update step in Eq. 6.3 can be viewed as a proximal operation:zt := proxΘ(xt−1, ρt), (6.5)where proxΘ is called proximal operator with model parameters Θ, which in-cludes a number of filters fi and corresponding penalty functions φi. To distinguishit from traditional proximal operators which typically contain a single filter, we callour generalized proximal operators proxΘ as proximal fields.Inspired by the state-of-the-art discriminative methods [18, 92], we propose tolearn the proximal fields model proxΘ and the weight scalar λ from training data.With the above HQS relaxation, the image prior and data-fidelity term in the origi-nal objective (Eq. 6.1) are contained in two separate subproblems (Eq. 6.3 and 6.4).This makes it possible to train an ensemble of diverse tasks (e.g., denoising, deblur-ring, or with different noise levels) each of which has its own data fidelity term andweight λ, while learning a single prior model proxΘ that is shared across ensem-ble tasks. This is in contrast to state-of-the-art discriminative methods such asCSF [92] and TRD [18] which train separate priors for each task.97The proximal operator proxΘ(xt−1, ρt) can be interpreted as processing xt−1corrupted by zero-mean Gaussian noise. With this interpretation, we propose todefine proxΘ(xt−1) as a multi-stage non-linear diffusion process modified fromthe TRD [18] model, as given in Eq. 6.6.ztk = ztk−1 −N∑i=1FkiTψki (Fki ztk−1),s.t. zt0 = xt−1, k = 1, 2, ...,K.(6.6)where k is the stage index, filters Fki , function ψki are trainable model parametersat each stage, and zt0 is the initial value of ztk. Note that, different from TRD, ourmodel does not contain the reaction term which would be −ρtαk(ztk−1 − xt−1)with step size αk. The main reasons for this modification are:• The data constraint is contained in the xt update in Eq. 6.4;• More importantly, our model gets rid of the weight ρt which changes ateach HQS iteration. Therefore, our proximal operator proxΘ(xt−1, ρt) issimplified to be:zt := proxΘ(xt−1) (6.7)Note that our proximal fields model Θ = {Fki , ψki |k ∈ {1, . . . ,K}} is re-usedat all HQS iteration t, making it different than previous discriminative methods(CSF, TRD). The parameters to learn in our method Ω includes λ’s for each prob-lem class p (restoration task and problem condition), and the proximal fields modelΘ shared across different classes, i.e., Ω = {λp,Θ}. Even though the scalar pa-rameters λp are trained, our method allows users to adjust them at test time for bestperformance and non-trained problem classes. This contrasts to previous discrim-inative approaches whose model parameters are fixed at test time. The subscript pindicating the problem class in λp is omitted below for convenience. The values ofρt are pre-selected: ρ1 = 1 and ρt = 2ρt−1 for t > 1.Note that a multi-stage model as in Eq. 6.6 is not possible if we split on Fixinstead of x in Eq. 6.1 and 6.2. For clarity, an overview of the proposed algorithm98Algorithm 8 Proposed algorithmInput: degraded image b, weight λ (optional)Output: recovered image x1: x0 = b, ρ1 = 1 (initialization)2: for t = 1 to T do3: (Update zt by Eq. 6.6 below)4: zt0 = xt−15: for k = 1 to K do6: ztk = ztk−1 −∑Ni=1 FkiTψki (Fki ztk−1)7: end for8: zt = ztK9: (Update xt by Eq. 6.4 below)10: xt = argminx λ||b−Ax||22 + ρt||zt − x||2211: ρt+1 = 2ρt12: end foris given in Algorithm LearningWe consider denoising and deconvolution tasks at training, where the sensing op-erator A is an identity matrix, or a block circulant matrix with circulant blocks thatrepresents 2D convolution with blur kernels respectively. In denoising tasks, xtupdate in Eq. 6.4 has closed-form solution:xt =λb + ρtztλ+ ρt(6.8)In deconvolution tasks, xt update in Eq. 6.4 has closed-form solution in Fourierdomain:xt = F−1(F(λATb + ρtzt)F(λATA + ρt)), (6.9)where F and F−1 represent Fourier and inverse Fourier transform respectively.Note that, different than CSF [92], our method does not require FFT computationfor denoising tasks.We use L-BFGS solver [91] with analytic gradients for training. The training99(a) Filters f1i at stage 1.(b) Filters f2i at stage 2.(c) Filters f3i at stage 3.Figure 6.1: Trained filters at each stage (k in Eq. 6.6) of the proximal operatorproxΘ in our model (3 stages each with 24 5×5 filters).loss function ` is defined as the average PSNR of reconstructed images. The gra-dient of ` w.r.t. the model parameters Ω = {λp,Θ} is computed by accumulatinggradients at all HQS iterations, as shown in Eq. 6.10.∂`∂Ω=T∑t=1(∂xt∂λ∂`∂xt+∂zt∂Θ(∂`∂zt+∂xt∂zt∂`∂xt)), (6.10)where the back-propagation inside our proximal operator is computed as:∂zt∂Θ=K∑k=1∂ztk∂Θ∂zt∂ztk(6.11)The 1D functions ψki in Eq. 6.6 are parameterized as a linear combination ofequidistant-positioned Gaussian kernels whose weights are trainable. More detailsof the derivation of analytic gradients can be found in Appendix A.3.100Progressive trainingA progressive scheme is proposed to make the training more effectively. First, weset the number of HQS iterations to be 1, and train λ′s and the model Θ of eachstage in proxΘ in a greedy fashion. Then, we gradually increase the number ofHQS iterations from 1 to T where at each step the model Ω = {λ,Θ} is refinedfrom the result by previous step. The L-BFGS iterations are set to be 200 for thegreedy training steps, and 100 for the refining steps. Fig. 6.1 shows examples ofour learned filters in proxΘ.6.3 ResultsDenoisingWe compare our method with state-of-the-art image denoising techniques, includ-ing KSVD [26], FoE [87], BM3D [22], LSSC [71], WNNM [37], EPLL [118],opt-MRF [17], ARF [8], CSF [92] and TRD [18]. Our method is denoted in shortas “ProxF”. The subscript in CSF5 and TRD5 indicates the number of cascadedstages (each stage has different model parameters). The subscript and superscriptin ProxF53 indicate the number of diffusion stages (K = 3 in Algorithm 8) in ourproximal operator proxΘ, and the number of HQS iterations (T = 5 in Algo-rithm 8), respectively. Note that the complexity (size) of our model is linear to K,but independent of T . CSF, TRD and ProxF use 24 filters of size 5×5 pixels at allstages in this section.The compared discriminative methods, CSF5 and TRD5 both are trained atsingle noise level σ = 15 that is the same as the test images. In contrast, our modelis trained on 400 images (100×100 pixels) cropped from [87] with random anddiscrete noise levels (standard deviation σ) varying between 5 and 25. The imageswith the same noise level share the same data fidelity weight λ at training.All the methods are evaluated on the 68 test images with noise level σ = 15from [87] and the averaged PSNR values are reported in Table 6.1. Our results arecomparable to generic methods such as KSVD, FoE and BM3D , and very closeto discriminative methods such as CSF5, while at the same time being much moreefficient which is demonstrated later. Besides, we simply use λ learned for images101with noise σ = 15 at training to generate all the test results, although adjustingits value at test time can expectedly improve our results. Note that the compareddiscriminative methods (CSF, TRD) do not allow for such parameter tuning.Table 6.1: Average PSNR (dB) on 68 images from [87] for image denoising.KSVD FoE BM3D LSSC WNNM EPLL30.87 30.99 31.08 31.27 31.37 31.19opt-MRF ARF CSF5 TRD5 ProxF33 ProxF5331.18 30.70 31.14 31.30 30.91 31.001520253035405 15 25PSNR (dB)noise level at testinput TRD15 TRD25 ProxFFigure 6.2: Analysis of model generality on image denoising. In this plot,“TRD15” denotes the TRD model trained at noise σ = 15, and“TRD25” at noise σ = 25. “ProxF” denotes our model trained withmixed noise levels in a single pass.To verify the generality of our method on varying noise levels, we test ourmodel ProxF33 (trained with varying noise levels in a single pass) and two TRDmodels (trained at specific noise levels 15 and 25) on 3 sets of 68 images withnoise σ = 5, 15, 25 respectively. The average PSNR values are reported in Fig. 6.2and example images are shown in Fig.6.3-6.8. Despite performing slightly belowthe TRD models trained for the exact noise level used at test time, our methodis more generic and works robustly for various noise levels. Note that our modelcontains 40% fewer trainable parameters than the compared TRD models. Besides,102(a) Input w/ noise σ 5 (34.15dB) (b) TRD15 (32.57dB)(c) TRD25 (29.33dB) (d) ProxF (37.14dB)Figure 6.3: Results with input noise level σ=5.(a) Input w/ noise σ 15 (24.61dB) (b) TRD15 (31.09dB)(c) TRD25 (29.31dB) (d) ProxF (31.10dB)Figure 6.4: Results with input noise level σ=15.103(a) Input w/ noise σ 25 (20.17dB) (b) TRD15 (23.74dB)(c) TRD25 (28.44dB) (d) ProxF (28.45dB)Figure 6.5: Results with input noise level σ=25.our method uses the fidelity weight λ that is learned for each noise level at training,although adjusting its value at test time can expectedly improve our results. Thelearned value λ = 20.706 for σ = 5, λ = 2.475 for σ = 15, and λ = 0.033for σ = 25. In sharp contrast to discriminative methods, which are inherentlyspecialized for a given problem setting, i.e., noise level, the proposed approachtransfers across different problem settings. We demonstrate this generality belowfor a variety of different image reconstruction tasks.Run-time comparisonIn Table 6.2 we compare the run-time of our method and state-of-the-art meth-ods. The experiments were performed on a laptop computer with Intel i7-4720HQCPU and 16GB RAM. WNNM and EPLL ran out-of-memory for images over4 megapixels in our experiments. CSF5, TRD5 and our method ProxF33 all use“parfor” setting in Matlab. ProxF33 is significantly faster than all compared genericmethods (WNNM, EPLL, BM3D) and even the discriminative method CSF5. Run-time of ProxF33 is about 1.5 times that of TRD5, which is expected as they use 9104(a) Input w/ noise σ 5 (34.15dB) (b) TRD15 (33.29dB)(c) TRD25 (29.95dB) (d) ProxF (38.02dB)Figure 6.6: Results with input noise level σ=5.105(a) Input w/ noise σ 15 (24.61dB) (b) TRD15 (32.31dB)(c) TRD25 (30.11dB) (d) ProxF (32.03dB)Figure 6.7: Results with input noise level σ=15.106(a) Input w/ noise σ 25 (20.17dB) (b) TRD15 (23.94dB)(c) TRD25 (29.72dB) (d) ProxF (29.40dB)Figure 6.8: Results with input noise level σ=25.107Table 6.2: Run-time (seconds) comparison for image denoising on differentsize images.Image size 2562 5122 10242 20482 40962WNNM 157.73 657.75 2759.79 - -EPLL 29.21 111.52 463.71 - -BM3D 0.78 3.45 15.24 62.81 275.39CSF5 1.23 2.22 7.35 27.08 93.66TRD5 0.39 0.71 2.01 7.57 29.09ProxF33 0.60 1.19 3.45 12.97 56.19ProxF33 (Halide) 0.11 0.26 1.60 5.61 20.85versus 5 diffusion steps in total. In addition, we implement our method in Halidelanguage [82], which has become popular recently for high-performance imageprocessing applications, and report the run-time on the same CPU as mentionedabove.DeconvolutionOur general model supports image deconvolution tasks in the HQS framework. Inthis experiment, we train a model with an ensemble of denoising and deconvolutiontasks on 400 images (100×100 pixels) cropped from [87], in which 250 images aregenerated for denoising tasks with random noise levels σ varying between 5 and25, and the other 150 images are generated by blurring the images with random25×25 kernels (PSFs) and then adding Gaussian noise with σ between 1 and 5.All input images are quantized to 8 bits.We compare our method with state-of-the-art non-blind deconvolution methodsincluding Levin et al [68], Schmidt et al [93] and CSF [92] on the benchmarkdataset of [63]. Note that TRD [18] does not support non-blind deconvolution.We test the methods on 32 images from [64] and report the average PSNR valuesin Table 6.3. The results of compared methods are quoted from [92]. We run agrid search on the adjustable fidelity weight λ (the same value for all the 32 testimages) and report the best result in Table 6.3. In Fig. 6.9, we show our results withdifferent λ. Within a fairly wide range of λ, our method outperforms all previousmethods. In Fig.6.10, we show example images of our results and more images108Table 6.3: Average PSNR (dB) on 32 images from [64] for non-blind decon-volution.Input Levin [68] Schmidt [93] CSFpw3 ProxF3322.86 32.73 33.97 33.48 34.3434. 125 150 175 200 225 250 275 300PSNR (dB)fidelity weight λour result with different fidelity weight λFigure 6.9: Our results with different fidelity weight λ for the non-blind de-convolution experiment reported in Table 6.3.can be found in the supplementary material of [109].In addition, in Fig.6.11 -6.12, we show a comparison for non-blind decon-volution on the dataset from Schuler et al [94] that features Gaussian blur withstandard deviation 1.6 and noise level σ = 2. We compare our method ProxF33with EPLL [118], Krishnan et al [58], Levin et al [67], DEB-BM3D [24], IDD-BM3D [24] and MLP [94]. Note that the MLP method is tailored for the task ofnon-blind deconvolution only and is trained with exactly the same (single) blurkernel and noise level as those at test time [94]. In contrast, our method ProxF33 istrained with an ensemble of denoising and deconvolution tasks including variousblur kernels and noise levels. In particular, our training data does not comprise anyGaussian blur kernels. Nonetheless, our method is close to MLP in terms of PSNRwhile at the same time outperforming all other competitors. Please zoom in thefigures for better view.109(a) Ground truth (b) Input (23.69dB) (c) ProxF33 (33.81dB)(d) Ground truth (e) Input (22.61dB) (f) ProxF33 (33.24dB)(g) Ground truth (h) Input (23.74dB) (i) ProxF33 (33.86dB)(j) Ground truth (k) Input (24.58dB) (l) ProxF33 (34.74dB)Figure 6.10: Results on images with blur kernel #1 in the dataset.110(a) Ground truth (b) Input (32.98dB) (c) EPLL (35.25dB)(d) Krishnan (35.12dB) (e) Levin (35.00dB) (f) DEB-BM3D (35.01dB)(g) IDD-BM3D (35.26dB) (h) MLP (35.36dB) (i) ProxF33 (35.39dB)Figure 6.11: Results on non-blind deconvolution with Gaussian blur.111(a) Ground truth (b) Input(23.26dB) (c) EPLL (25.48dB)(d) Krishnan (25.56dB) (e) Levin (25.53dB) (f) DEB-BM3D (25.58dB)(g) IDD-BM3D (25.73dB) (h) MLP (25.93dB) (i) ProxF33 (25.76dB)Figure 6.12: Results on non-blind deconvolution with Gaussian blur.112Collaborative with existing priorsAs shown above, even though the fidelity weight λ is trainable, our method allowsusers to adjust its value for better performance at test time. Moreover, this propertymakes it possible to combine our model (after being trained) with existing state-of-the-art priors at test time, in which case λ typically needs to be adjusted. Again,this is not possible with previous discriminative methods (CSF, TRD).In Fig. 6.13 we show an example to incorporate a non-local patch similar-ity prior (BM3D [21]) with our method to further improve the denoising quality.BM3D performs well in removing noise especially in smooth regions but usuallyover-smoothes edges and textures. Our original model (ProxF53) well preservessharp edges however sometimes introduces artifacts in smooth regions when theinput noise level is high. By combining those two methods, which is easy with ourHQS framework, the result is improved both visually and quantitatively.We give the derivation of the proposed hybrid method below. Let S(x) repre-sents the non-local patch similarity prior. The objective function is:λ2||b−Ax||22 +N∑i=1φi(Fix) + τS(x) (6.12)Applying the HQS technique described in Section 6.2, we relax the objective to be:λ2||b−Ax||22 +ρ2||z− x||22 +N∑i=1φi(Fiz)+ρs2||v − x||22 + τS(v)(6.13)Then we minimize Eq. 6.13 by alternately solving the following 3 subproblems:zt = proxΘ(xt−1)vt = argminvρts2||v − xt−1||22 + τS(v) ≈ BM3D (xt−1,τρts)xt = argminxλ||b−Ax||22 + ρt||zt − x||22 + ρts||vt − x||22,(6.14)where proxΘ is from our previous training, and the vt subproblem is approxi-113(a) Input (20.17dB) (b) BM3D (29.62dB)(c) ProxF53 (29.48dB) (d) ProxF53 + BM3D (29.74dB)Figure 6.13: Experiment on incorporating non-local patch similarity prior(BM3D) with our model after being trained. The input noise levelσ = 25. Please zoom in for better view.114(a) Ground truth (b) Input (20.18dB)(c) TRD5 (28.06dB) (d) ProxF53 (27.80dB)(e) TV + cross (26.89dB) (f) ProxF53 + cross (28.69dB)Figure 6.14: Experiment on incorporating a color prior [45] with our modelafter being trained. The input noise level σ = 25. (e,f) show theresults by combining TV denoising with a cross-channel prior, andour method with cross-channel prior, respectively. Please zoom in forbetter view.115mated by running BM3D software on xt−1 with noise parameter τρts following [45].Similarly, our method can incorporate color image priors (e.g., cross-channeledge-concurrence prior [45]) to improve test results on color images, despite ourmodel being trained on gray-scale images. An example is shown in Fig. 6.14. Thehybrid method shares the advantages of our original model that effectively pre-serves edges and textures and the cross-channel prior that reduces color artifacts.Transferability to unseen tasksOur method allows for new data-fidelity terms that are not contained in training,with no need for re-training. We demonstrate this flexibility with an experiment onthe joint denoising and inpainting task shown in Fig. 6.15. In this experiment, 60%pixels of the input image are missing, and the measured 40% pixels are corruptedwith Gaussian noise with σ = 15. Let vector a be the binary mask for measuredpixels. The sensing matrix A in Eq. 6.1, assumed to be known, is a binary diagonalmatrix (hence A = AT = ATA) with diagonal elements a. To reuse our modeltrained on denoising/deconvolution tasks, we only need to specify A and λ. Thesubproblems of our HQS framework are given in Eq. 6.15.zt = proxΘ(xt−1),xt = (λATb + ρtzt)./(λa + ρt),(6.15)where ./ indicates element-wise division.116(a) Input (b) Delaunay interp.(23.19dB)(c) ProxF53 (25.10dB) (d) Ground truthFigure 6.15: Experiment on joint denoising and inpainting task. The inputimage (a) misses 60% pixels, and is corrupted with noise σ = 15. Ourmethod takes the result of Delaunary interpolation (b) as the initialestimation x0. Please zoom in for better view.117(a) Ground truth(b) Noisy input (20.18dB) (c) Iter 1 (22.85dB)(d) Iter 2 (25.93dB) (e) Iter 3 (28.14dB)Figure 6.16: Results at each HQS iteration of our method on image denoisingwith noise level σ = 25. Inside brackets show the PSNR values. Pleasezoom in for better view.118(a) Ground truth(b) Blurry input (23.37dB) (c) Iter 1 (27.32dB)(d) Iter 2 (28.48dB) (e) Iter 3 (29.36dB)Figure 6.17: Results at each HQS iteration of our method on non-blind de-convolution with a 25×25 PSF and noise level σ = 3. Inside bracketsshow the PSNR values. Please zoom in for better view.119Table 6.4: Test with different HQS iterations (T ) and model stages (K) forimage denoising. Average PSNR (dB) results on 68 images from [64]with noise σ = 15 and 25 are reported (before and after “/” in each cellrespectively).# HQS iterations1 3 5#stages1 29.80 / 26.81 30.89 / 28.12 30.96 / 28.283 30.54 / 27.82 30.91 / 28.19 31.00 / 28.425 30.54 / 27.83 30.92 / 28.18 -Analysis of convergence and model complexityTo better understand the convergence of our method, in Fig. 6.16 and 6.17 weshow the results of each HQS iteration of our method on denoising and non-blinddeconvolution.To understand the effect of model complexity and the number of HQS iter-ation on results, in Table 6.4 we report test results of our method using modelstrained with different HQS iterations (T in Algorithm 8), and with different stagesin proxΘ (K in Algorithm 8). The results are generated using the λ values learnedat training without post-tuning.Connections to plug-and-play priorsWe noticed that recent plug-and-play methods [72, 85, 103] make a similar ob-servation that the zt-update step in 6.3 can be interpreted as a Gaussian denoisingprocessing on xt−1. However, while they replace the zt-update step with existinggeneric Gaussian denoisers, our method trains such step as part of the whole prox-imal optimization iterations by discriminative learning for good trade-off betweenhigh-quality results and time-efficiency.6.4 Conclusion and future workIn this paper, we proposed the trainable proximal fields model, a generalizationof traditional proximal operators. By combining advanced proximal optimizationalgorithms and discriminative learning techniques, a single training pass leads to120a transferable model useful for a variety of image restoration tasks and problemconditions. Furthermore, our method is flexible and can be combined with existingpriors and likelihood terms after being trained, allowing to improve image qualityon a task at hand. In spite of this generality, our method achieves comparablerun-time efficiency as previous discriminative approaches, making it suitable forhigh-resolution image restoration and mobile vision applications.We believe that in future work, our framework incorporating advanced opti-mization with discriminative learning techniques can be extended to deep learn-ing, for training more compact and shareable models, and to solve high-level vi-sion problems. Another plan is to train our models for ensemble tasks with largerdatasets and use advanced learning optimization techniques, which can potentiallyfurther improve results.121Chapter 7ConclusionsAs a classic topic that has been studied for decades, image restoration is still a veryactive research area. While significant progress has been made, developing moretime-, memory- and power-efficient methods are still highly desirable especiallyas the rise of mobile imaging systems such as smartphone cameras, ToF cameras,autonomous vehicles, etc. Recent advances in machine learning and data-driventechniques have been inspiring new perspectives and strategies for image restora-tion problems. This thesis presents our researches on these topics and addressesseveral image restoration problems for applications in computational imaging in-cluding ToF imaging and digital photography.While continuous-wave ToF cameras have shown great promise at low-costdepth imaging, they suffer from limited depth of field and low spatial resolution. InChapter 3, we develop a computational method to remove the lens blur and increaseimage resolution of off-the-shelf ToF cameras. In contrast to previous work, ourmethod solves the depth and amplitude image directly from the raw sensor dataand supports more advanced ToF cameras that use multiple frequencies, phasesand exposures.Taking personal photographs have become increasingly common due to thepopularity of mobile cameras. However, photos taken by hand-held cameras arelikely to suffer from blur caused by camera shake during exposure. Therefore,removing such blur and recovering sharp images as a post-process is highly de-sirable. In Chapter 4 we develop a blind image deblurring method that is purely122based on simple stochastic random-walk sampling for optimization, which allowsus to easily test new priors for both image and blur kernel. We demonstrate thatthis simple framework in combination with different priors produces comparableresults to the much more complex state-of-the-art deblurring algorithms.Blur causes even more serious issues for text document photographs, as slightblur can prevent existing OCR techniques from extracting correct text from the im-ages. In Chapter 5 we address the blind deblurring problem specifically for com-mon text document photographs. Observing that the latter are mostly composed ofhigh-order structures, our method proposes to capture such domain property by aseries of iteration- and scale-wise high-order filters as well as customized responsefunctions. These filters and functions are learned from data by discriminative learn-ing approach and form an end-to-end network that can efficiently estimate both blurkernel and sharp/legible text images.Discriminative learning approaches have been recently proposed for effectiveimage restoration, achieving convincing trade-off between image quality and com-putational efficiency. However, these methods require separate training for eachrestoration task and problem condition, making it time-consuming and difficult toencompass all tasks and conditions during training. In Chapter 6 we combine thediscriminative learning idea and formal numerical optimization method, to learnimage priors that require a single-pass training and share across various tasks andconditions, while keeping the efficiency as previous discriminative methods. More-over, after being trained, our model can be easily combined with other likelihoodor priors to address unseen restoration tasks or further improve the image quality.Throughout this thesis work, developing both effective image priors and effi-cient optimization solvers have been the central tasks for a given image restorationproblem. In Chapter 3 we investigate the classic prior (TGV) and the advanced nu-merical optimization methods (ADMM and Levenberg-Marquardt algorithm). InChapter 4 we develop a stochastic-sampling based optimization method for solv-ing general image priors. In Chapter 5, we apply discriminative learning approachto learn the priors and the iterative optimization process from data. The learnedmodels are highly optimized and customized for the trained task. In Chapter 6 wecombine the numerical optimization method (HQS) with the discriminative learn-ing approach to learn prior models that can be shared across a variety of restora-123tion tasks while keeping the efficiency as previous discriminative learning methods.While this is a first attempt, I believe that the emerging between formal numericaloptimization techniques and the data-driven machine learning techniques will stim-ulate more new ideas and successes in the field of image restoration, and I wouldlike to keep investigating on it as future work.124Bibliography[1] Robust deblurring software. www.cse.cuhk.edu.hk/∼leojia/deblurring.htm.→ pages xii, 59, 68, 69, 70, 71, 75, 86, 87, 89, 91[2] B. C. A. Buades and J. M. Morel. A review of image denoising algorithms,with a new one. Multiscale Modeling and Simulation, 4(2):490–530, 2005.→ pages 1, 9[3] S. Agarwal, K. Mierle, and Others. Ceres solver. http://ceres-solver.org.→ pages 38[4] F. Anscombe. The transformation of poisson, binomial andnegative-binomial data. Biometrika, 35(3/4):246–254, 1948. → pages 65[5] S. Anwar, C. Phuoc Huynh, and F. Porikli. Class-specific image deblurring.In ICCV 2015. → pages 14[6] H. Avron, A. Sharf, C. Greif, and D. Cohen-Or. 1-sparse reconstruction ofsharp point set surfaces. ACM Transactions on Graphics, 29(5):135, 2010.→ pages 25[7] J. Balzer and S. Soatto. Second-order shape optimization for geometricinverse problems in vision. arXiv preprint arXiv:1311.2626, 2013. →pages 25[8] A. Barbu. Training an active random field for real-time image denoising.IEEE Transactions on Image Processing, 18(11):2451–2462, 2009. →pages 101[9] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributedoptimization and statistical learning via the alternating direction method ofmultipliers. Foundations and Trends in Machine Learning, 3(1):1–122,2011. → pages 21, 27, 28125[10] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributedoptimization and statistical learning via the alternating direction method ofmultipliers. Foundations and Trends in Machine Learning, 3(1):1–122,2011. → pages 6, 96[11] K. Bredies, K. Kunisch, and T. Pock. Total generalized variation. SIAMJournal on Imaging Sciences, 3(3):492–526, 2010. → pages 25[12] H. Bristow, A. Eriksson, and S. Lucey. Fast convolutional sparse coding. InCVPR 2013. → pages 10[13] H. C. Burger, C. J. Schuler, and S. Harmeling. Image denoising: Can plainneural networks compete with BM3D? In CVPR 2012. → pages 2, 12[14] G. Casella and C. P. Robert. Monte carlo statistical methods, 1999. →pages 46[15] A. Chambolle and T. Pock. A first-order primal-dual algorithm for convexproblems with applications to imaging. Journal of Mathematical Imagingand Vision, 40(1):120–145, 2011. → pages 2, 96[16] X. Chen, X. He, J. Yang, and Q. Wu. An effective document imagedeblurring algorithm. In CVPR 2011, . → pages 5, 15[17] Y. Chen, T. Pock, R. Ranftl, and H. Bischof. Revisiting loss-specifictraining of filter-based mrfs for image restoration. In German Conferenceon Pattern Recognition 2013, . → pages 101[18] Y. Chen, W. Yu, and T. Pock. On learning optimized reaction diffusionprocesses for effective image restoration. In CVPR 2015, . → pages 2, 5,12, 95, 97, 98, 101, 108[19] H. Cho, J. Wang, and S. Lee. Text image deblurring using text-specificproperties. In ECCV 2012. → pages 5, 15[20] S. Cho and S. Lee. Fast motion deblurring. In ACM Transactions onGraphics (Proc. SIGGRAPH Asia), pages 145:1–145:8, 2009. → pages xii,14, 54, 59, 60, 61, 68, 70, 71, 75, 79[21] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Image restoration bysparse 3d transform-domain collaborative filtering. In Electronic Imaging2008. → pages 13, 113126[22] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Image denoising bysparse 3-D transform-domain collaborative filtering. IEEE Transactions onImage Processing, 16(8):2080–2095, 2007. → pages 1, 9, 101[23] A. Danielyan, V. Katkovnik, and K. Egiazarian. Image deblurring byaugmented lagrangian with bm3d frame prior. In Proc. WITMSE, pages16–18, 2010. → pages 13[24] A. Danielyan, V. Katkovnik, and K. Egiazarian. Bm3d frames andvariational image deblurring. IEEE Transactions on Image Processing, 21(4):1715–1728, 2012. → pages 13, 109[25] W. Dong, L. Zhang, G. Shi, and X. Li. Nonlocally centralized sparserepresentation for image restoration. IEEE Transactions on ImageProcessing, 22(4):1620–1630, 2013. → pages 1, 9, 10[26] M. Elad and M. Aharon. Image denoising via sparse and redundantrepresentations over learned dictionaries. IEEE Transactions on ImageProcessing, 15(12):3736–3745, 2006. → pages 2, 10, 101[27] R. Fergus, B. Singh, A. Hertzmann, S. T. Roweis, and W. T. Freeman.Removing camera shake from a single photograph. In ACM Transactionson Graphics (Proc. SIGGRAPH), pages 787–794, 2006. → pages 12, 13,44, 49, 50, 59, 60, 68, 70[28] D. Ferstl, C. Reinbacher, R. Ranftl, M. Ruether, and H. Bischof. Imageguided depth upsampling using anisotropic total generalized variation. InICCV 2013. → pages 17, 25[29] D. Freedman, Y. Smolin, E. Krupka, I. Leichter, and M. Schmidt. SRA:Fast removal of general multipath for ToF sensors. In ECCV 2014. →pages 18[30] D. Geman and C. Yang. Nonlinear image recovery with half-quadraticregularization. IEEE Transactions on Image Processing, 4(7):932–946,1995. → pages 2, 6, 19, 78, 96, 140[31] J. P. Godbaz, M. J. Cree, and A. A. Dorrington. Extending amcw lidardepth-of-field using a coded aperture. In ACCV 2010. → pages 17, 23, 24,29, 39, 41[32] J. P. Godbaz, M. J. Cree, and A. A. Dorrington. Blind deconvolution ofdepth-of-field limited full-field lidar data by determination of focal127parameters. In IS&T/SPIE Electronic Imaging, pages 75330B–75330B,2010. → pages 17[33] S. B. Gokturk, H. Yalcin, and C. Bamji. A time-of-flight depthsensor-system description, issues and solutions. In CVPR Workshops 2004.→ pages 18, 23, 41[34] A. Goldstein and R. Fattal. Blur-kernel estimation from spectralirregularities. In ECCV 2012. → pages 14[35] T. Goldstein and S. Osher. The split bregman method for l1-regularizedproblems. SIAM Journal on Imaging Sciences, 2(2):323–343, 2009. →pages 2, 49[36] J. Gregson, F. Heide, M. B. Hullin, M. Rouf, and W. Heidrich. StochasticDeconvolution. In CVPR 2013. → pages 43, 45, 46, 50, 53, 63[37] S. Gu, L. Zhang, W. Zuo, and X. Feng. Weighted nuclear normminimization with application to image denoising. In CVPR 2014. →pages 1, 5, 9, 10, 101[38] A. Gupta, N. Joshi, C. L. Zitnick, M. Cohen, and B. Curless. Single imagedeblurring using motion density functions. In ECCV 2010. → pages 14[39] U. Hahne and M. Alexa. Exposure fusion for time-of-flight imaging. InComputer Graphics Forum, volume 30, pages 1887–1894, 2011. → pages18[40] S. Harmeling, H. Michael, and B. Scho¨lkopf. Space-variant single-imageblind deconvolution for removing camera shake. In NIPS 2010. → pages14[41] F. Heide, W. Heidrich, and G. Wetzstein. Fast and flexible convolutionalsparse coding. In CVPR 2015. → pages 10[42] F. Heide, M. B. Hullin, J. Gregson, and W. Heidrich. Low-budget transientimaging using photonic mixer devices. ACM Transactions on Graphics, 32(4):45, 2013. → pages 23, 41[43] F. Heide, M. Rouf, M. B. Hullin, B. Labitzke, W. Heidrich, and A. Kolb.High-quality computational imaging through simple lenses. ACMTransactions on Graphics, 32(5), September 2013. → pages 61, 62128[44] F. Heide, M. Rouf, M. B. Hullin, B. Labitzke, W. Heidrich, and A. Kolb.High-quality computational imaging through simple lenses. ACMTransactions on Graphics, 32(5):149, 2013. → pages 29[45] F. Heide, M. Steinberger, Y.-T. Tsai, M. Rouf, D. Pajak, D. Reddy,O. Gallo, J. Liu, W. Heidrich, K. Egiazarian, et al. Flexisp: a flexiblecamera image processing framework. ACM Transactions on Graphics, 33(6):231, 2014. → pages 115, 116[46] M. Hirsch, C. J. Schuler, S. Harmeling, and B. Scholkopf. Fast removal ofnon-uniform camera shake. In ICCV 2011, . → pages 14, 60, 68, 70[47] M. Hirsch, S. Sra, B. Scholkopf, and S. Harmeling. Efficient filter flow forspace-variant multiframe blind deconvolution. In CVPR 2010, . → pages92, 93[48] M. Hradisˇ, J. Kotera, P. Zemcı´k, and F. Sˇroubek. Convolutional neuralnetworks for direct text deblurring. In BMVC 2015. → pages 5, 15, 77, 86,87, 88, 89, 90, 91, 92, 93, 94[49] S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, P. Kohli,J. Shotton, S. Hodges, D. Freeman, A. Davison, et al. Kinectfusion:real-time 3d reconstruction and interaction using a moving depth camera.In UIST 2011. → pages 17[50] V. Jain and H. Seung. Natural image denoising with convolutionalnetworks. NIPS 2009. → pages 2, 12[51] J. Jancsary, S. Nowozin, T. Sharp, and C. Rother. Regression tree fields - anefficient, non-parametric approach to image labeling problems. In CVPR2012. → pages 2, 12[52] N. Joshi, S. B. Kang, C. L. Zitnick, and R. Szeliski. Image deblurring usinginertial measurement sensors. ACM Transactions on Graphics (Proc.SIGGRAPH), 29(4):30, 2010. → pages 43[53] M. Kiechle, S. Hawe, and M. Kleinsteuber. A joint intensity and depthco-sparse analysis model for depth map super-resolution. In ICCV 2013.→ pages 17[54] S. Kindermann, S. Osher, and P. W. Jones. Deblurring and denoising ofimages by nonlocal functionals. Multiscale Modeling & Simulation, 4(4):1091–1115, 2005. → pages 13129[55] A. Kirmani, A. Benedetti, and P. A. Chou. Spumic: Simultaneous phaseunwrapping and multipath interference cancellation in time-of-flightcameras using spectral methods. In ICME 2013. → pages 18[56] F. Knoll, K. Bredies, T. Pock, and R. Stollberger. Second order totalgeneralized variation (tgv) for mri. Magnetic resonance in medicine, 65(2):480–491, 2011. → pages 25[57] R. Ko¨hler, M. Hirsch, B. Mohler, B. Scho¨lkopf, and S. Harmeling.Recording and playback of camera shake: benchmarking blinddeconvolution with a real-world database. In ECCV 2012. → pages x, xii,68, 69, 70[58] D. Krishnan and R. Fergus. Fast image deconvolution usinghyper-laplacian priors. In NIPS 2009. → pages 2, 10, 93, 97, 109[59] D. Krishnan, T. Tay, and R. Fergus. Blind deconvolution using anormalized sparsity measure. In CVPR 2011. → pages 2, 10, 14, 60, 68, 70[60] L. Lele´gard, E. Delaygue, M. Bre´dif, and B. Vallet. Detecting andcorrecting motion blur from images shot with channel-dependent exposuretime. In ISPRS Annals of Photogrammetry, Remote Sensing and SpatialInformation Sciences, volume 1, pages 341–346, 2012. → pages 43, 63[61] K. Levenberg. A method for the solution of certain non-linear problems inleast squares. Quarterly of Applied Mathematics, 2:164–168, 1944. →pages 28[62] A. Levin, Y. Weiss, F. Durand, and W. T. Freeman. Understanding andevaluating blind deconvolution algorithms. In CVPR 2009, . → pages 50,55[63] A. Levin, Y. Weiss, F. Durand, and W. T. Freeman. Understanding andevaluating blind deconvolution algorithms. In CVPR 2009, . → pages 87,108[64] A. Levin, Y. Weiss, F. Durand, and W. T. Freeman. Efficient marginallikelihood optimization in blind deconvolution. In CVPR 2011, . → pagesx, 108, 109, 120[65] A. Levin, R. Fergus, F. Durand, and W. T. Freeman. Image and depth froma conventional camera with a coded aperture. 26(3):70, 2007. → pages 12130[66] A. Levin, R. Fergus, F. Durand, and W. T. Freeman. Image and depth froma conventional camera with a coded aperture. ACM Transactions onGraphics (Proc. SIGGRAPH), 26(3), July 2007. → pages 49, 50[67] A. Levin, R. Fergus, F. Durand, and W. T. Freeman. Deconvolution usingnatural image priors. Massachusetts Institute of Technology, ComputerScience and Artificial Intelligence Laboratory, 2007. → pages 109[68] A. Levin, R. Fergus, F. Durand, and W. T. Freeman. Image and depth froma conventional camera with a coded aperture. ACM transactions ongraphics, 26(3):70, 2007. → pages 2, 10, 24, 25, 108, 109[69] Y. Li and S. Osher. Coordinate descent optimization for l1 minimizationwith application to compressed sensing; a greedy algorithm. InverseProblems and Imaging, 3(3):487–503, 2009. → pages 50[70] L. Lucy. An iterative technique for the rectification of observeddistributions. Astronomical Journal, 79(2):745–754, 1974. → pages 14[71] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Non-local sparsemodels for image restoration. In ICCV 2009. → pages 1, 9, 10, 101[72] C. A. Metzler, A. Maleki, and R. G. Baraniuk. Bm3d-amp: A new imagerecovery algorithm based on bm3d denoising. In Image Processing (ICIP),2015 IEEE International Conference on, pages 3116–3120. IEEE, 2015.→ pages 120[73] T. Michaeli and M. Irani. Blind deblurring using internal patch recurrence.In ECCV 2014. → pages 14[74] P. Milanfar. A tour of modern image filtering: New insights and methods,both practical and theoretical. IEEE Signal Processing Magazine, 30(1):106–128, 2013. → pages 1[75] J. Miskin and D. J. MacKay. Ensemble learning for blind image separationand deconvolution. In NIPS 2000. → pages 13[76] J. Nocedal and S. J. Wright. Numerical optimization 2nd. 2006. → pages28[77] S. Osher and L. I. Rudin. Feature-oriented image enhancement using shockfilters. SIAM Journal on Numerical Analysis, 27(4):pp. 919–940, 1990. →pages 14131[78] J. Pan, Z. Hu, Z. Su, and M.-H. Yang. Deblurring text images vial0-regularized intensity and gradient prior. In CVPR 2014. → pages 5, 15,86, 87, 88, 89, 91[79] N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends inOptimization, 1(3):123–231, 2013. → pages 2, 6, 18, 19, 21[80] J. Park, H. Kim, Y.-W. Tai, M. S. Brown, and I. Kweon. High quality depthmap upsampling for 3d-tof cameras. In ICCV 2011. → pages 17[81] P. Perona and J. Malik. Scale-space and edge detection using anisotropicdiffusion. IEEE Transactions on Pattern Analysis and MachineIntelligence, 12(7):629–639, 1990. → pages 55[82] J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, andS. Amarasinghe. Halide: a language and compiler for optimizingparallelism, locality, and recomputation in image processing pipelines.ACM SIGPLAN Notices, 48(6):519–530, 2013. → pages 108[83] A. Rav-Acha and S. Peleg. Two motion-blurred images are better than one.Pattern Recognition Letter, 26(3):311–317, 2005. → pages 43[84] W. H. Richardson. Bayesian-based iterative method of image restoration.Journal of the Optical Society of America, 62(1):55–59, 1972. → pages 14[85] Y. Romano, M. Elad, and P. Milanfar. The little engine that could:Regularization by denoising (red). arXiv preprint arXiv:1611.02862, 2016.→ pages 120[86] S. Roth and M. Black. Fields of experts. International Journal ofComputer Vision, 82(2):205–229, . → pages 2, 5, 10[87] S. Roth and M. J. Black. Fields of experts: A framework for learningimage priors. In CVPR 2005, . → pages x, 5, 78, 96, 101, 102, 108[88] L. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noiseremoval algorithms. Physica D: Nonlinear Phenomena, 60(1-4):259–268,1992. → pages 1, 8, 12[89] S. Sardy, A. Bruce, and P. Tseng. Block coordinate relaxation methods fornonparametric wavelet denoising. Journal of computational and graphicalstatistics, 9(2), 2000. → pages 50132[90] K. Schelten, S. Nowozin, J. Jancsary, C. Rother, and S. Roth. Interleavedregression tree field cascades for blind image deconvolution. In WACV2015. → pages 14, 78, 83[91] M. Schmidt. minfunc: unconstrained differentiable multivariateoptimization in matlab.http://www.cs.ubc.ca/∼schmidtm/Software/minFunc.html. → pages 85, 99[92] U. Schmidt and S. Roth. Shrinkage fields for effective image restoration. InCVPR 2014. → pages 2, 5, 12, 13, 76, 78, 79, 81, 84, 85, 93, 95, 97, 99,101, 108, 141, 142, 144[93] U. Schmidt, C. Rother, S. Nowozin, J. Jancsary, and S. Roth.Discriminative non-blind deblurring. In CVPR 2013. → pages 13, 14, 108,109[94] C. J. Schuler, H. Christopher Burger, S. Harmeling, and B. Scholkopf. Amachine learning approach for non-blind image deconvolution. In CVPR2013. → pages 13, 109[95] C. J. Schuler, M. Hirsch, S. Harmeling, and B. Scho¨lkopf. Learning todeblur. arXiv preprint arXiv:1406.7444, 2014. → pages 14[96] S. Schuon, C. Theobalt, J. Davis, and S. Thrun. Lidarboost: Depthsuperresolution for ToF 3D shape scanning. In CVPR 2009. → pages 17[97] S. Shalev-Shwartz and A. Tewari. Stochastic methods for `1-regularizedloss minimization. The Journal of Machine Learning Research, 12:1865–1892, 2011. → pages 50[98] Q. Shan, J. Jia, and A. Agarwala. High-quality motion deblurring from asingle image. In ACM Transactions on Graphics (Proc. SIGGRAPH),pages 73:1–73:10, 2008. → pages 60, 61, 68, 70[99] L. Sun, S. Cho, J. Wang, and J. Hays. Edge-based blur kernel estimationusing patch priors. In ICCP 2013. → pages 14[100] Y.-W. Tai and S. Lin. Motion-aware noise filtering for deblurring of noisyand blurry images. In CVPR 2012. → pages 14[101] H. Talebi and P. Milanfar. Global image denoising. IEEE Transactions onImage Processing, 23(2):755–768, 2014. → pages 1, 9[102] C. Tomasi and R. Manduchi. Bilateral filtering for gray and color images.In ICCV 1998. → pages 1, 8, 9, 14133[103] S. V. Venkatakrishnan, C. A. Bouman, and B. Wohlberg. Plug-and-playpriors for model based reconstruction. In Global Conference on Signal andInformation Processing (GlobalSIP), 2013 IEEE, pages 945–948. IEEE,2013. → pages 120[104] R. Wang, Z. Yang, L. Liu, J. Deng, and F. Chen. Decoupling noise andfeatures via weighted `1-analysis compressed sensing. ACM Transactionson Graphics, 33(2):18, 2014. → pages 25[105] J. Weickert. Anisotropic diffusion in image processing. ECMI Series,Teubner-Verlag, Stuttgart, Germany, 1998. → pages 1, 8[106] O. Whyte, J. Sivic, A. Zisserman, and J. Ponce. Non-uniform deblurringfor shaken images. In CVPR 2010. → pages 14, 68, 70[107] N. Wiener. Extrapolation, interpolation, and smoothing of stationary timeseries, volume 7. MIT press Cambridge, MA, 1949. → pages 12[108] B. Wohlberg. Efficient algorithms for convolutional sparse representations.IEEE Transactions on Image Processing, 25(1):301–315, 2016. → pages10[109] L. Xiao, F. Heide, W. Heidrich, B. Scho¨lkopf, and M. Hirsch. Learningproximal operators for image restoration. In Under review of ICCV 2017, .→ pages 6, 109[110] L. Xiao, F. Heide, M. O’Toole, A. Kolb, M. B. Hullin, K. Kutulakos, andW. Heidrich. Defocus deblurring and superresolution for time-of-flightdepth cameras. In CVPR 2015, . → pages 3[111] L. Xiao, J. Wang, W. Heidrich, and M. Hirsch. Learning high-order filtersfor efficient blind deconvolution of document photographs. In ECCV 2016,. → pages 5[112] L. Xiao, J. Gregson, F. Heide, and W. Heidrich. Stochastic blind motiondeblurring. IEEE Transactions on Image Processing, 24(10):3071–3085,Oct 2015. → pages 4[113] L. Xu and J. Jia. Two-phase kernel estimation for robust motion deblurring.In ECCV 2010. → pages 14, 60, 61, 68, 70, 79, 86[114] L. Xu, S. Zheng, and J. Jia. Unnatural l0 sparse representation for naturalimage deblurring. In CVPR 2013. → pages 14, 60, 61, 68, 86134[115] T. Yue, S. Cho, J. Wang, and Q. Dai. Hybrid image deblurring by fusingedge and power spectrum information. In ECCV 2014. → pages 14[116] L. Zhong, S. Cho, D. Metaxas, S. Paris, and J. Wang. Handling noise insingle image deblurring using directional filters. In CVPR 2013. → pages14[117] J. Zhu, L. Wang, R. Yang, J. E. Davis, and Z. Pan. Reliability fusion oftime-of-flight depth and stereo geometry for high quality depth maps. IEEETransactions on Pattern Analysis and Machine Intelligence, 33(7):1400–1414, 2011. → pages 17[118] D. Zoran and Y. Weiss. From learning models of natural image patches towhole image restoration. In ICCV 2011. → pages 2, 5, 10, 11, 101, 109[119] W. Zuo, D. Ren, S. Gu, L. Lin, and L. Zhang. Discriminative learning ofiteration-wise priors for blind deconvolution. In CVPR 2015. → pages 14,83135Appendix ASupplemental MaterialsA.1 Supplemental material for Chapter 3This section provides implementation details for Algorithm 2 (updating amplitudeestimate) and Algorithm 3 (updating depth estimate) in Chapter 3. The symbol ∇defines the derivative operator, T defines the matrix transpose, and I defines theidentity matrix.Algorithm 2, Line 2:a = argminaρ||c−Aa||22 + λ1ρa||∇a− y − p1 + u1||22 (A.1)equals to the solution of the linear equation system:(ρATA + λ1ρa∇T∇)a = ρATc + λ1ρa∇T(y + p1 − u1) (A.2)and we solve it by the left division function in Matlab.Algorithm 2, Line 3:y = argminyλ1||∇a− y − p1 + u1||22 + λ2||∇y − p2 + u2||22 (A.3)136equals to the solution of the linear equation system:(λ1I + λ2∇T∇)y = λ1(∇a− p1 + u1) + λ2∇T(p2 − u2) (A.4)and solved by the left division function in Matlab.Algorithm 2, Line 4:p1 = argminp1||p1||1 + ρa||∇a− y − p1 + u1||22 (A.5)is a soft shrinkage problem and has closed form solution:p1 = soft-shrinkage(∇a− y + u1, 0.5ρa) (A.6)where the soft-shrinkage operator is defined as:soft-shrinkage(x, ) =x + ; x < −0;− ≤ x ≤ x− ; x > (A.7)Algorithm 2, Line 5:p2 = argminp2||p2||1 + ρa||∇y − p2 + u2||22 (A.8)is a soft shrinkage problem and has closed form solution:p2 = soft-shrinkage(∇y + u2, 0.5ρa) (A.9)Algorithm 3, Line 2:z = argminzdata fitting constraint︷ ︸︸ ︷ρ||c− a ◦ g(z)||22 +prior constraint︷ ︸︸ ︷τ1ρx||∇z− x− q1 + v1||22 (A.10)137is a nonlinear least squares problem due to the nonlinearity of the modulationfunction g(z). We solve this problem by the Levenberg-Marquardt method imple-mented in the lsqnonlin(.) function in Matlab. We provide the analytical Jacobianfor acceleration:J(z) =[Jdata(z)Jprior](A.11)where the matrix Jdata(z) and Jprior define the Jacobian of the 1st (data fittingconstraint) and 2nd (prior constraint) least squares in Eq.A.10 respectively.Since the 1st least squares are pixel-wise separable (benefit from our splittingmethod explained in Section 3.2.1 in Chapter 3), Jdata(z) is simply a diagonalmatrix composed of:−ak · ∂g(zk)∂zk· √ρ (A.12)where k is the pixel index. For the ToF cameras based on cosine model modulation(see Eq.3.1 in Chapter 3), the diagonal element in Eq.A.12 becomes:ak · i4pifc· e−i( 4pifc ·zk) · √ρ (A.13)For arbitrary modulation waveforms in the future, the diagonal element in Eq.A.12can be estimated from calibration data. Jprior is simply the matrix version of thederivative operator∇ multiplied by√τ1ρx, which is independent of z.Algorithm 3, Line 3:x = argminxτ1||∇z− x− q1 + v1||22 + τ2||∇x− q2 + v2||22 (A.14)equals to the solution of the linear equation system:(τ1I + τ2∇T∇)x = τ1(∇z− q1 + v1) + τ2∇T(q2 − v2) (A.15)and solved by the left division function in Matlab.138Algorithm 3, Line 4:q1 = argminq1||q1||1 + ρx||∇z− x− q1 + v1||22 (A.16)is a soft shrinkage problem and has closed form solution:q1 = soft-shrinkage(∇z− x + v1, 0.5ρx) (A.17)Algorithm 3, Line 5:q2 = argminq2||q2||1 + ρx||∇x− q2 + v2||22 (A.18)is a soft shrinkage problem and has closed form solution:q2 = soft-shrinkage(∇x + v2, 0.5ρx) (A.19)139A.2 Supplemental material for Chapter 5This section gives the derivation details of Eq.5.6 -5.10 in Chapter 5.Derivation of Eq.5.6 and5.7The objective of latent image estimation at t-th iteration is defined asxt = argminxt||b− kt−1 ⊗ xt||22 +N∑i=1ρti(Ftixt) (A.20)Applying the half-quadratic splitting [30] on each filter response Ftixt, we havext = argminxt,uti||b− kt−1 ⊗ xt||22 +N∑i=1(ρti(uti) +βt2||Ftixt − uti||22)(A.21)The half-quadratic optimization technique [30] solves Eq.A.21 by alternatively up-dating xt and uti. More specifically, for xt update,xt = argminxt||b− kt−1 ⊗ xt||22 +N∑i=1βt2||Ftixt − uti||22 (A.22)This is solved by setting the gradient of the right-hand linear least squares to bezero, (KTt−1Kt−1 +βt2N∑i=1FtiTFti)xt = KTt−1b +βt2N∑i=1FtiTuti (A.23)where matrix Kt−1 represents corresponding convolution with blur kernel kt−1.Because Kt−1 and Fti represent convolution process, Eq.A.23 can be efficientlysolved in Fourier domain. Let λt = βt/2, we havext = F−1[F(KTt−1b + λt∑Ni=1 FtiTuti)F(KTt−1) · F(Kt−1) + λt∑Ni=1F(FtiT) · F(Fti)](A.24)140For uti update,uti = argminutiρti(uti) +βt2||Ftixt−1 − uti||22 (A.25)This minimization problem is pixel-wise separable. The key idea of [92] is toreplace this minimization problem with a shrinkage function ψti , i.e.,uti = ψti(Ftixt−1) (A.26)By Substituting Eq.A.26 to Eq.A.24, we havext = F−1[F(KTt−1b + λt∑Ni=1 FtiTψti(Ftixt−1))F(KTt−1) · F(Kt−1) + λt∑Ni=1F(FtiT) · F(Fti)](A.27)Note that we replace xt−1 with zt−1 in our blind deblurring framework. Therefore,Eq.A.27 becomesxt = F−1[F(KTt−1b + λt∑Ni=1 FtiTψti(Ftizt−1))F(KTt−1) · F(Kt−1) + λt∑Ni=1F(FtiT) · F(Fti)](A.28)which is Eq.5.6 in Chapter 5. Eq.5.7 in Chapter 5 can be derived in similar waythus it’s omitted here.Derivation of Eq.5.8The objective of kernel estimation is defined askt = argminkt||b− kt ⊗ zt||22 + τ t||kt||22 (A.29)This is solved by setting the gradient of the right-hand linear least squares to bezero,(ZTt Zt + τt)kt = ZTt b (A.30)141where matrix Zt represents corresponding convolution with image zt. Eq.A.30 canbe efficiently solved in Fourier domain:kt = F−1[ F(zt)∗ · F(b)F(zt)∗ · F(zt) + τ t](A.31)which is Eq.5.8 in the main paper. ∗ represents conjugate transpose.Calculation of Eq.5.9For xt update, the training loss ` = ||xt − x¯||22. Its gradient w.r.t. the modelparameters Θt = (f ti , ψti , λt) is computed as∂`Θt=∂xt∂Θt∂`xt= 2∂xt∂Θt(xt − x¯) (A.32)In order to compute ∂xt/∂Θt, xt in Eq.A.28 can be rewritten asxt =(KTt−1Kt−1 + λtN∑i=1FtiTFti)−1(KTt−1b + λtN∑i=1FtiTψti(Ftizt−1))(A.33)Then ∂xt/∂Θt can be derived following the matrix calculus rules, and we refer thederivation details of ∂xt/∂Θt to the supplemental material of [92].Calculation of Eq.5.10For zt and kt update, the training loss ` = ||kt − k¯||22 + α||zt − x¯||22. Its gradientw.r.t.the model parameters Ωt = (gti, φti, ηt, τ t) is computed as∂`∂Ωt=∂zt∂Ωt∂kt∂zt∂`∂kt+∂kt∂Ωt∂`∂kt+∂zt∂Ωt∂`∂zt(A.34)where∂`∂kt= 2(kt − k¯) (A.35)∂`∂zt= 2α(zt − x¯) (A.36)142In order to compute ∂kt/∂zt and ∂kt/∂Ωt, kt in Eq.A.31 can be rewritten askt = (ZTt Zt + τt)−1(ZTt b) (A.37)For brevity, we denote (ZTt Zt + τt) as Π, and (ZTt b) as Λ, then kt = Π−1Λ. Weuse the square bracket around any image a, i.e. [a], to indicate the matrix repre-senting convolution with a. Moreover, we define matrix R representing rotating a2D image by 180 degrees, i.e.,Ra means rotating the 2D image a by 180 degrees.Then, we have∂kt∂zt=∂(Π−1Λ)∂zt=∂Λ∂ztΠ−1 −ΛTΠ−1∂Π∂ztΠ−1=∂(ZTt b)∂ztΠ−1 − ∂(Πkt)∂ztΠ−1=∂([b]Rzt)∂ztΠ−1 − ∂((ZTt Zt + τt)kt)∂ztΠ−1= RT[b]TΠ−1 − ∂(ZTt Ztkt)∂ztΠ−1= RT[b]TΠ−1 − (RT[Ztkt]T + KTt Zt)Π−1= (RT[b]T −RT[Ztkt]T −KTt Zt)Π−1(A.38)And,∂kt∂Ωt=∂Π−1Λ∂τ t=∂Λ∂τ tΠ−1 −ΛTΠ−1∂Π∂τ tΠ−1= −ΛTΠ−1∂(ZTt Zt + τt)∂τ tΠ−1= −KTt Π−1(A.39)143By substituting Eq.A.35, A.36, A.38, A.39 into Eq.A.34, we have∂`∂Ωt= −2KTt Π−1(kt − k¯)+ 2∂zt∂Ωt((RT[b]T −RT[Ztkt]T −KTt Zt)Π−1(kt − k¯) + α(zt − x¯))(A.40)The derivation of ∂zt/∂Ωt is similar as ∂xt/∂Θt, and we refer the details to thesupplemental material of [92]. We will make our training code publicly availablewith the paper.144A.3 Supplemental material for Chapter 6This section gives the derivation of the analytic gradients that are required for train-ing in Chapter 6.The HQS iterations (t = 1, 2, ..., T ) in our method read as follows:zt = proxΘ(xt−1)xt =(λATA + ρt)︸ ︷︷ ︸Πt−1 (λATb + ρtzt)︸ ︷︷ ︸Λt(A.41)For both denoising and deconvolution tasks, the xt-update in Eq.A.41 has a closed-form solution, which is given in Eq.6.8 and 6.9 in Chapter 6. In this supplementaldocument we present a derivation with the more general formula xt = Π−1t Λt.In our method, the trainable parameters Ω = {λp,Θ} include the fidelityweight λp for each problem class p, and the model parameters Θ of the proxi-mal fields that are shared across all problem classes. The training loss function ` isdefined as the negative of the average PSNR between the reconstructed and groundtruth images. The gradient of the loss ` w.r.t. Θ is computed by averaging thegradients of all images, while the gradient of the loss ` w.r.t. λp is computed by av-eraging the gradients of only those images that belong to class p. For convenience,we give the derivations for one image, and omit the class label p below.` = −20 log10(255√M||xT − xtrue||2), (A.42)where M is the number of pixels in each image, xtrue is the ground truth image,and xT is the reconstructed image.∂`∂Ω=T∑t=1(∂xt∂λ∂`∂xt+∂zt∂Θ(∂`∂zt+∂xt∂zt∂`∂xt))=T∑t=1(∂xt∂λ+∂zt∂Θ∂xt∂zt)∂`∂xt(A.43)whereby we used ∂`/∂zt = 0. Next, we provide the derivation for the partial145derivative terms in Eq.A.43:∂xt∂λ=∂Λt∂λΠ−1t −(Π−1t Λt)T∂Πt∂λΠ−1t =(∂Λt∂λ− ∂Πtxt∂λ)Π−1t=(ATb−ATAxt)TΠ−1t(A.44)∂xt∂zt=∂Λt∂ztΠ−1t −(Π−1t Λt)T∂Πt∂ztΠ−1t =∂Λt∂ztΠ−1t = ρtΠ−1t (A.45)∂`∂xt−1=∂xt∂xt−1∂`∂xt=∂zt∂xt−1∂xt∂zt∂`∂xt= ρt∂zt∂xt−1Π−1t∂`∂xt(A.46)To compute the gradient of zt w.r.t. xt−1 and Θ, i.e. ∂zt/∂xt−1 and ∂zt/∂Θ ,let’s first recall Eq. 6 in the main paper:ztk = ztk−1 −N∑i=1FkiTψki (Fki ztk−1), s.t. zt0 = xt−1, k = 1, 2, ...,K.(A.47)Note that xt−1 only appears at the first stage k = 1:zt1 = zt0 −N∑i=1F1iTψ1i (F1i zt0) = xt−1 −N∑i=1F1iTψ1i (F1ixt−1) (A.48)Therefore,∂zt∂xt−1=∂ztK∂xt−1=∂zt1∂xt−1∂ztK∂zt1=(I−N∑i=1F1iTψ′1i (F1ixt−1)F1i)∂ztK∂zt1,(A.49)where I is an identity matrix, and ∂ztK/∂zt1 can be computed by following therule below:∂ztk∂ztk−1= I−N∑i=1FkiTψ′ki (Fki ztk−1)Fki , k = 1, 2, ...,K (A.50)∂zt/∂Θ in Eq.A.43 is composed of ∂zt/∂fki and ∂zt/∂ψki , which are computed146as follows:∂zt∂fki=∂ztk∂fki∂ztK∂ztk= −∂FkiTψki (Fki ztk−1)∂fki∂ztK∂ztk(A.51)∂zt∂ψki=∂ztk∂ψki∂ztK∂ztk= −∂FkiTψki (Fki ztk−1)∂ψki∂ztK∂ztk(A.52)147


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items