UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Structure-aware computational imaging Heide, Felix 2016

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2016_november_heide_felix.pdf [ 156.8MB ]
JSON: 24-1.0318066.json
JSON-LD: 24-1.0318066-ld.json
RDF/XML (Pretty): 24-1.0318066-rdf.xml
RDF/JSON: 24-1.0318066-rdf.json
Turtle: 24-1.0318066-turtle.txt
N-Triples: 24-1.0318066-rdf-ntriples.txt
Original Record: 24-1.0318066-source.json
Full Text

Full Text

Structure-Aware Computational ImagingbyFelix HeideMSc, University of Siegen, 2011BSc, University of Siegen, 2009A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES(Computer Science)The University of British Columbia(Vancouver)September 2016© Felix Heide, 2016AbstractWhile traditional imaging systems directly measure scene properties, computationalimaging systems add computation to the measurement process, allowing such sys-tems to extract non-trivially encoded scene features. This dissertation demonstratesthat exploiting structure in this process allows to even recover information thatis usually considered to be completely lost. Relying on temporally and spatiallyconvolutional structure, we extract two novel image modalities that were essen-tially “invisible” before: a new temporal dimension of light propagation, and anew per-pixel radial velocity dimension, both obtained using consumer Time-of-Flight cameras. These two novel types of images represent first steps toward theinversion of light transport. Specifically, we demonstrate that Non-Line-of-Sightimaging and imaging in scattering media can be made feasible with additionaltemporal information. Furthermore, structure-aware imaging also represents a com-pletely new approach to traditional color image processing. We show that classicalhand-crafted image processing pipelines can be replaced by a single optimizationproblem exploiting image structure. This approach does not only outperform thestate-of-the-art for classical image processing problems, but enables completely newcolor camera designs. In particular, we demonstrate camera designs with radicallysimplified optical systems, as well as novel sensor designs. The computation for allimaging problems from this dissertation relies on Bayesian inference using large-scale proximal optimization methods. We present a mathematical framework and acorresponding domain-specific language to automate the development of efficient,structure-aware solvers, allowing to immediately apply the insights gained in thisdissertation to new imaging problems.iiPrefaceThis dissertation is based on the following publications, which are all collaborativeworks. For each publication, the contributions of all the authors are listed.F. Heide, M. B. Hullin, J. Gregson, and W. Heidrich, Low-Budget Transient ImagingUsing Photonic Mixer Devices, ACM Transactions on Graphics (SIGGRAPH),2013, [1]: This work is presented in Chapter 3. The initial idea was conceivedby Prof. Heidrich, Prof. Hullin, and the author. Prof. Hullin contributed the imageformation and the hardware prototype. The inverse problem and the optimizationmethod for its solution were developed by the author. Prof. Heidrich wrote the intialdraft of the manuscript. All the authors contributed equally to the writing of thefinal published article.M. O’Toole, F. Heide, L. Xiao, M. B. Hullin, W. Heidrich, and K. N. Kutulakos,Temporal Frequency Probing for 5D Transient Analysis of Global Light Transport,ACM Transactions on Graphics (SIGGRAPH), 2014, [2]: We discuss this work atthe end of Chapter 3. Dr. O’Toole and the author had the initial idea for this work.Prof. Kutulakos and Dr. O’Toole drew the connection to previous work using spatiallight modulation. Chapter 3 gives an alternative derivation, which was developed bythe author. Furthermore, the author contributed the transient inverse methods, whilethe 3D reconstructions were performed by Dr. O’Toole. Lei Xiao, Dr. O’Toole,and the author performed the experiments. The prototype camera was developed byProf. Hullin. All the authors took part in writing the published manuscript.iiiF. Heide, L. Xiao, W. Heidrich, and M. B. Hullin, Diffuse Mirrors: 3D Reconstruc-tion from Diffuse Indirect Illumination Using Inexpensive Time-Of-Flight Sensors,Computer Vision and Pattern Recognition (CVPR), 2014, [3]: Chapter 5 coversthis work. The initial idea was conceived by the author. Prof. Hullin suggested theexperimental setup and developed the imaging and light source hardware setup. LeiXiao and the author captured the measurements. The image formation and inversemethod were developed by the author. Prof. Heidrich and Prof. Hullin jointlycoordinated this project. All the authors contributed to writing the published article.F. Heide, L. Xiao, A. Kolb, M. B. Hullin, and W. Heidrich, Imaging in ScatteringMedia Using Correlation Image Sensors and Sparse Convolutional Coding, OpticsExpress, 2014, [4]: This work is presented in Chapter 4. The original idea wasdeveloped by Prof. Heidrich and the author. In particular, Prof. Heidrich proposedthe experimental setup, while the author developed the convolutional model fortransient images as well as the inverse methods. Lei Xiao and the author capturedthe measurements using an imager provided by Prof. Kolb and a prototype providedby Prof. Hullin. Prof. Heidrich, Lei Xiao, and the author wrote the article.F. Heide, W. Heidrich, and G. Wetzstein, Fast and Flexible Convolutional SparseCoding, Computer Vision and Pattern Recognition (CVPR), 2015, [5]: This publi-cation is covered in Chapter 6. The initial idea as well as the concrete optimizationmethods were conceived by the author. Furthermore, the author performed theexperimental validation of the presented approach and wrote the initial draft. Prof.Wetzstein wrote a significant portion of the published article that all the authorstook part in writing.F. Heide, W. Heidrich, M. Hullin, and G. Wetzstein, Doppler Time-of-Flight Imaging,ACM Transactions on Graphics (SIGGRAPH), 2015, [6]: The initial idea of thiswork, which is described in Chapter 7, was conceived by Prof. Heidrich andthe author. The author developed the image formation model and the capturemethod based on Orthogonal Frequency-Division Multiplexing. After the firstpilot experiments conducted in Prof. Heidrich’s laboratory, all final experimentsivwere conducted in Prof. Wetzstein’s laboratory. Prof. Hullin contributed thecamera prototype and many fruitful discussions before conducting the experiments.Prof. Heidrich and Prof. Wetzstein jointly coordinated this project. The publishedmanuscript was written by all the authors.S. Shrestha, F. Heide, W. Heidrich, and G. Wetzstein, Computational Imaging withMulti-Camera Time-of-Flight Systems, ACM Transactions on Graphics (SIGGRAPH),2016, [7]: We discuss this publication at the end of Chapter 7. Prof. Wetzsteinhad the initial idea for this work. The author conceived the applications of theproposed system to interference canceling and dynamic Non-Line-of-Sight imaging.Shikhar Shrestha developed the hardware system and captured the experimental datatogether with the author. Prof. Wetzstein wrote the initial draft of the publication.All the authors took part in writing the final publication.F. Heide, M. Rouf, M. B. Hullin, B. Labitzke, W. Heidrich, and A. Kolb, High-QualityComputational Imaging through Simple Lenses, ACM Transactions on Graphics(TOG), 2013, [8]: This work is discussed in Chapter 8. The initial idea wasconceived by the author and Prof. Heidrich. The author proposed the inverseproblem and developed the optimization methods as well as their implementations.Mushfiqur Rouf developed the extension to handle low-light regions. The authorperformed all measurements. The multispectral camera prototype discussed inSection A.11 was developed by Alex Gukov. Prof. Hullin contributed with fruitfuldiscussions. The Point Spread Function estimation was adopted from previous workwith Prof. Kolb and Dr. Labitzke. Prof. Heidrich, Prof. Hullin, and the author wrotethe manuscript.F. Heide, Q. Fu, E. Peng, and W. Heidrich, Encoded Diffractive Optics for Full-Spec-trum Computational Imaging, Nature Scientific Reports, 2016, [9]: The authorconceived the idea of this work. Parts of it are discussed in Chapter 8. Furthermore,he proposed the factorization and reconstruction method and implemented the cor-responding algorithms. Dr. Fu designed the optical parameters and fabricated thelenses. Yifan Peng performed the Modulation Transfer Function analysis. Dr. Fuvand the author conducted the experiments. Prof. Heidrich coordinated the wholeproject. All the authors took part in writing the paper.F. Heide, M. Steinberger, Y. Tsai, M. Rouf, D. Paj ↪ak, D. Reddy, O. Gallo, J. Liu,W. Heidrich, K. Egiazarian, J. Kautz, and K. Pulli, FlexISP: A Flexible CameraImage Processing Framework, ACM Transactions on Graphics (SIGGRAPH Asia),2014, [10]: This publication is covered in Chapter 9. The author had the initialidea for this work. Furthermore, the inverse problem as well as the optimizationmethods were contributed by the author. While the initial implementations were alsodeveloped by the author, Yun-Ta Tsai and Prof. Steinberger contributed the efficientGPU implementations. Dr. Reddy and Mushfiqur Rouf provided the experimentalresults for the color array camera. Dr. Gallo, Dr. Paj ↪ak, and Jing Liu performedthe interlaced High Dynamic Range experiments. Dr. Kautz performed the burstimage denoising experiments. Prof. Egiazarian contributed with many insightfuldiscussions and in the parameter selection. The published manuscript was writtenby all the authors.F. Heide, S. Diamond, M. Nießner, J. Ragan-Kelley, W. Heidrich, and G. Wetzstein,ProxImaL: Efficient Image Optimization using Proximal Algorithms, ACM Transac-tions on Graphics (SIGGRAPH), 2016, [11]: This work is presented in Chapter 10.The original idea was conceived by the author. Furthermore, the author contributedthe proposed mathematical framework. Steven Diamond and the author developedand implemented the mathematical compiler as well as all solver algorithms. Thefunctions with Halide support were implemented by the author, with help fromDr. Nießner and Dr. Ragan-Kelley. Prof. Wetzstein wrote the initial manuscriptdraft. Steven Diamond, Prof. Wetzstein, Prof. Heidrich, and the author took part inwriting the final publication.viTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xivList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Dissertation Structure . . . . . . . . . . . . . . . . . . . . . . . . 72 Background and Related Work . . . . . . . . . . . . . . . . . . . . . 122.1 Light . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2 The Propagation of Light . . . . . . . . . . . . . . . . . . . . . . 142.2.1 Transmission, Reflection, and Refraction . . . . . . . . . 142.2.2 Global Light Transport . . . . . . . . . . . . . . . . . . . 172.2.3 Transient Global Light Transport . . . . . . . . . . . . . . 192.3 Optics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.3.1 Diffraction . . . . . . . . . . . . . . . . . . . . . . . . . 212.3.2 Geometric Optics . . . . . . . . . . . . . . . . . . . . . . 22vii2.3.3 Aberrations . . . . . . . . . . . . . . . . . . . . . . . . . 242.4 Imaging Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.4.1 Measuring Irradiance . . . . . . . . . . . . . . . . . . . . 282.4.2 CCD and CMOS Sensors . . . . . . . . . . . . . . . . . . 312.4.3 Sensor and Noise Model . . . . . . . . . . . . . . . . . . 332.4.4 Color Imaging . . . . . . . . . . . . . . . . . . . . . . . 382.4.5 Modified Sensor Designs . . . . . . . . . . . . . . . . . . 402.5 Time-of-Flight Imaging . . . . . . . . . . . . . . . . . . . . . . . 402.5.1 Optical Range Imaging . . . . . . . . . . . . . . . . . . . 402.5.2 Impulse Time-of-Flight Imaging . . . . . . . . . . . . . . 422.5.3 Continuous Wave Time-of-Flight Imaging . . . . . . . . . 432.6 Working Principles of Correlation Image Sensors (CIS) . . . . . . 462.6.1 Homodyne Measurement . . . . . . . . . . . . . . . . . . 472.6.2 Heterodyne Measurement . . . . . . . . . . . . . . . . . 482.6.3 Analysis in the Spectral Domain . . . . . . . . . . . . . . 492.6.4 Sensor and Noise Model . . . . . . . . . . . . . . . . . . 512.6.5 Multi-Path Interference . . . . . . . . . . . . . . . . . . . 522.6.6 Dynamic Scenes . . . . . . . . . . . . . . . . . . . . . . 542.7 Transient Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . 562.7.1 Impulse vs. Continuous Wave Transient Imaging . . . . . 572.7.2 Continuous Wave Transient Imaging . . . . . . . . . . . . 582.8 Inverse Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 582.8.1 Inverse Problems . . . . . . . . . . . . . . . . . . . . . . 592.8.2 Statistical Image Recovery . . . . . . . . . . . . . . . . . 592.8.3 Linear Models . . . . . . . . . . . . . . . . . . . . . . . 622.8.4 Optimization Methods . . . . . . . . . . . . . . . . . . . 633 Transient Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 663.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 663.2 Transient CIS Image Formation . . . . . . . . . . . . . . . . . . . 673.2.1 Generalized CIS Model . . . . . . . . . . . . . . . . . . . 673.2.2 Single Reflection . . . . . . . . . . . . . . . . . . . . . . 683.2.3 Global Illumination . . . . . . . . . . . . . . . . . . . . . 69viii3.3 Transient Imaging using CIS . . . . . . . . . . . . . . . . . . . . 713.3.1 Inverse Problem . . . . . . . . . . . . . . . . . . . . . . 713.3.2 Optimization via Splitting . . . . . . . . . . . . . . . . . 743.3.3 Solving the i-subproblem . . . . . . . . . . . . . . . . . . 753.4 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 763.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 793.5.1 Synthetic Examples . . . . . . . . . . . . . . . . . . . . . 793.5.2 CIS Measurements . . . . . . . . . . . . . . . . . . . . . 803.6 Spatio-Temporal Modulation . . . . . . . . . . . . . . . . . . . . 833.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 864 Imaging in Scattering Media . . . . . . . . . . . . . . . . . . . . . . 884.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 884.2 Sparse Transient Mixture Model . . . . . . . . . . . . . . . . . . 904.2.1 Are Transient Images Sparse ? . . . . . . . . . . . . . . . 904.2.2 Sparse Local Model for Transient Light Interaction . . . . 924.3 Transient Convolutional Sparse Coding . . . . . . . . . . . . . . 944.3.1 Reformulation in the Spectral Domain . . . . . . . . . . . 954.3.2 Synthetic Evaluation . . . . . . . . . . . . . . . . . . . . 964.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 974.4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 974.4.2 Qualitative Results . . . . . . . . . . . . . . . . . . . . . 994.4.3 Quantitative Results . . . . . . . . . . . . . . . . . . . . 994.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1015 Diffuse Mirrors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1035.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1035.2 Image Formation Model . . . . . . . . . . . . . . . . . . . . . . 1055.2.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . 1055.2.2 Stationary Light Transport . . . . . . . . . . . . . . . . . 1065.2.3 Transient Light Transport . . . . . . . . . . . . . . . . . 1075.2.4 Discretization . . . . . . . . . . . . . . . . . . . . . . . . 1075.2.5 Transient CIS Image Formation . . . . . . . . . . . . . . 108ix5.3 Reconstruction from Diffuse Indirect Illumination . . . . . . . . . 1085.3.1 Inverse Problem . . . . . . . . . . . . . . . . . . . . . . 1085.3.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . 1105.3.3 Enhancing Volume Sparsity . . . . . . . . . . . . . . . . 1125.4 Implementation and Parameter Selection . . . . . . . . . . . . . . 1125.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1135.5.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1135.5.2 Qualitative Results . . . . . . . . . . . . . . . . . . . . . 1145.5.3 Quantitative Results . . . . . . . . . . . . . . . . . . . . 1165.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1186 Convolutional Sparse Coding . . . . . . . . . . . . . . . . . . . . . . 1196.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1196.2 Mathematical Framework . . . . . . . . . . . . . . . . . . . . . . 1226.2.1 Efficient Splitting of the Objective . . . . . . . . . . . . . 1236.2.2 Generalization of the Objective . . . . . . . . . . . . . . 1246.2.3 Biconvex Minimization via Coordinate Descent . . . . . . 1276.2.4 Implementation details . . . . . . . . . . . . . . . . . . . 1276.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1286.3.1 Complexity Analysis . . . . . . . . . . . . . . . . . . . . 1286.3.2 Convergence . . . . . . . . . . . . . . . . . . . . . . . . 1296.4 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1316.4.1 Boundary Conditions . . . . . . . . . . . . . . . . . . . . 1316.4.2 Learning from Sparse Data . . . . . . . . . . . . . . . . . 1316.5 Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1336.5.1 Validation of Reconstruction . . . . . . . . . . . . . . . . 1336.5.2 Non-Normalized Data . . . . . . . . . . . . . . . . . . . 1336.5.3 Reconstruction with Known Basis . . . . . . . . . . . . . 1356.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1377 Doppler Velocity Imaging . . . . . . . . . . . . . . . . . . . . . . . . 1387.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1387.2 Dynamic CIS Image Formation . . . . . . . . . . . . . . . . . . . 141x7.3 Doppler Velocity Imaging . . . . . . . . . . . . . . . . . . . . . . 1437.3.1 Velocity Imaging via Orthogonal Frequencies . . . . . . . 1437.3.2 Simultaneous Range and Velocity . . . . . . . . . . . . . 1477.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1487.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1537.6 Toward the 3D Velocity Field . . . . . . . . . . . . . . . . . . . . 1567.7 Multi-Camera System . . . . . . . . . . . . . . . . . . . . . . . . 1597.7.1 Multi-Source Interference Cancelling . . . . . . . . . . . 1627.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1638 Simple Lens Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . 1668.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1678.2 Image Formation Model . . . . . . . . . . . . . . . . . . . . . . 1708.3 Deconvolution Using Cross-Channel Correlation . . . . . . . . . 1708.3.1 Cross-Channel Correlation . . . . . . . . . . . . . . . . . 1708.3.2 Inverse Problem . . . . . . . . . . . . . . . . . . . . . . 1738.3.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . 1738.3.4 Low-Light Imaging . . . . . . . . . . . . . . . . . . . . . 1758.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1778.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1849 Flexible Camera Image Processing . . . . . . . . . . . . . . . . . . . 1859.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1869.2 Image Formation Model . . . . . . . . . . . . . . . . . . . . . . 1899.3 Flexible Image Reconstruction . . . . . . . . . . . . . . . . . . . 1899.3.1 Inverse Problem . . . . . . . . . . . . . . . . . . . . . . 1899.3.2 Denoising Prior . . . . . . . . . . . . . . . . . . . . . . . 1909.3.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . 1919.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1939.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1959.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205xi10 Proximal Image Optimization . . . . . . . . . . . . . . . . . . . . . 20710.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20810.2 Representing Image Optimization Problems . . . . . . . . . . . . 21210.3 The ProxImaL Language . . . . . . . . . . . . . . . . . . . . . . 21510.3.1 Proxable Functions . . . . . . . . . . . . . . . . . . . . . 21610.3.2 Linear Operators . . . . . . . . . . . . . . . . . . . . . . 21710.4 Compiling Problems to Efficient Solvers . . . . . . . . . . . . . . 22010.4.1 Rewriting Problems . . . . . . . . . . . . . . . . . . . . 22010.4.2 Problem Splitting . . . . . . . . . . . . . . . . . . . . . . 22310.4.3 Problem Scaling . . . . . . . . . . . . . . . . . . . . . . 22310.5 Proximal Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 22410.5.1 Stopping Criteria . . . . . . . . . . . . . . . . . . . . . . 23010.5.2 Convergence Properties . . . . . . . . . . . . . . . . . . 23210.6 Analysis of Linear Systems . . . . . . . . . . . . . . . . . . . . . 23410.7 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 23610.8 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23710.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24411 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254A Supporting Materials . . . . . . . . . . . . . . . . . . . . . . . . . . 284A.1 Detailed Spectral Analysis of the CIS Image Formation . . . . . . 284A.2 Harmonic CIS Image Formation . . . . . . . . . . . . . . . . . . 284A.3 Transient Imaging with Gaussian-Exponential Priors . . . . . . . 286A.4 Indirect Imaging of Specular Object, and with Ambient Light . . . 291A.5 Imaging in Scattering Media of Various Densities . . . . . . . . . 296A.6 Validation of Convolutional Sparse Coding . . . . . . . . . . . . 299A.7 Non-Sinusoidal Doppler Time-of-Flight Imaging . . . . . . . . . 299A.8 Velocity Map Denoising . . . . . . . . . . . . . . . . . . . . . . 300A.9 Residual and Scale-Space Deconvolution . . . . . . . . . . . . . 301A.10 PSF Estimation using Calibration Targets . . . . . . . . . . . . . . 303A.11 Cross-Channel Deconvolution for Various Optical Systems . . . . 306xiiA.12 Natural Image Prior Design Choices . . . . . . . . . . . . . . . . 311A.13 Bayesian Color Arrays Imaging and Deblocking . . . . . . . . . . 315A.14 Poisson Data Fidelity . . . . . . . . . . . . . . . . . . . . . . . . 321xiiiList of TablesTable 6.1 Complexity of Convolutional Sparse Coding . . . . . . . . . . 128Table 9.1 Quantitative Deconvolution Results . . . . . . . . . . . . . . . 200Table 9.2 FlexISP Performance . . . . . . . . . . . . . . . . . . . . . . 204Table 10.1 Core Primitives in the ProxImaL Language . . . . . . . . . . . 215Table 10.2 Proxable Functions Provided by ProxImaL . . . . . . . . . . . 217Table 10.3 Linear Operators Provided by ProxImaL . . . . . . . . . . . . 218Table 10.4 Logic for the is_diag and is_gram_diag Subroutines . . . . . . . 235Table 10.5 Runtime for Linear Operators . . . . . . . . . . . . . . . . . . 236Table 10.6 Lines of Code Comparisons . . . . . . . . . . . . . . . . . . . 237Table 10.7 Demosaicking Results . . . . . . . . . . . . . . . . . . . . . . 238Table 10.8 Quantitative Evaluation of Poisson Deconvolution . . . . . . . 243Table A.1 Weighting Internal and External Priors . . . . . . . . . . . . . 312xivList of FiguresFigure 1.1 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . 4Figure 2.1 Propagation of Light at an Interface . . . . . . . . . . . . . . 16Figure 2.2 Global Light Transport in a Diffuse Scene . . . . . . . . . . . 18Figure 2.3 Time-resolved Global Light Transport . . . . . . . . . . . . . 19Figure 2.4 Optical Systems . . . . . . . . . . . . . . . . . . . . . . . . . 20Figure 2.5 Seidel Aberrations . . . . . . . . . . . . . . . . . . . . . . . 25Figure 2.6 Point Spread Function (PSF)s of a Biconvex Lens . . . . . . . 26Figure 2.7 CCD and CMOS APS Sensor Working Principle . . . . . . . . 31Figure 2.8 Sensor Model for Linear Digital Cameras . . . . . . . . . . . 34Figure 2.9 Working Principle of a Photonic Mixer Device (PMD) Pixel . . 45Figure 2.10 CIS Time-of-Flight (TOF) Range Imaging . . . . . . . . . . . 46Figure 2.11 Spectral Analysis of CIS TOF Imaging . . . . . . . . . . . . . 50Figure 2.12 Multi-Path Interference . . . . . . . . . . . . . . . . . . . . . 53Figure 2.13 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . 60Figure 3.1 Transient Imaging . . . . . . . . . . . . . . . . . . . . . . . 67Figure 3.2 Transient CIS Image Formation Model . . . . . . . . . . . . . 69Figure 3.3 Gaussian-Exponential Priors . . . . . . . . . . . . . . . . . . 73Figure 3.4 Setup for Transient Imaging . . . . . . . . . . . . . . . . . . 76Figure 3.5 Transient Measurement Matrix . . . . . . . . . . . . . . . . . 78Figure 3.6 Simulated Transient Imaging Results . . . . . . . . . . . . . 79Figure 3.7 Reconstructed Transient Profiles . . . . . . . . . . . . . . . . 80Figure 3.8 Test Scenes . . . . . . . . . . . . . . . . . . . . . . . . . . . 81xvFigure 3.9 Transient Reconstructions for Complex Scenes . . . . . . . . 82Figure 3.10 Spatio-Temporal Modulation . . . . . . . . . . . . . . . . . . 84Figure 3.11 Spatio-Temporal Transient Imaging . . . . . . . . . . . . . . 85Figure 4.1 Imaging in Scattering Media . . . . . . . . . . . . . . . . . . 89Figure 4.2 Continuum of Indirect Paths in Scattering . . . . . . . . . . . 90Figure 4.3 Sparsity of Transient Images . . . . . . . . . . . . . . . . . . 92Figure 4.4 Exponentially Modified Gaussians . . . . . . . . . . . . . . . 93Figure 4.5 Synthetic Evaluation . . . . . . . . . . . . . . . . . . . . . . 96Figure 4.6 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98Figure 4.7 Results for Scattering Milk Particles . . . . . . . . . . . . . . 99Figure 4.8 Results for Scattering Plaster Particles . . . . . . . . . . . . . 100Figure 4.9 Error Analysis for Imaging in Scattering Media . . . . . . . . 100Figure 5.1 Diffuse Mirrors . . . . . . . . . . . . . . . . . . . . . . . . . 104Figure 5.2 Schematic Overview . . . . . . . . . . . . . . . . . . . . . . 105Figure 5.3 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114Figure 5.4 Albedo Reconstructions . . . . . . . . . . . . . . . . . . . . 115Figure 5.5 Simultaneous Albedo and Geometry Reconstructions . . . . . 115Figure 5.6 Quantitative Evaluation . . . . . . . . . . . . . . . . . . . . . 116Figure 6.1 Convolutional Sparse Coding . . . . . . . . . . . . . . . . . . 120Figure 6.2 Empirical Convergence . . . . . . . . . . . . . . . . . . . . . 129Figure 6.3 Visualization of Filter Learning . . . . . . . . . . . . . . . . 131Figure 6.4 Learned Filters . . . . . . . . . . . . . . . . . . . . . . . . . 132Figure 6.5 Boundary Extrapolation . . . . . . . . . . . . . . . . . . . . 133Figure 6.6 Learning From Sparse Observations . . . . . . . . . . . . . . 134Figure 6.7 Inpainting Results for Normalized Data . . . . . . . . . . . . 135Figure 6.8 Inpainting Non-Normalized Data . . . . . . . . . . . . . . . . 135Figure 6.9 Reconstructions Using Known Convolutional Basis . . . . . . 136Figure 7.1 Doppler Velocity Imaging . . . . . . . . . . . . . . . . . . . 139Figure 7.2 Doppler Shift in CIS Imaging . . . . . . . . . . . . . . . . . . 142Figure 7.3 Velocity Imaging Using Orthogonal Frequencies . . . . . . . 144xviFigure 7.4 Simulated and Measured Sensor Response . . . . . . . . . . . 146Figure 7.5 Depth-Dependent Offset . . . . . . . . . . . . . . . . . . . . 149Figure 7.6 Experimental Verification for Varying Depth and Velocities . . 150Figure 7.7 Measurement Accuracy . . . . . . . . . . . . . . . . . . . . . 151Figure 7.8 Poisson Denoising . . . . . . . . . . . . . . . . . . . . . . . 153Figure 7.9 Results for Large Outdoor Scene . . . . . . . . . . . . . . . . 153Figure 7.10 Results for Periodic Motion . . . . . . . . . . . . . . . . . . 154Figure 7.11 Results for Fast Motion . . . . . . . . . . . . . . . . . . . . . 154Figure 7.12 Results for Textured Objects . . . . . . . . . . . . . . . . . . 155Figure 7.13 Results for Fast Gestures . . . . . . . . . . . . . . . . . . . . 156Figure 7.14 Results for Small Objects . . . . . . . . . . . . . . . . . . . . 157Figure 7.15 Optical Flow Failure Cases . . . . . . . . . . . . . . . . . . . 157Figure 7.16 Towards 3D Flow . . . . . . . . . . . . . . . . . . . . . . . . 158Figure 7.17 Multi-Camera TOF System . . . . . . . . . . . . . . . . . . . 159Figure 7.18 Multi-Camera Prototype Setup . . . . . . . . . . . . . . . . . 161Figure 7.19 Instantaneous Doppler Velocity Imaging . . . . . . . . . . . . 161Figure 7.20 Multi-Device Interference Cancellation . . . . . . . . . . . . 162Figure 8.1 Simple Lens Imaging . . . . . . . . . . . . . . . . . . . . . . 167Figure 8.2 PSFs of Simple Lenses . . . . . . . . . . . . . . . . . . . . . 168Figure 8.3 Cross-Channel Reconstruction . . . . . . . . . . . . . . . . . 171Figure 8.4 Cross-Channel Prior . . . . . . . . . . . . . . . . . . . . . . 172Figure 8.5 Chromatic Artifacts in Low-Light Areas . . . . . . . . . . . . 176Figure 8.6 Results for a Plano-Convex Lens . . . . . . . . . . . . . . . . 178Figure 8.7 Additional Scenes Captured with a Plano-Convex Lens . . . . 179Figure 8.8 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 180Figure 8.9 Results for Commercial Lens Systems . . . . . . . . . . . . . 181Figure 8.10 Results for an Ultra-Thin Diffractive Optical Element . . . . . 183Figure 9.1 Flexible Camera Image Processing . . . . . . . . . . . . . . . 186Figure 9.2 Replacing Traditional Camera Processing Pipelines . . . . . . 187Figure 9.3 Empirical Convergence . . . . . . . . . . . . . . . . . . . . . 196Figure 9.4 Quantitative Demosaicking Results . . . . . . . . . . . . . . 196xviiFigure 9.5 Qualitative Demosaicking Results . . . . . . . . . . . . . . . 197Figure 9.6 Joint Demosaicking and Denoising Results . . . . . . . . . . 198Figure 9.7 Demosaicking Results for Real-World Captures . . . . . . . . 198Figure 9.8 Qualitative Deconvolution Results . . . . . . . . . . . . . . . 199Figure 9.9 Interlaced HDR Reconstructions . . . . . . . . . . . . . . . . 201Figure 9.10 Burst Imaging Comparison . . . . . . . . . . . . . . . . . . . 203Figure 10.1 ProxImaL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208Figure 10.2 Example DAG Representation . . . . . . . . . . . . . . . . . 216Figure 10.3 DAG for Stacked Linear Operators . . . . . . . . . . . . . . . 219Figure 10.4 ProxImaL Compiler Pipeline . . . . . . . . . . . . . . . . . . 220Figure 10.5 Absorbing Linear Operator . . . . . . . . . . . . . . . . . . . 221Figure 10.6 Runtime Comparisons . . . . . . . . . . . . . . . . . . . . . 231Figure 10.7 Interlaced HDR Results . . . . . . . . . . . . . . . . . . . . 240Figure 10.8 Burst Denoising Results . . . . . . . . . . . . . . . . . . . . 241Figure 10.9 Poisson Deconvolution Results . . . . . . . . . . . . . . . . . 242Figure 10.10 Phase Retrieval Results . . . . . . . . . . . . . . . . . . . . . 245Figure A.1 Illustration of Coordinate Descent . . . . . . . . . . . . . . . 287Figure A.2 Transient Reconstructions for Diffuse Wall Scene . . . . . . . 290Figure A.3 Transient Profiles for Diffuse Wall Scene . . . . . . . . . . . 290Figure A.4 Non-Line-of-Sight (NLOS) Imaging of a Diffuse Foamboard . 291Figure A.5 NLOS Imaging of two Whiteboards . . . . . . . . . . . . . . 292Figure A.6 NLOS Imaging of Mirrors . . . . . . . . . . . . . . . . . . . . 292Figure A.7 NLOS Imaging of Brushed Metal . . . . . . . . . . . . . . . . 293Figure A.8 NLOS Geometry Reconstruction with Ambient Light . . . . . 294Figure A.10 Frame Averaging . . . . . . . . . . . . . . . . . . . . . . . . 294Figure A.9 NLOS Albedo Reconstruction with Ambient Light . . . . . . . 295Figure A.11 Sparsity Analysis for NLOS Imaging . . . . . . . . . . . . . . 295Figure A.12 Additional Scattering Experiments Using Milk . . . . . . . . 297Figure A.13 Additional Scattering Experiments Using Plaster . . . . . . . 298Figure A.14 Reconstruction Quality for Learned Filters . . . . . . . . . . 299Figure A.15 Synthetic Results for Velocity Map Denoising . . . . . . . . . 301xviiiFigure A.16 Residual Scale-Space Deconvolution . . . . . . . . . . . . . . 302Figure A.17 PSF Calibration Target . . . . . . . . . . . . . . . . . . . . . 305Figure A.18 Additional Results for Commercial Lens Systems . . . . . . . 306Figure A.19 Deconvolution for Multispectral Cameras . . . . . . . . . . . 307Figure A.20 Evaluation of the Cross-Channel Deconvolution . . . . . . . . 308Figure A.21 Additional Evaluation of the Cross-Channel Deconvolution . . 309Figure A.22 Cross-Channel Deconvolution on Simple Lens Captures . . . 310Figure A.23 Effects of Internal and External Priors . . . . . . . . . . . . . 313Figure A.24 Comparison of Natural-Image Priors . . . . . . . . . . . . . . 314Figure A.25 Choice of Prior Weights . . . . . . . . . . . . . . . . . . . . 316Figure A.26 Cross-Channel Optical Flow . . . . . . . . . . . . . . . . . . 317Figure A.27 Color-Array Camera Results . . . . . . . . . . . . . . . . . . 318Figure A.28 JPEG Deblocking Results . . . . . . . . . . . . . . . . . . . 321xixGlossaryCIS Correlation Image SensorsLTI Linear Time-InvariantPSF Point Spread FunctionTOF Time-of-FlightMPI Multi-Path InterferenceNLOS Non-Line-of-SightCMOS Complementary Metal Oxide SemiconductorAPS Active Pixel SensorCCD Charge-Coupled DeviceCSC Convolutional Sparse CodingSPAD Single-Photon Avalanche DiodesAMCW Amplitude Modulated Continuous WavePMD Photonic Mixer DeviceHDR High Dynamic RangeSNR Signal-to-Noise RatioPDF Probability Distribution FunctionxxMAP Maximum A PosterioriMMSE Minimum Mean Square ErrorTV Total VariationADMM Alternating Direction Method of MultipliersCG Conjugate GradientD-TOF Doppler Time-of-FlightOFDM Orthogonal Frequency-Division MultiplexingDSL Domain-Specific LanguagexxiAcknowledgmentsI am grateful to have had the opportunity to work with my academic advisor Prof.Wolfgang Heidrich. His guidance and support throughout my graduate studieswent far beyond of what a doctoral student could wish for. Without his relentlessencouragement, both in the laboratory and on the whiteboard, this dissertationwould not exist. Since my very first visit in Vancouver, one year before making thedecision to pursue a doctoral degree at the University of British Columbia, he hasinstilled in me a passion for the measurement, display, and inference using light,which has only grown since. I cannot imagine a better mentor and I now graduatefull of new ideas that I am eager to explore.Furthermore, I would like to thank the members of my doctoral supervisorycommittee Prof. Robert J. Woodham and Prof. Michiel van de Panne as well asthe external examiner Prof. Yoav Y. Schechner. I sincerely appreciate the fruitfuldiscussions on many of the ideas described in the following and the time spent onthe thorough evaluation of my dissertation.I feel lucky to have had terrific colleagues in Prof. Heidrich’s laboratories.James Gregson, Lei Xiao, Gerwin Damberg, Shuochen Su, Qiang Fu, and YifanPeng have made my doctoral studies a unique experience that I already miss.I have been fortunate to collaborate with outstanding researchers over the lastfew years. Without their contribution it would not have been possible to explorethe ideas from this dissertation as extensive as presented below. In particular, Prof.Gordon Wetzstein, Prof. Matthias Hullin, Prof. Ramesh Raskar and Dr. DouglasLanman all inspired me to revisit problems which have been discarded as too hard inprevious research. Furthermore, I thank Prof. Andreas Kolb, Dr. Matthew O’Tooleand Prof. Kiriakos Kutulakos for their collaboration on some of the work discussedxxiiin this thesis. Prof. Kolb has been a mentor of mine since I finished high school, andhis guidance and encouragement during my undergraduate and graduate degree atthe University of Siegen have laid the foundation for my doctoral studies. I thank myindustry collaborators Dr. Paul Green, Dr. Kari Pulli, and Dr. Jan Kautz for takingon the challenge of realizing some of the technical ideas from this dissertation incommercial products.I would also like to thank my wonderful friends back home in Germany and inNorth America. I am grateful to know that they are there to support me, especiallyin times of large workloads like those required for parts of this dissertation.xxiiiMeinen Eltern, fu¨r ihre Unterstu¨tzung und das Interesse and der Wissenschaft,ist diese Doktorarbeit gewidmet. Ihnen danke ich von Herzen dafu¨r, mir meinegesamte Ausbildung ermo¨glicht zu haben, und in gleichem Maße fu¨r das große,ermutigende Interesse an meiner Arbeit. Ha¨tte ich nicht ihre vollste Unterstu¨tzungwa¨hrend des Studiums und der Promotion gehabt, gerade in der langen Zeit imAusland, wa¨re die Vollendung dieser Arbeit nicht mo¨glich gewesen.xxivChapter 1IntroductionThrough billions of years of evolution, pairs of single-lens eyes have emerged as thedominating sensor of the mammalian visual system. It is remarkable that complextasks such as tracking, classification and localization can be performed robustly evenwhen other sensory inputs are not present. This is enabled by the vast computationalresources that are allocated for the visual system [12]. About half of the cerebralcortex of primates has been associated with visual tasks [13]. In particular, researchsuggests that the visual cortex represents natural scenes by local convolutionalfeatures [14], [15]. It is this compressive representation that enables the mammalianvisual cortex to robustly solve complex high-level tasks.Similar to the visual system, computational imaging systems attempt to decodescene features that are non-trivially encoded in the measurements by adding compu-tation to the measurement process. During the past decade, computational imaginghas made a huge impact in scientific applications [16], consumer imaging [17] andclassification [18], robotic and industrial applications [19], microscopy [20] andhealth [21]. The key idea enabling most of these applications is to exploit structurein the unknowns, that is the scene information to be recovered, and the measurementitself.In particular, convolutional structure in the measurement has the benefit that thelatent information is encoded locally. The local encoding, due to the limited supportof the convolutional kernel, preserves global context and therefore facilitates decod-ing the hidden information from the measurement. Furthermore, it leads to a linear1model that can be reasoned about using linear algebra, is computationally efficient,and Fourier analysis has proven to be an effective analysis tool for convolutionalmodels.An example of such a convolutional model is the spatial convolution that modelsaberrations in imaging optics. Since the end of the 19th century, imaging opticshave been designed to minimize all existing optical aberrations. In 1841 Gaussfirst described the linear model of optics [22], and only a few years later, in 1857,Seidel introduced aberration theory [23], which describes non-linear deviationsfrom Gauss’s simplified linear model, that is aberrations. Since then optical designaims at eliminating these aberrations by introducing increasingly complex opticalsystems. Only recently, researchers have recognized that optical aberrations cannot only be deliberately minimized but can be designed to carry additional sceneinformation. One example is coded aperture imaging, where the optical system isdesigned to have depth-varying aberrations [24]. Another example is the recentadvances in fluorescence microscopy, where depth-varying aberrations are designedusing a diffractive optical element [20].In addition to spatially convolutional measurements, spectrally convolutional ortemporal convolutional measurements can also effectively encode scene information.An example of a temporally convolutional measurement can be found in nature,which has equipped a few species of animals with remarkable capabilities allowingthem to survive in harsh environments. One of these evolutional strategies isbiological sonar. Animals using biological sonar create pulses of sound and listento the reflections of the pulse. In other words, the Linear Time-Invariant (LTI)system defined by the scene is convolved with an impulse, and the temporal impulseresponse of the LTI system is measured. The delay between transmission andreception, that is the Time-of-Flight (TOF) of the pulses, is used to locate andclassify objects. Notably, microchiropteran bats are therefore able to navigate incomplete darkness, and toothed whales (including dolphins, porpoises, killer whalesand sperm whales) rely on it in their underwater habitat, where scattering andabsorption drastically limit the range of vision [25].Inspired by biological sonar, optical TOF sensors measure the delay betweenlight emitted into a scene and received by a co-located sensor. It was the invention ofthe laser in 1960 as well as accurate timing electronics that enabled a commercially2successful application of this principle in Lidar systems [26]. While delivering highaccuracy and range, a major drawback is that distance is only measured to a singlepoint. Hence scanning is required. Recently, Correlation Image Sensors (CIS) arerevolutionizing the 3D depth imaging market by temporally convolving a scene withamplitude-modulated flood illumination, which no longer requires scanning. Thesensors consist of an array of pixels that can shift charges generated by incomingphotons into two buckets. By doing so, CIS convolve an incoming temporal signalwith a reference signal. From the resulting convolution signals, the TOF can beestimated with high accuracy. As this range measurement technique is based on TOF,structured light or stereo matching techniques are outperformed by a significantmargin due to the small baseline (that TOF-based systems allow) or due to lackof texture. CIS technology has been developed since the early 2000s [27]. Fastadoption of CIS-based range sensing has been accelerated by the fact that they canbe fabricated in the Complementary Metal Oxide Semiconductor (CMOS) process.CMOS technology has been driven by the microprocessor industry and is thereforeinexpensively available for mass-market production. Furthermore, additional logicfor control, signal generation, or processing can be directly implemented on chipalong with the sensor as in Microsoft’s Kinect sensor for Xbox One [28]. RecentCIS systems, however, do only very limited computation in post-processing [29].The key argument of this thesis is to exploit structure, both in the unknown sceneinformation and in the measurements, by adding computation to the measurementprocess. This is akin to the mammalian visual system, which performs this process-ing in the visual cortex. By adding the computation in the form of inverse problemsderived from Bayesian inference, one can, in fact, overcome current limitationsof imaging systems and “make the invisible visible.” We demonstrate this in thisdissertation as visualized in Figure 1.1. Adding computation has allowed us toextract two entirely new image modalities, beyond traditional color and depth imag-ing. These new image dimensions enable a large variety of novel applications. Forclassical color imaging, this approach leads to new computational camera designsthat significantly outperform traditional systems in many different scenarios.As shown in Figure 1.1 on the left, the first additional dimension is an ultra-high-resolution time dimension. Exploiting temporal structure in CIS measurements,we can separate all light paths corresponding to different photon travel times along3Chapter 3, [1] Chapter 6, [5] Chapter 7, [6] Chapter 8, [8] Chapter 9, [10]Chapter 4, [4] Chapter 5, [3]Chapter 10, [11]Figure 1.1: Thesis overview and scope (key publications indicated next tothe chapter labels): Adding computation to the measurement processcan overcome limitations of imaging systems by exploiting structurein the measurements or unknowns. Adding computation to CIS allowsextracting two new imaging dimensions. The first is an additional timedimension, which is introduced by transient imaging in Chapter 3. Twoapplications of transient imaging are demonstrated with imaging in scat-tering media Chapter 4, and non-line-of-sight imaging in Chapter 5. Amodel exploiting convolutional structure in the unknowns for both thetemporal and spatial dimension is discussed in Chapter 6. The seconddimension added to CIS imagers is velocity measured via the Dopplershift, which is presented in Chapter 7. Adding computation to colorimagers enables new computational camera designs, that outperformthe state-of-the-art. Imaging with radically simplified optics is demon-strated in Chapter 8. Novel computational sensor designs for challengingscenarios such as ultra-low-light imaging are shown in Chapter 9. Anoptimization framework unifying the computation for all of the methodsabove is discussed in Chapter 10.4this dimension. In other words, this means that such an image with additionaltime dimension is the impulse response of the light transport in the static scene.As light transport is a linear time invariant system that is fully described by itsimpulse response, this so-called transient image fully describes the light transportin the scene. We introduce a physically motivated model for transient images andreconstruction methods, relying on its convolutional structure in time. The abilityto reconstruct transient images from a co-design of capture and computation hasenabled new applications at the frontiers of vision. We demonstrate reconstructionof Non-Line-of-Sight (NLOS) objects from indirect reflections, thus turning diffusewalls into mirrors, and show imaging in strongly scattering media. In both cases, weexploit information that is treated as noise in regular range-measuring CIS systems.In this sense, our approach “makes the invisible visible.”Shown in the center of Figure 1.1, the second additional dimension that weintroduce is the velocity of moving objects. Recently, we were able to demonstratethat it is possible to capture the Doppler shift of sinusoidally modulated illumina-tion using CIS with specifically designed modulation signals and low-light imagereconstruction. Using the Doppler shift, one can compute the instantaneous axialvelocity of an object. Measuring the modulated light reflected from the scene withmultiple, differently correlating sensors enables 2D velocity capture along withusual the depth and amplitude capture. We imagine a variety of applications for thisnew imaging modality. In particular, traditionally challenging problems in visionsuch as segmentation, classification, and tracking could be drastically simplified byusing this new image dimension.Shown on the right of Figure 1.1, we demonstrate that adding computationto color imagers enables a variety of new computational camera designs that candrastically outperform classical color imaging systems. We demonstrate imagingwith radically different optical designs that do not necessarily follow traditionaldesign rules, such as minimizing optical aberrations. Instead, we exploit thataberrations can carry scene information embedded in the measurement. Recently,we have demonstrated that the specific structure of chromatic aberrations can beused for imaging with simple, cheap and very compact optics. In other words, ratherthan the classical approach of adding more elements to eliminate aberrations inincreasingly complex lens designs, we simplify the optics up to a single element5and remove the aberrations in computation. Going a step further, this approach caneven enable broadband imaging with ultra-thin diffractive optics. This opens upa space of very flexible, new optical designs, relying on thin refractive as well asdiffractive elements.Considering classical color image processing, also shown on the right of Fig-ure 1.1, we demonstrate that adding computation in the form of Bayesian inferencerepresents, in fact, a change of paradigm. Color image processing is tradition-ally organized as a pipeline model. Pipeline steps can accumulate error by neverback-projecting on the measured observation. Replacing pipelines by formal op-timization does not only yield a very flexible method without this drawback butalso outperforms known methods in classical problems such as demosaicking anddeconvolution. We demonstrate this on a variety of imaging systems using emergingarray sensors, interleaved High Dynamic Range (HDR) and low-light burst imaging.Here, low-light burst imaging is another example where added computation can“make the invisible visible.”Finally, this work covers the computational challenges for all of the methodsshown in Figure 1.1. Expressing imaging tasks as optimization problems usingBayesian inference is a principled approach as the image-formation can often bewell modeled. Hence, many state-of-the-art imaging methods follow this approach,like the ones in this dissertation. However, in practice, developing a given imageformation model into a solver method can be time-consuming and error-prone, asthere is a large span of choices in this process. Different combinations of statisticalpriors, various formulations of the same objective and different solver methodscan lead to drastically varying reconstruction quality and computational efficiency.We introduce a generalized mathematical framework and a corresponding domain-specific language that allow quick exploration of this space. A novel mathematicalcompiler attempts to reformulate a problem automatically to find the most efficientproblem translation. The key idea here is, again, to exploit structure in the problem.For the large variety of applications in this thesis, we demonstrate drasticallyincreased accuracy, convergence, and computational efficiency compared to naiveblack-box optimization, in many cases multiple orders of magnitude in run-time.This dissertation makes steps towards an “optimal camera” that does not attemptto capture the full plenoptic function by “brute-force” direct sampling using as many6sensors as possible, but rather as one that relies on structure in the scene and theimage formation process to only recover the unknowns of interest using few, easilyavailable measurement. The optics, sensor and reconstruction method of this cameraare jointly optimized for a given application. This also means that there is not asingle optimum. While solving such a higher-level camera optimization problemis extremely challenging due to its highly non-convex nature involving categoricaland integer variables, parametrized architectures which allow the exploration of asubspace of the objective are feasible. This dissertation already shows samples ofthe search space that have been found by optimizing one variable (e.g. the algorithmparameters or optics) with all others fixed, essentially performing a first coordinatedescent step. We also demonstrate first steps to extend this idea even further andconsider the scene also as part of the camera. The proposed imaging system forNLOS objects can be interpreted as a method that essentially turns a diffuse surfaceinto a large-scale image sensor. One could extend this system to be able to “unfold”light paths to a large depth while estimating the reflectance of each scene pointon the fly. Starting from a direct bounce, reconstructing object points “arounda corner” from a first bounce, and finally “around two corners” from a secondbounce, and so forth. Ultimately every scene surface will act as a sensor. Suchscene-adaptive imaging systems are yet another sample point of the rich high-levelobjective. Independent of this objective being minimized numerically or sampled byresearchers, it becomes clear that the capabilities of known biological image systemsdriven by evolution can be outperformed by a significant margin. By capturing lightpropagating at the speed of light, this dissertation provides a first indicator for thismargin.1.1 Dissertation StructureThe remainder of this dissertation will be structured as follows.Chapter 2. Background and Related Work This chapter explains the buildingblocks of the image formation models in this thesis and surveys previous relatedwork. In particular, it describes the basic working principles of CIS, reviews imagingoptics and explains recent color imaging sensors. Following, we review Bayesian7approaches for the inversion of the image formation from sensor measurements.Related traditional methods and work competing with any of the proposed ap-proaches are discussed. Hence, this chapter lays the foundation for deriving andunderstanding the techniques in the chapters that follow.Chapter 3. Transient Imaging This chapter demonstrates that transient imagescan be captured by adding computation to CIS measurements, covering our work [1]and [2]. We describe the capture sequence consisting of differently modulatedimages acquired with a CIS sensor, derives a model for local light/object interaction,and finally describes an optimization procedure to infer transient images given themeasurements and model. The resulting method produces transient images at a costseveral orders of magnitude below existing methods, speeds up the capture processby orders of magnitude, is robust to ambient illumination and allows for drasticallyincreased scene sizes compared to previous methods. Furthermore, this chapterdiscusses how to combine transient imaging with spatial light modulators. Relyingon the fact that light transport is separable in frequency, we show that all basic lighttransport techniques from conventional projector-to-camera systems can be appliedper individual modulation frequency. The additional spatial light transport analysishelps to improve resolution in transient imaging.Chapter 4. Imaging in Scattering Media We use the additional temporal dimen-sion, which was introduced in the previous chapter, to enable imaging in scatteringmedia. The core ideas from this chapter have been first presented in our work [4].Imaging in volumetric scattering is an extremely hard task, since due to the scatter-ing transport components are non-trivially mixed, leaving only the paths for ballisticphotons undisturbed. Nevertheless, we demonstrate that by exploiting convolutionalstructure in transient images, we can separate mixed light paths. We introduce anew Convolutional Sparse Coding model for transient images, and we demonstratethat our system can analyze the light transport through scattering and turbid media.Chapter 5. Diffuse Mirrors In this chapter, we use the additional temporal di-mension of light to turn diffuse walls into mirrors, covering our publication [3].8Diffuse walls scatter back into all directions in contrast to mirrors that preservedirectionality of the reflected light. However, in a diffuse reflection, the temporaldimension of light transport is left intact. This chapter describes an imaging systemto recover NLOS geometry and albedo from second-order diffuse reflections, effec-tively turning walls into mirrors. An efficient optimization strategy is presented,linearizing the objective and exploiting sparsity of the unknowns.Chapter 6. Convolutional Sparse Coding This chapter proposes an efficientoptimization method for data fitting problems using Convolutional Sparse Codingmodels, described originally in our work [5]. The proposed technique is generalizedto both reconstruction and learning of convolutional sparse codes. An optimizationmethod is proposed that finds better minima than the state-of-the-art, has betterempirical convergence and is therefore significantly faster. Due to its generalformulation, the proposed approach can be applied to a large variety of problems(that do not have to be related to CIS). We demonstrate the application of featurelearning for vision.Chapter 7. Doppler Velocity Imaging This chapter shows that by adding compu-tation to CIS measurements we can image instantaneous per-pixel velocity, coveringour publications [6] and [7]. The proposed technique exploits the Doppler effect ofobjects in motion, which shifts the temporal frequency of the illumination before itreaches the camera. We describe how to design coded illumination and modulationfrequencies that directly map velocities to measured pixel intensities. An effectivereconstruction method handling the resulting Poisson noise and misalignment isintroduced. Finally, we discuss a slight modification of our imaging system thatallows for color, depth, and velocity information to be captured simultaneously.Chapter 8. Simple Lens Imaging We demonstrate high-quality imaging with sim-ple, ultra-compact optics, first proposed in our work [8] and [9]. Modern imagingoptics are highly complex systems that aim at eliminating optical aberrations byintroducing a large number of individual optical elements. We propose a radicallydifferent design approach using uncompensated, simple optics, which computation-9ally corrects for the severe aberrations introduced by the simple optics. In particular,a novel blind, self-calibrating deconvolution method is proposed that relies on thestructure of the aberrations in different color channels of the latent image. We firstdemonstrate our approach using a simple refractive spherical lens and demonstratefirst results indicating that it can even enable imaging using ultra-thin diffractiveelements under broadband illumination.Chapter 9. Flexible Camera Image Processing This chapter rethinks classicalcamera image processing as an optimization problem, covering our work [10].Traditionally, camera image processing is organized as a pipeline of modules, eachresponsible for a particular part of the imaging task. This approach introduces acumulative error as each step considers only the output of the previous step, notthe original measurement. We replace the full pipeline with a single optimizationproblem using Bayesian inference. Observations are modeled using the imageformation while priors are enforced for the latent image. We demonstrate that thisapproach is very flexible and applicable to a broad variety of computational cameradesigns ranging from traditional Bayer imaging to ultra-low-light burst imaging, ineach case outperforming the state-of-the-art in terms of reconstruction quality.Chapter 10. Proximal Image Optimization As demonstrated in all previous chap-ters, an increasingly large number of imaging tasks can be expressed as optimizationproblems using Bayesian inference, often outperforming the state-of-the-art. Whiledesigning the objective is often a very principled and elegant approach, there aremany choices in the development of a solver, making this process time-consumingand error-prone. This chapter introduces a generalized mathematical representationfor image optimization problems and a corresponding domain-specific language,published as [11], which allows for exploration of this vast array of choices. Amathematical compiler is proposed that identifies structure in the objective andbased on this chooses the best way to translate a specific problem formulation intoa solver implementation. We demonstrate that exploring the many choices in thesolver development can lead to drastically increased accuracy and computationalefficiency, often multiple orders of magnitude in reduced run-time.10Chapter 11. Conclusion This chapter describes avenues of future research thatbuild toward the vision of an optimal imaging system, which not only matchesnaturally evolved visual systems found in nature but outperforms them for giventasks. The features of such an imaging system are outlined, major challenges onthe way toward this vision are discussed, and the next immediate steps based on thework in this thesis explored.11Chapter 2Background and Related WorkIn this chapter, we describe the building blocks of the image formation in this thesis,and survey work related to the proposed techniques. Starting from first principles,we first describe established models of light in Section 2.1. Following, Section 2.2discusses the propagation of light, and in particular introduces models for lighttransport in macroscopic scenes. Section 2.3 describes light transport in opticalsystems, while Section 2.4 reviews sensors that measure properties of the lighttransport. The concept of Correlation Image Sensors for range imaging is discussedin Section 2.5 and Section 2.6. This includes a discussion of CIS imaging in dynamicscenes and the Doppler effect. Having explained the properties of light transportand their measurement in various scenarios, Section 2.8 reviews inverse methodsfor the recovery of hidden scene features encoded in the measurements.2.1 LightClassical electrodynamics explains light as electromagnetic radiation [30]. In par-ticular, it corresponds to the spectral band from 400 - 700 nm wavelength of theelectromagnetic spectrum, which can be observed by the human eye [31]. This bandlies between the ultra-violet (shorter wavelengths) and infrared (longer wavelengths).Described by Maxwell’s equations [32], the fundamental laws of electromagnetictheory, various experiments verify that light behaves like an electromagnetic wavewith synchronized oscillating magnetic and electric fields. These two fields are12oriented perpendicular to each other and the direction of propagation (and energytransfer). Hence, they form a transverse wave. Electric and magnetic fields do notrequire a medium to exist, and in vacuum an electromagnetic wave propagates atthe speed of light c = 2.99792458 · 108m/s. Treating light as an electromagneticwave allows modeling its propagation, including effects of diffraction, interference,scattering, transmission, reflection, and refraction. The approximation of geomet-ric optics (i.e. ray optics) simplifies this propagation model. Geometric opticsaccurately models macroscopic propagation if the wavelength of light is small incomparison to the scene.While treating light as an electromagnetic wave describes many of its properties,it is important to realize that light is in fact not continuous as a classical wave. Thisbecomes evident in the photoelectric effect, describing that materials emit electronsfor incident illumination only at a certain frequency of the light but not dependingon its intensity. Einstein first explained that this effect is caused by light beingcomposed of quanta, i.e. photons [33]. This quantum theory of light seems to be inconflict with the classical wave description. Modern quantum mechanics solves thisconflict by accurately describing light as a dual wave-particle, which exhibits wave-like properties for propagating through space, and behaves like a particle duringemission and absorption [34]. This means that propagation effects can be explainedeither by the wave or the particle nature of light. However, for a large numberof photons, classical fields are produced on average. Hence, most macroscopicsituations can be explained treating light as an electromagnetic wave [30].In this dissertation, we describe imaging systems relying on both the wave andthe particle behavior of light. All of these systems are optical, i.e. measure light.They can be used in a wide range of applications just like the human visual systemmentioned in Chapter 1. This universality results from the fact that radiation withsmaller wavelength (such as UV, X-ray or Gamma radiation) can be harmful tobiological organisms due to the high photon energy. Radiation of lower wavelength,on the other hand, limits the resolution of a spatial sensor due to diffraction.132.2 The Propagation of Light2.2.1 Transmission, Reflection, and RefractionThe propagation of light in free space follows immediately from the models of lightintroduced in the previous section. Using the wave model, electromagnetic wavestravel unaltered through the vacuum at the speed of light. The propagation directionis perpendicular to the electromagnetic wave front. This can be convenientlydescribed using light rays, which are lines corresponding to the flow direction ofradiant energy [30]. While the propagation in a vacuum can be described just bythe properties of the wave itself, the propagation in homogeneous media and atinterfaces between different media requires describing the interaction of light withthe molecules in the media.Transmission For molecules of a size much smaller than the wavelength of light,this interaction is described by Rayleigh Scattering [35]. This includes gases (e.g.present in air), transparent liquids (such as water) and clear solids (such as glass).In these media, every molecule acts as an oscillator that can be set to vibrate uponabsorbing a photon. After the absorption, almost instantaneously, this oscillationcauses the emission of another photon of the same frequency, that is elastic scatteringoccurs. Using the wave model of light, we can infer the forward propagation oflight in homogeneous media. The randomly distributed molecules in the mediumcreate electromagnetic waves in a scattering event. These waves interfere with eachother, defining the new direction of propagation of light. For randomly spacedmolecules, the waves will be in-phase in the forward direction and constructivelyinterfere, while lateral phases will be randomly distributed and mostly destructivelyinterfere. The denser the medium, the less lateral scattering occurs [30]. Hence, thepropagation of light in a homogeneous dense medium (such as air and glass) is, infact, the same forward propagation as in the vacuum (the lateral and back-scattercomponent can be neglected), and it can also be described by light rays.While the propagation direction is the same as in free space, the apparentphase velocity of the transmitted light can change in a medium, although thephotons all travel at the speed of light c. This is because the molecules in the14medium only oscillate perfectly in-phase at low frequencies. As the frequencyof the electromagnetic radiation increases, a phase-lag in the scattered photonsis introduced, hence reducing the phase velocity ν of the apparent transmittedlight [31]. This phase retardation by the molecules is macroscopically described bythe medium’s index of refractionn =cν. Index of Refraction (2.1)Per definition, it is n = 1 in free-space and n > 1 for other materials. See [30] fora table of n for common materials. Hence, intuitively, light appears to travel slowerin most media, although its photons and the associated electric fields propagate atthe speed of light.Reflection Let us consider a discontinuity between two media with differentrefractive indices; see Figure 2.1. When a beam of light hits this interface, thephotons in this beam scatter at the closely packed molecules across the interface.A portion of the light is back-scattered, and the remaining portion is transmitted.Hence, the beam of light is not sustained at the interface but is separated into areflected and refracted beam [31]. The fact that two well-defined beams are createdas the result of the scattering can be described using the wave model. A plane wave,incident on a flat surface at some angle, sweeps across the molecules on the surface,generating waves along this sweep. As the surface is locally flat and wave fronts areplanar, these waves constructively interfere and form a well-defined reflected wave;see Figure 2.1. Note that, in principle, all molecules before and after the interfacejointly create the reflected wave. However, the majority of the energy is reflectedin a thin layer at the surface (about half of a wavelength in depth) [30]. Due to theplanar geometry of the wave fronts and interface, the angle of the incident lightbeam θi equals the angle of reflected lightθo = θi. Law of Reflection (2.2)15Figure 2.1: Propagation of light at an interface. When a beam of plane waveshits an interface, two well-defined beams are formed. As illustrated atthe top left, one beam is reflected off the interface, and another beam isrefracted into the transmitting medium. The corresponding ray model isshown at the top right. The bottom describes the propagation behaviorof light at this interface as a result of scattering. The incident planewave sweeps across the surface of the transmitting media, generatingwaves along this sweep that constructively interfere into the reflectedand refracted plane waves.Refraction Having described the properties of the beam reflected at the interface,we now consider the light that gets scattered forward, passing the interface, as thetransmitted beam. The molecules at the interface emit waves not only into theincident medium (combining into the reflected wave) but also in the transmittingmedium; see Figure 2.1 again. These waves constructively interfere with each otheras a well-defined transmitted wave. However, because the waves are propagatingin another medium with a different refractive index, the apparent phase velocity of16the waves changes. This change in phase velocity causes the combined transmittedwave to propagate along a different direction. Hence, the beam of light bends [31],i.e. the beam gets refracted. Macroscopically this behavior is described by Snell’slaw, first described in [36]. For θo, no, θt, nt being the angles and refractive indicesof the incident and refracted beam, it issin θisin θt=νiνt=ntni, Snell’s law (2.3)where the indices of refraction on the right are derived by division of c.The laws of reflection and refraction, which are discussed above, can also be derivedfrom different models for the propagation of light. A very different point of viewis Fermat’s Principle: The path taken by a beam of light between two points is theone that can be traversed in the least time [31]. The laws of reflection, refraction,and the concept of optical path length can be derived from Fermat’s Principle [30].Another elegant model is given by electromagnetic theory, which derives theselaws from boundary conditions on the electric and magnetic fields [30]. Quantumelectrodynamics, on the other hand, describes reflection and refraction with theparticle behavior of light. Every possible light path is assigned a complex quantum-mechanical probability amplitude. The total probability amplitude is the sum of allpossible paths and describes the macroscopic effect [34].2.2.2 Global Light TransportThe propagation of light in homogeneous media and at interfaces can be macroscop-ically described with closed-form models, as explained in the previous paragraphs.For light transport in complex scenes, this is no longer possible. Instead, accuratesimulation requires the evaluation of integrals over all possible light paths in thescene, for which no analytic approximation can be found [37]. A large body of workin computer graphics covers such global light transport simulations, i.e. rendering.A detailed introduction can be found in [38]. The geometric optics approximation,i.e. the light ray model, is used to simplify the derivations for complex scenes.17Figure 2.2: Global light transport in a diffuse scene. From left to right: Directillumination only, first indirect bounce of light rays, second indirectbounce. The steady state limit of these individual ray propagations is theglobal light transport.The classical rendering equation has been introduced in [39]. It is derived fromthe law of conservation of energy: The total energy of an isolated system is constant.For a ray with initial point x and direction ωo, this can be expressed asL(x, ωo) = Le(x, ωo) + Lr(x, ωo), (2.4)where here the Spectral Radiance L is used as an energetic quantity for a bundle ofrays. Spectral radiance is the radiant power that is transmitted, refracted, absorbed orreflected by a surface per unit solid angle per unit projected area per unit wavelength[W · sr−1 ·m−2 ·Hz−1]. See [38] for an introduction into radiometry and [40] fora detailed review. Eq. (2.4) states that the radiance L leaving a point x in directionωo is the sum of the radiance emitted by the point Le, and the radiance Lr reflectedfrom other sources at the same point in the same direction. The reflected componentLr can be expressed as an integral of all components received over the hemisphereof incident angles Θ, and Eq. (2.4) becomes the rendering equationL(x, ωo) = Le(x, ωo) +∫ΩLi(x, ωi) · f(X,ωi, ωo) · cos(ωi) dωi RenderingEquation(2.5)The integral accumulates the incident radiance Li from any other point in the scenein direction ωi toward x, weighted by the bidirectional reflectance distributionfunction (BRDF) f and a cosine attenuation factor due to the incident angle ω18between the surface normal at x and inward ray. The BRDF f is an intrinsicproperty of the surface material, which allows determining the outgoing radiancefor any incident and outgoing beam geometry (by evaluating the integral from therendering equation). The function is, in fact, not unit-less but is defined with unit[sr−1] to make the rendering Eq. (2.5) consistent regarding units.Li can either originate from a light source or from reflections off other surfaces;see Figure 2.2. The standard way of solving the rendering equation in computergraphics is by starting with a solution that contains only the emissive componentsand then iterating the scheme to convergence [37]. Another approach is to expand theintegral and sample the space of all paths [41]. The work of [37] gives an overviewof efficient physically accurate simulation methods. Model approximations andcomputational methods that allow for real-time rendering are described in [42].2.2.3 Transient Global Light TransportTime averageFigure 2.3: Time-resolved Global Light Transport. In the scene on the left, thebackside of the bunny geometry is illuminated with a short temporal flashand the indirect reflections of the diffuse wall are observed. Simulationsof the light transport in this scene are shown on the right. Traditionalglobal illumination rendering assumes the speed of light to be infinite,hence computes the time average of the light transport. Time-resolvingthis average reveals structure in the individual components.To this point, we have considered the speed of light to be infinite. The renderingequation (2.5) does not model the travel-time of photons along the integratedlight ray components but assumes the transport of radiance to be instantaneous;see Figure 2.3. This is a common assumption in computer graphics, which aimsat simulating steady-state global light transport [37]. This dissertation covers the19capture and inversion of the temporally resolved propagation of light. A model fortemporally resolved light transport can be derived from the rendering equation (2.5):Assuming constant scene geometry and reflectance during the transport, it becomesL(x, ωo, t) = Le(x, ωo, t)+∫ t0∫ΩLi (yωi,x,−ωi, t0) · δ(t0 + ∆ (x, yωi,x)− t)· f(x, ωi, ωo) · cos(ωi) dωi dt0,(2.6)where we use yω,x as notational sugar for the surface point found by intersecting aray from x along direction −ω with the scene. ∆(x, y) = ‖−→xy‖/c is the time delayfor the light transport from x to y. The delta-function makes sure that light leavingy at time t0 is registered in x only at its time of arrival t0 + ∆(x, y).2.3 OpticsFigure 2.4: Optical Systems. An illuminated object surface consists of tightlypacked scattering molecules. An individual scatterer A on the surfaceemits spherical waves. A segment of the wave front enters an opticalsystem that reshapes incoming wave fronts to converge at pointB. Hence,B is an image of the source A.Optical systems have been used and developed since remote antiquity, withsimple lenses for vision correction, dating back to at least 1200 B.C.E., and burning-glasses to create fire, mentioned around 400 B.C.E. [30]. Such systems are designedto redirect the radiant flow in a scene. Using the wave model from the previousparagraphs, an illuminated object surface can be described as tightly packed scat-20tering molecules that emit spherical waves. Considering now a single scatterer A,as shown in Figure 2.4, its spherical wave fronts spread the emitted radiant energywith increasing distance, i.e. the corresponding rays diverge.An optical system can be designed to reshape the wave fronts, such that theyconverge to a single point B. Hence, this optical system produces an image of A inthe point B.2.3.1 DiffractionIt is important to note that practical optical systems are limited in size. Therefore,only a portion of the wave fronts can be transformed by the system, correspondingto a cone of rays as visualized in Figure 2.4. It follows immediately from thislimitation that a perfect inversion of the incoming waveforms will not be possible,but the waves will be diffracted at the entrance of the optical system. To see this,consider an optical system consisting of just an aperture in the same homogeneousmedium in which the incident waves travel. The straight-forward propagation inhomogeneous media as explained in Section 2.2.1 is not continued at the apertureas the waves blocked out by the aperture no longer contribute to the overall forwardcomponent. Hence, the direction of energy of an incoming plane wave will bespread out. Thomas Young in 1804 [43] was the first who described this diffractioneffect with the wave behavior of light. His famous double slit experiment observedthe redirected energy of light incident on two closely spaced slits as a diffraction(interference) pattern, which can be accurately described using the wave model oflight. Therefore, it also follows that diffraction effects decrease as the wavelengthof the incident light λ becomes small in comparison to the dimensions of the opticalsystem. However, as the physical dimensions are limited, realizable optical systemsare always diffraction-limited. Specifically, for a circular aperture, a perfect lensdoes not focus a point light to a single point but to a spread-out, symmetric, Airypattern that can be analytically expressed [30]. Hence, the theoretical limit of aperfect imaging system can be defined based on whether two nearby imaged pointsources can be resolved as distinct patterns. In 1873, Abbe defined the resolutionlimit of a perfect imaging system as the “radius” of the Airy disk pattern, which is21infinitely large (but quickly decaying). It isr =0.5λNA=0.5λn · sin(θ) , Abbe Resolution Limit (2.7)where NA = n · sin(θ) is the numerical aperture that is defined in terms of theincident angle θ and the refractive index of the medium surrounding the lens. Due tothe fuzzy definition of when two blurred sources can be separated, slightly differentcriteria exist, which lead to similar analysis results in practice. See [30] for adetailed discussion on diffraction effects in optical systems.While diffraction limits the achievable resolution of an imaging system, opticscan also be deliberately designed to exploit interference, as we will show in Chap-ter 8. This enables ultra-compact optical systems but introduces additional imagingerrors. Nevertheless, many practical optical systems can be well approximatedwith the light ray model, i.e. geometric optics, which ignores diffraction and thusconceptually assumes λ → 0. An overview of such systems will be given in thefollowing paragraphs.2.3.2 Geometric OpticsConsider a point source A on an illuminated object surface as illustrated in Fig-ure 2.4. A typical imaging scenario is now that this source is distanced far fromthe optical system (and the sensor located beneath the optics). In this case, thespherical waves reaching the system are approximately planar. Optical design aimsat reshaping this beam of plane waves to make the wave fronts converge to a singlefocus point B. To achieve constructive interference of all incident waves in B, thenumber of phase distance along each path P must be equal in both media. That isPi/νi + Pt/νt = const., with Pi, Pt being the portion of the path in the incidentand optic medium, respectively. Relying on Eq. (2.1) from Section 2.2.1, we haveniPi + ntPi = const.. Hence, the optimal lens surface for this (monochromatic)imaging scenario is the hyperbola.Although hyperbolic lenses perform well in many scenarios, in fact, sphericallenses are most common in optical systems for imaging and display applications [30].This is because spherical surfaces are relatively easy to produce by grinding tworoughly spherical surfaces, one convex and one concave, against each other. High-22quality surfaces with a tolerance of less than λ/10 can be manufactured this way [44].For a single spherical lens, closed form expressions for the focus point and changein direction of individual rays can be easily derived from either Fermat’s principleor Snell’s law. Complex optical systems consisting of multiple elements, however,have to be analyzed numerically by tracing rays through all elements, using Snell’slaw at each interface. The paraxial approximation allows a drastically simplifiedanalytic description for rays close to the optical axis. This approximation essentiallyrelies on the Taylor-series expansionsin(θ) = θ − θ33!+θ55!− θ77!+ · · · ,cos(θ) = 1− θ22!+θ44!− θ66!+ · · · .(2.8)For a cone of rays impinging on the lens at small angles θ → 0, the first term in theTaylor-series is a good approximation, that is sin(θ) ≈ θ and cos(θ) ≈ 1. For raysin this paraxial region, Snell’s law simplifies toθini ≈ θtnt, Paraxial Approximation (2.9)The paraxial approximation shows that closely spaced spherical interfaces, forminga thin lens, can focus parallel rays in a unique image point. In 1841, Gauss wasthe first to describe this linear model of optics for nearly axial rays [22], i.e. first-order or Gaussian optics. Although accurate ray-tracing simulations without theparaxial simplification are easy to obtain, the linear model using thin lenses buildsintuition for many imaging tasks. Simple analytic equations describe the relationbetween focus-point for parallel rays, object and image point distance (Gaussianlens formula), location of focus points and planes, and combinations of differentthin lenses. See [30] for an expanded discussion. The paraxial model immediatelydescribes the macroscopic effect of focusing. With a thin-lens optical system,moving the sensor relative to the lens allows focusing on different distances, forobjects at infinity with parallel incoming rays to points close to the lens. Hence,such optical systems essentially perform a perspective transform of the 3D objectspace to a 3D image space. The sensor records a 2D slice of the image space, withsome objects in focus and others out of focus.232.3.3 AberrationsDeviations from the idealized linear model of Gaussian optics are called aberrations.They accurately describe the errors introduced by using the paraxial and thin-lensapproximations. This includes the imaging artifacts of using spherical lenses insteadof hyperbolic lenses, which were introduced above. Aberrations are divided intotwo types: chromatic and monochromatic aberrations.Chromatic aberrations result from the wavelength-dependency of the refractiveindex n. Recalling Eq. (2.1), which defined n = c/ν, the wavelength-dependencyof n becomes obvious because the phase velocity is given in wavelength per period,that is ν = λ/T . As a consequence, the refraction at an interface is wavelength-dependent as well, and rays of different wavelengths take different paths throughan optical system. Consider imaging a point light source, the shift or spread of theimaged point along the optical axis is called axial chromatic aberration, and theremaining component in the image plane is lateral chromatic aberration. The firsttype is observable as chromatic halos in an image while the latter causes colorfulblur and haze in a capture. Chromatic aberrations will be discussed in more detailin Chapter 8, which presents a method for aberration removal post-capture.Monochromatic aberrations describe the deviations from linear Gaussian opticsthat occur even for a single wavelength (or narrow spectral band). This includes thefive primary aberrations, which are spherical aberration, coma, astigmatism, fieldcurvature, and distortion. The primary aberrations can be described with sufficientaccuracy by adding the third-order term from the Taylor-series (2.8) to the paraxialapproximation, that is sin(θ) ≈ θ − θ3/3!. In 1857, Seidel introduced aberrationtheory [23], which describes the primary aberrations with this model, i.e. Seidelaberrations. Note that these aberrations are non-linear in the ray directions, whilethe measured intensity transported by the rays passing through an optical systemis itself linear, as we will describe further below. As the Seidel aberrations covera vast majority of imaging scenarios, aberrations that require higher-order termsfrom Eq. (2.8) are usually considered jointly as higher-order aberrations. Figure 2.5illustrates the imaging errors introduced by the Seidel aberrations. Spherical aber-ration describes an axial shift in focus for peripheral rays. While central rays arefocused on the image plane (at Bp in Figure 2.5), peripheral rays focus in front of24Figure 2.5: Seidel aberrations: These five primary aberrations explain themost severe deviations from the idealized linear Gaussian optics formonochromatic light.or behind the image plane (at Bo), depending on the lens being convex or concave.This causes a circularly blurred image superimposed with the focused paraxialcomponent. Coma causes a lateral spread of the focused image point for off-axisobject points, i.e. rays that impinge at an angle. This makes point sources appear tohave a tail (coma) similar to a comet. Astigmatism is illustrated at the bottom left ofFig. 2.5. Peripheral rays traveling in two orthogonal planes, the tangential (includingobject point and optical axis) and sagittal plane, focus in different locations in frontof or behind the image plane, Bt and Bs in Figure 2.5. Field curvature describes theaberrations that focus planar objects, orthogonal to the optical axis, not on a flat buta curved image plane. If this aberration is present, as for spherical lenses, only raysin the paraxial region are properly focused. Finally, the bottom right of Figure 2.5visualizes distortion. This aberration describes for off-axis points a lateral shift inthe image point B′f from the ideal projected image Bf . The image is then distortedwith pincushion or barrel distortion. See [30] for a detailed description of Seideland chromatic aberrations.25All of the explained Seidel aberrations and chromatic aberrations, except thelateral distortion and chromatic shift, can be reduced by closing down the aperture.Distortion and lateral chromatic aberration can be mostly corrected post-capture insoftware. However, this introduces diffraction blur and reduces the Signal-to-NoiseRatio (SNR) of the captured image (as will be discussed in Section 2.4).Modeling Aberrations In practice, all of the discussed aberrations (includingdiffraction blur) do not occur in isolation but jointly produce the overall imagingerror. The superimposed aberrations can be elegantly described with a Point SpreadFunction (PSF). Figure 2.6 shows the PSFs for a simple biconvex spherical lens.Figure 2.6: Patch-wise calibrated PSFs for a single biconvex lens at f/2.0. AllSeidel aberrations are present, which jointly cause a spatially varyingPSF. Spherical aberration and field curvature cause the rings, which varyin size. Astigmatism leads to the cross-shaped components, and comacauses an elongated tail of these crosses away from the center. Distortionleads to a shift of the PSFs’ center of mass in the peripheral regions,which has been removed for a more compact display.26A PSF describes the response of an imaging system to a point source. Hence,in general, this can be a 4D function of the 3D spatial position and wavelength ofthis source. For a point source A with spectral distribution ξA, resulting from theillumination and material properties in the scene, the resulting PSF for a sensor withresponse ψ isBA =∫ψ(λ) ξA(λ) BA(λ) dλ, (2.10)where BA is here the spatially varying PSF for the pointA. Common optical systemsare optimized to vary little in the spectral range of the traditional RGB color filters.Hence, often PSFs are assumed to be invariant for these three channels. Commonlens surfaces vary smoothly, and therefore also the resulting PSF changes smoothlyover the image plane and smoothly with object distance [30]. Therefore, a widelyused assumption is that PSFs are spatially invariant in a local neighborhood. As-suming invariance for a considered depth-range, the image formation including allaberrations can then be expressed as a local 2D-convolution in image tile. Con-sidering an n × m tile, let J, I ∈ Rn×m be the observed blurred image, and thelatent sharp image (that would be captured without aberrations). For a given colorchannel the PSF is then a 2D convolutional kernel, and the formation of the blurredobservation can be formulated asJ = B⊗ I, (2.11)j = Bi, (2.12)where ⊗ is the convolution operator. Sensor noise is ignored here for simplicity.Note that this model for the measured image intensity is linear, while the ray di-rections in the aberrations discussed above itself are non-linear in general. In thesecond row, B ∈ Rnm×nm, and j, i ∈ Rnm are the corresponding quantities inmatrix-vector form. The individual pixels of the 2D images are here stacked incolumn vectors, and the convolutional operation becomes a convolution (Toeplitz)matrix. In the remainder of this thesis, we will use either of these identical formula-tions depending on the context. Finally, the full image will be composed of manytiles, each with the PSF assumed constant over the tile. The transition between thetiles can be modeled with a window function W ∈ Rn×m that models the influence27of each tile on a blurred output pixel. The full image J˜ is thenJ˜ =N∑a=1W · Ba ⊗ Ia, (2.13)where a linear index a ∈ {1, . . . , N} is used for the 2D grid of tiles. A matrix-form of this equation can be formulated without much effort by adding a fewlinear operators for the weighting and patch extraction. It is also straightforward toexpress this tiled image formation in the frequency domain, which diagonalizes theconvolution operation. See [45] for a detailed description and analysis of the tiledimage formation model for spatially varying PSFs.2.4 Imaging SensorsImage sensors measure the intensity of light that is emitted from an active lightsource or sources in the scene and propagates through the potentially complex sceneuntil reaching the imaging optics, which focus light from different scene points ondifferent sensor locations. In particular, the optical system gathers a cone of raysthat diverge toward the aperture plane and redirects them to converge onto the 2Dsensor plane, as described in the previous paragraphs. Hence, the directionality(in the cone of rays) is lost in the measurement process. Instead of the directionalradiance, image sensors estimate irradiance [W ·m−2].2.4.1 Measuring IrradianceSolid-state image sensors rely on the photoelectric effect to perform this mea-surement. As mentioned in Section 2.1, this effect refers to the generation ofelectron-hole pairs in a material when illuminated with light. In particular, theenergy of the generated electrons is proportional to the frequency of the illumi-nating light but not the intensity. For low frequencies of the incident illumination,below a threshold, no electrons are generated independently of the intensity. Thephotoelectric effect had first been observed by Hertz in 1887 [46] but could not bedescribed with the classical wave model of light. It was Einstein in 1905 [33] whofirst explained this effect with light being composed of quanta, i.e. photons [33].28The energy of a photon isE = hf =hcλ, (2.14)and hence proportional to the frequency is f = c/λ. Here, the factor h =6.6260755 · 10−34Js is Planck’s constant.Charge Generation When a stream of photons impinges on a material, the energyE of every photon can be absorbed by electrons in the clouds around the atoms inthe material, removing the electrons from their atoms as free charges that macro-scopically lead to the photo-electric effect. Whether this photo-conversion happensdepends on the material properties, e.g. insulators require extremely high energy togenerate free-moving charges. In solids, electrons can exist in two distinct energybands, a valence band (low energy) and a conduction band for free-moving charges(high energy). These bands are separated by an energy gap, i.e. band gap. Forisolators, this is a large band gap (electrons do not reach the conduction band atroom temperature), while the bands overlap in conductors. For silicon, the band gapis 1.11 eV. Hence, from Eq. (2.14) it follows that photons with wavelengths largerthan 1125nm will not interact with silicon at all.However, even photons with higher energy do not necessarily create an electron-hole pair in the semiconductor (hole refers here to the electron-hole in the atom thatcontained the freed electron). Photon absorption is a statistical process dependingon the energy of the photon and the material properties, which can be modeled withan exponential [47]. The penetration depth of a photon is defined as the depth where1/e (37%) of the incident radiation are absorbed. Hence, this depth exponentiallydepends on the wavelength of the photons. For example, red light penetrates siliconmuch less deeply than blue light does. Photons of longer wavelengths may not getdetected at all, or may generate charges at deep penetration depths that accumulatein nearby pixels, i.e. cross-talk.The overall charge generation process of a sensor is summarized with its quan-tum efficiency η, which is defined as the ratio of the detected electron-hole pairs eto the number of incoming photons i; see Eq. (2.15).η(λ) =e(λ)i. Quantum Efficiency (2.15)29Here, i and e are measured for the total area of a sensor pixel, which also includeslight-insensitive areas occupied by on-sensor electronics. Hence, this definition ofthe quantum efficiency also models fill-factor effects, including micro-lens opticsthat aim at increasing it. Note that η depends on the wavelength of the incidentradiation, including depth-dependent photon absorption effects [47].Charge Accumulation Without any electric field applied to the semiconductor, theelectrons generated from photon absorptions move around by thermal diffusion andself-induced drift. Thermal diffusion describes the movement in random directionsdue to thermal energy for temperatures larger than 0◦K, and self-induced drift occursbetween charges of the same polarity due to repulsive Coulomb forces [47]. Due tothis movement, eventually, the electron-hole pairs recombine.To detect them, electric fields are generated in the individual photodetectors toseparate the charges. In most silicon detectors this field is created by doping narrowzones in the silicon. Doping refers to the process of introducing impurities into thesilicon to alter its electrical properties. Two types of dopants exist: electron donorimpurities (n-type) that contribute free electrons and electron acceptor impurities(p-type) that create holes. A very basic detector is the p–n diode. It consists ofa junction of p-type and n-type doped semiconductor materials. At this junction,some electrons from the n-region (that reach the conduction band) diffuse into thep-type region to combine with holes, creating positive/negative ions on both sides,respectively. This creates an electric field in this depletion region [47]. If a photongets absorbed in this depletion region, the resulting charge is moved by this electricfield toward the cathode. Hence, a current is produced. In conventional CMOSsensors, p–n diodes are often reverse-biased. In this operating mode, a potentialis applied to the semiconductor, which pulls electrons and holes away from thejunction. Hence, the depletion region is significantly increased [47]. For furtherreading on semiconductor physics, we refer the reader to [47].The number of electrons that can be retained in the depletion zone is called thewell capacity. When high irradiance exceeds this capacity, charges may travel tonearby pixels, which causes so-called Blooming artifacts. Greater well capacityprevents this and increases the dynamic range of a sensor. Dynamic range is definedas the ratio of the largest detectable signal to the smallest detectable signal. This30sensor metric quantifies how well both bright and dark features can be measured.It depends not only on the well capacity but also on the dark current. Electronsthat are thermally generated in the depletion zone cannot be distinguished fromphotoconverted electrons [27]. This causes a measurement floor for low irradiancevalues, i.e. the dark current. Dark current can be reduced by cooling the sensor.The well capacity can be increased by enlarging the depletion zone, which can,in general, be achieved by increasing the pixel area, and changing doping anddetector-dependent properties (e.g. the gate voltage in CCD pixels [47]).2.4.2 CCD and CMOS SensorsHaving described the basic principles of charge generation and accumulation insemiconductor imagers, we now review the two main sensor types: CCD sensorsand CMOS Active Pixel Sensor (APS) sensors. These differ drastically in the ampli-fication and the transfer of the photoconverted charges away from the charge bins inthe photodetectors. Figure 2.7 illustrates the two sensor designs.Figure 2.7: CCD and CMOS APS sensor working principle. In a CCD sensor,charges accumulated in the individual collection sites are sequentiallytransferred to neighboring pixels by charge-coupling. Similar to a bucketbrigade, charges are moved along lines off the sensor to a single outputamplifier. In contrast, CMOS APS sensors connect a buffer amplifier toeach pixel’s charge collection site. The locally amplified charge can thenbe read out over relatively long wires using column and row addressing(red wire for the red pixel).31As a very basic image sensor, one could realize a grid of p–n diodes in a siliconsubstrate with metal connectors (wires) from each diode to an output stage. As themetal connectors would need to cover the whole sensor area, they are relatively longcompared to the pixel size. Hence, a large read-out capacitance would be introducedby these wires, causing a low SNR and low read-out speed [27]. CCD and CMOSAPS sensors solve this problem using different approaches.Charge-Coupled Device (CCD) sensors were first proposed in 1969 [48], [49].These sensors consist of a grid of diodes as photodetectors. The diodes are biasedin the deep depletion mode, which means that the depletion zone under the gate iscompletely depleted of free electrons. The depletion region increases with the gatevoltage, similar to the p–n diode described above. For further reading on the CCDphotodiodes; see [49]. During integration, photoconverted charges are accumulatedin the depletion region under the photogate. See Figure 2.7 for an illustration on thetop left. To read out the collected charges after the integration, the neighboring gatevoltage is increased to match the bias potential. This causes the potential wells underboth gates to merge. Hence, due to thermal diffusion and self-induced drift, thecharges will distribute equally under both gates [27], i.e. charge-coupling will occur.This is illustrated on the center left in Figure 2.7. Afterward, the potential of thephotogate is removed, causing the charges to migrate completely under the adjacentgate; see the bottom left in Figure 2.7. Repeating this process, the photogeneratedcharges can be moved along lines of gates to the output.This charge transfer is very efficient: Typical CCD sensors drop only between0.01% and 0.001% of the charges between two pixels (which mostly consist ofthree to four gates) [27]. Hence, for a scanline length of 1, 000 pixels, a drop ofonly 1% can be achieved. After leaving the sensor area, the charges are convertedto an output voltage sequentially in a single, low-capacitance output amplifier [27].This leads to a low-noise and homogeneous output across all sensor pixels. Whilethis is a strong benefit of CCD sensors, it also leads to low read-out speed due tothe sequential output. Furthermore, high gate voltages between 10V and 20V arerequired for large depletion regions. Therefore, CCD sensors suffer from largerpower consumption. Finally, the CCD fabrication process prohibits processing logic,such as read-out or signal processing logic, to be integrated on the same chip as thesensor [49].32CMOS APS sensors have almost completely replaced CCD sensors in the con-sumer market because they resolve many of the issues discussed above. In particular,CMOS technology enables the integration of processing logic along with the sensoron a single chip; it has low power consumption and drastically reduced manufactur-ing cost [47]. CMOS APS sensors consist of a grid of reversely biased photodiodes,each connected to a buffer amplifier. Hence, every pixel contains an active stage [27].The locally amplified charges can then be read out over relatively long wires withlarge capacity. The individual pixels are addressed in a matrix with row and columndecoders, as visualized on the right in Figure 2.7. Hence, this allows reading outimage sub-regions at high speed, which is in contrast to CCD sensors that requirefull-sensor read-out. In addition, different integration start and end times can beused in each pixel.However, compared to CCD sensors, the increased flexibility in operation andmanufacturing come at the cost of a reduced SNR. Due to manufacturing variationin the photo-detector and buffer amplifier (e.g. gate size and doping concentration),every pixel has a different offset and gain, resulting in fixed pattern noise. Besidesthis fundamentally different noise component, traditionally sensors in CMOS tech-nology suffer from lower quantum efficiency, well capacity and higher read-outnoise. However, the image quality of CMOS sensors is reaching CCD quality evenin high-accuracy scientific applications [50]. Finally, it is even possible to realizeCCD structures in CMOS [27]. This allows using the highly efficient charge transfermechanism also on CMOS sensors, such as CIS TOF sensors, which will be reviewedin detail in Section Sensor and Noise ModelTo understand sensor measurements and perform inference tasks using measureddata, it is essential to accurately model the underlying measurement process. How-ever, as discussed in the previous paragraphs, a variety of different sensor types,operating modes and technical implementations exist. Hence, to compare differentsensors for different applications, not only an accurate but also a generalizing modelis necessary. The EMVA Standard 1288 [51] presents such a unified model alongwith calibration procedures for the model parameters. It covers all digital color (and33monochrome) cameras with linear photo response, and can even serve as a basisfor emerging sensor types, such as CIS TOF cameras [52]. Since its introductionin 2005, the EMVA Standard 1288 has been adopted by a large number of sensormanufacturers [51]. In the following, we introduce a sensor and noise model thatclosely follows this standard.Figure 2.8: Sensor model for digital cameras with linear photo response. Dur-ing integration, a fraction η(λ) of the photons with wavelength λ generateelectric charges. These charges and a signal floor of thermally generatedcharges (dark noise) are accumulated in the pixel. Subsequently, thecollected charges are amplified with gain K and quantified, resulting inthe measured digital value.Sensor Model Figure 2.8 illustrates the sensor model. For simplicity, we considerspectral irradiance E of a single wavelength λ to be incident on the sensor. Onlythe quantum efficiency η from Eq.(2.15) is assumed to be wavelength-dependent.Hence, all wavelength-dependent effects can be modeled as linear combinations ofthe single wavelength model. During an integration of time t a mean ofi =AEthf=AEλthc[#photons] (2.16)photons hits a single pixel. Here, A is the area of the pixel and the second equationis derived from Eq. (2.14). Only a fraction η(λ) of the i photons generates chargesthat are accumulated in the pixels charge bin. However, thermally generated chargesidark are also accumulated in the charge bins. The combined charges are amplifiedwith the gain factor K of units of DN/e−, which is digits per electrons. Note thatthis is an analog amplification. The amplified analog signal is finally converted intoa digital signal in the analog-to-digital converter (ADC), resulting in the measured34digital pixel value j. This value can be modeled asj = min ( K(η(λ)i+ idark) + n, jsat )= min ( Kη(λ)i+ jdark +K(ne + nd) + nq, jsat )(2.17)The operator min(·, jsat) clamps the measurement to the saturation value jsat, whichis defined as the maximum measurable gray value. Note that jsat is limited by thefull well capacity, the amplitude gainK and the ADC. The corresponding irradiance,however, does not reach the well capacity but a slightly smaller value, as the ADCusually reaches its maximum value before the maximum well capacity is achieved.The variable n in the first row of Eq. (2.17) models additive noise introduced in themeasurement process, which will be discussed in detail below. In the second row,we introduce jdark = Kidark as the dark signal that is measured without any lightincident on the sensor. As idark consists of thermally generated charges, this darksignal depends on the temperature (and grows proportional to the integration timet). Note that the measurement can also be expressed in terms of incident irradianceE, combining Eq. (2.16) and Eq. (2.17) leading toj = min(Kη(λ)AEλthc+ jdark +K(ne + nd) + nq, jsat). (2.18)Using Eq. (2.18), the linearity of the sensor can easily be verified by measuring themean response µj for varying exposure t or irradiance E.Noise Model To describe the stochastic variation in the measurement j, allvariables in Eq. (2.17) except for K are random variables. The variable n =K(ne + nd) + nq models additive noise that is introduced at three points in themeasurement process. The component ne models the shot noise fluctuations ofphoto-generated charges η(λ)i and the thermal noise charges idark [51] as an additiveoffset to their sum. Both, the photo-generated charges and thermal noise charges,follow a Poisson distribution. Since the sum of Poisson-distributed random variablesitself is Poisson-distributed, which allows us to use a single random variable ne. Thesecond component nd from Eq. (2.17) models normal-distributed noise introducedin the read-out of the charges and the amplifier. Both, ne and nd are amplified by35gain factor K, which also converts the fluctuations in the electrons to digits. Finally,nq models post-amplifier noise in the ADC. nq is uniformly distributed betweenthe quantization levels [53]. Due to the laws of error propagation, the (temporal)variance of n is given asσ2j = K2(σ2e + σ2d) + σ2q= K2σ2d + σ2q +Kµj ,(2.19)where σ2j , σ2e , σ2d, σ2q are the variances of j, ne, nd, nq. In the second row of Eq (2.19),we have used a property of the Poisson distribution; that is its variance equals itsmean, i.e. σ2e = µj . Using Eq. (2.19), it is possible to calibrate the system gainK from the linear relationship between the variance σj and µj , [51]. The offsetK2σ2d + σ2q allows calibrating σ2d, as σq is known from the ADC quantization levels.Finally, having estimated K, it is possible to calibrate the quantum efficiency η(λ)using the linear relationship between Kη(λ) and µj described in Eq. (2.17). For adetailed description of this calibration approach, classically known as the PhotonTransfer Method, we refer the reader to [54], [55].Above, we have discussed a sensor and noise model for an individual pixel.Note that often spatial non-uniformities between different pixels are described as anadditional type of noise, so-called fixed pattern noise. However, because this effectis temporally static, it is, in fact, not noise but inhomogeneous parameters of thesame per-pixel model.The EMVA Standard 1288 describes standardized calibration methods not onlyfor these non-uniformities but for all parameters of the sensor and noise modelthat we have introduced above. This also includes the spectral sensitivity, whichis captured in η in our model. We refer the reader to [51], [56] for an in-depthdiscussion and a reference implementation.Signal-to-Noise Ratio and Model Approximations Having introduced the sensorand noise model above, we can analyze the effects of the model parameters on the36signal quality. A common quality measure is the Signal-to-Noise Ratio, that isSNR(µi) =η(λ)µiσj=η(λ)µi√η(λ)µi + µjdark + σ2d + σ2q/K2, (2.20)which is defined as the ratio of the mean signal and the standard deviation of thenoise. The effect of the quantization noise is damped by factor 1/K2, and henceis often ignored in practice. An accurate and compact noise model is the mixedPoissonian-Gaussian model from [57]. Reducing the model to only Poisson andGaussian noise allows exploitation of the heteroskedastic normal approximationP(µ) ≈ N (µ, µ), HeteroskedasticApproximation(2.21)where here P(µ) is a Poisson distribution with mean µ and N (µ, µ) is a Gaussiandistribution with mean and variance µ. This approximation becomes more accuratewith increasing λ. It allows modeling the overall noise n as a Gaussian distributionwith a spatially changing variance, depending on the (unknown) latent signal valueη(λ)µi [57]. While this approximation simplifies inference tasks for high andmedium photon scenarios, low photon scenarios are common in many applications.The border cases areSNR(µi) ≈√η(λ)µi for η(λ)µi  µjdark + σ2d + σ2q/K2η(λ)µi√µjdark +σ2d+σ2q/K2for η(λ)µi  µjdark + σ2d + σ2q/K2.(2.22)The SNR changes from a linear dependency on µi in the low photon range to asquare root relationship in the high photon range. In other words, for well-exposedimages, the SNR scales with square root of incoming photons. Using Eq. (2.16),this means that increasing the pixel area by 4 (e.g. doubling width and height)or increasing the exposure time by 4 will increase the SNR by 2× (given that nosaturation occurs).Finally, we can define an ideal sensor as one with no read-out noise, no quantiza-tion noise, dark current of 0, and quantum efficiency of η(λ) = 1. Using Eq. (2.20),an ideal sensor defines the upper bound SNRideal(µi) =√µi.37Multi-Shot Capture Due to saturation in our model from Eq. (2.18) and motion blurfor dynamic scenes, the SNR cannot be arbitrarily increased by a longer exposuretime. Furthermore, the square-root dependence requires long exposure times forsignificant improvements in SNR. A successful approach to enabling longer exposuretimes is using a burst of short exposure captures [58]. The burst captures have to beregistered to compensate for camera or scene motion but can drastically increasethe SNR [59]. Capturing several images also allows for an increased dynamicrange, which was introduced above. High-dynamic-range (HDR) imaging can beimplemented by capturing an image stack with different exposure times. By doingso, measurements can be reconstructed that contain more detail in the very darkand bright areas than would otherwise be possible. We refer the reader to [17] for adetailed discussion. Multiple frames can also be combined to reduce motion blur[60] or to super-resolve more detail than is available from a single image [61].2.4.4 Color ImagingThe sensor model from above has been described for incident light of a single wave-length. For a spectrum of light Ω the model can be easily extended by integratingthe photo-generated charges in Ω, as done in the charge bucket by the accumulation.Eq. (2.17) becomesj = min(K∫Ωr(λ) η(λ) i(λ) dλ+ jdark + n, jsat), (2.23)where i(λ) is the incident number of photons per wavelength λ. As already describedin Section 2.3.3, image sensors sample the incident spectrum by adding differentspectral band-pass filters on the pixels, i.e. color filters. As perfect band-passfilters are hard to realize in practice, we model the filters with a general modulationr : R → R+. In order to sample different spectral bands, the different filters arespatially multiplexed on the pixel array. Different sets of spectral filters may beused, whose measured response (the accumulated j) eventually can be convertedinto a parameter vector of a chosen color space. A key result from colorimetryis that in most scenarios the response from three band-pass filters is enough tovisually distinguish colors. Hence, commonly, three color filters peaking around38the red, green and blue band are used, arranged in a Bayer array [62]. The colorscan then also be described with three scalars, i.e. the R, G, B component, whichmay be converted to different color spaces. We refer the reader to [17] for a detailedintroduction to photometry and colorimetry.Color Image Processing Early color image processing aims at estimating theseRGB pixel values at full sensor resolution from the noisy, subsampled measurementsj. Traditionally, this task is implemented as a pipeline of simple stages [63]. Theseimage signal processors (ISPs) usually take in raw Bayer sensor measurements,interpolate over defective or stuck pixels, demosaick the sparse color samples to adense image with RGB in every pixel [64], attempt to denoise the noisy signal [65],enhance edges, tonemap the image to 8 bits per channel, and optionally compressthe image. The camera capture parameters are controlled by the auto exposure,focus, and white balancing algorithms [66].Demosaicking and denoising play key roles in this image processing pipelinebecause they are both hard inverse estimation problems. We refer the reader to [64]for an overview of state-of-the-art demosaicking methods. Many approaches havebeen proposed for image denoising. Self-similarity and sparsity are two key conceptsin modern image denoising. Non-local image modeling utilizes structural similaritybetween image patches. Non-local means (NLM) [67] filters a single image using aweighted average of similar patches for each pixel. Many orthogonal transforms,such as DFT, DCT, and wavelets, have good decorrelation and energy compactionproperties for natural images. This property has been utilized in local transform-based denoising schemes, such as wavelet shrinkage [68] and sliding window DCTfiltering [69]. BM3D [70] was the first denoising algorithm to simultaneouslyexploit sparse transform-domain representations and non-local image modeling.Combining internal (such as BM3D) and external (such as Total Variation) denoisinghas recently been shown to be advantageous; [71] run two methods separately andthen merge the results.392.4.5 Modified Sensor DesignsA variety of sensor designs and capture methods has been proposed that differfrom traditional Bayer array color imaging with a single main lens as discussedabove. One line of research is to use sensor arrays with individual optics. Anextremely thin camera design is possible by integrating several small lenses over asensor, as in the TOMBO [72] and PiCam [73] design. By using per-lens spectralfilters, the PiCam design eliminates the Bayer array. Furthermore, the optics can bedrastically simplified as chromatic aberrations only need to be corrected over a smallwavelength band of the per-lens filter. Tiled sensors like PiCam also allow measuringangular components of the light transport from the different tiled viewpoints in aso-called light field [74, 75].Captured light field can be post-processed to allowchanging of the viewpoint or focus, creating a large virtual aperture, extractingdepth from the scene, etc. Traditional Bayer sensor designs have also been modifiedto extend the dynamic range that they capture. For example, the density of thecolor filter array can be varied per pixel [76], but this comes at the cost of reducedlight efficiency. Alternatively, the sensor design itself can be modified to allow forper-row selection of the exposure time or sensor gain [77, 78].2.5 Time-of-Flight ImagingHaving discussed a broad variety of image sensors that measure irradiance, this sec-tion on TOF imaging describes a fundamentally different sensor type. TOF imagingsystems measure the Time-of-Flight of light traveling from an active source along aglobal illumination path through the scene back to the sensor. Assuming a collocatedsource/sensor and a single reflector, such a TOF measurement immediately allowsinferring the distance of this reflector.2.5.1 Optical Range ImagingA large variety of different methods for optical range measurements exists. Stoykovaet al. [79] give an extensive overview. Optical range systems can be categorizedinto the two main groups of triangulation-based and collinear setups, where thesub-categories single/multi-camera and passive/active exist.40Collinear methods. Collinear setups do not rely on a baseline (for triangulation)and can, in theory, share the optical path for illumination and camera optics (al-though that is in practice mostly not the case). Passive collinear single camerasystems include the shape-from-X methods: shape-from-shading [80], [81], shape-from-defocus [82], [24], shape-from-texture [83]. Due to major ambiguities in theforward model describing the image formation, these techniques lead to extremelychallenging inverse problems lacking efficient regularization techniques. Activecollinear single-camera systems include TOF-based systems [84], [85], includingalso shutter-based approaches [86], [29], depth-from-defocus [87], intensity-falloff-based methods [88] and also methods based on interferometry [89], [79].Triangulation-based methods. Triangulation-based methods require a baselinebetween multiple cameras, or between illumination and camera (if the system isactive). Passive triangulation-based systems using multiple cameras [90] rely onsolving the correspondence problem [91], which is computationally hard and can failin regions with little texture. Furthermore, the depth accuracy depends on the sizeof the baseline, which is undesirable for small devices. Active triangulation-basedsystems rely on modulated illumination and a camera for triangulation, [92], [93].The modulated illumination is introduced to solve the correspondence problem [94].While being robust to regions with little texture, the methods suffer from accuracyfor small baselines and ambient illumination, thus need high-intensity focusedillumination. To achieve high accuracy, many structured light methods rely onmultiple projected patterns and therefore are severely limited by acquisition speed.Spatial multiplexing [95] can overcome some of these limitations, however, at thecost of significantly decreasing spatial resolution and adding expensive computation.Comparison of optical range imaging methods. Comparing all optical rangingsystems listed above, passive single-camera systems represent the class with thelowest accuracy due to the major ambiguities in the image formation [79]. Passivetriangulation-based systems resolve many of these ambiguities by introducingmultiple views and therefore significantly improve accuracy in comparison topassive single-camera systems [91]. However, precision is still limited by solving41the correspondence problem, which is significantly simplified with active structuredlight illumination methods. Comparing now TOF cameras to triangulation-basedsystems, Beder et al. [96] show that, for indoor scenarios, TOF cameras improvedepth accuracy an order of magnitude over passive stereo systems. Comparablestructured light systems can achieve similar accuracy as demonstrated by Langmannet al. [97]. However, the main difference between TOF cameras and triangulation-based systems is that the accuracy of the latter is a quadratic function of the inversedepth and only linearly depending on the baseline as derived in [98]. While therange of TOF cameras can be extended by simply increasing the illumination power,this is a fundamental limit for active and passive triangulation-based systems. Italso means that high accuracy requires large baselines, which severely restricts thesize of the system. Because of this fundamental limitation and the drawbacks oftriangulation-based methods outlined in the previous paragraph, TOF imaging is avery promising technology. Below, we review TOF imaging in detail.2.5.2 Impulse Time-of-Flight ImagingTOF range imaging systems can be classified by the type of their illuminationmodulation, that is impulse modulation or continuous wave modulation. In impulseTime-of-Flight imaging, a pulse of light is emitted into the scene and synchronouslyan accurate timer is started. The light pulse travels from the illumination source tothe object and is directly reflected back to the sensor (assuming direct reflection at thescene object). Once the sensor detects the reflected light pulse, the timer is stopped,measuring the round-trip time τ . Given the speed of light c = 299, 792, 458 m/sand assuming collinear sensor and illumination, one can then calculate the depthd of the object point reflecting the pulse as d = cτ/2. Classical Lidar rangeimaging [99], [26], is based on impulse Time-of-Flight imaging, with [100] amongone of the first introducing a practical Lidar system.Impulse illumination has the advantage that a high amount of energy is trans-mitted in a short time-window. Therefore, high SNR is achieved with regard tobackground illumination [27], while the mean optical power stays at low eye-safelevels. The high SNR allows for high accuracy measurements over a wide range ofdistances [26]. A further advantage is that Multi-Path Interference can be reduced42simply by detection of the first peak and temporal gating. We will discuss Multi-PathInterference in detail later in this section.Drawbacks of impulse illumination are that due to varying ambient illumination,attenuation due to scattering and varying scene depth, both the peak intensity of thepulse and the ambient illumination offset change. Thus, the sensors require a largedynamic range to detect pulses, which are usually implemented by thresholding [99].Another drawback of pulsed illumination is that currently only laser diodes deliverthe required short pulse width with high optical power [27]. However, the pulserepetition rate of laser diodes is limited to 10 kHz - 150 kHz in recent systems [101].This severely limits the frame rates of scanning systems that scan the scene point-by-point and also average multiple captures for improved SNR. Furthermore, scannerssignificantly increase sensitivity to vibrations, cost and system size [85]. In fact,most impulse-illuminated systems are scanning systems [26]. This is due to thedifficulty in implementing high-accuracy pixel-level timers [84]. To illustratethe required precision, note that a for a distance resolution of 1 mm, light pulsesneed to be separated with a time resolution of 6.6 ps. Only recently, advances inimplementing large arrays of Single-Photon Avalanche Diodes (SPAD) in standardCMOS enable efficient scanner-less pulsed range imaging [102]. However, SPADsimplemented in silicon technology at room temperature are not able to reach thetiming resolutions necessary for mm-resolution [102]. Therefore, either simpleaveraging of many periodic pulse trains in time-correlated single-photon counting(TCSPC) or continuous wave modulation is done to improve the timing uncertainty.The principles and benefits of continuous wave modulation are described in thefollowing paragraphs.2.5.3 Continuous Wave Time-of-Flight ImagingIn continuous wave TOF imaging, rather than using impulse illumination, the lightsource is modulated. The electromagnetic waves present in the illumination itself(e.g. in the case of a laser illumination only a very narrow spectrum) representthe carrier wave that gets modulated. In general, different modulation methodsexist. While frequency modulation for TOF imaging has been explored [103], byfar the most popular modulation technique is amplitude modulation due to its43comparatively very simple implementation. As the fundamental principle of thesesensors is based on temporal correlation (or convolution), we use in this dissertationthe term Correlation Image Sensors (CIS) .In CIS range imaging the intensity of the light source is modulated with aperiodic modulation signal. Rather than directly measuring the TOF as in impulseTOF imaging, the phase difference θ between a reference signal (often the transmittedsignal) and the received signal is measured. θ can be estimated by cross-correlationbetween the reference and received signal. Given the modulation frequency f of theperiodic illumination, it is θ = 2pifτ and thus the depth is d = cτ/2 = c/(4pif)θ.One of the first CIS systems was introduced by Lange [27]. A survey of recentsystems and their limitations can be found in [85], [84]. Using Amplitude ModulatedContinuous Wave (AMCW) illumination rather than pulsed illumination has theadvantage that the light source does not need to provide short pulse width at highrates, thus, in practice, inexpensive LED illumination can be used [27]. Furthermore,accurate per-pixel timing is required for pulsed TOF imaging. This is not necessaryfor Photonic Mixer Devices or time-gated photon counters [84], which thereforecan be implemented significantly more easily.A variety of different modulation waveforms has been proposed. Most com-mon are sinusoidal waves or square waves [104], [27]. Frequency modulation viachirping has been explored in [105]. Pseudo-noise modulation has been proposedin [106], and pseudo-random coded illumination was used by Whyte et al. [107].CIS sensors Sensors for correlation imaging are based on two working principles:correlation during or after exposure. The most popular class of sensors performs thecorrelation of the incoming modulated signal during exposure. This is achieved byusing Photonic Mixer Device (PMD) pixels that direct the photo-generated chargetoward multiple taps (commonly two). Such a pixel is illustrated in Figure 2.9. Thecharge accumulation and transport are achieved using the CCD principle discussedin Section 2.4. The main difference from a common CCD image sensor is that thecharge transfer between the group of accumulation sites occurs during the exposure.The direction of the transfer can be modeled as a scalar between -1 (maximumfield strength toward the left bin) and +1 (fully toward the right bin) accordingto a temporally varying modulation function f . The left bin corresponds then to44negative values while the right models positive values. This means that the sensoressentially correlates a temporally varying illumination function g with the sensormodulation function f ; see the right side of Figure 2.9. The correlation value isthe subtraction of the left from the right (as we assigned negative values to the leftbin). Using a zero-mean sensor modulation function f , unmodulated ambient lightis removed in this subtraction as constant bin offset after or even during integration.This is also illustrated on the right side of Figure 2.9. The CIS sensor designwas first proposed by [108], the application to phase imaging was shown in [109]and simultaneously a design for cm-accurate depth imaging was first presentedby Schwarte et al. [110], who also introduced the term Photonic Mixer Device [111].We refer the reader to [27] for a detailed discussion of the Photonic Mixer Devicesensor design. Nearly all commercial CIS designs are currently based on PMDtechnology [85]. An alternative approach is to do the correlation after exposureElectronFieldPhoton2-tap Photonic Mixer DeviceCharge storagebin 1Charge storagebin 2PhotonsTime0DCDC DCBackground (DC) suppression .Field Direction-1 +10Figure 2.9: Working principle of a PMD pixel. Photo-generated charges aredirected toward two bins using an electric field between two electrodes,i.e. the CCD transport principle (left). The intensity g of the light sourceand the electric field f is modulated during integration (right). Thesensor correlates the signal g with f . Unmodulated light is accumulatedequally in both buckets and can be removed during or after integration.on the electronic signal generated by the photosensitive pixel during measurement.The correlation can be done with analog electronics using standard photodiodes as45sensors. However, this sensing approach requires complex electronics per pixel,has high-power consumption and suffers from noise in the signal processing [84].Using SPAD sensors in Geiger mode, the correlation can be performed in the digitaldomain after read-out; see [112].2.6 Working Principles of CISHaving listed the benefits of continuous wave TOF imaging using CIS in the previoussection, we now describe the basic working principles of this technology in detail.Sinusoidal waves are often used in the CIS literature [85] to approximate the actualshape of the modulation waveform. In the following, we assume a sine wavemodel for simplicity of notation. However, it can be expanded straightforwardlyto a general model by using a superposition of all harmonics for the modulationwaveforms. Figure 2.10 on the left presents an overview of the CIS image formationusing sinusoidal modulation.Figure 2.10: CIS TOF imaging with its most common capture modes. Left:Homodyne setup using the same frequency for illumination and sensormodulation. Right: Heterodyne setup using different frequencies.Assuming a single diffuse reflector in the scene, the light source sends intensity-modulated light g(t) with a periodic modulation signal into the scene. Assuming46that the modulation signal is cosine, the signal incident on the sensor is s(t):g(t) = cos(ωgt) + 1s(t) = a cos(ωgt+ φ) + If(t) = cos(ωf t+ θ)(2.24)We see that the illumination g(t) received a phase shift φ due to the round-triptravel time of the photons, a change in amplitude a and an offset I due to ambientillumination. As discussed in the previous paragraphs, each Photonic Mixer Devicecan direct incoming photons into two neighboring buckets. By doing this at acertain temporal rate, the incoming signal can be multiplied with another periodicmodulation function f(t). Assuming again that this is a cosine with a phase shift θas in Eq. (2.25), we get the continuous signal p(t) in Eq. (2.25).p(t) = [cos(ωf t+ θ)] · [a cos(ωgt+ φ) + I]=a2cos((ωg − ωf )t− φ+ θ)+a2cos((ωg + ωf )t+ φ+ θ)+I cos(ωf t+ θ)(2.25)The continuous p(t) is then integrated over the exposure time T and results in themeasurement b. The reason why the sensor modulates is that due to the high fre-quency of the illumination (in the MHz range), one would need very high temporalsampling rate to sample the incoming signal s directly. As we will see below this isnot necessary for the sampling of the modulated s. Note that CIS sensors allow forextended exposure intervals, much like regular cameras, by integrating over a verylarge number of periods of the wave, thus essentially correlating f and s. This is akey benefit of CIS sensors, in contrast to impulse-based systems.2.6.1 Homodyne MeasurementIn general, the angular frequencies ωg, ωf of the illumination/sensor modulationcan be different. In a homodyne setup, we have ωf = ωg. In a heterodyne setup, itis ωf 6= ωg. The homodyne setting is used in nearly all commercial systems [85]. It47is illustrated on the left of Figure 2.10 with ω = ωf = ωg and results in Eq. (2.26):bθ(ρ) =∫ ρ+Tρp(t) dt =12a cos(θ − φ), (2.26)where the last two terms from Eq. (2.25) disappear due the integration (which islow-pass filtering plus sampling) when assuming the T  1/ω. As illustratedin Figure 2.10, b is constant over time. This is the reason why it was useful toinclude the phase shift θ in the sensor modulation. This additional shift allows usto evaluate the correlation measurement b at different samples. Evenly samplingfor θi = i · 2pi/N for i ∈ {0, . . . , N − 1} allows us to do a frequency analysis of bby using the discrete Fourier transform. The signal of interest will be in the firstfrequency bin, giving the estimate in Eq. (2.27).φest = − atan(∑i bθi(0) sin(θi)∑i bθi(0) cos(θi))aest =√(∑ibθi(0) sin(θi))2+(∑ibθi(0) cos(θi))2 (2.27)For N = 4, one gets the form commonly mentioned in the literature [29]. Havingfinally computed the estimate φest of φ, one can simply compute the distance dfrom the time of flight using dest = c2·ω · φest.2.6.2 Heterodyne MeasurementFor the heterodyne setting [113, 114], it is ωg 6= ωf , which is illustrated on the rightin Figure 2.10. In particular, let us assume a small frequency shift ωf = ωg − ωδ.We then get from Eq. (2.25):bθ(ρ) =∫ ρ+Tρp(t) dt =a2∫ ρ+Tρcos ((ωg − ωf )t− φ+ θ) dt+a2∫ ρ+Tρcos ((ωg + ωf )t− φ+ θ) dt︸ ︷︷ ︸≈0+∫ ρ+TρI cos(ωf t+ θ) dt︸ ︷︷ ︸≈0≈a2cos (ωδρ− φ+ θ) ,(2.28)48We see that the difference in the two modulation frequencies introduces a low-frequency beating pattern that preserves the distance-dependent phase shift φ. Thisis illustrated in Figure 2.10 at the bottom right. Again, as in the homodyne case,the high-frequency terms are removed in the integration. In particular, we assumehere a finite integration period T with 2pi(1/T ) >> ωf and ωf >> ωδ. Integratingfrom −∞ to +∞ would result in bθ(ρ) = 0, and no meaningful signal would beextracted. In fact, the last approximation for Eq. (2.28) assumes that T is smallenough so that the cosine integrand is locally linear.Finally, the reconstruction of the unknowns φ, θ, a can be done by samplingthe harmonic b over time with θi = i · ωδ/FPS, where FPS are here the cameraFPS. Estimating the unknowns from the samples is then done in the same way as inEq.(2.26).2.6.3 Analysis in the Spectral DomainIn the previous paragraphs, we have derived and explained the image formationfor CIS in the temporal domain. The same image formation model can also bederived in the frequency domain, which can help build a deeper understanding of themeasurement process. Figure 2.11 shows a spectral analysis of the homodyne andheterodyne measuring mode. The plots show the amplitude spectra. Phase spectraremain unchanged except the first row, where interaction with the scene introducesthe distance-dependent phase shift φ.In the first row, the amplitude spectrum of the single-frequency illuminationspectrum gˆ gets attenuated by the factor a. We absorb here illumination distancefall-off and non-perfect reflection at the scene object in this factor. The second rowshows that the spectrum sˆ incident on the sensor gets convolved with the modulationspectrum fˆ . This is because multiplication in the spatial domain corresponds toconvolution in the spectral domain. This convolution results in a high-frequencycomponent, the sum of ωg + ωf , and a low-frequency component, the differenceof ωg − ωf . Thus, in the homodyne setup, the low frequency is, in fact, 0. In theheterodyne configuration, it is a small beating frequency (assuming that ωg and ωflie close together). The last row in Figure 2.11 shows that the integration duringexposure corresponds to multiplication with a sinc low-pass filter in the spectral49Figure 2.11: Spectral analysis of CIS TOF imaging for the homodyne measur-ing mode (top) and heterodyne measuring mode (bottom).50domain, removing the high-frequency components. A mathematical derivation ofthe spectral analysis discussed above can be found in the Appendix in Section A. Sensor and Noise ModelThe model used so far is an idealized one. It includes neither noise, nor system-atic deviations from the ideal sensor or light source. However, in practice, themost commonly used operating mode is the homodyne mode with the inversetrigonometric estimation from Eq. (2.27). Therefore, various types of errors in thedepth estimation can be observed. A list of the most severe deviations is given in[85]. One can observe a systematic distance-dependent error, i.e. wiggling error[115], an intensity-dependent depth error [116], temperature-dependent depth devia-tions [117], exposure-time-dependent deviations [117], static pixel inhomogeneities[85], and random fluctuations (noise) in the depth estimate. Lindner and Kolb [116]and Khalmann et al. [117] identify the error components depending on distance, in-tegration time and amplitude as the most severe ones. Both approaches reduce thesesystematic errors by first measuring the deviation for a range of different estimateddepths, amplitudes and integration times. A depth estimate is then corrected usingthe interpolated deviation from a 3D lookup-table or B-spline [116]. However, thisdata-driven approach does not identify the inaccuracies in the model correspondingto the observed depth error, and therefore it does not generalize well beyond depthestimation (especially for higher-dimensional CIS-based imaging such as transientimaging).In contrast, Schmidt and Ja¨hne [118] and Schmidt [52] propose a modified phys-ical model to explain the observed error. As discussed in the previous section, CISpixels are groups of collection sites with CCD transfer to the collection sites. Hence,the EMVA Standard 1288 for image sensors, reviewed in Section 2.4.3, can beextended to model CIS sensors. To represent the systematic errors, the works [118]and [52] extend it to allow for non-sinusoidal illumination g and sensor modulationf , non-linear photo-response, adaptive suppression of background illumination andspatial non-uniformities. The depth-dependent errors can then be explained byhigher-order harmonics in the modulation signals, whereas the amplitude and inten-sity dependence are due to a non-linear photo-response and background suppression.51While a non-linear photo-response can easily be inverted after integration in eachbucket [52], different approaches address non-sinusoidal modulation. Payne et al.[119] propose a phase-shifting scheme during the integration, which significantlyreduces higher-order harmonics. Without changing the acquisition, the ideal sinu-soidal model can be expanded to allow for arbitrary periodic modulation signalsusing a superposition of all harmonics present in the waveform. This harmonicCIS model is derived in Section A.2. Noise can be modeled as a sum of shot noise,read-out noise and amplifier noise, which are independent between taps, exactly asdiscussed for image sensors in Section 2.4.3. In the case of a sensor reading outall taps independently, the quantization noise can also be considered per tap. If atap-difference value is read out, the quantization is applied after the tap subtraction.For a 2-tap sensor, the mixed Poisson-Gaussian noise model from Section 2.4.3then becomes a Skellam-Gaussian noise model. We refer the reader to [52] fora discussion of calibrating the sensor and noise model parameters, again closelyfollowing the EMVA 1288 calibration procedures.2.6.5 Multi-Path InterferenceIn CIS TOF imaging, it is essential to handle the effects of global illumination.Describing the basic measurement principles above, we have assumed illuminationalong a single path from the light source to a single diffuse reflector and back tothe camera pixel. In general scenes, however, emitted light can be reflected orscattered many times at different scene points until returned to the sensor. Thus,multiple returns along different paths can be superimposed at the sensor, which isthe Multi-Path Interference (MPI) problem. Resolving MPI is critical, especially inCIS TOF imaging, for two reasons. First, the scene is flood-illuminated, creatingsignificantly more potential MPI than with a scanned focused beam illumination.Second, when sine modulation is used (as in many commercial systems [85]),multiple superimposed sinusoids of the same frequency result in another sinusoidwith the same frequency but different phase. See the illustration in Figure 2.12.This means that even for a very low number of mixed paths, resolving MPI ishopeless with a single frequency. In contrast, impulse TOF could untangle a lownumber of separated paths using simple thresholding. Recently, many approaches52Figure 2.12: Multi-Path Interference. In the scene on the top, contributionsfrom direct and indirect light paths (direct red and indirect blue) are su-perimposed in the measurement. In CIS TOF imaging, the sinusoids foreach path result in another measured sinusoid of different phase (black),indistinguishable of a single reflector at a different depth. Impulse TOFcan separate the components in principle. However, the sensor requiresextremely high temporal resolution and almost continuous readout.have been proposed to mitigate MPI effects. The methods from [120], [121], [122]all model light transport only for the piecewise Lambertian scenes. A scene approxi-mation step followed by an MPI reduction step based on the previous scene proxy isdone (either once or iteratively [122]). Another class of methods models sparse two-path MPI for specular scenes [123], [124], [125]. Reduction of MPI is performedper-pixel without estimating a scene approximation. All of the methods describedso far consider only special cases of general light transport. Recently Freedmanet al. [126] made an attempt to generalize per-pixel MPI reduction by assuming atemporal sparse backscatter signal with a robust data-term. The multi-frequencymethod can be implemented efficiently using a look-up-table. Finally, the lastclass of methods not only relies on CIS measurements, like all approaches above,but uses additional spatial light modulation [127],[2]. Recently, O’Toole et al. [2]achieved state-of-the-art results combining spatial light modulation and sparselycoded backscatter.532.6.6 Dynamic ScenesMotion Artifacts So far we have considered static scenes. Hence, if measurementtime is of no concern, an arbitrary number of measurements can be acquired se-quentially, i.e. temporally multiplexing. However, real-world scenes can containmotion of the camera or any of the objects in the scene. This scene motion con-sists of an axial component along the optical axis and a lateral component in theorthogonal plane. Due to the motion, in practice, temporal multiplexing requires theoverall capture time of all N for a range estimate to be small compared to the scenemotion. This can be achieved by increasing the frame-rate of the camera or thenumber of measurements that can be acquired simultaneously during a frame, forexample through spatial multiplexing. 4-tap (or 4-bucket) sensor designs have beenproposed [27], [128], which add two more charge buckets to each pixel. Hence,four correlation measurements can be acquired during a single exposure [128].However, spatial multiplexing sacrifices spatial resolution, which is low for cur-rent CIS sensors [85] due to the large pixel structures required for the CCD-basedcharge transport. Hence, the majority of CIS TOF camera systems use the discussed2-tap design as a trade-off between pure time multiplexing (1-tap) and pure spatialmultiplexing (4-tap).Different approaches addressing motion artifacts for 2-tap CIS sensors have beenproposed in the literature. [52] and [129] present models for the raw measurementsin each bucket and corresponding calibration procedures. Doing so, they are able toreduce the bias between the individual taps of a CIS, which is usually removed bysimply using the measurements θ2, θ3 from Section 2.6.1. These two measurementsare 180◦ offset with regard to θ0, θ1 and therefore exactly invert the role of thebuckets in the correlation. Having removed the systematic bias between the twotaps, the measurements for θ2, θ3 are no longer necessary, and N becomes 2, that isthe frame rate increases by 2×. Another approach is presented in [130] and [131].Both works accumulate the values from both buckets of a CIS pixel, yielding anintensity image of the scene. Having computed intensity images for all sequentiallycaptured measurements, optical flow is computed between all frames, and the rawdata is warped to the first frame.54Doppler Effect While scene motion causes artifacts when correlating sequentialframes, sinusoidal illumination, in fact, encodes both motion and depth information,even during a single frame. Radial velocity is encoded in the frequency due to theDoppler effect, and depth, as before, in the phase. The Doppler effect describes thechange in frequency of a wave for the receiver of this wave that is moving relativeto its source. For an emitted wave of frequency ωg, a reflected signal of frequencyωs = ωg + ∆ω is arriving on the sensor, where the frequency shift depends on theradial object velocity as well as the emitter frequency:∆ω =vcωg. Doppler Shift (2.29)Christian Doppler first discovered this effect, observing that the spectrum of astro-nomical objects shifts depending on their velocity [132]. Since then the Dopplereffect has found widespread use in astronomical imaging, meteorology, traffic lawenforcement, radiology, healthcare, and aviation. Doppler spectroscopy, for ex-ample, measures radial velocity of otherwise undetectable planets by observingwavelength shifts of their respective stars [133]. The rate of expansion of the uni-verse can be estimated by Doppler spectroscopy as well [134]. Doppler velocimetryis a common technique in healthcare to measure blood flow using ultrasound orlasers. The basics of ultrasound Doppler velocimetry can be found in [135]. Al-brecht [136] gives a detailed survey of laser Doppler velocimetry. Laser Dopplervelocimetry also has found widespread commercial adoption with the contact-lessmeasurement of movement in computer mice [137]. Doppler radar velocimetry usesradio waves for velocity estimation. This technology has numerous applications inaviation, meteorology, and remote sensing. Meikle [138] presents a detailed surveyof state-of-the-art radar technologies. In its simplest form, continuous wave Dopplerradar sends out continuous electromagnetic waves and determines the velocity ofthe targeted scene object directly from the returned Doppler-shifted signal. Thismethod is commonly used in police speed guns. Doppler pulse radar simultaneouslymeasures velocity and range and is broadly used in aviation for tracking, classifica-tion and prediction [139]. In Doppler pulse radar, short pulses of radio energy aresent out to an object. The round-trip TOF of the returned pulse determines a rangeestimate, while its Doppler shift provides an estimate of the velocity. Both Doppler55radar and Doppler Lidar are commonly used in meteorological applications, such aswind velocity estimation [140, 141].All of the methods reviewed above rely on the Doppler effect of the light waveitself, that is the carrier wave for a CIS system. As the frequency of the amplitudemodulation in a CIS TOF system is six orders of magnitude lower than the carrierfrequency, the Doppler effect is also reduced by this factor, recalling Eq. (2.29).For example, given a typical modulation frequency of 50 MHz and a speed of 10m/s, this results in a tiny frequency shift of 1.67Hz. However, by exploiting thetemporally convolutional image formation of CIS sensors, we show in Chapter 7that it is nevertheless possible to extract the Doppler shift using inexpensive CISTOF cameras. Our approach allows imaging of both axial velocity and range. It is afull-field imaging method, meaning that it does not require the scene to be scannedsequentially unlike most existing Doppler radar or Lidar systems that capture only asingle scene point at a time.2.7 Transient ImagingWith the light transport models from the previous sections at hand, we now reviewwork related to transient imaging, which we realize in the following chapter usingCIS sensors. Recalling Section 2.2.3, global light transport is not an instantaneousprocess but has a temporal dimension. Transient imaging is a new imaging modalitycapturing this temporal dimension. To illustrate this, imagine an illumination sourceemitting a short pulse of light. Having a camera with ultra-high temporal resolution,these pulses could be observed “in flight” as they traverse a scene and before thelight distribution achieves a global equilibrium. A transient image is then the rapidsequence of frames capturing this scene response, hence representing the impulseresponse of a scene. As light transport is a linear time invariant system that is fullydescribed by its impulse response, such a transient image fully describes the lighttransport in the scene. This illustrates the potential of this new image modality.The original idea behind transient imaging goes back to work performed inthe late 1970s by Abramson [142, 143] under the name “light-in-flight recording.”Abramson created holographic recordings of scenes illuminated by picosecondlasers, from which it was possible to optically reconstruct an image of the wave56front at a specific time. While the scene complexity was limited by technicalconstraints of the holographic setup, other researchers already used this approachfor tasks such as shape measurements (e.g. [144]).2.7.1 Impulse vs. Continuous Wave Transient ImagingRecently, interest in transient imaging has been rekindled by the development offemtosecond and picosecond lasers, allowing for ultra-short impulse illumination,as well as ultra-fast camera technologies. Velten et al. were the first to combineboth technologies [145] to capture transient images, which allows for a simplifiedsetup compared to the holographic approach, as well as significantly more generalscene geometries. Transient imaging has many exciting applications. Starting withthe introduction of an image formation model [146] and the pilot experiments byKirmani et al. [147], there have been several proposals to use transient images asa means of reconstructing 3D geometry that is not directly visible to either thecamera or the light sources Pandharkar et al. [148], Velten et al. [149], to capturesurface reflectance [150], to perform lens-less imaging [151] or simply to visualizelight transport in complex environments to gain a better understanding of opticalphenomena [152]. Wu et al. [153] recently proposed the use of transient imagestogether with models of light/object interaction to factor the illumination into directand indirect components.Unfortunately, impulse transient imaging relies on expensive custom hardware,namely a femtosecond laser as a light source, and a streak camera for the imagecapture as in Velten et al.’s setup [145]. While capturing high-resolution transientimages, these components amount to hundreds of thousands of dollars’ worthof equipment that is bulky, extremely sensitive, difficult to operate, potentiallydangerous to the eye, and slow. For example, a streak camera measures only asingle scanline of a transient image in each measurement. To obtain a full transientimage, it is, therefore, necessary to mechanically scan the scene. Due to the verylimited amount of light in a femtosecond pulse, averaging of multiple measurements,complicated calibration and noise suppression algorithms are required to obtaingood image quality. All in all, capture times of an hour or more have been reportedfor a single transient image [145].572.7.2 Continuous Wave Transient ImagingIn this dissertation, we address this problem in the following Chapter 3 by intro-ducing a novel method using continuous wave illumination with an inexpensivewidespread CMOS CIS plus computation for obtaining the transient image.A number of recent works have built on our computational CIS-based transientimaging, of which many are described later in this dissertation. In Chapter 5, wepresent a CIS-based transient imager able to recover geometry and albedo outsidethe line of sight from second-order diffuse reflections, effectively turning walls intomirrors. Lin et al. [154] and [155] present different regularization approaches forthe reconstruction of transient images from the CIS measurements. We analyze thesparsity of transient images in Chapter 4 and present a new physically motivatedmodel in Chapter 4, Chapter 6. Gupta et al. [156], Freedman et al. [126] andKadambi et al. [157] present approaches for resolving multi-path ambiguities byreconstructing (partly complete) transient images. All of these works rely on ourimage formation model. In Chapter 4 we present a new approach for imagingthrough scattering and turbid media based on this model. Finally, Jarabo et al.[158] have investigated effective rendering of transient images, which allows forthe simulation of complex transient imaging scenarios.2.8 Inverse MethodsA large variety of optical measurement systems has been described in the previoussections, ranging from traditional color cameras to impulse- or correlation-basedTOF cameras. For each system, the measurement process has been explained andformalized with a mathematical model. Relying on these models, this sectionreviews approaches for the inference of scene features that are non-trivially encodedin the measurements, i.e. hidden or latent features. Similar to the mammalian visualsystem, described in the introduction of this thesis, this inference is done by addingcomputation to the measurement process.582.8.1 Inverse ProblemsThe measurement process can be generalized for a large class of optical imagingsystems (including the ones covered in this thesis), that isj = M (i), (2.30)where here i ∈ Rm is a vector of m latent scene features and j ∈ Rn is the mea-surement vector consisting of a set of n observations. M models the measurementsj, i.e. the forward model. The model is parametrized by i. It is important to note,that this model M is not a function, but is in fact a random process that generatesthe observations as samples from the distribution describing the measurement pro-cess. As introduced at the very beginning of this thesis in Section 2.1, the inherentparticle-nature of light causes statistical fluctuations in optical imaging systems.The inverse problem corresponding to Eq. (2.30) isi˜ = M−1(j) (2.31)This means, given an observation j, an inverse forward model M−1 is evaluated torecover an estimate i˜ of the hidden parameter vector i. A large variety of differentM−1 has been proposed, varying in the distance i˜− i depending on the propertiesof the forward model. Again, this is because M is not an invertible function, forwhich this distance could be made 0 by selecting its inverse function as the inverseforward model.2.8.2 Statistical Image RecoveryTo estimate the latent unknowns from the observations, we rely on Bayesian in-ference. By using Bayes’ rule, Bayesian inference exploits both the observationsand the prior knowledge on the unknowns. In contrast, Frequentist inference doesnot rely on priors but only on the measured observations. Both approaches havebeen used for inference in imaging systems, for example, traditional versus prior-based burst imaging discussed in [10]. However, in the remainder of this thesis, weshow a wide variety of imaging problems that benefit from prior knowledge on theunknowns.59Figure 2.13: Bayesian Inference. The left plot illustrates the Bayesian ap-proach to inference: The likelihood of observing a measurement iscombined with prior knowledge on the latent variables. Using Bayes’rule the combined belief is expressed as a posterior distribution. Theright shows a more complex, bimodal, posterior.Figure 2.13 illustrates Bayesian inference. Statistical variations in the forwardmodel M are formalized in our approach using the following likelihood functionL(i|j) = p(j|i) ∝ exp ( −G(j, i) ) . Likelihood (2.32)The likelihood equals the sampling Probability Distribution Function (PDF) p(j|i),which describes the probability of an observed measurement given the latent vari-ables i. Without loss of generality, the sampling distribution is modeled here as anexponential on a distance function G : Rn × Rm → R. Intuitively, G assigns adistance between all possible measurements and unknowns, which is transformedinto a PDF by the exponential. Prior knowledge on the unknowns can be encodedwithp(i) ∝ exp ( −F (i) ) , Prior (2.33)which assigns a probability to an arbitrary unknown (independent of any observa-tions being made). Again an exponential model is used, in which intuitively Fassigns a cost to less likely unknowns. Using Bayes’ rule the posterior probabilityp(i|j) isp(i|j) = p(j|i)p(i)p(j). Bayes’ Rule (2.34)60The posterior combines both, prior knowledge and observed measurements, andhence describes all that can be inferred about the unknowns i from the j (assumingall available priors are used). However, computing the full posterior PDF is infeasiblefor most inverse problems in imaging, which require large-scale inference withmillions of latent variables and observations. Therefore, usually point estimates ofthe posterior are used. A strong choice for a wide variety of imaging applications isthe Maximum A Posteriori (MAP) estimateiMAP = argmaxip(i|j) = argmaxip(j|i)p(i), MAP estimate (2.35)iMMSE = E(i|j) =∫p(j|i)x di. MMSE estimate (2.36)MAP estimates the unknowns for the point where the posterior reaches its maximumprobability. While this is a good strategy for unimodal posterior distributions, itcan lead to undesired estimates for multimodal posteriors. Figure 2.13 shows abimodal posterior on the right, which contains a large probability mass in the modewithout the maximum. Here, other estimates, such as the Minimum Mean SquareError (MMSE) estimate in Eq. (2.36), can be beneficial. Given a point estimate,Bayesian inference can also provide measures of the uncertainty in this estimate,such as the variance or confidence intervals of the posterior. We refer the readerto [159] for a detailed discussion of Bayesian inference.MAP estimation has been proven to be a successful approach for many imagingapplications. In Chapter 9, we demonstrate that even classic image processingtasks (such as demosaicking), which traditionally are approached with specializedhand-tuned pipelines, significantly benefit from Bayesian MAP estimation. TheMAP problem from Eq. (2.35) can be reformulated asiMAP = argmaxip(i|j)= argmini− log (p(j|i)p(i))= argmini− log (p(j|i))− log (p(i))= argminiG(j, i)︸ ︷︷ ︸Data Fidelity+ F (i)︸︷︷︸Regularizer.(2.37)61The logarithmic transform in the second row can be applied since the logarithmis a monotonic function. In the last row, we have used Eq. (2.32) and Eq. (2.33).The final objective function is a sum of two terms: a Data Fidelity term originatingfrom the likelihood, and a Regularizer term from the prior distribution. All thedeterministic components of the data fidelity term, that is the ones independentof noise, are called image formation model. In contrast, the forward model Mfrom Section 2.8.1 includes noise. Literature in the field often uses the term “prior”interchangeably with the term ”regularizer.” Finally, Eq. (2.38) shows an examplefollowing the Bayesian approach described in the previous paragraphs.L(i|j) ∝ exp(−‖j−Bi‖222σ2)and p(i) ∝ exp ( −‖i‖1 )⇒ G(x) = 12σ2‖j−Bx‖22 and F (x) = ‖x‖1(2.38)The likelihood models the blurred observations j using the convolutional modelfrom Eq. (2.12) in Section 2.3.3. B is here the convolution matrix corresponding toa convolution with kernel B. The observations follow a normal distribution witha standard deviation of σ. Hence, the likelihood models Gaussian noise with thesame standard deviation. The prior places a Laplacian distribution on the unknownsi. Intuitively, this choice corresponds to the prior knowledge of the unknowns beingsparsely distributed, e.g. astronomy images. The second row of Eq. (2.38) shows thecorresponding `2-norm data fidelity and `1-norm regularizer. The image formationmodel as defined above is here B.2.8.3 Linear ModelsA large class of inverse problems in imaging can be formulated by using linearimage formation models and priors on linear transforms. Eq. (2.39) shows themodified likelihood and prior for this linear model.L(i|j) ∝ exp ( −G(Φi) )p(i) ∝ exp ( −F (Ωi) )(2.39)62Many inverse problems in imaging are accurately modeled using a linear imageformation Φ. Remember that both light transport and conventional image sensorsare linear; see Section 2.2 and Section 2.4. Problems with linear image formationinclude classical image processing problems, such as demosaicking [10], deconvolu-tion [160], [161], inpainting [162], and denoising [163]. Furthermore, the remainderof this thesis presents inverse problems for traditional and novel computational cam-era designs, that all can be modeled using a linear image formation. A large numberof recent reconstruction approaches rely on image priors that are formulated onthe linear transforms Ωi. This includes Total Variation (TV) regularizers [164], hy-perlaplacian gradient priors [160], [165], mixture model priors [163], patch-basedshrinkage [166], and convolutional priors [167].Given the general linear reconstruction model from Eq. (2.39), we follow theBayesian approach described in the previous chapter yielding the MAP estimateiMAP = argminiG(Φi) + F (Ωi). (2.40)2.8.4 Optimization MethodsOver the past years, numerical optimization has become a standard tool for solvinga number of classical restoration and reconstruction problems in computationalphotography. Examples include the traditional image processing problems listedabove as well as image editing problems such as tone mapping [168], Poisson-blending [169] and colorization [170]. Very efficient solvers have been developedfor most of these problems [167, 171]. Optimization techniques are also becom-ing increasingly popular solutions for scientific imaging problems such as X-raytomography [172] and phase retrieval [173].The literature on algorithms for solving image optimization problems is exten-sive. A particularly fruitful line of research has focused on solving convex optimiza-tion problems with the structure of the MAP problem above from Eq. (2.40), that isproblems where F and G are convex. The main challenge in solving Eq. (2.40) isthat the objective has to be jointly minimized over the sum of the data and regularizerterm. Splitting methods avoid solving this challenging problem by turning the jointproblem into a sequence of simpler optimization problems, alternately involving63only one of the two terms. This is achieved by introducing a slack variable y givingiMAP = argminiG(Φi) + F (Ωi)= argminiG(Φi) + F (y) s.t. y = Ωi .(2.41)It is easy to see that this problem is equivalent to Eq. (2.40) by substituting theconstraint. A prominent splitting method is the Alternating Direction Method ofMultipliers (ADMM) [174]. This method finds a saddle point of an augmentedversion of the classical Lagrangian for the constrained problem from the second rowof Eq (2.41). The augmented Lagrangian isLρ(i,y, λ) = G(Φi) + F (y) + λT (Ωi− y) + ρ2‖Ωi− y‖22, (2.42)where λ is here the Lagrange multiplier associated with the constraint and ρ ∈ R+is a penalty for the quadratic term added to the classic Lagrangian. See [174] for adetailed discussion of this augmented version. ADMM can then be formulated as thefollowing algorithm.Algorithm 1 ADMM for Eq. (2.41)1: for k = 1 to V do2: ik+1 = argminiLρ(i,yk, λk)3: yk+1 = argminyLρ(ik+1,y, λk)4: λk+1 = λk + ρ(Ωik+1 − yk+1)5: end forIn the first two steps, the method minimized the augmented Lagrangian al-ternately with regard to i and y while fixing the other variables. The last stepperforms a gradient ascent of the Lagrange multiplier. The Lagrange multiplieraccumulates here the (scaled) consensus error. A powerful tool for the description,implementation and analysis of splitting algorithms is given by proximal operators.The proximal operator of a function f is defined asproxτf (v) = argminx(f(x) +12τ‖x− v‖22),ProximalOperator(2.43)64where τ > 0. The proximal operator optimizes over the function in isolation butincorporates a quadratic term that can be used to link the optimization with a broaderalgorithm. It can be interpreted as a gradient step for f when τ is small and f isdifferentiable [175]. We can reformulate Algorithm 1 as the following Algorithm 2using proximal operators.Algorithm 2 Proximal ADMM for Eq. (2.41)1: for k = 1 to V do2: ik+1 = argminiG(Φi) + 12ρ‖Ωi− yk − λ˜k‖223: yk+1 = argminyF (y) + 12ρ‖Ωik+1 − y − λ˜k‖22 = proxFρ (Ωik+1 + λ˜k)4: λ˜k+1 = λ˜k + Ωik+1 − yk+15: end forHere, we use the scaled Lagrange multiplier λ˜ = λ/ρ as a notational short-cut. We can now also understand why proximal algorithms like ADMM providean efficient way to solve the considered linear MAP problems. If G is quadratic orseparable, the first row becomes computationally cheap. If now the function F isseparable, then also its proximal operators parallelize and yield very efficient dis-tributed solver methods. A number of proximal algorithms have been explored; see[175]. Besides ADMM, prominent examples include the proximal point algorithm[176], forward-backward splitting [177], the Pock-Chambolle algorithm [178, 179],the split Bregman method [180], ISTA and FISTA [181], and half-quadratic splitting[182]. Recent work has applied these methods and new ones such as iPiano [183] tononconvex optimization problems and found conditions that guarantee convergence(though not necessarily to the global optimum); see, e.g., [184–186].65Chapter 3Transient ImagingIn this chapter, we demonstrate transient imaging using inexpensive CIS TOF cam-eras. It is shown that CIS measurements encode transient images in convolutionalstructure. We derive a convolutional model for the local light/object interaction.Using a sequence of differently modulated CIS images, an inverse problem can besolved to infer the latent transient image from these measurements. The resultingmethod produces transient images at a cost several orders of magnitude below exist-ing impulse-based methods, listed in Section 2.7, while simultaneously simplifyingand drastically speeding up the capture process.3.1 IntroductionTo enable fast, simple, and inexpensive transient imaging, we replace direct, impulse-based, acquisition methods with a CIS TOF camera plus computation. In particular,we attempt to substitute the complex femtosecond laser and streak camera setupdiscussed previously in Section 2.7.In the following, we describe and demonstrate the approach that makes thispossible. We derive a convolutional model for the relationship between transientimaging and traditional time-of-flight imaging using Photonic Mixer Devices. Theestimation of transient images from CIS measurements is then formulated as aninverse problem. This problem is ill-posed but can be solved in a MAP approach,following Section 2.8, by introducing regularization terms as well as a model of66Time in ns012345678910111213141516Figure 3.1: Left: Our capture setup for transient images (from left: computer,signal generator, power supply, modulated light source, CIS camera).Middle: A disco ball with many mirrored facets. Right: The same sphereas seen by our transient imager when illuminated from the left, coloredaccording to the time offset of the main intensity peak.local surface/light interactions. Finally, we demonstrate a prototype system witha modified PMD camera that allows for flexible measurements of Time-of-Flightimages using a range of different modulation frequencies and phases. The system isinexpensive, portable, eye-safe, insensitive to background light, and acquires data ata much higher rate than streak cameras, enabling transient imaging even outside labenvironments.3.2 Transient CIS Image Formation3.2.1 Generalized CIS ModelIn the related work Section 2.6, we have have introduced an idealized model usingperfectly sinusoidal modulation in the CIS image formation. While this approximatemodel builds intuition and is sufficiently accurate for many applications, in generalonly the following assumptions on g, s have to be made. The sensor modulation fωis a zero-mean periodic function with period T = 2pi/ω, that isfω(t+ k · T ) = fω(t) and∫ T0fω(t) dt = 0. (3.1)For an arbitrary incident irradiance s, the CIS measures the following modulatedexposurebω,φ =∫ NT0s(t) fω(t+ φ) dt, (3.2)67where φ is a programmable temporal offset. This is a temporally convolutionalmodel. Recalling Section 2.6, CIS sensors integrate over a large number of Nperiods (ca. 104–105), hence, they essentially correlate s and f . In contrast toimpulse-based systems, this allows extended exposures, much like regular cameras.3.2.2 Single ReflectionTo illustrate the general model, we first consider the special case of a single reflectorin the scene. Recalling Section 2.6, depth TOF imaging commonly assumes thedirect illumination of a single object point by a single point light source, effectivelymeaning that only a single light path contributes to the sensor reading. This isillustrated on the top of Figure 3.2. Under this assumption we obtain the incidentsignal sω from Eq.(3.3).sω(t) = I + αa · gω(t+ τ), (3.3)with I being the DC component of the light source plus any ambient illumination,a is the modulation amplitude for the light source, α is an attenuation term due tosurface reflectance and distance-based intensity falloff, and τ is the TOF from thelight source to the object point and then to the CIS pixel. The modulated exposuremeasured by the sensor becomes thenbω,φ =∫ NT0(I + αagω(t+ τ)) fω(t+ φ) dt=I ·N∫ T0fω(t+ φ) dt︸ ︷︷ ︸=0+ αa ·N∫ T0gω(t+ τ)fω(t+ φ) dt︸ ︷︷ ︸=cω,φ(τ).(3.4)The correlation coefficients cω,φ(τ) can either be determined analytically for specificfω, gω such as sinusoids, or they can be calibrated using objects at different knowndistances d using the relationship τ = ω/c · d, where c is here the speed of light.Relating the general model back to range imaging, two measurements bω,0◦ andbω,90◦ are obtained for per pixel range estimation. Using the known cω,φ(τ), it isthen possible to solve for the pixel intensity αa and the distance d of the objectvisible at that pixel. In general, measurements bωi,φi for many different modulation68Figure 3.2: Top: Operating principle of a conventional CIS PMD sensor. Lightfrom a modulated light source arrives at a pixel sensor via a single lightpath with time delay τ . The PMD sensor modulates the incident light witha reference signal fω and integrates the resulting modulated exposureto obtain a distance-dependent correlation. Bottom: in the presence ofglobal illumination, the incident illumination is a superposition of lightwith different phase shifts. Multiple measurements at different phasesand frequencies are taken to analyze such light interactions.frequencies and phases (ωi, φi) can be acquired. This yields a collection of different,travel-time dependent correlation coefficients cωi,φi(τ).3.2.3 Global IlluminationUsing a multitude of measurements over a wide frequency range, we can relax therequirement that light only arrives at the sensor via single light path, and move to afull global illumination model instead. Temporally-varying global light transporthas been reviewed in Section 2.2.3 and is illustrated in Figure 3.2 on the bottom.69Integrating over all light paths in the scene, Eq. (3.4) then becomesbωi,φi =∫ ∞0∫P∫ NT0(I + δ(|p| − τ)αp · agωi(t+ τ)) fω(t+ φi) dt dp dτ=∫ ∞0∫ NT0(I + i(τ) · agωi(t+ τ)) fω(t+ φi) dt dτ(3.5)We integrate now the modulated light along all ray paths, connecting the light sourceand the sensor pixel, P is the space of all such ray paths. The factor αp is the lightattenuation along such a path p, and |p| is the travel time along path p. FollowingSection 2.2.3, the delta-function makes sure that the path contributions with thesame travel time τ are integrated for a certain time offset τ . In the second row ofEq.(3.5), we encapsulate all path components of the same length in i. As explainedin the related work in Section 2.2.3, this is a pixel of the transient image.i(τ) =∫Pδ(|p| − τ)αp dp Transient Pixel (3.6)I(x, y, τ) = ix,y(τ) Transient Image (3.7)Note that above we have considered the measurement bωi,φi for a single pixel. Aregular grid of transient pixels is a transient image, hence it is three-dimensional,with two spatial and one time coordinate. We can see from Eq. (3.6), that thetransient image is the impulse response of a scene, i.e. the intensity profile asa function of time when the scene is illuminated by a very short pulse of light.Recognizing that i does not depend on t in Eq. (3.5), we can reformulate it asEq. (3.8) using the correlation coefficients.bωi,φi =∫ ∞0∫ NT0i(τ)︸︷︷︸TransientPixelagωi(t+ τ)︸ ︷︷ ︸LightModulationfω(t+ φi)︸ ︷︷ ︸SensorModulationdtdτ=a∫ ∞0i(τ)N∫ T0gωi(t+ τ)fω(t+ φi) dt︸ ︷︷ ︸=cωi,φi (τ)dτ(3.8)70Hence, the convolutional model allows for separating the transient component fromthe correlation on the CIS sensor. Discretizing τ in m temporal bins {τ1, . . . , τm}to the following linear modelb = Ci+ n with Ci,j = cωi,φi(τj) (3.9)The vectorized transient image is i ∈ Rm, with the components stacked. Themeasurement b ∈ Rn denotes the vector of all n modulated exposures observed atall pixels in the image. The correlation coefficients cωj ,φj are arranged in a matrixC that maps a transient image to the vector of observations b. Finally, n modelsadditive noise, which we have discussed in Section Transient Imaging using CISThis section describes how we estimate a transient image from the sensor measure-ments b. We first formulate the corresponding linear inverse problem and thenpresent a method to solve this optimization problem.3.3.1 Inverse ProblemThe model from Eq. (3.9) contains a linear transform C which is not well condi-tioned and additive noise n. Approximating g and f with sinusoidal waveformsas in Section 2.5, C becomes a truncated Fourier transform. The maximum fre-quency is limited by the sensor and light source and usually around 100 MHz [85].This means the correlation coefficients vary slowly with distance for the range offeasible modulation frequencies and scene scales (Section Section 3.4). Due tothe ill-conditioned C and the noise, the inverse problem of recovering i from b isill-posed, hence directly solving for i fails.The challenging inverse problem is solved following the MAP approach in-troduced in Section 2.8. We use spatial and temporal priors, and a likelihoodapproximating the noise distribution. In particular, two types of priors are used:a prior on the gradients, leading to a convex optimization problem which can besolved to high accuracy, and a Gaussian-Exponential prior leading to an objectivefunction with local minima. Specifically, the spatial and temporal gradients are71assumed to follow a heavy-tailed distribution. In other words, we assume that thetemporal and spatial profiles are sparse. The Gaussian-Exponential prior assumesthat a transient pixel can be expressed using a model m(·) of the local interactionof light with objects. Approximating the n as Gaussian noise, the resulting MAPestimation problem from Section 2.8 becomesiopt = argmini12‖Ci− b‖22 + λ∑x‖∇zix‖H + θ∑τ‖∇xiτ‖Hs.t. i ∈ m(C)(3.10)The first term is the Data Fidelity, which originates from the Gaussian likelihood.Gradient Priors The other two terms in the first line respectively represent gradientregularizations along the temporal direction for each individual pixel, and alongthe spatial dimensions for each individual time slice. We expect the gradientsfor both the spatial and temporal slices to be sparse, but with occasional outliers(discontinuities). In our work we chose the Huber penalty function [187]. TheHuber penalty ‖.‖H is an `1/`2 hybrid error measure and defined as‖x‖H =∑kh(xk) with h(x) ={|x| − 2 if |x| > |x|22 else(3.11)Huber fitting is commonly used in robust statistics since it allows for outliers inthe common Gaussian models [174]. The parameter  > 0 defines the trade-offbetween the Gaussian fit (quadratic) for small values and a Laplacian fit (`1-norm).In Eq. (3.10), spatial and temporal gradients are Huber-penalized. In contrast tothe Laplacian gradient penalty, the Huber gradient-penalty does not introduce non-smooth discontinuities in smooth regions. If we would allow  = 0, the spatial `1gradient-penalty would become the well-known TV penalty [188], which containssuch discontinuities as staircasing [178].Gaussian-Exponential Priors The Gaussian-Exponential prior leads to the con-straint in the MAP problem from Eq. (3.10). Here, m(C) is a notational shortcut forthe image of our model function m(·) that expresses the transient pixel profiles as a720 10 20 30 40 50 60 7000. in nsIntensity OriginalFitted model0 10 20 50 60 7000. 30 40Time in nsOriginalFitted modelFigure 3.3: Two different time profiles of the model mx(·) fitted to ground-truth measurements, which were acquired by Velten et al. using a fem-tosecond laser and streak camera [152].mixture of exponentials and Gaussians. This model is inspired by recent work byWu et al. [189] and applies to each pixel independently. Specifically, we model thelight interaction at each local surface point as a combination of surface reflectionand subsurface scattering. For surface reflection, the temporal impulse response toa short pulse of light is a Dirac peak, which we represent as a Gaussian Gσ, whereσ is related to the temporal resolution of the acquisition system. For subsurfacescattering, the temporal impulse response can be modeled as an exponential decayE [189]. In a complex scene with global illumination, the model mx(·) for the timeprofile at pixel x can therefore be expressed as a superposition of K Gaussians andexponentials, that ismx(u) =Kx∑krx,kGσ(τ − px,k) + ax,kE(dx,k, τ − px,k), (3.12)where the model parameters u are the set of all (temporal) positions px,k for theGaussians and exponentials, combined with their amplitudes rx,k and ax,k, andfinally the exponential decay parameters dx,k which depend on the density of thescattering medium.u =⋃x,k{rx,k,px,k,ax,k,dx,k} .73To illustrate the model we have fitted ground-truth traces from a streak camerasetup. For this experiment, we used the “tomato” transient image that Velten et al.obtained using a femtosecond laser and streak camera [152]. Figure 3.3 shows thefitted transient profiles for different pixel locations.3.3.2 Optimization via SplittingThe objective from Eq. (3.10) is non-convex due to the constraint, which containsthe non-convex set m(C). The constraint even contains integer variables (Kx). Incontrast, note that all penalties from the first line are convex. To solve this problem,we introduce a variable u and a new consensus constraint m(u) = i.iopt,uopt = argmini,u12‖Ci− b‖22 + λ∑x‖∇zix‖H + θ∑τ‖∇xiτ‖H ,s.t. m(u) = i(3.13)with the consensus constraint this problem is equivalent to the one from Eq. (3.10).We solve this problem by first eliminating the constraint as an additional penaltyterm, and then perform coordinate descent w.r.t. i and u. See Section A.3 for adetailed discussion. The coordinate descent splits the hard joint problem into asequence of simpler problems only involving i and u. The i-subproblem is a convexproblem since it does no longer involve the model. The u-subproblem is separableper pixel. Finally, the integer variable can be iteratively estimated as part of thecoordinate descent.Summarizing, the Gaussian-exponential prior makes the optimization problemhard and costly to solve. The described coordinate-descent has weak convergenceguarantees and is not guaranteed to find a global minimum. Due to the manyparameters in C the u-subproblem is expensive. In Chapter 4 we will introduce amodified model that exploits the convolutional structure of transient images. Bydoing so we eliminate all of the discussed issues and combine the i-subproblemwith the model. Hence, in the following, we will cover the i-subproblem, whilethe i-subproblem is discussed in Section A.3. Furthermore, note that omittingthe Gaussian-exponential prior also results in an optimization of the form of i-subproblem.743.3.3 Solving the i-subproblemThe i-subproblem is given in Eq. (3.14).iopt = argmini12‖Ci− b‖22 +λ∑x‖∇zix‖H + θ∑τ‖∇xiτ‖H +ρ2‖i−m‖22(3.14)The last term is here a quadratic coupling term. In our coordinate descent schemeit is m = m(u) from the previous u-subproblem; see Section A.3. Eq. (3.14) is aconvex problem of the form discussed in Section 2.8.3. In particular, we chooseΦ = I, G(v) =12‖Cv − b‖22 +ρ2‖v −m‖22 ,Ω = S, F (v) = ‖v‖H ,(3.15)where here the matrix S applies the temporal derivative operator to all pixel timesequences and the spatial gradient operators to all time slices (according to Eq. 3.14).Since we can express our problem in the form of Eq. (2.40), it can be solved witha proximal algorithm as discussed in Section 2.8.4. We use the first-order primal-dual framework by Chambolle and Pock [178], which is equivalent to a linearizedversion of ADMM [175]. We discuss various proximal algorithms, including theChambolle-Pock method, in Chapter 10. Note that different choices of the algorithmand the functions in Eq. (3.15) can be made. Chapter 10 also presents method toautomatically compare amongst different algorithms and problem formulations toyield an optimal solver method.To solve the problem with the Chambolle-Pock method, the proximal operatorsfor F ∗ and G have to be known. These are given byproxτF ∗(v) =v1+τmax(1,∣∣∣ v1+τ ∣∣∣)proxτG (v) =[τCTC+ (1 + τ)I]−1(τCTb+ v + τρm),(3.16)The Huber proximal operator amounts here to a simple pointwise evaluation. The75second proximal operator requires the solution of a linear system. However, sinceC is of small dimensions, it can be solved to high accuracy using a direct inver-sion in parallel for each pixel. The inverse can be cached which makes this stepcomputationally efficient in the iterative scheme.3.4 SetupFigure 3.4: The main components of our capture setup. Left: Modified PMDimager. Right: Light source with six uncollimated laser diodes.We now describe the physical setup we developed to acquire transient imagesof real scenes.The CIS camera is based on the PMDTechnologies CamBoard nano, an eval-uation platform for the PMD PhotonICs 19k-S3 image sensor. We removed theon-board light source and the default 830 nm longpass filter. Since our techniqueis based on the use of a wide range of modulation frequencies, we intercepted theon-board modulation signal and replaced it with our own input from an externalsource, and added a trigger output that signals the start of the integration phase. Weconfirmed that the sensor can be operated at modulation frequencies up to 180 MHz,but managed to obtain stable results only up to 110 MHz.Our light source is a bank of six 650 nm laser diodes with a total average outputpower of 2.4 W. Since the beams are not collimated, eye safety is not a concernat reasonable distances from the device. Two groups of three diodes are driventhrough a iC-Haus 6-channel constant current driver each (type iC-HG), with eachdiode occupying two driver outputs for DC bias and the high-frequency modulation.76Using a fast photodiode (Thorlabs FDS010) and a 500 MHz, 2 GS/s oscilloscope,we confirm that this setup can modulate the output intensity by a full 100% up to afrequency of 180 MHz.As our modulation source signal, we use the DDS9m/02 signal generator fromNovatech Instruments, which is based on the Analog Devices AD9958 direct digitalsynthesizer chip. We use the digital output of the signal generator to modulate boththe light source and the camera with a square wave.A microcontroller circuit sets modulation frequencies and phases on the syn-thesizer board. Reacting to the trigger signal from the camera, it switches themodulation signals for light source and camera in accordance with the currentlyset integration time (we take an exposure series from 120 to 1920µs in steps ofone f-stop). The sensor data is read out through the API provided with the cameraboard.Measurement Routine. In order to complete a full measurement over a frequencyrange from 10 to 120 MHz in steps of 0.5 MHz, our capture system takes about 90seconds. Note that we operate the setup at a duty cycle of less than 1%, in order toavoid overheating of the critical components (signal buffer and laser driver ICs, laserdiodes, PMD sensor) that are not equipped with heatsinks or active cooling. Wetherefore estimate that with proper thermal management, another significant speedupwill be achieved, reducing the overall acquisition time to only a few seconds.There are several ways our method deals with the dynamic range issues. Theworking principle of PMD sensors is itself very effective in suppressing ambientillumination, the sensor provides high-bit depth readouts (14 bits) and, finally, wedo take an exposure sequence as described just above. That said, if very bright lightpaths and very dim light paths mix in the same pixel, the reconstruction quality ofthe dim paths will suffer.Calibration. In order to obtain the correlation matrix C, we perform a calibrationstep. We place camera and light source as close to each other as possible, and facinga diffuse white wall, with no other objects nearby that could scatter light into theline of sight. For each frequency, we sample the distance-dependent correlation77coefficients by varying the relative phase between the sensor and the light sourcemodulation signals. This allows us to emulate different optical delays withoutmechanically moving parts.The calibration measurements for frequencies from 10 to 120 MHz in 1 MHzsteps and distances from 0 to 20 m in 0.1 m steps take about 6 hours to complete,with a further 30 minutes to extract the matrix from the raw sensor readings. Weaverage the calibration matrices over a 40×40 pixel region in order to reduceacquisition noise.We note, however, that the calibrated matrix C obtained in this fashion is alsovalid for different geometric configurations of sensor and camera, with only a changein the physical interpretation of the resulting reconstructed transient images. Thismeans that the calibration step is essentially a one-time operation. Figure 3.5 showsa color visualization of the recovered correlation coefficients, with the vertical axiscorresponding to different frequencies, and the horizontal axis corresponding todifferent path lengths, or travel times.Time in nsFrequency in MHz0 10 20 30 40 50 60020406080100120−0.1−0.0500.050.1Figure 3.5: A visualization of the correlation coefficients for different pathlengths (horizontal axis) and modulation frequencies (vertical axis).783.5 ResultsWe evaluated our approach using both synthetic data for ground truth comparisons,and measurements using our custom setup.3.5.1 Synthetic ExamplesFor our synthetic experiments, we used a transient image that Velten et al. obtainedusing a femtosecond laser and streak camera [152]. Since this dataset was capturedfor a smaller scene than the size of scenes we target with our setup, we simulateda larger scene by scaling the time dimension by a factor of 20. We then used thePMD sensor model from Eq. (3.5) to simulate measurements of bωi,φi for differentfrequencies and phases, also adding Gaussian noise with a sigma of 1%.Figure 3.6: A couple of selected frames from simulation with ground-truthdata. Top: Original ground-truth frames. Bottom: Reconstructions usingour method. Temporal profiles for a few representative pixels are shownin Figure 3.7.Figure 3.6 shows reconstructed frames (bottom row) in comparison with theground truth (top). The key features are reproduced, although a certain amount oftemporal smoothing is noticeable. An examination of time profiles for additionalpixels confirms this analysis (Figure 3.7). In green, we also show a direct least-squares fit of the model to the ground truth curves. These curves demonstratethe expressiveness of the model. Even though the direct fit exhibits shows sharpdiscontinuities due to the use of exponential segments with a step function onset,790 2 4 6 8 10 12 14 1600. in nsIntensitySignal at (1, 56)  Ground truthModelReconstruction0 2 4 6 8 10 12 14 1600. in nsIntensitySignal at (61, 156)  Ground truthModelReconstruction0 2 4 6 8 10 12 14 1600. in nsIntensitySignal at (86, 96)  Ground truthModelReconstruction0 2 4 6 8 10 12 14 1600. in nsIntensitySignal at (131, 151)  Ground truthModelReconstruction0 2 4 6 8 10 12 14 1600. in nsIntensitySignal at (86, 156)  Ground truthModelReconstruction0 2 4 6 8 10 12 14 1600. in nsIntensitySignal at (121, 101)  Ground truthModelReconstructionFigure 3.7: A selection of time profiles for different pixels from the examplein Figure 3.6. Each graph shows the original time profile, which we treatas ground truth (blue), a direct fit of the model to this ground truth curve(green), and finally the result of our full optimization procedure usingsimulated PMD measurements (red, see text).the examples show that the key features of the time sequences can be representedusing the model. Furthermore, in the actual reconstruction from CIS measurements,the i-step smoothes out the discontinuities. These results demonstrate that ourmethod can qualitatively reconstruct transient images, including complex scenarioswith multiple discrete pulses of light arriving at a surface point, as well as broadtemporal profiles resulting from indirect illumination via a diffuse surface. Note theresolution limit for our method is scene-dependent due to non-linear nature of ourMAP approach.3.5.2 CIS MeasurementsUsing our own setup, we captured a few settings with characteristic light transport.Individual time slices of three additional datasets are shown in Figure 3.9. Thescenes and single-image visualizations of the recovered transient images are shownin Figure 3.1 and Figure 3.8. See the Appendix Section A.3 for an additional simplescene. The left column of Figure 3.9 shows a wavefront propagating though a scene80Figure 3.8: Our test scenes, in addition to the one from Figure 3.1. All scenesare illuminated from the left, with the rainbow images encoding thetemporal location of the main intensity peak for each pixel. Top: Variousobjects on a table, and their reflections in mirrors placed nearby. Notehow the items and mirrored counterparts light up at different times.Bottom: Four bottles filled with slightly scattering water. The specularreflections reach the camera before the light scattered inside the bottles.with a mirrored disco ball placed in the corner of a room. In the first frame, thewavefront has just reached the front of the ball. In the second frame, the ball is nowfully illuminated, and we see the wavefront propagating along the left wall. Thethird frame shows the first caustics generated by reflections in the mirror. Morecaustics appear for longer light paths near the top and bottom of the fourth image,and the direct illumination is now approaching the back of the corner from bothsides. First indirect illumination in the floor is visible. In the last frame, the causticshave disappeared, and the indirect illumination is now lighting up the shadow of theball.The second column of the figure shows a scene with several bottles, filledwith water and a small amount of milk to create scattering. In the first frame, the81Figure 3.9: Time slices from three transient images captured with our setup,and reconstructed with the method from Section 3.3. Each of the threecolumns shows a result for one of the test-scenes shown in Figure 3.1(discoball scene on the left) and Figure 3.8 (water bottles scene in thecenter and mirrors scene on the right). A variety of global illuminationeffects can be observed. Please see the text for a detailed explanations ofthe different effects which we capture in each example.82wavefront has just reached the front of the leftmost bottles, and is reflecting off theirsurface. In the second frame, scattering effects are becoming visible in the bottles.Next, the light reaches the far wall, showing caustics of the light transport throughthe bottles. Indirect illumination of the back wall from light scattered in the bottlesappears in the fourth frame. This light continues to illuminate the back wall evenafter the bottles themselves have darkened (last frame).Finally, the last column of Figure 3.9 shows a scene with several foregroundobjects and two mirrors. We first see initial reflections coming off the foregroundobjects. As the light propagates further, the foreground objects are now fullyilluminated, and the wavefront reaches the back walls, but the mirrors remain dark.In the third frame reflections of the foreground objects are starting to become visiblein both the left and the right mirror. In the fourth frame, the left mirror shows areflection of an object in the right mirror. This reflection lingers in the last frame,even after the wavefront has passed by the foreground objects.3.6 Spatio-Temporal ModulationThe presented approach can also be easily combined with spatial modulation. Anumber of recent works [190–192] analyze or capture light transport through spatialmodulation using projector-camera systems, where the transport is here againassumed to be instantaneous. An established way to describe such systems is givenusing the light-transport matrix T [192, 193], that isi = T p (3.17)where here now i represents a 2D image with the X pixels stacked in a columnvector, T is the scene’s X×Y transport matrix, and p is a vector that represents thescene’s spatially-varying illumination, e.g., a 2D pattern projected onto the scene bya conventional video projector.The spatial light transport can be interpreted as the time-average of the tempo-rally resolved transport, that isT =∫ ∞0T˜(τ) dτ, (3.18)83i(a) sceneit(b) temporal responseij(c) spatial responseFigure 3.10: Simulated transient light transport, rendered using a modifiedpath tracing integrator for non-caustic transport and a photon mappingintegrator for caustic transport. For each camera pixel i, we simulatethe travel time τ of all light paths from the scene lit by a projector pixelj. (a) The scene contains, from left to right, a diffuse bunny, a mirrorcasting caustic light on the bunny, a glossy bowl, a diffuse v-shapedwedge, and a glass dragon. (b) A transient image of pixels i acrosstimes τ . (c) Conventional light transport matrix, describing each camerapixel i in response to projector pixel j.where T˜(·) simply extends the transient image from Eq.(3.6) by the spatial patterndimension. Without modulation we have p = 1 and considering a pixel k it isi(τ) =∑a T˜(τ)k,a. Now, emitting a spatio-temporal pattern p˜(t), will producethe incident sensor signals˜(t) =∫ ∞−∞T˜(τ) p˜(t− τ) dτ def= (T˜ ∗ p˜)(t) (3.19)where the operator ∗ convolves a matrix function and a vector function over time(with notation adopted from [194]). With this temporally modulated signal incidenton the sensor, we can adapt Eq. (3.8) specifically for CIS sensors and getbωi,φi =∫ NT0fωi(t+ φi)∫ ∞−∞T˜(τ)p˜(t− τ) dτ dt=[∫ ∞−∞T˜(τ)(N∫ T0fωi(t+ φi)gωi(t− τ) dt)dτ]p=[∫ ∞−∞T˜(τ)cωi,φi(τ) dτ]p (3.20)84It becomes obvious that adding spatial modulation leads to a straight-forward ex-tension to Eq. (3.8). We also see that the light transport is separable for each CISmeasurement. Hence, we can simply code the illumination by using a projectorinstead of our flood light source. For example, one can replacing the RGB lightsource of an off-the-shelf DLP LightCrafter projector with a collimated version ofthe light source from Section 3.4. The spatial modulation can then, for example,be used to only selectively capture direct/retroreflective and indirect componentsas discussed in [190]. Since direct paths always satisfy the epipolar constraint,simply randomly turning pixels corresponding to epipolar lines on the projector(a) PMD image (b) t = 1.9 ns (c) t = 3.2 ns (d) t = 5.0 nsFigure 3.11: Transient imaging in combination with spatial modulation. (a)Scene captured with a normal camera under ambient illumination (rows1, 3) and CIS camera under projector illumination (rows 2, 4). (b-d)Frames from the transient image reconstructed as before (rows 1, 3) andwith added spatial modulation (rows 2, 4). Notice the direct componentis a sharp impulse travelling along the walls.85(with probability 0.5) will suppress indirect paths from the projector to the cam-era. The direct/retroreflective component can then be estimated by capturing anunmasked measurement and subtracting the indirect-only component from it. Werefer the reader to [190] for an in depth discussion of spatial light transport probing.Figure 3.11 shows results of transient imaging combined with this approach. Theadded spatial modulation aids the transient reconstruction by separating the directand indirect components, hence simplifying the unmixing significantly. Note thatthe direct pulse of light traveling along the wall and the caustic light reflected bythe mirror have high-temporal resolution in our reconstructions, whereas they ap-pear very broad when reconstructed without transport decomposition. We refer thereader to our publication [2] for an in-depth analysis on the combination of transientimaging with spatial modulation, which goes beyond the scope of this dissertation.3.7 DiscussionIn this chapter we have presented a method for transient imaging with CIS TOFcameras. The hardware was modified to allow for different modulation frequenciesand phases for both the sensor and the illumination. Solving a MAP estimationproblem with spatial and temporal priors, we can robustly reconstruct transientimages from CIS measurements. Unlike existing systems for the capture of transientimages, ours does not require ultrafast temporal sensing or illumination. As a result,we achieve transient images with hardware costing a fraction of the price of thefemtosecond laser/streak camera combinations. Our method reduced acquisitiontimes, makes hardware more portable, and supports easier modes of operation.A disadvantage of our system is the limited spatial and temporal resolutionof CIS. The PMD PhotonICs 19k-S3 has a spatial resolution of only 160×120pixels. Sensors of higher resolution are available, such as the 512×424 sensor ofthe Microsoft® Kinect® 2. However, most commercial camera systems like theKinect® 2 are system-on-chip designs [195] which cannot be modified as proposedin this chapter and do not expose modulation control. The time resolution of ourapproach is limited by the maximum modulation frequency that can be appliedto the illumination and the PMD sensor around 110 MHz, which corresponds to aspatial wavelength of about 2.70 m. The size of the investigated scene should not86be significantly below this wavelength. We therefore work with larger-scale scenesthan Velten et al. [152].The Gaussian-Exponential prior leads to an non-convex optimization problemthat is hard and expensive to solve. While the convex i-substep can be solvedefficiently, in under a second, the u-step takes several hours per dataset. Furthermore,the model allows for discontinuities which can be observed in Figure 3.3. Weeliminate these issues in the following Chapter 4 by introducing a modified model.By exploiting the convolutional structure of transient images, a sparse model withoutdiscontinuities and corresponding convex objective can be formulated.In summary, we have demonstrated the combination of an inexpensive hardwaresetup and an inverse problem to acquire transient images using CIS sensors. Thisapproach reduces the barrier of entry for performing research in transient imagingand its applications. In the following chapters, we build on the proposed approachand demonstrate such applications.Chapter 4 demonstrates imaging in scattering andturbid media, and Chapter 5 presents a method which enables imaging of objectsoutside of the line of sight. In both of these traditionally challenging imagingscenarios, we show that adding computation can “make the invisible visible.”87Chapter 4Imaging in Scattering MediaIn this chapter, we demonstrate imaging in scattering and turbid media using CIS.This is an extremely challenging scenario since scattering events can occur not justat object surfaces, but at any location in a volume of point scatterers. Unmixing thecontinuum of path contributions is a hard inverse problem that is hopeless to solvewithout strong priors.We solve this problem relying on a new physically-motivated model for transientimages derived from an analysis of sparsity in transient images. By exploitingconvolutional structure, we eliminate the issues of the Gaussian-Exponential priorsmentioned before, at the very end of the previous chapter. The model leads to aconvex convolutional sparse coding problem for recovering transient images fromCIS. We demonstrate our method for imaging through scattering media of variousdensities.4.1 IntroductionImaging through scattering media has recently received a lot of attention. Whilemany works have considered microscopic settings such as imaging in biologicaltissue [196, 197], we consider here the macroscopic problem, having in mindultimate target applications such as underwater imaging or imaging through fog.Impulse-based approaches have been reviewed in Section 2.7.1. These methodsimage individual femtosecond laser pulses with streak cameras [147, 149], or fast88Distance in m0.630.851.081.311.541.761.992.222.452.67Figure 4.1: Example of imaging in scattering media using our approach. Left:Original scene with objects submerged in water-filled glass tank. Center:160 ml milk added. Right: Objects that are “invisible” due to strongscattering, like the structured object on the top right or parts of thecircular object become detectable using our approach.gated cameras [196, 198, 199]. However, impulse-based approaches suffer fromlow SNR, since ambient illumination can easily overpower the laser pulse. Therefore,various approaches have been explored to increase the SNR. Dalgleish et al. [200]combine time gated imaging with spatial modulation for an underwater imagingapplication. Mullen and Contarino [201] propose a hybrid system that combinesgated imaging with microwave amplitude modulation per pulse see also [202–204].The amplitude modulation in the GHz-range reduces multi-path mixing componentssignificantly as also discussed in [156]. However, these hybrid impulse-basedapproaches still directly sample the modulated incident scene response which isaffected by ambient light. Therefore, many repeated impulses per capture arerequired. In contrast, CIS TOF cameras allow for the integration over many (ca. 104–105) pulses of a modulated light source. It is key to realize that these sensors performadaptive in-pixel background suppression during the integration. As discussedin Section 2.6.4, and in more detail in [52, 205, 206], the unmodulated backgroundcomponent of photo-electrons is measured in both buckets equally and can bedrained during exposure (potentially multiple times). This operating mode allowsCIS sensors to amplify the signal component independently of the ambient light byincreasing the exposure time, while saturation due to ambient light no longer occurs.Note also that it allows a per-pixel exposure time [205].can be effectively used for imaging in scattering and turbid media, when com-bined with computational analysis based on sparse coding; see Figure 4.1. RecallingChapter 3, the primary advantage of Correlation Image Sensors is that they have89Figure 4.2: Left: Surface reflections superimpose indirect path contributions(red) on the direct path (blue). As already discussed in Section 2.6.5,TOF imaging has to account for the indirect path contributions. Right:Imaging in scattering media results in indirect contributions not onlyfrom object surfaces, but due to scattering in the entire volume (lightblue). Note that in both cases a continuum of indirect paths is measured.extended exposure intervals, much like regular cameras, and they integrate overmany (ca. 104–105) pulses of a modulated light source, instead of a single pulse,resulting in drastically reduced capture time, as well as slower, less expensiveelectronics. However, for imaging in scattering media multi-path contributions areeven stronger than for regular TOF imaging; see Figure 4.2. At a pixel a mixtureof path lengths is measured, where only few ballistic photons directly hit objectssubmerged in the scattering media, and a large number of indirect paths are createdby scattering events in the entire volume. This strong scattering, which makes tradi-tional imaging very challenging, can only be handled if multi-path contributions areremoved effectively. In the following, we present a sparse convolutional model thatencodes strong prior knowledge on the transient image, and hence enables effectiveMPI suppression in scattering and turbid media.4.2 Sparse Transient Mixture Model4.2.1 Are Transient Images Sparse ?The idea of using sparse representations for signal recovery was first analyzedextensively by Donoho and Candes et al. [207, 208] and since then has found usein many domains [209]. Several recent works [126, 210, 211] attempt to resolvemulti-path interference by assuming sparsity of the transient profiles in the Dirac90basis. The vastly popular idea of compressed sensing [207] can be applied in astraightforward manner, by trying to solve the basis pursuit denoising problem [212]:iopt = argmini‖i‖1 subject to ‖Ci− b‖22 < , (4.1)following again the notation from Chapter 3. Eq. (4.1) is a convex relaxation of thesparsity requirement ‖i‖0 < K, where K determines the sparsity, and hence canbe solved using convex optimization methods. The problem from Eq. (4.1) can beinterpreted in our Bayesian framework from Section 2.8.1 as the MAP estimationproblemiopt = argmini12‖Ci− b‖22 + λ ‖i‖1 (4.2)The parameter λ is here related to the reciprocal of Lagrange multiplier of the con-strained basis pursuit denoising problem from Eq. (4.1). For appropriate parametersof λ,  both problems yield the same solutions [212]. This formulation has first beenproposed for compressed sensing tasks in [213]. From a Bayesian perspective, thisproblem is MAP estimation problem with Gaussian Likelihood and a Laplacian priordistribution on the basis coefficients. As in the previous chapters, the parameterλ can then be simply interpreted as a weight absorbing the scaling variables ofthe likelihood and prior distribution, i.e. a trade-off between the data fidelity andregularizer.The basic assumption of the compressed sensing approach from Eq. (4.1), (4.2)is i being sparse. However, this is violated for many realistic environments, andin particular in the case of scattering media. For regular scenes the (intensity-modulated) radiosity leaving a single scene point integrates over a continuum ofscene points, and thus in general cannot be sparse. For example, any concaveobject can be expected to deliver a non-sparse response. Especially for imagingin scattering media, the assumption of temporal sparsity breaks down. This isdemonstrated empirically in Figure 4.3 on the left by plotting the histogram of the`0-“norm”, that is the number of non-zero entries, of a high-resolution transientimage captured by Velten et al. [214].The transient image depicts a scene composed of scattering objects (a tomato)and diffuse surfaces. However, note that even pixels that do not view a scattering910 50 100 150 20023456789101112K-sparsitylog 2 empirical distribution0 50 100 150 200051015K-sparsity of reconstructed imagelog 2 empirical distributionFigure 4.3: Sparsity of 40K signals of a transient image measured image (left).Sparsity after fitting to the convolutional basis proposed in this chapter(right).object are not necessarily completely sparse, as we will show later in Section 4.3.2.The signal has been thresholded for 0.1 of the peak signal value along each pixelso as not to interpret sensor noise as sparse components. A large number of pixelsignals have more than K > 100 components, which cannot be considered sparse,given a time discretization of M = 220 in this example.4.2.2 Sparse Local Model for Transient Light InteractionHaving shown that the popular sparse model does not apply in the Dirac basis forrealistic scenes (and especially not for scattering media), we now introduce a modelfor local light transport interaction. This model leads to an overcomplete basistransforming the signal into a space where it is sparse/compressible.Building on Section 3.3, we model the temporal PSF of direct local surfacereflection from a single point as a Dirac peak, while the temporal structure ofscattering processes is best represented by an exponential decay. However, now,in both cases, this PSF is convolved with a Gaussian that models the temporal PSFand resolution of the correlation image sensor. It is therefore plausible to describea transient pixel by a mixture of exponentially modified Gaussians. Originallydeveloped to describe chromatographic peaks [215], this versatile model has sincebeen adopted in different fields such as high-energy physics, where it is used tomodel the response of pixel detectors [216]. A single exponentially modified920 50 100 150 200 25000.ρ=25.0, σ=10.0Time stepIntensity (normalized)0 50 100 150 200 25000.ρ=25.0, σ=1.0Time stepIntensity (normalized)0 50 100 150 200 25000.ρ=1.0, σ=5.0Time stepIntensity (normalized)Figure 4.4: A few samples of the exponentially modified Gaussian signalslocated at µ = 50. Curves are normalized by adjusting amplitude a.Gaussian can be defined ash (τ ; a, σ, ρ, µ) = a · exp(12(σρ)2− τ − µρ)·(1 + erf((τ − µ)− σ2ρ√2σ)),(4.3)where a (amplitude), σ (Gaussian width), ρ (skew) and µ (peak center) are theparameters of the exponentially modified Gaussian function, and τ is the traveltime at which it is evaluated. If we stack the parameters for a single exponentiallymodified Gaussian in a vector u = [a, σ, ρ, µ]T , then we can model the transienttime-profile i as the mixture:i(τ) =N∑k=1h (τ ;uk) , (4.4)where N is here the number of mixture components. Figure 4.4 shows a fewsamples of the signals. We note that the joint modeling of the exponential andGaussian nature of our signals has a key advantage over previous models: thebasis functions and hence the reconstructions are inherently smooth. The Gaussian-Exponential prior from Section 3.3, on the other hand, produces discontinuoussolutions whenever exponential components are being used.Now, fitting this model to the data from Figure 4.3 (using the method describedbelow), we can see that it defines a basis that transforms the transient image signalsinto a space that is significantly sparser than the signal itself; see Figure 4.3, right.934.3 Transient Convolutional Sparse CodingIn order to find the optimal mixture parameters, commonly a non-convex non-linear problem is solved, which is prone to local minima and expensive to solve.Furthermore it leaves the number of mixing components open, which was solvedin Chapter 3 using an alternate scheme that also offers no guarantee of globalconvergence.We follow here a different approach and first linearize the above model bysampling the parameter space and later use it in basis pursuit fashion in a convexreconstruction problem. Linearizing the basis functions givesD = [h (s;u0) , . . . , h (s;uN )] (4.5)where s is the vector of all the sampling positions of the signal and N is thenumber of samples in the set containing all C = {σ, ρ, µ} with a = 1. Includingthis (massively overcomplete) basis D in Eq. (4.2) leads to the following pursuitdenoising problem.xopt = argminx12‖CDx− b‖22 + λ ‖x‖1 (4.6)The over-completeness of the basis is now handled by the sparsity prior. Thevalues of a have now become the vector of basis coefficients x. For simplicity, weconsider only one pixel, although the approach can be trivially extended to multiplepixels/measurements.However, the basis D is still extremely large (due to the large set C), whichmakes the optimization inefficient even for problems of moderate size. One sig-nificant improvement to this situation is to exploit the convolutional nature of thebasis. Instead of sampling the parameter space for µ (i.e., the translation of the peakalong the time axis), one can fix µ and reformulate the problem as the followingoptimization problem:xopt =argminx12∥∥∥∥∥CK∑k=0Dk ⊗ xk − b∥∥∥∥∥22+ λ ‖x‖1subject to Dk ∈ h(s, C′) ∀k ∈ {1, . . . ,K}(4.7)94The basis vectors are now sampled from the space of pulse shapes, C′ = {σ, ρ},and invariant to translation. Therefore the size of the basis is drastically reducedfrom previously N = K · dim(x) to just K, i.e., by typically around 2 to 3 ordersof magnitude.This approach is motivated by the signal processing of the acoustic nerve,where shift-invariant sparse coding has first been proposed by Lewicki and Se-jnowski [217]. Recently, convolutional sparse coding has also been used for audioand image detection tasks [218–220] motivated by the recent success of convolu-tional deep neural networks in image classification and detection [18].4.3.1 Reformulation in the Spectral DomainThe problem from Eq. (4.7) can be defined even more compactly in the Fourierdomain. The convolution reduces to pointwise multiplication, here expressed by theoperator :xopt =argminx12∥∥∥∥∥C F−1(K∑k=0Dˆk  xˆk)− b∥∥∥∥∥22+ λK∑k=0‖xk‖1 ,subject to Dk ∈ h(s, C′) ∀k ∈ {1, . . . ,K}(4.8)The operators F, F−1 denote here the Fourier transform and the inverse Fouriertransform, respectively. The operator ·ˆ is a notational shortcut the frequency domainrepresentation of a variable, that is vˆ = F(v). Due to the frequency domainformulation, this linear inverse problem can be solved with an efficient ADMMalgorithm. In Chapter 6 we describe a generalized approach to convolutional sparsecoding, which also allows to learn convolutional structure when not given by thephysical model. A detailed description of frequency domain convolutional sparsecoding will be given in this chapter. Since it will also cover the implementationdetails of our approach, the remainder of this chapter evaluates the proposed modelfor transient imaging in scattering media.954.3.2 Synthetic EvaluationTo generate ground-truth pixel profiles, we sampled a high resolution “ground truth”transient image measured by Velten et al. using direct temporal method [214]. Theobservations are then generated by assuming a typical sinusoidal measurementmatrix C where f, g are defined as in Eq. (2.24). We sample ω evenly spaced in 100steps from 10 MHz to 120 MHz, which is a realistic range for recent CIS systems,recalling Chapter 3, and φ as 0, pi/2, giving exactly 100 ·2 measurements j per pixel.The measurements are normalized and then corrupted with 1% Gaussian noise.Figure 4.5 shows synthetic results for two different pixel profiles. We compare ourmethod (“Reconstruction Gauss-Exp”) to the ground-truth signal (“Original”) aswell as two recent methods from transient imaging literature: Lin et al.’s smoothfrequency-domain interpolation (“FFT model”) [221], the non-linear non-convexmodel-fit from Chapter 3 (“Sequential model”). For the sake of completeness, wefurther add comparisons to various state-of-the-art sparse reconstruction techniques,namely LASSO (“Sparse reconstruction LASSO”) [222], OMP (“Sparse reconstruc-tion OMP”) [223] as well as Generalized Approximate Message Passing [224] using0 5 10 15 20 25 30 35 400. [ns]Response0 OriginalReconstruction Gauss-Exp model0 5 10 15 20 25 30 35 400. [ns]Response0 OriginalFFT modelSequential model0 5 10 15 20 25 30 35 400. [ns]Response0 OriginalSparse reconstruction LASSOSparse reconstruction OMPReconstruction Gauss-Exp 0 5 10 15 20 25 30 35 4000. [ns]ResponseOriginalGAMP-DMMGAMP-MAP0 5 10 15 20 25 30 35 400. [ns]Response0 OriginalReconstruction Gauss-Exp 0 5 10 15 20 25 30 35 400. [ns]Response0 OriginalFFT modelSequential model0 5 10 15 20 25 30 35 400. [ns]Response0 OriginalSparse reconstruction LASSOSparse reconstruction OMPReconstruction Gauss-Exp0 5 10 15 20 25 30 35 4000. [ns]ResponseOriginalGAMP-DMMGAMP-MAPReconstruction Gauss-Exp (a) (b) (c) (d)Figure 4.5: Example showing the effect of our sparse coding optimization ontwo pixels from the “Tomato” dataset. From left to right: (a) Proposednew model, (b) FFT and sequential models, (c) state-of-the-art sparsereconstruction (LASSO and OMP), (d) two state-of-the-art compressedsensing models (GAMP).96Donoho/Maleki/Montanari-style thresholding (“GAMP-DMM”) and assuming aLaplacian signal in an MAP formulation (“GAMP-MAP”).One can see that, although these pixels are dominated by direct reflections, thesignals are not sparse at all, and time-domain sparse backscatter models as usedby Freedman et al.’s SRA [126], or Bhandari et al. [211] are not capable of closereconstructions. Our method, on the other hand, produces solutions that followthe ground-truth distributions more closely than any of the competing models. Inparticular, the exponentially modified Gaussian basis outperforms approaches thatsolve non-linear non-convex optimization, such as the one from Chapter 3. Out ofall the methods tested, the one delivering the poorest fit is [221], which imposes theweakest prior by just interpolating smoothly in the Fourier domain.4.4 ResultsIn this section we show results for imaging in scattering media using our reconstruc-tion method proposed in the previous section. The measurement setup is explainedfirst. After that the results are analyzed qualitatively and quantitatively.4.4.1 SetupCamera We use a correlation image ToF camera prototype for our experimentsusing red laser light as illumination, see Figure 4.6 on the left.The TOF camera consists of a modulated light source and a correlation imagedetector. As in the previous Chapter 5, the sensor is a PMD CamBoard nano,modified to allow for external control of the modulation signal (since access to itsFPGA configuration was not available). The light source consists of an array of 650nm, 250mW laser diodes which are operated in pulse mode, also described in theprevious chapter.We measure ω evenly spaced in 3 steps from 20 MHz to 60 MHz and φ evenlyspaced between 0 and 5 m/cω in 201 steps and with an additional shift of (0, pi/2),resulting in a measurement vector j with exactly N = 201 · 2 · 3 samples per pixel.Calibration As described in Chapter 3, we calibrated the matrix C by measuringa diffuse planar target that is mounted on a translation stage perpendicular to the97-1 -0.5 0 0.5 1-0.500.511.52x in metersz in metersSetup (TOP VIEW)-1 Scattering mediaCameraLight sourceFrustumFigure 4.6: Camera prototype and setup: We use a modified PMD Technolo-gies CamBoard nano using an array of red laser diodes for illumination(left) as described in the previous Chapter 5. In our setup we image atank filled with a scattering medium of different concentrations frontalwith the cameras. The spatial dimensions and arrangement of the setupis shown in the center. The setup is shown on the right.z-axis. The target is translated along this axis at positions according to differenttravel times τi, i ∈ {1 . . .M}. We populate C column by column for each τi.Measurement Setup An image of our measurement setup is shown in Figure 4.6 onthe right. We placed a water tank with glass hull at a distance of 1.2 m in front of ourcamera, so that the optical axis intersects the center of the largest planar side. Thelight source is placed at a slight offset to the right to eliminate direct reflections ofthe air-to-glass interface on the camera-facing wall. See the schematic in Figure 4.6in the center for the exact spatial alignment.Scattering Media We filled the tank with 80 liters of water and submerged objectsat different positions in the water (and in particular at different distances to thecamera). We then evaluated our approach on two different scattering materialswith a series of different concentrations. First, we conducted a sequence of 100experiments using homogenized milk from 0 ml to 500 ml in 5 ml steps. Second, weconducted a sequence of 50 experiments using Gypsum plaster with a continuum ofparticle sizes ≤0.125 mm. We used 0 oz to 50 oz in 1 oz steps.For each of the 150 experiments we take a full measurement with the measure-98ment parameters as described above.4.4.2 Qualitative ResultsFigure 4.7 and Figure 4.8 show qualitative results for imaging through scatteringmedia of increasing density. Objects are immersed in a tank filled with water, towhich increasing concentrations of milk (Figure 4.7) or plaster (Figure 4.8) areadded. See Section A.5 for the full sequence of all concentrations that we tested.With increasing concentration, visibility through the scattering medium quicklydrops off for a conventional camera (top row). On the other hand, a light transportanalysis based on Correlation Image Sensors, not only increases the ability to detectobjects in highly turbid solutions, but also allows for a simultaneous estimation ofdistance (color coded images in the bottom column of each figure).4.4.3 Quantitative ResultsTo quantitatively analyze our results, we measure the error of the depth estimatefor three distinct camera pixels with respect to the measured ground truth depths.The three camera pixels ‘plate’, ‘holder’ and ‘bar’ are chosen as representativesfor objects located at different depths in the reconstruction volume. Their spatialDistance in m0.630.851.081.311.541.761.992.222.452.67Figure 4.7: Qualitative results for the milk experiment described in Sec-tion 4.4.1. Experiments in each column from left to right: 0 ml, 20ml, 40 ml, 80 ml, 300 ml of milk in water. Each column shows theregular camera image (top), peak image for red camera reconstruction(bottom). Peak images are here a parabola fit through the two nearestneighbor points of the strongest peak, where the position is encoded asas hue and the intensity is encoded as value in the HSV color model.99Distance in m0.630.851.081.311.541.761.992.222.452.67Figure 4.8: Qualitative results for the plaster experiment from Section 4.4.1.Experiments from left to right: 0 oz,4 oz, 8 oz, 16 oz, 59 oz of plaster inwater. The same visualization as in Figure 4.7 has been used.positions are shown in Figure 4.9 (left). All objects have approximately diffusereflectance except for the ‘plate’ which does have a specular component. Groundtruth depth measurements for all pixels were acquired manually. The pixel ‘plate’ islocated at 5.5 cm behind front facing glass wall of the tank, pixel ‘holder’ at 21.5 cminto the tank, and ‘bar’ at 40.0 cm, touching the rear-facing wall of the tank.The center and right portions of Figure 4.9 show the error with respect to theground truth pixel depth for all 100 experiments with milk as scattering media andall 50 experiments with plaster as scattering media as described in Section 4.4.1.We can see that the error of pixels ‘plate’ and ‘holder’ has a fairly low slope and0 200 600 80001020304050 Concentration [ml]Error [cm]Milk experiment error at 3 different pixels: PlateHolderBarBar naive400 0 10 20 30 40 50 600102030405060Concentration [oz]Error [cm]Plaster experiment error at 3 different pixels: PlateHolderBarBar naiveFigure 4.9: Error for three specific pixels shown on the left. The error ofthe position is shown for all 100 experiments using milk as scatteringmedium and all 50 experiments using plaster as scattering medium. Theerror is measured with respect to 0% concentration and corrected forthe speed of light in water. Standard time-of-flight depth reconstruc-tion (atan solution from Section 2.6.1) breaks down even at very lowconcentrations of scattering agent.100remains almost flat around 1 cm–5 cm even for strong concentrations. These pixeldepths are located close to the front and in the middle of the reconstruction volume,so the scattering is reduced in comparison to pixel ‘bar’ which is at the very rear ofthe reconstruction volume. Its error is significantly larger; around 10 cm–20 cm dueto the increased scattering. However, performing a naı¨ve reconstruction as describedin Section 2.6.1 on the same pixel resulted in even significantly larger errors around30 cm–60 cm for both the plaster and the milk sequence of experiments.4.5 DiscussionIn this chapter, we have demonstrated that CIS imagers can be used for imaging inscattering and turbid media. In the presence of scattering, light rays emitted into thescattering volume are perturbed by a large number of point scatterers, each causinga scattering event. Unlike MPI problems for range imaging, with large free-spacecomponents, the point scatterers are continuously distributed in the volume. Hence,imaging in scattering media is a hard MPI problem, which requires the unmixing ofthis continuum of path contributions.The key to solving this problem using Correlation Image Sensorss lies in ex-ploiting strong prior knowledge on the transient light transport. By identifyingconvolutional sparse structure in transient images, we have formulated a compres-sive and expressive representation. This representation is based on exponentiallymodified Gaussians, which are tailored towards representing combinations of sur-face reflection and volumetric scattering. Exploiting convolutional structure oftransient images is key for allowing the prior to find the sparse signal structure. Bysolving the resulting convolutional sparse coding problem, we have demonstratedimaging in turbid and scattering media. Our method significantly outperforms naiveTOF range imaging methods that do not resolve MPI. It is robust across a large rangeof scattering densities, demonstrating that our model generalizes well.In contrast to impulse-based methods, our CIS-based approach inherits theadvantages mentioned in the previous Chapter 3, that is robustness to ambient illu-mination (all measurements presented in this chapter were taken with room lightingswitched on), and works with higher light levels than approaches based on singlelight pulses. However, the proposed approach requires computational reconstruc-101tion, whereas impulse-based methods perform direct sampling. The quality of ourmulti-path unmixing is limited by the condition number of C, while impulse-basedmethods are limited by the pulse-shape and temporal camera resolution. Since C isa truncated Fourier transform, see Section 3.4, the modulation frequency determinesthe limit here, which for commercially available CIS imagers is up to 100 MHz.However, this means also that our method will directly benefit from increases inmodulation frequency for future TOF range imagers. Furthermore, our convolu-tional sparse model, which enables the results in this chapter, generalizes to otherTOF technologies, such as direct temporal sampling (C can be calibrated with adelay-sweep, as before in Chapter 3).102Chapter 5Diffuse MirrorsThe functional difference between a diffuse wall and a mirror is well understood:one scatters back into all directions, and the other one preserves the directionalityof reflected light. The temporal structure of the light, however, is left intact byboth. In other words, assuming simple surface reflection, photons that arrive firstare reflected first.In this chapter, we exploit this insight to recover objects outside the line ofsight from second-order diffuse reflections, effectively turning walls into mirrors.We formulate the reconstruction task as a linear inverse problem on the transientresponse of a scene, which we acquire, similar to the previous Chapters 4 and 3,using an affordable setup consisting of a modulated light source and a CIS TOFcamera. By exploiting sparsity in the reconstruction domain, we achieve resolutionsin the order of a few centimeters for object shape (depth and laterally) and albedo.Our method is robust to ambient light and works for large room-sized scenes. It isdrastically faster and less expensive than previous approaches using femtosecondlasers and streak cameras, and does not require any moving parts.5.1 IntroductionObject reconstruction from real-world imagery is one of the central problems incomputer vision, and the very mechanism of image formation (each pixel measuringlight flux as a multidimensional plenoptic integral) is one of the main reasons why it10300. in m(a) Setup (b) Reconstructed depth(volume as probability)(c) Reconstructed depth(strongest peak)Figure 5.1: (a) illustration of our measurement scenario (to scale). A diffusewall is illuminated by a modulated laser beam and observed by a TOFcamera. From diffuse reflections, we infer the geometry and albedo ofobjects within a bounding volume (green) that is completely occludedto both light source and camera, but visible from most locations on thewall. In this example, of two letters are placed in the unknown volume;see the scene photograph in top right inset in (a). The shape of the lettersbecomes clearly visible in the reconstruction (b,c).is so challenging. To overcome the limitations of standard monocular images takenunder uncontrolled illumination with respect to many vision tasks, a wide range ofnovel capturing approaches has emerged that extend the concept of digital imagingwith structured light or new sensing techniques involving masks, filters or integraloptics (light fields) [225].The additional temporal dimension introduced by transient imaging can solvesome of these challenges. By undoing the temporal mixing of the light transport,transient imaging has enabled the use of diffuse reflectors to image objects viathe time profile of reflections from ultra-short laser pulses [149, 226]; see alsoSection 2.7. However, reconstruction of the this data from transient images is a hardinverse problem, that is sensitive to the exact parametrization of the problem as wellas the priors and regularization terms that are employed. In this chapter, we developa new parametrization for this inverse problem, and combine it with a novel set ofsparsity inducing priors to achieve a robust reconstruction of geometry and albedofrom transient images.As discussed in Chapter 3, a major challenge is that the instrumentation requiredto measure the transient images themselves has traditionally suffered from severe104practical limitations including excessive hardware cost (hundreds of thousands ofdollars), long acquisition times (hours) and the difficulty of keeping the sensitivesystem calibrated. We solve this problem by building on the method from theprevious Chapter 3 using widespread CIS TOF sensors for obtaining the transientimage. The inverse problems for transient image reconstruction and geometryrecover can be merged into a single non-linear optimization problem that can besolved efficiently. The result is a system that is by several orders of magnitude moreaffordable and acquires data faster than previous solutions. We demonstrate theeffectiveness of our setup and the computational scheme by reconstructing both alow-contrast albedo and the geometry of hidden objects.5.2 Image Formation Model5.2.1 AssumptionsWe make several assumptions for the image formation; see Figure 5.2. The hidden0 0.5 1 1.5 2 2.5 3 3.500.511.522.533.5 wLight sourceDiffuse wallVolume VOccluderl0dxclxzToF cameranxFigure 5.2: A schematic of our measurement setup illustrated in Figure 5.1.All distances are in meters. An isotropic point light source is createdon the diffuse wall by illuminating it with a laser beam. This diffuselight source illuminates scene objects in the unknown scene volume(green), outside of the line of sight of the camera. Some of the lightis reflected back to the wall (1st bounce) and can be measured by thecamera bouncing of the diffuse wall (2nd bounce).105scene is modeled as a diffuse height field, which in turn is represented as a collectionof differential patches dx with orientation nx inside a volume V . We assume thatthe camera points at a section of the diffuse wall, and is in focus. Light is emittedas a laser ray from position l0 and illuminates a single point l on the diffuse wall,outside the field of view of the camera. Radiometrically, we treat this point as asingle, isotropic point light emitting a radiance Le(l). From l, the light illuminatesthe scene, and after a single bounce returns to the diffuse wall. Patches dw on thewall are chosen such that there is a one-to-one correspondence between patches andcamera pixels. Finally, we ignore occulusions in the height field.5.2.2 Stationary Light TransportWith these assumptions, starting from the diffuse version of the Rendering Equationintroduced in Section 2.2.2, we can model the stationary (i.e. time independent)light transport as follows.L(l) = Le(l)L(x) = Le(l)ρ(x)R(l,x)L(w) =∫VLe(l)ρ(x)R(l,x)ρ(w)R(x,w) dx(5.1)with ρ(.) denoting the diffuse albedo of a patch, and the unoccluded geometry termR(x,y) =cos](y − x,nx) · cos](x− y,ny)|y − x|2 . (5.2)between the patch at x and another patch at y. We re-write the radiance at a wallpatch (Eq. (5.1)) asL(w) = Le(l)ρ(w)∫Vr(x)v(x) dx, (5.3)where the geometry termr(x) =cos](x− l,nl) · cos](x−w,nw)|x− l|2 · |x−w|2 (5.4)106is independent of both the albedo and orientation of the patch dx, whilev(x) = ρ(x) · cos](l− x,nx) · cos](w − x,nx) (5.5)is a term that isolates both of these unknown quantities. We can interpret v(x) eitheras a generalized albedo term or as a continuous volume occupancy that indicateswhether or not a given voxel location is occupied by the surface to be reconstructed.Note that in this parametrization, the image formation is linear in v(x).5.2.3 Transient Light TransportFollowing the approach from the previous chapter, the transient version of Eq. (5.3)is obtained by adding a time coordinate t and counting only light contributions suchthat t is the sum of emission time t0 and the travel time τ for a given light path fromthe laser l0 to a camera pixel c. In our image formation model, the relevant lightpaths only differ in the position of the surface element dx, i.e. t = t0 + τ(x).Recalling that we assume a one-to-one correspondence between wall patches wand camera pixels c, we obtain the transient image formation modelL(c, t)=∫ t0Le(l, to)ρ(w)∫Vδ(t0 + τ(x)− t)r(x)v(x) dx dt0, (5.6)where the travel time τ(x) is the total path length divided by the speed of light c:τ(x) = (|l0 − l|+ |l− x|+ |x−w|+ |w − c|)/c (5.7)We note that this transient image formation model is independent of the way thetransient image has been acquired. It therefore applies to all known approachesfor generating transient images, including femtosecond imaging [149] as well ascorrelation-based measurements from the previous Chapter DiscretizationThe problem of reconstructing geometry from indirect light amounts to recoveringthe diffuse height field represented as the continuous voxel densities v(x). To thisend, we discretize the volume v(x) from Eq. (5.6) into a Euclidean voxel grid, and107represent it as a vector v of size m. The transient image (radiance) is represented asa vector i ∈ Rnt, where n is the number of camera pixels/wall patches, and t is thenumber of time steps. The discrete version of Equation 5.6 is then given asi = Pv (5.8)with the light transport tensor P ∈ Rnt×m.5.2.5 Transient CIS Image FormationUnlike Velten et al. [149], in our implementation we do not measure the transientimage directly. Instead, we build on-top of our work from the previous Chapter 3:using a standard CIS TOF sensor with a modulated illumination source, we obtain asequence of modulated exposure measurements b for different modulation frequen-cies and phases. Recalling Eq. (3.9), the transient image can then be recovered fromb = Ci+ n, where the correlation matrix C is obtained through a straightforwardcalibration step, and n represents sensor noise. Substituting Eq. (5.8) for i, wearrive at our full image formation model:b = CPv + n (5.9)5.3 Reconstruction from Diffuse Indirect IlluminationThe image formation model from Eq. (5.9) cannot be inverted directly, due to noise,and since the light transport matrix P is poorly conditioned, as is the correlationmatrixC; see Section 3.3. We follow the Bayesian MAP approach from Section 2.8.3to estimate the voxel grid. In particular, we use spatial priors on the voxel grid, anda likelihood approximating the noise distribution. Our approach is described in thefollowing.5.3.1 Inverse ProblemWe solve the following MAP estimation problemvopt = argminv12‖CPv − b‖22 + Γ(v), (5.10)108where the first term is the Data Fidelity term. As in the previous chapter, this termresults from a Gaussian likelihood, that is we assume n to be Gaussian distributed.Priors The second term implements the statistical priors on v in our MAP frame-work. It isΓ(v) = λ∑z‖∇xvz‖1 + θ ‖Wv‖1 +∑x,yindC(vx,y) (5.11)We assume a sparse Laplacian distribution on the spatial gradients height field,leading to the first `1 penalty term on the spatial gradients for all volume depthsz. Furthermore, we assume that the volume v is sparsely populated, justified byour assumption of height field geometry. This second term is implemented as aweighted `1 norm of the volume itself. The weight matrix W will be obtained usingan iteratively reweighted `1 scheme (IRL1, see Section 5.3.3). Finally, the last termin Eq. (5.11) encodes the knowledge of the height field, by constraining the volumeto have at most one non-zero entry for each 2D (x, y) coordinate. We encode thisprior using a projection onto an indicator set of possible depth values for each (x, y)coordinate:indC(p) ={0 if p ∈ C∞ else withC = {d ∈ Rz | card (d) = 1 ∧ 1Td = 1Tp}(5.12)We note that the second and third term of the regularizer both have the purpose ofencouraging a single surface reflection along the z-dimension of the reconstructionvolume. The term from Eq (5.12) is stronger than the `1 regularization, since itprefers exactly a single-non-zero solutions (in contrast to just sparse solutions). Onthe other hand, it makes the overall optimization non-convex as C is a non-convexset. So having both terms enables us to trade the convexity of our objective functionfor the sparsity of our model by adjusting the weight θ from Equation 5.11.1095.3.2 OptimizationThe optimization problem from Eq. (5.10) is of the form discussed in Section 2.8.3.In particular, we chooseΦ = I, G(p) =12‖CPp− b‖22 ,Ω =[DTx ,DTy ,WIT , IT]T, F (p) = λ ‖p1,2‖1 + θ ‖p3‖1 + indC(p4),(5.13)where Dx,Dy are derivative operators for the x, y dimensions for all z coordinates(stacked on-top of each other) and I is the identity matrix. We have chosen F (·):Γ(v) = F (Ωv). The index identifies here the component stacked on-top ofeach other in Ωv. Note that the minimum of Γ(v) is obtained by independentlyminimizing F for each component of Ωv.Eq. (5.13) represents our specific MAP estimation problem from Eq. (5.10) in thegeneral linear MAP form from Eq. (2.40). Hence, it can be solved with a proximalalgorithm as discussed in Section 2.8.4. We use a linearized version of ADMM [175].Using v as primal variable, the augmented Lagrangian from Eq. (2.42) becomesLρ(v,y, λ) = G(Φv) + F (y) + λT (Ωv − y) + ρ2‖Ωv − y‖22, (5.14)The ADMM method from Algorithm 1 performs three steps per iteration, eachupdating v,y, λ alternately. The individual steps are as follows:v-step The update of the volume v proceeds as follows:vk+1 = argminvLρ(v,yk, λk)= argminv12‖CPv − b‖22 + (λk)T(Ωv − yk)+ρ2∥∥∥Ωv − yk∥∥∥22≈ argminv12‖CPv − b‖22 + (λk)T(Ωv − yk)+ρ(ΩTΩvk − ΩTyk)Tv +µ2∥∥∥v − vk∥∥∥22=(PTCTCP+ µI)−1 (PTCTb+ µvk − ρ(ΩTΩvk − ΩTyk)+ ΩTλk).(5.15)110Note that in the third step we have made an approximation that linearizes thequadratic term from the second line in the proximity of the previous solution vk.This linearization approach is known under several different names, includingLinearized ADMM or inexact Uzawa method (e.g. [175, 227, 228]). The additionalparameter µ satisfies the relationship 0 < µ ≤ 1/ (ρ‖Ω‖22).y-step The slack variable y is updated as follows:yk+1 =argminyLρ(vk+1,y, λk)=argminyF (y) + (λk)T(Ωvk+1 − y)+ρ2∥∥∥Ωvk+1 − y∥∥∥22=argminyF (y) +ρ2∥∥∥∥(Ωvk+1 − λkρ)− y∥∥∥∥22(5.16)Both F (.) and the least square term can be minimized independently for eachcomponent in y. Using the slack variable y, the minimization involving the difficultfunction F has now been turned into a sequence of much simpler problems in just afew variables. To derive the specific solutions to these problems, we note that thelast line in Eq. (5.16) can be interpreted as a proximal operator:yk+1 = prox(1/ρ)F(Ωvk+1 − λkρ)(5.17)using the standard definition from Section 2.8.4. For our problem, we require theproximal operators for the `1 norm and for the indicator set. These are given asproxγ|·|(p) = (p− γ)+ − (−p− γ)+proxγ indC(·)(p) = ΠC(p)(5.18)The first term is the well-known point-wise shrinkage [175] and the second is theprojection on the set C.λ-step The final step of the ADMM algorithm 1 is to update the Lagrange multiplierby adding the (scaled) error, that is λk+1 := λk + ρ(Ωvk+1 − yk+1).1115.3.3 Enhancing Volume SparsityTo further enhance the sparsity of the convex `1-regularized part of our objective,we have placed a weight W on the individual components of the `1 volume penalty(second term in Eq. (5.11)).This approach has been proposed by [229]. The idea is that the weights Wcapture the support of our sparse solution. This support is estimated iteratively fromthe last solution, which allows for improved recovery of the sparse non-negativecomponents. As proposed in [229], we use the update ruleWj+1 := diag(1|vj |+ ), (5.19)where the division is here point-wise, and diag (·) denotes the diagonal matrixwith the diagonal from the argument. The iteration variable j from above is foran outer iteration on top of our original iteration from Algorithm 1 with the stepsfrom the previous paragraphs.5.4 Implementation and Parameter SelectionParameters For our specific Algorithm 1 with the steps from the previous para-graphs, we use the parameters ρ = 1.1 and µ = 0.5 · 1/ (ρ‖Ω‖22), which producedgood results for all of our tested datasets. Note that Ω changes for every outer IRL1iteration, and thus µ has to be recomputed for every iteration. We estimate ‖Ω‖22by running the power method for ΩTΩ with random initialization. We use 3 outerIRL1 iterations and an update weight of  = 0.1.Implementation of the v-step For a very high resolution sensor and reconstructionvolume, storing P would be infeasible. In this scenario one can implement P as theprocedural operator performing the transient light transport exactly as described inSection 5.2.3. The transient rendering operation parallelizes very well over eachinput pixel. One can implement its transposePT similarly as the dot product of eachtransient image for a considered voxel accumulated over the whole voxel volume.Thus again only transient rendering and some additional dot-products are required.112Finally, the v-step from the presented linearized ADMM can be implemented usingConjugate Gradient (CG). Instead of applying explicit matrix multiplication insideCG, we replace each of the products with P or PT with the operations definedabove.We implemented this version first. However, since our sensor only has avery low resolution of 120 × 160, we were actually able to fully precomputeand efficiently store P (in < 8GB RAM) as a sparse matrix which speeds up thereconstruction dramatically. Note that this approach would not be possible if thesensor or reconstruction resolution were significantly higher.Pre-factorization for Speedup Instead of minimizing ‖CPv − b‖22 as a data termone can also pre-factor the optimization and first solve for a transient image C−1jand then use this as an observation in the changed data term ‖Pv −C−1b‖22. Wehave used the i-subproblem from Section 3.3.3 to pre-factor the optimization anddid not notice a strong difference in reconstruction quality in comparison to usingthe not pre-factored version. The advantage of pre-factoring is that the method getssped up even more since all matrix application of C have been handled before andC itself can be inverted more efficiently than the full CP.5.5 Results5.5.1 SetupOur instrumentation comprises a modulated light source and a CIS detector, as usedfor the purpose of transient imaging in Chapter 3.The detector is based on a filterless version of the TOF development kit Cam-Board Nano by PMD Technologies, and extended with an external frequency-controllable modulation source (a workaround in lack of access to the FPGA config-uration for the CamBoard). We determined that for our setup an integration time of10 milliseconds delivers the optimal SNR, which we further improve by averagingover multiple measurements; see Section A.4 for details.The light source consists of six 650 nm, 250 mW laser diodes with collimationoptics and custom driving hardware to emit pulses of approximately 2-3 nanoseconds113Figure 5.3: Left: Photo of our capture setup facing the diffuse wall (lightsource covered with black photo board). To the left, behind an occluder,lies the reconstruction volume. Right: Close-up on the light sourcewithout cover.duration at variable repetition rate. The primary difference to the hardware setupfrom the previous chapter is that in the proposed setup, the diodes are not diffusedto act as a spot light. Instead, we focus each laser diode with individual optics ontoa single spot l on the wall (Figure 5.2, Figure 5.3). Their overall duty cycle duringcapture is less than 1%, allowing operation with only the lens holders doubling asheat sinks.Our reconstruction volume has a size of 1.5 m×1.5 m×2.0 m and is distanced1.5 m from the flat, diffuse wall. The camera and illumination are about 2 m fromthe wall; please see Figure 5.2 for the exact spatial arrangement.5.5.2 Qualitative ResultsGeometry Our first test is to reconstruct the geometry of two letters cut outof cardboard that were painted with white color, and placed at different depths(Figure 5.1). We show two visualizations of the recovered depth information in thevolume. In the center image we treat the voxels as an occupancy probability and weshow the expected value of the distribution for each pixel, i.e. the sum of distancesweighted by the occupancy probability.Since the expected value is not robust to outliers, we show in the right imagethe depth value with the strongest peak for each (x, y) pixel. This amounts to thevoxel with the highest probability of occupancy in our reconstruction. Note, that114Figure 5.4: Albedo reconstruction example: Reconstruction of scene imagewith a flat surface but varying albedo (right). Center: the color-codeddepth map of strongest peak along z-coordinate shows a flat geometry.Left: Albedo image at the depth map’s depth.here we threshold the volume, with very low albedo/occupancy probability as grey.Albedo The next experiment Figure 5.4 shows the recovery of a spatially varyingalbedo on a flat surface. The color-coded depth map shows the depth of the strongestdensity in the reconstructed volume for each pixel (x, y) as before. The left of thefigure shows the albedo v(x) sampled exactly at the depth map positions (theposition of the strongest peak).Figure 5.5: Simultaneous albedo and geometry: Reconstruction of scene im-age with varying albedo (letter on plane in the front) and varying depthfor the letter in back (right). Albedo image, reconstruction value exactlyat the depth position from the depth map (center). Color-coded depthmap of strongest peak along z-coordinate (left).115Albedo and Geometry Figure 5.5 shows an example of variation in both geometryand albedo. In this case, the planar surface in the front could not be reconstructed inthe depth map due to the low albedo limiting the reflected light.Different Materials In Section A.4 we show several examples of reconstructionswith non-Lambertian surfaces. We find that Lambertian scenes result in verysparse volume reconstructions that clearly represent a height field structure. Withincreasingly non-Lambertian surfaces the energy is spread out more and morethroughout the volume (as our model is violated).Effects of Ambient Light and Frame Averaging One of the advantages of themethod from the previous Chapter 3, is that it is rather insensitive to ambientillumination. We tested whether this robustness also applies to our approach forreconstructing geometry and albedo. Section A.4 presents results with strongambient illumination which has only a minor effect on reconstruction quality.5.5.3 Quantitative ResultsTo evaluate our reconstruction results, we compared the distance maps with manuallymeasured scene geometry. Figure 5.6 shows a quantitative evaluation for thegeometry reconstruction example shown above. in mFigure 5.6: Quantitative evaluation: Reconstruction of scene image with letter”L” and ”F” cut out of white painted cardboard (right). Color-codeddepth map of strongest peak along z-coordinate visualized with color barfor depth in m (left). (x, y) ground truth geometry acquired from scenemeasurements (middle).116Lateral Resolution In order to evaluate the spatial resolution, we show an imageof the measured scene geometry of the flat targets. The same discretization as forthe shown depth map has been used. Having in mind that our reconstruction volumefor all shown results had a size of 1.5 m × 1.5 m × 2.0 m (x × y × z), we seethat we can achieve an (x, y) resolution of approximately ±5 cm. The accuracy ofthe reconstruction varies with different materials. Materials that have little or nooverlap in the space-time profile (e.g. mirror example in Section A.4), allow forhigh reconstruction precision (around ±2 cm for the mirror example). The precisionfor more complex materials degraded to around ±15 cm tolerance. Overall thespatial resolution is limited by the low resolution of our sensor (which was only120×160 pixels).Note that due to our robust measurement and reconstruction procedure we areable to achieve the shown results for significantly larger scenes than previouslypossible with the femtosecond laser approach demonstrated in [149]. Velten et al.report distances of up to 25 cm from object to the wall and a reconstruction volumeof (40 cm)3 due to the low SNR for large distance bounces, whereas we demonstratefor the first time much larger room-sized environments.Depth Resolution For the temporal resolution achieved in the above example ofFigure 5.6, we see from the given color bar a depth distance of approximately 0.6m,where the measured distance was 0.75m. For all similarly diffuse materials wereach also roughly a tolerance of ±15cm. For simple strong reflectors like themirror we have less temporal superposition, so for the mirror example we obtaina high temporal resolution of below 5 cm error in our 2 m depth range, with morecomplex materials producing a precision of around ±20 cm.As shown above the resolution of our approach depends on the scene content.The achievable resolution should in the future scale linearly with the availability ofhigher resolution TOF cameras, such as the upcoming Kinect® 2 as mentioned inSection 3.7. We have shown that our method degrades somewhat gracefully withusing different materials, although a certain scene dependence is inherent in thenon-linear nature of the inverse problem we solve.1175.6 DiscussionIn this chapter, we have presented a method for reconstructing hidden, i.e. Non-Line-of-Sight, geometry and albedo values from transient images of diffuse reflections.This approach involves hard inverse problems that can only be solved using sta-tistical priors such as sparsity in the geometry, and our primary contribution isto identify a linearized image formation model, regularization terms, and corre-sponding numerical solvers to recover geometry and albedo under this difficultscenario.Despite these numerical challenges, we show that our method can be combinedwith the work from the previous Chapter 3 on transient imaging using inexpensiveCIS cameras, which itself involves a hard inverse problem. We demonstrate that it ispossible to combine these two inverse problems and solve them jointly in a singleoptimization method. As a result our approach has several advantages over previousmethods employing femtosecond lasers and streak cameras [149]. These includea) low hardware cost, b) no moving parts and simplified calibration, c) capturetimes that are reduced from hours to minutes, and d) robustness under ambientillumination in large room-sized environments. We believe that, as a result, ourmethod shows great promise for applications of indirect geometry reconstructionoutside the lab.118Chapter 6Convolutional Sparse CodingConvolutional Sparse Coding (CSC) has enabled exciting applications in computervision and machine learning. The previous Chapters 3, 4 and 5 have demonstratedthat physically-motivated convolutional codes allow the recovery of high-qualitytransient images from CIS measurements, enabling imaging of objects outside ofthe line of sight and imaging in scattering and turbid media. Going beyond recon-struction tasks, CSC can even serve as a strategy for unsupervised features learning,when a simple physical model cannot explain the data. In particular, it allowslearning features of natural images, that subsequently are used for classificationand reconstruction tasks. As opposed to patch-based methods, convolutional sparsecoding operates on whole images, thereby seamlessly capturing the correlationbetween local neighborhoods.In this chapter, we propose a new approach to solving CSC problems, both forlearning and reconstruction, and show that our method converges significantly fasterand also finds better solutions than the state of the art. In addition, the proposedmethod is the first efficient approach to allow for proper boundary conditions tobe imposed and it also supports feature learning from incomplete data as well asgeneral reconstruction problems.6.1 IntroductionAn increasing number of computer vision tasks rely on image statistics as a keyprior in Bayesian inference Section 2.8. Low-level problems that benefit from119Figure 6.1: Patchwise Sparse Coding (left) divides the data into small patchesof a given size (blue). Every patch is modeled as the superposition ofa sparse set of learned dictionary atoms. The patchwise representation,however, contains shifted version of the same feature (third and lastatom here). Convolutional Sparse Coding (right) finds a more compactrepresentation by exploiting this convolutional structure in the data, andmodels it as a sum of sparsely-distributed convolutional features.good priors include inpainting, denoising, deblurring, and super-resolution, whilerecognition, classification and other higher-level tasks often use learned featuresas priors for natural images. In this chapter, we revisit one strategy for unsuper-vised learning of image features: Convolutional Sparse Coding (CSC). CSC wasintroduced in the context of modeling receptive fields in human vision [230], but ithas recently been demonstrated to have important applications in a wide range ofcomputer vision problems such as low/mid-level feature learning, low-level recon-struction [231, 232], as part of more complex hierarchical structures or networks inhigh-level computer vision challenges [220, 233, 234], and in physically-motivatedcomputational imaging problems as the one from the Chapter 4. Beyond theseapplications, CSC could find applications in many other reconstruction tasks andfeature-based methods, including deblurring, denoising, inpaiting, classification,localization, and tracking. While we focus on 2D image features in this chapter, ourmethod straightforwardly generalizes to higher dimensional data, such as videos,and directly applies to 1D signals, such as the transient profiles discussed in theprevious chapters.CSC is closely related to popular patch-based learning and reconstruction meth-ods [163, 235, 236]. However, features learned with patch-based methods oftencontain shifted versions of the same features and latent structures of the underlyingsignal may be lost when dividing it into small patches. See Figure 6.1 on the left for120an illustration. A more elegant way to model many of these problems is to use a sumof sparsely-distributed convolutional features; see Figure 6.1 on the right. The maindrawback of convolutional models, however, is their computational complexity. Notonly is it very challenging to find a solution to convolutional sparse coding problemsin a reasonable amount of time, but finding a good local minimum is difficult as well.Generally, CSC for feature learning is a non-convex problem and many existingmethods provide little to no guarantees for global convergence. Seminal advances infast convolutional sparse coding have recently shown that feature learning via CSCcan be efficiently solved in the frequency domain. Grosse et al. [237] were the firstto propose a frequency domain method for 1D audio signals, while [238–240] laterdemonstrate efficient frequency domain approaches for 2D image data. While this isthe first step towards making CSC practical, these frequency methods can introduceboundary artifacts for both learning and reconstruction [233] and, as inherentlyglobal approaches, make it difficult to work with incomplete data.Building on recent advances in optimization [174, 175, 241–243], we proposea new splitting-based approach to convolutional sparse coding. We not only showthat our formulation allows us to easily incorporate proper boundary conditions andlearn from sparse observations, but we also demonstrate that the proposed methodis faster and converges to better solutions than the state of the art.In the following, we first derive a flexible formulation of the convolutionalsparse coding problem, and an efficient solution by splitting the objective into asum of simple, convex functions. This formulation fits into most recent proximaloptimization frameworks. Following, we demonstrate that the proposed methodallows for proper boundary conditions to be imposed without sacrificing perfor-mance; it converges faster than alternative methods and finds better solutions. Thelatter is verified using several low-level reconstruction problems. Finally, we showthat our flexible formulation allows feature learning from incomplete observations,and yields an efficient solver when working with known features, such as theexponentially modified Gaussians from Chapter 4.1216.2 Mathematical FrameworkTraditionally, convolutional sparse coding problems are expressed in the form ofthe following Eq. (6.1). See again Figure 6.1 on the right for an illustration.argmind,x12‖j−K∑k=1dk ⊗ xk‖22 + βK∑k=1‖xk‖1subject to ‖dk‖22 ≤ 1 ∀k ∈ {1, . . . ,K},(6.1)where xk are sparse feature maps that approximate the data term j when convolvedwith the corresponding filters dk of fixed spatial support. Here j ∈ RD,xk ∈ RDare vectorized images, dk ∈ RM are the vectorized 2D filters, k = 1 . . .K, and ⊗is the 2D convolution operator defined on the vectorized inputs. While the aboveequation is strictly only valid for a single target image, it can easily be generalizedto multiple images j.Bristow et al. [238, 239] have shown remarkable improvements in efficiency byexploiting Parseval’s theorem for solving Eq. (6.1), which states that the energy of asignal is equivalent — up to a constant — to that of its Fourier transform. We willneglect this constant in the following. Eq. (6.1) can therefore be reformulated [238–240] asargmind,x12‖jˆ−K∑k=1dˆk  xˆk‖22 + βK∑k=1‖tk‖1subject to ‖sk‖22 ≤ 1 ∀k ∈ {1, . . . ,K}sk = SΦT dˆk ∀k ∈ {1, . . . ,K}tk = xk ∀k ∈ {1, . . . ,K},(6.2)which expresses the computationally expensive convolution operations as moreefficient multiplications in the frequency domain. Here,ˆdenotes the frequencyrepresentation of a signal, is the component-wise product, Φ is the discrete Fouriertransform (DFT) matrix, and S projects a filter onto its (small) spatial support. Theslack variables sk and tk allow Eq. (6.2) to be solved by splitting the objective intomultiple subproblems that each can be solved efficiently.It is important to note that Eqs. (6.1) and (6.2) are actually only equivalent underthe assumption of circular boundary conditions [220]. Kavukcuoglu et al. [233]122point out that boundary conditions are one essential hurdle in the convolutionalmodel that affects the optimization even for non-circular boundary conditions,because pixels close to the boundary are, in general, covered by fewer filters thancenter pixels. While heuristics [238] might be acceptable for learning filters withvery small spatial support, this assumption does not necessarily hold for largerfilters or for general reconstruction problems. We propose the following, generalformulation for convolutional sparse coding:argmind,x12‖j−MK∑k=1dk ⊗ xk‖22 + βK∑k=1‖xk‖1subject to ‖dk‖22 ≤ 1 ∀k ∈ {1, . . . ,K}.(6.3)Here,M is a diagonal or block-diagonal matrix, such that it decouples linear systemsof the form (MTM+ I)x = b into many small and independent systems that areefficiently solved. For example, for boundary handling M can be a binary diagonalmatrix that masks out the boundaries of the padded estimation∑Kk=1 dk ⊗ xk. Thisallows us to use unmodified filters in boundary regions, without requiring circularboundaries or other conditions. Furthermore, we show that M allows for efficientlearning and reconstruction from incomplete data.Unfortunately, Eq. (6.3) cannot be solved directly with the “Fourier trick” dis-cussed in the literature (Eq. (6.2)). In the following, we derive a formulation thatis not only flexible in allowing us to solve Eq. (6.3) efficiently, but we also showthat our formulation solves the conventional convolutional sparse coding problem(Eq. (6.1)) faster than previous methods and converges to better solutions.6.2.1 Efficient Splitting of the ObjectiveTo efficiently solve Eq. (6.3), we reformulate it such that the constraints are includedin the objective via an indicator function indC(·). This indicator function is definedon the convex set of the constraints C = {v | ‖Sv‖22 ≤ 1}. With the constraintsencoded in the indicator penalty, we get the following unconstrained objective:argmind,x12‖j−MK∑k=1dk ⊗ xk‖22 + βK∑k=1‖xk‖1 +K∑k=1indC(dk), (6.4)123which can be expressed as the following sum of functionsargmind,xf1(Dx) +K∑k=1(f2(xk) + f3(dk)) , with (6.5)f1(v) =12‖j−Mv‖22, f2(v) = β‖v‖1, f3(v) = indC(v).Here, x = [xT1 . . .xTK ]T and D = [D1 . . .DK ] is a concatenation of Toeplitzmatrices, each one representing a convolution with the respective filter dk. Eq. (6.5)consists of a sum of functions fi, which are all simple to optimize individually,whereas their sum is challenging. Following [242], we define f1 with M includedbecause that splits the data term into two different subproblems involving M and Dseparately but never jointly.6.2.2 Generalization of the ObjectiveTo derive this result more intuitively, we consider the general objective from (6.5)argminxI∑i=1fi (Kix) , (6.6)where Ki : Rbi×ai are arbitrary matrices, fi : Rbi → R are closed, proper, convexfunctions, and i ∈ {1, . . . , I}, such that fi(Kj · ) : Rai → R; I is the number offunctions in the sum.Eq. (6.6) is motivated by recent work in image deconvolution [8, 242], whichhave a similar objective that consists of a sum of simple convex functions. Theproblem in Eq. (6.6) can be reformulated asargminxI∑i=1fi (Kix) = f (Kx) , withK =K1...KI and f(v) = I∑i=1fi(vi),(6.7)where vi selects the i-th support of v. Using a formulation with the stacked matrix124K allows us to remap Eq. (6.6) to existing proximal algorithms, such as ADMM,which have been discussed in detail in Section 2.8.4.OptimizationEq. (6.7) is an optimization problem of the form presented in Section 2.8.3. We setG = 0, Ω = K and F = f to express our problem in the general form. This settingmay seem unintuitive, but will become clear after deriving our solver method below.In particular, the general ADMM method 1 becomes now Algorithm. 3, with x asprimal variable. We observe that the resulting minimization becomes separable inAlgorithm 3 ADMM for a sum of functions in Eq. (6.7)1: for k = 1 to V do2: xk+1 = argminx‖Kx− y + λk‖223: yk+1i = prox fiρ(Kixk+1i + λki )4: λk+1 = λk + (Kxk+1 − yk+1)5: end forall the fi. SinceM is included in f1, we can solve the update in line 2 of Algorithm 3efficiently in the Fourier domain. Note that the combination of splitting M fromthe filter generation Dx via f1(Dx) leads to a non-standard application of ADMM;the standard approach from Section 2.8.4 would set G(x) = 12‖j−MDv‖22 as thedata fidelty, with MD as a single operator.In the following, we will describe each subproblem of Algorithm 3. Note,that although we derive an ADMM method for solving Eq. (6.7), this is equallypossible for other proximal algorithms, that have been listed in Section 2.8.4. InChapter 10, we rely on a general formulation for image optimization problems,similar to Eq. (6.6), as the basis for a corresponding domain-specific language,that allows to “compile” our optimization problem into proximal algorithms withdifferent splittings of the objective into G,F .x-step Solving the first quadratic subproblem from Algorithm 3 (line 2) givesxopt = argminx‖Kx− ξ‖22 = (KTK)−1(KT ξ) (6.8)125Here, we have set ξ = y − λk as a notational shortcut. Depending on whether wesolve for the filters (i.e. x = d) or for the feature maps (i.e. x = x), we get:dopt = (X†X+ 2I)−1(X†ξ1 + ξ2 + ξ3) for x = dxopt = (D†D+ I)−1(D†ξ1 + ξ2) for x = x(6.9)Here X is a concatenation of Toeplitz matrices for the respective sparse codes xkand ξi selects again the i-th support of τ as defined for Eq. (6.7). The operator ·†defines here the conjugate transpose, with notation borrowed from [240]. In bothcases, one can find a variable reordering for the equations systems in Eq. (6.9), thatmakes (X†X + 2I) and (D†D + I) block-diagonal [238, 240], which makes theinversion efficient by parallelization over the j ∈ {1 . . . D} different blocks. Theinverse can be efficiently computed for each block j using the Woodbury formula,giving(X†jXj + 2I)−1 =12I− 12X†j(2I+XjX†j)−1Xj(D†jDj + I)−1 = I− D†jDj1 +DjD†j,(6.10)where the second equation holds, since a block in Dj is just a row-vector. Wecompute the first inverse (2I+XjX†j)−1 by computing its Cholesky factorization.In contrast to the direct inversion in [239] (due to the code update, this has to be donein each iteration of their method), caching this factorization leads to a significantlydecreased running time as described below.y-step The proximal operators for Algorithm 3 (line 3) are simple to derive andwell known in literature [175]:proxθf1(v) = (I+ θMTM)−1(v + θMT j) Quadraticproxθf2(v) = max(1− θβ|v| , 0) v Shrinkageproxθf3(v) ={Sv‖Sv‖2 : ‖Sv‖22 ≥ 1Sv : elseProjectionwhere the inverse in proxθf1 is cheap to evaluate as M is usually (block-)diagonal.1266.2.3 Biconvex Minimization via Coordinate DescentAbove, we have described Algorithm 3, which can solve the bi-convex problem (6.3)for x or d when the respective other variable is fixed. To jointly solve for both, wefollow the standard approach of alternating between them, yielding Algorithm 4.Algorithm 4 CSC learning using coordinate descent1: Algorithm penalty parameters: ρd ∈ R+, ρx ∈ R+2: Initialize variables: d0,x0, λ0d, λ0x3: repeat {Outer iterations}4: Kernel update: Solve Eq. (6.5) w.r.t. d:di, λid ← argmind f1(Xd) +∑Kk=1 f3(dk) usingAlg. 3 with ρ = ρd, λ = λi−1d5: Code update: Solve Eq. (6.5) w.r.t. x:xi, λix ← argminx f1(Dx) +∑Kk=1 f2(xk) usingAlg. 3 with ρ = ρx, λ = λi−1x6: until No more progress in both directions.With this alternating approach, we have constructed a coordinate descent for xand d. The individual Lagrange multipliers are initialized with the ones from theprevious iteration. In practice, we run each of the two sub-routines until sufficientprogress has been made. The step-size of the coordinate descent is defined by theprogress each local optimization makes. Using a constant number of P iterations foreach substep gave us a sufficiently good performance. We stop the algorithm whennone of the two optimizations can further decrease the objective; a local minimumis found. It also follows that our algorithm monotonically decreases the objectivefor its iteration sequence di,xi.6.2.4 Implementation detailsFor the objective in Eq. (6.3), we found that the parameter β = 1 delivers a goodtradeoff between sparsity and data fit. All results in this chapter are computed withthis setting. We have verified that other settings of β lead to quantitatively similarresults. For Algorithm 4, we have chosen the heuristic values ρd = 1/(100·max(j))and ρx = 1/(10 ·max(j)). This choice dampens variations in the optimization pathof the filters more than for the codes.1276.3 Analysis6.3.1 Complexity AnalysisThis section analyzes the complexity of the proposed optimization approach andcompares the theoretical runtime with alternative methods. For D being the numberof pixels for a single image in j,N being the number of images,K being the numberof kernels to learn, M being the size of the filter support, and P inner iterations (ofthe substeps in Algorithm 4), the computation cost is shown in Table 6.1.Method Cost (in flops)Zeiler et al. [220] PN · ( KD︸︷︷︸Conjugate gradient· KDM︸ ︷︷ ︸Spatial convolutions+ KD︸︷︷︸Shrinkage)Bristow et al. [238, 239] PN · ( K3D︸ ︷︷ ︸Linear systems+KD log(D)︸ ︷︷ ︸FFTs+ KD︸︷︷︸Shrinkage)Ours (K > N) KN2D + (P − 1)KND︸ ︷︷ ︸Linear systems+ PN · (KD log(D)︸ ︷︷ ︸FFTs+ KD︸︷︷︸Shrinkage)Ours (K ≤ N) K3D + (P − 1)K2D︸ ︷︷ ︸Linear systems+ PN · (KD log(D)︸ ︷︷ ︸FFTs+ KD︸︷︷︸Shrinkage)Table 6.1: Time complexity of our approach compared to other methods.We observe immediately that Bristow’s method has significantly better perfor-mance than Zeiler et al. when K < DM . Its dominating cost is the inversion ofthe D linear systems. Note that in contrast to spatial methods, Bristow’s method, aswell as ours, is independent of the filter size M .For the proposed method, we consider two cases: K > N and K ≤ N . In thefirst case (K > N ), we exploit the inversion trick as explained in Eq. (6.10). Here,each of the D matrices Xj is an N × K matrix. Thus, by using Eq. (6.10), wereduce the cost of the inverse from K3 to KN2. Since we cache the factorizations,this cost is only for one of the P local iterations. For the other (P − 1) iterations,the back-solves cost only KN (instead of K2 for the naive inversion).In the second case, when K ≤ N , we have the full cost of the Choleskyfactorization K3 of the D matrices Xj once per P iterations, but again for all other(P − 1) iterations, only the back-solve cost K2. Thus, by caching the factorizations,we are able to achieve a speedup of the linear systems by P1+(P−1)/K in this case.128For our setting of P = 10, even a moderate number of K = 100 filters leads to aspeedup of 9×. In the following, we show that not only the complexity per iterationdecreases, but the convergence improves as well.6.3.2 ConvergenceFor two datasets of different size, we plot the empirical convergence of the proposedalgorithm and compare it to the state of the art in Figure 6.2. In both cases we learnK = 100 filters. The first dataset is the fruit datasets [220] with N = 10 images.In this case, we have K > N . The proposed algorithm converges in 13 iterationswhereas [238, 239] has a slowly decreasing objective and was fully converged afterabout 300 iterations. To be able to compare objective values, all compared methodshere are implemented using circular convolution, with edge-tapering applied tomake the convolution circular. Only the central part of the convolution withoutTime in seconds0 2000 4000 6000 8000 10000 12000 140002220181614121024LogObjective2220181614121024Iterations0 20 40 60 80 100House Dataset (100 images)Fruit Dataset (10 images)LogObjectiveIterations2220181614121080 20 40 60 80 100Time in seconds2220181614121080 200 400 600 800 1000 1200Bristow et al., optimizedproposed methodBristow et al., originalFigure 6.2: Convergence for two datasets (top N = 10 images, bottom N =100). The proposed algorithm converges to a better solution in less timethan competing methods.129a padding of kernel size M is included in the objective values. Note that thesolution of our algorithm is not only found in fewer iterations, but it also convergesto a solution that has a lower objective. The objective combines the data fittingterm as well as the sparsity term. We used the same sparsity weighting (from[238, 239]) for the comparison, although the same qualitative convergence wasmeasured for different weights. We also plot the convergence in an absolute timescale demonstrating the achieved speedup. We have further compared our methodto [238, 239] with our factorization strategy from the first line in Eq (6.10). Whilethis strategy improves the speed of their method since the inverses in each iterationare done more efficiently, the performance does not reach that of our method.For the second dataset, we have randomly sampled N = 100 images of size100× 100 from a city scene. In this case K ≤ N and N,K are large. Thus, solvingthe linear systems becomes the dominating cost (see Table 6.1) and the benefit ofcaching (which cannot be done in [238, 239]) becomes even more apparent. Hence,especially for large datasets, our method converges faster and usually finds a betteroptimum. Note that we are already comparing to [238, 239] with our improvedfactorization strategy from the first line in Eq (6.10).We also compare convolutional coding to patch-based sparse coding. Oneof the main challenges are large datasets, for which most patch-based methodsbecome infeasible. Consider learning from 10 images with 1000 × 1000 pixelseach. Incorporating all patches into the learning requires 10 million training patches.K-SVD, for example, could not handle that much data on our computer, so we ran acomparison for 10 100× 100 pixel images. Using all patches in the training set and100 iterations, K-SVD took 13.1 hours to converge whereas our method only tookabout 4.5 minutes (both on an Intel Xeon E5/Core i7 machine with 132 GB RAM).In addition to the convergence analysis above, we also show the evolution of thefilters throughout the learning process in Figure 6.3 for the fruit dataset. Initially,the filters seem random and then turn into Gaussian blob-like structures after a fewiterations. After about 8 iterations, the observed structures are very similar to thosefrequently occurring in patch-based dictionary learning methods, whereas the filterseventually converge to Gabor-like shapes.130Figure 6.3: Visualization of filters learned from fruit dataset after 1, 7, 8, and13 iterations. The evolution goes from random patterns over Gaussianblob-like structures and eventually converges to filters that bear similarityto Gabor patches.6.4 LearningWe trained our filters on the fruit and city datasets [220] with local contrast normal-ization applied. Figure 6.4 shows the resulting filters after convergence (ours after13 iterations, Bristow after 300 iterations). Although the filters look similar at firstglance, our results contain fewer dataset-specific features, which makes them moregeneral as we demonstrate in Section Boundary ConditionsEq. 6.3 is an elegant formulation that allows for general boundary conditions to beintegrated into the learning and also the reconstruction steps. Usually, we maskout the boundary so that it does not contribute to the objective function. As seen inFigure 6.5, the boundary is still reconstructed via extrapolated data fitting — linesand other high-frequency structures continue across the image boundary but quicklyfall off thereafter.6.4.2 Learning from Sparse DataThe mixing matrix M in Eq. 6.3 not only allows for general boundary constraintsbut for any type of linear operator to be applied. In Figure 6.6, we demonstratethat this can be used for learning filters from incomplete data. We subsequentlyuse the filters learned from incomplete measurements of an image to predict themissing parts of that image and evaluate the achieved peak signal-to-noise ratio for131Figure 6.4: Filters learned on the city and fruit datasets [220]. We showthumbnails of the datasets along with filters learned with the proposedmethod (left) and with that described in [238, 239]. In both cases, ourmethod finds a local optimum with an objective that is 3 − 4× lowerthan comparable methods.varying levels of incompleteness. As is expected, the quality of the reconstructionsdrops with a decreasing amount of observations. Nevertheless, learning filters fromincomplete measurements may be interesting for many applications (e.g. adaptivefilter learning for demosaicking), but is currently not supported by any existing(efficient) method for convolutional sparse coding. Figure 6.6 shows that even forsubsampling rates of up to 50%, the learned filters and quality of the inpaintedreconstructions are reasonably good.132Figure 6.5: The proposed formulation allows us to use non-circular boundaryconditions, as demonstrated for three examples. In practice, the regionsoutside the image boundary (red) are extrapolated but do not affect theobjective.6.5 ReconstructionIn this section, we evaluate the proposed algorithm for reconstructing signals whenthe filters are either already learned or known a priori. Reconstruction problems aresolved for all filters with the code update step from Algorithm Validation of ReconstructionFigure 6.7 shows an example image reconstructed from incomplete measurements.This is an inpainting problem, which we test with filters learned from the fruitdatabase. We compare reconstruction quality using filters learned with the proposedmethod and filters learned with the method described in [238, 239]. Not only do ourfilters lead to a better reconstruction around edges and sharp features, but our frame-work allows us to solve this inpainting problem without sacrificing performancein the CSC solver, which was not possible in previous work due to the “Fouriertrick”. See Section A.5 for an evaluation using dataset of 22 images. With a singleexception, our method outperforms previous work for all test images.6.5.2 Non-Normalized DataIn most results, we show contrast-normalized examples. However, non-normalizeddata can be intuitively handled in two ways. Either the filters are directly learnedfrom non-normalized training data or a low-frequency term for the DC offset is133Subsampling 90 % 70 % 50 % 30 % 10 %PSNR 23.2 dB 21.3 dB 19.2 dB 16.5 dB 14.8 dBFigure 6.6: Learning from sparse observations. The top two rows show exam-ples for 50% sampling. The original images are shown in the left column,randomly subsampled observations in the center, and reconstructionsof the entire image using filters learned from these sparse observationson the right. Bottom table and filters: evaluation of the learned filtersfor different subsampling rate of the data. We show 16 out of the total100 learned filters for each sampling rate above the table. One can seethat for less than 50% sampling, the reconstruction quality significantlydrops due decreasing filter quality.added to the set of filter after learning. Typical Gabor-like filters with different DCoffsets are observed for the former approach. The latter approach can be interpretedas adding a smoothness prior in the form of the low-frequency term rather thanrigorously enforcing sparsity. A reconstruction then has to jointly solve for thefilter coefficients as well as the low-frequency term, which is shown in Figure 6.8.We have also compared this approach with a state-of-the-art compressive sensingmethod from Dong et al. [244]. Using the image dataset from Section A.6 (50%134Figure 6.7: Inpainting example showing: original image (left), randomlysampled incomplete observations (center left), reconstruction with fil-ters learned with the proposed algorithm (center right), and filtersfrom [238, 239] (right). In addition to this example, we evaluate thereconstruction quality for a larger dataset in Section A.6.Figure 6.8: Inpainting non-normalized data: randomly-subsampled, incom-plete observations (left and center right), reconstruction with the pro-posed filters (center left and right).random sampling with non-normalized data) their method achieves a mean PSNRof 23.5 dB while ours achieves 29.0 dB. This preliminary result suggests that furtherinvestigation of sparse convolutional coding might lead to many fruitful applicationseven outside of the low-level feature learning.6.5.3 Reconstruction with Known BasisWe also evaluate the proposed method for fitting convolutional models to sparse andnoisy measurements when the filters are known a priori, such as the physically moti-135vated filters presented in Chapter 4. In general, physically-motivated convolutionalsparse coding may have a wide range of applications in radar, sonar, ultrasoundand seismic imaging. Figure 6.9 demonstrates reconstructions of sparsely-sampleddata in 1D and 2D. Filters are sampled from a Gaussian distribution and the mea-surements of the 2D example are further corrupted by i.i.d. Gaussian noise with astandard deviation of σ = 0.01. This experiment demonstrates the utility of CSC tonon-feature-learning-type applications, such as general Gaussian mixture models.The proposed method is capable of recovering the latent signal with a high qualityfrom only 6.25% of the samples.40 60 80 100 120 140 16000. indicesIntensity0 20 40 60 80 100 120 140 160 180 20000. Basis ReconstructionReconstructionSparse SamplesNoisy OriginalSampledOriginalReconstructionSignal indicesFigure 6.9: Reconstructions for known convolutional basis. The filters inthis example are sampled from 1D Gaussians (top left) and used to fita convolution model to a sparse set of samples (top right). The sameexperiment is performed in 2D, where a target signal is corrupted bynoise (bottom left), subsampled (bottom center), and then reconstructedfrom only 6.25% of the noisy measurements (bottom right).1366.6 DiscussionIn this chapter, we have described a method for learning and reconstruction usingconvolutional sparse codes. By exploiting convolutional structure in the unknowns,Convolutional Sparse Coding has the potential to replace or supplement popularpatch-based learning and reconstruction methods. The proposed method is applica-ble to a wide range of computer vision problems, such as feature learning, denoising,inpainting, and demosaicking.At its core, the method relies on a generalized formulation of CSC, whichallows to tackle a variety of different problems with convolutional structure in theunknowns. This general formulation enables proper handling of boundary conditionsby treating boundaries as unknowns. It allows for feature learning from incompleteobservations, or any type of linear operator applied to the estimation. Based on thegeneral CSC formulation, an efficient proximal algorithm is derived, which solvesfor both the convolutional codes and sparse feature maps in an alternating coordinatedescent scheme. The resulting method is faster than the state-of-the-art and findsbetter solutions.Although already faster than existing methods, our formulation is inherentlyparallel and the runtime could further be significantly improved by an efficientGPU implementation. An interesting avenue for future research is to evaluatelearning and reconstructing features in higher-dimensional problems, such as 3Dhyperspectral image data [245] or 4D light fields [246]. For higher-dimensionalproblems the proposed approach is limited by its memory needs. Full resolutionfeature maps and kernels in the frequency domain are required. Hence, the memoryrequirements scale linear with the number of filters K, but exponentially with thedimensionality. Besides higher-dimensional problems, it would be interesting toapply the proposed framework to more complex hierarchical convolutional structuresand networks [220, 233] that could be particularly useful for high-level computervision applications, such as recognition.137Chapter 7Doppler Velocity ImagingIn the previous chapters, we have demonstrated that the temporal dimension of lighttransport, i.e. transient images, can be extracted from CIS TOF measurements. Thisnew temporal dimension has enabled us to image non directly visible objects fromindirect reflections, and imaging in scattering media.This chapter demonstrates that we can extract a further novel imaging modalityfrom CIS TOF measurements: per-pixel radial velocity measurement. The proposedtechnique exploits the Doppler effect of objects in motion, which shifts the temporalillumination frequency before it reaches the camera. Relying again on the temporallyconvolutional imaging formation of CIS, it is possible to code illumination andmodulation frequencies of the TOF camera, such that object velocities directly mapto measured pixel intensities.Furthermore, we show that a slight modification of our imaging system allowsfor color, depth, and velocity information to be captured simultaneously. Combiningthe optical flow computed on the RGB frames with the measured metric radialvelocity allows us to further estimate the full 3D metric velocity field of the scene.7.1 IntroductionPioneers of photography, including Eadweard Muybridge and Harold “Doc” Edger-ton, advanced imaging technology to reveal otherwise invisible motions of high-speed events. Today, understanding the motion of objects in complex scenes is at138Figure 7.1: A new computational imaging system (left) that allows for met-ric radial velocity information to be captured instantaneously for eachpixel (center row). The Doppler effect of objects in motion is detectedas a frequency shift of temporally modulated illumination. By captur-ing a few coded CIS measurements and conventional RGB frames, wedemonstrate that color, velocity, and depth information can be recordedsimultaneously. For the captured example, the left-most frame showsa static object (velocity map is constant), which is then moved towards(positive radial velocity) or away from (negative velocity) the camera.the core of computer vision, with a wide range of applications in object tracking,segmentation, recognition, motion deblurring, navigation of autonomous vehicles,and defense. Usually, object motion or motion parallax are estimated via opticalflow [247]: recognizable features are tracked across multiple video frames. Thecomputed flow field provides the basis for many computer vision algorithms, in-cluding depth estimation. Unfortunately, optical flow is computationally expensive,fails for untextured scenes that do not contain good features to track, and it onlymeasures 2D lateral motion perpendicular to the camera’s line of sight. Further, theunit of optical flow is pixels; metric velocities cannot be estimated unless depthinformation of the scene is also available. For the particular application of depthestimation, many limitations of optical flow estimation can be overcome usingactive illumination, as done by most structured illumination and TOF cameras. Withthe emergence of RGB-D imaging, for example facilitated by Microsoft® Kinect®,complex and untextured 3D scenes can be tracked by analyzing both color anddepth, resulting in richer visual data that is useful for many applications.139In this chapter, we introduce a new approach to directly imaging radial objectvelocity. Our approach analyzes the Doppler effect in CIS TOF cameras: objectmotion towards or away from the camera shifts the temporal illumination frequencybefore it is recorded. The Doppler effect has been reviewed in detail in Section 2.6.6.While conventional TOF cameras encode phase information into intensity, instead,we propose here Doppler Time-of-Flight (D-TOF) as a new imaging mode, wherebythe change of illumination frequency (corresponding to radial object velocity) isdirectly encoded into the measured intensity. The required camera hardware is thesame as for conventional TOF imaging, but illumination and modulation frequenciesare carefully designed. Our approach is a full-field imaging method, meaning that itdoes not require the scene to be sequentially scanned unlike most existing Dopplerradar or Lidar systems that only capture a single scene point at a time. Furthermore,we can combine depth and velocity imaging using either two TOF cameras or usingthe same device by alternating the modulation frequencies between successive videoframes; color images can be obtained with a conventional camera.The proposed technique offers a new imaging modality that is ideally suitedfor fast motion. Note that our method is fundamentally different from Optical flow,which is traditionally used to estimate apparent motion in dynamic scenes; seeSection 2.6.6. Optical flow from a single camera is restricted to estimating lateralmotion whereas the Doppler is observed only for radial motion towards or awayfrom the camera. In fact, it is a complementary technique: together, optical flowon RGB frames and D-TOF allow for the metric 3D velocity field to be estimated,which is otherwise not easily possible. In general, however, D-TOF is independent ofthe RGB flow and works robustly for cases where optical flow often fails, includinguntextured scenes and extremely high object velocities. We also discuss a modefor simultaneous range and velocity imaging. As with standard TOF imaging, ourmethod requires a few subframes to be captured with different modulation signals.In our prototype system, we use rapid time-sequential acquisition of the requiredsubframes for simplicity, which is a common strategy for regular TOF imaging.However, appropriate hardware, such as the multi-camera extension which we willdiscuss in Section 7.7, the method can be implemented as a true snapshot imagingapproach.In the following, we first derive a mathematical framework for velocity esti-140mation with CIS TOF cameras, implement a prototype CIS imaging system, andvalidate the proposed model extensively in simulation and with the prototype. Theimaging system is evaluated using a range of different types of motion, for texturedand untextured surfaces as well as indoors and under strong outdoor ambient illumi-nation. Finally, we demonstrate that the velocities measured with our system can becombined with RGB flow, allowing for the metric 3D velocity field to be estimatedon a per-pixel basis.7.2 Dynamic CIS Image FormationIn Section 2.6, we have explained the working principle of CIS TOF imaging forstatic scenes. In this section, we analyze how the image formation behaves forobjects in motion.In fact, the conventional CIS image formation model breaks down when objectsof interest move with a non-negligible radial velocity. In this case, the illuminationfrequency undergoes a Doppler shift when reflected from an object in motion; seeagain Section 2.6.6.The illumination arriving at the sensor is now frequency-shifted to ωs = ωg +∆ω, where the change in temporal frequency ∆ω depends on the radial objectvelocity and the illumination frequency, as defined in the definition of the Dopplershift in Eq. (2.29) from Section 2.6.6. For intuition, we consider the case of anapproximately constant velocity v throughout the exposure time. If we continue toassume a homodyne setting with ωf = ωg = ω, our measurement is still heterodyne(due to the frequency shift ∆ω). In particular, Eq. (2.28) from the related workSection 2.6.1 on CIS heterodyning now becomes:bθ(ρ) ≈ a2cos(−∆ωρ− φ+ θ). (7.1)Note that this equation is now dependent on the time of measurement. Unfortunately,the introduced temporal intensity variation makes it more difficult to estimate phaseand therefore also depth. In audio signal processing, this time-dependent low-frequency artifact is known as a beating pattern. We illustrate this in Figure 7.2.The phase estimate from Eq. (2.27) becomes then Eq. (7.2). For simplicity, we141Figure 7.2: Depth imaging. For static scenes, measurements are unambiguous:different phase shifts result in unique intensity measurements (left). Fordynamic scenes, the Doppler shift results in a low-frequency beatingpattern that makes measured intensities ambiguous, and hence preventsreliable depth estimation (right).sample here only N = 2 phases with θi = i · 2pi/N for i ∈ {0, . . . , N − 1}.φest(ρ) = − atan(bpi/2(ρ)b0(ρ))+ ∆ωρ, (7.2)where the distortion ∆ωρ linearly depends on the (unknown) object velocity. Notethat, in practice, the estimated phase for moving objects corresponds to its averagethroughout the exposure.To summarize, in the homodyne setup, where the frequency of the light sourceand the frequency of the camera reference signal are identical, the Doppler shiftintroduced by moving objects results in mismatched frequencies on the imagesensor. This situation is a heterodyne measurement. However, in heterodyne rangeimaging, as described in Section 2.6.1, the frequency mismatch is deliberatelyintroduced in the sensor modulation. A major limitation of heterodyne ranging isthat multiple (> 2) measurements have to be captured to reliably estimate phase anddepth. Since only low frequency beats can be detected reliably, a significant amountof time needs to pass between the two measurements for robust phase estimation.142For moving objects, the necessity to capture multiple images would place severeconstraints on the velocity. To facilitate reliable velocity estimation, we derive a newcomputational TOF imaging methodology in the following section. Inspired by thegeneral concept of Orthogonal Frequency-Division Multiplexing (OFDM) e.g. [248],D-TOF uses illumination and on-sensor modulation frequencies that are orthogonalwithin the exposure time of the camera. Using this choice of frequencies alongwith a newly-devised reconstruction method, we demonstrate the first approach toper-pixel radial velocity estimation.7.3 Doppler Velocity ImagingAs illustrated in Figure 7.2 (right), the low-frequency beating pattern created by theDoppler effect makes it difficult or impossible to capture reliable Doppler frequencyand phase information. Consider the following example: a road cyclist travels at aspeed of v = 10ms towards the camera. For an illumination frequency of 50 MHz(i.e. ωg = 50 · 106 · 2pi/s), the observed Doppler shift is only∆ω =vcωg =10ms300 · 106ms· 50 · 106 2pis≈ 1.672pis(7.3)A frequency shift of only 1.67 Hz may seem small enough to be safely ignored.However, we show in the following that even such a minute change contains valuableinformation that can be used for velocity estimation.7.3.1 Velocity Imaging via Orthogonal FrequenciesInspired by multiplexing techniques in digital communication, we devise an uncon-ventional way to extract velocity information from the small Doppler shift observedby a TOF camera. We can interpret the camera system as a communication channeland consider the illumination a carrier signal. The carrier is optically modified bymoving objects — we observe a change in carrier amplitude, phase, and frequency.The secondary modulation in the sensor followed by a low-pass filter of the exposuretime corresponds to the demodulation process in communication. Conventionalcommunication channels use orthogonal frequencies; any inter-carrier interference(which could be caused by a frequency drift) is a polluting signal (see e.g. [248]).143Figure 7.3: Velocity imaging. Illumination ωg and modulation ωf frequenciesare designed to be orthogonal within the exposure time T . For staticscenes (left), this particular choice of frequencies will integrate to zero.The Doppler shift of moving scenes destroys the orthogonality andresults in an approximately linear relationship between radial velocityand recorded intensity (right).For Doppler TOF, we deliberately design the frequencies in the receiver and trans-mitter to be orthogonal, such that the (usually polluting) inter-carrier interferencecarries the desired velocity information.An illustration of our orthogonal frequency Doppler TOF is shown in Figure 7.3.For the application of direct velocity imaging, we would like to ensure that themeasured signal for a stationary object is zero (or a constant intensity offset); seeleft of Figure 7.3. We can achieve this by operating the TOF camera in heterodynemode with two orthogonal frequencies ωg and ωf . While any two sine waves withfrequencies ωg 6= ωf will be orthogonal for sufficiently long integration times, thisis not the case for finite integrals (exposures) in the presence of low frequencybeating patterns. Designing both frequencies to be orthogonal is done by settingωg = k2piTand ωf = l2piTwith k, l ∈ N, k 6= l, (7.4)i.e. having the exposure time T be an integer multiple of the period of both signals.144It is then easy to show from Eq. (2.25) thatbθ(ρ) =∫ ρ+Tρp(t) dt = 0 (7.5)for stationary objects (ωs = ωg). In practice, we set l = k + 1 and we setk = ωgT/2pi, which depends on T and the desired frequency ωg.Given these two orthogonal frequencies we now use the inter-carrier interfer-ence to extract valuable information about the Doppler shift. We achieve this bycomputing the ratio of a heterodyne measurement and a homodyne measurement.Using only the low-frequency terms from Eq. (2.25), this ratio can be expressed asEq. (7.6). Without loss of generality, we assume an exposure interval of [0 . . . T ].r =∫ T0 cos(ωf t+ θ) · (a cos((ωg + ∆ω)t+ φ) + s0) dt∫ T0 cos(ωgt+ θ) · (a cos((ωg + ∆ω)t+ φ) + s0) dt≈∫ T0a2 cos((ωf − ωg −∆ω)t+ θ − φ) dt∫ T0a2 cos(−∆ωt+ θ − φ) dt=a2(ωf−ωg−∆ω) [sin((ωf − ωg)t−∆ωt+ θ − φ)]T0a−2∆ω [−∆ωt+ θ − φ)]T0=−∆ωωf − ωg −∆ω ·sin((ωf − ωg)T −∆ωT + θ − φ)− sin(θ − φ)sin(−∆ωT + θ − φ)− sin(θ − φ)︸ ︷︷ ︸=1≈ −∆ωωf − ωg(7.6)since (ωf −ωg)T = (k− l) 2pi, and ∆ω  ωf −ωg. Note that we are only varyingthe sensor modulation frequency ωf here for the two measurements. Hence, thephases φ in numerator and denominator are identical, and the same offsets θ areselected in the homodyne and heterodyne measurement.Figure 7.4 shows the model derived here. On the left side, we see the full modelwithout any approximations (i.e. without neglecting high frequency componentsin Eq. (7.6)). Although the image formation is nonlinear, for a relative largerange of metric velocities it is very well approximated (Figure 7.4, top right)by our linear model (Eq. (7.6)). We experimentally verify the model using our145Figure 7.4: Simulated and measured intensities for a range of different veloci-ties. Although the mapping from radial velocity to measured intensityis generally nonlinear (top left), throughout a large range of velocitiesthe conversion is approximately linear (top right). We verify the pre-dicted mapping using our prototype camera (bottom). These particularmeasurements were captured with a static scene, and acquired with amodulation frequency of ωf = 60 MHz and an illumination frequencyof ωg = 60 MHz + 1 KHz + ∆ω. Thus, the Doppler shift for an objectmoving at a specific velocity was programmed into the illuminationfrequency for this particular experiment.camera prototype (Figure 7.4, bottom). With known, orthogonal illuminationand modulation frequencies ωg, ωf , it is therefore straightforward to compute theDoppler shift ∆ω from Eq. (7.6). The ratio image r can be interpreted as a directmeasurement of the instantaneous per-pixel radial velocity.We note that this approach still requires two measurements: one heterodyneimage and one homodyne image. There are several possible solutions for eitheracquiring these truly simultaneously, or they can be acquired in quick succession.146For instantaneous measurements, two synchronized TOF sensors can be mounted ina co-axial setup; one of the sensors is modulated with the same frequency as the lightsource (ωg), while the other uses a slightly different frequency ωf 6= ωg. However,this approach requires accurate synchronization and control over the modulationwaveforms of each camera, which is challenging for off-the-shelf camera systems.Nonetheless, we describe such a hardware system as an extension at the end of thischapter in Section 7.7.Instead of using two distinct sensors, it would also be possible to multiplexpixels with two different modulation frequencies onto the same image sensor, eitherin alternating scanlines or in a checkerboard pattern. This concept is similar in spiritto techniques that have been proposed for HDR cameras [249, 250].A third possibility is to rapidly alternate between two modulation frequenciesusing a single TOF camera. In this case, the measurements are not truly instanta-neous, and alignment problems may occur for very fast motions. However, thetwo measurements can be taken immediately after each other, as fast as the camerahardware allows, e.g. at 30 or 60 Hz. We follow this approach as it only requires asingle TOF camera and is therefore easy to realize. Note that, similar to heterodynedepth estimation [114], the Doppler shift can also be estimated directly from the low-frequency beating pattern, but at the cost of requiring multiple measurements thatare much more widely spaced in time (hence not suitable for velocity estimation).Finally, we note that the model from Eq. 7.6 only holds for sinusoidal modulationfunctions. If other periodic signals are being used, additional harmonic frequencycomponents are introduced, which distort the measurements for both stationary andmoving targets. However, these offsets are systematic and can be calibrated for aspecific TOF camera/lights source combination (see Section 7.4 and Section A.7).7.3.2 Simultaneous Range and VelocityIn many applications it may be useful to obtain both velocity and range measure-ments at the same time. As in standard TOF imaging, this can be achieved bycapturing a second homodyne measurement with the phase offset by pi/2. Simulta-neous range and velocity imaging therefore requires a total of three measurements:a heterodyne image with θ = 0, a homodyne image with θ = 0, and a homodyne147image with θ = pi/2.As discussed in Section 7.2, motion introduces a velocity-dependent distortion∆ωρ of the depth measurement (Eq. (7.2)). However, since the distortion linearlydepends on the Doppler shift ∆ω, which is known from the velocity estimationstep (Eq. (7.6)), we can now correctly estimate the phase delay (and hence thedepth) from Eq. Equation 7.2. This only requires an additional calibration step toobtain ∆ωρ for a specific velocity, which corresponds to estimating the time offsetρ between the start of the exposure time and the reference time for signal generationin the camera and light source.As mentioned, simultaneous velocity and range imaging requires three distinctmeasurements. We note that the illumination signal is the same for all three mea-surements, only the reference signal for the camera changes. As in the case ofvelocity-only imaging, this means that all three measurements can potentially beacquired at the same time using either multiple sensors with a shared optical axis, ora special sensor design with interleaved pixels. If neither option is available, rapidframe-sequential imaging is also possible.7.4 ImplementationHardware For all physical experiments, we use the experimental TOF CIS camerasystem from Chapter 4 and Chapter 5. As described in these chapters, our systemcomprises a custom RF modulated light source and a demodulation camera basedon the PMD Technologies PhotonICs 19k-S3 sensor. The light source is again anarray of 650 nm laser diodes.We run all of the presented results with 30 MHz. The modulation signals arenearly sinusoidal, but contain multiple low-amplitude harmonic components. Toavoid systematic errors in depth and velocity estimation, these components must becalibrated as described in the following.Correcting for Higher-order Harmonics Our camera prototype has the drawbackthat the periodic modulation functions are not perfectly sinusoidal, although they arevery close. In addition to the fundamental frequency, this introduces higher-order148150 MHz0 1 2 3 4 5Distance [m]30 MHz0 1 2 3 4 5Distance [m]80 MHz0 1 2 3 4 5Distance [m]0.04-0.08-0.0400IntensityFigure 7.5: Depth-dependent offset introduced by higher-order frequency com-ponents for a range of modulation frequencies. These offsets are cal-ibrated in a one-time offline process and then used to correct the rawphase measurements on a per-pixel basis.harmonic components to the modulation signal. Please refer to Section A.2 andSection A.7 for a detailed derivation of the image formation in these conditions.Unfortunately, the higher-order components are generally not orthogonal, thus theycan cause a phase-dependent offset. We calibrate this offset for different modulationfrequencies and phase shifts θ using a static target. The depth-dependent offsets areplotted for different modulation frequencies in Figure 7.5.This offset is calibrated in an offline process and raw phase measurements canbe corrected digitally using a lookup table. Note that for relatively low modulationfrequencies, such as 30 MHz, we find a fairly large depth range (around 1 m) tobe almost independent of this offset. In practice, it is therefore relatively easy toremove the higher-order frequency components.Calibrating Phase Response As is standard practice in time-of-flight cameras, wecalibrate the physical intensity response for different phase shifts φ in an offlinecalibration. Following [115], we measure the physical intensity response for aphase sweep of the illumination frequency and fit a fifth-order polynomial to themeasurements. This is used as a lookup table for converting phase to depth ratherthan solving Eq. (2.27) directly. With our prototype, we measure a notable zeroth-order component of the fitted polynomial, corresponding to fixed pattern phasenoise. This is easily corrected with the lookup table. Any other illumination-specificterms, for example introduced by the baseline between camera and light source, are149automatically calibrated with the described procedure and do not require additionalprocessing.Verification of Calibration Procedure The two calibration procedures describedabove are performed for all spatial locations on the sensor independently. Toverify our calibration routines, we image a static target and apply a frequency andphase sweep to the modulation function, simulating objects at different velocitiesand depths. The results shown in Figure 7.4 (left) demonstrate that the measuredintensities for a constant phase but varying Doppler shift follow the model derivedin the Section 7.3. Other than a small amount of noise, which is mostly due to arelatively low signal-to-noise ratio, the curve is linear and behaves as predicted.In Figure 7.6 (left), we verify experimental measurements for a range of differentphase offsets in the modulation frequency. This simulates objects at various depths,as indicated in the legend. Finally, we also test the velocity-dependent behavior fora range of different pixels over the sensor location and show results in Figure 7.6-20 -15 -10 -5 0 5 10 15 20-5-4-3-2-1012345xc10-3Responsec0.62cmResponsec0.75cmResponsec0.87cmResponsec1.00cmResponsec1.12cmResponsec1.25cm-20 -15 -10 -5 0 5 10 15 20-6-4-20246xc10MeasuredcIntensitiescforcRangecofcDifferentcDistances RangecofcDifferentcPixelcLocationsVelocitycinc[m/s] Velocitycinc[m/s]Figure 7.6: Experimental verification of the imaging system for varying objectvelocities and depths (left) as well as velocity-dependent behavior fora range of different pixel locations on the sensor (right). All of thisdata is captured using a large planar target perpendicular to the cameraand sweeping the illumination frequency (to simulate different Dopplershifts) and phase (to simulate different object distances).1503 3 .5 4 4 .5 5 5 .5 633 .544 .555 .56Experimental]SetupRecorded]Audio-110NormalizedAmplitude0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6[s]VelocityGround]Truth]from]Audio]][m/s]AudioD-ToFMeasured][m/s]Raw]Homodyne]Measurement Raw]Heterodyne]MeasurementFigure 7.7: Experimental validation of velocity estimation using a fan withadjustable rotation speed (three settings). We measure the ground truthvelocity of the rotating blades (top left) by analyzing audio recordings(top, lower left). The top right plot shows the velocity measured byD-TOF compared to the ground truth for a varying rotation speed. As thespeed becomes larger, estimation errors increase to a maximum of about0.2 m/s. The bottom row shows the unprocessed full-field measurementsof the homodyne (left) and the heterodyne (right) frequency setting withthe pixel indicated for which we plotted the velocities on the top right.(right). The remaining variance over pixel locations and phases is minimal.Figure 7.7 shows another experiment that we used to verify the accuracy ofour prototype D-TOF camera. In this example, we adjusted the speed of a rotatingfan and imaged its blades such that, throughout the time it takes for a single bladeto move across a pixel, forward motion is observed by that pixel. The exposuretime of the TOF camera was set to 1.5 ms and the fan was captured from a frontalperspective (raw homodyne and heterodyne measurements shown in Figure 7.7bottom). We manually measured the slope of the fan blades, which is constant over151the entire blades. The radius of the plotted position was measured, allowing us tocalculate the “ground truth” velocity when the rotation speed of the fan is known.Since the exact rotation speed is not actually known, we measure it by mountinga small pin on one of the blades and mounting a piece of flexible plastic in frontof the fan, such that the rotating pin strikes the plastic exactly once per revolution,creating a distinct sound. We record the sound (sampled at 44 KHz) of this setup toestimate the ground truth velocity of the fan blades, observed by one pixel, which iscompared with the corresponding D-TOF estimate (Figure 7.7, top right). For thisexperiment, the estimation error is always below 0.2 m/s. Errors are mainly due tothe low SNR of the measured Doppler-shifted signal.Subframe Alignment Although the required heterodyne and homodyne shots canbe captured simultaneously as shown later in Section 7.7, the hardware modificationsfor an alternating capture using a single-sensor require significantly less effort. Inthis case, since we are dealing with moving objects, the individual shots cannotbe assumed to be perfectly aligned. This results in velocity artifacts around edgesin the scene. We can mitigate, although not completely remove, these artifacts bycomputing a SIFT flow on the raw data and warping them to a reference frame.While not perfect, the SIFT flow delivered sufficiently good warps for most captures.Denoising With our system, we capture an extremely small frequency shift (inthe Hz range) relative to the modulation frequency (the MHz range). Additionally,the quantum efficiency of emerging time-of-flight sensors is still far from that ofmodern solid state sensors [251]. Therefore, the slight Doppler shift in our prototypeis only measured by a very low photon count in the measured signal. For suchlow photon counts the heteroskedastic normal approximation for the Poisson noisefrom Section 2.4.3 is not accurate anymore. Standard denoising methods assumingGaussian noise fail therefore in such low-light scenarios. We apply a binning-basednon-local means denoising strategy to all captured velocity maps (see Figure 7.8).Please see the Section A.8 for more details and denoising comparisons.152Figure 7.8: Velocity maps color-coded in grayscale. The maps computed fromraw measurements (top) are corrupted by Poisson noise. To accountfor this, we apply a binning-based non-local means-type denoiser to thereconstructed velocity images (bottom).7.5 ResultsWe show results captured with our prototype imaging system in Figures 7.1, 7.9,7.10, 7.12, 7.11, 7.13, 7.14. The results validate the proposed imaging system for avariety of challenging indoor and outdoor scenes. Color images are recorded withthe same exposure time as the time-of-flight camera. Most of the scenes have a slightred tint, because we work with eye-safe red illumination in the visible spectrum.Like current commercial TOF cameras, future implementations of this system wouldmost likely use invisible, near infrared wavelengths to encode velocity and depthFigure 7.9: Complex scene with ambient illumination and a large depth range.The velocity is robustly estimated within the range of the illumination(approx. 5m inside), even in outdoor settings.153Figure 7.10: This result shows a periodic motion of a hand along the opticalaxis. The static scene on the left results in no response of the sensor,whereas forward (center) and backward (right) movement result inpositive and negative responses, respectively.information. The reconstructed velocity maps are color-coded; absolute units areindicated in the color bars. As expected, static scenes result in a constant velocitymap whereas velocity is directly encoded in the measurements and subsequentlyreconstructed for each sensor pixel independently. In addition to the velocity maps,Figures 7.1, 7.12, 7.13, 7.14 also show the corresponding depth maps that can beestimated from an additional capture as well as the velocity maps (see Section 7.3.2).Figure 7.11: Even extremely fast motion, such as a bullets shot with a springairsoft gun, can be captured with our system. The airsoft gun is beingadvertised as shooting bullets with 99 m/s; we measure a radial velocityof 98.2 m/s (average of the peak pixels).154Figure 7.12: This result shows periodic motions in z for a textured object.Although the estimated velocity is mostly correct, shadows and darkscene parts are challenging for robust velocity estimation.The selection of scenes shows a wide range of motion types that can be re-constructed with the proposed method, but it also highlights several challenges ofD-TOF and TOF in general. D-TOF requires two frames to be captured, and theymust be aligned if recorded with a single camera. In some instances, such as theresults from Figure 7.10 and Figure 7.11, the alignment is challenging and anyerrors will propagate into the velocity maps, especially around depth-discontinuities.These artifacts can be mitigated by optimizing the camera firmware to minimizeswitching time between the subframes or by using two co-axial TOF cameras, asdemonstrated below in Section 7.7. Objects with dark albedos, as for exampleobserved in Figure 7.12, are challenging for any TOF method because only a smallamount of the coded illumination is reflected back to the camera. Similarly, shadowsare very challenging and often result in either no depth/velocity estimation or errors(sweater in Figure 7.9 and regions between fingers in Fig. 7.13). Whereas some of155Figure 7.13: We envision a wide range of applications for our technique,including gaming and human-computer interaction.these limitations can be overcome with better hardware, others are inherent to thetime-of-flight approach. Please see Section 7.8 for a more detailed discussion.7.6 Toward the 3D Velocity FieldOptical flow computed from conventional video sequences estimates the 2D projec-tion of the 3D flow field onto the image plane. The radial component is usually lost.Furthermore, optical flow is an ill-posed problem and may fail in many scenarios.Doppler TOF addresses two problems of optical flow: first, it can help in caseswhere optical flow fails either due to large displacements or missing scene struc-tures. Second, our technique also helps in cases where the optical flow estimationis successful; in this case, we can recover the 3D metric flow by combining metricradial velocity and the 2D optical pixel flow.156Figure 7.14: Physical props for gaming, such as ping pong balls fired withthis toy gun, could be tracked and enable new HCI techniques.Figure 7.15 shows an example scene where regular optical flow [252] as well asSIFT-flow [253] fail due to limited structure in the scene. Our method can success-fully capture the velocity of the objects and could also lead to a proper segmentationFigure 7.15: Failure case of optical flow for a moving, but untextured scene(left). Optical flow [252] and SIFT flow [253] for two succeeding colorframes are shown in the second and third column; the 2D flow vectorsare color-coded with the shown color wheel (insets). Both methodscannot recover the true 2D motion of the fan and wrongly segment thescene. Our orthogonal velocity estimate can resolve this problem andproperly segment the scene.157Figure 7.16: Towards 3D flow: when optical flow succeeds, the full 3D metricflow is uniquely estimated from both 2D pixel flow and the radial veloc-ity maps. The top images show a frame where optical flow computedreasonable estimates. The bottom shows full 3D velocity estimate fordifferent views. Note that the optical flow helps us to determine thatfan’s velocity is slightly rotated to the upper right, where the center ofrotation is located (bottom left).of the scene. Note that having additional depth estimates for conventional flowwould also only be of limited help since flat surfaces also do not deliver enoughfeatures for correspondence matching.Figure 7.16 shows a scene where the optical flow estimate is reasonable. Inthis case, the orthogonal component that our method captures completes the 2Dspatial flow estimates and uniquely determines the full metric 3D flow. Given theoptical flow estimates fx, fy for the horizontal and vertical image coordinates, onecan compute the metric velocity vectors vx = fx·ZF , fy =fx·ZF , where F is thefocal length of the lens and Z the corresponding depth estimate (see [254]). Inconjunction with the velocity estimate vz in the orthogonal direction along theoptical axis, the full 3D metric flow is V = (vx, vy, vz). An example is shown inFigure 7.16. Please note that the 3D flow field is only as reliable as the estimatedradial velocity and the RGB 2D flow. If one of them fails, so will the 3D flow.1587.7 Multi-Camera SystemPreviously, we have mentioned multi-camera CIS systems a few times as an ex-tension to the proposed method that could remove remaining artifacts and extendthe capabilities of our approach. In particular, two CIS cameras which share thesame optical path allow for simultaneous capture of the two required captures perframe, thus eliminating the need for alignment. Furthermore, three cameras, eachrepresenting a stereo-pair with both other cameras, would allow for the capture offull 3D metric flow without needing to solve the ill-posed optical flow estimationproblem. Stereo and multi-camera ToF arrays also have been shown to providebenefits for regular depth estimation [255–257].However, systems for dynamic scenes have not been demonstrated before sincesynchronizing the exposures as well as the modulation waveforms is not straight-forward. In this section, we present an extendable multi-camera TOF system thatallows for synchronization, custom waveforms for each light source and camera.We also show that the orthogonal frequency modulation from this chapter can beused to eliminate multi-camera interference for depth imaging. An overview ofthe entire system is illustrated in Figure 7.17. We build on the Texas Instruments’Multi7cameraSynchronization HostbcomputerMicrocontrollerb1MCUEARMbCortex7M8bPO7bitbRISCbcoreDirectbdigitalbSynthesizerb1DDSEAnalogbDevicesbAD99,9HighbSpeedbComparatorAnalogbDevicesbAD9,FPExposurebGatingTIbSNB8LVCFO0AbToFbcamerasbTIbOPT8O8FFunctionbGeneratorF,bHzNbP66bmVSPIUSBUSBFigure 7.17: System overview. The proposed multi-camera time-of-flightsystem is built around the TI OPT8241 camera developer kit (left).All connected cameras are driven in slave mode, such that an externalmicrocontroller (MCU) manages and synchronizes the sensor exposuresof all cameras. The DDS generates analog waveforms that are digitizedbefore being fed into the cameras. An additional exposure gating circuitguarantees that no residual signals are received unless exposing.159TOF camera development kit (CDK) OPT8241-CDK-EVM, or Tin Tin, and developexternal control electronics for signal generation and synchronization. Tin Tin is astate-of-the-art TOF camera with sensor resolution of 320× 240, providing externalsignal control as well as synchronization. In contrast to the previously discussedsystem from Chapters 3 and 5, which required complicated modifications to accessmodulation lines, this system has connections for external modulation signals. TinTin has a built-in laser light engine. At 850 nm, the diffused laser diodes offer afrequency of 12-80 MHz at 50% duty cycle. The on-board light engine can also bedisabled and replaced by a custom light source, such as a projector which couldprovided spatio-temporally illumination as in Section 3.6.The waveform-generation circuitry in our system is built around Direct DigitalSynthesis (DDS), which allows for precise waveform control in contrast to FPGA-based signal synthesis. We use an Analog Devices AD9959 four-channel DDSto generate sinusoidal waveforms with independent control over frequency, phase,and amplitude. Each of these four channels can control either a camera or lightsource. To digitize the analog DDS waveform, a clock fan-out buffer (AnalogDevices AD9513) is used as a high-speed comparator. An additional high speedbuffer chip (TI SNB8LVCFO0A) acts as an exposure gating mechanism, ensuringthat no residual signal is received by the sensor unless exposing. We use an ARMCortex M4 STM32F407VGT6 microcontroller to synchronize the exposures of allcameras and control the waveforms on the DDS. The microcontroller acts as themaster for all connected devices. The frequency and phase settings of each frame inevery channel can be controlled individually. Finally, the camera SDK allows toread out raw frames through the USB interface.Figure 7.18 shows different configurations of our setup. For example, we canrun the cameras in light field mode, where each sensor observes the scene fromdifferent perspectives or we can optically align them using a beam splitter so thatthey share the same perspective. Figure 7.19 shows a captured bullet in flightcaptured by the beam splitter setup. This is the same experiment as in Figure 7.11.We show the raw frames here to demonstrate our hardware setup. The homodyneand heterodyne frames showing the flying bullet are captured simultaneously anddo not contain displacements, which caused the streak artifact in Figure 7.11.160Figure 7.18: Multi-camera prototype setup. The left shows an array of sensorsobserving the scene from different perspectives. On the right twocameras are optically aligned with a beam splitter. They both sharethe same perspective and single light source, but each of the sensorsis modulated at different frequencies. In this mode, we can capturehomodyne and heterodyne frames simultaneously.Figure 7.19: Example from Figure 7.11, but using the beam splitter setupshown on the right of Figure 7.18. Homodyne and heterodyne framesof a static scene are shown on the (left) and of the same scene with abullet flying at about 99 m/s (center).1617.7.1 Multi-Source Interference CancellingThe proposed hardware system also allows to eliminate multi-source interference forclassical depth imaging. When multiple cameras illuminate the same scene, such asin Figure 7.18 on the left, then the temporally-coded illumination waveforms of eachcamera interfere with one another. Note, that this case differs significantly from themulti-path interference discussed in Section 2.6.5. Due to unsynchronized exposuresand slight variation in the illumination frequencies, the multi-source interferenceresults in a temporally beating measurement error. The frequency variation betweendifferent light sources is an inherent property of using independent sources which allhave their own crystal oscillator. Using only a single light source would eliminatethis problem (assuming synchronized exposures). In that case, however, parts of theimaged scene would never be illuminated, which defies the purpose of the differentperspectives captured by the sensor array.Previously, Castaneda et al. [255] proposed to capture all possible combinationsof activated light sources, Li et al. [256] proposed to capture and average more than100 frames to statistically mitigate multi-device interference. Neither option seemsfeasible for dynamic scenes. However, since the proposed hardware system allowsto synchronize all exposures, simply changing the modulation frequencies of eachcamera/source pair by a large enough amount removes multi-source interference;see also Figure 7.20 on the right.Figure 7.20: Multi-device interference. The left shows the reference phaseand frequency sweep response of a single camera. Two CIS cameras,each running in homodyne mode, interfere at one particular frequencywith one another (center left). Choosing an closely spaced orthogonalfrequency pair with a difference of a few hundred Hz eliminates multi-device interference (center right), achieving the same effect as choosingtwo frequencies with MHz difference (right).162This is easy to see recalling Eq. (2.25) from Section 2.6: If we consider acamera/source pair with large enough difference in the modulation frequency, thecontinuous signal p(t) at the sensor only becomes high frequency and averagesout in the integration. The smallest frequency component is exactly the differencebetween the light source and camera pair. Hence, selecting a large frequencydifference between all camera/source pairs ensures low interference, but uses theavailable modulation spectrum very inefficiently, thus limiting the number of devices.The orthogonal frequency selection that we have used previously to amplify thetiny Doppler shift, allows for an efficient usage of the modulation spectrum whilecompletely eliminating interference. In particular, we pick a frequency ωf =p2piT for sensor demodulation, as well as mutually orthogonal frequencies l =1 . . .M as ωl = (p+ l − 1) 2piT for each light source. For this particular choice ofillumination frequencies, the camera operates in homodyne mode with respect tothe first light source, but in (orthogonal) heterodyne mode with respect to all theother lights. The orthogonality eliminates interference for a given integration timeT , while allowing for closely spaced modulation frequencies. Figure 7.20 shows acalibration result for a modulation frequency of 22 MHz and 2 ms exposure time.The orthogonal frequencies are only separated by 508 Hz, but allow for perfectinterference canceling, as achieved by MHz shifts.We refer the reader to our publication [7] for a more detailed description of thepresented multi-camera system and further applications of it which goes beyondthe scope of this dissertation, such as high-speed range imaging, de-scattering ofdynamic scenes and by non-line-of-sight motion detection via frequency gating.7.8 DiscussionIn this chapter, we propose a new computational imaging modality that directlycaptures radial object velocity via Doppler Time-of-Flight Imaging. We demonstratea variety of experimental results for different types of settings. We have validatedour model in simulation and experimentally. We have also demonstrated the optionalcombination of footage captured using an RGB camera with the depth and velocityoutput of our coded TOF camera. Together, this data represents simultaneous per-pixel RGB, depth, and velocity estimates of a scene and allows for the 3D velocity163field to be estimated. Finally, we have presented a hardware extension to our systemenabling multi-camera/source capture.Note that D-TOF is complimentary to optical flow. It allows for the depth biasof xz-flow to be removed and enables recording of the metric 3D velocity field ofthe scene. However, if only radial velocity is required, our technique can also beused stand-alone, independent of optical flow.As a fundamentally new imaging modality, D-TOF implemented with our ex-perimental hardware has several limitations. Foremost, the resolution of the PMDsensor in our prototype is limited to 160× 120 pixels and the SNR of the measured,Doppler-shifted signal is low. Together, these limitations result in low-resolutionand noisy footage. We apply state-of-the-art denoising strategies which filter outmost of the noise but do produce artifacts in some cases. Furthermore, D-TOFrequires two frames to be acquired with different modulation frequencies. Currently,we capture these frames in sequence, which results in slight misalignment betweenthe frames observed as velocity artifacts around depth discontinuities. However,as already mentioned in Section 3.7, there is a clear path to addressing all of thesechallenges: low-cost TOF sensors higher resolutions and significantly improvednoise characteristics are already on the market. The multi-camera extension fromSection 7.7 uses sensors with QVGA resolution. Some consumer cameras, suchas the Microsoft® Kinect® 2, provide almost VGA resolution and also use highermodulation frequencies to increase the achievable depth resolution for range imag-ing. Note that increased sensor and illumination frequency would directly improvethe signal-to-noise ratio in our setup, because the Doppler effect is proportional tothe source frequency. With access to signal control of illumination and on-sensormodulation, D-TOF could be readily implemented on high-quality consumer TOFcameras. Nevertheless, D-TOF shares other limitations with TOF, including the needfor active illumination, limited range, and problematic processing in the presence ofstrong ambient illumination or objects with dark albedos, and global illuminationeffects.However, TOF cameras have entered the consumer market only a few years ago,but already transformed the way machines perceive the world. Human-computerinteraction, robotics and machine vision, navigation for autonomous vehicles, andmany other fundamental computer vision tasks have seen dramatic improvements164using these devices. With Doppler Time-of-Flight, we contribute a fundamentallynew imaging modality that could impact all of these applications. By amplifyingthe tiny Doppler shift through carefully coded illumination and sensor modulation,our approach, much like Chapter 3, “makes the invisible visible”.165Chapter 8Simple Lens ImagingUp until now, the methods presented in this dissertation have exploited the tempo-rally convolutional structure of the CIS measurements. In this chapter we investigatethe spatially convolutional structure of aberrations in imaging optics. Optical sys-tems and aberration theory, describing their imperfections, have been introduced inSection 2.3. Modern imaging optics are highly complex systems consisting of upto two dozen individual optical elements. This complexity is required in order tocompensate for all geometric and chromatic aberrations of a single lens, includinggeometric distortion, field curvature, wavelength-dependent blur, and color fringing.While traditional optical design attempts to minimize all of these aberrationsequally, we demonstrate in the following that chromatic aberrations, in fact, effi-ciently encode scene information. Exploiting the specific structure of chromaticaberrations allows us to realize radically simplified optical designs, using cheap,simple, and compact optics.We propose a Bayesian approach to color image deconvolution, which at itscore relies on a novel statistical prior for the correlation between different spectralbands. For chromatic aberrations that allow to focus in at least one channel, thisapproach allows to efficiently correct images captured through uncompensated,simple optics. Finally, we present a method for the calibration of all aberrations ofan optical system as a per-channel, spatially-varying PSF.166Figure 8.1: High-quality imagery through poorly performing lenses: Relyingon the specific structure of chromatic aberrations, we present a Bayesianreconstruction method for the removal of aberrations post-capture. Thisenables imaging using simple lenses that suffer from severe aberrations.From left to right: Camera with our lens system containing only asingle glass element (the plano-convex lens lying next to the camera),unprocessed input image, result estimated using the proposed approach.8.1 IntroductionOver the past decades, camera optics have become increasingly complex. Thelenses of modern single lens reflex (SLR) cameras may contain a dozen or moreindividual lens elements, which are used to optimize light efficiency of the opticalsystem while minimizing aberrations, i.e., non-linear deviations from an idealizedthin lens model. Optical systems and their aberrations have been described in detailin the related work part of this thesis. Recalling Section 2.3, aberrations includeeffects such as geometric distortions, chromatic aberration (wavelength-dependentfocal length), spherical aberration (focal length depends on the distance from theoptical axis), and coma (angular dependence on focus); see again Figure 2.5. Allsingle lens elements with spherical surfaces suffer from these artifacts, and as aresult cannot be used in high-resolution, high-quality photography. Instead, modernoptical systems feature a combination of different lens elements with the intentof canceling out aberrations. For example, an achromatic doublet is a compoundlens made from two glass types of different dispersion, i.e., their refractive indicesdepend on the wavelength of light differently. The result is a lens that is (in thefirst order) compensated for chromatic aberration, but still suffers from the otherartifacts mentioned above.Despite their better geometric imaging properties, modern lens designs are not167Figure 8.2: Patch-wise estimated PSFs for a single plano-convex lens at f/4.5.See Figure 2.6 for a single biconvex lens. The PSF estimation and non-blind deblurring in our method is done in patches to account for the PSFs’spatial variance.without disadvantages, including a significant impact on the cost and weight ofcamera objectives, as well as increased lens flare. In this chapter, we investigatea radically different approach to high-quality photography: instead of ever morecomplex optics, we propose to revisit much simpler optics, such as plano-convexlenses, used for hundreds of years (see, e.g., [258]), while correcting for the ensuingaberrations computationally.The aberrations introduces by these simple lens designs follow a specific struc-ture. Figure 2.6 and Figure 8.2 shows the variation of the PSF over the image planefor two simple lens elements. We can make several observations: The blur is highlyspatially varying, ranging from disc-like structures (spherical aberration) with di-ameters of 50 pixels and more to elongated streaks (coma and astigmatism). Wecan address this problem by subdividing the image into tiles over which we assumea constant PSF, as discussed in Section 2.3.3. The blur is also highly wavelength-dependent (chromatic aberration). This results in objectionable color fringing inthe image. Note that these chromatic aberrations are much more severe than forpartially corrected optics. For example, Kang [259] proposes a method specifically168for removing color fringing in images. However, this method is based on edgedetection, which is feasible for images taken with partially corrected optics, but isimpossible in the presence of very large blurs that result from the use of uncorrectedoptics.Although the aberrations are severe, at the same time, the PSF of at least one ofthe color channels often contains more energy in high spatial frequencies than theothers (one channel is usually focused significantly better than the others); note herethat we do not require it to be perfectly in focus. This suggests that we can exploitcross-channel correlation and reconstruct spatial frequencies that are preserved in atleast one of the channels.In our approach we rely on a novel cross-channel prior that encodes this cor-relation between different spectral channels. Following our Bayesian approachfrom the related work Section 2.8.1, we formulate a deconvolution method whichefficiently incorporates this prior in a convex optimization problem. We derivea proximal algorithm to solve this problem which is guaranteed to converge to aglobal optimum. The proposed approach is therefore able to handle much largerand more dense PSFs than previous methods (see, e.g., [165]), such as disk-shapedkernels with diameters of 50 pixels and more, which occur frequently in uncorrectedoptics unless the apertures are stopped down to an impractical size. To calibratethe kernels for a given optical system, we present a method for robust per-channelspatially-varying PSF estimation based following the same Bayesian approach asfor the deconvolution problem, but treating the PSF as an unknown. Our novelcross-channel prior, which is at the core of our proposed method, can also be com-bined with patch-based priors, and in fact improves the robustness of patch-basedreconstruction methods, which will be described in Chapter 9.In the following, we first review the image formation model including spatiallyvarying aberrations. Subsequently our novel cross-channel prior for color images isderived. Next, the robust Bayesian approach, and solver method for the resultingoptimization problem are described. We demonstrate that our method enables high-quality photographs on modern 12 megapixel digital SLRs using only single lenselements such as plano convex or biconvex lenses, and achromatic doublets. Weshow high quality results comparable to conventional lenses, with apertures aroundf/4.5. Furthermore, we present an extension of our method that allows for color169imaging even with highly chromatic diffractive elements, such as ultra-thin Fresnelphase plates. Finally, we demonstrate that our formulation outperforms compet-ing deconvolution approaches in terms quality and runtime. Using our method,simplified optics become viable alternatives to conventional camera objectives.8.2 Image Formation ModelThe color image formation in the presence of spatially varying aberrations hasbeen described in detail in Section 2.3.3. Recalling Eq. (2.12), we can assumeaberrations to be spatially invariant in a small n × m tile on the sensor. Usingj ∈ Rnm and i ∈ Rnm as the vectorized observed blurred image and latent sharpimage, respectively, the image formation can be formulate in the matrix-vector formj = Bi+ n, (8.1)where B ∈ Rnm×nm is the matrix corresponding to the convolution with the PSFB in the tile, and n represents sensor noise. As discussed in Section 2.3.3, the fullimage will be composed of many tiles, each with a corresponding PSF.8.3 Deconvolution Using Cross-Channel Correlation8.3.1 Cross-Channel CorrelationReal optical systems suffer from dispersion in the lens elements, leading to awavelength dependency of the PSF known as chromatic aberration.While complex modern lens assemblies are designed to minimize these arti-facts through the use of multiple lens elements that compensate for each others’aberrations, it is worth pointing out that even very good lenses still have a residualamount of chromatic aberration. For the simple lenses we aim to use, the chromaticaberrations are very severe – one color channel is focused significantly better (al-though never perfectly in focus) than the other channels, which are blurred beyondrecognition (of course excluding achromatic lenses which compensate for chromaticaberrations).Given individual PSFs B{1...3} for each color channel J{1...3} of an image J170PSNR = 29.2 dBPSNR = 20.7 dBBlurry observation Cross-channel reconstructionIndependent per-channel reconstruction[Krishnan and Fergus, 2009]Figure 8.3: Effect of using the cross-channel prior for a simulation using astrongly chromatic blur (in the very top left) Left: Blurred and noisyobservation. Middle: Independently deconvolved results for each channelusing [160]. Right: Deconvolution with our cross channel prior.one might attempt to independently deconvolve each color channel. As Figure 8.3demonstrates, this approach does not in general produce acceptable results, sincefrequencies in some of the channels may be distorted beyond recovery, leading tothe severe red fringing and residual blur in this case.We propose to share information between the deconvolution processes of thedifferent channels, so that frequencies preserved in one channel can be used to helpthe reconstruction in another channel. The key observation of our cross-channelprior is that for a latent (sharp) image, changes in chroma and luma are sparselydistributed in natural images. That means, in most areas the gradients of differentchannels are close, except for few large changes in chroma and luma (e.g. commonlyat object or material boundaries); see the left of Figure 8.4. In other words, for apair of channels l, k we assume∇ik/ik ≈ ∇il/il⇔ ∇ik · il ≈ ∇il · ik,(8.2)where the errors in the approximation are sparsely distributed around edges in theimage. Note that the division and multiplication /, · are pixel-wise operators. Hence,we model the difference in the normalized gradients with a heavy-tailed prior:p(ic|γ, il) ∝ exp−γ∑l 6=c∑a‖Haic · il −Hail · ic‖1 , (8.3)171PositionIntensityCross-channel gradient magnitudeLogProbablityFigure 8.4: Blurred scanline on the top left with different PSFs in each channeland a sharp green channel. Reconstructed scanline on the bottom left.Our cross-channel prior enforces gradient consistency between the colorchannels (green shaded regions) and allows sparse hue changes (redshaded regions). The right shows the cross-channel gradient statisticsfrom 10,000 randomly selected images from the Imagenet data set [260].The empirical distribution is heavy-tailed.for a considered channel x. The matrices Ha are different gradient operators, whichwill be described in detail below.Note that the proposed prior places stronger assumptions (normalized consistentgradients) on the color-channel coupling compared to methods that assume onlyconsistent gradients, such as [261]. This stronger assumption is verified empiricallyusing of 10,000 images from the Imagenet data set [260]. The right of Figure 8.4shows the empirical distribution of the accumulated cross-channel gradients, thatis c˜ = Haic · il −Hail · ic for all images, channels and gradient filters, where icis here the red or blue channel and ir is the green channel. The plot also showsa Gaussian fit p(c˜) ∝ e−γ|c˜|2 , Laplacian fit p(c˜) ∝ e−γ|c˜| and Hyper-Laplacianp(c˜) ∝ e−γ|c˜|23 . We can see that the empirical cross-channel gradient distibutionis heavy-tailed and thus not well approximated by a normal distribution. While aHyper-Laplacian fit best models the underlying distribution, the Laplacian is thebest convex relaxation of the fitted distribution. We therefore chose in Eq. (8.3) theLaplacian distribution as the best convex fit.1728.3.2 Inverse ProblemTo invert the image formation model from Eq. (8.2) using the described prior andincluding noise, we follow again the Bayesian MAP approach from Section 2.8.3, asfor all inverse problems discussed in this dissertation. Assuming n to be Gaussiandistributed, we formulate the problem of jointly deconvolving all channels as(i1...3)opt = argmini1...33∑c=1‖Bcic − jc‖22 + λc5∑a=1‖Haic‖1+∑l 6=cβcl2∑a=1‖Haic · il −Hail · ic‖1,(8.4)with the first term being the Data Fidelity term, which results from the Gaussianlikelihood. The second term enforces a heavy-tailed distribution for both gradientsand curvatures of all channels. The gradient matrices H{1,2}, implement the firstderivatives, while H{3...5} correspond to the second derivatives. We use the samekernels as Levin et al. [161] but employ an `1 norm in our method rather than afractional norm. This ensures that our problem is convex. The last term of Eq. (8.4)implements our cross-channel prior, resulting from the prior distribution Eq. (8.3).λc, βcl ∈ R with c, l ∈ {1 . . . 3} are weights for the image prior and cross channelprior terms, respectively.8.3.3 OptimizationThe joint objective from Eq. (8.4) for all channels is a hard non-convex problem. Tomake it feasible and efficient to solve, we minimize it by coordinate descent over thechannels. That means, we alternately minimize with respect to one channel whilefixing all the other channels. Similar to the optimization for CSC from Section 6.2.3,the resulting objective in each step becomes then convex.Note that for the specific case of a given sharp channel, the objective Eq. (8.4)is separable per channel. In this case, all channels can be solved jointly leadingto a complexity of our method growing linearly with the number of channels. Allconsidered simple optics in this chapter offer an almost sharp channel for a givenchannel. The coordinate descent approach discussed above can be applied for173general reconstruction problems. Minimizing Eq. (8.4) with regard to a singlechannel c can be formulated as the problem:(ic)opt= argminx‖Bcx− jc‖22 + λc5∑a=1‖Hax‖1 +∑l 6=cβcl2∑a=1‖Hax · il −Hail · x‖1= argminx‖Bcx− jc‖22 + λc5∑a=1‖Hax‖1 +∑l 6=cβcl2∑a=1‖(DilHa −DHail)x‖1= argminx‖Bcx− jc‖22 + ‖Sx‖1 withS =λcH1...λcH5βcl (DilH1 −DH1il)...βcl (DilH2 −DH2il)...,(8.5)where D denotes the diagonal matrix with the diagonal taken from the subscript. Sis the matrix consisting of the sequence of all t = 5 + 2(3− 1) matrices comingfrom the `1 minimization terms in Eq. (8.4). In Eq. (8.5) it becomes clear that ourdeconvolution problem is of the form discussed in Section 2.8.3. In particular, weuseΦ = I, G(v) = ‖Bcv − jc‖22,Ω = S, F (v) = ‖v‖1 ,(8.6)So, since our problem is expressible as Eq. (2.40), we can solve it using a proximalalgorithm as described in Section 2.8.4. We use the Chambolle-Pock method,as in Chapter 3. This method is equivalent to a linearized version of ADMM,which has been described in Chapter 6. We discuss proximal algorithms, includingthe Chambolle-Pock method in detail in Chapter 10. To solve our problem withChambolle-Pock’s method, the proximal operators for F ∗ and G have to be given.174These operators areproxτF ∗(v) =vmax (1, |v|) (8.7)proxτG (v) = x˜ = argminx‖Bcx− jc‖22 +12τ‖x− v‖22⇔ 1τ(x˜− v) + 2 (BTc Bcx˜−BTc j) = 0⇔ x˜ = F−1(τ2F (Bc)∗ F (jc) + F (v)τ2 |F (Bc)|2 + 1), (8.8)The first proximal operator is a projection on the `1-norm ball and described indetail in [178]. Note that the division is here point-wise. The second proximaloperator is the solution to a linear system as shown in the second line. Since thesystem matrix is composed of convolution matrices with a large support we canefficiently solve this linear system in the Fourier domain (last line).Parameters Adopting the notation from [178], we use θ = 1, σ = 10 and τ =0.9/(σL2) for our specific Chambolle-Pock method with Eq. (8.7),(8.8). That onlyleaves the operator norm L = ‖Ω‖ which is required for the method. We find thevalue L by using the power iteration where all matrix-vector-multiplications withSTS can be decomposed into filtering operations (other eigenvalue methods likethe Arnoldi iteration are consequently also suited for computing L).8.3.4 Low-Light ImagingFor image regions with very low intensity values, the method described above fails.This is since the cross-channel prior from Eq. (8.3) is not effective due to the huenormalization, which reduces the term near to zero. As a result, significant colorartifacts (such as color ringing) can remain in very dark regions, see Figure 8.5.Note that by allowing luma gradients in the prior from Eq. (8.3) this is an inherentdesign problem of this prior and not an optimization issue.In this scenario, we therefore propose to match absolute (rather than relative)gradient strengths between color channels using a Gaussian distribution. Theadditional prior can be expressed as a modification of operator G from (8.6), which175Figure 8.5: Chromatic artifacts in low-light areas. Left: Original blurredpatch captured with a plano-convex lens. Center: Reconstruction usingEq. (8.4). Notice the chromatic artifacts in all low-intensity areas andthe correct reconstruction in the other areas. Right: Regularization forlow-intensity areas added as discussed below.becomes nowG(v) = ‖Bcv − jc‖22 + λb∑l 6=c∑2a=1 ‖Dw (Hav −Hail)‖22 (8.9)where diag (w) is a spatial mask that selects dark regions below a threshold . Themask is blurred slightly with a Gaussian kernel Bσ with standard deviation σ = 3to avoid discontinuities at border regions affected by the additional prior:w =(1−∑l βclT (il)∑l βcl)⊗ Bσ with T (i)k ={1 ; ik ≤ 0 ; else, (8.10)where T is here a simple thresholding operator with threshold  = 0.05 and in ourimplementation. The proximal operator from Eq. (8.8) becomes thenproxτG (v) = x˜ = argminx‖Bcx− jc‖22 +12τ‖x− v‖22+λb∑l 6=c2∑a=1‖Dw (Hav −Hail)‖22⇔2τBTc Bc + I+ 2τλb∑l 6=c2∑a=1HTaD2wHa x˜= 2τBTc jc + v + 2τλb∑l 6=c2∑a=1HTaD2wHail.(8.11)176The linear system from the last row cannot be solved efficiently in the Frequencydomain anymore, since the spatially diagonal operator D2w does not diagonalizein the Frequency domain. Since blur kernel sizes of the order of magnitude of102 × 102 can be expected for practical applications, the linear system is not sparse,nor separable and therefore impractical to invert explicitly. We solve the systemusing the Conjugate Gradient (CG) algorithm. This allows us to express the matrix-vector multiplication in CG-algorithm as a sequence of filtering operations as before.Residual and Scale-Space Deconvolution The deconvolution approach describedabove is used in a residual framework which we adopt from [262]. By adaptingthe prior weights in sequential passes of our deconvolution method, residual de-convolution makes our basic deconvolution approach more robust to the parameterselection. This reduces artifacts in regions with strong gradients. Furthermore,we accelerate our method using a scale-space approach as in [263]. The residualand scale-space deconvolution technique is described in detail in the AppendixSection A.9, including a discussion on how we handle saturated regions in theresidual deconvolution.8.4 ResultsIn this section we present a number of results captured with different optical systems.Specifically, we show captures using simple lens elements as well as standardconsumer lens systems, and finally present an extension of our method to ultra-thin diffractive optical elements. We demonstrate full-spectrum color results andinvestigate multi-spectral imaging with our approach. The spatially varying PSFsof the simple lens elements are estimated using two captures of a planar target atdifferent aperture settings. Note that any PSF estimation method can be used hereand the whole estimation process is a calibration procedure, which only needs to beperformed once for each lens. Our PSF estimation technique, which only requirestargets printed on consumer laser-printers, is described in Section A.10. Note alsothat PSFs from Figure 2.6 and Figure 8.2 have been calibrated using our method.177Figure 8.6: Images captured with a simple plano-convex lens (left) and re-stored with our method (right). Note recovered spatial detail and theabsence of color fringing. Also note the bokeh and graceful handling ofout-of-focus blur in the top three images.Simple Lens Elements Figure 8.1, Figure 8.6 and Figure 8.7 show several resultsfrom the most basic refractive optical system, a single plano-convex lens (focallength 130 mm, f/4.5) shown lying on the side in Figure 8.1. These resulst arecaptured with a Canon EOS D40 camera, using standard ISO100 and autoexposuresettings. The corresponding PSFs are the ones from Figure 8.2. We note that allresults show good detail down to the native sensor resolution, demonstrating that our178Figure 8.7: Additional scenes captured using the simple plano-convex lens asin Figure 8.6.method is indeed capable of producing high quality digital photographs with verysimple lens designs. The parking meter image and the twigs show that deconvolvingwith the PSF for one particular scene depth preserves the deliberate defocus blur(bokeh) for objects at different depths without introducing artifacts. The manholecover image at the top shows focus change over a tilted plane. Saturated regions asshown in the second inset of this image are handled properly by our method. It alsoincludes a result under low-light conditions.Figure 8.8 test the limits of our method using lenses with a larger aperture off/2. The top of Figure 8.8 shows the results for a 100 mm biconvex lens, with PSFsfrom Figure 2.6. The larger aperture increases the depth dependency of the PSF.Therefore our deconvolution method produces very sharp reconstructions in-plane179Figure 8.8: Limitations of our approach. The left shows captured image, andthe right shows the deblurred result. Top: Results from biconvex lensat f/2. A very low cross-prior weight has been used here to illustratethe depth-dependency of the PSF. Bottom: Cropped central region of theimage from achromatic lens with f/2.0.(see color checker), but suffers from some ringing out-of-plane (see text on theforeground object). The bottom result of Figure 8.8 shows results for a 100 mmachromatic doublet, which does not contain spectrally varying blur. Consideringthat the PSFs exhibit disk-like structures in all channels, our deconvolution methodstill achieves reasonable quality, however cannot rely on the cross-channel prior.Commercial Lens Systems Results for a commercial camera lens, a Canon 28–105mm zoom lens at 105 mm and f/4.5, are shown in Figure 8.9 and Figure A.18 fromthe Appendix Section A.11. We use the Canon EOS D40 with the same settings asbefore. While this commercial lens shows much reduced aberrations compared to180Figure 8.9: Test scene captured with a commercial Canon lens (105 mm, f/4.5).Left: Captured input image. Right: Deblurred result.the uncorrected optics used above, there still are some residual blurs that can beremoved with our method. In particular, the vertical edges of the calibration patternin Figure A.18 reveal a small amount of chromatic aberration that is removed byour method. The PSFs for this lens are around 11× 11 pixels in diameter.Multi-Spectrum Imaging Our aberration correction method cannot only be usedfor RGB imaging but also for multi-spectral imagers. In this case, the cross-channel prior is applied between all frequency bands pairs. Results are shown inSection A.11.Considering the wavelength-dependent PSFs, it is important to note that theassumption of fixed PSFs for each color-channel of an RGB-sensor is often violated.This assumption is made for all the RGB sensor results in this chapter and is aclassical assumption in deconvolution literature. However, one cannot tell from atri-chromatic sensor the exact wavelength distribution. Metamers (different spectrathat produce the same tri-stimulus response) will have different blur kernels, sothere can always be situations where the assumption of fixed per-channel PSFs willfail, such as for example light sources with multiple narrow spectra or albedos with181very narrow color tuning. This will introduce errors in the data-fitting term of ourobjective function, which to a certain extent our cross-channel prior can account for.In addition to the simulated result from Figure 8.3, we have compared our methodto state-of-the-art deconvolution methods for measured data in Section A.11. Theadditional results demonstrate that even for incorrectly estimated PSFs our methodcan still provide plausible estimates.Diffractive Optical Elements However, for even further increased chromatic aber-rations this approach fails. Diffractive optical elements (DOEs) introduce highlywavelength-dependent aberrations due to the inherent dispersion. Only relying ondiffraction, such elements can be fabricated on ultra-thin planes, and hence representa promising avenue for reducing the size and complexity of conventional lenseseven further (compared to the simple refractive lenses from above). Furthermore,diffractive optics allow for a very flexible optical design since complex phase platescan be fabricated using photo lithograph, while, in contrast, high-quality refrac-tive optics rely on spherical optics (recall Section 2.3). Recent examples includecubic phase plates for extended depth of field imaging [264], DOEs producingdouble-helix PSFs [20] for single-molecule microscopy beyond the diffraction limit,and anti-symmetric gratings integrated with a CMOS sensor to produce an ultra-miniature lens-less imager PicoCam [265]. Figure 8.10 shows captures acquiredusing a diffractive optical element which implements a Fresnel lens that is designedfor the wavelength of 555 nm (center of green band). No other optical elements areused, and the results are captured with a 4 megapixel Canon 70D DSLR sensor withthe same settings as before. For this optical system any incident wavelength exceptthe single design wavelength will not be focused, depending on the spectral distribu-tions of objects and illumination. Hence, the captures in Figure 8.10 exhibit extremechromatic aberrations with blur kernels of diameters well over 100 pixels. Notethat the PSFs are also highly spatially varying due to varying material/illuminationproperties.In order to handle such aberrations, we cannot rely on pre-calibration, butestimate the spatially varying PSFs in a blind deconvolution approach only fromthe captured image. Using cross-channel information in the blind estimation turnsout to be key for achieving good PSF estimation results, as classical methods such182Figure 8.10: Results capture with an ultra-thin diffractive optical element (topleft inset), implementing a Fresnel phase plate optimized for the singlewavelength 555 nm (green). A set of representable scenes under indoorlighting conditions are shown here. For each individual result, we com-pare the aberrated observation (top) and a cross-channel reconstruction(bottom). Small insets below each image highlight details and illustratehow our method removes the severe chromatic aberrations.as [266–268] fail for the large kernels. Reconstruction results using the blindestimation are shown in Figure 8.10. To increase efficiency, we use here a slightreformulation of the prior from Eq.(8.3) without the luma normalization. Whilereaching almost similar reconstruction quality, the reformulation leads to 5 ordersof magnitude reduced runtime, which is a critical aspect to make blind estimationmethods practical. We refer the reader to our publication [9] for further details onfull-spectrum imaging with diffractive optics, including an in-depth discussion ofthe blind extension to our cross-channel reconstruction from this chapter, whichgoes beyond the scope of this dissertation.1838.5 DiscussionIn this chapter, we have exploited the structure of aberrations in imaging optics. Inparticular, we have presented a Bayesian method for color image deconvolutionrelying on the specific structure of chromatic aberrations that focus well in at leastone channel. We exploit statistical correlation between the focused and the otherchannels to recover severely degraded color channels, even with disk-like blurkernels of diameters of 50 pixels and more. This enables high-quality imaging usingvery simple, uncompensated lens elements instead of complex lens systems. Goinga step further, we have demonstrate that our approach can even be extended forbroadband imaging with ultra-thin diffractive optics.Overall our method produces image quality comparable to that of commercialpoint-and-shoot cameras even with the most basic refractive and diffractive elements:a single plano-convex lens and a Fresnel phase plate. However, for large aperturesof f/2 and more, the quality degrades due to the depth dependency of the PSF andsignificantly larger blur kernels, which destroy much of the frequency content.Furthermore, the scene and illumination determine the incident spectrum and hencemake the PSF estimation challenging. While a blind extension of our approach (asdiscussed for the diffractive experiments) mitigates these issues, our method doesstill achieve the quality level of a high-end lens in combination with a good SLRcamera. We conclude that in order to achieve that level of image quality it may stillbe necessary to optimize the lens design to partially compensate for aberrations. Inparticular the combination of thin refractive and diffractive elements is a promisingdirection for very flexible, new hybrid optical designs. Our work demonstrates thatfuture lenses can be simpler than they are today, thus paving the road for lighter,cheaper, and more compact camera lenses.Our method removes aberrations post-capture assuming that the optics aredesigned in a separate process to the best possible quality. I believe that jointoptimization of the optics together with the image reconstruction can lead to excitingnew areas in the optical design space, going far beyond first attempts such as [269],since the the freedom in the optical design is inherently limited by the achievablequality of the reconstruction method. Note also that this is in strong contrast toclassical optics design which solely aims at minimizing the spot-size.184Chapter 9Flexible Camera ImageProcessingWhile the previous chapter has demonstrated that computation enables new opticaldesigns by removing chromatic aberrations, we generalize this idea in the followingand show that it enables a variety of new computational camera designs whichoutperform classical color imaging systems.Color image processing is traditionally organized as a series of cascaded mod-ules, each responsible for addressing a particular problem. Demosaicking, denoising,deconvolution and tone-mapping represent a subset of these problems, which areusually treated by independent pipeline stages; see Section 2.4.4. While this divide-and-conquer approach offers many benefits, it also introduces a cumulative error,as each step in the pipeline only considers the output of the previous step, not theoriginal sensor data.We propose to completely replace such pipelines by Bayesian inference. Inparticular, building on Section 2.8.1, we propose an end-to-end system that isaware of the camera and image model, enforces natural-image priors, while jointlyaccounting for common image processing steps like demosaicking, denoising,deconvolution, and so forth, all directly in a given output representation (e.g., YUV,DCT). Our system is flexible and we demonstrate it on regular Bayer images as wellas emerging array sensors, such as interleaved HDR, and low-light burst imaging. Inall cases, we achieve large improvements in image quality and signal reconstruction185tGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGB B B BB B B BB B B BB B B BB B B BBBBBBR R R RR R R RR R R RR R R RR R R RRRRRRGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGB B B BB B B BB B B BB B B BB B B BBBBBBR R R RR R R RR R R RR R R RR R R RRRRRRGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG G G GGGGGGGGGGB B B BB B B BB B B BB B B BBBBBR R R RR R R RR R R RR R R RR R R RRRRRRGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG G G GGGGGGGGGGB B B BB B B BB B B BB B B BBBBBR R R RR R R RR R R RR R R RR R R RRRRRRGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGB B B BB B B BB B B BB B B BB B B BBBBBBR R R RR R R RR R R RR R R RR R R RRRRRRG IRBR(a)( ) (b)( ) (c)( ) (d)( )Figure 9.1: Our end-to-end method jointly accounts for demosaicking, de-noising, deconvolution, and missing data reconstruction. Following aBayesian approach, image formation and priors are separated, which al-lows us to support both conventional and unconventional sensor designs.Examples (our results at lower right): (a) Demosaicking+denoisingof Bayer data (top: regular pipeline). (b) Demosaicking+denoisingof a burst stack (top: first frame). (c) Demosaicking+denoising+HDRfrom interlaced exposure (top: normalized exposure). (d) Denois-ing+reconstruction of a color camera array (top: naı¨ve reconstruction).compared to state-of-the-art techniques, even on very classical problems such asdemosaicking. This demonstrates that treating color image processing as an inverseproblem in a Bayesian framework, is, in fact, a change of paradigm, enablingadvances for well understood problems as well as completely new camera designs.9.1 IntroductionModern camera systems rely heavily on computation to produce high-quality digitalimages. See Section 2.4.4 for a review of color image processing, including non-traditional sensor designs. Even relatively simple camera designs reconstruct aphotograph using a complicated process consisting of tasks such as dead-pixelelimination, noise removal, spatial upsampling of subsampled color information(e.g., demosaicking of Bayer color filter arrays), sharpening, and image compression.More specialized camera architectures may require additional processing, such asmulti-exposure fusion for high-dynamic-range imaging or parallax compensation incamera arrays.The complexity of this process is traditionally tackled by splitting the imageprocessing into several independent pipeline stages [63]. Splitting image reconstruc-tion into smaller, seemingly independent tasks has the potential benefit of making186... ... ......Standard pipelineNatural-image priorsOur approachOutput imageRAW Input Output imageGamutmappingDemosaicking JPEGcompressionDenoisingImageFormationBayesianReconstructionFigure 9.2: The traditional camera processing pipeline consists of many cas-cad