Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Exploiting temporal structures in computational photography Su, Shuochen 2018

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2018_may_su_shuochen.pdf [ 68.23MB ]
JSON: 24-1.0365768.json
JSON-LD: 24-1.0365768-ld.json
RDF/XML (Pretty): 24-1.0365768-rdf.xml
RDF/JSON: 24-1.0365768-rdf.json
Turtle: 24-1.0365768-turtle.txt
N-Triples: 24-1.0365768-rdf-ntriples.txt
Original Record: 24-1.0365768-source.json
Full Text

Full Text

Exploiting Temporal Structures in ComputationalPhotographybyShuochen SuB.Eng., Tsinghua University, 2013A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University of British Columbia(Vancouver)April 2018© Shuochen Su, 2018AbstractDespite the tremendous progress in computer vision over the last decade, imagescaptured by a digital camera are often regarded as the instantaneous 2D projectionof the scene for simplification. This assumption poses a significant challenge tomachine perception algorithms due to the existence of many imaging- and scene-induced artifacts in the camera’s measurements such as the hand-shake blur andtime-of-flight multi-path interference. In this thesis we introduce time-resolved im-age formation models for color and depth cameras by exploiting the temporal struc-ture within their raw sensor measurements of the scene. Specifically, we presentour efforts on leveraging the inter-scanline and cross-frame content correlations forimage and video deblurring, as well as utilizing the complementary components oftime-of-flight frequency measurements in collaboration for improved depth acqui-sition. By tackling these limitations we also enable novel imaging applicationssuch as direct material classification from raw time-of-flight data. In addition, wedevise post-processing algorithms with temporal structure awareness so that thehidden information can be decoded efficiently with existing off-the-shelf hardwaredevices. We believe that the proposed time-resolved modeling of the encoding-decoding process of a digital camera opens the door to many exciting directions incomputational photography research.iiLay SummaryPhotos are the most common way of capturing an instance of our life, and the worldsurrounding us. While most people think of the images as snapshots that freeze in-stantaneous moments, they are in fact the integral over small time segments. Suchmechanism leads to challenges in color and depth imaging unless the temporalstructures of the measurement are fully exploited, as we will present in this thesis.We demonstrate with four examples in computational photography: image deblur-ring, video deblurring, end-to-end depth imaging, and material classification withraw depth data.iiiPrefaceThis dissertation is based on the following publications.S. Su and W. Heidrich. Rolling shutter motion deblurring. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition, pages 1529-1537,2015 [128]. This work is presented in Chapter 3. W. Heidrich introduced theresearch direction of rolling shutter rectification to the author. The author proposed,experimentally validated, and documented the idea of solving the joint problemof rolling shutter removal and image deblurring. Both authors contributed to thediscussion and paper writing.S. Su, M. Delbracio, J. Wang, G. Sapiro, W. Heidrich, and O. Wang. Deep VideoDeblurring for Hand-held Cameras. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, pages 1279-1288, 2017 [130]. Thiswork is presented in Chapter 4. O. Wang had the initial idea of leveraging deeplearning for video deblurring, and supervised the project. The author designedand optimized the deep convolutional neural network architecture, proposed thedataset generation framework, conducted all the experiments, and led the paperwriting. All the authors contributed to the discussion and paper revisions.S. Su, F. Heide, G. Wetzstein, and W. Heidrich. Deep end-to-end time-of-flightimaging. Accepted to the IEEE Conference on Computer Vision and Pattern Recog-nition, 2018 [131]. This work is presented in Chapter 5. The author proposed,validated, and documented the idea of exploiting time-of-flight raw measurementsfor multi-path compensation. F. Heide provided the time-of-flight cameras andGPU resources for the experiment, and contributed to the discussion. All the au-thors took part in writing the manuscript.ivS. Su, F. Heide, R. Swanson, J. Klein, C. Callenberg, M. Hullin, and W. Hei-drich. Material classification using raw time-of-flight measurements. In Proceed-ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages3503-3511, 2016 [129]. This work is presented in Chapter 6. W. Heidrichhad the idea of exploiting raw measurements of time-of-flight cameras for objectrecognition. The author initiated the discussion with F. Heide and W. Heidrich onnarrowing down the scope to material classification, conducted simulations and ex-periments to verify the idea, and proposed the depth normalization algorithm. M.Hullin helped derive Algorithm 3 in the frequency space. J. Klein and C. Callen-berg helped with data collection. All authors took part in writing the paper.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xivAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 31.1.1 Rolling Shutter Image Deblurring . . . . . . . . . . . . . 31.1.2 Video Deblurring for Hand-held Cameras . . . . . . . . . 41.1.3 End-to-End Time-of-Flight Imaging . . . . . . . . . . . . 41.1.4 Material Classification Using Time-of-Flight Measurements 51.2 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Background and Related Work . . . . . . . . . . . . . . . . . . . . . 72.1 Overview of Image Sensing Pipeline . . . . . . . . . . . . . . . . 7vi2.2 Light Transport . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Sensor Integration . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3.1 CMOS Sensors . . . . . . . . . . . . . . . . . . . . . . . 102.3.2 Time-of-Flight Sensors . . . . . . . . . . . . . . . . . . . 122.4 Post-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.4.1 Motion Blur Removal . . . . . . . . . . . . . . . . . . . 132.4.2 Depth Ambiguity Compensation . . . . . . . . . . . . . . 152.5 Computational Approaches . . . . . . . . . . . . . . . . . . . . . 182.5.1 Numerical Optimization . . . . . . . . . . . . . . . . . . 182.5.2 Learning-Based Methods . . . . . . . . . . . . . . . . . . 203 Rolling Shutter Image Deblurring . . . . . . . . . . . . . . . . . . . 243.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.2.1 Motion Blur in Rolling Shutter Cameras . . . . . . . . . . 273.2.2 Camera Motion Modeling . . . . . . . . . . . . . . . . . 293.2.3 Deblurring Rolling Shutter Motion Blurred (RSMB) Image 313.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . 373.3.1 Synthetic Images . . . . . . . . . . . . . . . . . . . . . . 383.3.2 Real Images . . . . . . . . . . . . . . . . . . . . . . . . . 413.3.3 Parameters and Computational Performance . . . . . . . . 413.4 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . 454 Multi-Frame Video Deblurring . . . . . . . . . . . . . . . . . . . . . 464.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.2.1 Network Architecture . . . . . . . . . . . . . . . . . . . . 504.2.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . 554.4 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . 665 Time-of-Flight Imaging . . . . . . . . . . . . . . . . . . . . . . . . . 675.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70vii5.2.1 Depth Estimation Network . . . . . . . . . . . . . . . . . 715.2.2 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . 735.2.3 Training and Implementation . . . . . . . . . . . . . . . . 745.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745.3.1 Synthetic Dataset . . . . . . . . . . . . . . . . . . . . . . 745.3.2 Real Dataset . . . . . . . . . . . . . . . . . . . . . . . . 765.4 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . 765.4.1 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . 775.4.2 Comparison to Sequential Approaches . . . . . . . . . . . 795.5 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . 826 Material Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 836.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 836.2 Traditional Methods . . . . . . . . . . . . . . . . . . . . . . . . . 856.3 Decoding Material Signatures from Time-of-Flight (TOF) Measure-ments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 876.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 906.4.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . 906.4.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 926.4.3 Learning Models . . . . . . . . . . . . . . . . . . . . . . 936.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . 946.5.1 Dataset Acquisition . . . . . . . . . . . . . . . . . . . . . 946.5.2 Classification Results . . . . . . . . . . . . . . . . . . . . 956.5.3 Comparison with RGB Based Approaches . . . . . . . . . 986.5.4 Scene Labeling . . . . . . . . . . . . . . . . . . . . . . . 1006.6 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . 1017 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104A Supporting Materials . . . . . . . . . . . . . . . . . . . . . . . . . . 121A.1 Supporting Materials for Chapter 5 . . . . . . . . . . . . . . . . . 121A.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 121viiiA.1.2 Additional Results . . . . . . . . . . . . . . . . . . . . . 121A.1.3 Additional Experiments . . . . . . . . . . . . . . . . . . 140A.1.4 Implementation Details . . . . . . . . . . . . . . . . . . . 145A.1.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . 148A.2 Supporting Materials for Chapter 6 . . . . . . . . . . . . . . . . . 149A.2.1 Raw Correlation Frames . . . . . . . . . . . . . . . . . . 149A.2.2 Fixed Pattern Noise Removal . . . . . . . . . . . . . . . . 149A.2.3 Depth Normalization . . . . . . . . . . . . . . . . . . . . 149A.2.4 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 150ixList of TablesTable 4.1 Specifications of the DBN model. . . . . . . . . . . . . . . . . 51Table 4.2 PSNR/MSSIM measurements for each approach. . . . . . . . 53Table 5.1 Quantitative ablation studies on the proposed network and itsperformance against traditional sequential approaches. . . . . 78Table 6.1 Validation accuracies from different learning models. . . . . . 95Table 6.2 Confusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . 97Table 6.3 Testing accuracies with and without considering spatial coher-ence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97Table A.1 Specifications of the generator and discriminator networks ofthe proposed TOFNET model. . . . . . . . . . . . . . . . . . 147xList of FiguresFigure 2.1 A camera sensor integrates the mixture of direct and indirectreflectance from the scene. . . . . . . . . . . . . . . . . . . . 9Figure 2.2 Illustration of the global and rolling shutter sensor mechanisms. 11Figure 2.3 TOF sensors measure depth by estimating the time delay fromlight emission to light detection. . . . . . . . . . . . . . . . . 12Figure 2.4 An example Convolutional Neural Networks (CNN) architec-ture [140]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21Figure 3.1 Simulated blur kernels for global and rolling shutter cameras. 25Figure 3.2 The global and rolling shutter sensor mechanisms. . . . . . . 26Figure 3.3 Effect of camera motion on pixel shift in the image space. . . 30Figure 3.4 Pipeline of θ 0 initialization. . . . . . . . . . . . . . . . . . . 33Figure 3.5 Intermediate results when processing a RSMB image. . . . . . 36Figure 3.6 Synthetic results (kernel 1). . . . . . . . . . . . . . . . . . . 38Figure 3.7 Synthetic results (kernel 2). . . . . . . . . . . . . . . . . . . 39Figure 3.8 PSNR of the synthetic results. . . . . . . . . . . . . . . . . . 40Figure 3.9 Comparison with state-of-art uniform and non-uniform blinddeblurring algorithms (1/2). . . . . . . . . . . . . . . . . . . 42Figure 3.10 Comparison with state-of-art uniform and non-uniform blinddeblurring algorithms (2/2). . . . . . . . . . . . . . . . . . . 43Figure 3.11 Results from alternative approaches. . . . . . . . . . . . . . 44Figure 4.1 Blur in videos can be significantly attenuated by learning howto aggregate information from nearby frames. . . . . . . . . . 47xiFigure 4.2 Architecture of the proposed DeBlurNet model. . . . . . . . 49Figure 4.3 A selection of blurry/sharp pairs from our ground truth dataset. 50Figure 4.4 Quantitative results from our test set, with PSNRs relative tothe ground truth (1/2). . . . . . . . . . . . . . . . . . . . . . 56Figure 4.5 Quantitative results from our test set, with PSNRs relative tothe ground truth (2/2). . . . . . . . . . . . . . . . . . . . . . 57Figure 4.6 Quantitative comparison of different approaches. . . . . . . . 58Figure 4.7 Qualitative comparisons to existing approaches (1/4). . . . . 60Figure 4.8 Qualitative comparisons to existing approaches (2/4). . . . . 61Figure 4.9 Qualitative comparisons to existing approaches (3/4). . . . . 62Figure 4.10 Qualitative comparisons to existing approaches (4/4). . . . . 63Figure 4.11 Generalization to unseen type of data. . . . . . . . . . . . . . 64Figure 4.12 Visualization of learned filters. . . . . . . . . . . . . . . . . 65Figure 5.1 Traditional and proposed time-of-flight imaging frameworks. 68Figure 5.2 Illustration of dual-frequency correlation images of a cornerscene synthesized with and without multipath interference. . 70Figure 5.3 The proposed TOFNET architectures. . . . . . . . . . . . . . 71Figure 5.4 Synthetic dataset generation. . . . . . . . . . . . . . . . . . 75Figure 5.5 Comparions and ablation study on a corner scene. . . . . . . 79Figure 5.6 Results on synthetic dataset. . . . . . . . . . . . . . . . . . . 80Figure 5.7 Results on real indoor scenes. . . . . . . . . . . . . . . . . . 81Figure 6.1 Visually similar but structurally distinct material samples inRGB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84Figure 6.2 Simulation of Temporal Point Spread Function (TPSF)s. . . . 88Figure 6.3 Illustrations of experimental setup. . . . . . . . . . . . . . . . 90Figure 6.4 Our modulation signal φω(t) at 20MHz and 80MHz. . . . . . 93Figure 6.5 Effectiveness of depth normalization, visualized in the dimen-sional reduced [141] feature space. . . . . . . . . . . . . . . . 96Figure 6.6 Our classifier successfully recognizes the actual material ofeach paper printed replica. . . . . . . . . . . . . . . . . . . . 99Figure 6.7 Our classifier successfully labeled the segmented scene. . . . 100xiiFigure A.1 Synthetic results. . . . . . . . . . . . . . . . . . . . . . . . . 124Figure A.2 Synthetic results. . . . . . . . . . . . . . . . . . . . . . . . . 125Figure A.3 Synthetic results. . . . . . . . . . . . . . . . . . . . . . . . . 126Figure A.4 Synthetic results. . . . . . . . . . . . . . . . . . . . . . . . . 127Figure A.5 Synthetic results. . . . . . . . . . . . . . . . . . . . . . . . . 128Figure A.6 Synthetic results. . . . . . . . . . . . . . . . . . . . . . . . . 129Figure A.7 Real results. . . . . . . . . . . . . . . . . . . . . . . . . . . 131Figure A.8 Real results. . . . . . . . . . . . . . . . . . . . . . . . . . . 132Figure A.9 Real results. . . . . . . . . . . . . . . . . . . . . . . . . . . 133Figure A.10 Real results. . . . . . . . . . . . . . . . . . . . . . . . . . . 134Figure A.11 Real results. . . . . . . . . . . . . . . . . . . . . . . . . . . 135Figure A.12 Real results. . . . . . . . . . . . . . . . . . . . . . . . . . . 136Figure A.13 Real results. . . . . . . . . . . . . . . . . . . . . . . . . . . 137Figure A.14 Real results. . . . . . . . . . . . . . . . . . . . . . . . . . . 138Figure A.15 Real results. . . . . . . . . . . . . . . . . . . . . . . . . . . 139Figure A.16 Evaluation of the temporal consistency with experimentallycaptured raw ToF measurements. . . . . . . . . . . . . . . . 140Figure A.17 Robustness to albedos on two pairs of synthetic scenes. . . . 141Figure A.18 Robustness to albedos on two pairs of synthetic scenes. . . . 142Figure A.19 Effect of λa and λs variations. . . . . . . . . . . . . . . . . . 144Figure A.20 Synthetic data generation steps. . . . . . . . . . . . . . . . . 146Figure A.21 The raw correlation measurements from our ToF camera with-out any preprocessing. . . . . . . . . . . . . . . . . . . . . . 151Figure A.22 The effectiveness of fixed pattern noise removal. . . . . . . . 152Figure A.23 The effectiveness of depth normalization. . . . . . . . . . . . 153xiiiGlossaryADMM Alternating Direction Method of MultipliersAMCW Amplitude-Modulated Continuous WaveBRDF The Bidirectional Reflectance Distribution FunctionCCD Charge-Coupled DeviceCGAN Conditional Generative Adversarial NetworksCMOS Complementary Metal-Oxide-SemiconductorCNN Convolutional Neural NetworksGAN Generative Adversarial NetworksGSMB Global Shutter Motion BlurredGSMD Global Shutter Motion DeblurringMAP Maximum A PosterioriMPI Multi-Path InterferencePSF Point Spread FunctionRSMB Rolling Shutter Motion BlurredRSMD Rolling Shutter Motion DeblurringSPAD Single-Photon Avalanche DiodexivTOF Time-of-FlightTPSF The Temporal Point Spread FunctionxvAcknowledgmentsFirst and foremost, I would like to express my utmost gratitude to my researchsupervisor, Professor Wolfgang Heidrich, for his guidance, support, and encour-agement during my wonderful four and half years here at the University of BritishColumbia. Without him overseeing my academic progress and introducing all pos-sible resources both here at UBC and KAUST, this dissertation would not havebeen made possible.Furthermore, I am grateful to my doctoral and RPE supervisory committeemembers, Professor James Little, Professor Michiel van de Panne, and ProfessorRobert J. Woodham for the fruitful discussions and timely feedback over the years.I feel lucky to collaborate with many outstanding researchers in the field. Thanksto Dr. Felix Heide, Professor Matthias Hullin and Professor Gordon Wetzstein fortheir expertise and fantastic support over time-of-flight related projects and beyond.Thanks to Dr. Oliver Wang, Dr. Jue Wang, Professor Mauricio Delbracio and Pro-fessor Guillermo Sapiro for all the inspirational exchange on the video deblurringproject.Finally, I would like to thank my wonderful colleagues at Professor Heidrich’slabs: Yifan Peng, Robin Swanson, Jonathan Klein, Clara Callenberg, Dr. QiangFu, Dr. Hadi Amata, Dr. Lei Xiao, Dr. James Gregson, Dr. Gerwin Damberg, andmany friends back home and here in North America.xviTo my parents and Rute for their love and support.xviiChapter 1IntroductionColor and depth are two dominant sources of information that are fundamental toour perception and interaction with the world. From early pinhole film cameras tothe modern RGB and depth digital cameras, humans have been trying to revolution-ize the image sensing system in order to replicate and enrich our visual experience.Thanks to their drastically reduced cost and smaller footprint over the last decade,RGB and depth cameras have become increasingly popular in mobile and wearabledevices, such as cellphones, tablets, virtual reality helmets and self-driving cars,just to name a few. This not only enables better photography and depth percep-tion, but also opens the door to exciting new ways for human-computer interaction,telecommunication, and transportation. In the meantime, the pursuit for smaller,lighter and more energy efficient devices poses significant challenges to the com-putational side of imaging, where a stabilized, noise-free, and high dynamic rangeimage only meets some of the many desirable features of modern photography.Traditional film-like digital photography captures an image as a 2D projectionof the scene. Due to the limited capabilities of the camera, the recorded imageis often a partial, compressed, and degraded representation of the world. Recentadvances in computational photography aims to tackle these limitations, either viathe “Epsilon Photography” based approach, where the scene is reconstructed frommultiple observations captured over variations of the camera settings, or via the“Coded Photography” methods, in which case the camera system is deliberatelymodified to encode a machine-readable representation of the world which can then1be decoded to synthesize the essence of our visual experience [111]. Both ap-proaches, however, require specific adaptations to the capturing process with po-tential hardware modifications.Our goal in this thesis is to escape the limitations of traditional digital camerasto enable novel imaging applications, while maintaining the acquisition experiencewith existing, off-the-shelf hardware. To this end, we reformulate the conventionalcamera’s optics and imaging sensor as a spatiotemporal encoder of the light trans-port, and devise post-processing algorithms with temporal structure awareness toeffectively decode the hidden information within the scene measurements.Temporal Structure in PhotographyWhile most people think of photos as snapshots that capture instantaneous mo-ments, they are in fact the recording of a small integral of time. This is due to thefact that the recording surface, either it being a film or electric sensor, has to beexposed to the light for a certain amount of time to collect enough photons. Sucha mechanism leads to challenges but also benefits in image processing and sceneunderstanding, if the temporal structure of the resulting image is fully exploited.For example, consider the scenario where the exposure time is within a few tensof a second. In this case, human hand tremor becomes non-negligible, which oftenleads to blurriness in images captured by a hand-held camera. Over the past years,researchers have been focusing on the removal of such degradation, either in a non-blind fashion in which the camera motion is given, e.g. from the recording of a gy-roscope [51, 102], or in a blind way where neither the sharp image nor the cameramotion is known in advance [17, 148]. In both cases, the exposure of the whole im-age is treated as if it starts and ends at exactly the same time. This assumption doesnot hold for cameras with Complementary Metal-Oxide-Semiconductor (CMOS)sensors that adopt electric rolling shutter mechanism, which is the case for themajority of image sensors on the market nowadays. In this thesis, we argue thatby modeling the row-wise temporal structure of the CMOS measurement into op-timization, we are able to achieve better image deblurring results than traditionalmethods that ignore such sequential exposure characteristic. The single image sce-nario can be extended to multi-frame settings, where subsequent images in the2video clip are blurred from temporally adjacent segments of the camera motiontrajectory. We will present the effectiveness of multi-frame sharpness transfer forvideo deblurring, as well as a data-driven approach to learn the aggregation func-tion using Convolutional Neural Networks (CNN).The temporal blurriness in images can be a reflection of not only camera shake,but also the interaction of light and the scene. When a pixel is exposed to light, itsintensity is the contribution of multiple incoming light rays that can be traced backto the light source along different paths. Depending on the scene geometry andmaterial properties, each light ray is blurred differently every time it bounces offan object, making the final measurement at the pixel a combination of attenuatedand phase-shifted copies of the illumination. One way to visualize this temporalinteraction is through transient imaging, where the recovery of light transport ismade possible with dedicated imagers via either direct [143] or indirect measure-ments [44]. Neither of the methods, however, is practical when it comes to depthimaging or scene understanding, due to their high demands of budget and computa-tion when reconstructing the transient state of the scene. In this thesis, we introducean end-to-end depth imaging framework that compensates the temporal blurrinesscaused by multi-path backscattering, overcoming one of the main challenges inTime-of-Flight (TOF) imaging. We will also present an approach that exploits theencoded temporal structure in TOF raw measurements to represent materials in thescene, specifically on visually similar yet structurally distinct materials which aredifficult to classify with RGB information alone.1.1 Thesis OverviewIn this section, we provide an outline of our research on image restoration andscene understanding with CMOS and TOF cameras.1.1.1 Rolling Shutter Image DeblurringAlthough motion blur and rolling shutter deformations are closely coupled arti-facts in images taken with CMOS image sensors, the two phenomena have so farmostly been treated separately, with deblurring algorithms being unable to handlerolling shutter wobble, and rolling shutter algorithms being incapable of dealing3with motion blur.In Chapter 3 we present an approach that delivers sharp and undistorted outputgiven a single rolling shutter motion blurred image. The key to achieving this is aglobal modeling of the camera motion trajectory as polynomial function of time,which enables each scanline of the image to be deblurred with the correspondingmotion segment. We show the results of the proposed framework through experi-ments on synthetic and real data.1.1.2 Video Deblurring for Hand-held CamerasMotion blur from camera shake is a major problem in videos captured by hand-held devices. Unlike single-image deblurring, video-based approaches can takethe advantage of the abundant information that exists across neighboring frames.As a result, the best-performing methods rely on the alignment of nearby frames.However, aligning images is a computationally expensive and fragile procedure,and methods that aggregate information must, therefore, be able to identify whichregions have been accurately aligned and which have not, a task that requires high-level scene understanding.In Chapter 4, we introduce a deep learning solution to video deblurring, wherea CNN is trained end-to-end to learn how to accumulate information across frames.To train this network, we collected a dataset of real videos recorded with a highframe rate camera, which we use to generate synthetic motion blur for supervision.We show that the features learned from this dataset extend to deblurring motionblur that arises due to camera shake in a wide range of videos, and compare thequality of results to a number of other baselines.1.1.3 End-to-End Time-of-Flight ImagingExisting time-of-flight image processing pipelines consist of a sequence of opera-tions including modulated exposures, denoising, phase unwrapping and multi-pathinterference correction. While this cascaded modular design offers several bene-fits, such as closed-form solutions and power-efficient processing, it also suffersfrom error accumulation and information loss as each module can only observe theoutput from its direct predecessor, resulting in erroneous depth estimates.4In Chapter 5 we present an end-to-end image processing framework for time-of-flight cameras. We depart from a conventional pipeline model and propose adeep convolutional neural network architecture that recovers scene depth directlyfrom dual-frequency, raw TOF correlation measurements. To train this network, wesimulate TOF images for a variety of scenes using a time-resolved renderer, devisedepth-specific losses, and apply normalization and augmentation strategies to gen-eralize this model to real captures. We demonstrate that the proposed network canefficiently exploit the spatiotemporal structures of TOF frequency measurements,and validate the performance of the joint multi-path removal, denoising, and phaseunwrapping method on a wide range of challenging scenes.1.1.4 Material Classification Using Time-of-Flight MeasurementsTOF cameras capture the correlation between a reference signal and the tempo-ral response of the material to incident illumination. Such measurements encodeunique signatures of the material, i.e., the degree of subsurface scattering insidea volume. Subsequently, it offers an orthogonal domain of feature representationcompared to conventional spatial and angular reflectance-based approaches.In Chapter 6 we present a material classification method using raw time-of-flight measurements. We demonstrate the effectiveness, robustness, and efficiencyof our method through experiments and comparisons of real-world materials.1.2 OrganizationThe outline of the thesis is as follows.Chapter 2: Background and Related Work. In this chapter, we provide anoverview of image sensing pipelines, including the light transport, sensor inte-gration, and temporal structures in measurements from color and depth imagingsensors. We will also bridge the structural information to applications in imagerestoration and scene understanding, as well as discussing the computational partof the problem, where the fundamentals of numerical optimization, machine learn-ing, and convolutional neural networks will be summarized.Chapter 3: Rolling Shutter Image Deblurring. Next, we will showcase theapplication of structural-aware imaging for digital photography deshaking. As in-5troduced previously, we model the camera motion trajectory as a function of time,and associate the blurry image with the camera poses along this trajectory.Chapter 4: Multi-Frame Video Deblurring. We then extend the idea of tem-poral structural-aware image restoration from using multiple rows in a single imageto multiple frames in a video sequence. By aggregating sharp patches across thetemporal domain, our method can restore a crisper and smoother version of hand-held videos.Chapter 5: End-to-End Time-of-Flight Imaging. In this chapter, we diveinto the microsecond scale where the camera motion becomes a less significantproblem for introducing image blurriness, compared to multi-path interference inlight propagation. Different from traditional pipeline based depth reconstructionmethods, we demonstrate an end-to-end trainable image processing framework forTOF depth imaging, enabling the joint removal of noise, multi-path ambiguity, andmany other reconstructions errors.Chapter 6: Material Classification. Instead of compensating for Multi-PathInterference (MPI) for improved depth accuracy in TOF depth imaging, in this chap-ter, we rethink the depth distortion of different materials as their unique signatures,and validate the concept of deliberately representing materials with TOF raw cor-relation measurements for scene understanding.Chapter 7: Conclusion and Future Work. We conclude the thesis with dis-cussions on limitations and future work.6Chapter 2Background and Related WorkIn this chapter, we reformulate a conventional camera’s optics and imaging sensoras a spatiotemporal encoder of the light transport, and devise post-processing algo-rithms with temporal structure awareness to decode the hidden information fromscene measurements.The organization of this chapter is as follows. First, we review the basics ofimage sensing pipeline for color and depth cameras, including how light propagatesbefore it hits the imaging sensor, and the way that different types of sensors encodeand integrate irradiance over time. We then discuss the limitations of traditionalimage processing pipelines and review past works that address these challengesfrom different perspectives. At the end of this chapter, we provide an overviewof related numerical optimization and learning-based techniques, which lay thefoundation for our proposed methods in the remaining chapters of the thesis.2.1 Overview of Image Sensing PipelineFollowing Szeliski’s [133] interpretation, we categorize the image formation pro-cess of digital cameras into three consecutive stages.The first one takes place before the light reaches the imaging sensor, wherephotons that emit from one or many light sources propagate through some media(e.g. air, smoke), interact with objects, and finally pass through the optics, aperture,and shutter of the camera. During this stage, many scene-related properties, such7as material, depth, and motion are encoded in the incident light rays. Dependingon their total travel time from light to sensor, i.e. the TOF, these signals reflect thetransient state E(t) of the scene [57, 143].The second stage of the image sensing pipeline happens on the sensor chip,where all incident photons arriving within an exposure duration T are convertedand accumulated to represent the steady state of the scene. The electrical signalsare then passed through a set of sense amplifiers, transforming into the well-knownraw image from the cameraB =∫ T0E(t) f (t)dt. (2.1)Here, f (t) is a sensor specific integration function, i.e. a box filter used in colorimaging, and the sinusoidal wave in the TOF-based depth imaging. In theory, f cantake any form to encode the transient state of the scene, thus the raw measurementsB may appear distorted or random to the human eyes.To obtain the final perceivable image, the integrated light signal must then gothrough a third stage constituting of application-specific pipeline processing mod-ules. For example, it is a common practice to apply demosaicing and gamma cor-rection in color imaging, as well as phase unwrapping and multi-path correction inTOF depth imaging.Conventionally, these three stages: light transport, sensor integration, and post-processing, are modeled and addressed as individual subproblems in isolation. Asa result, severe cumulative error and information loss can be introduced in the pro-cess [45], leading to the erroneous representation of the view, which in turn posesa great challenge to scene understanding. In the next sections, we will mathemat-ically model each stage of the image sensing pipeline, and relate the three aroundthe idea of temporal encoding-decoding.2.2 Light TransportThe world that we normally observe by eyes, captured by cameras, and simulatedby computer graphics renderer is in its steady state. Suppose we have an ultra-fast camera which can record the propagation of light from illumination g(t), as8SceneCameraLightFigure 2.1: A camera sensor integrates the mixture of direct (green) and in-direct (orange) reflectance from the scene. The gray rays inside the slabindicate light propagation underneath the object surface commonly seenin translucent materials such as wax and skin, i.e. the subsurface scat-tering.illustrated in Figure 2.1. In this case, the pixel irradiance E(t) in Equation 2.1 canbe decomposed into the superposition of many attenuated and temporally shiftedcopies of g(t) along all possible paths p ∈ P weighted by the contribution αp:E(t) =∫P αpg(t−|p|)d p, (2.2)where α(τ) =∫P αpε(|p|− τ)d p (2.3)is defined as the Temporal Point Spread Function (TPSF) representing the summedcontribution of all paths of equal length |p| = τ , with ε(·) being the Dirac deltafunction. Combined, the multi-path backscatter E(t) can be expressed as the con-volution of g(t) with the α(τ):E(t) =∫ τmax0α(τ)g(t− τ)dτ. (2.4)As light interacts with objects in the scene, e.g. via reflections and subsurfacescattering shown in Figure 2.1, the TPSF it creates not only reflects the camera-object distance, but it also ensembles an unique signature that can describe manyphysical properties of the scene, such as the Bidirectional Reflectance DistributionFunction (BRDF) [95] and material [150] of the surface. Past efforts to capture and9analyze this signature have relied on detailed reconstructions of TPSF in the formof transient images [150], which can be captured either directly using experimentalequipment such as streak cameras and femtosecond lasers [143], or indirectly withconsumer TOF depth cameras [44], albeit at a significant computational cost as wellas a lower resolution.2.3 Sensor IntegrationIn this section, we introduce common sensor integration mechanisms for color anddepth cameras. Specifically, we review the CMOS technology, and the TOF depthsensing technique that combines the CMOS sensor with a modulated light source.2.3.1 CMOS SensorsCMOS sensors have become the most popular imagers found in consumer and scien-tific cameras in the past decade. There are a number of similarities between CMOSand its counterpart, the Charge-Coupled Device (CCD) technology, but one majordistinction is the way each sensor reads the signal accumulated at a given pixel afterthe exposure. The CCD sensors often adopt the global shutter mechanism, meaningthat every pixel is exposed simultaneously at the same instant in time. A typi-cal CMOS sensor, on the other hand, utilizes electronic rolling shutter, where eachscanline of the image is exposed at slightly different time segments. Concretely,this means that the integration function f (t) (Equation 2.1) of adjacent rows isshifted by a constant value in CMOS sensors. Figure 2.2 illustrates the relationshipbetween the CCD/global shutter and CMOS/rolling shutter sensors. When cameramotion exists, all pixels in a global shutter camera integrate over the same motiontrajectory, while different scanlines integrate over a slightly different segment ofthe trajectory in a rolling shutter sensor. Such a design is particularly beneficialwhen the view is changing from frame to frame, for example in the case of videorecording and cellphone live views, which require high frame rate and fast readout.On the other hand, the sequential integration nature of CMOS cameras poses a greatchallenge to computer vision algorithms such as change detection [110] and blinddeblurring [120], which typically ignore the structural difference across rows ofthe image.10......Global shutterRolling shutterCamera Motiontime0tetrScanline 0Scanline iScanline i+1Scanline Mtr......Scanline 0Scanline iScanline i+1Scanline MFigure 2.2: Illustration of the global and rolling shutter sensor mechanisms.The horizontal bars represent the exposure interval of a scanline in thesensor, te denotes the exposure time of each scanline, and tr is the timethe CMOS sensor takes to readout the i-th scanline before proceeding tothe next.Geometric distortions in rolling shutter images have received more attention,but typically without considering blur. A major motivation for rolling shutter cor-rection is the removal of wobbles in videos [8], where inter-row parallax is largerthan one pixel. The work by Saurer et al. [116] is another example that attempts tomake traditional computer vision algorithms (e.g., stereo, registration) work on RScameras. Most of such work relies on sharp input images so that feature detectorscan reliably deliver inter-frame homography estimates. Meilland et al. [92] werethe first to propose a unified framework for rolling shutter and motion blur, but theyrely on a sequence of images that can be registered together. Another approach waspresented by Pichaikuppan et al. [110], but it targets change detection and requiresa sharp global shutter reference image.In Chapter 3 we introduce a method to jointly remove motion blur and rollingshutter deformation from single images, by treating the CMOS sensor as row-wisetemporal encoders with overlapping integration functions f (t) (Equation 2.1).11g!(t)⌦light sourceZ ⌧max0↵(⌧)g!(t ⌧)d⌧f!(t  /!) ToF camerab!, targetFigure 2.3: TOF sensors measure depth by estimating the time delay fromlight emission to light detection.2.3.2 Time-of-Flight SensorsA TOF camera measures the modulated exposure of the scene with time-modulatedbut spatially uniform illumination g(t) and sensor demodulation signals f (t). Theimage formation model of correlation sensors in homodyne mode has been derivedin previous works [44, 46, 47]. Following Lange [73], we model a raw correlationmeasurement for integration time T asbω,ψ =∫ T0E(t) fω (t−ψ/ω)dt, (2.5)where E is the irradiance previously derived in Equation 2.4, and f is a pro-grammable reference signal with angular frequency ω and phase offset ψ . Typ-ically, f is zero-mean to make the imager insensitive to ambient light, i.e., the DCcomponent of E(t).Scenes with different geometric and photometric properties can cause multiplepath contributions to be linearly combined in a single sensor pixel, as previouslyshown in Figure 2.1 and adapted to the TOF depth imaging in Figure 2.3. Here,a TOF image sensor correlates the reference signal with a mixture of (a) directsurface reflection (black); (b) surface inter-reflection or multiple scattering insidesurface volume (red); and (c) subsurface scattering (green). The camera observesa mixture of (a), (b) and (c), as indicated in blue, resulting in erroneous depthestimation known as MPI. While this phenomenon has mostly been addressed as12a limitation of TOF depth imaging [96], there are a number of recent works thatdeliberately rely on MPI for new imaging applications, such as non-line-of-sightimaging [46] and material classification [136].2.4 Post-ProcessingThe encoded temporal information in CMOS and TOF measurements motivates usto devise new temporal-structure aware decoding algorithms for image restorationand scene understanding. In this section, we review past works on motion deblur-ring and TOF depth reconstruction, and discuss ways to reformulate the traditionalpost-processing algorithms with temporal constraints.2.4.1 Motion Blur RemovalTraditionally, a blurry image or video frame B can be modeled as the linear com-bination of images observed over a number of camera poses, assuming that thecamera adopts global shutter mechanism,B =∑iwiL+N, s.t.∑iwi = 1. (2.6)Here, L is the latent sharp image, N models the noise, and i indicates some discretecamera poses.There exist two main approaches to recovering L from its noisy and blurry ob-servation B: deconvolution-based methods that solve inverse problems, and thosethat rely on multi-image aggregation and fusion.Using DeconvolutionMotion blur from camera shake is one of the most noticeable degradations in hand-held photography. Without information about the underlying camera motion, anestimate of the latent sharp image can be restored using so-called blind deblur-ring, which jointly approximates a blurring kernel and the underlying sharp imagevia deconvolution [72]. In recent years many successful methods have been in-troduced [28, 70, 94, 120, 128, 154, 155], see [145] for a recent survey. Somerepresentative work on this problem includes Cho et al. [17] in which salient edges13and the FFT are used to speed up the uniform deblurring, and the introduction ofmotion density functions by Gupta et al. [39], which specializes in non-uniformcases. Maximum A Posteriori (MAP) estimation is a popular way to formulatethe objective of blind deblurring – a theoretical analysis on its convergence wasconducted in [106] recently.Multiple-image deconvolution methods use additional information to allevi-ate the severe ill-posedness of single-image deblurring. These approaches col-lect, for example, image bursts [55], blurry-noisy pairs [156], flash no-flash im-age pairs [108], gyroscope information [103], high frame rate sequences [134], orstereo pairs [119] for deblurring. These methods generally assume static scenesand require the input images to be aligned. For video, temporal information [79],optical flow [63] and scene models [101, 151] have been used for improving bothkernel and latent frame estimation.All of the above approaches strongly rely on the accuracy of the assumed im-age degradation model (blur, motion, noise) and its estimation, thus may performpoorly when the simplified degradation models are insufficient to describe realdata, or due to suboptimal model estimation. As a result, these approaches tend tobe more fragile than aggregation-based methods [23], and often introduce undesir-able artifacts such as ringing and amplified noise.Multi-Image AggregationMulti-image aggregation methods directly combine multiple images in either spa-tial or frequency domain without solving any inverse problem. Lucky imaging isa classic example, in which multiple low-quality images are aligned and best pix-els from different ones are selected and merged into the final result [60, 74]. Fordenoising, this has been extended to video using optical flow [81] or piecewisehomographies [85] for alignment.For video deblurring, aggregation approaches rely on the observation that ingeneral not all video frames are equally blurred. Sharp pixels thus can be trans-ferred from nearby frames to deblur the target frame, using for example homog-raphy alignment [90]. Cho et al. further extend this approach using patch-basedalignment [19] for improved robustness against moving objects. The method, how-14ever, cannot handle large depth variations due to the underlying homography mo-tion model, and the patch matching process is computationally expensive. Kloseet al. [67] show that 3D reconstruction can be used to project pixels into a singlereference coordinate system for pixel fusion. Full 3D reconstruction, however, canbe fragile for highly dynamic videos.Recently, Delbracio and Sapiro [22] show that aggregating multiple alignedimages in the Fourier domain can lead to effective and computationally highlyefficient deblurring. This technique was extended to video [23], where nearbyframes are warped via optical flow for alignment. This method is limited by opticalflow computation and evaluation, which is not reliable near occlusions and outliers.All above approaches have explicit formulations on how to fuse multiple im-ages. In Chapter 4, we instead adopt a data-driven approach to learn how multipleimages should be aggregated to generate an output that is as sharp as possible.2.4.2 Depth Ambiguity CompensationTOF image sensors are a class of devices that have been well explored for use indepth acquisition [38, 117] and since extended for various applications. When op-erated as range imagers, the quality delivered by correlation sensors suffers frommulti-path interference, whose removal has therefore been the subject of exten-sive research [25, 31, 96]. Although MPI, phase unwrapping, and denoising arecoupled, none of the existing reconstruction methods address them in a joint andcomputationally efficient manner.Depth CalculationIn the ideal case where indirect light paths are not present, one can reliably re-cover the scene depth1 at each pixel by capturing a pair of modulated exposuresbω,ψ ( Equation 2.5) with the same ω but different ψ asd = cφ/2ω, (2.7)where φ = atan2(bω,pi/2,bω,0) (2.8)1Converting distance to depth is trivial given camera intrinsic matrix. We use depth, distance, andpath length interchangeably in this work.15is the measured phase, and c denotes the speed of light. The amplitude of bω,0 +jbω,pi/2 is proportional to the backscattering strength and often serves as a measureof confidence for depth reconstruction. While computationally cheap, applyingEq. 2.7 to real-world data often leads to poor depth estimates. This is not onlybecause of sensor noise, but also because the measurements from Eq. 2.8 are in-herently ambiguous due to phase wrapping and MPI which requires solving theill-posed reconstruction problems described in the following.Phase UnwrappingDue to the periodic nature of the phase measurements in Eq. 2.8, the depth estimatealso “wraps around”, and is only unambiguous for distances smaller than a halfof the modulation wavelength, i.e. in the [0, cpi/ω] range. An established methodfor resolving phase ambiguity acquires measurements at two different modulationfrequencies [26], preserving long distance range by unwrapping high-frequencyphases with their lower-frequency counterpart.Specifically, dual-frequency methods disambiguate the true depth from other,phase wrapped candidates by measuring b at two different frequencies, ω1 andω2 [26]. This effectively extends the maximum unambiguous depth range to dmax =cpi/GCD(ω1,ω2), where GCD denotes the greatest common divisor of the two frequen-cies. To recover an unknown depth d∗ one can create a lookup table Tω(dˆ) betweencandidate depth dˆ ∈ [0, ...,dmax] and phase observations Φ= [φ1,φ2], and solve thefollowing 1D search problem [40],d∗ = argmindˆ||T[ω1,ω2](dˆ)−Φ||2. (2.9)However, in the presence of noise, this idealized per-pixel method often fails due tothe lack of spatial priors. Recently, Lawin et al. [75] proposed kernel density func-tion estimates as a hand-crafted prior for more robust phase unwrapping, by jointlyoptimizing spatial, unwrapping and phase likelihoods. However their weights mustbe chosen empirically or via cross validation on additional data.While effective for direct-only scenes, the dual-frequency acquisition approachalso becomes inaccurate in the presence of MPI. When multi-frequency measure-ments are not accessible, statistical priors, such as the amplitude smoothness [41]16and surface normal constraints [20, 27], can be leveraged. Our method in Chap-ter 5, however, is not built on such hand-crafted priors that only model a subset ofthe rich statistics of natural scenes. Instead, we learn the spatial prior directly froma large corpus of training data.MPI CorrectionThe second core challenge when applying Eq. 2.7 to real-world scenarios is thatinstead of a single directly-reflected path, scenes with different geometric and ma-terial properties can cause multiple light paths to be linearly combined in a singlesensor pixel, illustrated in Fig. 2.1. For common sinusoidal modulation, these pathmixtures lead to measurements that are identical to the ones from longer directpaths, resulting in inherently ambiguous measurements.MPI distortions are commonly reduced in a post-processing step. A large bodyof work explores either analytic solutions to the simplified two-path or diffuse-only problems [32, 36], or attempts to solve MPI in isolation as a computationallycostly optimization problem [31, 61] with strong assumptions on the scene sparsity.Formalizing the intensity-modulated light source in homodyne mode as gω(t), E(t)becomes a superposition of many attenuated and phase-shifted copies of gω(t),along all possible paths of equal travel time τ as previously derived in Section 2.2:E(t) =∫ τmax0α(τ)gω(t− τ)dτ. (2.10)When E(t) is substituted in Eq. 2.5, we model the correlation integral as in [44],bω,ψ =∫ τmax0α(τ) ·ρ (ω,ψ/ω+ τ)dτ, (2.11)where the scene-independent functions fω and gω have been folded into ρ whichis only dependent on the imaging device, and can be calibrated in advance. Essen-tially, Eq. 2.11 probes the latent, scene-dependent TPSF α(τ), to the sensor obser-vations bω,ψ . This motivates us to devise the learning framework for material repre-sentation, which will be described in Chapter 5. When the TPSF, α(τ), is carefullymodeled, a number of MPI-related inverse problems can been solved effectively,including transient imaging [44, 80] and imaging in scattering media [47]. Note17that while commonly treated as an unwanted measurement distortion, recent ap-proaches deliberately rely on MPI for non-line-of-sight imaging [46] and materialclassification [129].Alternative AcquisitionRecent alternate approaches attempt to resolve ambiguities in the capture process.Gupta et al. [40] propose to separate indirect illumination using high-frequencymodulations in the GHz range, which remains theoretical due to the limitations ofexisting hundred-MHz-range CMOS techniques. A number of works have proposedhybrid structured light-TOF systems [5, 97, 99], requiring coded and carefully syn-chronized illumination with a significantly enlarged footprint due to the projector-camera baseline, thus removing many of the inherent benefits of TOF technology.2.5 Computational ApproachesThroughout this thesis, we adopt various computational approaches depending onthe characteristic of each problem. In this section, we will briefly introduce thesemethods, as well as reviewing previous work on the development and applicationsof these computational approaches.2.5.1 Numerical OptimizationA mathematical optimization problem can be represented as the general formminimize f0(x)subject to fi(x)≤ bi, i = 1, ...,m,(2.12)where vector x = (x1, ...,xn) denotes the optimization variable, function f0 : Rn→R is the objective function, and fi : Rn → R, i = 1, ...,m are the constraint func-tions. A vector x∗ is called optimal, if it has the smallest objective value among allvectors that satisfy the constraints [9].In this thesis, we are interested in a specific kind of optimization problemsknown as the convex optimization problems. More formally, convex problems18satisfy the following inequalityfi(αx+βy)≤ α fi(x)+β fi(y) (2.13)for all x, y ∈ R and all α, β ∈ R, with α+β = 1 and α ≥ 0, β ≥ 0.One common type of convex optimization problems in computational photog-raphy shares the formminimize f (x)+g(x), (2.14)where f , g : Rn → R∪{+∞} are closed proper convex functions. We can writethis problem into an equivalent consensus formminimize f (x)+g(z)subject to x− z = 0,(2.15)where the consensus constraint x= z must be satisfied. The augmented Lagrangianof Equation 2.15 is given byLρ(x,z,y) = f (x)+g(z)+ yT (x− z)+(ρ/2)‖x− z‖22, (2.16)where ρ > 0 is a weight of the quadratic penalty term for the equality constraint,and y ∈ Rn is a dual variable associated with the constraint. Equation 2.16 can besolved iteratively with Alternating Direction Method of Multipliers (ADMM) whichconverges under mild assumptions [9, 10]xk+1 := argminx(f (x)+(ρ/2)‖x− zk +(1/ρ)yk‖22)zk+1 := argminz(g(z)+(ρ/2)‖xk+1− z− (1/ρ)yk‖22)yk+1 := yk +ρ(xk+1− zk+1).(2.17)As can be seen, the dual variable y is the scaled running sum of the consensuserrors, and all the objective terms are handled completely separately. In fact, afterdefining the proximal operator [10] proxλ f : Rn→ Rn of f asproxλ f (v) = argminx(f (x)+(1/2λ )‖x− v‖22), (2.18)19Equation 2.17 can be interpreted into the proximal versionxk+1 := proxλ f (zk−uk)zk+1 := proxλg(xk+1+uk)uk+1 := uk + xk+1− zk+1.(2.19)Now, all the functions in Equation 2.19 are only accessed through their proximaloperators which are well-known in optimization. For example, the proximal oper-ators of `1 and indicator functions areproxλ |·|(v) = (v−λ )+− (v−λ )−,proxλ indC(·)(v) = ΠC(v).(2.20)Throughout this thesis, we will adaptively apply optimization methods to var-ious problems: for the inverse problem of latent image update in Chapter 3, weadopt ADMM; for the stochastic objectives optimization in deep neural networks( Chapter 4, Chapter 5), we use Adam [65]. We also show that optimization servesas a foundation for advanced machine learning algorithms such as the derivation ofmaximum-margin hyperplane for SVM classifiers, which we discuss in Chapter 6,2.5.2 Learning-Based MethodsDifferent from the optimization based methods where the data fidelity and priorterms are often designed by hand, a learning problem takes a set of data samplesas input, and tries to predict unknown properties of data. We can separate learningproblems into two categories:• supervised learning, and• unsupervised learning.Supervised learning covers the scenario where additional attributes that wewant to predict come with the data. Depending on whether the output value iscontinuous or not, this problem can be either a classification problem, or regres-sion problem.20Figure 2.4: An example CNN architecture [140].In Chapter 4 and Chapter 5, we will be focusing on the regression problem,where the goal is to generate the desired representation of the data in domainB from a set of correlated inputs in domain A. For example, in image coloriza-tion [53], domain A consists of images in grayscale, and the goal is to learn themapping function from grayscale to color space in domain B. Usually, there existsstructural similarity between the two domains [54].In Chapter 6, we will be focusing on the classification problem, where the goalis to predict the class of unlabeled data from labeled data. For example, in thefamous Iris flower classification problem [30], four features were measured fromeach sample: the length and the width of the sepals and petals. The aim is to assigneach unlabeled feature vector to one of a species.Convolutional Neural NetworksCNN are a specific type of learning framework for approximating complex un-known functions. As an extension to artificial neural networks, they are made upof neurons that have learnable weights and biases. Unlike the ordinary neural net-works, CNN architectures make explicit assumptions that the inputs are images,which allows for certain properties to be encoded into the architecture. This alsocomes with the benefit that the number of weights to be learned could be drasticallyreduced due to the weight sharing scheme in convolutional layers [76]. An exam-ple of the CNN is shown in Figure 2.4, where a regular 3-layer Neural Network andthe corresponding CNN are shown on the left and right-hand side respectively. Ascan be seen, the layers of a CNN have neurons arranged in three dimensions (width,height, depth), take the 3D volume as input, and transform it to a 3D output volumeof neuron activations.21Similar to the objective function in numerical optimization, the whole CNNexpresses a single differentiable score function that is defined by a certain lossfunction. Parameter update is done by backpropagation, and there are a numberof optimizers such as SGD, ADAM [65], that could be applied depending on thecharacteristic of the problem.Data-Driven Image GenerationRecently, CNN have been applied to achieve leading results on a wide varietyof reconstruction problems. These methods tend to work best when large train-ing datasets can be easily constructed, for example by adding synthetic noise fordenoising [153], removing content for inpainting [104], removing color informa-tion for colorization [53], or downscaling for super-resolution [24, 84]. Super-resolution networks have been applied to video sequences before [52, 62, 121], butthese approaches address a different problem, with its own set of challenges. In thiswork we focus on deblurring, where blurry frames can vary greatly in appearancefrom their neighbors, making information aggregation more challenging.CNN have also been used for single- [15, 132] and multi- [149] image deblur-ring, using synthetic training data. One problem with synthetic blur is that real blurhas significantly different characteristics, as it depends on both the scene depth andobject motion. In Chapter 4, we show that by leveraging multiple video frames,training on real blur, and directly estimating the sharp images, our method canproduce substantially better results.While feedforward architectures work well for local operations on natural im-ages, a large receptive field is typically desired for non-local inverse problems,such as MPI removal. Recently, Conditional Generative Adversarial Networks(CGAN) have shown high-quality image translation results under supervised [54]and unsupervised [159] settings. Unlike traditional Generative Adversarial Net-works (GAN) [37], in CGAN both the generator G and discriminator D observe aninput image. By combining a GAN loss with traditional pixel losses, one can thenlearn to penalize structured differences between output and target images, withoutrelying on domain knowledge [4]. This is made possible by the competing na-ture of the G and D, as G is trained to generate outputs that are indistinguishable22from data distribution, while D tries to discriminate two distributions between realdata and model prediction. We adopt these successful CGAN strategies to train ourdepth generation network, and combine them with pixel loss and smoothness termson the depth maps and their gradients.When ground truth labels are difficult to collect, unsupervised learning be-comes a powerful tool. Examples include [4, 158] that exploit left-right / cross-channel structures, and [159] in which unpaired consistency between two imagedomains is used for image translation purposes. When a self-regularization termis added to the input image, [123] shows that they can add realism to synthetictraining data with unlabeled ground truths. We adopt a similar approach to bothtrain our depth prediction network, as well as refining the statistics of the syntheticdataset with real TOF captures.23Chapter 3Rolling Shutter Image DeblurringIn this chapter, we present a single-image blind deconvolution method for CMOScameras that adopt the rolling shutter mechanism. We model a Rolling ShutterMotion Blurred (RSMB) image with its sequential-exposure characteristic, and it-eratively optimize for both the latent sharp image and the parameters associatedwith the camera motion trajectory. This allows us to restore a sharp and undis-torted output which will be demonstrated with experimental results on syntheticand real data.3.1 IntroductionMotion blur from camera shake is one of the most noticeable degradations in hand-held photography. One common assumption made in almost all previous deblurringmethods [17, 39, 50, 68, 106, 135, 148, 154, 155] is the use of a global shutter sen-sor. That is, each part of the image is regarded as having been exposed during theexact same interval of time. This assumption, however, does not hold for imagescaptured by a CMOS sensor that uses an electronic rolling shutter, which is the casefor the majority of image sensors on the market. This is due to the fact that rollingshutter exposes each row sequentially as opposed to simultaneously, and that thepopularity of rolling shutter sensors in mobile devices makes them particularly sus-ceptible to motion blur, especially in low light scenarios. Figure 3.1 demonstratessome samples of the Point Spread Function (PSF) simulated by applying the same24(a) (b)(c)Figure 3.1: Simulated blur kernels for global (a) and rolling shutter (b) cam-eras. Blur kernels in global shutter images tend to be spatially invari-ant (assuming a static scene and negligible in-plane roation) while in arolling shutter image they are always spatially variant. (c) Spatial vari-ance of the blur kernel in a real RSMB image indicated by light motion to a global and a rolling shutter camera during exposure. Assum-ing a static scene and no in-plane rotation, the blur kernel of the global shutterimage in Figure 3.1(a) is shift-invariant, since all pixels integrate over the samemotion trajectory. For the rolling shutter image, however, different scanlines inte-grate over a slightly different segment of the trajectory, resulting in a shift-variantkernel in Figure 3.1(b) even in the case of the static object and no in-plane rotation.Thus, when applied to the wide range of RSMB images such as that shown in Fig-ure 3.1(c), existing methods [17, 39, 50, 68, 106, 135, 148, 154, 155] are destinedto fail.The shift variance of the rolling shutter kernel effect can be modeled by cap-turing the overall camera motion with a gyroscope and computing different kernels25......Global shutterRolling shutterCamera Motiontime0tetrScanline 0Scanline iScanline i+1Scanline Mtr......Scanline 0Scanline iScanline i+1Scanline MFigure 3.2: Illustration of the global shutter (top) and rolling shutter (middle)sensor mechanisms. The horizontal bar represents the exposure intervalof the i-th scanline in the sensor. M+ 1 is the number of scanlines inboth sensors, te denotes the exposure time of each scanline, and tr is thetime the rolling shutter sensor takes to readout the i-th scanline beforeproceeding to the next. When camera motion exists (bottom), all pixelsin a global shutter camera integrate over the same motion trajectory,while different scanlines integrate over a slightly different segment ofthe trajectory in a rolling shutter sensor.for each scanline [102]. Without such specialized hardware, an alternative is tosolve different blind deconvolution problems for blocks of scanlines, but these so-lutions would have to be stitched together, which is made more difficult by therolling shutter wobble.In this chapter, we introduce a single image approach that deblurs RSMB im-ages by estimating and parametrically modeling each degree of freedom of thecamera motion trajectory as polynomial functions of time. The motivation of thisparametric representation is based on a recent study of human camera shake fromKohler et al. [68], see Section 3.2.2. To achieve good initial estimates of the mo-tion trajectory coefficients, we adopt a back-projection technique [50] to estimate26higher dimensional camera poses from their 2D projections, i.e., PSFs. The specificcontributions of this work are:• A blind deblurring technique that handles the characteristic of rolling shutter;• A method for motion trajectory estimation and refinement from a singleRSMB image.Throughout the chapter we assume the scene to be sufficiently far away sothat planar homographies are able to describe the transformations. Based on ananalysis of the data from Kohler et al. [68], we also determine that in-plane rotationis negligible except for very wide angle lenses, allowing us to restrict ourselves toa 2D translational motion model.3.2 Method3.2.1 Motion Blur in Rolling Shutter CamerasIn the global shutter case, motion blur with a spatially varying kernel can be mod-eled as a blurred image G ∈ R(M+1)×(N+1) resulting from the integration over allinstances of the latent image L ∈ R(M+1)×(N+1) seen by the camera at poses alongits motion path during the exposure period t ∈ [0, te]:G =1te∫ te0Lp(t)dt+N, (3.1)where p(t) = (p1(t), p2(t), p3(t)) ∈ R3 corresponds to the camera pose at time t,and Lp(t) is the latent image transformed to the pose p at a given time. N representsa noise image, which is commonly assumed to follow a Gaussian-distribution ineach pixel.The above model proves to be effective for formulating a Global Shutter Mo-tion Blurred (GSMB) image but cannot be directly applied to the RSMB case, asscanlines in rolling shutter sensors are exposed sequentially instead of simultane-ously as that is assumed in Equation (3.1). Specifically, as illustrated in Figure 3.2,although each scanline is exposed for the same duration te, the exposure window isoffset by tr from scanline to scanline [11]. We thus rewrite Equation (3.1) for each27row bi in a RSMB image B = (bT0 , . . . ,bTM)T asbi =1te∫ i·tr+tei·trlp(t)i dt+ni, (3.2)where the subscript i also indicates the i-th row in Lp(t) and N.Equation (3.2) can be expressed in discrete matrix-vector form after assuming afinite number of time samples during the exposure of each row. Assuming that thecamera intrinsic matrix C ∈ R3×3 is given, bi j ∈ bi = (bi0, ...,biN) can be exactlydetermined at any pixel location x = (i, j)T in Bbi j =1|Ti| ∑t∈TiΓL(w(x;p(t)))+ni j, (3.3)where Ti = {i · tr + jK te} j=0...K is a set of uniformly spaced time samples in theexposure window of row i, ΓL(·) is the function that bi-linearly interpolates theintensity at some sub-pixel position in L, and w(·) is a warping function [7, 18]that maps positions x from the camera frame back to the reference frame of thelatent image L according to the current camera pose p:w(x;p) =DH(DT x+ e)eT H(DT x+ e). (3.4)Here H = CRC−1 is a homography matrix, R = ep× is its rotational compo-nent1, e = (0,0,1)T , andD =[1 0 00 1 0].From Equation (3.3,3.4) a sparse convolution matrix K can be created andEquation (3.2) can be rewritten asb = Kl+n, (3.5)1p× = 0 −p3 p2p3 0 −p1−p2 p1 0 as the matrix exponential.28where b, l and n respectively are the RSMB input, latent image, and noise, all invector form.3.2.2 Camera Motion ModelingThe sequence of camera poses p(t) describes a 1D continuous path through the3D camera pose space [39]. Before explaining how this serves as an importantconstraint for blind Rolling Shutter Motion Deblurring (RSMD), we first seek asuitable model for p(t) from t = 0 to te +Mtr, and then rewrite the RSMB imageformation model based on this modeling.Exposure Time of InterestTo discuss the model for p(t), we need to be more specific about the temporalrange of ∪i=0:M{Ti} for exposing the whole image. If the exposure time is largecompared to the aggregate readout time for a full frame (i.e. te  Mtr), then therolling shutter distortion is dominated by the motion blur and conventional deblur-ring methods can be used with good results. Unfortunately, this scenario can onlyoccur in still photography, since in video mode the exposure time cannot exceedone over the frame rate (see Chapter 4). The other extreme is that there is suf-ficient light for the exposure time to be very short compared to the readout time(i.e. te  Mtr). In this case, the blur is typically negligible, except for very fastcamera motion. In between these two extremes is the common scenario where theexposure time approaches the aggregate readout time. While this scenario is par-ticularly common in small-format cameras and cell-phones, it affects all camerasin lower light conditions. In this case, both rolling shutter distortion and motionblur are present, and existing methods cannot be used.Pose Trajectory ModelTo get a sense of what the sequence of p(t) typically looks like for a shaky hand-held camera during the time of interest, we performed an analysis over all the 40publicly available camera motion trajectories from Kohler et al. [68]. The camerapose was recorded at 500Hz by a Vicon system when 6 human objects were askedto hold still a camera in a natural way. Detailed descriptions of the experiment290.00 0.01 0.03 0.04−0.06−0.04−0.0200.020.040.06SecondDegree  Rot xRot yRot z0.00 0.01 0.03 0.04−3−2−10123SecondPixel  Rot xRot yRot zFigure 3.3: Left: a segment of the camera motion trajectory from [68] in 3Drotational pose space; Right: the resulting pixel shift at the corner of afull-HD image, assuming the rotation center is the image center. With amoderate angled lens (50mm in this example), the effect of roll (Rot z)reduces to subpixel-level pixel shift in most handheld shake cases.can be referred to the original paper. Here we illustrate in Figure 3.3 the threerotational pose trajectories during a randomly selected 1/25s segment (108-128)from the 39th dataset.As can be seen, even though the blur kernel of a RSMB image varies spatially,the decomposed p1(t), p2(t), p3(t) from the underlying camera motion are in factparameterizable. This observation generalizes to the other samples from the samedataset. Therefore we decide to fit polynomial functions to the pose trajectoriesp(t) = tθ , (3.6)where t = (tP, ...t0), t ∈ [0,Mtr + te], and θ is a (P+ 1)× 3 matrix having coeffi-cients of each polynomial function as entries. In this work, we set the polynomialdegree P to 3 or 4, which achieves a good fit. We note that quartic splines providea better fit if the trajectory is more complex, i.e., the camera shakes at higher fre-quencies, but this is rarely the case for natural hand-held camera shakes within theexposure time of interest [68].Finally, we also note that there is an interesting subset of RSMB images cap-30tured under medium to long focal lengths, where the contribution of in-plane ro-tation (p3(t)) is, in fact, so small that it does not result in a noticeable blur – Fig-ure 3.3 (right) shows an example of the maximal blur caused by in-plane rotationin a full-HD image for an assumed focal length of 50mm. In this example, theblur just due to the in-plane rotation is below one pixel even in the corner of theimage. Similar results are obtained for other motions from Kohler et al.’s database.We therefore neglect in-plane rotation in this work, and reduce Equation (3.6) toa 2D yaw/pitch space instead of a full 3D rotational pose space for the rest of thischapter.RSMB Modeling in Trajectory CoefficientsWith the trajectory model defined in Equation (3.6), the convolution matrix K inEquation (3.5) becomes a function of θb = K(θ)l+n, (3.7)where K(θ) is determined by rewriting Equation (3.3,3.4) in θ accordingly.3.2.3 Deblurring RSMB ImageHaving the forward RSMB model defined in the form of a camera motion model,the latent image l can be recovered from b by solving an inverse problem. Anoverview of our blind RSMD approach is summarized in Algorithm 1 and we willdescribe it in details in this section.ObjectiveOur objective function for RSMD is given byminl,θ12‖b−K(θ)l‖22+µ‖∇l‖1, (3.8)which is composed of a data error term based on Equation (3.5) and a sparsityprior on the latent image gradient ∇l, weighted by a scalar µ . Since the cameramotion and kernel normalization [106] are inherently implemented in K, no addi-tional prior for K(θ) is required in the objective, which differentiates our method31Algorithm 1 RSMD algorithm overview.Input: RSMB image b, initial weight µ0Output: Deblurred image l, trajectory coefficients θ1: Obtain the trajectory initialization θ 0 (Section 3.2.3)2: for k = 1 to n do3: Update l (Equation 3.15)4: Update θ (Equation 3.9)5: Decrease µ by τ6: end forfrom [39, 155].Similar to conventional blind deblurring algorithms, we update l and θ in analternating fashion. We initialize µ with a relatively large value µ0, thus in the earlyiterations only the most salient structure in l will be preserved which will guide therefinement of kernel coefficients θ , given that θ estimates are not yet accurate. Asthe optimization progresses, we decrease µ by a factor of τ after each iteration topreserve more details in l. Intermediate outputs are shown in Figure 3.5.Update of Trajectory CoefficientsThe objective for updating θ is given byθ k+1 = argminθM∑i=0N∑j=0ri j(θ)2, (3.9)where ri j(θ) is the residual at x that depends on θri j(θ) = bi j− 1|Ti| ∑t∈TiΓLk(w(x; tθ)). (3.10)Here Lk is the latent image estimated from the previous iteration k. SolvingEquation (3.9) is a non-linear optimization task since pixel values in L are, ingeneral, non-linear in θ .Gauss-Newton MethodMotivated by image registration, i.e., the Lucas-Kanade algorithm [7], we adopt32(a) Input.(b) Local blur kernel estimation. (c)102030  0 0.2 0.4 0.6 0.8 10.010.02Rot x ttedRot y ttedPixelsSecondsDegrees(d) Fitted trajectory.Figure 3.4: Pipeline of θ 0 initialization. (a) RSMB image; (b) Locally esti-mated latent patches and blur kernels; (c) Blur kernels after thinningoperation; (d) Polynomial fitting of the traced time-stamped data from(c), shown in units of both pixels and degrees.33Gauss-Newton method for this non-linear least square problem. In each iteration,Gauss-Newton updates θ byθ k+1 = θ k +∆θ , (3.11)where ∆θ is the solution for the linear systemJTr Jr vec(∆θ) = JTr r. (3.12)Here r is the residual vector, and Jr is the Jacobian of the residual, both evalu-ated at θ k.The calculation of Jr is carried out as follows by applying the chain rule onEquation (3.10)∇ri j(θ)|θ=θ k =−1|Ti|∑tJΓLk (w)Jw(p)Jp(θ), (3.13)where JΓLk (w) is the gradient of image Lk at w(x; tθ k). We find Conjugate Gradi-ents efficient for calculating ∆θ in Equation 3.12.Initialization of Trajectory CoefficientsA good initial estimate of θ is essential for the convergence of Gauss-Newton.We approach this problem by solving blind deconvolution problems for blocks ofseveral scanlines over which the PSF is assumed to be approximately constant forinitialization purposes only. The recovered PSFs for each block are then back-projected into pose space, and initial trajectory estimates are fit. An illustration ofthis process is also given in Figure 3.4.Local blur kernel estimationGiven a RSMB image (Figure 3.4a), we first divide it into several horizontal regions.Inside each region a first estimation of the kernel is recovered with a conventionalblind deblurring algorithmmink,l12‖k∗ l−b‖22+ω1‖∇l‖1+ω22‖k‖22, (3.14)34where k is the blur kernel and b is the blurry input for a horizontal region of B. ω1and ω2 are weights that balance the trade-off between the data term and priors onthe intrinsic image and blur kernel. This blind deconvolution in Equation 3.14 itselfis a non-convex optimization problem, and a multi-scale strategy is adopted foravoiding local minima. We show examples of the estimated l and k in Figure 3.4b.The blur kernel estimated from the previous step is usually noisy. We extractits skeleton by applying morphological thinning. To preserve the temporal infor-mation indicated by intensities, we convolve the initial blur kernel in Figure 3.4bwith a normalized all-one kernel (e.g., 3× 3), and then assign the correspondingintensities to its thinned version. Results of this step are shown in Figure 3.4c.Back-projection and tracingBack-projecting local blur kernels to the camera pose space is trivial for the yaw-pitch only case, but in order to reconstruct the motion trajectory a timestamp needsto be assigned to each camera pose. We achieve this by tracing the curves in Figure3.4c from one end to the other while assigning the timestamps as the accumulatedintensity value along the way to each pixel. To avoid outliers when fitting in thenext step we also assign a confidence value to the traced data that is proportionalto its intensity. Figure 3.4d plots the traced result in scattered dots, where thehorizontal axis is the time and the vertical axis is the traced pixel locations. Thetracing direction is determined by enumerating all possible directions and pickingthe one with the least fitting residual.Polynomial fittingθ 0 can be at last estimated by polynomial fitting the aligned time-stamped datafrom the previous step. We show the fitted curve in Figure 3.4d as well.Latent Image UpdateFixing θ k, we solve the latent image update subproblemlk+1 = argminl12‖K(θ k)l−b‖22+µ‖∇l‖1. (3.15)35(a) RSMB input.(b) Latent image update. (c) Trajectory coefficient update.Figure 3.5: Intermediate results when processing a RSMB image. Top: inputimage; Middle: intrinsic image updates; Bottom: camera motion trajec-tory coefficients θ updates (visualized as blur kernel). Columns in theintermediate steps represent the first, middle and final iterations, as wellas the ground truth values.36We use the ADMM algorithm outlined in Algorithm 2 to address the presence of L1norm in the objective.Algorithm 2 ADMM algorithm1: lk+1 = argminlLρ(l, jk,λ k)2: jk+1 = argminjLρ(lk+1, j,λ k)3: λ k+1 = λ k +ρ(Dlk+1− jk+1)Here we derive the optimization. We first rewrite the problem aslopt = argminlG(l)+F(j)subject to Dl = j,(3.16)where G(l) = (1/2µ)‖Kl−b‖22, F(j) = ‖∇l‖1. This can be expressed as an aug-mented LagrangianLρ(l, j,λ ) =G(l)+F(j)+λ T (Dl− j)+ ρ2‖Dl− j‖22(3.17)This is now a standard problem that can be solved efficiently with ADMM.Please refer to [10] for more details.3.3 Experiments and ResultsWe perform a series of experiments on both synthetic and real RSMB images, andconduct quantitative and qualitative comparison with conventional blind deblur-ring work, see Figure 3.6, Figure 3.7, and Figure 3.9, Figure 3.10. To furtherdemonstrate the power of our method, we also compare against a strategy of firstrectifying the rolling shutter wobble from videos containing the specific framesbefore applying conventional blind deblurring algorithms.37Kernel 1 Kernel 2ClockFishBlurred Blurred [6] Blurred Blurred [6] Ours[18]Xu OursFigure 3.6: Two sets of test images, Clock and Fish, synthetically blurredwith a typical kind of camera motion (kernal 1). Zoomed in compar-isons shown next to the RSMB images are, from left to right, croppedregions of the blurred image, output from [17] which assumes uniformmotion, output from [155] which assumes non-uniform model, and ourRSMD output highlighted in green.3.3.1 Synthetic ImagesThe synthetic RSMB images are obtained by simulating a rolling shutter sensor withtr = 1/50Ms and te = 1/50s, which is one of the standard settings in conventionalCMOS sensor cameras when capturing still images or video frames. For synthe-sizing the camera motion, we selected two segments of the motion trajectory fromKohler et al. [68] that is of the length of our specified te and tr, and applied themotion blur to two images, Fish and Clock. This gives us 4 sets of RSMB imagesas shown in Figure 3.6 and 3.7 along with the ground truth.Although both rolling shutter and MB deformations exist simultaneously for all38Kernel 2ClockFishBlurred Blurred [6] Ours[18]Kernel 2Blurred Blurred Cho Xu OursFigure 3.7: Two sets of test images, Clock and Fish, synthetically blurredwith a typical kind of camera motion (kernel 2). Zoomed in compar-isons shown next to the RSMB images are, from left to right, croppedregions of the blurred image, output from [17] which assumes uniformmotion, output from [155] which assumes non-uniform model, and ourRSMD output highlighted in green.kinds of camera motions, the two specific camera motions in the experiment herewere selected to highlight specific behaviors. The first motion is highly curved,which means that different regions in the rolling shutter image are blurred alongdifferent directions, thus emphasizing the spatial variation of the blur kernel. Thesecond blur kernel is predominantly linear, such that the blur kernel is similar fordifferent image regions, although they are displaced by different amounts. Thisresults in geometric distortions known as rolling shutter wobble.On the first motion, both the uniform [17] and non-uniform [155] methodsfail to address the sequential exposure mechanism of rolling shutter. As a re-390510152025303521.7723.75 23.0831.0222.0223.27 23.6331.9123.53 24.2226.2128.1423.63 24.02 24.4430.29Fish (kernel 1)PSNR (dB) BlurredChoXuOursFish (kernel 2)Clock (kernel 2)Clock (kernel 1)Figure 3.8: PSNR of the synthetic results.sult, ringing artifacts are unavoidable because of the incorrect kernel estimation,even though a relatively large weight on the image prior was used to obtain theresults. Because our model takes temporal information into account and optimizesthe global camera motion trajectory, instead of the discrete camera poses, a betterkernel estimation and thus sharper latent image can be estimated.With the second kind of blur, the resulting distortion/wobble could in the pastonly be addressed non-blindly with multiple sharp frames. Previous work [17, 155]successfully approximates the dominant blur kernel in this case, but leaves geomet-ric distortions unaddressed due to the lack of global motion trajectory modelingwithin the exposure of the image.To perform a quantitative evaluation we adopt the metric described in [68],where a minimization problem was first solved to find the optimal scale and trans-lation between the ground truth and the output image from each algorithm. We givethe PSNR in Figure 3.8, where our method outperformed previous works. Noticethat although the geometric misalignment (due to wobble) contributes to part ofthe PSNR loss in previous methods, their deblurred outputs also contain significantartifacts.403.3.2 Real ImagesTo collect the real RSMB images we captured a short footage in a handheld settingusing a Canon 70D DSLR camera mounted with a 50mm lens. A single framewas then extracted from the video which was shot at 24fps and an exposure timeof 1/50s for each frame. With a sequence of RSMB images in hand, we are ableto perform our blind RSMD on each frame to test our joint rolling shutter and MBrecovery. The footage also provides a necessary input for conventional rolling shut-ter rectification/video stabilization algorithms. We chose a DSLR over a cell-phonebecause it gives us access to specific values for te and tr, as needed by our method.We adopted the method in [11] to obtain the value of tr, which is determined by theframe rate and the total number of scanlines per frame.We show our results in comparison to those of [17, 155] in Figure 3.9 and Fig-ure 3.10. The insets clearly demonstrate the details recovered by our method.We compare our single image method with those followed by RS-rectification-and-then-blind-deblurring procedure in Figure 3.11 (left). The entire video wasprocessed with the rolling shutter correction feature included in Adobe After Ef-fects CS6, before applying the conventional blind deblurring algorithm [154] tothe rectified frame. This method still does not deliver comparable results to ours.This is due to the fact that conventional rolling shutter/video stabilization work isincapable of dealing with motion blur. Even after correctly rectifying the image,the blur kernel inside the frame is still difficult to model using traditional GlobalShutter Motion Deblurring (GSMD) techniques.In Figure 3.11 (right) we put the results of the stitched image of multiple de-blurred blocks from the RSMB input. With this method, there is a tradeoff in qualitybetween large blocks with potentially non-uniform kernel, and small blocks withpotentially insufficient detail. The global trajectory model in our method inherentlyaddresses this problem.3.3.3 Parameters and Computational PerformanceOn an 8GB, 4 core computer our un-optimized MATLAB implementation takesabout 1 minute for θ initialization, 15 minutes for greyscale kernel estimation, and2 minutes for the final deblurring on each channel of the 800× 450 sized color41(a) Blurred. (b) Cho et al. [17].Figure 3.9: Comparison of our methods with state-of-art uniform [17] andnon-uniform blind deblurring [155] algorithms (1/2).42(a) Xu et al. [155]. (b) Ours.Figure 3.10: Comparison of our methods with state-of-art uniform [17] andnon-uniform blind deblurring [155] algorithms (2/2).43(a) Results of applying blind deblurring on the rolling shutter rectified images.(b) Results of the deblur-and-stitch strategy.(c) Results of the proposed joint approach.Figure 3.11: Results from alternative approaches.image. The majority of the time is spent on the Jacobian matrix and sparse con-volution matrix computations. We note that due to the spatial variation in the blurkernels we are inherently limited to methods that don’t solve the deconvolutionproblem in the Fourier domain. This is a property not just of our solution, but ofthe RSMD problem in general.In all of our experiments, we set K as 30 which achieves good discretization-efficiency balance. µ is initialized as 1e-2, which decreases by τ = 0.7 after eachiteration until it reaches 1e-3 to output the final θ estimation. The µ for computing44the final color image is set as 7e-4.3.4 Conclusion and DiscussionWe presented an approach for rolling shutter motion deblurring by alternating be-tween optimizing the intrinsic image and optimizing the camera motion trajectoryin pose space. One limitation of our work is that it depends on the imperfect blurkernel estimation from uniform deblurring at multiple regions for trajectory ini-tialization. Another limitation is that non-negligible in-plane rotation commonlyexists in wide angle images. In the future, it would be interesting to extend ourmethod to this general case by fitting a full 3D rotational pose trajectory. Thekey technical challenge here is to solve the initialization problem (Section 3.2.3)for kernels that are no longer shift invariant even within a horizontal region of theimage, and how to back-project the resulting kernels into the 3D pose space.We also realize that even though a polynomial function is sufficient to describehuman camera shake within the exposure time of interest, RSMB also widely existsin images captured by drones, street view cars, etc., when the camera motion tra-jectory is likely to be more irregular. In these cases, a non-blind gyroscope assistedmethod could be a better choice.45Chapter 4Multi-Frame Video DeblurringIn this chapter, we extend the idea of temporal structural-aware image restorationfrom using multiple rows in a single image to multiple frames in a video sequence.Similar to hand-held photography, motion blur from camera shake is a major prob-lem in videos captured by mobile devices. Different from single-image methods,however, video deblurring can leverage the abundant information across multipleframes, due to the fact that subsequent images in the video clip are blurred fromtemporally adjacent segments of the camera motion trajectory. We present an end-to-end learning-based approach for multi-frame video deblurring. Our key contri-bution is the introduction of a generic CNN architecture that automates this processwhile being robust to alignment errors. Both qualitative and quantitative evalua-tions will be conducted validating the effectiveness and efficiency of the approach.4.1 IntroductionHand-held video capture devices are now commonplace. As a result, video stabi-lization has become an essential step in video capture pipelines, often performedautomatically at capture time (e.g., iPhone, Google Pixel), or as a service on shar-ing platforms (e.g., Youtube, Facebook). While stabilization techniques have im-proved dramatically, the remaining motion blur is a major problem with all stabi-lization techniques. This is because the blur becomes obvious when there is nomotion to accompany it, yielding highly visible “jumping” artifacts. In the end, the46inputproposedinputproposedFigure 4.1: Blur in videos can be significantly attenuated by learning how toaggregate information from nearby frames. Top: crops of consecutiveframes from a blurry video; Bottom: outputs from the proposed data-driven approach, in this case using simple homography alignment.remaining camera shake motion blur limits the amount of stabilization that can beapplied before these artifacts become too apparent.The most successful video deblurring approaches leverage information fromneighboring frames to sharpen blurry ones, taking advantage of the fact that mosthand-shake motion blur is both short and temporally uncorrelated. By borrowing“sharp” pixels from nearby frames, it is possible to reconstruct a high-quality out-put. Previous work has shown significant improvement over traditional deconvolution-based deblurring approaches, via the patch-based synthesis that relies on eitherlucky imaging [19] or weighted Fourier aggregation [23].One of the main challenges associated with aggregating information acrossmultiple video frames is that the differently blurred frames must be aligned. Thiscan either be done, for example, by nearest neighbor patch lookup [19], or optical47flow [23]. However, warping-based alignment is not robust around disocclusionsand areas with low texture, and often yields warping artifacts. In addition to thealignment computation cost, methods that rely on warping have to therefore disre-gard information from misaligned content or warping artifacts, which can be hardby looking at local image patches alone.To this end, we present the first end-to-end data-driven approach to video de-blurring, the results of which can be seen in Figure 4.1. We address specificallyblur that arises due to hand-held camera shake, i.e., is temporally uncorrelated,however we show that our deblurring extends to other types of blur as well, includ-ing motion blur from object motion. We experiment with a number of differentlylearned configurations based on various alignment types: no-alignment, frame-wise homography alignment, and optical flow alignment. On average optical flowperforms the best, although in many cases projective transform (i.e. homography)performs comparably with significantly less computation required. Notably, ourapproach also enables the generation of high-quality results without computing anyalignment or image warping, which makes it highly efficient and robust to scenetypes. Essential to this success is the use of an autoencoder-type network with skipconnections that increases the receptive field and is yet easy to train.Our main contribution is an end-to-end solution to train a deep neural networkto learn how to deblur images, given a short stack of neighboring video frames.We describe the architecture we found to give the best results, and the methodwe used to create a real-world dataset from high frame rate capture. We comparequalitatively to videos previously used for video deblurring, and quantitatively withour ground truth data set. We also present a test set of videos showing that ourmethod generalizes to a wide range of scenarios. Both datasets are made availableto the public to encourage follow up work.4.2 MethodOverview. Image alignment is inherently challenging as determining whether thealigned pixels in different images correspond to the same scene content can be dif-ficult with only low-level features. High-level features, on the other hand, provide48Flat-covolutional layerF0D1F1_1F1_2D2F2_1F2_2F2_3D3F3_1F3_2F3_3U1F4_1F4_2F4_3U2F5_1F5_2U3F6_1F6_2inputGTDown-covolutional layerUp-covolutional layerSkip connectionFigure 4.2: Architecture of the proposed DeBlurNet model, that takes thestacked nearby frames as input, and processes them jointly through anumber of convolutional layers until generating the deblurred centralframe. The depth of each block represents the number of activationmaps in response to learned kernels. See Table 4.1 for detailed configu-rations.49Figure 4.3: A selection of blurry/sharp pairs (split left/right respectively)from our ground truth dataset. Images are best viewed on-screen andzoomed in.sufficient additional information to help separate incorrectly aligned image regionsfrom correctly aligned ones. To make use of both low-level and high-level features,we therefore train an end-to-end system for video deblurring, where the input is astack of neighboring frames and the output is the deblurred central frame in thestack. Furthermore, our network is trained using real video frames with realisti-cally synthesized motion blur. In the following, we first present our neural networkarchitecture, then describe a number of experiments for evaluating its effectivenessand comparing with existing methods. The key advantage of our approach is the al-lowance of lessening the requirements for accurate alignment, a fragile componentof prior work.4.2.1 Network ArchitectureWe use an encoder-decoder style network, which has been shown to produce goodresults for a number of generative tasks [104, 124]. In particular, we choose avariation of the fully convolutional model proposed in [124] for sketch cleanup.50layer kernel size stride output size skip connectioninput - - 15×H×W to F6 2∗F0 5×5 1×1 64×H×W to U3D1 3×3 2×2 64×H/2×W/2 -F1 1 3×3 1×1 128×H/2×W/2 -F1 2 3×3 1×1 128×H/2×W/2 to U2D2 3×3 2×2 256×H/4×W/4 -F2 1 3×3 1×1 256×H/4×W/4 -F2 2 3×3 1×1 256×H/4×W/4 -F2 3 3×3 1×1 256×H/4×W/4 to U1D3 3×3 2×2 512×H/8×W/8 -F3 1 3×3 1×1 512×H/8×W/8 -F3 2 3×3 1×1 512×H/8×W/8 -F3 3 3×3 1×1 512×H/8×W/8 -U1 4×4 1/2×1/2 256×H/4×W/4 from F2 3F4 1 3×3 1×1 256×H/4×W/4 -F4 2 3×3 1×1 256×H/4×W/4 -F4 3 3×3 1×1 256×H/4×W/4 -U2 4×4 1/2×1/2 128×H/2×W/2 from F1 2F5 1 3×3 1×1 128×H/2×W/2 -F5 2 3×3 1×1 64×H/2×W/2 -U3 4×4 1/2×1/2 64×H×W from F0F6 1 3×3 1×1 15×H×W -F6 2 3×3 1×1 3×H×W from input∗Table 4.1: Specifications of the DBN model. Each convolutional layer is fol-lowed by batch normalization and ReLU, except those that are skip con-nected to deeper layers, where only batch normalization has been ap-plied, before the sum is rectified through a ReLU layer [42]. For exam-ple, the input to F4 1 is the rectified summation of U1 and F2 3. Notethat for the skip connection from input layer to F6 2, only the centralframe of the stack is selected. At the end of the network a Sigmoid layeris applied to normalize the intensities. We use the Torch implementa-tion of SpatialConvolution and SpatialFullConvolutionfor down- and up-convolutional layers.51We add symmetric skip connections [88] between corresponding layers in encoderand decoder halves of the network, where features from the encoder side are addedelement-wise to each corresponding layer. This significantly accelerates the con-vergence and helps generate much sharper video frames. We perform an early fu-sion of neighboring frames that is similar to the FlowNetSimple model in [29], byconcatenating all images in the input layer. The training loss is MSE to the groundtruth sharp image, which will be discussed in more detail in Section 4.2.2. Werefer to this network as DeBlurNet, or DBN, and show a diagram of it in Figure 4.2.It consists of three types of convolutional layers: down-convolutional layers, thatcompress the spatial resolution of the features while increasing the spatial supportof subsequent layers; flat-convolutional layers, that perform non-linear mappingand preserve the size of the image; and finally up-convolutional layers, that in-crease the spatial resolution. Please refer to Table 4.1 for detailed configurations.Alignment. One of the main advantages of our method is the ability to work wellwithout accurate frame-to-frame alignment. To this end, we create three versionsof our dataset with varying degrees of alignment, and use these to train DBN. Atone end, we use no alignment at all, relying on the network to abstract spatial in-formation through a series of down-convolution layers. This makes the methodsignificantly faster, as alignment usually dominates running time in multi-frameaggregation methods. We refer to this network as DBN+NOALIGN. We also useoptical flow [113] to align stacks (DBN+FLOW), which is slow to compute andprone to errors (often introducing additional warping artifacts), but allows pixels tobe aggregated more easily by removing the spatial variance of corresponding fea-tures. Finally, we use a single global homography to align frames, which providesa compromise in approaches, in terms of computational complexity and alignmentquality (DBN+HOMOG). The homography is estimated using SURF features and avariant of RANSAC [139] to reject outliers.Implementation details. During training, we use a batch size of 64, and patchesof 15×128×128, where 15 is the total number of RGB channels stacked from thecrops of 5 consecutive video frames. We observed that a patch size of 128 was52Method Input PSDEBLUR WFA [23] DBN+SINGLE DBN+NOALIGN DBN+HOMOG DBN+FLOW#1 24.14 / .859 24.42 / .908 25.89 / .910 25.75 / .901 27.83 / .940 27.93 / .945 28.31 / .956#2 30.52 / .958 28.77 / .952 32.33 / .974 31.15 / .966 33.11 / .980 32.39 / .975 33.14 / .982#3 28.38 / .914 25.15 / .928 28.97 / .931 29.30 / .946 31.29 / .973 30.97 / .969 30.92 / .973#4 27.31 / .900 27.77 / .928 28.36 / .925 28.38 / .922 29.73 / .948 29.82/ .948 29.99 / .954#5 22.60 / .852 22.02 / .890 23.99 / .910 23.63 / .885 25.12 / .930 24.79 / .925 25.58 / .944#6 29.31 / .951 25.74 / .932 31.09 / .975 30.70 / .962 32.52 / .978 31.84 / .972 32.39 / .981#7 27.74 / .939 26.11 / .948 28.58 / .955 29.23 / .959 30.80 / .975 30.46 / .972 30.56 / .975#8 23.86 / .906 19.75 / .822 24.78 / .926 25.62 / .936 27.28 / .962 26.64 / .955 27.15 / .963#9 30.59 / .976 26.48 / .963 31.30 / .981 31.92 / .983 33.32 / .989 33.15 / .989 32.95 / .989#10 26.98 / .926 24.62 / .938 28.20 / .960 28.06 / .949 29.51 / .969 29.30 / .969 29.53 / .975Average 27.14 / .918 25.08 / .921 28.35 / .944 28.37 / .941 30.05 / .964 29.73 / .962 30.05 / .969Table 4.2: PSNR/MSSIM [68] measurements for each approach, averagedover all frames, for 10 test datasets (#1→#10).sufficient to provide enough overlapping content in the stack even if the framesare not aligned. We use ADAM [65] for optimization, and fix the learning rateto be 0.005 in the first 24,000 iterations, then halves for every subsequent 8,000iterations until it reaches the lower bound of 10−6. For all the results reported inthe chapter, we train the network for 80,000 iterations, which takes about 45 hourson an NVidia Titan X GPU. Default values of β1, β2 and ε are used, which are 0.9,0.999, and 10−8 respectively, and we set weight decay to 0.As our network is fully convolutional, the input resolution is restricted only byGPU memory. At test time, we pass a 960× 540 frame into the network and tilethis if the video frame is of a larger resolution. Since our approach deblurs imagesin a single forward pass, it is computationally very efficient. Using an NVidiaTitan X GPU, we can process a 720p frame within 1s without alignment. Previousapproaches took on average 15s [23] and 30s [19] per frame on CPUs. The recentneural deblurring method [15] takes more than 1 hour to fully process each frame,and the approach of Kim et al. [63] takes several minutes per frame.4.2.2 DatasetGenerating realistic training data is a major challenge for tasks where ground truthdata cannot be easily collected/labeled. For training our neural network, we requiretwo video sequences of exactly the same content: one blurred by camera shakemotion blur, and its corresponding sharp version. Capturing such data is extremely53hard. One could imagine using a beam-splitter and multiple cameras to build aspecial capturing system, but this setup would be challenging to construct robustlyand would present a host of other calibration issues.One solution would be to use rendering techniques to create synthetic videosfor training. However if not done properly, this often leads to a domain gap, wheremodels trained on synthetic data do not generalize well to real-world data. Forexample, we could apply synthetic motion blur on sharp video frames to simulatecamera shake blur. However, in real-world scenarios, the blur not only depends oncamera motion but also is related to scene depth and object motion, thus is verydifficult to be rendered properly.In this work, we propose to collect real-world sharp videos at a very high framerate, and synthetically create blurred ones by accumulating a number of short expo-sures to approximate a longer exposure [137]. In order to simulate realistic motionblur at 30fps, we capture videos at 240fps, and subsample every eighth frame tocreate the 30fps ground truth sharp video. We then average together a temporallycentered window of 7 frames (3 on either side of the ground truth frame) to gener-ate synthetic motion blur at the target frame rate.Since there exists a time period between adjacent exposures (the “duty cycle”),simply averaging consecutive frames will yield ghosting artifacts. To avoid this,[64] proposed to only use frames whose relative motions in-between are smallerthan 1 pixel. To use all frames for rendering, we compute optical flow betweenadjacent high fps frames, and generate an additional 10 evenly spaced inter-frameimages, which we then average together. Examples of the dataset are shown inFigure 4.3. We have also released this dataset publicly for future research.In total, we collect 71 videos, each with 3-5s average running time. Theseare used to generate 6708 synthetic blurry frames with the corresponding groundtruth. We subsequently augment the data by flipping, rotating (0°, 90°, 180°, 270°),and scaling (1/4, 1/3, 1/2) the images, and from this we draw on average 10 random128×128 crops. In total, this gives us 2,146,560 pairs of patches. We split ourdataset into 61 training videos and 10 testing videos. For each video, its frames areused for either training or testing, but not both, meaning that the scenes used fortesting have not been seen in the training data.The training videos are capture at 240fps with an iPhone 6s, GoPro Hero 454Black, and Canon 7D. The reason to use multiple devices is to avoid bias towardsa specific capturing device that may generate videos with some unique character-istics. We test on videos captured by other devices, including Nexus 5x and MotoX mobile phones and a Sony a6300 consumer camera.Limitations. We made a significant effort to capture a wide range of situations,including long panning, selfie videos, scenes with moving content (people, wa-ter, trees), recorded with a number of different capture devices. While it is quitediverse, it also has some limitations. As our blurry frames are averaged from mul-tiple input frames, the noise characteristics will be different in the ground truthimage. To reduce this effect, we recorded input videos in high light situations,where there was minimal visible noise even in the original 240fps video, meaningthat our dataset only contains scenes with sufficient light. An additional source oferror is that using optical flow for synthesizing motion blur adds possible artifactswhich would not exist in real-world data. We found that however, as the inputvideo is recorded at 240fps, the motion between frames is small, and we did notobserve visual artifacts from this step.As we will show in Section 4.3, despite these limitations, our trained modelstill generalizes well to new capture devices and scene types, notably on low-lightvideos. We believe future improvements to the training data set will further im-prove the performance of our method.4.3 Experiments and ResultsWe conduct a series of experiments to evaluate the effectiveness of the learnedmodel, and also the importance of individual components.Effects of using multiple frames. We analyze the contribution of using a tem-poral window by keeping the same network architecture as DBN, but replicatingthe central reference frame 5 times instead of inputting a stack of neighboringframes, and retrain the network with the same hyper-parameters. We call this ap-proach DBN+SINGLE. Qualitative comparisons are shown in Figure 4.7 to 4.10,and quantitative results are shown in Table 4.2 and Figure 4.4, 4.5. We can see that55(a) Input.Input PSDEBLUR WFA [23] DBN+SINGLE21.79dB 24.09dB 21.53dB 24.51dBDBN+NOALIGN DBN+HOMOG DBN+FLOW ground-truth27.24dB 26.66dB 26.69dBInput PSDEBLUR WFA [23] DBN+SINGLE21.79dB 24.09dB 21.53dB 24.51dBDBN+NOALIGN DBN+HOMOG DBN+FLOW ground-truth27.24dB 26.66dB 26.69dB(b) Zoom-in comparisons.Figure 4.4: Quantitative results from our test set, with PSNRs relative to theground truth. Here we compare DBN with a single-image approach, PS-DEBLUR, and a multi-frame video deblurring method, WFA [23]. DBNachieves comparable results to [23] without alignment, and improvedresults with alignment.56(a) Input.Input PSDEBLUR WFA [23] DBN+SINGLE31.72dB 31.13dB 29.83dB 31.49dBDBN+NOALIGN DBN+HOMOG DBN+FLOW ground-truth32.89dB 34.76dB 34.87dBInput PSDEBLUR WFA [23] DBN+SINGLE31.72dB 31.13dB 29.83dB 31.49dBDBN+NOALIGN DBN+HOMOG DBN+FLOW ground-truth32.89dB 34.76dB 34.87dB(b) Zoom-in comparisons.Figure 4.5: Quantitative results from our test set, with PSNRs relative to theground truth. Here we compare DBN with a single-image approach, PS-DEBLUR, and a multi-frame video deblurring method, WFA [23]. DBNachieves comparable results to [23] without alignment, and improvedresults with alignment.5720 22 24 26 28 30 32 34 36Input (dB)-2-1 0+1+2+3+4Gain (dB)InputPSdeblurWFADBN+SingleDBN+NoalignDBN+Homog.DBN+FlowFigure 4.6: Quantitative comparison of different approaches. In this plot, thePSNR gain of applying different methods and configurations is plottedversus the sharpness of the input frame (we treat the difference betweenthe input and ground truth in units of PSNR as a measure of “blurriness”of the input). We observe that all multi-frame methods provide a qualityimprovement for blurry input frames, with diminishing improvementsas the input frames get sharper. DBN+NOALIGN and DBN+FLOW per-form the best, but qualitatively, DBN+FLOW and DBN+HOMOG are oftencomparable, and superior to no alignment. We provide a single-imageuniform blur kernel deblurring method as reference (PSDEBLUR).using neighboring frames greatly improves the quality of the results. We chosea 5 frame window as it provides a good compromise between result quality andtraining time [62]. Single-image methods are also provided as the reference: PS-DEBLUR for blind uniform deblurring with off-the-shelf shake reduction softwarein Photoshop, and [155] for non-uniform comparisons.Effects of alignment. In this set of experiments, we analyze the impact of inputimage alignment in the output restoration quality, namely we compare the results ofDBN+NOALIGN, DBN+HOMOG, and DBN+FLOW. See Table 4.2 and Figure 4.6 forquantitative comparisons, and the qualitative comparison in Figure 4.7 to 4.10. Ourmain conclusions are that DeBlurNet with optical flow and homography are of-ten qualitatively equivalent, and DBN+FLOW often has higher PSNR. On the other58hand, DBN+NOALIGN performs even better than DBN+FLOW and DBN+HOMOG interms of PSNR, especially when the input frames are not too blurry, e.g. >29dB.However, we observe that DBN+FLOW fails gracefully when inputs frame are muchblurrier, which leads to a drop in PSNR and MSSIM (see Table 4.2 and Figure 4.6).In this case, DBN+FLOW and DBN+HOMOG perform better. One possible explana-tion for this is that when the input quality is good, optical flow errors will dom-inate the final performance of the deblurring procedure. Indeed, sequences withhigh input PSNR have small relative motion (a consequence of how the dataset iscreated) so there is not too much displacement from one frame to the next, andDBN+NOALIGN is able to directly handle the input frames without any alignment.Comparisons to existing approaches. We compare our method to existing ap-proaches in Figure 4.4 to 4.10. Specifically, we show a quantitative comparisonto WFA [23], and qualitative comparisons to Cho et al. [19], Kim et al. [64], andWFA [23]. We also compare to single image deblurring methods, Chakrabarti [15],Xu et al. [155], and the Shake Reduction feature in Photoshop CC 2015 (PSDEBLUR).We note that PSDEBLUR can cause ringing artifacts when used in an automatic set-ting on sharp images, resulting in a sharp degradation in quality (Figure 4.6). Theresults of [19] and [64] are the ones provided by the authors, WFA [23] was ap-plied a single iteration with the same temporal window, and for [15, 155] we usethe implementations provided by the authors. Due to the large number of frames,we are only able to compare quantitatively to approaches which operate sufficientlyfast, which excludes many non-uniform deconvolution based methods. It is impor-tant to note that the test images have not been seen during the training procedure,and many of them have been shot by other cameras. Our conclusion is that DBNoften produces superior quality deblurred frames, even when the input frames arealigned with a global homography, which requires substantially less computationthan prior methods.Generalization to other types of videos. As discussed in Section 4.2.2, our train-ing set has some limitations. Despite these, Figure 4.11 shows that our methodcan generalize well to other types of scenes not seen during training. This includes59(a) Input.Input PSDEBLUR L0DEBLUR[155] NEURAL [15] WFA [23]DBN+SINGLE DBN+NOALIGN DBN+HOMOG DBN+FLOWInput PSDEBLUR L0DEBLUR[155] NEURAL [15] WFA [23]DBN+SINGLE DBN+NOALIGN DBN+HOMOG DBN+FLOW(b) Zoom-in comparisons.Figure 4.7: Qualitative comparisons to existing approaches.60(a) Input.Input PSDEBLUR L0DEBLUR[155] NEURAL [15] WFA [23]DBN+SINGLE DBN+NOALIGN DBN+HOMOG DBN+FLOWInput PSDEBLUR L0DEBLUR[155] NEURAL [15] WFA [23]DBN+SINGLE DBN+NOALIGN DBN+HOMOG DBN+FLOW(b) Zoom-in comparisons.Figure 4.8: Qualitative comparisons to existing approaches.61(a) Input.Input PSDEBLUR Cho et al. [19] Kim and Lee [63] WFA [23]DBN+SINGLE DBN+NOALIGN DBN+HOMOG DBN+FLOWInput PSDEBLUR Cho et al. [19] Kim and Lee [63] WFA [23]DBN+SINGLE DBN+NOALIGN DBN+HOMOG DBN+FLOW(b) Zoom-in comparisons.Figure 4.9: Qualitative comparisons to existing approaches.62(a) Input.Input PSDEBLUR Cho et al. [19] Kim and Lee [63] WFA [23]DBN+SINGLE DBN+NOALIGN DBN+HOMOG DBN+FLOWInput PSDEBLUR Cho et al. [19] Kim and Lee [63] WFA [23]DBN+SINGLE DBN+NOALIGN DBN+HOMOG DBN+FLOW(b) Zoom-in comparisons.Figure 4.10: Qualitative comparisons to existing approaches.63Input DBN+HOMOG Input DBN+HOMOGFigure 4.11: Our proposed method can generalize to types of data not seenin the training set. The first example shows a low-light, noisy video,and the second shows an example with motion blur, rather than camerashake. The biker is in motion, and is blurred in all frames in the stack,but the network can still perform some moderate deblurring.videos captured in indoor, low-light scenarios and motion blur originating from anobject moving, rather than the temporally uncorrelated blur from camera shake.While our dataset has instances of motion blur in it, it is dominated by camera-shake blur. Nonetheless, the network is able to produce a moderate amount ofobject motion deblurring as well, which is not handled by other lucky imagingapproaches.Other experiments. We tested with different fusion strategies, for example, latefusion, i.e. aggregating features from deeper layers after high-level image con-tent has been extracted from each frame, with both shared and non-shared weights.Experimental results show that this produced slightly worse PSNR and training and64Figure 4.12: Here we selectively visualize 3 out of 64 filters (highlighted)and their response at F0 from DBN+FLOW.validation loss, but it occasionally helped in challenging cases where DBN+NOALIGNfails. However this improvement is not consistent, so we left it out of our proposedapproach.Multi-scale phase-based methods have proven to be able to generate sharp im-ages using purely Eulerian representations [93], so we experimented with multiscale-supervised, Laplacian reconstructions, but found similarly inconclusive results.While the added supervision helps in some cases, it likely restricts the networkfrom learning useful feature maps that help in other frames.We also tried directly predicting the sharp Fourier coefficients, as in [22], how-ever, this approach did not work as well as directly predicting output pixels. Onepossible reason is that the image quality is more prone to reconstruction errors ofFourier coefficients, and we have not found a robust way to normalize the scaleof Fourier coefficients during training, compared with the straightforward way ofapplying Sigmoid layers when inputs are in the spatial domain.Visualization of learned filters. Here we visualize some filters learned from DBN+FLOW, specifically at F0, to gain some insights of how it deblurs an input stack. Itcan be observed that DBN not only learns to locate the corresponding color channelsto generate the correct tone (Figure 4.12, left), but is also able to extract edgesof different orientations (Figure 4.12, middle), and to locate the warping artifacts(Figure 4.12, right).Limitations. One limitation of this work is that we address only a subset of thetypes of blur present in the video, in particular, we focus on motion blur that arises65due to camera-shake from hand-held camera motion. In practice, our dataset con-tains all types of blur that can be reduced by a shorter exposure time, includingobject motion, but this type of motion occurs much less frequently. Explicitlyinvestigating other sources of blur, for example, focus and object motion, whichwould require different input and training data, is an interesting area for futurework.Although no temporal coherence is explicitly imposed and no post-processingis done, the processed sequences are in general temporally smooth. However, whenimages are severely blurred, our proposed model, especially DBN+NOALIGN, canintroduce temporal artifacts that become more visible after stabilization. In the fu-ture, we plan to investigate better strategies to handle unaligned cases, for examplethrough the multi-scale reconstruction [12, 34].We would like also to augment our training set with a wider range of videos,as this should increase general applicability of the proposed approach.4.4 Conclusion and DiscussionWe have presented a learning-based approach to multi-image video deblurring.Despite the above limitations, our method generates results that are often as good asor superior to the state-of-the-art approaches, with no parameter tuning and withoutthe explicit need for challenging image alignment. It is also highly efficient due tothe relaxation of the quality of alignment required – using a simplified alignmentmethod, our approach can generate high-quality results within a second, whichis substantially faster than existing approaches many of which take minutes perframe.In addition, we conducted a number of experiments showing the quality ofresults varying the input requirements. We believe that similar strategies could beapplied to other aggregation based applications.66Chapter 5Time-of-Flight ImagingIn this chapter, we introduce a learning-based framework for time-of-flight depthimaging. Our approach jointly addresses denoising, phase unwrapping, and MPIcompensation by exploiting the correlation between raw TOF measurements ofmultiple frequencies and spatial scales. The proposed architecture significantlyoutperforms conventional depth imaging pipelines, while being highly efficientwith 100+ fps on modern GPUs.5.1 IntroductionRecently, Amplitude-Modulated Continuous Wave (AMCW) time-of-flight cam-eras such as Microsoft’s Kinect One have not only become widely adopted ininteractive commercial applications, but have also emerged as an exciting imag-ing modality in computer vision [33, 48, 122]. Combined with conventional colorcameras, RGB-D data allows for high-fidelity scene reconstruction [56], enablingthe collection of large 3D datasets which drive 3D deep learning [16] for sceneunderstanding [49, 126], action recognition [98], and facial and pose tracking [69].Beyond enabling these core computer vision applications, RGB-D cameras havewide-spread applications in human-computer interaction, robotics, and for trackingin emerging augmented or virtual reality applications [78]. Due to low power re-quirements, low-cost CMOS sensor technology, and small sensor-illumination base-line [41], AMCW time-of-flight cameras have the potential to become a cornerstone67MPPUDNPhase & Amplitude Depth BaselineRaw Sensor Data Depth ProposedTraditionalProposed ToFNetFigure 5.1: Top: given phase and amplitude images from dual-frequencymeasurements, traditional TOF cameras apply a sequence of techniquesfor depth map generation, such as denoising (DN), phase unwrap-ping (PU) and multipath correction (MP). This often leads to inaccu-rate depth estimation as low-frequency phases are particularly prone toglobal illumination [40] and various types of sensor noise; Bottom: wetrain a deep convolutional network to predict scene depth directly froma TOF camera’s raw correlation measurements. The proposed method issubstantially more robust to noise and MPI, and runs in real-time.imaging technology. For brevity, we will in the following refer to AMCW time-of-flight cameras simply as TOF cameras, with the implicit understanding that theyare distinct from other time-of-flight imaging technologies, such as direct temporalsampling with Single-Photon Avalanche Diode (SPAD) (e.g. [144]).TOF cameras measure depth by illuminating a scene with periodic amplitude-modulated flood-light, which is reflected back to the camera along direct as well asindirect light paths. The camera then measures the phase shift of the incident sig-nal with respect to the illumination signal. To extract depth from these raw phasemeasurements, a number of challenging reconstruction problems must be solved.For a single diffuse reflector in the scene, the phase measurements encode depthunambiguously only up to an integer phase wrapping, which is addressed by phaseunwrapping methods [41]. In the presence of global illumination, multiple light68paths interfere along direct and indirect paths, leading to severe MPI distortion ofthe depth maps. Finally, raw TOF measurements are affected by severe noise dueto the low absorption depth of the IR modulation, and immature sensor technol-ogy [69] compared to RGB CMOS image sensors.Conventionally, these three reconstruction problems, phase unwrapping, MPIreduction, and denoising, are solved in a pipeline approach where each step ad-dresses an individual subproblem in isolation, as in [26, 32, 77, 97]. While thisdesign facilitates divide-and-conquer algorithms, it ignores the coupling betweenindividual sub-modules and introduces cumulative error and information loss inthe reconstruction pipeline. For example, established multi-frequency unwrappingmethods [27] become inaccurate in the presence of MPI or noise, leading to notice-able unwrapping errors and subsequently inaccurate shape recovery.Instead of building a reconstruction pipeline, or relying on additional hardware,we present a data-driven approach that generates a depth map directly from the rawmodulated exposures of the TOF camera (see Figure 5.1). Specifically, we makethe following contributions:• We propose a learning-based approach for end-to-end time-of-flight imagingby jointly solving phase unwrapping, MPI compensation, and denoising fromthe raw correlation measurements. The proposed architecture significantlyoutperforms conventional depth image pipelines, while being highly efficientwith interactive framerates on modern GPUs.• We validate that the proposed reconstruction approach effectively removesMPI, phase wrapping, and sensor noise, both in simulation and on experi-mentally acquired raw-dual frequency measurements.• We introduce a large-scale raw correlation time-of-flight dataset with knownground truth depth labels for every pixel. The dataset and architecture arepublished for full reproducibility of the proposed method.We are not the first to apply deep learning to resolve ambiguities in TOF depthreconstruction. Son et al. [125] use a robotic arm to collect TOF range images withthe corresponding ground truth labels from a structured light sensor, and train afeedforward neural network to remove multipath distortions. Concurrent with our69(a) Scene. (b) Correlation images.Figure 5.2: Illustration of dual-frequency correlation images of a cornerscene synthesized with and without multipath interference. MPI in-troduces scene- and frequency-dependent offsets to TOF data (bottomright), which is, in turn, treated as features by our method. Images aresimulated by the method in Section 5.3.1, and are normalized for, Marco et al. [89] train an encoder-decoder network that takes TOF rangeimages as input and predicts the multipath-corrected version. Both of these ap-proaches, however, are not end-to-end, as they post-process depth from a specifictype of camera’s pipeline output. Much of the information presented in the rawTOF images has already been destroyed in the depth images that serve as input tothese methods. By ignoring the coupled nature of the many subproblems in TOFreconstruction they artificially limit the depth imaging performance, as we show inthis work.5.2 MethodIn this section, we describe the proposed reconstruction architecture and learn-ing loss functions that allow us to directly estimate depth from raw TOF measure-ments. To build intuition for this end-to-end approach, we synthesize and analyzethe correlation images of a corner scene with Equation 2.11 in Figure 5.2. MPI in-troduces a per-pixel phase offset, depending not only on scene-specific propertiessuch as distance, geometry, and material [35], but also the modulation signals. Wedemonstrate that the inverse mapping from correlation images to depth maps can70Flat covolutionDown covolutionUp covolutionSkip connectionResNet blockTotal variationF1_1F1_2 F2 D2 R1 F3 U2 F4_1F4_2Input GT...D1 R2 R8 R9 U1 TVpred/GTD1 D2 D3 F1 0/1GeneratorDiscriminator12832641284 1 164641282561286441286432161 64 128 256 1 1Figure 5.3: The proposed TOFNET architectures, consisting of, top: a sym-metrically skip-connected encoder-decoder generator network G, andbottom: a patchGAN discriminator network D. We implement Lsmoothas a regularization layer, denoted as TV (total variation) here. Pleaserefer to the supplemental material in Section A.1 for detailed layer learned by leveraging the spatiotemporal structures of raw TOF measurements ina large corpus of synthetic training data. Specifically, we treat depth generation as amulti-channel image fusion problem, where the desired depth map is the weightedcombination of the same scene measured at multiple [ωi,ψ j] illumination-sensorconfigurations. Our reconstruction network is trained to fuse these spatiotempo-ral structures, jointly performing MPI removal, denoising and phase unwrapping,while penalizing artifacts in the resulting depth maps via a novel loss function.5.2.1 Depth Estimation NetworkThe proposed depth generation network architecture takes the correlated natureof the raw TOF measurements into account. In contrast to conventional RGB orgrayscale intensity images, the pixel values in Bω,ψ are more sensitive to sceneand camera settings, e.g. the frequency, phase offset and power of illuminationsignals. An ideal network should therefore learn cross channel correlations, aswell as spatial features that are invariant to albedo, amplitude and scale variations.Moreover, the input correlation measurements and output depth images shouldboth be consistent with the underlying scene geometry. While the two should sharedepth gradients, albedo gradients do not necessarily align with depth edges and71should be rejected.With these motivations, we design a multi-scale network, TOFNET, followingan encoder-decoder network architecture with skip connections [112] and ResNet [43]bottleneck layers (see Figure 5.3). Specifically, the network takes a stack of mod-ulated exposures [Bωi,ψ j ], i, j = [1,2] as input to generate a phase-unwrapped andMPI-compensated distance image. We then convert the network output to depthmap with calibrated camera intrinsics.The encoder (F1 1 to D2) of the generator G spatially compresses the inputup to 1/4 of its original resolution, while generating feature maps with increas-ing receptive field. The ResNet blocks at the bottleneck maintain the number offeatures, while refining their residuals across multiple channels so that they canreconstruct a finer, cleaner depth after upsampling. We also design symmetricallyconnected skip layers between F1 2-U2 and F2-U1 by element-wise summation.These skip connections are designed around the notion that scene structures shouldbe shared between inputs and outputs [54]. The discriminator network D consistsof 3 down-convolutional layers, classifying G’s prediction in overlapping patches.During training, we also randomly augment the scale of input images by a numberof coarse and fine levels to learn scale-robust features.We propose a number of input/output configurations as well as data normaliza-tion and augmentation strategies to accompany the network design. Specifically,instead of relying on the network to learn amplitude invariant features, we applypixel-wise normalization to correlation inputs with their corresponding amplitudes.This effectively improves the model’s robustness to illumination power and scenealbedo, thereby reducing the required training time as amplitude augmentation be-comes unnecessary. One drawback with the normalization scheme is that the inputmay contain significantly amplified noise in regions where reflectivity is low ordistances are too large due to the inverse square law. To this end, we introducean edge-aware smoothness term to leverage the unused amplitude information, byfeeding the amplitude maps into the TV regularization layer described in the fol-lowing section.725.2.2 Loss FunctionsDue to the vastly different image statistics of depth and RGB image data, traditional`1/`2-norm pixel losses that work well in RGB generation tasks lead to poor depthreconstruction performance with blurry image outputs. In the following, we devisedomain-specific criteria tailored to depth image statistics.L1 loss. We minimize the mean absolute error between the generator’s outputdepth d and target depth d˜ due to its robustness to outliers,LL1 =1N∑i, j|di j− d˜i j|. (5.1)Depth gradient loss. To enforce locally-smooth depth maps, we introduce an L1penalty term on depth gradients, i.e. a total variation loss, which is further weightedby image gradients in an edge-aware fashion [4]. Denoting w as the amplitude ofcorrelation inputs [bωi,0,bωi,pi/2], we haveLsmooth = 1N∑i, j|∂xdi j|e−|∂xwi j|+ |∂ydi j|e−|∂ywi j|. (5.2)Adversarial loss. To further adapt to depth-statistics, we introduce a patch-levelconditional adversarial loss [159], minimizing the structural gap between a model-generated depth d and ground-truth depth d˜. We adopt the least square GAN [87]to stabilize the training process,Ladv = 12Ey∼pdepth(y)[(D(y)−1)2]+12Ex∼pcorr(x)[(D(G(x)))2]. (5.3)Overall loss. Our final loss is a weighted combination ofLtotal = LL1 +λsLsmooth+λaLadv. (5.4)During training, G and D are optimized alternatingly, such that G gradually refinesthe depth it generates to convince D to assume the result to be correct (label 1),73while D gets better and better at distinguishing correct and incorrect depth esti-mates by minimizing the squared distance in Equation Training and ImplementationBoth G and D are trained on 128× 128 patches. We first randomly downsamplethe original 240× 320 images within a [0.6,1] scaling range and apply randomcropping. This multiscale strategy effectively increases the receptive field and im-proves the model’s robustness to spatial scales. Each convolution block in Fig-ure 5.3 contains spatial convolution and ReLU/Leaky ReLU (in D) nonlinearitylayers, omitting batch normalization to preserve cross-channel correlations. In allof our experiments, we set the loss weights in Equation 5.4 to be λs = 0.0001 andλa = 0.1. We train our model using the ADAM optimizer with an initial learningrate of 0.00005 for the first 50 epochs, before linearly decaying it to 0 over another100 epochs. The training takes 40 hours to complete on a single Titan X GPU.5.3 DatasetsBecause large raw TOF datasets with ground truth depth do not exist, we simulatesynthetic measurements with known ground truth to train the proposed architec-ture. To validate that the synthetic training results map to real camera sensors, weevaluate on experimental measurements acquired with a TOF development boardwith raw data access.5.3.1 Synthetic DatasetTo simulate realistic TOF measurements, we have extended pbrt-v3 [109] for time-resolved rendering. Specifically, we perform bidirectional path tracing [100] withhistogram binning according to the path-length of the sampled path. For eachscene model and camera-light configuration, our renderer synthesizes a sequenceof transient images consisting of a discretized TPSF at every pixel. The raw TOFimages can then be simulated by correlating the transient pixels with the frequency-dependent correlation matrix ρ (see Equation 2.11). During training we randomlyapply additive Gaussian noise to the raw images, which generalizes well to realToF data of various noise levels due to the fact that both Poisson and Skellam [13]74(a) Blender scenes used to generate the dataset.(b) Exemplar intensity images.(c) Real data collection.0 2 4 6depth (m)051015percent (%)syntheticreal(d) Depth statistics.Figure 5.4: We synthesize transient/correlation images by “animating” a vir-tual camera along physically-plausible paths in the publicly avail-able blender scenes: BATHROOM, BREAKFAST, CONTEMPORARY-BATHROOM, PAVILION, and WHITE-ROOM [1]. Reasonable align-ment can be observed between depth distributions of synthetic and realdatasets.noise are well approximated by Gaussian noise at high photon counts.We select a number of publicly available indoor and outdoor scene models [1],which include a diverse set of geometric structures at real-world 3D scales (see Fig-ure 5.4a and Figure 5.4b). The ground truth depth maps are generated usingBlender’s Z pass renderer. Each scene is observed by flying the virtual cam-75era across multiple viewing points and angles that lie along physically plausiblepaths [91]. To generalize our model to real-world reflectivity variations, we ad-ditionally augment the surface albedo of each object for training. In total, oursynthetic TOF dataset contains 100,000 correlation-depth image pairs of size 320×240, including 5 scenes with 10 reflectivity variations observed from 250 viewingpoints and 8 sensor mirroring/orientations.We further validate our synthetic dataset by comparing the depth-range distri-bution between synthetic and real datasets. Our synthetic dataset has a mean depthof 2.35m as a reasonable range for indoor scenes, and it matches the measuredempirical depth distribution (see Figure 5.4d).5.3.2 Real DatasetWe capture the physical validation TOF measurements using an off-the-shelf TexasInstrument OPT8241-CDK-EVM camera, shown in Figure 5.4c, which operates at48MHz by the default. We modify the frequency setting by adjusting the corre-sponding onboard register via the VoxelSDK [2]. We select 40 and 70MHz as themodulation frequencies for both real and synthesized measurements as our cam-era prototype achieves a high modulation contrast within this range. Note thatthe proposed architecture itself is not limited to this range and our network cangeneralize to any pair/set of modulation frequencies. We also calibrate the phasenon-linearity [3] for the two frequencies, after which we treat the measured signalas sinusoidal.We evaluate the proposed framework on a diverse set of scenes collected un-der both controlled and in-the-wild conditions, including wall corners, concaveobjects, as well as every-day environments such as an office, bedroom, bathroom,living room, and kitchen. See Figure 5.7 for examples. Note that the real scenesare much more cluttered, consisting of skin, cloth, fabric and mirrors with irregularshape and complex reflectance not presented during training.5.4 Experiments and ResultsIn this section, we present an ablation study to validate the proposed architec-ture design, and present synthetic and physical experiments that verify the recon-76struction performance compared to existing approaches. Table 5.1 and Figure 5.6show synthetic results on a test set containing 9,400 synthetic correlation-depth im-ages sampled from unseen scene-reflectivity-view configurations. Figure 5.7 showsphysical results on raw TOF measurements. We follow Adam et al. [6] to categorizethe pixel-wise multipath ratio into low, average, and strong levels, which allows usto understand the performance of each method when performed on direct illumina-tion, e.g. a planar wall, and difficult global illumination cases, e.g. a concave cor-ner. In the following, we quantify depth error with the mean absolute error (MAE)and the structural similarity (SSIM) [146] of predicted depth map compared to theground truth.5.4.1 Ablation StudyWe evaluate the contribution of individual architecture component to the overall re-construction performance by designing a series of ablation experiments with trun-cated architectures and varying input configurations.Effect of architecture components. Table 5.1 compares the performance of theproposed network architecture, denoted as COMBINED, against four ablated vari-ants, namely• BASELINE: where we remove the skip connections from G and only mini-mize the pixel loss LL1 ;• SKIPCONN: same as BASELINE except that G now includes skip connec-tions which encourage structural similarity between input and output;• TV: same as SKIPCONN except that two loss functions are used for training:LL1 and Lsmooth;• ADV: same as SKIPCONN except that both LL1 and Ladv are minimized.Corresponding corner scene scanlines are also shown in Figure 5.5. The BASE-LINE and SKIPCONN networks achieve an overall 3.1cm and 3.0cm depth errorwhich already outperforms traditional pipeline approaches by a substantial margin.However, the generated depth maps suffer from noticeable reconstruction noise inflat areas. By introducing total variation regularization during training, the TV77Network Input Low MPI Avg. MPI Strong MPI Overall SpeedEMPTY N/A 1.205 / 0.0000 2.412 / 0.0000 2.453 / 0.0000 2.190 / 0.0000 N/ABASELINE corr. 0.028 / 0.9994 0.030 / 0.9945 0.110 / 0.9959 0.031 / 0.9613 415.5SKIPCONN corr. 0.029 / 0.9993 0.030 / 0.9930 0.109 / 0.9949 0.030 / 0.9565 421.0TV corr. 0.026 / 0.9995 0.028 / 0.9956 0.109 / 0.9957 0.030 / 0.9625 418.4ADV corr. 0.026 / 0.9994 0.027 / 0.9937 0.107 / 0.9953 0.028 / 0.9593 418.8COMBINED corr. 0.025 / 0.9996 0.028 / 0.9957 0.107 / 0.9958 0.029 / 0.9631 418.8COMBINED phase 0.034 / 0.9987 0.051 / 0.9888 0.143 / 0.9938 0.055 / 0.9395 521.4COMBINED [89] depth 0.061 / 0.9960 0.060 / 0.9633 0.171 / 0.9815 0.064 / 0.8291 529.8PHASOR [40] phase 0.011 / 0.9975 0.102 / 0.9523 1.500 / 0.8869 0.347 / 0.6898 5.2∗SRA [31] corr. 0.193 / 0.9739 0.479 / 0.8171 0.815 / 0.8822 0.463 / 0.6005 32.3∗Table 5.1: Quantitative ablation studies on the proposed network and its per-formance against traditional sequential approaches. EMPTY serves asreference for the mean depth range of the test set. We report MAE andSSIM for each scenario, with MAE measured in meters. In the rightmostcolumn, runtime is reported in FPS (∗CPU implementation).network generates outputs without such artifacts, however still containing globaldepth offsets. Introducing the adversarial loss in ADV network, which learns adepth-specific structural loss, this global offset is reduced. We also find that theadversarial network generates much sharper depth maps with fewer “flying pixel”artifacts around depth edges. Finally, with skip connections, TV, and adversar-ial combined, our proposed network achieves the best balance between accuracy,smoothness and processing speed.Effect of inputs. Although raw correlation images are the natural input choice forthe proposed end-to-end architecture, it is also possible to post-process the phaseor depth estimation from existing methods’ output. Specifically, we evaluate thefollowing input configurations• CORR., where the input to the network is a stack of raw dual frequency cor-relation images as presented before;• PHASE, where we convert the TOF data into two phase maps using Equa-tion 2.8; and• DEPTH, where similar to [89] we first apply phase unwrapping (Equation 2.9)781.61.822.2BaselineGTdepth (m)0 50 100 150 2000. 50 100 150 2000. 50 100 150 2000.   22.2Combined (Phase)GTdepth (m)0 50 100 150 2000. (Depth)GT0 50 100 150 2000. 50 100 150 2000. 5.5: Comparions and ablation study on a corner obtain raw depth, and relying on TOFNET to remove noise and MPI.To this end, we modify the number of input channels at F1 1 layer of G and retrainthe weights. All other layers and hyperparameters are kept the same.As shown in Table 5.1, the COMBINED+PHASE network achieves an overall5.5cm depth error, which is closest to the COMBINED+CORR. variant. Differentfrom the smooth, correlation inputs, the COMBINED+PHASE network must learnto disambiguate edges caused by phase wrapping from those as a result of depthboundaries, thus becomes less confident when assigning depth values.The COMBINED+DEPTH network, on the other hand, takes the phase unwrappeddepth as input, but must learn to remove the newly introduced depth errors from theprevious step as well as correcting for MPI. Consequently, it generates depth mapsthat are much noisier than COMBINED+PHASE, yet still quantitatively superior topipeline approaches. Note that this observation matches that in [89], the code ofwhich is unavailable at the time of this work.5.4.2 Comparison to Sequential ApproachesNext, we compare the proposed direct reconstruction network to representativesequential pipeline approaches. Specifically, we compare with a TOF pipeline con-sisting of raw bilateral denoising [138], lookup-table phase unwrapping [40, 41],and non-linearity correction as first three blocks. We will denote the depth map79CONTBATHROOM1.522.5Scene Phase/Corr.Phasor SRA Depth2Depth Proposed GTPAVILION024Scene Phase/Corr.Phasor SRA Depth2Depth Proposed GTFigure 5.6: Results on synthetic dataset. Top: Reduction of MPI in a cornerscene from CONT-BATHROOM. Bottom: Challenging long range scenefrom PAVILION where denoising, phase unwrapping and MPI are jointlysolved by our approach.generated from this sub-pipeline as PHASOR. To compensate for MPI we also ap-ply the state-of-the-art sparse reflections analysis technique [31] as the last stage,indicated as SRA. We note that other works on MPI and phase unwrapping [26,32, 36, 58, 66] either share similar image formation models, or require tailored ac-quisition strategies, e.g. a larger number of phase or frequency measurements thanour approach, making it difficult to draw direct comparisons.Quantitative results on synthetic dataset. In Figure 5.6 we compare our proposedend-to-end solution against PHASOR, SRA, and our depth post-processing vari-ant COMBINED+DEPTH, denoted as DEPTH2DEPTH here [89], on two represen-tative scenes from the test set. As expected, PHASOR generates the most noise80CONCAVEWALL0.511.5Scene Phase/Corr. Phasor SRA Depth2Depth ProposedKITCHEN1.522.5Scene Phase/Corr. Phasor SRA Depth2Depth ProposedLIVINGROOM1.522.533.5Scene Phase/Corr. Phasor SRA Depth2Depth ProposedOFFICE234Scene Phase/Corr. Phasor SRA Depth2Depth ProposedPERSON1234Scene Phase/Corr. Phasor SRA Depth2Depth ProposedFigure 5.7: Results on real indoor scenes, where the coupled sensor noise,depth discontinuity (see wrapped edges in phase images) and multi-path ambiguity must be addressed in a joint end-to-end manner. Ourapproach faithfully reconstructs cleaner depth with reduced multipathdistortions (see Figure 5.5 and supplemental in Section A.1 for scanlinecomparisons). Notice the elimination of “flying” regions in our end-to-end recovered depth compared to the TOF depth as a result of isolatedpipeline steps.among all of the methods, due to the lack of MPI modeling in its image formations.SRA better suppresses the sensor and multipath noise, however it does not sig-nificantly compensate for MPI distortions in our experiments, possibly due to theviolation of the sparsity assumption [31] in our synthesized backscattering α(τ)(Equation 2.11) which contains strong indirect decay. The DEPTH2DEPTH variantperforms inconsistently and is particularly prone to input depth quality. Finally,our method consistently generates depth that is much closer to the ground truth interms of noise suppression and detail preservation. Please refer to the supplementalmaterial in Section A.1 for an extensive set of additional scenes.Qualitative results on real data. To validate that TOFNET generalizes to realcamera data, we conduct qualitative experiments in a number of challenging envi-81ronments, shown in Figure 5.7. Particularly, we evaluate on everyday scenes suchas CONCAVEWALL, KITCHEN, LIVINGROOM, OFFICE and PERSON, where tra-ditional pipeline methods commonly fail in the presence of noise, low reflectivity,long range and MPI. While the pipeline methods either partially or overly compen-sate MPI and introduce high-frequency artifacts, the proposed method consistentlygenerates piece-wise smooth depth maps with reasonable shapes, proving the ef-fectiveness of the learned spatial-correlation features. We refer the reader to thesupplemental material in Section A.1 for additional full-size reconstruction results.Failure cases. ToFNet gracefully fails when the measurement contains satura-tion, inadequately modeled materials, low reflectivity and finer geometric struc-tures. Nevertheless, due to the depth-dependent prior architecture, our model willestimate the unreliable regions adaptively based on the local neighborhood, achiev-ing a more stable performance than traditional techniques.5.5 Conclusion and DiscussionWe have presented a learning framework for end-to-end TOF imaging and vali-dated its effectiveness on joint denoising, phase unwrapping and MPI correctionfor both synthesized and experimentally captured TOF measurements. In the fu-ture, we plan to apply our framework to more types of TOF cameras, includingimpulse-based SPAD detectors. We are also exploring the co-design of modula-tion function and reconstruction method with our framework, potentially enablingimaging modalities beyond the capabilities of current TOF depth cameras, such asimaging in scattering media.82Chapter 6Material ClassificationIn this chapter, we introduce a novel imaging application of TOF cameras on ma-terial classification. The proposed method deliberately relies on the multi-pathinterference effect that is encoded in a TOF sensor’s raw correlation measurementsfor exacting material-dependent features. It is expected to serve as a valuable com-plement to existing RGB-based techniques.6.1 IntroductionMaterial classification is a popular, yet difficult, problem in computer vision. Ev-eryday scenes may contain a variety of visually similar, yet structurally different,materials that may be useful to identify. Autonomous robots and self-driving ve-hicles, for example, must be aware of whether they are driving on concrete, metal,pavement, or black ice. As further advances in robotics and human-computer in-teraction are made, the need for more accurate material classification will grow.One aspect of materials that has seen little use in classification is the way lighttemporally interacts with a material. As light interacts with an object, e.g., viareflection and subsurface scattering, it creates a TPSF (see Section 2.2), a uniquesignature that can describe the physical properties of each material. Past effortsto capture and analyze this signature have relied on detailed reconstructions ofthis temporal point spread function in the form of transient images. These canbe captured either directly using bulky and expensive equipment such as streak83Figure 6.1: Visually similar but structurally distinct material samples inRGB. Each material was captured under the exact same illuminationand camera settings. No adjustments were made except for some whitebalancing only for reproducing the images.cameras and femtosecond lasers [143, 150], or indirectly with inexpensive time offlight cameras [44], albeit at a significant computational cost as well as a lowerresolution.Our approach exploits raw measurements from TOF cameras’ correlation im-age sensor for material feature representation. This method requires very few fre-quency sweeps allowing for near one-shot captures similar to coded flash methods.By completely circumventing the inverse problem which is neither easy to solvenor able to produce robust solutions [44], our features can be directly fed into a pre-trained material classifier that predicts results in a timely manner. Furthermore, ourmethod allows for per pixel labeling which enables more accurate material classi-fication. Nevertheless, there are significant challenges inherent to this approach,including depth and noise which create ambiguities due to the correlation nature ofthe TOF image formation model, and the camera’s limited temporal resolution [44]relative to that of a streak camera [143].In this work, we take the first step to collect a dataset consisting of visuallysimilar but structurally distinct materials, i.e., paper, styrofoam, towel, and wax as84seen in Figure 6.1. To ensure that our classifier is robust to both distance and anglevariations, we take measurements from a variety of positions. Experimental resultsshow that classification from TOF raw measurements alone can achieve accuraciesup to 81%. We also present superior results of our method compared to those basedon reflectance in a real-world scenario where the latter fail, e.g., classifying printedreplicas. Together these experiments show that the representation of materials withraw TOF measurements, although at the expense of sacrificing temporal resolution,has the potential to work well on material classification tasks.Our specific contributions are:• We develop a method to represent materials as raw measurements from in-expensive and ubiquitous correlation image sensors which are both budgetand computational friendly;• Near single-shot, pixel-wise material classification which is robust to ambi-ent light and thus can be potentially deployed in everyday environments;• Finally, we show that our recognition results can be further improved byincluding spatial information.6.2 Traditional MethodsThe robust classification of materials from optical measurements is a long-standingchallenge in computer vision. Existing techniques rely on color, shading, and tex-ture in both active and passive settings; some even use indirect information like theshape or affordance of an object. For a comprehensive overview of the state of theart, we refer the reader to a recent survey by Weinmann and Klein [147]. Here weidentify the following groups of works, some of which are associated with refer-ence databases:• techniques based on natural RGB images and textures [14, 83, 142];• gonioreflectometric techniques [82, 86, 115, 157] that investigate materials’response to directional illumination;85• techniques that use patterned active illumination to recover parameters ofsubsurface light transport [127], and finally,• techniques that employ other aspects of materials, such as their thermal prop-erties [114], micro-scale surface geometry obtained through mechanical con-tact [59], or other physical parameters like elasticity [21].Common to all these approaches is that they fail if the suitable information isnot available or their capture conditions are not strictly met. Some methods, inparticular ones that rely on natural RGB images, are susceptible to adversarial in-put and could easily be fooled by human intervention, printed photos of objects,or reflections. Furthermore, these techniques often rely on object detection to infermaterial information [118]. As a whole, the problem of classifying materials re-mains unsolved. With our method, we propose the temporal analysis of subsurfacescattering as a new source of information. To our knowledge, it is the first methodcapable of per-pixel classification without the need for structured illumination orgonioreflectometric measurements. We demonstrate that our method is capable ofproducing robust results under lab conditions, and that it forms a valuable comple-ment to existing techniques.Correlation image sensors are a class of devices that have been well exploredfor use in depth acquisition [38, 117] and since extended for various applications.When operated as range imagers, the quality delivered by correlation sensors suf-fers from multi-path interference, whose removal has therefore been the subjectof extensive research [25, 31, 96]. Contrary to this line of work, our method isenabled by the insight that time-domain effects of multi-path scattering can carryvaluable information about the material. To our knowledge, the only other workthat explicitly makes use of this relation is a method by Naik et al. [95], in whichlow-parameter models are fitted to streak tube images to recover the reflectance ofnon-line-of-sight scenes. In a sense, Naik et al.’s method maps angular informa-tion to the time domain where it is picked up by a high-end optoelectronic imagingsystem. Our proposed method, in contrast, does not make use of angular or spa-tial distributions and works on a type of measurement that is routinely availablefrom both configurable TOF development toolkits such as ESPROS EPC 660 andTI OPT8241-CDK-EVM, and consumer-grade hardware like the Microsoft Kinect86v2 and Google’s Project Tango smartphone (with multi-frequency functionality en-abled).6.3 Decoding Material Signatures from TOFMeasurementsIn this section, we relate raw TOF measurements to material optical properties.Here we focus on two phenomena in particular: multiple scattering inside of asurface volume (e.g., between strands of a towel), and subsurface scattering whichcommonly occurs in wax and styrofoam. Throughout this chapter, we restrict mate-rials to planar geometries and ignore macro-scale multi-bounce reflections betweenobject surfaces. This results in the TPSF model in Equation 6.5, which is simplerthan the mixture model found in [47].The image formation model of correlation sensors in homodyne mode has beenderived in Chapter 2. Following the derivation in Section 2.2. By substitutingE(t) in Equation 2.5 with Equation 2.4, we obtain a correlation integral of sensorresponse and optical impulse response:bω,ψ =∫ T0fω(t−ψ/ω)∫ τmax0α(τ)gω(t− τ)dτdt (6.1)=∫ τmax0α(τ)∫ T0fω(t−ψ/ω)gω(t− τ)dt︸ ︷︷ ︸dτ (6.2)=:∫ τmax0α(τ) · c(ω,ψ/ω+ τ) dτ, (6.3)where the scene-independent functions fω and gω have been folded into a corre-lation function c(ω,ψ/ω + τ) that is only dependent on the imaging device andcan be calibrated in advance (Section 6.4.1). Expressing the real-valued c by itsFourier series we arrive at:bω,ψ =∞∑k=1gk∫ τmax0α(τ)cos(kω (ψ/ω+ τ)+φk)dτ, (6.4)where gk is the amplitude and φk the phase of the kth harmonic. In essence, Equa-tion 6.4 shows that the correlation sensor probes a TPSF’s frequency content. The87Time (ns)0 10 20 30 40 50 60Response (normalized) PSF ,7 = 10, 6 = 0.2Frequecies (MHz)20 40 60 80 100Values (normalized)-1-0.500.51Raw measurements b?=0?=:/2Frequecies (MHz)20 40 60 80 100Values (normalized)-1-0.500.51Raw measurements b (aligned)?=0?=:/2(a) Material 1 at distance 1.Time (ns)0 10 20 30 40 50 60Response (normalized) PSF ,7 = 30, 6 = 0.2Frequecies (MHz)20 40 60 80 100Values (normalized)-1-0.500.51Raw measurements b?=0?=:/2Frequecies (MHz)20 40 60 80 100Values (normalized)-1-0.500.51Raw measurements b (aligned)?=0?=:/2(b) Material 1 at distance 2.Time (ns)0 10 20 30 40 50 60Response (normalized) PSF ,7 = 30, 6 = 0.4Frequecies (MHz)20 40 60 80 100Values (normalized)-1-0.500.51Raw measurements b?=0?=:/2Frequecies (MHz)20 40 60 80 100Values (normalized)-1-0.500.51Raw measurements b (aligned)?=0?=:/2(c) Material 2 at distance 2.Figure 6.2: Simulation of TPSF α , measurement B and Baligned atψ = 0 (blue)and ψ = pi/2 (red). Note that for effect the time scale of the TPSF hasbeen exaggerated. The temporal support of the true TPSF is typicallybelow a nanosecond.change in the temporal profile α(τ)will be reflected in its Fourier spectrum. This isthe effect we expect to see in the measurement bω,ψ between structurally differentmaterials.Our camera images a material target while cycling through the relative phases{φ j=1...n} and frequencies {ωi=1...m} from Equation 6.4, generating m measure-ment vectors b1...m, each of which corresponds to one modulation frequency and issampled at n different phases. We stack all these vectors together and obtain thetotal measurement matrix B = (b1 . . .bm). The latent TPSF α(τ) only helps withthe derivation and is never reconstructed.Both the strength and challenges of using correlation measurements as materialfeatures can be illustrated via simulation. In Figure 6.2, for example, we demon-strate the simulation of B at ψ = 0 and pi/2 and why it is necessary to addressdepth ambiguities.First, we approximate α(τ) with an exponentially modified Gaussian model88which Heide et al. [47] found to compactly represent typical TPSFs:α(τ;a,σ ,λ ,µ) = a · exp((σλ )2 /2− (τ−µ)λ)·(1+ erf((τ−µ)−σ2λ√2σ)).(6.5)The intensity of TPSF at any given time τ is a function of amplitude a, Gaus-sian width σ , skew λ , and peak center µ . While λ relates to a material’s scatteringcoefficient [150], a and µ are connected to albedo, light falloff and depth-relatedparameters which are irrelevant to material structural properties. Similarly, σ mod-els the temporal resolution of a correlation image sensor, which, without lack ofgenerality, remains constant in our simulation.To test our concept in simulation, we assume the correlation function c(ω,ψ/ω+τ) (Equation 6.3) to be a pure sinusoid. By applying Equation 6.4 to the givenc(ω,ψ/ω + τ) and α(τ), we simulate measurements at several discrete frequen-cies ωi from 10 to 120 MHz and two modulation delays ψ = 0 and ψ = pi/2.On the top row of Figure 6.2 we show three TPSFs with varying peak centersand skews. Specifically, Figure 6.2a and Figure 6.2b differ in µ , while Figure 6.2band Figure 6.2c differ in λ . As can be compared in the middle row of Figure 6.2, Bis affected by both the material-independent parameter µ , and material-dependentλ . To make B invariant to µ , i.e., to depth variations, we need to compensate for aglobal temporal shift that originates from translation.To this end, we have developed a depth normalization method which is detailedin Section 6.4.1. An example of normalized measurements are plotted along thebottom row of Figure 6.2. While the depth dependent differences are eliminated inFigures 6.2a and 6.2b, the material intrinsic properties remain intact when compar-ing Figures 6.2b and 6.2c.89light source sensordi↵usor(a) Noise removal setup.(b) Dataset acquisition setup.Figure 6.3: Illustrations of experimental setup.6.4 Methods6.4.1 Data PreprocessingRemoving fixed pattern noise. Our measurements show that there exists modula-tion frequency dependent fixed pattern noise which necessitates per-pixel calibra-tion for their removal. Similar to [107], and as illustrated in Figure 6.3a, we exposethe camera sensor to diffuse light to create a noise calibration matrix. We can thendivide this amplitude normalized data from future measurements to compensate forthe fixed pattern noise.90Depth normalization. Next, we describe how unwanted variations in amplitudeand phase were removed from the input data. This serves to align measurements re-gardless of distance and intensity while leaving frequency-dependent effects unaf-fected. First, we take measurements from a set of modulation frequencies {ωi=1...m}and choose one to serve as a point of reference for each material – for the sake ofconvenience, let ω1 be the reference frequency for any given material.The following procedure summarized is performed for all pixels, materials, anddistances independently. It determines the total time delay of a given measurementvector by analyzing the phase shift at the fundamental of the ω1 measurement. Itthen shifts all measurements by the respective phase to compensate for this delay.1. Determine the complex amplitude of the signal at its base frequency ω1. Tothis end, we take the vector of n phase-shifted {φi=1...n}measurements bω1,φ jand, using a discrete Fourier transform, obtain coefficients cω1,k such thatbω1,φi =n/2−1∑k=0cω1,k · eikφi , cω1,k ∈C (6.6)Note that the negative spectral coefficients follow directly from the convexconjugate and are thus omitted from our derivation. From the coefficientof the fundamental frequency, cω1,1, we obtain the desired delay τref andamplitude factor aref by which we will compensate:τref = ∠(cω1,1)/ω1, aref = |cω1,1| (6.7)2. We then propagate this correction to the measured signal at all modula-tion frequencies ωi=1...m by altering their corresponding Fourier coefficients.Again, we Fourier transform the n phase samples for modulation frequencyωi as in Equation 6.6.bωi,φi =n/2−1∑k=0cωi,k · eikφi , cωi,k ∈C (6.8)Next, we phase-shift the coefficients cωi,k to compensate for the delay τref,91and normalize their amplitude with respect to aref:calignedωi,k = cωi,k · e−ikωiτref/aref (6.9)=cωi,k|cω1,1|·(cω1,1|cω1,1|)−|k|ωi/ω1(6.10)3. Finally, by substituting the new coefficients calignedωi,k back into Equation 6.8we obtain the compensated measurements balignedωi,φ j .An equivalent algorithm which is more compact and straightforward to imple-ment is provided in Algorithm 3.Algorithm 3 Depth alignment for measurements at modulation frequency ωiInput: bω1 : vector of n phase-shifted {µl=1..n} measurements at base modulationfrequency ω1; bωi : vector of n phase-shifted measurements at other frequencyωiOutput: Aligned measurement: balignedωi1: bˆω1 := FFT(bω1)2: bˆωi := FFT(bωi)3: for k = 1 to #harmonics do4: bˆalignedωi,k :=bˆωi ,k|bˆω1 ,1|·(bˆω1 ,1|bˆω1 ,1|)−|k|ωi/ω15: end for6: balignedωi := IFFT(bˆalignedωi )6.4.2 FeaturesAfter preprocessing, the raw correlation measurement at each pixel is denoted byBaligned, where each element is a depth and amplitude normalized complex num-ber balignedωi,φ j . This complex matrix is then vectorized into an n×m×2 dimensionalfeature vector for training and testing. Representing materials in such a high di-mensional space poses well known challenges to classification. Overfitting couldbe unavoidable if our number of data points is limited. Furthermore, higher dimen-sional feature data requires longer training time.To address these two issues, we compare the classification accuracy using fea-tures in both the original space and dimensionally reduced space, e.g., after PCA.92Phase (rad)0 : 2:Amplitude-10120 MHzPhase (rad)0 : 2:Amplitude-10180 MHzmeasuredsinusoidFigure 6.4: Our modulation signal φω(t) at 20MHz and 80MHz.In theory, features that share similar rows in Baligned are highly correlated, as theonly difference between the fundamentals of balignedωi,φ j1 and balignedωi,φ j2is a fixed phaseshift |φ j2 −φ j1 |. Our modulation signal φω(t) is nearly sinusoidal, see Figure 6.4,therefore most features in the original space may still be correlated to a certaindegree. We show a comparison between the two with a real dataset and how thenumber of required measurements can be minimized in Section Learning ModelsIt is important to see whether the classification accuracy is benefited most fromtweaking parameters for different learning models, or from the features themselves.To this end, we evaluated several supervised learning methods including NearestNeighbors, Linear SVM, RBF SVM, Decision Tree, Random Forest, AdaBoost,LDA, and QDA using both the MATLAB classificationLearner and Scikit-learn[105] implementations. The best results from every model can be found in Table6.1.Training and validation. During the training stage, we performed 5-fold cross-validation and reported the mean accuracy. A confusion matrix of the best per-forming model is given in Section Additionally, several test cases are reported in Section 6.5.3 and 6.5.4.These tests include special cases such as detecting material photo replicas, andscene labeling.6.5 Experiments and Results6.5.1 Dataset AcquisitionA prototype TOF camera system composed of a custom RF modulated light sourceand a demodulation camera was used to collect our dataset, similar to that usedin [47]. Our light source is an array of 650 nm laser diodes equipped with a dif-fuser sheet to provide full-field illumination. The sensor is a PMD CamBoardnano development kit with a clear glass PMD Technologies PhotonICs 19k-S3sensor (without NIR filter), a spatial resolution of 165×120 pixels, and a custom2-channel modulation source with 150 MHz bandwidth that serves to generate thesignals f (t) and g(t) (Section 6.3). In our experiments, we limit ourselves to thefrequency range from 10 MHz to 80 MHz that is also commonly used by other TOFsensor vendors.Data points. We collected data from 4 structurally different yet visually similarmaterials• paper: a stack of normal printing paper;• styrofoam: a regular piece of polystyrene foam;• towel: two layers of a hand towel;• wax: a large block of wax.A photo of our experimental setup can be seen in Figure 6.3b. To cover awide range of distances and viewing angles, we placed the material samples at 10distances ranging from 1 to 2 meters from the camera. The three viewing angles,flat, slightly tilted and more tilted, were achieved by adjusting the tripod to setpositions. Under each physical and modulation setting, a total of 4 frames were94Original After PCA High freqsDecision tree 72.2 64.3 68.4Nearest neighbor 69.5 74.7 69.1Linear SVM 76.9 69.8 68.6RBF SVM 80.9 77.7 71.5Random forest 79.9 75.1 70.0AdaBoost 72.1 61.1 69.3LDA 60.0 58.3 62.7QDA 62.6 60.4 64.8Table 6.1: Validation accuracies (%) from different learning models.captured with identical exposure times to account for noise. We then randomlysample 25 locations from the raw correlation frames as data points. In total, ourdataset consists of 10×3×4×25 = 3000 observations for each material.Features. Under each of the 30 physical distance and angle settings, a total of64 frames were captured to cover a wide range of modulation frequencies ω andphases ψ in B, including,• 8 frequencies: 10, 20, . . . , 80 MHz, and• 8 equispaced delays from 0 to 2piWe then follow the data preprocessing method in Section 6.4.1 to normalizethe depth and amplitudes, where ω1 is chosen as 10 MHz. This leaves us witha 64-dimensional complex-valued or, equivalently, a 128-dimensional real-valuedfeature Baligned at each data point. Finally, before training each feature is standard-ized to have zero mean and unit variance.Figure 6.5 shows a 2D projection of the wax and paper features from our datasetbefore and after preprocessing. It’s clear that the removal of fixed pattern noise andnormalization of amplitude and phase significantly improve the separability of ourfeatures.6.5.2 Classification ResultsAs previously mentioned in Section 6.4.3, we now report and analyze the classifi-cation accuracies of different learning models. The mean validation accuracy for95(a) PCA visualization of features before preprocessing.(b) PCA visualization of features after preprocessing.Figure 6.5: Effectiveness of depth normalization, visualized in the dimen-sional reduced [141] feature space.96paper styrofoam towel waxpaper 62.7 3.9 33.5 0.0styrofoam 6.1 82.2 11.7 0.0towel 18.3 4.1 77.6 0.0wax 0.0 0.1 0.0 99.9Table 6.2: Confusion matrix (%). Labels in the left most column denote thetrue labels, while those in the top row correspond to predicted labels.withoutspatial coherencewithspatial coherencepaper 70.6 80.0styrofoam 90.8 95.8towel 72.0 74.1wax 100.0 100.0Table 6.3: Testing accuracies (%) with and without considering spatial coher-ence.each method can be found in Table 6.1. We observe that while SVM with an RBFkernel generally has the greatest precision, most methods (Decision Tree, NearestNeighbor, SVM and Random forest) perform comparably. This suggests that thepower of our algorithm is a result of the features, i.e., the TOF raw measurements,rather than the learning model.The confusion matrix in Table 6.2 shows how often each category is misclas-sified as another. Paper and towel, for example, are most commonly misclassi-fied to each other. One possible explanation could be that the paper used in ourexperiments had stronger absorption, thus behaving similarly to the surface inter-reflectance of the towel. Wax, however, comes with a greater degree of subsurfacescattering compared to the other materials which is reflected directly in its accu-racy. Throughout this experiment, we fix the RBF kernel scale as σ =√P, whereP is the number of features.To study if we are able to further reduce the dimensionality of discriminativefeature representation for each material, we performed two experiments. First,we reduce the feature dimension by performing a principal component analysisprior to training, validation, and testing. If 95% variance is kept, we are left with5D features. These accuracies are shown in the center column of Table 6.1. We97also empirically handpicked b at two higher modulation frequencies: 70MHz and80MHz as features. These accuracies are shown in the rightmost column of Table6.1. As we can see, although the highest validation accuracy is achieved by repre-senting features in the original high dimensional space, there is a balance betweenthe number of features and acceptable accuracy which warrants further research.Furthermore, when only the selected higher frequencies are used for measuringand predicting unseen material, the capturing time is greatly reduced from 12.3s to3.0s.Lastly, we test our best-trained classifier on a separate testing set. These testaccuracies are reported in column “without spatial consistency” in Table 6.3. Wealso show that by simply introducing spatial consistency in our testing stage, up toa 10% improvement can be reached for paper. This spatial consistency is imple-mented by ranking all the predicted labels within a region of interest in each testframe. Then the label with the highest probability is chosen as the final prediction.These results are shown in column “with spatial consistency”.6.5.3 Comparison with RGB Based ApproachesReflectance-based methods can be easily fooled by replicas, e.g., printed picturesof actual materials, as the RGB information itself is insufficient to represent theintrinsic material properties. Furthermore, they are sensitive to even small changesin illumination. While a colored ambient light source may change the RGB appear-ance and therefore its classification using traditional methods, the material struc-ture is unchanged.To validate the advantages and robustness of our method over RGB approacheswe devised a simple experiment. First, we photographed our raw materials, makingsmall post-processing alterations to ensure the photos appear as similar to our labconditions as possible. Those photos were then printed on regular printing paper,similar to those used in our earlier classification experiments, and placed on top ofthe paper stack used earlier before taking TOF measurements.Experimental results, shown along the bottom row of Figure 6.6, reveal thatour feature representation is invariant to RGB textures, as all paper replicas werecorrectly classified as the actual material: paper. Reference RGB images captured98(a) Styrofoam. (b) Towel. (c) Wax.Figure 6.6: Our classifier successfully recognizes the actual material of eachpaper printed replica when attached to a paper stack. Top: ReferenceRGB images taken with a DSLR camera which could be confusing toRGB based methods and even human eyes. Bottom: classification re-sults overlayed on top of the TOF amplitude image. Green dots indicatea correct classification (paper) and red indicates a misclassification. Forclarity, the boundaries of each printed replica are highlighted in a Canon EOS 350D DSLR camera next to the TOF camera can be seen in the toprow of Figure 6.6. It is worth noting that this approach is limited by the fact thatprinted paper itself is less reflective, and therefore darker, than the actual materials.Due to the different scope and nature of our methods, direct comparison withRGB based approaches may be unfair because they unavoidably rely on objectcues. Nevertheless, we explored results from many of the best trained RGB-basedDeep Neural Nets methods. When testing our photo replicas (seen previously in99(a) Scene of materials. (b) Segmented and labeled.Figure 6.7: Our classifier successfully labeled the segmented scene. (a)Scene of materials captured by the reference RGB camera; (b) Labeledamplitude image from our TOF camera. Red: paper; green: styrofoam;blue: towel; yellow: wax.Figure 6.6, top) on a pre-trained CaffeNet model based on the network architectureof Krizhevsky et al. [71] for ImageNet, the wax replica is classified as “doormat”,while both towel and styrofoam are tagged with “towel” as the top predictions.When only a central region within the blue boundary of our photo replicas are fedto the network, wax, towel, and styrofoam replicas are recognized as “water snake”,“paper towel” and “velvet” respectively. These results are not surprising as thesemodels only use local correlation from RGB information whereas our approachexploits completely new features for classification.6.5.4 Scene LabelingFinally, we created a scene, shown in Figure 6.7, where each material was placedat different distances and angles from the camera. In this scene, we used an 8mmwide angle lens with the TOF camera instead of the 50mm lens used previously asit was difficult to assemble the materials into such a narrow field of view without100significant occlusion.As we can see, the entire wax region is correctly labeled at each pixel. For themost part, both styrofoam and paper are correctly classified as well. Towel, on theother hand, is recognized as wax and styrofoam. One possible explanation couldbe that as it is placed at the edge of the frame, vignetting becomes significant andintroduces additional noise to the features after preprocessing.6.6 Conclusion and DiscussionWe have proposed a method for distinguishing between materials using only TOFraw camera measurements. While these are merely the first steps towards TOFmaterial classification, our technique is already capable of identifying differentmaterials which are very similar in appearance. Furthermore, through the carefulremoval of noise and depth dependencies, our method is robust to depth, angle, andambient light variations allowing for classification in outdoor and natural settings.Future work. Although we are able to achieve high accuracy with our currentclassifiers and datasets, our method could be further refined by additional trainingdata and a more diverse set of materials. As a valuable complement to existingtechniques, we believe that our method could also be used in combination withstate of the art RGB algorithms, or by incorporating traditional spatial and imagepriors. In the future, we would also like to relax the restriction of planar materialsand investigate the robustness of our method to object shape variations.101Chapter 7ConclusionIn this thesis, we have presented our progress on making consumer color and depthcameras more reliable for scene perception. Central to the thesis research is therethinking of imaging sensors as the spatiotemporal encoder of the light transport,and the derivation of post-processing algorithms with temporal structure aware-ness so that hidden information within the scene measurements can effectively beextracted.To address the inherent sensing deficiencies of color and depth cameras, we in-troduced three approaches that leverage the correlations between scanlines (Chap-ter 3), frames (Chapter 4), and modulation functions (Chapter 5) respectively.In Chapter 3 we presented the rolling shutter motion deblurring algorithm, whichalternatively recovers an intrinsic sharp image while estimating the optimal cameramotion associated with the blurring process. In Chapter 4 a data-driven video de-blurring framework is introduced. The method effectively utilizes complementarycomponents of multiple images for collaboration, while also being efficient due tothe relaxation of high-quality image alignment procedure. A similar deep learningframework has been generalized to time-of-flight depth imaging in Chapter 5, en-abling the joint end-to-end depth inference in the presence noise, phase ambiguity,and MPI.Despite the demonstrated success of these methods over traditional techniques,there are many limitations that worth investigation in the future research. Onecritical assumption made in rolling shutter deblurring is the success of trajectory102initialization from local blur kernel estimations. Another limitation lies in the neg-ligence of in-plane rotation and the simplified polynomial modeling of the cameramotion, which may not be suitable to describe the high-frequency vibrations fromdrones and autonomous vehicles. In these cases, a non-blind gyroscope assistedmethod may be a better choice. Similarly, to guarantee the success of our videodeblurring framework on other types of video blurs such as the defocus blur, thepretrained model might need to be fine-tuned. This also applies to the proposedTOF imaging system, in which higher order harmonics may exist in other cameramodels with non-sinusoidal modulation signals. Improving the model generaliza-tion is an unsolved and hot research topic in the deep learning community. Somerecent works on transfer learning for general image restoration may shed somelight in this direction [152].Finally, we introduce a new application of TOF cameras for material classifi-cation. The method deliberately relies on MPI to categorize materials of differentscattering properties, allowing us to achieve superior per-pixel accuracy using justconventional classifiers. In the future, we plan to combine our method with RGB-based algorithms, as well as to investigate robust features for identifying objectswith both geometric and material variations. We believe that the thesis shall ben-efit many other imaging and scene understanding problems where a deeper under-standing of the raw sensor measurements has not been developed yet.103Bibliography[1] Scenes for pbrt-v3. Accessed:2017-10-29. → pages 75[2] VoxelSDK: an SDK supporting TI’s 3D Time of Flight cameras., . Accessed: 2017-10-29. → pages 76[3] OPT8241-CDK-EVM: Voxel Viewer User Guide., . Accessed: 2017-10-29. →pages 76[4] Unsupervised monocular depth estimation with left-right consistency. InThe IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pages 270–279, July 2017. → pages 22, 23, 73[5] S. Achar, J. R. Bartels, W. L. Whittaker, K. N. Kutulakos, and S. G.Narasimhan. Epipolar time-of-flight imaging. ACM Transactions onGraphics (TOG), 36(4):37, 2017. → pages 18[6] A. Adam, C. Dann, O. Yair, S. Mazor, and S. Nowozin. Bayesiantime-of-flight for realtime shape, illumination and albedo. IEEETransactions on Pattern Analysis and Machine Intelligence, 39(5):851–864, 2017. → pages 77[7] S. Baker and I. Matthews. Lucas-kanade 20 years on: A unifyingframework. International Journal of Computer Vision, 56(3):221–255,2004. → pages 28, 32[8] S. Baker, E. Bennett, S. B. Kang, and R. Szeliski. Removing rolling shutterwobble. In The IEEE Conference on Computer Vision and PatternRecognition (CVPR), pages 2392–2399. IEEE, 2010. → pages 11[9] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge universitypress, 2004. → pages 18, 19104[10] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributedoptimization and statistical learning via the alternating direction method ofmultipliers. Foundations and Trends® in Machine Learning, 3(1):1–122,2011. → pages 19, 37[11] D. Bradley, B. Atcheson, I. Ihrke, and W. Heidrich. Synchronization androlling shutter compensation for consumer video camera arrays. In TheIEEE Conference on Computer Vision and Pattern Recognition (CVPR)Workshops, pages 1–8. IEEE, 2009. → pages 27, 41[12] Z. Cai, Q. Fan, R. S. Feris, and N. Vasconcelos. A unified multi-scale deepconvolutional neural network for fast object detection. In EuropeanConference on Computer Vision (ECCV), pages 354–370. Springer, 2016.→ pages 66[13] C. Callenberg, F. Heide, G. Wetzstein, and M. Hullin. Snapshot differenceimaging using time-of-flight sensors. ACM Transactions on Graphics(TOG), 36(6):220:1–220:11, 2017. → pages 74[14] B. Caputo, E. Hayman, and P. Mallikarjuna. Class-specific materialcategorisation. In The IEEE International Conference on Computer Vision(ICCV), volume 2, pages 1597–1604. IEEE, 2005. → pages 85[15] A. Chakrabarti. A neural approach to blind motion deblurring. arXivpreprint arXiv:1603.04771, 2016. → pages 22, 53, 59, 60, 61[16] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li,S. Savarese, M. Savva, S. Song, H. Su, et al. Shapenet: An information-rich3d model repository. arXiv preprint arXiv:1512.03012, 2015. → pages 67[17] S. Cho and S. Lee. Fast motion deblurring. In ACM Transactions onGraphics (TOG), volume 28, page 145. ACM, 2009. → pages 2, 13, 24, 25,38, 39, 40, 41, 42, 43[18] S. Cho, H. Cho, Y.-W. Tai, Y. S. Moon, J. Cho, S. Lee, and S. Lee.Lucas-kanade image registration using camera parameters. volume 8301,pages 83010V–83010V–7, 2012. doi:10.1117/12.907776. URL → pages 28[19] S. Cho, J. Wang, and S. Lee. Video deblurring for hand-held cameras usingpatch-based synthesis. ACM Transactions on Graphics (TOG), 31(4):64,2012. → pages 14, 47, 53, 59, 62, 63105[20] R. Crabb and R. Manduchi. Fast single-frequency time-of-flight rangeimaging. In The IEEE Conference on Computer Vision and PatternRecognition Workshops, pages 58–65, 2015. → pages 17[21] A. Davis, K. L. Bouman, J. G. Chen, M. Rubinstein, F. Durand, and W. T.Freeman. Visual vibrometry: Estimating material properties from smallmotions in video. In The IEEE Conference on Computer Vision and PatternRecognition (CVPR), pages 5335–5343, 2015. → pages 86[22] M. Delbracio and G. Sapiro. Burst deblurring: Removing camera shakethrough fourier burst accumulation. In The IEEE Conference on ComputerVision and Pattern Recognition (CVPR), 2015. → pages 15, 65[23] M. Delbracio and G. Sapiro. Hand-held video deblurring via efficientfourier aggregation. IEEE Transactions on Computational Imaging, 1(4):270–283, 2015. → pages 14, 15, 47, 48, 53, 56, 57, 59, 60, 61, 62, 63[24] C. Dong, C. C. Loy, K. He, and X. Tang. Learning a deep convolutionalnetwork for image super-resolution. In European Conference on ComputerVision (ECCV), 2014. → pages 22[25] A. Dorrington, J. Godbaz, M. Cree, A. Payne, and L. Streeter. Separatingtrue range measurements from multi-path and scattering interference incommercial range cameras. In Proceedings of SPIE, volume 7864, 2011.→ pages 15, 86[26] A. A. Dorrington, J. P. Godbaz, M. J. Cree, A. D. Payne, and L. V. Streeter.Separating true range measurements from multi-path and scatteringinterference in commercial range cameras. In Conference on theThree-Dimensional Imaging, Interaction, and Measurement, volume 7864,pages 1–1. SPIE–The International Society for Optical Engineering, 2011.→ pages 16, 69, 80, 122[27] D. Droeschel, D. Holz, and S. Behnke. Multi-frequency phase unwrappingfor time-of-flight cameras. In IEEE/RSJ International Conference onIntelligent Robots and Systems (IROS), pages 1463–1469. IEEE, 2010. →pages 17, 69[28] R. Fergus, B. Singh, A. Hertzmann, S. T. Roweis, and W. T. Freeman.Removing camera shake from a single photograph. ACM Transactions onGraphics (TOG), 25(3):787–794, 2006. → pages 13106[29] P. Fischer, A. Dosovitskiy, E. Ilg, P. Ha¨usser, C. Hazırbas¸, V. Golkov,P. van der Smagt, D. Cremers, and T. Brox. Flownet: Learning optical flowwith convolutional networks. arXiv preprint arXiv:1504.06852, 2015. →pages 52[30] R. A. Fisher. The use of multiple measurements in taxonomic problems.Annals of Eugenics, 7(2):179–188, 1936. → pages 21[31] D. Freedman, Y. Smolin, E. Krupka, I. Leichter, and M. Schmidt. Sra: Fastremoval of general multipath for tof sensors. In European Conference onComputer Vision (ECCV), pages 234–249. Springer, 2014. → pages 15, 17,78, 80, 81, 86, 122[32] S. Fuchs, M. Suppa, and O. Hellwich. Compensation for multipath in tofcamera measurements supported by photometric calibration andenvironment integration. In The IEEE International Conference onComputer Vision Systems, pages 31–41. Springer, 2013. → pages 17, 69,80[33] J. Gall, H. Ho, S. Izadi, P. Kohli, X. Ren, and R. Yang. Towards solvingreal-world vision problems with rgb-d cameras. In The IEEE Conferenceon Computer Vision and Pattern Recognition Tutorial, 2014. → pages 67[34] G. Ghiasi and C. C. Fowlkes. Laplacian pyramid reconstruction andrefinement for semantic segmentation. In European Conference onComputer Vision (ECCV), 2016. → pages 66[35] I. Gkioulekas, A. Levin, F. Durand, and T. Zickler. Micron-scale lighttransport decomposition using interferometry. ACM Transactions onGraphics (TOG), 34(4):37, 2015. → pages 70[36] J. P. Godbaz, M. J. Cree, and A. A. Dorrington. Closed-form inverses forthe mixed pixel/multipath interference problem in amcw lidar. InConference on Computational Imaging X, volume 8296, pages 1–15. SPIE,2012. → pages 17, 80[37] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. InAdvances in Neural Information Processing Systems (NIPS), pages2672–2680, 2014. → pages 22[38] M. Grzegorzek, C. Theobalt, R. Koch, and A. Kolb. Time-of-Flight andDepth Imaging. Sensors, Algorithms and Applications, volume 8200.Springer, 2013. → pages 15, 86107[39] A. Gupta, N. Joshi, C. L. Zitnick, M. Cohen, and B. Curless. Single imagedeblurring using motion density functions. In European Conference onComputer Vision (ECCV), pages 171–184. Springer, 2010. → pages 14, 24,25, 29, 32[40] M. Gupta, S. K. Nayar, M. B. Hullin, and J. Martin. Phasor imaging: Ageneralization of correlation-based time-of-flight imaging. ACMTransactions on Graphics (TOG), 34(5):156, 2015. → pages 16, 18, 68,78, 79, 122[41] M. Hansard, S. Lee, O. Choi, and R. P. Horaud. Time-of-flight cameras:principles, methods and applications. Springer Science & Business Media,2012. → pages 16, 67, 68, 79[42] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for imagerecognition. arXiv preprint arXiv:1512.03385, 2015. → pages 51[43] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for imagerecognition. In The IEEE Conference on Computer Vision and PatternRecognition (CVPR), pages 770–778, 2016. → pages 72[44] F. Heide, M. B. Hullin, J. Gregson, and W. Heidrich. Low-budget transientimaging using photonic mixer devices. ACM Transactions on Graphics(TOG), 32(4):45, 2013. → pages 3, 10, 12, 17, 84[45] F. Heide, M. Steinberger, Y.-T. Tsai, M. Rouf, D. Pajak, D. Reddy,O. Gallo, J. Liu, W. Heidrich, K. Egiazarian, et al. Flexisp: a flexiblecamera image processing framework. ACM Transactions on Graphics(TOG), 33(6):231, 2014. → pages 8[46] F. Heide, L. Xiao, W. Heidrich, and M. B. Hullin. Diffuse mirrors: 3dreconstruction from diffuse indirect illumination using inexpensivetime-of-flight sensors. In The IEEE Conference on Computer Vision andPattern Recognition (CVPR), pages 3222–3229, 2014. → pages 12, 13, 18[47] F. Heide, L. Xiao, A. Kolb, M. B. Hullin, and W. Heidrich. Imaging inscattering media using correlation image sensors and sparse convolutionalcoding. Optics Express, 22(21):26338–26350, 2014. → pages 12, 17, 87,89, 94[48] F. Heide, W. Heidrich, M. Hullin, and G. Wetzstein. Doppler time-of-flightimaging. ACM Transactions on Graphics (TOG), 34(4):36, 2015. → pages67108[49] S. Hickson, S. Birchfield, I. Essa, and H. Christensen. Efficient hierarchicalgraph-based segmentation of rgbd videos. In The IEEE Conference onComputer Vision and Pattern Recognition (CVPR), pages 344–351, 2014.→ pages 67[50] Z. Hu and M.-H. Yang. Fast non-uniform deblurring using constrainedcamera pose subspace. In British Machine Vision Conference (BMVC),pages 1–11, 2012. → pages 24, 25, 26[51] Z. Hu, L. Yuan, S. Lin, and M.-H. Yang. Image deblurring usingsmartphone inertial sensors. In The IEEE Conference on Computer Visionand Pattern Recognition (CVPR), June 2016. → pages 2[52] Y. Huang, W. Wang, and L. Wang. Bidirectional recurrent convolutionalnetworks for multi-frame super-resolution. In Advances in NeuralInformation Processing Systems (NIPS), 2015. → pages 22[53] S. Iizuka, E. Simo-Serra, and H. Ishikawa. Let there be color!: jointend-to-end learning of global and local image priors for automatic imagecolorization with simultaneous classification. ACM Transactions onGraphics (TOG), 35(4):110, 2016. → pages 21, 22[54] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translationwith conditional adversarial networks. In The IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2017. → pages 21, 22,72[55] A. Ito, A. C. Sankaranarayanan, A. Veeraraghavan, and R. G. Baraniuk.Blurburst: Removing blur due to camera shake using multiple images.ACM Transactions on Graphics (TOG), Submitted. → pages 14[56] S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, P. Kohli,J. Shotton, S. Hodges, D. Freeman, A. Davison, et al. Kinectfusion:real-time 3d reconstruction and interaction using a moving depth camera.In Proceedings of the 24th annual ACM symposium on User interfacesoftware and technology, pages 559–568. ACM, 2011. → pages 67[57] A. Jarabo, B. Masia, J. Marco, and D. Gutierrez. Recent advances intransient imaging: A computer graphics and vision perspective. VisualInformatics, 1(1):65–79, 2017. → pages 8[58] D. Jimenez, D. Pizarro, M. Mazo, and S. Palazuelos. Modelling andcorrection of multipath interference in time of flight cameras. In The IEEE109Conference on Computer Vision and Pattern Recognition (CVPR), pages893–900. IEEE, 2012. → pages 80[59] M. K. Johnson, F. Cole, A. Raj, and E. H. Adelson. Microgeometry captureusing an elastomeric sensor. In ACM Transactions on Graphics (TOG),volume 30, page 46. ACM, 2011. → pages 86[60] N. Joshi and M. F. Cohen. Seeing Mt. Rainier: Lucky imaging formulti-image denoising, sharpening, and haze removal. In IEEEInternational Conference on Computational Photography (ICCP), 2010.→ pages 14[61] A. Kadambi, R. Whyte, A. Bhandari, L. Streeter, C. Barsi, A. Dorrington,and R. Raskar. Coded time of flight cameras: sparse deconvolution toaddress multipath interference and recover time profiles. ACMTransactions on Graphics (TOG), 32(6):167, 2013. → pages 17[62] A. Kappeler, S. Yoo, Q. Dai, and A. K. Katsaggelos. Videosuper-resolution with convolutional neural networks. IEEE Transactions onComputational Imaging, 2(2):109–122, 2016. → pages 22, 58[63] T. H. Kim and K. M. Lee. Generalized video deblurring for dynamicscenes. In The IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2015. → pages 14, 53, 62, 63[64] T. H. Kim, S. Nah, and K. M. Lee. Dynamic scene deblurring using alocally adaptive linear blur model. arXiv preprint arXiv:1603.04265, 2016.→ pages 54, 59[65] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXivpreprint arXiv:1412.6980, 2014. → pages 20, 22, 53[66] A. Kirmani, A. Benedetti, and P. A. Chou. Spumic: Simultaneous phaseunwrapping and multipath interference cancellation in time-of-flightcameras using spectral methods. In IEEE International Conference onMultimedia and Expo (ICME), pages 1–6. IEEE, 2013. → pages 80[67] F. Klose, O. Wang, J.-C. Bazin, M. Magnor, and A. Sorkine-Hornung.Sampling based scene-space video processing. ACM Transactions onGraphics (TOG), 34(4):67, 2015. → pages 15[68] R. Ko¨hler, M. Hirsch, B. Mohler, B. Scho¨lkopf, and S. Harmeling.Recording and playback of camera shake: Benchmarking blind110deconvolution with a real-world database. In European Conference onComputer Vision (ECCV), pages 27–40. Springer, 2012. → pages 24, 25,26, 27, 29, 30, 38, 40, 53[69] A. Kolb, E. Barth, R. Koch, and R. Larsen. Time-of-flight cameras incomputer graphics. In Computer Graphics Forum, volume 29, pages141–159. Wiley Online Library, 2010. → pages 67, 69[70] D. Krishnan, T. Tay, and R. Fergus. Blind deconvolution using anormalized sparsity measure. In The IEEE Conference on Computer Visionand Pattern Recognition (CVPR), 2011. → pages 13[71] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification withdeep convolutional neural networks. In Advances in Neural InformationProcessing Systems (NIPS), pages 1097–1105, 2012. → pages 100[72] D. Kundur and D. Hatzinakos. Blind image deconvolution. IEEE SignalProcessing Magazine, 13(3):43–64, 1996. → pages 13[73] R. Lange. 3d time-of-flight distance measurement with custom solid-stateimage sensors in cmos/ccd-technology. 2000. → pages 12[74] N. Law, C. Mackay, and J. Baldwin. Lucky imaging: high angularresolution imaging in the visible from the ground. Astronomy &Astrophysics, 446(2):739–745, 2006. → pages 14[75] F. J. Lawin, P.-E. Forsse´n, and H. Ovre´n. Efficient multi-frequency phaseunwrapping using kernel density estimation. In European Conference onComputer Vision (ECCV), pages 170–185. Springer, 2016. → pages 16[76] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learningapplied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. → pages 21[77] F. Lenzen, K. I. Kim, H. Scha¨fer, R. Nair, S. Meister, F. Becker, C. S.Garbe, and C. Theobalt. Denoising strategies for time-of-flight data. InTime-of-Flight and Depth Imaging. Sensors, Algorithms, and Applications,pages 25–45. Springer, 2013. → pages 69[78] H. Li, L. Trutoiu, K. Olszewski, L. Wei, T. Trutna, P.-L. Hsieh, A. Nicholls,and C. Ma. Facial performance sensing head-mounted display. ACMTransactions on Graphics (TOG), 34(4):47, 2015. → pages 67111[79] Y. Li, S. B. Kang, N. Joshi, S. M. Seitz, and D. P. Huttenlocher. Generatingsharp panoramas from motion-blurred videos. In The IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2010. → pages 14[80] J. Lin, Y. Liu, M. B. Hullin, and Q. Dai. Fourier analysis on transientimaging with a multifrequency time-of-flight camera. In The IEEEConference on Computer Vision and Pattern Recognition (CVPR), pages3230–3237, 2014. → pages 17[81] C. Liu and W. T. Freeman. A high-quality video denoising algorithm basedon reliable motion estimation. In European Conference on ComputerVision (ECCV), pages 706–719. Springer, 2010. → pages 14[82] C. Liu and J. Gu. Discriminative illumination: Per-pixel classification ofraw materials based on optimal projections of spectral brdf. IEEETransactions on Pattern Analysis and Machine Intelligence, 36(1):86–98,2014. → pages 85[83] C. Liu, L. Sharan, E. H. Adelson, and R. Rosenholtz. Exploring features ina bayesian framework for material recognition. In The IEEE Conference onComputer Vision and Pattern Recognition (CVPR), pages 239–246. IEEE,2010. → pages 85[84] D. Liu, Z. Wang, B. Wen, J. Yang, W. Han, and T. S. Huang. Robust singleimage super-resolution via deep networks with sparse prior. IEEETransactions on Image Processing, 25(7):3194–3207, 2016. → pages 22[85] Z. Liu, L. Yuan, X. Tang, M. Uyttendaele, and J. Sun. Fast burst imagesdenoising. ACM Transactions on Graphics (TOG), 33(6):232, 2014. →pages 14[86] M. A. Mannan, D. Das, Y. Kobayashi, and Y. Kuno. Object materialclassification by surface reflection analysis with a time-of-flight rangesensor. In Advances in Visual Computing, pages 439–448. Springer, 2010.→ pages 85[87] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smolley. Leastsquares generative adversarial networks. In The IEEE InternationalConference on Computer Vision (ICCV), 2017. → pages 73, 147[88] X.-J. Mao, C. Shen, and Y.-B. Yang. Image restoration using very deepconvolutional encoder-decoder networks with symmetric skip connections.arXiv preprint arXiv:1603.09056, 2016. → pages 52112[89] J. Marco, Q. Hernandez, A. Muoz, Y. Dong, A. Jarabo, M. Kim, X. Tong,and D. Gutierrez. Deeptof: Off-the-shelf real-time correction of multipathinterference in time-of-flight imaging. ACM Transactions on Graphics(TOG), 36(6), 2017. → pages 70, 78, 79, 80, 122[90] Y. Matsushita, E. Ofek, W. Ge, X. Tang, and H.-Y. Shum. Full-frame videostabilization with motion inpainting. IEEE Transactions on PatternAnalysis and Machine Intelligence, 28(7):1150–1163, July 2006. → pages14[91] J. McCormac, A. Handa, S. Leutenegger, and A. J. Davison. Scenenetrgb-d: Can 5m synthetic images beat generic imagenet pre-training onindoor segmentation? In The IEEE Conference on Computer Vision andPattern Recognition (CVPR), pages 2678–2687, 2017. → pages 76[92] M. Meilland, T. Drummond, and A. I. Comport. A unified rolling shutterand motion blur model for 3d visual registration. In The IEEEInternational Conference on Computer Vision (ICCV), pages 2016–2023.IEEE, 2013. → pages 11[93] S. Meyer, O. Wang, H. Zimmer, M. Grosse, and A. Sorkine-Hornung.Phase-based frame interpolation for video. In The IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2015. → pages 65[94] T. Michaeli and M. Irani. Blind deblurring using internal patch recurrence.In European Conference on Computer Vision (ECCV), 2014. → pages 13[95] N. Naik, S. Zhao, A. Velten, R. Raskar, and K. Bala. Single viewreflectance capture using multiplexed scattering and time-of-flight imaging.In ACM Transactions on Graphics (TOG), volume 30, page 171. ACM,2011. → pages 9, 86[96] N. Naik, A. Kadambi, C. Rhemann, S. Izadi, R. Raskar, and S. Bing Kang.A light transport model for mitigating multipath interference intime-of-flight sensors. In The IEEE Conference on Computer Vision andPattern Recognition (CVPR), pages 73–81, 2015. → pages 13, 15, 86[97] N. Naik, A. Kadambi, C. Rhemann, S. Izadi, R. Raskar, and S. B. Kang. Alight transport model for mitigating multipath interference in time-of-flightsensors. In The IEEE Conference on Computer Vision and PatternRecognition (CVPR), pages 73–81. IEEE, June 2015. → pages 18, 69113[98] B. Ni, G. Wang, and P. Moulin. Rgbd-hudaact: A color-depth videodatabase for human daily activity recognition. In Consumer DepthCameras for Computer Vision, pages 193–208. Springer, 2013. → pages 67[99] M. O’Toole, F. Heide, L. Xiao, M. B. Hullin, W. Heidrich, and K. N.Kutulakos. Temporal frequency probing for 5d transient analysis of globallight transport. ACM Transactions on Graphics (TOG), 33(4):87, 2014. →pages 18[100] M. O’Toole, F. Heide, D. B. Lindell, K. Zang, S. Diamond, andG. Wetzstein. Reconstructing transient images from single-photon sensors.In The IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pages 1539–1547, 2017. → pages 74[101] C. Paramanand and A. N. Rajagopalan. Non-uniform motion deblurring forbilayer scenes. In The IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2013. → pages 14[102] S. H. Park and M. Levoy. Gyro-based multi-image deconvolution forremoving handshake blur. In The IEEE Conference on Computer Visionand Pattern Recognition (CVPR), 2014. → pages 2, 26[103] S. H. Park and M. Levoy. Gyro-based multi-image deconvolution forremoving handshake blur. In The IEEE Conference on Computer Visionand Pattern Recognition (CVPR), 2014. → pages 14[104] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros. Contextencoders: Feature learning by inpainting. arXiv preprintarXiv:1604.07379, 2016. → pages 22, 50[105] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas,A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay.Scikit-learn: Machine learning in Python. Journal of Machine LearningResearch, 12:2825–2830, 2011. → pages 93[106] D. Perrone and P. Favaro. Total variation blind deconvolution: The devil isin the details. In The IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2014. → pages 14, 24, 25, 31[107] C. Peters, J. Klein, M. B. Hullin, and R. Klein. Solving trigonometricmoment problems for fast transient imaging. ACM Transactions onGraphics (TOG), 34(6), Nov. 2015. doi:10.1145/2816795.2818103. →pages 90114[108] G. Petschnigg, R. Szeliski, M. Agrawala, M. Cohen, H. Hoppe, andK. Toyama. Digital photography with flash and no-flash image pairs. ACMTransactions on Graphics (TOG), 23(3):664–672, 2004. → pages 14[109] M. Pharr, W. Jakob, and G. Humphreys. Physically based rendering: Fromtheory to implementation. Morgan Kaufmann, 2016. → pages 74[110] V. Pichaikuppan, R. Narayanan, and A. Rangarajan. Change detection inthe presence of motion blur and rolling shutter effect. In EuropeanConference on Computer Vision (ECCV), pages 123–137. SpringerInternational Publishing, 2014. → pages 10, 11[111] R. Raskar. Computational photography: Epsilon to coded photography.Emerging Trends in Visual Computing, pages 238–253, 2009. → pages 2[112] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networksfor biomedical image segmentation. In International Conference onMedical Image Computing and Computer-Assisted Intervention, pages234–241. Springer, 2015. → pages 72[113] J. Sa´nchez Pe´rez, E. Meinhardt-Llopis, and G. Facciolo. TV-L1 OpticalFlow Estimation. Journal of Image Processing On Line (IPOL), 3:137–150, 2013. doi:10.5201/ipol.2013.26. → pages 52[114] P. Saponaro, S. Sorensen, A. Kolagunda, and C. Kambhamettu. Materialclassification with thermal imagery. June 2015. → pages 86[115] M. Sato, S. Yoshida, A. Olwal, B. Shi, A. Hiyama, T. Tanikawa, M. Hirose,and R. Raskar. Spectrans: Versatile material classification for interactionwith textureless, specular and transparent surfaces. In Proceedings of the33rd Annual ACM Conference on Human Factors in Computing Systems,pages 2191–2200. ACM, 2015. → pages 85[116] O. Saurer, K. Koser, J.-Y. Bouguet, and M. Pollefeys. Rolling shutterstereo. In The IEEE International Conference on Computer Vision (ICCV),pages 465–472. IEEE, 2013. → pages 11[117] R. Schwarte, Z. Xu, H. Heinol, J. Olk, R. Klein, B. Buxbaum, H. Fischer,and J. Schulte. New electro-optical mixing and correlating sensor: facilitiesand applications of the photonic mixer device. In Proceedings of SPIE,volume 3100, pages 245–253, 1997. → pages 15, 86115[118] G. Schwartz and K. Nishino. Automatically discovering local visualmaterial attributes1. In The IEEE Conference on Computer Vision andPattern Recognition (CVPR), pages 3565–3573, 2015. → pages 86[119] A. Sellent, C. Rother, and S. Roth. Stereo video deblurring. In EuropeanConference on Computer Vision (ECCV), 2016. → pages 14[120] Q. Shan, J. Jia, and A. Agarwala. High-quality motion deblurring from asingle image. ACM Transactions on Graphics (TOG), 27(3), 2008. →pages 10, 13[121] W. Shi, J. Caballero, F. Husza´r, J. Totz, A. P. Aitken, R. Bishop,D. Rueckert, and Z. Wang. Real-time single image and videosuper-resolution using an efficient sub-pixel convolutional neural network.In The IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2016. → pages 22[122] S. Shrestha, F. Heide, W. Heidrich, and G. Wetzstein. Computationalimaging with multi-camera time-of-flight systems. ACM Transactions onGraphics (TOG), 35(4):33, 2016. → pages 67[123] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb.Learning from simulated and unsupervised images through adversarialtraining. In The IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2017. → pages 23, 147, 148[124] E. Simo-Serra, S. Iizuka, K. Sasaki, and H. Ishikawa. Learning to simplify:fully convolutional networks for rough sketch cleanup. ACM Transactionson Graphics (TOG), 35(4):121, 2016. → pages 50[125] K. Son, M.-Y. Liu, and Y. Taguchi. Automatic learning to remove multipathdistortions in time-of-flight range images for a robotic arm setup. In IEEEInternational Conference on Robotics and Automation, 2016. → pages 69[126] S. Song, S. P. Lichtenberg, and J. Xiao. Sun rgb-d: A rgb-d sceneunderstanding benchmark suite. In The IEEE Conference on ComputerVision and Pattern Recognition (CVPR), pages 567–576, 2015. → pages 67[127] J. Steimle, A. Jordt, and P. Maes. Flexpad: Highly flexible bendinginteractions for projected handheld displays. In Proceedings of the SIGCHIConference on Human Factors in Computing Systems, CHI ’13, pages237–246, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-1899-0.doi:10.1145/2470654.2470688. URL → pages 86116[128] S. Su and W. Heidrich. Rolling shutter motion deblurring. In The IEEEConference on Computer Vision and Pattern Recognition (CVPR), pages1529–1537, 2015. → pages iv, 13[129] S. Su, F. Heide, R. Swanson, J. Klein, C. Callenberg, M. Hullin, andW. Heidrich. Material classification using raw time-of-flightmeasurements. In The IEEE Conference on Computer Vision and PatternRecognition (CVPR), pages 3503–3511, 2016. → pages v, 18[130] S. Su, M. Delbracio, J. Wang, G. Sapiro, W. Heidrich, and O. Wang. Deepvideo deblurring for hand-held cameras. In The IEEE Conference onComputer Vision and Pattern Recognition (CVPR), pages 1279–1288,2017. → pages iv[131] S. Su, F. Heide, G. Wetzstein, and W. Heidrich. Deep end-to-endtime-of-flight imaging. In The IEEE Conference on Computer Vision andPattern Recognition (CVPR), 2018. Accepted. → pages iv[132] J. Sun, W. Cao, Z. Xu, and J. Ponce. Learning a convolutional neuralnetwork for non-uniform motion blur removal. In The IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2015. → pages 22[133] R. Szeliski. Computer vision: algorithms and applications. SpringerScience & Business Media, 2010. → pages 7[134] Y.-W. Tai, H. Du, M. S. Brown, and S. Lin. Image/video deblurring using ahybrid camera. In The IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2008. → pages 14[135] Y.-W. Tai, P. Tan, and M. S. Brown. Richardson-lucy deblurring for scenesunder a projective motion path. IEEE Transactions on Pattern Analysis andMachine Intelligence, 2011. → pages 24, 25[136] K. Tanaka, Y. Mukaigawa, T. Funatomi, H. Kubo, Y. Matsushita, andY. Yagi. Material classification using frequency-and depth-dependenttime-of-flight distortion. In The IEEE Conference on Computer Vision andPattern Recognition (CVPR), pages 79–88, 2017. → pages 13[137] J. Telleen, A. Sullivan, J. Yee, O. Wang, P. Gunawardane, I. Collins, andJ. Davis. Synthetic shutter speed imaging. Computer Graphics Forum, 26(3):591–598, 2007. → pages 54117[138] C. Tomasi and R. Manduchi. Bilateral filtering for gray and color images.In The IEEE International Conference on Computer Vision (ICCV), pages839–846. IEEE, 1998. → pages 79[139] P. H. Torr and A. Zisserman. Mlesac: A new robust estimator withapplication to estimating image geometry. Computer Vision and ImageUnderstanding, 78(1):138–156, 2000. → pages 52[140] S. University. Cs231n: Convolutional neural networks for visualrecognition. Accessed:2016-09-01. → pages xi, 21[141] L. J. van der Maaten, E. O. Postma, and H. J. van den Herik.Dimensionality reduction: A comparative review. Journal of MachineLearning Research, 10(1-41):66–71, 2009. → pages xii, 96[142] M. Varma and A. Zisserman. A statistical approach to materialclassification using image patch exemplars. IEEE Transactions on PatternAnalysis and Machine Intelligence, 31(11):2032–2047, 2009. → pages 85[143] A. Velten, D. Wu, A. Jarabo, B. Masia, C. Barsi, C. Joshi, E. Lawson,M. Bawendi, D. Gutierrez, and R. Raskar. Femto-photography: Capturingand visualizing the propagation of light. ACM Transactions on Graphics(TOG), 32(4):44, 2013. → pages 3, 8, 10, 84[144] F. Villa, B. Markovic, S. Bellisai, D. Bronzi, A. Tosi, F. Zappa, S. Tisa,D. Durini, S. Weyers, U. Paschen, et al. Spad smart pixel for time-of-flightand time-correlated single-photon counting measurements. IEEEPhotonics Journal, 4(3):795–804, 2012. → pages 68[145] R. Wang and D. Tao. Recent progress in image deblurring. arXiv preprintarXiv:1409.6838, 2014. → pages 13[146] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image qualityassessment: from error visibility to structural similarity. IEEE Transactionson Image Processing, 13(4):600–612, 2004. → pages 77[147] M. Weinmann and R. Klein. A short survey on optical material recognition.In Proceedings of the Eurographics Workshop on Material AppearanceModeling, pages 35–42. Eurographics, 2015. → pages 85[148] O. Whyte, J. Sivic, A. Zisserman, and J. Ponce. Non-uniform deblurringfor shaken images. International Journal of Computer Vision, 98(2):168–186, 2012. → pages 2, 24, 25118[149] P. Wieschollek, B. Scho¨lkopf, H. Lensch, and M. Hirsch. End-to-endlearning for image burst deblurring. In Asian Conference on ComputerVision (ACCV), 2016. → pages 22[150] D. Wu, A. Velten, M. OToole, B. Masia, A. Agrawal, Q. Dai, andR. Raskar. Decomposing global light transport using time of flightimaging. International Journal of Computer Vision, 107(2):123–138, 2014.→ pages 9, 10, 84, 89[151] J. Wulff and M. J. Black. Modeling blurred video with layers. In EuropeanConference on Computer Vision (ECCV), 2014. → pages 14[152] L. Xiao, F. Heide, W. Heidrich, B. Scho¨lkopf, and M. Hirsch.Discriminative transfer learning for general image restoration. arXivpreprint arXiv:1703.09245, 2017. → pages 103[153] J. Xie, L. Xu, and E. Chen. Image denoising and inpainting with deepneural networks. In Advances in Neural Information Processing Systems(NIPS), pages 341–349, 2012. → pages 22[154] L. Xu and J. Jia. Two-phase kernel estimation for robust motion deblurring.In European Conference on Computer Vision (ECCV), pages 157–170.Springer, 2010. → pages 13, 24, 25, 41[155] L. Xu, S. Zheng, and J. Jia. Unnatural l0 sparse representation for naturalimage deblurring. In The IEEE Conference on Computer Vision andPattern Recognition (CVPR), pages 1107–1114. IEEE, 2013. → pages 13,24, 25, 32, 38, 39, 40, 41, 42, 43, 58, 59, 60, 61[156] L. Yuan, J. Sun, L. Quan, and H.-Y. Shum. Image deblurring withblurred/noisy image pairs. ACM Transactions on Graphics (TOG), 26(3):1,2007. → pages 14[157] H. Zhang, K. Dana, and K. Nishino. Reflectance hashing for materialrecognition. In The IEEE Conference on Computer Vision and PatternRecognition (CVPR), June 2015. → pages 85[158] R. Zhang, P. Isola, and A. A. Efros. Split-brain autoencoders: Unsupervisedlearning by cross-channel prediction. In The IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2017. → pages 23[159] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-imagetranslation using cycle-consistent adversarial networks. In The IEEE119International Conference on Computer Vision (ICCV), 2017. → pages 22,23, 73120Appendix ASupporting MaterialsA.1 Supporting Materials for Chapter 5A.1.1 IntroductionIn this supplemental document, we provide additional full-size depth reconstruc-tion results and implementation details of the datasets generation and networktraining, ensuring full reproducibility of the proposed approach.A.1.2 Additional ResultsIn this section, we provide full-size depth maps generated from different methodsto evaluate the performance of the proposed end-to-end reconstruction frameworkfor joint phase unwrapping, MPI correction and denoising. In the following figures,we denote• SCENE: as the amplitude image of the ToF measurements;• PHASE: as the phase map measured at ω2 = 70 MHz for visualization ofphase ambiguity;• CORRELATION: as the raw ToF correlation image measured atω2 = 70 MHzand ψ1 = 0 rad which is one of the four input images to our network;121• SINGLE: as the depth map directly computed from a single, lower modula-tion frequency, ω1 = 40 MHz;• PHASOR: as the dual-frequency phase-unwrapped depth map [26, 40];• SRA: as the phase-unwrapped and MPI compensated depth map [31];• DEPTH2DEPTH (D2D): as the depth map generated from the depth post-processing network similar to [89];• OURS: as the depth map generated with our proposed method;• GT: as the ground truth depth (available for synthetic data only).The resulting depth maps are best compared along with the depth scanlinesshown at the bottom of each individual figure, where we use meter as the defaultunit of length. It is expected that SINGLE suffers here the most from both phase am-biguity and MPI distortion due to the lack of compensation for these measurementdistortions. The optimization-based methods, PHASOR and SRA, reduce MPIand phase ambiguity, however they also introduce “flying pixels” in the presenceof noise. The learning-based depth post-processing framework, DEPTH2DEPTH,compensates for these ambiguities, yet introduces new, high-frequency artifactswhen the input depth is of lower quality. Our results on the other hand, are consis-tently more accurate and robust to various capture scenarios.122Synthetic ResultsWe demonstrate results on synthesized raw ToF measurements with known groundtruth depth labels in Fig. A.1, A.2, A.3, A.4, A.5 and A.6. The depth scanlinesbelow each subfigure are sampled from the middle row of the the correspondingdepth maps, highlighted as red lines in the ground truth GT images in Fig. A.1h,A.2h, A.3h, A.4h, A.5h and A.6h.123(a) Scene. (b) Phase/Correlation.50 100 150 200 250 30002468SingleGT(c) Single.50 100 150 200 250 30002468PhasorGT(d) Phasor.50 100 150 200 250 30002468SRAGT(e) SRA.50 100 150 200 250 30002468D2DGT(f) Depth2Depth.50 100 150 200 250 30002468OursGT(g) Ours.50 100 150 200 250 30002468GT(h) GT.8.40.0Figure A.1: Synthetic results on a long-range scene with relatively low MPIand sensor noise.124(a) Scene. (b) Phase/Correlation.50 100 150 200 250 3000510SingleGT(c) Single.50 100 150 200 250 3000510PhasorGT(d) Phasor.50 100 150 200 250 3000510SRAGT(e) SRA.50 100 150 200 250 3000510D2DGT(f) Depth2Depth.50 100 150 200 250 3000510OursGT(g) Ours.50 100 150 200 250 3000510GT(h) GT.4.81.4Figure A.2: Synthetic results on a long-range scene with relatively high MPIand sensor noise.125(a) Scene. (b) Phase/Correlation.50 100 150 200 250 300234 SingleGT(c) Single.50 100 150 200 250 300234 PhasorGT(d) Phasor.50 100 150 200 250 300234 SRAGT(e) SRA.50 100 150 200 250 300234 D2DGT(f) Depth2Depth.50 100 150 200 250 300234 OursGT(g) Ours.50 100 150 200 250 300234 GT(h) GT.4.50.0Figure A.3: Synthetic results on a common indoor scene with relatively lowMPI and sensor noise.126(a) Scene. (b) Phase/Correlation.50 100 150 200 250 3002.533.5SingleGT(c) Single.50 100 150 200 250 3002.533.5PhasorGT(d) Phasor.50 100 150 200 250 3002.533.5SRAGT(e) SRA.50 100 150 200 250 3002.533.5D2DGT(f) Depth2Depth.50 100 150 200 250 3002.533.5OursGT(g) Ours.50 100 150 200 250 3002.533.5GT(h) GT.4.52.2Figure A.4: Synthetic results on a common indoor scene with moderate MPIand sensor noise.127(a) Scene. (b) Phase/Correlation.50 100 150 200 250 30022.533.5SingleGT(c) Single.50 100 150 200 250 30022.533.5PhasorGT(d) Phasor.50 100 150 200 250 30022.533.5SRAGT(e) SRA.50 100 150 200 250 30022.533.5D2DGT(f) Depth2Depth.50 100 150 200 250 30022.533.5OursGT(g) Ours.50 100 150 200 250 30022.533.5GT(h) GT.4.42.3Figure A.5: Synthetic results on a common indoor scene with moderate MPIand sensor noise.128(a) Scene. (b) Phase/Correlation.50 100 150 200 250 30022.533.5 SingleGT(c) Single.50 100 150 200 250 30022.533.5 PhasorGT(d) Phasor.50 100 150 200 250 30022.533.5 SRAGT(e) SRA.50 100 150 200 250 30022.533.5 D2DGT(f) Depth2Depth.50 100 150 200 250 30022.533.5 OursGT(g) Ours.50 100 150 200 250 30022.533.5 GT(h) GT.3.91.1Figure A.6: Synthetic results on a common indoor scene with strong MPI andsensor noise.129Real ResultsWe demonstrate results on experimentally captured raw ToF measurements in Fig. A.7,A.8, A.9, A.10, A.11, A.12, A.13, A.14, A.15. The depth scanlines are sampledfrom the middle row of each result, highlighted as red lines in the baseline SINGLEoutputs in Fig. A.7c, A.8c, A.9c, A.10c, A.11c, A.12c, A.13c, A.14c, and A.15c.130(a) Scene. (b) Phase/Correlation.50 100 150 200 250 30022.53Single(c) Single.50 100 150 200 250 30022.53PhasorSingle(d) Phasor.50 100 150 200 250 30022.53SRASingle(e) SRA.50 100 150 200 250 30022.53D2DSingle(f) Depth2Depth.50 100 150 200 250 30022.53OursSingle(g) Ours.3.01.1Figure A.7: Real results from an office scene consisting of a concrete wallcorner, sofa, chair and glass window. Our method generates piecewisesmooth depth map while compensating for MPI distortion and glassreflectance.131(a) Scene. (b) Phase/Correlation.50 100 150 200 250 30012345 Single(c) Single.50 100 150 200 250 30012345 PhasorSingle(d) Phasor.50 100 150 200 250 30012345 SRASingle(e) SRA.50 100 150 200 250 30012345 D2DSingle(f) Depth2Depth.50 100 150 200 250 30012345 OursSingle(g) Ours.5.01.3Figure A.8: Real results from a cluttered, long-range office scene consistingof a painted wall, bookshelves, cardboard boxes and dark carpet. Ourmethod generates piecewise smooth depth map, compensates for MPIdistortion, and removes noise in the low reflective regions.132(a) Scene. (b) Phase/Correlation.50 100 150 200 250 3001.522.53 Single(c) Single.50 100 150 200 250 3001.522.53 PhasorSingle(d) Phasor.50 100 150 200 250 3001.522.53 SRASingle(e) SRA.50 100 150 200 250 3001.522.53 D2DSingle(f) Depth2Depth.50 100 150 200 250 3001.522.53 OursSingle(g) Ours.3.20.9Figure A.9: Real results from a bedroom scene consisting of a painted wall,bed and duvets. Our method generates a piecewise smooth depth mapwhile compensating for MPI distortion. Notice the sharp corner re-construction from our method in the scanline of Fig. A.9g comparedto the rounded output from the baseline approach as a result of strongmultipath distortion near the wall corner.133(a) Scene. (b) Phase/Correlation.50 100 150 200 250 30011.522.5 Single(c) Single.50 100 150 200 250 30011.522.5 PhasorSingle(d) Phasor.50 100 150 200 250 30011.522.5 SRASingle(e) SRA.50 100 150 200 250 30011.522.5 D2DSingle(f) Depth2Depth.50 100 150 200 250 30011.522.5 OursSingle(g) Ours.2.80.9Figure A.10: Real results from a kitchen scene consisting of cabinets andkitchenware. Our method generates a piecewise smooth depth mapwhile preserving depth details and compensating for MPI distortion.134(a) Scene. (b) Phase/Correlation.50 100 150 200 250 30012345Single(c) Single.50 100 150 200 250 30012345PhasorSingle(d) Phasor.50 100 150 200 250 30012345SRASingle(e) SRA.50 100 150 200 250 30012345D2DSingle(f) Depth2Depth.50 100 150 200 250 30012345OursSingle(g) Ours.4.60.7Figure A.11: Real results from a long-range indoor scene consisting of planarwalls and mirrors of different orientations and distances. Our methodgenerates a piecewise smooth depth map while resolving phase ambi-guities and compensating for MPI distortion.135(a) Scene. (b) Phase/Correlation.50 100 150 200 250 300234Single(c) Single.50 100 150 200 250 300234PhasorSingle(d) Phasor.50 100 150 200 250 300234SRASingle(e) SRA.50 100 150 200 250 300234D2DSingle(f) Depth2Depth.50 100 150 200 250 300234OursSingle(g) Ours.4.41.7Figure A.12: Real results from a cluttered living room scene consisting ofpainted walls, window blinds, hanger, a rocking chair and mirror.Our method generates a piecewise smooth depth map while resolv-ing phase ambiguities and compensating for MPI distortion.136(a) Scene. (b) Phase/Correlation.50 100 150 200 250 3001.522.53 Single(c) Single.50 100 150 200 250 3001.522.53 PhasorSingle(d) Phasor.50 100 150 200 250 3001.522.53 SRASingle(e) SRA.50 100 150 200 250 3001.522.53 D2DSingle(f) Depth2Depth.50 100 150 200 250 3001.522.53 OursSingle(g) Ours.3.81.3Figure A.13: Real results from a cluttered living room scene consisting ofpainted walls, window blinds, hanger, and chairs. Our method gen-erates a piecewise smooth depth map while compensating for MPIdistortion.137(a) Scene. (b) Phase/Correlation.50 100 150 200 250 3000.811.21.4 Single(c) Single.50 100 150 200 250 3000.811.21.4 PhasorSingle(d) Phasor.50 100 150 200 250 3000.811.21.4 SRASingle(e) SRA.50 100 150 200 250 3000.811.21.4 D2DSingle(f) Depth2Depth.50 100 150 200 250 3000.811.21.4 OursSingle(g) Ours.1.50.6Figure A.14: Real results from a bathroom scene consisting of a ceramicbathtub and plastic curtains, with strong MPI leading to 10-20cmdepth distortions. Our method generates piecewise smooth depth mapwhile compensating for MPI distortions. Notice that, similar to otherapproaches, the proposed framework fails in the curtain area, treatingthe tile boundaries as depth edges. This is an limitation of our methodwhich we will discuss in Sec. A.1.5.138(a) Scene. (b) Phase/Correlation.50 100 150 200 250 3000.511.5Single(c) Single.50 100 150 200 250 3000.511.5PhasorSingle(d) Phasor.50 100 150 200 250 3000.511.5SRASingle(e) SRA.50 100 150 200 250 3000.511.5D2DSingle(f) Depth2Depth.50 100 150 200 250 3000.511.5OursSingle(g) Ours.1.90.4Figure A.15: Real results from an indoor corner scene consisting of orthog-onal painted walls. Our method generates a piecewise smooth depthmap while compensating for MPI distortion.139A.1.3 Additional ExperimentsIn this section we demonstrate the temporal consistency of our method, its robust-ness to scene albedo, and the effect of loss functions.Temporal ConsistencyIn Fig. A.16, we demonstrate that the proposed method produces temporally con-sistent depth prediction results. Specifically, we capture a short image burst withour ToF camera and document the reconstructed depth maps for each individualframe, without other frames being seen by our framework. The proposed methodgenerates temporally consistent depth maps with reduced MPI distortion, whichare also apparent in the scanline visualization at the bottom of the figure.FRAME 1 FRAME 2 FRAME 3 FRAME 4 FRAME 5SCENETOFDEPTHOURDEPTHSCNLN50 100 150 200 250 30011.522.533.5ToFOurs50 100 150 200 250 30011.522.533.5ToFOurs50 100 150 200 250 30011.522.533.5ToFOurs50 100 150 200 250 30011.522.533.5ToFOurs50 100 150 200 250 30011.522.533.5ToFOursFigure A.16: Evaluation of the temporal consistency with experimentallycaptured raw ToF measurements. We denote TOFDEPTH as the singlefrequency output measured at ω1 = 40 MHz, and OURDEPTH as theresults of the proposed framework.140Robustness to AlbedosWe demonstrate the robustness of our framework with respect to scene albedo vari-ations in Fig. A.17 and Fig. A.18. Note that for these results only object reflectivityvaries. Compared to competing methods, our results are substantially less affectedby albedo variations, in addition to eliminating noise, phase ambiguity and MPI.ALBEDO 1 ALBEDO 2 ALBEDO 1 ALBEDO 2SCENESINGLEPHASORSRAD2DOURSGT3.51.9 2.30.5Figure A.17: Robustness to albedos on two pairs of synthetic scenes.141ALBEDO 1 ALBEDO 2 ALBEDO 1 ALBEDO 2SCENESINGLEPHASORSRAD2DOURSGT4.42.3 6.30.0Figure A.18: Robustness to albedos on two pairs of synthetic scenes.142Effect of λs and λaWe have trained variants of TOFNET with different pairs of λs and λa. See exam-ples in Fig. A.19 on results of real ToF measurements. With increasing λa, i.e. theweight of the adversarial loss, our network learns to generate sharper depth edgesbut it also hallucinates high-frequency noise (see column λa = 0.03 and λa = 0.1).When the weight on the smoothness term λs is increased, our network generatesover-regularized “cartoon”-style depth maps with reduced noise in an edge-awarefashion. As discussed in the main paper, our final objective function combines therelative strengths of each loss.143λa = 0 λa = 0.01 λa = 0.03 λa = 0.1λ s=0λ s=0.00012.20.5λ s=0.01λ s=0λ s=0.00012.80.9λ s=0.01Figure A.19: Effect of λa and λs variations on two example scenes, BOOK-SHELF (top) and KITCHEN (bottom). Results are best seen with acomputer.144A.1.4 Implementation DetailsDataset GenerationTo generate the synthetic dataset for training and quantitative evaluations, we firstsimulate the transient images of a given scene model (Fig. A.20a) with our time-resolved renderer. Fig. A.20b shows a few synthesized transient images from thisstep, in which both direct peak and indirect global illumination are present. Thecorrelation images of the scene can then be generated as described in the mainmanuscript, and these results are shown in Fig. A.20c. Finally, we provide the fullsize phase and amplitude images in Fig. A.20d.145(a) 3D scene model in Blender.11ns 13ns 16ns 20ns(b) Transient frames.40 MHz, 0 rad 70 MHz, 0 rad 40 MHz, pi/2 rad 70 MHz, pi/2 rad(c) Correlation images.40 MHZ 70 MHz 40 MHz 70 MHz(d) Phase and amplitude images.Figure A.20: Synthetic data generation steps.146Network TrainingTab. A.1 provides layer specifications of the TOFNET model. Each convolutionallayer in G is followed by ReLU, except those that are skip connected to deeper lay-ers where the sum is rectified through a ReLU layer. At the end of the network aTanh layer is applied to normalize the intensities. In D we use 0.2 LeakyReLU and,unlike traditional GAN methods, omit the final Sigmoid layer. We adopt the Torchimplementation of SpatialConvolution and SpatialFullConvolutionfor down- and up-convolutional layers.Generator GLayer Kernel Size Stride Output Size Skip ConnectionInput - - 64×H×W to TVF1 1 7×7 1×1 64×H×W -F1 2 3×3 1×1 64×H×W to U2D1 3×3 2×2 128×H/2×W/2 -F2 3×3 1×1 128×H/2×W/2 to U1D2 3×3 2×2 256×H/4×W/4 -R1-R9 3×3 1×1 256×H/4×W/4 -U1 4×4 1/2×1/2 128×H/2×W/2 from F2F3 3×3 1×1 128×H/2×W/2 -U2 4×4 1/2×1/2 64×H×W from F1 2F4 1 3×3 1×1 64×H×W -F4 2 3×3 1×1 64×H×W -TV - - 1×H×W from Input∗Discriminator DLayer Kernel Size Stride Output SizeInput - - 1×H×WD1 4×4 2×2 64×H/2×W/2D2 4×4 2×2 128×H/4×W/4D3 4×4 2×2 256×H/8×W/8F1 4×4 1×1 1×H/4×W/4Table A.1: Specifications of the generator and discriminator networks of theproposed TOFNET model. We train on image patches with H = 128 andW = 128. ∗The amplitude image is extracted from the input correlationmeasurements and is skipped to the TV layer for calculating the edge-aware smoothness term.We apply two recent successful techniques from [123] and [87] to stabilize the147training procedure. In particular, we optimize for the least square loss during theD updates using a history of the generated depth maps. This strategy proves to beeffective on reducing model oscillation and generating higher quality results in ourexperiment. Our code and datasets will be made publicly available in the future.A.1.5 LimitationsAs shown in the previous sections, our method gracefully degrades when the sceneconsists of saturated regions and/or severe noise as a result of low reflectivity andlong distance light fall-off, see Fig. A.14c. In the future, we plan to investigate thisproblem by considering the sensor’s noise model and radiometric response functionwhen simulating the training data, as well as devising unsupervised strategies [123]to improve the realism of the synthetic training results.148A.2 Supporting Materials for Chapter 6In this supplementary material, we provide further details about the way our datasetwas collected and how our preprocessing removes unwanted noise and depth vari-ations while leaving intrinsic material properties intact. For convenience all of thecorresponding numerical raw data from Figures A.21, A.22 and A.23 are also pro-vided in accompanying .csv files. In the future we plan to make the full datasetpublicly available online.A.2.1 Raw Correlation FramesIn Figure A.21 we show a subset of raw correlation frames that have been cap-tured by our ToF camera directly using the same method described in the paper.Specifically, we show all 4 materials (rows), i.e., paper, styrofoam, towel and wax,observed at two distances from the camera (columns). The specific modulation fre-quency and phase of each individual frame are specified in the title of each figure.As expected, the raw correlation frames are highly susceptible to both noise anddepth variations.A.2.2 Fixed Pattern Noise RemovalFollowing the description in Section 4.1 of the paper, we show the results of ap-plying our fixed pattern noise removal algorithm to the raw frames in Figure A.21.While the fixed pattern noise clearly disappears from Figure A.22, the depth ambi-guity still remains. Note that these calibration methods are based on the assumptionof a purely sinusoidal correlation signal. Nevertheless, our method is very effectivethanks to the close approximation of our correlation signal to a pure sinusoid. Thisrelationship can be seen in Figure 5 of the paper.A.2.3 Depth NormalizationTo address depth ambiguity we follow the depth normalization method introducedin Section 4.1 of the paper. The resulting measurements after preprocessing areshown in Figure A.23. Note that in all figures only the real part is shown. Differ-ences can be more accurately compared using the .csv files provided.149A.2.4 DatasetFinally, after the preprocessing steps, we arrive at the correlation frames whereonly material discriminative components are left at each pixel. From these, wecreate the datapoints for training and testing as described in Section 5 of the paper.150PAPER. Dst: 1, f: 80.0 MHz, ?: 1.5 rad20 40 60 80 100 120 140 16020406080100120 0200400600800100012001400PAPER. Dst: 2, f: 80.0 MHz, ?: 1.5 rad20 40 60 80 100 120 140 16020406080100120-12000-10000-8000-6000-4000-20000STYROFOAM. Dst: 1, f: 80.0 MHz, ?: 1.5 rad20 40 60 80 100 120 140 16020406080100120 0200400600800100012001400STYROFOAM. Dst: 2, f: 80.0 MHz, ?: 1.5 rad20 40 60 80 100 120 140 16020406080100120-12000-10000-8000-6000-4000-20000TOWEL. Dst: 1, f: 80.0 MHz, ?: 1.5 rad20 40 60 80 100 120 140 16020406080100120 0200400600800100012001400TOWEL. Dst: 2, f: 80.0 MHz, ?: 1.5 rad20 40 60 80 100 120 140 16020406080100120-12000-10000-8000-6000-4000-20000WAX. Dst: 1, f: 80.0 MHz, ?: 1.5 rad20 40 60 80 100 120 140 16020406080100120 0200400600800100012001400WAX. Dst: 2, f: 80.0 MHz, ?: 1.5 rad20 40 60 80 100 120 140 16020406080100120-12000-10000-8000-6000-4000-20000Figure A.21: The raw correlation measurements from our ToF camera with-out any preprocessing. The rows from top to bottom show all 4 ma-terials reported in the paper, i.e., paper, styrofoam, towel and wax,observed at two distances, Dst 1 (left) and Dst 2 (right). As expected,the raw correlation frames are susceptible to both fixed pattern noise(left column), and material independent phase offset (comparing themeasurements at two distances).151PAPER. Dst: 1, f: 80.0 MHz, ?: 1.5 rad20 40 60 80 100 120 140 16020406080100120 010002000300040005000600070008000PAPER. Dst: 2, f: 80.0 MHz, ?: 1.5 rad20 40 60 80 100 120 140 16020406080100120 010002000300040005000600070008000STYROFOAM. Dst: 1, f: 80.0 MHz, ?: 1.5 rad20 40 60 80 100 120 140 16020406080100120 010002000300040005000600070008000STYROFOAM. Dst: 2, f: 80.0 MHz, ?: 1.5 rad20 40 60 80 100 120 140 16020406080100120 010002000300040005000600070008000TOWEL. Dst: 1, f: 80.0 MHz, ?: 1.5 rad20 40 60 80 100 120 140 16020406080100120 010002000300040005000600070008000TOWEL. Dst: 2, f: 80.0 MHz, ?: 1.5 rad20 40 60 80 100 120 140 16020406080100120 010002000300040005000600070008000WAX. Dst: 1, f: 80.0 MHz, ?: 1.5 rad20 40 60 80 100 120 140 16020406080100120 010002000300040005000600070008000WAX. Dst: 2, f: 80.0 MHz, ?: 1.5 rad20 40 60 80 100 120 140 16020406080100120 010002000300040005000600070008000Figure A.22: The effectiveness of fixed pattern noise removal describedin Section 6.4.1 of the paper. While the vertical stripes are clearlyremoved from the frames, material independent depth still affects themeasurements.152PAPER. Dst: 1, f: 80.0 MHz, ?: 1.5 rad20 40 60 80 100 120 140 16020406080100120-0.4-0.3-0.2- Dst: 2, f: 80.0 MHz, ?: 1.5 rad20 40 60 80 100 120 140 16020406080100120-0.4-0.3-0.2- Dst: 1, f: 80.0 MHz, ?: 1.5 rad20 40 60 80 100 120 140 16020406080100120-0.4-0.3-0.2- Dst: 2, f: 80.0 MHz, ?: 1.5 rad20 40 60 80 100 120 140 16020406080100120-0.4-0.3-0.2- Dst: 1, f: 80.0 MHz, ?: 1.5 rad20 40 60 80 100 120 140 16020406080100120-0.4-0.3-0.2- Dst: 2, f: 80.0 MHz, ?: 1.5 rad20 40 60 80 100 120 140 16020406080100120-0.4-0.3-0.2- Dst: 1, f: 80.0 MHz, ?: 1.5 rad20 40 60 80 100 120 140 16020406080100120-0.4-0.3-0.2- Dst: 2, f: 80.0 MHz, ?: 1.5 rad20 40 60 80 100 120 140 16020406080100120-0.4-0.3-0.2- A.23: The effectiveness of depth normalization described in para-graph 6.4.1 of the paper. Note that in all figures only the real part isshown. Differences can be more accurately compared using the .csvfiles provided.153


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items