UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Deep learning based multi-modal image analysis for enhanced situation awareness and environmental perception Liu, Shuo 2018

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2018_september_liu_shuo.pdf [ 20.09MB ]
JSON: 24-1.0371852.json
JSON-LD: 24-1.0371852-ld.json
RDF/XML (Pretty): 24-1.0371852-rdf.xml
RDF/JSON: 24-1.0371852-rdf.json
Turtle: 24-1.0371852-turtle.txt
N-Triples: 24-1.0371852-rdf-ntriples.txt
Original Record: 24-1.0371852-source.json
Full Text

Full Text

DEEP LEARNING BASED MULTI-MODAL IMAGEANALYSIS FOR ENHANCED SITUATIONAWARENESS AND ENVIRONMENTAL PERCEPTIONbyShuo LiuB.Eng., Beijing University of Technology, 2015A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMASTER OF APPLIED SCIENCEinTHE COLLEGE OF GRADUATE STUDIES(Electrical Engineering)THE UNIVERSITY OF BRITISH COLUMBIA(Okanagan)August 2018c© Shuo Liu, 2018The following individuals certify that they have read, and recommend to the Col-lege of Graduate Studies for acceptance, the thesis entitled:DEEP LEARNING BASED MULTI-MODAL IMAGE ANALYSIS FOR ENHANCEDSITUATION AWARENESS AND ENVIRONMENTAL PERCEPTIONsubmitted by Shuo Liu in partial fulfillment of the requirements ofthe degree of Master of Applied Science .Dr. Zheng Liu, School of EngineeringSupervisorDr. Loı¨c Markley, School of EngineeringSupervisory Committee MemberDr. Yang Cao, School of EngineeringSupervisory Committee MemberDr. Liwei Wang, School of EngineeringUniversity ExamineriiAbstractSituation awareness (SA) plays an important role in surveillance and security ap-plications. In a SA system, the perception capability is fundamental and essential.However, the accurate perception of the key elements and events in the complexand dynamic environment is a challenging task, due to the target scale variations,complex backgrounds, and poor illumination conditions. The research presented inthis thesis aims to address these challenges by developing deep learning approachesfor enhanced environmental perception.Firstly, a deep learning based multi-modal image fusion method is proposed forthe automatic target detection task, in which the information from visible, thermal,and temporal images is fused with a multi-channel convolutional neural network(CNN). Compared with the conventional multi-modal image fusion methods, thefusion strategy in the proposed method can be learned automatically at the trainingstage rather than by a hand-craft design.Secondly, the deep multi-modal image fusion (DMIF) is further improved in anew framework, where the multi-modal image fusion, region proposal, and region-wise classification modules are integrated into an end-to-end neural network. Thus,it becomes more efficient to train and optimize the neural network. In addition, adeeper CNN is also implemented. The comprehensive experiments demonstratethe proposed DMIF can successfully address the challenges that arise from targetscale variations and complex backgrounds in a dynamic environment.Finally, to enhance the environmental perception at dark night, a deep learningbased thermal image translation (namely IR2VI) method is presented. As the visi-ble camera usually does not work at the dark night without sufficient illumination,the multi-modal image fusion methods for context enhancement will not functioniiiproperly in this specific situation. The proposed IR2VI method is able to trans-late the nighttime thermal images to the daytime human favorable visible image.Experimental results show the superiority of the IR2VI over the state of the arts.ivLay SummaryThe accurate perception of the targets in a complex and dynamic environment (i.e.,battlefield) is critical for a situation awareness system (i.e., security system). How-ever, the conventional perception applications face many challenges, due to thetarget scale variations, complex backgrounds, and poor illumination conditions inthe complex environment. In this thesis, a deep learning based multi-modal imagefusion method is proposed to address the challenges that arise from the target scalevariations and complex backgrounds, which enable the environmental perceptionto be more accurate, efficient and robust. To tackle the challenge from the poorillumination conditions at the dark night, a novel thermal image translation neu-ral network is proposed, which can translate a nighttime thermal infrared image toa daytime human favorable visible image. Experimental results demonstrate theeffectiveness of the proposed methods.vPrefaceThis thesis is based on the research work conducted in the School of Engineeringat The University of British Columbia, Okanagan Campus, under the supervisionof Prof. Zheng Liu. Published works are contained in this thesis.Chapter 3 is based on the following published paper and used with permissionof IOS Press:• Shuo Liu, Vijay John, and Zheng Liu. On the prospects of using deep learn-ing for surveillance and security applications. In D.J. Hemanth and V.V.Estrela, editors, Deep Learning for Image Processing Applications, pages218243. IOS Press, Netherlands, 2017.Chapter 4 is going to be submitted for consideration in the form of IEEE refer-eed paper.Chapter 5 is based on the following published paper and used with permissionof IEEE:• Shuo Liu, Vijay John, Erik Blasch, Zheng Liu, and Ying Huang. IR2VI:Enhanced night environmental perception by unsupervised thermal imagetranslation. In IEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR) Workshops, pages 1153-1160. IEEE, 2018. ( c©2018 IEEE)I am the principle contributor for these works. Prof. Zheng Liu provided mewith some research ideas and suggestions to improve my research works. Prof.Vijay John, Prof. Erik Blasch, Prof. Ying Huang and Prof. Huan Liu helpedme prepare the manuscripts for scholarly publication by checking the validity ofexperimental results and proofreading.viTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . 11.2 Literature Review and Challenges . . . . . . . . . . . . . . . . . 31.2.1 Environmental Perception System . . . . . . . . . . . . . 31.2.2 Multi-Modal Image Fusion . . . . . . . . . . . . . . . . . 41.2.3 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . 51.3 Thesis Outline and Contributions . . . . . . . . . . . . . . . . . . 52 Deep Learning: State of the Art . . . . . . . . . . . . . . . . . . . . 72.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7vii2.2 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . 82.3 Generative Adversarial Network . . . . . . . . . . . . . . . . . . 112.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Multi-Channel CNN-based Automatic Target Detection for EnhancedSituation Awareness . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.2 Multi-channel CNN-based ATD . . . . . . . . . . . . . . . . . . 173.2.1 Image Fusion . . . . . . . . . . . . . . . . . . . . . . . . 173.2.2 Regions of Interest Proposal . . . . . . . . . . . . . . . . 233.2.3 Neural Network Architecture . . . . . . . . . . . . . . . . 233.2.4 Training Details . . . . . . . . . . . . . . . . . . . . . . . 243.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.3.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . 283.3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 283.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334 Deep Multi-Modal Image Fusion . . . . . . . . . . . . . . . . . . . . 344.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.2 Deep Fusion Methodology . . . . . . . . . . . . . . . . . . . . . 364.2.1 Deep Multi-Modal Image Fusion . . . . . . . . . . . . . . 364.2.2 Region Proposal Neural Network . . . . . . . . . . . . . 414.2.3 Classification & Regression Sub-Network . . . . . . . . . 424.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 444.3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.3.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . 444.3.3 Comparison with the State-of-the-Arts . . . . . . . . . . . 444.3.4 Analysis of Target Scales . . . . . . . . . . . . . . . . . . 484.3.5 Analysis of Environmental Complexity . . . . . . . . . . 494.3.6 Statistical Analysis . . . . . . . . . . . . . . . . . . . . . 514.3.7 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . 554.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55viii5 Thermal Image Translation for Enhanced Environmental Percep-tion at Night . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.2 Thermal Image Translation (IR2VI) . . . . . . . . . . . . . . . . 585.2.1 IR-GVI . . . . . . . . . . . . . . . . . . . . . . . . . . . 605.2.2 GVI-CVI . . . . . . . . . . . . . . . . . . . . . . . . . . 655.2.3 Evaluation methods . . . . . . . . . . . . . . . . . . . . . 675.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 705.3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 705.3.2 Experiments Setup . . . . . . . . . . . . . . . . . . . . . 715.3.3 IR-GVI Results . . . . . . . . . . . . . . . . . . . . . . . 725.3.4 GVI-CVI Results . . . . . . . . . . . . . . . . . . . . . . 765.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 776 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80ixList of TablesTable 3.1 Neural network configuration. . . . . . . . . . . . . . . . . . . 25Table 3.2 Performance comparison on accuracy and time cost of differentmethods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32Table 4.1 Performance comparison on time cost of different methods. . . 47Table 4.2 Observation distance (OD), SNR and corresponding AP for dif-ferent image models. . . . . . . . . . . . . . . . . . . . . . . . 52Table 4.3 Results of multiple linear regression for the data in Table 4.2. . 54Table 5.1 Evaluating results of different image translation methods usingdifferent NR-IQA criterion. . . . . . . . . . . . . . . . . . . . 73Table 5.2 Ranking results of different IR-GVI methods by the TOPSISbased on the different NR-IQA criterion. . . . . . . . . . . . . 73Table 5.3 AP scores of the object detector on the generated GVI imagesby different IR-GVI methods. . . . . . . . . . . . . . . . . . . 73Table 5.4 Evaluation of different image colorization methods. . . . . . . 74Table 5.5 Ranking results of different GVI-CVI methods by the TOPSISbased on the different NR-IQA criterion. . . . . . . . . . . . . 77xList of FiguresFigure 1.1 The simplified version of Endsleys SA model. Modified imagebased on the source: [1] . . . . . . . . . . . . . . . . . . . . 1Figure 2.1 The illustration of LetNet-5 architecture. Source: [2] . . . . . 9Figure 2.2 Illustration of Generative Adversarial Network . . . . . . . . 12Figure 3.1 Left: the appearance of the target is indistinguishable from thebackground environment. Right: the scale of the target is vari-ous dramatically. . . . . . . . . . . . . . . . . . . . . . . . . 15Figure 3.2 The pipeline of proposed multi-channel CNN-based ATD. . . 18Figure 3.3 The procedure of motion estimation. . . . . . . . . . . . . . . 19Figure 3.4 Illustration of different image fusion architectures. . . . . . . 21Figure 3.5 Appearance of targets in training dataset and testing dataset. . 27Figure 3.6 Average precision (AP) comparison between different experi-mental designs. . . . . . . . . . . . . . . . . . . . . . . . . . 30Figure 3.7 The visual results of the proposed method. . . . . . . . . . . . 31Figure 4.1 Two sample images from the SENSIAC dataset illustratingcomplex situations. . . . . . . . . . . . . . . . . . . . . . . . 35Figure 4.2 Overall framework of deep multi-modal image fusion basedtarget detection (DMIF). . . . . . . . . . . . . . . . . . . . . 37Figure 4.3 The illustration of proposed deep multi-modal image fusionmodule. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40Figure 4.4 The neural network configuration of Region Proposal NeuralNetwork module. . . . . . . . . . . . . . . . . . . . . . . . . 42xiFigure 4.5 The neural network configuration of the region-wise classifi-cation & regression sub-network. . . . . . . . . . . . . . . . 43Figure 4.6 Comparison of the state-of-the-art methods. Left: The overallPrecision-Recall curves of different methods. Right: The localenlarged image from the left. . . . . . . . . . . . . . . . . . 45Figure 4.7 The AP comparison against observation distance in large targetscales. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49Figure 4.8 The AP comparison against observation distance in small tar-get scales. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50Figure 4.9 Illustration of target area vs local environment area. . . . . . . 51Figure 4.10 The distribution of SNR value of MWIR imageries against dis-tances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53Figure 5.1 The overview of the IR2VI method. . . . . . . . . . . . . . . 59Figure 5.2 An overall architecture of the proposed Texture-Net. . . . . . 61Figure 5.3 An example of the results from the CycleGAN to illustratethe incorrect mapping problem. The CycleGAN incorrectlymapped the vegetation to the ground. . . . . . . . . . . . . . 63Figure 5.4 Illustration of different state-of-the-art GVI colorization archi-tectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66Figure 5.5 Subjective comparison of different IR-GVI methods on theSENSIAC dataset. . . . . . . . . . . . . . . . . . . . . . . . 68Figure 5.6 The precision-recall curve of the object detector on differentsynthesis images. . . . . . . . . . . . . . . . . . . . . . . . . 74Figure 5.7 Samples of colorized images by different GVI-CVI methods. . 75xiiGlossaryATR Automatic Target RecognitionATD Automatic Target DetectionAP Average PrecisionCNN Convolutional Neural NetworkCVI Colorful Visible ImageCSR Convolutional Sparse RepresentationDL Deep LearningDPM Deformable Part ModelsDTCWT-SR Dual-Tree Complex Wavelet Transform with Sparse RepresentationDMIF Deep Multi-Modal Image FusionGAN Generative Adversarial NetworkGPU Graphics Processing UnitGVI Grayscale Visible ImageHOG Histogram of Oriented GradientsHVS Human Visual SystemIR InfraredxiiiMCDM Multiple Criteria Decision MakingMWIR Mid-Wave InfraredMI Motion ImageNR-IQA Non-Reference Image Quality AssessmentNSS Natural Scene StatisticOD Observation DistancesR-CNN Region-based Convolutional Neural NetworkROI Region of InterestRPN Region Proposal NetworkSA Situation AwarenessSIFT Scale-Invariant Feature TransformSVM Support Vector MachineSENSIAC Military Sensing Information Analysis CenterSSD Single Shot MultiBox DetectorSNR Signal-to-Noise RatioSGD Stochastic Gradient DescentTIT Thermal Image TranslationTOPSIS The Technique for Order of Preference by Similarity to Ideal SolutionVI Visible ImagexivAcknowledgmentsMany people have helped me to complete this dissertation. Above all, I owe im-mense gratitude to my supervisor, Professor Zheng Liu, whose guidance and en-couragement throughout the long course of my degree have been indispensable,heartfelt, and inspiring. Under his guidance, I have attained invaluable researchskills, and I also thank him for giving me the opportunity to work on multipleprojects with companies that helped me further gain hands-on working experience.He has been an inspiration to work with, a true mentor and a great role model.I would like to thank Dr. Liwei Wang for his willingness to serve as my externalexaminer. I would also like to thank Dr. Loı¨c Markley and Dr. Yang Cao for theirwillingness to serve on the advisory committee. I really appreciate their valuabletime and constructive comments on my research and thesis. I would also like toexpress my thanks to Dr. Huan Liu, Dr. Ying Huang, Dr. Mingliang Gao, Dr. VijayJohn and Dr. Erik Blasch for their feedback, constructive comments, and valuablesuggestions on my research work.In addition, I offer heartfelt thanks to all of my colleagues at the IntelligentSensing, Diagnostic and Prognostic Research Lab (ISDPRL), who have given metheir encouragement, academic ideas, and support for two years.Finally, I owe my family the greatest appreciation, and I would like to thankthem for their patience, understanding, support, and love over all these years. Allmy achievement would not have been possible without their constant encourage-ment and support.xvChapter 1Introduction1.1 Background and MotivationSituation Awareness (SA) plays an important role in surveillance and security ap-plications, which allows robots or human operators to respond immediately andtake actions to increase the survivability and security of the equipment, platforms,and forces [3]. Whether the tactical scenario is the onslaught of an array of combatvehicles coming through the Fulda Gap, which was feared during the Cold War[4], or the identification of humans with intent to kill in an urban scene, the iden-tification of the threat for avoidance and engagement is paramount to survival andthreat neutralization.SA is defined as the perception of the elements of the environment, the compre-hension of their meaning (understanding), and the projection (prediction) of theirstatus in order to enable decision superiority [1].Figure 1.1: The simplified version of Endsleys SA model. Modified imagebased on the source: [1]1The simplified version of Endsley’s SA model [1] is depicted in Figure 1.1,showing the three levels of perception, comprehension, and projection. These areconstrued as aspects or levels of mental representation:1. Level 1. Perception provides information about the presence, characteristics,and activities of elements in the environment2. Level 2. Comprehension encompasses the combination, interpretation, stor-age, and retention of information, yielding an organized representation ofthe current situation by determining the significance of objects and events.3. Level 3. Projection involves forecasting future events.The perception is the foundation and the key to the SA model. However,the performance of a perception system is affected by many factors, e.g., targetscale variations, complex backgrounds, and poor illumination conditions. In re-cent decades, numerous researchers have been harnessed the multi-modal imagefusion approach to enhance the environmental perception [5–9]. They assume thefused image is able to expand the capability of vision system under varying envi-ronmental conditions, target variations, and viewpoint obscuration. Generally, themulti-modal image fusion techniques are often divided into three levels dependingon the stage at which fusion takes places: pixel level, feature level, and decisionlevel [10]. The pixel-level method is the popular research tendency for the wholeimage fusion field because it has minimum artifacts in the fused image whose pix-els are determined from a set of image pixels or other forms of image parametersat the lowest physical level [11]. Besides, the higher fusion levels, such as fea-ture level or decision level, are the combinations of the information in the formof image feature descriptors and probabilistic variables. To the pixel-level fusion,it can be further divided into spatial domain based and transform domain basedalgorithm. Technically, principal component analysis [12], wavelet [13], sparserepresentation [7] and etc. are the common techniques to be adopted in the fusionoperations. Theoretically, the fused image with the enhanced context is suitable forboth human visual perception and low-level image processing. However, these tra-ditional multi-modal image fusion approaches have limited benefit for the real-timemachine processing and analysis in the practical situation.2In the last few years, Deep Learning (DL) has dramatically advanced the stateof the arts in visual object recognition, speech recognition, and natural languageprocessing [14]. Technically, Deep Learning allows computational models thatare composed of multiple processing layers to learn representations of data withmultiple levels of abstraction, which is inspired by the natural visual perceptionmechanism of the living creatures. In this thesis, a deep learning based approachesare proposed to address the challenges that arise from the target variations, complexbackgrounds, and poor illumination conditions in order to advance the state-of-the-art of environmental perception and SA system.1.2 Literature Review and Challenges1.2.1 Environmental Perception SystemGenerally, an environmental perception system comprises five key components[15]: target detection, tracking, classification, recognition, and identification. Amongthese components, the target tracking aims to track one or multiple targets over timebased on a given accurate location. In [16], Gundogdu et al. proposed an ensem-ble tracking algorithm, which is able to switch different correlators according tothe current target appearance. Even though they achieved a promising accuracy,there was still a room for further improvement on computational efficiency. Tothis end, Demir et al. [17] implemented an efficient tracker by leveraging the co-difference matrix. In addition to tracking, researchers also focused on high-leveltasks, such as Automatic Target Recognition (ATR). A series of studies proposeda shape generative model based general system that supports recognition, segmen-tation and pose estimation jointly [18–20]. Target detection is the fundamental andimportant component of an environmental perception system, especially in chal-lenging conditions in the military context. However, only a few reports for militarytarget detection are publicly available to the best of the author’s knowledge. Mostrecently, Millikan et al. [21] proposed an infrared-focused military target detectorcombining both image reconstruction and quadratic correlation techniques. But itdid not achieve a promising accuracy to be applied in a practical scenario. Actu-ally, there still exist numerous challenges that need to be solved in the perception3system designs. First of all, the scale of the target varies over the range. Specif-ically, the scenario is expansive and the target of interest could be extremely farfrom the surveillance devices and sensors. As a result, the scale of the targetedobject captured through the image/video is rather small and the target can not beeasily detected. The second challenge is the complex environment of the scenario.Rocks and trees could obscure the target. Meanwhile, the targeted objects are likelyto disguise themselves so that they can not easily be recognized by the perceptionsystem.1.2.2 Multi-Modal Image FusionMulti-modal image fusion techniques could offer an effective solution to the chal-lenges from the complex environment [22, 23]. The fusion operation will generatea composite image with complementary information from multi-modal images ac-quired through a wider range of the electromagnetic spectrum. The high-level per-ception tasks will be carried out based on the fused outcomes. For example, Zhenget al. [24] improved the performance of the vehicle identification and threat analy-sis via the multi-modal image fusion. Particularly, the Infrared (IR)/thermal imageand visible image are widely adopted in the multi-modal imaging system for per-ception applications. The fusion operation can be implemented at pixel-, feature-,and decision-level. Numerous work on pixel-level fusion has been reported in thelast decade. The intuitive results achieved by pixel-level fusion can benefit the endusers through a direct observation. Among these pixel-level methods, transformdomain based approaches account for a dominant solution thanks to the inspira-tion of the human visual system [25]. The general steps for the transform domainbased multi-modal image fusion include transforming the input images to a specifictransform domain, performing fusion operation by combining coefficients, andgenerating the fused image by applying the inverse transform. Various transformmethods have been proposed, including stationary wavelet transform [26], discretewavelet transform [27], non-subsampled contourlet transform [28], self-fractionalFourier functions [29], Dual-Tree Complex Wavelet Transform with Sparse Repre-sentation (DTCWT-SR)[6], Convolutional Sparse Representation (CSR) [7], etc.In addition, the fusion operations were also implemented with hand-crafted fu-4sion strategies, such as guided filtering based weighted average [30] and choose-max [31], etc. A comprehensive review of the state-of-the-art is available in [22].1.2.3 Deep LearningDeep Learning has brought a series of breakthroughs for many generic computervision challenges recently [14], such as image recognition [32], object detection [33],semantic segmentation [34], etc. Among the different types of deep neural net-works in the family of Deep Learning, the Convolutional Neural Network (CNN)play an important role which is a trainable feed-forward neural network mainlycomprised of convolution layers, pooling layers, and normalization layers. Bytraining on a large-scale dataset, the CNN can learn a hierarchical representationof an object or scene. Recent work reported in [35–37] has shown that the deeperCNN can help gain better performance on computer vision tasks. Driven by thisinsight, He et al. proposed a very deep neural network, called ResNet [36], whichcomprises hundreds of convolution layers and broke many records in numeroustasks. Meanwhile, there are few deep CNN based methods for multi-modal im-age fusion. Recently, a CNN based fusion method for multi-focused images wasreported in [38]. Zhong et al. [39] made an effort on solving the remote sensingfusion problem with a CNN model.1.3 Thesis Outline and ContributionsThe thesis is organized into six chapters.Chapter 1 presents the background of the SA model, and the current solutionsand challenges to enhance the environmental perception. In addition, this chapterprovides a literature review related to the SA and environmental perception.Chapter 2 investigates the state-of-the-art of Deep Learning techniques thatrelated to this thesis. First, the background of Deep Learning is presented. Second,one of the most important types of Deep Learning techniques, CNN, is introduced.Finally, an emerging technique in Deep Learning, called Generative AdversarialNetwork (GAN), is described.In Chapter 3, a multi-channel CNN based Automatic Target Detection (ATD)algorithm for enhanced SA is proposed, where the visible, thermal and tempo-5ral images are fused in a CNN. The proposed method achieves 98.34% AveragePrecision (AP) on the Military Sensing Information Analysis Center (SENSIAC)dataset which is better than the baseline, the single-modal ATD.In Chapter 4, a new Deep Multi-Modal Image Fusion (DMIF) framework isproposed, which is a further improved version of the multi-channel CNN basedATD algorithm presented in chapter 3. In the DMIF framework, a deeper CNNmodel is adopted to carry out both multi-modal image fusion and ATD tasks. Andthe handcrafted region proposal module selective search used in the chapter 3 isreplaced with a more efficient module, Region Proposal Network (RPN). Conse-quently, the DMIF framework is a fully end-to-end neural network which can beoptimized efficiently on a Graphics Processing Unit (GPU) device. The extensiveexperiments on SENSIAC dataset show that the DMIF can achieve 99.73% APwith a great competence in run-time performance.In Chapter 5, a Deep Learning based thermal image translation method, namelyIR2VI, is presented to enhance the environmental perception at dark night. TheIR2VI algorithm is able to translate the nighttime IR images to the daytime colorfulvisible images which bring more semantic information to an end user for the furtherperception. Experimental results show the superiority of the IR2VI over the stateof the arts.Chapter 6 concludes the thesis. Future work are also suggested.6Chapter 2Deep Learning: State of the Art2.1 IntroductionMachine learning technology powers many aspects of modern society: from websearches to content filtering on social networks to recommendations on e-commercewebsites, and it is increasingly present in consumer products such as cameras andsmartphones. Conventional machine learning techniques were limited in their abil-ity to process natural data in their raw form. For decades, constructing a patternrecognition or machine learning system required careful engineering and consid-erable domain expertise to design a feature extractor that transformed the raw data(such as the pixel values of an image) into a suitable internal representation fromwhich the learning system could detect or classify patterns in the input.With the growing availability of large-scale data sets and advanced computa-tional resource, Deep Learning methods have become very popular in recent years.Actually, Deep Learning is a subset of the machine learning. It is usually with mul-tiple levels of representation, obtained by composing simple but non-linear mod-ules that each transform the representation at one level (starting with the raw input)into a representation at a higher, slightly more abstract level. With the compositionof enough such transformations, very complex functions can be learned. An image,for example, comes in the form of an array of pixel values, and the learned featuresin the first layer of representation typically represent the presence or absence ofedges at particular orientations and locations in the image. The second layer typ-7ically detects motifs by spotting particular arrangements of edges, regardless ofsmall variations in the edge positions. The third layer may assemble motifs intolarger combinations that correspond to parts of familiar objects, and subsequentlayers would detect objects as combinations of these parts. The key difference be-tween conventional machine learning and Deep Learning is that layers of featuresin Deep Learning methods are not designed by human engineers: they are learnedfrom data using a general-purpose learning procedure.How to train a Deep Learning system? Taking a simple image classificationsystem an example, which aims to classify images containing a person, a pet, anda car. To train this system, the first thing is to collect a large dataset of images ofpeople, pets, and cars, each labeled with its category. During training, the systemis shown an image and produces an output in the form of a vector of scores, onefor each category. The desired category should have the highest score in all cate-gories, but this is unlikely to happen before training. By computing an objectivefunction that measures the error (or distance) between the output scores and thedesired pattern of scores, the system modifies its internal adjustable parameters toreduce this error. These adjustable parameters, often called weights, are real num-bers that can be seen as knobs that define the input-output function of the system.In a typical Deep Learning system, there may be hundreds of millions of these ad-justable weights, and hundreds of millions of labeled examples with which to trainthe system.2.2 Convolutional Neural NetworkAmong different types of Deep Learning architectures, CNNs have been most ex-tensively studied in the last few years. There are numerous variants of CNN archi-tectures in the literature. However, their basic components are very similar. Takingthe famous LeNet-5 [2] as an example, it consists of three types of layers, namelyconvolutional, pooling, and fully-connected layers. The convolutional layer aimsto learn feature representations of the inputs. As shown in Figure 2.1, convolu-tion layer is composed of several convolution kernels which are used to computedifferent feature maps. Specifically, each neuron of a feature map is connectedto a region of neighboring neurons in the previous layer. Such a neighborhood is8Figure 2.1: The illustration of LetNet-5 architecture. Source: [2]9referred to as the neurons receptive field in the previous layer. The new featuremap can be obtained by first convolving the input with a learned kernel and thenapplying an element-wise nonlinear activation function on the convolved results.Note that, to generate each feature map, the kernel is shared by all spatial locationsof the input. The complete feature maps are obtained by using several differentkernels. Mathematically, the feature value at location (i, j) in the k-th feature mapof l-th layer, zli, j,k is calculated by:zli, j,k =WlkTxli, j +blk, (2.1)where W lk and blk are the trainable weight vector and bias term of the k-th filterof the l-th layer respectively, and xli, j is the input patch centred at location (i, j)of the l-th layer. Note that the kernel W lk that generates the feature map zl:,:,k isshared. Such a weight sharing mechanism has several advantages such as it canreduce the model complexity and make the neural network easier to train. Theactivation function introduces nonlinearities to CNN, which are desirable for multi-layer neural networks to detect nonlinear features. Let a(·) denote the nonlinearactivation function. The activation value ali, j,k of convolutional feature zli, j,k can becomputed as:ali, j,k = a(zli, j,k). (2.2)Typical activation functions are Sigmoid, TanH and rectified linear units (ReLU)[40]. The pooling layer aims to achieve shift-invariance by reducing the resolutionof the feature maps. It is usually placed between two convolutional layers. Eachfeature map of a pooling layer is connected to its corresponding feature map ofthe preceding convolutional layer. Denoting the pooling function as P(·), for eachfeature map al:,:,k the operation is as following:yli, j,k = P(alm,n,k),∀(m,n) ∈ Ri j, (2.3)where Ri j is a local neighbourhood around location (i, j). The typical pooling op-erations are average pooling [41] and max pooling [42]. The kernels in the 1stconvolutional layer are designed to detect low-level features such as edges and10curves, while the kernels in higher layers are learned to encode more abstract fea-tures. By stacking several convolutional and pooling layers, higher-level featurerepresentations could be gradually extracted.After several convolutional and pooling layers, there may be one or more fully-connected layers which aim to perform high-level reasoning [32, 35]. They takeall neurons in the previous layer and connect them to every single neuron of thecurrent layer to generate global semantic information.The last layer of CNN is an output layer. For classification tasks, the Softmax[43] operator is commonly used, which can transform all the activations in layerbefore the output layer to a series of values that can be interpreted as probabilities.Let θ denote all the parameters of a CNN ( e.g., the weight vectors and bias terms).The optimum parameters for a specific task can be obtained by minimizing anappropriate loss function defined on that task. Suppose there are N desired input-output relations x(n),y(n);n ∈ [1, ...,N], where x(n) is the n-th input data, y(n) is itscorresponding target label and o(n)is the output of CNN. The loss of CNN can becalculated as follows:L =1NN∑n=1l(θ ,y(n),o(n)). (2.4)Training CNN is a problem of global optimization. By minimizing the lossfunction, the best fitting set of parameters can be found. The Stochastic GradientDescent (SGD) is a common solution for optimizing the CNN[2].2.3 Generative Adversarial NetworkGenerative Adversarial Network (GAN) is an emerging technique in Deep Learn-ing community, which can enable the deep learning model to be trained in an un-supervised manner. It helps to solve such tasks as image generation from descrip-tions, getting high resolution images from low resolution ones, image translation,and etc. In 2014, Goodfellow et al. [44] firstly proposed the idea of GAN, whichis composed of two components: the generator G and the discriminator D. G pro-duces fake samples from the latent variable z (i.e., random numbers drawn froma distribution) whereas D takes both fake samples and real samples and decides11Figure 2.2: Illustration of Generative Adversarial Networkwhether its input is real or fake. D produces higher probability as it determines itsinput is more likely to be real. G and D oppose each other to achieve their indi-vidual goals, so the adversarial term is coined. When this adversarial situation isformulated as the objective function, GAN solves “minimax” Equation 2.5 to min-imize the maximum gain with parametrized neural networks G and D. The pdata(x)and pz(z) in Equation 2.5 denote a real data probability distribution defined in thedata space X and a probability distribution of the latent variable z defined on thelatent space Z. It should be noted that G maps the latent variable z from Z into theelement of X , whereas D takes an input x and distinguishes whether x comes fromreal samples or from G. The objective function is thus represented as follows:minGmaxDV (G,D) =minGmaxDEx∼pdata [logD(x)]+Ez∼pz [log(1−D(G(z)))].(2.5)Where V (G,D) is a binary cross entropy function which measures the performanceof a classification model whose output is a probability value between 0 and 1. FromD ’s perspective, if a sample comes from real data, D will maximize its output andif a sample comes from G, D will minimize its output; thus, the log(1−D(G(z)))term is derived in Equation 2.5. Meanwhile, G wants to deceive D so it tries tomaximize Ds output when a fake sample is presented to D. Consequently, D triesto maximize V (G,D) while G tries to minimize V (G,D), and which is where the“minimax” relationship in Equation 2.5 comes from. Figure 2.2 shows an outline12of the GAN where D produces a higher probability (near 1) when it decides itsinput is more likely to come from real data.Theoretically, assuming that models of G and D both have sufficient capacity,the Nash equilibrium for Equation 2.5 can be achieved through the following pro-cedure. First, D is trained to obtain an optimal discriminator for a fixed G, and thenG tries to fool D to enable D(G(z)) produce a high probability. By iteratively opti-mizing such a “minimax” problem, D cannot discriminate whether its input is realor fake anymore because pdata(x) = p(x) has been achieved, and D(x) produces aprobability of 12 for all real and fake samples.2.4 SummaryIn this chapter, the basic principles and the state-of-the-art techniques (CNN andGAN) of Deep Learning are introduced. Since CNN with its characteristic of pro-cessing 2D data is suitable on the image processing and computer vision tasks, itis adopted in this research to enhance the environmental perception and situationawareness. For the night vision enhancement in the chapter 5, the GAN is appliedfor the unsupervised image-to-image translation task.13Chapter 3Multi-Channel CNN-basedAutomatic Target Detection forEnhanced Situation Awareness3.1 IntroductionAutomatic Target Detection (ATD) is a key technology for a SA system and mil-itary operations. In the case of a military mission, sensors can be placed on theground or mounted on unmanned aerial vehicles to acquire sensory data. The ac-quired raw data will be fed into the ATD algorithms where the targets are able to belocated and highlighted. Fast and accurate ATD can increase the lethality and sur-vivability of the weapons platform/soldier in the battlefield. For decades, numerousATD algorithms have been proposed. Generally, these algorithms can be classifiedinto two main categories: background modelling approaches and learning-basedapproaches. Background modelling approaches assume that background pixelshave a similar color (or intensity) over time in a fixed camera. The background isabstracted from the input image, while the foreground (moving objects) region isdetermined by marking the pixels in which a significant difference occurs. In [45],the authors modeled the background using a kernel density estimation method overa joint domain-range representation of image pixels. Multilayer codebook-based14Figure 3.1: Left: the appearance of the target is indistinguishable from thebackground environment. Right: the scale of the target is various dra-matically.background subtraction model was proposed in [46], which can remove most ofthe non-stationary background and significantly increase the processing efficiency.Reference [47] proposed a motion detection model based on probabilistic neuralnetworks. These methods are designated for the stationary camera scenario. In theworks of [48–50], the authors proposed several schemes that can handle the prob-lems in the moving camera scenario. The background modelling based methodsare effective for detecting moving objects, whereas when the objects are still ormoving slowly, those methods will always be unsatisfying.Another popular category is the learning-based approaches. Traditionally, hand-engineered features, e.g., Scale-Invariant Feature Transform (SIFT) [51], and His-togram of Oriented Gradients (HOG) [52], are firstly extracted and then be fedinto a classifier, e.g., boosting trees [53], Support Vector Machine (SVM) [54],or random forest [55]. The typical work in this paradigm is the Deformable PartModels (DPM) [56]. More recently, Convolutional Neural Network (CNN) raiseda significant impact on ATD research community, which helped achieve promisingresults in many difficult object detection challenges [57–59]. A method, namelyOverfeat [60], firstly utilized CNN models in a sliding window fashion on the ATDtask. Where it has two CNNs, one for classifying if a window contains an objectand the other for predicting the bounding box of the object. After that, the mostpopular CNN based ATD framework appeared, called Region-based Convolutional15Neural Network (R-CNN) [33], which uses a pre-trained CNN to extract featuresfrom region proposals generated by selective search [61], and then classifies theseregions proposals with class specific linear SVMs. The significant advantage ofthis work is derived from replacing hand-engineered features by CNN feature ex-tractor. After that, the variants of R-CNN were proposed to mainly solve the prob-lem with computational burden [62–64]. Currently, these ATD methods are onlyapplicable in the case of the general scenarios. When it comes to the complexenvironment, e.g., battlefield, many challenges arise. Firstly, the battlefields is ex-tremely complex. As shown in Figure 4.1, the appearance of the target includescolor and texture is similar to the background in left example, because soldiers al-ways attempt to disguise themselves or their vehicles. Secondly, due to the vastbattlefield, the scale of objects always dramatically changes with their distance tosensors. Thus, these environmental factors always limit the ability of a genericATD algorithm. Secondly, the military ATD application always runs on the em-bedded platform whose computational and memory resources are limited. In thiscase, the ability to run at high frame rates with relatively high accuracy is importantto the military ATD.A number of image fusion based methods were proposed to enhance the envi-ronmental perception in literature [26, 65–67]. The basic scheme behind the imagefusion based methods is that multiple images acquired with the different range ofelectromagnetic spectrum are fused into one image by pixel-level algorithms suchas principal component analysis [67] or discrete wavelet transform [65], after thatthe enhanced image will be fed as an input to the ATD system. However, thesekinds of image fusion based ATD methods are neither effective nor efficient whenthey are applied in the practical scenario. To address the limitations, a novel deeplearning based image fusion approach is proposed to improving detector perfor-mance in the challenging scenario, which exploits the significant advantage of theunsupervised feature learning characteristic of CNNs. Compared with high-levelimage fusion algorithms, the proposed method can achieve a higher accuracy andcomputational efficiency. In addition, the state-of-the-art ATD framework is ap-plied in the military scenario and the cross-domain transfer learning techniquesare employed to cover the shortage of insufficient data. In this way, the proposedframework can achieve promising results on the SENSIAC dataset. To sum up, the16contributions in this chapter are as follows:1. An multi-channel Deep Learning based image fusion approach is proposed.Compared with the traditional image fusion methods with the handcraftedfusion strategy, the proposed method can automatically learn how to fuse theessential information from visible, thermal, and temporal images in a CNN.2. The proposed enhanced ATD framework achieves 98.34% AP and 98.90%Top1 accuracy on SENSIAC dataset, which is better than the baseline (single-modal ATD).3.2 Multi-channel CNN-based ATDThe proposed multi-channel CNN-based based ATD framework, is illustrated inFigure 4.2. The framework is composed of four modules: 1) a multi-channel im-age fusion module, which can fuse three different type of images into a compos-ited image; 2) a CNN feature extractor, used for extracting high-level semanticrepresentations from the composited image; 3) Region of Interest (ROI) proposalmodule manipulated on composited image is utilized for generating hundreds orthousands of ROIs and 4) ROI classification & regression module is performed toobtain fine bounding boxes and the probability of each category.3.2.1 Image FusionImage SelectionMulti-sensor data often provide complementary information for context enhance-ment, which is able to further enhance the performance of ATD. Two type of im-ages from different sensors, Mid-Wave Infrared (MWIR) images and visible im-ages, are investigated respectively. In addition to the images acquired from thesetwo sensors, a motion image generated from two consecutive visible frames is con-sidered in order to complement sufficient description of objects.MWIR: The infrared (IR) spectrum can be divided into different spectral bands.The IR bands include the active IR band and the passive (thermal) IR band. Themain difference between active and passive infrared bands is that the passive IR17Figure 3.2: The pipeline of proposed multi-channel CNN-based ATD which include four main components:1) multi-channel image fusion, 2) ROI proposal, 3) CNN feature extractor, and 4) ROI classification & regression.18image can be acquired without any external light source. The passive (thermal) IRband is further divided into the mid-wave infrared (3− 5 um) and the long-waveinfrared (7− 14 um). In general, the MWIR cameras can sense temperature vari-ations over targets and background at a long distance, and produce thermogramsin the form of 2D images. Its ability to present large contrasts between cool andhot surfaces is extremely useful for many computer vision tasks such as image seg-mentation and object detection. However, due to low-resolution sensor arrays andthe possible absence of auto-focus lens capabilities, high-frequency content of theobjects, e.g., edges and texture, are mostly missed.Visible image: The range of the electromagnetic spectrum of visible imageis from 380 nm to 750 nm. This type of image can be easily and conveniently ac-quired via various kinds of general cameras. In comparison with MWIR image, thevisible image is sensitive to illumination changes, preserve high-frequency infor-mation and can provide a relatively clear perspective of the environment. In most ofthe computer vision topics, the visible image has become a major studying objectfor many decades. Thus, there are a large number of public visible datasets acrossmany research areas. On the other hand, the significant drawbacks of visible imageare that it has poor quality in the harsh environmental conditions with unfavorablelighting and pronounced shadows, and there is no dramatic contrast between back-ground and foreground when the environment is extremely complicated such as thebattlefield.Figure 3.3: The procedure of motion estimation. The t is the current frameand the t-5 is the previous 5-th frame.Motion image: In general, the moving objects are the targets in the battlefields.Therefore, estimating the motion of objects can provide significant cues to segmentthose targets. Various motion estimation algorithms have been proposed in recent19decades, such as dense optical flow methods, points correspondence methods, andbackground subtraction. And each of them has shown effectiveness in many com-puter vision tasks. However, considering the trade-off between accuracy and com-putational complexity, any of the complicated motion estimation approaches arenot considered but a straightforward and easier to be implemented method is em-ployed. the proposed motion estimation method is illustrated in Figure 3.3, themotion map is estimated based on two consecutive frames. To be specific, theobjective images are sampled at every 5 frames, and then force the current frameto subtract the last frame, the resulting image is the desired motion image. Themethod can be formulated as follow:Mt(x,y) = |It(x,y)− It−5(x,y)|, (3.1)where Mt(x,y) represents the motion value of frame t at pixel point (x,y) andIt(x,y) denotes the pixel value of frame t at pixel point (x,y).In this way, only the subtraction operator is employed in this procedure, whichis more efficient than other popular methods, e.g., optical flow [68]. Even thoughthis method can bring some noise into the motion image, this image still can pro-vide enough complementary information in the later fusion stage.Fusion ArchitectureHere, the possible configurations of information fusion for object detection can beformalized into three categories, namely, pixel-level, feature-level and decision-level fusion architecture. An illustration is shown in Figure 3.4. Having thesepossibilities in mind will help to highlight the important benefits of the proposedfusion method in terms of efficiency and effectiveness.Pixel-level fusion: A typical pixel-level fusion architecture is illustrated inFigure 3.4(a). This configuration of image fusion is the lowest level techniques fordealing with the pixels obtained from the sensor directly and tries to improve thevisual enhancement. Typically, multiple images from different sources are com-bined into one single image in a pixel-wise manner, after which it is fed into theATD system to generate final results. One of the main advantages of the pixel-levelfusion is their low computational complexity and easy implementation.20Figure 3.4: Illustration of different image fusion architectures: (a) pixel-level fusion architecture; (b) feature-levelfusion architecture; (c) decision-level fusion architecture.21Feature-level fusion: As a higher level fusion system to Figure 3.4(a), onemight pursue Figure 3.4(b), in which different type of images are simultaneouslyfed into their independent lower part of the entire ATD system, which is typicallycalled feature extractor. For instance, this lower-level system might be the hand-engineered feature extractor for the traditional ATD system and high-level convolu-tion layer for CNN based system. After which, the concatenated features producedby the various independent lower system are fed into one upper (decision-making)system to produce the final results. Although this feature-level fusion is usuallyrobust to noise and misregistration, it always requires almost double memory andcomputing resource to deal with feature fusion procedure in a parallel fashion, es-pecially for the CNN based methods.Decision-level fusion: The decision-level fusion scheme illustrated in Figure3.4(c) operates on the highest level, and refers to fusing discriminate results fromdifferent systems designed for various type images. Note that for an ATD systemwhich usually based on machine learning algorithms, this high-level fusion mightnot establish a good relationship of interior characteristics between different type ofimages. In addition, since this duplication would multiply the number of resourcesand running time, it might also be practically challenging to implement,In the enhanced ATD framework, a novel multi-channel Deep Learning basedimage fusion approach is proposed, which is similar to pixel-level fusion style. Ascan be seen in the multi-channel image fusion module in Figure 4.2, firstly, thethree type of raw images (MWIR, visible image and motion image) are concate-nated into a composited image where MWIR in the red channel, motion image ingreen channel and visible in the blue channel. Then the 3-channel composited im-age can be obtained. It is worthy noted that any pixel values of the raw imagesare not modified in this procedure. The composited image is then fed as inputto a CNN for training the enhanced ATD in an end-to-end manner. Meanwhile,the feature from different source images can be fused together in the internal ofCNN in an unsupervised learning fashion. Therefore, compared with feature-leveland decision-level fusion methods, the proposed approach is more efficient to beimplement and low computational complexity. And compared with conventionalpixel-level fusion methods, the deep learning technique is employed to fuse im-ages from different sources instead of utilizing hand-engineered algorithms such22as discrete wavelet transform.3.2.2 Regions of Interest ProposalAs can be seen in the ROIs proposal module in Figure 4.2, given an image, theROIs proposal algorithms can output a set of class-independent locations that arelikely to contain objects. Different from the exhaustive search “sliding window”paradigm which will propose every possible candidate regions and generate around104− 105 windows per image, ROIs proposal methods try to reach high objectrecall with considerably fewer windows. In the popular object detectors such asR-CNN and Fast R-CNN [62], they select selective search [61] method as theirROIs proposal module.The selective search [61] is a ROIs proposal that combines the intuitions ofbottom-up segmentation and exhaustive search. The whole algorithm can be sim-plified as follows. Firstly, the algorithm from Felzenszwalb et al. [69] is adoptedto create initial regions. Then the similarities between all neighbor regions are cal-culated and the two most similar regions are grouped together. After that, the newsimilarities are calculated between the resulting region and its neighbors. In thisiterative manner, the process of grouping the most similar regions is repeated untilthe whole image becomes a single region. Finally, the object location boxes canbe extracted from each region. Because of this hierarchical grouping process, thegenerated locations come from all scales.3.2.3 Neural Network ArchitectureThe great success of CNNs in recent years aroused broader interest in CNNs basedgeneric object detection among researchers. The most typical CNNs based objectdetector is the R-CNN [33], which utilize selective search method to generate a setof ROIs proposal from the input image and then feed each ROI to a CNN to obtainfinal results. However, this paradigm is slow, because lots of heavily overlappedROIs have to go through the CNN separately and thus a large amount of redundantcomputation is consumed. Spatial pyramid pooling neural network [63] and Fast R-CNN [62] successful solved this problem by proposing an spatial pyramid poolingand ROI pooling, respectively. They suggested the whole image can go through23CNN once and the final decision is made at the last feature maps produced by theCNN by using their proposed pooling strategies.The proposed framework is illustrated in Figure 4.2. The composited threechannel image is firstly fed into the CNN feature extractor to generate convolu-tional feature maps. It should be noted that the final convolutional feature maps arealso the fusing results of the three types of images by unsupervised learning. Afterwhich, for each ROIs generated by the ROIs proposal, an ROI pooling process isperformed directly on the convolutional feature maps instead of an input image toextract a fixed length feature vector. The reason to choose ROI pooling insteadof spatial pyramid pooling is that the gradients can propagate to the CNN layersin training stage and this can help CNN learn how to fuse the multiple channel-independent images in an unsupervised fashion. Finally, the extracted vector needto be sent to a fully connected neural network which has two output ports whereone is for classification and another one is for bounding boxes regression.Taking the trade-off between accuracy and computational complexity into ac-count, the efficient CNN architecture, called VGGM [70], is selected as the CNNfeature extractor in the proposed framework. Specifically, The VGGM [70] is ashallow version CNN of VGG16 [35] neural network and wider version CNN ofAlexNet [32] neural network, but faster than VGG16 neural network as well asmore accurate than AlexNet neural network. More detail about the VGGM config-uration can be seen in Table Training DetailsTransfer LearningTransferring general information between different data source for related tasks isan effective technique to help deal with insufficient training data and overfittingin the deep learning scenario. For instance, In the variants of R-CNN, they firstlytrain the model on the large ImageNet [57] dataset and then finetune it on theirdomain-specific dataset. However, they were only limited to transferring informa-tion between RGB (visible image) models.The target training dataset includes the visible images, IR images, and mo-24Table 3.1: Neural network configuration: The neural network architecture contains two modules, the first module iscalled CNN feature extractor which includes 5 convolutional layers, the second module is the ROI classificationand regression which has an ROI pooling layer and 4 fully connected layers. In the layer type column, ”Conv”represents the convolutional layer; ”LRN” means the Local Response Normalization (LRN) layer; ”Max-Pool” isthe max pooling layer; ”ROI-Pool” represents the ROI pooling layer; and ”FC” means the fully connected layer.The Rectified Linear Unit (ReLU) [40] is employed as the activation function. The Dropout [71] is utilized foravoiding the overfitting problem in training stage.Layer # # Input Channels # Output Channels Kernel Size Layer Type Stride Pad Activation Function Dropout1 3 96 7x7 Conv 2 ReLU2 96 96 LRN3 96 96 3x3 Max-Pool 24 96 256 5x5 Conv 2 1 ReLU5 256 256 LRN6 256 256 3x3 Max-Pool 27 256 512 3x3 Conv 1 1 ReLU8 512 512 3x3 Conv 1 1 ReLU9 512 512 3x3 Conv 1 1 ReLU10 512 36 6x6 ROI-Pool11 36 4096 FC X12 4096 1024 FC X13 1024 2 FC14 1024 8 FC25tion maps, whose data distribution is significantly different from the large-scalepublic visible image datasets, such as ImageNet [57]. The goal is to leverage theCNN model gain essential common knowledge from a large-scale visible datasetand then transfer this information for accelerating training in the domain-specificdataset as well as boosting overall performance.Based on the general transfer learning techniques, the VGGM model is pre-trained on the large-scale RGB image dataset, ImageNet, which contains the mostcommon objects and scenes in daily life. Before training the neural network on thecomposited images, the weights of all the convolutional layers are initialized bytransferring learned weights. This allows the neural network to adapt to the newdata distribution in an end-to-end learning setting.Loss FunctionAs shown in Table 3.1, the proposed architecture has two sub-network. The firstis for classifying each ROI, which will output a discrete probability distributionover two categories (background and target). And the second is for regressing thebounding box offsets of ROI where for each category, it will output a tuple of(tx, ty, tw, th), the elements indicate the shift value relative to the central coordinate,height and width of original proposal ROI.Same to [62], for classification, the negative log likelihood objective is utilized:Lcls(p,u) = u log(p)+(1−u) log(1− p), (3.2)where p represents the predicted probability of one of categories and u is the ac-tual category. For regression, the smooth L1 loss function [62] is adopted in thischapter:Lbbox(tu,v) = ∑i∈{x,y,w,h}smoothL1(tui − vi), (3.3)in whichsmoothL1(x) =0.5x2 if |x|< 0|x|−0.5 otherwise , (3.4)where tu is the bounding box offsets of the u class and v is the true offsets. In26training stage, both of them will be jointed together as follow:L(p,u, tu,v) = Lcls(p,u)+ [u = 1]Lbbox(tu,v), (3.5)where u = 1 means only when the class is target, the bounding box regression canbe trained.3.3 Experiments3.3.1 DatasetFigure 3.5: Appearance of targets in training dataset and testing dataset.Military Sensing Information Analysis Center (SENSIAC) dataset was em-27ployed for evaluation in the experiments. This dataset contains 207 GB of MWIRvideo and 106 GB of visible video along with the bounding box data. All videowas taken using commercial cameras operating in the MWIR and visible bands.The types of targets are various, which include people, foreign military vehicles,and civilian vehicles. The datasets were collected during both daytime and nightand the distance between cameras and targets varied from 500 to 5000 meters.In the experiments, only the vehicle was set as target, and 5 types of vehicleswere classified as training targets and 3 types of vehicles were classified as test-ing targets, their name and appearance are showed in Figure 3.5. And each typeof vehicles with 3 different range of distance between cameras and targets (1000,1500 and 2000 meters) were selected. It should be noted that no matter how manyfine-grained types of vehicle it has, they were treated as one class, “vehicle.” Thus,the problem becomes a binary (vehicle and background) object detection prob-lem. Moreover, because the format of raw data is video, the images at every 5frames were sampled to maximize the difference between each frame. In total,4573 frames were used as training data and 2812 frames were used as testing data.3.3.2 Experimental SetupIn the experiments, a computer with an NVIDIA GeForce GTX 1080 GPU, anIntel Core i7 CPU and 32 GB Memory was utilized. The proposed framework wasimplemented by using the Caffe deep learning toolbox [72] which support the newconvolutional neural layers for the target detection. The setup of hyper-parametersfrom the Fast R-CNN [62] was followed, which has been proven to be efficient fortraining. To be specific, all the neural networks were trained for 40000 iterationswith initial learning rate 0.001 and 0.0001 for the last 10000 iterations, momentum0.9 and weight decay 0.0005.3.3.3 EvaluationMetricsFor all the metrics, the detection are considered as true or false positives based onwhether the area of overlap with ground truth bounding boxes exceed 0.5. The28overlap ratio can be calculated by the below equation:ao =area(Bp∩Bgt)area(Bp∪Bgt) , (3.6)where ao denotes the overlap area, Bp and Bgt denote the predicted bounding boxand ground truth bounding box, respectively. The area(·) function is to calculatethe area of the bounding box.Average Precision (AP) is a golden standard metric for evaluating the perfor-mance of an ATD algorithm. The AP value can be easily obtained by computingthe area under the Precision-Recall curve. Precision measures how accurate is thepredictions. i.e. the percentage of the positive predictions are correct. Recall mea-sures how good the model can find all the positives. Their mathematical definitionsare as follows:Precision =T PT P+FP, (3.7)Recall =T PT P+FN. (3.8)Where TP denotes true positive, FP denotes false positive and FN denotes falsenegative.Top1 Precision is a metric that is widely used in classification tasks, wherethe probability of multiple classes is predicted and one having the highest scoreis selected, then the Top1 precision score is computed as the numbers a predictedlabel matched the target label, divided by the number of whole data. In the exper-iments, there was only one target in each image. Thus, the Top1 precision metricwas employed in the experiments to evaluate the performance of the framework ina practical scenario.Results and AnalysisSix incremental experiment were designed to examine the effectiveness of the pro-posed fusion method. And at the beginning, the performance of the detectionalgorithms on the three single-modal images (Visible, MWIR and Motion) wereevaluated independently. Because all of the single-modal images were the singlechannel format and the input format requirement of CNN is a 3-channel image,29Figure 3.6: Average precision (AP) comparison between different experi-mental designs. Single-modal images (visible, MWIR and Motion im-age, respectively), the composited image of visible and MWIR (Visible-MWIR), the composited image of visible, MWIR and Motion (3-Channels) and decision-level fusion, respectively.the desired images were generated by duplicating the single channel image in threetimes. After that, the visible and MWIR images were fused together. To meetthe requirements of CNN, the visible channel was duplicated. Then, the temporalinformation, motion image, was integrated into the complete composited image,3-channel image. In addition, the decision-level fusion method was also evaluated.To be specific, three different single-modal neural networks are firstly performedto predict possible bounding boxes of targets. Then, these bounding boxes are as-sembled. Finally, the evaluation are conducted on the assembled bounding boxes.Figure 3.6 shows the AP curves of the six incremental experiments. In single-30Figure 3.7: The visual results of the proposed method. Example 1 and 2 demonstrate the performance of the differenttype of inputs in the system on large and small object detection, respectively. Different columns denote differenttypes of the input image. The raw input image generated feature map and the final output are shown in consecutiverows. In the final output image, the green bounding box represents the position of the object predicted by thesystem.31Table 3.2: Performance comparison on accuracy and time cost of differentmethods.Methods Accuracy (%) Running Time (s/image)AP Top1 ROIs Proposal Neural Networks OverallVisible 97.31 98.04 1.378 0.164 1.542MWIR 95.63 96.91 1.144 0.069 1.213Motion 91.64 92.39 1.167 0.038 1.205Visible-MWIR 97.37 98.18 1.505 0.248 1.7533-Channels 98.34 98.90 1.272 0.235 1.507Decision-level Fusion 97.52 97.93 3.690 0.271 3.961modal image experiments, the CNN based detector performed well enough in over-all, especially for the visible single-modal image which achieved 97.31% averageprecision and 98.04% Top1 accuracy, as shown in the “accuracy” column of Table3.2. The visible-MWIR composited image achieved a better result than the bestperformance of single-modal image. It should be noted that the 3-channel imageachieved both the highest average precision (98.34%) and Top1 accuracy (98.90%)which means the proposed method only falsely detected 16 frames in the totally2812 testing frames. It is also interesting that even though the average precision ofthe decision fusion method is higher than the best single-modal image method, butwhen it comes to practical application, its Top1 accuracy is lower than the visiblesingle-modal image approach and it is extremely time-consuming in running time(3.961s).To further verify the effectiveness of the proposed unsupervised image fusionmethod, the feature map of the last convolutional layer and the final output of theproposed framework are presented in Figure 3.7. The feature map is the outputof CNN feature extractor in Figure 4.2, and for the fused image, it is the fusedhigh-level features. It could be reasoned that if the object in the feature map issegmented clearly, the framework will get a better result. In the examples of Figure3.7, it is clear that the 3-channels could well fuse the complementary informationfrom the three single-modal images and made its feature map to be enhanced. Andits final output also verifies the fact that the enhanced feature map can boost the32performance.3.4 SummaryIn this chapter, a Deep Learning based multi-modal image fusion method is pro-posed for the ATD task, in which the information from visible, thermal, and tempo-ral images is fused with a multi-channel CNN. In this solution, the fusion strategycan be learned at the training stage rather than by a hand-craft design. Moreover,the Fast R-CNN is employed as the target detector and transfer learning techniqueis also applied. The SENSIAC dataset is employed for evaluation, in which the pro-posed method achieves the promising results with 98.34% AP and 98.90% Top1accuracy. However, the running time of the proposed method in this chapter is stilltoo slow in order to be applied into the practical scenario. Thus, this remains awork to further improve it in the next chapter.33Chapter 4Deep Multi-Modal Image Fusion4.1 IntroductionThe multi-modal image fusion techniques are often employed [5–9] to address thechallenges arising from the complex scenarios, by fusing the complementary cross-spectrum information acquiring through the multi-modal imaging system. In chap-ter 3, a multi-channel Deep Learning based image fusion method is introduced.However, this previous work [73] only fused multi-modal images in a shallow CNNand utilized the Fast R-CNN [62] framework for target detection. Even though itsresults are promising compared with single image modality methods, its runningtime and accuracy can still be improved for the practical automated applications.Actually, many challenges exist for the automated SA applications, includ-ing target scale variations, environmental diversity, and real-time response require-ments. In most scenarios, the scene is vast and expansive as illustrated with theexamples on the left side of Figure 4.1. The different distances from the imagingsensors to the target dramatically vary the scale of the target. The sample image onthe right side of Figure 4.1 demonstrates the complexity of a scenario. The target(color and texture) is hard to discriminate from the complex background from itsappearance due to the camouflage of the target. Moreover, the rocks or trees canalso be used to obscure the targets. These environmental factors will limit the auto-mated SA applications, especially for the ATD. In addition, the ATD applicationsmust have the capability to operate around-the-clock and provide immediate indi-34cations, warnings, and responses, thereby increasing requirements for robust andreal-time performance.Figure 4.1: Two sample images from the SENSIAC dataset illustrating com-plex situations.In this chapter, the previous work [73], presented in chapter 3, is improvedin terms of both accuracy and efficiency in a new framework, and comprehensiveanalysis and experiments are conducted as well. Specifically, a deeper CNN modelis adopted to carry out both multi-modal image fusion and target detection tasks.And the handcrafted region proposal module “selective search” [61] used in theprevious work [73] is replaced with a more efficient module, Region Proposal Net-work (RPN) [64]. Consequently, the DMIF framework is a full end-to-end neuralnetwork which can easily be optimized on a GPU device. Moreover, a compre-hensive analysis of the noise factors from the complex scenario is performed in theexperimental section to show the effectiveness of the proposed DMIF framework.35Figure 4.2 presents the overall architecture of the DMIF framework, which consistsof three main modules. The first is the deep multi-modal fusion module. The pre-processed multi-modal images are concatenated into three channels and fed into adeep CNN, which carries out the fusion operation and feature extraction. The othertwo modules, e.g., RPN and region-wise classification & regression, are adoptedfrom the state-of-the-art target detection [64], which can generate the boundingbox of the target to present to a human operator.The contributions of this chapter include the follows:1. A novel deep multi-modal image fusion (DMIF) approach is proposed, whichcan learn how to extract deep feature in an unsupervised manner and fusesynergistic information from multi-modal images.2. An efficient ATD framework is built to integrate the image fusion, regionproposal, and region-wise classification into an end-to-end CNN neural net-work.3. The proposed framework achieves a better performance, 99.73% AP, for theSENSIAC dataset in comparison with other state-of-the-art methods. More-over, a sensitivity analysis of the noise factors from the complex scenario isperformed as well.4.2 Deep Fusion Methodology4.2.1 Deep Multi-Modal Image FusionMulti-modal image fusion aims to enhance the scene context by fusing the com-plementary information from multiple input images. In this chapter, three differ-ent types of images are considered: (1) Mid-Wave Infrared (MWIR) image, (2)grayscale Visible Image (VI), (3) and Motion Image (MI) generated from twoconsecutive visible frames/images. At the fusion stage, a deep CNN through anunsupervised training process to fuse multi-modal images.36Figure 4.2: Overall framework of deep multi-modal image fusion based target detection (DMIF), including three majormodules: (1) deep multi-modal image fusion, (2) region proposal neural network, and (3) classification andregression sub-network.37Multi-Modal ImagesIn this subsection, a brief description on how to acquire and pre-process the multi-modal images is presented.Mid-wave infrared imageThe MWIR image belongs to the category of the passive infrared image (IR) whereno external light source is required in comparison with active IR image. And theelectromagnetic spectrum of MWIR image is from 3µm to 5µm. Thus, the MWIRimager can capture temperature variations over target and background over a rela-tively long distance, and produce thermograms in form of a two dimensional (2D)image. The value in each coordinate of thermograms represents the relative tem-perature. In order to process with DMIF deep fusion module, the thermogramsneed to be transformed into the general grayscale images by applying the follow-ing linear normalization [74]:I(x,y) =(T (x,y)−Min(T ))× (vmax− vmin)Max(T )−Min(T ) + vmin, (4.1)where T and I are the 2D thermogram and grayscale thermal image, respectively.(x,y) indicates the 2D coordinate in the image array. The Max(·) and Min(·) referto the functions to obtain the maximum and minimum value among the data. Theintensity range of grayscale thermal image is (vmin,vmax).Visible imageThe VI image in this chapter is of the electromagnetic spectrum range from 380nmto 750nm. This spectral range enables VI to capture the sufficient edge and textureinformation from the scene. However, the disadvantage is that VI is extremely sen-sitive to the luminance variation. In the experiments, VI is assumed to be alignedwith MWIR already and do not need to perform any registration operations.Motion imageIt is well-known that a moving object can generate a motion trajectory. Hence,motion estimation is a straightforward way to obtain the location information of38moving targets, even though the associated noises will sometimes present. DMIFleverages a motion image modality in the fusion process. Taking the computationalcomplexity into account, an efficient motion estimation method results to obtain themotion image. The method is formulated as follow:Mt(x,y) = |Vt(x,y)−Vt−δ (x,y)|, (4.2)where M and V represents the motion image and the original visible image, respec-tively. t indicates the t th frame in a continuous image sequence and δ means theframe interval between two consecutive keyframes. (x,y) indicates the 2D coordi-nate in the image array.Deep FusionAs described in [75], an image fusion algorithm is to solve two key problems: (1)effectively extracting the image features from the input source images and (2) com-bining the features from multiple sources into the fused image. In traditional imagefusion, the multi-scale transform based methods [6, 26, 27, 31] or sparse represen-tation based methods [7] is developed to solve the first problem, feature represen-tation. And the fusion strategy, e.g., weighted average[30] and choose-max [31],is developed to address the second problem, feature fusion. These principles helpbetter understand the proposed deep fusion module.As illustrated in the Figure 4.3, the three obtained images are concatenatedinto the RGB-channel of one image. Note that the value of each image has notbeen modified at this stage. In the DMIF, VI is put in the Blue (B) channel, MI isput in the Green (G) channel and MWIR is put in the Red (R) channel. Then, a setof learnable kernels are convolved on the RGB-channel image and generate a setof feature maps. Selecting one kernel as the example, the convolution procedure[2] can be formulated as follows:yl+1(i, j) =m−1∑a=0m−1∑b=0d−1∑c=0w(a,b,c)xl(i+a, j+b,c)+β , (4.3)where l represents the l th convolutional layer within the deep CNN. The y is thegenerated 2D feature map of the l+1 th convolutional layer while x is the original39Figure 4.3: The illustration of proposed deep multi-modal image fusion mod-ule. The m represents the convolution kernel size, and the w indicatesthe learnable weights of a convolution kernel. The d is the number ofchannels for an image or feature maps.3D feature bank of the l th convolutional layer (i.e., the RGB-channel image forthe first convolutional layer). (i, j) indicates the 2D coordinate in the feature map.w is the convolutional kernel/weight whose width and height is m and depth is d.Note that the depth of kernel d should be equal to the channel size of the originalfeature bank. β is the bias value.The first 3D convolution operation in the deep fusion procedure is a kind ofweighted fusion strategy. Nevertheless, in contrast to the traditional weighted fu-sion rule, the DMIF is learned in the training stage, which means the algorithm canlearn how to assign the weights to each imaging modality and choose the importantinformation for both within-modality and cross-modality feature learning.Generally, a deep CNN stacks a combination of convolutional layers, activationlayers, normalization layers and pooling layers, and repeats this pattern until thespatial scale of feature maps to a small size. The CNN is usually with multiplelevels of representation, obtained by composing simple but non-linear modulesthat each transform the representation at one level (starting with the raw input) intoa representation at a higher, slightly more abstract level. As a result, the neuralnetwork can learn a hierarchical representation of the multi-modal images.Recent work [35, 37] has demonstrated that the deeper CNN could achievebetter performance on representation learning tasks. However, directly increasing40the convolutional layers will cause a degradation problem. To address this issue,He et al. [36] proposed a residual block module, which allows information to bepassed directly through, making the backpropagated error signals less prone toexploding or vanishing. This solution makes it possible to train neural networkswith hundreds of layers. They also carried out a deep CNN model, called ResNet101, which consists of 101 convolutional layers. In this chapter, this state-of-the-art ResNet 101 is adopted to fuse multi-modal images and extract deep featuremaps for the region proposal neural network and classification & regression sub-network. Here, the materials from [36] will not be duplicated, due to the limitednumber of pages. Readers are referred to [36] for more details. This work leveragesthe transfer learning for faster training by pre-training the ResNet 101 on a largerscale image dataset ImageNet [57]. The pre-trained ResNet 101 is truncated andonly the fully convolutional neural layers were used in this work. The dilatedconvolution [76] is also be performed to increase receptive field as in [77].4.2.2 Region Proposal Neural NetworkThe objective of the region proposal is to generate a set of class-independent lo-cations that are likely to contain targets. The selective search algorithm [61] isadopted to accomplish this task in the previous work [73]. As the selective searchwith a complex implementation can only run on the CPU, it is not efficient forreal-time application. Recently, Ren et al. [64] introduced a Region Proposal Net-work (RPN), which is a fully convolutional neural network to accelerate the regionproposal procedure. The RPN will output a set of rectangular target proposals withcorresponding objectness score and share the convolutional computation with thetarget detection neural network. In other words, RPN is a small neural networkmodule that performs region proposal on the last layer of the main deep CNN. Thecore idea behind the RPN is the anchors. Specifically, anchors are a set of ref-erence boxes with different scales and ratios on a regular grid in the image. Thegenerated region proposals are the offsets to the anchors, and thus the number ofregion proposals is fixed.The configuration of RPN neural network is presented in Figure 4.4. To bespecific, a 3×3 convolution layer with Relu [40] activation function slides on the41Figure 4.4: The neural network configuration of Region Proposal Neu-ral Network module. For the setting of the convolution layer,“(size,size,number)” denotes the width, height and number of the con-volution kernels. And the “with Relu” means that the convolution layeris followed by an activation function of rectified linear unit (Relu) [40].feature maps generated by ResNet 101, following by two siblings 1×1 convolutionlayers, e.g., one is for outputting region proposals and the other is for outputtingthe corresponding objectness scores. Readers are referred to [64] for details on theloss function and implementation.4.2.3 Classification & Regression Sub-NetworkAs shown in Figure 4.5, the classification & regression sub-network is applied toeach region proposal and will generate classification scores as well as four offsetvalues with respect to the bounding box of the region proposals. Hence, this sub-network has two functional heads, e.g., classification and regression. The first oneis to classify the region proposals and output a discrete probability value over twocategories (target and background) by using the Softmax [43] function. The secondone is to regress the bounding box offsets of the region proposals and output a tupleof (tx, ty, tw, th), where the elements indicate the shift value relative to the centralcoordinate, height and width of the original region proposal.To train the classification head, the cross entropy is used as the loss function,which measures the performance of a classification model whose output is a prob-ability value between 0 and 1 :Lcls(p,u) = ulog(p)+(1−u)log(1− p), (4.4)42Figure 4.5: The neural network configuration of the region-wise classifica-tion & regression sub-network. For the configuration of ROI pooling,the “(size,size)” denotes the width and height of the pooling kernel. Forthe configuration of fully connected layer, the “(number)” representsthe number of output neurons. And the fully connected layer with Relumeans a Relu activation function is followed by the layer.where p and u represent the ground-truth label of the target/background and thepredicted probability, respectively. Meanwhile, the smooth L1 loss function [62] isadopted as the loss function for the regression head:Lbbox(tu,v) = ∑i∈{x,y,w,h}smoothL1(tui − vi), (4.5)in whichsmoothL1(x) =0.5x2 if |x|< 0|x|−0.5 otherwise , (4.6)where tu is the bounding box offsets of the u class. And v is the true offsets. At thetraining stage, the complete loss function is as follow:L(p,u, tu,v) = Lcls(p,u)+ [u = 1]Lbbox(tu,v), (4.7)where u = 1 means only when the class is a target, the bounding box regressioncan be trained.434.3 Experimental Results4.3.1 DatasetFor a fair comparison, the SENSIAC was utilized, same as the previous methodintroduced in chapter 3. In comparison experiments, the same dataset setting wasemployed where 4573 pairs of images were used for training and 2812 pairs ofimages were used for testing. But, for the detailed analysis experiments, 9 differentObservation Distances (OD) from 1000, 1500 to 5000 meters were selected. Tofurther reduce the overall data size in the experiments, the sample rate was reducedto 3Hz. Eventually, there were 7688 training images and 3542 testing images inthe detailed analysis experiments.4.3.2 Experimental SetupThe proposed deep fusion system was implemented by using Tensorflow deeplearning toolbox [78]. Compared with the Caffe [72] that utilized in the previ-ous work, the Tensorflow is easier for deployment and has better support for GPU.For the training, a machine with an NVIDIA GeForce GTX 1080 GPU, an IntelCore i7 CPU and 32 GB Memory was used. For the hyper-parameters, the setupfrom the Faster R-CNN [64] were followed which has been proven to be efficientand effective for the training stage. To be specific, all the neural networks weretrained for 60000 iterations with initial learning rate 0.003 and 0.0003 for the last40000 iterations, batch size 1, momentum 0.9 and weight decay 0.0005. Addition-ally, all the newly added layers were initialized from a Gaussian distribution withzero mean and 0.001 variance.The de-facto standard Average Precision (AP) was selected as the evaluationmetric, which is calculated as the ratio between the area under the Precision-Recallcurve and the entire area (which is 1).4.3.3 Comparison with the State-of-the-ArtsIn this section, the proposed deep fusion system was compared with the previ-ous work [73] and the state-of-the-art target detectors [64, 79] as well as imagefusion methods [6, 7] to validate the effectiveness and efficiency. For the compar-44Figure 4.6: Comparison of the state-of-the-art methods. Left: The overall Precision-Recall curves of different methods.Right: The local enlarged image from the left.45ison with generic target detectors (Faster R-CNN [64] and Single Shot MultiBoxDetector (SSD) [79]), their original algorithms were trained only on the visibleimage dataset without any parameter modification. Because the DMIF method isbased on Faster R-CNN but replace the original CNN with a deeper CNN (ResNet101), the performance of the improved detector, “Faster R-CNN (ResNet 101)”was reported as well. For the fair comparison with the conventional image fu-sion methods (Dual-Tree Complex Wavelet Transform with Sparse Representa-tion (DTCWT-SR) [6] and Convolutional Sparse Representation (CSR) [7]), theiralgorithms were applied to fuse VI and MWIR images under the default configu-rations and then the fused images were fed to the Faster R-CNN (ResNet 101) fortarget detection. For the accuracy comparison, the experimental results are pre-sented in Figure 4.6. The proposed DMIF method achieved the best performancewith 99.73% AP, which is almost the ceiling performance. Generally speaking,the image fusion based detectors were much better than those of any generic targetdetectors (single image modality). The DMIF method gained a 1.82% improve-ment compared with the best generic target detector, Faster R-CNN (ResNet 101).The results demonstrate that the DMIF method has the ability to enhance the target-based situation awareness. In comparison with the state-of-the-art hand-engineeredimage fusion methods (DTCWT-SR and CSR), DMIF reached 0.85% and 0.90%improvements, which means DMIF is able to learn an optimal strategy to assign theweights to each imaging modality and choose the important cross-modality infor-mation compared with those hand-engineered strategies. Moreover, the proposedDMIF method also outperformed the previous work [73] by 1.39%.Another important consideration is run-time efficiency. As noted previously,a general image fusion based target detection framework includes three essentialmodules: (1) image fusion module, (2) the region proposal module, (3) and theneural network based classification module. The efficiency comparison results arein the Table 4.1. The selected methods were grouped into single image modalitybased detectors which are those generic target detectors, and multi-modal imagefusion based detectors. For the single image modality based group, SSD methodachieved the fastest performance which only needed 0.020 second to process oneimage. For the multi-modal image fusion based group, the proposed DMIF methodtook only 0.243 seconds to process one image, which is around 6× faster than the46previous work [73] and over one order of magnitude faster than other conventionalimage fusion based methods. The reason is that the neural networks module can beeasily optimized on a GPU device, where the main computational cost in a multi-modal image fusion based detection system is with the image fusion module andthe region proposal module. For instance, the CSR + Faster R-CNN (ResNet 101)method costed 6.179 seconds on the image fusion module, and the previous workrequired 1.272 seconds on the region proposal module. But the proposed DMIFmethod combines those three modules into an end-to-end neural network whichcan enable them to share the computational resources and be optimized on a GPUdevice. Generally, the frame rate of a video is 30 frames/second (fps). Because theDMIF method needs the motion information, it sampled the frame at 6 Hz. In thisway, the DMIF method can achieve a near real-time performance. Even though theDMIF method was not faster than those generic target detectors, it is possible todeploy the DMIF method to a practical application.Table 4.1: Performance comparison on time cost of different methods.MethodsRunning Time AP(second/image) (%)single image modality based:Faster R-CNN 0.101 94.92Faster R-CNN (ResNet 101) 0.242 97.91SSD 0.020 95.17multi-modal image fusion based:Previous Work 1.507 98.34CSR + Faster R-CNN (ResNet 101) 6.413 98.83DTCWT-SR + Faster R-CNN (ResNet 101) 2.758 98.88DMIF (ours) 0.243 99.73474.3.4 Analysis of Target ScalesAs described in the section 4.1, there are many factors that degrade the performanceof target detection. One critical factor is the target scale, i.e., the observation dis-tance between the imaging system and the target, especially in a complex scenario.In this subsection, experiments were conducted for a comprehensive analysis ofthe multi-scale situation with the SENSIAC dataset.Note, there is an inverse relationship between the target scale and the observa-tion distance from the imaging system to the target. That means longer observationdistance leads to a smaller target scale. For the sake of simplicity, the “observationdistance” term is utilized to represent the relative target scale in the experiments.A set of data across a long observation distance (from 1000 to 5000 meters)were selected and the detector was trained with the whole selected data. For evalu-ation, the detection performance for targets at different observation distances wereassessed. The observation distance range of [1000,2000] meters was classified asthe large target scale while the [2500,5000] meter range was the small target scale.To verify the effectiveness of the DMIF method, 5 detectors of incremental imagemodality were implemented, which is from single image modality (MWIR, VI, andMI) to multi-modal image fusion (MWIR-VI and MWIR-VI-MI).Figure 4.7 plots the AP results against observation distances for different modal-ities in the small observation distance. The overall performance of multi-modalimage fusion achieved better results than that of single image modality. However,the multi-modal image fusion of MWIR and VI outperformed the multi-modal fu-sion of MWIR, VI, and MI at the three different close observation distances. Thisis because the performance of single MI modality was extremely poor at the ob-servation distance of 1500 and 2000 meters. When this modality was added to thefusion system, the overall performance degraded accordingly.The different methods at the long observation distances were further compared.Figure 4.8 shows that the performance of all type of image modalities decreasedsignificantly with the increase of observation distance. Similar to the results onclose observation distance test, the multi-modal image fusion performed better thanthe single image modality, especially for the extremely small target. It is worthmentioning that even the target that is 3500 meters away, the deep multi-modal48image fusion of MWIR, VI, and MI could still achieve 80% AP for the detection.However, it is clear that the multi-modal image fusion system was not able to beatthe single VI modality at the observation distance of 3000 meters. The singleMWIR modality at 3000 meters was even worse than that at 3500 meters. Hence,this raises a question about the existence of any other critical factors affecting theperformance of the detection.Figure 4.7: The AP comparison against observation distance in large targetscales.4.3.5 Analysis of Environmental ComplexityIn a complex scene, a cluster of trees and rocks are the ideal natural covers toobscure targeted objects, which introduces a challenge for target detection. Thiscritical impact factor is defined as “environmental complexity.” To measure thisfactor, the signal and noise for the scenario of target detection are described inFigure 4.9. The red dash bounding box is the target area, which is the system to49Figure 4.8: The AP comparison against observation distance in small targetscales.detect, so it is treated as the signal. Then, the target bounding box is increasedby√2 to the green bounding box. Thus, the area between the red and the greenbounding box refers to the local environment area. From the observation, it is clearthat the noise factors appearing within the local environment would degrade theperformance of target detection, so the local environment area is set as the noise.To quantify the environmental complexity, the Signal-to-Noise Ratio (SNR)[80] iscalculated as follows:SNR =µsignalσnoise, (4.8)where µsignal is the mean value of the signal and σnoise is the standard deviation ofthe noise. When there is noises in the local environment, e.g., cluster of trees orrocks, the σnoise will increase and reduce the SNR value. Therefore, a higher SNR50score indicates the lower environmental complexity.Figure 4.9: Illustration of target area vs local environment area.The SNR distribution of MWIR imageries against observation distances is plot-ted in Figure 4.10. The SNR score at 3000 meter was 1.17 lower than that at 2500meter, almost half of the SNR value at 3500 meter. In other words, the natural en-vironment at 3000 meter was much complex than its neighbors. And this explainswhy the detection performance by the MWIR modality at 3000 meter was worsethan other methods in Figure Statistical AnalysisAs discussed in the above section, there are two factors, e.g., target scale (observa-tion distance) and environment complexity, critical to the target detection perfor-mance. In this section, a statistical method was employed to verify if the proposeddeep fused system will mitigate the impact of the target scale and environmentalcomplexity. To accomplish this, the SNR, the observation distance, and corre-sponding AP for different image modalities were calculated and the results weresummarized in Table 4.2.Multiple linear regression analysis was used to evaluate the association be-51Table 4.2: Observation distance (OD), SNR and corresponding AP for different image models.MWIR VI MI MWIR-VI MWIR-VI-MIOD (m) SNR AP OD (m) SNR AP OD (m) SNR AP OD (m) SNR AP OD (m) SNR AP1000 5.865 0.988 1000 3.413 1.0 1000 2.726 0.990 1000 4.231 1.0 1000 4.001 1.01500 8.666 0.974 1500 4.799 0.981 1500 2.813 0.932 1500 6.088 0.986 1500 5.426 0.9812000 10.848 0.936 2000 4.759 0.974 2000 2.727 0.944 2000 6.788 0.990 2000 6.111 0.9782500 7.460 0.763 2500 4.568 0.944 2500 2.636 0.820 2500 5.532 0.981 2500 4.888 0.9883000 6.294 0.264 3000 6.831 0.657 3000 2.632 0.509 3000 6.652 0.476 3000 5.252 0.5003500 13.265 0.403 3500 7.069 0.627 3500 2.536 0.510 3500 9.134 0.775 3500 7.623 0.8184000 10.919 0.002 4000 5.995 0.588 4000 2.357 0.306 4000 7.636 0.555 4000 6.424 0.7084500 11.294 0.006 4500 7.449 0.278 4500 2.524 0.157 4500 8.731 0.478 4500 7.089 0.3015000 4.677 0.035 5000 6.925 0.025 5000 2.138 0.009 5000 6.176 0.100 5000 4.580 0.11652Figure 4.10: The distribution of SNR value of MWIR imageries against dis-tances.tween two or more independent variables and a dependent/response variable. Inthis chapter, the observation distance (OD) and environmental complexity (SNR)were set as two independent variables and the performance of detectors (AP) as thedependent variable. Hence, the linear regression model is fomulated as follows:AP = b0+b1×OD+b2×SNR, (4.9)where b1 to b2 are the estimated coefficients, and b0 is a constant term. The mul-tiple linear regression presents the equation that minimizes the distance betweenthe fitted line and all of the data points. If the two factors, e.g., target scale andenvironmental complexity, have a great influence on the detection, the estimatedmultiple linear regression model will fit the data well. In other words, the lowergoodness-of-fit for a multiple linear regression means the detection system has less53dependence on the observation distance (target scale) and environmental complex-ity. So the low goodness-of-fit is the expectation. To measure the goodness-of-fitof the model, three well-known statistical metrics (R2, adjusted R2 and p-value)were adopted. R2, also called the coefficient of determination, is a statistical mea-sure of how close the data are to the fitted regression line. The value of R2 is inthe range [0,1]. A higher value means that the multiple linear regression has bettergoodness-of-fit but the target detection modal has more dependence on the noisefactors. The adjusted R2 is a modified version of R2, which has one more termthat penalizes a model for each additional explanatory variable. Consequently, anyvariable without a strong correlation will make the adjusted R2 decrease. The p-value is used to test the null hypothesis that the independent variables (i.e., targetscale and environmental complexity) have no effect on the response variable (i.e.,average precision). So in this case, a higher p-value indicates that it is more pos-sible to accept the null hypothesis. In other words, the regression model with ahigher p-value means the detection system has less dependence on the observationdistance (target scale) and environmental complexity.A set of multiple linear regression models were estimated for detectors of dif-ferent image modalities. The results are given in Table 4.9. The overall evaluationshowed that all the multiple linear regression model of the detectors have a highgoodness-of-fit, so this means the target scale and environmental complexity havea strong effect on the performance of the detectors. On the other hand, the pro-posed deep multi-modal fused system (MWIR-VI-MI) had the lowest values forboth R2 and adjusted R2, and the highest p-value compared with the single modalbased and double modal based methods. This means that the DMIF method is ableto alleviate impact of the target scale and environmental complexity in comparisonwith the single image modalities.Table 4.3: Results of multiple linear regression for the data in Table 4.2.MWIR VI MI MWIR-VI MWIR-VI-MIR2 0.8927 0.8823 0.9573 0.8595 0.8284Adjusted R2 0.8570 0.8431 0.9431 0.8127 0.7712p− value 0.0012 0.0016 7.769e-05 0.0027 0.0050544.3.7 DiscussionsThe experiments for algorithm comparison demonstrated both the effectivenessand efficiency of the proposed deep multi-modal fusion system. In the analysisof observation distance, the deep multi-modal fusion performed better than otherindividual modality methods, especially in a long observation distance. However,when any single image modality involved in the fusion framework had a degradedperformance, it also introduced degradation to the overall deep multi-modal detec-tion.Two factors, e.g., observation distance (target scale) and environment com-plexity, were investigated for their impacts on detection performance. As the targetbecame smaller at a longer distance, the detection performance got worse gener-ally. Another observation is that lower environmental complexity would allow abetter result in the detection. When taking both factors into account, the statisti-cal analysis showed the evidence that the proposed deep multi-modal fusion couldsignificantly mitigate the two impacts.4.4 SummaryIn this chapter, a novel Deep Multi-Modal Image Fusion (DMIF) module is pro-posed and be integrated into a state-of-the-art generic target detection framework,i.e., Faster R-CNN for target detection in complex scenarios. The detection of thesmall target in a complex environment will enhance the capability of real-time situ-ation awareness. The overall framework is comprised of image fusion, region pro-posal, and classification & regression functionalities configured in an end-to-endnetwork. The extensive experiments on dataset show that the proposed method canachieve 99.73% AP with a great competence in run-time performance. Moreover,the proposed fusion method can operate with varying noise factors from the com-plex scene. The DMIF framework has great potential to be applied to a real-worldscenario. Nevertheless, the multi-modal image fusion based context enhancementmethods only be effective at daytime, dusk, or dawn when the visible camera isstill able to work. The environmental perception at night is also critical for a SAsystem. Thus, in the next chapter, the work will focus on how to enhance theenvironmental perception at night.55Chapter 5Thermal Image Translation forEnhanced EnvironmentalPerception at Night5.1 IntroductionHumans have poor night vision compared to many animals, partly because humaneyes lack tapetum lucidum [81]. This biological deficiency may lead to severalundesirable fatalities. For example, vehicle collisions are much more likely tohappen at night than during daytime [82]. Hence, context enhancement plays acritical role in many night vision applications.A straightforward way to enhance the context in night vision is by employingthermal or infrared (IR) and visible (VI) image fusion approaches [83–86], wherean IR sensor can enhance thermal objects in a night environment from a visualspectrum background [87]. However, an IR/VI image fusion method only worksat dawn or dusk. When it comes to a dark night with poor illumination conditions,the visible camera does not function properly [88]. In this case, the IR sensor stillworks, because the emitted energy of an object reaches the IR sensor which can bethen converted into a temperature value.Theoretically, the useful semantic information of an image to the Human Vi-56sual System (HVS) includes contour, texture and color [89–91]. Compared withthe Colorful Visible Image (CVI), the IR image only has the contour informa-tion. When the image is presented to a human, a CVI is preferred. In a nighttimescenario, translating an IR image to a CVI image would be a possible solution toenhance environmental perception at night. In recent years, numerous researchhas been proposed to solve this challenging task by colorizing the IR images us-ing different models [92–94]. Recent progress in machine learning might advancenighttime imagery. Generally, machine learning models are often employed topredict the color values directly. However, these approaches can only add colorinformation to the IR image, while the texture information is also critical to HVS.In addition, these IR colorization methods usually require large-scale datasets withcorresponding ground truth data for training. For the IR image captured at night, itis almost impossible to find a pixel-to-pixel aligned daytime CVI image.Since the final objective is to translate a nighttime IR image to a CVI, i.e., addtexture and color information to an IR image, a two-step thermal image translationmethod, called IR2VI, is built. The first step is to translate a nighttime IR image toa daytime Grayscale Visible Image (GVI), and this step is called as IR-GVI whichaims to add texture information to the IR. The second step is to colorize a GVI toa colorful visible image (CVI), and this step is named as GVI-CVI which aims toadd color information to the GVI.For IR-GVI, this step is formulated as an unsupervised image-to-image trans-lation task which aims to model the mapping between the two different data distri-butions without fully paired training datasets. Actually, this was a challenging taskuntil the Generative Adversarial Network (GAN) based methods were proposed[44, 95–97] in the most recent years. However, when applying the state-of-the-art unsupervised image-to-image translation methods to the IR-GVI task directly,two major problems arise. Firstly, when most of the areas from the input imageare overly bright these state of the arts face an incorrect mapping problem. Sec-ondly, the generated image by the state of the arts lacks fine details, especiallyfor the small objects. In this chapter, an unsupervised IR-GVI translation algo-rithm, namely Texture-Net, is proposed to address these challenges. Basically, theTexture-Net is a GAN-based method, and the basic architecture is comprised of agenerator, a global discriminator, and a Region of Interest (ROI) discriminator. To57deal with the incorrect mapping problem, a structure connection is developed inthe generator enabling the generated image to keep original structure information.An ROI focal loss is also presented, which consists of an ROI cycle-consistencyloss and an ROI adversarial loss to add more fine details in the concerned areas.For the GVI-CVI, it can be treated as a general GVI colorization procedure,although the input is a synthetic image, not a real gray-scale image. Due to theadvancement of Convolutional Neural Network (CNN), numerous automatic GVIcolorization approaches have been proposed in recent years. In this chapter, thestate of the arts [98–101] are investigated and the best one [98] is incorporated intothe proposed IR2VI framework via comprehensive evaluation experiments.As a matter of the fact, it is a challenge to evaluate the method without anyground truth data. Thus, a multi-criteria decision making based non-reference im-age quality assessment method is presented to evaluate the performance of theIR2VI.To summarize, the contributions of this chapter include:• A two-step thermal image translation method, named IR2VI, is proposed.To author’s best knowledge, this is the first method that is able to enhancethe environmental perception at night via translating the nighttime IR to theCVI.• Extensive experiments including subjective and objective evaluation are car-ried out, which demonstrate the effectiveness of the proposed method overthe state of the arts.5.2 Thermal Image Translation (IR2VI)The IR2VI framework consists of two steps: the first (IR-GVI) is to translate anighttime IR to a GVI, and the second (GVI-CVI) is to colorize a GVI to a CVI.The flowchart of the proposed IR2VI is illustrated in Figure 5.1. In the IR-GVIstep, the objective is to add texture semantic information. And in the GVI step,the objective is to add colorful semantic information. In this section,the designdecision and details for each step are presented.58Figure 5.1: The overview of the IR2VI method. The IR2VI consists of two steps: (1) IR-GVI which takes a thermal im-age as input and add the texture information using the Texture-Net; (2) GVI-CVI which add the color informationusing the image colorization algorithm.595.2.1 IR-GVIIn the IR-GVI step, the Texture-Net is proposed to add texture information to thenighttime IR. As can be seen in the Figure 5.2, the basic architecture of Texture-Net includes a generator, a global discriminator, and an ROI discriminator. Thegenerator translates an IR image to a synthetic GVI that looks similar to the realGVI, while the global discriminator distinguishes translated GVI images from realones. The ROI discriminator aims to distinguish the ROIs between translated GVIimages and real ones. In this way, the synthetic GVI images are designed to beindistinguishable from the real GVI images.Similar to CycleGAN [97] and StarGAN [95], the residual auto-encoder archi-tecture from Johnson et al. [102] with 9 residual blocks [36] is adopted for thegenerative neural network. The naming convention used in image translation com-munity [96, 97, 103] is adopted, and the neural network configuration is expressedas follows:c7s1−32,d64,d128,R128,R128,R128,R128,R128,R128,R128,R128,R128,u64,u32,c7s1−1,c7s1−1structure,Fwhere the c7s1−k represents a 7× 7 Convolution-BatchNorm-ReLU layer with kfilters and stride 1. And the right top structure means that is for structure connec-tion module which will be introduced in the following subsection. dk denotes a3×3 Convolution-BatchNorm-ReLU (CBR) layer with k filters, and stride 2. Thereflection padding is employed to reduce boundary artifacts. Rk denotes a resid-ual block which consists of two 3× 3 convolutional layers with the same numberof filters on both layer. uk represents a 3× 3 fractional strided CBR layer withk filters, and stride 12 . F denotes fusion layer where sum and tanh functions areutilized to fuse the output information from both structure connection and resid-ual auto-encoder. The PatchGAN [104] with 4 hidden layers is adopted for all thediscriminative neural networks, with the neural network configuration as follows:C64,C128,C256,C512,C51260Figure 5.2: An overall architecture of the proposed Texture-Net. Note that this is a brief illustration of the architecture,which actually needs to be duplicated for training in the CycleGAN way.61where Ck denotes a 4×4 Convolution-BatchNorm-LeakyReLU layer with k filtersand stride 2 (except for the last layer with stride 1). After the last layer, a convolu-tion is applied to produce a 1 dimensional output. BatchNorm is not applied to thefirst C64 layer. The slope is set 0.2 for leakyReLU.For training the Texture-Net, four loss functions are utilized (cycle consistencyloss, global adversarial loss, ROI cycle-consistency loss, and ROI adversarial loss).Details about each loss function are provided in the following sections.Basically, the Texture-Net framework evolves from the CycleGAN [97]. Incontrast to CycleGAN, two important improvements are made: (1) A structureconnection module is added into the generator to constrain the structure deforma-tion; and (2) an ROI focal loss is calculated in the training stage, which enables thecritical regions to be focused in translation procedure.Implementation DetailsStructure ConnectionIncorrect mapping is a common issue for the unsupervised image translation mod-els which directly lack supervised information. When objects in the source imageare overly bright, which is an extremely common situation for the IR image atnight, the translation models will be confused and map the objects to any randompermutation of objects in the target domain. As the example in Figure 5.3, the Cy-cleGAN incorrectly map the ground to the vegetation and the vehicle to a differentobject. To solve the incorrect mapping problem, a shortcut is added to the gener-ator to connect the input image with generated image, which is called “structureconnection.” A 7× 7 convolution layer is adopted to extract the detailed structureinformation from the IR image and then fuse it with the semantic information gen-erated by the residual auto-encoder model. In this way, the deep CNN is able tofocus on the semantic level task while the structure connection enables the syn-thetic GVI image to keep original structure information.62Figure 5.3: An example of the results from the CycleGAN to illustrate theincorrect mapping problem. The CycleGAN incorrectly mapped thevegetation to the ground.Cycle Consistency LossThe cycle consistency loss was proposed by Zhu et al. in [97]. The basic idea is tolearn two mapping functions G : IR→GVI and F : GVI→ IR, which can translatethe image between two domains. For the x ∈ IR, it forces F(G(x)) ≈ x, while fory ∈ GVI, it forces G(F(y)) ≈ y. Thus, it becomes possible to constrain the cycle-consistency and eliminate undesirable mappings. The cycle consistency loss canbe formulated as follows:Lcyc(G,F) =Ex∼IRdata [‖F(G(x))− x‖1]+Ey∼GVIdata [‖G(F(y))− y‖1].(5.1)Global Adversarial LossThe global adversarial loss is derived from the global discriminator, which aimsto distinguish the full-size image from the real domain with the full-size imagefrom the synthetic domain. Because the image fed to the discriminator neuralnetwork is full-sized, the adversarial loss is designated as the global adversarialloss. As aforementioned, two mapping functions are created to manipulate thecycle consistency loss. The global adversarial loss is applied to both mappingfunctions. Taking the mapping function G : IR→GVI as example, its discriminator63is DgGVI. Thus, the global adversarial loss is formulated as:Ladv(G,DgGVI, IR,GVI) = Ey∼GVIdata[log(DgGVI(y))]+Ex∼IRdata[log(1−DgGVI(G(x)))].(5.2)ROI Focal LossGenerally, the generated images via adversarial training are often lack of fine de-tails and realistic textures [103, 105]. This is manifested when the concerned objectis extremely small. To this end, a ROI focal loss is proposed, which consists of ROIadversarial loss and ROI cycle-consistency loss. The ROI approach is suitable forthose training dataset with bounding boxes. In contrast to the cycle consistencyloss and global adversarial loss which take the full-size image as input, the ROIfocal loss operates in the ROI. To obtain the ROIs from the full-size image, theROI pooling layer [62] is implemented, which is proposed to solve the object de-tection challenge. Based on provided bounding boxes, the ROI pooling layer isable to crop and reshape the arbitrary area to the fixed size image. In this work,the fixed size of the ROI image is set as 64× 64 and the ROI pooling function isnamed as R(·). Same as the cycle consistency loss and global adversarial loss, theROI focal loss is applied to both mapping functions. Here, the mapping functionG : IR→ GVI is used as an example.The ROI cycle-consistency loss can be formulated as follows:L roicyc (G,F) = Ex∼IRdata [‖R(F(G(x)))−R(x)‖1]+Ey∼GVIdata [‖R(G(F(y)))−R(y)‖1].(5.3)The neural network configuration of ROI discriminator is same as that of the globaldiscriminator. The ROI adversarial loss can be expressed as follows:L roiadv(G,DroiGVI, IR,GVI) = Ey∼GVIdata [log(DroiGVI(R(y)))]+Ex∼IRdata [log(1−DroiGVI(R(G(x))))],(5.4)where DroiGVI represents the ROI discriminator for GVI images.64Full ObjectiveFinally, the complete objective function can be written as:L f ull =Ladv(G,DgGVI, IR,GVI)+Ladv(G,DgIR, IR,GVI)+λcycLcyc(G,F)+λroi(λcycL roicyc (G,F)+L roiadv(G,DroiGVI, IR,GVI)+Lroiadv(G,DroiIR , IR,GVI)),(5.5)where λcyc and λroi are the hyper-parameters that control the relative importanceof cycle consistency loss and the ROI focal loss. For simplicity, L f ull representsL (G,F,DroiGVI,DroiIR ,DgGVI,DgIR). Finally, the method resolves:G∗,F∗ = argminF,GmaxDroiGVI,DroiIR ,DgGVI,DgIRL f ull . (5.6)5.2.2 GVI-CVIThe objective of the GVI-CVI step is to add color semantic information into thesynthetic images. Since the synthetic images generated by the Texture-Net is akind of GVI images, the state-of-the-art image colorization algorithm can be ap-plied to accomplish the goal. Generally, the GVI colorization algorithms can beroughly classified into three categories: reference-based [106–108], user-guided[109–113] and fully automatic [98–101, 114]. Since building a fully automaticthermal image translation framework is the objective in this work, those fully au-tomatic approaches are investigated.Mathematically, the fully automatic GVI colorization problem can be formu-lated as a learning function f : X→Y , where X is a GVI and Y is the parameterizedcolor maps. The state of the arts [98–101] focused on how to construct an optimallearning function f and the color parameterization. The state-of-the-art architec-tures for the learning function f are illustrated in Figure 5.4.Figure 5.4(a) represents the single-stream learning architecture, which is usu-ally utilized for solving the image classification challenge. The single-stream con-sists of a stack of convolution layers with different sizes of kernel or stride. Zhanget al. [98] is the most representative work in this style. It adopted the CIE Lab asthe color parameterization. Basically, the GVI can be considered as the lightness65Figure 5.4: Illustration of different state-of-the-art GVI colorization architectures. (a) single-stream architecture, (b)skip-layer architecture, and (c) two-stream architecture.66L, so the objective of the learning function is to predict a and b color maps. Inaddition, Zhang et al. [98] treated the colorization problem as multinomial classi-fication and they also proposed a class re-balancing method to deal with the biasedab values in natural images. They trained their model on the ImageNet [57].Figure 5.4(b) denotes the skip-layer learning architecture. Compared with thesingle-stream learning architecture, the key difference is that the skip-layer archi-tecture has lots of links which can incorporate the feature response from differentlevels of the primary stream and these responses are combined into a hypercol-umn. Larsson et al. [100] utilized this configuration in their colorization work,where they adopted the Hue-based color space as the color parameterization. In-stead of predicting the color value directly, they predicted the color histogram. Andthe ImageNet training set was adopted for training their model.Figure 5.4(c) illustrates the two-stream learning architecture, in which the firststream is to extract local features and the other stream aims at extracting globalfeatures, and then there is a fusion layer to fuse these features together. Lirukaet al. [99] firstly proposed this kind of architecture for the GVI colorization task.Then, Federico et al. [101] replaced its global feature extracting stream with amore powerful pre-trained image classification model, Inception-ResNet-v2 [115].They all utilized CIE Lab as the color parameterization and treated the colorizationproblem as a regression task. Liruka et al. [99] trained their model on the Placesdataset [116], while Federico et al. [101] used a small subset of the ImageNet(around 60,000 images) as the training dataset.5.2.3 Evaluation methodsEvaluating the IR2VI method and the individual steps is a challenge since there isno ground truth associated with the translated GVI or the colorized CVI images.To evaluate the quality of the translation and colorization results, different Non-Reference Image Quality Assessment (NR-IQA) on the synthetic GVI images andcolorized CVI images are performed. Since the dataset adopted in the work hasnighttime IR images and daytime GVI images along with bounding boxes, it isalso possible to assess different IR-GVI methods by performing object detectionon the synthesized GVI images.67Figure 5.5: Subjective comparison of different IR-GVI methods on the SENSIAC dataset. (a) The nighttime IR (input),(b) images generated by CycleGAN [97], (c) images generated by UNIT [96], (d) images generated by StarGAN[95], (e) images generated by the Texture-Net.68Non-Reference Image Quality AssessmentImage quality assessment aims to automatically predict the quality of the distortedimages as would be perceived by a common humanity. Generally, the image qual-ity assessment methods consist of reference-based and non-reference (blind) cat-egories. Since there is no reference (ground truth) for neither the synthetic GVIimages nor the colorized CVI images, the non-reference algorithms (NR-IQA) areconsidered as the quantitative methods for the experimental results in this chapter.The NR-IQA approaches are roughly divided into two categories: NaturalScene Statistic (NSS) methods and learning-based methods. NSS methods gener-ally depend on an assumption that natural scenes have certain statistical propertiesvarying by different level of qualities. Among these kinds of methods, BIQAA[117] is based on measuring entropy, BLIINDS-II [118] is based on discrete co-sine transform coefficients, BRISQUE [119] is based on the locally normalizedluminance coefficients, and NIQE [120] is based on a “quality aware” collectionof statistical features. On the other hand, the learning-based methods are able tolearn to predict human judgments of image quality from datasets of human-ratedimages. For example, NIMA [121] adopted the state-of-the-art classification CNNarchitectures for predicting both the technical and aesthetic qualities of the imagesby training the models on the large-scale human-rated image dataset. RankIQA[122] is a CNN-based NR-IQA model, but rather than being trained on a large-scale dataset, it used the “learning from rankings” technique to augment the smalldataset for training. Compared with the NSS models, these kinds of methods ex-tremely rely on the training dataset.To evaluate the quality of the colorized CVI images, both NSS-based methods(BIQAA, BLIINDS-II, BRISQUE, and NIQE) and learning-based methods (NIMAand RankIQA) were applied. However, the NSS-based methods were only appliedto assess the quality of the synthetic GVI images. This is because the learning-based methods were trained on the RGB image dataset, which is not suitable forassessing the GVI images.Since the accuracy and reliability of every single NR-IQA method is not com-plete, a decision cannot be made depending on any single method. Multiple metricscan give comprehensive but contrary judgments. Thus, it is necessary to conduct a69Multiple Criteria Decision Making (MCDM) method for the NR-IQA assessmentresults to make a fair evaluation. In this chapter, the The Technique for Order ofPreference by Similarity to Ideal Solution (TOPSIS) vector [123] is used as theMCDM method. The basic principle behind the TOPSIS is that it selects the alter-native which has the shortest distance from the ideal solution and the farthest dis-tance from the negative-ideal solution. TOPSIS can incorporate relative weights ofcriterion importance. In the experiments, the same weights were assigned to eachcriteria.Object DetectionTo assess the performance of different IR-GVI methods using object detection, theFaster R-CNN with ResNet 101 neural network presented in [77] is adopted asthe object detector, and train it on the daytime GVI image dataset (target domain).Then, different IR-GVI methods are employed to generate the synthetic GVI im-ages from the nighttime IR images. Finally, the trained object detector model isperformed on the synthetic GVI image collections. To evaluate the performanceof the object detector, the de-facto standard Average Precision (AP) is employed,which is calculated as the ratio between the area under the precision-recall curve(less than 1) to the entire area (equal to 1).5.3 Experimental ResultsThis section introduces the Military Sensing Information Analysis Center (SENSIAC)[124] dataset that was used in all the experiments. Then the settings of the hard-ware and training details are listed. Lastly, both subjective qualitative and objectivequantitative analysis were conducted on the IR-GVI and GVI-CVI of IR2VI indi-vidually.5.3.1 DatasetIn this experiments, the proposed IR2VI was also evaluated on the SENSIACdataset. The dataset was collected during both daytime and nighttime with multipleobservation distances from 500 to 5000 meters. It worth noting that SENSIAC haspaired IR and GVI videos in the daytime but only has IR videos in the nighttime.703 different observation distances, e.g., 1000, 1500, and 2000 meters were selectedaccording to the previous work in chapter 3. For training the IR-GVI models, thekeyframe was sampled at 3Hz (every 10 frames). Thus, there were 2700 nighttimeIR and 2691 daytime GVI training images. Note that all the nighttime IR imageswere preprocessed by histogram equalization operation prior to being fed into themodels. For object detection based evaluation method, the keyframe was sampledat 6Hz (every 5 frames) which are unseen in the stage of training IR-GVI models.Thus, there were 4573 daytime GVI images for training the object detector and3240 nighttime IR images for evaluating the IR-GVI methods. In the GVI-CVIstep, since SENSIAC does not have CVI images, the state-of-the-art colorizationmethods could not be trained on this dataset. But the GVI-CVI methods could beevaluated based on the synthetic GVI images from the SENSIAC testing dataset.5.3.2 Experiments SetupA workstation with an NVIDIA GeForce GTX 1080 GPU, an Intel Core i7 CPUand 32 GB Memory was employed in the work. To the Texture-Net, it was devel-oped based on CycleGAN [97] by using the Pytorch deep learning toolbox [125].For its hyper-parameters, the parameter tuning procedure was conducted to find thefitting hyper-parameters. In this way, λcyc = 5 and λroi = 0.1 were set in Equation5.5. The neural network was trained from scratch, and the weights were initial-ized from a Gaussian distribution with zero mean and 0.02 standard deviation. TheAdam solver [126] was employed with a batch size of 2. The learning rate was setat 0.0002 for the first 20 epochs and a linearly decaying rate to zero over the next20 epochs. To make a fair comparison, the default settings of the other IR-GVImethods were kept except the image channel, image size, batch size, and trainingepochs. To be specific, since all the images in SENSIAC dataset are one channel,the number of channels was set to 1 for both input and output of the IR-GVI meth-ods. Because the limited capacity of the GPU memory, the training epochs wereset to 40 with batch size 2 for CycleGAN [97] and UNIT [96], training epochs40 with batch size 12 for StarGAN [95]. And IR images were center-cropped to256×256 pixels before feeding into the neural networks. The Texture-Net is a kindof object-based framework, so the images were cropped to 256×256 with at least71one object. Because the generator neural network of every method is a fully CNNwhich is able to take an image of arbitrary size as input, the full-size IR imagewas fed to the neural network in the testing stage. To the GVI-CVI methods, theiroriginal setting in the works of [98–101] were kept without further modifications.5.3.3 IR-GVI ResultsIn this section, the proposed Texture-Net is compared with the state-of-the-art IR-GVI methods: CycleGAN [97], UNIT [96] and StarGAN [95].Subjective ComparisonsFor a fair comparison, all the methods were trained on the same training set andtested on the unseen images. Figure 5.5 shows the translated images from un-seen images by different IR-GVI methods. It is apparent that the CycleGAN andthe UNIT had the serious incorrect mapping problems. The CycleGAN could nottell where are the vegetation and ground, so it incorrectly mapped the ground toa forest. In the second testing image, the CycleGAN incorrectly generated twovehicles. The translated GVI images by UNIT were almost similar to each otherwithout too much semantic information. For the StarGAN method, it had few in-correct mapping problems but lacked sharp texture information. Significantly, itis clear that the Texture-Net provided with the highest visual quality of translationresults compared to the state of the arts. It could not only bring the spatial semanticinformation but also makes the target clear. Thus, the Texture-Net benefited fromthe advantages of the structure connection module and the ROI focal loss.Non-reference Image Quality Evaluation30 IR images were randomly picked up and then translated into synthetic GVIimages by different methods. Table 5.1 shows the assessment results by the NSS-based NR-IQA methods on these synthetic images. For the BIQAA [117] metric,the higher value means better quality. On the contrary, for the BLIIDS-II [118]NIQE [120] and BRISQUE [119], the lower value means better quality. Table 5.1shows that the synthetic images generated by the Texture-Net had better quality inboth spatial (NIQE and BRISQUE ) and transform (BLIINDS) domain assessment72aspects. The TOPSIS ranking results in Table 5.2 also verified the observationand proved that the proposed Texture-Net could generate more high-quality GVIimages compared with other state-of-the-art IR-GVI methods.Table 5.1: Evaluating results of different image translation methods using dif-ferent NR-IQA criterion.CycleGAN UNIT StarGAN Texture-Net (ours)BIQAA 0.875±0.091 0.624±0.277 0.003±0.001 0.822±0.164BLIINDSII 19.233±2.996 25.333±2.039 31.800±4.276 17.400±4.789NIQE 4.919±0.157 6.660±0.261 7.836±0.548 4.467±0.104BRISQUE 32.522±3.415 38.506±3.425 56.483±2.528 31.351±1.557Table 5.2: Ranking results of different IR-GVI methods by the TOPSIS basedon the different NR-IQA criterion.CycleGAN UNIT StarGAN Texture-NetTOPSIS Score 0.935 0 0.627 0.953Object Detection EvaluationFor the quantitative objective evaluation, the object detection was applied basedevaluation protocol introduced in Section 5.2.3. Figure 5.6 and Table 5.3 showthe precision-recall curves and AP scores of the object detector on translated GVIimages generated by different IR-GVI methods, respectively.Table 5.3: AP scores of the object detector on the generated GVI images bydifferent IR-GVI methods.CycleGAN UNIT StarGAN Texture-NetAP (%) 7.62×10−3 0.37 28.48 91.70The results clearly show that there was a large margin between different IR-GVI methods, and the Texture-Net achieved the best AP score at 91.70% which hasa 63.22% margin to the second rank method, StarGAN. These results demonstratethat the Texture-Net is capable of adding semantic texture information and objectshape information simultaneously to the original thermal images. Even though73Figure 5.6: The precision-recall curve of the object detector on different syn-thesis images.the translated GVI images by the StarGAN lacked texture information, the blurshape information could also help the VI object detector to accomplish detection.However, the incorrect mapping problems in CycleGAN and UNIT made the VIdetector completely fail as indicated with a nearly zero AP score.Table 5.4: Evaluation of different image colorization methods.Iizuka et al. Larsson et al. Zhang et al. Federico et al.RankIQA 0.6480±0.333 0.202±0.306 0.551±0.403 0.139±0.459NIMA 4.726±1.542 4.589±1.528 4.702±1.517 4.819±1.557BIQAA 0.003±0.001 0.003±0.001 0.003±0.001 0.004±0.001BLIINDS II 2.700±3.583 3.000±3.735 1.967±2.658 3.750±4.264NIQE 5.534±1.782 5.545±1.759 5.538±1.799 5.899±1.916BRISQUE 24.972±3.551 23.602±3.684 24.726±3.552 24.452±2.95774Figure 5.7: Samples of colorized images by different GVI-CVI methods. (a) Nighttime IR images, (b) synthetic GVIimages generated by the Texture-Net, (c) colorized CVI images by Zhang et al. [98], (d) colorized CVI imagesby Iizuka et al. [99], (e) colorized CVI images by Larsson et al. [100], and (f) colorized CVI images by Federicoet al. [101].755.3.4 GVI-CVI ResultsIn this section, the state-of-the-art GVI-CVI methods were applied on the syntheticGVI images generated by the Texture-Net module and evaluated in both subjectiveand objective way.Subjective ComparisonsAs can be seen in the Figure 5.7, 4 synthetic images generated by the Texture-Net were sampled randomly and be colorized by four different GVI-CVI methods.The left two columns are the nighttime IR images and the synthetic GVI imagesgenerated by the Texture-Net, respectively. And the other right four columns arethe colorized CVI images by Zhang et al. [98], Iizuka et al. [99], Larsson et al.[100], and Federico et al. [101], respectively. As can be seen, the colorized imagesby Zhang et al. had more rich and distinguished color than those colorized by othermethods. However, it also has a significant drawback that there were some apparentincorrect colorful blobs. For example, some parts of the ground were colorized intothe blue. The results generated by the method of Iizuka et al. had relatively evencolor distribution, but the objects in these images were not outstanding, e.g. thevegetation was dark but not green. While the method of Larsson seems to just puta green color layer on the top of the GVI images. To the method of Federico etal., sometimes it worked well, e.g., the top first image, sometimes it seemed likescribbling colors on the GVI images. Visually, no single GVI-CVI methods wereperfect, thus it is necessary to conduct the objective evaluation.Non-reference Image Quality EvaluationTo conduct objective evaluation, 30 images that generated by the Texture-Net weresampled randomly and colorized. Different NR-IQA including both NSS-basedand learning-based methods were performed on the colorized images. The assess-ment results are listed in the Table 5.4. It is hard to tell which GVI-CV method isthe best, so applying a MCDM method (TOPSIS) is necessary for making decision.After the TOPSIS ranking, the Table 5.5 shows that the method of Zhang et al. [98]was better than other GVI-CV methods.76Table 5.5: Ranking results of different GVI-CVI methods by the TOPSISbased on the different NR-IQA criterion.Iizuka et al. Larsson et al. Zhang et al. Federico et al.TOPSIS Score 0.745 0.253 0.773 0.2325.4 SummaryIn this chapter, a two-step IR2VI method for enhanced environmental perception atnight is proposed, which includes the IR-GVI and GVI-CVI steps. In the IR-GVIstep, a Texture-Net is presented which is able to translate the IR images to GVIimages in an unsupervised manner. Thanks to the proposed structure connectionmodule, Texture-Net is able to overcome the incorrect mapping problem which iscommonly faced by the state-of-the-art IR-GVI methods. Moreover, the proposedROI focal loss enables Texture-Net to generate a synthetic GVI image with fine de-tails. In the GVI-CVI step, a comprehensive investigation of the GVI colorizationmethod is conducted, and a relatively best method (Zhang et al. [98]) is incorpo-rated into the IR2VI framework. The results demonstrate the effectiveness of theproposed method. cc77Chapter 6ConclusionsIn this thesis, the Deep Learning based approaches are proposed to address thechallenges that exist in the modern Situation Awareness (SA) system, due to thetarget scale variations, complex background, and poor illumination conditions. Themain contributions can be summarized as followings:• To tackle the challenges that arise from the target scale variations and com-plex background in the complex environment, a novel Deep Multi-ModalImage Fusion (DMIF) approach is proposed, which can learn how to extractdeep feature in an unsupervised manner and fuse synergistic informationfrom multi-modal images. In this approach, the image fusion, region pro-posal and region-wise classification module are integrated into an end-to-end Automatic Target Detection (ATD) framework. The proposed methodachieves 99.73% Average Precision (AP) on the Military Sensing Informa-tion Analysis Center (SENSIAC) dataset with a great competence in run-time performance, which is much better than the baseline and other state-of-the-art methods.• To overcome disability of the visible camera in the dark night without anyartificial lights, a IR2VI framework is proposed, which can translate night-time thermal infrared images to daytime colorful visible images. To author’sbest knowledge, this is the first framework to enhance the night vision byusing unsupervised neural network translation technique. In addition, the78weakness of the state-of-the-art unsupervised image translation methods areexamined and some improvements (e.g., structure connection and focal Re-gion of Interest (ROI) loss) are proposed. Extensive experiments includingsubjective and objective evaluation are carried out, which verifies the effec-tiveness of the system.To make the proposed methods more robust and generalized, there are manypossible future extensions. For the DMIF, the framework has achieved almost theceiling performance (99.73% AP) on the SENSIAC dataset. Currently the DMIFmethod is not performed on other scenarios, due to the lack of public availabledatasets that meets the requirements. If more datasets will be public, the proposedDMIF will be examined on the other scenarios. Another possible extension wouldbe to try different feature-level fusion within CNNs. For example, different modalsof image are fed into CNNs individually, and then fuse the features at a specificlayer. For the IR2VI, one important possible extension would be to evaluate ifthe fusion of the translated visible image and the infrared image can help obtainboosted performance. In addition, since the SENSIAC dataset used in this thesisonly has the grayscale visible image, a two-step framework is proposed to translatethermal image to colourful visible image. Thus, when the related datasets withcolourful visible image are available in the future, verifying if it is possible toconduct the image translation in one shot would be one possible extension as well.79Bibliography[1] M. R. Endsley and D. J. Garland, Situation awareness analysis andmeasurement, 1st ed. Boca Raton, FL, USA: CRC Press, 2000. → pagesxi, 1, 2[2] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learningapplied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11,pp. 2278–2324, Nov. 1998. → pages xi, 8, 9, 11, 39[3] A. C. Muller and S. Narayanan, “Cognitively-engineered multisensorimage fusion for military applications,” Information Fusion, vol. 10, no. 2,pp. 137–149, Apr. 2009. → page 1[4] J. A. Ratches, “Review of current aided/automatic target acquisitiontechnology for military target acquisition tasks,” Optical Engineering,vol. 50, no. 7, pp. 072 001–072 001, 2011. → page 1[5] Z. Wenda, L. Huimin, and W. Dong, “Multisensor image fusion andenhancement in spectral total variation domain,” IEEE Transactions onMultimedia, vol. PP, no. 99, pp. 1–1, 2017, in press. → pages 2, 34[6] Y. Liu, S. Liu, and Z. Wang, “A general framework for image fusion basedon multi-scale transform and sparse representation,” Information Fusion,vol. 24, pp. 147–164, Jul. 2015. → pages 4, 39, 44, 46[7] Y. Liu, X. Chen, R. K. Ward, and Z. J. Wang, “Image fusion withconvolutional sparse representation,” IEEE Signal Processing Letters,vol. 23, no. 12, pp. 1882–1886, Dec. 2016. → pages 2, 4, 39, 44, 46[8] H. Hai-Miao, W. Jiawei, L. Bo, G. Qiang, and Z. Jin, “An adaptive fusionalgorithm for visible and infrared videos based on entropy and thecumulative distribution of gray levels,” IEEE Transactions on Multimedia,vol. 19, no. 12, pp. 2706–2719, Dec 2017.80[9] G. Jie, M. Zhenjiang, and Z. Xiao-Ping, “Efficient heuristic methods formultimodal fusion and concept fusion in video concept detection,” IEEETransactions on Multimedia, vol. 17, no. 4, pp. 498–511, April 2015. →pages 2, 34[10] G. Piella, “A general framework for multiresolution image fusion: frompixels to regions,” Information Fusion, vol. 4, no. 4, pp. 259 – 280, 2003.→ page 2[11] X. Jin, Q. Jiang, S. Yao, D. Zhou, R. Nie, J. Hai, and K. He, “A survey ofinfrared and visual image fusion methods,” Infrared Physics & Technology,vol. 85, pp. 478–501, 2017. → page 2[12] H. Li, L. Liu, W. Huang, and C. Yue, “An improved fusion algorithm forinfrared and visible images based on multi-scale transform,” InfraredPhysics & Technology, vol. 74, pp. 28 – 37, 2016. → page 2[13] A. Aran, S. Munshi, V. K. Beri, and A. K. Gupta, “Spectral band invariantwavelet-modified mach filter,” Optics and Lasers in Engineering, vol. 46,no. 9, pp. 656–665, 2008. → page 2[14] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no.7553, p. 436, 2015. → pages 3, 5[15] E. Blasch, C. Yang, and I. Kadar, “Summary of tracking and identificationmethods,” in SPIE Defense+ Security. Baltimore, MD, USA:International Society for Optics and Photonics, Jun. 2014, pp.909 104–909 104. → page 3[16] E. Gundogdu, H. Ozkan, H. S. Demir, H. Ergezer, E. Akagu¨ndu¨z, and S. K.Pakin, “Comparison of infrared and visible imagery for object tracking:Toward trackers with superior IR performance,” in 2015 IEEE Conferenceon Computer Vision and Pattern Recognition Workshops, CVPRWorkshops, Boston, MA, USA, June 7-12, 2015, 2015, pp. 1–9. → page 3[17] H. S. Demir and A. E. C¸etin, “Co-difference based object trackingalgorithm for infrared videos,” in 2016 IEEE International Conference onImage Processing, ICIP 2016, Phoenix, AZ, USA, September 25-28, 2016,2016, pp. 434–438. → page 3[18] J. Gong, G. Fan, J. P. Havlicek, N. Fan, and D. Chen, “Infrared targettracking, recognition and segmentation using shape-aware level set,” inIEEE International Conference on Image Processing, ICIP 2013,81Melbourne, Australia, September 15-18, 2013, 2013, pp. 3283–3287. →page 3[19] J. Gong, G. Fan, L. Yu, J. P. Havlicek, D. Chen, and N. Fan, “Jointview-identity manifold for infrared target tracking and recognition,”Computer Vision Image Understanding, vol. 118, no. Supplement C, pp.211 – 224, Jan. 2014.[20] L. Yu, G. Fan, J. Gong, and J. P. Havlicek, “Joint infrared target recognitionand segmentation using a shape manifold-aware level set,” Sensors, vol. 15,no. 5, pp. 10 118–10 145, Apr. 2015. → page 3[21] B. Millikan, A. Dutta, Q. Sun, and H. Foroosh, “Fast detection ofcompressively sensed ir targets using stochastically trained least squaresand compressed quadratic correlation filters,” IEEE Transactions onAerospace and Electronic Systems, vol. 53, no. 5, pp. 2449–2461, Oct.2017. → page 3[22] Z. Liu, E. Blasch, G. Bhatnagar, V. John, W. Wu, and R. S. Blum, “Fusingsynergistic information from multi-sensor images: An overview fromimplementation to performance assessment,” Information Fusion, vol. 42,pp. 127 – 145, Jul. 2018. → pages 4, 5[23] R. S. Blum and Z. Liu, Multi-sensor image fusion and its applications,1st ed., ser. Signal Processing and Communications. Hoboken, NJ, USA:Taylor and Francis, 2005. → page 4[24] Y. Zheng and E. Blasch, “Multispectral image fusion for vehicleidentification and threat analysis,” in Sensing and Analysis Technologies forBiomedical and Cognitive Applications 2016, vol. 9871. Baltimore, MD,USA: International Society for Optics and Photonics, Apr. 2016, p.98710G. → page 4[25] B. A. Olshausen et al., “Emergence of simple-cell receptive field propertiesby learning a sparse code for natural images,” Nature, vol. 381, no. 6583,pp. 607–609, Jun. 1996. → page 4[26] M. Beaulieu, S. Foucher, and L. Gagnon, “Multi-spectral image resolutionrefinement using stationary wavelet transform,” in Geoscience and RemoteSensing Symposium, 2003. IGARSS’03. Proceedings. 2003 IEEEInternational, vol. 6. Toulouse, France: IEEE, Jul. 2003, pp. 4032–4034.→ pages 4, 16, 3982[27] A. Loza, D. Bull, N. Canagarajah, and A. Achim, “Non-gaussianmodel-based fusion of noisy images in the wavelet domain,” ComputerVision Image Understanding, vol. 114, no. 1, pp. 54–65, Jan. 2010. →pages 4, 39[28] B. Gaurav, W. Q. M. Jonathan, and L. Zheng, “Directive contrast basedmultimodal medical image fusion in NSCT domain,” IEEE Transactions onMultimedia, vol. 15, no. 5, pp. 1014–1024, August 2013. → page 4[29] K. K. Sharma and M. Sharma, “Image fusion based on imagedecomposition using self-fractional Fourier functions,” Signal, Image andVideo Processing, vol. 8, no. 7, pp. 1335–1344, Oct. 2014. → page 4[30] S. Li, X. Kang, and J. Hu, “Image fusion with guided filtering,” IEEETransactions on Image Processing, vol. 22, no. 7, pp. 2864–2875, Jul.2013. → pages 5, 39[31] J. Hu and S. Li, “The multiscale directional bilateral filter and itsapplication to multisensor image fusion,” Information Fusion, vol. 13,no. 3, pp. 196 – 206, Jul. 2012. → pages 5, 39[32] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classificationwith deep convolutional neural networks,” in Advances in NeuralInformation Processing Systems 25: 26th Annual Conference on NeuralInformation Processing Systems 2012. Proceedings of a meeting heldDecember 3-6, 2012, Lake Tahoe, Nevada, USA, 2012, pp. 1106–1114. →pages 5, 11, 24[33] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich featurehierarchies for accurate object detection and semantic segmentation,” in2014 IEEE Conference on Computer Vision and Pattern Recognition,CVPR 2014, Columbus, OH, USA, June 23-28, 2014, 2014, pp. 580–587.→ pages 5, 16, 23[34] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks forsemantic segmentation,” in IEEE Conference on Computer Vision andPattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015,2015, pp. 3431–3440. → page 5[35] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” CoRR, vol. abs/1409.1556, Jun. 2014. →pages 5, 11, 24, 4083[36] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in 2016 IEEE Conference on Computer Vision and PatternRecognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016, pp.770–778. → pages 5, 41, 60[37] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan,V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” inIEEE Conference on Computer Vision and Pattern Recognition, CVPR2015, Boston, MA, USA, June 7-12, 2015, 2015, pp. 1–9. → pages 5, 40[38] Y. Liu, X. Chen, H. Peng, and Z. Wang, “Multi-focus image fusion with adeep convolutional neural network,” Information Fusion, vol. 36, pp.191–207, Jul. 2017. → page 5[39] J. Zhong, B. Yang, G. Huang, F. Zhong, and Z. Chen, “Remote sensingimage fusion with convolutional neural network,” Sensing and Imaging,vol. 17, no. 1, p. 10, Dec. 2016. → page 5[40] V. Nair and G. E. Hinton, “Rectified linear units improve restrictedboltzmann machines,” in Proceedings of the 27th International Conferenceon Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, 2010,pp. 807–814. → pages 10, 25, 41, 42[41] T. Wang, D. J. Wu, A. Coates, and A. Y. Ng, “End-to-end text recognitionwith convolutional neural networks,” in Proceedings of the 21stInternational Conference on Pattern Recognition, ICPR 2012, Tsukuba,Japan, November 11-15, 2012, 2012, pp. 3304–3308. → page 10[42] Y. Boureau, J. Ponce, and Y. LeCun, “A theoretical analysis of featurepooling in visual recognition,” in Proceedings of the 27th InternationalConference on Machine Learning (ICML-10), June 21-24, 2010, Haifa,Israel, 2010, pp. 111–118. → page 10[43] N. M. Nasrabadi, “Pattern recognition and machine learning,” Journal ofElectronic Imaging, vol. 16, no. 4, p. 049901, 2007. → pages 11, 42[44] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. C. Courville, and Y. Bengio, “Generative adversarial nets,” inAdvances in Neural Information Processing Systems 27: AnnualConference on Neural Information Processing Systems 2014, December8-13 2014, Montreal, Quebec, Canada, 2014, pp. 2672–2680. → pages11, 5784[45] Y. Sheikh and M. Shah, “Bayesian object detection in dynamic scenes,” pp.74–79, 2005. → page 14[46] J. Guo, C. Hsia, Y. Liu, M. Shih, C. Chang, and J. Wu, “Fast backgroundsubtraction based on a multilayer codebook model for moving objectdetection,” IEEE Transactions on Circuits Systems for Video Technology,vol. 23, no. 10, pp. 1809–1821, 2013. → page 15[47] B. Chen and S. Huang, “Probabilistic neural networks based movingvehicles extraction algorithm for intelligent traffic surveillance systems,”Information Sciences, vol. 299, pp. 283–295, 2015. → page 15[48] K. Yun, J. Lim, and J. Y. Choi, “Scene conditional background update formoving object detection in a moving camera,” Pattern Recognition Letters,vol. 88, pp. 57–63, 2017. → page 15[49] W. Hu, C. Chen, T. Chen, D. Huang, and Z. Wu, “Moving object detectionand tracking from video captured by moving camera,” Journal of VisualCommunication and Image Representation, vol. 30, pp. 164–180, 2015.[50] F. J. Lo´pez-Rubio and E. Lo´pez-Rubio, “Foreground detection for movingcameras with stochastic approximation,” Pattern Recognition Letters,vol. 68, pp. 161–168, 2015. → page 15[51] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,”International journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004.→ page 15[52] N. Dalal and B. Triggs, “Histograms of oriented gradients for humandetection,” in 2005 IEEE Computer Society Conference on ComputerVision and Pattern Recognition (CVPR 2005), 20-26 June 2005, San Diego,CA, USA, 2005, pp. 886–893. → page 15[53] P. A. Viola and M. J. Jones, “Rapid object detection using a boostedcascade of simple features,” vol. 1, pp. 511–518, 2001. → page 15[54] J. A. K. Suykens and J. Vandewalle, “Least squares support vector machineclassifiers,” Neural Processing Letters, vol. 9, no. 3, pp. 293–300, 1999. →page 15[55] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32,2001. → page 1585[56] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan,“Object detection with discriminatively trained part-based models,” IEEETransactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 9,pp. 1627–1645, Sep. 2010. → page 15[57] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei,“ImageNet large scale visual recognition challenge,” International Journalof Computer Vision, vol. 115, no. 3, pp. 211–252, 2015. → pages15, 24, 26, 41, 67[58] M. Everingham, L. Van Gool, C. K. I Williams, J. Winn, A. Zisserman,M. Everingham, L. K. Van Gool Leuven, B. CKI Williams, J. Winn, andA. Zisserman, “The pascal visual object classes (VOC) challenge,”International Journal of Computer Vision, vol. 88, no. 2, pp. 303–338,2010.[59] T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dolla´r,and C. L. Zitnick, “Microsoft COCO: common objects in context,” inComputer Vision - ECCV 2014 - 13th European Conference, Zurich,Switzerland, September 6-12, 2014, Proceedings, Part V, 2014, pp.740–755. → page 15[60] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun,“OverFeat: Integrated recognition, localization and detection usingconvolutional networks,” CoRR, vol. abs/1312.6229, Jun. 2013. → page 15[61] J. R. R. Uijlings, K. E. A. Van De Sande, T. Gevers, and A. W. M.Smeulders, “Selective search for object recognition,” International Journalof Computer Vision, vol. 104, no. 2, pp. 154–171, 2013. → pages16, 23, 35, 41[62] R. B. Girshick, “Fast R-CNN,” in 2015 IEEE International Conference onComputer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015,2015, pp. 1440–1448. → pages 16, 23, 26, 28, 34, 43, 64[63] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deepconvolutional networks for visual recognition,” IEEE Transactions onPattern Analysis and Machine Intelligence, vol. 37, no. 9, pp. 1904–1916,2015. → page 23[64] S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: towardsreal-time object detection with region proposal networks,” in Advances in86Neural Information Processing Systems 28: Annual Conference on NeuralInformation Processing Systems 2015, December 7-12, 2015, Montreal,Quebec, Canada, 2015, pp. 91–99. → pages 16, 35, 36, 41, 42, 44, 46[65] Y. Niu, S. Xu, L. Wu, and W. Hu, “Airborne infrared and visible imagefusion for target perception based on target region segmentation anddiscrete wavelet transform,” Mathematical Problems in Engineering, vol.2012, no. 275138, 2012. → page 16[66] J. Han and B. Bhanu, “Fusion of color and infrared video for movinghuman detection,” Pattern Recognition, vol. 40, no. 6, pp. 1771–1784,2007.[67] M. A. Smeelen, P. B. W. Schwering, A. Toet, and M. Loog, “Semi-hiddentarget recognition in gated viewer images fused with thermal IR images,”Information Fusion, vol. 18, pp. 131–147, 2014. → page 16[68] B. K. Horn and B. G. Schunck, “Determining optical flow,” Artificialintelligence, vol. 17, no. 1-3, pp. 185–203, 1981. → page 20[69] P. F. Felzenszwalb and D. P. Huttenlocher, “Efficient graph-based imagesegmentation,” International Journal of Computer Vision, vol. 59, no. 2,pp. 167–181, 2004. → page 23[70] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return of thedevil in the details: Delving deep into convolutional nets,” arXiv preprintarXiv:1405.3531, pp. 1–11, 2014. → page 24[71] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, andR. Salakhutdinov, “Dropout: a simple way to prevent neural networks fromoverfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp.1929–1958, 2014. → page 25[72] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. B. Girshick,S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fastfeature embedding,” in Proceedings of the ACM International Conferenceon Multimedia, MM ’14, Orlando, FL, USA, November 03 - 07, 2014,2014, pp. 675–678. → pages 28, 44[73] S. Liu and Z. Liu, “Multi-channel CNN-based object detection forenhanced situation awareness,” in Sensors and Electronics Technology(SET) panel Symposium SET-241 on 9th NATO Military SensingSymposium, Quebec City, QC, Canada, June 2017. → pages34, 35, 41, 44, 46, 4787[74] R. C. Gonzalez, R. E. Woods et al., “Digital image processing,” 1992. →page 38[75] J. Wang, J. Peng, X. Feng, G. He, and J. Fan, “Fusion method for infraredand visible images by using non-negative sparse representation,” InfraredPhysics & Technology, vol. 67, pp. 477–489, Nov. 2014. → page 39[76] F. Yu and V. Koltun, “Multi-scale context aggregation by dilatedconvolutions,” CoRR, vol. abs/1511.07122, Jun. 2015. → page 41[77] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer,Z. Wojna, Y. Song, S. Guadarrama, and K. Murphy, “Speed/accuracytrade-offs for modern convolutional object detectors,” in 2017 IEEEConference on Computer Vision and Pattern Recognition, CVPR 2017,Honolulu, HI, USA, July 21-26, 2017, 2017, pp. 3296–3297. → pages41, 70[78] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. J. Goodfellow,A. Harp, G. Irving, M. Isard, Y. Jia, R. Jo´zefowicz, L. Kaiser, M. Kudlur,J. Levenberg, D. Mane´, R. Monga, S. Moore, D. G. Murray, C. Olah,M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. A. Tucker,V. Vanhoucke, V. Vasudevan, F. B. Vie´gas, O. Vinyals, P. Warden,M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “Tensorflow: Large-scalemachine learning on heterogeneous distributed systems,” CoRR, vol.abs/1603.04467, 2016. → page 44[79] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, and A. C.Berg, “SSD: single shot multibox detector,” in Computer Vision - ECCV2016 - 14th European Conference, Amsterdam, The Netherlands, October11-14, 2016, Proceedings, Part I. Amsterdam, The Netherlands:Springer, Oct. 2016, pp. 21–37. → pages 44, 46[80] D. J. Schroeder, “Chapter 17 - detectors, signal-to-noise, and detectionlimits,” in Astronomical Optics (Second Edition), 2nd ed., D. J. Schroeder,Ed. San Diego, CA, USA: Academic Press, 2000, pp. 425 – 443. → page50[81] T. Chijiiwa, T. Ishibashi, and H. Inomata, “Histological study of choroidalmelanocytes in animals with tapetum lucidum cellulosum,” Graefe’sarchive for clinical and experimental ophthalmology, vol. 228, no. 2, pp.161–168, 1990. → page 5688[82] J. M. Sullivan, “Assessing the potential benefit of adaptive headlightingusing crash databases,” University of Michigan, Ann Arbor, TransportationResearch Institute, 1999. → page 56[83] G. Bhatnagar and Z. Liu, “A novel image fusion framework fornight-vision navigation and surveillance,” Signal, Image and VideoProcessing, vol. 9, no. 1, pp. 165–175, 2015. → page 56[84] A. Ulhaq, X. Yin, J. He, and Y. Zhang, “FACE: Fully automated contextenhancement for night-time video sequences,” Journal of VisualCommunication and Image Representation, vol. 40, pp. 682–693, 2016.[85] Z. Zhou, M. Dong, X. Xie, and Z. Gao, “Fusion of infrared and visibleimages for night-vision context enhancement,” Applied Optics, vol. 55,no. 23, pp. 6480–6490, 2016.[86] C. H. Son and X. P. Zhang, “Near-infrared fusion via color regularizationfor haze and color distortion removals,” IEEE Transactions on Circuits andSystems for Video Technology, pp. 1–1, 2017. → page 56[87] Z. Liu, E. Blasch, Z. Xue, J. Zhao, R. Laganiere, and W. Wu, “Objectiveassessment of multiresolution image fusion algorithms for contextenhancement in night vision: a comparative study,” IEEE Transactions onPattern Analysis and Machine Intelligence, vol. 34, no. 1, pp. 94–109,2012. → page 56[88] M. Jeong, B. C. Ko, and J. Y. Nam, “Early detection of sudden pedestriancrossing for safe driving during summer nights,” IEEE Transactions onCircuits and Systems for Video Technology, vol. 27, no. 6, pp. 1368–1380,June 2017. → page 56[89] S. O. Dumoulin, S. C. Dakin, and R. F. Hess, “Sparsely distributedcontours dominate extra-striate responses to complex scenes,” Neuroimage,vol. 42, no. 2, pp. 890–901, 2008. → page 57[90] G. Paschos, “Perceptually uniform color spaces for color texture analysis:an empirical evaluation,” IEEE Transactions on Image Processing, vol. 10,no. 6, pp. 932–937, Jun 2001.[91] B. S. Manjunath, J.-R. Ohm, V. V. Vasudevan, and A. Yamada, “Color andtexture descriptors,” IEEE Transactions on circuits and systems for videotechnology, vol. 11, no. 6, pp. 703–715, 2001. → page 5789[92] M. Limmer and H. P. A. Lensch, “Infrared colorization using deepconvolutional neural networks,” in 15th IEEE International Conference onMachine Learning and Applications, ICMLA 2016, Anaheim, CA, USA,December 18-20, 2016, 2016, pp. 61–68. → page 57[93] P. L. Sua´rez, A. D. Sappa, and B. X. Vintimilla, “Learning to colorizeinfrared images,” in Trends in Cyber-Physical Multi-Agent Systems. ThePAAMS Collection - 15th International Conference, PAAMS 2017, Porto,Portugal, June 21-23, 2017, Special Sessions. Springer, 2017, pp.164–172.[94] P. L. Suarez, A. D. Sappa, and B. X. Vintimilla, “Infrared imagecolorization based on a triplet DCGAN architecture,” in 2017 IEEEConference on Computer Vision and Pattern Recognition Workshops,CVPR Workshops, Honolulu, HI, USA, July 21-26, 2017, 2017, pp.212–217. → page 57[95] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “StarGAN:Unified generative adversarial networks for multi-domain image-to-imagetranslation,” arXiv preprint arXiv:1711.09020, 2017. → pages57, 60, 68, 71, 72[96] M. Liu, T. Breuel, and J. Kautz, “Unsupervised image-to-image translationnetworks,” in Advances in Neural Information Processing Systems 30:Annual Conference on Neural Information Processing Systems 2017, 4-9December 2017, Long Beach, CA, USA, 2017, pp. 700–708. → pages60, 68, 71, 72[97] J. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-imagetranslation using cycle-consistent adversarial networks,” in IEEEInternational Conference on Computer Vision, ICCV 2017, Venice, Italy,October 22-29, 2017, 2017, pp. 2242–2251. → pages57, 60, 62, 63, 68, 71, 72[98] R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,” inComputer Vision - ECCV 2016 - 14th European Conference, Amsterdam,The Netherlands, October 11-14, 2016, Proceedings, Part III. Springer,2016, pp. 649–666. → pages 58, 65, 67, 72, 75, 76, 77[99] S. Iizuka, E. Simo-Serra, and H. Ishikawa, “Let there be color!: jointend-to-end learning of global and local image priors for automatic imagecolorization with simultaneous classification,” ACM Transactions onGraphics (TOG), vol. 35, no. 4, p. 110, 2016. → pages 67, 75, 7690[100] G. Larsson, M. Maire, and G. Shakhnarovich, “Learning representationsfor automatic colorization,” in Computer Vision - ECCV 2016 - 14thEuropean Conference, Amsterdam, The Netherlands, October 11-14, 2016,Proceedings, Part IV. Springer, 2016, pp. 577–593. → pages 67, 75, 76[101] B. Federico, M. Diego-Gonzalez, and R.-G. Lucas, “Deep-koalarization:Image colorization using CNNs and Inception-ResNet-v2,” arXiv preprintarXiv:1712.03400, Dec. 2017. → pages 58, 65, 67, 72, 75, 76[102] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time styletransfer and super-resolution,” in Computer Vision - ECCV 2016 - 14thEuropean Conference, Amsterdam, The Netherlands, October 11-14, 2016,Proceedings, Part II. Springer, 2016, pp. 694–711. → page 60[103] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro,“High-resolution image synthesis and semantic manipulation withconditional gans,” arXiv preprint arXiv:1711.11585, 2017. → pages 60, 64[104] P. Isola, J. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation withconditional adversarial networks,” in 2017 IEEE Conference on ComputerVision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July21-26, 2017, July 2017, pp. 5967–5976. → page 60[105] Q. Chen and V. Koltun, “Photographic image synthesis with cascadedrefinement networks,” in IEEE International Conference on ComputerVision, ICCV 2017, Venice, Italy, October 22-29, 2017, 2017, pp.1520–1529. → page 64[106] A. Y. Chia, S. Zhuo, R. K. Gupta, Y. Tai, S. Cho, P. Tan, and S. Lin,“Semantic colorization with internet images,” vol. 30, no. 6, p. 156, 2011.→ page 65[107] R. Ironi, D. Cohen-Or, and D. Lischinski, “Colorization by example,” inProceedings of the Eurographics Symposium on Rendering Techniques,Konstanz, Germany, June 29 - July 1, 2005, 2005, pp. 201–210.[108] T. Welsh, M. Ashikhmin, and K. Mueller, “Transferring color to greyscaleimages,” vol. 21, no. 3, pp. 277–280, 2002. → page 65[109] A. Levin, D. Lischinski, and Y. Weiss, “Colorization using optimization,”vol. 23, no. 3, pp. 689–694, 2004. → page 6591[110] Y. Huang, Y. Tung, J. Chen, S. Wang, and J. Wu, “An adaptive edgedetection based colorization algorithm and its applications,” in Proceedingsof the 13th ACM International Conference on Multimedia, Singapore,November 6-11, 2005, 2005, pp. 351–354.[111] H. Chang, O. Fried, Y. Liu, S. DiVerdi, and A. Finkelstein, “Palette-basedphoto recoloring,” ACM Transactions on Graphics (TOG), vol. 34, no. 4, p.139, 2015.[112] R. Zhang, J.-Y. Zhu, P. Isola, X. Geng, A. S. Lin, T. Yu, and A. A. Efros,“Real-time user-guided image colorization with learned deep priors,” arXivpreprint arXiv:1705.02999, 2017.[113] B. Sheng, H. Sun, M. Magnor, and P. Li, “Video colorization using paralleloptimization in feature space,” IEEE Transactions on Circuits and Systemsfor Video Technology, vol. 24, no. 3, pp. 407–417, March 2014. → page 65[114] Z. Cheng, Q. Yang, and B. Sheng, “Deep colorization,” in 2015 IEEEInternational Conference on Computer Vision, ICCV 2015, Santiago,Chile, December 7-13, 2015, 2015, pp. 415–423. → page 65[115] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4,Inception-ResNet and the impact of residual connections on learning,” inProceedings of the Thirty-First AAAI Conference on Artificial Intelligence,February 4-9, 2017, San Francisco, California, USA., 2017, pp.4278–4284. → page 67[116] B. Zhou, A`. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learning deepfeatures for scene recognition using places database,” in Advances inNeural Information Processing Systems 27: Annual Conference on NeuralInformation Processing Systems 2014, December 8-13 2014, Montreal,Quebec, Canada, 2014, pp. 487–495. → page 67[117] S. Gabarda and G. Cristo´bal, “Blind image quality assessment throughanisotropy,” Journal of the Optical Society of America A, vol. 24, no. 12,pp. B42–B51, 2007. → pages 69, 72[118] M. A. Saad, A. C. Bovik, and C. Charrier, “Blind image quality assessment:A natural scene statistics approach in the DCT domain,” IEEE transactionson Image Processing, vol. 21, no. 8, pp. 3339–3352, 2012. → pages 69, 72[119] A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-reference image qualityassessment in the spatial domain,” IEEE Transactions on ImageProcessing, vol. 21, no. 12, pp. 4695–4708, 2012. → pages 69, 7292[120] A. Mittal, R. Soundararajan, and A. C. Bovik, “Making a completely blindimage quality analyzer,” IEEE Signal Processing Letters, vol. 20, no. 3, pp.209–212, 2013. → pages 69, 72[121] H. Talebi and P. Milanfar, “NIMA: Neural image assessment,” IEEETransactions on Image Processing, vol. 27, no. 8, pp. 3998–4011, 2018. →page 69[122] X. Liu, J. van de Weijer, and A. D. Bagdanov, “RankIQA: Learning fromrankings for no-reference image quality assessment,” in IEEE InternationalConference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29,2017, 2017, pp. 1040–1049. → page 69[123] S. Opricovic and G. Tzeng, “Compromise solution by MCDM methods: Acomparative analysis of VIKOR and TOPSIS,” European Journal ofOperational Research, vol. 156, no. 2, pp. 445–455, 2004. → page 70[124] “Military sensing information analysis center (SENSIAC),” 2008, online;accessed 01-November-2017. [Online]. Available:https://www.sensiac.org/external/products/list databases/ → page 70[125] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin,A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation inPyTorch,” 2017. → page 71[126] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”CoRR, vol. abs/1412.6980, 2014. → page 7193


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items