Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Deep learning with limited labeled image data for health informatics Zheng, Jiannan 2018

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2018_may_zheng_jiannan.pdf [ 14.4MB ]
Metadata
JSON: 24-1.0366054.json
JSON-LD: 24-1.0366054-ld.json
RDF/XML (Pretty): 24-1.0366054-rdf.xml
RDF/JSON: 24-1.0366054-rdf.json
Turtle: 24-1.0366054-turtle.txt
N-Triples: 24-1.0366054-rdf-ntriples.txt
Original Record: 24-1.0366054-source.json
Full Text
24-1.0366054-fulltext.txt
Citation
24-1.0366054.ris

Full Text

Deep Learning with Limited Labeled Image Data forHealth InformaticsbyJiannan ZhengB.Sc., Xi’an Jiaotong University, 2010M.Sc., Concordia University, 2012A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Electrical and Computer Engineering)The University of British Columbia(Vancouver)April 2018c© Jiannan Zheng, 2018AbstractDeep learning is a data-driven technique for developing intelligent systems using alarge amount of training data. Amongst the deep learning applications, this thesisfocuses on problems in health informatics. Compared to the general deep learningapplications, health informatics problems are complex, unique and pose problem-specific challenges. Many of these problems however, face a common challenge:the lack of labeled data. In this thesis, we explore the following three ways toovercome three specific image based health informatics problems:1) The use of image patches instead of whole images as the input for deeplearning. To increase the data size, each image is partitioned into non-overlapping,mid-level patches: This approach is illustrated by addressing the food image recog-nition problem. Automatic food recognition could be used for nutrition analysis.We propose a novel deep framework for mid-level food image patches. Evaluationson 3 benchmark datasets demonstrate that the proposed approach achieves superiorperformance over baseline convolutional neural networks (CNN) methods.2) The use of prior knowledge to reduce the high dimensionality and com-plexity of raw data: We illustrate this idea on magnetic resonance imaging (MRI)images, for diagnosing a common mental-health disorder, the Attention Deficit Hy-peractivity Disorder (ADHD). MRI has been increasingly used in analyzing ADHDwith machine learning algorithms. We propose a multi-channel 3D CNN basedautomatic ADHD diagnosis approach using MRI scans. Evaluations on ADHD-200 Competition dataset show that the proposed approach achieves state-of-the-artaccuracy.3) The use of synthetic data pre-training along with real data domain adapta-tion to increase the available labeled data during training: We illustrate this idea onii2-D/3-D image registration problems. We propose a fully automatic and real-timeCNN-based 2-D/3-D image registration system. Evaluations on TransesophagealEchocardiography (TEE) X-ray images from clinical studies demonstrate that theproposed system outperforms existing methods in accuracy and speed. We fur-ther propose a pairwise domain adaptation module (PDA MODULE), designed tobe flexible for different deep learning-based 2-D/3-D registration frameworks withimproved performance. Evaluations on two clinical applications demonstrate thePDA modules advantages for 2-D/3-D medical image registration with limiteddata.iiiLay SummaryDeep learning is a data-driven technique for developing learning systems using alarge amount of training data. In this thesis, we explore three ways to overcome thecommon limited data challenge of using deep learning in health informatics prob-lems: 1) the use of image patches instead of whole images as the input for deeplearning; 2) the use of prior knowledge of problem to reduce the high dimensional-ity and complexity of raw data; and 3) the use of synthetic data pre-training alongwith real data domain adaptation in training. We illustrated the above three ideasin three specific health informatics problems: 1) 2-D deformable image recogni-tion; 2) 3-D image-based diagnosis; and 3) 2-D/3-D image registration. Our resultsshowed that the proposed methods outperform conventional deep learning methodsfor health informatics problems studied in this thesis.ivPrefaceThis dissertation is written based on a collection of manuscripts. The majority ofthe research, including literature study, algorithm development and implementa-tion, numerical studies and report writing, were conducted by the candidate, withsuggestions from Prof. Z. Jane Wang. The manuscripts were primarily drafted bythe candidate, with helpful revisions and comments from Prof. Z. Jane Wang (pa-pers in Chapter 2–5), Dr. Rui Liao (papers in Chapter 4–5), Dr. Shun Miao (papersin Chapter 4–5) and Dr. Liang Zou (paper in Chapter 3).Chapter 2 is based on the following manuscript:• J. Zheng, L. Zou and Z. J. Wang, “Mid-Level Deep Food Part Mining forFood Image Recognition,” IET Computer Vision, 2017.Chapter 3 is based on the following manuscripts:• L. Zou, J. Zheng, C. Miao, M. Mckeown and Z. J. Wang, “3D CNN based au-tomatic diagnosis of attention deficit hyperactivity disorder using functionaland structural MRI,” IEEE Access. vol.5, pp. 23626-23636, 2017.• L. Zou, J. Zheng and M. Mckeown, “Deep Learning based Automatic Diag-nosis of Attention Deficit Hyperactivity Disorder,” IEEE Global Conferenceon Signal and Information Processing (GlobalSIP). Montreal, Canada, 2017.Chapter 4 is based on the following manuscript:• J. Zheng, S. Miao and R. Liao, “Learning CNNs with Pairwise DomainAdaption for Real-Time 6DoF Ultrasound Transducer Detection and Track-ing from X-Ray Images,” International Conference on Medical Image Com-puting and Computer-Assisted Intervention (MICCAI), pp.645-654, Quebec,Canada, 2017.vAnd finally, Chapter 5 is based on the following manuscript:• J. Zheng, S. Miao, Z. J. Wang and R. Liao, “Pairwise Domain Adapta-tion Module for CNN-based 2D/3D Registration,” SPIE Journal of MedicalImaging, vol.5, pp.2, 2018.viTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1.1 Introduction to Deep Learning . . . . . . . . . . . . . . . 11.1.2 Training of Deep Learning Models . . . . . . . . . . . . . 41.1.3 Health Informatics and Our Motivation . . . . . . . . . . 51.2 Problems Investigated . . . . . . . . . . . . . . . . . . . . . . . . 61.2.1 Deep Learning for Food Image Recognition . . . . . . . . 61.2.2 Deep Learning for Attention Deficit Hyperactivity Disor-der Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . 71.2.3 Deep Learning for 2-D/3-D Medical Imaging Registration 8vii1.3 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.3.1 Food Image Recognition . . . . . . . . . . . . . . . . . . 101.3.2 ADHD Diagnosis with FMRI and SMRI . . . . . . . . . 121.3.3 2-D/3-D Medical Imaging Registration . . . . . . . . . . 131.4 Thesis Objective, Outline and Main Contributions . . . . . . . . . 151.4.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . 151.4.2 Outline and Main Contributions . . . . . . . . . . . . . . 172 Combining Mid-Level Image Partitioning with Deep Learning: FoodImage Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.1 Motivation and Objectives . . . . . . . . . . . . . . . . . . . . . 232.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.2.1 Related Methods . . . . . . . . . . . . . . . . . . . . . . 262.2.2 Mid-Level Deep Food Part Framework with DCNN Features 272.2.3 Mid-Level Deep Food Part Label Mining Scheme . . . . . 312.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . 352.3.1 Food-101 . . . . . . . . . . . . . . . . . . . . . . . . . . 362.3.2 UEC Food 100 & 256 . . . . . . . . . . . . . . . . . . . 402.3.3 Performance Analysis . . . . . . . . . . . . . . . . . . . 422.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463 Encoding Prior Knowledge via Feature Extraction: MRI based ADHDDiagnosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.1 Motivation and Objectives . . . . . . . . . . . . . . . . . . . . . 483.2 Experiment Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 513.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . 533.3.1 Low-Level Feature Extraction based on Prior Knowledge . 533.3.2 3D Convolutional Neural Networks . . . . . . . . . . . . 573.3.3 Single Modality 3D CNN Architecture . . . . . . . . . . 593.3.4 Multi-Modality 3D CNN Architecture . . . . . . . . . . . 623.3.5 Training of the 3D CNN Architecture . . . . . . . . . . . 633.4 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . 64viii3.4.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . 643.4.2 Comparison of Single and Multi-modality Architectures . 653.4.3 Comparison with Existing Methods . . . . . . . . . . . . 653.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 704 Learning CNNs with Synthetic Data Pre-Training and Pairwise Do-main Adaptation: Real-Time 6DOF TEE Transducer Registration . 724.1 Motivation and Objectives . . . . . . . . . . . . . . . . . . . . . 734.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 774.2.1 Problem Formulation: 2-D/3-D Registration . . . . . . . . 774.2.2 System Overview . . . . . . . . . . . . . . . . . . . . . . 794.2.3 Hierarchical CNN Regression Architecture . . . . . . . . 804.2.4 Pairwise Domain Adaptation . . . . . . . . . . . . . . . . 844.2.5 A CNN Classifier to Resolve Pose Ambiguity . . . . . . . 864.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . 874.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 905 Learning CNNs with Universal Pairwise Domain Adaptation Mod-ule: 2-D/3-D Medical Imaging Registration via CNN and DRL . . . 915.1 Motivation and Objectives . . . . . . . . . . . . . . . . . . . . . 925.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 945.2.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . 945.2.2 Pairwise Domain Adaptation Module . . . . . . . . . . . 975.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . 1015.3.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . 1015.3.2 Performance Analysis . . . . . . . . . . . . . . . . . . . 1025.3.3 Evaluation of the Proposed Methods . . . . . . . . . . . . 1045.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1086 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . 1106.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1106.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1126.2.1 Compact CNN model for Image Recognition . . . . . . . 112ix6.2.2 Extension of Deep Learning-based MRI Data Analysis . . 1136.2.3 Multi-task Learning for 2-D/3-D Registration . . . . . . . 1146.2.4 GAN-based Artifact Removal for Spine Vertebral Regis-tration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115xList of TablesTable 2.1 Recognition accuracy evaluation of the 3 part-level label miningstrategies described in Section 2.2. . . . . . . . . . . . . . . . 37Table 2.2 Recognition accuracy results on the dataset Food-101. . . . . . 38Table 2.3 Recognition accuracy results on the datasets UEC Food 100 &256. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41Table 2.4 Selection of pooling strategies. . . . . . . . . . . . . . . . . . 44Table 3.1 Some details of the datasets utilized in this chapter. . . . . . . 52Table 3.2 Diagnosis performance comparisons between the proposed methodand state-of-the-art methods based on the ADHD-200 dataset. . 68Table 3.3 The diagnosis performance on the hold-out testing data fromdifferent sites. . . . . . . . . . . . . . . . . . . . . . . . . . . 68Table 4.1 Hierarchical CNN regression architecture. . . . . . . . . . . . 81Table 4.2 Quantitative results on the proposed hierarchical CNN regres-sion system in detection (Det) and tracking (Trak) mode. Num-bers in the table show (success rate, mean PTRE) under differ-ent PTRE error ranges. . . . . . . . . . . . . . . . . . . . . . 89Table 5.1 Performance (RMSE) comparison of PDA+ module and fine-tune with different number of training data. . . . . . . . . . . . 103Table 5.2 Quantitative results of the proposed PDA module on the prob-lem of CNN Regression-based 2-D/3-D Registration for TEETransducer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104xiTable 5.3 Quantitative results of the proposed PDA module on the prob-lem of DRL-based 2-D/3-D Registration for Spine Vertebra. . . 106xiiList of FiguresFigure 1.1 Architecture of CNN (LeNet-5) [61]. . . . . . . . . . . . . . 2Figure 1.2 Reinforcement learning scheme. In DRL, agent is implementedas deep learning models. . . . . . . . . . . . . . . . . . . . . 3Figure 1.3 Sample images from the UEC Food 100 dataset [71]. . . . . . 8Figure 1.4 Examples of 2-D/3-D medical image registration applications,with 2-D X-ray images on the left and 2-D/3-D overlay resultson the right. (a) TEE transducer registration: the 3-D CADmodel of TEE is overlaid in yellow, and the blue and red conesare the TEE projection target using the ground truth pose andthe estimated pose of the TEE transducer. TEE projection tar-get is defined as the four corners of the TEE imaging cone at60 mm depth. (b) Spine vertebra registration: the CT volumeof the spine is overlaid in yellow on 2-D X-ray spine images. 9Figure 1.5 A summary of this thesis: investigated approaches, applicableproblems, challenges and solutions. . . . . . . . . . . . . . . 16Figure 2.1 The flowchart of the proposed food part CNN (FP-CNN) frame-work with off-the-shelf DCNN features. . . . . . . . . . . . . 28Figure 2.2 Examples of food part data from 4 food part clusters belongto 2 food categories: sushi and meat sauce spaghetti. In thefirst row, two different styles of sushi are clustered into twodifferent clusters. In the second row spaghetti and meat sauceare clustered into corresponding clusters. . . . . . . . . . . . 33xiiiFigure 2.3 Examples of food images from 3 testing food image datasets:(a) UEC Food 100 and UEC Food 256 datasets, (b) Food-101dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36Figure 2.4 Per-category result comparison between the baseline Inception-v3(ft) method and the proposed Inception-v3(ft)+FP-CNN(ft)method on Food-101 dataset. . . . . . . . . . . . . . . . . . 39Figure 2.5 Per-category result comparison between the baseline Inception-v3(ft) method and the proposed Inception-v3(ft)+FP-CNN(ft)method on UEC Food 256 dataset. . . . . . . . . . . . . . . 42Figure 2.6 Performance vs. K of the proposed FP-CNN(ft) with AlexNet(ft)method on the UEC Food 100 database. . . . . . . . . . . . . 43Figure 3.1 Illustration of the voxels within the whole brain. Each colorrepresents a specific brain region defined by the AutomatedAnatomical Labeling (AAL) atlas. . . . . . . . . . . . . . . . 54Figure 3.2 A flowchart for ADHD classification based on FMRI and MRIusing 3D CNN. . . . . . . . . . . . . . . . . . . . . . . . . . 55Figure 3.3 Differences between the 2D convolution and the 3D convolu-tion. (a) 2D convolution: h1,1 =∑3x=1∑3y=1Wx,yVx,y+b; (b) 3Dconvolution: h1,1,1 = ∑3x=1∑3y=1∑3z=1Wx,y,zVx,y,z+ b, where Wis the weight of the kernel,V is the feature map in the previouslayer and b is the bias term. . . . . . . . . . . . . . . . . . . . 58Figure 3.4 Architecture of the proposed 3D CNN for diagnosing ADHD.We utilizes three types of 3D features across the whole brain asthe inputs, including REHO, FALFF and VMHC. This archi-tecture contains 6 layers, including four convolutional layersand two FC layers. . . . . . . . . . . . . . . . . . . . . . . . 60xivFigure 3.5 Statistical results of 3D CNN approaches corresponding to dif-ferent features over 50 individual runs. First, we evaluate sin-gle modality approaches where FMRI features and SMRI fea-tures are utilized individually. We further test the performanceof multi-modality approaches where FMRI and SMRI featuresare combined via the proposed multi-modality 3D CNN archi-tecture. The red asterisks and lines represent the average andmedian values respectively. The edges of the box are the lowerand upper quartiles. . . . . . . . . . . . . . . . . . . . . . . . 66Figure 4.1 (a) An illustration of TEE. (https://www.elcaminohospital.org/library/transesophageal-echocardiogram) (b) An overlay ex-ample of TEE transducer imaging cone on X-ray image. . . . 74Figure 4.2 A correct pose estimation results and its flipped counterpart. . 76Figure 4.3 Problem formulation of 2-D/3-D registration. . . . . . . . . . 77Figure 4.4 Illustration of TEE 6DOF pose parameters [95]. . . . . . . . 78Figure 4.5 Flowchart of the proposed 6DOF pose estimation system. . . 80Figure 4.6 Flowchart of our CNN regression method. With a given X-ray image, a DRR image is rendered. Then 140× 80 TEEtransducer patches are extracted from X-ray image and DRRimage and fed into CNN regressor to estimate pose offset δT . 81Figure 4.7 Comparison of real clinical TEE transducer X-ray data (left)and synthetically generated DRR data (right). . . . . . . . . . 83Figure 4.8 The proposed pairwise domain adaptation scheme. . . . . . . 84Figure 4.9 Qualitative comparison of method in [95] and the proposedsystem. (a) (c) results from method [95]. (b) (d) results fromthe proposed system. . . . . . . . . . . . . . . . . . . . . . . 88Figure 5.1 Comparison of real clinical spine vertebra X-ray data (left) andsynthetically generated DRR data (right). . . . . . . . . . . . 93Figure 5.2 Comparison of problem frameworks of (a) CNN Regression-based 2-D/3-D Registration for TEE Transducer, (b) DRL-based2-D/3-D Registration for Spine Vertebra. . . . . . . . . . . . 95xvFigure 5.3 (a) Illustration of PDA module plugged into a pre-trained CNNmodel. (b) Illustration of PDA module with basic loss. (c) Il-lustration of PDA module with multilayer loss (PDA+ mod-ule). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99Figure 5.4 Training feature distanceLD and synthetic feature distance S,testing feature distance and testing loss of the proposed PDA+approach on TEE dataset over iterations. . . . . . . . . . . . 102Figure 5.5 Example results of TEE registration with original CNN model(left) and with PDA+ module (right). . . . . . . . . . . . . . 107Figure 5.6 Example results of spine vertebra registration with originalCNN model (left) and with PDA+ module (right). The blueand red crosses are the target and estimated vertebra center,respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . 108xviGlossary6DOF 6 Degrees of FreedomAAL Automated Anatomical LabelingADHD Attention Deficit Hyperactivity DisorderALFF Amplitude of Low Frequency FluctuationsBN Batch normalizationTDC typically developing childrenCAD Computer Aided DesignCT Computed TomographyCNN convolutional neural networksCORAL Deep Correlation AlignmentCPU Central processing unitCSF cerebrospinal fluidDAN Deep Adaptation NetworkDBN Deep Belief NetworksDCNN deep convolutional neural networksDDC Deep Domain ConfusionxviiDNN deep neural networkDPARSF Data Processing Assistant for Resting State FMRIDRL Deep reinforcement learningDRR Digitally Reconstructed RadiographiesFALFF fractional ALFFFC fully-connectedFCP Functional Connectomes ProjectFCN fully convolutional networkFMRI functional MRIFPS frame per secondFP-CNN food part CNNFWHM Full Width at Half MaximumGAN generative adversarial networksGM gray matterGMP Gray Matter ProbabilityGPU Graphics processing unitHIPAA Health Insurance Portability and Accountability ActHOG histogram of oriented gradientsIFV improved Fisher VectorKKI Kennedy Krieger InstituteMDP Markov Decision ProcessMKL multiple kernel learningxviiiMMD Maximum Mean DiscrepancyMNI Montreal Neurological InstituteMRI magnetic resonance imagingMRMR Maximum Relevance Minimum RedundancyNYU New York University Child Study CenterPBT Probabilistic Boosting TreePCA Principle Component AnalysisPDA MODULE pairwise domain adaptation modulePET Positron Emission TomographyPKU Peking UniversityPTRE projected target registration errorREHO Regional HomogeneityRELU Rectified Linear UnitRF Random ForestRFDC Random Forest Discriminant ComponentsRMSE Root Mean Square ErrorRNN recurrent neural networksROI Region-of-InterestSDAE Stacked Denoising AutoencodersSELC supervised extreme learning committeeSFP Superpixels based Food PartSGD Stochastic Gradient DescentxixSHD Structural Heart DiseaseSIFT Scale-Invariant Feature TransformSMRI structural MRISVM Support Vector MachineTEE Transesophageal EchocardiographyTRE Target Registration ErrorUEC University of Electro-CommunicationsVBM voxel-based morphometryVLAD Vector of Locally Aggregated DescriptorsVMHC Voxel-Mirrored Homotopic ConnectivityWM white matterxxAcknowledgmentsI would like to express my special appreciation and thanks to my supervisor, Dr.Z. Jane Wang, for supervising me throughout my doctoral study. I would like tothank her for her guidance and support for my study and career development. Iwould also like to thank my committee members, Dr. Rabab Ward and Dr. MartinMcKeown for serving as my committee members and for their brilliant commentsand suggestions. I would like to thank my mentors, lab-mates, and coauthors, Dr.Rui Liao, Dr. Shun Miao, Dr. Liang Zou, Dr. Chunsheng Zhu, Pegah Kharazmi,Shaohui Liu, Prof. Tim Lee, Prof. Xun Chen, Prof. Chen He, Dr. Aiping Liu andDr. Zhenyu Guo and many others for the insightful discussions during my doctoralstudy.A special thank to my family. Words cannot express how grateful I am tomy mother, my father, grandmothers and grandfathers, my girlfriend Judy and herfamily for all sacrifices that they have made to support my PhD adventure. I wouldalso like to thank all my friends who supported me to strive towards my goal.xxiChapter 1Introduction1.1 Motivation1.1.1 Introduction to Deep LearningDeep learning, or deep hierarchical learning, is a group of machine learning meth-ods that aim to learn feature representations from data in raw form. Traditionalmachine learning methods employ hand-crafted feature extractors which are de-signed from prior knowledge of the target data distribution. Opposed to the so-called “shallow” learning methods, deep learning is a data-driven approach, whichmeans that the feature extractors are parameterized, and can be tuned to betterfit the data distribution via supervised training. At the early stage of deep learn-ing research, it was difficult to train deep learning models with many layers anda large number of parameters, mainly due to 3 reasons: 1) large dataset cannotbe easily fit into memory for optimization by gradient descent technique, and theproblems are non-convex which makes global optimization difficult; 2) beginninglayers of deeper networks are extremely hard to train due to the “vanishing gradi-ent” problem; 3) even though datasets are growing in magnitude, it is still common1Figure 1.1: Architecture of CNN (LeNet-5) [61].that the size of model parameters are much larger than the dataset size, makingtraining of deep learning model easy to be overfitting. Recent advances overcamethese 3 main challenges and boosted the development of deep learning. The firstproblem is solved by Stochastic Gradient Descent (SGD) [9], where training data israndomly sampled and fed into deep learning models in small mini-batches and up-date the model with small steps toward the global minimum, avoiding the need oflarge scale computing arrays and reducing the training cost deep learning models.“Vanishing gradient” problem is resolved by Rectified Linear Unit (RELU) as anactivation function in deep learning architectures [79]. RELU activation functionconstrains the gradients of each neuron to be either 0 or 1, therefore multiplyingaccumulated gradients in deeper models will not cause the gradient to “vanish”(become 0) or “explode” (become infinite) which makes deep models more train-able. Overfitting is reduced by dropout technique, which randomly sets activationsto 0 in fully-connected (FC) layers to reduce the capacity of these layers since FClayers consist of a large number of parameters [92].For high dimensional image data, the introduction of convolutional layers andconvolutional neural networks (CNN) greatly reduces the number of parameters by“weight sharing”, which means that, in the convolutional layers, the weights ofall input neurons across the whole spatial region are shared and parameterized as2Figure 1.2: Reinforcement learning scheme. In DRL, agent is implementedas deep learning models.image filters [61]. Convolutional layers also preserve spatial information since theactivation of each filter is dependent on neighboring inputs during filtering (Figure1.1). CNN has been proved to be suitable for image-based applications such asimage recognition, image segmentation, object detection and image restoration.Deep reinforcement learning (DRL), or action learning, is proposed to allow ar-tificial agents (e.g. CNN models) to mimic human actions and solve complex tasks[76]. Instead of supervised training with target labels, DRL training is based onthe rewards of each possible actions at each state of the agent in the environment(Figure 1.2). The applications of DRL include gaming, intelligent web service,autonomous driving and robotics. Other advances in deep learning are recurrentneural networks (RNN) for speech and video analysis [74] and generative adversar-ial networks (GAN) for image and data synthesis [33]. In this dissertation, we willmainly focus on CNN and DRL models.31.1.2 Training of Deep Learning ModelsTypical deep learning models contain millions of parameters to solve complex tasksand model various data distributions. Training of deep learning models usually re-quires large amount of data to represent the target data distribution. During train-ing, data augmentation techniques on training data are designed to guarantee thatthe training data distribution covers the testing data distribution. For instance, withrandom noises and occlusions applied to the training data, the trained CNN modelscan better tackle noisy testing data; with random rotation and translation added tothe training data, the trained CNN models will be rotation and translation invariant.Typical deep learning training methods can be categorized as supervised learn-ing and unsupervised learning. Supervised learning is when the training data in-cludes both the input and the target output of the models. Target output, or label,is defined based on the specific task. Labels can be integers which stand for dif-ferent classes in a classification problem, or float numbers for 6 Degrees of Free-dom (6DOF) pose parameters in a registration problem. In each iteration of train-ing, prediction of the model is compared with the corresponding label. Loss iscalculated based on this comparison and further used to modify the weights in themodel. Although supervised learning is widely used in deep learning, the effort oflabeling data can be enormous due to the large scale of dataset size (e.g. large scaleimage recognition) or the difficulty of the task (e.g. 2-D/3-D registration for 6DOFparameters).Unsupervised learning is when the training data only includes the input ofthe models. Typical unsupervised learning methods include Deep Belief Net-works (DBN) [42] or Stacked Denoising Autoencoders (SDAE) [13]. During train-ing, the model tries to reconstruct the input from the training dataset. The loss isusually defined as the difference of the input and the corresponding reconstruction.4Unsupervised learning can be used for feature learning and clustering with largeamount of unlabeled data. However, to perform better on specific tasks, supervisedlearning with task label is usually necessary after unsupervised feature learning.1.1.3 Health Informatics and Our MotivationHealth informatics seeks to improve health care via health information technolo-gies with high quality and high efficiency. Health informatics includes devicesand methods for optimized information acquisition, storage, retrieval and use inhealth and biomedicine. Since many health informatics data are based on image,CNN and DRL have great potential in health informatics applications. AlthoughCNN and DRL have been proved to be effective for general image-based applica-tions, health informatics problems are unique and difficult to apply general deeplearning methods. More importantly, many health informatics problems have acommon challenge: lack of labeled data to train deep learning models with mil-lions of parameters. Thus, in this dissertation, our aim is to develop more efficientdeep learning-based methods which can efficiently deal with the limited data chal-lenge and fit the nature of health informatics problems. More specifically, we planto explore three different approaches to deal with the limited data challenge whiletaking problem-specific challenges into account:1. Combining the mid-level image partitioning based approach with deep learn-ing to extract multi-resolution deep features, and fine-tuning deep learningmodel with mid-level image part data with mined “labels”. We illustratethis idea on 2D deformable image recognition problem, and use food imagerecognition as a case study.2. Pre-processing and feature extraction based on prior knowledge to reduce thehigh dimensionality and complexity of raw data compared to limited num-5ber of samples. We illustrate this idea on 3D image-based diagnosis prob-lem, and use MRI based Attention Deficit Hyperactivity Disorder (ADHD)diagnosis as a case study.3. Generating large amount of realistic synthetic data to train deep learningmodels, and performing pairwise domain adaptation to generalize the trainedmodels on real data. We illustrate this idea on 2-D/3-D medical imaging reg-istration problems, and use Transesophageal Echocardiography (TEE) regis-tration and spine registration problem as case studies.In the following sections, we will first elaborate three investigated health infor-matics problems and their challenges in Section 1.2. Then we will give literaturereviews of the three problems investigated (Section 1.3). In the last section of thischapter, we provide the objectives and an outline of this thesis with the main con-tributions (Section 1.4).1.2 Problems Investigated1.2.1 Deep Learning for Food Image RecognitionCompared to other types of image recognition problems, deformable image recog-nition is more challenging due to the variations of the object shapes. In this thesis,we use food image recognition as an illustrative example. Food images are highlydeformable (Figure 1.3). In addition, food is an inseparable part of our lives, as itprovides the necessary nutrition to support our daily activities. Proper food intakeaffects people’s health condition and life quality. Food has also gained increasingimpact on popular culture with the rapid technology advances in mobile devicesand social media. Due to its growing importance and popularity, food images areacquired and analysed in many health related applications, such as nutrition as-6sessment applications [6, 113] and personal food logging applications [4, 54], andbecome an unique and important research field for health informatics. Figure 1.3shows some examples of food image. In Figure 1.3, the class label of the first rowis curry beef, while the label of the second row is hamburger steak. Existing deeplearning-based food image recognition methods tend to treat food images as gen-eral images and apply deep learning models directly (i.e. AlexNet [57]). However,compared to other types of image recognition problems, food image recognitionproblem is unique and more complex:• food images have large variances across different cultures and regions. Formany particular regions, food image data can be limited;• food items are highly deformable. Food images have large intra-class vari-ances, which means that food images within the same class can appear quitedifferently. Food images also have small inter-class variances.Directly applying large deep learning models in this complex problem with limiteddata often results in overfitting and generalization issues. Therefore, more problemspecific deep learning methods are highly needed to improve the generalization andperformance of food image recognition.1.2.2 Deep Learning for Attention Deficit Hyperactivity DisorderDiagnosisADHD is one of the most common mental-health disorders which affects around5%-10% of the school-age children [85]. ADHD can be characterized by exces-sively impulsive, hyperactive or inattentive behaviors. These symptoms begin at anearly age and may continue throughout one’s lifespan, leading to serious limitationsin life for the patients while creating huge burdens for families and society. Thetraditional diagnosis of ADHD mainly depends on ratings of behavioral symptoms7Figure 1.3: Sample images from the UEC Food 100 dataset [71].which can be unreliable [21, 83]. For instance, the diagnosis criteria for childrenis mainly based on the report of behaviors from parents or teachers, rather thanthe reports of potential patient’s mental state phenomena. Therefore, a robust andefficient clinical tool that can provide assistance and diagnosis to psychiatrists ishighly desired. The main challenges of deep learning-based ADHD diagnosis are:• the high dimensionality of 3-D brain magnetic resonance imaging (MRI) datacompared to limited number of labeled data (100s);• brain MRI data has large intra-class variances between patients;• spatial information is also important for brain MRI data analysis.1.2.3 Deep Learning for 2-D/3-D Medical Imaging Registration2-D/3-D registration, which aligns the pre-operative 3-D data and the intra-operative2-D data into the same coordinate system, is one of the key enabling technologiesof image-guided procedures [72]. By aligning the 2-D and 3-D data, accurate 2-8Figure 1.4: Examples of 2-D/3-D medical image registration applications,with 2-D X-ray images on the left and 2-D/3-D overlay results on theright. (a) TEE transducer registration: the 3-D CAD model of TEE isoverlaid in yellow, and the blue and red cones are the TEE projectiontarget using the ground truth pose and the estimated pose of the TEEtransducer. TEE projection target is defined as the four corners of theTEE imaging cone at 60 mm depth. (b) Spine vertebra registration: theCT volume of the spine is overlaid in yellow on 2-D X-ray spine images.D/3-D fusion can provide complimentary information for advanced image-guidedradiation therapy, radiosurgery, endoscopy and interventional radiology [69]. Fig-ure 1.4 demonstrates two examples of 2-D/3-D registration: 1) Estimation of 6DOFpose of a TEE transducer in an X-ray image by registering its Computer Aided De-sign (CAD) model with the X-ray image, which is the key enabling technology forfusing live ultrasound and X-ray images for image-guided therapy; (b) Registration9of spine vertebra in Computed Tomography (CT) and X-ray images, which has awide range of applications in interventional imaging when spine is a visible objectin the imaging field.For 2-D/3-D medical imaging registration, data acquisition is difficult since itcan only be captured during surgical operations or training. Besides, accurate la-beling of 6DOF pose from 2-D X-ray projection images is extremely challengingand thus typically requires a bi-plane imaging setup that is rarely clinically avail-able. Existing deep learning-based 2-D/3-D medical imaging registration methodsemploy synthetically generated data for training, and apply the trained models onreal clinical data. This often leads to poor generalization on real data, and limit thecapture range and computational efficiency of the 2-D/3-D registration system.1.3 Related Works1.3.1 Food Image RecognitionRecently there have been many works focusing on food image recognition prob-lems, and most of them are based on low-level local features. Zhu et al. employedcolor and texture features to classify segmented food items in a food database with63 images and 19 food items [113]. He et al. introduced global color texture de-scriptors on segmented food items for classification, then incorporated two typesof contextual information co-occurrence patterns and personalized learning models[40]. Aizawa et al. employed global color, circle and block features with food per-sonal likelihood analysis [4, 54]. Anthimopoulos et al. optimized Bag-of-Featuresmodel by evaluating various types of low-level features and coding methods andshowed that the combination of the Scale-Invariant Feature Transform (SIFT) withthe color feature yields better performance [6]. Farinella et al. also employed low-level features into a Bag-of-Texton model [23] to tackle the food image retrieval10and classification problem. More recently, they proposed a Consensus Vocabulariesmethod to improve Bag-of-Visual-Words representation [22]. The Consensus Vo-cabularies method aims to discover consensus representations from different fea-ture spaces to boost the performance. Their experiments on PFID-61 [12] showedsuperior performance. Yanai et al. introduced a fast low-level features approachwhich showed promising performance and efficiency on the University of Electro-Communications (UEC) Food 100 and UEC Food 256 image datasets [53, 103].Although low-level features are simple and computational efficient, their discrim-inative power is limited by their small patch size and low dimensionality. As aresult, they are less powerful in capturing more complex image components [20].Mid-level image part-based approaches showed promising performances onfood images. Bossard et al. proposed a Random Forest (RF) based method to minefood parts with high probability for each class [8]. Zheng et al. introduced a den-sity ratio-based approach to mine food parts with more appearances in target classand less appearances in background classes [110]. Since these models employlow-level local features with improved Fisher Vector (IFV) encoding to representmid-level food parts, their recognition power is limited. Recently a supervisedextreme learning committee (SELC) has been developed which trains a series ofextreme learning committees over different image features to automatically chooseoptimal features for food recognition [70]. Benefiting from the ability to learn andrepresent powerful feature representations with labeled data, recently developeddeep learning approaches achieve state-of-the-art performances in several food im-age recognition problems. Yanai et al. introduced the deep convolutional neuralnetworks (DCNN) model to extract one DCNN feature vector for every food image,then classify DCNN image features with the linear Support Vector Machine (SVM)[52]. Further fine-tuning of the AlexNet model with more food images achievedimproved performance in benchmark food image datasets[102]. Fine-tuning with11other DCNN architectures (e.g. Inception-v3 and GoogLeNet) also gained success[37, 89]. In general, mid-level approaches mine representative image part acrossthe dataset [110], while deep learning approaches learn powerful image featuresdue to their deep structure with superior model capacity.1.3.2 ADHD Diagnosis with FMRI and SMRIMRI, including functional MRI (FMRI) and structural MRI (SMRI), has been in-vestigated in many studies to diagnose ADHD [48, 111, 112]. For instance, Zhuet al. trained a classifier based on fisher-discriminant-analysis (FDA) using FMRIscans from 24 subjects (12 typically developing children (TDC) and 12 ADHD) andachieved a leave-one-out cross-validation accuracy of 85%. However, the numberof samples utilized in these studies are relatively small and the generalization oftheir findings may be arguable [21]. In order to accelerate the understanding ofthe neural basis of ADHD and obtain objective diagnosis methods, the ADHD-200consortium publicly released a large-scaled neuroimaging dataset along with thephenotypic information. They further released a hold-out testing dataset and heldthe ADHD-200 global competition in 2011 [10]. Twenty-one international teams,from different scientific disciplines, joined the competition and submitted their di-agnostic labels. Accuracies derived by internal cross-validation ranged from 55%-78%, however, the accuracies reported on the external hold-out test dataset weresubstantially lower. Teams were ranked based on the diagnosis accuracy on thehold-out testing dataset, and out of these 21 teams, the best binary classifier basedon the neuroimages achieved a diagnostic accuracy of 61.54% [1].Subsequent to this competition, researchers continually worked on automaticdiagnosis of ADHD based on the ADHD-200 competition dataset. In [17], Dai etal. employed different image processing techniques to extract multimodal features,including features from MRI and FMRI. For MRI, they extracted Cortical Thick-12ness and Gray Matter Probability (GMP), while for FMRI, Regional Homogene-ity (REHO), and functional connectivity were extracted. They compared the effectsof using different features against each other. In addition, they further integratedmultimodal image features using multiple kernel learning (MKL) and obtained thediagnosis accuracy of 61.54%, which is comparable to the best result in the com-petition. Ghiassian et al. introduced a histogram of oriented gradients (HOG) invisual object recognition to the study of ADHD diagnosis [30]. To avoid overfit-ting in the training of classifiers, they selected the most relevant 211 features byMaximum Relevance Minimum Redundancy (MRMR) from 116480 features. Theyevaluated several classifiers and found that the SVM achieved the best performanceof 62.57%. Dey et al. proposed a novel framework for the automatic diagnosis ofADHD based on brain functional connectivity networks. They firstly selected asequence of highly active voxels and construct the connectivity network betweenthem. They obtained an average accuracy of 62.81% on the hold-out testing datasetwhen classification was performed on all the subjects. They also concluded that theperformance can be improved by integrating gender information. In [35], Guo etal. explored the functional connectivity between voxels and obtained an averageaccuracy of 63.75% based on social network method. These results represent thehighest diagnostic performance on the ADHD-200 hold-out test dataset.1.3.3 2-D/3-D Medical Imaging Registration2-D/3-D RegistrationIn the literature, intensity-based 2-D/3-D registration methods were first adoptedfor 6DOF pose estimation of TEE transducer from X-ray images [28, 43]. In thesemethods, Digitally Reconstructed Radiographies (DRR) of the TEE transducer aregenerated from a 3D model and then iteratively registered with the X-ray image.13Although intensity-based methods have been proven to be accurate, they typicallyhave a low computational efficiency due to the iterative DRR generation (e.g. 2frame per second (FPS) in [95]), and lack robustness due to the non-convexity ofthe image similarity metric. Manual initialization of the object pose in the closeneighborhood of the correct pose therefore is often needed in these methods. Fastapproximations of DRR have been proposed to accelerate intensity-based methods[38, 49]. However, the approximated DRRs often have a lower image quality,which subsequently leads to lower registration accuracy.Recently, deep learning-based methods have shown promising results in 2-D/3-D registration. A CNN-based regression approach was proposed to model the non-convex mappings between registration parameters and image residual features [72].They further improve the capture range and computational efficiency by modelingthe complex mapping problem with a hierarchical CNN regression architecture[109]. The reported framework shows the state-of-the-art performance in both ac-curacy and speed for 2-D/3-D registration of rigid objects. A Markov DecisionProcess (MDP) formulation for image registration was introduced for 3-D/3-D reg-istration [63], where an artificial agent is trained to perform registration by taking aseries of actions in the MDP environment. The agent-based approach was further-more extended to 2-D/3-D registration [73]. It is widely recognized that successfultraining of deep learning models often requires a large number of labeled data,which for many applications (e.g., image-guided surgery) are difficult or impracti-cal to collect. The above methods exploit a large number of synthetically generateddata for training. However, the domain difference between synthetic and real dataoften causes a performance gap when applying the trained model on real data, andlimits the system performance, speed and memory efficiency.14Deep Domain AdaptationDomain adaptation methods have been investigated to address the problem of do-main shifting by establishing knowledge transfer from the source domain trainingdata to the target domain testing data to extract domain invariant features. In theliterature, deep domain adaptation works can be categorized into two directions:discrepancy-based and adversarial-based [16]. The strategy of discrepancy-basedmethod is to guide the model training towards the target domain by minimizinga defined domain distance. Deep Domain Confusion (DDC) employs MaximumMean Discrepancy (MMD) as the domain loss for adaptation [96]. MMD measuresdomain difference by calculating the norm of the difference between the meansof the two domains. The Deep Adaptation Network (DAN) minimizes MMD withmultiple kernels and expends the multiple kernels MMD loss on multiple layersin CNN [67]. Deep Correlation Alignment (CORAL) method simply matches themean and covariance of the data distributions of the two domains [94]. In con-trast, adversarial-based methods aim to encourage domain confusion through anadversarial objective, i.e. a binary domain classifier [27, 97]. Since these methodsgenerally model the domain distance over source and target data distributions in anunsupervised fashion, they still require a large number of data from both domains.In addition, these methods tend to compare deep features at high level layers ofCNN (mostly FC layers). Since FC layers are more task specific, employing FCfeatures will limit the flexibility of the adapted model.1.4 Thesis Objective, Outline and Main Contributions1.4.1 ObjectiveThe global objective of this thesis is to develop effective deep learning methods tosolve the problem of limited data for health informatics. Figure 1.5 summarizes15Figure 1.5: A summary of this thesis: investigated approaches, applicableproblems, challenges and solutions.the main challenge of health informatics problems and the objectives of this thesis.As stated above, applications of deep learning methods in three practical healthinformatics have great potential, but remain technically challenging: 1) deep learn-ing for food image recognition, 2) deep learning for ADHD diagnosis with FMRIand MRI, and 3) deep learning for 2-D/3-D medical imaging registration. Themain challenge in the first problem is that in some cases, the training data of foodimage recognition is limited compared to that for general image recognition. Be-sides, the food images are more complex with large intra-class variances and smallinter-class variances. Therefore, our first objective is to develop a deep learningmethod for food image recognition to improve the performance with limited data.The main challenges associated with the second problem are that the labeled datais extremely limited compared to the high dimensionality and variances of ADHDMRI data. These are particularly challenging and cannot be solved by the state-of-the-art methods. Therefore, the second objective of this thesis is to develop arobust deep learning-based ADHD diagnosis system with FMRI and MRI scans.16The main challenge of the third problem is that accurately labeled real clinicaldata is extremely limited, and automated 2-D/3-D registration with high accuracy,speed and memory efficiency is needed by clinical application. Therefore, the thirdobjective is to develop a hierarchical CNN-based 2-D/3-D registration frameworkwith pairwise domain adaptation with robust real time performance. In addition,the forth objective of this thesis is to design a universal pairwise domain adapta-tion module (PDA MODULE) which is suitable for different 2-D/3-D registrationproblems and deep learning frameworks with improved performance.1.4.2 Outline and Main ContributionsChapter 2 - Combining Mid-Level Image Partitioning with Deep Learning:Food Image RecognitionIn this chapter, we propose a novel mid-level deep food part mining frameworkto better utilize DCNN models as powerful multi-resolution feature extractors forfood images. We name it food part CNN (FP-CNN) framework. Since food imagedata is limited, fine-tuning the DCNN model with food part data is investigated asan effective approach. To our best knowledge, we are the first to tackle the chal-lenge of training the DCNN model with unlabeled mid-level food part data. Finally,we train a DCNN (AlexNet) model on the food part dataset with the mined part-level labels. We denote this method as FP-CNN(ft). The experiments on severalbenchmark food image datasets show that the proposed approach consistently out-performs the mid-level or deep learning-based food image recognition approacheswith the same amount of training data. On small image dataset (around 10,000),the advantage of the proposed approach is more significant.In summary, the main contributions of this work are as follows:1. We propose a novel FP-CNN framework to better utilize DCNN models as17powerful multi-resolution feature extractors for food images by integratingthe DCNN models with the mid-level image partitioning-based approach.2. We overcome the problem of limited data by mining mid-level food partdata, and then fine-tuning DCNN model. A clustering-based food part labelmining scheme with 3 strategies is designed to mine part-level labels fromunlabeled food parts data. Finally, the food part dataset with the mined part-level labels are used to fine-tune the FP-CNN(ft) model.Chapter 3 - Encoding Prior Knowledge via Feature Extraction: MRI basedADHD DiagnosisInspired by the way that radiologists examine brain images, in this chapter wedesign a 3D CNN model to learn hierarchical spatial patterns to diagnose ADHDfrom FMRI and MRI features. To overcome the challenge of limited data with highdimensionality and variancesin raw data, we first encode prior knowledge into 6types of 3D features as the input of deep learning models. More specifically, we ex-tract 3 low-level features from FMRI data: REHO, fractional ALFF (FALFF) [114]and Voxel-Mirrored Homotopic Connectivity (VMHC) [115]; and 3 voxel-basedmorphometry features: gray matter (GM), white matter (WM) and cerebrospinalfluid (CSF) density in Montreal Neurological Institute (MNI) space from MRI data.Furthermore, we employ the 3D CNN model [45, 78] to learn latent 3D localpatterns with limited number of labeled data. We discover that FMRI and MRIfeatures are complementary, and design a multi-modality architecture to enhancethe classification accuracy. The performance on the independent hold-out testingdataset shows that the proposed 3D CNN approach outperforms the state-of-the-artstudies in the literature, even with less training samples.In summary, the main contributions of this chapter are four folds:1. To train deep learning models over limited MRI data with high dimension-18ality and high data variances, instead of directly use raw MRI data as input,we extract 3 FMRI and 3 SMRI features based on prior knowledge.2. We retain the spatial information throughout the learning process. Ratherthan representing the low-level features (including REHO, FALFF and VMHC)as a vector, we keep these low-level features in 3-order tensors (also called3-dimensional array). Inspired by the way that radiologists examine brainimages, we design 3D CNN models to learn hierarchical 3D patterns fromhigh-dimensional low-level features and show promising results.3. We investigate and summarize both FMRI and MRI features’ strength in thediagnosis of ADHD. We find that 3D CNN using GM density from MRIachieves the highest classification accuracy on ADHD-200 hold out testingdataset.4. We find that FMRI and MRI features are complementary, and design a multi-modality 3D CNN architecture to combine features from both FMRI andMRI. The proposed multi-modality 3D CNN approach achieves state-of-the-art accuracy of 69.15% on the holdout testing data of ADHD-200 globalcompetition, demonstrating the importance of incorporating both structuraland functional images for diagnosis of neurodevelopment disorders.Chapter 4 - Learning CNNs with Synthetic Data Pre-Training and PairwiseDomain Adaptation: Real-Time 6DOF TEE Transducer RegistrationIn this chapter, we overcome limitation on labeled data by synthetic data pre-training and pairwise domain adaptation. We propose a fully automatic 6DOFTEE transducer pose detection and tracking system based on hierarchical CNNs toimprove system accuracy and efficiency. Contributions of the proposed system arethreefold:191. First, to overcome the limitation on labeled data, pairwise domain adaptationmethod is proposed to refine CNNs using only a small number of annotatedreal X-ray images, leading to significantly improved accuracy in TEE trans-ducer pose estimation for real X-ray images.2. Second, to fully automate 2-D/3-D registration for TEE, and improve com-putational efficiency, we propose a framework based on lightweight CNNs tohierarchically regress the 6DOF pose parameters of TEE transducer, whichremoves the need on pose initialization and significantly reduces memoryfootprint (0.15 vs. 2.39 GB) and computation time (83.3 vs. 13.6 FPS) whencompared to CNN-based method in [72].3. Finally, we propose a self-symmetry resolution mechanism, where a CNNclassifier is trained to differentiate correct poses from their flipped counter-parts by seeing synthetically-generated pairs as examples and focusing onthe unsymmetric part of the transducer, i.e., the fiducial ball and hole mark-ers (Figure 4.2).Chapter 5 - Learning CNNs with Universal Pairwise Domain AdaptationModule: 2-D/3-D Medical Imaging Registration via CNN and DRLIn Chapter 4, we exploit the ability to generate a corresponding synthetic data foreach labeled real data with the exact same pose parameters, and define a pairwiseloss measuring the distance between features from paired real and synthetic datato represent the performance gap between the domains. In this chapter, we fur-ther propose a PDA module to bridge the performance gap, and extend pairwisedomain adaptation to a universal method for different 2-D/3-D registration appli-cations and deep learning methods. The proposed PDA module is 1) powerful withadditional network capacity to model complex domain variances, 2) flexible for dif-20ferent deep learning-based 2-D/3-D registration frameworks, 3) easy to be pluggedinto any pre-trained CNN model, and 4) trainable with hierarchical pairwise lossesusing only a few real-synthetic pairs. The proposed PDA module is evaluated on 2different deep learning frameworks with 2 different clinical problems: CNN-basedresidual regression for TEE Transducer registration and DRL-based policy learningfor spine vertebra registration. The experiment results demonstrate PDA module’sadvantage in generalization and superior performance for real clinical data. Theproposed PDA module has the potential to benefit any medical imaging problemswhere paired real-synthetic data can be obtained.Chapter 6 - Conclusions and Future WorksThis chapter concludes this thesis by summarizing the addressed problems andpresented results. Some perspectives for future work are listed and discussed.21Chapter 2Combining Mid-Level ImagePartitioning with Deep Learning:Food Image RecognitionIn this chapter, we focus on combining the mid-level image partitioning based ap-proach with deep learning to extract multi-resolution deep features, and fine-tuningdeep learning model with mid-level image part data with mined “labels” to over-come limitation on labeled data. We illustrate this idea on food image recognitionproblem and proposed a novel framework to better utilize DCNN models as pow-erful multi-resolution feature extractors for food image recognition.With the rapid technology advances in mobile devices and social media, foodimage has become an unique and important research field in the domain of healthinformatics. To better understand food images, accurate annotation is desired. Re-cently deep learning methods have shown promising performance in multiple foodimage recognition problems [37, 52, 89, 102]. However, compared with other im-age recognition problems, food data usually has small intra-class variances and22large inter-class variances. In addition, dataset size for food images are generallysmaller than general object recognition problems, which makes training of largeCNN models more proven to be overfitting. Thus, directly applying CNN methodsto food image recognition problem is problematic. In this chapter, we propose amid-level deep food part framework to improve food image recognition accuracyby mining meaningful mid-level food part features and integrate into deep learn-ing pipeline. In the experiment, the proposed method significantly outperforms theexisting CNN methods with limited number of training data (around 10,000).This chapter is structured as follows: Section 2.1 introduces the motivationand objectives of this work; Section 2.2 presents the proposed mid-level deep foodpart framework; Section 2.3 reports the results and compares the proposed methodwith the state-of-art methods; Section 2.4 discusses the strengths and potentiallimitations of the proposed method; Section 2.5 draws the conclusion and introducethe future works.2.1 Motivation and ObjectivesFigure 1.3 shows some food image samples from the dataset “UEC Food 100” [71].The class label of the first row is curry beef, while the second row is hamburgersteak. Compared to other types of objects, food items are highly deformable; foodimages have large intra-class variances, which means that food images within thesame class can appear quite differently. In addition, food images have small inter-class variances. To address the above challenges in food image recognition, an in-tuitive method is to manually annotate food images. However, this method is prob-lematic as the quality of manual annotation highly depends on the users’ knowledgeof nutrition, not to mention the process itself can be troublesome and inconvenient.Lately, many researches have focused on food image recognition problems.Most of these works are based on low-level local features, such as, color, texture,23HOG [18] and SIFT [68]. Low-level local features are hand-crafted feature vectorssampled from small image patches, usually with 8×8 or 16×16 pixels. Low-levelfeatures are simple and can be computed fast, but their discriminative power islimited by their small patch size and low dimensionality. More recently, mid-levelimage partitioning based methods showed superior performances on food imagerecognition mainly because of their suitability for modeling deformable food partswith significantly larger image patches [8, 110]. Mid-level image parts are patcheswith semantic meanings, and aim to replace low-level visual words [62]. However,these mid-level approaches generally employ low-level local features with the IFVcoding method [84] to represent food image parts. Therefore, the overall recog-nition performance is still limited by the discriminative power of the low-levelfeatures and the IFV coding method.Concurrently, deep learning methods, inspired by the human brain’s deep ar-chitecture, have attracted extensive attention in computer vision problems. Un-like low-level or mid-level feature based methods, deep learning methods focus onend-to-end learning instead of feature engineering which is often labor-intensiveand varies from task to task [7]. Deep learning methods can replace hand-craftedfeatures with multi-layer feature representations learned from large-scale datasetsvia supervised learning. Among various deep learning methods, DCNN is partic-ularly designed for image recognition and achieves state-of-the-art performancesin many image recognition tasks [50, 57]. Yanai et al. first introduced the DCNNfeature extraction to the area of food image recognition [52]. The DCNN modelused here is also known as AlexNet [57] which has 5 convolutional layers and 3FC layers. Fine-tuning the AlexNet with food images (AlexNet(ft)) also showedsuperior performance on several food image datasets [102].From the literature, we recognize that both mid-level based approaches andDCNN based approaches have their advantages, but perhaps most importantly,24these two approaches can be considered complementary and thus can be exploredjointly to improve the recognition accuracy of food images. In this chapter, we pro-pose a novel mid-level deep food part mining framework to better utilize DCNNmodels as powerful multi-resolution feature extractors for food images. We name itFP-CNN framework. Since food image data is limited, fine-tuning the DCNN modelwith food part data is investigated as an effective approach. To our best knowledge,we are the first to tackle the challenge of training the DCNN model with unla-beled mid-level food part data. Finally, we train a DCNN (AlexNet) model on thefood part dataset with the mined part-level labels. We denote this method as FP-CNN(ft). The experiments on several benchmark food image datasets show thatthe proposed approach consistently outperforms the mid-level or deep learning-based food image recognition approaches with the same amount of training data.On small image dataset (around 10,000), the advantage of the proposed approachis more significant.In summary, the main contributions of this work are as follows:1. We propose a novel FP-CNN framework to better utilize DCNN models aspowerful multi-resolution feature extractors for food images by integratingthe DCNN models with the mid-level image partitioning-based approach.2. We overcome the limitation on data by mining mid-level food part data, andthen fine-tuning DCNN model. A clustering-based food part label miningscheme with 3 strategies is designed to mine part-level labels from unlabeledfood parts data. Finally, the food part dataset with the mined part-level labelsare used to fine-tune the FP-CNN(ft) model.252.2 Methods2.2.1 Related MethodsThere have been several recent works that integrate mid-level based approacheswith DCNN based approaches for general image recognition problems. Gong etal. proposed a multi-level DCNN feature pooling framework [32]. The frameworkfirst densely samples mid-level image patches on 3 scales and calculates DCNNactivations. The mid-level image patches here (128× 128 and 64× 64) are largerthan low-level image patches (16×16 or 8×8) to capture more complex image spa-tial structures [32]. Then, the dimension of DCNN features is reduced from 4096through Principle Component Analysis (PCA) and further encoded with the Vectorof Locally Aggregated Descriptors (VLAD) pooling. Similarly, Liu et al. proposeda sparse coding based fisher vector encoding framework [65]. This frameworkdensely samples mid-level image patches and calculates DCNN activations fol-lowing PCA dimension reduction and feature encoding. Li et al. first denselyextracted sparsified and binarized DCNN features, then they built the DCNN fea-ture pool to mine class specified patterns from DCNN feature vectors via features’coverage and confidence [62]. The work presented in this chapter has the followingnovelties and differences when compared with the above approaches:• The approaches mentioned above were designed for general object and sceneimage recognition problems, and they sample mid-level image patches frominput images at a regular dense grid. For instance, [65] sampled mid-levelimage patches with size of 227× 227 and the stride of 8 pixels from theoriginal 512×512 image, and thus approximately 900 DCNN features needto be extracted for an input image. To better extract representations for de-formable food parts with arbitrary shapes and sizes, we adopt a superpixel26based mid-level feature extraction approach [8]. The proposed approach ex-tracts only 30 meaningful patches with arbitrary shapes and sizes which aremore suitable for food images. Extracting a much smaller number of mid-level representations also speeds up the following DCNN feature extractionprocess.• The above mentioned approaches generally employ the off-the-shelf DCNNmodel to represent mid-level image part data, and require complex featureencoding process to form final image representations. In this work, we focuson improving DCNN model with simplifying feature encoding process. Wetrain a FP-CNN(ft) model with the target data, which in our case is the unla-beled food parts data. Deep learning with unlabeled or weakly labeled datais challenging and less effective. We are the first to tackle this challenge bydesigning with 3 strategies to mine part-level labels for unlabeled food partdata.2.2.2 Mid-Level Deep Food Part Framework with DCNN FeaturesIn this section we present the proposed FP-CNN Framework with DCNN features.FP-CNN framework integrates the off-the-shelf DCNN features with the mid-levelimage partitioning approach (as illustrated in Figure 2.1).Mid-Level Food Parts ExtractionIn order to extract meaningful food parts from food images, superpixels segmen-tation methods can be adopted due to their outstanding adherence to image edgeand boundary. In fact, superpixel based patch sampling strategy samples less num-ber of DCNN activations than densely sampling strategy as in [32, 65]. Amongvarious superpixels segmentation methods, graph-based segmentation efficiently27Figure 2.1: The flowchart of the proposed food part CNN (FP-CNN) frame-work with off-the-shelf DCNN features.extracts global segmentation from simple local (pairwise) pixel similarity and pre-cisely preserves the arbitrary shape of an image region [24]. The method calculatesD(C1,C2) to determine whether there is a boundary for segmentation between two28components C1 and C2 [24]:D(C1,C2) =true, if Di f (C1,C2)> min(Int(C1), Int(C2));false, otherwise.(2.1)where internal difference Int(C) is the largest weight in the minimum spanning treeof the component, MST(C, E),Int(C) = maxe∈MST (C,E)w(e) (2.2)and Di f (C1,C2) defines the difference between two components C1,C2 to be theminimum weight edge connecting the two components,D(C1,C2) = minvi∈C1,v j∈C2,(vi,v j)∈Ew((vi,v j)) (2.3)where weight of edge w(e) is defined by the intensity difference between pixelsconnected by edge e. Graph-based segmentation method assumes that edges be-tween two pixels in the same component should have relatively lower weights, andedges between pixels in different components should have higher weights.In the proposed FP-CNN framework we segment an input food image X into30 food parts which contains meaningful food components (as shown in Figure2.2) [8, 110]. For each segmented food part, a square bounding area is extracted asa food part sub-image. Finally from food image X , we can obtain a food part set{xpi }, where i= {1,2, ...,n}, and xpi represents the ith food part.Xsegmentation−−−−−−−→ {xpi } (2.4)29DCNN Feature ExtractionFor the original image X and each extracted food part sub-image xpi , deep learn-ing features f X and f pi will be extracted by using DCNN models DCNN-1 andDCNN-2 (Figure 2.1). In this section, we will only implement pre-trained AlexNetmodel in DCNN-1 and DCNN-2 to evaluate the performance improvement of theproposed framework. We will discuss more on fine-tuning of DCNN-2 in the fol-lowing section. Here the responses of the second last FC layer (i.e., fc7) of Alexnetwill be extracted as features for each food part sub-image and the original image.The dimension of the extracted DCNN features is 4096.{xpi } DCNN feature extraction−−−−−−−−−−−−−→ { f pi } (2.5)Final Image RepresentationIn order to form a representation vector Rp from food part features { f pi }, 2-levelspatial pyramid matching and max pooling are adopted. First, each input foodimage will be divided into 4 spatial regions (2× 2). Together with the originalimage, 5 spatial regions will be considered in total. In jth spatial region, we se-lect food part feature whose center is within this region. If we denote the numberof selected features to be n, then we obtain n feature vectors of 4096 dimensions.Pooling is then performed to obtain a 4096 dimensions descriptor f rj for jth spa-tial region. Then, by concatenating { f rj }, Rp will be formed with the dimensionof (1+ 2× 2)× 4096. Finally, Rp will be further concatenated with f X to formthe final image representation (as in Figure 2.1). Hence, the dimension of imagerepresentation R is (1+ 1+ 2× 2)× 4096 = 24576. Compared with other mid-level image partitioning based approaches, max pooling is proved to be powerfulon deep learning features, and it also simplifies the training and testing process.30The final image representation is classified by the final classifier to predict the foodimage label.Compared with mid-level image partitioning based approaches in [8, 110], ourproposed framework employs DCNN features to extract a more powerful represen-tation for food parts, as well as to simplify and speed up the training and testingprocess by employing max pooling over part features. In addition to DCNN featurebased framework in [52, 102], our proposed framework explores mid-level foodparts information by extracting deformable food parts via superpixel segmenta-tion. Furthermore, spatial pyramid matching encodes food parts localization infor-mation. In Section 2.3, we will compare the performance of proposed frameworkwith the above mentioned approaches on three different food image datasets.2.2.3 Mid-Level Deep Food Part Label Mining SchemeAs stated in the previous sections, recent DCNN with mid-level data mining frame-works can be categorized into two directions: multi-level feature pooling and pat-tern mining. These earlier mentioned work all rely on pre-trained DCNN models.In this work, we explore a new strategy by training a DCNN-FP model with unla-beled food part data to replace the DCNN-2 component in Figure 2.1. The trainingof the FP-CNN(ft) model is different from the traditional DCNN training and fine-tuning problems since our targeted data is food part data without strong (target)labels. More specifically, we need to address the following two problems:• Food parts are unlabeled, and food images from a single class generally con-tain many different kinds of food parts. Directly feeding food parts andimage-level labels into the supervised learning framework could yield poorrecognition performance.• Unsupervised learning alone is not powerful enough. To train deep networks31in an unsupervised manner, such as DBN [42] or SDAE [13], the modelusually needs 10 or even 100 times more training data to achieve compara-ble performance as in supervised learning [14]. Our preliminary study onSDAE in exploring unsupervised learning with limited food parts data in-deed yielded poor recognition performance in our food image recognitionproblem.To deal with the above concerns, one possible solution is to create a labeledfood part dataset. For instance, Oquab et al. create 3 image part labels (positive,negative or background) based on the overlap of image parts and image segmenta-tion ground truth [80, 81]. However, these 3 part labels are too coarse to model acomplex food image part distribution in practice.To tackle this challenge, in this work, a label mining scheme is designed tocreate strong part-level labels for food part data. Given a food image dataset withN classes:X = {Xc},c= 1,2, ...,N (2.6)where Xc = {xc} represents training food images in class c. We can obtain a train-ing food part data subset Pc for each of the N food classes by superpixels segmen-tation:Xcsegmentation−−−−−−−→ Pc = {pc} (2.7)where pc represents a training superpixel in class c. We should notice that here cindicates the image-level label of the training images {xc}, and it is not the “true”label of the food parts {pc}. Then we resize food parts {pc} to AlexNet inputsize (227× 227) and extract DCNN features { f c} for superpixels via pre-trainedAlexNet model:Pcpre-trained AlexNet−−−−−−−−−−→ Fc (2.8)32Figure 2.2: Examples of food part data from 4 food part clusters belong to 2food categories: sushi and meat sauce spaghetti. In the first row, twodifferent styles of sushi are clustered into two different clusters. In thesecond row spaghetti and meat sauce are clustered into correspondingclusters.where Fc = { f c} represents the set of DCNN feature vectors of mid-level patchesin class c. To create “true” label or strong label for {pc}, we propose investigatingthe following three strategies to mine latent variables as part-level labels:• Strategy I: Treat every food part equally and cluster food part features { f c}33into K food part clusters {Fck } with the k-means clustering.{ f c} k-means−−−−→ {Fck },k = 1,2, ...,K,c= 1,2, ...,N (2.9)Then we introduce cluster labels as latent variables. For food parts Fck fromkth cluster of image class c, we assign (c−1)×K+k as their part-level label.• Strategy II: Instead of equally sample food parts, we adopt saliency detectiontechnique [25] to ignore possible backgrounds and reduce total number offood parts. Saliency detection method can be used to detect salient objectsin image which in our case is food. For segmented food part pc, if less thanhalf of its pixels are in the salient box created by saliency detection technique[25], we consider it as background part and remove it from food part dataset.After saliency detection, we continue strategy (I) and mine part-level labels.• Strategy III: Since food images have small inter-class variances, food imagesfrom different classes may share similar food parts. In order to find foodparts representative for certain class, we adopt an iterative mining strategiessimilar to [99]. Following the steps in strategy (II), after obtaining foodparts clusters {Fck } from class c, we treat these parts as positive samples withlabels k= {1,2, ...,K}, and sample food parts from other classes as negativesamples with label −1. The size of negative set is roughly two times ofpositive set. Then we iteratively train linear SVM on two sets and updatelabels of food parts in positive set. This process will continue for 5 times.Finally only representative parts with labels k = {1,2, ...,K} will be kept.After label mining, K food part clusters for each class c and n×K clustersfor the entire dataset will be constructed. Each cluster consists of food parts withsimilar appearance and structures from a food class. Figure 2.2 3 shows food part34data from 2 different food classes in the UEC Food 100 dataset: sushi and meatsauce spaghetti. In each row, food part data from 2 different food part clustersare displayed. We can observe that the appearances of food parts from differentclusters are diverse and discriminative. Thus the proposed Mid-Level food partmining scheme can learn additional meaningful food parts feature to improve theclassification performance. By assigning cluster labels (c− 1)×K + k to foodparts clusters, a labeled food part dataset can be obtained. Finally, we fine-tune anAlexNet model based with the labeled food part dataset. In Section 2.3 we willnumerically evaluate the performance of each of the three part-level label miningstrategies. Finally, we will evaluate the performance of our proposed FP-CNNframework with the fine-tuned AlexNet model (FP-CNN(ft)).2.3 Experiments and ResultsIn this section, we evaluated the performance of the proposed approaches on 3benchmark food image datasets: UEC Food 100 [71], UEC Food 256 [51] andFood-101 [8]. Figure 2.3 shows example food images from the 3 testing datasets.In the proposed FP-CNN with off-the-shelf DCNN features approach, DCNN-1and DCNN-2 are the AlexNet pre-trained with 1 million non-food images as statedin [52, 102]. In the FP-CNN(ft) approach, we fine-tuned a AlexNet model to re-place DCNN-2. We built the food part training dataset from training images, andwe then followed the food part mining scheme and mined “true” part-level la-bels and fine-tune AlexNet afterwards. DCNN-1 is replaced by AlexNet(ft) andInception-v3(ft) as in [37, 102]. For each dataset, FP-CNN(ft) models are fine-tuned separately with the corresponding training and validation data. For all exper-iments, we set the number of superpixels segmentation of each food image to be30, and extracted the response of FC7 for all AlexNet models. In the experimentswe compare the proposed approaches to the existing approaches in the literature35(a)(b)Figure 2.3: Examples of food images from 3 testing food image datasets: (a)UEC Food 100 and UEC Food 256 datasets, (b) Food-101 dataset.with the same experiment settings and training data.2.3.1 Food-101Food-101 dataset contains 101 food categories, with 1000 images per category[8]. The data is officially split by the authors of Food-101 dataset: 750 imagesof each class are selected for training and validation, and the remaining 250 im-36Table 2.1: Recognition accuracy evaluation of the 3 part-level label miningstrategies described in Section 2.2.Methods AccuracyAlexNet(ft) [102] 68.44 %Strategy I 76.45%Strategy II 75.28%Strategy III 74.56%ages are selected for testing. All the training and testing images contain back-ground. In addition, the training images are not cleaned, and contain noisy labeledimages. Fine-tuning of AlexNet(ft), Inception-v3(ft) and FP-CNN(ft) models aredone using training and validation images. We empirically selected learning rateto be 0.001 with step decay. We set batch size to be 100. For FP-CNN(ft) modelwe selected number of food part clusters K = 20, and replace last FC layer withK×N = 2020 neurons. FP-CNN(ft) is trained for 60 epochs, the final accuracy onvalidation food parts is 37.34%. For the final classification step, we apply 2-levelspatial pyramid matching and max pooling. The dimension of the final image rep-resentation is (1+1+2×2)×4096= 24576 for AlexNet(ft)+FP-CNN(ft) method,and 2048+(1+2×2)×4096 = 22528 for Inception-v3(ft)+FP-CNN(ft) method.The final classifier is a deep neural network (DNN) with 2 FC layers.We first inspected the effectiveness of the proposed 3 part-level label miningstrategies on AlexNet(ft)+FP-CNN(ft) method. As shown in Table 2.1, all 3 part-level label mining strategies outperformed baseline AlexNet(ft) approach [102] inthe literature. We can also observe that strategy I – k-means clustering yieldedthe best performance. The following reasons could contribute to the worse perfor-mance of Strategy II & III: (1) saliency detection and SVM may misclassified somefood parts as background or into negative set; (2) background parts are somehow37Table 2.2: Recognition accuracy results on the dataset Food-101.Methods Accuracylow-level and mid-levelFV [8] 38.88 %RFDC [8] 50.76 %pre-trained DCNN FP-CNN 54.16 %DCNN trained from scratch CNN [8] 56.40 %Extreme learning committee SELC [70] 55.89 %fine-tuned DCNNAlexNet(ft) [102] 68.44 %Inception-v3(ft) [37] 86.25%∗AlexNet(ft)+FP-CNN(ft) 76.45%Inception-v3(ft)+FP-CNN(ft) 87.96%∗ The accuracy 86.25% in the brackets is the implementation in this work. The accu-racy 88.28% is reported in [37].helpful for the overall image recognition framework. For instance, most food con-tainers will be classified as background. However, some food containers are strongcues for certain food classes (eg. round plate for spaghetti Figure 2.2(d)). Hence,we applied strategy I in the following experiments.Table 2.2 shows the experimental results of our proposed approaches and theresults of existing methods on Food-101 dataset. We first tested the proposed FP-CNN approach with off-the-shelf DCNN features. We compared the proposed FP-CNN approach with the low-level feature based FV approach and mid-level imagepartitioning based approach Random Forest Discriminant Components (RFDC) in[8]. We can observe that the proposed FP-CNN approach outperformed FV andRFDC. The proposed FP-CNN approach and RFDC are all exploring mid-levelfood part properties while the proposed FP-CNN approach relies on more pow-erful pre-trained DCNN features. We then compared the proposed FP-CNN(ft)approach with existing DCNN based approaches. CNN is trained directly on38Figure 2.4: Per-category result comparison between the baseline Inception-v3(ft) method and the proposed Inception-v3(ft)+FP-CNN(ft) methodon Food-101 dataset.Food-101 dataset [8]. AlexNet(ft) [102] and Inception-v3(ft) [37] are fine-tunedwith Food-101 training data. We can notice that there is an 20% performancegap between AlexNet(ft) and Inception-v3(ft). Compared with AlexNet model,Inception-v3 model is more powerful with much more convolution layers (54 vs5). As a result, the proposed FP-CNN(ft) with AlexNet(ft) approach improvedAlexNet(ft) from 68.44% to 76.45%. For Inception-v3(ft) method, [37] reportedaccuracy of 88.28%. The implementation in this work achieved an accuracy of86.25%. This could be caused by different image pre-processing methods andweights updating methods. To better evaluate the proposed FP-CNN(ft) method,we mainly compare the performance with our implementation of Inception-v3(ft).As a result, Inception-v3(ft)+FP-CNN(ft) achieved 87.96% and improved overInception-v3(ft) by 1.71%. Figure 2.4 shows the per-category result of 20 ran-domly selected categories of the baseline Inception-v3(ft) method and the pro-39posed Inception-v3(ft)+FP-CNN(ft) method. As can be observed, for most foodcategories, the proposed method improved the performance except “takoyaki”. Inthe presented categories, “edamame” is the easiest category while “tuna tartare”and “filet mignon” are the hardest categories.2.3.2 UEC Food 100 & 256UEC Food 100 contains 100 Japanese food image categories with more than 100images per category. Some images contain multiple classes of food. Boundingbox information is provided in the database. UEC Food 256 has similar structureas UEC Food 100 but with 256 Japanese food image categories. In the experi-ments we follow the 5-fold cross-validation setting as stated in the previous works[51, 52, 102]. For each dataset, 80% of data are randomly selected for fine-tuningof the AlexNet(ft), Inception-v3(ft) and FP-CNN(ft) models, as well as trainingof the final classifier. For food part data mining scheme we applied strategy I.After fine-tuning, FP-CNN(ft) models achieve 55.71% and 47.23% of accuracyon food part validation data from UEC 100 dataset and UEC 256 dataset respec-tively. Since UEC Food 100 & 256 have less number of images in each class,here we omit spatial pyramid matching and applied max pooling on the originalfood image. The dimension of the final image representation is (1+ 1)× 4096 =8192 for AlexNet(ft)+FP-CNN(ft) method, and 2048+4096= 6144 for Inception-v3(ft)+FP-CNN(ft) method. The final classifier is a DNN with 2 FC layers.Table 2.3 shows the experimental results on UEC Food 100 & 256. First, wetested the proposed FP-CNN approach with off-the-shelf DCNN features. ColorFV and RootHOG FV are low-level based approaches reported in [52]. Superpixelsbased Food Part (SFP) is a mid-level based food image recognition approach rely-ing on superpixels segmentation [110]. DCNN features approach in [52] employspre-trained DCNN model. It can be noted that the proposed FP-CNN approach40Table 2.3: Recognition accuracy results on the datasets UEC Food 100 &256.Methods UEC 100 UEC 256low-level & mid-levelColor FV [52] 53.04 % 41.60 %RootHOG FV [52] 50.14 % 36.46 %SFP [110] 60.50 % -pre-trained DCNNDCNN features [52] 57.87 % 43.98 %AlexNet+FP-CNN 69.43 % 58.29 %ELC SELC [70] 84.31 % -fine-tuned DCNNAlexNet(ft) [102] 75.25 % 63.64 %Inception-v3(ft) [37] 81.45 % 76.17 %AlexNet(ft)+FP-CNN(ft) 82.46 % 75.93 %Inception-v3(ft)+FP-CNN(ft) 86.51 % 78.60 %outperformed the above approaches on UEC Food 100 & 256 datasets by 11.56%and 14.31%. We then tested the proposed FP-CNN(ft) approach. When com-paring with the proposed approach with pre-trained DCNN model, the proposedAlexNet(ft)+FP-CNN(ft) improves the performance by 13.03% on UEC Food 100dataset and 17.64% on UEC Food 256 dataset.The proposed AlexNet(ft)+FP-CNN(ft) approach improved the AlexNet(ft) ap-proach by 7.21% on UEC Food 100 dataset and 12.29% on UEC Food 256 dataset,and also achieved comparable performance with inception-v3(ft) approach whichhas a deeper architecture. The proposed Inception-v3(ft)+FP-CNN(ft) approachimproved Inception-v3(ft) on both UEC Food 100 & 256 dataset by 5.06% and2.43% respectively, and achieved the state-of-the-art performance. Figure 2.5 showsthe per-category result of 20 randomly selected categories of the baseline Inception-v3(ft) method and the proposed Inception-v3(ft)+FP-CNN(ft) method. As can beobserved, the proposed method consistantly improved the performance for most41Figure 2.5: Per-category result comparison between the baseline Inception-v3(ft) method and the proposed Inception-v3(ft)+FP-CNN(ft) methodon UEC Food 256 dataset.of the food categories. In the presented categories, “cabbage roll”, “eight treasurerice” and “laulau” are the easiest category while “sour prawn soup” is the hardestcategories.2.3.3 Performance AnalysisSelection of number of clusters K The key parameter of FP-CNN(ft) model isthe number of food part clusters K. To select food part cluster number K, wetested different values of K on UEC Food 100 dataset and summarized the resultsin Fig. 4. With DNN classifier, the proposed FP-CNN(ft) with AlexNet(ft) methodachieves the best performance of 82.46% with K = 10. Thus we select K = 10 forUEC Food 100 and 256 dataset. For Food-101 dataset which has more images percategory (1000) we select K = 20.Pooling strategy and final classifiers We further compared different pooling42Figure 2.6: Performance vs. K of the proposed FP-CNN(ft) with AlexNet(ft)method on the UEC Food 100 database.strategies and final classifiers on each dataset. As shown in Table 2.4, max poolingachieves better result on UEC Food 100 and 256 datasets, while spatial pyramidpooling achieves better result on Food-101 datasets. This can be explained thatFood-101 images have large field of view, more background and more complexstructures while UEC Food images are close-in, with less background and lesscomplexity. We also tested 2 different classifiers: linear SVM and DNN with 1hidden layer on UEC Food 100 and 256 datasets. DNN with 1 hidden layer per-forms slightly better than linear SVM on UEC Food 100 (82.46% vs 81.25%) andUEC Food 256 (75.93% vs 74.57%).Cross domain performance In this experiment, we fine-tune AlexNet(ft) andAlexNet-FP models with Food101 dataset, and test on UEC Food 100 dataset. The43Table 2.4: Selection of pooling strategies.Datasets Max Average Spatial PyramidUEC Food 100 82.46% 78.60% 78.12%UEC Food 256 75.93% 70.12% 69.75%Food-101 75.51% 70.23% 76.45%result is 78.34% which is worse than training with UEC Food 100 data (82.46%).This can be explained by domain shifting [67]. The food feature extractors learnedfrom source domain (Food-101) can be used to extract features on target domain(UEC Food 100) and achieve reasonable results. However, to achieve better results,fine-tuning or domain adaptation on target domain is needed [67].Computational cost The low-level methods, Color FV & RootHOG FV [52]are fast with a small memory footprint, and have around 0.01 seconds/image onCentral processing unit (CPU) implementation. However, the performance is poordue to the low complexity of the methods. The mid-level methods, with muchhigher complexity and better performance, result in around 1-2 seconds/image ofspeed on CPU implementation (0.8 seconds/image on 8 core CPU for RFDC [8]and 1.5 seconds/image on a single CPU for SFP [110]). On the other hand, benefit-ing from recent developed Graphics processing unit (GPU) computing hardwares,deep learning methods achieved high speed on GPU despite the large complexityof deep learning models. As a result, on a single Maxwell Nvidia TitanX GPU,AlexNet model with Caffe [47] implementation achieves around 0.008 second-s/image, and Inception-v3 with optimized Tensorflow [3] implementation achievesaround 0.007 seconds/image. However, the training efforts for Inception-v3 modelare higher since Inception-v3 model has more much more convolution layers (54 vs5). In practice, AlexNet-based approach might be prefered. Based on our currentimplementation (without further optimization), the required computational time of44the proposed FP-CNN method is about 0.25 seconds since it explores more mid-level information, and achieves better performance in return. The memory foot-print for FP-CNN is 230 mb which is same as AlexNet model. In the future weplan to optimize the FP-CNN model by introducing a network architecture withmuch smaller input size (i.e. 48× 48) since the food part data are small, whilecurrently we deploy AlexNet model with 227× 227 input. This direction of opti-mization can potentially improve the speed to be comparable to that of the AlexNetapproach and with a much smaller memory footprint.2.4 DiscussionFrom the above result, the following observation can be made:• Compared with existing mid-level part based approaches RFDC [8] and SFP[110], the proposed FP-CNN approach improved the performance since ourproposed framework employs DCNN features to extract a more powerfulfood part representation.• Compared with food image recognition approach based on pre-trained DCNNfeatures [52], the proposed FP-CNN approach improved the performance onUEC Food 100 & 256 datasets with a large margin. The reason is that ourproposed framework effectively explores mid-level deformable food partsinformation rather than simply extracting DCNN features from the wholefood image.• Compared with existing fine-tuned DCNN approaches, the proposed AlexNet(ft)+FP-CNN(ft) approach improved over AlexNet(ft), while Inception-v3(ft)+FP-CNN(ft) improved over Inception-v3(ft). This demonstrates that FP-CNNmodel and deep learning model can provide complimentary information, and45integrating FP-CNN model with deep learning models can improve the clas-sification performance for food image.2.5 ConclusionIn this work, we focused on the problem of food image recognition and proposeda novel framework to better adopt DCNN models to food image applications. Theproposed framework jointly exploring the advantages of both the mid-level basedapproaches and the DCNN approaches. We first integrated the off-the-shelf DCNNfeatures with the mid-level based approach as a powerful mid-level image partrepresentation. To further improve the performance of the proposed framework,fine-tuning DCNN model with the target dataset is proposed as an effective ap-proach. To our knowledge, the proposed method is the first attempt in the literatureto tackle the challenge of fine-tuning the DCNN model with unlabeled mid-levelfood image part data. We presented a novel food part label mining scheme with 3strategies to mine part-level labels from unlabeled food parts data. We evaluatedthese 3 strategies and found that the simple k-means clustering can achieve the bestperformance. Finally, for each food image dataset, we trained a DCNN model onfood part dataset with the mined part-level labels. The experiments on 3 benchmarkfood image datasets showed that the proposed approach can improve the baselineDCNN fine-tune approach with a large margin without employing many differentfeatures or very deep DCNN architectures. Future work will be focusing on 1)developing more compact and cost efficient deep learning models to enable mo-bile based DCNN applications, 2) domain adaptation across different cuisines anddifferent cultures.46Chapter 3Encoding Prior Knowledge viaFeature Extraction: MRI basedADHD DiagnosisIn this chapter, we explore the approach of pre-processing and feature extractionbased on prior knowledge to reduce the high dimensionality and complexity of rawdata compared to limited number of labels, and illustrate on MRI-based ADHDdiagnosis.ADHD is increasingly regarded as a neurodevelopment disorder [77], and var-ious neuroimaging techniques, such as MRI and Positron Emission Tomography(PET), have been adopted in ADHD studies [100]. Due to MRI data’s high di-mensionality and noise, manually diagnosing ADHD via MRI data is extremelydifficult. Recently deep learning method DBN has been adopted to classify ADHDfrom TDC with extracted feature vectors which neglects the spacial informationof the brain. Since layers in DBN are fully-connected, DBN is more proven to beoverfitting with limited MRI data. On the other hand, CNN is more effective in 2-D47and 3-D imaging-based applications and can preserve spatial patterns from trainingimages. Nevertheless, limited number of high dimensional MRI data is extremelydifficult to train CNN models. In this chapter, we encode prior knowledge intoCNN training framework by 3D FMRI and MRI feature extraction, and propose amulti-channel 3D CNN architecture to achieve state-of-the-art diagnosis accuracy.This chapter is structured as followss: Section 3.1 introduces the motivationand objectives of this chapter; Section 3.2 presents the experiment ADHD datasetand the data pre-processing methods; Section 3.3 presents the proposed 3D CNNframework. Section 3.4 reports the results and compares the proposed methodwith the state-of-art methods; Section 3.5 discusses the strengths and potentiallimitations of the proposed method; Section 3.6 draws the conclusion and introducethe future works.3.1 Motivation and ObjectivesIn order to automatically diagnose the neurodevelopment disorders, such as ADHDand Parkinson’s disease, a multitude of features extracted from FMRI are investi-gated in the literature. These features can be categorized into voxel-level featuresand region-level features. Yang et al. investigated the Amplitude of Low Fre-quency Fluctuations (ALFF) [108] and revealed that the abnormal frontal activitiesat the resting state are associated with the underlying physiopathology of ADHD[104]. Long et al. extracted the REHO [107] and ALFF from FMRI data andemployed these features to classify the early Parkinson’s disease [66]. Althoughthese voxel-level features are simple and intuitive to extract, the feature usually hashigh dimensionality, thus feature selection is generally needed before the classifi-cation [93]. Alternatively, many researchers considered certain predefined regionsand extracted region-level features from each region or pair of regions. For in-stance, Eloyan et al. investigated the functional connectivity between five regions48of the motor cortex and utilized the connectivity matrix to diagnose ADHD [21].Compared with voxel-level features, these low dimensional region-level featuresare generally insensitive to subtle changes involved in neurological disorders. Inaddition, the disease-related changes may occur in part of a region or across mul-tiple regions. Therefore, the simple voxel-level or region-level features may noteffectively capture the disease-specific pathologies.Although ADHD is not believed to result from morphological changes in thebrain, several studies have shown that anatomical differences associated with ADHDcan be found in MR images [11, 56, 86]. For instance, Kobel et al. reported largechanges in volume of the cerebral cortex between children with ADHD and TDC[56]. The structure underlies and thus affects the function of brain. The differencesbetween MRI of ADHD and TDC also suggest that MRI may be an ideal classi-fication feature to diagnose ADHD. Compared with FMRI, MRI is less sensitiveto noise and requires simple preprocessing steps. In addition, MRI can providehigher qualities images with better resolution. Considering the complicated patho-logic process of ADHD, morphometry and functional changes may happen simul-taneously in the brain of ADHD children. MRI and FMRI provide complementaryinformation about the brain change. Therefore, to further improve the classifica-tion accuracy, we set forth to develop an ADHD classification method using bothMRI and FMRI. The combination of different measures (i.e., FMRI and MRI) mayincrease the reliability of the diagnosis model [5].Regarding deep learning methods on ADHD diagnosis, Kuang et al. firstly in-troduced a DBN with three hidden layers to discriminate ADHD with frequencydomain features from FMRI[60]. However, the conventional one-dimension neuralnetwork (e.g., DBN), which employs a vector as the input, generally neglects thetopology information of the input data. It is worth noting that, due to the 3D natureof neuroimaging data such as MRI and PET, to make a clinical decision, neurolo-49gists or radiologists need to navigate through 2D planes. They investigate the local3D patterns of neural images and combine these local information across the wholebrain [93]. It was shown that local 3D patterns from the whole brain, rather thanfrom an individual voxel or predefined region, may contribute to the diagnosis ofneurological disorders [55, 82]. Inspired by the way that radiologists examine brainimages, in this chapter we design a 3D CNN model to learn hierarchical spatial pat-terns to diagnose ADHD from FMRI and MRI features. To overcome the challengeof limited data with high dimensionality and high variances in raw data, we firstencode prior knowledge into 6 types of 3D features as the input of deep learn-ing models. More specifically, we extract 3 low-level features from FMRI data:REHO, FALFF [114] and VMHC [115]; and 3 voxel-based morphometry features:GM, WM and CSF density in MNI space from MRI data. Furthermore, we employthe 3D CNN model [45, 78] to learn latent 3D local patterns with limited numberof labeled data. We discover that FMRI and MRI features are complementary, anddesign a multi-modality architecture to enhance the classification accuracy. Theperformance on the independent hold-out testing dataset shows that the proposed3D CNN approach outperforms the state-of-the-art studies in the literature, evenwith less training samples.In summary, the main contributions of this chapter are three folds:1. To train deep learning models over limited MRI data with high dimension-ality and high data variances, instead of directly use raw MRI data as input,we extract 3 FMRI and 3 SMRI features based on prior knowledge.2. We retain the spatial information throughout the learning process. Ratherthan representing the low-level features (including REHO, FALFF and VMHC)as a vector, we keep these low-level features in 3-order tensors (also called3-dimensional array). Inspired by the way that radiologists examine brain50images, we design 3D CNN models to learn hierarchical 3D patterns fromhigh-dimensional low-level features and show promising results.3. We investigate and summarize both FMRI and MRI features’ strength in thediagnosis of ADHD. We find that 3D CNN using GM density from MRIachieves the highest classification accuracy on ADHD-200 hold out testingdataset.4. We find that FMRI and MRI features are complementary, and design a multi-modality 3D CNN architecture to combine features from both FMRI andMRI. The proposed multi-modality 3D CNN approach achieves state-of-the-art accuracy of 69.15% on the holdout testing data of ADHD-200 globalcompetition, demonstrating the importance of incorporating both structuraland functional images for diagnosis of neurodevelopment disorders.3.2 Experiment DatasetThe FMRI data analyzed in this chapter is from the ADHD-200 consortium. Ini-tially, they made available a large training dataset consisting of 776 FMRI scansand T1-weighted structural scans. Among them, 491 are obtained from typicallydeveloping individuals and 285 from patients with ADHD (ages: 7-21 years old).Characteristic information of subjects are also provided, including age, gender,handedness and IQ scores. The data was collected by 8 institutions around theworld and was shared anonymously without any protected health information inaccordance with the Health Insurance Portability and Accountability Act (HIPAA)guidelines and 1000 Functional Connectomes Project (FCP) protocols [26]. We re-fer to this dataset as the “original training dataset” excluding 108 subjects whoseFMRI data were regarded as with the ‘questionable’ quality by the data curators.For the ADHD-200 Global competition, the ADHD-200 consortium released a51Table 3.1: Some details of the datasets utilized in this chapter.Dataset ADHD (male) TDC (male) In Total (male)Original training dataset 239 (188) 429 (225) 668 (413)Hold-out testing dataset 77 (60) 94 (46) 171 (106)Preprocessed training dataset 197 (158) 362 (190) 559 (348)hold-out dataset from 94 TDC and 77 ADHD patients as well as 26 participantswithout diagnostic information. We refer the subset of this dataset as the “hold-outtesting dataset” consisting of 171 subjects for whom diagnostic data were released.Details of scan parameters, diagnostics criteria and other site-specific protocols areavailable at http://fcon 1000.projects.nitrc.org/indi/adhd200/.Raw data sharing demands intensive coordinating efforts, huge manpower andlarge-amount data storing/management facilities. In addition, the preprocessing ofmedical images always requires professional medical knowledge and thus it mayimpede other scientific communities (such as machine learning experts) to join inthe field of neuroimaging. To address these concerns, Chaogan et al. initiated theR-FMRI maps project (http://mrirc.psych.ac.cn/RFMRIMaps) [2] and encouragedscientists to share the preprocessed data through this project. For the ADHD-200dataset, they preprocessed the entire hold-out testing dataset and a subset of theoriginal training dataset. In this chapter, we refer this as the “preprocessed trainingdataset” and train the 3D CNN based on this dataset. For details of these datasets,please see Table 3.1.All resting-state FMRI images were preprocessed using Data Processing Assis-tant for Resting State FMRI (DPARSF) programs [101]. The following steps wereperformed:521. Slice timing correction;2. Realign and reslice correction of head motion for each volume relative toinitial one;3. Regress out the nuisnace covariates, such as regressing out the head motioneffects from the realigned data;4. Spatially coregistered (normalized) to standardized space;5. Voxel-wise band pass filtering (0.01-0.1Hz, which is regarded as the tradi-tional bandpass frequency range for FMRI);6. Normalization of anatomic images to MNI template space using unified seg-mentation of anatomic images;7. Smoothing with a 4mm Full Width at Half Maximum (FWHM) Gaussiankernel.3.3 Proposed MethodIn order to automatically diagnose ADHD, after data preprocessing steps, ourframework starts by extracting low-level 3D FMRI and MRI features as illustratedin Figure 3.2. The CNN networks and softmax classifer are then trained to clas-sify the ADHD cases from the TDC cases. In this section, we will present ourautomatic ADHD diagnosis framework in detail.3.3.1 Low-Level Feature Extraction based on Prior KnowledgeConsidering the fact that the number of subject samples is still limited relativeto millions of parameters in DNN, we first encode our prior knowledge voxel-wisely and extract 3 types of popular low-level features from FMRI scans, includ-53Figure 3.1: Illustration of the voxels within the whole brain. Each color rep-resents a specific brain region defined by the AAL atlas.ing REHO, FALFF and VMHC. We further exclude boundary areas of these threetypes of features (which are filled with zeros) and extract a cube with the size of47× 60× 46. All voxels within the brain are presented graphically in Figure 3.1by color-coding the region that each voxel belongs to based on the AAL atlas [98].REHO maps local brain activity across the whole brain and has been used todetect the abnormal neural activity in children with ADHD [17]. It measures thefunctional synchronization of a given voxel with its nearest neighbors.FALFF has been successfully utilized to detect the abnormal spontaneous brainactivity of various neuropsychiatric disorders, such as ADHD, Parkinson’s dis-ease and schizophrenia. It measures the ratio of power spectrum of low-frequency(0.01Hz-0.1Hz) range to that of the entire detectable frequency range. FALFF is54Figure 3.2: A flowchart for ADHD classification based on FMRI and MRIusing 3D CNN.55the normalized ALFF. It provides a more specific measure in detecting spontaneousbrain activities [114].Functional homotopy is a fundamental characteristic of the brain’s functionalarchitecture. In this chapter, VMHC is evaluated, which quantifies functional ho-motopy by providing a voxel-wise measure of connectivity between hemispheres.Recently, VMHC was used to analyze the group difference between children withand without ADHD [90].In addition, 3D low-level morphological features are extracted through voxel-based morphometry (VBM) analysis of high-resolution T1-weighted images. Inthis chapter, we also employ the density of GM, WM and CSF in MNI space asthe input of the 3D CNN. GM includes regions of the brain involved in musclecontrol, sensory perception and self-control [75]. Majority of the brain’s neuronalcell bodies are contained in GM. In T1-weighted image, GM appreas as dark gray.WM is composed of bundles, which connect various GM areas and carry nerve im-pulses between neurons. In T1-weighted image, WM appears as light gray. CSF isa body fluid found in the brain which provids mechanical and immunological pro-tection to the brain inside the skull. In T1-weighted image, CSF appreas as black.These three kinds of morphological features are derived from image segmentation[29]. After segmentation, each voxel contains three measures of the probabili-ties, according which it belongs to specific segmentation classes, corresponding toGM, WM and CSF respectively. We further exclude boundary areas of these threetypes of features (which are filled with zeros) and extract a cube with the size of90×117×100.MRI data preprocessing and feature extraction were performed with DPARSF.All the data and features used in this chapter are publicly available and can bedownloaded through the R-FMRI maps project [2]. It is worth mentioning thatother types of features can also be incorporated into the proposed approach. We56particularly consider the above 6 types of low-level features to illustrate the pro-posed method mainly due to their popularity in MRI studies. Figure 3.2 shows theflowchart for ADHD classification based on FMRI and MRI using 3D CNN.3.3.2 3D Convolutional Neural NetworksSimilar to the traditional deep learning architectures, CNN models are hierarchicalarchitectures where several convolutional layers are stacked on top of each other.This ‘deep’ stacking strategy is inspired by the hierarchical structure of humanbrain. Traditional CNNs have 2D convolutional kernels for applications on 2D im-ages. However, it is challenging to apply 2D CNN on 3D data because convolutionsin 2D CNN only can capture 2-dimensional spatial information, and neglect the in-formation along the third dimension. To address this concern, Ji et al. extended theidea of 2D CNN used for 2D images to a 3D convolution in both space (2D) andtime for video classification [45]. It can effectively incorporate the motion infor-mation in video analysis [45, 78]. Similar to video data (x,y,t), the extracted lowlevel features mentioned in Section 3.3.1 have 3 dimensions (x,y,z). Therefore, inthis chapter, we employ 3D convolutions to learn the 3D local patterns across thewhole brain to assist the diagnosis of ADHD.Compared to fully-connected DBNs, the convolutional layers have two mainproperties: partially connected and weight sharing [82]. In the convolutional layer,unlike in the DBN, an output neuron is connected to only a local region of the inputfeature maps. This property reduces the number of parameters and thus making theCNN less prone to be overfitting. Another benefit of this character is that the con-volutional layer can retain local spatial patterns which benefits image related tasks.The weights sharing property means weights in convolutional kernels are sharedacross the whole spatial region of the feature maps which further reduces the num-ber of parameters and increases the generalization capability of the network. It is57(a)(b)Figure 3.3: Differences between the 2D convolution and the 3D convolution.(a) 2D convolution: h1,1 = ∑3x=1∑3y=1Wx,yVx,y+ b; (b) 3D convolution:h1,1,1 =∑3x=1∑3y=1∑3z=1Wx,y,zVx,y,z+b, whereW is the weight of the ker-nel, V is the feature map in the previous layer and b is the bias term.58common to periodically insert a pooling layer between successive convolutionallayers in a CNN. The pooling operation reduces the spatial size of the feature mapsand the number of parameters. As shown in Figure 3.3, the 2D CNN are appliedon 2D features maps to extract the spatial features, whereas, to detect the 3D localpatterns in our case, 3D kernels are convolved over 3D feature cubes. More specif-ically, for the case of 3D CNN, the value at position (x,y,z) on the jth feature mapin the ith layer is obtained as follows,hi, jx,y,z = f ((Wi, j ∗V i−1)x,y,z+bi, j), (3.1)where Wi, j and bi, j are the weights and the bias for jth feature map respectively,V i−1 denotes the sets of input feature maps from the (i− 1)th layer connected tothe current layer, f is the non-linear function and ∗ is the convolution operation. Inthe training process, all the weights of these convolutional kernels in the CNN, W,together with the bias b are optimized with respect to a given loss function.A 3D convolutional layer is effective to learn local patterns and explore thespatial information across the 3D input images [55]. It should be mentioned thatthe complexity of learned local patterns is closely related to the numbers of 3Dconvolutional kernels in the network. With more kernels, the network can learndeeper and more powerful features, but on the other hand, the network will bemore prone to overfitting. A general principle is that the network should have moreconvolutional layers to learn deeper features, and have less numbers of featuremaps in each layer to limit the overall complexity [39, 44].3.3.3 Single Modality 3D CNN ArchitectureFor the three FMRI features and three MRI features mentioned in Section 3.3.1, wedesign a universal single modality 3D CNN architecture to diagnose ADHD. The59Figure 3.4: Architecture of the proposed 3D CNN for diagnosing ADHD.We utilizes three types of 3D features across the whole brain as theinputs, including REHO, FALFF and VMHC. This architecture contains6 layers, including four convolutional layers and two FC layers.60architecture takes either FMRI features (including REHO, VMHC and FALFF)with the dimension of 47× 60× 46, or MRI features (including GM, WM andCSF) with the dimension of 90× 117× 100 as input. In this chapter, we firstreduce the feature map size with max-pooling (2× 2× 2 for FMRI features and4× 4× 4 for MRI features), which reduces the three-dimension spatial resolutionof the input to 23×30×23 for FMRI features and 22×29×25 for MRI featuresrespectively. In our preliminary experiments, we observed that this setting canboost the performance and improve network generalization by greatly reducingnumber of parameters. The computational cost reduces dramatically as well. Wethen train 32 different 3D kernels with size of 5×5×5 on all three channels as atthe first convolutional layerC1. We further down sample the feature map size withmax-pooling. The output feature maps after these layers are made up of 32 featuremaps of size 9×13×9 and 9×13×11 for FMRI and MRI features respectively.Additional 3 convolutional layers, C2, C3 and C4, are further employed to learndeeper feature with 64 output maps. After these four convolutional layers, theoutput feature maps are fully-connected to 512 neurons in F5. F6 is the last layerand topped with the softmax activation function to output the probabilities of twoclasses, including ADHD and TDC.It is worth mentioning that, similar to most deep learning problems, the choiceof the specific network architecture is generally problem-dependent. In our pre-liminary study, we have tested a variety of 3D architectures with different num-ber of convolutional layers and kernel sizes. The 3D architecture described aboveyields the best performance on the ADHD-200 Dataset and outperforms the exist-ing methods.613.3.4 Multi-Modality 3D CNN ArchitectureAlthough single modality 3D CNN on FMRI or MRI improves the classificationperformance over the existing methods in our experiments, one should note thatFMRI and MRI carries significantly different information. In addition, consider-ing the complicated pathologic process of ADHD, morphometry and functionalchanges may happen simultaneously in the brain of ADHD children. Therefore,these two types of features can be complementary and combination of them canboost the performance to diagnose ADHD. In this section we present a multi-modality 3D CNN architecture which takes both FMRI and MRI features as inputin a 3D CNN training framework.Figure 3.4 demonstrates the proposed multi-modality 3D CNN architecture.The architecture contains two separate branches for FMRI features (right) and MRIfeatures (left). Each branch takes MRI features as input and learn a 512-dimensionfeature vector through back-propagation. The branches have the same CNN struc-ture as in the previous single modality CNN architecture: 4 convolutional layers,2 max-pooling layers and 1 FC layer. The output 2 512-dimension feature vectorsare then concatenated and input to a FC layer with output size 2 which stands for 2classes (ADHD and TDC). This multi-modality architecture has three advantages:1) FMRI and MRI features have different size of feature maps, thus single modalityarchitecture is not able to combine these two modalities; 2) since FMRI and MRIcarries different information, the two branches are able to learn hierarchical CNNfeatures for FMRI and MRI separately without interfering one another; 3) this ar-chitecture also enables joint training of feature extractor and classifier, which hasproven to be more effective than training them separately. As a result, the proposedmulti-modality 3D CNN architecture yields better performance in our experiments.623.3.5 Training of the 3D CNN ArchitectureThe training of the above architectures is carried out by optimizing a loss functionvia updating the network’s parameters {W,b}. In this chapter, we select the cross-entropy as the loss function, which is defined as follows,L (W,b) =− 1N(N∑n=1yn lnHW,b(xn)+(1− yn) ln(1−HW,b(xn))),(3.2)where N is the number of samples, xn and yn are the input and corresponding labelof the nth sample, HW,b(·) is the function learned by the network and HW,b(xn)represents the output of the neural network given the input xn. The weights of the3D convolutional networks are randomly initialized based on Xavier initializationstrategy [46]. Then the 3D CNN architecture is trained via the SGD with mini-batches of 20 training samples. The weights W are updated for every mini-batchas:5Wt = 〈5WtL (Wt)〉mini−bacthvt+1 = γvt −α5WtWt+1 = Wt +vt+1(3.3)where vt is the current velocity vector, α is the learning rate and γ is the mo-mentum. Momentum accumulates the velocity vector in directions of persistentreduction in the objective across iterations and can accelerate the training process.The large number of parameters existing in our network made it susceptible tobe overfitting. Besides taking advantage of the intrinsic feature of the 3D CNN ar-chitecture, such as partially connected character, weights sharing and pooling, weadditionally adopt several methods to avoid overfitting. We use the dropout tech-nique [91] with a probability of 0.5 in the FC layers. During the dropout, the inputs63of layers F5 and F6 are randomly set to 0 with a probability 0.5. This dropout pro-cedure is a variant of data augmentation and has been proved to be an effective wayto reduce the overfitting in deep neural networks [91]. Batch normalization (BN),which is a regularization technique, is to guarantee faster and better convergenceof the network training. We add BN layers after every convolutional layer and FClayer in our architectures.3.4 Experiments and Results3.4.1 Experiment SetupTo evaluate the proposed single and multi-modality 3D CNN architectures, weuse ADHD-200 dataset where 559 subjects are used for training the 3D CNNs.We then test the performance of trained models on the hold-out testing datasetwith 171 subjects. Percentage prediction accuracy of the two-class diagnosis (TDCvs. ADHD) is used for evaluation and comparison of the proposed methods andthe reported methods in literature. In the experiment, we report 6 single featureapproaches for each of the FMRI features (Reho, FALFF and VMHC) and MRIfeatures (GM, WM and CSF); 2 combined approaches with 3 FMRI features and 3MRI features separately (FMRI-all and MRI-all); and 2 multi-modality approacheswith both FMRI features and SMRI features (All and FALFF+GM). We split thetraining dataset into 4 folds for cross validation. We set the learning rate to be0.0001, and decay the learning rate after every 20 epochs of training with a factorof 0.5. We set the batch size to be 20 and train each approach for 100 epochs toensure the training is converged at the end. Considering the effect of the dropouttechnique as well as the random initialization of the network parameters, we repeatthe experiments for 50 times and report the average accuracy.643.4.2 Comparison of Single and Multi-modality ArchitecturesFigure 3.5 shows the statistical results of 3D CNN approaches. First, we comparethe results of single modality approaches. The single modality 3D CNN archi-tecture with FALFF achieves a mean accuracy of 66.04% while the model withGM achieves 65.86%. Surprisingly, combine all 3 FMRI features or all 3 SMRIfeatures does not yield better performance than FALFF and GM. One possible ex-planation of this phenomenon is that Reho, VMHC, WM and CSF may not containadditional complementary information for FALFF and GM, and combine differentFMRI features or SMRI features with single modality 3D CNN architecture willnot benefit the ADHD classification task.We then evaluate the multi-modality approaches. Two approaches are evalu-ated in this test: 1) all 3 FMRI features and 3 SMRI features are used as input;2)since FALFF and GM achieve superior performance, we combine only thesetwo features. As a result, FALFF+GM improves over FALFF and GM for a largemargin, and achieves the state-of-the-art performance on ADHD-200 dataset withan average accuracy of 69.15%, and the best accuracy of 71.49%. The reducedvariance value indicates that the training of multi-modality 3D CNN is more sta-ble than single modality CNNs. As stated in Section III E, the multi-modality3D CNN architecture is able to learn FMRI and SMRI convolutional features sep-arately though two separate CNN branches, and combine the learned high-levelfeature to boost the classification accuracy. The results demonstrate the superiorperformance of the proposed multi-modality 3D CNN architecture.3.4.3 Comparison with Existing MethodsTable 3.2 shows the results of related works and the proposed approaches whenthey are evaluated on the ADHD-200 hold-out testing dataset. In [17], Dai et al.65Figure 3.5: Statistical results of 3D CNN approaches corresponding to differ-ent features over 50 individual runs. First, we evaluate single modalityapproaches where FMRI features and SMRI features are utilized indi-vidually. We further test the performance of multi-modality approacheswhere FMRI and SMRI features are combined via the proposed multi-modality 3D CNN architecture. The red asterisks and lines representthe average and median values respectively. The edges of the box arethe lower and upper quartiles.employed features from SMRI and FMRI, including REHO, functional connectiv-ity, GM and Cortical Thickness. They integrated multimodal image features usingMKL and obtained the diagnosis accuracy of 61.54%. Ghiassian et al. adoptedHOG and feature selection process. They evaluated several classifiers and foundthat the best performance, 62.57%, was got via SVM [30]. Dey et al. firstly se-lected a sequence of highly active voxels and construct the connectivity network66between them. They obtained an average accuracy of 62.81% on the hold-outtesting dataset [19]. In [35], Guo et al. explored the functional connectivity be-tween voxels and obtained an average accuracy of 63.75% based on social networkmethod. These results represent the highest diagnostic performance on the ADHD-200 hold-out test dataset. As a comparison, our proposed single modality architec-ture which only takes FALFF or GM yields better performance than the existingmethods on the hold-out testing dataset. When combine FALFF and GM with amulti-modality 3D CNN, the proposed method achieves a classification accuracyof 69.15% which significantly improves over existing methods with a large margin.Same as [60], we also test the proposed method on the individual subset ofADHD-200 hold-out testing set from Peking University (PKU), Kennedy KriegerInstitute (KKI) and New York University Child Study Center (NYU) over 50 indi-vidual runs. As shown in Table 3.3, the average accuracy of the proposed method issuperior to the best result of the ADHD-200 competition in [1] and that of the DBNmethod [60] especially on PKU and NYU. We also include the number of sub-jects in each subset. The differences between the performances of different subsetsalso suggest the heterogeneity of the entire dataset. In summary, our proposed 3DCNN-based architecture achieves the state-of-the-art classification performance inADHD classification.3.5 DiscussionIt should be mentioned that, compared with the astonishing results of publishedstudies [48, 111, 112] using neural images not obtained from the ADHD-200 com-petition, the diagnostic performance based on the ADHD-200 competition datasetseem rather inferior. However, considering the heterogeneity of the clinical mani-festation of ADHD, it is always hard to generalize the findings of the studies uti-lized a small number of samples [21]. The ADHD-200 dataset is probably much67Table 3.2: Diagnosis performance comparisons between the proposedmethod and state-of-the-art methods based on the ADHD-200 dataset.Method Features Classifier Accuracy[17]REHO, functional connectivity, GMMulti-kernel learning 61.54%and cortical thickness[30] histogram of oriented gradient Support vector machine 62.57%[19] functional connectivity networks Support vector machine 62.81%[35]functional connectivity,Support vector machine63.75%assortative mixingand synchronizationPropose methodFALFF Single-modality 3D CNN 66.04%GM density Single-modality 3D CNN 65.86%FALFF, GM density Multi-modality 3D CNN 69.15%Table 3.3: The diagnosis performance on the hold-out testing data from dif-ferent sites.Site (number of subjects) ADHD-200 Competition [1] DBN[60] The proposed methodPKU (51) 51.05% 54.00% 62.95%KKI (11) 61.90% 71.82% 72.82%NYU (41) 35.19% 37.41% 70.50%68more difficult to classify because of the heterogeneity and its relatively large sam-ple size [10]. To address this concern, we feel that taking into account phenotypicinformation and the scanner information may improve the diagnostic accuracy. Ingeneral, the development of automatic ADHD diagnostic tools from FMRI scansis a challenging work and there is still a long way to go to apply these tools inassisting the ADHD diagnose.Compared to the counterpart, SMRI is more sensitive to biophysical propertiesof brain tissue whereas functional MRI is more sensitive to temporally changingneural activities. However, FMRI and SMRI always are analyzed separately andthe joint information is not explored. To the best of our knowledge, our work isthe first study to examine 3D CNN model on the diagnosis of ADHD exploringboth FMRI and SMRI. Considering the complicated pathologic process of ADHD,it is arbitrary to make a diagnosis based on a signal modality. Functional andanatomical changes may happen simultaneously. Therefore, features from SMRIand FMRI can provide complementary information of the brain changes. The com-bination of them leads to more rigorous pathophysiological models and furtherincrease the diagnosis accuracy. These results indicate that multiple modalitiesclassification will be a promising direction for finding neuroimaging biomarkers ofADHD.In addition, the performance is directly affected by the selected features. Sev-eral FMRI studies suggest that ADHD is also associated with brain sub-networksdysfunction [34, 64]. Other types of features (e.g., other FMRI features, multi-modal features) and prior knowledge (e.g., gender information) can be incorporatedinto the proposed approach to further improve the ADHD diagnosis performance.In the future, we will use additional features with our current classification methodto further improve ADHD diagnosis performance. However, it is worth mention-ing that a larger number of features utilized in DNN requires more training samples69and may also result in overfilling, especially when the number of training samplesis limited.To avoid overfitting during 3D CNN training, several methods are consideredin this chapter: 1) 3D CNN is adopted to taking advantage of its intrinsic fea-tures, such as partially connected, weights sharing and pooling architectures; 2)we carefully designed the number of layers and feature maps to avoid overfittingwhile retain sufficient capacity for the network to solve the complex ADHD di-agnose problem; 3) we performed data augmentation via dropout technique at theFC layers which contain most of the weights in the network. As a result, the 3DCNN models are well trained and yield state-of-the-art classification accuracy inthe ADHD diagnose problem.3.6 ConclusionWith the availability of the large scale ADHD-200 dataset and the inspiring suc-cess of deep learning in many recognition problems, we are motivated to developan automatic diagnosis algorithm based on deep learning to classify ADHD vs.TDC using MRI scans. Inspired by the way that radiologists examine the 3D neu-ral images, we propose an automatic and effective 3D CNN architecture for ADHDclassification which explores the complementarity of FMRI and SMRI. The pro-posed 3D CNN method is fundamentally different from the previous attempts toclassify the ADHD using MRI scans. Specifically, we first encode prior knowl-edge on six types of 3D low-level features into the video-type input, which wereused to diagnose ADHD in the literatures, including REHO, FALFF and VMHC aswell as GM, WM and CSF densities in MNI space. Then a 3D CNN-based strat-egy is used to extract the high-level features from each modality. Unlike previousmethods that mostly considered low-level features as a vector, which may neglectsthe potential 3D local patterns, we keep these low-level features in 3-order tensors70and train the 3D CNN based on them. We further combine the FMRI and SMRIfeatures with a multi-modality 3D CNN architecture which yields the state-of-the-art performance. Experimental results on the hold-out ADHD-200 testing datasetshows that the proposed 3D CNN is superior to previous works in the literature,even with a smaller number of training samples.71Chapter 4Learning CNNs with SyntheticData Pre-Training and PairwiseDomain Adaptation: Real-Time6DOF TEE TransducerRegistrationIn this chapter, we explore the approach of generating large amount of realistic syn-thetic data to train deep learning models, and performing pairwise domain adapta-tion to generalize trained models on real data. We illustrate this approach on 6DOFTEE transducer registration.Given the recent development in medical imaging technologies, image-guidedprocedures are becoming increasingly popular with reduced invasiveness and post-procedure complications [15]. TEE and X-ray fluoroscopy are the two major imag-72ing modalities for guiding catheter-based Structural Heart Disease (SHD) proce-dures [49]. TEE can capture soft tissue anatomies with great details, while X-raycan provide high quality visualization of medical devices (Figure 4.1). Robust andaccurate 6DOF pose detection and tracking of TEE transducer from X-ray imagesbrings the two image modalities into the same coordinate system, and thus enablesadvanced image guidance, e.g., fused visualization and joint analysis of anatomyand devices (Figure 4.1). Such multi-modality image fusion can greatly facilitateminimally invasive image guided procedures. In the literature, deep learning-based2-D/3-D registration for TEE transducer demonstrated superior performance [72].Since labeled real surgical data for TEE pose estimation is limited, the CNNs in[72] are trained with synthetically generated data which lead to poor generalizationon real data, and limit the capture range and computational efficiency of the regis-tration system. In this chapter, we propose a pairwise domain adaptation methodto employ limited real clinical data to enable more effective training of CNNs. Wepresent a hierarchical CNNs framework based on the proposed pairwise domainadaptation for automatic 6DOF TEE transducer pose estimation with real time per-formance and high accuracy.This chapter is structured as follows: Section 4.1 introduces the motivation andobjectives of this chapter; Section 4.2 presents the proposed hierarchical CNNsarchitecture and pairwise domain adaptation; Section 4.3 reports the results andcompares the proposed method with the state-of-art methods; Section 4.4 drawsthe conclusion and introduce the future works.4.1 Motivation and Objectives2-D/3-D registration, which aligns the pre-operative 3-D data and the intra-operative2-D data into the same coordinate system, is one of the key enabling technolo-gies of image-guided procedures [72]. The modalities of pre-operative 3-D data73(a)(b)Figure 4.1: (a) An illustration of TEE. (https://www.elcaminohospital.org/library/transesophageal-echocardiogram) (b) An overlay example ofTEE transducer imaging cone on X-ray image.include CT, MRI, PET of patients and CAD models of medical devices, whilethe intra-operative 2-D data include ultrasound and X-ray. By aligning the 2-Dand 3-D data, accurate 2-D/3-D fusion can provide complimentary information for74advanced image-guided radiation therapy, radiosurgery, endoscopy and interven-tional radiology [69]. In the literature, intensity-based 2-D/3-D registration meth-ods were first adopted for 6DOF pose estimation of TEE transducer from X-rayimages [28, 43]. In these methods, DRRs of the TEE transducer are generatedfrom a 3D model and then iteratively registered with the X-ray image. Althoughintensity-based methods have been proven to be accurate, they typically have alow computational efficiency due to the iterative DRR generation (e.g. 2 FPS in[95]), and lack robustness due to the non-convexity of the image similarity metric.Manual initialization of the object pose in the close neighborhood of the correctpose therefore is often needed in these methods. Fast approximations of DRRhave been proposed to accelerate intensity-based methods [38, 49]. However, theapproximated DRRs often have a lower image quality, which subsequently leadsto lower registration accuracy. Recent development in CNN for 2-D/3-D regis-tration significantly improved robustness and computational efficiency comparedto intensity-based methods [72]. However, two drawbacks limit its application inreal clinical scenarios: First, it still has a relatively small capture range (i.e., 10degrees in rotation) and therefore relies on pose initialization, which by itself is anon-trivial problem. Second, it has a large memory footprint (i.e., 2.39 GB) due tothe large number (i.e., 324) of regressors employed. A fully automatic 6DOF poseestimation and tracking system for TEE transducer was proposed recently in [95].This method employs template matching to estimate out-of-plane rotation param-eters, which inherently has a low accuracy due to the limited template samplingdensity (i.e., 4 degrees interval). In addition, the self-symmetry of the TEE trans-ducer is particular challenging on template matching for translucent X-ray images,i.e., its appearances in flipped poses are very similar and hard to be differentiated(Figure 4.2).In this chapter, we overcome limitation on labeled data by synthetic data pre-75Figure 4.2: A correct pose estimation results and its flipped counterpart.training and pairwise domain adaptation. We propose a fully automatic 6DOFTEE transducer pose detection and tracking system based on hierarchical CNNs toimprove system accuracy and efficiency. Contributions of the proposed system arethreefold:1. First, to overcome the limitation on labeled data, pairwise domain adaptationmethod is proposed to refine CNNs using only a small number of annotatedreal X-ray images, leading to significantly improved accuracy in TEE trans-ducer pose estimation for real X-ray images.2. Second, to fully automate 2-D/3-D registration for TEE, and improve com-putational efficiency, we propose a framework based on lightweight CNNs tohierarchically regress the 6DOF pose parameters of TEE transducer, whichremoves the need on pose initialization and significantly reduces memoryfootprint (0.15 vs. 2.39 GB) and computation time (83.3 vs. 13.6 FPS) whencompared to CNN-based method in [72].3. Finally, we propose a self-symmetry resolution mechanism, where a CNN76Figure 4.3: Problem formulation of 2-D/3-D registration.classifier is trained to differentiate correct poses from their flipped counter-parts by seeing synthetically-generated pairs as examples and focusing onthe unsymmetric part of the transducer, i.e., the fiducial ball and hole mark-ers (Figure 4.2).4.2 Methods4.2.1 Problem Formulation: 2-D/3-D RegistrationA common X-ray imaging system is illustrated in Figure 4.3. Assuming that thebeam divergence is corrected by the X-ray imaging system and the X-ray sensorhas a logarithm static response, the generation of X-ray image can be defined asfollows:I(p) =∫µ(L(p,r))dr, (4.1)where I is the intensity of the X-ray image, I(p) is the intensity of the X-ray imageI at point p, L(p,r) is the ray from the X-ray source to point p, parameterized by r,77Figure 4.4: Illustration of TEE 6DOF pose parameters [95].and µ(·) is the attenuation coefficient of the X-ray. Denoting the X-ray attenuationmap of the object to be imaged as J :R3→R, and the 3-D transformation from theobject coordinate system to the X-ray imaging coordinate system as T : R3→ R3,the attenuation coefficient at point x in the X-ray imaging coordinate system isµ(x) = J(T−1 ◦ x). (4.2)Combining Equation 4.1 and Equation 4.2, we haveI(p) =∫J(T−1 ◦L(p,r))dr. (4.3)In 2-D/3-D registration problems, L is determined by the X-ray imaging system, Jis provided by the 3-D pre-operative data (e.g., CT intensity), and the transforma-tion T is to be estimated from the input X-ray image I. Note that given J, L and T ,a synthetic X-ray image I(·) (often referred to as DRR) can be computed followingEquation 4.3 using the Ray-Casting algorithm [59].A rigid-body 3-D transformation T can be parameterized by a vector t with 3in-plane and 3 out-of-plane transformation parameters (Figure 4.4 [72]). In par-78ticular, in-plane transformation parameters include 2 translation parameters, tx andty, and 1 rotation parameter, tθ ; out-of-plane transformation parameters include 1out-of-plane translation parameter, tz, and 2 out-of-plane rotation parameters, tαand tβ .4.2.2 System OverviewOur goal is to develop a fully automatic system for 6DOF detection and trackingof TEE transducer from 2D X-ray images with high robustness, accuracy and ef-ficiency. Instead of regressing the 6 parameters together, we divide the complex6DOF estimation problem into 3 less difficult sub-problems, based on the effectsof each transformation and rotation parameters. In particular, the 6DOF param-eters can be grouped as: 1) In-plane parameters (tx, ty, tθ ), whose effects are ap-proximately 2-D rigid-body transformations, 2) Out-of-plane rotation parameters(tα , tβ ), whose effects are shape change, and 3) Out-of-plane translation parametertz, whose effect is scaling.In the proposed system, the 3 groups of parameters are estimated hierarchi-cally, as illustrated in Figure 4.5. In particular, initial in-plane (tx, ty, tθ ) and out-of-plane translation (tz) are generated by marginal space learning as described in[95]. Then out-of-plane rotation parameters (tα , tβ ) are estimated using CNN re-gressors, which are first trained purely on synthetic X-ray images, and then refinedto generalize on real X-ray images via pairwise domain adaptation. Following theinitial estimation, in-plane parameters (tx, ty, tθ ) and out-of-plane translation pa-rameter tz are further refined using two separate CNNs to achieve more accuratepose estimation. The details will be presented in the following sections.79Figure 4.5: Flowchart of the proposed 6DOF pose estimation system.4.2.3 Hierarchical CNN Regression ArchitectureTo achieve fully automatic pose estimation, we need to globally estimate the out-of-plane rotation parameters (tα , tβ ) without initialization. The mapping betweenthe out-of-plane rotation of the transducer and its appearance in the X-ray image isvery complex and difficult to be modeled accurately by one regressor with a rea-sonably small size. Therefore, in this chapter, we propose three-level hierarchicalregressions of (tα , tβ ) using namely global, coarse and refined levels of CNNs. Inaddition, using the parameter space partitioning scheme introduced in [72], we par-tition the parameter spaces of the three levels into 2, 8 and 8 zones, respectively,and train separate CNNs for the zones to further reduce the learning complexity.CNNs for each level and zone only focus on the features and mappings that are rel-evant to pose recovery for its correspondingly parameter range and capture range,making it possible to use lightweight CNNs on each level. Specifically, we employtwo CNN regressors to cover the full parameter space with full capture range atglobal level, and set the capture range of coarse level to cover the residual pose es-timation error resulted from global level, as shown in Table 4.1. Then refined levelhas three CNN regressors in each zone to further refine all 6 parameters. Detailsabout each level are illustrated in Table 4.1.A single CNN regressor architecture is illustrated in Figure 4.6. Given an X-ray image of the TEE transducer with ground truth pose parameter t and initialestimation of pose parameters t0, a DRR is rendered at post t0, and the CNN re-80Figure 4.6: Flowchart of our CNN regression method. With a given X-rayimage, a DRR image is rendered. Then 140× 80 TEE transducerpatches are extracted from X-ray image and DRR image and fed intoCNN regressor to estimate pose offset δT .gressor is trained to model estimate the pose parameter residual t−t0 by comparingthe X-ray and DRR images. Region-of-Interest (ROI) extraction is performed viaProbabilistic Boosting Tree (PBT) detectors [95]. More specifically, a series ofcascaded PBT classifiers are trained to classify the initial in-plane (tx, ty, tθ ) andout-of-plane translation (tz) using Haar-like and rotated Haar-like features. Basedon these four parameters, the ROI of the TEE transducer in the input X-ray imagecan be extracted.Following ROI extraction, normalization is performed to the X-rayand DRR patches. Intensity values of the patches are normalized from [0,255] toTable 4.1: Hierarchical CNN regression architecture.Levels Zones CNNs Param. range per zone Capture rangeGlobal 2 2 tα : 90, tβ : 180 δ tα :±45,δ tβ :±90Coarse 8 8 tα : 45, tβ : 90 δ tα :±30,δ tβ :±30Refined 8 24 tα : 45, tβ : 90 δ tα :±5,δ tβ :±581[0,1]. Since foreground pixels (TEE probe) are usually in the range of [0.2,0.8], weignore background pixels (intensity value in [0,0.2] and [0.8,1]) during normaliza-tion. This normalization process is to insure that the input data to CNN model hasconsistent brightness and contrast. The X-ray and DRR patches are further resizedto 140×80 pixels. Image residual feature is calculated via subtraction and fed intoCNN regressor. The CNN regressor aims to model the mappings between registra-tion parameters δ t and the image residual feature. In our system, the CNN regres-sor needs a large capacity to enable a large capture range and/or high accuracy, andalso lightweight to enable real-time clinical applications. Thus we design a CNNarchitecture with deeper convolutional structures and smaller FC layers comparedto [72]. In particular, the CNN regressor has 10 convolutional layers with 3×3 ker-nels and incremental feature map numbers [32,32,48,48,64,64,96,96,128,128].Pooling layers with 2× 2 kernels are added after every two convolutional layers.Following the convolutional layers and pooling layers, there is one FC layer with1024 neurons, and the last FC layer then outputs the registration parameters. Thememory footprint of the resulting CNN regressor is 3.44 mb.In the initial phase of training of the network, we generate a large numberof synthetic X-ray images following [72], since accurate labeling of 6DOF TEEpose from 2-D X-ray projection images is extremely challenging and thus typicallyrequires a bi-plane imaging setup that is rarely clinically available. To generatesynthetic training data for a specific zone Z, two images are rendered: a DRRimage It0 with a random starting pose t0 in Z’s parameter range, and a syntheticX-ray image It0+δ t i with ground truth parameter δ t i in Z’s capture range. Duringtraining, we randomly generate 1 million synthetic data with t0 and δ t i followinguniform distributions within the zone Z and capture range. The image residualfeature can be calculated asXi = It0+δ t i− It0 . (4.4)82Figure 4.7: Comparison of real clinical TEE transducer X-ray data (left) andsynthetically generated DRR data (right).Then the CNN regressor can be trained with the following loss function:L =1NN∑i=1‖δTi− f (Xi;W )‖= 1NN∑i=1‖δTi− f (It0+δ t i− It0 ;W )‖ (4.5)where N is the number of training samples, W is the CNN weights to be learned,f (Xi;W ) is the output of the CNN regressor with input image residual feature Xi,and ‖ · ‖ denotes Euclidean distance. We employ SGD [58] with Euclidean loss totrain each CNN regressor. CNN weights are updated using 100,000 iterations withthe batch size of 50.83Figure 4.8: The proposed pairwise domain adaptation scheme.4.2.4 Pairwise Domain AdaptationThe above CNNs are pre-trained on synthetic data and they do not generalize ide-ally on real data. Therefore, domain adaptation training is needed to minimize thegap between synthetic and real data. To handle domain shifting problem betweensynthetic and real data, existing deep domain adaptation methods focus on unsu-pervised distribution modeling by employing domain distance measurements basedon the data distribution, such as MMD [67, 96]. However, these distribution-basedmeasurements cannot be applied to our problem due to the lack of massive realdata to represent the statistical distribution. Another intuitive solution is to fine-tune the CNN model pre-trained with the synthetic data using labeled real data.However, with very limited number of labeled real data (tens to hundreds), perfor-mance improvement of naive fine-tuning could still be limited without exploitingother priors.In this chapter, our aim is to design a domain adaptatoin method that is suit-able for 2-D/3-D registration with very limited real data, by exploiting the fact thatpaired real and synthetic data can be generated. Specifically, for a real X-ray im-age with a known ground truth pose for the object to be registered, we can rendera DRR image with the same 6DOF pose. As shown in Figure 4.7, the image ap-pearance difference (e.g. object appearance, artifacts, noises and background) isthe only difference between the generated real-synthetic image pairs that causesthe performance gap. If we consider a CNN model to be a trained feature extractor84Φ(·) followed by a regressor/classifier R(·), our aim is to train a domain invariantfeature extractor ΦA(·) which has consistent performance over paired real data Irand synthetic data Is, and has similar behavior as the pre-trained feature extractorΦ(·) over synthetic data Is:ΦA(Ir)≈ΦA(Is)≈Φ(Is) (4.6)In addition, for a well trained regressor R(·), the results from real data Ir and syn-thetic data Is should be close to ground truth:R(ΦA(Ir))≈ R(Φ(Is))≈ GT (4.7)Thus, to train the domain invariant feature extractor ΦA(·) with real-synthetic pairsset P, a pairwise domain loss LD can be defined by:LD =1|P| ∑(Ir,Is)∈P‖ΦA(Ir)−Φ(Is)‖ (4.8)Minimizing LD forces ΦA(·) to extract domain invariant features. To ensure thatthe adapted feature extractorΦA(·) retains a consistent performance for the originaltask, we add the pre-training lossL over synthetic data as a regularization term:Lall =1|P| ∑(Ir,Is)∈P‖ΦA(Ir)−Φ(Is)‖+λL (4.9)where λ is a parameter to balance the level of domain adaptation and the perfor-mance on the original CNN task. In this work, we imperically set λ = 1 in allthe experiment. Unlike many deep domain adaptatoin methods which adapt higherlevel task-specific FC layers, we focus on convolutional features which 1) can bet-ter model appearance difference between domains in image registration problems85(Figure 4.7), 2) can be applied across different tasks (e.g. estimation of differentpose parameters) using the same training data and 3) tuning only convolutional lay-ers with a small number of parameters can minimize the risk of overfitting whenthe number of labeled real data is limited.An overview of the pairwise domain adaptation pipeline has been illustratedin Figure 4.8. Convolutional layers Φ(·) and FC layers are from the pre-trainedmodels andΦA(·) is the adapted convolutional layers. To train the adapted network,we generate a few real-synthetic pairs P (|P| ≈ 50) for domain distance LD. Weset the batch size to be 50 and train each domain adaptation network for 40,000iterations with SGD.4.2.5 A CNN Classifier to Resolve Pose AmbiguityDue to the symmetric shape of the TEE transducer and the translucent nature of X-ray imaging, the appearance of the TEE transducer in the X-ray image for a givenpose is very similar to that when the transducer is flipped. In particular, flippinghappens in the recovered pose when the resulting parameters (tα , tβ , tz) and theground truth parameters (tGα , tGβ , tGz ) have the following relationship:tα + tGα = 0, tβ + tGβ = 180, tz− tGz = 0 (4.10)To avoid flipping in the estimated pose, for every input case, we initialize twostarting poses (tα = 0, tβ = 0 and tα = 0, tβ = 180) and perform pose estimation toget two symmetric candidate poses T1 and T2. A classifier then needs to be trainedto select the correct pose from the two candidate poses.Figure 4.2 shows an example of two overlay results using the two candidateposes obtained from two initializations. The two results look very similar, but onecan still tell that T1 is the correct pose from the unsymmetric markers highlighted86in red circle. Therefore, we extract 6 20×20 local patches of the markers to forma 120×20 image residual features F1, F2 from two candidates T1, T2 respectively.F1, F2 are then fed into two CNN models each with 8 3× 3 convolutional layers,4 2×2 pooling layers and 1 FC layer with 1024 output neurons. These two CNNmodels share weights. The outputs are then concatenated and fed into a binaryclassifier with two FC layers to classify the correct pose. The CNN classifier istrained on 500,000 synthetic symmetry pairs with poses randomly selected fromthe whole parameter space. We set the batch size to 50 and train the network for100,000 iterations.4.3 Experiments and ResultsIn the experiments we validated our system with a total number of 1663 fluoro im-ages from clinical experiments. Ground truth labels were acquired manually usingour developed interactive tool by 4 experts. To perform domain adaptation for eachCNN in global and coarse level, we generated 10,000 synthetic data to computeloss term L . For domain distance DP we randomly picked 1 real sequence withN ≈ 50 frames from each zone, and repeated this test for 3 times to cross-validateour method. We compared our system with method in [95] since both provide fullyautomatic TEE pose estimation without user initialization. Since the purpose ofTEE registration is to accurately overlay the ultrasound image captured by TEEonto X-ray image, the accuracy of TEE registration was evaluated by the averageprojected target registration error (PTRE) at the four corners of the TEE imagingcone at 60 mm depth. We carried out our experiment on a NVIDIA Geforce GTX1080 GPU.The results are summarized in Table 4.2. First we compared our proposed sys-tem with and without pairwise domain adaptation. Using PTRE > 2.5 mm as thecriteria, without pairwise domain adaptation, our system achieved an error rate of87(a) (b)(c) (d)Figure 4.9: Qualitative comparison of method in [95] and the proposed sys-tem. (a) (c) results from method [95]. (b) (d) results from the proposedsystem.21.73% and mean PTRE of 1.37 mm. With domain adaptation, the resulting er-ror rate and mean PTRE were reduced by 69.95% and 13.87% respectively. Thisdemonstrates that the adapted CNNs is better generalized on real X-ray images us-ing the proposed pairwise domain adaptation that requires only a small number of88Table 4.2: Quantitative results on the proposed hierarchical CNN regressionsystem in detection (Det) and tracking (Trak) mode. Numbers in the tableshow (success rate, mean PTRE) under different PTRE error ranges.Method PTRE>2.5 mm PTRE>4.0 mm FPSDetection in [95] (50.41%, 1.46 mm) (19.82%, 2.13 mm) 5.0Prop. w.o. domain adaptation (21.73%, 1.37 mm) (5.32%, 1.67 mm) 25.8Prop. w. domain adaptation (6.53%, 1.18 mm) (1.64%, 1.27 mm) 25.8Tracking in [95] (25.26%, 1.40 mm) (8.19%, 1.70 mm) 10.4Prop. w. tracking (5.97%, 1.22 mm) (0.61%, 1.31 mm) 83.3labeled data. We further diagnosed the 6.53% failure cases, and only 6 cases weredue to pose ambiguity. This means that our proposed CNN pose selection strategyis effective with a success rate of 99.62%. We finally compared our proposed sys-tem with the method in [95], where the error rate for PTRE>2.5 mm was reducedfrom 50.41% to 6.53%, and the frame rate was improved from 5.0 FPS to 25.8FPS. Figure 4.9 shows qualitative comparison of [95] and our proposed system,demonstrating a significantly improved overlay accuracy.To meet real-time application requirements, tracking method is developed basedon the proposed system. For every input sequence, the first frame will go throughthe complete pipeline: global, coarse, refined pose regression and flipping detec-tion, to compute an accurate initial pose of the TEE transducer. Then, the followingframes will be regressed with refined level CNNs only, using the result from the lastframe as the initial pose. We benchmarked the speed of methods in [72] and [95],and our proposed method performed at 83.3 FPS which significantly improved over[95] (10.4 FPS) and [72] (13.6 FPS). In regard of memory efficiency, our proposedsystem has 2+8+3×8= 34 CNN regressors and a memory footprint of 146 MB,89compared with 324 regressors and 2.39 GB in [72].4.4 ConclusionIn this chapter, we presented a 6DOF TEE transducer pose detection and trackingsystem based on hierarchical CNNs trained with domain adaptation, which sig-nificantly outperforms the previous methods in robustness, accuracy, computationefficiency and memory footprint. The next step is to deploy the proposed systemin a clinical setup to evaluate its performance and assess the added values in fusionof ultrasound and X-ray images coming from the significantly improved accuracyand speed.90Chapter 5Learning CNNs with UniversalPairwise Domain AdaptationModule: 2-D/3-D MedicalImaging Registration via CNNand DRL2-D/3-D registration, which aligns the pre-operative 3-D data and the intra-operative2-D data into the same coordinate system, is one of the key enabling technologiesof image-guided procedures [72]. By aligning the 2-D and 3-D data, accurate 2-D/3-D fusion can provide complimentary information for advanced image-guidedradiation therapy, radiosurgery, endoscopy and interventional radiology [69]. InChapter 4 we proposed a pairwise domain adaptation method and investigated ahierarchical CNNs approach for TEE transducer pose estimation problem. In this91chapter, we present a PDA module which can further improve domain adaptationperformance, and can be adopted for different 2-D/3-D medical imaging registra-tion problem and deep learning frameworks.This chapter is structured as follows: Section 5.1 introduces the motivationand objectives of this chapter; Section 5.2 presents the proposed PDA module;Section 5.3 reports the results and compares the proposed method with the state-of-art methods; Section 5.4 draws the conclusion and introduce the future works.5.1 Motivation and ObjectivesIn Chapter 4 we investigated a hierarchical CNNs approach which demonstrateCNN’s suitability in a 2-D/3-D medical imaging application. However, while deeplearning offers large modeling capacity, sufficient training of such a deep modelrequires a large number of labeled data, which may be expensive or even impracti-cal to collect from clinical procedures, especially for 2-D X-ray images which tendto lack information in depth. Therefore, the data-driven approach in the literatureare often trained with synthetically generated data before they are tested on realclinical data. Specifically, synthetic data are generated by rendering DRR imagesfrom pre-operative 3-D data (e.g. CAD models and CT) with random poses to sim-ulate real X-ray images. Even though variations such as background and contrastmedium are randomly added to make the appearance of the synthetic data morerealistic, domain differences between real and synthetic data still exist. Comparedwith synthetically generated data, TEE probe in real X-ray images (Figure4.7) areblurred with artifacts in the center of the probe, while spine vertebra in the realX-ray images (Figure5.1) have distinct sharpness and noises. In addition, artifactsand other devices could be presented in real clinical X-ray images. These domaindifferences lead to the domain shifting problem that downgrades the performanceof the trained deep models on real clinical data.92Figure 5.1: Comparison of real clinical spine vertebra X-ray data (left) andsynthetically generated DRR data (right).In our preliminary work (Chapter 4), we exploit the ability to generate a corre-sponding synthetic data for each labeled real data with the exact same pose parame-ters, and define a pairwise loss measuring the distance between features from pairedreal and synthetic data to represent the performance gap between the domains. Inthis chapter, we further propose a PDA module to bridge the performance gap, andextend pairwise domain adaptation to a universal method for different 2-D/3-Dregistration applications and deep learning methods. The proposed PDA module is1) powerful with additional network capacity to model complex domain variances,2) flexible for different deep learning-based 2-D/3-D registration frameworks, 3)easy to be plugged into any pre-trained CNN model, and 4) trainable with hierar-chical pairwise losses using only a few real-synthetic pairs. The proposed PDAmodule is evaluated on 2 different deep learning frameworks with 2 different clini-93cal problems: CNN-based residual regression for TEE Transducer registration andDRL-based policy learning for spine vertebra registration. The experiment resultsdemonstrate PDA module’s advantage in generalization and superior performancefor real clinical data. The proposed PDA module has the potential to benefit anymedical imaging problems where paired real-synthetic data can be obtained.5.2 Methods5.2.1 Problem StatementCNN Regression-based 2-D/3-D Registration for TEE TransducerTEE and X-ray fluoroscopy are the two major live imaging modalities for image-guided catheter-based interventions. TEE can provide detailed visualization forsoft tissue anatomies while X-ray can capture medical devices. Accurate 2-D/3-D registration of TEE transducer from X-ray images can enable advanced imageguidance, e.g., fused visualization and joint analysis of anatomy and devices. Inthe previous chapter, A CNN regression-based approach was proposed to estimatethe transformation parameters t. Figure5.2(a) shows the structure of the CNN re-gressor. In this chapter, to better experiment the domain adaptation effect, weonly focus on the global level CNN regressors for out-of-plane parameters (tα , tβ ),which are most difficult to estimate from a single 2-D X-ray image.DRL-based 2-D/3-D Registration for Spine VertebraIn image-guided spine surgery, registration of 3-D pre-operative CT data and 2-D intra-operative X-ray image can provide valuable assistance such as vertebralocalization and device path planning. To address this problem, a MDP agent-based framework was proposed to train the artificial agent which can iteratively94Figure 5.2: Comparison of problem frameworks of (a) CNN Regression-based 2-D/3-D Registration for TEE Transducer, (b) DRL-based 2-D/3-D Registration for Spine Vertebra.choose the optimal action to recover 6DOF parameters t of the target vertebra[63].Specifically, the process of 2-D/3-D registration is formulated as a MDP, where atevery time point i with a pose t i, an artificial intelligent agent modeled by a deepneural network observes the X-ray image and DRR rendered with the pose t i, andproduces an action ai to modify the pose. At a time point i, the state si is definedby the current transformation Ti. Rewards r(si,ai) of actions ai can be calculatedbyr(si,ai) = D(Tg,Ti)−D(Tg,ai ◦Ti) (5.1)95where Tg is the ground truth transformation, D(Tg,Ti) defines the distance of groundtruth transformation and current transformation, and ai ◦Ti is the new transforma-tion after taking action ai. In this chapter the action set A has 12 candidate trans-formations with ±1 for each of the 6DOF parameters. More detailed formulationscan be found in [63]. As shown in Figure5.2(b), the CNN architecture has twobranches for input X-ray image and DRR image separately. A minmax normal-ization is applied on the X-ray and DRR images to normalized their intensities to[0,1]. The input X-ray and DRR images are resized to 128× 128 pixels. Eachbranch has 4 convolutional layers with 3× 3 kernels and increasing feature mapsize [64,64,128,128]. After each convolutional layer, a pooling layer with 2× 2kernels is added. The convolutional features are then concatenated and fed into 4FC layers with decreasing number of neurons [1024,512,256,12]. The output ofthe last FC layer corresponds to the rewards of the 12 candidate actions. Similar toEquation 4.5, the training lossL is defined by:L =1NN∑i=1‖r(si,ai)− f (It0+δ t i , It0 ;W )‖ (5.2)As shown in Figure5.2, the above two problems are different in the following as-pects:1. The tasks of CNN models are different. In TEE registration, the CNN modeloutputs the 6 DoF transformation parameters; in spine registration, the CNNmodel outputs 12 rewards for 12 candidate transformations.2. The CNN architectures are different. In TEE registration, the CNN modeltakes a single residual image as the input; in spine registration, there are twobranches to process X-ray and DRR images separately.3. The learning methods are different. In TEE registration, the CNN model96is trained with supervised learning; in spine registration, the CNN model istrained with Reinforcement Learning.In the following sections we will present the proposed PDA model which can sig-nificantly improve the 2-D/3-D registration performance for both applications witha few real data (around 100), despite the differences in CNN architecture, input-output model, and learning methods.5.2.2 Pairwise Domain Adaptation ModuleIn Chapter 4 we presented Pairwise Domain Loss to modify the last convolutionallayer in the CNN model for domain adaptation. To further improve the perfor-mance, we propose a PDA module which can be plugged into CNN models andadds extra network capacity for the purpose of domain adaptation without modi-fying the weights of the pre-trained model. In this way, the pre-trained Φ(·) andR(·) will remain unchanged and focus on the original CNN tasks, while the PDAmodule will focus on extracting domain invariant features for domain adaptation.The design of the PDA module is demonstrated in Figure5.3(a). The PDA moduleΦA(·) consists of k convolutional layers to transform features extracted from pre-trained Φ(·) into domain invariant features that generalize better on real clinicaldata.In the PDA module, we replace the pre-training loss L in Equation 4.9 witha synthetic feature distance S which is the Euclidean distance between originalsynthetic feature and transformed synthetic feature (Figure5.3(b)):S=1|B| ∑Is∈B‖ΦA(Is)−Φ(Is)‖ (5.3)where B denotes the synthetic dataset and |B| denotes the size of B. Together the97loss function becomes:Lall =LD+λS=1|P| ∑(Ir,Is)∈P‖ΦA(Ir)−Φ(Is)‖+λ 1|B| ∑Is∈B‖ΦA(Is)−Φ(Is)‖(5.4)This loss term encourages the PDA module to find a balance between two goals:1) the real feature is transferred to be as close as possible to the corresponding syn-thetic feature and 2) the synthetic feature largely remains unchanged. Since Lallis independent from the task-specific FC layers, the PDA module can be appliedacross different tasks using the same training data.The kernel size and feature map numbers of the convolutional layers in thePDA module are set to be identical with the last convolutional layer in the pre-trained CNN model to keep feature map dimension consistent after adaptation. Inthis chapter, we set kernel size to be 3× 3 and feature map size to be 128. Inorder to train the PDA module, we initialize the convolution kernels as identitymatrices. More specifically, for a convolutional layer in PDA module with weightsW ∈R3×3×128×128, we setW (1,1,k,k) = 1,k = 1,2, ...,128 (5.5)and set the rest of weights to be 0. The purpose of identity initialization is to ensurethat at the beginning of the training, the PDA module preserves the meaningfulinput synthetic features. In this way, the training mainly focuses on reducing thedomain distanceLD and is easier to converge.In Figure5.3(c) we further enhance the PDA module by introducing a multi-layer loss where pairwise feature distance Dl and synthetic feature distance Sl are98(a)(b)(c)Figure 5.3: (a) Illustration of PDA module plugged into a pre-trained CNNmodel. (b) Illustration of PDA module with basic loss. (c) Illustrationof PDA module with multilayer loss (PDA+ module).calculated for each layer l in the PDA module:Dl =1|P| ∑(Ir,Is)∈P‖ΦlA(Ir)−Φl(Is)‖ (5.6)99Sl =1|B| ∑Is∈B‖ΦlA(Is)−Φl(Is)‖ (5.7)where ΦlA(·), Φl(·) denote the adapted and original feature extractors at layer l.The PDA module loss can be defined as:Lall =LD+λS=k∑l=1(Dl+λlSl) (5.8)where k is the number of convolutional layers in the PDA module, and λl = 1 inall experiments. By introducing multilayer loss, domain distance of the lower lay-ers in the PDA module can be more flexibly modeled. In addition, the weightsare updated with gradients calculated in each layer which can reduce the effect ofvanishing gradients and leads to a better supervision for the training of the PDAmodule, similar to resNet [73]. In the experiments, we denote the PDA modulewith multilayer loss as PDA+ module. The proposed PDA module has the follow-ing merits:1. The novel direct measurement of domain distance on paired data allowstraining of the PDA module using a small number of real data, by focus-ing on the key image appearance differences between domains excludingother factors such as poses. In contrast, previous domain adaptation methodstypically employ statistical domain distance measurement, which requires alarge number of data from both domains.2. The pairwise loss allows the domain adaptation to be performed on convolu-tional layers, which are more correlated to image appearance (i.e., syntheticvs. real data). In contrast, previous domain adaptation methods typicallyonly adapt FC layers because the statistical distance measurement is not re-liable on high dimensional feature maps.1003. The proposed PDA module is a flexible plug-and-play module that can beapplied to general network structures. Since the PDA module is added andtrained after the main network is trained, it does not affect the network designand training of the main network.5.3 Experiments and Results5.3.1 Experiment SetupWe evaluated the proposed PDA and PDA+ modules on two clinical datasets forTEE and Spine registration. To pre-train the CNN models, we followed the trainingprocedure in the Chapter 4, where 1 million synthetic data were generated withrandom poses and backgrounds. SGD [58] was employed to update weights withtask loss L for 100,000 iterations with the batch size of 50. We imperically setλ = 1 in all the experiment. The details of the two datasets are as follows:1. The TEE dataset consists of 1663 X-ray images. A 3-D CAD model ofTEE transducer was used to generate DRR images and synthetic trainingdata. To demonstrate the performance of the PDA module, in this chapter weonly focus on the global level CNN regressors for out-of-plane parameters(tα , tβ ), which are most difficult to estimate from a single 2-D X-ray image.Therefore, the starting pose was set to (tα = 0, tβ = 0), and the capture rangewas δ tα ∈ [−45,45] and δ tβ ∈ [−90,90]. The performance was evaluatedvia Root Mean Square Error (RMSE) of tα and tβ .2. The spine dataset consists of 420 X-ray images with 42 corresponding C-Arm CT volumes. We sampled the X-ray images with the primary C-Armangle in [165,195], and trained the agent with a capture range of 20 mmfor translation parameters error (δ tx,δ ty,δ tz) and 10 degrees for rotation pa-101Figure 5.4: Training feature distance LD and synthetic feature distance S,testing feature distance and testing loss of the proposed PDA+ approachon TEE dataset over iterations.rameters error (δ tα ,δ tβ ,δ tθ ). The performance was evaluated via the TargetRegistration Error (TRE), computed as the RMSE of the location of sevenlandmarks on a chosen spine vertebra, and the Error Rate was measuredusing the criteria as TRE>10mm, following [73]. The ground truth was pro-vided by the calibration of the C-Arm system.5.3.2 Performance AnalysisWe first conducted property analysis of the PDA module using the TEE data. Fig-ure5.4 shows the task loss L , feature distance LD and synthetic feature distanceS on testing data during PDA+ training. Two randomly selected sequences witharound 100 image frames were used for domain adaptation. The feature distance102Table 5.1: Performance (RMSE) comparison of PDA+ module and fine-tunewith different number of training data.Number of sequence 1 2 4 6 8 10 12Number of real X-ray images 64 97 194 296 412 647 806M2. Fine-tuning with task loss N/A 6.87 6.28 5.84 5.53 4.86 4.00M5. PDA+ module 6.80 4.59 4.45 4.30 4.16 4.04 3.96on testing data reduces during training, demonstrating that using the direct super-vision provided by the pairwise domain loss, the PDA module can be effectivelytrained with a small number of data to reduce domain distance on unseen testingdata without noticeable overfitting. The synthetic feature distance S starts from 0due to the identity initialization of PDA module, and stays at a small value duringtraining, showing that the PDA module can preserve the feature of synthetic datawhile adapting the feature of real data toward that of the corresponding syntheticdata. The training curves also demonstrate that the feature distance LD and taskloss L are strongly correlated (i.e., they are reduced in parallel), which indicatesthat the proposed Pairwise Domain Distance is an effective domain distance mea-surement, and that minimizing it can effectively improve model generalization onreal data.Additionally, we compared the proposed transfer learning method PDA+ withfine-tuning of the CNN model using task loss on real data. Fine-tuning is to re-train the pre-trained CNN model with a small learning rate. In comparison, theproposed transfer learning method PDA+ is to fix the pre-trained CNN model andinsert the PDA module after the convolutional layers. Other domain adaptationmethods reported in the literature using statistics-based or adversarial domain dis-tance require a large number of data and cannot be applied to our problems with103Table 5.2: Quantitative results of the proposed PDA module on the problemof CNN Regression-based 2-D/3-D Registration for TEE Transducer.Method RMSE (degrees)M1. Baseline CNN 7.60M2. Fine-tuning w. task loss 6.87M3. Fine-tuning w. Lall 5.59M4. PDA module 4.79M5. PDA+ module 4.59limited data. Table 5.1 shows the performance comparison of PDA+ module andfine-tuning method using an increasing number of training data. As shown in Ta-ble 5.1, the RMSE of PDA+ module reduces rapidly when the data number is stillrelatively small and using around 100 data is sufficient to train the PDA+ moduleeffectively. In comparison, the RMSE of fine-tuning method reduces slowly whenthe data number is less than 400, and only becomes comparable with that of PDA+when the number of data is increased to 800. This shows that fine-tuning using realdata requires a relatively large number of data (i.e., ∼800) to achieve its optimalperformance, while the proposed PDA+ module can effectively transfer the modelto the target domain using a much smaller number of data (i.e., ∼100).5.3.3 Evaluation of the Proposed MethodsIn this section we tested 2 baseline methods and 3 proposed methods: M1: BaselineCNN trained purely on synthetic data without domain adaptation; M2: Fine-tuningof the CNN model on real-data using the task loss [105]; M3: Fine-tuning of theCNN model on real-data using the Pairwise Domain Loss (Equation 4.9); M4:PDA module (Figure5.3(b)); M5: PDA+ module (Figure5.3(c)). For PDA and104PDA+ modules, 3 convolutional layers were used. In addition, from performanceanalysis in the previous section, it is shown that domain adaptation using 100 to150 real data can already achieve close-to-optimal performance. Thus, for theTEE dataset, we randomly sampled 2 sequences with in total ∼100 X-ray imagesfor domain adaptation. For the spine dataset, we randomly selected 140 X-rayimages for domain adaptation. Since the real X-ray data are limited and unevenlydistributed in both datasets, to better evaluate the proposed methods and comparewith the existing methods, we employ the rest of the real data for testing. Weupdate the weights for 40,000 iterations to guarantee that the training is convergedand select the model at the end of the training to test the performance. The test wasrepeated 3 times to cross-validate the proposed methods.The results for TEE and Spine are summarized in Table 5.2 and 5.3, respec-tively. First, we compared the performance with and without the proposed PairwiseDomain Loss. Comparing M2 and M3 shows that by using the Pairwise DomainLoss, the RMSE in TEE registration was reduced by 18.63% (i.e., from 6.87 de-grees to 5.59 degrees), and the error rate and mean TRE in spine registration werereduced by 11.39% (i.e., from 15.71% to 13.92%) and 1.76% (i.e., from 6.80 mmto 6.68 mm), respectively. This demonstrates the effectiveness of the Pairwise Do-main Loss on domain adaptation using a small number of paired data from bothdomains.The comparison between M3 and M4 shows that the PDA module further im-proves the domain adaptation performance. In particular, the RMSE for TEE reg-istration was reduced by 14.31% (i.e., from 5.59 degrees to 4.79 degrees), and theerror rate and mean TRE for spine registration was reduced by 11.93% (i.e., from13.92% to 12.26%) and 11.23% (i.e., from 6.68 mm to 5.93 mm), respectively.This is due to the extra modeling power provided by the PDA module solely fordomain adaptation purpose. In addition, the result of M5 shows that the PDA+105Table 5.3: Quantitative results of the proposed PDA module on the problemof DRL-based 2-D/3-D Registration for Spine Vertebra.Method Error Rate (TRE>10mm) Mean TRE (mm)M1. Baseline CNN 16.07% 7.45M2. Fine-tuning w. task loss 15.71% 6.80M3. Fine-tuning w. Lall 13.92% 6.68M4. PDA module 12.26% 5.93M5. PDA+ module 11.20% 5.65module further improves domain adaptation performance with a multi-layer lossin Equation 5.8, which shows that the multilayer loss can better model domaindistance in the underlying layers and leads to a better supervision for the domainadaptation training. When comparing M5 with M2, the RMSE for TEE registra-tion was reduced by 33.19% (i.e., from 6.87 degrees to 4.59 degrees), and the errorrate and mean TRE in spine registration was reduced by 28.71% (i.e., from 15.71%to 11.20%) and 16.91% (i.e., from 6.80 mm to 5.65 mm). This demonstrates theproposed PDA+ has significant improvement over the baseline method fine-tuningusing task loss. In summary, the proposed Pairwise Domain Loss and PDA+ mod-ule are shown to be effective to improve generalization of the deep learning-based2-D/3-D registration methods on real clinical data.Samples of qualitative results of PDA+ module on TEE and spine data areshown in Figure5.5 and Figure5.6, respectively. Figure5.5 shows that the accuracyof TEE transducer pose estimation is significantly improved after applying PDA+module. Figure5.6 shows that without the PDA+ module, the agent could registerthe spine vertebra in 3-D CT with a wrong vertebra in the X-ray image, due to theappearance difference in synthetic training data and the real testing X-ray image.106Figure 5.5: Example results of TEE registration with original CNN model(left) and with PDA+ module (right).In comparison, with the PDA+ module, the agent successfully registers the spinevertebra.107Figure 5.6: Example results of spine vertebra registration with original CNNmodel (left) and with PDA+ module (right). The blue and red crossesare the target and estimated vertebra center, respectively.5.4 ConclusionIn this chapter, we presented a PDA module to tackle the domain shifting problemfor CNN-based 2-D/3-D registration. A Pairwise Domain Loss was proposed to108effectively model domain difference between synthetic generated pre-training dataand real clinical data. In addition, a PDA module was proposed to learn domain in-variant features using only a few paired real and synthetic data. The proposed PDAmodule was evaluated on two different 2-D/3-D registration problems, demonstrat-ing its advantages in generalization and flexibility for clinical applications. Theproposed PDA module can be plugged into any pre-trained CNN models and hasthe potential to benefit any medical imaging problem where a small number ofpaired real-synthetic data can be obtained.109Chapter 6Conclusion and Future WorkIn this thesis, we have developed a set of effective deep learning methods to solvethe problem of limited data in health informatics, including: 1) combining mid-level image partitioning with deep learning, 2) encoding prior knowledge via fea-ture extraction and 3) synthetic data pre-training and pairwise domain adaptation.In this chapter, we will summarize the findings presented in this thesis, and discusspotential future works.6.1 ConclusionIn Chapter 2, we focused on combining mid-level image partitioning based ap-proach with deep learning to extract multi-resolution deep features, and fine-tuningdeep learning model with mid-level image part data with mined “labels” to over-come limitation on labeled data. We illustrated this idea on food image recognitionproblem and proposed a novel framework to better utilize DCNN models as power-ful multi-resolution feature extractions for food image recognition. The proposedframework jointly explores the advantages of both the mid-level based approachesand the DCNN approaches. To further improve the performance of the proposedframework with limited labeled data, fine-tuning DCNN model with mid-level foodpart data was proposed as an effective approach. To our knowledge, the proposedmethod is the first attempt in the literature to tackle the challenge of fine-tuningthe DCNN model with unlabeled mid-level food image part data. We further pre-110sented a novel food part label mining scheme with 3 strategies to mine part-levellabels from unlabeled food parts data. We evaluated these 3 strategies and foundthat the simple k-means clustering can achieve the best performance. Finally, foreach food image dataset, we trained a DCNN model on food part dataset with themined part-level labels. The experiments on 3 benchmark food image datasetsshowed that the proposed approach can significantly improve the baseline DCNNfine-tune approach without employing many different features or very deep DCNNarchitectures.In Chapter 3, we explored the approach of pre-processing and feature extractionbased on prior knowledge to reduce the high dimensionality and complexity of rawdata compared to limited number of labels, and illustrated on MRI-based ADHDdiagnosis. We developed an automatic diagnosis algorithm based on deep learningto classify ADHD vs. TDC using MRI scans. The proposed 3D CNN method isfundamentally different from the previous attempts to classify the ADHD usingMRI scans. Specifically, to overcome the issue of high dimensionality and highdata variances compared to the extremely limited data number, we first encodedprior knowledge on six types of 3D low-level features, including REHO, FALFFand VMHC as well as GM, WM and CSF densities in MNI space. Unlike previousmethods that mostly considered low-level features as a vector, which may neglectsthe potential 3D local patterns, we kept these low-level features in 3-order tensorsand train the 3D CNN based on them. We further combined the FMRI and SMRIfeatures with a multi-modality 3D CNN architecture which yields the state-of-the-art performance. Experimental results on the hold-out ADHD-200 testing datasetshowed that the proposed 3D CNN is superior to previous works in the literature,even with a smaller number of training samples.In Chapter 4, we explored the approach of generating large amount of real-istic synthetic data to train deep learning models, and performing pairwise do-main adaptation to generalize trained models on real data, and illustrated on 6DOFTEE transducer registration. we developed a 6DOF TEE transducer pose detec-tion and tracking system based on hierarchical CNNs trained with domain adap-tation, which significantly outperforms the previous methods in robustness, accu-racy, computation efficiency and memory footprint. A pairwise domain loss wasproposed to effectively model domain difference between synthetic generated pre-111training data and real clinical data to overcome the problem of limited data. A hier-archical CNN architecture was proposed to enable automatic global pose detection.A compact CNN model was designed to improve computational cost and reducememory footprint. Finally, a CNN classifier was trained to resolve pose ambiguityof TEE transducer. Experiments on 1663 clinical X-ray images demonstrated thatthe proposed system achieved state-of-the-art accuracy, and significantly improvedframe rate and memory efficiency.In Chapter 5, we developed a PDA module to tackle the domain shifting prob-lem for CNN-based 2-D/3-D registration. PDA Module is to learn domain invari-ant features using only a few paired real and synthetic data. The proposed PDAmodule was evaluated on two different 2-D/3-D registration problems, includingTEE transducer registration and spine vertebral registration, demonstrating its ad-vantages in generalization and flexibility for clinical applications. The proposedPDA module can be plugged into any pre-trained CNN models and has the po-tential to benefit any medical imaging problem where a small number of pairedreal-synthetic data can be obtained.6.2 Future Work6.2.1 Compact CNN model for Image RecognitionOne possible future research direction for deep learning-based image recognition ismobile-based image recognition with compact CNN models. In Chapter 2, DCNN-FP is an AlexNet model, which takes 227×227 image as input. Thus, for each foodpart image, the system need to resize the small part image to 227×227, and feedinto AlexNet model to extract features, which is both time consuming and unneces-sary. To reduce the computational cost and enable mobile-based applications, onepossible way is to adopt network compression techniques, such as pruning [36],parameter binarization [87] and vector quantization [31] to reduce model memoryfootprint and computational cost without changing network input size. Anotherpossible way is to first train a large network with large input size (AlexNet orInception-v3), then use features from middle layers to help the training of a smallnetwork with smaller input size for fast deployment. Existing methods include112teacher-student network training [88] and dark knowledge transfer [41].Another possible future research direction is domain adaptation of food imagesfrom different types of cuisine. In Section 2.3.3, we examined the cross domainperformance of the system and showed that training CNN model with Food-101dataset, and test on UEC Food 100 dataset yields worse results. This can be ex-plained by domain shifting [67]. Different from Food-101 dataset, UEC Food 100dataset is a Japanese food image dataset (Figure 2.3). The food feature extractorslearned from source domain (Food-101) can be used to extract features on targetdomain (UEC Food 100) and achieve reasonable results. However, to achieve bet-ter results, domain adaptation on target domain is needed [67].6.2.2 Extension of Deep Learning-based MRI Data AnalysisOne possible future research direction is to process each of FMRI and SMRIfeatures with individual CNN branches. In Chapter 3, we treat FMRI featuresand SMRI features differently and feed each type of feature into different CNNbranches, which resulted in improved performance of diagnosis. From Figure 3.5,we can also observe that when combine FMRI features or SMRI features in a singlechannel CNN, the performance becomes worse. This shows that different FMRIand SMRI feature has different properties and behavior, single channel CNN is notsufficient to model different FMRI features and SMRI features. Feed each FMRIfeatures and SMRI features into individual CNN branches will potentially improvethe diagnosis performance.Another possible future research direction is to adopt dilated convolutional lay-ers [106] to increase network input size. In Chapter 3, we first downsize input fea-tures with max-pooling to a lower resolution around 20-by-30-by-20, this is simplybecause large 3D input will result in GPU memory overflow during training. Di-lated convolutional layers can exponentially expand the receptive field to handlelarge 3D input without loss of resolution [106].The proposed 3D CNN-based system can also be extended to other types ofdiseases (Alzheimer, etc.)1136.2.3 Multi-task Learning for 2-D/3-D RegistrationIn Chapter 4, by designing the deep and compact CNN model and pairwise domainadaptation, we greatly reduced the number of CNN regressors from 324 in [72] to34, with each CNN model focuses on its own parameter group, zone and capturerange. We also reduced memory footprint from 2.39 GB to 146 MB. However,training of the 34 CNN models is still complicated and time consuming. Onepotential way to ease the training effort is multi-task learning and sharing CNNweights across parameter groups and zones which can further simplify trainingprocess and reduce memory footprint of the system.6.2.4 GAN-based Artifact Removal for Spine Vertebral RegistrationIn Chapter 5, during the experiment, we discovered that when devices and implantsare presented in the X-ray image, the registration task is more likely to fail (Figure5.6). One possible future research direction is GAN-based artifact removal forspine vertebral registration. Recently GAN is proposed by Goodfellow et. al. andachieves great success in image synthesis, image in-painting and style transfer [33].In our preliminary research, we are able to accurately segment the artifact with afully convolutional network (FCN), and remove the artifact from the X-ray image.Potentially GAN will not only remove the artifact but also fill in the blank areascorresponding to the artifact and improve the registration accuracy of the system.The GAN-based artifact removal can be also extended to other medical imagingproblems.114Bibliography[1] The adhd-200 global competition.http://fcon 1000.projects.nitrc.org/indi/adhd200/results.html. → pages12, 67, 68[2] The r-fmri maps project. http://mrirc.psych.ac.cn/RfMRIMaps. → pages52, 56[3] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.Corrado, A. Davis, J. Dean, M. Devin, et al. Tensorflow: Large-scalemachine learning on heterogeneous distributed systems. arXiv preprintarXiv:1603.04467, 2016. → page 44[4] K. Aizawa, Y. Maruyama, H. Li, and C. Morikawa. Food balanceestimation by using personal dietary tendencies in a multimedia food log.Multimedia, IEEE Transactions on, 15(8):2176–2185, 2013. → pages 7, 10[5] M. Angriman, A. Beggiato, and S. Cortese. Anatomical and functionalbrain imaging in childhood adhd: Update 2013. Current DevelopmentalDisorders Reports, 1(1):29–40, 2014. → page 49[6] M. Anthimopoulos, L. Gianola, L. Scarnato, P. Diem, and S. Mougiakakou.A food recognition system for diabetic patients based on an optimizedbag-of-features model. Biomedical and Health Informatics, IEEE Journalof, 18(4):1261–1271, July 2014. → pages 7, 10[7] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A reviewand new perspectives. IEEE transactions on pattern analysis and machineintelligence, 35(8):1798–1828, 2013. → page 24[8] L. Bossard, M. Guillaumin, and L. Van Gool. Food-101–miningdiscriminative components with random forests. In ComputerVision–ECCV 2014, pages 446–461. Springer, 2014. → pages11, 24, 27, 29, 31, 35, 36, 38, 39, 44, 45115[9] L. Bottou. Large-scale machine learning with stochastic gradient descent.In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010. →page 2[10] M. R. Brown, G. S. Sidhu, R. Greiner, N. Asgarian, M. Bastani, P. H.Silverstone, A. J. Greenshaw, and S. M. Dursun. Adhd-200 globalcompetition: diagnosing adhd using personal characteristic data canoutperform resting state fmri measurements. Frontiers in systemsneuroscience, 6:69, 2012. → pages 12, 69[11] C.-W. Chang, C.-C. Ho, and J.-H. Chen. Adhd classification by a textureanalysis of anatomical brain mri data. Frontiers in systems neuroscience, 6:66, 2012. → page 49[12] M. Chen, K. Dhingra, W. Wu, L. Yang, R. Sukthankar, and J. Yang. Pfid:Pittsburgh fast-food image dataset. In Image Processing (ICIP), 2009 16thIEEE International Conference on, pages 289–292. IEEE, 2009. → page11[13] M. Chen, K. Q. Weinberger, F. Sha, and Y. Bengio. Marginalized denoisingauto-encoders for nonlinear representations. In Proceedings of the 31stInternational Conference on Machine Learning (ICML-14), pages1476–1484, 2014. → pages 4, 32[14] A. Coates, A. Karpathy, and A. Y. Ng. Emergence of object-selectivefeatures in unsupervised feature learning. In F. Pereira, C. Burges,L. Bottou, and K. Weinberger, editors, Advances in Neural InformationProcessing Systems 25, pages 2681–2689. Curran Associates, Inc., 2012.URL http://papers.nips.cc/paper/4497-emergence-of-object-selective-features-in-unsupervised-feature-learning.pdf. → page 32[15] D. Comaniciu, K. Engel, B. Georgescu, and T. Mansi. Shaping the futurethrough innovations: From medical imaging to precision medicine.Medical Image Analysis, (33):19–26, 2016. → page 72[16] G. Csurka. Domain adaptation for visual applications: A comprehensivesurvey. arXiv preprint arXiv:1702.05374, 2017. → page 15[17] D. Dai, J. Wang, J. Hua, and H. He. Classification of adhd children throughmultimodal magnetic resonance imaging. Frontiers in systemsneuroscience, 6:63, 2012. → pages 12, 54, 65, 68116[18] N. Dalal and B. Triggs. Histograms of oriented gradients for humandetection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005.IEEE Computer Society Conference on, volume 1, pages 886–893. IEEE,2005. → page 24[19] S. Dey, A. R. Rao, and M. Shah. Attributed graph distance measure forautomatic detection of attention deficit hyperactive disordered subjects.Frontiers in neural circuits, 8, 2014. → pages 67, 68[20] C. Doersch, A. Gupta, and A. A. Efros. Mid-level visual element discoveryas discriminative mode seeking. In Advances in Neural InformationProcessing Systems, pages 494–502, 2013. → page 11[21] A. Eloyan, J. Muschelli, M. B. Nebel, H. Liu, F. Han, T. Zhao, A. D.Barber, S. Joel, J. J. Pekar, S. H. Mostofsky, et al. Automated diagnoses ofattention deficit hyperactive disorder using magnetic resonance imaging.Frontiers in systems neuroscience, 6:61, 2012. → pages 8, 12, 49, 67[22] G. M. Farinella, M. Moltisanti, and S. Battiato. Food recognition usingconsensus vocabularies. In New Trends in Image Analysis andProcessing–ICIAP 2015 Workshops, pages 384–392. Springer, 2015. →page 11[23] G. M. Farinella, D. Allegra, M. Moltisanti, F. Stanco, and S. Battiato.Retrieval and classification of food images. Computers in Biology andMedicine, 77:23–39, 2016. → page 10[24] P. Felzenszwalb and D. P. Huttenlocher. Efficient graph-based imagesegmentation. International Journal of Computer Vision, 59(2):167–181,2004. → pages 28, 29[25] J. Feng, Y. Wei, L. Tao, C. Zhang, and J. Sun. Salient object detection bycomposition. In Computer Vision (ICCV), 2011 IEEE InternationalConference on, pages 1028–1035. IEEE, 2011. → page 34[26] C. for Disease Control, Prevention, et al. Hipaa privacy rule and publichealth. guidance from cdc and the us department of health and humanservices. MMWR: Morbidity and mortality weekly report, 52(Suppl. 1):1–17, 2003. → page 51[27] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette,M. Marchand, and V. Lempitsky. Domain-adversarial training of neuralnetworks. Journal of Machine Learning Research, 17(59):1–35, 2016. →page 15117[28] G. Gao, G. Penney, Y. Ma, et al. Registration of 3d trans-esophagealechocardiography to x-ray fluoroscopy using image-based probe tracking.Medical image analysis, 16(1):38–49, 2012. → pages 13, 75[29] E. D. Gennatas, B. B. Avants, D. H. Wolf, T. D. Satterthwaite, K. Ruparel,R. Ciric, H. Hakonarson, R. E. Gur, and R. C. Gur. Age-related effects andsex differences in gray matter density, volume, mass, and cortical thicknessfrom childhood to young adulthood. Journal of Neuroscience, 37(20):5065–5073, 2017. → page 56[30] S. Ghiassian, R. Greiner, P. Jin, and M. Brown. Learning to classifypsychiatric disorders based on fmr images: Autism vs healthy and adhd vshealthy. In Proceedings of 3rd NIPS Workshop on Machine Learning andInterpretation in NeuroImaging, 2013. → pages 13, 66, 68[31] Y. Gong, L. Liu, M. Yang, and L. Bourdev. Compressing deepconvolutional networks using vector quantization. arXiv preprintarXiv:1412.6115, 2014. → page 112[32] Y. Gong, L. Wang, R. Guo, and S. Lazebnik. Multi-scale orderless poolingof deep convolutional activation features. In Computer Vision–ECCV 2014,pages 392–407. Springer, 2014. → pages 26, 27[33] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. InAdvances in neural information processing systems, pages 2672–2680,2014. → pages 3, 114[34] M. Greicius. Resting-state functional connectivity in neuropsychiatricdisorders. Current opinion in neurology, 21(4):424–430, 2008. → page 69[35] X. Guo, X. An, D. Kuang, Y. Zhao, and L. He. Adhd-200 classificationbased on social network method. In International Conference on IntelligentComputing, pages 233–240. Springer, 2014. → pages 13, 67, 68[36] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deepneural networks with pruning, trained quantization and huffman coding.arXiv preprint arXiv:1510.00149, 2015. → page 112[37] H. Hassannejad, G. Matrella, P. Ciampolini, I. De Munari, M. Mordonini,and S. Cagnoni. Food image recognition using very deep convolutionalnetworks. In Proceedings of the 2nd International Workshop onMultimedia Assisted Dietary Management, pages 41–49. ACM, 2016. →pages 12, 22, 35, 38, 39, 41118[38] C. R. Hatt, M. A. Speidel, and A. N. Raval. Real-time pose estimation ofdevices from x-ray images: Application to x-ray/echo registration forcardiac interventions. Medical image analysis, 2016. → pages 14, 75[39] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for imagerecognition. In Proceedings of the IEEE conference on computer visionand pattern recognition, pages 770–778, 2016. → page 59[40] Y. He, C. Xu, N. Khanna, C. J. Boushey, and E. J. Delp. Context basedfood image analysis. In Image Processing (ICIP), 2013 20th IEEEInternational Conference on, pages 2748–2752. IEEE, 2013. → page 10[41] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neuralnetwork. arXiv preprint arXiv:1503.02531, 2015. → page 113[42] G. E. Hinton. Deep belief networks. Scholarpedia, 4(5):5947, 2009. →pages 4, 32[43] R. J. Housden, A. Arujuna, Y. Ma, et al. Evaluation of a real-time hybridthree-dimensional echo and x-ray imaging system for guidance of cardiaccatheterisation procedures. In MICCAI, pages 25–32. Springer, 2012. →pages 13, 75[44] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, andK. Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parametersand¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016. → page59[45] S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks forhuman action recognition. IEEE transactions on pattern analysis andmachine intelligence, 35(1):221–231, 2013. → pages 18, 50, 57[46] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fastfeature embedding. In Proceedings of the 22nd ACM internationalconference on Multimedia, pages 675–678. ACM, 2014. → page 63[47] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fastfeature embedding. arXiv preprint arXiv:1408.5093, 2014. → page 44[48] B. A. Johnston, B. Mwangi, K. Matthews, D. Coghill, K. Konrad, and J. D.Steele. Brainstem abnormalities in attention deficit hyperactivity disorder119support high accuracy individual diagnostic classification. Human brainmapping, 35(10):5179–5189, 2014. → pages 12, 67[49] M. Kaiser, M. John, A. Borsdorf, et al. Significant acceleration of 2d-3dregistration-based fusion of ultrasound and x-ray images by mesh-based drrrendering. In SPIE Medical Imaging, pages 867111–867111, 2013. →pages 14, 73, 75[50] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, andL. Fei-Fei. Large-scale video classification with convolutional neuralnetworks. In Computer Vision and Pattern Recognition (CVPR), 2014IEEE Conference on, pages 1725–1732. IEEE, 2014. → page 24[51] Y. Kawano and K. Yanai. Automatic expansion of a food image datasetleveraging existing categories with domain adaptation. In Proc. of ECCVWorkshop on Transferring and Adapting Source Knowledge in ComputerVision (TASK-CV), 2014. → pages 35, 40[52] Y. Kawano and K. Yanai. Food image recognition with deep convolutionalfeatures. In Proceedings of the 2014 ACM International Joint Conferenceon Pervasive and Ubiquitous Computing: Adjunct Publication, pages589–593. ACM, 2014. → pages 11, 22, 24, 31, 35, 40, 41, 44, 45[53] Y. Kawano and K. Yanai. Foodcam: A real-time food recognition systemon a smartphone. Multimedia Tools and Applications, pages 1–25, 2014.ISSN 1380-7501. doi:10.1007/s11042-014-2000-8. URLhttp://dx.doi.org/10.1007/s11042-014-2000-8. → page 11[54] K. Kitamura, C. de Silva, T. Yamasaki, and K. Aizawa. Image processingbased approach to food balance analysis for personal food logging. InMultimedia and Expo (ICME), 2010 IEEE International Conference on,pages 625–630. IEEE, 2010. → pages 7, 10[55] J. Kleesiek, G. Urban, A. Hubert, D. Schwarz, K. Maier-Hein,M. Bendszus, and A. Biller. Deep mri brain extraction: a 3d convolutionalneural network for skull stripping. NeuroImage, 129:460–469, 2016. →pages 50, 59[56] M. Kobel, N. Bechtel, K. Specht, M. Klarho¨fer, P. Weber, K. Scheffler,K. Opwis, and I.-K. Penner. Structural and functional imaging approachesin attention deficit/hyperactivity disorder: does the temporal lobe play akey role? Psychiatry Research: Neuroimaging, 183(3):230–236, 2010. →page 49120[57] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification withdeep convolutional neural networks. In F. Pereira, C. Burges, L. Bottou,and K. Weinberger, editors, Advances in Neural Information ProcessingSystems 25, pages 1097–1105. Curran Associates, Inc., 2012. URLhttp://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf. → pages 7, 24[58] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification withdeep convolutional neural networks. In Advances in neural informationprocessing systems, pages 1097–1105, 2012. → pages 83, 101[59] J. Kruger and R. Westermann. Acceleration techniques for gpu-basedvolume rendering. In Proceedings of the 14th IEEE Visualization 2003(VIS’03), page 38. IEEE Computer Society, 2003. → page 78[60] D. Kuang, X. Guo, X. An, Y. Zhao, and L. He. Discrimination of adhdbased on fmri data with deep belief network. In International Conferenceon Intelligent Computing, pages 225–232. Springer, 2014. → pages49, 67, 68[61] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learningapplied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. → pages xiii, 2, 3[62] Y. Li, L. Liu, C. Shen, and A. van den Hengel. Mid-level deep patternmining. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEEConference on, pages 971–980. IEEE, 2015. → pages 24, 26[63] R. Liao, S. Miao, P. de Tournemire, S. Grbic, A. Kamen, T. Mansi, andD. Comaniciu. An artificial agent for robust image registration. In AAAI,pages 4168–4175, 2017. → pages 14, 95, 96[64] P. Lin, J. Sun, G. Yu, Y. Wu, Y. Yang, M. Liang, and X. Liu. Global andlocal brain network reorganization in attention-deficit/hyperactivitydisorder. Brain imaging and behavior, 8(4):558–569, 2014. → page 69[65] L. Liu, C. Shen, L. Wang, A. van den Hengel, and C. Wang. Encoding highdimensional local features by sparse coding based fisher vectors. InAdvances in Neural Information Processing Systems, pages 1143–1151,2014. → pages 26, 27121[66] D. Long, J. Wang, M. Xuan, Q. Gu, X. Xu, D. Kong, and M. Zhang.Automatic classification of early parkinson’s disease with multi-modal mrimaging. PloS one, 7(11):e47714, 2012. → page 48[67] M. Long, Y. Cao, J. Wang, and M. Jordan. Learning transferable featureswith deep adaptation networks. In International Conference on MachineLearning, pages 97–105, 2015. → pages 15, 44, 84, 113[68] D. G. Lowe. Distinctive image features from scale-invariant keypoints.International journal of computer vision, 60(2):91–110, 2004. → page 24[69] P. Markelj, D. Tomazˇevicˇ, B. Likar, and F. Pernusˇ. A review of 3d/2dregistration methods for image-guided interventions. Medical imageanalysis, 16(3):642–661, 2012. → pages 9, 75, 91[70] N. Martinel, C. Piciarelli, and C. Micheloni. A supervised extreme learningcommittee for food recognition. Computer Vision and ImageUnderstanding, 148:67–86, 2016. → pages 11, 38, 41[71] Y. Matsuda, H. Hoashi, and K. Yanai. Recognition of multiple-food imagesby detecting candidate regions. In Multimedia and Expo (ICME), 2012IEEE International Conference on, pages 25–30. IEEE, 2012. → pagesxiii, 8, 23, 35[72] S. Miao, Z. J. Wang, and R. Liao. A cnn regression approach for real-time2d/3d registration. IEEE transactions on medical imaging, 35(5):1352–1363, 2016. → pages 8, 14, 20, 73, 75, 76, 78, 80, 82, 89, 90, 91, 114[73] S. Miao, S. Piat, P. Fischer, A. Tuysuzoglu, P. Mewes, T. Mansi, andR. Liao. Dilated fcn for multi-agent 2d/3d medical image registration.2017. → pages 14, 100, 102[74] T. Mikolov, M. Karafia´t, L. Burget, J. Cernocky`, and S. Khudanpur.Recurrent neural network based language model. In Interspeech, volume 2,page 3, 2010. → page 3[75] A. Miller, R. Alston, and J. Corsellis. Variation with age in the volumes ofgrey and white matter in the cerebral hemispheres of man: measurementswith an image analyser. Neuropathology and applied neurobiology, 6(2):119–132, 1980. → page 56[76] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al.122Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015. → page 3[77] T. E. Moffitt, R. Houts, P. Asherson, D. W. Belsky, D. L. Corcoran,M. Hammerle, H. Harrington, S. Hogan, M. H. Meier, G. V. Polanczyk,et al. Is adult adhd a childhood-onset neurodevelopmental disorder?evidence from a four-decade longitudinal cohort study. American Journalof Psychiatry, 172(10):967–977, 2015. → page 47[78] P. Molchanov, S. Gupta, K. Kim, and J. Kautz. Hand gesture recognitionwith 3d convolutional neural networks. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition Workshops, pages1–7, 2015. → pages 18, 50, 57[79] V. Nair and G. E. Hinton. Rectified linear units improve restrictedboltzmann machines. In Proceedings of the 27th international conferenceon machine learning (ICML-10), pages 807–814, 2010. → page 2[80] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning and transferringmid-level image representations using convolutional neural networks. InComputer Vision and Pattern Recognition (CVPR), 2014 IEEE Conferenceon, pages 1717–1724. IEEE, 2014. → page 32[81] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Is object localization forfree?–weakly-supervised learning with convolutional neural networks. InProceedings of the IEEE Conference on Computer Vision and PatternRecognition, pages 685–694, 2015. → page 32[82] A. Payan and G. Montana. Predicting alzheimer’s disease: a neuroimagingstudy with 3d convolutional neural networks. arXiv preprintarXiv:1502.02506, 2015. → pages 50, 57[83] X. Peng, P. Lin, T. Zhang, and J. Wang. Extreme learning machine-basedclassification of adhd using brain structural mri data. PloS one, 8(11):e79476, 2013. → page 8[84] F. Perronnin, J. Sa´nchez, and T. Mensink. Improving the fisher kernel forlarge-scale image classification. In Computer Vision–ECCV 2010, pages143–156. Springer, 2010. → page 24[85] G. V. Polanczyk, E. G. Willcutt, G. A. Salum, C. Kieling, and L. A. Rohde.Adhd prevalence estimates across three decades: an updated systematicreview and meta-regression analysis. International journal ofepidemiology, 43(2):434–442, 2014. → page 7123[86] M.-g. Qiu, Z. Ye, Q.-y. Li, G.-j. Liu, B. Xie, and J. Wang. Changes of brainstructure and function in adhd children. Brain topography, 24(3-4):243–252, 2011. → page 49[87] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-net: Imagenetclassification using binary convolutional neural networks. In EuropeanConference on Computer Vision, pages 525–542. Springer, 2016. → page112[88] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio.Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014. →page 113[89] A. Singla, L. Yuan, and T. Ebrahimi. Food/non-food image classificationand food categorization using pre-trained googlenet model. In Proceedingsof the 2nd International Workshop on Multimedia Assisted DietaryManagement, pages 3–11. ACM, 2016. → pages 12, 22[90] K. Somandepalli, C. Kelly, P. T. Reiss, X. Zuo, R. C. Craddock, C. Yan,E. Petkova, F. X. Castellanos, M. P. Milham, and A. Di Martino.Short-term test–retest reliability of resting state fmri metrics in childrenwith and without attention-deficit/hyperactivity disorder. DevelopmentalCognitive Neuroscience, 15:83–93, 2015. → page 56[91] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, andR. Salakhutdinov. Dropout: a simple way to prevent neural networks fromoverfitting. Journal of Machine Learning Research, 15(1):1929–1958,2014. → pages 63, 64[92] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, andR. Salakhutdinov. Dropout: a simple way to prevent neural networks fromoverfitting. Journal of machine learning research, 15(1):1929–1958, 2014.→ page 2[93] H.-I. Suk, S.-W. Lee, D. Shen, A. D. N. Initiative, et al. Hierarchicalfeature representation and multimodal fusion with deep learning for ad/mcidiagnosis. NeuroImage, 101:569–582, 2014. → pages 48, 50[94] B. Sun and K. Saenko. Deep coral: Correlation alignment for deep domainadaptation. In Computer Vision–ECCV 2016 Workshops, pages 443–450.Springer, 2016. → page 15124[95] S. Sun, S. Miao, T. Heimann, T. Chen, M. Kaiser, M. John, E. Girard, andR. Liao. Towards automated ultrasound transesophageal echocardiographyand x-ray fluoroscopy fusion using an image-based co-registration method.In International Conference on Medical Image Computing andComputer-Assisted Intervention, pages 395–403. Springer, 2016. → pagesxv, 14, 75, 78, 79, 81, 87, 88, 89[96] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell. Deep domainconfusion: Maximizing for domain invariance. arXiv preprintarXiv:1412.3474, 2014. → pages 15, 84[97] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarialdiscriminative domain adaptation. In NIPS Workshop on AdversarialTraining, (WAT), 2017. → page 15[98] N. Tzourio-Mazoyer, B. Landeau, D. Papathanassiou, F. Crivello, O. Etard,N. Delcroix, B. Mazoyer, and M. Joliot. Automated anatomical labeling ofactivations in spm using a macroscopic anatomical parcellation of the mnimri single-subject brain. Neuroimage, 15(1):273–289, 2002. → page 54[99] X. Wang, B. Wang, X. Bai, W. Liu, and Z. Tu. Max-marginmultiple-instance dictionary learning. In Proceedings of the 30thInternational Conference on Machine Learning, pages 846–854, 2013. →page 34[100] L. Weyandt, A. Swentosky, and B. G. Gudmundsdottir. Neuroimaging andadhd: fmri, pet, dti findings, and methodological limitations.Developmental neuropsychology, 38(4):211–225, 2013. → page 47[101] C. Yan and Y. Zang. Dparsf: a matlab toolbox for pipeline data analysis ofresting-state fmri. Frontiers in systems neuroscience, 4, 2010. → page 52[102] K. Yanai and Y. Kawano. Food image recognition using deep convolutionalnetwork with pre-training and fine-tuning. In Multimedia & ExpoWorkshops (ICMEW), 2015 IEEE International Conference on, pages 1–6.IEEE, 2015. → pages 11, 22, 24, 31, 35, 37, 38, 39, 40, 41[103] K. Yanai, T. Takamu, and Y. Kawano. Real-time photo mining from thetwitter stream: Event photo discovery and food photo detection. InMultimedia (ISM), 2014 IEEE International Symposium on, pages295–302. IEEE, 2014. → page 11125[104] H. Yang, Q. Wu, L. Guo, Q. Li, X. Long, X. Huang, R. C. Chan, andQ. Gong. Abnormal spontaneous brain activity in medication-naive adhdchildren: a resting state fmri study. Neuroscience letters, 502(2):89–93,2011. → page 48[105] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable arefeatures in deep neural networks? In Advances in Neural InformationProcessing Systems, pages 3320–3328, 2014. → page 104[106] F. Yu and V. Koltun. Multi-scale context aggregation by dilatedconvolutions. arXiv preprint arXiv:1511.07122, 2015. → page 113[107] Y. Zang, T. Jiang, Y. Lu, Y. He, and L. Tian. Regional homogeneityapproach to fmri data analysis. Neuroimage, 22(1):394–400, 2004. → page48[108] Y. Zang, Y. He, C. Zhu, Q. Cao, M. Sui, M. Liang, T. Li-Xia, J. Tian-Zi,and W. Yu-Feng. Altered baseline brain activity in children with adhdrevealed by resting-state functional mri. Brain and Development, 29(2):83–91, 2007. → page 48[109] J. Zheng, S. Miao, and R. Liao. Learning cnns with pairwise domainadaption for real-time 6dof ultrasound transducer detection and trackingfrom x-ray images. In International Conference on Medical ImageComputing and Computer-Assisted Intervention, pages 646–654. Springer,2017. → page 14[110] J. Zheng, Z. J. Wang, and C. Zhu. Food image recognition via superpixelbased low-level and mid-level distance coding for smart home applications.Sustainability, 9(5):856, 2017. → pages 11, 12, 24, 29, 31, 40, 41, 44, 45[111] C. Zhu, Y. Zang, M. Liang, L. Tian, Y. He, X. Li, M. Sui, Y. Wang, andT. Jiang. Discriminative analysis of brain function at resting-state forattention-deficit/hyperactivity disorder. In International Conference onMedical Image Computing and Computer-Assisted Intervention, pages468–475. Springer, 2005. → pages 12, 67[112] C.-Z. Zhu, Y.-F. Zang, Q.-J. Cao, C.-G. Yan, Y. He, T.-Z. Jiang, M.-Q. Sui,and Y.-F. Wang. Fisher discriminative analysis of resting-state brainfunction for attention-deficit/hyperactivity disorder. Neuroimage, 40(1):110–120, 2008. → pages 12, 67126[113] F. Zhu, M. Bosch, I. Woo, S. Kim, C. J. Boushey, D. S. Ebert, and E. J.Delp. The use of mobile devices in aiding dietary assessment andevaluation. Selected Topics in Signal Processing, IEEE Journal of, 4(4):756–766, 2010. → pages 7, 10[114] Q. Zou, C. Zhu, Y. Yang, X. Zuo, X. Long, Q. Cao, Y. Wang, and Y. Zang.An improved approach to detection of amplitude of low-frequencyfluctuation (alff) for resting-state fmri: fractional alff. Journal ofneuroscience methods, 172(1):137–141, 2008. → pages 18, 50, 56[115] X. Zuo, C. Kelly, A. Di Martino, M. Mennes, D. S. Margulies, S. Bangaru,R. Grzadzinski, A. C. Evans, Y.-F. Zang, F. X. Castellanos, et al. Growingtogether and growing apart: regional and sex differences in the lifespandevelopmental trajectories of functional homotopy. The Journal ofneuroscience, 30(45):15034–15043, 2010. → pages 18, 50127

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0366054/manifest

Comment

Related Items