EFFICIENT DEEP LEARNING OF 3D STRUCTURAL BRAIN MRISFOR MANIFOLD LEARNING AND LESION SEGMENTATION WITHAPPLICATION TO MULTIPLE SCLEROSISbyTOM BROSCHDipl.-Ing., Otto-von-Guericke-Universität Magdeburg, 2010A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinTHE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES(Biomedical Engineering)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)July 2016© Tom Brosch, 2016AbstractDeep learning methods have shown great success in many research areas such as objectrecognition, speech recognition, and natural language understanding, due to their ability toautomatically learn a hierarchical set of features that is tuned to a given domain and robustto large variability. This motivates the use of deep learning for neurological applications,because the large variability in brain morphology and varying contrasts produced bydifferent MRI scanners makes the automatic analysis of brain images challenging.However, 3D brain images pose unique challenges due to their complex content and highdimensionality relative to the typical number of images available, making optimizationof deep networks and evaluation of extracted features difficult. In order to facilitatethe training on large 3D volumes, we have developed a novel training method for deepnetworks that is optimized for speed and memory. Our method performs training ofconvolutional deep belief networks and convolutional neural networks in the frequencydomain, which replaces the time-consuming calculation of convolutions with element-wisemultiplications, while adding only a small number of Fourier transforms.We demonstrate the potential of deep learning for neurological image analysis usingtwo applications. One is the development of a fully automatic multiple sclerosis (MS)lesion segmentation method based on a new type of convolutional neural network thatconsists of two interconnected pathways for feature extraction and lesion prediction. ThisiiAbstractallows for the automatic learning of features at different scales that are optimized foraccuracy for any given combination of image types and segmentation task. Our networkalso uses a novel objective function that works well for segmenting underrepresentedclasses, such as MS lesions. The other application is the development of a statistical modelof brain images that can automatically discover patterns of variability in brain morphologyand lesion distribution. We propose building such a model using a deep belief network, alayered network whose parameters can be learned from training images. Our results showthat this model can automatically discover the classic patterns of MS pathology, as wellas more subtle ones, and that the parameters computed have strong relationships to MSclinical scores.iiiPrefaceThis thesis is primarily based on two journal papers, three conference papers, and one bookchapter, resulting from the collaboration between multiple researchers. In all publications,the contribution of the author was in developing, implementing, and evaluating themethod. All co-authors contributed to the editing of the manuscripts.Parts of the introduction to deep learning in Chapter 2 have been published in:• Brosch, Tom, Y. Yoo, L.Y.W. Tang, and R. Tam. Deep learning of brain images andits application to multiple sclerosis. In: G. Wu and M. Sabuncu (Eds.): MachineLearning and Medical Imaging, chapter 3, Elsevier, 2016. (in press)The contribution of the author was in writing the introduction of the book chapter. Y. Yooand R. Tam wrote the remaining sections of the chapter. All co-authors contributed to theediting of the manuscript.The study described in Chapter 3 has been published in:• Brosch, Tom and R. Tam. Efficient training of convolutional deep belief networks inthe frequency domain for application to high-resolution 2D and 3D images. NeuralComputation, 27(1), pages 211–227, 2015.ivPrefaceThe contribution of the author was in developing, implementing, and evaluating themethod. R. Tam helped with his valuable suggestions in improving the methodology.A study described in Chapter 4 has been published in:• Brosch, Tom, Y. Yoo, L.Y.W. Tang, D.K.B. Li, A. Traboulsee, and R. Tam. Deepconvolutional encoder networks for multiple sclerosis lesion segmentation. In: N.Navab et al. (Eds.): MICCAI 2015, Part III, LNCS 9351, pages 3–11, 2015.• Brosch, Tom, L.Y.W. Tang, Y. Yoo, D.K.B. Li, A. Traboulsee, and R. Tam. Deep 3Dconvolutional encoder networks with shortcuts for multiscale feature integrationapplied to multiple sclerosis lesion segmentation. IEEE Transactions on MedicalImaging, 35(5), pages 1229–1239, 2016.The contribution of the author was in developing, implementing, and evaluating themethod. L.Y.W. Tang helped with the evaluation of other lesion segmentation methods. A.Traboulsee and D.K.B. Li provided the data and clinical input. Y. Yoo, L.Y.W. Tang, and R.Tam helped with their valuable suggestions in improving the methodology.A study described in Chapter 5 has been published in:• Brosch, Tom and R. Tam, for the Alzheimer’s Disease Neuroimaging Initiative.Manifold learning of brain MRIs by deep learning. In: K. Mori et als. (Eds.):MICCAI 2013, Part II, LNCS 8150, pages 633–640, 2013.The contribution of the author was in developing, implementing, and evaluating themethod. R. Tam helped with his valuable suggestions in improving the methodology.vPrefaceA study described in Chapter 5 has been published in:• Brosch, Tom, Y. Yoo, A. Traboulsee, D.K.B. Li, and R. Tam. Modeling the variabilityin brain morphology and lesion distribution in multiple sclerosis by deep learning.In: P. Golland et al. (Eds.): MICCAI 2014, Part II, LNCS 8674, pages 462–469, 2014.The contribution of the author was in developing, implementing, and evaluating themethod. A. Traboulsee and D.K.B. Li provided the data and clinical input. Y. Yoo and R.Tam helped with their valuable suggestions in improving the methodology.viTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiList of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1.1 A short introduction to multiple sclerosis . . . . . . . . . . . . . . . . . . 11.1.2 Measuring disease state and progression . . . . . . . . . . . . . . . . . . 3viiTable of Contents1.2 Proposed method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Background on Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.1 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.1.1 Dense neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.1.2 Convolutional neural networks . . . . . . . . . . . . . . . . . . . . . . . . 152.2 Unsupervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.2.1 From restricted Boltzmann machines to deep belief networks . . . . . . 182.2.2 Variants of restricted Boltzmann machines and deep belief networks . . 213 Training of Convolutional Models in the Frequency Domain . . . . . . . . . . . . . . 283.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.3.1 Training in the spatial domain . . . . . . . . . . . . . . . . . . . . . . . . 303.3.2 Training in the frequency domain . . . . . . . . . . . . . . . . . . . . . . 323.3.3 Mapping of strided convolutions to stride-1 convolutions . . . . . . . . 353.3.4 GPU implementation and memory considerations . . . . . . . . . . . . . 363.4 Evaluation of running time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.4.1 Comparison of running times for training sconvDBNs . . . . . . . . . . 393.4.2 Comparison of running times for calculating convolutions with cuDNN 453.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50viiiTable of Contents4 White Matter Lesion Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.2.1 Unsupervised methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.2.2 Supervised methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.2.3 Patch-based deep learning methods . . . . . . . . . . . . . . . . . . . . . 534.2.4 Fully convolutional methods . . . . . . . . . . . . . . . . . . . . . . . . . 544.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.3.1 Model architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.3.2 Gradient calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594.3.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624.3.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644.4 Experiments and results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644.4.1 Data sets and preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 654.4.2 Comparison to other methods . . . . . . . . . . . . . . . . . . . . . . . . . 684.4.3 Measures of segmentation accuracy . . . . . . . . . . . . . . . . . . . . . 694.4.4 Training parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.4.5 Impact of the training set size on the segmentation performance . . . . 734.4.6 Comparison on public data sets . . . . . . . . . . . . . . . . . . . . . . . . 744.4.7 Comparison of network architectures, input modalities, and publiclyavailable methods on clinical trial data . . . . . . . . . . . . . . . . . . . 794.4.8 Comparison for different lesion sizes . . . . . . . . . . . . . . . . . . . . 824.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85ixTable of Contents5 Manifold Learning by Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 885.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 885.2 Manifold learning for medical image analysis . . . . . . . . . . . . . . . . . . . 895.3 Manifold learning using deep belief networks . . . . . . . . . . . . . . . . . . . 925.3.1 General architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 925.3.2 Unit types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 935.3.3 Incorporating a region of interest . . . . . . . . . . . . . . . . . . . . . . . 945.4 Manifold of MRIs of Alzheimer’s disease patients . . . . . . . . . . . . . . . . 955.4.1 Data sets and preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 965.4.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 965.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 975.5 Variability of morphology and lesion distribution . . . . . . . . . . . . . . . . 1015.5.1 Data sets and preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 1025.5.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1025.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1085.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1126 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1136.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1146.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1156.2.1 New applications of deep learning . . . . . . . . . . . . . . . . . . . . . . 1156.2.2 Improving the performance on small data sets using data augmentation 1166.2.3 Rotation-invariant neural networks . . . . . . . . . . . . . . . . . . . . . . 1176.2.4 Increasing the expressive power of single neurons . . . . . . . . . . . . . 118Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120xList of Tables2.1 Key variables and notation. For notational simplicity, we assume the inputimages to be square 2D images. . . . . . . . . . . . . . . . . . . . . . . . . . . 233.1 Comparison of the memory required for storing key variables using differenttraining methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.2 Training parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.3 Hardware specification of our test system for training sconvDBNs usingdifferent implementations for calculating convolutions. . . . . . . . . . . . . 403.4 CNN layer parameters for the comparison with cuDNN. . . . . . . . . . . . . 463.5 Hardware specification of our test system for the comparison with cuDNN. 463.6 Comparison of running times for calculating key operations for training aCNN layer for different batch sizes . . . . . . . . . . . . . . . . . . . . . . . . 483.7 Comparison of running times of time critical operations of two example deeplearning models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.1 Population, disease and scanner characteristics of the two clinical trial datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664.2 Parameters of the 3-layer CEN for the evaluation on the challenge data sets . 754.3 Comparison of our method with state-of-the-art lesion segmentation methods 77xiList of Tables4.4 Selected methods out of the 52 entries submitted for evaluation to the MIC-CAI 2008 MS lesion segmentation challenge . . . . . . . . . . . . . . . . . . . 774.5 Comparison of our method with the second and third ranked methods fromthe ISBI MS lesion segmentation challenge . . . . . . . . . . . . . . . . . . . . 784.6 Parameters of the 3-layer CEN used on the clinical trial data set. . . . . . . . 794.7 Parameters of the 7-layer CEN-s used on the clinical trial data set. . . . . . . 804.8 Comparison of the segmentation accuracy of different CEN models, othermethods, and input modalities. . . . . . . . . . . . . . . . . . . . . . . . . . . . 814.9 Lesion size groups as used for the detailed analysis. . . . . . . . . . . . . . . 825.1 Parameters of the DBN used to learn the manifold of brain MRIs of AD andhealthy subjects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 975.2 Pearson correlation of demographic and clinical parameters with manifoldcoordinates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1015.3 Training parameters of the morphology DBN, lesion DBN, and joint DBN . 1055.4 Pearson correlations of clinical scores with distribution parameters andimaging biomarkers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110xiiList of Figures1.1 Demyelination in MS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.1 Schematic depiction for the calculations in a dense neural network . . . . . . 142.2 Convolutional neural network with two convolutional layers, one poolinglayer and one dense layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.3 Graph representation of an RBM with 3 hidden and 5 visible units . . . . . . 183.1 Comparison of training algorithms of convRBMs in the spatial and frequencydomain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.2 Mapping of strided convolutions to stride-1 convolutional . . . . . . . . . . . 363.3 Comparison of running times for training a sconvRBMs on 2D images . . . 423.4 Comparison of running times for training a sconvRBMs on 3D images . . . 443.5 Comparison of running times of key operations for training a single CNN layer. 473.6 Comparison of running times of key operations for training a single CNN layer. 484.1 Pre-training and fine-tuning a 7-layer convolutional encoder network withshortcut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.2 Improvement in the mean DSC computed on the training and test sets duringtraining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72xiiiList of Figures4.3 ROC curves for different sensitivity ratios r . . . . . . . . . . . . . . . . . . . 734.4 Comparison of DSC scores calculated on the training and test sets for varyingnumbers of training samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744.5 Example segmentations of our method for three different subjects from theMICCAI challenge data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 764.6 Impact of increasing the depth of the network on the segmentation perfor-mance of very large lesions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834.7 Comparison of segmentation performance of different CEN architectures forsmall lesions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834.8 Comparison of segmentation accuracy and lesion detection measures of a7-layer CEN-s and LST-LGA for different lesion sizes . . . . . . . . . . . . . . 845.1 General architecture of the DBN used to learn the manifold of brain MRIs . 945.2 Generalizability vs. specificity of a DBN for different numbers of layers . . . 985.3 Axial slices from generated volumes from the manifold . . . . . . . . . . . . 995.4 Axial slices of volumes from the training set plotted against their manifoldcoordinates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1005.5 Proposed model for discovering patterns of variability in MS brain . . . . . . 1035.6 Slices from generated volumes from the morphology, lesion, and joint model 1095.7 Axial and mid-sagittal slices of volumes representing the spectrum of MSFCscores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111xivList of Abbreviations9-HPT 9-hole peg testAD Alzheimer’s diseaseADNI Alzheimer’s Disease Neuroimaging InitiativeCD contrastive divergenceCEN convolutional encoder networkCEN-s convolutional encoder network with shortcut connectionsCNN convolutional neural networkCNS central nervous systemconvDBN convolutional deep belief networkconvRBM convolutional restricted Boltzmann machinecuDNN NVIDIA CUDA Deep Neural Network libraryDBN deep belief networkDNN dense neural networkxvList of AbbreviationsDSC Dice similarity coefficientEMS expectation maximization segmentationfCNN fully convolutional neural networkFFT fast Fourier transformFLAIR fluid-attenuated inversion recoveryFLIRT FSL linear image registration toolFN false negativeFNIRT FSL non-linear image registration toolFP false positiveGPU graphics processing unitICBM International Consortium for Brain MappingIsomap isometric feature mappingLEM Laplacian eigenmapsLFPR lesion-wise false positive rateLL lesion loadLLE locally linear embeddingLST Lesion Segmentation ToolboxxviList of AbbreviationsLST-LGA lesion growth algorithmLST-LPA lesion prediction algorithmLTPR lesion-wise true positive rateMLE maximum likelihood estimationMMSE mini-mental state examinationMNI Montreal Neurological InstituteMOPS model of population and subjectMR magnetic resonanceMRF Markov random fieldMRI magnetic resonance imagingMS multiple sclerosisnBV normalized brain volumeNReLU noisy rectified linear unitOASIS Open Access Series of Imaging StudiesPASAT paced auditory serial addition testPCD persistent CDPDw proton density-weightedpp percentage pointsxviiList of AbbreviationsPPMS primary progressive MSPPV positive predictive valuePRMS progressive relapsing MSRBM restricted Boltzmann MachineRF random forestRMSE root mean squared errorROC receiver operating characteristicROI region of interestRRMS relapsing remitting MSsconvDBN strided convolutional deep belief networksconvRBM strided convolutional restricted Boltzmann machineSGD stochastic gradient descentSLS Salem Lesion SegmentationSPMS secondary progressive MSSSD sum of squared differencesSVM support vector machineT1w T1-weightedT25W timed 25-foot walkT2w T2-weightedxviiiList of AbbreviationsTP true positiveTPR true positive rateVD relative absolute volume differencexixAcknowledgementsFirst and foremost, I want to express my deep gratitude to my supervisor Roger Tam,without whom I would not have started my PhD at UBC. From the very beginning of myinternship at the MS/MRI research group until the end of my PhD, Roger has createda very kind, enjoyable, and productive working environment. He gave me the freedomand trust, which allowed me to pursue my personal research interests. I will miss ourweekly meetings, which not only allowed me to discuss new ideas and potential struggles,but which were also filled with many joyful conversations. I’m most grateful for theunprecedented amount of time he dedicates to the supervision of his students, which hasalways motivated me to do the best research I can.I am very grateful for the continued support and valuable feedback from my co-supervisor Rafeef Abugharbieh and all members of my supervisory committee: PurangAbolmaesumi, Jane Z. Wang, and Ghassan Hamarneh. I would also like to thank mydepartmental and university defense committee, Robert Rohling, Jim Little, and MartinMcKeown, and the external examiner Daniel Rueckert for their valuable criticism andinsightful feedback.I’d like to thank everyone from the MS/MRI research group I’ve had the pleasureto be working with. Special thanks goes to Anthony Traboulsee and David K.B. Li forproviding insightful feedback about potential clinical applications of my research, AndrewxxAcknowledgementsRiddehough for providing a unique perspective about the operation of clinical trials, KevinLam and Rasmus Storjohann, who have taught me lots about software development, andKen Bigelow and Vilmos Soti for maintaining a productive working environment.I want to thank all my friends I’ve met at UBC who’ve made my everyday a joyful one.I want to thank my labmates Chunfang, Fahime, Saurabh, Youngjin, Robert, Lisa, Vanessa,and Niño for many stimulating discussions and welcome distractions, when I neededone. I also want to thank my friends from the other labs, Al, Antonio, Colin, Nishantand everyone from BMEGSA and ECEGSA for making me feel a part of the UBC/SFUresearch community.Finally, I want to express my gratitude to my family, who have supported me throughoutall these years. I want to thank my parents and my parents-in-law, for being understandingand supportive of our decision to move to Canada, even if it meant that we sometimeswon’t see each other for more than a year. I also owe the deepest gratitude to my wifeAnne, who has just been amazing since the moment I met her. Her encouragement andwarming love has carried me through all the years and allowed me to find the rightbalance between focusing on my research and enjoying this beautiful country.xxiDedication—To my wonderful wife.xxiiChapter 1Introduction1.1 Motivation1.1.1 A Short Introduction to Multiple SclerosisMultiple sclerosis (MS) is a chronic and degenerative disease of the central nervoussystem (CNS) characterized by the formation of inflammatory and demyelinating lesions.Presumably due to the breakdown of the blood-brain-barrier, the body’s own immunesystem attacks the myelin sheaths that act as insulating covers of the axons of neurons (seeFigure 1.1). This disrupts the ability of parts of the nervous system to communicate, andcan lead to the manifestation of a large range of different signs and symptoms. The brainis a very plastic organ and is often able to compensate for the damage by forming newneural pathways. In the later stages of the diseases, however, the amount of tissue damageis often too high to be offset, which leads to the progressive accumulation of disease.The clinical presentation of MS is very heterogeneous due to the wide range of areasof the brain and spinal cord that can be affected. Characteristic but not specific physicalsigns and symptoms include the loss of sensitivity or changes in sensation such as tingling1Chapter 1 IntroductionMyelin sheathsDamaged myelinFigure 1.1: Due to the break down of the blood brain barrier, the body’s own immune system attacksthe myelin sheaths of the axons, which causes the formation of demyelinating lesions, visibleprimarily in the white matter on conventional magnetic resonance imaging (MRI) scans.Lesions are highlighted in white.or numbness, which typically starts in the fingers and toes, as well as muscle weakness,difficulty in moving, difficulties with coordination and balance, problems with speech orswallowing, visual problems, feeling tired, acute or chronic pain, and bladder and boweldifficulties. In addition, MS often leads to cognitive problems such as having difficultieslearning and remembering information, or psychiatric and emotional problems such asdepression or frequent mood swings.The course of the disease is unpredictable. Most patients (around 80 %) are initiallydiagnosed as having relapsing remitting MS (RRMS), which is characterized by alternatingperiods of worsening due to inflammatory attacks and the formation of new lesions,and periods of remission and recovery. The majority (around 65 %) of RRMS patientstransition to secondary progressive MS (SPMS). In this stage, the body is no longer able2Chapter 1 Introductionto compensate for the tissue damage, which leads to the unremitting and progressiveaccumulation of disability. Other types of MS include the primary progressive form(PPMS), characterized by the progression of disability from onset with no remissions afterthe initial symptoms, and progressive relapsing MS (PRMS), which shows progressiveaccumulation of disability in addition to clear superimposed relapses.1.1.2 Measuring Disease State and ProgressionThere is currently no cure for MS. Existing therapies that focus on symptomatic manage-ment and prevention of further damage have variable degrees of effectiveness, althoughseveral recent breakthroughs are promising. To monitor and further our understanding ofthe disease, many biomarkers have been developed that allow for the objective measure-ment of normal and pathogenic processes, as well as the monitoring of treatment effect.Current MS biomarkers can be roughly classified into generic-immunogenetic, laboratorial,and imaging biomarkers (Katsavos and Anagnostouli, 2013). For this thesis, we focuson the accurate measurement of existing lesion-based imaging biomarkers such as lesionvolume and lesion count, and the development of new biomarkers that capture changes inbrain morphology and white matter lesions—two hallmarks of MS pathology.Accurate segmentation of MS lesions has shown to be a very challenging due to the largevariability in lesion size, shape, intensity, and locations, as well as the large variability inimaging contrasts produced by scanners used at different clinical sites. Despite a growinginterest in developing fully automatic lesion segmentation methods, semi-automaticmethods are still the standard in clinical research, although their use is time-consuming,laborious, and potentially biasing. It is therefore highly desirable to develop a fully-automatic lesion segmentation method that is robust to large variability, while still beingable to segment lesions with high accuracy and sensitivity.3Chapter 1 IntroductionImaging biomarkers used in clinical trials mostly focus on volumetric measures ofglobal and local changes, which are important and relatively easy to compute, but onlycorrelate modestly with clinical scores. One reason for the modest correlation is thatthey do not reflect potentially important structural variations, such as shape changes inthe brain and the spatial dispersion of lesions. Therefore, it would be highly desirableto develop biomarkers that capture potentially important patterns of the variability inbrain morphology and lesion distribution, which would advance our understanding ofthe complex pathology of MS.1.2 Proposed Method1.2.1 ObjectivesThe clinical motivation of the thesis is to develop methods that facilitate the automaticmeasurement of MS disease state and progression that are visible on conventional struc-tural MRIs. To that end, we have identified two key applications. One is the developmentof a fully automatic lesion segmentation method that is able to segment lesions overa large range of sizes and in the presence of varying imaging contrasts and imagingartifacts produced by different scanners, which would allow for the accurate measurementof lesion-based imaging biomarkers such as lesion load and lesion count. The otherkey application is the development of a method that automatically discovers potentiallyimportant patterns of variability in brain morphology and lesion distribution, with thegoal to derive new imaging biomarkers that correlate stronger with the clinical measuresof MS disability than traditionally used volumetric measures. The global objective of thethesis is to determine the capabilities of deep learning (LeCun et al., 2015) for these twoclinical applications.4Chapter 1 Introduction1.2.2 OverviewThe two methods developed for segmenting MS lesions and modelling patterns of vari-ability are both based on deep learning, a field within machine learning that is inspired bythe learning capabilities of the brain. The human brain has often served as a model forcomputer vision algorithms. For example, SIFT features (Lowe, 1999), inspired by neuronsof the inferior temporal cortex, have proved to be very robust for object recognition.Resembling the receptive field of simple cells of the primary visual cortex, 2D Gaborfilters have been used to describe textures for segmentation (Grigorescu et al., 2002). Incontrast, deep learning algorithms try to mimic how the visual system learns insteadof copying what it has learned. First evidence for the learning capabilities of the visualsystem were found by (Wiesel and Hubel, 1963), who investigated the visual cortex ofcats. They showed that the receptive fields of neurons are learned from a continuousstream of images early in the development of the visual system (Wiesel and Hubel, 1963),but they are also fine-tuned later (Karni and Sagi, 1991). This allows the neurons of thevisual cortex to adapt to the type of images to which it is exposed. While it is difficult,for example, for an average person to recognize the differences between cows of the samebreed, the feature detecting neurons of the visual system of cattle farmers are highly tunedto their appearance, allowing them to recognize individual cows easily. This suggests thata learning-based model for classification should not only learn to perform the requestedtask based on a set of pre-defined features, but also learn the features that are most suitableto perform the task. The joint learning of feature extraction and prediction, also knownas end-to-end learning, is possible through the use of deep learning methods, which usemultiple layers of nonlinear processing units to learn a feature hierarchy directly from theinput data without a dedicated feature extraction step.5Chapter 1 IntroductionDeep learning has successfully been used in many research areas such as object recogni-tion (Krizhevsky et al., 2012), speech recognition (Hinton et al., 2012), natural languageunderstanding (Collobert et al., 2011b), and language translation (Sutskever et al., 2014).Deep learning methods are particularly successful due to their ability to recognize complexand highly nonlinear patterns in large amounts of training data, which facilitates thelearning of models that are robust to large variability. This motivates the use of deeplearning methods for segmenting MS lesions, because the large variability in lesion shape,size, contrast, and location as well as changes in imaging contrasts produced by differentMRI scanners make lesion segmentation challenging. Beyond voxel classification, deeplearning methods can also be used to model highly nonlinear patterns of variability ingroups of images. This allows deep learning models such as deep belief networks to beused for manifold learning, e.g., of hand written digits (Hinton et al., 2006) or, as wewill show in Chapter 5, brain MRIs (Brosch and Tam, 2013). However, deep learningalgorithms as implemented by widely used deep learning frameworks were originallydeveloped for application to small 2D images and do not scale well to large 3D volumes interms of training time and memory requirements, which prevents the use of out-of-the-boximplementations for 3D medical image analysis.1.2.3 ContributionsIn the course of developing deep learning-based methods for MS lesion segmentation andpattern discovery, we have made the following main contributions:1. We have developed a novel training algorithm for convolutional deep belief networksand convolutional neural networks that performs training in the frequency domain.The speed-ups gained by our method compared to state-of-the art spatial domain6Chapter 1 Introductionimplementations and the reduced memory requirements compared to other fre-quency domain methods enable the application of deep learning to high-resolution3D medical images.2. We have developed a neural network architecture that jointly learns features at differ-ent scales that are tuned to segmenting MS lesions and performs the segmentationbased on the automatically learned features. The joint learning of feature extractorand classifier facilitates the learning of features that are robust to the large variabilityof MS lesions and varying contrasts produced by different scanners.3. We have developed a novel objective function for training neural networks that issuitable for the classification of vastly unbalanced classes, such as the segmentationof MS lesions, which typically comprise less than one percent of the image.4. This is the first work to demonstrate that deep learning can be applied to manifoldlearning of brain MRIs.5. We have developed a framework for modelling changes in brain morphology andlesion distribution with only a few parameters, which also show improved correlationwith clinical scores compared to established volumetric imaging biomarkers.1.3 Thesis OutlineThe rest of this thesis is organized into five chapters as outlined below:Chapter 2—Background on Deep LearningIn this chapter, we will briefly introduce the supervised and unsupervised deep learningmodels that form the basis for the lesion segmentation and manifold learning methods,7Chapter 1 Introductionwhich are discussed further in Chapter 4 and Chapter 5. We will start with a descriptionof dense neural networks (Farley and Clark, 1954; Rumelhart et al., 1986; Werbos, 1974)and convolutional neural networks (Fukushima, 1980; LeCun et al., 1989, 1998). In thesecond part, we will give a brief overview of restricted Boltzmann Machines (Freund andHaussler, 1992; Hinton, 2010), which are the building blocks of deep belief networks(Hinton et al., 2006).Chapter 3—Training of Convolutional Models in the Frequency DomainDeep learning has traditionally been computationally expensive and advances in trainingmethods have been the prerequisite for improving its efficiency in order to expand itsapplication to a variety of image classification problems. In this chapter, we addressthe problem of efficient training of convolutional deep belief networks by learning theweights in the frequency domain, which eliminates the time-consuming calculation ofconvolutions. An essential consideration in the design of the algorithm is to minimize thenumber of transformations to and from frequency space. We have evaluated the runningtime improvements using two standard benchmark data sets, showing a speed-up ofup to 8 times on 2D images and up to 200 times on 3D volumes. In addition, we havedirectly compared the time required to calculate convolutions using our method withthe NVIDIA CUDA Deep Neural Network library (cuDNN), the current state-of-the-artlibrary for calculating 2D and 3D convolutions, with the results showing that our methodcan calculate convolutions up to 20 times faster than cuDNN. Our training algorithmmakes training of convolutional deep belief networks and convolutional neural networkson 3D volumes with a resolution of up to 128× 128× 128 voxels practical, which opensnew directions for using deep learning for medical image analysis.8Chapter 1 IntroductionChapter 4—White Matter Lesion SegmentationIn this chapter, we present a novel segmentation approach based on deep 3D convolutionalencoder networks with shortcut connections and apply it to the segmentation of MSlesions in magnetic resonance images. Our model is a neural network that consists oftwo interconnected pathways, a convolutional pathway, which learns increasingly moreabstract and higher-level image features, and a deconvolutional pathway, which predictsthe final segmentation at the voxel level. The joint training of the feature extraction andprediction pathways allows for the automatic learning of features at different scales that areoptimized for accuracy for any given combination of image types and segmentation task.In addition, shortcut connections between the two pathways allow high- and low-levelfeatures to be integrated, which enables the segmentation of lesions across a wide rangeof sizes. We have evaluated our method on two publicly available data sets (MICCAI 2008and ISBI 2015 challenges) with the results showing that our method performs comparablyto the top-ranked state-of-the-art methods, even when only relatively small data setsare available for training. In addition, we have compared our method with five freelyavailable and widely used MS lesion segmentation methods (EMS, LST-LPA, LST-LGA,Lesion-TOADS, and SLS) on a large data set from an MS clinical trial. The results showthat our method consistently outperforms these other methods across a wide range oflesion sizes.Chapter 5—Manifold Learning by Deep LearningManifold learning of medical images plays a potentially important role for modellinganatomical variability within a population with applications that include segmentation,registration, and prediction of clinical parameters. In this chapter, we describe a novel9Chapter 1 Introductionmethod for learning the manifold of 3D brain images and for building a statistical modelof brain images that can automatically discover spatial patterns of variability in brainmorphology and lesion distribution. We propose building such a model using a deepbelief network, a layered network whose parameters can be learned from training images.In contrast to other manifold learning algorithms, this approach does not require a prebuiltproximity graph, which is particularly advantageous for modelling lesions, because theirsparse and random nature makes defining a suitable distance measure between lesionimages challenging. Our results show that this model can automatically learn a low-dimensional manifold of brain volumes that detects modes of variations that correlate todemographic and disease parameters. Furthermore, our model can automatically discoverthe classic patterns of MS pathology, as well as more subtle ones, and the computedparameters have strong relationships to MS clinical scores.Chapter 6—Conclusions and Future WorkThis chapter concludes the thesis with a brief summary of the problems addressed and keyresults. In addition, some directions for future work are given in the broader context ofdeep learning for medical imaging. In particular, we will give suggestions for new medicalimage analysis applications of deep learning and how to deal with relatively small datasets using data augmentation. In addition, we will discuss two potential advancements ofneural networks that we believe will be particularly important for medical applications.10Chapter 2Background on Deep LearningDeep learning (LeCun et al., 2015) is a field within machine learning that has beenstudied since the early 1980s (Fukushima, 1980). However, deep learning methods did notgain in popularity until the late 2000s with the advent of fast general purpose graphicsprocessors (Raina et al., 2009), layerwise pre-training methods (Hinton et al., 2006;Hinton and Salakhutdinov, 2006), and large data sets (Deng et al., 2009; Krizhevskyet al., 2012). Since then, deep learning methods have become the state of the art in manynon-medical (Krizhevsky et al., 2012; Sainath et al., 2013) and medical (Cires¸an et al.,2012; Kamnitsas et al., 2015) applications. There are many different algorithms andmodels that are commonly referred to as deep learning methods, all of which have twoproperties in common: 1) the use of multiple layers of nonlinear processing units forextracting features, and 2) the layers are organized to form a hierarchy of low-level tohigh-level features. Representing data in a feature hierarchy has many advantages forclassification and other applications. To give an example of a feature hierarchy, let usconsider the domain of face images. The lowest layer of the feature hierarchy is composedof the raw pixel intensities, which are the most basic features of an image. Multiple pixels11Chapter 2 Background on Deep Learningcan be grouped to form general image features like edges and corners, which can befurther combined to form face parts such as different variations of noses, eyes, mouths,and ears. Finally, multiple face parts can be combined to form a variety of face images.Learning a feature hierarchy facilitates the parameterization of a large feature space witha small number of values by capturing complex relationships between feature layers. Forexample, a feature hierarchy consisting of three prototypical shapes for mouths, eyes, ears,and noses is able to represent 3× 3× 3× 3 = 81 different prototypical faces with only3 + 3 + 3 + 3 = 12 features. Without a hierarchical representation of the data, a modelwould require 81 prototypical face features to span the same face manifold.In this chapter, we will briefly introduce the supervised and unsupervised deep learningmodels that form the basis for the lesion segmentation and manifold learning methods,which are discussed further in Chapter 4 and Chapter 5. We will start with a descriptionof dense neural networks (DNNs) (Farley and Clark, 1954; Rumelhart et al., 1986;Werbos, 1974) and convolutional neural networks (CNNs) (Fukushima, 1980; LeCun et al.,1989, 1998). In the second part, we will give a brief overview of restricted BoltzmannMachines (RBMs) (Freund and Haussler, 1992; Hinton, 2010), which are the buildingblocks of deep belief networks (DBNs) (Hinton et al., 2006).2.1 Supervised LearningA typical pipeline for classifying images consists of two main steps. In the first step,predefined or learned features are extracted from the input images, which are then used totrain a separate supervised learning model, such as a random forest (RF) (Breiman, 2001)or a support vector machine (SVM) (Cortes and Vapnik, 1995), to perform classificationor prediction. Alternatively, classification and prediction can be performed with a single12Chapter 2 Background on Deep Learningmodel that takes the raw input data and produces the desired output, such as classprobabilities. This type of learning is called end-to-end learning and has shown greatpotential for medical image analysis (Cires¸an et al., 2012). The most popular modelsfor end-to-end learning are neural networks due to their ability to learn a hierarchicalset of features from raw input data. This allows the learning of features that are tunedfor a given combination of input modalities and classification task, but is more prone tooverfitting than unsupervised feature learning methods, especially when the amount oflabeled data is limited. In this section, we will start with an introduction to dense neuralnetworks, followed by a concise overview of convolutional neural networks.2.1.1 Dense Neural NetworksA dense neural network (see Figure 2.1) is a deterministic function that maps input datato the desired outputs through the successive application of multiple nonlinear mappingsof the following formzl = Wlxl−1 + bl , (2.1)xl = fl(zl), (2.2)where xl denotes a vector containing the units of layer l, x0 denotes a vector containing theinput of the neural network, xL denotes a vector containing the output, L is the numberof computational layers, fl are transfer functions, Wl are weight matrices, and bl are biasterms. Popular choices for the transfer function are the sigmoid function f (x) = sigm(x)and the rectified linear function f (x) = max(0, x). The same transfer function is typicallyused for all layers except for the output layer. The choice of the output transfer function13Chapter 2 Background on Deep Learningx0x1xlxL−1xLWlHidden unitsInput units Output unitsFigure 2.1: Schematic depiction for the calculations in a dense neural network. Each hidden and outputunit calculates the weighted sum of its inputs followed by the application of a transferfunction.depends on the learning task. For classification, a 1-of-n encoding of the output class isusually used in combination with the softmax transfer function defined assoftmax(a)i =exp(ai)∑nj=1 exp(aj), (2.3)where a denotes an n-dimensional output vector.Given a training set D = {(x(i)0 , y(i)) | i ∈ [1, N]}, a neural network is trained byminimizing the error between the predicted outputs x(i)L and the given labels y(i)θˆ = arg minθN∑i=1E(x(i)L , y(i)), (2.4)where θ denotes the trainable parameters of the neural network. Typical choices forthe error function are the sum of squared differences (SSD) and the cross-entropy. Theminimization problem can be solved using stochastic gradient descent (SGD) (Polyak andJuditsky, 1992; Rumelhart et al., 1986), which requires the calculation of the gradient of14Chapter 2 Background on Deep Learningthe error function with respect to the model parameters. The gradient can be calculatedby backpropagation (Werbos, 1974) as followsδL = ∇xL E · f ′L(zL), (2.5)δl = (WTl+1δl+1) · f ′l (zl) for l < L, (2.6)∇Wl E = δlxTl−1, (2.7)∇bl E = δl , (2.8)where ∇xL E denotes the gradient of the error function with respect to the predicted outputand · denotes element-wise multiplication.2.1.2 Convolutional Neural NetworksThe structure of CNNs is inspired by the complex arrangement of simple and complex cellsfound in the visual cortex (Hubel and Wiesel, 1962, 1968). Simple cells are only connectedto a small sub-region of the previous layer and need to be tiled to cover the entire visualfield. In a CNN (see Figure 2.2), simple cells are represented by convolutional layers,which exhibit a similar mechanism of local connectivity and weight sharing. Complexcells combine the activations of simple cells to add robustness to small translations. Thesecells are represented in the form of pooling layers. After several alternating convolutionaland pooling layers, the activations of the last convolutional layer are fed into one or moredense layers to carry out the final classification.For multimodal 3D volumes, the neurons of convolutional and pooling layers arearranged in a 4D array, where the first three dimensions correspond to the dimensions of15Chapter 2 Background on Deep LearningConvolutional layerPooling layerVectorizehidden unitsDense layerconvolutionpoolingMultimodalinput imagesHidden unitsOutputunitsFigure 2.2: Convolutional neural network with two convolutional layers, one pooling layer and onedense layer. The activations of the last layer are the output of the network.the input volume, and the forth dimension indexes the input modality or channel. Theactivations of the output of a convolutional layer are calculated byx(l)j = f(C∑i=1w˜(l)ij ∗ x(l−1)i + b(l)j), (2.9)where l is the index of a convolutional layer, x(l)j denotes the jth channel of the outputvolume, w(l)ij is a 3D filter kernel connecting the ith channel of the input volume to the jthchannel of the output volume, b(l)j denotes the bias term of the jth output channel, and w˜denotes a flipped version of w, i.e. w˜(a) = w(−a). CNNs can be trained using stochasticgradient descent, where the gradient can be derived analogously to dense neural networksand calculated using backpropagation (LeCun et al., 1989, 1998).Different types of operations (Scherer et al., 2010) have been proposed for the poolinglayers, with the common goal of creating a more compact representation of the input data.16Chapter 2 Background on Deep LearningThe most commonly used type of pooling is max-pooling. Therefore, the input to thepooling layer is divided into small blocks and only the maximum value of each block aspassed on to the next layer, which makes the representation of the input invariant to smalltranslations in addition to reducing its dimensionality.A major challenge for gradient-based optimization methods is the choice of an appro-priate learning rate. Classic stochastic gradient descent (LeCun et al., 1998) uses a fixed ordecaying learning rate, which is the same for all parameters of the model. However, thepartial derivatives of parameters of different layers can vary substantially in magnitude,which can require different learning rates. In recent years, there has been an increasinginterest in developing methods for automatically choosing independent learning rates.Most methods (e.g., AdaGrad by Duchi et al., 2011; AdaDelta by Zeiler, 2012; RMSpropby Dauphin et al., 2015; and Adam by Kingma and Ba, 2014) collect different statistics ofthe partial derivatives over multiple iterations and use this information to set an adaptivelearning rate for each parameter. This is especially important for the training of deepnetworks, where the optimal learning rates often differ greatly for each layer.2.2 Unsupervised LearningOne of the most important applications of deep learning is to learn patterns of variabilityin the form of a feature hierarchy from unlabeled images. The key to learning such ahierarchy is the ability of deep models to be trained layer by layer, where each layer acts asa nonlinear feature extractor. Various methods have been proposed for feature extractionfrom unlabeled images. In this section, we will first introduce the restricted Boltzmannmachines (Freund and Haussler, 1992; Hinton, 2010), which are the building blocks ofthe later described deep belief networks (Hinton et al., 2006).17Chapter 2 Background on Deep Learningh1 h2 h3v1 v2 v3 v4 v5w11Visible units vHidden units hWeights W betweenvisible and hidden unitsFigure 2.3: Graph representation of an RBM with 3 hidden and 5 visible units. An RBM models thejoint probability of visible and hidden units. Edges between vertices denote conditionaldependence between the corresponding random variables.2.2.1 From Restricted Boltzmann Machines to Deep Belief NetworksAn RBM is a probabilistic graphical model defined by a bipartite graph as shown inFigure 2.3. The units of the RBM are divided into two layers, one of visible units v and theother of hidden units h. There are no direct connections between units within either layer.An RBM defines the joint probability of visible and hidden units in terms of the energy E,p(v,h | θ) = 1Z(θ)e−E(v,h|θ). (2.10)When the visible and hidden units are binary, the energy is defined as−E(v,h | θ) =∑i,jviwijhj +∑ibivi +∑jcjhj, (2.11)= vTWh+ bTv+ cTh, (2.12)where Z(θ) is a normalization constant, W denotes the weight matrix that connects thevisible units with the hidden units, b is a vector containing the visible bias terms, c is avector containing the hidden bias terms, and θ = {W,b, c} are the trainable parameters ofthe RBM.18Chapter 2 Background on Deep LearningInferenceThe hidden units represent patterns of similarity that can be observed in groups of images.Given a set of model parameters θ, the features of an image can be extracted by calculatingthe expectation of the hidden units. The posterior distribution of the hidden units giventhe visible units can be calculated byp(hj = 1 | v, θ) = sigm(wT·,jv+ cj), (2.13)where w·,j denotes the jth column vector of W, and sigm(x) is the sigmoid functiondefined as sigm(x) = (1 + exp(−x))−1, x ∈ R. An RBM is a generative model, whichallows for the reconstruction of an input signal given its features. This is achieved bycalculating the expectation of the visible units given the hidden units. The posteriordistribution p(vi = 1 | h, θ) can be calculated byp(vi = 1 | h, θ) = sigm(wTi,·h+ bi), (2.14)where wi,· denotes the ith row vector of W. Reconstructing the visible units can be usedto visualize the learned features. To visualize the features associated with a particularhidden unit, all other hidden units are set to zero and the expectation of the visible unitsis calculated, which represents the pattern that causes a particular hidden unit to beactivated.TrainingRBMs can be trained by maximizing the likelihood or, more commonly, the log-likelihoodof the training data, D = {vn | n ∈ [1, N]}, which is called maximum likelihood estimation19Chapter 2 Background on Deep Learning(MLE). The gradient of the log-likelihood function with respect to the weights, W, is givenby the mean difference of two expectations∇W log p(D | θ) = 1NN∑n=1E[vhT | vn, θ]−E[vhT | θ]. (2.15)The first expectation can be estimated using a mean field approximationE[vhT | vn, θ] ≈ E[v | vn, θ]E[hT | vn, θ], (2.16)= vnE[hT | vn, θ]. (2.17)The second expectation is typically estimated using a Monte Carlo approximationE[vhT | θ] ≈ 1SS∑s=1vshTs , (2.18)where S is the number of generated samples, and vs and hs are samples drawn fromp(v | θ) and p(h | θ), respectively. Samples from an RBM can be generated efficientlyusing block Gibbs sampling, in which the visible and hidden units are initialized withrandom values and alternately sampled given the previous state usinghj = I(yj < p(hj = 1 | v, θ)) with yj ∼ U(0, 1) (2.19)vi = I(xi < p(vi = 1 | h, θ)) with xi ∼ U(0, 1) (2.20)where z ∼ U(0, 1) denotes a sample drawn from the uniform distribution in the interval[0, 1] and I is the indicator function, which is defined as 1 if the argument is true and 0otherwise. After several iterations, a sample generated by the Gibbs chain is distributedaccording to p(v,h | θ).20Chapter 2 Background on Deep LearningIf the Gibbs sampler is initialized at a data point from the training set and onlyone Monte Carlo sample is used to approximate the second expectation in (2.15), thelearning algorithm is called contrastive divergence (CD) (Hinton, 2002). Alternatively,persistent CD (PCD) (Tieleman, 2008) uses several separate Gibbs chains to generatedata independent samples from the model, which results in a better approximation of thegradient of the log-likelihood than CD.To speed up the training of RBMs using either CD and PCD, the data set is usuallydivided into small subsets called mini-batches and a gradient step is performed for eachmini-batch. To avoid confusion with a gradient step, the term “iteration” is generallyavoided and the term “epoch” is used instead to indicate a sweep through the entire dataset. Additional tricks to monitor and speed up the training of an RBM can be found inHinton’s RBM training guide (Hinton, 2010). A detailed explanation and comparison ofdifferent training algorithms is given in Chapter 3.Deep Belief NetworksA single RBM can be regarded as a nonlinear feature extractor. To learn a hierarchical setof features, multiple RBMs are stacked and trained layer by layer, where the first RBM istrained on the input data and subsequent RBMs are trained on the hidden unit activationscomputed from the previous RBMs. The stacking of RBMs can be repeated to initializeDBNs of any depth.2.2.2 Variants of Restricted Boltzmann Machines and Deep Belief NetworksMany variants of RBMs and DBNs have been proposed to adapt them for differentdomains. In this section, we will first introduce the convolutional deep belief networks21Chapter 2 Background on Deep Learning(convDBNs), which allow DBNs to be applied to high-resolution images, followed by adiscussion of different unit types, which allow DBNs to be applied to real-valued data likethe intensities of some medical images.Convolutional Deep Belief NetworksA potential drawback of DBNs is that the learned features are location dependent. Hence,features that can occur at many different locations in an image, such as edges and corners,must be relearned for every possible location, which dramatically increases the numberof features required to capture the content of large images. To increase the translationalinvariance of the learned features, Lee et al. introduced the convDBN (Lee et al., 2009,2011). In a convDBN, the units of each layer are organized in a multidimensional arraythat reflects the arrangement of pixels in the input images. The units of one layer areonly connected to the units of a sub-region of the previous layer, and share the sameweights with all other units of the same layer. This greatly reduces the number of trainableweights, which reduces the risk of overfitting, reduces the memory required to storethe model parameters, speeds up the training, and thereby facilitates the application tohigh-resolution images.A convDBN consists of alternating convolutional and pooling layers, which are followedby one or more dense layers. Each convolutional layer of the model can be trained in a22Chapter 2 Background on Deep LearningTable 2.1: Key variables and notation. For notational simplicity, we assume the input images to besquare 2D images.Symbol Descriptionv(i) a 2D array containing the units of the ith input channelh(j) a 2D array containing the units of jth output channel or featuremapw(ij) a 2D array containing the weights of filter kernels connectingvisible units v(i) to hidden units h(j)bi bias terms of the visible unitscj bias terms of the hidden unitsNc number of channels of the visible unitsNv width and height of the image representing the visible unitsNk number of filters and feature mapsNh width and height of a feature map• element-wise product followed by summation∗ valid convolution~ full convolutionw˜(ij) horizontally and vertically flipped version of w(ij), i.e.,w˜(ij)uv = w(ij)Nw−u+1,Nw−v+1, where Nw denotes the width andheight of a filter kernelgreedy layerwise fashion by treating it as a convolutional restricted Boltzmann machine(convRBM). The energy of a convRBM with binary visible and hidden units is defined asE(v,h) = −Nc∑i=1Nk∑j=1Nh∑x,y=1Nw∑u,v=1h(j)xy w(ij)uv v(i)x+u−1,y+v−1 −Nc∑i=1biNv∑x,y=1v(i)xy −Nk∑j=1cjNh∑x,y=1h(j)xy (2.21)= −Nc∑i=1Nk∑j=1h(j) • (w˜(ij) ∗ v(i))−Nc∑i=1biNv∑x,y=1v(i)xy −Nk∑j=1cjNh∑x,y=1h(j)xy . (2.22)The key terms and notations are defined in Table 2.1. At the first layer, the number ofchannels Nc is one when trained on unimodal images, or equal to the number of inputmodalities when trained on multimodal images. For subsequent layers, Nc is equal to thenumber of filters of the previous layer.23Chapter 2 Background on Deep LearningThe posterior distributions p(h | v) and p(v | h) can be derived from the energyequation and are given byp(h(j)xy = 1 | v) = sigm( Nc∑i=1(w˜(ij) ∗ v(i))xy + cj), and (2.23)p(v(i)xy = 1 | h) = sigm( Nk∑j=1(w(ij) ~ h(j))xy + bi). (2.24)To train a convRBM on a set of images D = {vn | n ∈ [1, N]}, the weights and biasterms can be learned by CD. During each iteration of the algorithm, the gradient of eachparameter is estimated and a gradient step with a fixed learning rate is applied. Thegradient of the filter weights can be approximated by∆w(ij) ≈ 1N(v(i)n ∗ h˜(j)n − v′(i)n ∗ h˜′(j)n ), (2.25)where h(j)n and h′(j)n are samples drawn from p(h(j) | vn) and p(h(j) | v′n), and v′(i)n =E[v(i) | hn].Strided Convolutional ModelsStrided convolutions are a type of convolution that shifts the filter kernel as a slidingwindow with a step size or stride s > 1, stopping at only Nv/s positions. Replacingstride-1 convolutions with strided convolutions, we can define the energy function of astrided convolutional restricted Boltzmann machine (sconvRBM) as followsE(v,h) = −Nc∑i=1Nk∑j=1Nh∑x,y=1Nw∑u,v=1h(j)xy w(ij)uv v(i)s(x−1)+u,s(y−1)+v −Nc∑i=1biNv∑x,y=1v(i)xy −Nk∑j=1cjNh∑x,y=1h(j)xy .(2.26)24Chapter 2 Background on Deep LearningStrided convolutional RBMs have several advantages over traditional convRBMs. The useof strided convolutions reduces the number of hidden units per channel to Nh = Nv/s,hence significantly reducing training time and memory required for storing the hiddenunits during training. Furthermore, strided convolutional deep belief networks (scon-vDBNs) do not require pooling layers to reduce the dimensionality, because dimensionalityreduction is already performed within the sconvRBM layers. Consequently, inference inan sconvDBN is invertible, which allows for the visualization of detected patterns similarto DBNs. Rules for inference, sampling, and training of an sconvRBM can be derivedanalogous to convRBMs. Furthermore, in Section 3.3.3, we will show how to convert ansconvRBM into an equivalent convRBM by reorganizing the hidden units, which enablesefficient training and inference of sconvRBMs in the frequency domain.Alternative Unit TypesTo model real-valued inputs like the intensities of some medical images, the binary visibleunits of an RBM can be replaced with Gaussian visible units, which leads to the followingenergy function− E(v,h | θ) =∑i,jviσiwijhj +∑i(vi − bi)22σ2i+∑jcjhj, (2.27)where the mean of the ith visible unit is encoded in the bias term bi, and its standarddeviation is given by σi. Although approaches have been proposed for learning thestandard deviation (Cho et al., 2011), the training data is often simply standardized to25Chapter 2 Background on Deep Learninghave zero mean and unit variance, which yields the following simplified rules for inferringof the visible and hidden units:E[hj | v, θ] = sigm(wT·,jv+ cj), (2.28)E[vi | h, θ] = wTi,·h+ bi. (2.29)A binary hidden unit can only encode two states. In order to increase the expressivepower of the hidden units, Nair et al. proposed using noisy rectified linear units (NReLUs)(Nair and Hinton, 2010) as the hidden units, and showed that this can improve thelearning performance of RBMs. The signal of an NReLU is the sum of an infinite numberof binary units, all of which having the same weights but different bias terms. In thespecial case where the offsets of their bias terms are set to −0.5,−1.5, . . . , the sum of theirprobabilities and therefore the expectation of an NReLU is extremely close to having aclosed form:E[hj | v, θ] =∞∑i=1sigm(wT·,jv+ cj − i + 0.5), (2.30)≈ log(1+ exp(wT·,jv+ cj)). (2.31)However, sampling of this type of unit involves the repeated calculation of the sigmoidfunction, which can be time-consuming. If a sample is not constrained to being an integer,a fast approximation can be calculated withhj ∼ max(0, µj +N (0, sigm(µj))), (2.32)µj = wT·,jv+ cj, (2.33)26Chapter 2 Background on Deep Learningwhere N (0, σ2) denotes Gaussian noise.In this section, we have introduced the most basic deep learning methods, which formthe basis for the segmentation and manifold learning methods explained in Chapter 4 andChapter 5, respectively. The next chapter details our training algorithm of convDBNs andCNNs in full, along with a comparison of alternative training methods.27Chapter 3Training of Convolutional Models in theFrequency Domain3.1 Related WorkDeep learning (LeCun et al., 2015) has traditionally been computationally expensive andadvances in training methods have been the prerequisite for expanding its applicationto a variety of image classification problems. The development of layer-wise trainingmethods (Hinton et al., 2006) greatly improved the efficiency of the training of deepbelief networks (DBNs) (Hinton et al., 2006), which has made feasible the use of largesets of small images (e.g. 28× 28), such as those used for hand-written digit recognition.Subsequently, new directions for speeding up the training of deep models were openedwith the advance of programmable graphics cards (GPUs), which can perform thousandsof operations in parallel. Raina et al. (2009) demonstrated that by using graphics cards,training of restricted Boltzmann Machines (RBMs) (Freund and Haussler, 1992; Hinton,2010) on small image patches (e.g. 24× 24) can be performed up to 70 times faster than28Chapter 3 Training of Convolutional Models in the Frequency Domainon the CPU, facilitating the application to larger training sets. However, the number oftrainable weights of a deep belief network (DBN) increases greatly with the resolution ofthe training images, which can make training on large images impracticable. In order toscale DBNs to high-resolution images, Lee et al. (2009, 2011) introduced the convolutionaldeep belief network (convDBN), a deep generative model that uses weight sharing andlocal connectivity to reduce the number of trainable weights. They showed that a convDBNcan be used to classify images with a resolution up to 200× 200 pixels. To speed up thetraining of convolutional neural networks (CNNs) (Fukushima, 1980; LeCun et al., 1989,1998) on high-resolution images, Krizhevsky et al. (2012) replaced traditional convolutionsof the first layer of their CNN with strided convolutions, a type of convolution that shiftsthe filter kernel as a sliding window with a fixed step size or stride greater than one.Through the use of strided convolutions, the number of hidden units in each convolutionallayer is greatly reduced, which reduces both training time and required memory. Using ahighly optimized GPU implementation of convolutions, they were able to train a CNN onimages with a resolution of 256× 256 pixels, achieving state-of-the-art performance onthe ILSVRC-2010 and ILSVRC-2012 natural image classification competitions (Krizhevskyet al., 2012). An alternative approach for calculating convolutions was proposed byMathieu et al. (2014) who sped up the training of CNNs by calculating convolutionsbetween batches of images and filters using fast Fourier transforms (FFTs), albeit atthe cost of additional memory required for storing the filters. Recently, NVIDIA hasreleased a library called cuDNN (Chetlur et al., 2014) that provides GPU-optimizedimplementations of 2D and 3D convolutions among other operations that are frequentlyused to implement deep learning methods, which further reduces training times comparedto the previous state-of-the-art.29Chapter 3 Training of Convolutional Models in the Frequency Domain3.2 OverviewIn this chapter, we detail our training algorithm and GPU implementation in full, with athorough analysis of the running time on high-resolution 2D images (512× 512) and 3Dvolumes (128× 128× 128), showing speed-ups of up to 8-fold and 200-fold, respectively fortraining convolutional restricted Boltzmann machines (convRBMs) and a 7-fold speed upfor computing key operations for the training of CNNs compared to cuDNN. The proposedmethod performs training in the frequency domain, which replaces the calculation oftime-consuming convolutions with simple element-wise multiplications, while addingonly a small number of FFTs. In contrast to similar FFT-based approaches (e.g., Mathieuet al., 2014), our method does not use batch processing of the images as a means toreduce the number of FFT calculations, but rather minimizes FFTs even when processinga single image, which significantly reduces the required amount of scarce GPU memory.In addition, we formalize the expression of the strided convolutional deep belief network(sconvDBN), a type of convDBN that uses strided convolutions to speed up training andreduce memory requirements, in terms of stride-1 convolutions, which enables the efficienttraining of sconvDBNs in the frequency domain.3.3 Algorithm3.3.1 Training in the Spatial DomainBefore we explain the training algorithm in the frequency domain, we will briefly revisethe basic steps for training convRBMs in the spatial domain. The weights and bias termsof a convRBM can be learned by contrastive divergence (CD). During each iteration of the30Chapter 3 Training of Convolutional Models in the Frequency Domainalgorithm, the gradient of each parameter is estimated and a gradient step with a fixedlearning rate is applied. The gradient of the filter weights can be approximated by∆w(ij) ≈ 1N(v(i)n ∗ h˜(j)n − v′(i)n ∗ h˜′(j)n ) (3.1)where vn, n ∈ [1, N] are images from the training set, h(j)n and h′(j)n are samples drawnfrom p(h(j) | vn) and p(h(j) | v′n), v′(i)n = E[v(i) | hn], ∗ denotes valid convolution, andh˜(j)n denotes a horizontal and vertical flipped version of h(j)n . To apply the model toreal-valued data like certain types of images, the visible units can be modelled as Gaussianunits. When the visible units are mean–centred and standardized to unit variance, theexpectation of the visible units is given byE[v(i) | h] =Nk∑j=1w(ij) ~ h(j) + bi, (3.2)where ~ denotes full convolution. A binary hidden unit can only encode two states. Inorder to increase the expressive power of the hidden units, we use noisy rectified linearunits (NReLUs) as the hidden units, which have been shown to improve the learningperformance of RBMs (Nair and Hinton, 2010). The hidden units can be sampled withh(j) ∼ max(0, µ(j) +N (0, sigm(µ(j)))) (3.3)µ(j) =Nc∑i=1w˜(ij) ∗ v(i) + cj (3.4)where N (0, σ2) denotes Gaussian noise, and sigm(x) is the sigmoid function definedas sigm(x) = (1 + exp(−x))−1, x ∈ R. The learning algorithm in the spatial domain issummarized in Figure 3.1(a).31Chapter 3 Training of Convolutional Models in the Frequency Domaininput : Images D = {Vn | n ∈ [1, N]}output : Weights gradient ∆w1 ∆w = 02 foreach image v ∈ D do3 v′ = 04 for j = 1 to Nk do5 µ(j) = ∑Nci=1 w˜(ij) ∗ v(i) + cj67 h(j) ∼ max(0, µ(j) +N (0, sigm(µ(j))))89 for i = 1 to Nc do10 ∆w(ij) = ∆w(ij) + h˜(j) ∗ v(i)11 v′(i) = v′(i) +w(ij) ~ h(j)12 ∀i : v′(i) = v′(i) + bi13 for j = 1 to Nk do14 µ′(j) = ∑Nci=1 w˜(ij) ∗ v′(i) + cj1516 h′(j) ∼ max(0, µ′(j)+17 N (0, sigm(µ(j))))1819 for i = 1 to Nc do20 ∆w(ij) = ∆w(ij) − h˜′(j) ∗ v′(i)(a) Training in the spatial domaininput : Images Dˆ = {vˆn | n ∈ [1, N]}output : Weights gradient ∆wˆ1 ∆wˆ = 02 foreach image vˆ ∈ Dˆ do3 vˆ′ = 04 for j = 1 to Nk do5 µˆ(j) = ∑Nci=1 wˆ(ij) · vˆ(i)6 µ(j) = ifft(µˆ(j)) + cj7 h(j) ∼ max(0, µ(j) +N (0, sigm(µ(j))))8 hˆ(j) = fft(h(j))9 for i = 1 to Nc do10 ∆wˆ(ij) = ∆wˆ(ij) + hˆ(j) · vˆ(i)11 vˆ′(i) = vˆ′(i) + wˆ(ij) · hˆ(j)12 ∀i : vˆ′(i)0,0 = vˆ′(i)0,0 + N2vbi13 for j = 1 to Nk do14 µˆ′(j) = ∑Nci=1 wˆ(ij) · vˆ′(i)15 µ′(j) = ifft(µˆ′(j)) + cj16 h′(j) ∼ max(0, µ′(j)+17 N (0, sigm(µ(j))))18 hˆ′(j) = fft(h′(j))19 for i = 1 to Nc do20 ∆wˆ(ij) = ∆wˆ(ij) − hˆ′(j) · vˆ′(i)(b) Training in the frequency domainFigure 3.1: Comparison of training algorithms of convRBMs in (a) the spatial and (b) the frequencydomain. Training in the frequency domain replaces the 5NkNC convolutions required in thespatial domain with simple element-wise multiplications, while adding only 4Nk Fouriertransforms. The other operations are equivalent in both domains.3.3.2 Training in the Frequency DomainThe computational bottleneck of the training algorithm in the spatial domain is thecalculation of convolutions, which needs to be performed 5× Nc × Nk times per iteration.To speed up the calculation of convolutions, we perform training in the frequency domain,32Chapter 3 Training of Convolutional Models in the Frequency Domainwhich maps the convolutions to simple element-wise multiplications. This is especiallyimportant for the training on 3D images due to the relatively large number of weights of a3D kernel compared to 2D. Calculating Fourier transforms is the most time-consumingoperation and a potential performance bottleneck, when the number of Fourier transformsis high. To minimize the number of Fourier transforms, we map all operations needed fortraining to the frequency domain whenever possible, which allows the training algorithmto stay almost entirely in the frequency domain. This is in contrast to the FFT-basedtraining method developed by Mathieu et al. (2014), which only maps the calculationof convolutions to the frequency domain and minimizes the number of FFT calculationsby processing batches of images. Unlike their method, our method performs trainingprimarily in the frequency domain, which reduces the number of FFT calculations evenwhen each image is processed sequentially, and therefore significantly reduces the memoryrequired for training. All of the scalar operations needed for training (multiplications andadditions) can be readily mapped to the frequency domain, because the Fourier transformis a linear operation. Another necessary operation is the flipping of a convolutional kernel,w˜(u, v) = w(Nw − u + 1, Nw − v + 1). To calculate flipping efficiently, we reindex theweights of a convolutional kernel w(ij)uv from u, v ∈ [1, Nw] to u, v ∈ [−bNw/2c, b(Nw −1)/2c, which simplifies the calculation of a flipped kernel to w˜(u, v) = w(−u,−v). Thisallows flipping of a kernel in the spatial domain to be expressed by the element-wisecalculation of the complex conjugate in the frequency domain, which follows directlyfrom the time-reversal property of the Fourier transform, i.e., if h(x) = f (−x), thenhˆ(ξ) = fˆ (−ξ); and the reality condition, fˆ (−ξ) = fˆ (ξ), where xˆ = F (x) denotes x in the33Chapter 3 Training of Convolutional Models in the Frequency Domainfrequency domain. Using the aforementioned mappings, equations (3.1) to (3.4) can berewritten as∆wˆ(ij) = vˆ(i) · hˆ(j) − vˆ′(i) · hˆ′j (3.5)E[vˆ(i)xy | hˆ] =Nk∑j=1wˆ(ij)xy hˆ(j)xy + N2vbi for x, y = 0Nk∑j=1wˆ(ij)xy hˆ(j)xy for x, y 6= 0(3.6)hˆ(j) ∼ F(max(0,F−1(µˆ(j)) + cj +N (0, σ2)))(3.7)µˆj =Nc∑i=1wˆ(ij) · vˆ(i) (3.8)where σ2 = sigm(F−1(µˆ(j)) + cj), F−1 denotes the inverse Fourier transform, and ·denotes element-wise multiplication. The algorithm for approximating the gradient in thefrequency domain is summarized in Figure 3.1(b).The only operations that cannot be directly mapped to the frequency domain are thecalculation of the maximum function, the generation of Gaussian noise, and trimming ofthe filter kernels. To perform the first two operations, an image needs to be mapped to thespatial domain and back. However, these operations need only be calculated 2Nk times periteration and are therefore not a significant contributor to the total running time. Becausefilter kernels are padded to the input image size, the size of the learned filter kernels mustbe explicitly enforced by trimming. This is done by transferring the filter kernels to thespatial domain, setting the values outside of the specified filter kernel size to zero, andthen transforming the filter kernels back to the frequency domain. This procedure needsto be performed only once per mini-batch. Since the number of mini-batches is relativelysmall compared to the number of training images, trimming of the filter kernels also does34Chapter 3 Training of Convolutional Models in the Frequency Domainnot add significantly to the total running time of the training algorithm. Although wehave presented the algorithm in the context of training convRBMs, the same mapping ofoperations can be used to speed up the training of other convolutional models, such asconvolutional neural networks.3.3.3 Mapping of Strided Convolutions to Stride-1 ConvolutionsWe have developed a method that allows convolutions with a stride s > 1 to be expressedequivalently as convolutions with stride s = 1 by reorganizing the values of v(i) and w(ij)to V(i′) and W(i′ j) as illustrated in Figure 3.2. This reindexing scheme allows the energyfunction to be expressed in terms of conventional (stride-1) convolutions, which facilitatestraining in the frequency domain. The new indices of V(i′)x′y′ and W(i′ j)u′v′ can be calculatedfrom the old indices of vi(xy) and w(ij)uv as follows:x′ = b(x− 1)/sc+ 1 u′ = b(u− 1)/sc+ 1 (3.9)y′ = b(y− 1)/sc+ 1 v′ = b(v− 1)/sc+ 1 (3.10)i′ = s2(i− 1) + s((y− 1) mod s) + ((x− 1) mod s) + 1 (3.11)After reorganizing v(i) and w(ij) to V(i′) and W(i′ j), the energy of the model can berewritten asE(V,h) = −NC∑i=1Nk∑j=1Nh∑x,y=1NW∑u,v=1h(j)xy W(ij)uv V(i)x+u−1,y+v−1 −NC∑i=1biNV∑x,y=1V(i)xy −Nk∑j=1cjNh∑x,y=1h(j)xy (3.12)= −NC∑i=1Nk∑j=1h(j) • (W˜(ij) ∗V(i))−NC∑i=1biNV∑x,y=1V(i)xy −Nk∑j=1cjNh∑x,y=1h(j)xy (3.13)35Chapter 3 Training of Convolutional Models in the Frequency Domain4 2 1 8 2 74 2 3 4 3 13 4 5 5 1 27 5 1 2 2 32 1 2 2 7 56 9 5 4 6 60.1 -0.1 0.3 0.40.2 -0.3 0.1 0.1-0.4 0.2 0.2 -0.30.5 -0.2 -0.2 0.46.8 24.6 3.6( )Sliding filterwindows∗(with a step size of s)=4 1 23 5 12 2 7 2 8 74 5 21 2 5 4 3 37 1 26 5 6 2 4 15 2 39 4 6 0.1 0.3-0.4 0.2( )-0.1 0.40.2 -0.3( )0.2 0.10.5 -0.2( )-0.3 0.1-0.2 0.4( )0.5 -1.11.4 1.4( )2.3 2.41.2 -0.8( )4.4 13.5 1.7( )-0.4 -0.3-1.5 1.3( )6.8 24.6 3.6( )∗=∗=∗=∗=+ + + =Imagev(1)Filterw(1,j)Featuremaph(j)RearrangepixelsRearrangepixelsV(1)W(1,j)V(2)W(2,j)V(3)W(3,j)V(4)W(4,j)h(j)Figure 3.2: Illustration of convolutions with a sliding window step size s = 2 as used during filteringin a sconvDBN. A convolution of a given stride size (left side) can be efficiently calculatedas the sum of multiple individual convolutions with s = 1 (right side) after rearranging thepixels of the input image and the filter kernels. The bottom row shows the actual valuesproduced by the convolutions, which are the features from the image extracted by the filter.The figure is best viewed in colour.where • denotes the element-wise product followed by summation. The number ofchannels, number of visible units per channel and number of weights per channel afterreorganization are given by NC = Nc × s2, N2V = N2v/s2 and N2W = N2w/s2, respectively.3.3.4 GPU Implementation and Memory ConsiderationsTo further reduce training times, the algorithm can be efficiently implemented on graphicscards, which can perform hundreds of operations in parallel. To optimize efficiency, itis crucial to maximize the utilization of the large number of available GPU cores, whileminimizing the required amount of GPU memory. Because our algorithm requires only arelatively small number of FFT calculations per iteration, the computational bottleneck is36Chapter 3 Training of Convolutional Models in the Frequency Domainthe calculation of the element-wise operations, which can be performed in parallel. Thus,we distribute the processing of a single 2D image over Nv(bNv/2c+ 1)× Nc independentthreads, with one thread per element in the frequency domain. The large number ofparallel threads results in a high utilization of the GPU, even when each image of a mini-batch is processed sequentially, which we do in our method, because this greatly reducesthe amount of GPU memory required to store the visible and hidden units compared toprocessing batches of images in parallel.Due to the relatively small amount of GPU memory available compared to CPU memory,memory requirements are an important consideration when designing a GPU algorithm.However, the total amount of required memory is highly implementation-dependent (e.g.the use of temporary variables for storing intermediate results) and a comparison can onlybe done with common elements at the algorithmic level. Therefore, in the remainder ofthis section, we focus on the memory requirements of key variables such as the visibleunits, the hidden units, and the filters, which are required by all implementations. Inour training algorithm, all key variables are stored in the frequency domain, where eachelement in the frequency domain is represented by a single-precision complex number.Due to the symmetry of the Fourier space, the number of elements that need to be storedis roughly half the number of elements in the spatial domain. Thus, the memory requiredfor storing the visible and hidden units in the frequency domain is roughly the same asin the spatial domain. A potential drawback of training in the frequency domain is thatthe filters need to be padded to the size of the visible units before applying the Fouriertransformation, which increases the memory required for storing the filters on the GPU.37Chapter 3 Training of Convolutional Models in the Frequency DomainThe total amount of memory in bytes required for storing the visible units, hidden units,and padded filters is given by the sum of their respective terms4N2v Nc +4N2v Nks2+ 4N2v NkNc. (3.14)As a comparison, the training method of Krizhevsky et al. (2012) processes batches ofimages in order to fully utilize the GPU, which requires the storing of batches of visibleand hidden units. The memory needed for storing the key variables is given by4N2v NbNc +4N2v NbNks2+ 4N2wNkNc, (3.15)where Nb is the number of images per mini-batch. Depending on the choice of the batchsize, this method requires more memory for storing the visible and hidden units, whilerequiring less memory for storing the filters. Alternatively, Mathieu et al. (2014) proposedspeeding up the training of CNNs by calculating convolutions between batches of imagesand filters using FFTs. The memory required for storing the visible units, hidden units,and filters using this method is given by4N2v NbNc + 4N2v NbNk + 4N2v NkNc. (3.16)Table 3.1 shows a comparison of the memory per GPU required for storing key variableswhen training a network used in previous work by Krizhevsky et al. (2012). For the firstlayer, a comparison with Mathieu et al.’s training method could not be performed, becausethat method does not support strided convolutions, which would significantly reducethe memory required for storing the hidden units. In all layers, the proposed approachcompensates for the increased memory requirements for the filters by considering one38Chapter 3 Training of Convolutional Models in the Frequency DomainTable 3.1: Comparison of the memory required for storing key variables using different training methods:our method (Freq), Krizhevsky et al.’s spatial domain method (Spat), and Mathieu et al.’smethod using batched FFTs (B-FFT). A comparison with Mathieu et al.’s method could not bemade for the first layer, because that method does not support strided convolutions. In alllayers, our method consumes less memory for storing the key variables than the other twomethods.Layer Nb Nv Nc Nw Nk s Memory in MBFreq Spat B-FFT1 128 224 3 11 48 4 28.7 147.1 —2 128 55 48 5 128 1 72.9 260.5 330.93 128 27 128 3 192 1 69.2 114.8 182.34 128 13 192 3 192 1 24.0 33.0 55.55 128 13 192 3 128 1 16.1 27.3 42.3image at a time rather than a batch, and still outperforms batched learning in the spatialdomain in terms of speed (see Section 3.4), despite of not using batches.3.4 Evaluation of Running Time3.4.1 Comparison of Running Times for Training sconvDBNsTo demonstrate where the performance gains are produced, we trained a two-layersconvDBN on 2D and 3D images using our frequency-domain method and the followingmethods that all compute convolutions on the GPU, but using different approaches: 1) ourspatial domain implementation that convolves a single 2D or 3D image with a single 2Dor 3D filter kernel at a time, 2) Krizhevsky’s spatial domain convolution implementation(Krizhevsky, 2012), which is a widely used method (e.g., Hinton and Srivastava, 2012;Scherer et al., 2010; Zeiler and Fergus, 2013) that calculates the convolution of batchesof 2D images and 2D filter kernels in parallel (note that this method cannot be applied to3D images, so it was only used for the 2D experiments), and 3) our implementation that39Chapter 3 Training of Convolutional Models in the Frequency DomainTable 3.2: Training parameters.Parameter ImageNet (2D) OASIS (3D)1st layer 2nd layer 1st layer 2nd layerFilter size 5 to 52 5 to 13 3 to 36 3 to 9Stride size 1, 2, 4 1 1, 2, 4 1Number of channels 3 16, 32, 64 1 8, 16, 32Number of filters 32 32 32 16Image dimension 5122 1282 1283 643Number of images 256 256 100 100Batch size 128 128 25 25Table 3.3: Hardware specification of our test system for training sconvDBNs using different implementa-tions for calculating convolutions.Processor Intel i7-3770 CPU @ 3.40 GHzCPU Memory 8 GBGraphics Card NVIDIA GeForce GTX 660GPU Cores 960 cores @ 1.03 GHzGPU Memory 2 GBcalculates convolutions using FFTs, but without mapping the other operations that wouldallow the algorithm to stay in the frequency domain when not computing convolutions.The parameters that we used for training convRBMs on 2D and 3D images are summarizedin Table 3.2. The key parameters that we varied for our experiments are the filter size andstride size of the first layer, and the filter size and the number of channels of the secondlayer. Because the number of channels of the second layer is equal to the number of filtersof the first layer, we also varied the number of filters of the first layer in order to attainthe desired number of channels. For all implementations, the training time is directlyproportional to the number of filters. Therefore, a detailed comparison of all four methodswith a varying number of filters was not performed. The hardware details of our testenvironment are summarized in Table 3.3.40Chapter 3 Training of Convolutional Models in the Frequency DomainFor the comparison on 2D images, we used a data set of 256 natural colour imagesfrom the ImageNet data set (Deng et al., 2009). All images were resampled to a resolutionof 512× 512 pixels per colour channel. For the evaluation on 3D images, we used 100magnetic resonance imaging (MRI) scans of the brain from the Open Access Series ofImaging Studies (OASIS) data set (Marcus et al., 2007). We resampled all volumes to aresolution of 128× 128× 128 voxels and a voxel size of 2× 2× 2 mm.Running Time Analysis on 2D Colour Images (ImageNet)Figure 3.3(a) shows a comparison of running times for training the first sconvRBM layer on256 images with varying filter and stride sizes. Due to internal limitations of Krizhevsky’sconvolution implementation, it cannot be applied to images with a resolution of 512× 512pixels when using a stride size smaller than four, and those comparisons could not bemade. Our frequency domain implementation is between 2 to 24 times faster than ourconvolution implementation, where the speed gains are larger for larger filter and stridesizes. The running time of the methods that calculate convolutions in the frequencydomain (fft and freq) are independent of the filter size, because the filters need to bepadded to the size of the input images before calculating the FFT and therefore the samenumber of operations are required for training for all filter sizes. For a stride of one,the impact of the convolution implementation on the total running time is relatively low,because the computational bottleneck is the inference and sampling of the hidden units.As the number of hidden units decreases with larger strides, the running time becomesmore dependent on the time spent to calculate convolutions. Hence, the differencesbetween the four methods are more pronounced for larger strides. For a stride of four,training in the frequency domain is between 8 to 24 times faster than training in the spatialdomain using our convolution implementation and 2 to 7 times faster than using batched41Chapter 3 Training of Convolutional Models in the Frequency DomainStride size 1 Stride size 2 Stride size 4l l l 132.6 139.1 34.5l l l 108.9 112.3 11l l l 126.9 106.3 39.1 5.310301005 9 13 10 18 26 20 36 52Filter sizeTraining time per epoch [s]Methodlspat1fftspat2freq(a) Running times of training a first layer sconvRBM with stride sizes of 1, 2, and 4.16 channels 32 channels 64 channelsl l l 44.3 82.5 14.8 3.1l l l 85.7 92.8 26.8 4.2l l l 168.4 122.8 51.5 6.4310301005 9 13 5 9 13 5 9 13Filter sizeTraining time per epoch [s]Methodlspat1fftspat2freq(b) Running times of training a second layer sconvRBM with 16, 32, and 64 channels (stride size 1).Figure 3.3: Comparison of running times for training a (a) first and (b) second layer sconvRBM on 2Dimages using our frequency domain method (freq) and three alternative methods using differ-ent convolution implementations: single image convolutions (spat1), batched convolutions(spat2), and convolution by using FFTs (fft). Due to internal limitations of the implementationof batched convolutions, a comparison with spat2 could not be performed for images with aresolution of 512× 512 when using a stride size smaller than four.convolutions. Calculating convolutions by FFTs is the slowest method for all stride sizesand 2D filter sizes up to 44, largely due to the cost of calculating Fourier transforms.Figure 3.3(b) shows a similar comparison for training the second convRBM layer for astride size of one and varying filter sizes and numbers of channels. In contrast to trainingthe first layer, training times mostly depend on the calculation of convolutions, where the42Chapter 3 Training of Convolutional Models in the Frequency Domainimpact of calculating convolutions on the total running time increases with an increasingnumber of channels. Training in the frequency domain is between 5 to 26 times faster thantraining in the spatial domain using single-image convolutions, and 2 to 8 times faster thanusing batched convolutions. For all channel sizes, batched training is about 3 to 4 timesfaster than non-batched training and calculating convolutions using FFTs is much slowerthan batched training and training in the frequency domain. To summarize, training of2D images in the frequency domain is much faster than training in the spatial domaineven for small filter sizes. Using the largest filter kernels in both layers, the proposedmethod is shown to yield a speedup of 7 to 8 times compared to state-of-the-art GPUimplementations.Running Time Analysis on 3D Volumes (OASIS)Figure 3.4 shows the comparison of running times for training a first and second layersconvRBM on 3D volumes for varying filter sizes, stride sizes, and varying numbers ofchannels. In contrast to training on 2D images, the computational costs of calculating3D convolutions break even with calculating FFTs even for small filter sizes, because thenumber of multiplications and additions per convolution increases cubically, instead ofquadratically, with the filter kernel size. As a result, simply training by convolutions in thefrequency domain is faster than in the spatial domain. However, our proposed trainingalgorithm still outperforms both other methods, even at the smallest filter size. For filtersizes of five and larger, our frequency domain implementation is between 3.5 to 200 timesfaster than our spatial domain implementation using single-image convolutions and 2.7 to17 times faster than calculating convolutions by FFTs. Similar to the results on 2D images,training times of the first layer using a stride of one depend strongly on the time requiredto calculate the expectation of the hidden units and to sample the hidden units. Hence,43Chapter 3 Training of Convolutional Models in the Frequency DomainStride size 1 Stride size 2 Stride size 4l l l l 1068.3 142.9 53.4l l l l 1027.7 91.1 10.2l l l l 1028 87.3 5103010030010003 5 7 9 6 10 14 18 12 20 28 36Filter sizeTraining time per epoch [s]Methodlspatfftfreq(a) Running times of training a first layer sconvRBM with stride sizes of 1, 2, and 4.8 channels 16 channels 32 channelsl l l l 513.8 46.1 5.3l l l l 1024 83.6 7.6l l l l 2043.3 158.7 14103010030010003 5 7 9 3 5 7 9 3 5 7 9Filter sizeTraining time per epoch [s]Methodlspatfftfreq(b) Running times of training a second layer sconvRBM with 8, 16, and 32 channels (stride size 1).Figure 3.4: Comparison of running times for training a (a) first and (b) second layer sconvRBM on 3Dvolumes using a single 3D image convolution implementation (spat), an implementationthat calculates convolutions by using FFTs (fft), and our proposed implementation in thefrequency domain (freq).performance improvements of our frequency domain method are more pronounced forlarger strides and numbers of channels, where the impact of calculating convolutions onthe total training time is also larger. This makes the proposed method particularly suitablefor training sconvRBMs on high-resolution 3D volumes.44Chapter 3 Training of Convolutional Models in the Frequency Domain3.4.2 Comparison of Running Times for Calculating Convolutions with cuDNNThe NVIDIA CUDA Deep Neural Network library (cuDNN) (Chetlur et al., 2014) is aGPU-accelerated library of primitives that are commonly required to implement deeplearning methods such as CNNs and convDBNs. It is used internally by state-of-the-artdeep learning frameworks such as Caffe (Jia et al., 2014), Theano (Bastien et al., 2012), andTorch (Collobert et al., 2011a). The library in its 7th version provides highly optimizedimplementations for calculating, e.g., batched 2D and 3D convolutions, pooling operations,and different transfer functions. In this set of experiments, we directly compared the timerequired to calculate convolutions when training a CNN using our frequency domainimplementation with cuDNN. We only consider convolutions because they are the mosttime-consuming operations and because all other operations are calculated in the same way.For the comparison with our frequency domain implementation, the time measurementsalso include the time required to calculate FFTs during the gradient calculation and fortransforming the accumulated gradient to spatial domain and back in order to trim thepadded filters to the original filter size. The parameters that we used for evaluating therunning time of different convolution implementations are summarized in Table 3.4. Theparameters represent typical values of the first and second layer of a CNN, which are thelayers that require the most time to train, because the visible units of subsequent layersare commonly of much smaller resolution. The key parameters that we varied are theimage size, filter size, and batch size. The hardware details of our test environment aresummarized in Table 3.5.Figure 3.5 shows a comparison of the running times for calculating convolutions usingour frequency domain implementation and cuDNN for varying image sizes. The FFTimplementation that we used internally, cuFFT, is optimized for image sizes that can be45Chapter 3 Training of Convolutional Models in the Frequency DomainTable 3.4: CNN layer parameters for the comparison with cuDNN.Experiment #Channels Image size Filter size Filter count Batch sizeVarying 2 163, 173, . . . , 1283 93 32 8image sizes 32 103, 113, . . . , 643 93 32 8Varying 2 1283 93 32 1, 2, 4, 8batch sizes 32 643 93 32 1, 2, 4, 8Varying 2 1283 13, 23, . . . , 93 32 8filter sizes 32 643 13, 23, . . . , 93 32 8Table 3.5: Hardware specification of our test system for the comparison with cuDNN.Processor Intel i5-2500 CPU @ 3.30 GHzCPU Memory 32 GBGraphics Card NVIDIA GeForce GTX 780GPU Cores 2304 cores @ 0.94 GHzGPU Memory 3 GBfactorized as 2a × 3b × 5c. Therefore, we also evaluated the time required to calculateconvolutions when the input images are automatically padded to a size for which cuFFThas been optimized. For image sizes larger than 1103 voxels, cuFFT requires significantlymore temporary memory for calculating FFTs of unoptimized image sizes. Consequently,only the running times of the padded frequency domain implementation and cuDNNwere measured for image sizes larger than 1103 voxels. Calculating convolutions in thefrequency domain scales better with an increase of the image size than cuDNN. WhilecuDNN is the fastest method for very small image sizes, the padded frequency domainimplementation is more than 20 times faster for images with a size of 1283 voxels and2 input channels, and more than 18 times faster for images with a size of 643 voxelsand 32 channels. Calculating the FFT and therefore the running time of the frequencydomain implementation varies significantly depending on the input image size, andcareful padding of the input images is required to achieve consistently high performance.46Chapter 3 Training of Convolutional Models in the Frequency Domain2 channels 32 channels 1.250 0.060 1.295 0.0710.0010.0100.1001.00040 80 120 20 40 60 80Image sizeRunning time per image [s]MethodcuDNNFreqPaddedFigure 3.5: Comparison of running times of key operations for training a single CNN layer for varyingnumber of channels and input sizes. Frequency domain training and training using cuDNNscales comparable for small numbers of channels. For large numbers of channels, frequencydomain method scales slightly better to larger images.Table 3.6 shows a comparison of running times per image for varying batch sizes andfor varying number of channels. Increasing the batch size has only a minor effect onthe running time of cuDNN, while significantly increasing the required amount of GPUmemory. In contrast, the implementation in the frequency domain does not requiremore memory, because each image of a mini-batch is processed sequentially. Increasingthe batch size reduces the average running time per image of our frequency domainimplementation, because the FFTs required to trim the padded filters to the original filtersize need to be calculated only once per mini-batch, and have therefore a smaller impacton the overall running time when the number of images per mini-batch is larger.The impact of the filter size on the running time is shown in Figure 3.6. For the firstlayer, training in the frequency domain is faster for all filter sizes, where the speed-upsare larger for larger filters. At a filter size of 9, the speed-up is 20 times. For 32 channels,cuDNN is faster for very small filter sizes and training in the frequency domain breaks47Chapter 3 Training of Convolutional Models in the Frequency DomainTable 3.6: Comparison of running times for calculating key operations for training a CNN layer fordifferent batch sizes. Increasing the batch size reduces the impact of cropping the learnedfilters on the overall running time and consequently reduces the average time to process oneimage. The cuDNN implementation only benefits mildly from using larger batches.Batch size Running time [s]2 channels 32 channelsFreq cuDNN Freq cuDNN1 0.109 1.397 0.153 1.3132 0.081 1.011 0.106 1.3074 0.068 1.249 0.082 1.3048 0.060 1.247 0.071 1.2962 channels 32 channelsl l lll llll 1.250 0.060lllllllll 1.295 0.0710.030.100.301.001 3 5 7 9 1 3 5 7 9Filter sizeRunning time per image [s]Methodl cuDNNFreqFigure 3.6: Comparison of running times of key operations for training a single CNN layer for varyingnumber of channels and filter sizes.even at a filter size of 3. For filter sizes larger than 3, training in the frequency domain issubstantially faster with a speed-up of up to 18 times for a filter size of 9.Table 3.7 shows a comparison of running times for calculating convolutions of ourfrequency domain implementation with cuDNN for the 7-layer convolutional encodernetwork (CEN) that we use in Chapter 4 to segment multiple sclerosis (MS) lesions, andthe first sconvRBM of the DBN model described in Chapter 5 to model the variability inlesion distribution and brain morphology. For the 7-layer CEN, the second convolutionaland the first deconvolutional layer benefits the most from training in the frequency domain48Chapter 3 Training of Convolutional Models in the Frequency DomainTable 3.7: Comparison of running times of time critical operations of the 7-layer CEN-s used for seg-menting lesions, and the first sconvRBM of the lesion DBN using to model lesion distribution.The running times of pooling layers we excluded.Experiment Running time [s] Speed-upFreq cuDNN7-layer CENused for lesionsegmentation1st convolutional layer 0.077 0.498 6.5072nd convolutional layer 0.052 0.524 10.0901st deconvolutional layer 0.052 0.524 10.0902nd deconvolutional layer 0.148 0.517 3.501total 0.328 2.063 6.288First sconvRBM ofthe lesion DBNdirect calculation of stridedconvolutions0.076 0.068 0.904strided convolutions mapped tostride-1 convolutions0.019 0.071 3.786due to the relatively large number of channels in these layers. Overall, the convolutionsrequired to train the 7-layer CEN can be calculated 6.3 times faster in the frequencydomain compared to cuDNN. In a second experiment, we compared the time requiredto directly calculate strided convolutions with mapping strided convolutions to stride-1 convolutions by reorganizing the visible units of an sconvRBM. For the frequencydomain implementation, the direct method calculates stride-1 convolutions instead ofstrided convolutions and discards the additional hidden units afterwards, because stridedconvolutions cannot directly be calculated in the frequency domain. Mapping stridedconvolutions to stride-1 convolutions speeds up the training in the frequency domain bya factor of four compared to calculating stride-1 convolutions in the frequency domain.Overall, calculating the convolutions for training the first sconvRBM is 3.7 times fasterusing our frequency domain implementation than the direct implementation of stridedconvolutions of cuDNN.49Chapter 3 Training of Convolutional Models in the Frequency Domain3.5 ConclusionsWe have presented a fast training method for convolutional models, which performstraining in the frequency domain in order to replace the time-consuming computationof convolutions with simple element-wise multiplications. We have shown that it is alsoessential to map the other operations to frequency space wherever possible to minimizethe number of Fourier transforms, and that this greatly decreases training times overperforming only the convolutions in the frequency domain. In addition, our methodcan be efficiently implemented on the GPU and is faster than a highly optimized GPUimplementation of batched convolutions in the spatial domain. We have evaluated therunning time improvements using two standard benchmark data sets, showing a speed-upof up to 8 times on 2D images from the ImageNet data set and up to 200 times on 3Dvolumes from the OASIS data set. In addition, we have directly compared the timerequired to calculate convolutions using our method with cuDNN, the current state-of-the-art library for calculating 2D and 3D convolutions, with the results showing that ourmethod can calculate convolutions up to 20 times faster than cuDNN.50Chapter 4White Matter Lesion Segmentation4.1 IntroductionMultiple sclerosis (MS) is an inflammatory and demyelinating disease of the centralnervous system with pathology that can be observed in vivo by magnetic resonanceimaging (MRI). MS is characterized by the formation of lesions, primarily visible inthe white matter on conventional MRI. Imaging biomarkers based on the delineationof lesions, such as lesion load and lesion count, have established their importance forassessing disease progression and treatment effect. However, lesions vary greatly in size,shape, intensity and location, which makes their automatic and accurate segmentationchallenging.4.2 Related WorkMany automatic methods have been proposed for the segmentation of MS lesions over thelast two decades (García-Lorenzo et al., 2013), which can be classified into unsupervisedand supervised methods.51Chapter 4 White Matter Lesion Segmentation4.2.1 Unsupervised MethodsUnsupervised methods do not require a labeled data set for training. Instead, lesionsare identified as an outlier of, e.g., a subject-specific generative model of tissue intensi-ties (Roura et al., 2015; Schmidt et al., 2012; Tomas-Fernandez and Warfield, 2015;Van Leemput et al., 2001), or a generative model of image patches representing a healthypopulation (Weiss et al., 2013). Alternatively, clustering methods have been used tosegment healthy and lesion tissue, where lesions are modelled as a separate tissue class(Shiee et al., 2010; Sudre et al., 2015). In many methods, spatial priors of healthy tissuesare used to reduce false positives. For example, in addition to modelling MS lesions asa separate intensity cluster, Lesion-TOADS (Shiee et al., 2010) employs topological andstatistical atlases to produce a topology-preserving segmentation of all brain tissues.To account for local changes of the tissue intensity distributions, Tomas-Fernandez andWarfield (2015) combined the subject-specific model of the global intensity distributionswith a voxel-specific model calculated from a healthy population, where lesions aredetected as outliers of the combined model. A major challenge of unsupervised methodsis that outliers are often not specific to lesions and can also be caused by intensityinhomogeneities, partial volume, imaging artifacts, and small anatomical structures suchas blood vessels, which leads to the generation of false positives. To overcome theselimitations, Roura et al. (2015) employed an additional set of rules to remove falsepositives, while Schmidt et al. (2012) used a conservative threshold for the initial detectionof lesions, which are later grown in a separate step to yield an accurate delineation.52Chapter 4 White Matter Lesion Segmentation4.2.2 Supervised MethodsCurrent supervised approaches typically start with a set of features, which can rangefrom small and simple to large and highly variable, and are either predefined by theuser (Geremia et al., 2011, 2010; Guizard et al., 2015; Subbanna et al., 2015) or gatheredin a feature extraction step such as by deep learning (Yoo et al., 2014). Voxel-basedsegmentation algorithms (Geremia et al., 2010; Yoo et al., 2014) feed the features andlabels of each voxel into a general classification algorithm, such as a random forest(Breiman, 2001), to classify each voxel and to determine which set of features are themost important for segmentation in the particular domain. Voxel features and the labelsof neighboring voxels can be incorporated into Markov random field-based (MRF-based)approaches (Subbanna et al., 2015, 2009) to produce a spatially consistent segmentation.As a strategy to reduce false positives, Subbanna et al. (2015) combined a voxel-levelMRF with a regional MRF, which integrates a large set of intensity and textural featuresextracted from the regions produced by the voxel-level MRF with the labels of neighboringnodes of the regional MRF. Library-based approaches leverage a library of pre-segmentedimages to carry out the segmentation. For example, Guizard et al. (2015) proposed asegmentation method based on an extension of the non-local means algorithms (Coupéet al., 2011). The centres of patches at every voxel location are classified based on matchedpatches from a library containing pre-segmented images, where multiple matches areweighted using a similarity measure based on rotation-invariant features.4.2.3 Patch-based Deep Learning MethodsA recent breakthrough for the automatic image segmentation using deep learning comesfrom the domain of cell membrane segmentation, in which Cires¸an et al. (2012) proposed53Chapter 4 White Matter Lesion Segmentationclassifying the centres of image patches directly using a convolutional neural network(CNN) (LeCun et al., 1998) without a dedicated feature extraction step. Instead, featuresare learned indirectly within the lower layers of the neural network during training,while the higher layers can be regarded as performing the classification, which allows thelearning of features that are specifically tuned to the segmentation task. However, the timerequired to train patch-based methods can make the approach infeasible when the sizeand number of patches are large.4.2.4 Fully Convolutional MethodsRecently, different CNN architectures (Brosch et al., 2015; Kang and Wang, 2014; Liet al., 2014; Long et al., 2015; Ronneberger et al., 2015) have been proposed that are ableto feed through entire images, which removes the need to select representative patches,eliminates redundant calculations where patches overlap, and therefore these modelsscale up more efficiently with image resolution. Kang and Wang (2014) introduced thefully convolutional neural network (fCNN) for the segmentation of crowds in surveillancevideos. However, fCNNs produce segmentations of lower resolution than the input imagesdue to the successive use of convolutional and pooling layers, both of which reduce thedimensionality. To predict segmentations of the same resolution as the input images,we recently proposed using a 3-layer convolutional encoder network (CEN) (Broschet al., 2015) for MS lesion segmentation. The combination of convolutional (LeCun et al.,1998) and deconvolutional (Zeiler et al., 2011) layers allows our network to producesegmentations that are of the same resolution as the input images.Another limitation of the traditional CNN is the trade-off between localization accuracy,represented by lower-level features, and contextual information, provided by higher-levelfeatures. To overcome this limitation, Long et al. (2015) proposed fusing the segmentations54Chapter 4 White Matter Lesion Segmentationproduced by the lower layers of the network with the upsampled segmentations producedby higher layers. However, using only low-level features was not sufficient to producea good segmentation at the lowest layers, which is why segmentation fusion was onlyperformed for the three highest layers. Instead of combining the segmentations producedat different layers, Ronneberger et al. (2015) proposed combining the features of differentlayers to calculate the final segmentation directly at the lowest layer using an 11-layeru-shaped network architecture called u-net. Their network is composed of a traditionalcontracting path (first half of the u), but augmented with an expanding path (last half of theu), which replaces the pooling layers of the contracting path with upsampling operations.To leverage both high- and low-level features, shortcut connections are added betweencorresponding layers of the two paths. However, upsampling cannot fully compensate forthe loss of resolution, and special handling of the border regions is still required.4.3 MethodsWe propose a new convolutional network architecture that combines the advantages of aCEN (Brosch et al., 2015) and a u-net (Ronneberger et al., 2015). Our network is dividedinto two pathways, a traditional convolutional pathway, which consists of alternatingconvolutional and pooling layers, and a deconvolutional pathway, which consists ofalternating deconvolutional and unpooling layers (Zeiler et al., 2011) and predicts thefinal segmentation. Similar to the u-net, we introduce shortcut connections between layersof the two pathways. In contrast to the u-net, our method produces segmentations of thesame resolution as the input images, independent of the number of layers and filter sizesused, which enables the use of deeper networks without requiring special handling of theborder regions.55Chapter 4 White Matter Lesion SegmentationInput images Segmentationsw(1)cw(3)c w(3)dw(1)dw(1)sx(0)x(1)x(2)x(3)y(0)y(1)y(2)y(3)Conv.layerPoolinglayerConv.layerDeconv.layer withshortcutUnpoolinglayerDeconv.layerconvolution pooling deconvolution unpoolingconvolution pooling deconvolution unpoolingInput imageswˆ(1)poolingwˆ(2)v(1)h(1)v(2)h(2)conv-RBM1convRBM2Pre-training Fine-tuningFigure 4.1: Pre-training and fine-tuning a 7-layer convolutional encoder network with shortcut. Theactivations calculated by deconvolution through a shortcut are added to the correspondinglayer activations before applying the transfer function. For deeper network architectures,there is a shortcut between all convolutional and deconvolutional layers except for thetop-most layer. Pre-training is performed on the input images using a stack of convolutionalRBMs. The pre-trained weights and bias terms are used to initialize a CEN, which isfine-tuned on pairs of input images, x(0), and segmentations, y(0).The task of segmenting MS lesions is defined as finding a function s that maps multi-modal images I, e.g., I = (IFLAIR, IT1), to corresponding binary lesion masks S, where 1denotes a lesion voxel and 0 denotes a non-lesion voxel. Given a set of training images In,n ∈ N, and corresponding segmentations Sn, we model finding an appropriate functionfor segmenting MS lesions as an optimization problem of the following formsˆ = arg mins∈S ∑nE(Sn, s(In)), (4.1)where S is the set of possible segmentation functions, and E is an error measure thatcalculates the dissimilarity between ground truth segmentations and predicted segmenta-56Chapter 4 White Matter Lesion Segmentationtions. The set of possible segmentation functions S is defined by the network architecture,where a particular function s is defined by the combination of network architecture andlearned parameters of the network.4.3.1 Model ArchitectureThe set of possible segmentation functions, S , is modelled by the convolutional encodernetwork with shortcut connections (CEN-s) illustrated in Figure 4.1. A CEN-s is a type ofCNN (LeCun et al., 1998) that is divided into two interconnected pathways, the convolu-tional pathway and the deconvolutional (Zeiler et al., 2011) pathway. The convolutionalpathway consists of alternating convolutional and pooling layers. The input layer of theconvolutional pathway is composed of the image voxels x(0)i (p), i ∈ [1, C], where i indexesthe modality or input channel, C is the number of modalities or channels, and p ∈ N3are the coordinates of a particular voxel. The convolutional layers automatically learn afeature hierarchy from the input images. A convolutional layer is a deterministic functionof the following formx(l)j = max(0,C∑i=1w˜(l)c,ij ∗ x(l−1)i + b(l)j), (4.2)where l is the index of a convolutional layer, x(l)j , j ∈ [1, F], denotes the feature mapcorresponding to the trainable convolution filter w(l)c,ij, F is the number of filters of thecurrent layer, b(l)j are trainable bias terms, ∗ denotes valid convolution, and w˜ denotesa flipped version of w, i.e., w˜(a) = w(−a). To be consistent with the inference rules ofconvolutional restricted Boltzmann machines (convRBMs) (Lee et al., 2009), which areused for pre-training, convolutional layers convolve the input signal with flipped filterkernels, while deconvolutional layers calculate convolutions with non-flipped filter kernels.57Chapter 4 White Matter Lesion SegmentationWe use rectified linear units (Nair and Hinton, 2010) in all layers except for the outputlayers, which have shown to improve the classification performance of CNNs (Krizhevskyet al., 2012). A convolutional layer is followed by an average pooling layer (Scherer et al.,2010) that halves the number of units in each dimension by calculating the average of eachblock of 2× 2× 2 units per channel.The deconvolutional pathway consists of alternating deconvolutional and unpoolinglayers with shortcut connections to the corresponding convolutional layers. The firstdeconvolutional layer uses the extracted features of the convolutional pathway to calculateabstract segmentation featuresy(L−1)i = max(0,F∑j=1w(L)d,ij ~ y(L)j + c(L−1)i), (4.3)where y(L) = x(L), L denotes the number of layers of the convolutional pathway, w(L)d,ijand c(L−1)i are trainable parameters of the deconvolutional layer, and ~ denotes fullconvolution. To be consistent with the general notation of deconvolutions (Zeiler et al.,2011), the non-flipped version of w is convolved with the input signal.Subsequent deconvolutional layers use the activations of the previous layer and corre-sponding convolutional layer to calculate more localized segmentation featuresy(l)i = max(0,F∑j=1w(l+1)d,ij ~ y(l+1)j +F∑j=1w(l+1)s,ij ~ x(l+1)j + c(l)i), (4.4)where l is the index of a deconvolutional layer with shortcut, and w(l+1)s,ij are the shortcutfilter kernels connecting the activations of the convolutional pathway with the activationsof the deconvolutional pathway. The last deconvolutional layer integrates the low-level58Chapter 4 White Matter Lesion Segmentationfeatures extracted by the first convolutional layer with the high-level features from theprevious layer to calculate a probabilistic lesion masky(0)1 = sigm(F∑j=1(w(1)d,1j ~ y(1)j + w(1)s,1j ~ x(1)j)+ c(0)1), (4.5)where we use the sigmoid function defined as sigm(z) = (1+ exp(−z))−1, z ∈ R, insteadof the rectified linear function in order to obtain a probabilistic segmentation with values inthe range between 0 and 1. To produce a binary lesion mask from the probabilistic outputof our model, we chose a fixed threshold such that the mean Dice similarity coefficient(DSC) (Dice, 1945) is maximized on the training set and used the same threshold for theevaluation on the test set.4.3.2 Gradient CalculationThe parameters of the model can be efficiently learned by minimizing the error E for eachsample of the training set, which requires the calculation of the gradient of E with respectto the model parameters (LeCun et al., 1998). Typically, neural networks are trained byminimizing either the sum of squared differences (SSD) or the cross-entropy. Here, we willderive the gradient on the example of the SSD, due to the similarity of the SSD with thelater proposed objective function. The SSD can be calculated for a single image as followsE =12∑p(S(p)− y(0)(p))2, (4.6)59Chapter 4 White Matter Lesion Segmentationwhere p ∈ N3 are the coordinates of a particular voxel. The partial derivatives of the errorwith respect to the model parameters can be calculated using the delta rule and are givenby∂E∂w(l)d,ij= δ(l−1)d,i ∗ y˜(l)j ,∂E∂c(l)i=∑pδ(l)d,i (p), (4.7)∂E∂w(l)s,ij= δ(l−1)d,i ∗ x˜(l)j , (4.8)∂E∂w(l)c,ij= x(l−1)i ∗ δ˜(l)c,j , and∂E∂b(l)i=∑pδ(l)c,i (p). (4.9)For the output layer, δ(0)d,1 can be calculated byδ(0)d,1 =(y(0)1 − S)y(0)1(1− y(0)1). (4.10)The derivatives of the error with respect to the parameters of the other layers can becalculated by applying the chain rule of partial derivatives, which yields toδ(l)d,j =C∑i(w˜(l)d,ij ∗ δ(l−1)d,i)I(y(l)j > 0), (4.11)δ(l)c,i =F∑j(w(l+1)c,ij ~ δ(l+1)c,j)I(x(l)i > 0), (4.12)where l is the index of a deconvolutional or convolutional layer, δ(L)c,i = δ(L)d,j , and I(z)denotes the indicator function defined as 1 if the predicate z is true and 0 otherwise. If a60Chapter 4 White Matter Lesion Segmentationlayer is connected through a shortcut, δ(l)c,j needs to be adjusted by propagating the errorback through the shortcut connection. In this case, δ(l)c,j is calculated byδ(l)c,j =(δ(l)c,j′ + w˜(l)s,ij ∗ δ(l−1)d,i)I(x(l)j > 0), (4.13)where δ(l)c,j′ denotes the activation of unit δ(l)c,j before taking the shortcut connection intoaccount.The sum of squared differences is a good measure of classification accuracy, if the twoclasses are fairly balanced. However, if one class contains vastly more samples, as is thecase for lesion segmentation, the error measure is dominated by the majority class andconsequently, the neural network would learn to ignore the minority class. To overcomethis problem, we use a combination of sensitivity and specificity, which can be usedtogether to measure classification performance even for vastly unbalanced problems. Moreprecisely, the final error measure is a weighted sum of the mean squared difference ofthe lesion voxels (sensitivity) and non-lesion voxels (specificity), reformulated to be errorterms:E = r∑p(S(p)− y(0)(p))2S(p)∑p S(p)+ (1− r)∑p(S(p)− y(0)(p))2 (1− S(p))∑p(1− S(p)) . (4.14)We formulate the sensitivity and specificity errors as squared errors in order to yieldsmooth gradients, which makes the optimization more robust. The sensitivity ratio r canbe used to assign different weights to the two terms. Due to the large number of non-lesion voxels, weighting the specificity error higher is important, but based on preliminaryexperimental results (Brosch et al., 2015), the algorithm is stable with respect to changesin r, which largely affects the threshold used to binarize the probabilistic output. A61Chapter 4 White Matter Lesion Segmentationdetailed evaluation of the impact of the sensitivity ratio on the learned model is presentedin Section 4.4.4.To train our model, we must compute the derivatives of the modified objective func-tion with respect to the model parameters. Equations (4.7)–(4.9) and (4.11)–(4.13) area consequence of the chain rule and independent of the chosen similarity measure.Hence, we only need to derive the update rule for δ(0)d,1 . With α = 2r(∑p S(p))−1 andβ = 2(1− r)(∑p(1− S(p)))−1, we can rewrite E asE =12∑p(S(p)− y(0)(p))2αS(p) +12∑p(S(p)− y(0)(p))2β(1− S(p)) (4.15)=12∑p(αS(p) + β(1− S(p)))(S(p)− y(0)1 (p))2. (4.16)Our objective function is similar to the SSD, with an additional multiplicative term appliedto the squared differences. The additional factor only depends on the target segmentationS and is therefore constant with respect to the model parameters. Consequently, δ(0)d,1 canbe derived analogously to the SSD case, and the new factor is simply carried over:δ(0)d,1 =(αS + β(1− S))(y(0)1 − S)y(0)1 (1− y(0)1 ). (4.17)4.3.3 TrainingAt the beginning of the training procedure, the model parameters need to be initialized andthe choice of the initial parameters can have a big impact on the learned model (Sutskeveret al., 2013). In our experiments, we found that initializing the model using pre-training(Hinton and Salakhutdinov, 2006) on the input images was required in order to beable to fine-tune the model using the ground truth segmentations without getting stuck62Chapter 4 White Matter Lesion Segmentationearly in a local minimum. Pre-training can be performed layer by layer (Hinton et al.,2006) using a stack of convRBMs (see Figure 4.1), thereby avoiding the potential problemof vanishing or exploding gradients (Hochreiter, 1991). The first convRBM is trainedon the input images, while subsequent convRBMs are trained on the hidden activationsof the previous convRBM. After all convRBMs have been trained, the model parametersof the CEN-s can be initialized as follows (showing the first convolutional and the lastdeconvolutional layers only, see Figure 4.1)w(1)c = wˆ(1), w(1)d = 0.5wˆ(1), w(1)s = 0.5wˆ(1) (4.18)b(1) = bˆ(1), c(0) = cˆ(1), (4.19)where wˆ(1) are the filter weights, bˆ(1) are the hidden bias terms, and cˆ(1) are the visiblebias terms of the first convRBM.A major challenge for gradient-based optimization methods is the choice of an appro-priate learning rate. Classic stochastic gradient descent (LeCun et al., 1998) uses a fixed ordecaying learning rate, which is the same for all parameters of the model. However, thepartial derivatives of parameters of different layers can vary substantially in magnitude,which can require different learning rates. In recent years, there has been an increasinginterest in developing methods for automatically choosing independent learning rates.Most methods (e.g., AdaGrad by Duchi et al., 2011; AdaDelta by Zeiler, 2012; RMSpropby Dauphin et al., 2015; and Adam by Kingma and Ba, 2014) collect different statistics ofthe partial derivatives over multiple iterations and use this information to set an adap-tive learning rate for each parameter. This is especially important for the training ofdeep networks, where the optimal learning rates often differ greatly for each layer. In63Chapter 4 White Matter Lesion Segmentationour initial experiments, networks obtained by training with AdaDelta, RMSprop, andAdam performed comparably well, but AdaDelta was the most robust to the choice ofhyperparameters, so we used AdaDelta for all results reported.4.3.4 ImplementationPre-training and fine-tuning were performed using the highly optimized GPU-acceleratedimplementation of 3D convRBMs and CNNs (Brosch and Tam, 2015) explained inChapter 3. Our frequency domain implementation significantly speeds up the train-ing by mapping the calculation of convolutions to simple element-wise multiplications,while adding only a small number of Fourier transforms. This is especially beneficialfor the training on 3D volumes, due to the increased number of weights of 3D kernelscompared to 2D. Although GPU-accelerated deep learning libraries based on the NVIDIACUDA Deep Neural Network library (cuDNN) (Chetlur et al., 2014) are publicly available(e.g., Bastien et al., 2012; Collobert et al., 2011a; Jia et al., 2014), we trained our modelsusing our own implementation, because we optimized it for memory facilitating thetraining on large 3D volumes and because it performs the most computationally intensivetraining operations six times faster than cuDNN in a direct comparison (see Table 3.7 onpage 49).4.4 Experiments and ResultsWe evaluated our method on two publicly available data sets (from the MICCAI 2008 MSlesion segmentation challenge1 and the ISBI 2015 longitudinal MS lesion segmentation1http://www.ia.unc.edu/MSseg/64Chapter 4 White Matter Lesion Segmentationchallenge2), which allows for a direct comparison with many state-of-the-art methods. Inaddition, we have used two much larger data sets containing two and four different MRIsequences from a multi-centre clinical trial in secondary progressive MS and relapsingremitting MS, which tends to have the most heterogeneity among the MS subtypes.These data sets are challenging due to the large variability in lesion size, shape, location,and intensity as well as varying contrasts produced by different scanners. The clinicaltrial data sets were used to estimate the number of images required to train a CENwithout overfitting and to carry out a detailed analysis of different CEN architecturesusing different combinations of modalities, with a comparison to five publicly availablestate-of-the-art methods.4.4.1 Data Sets and PreprocessingPublic Data SetsThe data set of the MICCAI 2008 MS lesion segmentation challenge (Styner et al., 2008)consists of 43 T1-weighted (T1w), T2-weighted (T2w), and fluid-attenuated inversion re-covery (FLAIR) MRIs, divided into 20 training cases for which ground truth segmentationsare made publicly available, and 23 test cases. In a first experiment, we evaluated ourmethod on the 20 training cases using 5-fold cross-validation. In a second experiment, wetrained our model on the 20 training cases and then used the trained model to segmentthe 23 test cases, which were sent to the challenge organizers for independent evaluation.The data set of the ISBI 2015 longitudinal MS lesion segmentation challenge consists of21 visit sets, each with T1w, T2w, proton density-weighted (PDw), and FLAIR MRIs. Thechallenge was not open for new submissions at the time of writing this article. Therefore,2http://iacl.ece.jhu.edu/MSChallenge65Chapter 4 White Matter Lesion SegmentationTable 4.1: Population, disease and scanner characteristics of the two clinical trial data sets.Data set 1 Data set 2MS subtype Secondary progressive MS Relapsing remitting MSEDSS 5.5± 1.0 3.3± 1.4Age 49.58± 8.00 37.37± 8.81Gender male 175 68female 311 125unknown 14 2Field strength 1 T 26 —1.5 T 383 3223 T 77 24unknown 14 31we evaluated our method on the training set using leave-one-subject-out cross-validation,following the evaluation protocol of the second place method by Jesson and Arbel (2015)and third place method by Maier and Handels (2015) from the challenge proceedings.The paper of the first place method does not have sufficient details to replicate theirevaluation.Clinical Trial Data SetsWe used two different clinical trial data sets to evaluate our method (see Table 4.1). Thefirst data set contains images of 500 secondary progressive MS patients split equallyinto training and test sets. The images were acquired from 45 different scanning sites.For each subject, the data set contains T2- and PD-weighted MRIs with a voxel sizeof 0.937× 0.937× 3.000 mm. The main preprocessing steps included rigid intra-subjectregistration, brain extraction, intensity normalization, and background cropping. Thisdata set was used to evaluate how many images are required to train a simple 3-layer CENwithout overfitting. Training on a single GeForce GTX 780 graphics card took between 666Chapter 4 White Matter Lesion Segmentationand 32 hours per model depending on the training set size. However, once the network istrained, segmentation of new images can be performed in less than one second.The second data set was collected from 67 different scanning sites using different 1.5 Tand 3 T scanners, and consists of T1w, T2w, PDw, and FLAIR MRIs from 195 relapsingremitting MS patients, most with two time points (377 visit sets in total). In contrast to theprevious data set, this data set also contains T1w and FLAIR MRIs, which allows for adirect comparison with other competing methods, which require these modalities. Theimage dimensions and voxel sizes vary by site, but most of the T1w images have close to1 mm isotropic voxels, while the other images have voxel sizes close to 1× 1× 3 mm. Allimages were skull-stripped using the brain extraction tool (BET) (Jenkinson et al., 2005),followed by an intensity normalization to the interval [0, 1], and a 6 degree-of-freedomintra-subject registration using one of the 3 mm scans as the target image to align thedifferent modalities. To speed-up the training, all images were cropped to a 164× 206× 52voxel subvolume with the brain roughly centred. The ground truth segmentations for bothdata sets were produced using an existing semiautomatic 2D region-growing technique,which has been used successfully in a number of large MS clinical trials (e.g., Kappos et al.,2006; Traboulsee et al., 2008). Each lesion was manually identified by an experiencedradiologist and then interactively grown from the seed point by a trained technician.We divided the data set into a training (n = 250), validation (n = 50), and test set(n = 77) such that images of each set were acquired from different scanning sites. Thetraining, validation, and test sets were used for training our models, for monitoringthe training progress, and to evaluate performance, respectively. The training set wasalso used to perform parameter tuning of the other methods used for comparison. Pre-training and fine-tuning of our 7-layer CEN-s took approximately 27 hours and 37 hours,67Chapter 4 White Matter Lesion Segmentationrespectively, on a single GeForce GTX 780 graphics card. Once the network is trained, newmulti-contrast images can be segmented in less than one second.4.4.2 Comparison to Other MethodsWe compared our method with five publicly available methods, some of which are widelyused for clinical research and are established in the literature (e.g., Guizard et al., 2015;Subbanna et al., 2015; Sudre et al., 2015) as reference points for comparison. These fivemethods include:1. Expectation maximization segmentation (EMS) method (Van Leemput et al., 2001);2. Lesion growth algorithm (LST-LGA) (Schmidt et al., 2012), as implemented in theLesion Segmentation Toolbox (LST) version 2.0.11;3. Lesion prediction algorithm (LST-LPA) also implemented in the same LST toolbox;4. Lesion-TOADS version 1.9 R (Shiee et al., 2010); and5. Salem Lesion Segmentation (SLS) toolbox (Roura et al., 2015).The Lesion-TOADS software only takes T1w and FLAIR MRIs and has no tunable parame-ters, so we used the default parameters to carry out the segmentations. The performanceof EMS depends on the choice of the Mahalanobis distance κ, the threshold t used tobinarize the probabilistic segmentation, and the modalities used. We applied EMS tosegment lesions using two combinations of modalities: a) T1w, T2w, and PDw, as used inthe original paper (Van Leemput et al., 2001), and b) all four available modalities (T1w,T2w, PD2, FLAIR). We compared the segmentations produced for all combinations ofκ = 2.0, 2.2, . . . , 4.6 and t = 0.05, 0.10, . . . , 1.00 with the ground truth segmentations on the68Chapter 4 White Matter Lesion Segmentationtraining set and chose the values that maximized the average DSC (κ = 2.6, t = 0.75 forthree modalities; κ = 2.8, t = 0.9 for four modalities).The LST-LGA and LST-LPA of the LST toolbox only take T1w and FLAIR MRIs asinput, and we used those modalities to tune the initial threshold κ of LST-LGA forκ = 0.05, 0.10, . . . , 1.00 and the threshold t used by LST-LPA to binarize the probabilisticsegmentations for t = 0.05, 0.10, . . . , 1.00. The optimal parameters were κ = 0.10 andt = 0.45, respectively.Similarly, the SLS toolbox, which is the most recently published work (Roura et al., 2015)also only takes T1w and FLAIR MRIs as input. This method uses an initial brain tissuesegmentation obtained from the T1w images and segments lesions by treating lesionalpixels as outliers to the normal appearing grey matter brain tissue on the FLAIR images.This method has three key parameters: ωnb, ωts, and αsls. We tuned these parameterson the training set via a grid-search over a range of values as suggested by Roura et al.(2015).4.4.3 Measures of Segmentation AccuracyWe used the following six measures to produce a comprehensive evaluation of segmen-tation accuracy as there is generally no single measure that is sufficient to capture allinformation relevant to the quality of a produced segmentation (García-Lorenzo et al.,2013).The first measure is the DSC (Dice, 1945) that computes a normalized overlap valuebetween the produced and ground truth segmentations, and is defined asDSC =2× TP2× TP+ FP+ FN, (4.20)69Chapter 4 White Matter Lesion Segmentationwhere TP, FP, and FN denote the number of true positive, false positive, and false negativevoxels, respectively. A value of 100 % indicates a perfect overlap of the produced segmen-tation and the ground truth. The DSC incorporates measures of over- and underestimationinto a single metric, which makes it a suitable measure to compare overall segmentationaccuracy. In addition, we have used the true positive rate (TPR) and the positive predictivevalue (PPV) to provide further information on specific aspects of segmentation perfor-mance. The TPR is used to measure the fraction of the lesion regions in the ground truththat are correctly identified by an automatic method. It is defined asTPR =TPTP+ FN, (4.21)where a value of 100 % indicates that all true lesion voxels are correctly identified. The PPVis used to determine the extent of the regions falsely classified as lesion by an automaticmethod. It is defined as the fraction of true lesion voxels out of all identified lesion voxelsPPV =TPTP+ FP, (4.22)where a value of 100 % indicates that all voxels that are classified as lesion voxels areindeed lesion voxels as defined by the ground truth (no false positives).We also measured the relative absolute volume difference (VD) between the groundtruth and the produced segmentation by computing their volumes (Vol), i.e.,VD =Vol(Seg)−Vol(GT)Vol(GT), (4.23)where Seg and GT denote the obtained segmentation and ground truth, respectively.However, it has been noted (García-Lorenzo et al., 2013) that a wide variability exists70Chapter 4 White Matter Lesion Segmentationeven between the lesion segmentations of trained experts, and thus, the achieved volumedifferences reported in the literature have ranged from 10 % to 68 %.For more precise evaluation, we have also included the lesion-wise true positive rate(LTPR) and the lesion-wise false positive rate (LFPR) that are much more sensitive inmeasuring the segmentation accuracy of smaller lesions, which are important to detectwhen performing early disease diagnosis (García-Lorenzo et al., 2013). More specifically,the LTPR measures the TPR on a per lesion-basis and is defined asLTPR =LTP#RL, (4.24)where LTP denotes the number of lesion true positives, i.e., the number of lesions in thereference segmentation that overlap with a lesion in the produced segmentation, and #RLdenotes the total number of lesions in the reference segmentation. An LTPR with a valueof 100 % indicates that all lesions are correctly identified. Similarly, the lesion-wise falsepositive rate (FPR) measures the fraction of the segmented lesions that are not in theground truth and is defined asLFPR =LFP#PL, (4.25)where LFP denotes the number of lesion false positives, i.e., the number of lesions in theproduced segmentation that do not overlap with a lesion in the reference segmentation,and #PL denotes the total number of lesions in the produced segmentation. An LFPR witha value of 0 % indicates that no lesions were incorrectly identified.71Chapter 4 White Matter Lesion Segmentationlllllll ll ll l ll l l l l ll l02040600 100 200 300 400 500EpochDSC [%]l Training setValidation setFigure 4.2: Improvement in the mean DSC computed on the training and test sets during training. Onlysmall improvements can be observed after 400 epochs.4.4.4 Training ParametersIn this set of experiments, we evaluated the impact of the number of epochs and thesensitivity ratio on the trained networks. Figure 4.2 shows the mean DSC evaluatedon the training and validation sets of the second clinical data set as computed duringtraining of a 7-layer CEN-s up to 500 epochs. The mean DSC scores increase monotonically,but the improvements are minor after 400 epochs. The optimal number of epochs is atrade-off between accuracy and time required for training. Due to the relatively smallimprovements after 400 epochs, we decided to stop the training procedure at 500 epochs.For the challenge data sets, due to their small sizes, we did not employ a subset of thedata for a dedicated validation set to choose the number of epochs. Instead, we set thenumber of epochs to 2500, which corresponds to roughly the same number of gradientupdates compared to the clinical trial data set.To determine an effective sensitivity ratio, we measured the performance on the valida-tion set over a range of values. For each choice of ratio, we binarized the segmentations72Chapter 4 White Matter Lesion Segmentationt = 0.80t = 0.90t = 0.94t = 0.9802550750.0 0.1 0.2 0.3 0.4 0.5FPR [%]TPR [%]Sensitivity ratio0.010.020.050.1Figure 4.3: ROC curves for different sensitivity ratios r. A ‘+’ marks the TPR and FPR of the optimalthreshold. Varying the value of r results in almost identical ROC curves and only causes achange of the optimal threshold t, which shows the robustness of our method with respectto the sensitivity ratio.using a threshold that maximized the DSC on the training set. Figure 4.3 shows a set ofreceiver operating characteristic (ROC) curves for different choices of the sensitivity ratioranging from 0.01 to 0.10 and the corresponding optimal thresholds. The plots illustrateour findings that our method is not sensitive to the choice of the sensitivity ratio, whichmostly affects the optimal threshold. We chose a fixed sensitivity ratio of 0.02 for all ourexperiments.4.4.5 Impact of the Training Set Size on the Segmentation PerformanceTo evaluate the impact of the training set size on the segmentation performance, we traineda CEN with 3 layers on the first in-house data set with varying number of training samplesand calculated the mean DSC on the training and test sets as illustrated in Figure 4.4.For small training sets, there is a large difference between the DSCs on the training andtest sets, which indicates that the training set is too small to learn a representative set of73Chapter 4 White Matter Lesion Segmentationlll lll lllllll ll l l l50556065700 50 100 150 200 250Number of training samplesAverage DSC [%]llTraining setTest setFigure 4.4: Comparison of DSC scores calculated on the training and test sets for varying numbersof training samples. At around 100 samples, the model becomes stable in terms of testperformance and the small difference between training and test DSCs, indicating thatoverfitting of the training data no longer occurs.features. At around 100 samples, the model becomes stable in terms of test performanceand the small differences between training and test DSCs, indicating that overfitting of thetraining data is no longer occurring. With 100 training subjects, our method achieves amean DSC on the test set of 57.38 %.4.4.6 Comparison on Public Data SetsTo allow for a direct comparison with a large number of state-of-the-art methods, weevaluated our method on the MICCAI 2008 MS lesion segmentation challenge (Styneret al., 2008) and the ISBI 2015 longitudinal MS lesion segmentation challenge. As shownin the previous section, approximately 100 images are required to train the 3-layer CENwithout overfitting and we expect the required number of images to be even higher whenadding more layers. Due to the relatively small size of the training data sets provided by74Chapter 4 White Matter Lesion SegmentationTable 4.2: Parameters of the 3-layer CEN for the evaluation on the challenge data sets. The number ofinput channels c is 3 for the MICCAI challenge and 4 for the ISBI challenge.Layer type Kernel Size #Filters Number of unitsInput — — 164× 206× 156× cConvolutional 9× 9× 9× c 32 156× 198× 148× 32Deconvolutional 9× 9× 9× 32 1 164× 206× 156× 1the two challenges, we used a CEN with only three layers on these data sets to reduce therisk of overfitting. The parameters of the models are summarized in Table 4.2.MICCAI 2008 MS Lesion Segmentation ChallengeIn a first experiment, we evaluated the performance of our method using 5-fold cross-validation on the training set of the MICCAI challenge. Figure 4.5 shows a comparisonof lesion masks produced by our method with the ground truth for three subjects. Thefirst two rows show the FLAIR, T1w, T2w, ground truth segmentations, and predictedsegmentations of two subjects with a DSC of 60.58 % and 61.37 %. Despite the largecontrast differences between the two subjects, our method performed well and consistently,which indicates that our model was able to learn features that are robust to a large range ofintensity variations. The last row shows a subject with a DSC of 9.01 %, one of the lowestDSC scores from the data set. Our method segmented lesions that have similar contrast tothe other two subjects, but these regions were not classified as lesions by the manual rater.This highlights the difficulty of manual lesion segmentation, as the difference betweendiffuse white matter pathology and focal lesions is often indistinct.A quantitative comparison of our method with other state-of-the-art methods is summa-rized in Table 4.3. Our method outperforms the winning method (Souplet et al., 2008) ofthe MS lesion segmentation challenge 2008 and the currently best unsupervised method75Chapter 4 White Matter Lesion SegmentationFLAIR T1w T2w Ground truth Our methodCHB07(DSC=60.58%)CHB04(DSC=61.37%)UNC09(DSC=9.01%)Figure 4.5: Example segmentations of our method for three different subjects from the MICCAI challengedata set. Our method performed well and consistently despite the large contrast differencesseen between the first two rows. In the third row, our method also segmented lesions thathave similar contrast, but these regions had not been identified as lesions by the manualrater, which highlights the difficulty in distinguishing focal lesions from diffuse damage,even for experts.reported on that data set (Weiss et al., 2013) in terms of mean TPR and PPV. Our methodperforms comparably to a current method (Geremia et al., 2011, 2010) that uses a carefullydesigned set of features specifically designed for lesion segmentation, despite our methodhaving learned its features solely from a relatively small training set.A comparison of our method with other state-of-the-art methods evaluated on the testset of the MICCAI challenge is summarized in Table 4.4. Our method ranked 6th (2nd ifonly considering methods with only one submission, i.e., without subsequent parameter-tuning and adjustments) out of 52 entries submitted to the challenge, outperforming76Chapter 4 White Matter Lesion SegmentationTable 4.3: Comparison of our method with state-of-the-art lesion segmentation methods in terms ofmean TPR, PPV, and DSC on the training set of the MICCAI 2008 lesion segmentationchallenge. Our method performs comparably to the best methods reported on the MS lesionsegmentation challenge data set.Method TPR PPV DSCSouplet et al. (2008) 20.65 30.00 —Weiss et al. (2013) 33.00 36.85 29.05Geremia et al. (2010) 39.85 40.35 —Our method 39.71 41.38 35.52Table 4.4: Selected methods out of the 52 entries submitted for evaluation to the MICCAI 2008 MSlesion segmentation challenge. Columns LTPR, LFPR, and VD show the average computedfrom the two raters in percent. Challenge results last updated: Dec 15, 2015.Rank Method Score LTPR LFPR VD1, 3, 9 Jesson and Arbel (2015) 86.94 48.7 28.3 80.22 Guizard et al. (2015) 86.11 49.9 42.8 48.84, 20, 26 Tomas-Fernandez and Warfield (2015) 84.46 46.9 44.6 45.65, 7 Jerman et al. (2015) 84.16 65.2 63.8 77.56 Our method 84.07 51.6 51.3 57.811 Roura et al. (2015) 82.34 50.2 41.9 111.613 Geremia et al. (2010) 82.07 55.1 74.1 48.924 Shiee et al. (2010) 79.90 52.4 72.7 74.5the recent SLS by Roura et al. (2015), and popular methods such as the random forestapproach by Geremia et al. (2010), and Lesion-TOADS by Shiee et al. (2010), but not aswell as the patch-based segmentation approach by Guizard et al. (2015), or the modelof population and subject (MOPS) approach by Tomas-Fernandez and Warfield (2015),which used additional images to build the intensity model of a healthy population. Thisis a very promising result for the first submission of our method given the simplicity ofthe model and the small training set size.77Chapter 4 White Matter Lesion SegmentationTable 4.5: Comparison of our method with the second and third ranked methods from the ISBI MSlesion segmentation challenge. The evaluation was performed on the training set usingleave-one-subject-out cross-validation. GT1 and GT2 denote that the model was trained withthe segmentations provided by the first and second rater as the ground truth, respectively.Method Rater 1 Rater 2DSC LTPR LFPR DSC LTPR LFPRRater 1 — — — 73.2 64.5 17.4Rater 2 73.2 82.6 35.5 — — —Jesson and Arbel (2015) 70.4 61.1 13.5 68.1 50.1 12.7Maier and Handels (2015) (GT1) 70 53 48 65 37 44Maier and Handels (2015) (GT2) 70 55 48 65 38 43Our method (GT1) 68.4 74.5 54.5 64.4 63.0 52.8Our method (GT2) 68.3 78.3 64.5 65.8 69.3 61.9ISBI 2015 Longitudinal MS Lesion Segmentation ChallengeIn addition, we evaluated our method on the 21 publicly available labeled cases fromthe ISBI 2015 longitudinal MS lesion segmentation challenge. The challenge organizershave only released the names of the top three teams, only two of which have publisheda summary of their mean DSC, LTPR, and LFPR scores for both raters to allow for adirect comparison. Following the evaluation protocol of the second (Jesson and Arbel,2015) and third (Maier and Handels, 2015) place methods, we trained our model usingleave-one-subject-out cross-validation on the training images and compared our results tothe segmentations provided by both raters. Table 4.5 summarizes the performance of ourmethod, the two other methods for comparison, and the performance of the two raterswhen compared against each other. Compared to the second and third place methods,our method was more sensitive and produced significantly higher LTPR scores, but alsohad more false positives, which resulted in slightly lower but still comparable DSC scores.This is again a promising result on a public data set.78Chapter 4 White Matter Lesion SegmentationTable 4.6: Parameters of the 3-layer CEN used on the clinical trial data set.Layer type Kernel Size #Filters Number of unitsInput — — 164× 206× 52× 2Convolutional 9× 9× 5× 2 32 156× 198× 48× 32Deconvolutional 9× 9× 5× 32 1 164× 206× 52× 14.4.7 Comparison of Network Architectures, Input Modalities, and PubliclyAvailable Methods on Clinical Trial DataQuantitative ComparisonTo determine the effect of network architectures, we compared the segmentation perfor-mance of three different networks using T1w and FLAIR MRIs. Specifically, we trained a3-layer CEN and two 7-layer CENs, one with shortcut connections and one without. Toinvestigate the effect of different input image types, we additionally trained two 7-layerCEN-s on the modalities used by EMS (T1w, T2w, PDw) and all four modalities (T1w,T2w, PDw, FLAIR). The parameters of the networks are given in Table 4.6 and Table 4.7.To roughly compensate for the anisotropic voxel size of the input images, we chose ananisotropic filter size of 9× 9× 5. In addition, we ran the five competing methods dis-cussed in Section 4.4.2 with Lesion-TOADS, SLS, and the two LST methods using the T1wand FLAIR images, and EMS using three (T1w, T2w, PDw) and all four modalities inseparate tests. A comparison of the segmentation accuracy of the trained networks andcompeting methods is summarized in Table 4.8.All CEN architectures performed significantly better than all other methods regardlessof the input modalities, with LST-LGA being the closest in overall segmentation accuracy.Comparing CEN to LST-LGA, the improvements in the mean DSC scores ranged from3 percentage points (pp) for the 3-layer CEN to 17 pp for the 7-layer CEN with shortcut79Chapter 4 White Matter Lesion SegmentationTable 4.7: Parameters of the 7-layer CEN-s used on the clinical trial data set.Layer type Kernel Size #Filters Number of unitsInput — — 164× 206× 52× 2Convolutional 9× 9× 5× 2 32 156× 198× 48× 32Average Pooling 2× 2× 2 — 78× 99× 24× 32Convolutional 9× 10× 5× 32 32 70× 90× 20× 32Deconvolutional 9× 10× 5× 32 32 78× 99× 24× 32Unpooling 2× 2× 2 — 156× 198× 48× 32Deconvolutional 9× 9× 5× 32 1 164× 206× 52× 1trained on all four modalities. The improved segmentation performance was mostly dueto an increase in lesion sensitivity. LST-LGA achieved a mean lesion TPR of 37.50 %,compared to 54.55 % produced by the CEN with shortcut when trained on the samemodalities, and 62.49 % when trained on all four modalities, while achieving a comparablenumber of lesion false positives. The mean lesion FPRs and mean volume differences ofLST-LGA and the 7-layer CEN-s were very close, when trained on the same modalities,and the CEN-s further reduced its FPR when trained on more modalities.This experiment also showed that increasing the depth of the CEN and adding theshortcut connections both improve the segmentation accuracy. Increasing the depth of theCEN from three layers to seven layers improved the mean DSC by 3 pp. The improvementwas confirmed to be statistically significant using a one-sided paired t-test (p-value of1.3× 10−5). Adding a shortcut to the network further improved the segmentation accuracyas measured by the DSC by 3 pp. A second one-sided paired t-test was performed toconfirm the statistical significance of the improvement with a p-value of less than 1× 10−10.Qualitative Comparison of Network ArchitecturesThe impact of increasing the depth of the network on the segmentation performance ofvery large lesions is illustrated in Figure 4.6, where the true positive, false negative, and80Chapter 4 White Matter Lesion SegmentationTable 4.8: Comparison of the segmentation accuracy of different CEN models, other methods, and inputmodalities. The table shows the mean of the Dice similarity coefficient (DSC), lesion truepositive rate (LTPR), and lesion false positive rate (LFPR). Because the volume difference (VD)is not limited to the interval [0, 100], a single outlier can heavily affect the calculation of themean. We therefore excluded outliers before calculating the mean of the VD for all methodsusing the box plot criterion.Method DSC [%] LTPR [%] LFPR [%] VD [%]Input modalities: T1w and FLAIR3-layer CEN 49.24 57.33 61.39 43.457-layer CEN 52.07 43.88 29.06 37.017-layer CEN-s 55.76 54.55 38.64 36.30Lesion-TOADS (Shiee et al., 2010) 40.04 56.56 82.90 49.36SLS (Roura et al., 2015) 43.20 56.80 50.80 12.30LST-LGA (Schmidt et al., 2012) 46.64 37.50 38.06 36.77LST-LPA (Schmidt et al., 2012) 46.07 48.02 52.94 41.62Input modalities: T1w, T2w, and PDw7-layer CEN-s 61.18 52.00 36.68 29.38EMS (Van Leemput et al., 2001) 42.94 44.80 76.58 49.29Input modalities: T1w, T2w, FLAIR, and PDw7-layer CEN-s 63.83 62.49 36.10 32.89EMS (Van Leemput et al., 2001) 39.70 49.08 85.01 34.51false positive voxels are highlighted in green, yellow, and red, respectively. The receptivefield of the 3-layer CEN has a size of only 17× 17× 9 voxels, which reduces its abilityto identify very large lesions marked by two white circles. In contrast, the 7-layer CENhas a receptive field size of 49× 53× 26 voxels, which allows it to learn features that cancapture much larger lesions. Consequently, the 7-layer CEN, with and without shortcut, isable to learn a feature set that captures large lesions much better than the 3-layer CEN,which results in an improved segmentation. However, increasing the depth of the networkwithout adding shortcut connections reduces the network’s sensitivity to very small lesionsas illustrated in Figure 4.7. In this example, the 3-layer CEN was able to detect three smalllesions, indicated by the white circles, which were missed by the 7-layer CEN. Adding81Chapter 4 White Matter Lesion SegmentationTable 4.9: Lesion size groups as used for the detailed analysis.Group Mean lesion size [mm3] #Samples Lesion load [mm3]Very small [0, 70] 6 1457± 1492Small (70, 140] 24 4298± 2683Medium (140, 280] 24 12 620± 9991Large (280, 500] 14 13 872± 5814Very large > 500 9 35 238± 27 531shortcut connections enables our model to learn a feature set that spans a wider range oflesion sizes, which increases the sensitivity to small lesions and, hence, allows the 7-layerCEN-s to detect all three small lesions (highlighted by the white circles), while still beingable to segment large lesions.4.4.8 Comparison for Different Lesion SizesTo examine the effect of lesion size on segmentation performance, we stratified the test setinto five groups based on their mean reference lesion size as summarized in Table 4.9. Acomparison of segmentation accuracy and lesion detection measures of a 7-layer CEN-strained on different input modalities and the best performing competing method LST-LGAfor different lesion sizes is illustrated in Figure 4.8. The 7-layer CEN-s outperformedLST-LGA for all lesions sizes except for very large lesions when trained on T1w andFLAIR MRIs. The advantage extended to all lesion sizes when the CEN-s was trained onall four modalities, which could not be done for LST-LGA. The differences were larger forsmaller lesions, which are generally more challenging to segment for all methods. Thedifferences between the two approaches were due to a higher sensitivity to lesions asmeasured by the LTPR, especially for smaller lesions, while the number of false positiveswas approximately the same for all lesion sizes.82Chapter 4 White Matter Lesion Segmentation7-layer CEN3-layer CEN 7-layer CEN-sFigure 4.6: Impact of increasing the depth of the network on the segmentation performance of verylarge lesions. The true positive, false negative, and false positive voxels are highlighted ingreen, yellow, and red, respectively. The 7-layer CEN, with and without shortcut, is ableto segment large lesions much better than the 3-layer CEN due to the increased size of thereceptive field. This figure is best viewed in colour.7-layer CEN3-layer CEN 7-layer CEN-s7-layer CEN3-layer CEN 7-lay r CEN-s7-layer CEN3-layer CEN 7-layer CEN-s7-layer CEN3-layer CEN 7-layer CEN-s7-layer CEN3-layer CEN 7-layer CEN-s7-layer CEN3-layer CEN 7-layer CEN-s7-layer CEN3-layer CEN 7-layer CEN-s7-laye CEN3-layer CEN 7-layer CEN-s7-layer CEN3-layer CEN 7-layer CEN-s7-layer CEN3-layer CEN 7-layer CEN-sFigure 4.7: Comparison of segmentation performance of different CEN architectures for small lesions.The white circles indicate lesions that were detected by the 3-layer CEN and the 7-layerCEN, but only with shortcut. Increasing the network depth decreases the sensitivity to smalllesions, but the addition of a shortcut allows the network to regain this ability, while stillbeing able to detect large lesions (see Figure 4.6). This figure is best viewed in colour.83Chapter 4 White Matter Lesion Segmentationlllllll020406080Very small Small Medium Large Very largeLesion sizeDSC [%]Method (modalities):7−layer CEN−s(T1w, FLAIR)7−layer CEN−s(T1w, T2w, PDw, FLAIR)LST−LGA(T1w, FLAIR)lllll0255075100Very small Small Medium Large Very largeLesion sizeLTPR [%]Method (modalities):7−layer CEN−s(T1w, FLAIR)7−layer CEN−s(T1w, T2w, PDw, FLAIR)LST−LGA(T1w, FLAIR)llll0255075100Very small Small Medium Large Very largeLesion sizeLFPR [%]Method (modalities):7−layer CEN−s(T1w, FLAIR)7−layer CEN−s(T1w, T2w, PDw, FLAIR)LST−LGA(T1w, FLAIR)Figure 4.8: Comparison of segmentation accuracy and lesion detection measures of a 7-layer CEN-strained on different input modalities and the best performing competing method LST-LGA fordifferent lesion sizes. The 7-layer CEN-s outperforms LST-LGA for all lesions sizes except forvery large lesions when trained on T1w and FLAIR MRIs, and for all lesion sizes when trainedon all four modalities, due to a higher sensitivity to lesions, while producing approximatelythe same number of false positives. Outliers are denoted by black dots.84Chapter 4 White Matter Lesion Segmentation4.5 DiscussionThe automatic segmentation of MS lesions is a very challenging task due to the largevariability in lesion size, shape, intensity, and location, as well as the large variabilityof imaging contrasts produced by different scanners used in multi-centre studies. Mostunsupervised methods model lesions as an outlier class or a separate cluster in a subject-specific model, which makes them inherently robust to inter-subject and inter-scannervariability. However, outliers are often not specific to lesions and can also be causedby intensity inhomogeneities, partial volume, imaging artifacts, and small anatomicalstructures such as blood vessels, which leads to the generation of false positives. On theother hand, supervised methods can learn to discriminate between lesion and non-lesiontissue, but are more sensitive to the variability in lesion appearance and different contrastsproduced by different scanners. To overcome those challenges, supervised methods requirelarge data sets that span the variability in lesion appearance and careful preprocessing tomatch the imaging contrast of new images with those of the training set. Library-basedapproaches have shown great promise for the segmentation of MS lesions, but do notscale well to very large data sets due to the large amount of memory required to storecomprehensive sample libraries and the time required to scan such libraries for matchingpatches. On the other hand, parametric deep learning models such as convolutional neuralnetworks scale much better to large training sets, because the size required to store themodel is independent of the training set size, and the operations required for training andinference are inherently parallelizable, which allows them to take advantage of very fastGPU-accelerated computing hardware. Furthermore, the combination of many nonlinearprocessing units allows them to learn features that are robust under large variability, whichis crucial for the segmentation of MS lesions.85Chapter 4 White Matter Lesion SegmentationConvolutional neural networks were originally designed to classify entire images anddesigning networks that can segment images remains an important research topic. Earlyapproaches have formulated the segmentation problem as a patch-wise classificationproblem, which allows them to directly use established classification network architecturesfor image segmentation. However, a major limitation of patch-based deep learningapproaches is the time required for training and inference. Fully convolutional networkscan perform the segmentation much more efficiently, but generally lack the precision toperform voxel-accurate segmentation and cannot handle unbalanced classes.To overcome these challenges, we have presented a new method for the automaticsegmentation of MS lesions based on deep convolutional encoder networks with shortcutconnections. The joint training of the feature extraction and prediction pathways allows forthe automatic learning of features at different scales that are tuned for a given combinationof image types and segmentation task. Shortcuts between the two pathways allow high-and low-level features to be leveraged at the same time for more consistent performanceacross scales. In addition, we have proposed a new objective function based on thecombination of sensitivity and specificity, which makes the objective function inherentlyrobust to unbalanced classes such as MS lesions, which typically comprise less than 1 % ofall image voxels. We have evaluated our method on two publicly available data sets and alarge data set from an MS clinical trial, with the results showing that our method performscomparably to the best state-of-the-art methods, even for relatively small training set sizes.We have also shown that when a suitably large training set is available, our method isable to segment MS more accurately than widely-used competing methods such as EMS,LST-LGA, SLS, and Lesion-TOADS. The substantial gains in accuracy were mostly due toan increase in lesion sensitivity, especially for small lesions. Overall, our proposed CENwith shortcut connections performed consistently well over a wide range of lesion sizes.86Chapter 4 White Matter Lesion SegmentationOur segmentation framework is very flexible and can be easily extended. One suchextension could be to incorporate prior knowledge about the tissue type of each non-lesionvoxel into the segmentation procedure. The probabilities of each tissue class could beprecomputed by a standard segmentation method, after which they can be added as anadditional channel to the input units of the CEN, which would allow the CEN to takeadvantage of intensity information from different modalities and prior knowledge abouteach tissue class to carry out the segmentation. In addition, our method can be appliedto other segmentation tasks. Although we have only focused on the segmentation of MSlesions in this paper, our method does not make any assumptions specific to MS lesionsegmentation. The features required to carry out the segmentation are solely learned fromtraining data, which allows our method to be used to segment different types of pathologyor anatomy when a suitable training set is available.87Chapter 5Manifold Learning by Deep Learning5.1 IntroductionChanges in brain morphology such as global and regional atrophy and the formation ofwhite matter lesions are two hallmarks of multiple sclerosis (MS) pathology. In order toquantify the pathological manifestation of MS, a number of imaging biomarkers derivedfrom magnetic resonance imaging (MRI) scans of the brain, such as lesion volume andwhole brain volume, have been proposed, which have established their importance for thestudy of MS. However, MS is a complex disease whose pathological variability extendswell beyond what can be captured by global and local volumetric measures. It would behighly desirable to have a method that can automatically discover potentially importantpatterns of variability in brain morphology and lesion distribution, which would advanceour understanding of the complex pathology of MS. In addition, the joint modelling ofbrain morphology and lesion distribution would further our knowledge of how these twokey pathological features interact. However, this type of modelling is very challengingdue to the high dimensionality of the data.88Chapter 5 Manifold Learning by Deep LearningIn recent years, there has been an increased interest in biomarker discovery usingmanifold learning to form high-level, low-dimensional representations of medical images(Aljabar et al., 2011; Wolz et al., 2012, 2010b). In order to discover common patterns ofvariability in a group of images, each image of the data set is regarded as a point in ahigh-dimensional image space (called the ambient space), with nx × ny × nz coordinates,where nx, ny, nz are the dimensions of each image. On the other hand, each image couldalso be identified by a smaller set of parameters that describe shape variations and patternsthat are common for a particular group of images. These parameters span a new spacecalled the manifold space. The task of manifold learning is to discover the low-dimensionalspace and its parameters, which can then be used to model the anatomical variabilitywithin a population, or serve as biomarkers to track disease state and progression.In the remainder of this section, we will briefly revise common manifold learningtechniques and applications for medical image analysis. A main part of the chapter willbe an explanation of the deep learning manifold learning framework using a deep beliefnetwork. We have evaluated the proposed manifold learning method on two tasks: a)to learn the variability of magnetic resonance (MR) images of the brain of Alzheimer’sdisease (AD) patients, and b) to discover patterns of morphological variability and lesiondistribution in MS.5.2 Manifold Learning for Medical Image AnalysisMost popular manifold learning methods for medical image analysis can be roughlyclassified into local and global methods (Cayton, 2005). Local algorithms such as locallylinear embedding (LLE) (Saul and Roweis, 2003) and Laplacian eigenmaps (LEM) (Belkinand Niyogi, 2003) try to find a low dimensional representation of the data such that89Chapter 5 Manifold Learning by Deep Learninglocal distances between points in ambient space are best preserved when mapped tomanifold space. In contrast, global methods such as isometric feature mapping (Isomap)(Tenenbaum et al., 2000) try to find an embedding of the manifold space that best preservesgeodesic distances of points that lie far apart in ambient space. Both types of methodsrequire building a proximity graph that is used to capture neighborhood information(local methods) or to estimate geodesic distances by calculating the shortest path distancesbetween all points on the affinity graph (global methods). Both types of methods assumethat the manifold is locally linear, which means that distances between neighboring pointsin manifold space can be approximated by their distances in ambient space. Once theproximity graph has been created, a unique solution to the inherent optimization problemof each algorithm can be found analytically by solving an eigenvalue problem.Manifold learning methods have been successfully applied to various image analysisproblems such as the segmentation of the hippocampus (Wolz et al., 2010a), to regularizethe segmentation of heart (Zhang et al., 2006) and brain ventricles (Etyngier et al., 2007),and to constrain the deformable registration of brain images to have biologically plausibleparameters (Hamm et al., 2010), while most methods focus on clinical prediction (Aljabaret al., 2011; Bhatia et al., 2012; Duchateau et al., 2012; Gerber et al., 2010; Guerreroet al., 2014; Wachinger et al., 2015; Wolz et al., 2012). Gerber et al. used Isomap to predictclinical parameters of AD patients (Gerber et al., 2010, 2009), and Wolz et al. used LEMto perform biomarker discovery (Wolz et al., 2011, 2012), also of AD patients. Duchateauet al. (2011, 2012) discriminate between normal and abnormal motion patterns of the heartbased on the definition of pathology as a deviation from normality. After learning themanifold representing normal motion patterns, abnormal motion patterns can be detectedbased on the distance of a new data sample from the manifold surface. Wachingeret al. (2015) proposed a new shape descriptor of the brain called BrainPrint, which is90Chapter 5 Manifold Learning by Deep Learningbuild by solving the eigenvalue problem of the 2D and 3D Laplace-Beltrami operator onmeshes of segmented cortical and subcortical structures. The descriptor defines a similaritymeasure on brain images, which can be used to perform subject identification (Wachingeret al., 2014b), and to predict Alzheimer’s disease (Wachinger et al., 2014a). Bhatiaet al. (2012) model regional variations of the brain by building a manifold on hierarchicalimage patches and applied it to finding discriminative regions of 3D brain MR images forclassifying Alzheimer’s disease. Guerrero et al. (2014) have used manifold learning onregions of interest (ROIs) to extract features that represent inter-subject variability of ADpatients. In a first step, sparse regression is used to automatically detect the ROI usingmini-mental state examination (MMSE) scores as the independent variable. Then, LEMwith cross-correlation as the distance measure is used to learn the manifold of the ROIs,from which the features are extracted.A main challenge of most manifold learning methods (e.g., Isomap, LEM, LLE) is theconstruction of the proximity graph. Building the proximity graph requires choosingan appropriate neighborhood criterion, which can be challenging due to the sensitivityto noise of commonly used criteria (e.g., k-nearest neighbors, e-ball), and, consequently,may result in an unstable topology of the learned manifold (Balasubramanian andSchwartz, 2002). Other challenges include avoiding erroneously disconnected regionsand shortcuts, finding a trade-off between sampling density and curvature of local patchesof the manifold, and finding a suitable distance metric. A number of solutions havebeen proposed, but there is no general solution. To increase the robustness to noise andvarying parameter settings, Carreira-Perpiñán and Zemel (2005) proposed buildingthe graph from a graph ensemble that combines multiple spanning trees, each fit to aperturbed version of the data set. A trade-off between sampling density and curvature oflocal patches of the manifold can be found automatically by adapting the selection of the91Chapter 5 Manifold Learning by Deep Learningneighborhood sizes through neighborhood contraction and expansion (Zhang et al., 2012).Gerber et al. (2010) have shown that the choice of a suitable distance measure is crucialfor manifold learning using Isomap and that the warping distance between brain imagesimproves the learning performance over previously used Euclidean distances in the imagespace.5.3 Manifold Learning using Deep Belief NetworksDeep generative models such as deep belief networks (DBNs) (Hinton et al., 2006) are apromising alternative to previously used brain manifold learning methods due to theirability to discover patterns of similarity in groups of images. In contrast to most commonlyused manifold learning algorithms (e.g., LLE, LEM, Isomap), DBNs do not assume themanifold space to be locally linear and do not require a previously defined similaritymeasure or the construction of a proximity graph, which makes them more generallyapplicable to a wide range of manifold learning tasks. This is particularly important forthe application to brain manifold learning, because the high-dimensionality of brain MRIsmake defining a neighborhood criterion challenging.5.3.1 General ArchitectureThe general architecture of the DBN used for manifold learning of brain MRIs is illustratedin Figure 5.1. The first restricted Boltzmann Machine (RBM) (Freund and Haussler, 1992;Hinton, 2010) receives the intensity values of a group of images in ambient space asinput and reduces the dimensionality of each image by discovering patterns of similaritythat are common within groups of images. Subsequent RBMs receive the hidden unitactivations of the previous RBM as input, thus learning successively more complex and92Chapter 5 Manifold Learning by Deep Learningabstract patterns from a training set, where the hidden units of the last layer represent thecoordinates of an image in manifold space. The number of trainable weights increasessignificantly with the resolution of the training images. In order to scale the model tohigh-resolution images, the first several layers of our DBN are strided convolutionalrestricted Boltzmann machines (sconvRBMs) (Brosch and Tam, 2015), a type of RBM thatuses weight sharing and local connectivity between units of adjacent layers to reduce thenumber of trainable weights. Due to the much smaller number of trainable parameterscompared to dense RBMs, sconvRBMs are best suited for learning low- to mid-levelfeatures from very high-dimensional data such as brain MRIs. Compared to other morecommonly used convolution-based RBMs (Lee et al., 2009), an advantage of sconvRBMs isthat inference is invertible, which allows the reconstruction of the visible units from thehidden unit activations. In our application, this allows for the mapping of parametersfrom the manifold space back to the ambient space. Due to the local connectivity of theweights, sconvRBMs can only learn local patterns in the data. In order to learn patternsthat describe images as a whole, sconvRBM layers are followed by dense RBMs, whichintegrate the extracted feature from all image locations in order to model global patterns.5.3.2 Unit TypesTo apply the model to real-valued data like the intensities of some medical images,the visible units are modelled as Gaussian units. As suggested in Hinton et al.’s RBMtraining guide (Hinton, 2010), we did not learn the variance of the Gaussian units andstandardized the inputs to having zero mean and unit variance instead. There are twotypical choices for the hidden units of an RBM, binary hidden units and noisy rectifiedlinear units (NReLUs). A binary hidden unit can only encode two states. In order to learnpatterns of variability that span a continues spectrum of changes instead of binary on/off93Chapter 5 Manifold Learning by Deep LearningInput imageVisible unitsStridedconvolutionStrided convolutional RBMVectorizeVectorizedhidden unitsHidden units ofsconvRBMsDense RBMM1M2Hidden units ofdense RBMsFigure 5.1: General architecture of the DBN used to learn the manifold of brain MRIs. The visible unitsof the DBN represent the intensity values of images in ambient space. The hidden units ofthe last layer represent the image coordinates in manifold space (M1, M2). The first severallayers of the DBN are sconvRBMs followed by one or more dense RBM layers.patterns, we use NReLUs as the hidden units, which have also shown to improve thelearning performance (Nair and Hinton, 2010) of RBMs.5.3.3 Incorporating a Region of InterestA challenge for training DBNs on brain MRIs is that large black regions without localstructure can lead to random activations of the hidden units and consequently to thelearning of random filters. To overcome this problem, we propose incorporating a ROI terminto the energy equation of the convolutional restricted Boltzmann machine (convRBM),which allows constraining the filter learning process to a given ROI. This can be achievedby the element-wise multiplication of the visible and hidden units with a binary mask,94Chapter 5 Manifold Learning by Deep Learningwhich sets the visible and hidden units outside of the ROI to zero, thereby removing theircontribution to the energy of the model. The modified energy of the model is given byE(v,h, rv, rh) =−Nc∑i=1Nk∑j=1(rh · h(j)) • (w˜(ij) ∗ (rv · v(i)))−Nc∑i=1biNv∑x,y=1rv,xyv(i)xy −Nk∑j=1cjNh∑x,y=1rh,xyh(j)xy , (5.1)where rv and rh are binary masks defining the ROI with respect to the visible and hiddenunits, and · denotes element-wise multiplication. Because sconvRBMs can be mapped toequivalent convRBMs, we can use the same modification to restrict the learning of filtersof sconvRBMs.5.4 Manifold of MRIs of Alzheimer’s Disease PatientsIn a first experiment, we applied our manifold learning framework using DBNs to thetask of learning the manifold of brain MRIs of AD and healthy subjects. DBNs haveshown to be able to learn the manifold of handwritten digits (Hinton et al., 2006) orsmall natural images (Krizhevsky, 2010). However, those data sets contain a large numberof small images, where the number of images (number of data points) well exceeds thenumber of pixels per image (the dimension of a data point). For this experiment, we wereinterested if DBNs are able to learn the manifold of images without overfitting, whenthe dimension of a data point well exceeds the number of data points, as is commonlythe case for biomedical data sets. A second question was if DBNs can learn meaningfulpatterns despite the limited amount of training data.95Chapter 5 Manifold Learning by Deep Learning5.4.1 Data Sets and PreprocessingWe have evaluated the proposed method on a subset of the Alzheimer’s Disease Neu-roimaging Initiative (ADNI) data set (Petersen et al., 2010), containing 300 T1-weighted(T1w) MRIs of AD and normal subjects. The images were provided skull-stripped and biasfield corrected. We resampled all images to a resolution of 128× 128× 128 voxels and avoxel size of 2.0× 2.0× 2.0 mm. We then normalized their intensities to a common range,and rigidly registered them to a group-wise mean image prior to training and testing. Wedid not perform non-rigid registration for spatial normalization in order to evaluate thecapabilities of the method without the added confound of complex registration parameters.The data set was divided into a training set and a test set such that each set contains 75AD and 75 normal subjects.5.4.2 MethodTo learn the manifold of brain MRIs, we used a DBN with three sconvRBM layers andtwo dense RBM layers as described in Section 5.3.1. The parameters of the DBN werechosen to yield a continuous reduction of the dimensionality of the input images and aresummarized in Table 5.1. For this experiment, we used circular convolutions instead ofvalid convolutions for the training of the sconvRBMs, which is a type of convolution thatdoes not reduce the output size by the filter size. This simplifies the choice of parametersof the DBN, because the sizes of the hidden layers are independent of the filter sizes. Forthe first layer, we chose a stride size of 4× 4× 4, which results in a strong reduction ofthe dimensionality in order to reduce the amount of memory required to train the secondlayer. Increasing the stride size decreases the overlap of adjacent sliding windows. Tocompensate, we also chose a larger filter size for the first layer than for all subsequent96Chapter 5 Manifold Learning by Deep LearningTable 5.1: Parameters of the DBN used to learn the manifold of brain MRIs of AD and healthy subjects.Layer type Kernel size Stride size #Filters/ Output size#Hidden unitsInput — — — 128× 128× 128× 1sconvRBM1 20× 20× 20× 1 4× 4× 4 32 32× 32× 32× 32sconvRBM1 14× 14× 14× 32 2× 2× 2 32 16× 16× 16× 32sconvRBM1 10× 10× 10× 32 2× 2× 2 64 8× 8× 8× 64RBM1 — — 32 32RBM2 — — 2 2layers. After the application of three sconvRBMs, the dimension of each image is reducedto 8× 8× 8 and small enough for RBMs. The training of the DBN took approximately 43hours on two GeForce GTX 560 Ti graphics cards.5.4.3 ResultsGeometric Fit of the ManifoldThe geometric fit of the learned manifold model was evaluated in terms of the gen-eralizability to new images and the specificity to images from the training set. Thegeneralizability was measured in terms of the average root mean squared error (RMSE)between images I and their reconstructions R, normalized by the intensity range of theinput imagesRMSE =√1N ∑p(I(p)− R(p))2max(I)−min(I) , (5.2)where p denotes the coordinates of a particular voxel of an image and N denotes to totalnumber of voxels. The specificity was measured by calculating the average RMSE betweenimages randomly generated from the manifold model using block Gibbs sampling andthe most similar images from the training set. Figure 5.2 shows a comparison of the97Chapter 5 Manifold Learning by Deep Learning0.040.060.080.100.121 2 3 4 5LayerRMSESpecificity errorTest errorTraining errorFigure 5.2: Generalizability vs. specificity of a DBN for different numbers of layers. The similarity ofthe reconstruction errors between the training and test images indicates that no overfittingoccurs. The opposite slopes of the reconstruction errors and error of generated images(specificity error) indicates a trade-off between generalizability vs. specificity in the earlierphases of training, but the low errors at layer 5 indicate that the method is both generalizableand specific.reconstruction errors between the training and test sets, and the specificity at differentlayers of the DBN. The similarity of the reconstruction errors between the training andtest images indicates that no overfitting is occurring. The average reconstruction error atthe last layer is below 6 %. Even though the very small reconstruction error is partiallydue to head MRIs having a large amount of homogeneous background, it demonstratesthe ability of the learned manifold to capture most of the visual information with onlytwo manifold parameters. The opposite slopes of the reconstruction errors and errorof generated images indicates a trade-off between generalizability and specificity in theearlier phases of training. The low errors at the end of training (layer 5) indicates that themethod is able to be both specific and generalizable.98Chapter 5 Manifold Learning by Deep LearningM1M2Figure 5.3: Axial slices from generated volumes from the manifold. An increase of the first and sec-ond manifold dimension visually correlates with an increase in brain and ventricle size,respectively.Visualization of the Learned ManifoldFigure 5.3 shows axial slices of 16 volumes sampled at the grid points of a 2D regular gridin manifold space. Volumes sampled along the first manifold dimension M1 (from left toright) appear to increase in brain size, and the images sampled along the second manifolddimension M2 (from bottom to top) appear to increase in ventricle size. Figure 5.4 showsan axial slice of each image of the training set plotted against its manifold coordinates.Consistent with images sampled from the manifold, an increase in ventricle size, which isindicative of brain atrophy (a hallmark of AD), visually correlates with an increase of thesecond manifold coordinate. The AD/normal status is indicated by the frame colour ofeach image. The vertical separation between AD and normals suggests that the second99Chapter 5 Manifold Learning by Deep Learning0.02.55.07.50 4 8 12M1M2DiagnosisADNormalFigure 5.4: Axial slices of volumes from the training set plotted against their manifold coordinates. Thebrains with larger ventricles, indicative of atrophy, are mostly at the top, which is also wheremost of the AD patients are.manifold coordinate is potentially of practical use in differentiating between AD andnormal.Correlations with Clinical ParametersTo evaluate the potential of the manifold coordinates to reveal or predict clinically relevantinformation, we have calculated the Pearson correlation r of demographic parameters(age, gender) and disease parameters (MMSE score, AD/normal status) with the manifoldcoordinates (M1 and M2). The results of the correlation tests are summarized in Table 5.2.Age, MMSE and AD/normal status show highly significant correlations with M2 (p-values100Chapter 5 Manifold Learning by Deep LearningTable 5.2: Pearson correlation r of demographic and clinical parameters with manifold coordinates (M1,M2). The stronger correlation in each column is highlighted in bold. The level of statisticalsignificance is indicated by the number of asterisks (* p < 0.05, ** p < 0.01, *** p < 0.001)Age Gender MMSE AD/normal statusM1 −0.173* 0.449*** 0.012 −0.032M2 0.447*** 0.186* −0.402*** 0.413***between 9.85× 10−9 and 3.53× 10−7), which makes intuitive sense because M2 visuallycorrelates with ventricle size. The first manifold coordinate correlates strongest withgender (p-value = 8.24× 10−9), which also makes sense in terms of the general differencein size between male and female. The strength and significance of the correlationsdemonstrate the potential of deep learning of brain images for classification and predictionof disease status.5.5 Variability of Morphology and Lesion DistributionIn this section, we present a new method for modelling the variability in brain morphologyand lesion distribution in a large set of MRIs of MS patients. Similar to our previous workon modelling the manifold of brain MRIs of AD and healthy subjects, our method is builtusing a DBN (Hinton et al., 2006), a layered network whose parameters can be learnedfrom training images. However, a difference to the previous approach is that the DBN istrained on deformation fields and lesions masks, instead of brain MRIs, in order to focuson the learning of the variability in brain morphology and lesion distribution, withoutthe confounding influence of intensity variations. An advantage of DBNs over othermanifold learning methods is that it does not require a prebuilt proximity graph, which isparticularly beneficial for modelling lesions, because the spareness and randomness ofMS lesions make defining a suitable distance measure challenging and potentially biasing.101Chapter 5 Manifold Learning by Deep LearningInstead, the DBN approach assumes that a particular lesion configuration is a samplefrom an unknown distribution of lesion configurations that can be parameterized by arelatively small set of lesion distribution parameters. We model morphological and lesionvariability with two individual DBNs, then form a joint model by replacing the individualtop network layers with a new layer that joins both DBNs, similarly to the work on thejoint modelling of auditory and visual signals for speech recognition (Ngiam et al., 2011).Our results show that this model can automatically discover the classic patterns of MSpathology, as well as more subtle ones, and that the distribution parameters computed arefound to have strong relationships to MS clinical scores.5.5.1 Data Sets and PreprocessingThe proposed method was evaluated on a data set from an MS clinical trial of 474 secondaryprogressive MS patients. For each subject, the data set contains one T1w, T2-weighted(T2w), and proton density-weighted (PDw) MRI with a resolution of 256× 256× 50 voxelsand a voxel size of 0.937× 0.937× 3.000 mm. The main preprocessing steps included rigidregistration, brain extraction, intensity normalization, and background cropping.5.5.2 MethodOur proposed model for pattern discovery is illustrated in Figure 5.5 and consists of threemain components (a) a model that aims to find patterns of morphological changes indeformation fields (top pathway), (b) a model that aims to find patterns in the spatialdistribution of lesions (bottom pathway), and (c) a joint model that aims to find concurringdeformation and lesion distribution patterns (combined pathways).102Chapter 5 Manifold Learning by Deep LearningMorphologyDBNDisplacementfield uux uy uzStridedconvolutionStrided convolutional RBMN = 32N = 64Vectorizehidden unitsN = 32Dense RBMD1D2LesionDBNLesionmask lVisibleunitsN = 32HiddenunitsN = 64N = 64L1L2J1J2J3J4N = 16JointDBNFigure 5.5: Proposed model for discovering patterns of variability in MS brain. Our model consists ofthree main components (a) a model that aims to find patterns of morphological changesin deformation fields (top pathway), (b) a model that aims to find patterns in the spatialdistribution of lesions (bottom pathway), and (c) a joint model that aims to find concurringdeformation and lesion distribution patterns (combined pathways)Morphology ModelThe morphology model is learned from a set of displacement fields that are calculatedvia non-rigid registration from a set of T1w brain MRIs D ⊂ I, I = {In | In : Ω 7→ R},Ω ⊂ R3 and the ICBM 152 nonlinear atlas template image (Fonov et al., 2011). First, allimages of the training set are aligned to MNI space by a 9 degree-of-freedom registrationusing the FSL linear image registration tool (FLIRT) (Jenkinson et al., 2002). Then foreach image In ∈ D, the displacement field un, u : Ω 7→ R3, that describes the non-rigidtransformation from template coordinates to image coordinates is calculated using the103Chapter 5 Manifold Learning by Deep LearningFSL non-linear image registration tool (FNIRT) (Andersson et al., 2007), where thedisplacement field un is represented by a 40× 48× 22 grid of 3D displacement vectors.We assume that the displacement fields un are samples from an unknown distributionp(u | D1, . . . ) that can be parameterized by far fewer parameters than the dimensionalityof the fields themselves. In practice, the user typically selects the number of parametersto represent the data being explored. The task of finding patterns is to discover theunderlying probability density function p(u | D1, . . . ), where the parameters (D1, . . . )Trepresent the patterns of variability of the displacement field distribution. This allows usto compare the morphology of two brains at a very high level in terms of the distributionparameters of their displacement fields u1 and u2 given by E[D1, . . . | u1] and E[D1, . . . | u2].Furthermore, we can visualize the modes of morphological variability of MS brain images,by sampling the space of displacement fields spanned by (D1, . . . )T by calculating theexpectation E[u | D1, . . . ] for a range of values for (D1, . . . )T.The probability density function p(u) is modelled by a DBN (Hinton et al., 2006). Thefirst RBM layer receives the displacement fields of a training set as input and reduces thedimensionality of each field by discovering patterns of similarity that are common withingroups of displacement fields. Each subsequent RBM receives the hidden unit activationsof the previous RBM as input, thus learning successively more complex and abstractpatterns from the training data. In particular, we use a DBN with three sconvRBMs andtwo dense RBMs as described in Section 5.3.1. Compared to other more commonly usedconvolution-based RBMs (Lee et al., 2009), an advantage of sconvRBMs is that inferenceis invertible, which allows the reconstruction of the visible units from the hidden unitactivations. In our application, this would allow for the reconstruction of deformationfields from distribution parameters.104Chapter 5 Manifold Learning by Deep LearningTable 5.3: Training parameters of the morphology DBN, lesion DBN, and joint DBN. The visible unitsof the joint DBN are composed of the concatenated hidden units of the first dense RBMs ofthe morphology and lesion DBNs.Layer Parameter Morphology DBN Lesion DBN Joint DBNInput Image size 40× 48× 22× 3 160× 192× 56× 1 —sconvRBM1 Stride size 2× 2× 1 4× 4× 2 —Filter size 10× 10× 7 20× 20× 10 —Filters 32 32 —Output size 16× 20× 16× 32 36× 44× 24× 32 —sconvRBM2 Stride size 2× 2× 2 2× 2× 2 —Filter size 10× 10× 10 14× 14× 10 —Filters 64 64 —Output size 4× 6× 4× 64 12× 16× 8× 64 —sconvRBM3 Stride size 1× 1× 1 2× 2× 2 —Filter size 3× 5× 3 10× 14× 6 —Filters 32 64 —Output size 2× 2× 2× 32 2× 2× 2× 64 —RBM1 Visible units 256 512 —Hidden units 16 16 —RBM2 Visible units 16 16 32Hidden units 2 2 4The parameters of the morphology DBN are summarized in Table 5.3. In contrast to theDBN used to learn the manifold of brain MRIs of AD and healthy subjects, we use validconvolutions for the sconvRBMs, because they allow a much faster reduction of the dimen-sionality, especially for layers two and three, where the image sizes are relatively smallcompared to the filter sizes. This allows for a greater reduction of dimensionality withinthe sconvRBM layers, and consequently for a more gradual reduction of the number ofhidden units during the transition of sconvRBMs to dense RBMs. To roughly compensatefor the anisotropic voxel size of the input images and corresponding deformation fields,we chose an anisotropic filter and stride size of 10× 10× 7 and 2× 2× 1, respectively.105Chapter 5 Manifold Learning by Deep LearningAfter three sconvRBMs, the size of a deformation field is reduced to 2× 2× 2× 32 andsmall enough for RBMs.Once the weights and bias terms of the DBN have been learned from training data,we can use the model for inference. Let vd,1, . . . , vd,5, hd,1, . . . ,hd,5, and θd,1, . . . , θd,5denote the visible units, hidden units, and parameters, respectively, of each RBM of themorphology DBN. Then, for a given displacement field un, we can calculate the parameters(D1, . . . )T of un ∼ p(u | D1, . . . ) with(D1, . . . )T = E[D1, . . . | un] = E[h | vd,5, θd,5] (5.3)vd,i+1 = E[h | vd,i, θd,i] (5.4)where i ∈ [1, 4] and vd,1 = un. Inversely, the mean displacement field u¯ given thedistribution parameters can be calculated byu¯ = E[u | D1, . . . ] = E[v | hd,1, θd,1] (5.5)hd,i = E[v | hd,i+1, θd,i+1] (5.6)where i ∈ [1, 4] and hd,5 = (D1, . . . )T.Lesion ModelThe input into our lesion model is a set of 3D binary lesion masks ln ∈ I, which havebeen created from T2w and PDw MRIs by experts using a semi-automatic method. Alllesion masks are spatially aligned to MNI space using the transformations as calculatedfor the corresponding T1w images. Analogous to the morphology model, we assumethat lesion masks ln are samples from an unknown distribution ln ∼ p(l | L1, . . . ) that106Chapter 5 Manifold Learning by Deep Learningcan be parameterized by only relatively few parameters (L1, . . . )T. The task of findinglesion patterns is to discover the underlying probability density function p(l | L1, . . . ),where the parameters (L1, . . . )T represent patterns of variability of MS lesions. Similarto the morphology DBN, the lesion DBN consists of three sconvRBMs and two denseRBMs, whose parameters are summarized in Table 5.3. For the first layer, we chose ananisotropic stride size of 4× 4× 2, to roughly compensate for the anisotropic voxel size ofthe input images. A limiting factor for the choice of parameters is the amount of availableGPU memory. Due to the much higher resolution of the lesion masks compared to thedeformation fields, we increased the stride size compared to the morphology DBN inorder to reduce the required amount of memory for storing the hidden units of the firstlayer and visible units of the second layer. Choosing a stride size of 4× 4× 2 reduces thesize of one channel by a factor of roughly 32, which allows the use of 32 filters, whilekeeping the amount of memory required to train the second layer roughly equal comparedto the first layer. To account for the sparse activation of MS lesion masks, we constrain thefilter learning process to a pre-defined ROI, which was chosen to contain all white matterlesions from the training set. Similarly to the morphology model, for a trained lesion DBNand a given lesion mask ln, we can calculate the parameters (L1, . . . )T of ln ∼ p(l | L1, . . . )by(L1, . . . )T = E[L1, . . . | ln] = E[h | vl,5, θl,5] (5.7)vl,i+1 = E[h | vl,i, θl,i] (5.8)107Chapter 5 Manifold Learning by Deep Learningwhere i ∈ [1, 4] and vl,1 = ln. Likewise, the mean lesion configuration l¯ given thedistribution parameters (L1, . . . )T can be calculated byl¯ = E[l | L1, . . . ] = E[v | hl,1, θl,1] (5.9)hl,i = E[v | hl,i+1, θl,i+1] (5.10)where vl,1, . . . , vl,5, hl,1, . . . ,hl,5, and θl,1, . . . , θl,5 denote the visible units, hidden units, andparameters of each RBM of the lesion DBN, respectively, i ∈ [1, 4], and hl,5 = (L1, . . . )T.Joint ModelTo discover concurring patterns of morphology and lesion distribution, we combine themorphology DBN and the lesion DBN to form the joint DBN, which defines the jointdistribution p(u, l | J1, . . . ). The joint DBN consists of two pathways (see Figure 5.5), eachcorresponding to the first 4 layers of the morphology and lesion DBNs, respectively, and a5th RBM layer with 4 hidden units, which replaces the 5th layer of the individual DBNsand combines the hidden unit activations of the 4th layer RBMs. That is, the 5th RBMdefines the joint probability p(vj,hj | θj), where the vector vj contains the concatenatedhidden units of the 4th layer RBMs, vj = (hTd,4,hTl,4)T, and hj = (J1, . . . )T are the modes ofvariability of morphological and lesion distribution changes.5.5.3 ResultsVisualization of Patterns of VariabilityThe invertibility of our model allows the reconstruction of images from the distributionparameters to visualize the discovered patterns of variability. Figure 5.6(a) shows axial108Chapter 5 Manifold Learning by Deep LearningD1D2(a) Morphology manifoldL1L 2(b) Lesion manifoldJ1J2J3J4Jx(c) Joint manifoldFigure 5.6: Slices from generated volumes from the (a) morphology, (b) lesion, and (c) joint model. Themorphology model captures ventricular enlargement (D1) and decrease in brain size (D2) asthe main modes of variation. For the lesion model, L1 captures an increase in lesion loadthroughout the WM, while L2 captures primarily periventricular lesion load variations. Theparameters of the joint model capture combinations of the variability found in the individualmodels.slices from volumes generated from the 2-parameter morphology model p(u | D1, D2). Togenerate these images, we calculated the mean displacement fields for varying values ofD1 and D2 to span the range of variability represented by the training set and applied theinverse deformations to the ICBM 152 template image. The most apparent morphologicalvariability captured by the morphology model is ventricular enlargement for D1 and overallbrain size for D2. Figure 5.6(b) shows axial slices from the mean lesion configurationsE[l | L1, L2] for varying lesion distribution parameters. An increase of L2 visually correlateswith an increase of lesions specifically around the ventricles, whereas an increase of L1visually correlates with an increase of lesions in the entire white matter.To visualize concurring patterns of morphology and lesion distribution, we sampledimages from the joint model p(u, l | J1, . . . , J4) as shown in Figure 5.6(c). The images are109Chapter 5 Manifold Learning by Deep LearningTable 5.4: Pearson correlations r of clinical scores with distribution parameters of the morphologymodel (D1, D2), lesion model (L1, L2), joint model (J1, J2, J3, J4), normalized brain volume(nBV), and lesion load (LL). The level of statistical significance is indicated by the number ofasterisks (* p < 0.05, ** p < 0.01, *** p < 0.001).T25W 9-HPT PASAT MSFCIndividualmodelsD1 −0.129** −0.215*** −0.282*** −0.315***D2 0.087 0.116* 0.089 0.139**L1 −0.058 −0.231*** −0.392*** −0.367***L2 −0.091 −0.354*** −0.427*** −0.464***JointmodelJ1 0.107* 0.286*** 0.336*** 0.379***J2 −0.038 −0.210*** −0.227*** −0.256***J3 −0.118* −0.369*** −0.453*** −0.494***J4 −0.049 −0.206*** −0.383*** −0.346***ImagingbiomarkersnBV 0.053 0.144** 0.247*** 0.235***LL −0.074 −0.286*** −0.400*** −0.406***deformed template images with superimposed lesion masks. For each row, we varied onlyone distribution parameter and set the remaining parameters to their mean values. Of thefour parameters, J3 visually corresponds most closely to the “classic” progression of MSpathology, with an enlargement of the ventricles paired with an increased periventricularlesion load. The parameters J2 and J4 also reveal simultaneous morphological and lesionvariations that are visible on the chosen axial slice. For J1, a lesion pattern is not obviousunless the images are viewed sagittally, which reveals changes in lesion load in the pons.Correlations with Clinical ScoresTo evaluate the potential of the distribution parameters to reveal clinically relevant in-formation, we have calculated the Pearson correlation r of the clinical MS FunctionalComposite (MSFC) score (Fischer et al., 1999) and its components (Timed 25-Foot Walk,T25W; 9-Hole Peg Test, 9-HPT; Paced Auditory Serial Addition Test, PASAT) with thedistribution parameters and two established MS imaging biomarkers (normalized brain110Chapter 5 Manifold Learning by Deep Learning1.5 0 −1.5 −3 −4.5(a) Morphological changes1.5 0 −1.5 −3 −4.5(b) Lesion changesFigure 5.7: Axial (top row) and mid-sagittal (bottom row) slices of volumes representing the spectrumof MSFC scores from 1.5 to −4.5. A decrease in MSFC shows the classic pattern of MSpathology.volume, nBV, as calculated by SIENAX (Smith et al., 2002) and lesion load, LL). Theresults of the correlation tests are summarized in Table 5.4. In the individual models, allparameters correlate significantly with 9-HPT, PASAT, and MSFC, except for D2, whichdid not show a statically significant correlation with PASAT. The morphology parameterD1 correlates more strongly with these scores than nBV, as does the lesion parameter L2than LL. For T25W, D1 shows a modest but significant correlation while nBV does not.In the joint model, all parameters correlate significantly with 9-HPT, PASAT, and MSFC,with J3 being particularly strong. The parameter J3 shows stronger correlations than nBVor LL for all clinical scores, including a modest significant correlation to T25W, which isnot shown by nBV nor LL. The significant correlation between J1 and T25W is particularlyinteresting because the lesion changes modelled by J1 occur in the pons, which is knownto be of clinical significance for mobility. Our model captures the largest variability inthe data set and inter-subject variability is a possible confound. However, the strongercorrelations compared to nBV and LL suggest that disease variability is more dominant.111Chapter 5 Manifold Learning by Deep LearningProgression of a “Mean” SPMS Brain along a Range of MSFC ScoresAnother benefit of our model is the ability to visualize the progression of a “mean”secondary progressive MS (SPMS) brain along a range of MSFC scores. To demonstrate,we trained four independent linear models to predict the distribution parameters J1, . . . , J4given the MSFC (Ji = ai + biMSFC). Figure 5.7 shows the axial (top row) and mid-sagittal(bottom row) slices of generated images representing the range of MSFC scores from 1.5to −4.5. Consistent with previous results, a decrease in MSFC visually correlates with anincrease in the size of the ventricles, an increase in periventricular lesions, and an increasein lesions in the pons region.5.6 SummaryWe have introduced a new method for modelling the variability in brain morphologyand lesion distribution of a large set of MRIs of AD and MS patients. We have proposedtwo models, both of which are based on DBNs. In our first approach, variability ofbrain morphology is modelled as a manifold of brain MRIs. In our second approach, wetrain three DBNs: one for morphology, one for lesion distribution, and one that jointlymodels both. The morphology model is learned from deformation fields, which allowsthe algorithm to focus on structural changes without the confound of contrast variationsbetween images. We have demonstrated that such a model, which requires no built-inpriors on image similarity, can automatically discover patterns of variability that can beparameterized in a low-dimensional space and are clinically relevant. In addition, ourmodel can generate sample images from model parameters for visualization.112Chapter 6Conclusions and Future WorkIn this thesis, we have addressed two challenges for measuring the state and progression ofmultiple sclerosis (MS). In Chapter 4, we presented a fully automatic segmentation methodof MS lesions, which enables the accurate computation of lesion-based imaging biomarkerssuch as lesion load and lesion count. In Chapter 5, we presented a novel method forthe automatic discovery of patterns of variability in MS subjects that show improvedcorrelations to clinical scores over traditional volumetric biomarkers. Both methods arebased on deep learning, a class of algorithms that can learn a feature hierarchy froma set of images. A major challenge of deep learning methods is the time and memoryrequired to process large 3D volumes. To address this challenge, we have developeda novel training algorithm for convolutional deep learning methods that facilitates thetraining of full resolution 3D volumes and presented our algorithm with a comprehensivecomparison with alternative training methods in Chapter 3. Improving the training speedis a rapidly advancing topic, but there are still very few publications that use images ofsuch high dimensionality, which suggests that computational speed is still a bottleneck ingeneral. Although the developed methods were developed and evaluated in the context of113Chapter 6 Conclusions and Future WorkMS, the methods are very general and can potentially be applied to various segmentationand manifold learning problems.6.1 ContributionsIn the course of developing deep learning-based methods for MS lesion segmentation andpattern discovery, we have made the following main contributions:1. We have developed a novel training algorithm for convolutional deep belief networksand convolutional neural networks that performs training in the frequency domain.The speed-ups gained by our method compared to state-of-the art spatial domainimplementations and the reduced memory requirements compared to other fre-quency domain methods enable the application of deep learning to high-resolution3D medical images.2. We have developed a neural network architecture that jointly learns features at differ-ent scales that are tuned to segmenting MS lesions and performs the segmentationbased on the automatically learned features. The joint learning of feature extractorand classifier facilitates the learning of features that are robust to the large variabilityof MS lesions and varying contrasts produced by different scanners.3. We have developed a novel objective function for training neural networks that issuitable for the classification of vastly unbalanced classes, such as the segmentationof MS lesions, which typically comprise less than one percent of the image.4. This is the first work to demonstrate that deep learning can be applied to manifoldlearning of brain MRIs.114Chapter 6 Conclusions and Future Work5. We have developed a framework for modelling changes in brain morphology andlesion distribution with only a few parameters, which also show improved correlationwith clinical scores compared to established volumetric imaging biomarkers.6.2 Future Work6.2.1 New Applications of Deep LearningMotivated by the success of deep learning for the two applications explored in this theses,a possible direction for further research is to investigate new applications of deep learningfor medical image analysis. The success of deep learning methods mostly stems fromtheir ability to learn feature hierarchies directly from the data. This allows deep learningmethods to adapt to the challenges presented by the data, such as imaging artifacts andcontrast variations, without explicitly having to model them. However, deep learningmethods can take a lot of time to train and require large amounts of training data inorder to learn a feature representation that generalizes well to new data samples. Toovercome those challenges, first applications of deep learning for medical image analysiswere limited to problems that can be cast as a patch-wise classification problem, suchas the segmentation of cell membranes (Cires¸an et al., 2012) and the detection of cellcarcinoma (Cruz-roa et al., 2013). In this fashion, every patch of an image is considereda training sample, which greatly increases the amount of labelled training cases. Inaddition, training on relatively small 2D patches made training feasible. Similarly, fullyconvolutional approach for image segmentation leverage every pixel of an image as atraining sample, while being more computationally efficient than patch-based approaches.However, the number of voxels within a layer decreases with the depth of the network,115Chapter 6 Conclusions and Future Workwith the result that deeper layers are trained with fewer training samples, which increasesthe risk of overfitting.A possible direction for future research is to investigate the application of the proposeddeep learning segmentation framework to other segmentation tasks. Preliminary resultson the segmentation of the corpus callosum and the grey matter of the spinal cord showgreat promise. Another potential application could be the detection of landmark points.Detecting landmarks is similar to segmentation in that every voxel of an image can beleveraged as a training sample. However, the number of positive samples per image istypically much smaller than for segmentation, which might require the development ofalternative training approaches.A different direction for future research is to investigate the potential of deep learningmethods to be incorporated into existing model learning frameworks. Originally, activeshape (Cootes et al., 1995) and active appearance (Cootes et al., 2001) models employprincipal component analysis in order to reduce the dimensionality of the input featurespace. Similarly, our proposed manifold learning framework can be used for dimensional-ity reduction, which might yield to the learning of more biologically plausible models,due to their ability to find highly nonlinear patterns of variability in the data.6.2.2 Improving the Performance on Small Data Sets using Data AugmentationA promising approach for improving the performance on small data sets is data aug-mentation. Data augmentation allows the artificial increase of the training data set sizeby generating new training samples from existing ones. Data augmentation has beensuccessfully used to deal with the problem of small medical data sets. For example, Ron-neberger et al. (2015) used automatically generated deformations to artificially increasethe data set size for segmenting neuronal structures in electron microscopic stacks, but also116Chapter 6 Conclusions and Future Worksimpler data augmentation techniques such as adding rotations, translations and contrastvariations can help to avoid overfitting. However, hand-engineered data augmentationapproaches depend on how well the variability in the data is understood. Alternatively,automatically learned unsupervised deep generative models such as deep belief networkscan potentially be used to learn the variability of the training data in order to generatenew samples that are biologically plausible.6.2.3 Rotation-invariant Neural NetworksAlthough neural networks are able to learn features that are invariant to the variabilityin the data, doing so often requires the learning of multiple variants of the same featurein order to capture the range of variability. In order to decrease the number of weightsof a neural network, and therefore to reduce the training time and risk of overfitting,different neural network architectures have been developed with build-in invariance orequivariance properties. For example, previously used sigmoid activation functions aresensitive to the intensity range of the input, which requires the relearning of featuredetectors for images with varying contrast. To overcome, rectified linear units have beenproposed, which are equivariant to intensity variations. In order to allow the learning oftranslation-invariant features, convolutional neural networks have been proposed, whichgreatly reduce the number of weights required to capture reoccurring patterns at differentlocations in the image compared to dense neural networks. However, features learned by aconvolutional neural network are not invariant to rotations, which leads to the relearningof the same class of features at different poses, and consequently increases the number offilters required to sample the pose space. This problem is especially pronounced in 3D.While sampling one feature under 2D rotations in 30 degree steps leads to 12 samples, itleads to 1728 samples for full 3D rotations, as three angles are required to determine a117Chapter 6 Conclusions and Future Work3D pose. In order to reduce the number of filters required to capture structures in 3D, itmight therefore be necessary to develop neural network architectures that are inherentlyinvariant or equivariant to rotations, which would allow for the learning of a diverse setof features that capture fine details in 3D with much fewer filters than currently employedconvolutional neural networks.6.2.4 Increasing the Expressive Power of Single NeuronsCurrent neural networks can only calculate the weighted sum of its inputs followed bythe application of a nonlinear function. Although it has been shown that this model isable to approximate arbitrary functions (Cybenko, 1989), doing so might require a largenumber of units and layers, which makes them more difficult to train and requires moretraining data in order to reduce the risk of overfitting. An alternative for increasing theexpressive power of neural networks without increasing their complexity is to designmore computationally powerful units. In their review of the computational properties ofdendrites, London and Häusser (2005) noted that some neurons were found that are ableto calculate the multiplication of its inputs. For example, the looming sensitive neurons inthe locust’s visual system are able to detect approaching objects on a collision course bycalculating the multiplication of angular size and speed of approaching object (Londonand Häusser, 2005). The multiplication is possibly encoded as the sum of logarithmicinputs followed by exponentiation. Similarly, Godfrey and Gashler (2016) have proposeda type of neuron for artificial neural networks that is able to learn to calculate the logarithmor exponential of its inputs, which allows for the learning of networks that can calculatearbitrary polynomials with significantly fewer parameters than traditional rectified linearor sigmoid neural networks. Alternatively, Livni et al. (2013) have proposed a networkarchitecture that can learn arbitrary polynomials using a combination of units that calculate118Chapter 6 Conclusions and Future Workweighted sums of its inputs and units that calculate the product of two input numbers.However, these network architectures needs to be investigated further in order to show ifthe theoretical advantages translate into improvements on practical problems.119BibliographyAljabar, Paul, R. Wolz, L. Srinivasan, S. J. Counsell, M. A. Rutherford, A. D.Edwards, J. V. Hajnal and D. Rueckert (2011). A combined manifold learning analysisof shape and appearance to characterize neonatal brain development.. IEEE Transactions onMedical Imaging, 30(12):2072–2086. (cited on pages 89 and 90)Andersson, Jesper L. R., M. Jenkinson and S. Smith (2007). Non-linear registration,aka Spatial normalisation. Technical Report TR07JA2, FMRIB Centre, Oxford, UnitedKingdom. (cited on page 104)Balasubramanian, Mukund and E. L. Schwartz (2002). The Isomap algorithm and topologi-cal stability. Science, 295(5552):7–7. (cited on page 91)Bastien, Frédéric, P. Lamblin, R. Pascanu, J. Bergstra, I. Goodfellow, A. Bergeron,N. Bouchard, D. Warde-Farley and Y. Bengio (2012). Theano: new features and speedimprovements. arXiv preprint arXiv:1211.5590, pp. 1–10. (cited on pages 45 and 64)Belkin, Mikhail and P. Niyogi (2003). Laplacian eigenmaps for dimensionality reduction anddata representation. Neural Computation, 15(6):1373–1396. (cited on page 89)Bhatia, Kanwal K, A. Rao, A. N. Price, R. Wolz, J. Hajnal and D. Rueckert (2012).Hierarchical manifold learning. In N. Ayache et al. (Eds.): MICCAI 2012, Part I, LNCS, vol.7510, pp. 512–519. Springer. (cited on pages 90 and 91)Breiman, Leo (2001). Random forests. Machine learning, 45(1):5–32. (cited on pages 12 and 53)Brosch, Tom and R. Tam (2013). Manifold learning of brain MRIs by deep learning. In K. Moriet al. (Eds.): MICCAI 2013, Part II, LNCS, vol. 8150, pp. 633–640. Springer. (cited on page 6)Brosch, Tom and R. Tam (2015). Efficient training of convolutional deep belief networks in thefrequency domain for application to high-resolution 2D and 3D images.. Neural Computation,27(1):211–227. (cited on pages 64 and 93)Brosch, Tom, Y. Yoo, L. Y. Tang, D. K. Li, A. Traboulsee and R. Tam (2015). Deepconvolutional encoder networks for multiple sclerosis lesion segmentation. In A. Frangi et al.(Eds.): MICCAI 2015, Part III, LNCS, vol. 9351, pp. 3–11. Springer. (cited on pages 54, 55,and 61)120BibliographyCarreira-Perpiñán, Miguel Á and R. S. Zemel (2005). Proximity graphs for clustering andmanifold learning. In Advances in Neural Information Processing Systems, pp. 225–232. (citedon page 91)Cayton, Lawrence (2005). Algorithms for manifold learning. Technical Report CS2008-0923,University of California, San Diego. (cited on page 89)Chetlur, Sharan, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaroand E. Shelhamer (2014). cuDNN: Efficient primitives for deep learning. arXiv preprintarXiv:1410.0759. (cited on pages 29, 45, and 64)Cho, KyungHyun, A. Ilin and T. Raiko (2011). Improved learning of Gaussian-Bernoullirestricted Boltzmann machines. In Artificial Neural Networks and Machine Learning–ICANN2011, pp. 10–17. Springer. (cited on page 25)Cires¸an, Dan C., A. Giusti and J. Schmidhuber (2012). Deep neural networks segmentneuronal membranes in electron microscopy images. In Advances in Neural InformationProcessing Systems, pp. 1–9. (cited on pages 11, 13, 53, and 115)Collobert, Ronan, K. Kavukcuoglu and C. Farabet (2011a). Torch7: A matlab-likeenvironment for machine learning. In BigLearn, NIPS Workshop, no. EPFL-CONF-192376,pp. 1–6. (cited on pages 45 and 64)Collobert, Ronan, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu and P. Kuksa(2011b). Natural language processing (almost) from scratch. The Journal of Machine LearningResearch, 12:2493–2537. (cited on page 6)Cootes, Timothy F., G. Edwards and C. Taylor (2001). Active appearance models. IEEETransactions on Pattern Analysis and Machine Intelligence, 23(6):681–685. (cited onpage 116)Cootes, Timothy F., C. J. Taylor, D. Cooper and J. Graham (1995). Active shape models—their training and application. Computer Vision and Image Understanding, 61(1):38–59.(cited on page 116)Cortes, Corinna and V. Vapnik (1995). Support-vector networks. Machine Learning,20(3):273–297. (cited on page 12)Coupé, Pierrick, J. V. Manjón, V. Fonov, J. Pruessner, M. Robles and D. L. Collins(2011). Patch-based segmentation using expert priors: application to hippocampus and ventriclesegmentation.. NeuroImage, 54(2):940–954. (cited on page 53)Cruz-roa, Angel Alfonso, J. Edison, A. Ovalle, A. Madabhushi and F. A. Gonz (2013).A deep learning architecture for image representation, visual interpretability and automated121Bibliographybasal-cell carcinoma cancer detection. K. Mori et al. (Eds.): MICCAI 2013, Part II, LNCS,vol. 8150, pp. 403–410. (cited on page 115)Cybenko, George (1989). Approximation by superpositions of a sigmoidal function. Mathemat-ics of Control, Signals and Systems, 2(4):303–314. (cited on page 118)Dauphin, Yann N., H. de Vries, J. Chung and Y. Bengio (2015). RMSProp and equilibratedadaptive learning rates for non-convex optimization. arXiv preprint arXiv:1502.04390. (citedon pages 17 and 63)Deng, Jia, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei (2009). ImageNet: A large-scalehierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition,pp. 248–255. (cited on pages 11 and 41)Dice, Lee R (1945). Measures of the amount of ecologic association between species. Ecology,26(3):297–302. (cited on pages 59 and 69)Duchateau, Nicolas, M. De Craene, G. Piella and A. F. Frangi (2011). Characterizingpathological deviations from normality using constrained manifold-learning. C. Fichtinger etal. (Eds.): MICCAI 2011, Part III, LNCS, vol. 6893, pp. 256–63. (cited on page 90)Duchateau, Nicolas, M. De Craene, G. Piella and A. F. Frangi (2012). Constrainedmanifold learning for the characterization of pathological deviations from normality. MedicalImage Analysis, 16(8):1532–1549. (cited on page 90)Duchi, John, E. Hazan and Y. Singer (2011). Adaptive subgradient methods for online learningand stochastic optimization. The Journal of Machine Learning Research, 12:2121–2159.(cited on pages 17 and 63)Etyngier, Patrick, F. Ségonne and R. Keriven (2007). Shape priors using manifold learningtechniques. In 11th International Conference on Computer Vision, pp. 1–8. IEEE. (cited onpage 90)Farley, B. and W. Clark (1954). Simulation of self-organizing systems by digital computer.Transactions of the IRE Professional Group on Information Theory, 4(4):76–84. (cited onpages 8 and 12)Fischer, J S, R. A. Rudick, G. R. Cutter and S. C. Reingold (1999). The multiple sclerosisfunctional composite measure (MSFC): an integrated approach to MS clinical outcome assessment.Multiple Sclerosis, 5(4):244–250. (cited on page 110)Fonov, Vladimir, A. C. Evans, K. Botteron, C. R. Almli, R. C. McKinstry and D. L.Collins (2011). Unbiased average age-appropriate atlases for pediatric studies.. NeuroImage,54(1):313–327. (cited on page 103)122BibliographyFreund, Yoav and D. Haussler (1992). Unsupervised learning of distributions on binaryvectors using two layer networks. In Advances in Neural Information Processing Systems, pp.912–919. (cited on pages 8, 12, 17, 28, and 92)Fukushima, Kunihiko (1980). Neocognitron: A self-organizing neural network model for amechanism of pattern recognition unaffected by shift in position. Biological Cybernetics,36(4):193–202. (cited on pages 8, 11, 12, and 29)García-Lorenzo, Daniel, S. Francis, S. Narayanan, D. L. Arnold and D. L. Collins(2013). Review of automatic segmentation methods of multiple sclerosis white matter lesionson conventional magnetic resonance imaging. Medical Image Analysis, 17(1):1–18. (cited onpages 51, 69, 70, and 71)Gerber, Samuel, T. Tasdizen, P. T. Fletcher, S. Joshi and R. Whitaker (2010). Manifoldmodeling for brain population analysis. Medical Image Analysis, 14(5):643–653. (cited onpages 90 and 92)Gerber, Samuel, T. Tasdizen, S. Joshi and R. Whitaker (2009). On the manifold structureof the space of brain images. In G.-Z. Yang et al. (Eds.): MICCAI 2009, Part I, LNCS, vol. 5761,pp. 305–312. Springer. (cited on page 90)Geremia, Ezequiel, O. Clatz, B. H. Menze, E. Konukoglu, A. Criminisi and N. Ayache(2011). Spatial decision forests for MS lesion segmentation in multi-channel magnetic resonanceimages. NeuroImage, 57(2):378–390. (cited on pages 53 and 76)Geremia, Ezequiel, B. H. Menze, O. Clatz, E. Konukoglu, A. Criminisi and N. Ayache(2010). Spatial decision forests for MS lesion segmentation in multi-channel MR images. InT. Jian et al. (Eds.): MICCAI 2010, Part I. LNCS, vol. 6362, pp. 111–118. Springer. (cited onpages 53, 76, and 77)Godfrey, Luke B. and M. S. Gashler (2016). A continuum among logarithmic, linear, andexponential functions, and its potential to improve generalization in neural networks. arXivpreprint arXiv:1602.01321, pp. 1–6. (cited on page 118)Grigorescu, Simona E, N. Petkov and P. Kruizinga (2002). Comparison of texture featuresbased on gabor filters. IEEE Transactions on Image Processing, 11(10):1160–1167. (cited onpage 5)Guerrero, Ricardo, R. Wolz, A. Rao and D. Rueckert (2014). Manifold populationmodeling as a neuro-imaging biomarker: Application to ADNI and ADNI-GO. NeuroImage,94:275–286. (cited on pages 90 and 91)123BibliographyGuizard, Nicolas, P. Coupé, V. S. Fonov, J. V. Manjón, D. L. Arnold and D. L. Collins(2015). Rotation-invariant multi-contrast non-local means for MS lesion segmentation. Neu-roImage: Clinical, 8:376–389. (cited on pages 53, 68, and 77)Hamm, J., D. Ye and R. Verma (2010). GRAM: A framework for geodesic registration onanatomical manifolds. Medical Image Analysis, 14(5):633–642. (cited on page 90)Hinton, Geoffrey, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior,V. Vanhoucke, P. Nguyen, T. N. Sainath et al. (2012). Deep neural networks for acousticmodeling in speech recognition: The shared views of four research groups. IEEE SignalProcessing Magazine, 29(6):82–97. (cited on page 6)Hinton, Geoffrey E. (2002). Training products of experts by minimizing contrastive divergence..Neural Computation, 14(8):1771–800. (cited on page 21)Hinton, Geoffrey E. (2010). A practical guide to training restricted Boltzmann machines.Technical Report UTML TR 2010-003, Department of Computer Science, University ofToronto. (cited on pages 8, 12, 17, 21, 28, 92, and 93)Hinton, Geoffrey E., S. Osindero and Y.-W. Teh (2006). A fast learning algorithm for deepbelief nets. Neural Computation, 18(7):1527–1554. (cited on pages 6, 8, 11, 12, 17, 28, 63, 92, 95,101, and 104)Hinton, Geoffrey E. and R. Salakhutdinov (2006). Reducing the dimensionality of datawith neural networks. Science, 313(5786):504–507. (cited on pages 11 and 62)Hinton, Geoffrey E. and N. Srivastava (2012). Improving neural networks by preventingco-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, pp. 1–18. (cited on page 39)Hochreiter, Sepp (1991). Untersuchungen zu dynamischen neuronalen Netzen. Diploma,Technische Universität München. (cited on page 63)Hubel, David H. and T. N. Wiesel (1962). Receptive fields, binocular interaction and functionalarchitecture in the cat’s visual cortex. The Journal of Physiology, 160(1):106. (cited on page 15)Hubel, David H. and T. N. Wiesel (1968). Receptive fields and functional architecture ofmonkey striate cortex. The Journal of Physiology, 195(1):215–243. (cited on page 15)Jenkinson, Mark, P. Bannister, M. Brady and S. Smith (2002). Improved optimization forthe robust and accurate linear registration and motion correction of brain images. NeuroImage,17(2):825–841. (cited on page 103)Jenkinson, Mark, M. Pechaud and S. Smith (2005). BET2: MR-based estimation of brain,skull and scalp surfaces. In 11th Annual Meeting of the Organization for Human Brain Mapping,vol. 17. Toronto, ON. (cited on page 67)124BibliographyJerman, Tim, A. Galimzianova, F. Pernus, B. Lik and Z. Spiclin (2015). Combiningunsupervised and supervised methods for lesion segmentation. In Proceedings the MICCAI 2015Brain Lesions Workshop, pp. 1–12. (cited on page 77)Jesson, Andrew and T. Arbel (2015). Hierarchical MRF and random forest segmentation ofMS lesions and healthy tissues in brain MRI. In Proceedings of the 2015 Longitudinal MultipleSclerosis Lesion Segmentation Challenge, pp. 1–2. (cited on pages 66, 77, and 78)Jia, Yangqing, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadar-rama and T. Darrell (2014). Caffe: Convolutional architecture for fast feature embedding.arXiv preprint arXiv:1408.5093. (cited on pages 45 and 64)Kamnitsas, Konstantinos, L. Chen, C. Ledig, D. Rueckert and B. Glocker (2015).Multi-scale 3D convolutional neural networks for lesion segmentation in brain MRI. IschemicStroke Lesion Segmentation, p. 13. (cited on page 11)Kang, Kai and X. Wang (2014). Fully convolutional neural networks for crowd segmentation.arXiv preprint arXiv:1411.4464. (cited on page 54)Kappos, L, A. Traboulsee, C. Constantinescu, J.-P. Erälinna, F. Forrestal, P. Jongen,J. Pollard, M. Sandberg-Wollheim, C. Sindic, B. Stubinski et al. (2006). Long-termsubcutaneous interferon beta-1a therapy in patients with relapsing-remitting MS. Neurology,67(6):944–953. (cited on page 67)Karni, A. and D. Sagi (1991). Where practice makes perfect in texture discrimination: evidencefor primary visual cortex plasticity. Proceedings of the National Academy of Sciences ofthe United States of America, 88(11):4966–4970. (cited on page 5)Katsavos, Serafeim and M. Anagnostouli (2013). Biomarkers in multiple sclerosis: Anup-to-date overview. Multiple Sclerosis International, 2013(II):20. (cited on page 3)Kingma, Diederik and J. Ba (2014). Adam: A method for stochastic optimization. arXivpreprint arXiv:1412.6980. (cited on pages 17 and 63)Krizhevsky, Alex (2010). Convolutional deep belief networks on CIFAR-10. Unpublishedmanuscript, pp. 1–9. (cited on page 95)Krizhevsky, Alex (2012). High-performance C++/CUDA implementation of convolutionalneural networks. http://code.google.com/p/cuda-convnet/. (cited on page 39)Krizhevsky, Alex, I. Sutskever and G. Hinton (2012). ImageNet classification with deepconvolutional neural networks. In Advances in Neural Information Processing Systems, pp.1–9. (cited on pages 6, 11, 29, 38, and 58)125BibliographyLeCun, Yann, Y. Bengio and G. Hinton (2015). Deep learning. Nature, 521(7553):436–444.(cited on pages 4, 11, and 28)LeCun, Yann, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard andL. D. Jackel (1989). Backpropagation applied to handwritten zip code recognition. NeuralComputation, 1(4):541–551. (cited on pages 8, 12, 16, and 29)LeCun, Yann, L. Bottou, Y. Bengio and P. Haffner (1998). Gradient-based learning appliedto document recognition. Proceedings of the IEEE, 86(11):2278–2324. (cited on pages 8, 12, 16,17, 29, 54, 57, 59, and 63)Lee, Honglak, R. Grosse, R. Ranganath and A. Y. Ng (2009). Convolutional deep beliefnetworks for scalable unsupervised learning of hierarchical representations. In Proceedings ofthe 26th Annual International Conference on Machine Learning, pp. 609–616. ACM. (cited onpages 22, 29, 57, 93, and 104)Lee, Honglak, R. Grosse, R. Ranganath and A. Y. Ng (2011). Unsupervised learning ofhierarchical representations with convolutional deep belief networks. Communications of theACM, 54(10):95–103. (cited on pages 22 and 29)Li, Hongsheng, R. Zhao and X. Wang (2014). Highly efficient forward and backwardpropagation of convolutional neural networks for pixelwise classification. arXiv preprintarXiv:1412.4526, pp. 1–10. (cited on page 54)Livni, Roi, S. Shalev-Shwartz and O. Shamir (2013). An algorithm for training polynomialnetworks. arXiv preprint arXiv:1304.7045, pp. 1–22. (cited on page 118)London, Michael and M. Häusser (2005). Dendritic computation. Annual Review ofNeuroscience, 28:503–32. (cited on page 118)Long, Jonathan, E. Shelhamer and T. Darrell (2015). Fully convolutional networks forsemantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and PatternRecognition, pp. 3431–3440. (cited on page 54)Lowe, D.G. (1999). Object recognition from local scale-invariant features. Proceedings of theSeventh IEEE International Conference on Computer Vision, 2:1150–1157. (cited on page 5)Maier, Oskar and H. Handels (2015). MS lesion segmentation in MRI with random forests.In Proceedings of the 2015 Longitudinal Multiple Sclerosis Lesion Segmentation Challenge, pp.1–2. (cited on pages 66 and 78)Marcus, Daniel S, T. H. Wang, J. Parker, J. G. Csernansky, J. C. Morris and R. L.Buckner (2007). Open Access Series of Imaging Studies (OASIS): cross-sectional MRI datain young, middle aged, nondemented, and demented older adults.. Journal of CognitiveNeuroscience, 19(9):1498–507. (cited on page 41)126BibliographyMathieu, Michael, M. Henaff and Y. LeCun (2014). Fast training of convolutional networksthrough FFTs. In 2nd International Conference on Learning Representations, pp. 1–9. (cited onpages 29, 30, 33, and 38)Nair, Vinod and G. E. Hinton (2010). Rectified linear units improve restricted Boltzmannmachines. In Proceedings of the 27th Annual International Conference on Machine Learning,pp. 807–814. (cited on pages 26, 31, 58, and 94)Ngiam, Jiquan, A. Khosla, M. Kim, J. Nam, H. Lee and A. Y. Ng (2011). Multimodaldeep learning. In 28th International Conference on Machine Learning, pp. 689–696. (cited onpage 102)Petersen, R.C., P. Aisen, L. Beckett, M. Donohue, A. Gamst et al. (2010). Alzheimer’sdisease neuroimaging initiative (ADNI). Neurology, 74(3):201–209. (cited on page 96)Polyak, Boris T. and A. B. Juditsky (1992). Acceleration of stochastic approximation byaveraging. SIAM Journal on Control and Optimization, 30(4):838–855. (cited on page 14)Raina, Rajat, A. Madhavan and A. Y. Ng (2009). Large-scale deep unsupervised learningusing graphics processors. In Proceedings of the 26th Annual International Conference onMachine Learning, pp. 873–880. ACM. (cited on pages 11 and 28)Ronneberger, Olaf, P. Fischer and T. Brox (2015). U-Net: Convolutional networks forbiomedical image segmentation. In N. Navab et al. (Eds.): MICCAI 2015, Part III, LNCS, vol.9351, pp. 234–241. Springer. (cited on pages 54, 55, and 116)Roura, Eloy, A. Oliver, M. Cabezas, S. Valverde, D. Pareto, J. C. Vilanova, L. Ramió-Torrentà, À. Rovira and X. Lladó (2015). A toolbox for multiple sclerosis lesion segmenta-tion. Neuroradiology, 57(10):1031–1043. (cited on pages 52, 68, 69, 77, and 81)Rumelhart, David E., G. E. Hinton and R. J. Williams (1986). Learning representations byback-propagating errors. Nature, 323:533–536. (cited on pages 8, 12, and 14)Sainath, Tara N., A.-r. Mohamed, B. Kingsbury and B. Ramabhadran (2013). Deepconvolutional neural networks for LVCSR. In 2013 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP), pp. 8614–8618. IEEE. (cited on page 11)Saul, LK and S. Roweis (2003). Think globally, fit locally: unsupervised learning of lowdimensional manifolds. The Journal of Machine Learning Research, 4:119–155. (cited onpage 89)Scherer, Dominik, A. Müller and S. Behnke (2010). Evaluation of pooling operations inconvolutional architectures for object recognition. In International Conference on ArtificialNeural Networks, pp. 92–101. Springer. (cited on pages 16, 39, and 58)127BibliographySchmidt, Paul, C. Gaser, M. Arsic, D. Buck, A. Förschler, A. Berthele, M. Hoshi,R. Ilg, V. J. Schmid, C. Zimmer et al. (2012). An automated tool for detection of FLAIR-hyperintense white-matter lesions in multiple sclerosis. NeuroImage, 59(4):3774–3783. (citedon pages 52, 68, and 81)Shiee, Navid, P.-L. Bazin, A. Ozturk, D. S. Reich, P. A. Calabresi and D. L. Pham (2010).A topology-preserving approach to the segmentation of brain images with multiple sclerosislesions. NeuroImage, 49(2):1524–1535. (cited on pages 52, 68, 77, and 81)Smith, Stephen M., Y. Zhang, M. Jenkinson, J. Chen, P. Matthews, A. Federico andN. De Stefano (2002). Accurate, robust, and automated longitudinal and cross-sectional brainchange analysis. NeuroImage, 17(1):479–489. (cited on page 111)Souplet, Jean-Christophe, C. Lebrun, N. Ayache and G. Malandain (2008). Anautomatic segmentation of T2-FLAIR multiple sclerosis lesions. In MIDAS Journal - MICCAI2008 Workshop. (cited on pages 75 and 77)Styner, Martin, J. Lee, B. Chin, M. Chin, O. Commowick, H. Tran, S. Markovic-Plese,V. Jewells and S. Warfield (2008). 3D segmentation in the clinic: A grand challenge II: MSlesion segmentation. MIDAS Journal - MICCAI 2008 Workshop, pp. 1–6. (cited on pages 65and 74)Subbanna, Nagesh, D. Precup, D. Arnold and T. Arbel (2015). IMaGe: Iterative multilevelprobabilistic graphical model for detection and segmentation of multiple sclerosis lesions in brainMRI. In Information Processing in Medical Imaging, pp. 514–526. Springer. (cited on pages 53and 68)Subbanna, NK, M. Shah, S. Francis, S. Narayanan, D. Collins, D. Arnold andT. Arbel (2009). MS lesion segmentation using Markov Random Fields. In Proceedings theMICCAI 2009 Workshop on Medical Image Analysis on Multiple Sclerosis, pp. 1–12. (cited onpage 53)Sudre, Carole, M. J. Cardoso, W. Bouvy, G. Biessels, J. Barnes and S. Ourselin (2015).Bayesian model selection for pathological neuroimaging data applied to white matter lesionsegmentation. IEEE Transactions on Medical Imaging, 34(10):2079–2102. (cited on pages 52and 68)Sutskever, Ilya, J. Martens, G. Dahl and G. Hinton (2013). On the importance of initial-ization and momentum in deep learning. In Proceedings of the 30th International Conference onMachine Learning (ICML-13), pp. 1139–1147. (cited on page 62)Sutskever, Ilya, O. Vinyals and Q. V. Le (2014). Sequence to sequence learning with neuralnetworks. In Advances in Neural Information Processing Systems, pp. 3104–3112. (cited onpage 6)128BibliographyTenenbaum, J B, V. de Silva and J. C. Langford (2000). A global geometric framework fornonlinear dimensionality reduction.. Science (New York, N.Y.), 290(5500):2319–2323. (citedon page 90)Tieleman, Tijmen (2008). Training restricted Boltzmann machines using approximations to thelikelihood gradient. In Proceedings of the 25th International Conference on Machine learning,pp. 1064–1071. ACM. (cited on page 21)Tomas-Fernandez, Xavier and S. K. Warfield (2015). A model of population and subject(MOPS) intensities with application to multiple sclerosis lesion segmentation. IEEE Transac-tions on Medical Imaging, 34(6):1349–1361. (cited on pages 52 and 77)Traboulsee, A, A. Al-Sabbagh, R. Bennett, P. Chang, D. Li et al. (2008). Reduction inmagnetic resonance imaging T2 burden of disease in patients with relapsing-remitting multiplesclerosis: analysis of 48-week data from the EVIDENCE (EVidence of Interferon Dose-response:European North American Comparative Efficacy) study. BMC neurology, 8(1):11. (cited onpage 67)Van Leemput, Koen, F. Maes, D. Vandermeulen, A. Colchester and P. Suetens (2001).Automated segmentation of multiple sclerosis lesions by model outlier detection. IEEE Transac-tions on Medical Imaging, 20(8):677–688. (cited on pages 52, 68, and 81)Wachinger, C, K. Batmanghelich, P. Golland and M. Reuter (2014a). BrainPrint inthe computer-aided diagnosis of Alzheimer’s disease. In Proceedings of the MICCAI WorkshopChallenge on Computer-Aided Diagnosis of Dementia Based on Structural MRI Data, pp.129–138. (cited on page 91)Wachinger, Christian, P. Golland, W. Kremen, B. Fischl, M. Reuter, A. D. N. Ini-tiative et al. (2015). BrainPrint: A discriminative characterization of brain morphology.NeuroImage, 109:232–248. (cited on page 90)Wachinger, Christian, P. Golland and M. Reuter (2014b). BrainPrint: identifying subjectsby their brain. In P. Golland et al. (Eds.): MICCAI 2014, Part III, LNCS, vol. 8675, pp. 41–48.Springer. (cited on page 91)Weiss, Nick, D. Rueckert and A. Rao (2013). Multiple sclerosis lesion segmentation usingdictionary learning and sparse coding. In K. Mori et al. (Eds.): MICCAI 2013, Part I, LNCS,vol. 8149, pp. 735–742. Springer. (cited on pages 52, 76, and 77)Werbos, Paul (1974). Beyond regression: New tools for prediction and analysis in the behavioralsciences. PhD Thesis, Harvard University. (cited on pages 8, 12, and 15)Wiesel, Torsten Niels and D. H. Hubel (1963). Single-cell responses in striate cortex ofkittens deprived of vision in one eye. Journal of Neurophysiology, 26(1003):17. (cited on page 5)129BibliographyWolz, Robin, P. Aljabar, J. V. Hajnal, A. Hammers and D. Rueckert (2010a). LEAP:learning embeddings for atlas propagation.. NeuroImage, 49(2):1316–1325. (cited on page 90)Wolz, Robin, P. Aljabar, J. V. Hajnal, J. Lotjonen and D. Rueckert (2011). Manifoldlearning combining imaging with non-imaging information. In International Symposium onBiomedical Imaging, pp. 1637–1640. IEEE. (cited on page 90)Wolz, Robin, P. Aljabar, J. V. Hajnal, J. Lötjönen and D. Rueckert (2012). Nonlineardimensionality reduction combining MR imaging with non-imaging information.. MedicalImage Analysis, 16(4):819–830. (cited on pages 89 and 90)Wolz, Robin, P. Aljabar, J. V. Hajnal and D. Rueckert (2010b). Manifold learning forbiomarker discovery in MR imaging. In F. Wang et al. (Eds.): MLMI 2010, LNCS, vol. 6357,pp. 116–123. Springer. (cited on page 89)Yoo, Youngjin, T. Brosch, A. Traboulsee, D. K. Li and R. Tam (2014). Deep learning ofimage features from unlabeled data for multiple sclerosis lesion segmentation. In G. Wu et al.(Eds.): MLMI 2014, LNCS, vol. 8679, pp. 117–124. Springer. (cited on page 53)Zeiler, Matthew D. (2012). ADADELTA: An adaptive learning rate method. arXiv preprintarXiv:1212.5701. (cited on pages 17 and 63)Zeiler, Matthew D. and R. Fergus (2013). Stochastic pooling for regularization of deepconvolutional neural networks. In 1st International Conference on Learning Representations,pp. 1–9. (cited on page 39)Zeiler, Matthew D., G. W. Taylor and R. Fergus (2011). Adaptive deconvolutional networksfor mid and high level feature learning. In 2011 IEEE International Conference on ComputerVision, pp. 2018–2025. Ieee. (cited on pages 54, 55, 57, and 58)Zhang, Qilong, R. Souvenir and R. Pless (2006). On manifold structure of cardiac MRIdata: Application to segmentation. In 2006 IEEE Conference on Computer Vision and PatternRecognition,, vol. 1, pp. 1092–1098. IEEE. (cited on page 90)Zhang, Zhenyue, J. Wang and H. Zha (2012). Adaptive manifold learning. IEEE Transactionson Pattern Analysis and Machine Intelligence, 34(2):253–265. (cited on page 92)130
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Efficient deep learning of 3D structural brain MRIs...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Efficient deep learning of 3D structural brain MRIs for manifold learning and lesion segmentation with… Brosch, Tom 2016
pdf
Page Metadata
Item Metadata
Title | Efficient deep learning of 3D structural brain MRIs for manifold learning and lesion segmentation with application to multiple sclerosis |
Creator |
Brosch, Tom |
Publisher | University of British Columbia |
Date Issued | 2016 |
Description | Deep learning methods have shown great success in many research areas such as object recognition, speech recognition, and natural language understanding, due to their ability to automatically learn a hierarchical set of features that is tuned to a given domain and robust to large variability. This motivates the use of deep learning for neurological applications, because the large variability in brain morphology and varying contrasts produced by different MRI scanners makes the automatic analysis of brain images challenging. However, 3D brain images pose unique challenges due to their complex content and high dimensionality relative to the typical number of images available, making optimization of deep networks and evaluation of extracted features difficult. In order to facilitate the training on large 3D volumes, we have developed a novel training method for deep networks that is optimized for speed and memory. Our method performs training of convolutional deep belief networks and convolutional neural networks in the frequency domain, which replaces the time-consuming calculation of convolutions with element-wise multiplications, while adding only a small number of Fourier transforms. We demonstrate the potential of deep learning for neurological image analysis using two applications. One is the development of a fully automatic multiple sclerosis (MS) lesion segmentation method based on a new type of convolutional neural network that consists of two interconnected pathways for feature extraction and lesion prediction. This allows for the automatic learning of features at different scales that are optimized for accuracy for any given combination of image types and segmentation task. Our network also uses a novel objective function that works well for segmenting underrepresented classes, such as MS lesions. The other application is the development of a statistical model of brain images that can automatically discover patterns of variability in brain morphology and lesion distribution. We propose building such a model using a deep belief network, a layered network whose parameters can be learned from training images. Our results show that this model can automatically discover the classic patterns of MS pathology, as well as more subtle ones, and that the parameters computed have strong relationships to MS clinical scores. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2016-07-13 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0305854 |
URI | http://hdl.handle.net/2429/58427 |
Degree |
Doctor of Philosophy - PhD |
Program |
Biomedical Engineering |
Affiliation |
Applied Science, Faculty of |
Degree Grantor | University of British Columbia |
GraduationDate | 2016-09 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2016_september_brosch_tom.pdf [ 4.11MB ]
- Metadata
- JSON: 24-1.0305854.json
- JSON-LD: 24-1.0305854-ld.json
- RDF/XML (Pretty): 24-1.0305854-rdf.xml
- RDF/JSON: 24-1.0305854-rdf.json
- Turtle: 24-1.0305854-turtle.txt
- N-Triples: 24-1.0305854-rdf-ntriples.txt
- Original Record: 24-1.0305854-source.json
- Full Text
- 24-1.0305854-fulltext.txt
- Citation
- 24-1.0305854.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0305854/manifest