Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Deep learning for feature discovery in brain MRIs for patient-level classification with applications… Yoo, Youngjin 2018

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2018_september_yoo_youngjin.pdf [ 5.88MB ]
Metadata
JSON: 24-1.0367030.json
JSON-LD: 24-1.0367030-ld.json
RDF/XML (Pretty): 24-1.0367030-rdf.xml
RDF/JSON: 24-1.0367030-rdf.json
Turtle: 24-1.0367030-turtle.txt
N-Triples: 24-1.0367030-rdf-ntriples.txt
Original Record: 24-1.0367030-source.json
Full Text
24-1.0367030-fulltext.txt
Citation
24-1.0367030.ris

Full Text

Deep Learning for Feature Discovery in BrainMRIs for Patient-Level Classification withApplications to Multiple SclerosisbyYoungjin Yooa thesis submitted in partial fulfillmentof the requirements for the degree ofDoctor of Philosophyinthe faculty of graduate and postdoctoralstudies(Biomedical Engineering)The University of British Columbia(Vancouver)May 2018c© Youngjin Yoo, 2018The following individuals certify that they have read, and recommend to theFaculty of Graduate and Postdoctoral Studies for acceptance, the dissertationentitled:Deep learning for feature discovery in brain MRIs for patient-level classi-fication with applications to multiple sclerosissubmitted by Youngjin Yoo in partial fulfillment of the requirements forthe degree of Doctor of Philosophy in Biomedical EngineeringExamining Committee:Roger Tam, Biomedical EngineeringResearch SupervisorRafeef Abugharbieh, Electrical and Computer EngineeringResearch Co-supervisorPurang Abolmaesumi, Biomedical EngineeringSupervisory Committee MemberDonna Lang, NeuroscienceUniversity ExaminerAlexandre Bouchard-Coˆte´, Genome Science and TechnologyUniversity ExaminerAdditional Supervisory/Examining Committee Members:iiGhassan Hamarneh, Computer ScienceSupervisory Committee MemberShou Li, Medical Imaging and BiophysicsExternal ExamineriiiAbstractNetwork architectures and training strategies are crucial considerations inapplying deep learning to neuroimaging data, but attaining optimal per-formance still remains challenging, because the images involved are high-dimensional and the pathological patterns to be modeled are often subtle.Additional challenges include limited annotations, heterogeneous modalities,and sparsity of certain image types. In this thesis, we have developed detailedmethodologies to overcome these challenges for automatic feature extractionfrom multimodal neuroimaging data to perform image-level classification andsegmentation, with applications to multiple sclerosis (MS).We developed our new methods in the context of four MS applications.The first was the development of an unsupervised deep network for MS le-sion segmentation that was the first to use image features that were learnedcompletely automatically, using unlabeled data. The deep-learned featureswere then refined with a supervised classifier, using a much smaller set ofannotated images. We assessed the impact of unsupervised learning by ob-serving the segmentation performance when the amount of unlabeled dataivwas varied. Secondly, we developed an unsupervised learning method formodeling joint features from quantitative and anatomical MRIs to detectearly MS pathology, which was novel in the use of deep learning to inte-grate high-dimensional myelin and structural images. Thirdly, we developeda supervised model that extracts brain lesion features that can predict con-version to MS in patients with early isolated symptoms. To efficiently traina convolutional neural network on sparse lesion masks and to reduce the riskof overfitting, we proposed utilizing the Euclidean distance transform forincreasing information density, and a combination of downsampling, unsu-pervised pretraining and regularization during training. The fourth methodmodels multimodal features between brain lesion and diffusion patterns todistinguish between MS and neuromyelitis optica, a neurological disordersimilar to MS, to support differential diagnosis. We present a novel hier-archical multimodal fusion architecture that can improve joint learning ofheterogeneous imaging modalities. Our results show that these models candiscover subtle patterns of MS pathology and provide enhanced classificationand prediction performance over the imaging biomarkers previously used inclinical studies, even with relatively small sample sizes.vLay SummaryIt is often very difficult to predict how the symptoms of a person with mul-tiple sclerosis (MS) will change over time, because MS affects each persondifferently. Magnetic resonance imaging (MRI) is already widely used tomonitor the tissue damage in MS patients, but the measurements commonlytaken on MR images do not predict future symptoms well. In this thesis, wedeveloped new computational methods, based on artificial intelligence, thatautomatically identify changes in brain MR images that signify how a patientmay get worse, before she or he actually does. The expert software can beused to perform predictions of future symptoms using MR images that thesoftware has not seen before. Using the developed expert software, doctorswill be able to gain much more useful information from each patient’s MRI,and give more specialized treatment for each person to hopefully reduce ordelay their future MS symptoms.viPrefaceThis thesis is primarily based on two journal papers, three conference papersand one book chapter, resulting from the collaboration between multipleresearchers. In all publications, the contribution of the author was in devel-oping, implementing, and evaluating the method. All co-authors contributedto the editing of the manuscripts.The literature survey described in Chapter 3 has been published in:• T. Brosch, Y. Yoo, L.Y.W. Tang and R. Tam. Deep learning of brainimages and its application to multiple sclerosis. In G. Wu and M.Sabuncu (Eds.): Machine Learning and Medical Imaging, Chapter 3,Elsevier, 2016.The contribution of the author was in writing the Deep Learning in Neu-roimaging Section of the book chapter. T. Brosch and R. Tam wrote theremaining sections of the chapter. All co-authors contributed to the editingof the manuscript.viiThe study described in Chapter 4 has been published in:• Y. Yoo, T. Brosch, A. Traboulsee, D.K.B. Li and R. Tam. Deep learn-ing of image features from unlabeled data for multiple sclerosis lesionsegmentation. In Proceedings of Medical Image Computing and Com-puter Assisted Intervention (MICCAI) Workshop on Machine Learningin Medical Imaging (MLMI), pages 113–120, 2014.The contribution of the author was in developing, implementing, and eval-uating the method. A. Traboulsee and D.K.B. Li provided the data andclinical input. T. Brosch and R. Tam helped with their valuable suggestionsin improving the methodology.The study described in Chapter 5 has been published in:• Y. Yoo, L.Y.W. Tang, T. Brosch, D.K.B. Li, S. Kolind, I. Vavasour,A. Rauscher, A. MacKay, A. Traboulsee and R. Tam. Deep learning ofjoint myelin and T1w MRI features in normal-appearing brain tissueto distinguish between multiple sclerosis patients and healthy controls.NeuroImage: Clinical, 17:169–178, 2018.The contribution of the author was in developing, implementing, and eval-uating the method. A. Traboulsee and D.K.B. Li provided the data andclinical input. S. Kolind, I. Vavasour, A. Rauscher and A. MacKay helpedwith their expertise in MRI physics. L.Y.W. Tang, T. Brosch and R. Tamviiihelped with their valuable suggestions in improving the methodology.The study described in Chapter 6 has been published in:• Y. Yoo, L.Y.W. Tang, T. Brosch, D.K.B. Li, L. Metz, A. Traboulseeand R. Tam. Deep learning of brain lesion patterns for predicting futuredisease activity in patients with early symptoms of multiple sclerosis.In Proceedings of Medical Image Computing and Computer AssistedIntervention (MICCAI) Workshop on Deep Learning in Medical ImageAnalysis (DLMIA), pages 86–94, 2016.• Y. Yoo, L.Y.W. Tang, D.K.B. Li, L. Metz, S. Kolind, A. Traboulsee andR. Tam. Deep learning of brain lesion patterns and user-defined clinicaland MRI features for predicting conversion to multiple sclerosis fromclinically isolated syndrome. Computer Methods in Biomechanics andBiomedical Engineering: Imaging & Visualization, pages 1–10, 2017.The contribution of the author was in developing, implementing, and eval-uating the method. A. Traboulsee, D.K.B. Li and L. Metz provided thedata and clinical input. S. Kolind helped with her expertise in MRI physics.L.Y.W. Tang, T. Brosch and R. Tam helped with their valuable suggestionsin improving the methodology.The study described in Chapter 7 has been published in:• Y. Yoo, L.Y.W. Tang, S. Kim, H. Kim, L.E. Lee, D.K.B. Li, S. Kolind,ixA. Traboulsee and R. Tam. Hierarchical multimodal fusion of deep-learned lesion and tissue integrity features in brain MRIs for distin-guishing neuromyelitis optica from multiple sclerosis. In Proceedings ofMedical Image Computing and Computer Assisted Intervention (MIC-CAI) Part III, pages 480–488, 2017.The contribution of the author was in developing, implementing, and eval-uating the method. S. Kim, H. Kim, L.E. Lee, A. Traboulsee and D.K.B.Li provided the data and clinical input. S. Kolind helped with her exper-tise in MRI physics. L.Y.W. Tang and R. Tam helped with their valuablesuggestions in improving the methodology.xTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . xiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xixGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . xxviDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxviii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1xi1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 Technical challenges . . . . . . . . . . . . . . . . . . . . . . . . 91.4 Outline and contributions . . . . . . . . . . . . . . . . . . . . 102 Foundation in Deep Learning . . . . . . . . . . . . . . . . . . 162.1 Deep feature learning by a supervised framework . . . . . . . 172.1.1 Dense neural networks . . . . . . . . . . . . . . . . . . 172.1.2 Convolutional neural networks . . . . . . . . . . . . . . 192.2 Deep feature learning by an unsupervised framework . . . . . 202.2.1 Boltzmann machines . . . . . . . . . . . . . . . . . . . 232.2.2 Restricted Boltzmann machines . . . . . . . . . . . . . 252.2.3 Building deep representations . . . . . . . . . . . . . . 292.3 Recent advances . . . . . . . . . . . . . . . . . . . . . . . . . . 323 Literature Survey for Deep Learning in Neuroimaging . . . 363.1 Registration of neuroimaging data by deep learning . . . . . . 383.2 Segmentation of neuroimaging data by deep learning . . . . . 393.3 Classification of neuroimaging data by deep learning . . . . . . 414 Feature Learning from Unlabeled Data for Multiple Scle-rosis Lesion Segmentation . . . . . . . . . . . . . . . . . . . . 444.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . 474.2.1 Algorithm overview . . . . . . . . . . . . . . . . . . . . 48xii4.2.2 Unsupervised feature learning using RBMs and deeplearning . . . . . . . . . . . . . . . . . . . . . . . . . . 484.2.3 Feature vector construction for supervised learning . . 504.2.4 Random forest training and prediction . . . . . . . . . 524.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . 524.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555 Detecting Multiple Sclerosis Pathology in Normal Appear-ing Brain Tissue . . . . . . . . . . . . . . . . . . . . . . . . . . 585.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595.2 Material and methods . . . . . . . . . . . . . . . . . . . . . . 635.2.1 Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . 635.2.2 MRI acquisition and preprocessing . . . . . . . . . . . 635.2.3 Cross-validation procedure . . . . . . . . . . . . . . . . 655.2.4 Overview of the feature learning and classification pipeline 655.2.5 Normal-appearing patch extraction . . . . . . . . . . . 665.2.6 Unsupervised deep learning of joint myelin-T1w features 695.2.7 Image-level feature vector construction and randomforest training . . . . . . . . . . . . . . . . . . . . . . . 735.2.8 Determining random forest and LASSO parameters . . 755.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.3.1 Performance evaluation . . . . . . . . . . . . . . . . . . 765.3.2 Separate analysis in NAWM and NAGM . . . . . . . . 79xiii5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 896 Predicting the Conversion Risk to Multiple Sclerosis fromClinically Isolated Syndrome . . . . . . . . . . . . . . . . . . 926.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 946.2 Materials and Preprocessing . . . . . . . . . . . . . . . . . . . 976.2.1 Study participants . . . . . . . . . . . . . . . . . . . . 976.2.2 MRI acquisition . . . . . . . . . . . . . . . . . . . . . . 986.2.3 User-defined MRI and clinical measurements . . . . . . 996.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1016.3.1 The CNN architecture . . . . . . . . . . . . . . . . . . 1016.3.2 Euclidean distance transform of lesion masks . . . . . . 1016.3.3 CNN training . . . . . . . . . . . . . . . . . . . . . . . 1036.3.4 Data augmentation and regularization . . . . . . . . . 1046.3.5 Incorporating user-defined MRI and clinical measure-ments . . . . . . . . . . . . . . . . . . . . . . . . . . . 1056.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 1066.4.1 Impact of EDT and unsupervised pretraining on CNNextraction of lesion features . . . . . . . . . . . . . . . 1066.4.2 Performance comparison to other prediction models . . 1116.4.3 Late Fusion Approach . . . . . . . . . . . . . . . . . . 1156.4.4 Applying Feature Selection to the User-Defined Features115xiv6.4.5 Replacing the output layer with a random forest classifier1176.5 Discussions and Conclusion . . . . . . . . . . . . . . . . . . . 1187 Differentiating Neuromyelitis Optica from Multiple Sclerosis1237.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1247.2 Materials and Preprocessing . . . . . . . . . . . . . . . . . . . 1277.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1277.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 1317.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1348 Conclusions and Future Work . . . . . . . . . . . . . . . . . . 1368.1 Summary of thesis contributions . . . . . . . . . . . . . . . . . 1378.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1418.2.1 Patient-level classification in neuroimaging . . . . . . . 1418.2.2 Key aspects in developing successful deep learning meth-ods for patient-level brain MRI classification . . . . . . 1428.2.3 The potential of deep learning to be incorporated intothe clinical practice . . . . . . . . . . . . . . . . . . . . 1458.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1468.3.1 Unsupervised generative models for imaging biomarkerdiscovery . . . . . . . . . . . . . . . . . . . . . . . . . . 1468.3.2 Domain adaptation . . . . . . . . . . . . . . . . . . . . 1478.3.3 Highly heterogeneous multimodal network . . . . . . . 148xv8.3.4 Advanced deep network architectures and hyperparam-eter optimization . . . . . . . . . . . . . . . . . . . . . 1508.3.5 Interpreting clinical relevance of deep-learned features . 1528.3.6 Enhancing neuroimaging data quality by deep learning 153Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155xviList of TablesTable 4.1 DSC results calculated on 50 T2w/PDw test pairs for eval-uating MS lesion segmentation performance . . . . . . . . . 55Table 4.2 Average TPR/PPV/DSC results for indirect comparisonwith state-of-the-art lesion segmentation methods . . . . . 56Table 4.3 Relative discriminative power of the features used for MSlesion segmentation as determined by the random forest . . 56Table 5.1 Performance comparison between 6 different feature typeswith and without LASSO for MS/NC classification on normal-appearing brain tissues . . . . . . . . . . . . . . . . . . . . 78Table 5.2 Separate analysis results for MS/NC classification in NAWMand NAGM . . . . . . . . . . . . . . . . . . . . . . . . . . . 80Table 6.1 Baseline characteristics of the participants used for predict-ing short-term future disease activity . . . . . . . . . . . . 98Table 6.2 The user-defined MRI and clinical measurements used forpredicting short-term future disease activity . . . . . . . . . 100xviiTable 6.3 Performance comparison between 18 different predictionmodels for predicting short-term clinical status conversionin patients with early MS symptoms . . . . . . . . . . . . . 113Table 6.4 Prediction performance of three different prediction modelswith selected user-defined features for predicting short-termclinical status conversion in patients with early MS symptoms117Table 7.1 Training methods and their hyperparameters used for train-ing deep learning networks that distinguish NMO from MS 129Table 7.2 Performance comparison between 7 classification modelsthat distinguish NMO from MS . . . . . . . . . . . . . . . . 133xviiiList of FiguresFigure 1.1 Epidemiologic study of natural evolution of multiple sclerosis 3Figure 1.2 A high-level schematic illustration of the developed deeplearning methods . . . . . . . . . . . . . . . . . . . . . . . 15Figure 2.1 A graphical representation of a Boltzmann machine . . . . 24Figure 2.2 A graphical representation of a restricted Boltzmann ma-chine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26Figure 2.3 Gibbs sampling procedure on a restricted Boltzmann ma-chine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28Figure 2.4 Greedy unsupervised layer-wise pretraining . . . . . . . . . 30Figure 4.1 A training algorithm for our MS lesion segmentation frame-work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49Figure 4.2 Probabilistic lesion segmentation example . . . . . . . . . 53Figure 5.1 An example of a myelin map of a healthy control subjectat several different slices . . . . . . . . . . . . . . . . . . . 61xixFigure 5.2 A schematic illustration of the proposed algorithm for de-tecting MS pathology on normal-appearing brain tissues . 66Figure 5.3 Voxel-wise t-test results displayed in the MNI152 templateshowing the most discriminative locations between RRMSpatients and normal controls . . . . . . . . . . . . . . . . . 67Figure 5.4 Discriminative patches in the MNI152 template for dis-tinguishing between RRMS and NC in normal-appearingbrain tissue . . . . . . . . . . . . . . . . . . . . . . . . . . 68Figure 5.5 The multimodal deep learning network architecture usedto extract a joint myelin-T1w feature representation. . . . 69Figure 5.6 Features at two RBM layers learned from myelin imagesand T1w images . . . . . . . . . . . . . . . . . . . . . . . 71Figure 5.7 Influence of the number of decision trees on the generaliz-ability of MS/NC classification . . . . . . . . . . . . . . . 76Figure 5.8 Deep-learned features separately extracted from predomi-nantly NAWM, NAGM and all normal-appearing patchesby the T1w modality-specific network . . . . . . . . . . . . 85Figure 5.9 The relative importance of the deep-learned joint myelin-T1w features in different sub-cortical brain areas . . . . . 87Figure 6.1 The proposed CNN architecture for predicting short-termfuture disease activity in patients with early symptoms ofMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102xxFigure 6.2 Grid search results for optimizing the replication and scalefactors for incorporating user-defined MRI and clinical mea-surements . . . . . . . . . . . . . . . . . . . . . . . . . . . 107Figure 6.3 The influence of Euclidean distance transform on unsuper-vised pretraining for the first convolutional layer . . . . . . 108Figure 6.4 The influence of Euclidean distance transform and pre-training on supervised training . . . . . . . . . . . . . . . 109Figure 6.5 Visualizations to show the influence of Euclidean distancetransform and pretraining on the learned manifold space,reduced to two dimensions using t-SNE . . . . . . . . . . . 110Figure 6.6 The average relative importance of the 11 user-defined fea-tures for predicting short-term clinical status conversion . 116Figure 7.1 The network architectures for distinguishing NMOSD fromMS on brain MRIs . . . . . . . . . . . . . . . . . . . . . . 128xxiGlossaryAD Alzheimer’s diseaseAI artificial intelligenceAUC area under the receiver operating characteristic curveBM Boltzmann machineBPF brain parenchymal fractionCD contrastive divergenceCDMS clinically definite multiple sclerosisCIS clinically isolated syndromeCNN convolutional neural networkCNS central nervous systemsCRBM convolutional restricted Boltzmann machineCSF cerebrospinal fluidxxiiCT computed tomographyDAWM diffusely abnormal white matterDBM deep Boltzmann machineDBN deep belief networkDNN deep neural networkDSC Dice similarity coefficientDTI diffusion tensor imagingEDSS extended disability status scaleEDT Euclidean distance transformFA fractional anisotropyFLAIR fluid-attenuated inversion recoveryfMRI functional magnetic resonance imagingGM gray matterHMF hierarchical multimodal fusionHOG histogram of oriented gradientICA independent component analysisISA independent subspace analysisxxiiiLASSO least absolute shrinkage and selection operatorMCI mild cognitive impairmentMCMC Markov chain Monte CarloMLP multilayer perceptronMLR multivariable logistic regressionMR magnetic resonanceMRF Markov random fieldMRI magnetic resonance imagingMS multiple sclerosisMTI magnetization transfer imagingMWF myelin water fractionMWI myelin water imagingNABT normal-appearing brain tissueNAGM normal-appearing gray matterNAWM normal-appearing white matterNMOSD neuromyelitis optica spectrum disorderPCD persistent contrastive divergencexxivPDw proton density-weightedPET positron emission tomographyPPV positive predictive valueRBM restricted Boltzmann machineRRMS relapsing remitting multiple sclerosisROI region-of-interestSAE stacked auto encoderSGD stochastic gradient descentSVM support vector machineSWI susceptibility weighted imagingTPR true positive rateT1w T1-weightedT2w T2-weightedWM white matterxxvAcknowledgmentsThis thesis would not have been possible without the help and support ofmany people. Most of all, I would like to thank my supervisor Roger Tamfor his invaluable guidance and support, giving me the freedom to exploremy own ideas, and providing me critical feedback on my work. I would alsolike to thank many fellows at UBC MS/MRI Research Group for discussingideas, answering questions, providing datasets and sharing feedback on myresearch: Lisa Y.W. Tang, Andrew Riddehough, David K.B. Li and AnthonyTraboulsee. The wonderful support from Ken Bigelow, Vilmos Soti andKevin Lam at UBC MS/MRI Research Group has allowed me to completemy work toward PhD. Without the MS/MRI Research Group, this thesiswould not exist. Thanks! My thanks to Shannon Kolind, Irene Vavasour,Alex MacKay, Alex Rauscher, Luanne Metz, Lisa E. Lee, Ho Jin Kim, Su-hyun Kim and UBC MRI Research Centre for kindly providing their datasetsand productive suggestions. I am very grateful to my supervisory committeemembers, Purang Abolmaesumi, Ghassan Hamarneh and Jane Z. Wang fortheir encouragement and insightful comments. I would also like to thankxxvimy university exam committee members, Donna Lang, Alexandre Bouchard-Coˆte´ and Vesna Sossi, and the external examiner Shuo Li for their valuablecriticism and constructive feedback. Finally, Tom Brosch, Saurabh Garg,Chungfang Wang, Fahime Sheikhzadeh, Adam Poriski and Marco Law weremy invaluable colleagues in our lab who were always willing to help me andmade my lab life very exciting.xxviiDedication– To my wonderful wife, children and parents.xxviiiChapter 1Introduction1.1 MotivationMultiple sclerosis (MS) is a chronic autoimmune demyelinating disorder ofthe central nervous systems (CNS), in which the insulating covers of nervecells in the brain and spinal cord are damaged. It is the principal cause of se-vere, non-traumatic physical disability among young adults in countries suchas Canada, Norway and Sweden. A person with MS can have a wide rangeof demyelinating symptoms, including numbness, muscle weakness, blurredvision, dizziness, difficulties with coordination and balance, loss of bladderand bowel control, fatigue, and memory problems. Psychiatric and men-tal problems such as depression and unstable mood are also common. MStypically exhibits two main pathological processes: neuroinflammation thatcan have relapse and recovery, and neurodegeneration that causes permanent11.1. Motivationand irreversible damage over time [Koudriavtseva and Mainero, 2016]. Thereis currently no effective therapy to slow further progression in MS patientswho are at later disease stages. Therefore, there is increasing evidence thataccurately predicting the disease course of MS can improve long-term progno-sis, because early intervention may mitigate clinical worsening by slowing orreducing clinically silent pathological processes. However, the individual dis-ease course is highly variable as shown in Figure 1.1, which shows that a newpatient with early MS symptoms may take from 2 to 20 years to develop anacute handicap as measured by the extended disability status scale (EDSS)[Barillot et al., 2016]. Why such discrepancy in the population exists, or whysome symptoms evolve sooner in some patients and slower in others remainmostly unknown. Precision medicine [Collins and Varmus, 2015] to delay dis-ease worsening cannot be achieved without objective and accurate criteria tovalidate new treatment options. Although more studies are needed, researchefforts have been made in the field of magnetic resonance imaging (MRI)and body fluid molecular biomarkers in MS, and recent findings suggest thatprecision medicine is slowly but steadily becoming reality in MS [Comabellaet al., 2016].MRI plays an essential role in answering the questions above because itcan be used for monitoring brain abnormalities such as white matter (WM)plaques and brain atrophy in vivo, which are the most visible signs of MSpathology seen on conventional MRI [Traboulsee et al., 2005]. Conven-tional magnetic resonance (MR) images such as T1-weighted (T1w), T2-21.1. MotivationFigure 1.1: Epidemiologic study of natural evolution of multiple scle-rosis in five subgroups of MS patients as a 2-stage disease course.At onset (time=0), a new patient with early MS symptoms maytake from 2 (red) to more than 20 (green) years to reach a clin-ical score highlighting acute handicap (EDSS=3). It is mostlyunknown why the disease course of MS is highly variable espe-cially at its early stages. Image courtesy: Barillot et al. [2016].weighted (T2w) and proton density-weighted (PDw) images, which exploitthree physical properties of tissue protons to generate signals, have beendeeply integrated into the clinical practice for examining the human brain andspinal cord, because they provide high tissue resolution and a wide varietyof contrast types, which are useful properties for computer-assisted imagingto assess brain structure [Ashburner et al., 2003]. In addition to the conven-tional MRI techniques, a number of advanced sequences are the subject ofintense research and have the potential for adoption into the clinical practice.The advanced sequences offer quantitative information, such as physiologi-cal, functional and chemical information, which cannot be provided with theanatomical contrast types in conventional MR images. The current advanced31.1. Motivationmethods include diffusion tensor imaging (DTI) [Le Bihan et al., 2001], MRspectroscopy [Soares and Law, 2009], blood oxygen level-dependent imagingor functional magnetic resonance imaging (fMRI) [Song et al., 2006] and soon. Myelin water imaging (MWI) [MacKay et al., 1994] is a particularlyimportant quantitative sequence in MS, because it has been shown that evenbrain tissues that appear normal may exhibit decreased myelin content asrevealed by MWI [Laule et al., 2004], which has the potential to detect MSpathology much earlier.MR images of the human brain represent very complex content. A typ-ical 3D MRI volume consists of several million voxels, and most modernbrain imaging protocols include multiple MR modalities. This creates ahigh-dimensionality problem when conducting clinical studies to investigatebrain structure, function and pathology in a population context, which re-quires large databases of brain MR images. The high-dimensionality problemrefers to the fact that increasing the dimensionality of input data can makeextracting representative and generalizable patterns more difficult becausethe variability in the features can also become high-dimensional, making itchallenging to model a data distribution. Therefore, to efficiently modelhigh-dimensional data, a large amount of input data is generally required.However, in medical image analysis, the amount of data available is usuallylimited, which is often worse in brain imaging studies than other medicalimaging domains such as chest X-ray studies.The most common approach to handling the high-dimensionality problem41.1. Motivationis to focus on predefined regions or structures, based on biological hypotheses.Since focusing on specific brain areas or measuring volumes (such as brainand lesion volume) simplifies the modeling of high-dimensional images, theycan be coarse in the sense that they may not reflect all pathologically relevantpattern types involved in brain diseases. Brain imaging studies that stronglydepend on biological hypotheses (for instance, some specific predefined re-gions are pathologically sensitive) may miss the most significant findingsaltogether if the hypotheses are too narrow. Currently, there are only mod-est correlations between the pathology observable on structural MRI (e.g.,the hallmark WM lesion volume, brain atrophy and cortical volume/thick-ness [Barkhof et al., 2009, Filippi and Agosta, 2010]) and the clinical symp-toms experienced by MS patients, and inconsistent or even contradictoryfindings exist in the neuroimaging literature. Furthermore, since the abnor-mal regions affected by the disease can enclose multiple predefined regions,it is very difficult to manually define changes precisely that can effectivelyidentify disease-related pathological patterns.Model-based approaches with hand-crafted features extensively definedby expert users [Rueckert et al., 2016] could be also used to analyze high-dimensional data, with or without focusing on predefined regions or struc-tures. The advantage of model-based approaches is that they reflect theuser’s knowledge, which acts as a regularizer that tends toward reason-able or probable solutions over implausible ones. Although there have beensome successful advances in medical image analysis over the last decade51.1. Motivation[Rueckert et al., 2016], the model-based approaches can still miss impor-tant pathological findings if the pathologies are too complex and subtle tobe modeled manually with human assumptions.Machine learning is a branch of computer algorithms that can learn com-plicated relationships or patterns from empirical data and make predictions.It provides an effective way to reduce the dimensionality and identify hiddenpatterns in complicated data, and consequently is an attractive tool in manydomains. Given a sufficiently powerful machine learning method, providingmore data to the method is likely to lead to better performance. Traditionalmachine learning approaches, such as support vector machine (SVM) [Cortesand Vapnik, 1995] and random forests [Breiman, 2001], require labeled datasuch as a lesional voxel vs. a non-lesional voxel and an MS brain image vs.a healthy brain image. Although providing more labeled data to machinelearning methods is a technically feasible way to achieve better performance,this can be very expensive and time-consuming in medical image analysis,as experts’ laborious work should be involved to acquire annotated medicalimages. In addition, the traditional machine learning methods still requirea substantial effort from expert users to extensively design hand-crafted fea-tures to achieve good performance.Recently, research focus in medical image analysis has moved away fromhand-crafted (and often explicitly designed) models toward data-driven, im-plicit models that are designed to discover imaging biomarkers (or features)in an automated way from a large set of medical imaging data, which could61.2. Objectivesbe used to generate hypotheses without prior knowledge. A number of re-searchers are working on deep learning algorithms [LeCun et al., 2015], whichare a branch of machine learning methods based on artificial neural networks,that can automatically learn feature representations (often from unlabeleddata), thus avoiding time-consuming human engineering [Raina et al., 2007].Deep learning has been shown to be a powerful feature extractor that istrainable through a hierarchical manner. Deep learning has revolutionizedthe field of machine learning and demonstrated impressive improvementsover the traditional algorithms on visual and sound recognition applications[LeCun et al., 2015]. Deep learning is also becoming a key player for analyz-ing complex medical images in many radiology research applications. Theapplications of deep learning in medical imaging [Wu et al., 2016, Litjenset al., 2017] include image segmentation (e.g., brain, spine), image registra-tion, computer-aided detection and diagnosis, and brain function or activityanalysis from fMRI. This thesis provides a more comprehensive survey forneuroimaging applications of deep learning in Chapter 3.1.2 ObjectivesMotivated by the above, the main objective of the thesis is to develop super-vised and unsupervised deep learning methods that can automatically extractbrain MRI features to perform patient-level classification and prediction offuture disease course and clinical outcome, and to improve our understandingof MS pathology. More specifically, we intend to determine whether the MRI71.2. Objectivesfeatures extracted by deep learning are more sensitive and specific to pathol-ogy detection, prediction of disease worsening and differential diagnosis in MSpatients than the traditional MRI biomarkers that have been used in clinicalstudies (e.g., lesion load and brain volume), which only capture volumet-ric changes and may not reflect potentially important structural variationssuch as shape and spatial lesion dispersion. In doing so, there are severaltechnical challenges that we identify in this thesis, which are described inSection 1.3. In consideration of the technical challenges, we develop detaileddeep learning methodologies with particular technical design considerations,which are patch-based versus whole image-based model architectures, inte-gration between supervised and unsupervised models, interaction betweentraining data and chosen classifiers, training and regularization strategies,increasing information density of sparse input data and hierarchical multi-modal fusion, to attain optimal performance in applying deep learning tohigh-dimensional structural and quantitative MRI data analysis. Our long-term clinical objective is to identify latent MRI patterns that may indicatea faster rate of worsening and can distinguish MS from similar neurologicaldiseases, which would enable more personalized treatment options for indi-vidual patients with such patterns, and to provide computer-aided diagnosisand prognosis tools that would be incorporated into routine clinical practice.81.3. Technical challenges1.3 Technical challengesNeurological pathology patterns that are clinically relevant are complex andsubtle, and thus attaining optimal performance with deep learning in neu-roimaging applications is challenging. The technical reason is because thechoice of model architectures and training methods can contribute to over-fitting and training instability in a complicated way, and deep learning ar-chitectures developed in the computer vision community, most of which aredesigned for classifying datasets with huge sample sizes, low-dimensionalityand great data variability (such as ImageNet and CIFAR-10 datasets), do notwork well for neuroimaging studies. In developing deep learning networks foranalyzing high-dimensional brain MRI data, there are several technical chal-lenges that we identify and address in this thesis, which are unique comparedto applying deep learning to the machine learning datasets commonly used intraditional computer vision. Firstly, neuroimaging data is high-dimensional,which greatly increases the risk of overfitting due to the technical issues suchas lack of generalization intrinsically caused by existence of noise featuresthat do not contribute to the reduction of classification error [Fan and Fan,2008] or that can actually increase classification error on test datasets. Neu-roimaging data often contains small training samples and limited annotationsas they are expensive and time-consuming to acquire, which significantly in-creases the magnitude of the overfitting as well as training instability prob-lems. When training data is high-dimensional and its size is limited, training91.4. Outline and contributionsinstability problems can occur due to unstable gradient computation (e.g.,Liu et al. [2017]) mostly caused by large variability between small trainingbatches. These technical issues are more pronounced in deep learning becausethe number of model parameters of deep networks is much larger than tradi-tional machine learning techniques. Secondly, neuroimaging data can consistof sparse image types such as brain lesion masks that make training unstableand can be another source of overfitting. Thirdly, neuroimaging data oftenconsists of heterogeneous imaging and clinical modalities that would requirealternative network architectures and training strategies to effectively inte-grate. Overall, the thesis addresses the technical challenges in dealing withrelatively small amount of annotated training samples, data sparsity suchas in lesion masks, and integration of heterogeneous imaging data, with thegoal of extracting subtle but clinically relevant patterns in high-dimensionalspaces that can exist in multiple brain regions and in multiple MR modalitiesby deep learning.1.4 Outline and contributionsThe thesis first provides a technological foundation and reviews recent ad-vances in deep learning, and surveys recent progress in adopting deep learningto neuroimaging data analysis, which is the main theme of the thesis. Thethesis then presents a deep learning network that automatically extracts MRIfeatures entirely from unlabeled data for MS lesion segmentation, which isa critical preprocessing phase in patient-level classification because the for-101.4. Outline and contributionsmation of focal lesions is the hallmark of MS pathology. The thesis then de-scribes an unsupervised deep learning network that can model a joint featurerepresentation from quantitative and anatomical MRIs, which can be usedfor classifying MS patients and healthy controls to detect disease pathologyin normal-appearing brain tissue at an early MS stage. We develop anotherdeep learning network that can predict the short-term future conversion riskto clinically definite MS in patients with early isolated symptoms. We thenbuild on the work to develop a deep learning network that can distinguish be-tween MS and neuromyelitis optica spectrum disorder (NMOSD), which is asimilar disease to MS but requires different treatments, to support differentialdiagnosis. Finally, we summarize contributions of the thesis, and concludewith discussion and future work. The major contributions of the thesis aredescribed below and a high-level schematic illustration of the developed deeplearning models is depicted in Figure 1.2.1. MRI feature extraction entirely from unlabeled data for MSlesion segmentation by deep learning: Limited annotations in MRimages from MS patients make developing automated lesion segmenta-tion methods challenging. We develop an unsupervised deep learningmethod that can model a manifold space entirely from unlabeled multi-modal MRIs, which is subsequently refined with a supervised classifierto estimate the data distribution of MS lesions in WM, using a rela-tively small set of annotated images. To our knowledge this is the firststudy to use image features that are learned completely automatically,111.4. Outline and contributionsfrom unlabeled data for performing MS lesion segmentation. We as-sess the impact of unsupervised learning by observing the segmentationperformance when the amount of unlabeled data is varied.2. Unsupervised joint myelin-T1w MRI feature modeling for MSpathology detection on normal-appearing brain tissue by deeplearning: MWI is an important imaging modality in MS, but modelingclinically useful features from MWI by deep learning has not been pre-viously attempted. We develop an unsupervised deep learning methodthat can model joint feature representations from high-dimensionalquantitative (myelin) and structural (T1w) MRI to detect early signs ofMS abnormality in normal-appearing brain tissue (NABT). This studyis novel in investigating the applicability of deep learning for automaticfeature extraction from myelin image data. This work demonstratesthat automated feature learning of myelin images, in combination withT1w MR images, via multimodal deep learning can significantly en-hance the performance of MS vs. normal classification in NABT overuser-defined MRI measurements that are commonly employed in thecurrent literature.3. Deep learning of brain lesion patterns and user-defined clini-cal and radiological measurements for predicting the conver-sion risk to clinically definite MS from clinically isolated syn-drome: We develop a deep learning method that extracts brain lesion121.4. Outline and contributionsfeatures that, when combined with user-defined radiological and clin-ical measurements obtained by the evaluating physicians, can predictshort-term (2 years) disease activity in patients with early MS symp-toms more accurately than the imaging biomarkers that have beenused in clinical studies. To the best of our knowledge this study isthe first attempt to perform patient-level classification of brain lesionmasks by deep learning. To efficiently train a convolutional neuralnetwork (CNN) on sparse lesion masks and to reduce the risk of over-fitting, we propose utilizing the Euclidean distance transform (EDT)for increasing information density, and a combination of downsampling,unsupervised pretraining and regularization for avoiding overfitting andimproving training convergence. To further enhance the prediction per-formance, we also propose incorporating user-defined radiological andclinical measurements into a CNN with feature replication and ampli-fication. We provide detailed analyses of the impact of EDT, unsuper-vised training and incorporating user-defined measurements on networktraining.4. Deep learning of brain lesion and diffusion features for distin-guishing neuromyelitis optica from MS: NMOSD exhibits sub-stantial similarities to MS in clinical presentations, but some drugs forMS can exacerbate NMOSD. We investigate whether multimodal deeplearning of brain MRI lesion and diffusion patterns, in both supervisedand unsupervised manners, can discover visually indistinct image fea-131.4. Outline and contributionstures that can be used for distinguishing NMOSD from MS. We developa novel hierarchical multimodal fusion architecture and training strate-gies that can improve joint feature learning of heterogeneous imagingmodalities, using a CNN architecture for modeling brain lesion patternsand a patch-based deep neural network (DNN) architecture for model-ing brain diffusion patterns. We show that the proposed method canlearn a manifold space from heterogeneous imaging modalities moreefficiently and enhance accuracy of differential diagnosis over the tradi-tional multimodal deep learning approach and the existing MRI mea-surements that have been used in clinical studies.141.4. Outline and contributionso Hierarchical multimodal fusion3D T2w patches 3D PDw patchesContribution 1: Unsupervised learning for lesion segmentationRBM 1000RBM 500RBM 1000RBM 1000RBM 500RBM 1000Feature concatenationSupervised training (random forests)3D Myelin patches 3D T1w patchesRBM 500 RBM 500581 SPMS patients (1400 unlabeled scansand 100 labeled scans) 55 RRMS patients and 44 healthy controlsRBM 500 RBM 500RBM 1000RBM 100Feature concatenationSupervised training (random forests)Brain lesion masks Clinical measures140 CIS patients (80 converters and 60 non-converters)7x7x7 conv, 12max-pool, /25x5x5 conv, 24max-pool, /23x3x3 conv, 48max-pool, /2fc 1000fc 100Logistic regression layerEDT3D DTI patches Brain lesion masks82 NMO patients and 52 RRMS patients7x7x7 conv, 12max-pool, /25x5x5 conv, 24max-pool, /23x3x3 conv, 48max-pool, /2EDTRBM 500RBM 300RBM 100feature concatenationfc 1000fc 500fc 100mf-fc 100mf-fc 100hf-fc 100fc 1000fc 100Logistic regression layerfc 100Contribution 2: Detecting MS pathology in NABTContribution 3: Predicting the conversion risk to MS from CISContribution 4: Distinguishing NMO from MSFeature replication & scaling Key methodso Feature learning entirely from unlabeled imageso Modeling joint myelin-T1w features by unsupervised learningo Euclidean distance transformo A combination of various regularization strategiesFigure 1.2: A high-level schematic illustration of the deep learningmodels developed in this thesis. Yellow box represents a layerthat is trained in an unsupervised manner, gray box representsa layer that is trained in a supervised manner and light blue boxrepresents a layer that is trained in both unsupervised and su-pervised manners. EDT represents the Euclidean distance trans-form. More details such as regularization and training strategieswill be described in the following chapters.15Chapter 2Foundation in Deep LearningNeural network is a biologically inspired programming paradigm that enablesa computer to learn patterns from observational data. Deep learning [LeCunet al., 2015] is a powerful set of approaches for learning in neural networks.Deep learning methods have become popular in the late 2000s with the aid ofbig datasets, efficient training algorithms, and powerful computing resources.Since then, deep learning methods have been rapidly becoming a dominantmethodology for analyzing medical images as well. A crucial concept of deeplearning is to let computers learn the features that optimally represent thedata for a given application. This concept provides the basis of most deeplearning methods: the use of multiple layers of nonlinear transformationunits for extracting hierarchical features from low-level to high-level. Mostdeep learning networks can be trained in a supervised (training labels arerequired) or an unsupervised (training labels are not required) manner. In162.1. Deep feature learning by a supervised frameworkthis chapter, we will discuss the technical concepts of supervised and unsu-pervised deep learning approaches that provide the foundation for the featurelearning methodologies developed in the thesis.2.1 Deep feature learning by a supervisedframeworkThe most common form of machine learning is supervised learning. Duringtraining, a deep learning network is shown an example for each training targetand produces a prediction for each target category. We compute an objectivefunction that measures the estimated error between the predictions and thedesired training targets. The network then modifies its model parametersto reduce this error. These adjustable model parameters define a nonlinearinput-output mapping function of the network.2.1.1 Dense neural networksMost neural networks can be considered as a generalization of a logisticregression. The activation a of each neuron represents a linear combinationof some input x and a set of adjustable model parameters, W and b, followedby an element-wise non-linearity f(·):ai = f(WTi x + b) (2.1)A neural network can consist of multiple layers of stacked neurons throughwhich a signal is monodirectionally propagated (i.e., feed-forward),172.1. Deep feature learning by a supervised frameworkf(WTLf(WTL−1 . . .) + bL), where intermediate layers are typically known ashidden layers. When a network has multiple layers it is often called a DNN.DNNs exploits the property that many natural signal representations arecompositional hierarchies, in which higher-level features are extracted bycomposing lower-level features.Given a training set D = {x(t),y(t)}Tt=1, a DNN is trained by minimizingthe error between the prediction xˆ(t) and the given labels y(t). The mostcommon objective functions to measure the prediction error are the sum ofsquared differences and the cross-entropy. The objective function, averagedover all training samples, can be viewed as a kind of hilly landscape in ahigh-dimensional space of model parameter values [LeCun et al., 2015]. Inpractice, stochastic gradient descent (SGD) is a typical procedure to opti-mize the objective function with respect to the model parameters, whichcomputes the average gradient for training samples and adjusts the modelparameters accordingly. The algorithm is repeated for many small sets oftraining samples until the objective function stops decreasing. The processis stochastic because each small set of training samples gives an approxi-mate of the average gradient over all samples. Although SGD is a simpleprocedure, it is known that the method usually finds a good set of modelparameters surprisingly quickly when compared with far more sophisticatedoptimization techniques [Bousquet and Bottou, 2008]. The gradient can be182.1. Deep feature learning by a supervised frameworkcalculated by backpropagation [Werbos, 1974] and a chain rule:zi = WTi x + b,δEδzi=δEδaiδaiδzi,δEδWi=δEδziδziδWi=δEδzix,(2.2)where E is the objective function.2.1.2 Convolutional neural networksConvolutional neural networks (CNNs) were inspired by the structure foundin the visual cortex of the human brain. There are four interesting character-istics in CNNs that exploit the properties of natural signal representations:local connections, shared weights, pooling and the use of many layers. Thestructure of CNNs is directly linked to the classic notions of simple cells andcomplex cells in visual neuroscience [Hubel and Wiesel, 1962]. The overallarchitecture is similar to the hierarchy in the visual cortex ventral pathway[Felleman and Van Essen, 1991].The main benefit of CNNs is their weight sharing property that takes ad-vantage of the fact that similar structures occur in different locations in nat-ural signals. This is realized by a convolution operation, the main workhorseof CNNs. Weight sharing by convolution drastically reduces the number ofmodel parameters that need to be learned because the number of weights nolonger depends on the size of the input signal.192.2. Deep feature learning by an unsupervised frameworkAt each layer, the activations of a convolutional layer are calculated byconvolving with a set of filter kernels W = {W1,W2, . . . ,WK} and subse-quently biases B = {b1, b2, . . . , bK} are added:Xlk = f(W˜l−1k ∗Xl−1 + bl−1k ) (2.3)where ∗ denotes a convolution operation, W˜ denotes a flipped version of Wand each new feature map Xk is subject to an element-wise non-linearityf(·). CNNs can be trained using SGD, where gradients are derived similarlyto DNNs and calculated by backpropagation [LeCun et al., 2015].Convolutional layers are typically interleaved with pooling layers whereneighboring activation values are aggregated using the max or mean opera-tions, thereby reducing the dimension of the representation and creating aninvariance to small shifts and distortions. At the end of convolutional lay-ers, fully-connected layers (regular DNNs) are typically attached to performa classification task, or fully convolutional layers are attached to perform asemantic segmentation task [Shelhamer et al., 2017].2.2 Deep feature learning by anunsupervised frameworkDeep learning has been shown to be a powerful trainable feature extractor,which can provide more discriminative and generalized data representationsat multiple levels, which are the theoretical advantages behind deep architec-202.2. Deep feature learning by an unsupervised frameworktures [Bengio et al., 2013]. Many researchers have demonstrated that unsu-pervised feature learning is less prone to overfitting than supervised learning[Bengio et al., 2013], although it is still debatable, particularly when theamount of labeled training samples is abundant. Deep learning has beentypically used as an initialization for a deep supervised neural network or asan input to a standard supervised machine learning classifier; in both casesthe unsupervised learning component can be seen as a prior [Bengio, 2009].It is known that kernel machines such as SVM [Cortes and Vapnik, 1995] canbe more powerful if using an appropriate kernel, i.e., an appropriate featurespace, which can be learned from deep learning [Bengio et al., 2013]. In thedeep learning community, this effect is known as the regularization effect,which hypothesizes that unsupervised deep learning is capable of capturinga comprehensive set of features that represent all of the key variations in theinput data, which can also improve generalizability that is the typical aim ofregularization.Deep learning approaches, especially deep neural networks such as deepbelief networks (DBNs) [Hinton et al., 2006], deep Boltzmann machines(DBMs) [Salakhutdinov and Hinton, 2009], stacked denoising autoencoders[Vincent et al., 2010], and many other variants, have become popular in thelast decade as a way of learning more discriminative and abstract data rep-resentations. They have been applied to various machine learning tasks suchas classifying a large set of images of different nature (e.g., Krizhevsky et al.[2012]) and speech recognition (e.g., Hinton et al. [2012]), and have shown212.2. Deep feature learning by an unsupervised frameworkimpressive improvements over conventional machine learning approaches. Itis a common notion in the deep learning community that there is no defi-nite advantage in performance between probabilistic models (e.g., restrictedBoltzmann machine (RBM) models) and deterministic models (e.g., autoen-coders). However, it is possible to sample from the learned RBM models,which may be useful to learn more generalized data distributions on brainstructures. Therefore, in this thesis, we focus on the probabilistic deep neuralnetwork models such as RBMs [Smolensky, 1986] and DBNs, which expressthe joint space of observed data x and the latent variables h as a probabilitydistribution p(x,h).An RBM model deals with a case that has no target values (i.e., unlabeledtraining data). In this case, the training data D consists of only input visiblevectors:D = {x(t)}Tt=1, (2.4)where t denotes sample index in the training data and T is sample size.The goal of unsupervised learning in a probabilistic framework is to modela distribution p(x|θ) of a given set of training samples, parameterized byθ. Undirected graphical models, also called Markov random fields (MRFs),model the joint distribution p(x,h) by factorizing unnormalized nonnegativeclique potentials:p(x,h) =1Zθ∏iψi(x)∏jηj(h)∏kvk(x,h), (2.5)222.2. Deep feature learning by an unsupervised frameworkwhere ψi(x), ηj(h), and vk(x,h) are the clique potentials describing the in-teractions within the visible units, within the hidden units, and between thevisible and hidden units, respectively. The partition function Zθ normalizesthe distribution.2.2.1 Boltzmann machinesWithin the context of unsupervised learning, a Boltzmann distribution withclique potentials constrained to be positive is a particular form of MRF:p(x,h|θ) = 1Zθexp (−ε(x,h|θ)), (2.6)where the energy ε(x,h|θ) is defined by a probability distribution and con-tains the interactions described by the MRF clique potentials and θ are themodel parameters that characterize the interactions between those stochas-tic units. The Boltzmann machine (BM) [Ackley et al., 1985] was origi-nally defined as a network of symmetrically coupled stochastic units, via theBoltzmann distribution. The BM energy function specifies the probabilitydistribution over x and h:εBM(x,h|θ) = −12xTUx− 12hTVh− xTWh− bTx− cTh, (2.7)where θ = {U,V,W,b, c} are the model parameters that, respectively, en-code the visible-to-visible interactions, the hidden-to-hidden interactions, thevisible-to-hidden interactions, the visible self-connections (also called visible232.2. Deep feature learning by an unsupervised frameworkh1 h2 h3 h4 v1 v2 v3 v4 v5 V W U Figure 2.1: A graphical representation of a Boltzmann machine with5 visible units and 4 hidden units. V, W and U are the modelparameters that represent the visible-to-visible interactions, thehidden-to-hidden interactions, and the visible-to-hidden interac-tions, respectively.biases), and the hidden self-connections (also called hidden biases) as illus-trated in Figure 2.1. In general, inference in the BM is intractable becauseit is involved in computing an expectation over all possible configurations ofthe input x under the distribution formed by the model.In training probabilistic models, learning parameters are typically per-formed by minimizing the BM energy (Equation 2.7), which is equivalentto maximizing the likelihood of the training data (or equivalently the log-likelihood). With T training samples, the log-likelihood is given by marginal-izing out the hidden units:L(θ) =T∑t=1log p(x(t)|θ) =T∑t=1log∑hp(x(t),h|θ)=T∑t=1log∑hexp(−εBM (x(t),h|θ))− logZθ.(2.8)242.2. Deep feature learning by an unsupervised frameworkThe parameters can be learned by maximizing Equation 2.8. Gradient-basedmaximization requires computing its gradient, using, for example, a stochas-tic gradient descent algorithm. We can obtain the steepest direction (gradi-ent) for each parameter by computing the partial derivative of the marginallog-likelihood with respect to a parameter θ:δL(θ)δθ=T∑t=1Ep(h|x(t),θ)[δδθ(−εBM (x(t),h|θ))]−T∑t=1Ep(x,h|θ)[δδθ(−εBM (x(t),h|θ))].(2.9)The first term computes the expectation of the partial derivative of the energyof the BM with respect to the parameter under the posterior distributionof the hidden units, with the visible units “clamped” to training samplesfrom data distribution. The expectation in the second term is the samequantity over the full joint p(x|h) under the model distribution (“unclamped”condition) represented by the BM. During the training, the gradient locallymoves the data distribution toward the model distribution until the twoforces are in equilibrium, where the sufficient statistics (gradient of the energyfunction) have equal expectations between visible units sampled from thedata distribution and visible units sampled from the model distribution.2.2.2 Restricted Boltzmann machinesThe RBM is the most popular subclass of BMs, proposed by Smolensky[1986]. Within the context of unsupervised learning, a Boltzmann distribu-252.2. Deep feature learning by an unsupervised frameworkh1 h2 h3 h4 v1 v2 v3 v4 v5 W Figure 2.2: A graphical representation of a restricted Boltzmann ma-chine with 5 visible units and 4 hidden units. W is the modelparameters that represent the visible-to-hidden interactions.tion with clique potentials constrained to be positive is a particular form ofMRF:p(x,h|θ) = 1Zθexp (−εRBM(x,h|θ)), (2.10)where the energy ε(x,h|θ) is defined by a probability distribution and con-tains the interactions described by the MRF clique potentials and θ are themodel parameters that characterize the interactions between those stochasticunits. It has a bipartite structure such that each visible unit is connectedto all hidden units and each hidden unit to all visible units, but there areno connections between the same type of units, as shown in Figure 2.2. Theenergy functional is defined asεRBM(x,h|θ) = −xTWh− bTx− cTh. (2.11)Unlike the BMs, the Markov blanket of each unit has only the units onthe other side. Due to this property, the hidden units are conditionallyindependent given a state of the visible units so that the inference is defined262.2. Deep feature learning by an unsupervised frameworkas:p(hi = 1|x,θ) = σ(∑jWijxj + cj),p(xj = 1|h,θ) = σ(∑iWijhi + bj),(2.12)where σ(s) = 11+exp(−s) is the sigmoid function.To learn the model parameters, we can obtain the steepest direction (gra-dient) for each parameter by computing the partial derivative of the marginallog-likelihood with respect to a parameter θ with training samples T :θ ∝ 1TT∑t=1δδθ(−εRBM(x(t),µ(t)|θ))−1MM∑m=1δδθ(−εRBM(x˜(m), h˜(m))|θ)),(2.13)where µ(t)j = p(hj = 1|x(t),θ) and M is Monte Carlo sampling size and(x˜(m), h˜(m)) are drawn by a block Gibbs Markov chain Monte Carlo (MCMC)sampling procedure:x˜(m) ∼ p(x|h˜(m−1)), h˜(m) ∼ p(h|x˜(m)). (2.14)Due to the conditional independence property of an RBM, this Gibbs sam-pling can be parallelized for each type of layer, either a visible or hiddenlayer, given the state of the units in the other type of layer. The state of272.2. Deep feature learning by an unsupervised frameworkx<0>h<0> ~ h | x<0>x<1> ~ x | h<0>h<1> ~ h | x<1>x<2> ~ x | h<1>h<2> ~ h | x<2>x<3> ~ x | h<2>h<3> ~ h | x<3>.........Figure 2.3: Gibbs sampling procedure on a restricted Boltzmann ma-chine, where x<0> is initialized to a random binary vector.each unit is updated by a new sample in two parallelized steps as illustratedin Figure 2.3 in the case of RBMs.Similarly to Section 2.1.2, the convolutional restricted Boltzmann ma-chine (CRBM) is defined by sharing the weights between the hidden andvisible layers among all locations in an image [Lee et al., 2011]. The energyof a CRBM with binary visible and hidden units isεCRBM(x,h) = −K∑k=1hk • (W˜k ∗ x)−K∑k=1bk∑i,jhki,j − c∑i,jxi,j (2.15)where ∗ denotes a convolution operation, • denotes an element-wise productfollowed by summation and all visible units share a single bias c. As withstandard RBMs, the posterior distributions can be derived from the energyfunction and are given byp(hkij = 1|x) = σ((W˜k ∗ x)ij + bk),p(xij = 1|h) = σ(∑kWk ∗ hk)ij+ c (2.16)which can be used for performing block Gibbs sampling.282.2. Deep feature learning by an unsupervised frameworkThe model parameters of a CRBM can be trained by contrastive diver-gence (CD) [Lee et al., 2011]. Over the training data, each model parameteris updated by the gradient that can be approximated by∆Wk ∝ 1N(Q˜(0),k ∗ x(0) − Q˜(n),k ∗ x(n)) (2.17)where Q is the posterior of the hidden unit as in Equation 2.16.2.2.3 Building deep representationsA number of researchers have suggested that it may be advantageous toutilize unsupervised neural networks as the first step to train deeper modelsthat are designed to perform a potentially different target task (e.g., Hintonet al. [2006], Bengio et al. [2007], Poultney et al. [2006], LeCun et al. [2015]).For example, an RBM may be trained to transform the coordinate systemof input data into one that is more useful for supervised learning tasks.The main idea is to learn a hierarchy of features one level at a time, usingunsupervised feature learning to learn a new transformation at each levelcombined with the previously learned transformations. That is, one layer ofweights is added to a deep network at each iteration of unsupervised featurelearning. This procedure is referred to as greedy layer-wise unsupervisedpretraining (Figure 2.4). Finally, inferred activations of the learned set oflayers could be combined to be used as input to standard supervised machinelearning classifiers such as SVM [Cortes and Vapnik, 1995] or random forests292.2. Deep feature learning by an unsupervised frameworkPretraining (1st layer) … … … … X h[1] … … … … Pretraining (2nd layer) h[2] h[1] … … … … Pretraining (3rd layer) h[3] h[2] W1 W2 W3 … … … … 3-layer deep learning network h[3] h[2] W3 … … h[1] W2 … … X W1 Figure 2.4: Greedy unsupervised layer-wise pretraining in the case ofRBMs. The dashed directed lines indicate copying of the acti-vations of the hidden units of the pretrained models.[Breiman, 2001], or to initialize a deep supervised neural network predictoror a deep generative model.Greedy layer-wise unsupervised pretraining, aimed to incrementally learnmultiple layers of representations, constitutes one of the most importantprinciples in deep learning [Bengio, 2009]. It is known that a simple re-gression model can perform a classification task perfectly if training samplesare linearly separable. In greedy layer-wise unsupervised pretraining, eachsubsequent intermediate layer transforms the coordinate system from theimmediately lower layer such that the training samples potentially becomelinearly separable after going through the multiple layers of nonlinear trans-formations. This property suggests that incrementally building a sequence302.2. Deep feature learning by an unsupervised frameworkof intermediate layers gradually learns more generalized and discriminativerepresentations of data, which can improve performance when compared tousing the raw data representation, as evaluated on other machine learningtasks [Bengio, 2009, Cho, 2014]. As a result, a deep-learned data represen-tation often yields improved generalizability, which is an important propertyfor avoiding overfitting. In practice, a transformation can be learned with onevisible and one hidden layer to extract a better representation. Using the rep-resentation from the previous stage, another model is repeatedly trained withone visible and one hidden layer to extract more abstract and discriminativerepresentations than those obtained from previous layers. This procedure isillustrated in Figure 2.4.The representations learned by greedy layer-wise unsupervised pretrain-ing may be used to improve deep learning models by jointly optimizing eitherall of the layers or some layers with respect to a supervised training criterion,which is called fine-tuning in the deep learning community. For example,the learned representation can be converted into a deep multilayer percep-tron (MLP) [LeCun et al., 2015], which consists of multiple layers of artificialneural networks, by adding a logistic regression layer or a softmax layer onthe top of the pretrained network. Fine-tuning is then performed via super-vised gradient descent of the log-likelihood cost function, after initializingthe parameters of a deep MLP with the parameters learned with the unsu-pervised pretraining. However, some researchers have indicated that jointlyoptimizing the lower layers with respect to a supervised training criterion can312.3. Recent advancesbe challenging because the top two layers of a deep MLP may tend to over-fit the training data while the lower layers learn irrelevant features [Bengioet al., 2013].2.3 Recent advancesGeneralizations to real-valued data Most typical data, including neu-roimaging data, consists of real-valued variables x ∈ Rdx . A number ofefforts have been made to generalize the generative models such as RBMs tobetter capture real-valued data. The most widely used approach to modelreal-valued observations for the RBM framework is the Gaussian-BernoulliRBM where the energy function is modified by adding a quadratic bias termin the visible units [Krizhevsky and Hinton, 2009].Advanced training strategies CD [Hinton et al., 2006] estimates the neg-ative phase of the RBM training with a very short Gibbs sampling chain (justone step is known to work well in practice) initialized with the training dataused in the first term of the gradient. By assuming that the model changesbetween updates are small, the persistent contrastive divergence (PCD) al-gorithm [Tieleman, 2008] initializes the Gibbs chain at the last state of thechain from the previous update. Various forms of reducing the learning rate(e.g., Tieleman [2008], Desjardins et al. [2010], Cho et al. [2010], Salakhut-dinov [2009]) to compensate have been proposed to address the divergenceproblem. However, low learning rates can result in very slow training. Tospeed up training, Hinton suggested using momentum [Hinton, 2012]. Recti-322.3. Recent advancesfied linear unit [Nair and Hinton, 2010] has recently become a popular choiceof nonlinearity for hidden units for improving both training speed and gen-eralization performance. Recently, adaptively adjusting the learning rate foreach model parameter has been increasingly becoming popular. For exam-ple, AdaGrad [Duchi et al., 2011] performs larger updates for infrequent andsmaller updates for smaller parameters, which can improve the robustness ofSGD. Zeiler introduced AdaDelta, an extension of AdaGrad that is designedto reduce its rapidly diminishing learning rate by a restricted window of accu-mulated past gradients [Zeiler, 2012]. Adaptive moment estimation (ADAM)is another method that computes adaptive learning rates for each parame-ter using estimates of the first and second moments of the past gradients[Kingma and Ba, 2014].Regularization Lee et al. [2008] proposed forcing the representation to besparse, meaning only a small number of the hidden units should be activatedin relation to a given stimulus by adding a sparsity constraint term to thehidden bias update. To reduce overfitting to the training data, Hinton intro-duced a training strategy called weight decay [Hinton, 2012], which penalizeslarge weights during the training, resulting in an L2-norm regularizationfunctional. Dropout [Srivastava et al., 2014] is another popular technique forreducing the overfitting problem. It randomly drops units during trainingto simulate the effect of using many “thinned” networks to produce an av-erage solution. Most methods implicitly regularize deep networks; however,more explicit and direct regularization methods have been recently studied:332.3. Recent advancesadding time-dependent random noise to the training gradient at every train-ing step [Neelakantan et al., 2015] and minimizing the cross-covariance ofhidden activations [Cogswell et al., 2015].Multimodal deep learning Ngiam et al. introduced a method to learna shared representation between audio and video data in a deep autoen-coder framework [Ngiam et al., 2011]. Srivastava and Salakhutdinov pro-posed the DBN [Srivastava and Salakhutdinov, 2012a] and DBM [Srivastavaand Salakhutdinov, 2012b] architectures for learning a joint representationbetween image and text data, even when some data modalities are missingby sampling from their conditional distribution to impute missing values.Rastegar et al. [2016] studied on a multimodal deep learning framework thatexploits the cross weights between the representation of modalities, which isdesigned to model gradual interactions of the modalities in a deep networkmanner. Arevalo et al. [2017] recently introduced the gated multimodal unitto find an intermediate representation based on a combination of data fromdifferent modalities, in which the network decides how modalities influencethe activation of the hidden unit using multiplicative gates.Building-in invariance Building invariant features is an important re-search issue for learning representations that are robust to directions of thevariance in the training data that are not informative to the given task. In-variant features can be particularly useful for learning a discriminative clas-sifier for high-dimensional data such as neuroimaging data. In deep learningframeworks, convolution with pooling is the most popular and powerful ap-342.3. Recent advancesproach to extract invariant features. The idea of the convolutional layer isbased on the fact that the same local feature computation is likely to berelevant at all translated positions of the receptive field [Hubel and Wiesel,1959]. Convolutional versions of the RBM and DBN [Lee et al., 2009] havebeen developed to directly train large convolutional layers in an unsupervisedfashion.Generative adversarial networks Goodfellow et al. [2014] introduced analternative generative model that performs an adversarial process, in whichtwo models are introduced: a generative model that captures the data dis-tribution, and a discriminative model that estimates the probability that asample came from the training data rather than the generated data. The twomodels are simultaneously trained by maximizing the probability of the dis-criminative model making a mistake. The model has dramatically sharpenedthe potential of artificial intelligence (AI)-generated content, and has becomeone of the most active research areas in the machine learning community.35Chapter 3Literature Survey for DeepLearning in NeuroimagingDeep learning has attracted the attention of neuroimaging researchers to in-vestigate its potential for analyzing medical images such as computed tomog-raphy (CT), MRI, positron emission tomography (PET), and so on. Morespecifically, there are two main reasons why deep learning methods are seenas highly promising for neuroimage data analysis. First, neuroimaging datais generally high-dimensional. For example, MR images, the most commonlyused imaging modality in neuroimaging, are typically 3D volumes and containseveral million voxels each. In addition, due to the continuing development ofadvanced MRI techniques such as DTI [Le Bihan et al., 2001], functional MRI[Song et al., 2006], susceptibility weighted imaging (SWI) [Haacke et al., 2004]and MWI [MacKay et al., 1994, Alonso-Ortiz et al., 2015], modern MRI data36is often multimodal, and thus increases the problem - the high-dimensionalityas discussed in Chapter 1. The capability of deep learning to automaticallycapture abstract and discriminative features through a hierarchical mannermay prove particularly useful for reducing the high-dimensionality of neu-roimaging data to extract key patterns that are representative of importantvariations. Second, neuroimaging data often lacks labeled data, because theytypically require expert annotations, which can be time-consuming and ex-pensive to obtain. Compared to the huge datasets of labeled natural imagescommonly used in the machine learning community, which often contain hun-dreds of thousands of labeled training samples, a neuroimaging dataset withseveral hundred labeled training samples would already be considered large.The limited amount of labeled training data is a common cause of overfitting,making methodological development for neuroimaging application challeng-ing. Many deep learning models can be trained in an unsupervised fashion,to learn a representative feature set, which can then be used to initialize asupervised model that can be trained with a smaller set of labeled images.By starting with a much representative set of features, the supervised modelcan potentially be more robust to overfitting.The current main applications of deep learning to neuroimaging are seg-mentation and classification although there also has been some work on im-age registration. In neuroimaging, segmentation is typically used to extractdesired structures or regions of the CNS such as WM, gray matter (GM),cerebrospinal fluid (CSF), spinal cord, corpus callosum etc., of which their373.1. Registration of neuroimaging data by deep learningvolume or shape can be subsequently used to perform diagnosis, monitordisease progression or study its pathology with group comparisons. Neu-roimage classification can be used to perform automatic computer-aided di-agnosis, prognosis, or early disease detection. Traditional machine learningtechniques for neuroimage classification generally require hand-crafted fea-tures with domain knowledge or assumptions about disease pathology, whichmay be subject to bias. Several recent articles have shown that deep learningcan automatically learn discriminative features for classification with littleprior knowledge.3.1 Registration of neuroimaging data bydeep learningDeformable image registration [Sotiras et al., 2013] is an important processin many neurological studies for determining the anatomical correspondenceswhich can be used, for example, for atlas-based segmentation of brain struc-tures. The principle behind deformable image registration is to determinethe optimal transformation that maximizes the feature similarities betweentwo images, which often relies on user-selected features such as Gabor filters.Wu et al. proposed using an unsupervised 2-layer stacked convolutionalindependent subspace analysis (ISA), which is an extension of independentcomponent analysis (ICA), to directly learn the basis image filters that rep-resent the training dataset [Wu et al., 2013]. During image registration, thecoefficients of these learned basis filters are used as morphological signatures383.2. Segmentation of neuroimaging data by deep learningto detect the correspondences. When incorporated into existing registra-tion methods, the data-adaptive features learned from the unsupervised deeplearning framework demonstrated improved results over the hand-crafted fea-tures. The method was extended by incorporating a convolutional stackedauto encoder (SAE) network to identify the intrinsic features in 3D imagepatches [Wu et al., 2015b].3.2 Segmentation of neuroimaging data bydeep learningIn machine learning, the most common approach to the image segmentationproblem consists of two stages: hand-designed feature detectors are used tobuild feature vectors for each input in the first step, sometimes with priorknowledge such as spatial regularization or contour smoothness, and thenthe extracted features and target labels are used to train a supervised classi-fier to perform segmentation. This generally requires much labeled trainingdata and good domain knowledge to design or select features, such as Gaborfilters [Jain and Farrokhnia, 1990], Haar wavelet [Mallat, 1989], SIFT [Lowe,1999] or variational level-set [Li et al., 2006]. However, such features are de-signed based on a user’s prior knowledge. With the hypothesis that the mosteffective features could be learned directly from training data, a number ofresearchers have recently adopted the deep learning framework.Kim et al. proposed integrating an unsupervised 2-layer stacked convolu-tional ISA into a multi-atlas based segmentation framework for hippocampus393.2. Segmentation of neuroimaging data by deep learningsegmentation [Kim et al., 2013a]. The authors compared the traditionalhand-crafted image features with the hierarchical feature representationslearned from 7.0T MR images. Guo et al. investigated using a 2-layer SAEto learn the features for segmenting the hippocampus from infant T1- andT2-weighted MR brain images [Guo et al., 2014]. The deep-learned fea-tures were used to measure inter-patch similarity for sparse patch matchingin a multi-atlas based segmentation framework, and demonstrated an im-provement of 4 ∼ 8% in Dice similarity coefficient over features based onintensity, Haar wavelet, histogram of oriented gradient (HOG) [Dalal andTriggs, 2005] and local gradient patterns [Ojala et al., 2002]. Zhang et al.proposed employing a deep CNN to perform infant brain segmentation onT1w, T2w and fractional anisotropy (FA) images [Zhang et al., 2015]. Theytrained a 3-layer CNN using approximately 10,000 local patches extractedfrom all voxels in training set of 10 brain images. Havaei et al. proposeda fully automatic brain tumor segmentation method based on deep CNNs[Havaei et al., 2015]. Brosch et al. [2015, 2016] recently proposed a deep 3Dconvolutional encoder network with shortcut connections and showed thatincreasing architecture depth and adding shortcut connections improved MSlesion segmentation performance on multimodal structural MRIs. It wasalso recently shown that the convolutional encoder network can work well onsegmenting corpus callosum [Tang et al., 2016] and spinal cord grey matter[Porisky et al., 2017]. Very recent studies suggest that deeper and more so-phisticated networks with appropriate training strategies have the potential403.3. Classification of neuroimaging data by deep learningto further enhance segmentation performance in neuroimaging data. Chenet al. [2017] proposed a voxel-wise residual network with a set of effectivetraining schemes for brain segmentation. The residual learning was used toalleviate the gradient vanishing problem when training a deep network sothat the performance gains achieved by increasing the network depth can befully leveraged. Salehi et al. [2017] presented an auto-context CNN in whichintrinsic local and global image features are learned through 2D patches ofdifferent window sizes in an application to brain extraction. Valverde et al.[2017] showed that cascading two 3D patch-wise CNNs can improve MS le-sion segmentation performance in which the first network is trained to selectcandidate lesional voxels while the second network is trained to reduce thenumber of misclassified voxels coming from the first network.3.3 Classification of neuroimaging data bydeep learningTypically, classification of human neuroimaging data has been used only todemonstrate the performance of a proposed hand-crafted feature (e.g., a setof voxel intensities or size of particular regions-of-interest) or a feature se-lection method, both of which often require a thorough understanding ofthe disease by the user. Deep learning algorithms have recently attractedconsiderable attention from neuroimaging researchers due to the promise ofautomatic feature discovery for the tasks of computer-aided disease diagnosisand prediction. Several recent studies have shown that deep learning meth-413.3. Classification of neuroimaging data by deep learningods can improve neuroimaging data classification by learning physiologicallyimportant image feature representations and discovering multimodal latentpatterns in a data-driven way.Plis et al. adopted a 3-layer DBNs for performing schizophrenia diagnosisusing structural brain MR images from 198 schizophrenia patients and 191matched controls [Plis et al., 2014]. Their result reinforces the hypothesisthat unsupervised pretraining can potentially lead to progressively more dis-criminative features at higher layers of data representation. Plis et al. usedthe same 3-layer DBN model described above to investigate its potential fordiagnosing Huntington disease [Plis et al., 2014]. Hjelm et al. proposed usinga Gaussian-Bernoulli RBM model to isolate linear factors in functional brainimaging data by fitting a probability distribution model to the data, in orderto identify functional networks [Hjelm et al., 2014]. Suk et al. proposed a3-layer SAE model with feature selection for classifying between Alzheimer’sdisease (AD) and mild cognitive impairment (MCI), a prodromal stage of AD[Suk et al., 2013]. Gray matter tissue volumes from MRI, mean signal inten-sities from PET, and biological measures from CSF were used as features.In a follow-up study, motivated by the fact that the features extracted frombrain region-of-interest (ROI) are pre-determined by the user, which maynot be able to reflect small or subtle but potentially important pathologicalchanges, Suk et al. proposed using a multimodal DBM framework [Salakhut-dinov and Hinton, 2012] to learn a joint feature representation from the paired3D patches of MRI and PET images [Suk et al., 2014]. Liu et al. investigated423.3. Classification of neuroimaging data by deep learninga deep learning architecture consisting of SAE with a softmax output layerfor performing four-class classification simultaneously: AD, NC, MCI-C andMCI-NC [Liu et al., 2014b]. Brosch et al., adopted a convolutional multi-modal DBN to automatically discover joint spatial patterns of variability inbrain morphology and lesion distribution [Brosch et al., 2014]. It was demon-strated that the learned latent parameters have stronger relationships to MSclinical scores than volumetric measures. Liu et al. developed a frameworkbased on SAEs to extract high-level ROI features from 3D PET images [Liuet al., 2014c] for AD classification. Pinaya et al. [2016] trained a DBN to ex-tract features from brain morphometry data in structural MRIs that can beused for discriminating between healthy controls (N=83) and patients withschizophrenia (N=143). The authors showed that the DBN provided moreaccurate classification accuracy (73.6%) than the support vector machine(68.1%). To predict survival time for brain tumor patients, Nie et al. [2016]proposed a 3D deep learning model to extract high-level features that canbetter represent the properties of different brain tumors from T1w, fMRI andDTI images. More specifically, the proposed model consists of a 3D patch-wise CNN for T1w images, multi-channel CNNs to extract information fromall the channels of fMRI or DTI images, and an SVM to fuse the multimodalfeatures and to make predictions. The authors reported 89.9% in accuracyfor predicting overall survival time using 69 patients.43Chapter 4Feature Learning fromUnlabeled Data for MultipleSclerosis Lesion SegmentationIn this chapter, an automatic method for MS lesion segmentation in multi-channel 3D MR images is presented. The main novelty of the method isthat it learns the spatial image features needed for training a supervisedclassifier entirely from unlabeled data. This is in contrast to other currentsupervised methods, which require the user to preselect or design the featuresto be used. Our method can learn an extensive set of image features withminimal user effort and bias. In addition, by separating the feature learningfrom the classifier training that uses labeled data (pre-segmented data), thefeature learning can take advantage of the much more available unlabeled444.1. Introductiondata. Our method uses deep learning for feature learning and a randomforest for supervised classification, but potentially any supervised classifiercan be used. Quantitative validation is carried out using 1450 T2w and PDwpairs of MRIs of MS patients. The results demonstrate that segmentationperformance increases with the amount of unlabeled data used, even whenthe number of labeled images is fixed.4.1 IntroductionMS is a chronic, inflammatory and demyelinating disease of the brain andspinal cord. Lesions are a hallmark of MS pathology, and are primarily vis-ible in WM on conventional MRI scans. Manual segmentation by expertusers is a common way to determine the extent of MS lesions, which is atime-consuming task and can suffer from intra- and inter-expert variability.Automatic segmentation is an attractive alternative, but it is a challengingtask and remains an open problem [Garc´ıa-Lorenzo et al., 2013]. Many au-tomatic approaches have been proposed over the last two decades and theyhave two main categories: supervised and unsupervised. Supervised methodslearn from training images previously segmented, and use user-selected im-age features to discriminate between lesions and healthy tissue (e.g., Geremiaet al. [2011]). The availability of representative labeled images and the choiceof image features are important considerations and may be difficult to opti-mize. Some methods use a very large starting set of features and select themore discriminative ones through labeled training (e.g., Morra et al. [2008]).454.1. IntroductionUnsupervised methods do not require labeled training data, but instead typ-ically use an intensity clustering method to model tissue distributions andrely on expert’s a priori knowledge of MRI and anatomy to reduce false pos-itives (e.g., Shiee et al. [2010]). While both supervised and unsupervisedapproaches have had some success, supervised methods that can automat-ically learn useful spatial features from unlabeled images are an attractivealternative that remains under-investigated. The amount of unlabeled datatypically far exceeds that of labeled data, and using a large database to builda feature set has the potential to improve robustness and generalizability overcurrent supervised methods.We present a new method for automatic learning, from unlabeled images,image features for MS lesion segmentation. We train our model on a largebatch of unlabeled images to identify common patterns, then add labels to asubset of the training images so that the features and labels can be used ina supervised learning method to perform the segmentation. To our knowl-edge, this is the first attempt to automatically learn discriminative 3D imagefeatures from unlabeled images for MS lesion segmentation. Previous papershave proposed advanced feature selection methods, such as those based onmodifications of random forests [Montillo et al., 2011, Yaqub et al., 2011],but the features were still pre-determined and filtered using relatively smallsets of labeled data to identify the more discriminative features. The maindifference is that our method automatically learns data-driven features fromunlabeled images without the potential bias of predefined features or those464.2. Materials and Methodslearned from labeled data. This allows large data sets to be used to generatebroadly representative feature sets. We show that the learned features enablesegmentation performance that is competitive with hand-crafted features,and that increasing the amount of unlabeled data improves segmentationperformance, even when the amount of labeled data is fixed.4.2 Materials and MethodsOur dataset consists of the image data from 581 MS patients scanned atmultiple time points. The total number of cases, where a case consists ofa pair of T2w and PDw scans, is 1450. Each T2w/PDw pair was acquiredusing a dual-echo MR sequence so they are inherently co-registered. Thedataset was collected from 48 sites, each using a different scanner, as partof a clinical trial in MS. All the images have the same resolution, 256 ×256 × 50, and the same voxel size, 0.936 × 0.936 × 3.000 mm3. We dividedthe dataset into independent training and test sets. The training set consistsof 1400 cases from 531 patients and the test set contains 50 cases from 50patients. Within the training set, 100 cases from 100 patients have expertsegmentations that we used for supervised training. For preprocessing, N3inhomogeneity correction [Sled et al., 1998] is first applied. Then, the entireset of T2w (and independently, PDw) images are intensity-normalized toproduce a mean of 0 and a standard deviation of 1. Skull-stripping is thenperformed with the brain extraction tool [Smith, 2002].474.2. Materials and Methods4.2.1 Algorithm overviewOur algorithm for learning image features from unlabeled data is built usingRBMs, which are two-layer, undirected networks each consisting of a visiblelayer and a hidden layer, where the activations of the hidden units capturepatterns in the visible units. RBMs can be stacked to form a DBN for learningmore abstract features. Our model (Figure 4.1) consists of two RBMs, onefor the T2w images, the other for the PDw images, that learn smaller-scalefeatures. In addition, two DBNs are used, again separately for the T2w andPDw images, to learn larger-scale features. After training with unlabeleddata, the model can be used to identify, in a probabilistic sense, the learnedfeatures in any given image. The model is then applied to a subset of thetraining data that has lesion labels. Any of the learned features found arethen fed, along with the labels, into a random forest, which is used to builda voxel-wise probabilistic classifier to find lesion voxels in unseen images.4.2.2 Unsupervised feature learning using RBMs anddeep learningTo target features at different scales, we extract image patches of two differentsizes at the same locations from each image. To make feature learning onlarge batches of data feasible, we extract 100 uniformly spaced and non-overlapping patches at each scale, and set those patches as a mini-batch.The spacing and patch sizes allow complete coverage of the whole brainin most images. For the smaller scale, we use a patch size of 9 × 9 × 3,484.2. Materials and MethodsT2w images9✖ 9✖ 3 patches 15✖ 15✖ 5 patches PDw images9✖ 9✖ 3 patches 15✖ 15✖ 5 patches RBM 500Unsupervised feature learningŸŸŸŸŸŸRBM 1000RBM 1000ŸŸŸŸŸŸRBM 500RBM 1000RBM 1000IT2IPDT1:500s1T1:1000s2 ,1T1:1000s2 ,2J1:500s1J1:1000s2 ,1J1:1000s2 ,2Construct feature vectorsTrain a supervised classifier with training labelsAnnotated imagesŸŸŸUnlabeled datasetTraining labelsFigure 4.1: A training algorithm for our MS lesion segmentationframework. A large number of unlabeled images and a smallernumber of labeled images are used in a deep learning frameworkto generate the feature vectors used to train a random forestclassifier.and convert the image values to one-dimensional vectors v1, . . . ,v100 ∈ RDwith D = 243. For the larger scale features, we use a 3D patch size of15×15×5 for D = 1125. We learn features from each 3D image patch usinga Gaussian-Bernoulli RBM model [Krizhevsky and Hinton, 2009] with a setof binary hidden random units h of dimension K (K = 500 for the smallerscale, 1000 for the larger scale), a set of real-valued visible random units vof dimension D (D = 243 for the smaller scale, 1125 for the larger scale),and symmetric connections between these two layers represented by a weightmatrix W ∈ RD×K . We follow a published guide [Hinton, 2012] for choosingthe number of hidden units to avoid severe overfitting. We minimize the494.2. Materials and Methodsenergy function [Krizhevsky and Hinton, 2009]:E(v,h) =12D∑i=1(vi − ci)2 −K∑j=1bjhj −D∑i=1K∑j=1viWijhj, (4.1)where bj are hidden unit biases (b ∈ RK) and ci are visible unit biases (c ∈RD). The units of a binary hidden layer (conditioned on the visible layer) areindependent Bernoulli random variables P (hj = 1|v) = σ (∑iWijvi + bj),where σ(s) = 11+exp(−s) is the sigmoid function. The visible units (condi-tioned on the hidden layer) are independent Gaussians with diagonal covari-ance P (vi|h) = N(∑jWijhj + ci, 1). We perform the contrast divergenceapproximation [Hinton et al., 2006] to update the weights and biases dur-ing training. In order to capture a higher-level representation of local brainstructures, another layer of hidden units is stacked on top of the larger scaleRBM to form a deep belief network. Hinton et al. [2006] showed that greedilytraining each pair of layers (from lowest to highest) as an individual RBMusing the previous layer’s activations as input is an efficient approach fortraining DBNs. Our DBN has a layer of real-valued visible units v of dimen-sion D = 1125 and two layers of K = 1000 binary hidden units h.4.2.3 Feature vector construction for supervisedlearningTo train a random forest, we use the labeled set of training images and con-struct feature vectors computed by applying our trained RBM/DBN model504.2. Materials and Methodsto 200 image patches within the lesion mask and 3800 image patches fromnormal-appearing tissue in each T2w and PDw image. The patches are ex-tracted in the same way described above. The activations of the RBM/DBNmodel represent the strength of the learned features present in the labeledimages. We define x as a voxel location and let vs1(x) represent a one-dimensional vector reformatted from a 3D image patch of size 9 × 9 × 3centered at x. We define vs2(x) as a one-dimensional vector reformattedfrom a 3D image patch of size 15 × 15 × 5 centered at x. We let IT2(x)and IPD(x) represent intensity values at a voxel position x of a T2w imageand a PDw image, respectively. A feature vector g ∈ RL with L = 5002 isconstructed for a given voxel in a pair of T2w/PDw images by concatenatingthe intensity values and activations of the learned features:g(x) = {IT2(x), T s11:500(vs1(x)), T s2,11:1000(vs2(x)), T s2,21:1000(vs2(x)),IPD(x), Js11:500(vs1(x)), Js2,11:1000(vs2(x)), Js2,21:1000(vs2(x))},(4.2)where T s1k is an activation of the k -th hidden unit from the trained RBM whenthe image patch vs1(x) from the T2w image is used as input. Similarly, Ts2,nkis an activation of the k -th hidden unit of the n-th layer from the trainedDBN when the image patch vs2(x) from the T2w image is used as input. Js1kand Js2,nk are the analogous activations calculated from the PDw image.514.3. Experiments and Results4.2.4 Random forest training and predictionWe have chosen to use a random forest [Breiman, 2001] for supervised clas-sification because random forests have been successfully used for MS lesionsegmentation using hand-crafted features [Geremia et al., 2011], and becauserandom forests are able to provide information on the relative importance ofthe features used. We construct a random forest consisting of 30 randomizedbinary decision trees with a maximum depth of 20. We use the same structurefor the random forest as used for previous work [Geremia et al., 2011] in MSlesion segmentation, which may not necessarily be optimal for our learnedfeatures, but should be sufficient for a proof-of-concept. As described above,we collect feature vectors from image patches inside and outside of the lesionmask of each labeled image. The information gain is used to measure thequality of a split. To segment the lesions in a new image, a feature vectorfor each voxel is computed using Equation 4.2 and voxel-wise classificationis performed by propagating the computed feature vectors through all thetrees by successive application of the relevant binary tests. The final poste-rior probability is estimated by averaging the posteriors from every leaf nodein all trees.4.3 Experiments and ResultsTo evaluate the segmentation performance using the automatically learnedfeatures, we used a validation procedure in which we varied the amount ofunlabeled data (100, 400, 700, 1000, and 1400 cases) used for training the524.3. Experiments and Results(a) (b) (c) (d) 0 0.5 1 Figure 4.2: Probabilistic segmentation example. (a) T2w input image,(b) PDw input image, (c) probabilistic segmentation result, (d)ground truth. DSC = 73.15%.RBMs and DBNs, while keeping the labeled (100 cases) and test images (50cases) the same, and compared the automatic probabilistic segmentations tothe binary segmentations by the experts. The parameters for training theRBMs and DBNs were kept consistent for all experiments. We used threemeasures for comparing segmentations: the Dice similarity coefficient (DSC),the true positive rate (TPR) and the positive predictive value (PPV) [Garc´ıa-Lorenzo et al., 2013, Weiss et al., 2013]. To produce binary segmentations,we thresholded the probabilistic segmentations using a visually derived valueof 0.4. Since relative segmentation accuracy generally increases with lesionload, we stratified the cases into 5 lesion load categories for interpreting theresults. An example of a segmentation result with a larger lesion load isshown in Figure 4.2.Table 4.1 summarizes the segmentation performance as measured by theDSC. For all of the lesion load categories, there is an apparent trend to-ward greater accuracy with an increase in the number of unlabeled training534.3. Experiments and Resultsimages. The improvement is monotonic up to 700 cases, except for a slightaberration in the lowest lesion load category. However, in all categories, theDSC decreased slightly when using 1000 cases as compared to 700 cases.This may be a problem arising from some unusual similarities between someof the 700 unlabeled cases, the labeled cases, and test images, leading toover-fitting, which may be determined by further experiments with multiplerandomizations.To compare indirectly with other state-of-the-art methods [Geremia et al.,2011, Weiss et al., 2013, Souplet et al., 2008], we selected 38 cases fromour test set so that the range in lesion load (128 mm3 to 20695 mm3) issimilar to that of the dataset used for evaluation in [Geremia et al., 2011,Weiss et al., 2013, Souplet et al., 2008] (105 mm3 to 22542 mm3). Table 4.2shows the performance statistics for the other methods and our own, usingthe features learned from 1400 cases, and demonstrates that our method ishighly competitive in accuracy, although the use of different datasets onlyallows for an indirect comparison. The lower PPV value suggests that ourmodel appears to under-segment the lesions compared to the other methods.Finally, we examined the training results of the random forest to deter-mine which sets of features (intensity, RBM, DBN first layer, DBN secondlayer), and which MR channel was the most important. Table 4.3 shows therelative discriminative power of each category of features as represented bythe percentage of nodes in which the category of features was selected bythe random forest. These results suggest that spatial features were much544.4. ConclusionTable 4.1: DSC results (%) calculated on 50 T2w/PDw test pairs. TenT2w/PDw pairs were used for each lesion load range and averagescores were computed. The number of unlabeled images used forfeature learning was varied, while the supervised training set wasfixed at 100 T2w/PDw pairs. There is an apparent trend towardimproved accuracy with a greater number of unlabeled trainingimages.Lesion load (1000×mm3)Number of cases (number of patients) 0.0-4.0 4.0-7.8 7.8-14.7 14.7-28.5 28.5+100 (45) 12.8 32.7 45.2 51.2 51.4400 (152) 12.1 34.2 48.1 54.8 55.3700 (264) 12.8 35.5 49.0 56.3 56.51000 (384) 12.2 34.2 47.8 55.1 55.81400 (532) 14.0 36.2 49.6 55.7 55.4more important (96.9%) than intensity features (3.1%) for distinguishing be-tween lesion and non-lesion voxels. The spatial features computed from thesecond layer of DBNs were selected slightly more often (35.0%) than thosefrom the first layer (32.7%) and the RBMs (29.2%), but the RBM contri-bution still seems significant. The features learned from T2w images wereselected slightly more often (51.6%) than the features learned from PDwimages (45.3%).4.4 ConclusionWe have presented a new MS lesion segmentation method based on automaticfeature learning from unlabeled images. Using a multi-scale RBM/DBN554.4. ConclusionTable 4.2: Average TPR/PPV/DSC results (%) for indirect compari-son with state-of-the-art methods. Our method is compared tothree state-of-the-art methods (2008: Souplet et al. [2008], 2011:Geremia et al. [2011], 2013: Weiss et al. [2013]). These meth-ods were evaluated with the dataset whose lesion load range isbetween 105 mm3 to 22542 mm3. Note that DSC measures werenot available in [Souplet et al., 2008, Geremia et al., 2011], andour dataset was different which only allows for an indirect com-parison. The lesion load range of our dataset is between 128 mm3to 20695 mm3.Souplet Souplet et al. [2008] Geremia Geremia et al. [2011] Weiss Weiss et al. [2013] Our MethodTPR PPV TPR PPV TPR PPV DSC TPR PPV DSC19± 14 30± 16 39± 18 40± 20 33± 18 37± 19 29± 13 58± 17 35± 24 38± 19Table 4.3: Relative discriminative power (%) of the features used forMS lesion segmentation as determined by the random forest. Thepercentages indicate the relative frequency each category of fea-tures was selected when training the random forest using thefeatures learned from 1400 unlabeled cases.T2w T2w T2w DBN T2w DBN PDw PDw PDw DBN PDw DBNintensity RBM Layer 1 Layer 2 intensity RBM Layer 1 Layer 21.3 14.9 17.7 19.0 1.9 14.3 15.0 16.0framework, we showed that the automatically learned features can be highlycompetitive to hand-crafted features for subsequent use in the supervisedtraining of random forests, and that adding more unlabeled images generallyincreases segmentation performance, with the main advantage that minimalmanual effort is involved. The main current limitation is the high dimen-sionality of the feature vectors used for training the random forest, whichis the reason we only used 4000 patches per labeled image. This limitation564.4. Conclusionis more critical than the small number of patches used for RBM and DBNtraining, because much fewer labeled images are typically available. Futurework would include improvements in training efficiency for the RBMs, DBNs,and random forests in order to use a greater number of sample patches forboth the unsupervised and supervised stages. Another limitation is that al-though we have shown that increasing the amount of unlabeled training datagenerally increases segmentation performance, the interactions between theunlabeled, labeled, and test data are poorly characterized and deserve fur-ther investigation (for example, by varying the amount of labeled data). Inaddition, our model can likely be further optimized, for instance by tuningthe deep learning and random forest parameters, and adding more layersto the network. Despite the limitations, we believe we have demonstratedthe potential for unsupervised feature learning in MS lesion segmentation.Accurately segmenting MS lesion is critical as segmented lesion masks canbe used for a number of different disease classification and prognosticationapplications in MS by deep learning. In Chapter 6 and 7, we will show howsegmented lesion masks can be exploited to perform patient-level brain MRIclassification in MS applications.57Chapter 5Detecting Multiple SclerosisPathology in NormalAppearing Brain TissueMyelin imaging [MacKay et al., 1994] is a form of quantitative MRI thatmeasures myelin content and can potentially allow demyelinating diseasessuch as MS to be detected earlier. Although focal lesions are the most vis-ible signs of MS pathology on conventional MRI, it has been shown thateven tissues that appear normal may exhibit decreased myelin content asrevealed by myelin-specific images (i.e., myelin maps). Current methods foranalyzing myelin maps typically use global or regional mean myelin measure-ments to detect abnormalities, but ignore finer spatial patterns that may becharacteristic of MS. In this chapter, we present a machine learning method585.1. Introductionto automatically learn, from multimodal MR images, latent spatial featuresthat can potentially improve the detection of MS pathology at an early stage.More specifically, 3D image patches are extracted from myelin maps and thecorresponding T1w MRIs, and are used to learn a latent joint myelin-T1wfeature representation via unsupervised deep learning. Using a data set ofimages from MS patients and healthy controls, a common set of patches areselected via a voxel-wise t-test performed between the two groups. In eachMS image, any patches overlapping with focal lesions are excluded, and afeature imputation method is used to fill in the missing values. A featureselection process is then utilized to construct a sparse representation. Theresulting normal-appearing features are used to train a random forest clas-sifier. We evaluate the proposed model using the myelin and T1w images of55 relapsing remitting MS (RRMS) patients and 44 healthy controls to de-termine whether the method has the potential for identifying image featuresthat are more sensitive and specific to MS pathology in normal-appearingbrain tissues. Our experimental results suggest that the proposed methodhas strong potential for identifying image features that are more sensitiveand specific to MS pathology in normal-appearing brain tissues.5.1 IntroductionMS is an autoimmune disorder characterized by inflammation, demyelina-tion, and degeneration in the central nervous system. MRI is invaluable formonitoring and understanding the pathology of MS in vivo from the earliest595.1. Introductionstages of the disease. One promising MR imaging modality is MWI [MacKayet al., 1994], which is a quantitative MRI technique that specifically measuresmyelin content (Figure 5.1) in the form of the myelin water fraction (MWF),which is defined as the ratio of water trapped within myelin over the to-tal amount of water. Although white matter lesions have been traditionallyconsidered the hallmark of MS pathology, histological studies and the MWItechnique have shown that MS alterations also occur in tissues that appearnormal in conventional MRIs. For example, a study using MWI found that acohort of MS patients had 16% lower mean global MWF in normal-appearingwhite matter (NAWM) than healthy controls [Laule et al., 2004]. In addition,normal-appearing gray matter (NAGM) has also been shown to have reducedMWF in MS patients [Steenwijk et al., 2014]. Although myelin imaging hasbeen indispensable in enhancing our understanding of MS, most analyses todate (e.g., MacKay et al., 1994, Laule et al., 2004, Yoo and Tam, 2013) onlyuse mean myelin measurements, either over the whole brain or in predefinedregions, and disregard the fine-scale spatial patterns of myelin content thatmay potentially be useful for MS diagnosis.Deep learning [LeCun et al., 2015] is a machine learning approach thatuses layered hierarchical, graphical networks to extract features from dataat progressively higher levels of abstraction. In recent years, methods basedon deep learning have attracted much attention due to their breakthroughperformance for classification in many application domains, including imagerecognition and natural language processing [LeCun et al., 2015]. Unsuper-605.1. Introduction0 0.3 MWF Figure 5.1: An example of a myelin map of a healthy control subject atseveral different slices in the dataset described in Section 5.2.1.The intensity reflects the relative amount of myelin present, ex-cept for the extraparenchymal areas.vised deep learning can be particularly useful in neuroimaging, a domain inwhich the number of labeled training images is typically limited. For ex-ample, unsupervised deep learning of neuroimaging data has been used toperform various tasks such as classification between MCI and AD [Suk et al.,2014], and to model morphological and lesion variability in MS [Brosch et al.,2014].In view of this, we employ deep learning to extract latent spatial featuresin myelin maps, both on their own and combined with structural MRIs, todetermine whether the deep-learned features can improve the detection ofMS pathology. In doing so, we employ multimodal deep learning [Ngiamet al., 2011] to discover and model correlations between hidden variables615.1. Introduction(latent MRI patterns discovered by deep learning) in the normal-appearingbrain tissues of coregistered pairs of myelin maps and T1w MRIs. Myelinand T1w scans are used to provide complementary information in that theformer contains myelin-specific features while the latter contains more generalmorphological features. Both types of features are known to be impacted byMS, but the benefits of deep learning for extracting myelin or myelin-T1wfeatures are unknown. We hypothesize that deep learning can uncover spatialfeatures in myelin maps that are more sensitive and specific to MS pathologythan mean myelin measurements, and that multimodal deep learning canextract more sensitive and specific features than those extracted from eithermyelin or T1w modality alone.Our method uses a four-layer DBN [Hinton et al., 2006] that is applied to3D image patches of NAWM and NAGM to learn a latent feature representa-tion. The image patches are selected via a voxel-wise t-test that is performedbetween the MS and healthy control groups. To target only normal-appearingimage patches, any patches overlapping with focal MS lesions are excluded,and a feature imputation technique is used to account for missing featuresoriginating from regions with focal lesions. We then apply least absoluteshrinkage and selection operator (LASSO) [Tibshirani, 1996] as a feature se-lection method to construct a sparse feature representation for reducing therisk of overfitting to the training data. The final features are then used totrain a random forest [Breiman, 2001] that would discriminate images of MSsubjects from those of normal subjects.625.2. Material and methods5.2 Material and methods5.2.1 SubjectsA cohort of 55 RRMS patients and a cohort of 44 age- and sex-matchednormal control (NC) subjects were included in this study. The median ageand range for both groups were 45 and 30–60. For the RRMS patients, 63.6%(35/55) were female, and 63.5% (28/44) of the NC subjects were female.The McDonald 2010 criteria [Polman et al., 2011], which rely on clinicaland radiological evidence to facilitate the diagnosis of MS and are currentlyregarded as the gold standard test, were used to diagnose the patients forMS. All patients underwent a neurological assessment and were scored onthe EDSS [Kurtzke, 1983]. The median EDSS and range were 4 and 0–5.Informed consent from each participant and ethical approval by the localethics committee were obtained prior to the study.5.2.2 MRI acquisition and preprocessingThe T1w images were acquired with a gradient echo sequence (TR = 28 ms,TE = 4 ms, flip angle = 27◦, voxel size = 0.977 × 0.977 × 3.000 mm3 andimage dimensions = 256× 256× 60). The myelin images were acquired witha 3D GRASE sequence [Prasloski et al., 2012b] (32 echoes, TE = 10, 20,30,. . . , 320 ms, TR = 1200 ms, voxel size = 0.958× 0.958× 2.500 mm3 andimage dimensions = 256×256×40), and processed with the non-negative leastsquares fitting algorithm with non-local spatial regularization [Yoo and Tam,635.2. Material and methods2013] and stimulated echo correction [Prasloski et al., 2012a]. All images wereacquired on a Philips Achieva 3T scanner with an 8-channel SENSE headcoil. Lesion masks were produced for the MS images using a semi-automaticsegmentation method [McAusland et al., 2010] applied to T2w and PDw MRIpairs. The T1w images were preprocessed by applying the N3 inhomogeneitycorrection method [Sled et al., 1998] iteratively over multiple scales [Jonesand Wong, 2002], then followed by denoising and skull-stripping. The multi-scale N3 method works similarly to the N4 algorithm [Tustison et al., 2010],but was optimized to work with the magnetic field of our scanner.The non-zero T1w intensities were normalized to have a range from 0to 1, and then standardized to have zero mean and unit standard deviation(SD) to enable the use of Gaussian visible units (explained in Section 5.2.6).In addition, normalization of the T1w intensities made the appearance ofhigh-contrast edge features between the normal-appearing tissue and cere-brospinal fluid more consistent across individuals. In general, this allows thedistribution of these edge features to be modeled more accurately during deeplearning, which makes training of the networks faster and more stable. Thebrain masks computed from the T1w images and intensity standardizationwere also applied to the myelin images. The myelin images were co-registeredto the T1w images using linear registration. Non-linear registration with FSLFNIRT [Jenkinson et al., 2012] was performed on the T1w images to alignthem to the MNI152 template [Mazziotta et al., 2001], and the computedtransforms were also applied to the myelin images.645.2. Material and methods5.2.3 Cross-validation procedureTo maximize use of the available data, we performed a cross-validation pro-cedure in which a rotating subset of the subjects acted as the test data,while the rest were used for training. We used an 11-fold cross-validationprocedure in which each fold consisted of 9 test subjects (5 MS and 4 NC)and 90 training subjects (50 MS and 40 NC). This partitioning allowed all99 subjects to be tested once.5.2.4 Overview of the feature learning andclassification pipelineFigure 5.2 shows a schematic diagram of our proposed method. The mainsteps are as follows. First, a common set of class-discriminative patchesare extracted from both modalities in the MNI152 template space. Next,for each subject we exclude those patches that overlap with focal lesions.The resulting normal-appearing patches are then used to learn a latent jointmyelin-T1w feature representation via unsupervised deep learning. To ac-count for missing features from focal lesions, an imputation method [Marlin,2008] is performed on the learned feature vectors. To reduce the risk of over-fitting by increasing sparsity, we apply LASSO to the feature vectors. Wethen train a random forest classifier with the joint myelin-T1w features andclass labels.655.2. Material and methods…T1w image dataset…Myelin map dataset……Extract normal-appearing 3D patches in the MNI152 spaceT1w 3D patchesMyelin 3D patchesImage-level featuresPatch-level features55 RRMSsubjects44 normalsubjects55 RRMSsubjects44 normalsubjectsJoint myelin-T1w feature learning………………………………Image-level feature vector constructionTrain a random forest classifierSupervised classifier learningCompute sparse representationsMultiple SclerosisNormalFigure 5.2: A schematic illustration of the proposed algorithm for de-tecting multiple sclerosis pathology on normal-appearing braintissues using a latent hierarchical myelin-T1w feature represen-tation.5.2.5 Normal-appearing patch extractionInstead of using all voxels in an image, patch extraction is commonly usedfor medical image classification to improve discriminative task accuracy andto reduce computational burden [Wu et al., 2015a]. We extract discrimi-native candidate patches on normal-appearing brain tissue from the myelinand T1w images in the MNI152 template space using a voxel-wise t-testto determine the statistical significance of the group difference between MSand NC images, as similarly done in previous studies [Suk et al., 2014, Tonget al., 2014] for AD/MCI diagnosis. The voxel-wise t-test results for eachmodality (myelin and T1w) are shown in Figure 5.3. Based on the voxel-wiset-test, the voxels with individual p-values lower than 0.05 are selected as thecenters of candidate patches. The mean p-value for each candidate patch is665.2. Material and methodsMyelinT1wp-value1.00Figure 5.3: Voxel-wise t-test results displayed in the MNI152 templateshowing the most discriminative locations between 55 RRMS pa-tients and 44 normal controls. The red areas indicate statisticalsignificance (p < 0.05). Most of the voxels selected from themyelin maps are located in cerebral white matter regions (thetop three images), while most selected from the T1w imagesare from the cortex and periventricular areas (the bottom threeimages).then computed. Starting with the patches with the lowest mean p-values,patches are selected in a greedy manner [Suk et al., 2014] while enforcing anoverlap of less than 50% with any previously selected patches. These patchesare then further selected by including only those with mean p-values smallerthan the average p-value of all candidate patches of both modalities. Finally,the patches overlapping with focal lesions are excluded for each patient inorder to retain only the normal-appearing patches. Patch sizes from 7×7×7675.2. Material and methodsFigure 5.4: Discriminative patches in the MNI152 template that wereextracted from the scans of 50 RRMS patients and 40 normalcontrols used as training data in one cross-validation fold, usingthe patch extraction method described in Section 5.2.5. In thisfigure, the patches have been rescaled from 9 × 9 × 9 to 3 ×3 × 3 for the purpose of visualization. The patches are usedfor training the multimodal unsupervised deep learning networkwith the goal of extracting features that can be used to detectMS pathology on normal-appearing tissues.to 11 × 11 × 11 have been suggested to be a good range for capturing localstructural information in related work [Suk et al., 2014, Tong et al., 2014,Liu et al., 2014a]. From this perspective, we chose a patch size of 9×9×9 forour experiments. From the data in this study, the number of selected patchesranged from 8000 to 10000 depending on the images used for training in eachcross-validation fold, and on the amount of lesion present. Examples of theselected patches in the MNI152 template are displayed in Figure 5.4.685.2. Material and methods… …	N = 729 v1 v2 v3 v4 vN-1 vN … …	v1 v2 v3 v4 vN-1 vN h1 h2 h3 hK1 … … … h1 h2 h3 hK2 h1 h2 h3 hK1 h1 h2 h3 hK2 h1 h2 h3 h4 h5 h6 h7 hK3 h1 h2 hK4 … K1 = 500 K2 = 500 K3 = 1000 K4 = 100 T1w Myelin Multimodal layer Figure 5.5: The multimodal deep learning network architecture usedto extract a joint myelin-T1w feature representation.5.2.6 Unsupervised deep learning of joint myelin-T1wfeaturesThe network architecture (Figure 5.5) for unsupervised deep learning consistsof two modality-specific DBNs, one for myelin features and the other for T1wfeatures, which are fed into a joint network that learns multimodal features.The number of network layers and number of hidden units were determinedfrom previous literature [Suk et al., 2014, Yoo et al., 2014] and a widely usedguide [Hinton, 2012] for training RBMs.We convert the selected patches into one-dimensional vectors v1, . . . ∈ RDwithD = 729. The number of feature vectors from each image depends on thenumber of excluded patches due to the presence of lesions. We learn featuresfor the myelin and T1w input vectors independently by using a Gaussian-Bernoulli RBM [Krizhevsky and Hinton, 2009] for each modality. Each RBM695.2. Material and methodshas real-valued visible units v of dimension D = 729, binary hidden units hof dimension K1 = 500, and symmetric connections between these two layersas represented by a weight matrix W ∈ RD×K1 . The energy function of aGaussian-Bernoulli RBM [Krizhevsky and Hinton, 2009] is defined as:E(v,h) =D∑i=1(vi − ci)22σ2i−K1∑j=1bjhj −D∑i=1K1∑j=1viσiWijhj, (5.1)where bj is the bias for the j-th hidden unit (b ∈ RK1), σi is the vari-ance term for the i-th visible unit, and ci is the bias for the i-th visibleunit (c ∈ RD). The variance term is set to 1 by standardizing the datasetas described in Section 5.2.2. The units of the binary hidden layer (con-ditioned on the visible layer) are independent Bernoulli random variablesP (hj = 1|v) = σ (∑iWijvi + bj), where σ(s) =11+exp(−s) is the sigmoidfunction. The visible units (conditioned on the hidden layer) are D indepen-dent Gaussians with diagonal covariance P (vi|h) = N(∑jWijhj + ci, 1).In order to capture higher-level correlations between the first-level features,another layer of binary hidden units of dimension K2 = 500 is stacked ontop of each RBM to form a DBN for each modality. We follow a standardlayer-by-layer approach for training a DBN [Hinton et al., 2006], in whicheach RBM adopts the previous layer’s activations as its input. Figure 5.6shows a large variety of spatial features learned by this network from bothmyelin and T1w images, which supports the hypothesis that myelin mapscontain potentially useful structural information.705.2. Material and methodsMyelin – RBM layer 1 Myelin – RBM layer 2T1w – RBM layer 1 T1w – RBM layer 2Figure 5.6: Features at two RBM layers learned from myelin images(top) and T1w images (bottom). The deep network is able tolearn a large variety of spatial features from both myelin andT1w images, which supports the hypothesis that myelin mapscontain potentially useful structural information for detectingMS pathology.715.2. Material and methodsWe next build a joint model (Figure 5.5) that finds multimodal myelin andT1w patterns by modeling the joint distribution between myelin and T1wfeatures. We form a multimodal DBN by adding a layer of K3 = 1000 binaryhidden units that are connected to both the myelin and the T1w DBNs,thereby combining their second-layer activations. Finally, we model higher-level multimodal features by stacking another layer of binary hidden units ontop of the multimodal RBM. For this multimodal layer, the dimensionalityis reduced to K4 = 100 for each patch.We perform contrastive divergence [Hinton et al., 2006] to approximategradient descent to update the weights and biases during training. To avoidthe difficulty of setting a fixed learning rate and decay schedule, we applyAdaDelta [Zeiler, 2012], which adaptively determines the learning rate foreach model parameter and improves the chances of convergence to a globalminimum. Given the high dimensionality of the feature vectors and theinherent risk of overfitting to the training data, we use two common reg-ularization approaches during training, consisting of weight decay [Hinton,2012] with the penalty coefficient 0.0002, which penalizes large weights, anddropout [Srivastava et al., 2014] with a probability of 0.5, which randomlydrops hidden units to simulate the effect of using many “thinned” networksto produce an average solution.725.2. Material and methods5.2.7 Image-level feature vector construction andrandom forest trainingFor input into a supervised image-level classifier, single-modality or multi-modality features can be used. Single-modality feature vectors can be con-structed by concatenating the second-layer activations from the individualmyelin and T1w DBNs for all normal-appearing patches. Joint multimodalfeature vectors can be constructed by concatenating the top-level hidden unitactivations of the multimodal DBN. For the patches excluded due to lesions,we model each missing feature element using a normal distribution N (µi, σi)whose parameters µi and σi are estimated as the mean and SD of all featurevalues from the training dataset, where i = 1, . . . ,m and m = P ×K4, andP is the number of patches and K4 is the number of units in the top level.We then impute each missing feature element with a value sampled from thenormal distribution, as previously described [Marlin, 2008].Since the feature dimension is very high (P ×K4, depending on the num-ber of patches P ) relative to the number of training samples (90), we con-struct a sparse representation using a linear regression model to reduce therisk of overfitting during supervised training. In previous work by Kim et al.[2013b], it has been shown that applying LASSO [Tibshirani, 1996] as afeature selection method to reduce the dimensionality of the latent featureslearned by DBNs is beneficial to classification performance. Accordingly, wealso explore the impact of using LASSO in our framework. More specifically,735.2. Material and methodsLASSO employs the following objective function:minq‖Xq− y‖22 + λ1‖q‖1, (5.2)where X ∈ RT×m and y ∈ RT×1 denote the data matrix and the labelvector respectively, and T is the number of subjects used for training. Thevector q ∈ Rm×1 holds the regression coefficients and λ1 is a regularizationparameter that controls the sparsity of the model. After LASSO, the non-zero elements in the regression coefficient vector q are used to select thecovariates to form a sparse feature representation for each image.Finally, given the feature vectors and labels, we train a random forest[Breiman, 2001] using the information gain to measure the quality of a splitfor each node. We also compute the relative importance of each patch forclassification between MS and NC by permuting the features for each patchamong the training data and computing the generalization error as mea-sured by the out-of-bag error [Breiman, 2001]. Then, we relate the featureimportance to anatomical regions by using the Harvard-Oxford sub-corticalstructural atlas [Desikan et al., 2006], which was derived from the MNI152template [Mazziotta et al., 2001], and the central voxel of each patch, toenhance the interpretability of the results. The feature importance for eachanatomical region is determined by averaging feature importance values fromevery patch belonging to each anatomical region. The procedure of deter-mining random forest and LASSO parameters is described in the following745.2. Material and methodssection.5.2.8 Determining random forest and LASSOparametersThe number of decision trees and their depth determine the generalizability ofthe random forest. In general, overly shallow trees lead to underfitting whileoverly deep trees lead to overfitting. We found that tree depths between 20and 40 produced almost identical out-of-bag errors in our case. From thisperspective, the tree depth value was empirically set as 30 to avoid under-and overfitting. To determine a suitable number of trees, we started with10 and increased it by a step size of 0.2 on a log scale until we observeda stabilization in the out-of-bag error (Figure 5.7) using the entire dataset.We determined an appropriate value of 105, which was used for all of ourexperiments.After fixing the random forest parameters, we performed a nested cross-validation procedure to determine the LASSO regularization parameter λ1,which we expected to vary between cross-validation folds. For each of the11 folds, we performed a nested cross-validation with 10 inner folds. Foreach inner fold, we varied λ1 between 10−7 and 1 with a step size of 1 ona log scale, and the λ1 that produced the best mean MS/NC classificationaccuracy was used for the outer fold.755.3. Results102 104 1060.10.150.20.25number of decision treesOut−of−Bag errorFigure 5.7: Influence of the number of decision trees on the generaliz-ability of MS/NC classification, as measured by the out-of-bagerror, on normal-appearing brain tissues.5.3 Results5.3.1 Performance evaluationLet TP, TN, FP and FN denote True Positive, True Negative, False Posi-tive and False Negative, respectively. The ability of our proposed methodto extract discriminative features was evaluated by using the deep-learnedfeatures of normal-appearing brain tissues to classify each subject as MS orNC, and measuring several different aspects of classification performance:• Accuracy = (TP + TN) / (TP + TN + FP + FN)765.3. Results• Sensitivity = TP / (TP + FN)• Specificity = TN / (TN + FP)• Area under the receiver operating characteristic curve (AUC) of thereceiver operating characteristic curveWe performed an 11-fold cross-validation procedure as described in Sec-tion 5.2.3. We performed the patch selection, unsupervised deep learningand random forest training using only the training data for each fold.We used three regional features as baseline comparators: the regionalmean T1w intensity, the regional mean myelin content, and the regional meanmyelin-T1w features, which were formed by concatenation of the myelin con-tent and T1w intensity feature vectors for each image. All regional meanswere computed on the same class-discriminative patches as used for unsuper-vised deep learning. We independently trained the random forest classifierfor each mean-based feature. To determine the LASSO regularization param-eter λ1, we performed the same nested cross-validation procedure describedin Section 5.2.8. The top three rows of Table 5.1 show the classification per-formance for each mean-based feature type. To analyze the effect of LASSO,the supervised training was initially done without this regularization. Rows 4to 6 of Table 5.1 show the classification performance of the three mean-basedfeatures when including the LASSO regularization.To determine the effectiveness of feature extraction by deep learning, wecompared three deep-learned feature types, which are deep-learned T1w fea-775.3. ResultsTable 5.1: Performance comparison (%) between 6 different featuretypes with and without LASSO for MS/NC classification onnormal-appearing brain tissues. We performed an 11-fold cross-validation on 55 RRMS and 44 NC images and computed the av-erage performance (and standard deviation) for each feature type.The highest value for each measure is in bold. Overall, deep learn-ing improved the classification results over the regional mean-based features across all four measures. In addition, LASSO hada positive effect, but more so for the regional mean-based featuresthan the deep-learned features.Feature type Accuracy Sensitivity Specificity AUCRegional mean without LASSOT1w intensity 63.6 (16.5) 74.5 (18.7) 50.0 (28.2) 62.3 (16.9)Myelin content 72.7 (13.7) 74.6 (18.7) 68.2 (17.9) 72.3 (13.8)Myelin-T1w 67.7 (8.8) 72.7 (22.5) 61.4 (14.7) 67.1 (9.2)Regional mean with LASSOT1w intensity 66.7 (10.6) 76.4 (17.2) 54.5 (24.1) 65.5 (11.5)Myelin content 73.7 (13.7) 76.4 (18.7) 70.5 (17.9) 73.4 (12.6)Myelin-T1w 70.7 (12.8) 70.9 (21.4) 70.5 (20.8) 70.7 (12.0)Deep-learned without LASSOT1w 70.1 (13.6) 81.8 (20.9) 56.8 (22.3) 69.3 (13.7)Myelin 83.8 (11.0) 85.5 (18.0) 81.8 (14.4) 83.6 (10.5)Myelin-T1w 86.9 (9.3) 85.5 (15.0) 88.6 (12.5) 87.0 (9.0)Deep-learned with LASSOT1w 70.1 (13.6) 81.8 (20.9) 56.8 (22.3) 69.3 (13.7)Myelin 83.8 (11.0) 85.5 (18.0) 81.8 (14.4) 83.6 (10.5)Myelin-T1w 87.9 (8.4) 87.3 (12.9) 88.6 (12.5) 88.0 (8.5)785.3. Resultstures, deep-learned myelin features, and the output of the multimodal DBN,which combines the deep-learned myelin and T1w features. These featureswere also tested with and without LASSO regularization, with the resultsshown in Table 5.1. Overall, deep learning improved the classification resultsover the regional mean-based features across all four evaluation metrics. Inaddition, LASSO had a positive effect, but more so for the mean-based fea-tures than the deep-learned features. The accuracy rate attained by thedeep-learned myelin-T1w feature with LASSO was statistically better thanthe ones attained by all of the regional features and the deep-learned T1wfeature with LASSO (p < 0.01, two-sided Wilcoxon test), but it was notstatistically better than the one attained by the deep-learned myelin featureswith LASSO.5.3.2 Separate analysis in NAWM and NAGMTo determine the relative contributions of white and gray matter to classifi-cation performance, we evaluated each deep-learned feature type on predom-inantly NAWM and NAGM separately. Since LASSO proved to be beneficialfor the previous experiments, we applied it to all of the regional NAWM andNAGM analyses. Using the WM and GM masks computed from the T1wMNI152 template, we excluded all patches that did not overlap with WM orGM. For the normal-appearing patches that overlapped with both the WMand GM masks, we labeled each patch as a NAWM patch if the WM voxelcount is larger than the GM voxel count, and vice versa.795.3. ResultsTable 5.2: Separate analysis results in NAWM and NAGM. The tableshows a performance comparison (%) between deep-learned fea-tures for MS/NC classification. We performed an 11-fold cross-validation on 55 RRMS and 44 NC images and computed theaverage performance (and standard deviation).Feature type Accuracy Sensitivity Specificity AUCDeep-learned on NAWMT1w 66.7 (7.0) 78.2 (19.2) 56.8 (24.1) 67.3 (10.3)Myelin 82.8 (11.5) 81.8 (19.9) 84.1 (11.8) 83.2 (10.4)Myelin-T1w 74.7 (13.5) 74.5 (17.6) 75.0 (16.3) 74.8 (13.7)Deep-learned on NAGMT1w 68.7 (6.7) 78.2 (19.1) 59.1 (22.0) 68.9 (9.9)Myelin 80.8 (12.6) 85.5 (15.0) 75.0 (21.3) 80.2 (13.0)Myelin-T1w 73.7 (14.4) 74.5 (22.5) 72.7 (14.7) 73.6 (14.7)The separate analysis results computed with each deep-learned featuretype are summarized in Table 5.2. For both NAWM and NAGM, the deep-learned myelin features alone provided the best overall classification perfor-mance. Using NAWM patches gave higher performance than that attained byusing NAGM patches for the myelin images, while for the T1w images, usingNAGM patches gave better performance. This is consistent with the obser-vations that the most discriminative patches in the myelin images came fromsubcortical WM, while the most discriminative patches in the T1w imagescame from the cortical and periventricular regions. Overall, the maximumclassification performance of using NAWM and NAGM patches separatelydid not approach that of using all normal-appearing patches together.805.4. Discussion5.4 DiscussionThe regional mean myelin content features were more discriminative than theregional mean T1w intensity features, which is not surprising given that theT1w sequence is a structural MR sequence designed to show tissue contrastand not for direct quantification. The mean myelin features achieved meanclassification performance rates of 73.7% (accuracy, SD 13.7%) and 73.4%(AUC, SD 12.6%) with LASSO, which are approximately 7% (accuracy) and8% (AUC) higher than those of the regional mean T1w intensity features.However, the regional combined mean myelin-T1w features produced meanclassification performance rates of 70.7% for both accuracy and AUC, show-ing that direct concatenation of the regional mean myelin content and T1wintensity features resulted in reduced classification performance when com-pared to mean myelin content alone, largely due to reduced specificity, butwere still better than the performance achieved using T1w intensity featuresalone.Applying LASSO as a feature selection method improved the classifica-tion performance for the regional mean features. When including LASSOregularization in the supervised classifier for the regional mean features, thefeature dimensionality reduction rate by LASSO was about 60–80%. For theregional mean T1w intensities, LASSO improved classification performancerates by approximately 3% for both mean accuracy and AUC. The impact ofLASSO was smaller for the regional mean myelin features, resulting in about815.4. Discussiona 1% improvement in classification accuracy. LASSO also improved the clas-sification performance of the regional combined mean myelin-T1w features,but did not beyond than that attained by the regional mean myelin features,again suggesting that direct concatenation of heterogeneous modalities is notan effective strategy for improving the classification performance. Overall,the regional mean myelin content feature type was the most accurate, sen-sitive, and specific regional mean MRI biomarker for distinguishing betweenMS and NC on normal-appearing brain tissues.Unsupervised deep learning of the regional myelin contents and T1w in-tensities yielded superior classification performance over using the regionalmean myelin contents and T1w intensities. Without LASSO, the deep-learned T1w features improved the classification performance by about 6%in both mean accuracy and mean AUC over the regional mean T1w intensityfeatures. In addition, the SDs for accuracy and AUC decreased by approx-imately 3%, showing a more consistent performance across folds. Similarly,the deep-learned myelin features improved the classification performance overthe regional mean myelin content features by about 11% in both mean accu-racy and mean AUC, demonstrating that spatial feature learning of myelinmaps by unsupervised deep learning can produce radiologically useful infor-mation associated with MS pathology. Similarly to the deep-learned T1wfeatures, the SDs for accuracy and AUC also decreased by about 3%.The joint deep-learned regional myelin-T1w features were more discrim-inative than either of the modality-specific deep-learned feature types, and825.4. Discussionimproved accuracy and AUC by about 4% over the deep-learned myelin fea-tures, showing that, in contrast to the case of simple concatenation of regionalmean T1w and myelin features, deep-learned joint features improved theclassification performance, and decreased the SDs by about 2%. Comparedto the regional mean myelin-T1w features, the deep-learned multimodal fea-tures improved the classification performance by approximately 17% in meanaccuracy and AUC.We observed a relatively small impact when including the LASSO regular-ization in the supervised classifier for the deep-learned features. The featuredimensionality reduction rate by LASSO was about 30–50%, which is smallerthan the case of regional mean features, suggesting that the deep-learned fea-tures had less redundancy. For both the deep-learned T1w features and thedeep-learned myelin features, LASSO did not change the classification per-formance. For the deep-learned joint myelin-T1w features, LASSO improvedthe classification performance by about 1% in both mean accuracy and AUC.The impact of LASSO was smaller than for the regional mean features. Thiscould be due to the fact that unsupervised deep learning is already capable ofextracting less redundant and more independent feature sets which reducedthe impact of dimensionality reduction by LASSO.Overall, the proposed deep-learned joint myelin-T1w features providedthe best performance, surpassing all other feature types substantially in ac-curacy, sensitivity, specificity, and AUC. In addition, they significantly re-duced the SDs of both accuracy and AUC compared to the regional mean835.4. Discussionfeatures, showing a more consistent classification performance across folds.When used independently, the deep-learned myelin features also performedwell, surpassing all other regional mean features on all four evaluation mea-sures. All deep-learned features outperformed their regional mean counter-parts, with or without LASSO, which indicates that both myelin and T1wmodalities contain discriminative latent spatial patterns. Our main conclu-sion is that the deep-learned myelin features provide valuable pathologicalinformation that is more sensitive and specific than the use of regional meanmyelin and/or T1w measurements for MS diagnosis on normal-appearingbrain tissues, especially when combined jointly with deep-learned T1w fea-tures.We analyzed the effect of separately using predominantly NAWM andNAGM patches on the various deep learning models we built. The deep-learned myelin features extracted from NAWM patches achieved mean clas-sification performance rates of 82.8% in accuracy (SD 11.5%) and 83.2% inAUC (SD 10.4%), which are approximately 2–3% higher in accuracy andAUC than those of the deep-learned myelin features with NAGM patches,suggesting that the deep-learned myelin features are more pathologically rel-evant to NAWM than NAGM, which is expected due to the greater myelincontent in WM. When using NAWM and NAGM patches separately, the vari-ety of feature patterns learned by the T1w-specific network was reduced com-pared to those learned by the T1w-specific network with all normal-appearingpatches as shown in Figure 5.8. The more limited feature set led to classifi-845.4. DiscussionAll normal-appearing tissuesLayer 1NAWM NAGMLayer 2Figure 5.8: Deep-learned features separately extracted from predomi-nantly NAWM, NAGM and all normal-appearing patches by theT1w modality-specific network. The variety of feature patternslearned by the T1w-specific network with NAWM and NAGMpatches is reduced compared to that learned by the T1w-specificnetwork with all normal-appearing patches.cation accuracy rates of 66.7% with NAWM patches and 68.7% with NAGMpatches, which are lower than those of the deep-learned T1w features with allnormal-appearing patches (70.1%). Since this limited feature set was usedas an input to the multimodal myelin-T1w layer, the deep-learned myelin-T1w features did not improve the classification performance in NAWM norNAGM patches as shown in Table 5.2.To enhance interpretability of the results and examine their relationshipto published MS literature, we determined the relative contribution of the855.4. Discussiondeep-learned joint myelin-T1w features in each patch location, and used theHarvard-Oxford sub-cortical atlas to compute the mean importance valuesin particular sub-cortical regions and structures. As shown in Figure 5.9, thesix most discriminative sub-cortical brain regions and structures were foundto be the cerebral white matter, lateral ventricles, putamen, thalamus, hip-pocampus, and amygdala. The most discriminative sub-cortical brain regionswere the cerebral white matter and lateral ventricles. The high importanceof the cerebral white matter and lateral ventricles are likely due to demyeli-nation in the periventricular region, combined with morphological changescaused by brain atrophy, both of which are strongly associated with MSpathology. The observed importance of the sub-cortical gray matter struc-tures is consistent with previous MS studies (e.g., Hulst and Geurts, 2011),which showed that these specific structures undergo substantial structuraland/or chemical changes.It is important to acknowledge limitations of our study. Due to the rela-tively small training sample size, this study can only provide preliminary re-sults and does not ensure that the proposed model will generalize to producethe exact same results in other cohorts. Secondly, our study only includedRRMS patients and did not include progressive MS patients. The proposedmodel may extract different features for progressive MS cohorts because pa-tients with progressive MS can have different patterns of demyelination andmorphological changes throughout the brain. To evaluate this approach fordetecting very early MS pathology, our future work should include patients865.4. Discussion0	 0.1	 0.2	 0.3	 0.4	 0.5	 0.6	 0.7	 0.8	 0.9	Brain-Stem	Pallidum	Accumbens	Cerebral	Cortex	Caudate	Amygdala	Hippocampus	Thalamus	Putamen	Lateral	Ventricle	Cerebral	White	MaLer	Rela%ve	feature	importance	Figure 5.9: The relative importance of the deep-learned joint myelin-T1w features in different sub-cortical brain areas for RRMS vs.NC classification on normal-appearing brain tissues.with clinically isolated syndrome (CIS), a prodromal stage of MS, with theclinical goal of enabling earlier diagnosis.As we stated above, the T1w sequence is a structural imaging sequence fordisplaying tissue contrast and not for direct quantification, and this limita-tion cannot be corrected by intensity normalization, which is likely the mainreason why the regional mean T1w intensity features produced relatively lowclassification accuracy as shown in Table 5.1. It could be argued that forcomplementing the myelin scans, a quantitative T1 relaxometry may be ap-propriate. However, a primary goal of this study was to determine whethermyelin scans contained spatial features finer than regional means that wouldbe useful for distinguishing MS, so a comparison to regional mean inten-875.4. Discussionsity features seems appropriate. We believe the primary reason that deeplearning on T1w images produced useful features for classification is that themodel captured spatial variabilities in the high-contrast boundaries betweennormal-appearing tissue and cerebrospinal fluid as induced by atrophy andother morphological changes. This is visually verifiable by Figures 5.6 and5.8, which show that the features extracted by deep learning from the T1wimages are mostly high-contrast edges in various shapes and orientations.For deep learning, the intensity normalization procedure is meant to enablethe use of Gaussian visible units and to allow the distribution of these edgefeatures to be modeled more accurately which generally makes training of thenetworks faster and more stable as also stated in Section 5.2.2. In contrast,deep learning appeared to capture more intensity variations in the myelinimages, indicative of changes in myelin content, as shown in Figure 5.6, es-pecially in RBM layer 1.In summary, our experimental results have demonstrated the followingfor the task of detecting MS pathology on normal-appearing brain tissues:• The regional mean myelin content features were more discriminativethan the regional mean T1w intensity features.• Direct concatenation of the regional mean myelin content features andthe regional mean T1w intensity features did not improve the classifi-cation accuracy over using each feature type alone.• Unsupervised deep learning of the regional myelin contents and T1w885.5. Conclusionintensities yielded superior classification performance over using theregional mean myelin contents and T1w intensities.• The joint deep-learned regional myelin-T1w features were more dis-criminative than either of the modality-specific deep-learned featuretypes.• Applying LASSO to produce sparser feature representations improvedthe classification performance for the regional mean features, but theimpact of LASSO was marginal for the deep-learned regional features.• The maximum classification performance of using predominantly NAWMand NAGM patches separately was achieved by the deep-learned re-gional myelin features. However, it did not approach that of using allnormal-appearing patches together.5.5 ConclusionWe have demonstrated that unsupervised deep learning of normal-appearingbrain tissues on myelin and T1w images can extract information that couldbe useful for early MS detection and provides superior classification perfor-mance to the traditional regional mean MRI measurements when using thesame supervised classifier. In addition, we have shown that unsuperviseddeep learning of joint myelin and T1w features improves the classificationperformance over deep learning of either modality alone. Using a four-layermultimodal deep learning network to learn latent features, unbiased feature895.5. Conclusionimputation to exclude lesion voxels, a feature selection method (LASSO) toconstruct sparse feature representations, and a random forest for a super-vised classifier, we achieved a mean classification accuracy of 87.9% betweenRRMS and healthy controls on normal-appearing brain tissues, using an 11-fold cross-validation procedure. The local brain structures that were found tobe important for MS classification by our method were consistent with knownMS pathology and previous MS literature. In future work, we plan to extendthe proposed framework to include other MRI modalities used for studyingMS pathology such as multi-echo susceptibility-weighted imaging [Denk andRauscher, 2010]. We also plan to apply our framework to other subgroupsof MS patients and longitudinal data for applications in MS prognostication.As we stated in Section 5.4, for detecting very early MS pathology, the clas-sification model should include patients with CIS, a prodromal stage of MS,with the clinical goal of enabling earlier diagnosis. In the next chapter, wewill discuss how deep learning of brain MRIs can be utilized for detectingvery early MS pathology in CIS patients, which could be helpful for physi-cians to provide more personalized treatment options to patients who have ahigher risk of developing clinically definite multiple sclerosis (CDMS). Un-fortunately, our current CIS dataset does not have MWI scans so we willshow how deep learning of structural MRIs only can be used for predictingthe individual risk, but our future work will include MWI scans to develop amore accurate prediction model because we have improved the understand-ing of the usefulness of myelin spatial features that can detect MS pathology905.5. Conclusionon normal-appearing tissue in this chapter.91Chapter 6Predicting the Conversion Riskto Multiple Sclerosis fromClinically Isolated SyndromeMS is a neurological disease with an early course that is characterized byattacks of clinical worsening, separated by variable periods of remission. Theability to predict the risk of attacks in a given time frame can be used toidentify patients who are likely to benefit from more proactive treatment.We aim to determine whether deep learning can extract latent MS lesion fea-tures that, when combined with user-defined radiological and clinical mea-surements, can predict conversion to MS (defined with criteria that includenew T2w lesions, new T1w gadolinium enhancing lesions, and/or new clinicalrelapse) in patients with early MS symptoms (clinically isolated syndrome),92a prodromal stage of MS, more accurately than imaging biomarkers thathave been used in clinical studies to evaluate overall disease states, such aslesion volume and brain volume. More specifically, we use convolutional neu-ral networks to extract latent MS lesion patterns that are associated withconversion to definite MS (based on the McDonald 2005 criteria) using le-sion masks segmented from baseline MR images. The main challenges arethat lesion masks are generally sparse and the number of training samplesis small relative to the dimensionality of the images. To cope with sparsevoxel data, we propose utilizing the EDT for increasing information densityby populating each voxel with a value indicating the distance to the clos-est lesion boundary. To reduce the risk of overfitting resulting from highimage dimensionality, we use a combination of downsampling, unsupervisedpretraining, and regularization during training. Detailed analyses of the im-pact of EDT, unsupervised pretraining and incorporating user-defined clini-cal measurements on network training are presented. We validate the abilityof our model for predicting disease activity that is indicative of radiologicalor clinical conversion to definite MS within two years using the baseline MRIscans and all available user-defined clinical measurements from 140 subjectsin a 7-fold cross-validation procedure. Our results demonstrate the potentialbenefit of automatic extraction of latent lesion distribution features by deeplearning.936.1. Introduction6.1 IntroductionMS is a neurological disorder characterized by inflammation, demyelination,and degeneration in the central nervous system. There is increasing evidencethat early detection and intervention can improve long-term prognosis. How-ever, the disease course of MS is highly variable, especially in its early stages,and it is difficult to predict which patients would progress more quickly andtherefore benefit from more proactive treatment. The McDonald criteria[Polman et al., 2005, 2011], which are a combination of clinical and MRIindicators of disease activity, facilitate the diagnosis of MS in patients whopresent early symptoms suggestive of MS.However, predicting which patients will meet a given set of criteria fordisease activity within a certain time frame remains a challenge. MRI isinvaluable for monitoring and understanding the pathology of MS in vivofrom the earliest stages of the disease, but common imaging biomarkers thatare used in clinical studies to evaluate overall disease state, such as brainand lesion volumes, are not strongly predictive of future disease activity (e.g.Odenthal and Coulthard, 2015), especially when only baseline measures areavailable, which is often the case when a patient first requires a diagnosis.Researchers have attempted to define more sophisticated MRI features thatare more predictive. Recently, Wottschel et al. [2015] employed a supportvector machine trained on user-defined clinical and radiological features topredict the conversion of CIS, a prodromal stage of MS, to CDMS. The946.1. Introductionsubjects consisted of 74 CIS subjects, of whom 22 (30%) and 33 (44%) devel-oped CDMS within one and three years, respectively. The features includeddemographic information and clinical measurements at baseline, and alsoMRI-derived features such as lesion volume, brain volume and lesion dis-tance from the center of the brain. The reported accuracy of predicting theconversion to CDMS was 71.4% at one year and 68% at three years.User-defined features typically require expert domain knowledge and asignificant amount of trial-and-error to select features, and are subject to userbias. An alternate approach is to automatically learn patterns and extractlatent features using machine learning. In recent years, deep learning [LeCunet al., 2015] has received much attention due to its use of automated featureextraction to achieve breakthrough success in many applications, some ofwhich involved high-dimensional and complex content such as those fromneuroimaging. For example, deep learning of neuroimaging data has beenused to perform various tasks such as the classification between mild cognitiveimpairment and Alzheimer’s disease (e.g., Suk et al., 2014, Liu et al., 2015)and to model pathological variability in MS [Brosch et al., 2014].In this work, using the baseline MRIs and user-defined radiological andclinical measurements of patients with early symptoms of MS but not yetmeeting the McDonald 2005 criteria for MS diagnosis, we aim to predictwhich patients worsened to meet the conversion criteria within two years.MS exhibits a complex pathology that is still not well understood, but itis known that change in spatial lesion distribution may be an indicator of956.1. Introductiondisease activity [Giorgio et al., 2013]. Our clinical motivation is to discoverwhite matter lesion patterns that may indicate a faster rate of worsening,so that patients who exhibit such patterns can be selected for more person-alized treatment. The main computational approach used in this chapteris to employ CNNs to identify latent lesion pattern features whose variabil-ity can maximally distinguish those patients at risk of short-term diseaseactivity from those who will remain relatively stable. We also incorporateuser-defined radiological features such as brain volume and clinical measure-ments such as EDSS [Kurtzke, 1983] into the CNN to determine whetherthese user-defined features improve prediction accuracy. We present a de-tailed comparison with other prediction models such as logistic regressionand random forests [Breiman, 2001], as applied to the user-defined radiolog-ical and clinical measurements. The remainder of this chapter is organizedas follows: the demographic information of the dataset is described in Sec-tion 6.2.1. This is followed by a description of the MRI acquisition parametersand the preprocessing pipeline of these acquired images in Section 6.2.2, anda description of the user-defined radiological and clinical measurements inSection 6.2.3. We will then describe the technical aspect of our predictionmodel in Section 6.3. In Section 6.4, a detailed analysis of the proposedmethod is provided, and its prediction performance is evaluated and com-pared to other prediction models. Additional analyses and experiments arealso provided in Section 6.4. Finally, we provide discussions and conclusionin Section 6.5.966.2. Materials and Preprocessing6.2 Materials and Preprocessing6.2.1 Study participantsFrom 2009 to 2013, 140 subjects between the ages of 18 and 60 with onsetof their first demyelinating symptoms within the previous 180 days wererecruited from 12 MS clinics. A minimum of two lesions that were at least 3mm in diameter on a T2w screening brain MRI was required; one had to beovoid, periventricular or infratentorial. Cerebrospinal fluid oligoclonal bands,or spinal MRI changes typical of demyelination were required for subjectsover age 50. Key exclusions included: a better explanation for the event; aprevious event reasonably attributable to demyelination; or meeting the 2005McDonald criteria for MS [Polman et al., 2005]. When the study began in2008 all subjects were diagnosed as having CIS, but in 2010 the McDonaldcriteria were revised [Polman et al., 2011] to target earlier detection, and 42subjects (30%) would have been considered to have MS at baseline. Whilethe diagnostic criteria for MS changed during the study, the 2005 criteriaremain valid as a method of confirming MS disease activity. At the endof two years, 80 of the patients had converted to meet the 2005 McDonaldcriteria, while 60 did not. Baseline characteristics of the participants aresummarized in Table 6.1. The study protocol and informed consent wereapproved by Health Canada and site institutional review boards.976.2. Materials and PreprocessingTable 6.1: Baseline characteristics of the participants.Characteristic ValuesMean (SD) age at CISa onset in years 35.9 (9.2)Number of females 97 (69%)Number of white race 120 (86%)Median EDSSb (range) 1.5 (0–4.5)Median CISa duration (range) in days 84 (21–190)aclinically isolated syndromebextended disability status scale6.2.2 MRI acquisitionCranial PDw and T2w MRIs were acquired for each patient in accordancewith a standardized scanning protocol from 12 different sites. The PDw andT2w images were acquired using sequence parameters with TE = 8–19 msand TR = 2000–3400 ms for PDw, and TE = 78–91 ms and TR = 2800–8000 ms for T2w. The image dimensions are 256 × 256 × 60 with a voxelsize of 0.937× 0.937× 3.000 mm. Preprocessing consisted of skull strippingand linear intensity normalization. All images were spatially normalized toa standard template (MNI152) [Mazziotta et al., 2001] using affine registra-tion. The T2w and PDw scans were segmented to produce lesion masks viaa semi-automated multimodal method [McAusland et al., 2010], which re-quires a user to place a seed point on each lesion location, but the rest ofthe segmentation process is fully automatic. This method was previouslyvalidated extensively for accuracy (mean Dice coefficient of 80% comparedto a gold standard on a large multi-center dataset) and high reproducibility986.2. Materials and Preprocessingbetween expert raters. In the current study, we used one expert rater forseeding and another trained expert for quality review, but no post-correctionwas applied. The mask images were then downsampled to 128 × 128 × 30with Gaussian pre-filtering to reduce computation overhead and the risk ofoverfitting during feature learning.6.2.3 User-defined MRI and clinical measurementsSubjects had clinical assessments performed at baseline by qualified neurol-ogists from 12 different sites. All MRI data for each subject were evaluatedand assessed with three user-defined MRI outcomes measured at baseline,which are T2w lesion volume, normalized whole brain volume and diffuselyabnormal white matter (DAWM) [Laule et al., 2013]. The brain volume ismeasured as the ratio of brain parenchymal volume to total intracranial vol-ume, which is called brain parenchymal fraction (BPF) [Rudick et al., 2000].DAWM is defined as white matter regions with mild MRI hyperintensity,ill-defined boundaries and reduced myelin content, which may be associatedwith MS disability and progression [Laule et al., 2013]. Due to the lack ofseparation between DAWM and NAWM, the DAWM outcome is reported asa binary measure reflecting presence (1) or absence (0). In addition, therewere 8 demographic and clinical variables. We employed gender informationas a predictive variable because the prevalence of MS is known to be higherin women (e.g., Orton et al., 2006). We also utilized five anatomical regions(cerebrum, optic nerve, cerebellum, brain stem, and spinal cord), which were996.2. Materials and PreprocessingTable 6.2: The 11 user-defined MRI and clinical measurements at base-line that were used for predicting short-term (2 years) future dis-ease activity (conversion to definite MS based on the McDonald2005 criteria) in patients with early symptoms of MS.Nomenclature Measures (predictors) Data typeBOD Burden of disease (T2w lesion volume) FloatBPF Brain parenchymal fraction (the ratio of brainparenchymal volume to total intracranial volume)FloatDAWM Diffusely abnormal white matter (Yes=1, No=0) Integergender Gender (Female=1, Male=0) Integercerebrum Initial CISa event at cerebrum* (Yes=1, No=0) Integeropticnerve Initial CISa event at optic nerve* (Yes=1, No=0) Integercerebellum Initial CISa event at cerebellum* (Yes=1, No=0) Integerbrainstem Initial CISa event at brain stem* (Yes=1, No=0) Integerspinalcord Initial CISa event at spinal cord* (Yes=1, No=0) Integeredss EDSSb score Floatcistype Is the CISa at onset monofocal (=0) or multifocal(=1)?Integeraclinically isolated syndromebextended disability status scale*This is the anatomical location that a physician determined to be the most likelylocation of initial attackused by the evaluating physicians to classify the location(s) of each patient’sinitial CIS event. EDSS is a measure of neurologic impairment in MS that isthe combination of grades within 8 functional systems such as those locatedin the cerebellum and the brain stem [Kurtzke, 1983]. Finally, the clinicalclassification of presenting symptoms as either monofocal (indicative of a sin-gle lesion) or multifocal (indicative of more than one lesion) was employedas a predictor. In total, 11 user-defined measurements at baseline were used,which are summarized in Table 6.2.1006.3. Methods6.3 Methods6.3.1 The CNN architectureOur CNN architecture is a 9-layer model (Figure 6.1), consisting of three3D convolutional layers interleaved with three max-pooling layers, followedby two fully connected (fc) layers, and finally a logistic regression outputlayer. The first convolutional layer has 12 filters with kernel size 7 × 7 × 7.The second convolutional layer has 24 filters with kernel size 5 × 5 × 5.The third convolutional layer has 48 filters with kernel size 3 × 3 × 3. Weutilize a relatively small number of convolutional filters to reduce the risk ofoverfitting. The first fc layer has 1000 hidden units and the second fc layerhas 100 hidden units.6.3.2 Euclidean distance transform of lesion masksMS lesions typically occupy a very small percentage of a brain image, and asa result the binary lesion masks contain mostly zeros. From our preliminaryexperiments, we observed that the CNN model learns mostly noisy patternsfrom the binary lesion masks, which is likely due to the fact that sparse le-sion voxels can be ignored or deformed into noise spikes by various stagesof convolution and pooling operations during training. As described in Sec-tion 6.4, the training and test results show that the binary lesion masks arenot appropriate as the input to the CNN model. We could have also usedraw MR images as the input, but the lesion voxels would almost certainly be1016.3. MethodsLesion maskdatasetEuclidean distance transform7x7x7 convolution12 filtersmax-pool, /25x5x5 convolution24 filtersmax-pool, /23x3x3 convolution48 filtersmax-pool, /2fc 1000fc 100Logistic regression layerUser-defined measurementsfeature replication & scalingFigure 6.1: The proposed CNN architecture (fc=fully connected layer)for predicting short-term (2 years) future disease activity (con-version to definite MS based on the McDonald 2005 criteria) inpatients with early symptoms of MS. The Euclidean distancetransform is used for increasing information density from sparselesion masks. The CNN extracts latent lesion features and alsoincorporates user-defined MRI and clinical measurements at thesecond fc layer. The feature dimensionality and dynamic rangebetween the learned latent lesion features and the user-definedmeasurements are significantly different, which can lead to un-stable model training. To compensate for this mismatch, weperform feature replication and scaling for the user-defined mea-surements during the model training (described in Section 6.3.5).lost in the learning process due to their sparsity. To overcome this problem,we propose increasing the density of information in the lesion masks by theEDT [Maurer et al., 2003], which measures the Euclidean distance betweeneach voxel and the closest lesion. The EDTs of the binary lesion masks formthe input to our CNN model. From Figure 6.1, we can see examples of howthe spatial distribution of the lesions is densely captured and better repre-sented than those in the original binary masks. The impact of the transformon training a deep learning network will be presented in Section 6.4.1026.3. Methods6.3.3 CNN trainingIt has been shown that unsupervised pretraining can improve the optimiza-tion performance of supervised deep networks especially when training setsare limited, which often happens in the medical imaging domain [Tajbakhshet al., 2016], but the gains are dependent on data properties. We investigatedthe impact of using a 3D convolutional DBN for pretraining to initialize ourCNN model. Our convolutional DBN has the same network architecture asthe convolutional and pooling layers of our CNN. For our DBN and CNN,we used the leaky rectified non-linearity [Maas et al., 2013] (negative slopeα = 0.3), which is designed to prevent the problem associated with non-leakyunits failing to reactivate after encountering certain conditions due to largegradient flow. Our convolutional DBN was initialized using a robust method[He et al., 2015] that particularly considers the rectified non-linearity and hasbeen shown to allow successful training of deep networks on natural images,and trained using contrastive divergence [Lee et al., 2011]. To analyze the in-fluence of EDT and pretraining on supervised training, we trained our CNNunder four conditions: no EDT and no pretraining, no EDT with pretraining,with EDT and no pretraining, with both EDT and pretraining. For all fourexperiments, we used negative log-likelihood maximization with AdaDelta[Zeiler, 2012] (conditioning constant  = 1e−12 and decay rate ρ = 0.95) anda mini-batch size of 20 for training. Since there are more converters thannon-converters in the dataset, the class weights in the cost function (crossentropy) for supervised training were automatically adjusted to be inversely1036.3. Methodsproportional to the class frequencies observed in the training set. We usedTheano [Theano Development Team, 2016] and cuDNN [Chetlur et al., 2014]to implement the CNN models.6.3.4 Data augmentation and regularizationDue to the high dimensionality of the input images relative to the numberof samples in the dataset, even after downsampling, the proposed networkcan suffer from overfitting. Data augmentation is an established approachfor reducing the risk of overfitting by artificially creating training samples toincrease the dataset size. To generate more training samples, we performeddata augmentation by applying random rotations (±3 degrees), translations(±2 mm), and scaling (±2 percent) to the mask images, which increased thenumber of training images by four folds. To regularize training, we applieddropout [Srivastava et al., 2014] with p = 0.5, weight decay (L2-norm reg-ularization) with penalty coefficient 2e−3 and L1-norm regularization withpenalty coefficient 1e−6. We empirically determined the L1 and L2-normparameters using the widely used training guide [Orr and Mu¨ller, 2012] todetermine appropriate ranges. Finally, we applied early stopping, which alsoacts as a regularizer to improve the generalization ability, with a convergencetarget of negative log-likelihood of 0.59. The convergence target was deter-mined as the point during training when the generalization loss (defined asthe relative increase of the test error over the minimum-so-far during train-ing) started to increase as suggested by Prechelt [2012]. The value of the1046.3. Methodsconvergence target was found by cross-validation.6.3.5 Incorporating user-defined MRI and clinicalmeasurementsWe construct a multimodal deep learning prediction model that performsdata fusion between the latent lesion features extracted by the CNN and theuser-defined MRI and clinical measurements. We do so by incorporating theuser-defined measurements into the CNN as depicted in Figure 6.1. Beforefusing the user-defined measurements into the CNN, we standardized them tohave zero mean and unit standard deviation. The feature dimensionality anddynamic range between the latent lesion features learned by the CNN and theuser-defined measurements are significantly different, which can lead to un-stable model training. In our case, the dimensionality and the dynamic rangeof the latent lesion features are much larger than those of the user-definedmeasurements. Thus, the influence of the user-defined measurements can beeasily ignored during the model training. To compensate for this mismatch,we perform feature replication and scaling for increasing dimensionality anddynamic range of the user-defined measurements before the model training.When creating a feature vector for the user-defined measurements, we repli-cate and scale each feature element by a replication factor and a scaling factorrespectively as similarly done by Flake [2012] and Riedmiller [2012].We determine the feature replication and scale factors used to incorporatethe user-defined MRI and clinical measurements by using a grid-search. The1056.4. Experimental Resultsparameters that produced the highest average prediction accuracy evaluatedby a 7-fold cross-validation procedure were determined as optimal. We variedthe replication factor over a set of values {1, 5, 10, 15, 20, 25, 30} andthe scale factor over a set of values {1, 5, 10, 15, 20, 25}, with the resultsillustrated by the heat map in Figure 6.2. The parameter pairs (15,10) and(10,15), with the first number in each pair being the replication factor and thesecond number being the scaling factor, both produced the highest predictionaccuracy but we selected (15,10) because the region of high accuracy aroundthe peak was considerably larger, which indicated that those parameters aremore generalizable and robust.6.4 Experimental ResultsWe next present an analysis of the impact of EDT and unsupervised pre-training on the extraction of latent lesion features using the CNN describedin Section 6.3. Then we present the results of the final prediction model andits comparison to other prediction models using a 7-fold cross-validation pro-cedure in which each fold contained 120 subjects for training and 20 subjectsfor testing.6.4.1 Impact of EDT and unsupervised pretrainingon CNN extraction of lesion featuresTo see the impact of EDT on unsupervised pretraining, we computed theroot mean squared (RMS) reconstruction error with and without EDT after1066.4. Experimental Results5 10 15 20 2551015202530  cross−validation accuracy (%)scale factor repetition factor6869707172737475replication factor scaling factor Figure 6.2: Grid search results for optimizing the replicationand scale factors for incorporating user-defined MRIand clinical measurements. Interpolated values wereused at non-grid points to produce a smooth plot. Theoptimal parameters were determined as those that pro-duced the highest prediction accuracy in a 7-fold cross-validation procedure. The center of the yellow circleindicates the selected optimal parameters.1076.4. Experimental ResultsPretraining without EDT0 50 100 150 200 250 3000.850.90.9511.05epochreconstruction error (RMS)pretraining with EDT0 50 100 150 200 250 3000.850.90.9511.05epochreconstruction error (RMS)pretraini g without EDT Pretraining with EDTFigure 6.3: The influence of EDT on unsupervised pretraining forthe first convolutional layer. Pretraining with EDT con-verged faster and produced lower reconstruction error afterconvergence. The plots show averaged reconstruction er-rors and standard deviations (shaded area) in a 7-fold cross-validation. EDT also greatly improved consistency acrossfolds.each epoch during training of the convolutional DBN. We observed that pre-training with EDT converged faster and produced lower reconstruction errorat convergence than pretraining without EDT, as illustrated by Figure 6.3,which plots the first-layer’s RMS values over 300 epochs.To analyze the impact of EDT and pretraining on supervised training,we compared four different scenarios, the results of which are shown in Fig-ure 6.4. Without EDT, the prediction errors at convergence were similarbetween those obtained with and without pretraining. In both cases, thetraining made little progress in the prediction error on both the training andtest sets in each fold, with the main difference being generally more fluc-tuations during testing. Overall, the final prediction errors remained close1086.4. Experimental Results0 50 100 150epoch0.20.250.30.350.40.450.50.55prediction errortraining (averaged across folds)train errortest error0 100 200 300 400 500epoch0.20.250.30.350.40.450.50.55prediction errortraining (averaged across folds)train errortest error0 100 200 300epoch0.20.250.30.350.40.450.50.55prediction errortraining (averaged across folds)train errortest error0 50 100 150 200 250epoch0.20.250.30.350.40.450.50.55prediction errortraining (averaged across folds)train errortest error0 50 100 150epoch0.60.620.640.660.680.70.72negative log-likelihood (cross entropy)training cost (averaged across folds)0 100 200 300 400 500epoch0.60.620.640.660.680.70.72negative log-likelihood (cross entropy)training cost (averaged across folds)0 100 200 300epoch0.60.620.640.660.680.70.72negative log-likelihood (cross entropy)training cost (averaged across folds)0 50 100 150 200 250epoch0.60.620.640.660.680.70.72negative log-likelihood (cross entropy)training cost (averaged across folds)no EDT, no pretrainingEDT, pretrainingno EDT, pretrainingEDT, no pretrainingno EDT, no pretrainingEDT, pretr iningEDT, no pretr iningno EDT, pretr iningFigure 6.4: The influence of EDT and pretraining on supervised train-ing. The 4 images in the left box show averaged training costsand standard deviations (shaded area), and the 4 images in theright box show averaged prediction errors and standard devia-tions (shaded area) on both training and test datasets for eachepoch during supervised training in a 7-fold cross-validation.Overall, we show that both EDT and unsupervised pretrainingare necessary for successful training.to the initial errors at the start of training. When using EDT, the opti-mization converged without pretraining at approximately 500 epochs, butconverged with pretraining at approximately 170 epochs. Without pretrain-ing, the prediction errors for the training datasets steadily decreased at aslow rate and then remained constant after approximately 400 epochs. Theprediction errors increased for most test sets at an early stage, and thenstarted to decrease after approximately 250 epochs and remained constantafter approximately 400 epochs. However, the final average prediction erroron the test sets is almost the same as the initial prediction error. In contrast,with both EDT and pretraining, the prediction errors on both training and1096.4. Experimental ResultsTraining setTest set-40 -30 -20 -10 0 10 20 30-20-15-10-50510152025test setnon converter (test set)converter (test set)-50 -40 -30 -20 -10 0 10 20 30-20-15-10-5051015202530training setnon converter (training set)converter (training set)no EDT, no pretraining-40 -30 -20 -10 0 10 20 30 40-20-15-10-505101520training setnon converter (training set)converter (training set)EDT with pretraining-40 -30 -20 -10 0 10 20 30 40-15-10-5051015test setnon converter (test set)converter (test set)Figure 6.5: Visualizations to show the influence of EDT and pretrain-ing on the learned manifold space in one cross-validation fold,reduced to two dimensions using t-SNE [Van Der Maaten, 2014].Each subject in the dataset is represented by a two-dimensionalfeature vector. The axes represent the feature element val-ues of each two-dimensional feature vector in the learned low-dimensional map. The converter and non-converter groups forboth training set and test set are more linearly separable in themanifold space when using the EDT and pretraining.test datasets increased at an early stage, then decreased fairly steadily up toabout 150 epochs. They remained stable afterward. The final average pre-diction error on the test sets was approximately 14% lower than the initialprediction error.Figure 6.5 shows visualizations of the manifolds produced by the CNN1106.4. Experimental Resultsoutputs, reduced to two dimensions using t-distributed stochastic neighborembedding (t-SNE) [Van Der Maaten, 2014]. When EDT and pretrainingwere not used, the two groups (converters and non-converters) showed poorlinear separability in the learned manifold space. The two groups were moredistinguishable in the manifold space learned from the CNN with EDT andpretraining.6.4.2 Performance comparison to other predictionmodelsLet TP, TN, FP and FN denote True Positive, True Negative, False Positiveand False Negative, respectively. For evaluating prediction performance, weconsidered several different aspects of prediction performance:• Accuracy = (TP + TN) / (TP + TN + FP + FN)• Sensitivity = TP / (TP + FN)• Specificity = TN / (TN + FP)• Area under the curve of the receiver operating characteristic curve(AUC)We used a 7-fold cross-validation procedure in which each fold contained120 subjects for training and 20 subjects for testing. The number of train-ing images for each fold was increased to 480 by data augmentation. Wedid not apply data augmentation for the user-defined measurements so weused the same radiological and clinical measurements for each augmented1116.4. Experimental Resultsimage except for BOD. Since we applied scaling to the input lesion masksfor augmenting the dataset, BOD values were augmented by calculating thetotal lesion volume of each augmented lesion mask. For comparison to theuser-defined measurements that have been used in clinical studies, such asBOD, BPF and EDSS, two popular classifiers, logistic regression and randomforests [Breiman, 2001], were also used.The prediction performance for each individual user-defined measurementis summarized in Table 6.3 (rows 1 to 11). The highest prediction accu-racy (65.0%, SD=14.6%) and AUC (67.6% SD=14.9%) were obtained bylogistic regression with BOD (row 1). We also evaluated the prediction per-formance of all user-defined measurements together (rows 12 and 13) usingmultivariable logistic regression and random forests. The multivariable logis-tic regression model produced comparable accuracy and AUC to BOD alone,while improving sensitivity at the sacrifice of specificity. We observed an im-provement in prediction performance by approximately 3% in accuracy overlogistic regression with BOD when using a random forest with all user-definedmeasurements together (row 13). Also, the random forest decreased SD byapproximately 4% for both accuracy and AUC over the logistic regressionmodel with BOD.The prediction performance for each deep learning prediction model issummarized in Table 6.3 (rows 14 to 18). For comparison purposes, theprediction performance measures achieved by the CNN without user-definedmeasurements [Yoo et al., 2016] are included (row 14). When incorporat-1126.4. Experimental ResultsTable 6.3: Performance comparison (%) between 18 different predic-tion models for predicting short-term (2 years) clinical status con-version (based on the McDonald 2005 criteria) in patients withearly MS symptoms. The same training parameters were used forall the CNNs. We performed a 7-fold cross-validation procedureon 80 converters and 60 non-converters and computed the averageperformance (and standard deviation) for each prediction model.Refer to Table 6.2 for the definitions of the predictor variables.No. Prediction models EDTaUnsupervisedtrainingAccuracy Sensitivity Specificity AUC1 Logistic regression with BOD N/A N/A 65.0 (14.6) 54.3 (17.2) 80.9 (8.2) 67.6 (14.9)2 Logistic regression with BPF N/A N/A 53.6 (9.2) 74.1 (10.5) 30.0 (15.6) 49.5 (7.0)3 Logistic regression withDAWMN/A N/A 50.0 (5.3) 30.0 (8.3) 77.0 (11.5) 53.5 (5.9)4 Logistic regression with gender N/A N/A 56.4 (10.3) 37.6 (11.8) 80.7 (17.0) 59.1 (12.7)5 Logistic regression with cere-brumN/A N/A 43.6 (9.1) 15.8 (35.8) 79.1 (38.3) 47.5 (3.9)6 Logistic regression with optic-nerveN/A N/A 41.4 (11.2) 46.4 (25.6) 37.2 (29.5) 41.8 (8.8)7 Logistic regression with cere-bellumN/A N/A 46.4 (9.9) 10.3 (12.3) 95.2 (8.1) 52.8 (2.6)8 Logistic regression with brain-stemN/A N/A 42.6 (12.2) 59.1 (23.6) 23.3 (14.9) 41.2 (10.6)9 Logistic regression with spinal-cordN/A N/A 56.4 (9.5) 47.7 (20.6) 62.6 (21.7) 55.1 (9.7)10 Logistic regression with EDSS N/A N/A 54.3 (8.6) 58.2 (30.0) 45.9 (36.9) 52.1 (8.1)11 Logistic regression with cistype N/A N/A 49.3 (7.3) 25.3 (15.1) 79.2 (18.1) 52.2 (6.2)12 Multivariable logistic regres-sion with all user-defined mea-surementsN/A N/A 65.7 (10.8) 61.3 (15.4) 69.5 (18.5) 65.4 (11.1)13 Random forests with all user-defined measurementsN/A N/A 67.9 (10.6) 65.9 (14.7) 69.9 (18.9) 67.9 (10.9)14 CNN without all user-definedmeasurementsX X 72.9 (10.3) 78.6 (13.9) 65.1 (16.8) 71.8 (10.2)15 CNN with all user-definedmeasurements× × 61.4 (11.9) 80.0 (9.4) 36.5 (17.2) 58.2 (12.3)16 CNN with all user-definedmeasurements× X 61.4 (7.9) 80.0 (6.3) 36.7 (11.4) 58.3 (8.1)17 CNN with all user-definedmeasurementsX × 66.4 (8.3) 73.6 (10.4) 56.9 (8.5) 65.3 (7.7)18 CNN with all user-definedmeasurementsX X 75.0 (11.3) 78.7 (12.2) 70.4 (15.4) 74.6 (11.4)aEuclidean distance transform1136.4. Experimental Resultsing user-defined measurements but without using EDT, the CNN (with andwithout pretraining) produced lower prediction accuracy rates than thoseattained by the logistic regression model with BOD. In addition, these casesproduced high sensitivity but low specificity, possibly due to overfitting onthe sparse lesion image data. When EDT was used without pretraining, theprediction accuracy was higher than that of BOD by 1.4%, but AUC waslower by 2.3%. The gap between sensitivity and specificity was reduced butstill remained relatively large. The CNN with EDT and pretraining (row 18)improved the prediction performance by approximately 10% in accuracy and7% in AUC when compared to the logistic regression model with BOD. Inaddition, the SDs for both accuracy and AUC decreased by approximately3-4%, showing a more consistent performance across folds. When comparedwith the random forest with all user-defined measurements (row 13), the CNNwith EDT and pretraining improved the prediction performance by approx-imately 7% in both accuracy and AUC. Overall, incorporating user-definedmeasurements into the CNN achieved the best prediction performance inboth accuracy (75.0%, SD=11.3%) and AUC (74.6%, SD=11.4%) (row 18).The sensitivity and specificity were 78.7% and 70.4%. This model also pro-vided the best balance between sensitivity and specificity, and was the onlymodel with overall better performance than our previous CNN with EDTand pretraining but no user-defined features (row 14).1146.4. Experimental Results6.4.3 Late Fusion ApproachFor comparison purposes, in order to produce an approximation to late fu-sion, we have conducted an experiment that combines the CNN features andthe user-defined features by averaging the output prediction probabilities ofthe individual models: the CNN (row 14 in Table 6.3) and the random forest(row 13 in Table 6.3). Using the same 7-fold cross-validation procedure, thelate fusion approach achieved average prediction performance rates of 72.9%(accuracy, SD=12.8%), 76.1% (sensitivity, SD=14.0%), 68.7% (specificity,SD=16.8%), and 72.4% (AUC, SD=12.9%). The prediction measures weresimilar to those achieved by the proposed model (row 18 in Table 6.3) thatfuses the CNN features and the user-defined features at an earlier stage, butthe overall performance rates were slightly lower.6.4.4 Applying Feature Selection to the User-DefinedFeaturesIn machine learning, feature selection is often used in model construction.It can sometimes enhance generalization by reducing overfitting, especiallywhen some features are redundant and irrelevant. For comparison purposes,we have conducted an experiment with a feature selection method applied tothe user-defined features. In order to select discriminative features from the11 user-defined features, we computed the relative importance of each featurefor classification between converters and non-converters by permuting thefeatures among the training data and computing the average generalization1156.4. Experimental Results00.050.10.150.20.250.30.35Relative	ImportanceFigure 6.6: The average relative importance of the 11 user-defined fea-tures for predicting short-term (2 years) clinical status conver-sion (based on the McDonald 2005 criteria) in patients withearly MS symptoms in a 7-fold cross-validation procedure. Arandom forest was used for computing the relative importance.error as measured by the out-of-bag error [Breiman, 2001] using a randomforest and a 7-fold cross-validation procedure. The computed relative im-portance is shown in Figure 6.6. We then selected discriminative features bychoosing the features whose relative importance is larger than the medianimportance (0.0459 in our case). The selected features were BOD, BPF,EDSS, gender, spinal cord and cerebellum. To analyze the impact of featureselection, we separately performed the experiments in Table 6.3 (rows 12,13 and 18) using the selected features only. The results are summarized inTable 6.4. The results showed that feature selection improved specificity, butslightly decreased all other performance measures. Feature selection may beuseful for achieving greater balance across the given performance measures,but was not definitely better in this dataset. Overall, all the three models didnot match the prediction performance of their corresponding models in Ta-1166.4. Experimental ResultsTable 6.4: Prediction performance (%) of three different predictionmodels for predicting short-term (2 years) clinical status con-version (based on the McDonald 2005 criteria) in patients withearly MS symptoms. The models were trained only with the user-defined features selected by a random forest (6 features). We per-formed a 7-fold cross-validation procedure on 80 converters and60 non-converters and computed the average performance (andstandard deviation) for each prediction model.Prediction models Accuracy Sensitivity Specificity AUCMultivariable logistic regression with discriminativeuser-defined measurements63.6 (7.4) 57.6 (8.3) 71.6 (12.6) 64.6 (7.9)Random forests with discriminative user-defined mea-surements65.7 (9.8) 58.7 (12.7) 75.0 (11.3) 66.8 (9.8)CNN (EDTa, pretraining) with discriminative user-defined measurements73.8 (9.2) 74.1 (10.2) 72.0 (13.0) 73.2 (9.4)aEuclidean distance transformble 6.3 that utilized all the available user-defined features. This could suggestthat there might be some underlying correlations between the latent lesionfeatures and the less discriminative user-defined features, which could havebeen identified by the deep learning network, or there could also be subtlerelationships between combinations of the low-ranking user-defined featuresand the clinical outcome.6.4.5 Replacing the output layer with a randomforest classifierIn the proposed method, the output layer is the logistic regression layer thatuses a softmax function [Bishop, 2006] for generating a prediction probabil-ity. For comparison purposes, we replaced the output layer with a randomforest. Since replacing the output layer with a random forest makes the1176.5. Discussions and Conclusionentire network not differentiable (thus not end-to-end trainable), we firsttrained the network with the logistic regression layer, and then trained arandom forest classifier with the output activation values of the top fc layer(fc 100) for each subject. When training a random forest with the latentCNN features only, the model achieved average prediction performance ratesof 65.7% (accuracy, SD=5.6%), 78.8% (sensitivity, SD=6.3%), 48.2% (speci-ficity, SD=10.4%), and 63.5% (AUC, SD=6.0%). When training with thelatent CNN features and the 11 user-defined features, the model achievedaverage prediction performance rates of 67.9% (accuracy, SD=9.2%), 78.7%(sensitivity, SD=7.8%), 53.4% (specificity, SD=15.0%), and 66.0% (AUC,SD=9.6%). Although both models achieved high sensitivity, the other threeperformance measures were markedly lower than those achieved when logis-tic regression was used as an output layer. We conjecture that replacing theoutput layer with a random forest degraded the prediction performance be-cause the features were extracted by the deep learning network to be optimalfor classification with the logistic regression layer, and therefore suboptimalfor the random forest.6.5 Discussions and ConclusionThe main potential of the proposed prediction model is that it can be usedfor the prediction of early individual disease course from baseline data. Manyprevious relevant studies have analyzed group differences (e.g., Tintore et al.,2006, Dobson et al., 2012) between CIS and MS subjects. However, it should1186.5. Discussions and Conclusionbe noted that single subject prediction and group difference analysis are verydifferent problems as they try to address essentially distinct research hypothe-ses [Arbabshirani et al., 2017]. The ability to accurately predict the risk ofearly MS worsening for an individual subject may enable more timely andbetter-informed treatment options. Using a more tailored prognosis model,physicians will be able to gain much more useful information from patients’MRI and clinical measurements, and give more specialized treatment for eachpatient to hopefully reduce or delay their future MS symptoms.A limitation of the present work is that, due to the relatively small train-ing sample size, this study can only provide preliminary results and doesnot ensure that the learned model will generalize to other cohorts. In par-ticular, the grid search that was used for determining the replication andscaling factors could have resulted in biased parameters because the dataset was not large enough to create an independent validation set devoted toparameter search. A fully nested cross-validation would be the theoreticalsolution to avoid this potential bias, but the computation time would havebeen prohibitive, as a nested grid search procedure would have taken sev-eral months with current hardware. Another limitation is some uncertaintyof the diagnostic status assigned to 9 subjects. More specifically, these 9non-converters were only monitored to month 12 because the study finishedbefore they reached the 24-month endpoint. As the trial ended and thus nofollow-up was conducted, it remains unknown whether these 9 subjects con-verted between month 12 and month 24. Another clinical limitation is that1196.5. Discussions and Conclusionwe did not include the most commonly used clinical imaging predictors ofconversion to MS, which are the occurrences of gadolinium enhancing lesion,juxtacortical lesion, infratentorial lesion and periventricular lesion [Barkhofet al., 1997]. The main reason is that the McDonald criteria include thoseimaging biomarkers, and we wanted to keep the predictors and target asindependent as possible. A comprehensive clinical prediction model wouldlikely include those commonly used imaging biomarkers. For generating thelesion masks, we used a semi-automated multimodal method [McAuslandet al., 2010] that performs lesion segmentation based on seed points placedby a trained expert. Only one expert was involved in placing the seed points,and while the segmentation method is robust to seed point placement, a userbias cannot be eliminated as a possibility.Our current method for incorporating low-dimensional user-defined mea-surements into a deep learning network adopts a simple approach based onfeature replication and scaling. Future work would involve examining moresophisticated strategies as studied in the literature such as augmenting inputfeature vectors with the squared values to increase feature dynamic rangeand creating an augmented network that has the ability to learn higher-order features, both of which were proposed by Flake [2012]. Instead ofreplicating feature vectors to increase the dimensionality of feature vectors,an alternate way would be to expand the feature space by taking polynomialcombinations of feature vectors, which was originally proposed to improve theperformance of support vector machines [Bhagavatula et al., 2014]. Recently,1206.5. Discussions and ConclusionXu et al. [2016] have utilized batch normalization [Ioffe and Szegedy, 2015]for incorporating non-image features into a deep learning network for cervi-cal dysplasia diagnosis, to compensate for different statistical properties ineach modality. However, in our case, the small mini-batch size could limitthe efficacy of batch normalization.In conclusion, we have presented a CNN architecture that learns latentbrain lesion features useful for identifying patients with early MS symptomswho are at risk of future disease activity within two years, and can addi-tionally incorporate user-defined MRI and clinical measurements to furtherimprove prediction performance. We presented methods to overcome thesparsity of lesion image data and the high dimensionality of the images rela-tive to the number of training samples. In particular, we showed that the Eu-clidean distance transform and unsupervised pretraining are both key stepsto successful optimization, when supported by a synergistic combination ofdata augmentation and regularization strategies. We have also demonstratedthat deep learning of MS brain lesion patterns can be used in combinationwith user-defined measurements to predict the short-term risk of conversionto the 2005 McDonald Criteria for MS patients with early symptoms.Another clinical demand for enabling effective treatment options in MSpatients with early symptoms is to accurately distinguish MS from other sim-ilar diseases that share similar clinical and imaging characteristics, becausetreatment options can differ significantly and a correct and timely diagno-sis is essential for managing patients. In the next chapter, we will discuss1216.5. Discussions and Conclusionhow deep learning of structural and quantitative brain MRIs can be used forimproving differential diagnosis performance.122Chapter 7Differentiating NeuromyelitisOptica from Multiple SclerosisNeuromyelitis optica spectrum disorder (NMOSD) is a disease of the centralnervous system that is often misdiagnosed as MS because they share sim-ilar clinical and radiological characteristics. Two key pathological signs ofNMOSD and MS that are detectable on MRI are white matter lesions and al-terations in tissue integrity as measured by FA values on DTIs. This chapterproposes a multimodal deep learning model that discovers latent features inbrain lesion masks and DTIs for distinguishing NMOSD from MS. The maintechnical challenge is to optimally extract and integrate features from twovery heterogeneous image types (lesion masks and FA maps). Our solution isto first build two modality-specific pathways, each designed to accommodatethe expected feature density and scale, then integrate them into a hierarchical1237.1. Introductionmultimodal fusion (HMF) model. The HMF model contains two multimodalfusion layers operating at two different scales, which in turn are joined by amulti-scale fusion layer. We hypothesize that the HMF approach would al-low the automatic extraction of joint-features of heterogeneous image typesto be optimized with greater efficiency and accuracy than the traditionalmultimodal approach of combining only the top-layer modality-specific fea-tures with a single fusion layer. We evaluate the proposed method using 82NMOSD patients and 52 MS patients in a seven-fold cross-validation. Theresults show that the HMF approach can improve the differential diagnosisperformance over the conventional multimodal fusion approach, and that theproposed model is easier and much faster for training, which are practicalbenefits.7.1 IntroductionNMOSD and MS are serious diseases of the CNS with similar initial clinicalpresentations. Early initiation of therapy is the key to preventing attack-related disability in NMOSD, but differentiation from MS is important, be-cause some treatments for MS can exacerbate NMOSD [Kim et al., 2015].However, differential diagnosis is currently difficult because while a serum-based biomarker for NMOSD exists, it is only moderately sensitive (70–90%in current studies), and so MRI has an increasingly important role in differen-tiating NMOSD from other inflammatory disorders of the CNS, particularlyMS [Kim et al., 2015]. For example, previous DTI studies have shown that1247.1. Introductiontissue integrity as measured by FA values is differentially affected by NMOSDand MS in some key brain regions such as the corpus callosum, corticospinaltract, and optic radiation (e.g., Eshaghi et al. [2015]). WM lesions are an-other MRI abnormality that is common in both NMOSD and MS patientsand descriptive criteria based on lesion location and configuration have beenderived by experts to distinguish between the two diseases [Kim et al., 2015],but these guidelines require subjective expertise to apply and would be diffi-cult to validate on a large scale. Thus, an automated and sensitive methodfor characterizing the differential MRI features on multiple scales of tissuedamage (e.g., slight alterations in FA to visible lesions) is highly desired asan additional tool for more accurate and prompt diagnosis.Recently, machine learning algorithms applied to user-defined featureshave shown some promise in differentiating MRI scans of patients withNMOSD and MS. The most relevant work has been done by Eshaghi et al.[2015, 2016], who utilized user-defined MRI features such as lesion volume,regional gray matter volume, regional average FA values, and functional con-nectivity values from resting-state functional MRIs, along with clinical pre-dictors such as cognitive scores. Using classifiers such as support vectormachines and random forests, they reported accuracy rates up to 88% (on30 NMOSD and 25 MS patients) when using all features, and 80% (on 50NMOSD and 49 MS patients) when only using imaging features. User-definedfeatures typically require expert domain knowledge to select, and the timeand cost required to identify and extract these features tend to rise quickly1257.1. Introductionwith the number and complexity of features. An alternate and complemen-tary approach is to automatically learn patterns and extract latent featuresusing machine learning. Deep learning is often used for feature learning, buthas not been for this application to our knowledge.We herein propose a deep learning model to automatically discover la-tent brain lesion and FA patterns that can distinguish NMOSD from MS.Our model consists of two modality-specific pathways that are joined bymultiple fusion connections into a multimodal pathway (Figure 7.1). Eachmodality-specific pathway has an architecture that is designed to accommo-date the expected feature density and scale. To extract lesion patterns fromthe generally sparse whole binary image volumes, a CNN is applied, with anapproach to handling sparsity and overfitting similar to [Yoo et al., 2016].For the generally dense FA maps, a patch-based fully-connected (fc) networkis used to learn latent tissue integrity features from the most discriminativeareas. Inspired by the multi-layer fusion methods proposed for jointly mod-eling spatial and temporal information in video data [Karpathy et al., 2014],the two pathways are integrated by two multimodal fusion layers operatingat two different scales, which in turn are joined by a multi-scale fusion layer.We hypothesize that this hierarchical multimodal fusion approach is moreeffective at modeling joint features of heterogeneous image types than thetraditional approach of combining only the top-layer features by concatena-tion followed by a single fusion layer [Ngiam et al., 2011].1267.2. Materials and Preprocessing7.2 Materials and PreprocessingOur study included 82 NMOSD and 52 RRMS patients. The median (range)ages of disease onset were 30 (8–48) and 27.5 (14–45) for the NMOSD and MScohorts, respectively, and their median (range) disability scores on the EDSSwere 2 (0–7) and 2 (0–6.5). Baseline fluid-attenuated inversion recovery(FLAIR) and DTIs were acquired and lesion masks were computed fromFLAIR images using an automated method [Jeon et al., 2011]. We used theFSL [Jenkinson et al., 2012] DTI tool to generate the FA images. All imageswere spatially normalized to the FSL standard templates (MNI152 1mm forFLAIR and FMRIB58 1mm for DTI) by deformable registration using FSLFNIRT.7.3 MethodsWe employ a CNN to discover latent brain MRI lesion features that aresensitive to the pathological differences between the two diseases. Lesionstypically only occupy a very small percentage of each image’s voxels, whichwould not be suitable for deep learning, so we transform the data to a denserrepresentation, in this case with an EDT, similarly to [Yoo et al., 2016].The lesion network consists of three convolutional layers, three max-poolinglayers and two fc layers as depicted in Figure 7.1. We pretrain the con-volutional layers using a 3D convolutional DBN and contrastive divergence[Lee et al., 2011]. For supervised training, we use the leaky rectified non-linearity [Maas et al., 2013] and negative log-likelihood maximization with1277.3. Methods7x7x7 conv, 125x5x5 conv, 243x3x3 conv, 48Lesion masksEuclidean distance transformRBM 300RBM 100fc 1000fc 500fc 100…Discriminative 3D FA patchesRBM 500m-fc 100fc 100(a) (b)Tissue integrity pattern path Lesion pattern pathfeature concatenationLogistic regression layermax poolingmax poolingmax poolingfc 1000fc 1007x7x7 conv, 125x5x5 conv, 243x3x3 conv, 48Lesion masksEuclidean distance transformRBM 300RBM 100fc 1000fc 500fc 100…Discriminative 3D FA patchesRBM 500fc 100Tissue integrity pattern path Lesion pattern pathfeature concatenationLogistic regression layermax poolingmax poolingmax poolingfc 1000fc 100mf-fc 100mf-fc 100hf-fc 100Figure 7.1: The network architectures for distinguishing NMOSD fromMS on brain MRIs: (a) a conventional multimodal architecture;(b) the proposed multimodal architecture that performs hierar-chical (red connections) multimodal (blue connections) fusion.Multimodal fusion occurs at multiple scales to allow heteroge-neous imaging features to be combined.AdaDelta [Zeiler, 2012] for adaptively controlling the learning rate. Sincethere are more NMOSD than MS subjects in the dataset, the class weightsin the cost function (cross-entropy) for supervised training are automaticallyadjusted in each fold to be inversely proportional to the class frequenciesobserved in the training set. To regularize training, we apply dropout [Sri-vastava et al., 2014] and weight decay. Finally, we apply early stopping,which also acts as a regularizer to improve generalizability [Prechelt, 2012],with a convergence target of negative log-likelihood that was determined by1287.3. MethodsTable 7.1: Training methods and their hyperparameters used for train-ing all networks in this chapter.Training method HyperparameterInitialization N (0,√2nl) (conv. layers), N (0, 0.01) (fc lay-ers)Unit type Leaky rectified non-linearity (α = 0.3)Control of learning rate in unsupervised train-ingAdaDelta ( = 1e−10 for the first layer,  =1e−11 for the higher layers, and ρ = 0.95)Control of learning rate in supervised training AdaDelta ( = 1e−11 for lesion network,  =1e−10 for FA network,  = 1e−9 for HMF net-work, and ρ = 0.95)Weight decay in unsupervised training Penalty coefficient = 5e−4Weight decay in supervised training Penalty coefficient = 2e−3 (lesion and HMFnetworks), 5e−4 (FA network)Cost function in supervised training Cross-entropy with weights set inversely pro-portional to class frequenciesDropout fc layers only, p = 0.5Noisy gradient regularization η = 1.0 and γ = 0.55 (FA network), η = 10.0and γ = 0.55 (HMF network)Early stopping Convergence target = 0.45 (lesion network),0.2 (FA network), 0.35 (HMF network)Mini-batch size 10cross-validation. The convergence target was used to stop training when thegeneralization loss (defined as the relative increase of the validation errorover the minimum-so-far during training) started to increase. The values ofthe hyperparameters used for training all networks in this chapter are foundin Table 7.1, and were determined from previous literature [Yoo et al., 2016]and a widely used RBM and neural network training guide [Grgoire et al.,2012].To model spatial patterns of tissue integrity, we propose using a patch-based network that uses a DBN to extract features from individual FApatches in an unsupervised manner. The patch-level features are then con-catenated into image-level feature vectors for supervised training and clas-1297.3. Methodssification. The DBN consists of three stacked RBMs and the supervisedclassifier is composed of three fc directed layers (Figure 7.1). We extract 3Ddiscriminative candidate patches of size 9×9×9 in the parenchymal compo-nent of the FMRIB58 template space using a voxel-wise t-test to determinethe statistical significance of the group difference between NMOSD and MS,similarly to [Suk et al., 2014]. The voxels with individual p-values lowerthan 0.01 are selected as the centers of candidate patches. The mean p-valuefor each candidate patch is then computed. Starting with the patches withthe lowest mean p-values, patches are selected while enforcing an overlap ofless than 50% with any previously selected patches. These patches are thenfurther selected by including only those with mean p-values smaller thanthe average p-value of all candidate patches. For training the RBM layers,we perform contrastive divergence with AdaDelta. After training the RBMlayers, we train the fc layers for image-level classification using negative log-likelihood maximization with AdaDelta and a class-balanced cross-entropycost function. For regularization, we apply dropout, weight decay, noisygradient optimization [Neelakantan et al., 2015], and early stopping.We propose an HMF model that aims to combine heterogeneous modal-ities by first performing multimodal fusion at two different scales, then per-forming a fusion of the multiscale features separately in an additional net-work layer. Figure 7.1 shows a conventional multimodal architecture (Figure7.1-a) that uses a single fusion layer to combine the top-level activations ofthe modality-specific pathways, and the proposed HMF model (Figure 7.1-b),1307.4. Experimental Resultswhich fuses the lesion and FA features at the top two layers of their respectivenetworks (blue connections), then fuses the two levels of multimodal featuresusing a separate layer (red connections). We hypothesize that the additionalpathways offered by the HMF model would allow the two modalities to befused in a more optimized manner during supervised training.After the modality-specific networks are trained as explained above, boththe HMF and conventional multimodal networks can be trained similarly.The entire network is trained in the same supervised manner as the modality-specific networks using the training configuration summarized in Table 7.1.However, we noticed early on in our experiments that full network training ofthe conventional architecture tended to be unstable, even over a broad rangeof hyperparameters, similarly to [Xu et al., 2016] in training heterogeneousmodalities. For comparison, we fixed the model parameters of the trainedmodality-specific networks and then trained only the multimodal and abovelayers, which we call partial multimodal training.7.4 Experimental ResultsWe evaluated our model using a 7-fold cross-validation procedure and as-sessed the classification performance using 4 key measures: accuracy, sensi-tivity, specificity, and AUC as shown in Table 7.2. We compared the HMFmodel to multivariable logistic regression (MLR) and random forest mod-els applied to user-defined imaging features, specifically WM lesion volumeand mean FA values of three brain structures (corpus callosum, corticospinal1317.4. Experimental Resultstract, and optic radiation), which have been used with some success in pre-vious literature on NMOSD/MS classification [Eshaghi et al., 2015]. Whenusing all of the user-defined features, the MLR and random forest models pro-duced performance values below 70%, except for MLR in sensitivity (73.2%)and random forest in specificity (71.4%), but even these did not match theperformance of the HMF model (accuracy=81.3%, sensitivity=85.3%, speci-ficity=75.0%, AUC=80.1%). The accuracy attained by the HMF modelwas statistically better than the one attained by the random forest model(p < 0.05, two-sided Wilcoxon test), while the other deep networks werenot. The deep-learned features significantly outperformed the user-definedfeatures even after a dramatic reduction in feature dimensionality to 100units, which shows the potential of deep learning with the proposed fusionapproach for differential NMOSD/MS diagnosis using brain MRIs, and ouranalysis of the training and test errors suggests that even greater accuracyis possible with a larger dataset.We also compared the HMF model to the two modality-specific net-works and the conventional multimodal fusion model, with full and par-tial multimodal training. The lesion-pattern and FA-pattern networks per-formed similarly, with the lesion-pattern network achieving slightly higheraccuracy at 78.3% (vs. 76.9%) and AUC at 76.7% (vs. 74.8%), and bothmodality-specific networks achieved a strong sensitivity score of 84.1%, butboth also had specificity scores below 70%. The conventional multimodalfusion network with full multimodal training performed similarly to the1327.4. Experimental ResultsTable 7.2: Performance comparison (%) between 7 classification mod-els for differentiating NMOSD from MS. We performed a 7-foldcross-validation procedure on 82 NMOSD and 52 MS patients,and computed the average performance (and standard deviation)for each model. Top performance for each measure is in bold.Classifier EDTUnsupervisedtrainingAccuracy Sensitivity Specificity AUCMultivariable logistic regression withuser-defined features†N/A N/A 67.7 (11.2) 73.2 (8.2) 59.2 (20.5) 66.2 (12.6)Random forest with user-definedfeatures†N/A N/A 68.6 (9.4) 66.8 (11.0) 71.4 (12.0) 69.1 (9.2)Lesion pattern network X X 78.3 (11.1) 84.1 (14.9) 69.4 (18.5) 76.7 (11.6)FA pattern network × X 76.9 (6.7) 84.1 (11.7) 65.6 (14.7) 74.8 (6.6)Conventional multimodal fusion network(full multimodal training)X X 76.1 (10.4) 82.9 (13.2) 65.8 (21.6) 74.4 (11.6)Conventional multimodal fusion network(partial multimodal training)X X 79.1 (11.3) 81.7 (15.3) 75.0 (16.5) 78.4 (11.2)Hierarchical multimodal fusion X X 81.3 (10.1) 85.3 (15.2) 75.0 (16.5) 80.1 (10.2)† WM lesion load, mean FA value in corpus callosum, mean FA values in left and right corticospinal tracts, andmean FA values in left and right optic radiations.modality-specific networks, possibly due to insufficient fusion pathways caus-ing one modality to dominate. The conventional multimodal fusion networkwith partial multimodal training (accuracy=79.1%, sensitivity=81.7%, speci-ficity=75.0%, AUC=78.4%) performed the closest to the HMF model, butdid not exceed it on any measure. The HMF model improved sensitivityby 3.6% over the conventional multimodal network with partial multimodaltraining while retaining the same specificity (75.0%). Given that the maindifference between the two models is the fusion approach, we attribute theimprovement in sensitivity to the HMF model’s ability to learn combinedfeatures that are more distinct between the two diseases.1337.5. ConclusionTo analyze the impact of the HMF model on training speed, we computedconvergence rates as measured by the average number of training epochs atthe convergence target for the HMF and conventional fusion models, bothtrained using partial multimodal training. Using an NVIDIA TITAN Xgraphics card with 3584 cores and 12 GB device memory, the conventionalmultimodal fusion network converged at an average of 513 epochs (4.2 hours),while the HMF model achieved convergence at an average of 104 epochs (0.8hours), resulting in a 4–5 times speedup.The results suggest that separating the multimodal fusion and the hier-archical fusion procedures by the HMF model can produce a more accurateand generalizable joint distribution from very heterogeneous neuroimagingmodalities, which led to improved classification performance and trainingspeed. Overall, the faster and more stable training of the HMF model, alongwith superior classification results, make it a demonstrably more practicalyet still relatively simple approach.7.5 ConclusionWe have proposed a deep network model for performing the differential diag-nosis between NMOSD and MS by automatically extracting discriminativeimage features from sparse brain lesion masks and dense FA maps, and in-tegrating these heterogeneous features with a hierarchical multimodal fusionapproach. The lesion features are extracted by a convolutional neural net-work and the tissue integrity features are extracted by a patch-based dense1347.5. Conclusionneural network. The joint features are formed by two multimodal fusion lay-ers operating at two scales, followed by a third fusion layer that combinesthe features across scales. Using 82 NMOSD and 52 MS subjects in a 7-foldcross-validation procedure, we showed that the HMF model handily outper-formed multivariable regression and random forest classifiers using commonuser-defined imaging features, and was also more accurate and sensitive thanthe conventional single-layer fusion approach using the identical modality-specific networks for feature extraction. Although the performance of theHMF model was not dramatically improved over the conventional fusion ap-proach, the improvements are clinically useful if they can be validated ona large scale, and overall we found the HMF model to be easier to trainand much faster, which are practical benefits. In this study, the HMF ap-proach was applied to only two scales and two modalities, and future workwill investigate whether the approach can be generalized. In addition, directcomparisons to other advanced multimodal architectures should be done todetermine whether they would have benefits for heterogeneous neuroimagingdata.135Chapter 8Conclusions and Future WorkIn this thesis, we have developed detailed methodologies for automatedbiomarker discovery in brain MRIs by use of deep learning, which can beused for patient-level classification and prediction of disease courses and seg-mentation in MS. We did so by overcoming several key inherent technicalchallenges in applying deep learning to neuroimaging data such as high-dimensionality of the input data, small training samples and limited an-notations, data sparsity and heterogeneous modalities. In Chapter 4, wepresented a deep learning model that can extract multiscale MRI featuresentirely from unlabeled data. The developed patch-based network and unsu-pervised training method were used for the patient-level classification modelspresented in this thesis. In Chapter 5, we presented an unsupervised deeplearning model that learns a joint feature representation from quantitativeand structural MRI for detecting MS pathology in NABT. In Chapter 6, we1368.1. Summary of thesis contributionsintroduced a deep learning model that can predict the individual conversionrisk to CDMS in CIS patients using brain lesion masks and clinical measure-ments at baseline. In Chapter 7, we presented a deep learning model thatcan distinguish NMOSD (a CNS disorder that exhibits similar clinical andradiological symptoms with MS) from RRMS (a disease subtype of MS) byjointly modeling brain lesion features with a CNN architecture and diffusionfeatures with a patch-based DNN architecture.8.1 Summary of thesis contributionsIn the course of developing deep learning methods to overcome the technicalchallenges for automated brain MRI biomarker discovery in MS applications,we have made the following contributions:1. We have developed a patch-based deep learning method that automat-ically learns features at multiple scales entirely from unlabeled multi-modal MRI scans, which are then refined with a supervised classifier,using a relatively small set of annotated images. Our experimentalresults showed that adding more unlabeled data to our deep learningnetwork generally improves the segmentation accuracy, which wouldbe beneficial to neuroimaging data analysis where annotated imagesare limited and expensive to acquire. To our knowledge, at the timeof publishing, this was the first study to use image features that arelearned completely automatically, from unlabeled data for performingMS lesion segmentation. Since then there have been attempts to ex-1378.1. Summary of thesis contributionstract patch-wise features from annotated images by supervised deeplearning networks, most notably the following. Prieto et al. [2017] pro-posed training a supervised 3D patch-based CNN with balanced MSlesion and non-lesion samples generated by the lesion frequency map.To reduce the number of misclassified voxels, Valverde et al. [2017]recently introduced a supervised 3D patch-wise deep network that re-lies on a cascade of two CNNs, in which the output of the first CNN isused to select the input features of the second CNN. Instead of extract-ing patch-wise deep-learned features, learning MS lesion patterns fromentire images without patch selection using convolutional and decon-volutional layers, that are trained in both supervised and unsupervisedmanners, has been investigated by Brosch et al. [2015, 2016].2. We have developed an unsupervised deep learning method that canmodel a joint feature representation from quantitative and anatomicalMRI modalities for detecting MS pathology in NABT, which is novelin the use of deep learning to extract features from myelin images andto integrate high-dimensional myelin and T1w images. We demon-strated that the deep-learned joint myelin-T1w feature representationsignificantly outperformed the regional mean MRI feature types for dis-tinguishing between RRMS and healthy controls in NABT. This resultsuggests that deep learning of myelin and T1w images has the potentialto enable early MS abnormalities classification more accurately, whichwould be useful for clinicians to provide more personalized treatment1388.1. Summary of thesis contributionsoptions to MS patients at an early disease stage, if the proposed modelis proved to be successful with other disease subtypes and longitudinaldata.3. We have developed a deep learning method that can model brain le-sion distribution patterns, in combination with radiological and clinicalmeasurements defined by the evaluating physicians, for predicting theindividual conversion risk to CDMS from CIS. To the best of ourknowledge this is the first study to perform patient-level classificationof brain lesion masks by deep learning. To efficiently train a CNN onsparse lesion masks and to reduce the risk of overfitting, we proposedutilizing the EDT for increasing information density, and a combina-tion of downsampling, unsupervised pretraining and regularization foravoiding overfitting and improving training convergence. We also pro-posed incorporating the user-defined clinical measurements into a CNNto further improve the prediction performance by a simple approachbased on feature replication and amplification that can compensatefor the feature dimension and range mismatch between the CNN fea-tures and the low-dimensional user-defined clinical features. Our re-sults showed that our deep learning model can produce more accuratepatient-level prediction performance than the clinical biomarkers thathave been used in clinical studies. Instead of automatically extract-ing patient-level lesion features by deep learning, recent relevant work[Doyle et al., 2017] has proposed an unsupervised clustering framework1398.1. Summary of thesis contributionsthat categorizes probabilistic patient-level lesion representations usinga variety of user-defined features, such as lesion size, RIFT (a rotation-invariant generalization of SIFT) [Lazebnik et al., 2005], local binarypattern [Ahonen et al., 2006] and intensity features, for predicting fu-ture MS activity.4. We have proposed a novel hierarchical multimodal fusion approach thatcan discover latent features from heterogeneous brain MRI modalitiesusing a CNN architecture for sparse lesion masks and a patch-basedDNN architecture for FA maps, which can be used for distinguish-ing NMOSD from MS. We have shown that the proposed hierarchicalmultimodal fusion model can efficiently extract brain lesion and diffu-sion patterns that are relevant to the differential diagnosis and improvethe classification performance over the conventional multimodal fusionapproach, and that the proposed model is easier and much faster fortraining, which are practical benefits. A comparative study betweenthe proposed method and other recent advanced multimodal architec-tures such as the gating unit based architecture [Arevalo et al., 2017]and the architecture with cross-weights [Rastegar et al., 2016] wouldbe needed to determine whether they would also have benefits for in-tegrating heterogeneous neuroimaging data as future work.Overall, we have demonstrated that deep learning can automatically ex-tract clinically useful MRI biomarkers in a data-driven manner from high-1408.2. Discussiondimensional brain images to improve patient-level clinical classification andprediction over previously used imaging biomarkers, even with small train-ing datasets. In consideration of the major challenges for clinical adoptionof deep learning in neuroimaging, careful development of network architec-tures and training strategies for each application was found to be crucial.Based on our results, we believe that we have found evidence showing thegreat potential of deep learning in neuroimaging for patient-level predictionin neurological disorders, and its clinical adoption will be accelerated by fur-ther validation with emerging trends such as decentralized data sharing andadvanced imaging modalities.8.2 Discussion8.2.1 Patient-level classification in neuroimagingMany neuroimaging studies have demonstrated abnormality and pathologyin a population context with a single feature or multiple features in a patientcohort in comparison with a healthy group using group statistical tests. Inorder to enable precision medicine, the design of patient-level prognostic anddiagnostic tools is required, but the relationship between group statisticaltests and individual discrimination power is not straightforward. Accordingto the literature survey discussed in Chapter 3, a number of deep learningmethods for segmenting brain MRIs such as lesion and tumour segmentationand tissue classification have been developed, and excellent improvements in1418.2. Discussionsegmentation performance have been recently achieved, which would accel-erate adoption of deep learning based segmentation methods into the clin-ical practice in the near future. However, although there have been someattempts, we have observed relatively slow progress and few achievementsin developing successful deep learning methods for patient-level classificationand prediction problems using neuroimaging data. We feel that it is currentlyhindered by a number of technical difficulties such as high-dimensionality,small datasets, limited annotations, label noise and uncertainty, class imbal-ance, inconsistent data, data sparsity of certain image types, heterogeneousmodalities, blackbox-like characteristics of deep networks and subtleness ofpathological patterns to be discovered. Although some of these challenges,such as limited annotations, data sparsity of certain image types and hetero-geneous modalities, have been tackled in this thesis, many of them have notbeen thoroughly investigated and thus remain challenging. In particular, thehigh-dimensionality problem is expected to remain as a major obstacle inapplying deep learning to neuroimaging for patient-level classification in thenear future. We will discuss some potential strategies to further overcomesome of these difficulties as future work in Section 8.3.8.2.2 Key aspects in developing successful deeplearning methods for patient-level brain MRIclassificationThe primary hypothesis presented in this thesis was that the capability ofdeep learning to automatically capture discriminative and abstract features1428.2. Discussionthrough a hierarchical manner may prove useful for effectively reducing thehigh-dimensionality of neuroimaging data to extract key MRI patterns thatare representative of important disease variations in MS patients. However,although CNNs are showing impressive performance in many image recog-nition competitions, we observed that directly employing CNNs commonlyused in the computer vision community does not necessarily produce anoptimal performance for neuroimaging studies. After testing a number ofdifferent deep learning architectures and training methods, it seemed thatthere is no single perfect network architecture and training strategy for allMS applications. Network architectures, training methods and hyperparam-eters contributed to final prediction performance in a complicated way dueto the blackbox behaviors of deep learning and the technical difficulties, andno clear simple recipe appeared to exist for attaining optimal performance.Throughout the thesis projects, careful developments, based on empiricalexperience and detailed analyses, of deep learning architectures and trainingstrategies were found to be crucial in applying deep learning to each MSapplication. More specifically, incorporating task-specific properties and re-flecting domain knowledge were important considerations when developingnetwork architectures and training strategies. In our work, to incorporatethe high-dimensionality property, we have shown that employing a hierarchi-cal 3D patch-level architecture to integrate myelin and T1w images could bean effective way to perform patient-level classification on high-dimensionalmultimodal images. We have also demonstrated that populating information1438.2. Discussiondensity of the input data by EDT to train a CNN was helpful for reflectingthe sparseness property in brain lesion masks, and that the hierarchical mul-timodal fusion approach was useful for efficiently training a deep network onheterogeneous imaging modalities. These attempts show the importance ofincorporating task-specific properties and reflecting domain knowledge. An-other example would be that, to reflect the high-dimensionality property andto avoid overfitting in applying deep learning to neuroimaging data, we em-ployed the unsupervised deep learning architectures and utilized a combina-tion of various regularization methods such as dropout, early stopping, noisygradient optimization, weight decay and down-sampling for more extensiveregularization compared to traditional computer vision domains. Determin-ing which type of network architecture would work for the type of data (e.g.,lesion mask, myelin map and FA map) was also an essential consideration toincorporate the given task properties.Another important aspect was the preprocessing methods that should becarefully considered for the development of successful deep networks. Due tothe imperfection in neuroimaging quality and the inhomogeneity of scannersbetween different clinical sites, the employment of appropriate preprocessingmethods such as linear and nonlinear registration, intensity normalizationand inhomogeneity correction was essential to ensure consistent quality intraining datasets. Although the impact of data augmentation was not thor-oughly studied in this thesis, it would be also one of feasible ways to copewith small training samples, which deserves further investigation.1448.2. Discussion8.2.3 The potential of deep learning to beincorporated into the clinical practiceTo evaluate the applicability of our proposed methods into the clinical prac-tice, we applied them to four important MS applications in measuring thestate and progression of MS. Our results demonstrated that these modelscan discover subtle patterns of MS pathology and provide enhanced classi-fication and prediction performance over the imaging biomarkers previouslyused in clinical studies, even with small training samples. We believe ourresults and the related literature summarized in Chapter 3 support the hy-pothesis that applying deep learning to neuroimaging data has the greatpotential to provide accurate, robust and generalizable patient-level classi-fication and prediction solutions for MS and other neurological disorders.Since the patient-level classification models developed in this thesis werevalidated by cross-validation due to the small datasets, validation on muchlarger datasets will be required because cross-validation does not guaranteethat the models will generalize on a new dataset. Emerging trends such asdecentralized data sharing that can provide much larger datasets would likelyaccelerate acceptance of the deep learning methods among clinicians and pa-tients. In addition, adding advanced MRI sequences that are more sensitiveand specific to MS pathology, such as MWI, SWI, DTI and magnetizationtransfer imaging (MTI), to our models would further enable more timely andbetter-informed treatment options and early initiation of effective therapy tohopefully reduce or delay future MS symptoms (e.g., Vavasour et al. [2017]).1458.3. Future workThe long-term clinical goal is to establish a technology that will provideclinical predictions in routine practice to MS neurologists, who may chooseto make adjustments to treatment plans based on personalized prognoses. Inaddition, the results of the thesis projects will potentially benefit the develop-ment of new MS therapies by discovering MRI features that are more sensitiveto disease relevant changes, thereby providing greater value to image-basedstudies to monitor treatment effect in support of personalized medicine. Al-though the developed deep learning methods were mostly validated usingthe small datasets in this thesis, deep learning methods were mostly seenas promising for neuroimaging data analysis even with small data sizes, andthus we believe that, with several thousand subjects and advanced MRI se-quences, it would be possible to provide reliable prediction and classificationmodels in neuroimaging applications to MS.8.3 Future work8.3.1 Unsupervised generative models for imagingbiomarker discoveryDeep learning, especially with supervised learning methods, has achievedimpressive results in many challenging applications by reaching or exceedinghuman-level performance. These successes were possible mainly by feedinga large amount of labeled data to enable deeper network training. However,labeled data can be expensive in neuroimaging as collecting them requires ex-perts’ domain knowledge and time-consuming manual labor. In unsupervised1468.3. Future workfeature learning, we can provide a huge amount of unlabeled data to deeplearning networks to learn a good feature representation of the input data,with or without the support of supervised approaches as we demonstratedin Chapter 4. In this way, it would be possible to discover new imagingbiomarkers from unlabeled data without requiring radiologists or neurolo-gists’ efforts, which can be useful for discovering subtle pathological patternsin complex neurological diseases. For example, in the work done by Broschet al. [2014], it was shown that an unsupervised convolutional deep learningnetwork could be used to automatically identify two imaging pathologicalpatterns in MS: brain atrophy and lesion volume, which are commonly usedimaging biomarkers. Thus, discovering subtle pathology from a large amountof unlabeled neuroimaging data by deep learning will be an interesting andpromising research direction in future work.8.3.2 Domain adaptationIt has been recently reported that the performance of deep networks of-ten deteriorates when they are tested on a new dataset whose distributionis different from that of the training data. In medical imaging applica-tions, this can occur due to variations in imaging protocols between clini-cal sites. This degradation can be more pronounced in neuroimaging dataanalysis in which data size is often limited and thus combining datasets frommultiple sites is common. Domain adaptation is the problem of training aset of machine learning models on data drawn from different distributions.1478.3. Future workKamnitsas et al. [2017] recently introduced an unsupervised domain adap-tation method for image segmentation based on adversarial training of twoCNNs. They modeled domain-invariant features by attempting to classify thedomain of the input data with the activations of the segmentation networkusing a multi-connected adversarial network that is simultaneously trainedwith the segmentation network. As the need for combining neuroimagingdata from multiple sites will inevitably increase, investigating deep archi-tectures particularly with adversarial networks for domain adaptation andanalyzing their capabilities on multi-center studies will be important futurework.8.3.3 Highly heterogeneous multimodal networkThe high tissue resolution and the wide variety of contrast types have madeMRI as the most versatile imaging techniques in neuroradiology. Thus, theconventional MRI protocols such as T1w, T2w, PDw and FLAIR, which givemorphological information about the brain tissue, have become the mainmethod of brain examination, which can be used for monitoring and predict-ing the results of therapeutic treatment and surgical intervention of neurolog-ical diseases. To increase the biological sensitivity and specificity of clinicalbrain MRI, researchers are actively developing more quantitative imaging se-quences in neuroimaging such as fMRI, DTI, MWI, MTI and SWI. In recentclinical studies, advanced neuroimaging sequences that are high-dimensionaland contain both structural and quantitative modalities that provide com-1488.3. Future workplementary information are often employed. In order to efficiently modellatent joint features from these heterogeneous MRI modalities, there willbe great need for developing deep learning models that can be efficientlytrained with large-scale multimodal datasets. In the early studies of multi-modal deep learning (e.g., Ngiam et al. [2011], Srivastava and Salakhutdinov[2012b]), the capability of deep learning in modeling joint feature representa-tions from heterogeneous data (e.g., between image and text) was validatedusing simple multimodal architectures that rely on a top-level feature con-catenation. More recent studies suggest that a joint feature representationbetween heterogeneous modalities can be more effectively learned by sophisti-cated architectures such as the gating unit based architecture [Arevalo et al.,2017] and the architecture with cross-weights [Rastegar et al., 2016]. In neu-roimaging applications, the number of different modalities per sample canbe much larger than other domains, making the need for further investiga-tion on advanced multimodal architectures and efficient training strategiessignificant. In Chapter 7, we have developed an approach with a hierarchicalmultimodal manner for modeling a joint feature representation between le-sion distribution and brain diffusion patterns, but further research would beneeded for advanced MRI sequences that consist of multiple heterogeneousmodalities. In doing so, one of the expected challenges would be that somesubjects might have MRI scans in which some modalities are missing, whichcan make network training more difficult.1498.3. Future work8.3.4 Advanced deep network architectures andhyperparameter optimizationSince deep learning methods brought a revolution to the computer visioncommunity, many researchers have introduced novel architectures and train-ing strategies during the past few years. For instance, visual attention models[Ba et al., 2014, Jaderberg et al., 2015], fully residual convolutional networks[Chang, 2016], batch normalization [Ioffe and Szegedy, 2015], CNNs with di-lated convolution [Gu et al., 2017] and capsule networks [Sabour et al., 2017]are some examples of these recent innovations. These recent deep learningmethods provided novel solutions to many previously unsolved image anal-ysis problems in the computer vision community. These recent advancesin deep learning are already helping to achieve impressive performance inmedical image analysis, including detection/recognition, segmentation, reg-istration, computer-aided diagnosis and disease quantification, to name someof the most important problems in the field. However, we believe that thereis still much room for significant improvement and, for these novel deeplearning solutions to enter the field of neuroimage data analysis, dedicatedmethods must be designed that can take into account specific propertiesof neuroimaging data such as the high-dimensionality and the subtleness ofpathological patterns to be modeled. Therefore, exploring dedicated methodsfor neuroimaging data analysis that can incorporate contextual informationor domain-specific knowledge into the advanced deep learning methods toreflect data properties will be one of the main components of future work.1508.3. Future workIn clinical practice, it is common to scan patients with neurological disordersat multiple time points, but how to effectively model longitudinal radiolog-ical patterns from temporal MRI data by deep learning has not been fullystudied. Thus, investigating a new methodological advanced architectureinvolving longitudinal analysis would be one future direction in this line ofresearch.The success of deep learning depends heavily on finding an architectureto fit the task and hyperparameter settings, which require both human ex-pertise and labor. New architectures are currently built manually by carefulexperiments or modified from existing networks. However, as deep learn-ing methods have scaled up to more complicated architectures, it has be-come challenging to design by hand. Recently, Miikkulainen et al. [2017]have explored an automated approach based on the evolution theory, whichcan optimize deep learning architectures without much human intervention.Baker et al. [2016] have studied utilizing reinforcement learning to automat-ically generate high-performing CNN architectures for a given learning task.Both studies demonstrated that automatically designed architectures andautomated optimization of training hyperparameters have the potential toachieve comparable results to human designs. With anticipated increases incomputing power, these evolution or reinforcement learning-driven networkdesigns may surpass human designs in the future, and should be thoroughlyinvestigated for neuroimaging applications.1518.3. Future work8.3.5 Interpreting clinical relevance of deep-learnedfeaturesWhen using deep learning to investigate underlying latent patterns in com-plex images such as multimodal brain MRIs, because of the blackbox behav-iors of deep learning models, it remains challenging to intuitively interpretwhat the models actually learn. A careful analysis such as computing corre-lation and significance between the learned features and clinical symptomsis valuable for understanding the pathological relevance of learned features(e.g., Brosch et al. [2014]), but still understanding exactly what each neuronhas learned and thus what computation is performing is not fully studied.Some have supposed that deep-learned representations are highly distributed,and thus any individual neuron or dimension is uninterpretable, but othershave argued that many neurons represent abstract features in a more localmanner and thus interpretable (e.g., Olah et al. [2017]). A visualization tech-nique via regularized optimization in image space [Yosinski et al., 2015] or adimensionality reduction technique such as t-SNE [Maaten and Hinton, 2008]could be useful ways to understand and interpret clinical relevance of learnedfeatures along with involved physicians. Ranking deep-learned features couldbe also useful for improving the interpretability. For example, Chang et al.[2017] proposed a feature ranking method for deep learning based on thedropout framework, which was validated on predicting drug response. Inanother example, Li et al. [2017] have proposed utilizing an additional net-work layer that can provide some useful insight into the inner workings of the1528.3. Future worknetwork and the relationship between predictions. The proposed additionallayer consists of an encoder for doing comparisons between each predictionwithin the latent space and a decoder for visualizing the learned features.This line of research has not been thoroughly carried out for neuroimagingapplications, and thus interpreting learned features and understanding theirpathological relevance must be further explored in order to provide moreuseful and practical methodologies to neurologists and clinicians.8.3.6 Enhancing neuroimaging data quality by deeplearningWhile deep learning approaches are the subject of intense research for medicalimage analysis, their use for enhancing the quality of medical images or forrendering with specific effects has received less attention so far. The qualityof medical images is of great significance as small variations in medical im-ages may reveal subtle but important pathological information. Novel deeplearning methods could be developed that are able to enhance the qualityof medical images by learning the physics of the underlying image forma-tion process. For instance, applying a compressed sensing technique to MRI,which aims to reconstruct images from significantly fewer measurements, hasreceived attention because it potentially offers significant scan time reduc-tions, with benefits for patients and healthcare economics. However, cur-rent reconstruction algorithms in compressed sensing are very slow to con-verge, which limits the technique to non real-time applications. Recently,1538.3. Future workMousavi and Baraniuk [2017] have introduced a DNN to model the in-verse transformation from measurement vectors to signals and have shownthat deep learning can closely approximate the reconstruction produced bythe state-of-the-art recovery algorithms while the convergence time can bereduced up to hundreds of times. As another example, in recent super-resolution approaches that use training data to estimate high resolution im-ages, which have shown outcomes with high reconstruction performance, deeplearning has become a key tool in modeling non-linear mapping between im-ages of low and high resolution (e.g., Dong et al. [2016]). These recentattempts are being adopted and further developed for medical imaging. Forexample, very recently, Pham et al. [2017] have employed a 3D CNN to gener-ate high-resolution brain MRI images from low-resolution input images withthe aid of patches of other high-resolution brain images. Thus, suitable andnew deep learning architectures, cost functions and training strategies thatare specifically designed to enhance the quality of neuroimaging data will bean interesting subject of future work.154BibliographyDavid H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learningalgorithm for Boltzmann machines*. Cognitive Science, 9(1):147–169,1985.Timo Ahonen, Abdenour Hadid, and Matti Pietikainen. Face descriptionwith local binary patterns: Application to face recognition. IEEETransactions on Pattern Analysis and Machine Intelligence, 28(12):2037–2041, 2006.Eva Alonso-Ortiz, Ives R Levesque, and G Bruce Pike. MRI-based myelinwater imaging: A technical review. Magnetic Resonance in Medicine, 73(1):70–81, 2015.Mohammad R Arbabshirani, Sergey Plis, Jing Sui, and Vince D Calhoun.Single subject prediction of brain disorders in neuroimaging: Promisesand pitfalls. NeuroImage, 145:137–165, 2017.John Arevalo, Thamar Solorio, Manuel Montes-y Go´mez, and Fabio AGonza´lez. Gated multimodal units for information fusion. arXiv preprintarXiv:1702.01992, 2017.John Ashburner, John G Csernansk, Christos Davatzikos, Nick C Fox,Giovanni B Frisoni, and Paul M Thompson. Computer-assisted imagingto assess brain structure in healthy and diseased brains. The LancetNeurology, 2(2):79–88, 2003.Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. Multiple objectrecognition with visual attention. arXiv preprint arXiv:1412.7755, 2014.155BibliographyBowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designingneural network architectures using reinforcement learning. arXiv preprintarXiv:1611.02167, 2016.Christian Barillot, Gilles Edan, and Olivier Commowick. Imagingbiomarkers in multiple sclerosis: From image analysis to populationimaging. Medical Image Analysis, 33:134 – 139, 2016.Frederik Barkhof, Massimo Filippi, David H Miller, Philip Scheltens,Adriana Campi, Chris H Polman, Giancarlo Comi, Herman J Ader, NickLosseff, and Jacob Valk. Comparison of MRI criteria at first presentationto predict conversion to clinically definite multiple sclerosis. Brain, 120(11):2059–2069, 1997.Frederik Barkhof, Peter A Calabresi, David H Miller, and Stephen CReingold. Imaging outcomes for neuroprotection and repair in multiplesclerosis trials. Nature Reviews Neurology, 5(5):256–266, 2009.Yoshua Bengio. Learning deep architectures for AI. Foundations andTrends in Machine Learning, 2(1):1–127, 2009.Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle, et al.Greedy layer-wise training of deep networks. Advances in NeuralInformation Processing Systems, 19:153, 2007.Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representationlearning: A review and new perspectives. IEEE Transactions on PatternAnalysis and Machine Intelligence, 35(8):1798–1828, 2013.Sruti Bhagavatula, Christopher Dunn, Chris Kanich, Minaxi Gupta, andBrian Ziebart. Leveraging machine learning to improve unwantedresource filtering. In Proceedings of the 2014 Workshop on ArtificialIntelligent and Security Workshop, pages 95–102. ACM, 2014.Christopher M Bishop. Pattern recognition. Machine Learning, 128:1–58,2006.Olivier Bousquet and Le´on Bottou. The tradeoffs of large scale learning. InAdvances in Neural Information Processing Systems, pages 161–168,2008.156BibliographyLeo Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.Tom Brosch, Youngjin Yoo, David KB Li, Anthony Traboulsee, and RogerTam. Modeling the variability in brain morphology and lesiondistribution in multiple sclerosis by deep learning. In Medical ImageComputing and Computer-Assisted Intervention, pages 462–469.Springer, 2014.Tom Brosch, Youngjin Yoo, Lisa YW Tang, David KB Li, AnthonyTraboulsee, and Roger Tam. Deep convolutional encoder networks formultiple sclerosis lesion segmentation. In Medical Image Computing andComputer-Assisted Intervention, pages 3–11. Springer, 2015.Tom Brosch, Lisa YW Tang, Youngjin Yoo, David KB Li, AnthonyTraboulsee, and Roger Tam. Deep 3D convolutional encoder networkswith shortcuts for multiscale feature integration applied to multiplesclerosis lesion segmentation. IEEE Transactions on Medical Imaging, 35(5):1229–1239, 2016.Chun-Hao Chang, Ladislav Rampasek, and Anna Goldenberg. Dropoutfeature ranking for deep learning models. arXiv preprintarXiv:1712.08645, 2017.Peter D Chang. Fully convolutional deep residual neural networks for braintumor segmentation. In International Workshop on Brainlesion: Glioma,Multiple Sclerosis, Stroke and Traumatic Brain Injuries, pages 108–118.Springer, 2016.Hao Chen, Qi Dou, Lequan Yu, Jing Qin, and Pheng-Ann Heng. Voxresnet:Deep voxelwise residual networks for brain segmentation from 3D MRimages. NeuroImage, 2017.Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen,John Tran, Bryan Catanzaro, and Evan Shelhamer. cuDNN: Efficientprimitives for deep learning. arXiv preprint arXiv:1410.0759, 2014.Kyunghyun Cho. Foundations and Advances in Deep Learning. PhD thesis,Aalto University, 2014.157BibliographyKyungHyun Cho, Tapani Raiko, and Alexander Ilin. Parallel tempering isefficient for learning restricted Boltzmann machines. In InternationalJoint Conference on Neural Networks, pages 1–8, 2010.Michael Cogswell, Faruk Ahmed, Ross Girshick, Larry Zitnick, and DhruvBatra. Reducing overfitting in deep networks by decorrelatingrepresentations. arXiv preprint arXiv:1511.06068, 2015.Francis S Collins and Harold Varmus. A new initiative on precisionmedicine. New England Journal of Medicine, 372(9):793–795, 2015.Manuel Comabella, Jaume Sastre-Garriga, and Xavier Montalban.Precision medicine in multiple sclerosis: biomarkers for diagnosis,prognosis, and treatment response. Current Opinion in Neurology, 29(3):254–262, 2016.Corinna Cortes and Vladimir Vapnik. Support-vector networks. MachineLearning, 20(3):273–297, 1995.Navneet Dalal and Bill Triggs. Histograms of oriented gradients for humandetection. In Computer Vision and Pattern Recognition, volume 1, pages886–893. IEEE, 2005.Christian Denk and Alexander Rauscher. Susceptibility weighted imagingwith multiple echoes. Journal of Magnetic Resonance Imaging, 31(1):185–191, 2010.Rahul S Desikan, Florent Se´gonne, Bruce Fischl, Brian T Quinn,Bradford C Dickerson, Deborah Blacker, Randy L Buckner, Anders MDale, R Paul Maguire, Bradley T Hyman, et al. An automated labelingsystem for subdividing the human cerebral cortex on MRI scans intogyral based regions of interest. NeuroImage, 31(3):968–980, 2006.Guillaume Desjardins, Aaron C Courville, Yoshua Bengio, Pascal Vincent,and Olivier Delalleau. Tempered Markov chain Monte Carlo for trainingof restricted Boltzmann machines. In International Conference onArtificial Intelligence and Statistics, pages 145–152, 2010.Ruth Dobson, Sreeram Ramagopalan, and Gavin Giovannoni. The effect ofgender in clinically isolated syndrome (CIS): a meta-analysis. MultipleSclerosis Journal, 18(5):600–604, 2012.158BibliographyChao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Imagesuper-resolution using deep convolutional networks. IEEE Transactionson Pattern Analysis and Machine Intelligence, 38(2):295–307, 2016.Andrew Doyle, Doina Precup, Douglas L Arnold, and Tal Arbel. Predictingfuture disease activity and treatment responders for multiple sclerosispatients using a bag-of-lesions brain representation. In Medical ImageComputing and Computer-Assisted Intervention, pages 186–194.Springer, 2017.John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradientmethods for online learning and stochastic optimization. Journal ofMachine Learning Research, 12(Jul):2121–2159, 2011.Arman Eshaghi, Sadjad Riyahi-Alam, Roghayyeh Saeedi, Tina Roostaei,Arash Nazeri, Aida Aghsaei, Rozita Doosti, Habib Ganjgahi, BenedettaBodini, Ali Shakourirad, et al. Classification algorithms withmulti-modal data fusion could accurately distinguish neuromyelitis opticafrom multiple sclerosis. NeuroImage: Clinical, 7:306–314, 2015.Arman Eshaghi, Viktor Wottschel, Rosa Cortese, Massimiliano Calabrese,Mohammad Ali Sahraian, Alan J Thompson, Daniel C Alexander, andOlga Ciccarelli. Gray matter MRI differentiates neuromyelitis opticafrom multiple sclerosis using random forest. Neurology, 87(23):2463–2470,2016.Jianqing Fan and Yingying Fan. High dimensional classification usingfeatures annealed independence rules. Annals of Statistics, 36(6):2605,2008.Daniel J Felleman and David C Van Essen. Distributed hierarchicalprocessing in the primate cerebral cortex. Cerebral Cortex (New York,NY: 1991), 1(1):1–47, 1991.Massimo Filippi and F Agosta. Imaging biomarkers in multiple sclerosis.Journal of Magnetic Resonance Imaging, 31(4):770–788, 2010.Gary William Flake. Square unit augmented, radially extended, multilayerperceptrons. In Neural Networks: Tricks of the Trade, pages 143–161.Springer, 2012.159BibliographyDaniel Garc´ıa-Lorenzo, Simon Francis, Sridar Narayanan, Douglas LArnold, and D Louis Collins. Review of automatic segmentation methodsof multiple sclerosis white matter lesions on conventional magneticresonance imaging. Medical Image Analysis, 17(1):1–18, 2013. ISSN1361-8423. doi:10.1016/j.media.2012.09.004.Ezequiel Geremia, Olivier Clatz, Bjoern H Menze, Ender Konukoglu,Antonio Criminisi, and Nicholas Ayache. Spatial decision forests for MSlesion segmentation in multi-channel magnetic resonance images.NeuroImage, 57(2):378–90, 2011. ISSN 1095-9572.doi:10.1016/j.neuroimage.2011.03.080.Antonio Giorgio, Marco Battaglini, Maria Assunta Rocca, AlessandroDe Leucio, Martina Absinta, Ronald van Schijndel, Alex Rovira, MarTintore´, Declan Chard, Olga Ciccarelli, et al. Location of brain lesionspredicts conversion of clinically isolated syndromes to multiple sclerosis.Neurology, 80(3):234–241, 2013.Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, DavidWarde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio.Generative adversarial nets. In Advances in Neural InformationProcessing Systems, pages 2672–2680, 2014.Montavon Grgoire, Genevieve B Orr, et al. Neural networks: Tricks of theTrade. Springer, 2012.Jiuxiang Gu, Zhenhua Wang, Jason Kuen, Lianyang Ma, Amir Shahroudy,Bing Shuai, Ting Liu, Xingxing Wang, Gang Wang, Jianfei Cai, et al.Recent advances in convolutional neural networks. Pattern Recognition,2017.Yanrong Guo, Guorong Wu, Leah A Commander, Stephanie Szary, ValerieJewells, Weili Lin, and Dinggang Shen. Segmenting hippocampus frominfant brains by sparse patch matching with deep-learned features. InMedical Image Computing and Computer-Assisted Intervention, pages308–315. Springer, 2014.E Mark Haacke, Yingbiao Xu, Yu-Chung N Cheng, and Ju¨rgen RReichenbach. Susceptibility weighted imaging (SWI). MagneticResonance in Medicine, 52(3):612–618, 2004.160BibliographyMohammad Havaei, Axel Davy, David Warde-Farley, Antoine Biard, AaronCourville, Yoshua Bengio, Chris Pal, Pierre-Marc Jodoin, and HugoLarochelle. Brain tumor segmentation with deep neural networks. arXivpreprint arXiv:1505.03540, 2015.Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deepinto rectifiers: Surpassing human-level performance on imagenetclassification. In Proceedings of the IEEE International Conference onComputer Vision, pages 1026–1034, 2015.Geoffrey Hinton, Simon Osindero, and Yee-Whye Teh. A fast learningalgorithm for deep belief nets. Neural Computation, 18(7):1527–1554,2006.Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahmanMohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, PatrickNguyen, Tara N Sainath, et al. Deep neural networks for acousticmodeling in speech recognition: The shared views of four researchgroups. Signal Processing Magazine, IEEE, 29(6):82–97, 2012.Geoffrey E Hinton. A practical guide to training restricted Boltzmannmachines. In Neural Networks: Tricks of the Trade, pages 599–619.Springer, 2012.R Devon Hjelm, Vince D Calhoun, Ruslan Salakhutdinov, Elena A Allen,Tulay Adali, and Sergey M Plis. Restricted Boltzmann machines forneuroimaging: an application in identifying intrinsic networks.NeuroImage, 96:245–260, 2014.David H Hubel and Torsten N Wiesel. Receptive fields of single neurones inthe cat’s striate cortex. The Journal of Physiology, 148(3):574–591, 1959.David H Hubel and Torsten N Wiesel. Receptive fields, binocularinteraction and functional architecture in the cat’s visual cortex. TheJournal of Physiology, 160(1):106–154, 1962.Hanneke E Hulst and Jeroen JG Geurts. Gray matter imaging in multiplesclerosis: what have we learned? BMC Neurology, 11(1):153, 2011.161BibliographySergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift. In InternationalConference on Machine Learning, pages 448–456, 2015.Max Jaderberg, Karen Simonyan, Andrew Zisserman, and KorayKavukcuoglu. Spatial transformer networks. In Advances in NeuralInformation Processing Systems, pages 2017–2025, 2015.Anil K Jain and Farshid Farrokhnia. Unsupervised texture segmentationusing gabor filters. In IEEE International Conference on Systems, Manand Cybernetics, pages 14–19. IEEE, 1990.Mark Jenkinson, Christian F Beckmann, Timothy EJ Behrens, Mark WWoolrich, and Stephen M Smith. FSL. NeuroImage, 62(2):782–790, 2012.Seun Jeon, Uicheul Yoon, Jun-Sung Park, Sang Won Seo, Jung-Hyun Kim,Sung Tae Kim, Sun I Kim, Duk L Na, and Jong-Min Lee. Fullyautomated pipeline for quantification and localization of white matterhyperintensity in brain magnetic resonance image. International Journalof Imaging Systems and Technology, 21(2):193–200, 2011.Craig Jones and Erick Wong. Multi-scale application of the N3 method forintensity correction of MR images. In Medical Imaging 2002, pages1123–1129. International Society for Optics and Photonics, 2002.Konstantinos Kamnitsas, Christian Baumgartner, Christian Ledig, VirginiaNewcombe, Joanna Simpson, Andrew Kane, David Menon, Aditya Nori,Antonio Criminisi, Daniel Rueckert, et al. Unsupervised domainadaptation in brain lesion segmentation with adversarial networks. InInformation Processing in Medical Imaging, pages 597–609. Springer,2017.Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, RahulSukthankar, and Li Fei-Fei. Large-scale video classification withconvolutional neural networks. In Computer Vision and PatternRecognition, pages 1725–1732. IEEE, 2014.Ho Jin Kim, Friedemann Paul, Marco A Lana-Peixoto, et al. MRIcharacteristics of neuromyelitis optica spectrum disorder an internationalupdate. Neurology, 84(11), 2015.162BibliographyMinjeong Kim, Guorong Wu, and Dinggang Shen. Unsupervised deeplearning for hippocampus segmentation in 7.0 tesla MR images. InMachine Learning in Medical Imaging (MICCAI workshop), pages 1–8.Springer, 2013a.Yelin Kim, Honglak Lee, and Emily Mower Provost. Deep learning forrobust feature generation in audiovisual emotion recognition. InInternational Conference on Acoustics, Speech and Signal Processing,pages 3687–3691. IEEE, 2013b.Diederik Kingma and Jimmy Ba. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980, 2014.Tatiana Koudriavtseva and Caterina Mainero. Neuroinflammation,neurodegeneration and regeneration in multiple sclerosis: intercorrelatedmanifestations of the immune response. Neural Regeneration Research,11(11):1727, 2016.Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of featuresfrom tiny images. University of Toronto, Tech. Rep, 2009.Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenetclassification with deep convolutional neural networks. In Advances inNeural Information Processing Systems, pages 1097–1105, 2012.John F Kurtzke. Rating neurologic impairment in multiple sclerosis anexpanded disability status scale (EDSS). Neurology, 33(11):1444–1444,1983.C Laule, IM Vavasour, GRW Moore, J Oger, David K B Li, DW Paty, andAL MacKay. Water content and myelin water fraction in multiplesclerosis. Journal of Neurology, 251(3):284–293, 2004.Cornelia Laule, Vlady Pavlova, Esther Leung, Guojun Zhao, Alex LMacKay, Piotr Kozlowski, Anthony L Traboulsee, David KB Li, andGR Wayne Moore. Diffusely abnormal white matter in multiple sclerosis:further histologic studies provide evidence for a primary lipidabnormality with neurodegeneration. Journal of Neuropathology &Experimental Neurology, 72(1):42–52, 2013.163BibliographySvetlana Lazebnik, Cordelia Schmid, and Jean Ponce. A sparse texturerepresentation using local affine regions. IEEE Transactions on PatternAnalysis and Machine Intelligence, 27(8):1265–1278, 2005.Denis Le Bihan, Jean-Franc¸ois Mangin, Cyril Poupon, Chris A Clark,Sabina Pappata, Nicolas Molko, and Hughes Chabriat. Diffusion tensorimaging: concepts and applications. Journal of Magnetic ResonanceImaging, 13(4):534–546, 2001.Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature,521(7553):436–444, 2015.Honglak Lee, Chaitanya Ekanadham, and Andrew Y Ng. Sparse deep beliefnet model for visual area v2. In Advances in Neural InformationProcessing Systems, pages 873–880, 2008.Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y Ng.Convolutional deep belief networks for scalable unsupervised learning ofhierarchical representations. In International Conference on MachineLearning, pages 609–616. ACM, 2009.Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y Ng.Unsupervised learning of hierarchical representations with convolutionaldeep belief networks. Communications of the ACM, 54(10):95–103, 2011.Oscar Li, Hao Liu, Chaofan Chen, and Cynthia Rudin. Deep learning forcase-based reasoning through prototypes: A neural network that explainsits predictions. arXiv preprint arXiv:1710.04806, 2017.Shuo Li, Thomas Fevens, Adam Krzyz˙ak, and Song Li. Automatic clinicalimage segmentation using pathological modeling, PCA and SVM.Engineering Applications of Artificial Intelligence, 19(4):403–410, 2006.Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, ArnaudArindra Adiyoso Setio, Francesco Ciompi, Mohsen Ghafoorian,Jeroen AWM van der Laak, Bram van Ginneken, and Clara I Sa´nchez. Asurvey on deep learning in medical image analysis. arXiv preprintarXiv:1702.05747, 2017.164BibliographyBo Liu, Ying Wei, Yu Zhang, and Qiang Yang. Deep neural networks forhigh dimension, low sample size data. In Proceedings of the Twenty-SixthInternational Joint Conference on Artificial Intelligence Main track, 2017.Manhua Liu, Daoqiang Zhang, and Dinggang Shen. Hierarchical fusion offeatures and classifier decisions for Alzheimer’s disease diagnosis. HumanBrain Mapping, 35(4):1305–1319, 2014a.Siqi Liu, Sidong Liu, Weidong Cai, Hangyu Che, Sonia Pujol, Ron Kikinis,Dagan Feng, and Michael J Fulham. Multi-modal neuroimaging featurelearning for multi-class diagnosis of Alzheimer’s disease. IEEETransactions on Biomedical Engineering, 62(4):1132–1140, 2014b.Siqi Liu, Sidong Liu, Weidong Cai, Hangyu Che, Sonia Pujol, Ron Kikinis,Michael Fulham, and Dagan Feng. High-level feature based PET imageretrieval with deep learning architecture. Journal of Nuclear Medicine, 55(supplement 1):2028–2028, 2014c.Siqi Liu, Sidong Liu, Weidong Cai, Hangyu Che, Sonia Pujol, Ron Kikinis,Dagan Feng, Michael J Fulham, et al. Multimodal neuroimaging featurelearning for multiclass diagnosis of Alzheimer’s disease. IEEETransactions on Biomedical Engineering, 62(4):1132–1140, 2015.David G Lowe. Object recognition from local scale-invariant features. InInternational Conference on Computer Vision, volume 2, pages1150–1157. IEEE, 1999.Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifiernonlinearities improve neural network acoustic models. In InternationalConference on Machine Learning, volume 30, 2013.Laurens van der Maaten and Geoffrey Hinton. Visualizing data usingt-SNE. Journal of Machine Learning Research, 9(Nov):2579–2605, 2008.Alex MacKay, Kenneth Whittall, Julian Adler, David K B Li, Donald Paty,and Douglas Graeb. In vivo visualization of myelin water in brain bymagnetic resonance. Magnetic Resonance in Medicine, 31(6):673–677,1994.165BibliographyStephane G Mallat. A theory for multiresolution signal decomposition: thewavelet representation. IEEE Transactions on Pattern Analysis andMachine Intelligence, 11(7):674–693, 1989.Benjamin M Marlin. Missing data problems in machine learning. PhDthesis, University of Toronto, 2008.Calvin R Maurer, Rensheng Qi, and Vijay Raghavan. A linear timealgorithm for computing exact Euclidean distance transforms of binaryimages in arbitrary dimensions. IEEE Transactions on Pattern Analysisand Machine Intelligence, 25(2):265–270, 2003.John Mazziotta, Arthur Toga, Alan Evans, Peter Fox, Jack Lancaster, KarlZilles, Roger Woods, Tomas Paus, Gregory Simpson, Bruce Pike, et al. Aprobabilistic atlas and reference system for the human brain:International Consortium for Brain Mapping (ICBM). PhilosophicalTransactions of the Royal Society B: Biological Sciences, 356(1412):1293–1322, 2001.Jon McAusland, Roger Tam, Erick Wong, Andrew Riddehough, and DavidK B Li. Optimizing the use of radiologist seed points for improvedmultiple sclerosis lesion segmentation. IEEE Transactions on BiomedicalEngineering, 57(11):2689–2698, 2010.Risto Miikkulainen, Jason Liang, Elliot Meyerson, Aditya Rawal, Dan Fink,Olivier Francon, Bala Raju, Arshak Navruzyan, Nigel Duffy, and BabakHodjat. Evolving deep neural networks. arXiv preprintarXiv:1703.00548, 2017.Albert Montillo, Jamie Shotton, John Winn, Juan Eugenio Iglesias, DimitriMetaxas, and Antonio Criminisi. Entangled decision forests and theirapplication for semantic segmentation of CT images. In InfomationProcessing in Medical Imaging, pages 184–196. Springer, 2011.Jonathan Morra, Zhuowen Tu, Arthur Toga, and Paul Thompson.Automatic segmentation of MS lesions using a contextual model for theMICCAI grand challenge. In MS Lesion Segmentation Challenge(MICCAI Workshop), pages 1–7. Citeseer, 2008.Ali Mousavi and Richard G Baraniuk. Learning to invert: Signal recoveryvia deep convolutional networks. arXiv preprint arXiv:1701.03891, 2017.166BibliographyVinod Nair and Geoffrey E Hinton. Rectified linear units improve restrictedBoltzmann machines. In International Conference on Machine Learning,pages 807–814, 2010.Arvind Neelakantan, Luke Vilnis, Quoc V Le, Ilya Sutskever, Lukasz Kaiser,Karol Kurach, and James Martens. Adding gradient noise improveslearning for very deep networks. arXiv preprint arXiv:1511.06807, 2015.Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, andAndrew Y Ng. Multimodal deep learning. In Proceedings of the 28thInternational Conference on Machine Learning (ICML-11), pages689–696, 2011.Dong Nie, Han Zhang, Ehsan Adeli, Luyan Liu, and Dinggang Shen. 3Ddeep learning for multi-modal imaging-guided survival time prediction ofbrain tumor patients. In Medical Image Computing andComputer-Assisted Intervention, pages 212–220. Springer, 2016.C Odenthal and A Coulthard. The prognostic utility of MRI in clinicallyisolated syndrome: a literature review. American Journal ofNeuroradiology, 36(3):425–431, 2015.Timo Ojala, Matti Pietikainen, and Topi Maenpaa. Multiresolutiongray-scale and rotation invariant texture classification with local binarypatterns. IEEE Transactions on Pattern Analysis and MachineIntelligence, 24(7):971–987, 2002.Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Featurevisualization. Distill, 2017. doi:10.23915/distill.00007.https://distill.pub/2017/feature-visualization.Genevieve B Orr and Klaus-Robert Mu¨ller. Neural networks: Tricks of theTrade. Springer, 2012.Sarah-Michelle Orton, Blanca M Herrera, Irene M Yee, William Valdar,Sreeram V Ramagopalan, A Dessa Sadovnick, George C Ebers, CanadianCollaborative Study Group, et al. Sex ratio of multiple sclerosis incanada: a longitudinal study. The Lancet Neurology, 5(11):932–936, 2006.Chi-Hieu Pham, Aure´lien Ducournau, Ronan Fablet, and Franc¸oisRousseau. Brain MRI super-resolution using deep 3D convolutional167Bibliographynetworks. In Biomedical Imaging (ISBI 2017), 14th InternationalSymposium on, pages 197–200. IEEE, 2017.Walter HL Pinaya, Ary Gadelha, Orla M Doyle, Cristiano Noto, Andre´Zugman, Quirino Cordeiro, Andrea P Jackowski, Rodrigo A Bressan, andJoa˜o R Sato. Using deep belief network modelling to characterizedifferences in brain morphometry in schizophrenia. Scientific Reports, 6:38897, 2016.Sergey M Plis, Devon R Hjelm, Ruslan Salakhutdinov, Elena A Allen,Henry J Bockholt, Jeffrey D Long, Hans J Johnson, Jane S Paulsen,Jessica A Turner, and Vince D Calhoun. Deep learning for neuroimaging:a validation study. Frontiers in Neuroscience, 8, 2014.C Polman, S Reingold, G Edan, M Filippi, H Hartung, Ludwig Kappos,Fred Lublin, Luanne Metz, Henry McFarland, Paul O’Connor, et al.Diagnostic criteria for multiple sclerosis: 2005 revisions to the McDonaldcriteria. Annals of Neurology, 58(6):840–846, 2005.Chris H Polman, Stephen C Reingold, Brenda Banwell, Michel Clanet,Jeffrey A Cohen, Massimo Filippi, Kazuo Fujihara, Eva Havrdova,Michael Hutchinson, Ludwig Kappos, et al. Diagnostic criteria formultiple sclerosis: 2010 revisions to the McDonald criteria. Annals ofNeurology, 69(2):292–302, 2011.Adam Porisky, Tom Brosch, Emil Ljungberg, Lisa Y.W. Tang, YoungjinYoo, Benjamin De Leener, Anthony Traboulsee, Julien Cohen-Adad, andRoger Tam. Grey matter segmentation in spinal cord MRIs via 3Dconvolutional encoder networks with shortcut connections. In DeepLearning in Medical Image Analysis (MICCAI workshop). Springer, 2017.Christopher Poultney, Sumit Chopra, Yann LeCun, et al. Efficient learningof sparse representations with an energy-based model. In Advances inNeural Information Processing Systems, pages 1137–1144, 2006.Thomas Prasloski, Burkhard Ma¨dler, Qing-San Xiang, Alex MacKay, andCraig Jones. Applications of stimulated echo correction tomulticomponent T2 analysis. Magnetic Resonance in Medicine, 67(6):1803–1814, 2012a.168BibliographyThomas Prasloski, Alexander Rauscher, Alex L MacKay, MadeleineHodgson, Irene M Vavasour, Corree Laule, and Burkhard Ma¨dler. Rapidwhole cerebrum myelin water imaging using a 3D GRASE sequence.NeuroImage, 63(1):533–539, 2012b.Lutz Prechelt. Early stopping but when? In Neural Networks: Tricks of theTrade, pages 53–67. Springer, 2012.Juan C Prieto, Michele Cavallari, Miklos Palotai, Alfredo Morales Pinzon,Svetlana Egorova, Martin Styner, and Charles RG Guttmann. Largedeep neural networks for ms lesion segmentation. In Medical Imaging2017: Image Processing, volume 10133, page 101330F. InternationalSociety for Optics and Photonics, 2017.Rajat Raina, Alexis Battle, Honglak Lee, Benjamin Packer, and Andrew YNg. Self-taught learning: transfer learning from unlabeled data. InInternational Conference on Machine Learning, pages 759–766. ACM,2007.Sarah Rastegar, Mahdieh Soleymani, Hamid R Rabiee, and SeyedMohsen Shojaee. MDL-CW: A multimodal deep learning framework withcross weights. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 2601–2609, 2016.Martin Riedmiller. 10 steps and some tricks to set up neural reinforcementcontrollers. In Neural Networks: Tricks of the Trade, pages 735–757.Springer, 2012.Richard A Rudick, Elizabeth Fisher, Jar-Chi Lee, Jeffrey T Duda, and JackSimon. Brain atrophy in relapsing multiple sclerosis: relationship torelapses, edss, and treatment with interferon β-1a. Multiple SclerosisJournal, 6(6):365–372, 2000.Daniel Rueckert, Ben Glocker, and Bernhard Kainz. Learning clinicallyuseful information from images: Past, present and future. Medical ImageAnalysis, 33:13 – 18, 2016.Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. Dynamic routingbetween capsules. In Advances in Neural Information ProcessingSystems, pages 3859–3869, 2017.169BibliographyRuslan Salakhutdinov and Geoffrey Hinton. An efficient learning procedurefor deep Boltzmann machines. Neural Computation, 24(8):1967–2006,2012.Ruslan Salakhutdinov and Geoffrey E Hinton. Deep Boltzmann machines.In International Conference on Artificial Intelligence and Statistics,pages 448–455, 2009.Ruslan R Salakhutdinov. Learning in Markov random fields using temperedtransitions. In Advances in Neural Information Processing Systems,pages 1598–1606, 2009.Seyed Sadegh Mohseni Salehi, Deniz Erdogmus, and Ali Gholipour.Auto-context convolutional neural network (auto-net) for brainextraction in magnetic resonance imaging. IEEE Transactions onMedical Imaging, 2017.Evan Shelhamer, Jonathan Long, and Trevor Darrell. Fully convolutionalnetworks for semantic segmentation. IEEE Transactions on PatternAnalysis and Machine Intelligence, 39(4):640–651, 2017.Navid Shiee, Pierre-Louis Bazin, Arzu Ozturk, Daniel S Reich, Peter ACalabresi, and Dzung L Pham. A topology-preserving approach to thesegmentation of brain images with multiple sclerosis lesions. NeuroImage,49(2):1524–1535, 2010.John G Sled, Alex P Zijdenbos, and Alan C Evans. A nonparametricmethod for automatic correction of intensity nonuniformity in MRI data.IEEE Transaction on Medical Imaging, 17(1):87–97, 1998.Stephen M Smith. Fast robust automated brain extraction. Human BrainMapping, 17(3):143–155, 2002.Paul Smolensky. Information processing in dynamical systems: Foundationsof harmony theory. 1986.DP Soares and M Law. Magnetic resonance spectroscopy of the brain:review of metabolites and clinical applications. Clinical Radiology, 64(1):12–21, 2009.170BibliographyAllen W Song, Scott A Huettel, and Gregory McCarthy. Functionalneuroimaging: Basic principles of functional MRI. Handbook ofFunctional Neuroimaging of Cognition, 2:22–52, 2006.Aristeidis Sotiras, Christos Davatzikos, and Nikos Paragios. Deformablemedical image registration: A survey. IEEE Transactions on MedicalImaging, 32(7):1153–1190, 2013.Jean-Christophe Souplet, Christine Lebrun, Nicholas Ayache, and Gre´goireMalandain. An automatic segmentation of T2-FLAIR multiple sclerosislesions. In MS Lesion Segmentation Challenge (MICCAI Workshop),2008.Nitish Srivastava and Ruslan Salakhutdinov. Learning representations formultimodal data with deep belief nets. In International Conference onMachine Learning Workshop, 2012a.Nitish Srivastava and Ruslan R Salakhutdinov. Multimodal learning withdeep boltzmann machines. In Advances in Neural Information ProcessingSystems, pages 2222–2230, 2012b.Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, andRuslan Salakhutdinov. Dropout: a simple way to prevent neuralnetworks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.Martijn D Steenwijk, Marita Daams, Petra JW Pouwels, Lisanne J Balk,Prejaas K Tewarie, Joep Killestein, Bernard MJ Uitdehaag, Jeroen JGGeurts, Frederik Barkhof, and Hugo Vrenken. What explains gray matteratrophy in long-standing multiple sclerosis? Radiology, 2014.Heung-Il Suk, Seong-Whan Lee, Dinggang Shen, Alzheimer’s DiseaseNeuroimaging Initiative, et al. Latent feature representation with stackedauto-encoder for AD/MCI diagnosis. Brain Structure and Function,pages 1–19, 2013.Heung-Il Suk, Seong-Whan Lee, Dinggang Shen, Alzheimer’s DiseaseNeuroimaging Initiative, et al. Hierarchical feature representation andmultimodal fusion with deep learning for AD/MCI diagnosis.NeuroImage, 101:569–582, 2014.171BibliographyNima Tajbakhsh, Jae Y Shin, Suryakanth R Gurudu, R Todd Hurst,Christopher B Kendall, Michael B Gotway, and Jianming Liang.Convolutional neural networks for medical image analysis: Full trainingor fine tuning? IEEE Transactions on Medical Imaging, 35(5):1299–1312,2016.Lisa YW Tang, Tom Brosch, XingTong Liu, Youngjin Yoo, AnthonyTraboulsee, David Li, and Roger Tam. Corpus callosum segmentation inbrain mris via robust target-localization and joint supervised featureextraction and prediction. In Medical Image Computing andComputer-Assisted Intervention, pages 406–414. Springer, 2016.Theano Development Team. Theano: A Python framework for fastcomputation of mathematical expressions. arXiv e-prints,abs/1605.02688, May 2016. URL http://arxiv.org/abs/1605.02688.Robert Tibshirani. Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society. Series B (Methodological), pages267–288, 1996.Tijmen Tieleman. Training restricted Boltzmann machines usingapproximations to the likelihood gradient. In International Conferenceon Machine Learning, pages 1064–1071. ACM, 2008.M Tintore, A Rovira, J Rio, C Nos, E Grive, N Tellez, R Pelayo,M Comabella, J Sastre-Garriga, and X Montalban. Baseline MRIpredicts future attacks and disability in clinically isolated syndromes.Neurology, 67(6):968–972, 2006.Tong Tong, Robin Wolz, Qinquan Gao, Ricardo Guerrero, Joseph VHajnal, Daniel Rueckert, Alzheimer’s Disease Neuroimaging Initiative,et al. Multiple instance learning for classification of dementia in brainMRI. Medical Image Analysis, 18(5):808–818, 2014.Anthony Traboulsee, Guojun Zhao, and David KB Li. Neuroimaging inmultiple sclerosis. Neurologic Clinics, 23(1):131–148, 2005.Nicholas J Tustison, Brian B Avants, Philip A Cook, Yuanjie Zheng,Alexander Egan, Paul A Yushkevich, and James C Gee. N4ITK:improved N3 bias correction. IEEE Transactions on Medical Imaging, 29(6):1310–1320, 2010.172BibliographySergi Valverde, Mariano Cabezas, Eloy Roura, Sandra Gonza´lez-Villa`,Deborah Pareto, Joan C Vilanova, Llu´ıs Ramio´-Torrenta`, A`lex Rovira,Arnau Oliver, and Xavier Llado´. Improving automated multiple sclerosislesion segmentation with a cascaded 3D convolutional neural networkapproach. NeuroImage, 155:159–168, 2017.Laurens Van Der Maaten. Accelerating t-SNE using tree-based algorithms.Journal of Machine Learning Research, 15(1):3221–3245, 2014.Irene Vavasour, Cornelia Laule, Shannon Kolind, Roger Tam, David Li,Alex MacKay, and Anthony Traboulsee. Myelin water imaging providesevidence of long-term remyelination and neuroprotection in alemtuzumabtreated multiple sclerosis patients (p5. 335). Neurology, 88(16Supplement):P5–335, 2017.Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, andPierre-Antoine Manzagol. Stacked denoising autoencoders: Learninguseful representations in a deep network with a local denoising criterion.The Journal of Machine Learning Research, 11:3371–3408, 2010.Nick Weiss, Daniel Rueckert, and Anil Rao. Multiple sclerosis lesionsegmentation using dictionary learning and sparse coding. In MedicalImage Computing and Computer-Assisted Intervention, pages 735–742.Springer, 2013.Paul John Werbos. Beyond regression: New tools for prediction andanalysis in the behavioral sciences. Doctoral Dissertation, AppliedMathematics, Harvard University, MA, 1974.V Wottschel, DC Alexander, PP Kwok, DT Chard, ML Stromillo,N De Stefano, AJ Thompson, DH Miller, and O Ciccarelli. Predictingoutcome in clinically isolated syndrome using machine learning.NeuroImage: Clinical, 7:281–287, 2015.Guorong Wu, Minjeong Kim, Qian Wang, Yaozong Gao, Shu Liao, andDinggang Shen. Unsupervised deep feature learning for deformableregistration of MR brain images. In Medical Image Computing andComputer-Assisted Intervention, pages 649–656. Springer, 2013.173BibliographyGuorong Wu, Pierrick Coupe´, Yiqiang Zhan, Brent Munsell, and DanielRueckert. Patch-based techniques in medical imaging. In SpringerVerlag. Springer, 2015a.Guorong Wu, Min-Jeong Kim, Qian Wang, Brent Munsell, and DinggangShen. Scalable high performance image registration framework byunsupervised deep feature representations learning. IEEE Transactionson Biomedical Engineering (In press), 2015b.Guorong Wu, Dinggang Shen, and Mert R. Sabuncu. Machine Learningand Medical Imaging. Academic Press, 2016. ISBN 978-0-12-804076-8.Tao Xu, Han Zhang, Xiaolei Huang, Shaoting Zhang, and Dimitris NMetaxas. Multimodal deep learning for cervical dysplasia diagnosis. InMedical Image Computing and Computer-Assisted Intervention, pages115–123. Springer, 2016.Mohammad Yaqub, M Kassim Javaid, Cyrus Cooper, and J Alison Noble.Improving the classification accuracy of the classic RF method byintelligent feature selection and weighted voting of trees with applicationto medical image segmentation. In Machine Learning in Medical Imaging(MICCAI Workshop), pages 184–192. Springer, 2011.Youngjin Yoo and Roger Tam. Non-local spatial regularization of MRI T2relaxation images for myelin water quantification. In Medical ImageComputing and Computer-Assisted Intervention, pages 614–621.Springer, 2013.Youngjin Yoo, Tom Brosch, Anthony Traboulsee, David KB Li, and RogerTam. Deep learning of image features from unlabeled data for multiplesclerosis lesion segmentation. In Machine Learning in Medical Imaging(MICCAI workshop), pages 117–124. Springer, 2014.Youngjin Yoo, Lisa W Tang, Tom Brosch, David KB Li, Luanne Metz,Anthony Traboulsee, and Roger Tam. Deep learning of brain lesionpatterns for predicting future disease activity in patients with earlysymptoms of multiple sclerosis. In Deep Learning in Medical ImageAnalysis (MICCAI workshop), pages 86–94. Springer, 2016.174BibliographyJason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson.Understanding neural networks through deep visualization. arXivpreprint arXiv:1506.06579, 2015.Matthew D Zeiler. ADADELTA: An adaptive learning rate method. arXivpreprint arXiv:1212.5701, 2012.Wenlu Zhang, Rongjian Li, Houtao Deng, Li Wang, Weili Lin, Shuiwang Ji,and Dinggang Shen. Deep convolutional neural networks formulti-modality isointense infant brain image segmentation. NeuroImage,108:214–224, 2015.175

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0367030/manifest

Comment

Related Items