Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Machine learning for MRI-guided prostate cancer diagnosis and interventions Mehrtash, Alireza 2020

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata

Download

Media
24-ubc_2020_november_mehrtash_alireza.pdf [ 18.05MB ]
Metadata
JSON: 24-1.0394788.json
JSON-LD: 24-1.0394788-ld.json
RDF/XML (Pretty): 24-1.0394788-rdf.xml
RDF/JSON: 24-1.0394788-rdf.json
Turtle: 24-1.0394788-turtle.txt
N-Triples: 24-1.0394788-rdf-ntriples.txt
Original Record: 24-1.0394788-source.json
Full Text
24-1.0394788-fulltext.txt
Citation
24-1.0394788.ris

Full Text

Machine Learning for MRI-guidedProstate Cancer Diagnosis andInterventionsbyAlireza MehrtashA THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinThe Faculty of Graduate and Postdoctoral Studies(Electrical and Computer Engineering)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)October 2020© Alireza Mehrtash 2020The following individuals certify that they have read, and recommend to theFaculty of Graduate and Postdoctoral Studies for acceptance, the dissertationentitled:Machine Learning for MRI-guided Prostate Cancer Diagnosis andInterventionssubmitted by Alireza Mehrtash in partial fulfillment of the requirementsfor the degree of Doctor of Philosophy in Electrical and ComputerEngineering.Examining Committee:Purang Abolmaesumi, Department of Electrical and Computer EngineeringSupervisorLeonid Sigal, Department of Computer ScienceUniversity ExaminerZhen Jane Wang, Department of Electrical and Computer EngineeringUniversity ExaminerJames Duncan, Yale UniversityExternal ExaminerAdditional Supervisory Committee Members:Robert Rohling, Department of Electrical and Computer EngineeringSupervisory Committee MemberWilliam Wells, Brigham and Women’s Hospital, Harvard Medical SchoolSupervisory Committee MemberTina Kapur, Brigham and Women’s Hospital, Harvard Medical SchoolSupervisory Committee MemberiiAbstractProstate cancer is the second most prevalent cancer in men worldwide. Mag-netic Resonance Imaging (MRI) is widely used for prostate cancer diagnosisand guiding biopsy procedures due to its ability in providing superior contrastbetween cancer and adjacent soft tissue. Appropriate clinical management ofprostate cancer critically depends on meticulous detection and characteriza-tion of the disease and precise biopsy procedures if necessary.The goal of this thesis is to develop computational methods to aid ra-diologists in diagnosing prostate cancer in MRI and planning necessaryinterventions. To this end, we have developed novel methods for assessingprobability of clinically significant prostate cancer in MRI, localizing biopsyneedles in MRI, and providing segmentation of structures such as the prostategland. The proposed methods in this thesis are based on supervised machinelearning techniques, in particular deep convolutional neural networks (CNNs).We have also developed methodology that is necessary in order for such deepnetworks to eventually be useful in clinical decision-making workflows; thisspans the areas of domain adaptation, confidence calibration, and uncertaintyestimation for CNNs. We used domain adaptation to transfer the knowledgeof lesion segmentation learned from MRI images obtained using one set ofacquisition parameters to another. We also studied predictive uncertainty inthe context of medical image segmentation to provide model confidence (i.eexpectation of success) at inference time. We further proposed parameterensembling by perturbation for calibration of neural networks.iiiLay SummaryProstate cancer is the first most diagnosed cancer in North American menand the second most common cancer in men worldwide. Early detectionof prostate cancer increases the chances of long-term survival. MagneticResonance Imaging (MRI) can aid doctors in better screenings of prostatecancer. However, prostate cancer screening with MRI is not 100% accurateand often leads to missing high-risk patients and unnecessary aggressivetreatment for low-risk patients. The purpose of this thesis is to developreliable computational methods to aid physicians for better diagnosis andtreatment of prostate cancer patients.ivPrefaceThis thesis is primarily based on five manuscripts resulting from the collab-oration among multiple researchers. The manuscripts have been modifiedaccordingly to present a consistent thesis.A study described in Chapter 2 has been published in:• Alireza Mehrtash, Alirea Sedghi, Mohsen Ghafoorian, Mehdi Taghipour,Clare M. Tempany, William M. Wells III, Tina Kapur, Parvin Mousavi,Purang Abolmaesumib, Andrey Fedorov. Classification of clinical signif-icance of MRI prostate findings using 3D convolutional neural networks.Medical Imaging 2017: Computer-Aided Diagnosis. International Soci-ety for Optics and Photonics, 10134: 101342A, 2017.The contribution of the author was in developing, implementing, and evaluat-ing the method. Drs. Sedghi and Ghafourian contributed in developing andimplementing the proposed method. Drs. Tempany, and Taghipour providedclinical insight. Drs. Fedorov, Abolmaesumi, Mousavi, and Kapur helpedwith their valuable suggestions in improving the methodology.A version of Chapter 3 has been published in:• Alireza Mehrtash, Mohsen Ghafoorian, Guillaume Pernelle, AlirezaZiaei, Friso G. Heslinga, Kemal Tuncali, Andriy Fedorov, Ron Kikinis,Clare M Tempany, William M Wells, Purang Abolmaesumi, Tina Kapur.Automatic needle segmentation and localization in MRI With 3-Dconvolutional neural networks: application to MRI-targeted prostatebiopsy. IEEE Transactions on Medical Imaging, 38(4):1026-1036, 2018.The contribution of the author was in developing, implementing, and evalu-ating the method. Drs. Ghafoorian and Pernelle provided valuable scientificinputs to improve the proposed method. Dr. Ziaei and F.G. Heslinga createdthe needle segmentation ground truth. Dr. Tuncali performed the biopsyprocedures, with technical support from Dr. Fedorov. Dr. Tempany providedclinical insight for MR-guided prostate biopsy procedure. Profs. Kapur,vPrefaceWells, Abolmaesumi, and Kikinis helped with their valuable suggestions inimproving the methodology.A version of Chapter 4 has been published in:• Mohsen Ghafoorian, Alireza Mehrtash, Tina Kapur, Nico Karssemeijer,Elena Marchiori, Mehran Pesteie, Charles RG Guttmann, Frank-Erikde Leeuw, Clare M Tempany, Bram van Ginneken, Andriy Fedorov,Purang Abolmaesumi, Bram Platel, William M Wells. Transfer learningfor domain adaptation in MRI: Application in brain lesion segmentation.International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Springer 2017.Dr. Ghafoorian and the author equally contributed in developing, implement-ing, and evaluating the method. Prof. de Leeuw helped with data preparationand annotation. Drs. Wells, Platel, Abolmaesuimi, Fedorov, van Ginneken,Tempany, Guttmann, and Pesteie helped with their valuable suggestions inimproving the methodology.A version of Chapter 6 has been published in:• Alireza Mehrtash, William M. Wells III, Clare M. Tempany, PurangAbolmaesumi, Tina Kapur. Confidence Calibration and PredictiveUncertainty Estimation for Deep Medical Image Segmentation. IEEETransactions on Medical Imaging, 2020.The contribution of the author was in developing, implementing, and evalu-ating the method. Profs. Kapur, Abolmaesumi, Tempany, and Wells helpedwith their valuable suggestions in improving the methodology. All co-authorscontributed to the editing of the manuscript.A version study presented in Chapter 7 will be published in proceedings of2020 Conference on Neural Information Processing Systems (NeurIPS):• Alireza Mehrtash, Purang Abolmaesumi, Polina Goland, Tina Kapur,Demian Wassermann, William M. Wells III. PEP: Parameter Ensem-bling by Perturbation. NeurIPS 2020.The contribution of the author was in developing, implementing, and evalu-ating the method. Prof. Wells contributed to the mathematical derivationof the local analysis. Drs. Abolmaesumi, Wassermann, Goland, and Kapurhelped with their valuable contributions and suggestions in improving themethodology.viTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiList of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . xxiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiiiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxv1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Clinical Background . . . . . . . . . . . . . . . . . . . . . . . 11.2 Magnetic Resonance Imaging for Prostate Cancer . . . . . . 31.3 Machine Learning in Prostate Cancer Imaging . . . . . . . . 41.4 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 71.6 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Prostate Cancer Diagnosis in MRI . . . . . . . . . . . . . . . 132.1 Introduction and Background . . . . . . . . . . . . . . . . . . 132.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3 Patch-based Cancer Classifier . . . . . . . . . . . . . . . . . . 152.3.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . 152.3.2 Network Architecture . . . . . . . . . . . . . . . . . . 162.3.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . 16viiTable of Contents2.3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 182.4 FCN Classifier and Uncertainty in Biopsy Location . . . . . 192.4.1 Gaussian Weighted Loss . . . . . . . . . . . . . . . . 192.4.2 Location Uncertainty-aware Inference . . . . . . . . . 202.4.3 Experimental Setup . . . . . . . . . . . . . . . . . . . 212.4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 222.5 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . 233 Biopsy Needle Localization in MRI . . . . . . . . . . . . . . 263.1 Introduction and Background . . . . . . . . . . . . . . . . . . 263.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.2.1 MRI-Targeted Biopsy Clinical Workflow . . . . . . . . 293.2.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.2.3 Data Annotation . . . . . . . . . . . . . . . . . . . . . 303.2.4 Data Preprocessing . . . . . . . . . . . . . . . . . . . 313.2.5 Convolutional Neural Networks . . . . . . . . . . . . . 343.2.6 Network Architecture . . . . . . . . . . . . . . . . . . 353.2.7 Training . . . . . . . . . . . . . . . . . . . . . . . . . 373.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 383.3.1 Observer Study . . . . . . . . . . . . . . . . . . . . . 383.3.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . 383.3.3 Test-time Augmentation . . . . . . . . . . . . . . . . 393.3.4 Ensembling . . . . . . . . . . . . . . . . . . . . . . . . 393.3.5 Implementation and Deployment . . . . . . . . . . . . 403.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.4.1 Tip Localization . . . . . . . . . . . . . . . . . . . . . 413.4.2 Tip Axial Plane Detection . . . . . . . . . . . . . . . 433.4.3 Trajectory Localization . . . . . . . . . . . . . . . . . 433.4.4 Needle Direction . . . . . . . . . . . . . . . . . . . . . 433.4.5 Data Augmentation . . . . . . . . . . . . . . . . . . . 443.4.6 Execution Time . . . . . . . . . . . . . . . . . . . . . 443.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474 Transfer Learning for Domain Adaptation in MRI . . . . . 484.1 Introduction and Background . . . . . . . . . . . . . . . . . . 484.2 Materials and Method . . . . . . . . . . . . . . . . . . . . . . 504.2.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 504.2.2 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 514.2.3 Network Architecture and Training . . . . . . . . . . 51viiiTable of Contents4.2.4 Domain Adaptation . . . . . . . . . . . . . . . . . . . 524.2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . 524.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.4 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . 545 Weakly-supervised Medical Image Segmentation . . . . . . 565.1 Introduction and Background . . . . . . . . . . . . . . . . . . 565.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585.3 Applications and Data . . . . . . . . . . . . . . . . . . . . . . 605.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 605.4.1 Partial Annotation Generation . . . . . . . . . . . . . 605.4.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . 615.4.3 Partial Loss Functions . . . . . . . . . . . . . . . . . . 615.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625.6 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . 666 Uncertainty Estimation for Image Segmentation . . . . . . 676.1 Introduction and Background . . . . . . . . . . . . . . . . . . 676.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . 706.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 726.4 Applications & Data . . . . . . . . . . . . . . . . . . . . . . . 736.4.1 Brain Tumor Segmentation Task . . . . . . . . . . . . 736.4.2 Ventricular Segmentation Task . . . . . . . . . . . . . 746.4.3 Prostate Segmentation Task . . . . . . . . . . . . . . 746.4.4 Data Pre-processing . . . . . . . . . . . . . . . . . . . 746.5 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 746.5.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 746.5.2 Calibration Metrics . . . . . . . . . . . . . . . . . . . 766.5.3 Confidence Calibration with Ensembling . . . . . . . 776.5.4 Segment-level Predictive Uncertainty Estimation . . . 776.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 786.6.1 Training Baselines . . . . . . . . . . . . . . . . . . . . 786.6.2 Cross-entropy vs. Dice . . . . . . . . . . . . . . . . . 796.6.3 MC dropout . . . . . . . . . . . . . . . . . . . . . . . 796.6.4 Confidence Calibration . . . . . . . . . . . . . . . . . 796.6.5 Segment-level Predictive Uncertainty . . . . . . . . . 806.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 806.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 876.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90ixTable of Contents7 PEP: Parameter Ensembling by Perturbation . . . . . . . . 917.1 Introduction and Background . . . . . . . . . . . . . . . . . . 917.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 947.2.1 Baseline Model . . . . . . . . . . . . . . . . . . . . . . 947.2.2 Hierarchical Model . . . . . . . . . . . . . . . . . . . . 947.2.3 Local Analysis . . . . . . . . . . . . . . . . . . . . . . 967.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 997.3.1 ImageNet Experiments . . . . . . . . . . . . . . . . . 1007.3.2 MNIST and CIFAR-10 Experiments . . . . . . . . . . 1037.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1058 Conclusion and Future Work . . . . . . . . . . . . . . . . . . 1078.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 1078.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 110Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112xList of Tables2.1 Classification quality of models for diagnosing clinically sig-nificant prostate cancer in MRI evaluated on reported biopsylocations (n=325). Models trained with partial cross-entropyloss are compared with those trained with Gaussian cross-entropy loss. The results of inference time biopsy locationadjustments are also provided for multiple Gaussian kernelsizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.1 Number of patients and needle MRIs for training/validationand test sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2 Needle tip localization error (mm) for test cases for proposedCNN method and the second observer*. . . . . . . . . . . . . 413.3 Trajectory localization error averaged over test cases for pro-posed CNN method and the second observer* (units are inmillimeters). . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.4 Needle direction error quantified as the deviation angle aver-aged over test cases for proposed CNN method and the secondobserver* (units are in degrees). . . . . . . . . . . . . . . . . 453.5 Impact of training-time and test-time augmentation on perfor-mance*. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.1 Number of patients for the domain adaptation experiments. . 505.1 Segmentation quality of models in terms of the Dice coefficient(95% CI) of foreground structures: Weakly-supervised trainingwith partial annotations (points and scribbles) is comparedwith fully supervised training. Models trained with partialCE loss (PCL) [27] are compared with those that were trainedwith the proposed partial Dice loss (PDL). Fractions of par-tial annotations to full labels are given (abbreviated to fr.).Boldface indicates statistically significant differences betweenmodel pairs (p-value<0.05). . . . . . . . . . . . . . . . . . . 63xiList of Tables5.2 Segmentation quality of models in terms of 95th Hausdorffdistance (95% CI) of foreground structures. Models trainedwith partial cross-entropy (PCL) [27] are compared with thosethat were trained with the proposed partial Dice loss (PDL).Boldface indicates statistically significant differences betweenmodel pairs (p-value<0.05). . . . . . . . . . . . . . . . . . . 646.1 Number of patients for training, validation, and test sets usedin this study. . . . . . . . . . . . . . . . . . . . . . . . . . . . 736.2 Calibration quality and segmentation performance for baselinestrained with cross-entropy (LCE) are compared with those thatwere trained with Dice loss (LDSC) and those that were cali-brated with ensembling (M=50) and MC dropout. Boldfacedfont indicates the best results for each application (model) andshows that the differences are statistically significant. . . . . 847.1 ImageNet results: For all models except VGG19 , PEP achievesstatistically significant improvements in calibration comparedto baseline (BL) and temperature scaling (TS), in terms ofNLL and Brier score. PEP also reduces test errors, while TSdoes not have any effect on test errors. Although TS andPEP outperform baseline in terms of ECE% for DenseNet121,DenseNet169, ResNet, and VGG16, the improvements inECE% is not consistent among the methods. T ∗ and σ∗denote optimized temperature for TS and optimized sigma forPEP, respectively. Boldfaced font indicates the best resultsfor each metric of a model and shows that the differences arestatistically significant (p-value<0.05). . . . . . . . . . . . . . 1017.2 MNIST and CIFAR-10 results: The table summarizes experi-ments described in Section 7.3.2. . . . . . . . . . . . . . . . . 104xiiList of Figures1.1 Prostate anatomy [53]. . . . . . . . . . . . . . . . . . . . . . 21.2 Multiparametric MRI of a patient with clinically significantprostate cancer. Arrows mark the lesion location. (a) AxialT2-weighted MR. (b) computed high–b value (1400 sec/mm2)diffusion-weighted MR. (c) ADC map (d) KTrans parametricmap from dynamic contrast enhanced T1-weighted MRI. . . 42.1 Distribution of training and test datasets of the PROSTATExchallenge. (a) Training samples: the distribution of lesionfindings shows that the training dataset is not balanced interms of both zonal distribution and the clinical significanceof the finding. (b) Test samples are not balanced in terms ofzones. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2 CNN model for PROSTATEx challenge. . . . . . . . . . . . . 172.3 ROC curve for prostate cancer diagnosis. . . . . . . . . . . . . 182.4 Illustration of true positive prostate cancer diagnosis. . . . . . 19xiiiList of Figures2.5 Method overview. For the training time (a) we propose aweighted cross-entropy (CE) loss to handle sparse biopsyground truth. For the inference time (b-d), we propose aprobabilistic framework which models noise in observed loca-tions and makes adjustments. FCN architecture is used forthe seamless rendering of the cancer probability maps (c). (a)shows a sample train image (ADC) together with a Gaussianloss weight centered on biopsy location. The weight is appliedto the pixel-level samples of the CE loss. The trained networkwith optimized parameters, θˆ, is then used for inference. (b)shows a sample test image, Ii, with reported (observed) biopsyposition, xoi , marked on the left peripheral zone. (c) showsthe network prediction for probability of cancer at each pixelp(zi = 1|Ii, xi, θˆ). zi = 1 denotes cancer outcome for biopsy.(d) shows the input image overlaid with a Gaussian denotingp(xpi |xi), the probability of latent true biopsy given observedbiopsy location. (e) shows the probability distribution forlatent true biopsy location p(xi|xoi , Ii, θˆ, zi = 1). Using (e), thepresumably misplaced reported location of the biopsy can beadjusted. The proposed network uses multi-modal inputs andhere for simplicity we only show ADC inputs. . . . . . . . . 202.6 Examples of biopsy location adjustments. The first columnshows the input ADC images with given biopsy locations(black crossharis). The b-value and KTrans images were usedfor inference but not shown here. For all the of the four exam-ples, the most probable latent location for cancer location wasfound (white crosshairs) using Equation 2.6. Last two columnsshow calculated true latent biopsy location probabilities giventhe results is clinically significant (third column), or insignifi-cant (fourth column). The top two rows show false positivepredictions that turned into true positives by the proposedadjustment. The bottom two rows show true negatives thatturned into false positives by adjustment. . . . . . . . . . . . 25xivList of Figures3.1 Transperineal in-gantry MRI-targeted prostate biopsy proce-dure: (a) The patient is placed in the supine position in theMRI gantry, and his legs are elevated to allow for transperinealaccess. The skin of the perineum is prepared and draped in asterile manner, and the needle guidance template is positioned.(b), (c) and (d): Axial, sagittal and coronal views of intrapro-cedural T2-W MRI with needle tip marked by white arrow. (e)3D rendering of the needle (blue), segmented by our method,and visualized relative to the prostate gland (purple), and anMRI cross-section that is orthogonal to the plane containingthe needle tip. . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.2 (a) and (b): Examples of needle induced susceptibility artifactsin MRI where instead of a single hypointense (dark) region,there are two hypointense regions separated by a hyperintense(bright) region. In such cases, the human expert followed theneedle carefully across several slices to ensure the integrity ofthe annotation. The arrow marks the needle identified by theexpert. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.3 Original, cropped and padded, and segmentation volumes ofinterests (VOIs) (a) The original grayscale volume (V OIORIG,red box) is cropped in x and y directions and padded inthe z direction to a volume of size 164.5× 164.5× 165.6 mm(188×188×46 voxels) centered on the prostate gland (V OICP ,blue box). V OICP is used as the network input. The networkoutput segmentation map is of size 88× 88× 64.8 mm (100×100 × 18 voxels) (V OISEG, green box). The adjusted voxelspacing for the volumes is 0.88× 0.88× 3.6 mm. (b), (c), (d)show axial, sagittal and coronal views respectively of a patientcase overlayed with the boundaries of the volumes V OIORIG,V OICP and V OISEG. . . . . . . . . . . . . . . . . . . . . . . 33xvList of Figures3.4 Schematic overview of the anisotropic 3D fully convolutionalneural network for needle segmentation and localization inMRI. Network architecture consisting of 14 convolutional lay-ers, 3 max-pooling and 3 up-sampling layers. Convolutionallayers were applied without padding while max-pooling layershalved the size of their inputs only in in-plane directions. Theparameters including the kernel sizes and number of kernelsare explained in each corresponding box. Shortcut connec-tions insures combination of low-level and high-level features.The input to the network is the 3D volume image with theprostate gland at the center (188× 188× 46) and the outputsegmentation map has the size of 100× 100× 18. . . . . . . 363.5 An example test case. Green, yellow, and red contours showthe needle segmentation boundaries of the ground truth, theproposed system, and the second observer respectively. Thearrows mark the needle tips. (a) First row shows ground truth.Second row shows predictions of the proposed system. Thirdrow second observer annotations. (b) Zoomed view of slicesin (a). (c) Coronal views. (d) 3D rendering of the needlerelative to the prostate gland (blue), ground truth and CNNpredictions. For the proposed CNN, the measured needle tiplocalization error (∆P ), tip axial plane detection error (∆A),Hausdorff distance (HD), and angular deviation error (∆θ)are 1.76 mm, 0 voxels, 1.24 mm, and 0.30◦ respectively. . . . 423.6 Box plots of the needle tip deviation error and Hausdorffdistance (HD) in millimeters for the test cases. Distances ofautomatic (CNN) and second observer are shown which arecomparable. The median tip localization error and the medianHD distance for both CNN and second observer are 0.88 mm(1 pixel in transaxial plane) and 1.24 mm, respectively. . . . 433.7 Bar charts of needle tip axial plane localization error (∆A).Needle tip axial plane distance error of the automatic (CNN)method and second observer are shown. The results of the au-tomatic CNN method are comparable with the second-observer. 444.1 Architecture of the convolutional neural network used in ourexperiments. The shallowest i layers are frozen and the restd− i layers are fine-tuned. d is the depth of the network whichwas 15 in our experiments. . . . . . . . . . . . . . . . . . . . . 51xviList of Figures4.2 (a) The comparison of Dice scores on the target domain withand without transfer learning. A logarithmic scale is used onthe x axis. (b) Given a deep CNN with d=15 layers, transferlearning was performed by freezing the i initial layers andfine-tuning the last d− i layers. The Dice scores on the test setare illustrated with the color-coded heatmap. On the map, thenumber of fine-tuned layers are shown horizontally, whereasthe target domain training set size is shown vertically. . . . . 534.3 Examples of the brain WMH MRI segmentations. (a) AxialT1-weighted image. (b) FLAIR image. (c-f) FLAIR imageswith WMH segmented labels: (c) reference (green) WMH. (d)WMH (red) from a domain adapted model (f˜ST (.)) fine-tunedon five target training samples. (e) WMH (yellow) from modeltrained from scratch (f˜T (.)) on 100 target training samples.(f) WMH (orange) from model trained from scratch (f˜T (.))on 5 target training samples. . . . . . . . . . . . . . . . . . . 545.1 Sample cardiac MRI image (a) with different forms of annota-tions (b-d); Yellow, purple, green, and blue colors correspondto the right ventricle, endocardium, left ventricle, and back-ground, respectively. Fully supervised training of FCNs forsemantic segmentation requires annotation of all pixels (b).The goal of this study is to develop weakly-supervised segmen-tation methods for training FCNs with a single point (c) andscribble (d). In this study, points refer to single-pixel marksfor each class on each image slice. Scribbles have a width ofone pixel. In this example, the sizes of points and scribblesare exaggerated for better visualization. . . . . . . . . . . . . 575.2 Examples of segmentation from scribble-supervised training ofmodels with partial cross-entropy loss (CE), partial Dice loss(DSC), and models trained with full masks. The rows fromtop to bottom show the results for segmentation of the rightventricle, the prostate gland, and the kidney, respectively. . . 65xviiList of Figures6.1 Calibration and out-of-distribution detection. Models forprostate gland segmentation were trained with T2-weightedMR images acquired using phased-array coils. The results ofinference are shown for two test examples imaged with: (a)phased-array coil (in-distribution example), and (b) endorectalcoil (out-of-distribution example). The first column showsT2-weighted MRI images with the prostate gland boundarydrawn by an expert (white line). The second column showsthe MRI overlaid with uncalibrated segmentation predictionsof an FCN trained with Dice loss. The third column shows thecalibrated segmentation predictions of an ensemble of FCNstrained with Dice loss. The fourth column shows the histogramof the calibrated class probabilities over the predicted prostatesegment of the whole volume. Note that the bottom row has amuch wider distribution compared to the top row, indicatingthat this is an out of distribution example. In the middlecolumn, prediction prostate class probabilities ≤ 0.001 hasbeen masked out. . . . . . . . . . . . . . . . . . . . . . . . . 696.2 Improvements in calibration as a function of the number ofmodels in the ensemble for baselines trained with cross-entropyand Dice loss functions. Calibration quality in terms of NLLimproves as number of models M increases for prostate, heart,and brain tumor segmentation. For each task an ensembleof size M=10 trained with Dice loss outperforms the baselinemodel (M=1) trained with cross-entropy in terms of NLL. Sameplot with 0.95 CIs and for both whole volume and boundingbox measurements are given in Figure 4 of the SupplementaryMaterial. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83xviiiList of Figures6.3 Segment-level predictive uncertainty estimation: Top row:Scatter plots and linear regression between Dice coefficientand average of entropy over the predicted segment H(Sˆ). Foreach of the regression plots, Pearson’s correlation coefficient (r)and 2-tailed p-value for testing non-correlation are provided.Dice coefficients are logit transformed before plotting andregression analysis. For the majority of the cases in all threesegmentation tasks, the average entropy correlates well withDice coefficient, meaning that it can be used as a reliable metricfor predicting the segmentation quality of the predictions attest-time. Higher entropy means less confidence in predictionsand more inaccurate classifications leading to poorer Dicecoefficients. However, in all three tasks there are few casesthat can be considered outliers. (A) For prostate segmentation,samples are marked by their domain: PROSTATEx (sourcedomain), and the multi-device multi-institutional PROMISE12dataset (target domain). As expected, on average, the sourcedomain performs much better than the target domain, meaningthat average entropy can be used to flag out-of-distributionsamples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 856.4 The two bottom rows correspond to two of the cases from thePROMISE12 dataset are marked in (A): Case I and Case II;These show the prostate T2-weighted MRI at different locationsof the same patient with overlaid calibrated class probabilities(confidences) and histograms depicting distribution of probabil-ities over the segmented regions. The white boundary overlayon prostate denotes the ground truth. The wider probabilitydistribution in Case II associates with a higher average entropywhich correlates with a lower Dice score. Case-I was imagedwith phased-array coil (same as the images that was used fortraining the models), while Case II was imaged with endorectalcoil (out-of-distribution case in terms of imaging parameters).The samples in scatter plots in (B) and (C) are marked bytheir associated foreground segments. The color bar for theclass probability values is given in Figure 6.1. Qualitativeexamples for brain and heart applications and scatter plots formodels trained with cross-entropy are given in Figures 7 and8 of the Supplementary Material, respectively. . . . . . . . . 86xixList of Figures7.1 Parameter Ensembling by Perturbation (PEP) on pre-trainedInceptionV3 [175]. The rectangle shaded in gray in (a) isshown in greater detail in (b). The average log-likelihood ofthe ensemble average, L(σ), has a well-defined maximum atσ = 1.85× 10−3. The ensemble also has a noticeable increasein likelihood over the individual ensemble item average log-likelihoods, ln(L) and over their average. In this experiment,an ensemble size of 5 (M = 5) was used for PEP and theexperiments were run on 5000 validation images. . . . . . . . 967.2 Improving pre-trained DenseNet169 with PEP (M=10). (a)and (b) show the reliability diagrams of the baseline and thePEP. (c) shows examples of misclassifications corrected byPEP. The examples were among those with the highest PEPeffect on the correct class probability. (c) Top row: brownbear and lampshade changed into Irish terrier and boathouse;Middle row: band aid and pomegranate changed into sandaland strawberry; Bottom row: bathing cap and wall clockchanged into volleyball and pinwheel. The histograms at theright of each image illustrate the probability distribution ofensemble. Vertical red and green lines show the predictedclass probabilities of the baseline and the PEP for the correctclass label. (For more reliability diagrams see SupplementaryMaterial.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1027.3 The relationship between overfitting and PEP effect. (a) showsthe average of NLLs on test set for CIFAR-10 baselines (redline) and PEP L (black line). The baseline curve shows over-fitting as a result of overtraining. The degree of overfittingwas calculated by subtracting the training NLL (loss) fromthe test NLL (loss). PEP reduces overfitting and improveslog-likelihood. PEP effect is more substantial as the overfit-ting grows. (b), (c), (d) shows scatter plots of overfitting vsPEP effect for CIFAR-10, MNIST(MLP), and MNIST(CNN),respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . 105xxList of Abbreviations2D Two Dimensional3D Three DimensionalADC Apparent Diffusion CoefficientAUC Area Under the CurveBN Batch NormalizationCADe Computer-aided detectionCADx Computer-aided diagnosisCE Cross-entropyCNN Convolutional Neural NetworkCT Computed TomographyCZ Central ZoneDCE Dynamic Contrast-enhancedDWI Diffusion Weighted ImagingEM Expectation-maximizationFCN Fully Convolutional Neural NetworkFLAIR Fluid-attenuated Inversion RecoveryGGG Gleason Grade GroupGPU Graphical Processing UnitHD Hausdorff DistanceMCD Monte Carlo DropoutmpMRI Multi-parametric Magnetic Resonance ImagingMRI Magnetic Resonance ImagingxxiList of AbbreviationsNLL Negative Log LikelihoodNN Neural NetworkPCa Prostate CancerPEP Parameter Ensembling by PerturbationPI-RADS Prostate Imaging Reporting and Data SystemPK PharmacokineticsPZ Peripheral ZoneReLu Rectified Linear UnitROC Receiver Operating CharacteristicSGD Stochastic Gradient DescentSVM Support Vector MachineTRUS Transrectal UltrasoundTS Temperature ScalingTZ Transition ZoneWMH White Matter HyperintensitiesxxiiAcknowledgementsFirst and foremost, I would like to express my sincere gratitude to my advisor,Prof. Purang Abolmaesumi. I am grateful for his invaluable insight, guidance,and continuous support during my PhD studies. I particularly appreciatehim for giving me the freedom to find and investigate different ideas.I feel very fortunate to have Dr. Tina Kapur as my co-supervisor, guide,and friend. She is not just an outstanding mentor, she is a wonderful humanbeing. One of a kind! Thank you, Tina! Thank you for believing in me andfor always being there for me. Thank you for your endless support to makethis dream come true.Many thanks to my wonderful co-supervisor Prof. William (Sandy) M.Wells. I was so lucky to have the chance to work with you. I enjoyed all themachine learning chats we had over the past years and I learned a lot fromyou.I am thankful for my committee members, Prof. Robert Rohling, Prof.Leonid Sigal, and Prof. Zhen Jane Wang for reading my thesis and providingme with constructive comments and insightful feedback.I thankfully acknowledge the grants that supported my research duringthe course of my studies from the National Center for Image Guided Therapy(NIH P41EB015898), the Natural Science and Engineering Research Councilof Canada (NSERC), and the Canadian Institutes of Health Research (CIHR).I am so grateful to my mentors at SPL over the past years. I would like tothank Prof. Ron Kikinis for his full support since I joined SPL. I would liketo thank Prof. Clare M. Tempany for always being supportive of my research.Moreover, I would like to thank her for keeping me up-to-date with all theclinical insights that I needed for my research in prostate cancer. I wouldlike to thank Dr. Andrey Fedorov from whom I learned a lot about workingwith medical imaging data. I would like to thank all the other amazing SPLmembers that I worked with them over the past years. Thank you Steve,Jay, Junichi, Brunilda, and Danielle. I would also like to thank the amazingresearchers that I met at SPL and worked on several ideas together. Thankyou Alireza Sedghi, Joeky, Mehdi, Friso, and Prashin.I feel lucky to met Mohsen Ghafoorian at SPL in November 2016. DespitexxiiiAcknowledgementshis short stay in Boston, we managed to do lots of fruitful research together.Mohsen is such a talent and always comes up with the brightest ideas. Beyondthis, he is an amazing friend. Thank you Mohsen!My sincere thanks go to all the RCL members at UBC. I enjoyed workingon a medical imaging challenge with Mehran, Amir, and Jorden. I had anamazing time and learned a lot from you guys. I am also very happy that Imet Mehran at RCL. I miss working with him and all the brainstorming thatwe had during our coffee breaks.Special thanks to Saeideh and Rasool for taking us from the airport thenight we arrived in Vancouver and taking the best care of us for our wholestay in Vancouver. Because of you, we felt at home.I offer my gratitude to my amazing parents-in-law for their full supportand outstanding care throughout my studies. Thank you, Fereshteh and Ali! Iam very fortunate to have you in my life. Special thanks to my brother-in-lawAmir for all his encouragements.I am so grateful to my wonderful sisters. Thank you Tahmineh for yourunconditional support, selfless care, and devotion which helped me to fulfillmy dream. Thank you Manjijeh for always believing in me and encouragingme. I want to express my gratitude to my brother-in-law Homayoun, whomhas always been a source of inspiration for me. I would also like to thankKiana and Roxana who have always been the best nieces one could have everwished for.And finally, I would have not been able to finish my PhD without the dailylove that I received from my better half Roya and my adorable daughtersDiana and Amitis. Thank you Roya for your unwavering love, support,encouragement, and patience in every step of this journey.xxivTo the loving memory of Fatemeh Khorgami and Javad MehrtashTo Roya, Diana, and AmitisxxvChapter 1Introduction1.1 Clinical BackgroundThe prostate is a walnut-shaped gland that is part of the male reproductivesystem (Figure 1.1). It is located in the pelvis at the base of the urinarybladder and surrounds the urethra. The prostate produces the seminal fluidthat combines with sperm from the testes. The alkaline nature of the prostaticfluid helps in reducing the acidity of the vaginal environment which couldextend the lifespan of sperm. The prostate is composed of both globular andfibromuscular tissues are enclosed in a surface termed the prostatic capsuleor prostatic fascia [171]. Along the urethra, from superior to inferior, theprostate is composed of three primary regions, the base (below the bladder),the midgland, and the apex (inferior part in the vicinity of the urogenitaldiaphragm). Histologically, the prostate is divided into four primary zones:anterior fibromuscular stroma (AS), the transition zone (TZ), the central zone(CZ), and the peripheral zone (PZ). AS contains no glandular tissue. CZ andTZ surround the ejaculatory ducts and the proximal urethra, respectively.TZ, CZ, and PZ contain about 5%, 20%, and 70 − 80% of the glandulartissue, respectively [183].Prostate cancer is the second most frequently diagnosed cancer in menand the fifth leading cause of cancer mortality worldwide [144]. In the UnitedStates, it is the most frequently diagnosed, noncutaneous male malignancyand the second leading cause of cancer-related mortality among men in theUnited States [165]. Statistics of prostate cancer frequency, morbidity, andmortality can be examined in many different ways. It is a very commoncancer, as it is a “tumor of aging," but it has a very low disease-specificParts of Sections 1.1, 1.2, and 1.3 are adapted from Wenya Linda Bi, Ahmed Hosny,Matthew B. Schabath, Maryellen L. Giger, Nicolai J. Birkbak, Alireza Mehrtash, TavisAllison, Omar Arnaout, Christopher Abbosh, Ian F. Dunn, Raymond H. Mak, Rulla M.Tamimi, Clare M. Tempany, Charles Swanton, Udo Hoffmann, Lawrence H. Schwartz,Robert J. Gillies, Raymond Y. Huang, Hugo J. W. L. Aerts. Artificial intelligence in cancerimaging: clinical challenges and applications. CA: A Cancer Journal for Clinicians. WileyPeriodicals, Inc. on behalf of American Cancer Society 2019.11.1. Clinical BackgroundFigure 1.1: Prostate anatomy [53].mortality, all of which reinforce its characterization as a complex publichealth concern that impacts a large population. Although prostate canceris a serious disease, most men diagnosed with prostate cancer do not dieof it [56]. The key clinical problems in prostate cancer diagnosis todayinclude 1) overdiagnosis and overtreatment resulting from an inability topredict the aggressiveness and risk of a given cancer; and 2) inadequatetargeted biopsy sampling, leading to misdiagnosis and to disease progressionin men with seemingly low-risk prostate cancer. In a meta-analysis [111], thereported rate of overdiagnosis of nonclinically significant prostate cancer wasas high as 67%, leading to unnecessary treatment and associated morbidity.Because of this range of clinical behavior, it is necessary to differentiate menwho have clinically significant tumors (those with a biopsy Gleason score 7or higher and/or tumor volume > 0.5 ml) [195] as candidates for therapyfrom those who have clinically insignificant tumors and can safely undergoactive surveillance. It has been noted that potential survival benefits fromaggressively treating early-stage prostate cancer are undermined by harmfrom the unnecessary treatment of indolent disease.The current screening procedures for prostate cancer include a digitalrectal examination (DRE) and the prostate-specific antigen (PSA) bloodtest (which has recently been downgraded because of high false positiverates). DRE can only detect late-stage prostate cancer in the PZ. Henceit lacks the required sensitivity for early-stage cancer and cancer in otherzones. PSA elevated levels often indicate the presence of prostate cancer.On the other hand, patients with prostatitis or benign prostatic hyperplasiacan also have higher than normal PSA levels. Hence, while PSA screening21.2. Magnetic Resonance Imaging for Prostate Canceris a sensitive diagnostic test, it lacks the required specificity. An abnormalscreening indicates the possibility of prostate cancer, and random systematic(sextant) biopsies of the entire organ are performed on the patient underthe guidance of transrectal ultrasound (TRUS). These biopsies randomlysample a very small part of the gland and the results sometimes miss themost aggressive tumor within the gland [36, 58, 155].1.2 Magnetic Resonance Imaging for ProstateCancerMulti-parametric Magnetic Resonance Imaging (mpMRI) provides the re-quired soft tissue contrast for detection and localization of suspicious clinicallysignificant prostate lesions and gives information about tissue anatomy, func-tion, and characteristics (Figure 1.2). Importantly, it has superior capabilitiesto detect the “clinically significant disease.” Recent years have seen a growthin the volume of mpMRI examination of prostate cancer due to its ability todetect these lesions and allow targeted biopsy sampling. A large populationstudy from the UK suggested that use of mpMRI as a triage before primarybiopsy can reduce the number of unnecessary biopsies by a quarter anddecrease overdiagnosis of clinically insignificant disease [4]. This was furthervalidated in the and on smaller data sets than would be optimal. In themultinational PRECISION study of 500 patients [81], men randomized tompMRI prior to biopsy experienced a significant increase in the detection ofclinically significant disease over the current standard of care, which employsa 10-12 core transrectal ultrasound-guided biopsy (38% vs 26%).MRI has demonstrated value in not just detecting and characterizingclinically significant prostate cancer, but also in guiding biopsy needles to thesuspicious targets [4]. MRI has been incorporated into the biopsy procedurein two different ways. The first is MR/ultrasound fusion biopsy in whichtargets for biopsy are identified in diagnostic mpMRI, displayed in the TRUSimage (using MR/Ultrasound fusion) during a routine sextant biopsy, andadditionally sampled. The second is in-bore biopsy that is performed insidethe bore of an MR scanner; the diagnostic mpMRI with marked targets isoverlaid on a rapid acquisition intra-operative MR image (using MR/MRfusion) and targeted sampling is performed by the internationalist. A study ofover 1000 men undergoing biopsy for suspected prostate cancer [164] showedtargeted MR/ultrasound fusion biopsy, compared with standard extended-sextant TRUS biopsy, was associated with increased detection of high-riskprostate cancer and decreased detection of low-risk prostate cancer. Smaller31.3. Machine Learning in Prostate Cancer ImagingFigure 1.2: Multiparametric MRI of a patient with clinically significantprostate cancer. Arrows mark the lesion location. (a) Axial T2-weightedMR. (b) computed high–b value (1400 sec/mm2) diffusion-weighted MR.(c) ADC map (d) KTrans parametric map from dynamic contrast enhancedT1-weighted MRI.studies for in-bore biopsies have reported requiring significantly fewer coresand revealed a significantly higher percentage of cancer involvement perbiopsy core [136, 139, 187, 196].1.3 Machine Learning in Prostate Cancer ImagingThe growing trend towards mpMRI has introduced a demand for experiencedradiologists to interpret the exploding volumes of oncological prostate MRIs.Furthermore, reading challenging cases and reducing the high rate of interob-server disagreements on findings is a remaining challenge for prostate MRI.In 2015, the European Society of Urogenital Radiology, American Collegeof Radiology, and AdmeTech foundation published the second version ofProstate Imaging Reporting and Data system (PI-RADS). These provideguidelines for radiologists in reading and interpreting the prostate mpMRI,which aim to increase the consistency of interpretation and communicationof mpMRI findings. Over the past ten years, machine learning models havebeen developed as Computer-aided detection (CADe) and Computer-aided di-agnosis (CADx) systems to detect, localize, and characterize prostate tumors[109]. In conjunction with PI-RADS, accurate CAD systems can increase theinter-rater reliability and improve the diagnostic accuracy of mpMRI readingand interpretation [48]. In preliminary analyses, it has been shown that theaddition of a CADx system can improve the performance of radiologists inprostate cancer interpretation.The clinical motivation of this thesis is to aid radiologists in the detectionand classification of prostate cancer. While clinically prostate cancer has41.3. Machine Learning in Prostate Cancer Imagingswung between extremes of under- and over- diagnosis and treatment, theunderlying computer vision questions have remained unchanged; are theredistinct patterns in MRI images of patients suspected of prostate cancer thatcan be automatically detected to help detect and biopsy the cancer? will apattern recognition method developed for one set of MRI images work whenthe acquisition protocol or scanner is upgraded? Is this addressable withavailable technology or are methodological improvements needed? Based onthese questions we developed objectives for this thesis which are described inthe next section.Computational methods mostly based on supervised machine learning havebeen successfully applied to imaging modalities such as MRI and ultrasoundto detect suspicious lesions and differentiate clinically significant cancers fromthe rest. Recent application of deep learning in prostate cancer screening andaggressive cancer diagnosis has produced promising results. Preliminary workin mpMRI CADx systems focused primarily on classic supervised machinelearning methodologies, including combinations of feature extractors andshallow classifiers. In this category of machine leanring systems, featureengineering plays a central role in the overall performance of the CADsystem. Combinations of CADe and CADx systems have been reported thatuse intensity, anatomical, Pharmacokinetics (PK), texture, and blobnessfeatures [106]. PK metrics can be extracted from a time signal analysis ofintravenous contrast passing through a given volume of tissue. They includeparameters such as wash-in and wash-out. Texture features are also signalbased and depend heavily on imaging techniques. Others used intensityfeatures calculated from mpMRI sequences, including T2-weighted, ADC,high b-value DWI, and a T2 estimation map by proton density image[106],or only using features extracted from PK analysis and DTI parameter maps[125]. Similar image-based features were included into CAD systems [19, 49,129, 161] and many of these systems use support vector machines (SVMs)for classification [21, 94, 125, 188].In the past few years, advancements in Deep Learning (DL) have domi-nated the field of computer-assisted PCa detection using mpMRI or ultrasoundinformation, as individual modalities [192]. Most of the research has utilizedConvolutional Neural Networks (CNN) and various optimization strategies forachieving state-of-the-art performance in PCa detection. Several groups havetaken advantage of the multi-sequence nature of the mpMRI data by stackingeach modality as input channels similar to RGB images [22, 110], and inte-grating their information early in the training. Kiraly et al. [87] used FullyConvolutional Networks (FCN) for localization and classification of prostatelesions and achieved Area Under the Curve (AUC) of 0.83 by training on 20251.4. Objectivespatients. Schelb et al. [154] proposed a U-Net architecture on bi-parametricprostate MRI (T2-Weighted and ADC), and achieved performance similar tothat of Prostate Imaging Reporting and and Data System (PI-RADS), theclinical standard of mpMRI scoring [183]. In another recent study, Sedghi etal. [159] demonstrated the potential of integration of multimodal informationfrom MRI and temporal ultrasound to improve prostate cancer detection.By using Fully Convolutional Neural Networks (FCNs) as the architectureof choice, they created cancer probability maps in the entire imaging planesimmediatelyThe results of the ongoing research in the use of machine learning for thedetection and characterization of prostate cancer are promising and demon-strate ongoing improvement. The recent body of research in prostate cancerimage analysis shows a transition from feature engineering and classic machinelearning methods towards deep learning and the use of large training sets.Unlike lung and breast cancers, clinical routines in prostate cancer have notyet adopted regulated CAD systems. However, the recently achieved resultsof deep learning techniques on mid-size datasets such as the PROSTATExbenchmark are promising. As it is now evident there has been a rapid growthin prostate MR exam volumes worldwide and increasing demand for accurateinterpretations. Accurate CAD systems will improve the diagnostic accuracyof prostate MRI readings which will result in better care for individual pa-tients, as fewer patients with benign and indolent tumors (false positives)will need to undergo invasive biopsy and/or radical prostatectomy procedureswhich can lower their quality of life. On the other hand, early detectionof prostate cancer improves the prognosis of patients with clinically signifi-cant prostate cancer. Computer-assisted detection and diagnosis systems ofprostate cancer help clinicians by potentially reducing the chances of eithermissing or overdiagnosing suspicious targets on diagnostic MRIs, althoughthis merits additional validation in trials before routine clinical incorporation.1.4 ObjectivesThe main objective of this thesis is to develop reliable machine learning modelsand algorithms that can improve MRI-guided prostate cancer diagnosis andinterventions. We start by building solutions and applications to facilitateprostate cancer diagnosis and interventions. We investigate the problem ofdistinguishing the normal gland from cancer using mpMRI images of theprostate. We study methods for localizing biopsy needle tip and trajectoryin MRI scans obtained for MRI-guided prostate biopsy.61.5. ContributionsWe then address common challenges regarding data in real clinical setups.We propose approaches to improve our diagnosis system by recognizing thatuncertainty in biopsy location is an issue. This led us to model the error anduse probabilistic inference to accommodate this error. We develop transferlearning methodologies to address the domain shift problem.Furthermore, we study the problem of prostate segmentation in thecontext of weakly-supervised learning and uncertainty estimation. Prostatesegmentation in MRI is an important preprocessing step for several taskssuch as automated fusion of imaging data for targeted biopsy (guided byMRI alone or by MRI-ultrasound fusion), quantification of PSA density inassessing treatment response, and dose planning in radiotherapy. We proposea methodology for weakly-supervised segmentation with partially annotatedground truth. We further investigate methods to improve the calibration ofdeep segmenters through ensembling.Finally, we propose a novel methodology for ensembling based on param-eter perturbation.1.5 ContributionsThis thesis is an attempt to develop techniques that are essential for MRI-guided prostate cancer diagnosis and interventions. In the course of achievingthis objective the following contributions were made:• Developing a novel deep neural network for diagnosing clinically sig-nificant prostate cancer in mpMRI. The method uses diffusion anddynamic contrast images together with information about the locationof the suspicious target to predict the probability of clinically significantcancer.• Developing a novel probabilistic framework to model the uncertaintyregarding the location of the biopsy samples. Also, developing a novelGaussian weighted loss function as a form of data augmentation (labelimputation) to train FCNs with sparse biopsy locations. The frameworkprovides posterior probabilities of latent true biopsy locations given theimage, model, observed biopsy location, and the biopsy outcome.• Developing novel method for fast automatic needle tip and trajectorylocalization and visualization in MRI for prostate biopsies. The pro-posed method has a performance comparable with human inter-observerconcordance, reinforcing the clinical acceptability of it.71.6. Thesis Outline• Developing a novel transfer learning technique for domain adaptationof networks trained with one set of MRI acquisition parameters. Thisis an essential step for deployment of machine learning models in prac-tice where imaging parameters are changing. The proposed methodis capable of tuning the deep network to the new domain. Here, weperform experiments on brain MRI images to assess the contributions.Since there are no prior assumptions regarding the specific problemof white matter hyperintensities (WMH) segmentation, we anticipatethat the proposed method can be generalized to other medical imag-ing problems including prostate cancer diagnosis with MRI. However,confirmation of this requires multi-domain prostate MRI datasets andfurther experimentation.• Proposing a novel method for weakly-supervised semantic segmentationwith point and scribble supervision in FCNs. We also propose partialDice loss, a variant of Dice loss function for deep weakly-supervisedsegmentation with sparse pixel-level annotations. Here, in addition toprostate segmentation, we evaluate the proposed method with heartand kidney segmentation problems.• Developing a novel technique based on ensembling for confidence cali-bration and predictive uncertainty estimation for deep medical imagesegmentation. Also, proposing a novel entropy-based metric to predictthe segmentation quality of foreground structures, which can be furtherused to detect out-of-distribution test inputs. We evaluate our contri-butions across three medical image segmentation applications of theprostate, the heart, and the brain.• Proposing a new technique for confidence calibration uncertainty esti-mation of neural networks without the need for network modificationor several rounds of training. We proposed parameter ensembling byperturbation (PEP) which prepares an ensemble of parameter values asperturbations of the optimal parameter set from training by a Gaussianwith a single variance parameter.1.6 Thesis OutlineThe rest of this thesis is divided into six chapters as outlined below:CHAPTER 2: PROSTATE CANCER DIAGNOSIS IN MRI81.6. Thesis OutlineIn this chapter, we introduce novel deep learning techniques for clinicallysignificant prostate cancer detection and diagnosis in mpMRI. We train,validate, and test deep CNNs on patients suspected of having prostate cancer.We propose two different styles of CNN architectures for cancer diagnosis:patch-based method and fully convolutional neural network (FCN). For thepatch-based method, we use 3D convolutional neural networks and fullyconnected layers. For this model, in addition to image features, we also feedlocation features of the suspicious to the network. Our results suggest thatfor the proposed architecture, the combination of diffusion weighted MRI(DWI) and parametric maps from dynamic contrast-enhanced (DCE) MRIserve as the best imaging features for diagnosing prostate cancer. The secondproposed architecture, FCNs, makes it feasible to do prediction on wholegland in a single inference. Partial cross-entropy loss is used to train FCNs onsparse ground truth locations. Furthermore, we studied methods to addresssparsity of training data and also location uncertainty of ground truth deepcancer classifiers. We observe that Gaussian weighted loss improves the areaunder the receiver operating characteristic curve and the proposed biopsylocation adjustment substantially improves the sensitivity of the models.CHAPTER 3: BIOPSY NEEDLE LOCALIZATION IN MRIImage-guidance improves tissue sampling during biopsy by allowing thephysician to visualize the tip and trajectory of the biopsy needle relativeto the target in MRI, CT, ultrasound, or other relevant imagery. A systemfor fast automatic needle tip and trajectory localization and visualization inMRI was developed and tested in the context of an active clinical researchprogram in prostate biopsy at Brigham and Women’s hospital. Needle tipand trajectory were annotated on 583 T2-weighted intra-procedural MRIscans acquired after needle insertion for 71 patients who underwent transpere-nial MRI-targeted biopsy procedure at our institution. The images weredivided into two independent training-validation and test sets at the patientlevel. A deep 3-dimensional fully convolutional neural network model wasdeveloped, trained and deployed on these samples. The accuracy of theproposed method, as tested on previously unseen data, was 2.80 mm averagein needle tip detection, and 0.98◦ in needle trajectory angle. An observerstudy was designed in which independent annotations by a second observer,blinded to the original observer, were compared to the output of the proposedmethod. The resultant error was comparable to the measured inter-observerconcordance, reinforcing the clinical acceptability of the proposed method.91.6. Thesis OutlineCHAPTER 4: TRANSFER LEARNING FOR DOMAIN ADAPTATION INMRIIt is well known that variations in MRI acquisition protocols result in differ-ent appearances of normal and diseased tissue in the images. Convolutionalneural networks (CNNs), which have shown to be successful in many medicalimage analysis tasks, are typically sensitive to the variations in imagingprotocols. Therefore, in many cases, networks trained on data acquiredwith one MRI protocol, do not perform satisfactorily on data acquired withdifferent protocols. This limits the use of models trained with large annotatedlegacy datasets on a new dataset with a different domain which is often arecurring situation in clinical settings. In this study, we investigated thefollowing central questions regarding domain adaptation in medical imageanalysis: Given a fitted legacy model, 1) How much data from the newdomain is required for a decent adaptation of the original network?; and, 2)What portion of the pre-trained model parameters should be retrained givena certain number of the new domain training samples? To address thesequestions, we conducted extensive experiments in white matter hyperintensitysegmentation task. We trained a CNN on legacy MR images for a specifictask and evaluated the performance of the domain-adapted network on thesame task with images from a different domain. We then compared theperformance of the model to the surrogate scenarios where either the sametrained network is used or a new network is trained from scratch on the newdataset. The domain-adapted network tuned only by two training examplesachieved a performance substantially outperforming a similar network trainedon the same set of examples from scratch.CHAPTER 5: WEAKLY-SUPERVISED MEDICAL IMAGE SEGMENTA-TIONFully Convolutional neural networks (FCNs) including U-Nets, have achievedstate-of-the-art results in semantic segmentation for numerous medical imag-ing applications. Training deep models for segmentation requires high-qualitypixel-level ground truth annotations, which is time-consuming and expensive.Partial annotations such as points or scribbles can be used as less expensivealternatives. In this chapter, we study weakly-supervised FCN-based seg-mentation methods that can be trained with only a single annotated pointor only a single annotated scribble per slice of a medical image volume. Wepropose the use of a partial Dice loss function in our methods because it101.6. Thesis Outlineencourages higher Dice values for collections of pixels where ground truthis known. Furthermore, we systematically compare partial Dice loss withpartial cross-entropy loss in terms of segmentation quality and demonstratestatistically significant performance improvement. We evaluate the proposedmethods through extensive experiments in five segmentation tasks acrossthree medical image domains - images of the prostate, the kidney, and theheart. Among these applications, our methods with a single point or a singlescribble supervision achieve 51%−95% and 86%−97% of the performance ofthe fully supervised training, respectively.CHAPTER 6: UNCERTAINTY ESTIMATION IN SEGMENTATIONFully convolutional neural networks (FCNs), and in particular U-Nets, haveachieved state-of-the-art results in semantic segmentation for numerous med-ical imaging applications. Moreover, batch normalization and Dice loss havebeen used successfully to stabilize and accelerate training. However, thesenetworks are poorly calibrated i.e. they tend to produce overconfident predic-tions for both correct and erroneous classifications, making them unreliableand hard to interpret. In this chapter, we study predictive uncertainty es-timation in FCNs for medical image segmentation. We make the followingcontributions: 1) We systematically compare cross-entropy loss with Diceloss in terms of segmentation quality and uncertainty estimation of FCNs;2) We propose model ensembling for confidence calibration of the FCNstrained with batch normalization and Dice loss; 3) We assess the ability ofcalibrated FCNs to predict the segmentation quality of structures and detectout-of-distribution test examples. We conduct extensive experiments acrossthree medical image segmentation applications of the prostate, the heart,and the brain to evaluate our contributions. The results of this study offerconsiderable insight into the predictive uncertainty estimation and out-of-distribution detection in medical image segmentation and provide practicalrecipes for confidence calibration. Moreover, we consistently demonstratethat model ensembling improves confidence calibration.CHAPTER 7: PEP: PARAMETER ENSEMBLING BY PERTURBATIONEnsembling is recognized as an effective approach for increasing the pre-dictive performance and calibration of deep networks. We introduce a newapproach, Parameter Ensembling by Perturbation (PEP), that constructsan ensemble of parameter values as random perturbations of the optimalparameter set from training by a Gaussian with a single variance parameter.111.6. Thesis OutlineThe variance is chosen to maximize the log-likelihood of the ensemble average(L) on the validation data set. Empirically, and perhaps surprisingly, L has awell-defined maximum as the variance grows from zero (which correspondsto the baseline model). Conveniently, calibration level of predictions alsotends to grow favorably until the peak of L is reached. In most experiments,PEP provides a small improvement in performance, and, in some cases, asubstantial improvement in empirical calibration. We show that this “PEPeffect” (the gain in log-likelihood) is related to the mean curvature of thelikelihood function and the empirical Fisher information. Experiments onImageNet pre-trained networks including ResNet, DenseNet, and Inceptionshowed improved calibration and likelihood. We further observed a mildimprovement in classification accuracy on these networks. Experiments onclassification benchmarks such as MNIST and CIFAR-10 showed improvedcalibration and likelihood, as well as the relationship between the PEP effectand overfitting; this demonstrates that PEP can be used to probe the levelof overfitting that occurred during training. In general, no special trainingprocedure or network architecture is needed, and in the case of pre-trainednetworks, no additional training is needed.CHAPTER 8: CONCLUSION AND FUTURE WORKSThis chapter includes a short summary followed by a discussion of themethods for prostate cancer diagnosis and interventions. It also includessuggestions for future works.12Chapter 2Prostate Cancer Diagnosis inMRI2.1 Introduction and BackgroundProstate cancer is the most frequently diagnosed noncutaneous male malig-nancy and the second leading cause of cancer-related mortality among menin the United States [165]. Magnetic resonance imaging (MRI) is widely usedfor prostate cancer detection, localization, diagnosis, and guidance for biopsyprocedures due to its ability in providing superior contrast between cancerand adjacent soft tissue [183]. Convolutional neural networks (CNNs) havebeen successfully used for prostate cancer detection, localization, and charac-terization in medical images [192]. Machine learning models are often trainedwith pathology results from biopsy procedures [7]. Training deep CNNs withsparse labels can be addressed with both patch-based [22, 110, 156, 190] andfully convolutional neural network (FCN) models [25, 71, 87, 88, 154]. Inpatch-based training, samples with the biopsy points at their center are cre-ated and the network will act as a binary classifier. Hence, the model outputwill be two neurons corresponding to the binary output. The input of FCNs isthe whole image slice containing the prostate and the output is a probabilitymap image with the same dimensions as the input image. FCNs can be usedto create a cancer probability map for the whole prostate. Furthermore, FCNarchitectures allow efficient training and learning of contextual features andprovide a computationally efficient method to estimate cancer probabilityfor the whole volume. Biopsy ground truth location can become noisy dueto the registration and sampling errors during biopsy procedures. Furthernoise can be introduced as a result of inter-modality image registration or inSection 2.3 of this chapter is adapted from Alireza Mehrtash, Alirea Sedghi, MohsenGhafoorian, Mehdi Taghipour, Clare M. Tempany, WilliamM.Wells III, Tina Kapur, ParvinMousavi, Purang Abolmaesumib, Andrey Fedorov. Classification of clinical significanceof MRI prostate findings using 3D convolutional neural networks. Medical Imaging2017: Computer-Aided Diagnosis. International Society for Optics and Photonics, 10134:101342A, 2017.132.1. Introduction and Backgroundthe course of annotating biopsy locations on multiparametric MRI (mpMRI).Noisy ground truth can have adverse effects on both training and inferenceof computer-assisted diagnosis systems.In this chapter, we study the problem of clinically significant prostatecancer diagnosis with CNNs. We develop both patch-based and FCN modelsfor diagnosis. We further propose a probabilistic framework to include biopsylocation uncertainty into the inference. In summary, we make the followingcontributions:• We present a 3D CNN tailored for the task of diagnosis of clinically sig-nificant prostate cancer of suspicious findings in mpMRI. The proposednetwork benefits from the explicit addition of location-aware features(zonal information of the finding).• We propose an FCN for end-to-end diagnosis and segmentation ofclinically significant cancer tissues. To do so we present a Gaussianweighted loss as a label imputation mechanism for training FCNs withsparse biopsy data. We compare the proposed loss function with partialcross-entropy (CE) [27, 159, 177] where biopsy locations are used forloss calculation in optimization. We observe that FCNs trained withthe proposed loss function perform achieves better classification resultscompared to those trained partial CE loss.• We propose a probabilistic framework for modeling ground truth lo-cation uncertainty. By using priors on observed biopsy locations wecalculate the probability of the true latent biopsy locations. Usingthe posterior biopsy location probability distributions, we adjust thebiopsy location for nearby positive findings (lesions). We observe thatcompared to baselines, updated biopsy location improves sensitivitysignificantly through detecting lesions where the biopsy location wasdisplaced.• We train and validate our proposed methods on PROSTATEx dataset[7, 106].The rest of this chapter is organized as follows: in Section 2.2, we describethe PROSTATEx dataset that was used for this study. Section 2.3 presentsthe proposed 3D patch-based CNN model for cancer diagnosis. Section2.4 covers the proposed FCN model and the probabilistic framework forclinically significant cancer segmentation and diagnosis. Section 3.5 presentsa discussion and our conclusions from this chapter.142.2. DataFigure 2.1: Distribution of training and test datasets of the PROSTATExchallenge. (a) Training samples: the distribution of lesion findings showsthat the training dataset is not balanced in terms of both zonal distributionand the clinical significance of the finding. (b) Test samples are not balancedin terms of zones.2.2 DataThe training dataset consisted of 204 patients with 330 suspicious lesionfindings, and the test dataset consisted of 140 patients with 208 findings.For each of the findings, assignment to one out of a possible four prostateanatomic regions was available. These anatomic regions are: the peripheralzone (PZ), which comprises 70− 80% of the glandular tissue and accounts for≈ 70% of prostate cancers; the transition zone (TZ), which comprises 5% ofthe glandular tissue and accounts for ≈ 25% of prostate cancers; the centralzone (CZ), which comprises 20% of the glandular tissue and accounts for ≈ 5%of prostate cancers; and the non-glandular anterior fibromuscular stroma(AS) [109]. The training and test samples in the PROSTATEx challenge werefrom PZ, TZ, AS, and seminal vesicles (SV) as illustrated in Figure 2.1.2.3 Patch-based Cancer Classifier2.3.1 PreprocessingAfter minor data cleaning that consisted of excluding patients with incompleteseries and also SV findings, we selected 201 subjects with 321 findings fortraining and validation purposes. In order to augment and balance thetraining dataset, we used flipping and translation of the original data. As aresult of data augmentation, we generated 5-fold cross-validation datasetswith 10,000 training and 2,000 validation samples for each fold. For training-152.3. Patch-based Cancer Classifiervalidation splitting, we used stratified sampling based on pathology outcomeand the prostate zone to make the subgroups homogeneous. Image intensitieswere normalized to be within the range of [0,1]. 3D patches of size 40×40×40mm for T2, 32× 32× 12 for DWI and DCE-MRI images, centered at findinglocations served as training image patches.2.3.2 Network ArchitectureOur CNN architecture, illustrated in Figure 2.2, included three input streams:ADC maps and maximum b-value from DWI, and Ktrans from DCE-MRI.Similar to the work of Ghafoorian et al. [45], we added explicit zone informa-tion to the first dense layer. The DCE-MRI and DWI streams with inputsizes of (32 × 32 × 12) had 9 convolutional layers combining of (3 × 3 × 1)and (3× 3× 3) filter sizes. Max-pooling layers of size 2× 2× 1 were appliedin selected middle layers. At the end of each stream, the output of the lastconvolutional layer was connected to a dense layer. The neurons of this layerwere concatenated with the zonal information of the finding and applied toanother set of three fully connected layers. Leaky rectified linear unit [114]function, which allows a small, non-zero gradient when the unit is not active,was used as the non-linearity element.2.3.3 TrainingFor training the network, we used the stochastic gradient descent algorithmwith the Adam update rule [86], a mini-batch size of 64, and a binary cross-entropy loss function. We initialized the CNN weights randomly from aGaussian distribution using the He method [59]. We also batch-normalized[70] the intermediate responses of all layers to accelerate the convergence. Toprevent overfitting, in addition to the batch-normalization, we used drop-outwith 0.25 probability as well as L2 regularization with λ2 = 0.005 penalty onneuron weights. We used an early stopping policy by monitoring validationperformance and picked the best model with the highest accuracy on thevalidation set. Cross-validation was used to find the best combination ofinput channels and the number of filters for convolutional layers.16832326416483232641648323264164195971ZoneSigmoid32×32×12ADC32×32×12BVAL32×32×12KTransConv3D3×3×1Conv3D3×3×3MaxPool3D2×2×1Flatten Dense MergeFigure 2.2: Architecture of the proposed 3D CNN for the PROSTATEx Challenge for detection of clinicallysignificant cancer. The network uses a combination of ADC map, maximum B-Value (BVAL) from DWI, andKtrans from DCE-MRI with zone information.172.3. Patch-based Cancer Classifier0.0 0.2 0.4 0.6 0.8 1.01  Speci city0.00.20.40.60.81.0SensitivityModel 1, Az = 0.79Model 2, Az = 0.85Model 3, Az = 0.76Model 4, Az = 0.77Figure 2.3: Comparison of classifiers trained with architecture in Figure 2.2on different folds of cross-validation.2.3.4 ResultsOur training-validation results indicate that the combination of ADC, maxi-mum b-Value, and Ktrans modalities in combination with zonal informationof the lesion leads to the best performance characterized by the area underthe curve (Az) of the receiver operating characteristic (ROC) curve. Figure2.3 shows the results of training on different folds of our cross-validation. Fortest data prediction we combined the prediction of the best 4 out of the 5models by averaging the outputs of the models. Figure 2.4 shows an exampleof a true positive finding in the validation set.This network was evaluated by the organizers of the PROSTATEx chal-lenge on a held-out test set containing 206 findings from 140 patients andachieved and area under the curve (AUC) of receiver operating characteristiccurve (ROC) of 0.80. This is within the range of our validation results,indicating that the proposed model generalized well on the test data. Ourresults are also comparable with the Az values of 0.79 and 0.83 achieved byan experienced human reader for PI-RADS v1 and PI-RADS v2, respectively[80]. The proposed method ranked 6th out of 72 entries in the challenge [7].182.4. FCN Classifier and Uncertainty in Biopsy Location(a) (b) (c) (d) (e) (f)Figure 2.4: An example of a PZ true positive in the validation set. Only(d-f) modalities with zone information (zone=PZ) were used by the networkto predict the clinical significance of the finding.2.4 FCN Classifier and Uncertainty in BiopsyLocationHere, we consider MRI-guided cancer diagnosis with sparse biopsy groundtruth as a weakly-supervised binary classification problem. The input imagesIi ∈ Rn are n-dimensional. Sparse labels zi ∈ {0, 1} are reported biopsyresults where 0 corresponds to benign or clinically insignificant cancer (Glea-son score ≤ 3+3) and 1 corresponds to clinically significant cancer (Gleasonscore ≥ 3+4). Each biopsy label comes with a reported (observed) biopsycoordinate xoi ∈ R3 that can be noisy. Figure 2.5 visually illustrates theproblem and the proposed method.2.4.1 Gaussian Weighted LossDeep neural networks are often optimized using maximum likelihood estima-tion:θˆ = argmaxθ∑iln(p(zi|Ii, xoi , θ)) (2.1)In this problem the labeled data is sparse. Partial CE loss [27, 177] can beused to train an FCN with partially labeled data. We can consider labelingthe adjacent pixels having the same label as the reported biopsy points.Label imputation can be done around the biopsy point, by considering aconditional probability between pseudo-label sample pairs (xi, yˆi) and biopsylocation (xoi , yi) such that the pseudo-labels have the same class label as theobserved label. We assume conditional probability p(xoi |xi) has a Gaussianp(xoi |xi) ∼ N (x− xo,Σ) distribution. Using this, we can rewrite Equation2.1 as:θˆ = argmaxθ∑iEp(xi|xoi ,Ii,θ,yi)[ln(p(zi|Ii, xi, θ))] (2.2)192.4. FCN Classifier and Uncertainty in Biopsy LocationTraining Inference(a) (b) (c) (d) (e)Figure 2.5: Method overview. For the training time (a) we propose aweighted cross-entropy (CE) loss to handle sparse biopsy ground truth. Forthe inference time (b-d), we propose a probabilistic framework which modelsnoise in observed locations and makes adjustments. FCN architecture is usedfor the seamless rendering of the cancer probability maps (c). (a) shows asample train image (ADC) together with a Gaussian loss weight centered onbiopsy location. The weight is applied to the pixel-level samples of the CE loss.The trained network with optimized parameters, θˆ, is then used for inference.(b) shows a sample test image, Ii, with reported (observed) biopsy position,xoi , marked on the left peripheral zone. (c) shows the network prediction forprobability of cancer at each pixel p(zi = 1|Ii, xi, θˆ). zi = 1 denotes canceroutcome for biopsy. (d) shows the input image overlaid with a Gaussiandenoting p(xpi |xi), the probability of latent true biopsy given observed biopsylocation. (e) shows the probability distribution for latent true biopsy locationp(xi|xoi , Ii, θˆ, zi = 1). Using (e), the presumably misplaced reported locationof the biopsy can be adjusted. The proposed network uses multi-modal inputsand here for simplicity we only show ADC inputs.that can be further expanded into:θˆ = argmaxθ∑i∑xip(xii|xoi , Ii, θ, zi) · ln(p(zi|Ii, xi, θ)) (2.3)p(xii|xoi , Ii, θn, zi) can be considered as a weight that will be applied duringthe training to the pixel samples. The weighted loss function in FCNs can beinterpreted as an alternative for shift augmentation in patch-based training.2.4.2 Location Uncertainty-aware InferenceThe observed location of biopsy points xoi can be noisy. By modeling thenoise in biopsy locations as Gaussian, we can formulate the latent true biopsy202.4. FCN Classifier and Uncertainty in Biopsy Locationcoordinate xi ∈ R3 as:p(xoi |xi) ∼ N (xi − xo,Σ) (2.4)Given the image Ii, classifier estimates the probability of cancer at thelocation xoi , as p(zi|Ii, xi, θˆn). We can form a conditional probability torepresent the most probable latent location of biopsy:p(xi|xoi , Ii, θˆ, zi) =p(zi|Ii, xi, θˆ) · p(xoi |xi)∑xip(zi|Ii, xi, θˆ) · p(xoi |xi)(2.5)For each pixel in image the probability of latent location given the canceroutcome is positive p(xi|xoi , Ii, θˆ, zi = 1) or negative p(xi|xoi , Ii, θˆ, zi = 0) canbe calculated. To improve the sensitivity of the model, and reducing thechance of missing cancer, the reported biopsy can be adjusted to the mostprobable latent biopsy location x∗ given the outcome is positive:x∗ = argmaxxp(xi|xoi , Ii, θˆ, zi = 1). (2.6)2.4.3 Experimental SetupData and PreprocessingWe used 203 patients with 325 suspicious lesion from the training set of thePROSTATEx dataset [106]. Stratified 6-fold cross validation was used for thetraining and validation of the proposed methods. Registration with mutualinformation maximization [191] was used to adjust possible misalignmentsbetween mpMRI sequences. We followed the same registration procedurethat was done by Kiraly et al. [87]. Registered images were then resampledto the resolution of 0.5× 0.5× 3 mm. All axial slices were then cropped atthe center to create images of size 224× 224 pixels as the input size of theFCN. Image intensities were normalized to be within the range of [0,1].Model & TrainingWe used an architecture similar to U-Net [147] but with three input channelsand fewer kernel filters at each layer. The inputs of the model are ADCand high b-value images from DWU and Ktrans from DCE-MRI. The inputand output of the model have sizes of 224 × 224 × 3, and 224 × 224 × 2,respectively. The network has the same number of layers as the original U-Net.The sizes of kernels for the encoder section of the network are 16, 16, 32,32, 32, 32, 64, 64, 128, and 128. The parameters of the convolutional layers212.4. FCN Classifier and Uncertainty in Biopsy Locationwere initialized randomly from a Gaussian distribution [59]. For optimization,stochastic gradient descent with the Nesterov Adam update rule [34] wasused. A mini-batch of 16 examples was used during the training. The initiallearning rate was set to 0.0005 and it was reduced by a factor of 0.5 if theaverage validation loss did not improve by 0.001 in 5 epochs. We used 50epochs for the training of the models with an early stopping policy. Foreach training run, the model checkpoint was saved at the epoch where thevalidation loss was lowest. For each of the validation folds, the model wastrained 5 times with partial CE and 5 times with Gaussian weighted CE, eachwith random weight initialization and random shuffling of the training data.We used ensembling by averaging network predictions to boost performanceand calibration of the models [96].ExperimentsWe compare the classification performance of models trained with partialCE loss with those trained with Gaussian-weighted CE. We generate 2DGaussian weighted with σ values of 0.5, 1, and 2. For all trained models,we calculate p(xi|xoi , Ii, θˆ, zi = 1) with σ values of 5, 9, and 15 and find theadjusted biopsy location x∗. We then compare the baseline predictions at thereported biopsy locations with probabilities at the adjusted biopsy location.For statistical tests and calculating 95% confidence intervals (CI), we usebootstrapping (n = 1000).2.4.4 ResultsTable 2.1 compares the classification performances of the models trained withpartial CE with those trained with Gaussian CE loss with σ = 2. The areaunder the receiver operating characteristic curve (ROC) (Az) was improvedfrom 0.74 to 0.78 by including adjacent labels in loss calculation. GaussianCE with σ values of 0.5 and 1 achieved Az (95% CI) of 0.78 (0.72−0.84)and 0.77 (0.71−0.83), respectively. Both Azs significantly better than thebaseline with partial CE. Table 2.1 also compares the performance of original(observed) biopsy points with adjusted locations with different σ values. Asexpected, biopsy location adjustment acts in favor of finding lesions andincreases sensitivity. The increase in sensitivity is at the cost of a notablereduction in the specificity of the models.Figure 2.6 provides some examples of the proposed probability frameworklocation uncertainty estimation and biopsy location adjustments.222.5. Discussion and Conclusion2.5 Discussion and ConclusionIn this chapter, we studied training deep cancer classifiers using sparse biopsyannotations. Moreover, we modeled biopsy location uncertainty and proposeda method for improving the sensitivity of the models by biopsy locationadjustment. The FCN was trained with DWI and DCE-MRI data as inputand biopsy ground truth as targets for clinically significant prostate cancerdetection. We proposed a Gaussian weighted loss function as a form ofdata augmentation (label imputation) to train FCNs with sparse biopsylocations. Furthermore, we proposed a probabilistic framework to model theuncertainty regarding the location of the biopsy samples. The frameworkprovides posterior probabilities of latent true biopsy locations given the image,model, observed biopsy location, and the biopsy outcome. The models wereassessed on 325 biopsy locations. The results of our experiments show thatthe proposed weighted loss provides better classification compared to thebaselines that only used given sparse labels in their loss function. We showedimproved sensitivity and area under ROC curves by adjusting biopsy locationadjustments at the expense of more false positives.The main limitation of our work is the relatively small number of casesthat were used for the development and validation of our system. Onlyabout one fourth of the biopsies were clinically significant cancers. Not onlycancerous samples were limited, but also they were heterogeneous in termsof severity of cancer and the lesion sizes. Despite this, the achieved resultsare promising and should be validated by larger patient populations andpreferably with independent test sets.Future work will explore the use of the posterior probabilities on latenttrue biopsy coordinates for improving training procedures. Through anexpectation-maximization (EM) framework, Equation 2.5 can be used asan E−step to re-estimate probability distribution on biopsy locations giventhe prior knowledge and classifier’s output. Maximum likelihood estimation(Equation 2.1) can be updated to include the current knowledge of thedistribution where samples come from (M−step).23Table 2.1: Classification quality of models for diagnosing clinically significant prostate cancer in MRI evaluated onreported biopsy locations (n=325). Models trained with partial cross-entropy loss are compared with those trainedwith Gaussian cross-entropy loss. The results of inference time biopsy location adjustments are also provided formultiple Gaussian kernel sizes.Biopsy Location TP TN FN FP Sens (%) Spec (%) F-1 Az (95%CI)Partial Cross-entropy LossOriginal 37 210 40 38 48.05 84.68 0.49 0.74 (0.67-0.80)Adjusted (σ = 5) 55 161 55 22 71.43 64.92 0.50 0.76 (0.70−0.82)Adjusted (σ = 9) 68 116 9 132 88.31 46.77 0.49 0.77 (0.70−0.83)†Adjusted (σ = 15) 72 79 5 168 93.50 31.85 0.45 0.78 (0.71−0.84) †Gaussian Cross-entropy Loss (σ = 2)Original 41 212 36 36 53.25 85.48 0.53 0.78 (0.72−0.83)Adjusted (σ = 5) 47 198 30 50 61.04 79.84 0.54 0.79 (0.74−0.85) †Adjusted (σ = 9) 57 174 20 74 74.03 70.16 0.55 0.77 (0.72−0.83)Adjusted (σ = 15) 65 127 12 121 78.00 51.21 0.49 0.79 (0.73−0.84) †TP = true positives; TN = true negatives; FN = false negatives; FP = false positives; Sens =sensitivity; Spec = specificity;† Difference are statistically significant (p-value<0.01).24ADC P (z = 1|I, x, θ∗) P (x|xo, I, θ∗, z = 1) P (x|xo, I, θ∗, z = 0)Figure 2.6: Examples of biopsy location adjustments. The first column shows the input ADC images with givenbiopsy locations (black crossharis). The b-value and KTrans images were used for inference but not shown here.For all the of the four examples, the most probable latent location for cancer location was found (white crosshairs)using Equation 2.6. Last two columns show calculated true latent biopsy location probabilities given the resultsis clinically significant (third column), or insignificant (fourth column). The top two rows show false positivepredictions that turned into true positives by the proposed adjustment. The bottom two rows show true negativesthat turned into false positives by adjustment.25Chapter 3Biopsy Needle Localization inMRI3.1 Introduction and BackgroundWhen screening indicates the possibility of prostate cancer in an individual,the standard of care includes non-targeted systematic (sextant) biopsies ofthe entire organ under the guidance of transrectal ultrasound (TRUS). Thesebiopsies randomly sample a very small part of the gland and the resultssometimes miss the most aggressive tumor within the gland [36, 58, 155].MRI has demonstrated value in not just detecting and localizing the cancer,but also in guiding biopsy needles to the suspicious targets [4]. In particular,biopsies with intra-operative MRI guidance require significantly fewer coresthan with the standard TRUS approach and reveal a significantly higherpercent of cancer involvement per biopsy core [136, 139, 187, 196].Accurately placing needles in suspicious target tissue is critical for thesuccess of a biopsy procedure. An intra-operative MRI allows the physicianto check the position and trajectory of the needle relative to the suspicioustarget in a three-dimensional (3D) stack of cross-sectional images and to makeneeded adjustments. Physicians achieve targeting accuracy in the range of3–6 mm for MRI-guided prostate biopsy, which is adequate for the task sinceclinically significant prostate cancer lesions are typically larger than 0.5 mLin volume or 9.8 mm in diameter (assuming spherical lesions) [90, 170, 182].Automatic localization of the needle tip and trajectory can aid the physicianby providing rapid 3D visualization that reduces their cognitive load and theduration of the procedure. In addition, for the realization of robot-guidedpercutaneous needle placement procedures, accurate and automatic needleThis chapter is adapted from Alireza Mehrtash, Mohsen Ghafoorian, GuillaumePernelle, Alireza Ziaei, Friso G. Heslinga, Kemal Tuncali, Andriy Fedorov, Ron Kikinis,Clare M Tempany, William M Wells, Purang Abolmaesumi, Tina Kapur. Automatic needlesegmentation and localization in MRI With 3-D convolutional neural networks: applicationto MRI-targeted prostate biopsy. IEEE Transactions on Medical Imaging, 38(4):1026-1036,2018.263.1. Introduction and Backgroundlocalization is a necessary part of the feedback loop [145].While MRI is the imaging modality of choice for identifying suspiciousbiopsy targets because of its ability to provide superior soft tissue contrast, itposes two types of challenges in the needle localization task. The first challengeis that parts of a needle may appear substantially different from others in anMRI scan while also being difficult to distinguish from surrounding tissue[141]. This variability in grayscale appearance of needles confounds automaticsegmentation algorithms and is addressed in this study.Today, aside from the proposed work, there are no automatic solutionsfor the segmentation of needles from MRI images [32, 167]. Even manualsegmentation from MRI is tedious and error-prone, and to the best of ourknowledge, not attempted in clinical or research programs. The secondchallenge, while not addressed in this study, is worth noting; an MRI doesnot directly show the geometric location of a needle. Instead, the needle isdetected through a loss of signal due to the susceptibility artifact that occursat the interfaces of materials with substantially different magnetic resonanceproperties, and is commonly referred to as the needle artifact. Studies reporta displacement between the actual needle tip and the needle tip artifact [167].For brevity the term needle is used instead of needle artifact in this study.Needle trajectory is defined as the set of points connecting the center of theartifact across a stack of axial cross-sections. The needle tip is the center ofneedle artifact at the most distal plane.Several approaches have been suggested in the literature for segmentationand localization of needle-like i.e. elongated tubular objects in medicalimages. Segmentation of tortuous and branched structures, such as bloodvessels [105, 193], white matter tracts [57, 133] or nerves [174] are the targetsof many reported methods. Other methods target straight or bent catheters[65, 116, 137]. Based on the clinical application, the proposed techniques havebeen applied to different image modalities including ultrasound [3, 13, 65],computed tomography [52, 128], and MRI [116, 137] for the purpose oflocalization after insertion or real-time guidance during insertion. Manyattempts have been made to incorporate hand crafted and kernel-basedmethods to segment and localize the objects which can be considered asline detection algorithms. The reported methods are based on 3D Houghtransforms [13, 65], on model based and raycasting-based search [116, 137],orthogonal 2-dimensional projections [3], generalized radon transforms [132],and random sample consensus (RANSAC) [184].Deep convolutional neural networks (CNNs) use the power of represen-tation learning for complex pattern recognition tasks [51]. Deep modelrepresentations are learned through multiple levels of abstraction in a su-273.1. Introduction and Backgroundpervised training scheme, as opposed to hand-crafting of features. CNNshave been extensively used in medical image analysis and have outperformedconventional methods for many tasks [107]. For instance, CNNs have beenshown to achieve outstanding performance for segmentation [45], localization[29], cancer diagnosis [120], quality assessment [2], and vessel segmentation[189].In this work, we propose a CNN-based system for automatic segmentationand localization of biopsy needles in MRI images. The proposed system usesCNNs to extract hierarchical representations from MRI to segment needlesfor the purpose of tip and trajectory localization. An asymmetric 3D fullyconvolutional neural network with in-plane pooling and up-sampling layerswas designed to handle the anisotropic nature of the needle MR images. Theproposed asymmetry in the network design is computationally efficient andallows the whole volumetric MR images to be used for the process of training.A large dataset of MRI acquired in transperineal prostate biopsy procedureswas used for developing the system; 583 volumetric T2-weighted MRI from71 biopsy procedures (on 71 distinct patients) were used to design, optimize,train and test the deep learning models.The performance of CNNs and other supervised machine learning methodsis measured against that of experienced humans, which is known to be variablefor medical image analysis tasks; observer studies are used to establish rangesfor human performance, against which automated CNNs can be rated. Anobserver study was conducted to compare the quality of the predictionsagainst a second observer.To promote further research and facilitate reproduction of the results, theresultant trained deep learning model is publicly available via DeepInfer[119],an open-source deployment platform for trained models. To the best ofour knowledge, we are the first and only group to attempt fully automaticsegmentation and localization of needles in MRI. Since there are no priorassumptions regarding the prostate images, the proposed method can begeneralized and adopted in other clinical procedures for needle segmentationand localization in MRI.The rest of this chapter is organized as follows: in Section 3.2 we describethe methods for this study including the clinical workflow of in-gantry MRI-targeted prostate biopsy and details of the proposed deep learning system.Section 3.3 and 3.4 cover the experimental setup and results, respectively, ofapplying the proposed system to the MRI-targeted biopsy procedure. Section3.5 presents a discussion and our conclusions from this study.283.2. Methods(a) (b) (c) (d) (e)Figure 3.1: Transperineal in-gantry MRI-targeted prostate biopsy procedure:(a) The patient is placed in the supine position in the MRI gantry, andhis legs are elevated to allow for transperineal access. The skin of theperineum is prepared and draped in a sterile manner, and the needle guidancetemplate is positioned. (b), (c) and (d): Axial, sagittal and coronal viewsof intraprocedural T2-W MRI with needle tip marked by white arrow. (e)3D rendering of the needle (blue), segmented by our method, and visualizedrelative to the prostate gland (purple), and an MRI cross-section that isorthogonal to the plane containing the needle tip.3.2 Methods3.2.1 MRI-Targeted Biopsy Clinical WorkflowThe general workflow of an in-gantry transperineal MRI-targeted prostatebiopsy involves imaging in two stages: a) the preoperative stage during whichmultiparametric MRI consisting of T1, T2, diffusion weighted, and dynamiccontrast enhanced images are acquired and the cancer suspicious targetsare marked, and b) the intraoperative stage during which the patient isimmobilized on the table top inside the gantry of the MRI scanner and tissuesamples are acquired transperineally with a biopsy needle under intraoperativeMRI guidance. At the beginning of the intra-operative stage, anesthesiais administered to the patient and a grid template affixed to his perineumto facilitate targeted sampling. Intra-operative MR images are acquired asneeded to optimize the skin entry point and depth for each needle insertion.One or more biopsy samples are taken for each target, depending on thesample quality. Samples are sent for histological analysis, and the institutionalpost-operative care protocol is followed for the patient. At our site, almost 600such procedures have been performed under intravenous conscious sedation;one to five biopsy samples are obtained using an off-the-shelf 18-gauge side-cutting MR-compatible core biopsy needle and the patient discharged, on anaverage, two hours later [38, 136, 179].293.2. Methods3.2.2 DataThe data used in this study consists of 583 intraprocedural MRI scans obtainedfrom 71 patients who underwent transperineal MRI-guided biopsy betweenDecember 2010 and September 2015. This retrospective study was HIPAAcompliant and institutional review board approval (IRB) and informed consentwas obtained. The patients in this cohort had prostate MRI lesions suspiciousfor new cancer, recurrent cancer after prior therapy, or lesions suspicious forhigher grade cancer than their initial diagnosis. Each of the intraproceduralMRI scans is an axial fast spin echo (FSE) T2-weighted volume of size range256–320×204–320×18–30 voxels, with voxel spacing in the range of 0.53–0.94mm in-plane and slice thickness of 3.6–4.8 mm. The acquisition parametersfor the FSE sequence were set as follows: repetition time (TR) is 3000 ms,echo time (TE) is 106 ms, and flip angel (FA) is 120 degrees [136]. Theimaging time is about one minute and is performed after needle insertionto visualize it relative to the target. These scans were acquired on either aconventional wide-bore, 3T MR scanner (Verio, Siemens Healthcare, Erlangen,Germany) or a ceiling-mounted version of it (IMRIS/Siemens Verio; IMRIS,Minnetonka, Minn).3.2.3 Data AnnotationA custom needle annotation software tool was used by an expert human raterto interactively mark the needle trajectory and tip on each of the 583 MRI[137]. These annotations are also referred to as ground truth. In these images,a needle can be identified by the dark susceptibility artifact around its shaft,as seen in Figure 3.1(c) and (d). The annotation tool allowed the human raterto place several control points ranging from the tip of the needle to its base.Those control points were then used to fit a Bézier curve which represents atrajectory of the needle artifact. Thus the manual needle trajectory relies onlyon the observer input (and not on the underlying gray scale values). Groundtruth needle segmentation label maps were then generated by creating a4 mm diameter cylinder around the Bézier curve to cover the hypointenseartifact that surrounds the needle shafts, as seen in Figure 3.1(e).It should be noted that even for experienced human observers, there canbe ambiguity in picking the axial plane containing the needle tip due to thelarge slice thickness and partial volume effects. In addition, there are cases,as shown in Figure 3.2, where the needle susceptibility artifact consists oftwo hypointense regions separated by a hyperintense one (instead of a singlehypointese region). The human observer followed the needle carefully across303.2. Methods(a)(b)Figure 3.2: (a) and (b): Examples of needle induced susceptibility artifactsin MRI where instead of a single hypointense (dark) region, there are twohypointense regions separated by a hyperintense (bright) region. In suchcases, the human expert followed the needle carefully across several slices toensure the integrity of the annotation. The arrow marks the needle identifiedby the expert.Table 3.1: Number of patients and needle MRIs for training/validation andtest sets.set training & validation test# patients # images # patients # imagessize 50 410 21 173several slices to ensure the integrity of the annotation.The annotated images were split at the patient level into 70% training/cross-validation for algorithm development and 30% for final testing (Table 3.1).3.2.4 Data PreprocessingPrior to training the CNN models, the data was preprocessed in four steps:resampling, cropping, padding, and intensity normalization, as follows.Resampling: First, the data was resampled to a common resolutionof 0.88 × 0.88 × 3.6 mm. The MR images and ground truth segmentationmaps were resampled with linear and nearest neighbor interpolation methodsrespectively. SimpleITK implementation of the interpolation methods wereused for image resampling [113].313.2. MethodsCropping: Second, to constrain the search area for the needle tip, each MRimage was cropped to a cube of size 165.4×165.4×165.6 mm (188×188×46voxels) around the center of the prostate gland. The size of the box waschosen to be large enough to easily accommodate the size of the largestexpected diseased gland, and small enough to fit in the GPU memory forefficient processing. Even though a very coarsely selected bounding box thatcontains the prostate gland is sufficient for this step, we used a separate deepnetwork, a customized variant of 2D U-Net architecture, that was readilyavailable to us to perform the segmentation automatically [119]1.Padding: Third, the borders of the cropped volume were padded by 50mm (14 pixels) in the z direction. The zero padding in the z direction isrequired to accommodate the reduction in spatial dimension of the output of3D convolutional filters. As a result of convolution operation, for in-planedirections (x and y) the output of the final layer of the network (green boxin Figure 3.3) will be 88 pixels smaller than the input image (blue box inFigure 3.3).Intensity Normalization: Fourth, to reduce the heterogeneity of thegrayscale distribution in the data, intensities were truncated and re-scaled tothe range between 0.1% and 99% quantiles of the intensity histogram andthen normalized to the range of [0, 1].1http://www.deepinfer.org/models/prostate-segmenter/32(a) (b) (c) (d)Axial Sagittal CoronalVOISEGVOICPVOIORIGFigure 3.3: Original, cropped and padded, and segmentation volumes of interests (VOIs) (a) The original grayscalevolume (V OIORIG, red box) is cropped in x and y directions and padded in the z direction to a volume of size164.5× 164.5× 165.6 mm (188× 188× 46 voxels) centered on the prostate gland (V OICP , blue box). V OICP isused as the network input. The network output segmentation map is of size 88× 88× 64.8 mm (100× 100× 18voxels) (V OISEG, green box). The adjusted voxel spacing for the volumes is 0.88× 0.88× 3.6 mm. (b), (c), (d)show axial, sagittal and coronal views respectively of a patient case overlayed with the boundaries of the volumesV OIORIG, V OICP and V OISEG.333.2. Methods3.2.5 Convolutional Neural NetworksIn this chapter a binary classification model based on CNNs is proposedfor needle segmentation and localization in prostate MR images. The deepnetwork architecture is composed of sequential convolutional layers l ∈ [1, L].At each convolutional layer l, the input feature map (image) is convolved by aset of K kernels Wl = {W 1, ...,WK} and biases bl = {b1, ..., bK} to generatea new feature map. A non-linear activation function f is then applied to thisfeature map to generate the output Yl which is the input for the next layer.The nth feature map of the output of the lth layer can be expressed by:Y nl = f(K∑k=1W n,kl ∗ Y kl−1 + bnl ), (3.1)The concatenation of the feature maps at each layer provides a combi-nation of patterns to the network, which become increasingly complex fordeeper layers. Training of the CNN is usually done through several iterationsof stochastic gradient descent (SGD), in which several samples of trainingdata (a batch) is processed by the network. At each iteration, based onthe calculated loss the network parameters (kernel weights and biases) areoptimized by SGD in order to decrease the loss.Medical image segmentation can be formulated as a pixel-level classifica-tion problem which can be solved by convolutional neural networks. Leverag-ing the volumetric nature of the data through the inter-slice dependence of 2Dslices is a key factor in 3D biomedical image classification and segmentationproblems. Representation learning for segmentation in 3D has been donein different ways: directly by the use of 3D convolutional filters, multi-viewCNNs with 2D images, and recurrent architectures [107]. 3D convolutionalfilters can be used in 3D architectures known as fully convolutional neuralnetworks (FCNs) or through patch-based sliding-window methods[27].The use of FCNs for image segmentation allows for end-to-end learning,with each pixel of the input image being mapped by the FCN to the outputsegmentation map. This class of neural networks has shown great successfor the task of semantic segmentation [112]. During training, the FCN aimsto learn representations based on local information. Although patch-basedmethods have shown promising results in segmentation tasks [45], FCNs havethe advantage of reduction in the computational overhead of sliding-window-based computation. The efficiency of FCNs in prediction time makes thembetter suited for procedures such as intera-operative imaging where time is animportant factor. One drawback of 3D FCNs is the memory constraint of the343.2. Methodsgraphical processing units (GPUs) to hold the large parameters during theoptimization process which limits the input size, number of model parametersand number of mini-batches in stochastic gradient descent iterations. Inaddition to CNNs, recurrent neural networks (RNNs) have also been success-fully used for segmentation tasks by feeding prior information from adjacentlocations such as nearby slices or nearby patches into the classifier [6].The CNN proposed in this work is a 3D FCN. FCNs for segmentationusually consist of an encoder (contracting) path and a decoder (expanding)path [8, 147]. The encoder path consists of repeated convolutional layersfollowed by activation functions with max-pooling layers on selected featuremaps. The encoder path decreases the resolution of the feature maps bycomputing the maximum of small patches of units of the feature maps.However, good resolution is critical for accurate segmentation, therefore inthe decoder path, up-sampling is performed to restore the initial resolution,but the feature maps are concatenated to keep the computation and memoryrequirements tractable. As a result of multiple convolutional layers and max-pooling operations the feature maps are reduced and the intermediate layersof an FCN become successively smaller. Therefore, following the convolutions,an FCN uses inverse convolutions (or backward convolutions) to up-samplethe intermediate layers until the input resolution is matched [35, 112]. FCNswith skip-connections are able to combine high level abstract features withlow level high resolution features which has been shown to be successful insegmentation tasks [27].3.2.6 Network ArchitectureWe present a fully automatic approach for needle localization by segmentationin prostate MRI based on a 14-layers deep anisotropic 3D FCN with skip-connections (Figure 3.4). The network architecture is inspired by the 3DU-Net model [27]. We improved the network architecture to efficiently handlethe anisotropic nature of MRI volumes and for the specific problem of needlesegmentation in MRI. Due to the time constraints in intraoperative imaging,MRIs taken during the interventional procedure often have thick slices buthigh resolution in the axial plane which leads to anisotropic voxels. Poolingand up-sampling were only applied to the in-plane axes (x and y) to handlethe anisotropic nature of the needle MRI. The proposed asymmetry in thenetwork design is computationally efficient and allows the whole volumetricMRI to be used for training.3516×(3×3×3) 2×2×132×(3×3×3) 2×2×164×(3×3×3) 2×2×1256×(3×3×3) 2×2×164×(3×3×3) 2×2×132×(3×3×3) 2×2×116×(3×3×3) Input Image Volume188×188×46Ouput Segmentaion Map100×100×183D conv+bn+ReLu 3D max pooling 3D upsamling copy and concatFigure 3.4: Schematic overview of the anisotropic 3D fully convolutional neural network for needle segmentation andlocalization in MRI. Network architecture consisting of 14 convolutional layers, 3 max-pooling and 3 up-samplinglayers. Convolutional layers were applied without padding while max-pooling layers halved the size of their inputsonly in in-plane directions. The parameters including the kernel sizes and number of kernels are explained in eachcorresponding box. Shortcut connections insures combination of low-level and high-level features. The input to thenetwork is the 3D volume image with the prostate gland at the center (188× 188× 46) and the output segmentationmap has the size of 100× 100× 18.363.2. MethodsAs illustrated in Figure 3.4, the proposed network consists of 14 convo-lution layers. Each convolution layer has a kernel size of (3 × 3 × 3) withstride of size 1 in all three dimensions. Since the input of the proposednetwork is a T2-weighted MRI, the number of channels for the first layer isequal to one. After each convolutional layer, a rectified linear unit (ReLu)f(x) = max(0, x) is used as the nonlinear activation function except for thelast layer [55] where a sigmoid function S(x) = ex(ex + 1)−1 is used to mapthe output to a class probability between 0 and 1. There are 3 max-poolingand 3 up-sampling layers of size (2× 2× 1) in the encoder and decoder pathsrespectively. The network has a total of 3,231,233 trainable parameters. Theinput to the network is the 3D volume image with the prostate gland at thecenter (188×188×46) and the output segmentation map size is 100×100×18which corresponds to a receptive field of size 88× 88× 65 mm.3.2.7 TrainingDuring training of the proposed network, we aimed to minimize a loss functionthat measures the quality of the segmentation on the training examples. Thisloss Lt over N training volumes can be defined as:Lt = − 1NN∑n=1(2|Xn ∩ Yn||Xn|+|Yn|+s), (3.2)where Xn is the output segmentation map, Yn is the ground truth obtainedfrom expert manual segmentation for the nth training volume, and s (set to5), is the smoothing coefficient which prevents the denominator from beingzero. This loss function has demonstrated utility in image segmentationproblems where there is a heavy imbalance between the classes, as in ourcase where most of the data is considered background [124].We used a SGD algorithm with the Adam update rule [86] which wasimplemented in the Keras framework [26]. During the training we used a mini-batch of 4 image volumes. The initial learning rate was set to 0.001. Learningrate was reduced by a factor of 0.8 if the average of validation Dice scoredid not improve by 10−5 in five epochs. The parameters of the convolutionallayers were initialized randomly from a Gaussian distribution using the Hemethod [59]. To prevent overfitting, in addition to the batch-normalization[70], we used drop-out with 0.1 probability as well as L2 regularization withλ2 = 10−5 penalty on convolutional layers except the last one. Training wasperformed on 410 MRI scans from 50 patients using five-fold cross validationwith splitting at the patient level. Each training sample was a 3D patch373.3. Experimental Setup(also referred to as input volume or V OICP ) of size 188 × 188 × 46 voxel.Data augmentation was performed by flipping the 3D volumes horizontally(left to right), which doubled the amount of training examples [44]. Cross-validation was used to optimize and tune the hyperparameters including CNNarchitecture, training scheme, and finding the best epoch (model checkpoint)for the test-time deployment. For each cross-validation fold, we used 100 asthe maximum number of epochs for training and an early stopping policy bymonitoring validation performance. This resulted in five trained models, onefrom each of the cross-validation folds, that are aggregated later with theensembling method (described in Section 3.3.3) for test-time prediction.3.3 Experimental Setup3.3.1 Observer StudyWe designed an observer study in which a second observer, blinded to theannotations by the first observer (the ground truth), segmented the needletrajectory on the test set (n = 173 images) using the same annotations toolsas the first observer. We compared the performance of both the proposedautomatic system and the second observer with the first observer (groundtruth).3.3.2 Evaluation MetricsWe evaluated the accuracy of the system by measuring how well it localizesthe tip of the needle, and how well it segments the entire trajectory of theneedle. In addition to measuring the tip and angular deviation errors whichare commonly used to quantify targeting accuracy of percutaneous needleinsertion procedures [13], we report the number of axial planes contained inthe tip error because of the high anisotropy of the data set. We used theHausdorff distance to measure the quality of the segmentation of the entirelength of needle (beyond the tip error) [116, 137].• Tip deviation error ∆P : The ground truth needle tip position wasdetermined as the center of the needle artifact in the most distal planeof the needle segmentation image P (x, y, z). Tip deviation ∆P isquantified as the 3D Euclidean distance between the prediction Pˆ andmanually specified ground truth P in millimeters.• Tip axial plane detection error ∆A: The tip plane detection error isthe absolute value of the distance between the ground truth axial plane383.3. Experimental Setupindex A containing the needle tip and the predicted axial plane indexAˆ in voxels.• Hausdorff Distance HD: Trajectory accuracy was calculated by measur-ing the directed Hausdorff distance between two N-D sets of predictedXˆ and ground truth X needles defined withdH(X, Xˆ) = max{ supx∈Xinfxˆ∈Xˆd(x, xˆ), supxˆ∈Xˆinfx∈Xd(x, xˆ) }, (3.3)where sup represents the supremum and inf the infimum, and x and xˆare points from X and Xˆ respectively.• Angular deviation error ∆θ: The true needle direction θ was defined asthe angle between the needle shaft and the axial plane. The angulardeviation between the ground truth needle direction θ and the predictedneedle direction θˆ quantifies the accuracy of needle direction prediction(∆θ = |θ − θˆ|).3.3.3 Test-time AugmentationTest-time augmentation seeks to improve classification by analyzing multipleaugmentations or variants of the same image and averaging out the results.Recently, it has been used to improve pulmonary nodule classification fromCT [168], detection of lacunes from MRI [44], and prostate cancer diagnosisfrom MRI [74]. We performed test-time augmentation by flipping 3D volumeshorizontally which doubles the test data.We conducted experiments to quantify the impact of training-time andtest-time augmentation and performed analysis to measure the statisticalsignificance of the results. Paired comparison of needle tip localization errorsfor unequal cases was performed using Wilcoxon signed-rank test (two-tailed).For reporting statistical analysis results, statistical significance was set at0.05.3.3.4 EnsemblingAs reported in Section 3.2.7, cross-validation resulted in five trained models.Combined with test-time augmentation, this results in 10 segmentation mapsfor each test case, i.e. for each of the five trained models, there is oneprediction for the test image, and one for its flipped variant. To obtain thefinal binary prediction, we used an iterative ensembling or voting mechanismto aggregate the results of 10 predictions at the voxel level (Algorithm 1).393.3. Experimental SetupAlgorithm 1 Overview of the ensembling algorithmRequire: S, the sum image of n predictions1: Constant: ν = 100, min number of voxels in a needle2: Initialize: τ =⌈n2⌉, min vote threshold3: Initialize: B=0, output segmentation image, same shape as S4: repeat5: for voxel s[i] ∈ S and b[i] ∈ B do6: if s[i] ≥ τ then7: b[i]← 1 {voxel is needle}8: else9: b[i]← 0 {voxel is not needle}10: end if11: end for12: τ ← τ − 1 {relax threshold if no needle found}13: c←∑b[j]∈B b[j] {c is total number of needle voxels}14: until c ≥ ν or τ = 0The input to the algorithm is S, the sum image of all predictions. Inan iterative procedure, the binary segmentation map B, is generated bythresholding S using τ . τ is initialized at the value of majority votes τ =⌈n2⌉,where n is the number of predictions. ν is a constant that represents theminimum size of a needle which was measured at 100 voxels over the 410needles in the training set. An iterative procedure reduces τ until a needle isfound.Finally, to obtain the needle tip and trajectory the binary segmentationmap is converted to 3D points in space by getting the center of the boundingbox of the needle in each axial slice. The most distal point in the z-axis isconsidered the needle tip.3.3.5 Implementation and DeploymentThe proposed algorithm for needle localization and segmentation was imple-mented in Keras V2.0 [26] with Tensorflow back-end [1] and trained on anNvidia GTX Titan X GPU with 12 GB of memory, hosted on a machinerunning Ubuntu 14.04 operating system on a 3.4 GHz Intel(R) Core(TM)i7-6800K CPU with 64 GB of memory. Training of large volumetric 3D net-works was enabled and accelerated by the efficient cudNN2 implementationof deep neural network layers. The trained models were deployed in the2https://developer.nvidia.com/cudnn403.4. ResultsTable 3.2: Needle tip localization error (mm) for test cases for proposed CNNmethod and the second observer*.∆P σ∆P RMS(∆P ) M(∆P )CNN 2.80 3.59 4.54 0.88Second observer 2.89 4.05 4.98 0.88* ∆P , σ∆P , RMS(∆P ) and M(∆P ) are the mean, stan-dard deviation, root mean square, and median of the needletip deviation, respectively. First row indicates the error ofthe CNN against ground truth, and second row indicateserror of the second observer against ground truth.open-source DeepInfer toolkit and are publicly available for download anduse from the DeepInfer model repository3.3.4 ResultsWe tested the proposed method on a previously unseen test set of 173 MRIvolumes from 21 patients. Figure 3.5 visually illustrates the localization of asingle needle and the corresponding measured quality metrics in an examplethat is representative of the results of the proposed system. In the rest of thissection, we present quantitative results of the performance of the proposedsystem against the ground truth, and also that of a second observer againstthe ground truth.3.4.1 Tip LocalizationThe average tip localization errors for the proposed automatic system andthe second observer relative to the ground truth are presented in Table 3.2.Corresponding box plots are presented in Figure 3.6. The median needle tipdeviation for both the CNN and the second observer was 0.88 mm (1 pixel inthe transaxial plane). Perfect matching of the predicted needle tip (0 mmdeviation) was achieved for 32 (18% of test images) and 46 needles (27% oftest images) for the CNN and second observer, respectively.3http://www.deepinfer.org/models/prostate-needle-finder/41(a)(b)(c)(d)Figure 3.5: An example test case. Green, yellow, and red contours show theneedle segmentation boundaries of the ground truth, the proposed system,and the second observer respectively. The arrows mark the needle tips. (a)First row shows ground truth. Second row shows predictions of the proposedsystem. Third row second observer annotations. (b) Zoomed view of slices in(a). (c) Coronal views. (d) 3D rendering of the needle relative to the prostategland (blue), ground truth and CNN predictions. For the proposed CNN, themeasured needle tip localization error (∆P ), tip axial plane detection error(∆A), Hausdorff distance (HD), and angular deviation error (∆θ) are 1.76mm, 0 voxels, 1.24 mm, and 0.30◦ respectively.423.4. Results0246810Second ObserverCNN0246810Second ObserverCNNFigure 3.6: Box plots of the needle tip deviation error and Hausdorff distance(HD) in millimeters for the test cases. Distances of automatic (CNN) andsecond observer are shown which are comparable. The median tip localizationerror and the median HD distance for both CNN and second observer are0.88 mm (1 pixel in transaxial plane) and 1.24 mm, respectively.3.4.2 Tip Axial Plane DetectionThe bar chart in Figure 3.7 summarizes the accuracy results for needle tipaxial plane detection. For 113 images (65%) the algorithm detected thecorrect axial slice (∆A = 0) containing the needle tip which is comparable tothe agreement between the two observers (108 cases (62%)). The algorithmmissed the needle tip by one slice (∆A = 1) on 44 images (25%), by twoslices (∆A = 2) for 9 images (5%) and by three or more slices (∆A ≥ 3) for7 images (4%). The bar chart shows that the performance of the CNN andthe second observer are in the same range.3.4.3 Trajectory LocalizationTable 3.3 presents the results of needle trajectory localization error in termsof directional Hausdorff distance (HD) for the test cases for both the CNNand the second observer. Trajectory localization errors are summarized asthe mean, standard deviation, root mean square, and median of the error.Corresponding box plots are presented in Figure 3.6.3.4.4 Needle DirectionTable 3.4 presents the needle direction error in terms of angular deviation(∆θ) for the test cases for both the CNN and the second observer. Needle433.4. Results020406080100120No. of imagesCNNSecond observer65.3%62.4%25.4% 30.6%5.2%4.0%1404.0%2.9%Figure 3.7: Bar charts of needle tip axial plane localization error (∆A).Needle tip axial plane distance error of the automatic (CNN) method andsecond observer are shown. The results of the automatic CNN method arecomparable with the second-observer.angular deviation errors are summarized as the mean, standard deviation,root mean square, and median of the error.3.4.5 Data AugmentationTable 3.5 summarizes the impact of training-time and test-time augmentationon system performance as measured by the mean and standard deviation ofthe needle tip localization error in millimeters. The bottom right cell indicatesbest performance when both training-time and test-time augmentation areused. While both training and test-time augmentation did demonstratesmaller averages of needle tip deviation errors and fewer number of failures,we did not find improvements to be statistically significant.3.4.6 Execution TimeThe execution time of the proposed system was measured for inference onthe test set of 173 volumes in the same environment that was described inSection 3.3.5. The average localization time using the proposed system was29 seconds. This includes preprocessing, running five models on the originalMRI volume and the flipped version, ensembling, and resampling back to443.5. DiscussionTable 3.3: Trajectory localization error averaged over test cases for proposedCNN method and the second observer* (units are in millimeters).HD σHD RMS(HD) M(HD)CNN 3.00 3.15 4.35 1.24Second observer 2.29 2.82 3.63 1.24* HD, σHD, RMS(HD), and M(HD) are the mean, stan-dard deviation, root mean square, and median of the needletrajectory localization Hausdorff distance, respectively.Table 3.4: Needle direction error quantified as the deviation angle averagedover test cases for proposed CNN method and the second observer* (unitsare in degrees).∆θ σ∆θ RMS(∆θ) M(∆θ)CNN 0.98 1.1 1.47 0.68Second observer 0.97 1.04 1.43 0.75* ∆θ, σ∆θ, RMS(∆θ), and M(∆θ) are the mean, stan-dard deviation, root mean square, and median of theneedle deviation angle, respectively.the original spatial resolution of the input image. In comparison, the secondobserver annotated a needle in 52 seconds on average.3.5 DiscussionAutomatic localization of needle tip and visualization of needle trajectoriesrelative to the target can aid interventionalists in percutaneous needle place-ment procedures. Furthermore, accurate needle tip and trajectory localizationis necessary for robot-guided needle placement. To the best of our knowledge,this is the first report of a fully automatic system for biopsy needle segmen-tation and localization in MRI with deep convolutional neural networks. Afairly large dataset of 583 MRI volumes from 71 patients suspected of prostatecancer was used to design, optimize, and test the proposed system. Thesystem achieves human expert level performance for MRI-targeted prostatebiopsy procedures. The results on an unseen test set show a mean accuracyerror of 2.8 mm in detection of needle tip, 96% detection of axial tip planewithin 2 slices, mean Hausdorff distance of 3 mm in needle trajectory, and a453.5. DiscussionTable 3.5: Impact of training-time and test-time augmentation on perfor-mance*.∆P ± σ∆PTrain-Time AugmentationWithout WithWithout Test-Time Augmentation 4.92 ± 13.22 3.07 ± 3.70†With Test-Time Augmentation. 3.93 ± 8.83 2.80 ± 3.59* This table quantifies the system performance as measured by themean and standard deviation of the needle tip localization error inmillimeters.† This model failed to segment one needle out of 173 in the test set.All other models did not miss any needles.mean 0.98◦ error in needle trajectory angle, all of which lie within the rangeof agreement between human experts as shown by an observer study.Our results support the findings of other studies in using 3D fully convo-lutional neural networks including 3D U-Net and its variants, for biomedicalimage segmentation to achieve promising results [107]. Additionally, thedeployed trained model segments and localizes a needle in a 3D MRI volumein 29 seconds which makes it viable for adoption in the clinical workflowof MRI-targeted prostate biopsy procedures. The results of experiments toquantify the effect of data augmentation demonstrated smaller averages ofneedle tip deviation errors. However, unlike Ghafoorian et al. [44], we didnot find the improvements to be statistically significant. Further analysison larger test sets is required to statistically assess the effect of data aug-mentation for the needle segmentation problem. By preserving the ratiobetween in-plane resolution and slice thickness with anisotropic max-poolingand down sampling, we were able to train and deploy our model with whole3D MRI volumes as inputs to the networks.CNNs tend to be sensitive to the variations in MRI acquisition protocols.Variations in parameters during the acquisition of the MRI volumes resultin different appearances of tissue and needle artifact [46]. Although we useda fairly large dataset of 583 MRI volumes in our experiments, and theseMRI were acquired on two different MRI scanners, they were all obtained ina single institution using substantially similar MRI protocols. Therefore itis a reasonable conclusion that the performance of the trained models willdegrade when applied to data acquired using substantially different MRIparameters. Domain adaptation and transfer learning techniques can be usedto address this issue [46]. Moreover, due to the large slice thickness of 3.6463.6. Conclusionmm and partial volume effect, in many cases there is ambiguity in identifyingthe correct axial plane containing the needle tip. In this study we used thefirst observer as the gold standard and compared the second observer andthe proposed method with it. Ideally, we would have had multiple observersand used majority voting for needle segmentation and tip localization.3.6 ConclusionWe presented a new method for biopsy needle localization in MRI. A deep3D fully convolutional neural network model was developed, trained anddeployed using 583 T2-weighted MRI scans for 71 patients. The accuracyof the proposed method, as tested on previously unseen data, was 2.80 mmaverage in needle tip detection, and 0.98◦ in needle trajectory angle. Wefurther designed an observer study in which independent annotations bya second observer, blinded to the original observer, were compared to theoutput of the proposed method. We showed that 3D convolutional neuralnetworks, designed with some attention to domain knowledge, can effectivelysegment and localize needles from in-gantry MRI-targeted prostate biopsyimages. The results of this study suggest that our proposed system can beused to detect and localize biopsy needles in MRI within the range of clinicalacceptance and human-expert performance.47Chapter 4Transfer Learning for DomainAdaptation in MRI4.1 Introduction and BackgroundDeep neural networks have been extensively used in medical image analysisand have outperformed the conventional methods for specific tasks such assegmentation, classification, and detection [107]. For instance on brain MRanalysis, convolutional neural networks (CNN) have been shown to achieveoutstanding performance for various tasks including white matter hyperin-tensities (WMH) segmentation [45], tumor segmentation [77], microbleeddetection [33], and lacune detection [44]. Although many studies reportexcellent results on specific domains and image acquisition protocols, thegeneralizability of these models on test data with different distributions isoften not investigated and evaluated. Therefore, to ensure the usability ofthe trained models in real-world practice, which involves imaging data fromvarious scanners and protocols, domain adaptation remains a valuable fieldof study. This becomes even more important when dealing with MagneticResonance Imaging (MRI), which demonstrates high variations in soft tissueappearances and contrasts among different protocols and settings.Mathematically, a domain D can be expressed by a feature space χ and amarginal probability distribution P (X), where X = {x1, ..., xn} ∈ χ [135]. Asupervised learning task on a specific domain D = {χ,P (X)}, consists of apair of a label space Y and an objective predictive function f(.) (denoted byT = {Y, f(.)}). The objective function f(.) can be learned from the trainingdata, which consists of pairs {xi, yi}, where xi ∈ X and yi ∈ Y . After thetraining process, the learned model denoted by f˜(.) is used to predict theThis chapter is adapted from Mohsen Ghafoorian, Alireza Mehrtash, Tina Kapur,Nico Karssemeijer, Elena Marchiori, Mehran Pesteie, Charles RG Guttmann, Frank-Erikde Leeuw, Clare M Tempany, Bram van Ginneken, Andriy Fedorov, Purang Abolmaesumi,Bram Platel, WilliamMWells. Transfer learning for domain adaptation in MRI: Applicationin brain lesion segmentation. International Conference on Medical Image Computing andComputer-Assisted Intervention (MICCAI), Springer 2017.484.1. Introduction and Backgroundlabel for a new instance x. Given a source domain DS with a learning task TSand a target domain DT with learning task TT , transfer learning is defined asthe process of improving the learning of the target predictive function fT (.)in DT using the information in DS and TS , where DS 6= DT , or TS 6= TT[135]. We denote f˜ST (.) as the predictive model initially trained on the sourcedomain DS , and domain-adapted to the target domain DT .In the medical image analysis literature, transfer classifiers such as adap-tive SVM and transfer AdaBoost, are shown to outperform the commonsupervised learning approaches in segmenting brain MRI, trained only ona small set of target domain images [186]. In another study, a machinelearning-based sample weighting strategy was shown to be capable of handlingmulti-center chronic obstructive pulmonary disease images [24]. Recently,also several studies have investigated transfer learning methodologies ondeep neural networks applied to medical image analysis tasks. Some studiesused networks pre-trained on natural images to extract features and followedby another classifier, such as a Support Vector Machine (SVM) or a ran-dom forest [37]. Other studies [163, 176] performed layer fine-tuning on thepre-trained networks for adapting the learned features to the target domain.Considering the hierarchical feature learning fashion in CNN, we expectthe first few layers to learn features for general simple visual building blocks,such as edges, corners, and simple blob-like structures, while the deeper layerslearn more complicated abstract task-dependent features. In general, theability to learn domain-dependent high-level representations is an advantageenabling CNNs to achieve great recognition capabilities. However, it is notobvious how these qualities are preserved during the transfer learning processfor domain adaptation. For example, it would be practically importantto determine how much data on the target domain is required for domainadaptation with sufficient accuracy for a given task, or how many layers froma model fitted on the source domain can be effectively transferred to thetarget domain. Or more interestingly, given a number of available sampleson the target domain, what layer types and how many of those can we affordto fine-tune. Moreover, there is a common scenario in which a large setof annotated legacy data is available, often collected in a time-consumingand costly process. Upgrades in the scanners, acquisition protocols, etc., aswe will show, might make the direct application of models trained on thelegacy data unsuccessful. To what extent these legacy data can contribute toa better analysis of new datasets, or vice versa, is another question worthinvestigating.In this chapter, we aim towards answering the questions discussed above.At the time of running experiments of this study, we did not have access494.2. Materials and MethodTable 4.1: Number of patients for the domain adaptation experiments.Source Domain Target DomainSet Train Validation Test Train Validation TestSize 200 30 50 100 26 33to multi-domain prostate cancer datasets. Hence, we chose to performexperiments on brain WMH segmentation problem where we used transferlearning methodology for domain adaptation of models trained on legacy MRIdata. Since there are no prior assumptions regarding the specific problem ofWMH segmentation, we expect that the proposed method can be generalizedto other medical imaging domain adaptation problems including prostatecancer diagnosis with MRI. However, confirmation of such expectationsrequires multi-domain datasets and further experimentation.4.2 Materials and Method4.2.1 DatasetRadboud University Nijmegen Diffusion tensor and Magnetic resonance imag-ing Cohort (RUN DMC) [185] is a longitudinal study of patients diagnosedwith small vessel disease. The baseline scans acquired in 2006 consistedof fluid-attenuated inversion recovery (FLAIR) images with voxel size of1.0×1.2×5.0 mm and an inter-slice gap of 1.0 mm, scanned with a 1.5 TSiemens scanner. However, the follow-up scans in 2011 were acquired differ-ently with a voxel size of 1.0×1.2×3.0 mm, including a slice gap of 0.5 mm.The follow-up scans demonstrate a higher contrast as the partial volume effectis less of an issue due to thinner slices. For each subject, we also used 3D T1magnetization-prepared rapid gradient-echo (MPRAGE) with voxel size of1.0×1.0×1.0 mm which is the same among the two datasets. We should notethat even though the two scanning protocols are only different on the FLAIRscans, it is generally accepted that the FLAIR is by far the most contributingmodality for WMH segmentation. Reference WMH annotations on bothdatasets were provided semi-automatically, by manually editing segmentationsprovided by a WMH segmentation method [43] wherever needed.The T1 images were linearly registered to FLAIR scans, followed by brainextraction and bias-filed correction operations. We then normalized the imageintensities to be within the range of [0, 1].In this study, we used 280 patient acquisitions with WMH annotations504.2. Materials and Method3×3 Conv 163×3 Conv 163×3 Conv 323×3 Conv 323×3 Conv 643×3 Conv 643×3 Conv 1283×3 Conv 1283×3 Conv 2563×3 Conv 2563×3 Conv 5123×3 Conv 512Dense 256Dense 128Dense 2SoftmaxInput Image2×32×32i frozen d i ne-tunedFigure 4.1: Architecture of the convolutional neural network used in ourexperiments. The shallowest i layers are frozen and the rest d− i layers arefine-tuned. d is the depth of the network which was 15 in our experiments.from the baseline as the source domain, and 159 scans from all the patientsthat were rescanned in the follow-up as the target domain. Table 4.1 showsthe data split into the training, validation, and test sets. It should be notedthat the same patient-level partitioning which was used on the baseline, wasrespected on the follow-up dataset to prevent potential label leakages.4.2.2 SamplingWe sampled 32×32 patches to capture local neighborhoods around WMHand normal voxels from both FLAIR and T1 images. We assigned each patchwith the label of the corresponding central voxel. To be more precise, werandomly selected 25% of all voxels within the WMH masks, and randomlyselected the same number of negative samples from the normal appearingvoxels inside the brain mask. We augmented the dataset by flipping thepatches along the y axis. This procedure resulted in training and validationdatasets of size ∼1.2m and ∼150k on the baseline, and ∼1.75m and ∼200kon the followup.4.2.3 Network Architecture and TrainingWe stacked the FLAIR and T1 patches as the input channels and used a15-layer architecture consisting of 12 convolutional layers of 3×3 filters and 3dense layers of 256, 128, and 2 neurons, and a final softmax layer. We avoidedusing pooling layers as they would result in a shift-invariance property thatis not desirable in segmentation tasks, where the spatial information of thefeatures is important to be preserved. The network architecture is illustratedin Figure 4.1.To tune the weights in the network, we used the Adam update rule [86]with a mini-batch size of 128 and a binary cross-entropy loss function. We514.2. Materials and Methodused the Rectified Linear Unit (ReLU) activation function as the non-linearityand the He method [59] that randomly initializes the weights drawn froma N (0,√2m) distribution, where m is the number of inputs to a neuron.Activations of all layers were batch-normalized to speed up the convergence[70]. A decaying learning rate was used with a starting value of 0.0001 for theoptimization process. To avoid over-fitting, we regularized our networks witha drop-out rate of 0.3 as well as the L2 weight decay with λ2=0.0001. Wetrained our networks for a maximum of 100 epochs with an early stoppingpolicy. For each experiment, we picked the model with the highest area underthe curve on the validation set.We trained our networks with a patch-based approach. At segmentationtime, however, we converted the dense layers to their equivalent convolutionalcounterparts to form a fully convolutional network (FCN). FCNs are muchmore efficient as they avoid the repetitive computations on neighboringpatches by feeding the whole image into the network. We prefer the conceptualdistinction between dense and convolutional layers at the training time, tokeep the generality of experiments for classification problems as well (e.g.,testing the benefits of fine-tuning the convolutional layers in addition to thedense layers). Patch-based training allows class-specific data augmentation tohandle domains with hugely imbalanced class ratios (e.g., WMH segmentationdomain).4.2.4 Domain AdaptationTo build the model f˜ST (.), we transferred the learned weights from f˜S , thenwe froze the shallowest i layers and fine-tuned the remaining d−i deeper layerswith the training data from DT , where d is the depth of the trained CNN.This is illustrated in Figure 4.1. We used the same optimization update-rule,loss function, and regularization techniques as described in Section 4.2.3.4.2.5 ExperimentsOn the WMH segmentation domain, we investigated and compared threedifferent scenarios: 1) Training a model on the source domain and directlyapplying it on the target domain; 2) Training networks on the target domaindata from scratch; and 3) Transferring model learned on the source domainonto the target domain with fine-tuning. To identify the target domaindataset sizes where transfer learning is most useful, the second and thirdscenarios were explored with different training set sizes of 2, 3, 4, 5, 6, 7, 8, 9,10, 11, 12, 25, 50 and 100 cases. We extensively expanded the third scenario524.3. Results2 5 10 25 500.10.20.30.40.50.60.70.8Target domain  Dice scoreTraining from scratch.1-layer ne-tuning.9-layer ne-tuning.15-layer ne-tuning.2 3 4 5 6 7 8 9 10 11 12 25 50100Target training set size1151413121110987654321(a)Number of ne-tuned layers0.450.500.550.600.650.70(b)Target training set size100Figure 4.2: (a) The comparison of Dice scores on the target domain withand without transfer learning. A logarithmic scale is used on the x axis. (b)Given a deep CNN with d=15 layers, transfer learning was performed byfreezing the i initial layers and fine-tuning the last d − i layers. The Dicescores on the test set are illustrated with the color-coded heatmap. On themap, the number of fine-tuned layers are shown horizontally, whereas thetarget domain training set size is shown vertically.investigating the best freezing/tuning cut-off for each of the mentioned targetdomain training set sizes. We used the same network architecture and trainingprocedure among the different experiments. The reported metric for thesegmentation quality assessment is the Dice score.4.3 ResultsThe model trained on the set of images from the source domain (f˜S), achieveda Dice score of 0.76. The same model, without fine-tuning, failed on the targetdomain with a Dice score of 0.005. Figure 4.2(a) demonstrates and comparesthe Dice scores obtained with three domain-adapted models to a networktrained from scratch on different target training set sizes. Figure 4.2(b)illustrates the target domain test set Dice scores as a function of targetdomain training set size and the number of abstract layers that were fine-tuned.Figure 4.3 presents and compares qualitative results of WMH segmentationof several different models of a single sample slice.534.4. Discussion and Conclusiona b cd e fFigure 4.3: Examples of the brain WMH MRI segmentations. (a) AxialT1-weighted image. (b) FLAIR image. (c-f) FLAIR images with WMHsegmented labels: (c) reference (green) WMH. (d) WMH (red) from adomain adapted model (f˜ST (.)) fine-tuned on five target training samples.(e) WMH (yellow) from model trained from scratch (f˜T (.)) on 100 targettraining samples. (f) WMH (orange) from model trained from scratch (f˜T (.))on 5 target training samples.4.4 Discussion and ConclusionWe observed that while f˜S demonstrated a decent performance on DS , ittotally failed on DT . Although the same set of learned representations isexpected to be useful for both as the two tasks are similar, the failure comesto no surprise as the distribution of the responses to these features is different.Observing the comparisons presented by Figure 4.2(a), it turns out thatgiven only a small set of training examples on DT , the domain adaptedmodel substantially outperforms the model trained from scratch with thesame size of training data. For instance, given only two training images, f˜STachieved a Dice score of 0.63 on a test set of 33 target domain test images,while f˜T resulted in a dice of 0.15. As Figure 4.2(b) suggests, with only afew DT training cases available, the best results can be achieved by fine-tuning only the last dense layers, otherwise enormous number of parameterscompared to the training sample size would result in over-fitting. As soonas more training data becomes available, it makes more sense to fine-tune544.4. Discussion and Conclusionthe shallower representations (e.g., the last convolutional layers). It is alsointeresting to note that tuning the first few convolutional layers is rarelyuseful considering their domain-independent characteristics. Even though wedid not experiment with training-time fully convolutional networks such asU-Net [147], arguments can be made that the same conclusions would applyto such architectures.55Chapter 5Weakly-supervised MedicalImage Segmentation5.1 Introduction and BackgroundFully convolutional neural networks (FCNs), and in particular U-Nets [27,147] have been increasingly used for semantic segmentation of both normalorgans and lesions and achieved top ranking results in medical imagingchallenges [92, 126]. The use of FCNs for image segmentation allows efficienttraining and learning of contextual features. FCNs are commonly trainedusing masks for which the ground truth is available for all of the pixels of thewhole input image. Creating accurate pixel-level labels for medical images istime-consuming, expensive, and requires a high level of expertise. Annotationcost can be substantially reduced by using weaker supervision, i.e. by notneeding full pixel-level labels.Different weakly-supervised methods have been proposed for learningsemantic segmentation with various forms of weak annotations. Unlabeleddata can be at the level of images or pixels. In medical imaging research,much of the focus has been on studying image-level unlabeled data, i.e whentraining data consist of images with full annotations and images withoutany labels [9, 12, 23, 42, 130, 197, 198]. Other forms of weak supervisionhave also been studied, including response evaluation criteria in solid tumors(RECIST) [18], bounding boxes [143], and image-level tags [134]. Less studiedare the cases with pixel-level unlabeled data, i.e. annotations in forms ofpartial labels [84, 199], scribbles [20, 73], or points [148]. Examples of fulland partial annotations for cardiac MRI segmentation are shown in Figure5.1.A growing body of literature deals with the problem of training segmen-tation models with a mixture of labeled and unlabeled medical imaging data.Most methods propose hybrid loss functions with both supervised and semi-supervised components [9, 20, 23]. Different forms of pseudo-labels have beenused as weak annotation for learning from unlabeled data. These methods565.1. Introduction and Background(a) (b) (c) (d)Figure 5.1: Sample cardiac MRI image (a) with different forms of annotations(b-d); Yellow, purple, green, and blue colors correspond to the right ventricle,endocardium, left ventricle, and background, respectively. Fully supervisedtraining of FCNs for semantic segmentation requires annotation of all pixels(b). The goal of this study is to develop weakly-supervised segmentationmethods for training FCNs with a single point (c) and scribble (d). In thisstudy, points refer to single-pixel marks for each class on each image slice.Scribbles have a width of one pixel. In this example, the sizes of points andscribbles are exaggerated for better visualization.include initial coarse segmentation [18, 73], self-learning [9], and uncertainty-aware label propagation [130, 197]. Other methods use prior knowledgeabout anatomical structures [198, 199], adversarial training [130], conditionalrandom field (CRF) post-processing [9, 20, 73, 143], and attention-basedlearning [23, 130].Bai et al. [9] and Baur et al. [12] were among the first to leverage unlabeleddata for improving FCN performance for medical image segmentation tasks.Baur et al. [12] proposed a semi-supervised method for multiple sclerosislesion segmentation by adding an auxiliary manifold embedding loss to thesupervised Dice loss. Using an iterative self-learning method Bai et al. [9]showed improvements in cardiac segmentation quality by including unlabeledMRI samples. Cai et al. [18] proposed a weakly supervised slice-propagatedsegmentation method for lymph node segmentation with RECIST annotations.Can et al. [20] presented a scribble-supervised learning framework whichincludes a region growing step for creating uncertainty maps for labels andan Expectation-Maximization approach for learning the network parameters.In another scribble-supervised segmentation work, Ji et al. [73] used partialcross-entropy (CE) [177] and dense CRF loss for the task of brain tumorsegmentation. Kervadec et al. [84] proposed a novel loss function that includesconstraints on sizes of structures and showed promising results for cardiacsegmentation in MRI. Sedai et al. [158] and [197] successfully used uncertainty-aware pseudo-labels for semi-supervised segmentation of retinal layer andatrium, respectively. Zhen et al. proposed a semi-supervised adversarial575.2. Methodlearning model with atlas priors for liver segmentation in CT scans [198]. Zhuet al. [199] proposed a prior aware loss function by regularizing the organ sizedistributions of the model output for an abdominal segmentation problem inCT scans.In this chapter, we study the problem of weakly-supervised semanticsegmentation with point and scribble supervision in FCNs. Specifically,we explore how far we can go with a single annotated point or a singleannotated scribble per slice of a volumetric data set. We also proposepartial Dice loss, a variant of Dice loss [124] for deep weakly-supervisedsegmentation with sparse pixel-level annotations. We furthermore comparepartial Dice loss with partial CE [27, 159, 177] in terms of segmentationquality. Finally, we assess point and scribble-supervised segmentation on fivedifferent semantic segmentation tasks from medical images of the heart, theprostate, and the kidney. In a majority of these experiments, partial Dice lossprovides statistically significant performance improvement over partial CE.The use of single point supervision results in 51%−95% of the performanceof fully supervised training and the use of single scribble supervision achieves86%−97% of the performance of fully supervised training.5.2 MethodSemantic segmentation can be formulated as a pixel-level classification prob-lem. In this setup, the pixels in the training image and label pairs can beconsidered as N i.i.d data points D = {xn,yn}Nn=1, where x ∈ RM is theM-dimensional input and each pixel yi in y ∈ RM , can be one and only oneof the k possible classes, k ∈ {1, ...,K}. In weakly-supervised learning withpartial annotations, the ground truth for labels is only available for a subsetof pixels in y. Here, we use FCNs for image segmentation, which allows forend-to-end learning, with each pixel of the input image being mapped bythe FCN to the output segmentation map and it is more straightforward toimplement segment-level loss functions such as Dice loss in this architecture.Neural networks can be formulated as parametric conditional probabilitymodels, p(yj = k|xj , θ), where the parameter set θ is chosen to minimize aloss function, and p(yˆi = k|xi, θ) is the probability of pixel i belonging toclass k. Subsequently, p(yi = k|xi, θ∗) is used for inference, where θ∗ is theoptimized parameter set.The Dice coefficient was originally developed as a measure of similaritybetween two sets; it is twice the size of the intersection divided by the sumof the sizes of the two sets. The soft Dice loss function [124] is a generalized585.2. Methodmeasure where the probabilistic output of a segmenter is compared to thetraining data, set memberships are augmented with label probability, anda smoothing factor is added to the denominator to make the loss functiondifferentiable.With the Dice loss, the parameter set θ is chosen to minimize the negativeof weighted Dice of different structures. Here, we propose the partial Diceloss, in which the parameter set is chosen to minimize the negative of Dice ofdifferent structures over only the pixels where the ground truth is known:LPDL = −2K∑k=1∑Ni=1 [mi · p(yˆi = k|xi, θ) · (yi = k)]∑Ni=1mi · [p(yˆi = k|xi, θ) + (yi = k)] + , (5.1)where mi is the the mask that is applied to each pixel. mi is 1 for pixelswhere ground truth (partial annotation) is available and 0 for unlabeledpixels. (yi = k) is the binary indicator which denotes if the class label k isthe correct class of the ith pixel, N is the number of pixels that are used ineach mini-batch, and  is the smoothing factor.A similar masked Dice loss function was previously proposed as part of asemi-supervised training approach in ASDNet [130]. There, masking was usedto filter out unconfident pseudo-label pixels which were generated throughlabel propagation for images without any ground truth. Here, masking isused to limit the Dice loss calculation (LPDL above) to only the labeled pixelsin a weakly-supervised framework where labels (or ground truth annotations)are available only for small subsets of pixels via points and scribbles.Similar to partial Dice loss, partial CE loss, LPCL, can be defined tocalculate loss only over the sparse ground truth. Partial CE has beensuccessfully used in segmentation of biomedical [27] and natural images [177].With weighted partial CE loss, the parameter set θ is chosen to maximize theaverage log likelihood over the training pixels where ground truth is known:LPCL = −N∑i=1mi · ln (p(yˆi = k|xi, θ)) · (yi = k) , (5.2)where mi is the the mask that is applied to each pixel, p(yˆi = k|xi, θ) is theprobability of pixel belonging to class k, (yi = k) is the binary indicatorwhich denotes if the class label k is the correct class of ith pixel, and N isthe number of pixels that are used in each mini-batch.595.3. Applications and Data5.3 Applications and DataWe performed experiments on five different semantic segmentation tasksfrom medical images of the heart, the kidney, and the prostate. These fivesegmentations include three structures in the heart - the left ventricle, theright ventricle, and the endocardium, along with the kidney, and the prostategland, as described next. For heart segmentation, data from the MICCAI2017 ACDC challenge for automated cardiac diagnosis were used [194]. Thisis a four-class segmentation task; cine MR images (CMRI) of patients areto be segmented into the left ventricle, the endocardium, the right ventricle,and the background. This dataset consists of end-diastole and end-systoleimages of 100 patients. We used only the end-diastole images in our study.For kidney segmentation, data from the MICCAI 2019 KiTS challenge forkidney tumor segmentation were used [61]. The training dataset consists of210 arterial phase abdominal CT of kidney cancer patients. In this study, weonly considered the problem of kidney segmentation and not tumors. Hence,we considered healthy and cancerous kidney tissue as the same class. Forprostate segmentation, the public PROSTATEx dataset [106] together with40 prostate gland full annotations from Meyer et al. [123] were used. This isa two-class segmentation task; Axial T2-weighted images of men suspectedof having prostate cancer are to be segmented into the prostate gland andbackground. For all three segmentation tasks, the patients were split intotraining (40%), validation (10%), and testing (50%) sets. Prostate and cardiacimages were resampled to common in-plane resolutions of 0.5× 0.5 mm and2× 2 mm, respectively. Kidney images were resampled to the resolution of1 × 1 × 1 mm. All axial slices were then cropped at the center to createimages of size 224× 224 pixels as the input size of the FCN. Image grayscaleswere normalized to be within the range of [0,1].5.4 Experimental Setup5.4.1 Partial Annotation GenerationFor all the training and validation data, partial annotations were generatedautomatically in form of single points and single scribbles per slice per class.Points were generated by randomly sampling a single pixel from each of theforeground classes and the background class for all of the 2D slices of each 3Dvolume. A single scribble was generated at each 2D slice by first randomlysampling for the start and endpoints of the scribble. Then the A* path searchalgorithm was used to construct a path between these points [16]. Scribbles605.4. Experimental Setupwere generated for foreground and background classes only on 2D slices wherethere was a foreground class. Slices with no foreground were left unlabeled.Figure 5.1 shows examples of automatically generated points and scribblesand their corresponding full mask.5.4.2 TrainingFor all experiments, we used a baseline FCN model similar to the two-dimensional U-Net architecture [147] but with fewer kernel filters at eachlayer. The input and output of the FCN have a size of 224 × 224 pixels.The network has the same number of layers as the original U-Net but withfewer kernels. The number of kernels for the encoder section of the networkwere 8, 8, 16, 16, 32, 32, 64, 64, 128, and 128. The parameters of theconvolutional layers were initialized randomly from a Gaussian distribution[59]. For optimization, stochastic gradient descent with the Adam update rule[86] was used. During the training, we used a mini-batch of 16 examples. Theinitial learning rate was set to 0.005 and it was reduced by a factor of 0.5−0.8if the average validation Dice score did not improve by 0.001 in 10 epochs.We used 1000 epochs for training the models with an early stopping policy.For weakly-supervised training experiments, partial annotations were usedfor both training and validation labels. For each run, the model checkpointwas saved at the epoch where the validation loss was lowest. For each ofthe three segmentation problems, and for each type of partial ground truth,the model was trained 10 times with partial CE and 10 times with Dice loss,each with random weight initialization and random shuffling of the trainingdata. Ensembling [121] was used to combine the test predictions. The sameensembling procedure was used for fully supervised training. All the deepmodels were implemented and optimized using the Keras framework [26].5.4.3 Partial Loss FunctionsCE loss and Dice loss are the two most commonly used loss functions intraining FCNs for semantic segmentation. Dice loss [124] is robust to classimbalance and directly optimizes the model for semantic segmentation perfor-mance. CE indirectly improves segmentation through pixel-level classificationand models trained with CE loss generally produce better-calibrated classprobabilities [121]. Here, we compare the segmentation quality of modelstrained with partial annotations with partial CE loss with those trainedwith partial Dice Loss. We also compare weakly-supervised training withfully-supervised segmentation with Dice loss. We assess the segmentation615.5. Resultsquality of the model with the Dice coefficient and 95th percentile Hausdorffdistance (H95).5.5 ResultsPartial annotations were generated as points and scribbles for all five seg-mentation tasks. For held-out test images, Dice and HD95 were calculated.Bootstrapping (n=1000) was performed and 95% confidence intervals (CIs)were calculated. P-values of less than 0.05 were regarded as statisticallysignificant. Table 5.1 provides the proportions of partial annotations to fulllabels; it also compares the averages of Dice coefficients of foreground seg-ments for single and ensemble models trained with partial Dice loss, partialCE loss, and models trained with full masks.For cardiac segmentation with point supervision, Dice coefficients ofendocardium and left ventricle were significantly better for models trainedwith partial Dice loss. Models trained with partial CE showed significantlybetter performance in the right ventricle. Point and scribble supervisionachieved ranges of 71%−97% and 78%−97% of the performance of fullysupervised annotation.For prostate segmentation with point or scribble supervision, no statisti-cally significant differences were found between models trained with eitherDice loss or CE. Point and scribble supervision achieved 74% and 90% of theperformance of fully supervised annotation.For kidney segmentation, partial Dice loss was significantly better forboth point and scribbles. Point and scribble supervision achieved ranges of42%−51% and 84%−87% of the performance of fully supervised annotation.Table 5.2 compares the averages of HD95 of foreground segments for singleand ensemble models trained with partial Dice loss, partial CE loss, andmodels trained with full masks. For the endocardium, the prostate gland, andthe kidney, models trained with partial Dice loss and point supervision showedsignificantly better segmentation in terms of HD95. For the right ventricle,models trained with partial Dice loss and scribble showed significantly betterresults. The differences between the two partial loss functions were notstatistically significant for the rest of the segments. Figure 5.2 visuallycompares the ensemble models trained with point supervision and full masksthrough some representative examples over the five segmentation tasks.62Table 5.1: Segmentation quality of models in terms of the Dice coefficient (95% CI) of foreground structures:Weakly-supervised training with partial annotations (points and scribbles) is compared with fully supervised training.Models trained with partial CE loss (PCL) [27] are compared with those that were trained with the proposed partialDice loss (PDL). Fractions of partial annotations to full labels are given (abbreviated to fr.). Boldface indicatesstatistically significant differences between model pairs (p-value<0.05).R. Ventricle Endocardium L. Ventricle Prostate KidneyPoint Supervisionfr. 0.19% 0.27% 0.20% 0.02% 0.13%PCL 0.94 (0.87−0.96) 0.64 (0.45−0.79) 0.69 (0.40−0.87) 0.71 (0.56−0.85) 0.38 (0.22−0.64)PDL 0.92 (0.87−0.96) 0.70 (0.49−0.82) 0.74 (0.44−0.87) 0.71 (0.52−0.84) 0.46 (0.31−0.63)Scribble Supervisionfr. 2.93% 4.91% 2.56% 0.75% 2.07%PCL 0.91 (0.89−0.95) 0.80 (0.70−0.89) 0.91 (0.80−0.95) 0.81 (0.69−0.88) 0.76 (0.60−0.91)PDL 0.94 (0.92−0.96) 0.83 (0.71−0.89) 0.91 (0.73−0.96) 0.83 (0.72−0.90) 0.78 (0.66−0.88)Full Supervision0.97 (0.94−0.98) 0.90 (0.74−0.95) 0.95 (0.90−0.97) 0.96 (0.92−0.97) 0.90 (0.77−0.96)63Table 5.2: Segmentation quality of models in terms of 95th Hausdorff distance (95% CI) of foreground structures.Models trained with partial cross-entropy (PCL) [27] are compared with those that were trained with the proposedpartial Dice loss (PDL). Boldface indicates statistically significant differences between model pairs (p-value<0.05).R. Ventricle Endocardium L. Ventricle Prostate KidneyPoint SupervisionPCL 6.8 (2.0−14.2) 15.4 (8.0−29.7) 29.3 (15.1−70.0) 17.4 (11.4−22.7) 35.7 (11.7−57.3)PDL 5.9 (2.0−13.4) 8.9 (4.0−15.2) 25.0 (10.2−54.7) 14.1 (8.0−25.5) 31.1 (11.0−53.8)Scribble SupervisionPCL 4.9 (2.8−11.5) 4.1 (2.8−10.8) 7.4 (2.8−15.2) 11.9 (6.1−31.6) 29.1 (3.3−52.8)PDL 3.7 (2.0−10.3) 3.6 (2.0−12.0) 8.3 (2.0−22.4) 13.1 (6.1−20.6) 27.3 (4.1−49.1)Full Supervision2.3 (2.0−10.2) 2.2 (2.0−2.8) 3.7 (2.0−14.3) 2.3 (1.6−3.4) 20.7 (1.4−47.2)640.20.40.60.81.0Grayscale Slice Partial CE Partial DSC Fully SupervisedFigure 5.2: Examples of segmentation from scribble-supervised training of models with partial cross-entropy loss(CE), partial Dice loss (DSC), and models trained with full masks. The rows from top to bottom show the resultsfor segmentation of the right ventricle, the prostate gland, and the kidney, respectively.655.6. Discussion and Conclusion5.6 Discussion and ConclusionThrough extensive experiments, we have assessed weakly-supervised segmen-tation with partial annotations for medical image segmentation with FCNs.We proposed partial Dice loss for weakly-supervised segmentation with partialannotations, specifically points and scribbles. Moreover, we compared partialDice loss with partial CE loss and compared them with fully supervisedsegmentation. We have performed these assessments using five segmentationtasks across three medical image domains tasks to ensure the generalizabilityof the findings. The results show that in a majority of the experiments, partialDice provides statistically significant improvement in segmentation qualityover partial CE in terms of Dice coefficient and 95th percentile Hausdorffdistance.Further work needs to be carried out to include self-learning methodolo-gies in the proposed weakly-supervised FCN framework. Such self-learningmethods can be combined by uncertainty estimation methods [121] to produceconfidence-aware pseudo-labels that can be used to further boost performance.We conclude that partial annotations including points and scribbles are apromising direction for weakly-supervised segmentation using FCNs.66Chapter 6Uncertainty Estimation forImage Segmentation6.1 Introduction and BackgroundFully convolutional neural networks (FCNs), and in particular the U-Net [147],have become a de facto standard for semantic segmentation in general and inmedical image segmentation tasks in particular. The U-Net architecture hasbeen used for segmentation of both normal organs and lesions and achieved topranking results in several international segmentation challenges [76, 92, 126].Despite numerous applications of U-Nets, very few works have studied thecapability of these networks in capturing predictive uncertainty.Predictive uncertainty or prediction confidence is described as the abilityof a decision-making system to provide an expectation of success (i.e. correctclassification) or failure for the test examples at inference time. Using afrequentist interpretation of uncertainty, predictions (i.e. class probabilities)of a well-calibrated model should match the probability of success of thoseinferences in the long run [54]. For instance, if a well-calibrated braintumor segmentation model classifies 100 pixels each with the probabilityof 0.7 as cancer, we expect 70 of those pixels to be correctly classifiedas cancer. However, a poorly calibrated model with similar classificationprobabilities is expected to result in many more or less correctly classifiedpixels. Miscalibration frequently occurs in many modern neural networks(NNs) that are trained with advanced optimization methods [54]. Poorly-calibrated NNs are often highly confident in misclassification [5]. In someapplications, for example, medical image analysis, or automated driving,overconfidence can be dangerous.The soft Dice loss function [124], also known as Dice loss, is a generalizedmeasure where the probabilistic output of a segmenter is compared to theThis chapter is adapted from Alireza Mehrtash, William M. Wells III, Clare M.Tempany, Purang Abolmaesumi, Tina Kapur. Confidence Calibration and PredictiveUncertainty Estimation for Deep Medical Image Segmentation. IEEE Transactions onMedical Imaging, 2020.676.1. Introduction and Backgroundtraining data, set memberships are augmented with label probability, anda smoothing factor is added to the denominator to make the loss functiondifferentiable. With the Dice loss, the model parameter set is chosen to mini-mize the negative of weighted Dice of different structures. Dice loss is robustto class imbalance and has been successfully applied in many segmentationproblems [173]. Furthermore, Batch Normalization (BN) effectively stabilizesconvergence and also improves performance of networks for natural imageclassification tasks [70]. BN and Dice loss have made FCN optimizationseamless. The addition of BN to the U-Net has improved optimization andsegmentation quality [27]. However, it has been reported that both BN andDice loss have adverse effects on calibration quality [15, 54, 153]. Conse-quently, FCNs trained with BN and Dice loss do not produce well-calibratedprobabilities leading to poor uncertainty estimation. In contrast to Dice loss,cross-entropy loss provides better calibrated predictions and uncertainty esti-mates, as it is a strictly proper scoring rule [50]. Yet, the use of cross-entropyas the loss function for training FCNs can be challenging in situations wherethere is a high class imbalance, e.g., where most of an image is consideredbackground [173]. Hence, it is of great significance and interest to studymethods for confidence calibration of FCNs trained with BN and Dice loss.Another important aspect of uncertainty estimation is the ability of apredictive model to distinguish in-distribution test examples (i.e. thosesimilar to the training data) from out-of-distribution test examples (i.e. thosethat do not fit the distribution of the training data) [62]. The ability ofthe models to detect out-of-distribution inputs is specifically important formedical imaging applications as deep networks are sensitive to domain shift,which is a recurring situation in medical imaging [46]. For instance, networkstrained on one MRI protocol often do not perform satisfactorily on imagesobtained with slightly different parameters or out-of-distribution test images.Hence, in the face of an out-of-distribution sample, an ideal model knowsand announces “I do not know” and seeks human intervention – if possible –instead of a silent failure. Figure 6.1 shows an example of out-of-distributiondetection from a U-Net model that was trained with BN and Dice loss forprostate gland segmentation before and after confidence calibration.68uncalibrated calibrated(a)0.6 0.8 1.0Class Probability010000200003000040000No.ofPixels(b)0.6 0.8 1.0Class Probability050010001500No.ofPixels0.0 0.2 0.4 0.6 0.8 1.0Prostate Class ProbabilityFigure 6.1: Calibration and out-of-distribution detection. Models for prostate gland segmentation were trained withT2-weighted MR images acquired using phased-array coils. The results of inference are shown for two test examplesimaged with: (a) phased-array coil (in-distribution example), and (b) endorectal coil (out-of-distribution example).The first column shows T2-weighted MRI images with the prostate gland boundary drawn by an expert (white line).The second column shows the MRI overlaid with uncalibrated segmentation predictions of an FCN trained withDice loss. The third column shows the calibrated segmentation predictions of an ensemble of FCNs trained withDice loss. The fourth column shows the histogram of the calibrated class probabilities over the predicted prostatesegment of the whole volume. Note that the bottom row has a much wider distribution compared to the top row,indicating that this is an out of distribution example. In the middle column, prediction prostate class probabilities≤ 0.001 has been masked out.696.2. Related Works6.2 Related WorksThere has been a recent growing interest in uncertainty estimation and confi-dence measurement with deep NNs. Although most studies on uncertaintyestimation have been done through Bayesian modeling of the NN, therehas been some recent interest in using non-Bayesian approaches such asensembling methods. Here, we first briefly review Bayesian and non-Bayesianmethods and then review the recent literature for uncertainty estimation forsemantic segmentation applications.In the Bayesian approach, the deterministic parameters of the NN arereplaced by prior probability distributions. Using Bayesian inference, giventhe data samples, a posterior probability distribution over the parameters iscalculated. At inference time, instead of single scalar probability, the BayesianNN gives probability distributions over the output label probabilities [115],which models NN predictive uncertainty. Gal and Ghahramani [41] proposedto use dropout [169] as a Bayesian approximation. They proposed MonteCarlo dropout (MC dropout) in which dropout layers are applied before everyweight together with non-linearities. The probabilistic Gaussian process isapproximated at inference time by running the model several times withactive dropout layers. Implementing MC dropout is straightforward and hasbeen applied in several application domains including medical imaging [102].In a similar Bayesian approach, Teye et al. [180] showed that training NNswith BN [70] can be used to approximate inference of Bayesian NNs. Fornetworks with BN and without dropout, Monte Carlo Batch Normalization(MCBN) can be considered an alternative to MC dropout. In another Bayesianwork, Heo et al. [64] proposed a method that allows the attention model toleverage uncertainty. By learning the Uncertainty-aware Attention (UA) withvariational inference, they improved both model calibration and performancein attention models. Seo et al. [160] proposed a variance-weighted lossfunction that enables learning single-shot calibration scores. In combinationwith stochastic depth and dropout, their method can improve confidencecalibration and classification accuracy. Recently, Liao et al. [104] proposeda method for modeling such uncertainty in intra-observer variability of 2Dechocardiography using the proposed cumulative density function probabilitymethod.Non-Bayesian approaches have been proposed for probability calibrationand uncertainty estimation. Guo et al. [54] studied the problem of confi-dence calibration in deep NNs. Through experiments, they analyzed differentparameters such as depth, width, weight decay, and BN and their effect oncalibration. They also used temperature scaling to easily calibrate trained706.2. Related Worksmodels. Ensembling has been used as an effective tool to improve classifi-cation performance of deep NNs in several applications including medicalimage segmentation [78, 118]. Following the success of ensembling methods[31] in improving baseline performance, Lakshminarayanan proposed DeepEnsembles in which model averaging was used to estimate predictive uncer-tainty [96]. By training collections of models with random initialization ofparameters and adversarial training, they provided a simple approach toassess uncertainty. This observation motivated some of the experimentaldesign in our work. Unlike MC dropout, using Deep Ensembles does notrequire network architecture modification. In [96] authors showed that DeepEnsembles outperforms MC dropout on two image classification problems.On the downside, Deep Ensembles requires retraining a model from scratch,which is computationally expensive for large datasets and complex models.Predictive uncertainty estimation has been studied specifically for theproblem of semantic segmentation with deep NNs. Bayesian SegNet [82]was among the first that addressed uncertainty estimation in FCNs by usingMC dropout. They applied MC dropout by adding dropout layers after thepooling and upsampling blocks of the three innermost layers of the encoderand decoder sections of the SegNet architecture. Using similar approachesfor uncertainty estimation, Kwon et al. [95] and Sedai et al. [157] usedBayesian NNs for uncertainty quantification in segmentation of ischemicstroke lesions and visualization of retinal layers, respectively. Sander et al.[153] applied MC dropout to capture instance segmentation uncertainty inambiguous regions and compared different loss functions in terms of theresultant miscalibration. Kohl et al. [89] proposed a Probabilistic U-Net thatcombined an FCN with a conditional variance autoencoder to provide multiplesegmentation hypotheses for ambiguous images. In similar work, Hu et al.[66] studied uncertainty quantification in the presence of multiple annotationsas a result of inter-observer disagreement. They used a probabilistic U-Net toquantify uncertainty in the segmentation of lung abnormalities. Baumgartneret al. [11] presented a probabilistic hierarchical model where separate latentvariable are used for different resolutions and variational autoencoder is usedfor inference. Rottmann and Schubert [149] proposed a prediction qualityrating method for segmentation of nested multi-resolution street scene imagesby measuring both pixel-wise and segment-wise measures of uncertainty aspredictive metrics for segmentation quality. Recently, Karimi et al. [79] usedensembling for uncertainty estimation of difficult to segment regions and usedthis information to improve clinical target volume estimation in prostateultrasound images. In another recent work, Jungo and Reyes [75] studieduncertainty estimation for brain tumor and skin lesion segmentation tasks.716.3. ContributionsIn conjunction with uncertainty estimation and confidence calibration,several works have studied out-of-distribution detection [30, 62, 100, 103,162]. In a non-Bayesian approach, Hendrycks and Gimpel [62] used softmaxprediction probability baseline to effectively predict misclassificaiton andout-of-distribution in test examples. Liang et al. [103] used temperaturescaling and input perturbations to enhance the baseline method of Hendrycksand Gimpel [62]. In the context of a generative NN scheme, Lee et al.[100]used a loss function that encourages confidence calibration and this resulted inimprovements in out-of-distribution detection. Similarly, DeVries and Taylor[30] proposed a hybrid with a confidence term to improve out-of-distributiondetection. Shalev et al. [162] used multiple semantic dense representations ofthe target labels to detect misclassified and adversarial examples.6.3 ContributionsIn this chapter, we study predictive uncertainty estimation for semanticsegmentation with FCNs and propose ensembling for confidence calibrationand reliable predictive uncertainty estimation of segmented structures. Insummary, we make the following contributions:• We analyze the choice of loss function for semantic segmentation inFCNs. We compare the two most commonly used loss functions intraining FCNs for semantic segmentation: cross-entropy loss and Diceloss. We train models with these loss functions and compare the result-ing segmentation quality and predictive uncertainty estimation. Weobserve that FCNs trained with Dice loss perform significantly bettersegmentation compared to those trained with cross-entropy but at thecost of poor calibration.• We propose model ensembling [96] for confidence calibration of FCNstrained with Dice loss and batch normalization. By training collectionsof FCNs with random initialization of parameters and random shufflingof training data, we create an ensemble that improves both segmentationquality and uncertainty estimation. We also compare ensembling withMC dropout [41, 82]. We empirically quantify the effect of the numberof models on calibration and segmentation quality.• We propose to use average entropy over the predicted segmented objectas a metric to predict segmentation quality of foreground structures,which can be further used to detect out-of-distribution test inputs.726.4. Applications & DataTable 6.1: Number of patients for training, validation, and test sets used inthis study.Application Brain Heart ProstateDataset CBICA TCIA ACDC PROSTATEx PROMISE12†# Training 66 − 40 16 −# Validation 22 − 10 4 −# Test − 102 50 20 35† Used only for out-of-distribution detection experiments.Our results demonstrate that object segmentation quality correlatesinversely with the average entropy over the segmented object and canbe used effectively for detecting out-of-distribution inputs.• We demonstrate our method for uncertainty estimation and confidencecalibration on three different segmentation tasks from MRI images ofthe brain, the heart, and the prostate. Where appropriate, we reportthe statistical significance of our findings.6.4 Applications & DataTable 6.1 shows the number of patient images in each dataset and how wesplit these into training, validation, and test sets. In the following subsections,we briefly describe each segmentation task, data characteristics, and pre-processing.6.4.1 Brain Tumor Segmentation TaskFor brain tumor segmentation, data from the MICCAI 2017 BraTS challenge[10, 122] was used. This is a four-class segmentation task; multiparametricMRI of brain tumor patients are to be segmented into enhancing tumor, non-enhancing tumor, edema, and background. The training dataset consists of190 multiparametric MRI (T1-weighted, contrast-enhanced T1-weighted, T2-weighted, and FLAIR sequences) from brain tumor patients. The dataset isfurther subdivided into two sets: CBICA and TCIA. The images in CBICA setwere acquired at the Center for Biomedical Image Computing and Analytics(CBICA) at the University of Pennsylvania [10]. The images in the TCIA setwere acquired across multiple institutions and hosted by the National Cancer736.5. MethodsInstitute, The Cancer Imaging Archive (TCIA). The CBICA subset was usedfor training and validation and the TCIA subset was reserved as the test set.6.4.2 Ventricular Segmentation TaskFor heart ventricle segmentation, data from the MICCAI 2017 ACDC chal-lenge for automated cardiac diagnosis was used [194]. This is a four-classsegmentation task; cine MR images (CMRI) of patients are to be segmentedinto the left ventricle, the myocardium, the right ventricle, and the back-ground. This dataset consists of end-diastole (ED) and end-systole (ES)images of 100 patients. We used only the ED images in our study.6.4.3 Prostate Segmentation TaskFor prostate segmentation, the public datasets, PROSTATEx [106] andPROMISE12 [108] were used. This is a two-class segmentation task; AxialT2-weighted images of men suspected of having prostate cancer are to besegmented into the prostate gland and the background. For PROSTATExdataset, 40 images with annotations from Meyer et al. [123] were used. Allthese images were acquired at the same institution. PROSTATEx dataset wasused for both training and testing purposes, and PROMISE12 dataset was setaside for test only. PROMISE12 dataset is a heterogeneous multi-institutionaldataset acquired using different MR scanners and acquisition parameters. Weused the 50 training images for which ground truth is available.6.4.4 Data Pre-processingProstate and cardiac images were resampled to the common in-plane resolutionof 0.5× 0.5 mm and 2× 2 mm, respectively. Brain images were resampledto the resolution of 1× 1× 2 mm. All axial slices were then cropped at thecenter to create images of size 224× 224 pixels as the input size of the FCN.Image intensities were normalized to be within the range of [0,1].6.5 Methods6.5.1 ModelSemantic segmentation can be formulated as a pixel-level classification prob-lem, which can be solved by convolutional neural networks [107]. The pixelsin the training image and label pairs can be considered as N i.i.d data pointsD = {xn, yn}Nn=1, where x ∈ RM is the M-dimensional input and y can746.5. Methodsbe one and only one of the k possible classes k ∈ {1, ...,K}. The use ofFCNs for image segmentation allows for end-to-end learning, with each pixelof the input image being mapped by the FCN to the output segmentationmap. Compared to FCNs, patch-based NNs are much slower at inferencetime as they require sliding window mechanisms for predicting each pixel[112]. Moreover, it is more straightforward to implement segment-level lossfunctions such as Dice loss in FCN architectures. FCNs for segmentationusually consist of an encoder (contracting) path and a decoder (expanding)path [112, 147]. FCNs with skip-connections are able to combine high levelabstract features with low-level high-resolution features, which has beenshown to be successful in segmentation tasks [27, 147]. NNs can be for-mulated as parametric conditional probability models, p(yj |xj , θ), and theparameter set θ is chosen to minimize a loss function. Both cross-entropy(CE) and negative of Dice Similarity Coefficient (DSC), known as Dice loss,have been used as loss functions for training FCNs. Class weights are used foroptimization convergence and dealing with the class imbalance issue. WithCE loss, parameter set is chosen to maximize the average log-likelihood overtraining data:LCE = − 1NN∑i=1K∑k=1ωk ln (p(yˆi = k|xi, θ)) · (yi = k) , (6.1)where p(yˆi = k|xi, θ) is the probability of pixel i belonging to class k, (yi = k)is the binary indicator which denotes if the class label k is the correct classof ith pixel, ωk is the weight for class k, and N is the number of pixels thatare used in each mini-batch.With the Dice loss, the parameter set is chosen to minimize the negativeof weighted Dice of different structures:LDSC = −2K∑k=1ωk∑Ni=1 [p(yˆi = k|xi, θ) · (yi = k)]∑Ni=1 [p(yˆi = k|xi, θ) + (yi = k)] + , (6.2)where p(yˆi = k|xi, θ) is the probability of pixel belonging to class k, (yi = k)is the binary indicator which denotes if the class label k is the correct class ofith pixel, ωk is the weight for class k, N is the number of pixels that are usedin each mini-batch, and  is the smoothing factor to make the loss functiondifferentiable. Subsequently, p(yi|xi, θ∗) is used for inference, where θ∗ is theoptimized parameter set.756.5. Methods6.5.2 Calibration MetricsThe output of an FCN for each input pixel is a class prediction yˆj andits associated class probability p(yj |xj , θ). The class probability can beconsidered the model confidence or probability of correctness and can be usedas a measure for predictive uncertainty at the pixel level. Strictly properscoring rules are used to assess the calibration quality of predictive models[50]. In general, scoring rules assess the quality of uncertainty estimationin models by awarding well-calibrated probabilistic forecasts. Negative log-likelihood (NLL), and Brier score [17], are both strictly proper scoring rulesthat have been previously used in several studies for evaluating predictiveuncertainty [41, 54, 96]. In a segmentation problem, for a collection of Npixels, NLL is calculated as:NLL = − 1NN∑i=1K∑k=1ln (p(yˆi = yk|xi, θ)) · (yˆi = yk) (6.3)Brier score (Br) measures the accuracy of probabilistic predictions:Br =1NN∑i=11KK∑k=1[p(yˆi = yk|xi, θ)− (yˆi = yk)]2 (6.4)In addition to NLL and Brier score, we directly assess the predictivepower of a model by analyzing test examples confidence values versus theirmeasured expected accuracy values. To do so, we use reliability diagrams asvisual representations of model calibration and Expected Calibration Error(ECE) as summary statistics for calibration [54, 127]. Reliability diagramsplot expected accuracy as a function of class probability (confidence). Thereliability diagram of a perfectly calibrated model is the identity function. Forexpected accuracy measurement, the samples are binned into N groups andthe accuracy and confidence for each group are computed. AssumingDm to beindices of samples whose confidence predictions are in the range of(m−1M ,mM],the expected accuracy of the Dm is Acc(Dm) = 1|Dm|∑i∈Dm 1(yˆi = yi). Theaverage confidence on bin Dm is calculated as P (Dm) = 1|Dm|∑i∈Dm p(yˆi =yi|xi, θ). ECE is calculated by summing up the weighted average of thedifferences between accuracy and the average confidence over the bins:ECE =M∑m=1|Dm|N∣∣ACC(Dm)− P (Dm)∣∣, (6.5)766.5. Methodswhere N is the total number of samples. In other words, ECE is the averageof gaps on the reliability diagram.6.5.3 Confidence Calibration with EnsemblingWe propose to empirically determine whether ensembling [31] results inconfidence calibration of otherwise poorly calibrated FCNs trained with Diceloss. To this end, similar to the Deep Ensembles method [96], we train MFCNs with random initialization of the network parameters and randomshuffling of the training dataset in mini-batch stochastic gradient descent.However, unlike the Deep Ensemble method, we do not use any form ofadversarial training. We train each of the M models in the ensemble fromscratch and then compute the probability of the ensemble pE as the averageof the baseline probabilities as follows:pE(yj = k|xj) = 1MM∑m=1p(yj = k|xj , θ∗m), (6.6)where p(yi = k|xi, θ∗m) are the individual probabilities.6.5.4 Segment-level Predictive Uncertainty EstimationFor segmentation applications, besides the pixel-level confidence metric, itis desirable to have a confidence metric that captures model uncertainty atthe segment-level. Such a metric would be very useful in clinical applicationsfor decision making. For a well-calibrated system, we anticipate that asegment-level confidence metric can predict the segmentation quality in theabsence of ground truth. The metric can be used to detect out-of-distributionsamples and hard or ambiguous cases. Such metrics have been previouslyproposed for street scene segmentation [149]. Given the pixel-level classpredictions yˆi and their associated ground truth class yi for a predictedsegment Sˆk = {s ∈ (xi, yˆi)|yˆi = k}, we propose to use the average of pixel-wise entropy values over the predicted foreground 4 segment Sˆk as a scalarmetric for volume-level confidence of that segment as:4Following the convention in the semantic segmentation literature, we are using fore-ground and background labels regardless of the fact that the problem is binary or k-classsegmentation [112].776.6. Experiments(6.7)H(Sˆk) = − 1∣∣∣Sˆk∣∣∣∑i∈Sˆk[p(yˆi= k|xi, θ) · ln (p(yˆi = k|xi, θ)) + (1− p(yˆi= k|xi, θ)) · ln (1− p(yˆi = k|xi, θ))].In calculating the average entropy of Sˆk, we assumed binary classification:the probability of belonging to class k, p(yˆi = k|xi, θ) and the probability ofbelonging to other classes 1− p(yˆi = k|xi, θ).6.6 Experiments6.6.1 Training BaselinesFor all of the experiments, we used a baseline FCN model similar to thetwo-dimensional U-Net architecture [147] but with fewer kernel filters at eachlayer. The input and output of the FCN have a size of 224×224 pixels. Exceptfor the brain tumor segmentation that used a three-channel input (T1CE,T2, FLAIR), for the rest of the problems the input was a single channel. Thenetwork has the same number of layers as the original U-Net but with fewerkernels. The number of kernels for the encoder section of U-Net were 8, 8, 16,16, 32, 32, 64, 64, 128, and 128. The parameters of the convolutional layerswere initialized randomly from a Gaussian distribution [59]. For each of thethree segmentation problems, the model was trained 100 times with CE and100 times with Dice loss, each with random weight initialization and randomshuffling of the training data. For the models that were trained with Diceloss, the softmax activation function of the last layer was substituted withsigmoid function as it improved the convergence substantially. For CE loss,class weights ωk, were calculated as inverse frequencies of the class labels forthe combined pixels in training and validation data. For Dice loss, uniformclass weights, ωk, were used for all the foreground segments, except for themyocardium class in heart segmentation where the class weight was threetimes the other two foreground classes. For optimization, stochastic gradientdescent with the Adam update rule [86] was used. During the training, weused a mini-batch of 16 examples for prostate segmentation and 32 examplesfor brain tumor and cardiac segmentation tasks. The initial learning ratewas set to 0.005 and it was reduced by a factor of 0.5 − 0.8 if the averagevalidation Dice score did not improve by 0.001 in 10 epochs. We used 1000epochs for training the models with an early stopping policy. For each run,786.6. Experimentsthe model checkpoint was saved at the epoch where the validation DSC wasthe highest.6.6.2 Cross-entropy vs. DiceCE loss aims to minimize the average negative log-likelihood over the pixels,while Dice loss improves segmentation quality in terms of the Dice coefficientdirectly. As a result, we expect to observe models trained with CE toachieve a lower NLL and models trained with Dice loss to achieve better Dicecoefficients. Here, our main focuses are to observe the segmentation qualityof a model that is trained with CE in terms of Dice loss and the calibrationquality of a model that was trained with Dice loss. We compare modelstrained with CE with those trained with Dice on three segmentation tasks.6.6.3 MC dropoutMC dropout was implemented by modifying the baseline network as it wasdone in Bayesian SegNet [82]. Dropout layers were added to the threeinner-most encoder and decoder layers with a dropout probability of 0.5. Atinference time, Monte Carlo sampling was done with 50 samples and themean of the samples was used as the final prediction.6.6.4 Confidence CalibrationWe used ensembling (Equation 6.6) to calibrate batch normalized FCNstrained with Dice loss. For the three segmentation problems, we madeensemble predictions and compared them with baselines in terms of calibrationand segmentation quality. For calibration quality, we compared NLL, Brierscore, and ECE%. For segmentation quality, we compared dice and 95thpercentile Hausdorff distance. Moreover, for calibration quality assessmentwe calculated the metrics on two sets of samples from the held-out testdatasets: 1) the whole test dataset (all pixels of the test volumes) 2) pixelsbelonging to dilated bounding boxes around the foreground segments. Theforeground segments and the adjacent background around them usuallyhave the highest uncertainty and difficulty. At the same time, backgroundpixels far from foreground segments show less uncertainty but outnumberthe foreground pixels. Using bounding boxes removes most of the correct(certain) background predictions from the statistics and will lead to a betterhighlighting of the differences among models. For all three problems, weconstructed bounding boxes of the foreground structures. The boxes are then796.7. Resultsdilated by 8 mm in each direction of the in-plane axes and 2 slices (whichtranslates to 4mm to 20mm) in each direction of the out-of-plane axis.We also measured the effect of ensembles by calculating pE(y|x) (Equation6.6) for ensembles with number of models (M) of 1, 2, 5, 10, 25, and 50.To provide better statistics and reduce the effect of chance in reporting theperformance, for each ensemble, we sampled the 100 baseline models n timesand reported the averages and 0.95 CI of the NLL and Dice. For example, forM=50, instead of reporting the means of NLL and Dice on a single set of 50models (out of the 100 trained models), we sampled n sets of 50 models andreported the averages and 0.95 CI of the NLL and Dice. For prostate andheart segmentation tasks n was set to 50 and for brain tumor segmentationn was set to 10.6.6.5 Segment-level Predictive UncertaintyFor each of the segmentation problems, we calculated volume-level confidencefor each of the foreground labels and H(Sˆ) (Equation 6.7) vs. Dice. Forprostate segmentation, we are also interested in observing the differencebetween the two datasets of PROSTATEx test set (which is the same as thesource domain) and PROMISE-12 set (which can be considered as a targetset).Finally, in all the experiments, for statistical tests and calculating 95%confidence intervals (CI), we used bootstrapping (n=100). P-values of lessthan 0.01 were regarded as statistically significant. In all the presented tables,boldfaced text indicates the best results for each instance and shows that thedifferences are statistically significant.6.7 ResultsTable 6.2 compares the calibration quality and segmentation performanceof baselines and ensembles (M=50) trained with CE loss with those thatwere trained with Dice loss and those that were calibrated with MC dropout.The averages and 95% CI values for NLL, Brier score, and ECE% for thebounding boxes around the segments are provided. Table 6.2 also comparesthe averages and 95% CI values of Dice coefficients of foreground segments forbaselines trained with cross-entropy loss, Dice loss, and baselines calibratedwith ensembling (M=50) for the whole volume. Calibration quality resultsfor whole volumes and segmentation quality results in terms of Hausdorffdistances are provided in Tables I and II of the Supplementary Material,806.7. Resultsrespectively. For all tasks across all segments, in terms of segmentationperformance, baselines trained with Dice loss outperform those trained withCE loss and ensembles of models trained with Dice loss outperform allthe other models. For all three segmentation tasks, calibration qualitywas significantly better in terms of NLL and ECE% for baseline (single)models trained with CE comparing to those that were trained with Dice loss.However, the direction of change for Brier score was not consistent amongmodels trained with CE vs models trained with Dice loss. For bounding boxesof brain tumor and prostate segmentation, the Brier scores were significantlybetter for models trained with Dice loss compared to those trained with CE,while in the case of the heart segmentation was the opposite. The ensemblemodels show significantly better calibration qualities for all metrics acrossall tasks. In all cases ensembling outperformed baselines and MC dropoutmodels in terms of calibration quality. We observe that ensembling improvesthe calibration quality of the models trained with Dice loss significantly. MCdropout consistently improves the calibration quality of the models trainedwith Dice loss. However, for models trained with CE loss, MC dropout onlyimproves the calibration quality of prostate application models and not brainand heart applications.The graphs in Figure 6.2 show the quantitative improvement in thecalibration and segmentation as a function of the number of models in theensemble, for each of the three segmentation applications of the prostate,the heart, and the brain tumors. As we see, for the prostate, the heart, andthe brain tumor segmentation, using even five ensembles (M=5) of baselinestrained with Dice loss can reduce the NLL by about 66%, 44%, and 62%,respectively. Qualitative examples for improvement as a function of numberof models in ensemble are provided in the Supplementary Material Figures 5and 6.Figure 6.3 provides scatter plots of Dice coefficient vs. the proposedsegment-level predictive uncertainty metric, H(Sˆ) (Equation 6.7), for modelstrained with Dice loss and calibrated with ensembling (M=50). For bettervisualization, Dice values were logit transformed logit(p) = ln( p1−p) as in[131]. In all three segmentation tasks, we observed a strong correlation(0.77 ≤ r ≤ 0.92) between logit of Dice coefficient and average of entropyover the predicted segment. For the prostate segmentation task, a clusteringis obvious among the test set from the source domain (PROSTATEx dataset)and those from the target domain (PROMISE12). Investigation of individualcases reveals that most of the poorly segmented cases, which were predictedcorrectly using H(Sˆ), can be considered out-of-distribution examples as they816.7. Resultswere imaged with endorectal coils.821 2 5 10 25 50M0.51.01.5NLLBrainLDSCLCE1 2 5 10 25 50M0.20.30.40.5HeartLDSCLCE1 2 5 10 25 50M0.20.40.6ProstateLDSCLCEFigure 6.2: Improvements in calibration as a function of the number of models in the ensemble for baselines trainedwith cross-entropy and Dice loss functions. Calibration quality in terms of NLL improves as number of modelsM increases for prostate, heart, and brain tumor segmentation. For each task an ensemble of size M=10 trainedwith Dice loss outperforms the baseline model (M=1) trained with cross-entropy in terms of NLL. Same plot with0.95 CIs and for both whole volume and bounding box measurements are given in Figure 4 of the SupplementaryMaterial.83Table 6.2: Calibration quality and segmentation performance for baselines trained with cross-entropy (LCE) arecompared with those that were trained with Dice loss (LDSC) and those that were calibrated with ensembling(M=50) and MC dropout. Boldfaced font indicates the best results for each application (model) and shows that thedifferences are statistically significant.Calibration Quality † Segmentation Performance (Average Dice Score (95% CI)) ‡Application (Model) NLL (95% CI) Brier (95% CI) ECE% (95% CI) Segment I∗ Segment II∗ Segment III∗Brain (LCE) 0.52 (0.16−1.66) 0.23 (0.08−0.62) 8.11 (1.54−26.23) 0.37 (0.00−0.84) 0.47 (0.07−0.82) 0.58 (0.03−0.87)Brain (MCD? LCE) 0.81 (0.16−2.62) 0.36 (0.08−0.92) 13.41 (0.80−43.26) 0.34 (0.00−0.81) 0.34 (0.03−0.76) 0.54 (0.02−0.86)Brain (EN LCE) 0.29 (0.11−0.71) 0.15 (0.05−0.40) 3.28 (0.52−10.06) 0.49 (0.00−0.92) 0.59 (0.11−0.86) 0.68 (0.04−0.91)Brain (LDSC) 0.62 (0.17−2.70) 0.23 (0.06−0.55) 13.20 (2.60−33.55) 0.45 (0.00−0.89) 0.60 (0.10−0.90) 0.67 (0.07−0.91)Brain (MCD LDSC) 1.14 (0.28−4.04) 0.18 (0.06−0.49) 8.96 (2.41−23.87) 0.43 (0.00−0.88) 0.58 (0.08−0.89) 0.64 (0.03−0.91)Brain (EN LDSC) 0.31 (0.16−0.78) 0.14 (0.08−0.35) 3.71 (0.94−15.27) 0.51 (0.00−0.93) 0.66 (0.11−0.91) 0.74 (0.16−0.92)Heart (LCE) 0.36 (0.16−1.18) 0.17 (0.09−0.41) 5.75 (1.42−17.99) 0.77 (0.17−0.91) 0.73 (0.45−0.86) 0.91 (0.63−0.97)Heart (MCD LCE) 0.36 (0.17−1.10) 0.17 (0.09−0.41) 5.70 (1.39−17.93) 0.78 (0.27−0.90) 0.73 (0.47−0.86) 0.92 (0.64−0.97)Heart (EN LCE) 0.23 (0.13−0.58) 0.13 (0.07−0.30) 2.51 (0.58−10.15) 0.81 (0.18−0.93) 0.77 (0.56−0.88) 0.93 (0.79−0.97)Heart (LDSC) 0.62 (0.17−2.70) 0.23 (0.06−0.55) 13.20 (2.60−33.55) 0.84 (0.14−0.96) 0.81 (0.49−0.90) 0.92 (0.64−0.97)Heart (MCD LDSC) 0.41 (0.17−1.51) 0.45 (0.11−0.81) 36.79 (6.17−70.58) 0.84 (0.12−0.96) 0.78 (0.04−0.89) 0.91 (0.61−0.97)Heart (EN LDSC) 0.31 (0.16−0.78) 0.14 (0.08−0.35) 3.71 (0.94−15.27) 0.87 (0.12−0.96) 0.83 (0.59−0.91) 0.93 (0.71−0.98)Prostate (LCE) 0.40 (0.22−0.79) 0.25 (0.13−0.47) 8.08 (1.60−25.50) 0.83 (0.62−0.91) − −Prostate ( MCD LCE) 0.30 (0.14−0.69) 0.16 (0.08−0.30) 5.23 (0.70−12.75) 0.77 (0.49−0.89)Prostate (EN LCE) 0.16 (0.13−0.25) 0.09 (0.06−0.16) 4.12 (1.92−7.04) 0.87 (0.68−0.92) − −Prostate (LDSC) 0.74 (0.31−1.60) 0.11 (0.06−0.27) 5.72 (3.20−12.57) 0.88 (0.72−0.93) − −Prostate (MCD LDSC) 0.48 (0.22−1.03) 0.11 (0.07−0.25) 5.23 (2.75−11.60) 0.86 (0.67−0.93) − −Prostate (EN LDSC) 0.15 (0.07−0.25) 0.07 (0.04−0.14) 2.02 (0.48−3.89) 0.90 (0.76−0.95) − −† The presented calibration quality metrics are calculated for bounding boxes. For whole volume results see Table I of the Supplementary Material.‡ Comparison between Hausdorff distance of different models is provided in Table II of the Supplementary Material.* For brain application segments, I, II, and III correspond to non-enhancing tumor, edema, and enhancing tumor, respectively. For heart application segments, I,II, and III correspond to the right ventricle, the myocardium, and the left ventricle, respectively. For prostate application segment I corresponds to the prostategland.? MCD stands for Monte Carlo Dropout.84(A) Prostate Segmentation (B) Brain Tumor Segmentation (C) Cardiac Segmentation0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7H(Sˆ)−4−3−2−101234logit(Dice)r=-0.92, p-value < 0.001Case ICase IIDomainPROSTATExPROMISE-120.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7H(Sˆ)−6−4−2024logit(Dice)r=-0.78, p-value < 0.001labelNon-enhancing tumorEnhancing tumorEdema0.0 0.1 0.2 0.3 0.4 0.5 0.6H(Sˆ)−3−2−1012345logit(Dice)r=-0.77, p-value < 0.001labelLeft VentricleEndocardiumRight VentricleFigure 6.3: Segment-level predictive uncertainty estimation: Top row: Scatter plots and linear regression betweenDice coefficient and average of entropy over the predicted segment H(Sˆ). For each of the regression plots, Pearson’scorrelation coefficient (r) and 2-tailed p-value for testing non-correlation are provided. Dice coefficients are logittransformed before plotting and regression analysis. For the majority of the cases in all three segmentation tasks, theaverage entropy correlates well with Dice coefficient, meaning that it can be used as a reliable metric for predictingthe segmentation quality of the predictions at test-time. Higher entropy means less confidence in predictions andmore inaccurate classifications leading to poorer Dice coefficients. However, in all three tasks there are few casesthat can be considered outliers. (A) For prostate segmentation, samples are marked by their domain: PROSTATEx(source domain), and the multi-device multi-institutional PROMISE12 dataset (target domain). As expected, onaverage, the source domain performs much better than the target domain, meaning that average entropy can beused to flag out-of-distribution samples.85CaseI0.5 0.6 0.7 0.8 0.9 1.0Class Probability01000020000300004000050000No.ofPixelsH(Sˆ)= 0.20CaseII0.5 0.6 0.7 0.8 0.9Class Probability02004006008001000No.ofPixelsH(Sˆ)= 0.59Figure 6.4: The two bottom rows correspond to two of the cases from the PROMISE12 dataset are marked in(A): Case I and Case II; These show the prostate T2-weighted MRI at different locations of the same patient withoverlaid calibrated class probabilities (confidences) and histograms depicting distribution of probabilities over thesegmented regions. The white boundary overlay on prostate denotes the ground truth. The wider probabilitydistribution in Case II associates with a higher average entropy which correlates with a lower Dice score. Case-Iwas imaged with phased-array coil (same as the images that was used for training the models), while Case II wasimaged with endorectal coil (out-of-distribution case in terms of imaging parameters). The samples in scatter plotsin (B) and (C) are marked by their associated foreground segments. The color bar for the class probability values isgiven in Figure 6.1. Qualitative examples for brain and heart applications and scatter plots for models trained withcross-entropy are given in Figures 7 and 8 of the Supplementary Material, respectively.866.8. Discussion6.8 DiscussionThrough extensive experiments, we have rigorously assessed uncertaintyestimation for medical image segmentation with FCNs. Furthermore, weproposed ensembling for confidence calibration of FCNs trained with Diceloss. We have performed these assessments using three common medicalimage segmentation tasks to ensure the generalizability of the findings. Theresults consistently show that for baseline (single) models, cross-entropyloss is better than Dice loss in terms of uncertainty estimation in terms ofNLL and ECE%, but falls short in segmentation quality. We then showedthat ensembling with M ≥ 5 notably calibrates the confidence of modelstrained with Dice loss and CE loss. Importantly, we also observed that inaddition to NLL reduction, the segmentation accuracy in terms of the Dicecoefficient and Hausdorff distance was also improved through ensembling.We also showed that ensembling outperforms MC dropout in estimating theuncertainty of deep image segmenters. This confirms previous findings in theimage classification literature [96]. Consistent with the results of previousstudies [92], we observed that segmentation quality improved with ensembling.The results of our experiments for comparing cross-entropy with Dice lossare in line with the achieved results of Sanders et al. [153].Importantly, we demonstrated the feasibility of constructing metrics thatcan capture predictive uncertainty of individual segments. We showed thatthe average entropy of segments can predict the quality of the segmentationin terms of Dice coefficient. Preliminary results suggest that calibrated FCNshave the potential to detect out-of-distribution samples. Specifically, forprostate segmentation, the ensemble correctly predicted the cases where itfailed due to differences in imaging parameters (such as different imagingcoils). However, it should be noted that this is an early attempt to capturethe segment-level quality of segmentation and the results thus need to beinterpreted with caution. The proposed metric can be improved by addingprior knowledge about the labels. Furthermore, it should be noted that theproposed metric does not encompass any information on number of samplesused in that estimation.As introduced in the methods section, some loss functions are “properscoring rules”, a desirable quality that promotes well-calibrated probabilisticpredictions. The Deep Ensembles method has a proper scoring rule require-ment for the baseline loss function [96]. The question arises: “Is the Dice lossa proper scoring rule?” Here, we argue that there is a fundamental mismatchin the potential usage of the Dice loss for scoring rules. Scoring rules arefunctions that compare a probabilistic prediction with an outcome. In the876.8. Discussioncontext of binary segmentation, an outcome corresponds to a binary vector oflength n, where n is the number of pixels. The difficulty with using scoringrules here is that the corresponding probabilistic prediction is a distributionon binary vectors. However, the predictions made by deep segmenters arecollections of n label probabilities. This is in contrast to distributions onbinary vectors, which are more complex and in general characterized byprobability mass functions with 2n parameters, one for each of the 2n possi-ble outcomes (the number of possible binary segmentations). The essentialproblem is that deep segmenters do not predict distributions on outcomes(binary vectors). One potential workaround is to say that the network doespredict the required distributions, by constructing them as the product ofthe marginal distributions. This, though, has the problem that the predicteddistributions will not be similar to the more general data distributions, so inthat sense, they are bound to be poor predictions.We used segmentation tasks in the brain, the heart and the prostate toassess uncertainty estimation. Although each of these tasks was performedon MRI images, there were subtle differences between them. The brain seg-mentation task was performed on three-channel input (T1 contrast-enhanced,FLAIR, and T2) while the other two were performed on single-channel input(T2 for prostate and Cine images for the heart). Moreover, the number oftraining samples, the size of the target segments, and the homogeneity ofsamples were different in each task. Only publicly available datasets wereused in this study to allow others to easily reproduce these experiments andresults. The ground truth was created by experts and independent test setswere used for all experiments. For prostate gland segmentation and braintumor segmentation tasks, we used multi-scanner, multi-institution test sets.For all three tasks, the boundaries of the target segments were commonlyidentified as areas of high uncertainty estimate. Compared to the prostateand heart applications, we observed lower segmentation quality in the braintumor application. Segmentation of lesions (in this case brain tumors) isgenerally a harder problem compared to the segmentation of organs (in thiscase the heart, and the prostate gland). This is partly because lesions aremore heterogeneous. This is partly due to the fact that lesions are moreheterogeneous. However, as shown in Figure 6.3 the calibrated models suc-cessfully predicted the segmentation quality and total failures (where themodel failed to predict any meaningful structure – e.g. Dice score ≤ 0.05.Our focus was not on achieving state-of-the-art results on the three men-tioned segmentation tasks, but on using these to understand and improvethe uncertainty prediction capabilities of FCNs. Since we performed severalrounds of training with different loss functions, we limited the number of886.8. Discussionparameters in the models to speed up each training round; we carried outexperiments with 2D CNNs (not 3D), used fewer convolutional filters in ourbaseline compared to the original U-Net, and performed limited (not exhaus-tive) hyperparameter tuning to allow reasonable convergence. 2D U-Nets havebeen used extensively to segment 3D images and we used these to conduct theexperiments reported above. 2D vs 3D is one of the many design choices orhyper-parameters of constructing deep networks for semantic segmentation,without a clear-cut answer that 2D U-Nets are always better for 2D imagesand 3D U-Nets are always better for 3D images. In fact, in some applications,2D networks have outperformed 3D networks [92]. However, in the case ofconfidence calibration using deep ensembles, preliminary experiments (thatwe have included in Appendix F of the Supplementary Material) indicateno difference between using 3D U-Nets or 2D U-Nets. A comprehensiveempirical study on this topic would be quite interesting.In this chapter, we compared calibration qualities of two commonlyused loss functions and showed that loss function directly affects calibrationquality and segmentation performance. As stated earlier, calibration qualityis an important metric that provides information about the quality of thepredictions. We think it is important for users of deep networks to be awareof the calibration qualities associated with different loss functions, and tothat end, we think that it would be interesting to investigate the calibrationand segmentation quality of other commonly used loss functions such ascombinations of Dice loss and cross-entropy loss, as well as the recentlyproposed Lovász-Softmax loss [14] that we think is promising for medicalimage segmentation.For the proposed segment-level predictive uncertainty measure (Equation6.7), we assumed binary classification and entropy of the foreground class wascalculated by considering every other class as background. However, thereare neighborhood relationships between classes and adjacent pixels that couldbe further integrated using measures such as multi-class entropy or similarstrategies such as the Wasserstein losses [39].There remains a need to study calibration methods that, unlike ensembling,do not require training from scratch which is time-consuming. In this study, weonly investigated uncertainty estimation for MR images. Although parameterchanges occur more often in MRI comparing to computed tomography (CT),it would still be very interesting to study uncertainty estimation in CTimages. Parameter changes in CT can also be a source of failure in CNNs.For instance, changes in slice thickness or use of contrast can result in failuresin predictions and it is highly desirable to predict such failures through modelconfidence.896.9. ConclusionWe believe that our research will serve as a base for future studies on uncer-tainty estimation and confidence calibration for medical image segmentation.Further study of the sources of uncertainty in medical image segmentation isneeded. Uncertainty has been classified as aleatoric or epistemic in medicalapplications [69] and Bayesian modeling [83]. Aleatoric refers to types ofuncertainties that exist due to noise or the stochastic behavior of a system.In contrast, epistemic uncertainties are rooted in limitation in knowledgeabout the model or the data. In this study, we consistently observed higherlevels of uncertainty at specific locations such as boundaries. For example inthe prostate segmentation task, single and multiple raters often have higherinter- and intra-disagreements in the delineation of the base and apex of theprostate rather than at the mid-gland boundaries [108]. Such disagreementscan leave their traces on models that are trained using ground truth labelswith such discrepancies. It has been shown that with enough training datafrom multiple raters, deep models are able to surpass human agreementson segmentation tasks [107]. However, few works have been done on thecorrelation of ground truth quality and model uncertainty that results fromrater disagreements [172, 178].6.9 ConclusionWe conclude that model ensembling can be used successfully for confidencecalibration of FCNs trained with Dice Loss. Also, the proposed averageentropy metric can be used as an effective predictive metric for estimating theperformance of the model at test-time when the ground-truth is unknown.90Chapter 7PEP: Parameter Ensemblingby Perturbation7.1 Introduction and BackgroundDeep neural networks have achieved remarkable success on many classificationand regression tasks [98]. In the usual usage, the parameters of a conditionalprobability model are optimized by maximum likelihood on large amounts oftraining data [51]. Subsequently the model, in combination with the optimalparameters, is used for inference. Unfortunately, this approach ignoresuncertainty in the value of the estimated parameters; as a consequenceover-fitting may occur and the results of inference may be overly confident.In some domains, for example medical applications, or automated driving,overconfidence can be dangerous [5].Probabilistic predictions can be characterized by their level of calibration,an empirical measure of consistency with outcomes, and work by Guo etal. shows that modern neural networks (NN) are often poorly calibrated,and that a simple one-parameter temperature scaling method can improvetheir calibration level [54]. Explicitly Bayesian approaches such as MonteCarlo Dropout (MCD) [41] have been developed that can improve likelihoodsor calibration. MCD approximates a Gaussian process at inference timeby running the model several times with active dropout layers. Similar tothe MCD method [41], Teye et al. [180] showed that training NNs withbatch normalization (BN) [70] can be used to approximate inference withBayesian NNs. Directly related to the problem of uncertainty estimation,several works have studied out-of-distribution detection. Hendrycks andGimpel [62] used softmax prediction probability baseline to effectively predictmisclassification and out-of-distribution in test examples. Liang et al. [103]used temperature scaling and input perturbations to enhance the baselineThis chapter is is adapted from: Alireza Mehrtash, Purang Abolmaesumi, PolinaGoland, Tina Kapur, Demian Wassermann, William M. Wells III. PEP: Parameter Ensem-bling by Perturbation. NeurIPS 2020.917.1. Introduction and Backgroundmethod of Hendrycks and Gimpel [62]. In a recent work, Rohekar et al.[146] proposed a method for confounding training in deep NNs by sharingneural connectivity between generative and discriminative components. Theyshowed that using their BRAINet architecture, which is a hierarchy of deepneural connections, can improve uncertainty estimation. Hendrycks et al. [63]showed using pre-training can improve uncertainty estimation. Thulasidasanet al. [181] showed that mixed up training can improve calibration andpredictive uncertainty of models. Corbière et al. [28] proposed True ClassProbability as an alternative for classic Maximum Class Probability. Theyshowed that learning the proposed criterion can improve model confidenceand failure prediction. Raghu et al. [142] proposed a method for directuncertainty prediction that can be used for medical second opinions. Theyshowed that deep NNs can be trained to predict uncertainty scores of datainstances with high human reader disagreement.Ensemble methods [31] are regarded as a straightforward way to increasethe performance of base networks and have been used by the top performersin imaging challenges such as ILSVRC [175]. The approach typically preparesan ensemble of parameter values that are used at inference-time to makemultiple predictions, using the same base network. Different methods forensembling have been proposed for improving model performance, suchas M-heads [101] and Snapshot Ensembles [67]. Following the success ofensembling methods in improving baseline performance, Lakshminarayananet al. proposed Deep Ensembles in which model averaging is used to estimatepredictive uncertainty [96]. By training collections of models with randominitialization of parameters and adversarial training, they provided a simpleapproach to assess uncertainty.Deep Ensembles and MCD have both been successfully used in severalapplications for uncertainty estimation and calibration improvement. How-ever, Deep Ensembles requires retraining a model from scratch for severalrounds, which is computationally expensive for large datasets and complexmodels. Moreover, Deep Ensembles cannot be used to calibrate pre-trainednetworks for which the training data is not available. MCD requires networkarchitecture to have dropout layers. In many modern networks, BN removesthe need for dropout [70]. Hence, there is a need for network modification ifthe original architecture does not have dropout layers. It is also challenging ornot feasible in some cases to use MCD on out-of-the-box pre-trained networks.Parameter (weight) perturbation at training time has been used to goodeffect in variational Bayesian deep learning [85] and to improve adversarialrobustness [72].In this chapter, we propose Parameter Ensembling by Perturbation (PEP),927.1. Introduction and Backgrounda simple ensembling approach that uses random perturbations of the optimalparameters from a single training run. Unlike MCD which needs dropout attraining, PEP can be applied to any pre-trained network without restrictionson the use of dropout layers. Unlike Deep Ensembles, PEP needs only onetraining run. PEP can provide improved log-likelihood and calibration forclassification problems, without the need for specialized or additional training,substantially reducing the computational expense of ensembling. We showempirically that the log-likelihood of the ensemble average (L) on hold-outvalidation and test data grows initially from that of the baseline model to awell-defined peak as the spread of the parameter ensemble increases. We alsoshow that PEP may be used to probe curvature properties of the likelihoodlandscape. We conduct experiments on deep and large networks that havebeen trained on ImageNet (ILSVRC2012) [150] to assess the utility of PEPfor improvements on calibration and log-likelihoods. The results show thatPEP can be used for probability calibration on pre-trained networks such asDenseNet [68], Inception [175], ResNet [60], and VGG [166]. Improvementsin the log-likelihood range from small to significant but they are almostalways observed in our experiments. To compare PEP with MCD and DeepEnsembles, we run experiments on classification benchmarks such as MNISTand CIFAR-10 which are small enough for us to re-train and add dropoutlayers. Finally, We perform further experiments to study the relationshipbetween over-fitting and the “PEP effect,” (the gain in log-likelihood over thebaseline model) where we observe larger PEP effects for models with higherlevels of over-fitting.In this chapter, we limit our experiments to computer vision benchmarkssuch as MNIST, CIFAR-10, and ImageNet. The proposed PEP method andthe theoretical developments apply to deep NNs in general. Here, we donot run any experiments on medical images. However, we expect that theproposed method can be generalized and adopted well for medical imagingapplications in general and prostate cancer diagnosis in MRI in particular.As we showed in Chapter 6, deep NNs trained on medical images are oftenpoorly calibrated. PEP provides an affordable method to calibrate suchmodels without the additional cost of training. Importantly, PEP does notrequire access to training data. This could facilitate the calibration of modelstrained for medical applications where security and privacy are top priorities.To the best of our knowledge, this is the first report of using ensemble ofperturbed deep nets as an accessible and computationally inexpensive methodfor calibration and performance improvement. Our method is potentiallymost useful when the cost of training from scratch is too high in terms ofeffort or carbon footprint.937.2. Method7.2 MethodIn this section, we describe the PEP model and analyze local properties of theresulting PEP effect (the gain in log-likelihood over the comparison baselinemodel). In summary PEP is formulated in the Bayes’ network (hierarchicalmodel) framework; it constructs ensembles by Gaussian perturbations of theoptimal parameters from training. The single variance parameter is chosen tomaximize the likelihood of ensemble average predictions on validation data,which, empirically, has a well-defined maximum. PEP can be applied to anypre-trained network; only one standard training run is needed, and no specialtraining or network architecture is needed.7.2.1 Baseline ModelWe begin with a standard discriminative model, e.g., a classifier that predictsa distribution on yi given an observation xi,p(yi;xi, θ) . (7.1)Training is conventionally accomplished by maximum likelihood,θ∗ .= argmaxθL(θ) where the log-likelihood is: L(θ) .=∑ilnLi(θ) ,(7.2)and Li(θ).= p(yi;xi, θ) are the individual likelihoods. Subsequent predictionsare made with the model using θ∗.7.2.2 Hierarchical ModelEmpirically, different optimal values of θ are obtained on different data sets;we aim to model this variability with a very simple parametric model – anisotropic normal distribution with mean and scalar variance parameters,p(θ; θ¯, σ).= N(θ; θ¯, σ2I) . (7.3)The product of Eqs. 7.1 and 7.3 specifies a joint distribution on yi and θ;from this we can obtain model predictions by marginalizing over θ, whichleads top(yi;xi, θ¯, σ) = Eθ∼N(θ¯,σ2I) [p(yi;xi, θ)] . (7.4)We approximate the expectation by a sample average,p(yi;xi, θ¯, σ) ≈ 1m∑jp(yi;xi, θj) where θmj=1 ←IIDN(θ¯, σ2I), (7.5)947.2. Methodi.e., the predictions are made by averaging over the predictions of an ensemble.The log-likelihood of the ensemble prediction as a function of σ is thenL(σ) .=∑iln1m∑jLi(θj) where θmj=1 ←IIDN(θ¯, σ2I) (7.6)(dependence on θ¯ is suppresssed for clarity). We estimate the model parame-ters as follows. First we optimize θ with σ fixed at zero using a training dataset (when σ → 0 the θj → θ¯), thenθ∗ = argmaxθ¯∑iln p(yi;xi, θ¯) , (7.7)which is equivalent to maximum likelihood parameter estimation of the basemodel. Next, we optimize over σ, (using a validation data set), with θ fixedat the previous estimate, θ∗,σ∗ = argmaxσ∑iln1m∑θjp(yi;xi, θj) where θmj=1 ←IIDN(θ∗, σ2I) . (7.8)Then at test time the ensemble prediction isp(yi;xi, θ∗, σ∗) ≈ 1m∑θjp(yi;xi, θj) where θmj=1 ←IIDN(θ∗, σ∗2I) . (7.9)In our experiments, perhaps somewhat surprisingly, L(σ) has a well-defined maximum away from σ = 0 (which corresponds to the baselinemodel). As σ grows from 0, L(σ) rises to a well-defined peak value, then fallsdramatically (Figure 7.1). Conveniently, the calibration quality tends to growfavorably until the L(σ) peak is reached. It may be that L(σ) initially growsbecause the classifiers corresponding to the ensemble parameters remainaccurate, and the ensemble performs better as the classifiers become moreindependent [31]. Figure 7.1 shows L(σ) for experiments with InceptionV3[175], along with the average log-likelihoods (ln(L)) of the individual ensemblemembers. Note that in the figures, in the current machine learning style, wehave used averaged log-likelihoods, while in this section we use the estimationliterature convention that log-likelihoods are summed rather than averaged.We can see that for several members, ln(L) grows somewhat initially, thisindicates that the optimal parameter from training is not optimal for thevalidation data. Interestingly, the ensemble has a more robust increase, whichpersists over scale substantially longer than for the individual networks. Wehave observed this L(σ) “increase to peak” phenomenon in many experimentswith a wide variety of networks.957.2. MethodFigure 7.1: Parameter Ensembling by Perturbation (PEP) on pre-trainedInceptionV3 [175]. The rectangle shaded in gray in (a) is shown in greaterdetail in (b). The average log-likelihood of the ensemble average, L(σ),has a well-defined maximum at σ = 1.85× 10−3. The ensemble also has anoticeable increase in likelihood over the individual ensemble item averagelog-likelihoods, ln(L) and over their average. In this experiment, an ensemblesize of 5 (M = 5) was used for PEP and the experiments were run on 5000validation images.7.2.3 Local AnalysisIn this section, we analyze the nature of the PEP effect in the neighborhoodof θ∗. Returning to the log-likelihood of a PEP ensemble (Eq. 7.6), and“undoing” the approximation by sample average,L(σ) ≈∑ilnEθ∼N(θ∗,σ2I) [Li(θ)] . (7.10)Next we develop a local approximation to the expected value of the log-likelihood. Suppose x ∼ N(µ,Σ). We seek a local approximation to Ex [f(x)].Using a second order Taylor expansion about µ,Ex [f(x)] ≈ Ex[f(µ) + (x− µ)T∇f(µ) + 12(x− µ)THf(µ)(x− µ)](7.11)where Hf(x) is the Hessian of f(x). Then, as the gradient term vanishes,Ex [f(x)] ≈ f(µ) + 12Ex[(x− µ)THf(µ)(x− µ)] (7.12)967.2. MethodEx [f(x)] ≈ f(µ) + 12Ex[xTHf(µ)x− 2xTHf(µ)µ+ µTHf(µ)µ] (7.13)orEx [f(x)] ≈ f(µ) + 12[Ex[xTHf(µ)x]− µTHf(µ)µ] . (7.14)Now using Ex[xTΛx]= TR(ΛΣ) + µTΛµ,(TR is the trace, see [117]),Ex [f(x)] ≈ f(µ) + 12TR(Hf(µ)Σ) . (7.15)For x ∼ N(µ,Σ)Ex [f(x)] ≈ f(µ) + 12TR(Hf(µ)Σ) , (7.16)where Hf(x) is the Hessian of f(x) and TR is the trace. In the specialcase that Σ = σ2I,Ex [f(x)] ≈ f(µ) + σ224 f(µ) (7.17)where 4 is the Laplacian, or mean curvature. The appendix shows thatthe third Taylor term vanishes due to Gaussian properties, so that theapproximation residual is O(σ4∂4f(µ)) where ∂4 is a specific fourth derivativeoperator.Applying this to the log-likelihood in Eq. 7.10 yieldsL(σ) ≈∑iln[Li(θ∗) +σ224 Li(θ∗)]≈∑i[lnLi(θ∗) +σ224Li(θ∗)Li(θ∗)](7.18)(to first order), orL(σ) ≈ L(θ∗) +Bσ(θ∗) , (7.19)where L(θ) is the log-likelihood of the base model (Eq. 7.2) andBσ(θ).=σ22∑i4Li(θ)Li(θ)(7.20)is the PEP effect. Note that the PEP effect value may be dominated by dataitems that have low likelihood, perhaps because they are difficult cases, orincorrectly labeled. Next we establish a relationship between the PEP effect977.2. Methodand the Laplacian of the log-likelihood of the base model. From Appendix(Eq 11) ,4L(θ) =∑i[4Li(θ)Li(θ)− (∇ lnLi(θ))2](7.21)(here the square in the second term on the right is the dot product of twogradients) Then4L(θ) = 2σ2Bσ(θ)−∑i(∇ lnLi(θ))2 (7.22)orBσ(θ) =σ22[4L(θ) +∑i(∇ lnLi(θ))2]. (7.23)The empirical Fisher information (FI) is defined in terms of the outerproduct of gradients asF˜ (θ).=∑i∇ lnLi(θ)∇ lnLi(θ)T (7.24)(see [93]) . So, the second term above in Eq. 7.23 is the trace of the empiricalFI. Then finally,Bσ(θ) =σ22[4L(θ) + TR(F˜ (θ))]. (7.25)The first term of the PEP effect, the mean curvature of the log-likelihoodcan be positive or negative, (we expect it to be negative near the mode), whilethe second term, the trace of the empirical Fisher information, is non-negative.As the sum of squared gradients, we may expect the second term to grow asθ moves away from the mode.The first term may also be seen as a (negative) trace of an empirical FI.If the sum is converted to an average it approximates an expectation that isequal to the negative of the trace of the Hessian form of the FI, while thesecond term is the trace of a different empirical FI. Empirical FI is said tobe most accurate at the mode of the log-likelihood [93]. So, if θ∗ is close tothe log-likelihood mode on the new data, we may expect the terms to cancel.If θ∗ is farther from the log-likelihood mode on the new data, they may nolonger cancel.Next, we discuss two cases, in both we examine the log-likelihood of thevalidation data, L(θ), at θ∗, the result of optimization on the training data.In general, θ∗ will not coincide with the mode of the log-likelihood of the987.3. Experimentsvalidation data. Case 1: θ∗ is ‘close’ to the mode of the validation data, sowe expect the mean curvature to be negative. Case 2: θ∗ is ‘not close’ tothe mode of the validation data, so the mean curvature may be positive. Weconjecture that case 1 characterizes the likelihood landscape on new datawhen the baseline model is not overfitted, and that case 2 is characteristic ofan overfitted model (where, empirically, we observe positive PEP effect).As these are local characterizations, they are only valid near θ∗. Whilethe analysis may predict PEP effect for small σ, as it grows, and the θj movefarther from the mode, the log-likelihood will inevitably decrease dramatically(and there will be a peak value between the two regimes).There has been a lot of work recently concerning the curvature propertiesof the log-likelihood landscape. Gorbani et al. point out that “Hessian oftraining loss ... is crucial in determining many behaviors of neural networks”;they provide tools to analyze the Hessian spectrum and point out charac-teristics associated with networks trained with BN [47]. Sagun et al. [151]point out that there is a ’bulk’ of zero valued eigenvectors of the Hessianthat can be used to analyze overparameterization , and in a related paperdiscuss implications that “shed light on the geometry of high-dimensional andnon-convex spaces in modern applications” [152]. Fort et al. [40] analyze DeepEnsembles from the perspective of the loss landscape, discussing multiplemodes and associated connectors among them. While the entire Hessianspectrum is of interest, some insights may be gained from the avenues tocharacterizing the mean curvature that PEP provides.7.3 ExperimentsThis section reports performance of PEP, and compares it to temperaturescaling [54], MCD [41], and Deep Ensembles [96], as appropriate. The first setof results are on ImageNet pre-trained networks where the only comparisonis with temperature scaling (no training of the baselines was carried out soMCD and Deep Ensembles were not evaluated). Then we report performanceon smaller networks, MNIST and CIFAR-10, where we compare to MCD andDeep Ensembles as well. We also show that the PEP effect is strongly relatedto the degree of overfitting of the baseline networks.Evaluation metrics: Model calibration was evaluated with negativelog-likelihood (NLL), Brier score [17] and reliability diagrams [127]. NLL andBrier score are proper scoring rules that are commonly used for measuringthe quality of classification uncertainty [41, 54, 96, 140]. Reliability diagramsplot expected accuracy as a function of class probability (confidence), and997.3. Experimentsperfect calibration is achieved when confidence (x-axis) matches expectedaccuracy (y-axis) exactly [54, 127]. Expected Calibration Error (ECE) isused to summarize the results of the reliability diagram. Details of evaluationmetrics are given in the Supplementary Material.7.3.1 ImageNet ExperimentsWe evaluated the performance of PEP using large scale networks that weretrained on ImageNet (ILSVRC2012) [150] dataset. We used the subset of50,000 validation images and labels that is included in the development kit ofILSVRC2012. From the 50,000 images, 5,000 images were used as a validationset for optimizing σ in PEP, and temperature T in temperature scaling. Theremaining 45,000 images were used as the test set. Golden section search[138] was used to find the σ∗ that maximizes L(σ). The search range for σwas 5×10−5–5×10−3, ensemble size was 5 (M=5), and number of iterationswas 7. On the test set with 45,000 images, PEP was evaluated using σ∗ andwith ensemble size of 10 (M=10). Single crop of the center of images was usedfor the experiments. Evaluation was performed on six pre-trained networksfrom the Keras library[26]: DenseNet121, DenseNet169 [68], InceptionV3[175], ResNet50 [60], VGG16, and VGG19 [166]. For all pre-trained networks,Gaussian perturbations were added to the weights of all convolutional layers.Table 7.1 summarizes the optimized T and σ values, model calibration interms of NLL, Brier score, and classification errors. For all the pre-trainednetworks, except VGG19, PEP achieves statistically significant improvementsin calibration compared to the baseline and temperature scaling. Note thereduction in top-1 error of DenseNet169 by about 1.5 percentage points, andthe reduction in all top-1 errors. Figure 7.2 shows the reliability diagramfor DenseNet169, before and after calibration with PEP with some correctedmisclassification examples.100Table 7.1: ImageNet results: For all models except VGG19 , PEP achieves statistically significant improvements incalibration compared to baseline (BL) and temperature scaling (TS), in terms of NLL and Brier score. PEP alsoreduces test errors, while TS does not have any effect on test errors. Although TS and PEP outperform baseline interms of ECE% for DenseNet121, DenseNet169, ResNet, and VGG16, the improvements in ECE% is not consistentamong the methods. T ∗ and σ∗ denote optimized temperature for TS and optimized sigma for PEP, respectively.Boldfaced font indicates the best results for each metric of a model and shows that the differences are statisticallysignificant (p-value<0.05).σ∗ Negative log-likelihood Brier score ECE% Top-1 error %Model T ∗ ×10−3 BL TS PEP BL TS PEP BL TS PEP BL PEPDenseNet121 1.10 1.94 1.030 1.018 0.997 0.357 0.356 0.349 3.47 1.52 2.03 25.73 25.13DenseNet169 1.23 2.90 1.035 1.007 0.940 0.354 0.350 0.331 5.47 1.75 2.35 25.31 23.74IncepttionV3 0.91 1.94 0.994 0.975 0.950 0.328 0.328 0.317 1.80 4.19 2.46 22.96 22.26ResNet50 1.19 2.60 1.084 1.057 1.023 0.365 0.362 0.350 5.08 1.97 2.94 26.09 25.18VGG16 1.09 1.84 1.199 1.193 1.164 0.399 0.399 0.391 2.52 2.08 1.64 29.39 28.83VGG19 1.09 1.03 1.176 1.171 1.165 0.394 0.394 0.391 4.77 4.50 4.48 28.99 28.751010.0 0.2 0.4 0.6 0.8 1.0Confidence0.00.20.40.60.81.0AccuracyECE: 5.79BaselineGapOutputs0.0 0.2 0.4 0.6 0.8 1.0Confidence0.00.20.40.60.81.0AccuracyECE: 2.36PEPGapOutputs(a) (b) (c)Figure 7.2: Improving pre-trained DenseNet169 with PEP (M=10). (a) and (b) show the reliability diagrams of thebaseline and the PEP. (c) shows examples of misclassifications corrected by PEP. The examples were among thosewith the highest PEP effect on the correct class probability. (c) Top row: brown bear and lampshade changed intoIrish terrier and boathouse; Middle row: band aid and pomegranate changed into sandal and strawberry; Bottomrow: bathing cap and wall clock changed into volleyball and pinwheel. The histograms at the right of each imageillustrate the probability distribution of ensemble. Vertical red and green lines show the predicted class probabilitiesof the baseline and the PEP for the correct class label. (For more reliability diagrams see Supplementary Material.)1027.3. Experiments7.3.2 MNIST and CIFAR-10 ExperimentsThe MNIST handwritten digits dataset [97] consists of 60,000 training imagesand 10,000 test images. The CIFAR-10 dataset [91] consists of 50,000 trainingimages and 10,000 test images. We created validation sets by setting aside10,000 and 5,000 training images from MNIST and CIFAR-10, respectively.For the MNIST dataset, the predictive uncertainty was evaluated for twodifferent neural networks: a Multi-layer Perception (MLP) and a Convolu-tional Neural Network (CNN) similar to LeNet [99] but with smaller kernelsizes. The MLP is similar to the one used in [96] and has 3 hidden layerswith 200 neurons each, ReLu non-linearities, and BN after each layer. ForMCD experiments, dropout layers were added after each hidden layer with 0.5dropout rate as was suggested in [41]. The CNN for MNIST experiments hastwo convolutional layers with 32 and 64 kernels of sizes 3×3 with stride size of1 followed by two fully connected layers (with 128 and 64 neurons each) withBN after both types of layers. Here, again for MCD experiments, dropoutwas added after all layers with 0.5 dropout rate, except the first and lastlayers. For the CIFAR-10 dataset, the CNN architecture has 2 convolutionallayers with 16 kernels of size 3×3 followed by a max-pooling of 2×2; another2 convolutional layers with 32 kernels of size 3× 3 followed by a max-poolingof size 2×2. And finally, two dense layers of size 128, and 10. BN was appliedto all convolutional layers. For MCD experiments, dropout was added similarto CNN for MNIST experiments. Each network was trained and evaluated25 times with different initializations of parameters (weights and biases) andrandom shuffling of the training data. For optimization, stochastic gradientdescent with the Adam update rule [86] was used. Each baseline was trainedfor 15 epochs. Training was performed for another 25 rounds with dropoutfor MCD experiments. Models trained and evaluated with active dropoutlayers were used for MCD evaluation only, and baselines without dropoutwere used for the rest of the experiments. The Deep Ensembles methodwas tested by averaging the output of the 10 baseline models. MCD wastested on 25 models and the performance was averaged over all 25 models.Temperature scaling and PEP were tested on the 25 trained baseline modelswithout dropout and the results were averaged.103Table 7.2: MNIST and CIFAR-10 results: The table summarizes experiments described in Section 7.3.2.Experiment Baseline PEP Temp. Scaling MCD Deep EnsemblesNLLMNIST (MLP) 0.096 ± 0.01 0.079 ± 0.01 0.074 ± 0.01 0.094 ± 0.00 0.044 ± 0.00MNIST (CNN) 0.036 ± 0.00 0.034 ± 0.00 0.032 ± 0.00 0.031 ± 0.00 0.021 ± 0.00CIFAR-10 1.063 ± 0.03 0.982 ± 0.02 0.956 ± 0.02 0.798 ± 0.01 0.709 ± 0.00BrierMNIST (MLP) 0.037 ± 0.00 0.035 ± 0.00 0.035 ± 0.00 0.040 ± 0.00 0.020 ± 0.00MNIST (CNN) 0.016 ± 0.00 0.015 ± 0.00 0.015 ± 0.00 0.014 ± 0.00 0.010 ± 0.00CIFAR-10 0.469 ± 0.01 0.450 ± 0.01 0.447 ± 0.01 0.381 ± 0.01 0.335 ± 0.00ECE %MNIST (MLP) 1.324 ± 0.16 0.528 ± 0.12 0.415 ± 0.10 2.569 ± 0.17 0.839 ± 0.08MNIST (CNN) 0.517 ± 0.07 0.366 ± 0.08 0.259 ± 0.06 0.832 ± 0.06 0.287 ± 0.05CIFAR-10 11.718 ± 0.72 4.599 ± 0.82 1.318 ± 0.26 7.109 ± 0.62 8.867 ± 0.23Classification Error %MNIST (MLP) 2.264 ± 0.22 2.286 ± 0.24 2.264 ± 0.22 2.452 ± 0.14 1.285 ± 0.05MNIST (CNN) 0.990 ± 0.13 0.990 ± 0.12 0.990 ± 0.13 0.842 ± 0.06 0.659 ± 0.03CIFAR-10 33.023 ± 0.68 32.949 ± 0.74 33.023 ± 0.68 27.207 ± 0.66 22.880 ± 0.211047.4. Conclusion(a) (b) (c) (d)4 6 8 10 12 14Epoch0.960.981.001.021.041.061.081.10NLLBaselinePEP0.0 0.2 0.4 0.6 0.8Degree of Overfitting0.000.020.040.060.080.100.12PEPboost0.000 0.025 0.050 0.075 0.100Degree of Overfitting−0.0050.0000.0050.0100.0150.0200.0250.01 0.02 0.03 0.04Degree of Overfitting−0.0010.0000.0010.0020.0030.0040.0050.006Figure 7.3: The relationship between overfitting and PEP effect. (a) showsthe average of NLLs on test set for CIFAR-10 baselines (red line) and PEP L(black line). The baseline curve shows overfitting as a result of overtraining.The degree of overfitting was calculated by subtracting the training NLL (loss)from the test NLL (loss). PEP reduces overfitting and improves log-likelihood.PEP effect is more substantial as the overfitting grows. (b), (c), (d) showsscatter plots of overfitting vs PEP effect for CIFAR-10, MNIST(MLP), andMNIST(CNN), respectively.Table 7.2 compares the calibration quality and test errors of baselinesand PEP, temperature scaling [54], MCD [41], and Deep Ensembles [96]. Theaverages and standard deviation values for NLL, Brier score, and ECE% areprovided. For all cases, it can be seen that PEP achieves better calibrationin terms of lower NLL compared to the baseline. Deep Ensembles achievesthe best NLL and classification errors in all the experiments. Compared tothe baseline, temperature scaling and MCD improve calibration in terms ofNLL for all three experiments.Effect of Overfitting on PEP Effect: We ran experiments to quantifythe effect of overfitting on PEP effect, and optimized σ values. For the MNISTand CIFAR-10 experiments, model checkpoints were saved at the end of eachepoch. Different levels of overfitting as a result of over-training were observedfor the three experiments. σ∗ was calculated for each epoch and PEP wasperformed and the PEP effect was measured. Figure 7.3 (a), shows the effectof calibration on calibration and reducing NLL for CIFAR-10 models. Figures7.3 (b-d) shows that PEP effect increases with overfitting. Furthermore, weobserved that the σ∗ values also increase with overfitting, meaning that largerperturbations are required for more overfitting.7.4 ConclusionWe proposed PEP for improving calibration and performance in deep learning.PEP is computationally inexpensive and can be applied to any pre-trained1057.4. Conclusionnetwork. On classification problems, we show that PEP effectively improvesprobabilistic predictions in terms of log-likelihood, Brier score, and expectedcalibration error. It also nearly always provides small improvements inaccuracy for pre-trained ImageNet networks. We observe that the optimal sizeof perturbation and the log-likelihood increase from the ensemble correlateswith the amount of overfitting. Finally, PEP can be used as a tool toinvestigate the curvature properties of the likelihood landscape.106Chapter 8Conclusion and Future WorkProstate cancer is the leading cause of cancer death in North American menand the second most common cancer in men worldwide. The ultimate diag-nosis of prostate cancer is through histopathology analysis of prostate biopsyor radical prostatectomy. MRI has shown promising results for detectionand characterization of prostate cancer and in guiding biopsy needles tosuspicious targets. Despite promising results in using MRI for prostate cancermanagement, open problems exist regarding detection and characterizationof prostate cancer and image-guided interventions.In this thesis, novel algorithms and methods were proposed with theultimate goal of improving MRI-guided prostate cancer diagnosis and inter-ventions. In Chapter 2, we proposed models to classify prostate cancer at agiven biopsy location as clinically significant or not in diagnostic MRI images.We further proposed models to handle biopsy location uncertainty at trainingand inference times. In Chapter 3, we proposed models to automaticallydetect the tip of biopsy needles on intra-procedural MRI images. In Chapter4, we investigated domain adaptation techniques to see if we could tune aCNN trained to perform a task on MRI images acquired with different acqui-sition parameters. In Chapter 5, we proposed a partial Dice loss function forweakly-supervised segmentation with single points and scribbles. In Chapter6, we studied uncertainty estimation in semantic segmentation and proposedmethods to improve confidence calibration using ensemble of models. Finally,in Chapter 7, we proposed a general methodology for uncertainty estimationof neural networks using parameter ensembling by perturbation.8.1 ContributionsThis thesis is an attempt to develop techniques that are essential for MRI-guided prostate cancer diagnosis and interventions. In the course of achievingthis objective, the following contributions were made:• A novel deep learning technique was proposed for diagnosing clini-cally significant prostate cancer in mpMRI. The method uses diffusion-1078.1. Contributionsweighted imaging (DWI) and dynamic contrast-enhanced (DCE) MRIsequences and information about the location of the suspicious target todiagnose clinically significant cancer. The proposed method was testedon an unseen patient dataset of 206 findings from 140 patients andachieved an area under the curve of receiver operating characteristic(AUC) of 0.80. The performance is comparable with the AUC values ofexperienced human readers for PI-RADS.• A novel probabilistic framework was proposed to include biopsy locationuncertainty at the inference for diagnosis of clinically significant prostatecancer lesions with FCNs. Moreover, a Gaussian weighted loss wasproposed as a label imputation mechanism for training FCNs withsparse biopsy data. The proposed loss function was compared withpartial cross-entropy (CE) where biopsy locations are used for losscalculation in optimization. It was observed that using the updatedbiopsy location improves sensitivity significantly through detectinglesions where the biopsy location was displaced. The proposed methodwas trained and validated using a 6-fold cross validation scheme with352 biopsy locations from 203 patients suspicious of prostate cancer.• A novel asymmetric 3D deep CNN was developed to localize andvisualize the tip and trajectory of biopsy needles in MRI. Needles wereannotated on 583 T2-weighted intra-procedural MRI scans acquiredafter needle insertion for 71 patients. The accuracy of the proposedmethod, as tested on previously unseen data, was 2.80 mm average inneedle tip detection, and 0.98◦ in needle trajectory angle. Additionally,an observer study was designed in which independent annotations by asecond observer, blinded to the original observer, were compared to theoutput of the proposed method. The resultant error was comparableto the measured inter-observer concordance, reinforcing the clinicalacceptability of the proposed method. To the best of our knowledge,this was the first report of a fully automatic system for biopsy needlesegmentation and localization in MRI with deep convolutional neuralnetworks.• A novel technique was developed for domain adaptation of networkstrained with one set of MRI acquisition parameters. The followingquestions regarding domain adaptation were investigated: Given a fittedmodel on a certain dataset domain, 1) How much data from the newdomain is required for a decent adaptation of the original network?;and, 2) What portion of the pre-trained model parameters should be1088.1. Contributionsretrained given a certain number of the new domain training samples?We trained a CNN on one set of images and evaluated the performanceof the domain-adapted network on the same task with images from adifferent domain. We then compared the performance of the modelto the surrogate scenarios where either the same trained network isused or a new network is trained from scratch on the new dataset. Theproposed method is capable of tuning the deep network to the newdomain.• A novel technique was proposed for weakly-supervised semantic seg-mentation with point and scribble supervision in FCNs. A novel lossfunction, partial Dice loss, was proposed a variant of Dice loss [124]for deep weakly-supervised segmentation with sparse pixel-level an-notations. Partial Dice loss was compared with partial cross-entropy[27, 177] in terms of segmentation quality. Finally, point and scribble-supervised segmentation were compared with fully-supervised on fivedifferent semantic segmentation tasks from medical images of the heart,the prostate, and the kidney. In a majority of these experiments, partialDice loss provided statistically significant performance improvementover partial cross-entropy. The use of single point supervision results in51%−95% of the performance of fully supervised training and the useof single scribble supervision achieves 86%−97% of the performance offully supervised training.• A novel technique was developed for confidence calibration and pre-dictive uncertainty estimation for deep medical image segmentation.Despite of high quality segmentations, FCNs trained with batch normal-ization and Dice loss are poorly calibrated. We systematically comparedcross-entropy loss with Dice loss in terms of segmentation quality anduncertainty estimation of FCNs; We proposed model ensembling forconfidence calibration of the FCNs trained with BN and Dice loss; Wefurther assessed the ability of calibrated FCNs to predict the segmenta-tion quality of structures and detect out-of-distribution test examples.We consistently demonstrated that model ensembling is considerablyeffective for confidence calibration.• A novel technique was developed for confidence calibration uncertaintyestimation of neural networks. The proposed technique, parameterensembling by perturbation (PEP) approach, prepares an ensemble ofparameter values as perturbations of the optimal parameter set fromtraining by a Gaussian with a single variance parameter. Experiments1098.2. Future Workon classification benchmarks such as MNIST and CIFAR-10 showedimproved calibration and likelihood. To demonstrate the scalability ofPEP on deep networks, experiments were conducted on ImageNet, theseshow that PEP can be used for uncertainty estimation and probabilitycalibration on pre-trained networks.8.2 Future WorkNovel methods have been presented in this thesis for MRI-guided prostatecancer diagnosis and interventions. In addition, we proposed methods fordomain adaptation, confidence calibration and uncertainty estimation inmedical images. A number of interesting areas of research can be suggestedas follows:• The proposed models for cancer diagnosis were developed and validatedonly on a cohort of patient from a single institution. Further experi-mental investigations are needed to be done using larger multi-institutedatasets. It would be essential to determine the performance of theproposed prostate cancer diagnosis approaches across a wider range ofpatient populations.• Future work requires to explore the use of the posterior probabilitieson latent true biopsy coordinates for improving training proceduresof CAD systems with noisy ground truth. Through an expectation-maximization (EM) framework, the posterior can be used as the E−stepto re-estimate probability distribution on biopsy locations given theprior knowledge and classifier’s output. Maximum likelihood estimationcan be updated to include the current knowledge of the distribution.• Further work needs to be carried out to embed the proposed automaticlocalization method in the workflow of the transperineal in-gantryMRI-targeted prostate biopsies. In order to do this, a study has to bedesigned to determine how the needle trajectory should be presented tothe interventionalist to help them make the most efficacious decisions– e.g. should the insertion point or angle of a suboptimal trajectorybe changed – during the procedure. On a wider level, research is alsoneeded to transfer the framework and proposed methodology to othertypes of image-guided procedures that involve needle detection andlocalization.1108.2. Future Work• In Chapter 4, due to lack of access to multi-domain prostate cancerdatasets, we studied transfer learning for the problem of brain whitematter hyperintensities (WMH) segmentation. Further experimentsare required to evaluate the proposed methods with multi-institutionalprostate cancer MRI datasets.• Additional work needs to be carried out to establish the effect of lossfunction on confidence calibration for deep FCNs that were proposedin Chapter 6. It would be interesting to investigate the calibrationand segmentation quality of other loss functions such as combinationsof Dice loss and cross-entropy loss, as well as the recently proposedLovász-Softmax loss [14].• The proposed parameter ensembling by perturbation (PEP) methodwas evaluated on computer vision benchmarks. Further evaluationof PEP on medical imaging benchmarks and applications includingprostate cancer diagnosis in MRI would be interesting.111Bibliography[1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, ZhifengChen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, MatthieuDevin, et al. Tensorflow: Large-scale machine learning on heterogeneousdistributed systems. arXiv preprint arXiv:1603.04467, 2016.[2] Amir H Abdi, Christina Luong, Teresa Tsang, Gregory Allan, SamanNouranian, John Jue, Dale Hawley, Sarah Fleming, Ken Gin, JodySwift, Robert Rohling, and Purang Abolmaesumi. Automatic qualityassessment of echocardiograms using convolutional neural networks:Feasibility on the apical Four-Chamber view. IEEE Transactions onMedical Imaging, 36(6):1221–1230, 2017.[3] M Aboofazeli, P Abolmaesumi, P Mousavi, and G Fichtinger. A newscheme for curved needle segmentation in three-dimensional ultrasoundimages. In 2009 IEEE International Symposium on Biomedical Imaging:From Nano to Macro, pages 1067–1070, 2009.[4] Hashim U Ahmed, Ahmed El-Shater Bosaily, Louise C Brown, RhianGabe, Richard Kaplan, Mahesh K Parmar, Yolanda Collaco-Moraes,Katie Ward, Richard G Hindley, Alex Freeman, et al. Diagnosticaccuracy of multi-parametric MRI and TRUS biopsy in prostate can-cer (PROMIS): a paired validating confirmatory study. The Lancet,389(10071):815–822, 2017.[5] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, JohnSchulman, and Dan Mané. Concrete problems in AI safety. arXivpreprint arXiv:1606.06565, 2016.[6] Simon Andermatt, Simon Pezold, and Philippe Cattin. Multi-dimensional gated recurrent units for the segmentation of biomedical3D-Data. In Deep Learning and Data Labeling for Medical Applications,pages 142–151. Springer International Publishing, 2016.[7] Samuel G Armato, Henkjan Huisman, Karen Drukker, Lubomir Hadji-iski, Justin S Kirby, Nicholas Petrick, George Redmond, Maryellen L112BibliographyGiger, Kenny Cha, Artem Mamonov, et al. PROSTATEx challengesfor computerized classification of prostate lesions from multiparametricmagnetic resonance images. Journal of Medical Imaging, 5(4):044501,2018.[8] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. SegNet: Adeep convolutional encoder-decoder architecture for scene segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.[9] Wenjia Bai, Ozan Oktay, Matthew Sinclair, Hideaki Suzuki, MartinRajchl, Giacomo Tarroni, Ben Glocker, Andrew King, Paul MMatthews,and Daniel Rueckert. Semi-supervised learning for network-basedcardiac MR image segmentation. In International Conference on MedicalImage Computing and Computer-Assisted Intervention, pages 253–260.Springer, 2017.[10] Spyridon Bakas, Hamed Akbari, Aristeidis Sotiras, Michel Bilello, Mar-tin Rozycki, Justin S Kirby, John B Freymann, Keyvan Farahani,and Christos Davatzikos. Advancing the cancer genome atlas gliomaMRI collections with expert segmentation labels and radiomic features.Scientific Data, 4:170117, 2017.[11] Christian F Baumgartner, Kerem C Tezcan, Krishna Chaitanya, An-dreas M Hötker, Urs J Muehlematter, Khoschy Schawkat, Anton SBecker, Olivio Donati, and Ender Konukoglu. Phiseg: Capturing uncer-tainty in medical image segmentation. In International Conference onMedical Image Computing and Computer-Assisted Intervention, pages119–127. Springer, 2019.[12] Christoph Baur, Shadi Albarqouni, and Nassir Navab. Semi-superviseddeep learning for fully convolutional networks. In International Confer-ence on Medical Image Computing and Computer-Assisted Intervention,pages 311–319. Springer, 2017.[13] Parmida Beigi, Robert Rohling, Tim Salcudean, Victoria A Lessoway,and Gary C Ng. Needle trajectory and tip localization in Real-Time3-D ultrasound using a moving stylus. Ultrasound in Medicine andBiology, 41(7):2057–2070, 2015.[14] Maxim Berman, Amal Rannen Triki, and Matthew B Blaschko. TheLovász-softmax loss: A tractable surrogate for the optimization of theintersection-over-union measure in neural networks. In Proceedings of113Bibliographythe IEEE Conference on Computer Vision and Pattern Recognition,pages 4413–4421, 2018.[15] Jeroen Bertels, David Robben, Dirk Vandermeulen, and Paul Suetens.Optimization with soft Dice can lead to a volumetric bias. arXivpreprint arXiv:1911.02278, 2019.[16] Andreas Bresser et al. Python pathfinding. https://github.com/brean/python-pathfinding.[17] Glenn W Brier. Verification of forecasts expressed in terms of probability.Monthly weather review, 78(1):1–3, 1950.[18] Jinzheng Cai, Youbao Tang, Le Lu, Adam P Harrison, Ke Yan, JingXiao, Lin Yang, and Ronald M Summers. Accurate weakly-superviseddeep lesion segmentation using large-scale clinical annotations: Slice-propagated 3D mask generation from 2D RECIST . In InternationalConference on Medical Image Computing and Computer-Assisted Inter-vention, pages 396–404. Springer, 2018.[19] Andrew Cameron, Farzad Khalvati, Masoom A Haider, and AlexanderWong. MAPS: A quantitative radiomics approach for prostate cancerdetection. IEEE Transactions on Biomedical Engineering, 63(6):1145–1156, 2016.[20] Yigit B Can, Krishna Chaitanya, Basil Mustafa, Lisa M Koch, EnderKonukoglu, and Christian F Baumgartner. Learning to segment medicalimages with scribble-supervision alone. In Deep Learning in MedicalImage Analysis and Multimodal Learning for Clinical Decision Support,pages 236–244. Springer, 2018.[21] Ian Chan, William Wells, 3rd, Robert V Mulkern, Steven Haker, Jian-qing Zhang, Kelly H Zou, Stephan E Maier, and Clare M C Tempany.Detection of prostate cancer by integration of line-scan diffusion, T2-mapping and T2-weighted magnetic resonance imaging; a multichannelstatistical classifier. Medical physics, 30(9):2390–2398, 2003.[22] Quan Chen, Xiang Xu, Shiliang Hu, Xiao Li, Qing Zou, and YunpengLi. A transfer learning approach for classification of clinical signifi-cant prostate cancers from mpMRI scans. In Medical Imaging 2017:Computer-Aided Diagnosis, volume 10134, page 101344F. InternationalSociety for Optics and Photonics, 2017.114Bibliography[23] Shuai Chen, Gerda Bortsova, Antonio García-Uceda Juárez, Gijs vanTulder, and Marleen de Bruijne. Multi-task attention-based semi-supervised learning for medical image segmentation. In InternationalConference on Medical Image Computing and Computer-Assisted Inter-vention, pages 457–465. Springer, 2019.[24] V. Cheplygina, I. P. Pena, J. H. Pedersen, D. A Lynch, L. Sørensen, andM. de Bruijne. Transfer learning for multi-center classification of chronicobstructive pulmonary disease. arXiv preprint arXiv:1701.05013, 2017.[25] Eleni Chiou, Francesco Giganti, Elisenda Bonet-Carne, Shonit Pun-wani, Iasonas Kokkinos, and Eleftheria Panagiotaki. Prostate cancerclassification on verdict DW-MRI using convolutional neural networks.In International Workshop on Machine Learning in Medical Imaging,pages 319–327. Springer, 2018.[26] François Chollet et al. Keras. https://keras.io, 2015.[27] Özgün Çiçek, Ahmed Abdulkadir, Soeren S Lienkamp, Thomas Brox,and Olaf Ronneberger. 3D U-Net: learning dense volumetric segmenta-tion from sparse annotation. In International Conference on MedicalImage Computing and Computer-assisted Intervention, pages 424–432.Springer, 2016.[28] Charles Corbière, Nicolas Thome, Avner Bar-Hen, Matthieu Cord,and Patrick Pérez. Addressing failure prediction by learning modelconfidence. In Advances in Neural Information Processing Systems,pages 2898–2909, 2019.[29] Bob D de Vos, Jelmer M Wolterink, Pim A de Jong, Tim Leiner,Max A Viergever, and Ivana Isgum. ConvNet-Based localization ofanatomical structures in 3-D medical images. IEEE Transactions onMedical Imaging, 36(7):1470–1481, 2017.[30] Terrance DeVries and Graham W Taylor. Learning confidence forout-of-distribution detection in neural networks. arXiv preprintarXiv:1802.04865, 2018.[31] Thomas G Dietterich. Ensemble methods in machine learning. InInternational Workshop on Multiple Classifier Systems, pages 1–15.Springer, 2000.115Bibliography[32] SP DiMaio, DF Kacher, RE Ellis, G Fichtinger, N Hata, GP Zientara,LP Panych, R Kikinis, and FA Jolesz. Needle artifact localization in3T MR images. Studies in Health Technology and Informatics, 119:120,2005.[33] Q. Dou, H. Chen, L. Yu, L. Zhao, J. Qin, D. Wang, V. CT Mok, L. Shi,and P. A. Heng. Automatic detection of cerebral microbleeds from MRimages via 3D convolutional neural networks. IEEE Transactions onMedical Imaging, 35(5):1182–1195, 2016.[34] Timothy Dozat. Incorporating nesterov momentum into adam. InICLR Workshop, 2016.[35] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map predictionfrom a single image using a multi-scale deep network. In Advances inNeural Information Processing Systems, pages 2366–2374, 2014.[36] Jonathan I Epstein, Zhaoyong Feng, Bruce J Trock, and Phillip MPierorazio. Upgrading and downgrading of prostate cancer from biopsyto radical prostatectomy: incidence and predictive factors using themodified gleason grading system and factoring in tertiary grades. Eu-ropean Urology, 61(5):1019–1024, 2012.[37] A. Esteva, B. Kuprel, R. A Novoa, J. Ko, S. M Swetter, H. M Blau,and S. Thrun. Dermatologist-level classification of skin cancer withdeep neural networks. Nature, 542(7639):115–118, 2017.[38] Andriy Fedorov, Kemal Tuncali, Fiona M Fennessy, Junichi Tokuda,Nobuhiko Hata, William M Wells, Ron Kikinis, and Clare M Tem-pany. Image registration for targeted MRI-guided transperineal prostatebiopsy. Journal of Magnetic Resonance Imaging, 36(4):987–992, 2012.[39] Lucas Fidon, Wenqi Li, Luis C Garcia-Peraza-Herrera, JinendraEkanayake, Neil Kitchen, Sébastien Ourselin, and Tom Vercauteren.Generalised Wasserstein Dice score for imbalanced multi-class segmen-tation using holistic convolutional networks. In International MICCAIBrainlesion Workshop, pages 64–76. Springer, 2017.[40] Stanislav Fort, Huiyi Hu, and Balaji Lakshminarayanan. Deep ensem-bles: A loss landscape perspective. arXiv preprint arXiv:1912.02757,2019.116Bibliography[41] Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approxima-tion: Representing model uncertainty in deep learning. In InternationalConference on Machine Learning, pages 1050–1059, 2016.[42] Pierre-Antoine Ganaye, Michaël Sdika, and Hugues Benoit-Cattin. Semi-supervised learning for segmentation under semantic constraint. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 595–602. Springer, 2018.[43] M. Ghafoorian, N. Karssemeijer, I. WM van Uden, F.E de Leeuw,T. Heskes, E. Marchiori, and B. Platel. Automated detection of whitematter hyperintensities of all sizes in cerebral small vessel disease.Medical Physics, 43(12):6246–6258, 2016.[44] Mohsen Ghafoorian, Nico Karssemeijer, Tom Heskes, Mayra Bergkamp,Joost Wissink, Jiri Obels, Karlijn Keizer, Frank-Erik de Leeuw,Bram van Ginneken, Elena Marchiori, and Bram Platel. Deep multi-scale location-aware 3D convolutional neural networks for automateddetection of lacunes of presumed vascular origin. Neuroimage Clinical,14:391–399, 2017.[45] Mohsen Ghafoorian, Nico Karssemeijer, Tom Heskes, Inge WM vanUden, Clara I Sanchez, Geert Litjens, Frank-Erik de Leeuw, Bramvan Ginneken, Elena Marchiori, and Bram Platel. Location sensitivedeep convolutional neural networks for segmentation of white matterhyperintensities. Scientific Reports, 7(1):1–12, 2017.[46] Mohsen Ghafoorian, Alireza Mehrtash, Tina Kapur, Nico Karssemeijer,Elena Marchiori, Mehran Pesteie, Charles RG Guttmann, Frank-Erikde Leeuw, Clare M Tempany, Bram van Ginneken, et al. Transferlearning for domain adaptation in MRI: Application in brain lesionsegmentation. In International Conference on Medical Image Computingand Computer-Assisted Intervention, pages 516–524. Springer, 2017.[47] Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An investiga-tion into neural net optimization via hessian eigenvalue density. InInternational Conference on Machine Learning, pages 2232–2241, 2019.[48] Valentina Giannini, Simone Mazzetti, Enrico Armando, Silvia Cara-balona, Filippo Russo, Alessandro Giacobbe, Giovanni Muto, andDaniele Regge. Multiparametric magnetic resonance imaging of theprostate with computer-aided detection: experienced observer perfor-mance study. European Radiology, 27(10):4200–4208, 2017.117Bibliography[49] Shoshana B Ginsburg, Ahmad Algohary, Shivani Pahwa, Vikas Gu-lani, Lee Ponsky, Hannu J Aronen, Peter J Boström, Maret Böhm,Anne-Maree Haynes, Phillip Brenner, Warick Delprado, James Thomp-son, Marley Pulbrock, Pekka Taimen, Robert Villani, Phillip Stricker,Ardeshir R Rastinehad, Ivan Jambor, and Anant Madabhushi. Ra-diomic features for prostate cancer detection on MRI differ betweenthe transition and peripheral zones: Preliminary findings from a multi-institutional study. Journal of Magnetic Resonance Imaging, 46(1):184–193, 2017.[50] Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules,prediction, and estimation. Journal of the American Statistical Associ-ation, 102(477):359–378, 2007.[51] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning.MIT Press, 2016. http://www.deeplearningbook.org.[52] Joseph Görres, Michael Brehler, Jochen Franke, Karl Barth, Sven YVetter, Andrés Córdova, Paul A Grützner, Hans-Peter Meinzer, IvoWolf, and Diana Nabers. Intraoperative detection and localizationof cylindrical implants in cone-beam CT image data. InternationalJournal of Computer Assisted Radiology and Surgery, 9(6):1045–1057,2014.[53] Henry Gray. Anatomy of the human body, volume 8. Lea & Febiger,1878.[54] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. Oncalibration of modern neural networks. In Proceedings of the 34thInternational Conference on Machine Learning-Volume 70, pages 1321–1330. JMLR. org, 2017.[55] Richard HR Hahnloser, Rahul Sarpeshkar, Misha A Mahowald, Rod-ney J Douglas, and H Sebastian Seung. Digital selection and ana-logue amplification coexist in a cortex-inspired silicon circuit. Nature,405(6789):947–951, 2000.[56] Freddie C Hamdy, Jenny L Donovan, J Athene Lane, Malcolm Mason,Chris Metcalfe, Peter Holding, Michael Davis, Tim J Peters, Emma LTurner, Richard M Martin, et al. 10-year outcomes after monitoring,surgery, or radiotherapy for localized prostate cancer. New EnglandJournal of Medicine, 375(15):1415–1424, 2016.118Bibliography[57] Xiang Hao, Kristen Zygmunt, Ross T Whitaker, and P Thomas Fletcher.Improved segmentation of white matter tracts with adaptive Rieman-nian metrics. Medical Image Analysis, 18(1):161–175, 2014.[58] Elmira Hassanzadeh, Daniel I Glazer, Ruth M Dunne, Fiona M Fennessy,Mukesh G Harisinghani, and Clare M Tempany. Prostate imagingreporting and data system version 2 (PI-RADS v2): a pictorial review.Abdominal Radiology, 42(1):278–289, 2017.[59] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delvingdeep into rectifiers: Surpassing human-level performance on ImageNetclassification. In Proceedings of the IEEE International Conference onComputer Vision, pages 1026–1034, 2015.[60] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deepresidual learning for image recognition. pages 770–778, 2016.[61] Nicholas Heller, Niranjan Sathianathen, Arveen Kalapara, EdwardWalczak, Keenan Moore, Heather Kaluzniak, Joel Rosenberg, PaulBlake, Zachary Rengel, Makinna Oestreich, et al. The KiTS19 challengedata: 300 kidney tumor cases with clinical context, CT semanticsegmentations, and surgical outcomes. arXiv preprint arXiv:1904.00445,2019.[62] Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclas-sified and out-of-distribution examples in neural networks. In 5thInternational Conference on Learning Representations, ICLR 2017,2017.[63] Dan Hendrycks, Kimin Lee, and Mantas Mazeika. Using pre-trainingcan improve model robustness and uncertainty. arXiv preprintarXiv:1901.09960, 2019.[64] Jay Heo, Hae Beom Lee, Saehoon Kim, Juho Lee, Kwang Joon Kim,Eunho Yang, and Sung Ju Hwang. Uncertainty-aware attention for reli-able interpretation and prediction. In Advances in Neural InformationProcessing Systems, pages 909–918, 2018.[65] William Thomas Hrinivich, Douglas A Hoover, Kathleen Surry,Chandima Edirisinghe, Jacques Montreuil, David D’Souza, AaronFenster, and Eugene Wong. Simultaneous automatic segmentationof multiple needles using 3D ultrasound for high-dose-rate prostatebrachytherapy. Medical Physics, 44(4):1234–1245, 2017.119Bibliography[66] Shi Hu, Daniel Worrall, Stefan Knegt, Bas Veeling, Henkjan Huisman,and Max Welling. Supervised uncertainty quantification for segmenta-tion with multiple annotations. In International Conference on MedicalImage Computing and Computer-Assisted Intervention, pages 137–145.Springer, 2019.[67] Gao Huang, Yixuan Li, Geoff Pleiss, Zhuang Liu, John E. Hopcroft,and Kilian Q. Weinberger. Snapshot ensembles: Train 1, get M for free.In 5th International Conference on Learning Representations, ICLR2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings.OpenReview.net, 2017.[68] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian QWeinberger. Densely connected convolutional networks. In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition,pages 4700–4708, 2017.[69] Abhaya Indrayan. Medical biostatistics. Chapman and Hall/CRC, 2012.[70] Sergey Ioffe and Christian Szegedy. Batch normalization: Acceler-ating deep network training by reducing internal covariate shift. InInternational Conference on Machine Learning, pages 448–456, 2015.[71] Junichiro Ishioka, Yoh Matsuoka, Sho Uehara, Yosuke Yasuda, ToshikiKijima, Soichiro Yoshida, Minato Yokoyama, Kazutaka Saito, KazunoriKihara, Noboru Numao, et al. Computer-aided diagnosis of prostatecancer on magnetic resonance imaging using a convolutional neuralnetwork algorithm. BJU International, 122(3):411–417, 2018.[72] Ahmadreza Jeddi, Mohammad Javad Shafiee, Michelle Karg, ChristianScharfenberger, and Alexander Wong. Learn2perturb: an end-to-endfeature perturbation learning to improve adversarial robustness. arXivpreprint arXiv:2003.01090, 2020.[73] Zhanghexuan Ji, Yan Shen, Chunwei Ma, and Mingchen Gao. Scribble-based hierarchical weakly supervised learning for brain tumor segmen-tation. In International Conference on Medical Image Computing andComputer-Assisted Intervention, pages 175–183. Springer, 2019.[74] Hongsheng Jin, Zongyao Li, Ruofeng Tong, and Lanfen Lin. A deep 3Dresidual CNN for false-positive reduction in pulmonary nodule detection.Medical physics, 45(5):2097–2107, 2018.120Bibliography[75] Alain Jungo and Mauricio Reyes. Assessing reliability and challengesof uncertainty estimations for medical image segmentation. In Interna-tional Conference on Medical Image Computing and Computer-AssistedIntervention, pages 48–56. Springer, 2019.[76] Kaggle. TGS salt identification challenge, segment salt de-posits beneath the earth’s surface. https://www.kaggle.com/c/tgs-salt-identification-challenge, 2018.[77] K. Kamnitsas, C. Ledig, V. Newcombe, J. P. Simpson, A. D. Kane,D. K. Menon, D. Rueckert, and B. Glocker. Efficient multi-scale 3DCNN with fully connected CRF for accurate brain lesion segmentation.Medical Image Analysis, 36:61–78, 2017.[78] Konstantinos Kamnitsas, Wenjia Bai, Enzo Ferrante, Steven McDonagh,Matthew Sinclair, Nick Pawlowski, Martin Rajchl, Matthew Lee, Bern-hard Kainz, Daniel Rueckert, et al. Ensembles of multiple models andarchitectures for robust brain tumour segmentation. In InternationalMICCAI Brainlesion Workshop, pages 450–462. Springer, 2017.[79] Davood Karimi, Qi Zeng, Prateek Mathur, Apeksha Avinash, SaraMahdavi, Ingrid Spadinger, Purang Abolmaesumi, and Septimiu ESalcudean. Accurate and robust deep learning-based segmentation ofthe prostate clinical target volume in ultrasound images. Medical ImageAnalysis, 57:186–196, 2019.[80] Moritz Kasel-Seibert, Thomas Lehmann, René Aschenbach, Felix VGuettler, Mohamed Abubrig, Marc-Oliver Grimm, Ulf Teichgraeber,and Tobias Franiel. Assessment of PI-RADS v2 for the detection ofprostate cancer. European Journal of Radiology, 85(4):726–731, 2016.[81] Veeru Kasivisvanathan, Antti S Rannikko, Marcelo Borghi, ValeriaPanebianco, Lance A Mynderse, Markku H Vaarala, Alberto Briganti,Lars Budäus, Giles Hellawell, Richard G Hindley, et al. MRI-targetedor standard biopsy for prostate-cancer diagnosis. New England Journalof Medicine, 378(19):1767–1777, 2018.[82] Alex Kendall, Vijay Badrinarayanan, and Roberto Cipolla. BayesianSegNet: Model uncertainty in deep convolutional encoder-decoderarchitectures for scene understanding. arXiv preprint arXiv:1511.02680,2015.121Bibliography[83] Alex Kendall and Yarin Gal. What uncertainties do we need in Bayesiandeep learning for computer vision? In Advances in Neural InformationProcessing Systems, pages 5574–5584, 2017.[84] Hoel Kervadec, Jose Dolz, Meng Tang, Eric Granger, Yuri Boykov,and Ismail Ben Ayed. Constrained-CNN losses for weakly supervisedsegmentation. Medical Image Analysis, 54:88–99, 2019.[85] Mohammad Emtiyaz Khan, Didrik Nielsen, Voot Tangkaratt, Wu Lin,Yarin Gal, and Akash Srivastava. Fast and scalable Bayesian deep learn-ing by weight-perturbation in Adam. arXiv preprint arXiv:1806.04854,2018.[86] D. Kingma and J. Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014.[87] Atilla P Kiraly, Clement Abi Nader, Ahmet Tuysuzoglu, Robert Grimm,Berthold Kiefer, Noha El-Zehiry, and Ali Kamen. Deep convolutionalencoder-decoders for prostate cancer detection and classification. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 489–497. Springer, 2017.[88] Simon Kohl, David Bonekamp, Heinz-Peter Schlemmer, KaneschkaYaqubi, Markus Hohenfellner, Boris Hadaschik, Jan-Philipp Radtke,and Klaus Maier-Hein. Adversarial networks for the detection of ag-gressive prostate cancer. arXiv preprint arXiv:1702.08014, 2017.[89] Simon Kohl et al. A probabilistic U-Net for segmentation of ambiguousimages. In Advances in Neural Information Processing Systems, pages6965–6975, 2018.[90] Axel Krieger, Sang-Eun Song, Nathan Bongjoon Cho, Iulian I Iorda-chita, Peter Guion, Gabor Fichtinger, and Louis L Whitcomb. Devel-opment and evaluation of an actuated MRI-compatible robotic systemfor MRI-guided prostate intervention. IEEE/ASME Transactions onMechatronics, 18(1):273–284, 2013.[91] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers offeatures from tiny images. Technical report, Citeseer, 2009.[92] Hugo J Kuijf, J Matthijs Biesbroek, Jeroen De Bresser, Rutger Heinen,Simon Andermatt, Mariana Bento, Matt Berseth, Mikhail Belyaev,M Jorge Cardoso, Adria Casamitjana, et al. Standardized assessment of122Bibliographyautomatic segmentation of white matter hyperintensities; results of theWMH segmentation challenge. IEEE Transactions on Medical Imaging,38(11):2556–2568, 2019.[93] Frederik Kunstner, Philipp Hennig, and Lukas Balles. Limitations ofthe empirical Fisher approximation for natural gradient descent. InAdvances in Neural Information Processing Systems 32, pages 4156–4167. Curran Associates, Inc., 2019.[94] Jin Tae Kwak, Sheng Xu, Bradford J Wood, Baris Turkbey, Pe-ter L Choyke, Peter A Pinto, Shijun Wang, and Ronald M Summers.Automated prostate cancer detection using T2-weighted and high-b-value diffusion-weighted magnetic resonance imaging. Medical physics,42(5):2368–2378, 2015.[95] Yongchan Kwon, Joong-Ho Won, Beom Joon Kim, and Myunghee ChoPaik. Uncertainty quantification using Bayesian neural networks in clas-sification: Application to ischemic stroke lesion segmentation. MedicalImaging with Deep Learning, 2018.[96] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell.Simple and scalable predictive uncertainty estimation using deep en-sembles. In Advances in Neural Information Processing Systems, pages6402–6413, 2017.[97] Yann LeCun. The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/, 1998.[98] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning.Nature, 521(7553):436, 2015.[99] Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al.Gradient-based learning applied to document recognition. Proceedingsof the IEEE, 86(11):2278–2324, 1998.[100] Kimin Lee, Honglak Lee, Kibok Lee, and Jinwoo Shin. Trainingconfidence-calibrated classifiers for detecting out-of-distribution samples.In 6th International Conference on Learning Representations, ICLR2018, 2018.[101] Stefan Lee, Senthil Purushwalkam, Michael Cogswell, David Crandall,and Dhruv Batra. Why M heads are better than one: Training a diverseensemble of deep networks. arXiv preprint arXiv:1511.06314, 2015.123Bibliography[102] Christian Leibig, Vaneeda Allken, Murat Seçkin Ayhan, Philipp Berens,and Siegfried Wahl. Leveraging uncertainty information from deepneural networks for disease detection. Scientific Reports, 7(1):17816,2017.[103] Shiyu Liang, Yixuan Li, and R. Srikant. Enhancing the reliability of out-of-distribution image detection in neural networks. In 6th InternationalConference on Learning Representations, ICLR 2018, 2018.[104] Zhibin Liao, Hany Girgis, Amir Abdi, Hooman Vaseli, Jorden Hethering-ton, Robert Rohling, Ken Gin, Teresa Tsang, and Purang Abolmaesumi.On modelling label uncertainty in deep neural networks: Automaticestimation of intra-observer variability in 2D echocardiography qualityassessment. IEEE Transactions on Medical Imaging, 39(6):1868–1883,2019.[105] Paweł Liskowski and Krzysztof Krawiec. Segmenting retinal bloodvessels with deep neural networks. IEEE Transactions on MedicalImaging, 35(11):2369–2380, 2016.[106] Geert Litjens, Oscar Debats, Jelle Barentsz, Nico Karssemeijer, andHenkjan Huisman. Computer-aided detection of prostate cancer inMRI. IEEE Transactions on Medical Imaging, 33(5):1083–1092, 2014.[107] Geert Litjens et al. A survey on deep learning in medical image analysis.Medical Image Analysis, 42:60–88, 2017.[108] Geert Litjens, Robert Toth, Wendy van de Ven, Caroline Hoeks, SjoerdKerkstra, Bram van Ginneken, Graham Vincent, Gwenael Guillard,Neil Birbeck, Jindang Zhang, et al. Evaluation of prostate segmentationalgorithms for MRI: the PROMISE12 challenge. Medical Image Analysis,18(2):359–373, 2014.[109] Lizhi Liu, Zhiqiang Tian, Zhenfeng Zhang, and Baowei Fei. Computer-aided Detection of Prostate Cancer with MRI: Technology and Appli-cations. Academic Radiology, 23(8):1024–1046, 2016.[110] Saifeng Liu, Huaixiu Zheng, Yesu Feng, and Wei Li. Prostate cancerdiagnosis using deep learning with 3D multiparametric MRI. In MedicalImaging 2017: Computer-Aided Diagnosis, volume 10134, page 1013428.International Society for Optics and Photonics, 2017.124Bibliography[111] Stacy Loeb, Marc A Bjurlin, Joseph Nicholson, Teuvo L Tammela,David F Penson, H Ballentine Carter, Peter Carroll, and Ruth Etzioni.Overdiagnosis and overtreatment of prostate cancer. European Urology,65(6):1046–1055, 2014.[112] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convo-lutional networks for semantic segmentation. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition, pages3431–3440, 2015.[113] Bradley Christopher Lowekamp, David T Chen, Luis Ibáñez, and DanielBlezek. The design of SimpleITK. Frontiers in Neuroinformatics, 7:45,2013.[114] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier nonlin-earities improve neural network acoustic models. In in ICML Workshopon Deep Learning for Audio, Speech and Language Processing, vol-ume 30, 2013.[115] David JC MacKay. A practical Bayesian framework for backpropagationnetworks. Neural Computation, 4(3):448–472, 1992.[116] Andre Mastmeyer, Guillaume Pernelle, Ruibin Ma, Lauren Barber,and Tina Kapur. Accurate model-based segmentation of gynecologicbrachytherapy catheter collections in MRI-images. Medical ImageAnalysis, 42:173–188, 2017.[117] Arakaparampil M Mathai and Serge B Provost. Quadratic forms inrandom variables: theory and applications. Dekker, 1992.[118] Alireza Mehrtash, Mohsen Ghafoorian, Guillaume Pernelle, AlirezaZiaei, Friso G Heslinga, Kemal Tuncali, Andriy Fedorov, Ron Kikinis,Clare M Tempany, William M Wells, et al. Automatic needle segmen-tation and localization in MRI with 3-D convolutional neural networks:Application to MRI-targeted prostate biopsy. IEEE Transactions onMedical Imaging, 38(4):1026–1036, 2018.[119] Alireza Mehrtash, Mehran Pesteie, Jorden Hetherington, Peter A.Behringer, Tina Kapur, William M. Wells III, Robert Rohling, AndriyFedorov, and Purang Abolmaesumi. DeepInfer: Open-source deeplearning deployment toolkitfor image-guided therapy. In SPIE MedicalImaging. International Society for Optics and Photonics, 2017.125Bibliography[120] Alireza Mehrtash, Alireza Sedghi, Mohsen Ghafoorian, MehdiTaghipour, Clare M Tempany, William M Wells III, Tina Kapur, ParvinMousavi, Purang Abolmaesumi, and Andriy Fedorov. Classification ofclinical significance of MRI prostate findings using 3D convolutionalneural networks. In Medical Imaging 2017: Computer-Aided Diagnosis,volume 10134, page 101342A. International Society for Optics andPhotonics, 2017.[121] Alireza Mehrtash, William M Wells, Clare M Tempany, Purang Abol-maesumi, and Tina Kapur. Confidence calibration and predictiveuncertainty estimation for deep medical image segmentation. IEEETransactions on Medical Imaging, 2020.[122] Bjoern H Menze, Andras Jakab, Stefan Bauer, Jayashree Kalpathy-Cramer, Keyvan Farahani, Justin Kirby, Yuliya Burren, Nicole Porz,Johannes Slotboom, Roland Wiest, et al. The multimodal brain tumorimage segmentation benchmark (BRATS). IEEE Transactions onMedical Imaging, 34(10):1993–2024, 2015.[123] Anneke Meyer, Alireza Mehrtash, Marko Rak, Daniel Schindele, MartinSchostak, Clare Tempany, Tina Kapur, Purang Abolmaesumi, AndriyFedorov, and Christian Hansen. Automatic high resolution segmentationof the prostate from multi-planar MRI. In 2018 IEEE 15th InternationalSymposium on Biomedical Imaging (ISBI 2018), pages 177–181, 2018.[124] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-Net:Fully convolutional neural networks for volumetric medical image seg-mentation. In 2016 Fourth International Conference on 3D Vision(3DV), pages 565–571. IEEE, 2016.[125] Mehdi Moradi, Septimiu E Salcudean, Silvia D Chang, Edward CJones, Nicholas Buchan, Rowan G Casey, S Larry Goldenberg, andPiotr Kozlowski. Multiparametric MRI maps for detection and gradingof dominant prostate tumors. Journal of Magnetic Resonance Imaging,35(6):1403–1413, 2012.[126] MRBrainS18. Grand challenge on MR brain segmentation at MICCAI2018. https://mrbrains18.isi.uu.nl/, 2018.[127] Mahdi Pakdaman Naeini, Gregory F. Cooper, and Milos Hauskrecht.Obtaining well calibrated probabilities using Bayesian binning. In Pro-ceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence,pages 2901–2907, 2015.126Bibliography[128] Huu-Giao Nguyen, Céline Fouard, and Jocelyne Troccaz. Segmentation,separation and pose estimation of prostate brachytherapy seeds in CTimages. IEEE Transactions on Biomedical Engineering, 62(8):2012–2024, 2015.[129] Emilie Niaf, Olivier Rouvière, Florence Mège-Lechevallier, FlavieBratan, and Carole Lartizien. Computer-aided diagnosis of prostatecancer in the peripheral zone using multiparametric MRI. Physics inMedicine & Biology, 57(12):3833–3851, 2012.[130] Dong Nie, Yaozong Gao, Li Wang, and Dinggang Shen. ASDNET:Attention based semi-supervised deep networks for medical image seg-mentation. In International Conference on Medical Image Computingand Computer-Assisted Intervention, pages 370–378. Springer, 2018.[131] Marc Niethammer, Kilian M Pohl, Firdaus Janoos, and William MWells III. Active mean fields for probabilistic image segmentation:Connections with Chan–Vese and Rudin–Osher–Fatemi models. SIAMJournal on Imaging Sciences, 10(3):1069–1103, 2017.[132] Paul M Novotny, Jeff A Stoll, Nikolay V Vasilyev, Pedro J del Nido,Pierre E Dupont, Todd E Zickler, and Robert D Howe. GPU based real-time instrument tracking with three-dimensional ultrasound. MedicalImage Analysis, 11(5):458–464, 2007.[133] Lauren J O’Donnell and Carl-Fredrik Westin. Automatic tractographysegmentation using a high-dimensional white matter atlas. IEEETransactions on Medical Imaging, 26(11):1562–1575, 2007.[134] Xi Ouyang, Zhong Xue, Yiqiang Zhan, Xiang Sean Zhou, QingfengWang, Ying Zhou, Qian Wang, and Jie-Zhi Cheng. Weakly supervisedsegmentation framework with uncertainty: A study on pneumothoraxsegmentation in chest X-ray. In International Conference on MedicalImage Computing and Computer-Assisted Intervention, pages 613–621.Springer, 2019.[135] S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactionson Knowledge and Data Engineering, 22(10):1345–1359, 2010.[136] Tobias Penzkofer, Kemal Tuncali, Andriy Fedorov, Sang-Eun Song,Junichi Tokuda, Fiona M Fennessy, Mark G Vangel, Adam S Kibel,Robert V Mulkern, William M Wells, Nobuhiko Hata, and Clare M CTempany. Transperineal in-bore 3-T MR imaging-guided prostate127Bibliographybiopsy: a prospective clinical observational study. Radiology, 274(1):170–180, 2015.[137] Guillaume Pernelle, Alireza Mehrtash, Lauren Barber, Antonio Dam-ato, Wei Wang, Ravi Teja Seethamraju, Ehud Schmidt, Robert ACormack, Williams Wells, Akila Viswanathan, and Tina Kapur. Val-idation of catheter segmentation for MR-Guided gynecologic cancerbrachytherapy. In Medical Image Computing and Computer-AssistedIntervention–MICCAI 2013, Lecture Notes in Computer Science, pages380–387. Springer, Berlin, Heidelberg, 2013.[138] William H Press, Saul A Teukolsky, William T Vetterling, and Brian PFlannery. Numerical recipes 3rd edition: The art of scientific computing.Cambridge university press, 2007.[139] Michael Quentin, Dirk Blondin, Christian Arsov, Lars Schimmöller,Andreas Hiester, Erhard Godehardt, Peter Albers, Gerald Antoch, andRobert Rabenalt. Prospective evaluation of magnetic resonance imagingguided in-bore prostate biopsy versus systematic transrectal ultrasoundguided prostate biopsy in biopsy naïve men with elevated prostatespecific antigen. The Journal of Urology, 192(5):1374–1379, 2014.[140] Joaquin Quinonero-Candela, Carl Edward Rasmussen, Fabian Sinz,Olivier Bousquet, and Bernhard Schölkopf. Evaluating predictive un-certainty challenge. In Machine Learning Challenges Workshop, pages1–27. Springer, 2005.[141] Khashayar Rafat Zand, Caroline Reinhold, Masoom A Haider, AsakoNakai, Laurian Rohoman, and Sharad Maheshwari. Artifacts andpitfalls in MR imaging of the pelvis. Journal of Magnetic ResonanceImaging, 26(3):480–497, 2007.[142] Maithra Raghu, Katy Blumer, Rory Sayres, Ziad Obermeyer, BobbyKleinberg, Sendhil Mullainathan, and Jon Kleinberg. Direct uncertaintyprediction for medical second opinions. In International Conference onMachine Learning, pages 5281–5290, 2019.[143] Martin Rajchl, Matthew CH Lee, Ozan Oktay, Konstantinos Kamnitsas,Jonathan Passerat-Palmbach, Wenjia Bai, Mellisa Damodaram, Mary ARutherford, Joseph V Hajnal, Bernhard Kainz, et al. Deepcut: Objectsegmentation from bounding box annotations using convolutional neuralnetworks. IEEE Transactions on Medical Imaging, 36(2):674–683, 2016.128Bibliography[144] Prashanth Rawla. Epidemiology of prostate cancer. World Journal ofOncology, 10(2):63, 2019.[145] Mark Renfrew, Mark Griswold, and M Cenk Çavuşogˆlu. Active lo-calization and tracking of needle and target in robotic image-guidedintervention systems. Autonomous Robots, 42(1):83–97, 2018.[146] Raanan Yehezkel Rohekar, Yaniv Gurwicz, Shami Nisimov, and GalNovik. Modeling uncertainty by learning a hierarchy of deep neuralconnections. In Advances in Neural Information Processing Systems,pages 4246–4256, 2019.[147] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convo-lutional networks for biomedical image segmentation. In InternationalConference on Medical Image Computing and Computer-assisted Inter-vention, pages 234–241. Springer, 2015.[148] Holger Roth, Ling Zhang, Dong Yang, Fausto Milletari, Ziyue Xu,Xiaosong Wang, and Daguang Xu. Weakly supervised segmentationfrom extreme points. In Large-Scale Annotation of Biomedical Dataand Expert Label Synthesis and Hardware Aware Learning for MedicalImaging and Computer Assisted Intervention, pages 42–50. Springer,2019.[149] Matthias Rottmann and Marius Schubert. Uncertainty measures andprediction quality rating for the semantic segmentation of nested multiresolution street scene images. In The IEEE Conference on ComputerVision and Pattern Recognition (CVPR) Workshops, 2019.[150] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, SanjeevSatheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla,Michael Bernstein, et al. ImageNet large scale visual recognition chal-lenge. International Journal of Computer Vision, 115(3):211–252, 2015.[151] Levent Sagun, Leon Bottou, and Yann LeCun. Eigenvalues of theHessian in deep learning: Singularity and beyond. arXiv preprintarXiv:1611.07476, 2016.[152] Levent Sagun, Utku Evci, V Ugur Guney, Yann Dauphin, and LeonBottou. Empirical analysis of the Hessian of over-parametrized neuralnetworks. arXiv preprint arXiv:1706.04454, 2017.129Bibliography[153] Jörg Sander, Bob D de Vos, Jelmer M Wolterink, and Ivana Išgum. To-wards increased trustworthiness of deep learning segmentation methodson cardiac MRI. In Medical Imaging 2019: Image Processing, volume10949, page 1094919. International Society for Optics and Photonics,2019.[154] Patrick Schelb, Simon Kohl, Jan Philipp Radtke, Manuel Wiesenfarth,Philipp Kickingereder, Sebastian Bickelhaupt, Tristan Anselm Kuder,Albrecht Stenzinger, Markus Hohenfellner, Heinz-Peter Schlemmer,Klaus H Maier-Hein, and David Bonekamp. Classification of cancerat prostate MRI: Deep learning versus clinical PI-RADS assessment.Radiology, page 190938, 2019.[155] Ivo G Schoots, Monique J Roobol, Daan Nieboer, Chris H Bangma,Ewout W Steyerberg, and MG Myriam Hunink. Magnetic resonanceimaging–targeted biopsy may enhance the diagnostic accuracy of sig-nificant prostate cancer detection compared to standard transrectalultrasound-guided biopsy: a systematic review and meta-analysis. Eu-ropean Urology, 68(3):438–450, 2015.[156] Jarrel CY Seah, Jennifer SN Tang, and Andy Kitchen. Detection ofprostate cancer on multiparametric MRI. In Medical Imaging 2017:Computer-Aided Diagnosis, volume 10134, page 1013429. InternationalSociety for Optics and Photonics, 2017.[157] Suman Sedai, Bhavna Antony, Dwarikanath Mahapatra, and RahilGarnavi. Joint segmentation and uncertainty visualization of retinallayers in optical coherence tomography images using Bayesian deeplearning. In Computational Pathology and Ophthalmic Medical ImageAnalysis, pages 219–227. Springer, 2018.[158] Suman Sedai, Bhavna Antony, Ravneet Rai, Katie Jones, HiroshiIshikawa, Joel Schuman, Wollstein Gadi, and Rahil Garnavi. Uncer-tainty guided semi-supervised segmentation of retinal layers in OCTimages. In International Conference on Medical Image Computing andComputer-Assisted Intervention, pages 282–290. Springer, 2019.[159] Alireza Sedghi, Alireza Mehrtash, Amoon Jamzad, Amel Amalou,William M Wells III, Tina Kapur, Jin Tae Kwak, Baris Turkbey, PeterChoyke, Peter Pinto, et al. Improving detection of prostate cancerfoci via information fusion of MRI and temporal enhanced ultrasound.130BibliographyInternational Journal of Computer Assisted Radiology and Surgery,2020.[160] Seonguk Seo, Paul Hongsuck Seo, and Bohyung Han. Learning for single-shot confidence calibration in deep neural networks through stochasticinferences. In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 9030–9038, 2019.[161] Vijay Shah, Baris Turkbey, Haresh Mani, Yuxi Pang, Thomas Po-hida, Maria J Merino, Peter A Pinto, Peter L Choyke, and MarcelinoBernardo. Decision support system for localizing prostate cancer basedon multiparametric magnetic resonance imaging. Medical physics,39(7):4093–4103, 2012.[162] Gabi Shalev, Yossi Adi, and Joseph Keshet. Out-of-distribution de-tection using multiple semantic label representations. In Advances inNeural Information Processing Systems, pages 7375–7385, 2018.[163] H. C. Shin, H. R. Roth, M. Gao, L. Lu, Z. Xu, I. Nogues, J. Yao,D. Mollura, and R. M. Summers. Deep convolutional neural networksfor computer-aided detection: CNN architectures, dataset character-istics and transfer learning. IEEE Transactions on Medical Imaging,35(5):1285–1298, 2016.[164] M Minhaj Siddiqui, Soroush Rais-Bahrami, Baris Turkbey, Arvin KGeorge, Jason Rothwax, Nabeel Shakir, Chinonyerem Okoro, DimaRaskolnikov, Howard L Parnes, W Marston Linehan, et al. Comparisonof MR/ultrasound fusion–guided biopsy with ultrasound-guided biopsyfor the diagnosis of prostate cancer. JAMA, 313(4):390–397, 2015.[165] Rebecca L Siegel, Kimberly D Miller, and Ahmedin Jemal. Cancerstatistics, 2020. CA: A Cancer Journal for Clinicians, 70(1):7–30, 2020.[166] Karen Simonyan and Andrew Zisserman. Very deep convolutional net-works for large-scale image recognition. arXiv preprint arXiv:1409.1556,2014.[167] SE Song, NB Cho, II Iordachita, P Guion, Fichtinger G, A Kaushal,K Camphausen, and LL Whitcomb. Biopsy catheter artifact localiza-tion in MRI-guided robotic transrectal prostate intervention. IEEETransactions on Biomedical Engineering, 59(7):1902–11, 2012.131Bibliography[168] Yang Song, Yu-Dong Zhang, Xu Yan, Hui Liu, Minxiong Zhou, BingwenHu, and Guang Yang. Computer-aided diagnosis of prostate cancerusing a deep convolutional neural network from multiparametric MRI.Journal of Magnetic Resonance Imaging, 2018.[169] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever,and Ruslan Salakhutdinov. Dropout: a simple way to prevent neuralnetworks from overfitting. The Journal of Machine Learning Research,15(1):1929–1958, 2014.[170] Thomas A Stamey, Fuad S Freiha, John E McNeal, Elise A Redwine,Alice S Whittemore, and Hans-Peter Schmid. Localized prostate cancer.relationship of tumor volume to clinical significance for treatment ofprostate cancer. Cancer, 71(S3):933–938, 1993.[171] Susan Standring. Gray’s Anatomy : The Anatomical Basis of ClinicalPractice. Gray’s Anatomy. Elsevier Health Sciences, 41 edition, 2016.[172] Carole H Sudre, Beatriz Gomez Anson, Silvia Ingala, Chris D Lane,Daniel Jimenez, Lukas Haider, Thomas Varsavsky, Ryutaro Tanno,Lorna Smith, Sébastien Ourselin, et al. Let’s agree to disagree: Learninghighly debatable multirater labelling. In International Conference onMedical Image Computing and Computer-Assisted Intervention, pages665–673. Springer, 2019.[173] Carole H Sudre et al. Generalised Dice overlap as a deep learning lossfunction for highly unbalanced segmentations. In Deep Learning inMedical Image Analysis and Multimodal Learning for Clinical DecisionSupport, pages 240–248. Springer, 2017.[174] Sharmin Sultana, Jason Blatt, Benjamin Gilles, Tanweer Rashid, andMichel Audette. MRI-based medial axis extraction and boundarysegmentation of cranial nerves through discrete deformable 3D contourand surface models. IEEE Transactions on Medical Imaging, 2017.[175] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens,and Zbigniew Wojna. Rethinking the inception architecture for com-puter vision. arxiv 2015. arXiv preprint arXiv:1512.00567, 1512, 2015.[176] N. Tajbakhsh, J. Y Shin, S. R Gurudu, R Todd Hurst, Christopher BKendall, Michael B Gotway, and Jianming Liang. Convolutional neuralnetworks for medical image analysis: full training or fine tuning? IEEETransactions on Medical Imaging, 35(5):1299–1312, 2016.132Bibliography[177] Meng Tang, Abdelaziz Djelouah, Federico Perazzi, Yuri Boykov, andChristopher Schroers. Normalized cut loss for weakly-supervised CNNsegmentation. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 1818–1827, 2018.[178] Ryutaro Tanno, Ardavan Saeedi, Swami Sankaranarayanan, Daniel CAlexander, and Nathan Silberman. Learning from noisy labels byregularized estimation of annotator confusion. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition, pages11244–11253, 2019.[179] Clare Tempany, Jagadeesan Jayender, Tina Kapur, Raphael Bueno,Alexandra Golby, Nathalie Agar, and Ferenc A Jolesz. Multimodalimaging for improved diagnosis and treatment of cancers. Cancer,121(6):817–827, 2015.[180] Mattias Teye, Hossein Azizpour, and Kevin Smith. Bayesian uncer-tainty estimation for batch normalized deep networks. In InternationalConference on Machine Learning, pages 4914–4923, 2018.[181] Sunil Thulasidasan, Gopinath Chennupati, Jeff A Bilmes, TanmoyBhattacharya, and Sarah Michalak. On mixup training: Improvedcalibration and predictive uncertainty for deep neural networks. InAdvances in Neural Information Processing Systems, pages 13888–13899,2019.[182] Gaurie Tilak, Kemal Tuncali, Sang-Eun Song, Junichi Tokuda, OlutayoOlubiyi, Fiona Fennessy, Andriy Fedorov, Tobias Penzkofer, ClareTempany, and Nobuhiko Hata. 3T MR-guided in-bore transperinealprostate biopsy: A comparison of robotic and manual needle-guidancetemplates. Journal of Magnetic Resonance Imaging, 42(1):63–71, 2015.[183] Baris Turkbey, Andrew B. Rosenkrantz, Masoom A. Haider, Anwar R.Padhani, Geert Villeirs, Katarzyna J. Macura, Clare M. Tempany,Peter L. Choyke, François Cornud, Daniel J. A. Margolis, Harriet C.Thoeny, Sadhna Verma, Jelle O. Barentsz, and Jeffrey C Weinreb.Prostate imaging reporting and data system version 2.1: 2019 updateof prostate imaging reporting and data system version 2. EuropeanUrology, 2019.[184] M Uherčík, J Kybic, H Liebgott, and C Cachard. Model fitting usingRANSAC for surgical tool localization in 3-D ultrasound images. IEEETransactions on Biomedical Engineering, 57(8):1907–1916, 2010.133Bibliography[185] A. G. van Norden, K. F. de Laat, R. A. Gons, I. W. van Uden, E. J.van Dijk, L. J. van Oudheusden, R. A. Esselink, B. R. Bloem, B. G.van Engelen, M. J. Zwarts, I. Tendolkar, M. G. Olde-Rikkert, M. J.van der Vlugt, M. P. Zwiers, D. G. Norris, and F. E. de Leeuw. Causesand consequences of cerebral small vessel disease. The RUN DMCstudy: a prospective cohort study. Study rationale and protocol. BMCNeurology, 11:29, 2011.[186] A. Van Opbroek, M A. Ikram, M. W Vernooij, and M. De Bruijne.Transfer learning improves supervised image segmentation across imag-ing protocols. IEEE Transactions on Medical Imaging, 34(5):1018–1030,2015.[187] Sadhna Verma, Peter L Choyke, Steven C Eberhardt, Aytekin Oto,Clare M Tempany, Baris Turkbey, and Andrew B Rosenkrantz. Thecurrent state of MR imaging-targeted biopsy techniques for detectionof prostate cancer. Radiology, 285(2):343–356, 2017.[188] P C Vos, J O Barentsz, N Karssemeijer, and H J Huisman. Automaticcomputer-aided detection of prostate cancer based on multiparametricmagnetic resonance image analysis. Physics in Medicine & Biology,57(6):1527–1542, 2012.[189] Juan Wang, Huanjun Ding, FateMeh Azamian, Brian Zhou, CarlosIribarren, Sabee Molloi, and Pierre Baldi. Detecting cardiovasculardisease from mammograms with deep learning. IEEE Transactions onMedical Imaging, 2017.[190] Zhiwei Wang, Chaoyue Liu, Danpeng Cheng, Liang Wang, Xin Yang,and Kwang-Ting Cheng. Automated detection of clinically significantprostate cancer in mp-MRI images based on an end-to-end deep neuralnetwork. IEEE Transactions on Medical Imaging, 37(5):1127–1139,2018.[191] William M Wells III, Paul Viola, Hideki Atsumi, Shin Nakajima, andRon Kikinis. Multi-modal volume registration by maximization ofmutual information. Medical Image Analysis, 1(1):35–51, 1996.[192] Rogier R Wildeboer, Ruud JG van Sloun, Hessel Wijkstra, and MassimoMischi. Artificial intelligence in multiparametric prostate cancer imagingwith focus on deep-learning methods. Computer Methods and Programsin Biomedicine, 189:105316, 2020.134Bibliography[193] Onno Wink, Wiro J Niessen, and Max A Viergever. Multiscale vesseltracking. IEEE Transactions on Medical Imaging, 23(1):130–133, 2004.[194] Jelmer M Wolterink, Tim Leiner, Max A Viergever, and Ivana Išgum.Automatic segmentation and disease classification using cardiac cineMR images. In International Workshop on Statistical Atlases andComputational Models of the Heart, pages 101–110. Springer, 2017.[195] Tineke Wolters, Monique J Roobol, Pim J van Leeuwen, Roderick CNvan den Bergh, Robert F Hoedemaeker, Geert JLH van Leenders,Fritz H Schröder, and Theodorus H van der Kwast. A critical analysisof the tumor volume threshold for clinically insignificant prostate cancerusing a data set of a randomized screening trial. The Journal of Urology,185(1):121–125, 2011.[196] David A Woodrum, Akira Kawashima, Krzysztof R Gorny, and Lance AMynderse. Targeted prostate biopsy and MR-guided therapy forprostate cancer. Abdominal Radiology, 41(5):877–888, 2016.[197] Lequan Yu, Shujun Wang, Xiaomeng Li, Chi-Wing Fu, and Pheng-AnnHeng. Uncertainty-aware self-ensembling model for semi-supervised3D left atrium segmentation. In International Conference on MedicalImage Computing and Computer-Assisted Intervention, pages 605–613.Springer, 2019.[198] Han Zheng, Lanfen Lin, Hongjie Hu, Qiaowei Zhang, Qingqing Chen,Yutaro Iwamoto, Xianhua Han, Yen-Wei Chen, Ruofeng Tong, andJian Wu. Semi-supervised segmentation of liver using adversariallearning with deep atlas prior. In International Conference on MedicalImage Computing and Computer-Assisted Intervention, pages 148–156.Springer, 2019.[199] Zhen Zhu, Mengde Xu, Song Bai, Tengteng Huang, and Xiang Bai.Asymmetric non-local neural networks for semantic segmentation. InProceedings of the IEEE International Conference on Computer Vision,pages 593–602, 2019.135

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            data-media="{[{embed.selectedMedia}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0394788/manifest

Comment

Related Items