Diabetic Retinopathy Classification Using an Efficient ConvolutionalNeural NetworkbyJiaxi GaoB.Eng., Xi’an Jiaotong University, 2015A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF APPLIED SCIENCEinThe Faculty of Graduate and Postdoctoral Studies(Electrical and Computer Engineering)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)April 2019c© Jiaxi Gao, 2019The following individuals certify that they have read, and recommend to the Facultyof Graduate and Postdoctoral Studies for acceptance, a thesis/dissertation entitled:Diabetic Retinopathy Classification Using An Efficient ConvolutionalNeural Networksubmitted by Jiaxi Gao in partial fulfillment of the requirements forthe degree of Master of Applied Sciencein Electrical and Computer EngineeringExamining Committee:Cyril Leung, Electrical and Computer EngineeringSupervisorRabab Ward, Electrical and Computer EngineeringSupervisory Committee MemberJosé Martí, Electrical and Computer EngineeringSupervisory Committee MemberiiAbstractDiabetic Retinopathy (DR) is a diabetic complication that affects the eyes and maylead to blurred vision or even blindness. The diagnosis of DR through retinal fundusimages is traditionally performed by ophthalmologists who inspect for the presenceand significance of many subtle features, a process which is cumbersome and time-consuming. As there are many undiagnosed and untreated cases of DR, DR screeningof all diabetic patients is a huge challenge.Deep convolutional neural network (CNN) has rapidly become a powerful tool foranalyzing medical images. There have been previous works which use deep learningmodels to detect DR automatically. However, these methods employed very deepCNNs which require vast computational resources. Thus, there is a need for morecomputationally efficient deep learning models for automatic DR diagnosis. Theprimary objective of this research is to develop a robust and computationally efficientdeep learning model to diagnose DR automatically.In the first part of this thesis, we propose a computationally efficient deep CNNmodel MobileNet-Dense which is based on the recently proposed MobileNetV2 andDenseNet models. The effectiveness of the proposed MobileNet-Dense model isdemonstrated using two widely used benchmark datasets, CIFAR-10 and CIFAR-100.In the second part of the thesis, we propose an automatic DR classification sys-tem based on the ensemble of the proposed MobileNet-Dense model and the Mo-bileNetV2 model. The performance of our system is evaluated and compared withiiiAbstractsome of the state-of-the-art methods using two independent DR datasets, the Eye-PACS dataset and the Messidor database. On the EyePACS dataset, our systemachieves a quadratic weighted kappa (QWK) score of 0.852 compared to a QWKscore of 0.849 achieved by the benchmark method while using 32% fewer parametersand 73% fewer multiply-adds (MAdds). On the Messidor database, our system out-performs the state-of-the-art method on both Normal/Abnormal and Referable/Non-Referable classification tasks.ivLay SummaryDiabetic Retinopathy (DR) is a diabetic complication that affects the eyes and maycause vision impairment or even vision loss. The presence of the disease will re-sult in progressively developing abnormalities such as exudates, hemorrhages andmicroaneurysms in the retina. As the number of the patient with undiagnosed DR isincreasing globally and manually diagnosing DR is cumbersome and time-consuming,performing DR screening automatically has a great significance.Convolutional Neural Network (CNN) has rapidly become a popular tool for med-ical image processing and analysis. Previous works which applied deep CNN modelson the automatic screening of DR needs vast computational resources. In this thesis,we study the use of the computationally efficient deep CNN model for the automaticclassification of DR.vPrefaceI hereby declare that I am the author of this thesis. This thesis is an original,unpublished work carried out under the supervision of Professor Cyril Leung.viTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiList of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvNotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviiiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 An Overview of Diabetic Retinopathy . . . . . . . . . . . . . . . . . 11.1.1 Diabetes Mellitus and Diabetic Retinopathy . . . . . . . . . . 11.1.2 Diabetic Retinopathy Classification Rules . . . . . . . . . . . 21.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3viiTable of Contents1.2.1 DR Screening Using Traditional Machine Learning Algorithms 41.2.2 DR Screening Using Deep Learning Algorithms . . . . . . . . 61.2.3 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.3 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.5 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . 102 Review of Neural Network Models . . . . . . . . . . . . . . . . . . . 122.1 Fully Connected Neural Networks . . . . . . . . . . . . . . . . . . . 122.1.1 The Architecture of FCNN . . . . . . . . . . . . . . . . . . . 122.1.2 Training the Neural Network Model . . . . . . . . . . . . . . 132.2 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . 172.2.1 Standard Convolutional Layer . . . . . . . . . . . . . . . . . 182.2.2 Depthwise Separable Convolutional Layer . . . . . . . . . . . 222.2.3 Max Pooling Layer . . . . . . . . . . . . . . . . . . . . . . . . 282.2.4 Fully Connected Layer . . . . . . . . . . . . . . . . . . . . . . 292.3 Typical Workflow of a Deep Learning Project . . . . . . . . . . . . . 303 MobileNet-Dense: An Efficient Convolutional Neural Network . 333.1 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.1.1 CNN with Skip Connections: ResNet and DenseNet . . . . . 333.1.2 Efficient CNNs: MobileNetV1 and MobileNetV2 . . . . . . . 383.2 MobileNet-Dense: An efficient CNN based on Dense Connections andDepthwise Separable Convolutional Layers . . . . . . . . . . . . . . . 453.3 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.3.1 Experiment Environment . . . . . . . . . . . . . . . . . . . . 473.3.2 CIFAR Datasets . . . . . . . . . . . . . . . . . . . . . . . . . 49viiiTable of Contents3.3.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493.3.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 503.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524 Automatic Diabetic Retinopathy Classification System Design . 534.1 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . . . . 534.1.1 EDA of Training Labels . . . . . . . . . . . . . . . . . . . . . 554.1.2 EDA of Training Images . . . . . . . . . . . . . . . . . . . . . 554.2 Image Augmentation and Preprocessing . . . . . . . . . . . . . . . . 584.3 CNN Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614.3.1 Evaluation Metric and Loss Function . . . . . . . . . . . . . . 614.3.2 Hyperparameter Tuning and Training . . . . . . . . . . . . . 654.4 Ensemble Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674.4.1 Feature Extraction and Feature Blending . . . . . . . . . . . 684.4.2 Feature Reduction . . . . . . . . . . . . . . . . . . . . . . . . 684.4.3 Model Stacking . . . . . . . . . . . . . . . . . . . . . . . . . . 694.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695 Performance Evaluation of the Diabetic Retinopathy ClassifiacationSystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 705.1 Performance Evaluation on the EyePACS Test Set . . . . . . . . . . 705.1.1 The QWK score and Model Complexity . . . . . . . . . . . . 715.1.2 Performance Evaluation for Each Class . . . . . . . . . . . . 725.2 Performance Evaluation on the Messidor Database . . . . . . . . . . 755.2.1 EDA of the Messidor Database . . . . . . . . . . . . . . . . . 755.2.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 765.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81ixTable of Contents6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 836.1 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 836.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86xList of Tables1.1 DR grading rules [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.1 Error rates of plain CNNs and ResNets on ImageNet dataset [2] . . . 343.2 The structure of MobileNetV1 [3] . . . . . . . . . . . . . . . . . . . . 413.3 The structure of MobileNetV2 [4] . . . . . . . . . . . . . . . . . . . . 443.4 The structure of MobileNet-Dense . . . . . . . . . . . . . . . . . . . 483.5 The structure of MobileNetV2 for CIFAR datasets . . . . . . . . . . . 513.6 The structure of MobileNet-Dense for CIFAR datasets . . . . . . . . 513.7 Performance on CIFAR-10 and CIFAR-100 datasets . . . . . . . . . . 524.1 DR label information in EyePACS dataset [5] . . . . . . . . . . . . . 554.2 The structure of MobileNet-Dense for DR classification . . . . . . . . 664.3 The structure of MobileNetV2 for DR classification . . . . . . . . . . 665.1 Performance comparison on EyePACS test set . . . . . . . . . . . . . 715.2 Per-class performance on EyePACS test set . . . . . . . . . . . . . . . 755.3 Performance comparison on Normal/Abnormal screening task . . . . 815.4 Performance comparison on Referable/Non-Referable screening task . 81xiList of Figures1.1 Annotated results of an image with DR [6]. . . . . . . . . . . . . . . . 32.1 The structure of a FCNN with two hidden layers. . . . . . . . . . . . 132.2 The structure of the AlexNet [7]. . . . . . . . . . . . . . . . . . . . . 182.3 Illustration of convolving a single 7 × 7 × 3 filter over a 32 × 32× 3 input feature map with a stride of 1 (shifting the filter one unithorizontally or vertically at a time). There are 26× 26 spatial locationsfor a 7 × 7 × 3 filter to slide over a 32 × 32 × 3 input feature map,thus this procedure generates a 26 × 26 × 1 output feature map, whereeach element in the output feature map is the sum of the element-wisemultiplication of the filter and the small patch of the input featuremap it overlapped. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.4 Illustration of convolving a DF × DF × M input feature map witha standard convolutional layer. This convolutional layer contains Nconvolution filters. Each filter has a dimension of DK × DK × M . . 242.5 Illustration of convolving a DF × DF × M input feature map witha depthwise convolutional layer. The depthwise convolutional layercontains M convolutional filters. Each filter has a dimension of DK ×DK × 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25xiiList of Figures2.6 Illustration of convolving a DG × DG × M input feature map with apointwise convolutional layer. The pointwise convolutional layer con-tains N convolutional filters. Each filter has a dimension of 1 × 1 ×M . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.7 Illustration of maxpooling operations. . . . . . . . . . . . . . . . . . . 293.1 Building blocks of ResNet [2]. . . . . . . . . . . . . . . . . . . . . . . 363.2 Building blocks of DenseNet [8]. . . . . . . . . . . . . . . . . . . . . . 373.3 Illustration of a typical Dense Block [8] constructed using 5 bottleneckblocks. Each bottleneck block is illustrated by a single blue node. Anytwo of the nodes within the same dense block are directly connected. 393.4 A deep DenseNet [8] with two dense blocks. The Conv1x1 layer andpooling layer between two dense blocks are applied to downsample thesize of the feature map. . . . . . . . . . . . . . . . . . . . . . . . . . . 403.5 Building block of MobileNetV1 [3]. . . . . . . . . . . . . . . . . . . . 413.6 Building blocks of MobileNetV2 [4]. . . . . . . . . . . . . . . . . . . . 433.7 Building blocks of MobileNet-Dense . . . . . . . . . . . . . . . . . . . 464.1 Flowchart of the proposed automatic DR classification system. . . . . 544.2 Distribution of DR label values in EyePACS training set. . . . . . . . 564.3 Label correlation in the EyePACS training set. . . . . . . . . . . . . . 574.4 Random sample images from the EyePACS dataset [5]. . . . . . . . . 584.5 Underexposed/Normal/Overexposed Images. Small figures with Red/Green/Bluecolor under each retinal image are the histograms of the Red/Green/Bluecolor channel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594.6 Illustration of the original image and the augmented image using FancyPCA algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61xiiiList of Figures5.1 Confusion matrices of: (a) MobileNetV2, (b) MobileNet-Dense, (c)Model Ensemble (2+1) . . . . . . . . . . . . . . . . . . . . . . . . . . 745.2 Random sample images from the Messidor database [9]. . . . . . . . . 765.3 Distribution of DR grades in the Messidor Database. . . . . . . . . . 775.4 ROC curves for Normal/Abnormal task. . . . . . . . . . . . . . . . . 795.5 ROC curves for Referable/Non-Referable task. . . . . . . . . . . . . . 80xivList of AbbreviationsANN Artificial Neural NetworkAcc AccuracyAUC Area Under ROC CurveCNN Convolutional Neural NetworkConv Convolutional LayerDM Diabetes MellitusDR Diabetic RetinopathyEDA Exploratory Data AnalysisFCNN Fully Connected Neural NetworkFN False NegativeFP False PositiveMAdds Multiply-addsMSE Mean Squared ErrorNPDR Non-Proliferative Diabetic RetinopathyPCA Principal Component AnalysisPDR Proliferative Diabetic RetinopathyQWK Quadratic Weighted KappaRelu Rectified Linear UnitROC Receiver Operating Characteristic CurveSD Standard DeviationTN True NegativexvList of AbbreviationsTP True PositivexviNotationA Matrixai,j ijth entry of Matrix Aa Vector1 All–one column vectorI Identity matrix(·)T Transpose∗ Convolution operation? Cross-correlation operationvar(·) Variance operatorCov(·) Covariance operatorσ(·) Non-linear activation function Hadamard productxviiAcknowledgementsI would like to express my sincere gratitude to my supervisor, Professor Cyril Leung,for his immeasurable support and guidance throughout my research studies. Hispatience and guidance help me overcome challenges and finish this thesis. WithoutProfessor Leung’s guidance, this thesis would not have been possible.This work was supported in part by the Natural Sciences and Engineering Re-search Council (NSERC) of Canada under Grant RGPIN 1731-2013, the UBC Facultyof Applied Science, the UBC PMC-Sierra Professorship in Networking and Commu-nications and the National Research Foundation, Prime Minister’s Office, Singaporeunder its IDM Futures Funding Initiative.I would like to thank to EyePACS and Messidor program partners for providingthe retinal image dataset.I would like to thank my co-supervisor, Professor Chunyan Miao (Nanyang Tech-nological University), for her guidance and assistance during my studies in Singapore.I would also like to thank all colleagues who helped me with my research.Most importantly, none of this would have been possible without the love, en-couragement and patience of my parents. I am deeply grateful to my parents whoencourage me to pursue further education abroad. Without my parents, I would notbe who I am today.xviiiDedicationTo my parents and families.xixChapter 1IntroductionThis chapter provides a brief introduction to diabetic retinopathy (DR). In Sec-tion 1.1, we discuss DR, the importance of diagnosing DR and the diagnosis rules.Section 1.2 includes a literature review of previous works on the automatic diagno-sis of DR using machine learning approaches. The motivation and contributions ofthis thesis are discussed and outlined in Section 1.3. The structure of this thesis isoutlined in Section 1.4.1.1 An Overview of Diabetic Retinopathy1.1.1 Diabetes Mellitus and Diabetic RetinopathyDiabetes Mellitus (DM) is a chronic disease characterized by consistently high bloodglucose levels over a long-term period [10]. Without controlling blood glucose levels,many long-term complications may occur such as diabetic retinopathy (DR), diabeticfoot or diabetic kidney disease [11].DR is a chronic and progressive diabetic complication which damages the retina.Globally, DR will affect 191 million people by 2030 [12]. DR is caused by lasting harmto the retinal vessels from the high blood glucose level which can lead to blockage ordamage in the tiny retinal blood vessels which nourish the retina. In response, thehuman body attempts to grow new blood vessels in the eye to maintain the nour-ishment. The new blood vessels are weak, which makes them has a high probabilityof leaking and bleeding [13]. As a result, patients may experience progressive vision1Chapter 1. Introductiondisorders from blurred vision to vision loss [14].1.1.2 Diabetic Retinopathy Classification RulesAccording to [14], there are 2 stages in the DR disease: Non-Proliferative DiabeticRetinopathy (NPDR) and Proliferative Diabetic Retinopathy (PDR). For NPDR, thedisease can be further categorized as mild NPDR, moderate NPDR, or severe NPDR[15]. NPDR is the early stage of DR during which the retinal arteries become weakand small red dots like microaneurysms or even hemorrhages can be found in theretinal images. PDR is a condition in which the retina lacks oxygen and there arespots that appear on the retina as a result of the circulatory system attempting tomaintain the delivery of oxygen.The classification rules for DR are given in Table 1.1. A sample retinal imagewith the manually annotated common DR lesions [6] is shown in Figure 1.1.The typical lesions of DR are briefly discussed below:• Hard Exudates: Hard exudates are one of the constellations of retinal lesionsthat define DR. Hard exudates usually appear in the retinal images as tinyyellow-white flecks with sharp edges and different sizes [16].• Soft Exudates: Soft exudates, also known as cotton wool spots, appear aswhite patches with blurred and hazy edges [17]. Exudates, including soft ex-udates and hard exudates, are one of the most common early lesions of DR[18].• Microaneurysms: Microaneurysms are the earliest clinically visible signs ofNPDR and are caused by dilatations of thin blood vessels. Microaneurysmsusually appear as small red dots with sharp edges (20 to 200 microns) in clusters[19].2Chapter 1. IntroductionHard ExudatesSoft ExudatesMicroaneurysmsHemorrhagesFigure 1.1: Annotated results of an image with DR [6].• Haemorrhages: Retinal hemorrhages are the bleeding spots in the retina. Itappears like a blot in the retinal fundus image and its shape varies [17].1.2 Related WorkIn this section, we review and discuss some previous works on the automatic detectionof DR.Currently, DR is mostly diagnosed manually by inspecting retinal images. Theprocess is time-consuming and challenging since some lesions in the retinal imageare very tiny or subtle, such as microaneurysm illustrated in Figure 1.1. In order toimprove the efficiency and accuracy of DR classifications, many automatic or semi-automatic methods using computer vision and machine learning algorithms have been3Chapter 1. IntroductionTable 1.1: DR grading rules [1]Disease Severity Level FindingsGrade 0: No apparent retinopathy No visible sign of abnormalitiesGrade 1: Mild NPDR Presence of microaneurysms onlyGrade 2: Moderate NPDR More than just microaneurysms but less than severe NPDRGrade 3: Severe NPDR Any of the following:(1) More than 20 intraretinal hemorrhages(2) Venous beading(3) Intraretinal microvascular abnormalities(4) No signs of PDRGrade 4: PDR Either or both of the following:(1) Neovascularization(2) Vitreous/pre-tinal hemorrhageproposed.Generally, automatic DR screening algorithms have evolved from traditional com-puter vision techniques which combined manually designed feature extraction algo-rithms and traditional classification algorithms to end-to-end deep learning algo-rithms.1.2.1 DR Screening Using Traditional Machine LearningAlgorithmsBefore the development of deep learning algorithms, especially deep neural networks,a feature extraction step was needed for general computer vision tasks. Features,in this case, are distinguishing and significative small image patches. Many featureextraction algorithms were proposed in the 1990s, such as SIFT [20] and SURF [21],which have been widely applied for object recognition [22] or medical image retrieval[23]. Following the feature extraction step, traditional machine learning classificationalgorithms, such as support vector machines, logistic regression or decision trees, aretrained using extracted features to classify the image. This pipeline was appliedto almost all traditional computer vision tasks including image classification, image4Chapter 1. Introductionsegmentation, etc.Automatic DR detection based on traditional computer vision techniques followsthe same pipeline. First, specific features are extracted from the fundus images usingsingle or combinations of manually designed feature extraction algorithms. Classifiersare then trained using the extracted feature to classify the DR stages. Several suchworks are briefly reviewed below.A multilayer neural network model was trained using features extracted by re-cursive region growing segmentation algorithms to perform binary DR classificationtask [24]. The sensitivity and specificity of the proposed system is 80.2% and 70.7%respectively.Recursive region growing segmentation algorithms was also adopted in [25] toextract DR visual features such as exudates, hemorrhages, and microaneurysms. Thefeatures were then used for Healthy/DR binary classification. The sensitivity andspecificity of the proposed system is 74.8% and 82.7% respectively.Lee et al. [26] introduced methods for automatically grading the severity of threeearly DR lesions, namely hemorrhages, hard exudates, and cotton-wool spots. Analgorithm is trained to classify NPDR based on the absence of these three types oflesions. The overall accuracy of the proposed system is 81.7%.In [27], a decision support system was proposed for early detection of DR (presenceof microaneurysms) using Bayes optimality criteria. The method was able to identifythe early stage of DR with a sensitivity of 100% and specificity of 67%.The automatic DR classification methods discussed in this section use manuallydesigned features. The performance of such methods is highly dependent on thefeature extraction process. Machine learning algorithms are only used for numericaloptimization based on the features designed by humans. Thus, domain knowledge5Chapter 1. Introductionof the DR disease plays a crucial role in building effective DR classification models.The limitation of applying conventional computer vision algorithms for DR lesiondetection or DR classification is that the manually designed features are often over-specified, incomplete or require a long time and experience to design and validate.1.2.2 DR Screening Using Deep Learning AlgorithmsInstead of manually designing features and feeding them to the classifier for DRscreening, many researchers are now building end-to-end deep learning models whichlearn all the needed features automatically. Several works which use deep neuralnetwork models to detect DR automatically are briefly reviewed below.A 13-layer deep convolutional neural network (CNN) for screening DR was dis-cussed in [29]. Several data augmentation techniques (i.e., image rotation, imageflipping and image shifting) were applied to overcome the data imbalanced problem.Five thousand images from the EyePACS DR training set [5] were retained as thetest set and the rest of the images from the EyePACS DR training set were used fortraining the proposed model. The trained network achieves 95% specificity and 30%sensitivity on the test set, which indicates that the proposed model is highly biasedto the negative (healthy) case.In [30], a residual neural network [2] was used to classify the retinal image. Thequadratic weighted kappa (QWK) [31] was adopted as the evaluation metric. Theproposed system achieves a QWK score of 0.51 on the EyePACS DR test set.A deeply supervised ResNet was proposed in [32] to automatically classify thegrade of DR. In the proposed architecture, three sets of additional side-output layerswere added to the intermediate layers of the ResNet to provide additional regulariza-tion during the training process. 20% images from the EyePACS DR training datasetwere retained as the test set. The results show that by leveraging the predictions of6Chapter 1. Introductionintermediate supervised layers, ResNet can focus on multi-scale learning which leadsto improved performance compared to standard ResNet without side-output layers.The QWK score, specificity, sensitivity and accuracy of the proposed system on thetest set is 0.73, 94%, 67% and 81% respectively.In [33], a Non-Local means denoising method was applied to the fundus image toremove noise from the EyePACS DR dataset. AlexNet [7] and GoogleNet[34] wereutilized to classify the pre-processed retinal images. An AUC score of 0.78 is achievedby the GoogleNet model compared to an AUC score of 0.68 achieved by AlexNet.In [35], a computationally efficient CNN model, namely the MobileNet model [3],was used for binary DR screening (DR V.S. Healthy image) using the EyePACS DRdataset. An accuracy of 0.73 was achieved by the proposed method. Consideringthe great data imbalance nature of this dataset (73.4% of the image are healthy and26.6% of the image contains DR), only using accuracy score may not appropriate fordemonstrating the effectiveness of the proposed model. For an extreme example, asimple classifier which always predicts images as healthy images will also achieve anaccuracy of 0.73 using the EyePACS DR dataset.In [36], GoogleNet and VGGNet were modified to two corresponding networksCKML (Combined Kernels with Multiple Losses Network) and VNXK (VGGNetwith Extra Kernel). The retinal images were converted to a hybrid LGI color space.Models trained using the LGI retinal images were then compared with models trainedusing RGB retinal images. The results show that both CKML and VNXK achievehigher accuracy with LGI images. However, the proposed model is also highly biasedto the healthy samples. The corresponding accuracy from class 0(healthy) to class4(proliferative DR) are 97.6%,11.9%,57.9%,33.2% and 36.8% respectively.In [37], a novel Zoom-in-Net model was proposed. It mimics the zoom-in process7Chapter 1. Introductionof an ophthalmologist who examines the retinal images. Zoom-in-Net consists ofthree parts, Inception-ResNet was adopted as the main network (M-Net) for DRclassification. Two tiny CNNs, namely the Attention Network (A-Net) and the CropNetwork (C-Net), were used for Attention Localization of the for DR images. Thesingle Zoom-in-Net achieves a QWK score of 0.849 on the entire EyePACS DR testset. With an ensemble of three Zoom-in-Net models, the QWK score was improvedto 0.854.1.2.3 DiscussionsThere are few comparisons of the discussed deep learning methods although all ofthe methods used the EyePACS DR dataset. Possible reasons include:1. Different classification tasks were performed. For example, binary DR classifi-cation (Healthy vs DR) task was performed in [33, 35], while DR classificationwith 5 classes was considered in [36, 37, 32].2. Different number of retinal images were retained as the test set. For example,20% of the retinal images in the EyePACS DR test set were used to report themodel performance in [32, 29]. The entire EyePACS DR test set was used toreport the model performance in [37].3. Different classification metrics were used to assess performance. For exam-ple, accuracy (Acc) score was used in [38, 33] while quadratic weighted kappa(QWK) score was reported in [30, 32, 37].1.3 MotivationsDR is one of the main causes of vision loss and blindness. Manually diagnosing DRfrom retinal images is time-consuming and challenging. To this end, several works8Chapter 1. Introductionhave applied deep CNN models to diagnose DR automatically. However, the pro-posed methods employed very deep CNN models (e.g., ResNet-based method [32],InceptionNet-based method [33], InceptionNet-ResNet-based model [37]) which re-quire vast computational resources. Research on DR screening using computationallyefficient CNN models has a great practical significance. This thesis is therefore fo-cused on applying end-to-end, accurate and computationally efficient CNN modelsfor automatic DR classification.1.4 ContributionsThe main contributions of this thesis are summarized as follows:• In Chapter 3, a computationally efficient CNN model MobileNet-Dense is pro-posed. Two benchmark datasets, namely CIFAR-10 and CIFAR-100, are usedto demonstrate the effectiveness of the proposed MobileNet-Dense model. Theaccuracy (Acc) scores of MobileNet-Dense are compared with the Acc scores ofMobileNetV2 model [4] on these two datasets. The MobileNet-Dense model andMobileNetV2 model achieve similar Acc scores on both CIFAR-10 and CIFAR-100 datasets, but the proposed MobileNet-Dense uses 62% fewer parametersand 36% fewer multiply-adds (MAdds).• In Chapter 4, a computationally efficient automatic DR classification systemis proposed. The proposed system is based on the ensemble of the proposedMobileNet-Dense model and the MobileNetV2 model.• In Chapter 5, the performance of the proposed DR classification system isevaluated using the EyePACS DR dataset and the Messidor database. ForEyePACS DR dataset, the top-ranked method [39] in the Kaggle DR challenge9Chapter 1. Introductionis chosen as the benchmark method. Our system achieves a QWK score of0.852 compared to a QWK score of 0.849 achieved by the benchmark methodwhile using 32% fewer parameters and 73% fewer MAdds. For the Messidordatabase, the proposed system is evaluated on Normal/Abnormal screening taskand Referable/Non-Referable screening task. On Normal/Abnormal screeningtask, our system achieves an AUC of 0.962 and an Acc of 0.917 compared toan Auc of 0.921 and an Acc of 0.905 achieved by the state-of-the-art method[37]. On Referable/Non-Referable screening task, our system achieves an AUCof 0.970 and an Acc of 0.924 compared to an Auc of 0.957 and an Acc of 0.911achieved by the state-of-the-art method [37].1.5 Organization of the ThesisThe remainder of the thesis is organized as follows:In Chapter 2, we introduce some basic concepts of deep learning methods whichare directly related to the work described in later chapters. Specifically, we firstbriefly review the basic concepts and core layers of conventional fully connected neuralnetworks (FCNN). Then we recall the basic concepts and core layers of convolutionalneural networks (CNNs), including the standard convolutional layers, the depthwiseseparable convolutional layers, the pooling layers and the fully connected layers. Thetypical workflow of applying a deep learning model for a real-world task is summarizedin the last section.In Chapter 3, we propose a computationally efficient MobileNet-Dense model andwe demonstrate its effectiveness using two benchmark datasets, the CIFAR-10 andCIFAR-100 datasets.In Chapter 4, we propose a computationally efficient DR classification system10Chapter 1. Introductionbased on the ensemble of the proposed MobileNet-Dense model and the MobileNetV2model. The system is trained using the publicly available EyePACS DR dataset. Theworkflow of the proposed system is described in details.In Chapter 5, we demonstrate the effectiveness and efficiency of the proposedautomatic DR classification system using the EyePACS DR dataset and the Messidordatabase.In Chapter 6, the main contributions of this thesis are summarized and somepossible future extensions are outlined.11Chapter 2Review of Neural Network ModelsThe primary approach in this research project is to apply computationally efficientCNNs for automatic diagnosis of DR. This chapter provides some background onneural network models especially on CNNs and computationally efficient CNNs. InSection 2.1, we review the conventional fully connected neural network which is thefoundation of CNNs. In Section 2.2, we review the core layers of CNNs, includingthe standard convolutional layer, the depthwise separable convolutional layer, themaxpooling layer and the fully connected layer. The typical workflow of applyingneural network models for a real-world task is summarized in Section 2.3.2.1 Fully Connected Neural NetworksA fully connected neural network (FCNN) is a neural network model inspired bybiological neural networks [40].2.1.1 The Architecture of FCNNThe structure of a simple FCNN with two hidden layers is shown in Figure 2.1. Thereare three types of layers in a typical FCNN. The input layer, the hidden layer andthe output layer. The input layer is used to take in the input information. Thehidden layers are used to perform a non-linear combination of the information fromthe previous (input or hidden) layer. The output layer performs a weighted sum onthe results of the last hidden layer. All layers are composed of neurons (also knownas nodes). Nodes in two adjacent layers are fully-connected by edges. Each edge12Chapter 2. Review of Neural Network ModelsFigure 2.1: The structure of a FCNN with two hidden layers.represents a weight value learned during the training process.2.1.2 Training the Neural Network ModelTraining a neural network model consists of executing the feed forward process andthe backpropagation process for all training samples iteratively.The Feed Forward ProcessThe feed forward process refers to feeding the input vector into a layer or a neuralnetwork model to calculate the output vector.Let wlj,k denotes the weight value between the kth node in the (l − 1)th layer andthe jth node in the lth layer. blj denotes the bias value of the jth node in the lthlayer and zlj denotes the output value (before activation) of the jth node in the lthlayer. Let σ(.) denotes the non-linear activation function applied on zlj. alj denotes13Chapter 2. Review of Neural Network Modelsthe activation of zlj. There are several types of activation functions (e.g., Sigmoidf(x) = 11+exp(−x) , Tanh f(x) =21+exp(−2x) −1 or Relu f(x) = max(0, x)). For trainingdeep neural networks, the Relu (Rectified Linear Unit) activation fuction is preferredsince it does not have the vanishing gradient problem [41] and it leads to more sparsitywhich accelerates the training process [42]. The feed forward process from (l − 1)thlayer to lth layer is given by:alj = σ(zlj) = σ(∑kwlj,kal−1k + blj)(2.1)We can rewrite Eq. (2.1) in the vector form as follows:al = σ(zl) = σ(W lal−1 + bl) (2.2)whereW l denote the weight matrix from the (l−1)th layer to the lth layer. al and blare column vectors which denote the activation and bias of the lth layer respectively.Suppose there are L layers in the neural network model. Given an input vector a0,we can calculate the output of lth (0 < l < L) layer al using Eq. (2.2).The Backpropagation ProcessThe backpropagation algorithm [43] is used for updating the learnable parameters(i.e., weights and bias) during backpropagation process, in such a way as to minimizethe loss function. The loss functions are used to measure the error (disagreement)between the predicted results and the ground truth label and the error is propagatedbackward throughout the neural network’s layers (i.e., the error is propagated fromthe output layer to the input layer). There are many loss functions which are com-monly used, e.g., categorical cross-entropy loss or mean squared error (MSE). Thecategorical cross-entropy loss is often used for classification task and is given by:14Chapter 2. Review of Neural Network ModelsL = 1N∑i∑jti,jlog(pi,j). (2.3)where pi,j denotes the predicted probability value of sample i to be in class j given bythe neural network model, ti,j denotes the ground truth probability value of samplei to be in class j (i.e., ti,j = [0, . . . , 1, . . . , 0] contains a single 1 at the jth position ifsample i is labeled as class j.) and N denotes the total number of samples.The MSE loss is often used for regression task and is given by:L = 1NN∑i=1(oi − ti)2. (2.4)where N denotes the total number of samples, oi denotes the predicted value ofsample i and ti denotes the target value of sample i.After an appropriate loss function L is chosen, the partial derivatives ∂L∂W l, ∂L∂blarecalculated using the chain rule to update the corresponding weights W l and bias bl.To explain the backpropagation process, suppose we use el to denote the error of theoutput of the lth layer. Then the el is given by:el =∂L∂zl=∂L∂al∂al∂zl=∂L∂al σ′(zl) (2.5)where the denotes the Hadamard product [44]. Using the chain rule, el is calculatedusing el+1, as shown below:el =∂L∂al σ′(zl)=∂L∂zl+1∂zl+1∂al σ′(zl)=∂L∂zl+1∂(W l+1al + bl+1)∂al σ′(zl)= ((W l+1)Tel+1) σ′(zl)(2.6)15Chapter 2. Review of Neural Network ModelsSuppose there are L layers in the neural network model. Given the error of theoutput layer eL (measured using loss function), the error of the lth layer el can becalculated using Eq. (2.6).The partial derivatives of the loss function with respect to W l and bl are calcu-lated using el, as shown below:∂L∂W l=∂L∂zl∂zl∂W l=∂L∂zl∂(W lal−1 + bl)∂W l= el(al−1)T (2.7)∂L∂bl=∂L∂zl∂zl∂bl=∂L∂zl∂(W lal−1 + bl)∂bl= el (2.8)The weights W l and bias bl are updated using the corresponding partial deriva-tives:W l =W l − α ∂L∂W l(2.9)bl = bl − α∂L∂bl(2.10)where α is the step size (also known as the learning rate). If α is too large, we mayovershoot the optimal value. On the other hand, if α is too small, it will take a longtime to reach the optimal value or the loss value may converge to a local minimumor saddle point instead of the global minimum [45]. Thus the optimal learning rateα need to be carefully searched.The optimal learnable parameters are obtained by iteratively executing the feed-forward process and the backpropagation process until the loss converge. The epochdescribes the number of times the model trained using the entire training set. Inother words, once we have updated the learnable parameters using all the training16Chapter 2. Review of Neural Network Modelssamples, an epoch has completed. In general, training a neural network model re-quires multiple epochs. Ideally, we should calculate the mean loss for all trainingsamples and update the learnable parameters once using the gradients of the meanloss. In practice, the datasets we use may contain thousands or millions of samples(e.g., ImageNet has more than 1 million training images) and updating the learnableparameters with the mean loss of all training samples are computationally inefficient.A commonly used strategy to solve this problem is estimating the gradient and up-dating the parameters using a small batch of training samples at a time. In otherwords, the parameters of the neural network get many approximate updates insteadof a more accurate single update for one epoch. For a concrete example, the learn-able parameters will update 10 times for one epoch if we have a training set with1000 samples and we select the mini-batch size to be 100. This mini-batch strategyworks well in most practical applications. The pseudocode of the backpropagationalgorithm with mini-batch is listed in Algorithm 2.1.Algorithm 2.1 The Backpropagation Algorithm1. Sample a mini-batch of data from training dataset, without replacement.2. Forward propagate the mini-batch samples through the neural networks.3. Estimate the partial derivatives with backpropagation.4. Update the learnable parameters using the partial derivatives.5. Repeat step 1 to step 4 until all the training samples are sampled (one epochfinished).6. Repeat step 5 until the loss value converges.2.2 Convolutional Neural NetworksConvolutional neural networks (CNNs) is one of the most widely adopted deep neuralnetwork models for the current research literature. CNNs have been shown to per-form very well in many computer vision tasks such as face recognition [46] or object17Chapter 2. Review of Neural Network Modelsdetection [47].Figure 2.2 illustrates the architecture of AlexNet [7]. AlexNet is a typical CNNmodel which shed the lights on the design of the CNN architecture. AlexNet isthe winner of the ImageNet Large Scale Visual Recognition Competition 2012 (Im-ageNet Competition is a benchmark competition for image recognition. There areover 14,000,000 images which categorized into 1000 classes in this dataset). AlexNetis the first CNN used in this competition and yields a reduction in the error ratefrom 25.8% to 16.4%. For the AlexNet architecture, there are three types of layers:standard convolutional layers, max pooling layers, and fully connected layers.Figure 2.2: The structure of the AlexNet [7].2.2.1 Standard Convolutional LayerThe convolutional layer is the most important layer in a CNN model and it is trainedusing the backpropagation algorithm to extract specific features from the input vec-tor. The output of the convolutional layer, also known as the output feature map, aregenerated by convolving multiple filters over the input feature map. For convolutionoperations in CNNs, we often use convolutions over two axes (width and height). Theconvolution of a two-dimensional image I and a two-dimensional kernel K is givenas follows [48]:18Chapter 2. Review of Neural Network ModelsS(i, j) = (I ∗K)(i, j) =∑m∑nIm,nKi−m,j−n (2.11)where m and n are the width and height of the convolution filter respectively. Nor-mally we apply kernel with symmetric size (i.e., m = n). Using the commutativeproperty of the convolution operation, we can rewrite Eq. (2.11) as:S(i, j) = (K ∗ I)(i, j) =∑m∑nIi−m,j−nKm,n (2.12)In practice, many neural network libraries (e.g., Tensorflow [49]) implement a re-lated function called cross-correlation function instead of implementing the convolu-tion function. The convolutional layer is constructed using as many cross-correlationfilters as desired. The definition of the cross-correlation function is given as follows[48]:Si,j = (K ? I)(i, j) =∑m∑nIi+m,j+nKm,n (2.13)CNN with cross-correlation kernel and CNN with convolution kernel will give thesame results as the backpropagation algorithm will learn the appropriate values of thekernel in the appropriate place. Given the definition of the convolution and the cross-correlation, the kernel implementation based on convolution will learn a kernel whichis flipped relative to the kernel implementation based on cross-correlation [48]. Inthis thesis, the cross-correlation operation is referred to as the convolution operation.Convolution operation preserves the spatial information between pixels by learn-ing features only using small patches of the input feature map. Figure 2.3 illustratesthe process of convolving a 32 × 32 × 3 input feature map with a 7 × 7 × 3 convolu-tion filter (also known as kernel). The kernel slides across the input feature map overtwo axes (width and height). At each location, the element-wise multiplication of19Chapter 2. Review of Neural Network Modelsthe kernel value and the input element it overlapped is computed and the results aresummed up to obtain the value at the current location for the output feature map.More generally, a convolutional layer contains multiple convolution filters and theinput feature map is convolved with each filter independently and thus each filterproduces its own output feature map. The output of the convolutional layer is thestack of those output feature maps. The standard convolutional layer is illustratedin Figure 2.4. We use the same notations used in [3] to describe the convolutionallayer. To be more specific, DF ,DK and DG denote the spatial width (height) ofthe input feature map, the kernel and the output feature map respectively. TheM and N denote the depth of the input feature map and the output feature maprespectively. In this standard convolutional layer, the DF × DF × M input featuremap is convolved with N independent DK × DK ×M convolution filters. Each filterproduces a DG × DG × 1 output feature map, the output feature map of this layer isthe stack of N output feature maps and it has a dimension of DG × DG × N , whereDG is given by:DG =DF −DK + 2PS+ 1 (2.14)where S denotes the stride of the convolution filters. P denotes the zero paddingsize (i.e., zero padding means padding the input feature map with zeros around theborder). The depth of the output feature map is determined by the number of con-volution filters applied in the convolutional layer. Taking the first convolutional layershown in Figure 2.2 as an example, the input feature map of the first convolutionallayer is a 227 × 227 × 3 image. The first convolutional layer contains 96 filters with astride of 4 and the dimension of each filter is 11 × 11 × 3. The spatial width or heightof the output feature map of this layer is thus calculated by DG = 227−11+2×04 +1 = 5520Chapter 2. Review of Neural Network Models(Assuming no zero padding is applied in this layer). The depth of the output featuremap is 96 since there are 96 convolution filters in this layer. The size of the outputfeature map of this layer is thus 55 × 55 × 96, as illustrated in Figure 2.2.To measure the computational cost of a layer, a widely used metric is the numberof multiply-adds [3, 4, 50, 8]. In this thesis, we use MAdds to denote the number ofmultiply-adds for simplicity. The MAdds of a convolutional layer is given in [3], asshown below:MAdds = Cin ×K2 ×Hout ×Wout × Cout (2.15)where K is the width or height of the convolution filters, Cin is the depth of theinput feature map, Hout, Wout and Cout are the height, width and depth of the outputfeature map respectively.The MAdds of the typical convolutional layer shown in Figure 2.4 is calculatedusing Eq. (2.15), as shown below:MAdds =M ×D2K ×D2G ×N (2.16)The deployment of deep learning models in real-world applications is not only con-strained by MAdds. Another important factor is the number of learnable parameters[3, 50, 8, 51]. In this thesis, we use Parameters to denote the number of learnable pa-rameters for simplicity. In general, the strong representation ability of a CNN modelcomes from the millions of learnable parameters. The learnable parameters and themodel structure need to be stored on the disk and loaded into memory during thetraining and testing process. Storing a typical CNN ( ResNet [2], GoogleNet [34],VGGNet [52], etc) needs more than 300MB space. This may not be a problem forthe high-end GPU (graphics processing unit), but it is unaffordable for the resource-21Chapter 2. Review of Neural Network Modelsconstrained devices such as mobile devices or Internet of Things (IoT) devices [53].Given equivalent performance, a CNN architecture with fewer learnable parametershas three main advantages [51]:• Less communication overhead for distributed training.• Fewer data transfers when exporting updated models to clients.• Feasible resource-constrained devices deployment.The overall Parameters for the standard convolutional layer shown in Figure 2.4is given by:Parameters = D2K ×M ×N (2.17)2.2.2 Depthwise Separable Convolutional LayerThe standard convolutional layers are widely applied in many modern CNN architec-tures (e.g., AlexNet, ResNet [2]) and the CNN models are becoming deeper in orderto obtain better performance [2, 34]. On the other hand, the MAdds or Parametersof a very deep CNN (a CNN model with many standard convolutional layers) maybecome unaffordable. In [54], a computationally efficient convolutional layer, namelythe depthwise separable convolutional layer, was proposed to tackle this problem.It has been shown that by replacing standard convolutional layers with depthwiseconvolutional layers in AlexNet, the accuracy remains approximately the same whilethe training speed is significantly faster [54]. The difference between a standardconvolutional layer and a depthwise convolutional layer is described below.For the standard convolutional layer, the convolution filters across the entire depthof the input feature map and the weighted sum are carried out in a single step. On22Chapter 2. Review of Neural Network ModelsFigure 2.3: Illustration of convolving a single 7 × 7 × 3 filter over a 32 × 32 ×3 input feature map with a stride of 1 (shifting the filter one unit horizontally orvertically at a time). There are 26 × 26 spatial locations for a 7 × 7 × 3 filter toslide over a 32 × 32 × 3 input feature map, thus this procedure generates a 26 × 26× 1 output feature map, where each element in the output feature map is the sum ofthe element-wise multiplication of the filter and the small patch of the input featuremap it overlapped.the other hand, the depthwise separable convolutional layer decomposes the standardconvolution into a depthwise convolution step and a pointwise convolution step, asdescribed below:1. Depthwise Convolution Step: In the depthwise convolution step, the inputfeature map is convolved with a depthwise convolutional layer, as illustrated inFigure 2.5. Suppose the input feature map has a dimension of DF × DF × M .The depthwise convolutional layer regards the DF × DF × M input featuremap as M independent DF × DF × 1 input feature maps. Each DF × DF× 1 input feature map is then convolved with a DK × DK × 1 convolutionfilter to create a DG × DG × 1 output feature map. This process will generateM independent DG × DG × 1 output feature maps. We stack those output23Chapter 2. Review of Neural Network ModelsFigure 2.4: Illustration of convolving a DF × DF × M input feature map with astandard convolutional layer. This convolutional layer contains N convolution filters.Each filter has a dimension of DK × DK × M .24Chapter 2. Review of Neural Network ModelsFigure 2.5: Illustration of convolving a DF × DF × M input feature map with adepthwise convolutional layer. The depthwise convolutional layer contains M convo-lutional filters. Each filter has a dimension of DK × DK × 1.25Chapter 2. Review of Neural Network Modelsfeature maps to form the output feature map of this layer. Thus the outputfeature map of this depthwise convolutional layer has a dimension of DG × DG× M . The MAdds of this depthwise convolution layer is D2K × D2G × M . TheParameters of this layer is DK × DK × M .2. Pointwise Convolution Step: In the pointwise convolution step, the outputfeature map generated by the depthwise convolution step is convolved with apointwise convolutional layer, as illustrated in Figure 2.6. In this step, theoutput feature map generated by the depthwise convolution layer is convolvedwith 1 × 1 × M convolution filters. The 1 × 1 × M convolution filter is calledpointwise convolution filter. Each pointwise convolution filter generates a DG× DG × 1 output feature map. We apply N such pointwise filters to generateN output feature maps and those output feature maps are stacked to formthe output feature map of this pointwise convolutional layer. Thus the outputfeature map of this pointwise convolutional layer has a dimension of DG × DG× N . The MAdds of this layer is D2G × M × N . The Parameters of this layeris M × N .The total MAdds of the depthwise separable convolutional layer is the sum ofthe MAdds of the depthwise convolutional layer and the MAdds of the pointwiseconvolutional layer. The comparison of MAdds of the standard convolutional layerillustrated in Figure 2.4 and MAdds of the depthwise separable convolutional layerillustrated in Figure 2.5 and Figure 2.6 is summarized below:26Chapter 2. Review of Neural Network ModelsFigure 2.6: Illustration of convolving a DG × DG × M input feature map with apointwise convolutional layer. The pointwise convolutional layer contains N convo-lutional filters. Each filter has a dimension of 1 × 1 × M .27Chapter 2. Review of Neural Network ModelsMAdds of the standard convolutional layer : D2K ×D2G ×M ×N.MAdds of the depthwise separable convolutional layer : D2K ×D2G ×M +D2G ×M ×N.MAdds of the depthwise separable convolutional layerMAdds of the standard convolutional layer=1N+1D2K.The Parameters of the depthwise separable convolutional layer is the sum of theParameters of the depthwise convolutional layer and the Parameters of the pointwiseconvolutional layer. The comparison of the Parameters of the standard convolutionallayer and the Parameters of the depthwise separable convolutional layer is summa-rized below:Parameters of the standard convolutional layer : D2K ×M ×N.Parameters of the depthwise separable convolutional layer :M ×D2K +M ×N.Parameters of the depthwise separable convolutional layerParameters of the standard convolutional later=1N+1D2K.It can be seen that the MAdds or Parameters of the depthwise separable convo-lutional layer is 1N+ 1D2Kof those of the standard convolutional layer [3]. For modernCNNs, the depth of the output feature map N is normally much greater than D2K(Typically N > 32 and DK = 3). Thus, the MAdds or Parameters of the depth-wise separable convolutional layer is approximately 1D2K= 19of those of the standardconvolutional layer.2.2.3 Max Pooling LayerMax pooling layer is another core layer of CNNs and it aims to reduce the width andheight of the input feature map by max pooling operations (i.e., taking the max valuefrom sub-regions of the input feature map). It also reduces the over-fitting problem28Chapter 2. Review of Neural Network Modelsby providing a down-sampled feature map. Max pooling works by sliding a windowacross the input feature map and output the maximum value of the overlapped sub-regions. Figure 2.7 illustrates the max pooling operations with a 2 × 2 × 1 poolingfilter and a stride of 2 over an 4 × 4 × 1 input feature map.There are three max pooling layers applied in the AlexNet architecture, as shownin Figure 2.2. Considering the first max pooling layer as an example, 3 × 3 maxpooling filters with stride of 2 are applied to the 55 × 55 × 96 feature map. Theresulting output feature map of this pooling layer is a 27 × 27 × 96 feature map. Notethat pooling layer does not change the depth of the feature map, it only downsamplesthe width and height of the input feature map.Figure 2.7: Illustration of maxpooling operations.2.2.4 Fully Connected LayerThe fully-connected layer provides a non-linear combination of the extracted featuresso that the output layer can use these features to classify the input vector into thecorresponding class. The fully connected layers are usually placed after the lastconvolutional layer or pooling layer as shown in Figure 2.2. The fully connected29Chapter 2. Review of Neural Network Modelslayers are the same as the fully connected layers in the typical FCNN discussed inSection 2.1.The MAdds of a fully connected layer is given by:MAdds = I ×O (2.18)The number of parameters of a fully connected layer is given by:Parameters = (I + 1)×O (2.19)where I is the input dimensionality and O is the output dimensionality in Eq.(2.18)and Eq.(2.19).2.3 Typical Workflow of a Deep Learning ProjectThe basic concepts of neural networks are reviewed in the previous section. In thissection, we summarize the typical workflow of applying neural network models in areal-world application. The typical workflow is summarized as follows:1. Data Acquisition: First, prepare a dataset for the application. The datasetis then split into three parts, namely the training set, the validation set and thetest set. The training set is used for training the model with backpropagationalgorithms. The validation set is used for hyperparameter optimization and thetest set is used for performance evaluation.2. Exploratory Data Analysis: After we obtained the dataset, exploratory dataanalysis (EDA) is performed in order to discover latent patterns such as im-balanced data distribution, features at different scales, identifying anomalies oroutliers using statistical methods and illustrate data properties using graphicalrepresentations.30Chapter 2. Review of Neural Network Models3. Data Preprocessing and Data Augmentation: After EDA, data prepro-cessing techniques and data augmentation techniques are applied on the dataset.Preprocessing techniques are widely used for improving the convergence of theneural network model [55]. For image dataset, standardization (subtractingthe mean and dividing by the standard deviation individually for RGB chan-nel) is commonly used. Data augmentation techniques are used to generateartificial training samples and expand the training dataset. It has been widelyused for improving the model performance and reducing overfitting, especiallyfor training with a small dataset or imbalanced dataset [56]. Commonly usedimage augmentation techniques include image cropping, image flipping, imagerotating, etc.4. Architecture Design: Next, we need to find a set of architectures which aresuitable for the applications. If we have limited computational resources ora real-time prediction is needed, we may use computationally efficient CNNmodels for the task. On the other hand, we may use very deep CNN models toobtain good performance if adequate computational resources are provided.5. Hyperparameter Optimization and Training: Next, we need to choose aset of hyperparameters to train the model. Hyperparameters are set of non-learnable parameters which need to be predefined before the training processbegins. The hyperparameter optimization process consists of sampling hyper-parameters from the manually specified subset of the hyperparameter space,training the models with different hyperparameter settings and evaluating thosemodels on the validation set. The combination of the hyperparameters whichyields an optimal model is selected as the optimal hyperparameter set. Thefinal model is the model trained with the optimal hyperparameter set.31Chapter 2. Review of Neural Network Models6. Performance Evaluation: After the model is trained using optimal hyperpa-rameters, the performance of the final model is evaluated using the unseen testset.7. Ensemble Learning (optional): Consistent performance improvement canbe obtained by applying ensemble learning (i.e., ensemble learning refers touse multiple models to obtain better performance than any single model [57]).Commonly used ensemble learning techniques include Bagging (i.e., voting withequal weight), Stacking (i.e., training a meta-classifier to combine the predic-tions of several base-level models), etc.32Chapter 3MobileNet-Dense: An EfficientConvolutional Neural NetworkIn this chapter, we propose and assess the performance of a novel CNN architec-ture MobileNet-Dense. In Section 3.1, we review the related work. In Section 3.2,we introduce the MobileNet-Dense model based on dense connectivity and depth-wise separable convolutional layers. The effectiveness of MobileNet-Dense model isillustrated in Section 3.3 using two benchmark datasets.3.1 Related WorksWe first review the CNNs with skip connections (i.e., ResNet and DenseNet). Next,we review the computationally efficient CNN model constructed using depthwiseseparable convolutional layers (i.e., MobileNetV1) and the computationally efficientCNN model constructed using both depthwise separable convolutional layers andresidual skip connections (i.e., MobileNetV2).3.1.1 CNN with Skip Connections: ResNet and DenseNetAs computer vision tasks become increasingly complex, state-of-the-art CNNs arebecoming rapidly deeper. The VGGNet [52] extends the depth of AlexNet fromeight to nineteen layers and achieves an error rate of 7.3% compared to an errorrate of 16.4% achieved by AlexNet on ImageNet dataset. GoogleNet [34] achievesan error rate of 6.7% on ImageNet dataset and it extends the depth in a different33Chapter 3. MobileNet-Dense: An Efficient Convolutional Neural NetworkTable 3.1: Error rates of plain CNNs and ResNets on ImageNet dataset [2]Plain CNN ResNet18 layers 27.94% 27.88%34 layers 28.54% 25.03%way. GoogleNet concatenates the output feature map produced by filters of differentsizes to enrich the features learned from the input feature map at each layer, but thegeneral idea is similar, namely to boost the performance by extending the depth ofthe CNNs. Although making CNNs deeper generally improves the performance, thisis not always true. Simply adding more layers to an appropriately deep CNN mayresult in higher training and testing errors [58, 59, 2].ResNetIn order to tackle this performance degradation problem, a Residual Network (i.e.,CNN with residual skip connections) was proposed in [2]. In [2], a 18-layer plainCNN (i.e., CNN without residual skip connections), a 34-layer plain CNN, a 18-layerResNet and a 34-layer ResNet were trained and compared on ImageNet dataset. Theresults are shown in Table 3.1. It can be seen that adding 16 layers to a 18-layerResNet leads to improved performance while adding 16 layers to a 18-layer plainCNN leads to degraded performance. In addition, both 18-layer ResNet and 34-layer ResNet outperform their plain counterparts. The results indicate that residualconnections are effective not only for avoiding the performance degradation problembut also for boosting the performance.The basic bottleneck blocks for the ResNet architecture are shown in Figure 3.1,in which the skip connections are those connections which skip one or more layers.In the rest of this thesis, we use ConvNxN layer to denote a convolutional layerwhich contains multiple N × N convolution filters. The basic building block of34Chapter 3. MobileNet-Dense: An Efficient Convolutional Neural NetworkResNet model is called the bottleneck block since Conv1x1 layer is applied as a’bottleneck’ to reduce the depth of the input feature map and thereby improving thecomputational efficiency. We first explain the residual bottleneck layer (stride=1)shown in Figure 3.1. Suppose the input for this bottleneck layer is a DF × DF ×M feature map and it is denoted by F in. It can be seen that there are two pathsfor F in to go forward. The main path (illustrated by blue arrows) is formed by astack of standard convolutional layers and the skip connection path (illustrated bygreen arrows) is simply performing an identity mapping. In the main path, the firstConv1x1 layer is used for reducing the depth of the input feature map proportionallyby a factor of t (t < 1) and thus it generates a DF × DF × tM feature map F 1.The following Conv3x3 layer generates a DF × DF × tM feature map F 2. Thelast Conv1x1 layer is used for expanding (restoring) the depth of the feature mapand it generates a DF × DF × M feature map F 3. The output feature map F out isobtained by performing element-wise addition of F in and F 3. For residual bottlenecklayer (stride=2), a Conv3x3 layer with a stride of 2 is applied in the main path todownsample the width and height of the feature map. Therefore, a Conv1x1 layerwith a stride of 2 is applied in the skip connection path to match the dimensions.The ResNet involves skip connections to let intermediate convolutional layerslearn a residual mapping f(x) = H(x) − x rather than a desired underlying map-ping as H(x), which solves the performance degradation problem [2]. To explainwhy, suppose a ResNet with l residual bottleneck blocks is sufficient for a specifictask and we construct a ResNet with l +m residual bottleneck blocks, then ideallythe backpropagation algorithm will force the residual to zero for all the m redun-dant bottleneck blocks and thus the last m redundant bottleneck blocks are simplyperforming identity mapping. Therefore the performance of the ResNet with l +m35Chapter 3. MobileNet-Dense: An Efficient Convolutional Neural NetworkInputConv3x3,Stride=1,ReluConv1x1,ReluConv1x1,ReluResidual bottleneck block (Stride=1) Elementwise AddIdentity Mapping OutputMDD FF MDD FF MDD FF InputConv3x3,Stride=2,ReluConv1x1,ReluConv1x1,ReluElementwise AddOutputMDD FF Residual bottleneck block (Stride=2)MDD FF 222Conv1x1,Stride=2,ReluMDD FF 222MDD FF MDD FF 222tMDD FF tMDD FF tMDD FF tMDD FF 22:inF :inF:1F :1F:2F :2F:inF:3F :3F:outF :outF:1sF:inF MDD FF Figure 3.1: Building blocks of ResNet [2].residual bottleneck blocks should not worse than the performance of the ResNet withl residual bottleneck blocks.DenseNetA variant of residual skip connection, namely the dense connection, is proposed inthe DenseNet architecture [8]. The two types of building blocks of the DenseNetarchitecture, namely the bottleneck block and the transition block, are shown in36Chapter 3. MobileNet-Dense: An Efficient Convolutional Neural NetworkInputConv3x3,Stride=1,ReluConv1x1,ReluDenseNet bottleneck block ConcatenationInputAverage Pooling,Stride=2Conv1x1,ReluDenseNet transition block MDD FF KDD FF 4KDD FF MDD FF Output)( KMDD FF Identity Mapping MDD FF 2MDD FF Output222MDD FF :inF:inF:inF:1F:1F:2F:outF:outFFigure 3.2: Building blocks of DenseNet [8].Figure 3.2. For bottleneck block, the first Conv1x1 layer in the main path is convolvedwith a DF × DF × M input feature map F in and it generate a DF × DF × 4Kfeature map F 1. The following Conv3x3 layer generates a DF × DF × K featuremap F 2. The input feature map F in is then concatenated with the feature map F 2along the depth axis to form the output feature map F out. Thus F out is a DF ×DF × (M +K) feature map. A dense block is constructed using multiple bottleneckblocks, as shown in Figure 3.3. It can be seen that all bottleneck blocks within a37Chapter 3. MobileNet-Dense: An Efficient Convolutional Neural Networkdense block are directly connected with each other. In other words, each bottleneckblock receives the output feature map from all preceding bottleneck blocks and passeson its output feature map to all subsequent bottleneck blocks. This connectivity iscalled the dense connectivity. Compared to the traditional connectivity (i.e., eachbuilding block only receives output feature map from the nearest preceding blockand passes on its output feature map to the nearest subsequent block), the denseconnectivity allows more feature reuse and promote information flow throughout thenetwork by introducing more connections [8]. A DenseNet model is constructed usinga stack of these dense blocks, as shown in Figure 3.4. Between two adjacent denseblocks, a transition block is applied to downsample the size of the feature map. It isfound that DenseNet achieves an Acc of 77.5% compared to an Acc of 78.2% achievedby ResNet on ImageNet dataset while using 55% fewer parameters and 49% fewerMAdds [60].3.1.2 Efficient CNNs: MobileNetV1 and MobileNetV2Despite the great success of modern CNNs (e.g. ResNet, DenseNet, etc.) in manyimage recognition applications, one noteworthy drawback of such CNNs is the highcomputational cost. In this section, we review the computationally efficient Mo-bileNetV1 [3] model which is constructed using depthwise separable convolutionallayers and MobileNetV2 model constructed using depthwise separable convolutionallayers and residual skip connections.MobileNetV1The structure of MobileNetV1 [3] is described in Table 3.2 and the basic buildingblock is shown in Figure 3.5. The MobileNetV1 model is basically a CNN modelwith a stack of depthwise separable convolutional layers. A hyperparameter α is38Chapter 3. MobileNet-Dense: An Efficient Convolutional Neural NetworkFigure 3.3: Illustration of a typical Dense Block [8] constructed using 5 bottleneckblocks. Each bottleneck block is illustrated by a single blue node. Any two of thenodes within the same dense block are directly connected.introduced to reduce(α < 1) or increase(α > 1) the model complexity by alteringthe depth of the feature map for each building block proportionally by a factor α.The α is applied to trade off model performance and computational cost. Unlikethe traditional CNN architecture, the MobileNetV1 architecture does not apply anypooling layers since it has been shown in [61] that pooling layers can be replacedby convolutional layers with an increased stride without loss in accuracy. Therefore,a depthwise Conv3x3 layer with a stride of 2 is applied to downsample the spatialwidth and height of the feature map.39Chapter 3. MobileNet-Dense: An Efficient Convolutional Neural NetworkFigure 3.4: A deep DenseNet [8] with two dense blocks. The Conv1x1 layer andpooling layer between two dense blocks are applied to downsample the size of thefeature map.MobileNetV2Sandler et al. [4] proposed the MobileNetV2 architecture constructed using depthwiseseparable convolutional layers with residual skip connections. The two types of build-ing blocks for MobileNetV2, namely the inverted residual building block with linearbottlenecks and the reduction building block with linear bottlenecks, are illustratedin Figure 3.6. We briefly review these two building blocks below:• The Inverted Residual Block with linear bottleneck: The inverted resid-ual block begins with a Conv1x1 layer which is used for expanding the depthof the input feature map by a factor of t (t > 1). Then a depthwise Conv3x3layer is convolved with the expanded feature map. The last Conv1x1 layer isused for compressing (restoring) the depth of the feature map to allow for theelement-wise addition. If the expansion ratio t is set to be less than 1, this isthe classical residual bottleneck block (shown in Figure 3.1). The main reasonfor expanding the depth of the feature map at first Conv1x1 layer is to preventthe information loss caused by the presence of ReLU activation functions [4].It has been shown that expanding the depth of the feature map with a suffi-ciently large expansion rate and applying Relu on the expanded feature mapis resistant to the information loss [4]. In addition, the inverted residual build-ing block is more memory efficient than the classical residual block [4]. The40Chapter 3. MobileNet-Dense: An Efficient Convolutional Neural NetworkFigure 3.5: Building block of MobileNetV1 [3].Table 3.2: The structure of MobileNetV1 [3]Input Operator M r s6722 × 3 Conv3x3 Layer 32 1 23362 × 32 Building Block 64 1 13362 × 64 Building Block 128 1 21682 × 128 Building Block 128 1 11682 × 128 Building Block 256 1 2842 × 256 Building Block 256 1 1842 × 256 Building Block 512 1 2422 × 512 Building Block 512 5 1422 × 512 Building Block 1024 1 2212 × 1024 GlobalAvgPool Layer 1024 112 × 1024 Output Layer - -Suppose the input is a 6722×3 image. Each line in thistable describes a layer or a sequence of building blockrepeated r times. M denotes the depth of the outputfeature map for each layer or sequence. s denotes thestride of the Conv3x3 layer.41Chapter 3. MobileNet-Dense: An Efficient Convolutional Neural Networklinear bottleneck refers to replacing the Relu non-linear activation with linearactivation on the output feature map of the last Conv1x1 layer. The experi-mental result showed that applying linear activation on the output feature mapof the last Conv1x1 layer lead to improved performance since it also preventsthe information loss caused by Relu non-linear activation [4].• The Reduction Block with linear bottleneck: The reduction block ofMobileNetV2 begins with a Conv1x1 layer which is used for expanding thedepth of the input feature map by a factor t (t > 1), followed by a depthwiseConv3x3 layer a Conv1x1 layer with linear activation. The depthwise Conv3x3layer is responsible for downsampling the width and height of the feature mapand the last Conv1x1 layer is responsible for compressing the depth of thefeature map. The reduction block does not include a residual skip connectionsince the spatial width and height of its output feature map F out are smallerthan those of the input feature map F in.The standard MobileNetV2 architecture is described in Table 3.3. It can be seenthat the MobileNetV2 model is constructed using a stack of residual modules. Eachresidual module is constructed using multiple building blocks, where the first buildingblock can be either an inverted residual block or a reduction block and the followingbuilding blocks are inverted residual blocks. Similar to MobileNetV1 architecture,a hyperparameter α is introduced to trade off performance and computational cost.It is found that MobileNetV2 achieves an Acc of 0.720 compared to an Acc of 0.706achieved by MobileNetV1 while using 19% fewer parameters and 48% fewer MAddson ImageNet dataset [4].42Chapter 3. MobileNet-Dense: An Efficient Convolutional Neural NetworkInputDepthwise Conv 3x3,stride=2,ReluConv1x1,LinearConv1x1,ReluReduction Block InputDepthwise Conv 3x3,stride=1,ReluConv1x1,LinearConv1x1,ReluInverted Residual Block Elementwise AdditionOutputIdentity Mapping OutputMDD FF MDD FF MtDD FF MtDD FF MDD FF MDD FF MtDD FF MtDD FF 22:inF:inF:1F:1F:2F:2F:3F:outF:outF NDD FF 22:inF MDD FF Figure 3.6: Building blocks of MobileNetV2 [4].43Chapter 3. MobileNet-Dense: An Efficient Convolutional Neural NetworkTable 3.3: The structure of MobileNetV2 [4]Input Operator M r s t6722 × 3 Conv3x3 Layer 32 1 2 -3362 × 32 Residual Module 16 1 1 13362 × 16 Residual Module 24 2 2 61682 × 24 Residual Module 32 3 2 6842 × 32 Residual Module 64 4 2 6422 × 64 Residual Module 96 3 1 6422 × 96 Residual Module 160 3 2 6212 × 160 Residual Module 320 1 1 6212 × 320 Conv1x1 Layer 1280 1 1 -212 × 1280 GlobalAvgPool Layer 1280 1 - -12 × 1280 Output Layer - - - -Suppose the input is a 6722 × 3 image. Each line in thistable describes a layer or a residual module. Each residualmodule is constructed using r building blocks, where thefirst building block can be either an inverted residual blockor a reduction block and the following (r−1) building blocksare inverted residual blocks. M denotes the depth of theoutput feature map for each layer or sequence. s denotesthe stride of the Conv3x3 layer. t denotes the expansionrate of each building block. Except for the first residualmodule, a constant expansion rate is applied throughoutthe network.44Chapter 3. MobileNet-Dense: An Efficient Convolutional Neural Network3.2 MobileNet-Dense: An efficient CNN based onDense Connections and Depthwise SeparableConvolutional LayersAs discussed in Section 3.1, DenseNet achieves a similar performance but with fewerMAdds and Parameters as ResNet on ImageNet dataset. Also, MobileNetV2 achievesbetter results than MobileNetV1 through the addition of residual connections [4].Based on these two observations, we propose a novel MobileNet-Dense model whichis constructed by adding dense connectivity to the inverted bottleneck block.The core building blocks of the proposed MobileNet-Dense, namely the bottleneckblock and the reduction block, are shown in Figure 3.7. These two building blocksare discussed as follows:• The Bottleneck Block: In order to prevent the information loss caused byRelu activation, the design of the bottleneck block follows the design of invertedresidual block [4]. Specifically, the bottleneck block of MobileNet-Dense is con-structed using a stack of 3 convolutional layers. The three convolutional layersare Conv1x1 layer, depthwise Conv3x3 layer, and Conv1x1 layer with linear ac-tivation respectively. The first Conv1x1 layer is responsible for expanding thedepth of the feature map by a factor of t so as to prevent the information loss.The last Conv1x1 layer with linear activation is responsible for compressing thedepth of the feature map so as to improve the computation efficiency. Followingthe design of DenseNet, we let each bottleneck block generates a feature mapwith K channels. In other words, the last Conv1x1 layer compressed the depthof the feature map to be a fixed value K. Then the input feature map F in andthe output feature map of the last Conv1x1 layer F 3 are concatenated alongthe depth axis to allow for the dense connectivity. The hyperparameter K can45Chapter 3. MobileNet-Dense: An Efficient Convolutional Neural NetworkInputDepthwise Conv 3x3,stride=2,ReluConv1x1,LinearConv1x1,ReluReduction Block InputDepthwise Conv 3x3,stride=1,ReluConv1x1,LinearConv1x1,ReluBottleneck Block ConcatenationtMDD FF OutputMDD FF OutputtMDD FF tMDD FF 22MDD FF KDD FF )( KMDD FF MDD FF 22tMDD FF MDD FF Identity Mapping :1F:inF :inF:2F:2F:3F:outF:outF:inF:1FFigure 3.7: Building blocks of MobileNet-Dense46Chapter 3. MobileNet-Dense: An Efficient Convolutional Neural Networkbe used to trade off the performance and the computational cost.• The Reduction Block: The Reduction Block begins with a Conv1x1 layer,followed by a depthwise Conv3x3 layer with a stride of 2 and a Conv1x1 layerwith linear activation. The first and the last Conv1x1 layers are used for ex-panding and then compressing the depth of the feature map. The depthwiseConv3x3 with a stride of 2 is used for downsampling the width and height ofthe feature map. The reduction block does not include the skip connection (i.e.,feature concatenation) since the spatial width and height of its output featuremap F out are smaller than those of the input feature map F in.The MobileNet-Dense is constructed using a stack of dense modules, where eachdense module is constructed by densely connecting r building blocks. To be morespecific, each dense module begins with one reduction block to downsample the widthand height of the feature map, followed by (r − 1) bottleneck blocks. Between twoadjacent dense modules, a Conv1x1 layer with linear activation is applied to halvethe depth of the feature map in order to improve the computational efficiency. Thestructure of the proposed MobileNet-Dense is summarized in Table 3.4.3.3 ExperimentWe empirically demonstrate the effectiveness of MobileNet-Dense on two widely usedbenchmark datasets, namely CIFAR-10 and CIFAR-100. We select the MobileNetV2model as the benchmark model to compare the results.3.3.1 Experiment EnvironmentThe hardware environment and software environment used in our experiment aresummarized below:47Chapter 3. MobileNet-Dense: An Efficient Convolutional Neural NetworkTable 3.4: The structure of MobileNet-DenseInput Operator M r s K t6722 × 3 Conv3x3 Layer 32 1 2 - -3362 × 32 Bottleneck Block 48 1 1 16 13362 × 48 Dense Module 96 2 2 48 31682 × 96 Conv1x1 Layer 48 1 1 - -1682 × 48 Dense Module 144 3 2 48 3842 × 144 Conv1x1 Layer 72 1 1 - -842 × 72 Dense Module 216 4 2 48 3422 × 216 Conv1x1 Layer 108 1 1 -422 × 108 Dense Module 300 5 2 48 3212 × 300 Conv1x1 Layer 1280 1 1 - -12 × 1280 GlobalAvgPool Layer 1280 1 - - -12 × 1280 Output Layer - - - - -Suppose the input is a 6722 × 3 image. Each line in this tabledescribes a layer or a dense module. Each dense module is con-structed using r building blocks, where the first building blockis a reduction block and the following (r − 1) building blocksare bottleneck blocks. We constructed the standard MobileNet-Dense model using 4 dense modules, where those dense modulescontain 2, 3, 4 and 5 building blocks respectively. M denotesthe depth of the output feature map for each layer or densemodule. K denotes the number of convolution filters appliedin the last Conv1x1 layer of each bottleneck block. t denotesthe expansion rate of each building block. The K and t of thefirst bottleneck block (the 2nd line of this table) are fixed to 16and 1 respectively, the K and t of the following 4 dense mod-ules need to be grid searched (In this table we set K = 48 andt = 3 for illustration purposes).48Chapter 3. MobileNet-Dense: An Efficient Convolutional Neural Network1. Hardware:• CPU (Central Processing Unit): Intel i7-6850K Processor• RAM (Random-access Memory): Kingston 48GB DDR4• GPU (Graphics Processing Unit): Nvidia GeForce GTX 1080 Ti2. Software:• OS (Operating System): Ubuntu 16.04.4 LTS• Coding Environment: Python 2.7• Core Library: CUDA 9.0, Cudnn 7.0, Tensorflow-gpu 1.9.0, Keras 2.2.03.3.2 CIFAR DatasetsThe CIFAR-10 dataset and CIFAR-100 dataset consist of colored RGB natural im-ages, each image has a size of 32 × 32 pixels. CIFAR-10 consists of 10 classes, eachwith 6000 images. CIFAR-100 consists of 100 classes, each with 600 images. For bothCIFAR-10 and CIFAR-100, the training set contains 50000 images and the test setcontains 10000 images.3.3.3 TrainingFollowing the typical workflow described in Section 2.3, we first apply preprocessingand augmentation techniques on CIFAR datasets. For preprocessing, we standardizedthe image data using the RGB channel means and standard deviations. For data aug-mentation, we used two image augmentation techniques (mirroring/shifting) whichhave been widely used on these two datasets [2, 8, 62]. As the size of the imagesin CIFAR-10 and CIFAR-100 datasets is only 32 × 32 pixels, we changed the stride49Chapter 3. MobileNet-Dense: An Efficient Convolutional Neural Networkof the first Conv3x3 layer from 2 to 1 for both MobileNetV2 model and MobileNet-Dense model and we removed the first bottleneck block with a stride of 2 for bothMobileNetV2 model and MobileNet-Dense model to preserve the spatial width andheight of the feature map at the beginning of the network. The MobileNetV2 modeland MobileNet-Dense model we trained for CIFAR datasets are summarized in Ta-ble 3.5 and Table 3.6 respectively. There are two hyperparameters (i.e., t and α)for MobileNetV2 and there are two hyperparameters (i.e., t and K) for MobileNet-Dense. Grid search is applied to optimize the hyperparameter values. Consideringthe computational cost, we defined the searching space as below:t ∈ {2, 3, 4, 5, 6}K ∈ {16, 32, 48, 64}a ∈ {1.0, 1.1, 1.2, 1.3}All the networks with different hyperparameter values are trained using the Adamoptimizer [63]. We trained each network using a batch size of 64 for 200 epochs. Theinitial learning rate is set to 0.001 as suggested in [63], and it is successively decreasedby 10 at 40%, 60% and 80% of the total number of training epochs. The hyperparam-eters set we found for MobileNet-Dense is [t = 3, K = 48] and the hyperparametersset we found for MobileNetV2 is [t = 6, α = 1.3].3.3.4 Performance EvaluationThe performance of MobileNet-Dense and MobileNetV2 on the CIFAR-10 and CIFAR-100 datasets are summarized in Table 3.7.As we can see that the MobileNet-Dense and MobileNetV2 achieve similar ac-curacies on both CIFAR-10 and CIFAR-100 datasets, while MobileNet-Dense using50Chapter 3. MobileNet-Dense: An Efficient Convolutional Neural NetworkTable 3.5: The structure of MobileNetV2 for CIFAR datasetsInput Operator M r s t322 × 3 Conv3x3 Layer 40 1 1 -322 × 40 Residual Module 24 1 1 1322 × 24 Residual Module 32 1 1 6322 × 32 Residual Module 40 3 2 6162 × 40 Residual Module 80 4 2 682 × 80 Residual Module 128 3 1 682 × 128 Residual Module 208 3 2 642 × 208 Residual Module 416 1 1 642 × 416 Conv1x1 Layer 1664 1 1 -42 × 1664 GlobalAvgPool Layer 1664 1 - -12 × 1664 Softmax Layer - - - -The input is a 322×3 image. The differences between Mo-bileNetV2 for CIFAR and standard MobileNetV2 modeldescribed in Table 3.3 are highlighted in red color.Table 3.6: The structure of MobileNet-Dense for CIFAR datasetsInput Operator M r s K t322 × 3 Conv3x3 Layer 32 1 1 - -322 × 32 Bottleneck Block 48 1 1 16 1322 × 48 Bottleneck Block 96 1 1 48 3322 × 96 Conv1x1 Layer 48 1 1 - -322 × 48 Dense Module 144 3 2 48 3162 × 144 Conv1x1 Layer 72 1 1 - -162 × 72 Dense Module 216 4 2 48 382 × 216 Conv1x1 Layer 108 1 1 - -82 × 108 Dense Module 300 5 2 48 342 × 300 Conv1x1 Layer 1280 1 1 - -42 × 1280 GlobalAvgPool Layer 1280 1 - - -12 × 1280 Softmax Layer - - - - -The input is a 322 × 3 image. The differences betweenMobileNet-Dense for CIFAR and standard MobileNet-Densemodel described in Table 3.4 are highlighted in red color.51Chapter 3. MobileNet-Dense: An Efficient Convolutional Neural NetworkTable 3.7: Performance on CIFAR-10 and CIFAR-100 datasetsMethod MAdds 1 Parameters 2 Acc (CIFAR-10) Acc (CIFAR-100) Testing Time 3MobileNetV2 138M 3.7M 0.931 0.731 325 µsMobileNet-Dense 89M 1.4M 0.934 0.727 270 µs1 MAdds is in Millions and it is the MAdds for predicting one image.2 Parameters is in Millions.3 Testing time is in microseconds and it is the inference time for predicting one image.62% fewer learnable parameters and 36% fewer MAdds compared to MobileNetV2.In terms of testing time per image, the MobileNet-Dense is 17% faster than the Mo-bileNetV2. The results indicate that the proposed MobileNet-Dense is effective andefficient.3.4 SummaryIn this Chapter, a novel CNN model, namely MobileNet-Dense was proposed. Theclassification performance and the computational cost of MobileNet-Dense are com-pared with those of the MobileNetV2 model using the CIFAR-10 and the CIFAR-100datasets. It is found that MobileNet-Dense and MobileNetV2 model achieves sim-ilar classification performance on those two benchmark datasets, while MobileNet-Dense model using 62% fewer parameters, 36% fewer MAdds and 17% lesser runningtime compared to MobileNetV2 model. These results indicate that the proposedMobileNet-Dense model is effective and efficient.52Chapter 4Automatic Diabetic RetinopathyClassification System DesignThis chapter describes the design of the computationally efficient DR classificationsystem. The workflow of the DR classification system, as shown in Figure 4.1, followsthe typical workflow discussed in Section 2.3. The structure of this chapter followsthe workflow illustrated in Figure 4.1. Section 4.1 describes the exploratory dataanalysis (EDA) of the EyePACS dataset [5]. In Section 4.2, data augmentation andimage preprocessing techniques used in our system are described. In Section 4.3,the hyperparameter tuning and the training details are discussed. In Section 4.4,the ensemble learning method employed in our system is described. A summary isprovided in Section 4.5.4.1 Exploratory Data AnalysisExploratory Data Analysis (EDA) [64] is a core process for general machine learningtasks. It refers to the primary inspection process of the data in order to discover un-derlying patterns such as imbalanced data distribution, identify anomalies or outliersusing statistical methods and illustrates data properties using graphical representa-tions. As the EyePACS DR dataset [5] contains retinal images and the correspondingDR severity labels, we perform basic EDA on both the retinal images and the labels.53Chapter 4. Automatic Diabetic Retinopathy Classification System DesignFigure 4.1: Flowchart of the proposed automatic DR classification system.54Chapter 4. Automatic Diabetic Retinopathy Classification System DesignTable 4.1: DR label information in EyePACS dataset [5]DR Severity Level DR Severity LabelNo DR 0Mild DR 1Moderate DR 2Severe DR 3Proliferative DR 44.1.1 EDA of Training LabelsThe EyePACS DR dataset [5] is a large DR dataset with high-resolution imagestaken under different imaging conditions. There are 17653/5453/21335 pairs of colorimages for the training/validation/test set and the corresponding DR severity levelsare provided. As shown in Table 4.1, the DR levels are rated from 0 to 4 based onthe severity of the DR disease assessed by an ophthalmologist. The distribution ofDR levels in the training dataset is illustrated in Figure 4.2. It can be seen that thedistribution is very non-uniform, with 73% of the images labeled as healthy , 6% asmild DR, 15% as moderate DR, 2% as severe DR and 2% as proliferative DR. TheEyePACS DR training set also shows that the DR labels for a given pair of eyes arehighly correlated, as shown in Figure 4.3. To be more specific, it can be seen that87% of pairs of eyes have the same DR severity label, 95% of pairs of eyes have theDR label values differ by at most 1.4.1.2 EDA of Training ImagesRandom sample retinal images with different DR levels from the EyePACS DR train-ing set are shown in Figure 4.4. Each row contains five retinal images with the sameDR level. Retinal images in this dataset were taken in different illumination condi-tions or by different types of cameras [5]. Therefore the size of the image and thelight conditions vary (e.g., the 4th image with moderate DR, the 2nd and the 3rd55Chapter 4. Automatic Diabetic Retinopathy Classification System Design0 1 2 3 4Severity Label Values of Diabetic Retinopathy0500010000150002000025000Number of Samples2581024435292873708Figure 4.2: Distribution of DR label values in EyePACS training set.image with proliferate DR in Figure 4.4 are relatively dark compared with the otherimages). The sizes of the training images varies from 400 × 315 to 5184 × 3456. Forillustration purposes, all images shown in Figure 4.4 are resized to 672 × 672 usingbilinear interpolation. The dataset contains some underexposed images (i.e., imageswhich are too dark) and overexposed images (i.e., images which are too bright) [5].It is useful to check whether there are many images which are poorly exposed sinceexisting deep learning models are shown to be sensitive to image quality [65].We converted the RGB images to 8-bit grayscale images and computed the his-togram of the grayscale images for screening the poorly exposed images. For anunderexposed image, a large portion of the pixel values are close to 0 (which cor-responds to black), whereas for an overexposed image, a large portion of the pixelvalues are close to 255 (which corresponds to white). We considered an image to56Chapter 4. Automatic Diabetic Retinopathy Classification System Design0 1 2 3 4Difference in DR Severity Labels0200040006000800010000120001400016000Number of Pairs1532314857231121Figure 4.3: Label correlation in the EyePACS training set.be underexposed if 90% of the pixel values are between 0 and 30. We consider animage to be overexposed if 80% of the pixel values of the image are between 225and 255. The threshold is set differently since the retinal images contain some blackborders. Note that there could be other ways to screen the poorly exposed image.For example, another equivalent method would be using the average pixel values ofthe grayscale image to screen the poorly exposed image. Our classification rule yields145 underexposed images and 14 overexposed images in the training dataset. In thetest set, our classification rule yields 190 underexposed images and 11 overexposedimages. As the poorly exposed images comprised only 0.5% of the training set ortest set, we did not remove any images from the training or test sets. An under-exposed retinal image with its RGB histograms, a properly exposed image with itsRGB histograms and an overexposed image with its RGB histograms are shown in57Chapter 4. Automatic Diabetic Retinopathy Classification System DesignFigure 4.5.Figure 4.4: Random sample images from the EyePACS dataset [5].4.2 Image Augmentation and PreprocessingAs discussed in Section 4.1, the distribution of the DR label values in the EyePACSdataset is highly imbalanced and the sizes of the images vary from 400 × 315 pixels to5184 × 3456 pixels. Therefore we first resize the size of all the images in our dataset to58Chapter 4. Automatic Diabetic Retinopathy Classification System Design(a) An underexposed image. (b) A properly exposed image. (c) An overexposed imageFigure 4.5: Underexposed/Normal/Overexposed Images. Small figures withRed/Green/Blue color under each retinal image are the histograms of theRed/Green/Blue color channel.a uniform value using bilinear interpolation. We resize all images to images with 672× 672 pixels since some of the subtle differences between retinal images with differentDR levels may not be captured by the images with a smaller size. To deal with thedata imbalance problem, the state-of-the-art solutions for learning from imbalanceddata include sampling methods (undersample the majority classes or oversample theminority classes to balance the dataset) and cost-sensitive learning [66] (assigningdifferent weights to samples from different classes). As deep learning models usuallyhave millions of parameters, a large amount of data is required for training suchmodels. Undersampling will result in training models with limited data (i.e., sometraining samples from majority class are dropped) and cost-sensitive learning needsus to carefully pre-define the weights assigned to each class which may be time-consuming. Thus we choose to augment the minority classes using several imageaugmentation techniques to balance the training set.We applied two types of image augmentation techniques to expand the train-59Chapter 4. Automatic Diabetic Retinopathy Classification System Designing set. The first type includes geometric augmentation techniques which alter thegeometry of the image by mapping the individual pixel values to new destinations[67]. To be more specific, we applied image flipping (horizontally or vertically), im-age rotatation (0-360 degrees) and image cropping (90%). The second type of imageaugmentation technique is the Fancy PCA color augmentation proposed in [7] whichalters the pixel values of RGB channels. The basic idea of Fancy PCA color aug-mentation is to alter the intensities of RGB channels of the image by adding smallperturbations on those RGB channels. The perturbation for each image is first cal-culated by the eigenvectors of the covariance matrix of RGB channels multiplied bythe corresponding eigenvalues scaled by a Gaussian random variable with a mean ofzero and a standard deviation of 0.1. Then the RGB color perturbations are addedto the RGB channels of the original image to form the augmented image. The FancyPCA algorithm is summarized in Algorithm 4.1. A comparison of an original retinalimage and the augmented retinal image using Fancy PCA is shown in Figure 4.6.Although the difference between these two images is not easily visible to the humaneye, it has been shown that applying the Fancy PCA algorithm generally improvesthe model performance [7]. In order to generate an image, we first randomly applyone of the geometric augmentation techniques on the input image. Following that theaugmented images are standardized by subtracting the channel means and dividingthe channel standard deviations. Then we apply the Fancy PCA color augmentationtechnique to bring more varieties to the training images. We utilized the open sourcescikit-image library [68] to implement the discussed data augmentation techniques.60Chapter 4. Automatic Diabetic Retinopathy Classification System DesignFigure 4.6: Illustration of the original image and the augmented image using FancyPCA algorithm.4.3 CNN TrainingAs illustrated in Figure 4.1, the proposed DR classification system is based on theensemble of two computationally efficient CNN models, namely MobileNet-Denseand MobileNetV2. In this section, we describe the training details of the MobileNet-Dense model and MobileNetV2 model including the choice of the loss function andhyper-parameter tuning.4.3.1 Evaluation Metric and Loss FunctionEvaluation MetricAs the DR disease develops progressively, we consider the prediction of DR severitylevels as an ordinal classification task [69] rather than an ordinary classification task.Ordinal classification is different from ordinary classification in the sense that thelabels are ordered (e.g. disease levels from 0 to 4), while the labels for an ordinaryclassification task are not ordered (e.g. cat vs dog). Many metrics can be used to61Chapter 4. Automatic Diabetic Retinopathy Classification System DesignAlgorithm 4.1 Fancy PCA color augmentation algorithm [7].Input:An m × m × 3 standardized image H ;1: Create an m2 × 3 matrix Hs given the m × m × 3 input image H , whereHs[:, 1] contains all the red pixel data, Hs[:, 2] contains all the green pixel dataand Hs[:, 3] contains all the blue pixel data;2: Compute the covariance matrix of the Hs.3: Calculate the eigenvector p and eigenvalue λ of the covariance matrix.4: Obtain the augmented RGB image K using:[kRi,j, kGi,j, kBi,j]T=[hRi,j, hGi,j, hBi,j]T+ [p1,p2,p3] [α1λ1, α2λ2, α3λ3]T• pi is the ith eigenvector of the 3 × 3 covariance matrix.• λi is the ith eigenvalue of the 3 × 3 covariance matrix.• αi is the ith value of the Gaussian random variable with a mean of zero anda standard deviation of 0.1.• kRi,j, kGi,j, kBi,j are the ijth entry of the Red/Green/Blue channel of the aug-mented image K.• hRi,j, hGi,j, hBi,j are the ijth entry of the Red/Green/Blue channel of the inputimage H .Output:An m × m × 3 augmented RGB image K;evaluate classification performance such as accuracy, specificity, precision, recall, F1-score, etc. In this thesis, we adopt quadratic weighted kappa (QWK) [31] score asthe evaluation matrix since QWK is designed to measure the agreement of two raterson labels with ordinal scales and it has been used for reporting DR classificationperformance for existing models [37, 32, 70, 39].The QWK, kw, is defined as follows:kw = 1−∑i,j wi,joi,j∑i,j wi,jei,j(4.1)The W , O and E are M × M matrices which are used for calculating QWK score,62Chapter 4. Automatic Diabetic Retinopathy Classification System Designwhere M is the number of classes. The matrix O is the histogram matrix of theobserved ratings, where oi,j is the number of images that received a rating i by raterA (human) and a rating j by rater B (the trained classifier). The matrix W is theweight matrix, where wi,j is the weights assigned to different types of disagreementand it is given by:wi,j =(i− j)2(M − 1)2 (4.2)The matrix E is a histogram matrix of expected ratings, the ei,j is calculated as theouter product between two rater’s histogram vector of ratings (assuming that therating process for rater A and rating process for rater B are independent of eachother), as shown blow:ei,j =∑i oi,j∑j oi,jN(4.3)where N is the total number of samples in the evaluation dataset. The N in thedenominator is used for scaling the ei,j so that the ei,j and oi,j have the same scale.Loss FunctionAfter the evaluation metric is chosen, a loss function is needed to train our models.We can not use non-differentiable QWK as the loss function due to the fact thatthe backpropagation algorithm requires the loss function to be differentiable. Formulti-class classification tasks, the categorical cross-entropy loss L is commonly usedand it is given by:L = − 1N∑i∑jti,jlog(pi,j). (4.4)63Chapter 4. Automatic Diabetic Retinopathy Classification System Designwhere pi,j represents predicted probability value of sample i to be in class j given bythe neural network model, ti,j represents the ground truth probability value of samplei to be in class j (i.e., ti,j = [0, . . . , 1, . . . , 0] contains a single 1 at the jth position ifsample i are labeled as class j.), N represents total number of samples.The categorical cross-entropy is not suitable for the ordinal DR classificationtask as it failed to consider the order of the labels and it only takes the predictedprobability of the true class into consideration, which results in assigning same lossfor different types of misclassification errors. To explain this, suppose there are tworetinal images I1 and I2 with proliferative DR (i.e., the ground truth probabilityt1 = t2 = [0, 0, 0, 0, 1]). Suppose the predicted class probabilities (predicted by themodel) for I1 and I2 are given by:p1 , [p1,0, p1,1, p1,2, p1,3, p1,4] = [0.9, 0, 0, 0, 0.1]p2 , [p2,0, p2,1, p2,2, p2,3, p2,4] = [0, 0, 0, 0.9, 0.1]with p1, the model predicts that I1 has a 90% probability of being an image withoutDR and 10% probability of being an image with proliferative DR. With p2, the modelpredicts that I2 has a 90% probability of being an image with severe DR and 10%probability of being an image with proliferative DR. Although neither p1 nor p2 givesthe correct prediction, p2 is better than p1 in the sense that the predicted severitylevel is closer to the ground truth severity level. We would expect the loss functionto assign different loss values for p1 and p2, however the corresponding categoricalcross-entropy loss values for p1 and p2 are the same, as shown below:L1 = −0× log(0.9)− 0× log(0)− 0× log(0)− 0× log(0)− 1× log(0.1) = log(10)L2 = −0× log(0)− 0× log(0)− 0× log(0)− 0× log(0.9)− 1× log(0.1) = log(10)Following the 2nd place method [70] in Kaggle DR challenge, we consider the DR64Chapter 4. Automatic Diabetic Retinopathy Classification System Designclassification problem as an ordinal regression problem and we use the MSE loss totrain the CNN models as both MSE loss and QWK score introduce quadratic penaltyto the disagreement between the predicted result and the ground truth label. Wetreat the categorical label (0,1,2,3,4) as continuous values and we minimize the MSEbetween the output of our CNN model and ground truth categorical label. Note thatthe output of our neural network is a real number and it is bounded between 0 and 4(e.g., ypred = 3.67). The categorical predictions are then obtained by thresholding thepredicted real values. The threshold values are given by [η0 = 0, η1, η2, η3, η4, η5 = 4].For any ηi < ypred < ηi+1, we interpret ypred to be class i. The optimal thresholdvalues are searched using the gradient-free Powell’s method [71] on the validation setand the optimal threshold values are applied on the test set to obtain the test QWKscore.4.3.2 Hyperparameter Tuning and TrainingHyperparameter TuningThe MobileNet-Dense model and MobileNetV2 model for DR classification are sum-marized in Table 4.2 and Table 4.3 respectively. Compared to the standard MobileNet-Dense model and standard MobileNetV2 model discussed in Chapter 3. Two fullyconnected layers are added between the GlobalAvgPool Layer and the Output Layerto improve the performance at the cost of some added complexity.Same as the experiment on CIFAR dataset, grid search are used to find thehyperparameters (t and K) for MobileNetV2-Dense model and hyperparameters (tand α) for MobileNetV2 model. The hyperparameter searching space is listed below:t ∈ {2, 3, 4, 5, 6}K ∈ {16, 32, 48, 64}65Chapter 4. Automatic Diabetic Retinopathy Classification System DesignTable 4.2: The structure of MobileNet-Dense for DR classificationInput Operator M r s K t6722 × 3 Conv3x3 Layer 32 1 2 - -3362 × 32 Dense Block 48 1 1 16 13362 × 48 Dense Module 96 2 2 48 31682 × 96 Conv1x1 Layer 48 1 1 - -1682 × 48 Dense Module 144 3 2 48 3842 × 144 Conv1x1 Layer 72 1 1 - -842 × 72 Dense Module 216 4 2 48 3422 × 216 Conv1x1 Layer 108 1 1 - -422 × 108 Dense Module 300 5 2 48 3212 × 300 Conv1x1 Layer 1280 1 1 - -212 × 1280 GlobalAvgPool Layer 1280 1 - - -1280 Fully Connected Layer 256 2 - - -256 Output Layer 1 - - - -Table 4.3: The structure of MobileNetV2 for DR classificationInput Operator M r s t6722 × 3 Conv3x3 Layer 40 1 2 -3362 × 40 Residual Module 24 1 1 13362 × 24 Residual Module 32 2 2 61682 × 32 Residual Module 40 3 2 6842 × 40 Residual Module 80 4 2 6422 × 80 Residual Module 128 3 1 6422 × 128 Residual Module 208 3 2 6212 × 208 Residual Module 416 1 1 6212 × 416 Conv1x1 Layer 1664 1 1 -212 × 1664 GlobalAvgPool Layer 1664 1 - -1664 Fully Connected Layer 256 2 - -256 Output Layer 1 - - -66Chapter 4. Automatic Diabetic Retinopathy Classification System Designα ∈ {1.0, 1.1, 1.2, 1.3}In order to speed up the hyperparameter tuning process, models with differenthyperparameter settings are trained using images with a small size (i.e., 128 × 128pixels) as it is pointed out in [72] that the optimal hyperparameter values are largelyindependent of the image size. We randomly sample 60% of the EyePACS trainingdata to train and the hyperparameters set which yields a model with highest QWK onthe validation set is chosen. The hyperparameter set we found for MobileNet-Denseand MobileNetV2 are [t = 3, K = 48] and [t = 6, α = 1.3] respectively.Training CNN modelsThe MobileNet-Dense and MobileNetV2 are trained using a batch size of 4 for 200epochs. The initial learning rate is set to 10−4, and it is reduced to 10−5 at 80% ofthe total number of training epochs. We trained each model for 200 epochs as weobserved that training more than 250 epochs resulted in overfitting and 200 epochsare sufficient for the validation loss to converge. After each epoch, we evaluate theQWK score of our model on the validation set and the model checkpoints (i.e., Themodel checkpoint at epoch t stores the current values of all the learnable parameters)which achieve a QWK score higher than 0.810 are saved for further testing and modelensembling.4.4 Ensemble LearningEnsemble of machine learning models (also known as ensemble learning) generallyimproves the classification system’s robustness and accuracy [73]. It mimics thecommon practice of making decisions based on the decisions of multiple experts.Ensemble learning has been widely applied in medical image classification tasks, suchas cancer detection [74] and Alzheimer’s Disease detection [75]. Therefore we applied67Chapter 4. Automatic Diabetic Retinopathy Classification System Designensemble learning so as to improve the performance of the classification system. Theensemble learning process consists of three steps: the feature extraction step, thefeature reduction step and the model stacking step.4.4.1 Feature Extraction and Feature BlendingWe utilized the label correlation for a pair of eyes by training classifiers to screen DRusing features from both of the eyes as it has been shown in [70, 39] that utilizing thelabel correlation generally improves the overall performance. To be more specific, thefeatures of each retinal image are extracted from the last fully-connected layer of thetrained MobileNet-Dense model and the trained MobileNetV2 model (the feature ofeach retinal image is a 256 × 1 matrix since the last fully connected layer contains256 nodes). Then the features of the left eye and the right eye are concatenated. Theconcatenated feature is given by:F concatenated =[F leftMobileNet-Dense,FrightMobileNet-Dense,FleftMobileNetV2,FrightMobileNetV2]where F leftMobileNet-Dense and FleftMobileNetV2 denote the left eye’s features extracted fromMobileNet-Dense model and MobileNetV2 model respectively. Similarly, F rightMobileNet-Denseand F rightMobileNetV2 denote the right eye’s features extracted from MobileNet-Densemodel and MobileNetV2 model respectively.4.4.2 Feature ReductionThe PCA (Principal Component Analysis) feature reduction is performed on the con-catenated features in order to reduce the dimensionality of the concatenated featureand thereby reducing the overfitting as well as speeding up the training process. Weselect the number of components such that 99% of the variance is explained.68Chapter 4. Automatic Diabetic Retinopathy Classification System Design4.4.3 Model StackingWe used the stacking technique [76] to perform ensemble learning. Stacking is anensemble learning technique that combines multiple base-level models via a meta-classifier. The base-level models (i.e. MobileNet-Dense and MobileNetV2) are firsttrained using the EyePACS DR training set, then the meta-classifier is trained usingthe features extracted from the base-level models. To be more specific, we trainedone FCNN as the meta-classifiers to screen left eyes using concatenated features andwe trained another FCNN as the meta-classifiers to screen right eyes. We used thesame structure for these two FCNNs. The first hidden layer of the FCNN contains256 hidden nodes and the second hidden layer contains 128 hidden nodes. The outputlayer contains one node and the MSE loss is used for training. A weight decay of0.025 is applied to reduce overfitting. The FCNNs are trained using a batch size of256 for 120 epochs. The initial learning rate is set to 0.0005, and it is successivelydecreased by 10 at 40%, 60% and 80% of the total number of training epochs.4.5 SummaryIn this chapter, we proposed an automatic DR classification system based on the en-sembling of two computationally efficient CNN models, namely the MobileNet-Densemodel and the MobileNetV2 model. The system is designed following the typicalworkflow described in Section 2.3. Exploratory data analysis on EyePACS trainingset is first described followed by the image augmentation techniques used to addressthe data imbalance problem. Next, hyperparameter optimization and training detailsof the MobileNet-Dense and MobileNetV2 models are discussed. Lastly, the ensemblelearning that was used is described.69Chapter 5Performance Evaluation of theDiabetic Retinopathy ClassifiacationSystemIn this Chapter, we present the performance evaluation of the automatic DR clas-sification system proposed in Chapter 4. We demonstrate the effectiveness of theproposed DR classification system using two independent DR dataset, namely theEyePACS dataset and the Messidor database. In Section 5.1, the performance of theproposed DR classification system is compared with some state-of-the-art methods[39, 37] on the EyePACS dataset. In Section 5.2, the performance of the proposedDR classification system is compared with some state-of-the-art methods [36, 37] onthe Messidor database [9]. A brief summary is provided in Section 5.3.5.1 Performance Evaluation on the EyePACS TestSetIn this section, we present the performance evaluation of the proposed system usingthe EyePACS test set. The overall performance is evaluated using the QWK score[31], the MAdds and the Parameters. The classification performance for each DRseverity class is measured using the precision, recall and F1-Score [77].70Chapter 5. Performance Evaluation of the Diabetic Retinopathy Classifiacation SystemTable 5.1: Performance comparison on EyePACS test setMethod Test QWK Parameters 1 MAdds 2 Testing Time 3Method in [39]4 0.849 11.8M 45.0B -Zoom-In-Net [37] 0.854 - - -MobileNet-Dense 0.825 1.8M 3.7B 25.6msMobileNetV2 0.822 4.2M 4.6B 28.7msModel Ensemble (1+1) 5 0.851 6.2M 8.3B 54.3msModel Ensemble (2+1) 5 0.852 8.0M 12.0B 79.9msModel Ensemble (1+2) 5 0.852 10.4M 12.9B 83.0msModel Ensemble (2+2) 5 0.852 12.3M 16.6B 108.6ms1 Parameters is in Millions.2 MAdds is in Billions and it is the MAdds for predicting one image.3 Test time is in milliseconds and it is the inference time for predicting one image.4 Method in [39] is the top-ranked method in the Kaggle DR Detection Challenge, it ensembledthree CNNs to obtain the final results. We use this method as the benchmark method.5 Model Ensemble (M+N) refers to the ensemble of Top-M saved checkpoints of MobileNet-Dense and Top-N saved checkpoints of MobileNetV2.5.1.1 The QWK score and Model ComplexityWe use the QWK score, MAdds and Parameters to report the overall performance.The QWK score, MAdds and Parameters of the proposed system and those of somestate-of-the-art models are presented in Table 5.1. It can be seen that MobileNet-Dense model achieves a QWK score of 0.825 on EyePACS test set which is slightlyhigher than the QWK score of 0.822 achieved by MobileNetV2 model, but using57% fewer Parameters, 20% fewer MAdds and 11% lesser running time. Let ModelEnsemble (M+N) to denote the ensemble of Top-M saved checkpoints of MobileNet-Dense and Top-N saved checkpoints of MobileNetV2. It can be seen that the sameQWK score (0.852) is achieved by Model Ensemble (2+1), Model Ensemble (1+2)and Model Ensemble (2+2). Considering the MAdds and the Parameters, we selectthe result of Model Ensemble (2+1) as our final result to compare with existing works[39, 37].71Chapter 5. Performance Evaluation of the Diabetic Retinopathy Classifiacation SystemAs shown in Table 5.1, the Zoom-In-Net [37] method achieves the highest QWKscore (0.854) on the EyePACS test set. Unfortunately, we are not able to estimatethe Parameters or MAdds of the Zoom-In-Net based on the information providedin [37]. Therefore we selected the top-ranked method in the Kaggle DR challenge(i.e., Method [39] in Table 5.1) as the benchmark method. Our Model Ensemble(2+1) achieves a QWK score of 0.852 compared to a QWK score of 0.849 achievedby benchmark method [39] while using 32% fewer parameters and 73% fewer MAdds.The results indicate that our system is effective and efficient compared to the bench-mark method.5.1.2 Performance Evaluation for Each ClassAs the QWK score measures the overall agreement between two raters’ classificationresults, it is hard to interpret the model performance for each class using only theQWK score. Therefore, we also investigated the model performance of the proposedDR classification system for each DR severity class using the commonly used metricsof precision, recall and F1-score [77].We first recall the concept of confusion matrix since the precision, recall and F1-score can be easily calculated using the confusion matrix. A confusion matrix Cis a matrix in which rows represent the ground truth label and columns representsthe classification result, as shown in Figure 5.1. The ci,j is the number of samplespredicted as class j by the classifier with the ground truth label class i.The precision, recall and F1-Score for class k are calculated as follows:Precision =TPkTPk + FPk(5.1)72Chapter 5. Performance Evaluation of the Diabetic Retinopathy Classifiacation SystemRecall =TPkTPk + FNk(5.2)F1 = 2× precision× recallprecision+ recall(5.3)where TPk, FPk, FNk denote the number of true positives, false positives, and falsenegatives for class k respectively. TPk is the number of samples correctly predictedas class k by the classifier, i.e., ck,k. The FPk is the number of samples wronglypredicted as class k by the classifier, i.e.,∑i|i 6=k ci,k. The FNk is the number ofsamples wrongly predicted as not class k, i.e.,∑j|j 6=k ck,j.As the confusion matrices of the state-of-the-art methods [39, 37] on EyePACStest set are not given, we only reported the precision, recall and F1-Score of our DRclassification system. The confusion matrices of the predicted results on EyePACStest set are given in Figure 5.1 and the corresponding precision, recall and F1-Scoreare shown in Table 5.2.It can be seen from Table 5.2 that the ensemble learning generally improves theclassification performance of each class. Another noteworthy point is that our systemachieves relatively low F1-score on screening Mild DR and Severe DR, one possiblereason could be the categorical predictions are obtained by thresholding the outputvalues in order to maximize the overall QWK score instead of maximizing the F1-score of each class. Thus adjusting the threshold may improve the F1-scores of thesetwo classes.73Chapter 5. Performance Evaluation of the Diabetic Retinopathy Classifiacation SystemHealthy Mild Moderate Severe ProliferativePredicted labelHealthyMildModerateSevereProliferativeGround Truth level29293 1770 272 48 201482 1199 355 6 0899 1292 3169 835 8719 30 297 493 13825 49 144 271 477MobileNetV2(a) Confusion Matrix of MobileNetV2Healthy Mild Moderate Severe ProliferativePredicted labelHealthyMildModerateSevereProliferativeGround Truth level29402 1707 236 40 181487 1222 318 14 1900 1332 2504 1429 11726 20 147 594 19031 66 87 263 519MobileNet-Dense(b) Confusion Matrix of MobileNet-DenseHealthy Mild Moderate Severe ProliferativePredicted labelHealthyMildModerateSevereProliferativeGround Truth level29441 1640 280 30 121333 1224 475 10 0675 1089 3217 1228 7312 24 190 683 6812 33 102 321 498Ensemble(c) Confusion Matrix of Model Ensemble (2+1)Figure 5.1: Confusion matrices of: (a) MobileNetV2, (b) MobileNet-Dense, (c) ModelEnsemble (2+1)74Chapter 5. Performance Evaluation of the Diabetic Retinopathy Classifiacation SystemTable 5.2: Per-class performance on EyePACS test setClass Precision Recall F1-scoreMobV2 1 Mob-D 2 Ensemble 3 MobV2 Mob-D Ensemble MobV2 Mob-D EnsembleHealthy 0.92 0.92 0.94 0.93 0.94 0.94 0.93 0.93 0.94Mild 0.28 0.28 0.31 0.39 0.40 0.40 0.32 0.33 0.35Moderate 0.75 0.76 0.75 0.50 0.40 0.51 0.60 0.52 0.61Severe 0.30 0.25 0.30 0.50 0.61 0.70 0.37 0.36 0.42Proliferate 0.66 0.61 0.76 0.49 0.54 0.52 0.57 0.57 0.621 The MobV2 denotes the MobileNetV2 model.2 The Mob-D denotes the MobileNet-Dense model.3 The Ensemble denotes the Ensemble (2+1) .5.2 Performance Evaluation on the MessidorDatabaseIn this section, we test the performance of the proposed automatic DR classificationsystem on another widely used independent diabetic retinopathy dataset, namely theMessidor database [9], to show the generalization ability of the proposed system. Asthe label distribution and image quality of the Messidor database are different fromthose of the EyePACS dataset, thus the basic EDA of Messidor database is first givenfollowed by the comparison of our results with the state-of-the-art results [36, 78, 37].5.2.1 EDA of the Messidor DatabaseThe Messidor database consists of 1200 retinal images. The retinal images are inone of three sizes: 1440 × 960, 2240 × 1488, or 2304 × 1536. A retinopathy gradeis provided by an ophthalmologist for each retinal image in the Messidor databaseand the grade ranges from 0 to 3. Random sample retinal images with different DRgrades from the Messidor database are shown in Figure 5.2. Each row contains fiveretinal images with the same DR grade. The distribution of DR grades in Messidordatabase is shown in Figure 5.3, with 46% of the images labeled as grade 0, 13% asgrade 1, 20% as grade 2, and 21% as grade 3.75Chapter 5. Performance Evaluation of the Diabetic Retinopathy Classifiacation SystemFigure 5.2: Random sample images from the Messidor database [9].5.2.2 Performance EvaluationAs there are only 1200 images in Messidor database which is not adequate to traina CNN model, Holly et al.[36] suggested building classifiers using features extractedfrom the CNN models trained on other DR dataset such as EyePACS DR dataset. Asthe DR grade for the Messidor database ranges from 0 to 3 while the DR label value76Chapter 5. Performance Evaluation of the Diabetic Retinopathy Classifiacation System0 1 2 3Severity Grades of Diabetic Retinopathy0100200300400500Number of Samples547153246254Figure 5.3: Distribution of DR grades in the Messidor Database.ranges from 0 to 4 for the EyePACS dataset, we adopt the same scheme in [36, 37] andperform two binary classification tasks (i.e., Referable versus Non-Referable, Normalversus Abnormal) to demonstrate the generalization ability. For Normal/Abnormalclassification task, images with grade 0 are treated as normal and images with othergrades are treated as abnormal. For Referable/Non-Referable classification task,images with grade 0 and grade 1 are treated as non-referable and images with grade2 and grade 3 are treated as referable. The training schemes and the results for thesetwo binary classification tasks are described below:• For Normal/Abnormal classification task, following the scheme used in [36,37], we trained a logistic classifier using the 256-dimensional features (i.e., theoutput of the last fully connected layer) extracted from the EyePACS training77Chapter 5. Performance Evaluation of the Diabetic Retinopathy Classifiacation Systemset and tested the performance using the entire Messidor database. The AUC(Area Under The ROC curve) score and Acc (Accuracy) score are used as theevaluation metrics, as shown in Table 5.3. It can be seen that MobileNet-Densemodel and MobileNetV2 model achieve same AUC score and same Acc score.Small improvements are obtained by the ensemble of MobileNet-Dense modeland MobileNetV2 model. Specifically, the AUC score reached to 0.962 and theAcc score reached to 0.917. Results have shown that our system outperformsthe Zoom-In-Net method in both AUC and Acc scores. The ROC curves of ourmethods are shown in Figure 5.4.• For Referable/Non-Referable classification task, following the scheme used in[36, 37], 10-fold cross-validation is performed on the entire Messidor database(i.e., The dataset is randomly partitioned into 10 subsets, with same number ofsamples in each subset. A single subset is retained as the test data for testingthe model, and the remaining 9 subsets are used as training data. The cross-validation process is repeated 10 times, with each of the 10 subsets used onceas the retained test data). A logistic regression model is trained to performthe binary classification task. The mean AUC score and mean Acc score areused as the evaluation metrics, as shown in Table 5.4. It can be seen that themean AUC scores achieved by MobileNet-Dense and MobileNetV2 are 0.967and 0.970 respectively. The mean Acc scores achieved by MobileNet-Denseand MobileNetV2 are 0.922 and 0.923 respectively. A small improvement onmean Acc score (from 0.923 to 0.924) is obtained by the ensemble of MobileNet-Dense model and the MobileNetV2 model. Results have shown that our systemoutperforms the Zoom-In-Net method in both mean AUC and mean Acc scores.The ROC curves of our methods are shown in Figure 5.5.78Chapter 5. Performance Evaluation of the Diabetic Retinopathy Classifiacation System0.0 0.2 0.4 0.6 0.8 1.0False Positive Rate0.00.20.40.60.81.0True Positive RateReceiver operating characteristic curveMobileNet-Dense (AUC = 0.959)MobileNetV2 (AUC = 0.959)Model Ensemble (AUC = 0.962)ChanceFigure 5.4: ROC curves for Normal/Abnormal task.79Chapter 5. Performance Evaluation of the Diabetic Retinopathy Classifiacation System0.0 0.2 0.4 0.6 0.8 1.0False Positive Rate0.00.20.40.60.81.0True Positive RateReceiver operating characteristic curveMobileNet-Dense (AUC = 0.967)MobileNetV2 (AUC = 0.970)Model Ensemble (AUC = 0.970)ChanceFigure 5.5: ROC curves for Referable/Non-Referable task.80Chapter 5. Performance Evaluation of the Diabetic Retinopathy Classifiacation SystemTable 5.3: Performance comparison on Normal/Abnormal screening taskMethod AUC AccVNXK [36] 0.870 0.871CKML [36] 0.862 0.858Human Expert A [78] 0.922 -Human Expert B [78] 0.865 -Zoom-In-Net [37] 0.921 0.905MobileNet-Dense 0.959 0.908MobileNetV2 0.959 0.908Model Ensemble (1+1) 0.962 0.917Table 5.4: Performance comparison on Referable/Non-Referable screening taskMethod mean AUC mean AccVNXK [36] 0.887 0.893CKML [36] 0.891 0.897Human Expert A [78] 0.94 -Human Expert B [78] 0.92 -Zoom-In-Net [37] 0.957 0.911MobileNet-Dense 0.967 0.922MobileNetV2 0.970 0.923Model Ensemble (1+1) 0.970 0.924In summary, our method outperforms the state-of-the-art Zoom-In-Net methodon both Referable/Non-Referable screening and Normal/Abnormal screening tasks.These results indicate that our proposed methods have good generalization abilitiesfor performing DR screening using retinal images came from different sources.5.3 SummaryIn this Chapter, the performance of our automatic DR classification system are pre-sented. Results are obtained using the EyePACS test set and the Messidor database.For the EyePACS test set, it is found that the proposed system achieves a QWK score81Chapter 5. Performance Evaluation of the Diabetic Retinopathy Classifiacation Systemof 0.852 compared to a QWK score of 0.849 achieved by the benchmark method [39],but using 32% fewer parameters and 73% fewer MAdds. For the Messidor database,our system outperforms the state-of-the-art method [37] on both Normal/Abnormaland Referable/Non-Referable screening tasks. The results indicate that our system iseffective, efficient and have good generalization abilities for screening retinal images.82Chapter 6ConclusionIn this chapter, we summarize the main contributions. Some possible future worksare outlined as well.6.1 Main Contributions• In Chapter 3, we proposed a novel MobileNet-Dense model which is based onthe DenseNet model [8] and the MobileNetV2 model [4]. We demonstrated itseffectiveness and efficiency using the CIFAR-10 and CIFAR-100 datasets. TheMobileNet-Dense and the MobileNetV2 achieve similar Acc on both datasets,but MobileNet-Dense uses 62% fewer parameters and 36% fewer MAdds.• In Chapter 4, we proposed an automatic DR classification system based on theensemble of two computationally efficient CNN models, namely the proposedMobileNet-Dense model and the existing MobileNetV2 model [4]. The systemis trained using the EyePACS training set.yu• In Chapter 5, we evaluated the proposed DR classification system on the Eye-PACS dataset and the Messidor database. On the EyePACS dataset, thetop-ranked method [39] in the Kaggle DR challenge was chosen as the bench-mark method. Our system achieves a test QWK score of 0.852 compared to atest QWK score of 0.849 achieved by the benchmark method while using 32%83Chapter 6. Conclusionfewer parameters and 73% fewer MAdds. For the experiment on the Messidordatabase, our system achieves an AUC of 0.962 and an Acc of 0.917 compared toan Auc of 0.921 and an Acc of 0.905 achieved by the state-of-the-art method [37]on Normal/Abnormal screening task. On the Referable/Non-Referable screen-ing task, our system achieves a mean AUC of 0.970 and a mean Acc of 0.924compared to a mean AUC of 0.957 and a mean Acc of 0.911 achieved by thestate-of-the-art method [37].6.2 Future WorkSome possible extensions of our research work are listed below:• Optimizing the CNN architecture: We proposed a novel MobileNet-Densemodel which is constructed using 4 dense modules as the default setting. Ashallower MobileNet-Dense model (e.g., a MobileNet-Dense model with 3 densemodules) may achieve similar performance with fewer MAdds and Parametersor a deeper MobileNet-Dense model may achieve noticeably better performanceat an acceptable cost.• Pruning the trained model: Another possible extension of our work wouldbe further improving the computational efficiency of our DR classification sys-tem by pruning and retraining the trained MobileNet-Dense model and Mo-bileNetV2 model (removing the convolution filters in convolutional layer orweights in fully connected layers which are less correlated to the final results)as it has been shown that network pruning and retraining reduce the numberof MAdds and Parameters of the model while maintaining similar performancein various applications or benchmark datasets [79, 80].84Chapter 6. Conclusion• Enhancing the image constrast: As mentioned in Section 4.1, the Eye-PACS dataset contains some poorly exposed images. It would be interestingto investigate whether the overall classification performances of our systemcould be significantly improved by enhancing the image contrast using adap-tive histogram equalization algorithm or contrast limited adaptive histogramequalization algorithm [81].• Stacking MobileNet-Dense with different CNN models: As both Mo-bileNetV2 and MobileNet-Dense use the depthwise separable convolutional layeras the basic building block and these two models achieve similar performanceon both EyePACS dataset and Messidor database. Thus the features extractedfrom these two models might be highly correlated which results in limited im-provement when performing ensemble learning. It would be interesting to inves-tigate whether we can improve the performance of the DR classification systemby stacking the MobileNet-Dense model with other types of computationallyefficient CNN models which are not constructed using depthwise separable con-volutional layers (e.g., SqueezeNet [51]).85Bibliography[1] “Indian Diabetic Retinopathy Image Dataset (IDRiD) website,” https://idrid.grand-challenge.org/grading/, accessed: 2018-05-30.[2] K. He et al., “Deep residual learning for image recognition,” in Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition (CVPR).IEEE, 2016, pp. 770–778.[3] A. G. Howard et al., “Mobilenets: Efficient convolutional neural networks formobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.[4] M. Sandler et al., “Mobilenetv2: Inverted residuals and linear bottlenecks,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR). IEEE, 2018, pp. 4510–4520.[5] “Kaggle Diabetic Retinopathy Detection,” https://www.kaggle.com/c/diabetic-retinopathy-detection/data, accessed: 2018-04-30.[6] T. Kauppi et al., “The diaretdb1 diabetic retinopathy database and evaluationprotocol.” in Proceedings of Medical Image Understanding and Analysis (MIUA),vol. 1, 2007, pp. 1–10.[7] A. Krizhevsky et al., “Imagenet classification with deep convolutional neuralnetworks,” in Proceedings of the International Conference on Neural InformationProcessing Systems (NIPS). Curran Associates, Inc., 2012, pp. 1097–1105.[8] G. Huang et al., “Densely connected convolutional networks,” in Proceedingsof the 2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). IEEE, 2017, pp. 2261–2269.[9] E. Decenciere et al., “Feedback on a publicly distributed image database: theMessidor database,” Image Analysis and Stereology, vol. 33, no. 3, pp. 231–234,2014.[10] F. Chentli, S. Azzoug, and S. Mahgoun, “Diabetes mellitus in elderly,” Indianjournal of endocrinology and metabolism, vol. 19, no. 6, p. 744, 2015.[11] D. M. Nathan, “Long-term complications of diabetes mellitus,” New EnglandJournal of Medicine, vol. 328, no. 23, pp. 1676–1685, 1993.86Bibliography[12] Y. Zheng et al., “The worldwide epidemic of diabetic retinopathy,” Indian Jour-nal of Ophthalmology, vol. 60, no. 5, pp. 428–431, 2012.[13] “Facts about diabetic eye disease,” https://nei.nih.gov/health/diabetic/retinopathy, accessed: 2019-03-20.[14] Early Treatment Diabetic Retinopathy Study Research Group et al., “Classifica-tion of diabetic retinopathy from fluorescein angiograms: ETDRS report number11,” Ophthalmology, vol. 98, no. 5, pp. 807–822, 1991.[15] E. Chew et al., “American academy of ophthalmology retina panel: Preferredpractice patterns,” American Academy of Ophthalmology, 2003.[16] C. I. Sánchez et al., “Retinal image analysis to detect and quantify lesions associ-ated with diabetic retinopathy,” in Proceedings of the International Conference ofthe IEEE Engineering in Medicine and Biology Society (EMBC), vol. 1. IEEE,2004, pp. 1624–1627.[17] V. Esmann et al., “Types of exudates in diabetic retinopathy,” Acta MedicaScandinavica, vol. 174, no. 3, pp. 375–384, 1963.[18] R. Klein et al., “The Wisconsin epidemiologic study of diabetic retinopathy:VII. diabetic nonproliferative retinal lesions,” Ophthalmology, vol. 94, no. 11,pp. 1389–1400, 1987.[19] J. Kaur and D. H. Sinha, “Automated detection of diabetic retinopathy usingfundus image analysis,” International Journal of computer Science and Infor-mation Technologies, vol. 3, p. 4794, 2012.[20] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Inter-national Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004.[21] H. Bay et al., “Surf: Speeded up robust features,” in Proceedings of EuropeanConference on Computer Vision (ECCV). Springer, 2006, pp. 404–417.[22] C. Geng and X. Jiang, “Face recognition based on the multi-scale local imagestructures,” Pattern Recognition, vol. 44, no. 10-11, pp. 2565–2575, 2011.[23] L. J. Zhi et al., “Medical image retrieval using SIFT feature,” in Proceedings ofInternational Congress on Image and Signal Processing (CISP). IEEE, 2009,pp. 1–4.[24] C. Sinthanayothin et al., “Automated detection of diabetic retinopathy on digitalfundus images,” Diabetic Medicine, vol. 19, no. 2, pp. 105–112, 2002.87Bibliography[25] A. Singalavanija et al., “Feasibility study on computer-aided screening for dia-betic retinopathy,” Japanese Journal of Ophthalmology, vol. 50, no. 4, pp. 361–366, 2006.[26] S. C. Lee et al., “Computer classification of nonproliferative diabetic retinopa-thy,” Archives of Ophthalmology, vol. 123, no. 6, pp. 759–764, 2005.[27] P. Kahai et al., “A decision support framework for automated screening of dia-betic retinopathy,” International Journal of Biomedical Imaging, vol. 2006, pp.1–8, 2006.[28] B. Antal et al., “An ensemble-based system for microaneurysm detection anddiabetic retinopathy grading,” IEEE Transactions on Biomedical Engineering,vol. 59, no. 6, pp. 1720–1726, 2012.[29] H. Pratt et al., “Convolutional neural networks for diabetic retinopathy,” Proce-dia Computer Science, vol. 90, pp. 200–205, 2016.[30] S. izza Rufaida and M. I. Fanany, “Residual convolutional neural network fordiabetic retinopathy,” in Proceedings of IEEE International Conference on Ad-vanced Computer Science and Information Systems (ICACSIS). IEEE, 2017,pp. 367–374.[31] J. Cohen, “Weighted kappa: Nominal scale agreement provision for scaled dis-agreement or partial credit.” Psychological Bulletin, vol. 70, no. 4, pp. 213–220,1968.[32] D. Zhang et al., “Diabetic retinopathy classification using deeply supervisedResNet,” in Proceedings of IEEE SmartWorld, Ubiquitous Intelligence and Com-puting, Advanced and Trusted Computed, Scalable Computing and Communi-cations, Cloud and Big Data Computing, Internet of People and Smart CityInnovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI). IEEE,2017.[33] M. Alban and T. Gilligan, “Automated detection of diabetic retinopathy usingfluorescein angiography photographs,” Report of Standford Education, 2016.[34] C. Szegedy et al., “Going deeper with convolutions,” in Proceedings of the IEEEConference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2015,pp. 1–9.[35] S. Suriyal et al., “Mobile assisted diabetic retinopathy detection using deep neuralnetwork,” in Proceedings of Global Medical Engineering Physics Exchanges/PanAmerican Health Care Exchanges (GMEPE/PAHCE). IEEE, 2018, pp. 1–4.88Bibliography[36] H. H. Vo and A. Verma, “New deep neural nets for fine-grained diabetic retinopa-thy recognition on hybrid color space,” in Proceeding of the IEEE InternationalSymposium on Multimedia (ISM). IEEE, 2016, pp. 209–215.[37] Z. Wang et al., “Zoom-in-net: Deep mining lesions for diabetic retinopathy detec-tion,” in Proceeding of the International Conference on Medical Image Computingand Computer-Assisted Intervention (MICCAI). Springer, 2017, pp. 267–275.[38] X. Zhang and O. Chutatape, “A SVM approach for detection of hemorrhagesin background diabetic retinopathy,” in Proceedings of IEEE International JointConference on Neural Networks (IJCNN), vol. 4. IEEE, 2005, pp. 2435–2440.[39] B. Graham, “Competition report of Minpooling,” https://www.kaggle.com/c/diabetic-retinopathy-detection/discussion/15801, 2015, online; accessed 30 Jan-uary 2019.[40] S. Agatonovic-Kustrin and R. Beresford, “Basic concepts of artificial neural net-work (ann) modeling and its application in pharmaceutical research,” Journal ofPharmaceutical and Biomedical Analysis, vol. 22, no. 5, pp. 717–727, 2000.[41] S. Hochreiter, “The vanishing gradient problem during learning recurrent neuralnets and problem solutions,” International Journal of Uncertainty, Fuzzinessand Knowledge-Based Systems, vol. 6, no. 02, pp. 107–116, 1998.[42] S. Shi and X. Chu, “Speeding up convolutional neural networks by exploitingthe sparsity of rectifier units,” arXiv preprint arXiv:1704.07724, 2017.[43] D. E. Rumelhart et al., “Learning internal representations by error propagation,”California Univ San Diego La Jolla Inst for Cognitive Science, Tech. Rep., 1985.[44] C. Davis, “The norm of the schur product operation,” Numerische Mathematik,vol. 4, no. 1, pp. 343–344, 1962.[45] Y. N. Dauphin et al., “Identifying and attacking the saddle point problem inhigh-dimensional non-convex optimization,” in Proceeding of Neural InformationProcessing Systems (NIPS), 2014, pp. 2933–2941.[46] O. M. Parkhi et al., “Deep face recognition,” in Proceedings of the British Ma-chine Vision Conference(BMVC), vol. 41, no. 3. British Machine Vision Asso-ciation, 2015, pp. 1–12.[47] R. Girshick et al., “Rich feature hierarchies for accurate object detection andsemantic segmentation,” in Proceedings of the IEEE conference on ComputerVision and Pattern Recognition (CVPR). IEEE, 2014, pp. 580–587.89Bibliography[48] I. Goodfellow et al., Deep Learning. MIT Press, 2016, http://www.deeplearningbook.org.[49] M. Abadi, A. Agarwal, P. Barham et al., “TensorFlow: Large-Scalemachine learning on heterogeneous systems,” 2015, software available fromtensorflow.org. [Online]. Available: http://tensorflow.org/[50] P. Molchanov et al., “Pruning convolutional neural networks for resource efficientinference,” arXiv preprint arXiv:1611.06440, 2016.[51] F. N. Iandola et al., “Squeezenet: Alexnet-level accuracy with 50x fewer param-eters and< 0.5 mb model size,” arXiv preprint arXiv:1602.07360, 2016.[52] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scaleimage recognition,” arXiv preprint arXiv:1409.1556, 2014.[53] Z. Liu et al., “Learning efficient convolutional networks through network slim-ming,” in Proceedings of the IEEE International Conference on Computer Vision(ICCV), 2017, pp. 2736–2744.[54] L. Sifre and S. Mallat, “Rigid-motion scattering for image classification,” Ph.D.dissertation, Citeseer, 2014.[55] Y. LeCun et al., “Gradient-based learning applied to document recognition,”Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.[56] L. Perez and J. Wang, “The effectiveness of data augmentation in image classi-fication using deep learning,” arXiv preprint arXiv:1712.04621, 2017.[57] L. Rokach, “Ensemble-based classifiers,” Artificial Intelligence Review, vol. 33,no. 1-2, pp. 1–39, 2010.[58] K. He and J. Sun, “Convolutional neural networks at constrained time cost,” inProceedings of the IEEE conference on Computer Vision and Pattern Recognition(CVPR). IEEE, 2015, pp. 5353–5360.[59] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Highway networks,” arXivpreprint arXiv:1505.00387, 2015.[60] G. Pleiss et al., “Memory-efficient implementation of Densenets,” arXiv preprintarXiv:1707.06990, 2017.[61] J. T. Springenberg et al., “Striving for simplicity: The all convolutional net,”arXiv preprint arXiv:1412.6806, 2014.[62] C.-Y. Lee et al., “Deeply-supervised nets,” in Proceedings of Machine LearningResearch (MLR), 2015, pp. 562–570.90Bibliography[63] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXivpreprint arXiv:1412.6980, 2014.[64] C. H. Yu, “Exploratory data analysis,” Methods, vol. 2, pp. 131–160, 1977.[65] S. Dodge and L. Karam, “Understanding how image quality affects deep neuralnetworks,” in Proceedings of 8th International Conference on Quality of Multi-media Experience (QoMEX). IEEE, 2016, pp. 1–6.[66] H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE Transactionson Knowledge and Data Engineering, vol. 21, no. 9, pp. 1263–1284, 2008.[67] L. Taylor and G. Nitschke, “Improving deep learning using generic data augmen-tation,” arXiv preprint arXiv:1708.06020, 2017.[68] S. Van der Walt et al., “scikit-image: Image processing in Python,” PeerJ, vol. 2,p. e453, 2014.[69] C. Winship and R. D. Mare, “Regression models with ordinal variables,” Amer-ican sociological review, pp. 512–525, 1984.[70] A. Mathis and B. Stephan, “Competition report of TeamoO,” https://www.kaggle.com/c/diabetic-retinopathy-detection/discussion/15617, 2015, on-line; accessed 30 January 2019.[71] M. J. Powell, “An efficient method for finding the minimum of a function ofseveral variables without calculating derivatives,” The computer journal, vol. 7,no. 2, pp. 155–162, 1964.[72] T. Hinz et al., “Speeding up the hyperparameter optimization of deep convo-lutional neural networks,” International Journal of Computational Intelligenceand Applications, p. 1850008, 2018.[73] C. Zhang and Y. Ma, Ensemble machine learning: methods and applications.Springer, 2012.[74] X. Yuan et al., “A regularized ensemble framework of deep learning for can-cer detection from multi-class, imbalanced training data,” Pattern Recognition,vol. 77, pp. 160–172, 2018.[75] A. Ortiz et al., “Ensembles of deep learning architectures for the early diagnosisof the alzheimer’s disease,” International Journal of Neural Systems, vol. 26,no. 07, p. 1650025, 2016.[76] D. H. Wolpert, “Stacked generalization,” Neural networks, vol. 5, no. 2, pp. 241–259, 1992.91Bibliography[77] D. M. Powers, “Evaluation: from precision, recall and f-measure to roc, informed-ness, markedness and correlation,” 2011.[78] C. I. Sánchez et al., “Evaluation of a computer-aided diagnosis system for diabeticretinopathy screening on public data,” Investigative Ophthalmology and VisualScience, vol. 52, no. 7, pp. 4866–4871, 2011.[79] S. Han et al., “Deep compression: Compressing deep neural networks with prun-ing, trained quantization and huffman coding,” arXiv preprint arXiv:1510.00149,2015.[80] H. Li et al., “Pruning filters for efficient convnets,” arXiv preprintarXiv:1608.08710, 2016.[81] S. M. Pizer et al., “Adaptive histogram equalization and its variations,” Computervision, graphics, and image processing, vol. 39, no. 3, pp. 355–368, 1987.92
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Diabetic retinopathy classification using an efficient...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Diabetic retinopathy classification using an efficient convolutional neural network Gao, Jiaxi 2019
pdf
Page Metadata
Item Metadata
Title | Diabetic retinopathy classification using an efficient convolutional neural network |
Creator |
Gao, Jiaxi |
Publisher | University of British Columbia |
Date Issued | 2019 |
Description | Diabetic Retinopathy (DR) is a diabetic complication that affects the eyes and may lead to blurred vision or even blindness. The diagnosis of DR through retinal fundus images is traditionally performed by ophthalmologists who inspect for the presence and significance of many subtle features, a process which is cumbersome and time-consuming. As there are many undiagnosed and untreated cases of DR, DR screening of all diabetic patients is a huge challenge. Deep convolutional neural network (CNN) has rapidly become a powerful tool for analyzing medical images. There have been previous works which use deep learning models to detect DR automatically. However, these methods employed very deep CNNs which require vast computational resources. Thus, there is a need for more computationally efficient deep learning models for automatic DR diagnosis. The primary objective of this research is to develop a robust and computationally efficient deep learning model to diagnose DR automatically. In the first part of this thesis, we propose a computationally efficient deep CNN model MobileNet-Dense which is based on the recently proposed MobileNetV2 and DenseNet models. The effectiveness of the proposed MobileNet-Dense model is demonstrated using two widely used benchmark datasets, CIFAR-10 and CIFAR-100. In the second part of the thesis, we propose an automatic DR classification system based on the ensemble of the proposed MobileNet-Dense model and the MobileNetV2 model. The performance of our system is evaluated and compared with some of the state-of-the-art methods using two independent DR datasets, the EyePACS dataset and the Messidor database. On the EyePACS dataset, our system achieves a quadratic weighted kappa (QWK) score of 0.852 compared to a QWK score of 0.849 achieved by the benchmark method while using 32% fewer parameters and 73% fewer multiply-adds (MAdds). On the Messidor database, our system outperforms the state-of-the-art method on both Normal/Abnormal and Referable/Non-Referable classification tasks. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2019-05-01 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0378560 |
URI | http://hdl.handle.net/2429/70037 |
Degree |
Master of Applied Science - MASc |
Program |
Electrical and Computer Engineering |
Affiliation |
Applied Science, Faculty of Electrical and Computer Engineering, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2019-09 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2019_september_gao_jiaxi.pdf [ 6.28MB ]
- Metadata
- JSON: 24-1.0378560.json
- JSON-LD: 24-1.0378560-ld.json
- RDF/XML (Pretty): 24-1.0378560-rdf.xml
- RDF/JSON: 24-1.0378560-rdf.json
- Turtle: 24-1.0378560-turtle.txt
- N-Triples: 24-1.0378560-rdf-ntriples.txt
- Original Record: 24-1.0378560-source.json
- Full Text
- 24-1.0378560-fulltext.txt
- Citation
- 24-1.0378560.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0378560/manifest