{"http:\/\/dx.doi.org\/10.14288\/1.0431094":{"http:\/\/vivoweb.org\/ontology\/core#departmentOrSchool":[{"value":"Applied Science, Faculty of","type":"literal","lang":"en"},{"value":"Electrical and Computer Engineering, Department of","type":"literal","lang":"en"}],"http:\/\/www.europeana.eu\/schemas\/edm\/dataProvider":[{"value":"DSpace","type":"literal","lang":"en"}],"https:\/\/open.library.ubc.ca\/terms#degreeCampus":[{"value":"UBCV","type":"literal","lang":"en"}],"http:\/\/purl.org\/dc\/terms\/creator":[{"value":"Mokhtari, Masoud","type":"literal","lang":"en"}],"http:\/\/purl.org\/dc\/terms\/issued":[{"value":"2023-04-18T19:22:17Z","type":"literal","lang":"en"},{"value":"2023","type":"literal","lang":"en"}],"http:\/\/vivoweb.org\/ontology\/core#relatedDegree":[{"value":"Master of Applied Science - MASc","type":"literal","lang":"en"}],"https:\/\/open.library.ubc.ca\/terms#degreeGrantor":[{"value":"University of British Columbia","type":"literal","lang":"en"}],"http:\/\/purl.org\/dc\/terms\/description":[{"value":"Machine learning frameworks for medical applications must be explainable, generalizable despite the scarcity of training data, and able to tackle various clinical tasks with minimal modifications. Explainability is crucial for safety-critical applications, as users must recognize when human supervision is needed. Additionally, developing strong inductive bias from sparsely labeled data is essential, given that large-scale medical datasets are not widely available. In a clinical setting, numerous metrics are measured daily, making it logistically challenging to maintain separate models for individual metrics. This highlights the need for flexible frameworks that preserve explainability. In this thesis, we address these requirements by proposing three frameworks that harness the representation power of graph neural networks (GNNs) or transformers, improving the state-of-the-art and enhancing the practicality of machine learning in medical applications.\r\n\r\nOur first framework aims to provide explainability in the prediction pipeline. We demonstrate its effectiveness using the task of left ventricular ejection fraction estimation from echocardiographic videos. This framework employs GNNs to learn a weighted graph between the frames of an input echocardiogram before producing a single ejection fraction estimate. Our results show that the learned latent structure aligns with clinical guidelines for predicting ejection fraction and can serve as a surrogate for the model's confidence in its predictions.\r\n\r\nThe second framework improves model generalizability for sparsely labeled data using GNNs. We apply the framework to the task of clinical landmark detection, where only a small number of frames in a video are labeled. To maximize the use of supervisory signals, we employ a multi-scale objective function and a hierarchical graph structure. Our results indicate that this approach builds better inductive bias and outperforms previous work.\r\n\r\nLastly, we propose a flexible framework that offers attention-based explainability on multiple levels, making it suitable for various clinical tasks. This framework utilizes Transformers, a special instance of GNNs, to capture patch-wise, frame-wise, and video-wise interactions in echocardiographic data. This approach aids in identifying pertinent information for a specific clinical metric. To showcase the flexibility of this framework, we consider two critical cardiac tasks: aortic stenosis detection and ejection fraction estimation.","type":"literal","lang":"en"}],"http:\/\/www.europeana.eu\/schemas\/edm\/aggregatedCHO":[{"value":"https:\/\/circle.library.ubc.ca\/rest\/handle\/2429\/84247?expand=metadata","type":"literal","lang":"en"}],"http:\/\/www.w3.org\/2009\/08\/skos-reference\/skos.html#note":[{"value":"Graph Neural Networks and Transformers for EnhancedExplainability and Generalizability in Medical MachineLearningbyMasoud MokhtariBASc, The University of British Columbia, 2021A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of Applied ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Electrical and Computer Engineering)The University of British Columbia(Vancouver)April 2023\u00a9 Masoud Mokhtari, 2023The following individuals certify that they have read, and recommend to the Fac-ulty of Graduate and Postdoctoral Studies for acceptance, the thesis entitled:Graph Neural Networks and Transformers for Enhanced Explainabil-ity and Generalizability in Medical Machine Learningsubmitted by Masoud Mokhtari in partial fulfillment of the requirements for thedegree of Master of Applied Science in Electrical and Computer Engineering.Examining Committee:Purang Abolmaesumi, Professor, Department of Electrical and Computer Engi-neering, UBCCo-supervisorRenjie Liao, Professor, Department of Electrical and Computer Engineering, UBCCo-supervisorTim Salcudean, Professor, Department of Electrical and Computer Engineering,UBCSupervisory Committee MemberiiAbstractMachine learning frameworks for medical applications must be explainable, gen-eralizable despite the scarcity of training data, and able to tackle various clinicaltasks with minimal modifications. Explainability is crucial for safety-critical appli-cations, as users must recognize when human supervision is needed. Additionally,developing strong inductive bias from sparsely labeled data is essential, given thatlarge-scale medical datasets are not widely available. In a clinical setting, nu-merous metrics are measured daily, making it logistically challenging to maintainseparate models for individual metrics. This highlights the need for flexible frame-works that preserve explainability. In this thesis, we address these requirements byproposing three frameworks that harness the representation power of graph neuralnetworks (GNNs) or transformers, improving the state-of-the-art and enhancingthe practicality of machine learning in medical applications.Our first framework aims to provide explainability in the prediction pipeline.We demonstrate its effectiveness using the task of left ventricular ejection frac-tion estimation from echocardiographic videos. This framework employs GNNsto learn a weighted graph between the frames of an input echocardiogram beforeproducing a single ejection fraction estimate. Our results show that the learnedlatent structure aligns with clinical guidelines for predicting ejection fraction andcan serve as a surrogate for the model\u2019s confidence in its predictions.The second framework improves model generalizability for sparsely labeleddata using GNNs. We apply the framework to the task of clinical landmark detec-tion, where only a small number of frames in a video are labeled. To maximizethe use of supervisory signals, we employ a multi-scale objective function and ahierarchical graph structure. Our results indicate that this approach builds betteriiiinductive bias and outperforms previous work.Lastly, we propose a flexible framework that offers attention-based explainabil-ity on multiple levels, making it suitable for various clinical tasks. This frameworkutilizes Transformers, a special instance of GNNs, to capture patch-wise, frame-wise, and video-wise interactions in echocardiographic data. This approach aidsin identifying pertinent information for a specific clinical metric. To showcase theflexibility of this framework, we consider two critical cardiac tasks: aortic stenosisdetection and ejection fraction estimation.ivLay SummaryMachine learning models have great potential to be deployed for medical applica-tions, but they must be explainable, flexible, and work well even with limited data.This thesis proposes three innovative approaches to tackle these challenges usinggraph neural networks and Transformers, which are considered as some of today\u2019sadvanced machine learning techniques. The first approach focuses on making pre-dictions more explainable through a learned graphical structure, using an exampleof estimating heart function from echocardiogram videos. The second approachimproves the model\u2019s ability to work with limited data by using a hierarchical struc-ture, demonstrated through detecting clinical landmarks in echocardiograms. Thefinal approach offers a flexible framework that can be easily modified for differentclinical tasks while maintaining explainability. These approaches make machinelearning models more practical for real-world clinical applications, as they provideexplainability, work with limited data, and can be tailored to different tasks.vPrefaceThe studies presented in this thesis were conducted at the Robotics and ControlLaboratory at the University of British Columbia under the supervision of Dr. Pu-rang Abolmaesumi and Dr. Renjie Liao. Our methods and data usage were ap-proved by the University of British Columbia\u2019s Research Ethics Board, with cer-tificate number H20-00365.This thesis comprises one published work and two manuscripts under review.Chapters 2 to 4 are modified versions of these works, as detailed below.A version of the study presented in Chapter 2 has been published as:M. Mokhtari, T. Tsang, P. Abolmaesumi, and R. Liao. Echognn: Explain-able ejection fraction estimation with graph neural networks. In L. Wang,Q. Dou, P. T. Fletcher, S. Speidel, and S. Li, editors, Medical Image Com-puting and Computer Assisted Intervention \u2013 MICCAI 2022, pages 360\u2013369,Cham, 2022. Springer Nature Switzerland. ISBN 978-3-031-16440-8The author\u2019s contribution involved developing and evaluating the methodology andpreparing the manuscript. Dr. Tsang provided clinical expertise, while Drs. Abol-maesumi and Liao supervised the technical aspects of the project and offered con-structive feedback throughout the process.A version of the study presented in Chapter 3 has been submitted to a peer-reviewed conference and is currently under consideration for publication. The ver-sion presented in this thesis is the extended version with additional results andinformation.M. Mokhtari, M. Mahdavi, H. Vaseli, C. Luong, P. Abolmaesumi, T. Tsang,and R. Liao. Echoglad: Hierarchical graph neural networks for left ventriclevilandmark detection on echocardiograms. Preprint, 2023The author\u2019s contributions included designing, developing, and evaluating amajor portion of the proposed framework, as well as writing of the manuscript.Hooman Vaseli and Mobina Mahdavi provided valuable support in the develop-ment of the framework and evaluation of the baselines. Drs. Luong and Tsangoffered crucial clinical expertise, while Drs. Abolmaesumi and Liao supervisedthe technical aspects of the project.A version of the study presented in Chapter 4 has been submitted to a peer-reviewed conference and is currently under consideration for publication:M. Mokhtari, N. Ahmadi, T. Tsang, P. Abolmaesumi, and R. Liao. Gemtrans:A general, echocardiography-based, multi-level transformer framework forcardiovascular diagnosis. Preprint, 2023The author\u2019s contributions included designing, developing, and evaluating thecore of the proposed framework, as well as writing the manuscript. Neda Ahmadiworked on the contributions involving prototypical learning that are outside thescope of this thesis (and therefore not presented in details) and provided valuablesupport in establishing baselines. Dr. Tsang offered valuable clinical expertise,while Drs. Abolmaesumi and Liao supervised the technical aspects of the project.viiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxvAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxx1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 21.1.1 Graph Convolutional Networks . . . . . . . . . . . . . . 41.1.2 Message Passing Paradigm . . . . . . . . . . . . . . . . . 61.1.3 Transformers . . . . . . . . . . . . . . . . . . . . . . . . 71.2 Clinical Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.2.1 Echocardiography . . . . . . . . . . . . . . . . . . . . . 91.2.2 Left Ventricle Assessment . . . . . . . . . . . . . . . . . 121.3 Thesis Objectives and Contributions . . . . . . . . . . . . . . . . 161.3.1 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . 16viii2 Explainable Ejection Fraction Estimation with Graph Neural Net-works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.1 Introduction and Related Works . . . . . . . . . . . . . . . . . . 182.1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . 192.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.2.1 Problem Setup . . . . . . . . . . . . . . . . . . . . . . . 202.2.2 Proposed Framework . . . . . . . . . . . . . . . . . . . . 202.2.3 Training and Objective Function . . . . . . . . . . . . . . 242.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.3.2 Implementation Details . . . . . . . . . . . . . . . . . . . 252.3.3 Results and Discussion . . . . . . . . . . . . . . . . . . . 252.3.4 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . 282.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 Hierarchical Graph Neural Networks for Left Ventricle LandmarkDetection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.1 Introduction and Related Works . . . . . . . . . . . . . . . . . . 303.1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . 333.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.2.1 Problem Setup . . . . . . . . . . . . . . . . . . . . . . . 343.2.2 Proposed Framework . . . . . . . . . . . . . . . . . . . . 353.2.3 Training and Objective Function . . . . . . . . . . . . . . 393.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.3.2 Implementation Details . . . . . . . . . . . . . . . . . . . 403.3.3 Results and Discussion . . . . . . . . . . . . . . . . . . . 413.3.4 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . 463.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494 General, Echocardiogram-Based, Multi-Level Transformer Frame-work for Cardiovascular Diagnoses . . . . . . . . . . . . . . . . . . 514.1 Introduction and Related Works . . . . . . . . . . . . . . . . . . 51ix4.1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . 534.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.2.1 Problem Setup . . . . . . . . . . . . . . . . . . . . . . . 544.2.2 Proposed Framework . . . . . . . . . . . . . . . . . . . . 554.2.3 Training and Objective Function . . . . . . . . . . . . . . 584.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.3.2 Implementation Details . . . . . . . . . . . . . . . . . . . 594.3.3 Results and Discussion . . . . . . . . . . . . . . . . . . . 594.3.4 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . 604.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . 635.1 Proposed Framework for Future Work . . . . . . . . . . . . . . . 65Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67A Supporting Materials . . . . . . . . . . . . . . . . . . . . . . . . . . 78A.1 EchoGNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78A.2 EchoGLAD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79A.2.1 Training Details of Prior Works . . . . . . . . . . . . . . 79A.2.2 Additional Quantitative Results . . . . . . . . . . . . . . 80A.2.3 Additional Qualitative Results . . . . . . . . . . . . . . . 84A.3 GEMTrans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89xList of TablesTable 2.1 Summary of quantitative results: Lower values are better for allmetrics except R2 and F1. Ouyang et al. (AF) average predic-tions on all possible 32-frame clips in a sampled video. Rey-naud et al. (R) and (M) are transformer-based models with dif-ferent sampling techniques. Esfeh et al. use Bayesian neuralnetworks (BNNs). We mark the models that cannot predictend-systolic (ES) and end-diastolic (ED) locations as \u201d-\u201d in theaverage frame distance (AFD) metric. EchoGNN is the onlymodel that provides explainability and ES\/ED location estima-tions without direct supervision. . . . . . . . . . . . . . . . . 28Table 2.2 Ablation study results: The Aug., Class., and Pretrain columnsindicate whether the model employs data augmentation, classi-fication loss, and pretraining, respectively. We observe that theclassification loss enhances performance for under-representedgroups, while pretraining and data augmentation reduce overallejection fraction (EF) error. . . . . . . . . . . . . . . . . . . . 29Table 3.1 Quantitative results on the private test set for models trained onthe private training set in terms of mean-absolute-error (MAE),mean percentage error (MPE) and success detection rate (SDR):Lower values for MAE and MPE, and higher values for SDR arebetter. We see that our model has the best average performanceover the three measurements, which shows the superiority ofour model in the in-distribution (ID) setting for high-data regime. 43xiTable 3.2 Quantitative results on the public UIC test set for models trainedon the public UIC training set in terms of mean-absolute-error(MAE), mean percentage error (MPE) and success detectionrate (SDR): Lower values for MAE and MPE, and higher valuesfor SDR are better. Although the number of training samples ismuch lower for Unity Imaging Collaborative (UIC) comparedto our private dataset, we see that our model still outperformsprevious works on average over the three measurements, whichshowcases the accuracy of our model in the low-data regimeand in-distribution (ID) settings. We were not able to train Linet al.\u2019s model [48] on this dataset since they rely on consistencyamong frames in a video, whereas this dataset only contains in-dividual frames. . . . . . . . . . . . . . . . . . . . . . . . . . 44Table 3.3 Quantitative results on the public UIC test set for models trainedon the private training set in terms of mean-absolute-error (MAE),mean percentage error (MPE) and success detection rate (SDR):Lower values for MAE and MPE, and higher values for SDR arebetter. This table shows the out-of-distribution (OOD) perfor-mance of the models when trained on a larger dataset and testedon a smaller external dataset. We can see that in this case, ourmodel outperforms previous works by a large margin, whichattests to the generalizability of our framework. . . . . . . . . 45Table 3.4 Ablation results on the validation set of our private dataset forthe model architecture: Vanilla U-Net uses the output of a sim-ple U-Net model for segmentation, while U-Net Main Graphonly uses the pixel-level graph. Main Model is our model thatuses our proposed hierarchical approach. Lastly, Single-ScaleLoss has the same framework as the Main Model but only com-putes the loss for the model\u2019s predictions on the main graphduring training. We see that the addition of a hierarchical rep-resentation and using multi-scale loss improves performance. . 48xiiTable 3.5 Ablation results on the validation set of our private dataset fordifferent node feature extraction methods: We see that the U-Net-based method outperforms others. . . . . . . . . . . . . . 50Table 4.1 Quantitative results for ejection fraction (EF) on the test set: LVBiplane dataset results for models not supporting multi-videotraining are indicated by \u201d-\u201d. MAE is the Mean Absolute Errorand R2 indicates variance captured by the model. . . . . . . . . 60Table 4.2 Quantitative results for aortic stenosis (AS) on the test set: Sever-ity is a four-class classification task, while Detection involvesthe binary detection of AS. . . . . . . . . . . . . . . . . . . . 60Table 4.3 Ablation study on the validation set of EchoNet Dynamic: Wesee that both spatial and temporal attention supervision are ef-fective for ejection fraction (EF) estimation, while the modeldoes not converge without pretraining the Vision Transform-ers (ViT). MAE is the Mean Absolute Error and R2 indicatesvariance captured by the model. . . . . . . . . . . . . . . . . . 61Table A.1 Quantitative results on the private test set for models trained onthe private training set (in-distribution (ID), high-data regime)in terms of success detection rate (SDR) for inter-ventricularseptal (IVS) and left ventricular posterior wall (LVPW). Highervalues for SDR are better. We see that our model outperformsprior works on both measurement across all thresholds. . . . . 81xiiiTable A.2 Quantitative results on the Unity Imaging Collaborative (UIC)test set for models trained on the UIC training set (ID, low-data regime) in terms of success detection rate (SDR) for inter-ventricular septal (IVS) and left ventricular posterior wall (LVPW).Higher values for SDR are better. We see that while Chen etal. [11] outperforms all models for IVS, our model has a betterLVPW accuracy compared to prior works. It must be noted thatLin et al. [48] requires input videos for training, whereas UIConly contains individual frames. Therefore, they are excludedfrom this table. . . . . . . . . . . . . . . . . . . . . . . . . . . 82Table A.3 Quantitative results on the public Unity Imaging Collaborative(UIC) test set for models trained on the private training set (out-of-distribution (OOD) setting) in terms of success detection rate(SDR) for inter-ventricular septal (IVS) and left ventricular pos-terior wall (LVPW). Higher values for SDR are better. Wesee that our model outperforms the state-of-the-art (SOTA) forLVPW on average across the thresholds. However, for IVS, wesee that while McCouat et al. [53] has better performance forlower thresholds, our model is robust to outliers by showing ahigher SDR for 3 and 6 millimeters (mm) thresholds. . . . . . 83xivList of FiguresFigure 1.1 A graph, denoted by G(V,E), consists of a set of nodes V andthe set of edges E between them. Graph data can be rep-resented by an adjacency matrix A, which indicates connec-tions between nodes, and a feature matrix X containing node-specific information. graph neural network (GNN) layers learnnode embeddings that transform the input node features (X)while incorporating the structural information of the graph (A).These learned embeddings H can be utilized for various node-level, edge-level, and graph-level tasks. . . . . . . . . . . . . 4xvFigure 1.2 Echocardiography (echo), an Ultrasound (US) imaging modal-ity, is widely used due to its non-invasive, cost-effective, andportable nature. There are around 14 echo views. (A) We dis-play an example of an apical four-chamber (A4C) view of theheart. The left ventricular ejection fraction (LVEF) is typicallyestimated using this view because of the visibility of the leftventricle (LV). (B) We present an example of an apical two-chamber (A2C) view of the heart and an echo frame depictingvarious chambers. (C) We show an example of a parasternallong-axis (PLAX) echo, which can be used to obtain impor-tant LV measurements or assess the function of the aortic valve(AV). (D) We provide an example of a parasternal short-axis(PSAX) echo where the function of the AV can be assesseddue to its visibility in this view. (All figures are sourced fromWikimedia Commons.) . . . . . . . . . . . . . . . . . . . . . 10Figure 1.3 14 standard Echocardiography (echo) views are obtained basedon how the transducer is positioned as a combination of thechosen acoustic window and the imaging plane. (A) We demon-strate various acoustic windows, representing the position ofthe transducer on the patient. (B) We illustrate different imag-ing planes of the heart, which refer to the orientation of thetransducer with respect to the axis of the left ventricle (LV).(All figures are taken from [57] with permission from Elsevier.) 11Figure 1.4 The left ventricle (LV) measurements, including left ventric-ular internal diameter (LVID), inter-ventricular septal (IVS),and left ventricular posterior wall (LVPW) thickness, are usu-ally characterized on a parasternal long-axis (PLAX) echocar-diography (echo) frame by placing either four or six landmarkson the frame. (The original figure is sourced from WikimediaCommons. We modified it to highlight the measurements.) . . 14xviFigure 2.1 EchoGNN has three main components. (1) Video Encoder: en-codes video frames into vector embeddings while preservingthe temporal dimension; (2) Attention Encoder: infers weightsover the nodes (video frames) and edges (relationships amongframes) of the echo-graph; (3) Graph Regressor: estimatesejection fraction (EF) using the inferred weighted graph; thisfigure shows an example where each patient has an apical api-cal two-chamber (A2C) and an apical four-chamber (A4C) echovideo. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21Figure 2.2 Video Encoder network architecture: We use modular blockscontaining 3D convolutions with residual connections to gen-erate low-dimensional frame embeddings. . . . . . . . . . . . 22Figure 2.3 Explainability through the learned weighted graph: (Top) Anexample where the model has learned the periodic nature ofthe data, and the learned weights enable identification of end-systolic (ES) and end-diastolic (ED) locations. (Bottom) An-other example where the left ventricle (LV) region is cropped(as indicated by the arrow), and the learned weights are dis-tributed more evenly, suggesting the need for expert intervention. 26Figure 2.4 end-systolic (ES) and end-diastolic (ED) frame approximationfrom learned echo-graph weights: We first use a threshold toconvert the sum of outgoing edge weights into a binary format(alternatively, frame weights can be used). Please note that thisthreshold is selected based on average frame distance (AFD)performance on the validation set. Consecutive 1-valued weightsform a block together. The left-most and right-most frames ineach block are the approximated ED and ES locations, respec-tively. We reject samples where the size of the block is equalto 55, indicating that the model has not learned the periodicnature of the data. By rejecting these samples, we achieve anaverage frame distance of 4.15 for ES and 3.68 for ED. . . . . 27xviiFigure 3.1 (A) Inter-ventricular septal (IVS), left ventricular internal di-ameter (LVID) and left ventricular posterior wall (LVPW) mea-surements visualized on an parasternal long-axis (PLAX) echocar-diography (echo) frame: We can see that four landmark coordi-nates are normally enough to characterize these measurements.(B) Left ventricle (LV) landmark label smoothing example: Ifthe wall landmark labels (e.g., , within the circle) are smoothedby an isotropic Gaussian distribution, points along the visual-ized wall and ones perpendicular are penalized equally. Ide-ally, if the model learns the edge of the wall, it should be pe-nalized less. . . . . . . . . . . . . . . . . . . . . . . . . . . . 31Figure 3.2 Overview of our proposed model architecture: HierarchicalFeature Construction provides node features for the hierar-chical graph representation of each echo frame where the nodesin the main graph correspond to pixels in the image, and nodesin the auxiliary graphs correspond to patches in the image.Graph Neural Networks (GNNs) are used to process the hi-erarchical graph representation and produce node embeddingsfor the auxiliary graphs and the main graph. Multi-Layer Per-ceptrons (multi-layer perceptrons (MLPs)) are followed bya Sigmoid output function to map the node embeddings intolandmark heatmaps of different granularity over the input echoframe. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34Figure 3.3 An example of the model\u2019s prediction for a single input echo:Here, we show the model\u2019s prediction for the case where onlythree auxiliary graphs are used. We see that the model is learn-ing the left ventricle (LV) landmarks on different resolutionsto achieve high accuracy for the main pixel-level task. Weshow zoomed-in versions of the higher resolution task to en-able comparison. The patch size for each image is also shownin the figure. . . . . . . . . . . . . . . . . . . . . . . . . . . 36xviiiFigure 3.4 A convolutional neural network (CNN) is initially used to ex-pand the number of the feature maps. The intermediate fea-tures of the decoder part of a U-Net are then used as node fea-tures such that deeper representations correspond to node fea-tures of finer graphs. This way, the node features correspondto proper patches in the original image, while also providingricher information due to the increasing size of the perceptivefield. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37Figure 3.5 Examples demonstrating the distribution difference betweenthe private dataset and Unity Imaging Collaborative (UIC) pub-lic dataset: We see that while the private dataset contains mostlyclean samples that are free of extra annotations and markings,UIC\u2019s dataset contains samples that include the Doppler win-dow (A, B) or other annotations outside the image area (C, D,E). Additionally, while most samples in the private dataset areproperly aligned in the frame, there are cases in UIC where theLV is placed uncharacteristically at the top of the image (C). . 41Figure 3.6 Qualitative visualization of our model on two failure casesfrom the test set of our private dataset: The Failure Example 1is a low-quality PLAX image that also corresponds to a patientwith severe LVH, a scenario that happens rarely in our dataset.The Failure Example 2 belongs to a case with a low quality ofPLAX with unclear boundaries for the walls and the chambersof the LV. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47xixFigure 3.7 Qualitative ablation results for the model architecture: Land-mark heatmaps from top to bottom are color-coded with red,cyan, pink and green, respectively. We see that Vanilla U-Net struggles to make confident and accurate landmark pre-dictions. While the addition of a main grid graph in MainGraph relatively increases model\u2019s performance, it still doesnot produce accurate results. In contrast, the Main Model pro-duces confident prediction heatmaps by relying on a hierarchi-cal graph representation as well as multi-scale objectives. Wealso see that the removal of the multi-scale objective (Single-Scale Loss) degrades performance. . . . . . . . . . . . . . . . 48Figure 3.8 Different approaches to node feature construction: A convo-lutional neural network (CNN) is initially used to expand thenumber of the features maps. Different feature constructionmethods can then be employed: (A) 2D average pooling lay-ers with different kernel sizes are used to generate features fornodes of auxiliary graphs with different coarseness levels. (B)Multiple CNN layers are used to transform the image, and theintermediate features are used as node features such that deeperlayers contain the features for coarser graphs. (C) The interme-diate features of the decoder part of a U-Net are used as nodefeatures such that deeper representations correspond to nodefeatures of finer graphs. . . . . . . . . . . . . . . . . . . . . . 50xxFigure 4.1 GEMTrans Overview: The multi-level transformer networkprocesses one or multiple echocardiography (echo) videos andis composed of three main components. Spatial TransformerEncoder (STE) produces attention among patches in the sameimage frame, while Temporal Transformer Encoder (TTE) cap-tures the temporal dependencies among the frames of eachvideo. Lastly, Video Transformer Encoder (VTE) produces anembedding summarizing all available data for a patient by pro-cessing the learned embedding of each video. Different down-stream tasks can then be performed using this final learnedembedding. During training, both the final prediction and theattention learned by different layers of the framework can besupervised (not all connections are shown for cleaner visual-ization). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54Figure 4.2 Attention supervision example: For ejection fraction (EF), thespatial attention is penalized if attention is given outside theunion of the left ventricle (LV) mask for end-diastolic (ED) andend-systolic (ES). The temporal attention is also encouragedto give more attention to ED\/ES locations. . . . . . . . . . . . 57Figure 4.3 Learned patch-attention from the Spatial Transformer Encoder(STE): We visualize the learned attention of STE, where forEF, the model is focusing on the walls of the LV, while for AS,the model learns to attend to the valve area, which is clinicallycorrect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61Figure 4.4 Learned patch-level prototypes: Learned prototypes use Spa-tial Transformer Encoder (STE) attention to properly focus onthe valve area for a healthy and severe aortic stenosis (AS)case. We can see that in the healthy case, the aortic valve (AV)is thin and not calcified. However, in the severe case, the cal-cification of AV is apparent (i.e. the valve appears bright in theimage). Frame-level and ejection fraction (EF) prototypes arepresented in Section A.3. . . . . . . . . . . . . . . . . . . . 62xxiFigure 5.1 Proposed future work: We recommend employing self-supervisedlearning (SSL) to pretrain a multi-level, transformer-based frame-work. This framework can be trained on vast amounts of un-labeled medical data and subsequently fine-tuned on specifictasks to achieve high levels of task-specific performance. . . . 65Figure A.1 Additional quantitative results for EchoGNN: (left) The confu-sion matrix for our best-performing model. The chosen ejec-tion fraction (EF) categories indicate different levels of heartfailure risk, with patients having EF below 40 % requiringmedical monitoring. (right) The scatter plot shows how closelyour model\u2019s EF estimates align with the ground truth. Weobserve that the model struggles with EF values between 30% and 40 %, and we argue that this is due to the high inter-observer variability in the ground truth labels, which is moreprominent for samples that lie near pathological boundaries. . 78Figure A.2 Additional examples of EchoGNN\u2019s explainability capability:(left) We can observe instances where the learned frame weightsenable clear identification of end-systolic (ES) and end-diastolic(ED) locations. (right) We encounter examples where there areatypical zoomed-in apical four-chamber (A4C) echo or echocar-diography (echo) videos in which the left ventricle (LV) is notentirely visible and is cropped, resulting in the model distribut-ing frame weights more evenly and not clearly indicating thepositions of ED and ES. . . . . . . . . . . . . . . . . . . . . 79Figure A.3 Success examples in the in-distribution (ID) setting where themodel is trained and tested on the private dataset. As shown,despite some cases having low quality or noisy samples (B,C), the model successfully predicts the left ventricle (LV) mea-surements. Additionally, in cases where the papillary muscles(which could be mistaken for LV walls) are visible in the im-age (A, D, E), the model is not confused and finds the properLV landmarks. . . . . . . . . . . . . . . . . . . . . . . . . . 85xxiiFigure A.4 Failure examples in the in-distribution (ID) setting where themodel is trained and tested on the private dataset. In cases A,D and E, we can see that for inter-ventricular septal (IVS), theground truth is placed at a bulge appearing in the wall (whichcould be an indicator of left ventricular hypertrophy (LVH)),which is an under-represented case in our dataset. In case B,we see that the ground truth for the IVS is placed somewhere inthe middle of the wall, while the model is including the wholewall structure, which could be a mistake in ground truth label-ing. Lastly, in case C, we see an out of distribution samplewhere the echo seems to be flipped such that the orientation ofthe landmarks does not match the common slope observed inthe dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 86Figure A.5 Success examples in the out-of-distribution (OOD) setting wherethe model is trained on the private dataset and tested on UnityImaging Collaborative (UIC) public dataset. For case A, wesee that despite the upper wall being barely visible, the modelis accurately capturing the inter-ventricular septal (IVS) mea-surement. In case B, despite the presence of the Doppler win-dow, the model is performing within acceptable margins. Incase C, the model is not confused by the papillary muscle, andin case D, despite the low quality of the image, the model issuccessful in finding the landmarks. Lastly, in E, despite theleft ventricle (LV) appearing in an unusual location, the modelis performing well. . . . . . . . . . . . . . . . . . . . . . . . 87Figure A.6 Failure examples in the out-of-distribution (OOD) setting wherethe model is trained on the private dataset and tested on UnityImaging Collaborative (UIC) public dataset. In most failurecases, the image is of poor quality (A, B, D, E). In case C,we see an unusually small left ventricular internal diameter(LVID) measurement which is not well-represented in the pri-vate training dataset. . . . . . . . . . . . . . . . . . . . . . . 88xxiiiFigure A.7 Prototypical network structure: For spatial (patch-level) pro-totypes, the learned local tokens zk,t \u2208 RHW\/p2\u00d7d of SpatialTransformer Encoder (STE) are used. M patches with thehighest attention are included and the rest are eliminated. Re-maining patches are compared with B learnable prototypes foreach class Pl = {plb,c}Bl ,Cb=1,c=1 \u2208Rd producing a similarity vec-tor s \u2208 RB\u00d7C where C is the number of classes. Fully con-nected layers map these similarity scores to the output. Fortemporal prototypes, the frame-level tokens z\u2032k,t of TemporalTransformer Encoder (TTE) are given as input. M\u2019 frameswith high temporal attention are kept and compared with Hlearnable prototypes and the similarity scores produce the out-put. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89Figure A.8 Patch-level prototypes for ejection fraction (EF): This figurevisualizes the patch-level prototypes that represent the left ven-tricle (LV) in end-systolic (ES) and end-diastolic (ED) frames.This suggests that these frames are the most significant in con-tributing to the final estimation of EF, which is clinically cor-rect since the ratio of the volume of blood in ED and ES areused to find EF. . . . . . . . . . . . . . . . . . . . . . . . . . 90Figure A.9 Additional patch-Level prototypes for aortic stenosis (AS): Leftfigures demonstrate discarded patches based on the acquiredattention of Spatial Transformer Encoder (STE). Patches withlow attention are eliminated. The right figures display the areasthat correspond to the learned prototypes. In both the healthyand severe cases, there is a notable emphasis on the aortic valve. 90Figure A.10 Frame-level prototypes for aortic stenosis (AS): Two instancesof frame-level prototypes are visualized for healthy and severeAS. The majority of frame-level prototypes are indicative ofend-systole and mid-systole stage of the heart cycle in whichthe restriction of valve\u2019s motion and detection of the aorticvalve\u2019s calcification is easier. . . . . . . . . . . . . . . . . . . 91xxivGlossaryA2C apical two-chamberA4C apical four-chamberAFD average frame distanceAS aortic stenosisAV aortic valveBCE binary cross entropyBNN Bayesian neural networkCNN convolutional neural networkDL deep learningecho echocardiographyxxvED end-diastolicEDV end-diastolic volumeEF ejection fractionES end-systolicESV end-systolic volumeFCN fully convolutional networkGCN graph convolutional networkGNN graph neural networkGPU graphics processing unitID in-distributionIVS inter-ventricular septalLA left atriumLN LayerNormxxviLSTM long short-term memoryLV left ventricleLVEF left ventricular ejection fractionLVH left ventricular hypertrophyLVID left ventricular internal diameterLVM left ventricular massLVOT left ventricular outflow tractLVPW left ventricular posterior wallMAE mean-absolute-errorMHA multi-head attentionML machine learningMLP multi-layer perceptronmm millimetersxxviiMPE mean percentage errorMV mitral valveOOD out-of-distributionPLAX parasternal long-axisPSAX parasternal short-axisRAM random access memorySDR success detection rateSOTA state-of-the-artSSL self-supervised learningST Spatial TokenizerSTE Spatial Transformer EncoderSV stroke volumeTTE transthoracic echocardiographyxxviiiTTE Temporal Transformer EncoderTVT training\/validation\/testUIC Unity Imaging CollaborativeUS UltrasoundViT Vision TransformersVTE Video Transformer EncoderxxixAcknowledgmentsFirst and foremost, I would like to thank my parents for providing me with anenvironment where I could focus on my studies and for facing the challenges ofimmigrating to a new country so that their children could have a better life.A special thanks is due to Dr. Purang Abolmaesumi for his unwavering supportover the past two years. I felt valued and accommodated. I am also extremelygrateful to Dr. Liao for providing me with incredibly helpful technical guidancethroughout this journey. I could not have asked for a better academic experienceas a graduate student. I would also like to thank Dr. Tim Salcudean for graciouslyaccepting to be on my thesis committee and chairing the defense.xxxChapter 1IntroductionDeploying machine learning (ML) models in the medical domain presents a uniqueset of challenges. These models must be explainable, able to generalize despitethe scarcity of training data, and flexible enough to accommodate various taskswhile maintaining explainability. In this thesis, we address these challenges byproposing three novel frameworks that leverage the representation power of graphneural networks (GNNs) and transformers, advancing the state-of-the-art (SOTA)and enhancing the practical utility of ML in medical applications.Explainability is a critical aspect of ML models for medical applications as itallows users to understand and trust the model\u2019s predictions and recognize whenhuman supervision is required. To this end, our first framework aims to provideexplainability in the prediction pipeline. We demonstrate its effectiveness using thetask of left ventricular ejection fraction (LVEF) estimation from echocardiography(echo) videos. The framework employs GNNs to learn a graph structure betweenthe frames of an input echo before producing an ejection fraction (EF) estimate.Our results show that the learned latent structure aligns with clinical guidelines,effectively serving as a surrogate for the model\u2019s confidence in its predictions.Expanding upon the importance of explainability, we next tackle the challengeof generalizability. A primary obstacle in developing ML models for medical ap-plications is the limited availability of labeled data. Consequently, establishingstrong inductive bias from such sparse datasets is crucial. In response, our secondframework aims to enhance model generalizability for data with minimal label-1ing by utilizing GNNs. We apply this framework to the task of left ventricle (LV)landmark detection, where only a few frames in a video are labeled. Leveraging amulti-scale objective function and a hierarchical graph structure, we optimize theuse of supervisory signals. Our results demonstrate that this approach effectivelybuilds a more robust inductive bias and surpasses prior work.Another important consideration is that various metrics are measured daily ina clinical setting, and maintaining separate models for each metric becomes logis-tically challenging. Hence, there is a need for adaptable models that can efficientlyhandle diverse tasks with minimal modifications while preserving a high degree ofexplainability. Consequently, our third framework offers attention-based explain-ability on multiple levels, enabling it to adapt effortlessly to various clinical tasks.This framework utilizes transformers, a sub-family of GNNs, to capture patch-wise, frame-wise, and video-wise interactions in echo data for each patient. Thisapproach aids in identifying pertinent information for a specific clinical metric. Toshowcase the flexibility of this framework, we consider two critical cardiac tasks:aortic stenosis (AS) detection and EF estimation.In conclusion, this thesis presents three innovative frameworks that address thekey challenges of explainability, generalizability, and flexibility in ML models formedical applications. By leveraging the powerful representation capabilities ofGNNs and transformers, our proposed frameworks hold the potential to improvethe practical utility of ML in the medical domain. In Section 1.1, we providean introduction to GNNs and transformers, while presenting a background on theclinical problems that we consider for evaluating our proposed frameworks in Sec-tion 1.2.1.1 Graph Neural NetworksGraph data, characterized by nodes and connecting edges, is prevalent across var-ious fields due to its ability to represent complex relationships beyond Euclideanspace. Examples of graph structures include social networks, molecular structures,and protein-to-protein interactions [29, 45, 83]. However, while graph data arepowerful, it is challenging to efficiently extract information from it. Cui et al. [83]describe the following challenges in processing and utilizing graph data:2\u2022 Computational Cost: Edge-level relationships between nodes often necessi-tate computationally intensive combinatorial or iterative steps. For example,finding the shortest path between two nodes in a graph requires enumeratingnumerous potential paths. Traditional graph processing approaches struggleto handle large graphs because of these high computational costs.\u2022 Parallelizability: The success of many ML models is attributed to their abil-ity to be parallelized using accelerated hardware, such as graphics process-ing units (GPUs). However, distributing the computational cost for groupsof nodes in a graph across different servers is challenging, as the dependencyof nodes, captured by edges, leads to high communication costs.\u2022 Compatibility with traditional ML models: Conventional ML models ex-cel at processing independent data. For instance, convolutional neural net-works (CNNs) learn representations from each image independently. Thisindependence assumption does not apply to graph data, as nodes are interde-pendent and explicitly connected via edges.Addressing these challenges is vital for harnessing the full potential of graphdata and expanding the application of ML models across various domains, leadingto the emergence of numerous graph representation learning methods.Graph representation learning is essential for effectively utilizing graph data asit learns dense, continuous, and low-dimensional vector representations for nodesin a graph. The primary goal of these representations is to eliminate redundantinformation while encoding the structural information of the original graph. To ac-complish this, the distance between vectors in the learned embedding space shouldreflect the relationships initially encoded by the edges in the graph.In summary, two main objectives guide graph representation learning. The firstobjective is to ensure that the original graph can be reconstructed using the learnedembeddings, which extends the concept of vector distance to edge relationships.The second objective is to support various inference tasks with the learned embed-dings, such as predicting missing links or identifying important nodes.As shown in Figure 1.1, GNNs are a family of deep learning (DL) modelsthat produce node and or edge embeddings with the characteristics mentioned3Figure 1.1: A graph, denoted by G(V,E), consists of a set of nodes V andthe set of edges E between them. Graph data can be represented by anadjacency matrix A, which indicates connections between nodes, anda feature matrix X containing node-specific information. graph neuralnetwork (GNN) layers learn node embeddings that transform the inputnode features (X) while incorporating the structural information of thegraph (A). These learned embeddings H can be utilized for variousnode-level, edge-level, and graph-level tasks.above, utilizing the power of neural networks to enable various downstream taskson graphs, including graph classification, node classification, and link prediction.GNNs have become a popular research area, featuring various architectures andtraining strategies such as graph recurrent neural networks [70], graph autoen-coders [41], graph reinforcement learning [56], and graph adversarial methods [9].In Section 1.1.1, we describe graph convolutional networks (GCNs) [42], apopular GNN variant. Meanwhile, in Section 1.1.2, we present a formulation thatmost GNN models follow. Lastly, in Section 1.1.3, we introduce transformers,which can be regarded as a sub-family of GNNs.1.1.1 Graph Convolutional NetworksIn this section, we introduce GCNs, a popular GNN variant that was among the firstmodels to demonstrate the representation power of GNNs. To understand GCNs, asoutlined in the course taught by Dr. Renjie Liao [47] and his PhD Thesis [46], it isessential to review some signal processing fundamentals, such as one-dimensionalconvolutions and their connection to the Fourier transform.The Fourier transform is defined based on the eigenfunction of the Laplacian4operator \u22072, which as shown in Equation 1.1, is e\u22122\u03c0i\u03be t . Using this eigenfunction,the Fourier transform converts a signal f (t) into the Fourier domain, as shown inEquation 1.2. This transformation is widely employed in the signal processingdomain, as it simplifies computationally expensive operations.In fact, the Convolution theorem asserts that a costly operation like convolu-tion for a filter h(t) can be efficiently replaced by multiplication using the Fouriertransform, as highlighted in Equation 1.3.\u22072(e\u22122\u03c0i\u03be t) =\u2212(2\u03c0\u03be )2e\u22122\u03c0i\u03be t (1.1)f\u02c6 (\u03be ) =\u222bRf (t)e\u22122\u03c0i\u03be tdt (1.2)( f \u2217h)(t) =\u222bRf (\u03c4)h(t\u2212 \u03c4)d\u03c4 =\u222bRf\u02c6 (\u03be )h\u02c6(\u03be )e2\u03c0i\u03be td\u03be (1.3)Drawing inspiration from the concepts described above, defining convolutionsfor graphs requires identifying a Laplacian operator on graphs and graph convolu-tions in the Fourier domain. Let us denote a graph with G(V,E), where V is theset of nodes and E is the set of edges between the nodes. We can define the graphLaplacian as L=D\u2212A, where A is the adjacency matrix indicating the connectionsin E, and D is the diagonal degree matrix defined as Dii = \u03a3Nj=1Ai j. As shown inEquation 1.4, we can normalize this graph Laplacian so that its eigenvalues lie in[0,2], and since L is symmetric, we can use spectral decomposition to decomposeit into the matrix of eigenvectors U and the diagonal matrix of eigenvalues (\u03bb ) \u039b,as shown in Equation 1.5.L = D\u221212 (D\u2212A)D\u2212 12 = I\u2212D\u2212 12 AD\u2212 12 (1.4)L =U\u039bUT = \u03a3Ni=1\u03bbiuiuTi (1.5)Now, we can define the graph Fourier transform as X\u02c6 =UT X and graph con-volution for a graph signal X \u2208 RN\u00d71 and a filter h\u03b8 as shown in Equation 1.6.h\u03b8 \u2217X =Uh\u03b8 (\u039b)UT X (1.6)One issue with the above formulation is that the construction of h requiresspectral decomposition, which is computationally expensive. To address this, ap-5proximations are made to the formulation using Chebyshev polynomials, detailsof which are skipped as they are outside the scope of this thesis. The final re-sult is shown in Equation 1.7, where approximations are made and the formulationis expanded to the multi-input\/multi-output setting. A nonlinearity \u03c3 is added tocomplete the definition of GCN as originally presented by Kipf et al. [42].\u03c3(hW \u2217X)\u2248 \u03c3(D\u02dc\u2212 12 (A+ I)D\u02dc\u2212 12 XW )= \u03c3(L\u02dcXW ), where D\u02dcii = \u03a3 j(A+ I)i j (1.7)The above formulation can be characterized using neural networks, where Wis the learnable weights of the network. By stacking multiple such layers, eachnode gathers information from further nodes. Such information passing operationis known as message passing, which is summarized in Section 1.1.2, and can beregarded as a formulation that almost all GNN variants fit into.1.1.2 Message Passing ParadigmAccording to Liao et al. [46, 47], two significant challenges in processing graphdata are the unordered nature of nodes and varying sizes of node neighborhoodsin the graph. The first challenge is exacerbated by the fact that different permu-tations of an adjacency matrix (an example of which is shown in Figure 1.1) cancorrespond to the same graph, since only the order of nodes is changed while theconnections remain the same. This is known as graph isomorphism.In other words, when dealing with graph data, our inputs consist of unorderedsets rather than individual (and independent) samples. This highlights the needfor a family of permutation-invariant and equivariant DL models. A function f ispermutation-invariant if, for any permutation matrix P, Y = f (PX), meaning thesame output Y is produced for permuted input data X . A function f is equivariantif the condition shown in Equation 1.8 holds for node representations H.H = f (X)\u2212\u2192 PH = P f (X) = f (PX) (1.8)Zaheer et al. [86] prove that for suitable transformations \u03c1 and \u03c6 , functions ofthe form \u03c1(\u03a3x\u2208X\u03c6(x)) are permutation-invariant, while functions f\u0398 :RM 7\u2192RM of6the form \u03c3(\u0398x) are equivariant if\u0398= \u03bb I+\u03b3(11T ) for \u03bb ,\u03b3 \u2208R and 1= [1, ...,1]T \u2208R. Using these findings, Zaheer et al. [86] introduce Deep Sets, which are a familyof permutation and equivariant DL models suitable for acting on sets.Closely related to these findings, Gilmer et al. [23] introduce a message passingparadigm that most GNN models fit into. In Equation 1.9, we show the formulationpresented in the Graph Representation Learning book [29].h(k+1)u = UPDATE(k)(h(k)u ,AGGREGATE(k)({h(k)v ,\u2200v \u2208 N(u)}))= UPDATE(k)(h(k)u ,m(k)N(u)), (1.9)where h(k)u is a hidden embedding for each node u in the graph, while N(u) isthe set of neighbouring nodes of u, and k indicates the iteration number. Here,the UPDATE and AGGREGATE functions are arbitrary differentiable functionsrealized via neural networks. It must be noted that the AGGREGATE function getsa set as input, and must therefore be permutation invariant, which can be achievedby operations such as mean, max or sum. As an example, Hamilton et al. [29]present a simple GNN framework in Equation 1.10.h(k)u = \u03c3(W(k)selfh(k\u22121)u +W(k)neigh\u03a3v\u2208N(u)h(k\u22121)v +b(k)), (1.10)In this context, Wself and Wneigh can be represented by two neural networksthat update a node\u2019s representation by utilizing past information of the node andits neighbors, respectively. Similarly, various GNN variants can be defined byspecifying the AGGREGATE and UPDATE functions.1.1.3 TransformersA typical input to the GNN models introduced so far can be a graph with pre-determined edges. For example, this could be a graph where nodes represent dif-ferent pieces of scholarly work, and the edges indicate citations between them.However, there are cases where such prior information on the graph structure isnot readily available or easily obtained. In such situations, one plausible approachis to assume a fully connected graph (a graph with edges between all nodes) andallow the model to reason about the importance of connections between the nodes.7This leads us to transformers [82], a family of DL models that can be consid-ered a sub-family of GNNs. In this context, a fully connected graph between inputentities is assumed, and an attention mechanism is employed so that each nodeonly attends to the most relevant nodes in the graph. Transformers are typicallyused for natural language processing applications, where words in a sentence arethe nodes in the graph, and attention mechanisms capture inter-word relationshipsfor various downstream tasks.Here, we present a slightly modified version of the formulation proposed byVaswani et al. [82], which incorporates a multi-head attention mechanism. First,as shown in Equation 1.11, n, di-dimensional input tokens X \u2208 Rn\u00d7di are linearlyprojected into queries Q \u2208 Rn\u00d7dk , keys K \u2208 Rn\u00d7dk , and values V \u2208 Rn\u00d7dv . Then, asshown in Equation 1.12, each attention head produces token-level attention usingQ and K, followed by multiplication with V to generate the output. Lastly, as illus-trated in Equation 1.13, the attention heads are concatenated before being linearlyprojected to the desired output.Qi = XWQi Ki = XWKi Vi = XWVi (1.11)headi = Attention(Qi,Ki,Vi) = softmax(QiKTi\u221adk)Vi (1.12)MultiHead(X) = [head1, ...,heads]W O (1.13)Here, W Qi \u2208 Rdi\u00d7dk , W Ki \u2208 Rdi\u00d7dk , WVi \u2208 Rdi\u00d7dv for attention head i, [...] is theconcatenation operator, W O \u2208 Rs\u00d7do where s is the total number of heads, and dois the desired output dimension.While transformers are highly capable models for processing text corpora, theoriginal formulation is not suitable for spatially processing images. To address this,Dosovitskiy et al. [16] propose the Vision Transformers (ViT), which extend thecapabilities of transformers to images, using patches in the image as input tokens.As our proposed framework in Chapter 4 employs ViTs, we provide a more detaileddescription in that chapter.81.2 Clinical DetailsIn this thesis, we propose three novel vision-based ML frameworks for medicalapplications. To demonstrate the effectiveness of these frameworks, we focus onecho as our imaging modality and select the tasks of LVEF, LV landmark detection,and AS detection. In Section 1.2.1, we provide a brief background on echo, whilein Section 1.2.2, Section 1.2.2, and Section 1.2.2, we discuss the various medicaltasks under consideration.1.2.1 EchocardiographyUltrasound (US) imaging has been widely used in clinical practice due to its non-invasive, cost-effective, portable nature, and excellent temporal resolution [79].In US imaging, ultrasonic waves are transmitted into the tissue. The propagatingwaves get partially reflected at the boundaries of tissues, allowing their position tobe measured over time. Various wave phenomena such as diffraction, refraction,attenuation, and scattering must be considered during both data acquisition andimage reconstruction steps.US imaging has various modes, including A-mode, B-Mode, and M-Mode. InA-mode imaging, the simplest case, the pulse-echo principle is used to find the po-sition of various tissue boundaries in a one-dimensional manner. In B-mode, whichis the mode used in the experiments presented in this thesis, a two-dimensionalimage is acquired by translating the US transducer between multiple A-mode ac-quisitions. In the case of cardiac imaging, since the rib bones block the view, thetransducer is tilted to obtain an image rather than translating it, resulting in thecone-shaped images illustrated in Figure 1.2. Lastly, M-mode is similar to A-modebut adds a temporal dimension to show movement over time.One important application of US imaging is echo, specifically Temporal Trans-former Encoder (TTE), which is the primary modality that depicts the anatomicstructure and function of the heart over time. TTE imaging involves externallyplacing a hand-held transducer at various angles to visualize different regions ofthe heart. This technique is effective in providing guidance for various clinical sce-narios, including the detection of valvular heart disease, coronary artery disease,cardiomyopathy, stroke, and heart failure. Despite these advantages, some limita-9A BC DFigure 1.2: Echocardiography (echo), an Ultrasound (US) imaging modality,is widely used due to its non-invasive, cost-effective, and portable na-ture. There are around 14 echo views. (A) We display an example of anapical four-chamber (A4C) view of the heart. The left ventricular ejec-tion fraction (LVEF) is typically estimated using this view because ofthe visibility of the left ventricle (LV). (B) We present an example of anapical two-chamber (A2C) view of the heart and an echo frame depict-ing various chambers. (C) We show an example of a parasternal long-axis (PLAX) echo, which can be used to obtain important LV measure-ments or assess the function of the aortic valve (AV). (D) We providean example of a parasternal short-axis (PSAX) echo where the functionof the AV can be assessed due to its visibility in this view. (All figuresare sourced from Wikimedia Commons.)10A BFigure 1.3: 14 standard Echocardiography (echo) views are obtained basedon how the transducer is positioned as a combination of the chosenacoustic window and the imaging plane. (A) We demonstrate variousacoustic windows, representing the position of the transducer on thepatient. (B) We illustrate different imaging planes of the heart, whichrefer to the orientation of the transducer with respect to the axis of theleft ventricle (LV). (All figures are taken from [57] with permissionfrom Elsevier.)tions of this technique include operator dependence, a narrow field of view, andlimited tissue characterization [27].As previously mentioned, the heart can be depicted from various angles basedon the positioning of the transducer. Multiple standard views can be obtained fromthe combination of the acoustic window and the cardiac plane. As shown in Fig-ure 1.3 (A), the acoustic windows include the Parasternal Window, Apical Win-dow, Subcostal Window, and Suprasternal Window. As shown in Figure 1.3 (B),the planes include the Long axis, Short axis, and Apical plane. Around 14 standardviews can be obtained. In the following list, we describe the views most relevantto the clinical tasks considered in this thesis.\u2022 Parasternal Window\u2013 The parasternal long-axis (PLAX) view is typically the first view ob-tained during an examination. In this view, the left atrium (LA), LV,11left ventricular outflow tract (LVOT), aortic valve (AV), and mitralvalve (MV) are visualized. As discussed in Chapter 3, we use thisview to find the LV landmarks, and in Chapter 4, we use this view todetect AV anomalies. An example of this view is shown in Figure 1.2(C).\u2013 The parasternal short-axis (PSAX) view can be obtained by rotatingthe transducer 90 degrees from the PLAX view. In this view, the AV,MV, papillary muscles, and the LV apex are visible. As discussed inChapter 4, this is another view we use for AS detection. An instance isshown in Figure 1.2 (D).\u2022 Apical Window:\u2013 The apical four-chamber (A4C) view is the first view obtained in theapical window. This view is used for volumetric assessment of theLV. which is why we use this view for EF estimation in Chapter 2 andChapter 4. An example of this view is illustrated in Figure 1.3 (A).\u2013 The apical two-chamber (A2C) view is obtained by rotating the trans-ducer 90 degrees from the A4C view. This view allows visualization ofLA, MV, and LV. Due to the visibility of LV in this view, we use it forLVEF estimation in Chapter 4. An example of this view is visualizedin Figure 1.3 (B).1.2.2 Left Ventricle AssessmentThe heart\u2019s LV is composed of a complex network of muscles, including sub-endocardial and sub-epicardial fibers with a longitudinal disposition, along withmid-wall circumferential fibers. The contraction and relaxation of these fiberscause the LV to expand or contract, thereby pumping oxygenated blood through-out the body. The phase during which the LV is contracted is known as the end-systolic (ES) phase, and the phase when it is expanded is known as the end-diastolic (ED) phase [6]. Due to the critical role of the LV in circulating oxy-genated blood, non-invasive assessment of its function is of utmost importance anda central task in cardiology [65].12Ejection FractionOne such assessment, and one of the most commonly used metrics, is the left ven-tricular ejection fraction (LVEF). LVEF is defined as the ratio between the volumeof the LV during the ED and ES phases, as shown in Equation 1.14, where thedifference between end-diastolic volume (EDV) and end-systolic volume (ESV)is known as the stroke volume (SV) [6]. The EDV and ESV values are often es-timated from A2C and A4C echo by either delineating the LV region or throughdirect visual assessment.EF =EDV\u2212ESVEDV\u00d7100 = SVEDV\u00d7100. (1.14)LVEF has important implications in determining diagnosis and initiating car-dioprotective pharmacotherapies to prevent heart failure, with the most prognosticvalue present when the measured EF is below 40% [65]. However, while this ratiois one of the most used parameters in both clinical and research settings, it hassignificant limitations.These limitations arise from the fact that obtaining the A2C and A4C views ac-curately and with high quality is challenging. So much so that even an experiencedprofessional can make a standard error of 6.3% in their estimate of the EF. More-over, determining the true cause of a specific LVEF ratio is difficult since the SVcan be affected by several factors, including myocardial contractility (the strengthof heart muscle contractions), heart rate (the number of heartbeats per minute),loading conditions (the volume and pressure conditions affecting the heart), anddyssynchrony of contraction (the lack of coordinated muscle contractions). Addi-tionally, since the ratio is normalized by the EDV, comparing the EF value of twopatients can be misleading, as two patients with the same EF can have drasticallydifferent EDV values [6].Such limitations and the high error rate in LVEF estimates underscore the needfor accurate and explainable automatic ML frameworks, as explored in Chapter 2.Landmark Detection for HypertrophyLeft ventricular hypertrophy (LVH), one of the leading predictors of adverse car-diovascular outcomes, is a condition in which the heart\u2019s mass abnormally in-13Figure 1.4: The left ventricle (LV) measurements, including left ventricularinternal diameter (LVID), inter-ventricular septal (IVS), and left ven-tricular posterior wall (LVPW) thickness, are usually characterized on aparasternal long-axis (PLAX) echocardiography (echo) frame by plac-ing either four or six landmarks on the frame. (The original figure issourced from Wikimedia Commons. We modified it to highlight themeasurements.)creases due to anatomical changes in the LV. Hypertension, AS, and intenseathletic training are some factors that cause overexertion of the heart and con-sequently affect its anatomy [5, 26]. These anatomical changes include an in-crease in septal and LV wall thickness, as well as enlargement of the LV chamber.More specifically, inter-ventricular septal (IVS) thickness, left ventricular posteriorwall (LVPW) thickness, and left ventricular internal diameter (LVID) are assessedto investigate LVH and the risk of heart failure [54].These measurements are performed on echo frames, specifically during eitherthe ED or ES phases of the cardiac cycle where the LV is fully expanded or con-tracted. Additionally, it is recommended that these linear measurements be per-formed on the PLAX view [43]. As shown in Figure 1.4, clinical professionalscharacterize these measurements by placing four or six landmarks (clicks) on aPLAX echo frame.Since LVH is closely related to left ventricular mass (LVM), the aforemen-tioned LV measurements must be used to approximate the LVM [44]. One pro-14posed formula to achieve this is the cube formula shown in Equation 1.15, pro-posed by Devereux et al. [14], where the measurements are made during the EDphase of the cardiac cycle.LVM = 0.8\u00d71.04\u00d7 ((IVS+LVID+LVPW)3\u2212LVID3)+0.6 (1.15)Automating this task with ML frameworks is challenging since only a fewframes in an echo video are labeled (usually ED or ES frames), and on those labeledframes, only four or six landmarks exist. This sparse labeling scheme makes itdifficult to build accurate models and highlights the need for generalizable models,as we discuss in Chapter 3.Aortic Stenosis DetectionAortic stenosis (AS) is one of the most common valvular heart diseases, oftenleading to death if interventions such as transcatheter aortic valve (AV) replacementare not carried out. This condition typically affects the older population, with aprevalence of up to 5% in individuals over 75 years of age [7, 67, 78]. AS occurswhen the AV becomes progressively calcified, immobile, and restricted, causingabnormal blood flow out of the LV.The AV typically consists of three cusps that collectively form a gate betweenthe aorta and the left ventricular outflow tract (LVOT). A healthy AV is not cal-cified and fully opens to prevent obstruction of blood flow out of the heart duringsystole. However, there are three main factors contributing to the development ofAS. First, a bicuspid AV, which affects about 0.5 to 1% of the population, occurswhen the AV has only two cusps instead of the usual three. Second, rheumaticheart disease, a condition leading to the inflammation and scarring of the AV, isanother contributing factor. Lastly, age-related calcific degeneration, characterizedby thickening and fibrosis of the AV cusps, is a common cause of AS in olderindividuals [67].The calcification of the AV can be observed via echo. As shown in Figure 1.2(C,D), in both the PLAX and PSAX views, the AV is visible. Therefore, in Chap-ter 4, we investigate the feasibility of a transformer-based network for detectingAS from PLAX and PSAX echo views.151.3 Thesis Objectives and ContributionsThe overall objective of this thesis is to address the challenges in deploying MLmodels for medical applications. We focus on three main challenges: explain-ability, generalizability, and flexibility. For each challenge, we propose a novelframework:\u2022 We propose an explainable framework using GNNs for LVEF estimationfrom echo videos. This framework learns a latent structure that aligns withclinical guidelines and provides a surrogate for model confidence.\u2022 We develop a generalizable framework for sparsely labeled data usingGNNs, focusing on medical landmark detection. By employing a multi-scaleobjective function and hierarchical graph structure, this framework maxi-mizes supervisory signals and outperforms prior work.\u2022 We introduce a highly flexible framework leveraging transformers for var-ious clinical tasks. This framework offers attention-based explainability atmultiple levels, capturing patch-wise, frame-wise, and video-wise interac-tions in echo data. The versatility of this framework is showcased throughits application in AS detection and EF estimation, all while maintaining ahigh level of explainability.1.3.1 Thesis OutlineThe next chapters in the thesis are organized as:Chapter 2 - Explainable Ejection Fraction Estimation with Graph Neural Net-works: EF, a key indicator of cardiac function, is estimated from echo with highinter-observer variability due to manual processes and variable video quality. Thisnecessitates reliable and explainable ML techniques for rapid assessment. We in-troduce EchoGNN, a GNN model for estimating EF from echo videos. The modelinfers a latent echo-graph from echo cine series frames, estimates weights overnodes and edges, and uses a GNN regressor to predict EF. The learned graphweights provide explainability, identifying critical frames for EF estimation and16indicating when human intervention is needed. EchoGNN achieves competitiveperformance, offering explainability essential for addressing the task\u2019s inherentinter-observer variability.Chapter 3 - Hierarchical Graph Neural Networks for Left Ventricle LandmarkDetection: Automating LV chamber assessment with ML faces challenges dueto sparse clinical labels, leading to reliance on isotropic label smoothing, whichignores anatomical information and induces bias. We introduce EchoGLAD, ahierarchical GNN for LV landmark detection in echo. Our contributions includea multi-resolution hierarchical graph representation learning framework and hi-erarchical supervision with a multi-level loss. We achieve state-of-the-art meanabsolute errors on public and private datasets under in-distribution (ID) and out-of-distribution (OOD) settings. Furthermore, our model demonstrates better OODgeneralization than previous works.Chapter 4 - General, Echocardiogram-Based, Multi-Level Transformer Frame-work for Cardiovascular Diagnoses: Vision-based, medical ML methods havebecome popular as secondary verification layers. For these safety-critical applica-tions, explainability and accuracy are essential. Additionally, methods must pro-cess multiple echo videos from various heart views for different cardiovasculartasks. Prior work lacks explainability or focuses on single tasks. We propose aGeneral, Echo-based, Multi-Level Transformer (GEMTrans) framework that of-fers explainability and multi-video training, capturing intra-frame, intra-video, andinter-video relationships. We demonstrate the framework\u2019s flexibility with EF andAS severity detection tasks and show that it outperforms prior works on these tasks.Chapter 5 - Conclusions and Future Work: We summarize the contributions ofthis thesis followed by suggestions for future work including the dissemination ofa multi-tasking, transformer-based model to the ML community.17Chapter 2Explainable Ejection FractionEstimation with Graph NeuralNetworks2.1 Introduction and Related WorksEjection fraction (EF) is a ratio indicating the volume of blood pumped by theheart, crucial in monitoring cardiovascular health and potentially indicating heartfailure [32, 51]. EF is computed using the stroke volume (SV), the blood vol-ume difference in the left ventricle (LV) during the end-systolic (ES) and end-diastolic (ED) phases of the cardiac cycle, denoted by end-systolic volume (ESV)and end-diastolic volume (EDV), respectively [2]. These volumes are estimatedfrom Ultrasound (US) videos of the heart, or echocardiography (echo), which in-volves detecting the frames corresponding to ES and ED and tracing the LV re-gion. The manual process of detecting the correct frames and making proper tracesis prone to human error. Therefore, the American Society of Echocardiographyrecommends performing EF estimation for up to 5 cardiac cycles and averagingthe results [44]. However, this guideline is seldom followed in practice, and a sin-gle representative beat is selected for evaluation instead, resulting in inter-observervariations from 7.6 % to 13.9 % in the EF ratio [62].18Automatic EF estimation techniques provide professionals with an additionallayer of verification. Additionally, with the rise of point-of-care US devices, whichare used by less experienced echo users, automating clinical measurements likeEF is becoming increasingly necessary [1]. However, for broad adoption, suchautomation techniques must be explainable to determine when human interventionis required.Different ML architectures have been proposed to perform automatic EF es-timation [35, 38, 62, 66, 74], most of which lack reliable explainability mecha-nisms. Some of these models fail to provide the model\u2019s confidence in their pre-dictions [62, 66, 74] or have low accuracy due to unrealistic data augmentationduring training and over-reliance on ground truth labels [66].More specifically, prior works use convolutional neural networks (CNNs) intheir EF estimation pipeline [35, 38, 62, 66]. Ouyang et al. [62] use ResNet-based(2+1)D convolutions [81] to estimate and average EF for all possible 32-frameclips in an echo, while Kazemi Esfeh et al. [38] use a similar approach underthe Bayesian neural networks (BNNs) setting. Recent work uses the encoder ofResNetAE [30] to reduce data dimensionality before using transformers [82] tojointly perform ES\/ED frame detection and EF estimation [66]. While these meth-ods show different levels of accuracy and success in predicting EF, they eitherlack explainability or significantly rely on accurate clinical labels, which are in-herently noisy and subject to significant inter-observer variability. For example,the transformer-based approach requires ES\/ED frame index labels in addition toEF labels in its training pipeline [66]. Lastly, while Kazemi Esfeh et al.and Ja-fari et al. [35, 38] report uncertainty in their predictions, they still lack explainableindicators as to why models fail or succeed for different cases.2.1.1 ContributionsIn this chapter, in response to the need for an explainable and accurate ML EFestimation model, we introduce EchoGNN, a novel deep learning (DL) model forexplainable EF estimation. Our approach first infers a latent graph between framesof one or multiple echo cine series. It then estimates EF based on this latent graphvia GNNs [72], which are a class of DL models that efficiently capture graph data.19To the best of our knowledge, our work is the first to investigate GNNs in thecontext of US videos and EF estimation. Moreover, our work brings explainabilitythrough latent graph learning, inspiring further work in this domain. Lastly, as anadded advantage, the number of parameters for our model is significantly less thanprior work, which is highly desirable for deploying such models on mobile clinicaldevices.In summary, our contributions are threefold:\u2022 We introduce EchoGNN, a novel DL model for explainable EF estimationthrough GNN-based latent graph learning.\u2022 We present a weakly-supervised training pipeline for EF estimation withoutdirect reliance on ground truth ES\/ED frame labels.\u2022 Our model has a much lower number of parameters compared to prior work,significantly reducing computational and memory requirements.2.2 Method2.2.1 Problem SetupWe consider the following supervised problem for EF estimation: assume that foreach patient i \u2208 [N] in dataset D, there is a ground truth EF ratio yi \u2208 R, and thereare K echo videos xik \u2208RT\u00d7H\u00d7W , where k \u2208 [K], T is the number of frames, and Hand W are the height and width of each frame, respectively. The goal of our modelis to learn a function f : RK\u00d7T\u00d7H\u00d7W \u2192 R to estimate EF from echo videos. Fornotational simplicity, and since our evaluation dataset only contains one video perpatient, we assume that K = 1. However, it is important to note that our model isflexible in this regard and can handle multiple videos per patient.2.2.2 Proposed FrameworkAs shown in Figure 2.1, EchoGNN is composed of three main components: VideoEncoder, Attention Encoder, and Graph Regressor. In the following subsections,we discuss the details pertaining to each component.20Figure 2.1: EchoGNN has three main components. (1) Video Encoder: en-codes video frames into vector embeddings while preserving the tem-poral dimension; (2) Attention Encoder: infers weights over the nodes(video frames) and edges (relationships among frames) of the echo-graph; (3) Graph Regressor: estimates ejection fraction (EF) usingthe inferred weighted graph; this figure shows an example where eachpatient has an apical apical two-chamber (A2C) and an apical four-chamber (A4C) echo video.Video EncoderThe original echo videos are high-dimensional and must be mapped into lower-dimensional embeddings to reduce memory footprint and remove redundant infor-mation.The Video Encoder is used to learn a mapping fve :RT\u00d7H\u00d7W \u2192RT\u00d7d from in-put echo videos xi \u2208RT\u00d7H\u00d7W to d-dimensional embeddings hij \u2208Rd , where j \u2208 [T ]is the frame number. The temporal dimension is preserved because the AttentionEncoder requires embeddings for all frames to produce interpretable weights overthem. We use a custom network consisting of 3D convolutions and residual con-nections to utilize both the spatial and temporal information in the video whengenerating the embeddings. This network\u2019s architecture is provided in Figure 2.2.Lastly, following [82], periodic positional encodings are added to the generated21Figure 2.2: Video Encoder network architecture: We use modular blockscontaining 3D convolutions with residual connections to generate low-dimensional frame embeddings.frame embeddings to encode the sequential nature of video data.Attention EncoderFor each patient, we construct an echo-graph, which is a complete graph whereeach node corresponds to a frame in the echo video, and the edges show the non-Euclidean relationship between these frames. Formally, we denote the echo-graphwith Gecho(V,E) where V is the set of nodes corresponding to echo frames suchthat |V | = T , and E is the set of edges between the nodes to show the relationshipbetween video frames such that if v1,v2 \u2208 V are connected, then ev1,v2 \u2208 E. Weuse the frame embeddings from our Video Encoder as node features of Gecho. Thatis, hi1,hi2, ...,hiT are the set of features for v1,v2, ...,vT . These embeddings can berepresented as a matrix H i \u2208RT\u00d7d such that each row is the embedding for a framein the echo video for patient i.Inspired by [40], we propose using GNNs to learn and assign weights to bothedges and nodes of the echo-graph. The edge and node weights are learned toencode the importance of each frame (node weights) and the relationships amongframes (edge weights) for the final EF estimation.The Attention Encoder infers weights over edges and nodes of the echo-graphusing message passing-based GNNs [23]. A single message passing step is suffi-cient for each node to capture information from all other nodes due to the echo-graph being a complete graph. More specifically, the following operations are usedto obtain weights over each edge evk,vs :22uk,s = MLP1([hik\u2225his]) (node\u2192 edge) (2.1)vs = MLP2(\u2211k \u0338=s uk,s) (edge\u2192 node) (2.2)zk,s = MLP3([vk\u2225vs]) (node\u2192 edge) (2.3)ak,s = \u03c3(zk,s), (2.4)where \u03c3 is the Sigmoid function, [.\u2225.] is the concatenation operator, and ak,s \u2208 [0,1]is the inferred weight for the directed edge from vk to vs. Similarly, weights foreach node ws \u2208 [0,1] are generated by inserting another edge \u2192 node operationafter Equation 2.3. All multi-layer perceptrons (MLPs) use two fully connectedlinear layers with ELU [13] activation and batch normalization.RegressorOur Regressor network uses GNN layers with the learned weighted echo-graph toperform EF estimation. Specifically, for each patient, the output of the AttentionEncoder can be represented as a weighted adjacency matrix A \u2208 [0,1]T\u00d7T and anode weight vector w \u2208 [0,1]T . The Regressor uses A to generate embeddings overframes of the echo video:H l = gl(A,H l\u22121), l = 1, ...,L (2.5)where H l \u2208 RT\u2217dg is the matrix of learned node embeddings at layer l, H0 is thematrix of frame embeddings from the Video Encoder, and gl is composed of agraph convolutional network (GCN) layer followed by batch normalization andELU activation [42]. To represent the whole graph with a single vector embedding,the node embeddings are averaged using the frame weights w generated by theAttention Encoder:higraph =\u2211Tj=1 w j \u2217H lj\u2211Tj=1 w j, (2.6)23where H lj \u2208 Rd is the jth row of H l , and w j is the jth scalar weight in the frameweight vector. higraph is mapped into an EF estimate using an MLP with two fullyconnected linear layers, ELU activation and batch normalization.2.2.3 Training and Objective FunctionThe model is differentiable in an end-to-end manner. Therefore, we use gradientdescent with mean-absolute-error (MAE) between predicted EF estimates y\u02dci andground truth EF values yi \u2208 Y as the optimization objective, which is computed asL = 1N \u2211Ni=1 |y\u02dci\u2212 yi|.2.3 Experiments2.3.1 DatasetWe use the EchoNet-Dynamic public EF dataset, consisting of 10,030 A4C echovideos obtained between 2016 and 2018 at Stanford University Hospital. Eachecho frame has a dimension of 112\u00d7 112, and the dataset provides ESV, EDV,contour tracings of LV, and EF ratios for each patient [62]. We use the providedsplits in the dataset from mutually exclusive patients, including 7,465 samples fortraining, 1,288 samples for validation, and 1,277 samples for testing. The datadistribution in the training set is unbalanced, with only 12.7 % of samples havingan EF ratio below 40 %. Clinically, however, such patients are most critical to bedetected for timely intervention [8, 37].Frame Sampling: To stay within reasonable memory requirements, we use afixed number of frames per echo denoted by Tfixed. During training, we uniformlysample an initial frame index j in [1,T itotal\u2212Tfixed], where T itotal is the total numberof frames in echo video i. We then use Tfixed samples starting from j. Follow-ing [62], we set Tfixed to 64 and use zero padding in the temporal dimension whenT itotal < Tfixed. During test time, we extract multiple back-to-back clips, with eachclip containing Tfixed frames and the first clip starting from index 0. We use zeropadding in the temporal dimension if T itotal < Tfixed and overlap the last clip with theprevious one if the last clip overshoots T itotal. We set Tfixed to 64 and independentlyestimate EF for each clip, reporting the average prediction.24Data Augmentation: Occasionally, A4C echo is zoomed in on the LV regionfor certain clinical studies [18, 64]. To allow learning of this under-representeddistribution, we augment our training set by using a fixed cropping window of90\u00d772 centered at the top of each frame and interpolating the result to achieve theoriginal 112\u00d7112 dimension, which creates the desired zoom-in effect.2.3.2 Implementation DetailsThe Video Encoder uses custom convolution blocks with 16, 32, 64, 128, and256 channels. The Attention Encoder employs a hidden dimension of 128 forMLP layers, and the Regressor utilizes a 3-layer GCN with 128, 64, and 32 hiddendimensions, followed by an MLP with a hidden dimension of 16. We use theAdam optimizer [39] with a learning rate of 1e-4, a batch size of 80, and 2500training epochs. Our framework is implemented using PyTorch [63] and PyG [19],and the training was performed on two Nvidia Titan V GPUs. Pretraining: Weuse ES\/ED index labels in a pretraining step to train the Video Encoder and theAttention Encoder, giving higher weights to ES and ED frames. ClassificationLoss: We bin the EF values into 4 ranges [0\u221230],(30,40], (40,55],(55,100] anduse a cross-entropy loss encouraging the model to learn EF\u2019s clinical categories [8].2.3.3 Results and DiscussionExplainabilityThe key advantage of EchoGNN over prior work is the explainability it providesthrough the learned weights on the echo-graph. As shown in Figure 2.3, the learnedweights can indicate when human intervention is required. We observe two differ-ent scenarios: (1) the model learns the periodic nature of echo videos and assignslarger weights to frames and edges that are in between ES and ED phases beforeperforming EF estimation. This means that the location of ES and ED can be ap-proximated using these weights. (2) The model cannot detect the location of ESand ED frames and distributes weights more evenly. We see that in these cases,we have either an atypical zoomed-in A4C echo or an echo where the LV is notentirely visible and is cropped. In such cases, an expert can evaluate the video and25Figure 2.3: Explainability through the learned weighted graph: (Top) An ex-ample where the model has learned the periodic nature of the data, andthe learned weights enable identification of end-systolic (ES) and end-diastolic (ED) locations. (Bottom) Another example where the left ven-tricle (LV) region is cropped (as indicated by the arrow), and the learnedweights are distributed more evenly, suggesting the need for expert in-tervention.determine if new videos must be obtained. Additional explainability examples areprovided in Section A.1.To quantitatively measure the explainability of EchoGNN, for cases where themodel learns the periodic nature of the data (1173 samples out of 1277), we usethe average frame distance (AFD) as in [66]. The AFD is computed as AFD =1N \u2211Ni=1 | ji \u2212 j\u02dci|, with ji and j\u02dci being the true and approximated indices, respec-tively, for sample i. As shown in Table Table 2.1, our model achieves better EDAFD and comparable ES AFD without using ground-truth ES\/ED locations fortraining, whereas Reynaud et al. [66] use such supervision. This demonstratesthe explainability power of EchoGNN. AFD computation details are provided inFigure 2.4.26Figure 2.4: end-systolic (ES) and end-diastolic (ED) frame approximationfrom learned echo-graph weights: We first use a threshold to convertthe sum of outgoing edge weights into a binary format (alternatively,frame weights can be used). Please note that this threshold is selectedbased on average frame distance (AFD) performance on the validationset. Consecutive 1-valued weights form a block together. The left-mostand right-most frames in each block are the approximated ED and ESlocations, respectively. We reject samples where the size of the blockis equal to 55, indicating that the model has not learned the periodicnature of the data. By rejecting these samples, we achieve an averageframe distance of 4.15 for ES and 3.68 for ED.EF EstimationTo evaluate the error in predicted EF values, we use MAE. Additionally, as ameasure of the amount of explained variance in the data, we report the model\u2019sR2 score. Moreover, we report the F1 score for the task of indicating whether EFvalues are lower than 40 %, which is a strong indicator of heart failure [37].As shown in Table 2.1, our model significantly outperforms Reynaud et al. [66]without direct supervision of ES and ED frame locations during training. Ourmodel has similar predictive performance as Esfeh et al. [38], but with a muchlower number of parameters and the added benefit of explainability through thelearned latent graph structures. Ouyang et al. (AF) [62] require large amounts ofrandom access memory (RAM) due to sampling all 32-frame clips in a video, mak-ing it impossible for us to train and evaluate the model. As a result, we only reportresults from the paper and cannot produce additional metrics, such as the F1 score,which is not originally reported. Hence, we show this with N\/A in Table 2.1. Thismodel\u2019s weak performance compared to our model demonstrates the sensitivity ofOuyang et al. (AF) to frame locations in a clip. Lastly, our model has a signifi-27Table 2.1: Summary of quantitative results: Lower values are better for allmetrics except R2 and F1. Ouyang et al. (AF) average predictions onall possible 32-frame clips in a sampled video. Reynaud et al. (R) and(M) are transformer-based models with different sampling techniques.Esfeh et al. use Bayesian neural networks (BNNs). We mark the modelsthat cannot predict end-systolic (ES) and end-diastolic (ED) locationsas \u201d-\u201d in the average frame distance (AFD) metric. EchoGNN is theonly model that provides explainability and ES\/ED location estimationswithout direct supervision.Model R2 MAE F1< 40%ESAFDEDAFD#params(\u00d7106)Ouyang et al. (AF) [62] 0.4 7.35 N\/A - - 31.5Reynaud et al. (R) [66] 0.48 6.76 0.70 2.86 7.88 346.8Reynaud et al. (M) [66] 0.52 5.95 0.55 3.35 7.17 346.8Esfeh et al. [38] 0.75 4.46 0.77 - - 31.5Ours 0.76 4.45 0.78 4.15 3.68 1.7cantly lower number of parameters, making it desirable for deployment on mobileclinical devices. Our model\u2019s EF scatter plot and confusion matrix are provided inSection A.1.2.3.4 Ablation StudyIn Table 2.2, we observe that the classification loss enhances the model\u2019s perfor-mance for under-represented samples, while pretraining and data augmentation re-duce EF error and increase the model\u2019s capability to represent the variance in thedata.2.4 ConclusionIn this work, we present a deep learning model that delivers the advantage of ex-plainability through GNN-based latent graph learning. While we have demon-strated the success of our framework for EF estimation, we argue that the samepipeline could be applied to other datasets and problems, introducing a newparadigm for video processing and prediction tasks in clinical data and beyond.While our model surpasses previous works in EF estimation and offers explain-28Table 2.2: Ablation study results: The Aug., Class., and Pretrain columns in-dicate whether the model employs data augmentation, classification loss,and pretraining, respectively. We observe that the classification loss en-hances performance for under-represented groups, while pretraining anddata augmentation reduce overall ejection fraction (EF) error.Aug. Class. Pretrain R2 MAE F1 < 40%\u2713 \u2717 \u2717 0.75 4.48 0.76\u2713 \u2713 \u2717 0.74 4.59 0.77\u2713 \u2717 \u2713 0.75 4.47 0.73\u2717 \u2713 \u2713 0.75 4.47 0.77\u2713 \u2713 \u2713 0.76 4.45 0.78ability, there are certain limitations that future work can address. First, althoughthe explainability provided over frames and edges of the echo-graph allows for theidentification of cases requiring closer inspection, it does not enable the determi-nation of regions within each frame that the model is uncertain about. We contendthat an attention map over the pixels in each frame could further enhance explain-ability. Second, creating a complete graph for long videos results in a significantmemory cost. While this is not an issue for echo, where videos are relatively short,alternative graph construction methods should be explored for longer videos.29Chapter 3Hierarchical Graph NeuralNetworks for Left VentricleLandmark Detection3.1 Introduction and Related WorksLeft ventricular hypertrophy (LVH), one of the leading predictors of adverse car-diovascular outcomes, is the condition where heart\u2019s mass abnormally increasessecondary to anatomical changes in the left ventricle (LV). Hypertension, aor-tic stenosis (AS), and\/or intense athletic training are some conditions that causeoverexertion of the heart and affect its anatomy as a result [5, 26]. These anatom-ical changes include an increase in the septal and LV wall thickness, and the en-largement of the LV chamber. More specifically, inter-ventricular septal (IVS), leftventricular posterior wall (LVPW) and left ventricular internal diameter (LVID) areassessed to investigate LVH and the risk of heart failure [54].Echocardiography (echo), which is a noninvasive Ultrasound (US) cardiacimaging technique, depicts different chambers of the heart as it pumps blood overtime, allowing its anatomy and function to be evaluated. The clinical procedurefor finding LV measurements involves selecting a representative frame in the echoand pinpointing four or six anatomical pixel locations (landmarks) on the selected30(a) LV Measurements (b) Label SmoothingFigure 3.1: (A) Inter-ventricular septal (IVS), left ventricular internal diam-eter (LVID) and left ventricular posterior wall (LVPW) measurementsvisualized on an parasternal long-axis (PLAX) echocardiography (echo)frame: We can see that four landmark coordinates are normally enoughto characterize these measurements. (B) Left ventricle (LV) landmarklabel smoothing example: If the wall landmark labels (e.g., , within thecircle) are smoothed by an isotropic Gaussian distribution, points alongthe visualized wall and ones perpendicular are penalized equally. Ide-ally, if the model learns the edge of the wall, it should be penalized less.frame based on clinical guidelines. As shown in Figure 3.1 (A), these landmarkscharacterize IVS, LVPW and LVID, and allow cardiac function assessment.The manual nature of the frame selection and landmark detection proceduresmentioned previously, along with varying levels of operator experience, leads tohigh inter-observer variability in the clinical evaluation for finding LV landmarks.This has prompted interest in automatic LV landmark detection methods, particu-larly as point-of-care US imaging devices become more popular. However, sinceLV landmark detection involves identifying a small number of pixel locations (usu-ally four or six) in a high-dimensional image, achieving high accuracy with ma-chine learning (ML) models is challenging due to the sparse positive training sig-nals associated with the correct pixel locations. This problem is exacerbated bythe fact that ground-truth LV labels are only provided for a few frames in an echovideo. This highlights the need for generalizable models that establish strong in-31ductive bias despite the sparsity of labeled data.Most prior works for this task have focused on LVID and use convolu-tional neural network (CNN) like fully convolutional networks (FCNs) [52] or U-Nets [68] to perform either segmentation or a regression of landmark locations.Sofka et al. [75] use FCNs to generate prediction heatmaps followed by a centerof mass layer to produce the coordinates of the landmark locations. In addition,they use long short-term memory (LSTM) [71] units to extract the temporal rela-tionship among echo frames. Another work [48] uses a modified U-Net model toproduce a segmentation map followed by a focal loss to penalize pixel predictionsin close proximity to the ground truth landmark locations modulated by a Gaus-sian distribution. A tracking head is also used to encourage consistency betweenpredicted landmarks of adjacent echo frames. Jafari et al. [36] employ a similarU-Net model with BNNs [25] to estimate the uncertainty in model predictions andreject samples that exhibit high uncertainties. Gilbert et al. [22] smooth groundtruth labels by placing two-dimensional Gaussian heatmaps around landmark lo-cations at angles that are statistically obtained from training data. Lastly, Duffyet al. [17] use atrous convolutions [10] to make predictions for LVID, IVS, andLVPW measurements.Other related works focus on detecting cephalometric landmarks from X-raysof the head. These works are highly transferable to the task of LV landmark detec-tion, as cephalometric landmark detection also involves detecting a sparse numberof landmarks. McCouat et al. [53] is one of these works that abstains from usingGaussian label smoothing but still relies on one-hot labels and treats landmark de-tection as a pixel-wise classification task. Chen et al. [11] create a feature pyramidfrom the intermediate layers of a ResNet [81] model followed by a self-attentionmodule in another cephalometric landmark detection work. Lastly, Yao et al. [85]focus on adversarial attacks in the context of detecting cephalometric landmarks.While showing different levels of performance, previous works to date are allCNN-based in a supervised learning setting for direct pixel-level predictions. Theyneed to predict only a few positive pixels in a high-dimensional image, thus beinglargely limited by the sparse annotation issue. Some prior works smooth the pixellabels by adding Gaussian distributions around the landmarks to alleviate the issue.However, as shown in Figure 3.1 (B), we argue that the distribution\u2019s shape and32the angle of its placement on the image introduce misleading bias in the trainingprocess and hurt prediction performance.3.1.1 ContributionsIn this work, we address the challenges posed by sparse annotation and avoid la-bel smoothing by proposing an Echo-based GNN, for LV LAndmark Detection(EchoGLAD). Our framework, illustrated in Figure 3.2 and described in Sec-tion 3.2, learns useful representations on a hierarchical grid graph constructed fromthe input echo image and performs multi-level prediction tasks.As shown in Figure 3.2, in the bottom (fine) level, a grid-graph is formed whereeach pixel serves as a node connected to its vertical and horizontal neighbors. Forthe upper (coarse) levels, a similar grid-graph is created, with nodes correspondingto image patches instead of individual pixels. Cross-level edges are added so thateach node in an upper-level graph connects to nodes in the same region in thelower-level graph. Node classification is performed at each level to predict if a node(pixel or patch) contains the landmark. Upper-level prediction tasks are simplerthan lower-level tasks and provide additional guidance for the model to improvelower-level predictions. Moreover, our framework captures both cross-level andwithin-level dependencies among nodes by leveraging message passing over thishierarchical graph.Our approach differs from prior works as it aims to avoid the issues demon-strated in Figure 3.1 (B) and the sparse annotations problem by introducing sim-pler auxiliary tasks to guide the main pixel-level task. Consequently, the MLmodel learns the landmark locations without relying on Gaussian label smooth-ing. Additionally, our framework enhances representation learning through effi-cient message-passing [23, 72] in GNNs among pixels and patches at differentlevels, without the high computational complexity of transformers [16, 50].In summary, our contributions are threefold:\u2022 We propose EchoGLAD, a novel GNN framework for LV landmark detec-tion, performing message passing over hierarchical graphs constructed froman input echo;\u2022 We introduce a hierarchical supervision that is automatically induced from33Figure 3.2: Overview of our proposed model architecture: Hierarchical Fea-ture Construction provides node features for the hierarchical graphrepresentation of each echo frame where the nodes in the main graphcorrespond to pixels in the image, and nodes in the auxiliary graphscorrespond to patches in the image. Graph Neural Networks (GNNs)are used to process the hierarchical graph representation and producenode embeddings for the auxiliary graphs and the main graph. Multi-Layer Perceptrons (MLPs) are followed by a Sigmoid output functionto map the node embeddings into landmark heatmaps of different gran-ularity over the input echo frame.sparse annotations to alleviate the issue of label smoothing;\u2022 We evaluate our model on two LV landmark datasets and show that it notonly achieves SOTA mean-absolute-error (MAE) (1.46 mm and 1.86 mmacross three LV measurements) but also outperforms other methods in OODtesting (achieving 4.3 mm).3.2 Method3.2.1 Problem SetupWe consider the following supervised setting for LV wall landmark detection. Wehave a dataset D = {X ,Y}, where |D| = n is the number of {xi,yi} pairs such thatxi \u2208 X , yi \u2208 Y , and i \u2208 [1,n]. Each xi \u2208 RH\u00d7W is an echo image of the heart,where H and W are height and width of the image, respectively, and each yi is34the set of four point coordinates [(hi1,wi1),(hi2,wi2),(hi3,wi3),(hi4,wi4)] indicating thelandmark locations in xi. Our goal is to learn a function f : RH\u00d7W 7\u2192 R4\u00d72 thatpredicts the four landmark location coordinates for each input image. Figure 3.3further clarifies how the model generates landmark location heatmaps on differentscales.3.2.2 Proposed FrameworkAs shown in Figure 3.2, each input echo frame is represented by a hierarchicalgrid graph where each sub-graph corresponds to the input echo frame at a differentresolution. The model produces heatmaps over both the main pixel-level task aswell as the coarse auxiliary tasks. It must be noted that due to lower resolutions,the auxiliary tasks have smaller solution spaces and are easier for the model topredict. While the pixel-level heatmap prediction is of the main interest, we use ahierarchical multi-level loss approach where the model\u2019s prediction over auxiliarytasks is used during training to optimize the model through comparisons to coarserversions of the ground truth. The intuition behind such an approach is that themodel learns nuances in the data by performing landmark detection on the easierauxiliary tasks and uses this established reasoning when performing the difficultpixel-level task. The hierarchical graph representation is described in Section 3.2.2,Section 3.2.2 and Section 3.2.2, while the objective functions for optimizing themodel are explained in Section 3.2.3.Hierarchical Graph ConstructionTo learn representations that better capture the dependencies among pixels andpatches, we introduce a hierarchical grid graph along with multi-level predictiontasks. These tasks increase in difficulty as a result of the increasing number oflocations that the model has to consider to predict the landmarks. As an example,the simplest task consists of a grid graph with only four nodes, where each nodecorresponds to four equally-sized patches in the original echo image. In the maintask (the one that is at the bottom in Figure 3.2 and is the most difficult), the numberof nodes is equal to the number of pixels in the original image.More formally, let us denote a graph as G= (V,E), where V is the set of nodes,35112\u00d7112 56\u00d75628\u00d728 1\u00d71Figure 3.3: An example of the model\u2019s prediction for a single input echo:Here, we show the model\u2019s prediction for the case where only threeauxiliary graphs are used. We see that the model is learning the leftventricle (LV) landmarks on different resolutions to achieve high accu-racy for the main pixel-level task. We show zoomed-in versions of thehigher resolution task to enable comparison. The patch size for eachimage is also shown in the figure.and E is the set of edges in the graph such that if vi,v j \u2208 V and there is an edgefrom vi to v j, then ei, j \u2208 E. As shown in Figure 3.2, to build hierarchical task rep-resentations, for each image x \u2208 X and the ground truth y\u2208Y , K different auxiliarygraphs Gk(Vk,Ek) are constructed using the following steps for each k \u2208 [1,K]:1. 2k\u00d72k = 4k nodes are added to Vk to represent each patch in the image. Notethat the larger values of k correspond to graphs of finer resolution, while thesmaller values of k correspond to coarser graphs.-2. Edges are added in a grid-like manner such that el\u22121,s,el+1,s,el,s\u22121,el,s+1 \u2208Ek for each l,s \u2208 [1 . . .2k] if these neighbouring nodes exist in the graph36Figure 3.4: A convolutional neural network (CNN) is initially used to ex-pand the number of the feature maps. The intermediate features of thedecoder part of a U-Net are then used as node features such that deeperrepresentations correspond to node features of finer graphs. This way,the node features correspond to proper patches in the original image,while also providing richer information due to the increasing size of theperceptive field.(border nodes will not have four neighbouring nodes).3. A patch feature embedding zkj, where j \u2208 [1 . . .4k] is generated and associatedwith that patch (node) v j \u2208 Vk. The patch feature construction technique isdescribed in Section 3.2.2.4. Binary node labels y\u02c6k \u2208 {0,1}4k\u00d74 are generated such that y\u02c6k j = 1 if at leastone of the ground truth landmarks in y is contained in the patch associatedwith node v j \u2208 Vk. Note that for each auxiliary graph, four different one-hot labels are predicted, which correspond to each of the four landmarksrequired to characterize LV measurements.The main graph, Gmain, has a grid structure and contains H \u2217W nodes regardless ofthe value of K, where each node corresponds to a pixel in the image.Additionally, to allow the model to propagate information across levels, weadd inter-graph edges such that each node in a graph is connected to four nodesin the corresponding region in the next finer graph, as shown in the middle part ofFigure 3.2. Through these edges, GNNs enable message passing between tasks of37different difficulty to establish better inductive bias and ease landmark detection inthe main pixel-level graph.Node Feature ConstructionThe graph representation described in the previous section is not complete withoutproper node features, denoted by z\u2208R|V |\u00d7d , characterizing patches or pixels of theimage. As shown in Figure 3.8, to achieve this, the grey-scale image is initiallyexpanded in the channel dimension using a CNN. The features are then fed intoa U-Net where the decoder part is used to obtain node features such that deeperlayer embeddings correspond to the node features for the finer graphs (The detailsfor the architecture of the CNN and the U-Net are provided in Section 3.3.2). Thismeans that the main pixel-level graph would have the features of the last layer ofthe network. This way, the finer-scale graphs, where the prediction task is moredifficult due to the large solution space, contain information from larger perceptivefields, which should further aid the prediction process.It is also possible to obtain node features using a vanilla CNN or average pool-ing with different window sizes. However, as shown in Section 2.3.4, the U-Netbackbone provides the best performance.Hierarchical Message PassingIn this section, we explain how we perform message passing on the hierarchicalgraph desribed in the previous sections using GNNs to learn node representationsfor predicting landmarks (node labels).The whole hierarchical graph created for each sample, i.e., , the main graph,auxiliary graphs, and cross-level edges, are collectively denoted as Gi, where i \u2208[1, . . . ,n]. Each Gi is fed into GNN layers followed by an MLP:hl+1nodes = ReLU(GNNl(Gi),hlnodes), l \u2208 [1, . . . ,L] (3.1)hout = \u03c3(MLP(hnodes)), (3.2)where \u03c3 is the Sigmoid function, hlnodes \u2208 R|VGi |\u00d7d is the set of d-dimensional em-beddings for all nodes in the graph at layer l, and hout \u2208 [0,1]|VGi |\u00d74 is the four-38channel prediction for each node with each channel corresponding to a heatmapfor each of the pixel landmarks (The details of the GNN layers used are providedin Section 3.3.2). The initial node features h1nodes are set to the features z describedin Section 3.2.2 and Section 3.2.2. The output coordinates (xpout,ypout) for each land-mark location p\u2208 [1,2,3,4] are obtained by taking the expected value of individualheatmaps hpout along the x and y directions such that:xpout =|VGi |\u2211s=1softmax(hpout)s \u2217 locx(s); (3.3)ypout =|VGi |\u2211s=1softmax(hpout)s \u2217 locy(s). (3.4)Here, we vectorize the two-dimensional heatmap into a single vector and then feedit to the softmax operator. locx and locy return the x and y positions of a node inthe image. It must be noted that unlike some prior works such as Duffy et al. [17]that use post-processing steps such as imposing thresholds on the heatmap values,our work directly uses the output heatmaps to find the final predictions.3.2.3 Training and Objective FunctionTo train the network, we leverage two types of objective functions. 1) Weightedbinary cross entropy (BCE): We use a BCE loss because each node should ei-ther be classified as a landmark or a non-landmark. However, since the number oflandmark locations is much smaller than non-landmark locations, we use a higherweight for landmark nodes as disclosed in Section Section 3.3.2; 2) L2 regressionof landmark coordinates: To give the model better training signals, we add a re-gression loss term in addition to the node classification objective. The regressionobjective is the L2 loss between the predicted coordinates that are found usingEquation 3.3 and Equation 3.4 versus the ground truth labels.393.3 Experiments3.3.1 DatasetTo train and evaluate the performance of our framework, we use an internal privateand an external public dataset. Internal Dataset: Our private dataset contains29,867 PLAX echo frames labelled with LV landmarks characterizing LVID, IVSand LVPW. We split the dataset in a patient-exclusive manner with 23824, 3004,and 3039 frames for training, validation, and testing, respectively. These frames aredown-sampled to a fixed size of 224\u00d7 224. External Dataset: The public UnityImaging Collaborative (UIC) [31] LV landmark dataset consists of a combinationof 3822 end-systolic (ES) and end-diastolic (ED) PLAX echo frames acquired fromseven British echocardiography labs. The provided splits contain 1613, 298, and1911 training, validation, and testing samples, respectively. Similar to the privatedataset, we down-sample the frames to a fixed size of 224\u00d7224.Datasets\u2019 Distribution DiscrepancyIn Figure 3.5, we show a few examples that demonstrate the distribution differ-ence between our private dataset and UIC\u2019s public dataset. We see that there area number of common cases in UIC that are not represented in our private dataset,which further highlights the importance of OOD quantitative results presented inthis chapter.3.3.2 Implementation DetailsOur model creates K=7 auxiliary graphs to aid the main pixel-level task. For thenode features, the initial single-layer CNN uses a kernel size of 3 and zero-paddingto output features with a dimension of 224\u00d7224\u00d74 (C=4). The U-Net\u2019s encodercontains 7 layers with 128\u00d7 128,64\u00d7 64,32\u00d7 32,6\u00d7 6,8\u00d7 8,4\u00d7 4, and 2\u00d7 2spatial dimensions, and 8,16,32,64,128,256, and 512 number of channels, re-spectively. The intermediate features of the U-Net are fed into a 1\u00d71 convolutionlayer to equalize the number of channels to 128, meaning that each node feature isa 128-dimensional vector. Three graph convolutional networks (GCNs) [42] layerswith a hidden node dimension of 128 are used. The node embeddings obtained40PrivateDatasetUICDatasetA B C D EFigure 3.5: Examples demonstrating the distribution difference between theprivate dataset and Unity Imaging Collaborative (UIC) public dataset:We see that while the private dataset contains mostly clean samples thatare free of extra annotations and markings, UIC\u2019s dataset contains sam-ples that include the Doppler window (A, B) or other annotations out-side the image area (C, D, E). Additionally, while most samples in theprivate dataset are properly aligned in the frame, there are cases in UICwhere the LV is placed uncharacteristically at the top of the image (C).from the GNN layers are fed into two-layer MLPs with hidden dimensions of 32and 16 and an output dimension of four followed by a Sigmoid activation functionto produce landmark probabilities over the nodes. To optimize the model, we usethe Adam optimizer [39] with an initial learning rate of 0.001, \u03b2 of (0.9, 0.999)and a weight decay of 0.0001, and for the weighted BCE loss, we use a weight of9000. The model is implemented using PyTorch [63] and Pytorch Geometric [19]and is trained on two 32-GB Nvidia Titan GPUs. We provide the training detailsfor prior works in Section A.2.3.3.3 Results and DiscussionQuantitative ResultsThe four landmark coordinate predictions of the model and the expert annotationsare used to create the predicted and ground truth LVID, IVS and LVPW mea-surements. The unit for these measurements is pixels and must be converted to41millimeters (mm) using pixel to mm ratios, which are available and specific to ev-ery image frame. The model is evaluated in terms of how close these predictedvalues and ground-truth labels are. More specifically, the error is calculated usingmean-absolute-error (MAE) in mm, and mean percentage error (MPE) in percentsas follows:MAE = |Lpred\u2212Ltrue| (3.5)MPE = 100\u00d7 |Lpred\u2212Ltrue|Ltrue, (3.6)where Lpred and Ltrue are the prediction and ground truth values for every measure-ment, respectively. Furthermore, we report the success detection rate (SDR) forLVID for 1, 2, 3, and 6 mm thresholds. This rate shows the percentage of sampleswhere the absolute error between ground truth and LVID predictions is below thespecific threshold. We provide the SDR for IVS and LVPW in the Section A.2. Us-ing these metrics, we compare our model\u2019s performance with six prominent priorworks for both ID and OOD settings.In-Distribution Results. To compare our model\u2019s performance with previ-ous works in the ID setting where the training and test sets come from the samedistribution, we separately train and test the models on the private and the publicdataset. As shown in Table 3.1, on the private dataset, our model outperforms pre-vious work in all measurements except for LVID, where the performance is similarto McCouat et al. [53]. In Table 3.2, we report the test results on the UIC dataset,showing that our model outperforms previous works for all measurements exceptfor IVS where the performance is on par with Chen et al. [11].Out-of-Distribution Results. To fully investigate the generalization ability ofour model compared to previous works, we train all models on the private dataset(which consists of a larger number of samples compared to UIC), and test thetrained models on the public UIC dataset. Based on our visual assessment andas described in Section 3.3.1, the UIC dataset look very different compared tothe private dataset, thus serving as an OOD test-bed. As shown in Table 3.3, ourmodel significantly outperforms previous works under this setting, which suggeststhat our framework has better generalization.42Table 3.1: Quantitative results on the private test set for models trained on the private training set in terms of mean-absolute-error (MAE), mean percentage error (MPE) and success detection rate (SDR): Lower values for MAEand MPE, and higher values for SDR are better. We see that our model has the best average performance overthe three measurements, which shows the superiority of our model in the in-distribution (ID) setting for high-dataregime.Model MAE [mm] \u2193 MPE [%] \u2193 SDR[%] of LVID < \u2191LVID IVS LVPW LVID IVS LVPW 1.0 mm 2.0 mm 3.0 mm 6.0 mmGilbert et al. [22] 2.9 1.4 1.4 6.5 14.5 15.2 26.7 48.1 65.6 88.9Lin et al. [48] 9.4 11.2 9.0 21.2 116.5 92.9 13.6 26.0 34.4 49.1McCouat et al. [53] 2.2 1.3 1.4 4.8 13.5 15.1 33.2 58.3 76.1 93.9Chen et al. [11] 2.3 1.2 1.2 5.2 12.6 13.8 32.1 60.4 74.7 92.6Yao et al. [85] 4.8 3.3 3.4 10.5 34.6 37.3 16.8 32.3 45.8 74.9Duffy et al. [17] 2.5 1.2 1.2 5.4 13.2 13.5 28.4 52.1 70.1 93.0Ours 2.2 1.1 1.1 4.8 11.2 12.2 33.0 62.4 74.9 94.443Table 3.2: Quantitative results on the public UIC test set for models trained on the public UIC training set in terms ofmean-absolute-error (MAE), mean percentage error (MPE) and success detection rate (SDR): Lower values forMAE and MPE, and higher values for SDR are better. Although the number of training samples is much lower forUIC compared to our private dataset, we see that our model still outperforms previous works on average over thethree measurements, which showcases the accuracy of our model in the low-data regime and in-distribution (ID)settings. We were not able to train Lin et al.\u2019s model [48] on this dataset since they rely on consistency amongframes in a video, whereas this dataset only contains individual frames.Model MAE [mm] \u2193 MPE [%] \u2193 SDR[%] of LVID < \u2191LVID IVS LVPW LVID IVS LVPW 1.0 mm 2.0 mm 3.0 mm 6.0 mmGilbert et al. [22] 5.2 2.5 3.1 12.2 19.0 22.7 17.1 32.2 45.2 70.0McCouat et al. [53] 2.5 1.6 2.4 7.5 14.8 19.9 29.3 56.4 71.8 91.7Chen et al. [11] 2.3 1.5 2.3 7.1 12.5 21.4 31.2 57.3 84.8 94.6Yao et al. [85] 15.4 8.8 9.2 44.8 78.5 80.5 3.4 7.5 12.1 24.6Duffy et al. [17] 8.7 3.4 3.8 24.8 34.8 34.1 7.0 13.7 20.5 42.4Ours 2.2 1.5 1.9 6.2 14.0 16.9 32.0 58.9 81.2 94.944Table 3.3: Quantitative results on the public UIC test set for models trained on the private training set in terms of mean-absolute-error (MAE), mean percentage error (MPE) and success detection rate (SDR): Lower values for MAEand MPE, and higher values for SDR are better. This table shows the out-of-distribution (OOD) performance ofthe models when trained on a larger dataset and tested on a smaller external dataset. We can see that in this case,our model outperforms previous works by a large margin, which attests to the generalizability of our framework.Model MAE [mm] \u2193 MPE [%] \u2193 SDR[%] of LVID < \u2191LVID IVS LVPW LVID IVS LVPW 1.0 mm 2.0 mm 3.0 mm 6.0 mmGilbert et al. [22] 9.5 4.8 4.1 23.5 32.3 26.8 11.5 22.5 31.7 52.2Lin et al. [48] 51.5 51.7 41.3 121.0 375.8 298.0 5.8 11.3 15.1 24.6McCouat et al. [53] 5.9 3.6 4.4 18.5 30.5 36.4 18.2 34.6 52.9 72.3Chen et al. [11] 7.4 5.3 6.9 22.5 49.4 62.4 14.6 28.9 40.3 65.3Yao et al. [85] 22.4 14.2 13.4 68.8 141.7 121.6 4.1 7.7 12.1 25.1Duffy et al. [17] 13.7 4.1 5.5 36.8 36.4 45.4 2.5 6.2 9.6 20.6Ours 5.8 2.8 4.3 18.4 23.8 34.6 18.6 35.8 49.0 74.945Qualitative ResultsTo better understand the model\u2019s behaviour when it fails to make correct predic-tions, we visualize the predictions against the ground truth for two failure cases inFigure 3.6. In the first failure example, our model underestimates the wall thick-ness. This case belongs to a patient diagnosed with severe LVH, which causesextreme thickening of the LV walls. We believe this failure is caused by the lowimage quality making it hard to pinpoint the edges of the LV walls. In the secondfailure example, our model is confused by the sub-optimal quality of the PLAXview in which the chambers and the walls, especially the upper IVS landmark,are not properly captured. We also provide an exhaustive set of qualitative resultsshowing multiple success and failure cases for both the ID and OOD settings inSection A.2.3.3.4 Ablation StudyTo show the incremental performance benefits that arise from our architecturalchoices, we performed an ablation study for the different variants of the model.Vanilla U-Net uses the last-layer activations of a U-Net model without hierarchyor graph components. U-Net Main Graph only uses a single graph that representsthe original frame at the pixel level. The node features are from the last-layeractivation of a U-Net model. Single-Scale Loss uses the same framework as themain model but replaces the multi-scale loss with a single loss computed only forthe main graph. Main Model is our main framework that uses the intermediatelayers of a U-Net model to produce node features for seven auxiliary graphs and amain graph. The loss associated with each auxiliary task is computed individuallyto create a multi-scale objective.In Table 3.4, we quantitatively show the benefits of a hierarchical graph repre-sentation with a multi-scale objective for the task of LV landmark detection. Morespecifically, we see that the addition of a single main graph to a vanilla U-Netmodel significantly increases accuracy. We postulate that this is due to GNNsenabling message passing between landmark locations that are highly correlated.Additionally, we see that the addition of hierarchy into the model through the useof auxiliary tasks and a multi-scale loss achieves the best performance, showing46GroundTruthPredictionsFailure Example 1 Failure Example 2Figure 3.6: Qualitative visualization of our model on two failure cases fromthe test set of our private dataset: The Failure Example 1 is a low-qualityPLAX image that also corresponds to a patient with severe LVH, a sce-nario that happens rarely in our dataset. The Failure Example 2 belongsto a case with a low quality of PLAX with unclear boundaries for thewalls and the chambers of the LV.that the model is able to build better inductive bias by incorporating simpler taskswith the main difficult task. However, having a hierarchical framework without amulti-scale loss (Single Scale Loss) causes the model to perform worse than thesingle graph variant.In addition to the quantitative results of the ablation study provided in Ta-ble 3.4, we provide qualitative comparisons between different variants of the modelin Figure 3.7. We see that Vanilla U-Net and U-Net Main Graph variants of themodel struggle, as evidenced by the diffuseness of the predicted heatmaps. In con-trast, the Main Model performs significantly better despite the low quality of theimages.47Table 3.4: Ablation results on the validation set of our private dataset for themodel architecture: Vanilla U-Net uses the output of a simple U-Netmodel for segmentation, while U-Net Main Graph only uses the pixel-level graph. Main Model is our model that uses our proposed hierarchi-cal approach. Lastly, Single-Scale Loss has the same framework as theMain Model but only computes the loss for the model\u2019s predictions onthe main graph during training. We see that the addition of a hierarchicalrepresentation and using multi-scale loss improves performance.Model MPE [%]LVID IVS LVPWVanilla U-Net 5.31 13.17 13.47U-Net Main Graph 4.98 11.67 12.78Single-Scale Loss 5.41 12.37 12.8Main Model 4.91 11.45 12.36Example1Example2Ground Truth Vanilla U-Net Main Graph Single-Scale Loss Main Model (Ours)Figure 3.7: Qualitative ablation results for the model architecture: Landmarkheatmaps from top to bottom are color-coded with red, cyan, pink andgreen, respectively. We see that Vanilla U-Net struggles to make con-fident and accurate landmark predictions. While the addition of a maingrid graph in Main Graph relatively increases model\u2019s performance, itstill does not produce accurate results. In contrast, the Main Model pro-duces confident prediction heatmaps by relying on a hierarchical graphrepresentation as well as multi-scale objectives. We also see that theremoval of the multi-scale objective (Single-Scale Loss) degrades per-formance.48Ablation Study on Feature ConstructorsAs shown in Figure 3.8, we consider three different approaches to hierarchicalfeature construction including two-dimensional Average Pooling, Vanilla CNNsand U-Net intermediate features. The U-Net approach is described previously, anda description of other approaches is provided below:\u2022 2D Average Pooling: We use 2D average pooling layers with different ker-nel sizes to summarize patch-level information. That is, for each k\u2208K whereK is the number of auxiliary graphs, we use a 2D average pooling layer witha kernel size of (\u230aH\/2k\u230b,\u230aW\/2k\u230b) and a stride of (\u230aH\/2k\u230b,\u230aW\/2k\u230b).\u2022 Vanilla CNN Features: A CNN is used to construct the features for the aux-iliary graphs such that deeper layers contain the features for coarser graphs.The kernel size for each CNN layer is determined so that the resulting inter-mediate feature map\u2019s dimension matches the number of nodes in the corre-sponding graph. The intuition behind this approach is that deeper featureshave larger receptive fields corresponding to large patches of the originalimage.In Table 3.5, we study the impact of these different node feature extractionmethods. We see that the U-Net-based model significantly outperforms others. Wehypothesize that the Vanilla CNN and Average Pooling-based methods performpoorly because the shallow representations for the main-level graph are insufficientin providing enough information for the task. On the other hand, the U-Net-basedmethod provides a larger receptive field for each node in the main pixel-level task,which is also the most challenging.3.4 ConclusionIn this work, we introduce a novel hierarchical GNN for left ventricle landmarkdetection. The model performs better than the state-of-the-art on most measure-ments without relying on label smoothing. The performance gains of our modelare attributed to two design choices. First, the hierarchical graph constructionand the hierarchical message-passing GNNs enable better information propaga-tion and representation learning. Second, the hierarchical supervision makes the49Table 3.5: Ablation results on the validation set of our private dataset for dif-ferent node feature extraction methods: We see that the U-Net-basedmethod outperforms others.Feature Extractor MPE [%]LVID IVS LVPWAverage Pooling 15.21 25.32 23.41Vanilla CNN 9.36 15.35 20.53HiEchoGNN 4.91 11.45 12.36Figure 3.8: Different approaches to node feature construction: A convolu-tional neural network (CNN) is initially used to expand the number ofthe features maps. Different feature construction methods can then beemployed: (A) 2D average pooling layers with different kernel sizes areused to generate features for nodes of auxiliary graphs with differentcoarseness levels. (B) Multiple CNN layers are used to transform theimage, and the intermediate features are used as node features such thatdeeper layers contain the features for coarser graphs. (C) The interme-diate features of the decoder part of a U-Net are used as node featuressuch that deeper representations correspond to node features of finergraphs.learning easier for the pixel-level prediction. While the model shows promisingperformance, we believe that the scalability of the framework for higher-resolutionimages must be studied as a future direction.50Chapter 4General, Echocardiogram-Based,Multi-Level TransformerFramework for CardiovascularDiagnoses4.1 Introduction and Related WorksEchocardiography (echo) is an Ultrasound (US) imaging modality widely used toeffectively depict dynamic cardiac anatomy from different standard views [79].The challenges in accurately making echo-based diagnoses have given rise tovision-based machine learning (ML) models for automatic predictions. Numer-ous works perform segmentation of cardiac chambers; Liu et al. [49] perform leftventricle (LV) segmentation using feature pyramids and a segmentation coherencynetwork, while Cheng et al. [12] and Thomas et al. [80] use contrastive learningand GNNs for the same purpose, respectively. Some others introduce ML frame-works for echo view classification [21, 28] or detecting important phases (e.g.,end-systolic (ES) and end-diastolic (ED)) in a cardiac cycle [20]. Other bodies ofwork focus on making disease predictions from input echo. For instance, Duffy etal. [17] perform landmark detection to predict left ventricular hypertrophy (LVH),51while Roshanitabrizi et al. [69] predict rheumatic heart disease from Doppler echousing an ensemble of transformers and convolutional neural networks (CNNs).As previously mentioned, the versatility of echo and its numerous applicationsin assessing heart function have led to the development of a variety of automaticML frameworks. This presents logistical challenges, as multiple models must bemaintained for different cardiac metrics. Consequently, there is a need for a flex-ible model capable of handling various clinical tasks with minimal modifications.Additionally, due to the safety-critical nature of the problem, such a model mustmaintain a high level of explainability despite its flexibility. Furthermore, given themulti-view nature of echo, the model must be able to capture interactions betweenmultiple videos for a specific patient. In this chapter, we propose a General, Echo-based, Multi-Level Transformer (GEMTrans) framework that provides explain-ability while simultaneously enabling multi-video training, capturing the interplayamong echo image patches within the same frame, across all frames in the samevideo, and inter-video relationships. This framework can easily be adapted forvarious tasks, as its multi-level structure can capture a wide range of relationshipswhile providing attention-based explainability.In this chapter, we focus on ejection fraction (EF) estimation and aortic steno-sis (AS) detection as two example applications to showcase the flexibility of ourframework and enable its comparison to prior works in a tractable manner. There-fore, in the following two paragraphs, we give a brief introduction to EF, AS andprior automatic detection works specific to these tasks.EF is a ratio that indicates the volume of blood pumped by the heart and isan important indicator of heart function. The clinical procedure to estimating thisratio involves finding the ES and ED frames in echo cine series (videos) and trac-ing the LV on these frames. A high level of inter-observer variability of 7.6% to13.9% has been reported in clinical EF estimates [62]. Due to this, various MLmodels have been proposed to automatically estimate EF and act as secondarylayers of verification. More specifically, Esfeh et al. [38] propose a Bayesian neu-ral networks (BNNs) that produces uncertainty along with EF predictions, whileReynaud et al. [66] use a BERT [15] transformer to capture frame-to-frame rela-tionships. Recently, Mokhtari et al. [58] provide explainability in their frameworkby learning a graph structure among frames of an echo.52The other cardiovascular task we consider is the detection of AS, which isa condition in which the aortic valve (AV) becomes calcified and narrowed, andis typically detected using spectral Doppler measurements [61, 76]. High inter-observer variability, limited access to expert cardiac physicians, and the unavail-ability of spectral Doppler in many point-of-care US devices are challenges thatcan be addressed through the use of automatic AS detection models. For example,Huang et al. [33, 34] predict the severity of AS from single echo images, whileGinsberg et al. [24] adopt a multitask training scheme.4.1.1 ContributionsIn this chapter, we introduce GEMTrans, a multi-level transformer-based frame-work that enables multi-video training. Our framework distinguishes itself fromprevious echo-based works in several ways. First, to the best of our knowledge,our model is the first to generate attention maps on patch, frame, and video levelsfor echo data, while simultaneously processing multiple videos, as demonstratedin Figure 4.1. Second, unlike prior works specifically designed for a single car-diovascular task, our framework is versatile and can be easily adapted for variousecho-based metrics. This level of flexibility stems from its multi-level attentionmechanism. For instance, in the case of EF, capturing the temporal attention be-tween echo frames is crucial to track changes in LV volume, whereas for AS, spa-tial attention to the valve area is essential. The model is trained to adjust its learnedattention accordingly, outperforming all previous works. Lastly, we offer patch andframe-level, attention-guided prototypical explainability.Our contributions are summarized below:\u2022 We propose GEMTrans, a general, transformer-based, multi-level MLframework for vison-based medical predictions on echo cine series (videos).\u2022 We show that the task-specific attention learned by the model is effectivein highlighting the important patches and frames of an input video, whichallows the model to achieve a mean-absolute-error (MAE) of 4.49 on twoEF datasets, and a detection accuracy score of 96.5% on an AS dataset.\u2022 We demonstrate how prototypical learning can be easily incorporated into53Figure 4.1: GEMTrans Overview: The multi-level transformer network pro-cesses one or multiple echocardiography (echo) videos and is composedof three main components. Spatial Transformer Encoder (STE) pro-duces attention among patches in the same image frame, while TemporalTransformer Encoder (TTE) captures the temporal dependencies amongthe frames of each video. Lastly, Video Transformer Encoder (VTE)produces an embedding summarizing all available data for a patient byprocessing the learned embedding of each video. Different downstreamtasks can then be performed using this final learned embedding. Duringtraining, both the final prediction and the attention learned by differ-ent layers of the framework can be supervised (not all connections areshown for cleaner visualization).the framework for added multi-layer explainability. 14.2 Method4.2.1 Problem SetupThe input data for both EF and AS tasks compose of one or multiple B-mode echovideos denoted by X \u2208 RK\u00d7T\u00d7H\u00d7W , where K is the number of videos per sample,1The contributions mentioned in relation to prototypical learning are not within the scope of thisthesis and were made by Neda Ahmadi. We have included these to maintain consistency with theversion submitted to MICCAI 2023, but we will not elaborate on these concepts further.54T is the number of frames per video, and H and W are the height and width of eachgrey-scale image frame.For EF Estimation, we consider both the single video (A4C) and the dualvideo (A2C and A4C) settings corresponding to K = 1 and K = 2, respectively. Ourdatasets consist of triplets {xief,yief,yiseg}, where i \u2208 [1, ...,n] is the sample number.yief \u2208 [0,1] is the ground truth EF value, yiseg \u2208 {0,1}H\u00d7W is the binary LV segmen-tation mask, and xi \u2208RK\u00d7T\u00d7H\u00d7W are the input videos defined previously. The goalis to learn an EF estimation function fef : RK\u00d7T\u00d7H\u00d7W 7\u2192 R.For AS Classification, we consider the dual-video setting (PLAX and PSAX).Here, our dataset Das = {Xas,Yas} consists of pairs {xias,yias}, where xias are theinput videos and yias \u2208 {0,1}4 is a one-hot label indicating healthy, mild, moderateand severe AS cases. Our goal is to learn fas : R2\u00d7T\u00d7H\u00d7W 7\u2192 R4 that produces aprobability over AS severity classes.4.2.2 Proposed FrameworkAs shown in Figure 4.1, we employ a three-level transformer network, where thelevels are tasked with patch-wise, frame-wise and video-wise attention, respec-tively.Spatial Transformer Encoder (STE) captures the attention among patcheswithin a certain frame and follows Vision Transformers (ViT)\u2019s [16] architecture.As shown in Equation 4.1 and Equation 4.2, the Spatial Tokenizer (ST) first dividesthe image into non-overlapping p\u00d7 p sized patches before flattening and linearlyprojecting each patch:x\u02c6k,t = [x\u02c6k,t,1, x\u02c6k,t,2, ..., x\u02c6k,t,HW\/p2 ] = fpatch(xk,t , p); (4.1)x\u2032k,t = [x\u2032k,t,1,x\u2032k,t,2, ...,x\u2032k,t,HW\/p2 ] = flin(flatten(x\u02c6k,t)), (4.2)where p is the patch size, k \u2208 [1, ...,K] is the video number, t \u2208 [1, ...,T ] is theframe number, fpatches : RH\u00d7W 7\u2192 RHW\/p2\u00d7p\u00d7p splits the image into equally-sizedpatches, and flin : Rp2 7\u2192 Rd is a linear projection function that maps the flattenedpatches into d-dimensional embeddings. The obtained tokens from the ST are thenfed into a transformer network [82] as illustrated in the following equations:55h0k,t = [clsspatial;x\u2032k,t ]+Epos; (4.3)h\u2032lk,t = MHA(LN(hl\u22121k,t ))+hl\u22121k,t , l \u2208 [1, ...,L]; (4.4)hlk,t = MLP(LN(h\u2032lk,t))+h\u2032lk,t , l \u2208 [1, ...,L]; (4.5)zk,t = LN(hLk,t,0), (4.6)where clsspatial \u2208 Rd is a token similar to the [class] token introduced by Devlinet al. [15], Epos \u2208 Rd is a learnable positional embedding, MHA is a multi-headattention network [82], and the LayerNorm (LN) operation is used. The obtainedresult zk,t \u2208 Rd can be regarded as an embedding summarizing the tth frame in thekth video for a sample.Temporal Transformer Encoder (TTE) accepts as input the learned embed-dings of the STE for each video zk,1...T and performs similar operations outlinedin Equation 4.3, Equation 4.4, Equation 4.5, and Equation 4.6 to generate a singleembedding vk \u2208 Rd representing the whole video from the kth view. Video Trans-former Encoder (VTE) is the same as TTE with the difference that each inputtoken vk \u2208 Rd is a representation for a complete video from a certain view. Theoutput of VTE is an embedding ui \u2208Rd summarizing the data available for patienti. This learned embedding can be used for various downstream tasks as describedin Section 4.2.3.Attention SupervisionFor EF, the ED\/ES frame locations and their LV segmentation are available. Thisintermediary information can be used to supervise the learned attention of thetransformer. Therefore, inspired by Stacey et al. [77], we supervise the last-layerattention that the cls token allocates to other tokens. More specifically, for spa-tial attention, we penalize the model for giving attention to the region outsidethe LV, while for temporal dimension, we encourage the model to give more at-tention to the ED\/ES frames. This is illustrated in Figure 4.2. More formally,we define ATTNspatialcls \u2208 [0,1]HW\/p2and ATTNtemporalcls \u2208 [0,1]T to be the Lth-layer,softmax-normalized spatial and temporal attention learned by the multi-head at-56Figure 4.2: Attention supervision example: For ejection fraction (EF), thespatial attention is penalized if attention is given outside the union of theleft ventricle (LV) mask for end-diastolic (ED) and end-systolic (ES).The temporal attention is also encouraged to give more attention toED\/ES locations.tention (MHA) module (see Equation 4.4) of STE and TTE, respectively. Theattention loss is defined asy\u2032seg = OR(y\u2032edseg,y\u2032esseg); (4.7)Lspatialattn, s =\uf8f1\uf8f2\uf8f3(ATTNspatialcls,s \u22120)2, if y\u2032seg,s = 0 (outside LV)0, otherwise;(4.8)Ltemporalattn, t =\uf8f1\uf8f2\uf8f3(ATTNtemporalcls,s \u22121)2, if t \u2208 [ED,ES]0, otherwise;(4.9)Lattn = \u03bbtemporal\u03a3Tt=1Ltemporalattn, t +\u03bbspatial\u03a3HW\/p2s=1 Lspatialattn, s , (4.10)where y\u2032edseg,y\u2032esseg \u2208 {0,1}HW\/p2are the coarsened versions (to match patch size) ofyseg at the ED and ES locations, OR is the bit-wise logical or function, ed\/es in-dicate the ED\/ES temporal frame indices, and \u03bbtemporal,\u03bbspatial \u2208 [0,1] control theeffect of spatial and temporal losses on the overall attention loss.Prototypical LearningPrototypical learning provides explainability by presenting training examples (pro-totypes) as the reasoning for choosing a certain prediction. As an example, in the57context of AS, patch-level prototypes can indicate the most important patches inthe training frames that correspond to different AS cases. Due to the multi-layernature of our framework and inspired by Xue et al. [84], we can expand this ideaand use our learned attention to filter out uninformative details prior to learningpatch and frame-level prototypes. As shown in Figure 4.1, the STE and TTE\u2019s at-tention information are used in prototypical branches to learn these prototypes. Itmust be noted that we are not using prototypical learning to improve performance,but rather to provide added explainability. For this reason, prototypes are obtainedin a post-processing step using the pretrained transformer framework. The networkused to obtain these prototypes is shown in Section A.3.4.2.3 Training and Objective FunctionThe output embedding of VTE denoted by u\u2208Rn\u00d7d can be used for various down-stream tasks. We use Equation 4.11 to generate predictions for EF and use an L2loss between y\u02c6ef and yief for optimization denoted by Lef. For AS severity classi-fication, Equation 4.12 is used to generate predictions and a cross-entropy loss isused for optimization shown as Las:y\u02c6ef = \u03c3(MLP(u)) y\u02c6ef \u2208 Rn\u00d71; (4.11)y\u02c6as = softmax(MLP(u)) y\u02c6as \u2208 Rn\u00d74; (4.12)Loverall = Lef or as+Lattn, (4.13)where \u03c3 is the Sigmoid function. Our overall loss function is shown in Equa-tion 4.13, where Lattn is desribed in Section 4.2.2.4.3 Experiments4.3.1 DatasetWe compare our model\u2019s performance to prior works on three datasets. In sum-mary, for the single-video case, we use the EchoNet Dynamic dataset that consistsof 10,030 A4C echo videos obtained at Stanford University Hospital [62] with a58training\/validation\/test (TVT) split of 7,465, 1,288 and 1,277. For the dual-videosetting, we use a private dataset of 5,143 pairs of A2C\/A4C videos with a TVT splitof 3,649, 731, 763. For the AS severity detection task, we use a private dataset ofPLAX\/PSAX pairs with a balanced number of healthy, mild, moderate and severeAS cases and a TVT of 1,875, 257, and 258. For all datasets, the frames are resizedto 224\u00d7224.4.3.2 Implementation DetailsAll models were trained on four 32 GB NVIDIA Tesla V100 GPUs, where thehyper-parameters are found using Weights & Biases random sweep [4]. Lastly, weuse the ViT network from [55] pre-trained on ImageNet-21K for the STE module.4.3.3 Results and DiscussionQuantitative ResultsIn Table 4.1 and Table 4.2, we show that our model outperforms all previous workin EF estimation and AS detection. For EF, we use the MAE between the groundtruth and the predicted EF values, and R2 correlation score. For AS, we use anaccuracy metric for the four-class severity prediction and binary detection of AS.The results show the flexibility of our model to be adapted for various echo-basedtasks while achieving high levels of performance.Qualitative ResultsIn addition to having the superior quantitative performance, we show that ourmodel provides explainability through its task-specific learned attention (Fig-ure 4.3) and the learned prototypes (Figure 4.4). More specifically, we see thatthe model learns to attend to the valve area for AS while focusing on the walls ofthe LV on the ED and ES frames for EF, which follows the clinical procedures.Additional prototypical results are shown in Section A.3.59Table 4.1: Quantitative results for ejection fraction (EF) on the test set: LVBiplane dataset results for models not supporting multi-video trainingare indicated by \u201d-\u201d. MAE is the Mean Absolute Error and R2 indicatesvariance captured by the model.ModelEchoNet Dynamic LV BiplaneMAE [mm] \u2193 R2 Score \u2191 MAE [mm] \u2193 R2 Score \u2191Ouyang et al. [62] 7.35 0.40 - -Reynaud et al. [66] 5.95 0.52 - -Esfeh et al. [38] 4.46 0.75 - -Thomas et al. [80] 4.23 0.79 - -Mokhtari et al. [58] 4.45 0.76 5.12 0.68Ours 4.15 0.79 4.84 0.72Table 4.2: Quantitative results for aortic stenosis (AS) on the test set: Sever-ity is a four-class classification task, while Detection involves the binarydetection of AS.ModelAccuracy [%] \u2191Severity DetectionHuang et al. [33] 73.7 94.1Bertasius et al. [3] 75.3 94.8Ginsberg et al. [24] 74.4 94.2Ours 76.2 96.54.3.4 Ablation StudyWe show the effectiveness of our design choices in Table 4.3, where we use EF es-timation for the EchoNet Dynamic dataset as our test-bed. It is evident that our at-tention loss described in Section 4.2.2 is effective for EF when intermediary labelsare available, while it is necessary to use a pre-trained ViT as the size of medicaldatasets in our experiments are not sufficiently large to build good inductive bias.60EF ASFigure 4.3: Learned patch-attention from the Spatial Transformer Encoder(STE): We visualize the learned attention of STE, where for EF, themodel is focusing on the walls of the LV, while for AS, the model learnsto attend to the valve area, which is clinically correct.Table 4.3: Ablation study on the validation set of EchoNet Dynamic: We seethat both spatial and temporal attention supervision are effective for ejec-tion fraction (EF) estimation, while the model does not converge withoutpretraining the Vision Transformers (ViT). MAE is the Mean AbsoluteError and R2 indicates variance captured by the model.Model MAE [mm] \u2193 R2 Score \u2191No Spatial Attn. Sup. 4.42 0.77No Temporal Attn. Sup. 4.54 0.76No ViT Pretraining 5.61 0.45Ours 4.11 0.8061Healthy Severe ASFigure 4.4: Learned patch-level prototypes: Learned prototypes use SpatialTransformer Encoder (STE) attention to properly focus on the valve areafor a healthy and severe aortic stenosis (AS) case. We can see that in thehealthy case, the aortic valve (AV) is thin and not calcified. However, inthe severe case, the calcification of AV is apparent (i.e. the valve appearsbright in the image). Frame-level and ejection fraction (EF) prototypesare presented in Section A.3.4.4 ConclusionIn this paper, we introduced a multi-layer, transformer-based framework suitablefor processing echo videos and showed superior performance to prior works on twochallenging tasks while providing explainability through the use of prototypes andlearned attention of the model. As also described in Chapter 5, future work willinclude training a large multi-task model with a comprehensive echo dataset thatcan be disseminated to the community for a variety of clinical applications.62Chapter 5Conclusions and Future WorkEchocardiography (echo) is a non-invasive, cost-effective, and portable Ultrasound(US) imaging modality widely used to assess heart function by depicting its dy-namic anatomy. The left ventricle (LV), responsible for pumping oxygenated bloodthroughout the body, is a crucial heart chamber, and many echo-based metrics fo-cus on its function. In this thesis, we explored the following clinical metrics anddiagnoses:\u2022 Left ventricular ejection fraction (LVEF): The ejection fraction (EF) repre-sents the ratio between the LV\u2019s volume during the end-diastolic (ED) andend-systolic (ES) phases, indicating its ability to pump blood. Clinical de-termination of this ratio exhibits significant inter-observer variability.\u2022 LV measurements: The heart\u2019s dimensions and wall thickness impact itsfunction, with abnormal increases leading to left ventricular hypertrophy(LVH). To monitor these changes, inter-ventricular septal (IVS) thickness,left ventricular posterior wall (LVPW) thickness, and left ventricular internaldiameter (LVID) are measured, characterized by four to six landmarks on anecho frame.\u2022 Aortic stenosis (AS): AS, the most common valvular heart disease, occurswhen the aortic valve (AV) becomes calcified, hindering its ability to openand close efficiently during the pumping of the blood. One AS detection63method from echo involves visually inspecting the AV. However, high inter-observer variability has been observed.Automatic estimation or detection machine learning (ML) frameworks can sig-nificantly benefit such clinical tasks by serving as secondary layers of verification,particularly with the increasing popularity of point-of-care US devices often usedby less-experienced users. However, due to the safety-critical nature of health-related tasks, these frameworks must be explainable to determine when humanintervention. Furthermore, they must be highly generalizable due to the scarcityof labeled medical data. Additionally, given the multitude of cardiac tasks, main-taining multiple per-task ML frameworks is logistically challenging. Therefore,the proposed frameworks must be flexible, capable of handling various tasks withminimal modifications while maintaining a high level of explainability.In this thesis, with the aforementioned principles in mind, we investigated threeinnovative frameworks that harness the representational power of graph neural net-works (GNNs) and transformers:\u2022 We presented an explainable GNN-based framework for estimating LVEFfrom echo videos. This innovative approach learns a latent structure alignedwith clinical guidelines while offering a surrogate for model confidence. Weshowed that while outperforming the previous work, this framework also hasmuch smaller memory footprint making it suitable for deployment on hand-held devices.\u2022 We designed a highly generalizable GNN framework for medical landmarkdetection in sparsely labeled data. By implementing a multi-scale objectivefunction and hierarchical graph structure, our framework maximizes supervi-sory signals and surpasses previous methods in both the in-distribution (ID)and out-of-distribution (OOD) settings.\u2022 We introduced a versatile transformer-based framework for multiple clinicaltasks. This framework delivers attention-based explainability at various lev-els, capturing interactions between patches, frames, and videos in echo data.The flexibility of this framework is demonstrated through its successful ap-64plication in AS detection and EF estimation while maintaining a high degreeof explainability.While the presented frameworks exhibit promising results, there is still roomfor further refinement to ensure readiness for deployment in a clinical setting.Therefore, in Section 5.1, we propose a future work direction that incorporates ourlearnings from these frameworks, with the aim of advancing towards a deployableand powerful ML framework for medical applications.5.1 Proposed Framework for Future WorkFigure 5.1: Proposed future work: We recommend employing self-supervised learning (SSL) to pretrain a multi-level, transformer-basedframework. This framework can be trained on vast amounts of un-labeled medical data and subsequently fine-tuned on specific tasks toachieve high levels of task-specific performance.Although clean, labeled medical datasets are scarce, accessing unlabeled med-ical datasets is considerably easier. This situation leads us to an ML model trainingparadigm that enables models to learn robust representations without requiring la-bels, known as SSL. Numerous approaches to SSL have been developed, includingcontrastive and reconstructive methods [73]. Therefore, given the availability ofunlabeled medical datasets, we choose to focus on SSL for our future work.65Seeing the effectiveness of the multi-level, attention-based framework intro-duced in Chapter 4, we propose the network structure depicted in Figure 5.1. Sev-eral observations can be made by comparing this figure to the original one pre-sented in Figure 4.1. First, drawing upon our findings from Chapter 3, where wedemonstrate the effectiveness of considering an input image at multiple scales, weincorporate an additional level of hierarchy in the original GEMTrans framework,employing a ViT on both coarse and fine patches. Second, we recommend a recon-structive SSL approach in which patches, frames, or entire videos are masked, andthe model is tasked with reconstructing the masked data, leading to the learning ofrobust representations.Once the proposed network is trained on large volumes of unlabeled data,it learns to build informative multi-scale representations. By fine-tuning such amodel for specific medical tasks, we suggest that high levels of performance can beachieved, resulting in a high-performing, multi-task, explainable framework suit-able for deployment in clinical settings.66Bibliography[1] C. B. Amaral, D. C. Ralston, and T. K. Becker. Prehospital point-of-careultrasound: A transformative technology. SAGE Open Medicine, 8:2050312120932706, 2020. PMID: 32782792. \u2192 page 19[2] D. Bamira and M. Picard. Imaging: Echocardiology\u2014assessment of cardiacstructure and function. In R. S. Vasan and D. B. Sawyer, editors,Encyclopedia of Cardiovascular Research and Medicine, pages 35\u201354.Elsevier, Oxford, 2018. ISBN 978-0-12-805154-2. \u2192 page 18[3] G. Bertasius, H. Wang, and L. Torresani. Is space-time attention all you needfor video understanding? In M. Meila and T. Zhang, editors, Proceedings ofthe 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 813\u2013824. PMLR, 2021.\u2192 page 60[4] L. Biewald. Experiment tracking with weights and biases, 2020. \u2192 page 59[5] A. B. Bornstein, S. S. Rao, and K. Marwaha. Left Ventricular Hypertrophy.StatPearls Publishing, 2022. \u2192 pages 14, 30[6] M. Cameli, S. Mondillo, M. Solari, F. M. Righini, V. Andrei, C. Contaldi,E. De Marco, M. Di Mauro, R. Esposito, S. Gallina, R. Montisci, A. Rossi,M. Galderisi, S. Nistri, E. Agricola, and D. Mele. Echocardiographicassessment of left ventricular systolic function: from ejection fraction totorsion. Heart Failure Reviews, 21(1):77\u201394, Jan 2016. ISSN 1573-7322.\u2192 pages 12, 13[7] B. A. Carabello. Introduction to aortic stenosis. Circulation Research, 113(2):179\u2013185, 2013. \u2192 page 15[8] M. Carroll. Ejection fraction: Normal range, low range, and treatment, Nov2021. URL https:\/\/www.healthline.com\/health\/ejection-fraction. \u2192 pages24, 2567[9] L. Chen, J. Li, J. Peng, T. Xie, Z. Cao, K. Xu, X. He, Z. Zheng, and B. Wu.A survey of adversarial learning on graph. arXiv preprint arXiv:2003.05730,2020. \u2192 page 4[10] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. Yuille.Deeplab: Semantic image segmentation with deep convolutional nets, atrousconvolution, and fully connected crfs. IEEE Transactions on PatternAnalysis and Machine Intelligence, PP, 06 2016. \u2192 page 32[11] R. Chen, Y. Ma, N. Chen, D. Lee, and W. Wang. Cephalometric landmarkdetection by attentive feature pyramid fusion and regression-voting. InD. Shen, T. Liu, T. M. Peters, L. H. Staib, C. Essert, S. Zhou, P.-T. Yap, andA. Khan, editors, Medical Image Computing and Computer AssistedIntervention \u2013 MICCAI 2019, pages 873\u2013881, Cham, 2019. SpringerInternational Publishing. ISBN 978-3-030-32248-9. \u2192 pagesxiv, 32, 42, 43, 44, 45, 80, 81, 82, 83[12] L.-H. Cheng, X. Sun, and R. J. van der Geest. Contrastive learning forechocardiographic view integration. In L. Wang, Q. Dou, P. T. Fletcher,S. Speidel, and S. Li, editors, Medical Image Computing and ComputerAssisted Intervention, pages 340\u2013349, Cham, 2022. Springer NatureSwitzerland. ISBN 978-3-031-16440-8. \u2192 page 51[13] D.-A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deepnetwork learning by exponential linear units (elus). In Y. Bengio andY. LeCun, editors, ICLR (Poster), 2016. \u2192 page 23[14] R. B. Devereux, D. R. Alonso, E. M. Lutas, G. J. Gottlieb, E. Campo,I. Sachs, and N. Reichek. Echocardiographic assessment of left ventricularhypertrophy: Comparison to necropsy findings. The American Journal ofCardiology, 57(6):450\u2013458, 1986. ISSN 0002-9149. \u2192 page 15[15] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training ofdeep bidirectional transformers for language understanding. In J. Burstein,C. Doran, and T. Solorio, editors, NAACL-HLT (1), pages 4171\u20134186.Association for Computational Linguistics, 2019. ISBN 978-1-950737-13-0.\u2192 pages 52, 56[16] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly,J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words:Transformers for image recognition at scale. In International Conference onLearning Representations, 2021. \u2192 pages 8, 33, 5568[17] G. Duffy, P. P. Cheng, N. Yuan, B. He, A. C. Kwan, M. J. Shun-Shin, K. M.Alexander, J. Ebinger, M. P. Lungren, F. Rader, D. H. Liang, I. Schnittger,E. A. Ashley, J. Y. Zou, J. Patel, R. Witteles, S. Cheng, and D. Ouyang.High-Throughput Precision Phenotyping of Left Ventricular HypertrophyWith Cardiovascular Deep Learning. JAMA Cardiology, 7(4):386\u2013395,2022. ISSN 2380-6583. \u2192 pages 32, 39, 43, 44, 45, 51, 80, 81, 82, 83[18] D. Ferraioli, G. Santoro, M. Bellino, and R. Citro. Ventricular septal defectcomplicating inferior acute myocardial infarction: A case of percutaneousclosure. Journal of Cardiovascular Echography, 29:17, 01 2019. \u2192 page 25[19] M. Fey and J. E. Lenssen. Fast graph representation learning with PyTorchGeometric. In ICLR Workshop on Representation Learning on Graphs andManifolds, 2019. \u2192 pages 25, 41[20] A. M. Fiorito, A. \u00d8stvik, E. Smistad, S. Leclerc, O. Bernard, andL. Lovstakken. Detection of cardiac events in echocardiography using 3dconvolutional recurrent neural networks. In IEEE International UltrasonicsSymposium, pages 1\u20134, 2018. \u2192 page 51[21] X. Gao, W. Li, M. Loomes, and L. Wang. A fused deep learning architecturefor viewpoint classification of echocardiography. Information Fusion, 36:103\u2013113, 2017. ISSN 1566-2535. \u2192 page 51[22] A. Gilbert, M. Holden, L. Eikvil, S. A. Aase, E. Samset, and K. McLeod.Automated left ventricle dimension measurement in 2d cardiac ultrasoundvia an anatomically meaningful cnn approach. In Smart Ultrasound Imagingand Perinatal, Preterm and Paediatric Image Analysis, pages 29\u201337.Springer, 2019. \u2192 pages 32, 43, 44, 45, 79, 81, 82, 83[23] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neuralmessage passing for quantum chemistry. In Proceedings of the 34thInternational Conference on Machine Learning - Volume 70, ICML\u201917, page1263\u20131272. JMLR.org, 2017. \u2192 pages 7, 22, 33[24] T. Ginsberg, R.-e. Tal, M. Tsang, C. Macdonald, F. T. Dezaki, J. van derKuur, C. Luong, P. Abolmaesumi, and T. Tsang. Deep video networks forautomatic assessment of aortic stenosis in echocardiography. In J. A. Noble,S. Aylward, A. Grimwood, Z. Min, S.-L. Lee, and Y. Hu, editors, SimplifyingMedical Ultrasound, pages 202\u2013210, Cham, 2021. Springer InternationalPublishing. \u2192 pages 53, 6069[25] E. Goan and C. Fookes. Bayesian Neural Networks: An Introduction andSurvey, pages 45\u201387. Springer International Publishing, Cham, 2020. ISBN978-3-030-42553-1. \u2192 page 32[26] A. H. Gradman and F. Alfayoumi. From left ventricular hypertrophy tocongestive heart failure: Management of hypertensive heart disease.Progress in Cardiovascular Diseases, 48(5):326\u2013341, 2006. ISSN0033-0620. \u2192 pages 14, 30[27] M. D. Grant, R. D. Mann, S. D. Kristenson, R. M. Buck, J. D. Mendoza,J. M. Reese, D. W. Grant, and E. A. Roberge. Transthoracicechocardiography: Beginner\u2019s guide with emphasis on blind spots asidentified with ct and mri. RadioGraphics, 41(4):1022\u20131042, 2021. \u2192 page11[28] A. N. Gu, C. Luong, M. H. Jafari, N. Van Woudenberg, H. Girgis,P. Abolmaesumi, and T. Tsang. Efficient echocardiogram view classificationwith sampling-free uncertainty estimation. In J. A. Noble, S. Aylward,A. Grimwood, Z. Min, S.-L. Lee, and Y. Hu, editors, Simplifying MedicalUltrasound, pages 139\u2013148, Cham, 2021. Springer International Publishing.ISBN 978-3-030-87583-1. \u2192 page 51[29] W. L. Hamilton. Graph representation learning. Synthesis Lectures onArtificial Intelligence and Machine Learning, 2020. \u2192 pages 2, 7[30] B. Hou. ResNetAE, 2019. URL https:\/\/github.com\/farrell236\/ResNetAE. \u2192page 19[31] J. P. Howard, C. C. Stowell, G. D. Cole, K. Ananthan, C. D. Demetrescu,K. Pearce, R. Rajani, J. Sehmi, K. Vimalesvaran, G. S. Kanaganayagam,E. McPhail, A. K. Ghosh, J. B. Chambers, A. P. Singh, M. Zolgharni,B. Rana, D. P. Francis, and M. J. Shun-Shin. Automated left ventriculardimension assessment using artificial intelligence developed and validatedby a uk-wide collaborative. Circulation: Cardiovascular Imaging, 14(5):e011951, 2021. \u2192 page 40[32] H. Huang, P. Nijjar, J. Misialek, A. Blaes, N. Derrico, F. Kazmirczak,I. Klem, A. Farzaneh-Far, and C. Shenoy. Accuracy of left ventricularejection fraction by contemporary multiple gated acquisition scanning inpatients with cancer: Comparison with cardiovascular magnetic resonance.Journal of Cardiovascular Magnetic Resonance, 19, 12 2017. \u2192 page 1870[33] Z. Huang, G. Long, B. Wessler, and M. C. Hughes. A new semi-supervisedlearning benchmark for classifying view and diagnosing aortic stenosis fromechocardiograms. In Proceedings of the 6th Machine Learning forHealthcare Conference, 2021. \u2192 pages 53, 60[34] Z. Huang, G. Long, B. Wessler, and M. C. Hughes. Tmed 2: A dataset forsemi-supervised classification of echocardiograms. 2022. \u2192 page 53[35] M. H. Jafari, N. V. Woudenberg, C. Luong, P. Abolmaesumi, and T. Tsang.Deep bayesian image segmentation for a more robust ejection fractionestimation. In 2021 IEEE 18th International Symposium on BiomedicalImaging (ISBI), pages 1264\u20131268, 2021. \u2192 page 19[36] M. H. Jafari, C. Luong, M. Tsang, A. N. Gu, N. Van Woudenberg,R. Rohling, T. Tsang, and P. Abolmaesumi. U-land: Uncertainty-drivenvideo landmark detection. IEEE Transactions on Medical Imaging, 41(4):793\u2013804, 2022. \u2192 page 32[37] A. P. Kalogeropoulos, G. C. Fonarow, V. Georgiopoulou, G. Burkman,S. Siwamogsatham, A. Patel, S. Li, L. Papadimitriou, and J. Butler.Characteristics and Outcomes of Adult Outpatients With Heart Failure andImproved or Recovered Ejection Fraction. JAMA Cardiology, 1(5):510\u2013518,08 2016. ISSN 2380-6583. \u2192 pages 24, 27[38] M. M. Kazemi Esfeh, C. Luong, D. Behnami, T. Tsang, andP. Abolmaesumi. A deep bayesian video analysis framework: Towards amore robust estimation of ejection fraction. In Medical Image Computingand Computer Assisted Intervention \u2013 MICCAI 2020, pages 582\u2013590.Springer International Publishing, 2020. ISBN 978-3-030-59713-9. \u2192pages 19, 27, 28, 52, 60[39] D. Kingma and J. Ba. Adam: A method for stochastic optimization.International Conference on Learning Representations, 12 2014. \u2192 pages25, 41[40] T. Kipf, E. Fetaya, K.-C. Wang, M. Welling, and R. Zemel. Neural relationalinference for interacting systems. In J. Dy and A. Krause, editors,Proceedings of the 35th International Conference on Machine Learning,volume 80 of Proceedings of Machine Learning Research, pages2688\u20132697. PMLR, 10\u201315 Jul 2018. \u2192 page 22[41] T. N. Kipf and M. Welling. Variational graph auto-encoders. NIPSWorkshop on Bayesian Deep Learning, 2016. \u2192 page 471[42] T. N. Kipf and M. Welling. Semi-supervised classification with graphconvolutional networks. In 5th International Conference on LearningRepresentations, ICLR 2017, Toulon, France, April 24-26, 2017, ConferenceTrack Proceedings. OpenReview.net, 2017. \u2192 pages 4, 6, 23, 40[43] C. B. Kristensen, F. Steensgaard-Hansen, K. A. Myhr, N. J. L\u00f8kkegaard,S. H. Finsen, C. Hassager, and R. M\u00f8gelvang. Left ventricular massassessment by 1- and 2-dimensional echocardiographic methods inhemodialysis patients: Changes in left ventricular volume usingechocardiography before and after a hemodialysis session. Kidney Medicine,2(5):578\u2013588.e1, 2020. ISSN 2590-0595. \u2192 page 14[44] R. M. Lang, L. P. Badano, V. Mor-Avi, J. Afilalo, A. Armstrong, L. Ernande,F. A. Flachskampf, E. Foster, S. A. Goldstein, T. Kuznetsova, P. Lancellotti,D. Muraru, M. H. Picard, E. R. Rietzschel, L. Rudski, K. T. Spencer,W. Tsang, and J.-U. Voigt. Recommendations for cardiac chamberquantification by echocardiography in adults: An update from the americansociety of echocardiography and the european association of cardiovascularimaging. Journal of the American Society of Echocardiography, 28(1):1\u201339.e14, 2015. ISSN 0894-7317. \u2192 pages 14, 18[45] J. Leskovec. Cs224w: Machine learning with graphs, 2022. URLhttp:\/\/web.stanford.edu\/class\/cs224w\/index.html. \u2192 page 2[46] R. Liao. Deep Learning on Graphs: Theory, Models, Algorithms andApplications. PhD thesis, University of Toronto (Canada), 2021. \u2192 pages4, 6[47] R. Liao. Eece 571f: Deep learning with structures, 2022. URLhttps:\/\/lrjconan.github.io\/UBC-EECE571F-DL-Structures\/. \u2192 pages 4, 6[48] J. Lin, G. Sahebzamani, C. Luong, F. T. Dezaki, M. Jafari, P. Abolmaesumi,and T. Tsang. Reciprocal landmark detection and tracking with extremelyfew annotations. In Proceedings of the IEEE\/CVF Conference on ComputerVision and Pattern Recognition, pages 15170\u201315179, 2021. \u2192 pagesxii, xiv, 32, 43, 44, 45, 79, 81, 82, 83[49] F. Liu, K. Wang, D. Liu, X. Yang, and J. Tian. Deep pyramid local attentionneural network for cardiac structure segmentation in two-dimensionalechocardiography. Medical Image Analysis, 67:101873, 2021. ISSN1361-8415. \u2192 page 5172[50] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo. Swintransformer: Hierarchical vision transformer using shifted windows. In 2021IEEE\/CVF International Conference on Computer Vision (ICCV), pages9992\u201310002, Los Alamitos, CA, USA, oct 2021. IEEE Computer Society.\u2192 page 33[51] L. R. Loehr, W. D. Rosamond, P. P. Chang, A. R. Folsom, and L. E.Chambless. Heart failure incidence and survival (from the atherosclerosisrisk in communities study). The American Journal of Cardiology, 101(7):1016\u20131022, 2008. ISSN 0002-9149. \u2192 page 18[52] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks forsemantic segmentation. In 2015 IEEE Conference on Computer Vision andPattern Recognition (CVPR), pages 3431\u20133440, 2015. \u2192 page 32[53] J. McCouat and I. Voiculescu. Contour-hugging heatmaps for landmarkdetection. In 2022 IEEE\/CVF Conference on Computer Vision and PatternRecognition (CVPR), pages 20565\u201320573, 2022. \u2192 pagesxiv, 32, 42, 43, 44, 45, 80, 81, 82, 83[54] T. M. McFarland, M. Alam, S. Goldstein, S. D. Pickard, and P. D. Stein.Echocardiographic diagnosis of left ventricular hypertrophy. Circulation, 57(6):1140\u20131144, 1978. ISSN 00097322. \u2192 pages 14, 30[55] L. Melas-Kyriazi. Vit pytorch, 2020. URLhttps:\/\/github.com\/lukemelas\/PyTorch-Pretrained-ViT. \u2192 page 59[56] N. Mingshuo, C. Dongming, and W. Dongqi. Reinforcement learning ongraphs: A survey. arXiv preprint arXiv:2204.06127, 2022. \u2192 page 4[57] C. Mitchell, P. S. Rahko, L. A. Blauwet, B. Canaday, J. A. Finstuen, M. C.Foster, K. Horton, K. O. Ogunyankin, R. A. Palma, and E. J. Velazquez.Guidelines for performing a comprehensive transthoracic echocardiographicexamination in adults: Recommendations from the american society ofechocardiography. Journal of the American Society of Echocardiography, 32(1):1\u201364, Jan. 2019. \u2192 pages xvi, 11[58] M. Mokhtari, T. Tsang, P. Abolmaesumi, and R. Liao. Echognn:Explainable ejection fraction estimation with graph neural networks. InL. Wang, Q. Dou, P. T. Fletcher, S. Speidel, and S. Li, editors, MedicalImage Computing and Computer Assisted Intervention \u2013 MICCAI 2022,pages 360\u2013369, Cham, 2022. Springer Nature Switzerland. ISBN978-3-031-16440-8. \u2192 pages 52, 6073[59] M. Mokhtari, N. Ahmadi, T. Tsang, P. Abolmaesumi, and R. Liao.Gemtrans: A general, echocardiography-based, multi-level transformerframework for cardiovascular diagnosis. Preprint, 2023.[60] M. Mokhtari, M. Mahdavi, H. Vaseli, C. Luong, P. Abolmaesumi, T. Tsang,and R. Liao. Echoglad: Hierarchical graph neural networks for left ventriclelandmark detection on echocardiograms. Preprint, 2023.[61] C. M. Otto, R. A. Nishimura, R. O. Bonow, B. A. Carabello, J. P. Erwin,F. Gentile, H. Jneid, E. V. Krieger, M. Mack, C. McLeod, P. T. O\u2019Gara, V. H.Rigolin, T. M. Sundt, A. Thompson, and C. Toly. 2020 acc\/aha guideline forthe management of patients with valvular heart disease: Executive summary.Journal of the American College of Cardiology, 77(4):450\u2013500, 2021. \u2192page 53[62] D. Ouyang, B. He, A. Ghorbani, N. Yuan, J. Ebinger, C. Langlotz,P. Heidenreich, R. Harrington, D. Liang, E. Ashley, and J. Zou. Video-basedai for beat-to-beat assessment of cardiac function. Nature, 580, 2020. \u2192pages 18, 19, 24, 27, 28, 52, 58, 60, 80[63] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen,Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang,Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang,J. Bai, and S. Chintala. Pytorch: An imperative style, high-performancedeep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer,F. d'Alche\u00b4-Buc, E. Fox, and R. Garnett, editors, Advances in NeuralInformation Processing Systems 32, pages 8024\u20138035. Curran Associates,Inc., 2019. \u2192 pages 25, 41[64] V. Patil and H. Patil. Isolated non-compaction cardiomyopathy presentedwith ventricular tachycardia. Heart views : the official journal of the GulfHeart Association, 12:74\u20138, 04 2011. \u2192 page 25[65] E. Potter and T. H. Marwick. Assessment of left ventricular function byechocardiography: The case for routinely adding global longitudinal strainto ejection fraction. JACC: Cardiovascular Imaging, 11(2, Part 1):260\u2013274,2018. ISSN 1936-878X. \u2192 pages 12, 13[66] H. Reynaud, A. Vlontzos, B. Hou, A. Beqiri, P. Leeson, and B. Kainz.Ultrasound video transformers for cardiac ejection fraction estimation. InM. de Bruijne, P. C. Cattin, S. Cotin, N. Padoy, S. Speidel, Y. Zheng, andC. Essert, editors, Medical Image Computing and Computer Assisted74Intervention \u2013 MICCAI 2021, pages 495\u2013505. Springer InternationalPublishing, 2021. ISBN 978-3-030-87231-1. \u2192 pages 19, 26, 27, 28, 52, 60[67] L. Ring, B. N. Shah, S. Bhattacharyya, A. Harkness, M. Belham,D. Oxborough, K. Pearce, B. S. Rana, D. X. Augustine, S. Robinson, andC. Tribouilloy. Echocardiographic assessment of aortic stenosis: a practicalguideline from the british society of echocardiography. Echo Research &Practice, 8(1):G19\u2013G59, Mar 2021. ISSN 2055-0464. \u2192 page 15[68] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks forbiomedical image segmentation. In International Conference on MedicalImage Computing and Computer-assisted Intervention, pages 234\u2013241.Springer, 2015. \u2192 page 32[69] P. Roshanitabrizi, H. R. Roth, A. Tompsett, A. R. Paulli, K. Brown,J. Rwebembera, E. Okello, A. Beaton, C. Sable, and M. G. Linguraru.Ensembled prediction of rheumatic heart disease from ungated dopplerechocardiography acquired in low-resource settings. In L. Wang, Q. Dou,P. T. Fletcher, S. Speidel, and S. Li, editors, Medical Image Computing andComputer Assisted Intervention, pages 602\u2013612, Cham, 2022. SpringerNature Switzerland. ISBN 978-3-031-16431-6. \u2192 page 52[70] L. Ruiz, F. Gama, and A. Ribeiro. Gated graph recurrent neural networks.IEEE Transactions on Signal Processing, 68:6303\u20136318, 01 2020. \u2192 page 4[71] H. Sak, A. W. Senior, and F. Beaufays. Long short-term memory basedrecurrent neural network architectures for large vocabulary speechrecognition. CoRR, abs\/1402.1128, 2014. \u2192 page 32[72] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. Thegraph neural network model. IEEE transactions on neural networks, 20(1):61\u201380, 2008. \u2192 pages 19, 33[73] M. C. Schiappa, Y. S. Rawat, and M. Shah. Self-supervised learning forvideos: A survey. ACM Computing Surveys, 2022. \u2192 page 65[74] E. Smistad, A. \u00d8stvik, I. M. Salte, D. Melichova, T. M. Nguyen, K. Haugaa,H. Brunvand, T. Edvardsen, S. Leclerc, O. Bernard, B. Grenne, andL. L\u00f8vstakken. Real-time automatic ejection fraction and foreshorteningdetection using deep learning. IEEE Transactions on Ultrasonics,Ferroelectrics, and Frequency Control, 67(12):2595\u20132604, 2020. \u2192 page 1975[75] M. Sofka, F. Milletari, J. Jia, and A. Rothberg. Fully convolutionalregression network for accurate detection of measurement points. In Deeplearning in medical image analysis and multimodal learning for clinicaldecision support, pages 258\u2013266. Springer, 2017. \u2192 page 32[76] E. Spitzer, R. T. Hahn, P. Pibarot, T. de Vries, J. J. Bax, M. B. Leon, andN. M. V. Mieghem. Aortic stenosis and heart failure: Disease ascertainmentand statistical considerations for clinical trials. Cardiac Failure Review, 5:99\u2013 105, 2019. \u2192 page 53[77] J. Stacey, Y. Belinkov, and M. Rei. Supervising model attention with humanexplanations for robust natural language inference. Proceedings of the AAAIConference on Artificial Intelligence, 36(10):11349\u201311357, 2022. \u2192 page56[78] G. Strange, S. Stewart, D. Celermajer, D. Prior, G. M. Scalia, T. Marwick,M. Ilton, M. Joseph, J. Codde, and D. Playford. Poor long-term survival inpatients with moderate aortic stenosis. Journal of the American College ofCardiology, 74(15):1851\u20131863, 2019. ISSN 0735-1097. \u2192 page 15[79] P. Suetens. Ultrasound imaging, page 128\u2013158. Cambridge UniversityPress, 2 edition, 2009. \u2192 pages 9, 51[80] S. Thomas, A. Gilbert, and G. Ben-Yosef. Light-weight spatio-temporalgraphs for segmentation and ejection fraction prediction in cardiacultrasound. In L. Wang, Q. Dou, P. T. Fletcher, S. Speidel, and S. Li, editors,Medical Image Computing and Computer Assisted Intervention, pages380\u2013390, Cham, 2022. Springer Nature Switzerland. ISBN978-3-031-16440-8. \u2192 pages 51, 60[81] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri. A closerlook at spatiotemporal convolutions for action recognition. In 2018IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pages6450\u20136459, 2018. \u2192 pages 19, 32[82] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,L. u. Kaiser, and I. Polosukhin. Attention is all you need. In Advances inNeural Information Processing Systems, volume 30, 2017. \u2192 pages8, 19, 21, 55, 56[83] L. Wu, P. Cui, J. Pei, and L. Zhao. Graph Neural Networks: Foundations,Frontiers, and Applications. Springer Singapore, Singapore, 2022. \u2192 page 276[84] M. Xue, Q. Huang, H. Zhang, L. Cheng, J. Song, M. hui Wu, and M. Song.Protopformer: Concentrating on prototypical parts in vision transformers forinterpretable image recognition. ArXiv, 2022. \u2192 page 58[85] Q. Yao, Z. He, H. Han, and S. K. Zhou. Miss the point: Targeted adversarialattack on multiple landmark detection. In A. L. Martel, P. Abolmaesumi,D. Stoyanov, D. Mateus, M. A. Zuluaga, S. K. Zhou, D. Racoceanu, andL. Joskowicz, editors, Medical Image Computing and Computer AssistedIntervention \u2013 MICCAI 2020, pages 692\u2013702, Cham, 2020. SpringerInternational Publishing. ISBN 978-3-030-59719-1. \u2192 pages32, 43, 44, 45, 80, 81, 82, 83[86] M. Zaheer, S. Kottur, S. Ravanbhakhsh, B. Po\u00b4czos, R. Salakhutdinov, andA. J. Smola. Deep sets. In Proceedings of the 31st International Conferenceon Neural Information Processing Systems, NIPS\u201917, page 3394\u20133404, RedHook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964. \u2192pages 6, 777Appendix ASupporting MaterialsA.1 EchoGNNWe show the scatter plots and a confusion matrix for model\u2019s ejection fraction(EF) predictions are provided in Figure A.1. We also provide additional qualitativeresults in Figure A.2.Figure A.1: Additional quantitative results for EchoGNN: (left) The confu-sion matrix for our best-performing model. The chosen ejection frac-tion (EF) categories indicate different levels of heart failure risk, withpatients having EF below 40 % requiring medical monitoring. (right)The scatter plot shows how closely our model\u2019s EF estimates align withthe ground truth. We observe that the model struggles with EF valuesbetween 30 % and 40 %, and we argue that this is due to the high inter-observer variability in the ground truth labels, which is more prominentfor samples that lie near pathological boundaries.78Figure A.2: Additional examples of EchoGNN\u2019s explainability capability:(left) We can observe instances where the learned frame weights enableclear identification of end-systolic (ES) and end-diastolic (ED) loca-tions. (right) We encounter examples where there are atypical zoomed-in apical four-chamber (A4C) echo or echocardiography (echo) videosin which the left ventricle (LV) is not entirely visible and is cropped,resulting in the model distributing frame weights more evenly and notclearly indicating the positions of ED and ES.A.2 EchoGLADA.2.1 Training Details of Prior WorksFor prior works originally designed for cephalometric landmark detection, asmarked by \u201d*\u201d, we modify the codebase to predict four LV landmark locationsinstead. A summary of how training was performed for each model is providedbelow:\u2022 Lin et al. [48] and Gilbert et al. [22]: For these models, the code is not pub-licly available, and we implement their models using information providedin their papers. Additionally, since the original RDT model only performsLVID predictions, we modify the model and add four output channels to79enable predictions over all three measurements.\u2022 McCouat* et al. [53]: We use a batch size of 1 and 15 training epochs fol-lowed by 15 temperature scaling epochs using the validation dataset. Otherhyper-parameters are kept similar to the original model.\u2022 Chen* et al. [11]*: We use a radius of 11 pixels for label smoothing, a batchsize of 1 and 400 training epochs. We select the model with the lowestvalidation error in terms of LV measurements.\u2022 Yao* et al. [85]*: We use the original hyper-parameters including a batchsize of 1 and 230 training epochs, and we select the model with the lowestvalidation error in terms of LV measurements.\u2022 Duffy et al. [17]: The training script for this model is not provided. Wecontacted the authors and was informed that they use a similar training script(with the same hyper-parameters) as their prior work [62]. We modified thatscript with the loss function and forward model described in the paper andtrained the model for 50 epochs using a batch size of 20 samples.A.2.2 Additional Quantitative ResultsWe provide additional quantitative results including the success detection rate(SDR) for inter-ventricular septal (IVS) and left ventricular posterior wall (LVPW).In Table A.1, we see that in the high-data, in-distribution (ID) regime, our modeloutperforms all prior works for all measurements. As shown in Table A.2, for themodel trained and tested on the public Unity Imaging Collaborative (UIC) dataset(low-data, ID regime), we see that while Chen et al. [11] outperforms all modelsfor IVS, our model achieves the best results for LVPW. Lastly, as shown in Ta-ble A.3, for the out-of-distribution (OOD) setting where the models are trained onthe private dataset and tested on the UIC public test set, we see that our model out-performs prior works for LVPW and performs on par with state-of-the-art (SOTA)for IVS.80Table A.1: Quantitative results on the private test set for models trained on the private training set (in-distribution (ID),high-data regime) in terms of success detection rate (SDR) for inter-ventricular septal (IVS) and left ventricularposterior wall (LVPW). Higher values for SDR are better. We see that our model outperforms prior works on bothmeasurement across all thresholds.Model SDR[%] of IVS < \u2191 SDR[%] of LVPW < \u21911.0 mm 2.0 mm 3.0 mm 6.0 mm 1.0 mm 2.0 mm 3.0 mm 6.0 mmGilbert et al. [22] 51.3 80.6 92.0 98.4 49.4 78.7 92.3 98.6Lin et al. [48] 22.6 39.1 47.6 55.6 25.4 42.1 50.1 56.3McCouat et al. [53] 54.3 83.2 93.2 99.2 51.3 79.6 91.4 98.4Chen et al. [11] 55.5 83.1 94.3 99.3 52.8 82.3 93.6 99.2Yao et al. [85] 27.9 50.3 66.8 88.4 26.9 50.5 67.8 89.8Duffy et al. [17] 50.4 81.0 94.0 99.6 53.6 82.5 94.4 99.5Ours 58.2 85.9 95.3 99.7 54.8 84.3 95.4 99.781Table A.2: Quantitative results on the Unity Imaging Collaborative (UIC) test set for models trained on the UIC train-ing set (ID, low-data regime) in terms of success detection rate (SDR) for inter-ventricular septal (IVS) and leftventricular posterior wall (LVPW). Higher values for SDR are better. We see that while Chen et al. [11] outper-forms all models for IVS, our model has a better LVPW accuracy compared to prior works. It must be noted thatLin et al. [48] requires input videos for training, whereas UIC only contains individual frames. Therefore, theyare excluded from this table.Model SDR[%] of IVS < \u2191 SDR[%] of LVPW < \u21911.0 mm 2.0 mm 3.0 mm 6.0 mm 1.0 mm 2.0 mm 3.0 mm 6.0 mmGilbert et al. [22] 31.3 56.3 71.3 91.3 22.6 44.7 60.5 86.4McCouat et al. [53] 46.4 74.4 87.2 97.1 34.4 59.4 74.9 94.0Chen et al. [11] 47.6 79.4 89.3 97.9 42.1 61.2 78.5 97.1Yao et al. [85] 2.1 4.6 8.7 28.8 6.1 12.9 20.2 46.6Duffy et al. [17] 18.9 36.8 55.7 87.7 20.8 41.2 58.1 90.1Ours 42.8 72.0 88.0 98.7 37.6 62.6 79.4 97.182Table A.3: Quantitative results on the public Unity Imaging Collaborative (UIC) test set for models trained on the pri-vate training set (out-of-distribution (OOD) setting) in terms of success detection rate (SDR) for inter-ventricularseptal (IVS) and left ventricular posterior wall (LVPW). Higher values for SDR are better. We see that our modeloutperforms the state-of-the-art (SOTA) for LVPW on average across the thresholds. However, for IVS, we seethat while McCouat et al. [53] has better performance for lower thresholds, our model is robust to outliers byshowing a higher SDR for 3 and 6 mm thresholds.Model SDR[%] of IVS < \u2191 SDR[%] of LVPW < \u21911.0 mm 2.0 mm 3.0 mm 6.0 mm 1.0 mm 2.0 mm 3.0 mm 6.0 mmGilbert et al.Gilbert et al. [22] 11.5 22.5 31.7 52.2 21.3 39.3 53.6 80.5Lin et al. [48] 5.8 11.3 15.1 24.6 10.5 19.7 27.1 39.2McCouat et al. [53] 29.5 52.4 61.5 88.9 25.3 51.6 59.4 86.4Chen et al. [11] 20.9 39.0 54.6 79.8 22.8 42.5 56.2 81.9Yao et al. [85] 8.7 17.2 16.7 50.0 10.5 21.6 31.8 55.1Duffy et al. [17] 21.8 39.1 54.2 74.7 19.6 37.0 51.3 74.0Ours 28.1 50.1 66.6 90.7 26.2 48.4 63.6 87.083A.2.3 Additional Qualitative Results\u2022 Figure A.3: we show multiple challenging settings where the model properlycaptures the LV landmarks in the ID setting.\u2022 Figure A.4: we provide examples and possible explanation for some caseswhere the model fails to produce the correct LV landmarks in the ID setting.\u2022 Figure A.5: we see examples where the model performs well on the UICsamples in the OOD setting where the model is trained on the private dataset.\u2022 Figure A.6: we demonstrate examples where the model fails to accuratelycapture the LV landmarks in the OOD setting.84GroundTruthPredictionsA B C D EFigure A.3: Success examples in the in-distribution (ID) setting where the model is trained and tested on the privatedataset. As shown, despite some cases having low quality or noisy samples (B, C), the model successfullypredicts the left ventricle (LV) measurements. Additionally, in cases where the papillary muscles (which couldbe mistaken for LV walls) are visible in the image (A, D, E), the model is not confused and finds the proper LVlandmarks.85GroundTruthPredictionsA B C D EFigure A.4: Failure examples in the in-distribution (ID) setting where the model is trained and tested on the privatedataset. In cases A, D and E, we can see that for inter-ventricular septal (IVS), the ground truth is placed ata bulge appearing in the wall (which could be an indicator of left ventricular hypertrophy (LVH)), which is anunder-represented case in our dataset. In case B, we see that the ground truth for the IVS is placed somewhere inthe middle of the wall, while the model is including the whole wall structure, which could be a mistake in groundtruth labeling. Lastly, in case C, we see an out of distribution sample where the echo seems to be flipped suchthat the orientation of the landmarks does not match the common slope observed in the dataset.86GroundTruthPredictionsA B C D EFigure A.5: Success examples in the out-of-distribution (OOD) setting where the model is trained on the private datasetand tested on Unity Imaging Collaborative (UIC) public dataset. For case A, we see that despite the upper wallbeing barely visible, the model is accurately capturing the inter-ventricular septal (IVS) measurement. In case B,despite the presence of the Doppler window, the model is performing within acceptable margins. In case C, themodel is not confused by the papillary muscle, and in case D, despite the low quality of the image, the model issuccessful in finding the landmarks. Lastly, in E, despite the left ventricle (LV) appearing in an unusual location,the model is performing well.87GroundTruthPredictionsA B C D EFigure A.6: Failure examples in the out-of-distribution (OOD) setting where the model is trained on the private datasetand tested on Unity Imaging Collaborative (UIC) public dataset. In most failure cases, the image is of poorquality (A, B, D, E). In case C, we see an unusually small left ventricular internal diameter (LVID) measurementwhich is not well-represented in the private training dataset.88A.3 GEMTransIn Figure A.7, we provide the prototypical networks used to generate spatial andtemporal prototypes using the local token from Spatial Transformer Encoder (STE)and Temporal Transformer Encoder (TTE). We present temporal ejection fraction(EF) prototypes in Figure A.8, while providing additional prototypical results foraortic stenosis (AS) in Figure A.9 and Figure A.10.Figure A.7: Prototypical network structure: For spatial (patch-level) proto-types, the learned local tokens zk,t \u2208 RHW\/p2\u00d7d of Spatial TransformerEncoder (STE) are used. M patches with the highest attention are in-cluded and the rest are eliminated. Remaining patches are comparedwith B learnable prototypes for each class Pl = {plb,c}Bl ,Cb=1,c=1 \u2208Rd pro-ducing a similarity vector s \u2208 RB\u00d7C where C is the number of classes.Fully connected layers map these similarity scores to the output. Fortemporal prototypes, the frame-level tokens z\u2032k,t of Temporal Trans-former Encoder (TTE) are given as input. M\u2019 frames with high tem-poral attention are kept and compared with H learnable prototypes andthe similarity scores produce the output.89End-Systolic End-DystolicFigure A.8: Patch-level prototypes for ejection fraction (EF): This figure vi-sualizes the patch-level prototypes that represent the left ventricle (LV)in end-systolic (ES) and end-diastolic (ED) frames. This suggests thatthese frames are the most significant in contributing to the final estima-tion of EF, which is clinically correct since the ratio of the volume ofblood in ED and ES are used to find EF.HealthySevereDiscarded Patches Final PrototypeFigure A.9: Additional patch-Level prototypes for aortic stenosis (AS): Leftfigures demonstrate discarded patches based on the acquired attentionof Spatial Transformer Encoder (STE). Patches with low attention areeliminated. The right figures display the areas that correspond to thelearned prototypes. In both the healthy and severe cases, there is anotable emphasis on the aortic valve.90HealthySevereFigure A.10: Frame-level prototypes for aortic stenosis (AS): Two instancesof frame-level prototypes are visualized for healthy and severe AS.The majority of frame-level prototypes are indicative of end-systoleand mid-systole stage of the heart cycle in which the restriction ofvalve\u2019s motion and detection of the aortic valve\u2019s calcification is eas-ier.91","type":"literal","lang":"en"}],"http:\/\/www.europeana.eu\/schemas\/edm\/hasType":[{"value":"Thesis\/Dissertation","type":"literal","lang":"en"}],"http:\/\/vivoweb.org\/ontology\/core#dateIssued":[{"value":"2023-05","type":"literal","lang":"en"}],"http:\/\/www.europeana.eu\/schemas\/edm\/isShownAt":[{"value":"10.14288\/1.0431094","type":"literal","lang":"en"}],"http:\/\/purl.org\/dc\/terms\/language":[{"value":"eng","type":"literal","lang":"en"}],"https:\/\/open.library.ubc.ca\/terms#degreeDiscipline":[{"value":"Electrical and Computer Engineering","type":"literal","lang":"en"}],"http:\/\/www.europeana.eu\/schemas\/edm\/provider":[{"value":"Vancouver : University of British Columbia Library","type":"literal","lang":"en"}],"http:\/\/purl.org\/dc\/terms\/publisher":[{"value":"University of British Columbia","type":"literal","lang":"en"}],"http:\/\/purl.org\/dc\/terms\/rights":[{"value":"Attribution-NonCommercial-ShareAlike 4.0 International","type":"literal","lang":"*"}],"https:\/\/open.library.ubc.ca\/terms#rightsURI":[{"value":"http:\/\/creativecommons.org\/licenses\/by-nc-sa\/4.0\/","type":"literal","lang":"*"}],"https:\/\/open.library.ubc.ca\/terms#scholarLevel":[{"value":"Graduate","type":"literal","lang":"en"}],"http:\/\/purl.org\/dc\/terms\/contributor":[{"value":"Abolmaesumi, Purang","type":"literal","lang":"en"},{"value":"Liao, Renjie","type":"literal","lang":"en"}],"http:\/\/purl.org\/dc\/terms\/title":[{"value":"Graph neural networks and transformers for enhanced explainability and generalizability in medical machine learning","type":"literal","lang":"en"}],"http:\/\/purl.org\/dc\/terms\/type":[{"value":"Text","type":"literal","lang":"en"}],"https:\/\/open.library.ubc.ca\/terms#identifierURI":[{"value":"http:\/\/hdl.handle.net\/2429\/84247","type":"literal","lang":"en"}]}}