Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Generalization performance of deep models for assessing echo image quality in different ultrasound machines Fung, Andrea 2020

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata

Download

Media
24-ubc_2021_may_fung_andrea.pdf [ 1004.11kB ]
Metadata
JSON: 24-1.0396737.json
JSON-LD: 24-1.0396737-ld.json
RDF/XML (Pretty): 24-1.0396737-rdf.xml
RDF/JSON: 24-1.0396737-rdf.json
Turtle: 24-1.0396737-turtle.txt
N-Triples: 24-1.0396737-rdf-ntriples.txt
Original Record: 24-1.0396737-source.json
Full Text
24-1.0396737-fulltext.txt
Citation
24-1.0396737.ris

Full Text

    GENERALIZATION PERFORMANCE OF DEEP MODELS FOR ASSESSING ECHO IMAGE QUALITY IN DIFFERENT ULTRASOUND MACHINES  by  Andrea Fung  B.Sc., The University of Western Ontario, 2018  A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF  MASTER OF SCIENCE  in  The Faculty of Graduate and Postdoctoral Studies  (Experimental Medicine)  THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver)  April 2021  © Andrea Fung, 2020  ii The following individuals certify that they have read, and recommend to the Faculty of Graduate and Postdoctoral Studies for acceptance, the thesis entitled:  Generalization performance of deep learning models for assessing echo image quality in different ultrasound machines  submitted by Andrea Fung  in partial fulfillment of the requirements for the degree of Master of Science in Experimental Medicine  Examining Committee: Dr. Teresa Tsang, Cardiology Co-supervisor Dr. Purang Abolmaesumi, Electrical and Computer Engineering Co-supervisor  Dr. Christina Luong, Cardiology Supervisory Committee Member Dr. Wendy Tsang, Cardiology Additional Examiner   iii Abstract Background: In comparison with other advanced imaging techniques (e.g. computed tomography, or magnetic resonance imaging), cardiac ultrasound interpretation is less accurate with higher prevalence of low quality images. The problem can be more severe when non-experts use point-of-care ultrasound (PoCUS) to acquire and interpret images. Artificial intelligence (AI) models that provide image quality rating and feedback can help novice users to identify suboptimal image quality in real-time. However, such models have only been validated on cart-based ultrasound systems typically used in echocardiography labs. In this study, we examined the performance of a AI deep learning image quality feedback model trained on cart-based ultrasound systems when applied to PoCUS devices. Methods: We enrolled 107 unselected patients from an out-patient echocardiography facility at the Vancouver General Hospital. A single sonographer obtained 9 standard image views with a cart-based system and with a hand-held PoCUS device. All the images obtained were assigned image quality ratings by the AI model and by 2 expert physician echocardiographers. Image quality was graded based on percent endocardial border visualization (poor quality = 0-25%; fair quality = 26-50%; good quality = 51-75%; excellent quality = 76-100%). Statistical methods were used to compare the model’s classification performance on cart-based vs. PoCUS data with respect to echocardiographer opinion: percent agreement, weighted kappa, positive predictive value (precision), negative predictive value, sensitivity (recall), and specificity. Results: Percent agreement and weighted kappa were comparable on PoCUS and cart-based ultrasound clips. Overall, the model’s positive predictive value, negative predictive value, sensitivity, and specificity were neither better nor worse on either machine type. Conclusions: We conclude that  iv AI based image quality feedback models designed for cart-based systems can perform well when applied to hand-held PoCUS devices. Researchers may consider using cart-based ultrasound data to train models for PoCUS to overcome data collection and labelling barriers.      v Lay Summary Cardiac ultrasound (US) is a commonly used heart imaging technique. In imaging exams, expert technicians and cardiologists use high-end US equipment to obtain and interpret patient images. More recently, non-cardiologists are using lower-end US machines to better diagnose and manage patient problems. However, one problem is that some users are inexperienced with cardiac US and may incorrectly interpret US images due to poor image quality. Artificial intelligence (AI) can help doctors identify poor images to prevent false test results that can affect patient care. We studied whether an AI image quality grading model developed for high-end US machines can perform comparably on low-end US machines. Our findings showed that performance was comparable on both machine types. The ability for AI to adapt to different machines may make the process of implementing AI into medicine quicker and more feasible.   vi Preface This thesis was in collaboration with cardiology and biomedical engineering researchers at the University of British Columbia (UBC). The study was proposed by co-supervisors, Dr. Teresa Tsang and Dr. Purang Abolmaesumi. I was responsible for developing the study design, recruiting patients, and performing data analysis. All written work was conducted by myself. Ethics approval for this project, “Artificial intelligence application across ultrasound machines for assessing cardiac image quality”, was obtained by Dr. Teresa Tsang from the Clinical Research Ethics Board (CREB) (certificate #H18-02450).   Dr. Nathaniel Moulson, Dr. Teresa Tsang, and Dr. Purang Abolmaesumi contributed valuable feedback that had improved the study design. Shane Balthazaar and Ken Szeto obtained all image data from patients. The deep learning model used in this study was designed by Zhibin Liao at UBC. The description in Chapter 3.4 on the model architecture was based on Zhibin Liao’s published work: Liao, Z., Girgis, H., Abdi, A., Vaseli, H., Hetherington, J., Rohling, R., ... & Abolmaesumi, P. (2019). On Modelling Label Uncertainty in Deep Neural Networks: Automatic Estimation of Intra-observer Variability in 2D Echocardiography Quality Assessment. arXiv preprint arXiv:1911.00674. Nathan Van Woudenberg collected data labels from the model and clinical experts. Dr. Christina Luong and Dr. Hany Girgis were the clinical experts who annotated the entire dataset. Dr. Eric Sayre assisted with choosing appropriate statistical tests for the study.   vii Table of Contents Abstract .......................................................................................................................................... iii Lay Summary .................................................................................................................................. v Preface............................................................................................................................................ vi Table of Contents .......................................................................................................................... vii List of Tables .................................................................................................................................. x List of Figures ................................................................................................................................ xi Thesis Organization ........................................................................................................................ 1 1 Clinical Background ............................................................................................................... 2 1.1 Echocardiography and User Expertise ...................................................................................... 2 1.2 Echocardiography for Non-Experts (Point-of-care Ultrasound)............................................... 4 1.3 Artificial Intelligence for Assessing Image Quality ................................................................. 6 1.4 Gap in the Literature ................................................................................................................. 7 2 Artificial Intelligence Background ......................................................................................... 9 2.1 Artificial Intelligence, Machine Learning, and Deep Learning ................................................ 9 2.2 Supervised and Unsupervised Learning.................................................................................. 10 2.3 Variables of a Neural Network: Hyperparameters and Parameters ........................................ 11 2.4 ML Data .................................................................................................................................. 12 2.5 Introduction to Neural Networks ............................................................................................ 13 2.5.1 Neural Networks and the Human Brain ........................................................................... 13 2.5.2 Neural Network Organization .......................................................................................... 14 2.5.3 Convolutional Neural Network for Computer Vision ..................................................... 15  viii 2.5.4 Recurrent Neural Networks for Sequential Data ............................................................. 17 2.6 Neural Network Learning ....................................................................................................... 18 2.7 Motivation ............................................................................................................................... 18 3 Methods................................................................................................................................. 20 3.1 Sample..................................................................................................................................... 20 3.2 Data Collection ....................................................................................................................... 20 3.3 Gold Standard Data Classification .......................................................................................... 21 3.4 Neural Network Design .......................................................................................................... 22 3.5 Evaluation Metrics .................................................................................................................. 25 3.5.1 Percent Agreement ........................................................................................................... 27 3.5.2 Weighted Kappa............................................................................................................... 28 3.5.3 Positive Predictive Rate, Negative Predictive Rate, Sensitivity, Specificity ................... 30 4 Results ................................................................................................................................... 33 4.1 Participant and Data Characteristics ....................................................................................... 33 4.2 Percent Agreement .................................................................................................................. 35 4.3 Weighted Kappa...................................................................................................................... 36 4.4 PPV ......................................................................................................................................... 37 4.5 NPV......................................................................................................................................... 37 4.6 Sensitivity ............................................................................................................................... 38 4.7 Specificity ............................................................................................................................... 38 5 Discussion ............................................................................................................................. 39 5.1 Model Application in Cart-based vs PoCUS Machines .......................................................... 39 5.2 Clinical Implications of Model Generalization Across Ultrasound Machines ....................... 42  ix 5.3 Research Implications of Model Generalization Across Ultrasound Machines ..................... 44 5.4 Study Strengths and Limitations ............................................................................................. 46 5.5 Conclusion and Future Directions .................................................................................... 47 References ..................................................................................................................................... 48 Appendices .................................................................................................................................... 60 Appendix A: 14 standard echo image views................................................................................. 60 Appendix B: Number of studies in the training set per image view ............................................. 61 Appendix C: Calculating unweighted kappa ................................................................................ 62 Appendix D: Calculating weighted kappa .................................................................................... 64 Appendix E: Clinical Indications for study sample transthoracic echocardiograms (N=106) ..... 65 Appendix F: Contingency tables comparing model and expert ratings of image on cart-based vs. PoCUS images .............................................................................................................................. 66 Appendix G: Positive predictive value (PPV) of the BakeNeko model’s image quality ratings on PoCUS vs. cart-based image data ................................................................................................. 67 Appendix H: Negative predictive value (NPV) of the BakeNeko model’s image quality ratings on PoCUS vs. cart-based image data ............................................................................................ 68 Appendix I: Sensitivity of the BakeNeko model’s image quality ratings on PoCUS vs. cart-based image data ..................................................................................................................................... 69 Appendix J: Specificity of the BakeNeko model’s image quality ratings on PoCUS vs. cart-based image data ..................................................................................................................................... 70     x List of Tables Table 3.1 4-point scoring system for grading image quality based on percent endocardial border definition ....................................................................................................................................... 21 Table 3.2 Frequency of model vs. expert rater agreement using the 4-point image quality grading system ........................................................................................................................................... 26 Table 3.3 Proportions of model vs. expert rater agreement using the 4-point image quality grading system .............................................................................................................................. 27 Table 3.4 A weight matrix using the 4-point image quality grading system ................................ 29 Table 3.5 A linear weight matrix using the 4-point image quality grading system ...................... 29 Table 3.6 Frequency of model vs. expert rater agreement on a binary image quality classification problem ......................................................................................................................................... 31 Table 4.1 Anthropometric characteristics and clinical assessment of the study population (N=107) ......................................................................................................................................... 34    xi List of FiguresFigure 2.1 A (Fully-connected) Deep Neural Network ................................................................ 14 Figure 2.2 A Convolutional Neural Network ............................................................................... 16 Figure 2.3 A Recurrent Neural Network....................................................................................... 17 Figure 4.1 Raters’ distribution of image quality classifications in cart-based and PoCUS datasets....................................................................................................................................................... 35 Figure 4.2 Percent agreement between model and expert ratings of image quality on PoCUS vs. cart-based ultrasound clips ............................................................................................................ 36 Figure 4.3 Weighted kappa agreement (±95% CI) between model and expert ratings of image quality on PoCUS cart-based ultrasound clips.............................................................................. 37    1 Thesis Organization The rest of this paper is organized as follows. Chapter 1 gives an introduction to cardiac ultrasound and discusses artificial intelligence-based solutions to a growing problem in the field. Chapter 2 gives an introduction to artificial intelligence terminology and concepts, particularly in deep learning, to assist the understanding of the study methodology, statistics, and discussion. The motivation for this study is explained at the end of chapter 2. Chapter 3 describes the study methodology from data collection to statistical analysis. Study results are reported in Chapter 4. Chapter 5 discusses the study findings and their implications to clinical research and practice as well as study limitations and future directions.     2 1 Clinical Background Cardiovascular disease (CVD) has remained the global leading cause of death for the past 15 years (1). In 2019, approximately 18.6 million people died from CVD, accounting for 33% of mortalities that year (2). Imaging tests are critical to reducing CVD burden. Among medical imaging techniques, echocardiography (echo; cardiac ultrasound) is most preferred for CVD screening, diagnosis, and management (3).  1.1 Echocardiography and User Expertise Echo uses ultrasound to create cardiac images that contain real-time information about heart structure, function, and hemodynamics (3). The most common echo test is a 2D transthoracic echo (TTE). Compared to Computerized Tomography and Magnetic Resonance Imaging, echo is more cost-effective, safe, widely available, and portable (4, 5). However, one limitation of echo is its operator-dependency – that is, the accuracy of echo depends on the user’s expertise with image acquisition and interpretation.  For 2D-TTE, sonographers need much technical skill, knowledge, and experience to consistently produce good quality images. According to the American Society of Echocardiography (ASE), acceptable 2D-TTE image quality is defined by: (1) sufficient endocardial border definition to assess morphology and wall motion and (2) visibility of expected structures (6). For novice operators, it is often difficult to move an ultrasound probe effectively on the patient to achieve these aforementioned goals. A skilled hand is needed, as slight movements or changes in applied  3 pressure can markedly affect image quality. Anatomical knowledge is also needed to capture relevant structures across the 14 standard cross-sectional views acquired in a typical diagnostic test. Additionally, practical experience is needed to optimize image quality for all types of patient cases. In ‘technically-difficult’ studies, where patient features, conditions, and/or medical equipment impede ultrasound penetration (eg. lung disease), sonographers must know how to improve quality by adjusting imaging procedures to the individual (7-10). Considering the level of skill needed to perform echo, it is unsurprising that lab accreditation as well as sonographer experience and credentials are significantly associated with better image quality (11-13).   Optimizing image quality is essential, as suboptimal images are known to reduce the accuracy and reliability of echo interpretation. Suboptimal images have poor visibility, which can lead echocardiographers to produce erroneous measurements or miss pathologies. Previous studies have associated poor image quality with lower reproducibility of left ventricular volume, quantitative and visual left ventricular ejection fraction (14-18), global longitudinal strain (16), and regional wall motion abnormalities (19). Other studies have associated poor image quality with lower accuracy of intracardiac mass (20), infective endocarditis (21) and bicuspid aortic valve diagnoses (7). In addition, contrast-enhancing agents that improve image quality (ie. endocardial border definition) have been associated with fewer uninterpretable cases, higher measurement accuracy and reliability, and higher diagnostic accuracy (22, 23). The ASE also acknowledges the importance of image quality and cautious interpretation when dealing with poor quality images, as 2011 guidelines on quality of lab operations advises clinicians to forgo performing measurements on suboptimal images (6).    4 In echo, policies, regulations, and standards have been established to minimize poor quality imaging and misinterpretation. At minimum, diagnostic labs are expected to have cart-based machines with M-mode, 2D, colour Doppler and spectral Doppler imaging capabilities (3, 4, 6, 24, 25). All machines must also comply with international Digital Imaging Communications in Medicine (DICOM) standards (6). These methods ensure uniform image production, processing, storage, and display. In addition, both sonographers and echocardiographers undergo extensive training before practicing professionally. Sonographers complete years of training in an accredited sonography program with a supervised clinical internship and pass national credentialing exams. For independent TTE interpretation, clinicians fulfill level II competency requirements (ie. perform 150+ and interpret 300+ TTEs), which is completed after cardiology training and 6 months of full-time echo training (26).  1.2 Echocardiography for Non-Experts (Point-of-care Ultrasound) Over the past decade, echo has been used increasingly amongst non-experts. Specifically, clinicians are using small devices to perform point-of-care ultrasound (PoCUS) tests as a complement to the physical exam. It is known that PoCUS increases diagnostic accuracy and expediency in the hands of experienced users (4, 27-29). However, since most clinicians are inexperienced with cardiac PoCUS, many of them may not have the skills to reap these benefits.   The literature suggests that many novice users do not have sufficient skill and are frequently acquiring and interpreting images with suboptimal quality. In an emergency medicine study,  5 novice users with cardiac PoCUS training were far less able to acquire sufficient quality images than expert users (13). The proportion of interpretable exams by novice vs. expert users was: 56% vs. 96% for left ventricular ejection fraction, 26% vs. 92% for right ventricular dilation, and 21% vs. 67% for inferior vena cava compliance, respectively (13). Many PoCUS users have also reported low self-efficacy with performing exams. In a 2017 internal medicine survey, 48% and 43% of resident and staff respondents, respectively, felt uncomfortable with acquiring interpretable PoCUS images (30). In addition, the same survey showed many respondents (57% of residents and 43% of staff) felt uncertain with how to interpret images – a critical skill for assessing image quality.    Poor user competency may be largely attributed to insufficient regulation and standardization of PoCUS. There are no requirements to ensure all clinicians using PoCUS undergo basic training. Internal medicine surveys suggests many PoCUS users do not complete formal PoCUS training (30, 31). Nevertheless, even those who have completed formal training may lack important competencies due to program variability. There are no universal training and credentialing recommendations, thus some PoCUS programs are more effective than others for building complete user competency. For example, programs vary considerably in program duration, which can range from 2 hours to over 3 months, and minimum scanning requirements, which can range from 20-100 scans (29, 32-35). Literature reviews argue short programs (under 12 hours) and bare minimum scanning requirements do not give clinicians enough experience to build expected competencies (29, 33). In addition, many programs have narrow case study requirements. Programs may follow different clinical guidelines, which have varied recommendations for abnormal study requirements (29, 33, 36). This can leave clinicians ill- 6 equipped to perform PoCUS on difficult or atypical patient cases. The lack of evidence-based, standardized training hinders clinicians from developing complete competencies within their scope of practice.   According to the non-profit organization, the Emergency Care Research Institute (32), PoCUS was ranked the 2nd largest medical technology hazard in 2020. Insufficient physician training has contributed much to its potential harm, but health care facilities also lack expert feedback and sufficient human oversight, which has been shown to improve echo quality and interpretation accuracy (37-41). As PoCUS adoption rates continue to grow rapidly, expert oversight may become increasingly unavailable (29, 42). As such, it is important to consider new methods of safeguarding PoCUS practices. 1.3 Artificial Intelligence for Assessing Image Quality Recently, artificial intelligence (AI) models in dermatology, ophthalmology, and pathology have achieved diagnostic accuracy that is equal or superior to clinical experts (43). Since these ground-breaking achievements, many have considered using AI to improve accuracy, reliability, and workflow in diagnostic imaging. In echo, AI applications have mainly focused on detection and segmentation tasks (43). Although, several researchers have also validated AI models for assessing image quality (44-48). Abdi et al. (46) developed an AI model that provided numeric image quality feedback to sonographers in real-time with expert-level accuracy. The proposed model examined 5 standard echo views and achieved performance that was comparable to inter-observer variability. Van Woudenberg et al. (49) showed that an AI model for identifying assessing image quality could be applied to PoCUS devices in real-time. The model processed 30  7 frames per second to produce outputs with a mean latency of 0.35 seconds, a sufficient computing speed for delivering real-time feedback (49). As a safeguard, automated image quality feedback has the potential to reduce the use of suboptimal images and improve the image acquisition skill of novice users (50). 1.4 Gap in the Literature Previous studies have shown that automating image quality assessment is feasible and effective. However, image quality feedback models have only been validated on cart-based machines exclusively used in echo labs, despite inexperienced PoCUS users needing this service most in hand-held ultrasound devices.   It is unknown whether such DL models can generalize predictions to new devices. AI models may perform worse on data that has not been included in model training (ie. new/unseen data) if it has learned (fit) the training dataset too well. For example, it may include irrelevant, random, or outlying characteristics of data and fail to capture overarching patterns. Since PoCUS devices have lower spatial and temporal resolution; less artifact reduction abilities and image enhancement features; limited ultrasound frequencies that shorten scanning depth; and less contrast than cart-based systems, a model solely trained with cart-based ultrasound images may use machine-specific features to discern image quality (25, 29, 33). As a result of overfitting to cart-based machine data, a model may perform worse on data from new machines – eg. PoCUS devices. Secondly, Dodge and Karam (51) found that increasing levels of noise and blur to test images reduced classification accuracy of state-of-the-art AI models. Hence, AI models may also be inferior on PoCUS devices because they tend to produce more noisy and blurry images.  8 Further studies are needed to adequately determine whether cart-based machine trained models can be applied to hand-held devices, used outside echo lab settings, to improve PoCUS.   9 2 Artificial Intelligence Background In this chapter, we introduce basic concepts in AI with a focus on a subfield known as deep learning. In particular, we discuss how deep learning models are designed, trained, and applied to medical imaging problems.  2.1 Artificial Intelligence, Machine Learning, and Deep Learning AI is a field in computer science that uses artificial systems to perform tasks which normally require human intelligence. The earliest applications of AI in healthcare involved the use of expert systems, which mimicked the intellectual decision-making process of human experts (55). With this approach, engineers had to encode rules (eg. if-then statements) that the expert system would perform algorithmically to make decisions for a defined task.   In the 1990’s, machine learning (ML) emerged as a new approach in AI. As opposed to following programmed rules, ML systems used data samples (a training dataset) to learn how to make accurate output decisions (55). For a defined task, ML required an engineer to select features that could discriminate data. Subsequently, these hand-engineered features were fed into a prediction model (classifier) that learned to create output decisions. A popular prediction model in ML was the artificial neural network (ANN), a network of computational units with input, intermediate (hidden), and output layers. The input layer received information from the outside world (eg. hand-engineered features), the hidden layer(s) organized and applied transformations  10 to the inputs, and the output layer produced categorical or numeric output predictions. Success in ML relied on selecting a set of highly relevant and discriminative features, which depended on the engineer’s knowledge of the particular medical field and problem of interest.   ML systems only began achieving exceptional performance after deep learning (DL), a subfield of ML, had re-emerged (55). DL refers to the use of deep neural networks (DNN), which are simply ANNs with many hidden layers between the input and output layer. The main difference between traditional ML and DL is that the latter can act as a feature extractor as well as a prediction model. By learning patterns from training data, DNNs automatically find features that are relevant to a defined task.   In recent years, DNNs have gained tremendous attention and success due to increased availability of data and affordable computer processing power (55). Unlike traditional ML models, DNNs continue improving as training sets become larger (55). However, they also require much more computing power than shallow ANNs to learn and make inferences in a practical time frame. DNNs have been shown to outperform other AI methods and shallower ANNs on diverse tasks (55-58). In medical imaging, DNNs have achieved top accuracy in diagnosing disease and assessing image quality (56, 58, 59). 2.2 Supervised and Unsupervised Learning In medicine, there are two main approaches to learning different ML tasks: unsupervised and supervised learning (55, 56). Unsupervised learning involves building a mathematical model that describes data and explores its hidden characteristics. It is used to perform tasks, such as  11 grouping data meaningfully (cluster analysis), simplifying high-dimensional data (principal component analysis), and creating artificial data (generative adversarial networks). In contrast, supervised learning models have a different purpose – to map the relationship between variables (input and output) with a mathematical model that solves a defined classification or regression problem. While unsupervised learning only requires input data, supervised learning requires input data as well as corresponding output data (labels). The output labels are correct annotations (ground-truth) associated with each input, which the model aims to predict. For example, imagine a ML model that uses supervised learning to classify echo vs. non-echo images. The training dataset would consist of input data (ie. echo images and random images) as well as output labels (ie. ‘echo’ vs. ‘not echo’ annotations corresponding to each image).   Most DL applications in medicine use supervised learning. In echo, supervised learning has been used to perform segmentation, chamber quantification, valve assessment, image view classification, and more. For the remainder of chapter 2, all ML and DL concepts will refer to neural networks with supervised learning.   2.3 Variables of a Neural Network: Hyperparameters and Parameters A neural network has two types of variables, hyperparameters and parameters. Parameters are variables that allow the model to learn relevant data patterns and make accurate predictions. During training, parameters are modified and learn to take on optimal values (57). Ultimately,  12 the purpose of training is to optimize the parameter values, as they determine the model’s success.   Hyperparameters are variables that determine model structure and approach to learning, such as the number of ANN layers or the type of neural network used (57-58). They are defined before training and can only be changed manually (58). Engineers can change hyperparameters to improve parameter training and, as a result, the model’s performance.  2.4 ML Data ML data is commonly separated into 3 discrete subsets for different stages of model development: training, refining, and testing (holdout validation). The training set helps the neural network determine an optimal set of parameters that maximizes performance measures of interest. In the learning phase, the engineer typically moves between forming ideas, coding, and experimenting with the training set. Next, the validation set is a separate dataset used to adjust model hyperparameters. During this phase, users may evaluate and compare different neural networks to determine the best performing one (56). Finally, the test set is used to evaluate the performance of the final model (56). Unlike the validation set, which is used to modify the model(s), the test set is considered an unseen, objective dataset. In medical ML, neural networks may also be tested on external validation sets to assess how the model would perform in different institutions. The goal of ML is to achieve good generalization, where a neural network is able to make accurate predictions on new data. Good generalization is best estimated by its performance on test sets (internal and external).   13 A 60/30/30% split is traditionally used to form training, validation, and test sets, respectively, although the distribution of data across subsets can vary. More recently, engineers are choosing to allocate a larger proportion of data to training due to growing availability of ML data (eg., 98/1/1% split).   2.5 Introduction to Neural Networks  2.5.1 Neural Networks and the Human Brain The structure of ANNs were originally inspired by the human brain. Small functional units in the brain, known as neurons, communicate with each other to process information and carry out specific functions. As an example, imagine a human visualizing and processing an image of a heart. In the brain, neurons early in the visual pathway detect lower-level structures (eg. edges, colours), and those later in the visual pathway build upon these features to detect higher-level structures (eg. a heart shape). Gradually, the brain develops a sophisticated understanding of the visual input and recognizes a heart image. Similarly, in neural networks, artificial neurons receive inputs (eg. pixels of a heart image) and interact with other neurons in a hierarchical manner to make a judgement (eg. classifying the image as a heart).   On a cellular level, biological neurons have 4 key components which enable them to communicate: dendrites, cell body, axon, and synapses. Dendrites enable neurons to receive input signals. The cell body organizes and manipulates inputs to decide upon sending an output signal (action potential) to other neurons. The axon carries output signals to other areas. The synapse is a special connection between neurons where information is transferred. Each neural  14 connection has a ‘synaptic strength’, or a level of intensity that affects the strength of the next neuron’s input signal. Similar to the brain, ANNs consist of computational units, known as (artificial) neurons, that interact with each other to produce outputs. Each artificial neuron receives an input, transforms it, and forms an output. Just as the brain controls the synaptic strength of neural connections, an ANN multiplies inputs with corresponding ‘weights’ to control the strength of neural connections. A weight affects the intensity of one neuron’s output signals on the following neurons connected to it (60). Similar to the cell body, each artificial neurons transforms inputs into outputs using a non-linear function, known as an activation function. Activation functions enable neural networks to learn complex, non-linear mathematical models of data.  2.5.2 Neural Network Organization ANNs organize neurons into three types of layers: the input layer, hidden layers, and the output layer (Figure 2.1).   Figure 2.1 A (Fully-connected) Deep Neural Network     15 The input layer is the first layer which introduces data variables into the neural network. The hidden layers situated between input and output layers help extract data features. As mentioned, DNNs have many hidden layers, which allow them to extract and use more predictive features to make output decisions. Each layer of neurons builds upon the previous to extract more abstract information from data. Therefore, having hidden layers also increases the ability for neural networks to extract more complex and abstract features of data (57, 58). The output layer is the final layer and acts as a classifier that generates a classification or regression output (57, 60). As a system, neural networks perform computations from the front to back layer to make output predictions (57).   In DL, there are many different types of neural networks architectures. For certain tasks, some prove to perform better than others. For image recognition tasks, convolutional neural networks (CNN) are considered state-of-the art algorithms (56, 58). On the other hand, the most popular type of architecture for sequential tasks (eg. language translation, video analysis) is the recurrent neural network (RNN). Accordingly, the most suitable neural network architectures for a ML project depends on the particular problem of interest. For the purpose of understanding the DL models used in this study, the next two subsections solely describe CNN and RNN architectures.  2.5.3 Convolutional Neural Network for Computer Vision A CNN is a neural network that uses convolutional layers to extract image features. In each convolutional layer, filters are used to detect different image features (56). The filters are made up of learnable weights (parameters) that are optimized during neural network training. Linear matrix operations instruct each filter to scan an input image for a specific feature and represent it  16 spatially in a ‘feature map’ (56, 57). For example, a ‘vertical line’ filter would produce a feature map that locates vertical lines in the image (60). The outputs of each layer (feature maps) are transformed by a non-linear (activation) function and then passed on to the next layer as inputs. As with all DNNs, filters in deeper layers extract higher-level features (eg. shapes, identity) by combining simpler features discovered in earlier layers (eg. edges, textures, lines) (57).   After a convolutional layer, a pooling layer is often used to reduce the dimensional size of feature maps. Unlike convolutional layers, pooling layers do not contain parameters or extract features. Rather, pooling is used to improve model training speed and generalization.   A typical CNN structure alternates between convolutional and pooling layers. After the last convolutional and pooling layers, one or more fully-connected layers are used to organize feature information and make an output prediction. An example of a CNN structure is illustrated in Figure 2.2.  Figure 2.2 A Convolutional Neural Network     17  2.5.4 Recurrent Neural Networks for Sequential Data An RNN is ideal for analyzing sequential data. It is a cyclic neural network with a temporal element and can be thought of as a chain of duplicate cells, each processing a different part (time-step) of a sequence. In video analysis, each cell analyzes and makes a prediction on one image frame (time-step).   Cells have an identical neural network structure, which may be as simple as an ANN with aa single hidden layer. Each cell (x) receives two inputs: raw data from a corresponding time-step (eg. cell 2 receives input from image frame 2), and outputs from the previous cell (eg. x-1). Once a cell receives input, it functions like a traditional neural network to produce an output that is sent to the next cell (Figure 2.3). Consequently, rather than making independent predictions about each image frame, RNNs allow the model to make predictions on image frames based on features of the image frame, itself, as well as information on all previous image frames.   Figure 2.3 A Recurrent Neural Network     18 2.6 Neural Network Learning  Once preliminary neural network structures are developed and data is collected and prepared, training begins. In ML, weights are central to learning; they are parameters which determine the model’s ability to extract features and determine outputs. Often, neural network parameters are initially set to random values. Then, through exposure to training data, parameters learn to take on optimal values.  Parameters are optimized in an iterative manner. On a fixed number of training samples, referred to as a ‘batch’, a neural network makes a prediction per sample, receives overall feedback of its overall performance, and updates parameters based on this feedback. Feedback is based on a neural network’s cost function, which calculates the average difference between ground-truth and neural network outputs on one batch (58). Based on the cost function, another algorithm is used to direct parameters updates toward minimizing cost. Parameter optimization and neural network accuracy improves gradually with exposure to each training batch. 2.7 Motivation  Inexperienced users, who only use PoCUS devices, would benefit most from having DL-based image quality feedback for cardiac applications. Yet, to our knowledge, no such model has been validated on PoCUS devices. In contrast, several DL-based image quality assessment models have been validated on cart-based ultrasound data. Lack of PoCUS data and clinical experts to annotate data are two major bottlenecks to executing DL projects with PoCUS data. The ability to apply a single model to different machines can alleviate data collection and labelling needs.  19 Currently, it is unknown whether models can generalize predictions across different types of machines. In this study, we investigate whether a DL model trained to assess echo image quality on cart-based machines can perform comparably on hand-held PoCUS devices.    20 3 Methods 3.1 Sample One-hundred seven patients who were scheduled for a 2D-TTE at an out-patient echo lab at Vancouver General Hospital (VGH) were recruited to participate in this study. Only participants under age 18 were ineligible. Informed consent was obtained. Study participants were assigned a unique study number to link each subject to corresponding image data.  3.2 Data Collection Two sonographers were instructed to acquire image data from patients using cart-based and PoCUS machines. In the same room where the patient’s TTE was performed, one of two sonographers collected cine data with the same cart-based system used for the diagnostic exam (Vivid E95 and Vivid E9, General Electric, Chicago; iE33, Philips, Andover, MA) as well as a POCUS device (Lumify, Philips, Andover, MA; Clarius Mobile Health, Vancouver, BC). Similar to how many clinicians solely use one POCUS device, each sonographer used the same POCUS device for the entire study.  On each patient, the same sonographer acquired image data with the cart-based system and then with a PoCUS device. Ultrasound clips from nine views were acquired with each ultrasound machine in the following order: parasternal long-axis; parasternal short-axis on 4 levels (aortic, mitral, papillary muscle, apical); apical 4-chamber; apical 2-chamber; subcostal 4-chamber;  21 subcostal inferior vena cava. These are core image views that have been included in cardiac PoCUS and bedside echo tools for assessing image acquisition skill (61, 66). Each patient exam took approximately 15 minutes to complete. All image data was de-identified.  3.3 Gold Standard Data Classification  Currently, expert opinion is the gold standard for assessing echo image quality. While there is variation in expert ratings, physician experience and accreditation modestly improve the reliability of raters (17). Accordingly, the study dataset was evaluated by two echocardiographers who have completed level III training, a standing that indicates possession of the highest level of knowledge and experience in echo (62).   Experts were instructed to use a 4-point scoring system. As described by Liao et al. (43), this is a categorical method for assessing image quality based on visibility of the endocardial border:   Table 3.1 4-point scoring system for grading image quality based on percent endocardial border definition Image quality Endocardial border definition 1 (poor) 0-25% 2 (fair) 26-50% 3 (good) 51-75% 4 (excellent) 76-100%    22 Researchers assess image quality based on endocardial border definition (EBD), as it is critical for chamber size, wall thickness, and wall motion assessment (14-16, 18, 19, 63-65). Without adequate EBD, physicians may choose to forgo quantitative and/or qualitative LV assessment (9, 22, 66). To date, there is no consensus method for assessing echo image quality. Consequently, the literature consists of numerous methods for assessing image quality, which vary in subjectivity, quickness, and clinical relevance (13-17, 19, 67, 68). The 4-point method was chosen over other methods for its quickness and clinical relevance.   TeamViewer, an encrypted online screen sharing platform, was used to provide experts with access to study data. Image data was presented to experts in a randomized order to minimize confirmation bias. Each expert rater assigned a cardiac view and quality classification to each ultrasound clip in the dataset, while blinded to the other’s ratings. Expert ratings of image quality served as a benchmark for assessing and comparing model performance between machine types.  3.4 Neural Network Design  For this study, we used a pre-trained DNN developed by Liao et al. (2019), known as the BakeNeko model. The model produces three outputs for each image or video input: (1) the image view; (2) a numerical image quality score from 0-100%; and (3) standard deviation for the image quality score to indicate output uncertainty. The uncertainty score addresses human error in training data labels and is intended to help users take observer variability into account when using image quality scores (47).    23 The DL model uses a dense convolutional neural network (DenseNet), followed by a long short-term memory (LSTM) network, as illustrated below:  Figure 3.1 The BakeNeko model architecture   The DenseNet is a type of CNN architecture that uses ‘DenseBlocks’. Each layer in a DenseBlock receives input from the outputs of all preceding layers, thus giving each layer direct access to information from all layers. This dense connectivity between neural network layers promotes feature reuse, diverse feature extraction and more effective parameter updates. Compared to traditional CNNs, DenseNets have been shown to have higher computational efficiency, training speed, and network accuracy (70).   A special RNN architecture known as the Long Short-Term Memory (LSTM) network was used in conjunction with the DenseNet to produce output(s) for ultrasound clips. Traditional RNNs poorly retain information from early time steps (ie. image frames), rendering them unable to capture correlations between distant time steps. Consequently, LSTMs were developed to improve the ability for RNNs to capture long-range time dependencies in sequential data. As with any RNN, the LSTM consists of a connected series of cells, where each cell represents a time step. The LSTM uses a ‘hidden state’ to pass previous and current time step information to  24 each cell to produce an output per time step. Importantly, the LSTM also uses a ‘cell state’ to selectively store and pass information through the entire neural network. Three gates are used to control the flow of information in and out of the cell state. The selectivity of the cell state enables the LSTM to retain long-term memory of important information and help the neural network learn long-term dependencies. Following the LSTM module, logistic regression was used to output a mean and standard deviation estimate of image quality between 0 and 1.   The image quality assessment model was built and validated with 5,000 unique TTEs retrieved from DICOM files in the VGH Picture Archiving and Communications System (PACS) database. The data was exclusively taken from cart-based ultrasound machines, Philips’ iE33 and General Electric Vivid I. Most of the data was acquired by certified sonographers, but some was also acquired by trainees (ie. cardiology residents, sonography students). Ultrasound clips were taken from 14 echo image views (Appendix A).   TTE exams were divided into three mutually exclusive datasets: 60% training set (3,000 TTEs), 20% validation set (1,000 TTEs), and 20% test set (1,000 TTEs). For the number of training studies per image view, see Appendix B. A level III echocardiographer assigned image quality labels twice to each ultrasound clip at different times, using the 4-point grading system.   Before neural network training, all ultrasound clips in the training set were cropped and reduced to 120x120 pixels. This standardized inputs from different ultrasound machines, which helped the neural network improve generalizability to new data. The dataset was also artificially expanded using data augmentation, whereby existing data (ie. the 3,000 TTEs) was modified  25 with techniques, such as image scaling, translation, and rotation. Training data for the BakeNeko model was augmented by randomly translating ultrasound clips up to 10% of the total image size (in pixels) and randomly rotating them 5 degrees. Data augmentation allows the training set to represent many geometric variations that exist in real-world image data.   Refer to Liao et al. (47) for further information on the neural network design and training. For the purposes of this study, the model’s numeric outputs were converted into categorical outputs, based on the 4-point grading system.  3.5 Evaluation Metrics Contingency tables are frequently used in medical imaging studies to evaluate and summarize DL performance on classification tasks (59, 71, 72). In machine learning, they are known as confusion matrices. For this study, contingency tables were used to summarize DL performance on cart-based and PoCUS data with respect to expert opinion. Each contingency table compared the frequency distribution of model and expert ratings of image quality on the same dataset N, as shown in Table 3.2.     26 Table 3.2 Frequency of model vs. expert rater agreement using the 4-point image quality grading system  Model (j) Poor Fair Good Excellent Expert (i) Poor n1,1 n1,2 n1,3 n1,4 Fair n2,1 n2,2 n2,3 n2,4 Good n3,1 n3,2 n3,3 n3,4 Excellent n4,1 n4,2 n4,3 n4,4  Table rows and columns represented image quality rating categories (poor, fair, good, excellent), and cells (ni,j) represented the number of ultrasound clips with corresponding row and column rating assignments.   Contingency tables were used to calculate six statistics for assessing DL performance across machine types: percentage agreement, weighted kappa, positive predictive value (PPV), negative predictive value (NPV), sensitivity, and specificity.   The remaining of section 3.5 will base statistical calculations on contingency tables with proportions of observed agreement (𝑝𝑜𝑖𝑗), where 𝑝𝑜𝑖𝑗 =𝑛𝑖𝑗𝑁:   27 Table 3.3 Proportions of model vs. expert rater agreement using the 4-point image quality grading system  Model (j) Poor Fair Good Excellent Expert (i) Poor 𝑝𝑜1,1 𝑝𝑜1,2 𝑝𝑜1,3 𝑝𝑜1,4 Fair 𝑝𝑜2,1 𝑝𝑜2,2 𝑝𝑜2,3 𝑝𝑜2,4 Good 𝑝𝑜3,1 𝑝𝑜3,3 𝑝𝑜3,3 𝑝𝑜3,4  3.5.1 Percent Agreement The most basic statistic for assessing the reliability of instruments, methods, or tests is percent agreement. Percent agreement can be defined as the percentage of observer ratings in data that are completely concordant. It is calculated as the number of agreements between raters divided by the total number of rated cases (N) multiplied by 100 (73). In the confusion matrix from Table 3.3, it is simply the sum of the green diagonal cells:   𝑃𝑜 = 𝑝𝑜1,1 + 𝑝𝑜2,2 + 𝑝𝑜3,3 + 𝑝𝑜4,4  One limitation of percent agreement is that all unmatched observer ratings are treated the same. However, for clinical tools or tests, it is important to discriminate between different types of error and levels of error. For example, a ‘poor-excellent’ disagreement would be significant because it impacts the clinician’s decision (ie. to forgo image interpretation). However, a ‘good-excellent’ disagreement may be insignificant if both opinions lead to the same outcome (ie. using  28 an interpretable image). Hence, in addition to percent agreement, weighted kappa (𝜅𝑤) was used to assess the severity of disagreement between the model and experts.   3.5.2 Weighted Kappa 𝜅𝑤 was a statistic adapted from Cohen’s kappa (𝜅). Originally, the 𝜅 coefficient was developed to measure agreement between two observers above random agreement that may have occurred due to chance. For a given (rated) dataset, 𝜅 is determined by calculating the overall probability of observed agreement (𝑃𝑜) and correcting for the probability of chance agreement (𝑃𝑒):  𝜅 =  𝑃𝑜 − 𝑃𝑒1 − 𝑃𝑒  Refer to Appendix C for calculations of 𝑃𝑜 and 𝑃𝑒. Similar to percent agreement, 𝜅 also treats all disagreement as equal. Thus, 𝜅𝑤 was developed to measure rater agreement on ordinal data more appropriately. 𝜅𝑤 places partial value on observations depending on the extent of rater disagreement, making it the most common statistic for measuring observer agreement on ordinal scales (74). It uses a defined weighting scheme to penalize disagreement increasingly as observer ratings become more dissimilar (ie. differ by higher number of classes). For example, cases in which observer scores differ by 1-class (eg. poor-fair image quality) are penalized less than when scores differ by 2-classes (eg. poor-good image quality). An example of a weight scheme is shown below.    29 Table 3.4 A weight matrix using the 4-point image quality grading system  Model (j) Poor Fair Good Excellent Expert (i) Poor w1,1 w1,2 w1,3 w1,4 Fair w2,1 w2,2 w2,3 w2,4 Good w3,1 w3,3 w3,3 w3,4 Excellent w4,1 w4,4 w4,3 w4,4  The type of weight matrix will depend on the nature of the problem and the type of rating scale used. For the purpose of this study, we applied a linear weighting system where the weight of each cell decreased by 0.333 for every 1-category difference between observer ratings:  Table 3.5 A linear weight matrix using the 4-point image quality grading system   Model(j) Poor Fair Good Excellent Expert (i) Poor 1 0.66 0.33 0 Fair 0.66 1 0.66 0.33 Good 0.33 0.66 1 0.66 Excellent 0 0.33 0.66 1  The difference between an unweighted and weighted kappa calculation is the inclusion of the weight matrix. 𝜅𝑤 =  𝑃𝑜(𝑤) − 𝑃𝑒(𝑤)1 − 𝑃𝑒(𝑤)  30 Refer to Appendix D for calculations of 𝑃𝑜(𝑤) and 𝑃𝑒(𝑤).  𝜅𝑤 and 𝜅 coefficients range from -1 to 1. The denominator, 1- 𝑃𝑒 or 1 − 𝑃𝑒(𝑤), standardizes the coefficient such that a value of 0 represents agreement due to pure chance. A negative value represents disagreement above chance, which is usually a rare occurrence. A positive value represents agreement above chance. The greater the level of agreement, the higher the positive value up to 1.  3.5.3 Positive Predictive Rate, Negative Predictive Rate, Sensitivity, Specificity When test data contains a class imbalance, where a disproportionate amount of data belongs to one or more select classes, agreement statistics alone are an insufficient indicator of model performance. Percent agreement and 𝜅𝑤 focus disproportionately on the most frequent classification(s) in data. Consequently, these statistics can mask model biases by ignoring poor performance on the least frequent classification(s) in data. Liao et al. (47) showed that a randomly selected dataset from VGH PACs that was mostly acquired by professional sonographers contained only 8.3% poor quality ultrasound clips. Since class imbalance is expected in this study, the model’s performance on each image quality classification was evaluating using four measures of accuracy: positive predictive value, negative predictive value, sensitivity, and specificity.  Positive predictive value, negative predictive value, sensitivity, and specificity are traditionally used to evaluate binary problems, where rating categories represent a positive and negative class  31 (65, 75). For example, the positive class and negative class in a binary image quality problem may represent acceptable and unacceptable image quality, respectively (Table 3.6).   Table 3.6 Frequency of model vs. expert rater agreement on a binary image quality classification problem  Model Acceptable Unacceptable Expert Acceptable TP FN Unacceptable FP TN  In the contingency table above, cell counts describe 4 types of outcomes: • A true positive (TP), which refers to a correct prediction of the positive class • A true negative (TN), which refers to a correct prediction of the negative class  • A false positive (FP), which refers to an incorrect prediction of the positive class  • A false negative (FN), which refers to an incorrect prediction of the negative class  TP, TN, FP, and FN values in the contingency table can be used to calculate positive predictive value, negative predictive value, sensitivity, and specificity. In multi-class problems, classifications may be dichotomized such that these statistics can be calculated for each outcome category (ie. poor, fair, good, and excellent).   32 Positive predictive value (PPV), also known as precision, is the proportion of predicted positives that are truly positive. In this context, PPV describes how often a model is correct on positive cases. The PPV is calculated as: 𝑃𝑃𝑉 =𝑇𝑃𝑇𝑃 + 𝐹𝑃  Negative predictive value (NPV), is the proportion of predicted negatives that are truly negative. Contrary to PPV, NPV describes how often a model is correct on negative cases. NPV is calculated as: 𝑁𝑃𝑉 =𝑇𝑁𝑇𝑁 + 𝐹𝑁  Sensitivity, also known as recall or the TP rate, is the proportion of actual positives that are correctly predicted. Sensitivity describes how often a model detects a positive result. It is calculated as: 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 =  𝑇𝑃𝑇𝑃 + 𝐹𝑁  Specificity, also known as the TN rate, is the proportion of actual negatives that are correctly predicted. Specificity describes how often a model detects a negative result. It is calculated as: 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =  𝑇𝑁𝑇𝑁 + 𝐹𝑃     33 4 Results Sometimes, sonographers could acquire particular views from a participant with only one machine type. To control for image view as a potential confounder, patient cines without view-matched cart-based and POCUS data were excluded. A total of 1925 cines were collected. Statistical analyses included 1820 cines after excluding POCUS and cart-based data that were not matched by patient and image view.  The data was divided by machine type into cart-based and PoCUS subsets. Comparisons of model and expert ratings on cart-based and PoCUS datasets were summarized in contingency tables. Percent agreement and linear w𝜅 was computed on cart-based and PoCUS datasets. Asymptotic 95% confidence intervals were calculated for all w𝜅 values. PPV, NPV, sensitivity, and specificity were calculated for each image quality class. All analyses were conducted using R software, version 3.5.1 (76), with the R Base Package (version 3.6.2). Kappa and confint functions from the vcd package (version 1.4-5) were used to calculate w𝜅 and confidence intervals.  4.1 Participant and Data Characteristics Demographics of the study population are shown in Table 4. For clinical indications of the population, see Appendix E.    34 Table 4.1 Anthropometric characteristics and clinical assessment of the study population (N=107) Anthropometric Characteristics Mean (SD), Frequency (%) Age (years) 63 (16) Female 56 (52) Height (cm) 168 (9) Weight (kg) 73 (17) Obesity (BMI ≥ 30) 17 (16) Body surface area (m2) 1.8 (0.2) Systolic blood pressure (mmHg) (N = 102) 136 (24) Diastolic blood pressure (mmHg) (N = 102) 70 (11) Heart rate (bpm) (N=77) 71 (11) Clinical Assessment Mean (SD), Frequency (%) Technically Difficult 26 (24) Elevated Pulmonary Artery Systolic Pressure (N = 63) 8 (13) Abnormal LV size  8 (7) Abnormal LVEF  20 (19) LVMI > normal (N = 104) 17 (16) Abnormal LV wall motion (N = 70) 22 (31) Dilated LA volume (N = 105) 58 (55) Dilated RA volume (N = 106) 18 (17) Dilated RV size 4 (4) Abnormal RV systolic function (N = 106) 6 (6) Dilated IVC (N = 104) 6 (6) Pericardial Effusion  13 (12)  The distribution of classifications on cart-based and PoCUS datasets are shown in Figure 4.1. For both datasets, the proportion of data belonging to each image quality class was imbalanced with fewest ultrasound clips labelled ‘poor’. The model tended to assign more ‘fair’ ratings and fewer ‘excellent’ ratings than experts.  35  Figure 4.1 Raters’ distribution of image quality classifications in cart-based and PoCUS datasets  Appendix F shows contingency tables comparing model and experts ratings on cart-based vs. PoCUS datasets. Errors occur most often in the top-right off-diagonal cells, indicating the model consistently underrated image quality.  4.2 Percent Agreement Figure 4.2 shows percent agreement between raters (the model and experts) on image quality ratings for cart-based vs. PoCUS data.   36  Figure 4.2 Percent agreement between model and expert ratings of image quality on PoCUS vs. cart-based ultrasound clips   The difference in agreement between expert raters on cart-based (61%) and hand-held ultrasound data (57%) was only 4%. Overall, percent agreement between model and experts was modest relative to inter-expert agreement. In comparisons between the model and experts 1 and 2, percent agreement was 2% and 6% lower on PoCUS data, respectively.  4.3 Weighted Kappa Figure 7 summarizes 𝜅𝑤 agreement between raters on cart-based and PoCUS data. According to kappa guidelines by Landis and Koch (77), the experts achieved fair agreement with 𝜅𝑤 = 0.53. Consistent with percent agreement results, 𝜅𝑤 agreement between the model and expert raters was significantly lower than between expert raters. In all comparisons, there was no significant difference in 𝜅𝑤 coefficients between cart-based and PoCUS data.   37  Figure 4.3 Weighted kappa agreement (±95% CI) between model and expert ratings of image quality on PoCUS cart-based ultrasound clips  4.4 PPV Appendix G shows the model’s PPV per image quality classification on PoCUS vs. cart-based datasets. PPV was highest on excellent ultrasound clips. When expert 1 was the gold standard, PPV was higher on cart-based data for half of the image quality classifications (ie. fair and good quality). When expert 2 was the gold standard, PPV was higher on cart-based data for only one image quality classification (ie. excellent quality).   4.5 NPV Appendix H shows the model’s NPV per image quality classification on PoCUS vs. cart-based datasets. NPV was highest on poor ultrasound clips. When expert 1 was the gold standard, NPV  38 was markedly higher on cart-based data for half of the image quality classifications (ie. fair and good quality). When expert 2 was the gold standard, NPV was markedly higher on cart-based data for one image quality classification (ie. fair quality). 4.6 Sensitivity Appendix I shows the model’s sensitivity per image quality classification on PoCUS vs. cart-based datasets. Sensitivity was highest on the fair quality classification for both PoCUS and cart-based datasets. When expert 1 was the gold standard, sensitivity was higher on cart-based data for three image quality classifications. Although, sensitivity was higher on PoCUS data for the poor quality class. When expert 2 was the gold standard, sensitivity was higher on cart-based data for half of the image quality classifications (ie. good and excellent).  4.7 Specificity Appendix J shows the model’s specificity per image quality classification on PoCUS vs. cart-based datasets. Specificity was highest on poor and excellent ultrasound clips. When either expert was the gold standard, specificity was higher on cart-based data for half of the image quality classifications (ie. poor and fair quality).     39 5 Discussion Many clinicians have inadequate cardiac PoCUS skill and are prone to using poor quality images that can reduce test reliability and accuracy. In a scalable and cost-effective way, DL-based image quality feedback can help clinicians identify suboptimal images and improve image acquisition skill (46, 78). Cardiac ultrasound is performed with a wide range of machine brands and types, from cart-based systems to hand-held devices. DL implementation would be most practical if models could be applied to most ultrasound machines. We explored whether a DL model for assessing echo image quality on cart-based machines could be adequately applied to hand-held PoCUS devices.   5.1 Model Application in Cart-based vs PoCUS Machines In this study, the BakeNeko model demonstrated comparable performance on cart-based and PoCUS datasets. In model vs. expert comparisons of image quality ratings, percent agreement was similar across machine types and 𝜅𝑤 agreement was not significantly different across machine types. In addition, model sensitivity, specificity, PPV, and NPV were neither better nor worse on either machine type. Despite exclusive training on cart-based data, our DL model demonstrated roughly equivalent performance on vastly different machines, suggesting that DL may have strong generalization ability across imaging machines.    40 This findings of this study are supported by previous echo and radiology studies in the literature. In a prospective study, Moulson et al. (79) showed that a DL-based model for assessing left ventricular ejection fraction had comparable performance on PoCUS and cart-based images (with adequate image quality), even though the model was only trained on cart-based ultrasound data. In linear regression analysis, Moulson et al. (79) showed similar correlation between quantitative left ventricular ejection fraction estimates relative to echocardiographer biplane assessment on cart-based (Adjusted R2 = 0.42) and PoCUS data (Adjusted R2 = 0.45). In fact, the median difference in left ventricular ejection fraction scores between the DL model and echocardiographer was 7.5% lower (p < 0.001) on PoCUS images compared to cart-based images. In a cardiac MRI study by Chen et al. (54), a DL segmentation model trained with a single-scanner dataset achieved comparable LV, RV, and myocardial segmentation accuracy on multiple scanners by different manufacturers and with varied levels of magnetic field strength. In another MRI study by Zhang et al. (80), a DL-based breast tissue segmentation model showed high-to-very high correlation in performance across 4 different scanners (R2 = 0.96-0.99) (80). In support of this study, these findings from the literature suggest that DL is capable of generalizing to data from vastly different ultrasound machines. Previous studies have applied DL to various echo tasks, such as view classification, valve segmentation, and disease classification, and the ability for these models to generalize to PoCUS devices is promising (43).   Interestingly, several external validation studies have shown that current models generalize poorly to external institutions (52, 53, 81-83). While most external validation studies have not directly investigated the relationship between new imaging equipment and DL performance, some studies have suggested that differences in imaging systems may reduce DL performance  41 (52, 53). Our findings suggest that machine-specific differences unlikely affect DL performance significantly. Indeed, it is possible that differences in image appearance can affect model performance; however, current state-of-the-art methods, such as data augmentation and normalization, can improve model generalization considerably (82). With extensive data augmentation (“BigAug”), Zhang et al. (84) showed that DL models can have similar performance on unseen domains (ie. machines from different ultrasound vendors) as on source domains (ie. machines from the same ultrasound vendor). Accordingly, these state-of-the-art methods may be sufficient in representing data from various imaging systems and achieving good generalization across machines.   Current DL models may be more substantially impacted by factors other than machine-specific differences, such as differences in patient population. Previous studies show that DL performance is worse when training sets and external validation sets have different patient populations (eg. prevalence of disease). In particular, DL models for both disease and non-disease related tasks (eg. image artifact and noise reduction) have shown worse performance on patient populations with different types of disease, disease prevalence, and disease severity (81, 83, 85). Koehler et al. (81) noted that poor generalization to external datasets occurred despite “massive data augmentation”. Perhaps, data augmentation and normalization methods enables DL models to generalize to different imaging machines, but not data from different patient populations. Researchers have recommended that training data represent “a wider anatomical spectrum” to overcome poor generalization to different patient population data (85), which may not be achieved with common augmentation tactics such as rotation, translation, and scaling. As such, new approaches may be needed to handle this specific cause for poor DL generalization.  42 Defining and distinguishing pathology-driven vs. machine-driven causes of poor generalization may help researchers better understand generalization error and develop more effective strategies for improving DL generalization performance and capability.  5.2 Clinical Implications of Model Generalization Across Ultrasound Machines Poor generalizability is a major barrier to implementing DL into clinical practice. Multi-site validation studies show DL performance can perform worse when tested on external clinical sites, suggesting that DL models may adapt unsatisfactorily to variable and dynamic environments.   To overcome poor DL performance in new institutions, researchers have suggested fine-tuning models on local data before deployment (53, 86, 87). However, fine-tuning has important limitations that must be considered. Firstly, fine-tuning is resource-intensive; it requires funding, data infrastructure, human expertise, and time to execute. There is concern that DL performance could worsen over time due to changes in the clinical environment (eg. changes in the patient population, imaging protocols, and operator skill and training). The potential need for close monitoring and regular fine-tuning of DL models can make DL application somewhat high maintenance. Secondly, fine-tuning limits the accessibility and impact of DL. Due to the versatility of cardiac ultrasound, clinicians may use a wide range of ultrasound machines to meet different clinical needs. There are over ten key players in the ultrasound market, and each manufacturer produces machines with different settings, capabilities, and/or levels of quality.  43 Researchers have stated that fine-tuning may be needed when new machines are introduced to an institution (53). Our study suggests that DL models that prove to be clinically acceptable on one ultrasound machine may be viable on most other ultrasound machines. By having DL models available in most machines, DL models can be accessed widely by clinicians. In general, broad access to DL models can help a larger number of clinicians use ultrasound to improve decision-making, especially in the absence of sufficient PoCUS training or expert oversight. Moreover, on a global scale, fine-tuning can also impact the accessibility of DL adoption, particularly in low-resource settings. While AI is said to be “ideally suited to situations where human expertise is a scarce resource” (88), the need for fine-tuning requires institutional resources which low-resource settings may not afford. Paradoxically, this may hinder safe implementation of DL services to settings that would benefit from DL tools most, such as rural regions and second and third world countries. DL implementation in low-resource settings may be more plausible with lesser need for fine-tuning.   Fine-tuning may be an effective short-term solution, but has significant limitations in the long-term. Hence, regardless of strategies with fine-tuning, improving DL generalization ability is critical to successful and broad adoption of AI. Understanding and identifying specific generalization problems is the first step toward finding effective strategies that yield robust DL models which serve the needs of the local and global community.   44 5.3 Research Implications of Model Generalization Across Ultrasound Machines  Typically, training and test data are randomly drawn from the same data source because researchers desire a training set closely resembling real-world data (ie. the data they aim to make inferences on). Although, since the findings of this study suggest that DL models can generalize across different ultrasound machines, researchers may consider training DL models for PoCUS with cart-based ultrasound data to improve project workflow and overcome data collection and labeling barriers.   Data availability is a large barrier in ML applications to medicine, as having sufficient training data is required to produce highly accurate models (89). In DL, model accuracy tends to improve as the training set increases in size and variety (90) – data characteristics which are more often associated with cart-based data. Much cart-based ultrasound data is stored in hospital PACS, a well-structured and standardized data storage solution. In comparison, the data archiving infrastructure for PoCUS is relatively poor. Even when PoCUS data archiving is expected in some institutions, studies have shown documentation of exams can be poor (91). As a first-line imaging technique, TTEs are performed on a relatively diverse patient populations, and as a result, cart-based ultrasound data from PACS can be quite heterogeneous. Additionally, with decades’ worth of image data stored in PACS, cart-based training data may be more likely to cover a wider range of patient conditions than most PoCUS datasets available. Some researchers may argue that public datasets can be used to develop PoCUS models. However, public datasets may be subject to selection bias (86), and as a result of and being less representative of real- 45 world data, can cause models to generalize poorly. Due to the quantity and variety of images stored in PACS, researchers may have a greater capacity to build accurate and generalizable models with cart-based ultrasound data as opposed to PoCUS data.   Lastly, the willingness to use cart-based training data enables researchers to conduct studies in a more efficient manner. To our knowledge, no image quality model has been trained on PoCUS data. DL projects can be time-consuming, costly, and laborious. Bearing in mind that about 80% of work is devoted to data collection, organization, and labelling, researchers may bypass the most time-consuming parts of a DL project by applying or modifying existing models rather than working from scratch (92). Image quality assessment models have been trained and validated on cart-based ultrasound data (44-47, 50), and some may be suitable for PoCUS  machines. Another advantage of using cart-based ultrasound (PACS) data is that associated patient echo reports can be used to overcome labelling barriers. For our particular DL task, the kind of annotations needed to train a real-time image quality feedback model are not available in echo reports. However, other information in echo reports, such as on chamber quantification, could reduce labelling barriers and improve the efficiency of research. Timely intervention is important, especially for ML applications to PoCUS. Without effective mitigation strategies, potential harm associated with handling poor quality data will only increase with the growing uptake of PoCUS. Leveraging pre-trained models and annotated echo reports are ways in which the process of deploying AI solutions into clinical practice may be accelerated.     46 5.4 Study Strengths and Limitations This is one of few studies to have investigated the ability for DL models to generalize performance to different imaging machines. This helps build our understanding of DL and its ability to adapt to different real-world conditions. While most DL studies on echo image quality have been retrospective, we conducted a prospective study. Participants were recruited randomly from an out-patient echo lab with no exclusion criteria, minimizing selection bias.  The study was designed such that each imaging device was used by the same operator and on the same patient population. These methods allowed us to control for potential confounding of operator skill and patient type, which are suspected to affect DL generalization ability (93).  One limitation of this study was using the 4-point rating system to assess image quality. While clinically relevant, this rating system narrowly defines quality. It only considers endocardial border delineation. However, appropriate echo interpretation also relies on other indicators of image quality, such as a correct scan plane, appropriate instrument settings, visible valves, absence of artifacts, etc. Additionally, image quality ratings between echocardiographers was quite variable (w = 0.53; percent agreement = 57-61%), suggesting that the gold standard could have been more reliable. The quality and accuracy of statistical analysis may be improved by having multiple echocardiographers agree upon gold standard ratings with a majority or consensus vote.   Another study limitation was the qualitative nature of statistical comparisons. In addition to qualitative assessment, future steps may involve more sophisticated statistics to measure  47 significant differences in percent agreement, PPV, NPV, sensitivity, and specificity metrics on PoCUS vs. cart-based data.   Lastly, this study only tested the generalization ability of one neural network architecture. Recent studies show that neural network architecture significantly affects test accuracy in image view classification problems (72, 94). Accordingly, some neural network architectures may have greater ability to generalize performance across different machines than others. Thus, although promising, the findings of this study may not generalize to different neural network architectures.  5.5 Conclusion and Future Directions Ultrasound is being used increasingly by clinicians to perform cardiac PoCUS. Automated image quality feedback can help prevent novice users from interpreting suboptimal images, especially when expert oversight is unavailable. Several researchers have trained and validated image quality assessment models on cart-based ultrasound data. Our findings show that such models, trained exclusively on cart-based data, may be adequately applied to PoCUS devices without lowering model performance. This suggests that DL can have robust performance in a wide range of ultrasound machines without fine-tuning. In future studies, researchers may evaluate DL performance on cart-based vs. PoCUS machines at different clinical sites, with different PoCUS users, and/or with different model architectures to determine the conditions in which DL models have good generalization ability. Future studies may also compare DL performance on cart-based vs. PoCUS machines for other echo tasks, such as view and disease classification.   48 References 1. World Health Organization. The top 10 causes of death. Available at: https://www.who.int/news-room/fact-sheets/detail/the-top-10-causes-of-death. Accessed November 20, 2020. 2. Global Health Data Exchange. GBD Results Tool. Available at: http://ghdx.healthdata.org/gbd-results-tool. Accessed November 20, 2020. 3. Evangelista A, Flachskampf F, Lancellotti P, et al. European Association of Echocardiography recommendations for standardization of performance, digital storage and reporting of echocardiographic studies. Eur J Echocardiogr 2008;9:438-448.  4. Spencer KT, Kimura BJ, Korcarz CE, Pellikka PA, Rahko PS, Siegel RJ. Focused cardiac ultrasound: Recommendations from the American Society of Echocardiography. J Am Soc Echocardiogr 2013;26:567-581. 5. Savino K, Ambrosio G. Handheld ultrasound and focused cardiovascular echography: Use and information. Medicina (Kaunas) 2019;55:423.  6. Picard MH, Adams D, Bierig SM, et al. American Society of Echocardiography recommendations for quality echocardiography laboratory operations. J Am Soc Echocardiogr 2011;24:1-10. 7. Jain R, Ammar KA, Kalvin L, et al. Diagnostic accuracy of bicuspid aortic valve by echocardiography. Echocardiography 2018;35:1932-1938.  49 8. Hu K, Liu D, Niemann M, et al. Methods for assessment of left ventricular systolic function in technically difficult patients with poor imaging quality. J Am Soc Echocardiogr 2013;26:105-113. 9. Foley T, Mankad, S, Anavekar N, et al. Measuring left ventricular ejection fraction – techniques and potential pitfalls. European Cardiology Review 2012;8:108.  10. Kim J, Zeng H, Ghadiyaram D, Lee S, Zhang L, Bovik AC. Deep convolutional neural models for picture-quality prediction: Challenges and solutions to data-driven image quality assessment. IEEE Signal Processing Magazine 2017;34:130-141.  11. Thaden JJ, Tsang MY, Ayoub C, et al. Association between echocardiography laboratory accreditation and the quality of imaging and reporting for valvular heart disease. Circulation. Cardiovascular imaging 2017;10. 12. Bremer, ML. Relationship of sonographer credentialing to Intersocietal Accreditation Commission echocardiography case study image quality. J Am Soc Echocardiogr 2016;29:43-38. 13. Bobbia X, Pradeilles C, Claret PG, et al. Does physician experience influence the interpretability of focused echocardiography images performed by a pocket device? Scand J Trauma Resusc Emerg Med 2015;23:52-52. 14. Tighe DA, Rosetti M, Vinch CS, et al. Influence of image quality on the accuracy of real time three-dimensional echocardiography to measure left ventricular volumes in unselected patients: a comparison with gated-SPECT imaging. Echocardiography 2007;24:1073-80. 15. Grossgasteiger M, Hien MD, Graser B, et al. Image quality influences the assessment of left ventricular function: an intraoperative comparison of five 2-dimensional echocardiographic  50 methods with real-time 3-dimensional echocardiography as a reference. Journal of Ultrasound in Medicine 2014;33:297-306. 16. Nagata Y, Kado Y, Onoue T, et al. Impact of image quality on reliability of the measurements of left ventricular systolic function and global longitudinal strain in 2D echocardiography. Echo Research and Practice 2018;5:27-39. 17. Cole GD, Dhutia NM, Shun-Shin MJ, et al. Defining the real-world reproducibility of visual grading of left ventricular function and visual estimation of left ventricular ejection fraction: impact of image quality, experience and accreditation. Int J Cardiovasc Imaging 2015;31:1303-1314. 18. Kusunose K, Haga A, Abe T, Sata M. Utilization of artificial intelligence in echocardiography. Circulation 2019;83:1623-1629. 19. Hoffmann R, Lethen H, Marwick T, et al. Analysis of interinstitutional observer agreement in interpretation of dobutamine stress echocardiograms. J Am Coll Cardiol 1996;27:330-336. 20. Ragland MM, Tak T. The role of echocardiography in diagnosing space-occupying lesions of the heart. Clinical Medicine & Research 2006;4:22-32. 21. Connolly K, Ong G, Kuhlmann M, et al. Use of the valve visualization on echocardiography grade tool improves sensitivity and negative predictive value of transthoracic echocardiogram for exclusion of native valvular vegetation. J Am Soc Echocardiogr 2019;32:1551-1557. 22. Yu EHC, Sloggett CE, Iwanochko RM, Rakowski H, Siu SC. Feasibility and accuracy of left ventricular volumes and ejection fraction determination by fundamental, tissue harmonic, and intravenous contrast imaging in difficult-to-image patients. J Am Soc Echocardiogr 2000;13:216-224.  51 23. Kurt M, Shaikh KA, Peterson L, et al. Impact of contrast echocardiography on evaluation of ventricular function and clinical management in a large prospective cohort. J Am Coll Cardiol 2009;53:802-810. 24. Mitchell C, Rahko PS, Blauwet LA, et al. Guidelines for performing a comprehensive transthoracic echocardiographic examination in adults: Recommendations from the American Society of Echocardiography. J Am Soc Echocardiogr 2019;32(1):1-64.  25. Cardim N, Dalen H, Voigt JU, et al. The use of handheld ultrasound devices: a position statement of the European Association of Cardiovascular Imaging (2018 update). Eur Heart J Cardiovasc Imaging. 2019;20(3):245-252.  26. Sanfilippo AJ, Bewick D, Chan KL, et al. Guidelines for the provision of echocardiography in Canada: recommendations of a joint Canadian Cardiovascular Society/Canadian Society of Echocardiography Consensus Panel. Can J Cardiol 2005;21:763. 27. Labovitz AJ, Noble VE, Bierig M, et al. Focused cardiac ultrasound in the emergent setting: A consensus statement of the American Society of Echocardiography and American College of Emergency Physicians. J Am Soc Echocardiogr 2010;23:1225-1230. 28. Arntfield RT, Millington SJ. Point of care cardiac ultrasound applications in the emergency department and intensive care unit--a review. Curr Cardiol Rev 2012;8:98–108.  29. Chamsi-Pasha MA, Sengupta PP, Zoghbi WA. Handheld echocardiography: Current state and future perspectives. Circulation 2017;136:2178-2188. 30. Lewis K, McConnell M, Azzam K. A systematic needs assessment for point of care ultrasound in internal medicine residency training programs. Can J Gen Intern Med 2017; 12.  52 31. Ailon J, Mourad O, Nadjafi M, Cavalcanti R. Point-of-care ultrasound as a competency for general internists: A survey of internal medicine training programs in Canada. Can Med Educ J 2016;7:e51-e69.  32. Adoption of Point-of-Care Ultrasound Is Outpacing Safeguards. Emergency Care Research Institute.https://www.ecri.org/components/HDJournal/Pages/Top_10_hazards_2020_No_2_POCUS.aspx?tab=2. Published 2019. Accessed November 20, 2020. 33. Luong CL, Ong K, Kaila K, Pellikka PA, Gin K, Tsang TSM. Focused cardiac ultrasonography: Current applications and future directions. J Ultrasound Med 2019;38:865-876.  34. AlEassa EM, Ziesmann MT, Kirkpatrick AW, Wurster CL, Gillman LM. Point of care ultrasonography use and training among trauma providers across Canada. Can J Surg. 2016;59:6–8.  35. Mirabel M, Celermajer D, Beraud A, Jouven X, Marijon E, Hagège AA. Pocket-sized focused cardiac ultrasound: Strengths and limitations. Arch Cardiovasc Dis 2015;108:197-205. 36. International expert statement on training standards for critical care ultrasonography. Intensive Care Med 2011;37:1077–1083. 37. Wong M, Staszewsky L, Volpi A, Latini R, Barlera S, Höglund C. Quality assessment and quality control of echocardiographic performance in a large multicenter international study: Valsartan in Heart Failure Trial (Val-HeFT). J Am Soc Echocardiogr 2002;15:293-307. 38. Boyd JS, LoPresti CM, Core M, et al. Current use and training needs of point-of-care ultrasound in emergency departments: A national survey of VA hospitals. Am J Emerg 2019;37:1794-1797.  53 39. Wenger J, Steinbach TC, Carlbom D, Farris RW, Johnson NJ, Town J. Point of care ultrasound for all by all: A multidisciplinary survey across a large quaternary care medical system. J Clin Ultrasound 2020;48:443-451. 40. Mosier JM, Malo J, Stolz LA, et al. Critical care ultrasound training: A survey of US fellowship directors. J Crit Care 2014;29:645-649. 41. Eisen LA, Leung S, Gallagher AE, Kvetan V. Barriers to ultrasound training in critical care medicine fellowships: A survey of program directors. Crit Care Med 2010;38:1978-1983. 42. LoPresti CM, Jensen TP, Dversdal RK, Astiz DJ. Point-of-care ultrasound for internal medicine residency training: A position statement from the alliance of academic internal medicine. Am J Med 2019;132:1356-1360. 43. Litjens G, Ciompi F, Wolterink JM, et al. State-of-the-art deep learning in cardiovascular image analysis. JACC Cardiovasc Imaging 2019;12:1549-1565. 44. Pavani SK, Subramanian N, Gupta MD, Annangi P, Govind SC, Young B. Quality metric for parasternal long axis b-mode echocardiograms. MICCAI 2012;7511:478-485 45. Abdi AH, Luong C, Tsang T, et al. Automatic quality assessment of echocardiograms using convolutional neural networks: Feasibility on the apical four-chamber view. IEEE Transactions on Medical Imaging 2017;36:1221-1230. 46. Abdi AH, Luong C, Tsang T, et al. Quality assessment of echocardiographic cine using recurrent neural networks: Feasibility on five standard view planes. MICCAI 2017;10435:302-310. 47. Liao Z, Girgis H, Abdi A, et al. On modelling label uncertainty in deep neural networks: Automatic estimation of intra- observer variability in 2D echocardiography quality assessment. IEEE Trans Med Imaging 2020;39:1868-1883.  54 48. Mazomenos EB, Bansal K, Martin B, Smith A, Wright S, Stoyanov D. Automated performance assessment in transoesophageal echocardiography with convolutional neural networks. MICCAI 2018;11073:256-264. 49. Van Woudenberg N, Liao Z, Abdi AH, et al. Quantitative echocardiography: Real-time quality estimation and view classification implemented on a mobile android device. POCUS MICCAI 2018;11042:74-81. 50. Snare SR, Torp H, Orderud F, Haugen BO. Real-time scan assistant for echocardiography. IEEE Trans Ultrason Ferroelectr Freq Control 2012;59:583-589. 51. Dodge S, Karam L. A study and comparison of human and deep learning recognition performance under visual distortions. ICCCN 2017:1-7.  52. Zech JR, Badgeley MA, Liu M, Costa AB, Titano JJ, Oermann EK. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. PLoS Med 2018;15:e1002683.  53. Blaivas M, Arntfield R, White M. DIY AI, deep learning network development for automated image classification in a point‐of‐care ultrasound quality assurance program. J Am Coll Emerg Physicians Open 2020;1:124-131. 54. Chen C, Bai W, Davies RH, et al. Improving the generalizability of convolutional neural network-based segmentation on cmr images. Front Cardiovasc Med 2020;7:105. 55. Schmidt-Erfurth U, Sadeghipour A, Gerendas BS, Waldstein SM, Bogunović H. Artificial intelligence in retina. Prog Retin Eye Res 2018;67:1-29. 56. Kim M, Yun J, Cho Y, et al. Deep learning in medical imaging. neurospine 2019;16:657-668. 57. Do S, Song KD, Chung JW. Basics of deep learning: A radiologist's guide to understanding published radiology articles on deep learning. Korean J Radiol 2020;21(1):33-41.   55 58. Abdelhafiz D, Yang C, Ammar R, Nabavi S. Deep convolutional neural networks for mammography: Advances, challenges and applications. BMC Bioinformatics 2019;20(Suppl 11):281.  59. Sujit SJ, Gabr RE, Coronado I, Robinson M, Datta S, Narayana PA. Automated image quality evaluation of structural brain magnetic resonance images using deep convolutional neural networks. CIBEC 2018; 33-36. 60. Soffer S, Ben-Cohen A, Shimon O, Amitai MM, Greenspan H, Klang E. Convolutional neural networks for radiologic images: A radiologist's guide. Radiology. 2019;290:590-606.  61. Kumar A, Kugler J, Jensen T. Evaluation of trainee competency with point-of-care ultrasonography (POCUS): A conceptual framework and review of existing assessments. JGIM 2019;34:1025-1031. 62. Wiegers SE, Ryan T, Arrighi JA, et al. 2019 ACC/AHA/ASE advanced training statement on echocardiography (revision of the 2003 ACC/AHA clinical competence statement on echocardiography): A report of the ACC Competency Management Committee. Catheter Cardiovasc Interv 2019;94:481-505. 63. Rajpoot K, Grau V, Noble JA, Becher H, Szmigielski C. The evaluation of single-view and multi-view fusion 3D echocardiography using image-driven segmentation and tracking. Med Image Anal 2011;15:514-528.  64. Charron C, Templier F, Goddet NS, Baer M, Vieillard-Baron A, Group of Investigators of SAMU 92. Difficulties encountered by physicians in interpreting focused echocardiography using a pocket ultrasound machine in prehospital emergencies. Eur J Emerg Med 2015;22:17-22.  56 65. Huang M, Wang C, Chiang J, Liu P, Tsai W. Automated recognition of regional wall motion abnormalities through deep neural network interpretation of transthoracic echocardiography. Circulation 2020;142:1510-1520. 66. Hundley WG, Kizilbash AM, Afridi I, Franco F, Peshock RM, Grayburn PA. Administration of an intravenous perfluorocarbon contrast agent improves echocardiographic determination of left ventricular volumes and ejection fraction: Comparison with cine magnetic resonance imaging J Am Coll Cardiol 1998;32:1426-1432 67. Medvedofsky D, Mor-Avi V, Byku I, Singh A, Weinert L, Yamat M, et al. Three-dimensional echocardiographic automated quantification of left heart chamber volumes using an adaptive analytics algorithm: Feasibility and impact of image quality in nonselected patients. J Am Soc Echocardiogr 2017;30:879-885. 68. Gaudet J, Waechter J, McLaughlin K, et al. Focused critical care echocardiography: development and evaluation of an image acquisition assessment tool. critical care medicine 2016;44:e329-e335. 69. Chen Z, Lin W, Wang S, Xu L, Li L. Image quality assessment guided deep neural networks training. ArXiv 2017. 70. Huang G, Liu Z, van der Maaten L, Weinberger KQ. Densely connected convolutional networks. ArXiv 2016. 71. Gamil ME, Fouad MM, Abd El Ghany MA, Hoffinan K. Fully automated CADx for early breast cancer detection using image processing and machine learning. ICM 2018:108-111. 72. Madani A, Arnaout R, Mofrad M, Arnaout R. Fast and accurate view classification of echocardiograms using deep learning. NPJ Digit Med 2018;1:6.   57 73. Araujo J, Born DG. Calculating percentage agreement correctly but writing its formula incorrectly. Behav Anal 1985;8:207-208.  74. Sim J, Wright CC. The kappa statistic in reliability studies: Use, interpretation, and sample size requirements. Physical Therapy 2005;85:257-268. 75. Shin SJ, You SC, Jeon H, et al. Style transfer strategy for developing a generalizable deep learning application in digital pathology. Comput Methods Programs Biomed 2021;198:105815-105815. 76. R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing 2019.  77. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977;33:159-174. 78. Olusanya O, Wong A, Kirk-Bayley J, Parulekar P. Incorporating point-of-care ultrasound into daily intensive care unit rounds: Another source of interruptions? J Intensive Care Soc 2018.   79. Moulson N, Fung A, Balthazaar S, et al. Artificial intelligence assessment of left ventricular volumes and function on POCUS imaging.  Can J Cardiol 2019;35:S3-S4. 80. Zhang Y, Chen J, Chang K, et al. Automatic breast and fibroglandular tissue segmentation in breast MRI using deep learning by a fully-convolutional residual neural network U-Net. Acad Radiol 2019;26:1526-1535. 81. Koehler S, Tandon A, Hussain T, et al. How well do U-Net-based segmentation trained on adult cardiac magnetic resonance imaging data generalise to rare congenital heart diseases for surgical planning? ArXiv 2020.  58 82. Onofrey JA, Casetti-Dinescu DI, Lauritzen AD, et al. Generalizable multi-site training and testing of deep neural networks using image normalization. Proc IEEE Int Symp Biomed Imaging 2019:348-351. 83. Cheplygina V, Pena IP, Pedersen JH, Lynch DA, Sorensen L, de Bruijne M. Transfer learning for multicenter classification of chronic obstructive pulmonary disease. IEEE J Biomed Health Inform 2018;22:1486-1496. 84. Zhang L, Wang X, Yang D, et al. Generalizing deep learning for medical image segmentation to unseen domains via deep stacked transformation. IEEE transactions on medical imaging 2020;39:2531-2540. 85. Diller G, Lammers AE, Babu-Narayan S, et al. Denoising and artefact removal for transthoracic echocardiographic imaging in congenital heart disease: utility of diagnosis specific deep learning algorithms. Int J Cardiovasc Imaging 2019;35:2189-2196. 86. Maleki F, Muthukrishnan N, Ovens K, Reinhold C, Forghani R. Machine learning algorithm validation: From essentials to advanced applications and implications for regulatory certification and deployment. Neuroimaging Clin N Am 2020;30:433-445.  87. Kelly CJ, Karthikesalingam A, Suleyman M, Corrado G, King D. Key challenges for delivering clinical impact with artificial intelligence. BMC medicine 2019;17:195. 88. Buch VH, Ahmed I, Maruthappu M. Artificial intelligence in medicine: Current trends and future possibilities. Br J Gen Pract 2018;68:143-144. 89. Chen PC, Liu Y, Peng L. How to develop machine learning models for healthcare. Nat Mater 2019;18:410-414. 90. Ghorbani A, Ouyang D, Abid A, et al. Deep learning interpretation of echocardiograms. NPJ Digit Med 2020;3:1-10.  59 91. Aziz S, Bottomley J, Mohandas V, Ahmad A, Morelli G, Thenabadu S. Improving the documentation quality of point-of-care ultrasound scans in the emergency department. BMJ Open Qual 2020;9:e000636. 92. Council J. Data challenges are halting ai projects, ibm executive says. Available at: https://www.wsj.com/articles/data-challenges-are-halting-ai-projects-ibm-executive-says-11559035800. 93. Blaivas M, Blaivas L. Are all deep learning architectures alike for point‐of‐care ultrasound?: Evidence from a cardiac image classification model suggests otherwise. J Ultrasound Med 2020;39:1187-1194. 94. Blaivas L, Blaivas M. Are convolutional neural networks trained on imagenet images wearing rose‐colored glasses?: A quantitative comparison of imagenet , computed tomographic, magnetic resonance, chest x‐ray , and point‐of‐care ultrasound images for quality. J Ultrasound Med 2020.    60 Appendices  Appendix A: 14 standard echo image views Parasternal long axis Parasternal short axis – 4 levels Aortic valve Mitral annulus valve Mitral valve papillary muscle Apex Apical  2-chamber 3-chamber 4-chamber 5-chamber Subcostal  4-chamber 5-chamber Inferior vena cava Suprasternal  Right ventricular inflow       61 Appendix B: Number of studies in the training set per image view  Number of studies Parasternal long axis 390 Parasternal short axis – aortic valve level 401 Parasternal short axis – mitral annulus valve level 388 Parasternal short axis – papillary muscle level 187 Parasternal short axis – apex level 63 Apical 2-chamber 335 Apical 3-chamber 283 Apical 4-chamber 359 Apical 5-chamber 128 Subcostal 4-chamber 172 Subcostal 5-chamber 29 Subcostal inferior vena cava 218 Suprasternal  46 Right ventricular inflow  131     62 Appendix C: Calculating unweighted kappa Unweighted kappa formula: 𝜿 =  𝑷𝒐 − 𝑷𝒆𝟏 − 𝑷𝒆 The probability of observed agreement per cell is calculated as 𝑝𝑜𝑖𝑗 =𝑋𝑖𝑗𝑁 , where xij is the number of matched or unmatched ratings between two observers for an outcome classification.     Model (j)  Poor Fair Good Excellent Row sum  (𝒑𝒐𝒊 ) Expert (i) Poor 𝒑𝒐𝟏,𝟏 𝒑𝒐𝟏,𝟐 𝒑𝒐𝟏,𝟑 𝒑𝒐𝟏,𝟒  Fair 𝒑𝒐𝟐,𝟏 𝒑𝒐𝟐,𝟐 𝒑𝒐𝟐,𝟑 𝒑𝒐𝟐,𝟒  Good 𝒑𝒐𝟑,𝟏 𝒑𝒐𝟑,𝟑 𝒑𝒐𝟑,𝟑 𝒑𝒐𝟑,𝟒  Excellent 𝒑𝒐𝟒,𝟏 𝒑𝒐𝟒,𝟒 𝒑𝒐𝟒,𝟑 𝒑𝒐𝟒,𝟒   Column sum (𝒑𝒐𝒋)       Total observed agreement (𝑷𝒐) is calculated as: 𝑷𝒐 =  𝒑𝒐𝟏,𝟏 + 𝒑𝒐𝟐,𝟐 + 𝒑𝒐𝟑,𝟑 + 𝒑𝒐𝟒,𝟒 Let the probability of chance agreement be denoted 𝑝𝑒𝑖𝑗 =  𝑝𝑜𝑖 ×  𝑝𝑜𝑗  , where poi  and poj  are the marginal totals of expert and model ratings for an outcome classification, respectively.     63   Model (j) Poor Fair Good Excellent Expert (i) Poor 𝒑𝒆𝟏,𝟏 𝒑𝒆𝟏,𝟐 𝒑𝒆𝟏,𝟑 𝒑𝒆𝟏,𝟒 Fair 𝒑𝒆𝟐,𝟏 𝒑𝒆𝟐,𝟐 𝒑𝒆𝟐,𝟑 𝒑𝒆𝟐,𝟒 Good 𝒑𝒆𝟑,𝟏 𝒑𝒆𝟑,𝟑 𝒑𝒆𝟑,𝟑 𝒑𝒆𝟑,𝟒 Excellent 𝒑𝒆𝟒,𝟏 𝒑𝒆𝟒,𝟒 𝒑𝒆𝟒,𝟑 𝒑𝒆𝟒,𝟒  𝑷𝒆 is calculated as: 𝒑𝒆𝒊𝒋 =  𝒑𝒐𝒊 ×  𝒑𝒐𝒋 𝑷𝒆𝒊𝒋 = 𝒑𝒆𝟏,𝟏 + 𝒑𝒆𝟐,𝟐 + 𝒑𝒆𝟑,𝟑 + 𝒑𝒆𝟒,𝟒     64 Appendix D: Calculating weighted kappa 𝜅𝑤 =  𝑃𝑜(𝑤) − 𝑃𝑒(𝑤)1 − 𝑃𝑒(𝑤) The overall weighted proportions of observed and expected agreement, 𝑃𝑜(𝑤) and 𝑃𝑒(𝑤), is calculated by taking the sum of the weighted proportions of observed agreement and expected agreement of each individual cell, respectively:  𝑃𝑜(𝑤) = Σ [𝑝𝑜𝑖𝑗(𝑤𝑖𝑗)] 𝑃𝑒(𝑤) = Σ[𝑝𝑒𝑖𝑗(𝑤𝑖𝑗)]  𝑝𝑜𝑖𝑗(𝑤𝑖𝑗) and 𝑝𝑒𝑖𝑗(𝑤𝑖𝑗) are the product of the linear weight matrix and 𝑝𝑜 and 𝑝𝑒 matrices, respectively:  𝑝𝑜𝑖𝑗(𝑤𝑖𝑗) = 𝑝𝑜𝑖𝑗 × 𝑤𝑖𝑗  𝑝𝑒𝑖𝑗(𝑤𝑖𝑗) = 𝑝𝑒𝑖𝑗 × 𝑤𝑖𝑗    Model (j) Poor Fair Good Excellent Expert (i) Poor w1,1 w1,2 w1,3 w1,4 Fair w2,1 w2,2 w2,3 w2,4 Good w3,1 w3,3 w3,3 w3,4 Excellent w4,1 w4,4 w4,3 w4,4  65 Appendix E: Clinical Indications for study sample transthoracic echocardiograms (N=106) Clinical Indications Frequency (%) Abnormal ECG 3 (3) Aortic aneurysm 1 (1) Arrhythmia 11 (10) Cardiac mass/thrombus 1 (1) Cardiogenic shock 1 (1) Chemotherapy  2 (2) Chest pain 2 (2) Click 2 (2) Coronary Artery Disease 5 (5) Dyspnea 3 (3) Embolic source 1 (1) Heart Failure/Cardiomyopathy 9 (8) Hypertension  3 (3) Lung disease 1 (1) LV function 45 (42) LV hypertrophy 1 (1)  LV obstruction 1 (1) Murmur 7 (7) Palpitations 8 (8) Pericardial Effusion 4 (4) Preoperative Exam 1 (1) Pulmonary Hypertension 5 (5) Stroke 3 (3) Valve disease 15 (14)   66 Appendix F: Contingency tables comparing model and expert ratings of image on cart-based vs. PoCUS images  Cart-Based Ultrasound     PoCUS                                     (1) Model vs. Expert 1    (2) Model vs. Expert 1      Expert 1           Expert 1         Poor Fair Good Excellent Total         Poor Fair Good Excellent Total   Model Poor 10 20 4 6 40     Model Poor 19 53 12 1 85   Fair 6 66 142 61 275     Fair 14 114 231 59 418   Good 3 11 148 233 395     Good 1 16 143 166 326   Excellent 2 1 29 168 200     Excellent 0 0 19 62 81     Total 21 98 323 468 910       Total 34 183 405 288 910                                                                       (3) Model vs. Expert 2    (4) Model vs. Expert 2      Expert 2           Expert 2         Poor Fair Good Excellent Total         Poor Fair Good Excellent Total   Model Poor 19 11 7 3 40     Model Poor 30 42 12 1 85   Fair 18 87 135 35 275     Fair 30 166 174 48 418   Good 5 36 164 190 395     Good 1 28 162 135 326   Excellent 2 2 29 167 200     Excellent 0 1 17 63 81     Total 44 136 335 395 910       Total 61 237 365 247 910                                         67 Appendix G: Positive predictive value (PPV) of the BakeNeko model’s image quality ratings on PoCUS vs. cart-based image data    0.00.10.20.30.40.50.60.70.80.91.0ProbabilityPPVCart_Expert 1Cart_Expert 2POCUS_Expert 1POCUS_Expert 2Poor Fair Good Excellent vs. vs. vs.  68 Appendix H: Negative predictive value (NPV) of the BakeNeko model’s image quality ratings on PoCUS vs. cart-based image data   0.00.10.20.30.40.50.60.70.80.91.0ProbabilityNPVCart_Expert 1Cart_Expert 2POCUS_Expert 1POCUS_Expert 2Poor Fair Good Excellent vs. vs. vs.  69 Appendix I: Sensitivity of the BakeNeko model’s image quality ratings on PoCUS vs. cart-based image data    0.00.10.20.30.40.50.60.70.80.91.0ProbabilitySensitivityCart_Expert 1Cart_Expert 2POCUS_Expert 1POCUS_Expert 2Poor Fair Good Excellent vs. vs. vs.  70 Appendix J: Specificity of the BakeNeko model’s image quality ratings on PoCUS vs. cart-based image data    0.00.10.20.30.40.50.60.70.80.91.0ProbabilitySpecificityCart_Expert 1Cart_Expert 2POCUS_Expert 1POCUS_Expert 2Poor Fair Good Excellent vs. vs. vs. 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            data-media="{[{embed.selectedMedia}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0396737/manifest

Comment

Related Items