UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Weakly supervised landmark detection for automatic measurement of left ventricular diameter in videos… Sahebzamani, Ghazal 2021

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


24-ubc_2021_may_sahebzamani_ghazal.pdf [ 10.12MB ]
JSON: 24-1.0396334.json
JSON-LD: 24-1.0396334-ld.json
RDF/XML (Pretty): 24-1.0396334-rdf.xml
RDF/JSON: 24-1.0396334-rdf.json
Turtle: 24-1.0396334-turtle.txt
N-Triples: 24-1.0396334-rdf-ntriples.txt
Original Record: 24-1.0396334-source.json
Full Text

Full Text

Weakly Supervised Landmark Detection for AutomaticMeasurement of Left Ventricular Diameter in Videos ofPLAX from Cardiac UltrasoundbyGhazal SahebzamaniB.Sc., The University of Tehran, 2018A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of Applied ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Electrical and Computer Engineering)The University of British Columbia(Vancouver)March 2021© Ghazal Sahebzamani, 2021The following individuals certify that they have read, and recommend to the Fac-ulty of Graduate and Postdoctoral Studies for acceptance, the thesis entitled:Weakly Supervised Landmark Detection for Automatic Measurementof Left Ventricular Diameter in Videos of PLAX from Cardiac Ultra-soundsubmitted by Ghazal Sahebzamani in partial fulfillment of the requirements forthe degree of Master of Applied Science in Electrical and Computer Engineer-ing.Examining Committee:Purang Abolmaesumi, Electrical and Computer EngineeringSupervisorRobert Rohling, Electrical and Computer EngineeringSupervisory Committee MemberJane Wang, Electrical and Computer EngineeringAdditional ExamineriiAbstractBeing an easily accessible, cheap, and low-risk modality, cardiac ultrasound servesas one of the standard and most commonly used protocols for cardiac assessmentin point-of-care. A comprehensive analysis of the structure and function of theheart involves a number of measurements, and by evaluating them, a cardiologistis able to determine the overall condition of the heart. One of these measurementstracks the changes of the left ventricle in a full cardiac cycle to interpret the leftventricular efficiency in pumping blood. Specifically, the left ventricle’s diame-ter is measured in two phases of the heart: end-diastole, through which the heartinflates to its maximum size and the left ventricle gets filled with all the blood itcan hold, and end-systole, where the left ventricle contracts and pumps the bloodout towards the body. Currently, this assessment is conducted manually by sonog-raphers through measuring the diameter on those two frames. However, there aremany challenges to this procedure. Foremost, the process of labeling ultrasoundvideos is highly expensive and time-consuming. Additionally, this type of label-ing requires professional background knowledge of the heart’s structure. This isclearly a drawback as many ultrasound probes are shifting towards hand-held de-vices to be accessible for all users with various levels of expertise. Even amongthe professionals, there exist high amounts of labeling error, partly due to the noisynature of ultrasound images. These labeling errors can easily lead to wrong di-agnosis of patients, such as reporting an abnormal patient’s condition as normal.Therefore, in an attempt to mitigate these challenges, this thesis proposes a deepneural network-based model to automate the detection of landmarks correspond-ing to the diameter of the left ventricle. The dataset used for this work is sparsein the temporal dimension, having annotations merely available on end-diastoliciiiand end-systolic frames, and the final goal of the model is to provide meaning-ful measurements across the entire cardiac cycle. The proposed network leads to10.08% average ejection fraction error, and 12.18% mean percentile error, whichis satisfactory based on the requirements of this project.ivLay SummaryThe length of the left ventricle and how it changes throughout a cardiac cycle is anindicative of the health of the heart in clinical studies. Currently, this measurementis computed through manual labeling made by experienced sonographers. This la-beling procedure is highly costly, time-consuming, and prone to inter-observer andintra-observer errors. It also requires professional background knowledge of theheart, which limits the usage of this modality among users with different levels ofexpertise. In order to overcome these challenges, this thesis proposes a method toautomate labeling the keypoints required for assessing the left-ventricular functionbased on one of the views of the cardiac ultrasound images: parasternal long axisview, or PLAX in short, which is one of the main views in ultrasound exams andthe easiest one to acquire by a less experienced user.vPrefaceAn extension of this work has been utilized in a publication resulting from a collab-oration of multiple researchers, professors, sonographers, and cardiologists fromthe University of British Columbia, the Department of Electrical and ComputerEngineering, and Vancouver General Hospital. This paper is available on arXiv.organd is accepted in The 2021 Conference on Computer Vision and Pattern Recogni-tion (CVPR 2021).This thesis has been conducted under the approval of medical research ethicsboard, certificate number H19-00132. The idea of the proposed approach has beenformed in the light of instructive suggestions and under consistent guidance of Prof.Purang Abolmaesumi. The author has also sought technical advice from FatemehTaheri Dezaki many times during the process. This thesis has also been read andcommented on by Pardiss Danaei.viTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii1 Introduction and Background . . . . . . . . . . . . . . . . . . . . . 11.1 Clinical Background . . . . . . . . . . . . . . . . . . . . . . . . 11.1.1 Introduction to the Human Heart . . . . . . . . . . . . . . 11.1.2 Echocardiogram . . . . . . . . . . . . . . . . . . . . . . 131.1.3 PLAX 2D Measurements . . . . . . . . . . . . . . . . . . 161.1.4 Fractional Shortening . . . . . . . . . . . . . . . . . . . . 211.1.5 Neural Networks & Deep Learning . . . . . . . . . . . . 221.1.6 Convolutional Neural Networks . . . . . . . . . . . . . . 221.1.7 Sequential Neural Networks . . . . . . . . . . . . . . . . 221.1.8 Automatic Landmark Detection in Echocardiograms . . . 231.2 Thesis Objective . . . . . . . . . . . . . . . . . . . . . . . . . . 261.2.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . 26vii1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 End-to-end Landmark Detection inWeakly Labeled Spatio-temporalPLAX Echocardiograms Using a Deep Neural Networks . . . . . . . 282.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . 292.2.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.2.2 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 302.3 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . 312.4 Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342.5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 372.5.2 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . 532.6 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . 623 Discussion and Comparison with Reference Methods . . . . . . . . 633.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643.2.1 Model 1: Sofka et al. [29] . . . . . . . . . . . . . . . . . 643.2.2 Model 2: Gilbert et al. [10] . . . . . . . . . . . . . . . . . 653.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 663.3 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . 684 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74viiiList of TablesTable 2.1 EF and FS Error analysis for the proposed models . . . . . . . 38Table 2.2 Mean distance error for each landmark’s predicted coordinatesin pixels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40Table 2.3 Mean distance error for each landmark’s predicted coordinatesin centimeters. . . . . . . . . . . . . . . . . . . . . . . . . . . 41Table 2.4 Evaluation metrics on the modified network . . . . . . . . . . 55Table 3.1 Mean Percent error for the three architectures (two referencemethods proposed in [10, 29] plus our approach) . . . . . . . 67Table 3.2 Number of parameters for the three architectures (two referencemethods proposed in [10, 29] plus our approach) . . . . . . . 67ixList of FiguresFigure 1.1 Various types of blood vessels . . . . . . . . . . . . . . . . . 2Figure 1.2 An overview of the heart chambers . . . . . . . . . . . . . . . 3Figure 1.3 The heart and its main vessels . . . . . . . . . . . . . . . . . 5Figure 1.4 The heart valves . . . . . . . . . . . . . . . . . . . . . . . . 6Figure 1.5 The heart wall and its layers . . . . . . . . . . . . . . . . . . 7Figure 1.6 Sarcomeres and how they change in contraction . . . . . . . . 8Figure 1.7 Changes of the voltage across specialized cardiac cells in theSA node during impulse generation . . . . . . . . . . . . . . 9Figure 1.8 Diagram of action potentials in myocardial contractile cells . . 12Figure 1.9 ECG diagram of a normal heart in one cycle . . . . . . . . . . 12Figure 1.10 An example of PLAX view from cardiac ultrasounds . . . . . 17Figure 1.11 LVID measurement in PLAX . . . . . . . . . . . . . . . . . . 18Figure 1.12 Biplane disk summation for biplane method of discs (modifiedSimpson’s rule) . . . . . . . . . . . . . . . . . . . . . . . . . 20Figure 1.13 An overview of BiLSTM architecture . . . . . . . . . . . . . 23Figure 2.1 The base architecture for the proposed method . . . . . . . . 32Figure 2.2 Substituting one-directional LSTMs with Bidirectional LSTMsin the proposed architecture . . . . . . . . . . . . . . . . . . 33Figure 2.3 Adding the LVD constraint to the architecture . . . . . . . . . 33Figure 2.4 The train and validation loss for all variations of the proposedmodel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35Figure 2.5 The train and validation loss for all variations of the proposedmodel during the latter epochs for a clear view . . . . . . . . 36xFigure 2.6 P-value of pairwise t-test between the proposed models for EFand FS errors . . . . . . . . . . . . . . . . . . . . . . . . . . 38Figure 2.7 P-value of pairwise t-test between the proposed models’ errorsin prediction of the landmarks’ coordinates (in pixels) . . . . 40Figure 2.8 P-value of pairwise t-test between the proposed models’ errorsin prediction of the landmarks’ coordinates (in centimeters) . 42Figure 2.9 Histograms of EF and FS errors for all models. . . . . . . . . 43Figure 2.10 Boxplots of EF and FS errors for all models. . . . . . . . . . 44Figure 2.11 Scatter plots of predicted EF and FS vs. ground truth for allmodels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45Figure 2.12 Histogram of distance errors of the landmarks’ predicted coor-dinates in pixels for all models. . . . . . . . . . . . . . . . . . 46Figure 2.13 Boxplot of distance errors of the landmarks’ predicted coordi-nates in pixels for all models. . . . . . . . . . . . . . . . . . . 47Figure 2.14 Histogram of distance errors of the landmarks’ predicted coor-dinates in centimeters for all models. . . . . . . . . . . . . . . 48Figure 2.15 Boxplot of distance errors of the landmarks’ predicted coordi-nates in centimeters for all models. . . . . . . . . . . . . . . . 49Figure 2.16 Predicted landmarks for ED and ES frames of some randomtest samples . . . . . . . . . . . . . . . . . . . . . . . . . . . 51Figure 2.17 Predicted landmarks for ED and ES frames of some randomtest samples . . . . . . . . . . . . . . . . . . . . . . . . . . . 52Figure 2.18 Predicted landmarks for ED and ES frames of some randomtest samples . . . . . . . . . . . . . . . . . . . . . . . . . . . 53Figure 2.19 Substitution of LSTMs with dense layers in the original modelfor an ablation study. . . . . . . . . . . . . . . . . . . . . . . 54Figure 2.20 P-value of pairwise t-test between the modified models and thebest proposed model for errors in the landmarks’ coordinatepredictions (in centimeters) for the modified models and theproposed model . . . . . . . . . . . . . . . . . . . . . . . . . 56Figure 2.21 Scatter plots of EF errors for the modified models having denselayers instead of sequence layers. . . . . . . . . . . . . . . . 58xiFigure 2.22 Histogram and boxplots of EF and FS errors for the modifiedmodels having dense layers instead of sequence layers. . . . . 59Figure 2.23 Distribution of distance errors of the landmarks’ predicted co-ordinates in pixels for the modified models having dense layersinstead of sequence layers. . . . . . . . . . . . . . . . . . . . 60Figure 2.24 Distribution of distance errors of the landmarks’ predicted co-ordinates in centimeters for the modified models having denselayers instead of sequence layers. . . . . . . . . . . . . . . . 61Figure 3.1 The first architecture proposed by the reference paper in [29] . 64Figure 3.2 The second architecture proposed by the reference paper in [29] 65Figure 3.3 The architecture proposed by the reference paper in [10] . . . 66Figure 3.4 The quality distribution of the dataset used for this thesis. . . . 70xiiGlossary3D Three-dimensional2D Two-dimensionalBILSTM Bidirectional Long Short-Term MemoryCNN Convolutional Neural NetworkDL Deep LearningED End-diastoleEDV End Diastolic VolumeEF Ejection FractionES End-systoleESV End Systolic VolumeFCN Fully Convolutional NetworksFS Fractional ShorteningGB GigabyteGPU Graphics Processing UnitLV Left VentricleLVD Left Ventricular DiameterxiiiLVEF Left Ventricle Ejection FractionLVID Left Ventricular Interior DiameterLSTM Long Short-Term MemoryML Machine LearningMRI Magnetic Resonance ImagingPLAX Parasternal Long AxisRNN Recurrent Neural NetworkSV Stroke VolumeUNILSTM Unidirectional Long Short-Term MemoryUS UltrasoundxivChapter 1Introduction and Background1.1 Clinical Background1.1.1 Introduction to the Human HeartThe human heart is a muscular organ that receives nutrients and oxygen-rich bloodfrom the lungs and pumps it, so that it gets delivered to the cells all over the bodythrough a network of blood vessels. It then receives the blood containing carbondioxide back from the cells to send it to the lungs, where the blood gets purifiedfrom carbon dioxide through exhalation.This Organ has a size of about an enclosed fist, weighs about 8 to 12 ounces, and islocated between the lungs, inside the chest cavity (thoracic cavity), behind the ribs,posterior to the breastbone (sternum), and is slightly tilted towards the left. Theheart beats around 100,000 times and pumps approximately five liters of blood perday [18].This network of blood vessels consists of:• Arteries: Blood vessels that move blood away from the heart. They usuallycarry oxygenated blood and deliver it to all parts of the body (except pul-monary arteries that receive deoxygenated blood from the heart and bring itto the lungs). The arteries branch into smaller vessels called arterioles.• Veins: Blood vessels that bring blood into the heart. They usually carry de-1Figure 1.1: Various types of blood vessels. (This image is a work of theNational Institutes of Health, part of the United States Department ofHealth and Human Services. As a work of the U.S. federal government,the image is in the public domain)oxygenated blood from the body towards the heart (except pulmonary veinsthat carry oxygenated blood from the lungs towards the heart). Veins branchinto smaller vessels called venules.• Capillaries: Small vessels that exchange oxygen, carbon dioxide, and othernutrients, waste, and water with tissues through their thin walls. These ves-sels connect arterioles to venules and exchange oxygen and other substanceswith body tissues through two processes called passive diffusion and pinocy-tosis (Figure 1.1).The heart consists of four chambers (Figure 1.2):• Right atrium (RA)• Left atrium (LA)• Right ventricle (RV)• Left ventricle (LV)The atria are thin-walled chambers that receive blood from the veins, while theventricles are thick-walled chambers below the atria that get filled with the bloodpouring out of the atria to pump it out. The right atrium receives deoxygenated2Figure 1.2: An overview of the heart chambers. (From the Heart Foundationof New Zealand, found at www.heartfoundation.org.nz)blood from the body by large veins called the inferior vena cava and the superiorvena cava. When the right atrium contracts, it pours deoxygenated blood into theright ventricle. After the right ventricle gets filled, it contracts to pump out the de-oxygenated blood and deliver it to the lungs through the pulmonary artery. Whenthe deoxygenated blood reaches the lungs, it can get re-oxygenated through inhala-tion. Then, this oxygen-rich blood passes through the pulmonary vein to enter theleft atrium of the heart. Next, the left atrium contract to fill the left ventricle withthe oxygenated blood so that this blood could finally be pumped to the body onceagain via a rather forceful contraction.The blood ejected from the left ventricle flows through the aorta, which is thelargest artery in the body. The aorta begins at the left ventricle, and extends upwardin the chest for about two inches, and then forms an arch to descend downwards tothe abdomen. The aorta consists of four sections: the ascending aorta, the aorticcurve, the descending aorta, and the abdominal aorta. The heart itself receives itsoxygen-rich blood supply from the coronary arteries, which branch off from theascending aorta. Small arteries also branch off from other sections of the aorta tosupply blood to the head, neck, arms, chest, ribs, and abdomen.3There are two main coronary arteries:• Left Main Coronary Artery (LMCA): Supplies oxygen-rich blood to the leftatrium and the left ventricle. This artery divides into two main branches it-self, called the left anterior descending artery and the circumflex artery. Theformer supplies blood to the anterior wall and a part of the anterolateral wallof the left ventricle, as well as to most of the anterior ventricular septum.While the latter, namely the circumflex artery, is responsible for supplyingblood to the left ventricle’s lateral wall. In a small percentage of the popula-tion (cases with left dominant hearts), this artery gives rise to another branchcalled the posterior descending artery (PDA), which supplies blood to thelower and rear portions of the heart.• Right Coronary Artery (RCA): Supplies oxygen-rich blood mainly to theright atrium and the right ventricle. RCA divides into smaller branches,including the sinoatrial nodal branch, right posterior descending artery (inthe majority of people who have right dominant hearts), and acute marginalartery. The sinoatrial nodal branch supplies blood to the SA node and AVnode, which are responsible for regulating the heart rhythms. The marginalbranch delivers blood to the lateral portion of the right ventricle.The main vessels discussed above can be visualized in Figure 1.3:The right and left sides of the heart are separated by a muscular wall called theseptum. The left ventricle is larger than the right ventricle because it has to pumpblood to various parts of the body instead of just the lungs, and therefore, requiresa stronger muscular structure.There are also four valves in the heart; these valves open while a chamber getsfilled with blood, and close after the chamber’s contraction in order to prevent anybackflow of the blood that is leaving that chamber. These four valves include:• Mitral valve (also called bicuspid valve): located between the left atrium andthe left ventricle• Tricuspid valve: located between the right atrium and the right ventricle• Aortic valve: located between the left ventricle and aorta4Figure 1.3: The heart and the main vessels involved in the blood cir-culation of this organ. (Coronary.pdf: Patrick J. Lynch, med-ical illustrator derivative work: Fred the Oyster (talk) adaptionand further labeling: Mikael Ha¨ggstro¨m, CC BY-SA 3.0 fromhttps://creativecommons.org/licenses/by-sa/3.0, via Wikimedia Com-mons)• Pulmonic valve (also called pulmonary valve): located between the rightventricle and the pulmonary arteryThe valves sit on a ring-like, fibrous structure called the annulus. The valves are at-tached to the annulus through a few flaps called leaflets or cusps. Figure 1.4 depictsthe position of these valves respective to the other formerly discussed structures.The heart is surrounded by a fluid-filled sac called the pericardium, which servesas a protective layer. The heart wall consists of three layers: 1) Epicardium, 2)Myocardium, and 3) Endocardium, from outermost to the innermost layer, respec-tively (Figure 1.5). The contraction of the heart and the synchronization of theheartbeat are both enabled by the cardiac muscle in the Myocardium. The cardiacmuscle cells divide into two main categories:5Figure 1.4: The heart valves and their locations relative to the mainheart vessels and chambers. (Wapcaplet, CC BY-SA 3.0 fromhttp://creativecommons.org/licenses/by-sa/3.0/, via Wikimedia Com-mons))1. Contractile Cells (also called the working cells): They constitute around 99%of the cardiac muscle cells and are capable of contracting when triggered.In other words, when these cells receive an electrical impulse (an actionpotential), their fibers become shorter in size, and as a result, a contraction(systole) occurs in the muscle.2. Specialized Cardiac Cells: These cells are capable of generating rhythmicelectrical impulses (action potentials) on their own. This property is calledself-excitability, or automaticity. Being able to transmit the electrical im-pulses as well, these cells form the heart’s conductive system, which initiatesthe trigger for the cardiac contractile cells to contract.The cardiac muscle’s basic contractile unit is called sarcomere, which con-sists of two kinds of contractile protein filaments: actin, shaping thin fibers, andmyosin, shaping thick filaments Figure 1.6. These filaments slide past one anotherto generate or propagate the mechanical force required for contraction. The re-6Figure 1.5: The heart wall and its layers. (By [4], CC BY 3.0, via WikimediaCommons)peating patterns of sarcomeres form fine contractile fibers called myofibrils. Thespecialized cardiac cells have a relatively fewer number of myofilaments and noorganized sarcomeres, and therefore, do not contribute to contraction. The cardiacmuscle cells are attached to each other by structures called intercalated discs. Threetypes of cell junctions compose intercalated discs: fascia adherens, desmosomes,and gap junctions. Specifically, action potentials can be transmitted in adjacentcells directly through gap junctions and by exchanging a flow of ions, resulting ina rapid spread of electrical depolarization throughout the muscle. A desmosome,on the other hand, is a cell structure that fastens the ends of cardiac muscle fiberstogether so that they would not pull apart during the pressure of contracting fibers.How do specialized cells generate electrical impulses?A series of intercellular chemical events coordinate the generation of action po-tentials. In the specialized cells, there are multiple types of channels that each ispermeable to specific ions such as Na+, K+, and Ca2+. All cells in the body have anabundance of ions inside and outside the cells. The cell voltage is determined by7Figure 1.6: An illustration of sarcomeres, consisting of actin and myosinfilaments, and how they change in contraction. (By [26], CC BY-SA 3.0 http://creativecommons.org/licenses/by-sa/3.0, via WikimediaCommons)these ions as they move through small channels in the membrane along the concen-tration gradient between the cell’s inside and outside environment. Also, the ionscan sometimes move against the concentration gradient by spending energy. Forinstance, all cells have sodium–potassium pumps that are powered by adenosinetriphosphate (ATP), a substance providing energy for many processes in the body.These pumps use the energy of each ATP molecule to throw three Na+ ions out ofthe cell and enter two K+ ions into the cell. However, specialized cardiac cells havesome unusual channels called HCN channels (hyperpolarization-activated cyclicnucleotide–gated) that open at a membrane potential of -50 mV or less. Once thesechannels become activated, an inward flux of Na+ rushes into the cell, causing aslow depolarization in it. This inward current is referred to as the funny current,due to the unusual behavior of these HCN channels, which is exclusive to lim-ited types of neurons and cardiac specialized muscle cells. Then, due to this slowrise of voltage, as the membrane potential hits -40 mV, a number of voltage-gatedCa2+ channels open up, causing an inward flow of Ca2+ toward the cell, resultingin a steeper rise in the membrane potential. Once the membrane potential reachesaround +10 mV, the same voltage-gated Ca2+ channels close, but another class of8Figure 1.7: The diagram of the changes of the voltage across specialized car-diac cells in the SA node during impulse generation. (OpenStax Col-lege, CC BY 3.0 https://creativecommons.org/licenses/by/3.0, via Wiki-media Commons)voltage-gated channels that are permeable to K+ open up, pouring K+ ions out ofthe cell. As a result, the membrane potential will decrease again until it stops ataround -60 mV, which is the key for making the same K+ channels close again.This process will repeat itself, again and again, generating cyclic action potentials(Figure 1.7).The cardiac specialized cells exist in the following structures:• SA node: The sinoatrial (SA) node is a group of specialized cells locatedin the junction of the upper wall of the right atrium and the opening of thesuperior vena cava. In normal people, these cells generate action potentialsat a faster rate compared to the other specialized cells, generating about 60 to80 action potentials per minute. As the fastest rate will determine the heart-beat, this node usually coordinates the rate of the heartbeat, and therefore,its cells are also referred to as pacemaker cells. However, in patients thatthe SA node’s function is disturbed, other nodes such as the atrioventricular(AV) node will take over. More specifically, the speed of depolarization ofspecialized cells can be arranged in the following order: AV > bundle ofHis > Purkinje fibers, which will all be discussed through the next items.Upon each action potential generation of the SA node, the impulses spreadthroughout the right atria to send the contraction signals to its myocardial9contractile cells and reach the next node called the AV node afterward. Asnodal cells are coupled to all types of cardiac cells, action potentials willpropagate to all conducting cells and muscle cells, resulting in a contraction.The electrical impulse will travel through three main pathways of conductivemyocardial cells in the atria: the anterior, middle, and posterior tracts. Addi-tionally, another tract of conductive myocardial cells stems from the anteriorpath and extends out towards the left atrium, which is called the Bachmann’sbundle. This bundle, that is also known as the interatrial bundle, is a clusterof parallel myocardial strands on the inner walls of the atria that connect theright atrium to the left atrium, through which the depolarization travels fromthe right atrium to the left atrium. As a result of this depolarization, the con-tractile cells start contraction from the superior to the inferior segments ofthe atria, causing them to pour their blood into the ventricles beneath them.Both the SA node and the Bachmann’s bundle usually receive their oxygensupply from the right coronary artery.• AV node: This node contains a cluster of other specialized conductive car-diac cells and is located in the inferior wall of the right ventricle, in theatrioventricular septum, close to the opening of the coronary sinus. The jobof this node is to insert a delay in the propagation of the electrical impulsefrom the atria to the ventricles, long enough (approximately 120 ms) to en-sure all of their blood is completely poured into the ventricles before beingpumped out through the ventricular contraction. Moreover, other impulsesreaching the septum directly will not be able to pass through the ventriclesunless they go through the AV node. This delay is also long enough for theatria to finish their contraction and relax (atrial diastole).• Bundle of His: This path includes a continuation of cardiac specialized con-ductive myofibers, located in the upper part of the interventricular septum,that pass the electrical impulse from the AV node down through the septum.• Bundle branches: At the bottom of the bundle of His, the path splits intotwo branches of the right and left, carrying the impulse to the right and leftventricle, respectively.10• Purkinje fibers: These fibers are additional paths that are also composed ofspecialized conducting cells and are extended through the subendocardialsurface of the ventricle walls from the apex of the heart toward the atrioven-tricular septum and the base of the heart. As they have extensive gap junc-tions, they show a fast conduction rate, which allows the rapid transmissionof the impulse to the myocardium of the ventricles.Furthermore, the heart rate can be modulated by autonomic nervous stimulation.The autonomous nervous system is a part of the peripheral nervous system, con-trolling the involuntary movement of smooth muscles and glands. There are specialreceptors in the SA node that can control the SA node’s firing rate by modifyingthe behavior and the activation duration of the channels.The action potential in myocardial contractile cells is slightly different. These cellshave a resting potential of about -80 mV. As discussed before, these cells receivetheir action potentials from the specialized cells and through gap junctions. Whenthe action potential reaches the cell, voltage-gated Na+ channels open, and an in-flux of Na+ occurs, resulting in a depolarization phase. Followingly, at about +20mV, a rapid repolarization follows by the opening of fast K+ channels. At phase 2,the repolarization gradient decreases when a series of Ca2+ channels open, causingan influx of Ca2+ ions. This phenomenon will balance the outflow of the K+ ions,which brings a plateau phase. This provides a refractory period of time, preventingthe cell from being restimulated, which may result in an incomplete contraction.Next, the Ca2+ close again, but slow K+ channels open instead, resulting in rapidrepolarization due to the open K+ channels. Finally, the potential goes back to theresting state by means of the Na-K ATP pumps (Figure 1.8).The net electrical activity of the heart can be monitored using several elec-trodes placed on the skin in a sequential diagram called an electrocardiogram (alsoknown as ECG or EKG). Each cycle of this diagram represents a heartbeat, consist-ing of three phases, each phase corresponding to depolarization or repolarization ofa specific part of the heart. These phases include: the P wave, which correspondsto the depolarization of the atria, the QRS complex, which represents the depolar-ization of the ventricles, and the T wave, which indicates the repolarization of theventricles (Figure 1.9).11Figure 1.8: Diagram of action potentials in myocardial contractile cells.(From ”Anatomy & Physiology”, provided by: OpenStax CNX, li-censed by CC BY: Attribution. 4.0, https://creativecommons.org/licenses/by/4.0/)Figure 1.9: ECG diagram of a normal heart in one cycle. (Created byAgateller (Anthony Atkielski), converted to svg by atom., Public do-main, via Wikimedia Commons)121.1.2 EchocardiogramEchocardiogram (or echo in short) is a type of non-invasive test that uses ultrasoundwaves to image the heart and show how effective all the chambers and the valveswork. This method can also capture the movements across time. In this method, atransducer (also called a probe) will transmit high-frequency sound waves, whichare in the range of ultrasound, to the body. Whenever these waves hit a boundary,a part of them gets reflected and travels back to the probe, while the remainingportion penetrates deeper in the body to reach further boundaries. By measuringthe time between sending and receiving each pulse (while knowing the speed ofthe sound wave in tissues), the machine can calculate the depth of each boundaryand map them to an image. The ultrasound waves are made using one or severalquartz crystals. According to the piezoelectric effect, when electrical currents areapplied to piezoelectric crystals (such as quartz), their size changes rapidly, leadingto vibrations forming sound waves. On the contrary, these crystals are also capableof generating electrical currents in the presence of pressure. Therefore, they canboth be used as transducer and receiver at the same time. In order to prevent backreflection from the probe, a sound-absorbing substance can be utilized.Additionally, acoustic lenses can help with focusing the waves. By viewing echoes,cardiologists can assess the shape, size, thickness, and function of different heartstructures or the effectiveness of specific cardiac treatments. A variety of heartconditions can be diagnosed using echoes, such as heart murmurs, damage to theheart muscle following a heart attack, or infections in the heart. Examples of heartdiseases that can be detected by echo include:• Atherosclerosis: a buildup of fatty substances in the arteries, forming aplaque on the walls, which could lead to wall motion abnormalities, bloodpumping malfunction, and also formation of blood clots in the long run• Cardiomyopathy: a condition in which the heart muscle loses its ability topump blood efficiently, which can potentially lead to heart failure• Congenital heart disease: also known as congenital heart defect, is a problemin the structure of the heart that occurs during the formation of a fetus.• Heart failure: a condition in which the heart gets too weak, stiff, or damaged13that it cannot pump blood efficiently. Additionally, the heart muscle may notrelax properly to receive the flow of blood fully back from the lungs, whichmight lead to fluid buildup (congestion) in the vessels, lungs, and swelling(Edema) in other parts of the body such as feet.• Aneurysm: a condition where a part of the wall of an artery gets weak anddevelops a large bulge, which is at risk of rupturing, preceding internal bleed-ing.• Heart valve disease: when one or a number of heart valves malfunction,which may introduce abnormalities in the blood flow; in severe cases, valvu-lar heart disease might lead to an enlarged heart or heart failure. Many con-ditions such as stenosis (narrowing of the heart-opening), prolapse (whenthe flaps of a valve do not close properly, and a bulging in one or severalflaps forms upward into the atrium), and regurgitation (leaking in the heartvalves).• Cardiac tumor: an abnormal growth in the heart surface, myocardium, orheart valves• Pericarditis: an inflammation or infection in the pericardium• Atrial or septal wall defects: a hole between the atria or two ventricles thatallows blood to pass from the left side to the right side; this condition mightput the patient at risk of heart failure, poor blood flow, or stroke.Generally speaking, there are two main types of echoes based on the acquisition:transthoracic echocardiogram (TTE) and transesophageal echocardiogram (TEE).TTE is the most common type of echo, in which the probe will be moving acrossthe front of the patient’s chest. In TEE, however, a thin, flexible transducer will bepassed through the patient’s esophagus to provide a closer view of the heart struc-tures without having to pass through chest bones. There are also various modalitiesof ultrasound:• A-mode (amplitude-mode): A simple mode that gives one-dimensional rep-resentations. The vertical axis will correspond to the magnitude of the re-14flected wave, while the horizontal axis will display the time. This mode iscurrently obsolete.• B-mode: It is the most common modality of echo and captures static orreal-time cross-sectional images of the heart. The 2D plane is formed by atransducer consisting of a number of crystal elements lined up next to eachother in the form of a loop or a straight line. These elements could be fired atonce (sequential array) or group by group (phased array), each with a smalldelay, to steer or focus the beam into a preferred shape or angle. In thismode, the brightness of the points will be indicative of the amplitude of thesignal, and the position across the vertical line will depend on the round triptravel time of the wave to return to the probe.• M-mode: (or motion-mode) is a mode suitable for monitoring the movementof specific tissues across time. In this mode, a single ultrasound beam will besent to the body through a stationary probe, held at a fixed desired location,and it will generate a continuous signal on a moving paper stripe.• 3D echo: Refers to a volumetric image of the heart, which is achieved bycapturing multiple 2D images from various angles and putting them togetherto form a 2D reconstruction. It can also be acquired in real-time, allowingthe cardiologist to view a 3D model of their patient’s beating heart.• Doppler echo: This mode can display the flow of the blood through the heartvessels, valves, and chambers and also show other moving structures such asfetal heartbeat. In this mode, a shift in the ultrasound wave frequency willbe proportional to the velocity of the moving object, such as blood.When cardiologists decide to assess the blood flow through the heart’s cham-bers and valves, they use a technique called Doppler echocardiography. Itworks by measuring sound waves reflected from moving objects — in thiscase, red blood cells. A color Doppler test uses a computer to change thesound waves into different colors in order to show the speed and direction ofblood flow in real-time. There are three types of Doppler echo:1. Continuous-wave Doppler: The cardiologist will listen to the sounds15made by the transducer to detect a change of pitch to find a narrowingor blockage in the blood flow stream.2. Duplex Doppler: A B-mode transducer will generate 2D images ofthe tissues, and then, a Doppler probe will deliver flow information.The Doppler information will be translated to a visual graph, showingvelocity and its direction in the same image containing the 2D scans ofthe structure.3. Color Doppler: This mode also combines B-mode with Doppler imag-ing, except that the Doppler info such as velocity and direction will becolor-coded and overlaid on the B-mode scan.Echocardiographic Tomographic ViewsIn the standard 2D TTE echo, five cross-sectional windows are essential:1. Left parasternal, which consists of two views called parasternal long axisview (PLAX), and parasternal short axis (PSAX). PLAX is the most straight-forward view to acquire by a less experienced user.2. Apical, consisting of apical four-chamber (A4C), apical-five chamber (A5C),apical-two chamber (A2c), and apical-three chamber (A3C)3. Subcostal, consisting of subcostal four-chamber view (SC4C) and subcostallong-axis inferior vena cava view (SC IVC)4. Suprasternal notch, which has the aortic arch view5. Right parasternal window, consisting of ascending aorta view.In this thesis, the focus is on PLAX views. An example of this view can be seen inFigure PLAX 2D MeasurementsThe standard measurements extracted from the PLAX views are:• The thickness of the interventricular septum (IVS)16Figure 1.10: An example of PLAX view from cardiac ultrasounds.• The Thickness of the left ventricle anterior wall (LVAW)• The left ventricular interior diameter (LVID) for LV size measurement. Thisthesis may refer to this quantity as LVD standing for left ventricle diameterinterchangeably.• The left ventricle posterior wall thickness (LVPW).Some other measurements have also been defined that are based on LV size, whichinclude ejection fraction (EF) and left ventricle fractional shortening (LVFS, or FSin short). In this thesis, the focus is on LV size measurement based on LVID andits corresponding parameters (EF and FS).LVIDThe protocol for this measurement is to freeze the cine loop on ED and ES framesand measure the internal LV diameter along the diagonal line, perpendicular tothe LV long axis, for both frames (Figure 1.11). These measurements would bedenoted as LVIDd and LVIDs, respectively. In our dataset, two landmarks are17Figure 1.11: LVID measurement in PLAX. The green line denotes the leftventricular internal diameter, which connects the anteroseptal wall tothe inferolateral wall (From [15]).annotated for this task to represent the starting and ending points of the diagonallines. These landmarks are placed on the anteroseptal wall (corresponding to theupper landmark) and the inferolateral wall (corresponding to the bottom landmark).Ejection FractionEjection fraction (EF) is a measurement representing the ability of the heart topump blood. This parameter is known as the predominant measurement for globalsystolic function analysis. Although this quantity can be measured for the rightventricle, it is commonly measured for the left ventricle to indicate the percentageof blood leaving the heart and flowing into the body after each left ventricularcontraction. Specifically, EF is defined as:EF(%) =SVEDV×100, (1.1)18where SV is the stroke volume defined as:SV = end-diastolic volume (EDV) − end-systolic volume (ESV), (1.2)and EDV is the end-diastolic volume.An EF equal to or greater than 55% up to 70% is considered normal, while an EFfrom 41% to 54% is deemed to be slightly below normal. An EF of 40% and lessis considered as reduced EF and could be an indicator of moderate to severe heartfailure and many conditions such as cardiomyopathy, coronary artery disease, andheart valve disease. Similarly, an EF of 75% and more is also abnormal and couldbe an indicator of a condition called hypertrophic cardiomyopathy.EF can be measured using various types of tests such as echo, magnetic resonanceimaging (MRI), and nuclear medicine scan. According to the EF definition, accu-rate measurement of this quantity depends on reliable estimation of left ventriclevolume; therefore, MRI is generally considered as the gold standard method for EFmeasurement due to the utilization of tomographic techniques [9, 14]. However,there are robust methods to estimate the left ventricle’s 3D volume based on theinformation accessed by 2D ultrasound views, in combination with the prior med-ical knowledge provided by researchers. As a result, despite the variability in EFestimation from echo, this method remains a powerful tool, offering EF estimationpossibility in a cheap, accessible, and non-invasive setting.LVEF Estimation in EchoThere are various methods to estimate EF in echo using different types of scans(e.g., M-mode, two-dimensional, and three-dimensional). Each of these methodsmight also depend on various types of measurements extracted from echo images,such as specific lengths (one-dimensional), areas (two-dimensional), or volumet-ric measurements (three-dimensional). The recommended method based on theAmerican Society of Echocardiography for LVEF estimation in echo is the biplanemethod of disks (also called the modified Simpson method). In this method, theleft ventricle’s endocardial cavity will be identified in both A2C and A4C views,and then, in each view, this border will be divided into multiple numbers of disks.Assuming each of these disks are ellipse cross-sections of short cylinders, having19Figure 1.12: Biplane disk summation for biplane method of discs (modifiedSimpson’s rule). This approximation is useful for estimating LV’s 3Dvolume based on 2D views and calculation of EF. (From The AmericanSociety of Echocardiography Recommendation for Cardiac ChamberQuantification in Adults)small, constant heights (which are the distances between the disks), the left ventri-cle volume could be estimated by summing up the volume of these small cylinders.For this estimation, the major and minor radii of each cylinder’s cross-section couldbe determined by measuring the lengths of that disk in both views (Figure 1.12).The volumetric estimations should be calculated during both end-diastole (ED) andend-systole (ES) to calculate EF. Compared to other methods, this method imposesfewer assumptions on the LV geometry, which is more realistic as the geometry20could be different from person to person and could be affected by various condi-tions. However, in this thesis, we use one-dimensional EF estimation based onPLAX data. This method uses the Teichholz formula [30] to estimate the three-dimensional volume of the left ventricle from the one-dimensional left ventriclediameters (LVD) measured according to the following equations:EDV =72.4+LVIDd×LVIDd3, (1.3)ESV =72.4+LVIDs×LVIDs3, (1.4)where EDV and ESV represent end-diastolic and end-systolic volumes, respec-tively. It should be noted that this method relies on geometric assumptions aboutthe heart, which may be inaccurate and also invalid for abnormal cases, as theyhave been derived from normal data. As a result, it is not recommended to usethis method in clinical settings. Therefore, in this work, the LVEF measurementis merely used as a comparison tool to evaluate the errors in predictions and is byno means intended to be served as a valid metric to assess, diagnose, or draw anymedical conclusions on the subjects.1.1.4 Fractional ShorteningAnother useful measurement in cardiac assessment is fractional shortening (FS),which is calculated based on linear measurements extracted from 2D echo images.By definition, FS can be calculated as follows:LVFS(%) =LVIDd−LVIDdLVIDd×100. (1.5)However, this measure also remains a poor method to assess global LV systolicfunction, as it reliably correlates with EF only in normal cases, where there are noregional wall motion abnormalities present. The normal range for FS is consideredto be from 25% to 45%.211.1.5 Neural Networks & Deep LearningDeep learning refers to a class of algorithms in which the goal is to approximateany objective function by taking advantage of specific patterns occurring in ratherlarge datasets. In these algorithms, a large number of nodes attach together accord-ing to a particular architecture and form a neural network. This naming stems fromthe fact that these models were initially inspired by the human brain. Each nodeof this network can introduce a type of nonlinearity to the flow of data so that thenetwork’s final outputs will effectively represent the result of a complex transfor-mation applied to the input. The goal is to optimize the weights of the connectionsbetween the network nodes by minimizing a particular loss function. This lossfunction tries to associate a cost to the distance between the desired outputs andthe neural network’s outputs. By experimenting over a large number of samplesand optimizing the weights based on the loss function iteratively, the model willfinally converge to an optimum, where it supposedly acts as the true transformationof interest.1.1.6 Convolutional Neural NetworksConvolutional neural network (CNN) is a term referred to a class of neural net-works that are suited for image inputs. In these networks, a series of convolutionalfilters will be applied to an input image, each filter having learnable parameters.A large number of these convolutional layers are stacked together to form a CNN.As we move deeper into the network, the latest filters have been shown to extracthigh-level semantic features, in contrast to the initial layers, which tend to focuson low-level features such as colors and edges. These networks are the core al-gorithms of the state of the art solutions for many computer vision tasks such asobject detection, segmentation, and visual scene understanding.1.1.7 Sequential Neural NetworksThese networks are a class of models that have sequential data as input or output.Examples of sequential data include text, videos, and time-series. A very commontype of these networks are recurrent neural networks (RNN). The main property ofRNNs is that they use the output of the previous layers as input to the subsequent22Figure 1.13: An overview of BiLSTM architecture. (Figure from [6])layers, which allows them to maintain a sort of memory. In traditional RNNs, eachcell’s output is directly connected to the immediate cell next to it. Therefore, as thesequences grow longer, the gradients of the earlier cells will start to vanish, and thenetwork will tend to remember only the most recent information from the sequence.To overcome this problem, Long short-term memory networks (LSTM) have beenintroduced [12], which are capable of learning long-term dependencies betweenitems of the sequence. LSTMs fall into two general categories of one-directional(UniLSTM) and bi-directional (BiLSTM). In UniLSTMs, the output of each cellof the model depends on the output of all the cells before it, whereas in BiLSTMs,each cell’s output will depend on both future and past cells [11]. Figure 1.13 showsa diagram of BiLSTM architecture. Precisely, a BiLSTM consists of two stackedLSTMs, one working on the original sequence and the other taking the reversedversion of it as input. For the final output of each cell, the output of the two stackedone-directional LSTM cells will be combined. Both UniLSTMs and BiLSTMs areused in this thesis for capturing the temporal dependency between the landmarks.1.1.8 Automatic Landmark Detection in EchocardiogramsAs discussed in Section 1.1.3, monitoring the changes of LVID throughout car-diac cycles and analysis of EF play a crucial role in cardiac functional assessments23and diagnosis of numerous cardiac diseases. Currently, the standard way to makethese assessments based on the PLAX view involves a sonographer going throughED and ES frames of each patient’s cine loops and labeling LVID on them man-ually. However, this procedure is time-consuming and expensive; thus, it imposesa heavy burden on the health-care system. Additionally, the noisy nature of ultra-sound images, along with the challenges in acquiring good quality images, holdsthe measurements subject to high amounts of inter-observer and intra-observervariability [31], which degrades the reliability of the metrics and the outcomesof examinations, and thus, clinical decisions required for patients. Needless to say,the challenges of the labeling procedure demand experienced specialists with highlevels of expertise and clinical background. This fact makes the assessments im-possible to be conducted by a typical, non-professional user. This limitation is atodds with the aim of the recent hand-held ultrasound devices and the trend to pub-licize their usage among regular people.These challenges have given rise to the idea of delegating this labeling task to anautomated method with high reliability.Prior WorksDeep learning has piqued the interest of many researchers as a method of automati-cally predicting clinical measurements. Having access to large training datasets hassignificantly improved the accuracy of predictions, especially for tasks such as seg-mentation and keypoint localization of anatomical structures. For pixel-level labelpredictions, recent methodologies often utilize fully convolutional neural networks(FCN) as their main component [1–3, 19, 20, 22, 27]. Many of the FCN-basedapproaches have channeled methodologies used in pose detection tasks [24, 32].In these methods, to localize the desired structures, heatmaps corresponding to theregion of interest are generated at some point in the network [23]. In [23], forthe task of multiple landmark detection, a CNN architecture is used to combinethe local appearance of a certain landmark with the spatial configuration of otherlandmarks. In [10], a U-net structure has been utilized for a similar problem to ourwork, where the objective is predicting some LV measurements (including LVID).Their proposed model predicts heatmaps corresponding to each of the landmarks24involved in the measurements. It is important to note that these methods are craftedfor datasets consisting of individual frames rather than temporal sequences. Timeplays a vital role in calculating the clinical measurements of interest at this work,such as EF; hence, relying purely on the methods mentioned above will not be suf-ficient for our problem of interest, nor for other temporally constrained or real-timeapplications.To tackle the limitations of the previously noted methods, Savioli et al. andDezaki et al. [7, 28] use spatio-temporal models when dealing with sequentialdata, and particularly, echo cine loops. In [29], a convolutional long short-termmemory (CLSTM) network is used to improve temporal consistency in conjunc-tion with a center of mass layer placed on top of an FCN architecture to regresskeypoints out of the heatmap predictions directly. Recurrent neural networks havealso been applied in many works of the cardiac segmentation domain, such as [8].Another study [16] first extracts multi-scale features using pyramid ConvBlocks,and these features are aggregated later via hierarchical ConvLSTMs. Some studiessuch as [5, 13, 34] feed motion information to their network by estimating the mo-tion vector between consecutive frames. In [25], for a similar weakly supervisedproblem, an optical flow branch is used to obtain motion estimation. This estima-tion then enforces spatio-temporal smoothness for a segmentation task with sparselabels in the temporal dimension. The downside to using optical flow estimation isthat it might generate drastic errors in consecutive frames that are largely variable.This problem becomes prominent in ultrasound images where the boundaries arefuzzy, and frames contain considerable amounts of noise and artifacts. Therefore,for such weakly supervised tasks where labels are distant in the temporal domain,the described approach is not optimal.While most of the previously analyzed methods consider temporal coherence,the constraints set to achieve it might not be enforced on the model in the desiredway [5, 8, 13, 16, 25, 28, 34]. Wei et al. [33] proposes an interesting method forconsistent segmentation of echocardiograms in the temporal dimension. In thisproblem, per echo cycle, the labels are only on the end-diastolic and end-systolicframes. The approach is composed of two co-learning strategies for segmentation25and tracking. The first estimates shape and motion fields in the appearance level,and the second imposes additional temporal consistency in the shape level for thepreviously predicted segmentation maps. While promising, this approach is notentirely applicable to our problem at hand since our labels are coordinates anddon’t contain the rich features embedded in segmentation labels.1.2 Thesis ObjectiveThis thesis aims to propose a model for automatic prediction and tracking of LVDlandmarks throughout PLAX videos of cardiac ultrasounds. This approach will notonly facilitate the clinical cardiac assessment of patients by alleviating the chal-lenges of the labor-intensive task of manual labeling but also will make it feasiblefor any user with any level of expertise to acquire the measurements. The goalis to design a deep neural network and train it on a sparse dataset consisting ofground truth landmark locations denoting LVD for ED and ES frames of a num-ber of PLAX videos. During inference, this network should be able to predict thelandmarks for any frame in any phase of a cardiac cycle. By having this tempo-rally coherent predictor, it would be possible to estimate EF via computing theminimum and maximum length of LVD through a cycle. Teichholz formula forone-dimensional EF approximation will be used for this estimation. Ideally, thenetwork should have a light-weight architecture to be capable of functioning onmobile, hand-held ultrasound devices.Implicitly, the smooth temporal predictions of LVD landmarks could also enablethe network to provide a rough estimate of the cardiac cycle phase. This is pos-sible by comparing each frame’s predicted LVD length with other frames of thesequence.1.2.1 ContributionsThe contributions of this work are listed in the following points:• A deep neural network framework has been proposed for estimating tempo-rally consistent LVD landmarks across PLAX cardiac videos;• An optimization strategy has been utilized for training the network effec-26tively with sparse labels in the temporal dimension;• Four variations of the model have been experimented with for choosing thebest model. The results suggest that the superior model consists of a seriesof convolutional layers for global information extraction, BiLSTM layers forlearning temporally coherent landmarks out of the global embeddings, and aconstraint on the length of LVD;• An ablation study was performed to ensure the model is benefiting from itssequential layers in a way that is desired and clinically impactful;• A post-processing pipeline has been implemented for comprehensive analy-sis of the performance of the model based on the common evaluation metricsof PLAX video;• The results of the model were compared to two of the most relevant andrecent studies in the literature.1.3 Thesis Outline• Chapter 1 provides a clinical background to the anatomy of the heart andintroduces the problem targeted in this thesis along with a brief overview ofthe exploited techniques. Related works in the literature have been analyzedin Section 1.1.8.• Chapter 2.2 delves into the details of the proposed method. It also aims tojustify the choice of the model’s building blocks through a series of experi-ments, and the final results are discussed and compared.• Chapter 3 compares the performance and structure of the model with two ofthe most relevant papers chosen as state of the art.• Chapter 4 summarizes the methods, findings, and contributions of this work.to to be followed in future for the interested readers• A set of ideas are also suggested in Chapter 4.2 for anyone interested inextending this work for future studies.27Chapter 2End-to-end Landmark Detectionin Weakly LabeledSpatio-temporal PLAXEchocardiograms Using a DeepNeural Networks2.1 IntroductionIn this thesis, the goal is to design an end-to-end network that takes advantage ofthe labeled ED or ES frames wherever they are available in the dataset to train anetwork capable of predicting smooth, continuous labels on each frame of a PLAXvideo. This test video will have an unknown duration. Ideally, the network shouldbe able to predict ED and ES frames robustly during inference. This would bepossible by measuring the distance between the upper and lower landmarks (LVD)on all frames and finding the frames on which LVD gets maximized or minimized,respectively. This chapter consists of 1) details of data preparation for this task,2) proposal of an end-to-end spatio-temporal model for landmark localization, 3)details of the loss function proposed for training the network, and 4) validation28results of the network.2.2 Materials and Methods2.2.1 DatasetThe dataset used for this project is provided by the Vancouver General Hospital,Vancouver, BC, Canada. Ethics approval was obtained, and data were anonymizedprior to use. This dataset consists of 1367 PLAX echo videos collected from Philipscart-based machines. The dataset was divided into training, test, and validation setsaccording to a ratio of 60: 20: 20, respectively, each set sampled randomly.For preparing the videos, first, the center of the video was cropped out using alinear transform to omit any other material, including text, marks, or EEG signalsurrounding the video, which could bias the network wrongly without containinguseful information. All videos were then resized to a fixed size, which has beenset to 256x256 for this project. This number was chosen empirically, as it yieldedacceptable performance for the network while occupying a reasonable amount ofmemory on GPU. This fact made all experiments possible to be trained on the ma-chines provided in our lab with their specific hardware capacities in a reasonableamount of time. However, without any doubt, choosing a higher resolution willlead to more accurate results in a setting where there are no limits on the GPUmemory and the training time.The original labels were provided for two keyframes, namely ED and ES, each la-bel consisting of coordinates of the two landmarks. As these labels were reportedwith respect to the original video’s coordinate system, the same mask used forcropping the original data was applied to its labels to transform it to the main win-dow’s coordinate system. Then, the new labels in the 256 size image were achievedby scaling then with a factor of 256original size .As the duration of videos were different, a window length of 200 was chosen tostore each sample, and videos having more frames were excluded from the trainset. This number was picked based on visualizing the histogram of the temporallength of videos and choosing a cut off threshold that resulted in minimal loss ofsamples in the dataset.29The images were converted to grayscale, and the intensities were normalized dur-ing training.2.2.2 SamplingAs mentioned in Section 2.2.1, the training dataset is first prepared in the formof 820 videos of 200 frames. However, only a few frames in each sample havelabels. We can only train the detector on the labeled frames, while the other framescan be used to teach the network temporal consistency. Obviously, due to thememory and training limitations, we cannot feed all 200 frames at once to anynetwork. So a window having a sequence of frames was cropped out of each videoto form one data point. Here, the size of this window is another hyperparameterthat should be chosen heuristically. Although larger windows will improve thenetwork’s temporal performance, they will sacrifice training time and increase thenumber of parameters. By trying out different numbers, a size of 30 was found tobe an ideal candidate, leading the network to a good local optimum while providingfeasible experiments.One crucial step in sampling was to choose a good set of 30 frames out of the wholevideo. Ideally, we wanted to select a chunk that contained at least one labeled framein it, but we did not like this frame to appear at constant positions throughout thesamples. For instance, we did not want the window to always start from an EDframe. This act would have biased the network towards seeing special cases duringtraining, leading to poor generalization during test time, where the starting framecould be at any phase of a cardiac cycle. Therefore, the first labeled frame wasidentified in each video. Next, the window’s starting frame was chosen as a frameoccurring slightly before the first labeled frame. Specifically, a random numberbetween zero and seven is sampled from a uniform distribution every time. Thisnumber represents the number of frames we want to include in the window beforethe first labeled frame. The number seven was another hyperparameter chosenheuristically; we did not want to diverge too much from the label, but it was alsodesired to insert some level of randomness in the samples. In cases where the firstlabeled frame had occurred before the 7th frame of the cine loop, the starting framewas picked as a random frame between the first frame of the cine loop and the first30frame having a label.After choosing the starting frame from a video, a single training sample was formedby stacking it with the next 29 consecutive frames. In case more labels were presentin the same video that had not been included in the video’s 30-sized window sampleformed before, the sampling was repeated again by identifying the next availablelabeled frame, and shifting was performed similarly according to the same method.2.3 Network ArchitectureThis method’s main architecture consists of a series of time distributed convolutionlayers, followed by a sequence model (LSTM). The convolutional layers learn toextract global, meaningful representations of the images so that the network under-stands what the clinical meanings of the landmarks are. As the definitions of thelandmarks are consistent throughout the whole video, or in other words, since thelandmarks occur on specific shapes forming the upper and lower walls of LV, thesame weights were used for each frame in the sequence. However, the temporalconsistency of these representations is ensured by means of the LSTM model. Thismodel will help the unlabeled frames of each video to learn meaningful represen-tations based on the few labeled frames in the sequence so that the middle framesbe bound to follow a smooth trajectory in the cycle from the maximum diameter inED to the minimum diameter in ES.Here, we found out that employing a 2-layer stacking of LSTMs outperforms theone layer version, as it adds more parameters to the network and prevents it fromunderfitting. The first LSTM takes a sequence of 64-dimensional latent represen-tations as input, which is extracted by the previous CNN layers, and outputs asequence of embeddings in a smaller space (32-dimensional). Next, the secondLSTM takes these 32-dimensional embedding vectors as input and outputs a 4-dimensional vector of coordinates for each frame. These four entries include thepredicted x and y coordinates for the upper and lower landmarks. An overview ofthis model can be seen in Figure 2.1.In another variation of this model, we substituted one-directional LSTMs withBidirectional LSTMs to make the predictions conditioned not only on the tem-poral constraints of the previous frames, but also the future frames (Figure 2.2).31Figure 2.1: The base architecture for the proposed method. This model con-sists of a series of convolutional layers for extracting global, sharedspatial embeddings, followed by a 2-layer LSTM layer for predictinga smooth set of coordinates out of the spatial embeddings across thetemporal dimension.Additionally, this modification will prevent the adverse effect of getting better pre-dictions for each window’s final frames than the initial frames of the same window,which is a downside of one-directional LSTMs. The whole pipeline is trained end-to-end.In another variation of the base architecture, we added a zero-parameter arithmeticlayer that takes the 4-dimensional predicted coordinates as input and computes theEuclidean distance between the upper and lower landmarks. This prediction rep-resents LV diameter, which is a crucial factor in predicting accurate EF, linked tomany critical pathological conclusions. Therefore, by training this diameter basedon the four predicted coordinates, and with ground truths equal to the Euclideandistance between the top and bottom landmarks in the labeled frames, we implic-itly teach the network to predict the coordinates under the constraint of produc-ing meaningful diameters, changing smoothly across the frames of the cine loop,which should regularize the EF prediction eventually (Fig. 2.3). This modificationcould be applied to both the one-directional and bidirectional versions of the basearchitecture.32Figure 2.2: Substituting one-directional LSTMs with Bidirectional LSTMsin the proposed architecture of Figure 2.1 for achieving more precisetemporal consistency by taking both future and past frames into account.Figure 2.3: Adding a fifth dimension to the outputs of the network for eachframe. This prediction is a zero-parameter layer that computes the L2norm between the predicted coordinates corresponding to the upper andlower landmarks in an attempt to regularize LVD and EF implicitly.332.4 LossThe loss function used for training the networks is chosen as smooth L1 loss. As thelabels are sparse, for each sample, a corresponding one-hot vector having the samelength as the data point in temporal dimension (which was set to 30 as discussedbefore) is formed during the data preparation phase. This vector will have ones onindices representing the annotated frames and zeros for other frames. Therefore,for each training batch, we will have a corresponding batch of one-hot vectors.This one-hot batch is multiplied to the loss function during each training step sothat the gradients would not backpropagate on unlabeled frames.2.5 ExperimentsThe aforementioned networks were first trained on the train set and validated onthe validation set. The best model was chosen based on monitoring the train andvalidation losses. Specifically, it is intended to look for a model in which the train-ing error is low enough (to prevent under-fitting), while the validation loss alsohas its lowest value and has not yet started to increase (which prevents overfitting).Figures 2.4 and 2.5 show the trend of train and validation loss across all train-ing epochs. Figure 2.5 demonstrates the zoomed-in version of the figures in 2.4through the latter sets of epochs for a better view since the variations in the lossfunction become more subtle in this section. Therefore, according to these figures,models corresponding to epoch 250 were chosen as the best models suited for test-ing for all the variations: BiLSTM model without LV diameter constraint, BiLSTMwith the LV diameter constraint, UniLSTM without LV diameter constraint, andUniLSTM with LV diameter constraint. After fixing the hyperparameters in theprevious experiments, training and validation sets were combined, and all modelswere retrained on the new dataset with the same settings and stopped on the chosenepochs. Next, the test set was used to report the test accuracy for the final modeland extract statistics and qualitative results to evaluate the performance of all mod-els.All networks were trained on batches of size 5, using Adam optimizer, weight de-cay of 0.00001, and initial learning rate of 0.0001, which was divided by ten onepochs 50 and 120, during training for a total of 400 epochs.34(a) Network with UniLSTM and with-out LVD constraint.(b) Network with BiLSTM and withoutLVD constraint.(c) Network with UniLSTM and LVDconstraint.(d) Network with BiLSTM and LVDconstraint.Figure 2.4: The train loss (red) and validation loss (blue) plotted for all fourvariations of the models in Section 2.3.35(a) Network with UniLSTM and with-out LVD constraint.(b) Network with BiLSTM and withoutLVD constraint.(c) Network with UniLSTM and LVDconstraint.(d) Network with BiLSTM and LVDconstraint.Figure 2.5: Similar to Figure 2.4, the train loss (red) and validation loss (blue)are plotted for all four variations of the models in Section 2.3 during thelatter epochs for a clear view.362.5.1 ResultsThe following results are all derived from the test set to evaluate and compare theproposed method’s performance. Table 2.1 shows the mean and variance of EFand FS for all models. According to this table, the network with BiLSTM andLVD constraint leads to considerably better results in terms of EF and FS. Thiswas expected as the accuracy of both EF and FS metrics relies on the precision ofthe LVD estimate, and the LVD constraint seems to be successful in regularizingthe prediction of this property. Also, as both EF and FS should be computed onED and ES, and during test time there are no phase labels across the cine loop,the network has to take the minimum and maximum of the predicted LVD overa window of frames containing at least one cycle, to assign them to ES and EDphases, respectively. Therefore, it is crucial for the network to predict smooth,cyclic LVD quantities that are temporally consistent. This will ensure that the min-imums and maximums only occur due to the heart’s movement rather than noise.Bearing this fact in mind, the choice of BiLSTM justifies the improvements as itapplies temporal consistency in two directions, and therefore, provides smootherLVD measurements across time.In order to see how statistically significant the differences between the errors inTable 2.1 are, a pair-wise t-test is conducted on the models (Figure 2.6). The re-sults show that in total, the ranges of p-values for EF and FS are relatively large,which denotes that the difference in the distribution of errors for these two quan-tities fails to be statistically significant. Regardless, for both of these measures,the model with BiLSTM and LVD constraint makes the most significant improve-ment comparing to the network with UniLSTM without LVD constraint and thenetwork with BiLSTM and without LVD constraint. This fact shows that addingthe LVD constraint has been most helpful for the model with BiLSTM layers. Theresults also show no statistically significant difference among the models withoutLVD constraint; the same also holds to some extent for both models having LVDconstraint.37Table 2.1: EF and FS Error analysis for the proposed modelsModel Name Mean EFErrorMean FSErrorNetwork with UniLSTM and without LVDconstraint10.67% 7.75%Network with BiLSTM and without LVDconstraint10.6% 7.77%Network with UniLSTM and LVD constraint 10.34% 7.52%Network with BiLSTM and LVD constraint 10.08% 7.21%(a) P-value of pairwise t-test between modelsfor EF error(b) P-value of pairwise t-rest between mod-els for FS errorFigure 2.6: Statistical significance of the EF and FS errors in Table 2.1 isevaluated using p-value for paired t-test. Brighter colors correspond tolower p-values and therefore, higher statistical significance.38For measuring how accurate each of the landmarks’ locations is, the 2D dis-tance between the ground truth coordinates and predicted coordinates is computedfor frames with ground truth labels. In order to calculate this distance, the predictedlocations have to be rescaled so that they match the size of the original image be-fore preparation. This procedure should be done because the network has learnedto see samples of a fixed size (256×256). As predicted numbers represent the co-ordinates in pixels, the ground truth locations should also be passed as pixels to thisequation. Table 2.2 shows the mean of the distance error for both landmarks acrossall models. According to this table, the network with BiLSTM and LVD constrainbeats all other models in terms of location accuracy. However, among the modelswithout the LVD constraint, the model with UniLSTM does a better job in predict-ing per-point locations. One explanation could be that in BiLSTM, there is moreemphasis put on the smoothening of the changes of LVD across time, resulting inweaker per-point location estimation. However, adding the LVD constraint seemsto have fixed this issue while taking advantage of the strong regularization powerof Bidirectional LSTMs.Again, for evaluating how much statistical significance in coordinate prediction er-ror (in pixels) different versions of the models offer, p-values of pair-wise t-testshave been computed in Figure 2.7. These results show that the modification in themodels leads to more significant improvements in predicting the lower landmarkcompared to the upper landmark. As Table 2.2 also shows larger errors for thelower landmark, it can be concluded that the prediction of the lower landmark isharder for the model, and most of the significant improvements in the results stemfrom enhancement in the prediction of this landmark. Furthermore, for both of thelandmarks, the most significant improvement occurs for the model with BiLSTMafter adding the LVD constraint compared to the same model without having thisconstraint included, which brings more evidence to the effective role of the LVDconstraint in the presence of BiLSTM layers. Also, it can be observed that in theabsence of LVD constraint, the BiLSTM layers lead to significant improvementin the lower landmark comparing to the UniLSTM layers. There seems to be nosignificance in the difference of the landmark error distributions among the modelstrained under LVD constraint.39Table 2.2: Mean distance error for each landmark’s predicted coordinates inpixelsModel Name UpperLandmarkLowerLandmarkNetwork with UniLSTM and without LVDconstraint7.34 9.51Network with BiLSTM and without LVDconstraint7.51 9.86Network with UniLSTM and LVD constraint 7.4 9.42Network with BiLSTM and LVD constraint 7.26 9.33(a) P-values for the errors in prediction of theupper landmark’s coordinates(b) P-values for the errors in prediction of thelower landmark’s coordinatesFigure 2.7: Statistical significance of the errors in the predicted coordinates(in pixels) for each landmark in Table 2.2 is evaluated using p-value ofpaired t-test. Brighter colors correspond to lower p-values and therefore,higher statistical significance.40As reporting the error in pixels is not particularly meaningful in clinical appli-cations, it is more comprehensible to report the distance error in centimeters so thatwe could have a sense of how the methods actually scale with respect to the heart’sdimensions. In order to obtain this error, each of the sample’s errors was scaledbased on the original ultrasound depth information found in the sample’s DICOMheader, and the total error was averaged out from the test set. The resulting num-bers are presented in Table 2.3.The p-value results of paired t-test for these errors are represented in Figure 2.8.The key findings of this table comply with Figure 2.7, which are explained in Sec-tion 2.5.1.Table 2.3: Mean distance error for each landmark’s predicted coordinates incentimeters.Model Name UpperLandmarkLowerLandmarkNetwork with UniLSTM and without LVDconstraint0.697 0.889Network with BiLSTM and without LVDconstraint0.706 0.923Network with UniLSTM and LVD constraint 0.697 0.887Network with BiLSTM and LVD constraint 0.687 0.874B) Distribution AnalysisTo analyze how successful the proposed models are in terms of minimizing thenumber of outliers, a number of distribution analysis experiments are conductedand presented in this section. Figures 2.9 and 2.10 display the histogram andthe box plot of EF error distribution for all the models, respectively. According41(a) P-values for the errors in prediction of theupper landmark’s coordinates(b) P-values for the errors in prediction of thelower landmark’s coordinatesFigure 2.8: Statistical significance of the errors in the predicted coordinates(in centimeters) for each landmark in Table 2.2 is evaluated using p-value of paired t-test. Brighter colors correspond to lower p-values andtherefore, higher statistical significance.to these figures, adding the LVD constraint has tightened the distribution of theircounterparts without the LVD constraints. These figures also show that the mod-els with Bidirectional LSTMs have slightly wider distributions than models withUnidirectional LSTMs. However, the difference between the two distributions hasdecreased with the addition of the LVD constraint. This can be explained againwith the same justification presented in Section 2.5.1, suggesting that the Bidirec-tional LSTMs enforce the model to focus more on smoothening the LVD curveacross time; this might result in making the LVD predictions in ED and ES becomecloser to each other, leading to higher errors in EF and FS. However, the distribu-tions also show that this issue is clearly resolved by the addition of LVD constraint,showing a drop in both the mean and variance of EF errors.For having a better sense of the aforementioned distributions, the scatter plots ofpredicted EF and FS versus their corresponding ground truth values can be ob-served in Figure 2.11.The distribution of errors in predicting the locations of each of the two landmarkscan be observed in Table 2.12 and Table 2.13 (in pixels), and Table 2.14 and Ta-ble 2.15 (in centimeters).42(a) Network with UniLSTM andwithout LVD constraint(b) Network with BiLSTM andwithout LVD constraint(c) Network with UniLSTM andLVD constraint(d) Network with BiLSTM andLVD constraint(e) Network with UniLSTM andwithout LVD constraint(f) Network with BiLSTM andwithout LVD constraint(g) Network with UniLSTM andLVD constraint(h) Network with BiLSTM andLVD constraintFigure 2.9: Histograms of EF and FS errors for all models.43(a) Network with UniLSTM andwithout LVD constraint(b) Network with BiLSTM andwithout LVD constraint(c) Network with UniLSTM andLVD constraint(d) Network with BiLSTM andLVD constraint(e) Network with UniLSTM andwithout LVD constraint(f) Network with BiLSTM andwithout LVD constraint(g) Network with UniLSTM andLVD constraint(h) Network with BiLSTM andLVD constraintFigure 2.10: Boxplots of EF and FS errors for all models.44(a) Network with UniLSTM andwithout LVD constraint(b) Network with BiLSTM andwithout LVD constraint(c) Network with UniLSTM andLVD constraint(d) Network with BiLSTM andLVD constraint(e) Network with UniLSTM andwithout LVD constraint(f) Network with BiLSTM andwithout LVD constraint(g) Network with UniLSTM andLVD constraint(h) Network with BiLSTM andLVD constraintFigure 2.11: Scatter plots of predicted EF and FS vs. ground truth for allmodels.45(a) Network with UniLSTM andwithout LVD constraint(b) Network with BiLSTM andwithout LVD constraint(c) Network with UniLSTM andLVD constraint(d) Network with BiLSTM andLVD constraint(e) Network with UniLSTM andwithout LVD constraint(f) Network with BiLSTM andwithout LVD constraint(g) Network with UniLSTM andLVD constraint(h) Network with BiLSTM andLVD constraintFigure 2.12: Histogram of distance errors of the landmarks’ predicted coor-dinates in pixels for all models.46(a) Network with UniLSTM andwithout LVD constraint(b) Network with BiLSTM andwithout LVD constraint(c) Network with UniLSTM andLVD constraint(d) Network with BiLSTM andLVD constraint(e) Network with UniLSTM andwithout LVD constraint(f) Network with BiLSTM andwithout LVD constraint(g) Network with UniLSTM andLVD constraint(h) Network with BiLSTM andLVD constraintFigure 2.13: Boxplot of distance errors of the landmarks’ predicted coordi-nates in pixels for all models.47(a) Network with UniLSTM andwithout LVD constraint(b) Network with BiLSTM andwithout LVD constraint(c) Network with UniLSTM andLVD constraint(d) Network with BiLSTM andLVD constraint(e) Network with UniLSTM andwithout LVD constraint(f) Network with BiLSTM andwithout LVD constraint(g) Network with UniLSTM andLVD constraint(h) Network with BiLSTM andLVD constraintFigure 2.14: Histogram of distance errors of the landmarks’ predicted coor-dinates in centimeters for all models.48(a) Network with UniLSTM andwithout LVD constraint(b) Network with BiLSTM andwithout LVD constraint(c) Network with UniLSTM andLVD constraint(d) Network with BiLSTM andLVD constraint(e) Network with UniLSTM andwithout LVD constraint(f) Network with BiLSTM andwithout LVD constraint(g) Network with UniLSTM andLVD constraint(h) Network with BiLSTM andLVD constraintFigure 2.15: Boxplot of distance errors of the landmarks’ predicted coordi-nates in centimeters for all models.49Taking all of the above-mentioned results into account, the network with BiL-STM and LVD constraint seems to beat the other three based on the metrics ofinterest. In Figures 2.16 to 2.18, landmark predictions of ED and ES for somerandom test samples have been plotted.50(a) Sample predictions for ED frame (b) Sample predictions for ES frame(c) Sample predictions for ED frame (d) Sample predictions for ES frame(e) Sample predictions for ED frame (f) Sample predictions for ES frameFigure 2.16: Predicted landmarks for ED and ES frames of some random testsamples (red: ground truth, green: predicted).51(a) Sample predictions for ED frame (b) Sample predictions for ES frame(c) Sample predictions for ED frame (d) Sample predictions for ES frame(e) Sample predictions for ED frame (f) Sample predictions for ES frameFigure 2.17: Predicted landmarks for some ED and ES frames of random testsamples (red: ground truth, green: predicted).52(a) Sample predictions for ED frame (b) Sample predictions for ES frameFigure 2.18: Predicted landmarks for some ED and ES frames of random testsamples (red: ground truth, green: predicted).2.5.2 Ablation StudyIn an attempt to test how effective the sequence models are in enforcing temporalconsistency to predictions, in a paradigm similar to ablation study, the sequencelayers of the original models were removed and replaced with a simple time-distributed two-layer dense architecture. This network was also trained on the trainset, and the best epoch along with the hyperparameters was chosen based on theperformance on the validation set. Next, this chosen model was retrained on thecombination of the train and validation sets with the chosen hyperparameters andlater tested on the test set. Figure 2.19 shows an overview of the structure of the re-sulting models, both with and without adding the LVD constraint. The model withthe BiLSTM layers and LVD constraint has been picked as the best model fromthe previous sections, and the results of this ablation study will only be comparedagainst this model. The reason for this choice is that this model has led to betterresults in general. Although the results of both models with LVD constraint did notshow drastic statistical significance, it is crucial to bear in mind that clinical signif-icance is different from statistical significance. While there was no means at handfor measuring the clinical significance, it still seems the most reliable decision topick the model with the best mean errors.The results of the modified models after conducting the experiments on the test set53(a) without the LVD diameter branch added(b) with the LV diameter branch addedFigure 2.19: Substitution of LSTMs with dense layers in the original modelfor an ablation study.are discussed in the following sections.A) Statistical AnalysisTable 2.4 shows the mean errors according to the predefined metrics, and Fig-ure 2.20 shows the p-values of paired t-test for the modified models and the modelwith BiLSTM and LVD constraint, which has been chosen as the best model.54Table 2.4: Evaluation metrics on the modified networkMetrics Without LVDConstraintWith LVDConstraintMean EF error 12.18% 12.16%Mean FS error 9.23% 9.27Mean distance error for theupper landmark in pixels7.74 8.55Mean distance error for thelower landmark in pixels10.25 10.89Mean distance error for theupper landmark in cm0.730 0.803Mean distance error for thelower landmark in cm0.963 1.01655(a) P-values for the EF error (b) P-values for the FS error(c) P-values for the error in the upper land-mark’s coordinate prediction (pixels)(d) P-values for the error in the lower land-mark’s coordinate prediction (pixels)(e) P-values for the error in the upper land-mark’s coordinate prediction (cm)(f) P-values for the error in the lower land-mark’s coordinate prediction (cm)Figure 2.20: Statistical significance of the errors in Table 2.4 for the modifiedmodels are computed against the best model using p-value of paired t-test. Brighter colors correspond to lower p-values and therefore, higherstatistical significance.According to Table 2.4, adding the LVD constraint has made no significant56change in EF and FS errors for the modified models. However, adding this con-straint has resulted in a significant drop in the accuracy of pixel coordinate pre-diction. This is due to the fact that without the sequence layers, the model has nosense of regularizing the predictions from the temporal point of view. Hence, thenetwork might have learned something similar to predicting the average of LVDon ED and ES frames for both landmarks; in this case, the use of LVD constraintseems to have misled the network.On the other hand, comparing the errors with the network with BiLSTM 2.1, andconsidering the p-values in 2.20, we can observe that all errors are significantlylarger for the new network having dense layers instead of sequence layers. There-fore, we can safely conclude that the addition of the sequence layers has improvedthe temporal consistency in the desired way, meaning it has lead to more accu-rate predictions, both in terms of per-point location and cyclic functional metrics,which are crucial for clinical assessments.B) Distribution AnalysisThe distribution analysis of the modified models is depicted in the following fig-ures. Figure 2.21 demonstrates the scatter plots of EF and FS errors, and Fig-ure 2.22 shows the histograms and boxplots of these metrics. These figures alsoalign with the conclusion that adding the LVD constraint is not helpful even interms of reducing the number of outliers for the model without sequential layers.Figures 2.23 and 2.24 show the distribution of the distance errors for predictingeach of the landmarks’ coordinates in pixels and centimeters. Comparing theseplots with the best model of the previous section, which is the network with BiL-STM and LVD constraint, we can clearly notice that the use of BiLSTM also tight-ens the distribution of errors and reduces the number of outliers.57(a) Predicted vs. ground truthEF for the modified model withoutLVD constraint.(b) Predicted vs. ground truth EFfor the modified model with LVDconstraint.(c) Predicted vs. ground truthFS for the modified model withoutLVD constraint.(d) Predicted vs. ground truth FSfor the modified model with LVDconstraint.Figure 2.21: Scatter plots of EF errors for the modified models having denselayers instead of sequence layers.58(a) Histogram of EF error for thenew model without LVD constraint(b) Histogram of EF error for thenew model with LVD constraint.(c) Histogram of FS error forthe new model without LVD con-straint.(d) Histogram of FS error for thenew model with LVD constraint.(e) Boxplot of EF error for the newmodel without LVD constraint.(f) Boxplot of EF error for the newmodel with LVD constraint.(g) Boxplot of FS error for the newmodel without LVD constraint.(h) Boxplot of FS error for the newmodel with LVD constraint.Figure 2.22: Histogram and boxplots of EF and FS errors for the modifiedmodels having dense layers instead of sequence layers.59(a) Histogram of distance error(pixels) for the upper landmark us-ing the new model without LVDconstraint(b) Histogram of distance error(pixels) for the upper landmark us-ing the new model with LVD con-straint(c) Boxplot of distance error (pix-els) for the upper landmark usingthe new model without LVD con-straint(d) Boxplot of distance error (pix-els) for the upper landmark usingthe new model with LVD constraint(e) Histogram of distance error(pixels) for the lower landmark us-ing the new model without LVDconstraint(f) Histogram of distance error(pixels) for the lower landmark us-ing the new model with LVD con-straint(g) Boxplot of distance error (pix-els) for the lower landmark usingthe new model without LVD con-straint(h) Boxplot of distance error (pix-els) for the lower landmark usingthe new model with LVD constraintFigure 2.23: Distribution of distance errors of the landmarks’ predicted coor-dinates in pixels for the modified models having dense layers insteadof sequence layers.(a) Histogram of distance error(cm) for the upper landmark usingthe new model without LVD con-straint(b) Histogram of distance error(cm) for the upper landmark usingthe new model with LVD constraint(c) Boxplot of distance error (cm)for the upper landmark using thenew model without LVD constraint(d) Boxplot of distance error (cm)for the upper landmark using thenew model with LVD constraint(e) Histogram of distance error(cm) for the lower landmark usingthe new model without LVD con-straint(f) Histogram of distance error(cm) for the lower landmark usingthe new model with LVD constraint(g) Boxplot of distance error (cm)for the lower landmark using thenew model without LVD constraint(h) Boxplot of distance error (cm)for the lower landmark using thenew model with LVD constraintFigure 2.24: Distribution of distance errors of the landmarks’ predicted co-ordinates in centimeters for the modified models having dense layersinstead of sequence layers.2.6 Discussion and ConclusionIn this chapter, a series of architectures were proposed and tested for solving thelandmark detection problem in a continuous cine loop with sparse labels. Accord-ing to the experiments, the model with BiLSTM layers and LVD constraint servesbest for our intention, as it provides smooth tracking of predictions across time,with the lowest error in EF, FS, and per-landmark location accuracy. The pairedt-test results show statistically significant improvement due to the addition of LVDconstraint for the models with BiLSTM. Although the difference between the mod-els with LSTM and BiLSTM layers, both in the presence of LVD constraint, is notmuch statistically meaningful, the improvements in the statistics might still be re-markable from a clinical point of view. Therefore, the most reliable decision wasto pick this model as the final proposed method. The ablation study proves that thechoice of sequence layers also leads to significant improvement in errors, whichrejects the null hypothesis of getting randomly good results with high confidence.As shown in the experiments, these predictions are robust enough to estimate thephase of the frames at test time only by assessing the predicted LVD and com-paring it with the prediction on other frames of the cycle. Therefore, the proposedmethod could also serve as a phase detector on PLAX views in the absence of otherdesignated tools.62Chapter 3Discussion and Comparison withReference Methods3.1 IntroductionIn this chapter, the proposed method in Chapter 2.2 will be compared with two ofthe most major and recent works in the literature tackling the same problem as thisthesis, and the advantages and disadvantages of each method will be discussed.The first paper to be addressed is [29], where an FCN network is utilized to regresskeypoints corresponding to the same temporal LVD landmarks targeted at this the-sis. The outputs of the FCN network then follow through a series of ConvLSTMlayers and pass a center of mass layer at last, which assigns the final coordinates tothe output heatmaps of ConvLstms.The second paper studied is [10], where a modified U-net CNN approach is adoptedto predict three LV measurements from PLAX images, one of which being LVDlandmarks on single frames instead of the whole temporal sequence.These networks will be presented in detail through the following sections, and theirperformance will be weighed against our method.63Figure 3.1: (a) CNN architecture for regressing landmarks; (b) ConvLSTMlayers following the convolutional layers for enforcing temporal consis-tency. (Figure from [29])3.2 Methods3.2.1 Model 1: Sofka et al. [29]In this paper, two architectures are discussed, one of which is presented as theirproposed architecture. In the first architecture, a series of convolutional and pool-ing layers are followed by a ConvLSTM model (Figure 3.1). The second and theirmain proposed architecture includes a series of transposed convolutional layersproceeding a path of FCN and ConvLSTM layers similar to 3.1. This architec-ture resembles a U-net [27], which is well-suited for segmentation tasks, exceptthat the bottleneck architecture is able to learn temporal coherency when usingConvLSTMs. Lastly, an arithmetical center of mass layer is designed to turn theheatmap results of the segmentation into the output coordinates predicting each ofthe two landmarks’ coordinates. The architecture is depicted in Figure 3.2. Thenetworks are trained with Euclidean Loss with Adam optimizer. Their dataset con-sists of 4981 video frames for training, 628 for validation, and 90501 for testing.As quoted in the paper, two professional sonographers have labeled the framesmanually, and the average location of the annotations across the two annotatorswas considered as the ground truth labels.64Figure 3.2: (a) ConvLSTM layers take output of the FCN network (b) FCNdirectly regressing keypoints without the use of ConvLSTMs. In bothcases, a center of mass layer is designed as the final layer to computecoordinates out of the predicted heatmaps. (Figure from [29])3.2.2 Model 2: Gilbert et al. [10]This method uses a modified U-net structure to predict heatmaps corresponding tothe landmarks by employing a segmentation-based approach. As their work pre-dicts three measurements, each containing two landmarks, the model’s final outputhas six channels. They also adopt a spatial-numerical transform block [21] on topof the predicted probability maps to estimate the center of mass of these heatmapsas the ultimate coordinate predictions. The architecture can be seen in Figure 3.3.This model is trained using a combination of three losses; one loss is defined as theroot mean squared error (RMSE) between the predicted heatmaps and ground truthheatmap labels, which are constructed by centering a 2D Gaussian at the groundtruth locations. These Gaussian labels are rotated such that they are elongated in anaxis orthogonal to the line of measurements. The second loss function is defined as65Figure 3.3: The modified U-net architecture proposed in [10] to predictheatmaps for each of the landmarks, and then the center of mass is cal-culated to obtain the final predicted locations. (Figure from [10])L2 loss between the ground truth and predicted coordinates and the length of eachof the measurements (including LVD). The third loss introduced is the angle lossbetween the ground truth measurement lines and the predicted measurement lines,which is calculated using the cosine similarity between the two sets of vectors.The dataset used for this method consists of 585 images of ED and ES PLAXviews. These images have been collected from 309 unique patients. They have had32 of their recordings labeled multiple times by the same annotator to calculateintra-observer variability. These recordings were used as the test dataset for report-ing the metrics in Table 3.1. Apart from these 64 images, 521 images remainedand were used for training and validation.3.2.3 ResultsBoth of the methods mentioned have chosen mean percent errors of the mea-surement lengths as their evaluation metric. Therefore, for the sake of compari-son, the same metric was also computed for our proposed method in Chapter 2.2.Method 3.2.1 has computed this metric for different percentiles. Hence, the samehas also been applied to the proposed method of this thesis. The approximate num-ber of parameters for the models is shown in Table 3.2. However, it should be66Table 3.1: Mean Percent error for the three architectures (two reference meth-ods proposed in [10, 29] plus our approach)Model Name Mean Percent Error of LVD (%)50th 75th 95th 100thSofka et al. [29]: CNN+ConvLSTM 4.89 8.68 17.51 -Sofka et al. [29]: FCN+ConvLSTM 4.87 8.68 18.27 -Gilbert et al. [10] - - - 6Proposed Model in Chapter 2.2 4.61 7.2 12.18 12.18Table 3.2: Number of parameters for the three architectures (two referencemethods proposed in [10, 29] plus our approach)Model Name Number of Parameters (in Millions)Sofka et al. [29]: CNN+ConvLSTM 10.3 MSofka et al. [29]: FCN+ConvLSTM 2.7 MGilbert et al. [10] 7.7 MProposed Model in Chapter 2.2 4.5 Mnoted that the exact details of the architecture and sizes of layers have not beenspecified in the proposed methods in Sofka et al. [29]. Therefore, the models wereimplemented based on some assumptions made based on their models’ schematicdiagrams provided in the paper.673.3 Discussion and ConclusionIn this section, the details of the three models will be discussed and compared.First, it should be noted that our goal was to train a model. The GPU resourcesavailable to this project were constraint to 16 GBs; therefore, running a model witha high number of parameters was both not feasible and intended for this project.The model from Sofka et al. [29] reports numbers that are already worse on theirdataset compared to our proposed method. However, we were not able to trainthis method on our dataset, as ConvLSTMs are very hard to train and require lotsof data to optimize. As mentioned in Section 3.2.1, this work has had access toa substantially larger dataset, which was not available to us. Still, we found thatsegmentation based networks such as U-Net did not achieve precisions as high asthe current method, where the input of the LSTMs are embeddings, and the outputsare coordinates. Another downside of this model is that it has no mention of havinga sparse dataset, meaning their labels are most likely available for all the frames ofthe sequence. On the contrary, we only have annotations for ED and ES frames inour dataset, and our model has introduced a technique in the loss function to dealwith these sparse labels. Moreover, according to Table ??, the first architecturepresented in this paper has more parameters compared to our model, and althoughtheir second proposed architecture seems smaller, the use of ConvLSTMs leads tooccupying rather large memories. Therefore, Our proposed method appears to bemore powerful in the sense that it achieves better results with fewer resources, it ismuch easier to train, converges very fast, and it requires fewer labels per cycle topredict LVD at any phase of the cine loop during inference time.Another reason to emphasize why our method is working sufficiently well is that itsmean percent error reported in Table 3.1 falls within the inter-observer variabilityrange reported in [29]. Although this range is still bound to their own dataset,and we did not have any means of measuring inter-observer variability within ourdataset, still, this reported range should be a good candidate at hand for the globalrange of this quantity due to their large sample size.The model from [10] has reported better results for LVD (Table 3.1). However, itis only used for single-frame images. Their proposed network is still a modified U-Net [27] with a segmentation-based approach, which requires a rather large number68of parameters to be fitted in our pipeline alongside the LSTMs. Consequently,training this network will most probably be much more unstable similar to themodel in Sofka et al. [29]. Additionally, this model also has larger number ofparameters compared to our model based on Table 3.2. To our assumption, theimprovement of this model arises from two sources:1. The combination of angle loss, heatmap loss, and coordinate loss2. The training and testing were merely limited to single-frame images. As a re-sult, there has been no need to account for a temporal smoothening network,which might have impaired the accuracy of per-point location estimation.As mentioned earlier, our current method achieves satisfactory results for smoothlandmark detection by a deep neural network. However, in the future, it would beinteresting to investigate the effectiveness of the three losses introduced in [10] andfind a way to incorporate their method with a fewer number of parameters in ourcurrent video-based pipeline.It is also worth mentioning that our dataset was highly diversified in terms of qual-ity. The use of a large number of low-quality data could be vastly detrimental to thetraining of the model. For a more detailed understanding of the dataset, the qualitypercentage for each image has been predicted using [17], such that 0% correspondsto the worst quality and 100% denotes the best possible quality. Next, a distributionanalysis was performed on these values. The results of the analysis demonstrate amean equal to 52.6% and a standard deviation of 18.1%. As shown in Figure 3.4,the distribution is unbalanced, biased towards the low-quality videos.By looking at some random samples of the dataset, it was also observed that thedistribution of the dataset is very wide as well; some of the images have been cap-tured from odd angles, which can mislead the network easily. It was attemptedto account for some of these variabilities through augmentation. However, manyof these deformations have resulted from a shift in the source probe. Therefore, inthese cases, the whole image might have been captured from unusual angles, whichmight not be fixed by means of standard augmentations.69(a) Histogram of quality of the images in the dataset(b) Boxplot of quality of the images in the datasetFigure 3.4: The quality distribution of the dataset used for this thesis. Thesefigures show a large variability in the qualities of the samples. Thehistogram in Figure (a) shows that low-quality images constitute themajority.70Chapter 4ConclusionIn this thesis, a deep neural network has been proposed for automatic detection oftwo landmarks in PLAX views of cardiac ultrasound videos. These landmarks arethe starting and ending points of the LV diameter measurement in PLAX, one ofwhich is located on the anteroseptal wall and the other on the inferolateral wallof LV. The clinical significance of this measurement lies beyond the fact that uponprecise prediction of LVD in ED and ES phases, the model would be able to providean estimate of EF, which is an indicator of many cardiac conditions. The datasetused for this work consists of sparse labels in the temporal dimension, having an-notations merely available on ED or ES frames. The presented model is a net-work consisting of a series of convolutional and sequence layers. The convolutionlayers are responsible for extracting globally meaningful embeddings, conveyingthe information of the landmarks’ rough location respective to the other anatom-ical structures in the image. These embeddings are then passed to a sequence ofLSTMs to enforce temporal consistency across the frames. This is achieved byteaching the network to estimate the landmarks of the cycle’s inner frames basedon the labels of the earlier or later frames of the same sequence. Another constrainthas also been introduced for enforcing the network to predict landmarks subject toa more temporally meaningful condition; This condition states that the length ofthe line connecting the two landmarks should follow the same temporal trend asthat of the individual landmarks. This constraint is referred to as the LVD con-straint across this thesis. Four variations of this model were tested and compared71in Chapter 2.2, and as a result of the analysis, the best one appeared to be the modelhaving Bidirectional LSTMs and the LVD constraint. As an attempt to make surethe sequence layers were enforcing temporal consistency in the clinically desiredway, which is enhancing the EF prediction, in an experiment, the sequence layerswere substituted by dense layers. The results of this ablation study revealed signif-icant drops of accuracy in all the metrics, justifying the use of the sequence layers.In Chapter 3, the proposed model was compared against two of the most relevantand recent studies in the literature chosen as reference. The results support theeffectiveness of our model according to the same metrics reported in the referencemethods. The advantages of the presented model were also highlighted in termsof the number of parameters, stability, and scalability to temporal videos withoutdepending on densely labeled data. Therefore, the proposed method is deemed tobeat the other two based on this project’s requirements.4.1 ContributionsThe contributions of this work can be summarized through these points:• A deep neural network architecture has been proposed to predict LVD land-marks in PLAX videos from ultrasound cardiac sequences;• A special strategy has been utilized in the loss function to account for thesparse ground truth labels, as annotations were only available for ED and ESframes;• Four variations of the model were introduced and experimented with forchoosing the best possible model and justifying its building blocks;• The capability of the model in learning temporal coherence was investigatedand verified by analyzing a variation of the model having its sequential layerssubstituted with non-temporal layers;• A post-processing pipeline has been implemented for comprehensive analy-sis of the main clinical metrics used for PLAX videos;• The performance of the model was compared against two of the most rele-vant and recent papers in the literature.724.2 Future WorkIn this section, several possible additions have been suggested to pursue this thesisin the future:1. Collecting more high-quality data;2. Investigating the effect of angle loss or heatmap loss similar to [10];3. Adding an optical flow estimation branch to achieve better temporal co-herency. It is suggested to use a deep neural network approach for this aimso that the whole pipeline could be optimized end-to-end; This approachcontrasts with classical approaches that do not generally work well on noisyimages such as ultrasound records. This idea was tested through multiple ex-periments; however, many of the well-defined methods in the literature forneural network-based optical flow estimation tend to fail on ultrasound im-ages. This arises from the fact that these networks have already been trainedon datasets having considerably dissimilar distributions. It is possible to re-train or fine-tune these networks in case of having access to the flow maps ofthe current dataset. Therefore, it is suggested to prepare these flow labels inthe future to expand the model’s capacity in enforcing temporal consistencybetween frames. This step will most likely lead to substantial improvementsin EF and FS errors;4. Another idea is to utilize uncertainty estimation models to account for thelarge number of low-quality images in the dataset. This could be achievedby teaching the network to predict an uncertainty level for its outputs basedon the quality of their corresponding input images.73Bibliography[1] M. Avendi, A. Kheradvar, and H. Jafarkhani. A combined deep-learning anddeformable-model approach to fully automatic segmentation of the leftventricle in cardiac mri. Medical Image Analysis, 30:108–119, 2016. →page 24[2] W. Bai, O. Oktay, M. Sinclair, H. Suzuki, M. Rajchl, G. Tarroni, B. Glocker,A. King, P. M. Matthews, and D. Rueckert. Semi-supervised learning fornetwork-based cardiac mr image segmentation. In International Conferenceon Medical Image Computing and Computer-Assisted Intervention, pages253–260. Springer, 2017.[3] W. Bai, M. Sinclair, G. Tarroni, O. Oktay, M. Rajchl, G. Vaillant, A. M. Lee,N. Aung, E. Lukaschuk, M. M. Sanghvi, et al. Automated cardiovascularmagnetic resonance image analysis with fully convolutional networks.Journal of Cardiovascular Magnetic Resonance, 20(1):65, 2018. → page 24[4] Blausen.com staff. Medical gallery of blausen medical 2014. WikiJournal ofMedicine 1, 1(2), 2014. ISSN 2002-4436. doi:10.15347/wjm/2014.010. →page 7[5] S. Chen, K. Ma, and Y. Zheng. Tan: Temporal affine network for real-rimeleft ventricle anatomical structure analysis based on 2d ultrasound videos.arXiv preprint arXiv:1904.00631, 2019. → page 25[6] S. Cornegruta, R. Bakewell, S. Withey, and G. Montana. Modellingradiological language with bidirectional long short-term memory networks.arXiv preprint arXiv:1609.08409, 2016. → page 23[7] F. T. Dezaki, Z. Liao, C. Luong, H. Girgis, N. Dhungel, A. H. Abdi,D. Behnami, K. Gin, R. Rohling, and P. Abolmaesumi. Cardiac phasedetection in echocardiograms with densely gated recurrent neural networksand global extrema loss. IEEE Transactions on Medical Imaging, 38(8):1821–1832, 2018. → page 2574[8] X. Du, S. Yin, R. Tang, Y. Zhang, and S. Li. Cardiac-deepied: Automaticpixel-level deep segmentation for cardiac bi-ventricle using improvedend-to-end encoder-decoder network. IEEE Journal of TranslationalEngineering in Health and Medicine, 7:1–10, 2019. → page 25[9] T. A. Foley, S. V. Mankad, N. S. Anavekar, C. R. Bonnichsen, M. F. Morris,T. D. Miller, and P. A. Araoz. Measuring left ventricular ejectionfraction-techniques and potential pitfalls. Eur Cardiol, 8(2):108–114, 2012.→ page 19[10] A. Gilbert, M. Holden, L. Eikvil, S. A. Aase, E. Samset, and K. McLeod.Automated left ventricle dimension measurement in 2d cardiac ultrasoundvia an anatomically meaningful cnn approach. In Q. Wang, A. Gomez,J. Hutter, K. McLeod, V. Zimmer, O. Zettinig, R. Licandro, E. Robinson,D. Christiaens, E. A. Turk, and A. Melbourne, editors, Smart UltrasoundImaging and Perinatal, Preterm and Paediatric Image Analysis, pages29–37, Cham, 2019. Springer International Publishing. ISBN978-3-030-32875-7. → pages viii, ix, xii, 24, 63, 65, 66, 67, 68, 69, 73[11] A. Graves and J. Schmidhuber. Framewise phoneme classification withbidirectional lstm and other neural network architectures. Neural Networks,18(5):602–610, 2005. ISSN 0893-6080.doi:https://doi.org/10.1016/j.neunet.2005.06.042. URLhttps://www.sciencedirect.com/science/article/pii/S0893608005001206.IJCNN 2005. → page 23[12] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neuralcomputation, 9:1735–80, 12 1997. doi:10.1162/neco.1997.9.8.1735. →page 23[13] M. H. Jafari, H. Girgis, Z. Liao, D. Behnami, A. Abdi, H. Vaseli, C. Luong,R. Rohling, K. Gin, T. Tsang, et al. A unified framework integratingrecurrent fully-convolutional networks and optical flow for segmentation ofthe left ventricle in echocardiography data. In Deep Learning in MedicalImage Analysis and Multimodal Learning for Clinical Decision Support,pages 29–37. Springer, 2018. → page 25[14] R. M. Lang, L. P. Badano, V. Mor-Avi, J. Afilalo, A. Armstrong, L. Ernande,F. A. Flachskampf, E. Foster, S. A. Goldstein, T. Kuznetsova, P. Lancellotti,D. Muraru, M. H. Picard, E. R. Rietzschel, L. Rudski, K. T. Spencer,W. Tsang, and J.-U. Voigt. Recommendations for Cardiac ChamberQuantification by Echocardiography in Adults: An Update from the75American Society of Echocardiography and the European Association ofCardiovascular Imaging. European Heart Journal - CardiovascularImaging, 16(3):233–271, 02 2015. ISSN 2047-2404.doi:10.1093/ehjci/jev014. URL https://doi.org/10.1093/ehjci/jev014. → page19[15] R. M. Lang, L. P. Badano, V. Mor-Avi, J. Afilalo, A. Armstrong, L. Ernande,F. A. Flachskampf, E. Foster, S. A. Goldstein, T. Kuznetsova, P. Lancellotti,D. Muraru, M. H. Picard, E. R. Rietzschel, L. Rudski, K. T. Spencer,W. Tsang, and J.-U. Voigt. Recommendations for Cardiac ChamberQuantification by Echocardiography in Adults: An Update from theAmerican Society of Echocardiography and the European Association ofCardiovascular Imaging. European Heart Journal - CardiovascularImaging, 16(3):233–271, 02 2015. ISSN 2047-2404.doi:10.1093/ehjci/jev014. URL https://doi.org/10.1093/ehjci/jev014. → page18[16] M. Li, W. Zhang, G. Yang, C. Wang, H. Zhang, H. Liu, W. Zheng, and S. Li.Recurrent aggregation learning for multi-view echocardiographic sequencessegmentation. In International Conference on Medical Image Computingand Computer-Assisted Intervention, pages 678–686. Springer, 2019. →page 25[17] Z. Liao, H. Girgis, A. Abdi, H. Vaseli, J. Hetherington, R. Rohling, K. Gin,T. Tsang, and P. Abolmaesumi. On modelling label uncertainty in deepneural networks: Automatic estimation of intra- observer variability in 2dechocardiography quality assessment. IEEE Transactions on MedicalImaging, 39(6):1868–1883, 2020. doi:10.1109/TMI.2019.2959209. → page69[18] Michigan Medicine. Anatomy of a human heart.https://healthblog.uofmhealth.org/heart-health/anatomy-of-a-human-heart,Feb 2019. Accessed: 2021-01-05. → page 1[19] F. Milletari, N. Navab, and S.-A. Ahmadi. V-net: Fully convolutional neuralnetworks for volumetric medical image segmentation. In 2016 FourthInternational Conference on 3D Vision (3DV), pages 565–571. IEEE, 2016.→ page 24[20] T. A. Ngo, Z. Lu, and G. Carneiro. Combining deep learning and level setfor the automated segmentation of the left ventricle of the heart from cardiac76cine magnetic resonance. Medical Image Analysis, 35:159–171, 2017. →page 24[21] A. Nibali, Z. He, S. Morgan, and L. Prendergast. Numerical coordinateregression with convolutional neural networks. arXiv preprintarXiv:1801.07372, 2018. → page 65[22] O. Oktay, E. Ferrante, K. Kamnitsas, M. Heinrich, W. Bai, J. Caballero,S. A. Cook, A. De Marvao, T. Dawes, D. P. O‘Regan, et al. Anatomicallyconstrained neural networks (acnns): application to cardiac imageenhancement and segmentation. IEEE Transactions on Medical Imaging, 37(2):384–395, 2017. → page 24[23] C. Payer, D. Sˇtern, H. Bischof, and M. Urschler. Regressing heatmaps formultiple landmark localization using cnns. In International Conference onMedical Image Computing and Computer-Assisted Intervention, pages230–238. Springer, 2016. → page 24[24] T. Pfister, J. Charles, and A. Zisserman. Flowing convnets for human poseestimation in videos. In Proceedings of the IEEE International Conferenceon Computer Vision, pages 1913–1921, 2015. → page 24[25] C. Qin, W. Bai, J. Schlemper, S. E. Petersen, S. K. Piechnik, S. Neubauer,and D. Rueckert. Joint learning of motion estimation and segmentation forcardiac mr image sequences. In International Conference on Medical ImageComputing and Computer-Assisted Intervention, pages 472–480. Springer,2018. → page 25[26] Richfield, David. Medical gallery of david richfield. WikiJournal ofMedicine, 1(2), 2014. ISSN 2002-4436. doi:10.15347/wjm/2014.009. →page 8[27] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks forbiomedical image segmentation. In International Conference on MedicalImage Computing and Computer-Assisted Intervention, pages 234–241.Springer, 2015. → pages 24, 64, 68[28] N. Savioli, M. S. Vieira, P. Lamata, and G. Montana. Automatedsegmentation on the entire cardiac cycle using a deep learning work-flow. In2018 Fifth International Conference on Social Networks Analysis,Management and Security (SNAMS), pages 153–158. IEEE, 2018. → page2577[29] M. Sofka, F. Milletari, J. Jia, and A. Rothberg. Fully convolutionalregression network for accurate detection of measurement points. In DeepLearning in Medical Image Analysis and Multimodal Learning for ClinicalDecision Support, pages 258–266. Springer, 2017. → pagesviii, ix, xii, 25, 63, 64, 65, 67, 68, 69[30] L. E. Teichholz, T. Kreulen, M. V. Herman, and R. Gorlin. Problems inechocardiographic volume determinations: Echocardiographic-angiographiccorrelations in the presence or absence of asynergy. The American Journalof Cardiology, 37(1):7–11, 1976. ISSN 0002-9149.doi:https://doi.org/10.1016/0002-9149(76)90491-4. URLhttps://www.sciencedirect.com/science/article/pii/0002914976904914. →page 21[31] A. Thorstensen, H. Dalen, B. H. Amundsen, S. A. Aase, and A. Stoylen.Reproducibility in echocardiographic assessment of the left ventricularglobal and regional function, the HUNT study. European Journal ofEchocardiography, 11(2):149–156, 12 2009. ISSN 1525-2167.doi:10.1093/ejechocard/jep188. URLhttps://doi.org/10.1093/ejechocard/jep188. → page 24[32] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint training of aconvolutional network and a graphical model for human pose estimation. InAdvances in Neural Information Processing Systems, pages 1799–1807,2014. → page 24[33] H. Wei, H. Cao, Y. Cao, Y. Zhou, W. Xue, D. Ni, and S. Li.Temporal-consistent segmentation of echocardiography with co-learningfrom appearance and shape. In International Conference on Medical ImageComputing and Computer-Assisted Intervention, pages 1–8. Springer, 2020.→ page 25[34] W. Yan, Y. Wang, Z. Li, R. J. Van Der Geest, and Q. Tao. Left ventriclesegmentation via optical-flow-net from short-axis cine mri: preserving thetemporal coherence of cardiac motion. In International Conference onMedical Image Computing and Computer-Assisted Intervention, pages613–621. Springer, 2018. → page 2578


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items