UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

A multifaceted quantitative validity assessment of laparoscopic surgical simulators Kinnaird, Catherine 2004

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


831-ubc_2004-0509.pdf [ 16.12MB ]
JSON: 831-1.0080797.json
JSON-LD: 831-1.0080797-ld.json
RDF/XML (Pretty): 831-1.0080797-rdf.xml
RDF/JSON: 831-1.0080797-rdf.json
Turtle: 831-1.0080797-turtle.txt
N-Triples: 831-1.0080797-rdf-ntriples.txt
Original Record: 831-1.0080797-source.json
Full Text

Full Text

A MULTIFACETED QUANTITATIVE VALIDITY ASSESSMENT OF LAPAROSCOPIC SURGICAL SIMULATORS By: CATHERINE KLNNAIRD BASC UNIVERSITY OF GUELPH, 2001 A THESIS SUBMITTED FOR PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF APPLIED SCIENCE IN THE FACULTY OF GRADUATE STUDIES DEPARTMENT OF MECHANICAL ENGINEERING We accept this thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA July 2004 © Catherine'Kinnaird, 2004 Library Authorization In presenting this thesis in partial fulfi l lment of the requirements for an advanced d e g r e e a t the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for f inancial gain shall not be allowed without my written permission. Catherine 16/07/04 Name of Author (please print) Date (dd/mm/yyyy) Title of Thesis: A MULTIFACETED QUANTITATIVE VALIDITY A S S E S S M E N T OF LAPAROSCOPIC SURGICAL S IMULATORS Degree: MASc Department of Mechanical Engineering The University of British Columbia Vancouver, BC Canada Year: 2004 Abstract The objective of this work was to design an experimental surgical tool and data acquisition protocol to quantitatively measure surgeon motor behaviour in a human operating room (OR). We want to use expert OR behaviour data to evaluate the concurrent validity of two types of laparoscopic surgical simulators. Current training and evaluation methods are subjective and potentially unreliable, and surgical simulators have been recognized as potential objective training and measurement tools, even though their validity has not been quantitatively established. We compare surgeon motor behaviour in the OR to a ~$50 000 virtual reality simulator, and a ~$1 physical "orange" simulator. It is our contention that if expert behaviour in a simulator is the same as in the OR, then that simulator is a valid measurement tool. A standard laparoscopic surgical tool is instrumented with optical, magnetic, and force/torque sensors to create a hybrid system. We use the hybrid tool in a pilot study, to collect continuous kinematics and force/torque profiles in a human OR. We compare the position, velocity, acceleration, jerk, and force/torque profiles of two expert surgeons across analogous tasks in the three settings (OR, VR, and physical) using the Kolmogorov-Smirnov statistic. We find that intra- and intersubject differences between settings are small (D < 0.3), which indicates that experts exhibit the same motor behaviour in each setting. This also helps to validate our choice of performance measures and analysis method. However, we find larger intersetting expert differences (0.3 < D < 1) from the OR to simulators. We suspect that experts behave the same as each other in all settings, but that OR behaviour is considerably different from simulator behaviour. In other words, for this preliminary study we find that the VR and physical simulators both demonstrate poor performance validity. n Table of Contents Abstract — ii Table of Contents iii List of Tables vii List of Figures viii Acknowledgments xi Chapter 1: Introduction and Literature Review 1 1.1 Introduction and Obj ectives 1 1.2 Minimally Invasive Surgery 2 1.3 Training and Simulators 4 1.4 Validity 6 1.5 Performance Evaluation 10 1.5.1 Assessment Forms 10 1.5.2 Performance Measures 11 1.6 Multi-faceted Assessment Tools 12 1.6.1 Kinematics 12 1.6.2 Force/Torque Signatures 13 1.6.3 Hybrid Tools and Methods 14 1.7 Context Comparisons - Analysis of Performance Measures 14 1.8 Project Goals 16 Chapter 2: Experimental Hybrid Surgical Tool for Muli-faceted Performance Measure Assessment 18 2.1 Introduction 18 2.2 Tool Design 19 2.2.1 Study Design 20 Hierarchical Decomposition 20 Performance Measures 22 2.2.2 Tool Design Considerations 24 Kinematics 24 iii Force/Torque 26 2.2.3 Equipment Specifications 27 Kinematics 28 Force/Torque Hybrid System 29 Sensor Bracket 30 Video 31 Surgical Tool 31 Simulators 32 2.2.4 Component Integration 33 2.3 Data Collection 34 2.3.1 Data Acquisition Software 34 2.3.2 Experimental Protocol 35 2.4 Equipment Calibration and Data-Processing 36 2.4.1 Kinematics Data Registration and Calibration 38 Component Calibration 39 OR System Calibration 42 Registration 42 Repeatability Analysis 43 2.4.2 Force/Torque Data Calibration and Registration 44 Calibration 44 F/T Registration 51 2.4.3 Data Synchronization 52 2.4.4 Electrosurgical Effects and Data Removal 54 2.4.5 Data Fusion 58 2.4.6 Data Segmenting and Performance Measure Extraction 59 Data Segmenting 59 Performance Measure Extraction 59 2.5 Discussion and Recommendations 61 2.5.1 Discussion 61 2.5.2 Recommendations 63 Chapter 3: Results from a Quantitative Validity Assessment of Two Laparoscopic Surgical Simulators 65 3.1 Introduction 65 3.2 Methods 67 3.2.1 Operating Room 67 3.2.2 Simulators 68 3.2.3 Performance Measures 69 3.3 Context Comparisons 71 3.3.1 Kolmogorov-Smirnov Statistic 71 iv 3.3.2 Assessing Difference and Reliability 73 Bootstrapping 73 Data Dependency 73 Moving Block Bootstrap 75 Assigning Confidence Intervals to D-values 75 Assessing Difference 77 3.4 Results 78 3.4.1 ? 1: Intrasubject Intraprocedural OR Comparisons 80 Expert 1: Intrasubject Intraprocedural OR 80 Expert 2: Intrasubject Intraprocedural OR 83 3.4.2 ? 2: Intrasubject Interprocedural OR: Expert 2 87 3.4.3 ? 3: Intrasubject VR Simulator Comparisons 88 3.4.4 ? 4, ? 5, & ? 6: Intersubject Intrasetting Comparisons 91 3.4.5 ? 7: Intrasubject Intersetting Comparisons 95 3.4.6 ? 8: Intersubject Intersetting (? 7 Expert 1 versus ? 7 Expert 2) 99 3.4.7 ? 9: Intersetting Intersubject (? 4 versus ? 5 versus ? 6) 101 3.5 Discussion 102 3.5.1 Context Comparisons 103 Intraprocedural OR Variability 103 Intertrial Intrasetting Variability 104 Intersubject Intrasetting 105 Intrasubj ect Intersetting 106 Intersubject Intersetting Paired Differences 107 Intersetting Intersubject Paired Differences 107 3.5.2 Performance Measure Reliability 108 3.6 Conclusions 108 Chapter 4: Conclusions and Recommendations 110 4.1 Introduction 110 4.2 Review of Present Research 110 4.2.1 Experimental Hybrid Surgical Tool 110 4.2.2 Data Collection 112 Operation Room 112 Simulators 112 4.2.3 P erformance Measures 113 4.2.4 Context Comparisons 113 4.2.5 Simulator Validity Assessment 115 4.3 Relevance of Present Research 116 4.4 Future Research Recommendations 117 v 4.4.1 Hardware Improvements 117 4.4.2 Software Improvements 118 4.4.3 OR Data Acquisition 118 4.4.4 Data Analysis and Performance Measure Extraction 118 4.5 Future / Concurrent Studies 119 List of Terms 121 Bibliography 123 Appendix A: OR Study Experimental Protocol and Data Acquisition Procedures 132 Appendix B: Operational Definitions 136 Appendix C: Sensor Bracket Design Drawings 140 vi List of Tables Table 1.1 - Validity definitions, listed from least to most rigorous to demonstrate ....7 Table 2.1 - The reliability of the transformations 43 Table 3.1 - The performance measures available from each of the three settings 70 Table 3.2 - Summary of data trials from each setting 78 Table B.l - Hierarchical subtask dissection definition 138 Table B.l - Kinematics and force performance measures 139 vii List of Figures Figure 1.1 - A typical MIS operating room 3 Figure 1.2- Degrees of freedom of the MIS tool tip 4 Figure 1.3 - Two types of inanimate simulator models 6 Figure 1.4 - The KS statistic 16 Figure 2.1 - Hierarchical decomposition of laparoscopic cholecystectomy 21 Figure 2.2 - The virtual reality version of the dissection of the cystic duct 22 Figure 2.3 - The tool tip frame is established with respect to the tool handle 23 Figure 2.4 - The F/T hybrid system 27 Figure 2.5 - The F/T hybrid sensor bracket 31 Figure 2.6 - Sensor component integration for OR data acquisition 33 Figure 2.7 - Data acquisition software 35 Figure 2.8 - Data post-processing steps for data from each sensor 37 Figure 2.9 - Three-dimensional frames from magnetic and optical sensors 38 Figure 2.10 - We use a calibration rig to set up the tool tip reference frame 40 Figure 2.11 - MDMArray passive marker face frames 41 Figure 2.12 - Schematic of experimental tool 45 Figure 2.13 - Calculation of the grip constant K for force in the x direction 46 Figure 2.14 - An illustration of how the grip forces are removed 47 Figure 2.15a - Pitch of the experimental tool shaft 49 Figure 2.15b - Roll of the experimental tool shaft 49 Figure 2.16 - Reproduced gravity effects for one sample trial 50 Figure 2.17 - Schematic of experimental tool for F/T data registration 51 viii Figure 2.18 - Data synchronization protocol 52 Figure 2.19 - Visual examination of registered kinematics data 53 Figure 2.20 - Raw strain gauge data from OR experiment #3 56 Figure 2.21 - Raw and filtered data from OR experiment # 3 ....56 Figure 2.22 - Raw F/T data in one direction affected by cautery 57 Figure 2.23 - Cautery filtered magnetic data 58 Figure 2.24 - Projection of the tool path vector on the tool axis vectors 60 Figure 3.1 - Study design for statistically assessing simulator validity 66 Figure 3.2 - Surgical steps where the experimental hybrid tool is used 68 Figure 3.3 - Tool tip reference frame 71 Figure 3.4a - Representation of the Kolmogorov-Smirnov statistic D 72 Figure 3.4b - Representation of the KS statistic 72 Figure 3.5 - Sample autocorrelation for absolute velocity data for one OR trial 74 Figure 3.6 - The MBB method breaks the dependent dataset that is to be resampled 75 Figure 3.7 - Assigning a confidence interval to any particular difference value 76 Figure 3.8 - Finding a CPD of D-values between CPD R S and CPD r e f. 77 Figure 3.9 - Context comparisons made betweens subjects and settings 79 Figure 3.10 - Expert 1 intraprocedure OR trial 1, segment 1 through 3 82 Figure 3.11 - Expert 1 intraprocedural segment comparisons 83 Figure 3.12 - Expert 2 intraprocedure OR trial 2, segments 1 through 3 84 Figure 3.13 - Expert 2 intraprocedural OR trial 2 segment comparisons 85 Figure 3.14 - Expert 2 intraprocedure OR trial 3, segments 1 through 3 86 Figure 3.15 - Expert 2 intraprocedural OR trial 3 segment comparisons 87 ix Figure 3.16 - Expert 2 intrasubject interprocedural comparisons 88 Figure 3.17 - Expert 1 intertrial VR CPD's 89 Figure 3.18 - Expert 2 intertrial VR CPD's 90 Figure 3.19 - Intrasubject intertrial VR comparisons for both experts 91 Figure 3.20 - Intersubject performance measure CPD's in the OR 92 Figure 3.21 - Intersubject performance measure CPD's in the VR simulator 93 Figure 3.22 - Intersubject performance measure CPD's in the physical simulator 94 Figure 3.23 - Intersubject intrasetting comparisons 95 Figure 3.24 - Expert 1 intersetting performance measure CPD's 96 Figure 3.25 - Expert 2 intersetting performance measure CPD's 97 Figure 3.26 - Expert 1 intrasubject intersetting comparisons 98 Figure 3.27 - Expert 2 intrasubject intersetting comparisons 99 Figure 3.28 - Intersubject intersetting paired differences 100 Figure 3.29 - Intersetting intersubject paired differences 101 Figure A.l - UBC Hospital operating room equipment layout 134 Figure B.l - Five levels of the hierarchical decomposition 136 Figure B.2 - Stage level diagram for cystic duct dissection (CDD) and gallbladder dissection (GBD)... 137 Figure C.l - SolidWorks design drawings for sensor bracket assembly 141 Acknowledgements First and foremost, I would like to thank Dr. Tony Hodgson for his vision, enthusiasm and never-ending optimism in this project. I am particularly grateful to the expert surgeons who participated in this project: Dr. Alex Nagy and Dr. Neely Panton. Your interest was unflagging despite relentless OR experiments, patient consent forms and unfamiliar surgical tools. Many thanks to the Marlene Purvy and the staff at UBC hospital for smiling and putting up with us in an already crowded OR. Thanks also to Brandon Lee for volunteering to help with the sometimes painful, sensor bracket design process. To my partners in crime, Joanne Lim and Iman Brouwer thank you for always be there to help, and for making me laugh on the cold, dreary mornings in the OR when things looked their bleakest. I think the curse has been lifted! Of course thank you to my friend and colleague Stacy Bullock; I know that I can always count on you to give me an honest opinion on absolutely everything. To Sayra Cristancho, and my other colleagues in the NCL lab, I am passing the torch, enjoy! I am supremely grateful to the CESEI team: Ferooz (Ferocious), Vanessa, Marlene, Humberto and Dr. Qayumi for their technical and moral support, and all the tasty muffins. Most of all I would like to thank my family. Thank you for always knowing exactly what I needed, from a sympathetic ear to a kick in the pants. I could not have done this without you. Last, but certainly not least to Carolyn Greaves, Carolyn Sparrey and Christina Niosi: I will always be thankful for your help, advice, friendship, and this serious Starbucks addiction. xi 1 Chapter 1 Introduction and Literature Review 1.1 Introduction and Objectives Minimally invasive surgery (MIS), as opposed to traditional open surgery, has become the approach of choice, accounting for as much as 70% of all high-volume surgeries annually in the United States (http://vvfww.allhealth.edu/mis/). Laparoscopic procedures, the MIS solution to abdominal complaints, have many patient benefits such as less scarring and shorter recovery periods. For surgeons however, MIS means longer operating times in awkward working conditions. Also, minimally invasive surgical skills are difficult to learn and there has been increasing pressure on surgeons to reduce costs, prove their ability to operate well, maintain their performance and deliver acceptable results (Darzi 1999, Kohn 1999). ' Since the skill set required to perform a successful laparoscopic procedure are difficult to acquire, and most training is done in the operating room, patients may be put at risk during the steep learning curve (Hamilton 2001, Gallagher 2001). Simulators have been recognized as a potentially unlimited and safe training source where novice surgeons could work through the more dangerous portions of the learning curve (Grantcharov 2001). Simulators have been used for decades to train pilots, astronauts, and nuclear power plant operators and have only recently been used to teach surgical skills in an attempt to emulate the successful pilot training programs (Woodman 1999). Surgical simulators range from the simple inanimate physical models to complex three-dimensional, anatomically correct virtual reproductions, and are playing an increasingly important role in education. Nevertheless, until they have been shown to be a valid and reliable measurement source, their contribution to surgical education will remain unknown (Feldman 2004). 2 The objective of this research was to apply a quantitative assessment of surgical performance in order to appraise the validity of two laparoscopic surgical simulators: virtual reality and inanimate. To accomplish this, specific tasks in the operating room are compared to analogous tasks in the simulated setting. The methods and assessment techniques used here may also be used as a benchmark by which other simulators can be assessed. The larger goal of this work is to assess the validity, the transfer of training, and the minimum technological requirements of a virtual reality surgical simulator. Here, we want to determine whether the simulator is a reliable measurement device and establish an effective assessment technique for other simulators. 1.2 Minimally Invasive Surgery Minimally invasive surgery (MIS) is a relatively new technique that allows the surgeon to perform through 3 or 4 keyhole-sized surgical incisions. Tiny cameras (endoscopes) and specially designed instruments allow the surgeon to proceed by viewing the operative field on a video screen (Figure 1.1). Laparoscopic surgery is a MIS solution to many abdominal problems, achieving the same results as its open surgical counterparts. Minimally invasive surgery was first conceived in the beginning of the 20th century, but did not garner much enthusiasm until the 1960's with the advent of fiber optics. In 1987 Philippe Mouret performed the first laparoscopic cholecystectomy and the popularity of this MIS procedure soared (Nagy 1992, Poulin 1992). The benefits of laparoscopic surgery include less scarring, shorter recovery time and hospital stays, and reduced postoperative pain. Because these benefits are so great, an estimated 95 percent of gall bladder removals are now done laparoscopically (Franklin 1998). 3 Figure 1.1: A typical MIS operating room (left, Source: http://www.med.nyu. edu/urology/fgp/programs/mininvasive. html); operative instruments and laparoscopic points of insertion (right, Source: http.V/iregtl .iai.fzk.de/TRAINER/mic_defl .html) Although the advantages of MIS to the patient are numerous, there are limitations that must be addressed. The long, straight tools that are inserted into the abdomen have only 4 degrees of freedom (DOF) as compared to the 6 DOF available in an open procedure (Person 2000). This restriction of movement is due to the lateral constraints on trocars (guide tubes) against the abdominal wall (Figure 1.2). This effect of the body wall on the surgical instruments also leads to the so-called fulcrum effect; when the surgeon's hand moves to the right the tool moves to the left within the body. This effect is counterintuitive, especially for the novice surgeon (Gallagher 1998). The 2-dimensionality of the video adversely affects his/her depth perception, hampering complex task completion, such as knot tying. There is poor tactile feedback in laparoscopic surgery and there has been some disagreement about the reliability and reproducibility of this feedback (Bholat 1999, Sjoerdsma 1997). The lack of tactile feedback has been shown to lengthen significantly the learning curve for laparoscopy (Rosen 1999), especially i f feedback varies from tool to tool. Finally, the instruments and operating room setup force the surgeon to assume awkward positions for extended periods of time, leading to muscle fatigue and pain. Postural and E M G data both support this observation (Berguer 1997, Person 2001, Ullrich 2002). While the limitations of laparoscopy do not outweigh the benefits, they do significantly extend the learning curve and hinder efforts to perform more advanced operations. 4 Figure 1.2: Degrees offreedom of the MIS tool tip: roll, pitch and yaw about the fulcrum (Source: http://www.spectrumsurgical.com/catalog/laparoscopic.htm) 1.3 Training and Simulators Despite the increased complexity of the new MIS procedures, surgical training today closely resembles the training techniques of yore. Apprenticeships and the "see one, do one, teach one" aphorism, first introduced by Halsted in 1904 (Halsted bulletin 1904), still governs most training curricula. Given the complexity of today's operative suite this is inappropriate, especially when one considers that most residents who are graduating from programs in the United States have inadequate laparoscopic experience (Bailey 1991, Dent 1991, Villegas 2004). Aggravating this condition are increased societal pressures to reduce surgical training time, and caps on resident work hours. The potential of simulators to mitigate the risk involved in learning to function in complex settings has long been recognized. The first flight simulator was developed in 1939 (Kaufmann 2001), and today simulators are used to train pilots and other professionals in high-risk environments such as nuclear power plant operators. Surgery fits nicely in this description and it seems only natural for surgeon training to emulate the very successful pilot training programs. Traditionally, surgeons have been trained outside of the OR on cadavers and animals. However, cadavers are expensive and do not adequately represent living tissue, thus they are of limited use in training programs 5 (Fitzpatrick 2001, Kaufmann 2001). Animal models raise ethical concerns and do not reproduce the biological or anatomical conditions surgeons encounter (Wolfe 1993). There are two types of simulators available: inanimate and virtual reality (VR) (Figure 1.3). Inanimate simulators are relatively inexpensive and usually take the form of a physical model or simple bench top dexterity task that allows the student to practice handling the surgical tools. In more advanced versions, there are full-scale organ or mannequin models. Ethicon Edusurgery released the first "Preceptor" laparoscopic computer-based simulator in 1995 (Villegas 2004). Recent advances in technology have allowed the development of sophisticated virtual reality simulators, with three-dimensional recreations of human anatomy with which the user can interact. Haptic feedback such as the sense of touch, has been added to some high-end systems in an attempt to improve the fidelity of VR simulators (Satava 2001). Of course, increasing complexity brings higher prices. A typical commercially available VR simulator can cost as much as $50 000, and is typically only used for one type of surgical procedure. It is therefore of great interest to determine how performance degrades with decreasing quality (and price) of VR and inanimate simulators. The effectiveness of both types of simulator is unknown, from a purely quantitative point of view, and is assessed in this study. It is hypothesized that since the high-end VR simulators are technologically complex, they must be superior to the simpler (and cheaper) inanimate physical models, but a quantitative study comparing various motor behaviours between these two types of simulators and the "real" operating room task has not been attempted until now. In the present study we have examined expert performance in a virtual reality and physical simulator and compared this performance to that in a real operating room setting. Simulators have two main purposes: training and evaluation or certification. This raises two primary questions. 1) What should be measured in terms of performance for evaluation of skill? 2) How should validity be demonstrated? 6 In order to evaluate performance the performance measures that are the most important, and represent the core set of skills used by an expert surgeon in the operating room need to be determined (Feldman 2004). This set of skills may very well be the same set needed to prove the validity of a simulator. If behaviour on a simulator is the same as performance in the "real" situation, is validity of the simulator proven? Figure 1.3: Top: Two types of inanimate simulator models. Top Right: Sophisticated abdominal model with removable gallbladder. Top Left: Pick and place task simulator. Bottom Left: Virtual Reality haptic interface. Bottom Right: Virtual Reality software 1.4 Validity In psychometrics a valid measure is one that measures what it is supposed to measure (http://www.wikipedia.org/wiki/Validity_(psychometric)). A validated simulator would reliably measure or teach the skills it was designed to teach. Validity tests help to answer 7 the question: "Does proficiency in the simulator correlate with proficiency in reality?" (Berg 2001). In order to be widely adopted as a teaching and assessment tool, a simulator must be shown to be valid. The surgical community's enthusiasm about validity tests is relatively new whereas validity tests have been used in psychology since the beginning of the century (Gallagher 2003). Flight simulators have been subjected to rigorous validation tests for decades (Blauuwl982), and similar approaches in surgical education are only now being put to use. Validity is a broad term with many definitions. In 1974 the American Psychological Association developed a set of standards and definitions to guide validity and reliability research studies (APA 1974). We are primarily interested in the different forms of behavioural correspondence validity. Behavioural correspondence validity is concerned with how the human operator treats the simulator as compared to the real world situation. This may be achieved by comparing the simulator and real situation during identical tasks and circumstances in terms of human operator behaviour (Blaauw 1982). The types of behavioural correspondence validity (heretofore referred to as validity) vary in assessment complexity (Table 1.1) (Reber 1995). Table 1.1: Validity definitions, listed from the least to most rigorous to demonstrate. Validity Definition Who has tried it? Face Expert Opinion Haluck 2001, McCarthy 1999 Content Checklist of matching elements Paisley 2001, Schijven 2002 Construct Differentiates between skill levels Adrales 2003, Datta 2002 Gallagher 2004, Grantcharov 2002, Mahmood 2003, Taffinder 1998, Schijven 2003 Concurrent Correlates with gold standard Ahlberg 2002, Feldman 2004, Grantcharov 2004 Performance Quantifiable performance measures same as "real" setting Present study Predictive Predicts future results N/A 8 Face validity is based on expert opinion. An expert would use the simulator to offer a judgment about how well the simulator replicates the real task. Face validity is a subjective measure that is usually done in the first phase of validity testing. When an expert uses a checklist of matching elements to reduce rater subjectivity, content validity is tested. Content validity is historically the second step in validity evaluation and is simply an extension of face validity. Schijven had 120 surgeons practice on, and fill out a questionnaire about the Xitact LS500 simulator. He found that, based on surgeon opinion, the simulator was a good representation of the laparoscopic surgical tasks, and would be an important and useful tool in a surgical curriculum (Schijven 2002). Unfortunately, the simplest forms of validity to demonstrate also are the most subjective, and evaluation subjectivity is exactly the problem that simulators are supposed to address. When the test is able to discriminate between skill levels, construct validity is shown. On a simulator, an expert should show a marked improvement over a novice's performance on analogous tasks. Construct validity is the most common validity test applied to surgical simulators. This test is still relatively simple and does not require the opinion of the evaluator. Mahmood showed that operators who differ in clinical experience and technical ability (determined by years of experience) also differ in their performance during simulated colonoscopy (Mahmood 2003). Concurrent validity is proven when test performance correlates with the current gold standard. In the case of surgical simulators the gold standard is performance in the operating room. Feldman and Grantcharov showed that performance on a simulator correlated with performance in the operating room by using performance-specific checklists (Feldman 2004, Grantcharov 2004). This approach is extremely time consuming and still imparts a significant amount of subjectivity on the test due to the checklist component. Lastly, predictive validity is the most difficult to determine effectively. Once predictive validity has been demonstrated, the test in question is used to predict future performance. 9 Predictive validation entails high stakes, as decisions about the future of surgeons may be made based on simulator performance. If a simulator was shown to be predictably valid then they could theoretically be used to "weed" students early in their training (Gallagher 2003). In the present work we introduce performance validity, a new definition to the validity list. Performance validity is simply an extension of concurrent validity. We are quantitatively assessing whether measurable quantities of performance in the OR (e.g. velocity and force profiles) are the same as those in a surgical simulator. It is our contention that if behaviour across two settings is the same then we will have established validity. When subjected to validity tests, not all simulators lead to better performance or measure what they claim to measure. This demonstrates the need to establish validity before simulators are incorporated into any program. A simulator may fail a validity test for several reasons (Berg 2001): a) The simulator is not testing surgical skill but eye-hand coordination, manual dexterity, et cetera b) Manual dexterity is not necessarily correlated with surgical skill c) Test subject bias: expert surgeon is not doing as many surgeries as in the past and does not perform accordingly d) The subjects may under-perform due to "exam stress" To the best of our knowledge this work is the first effort to quantitatively assess human behaviour in the operating room and compare this performance to that on analogous tasks in a simulated settings. A validated simulator can be used with confidence for teaching, assessment and evaluating new tool designs. 10 1.5 Performance Evaluation In the future, surgical simulators may be used both to train and evaluate surgeons throughout their careers. In the past decade there has been intense scrutiny on the current methods of training and assessment of surgeons, due in part to several cases where poor outcomes were attributed to inadequate technical skills (Darzi 2001). Current assessments of surgical performance in the operating room are subjective and potentially unreliable (Rosser 1998). A reliable, standardized method of establishing competence, although recognized by all as imperative, is a controversial issue. Concerns about fairness and defensibility must all be addressed (Darzi 2001). 1.5.1 Assessment Forms The current gold standard in surgical assessment is performance checklists. Performance checklists, or structured skills assessment forms, have been used in an attempt to objectify assessment by identifying key areas for successful completion of a task. Checklists have been applied to video data (Eubanks 1999), as well as during bench station simulations (Winckel 1994). Objective Structured Assessment of Technical Skill (OSATS) is a set of operation-specific checklists used to assess formally discrete segments of surgical tasks using bench model simulations (Martin 1997). There has been a significant body of work that demonstrates the validity and reliability of OSATS (Datta 2002, Faulkner 1996, Goff 2002, MacRae 2000, Reznick 1997). These forms do not represent a truly objective assessment because human opinion is still a requirement to their application. Research indicates that global rating scales, when administered by experts, are more reliable than highly structured checklists. This seems like a counter-intuitive result. The strength of highly structured checklists might also be its greatest weakness in that thoroughness and step-wise thinking are rewarded, while efficiency and independent thought are penalized (Reznick 1997). Also, expert surgeons often have different opinions on the relative importance of certain technical skills (Baldwin 1999), indicating another problem inherent to performance checklists. 11 1.5.2 Performance Measures A variety of performance measures have been used to evaluate surgical performance. The most popular is time because it is an easy metric to gather and analyze. Time has been used as an indicator of quality and efficiency of learning surgical skills. It has been shown that time to complete a task decreases as experience increases (Chung 1998, Coleman 2002, den Boer 2001, Hasson 2001, Rosser 1997, Weghorst 1998, Welty 2003). Time has been criticized when used as a lone measure of human performance. An increase in competency is not always represented by a decrease in task completion time (Keele 1973,Childs 1980). Error frequency has also been used by means of a Human Reliability Analysis (Hanna 1997, Joice 1998, Rew 2000, Scott 2000, Seymour 2004). Joice categorized error using external error modes, or ways that the procedure may be performed erroneously, and from that data some inferences were made about the psychological factors causing the error. Oto recognized that many of the criticisms in laparoscopic surgery are of the "too near, too far" variety, and used fuzzy logic to categorize performance on a simulator (Oto 1995). The Imperial College Surgical Assessment Device (ICSAD) utilizes an electromagnetic tracker attached to the surgeon's hand to track hand movements on a standardized task (Grober 2003, Taffinder 1998, Smith 2002). Forces and torques exerted by the tools on operative tissues have also been examined; both in the form of grip force and tool tip forces (de Visser 2002, Morimoto 1997). Rosen's group has done extensive work using force/torque signatures to evaluate performance in a porcine model. Rosen uses Markov Modeling and a task decomposition to categorize performance levels and establish construct validity (Rosen 2001). Event sequencing and tasks analysis organizes the complex tasks and evaluates performance based on the subjects' transitions between states (Mehta 2002). McBeth demonstrated that a multi-faceted approach to assessment including event sequencing, completion time, postural data and tool tips motion analysis were feasible measures in the evaluation of human performance (McBeth 2002). McBeth was one of the first to take his assessment methods to the operating room. The use of a multi-faceted approach to surgical competence has been recognized 12 as the only way to assess decisively any measurement device (Bann 2003, Chaudry 1999, Darzi 2001, Mehta 2002, Smith 2001). 1.6 Multi-faceted Assessment Tools In order to answer the question of validity and gather the performance measures that will allow us to compare behaviour across different settings, we need a multi-faceted assessment tool to gather operating room data. 1.6.1 Kinematics For this study we have built on previous work in our laboratory by McBeth to gather kinematics data for the tool tip during a laparoscopic surgery (McBeth, 2002). For our analysis we required a high frequency tracking system that is capable of returning three-dimensional position and orientation information. Kinematics data for assessing surgical skill can be acquired using several different types of motion-capture devices including accelerometers, optoelectronic, magnetic, and ultrasonic trackers. Optoelectronic systems are commonly used to track human movements because they are easily sterilized, accurate and lightweight (Berguer 2002, Birkfellner 1998, McBeth 2002). Accuracy and sterilizability are crucial aspects of a tracking system used in a surgical environment. Passive systems require no wire attachments, but can be affected by ambient light sources. These systems also depend on a line of sight between the marker and camera. McBeth used an optical tracking system to track tool tip kinematics during a laparoscopic procedure and found that due to line-of-site problems, some procedures recorded virtually no data, which led to unreliable results. Also, the Polaris optoelectronic system is limited by a 30Hz-sampling rate, resulting in poor reproductions of the velocity and acceleration profiles. For this project we wanted to improve McBefh's tracking method to get a continuous, high frequency kinematics data stream. 13 One type of tracking system available to us that appears to overcome the drawbacks of the optoelectronic system is an electromagnetic tracking system. Electromagnetic systems are popular, commercially available motion-tracking devices due to their comparative cheapness and continuous high frequency sampling. Some disadvantages related to magnetic systems include fixed wire attachments to the tool interface unit, and static metal interference that affects the accuracy of measurements. Several groups have used electromagnetic tracking systems in a surgical environment, but all complain of accuracy and drift issues (Bann 2003, Datta 2002, Milne 1996, Smith 2002). Extensive calibrations have been done to overcome the metal interference in an operating room environment, with poor results (Birkfellner 1998). For this reason, it was determined that the magnetic system could not be used alone to track surgeon movement in the OR. For this study a hybrid kinematics sensor incorporating optoelectronic and magnetic tracking systems was used to overcome the line of sight and low sampling rate problems of the optical sensor and the accuracy issues of the magnetic tracker. Using the hybrid sensor we can get a continuous high frequency dataset that is better than what either sensor could provide individually. In this project kinematics data from both sensors are fused by shifting and warping the high frequency magnetic data over the higher accuracy, lower frequency optical data "fixes". 1.6.2 Force/Torque Signatures We gathered high frequency and continuous force/torque data of the tool tip and tool handle using a technique similar to Rosen (Rosen 2001). Rosen mounted a custom-designed tri-axial force sensor to the shaft of a surgical tool to measure forces and torques of the tool tip. Rosen's group used this tool in a porcine model ("pig lab") and used a statistical technique to differentiate a novice from an expert surgeon based on the force/torque signatures. To our knowledge, this is the first attempt at measuring forces and torques of a minimally invasive surgical tool in a human OR. 14 For this study we used a commercially available tri-axial strain gauge transducer capable of measuring forces and torques in three dimensions. ATI Technologies Inc. produces transducers of this type in different sizes and for a variety of applications. This sensor is sterilizable and relatively lightweight. A transducer bracket that is easily sterilized and allows for the transference of forces from the tool shaft to the sensor is required. Strain gauges on the handle of the surgical tool measure perpendicular surgeon grip force. Rosen's group in Washington also used a single axis transducer on the handle of their tool to measure grip forces and calibrate their force transducer. 1.6.3 Hybrid Tools and Methods The kinematics and force/torque hybrid systems were incorporated with a standard laparoscopic dissector tool. To the best of our knowledge this is the first tool of its type to be designed to gather surgeon performance measures. Other groups have examined forces and torques in pig models and benchtop tasks (Hu 2004, Rosen 2000). Kinematics alone has also been examined (Birkfellner 1998, Datta 2002, McBeth 2001, Smith 2001). However, while a multi-faceted assessment tool has been recognized as being an important requirement for building a new surgical education protocol (Darzi 2001, Gallagher 2003) it had yet to be designed until this project. Human operation room data acquisition protocols were designed based on previous work in our laboratory (McBeth 2001). These protocols were modified to collect from the sensors that are mounted to the surgical tool. Tool calibration techniques were also designed to gather the data necessary to synchronize and register the sensor data to a common time and reference frame (the surgical tool tip). 1.7 Context Comparisons - Analysis of Performance Measures Assuming that we are able to measure kinematics and forces and torques of the surgical tool in a human operation, we want to compare proficiency in the operating room to performance on two types of simulators: virtual reality and physical. A surgical 15 procedure is an extremely complex and variable task, and in order to organize data analysis and assessment of performance, the task is decomposed into smaller and more manageable sections. Surgical simulators commonly teach discrete components of the complex task in a very structured environment, and decomposition will give us analogous data blocks in the surgery to be compared to simulator performance. The hierarchical approach to complex task training and assessment relies on the theory that each subtask level depends on the skills learned in earlier subtasks (Fabiani 1989). Comparisons of individual dissection tasks are made across different settings. In this case, they are made between the operating room and simulator. Hierarchical decompositions have been created for many surgical procedures in order to facilitate a more thorough understanding of that procedure. Cao's work produced a decomposition of a laparoscopic cholecystectomy to be used in performance assessment, tool evaluations and training (Cao 1999). When comparing across contexts an appropriate measure of differences is required. The Kolmogorov-Smirnov statistic is very useful to compare measures across different settings of unknown distributions. This statistic measures the maximum absolute difference between two distributions, referred to as the D-value (Figure 1.4). The D value varies between 0 and 1, with zero representing the maximum difference between the distributions. The KS statistic has commonly been used to determine whether two distributions originate from the same underlying distribution. This is given by the P value, the closer the P value is to zero, the more likely it is that the distributions come from the same original distribution. The benefit of the KS statistic lies in the fact that it does not make any a priori assumptions about the shape of the input distributions. The KS statistic is an internal normalizer, so we are able to compare different magnitudes of performance measures by using their cumulative probability distributions. 16 CPD Figure 1.4: The KS statistic: D is the maximum vertical difference between two cumulative probability distributions 1.8 Project Goals The larger goal of this project is to quantify the fidelity of surgical simulators by determining validity and transfer of training as well as the minimum technological simulator requirements. These results may then be used in the design of new simulators, surgical tools and techniques, surgeon training and evaluation. This work is concerned with quantitatively assessing validity of two simulators by measuring human performance in a surgical setting and comparing that to behavior on analogous tasks in simulated settings. The objectives addressed here are: • Design and build a hybrid tool capable of measuring continuous tool tip kinematics, forces and torques. • Design a data collection protocol and calibration tests for data synchronization, fusion and analysis. • Demonstrate an approach to assessing validity by conducting a study comparing motor behaviour of expert surgeons in the operating room, virtual reality and simulators. 17 1.9 Outline Chapter 2 describes the design of the experimental surgical tool capable of collecting kinematics and forces/torque signatures, and the methods followed to collect and analyze OR data to obtain performance measures analogous to simulator tasks. The performance measures used to evaluate the surgeon in the operating room and simulators are described. The hybrid tool and the equipment used to gather the data in the operating room are shown. Operating room protocol is discussed to outline the steps executed to successfully gather OR data. The post-processing section has a thorough description of the registration, synchronization, fusion and filtering techniques used on the raw data in order to extract useful performance measures. Experimental results of comparisons between the operating room and simulators are given in Chapter 3. The reliability of the performance measures used is analyzed, and an evaluation of the impact of procedural and subject variability is given as well. The final conclusions of this work are presented in Chapter 4. Clinical utility and recommendations for future studies are included. Conclusions are made with respect to concurrent complementary studies and the impact of this work on surgical simulation and education. 18 Chapter 2 Experimental Hybrid Surgical Tool for Multi-faceted Performance Measure Assessment 2.1 Introduction More than two thirds of all gallbladder removals are now done laparoscopically (Kologlu 2004). Laparoscopic procedures, while offering many benefits to patients, require skills that are more difficult to learn than many of their open surgical counterparts. However, the training and evaluation curricula have not kept pace with the technological complexity of today's operative suite, and due to the nature of surgery, hands-on practice is critical. The current approach of training on patients carries intrinsic risks that society is no longer comfortable in assuming. Therefore, many training programs are looking towards simulators, both virtual reality and inanimate (physical), as unlimited training and evaluation devices. Nevertheless, until simulators have been quantitatively shown to measure and teach what they purport, their contribution to any training program is unknown. This chapter outlines the design and implementation of an experimental surgical tool that is used to collect continuous tool tip kinematics, forces and torques. This tool is to be used to quantitatively evaluate the performance validity of two types of surgical simulators: virtual reality and physical. The study method provides the foundation for the design and integration of the experimental tool. Since a surgical procedure is a highly variable task, we used a hierarchical decomposition (task analysis) of the laparoscopic procedure to provide an organizational framework for the analysis. From various levels of the decomposition, performance measures will be extracted to evaluate surgeon motor behaviour. A detailed description of the decomposition and performance measures chosen is given in Section 19 There are several concerns when designing a human surgical tool that have to be addressed. These criteria, which include providing sufficient working length of the surgical tool, wire management, electrical current isolation and sterilizability, were accommodated to reach a final tool design. This work was done in conjunction with another student, Joanne Lim, and complementary details are given in her thesis (Lim 2004). The experimental protocol for gathering the OR and simulator data is outlined in Section 2.3. Rigorous post-processing of the data is necessary before extracting any performance measures. The data must be registered and synchronized to a common reference and time frame. The kinematics and force/torque data are fused to get continuous high frequency streams. Finally, noise data are filtered and the performance measures extracted. A description of all calibration and post processing steps is given in Section 2.4. Finally, this chapter ends with a discussion about the methods applied here in the design and application of the experimental hybrid surgical tool. I have also included a set of software and hardware recommendations to help improve the system for future studies. The experimental tool described is used to evaluate performance validity of VR and physical simulators. This tool is going to be used in future studies to automate performance evaluation to facilitate quicker analyses of simulator validity and tool designs. 2.2 Tool Design The following sections describe the design process for creating our experimental surgical tool. The study design reduces the complexity of a surgical procedure through task decomposition. Decomposition allows us to find tasks in the OR that are analogous to tasks in the simulators. Comparisons can then be made across settings and contexts. In consultation with expert surgeons, we created a set of critical performance measures. 20 Data acquisition equipment choices were made based on the desired performance measures. 2.2.1 Study Design Hierarchical Decomposition A surgical procedure is a highly complex and variable task in a volatile environment. We are attempting to assess validity of surgical simulators by comparing expert surgeon behaviour in the simulators and human OR. In the OR studies, data were gathered using an experimental surgical tool from three laparoscopic cholecystectomies (gallbladder removal) with no a priori selections as to patient or surgical staff. This served to further increase the variability between surgical cases. We selected two simulators to analyze expert performance in, VR and physical, and by contrast the VR surgical simulator used provides a highly structured teaching and evaluation environment. This simulator consists of teaching modules where task types, such as clipping and dissecting, are performed separately. The physical simulator is an dissection task in a standard box trainer and provides a more structured environment than the OR, but less structure than in the VR simulator. We attempt to address this issue of structure with a task analysis, or hierarchical decomposition. Because of the intrinsic variability in surgical procedures we want to divide the OR data into segments that are analogous to tasks on the simulators. To accomplish this we are using a hierarchical decomposition approach based on Cao's work (Cao 1999). The decomposition provides a structured framework for classifying the steps of a cholecystectomy (gallbladder removal). Each level of the decomposition (Figure 2.1) is increasingly detailed, from overall goals of the procedure down to the level of motions of each tool (Cao 1999). The decomposition consists of five steps. Level (1), the phase level, describes the global goals of the procedure, such as gallbladder removal. The Stage (2) level is concerned with the local goals of each global goal while Task level (3) deals with the use of a single tool within each Stage. The subtasks, level (4), describe the movement of each tool, and Action level (5) describe the kinematics features of each 21 movement of an individual tool (McBeth 2001). This five-step decomposition provides an organizational framework by which the OR data can be segmented and compared to performance on the simulators. Figure 2.1: Hierarchical decomposition of laparoscopic cholecystectomy, modifiedfrom Cao's work by McBeth for this project (McBeth 2001). (CD: Cystic Duct, CA: Cystic Artery, GB: Gallbladder) In the OR and physical simulators we are using an experimental surgical laparoscopic tool to collect continuous motor behaviour data. The experimental surgical tool tip used for this study is a Maryland dissector. Therefore we are limited to the Task, Subtask and Action levels of the decomposition within the dissection Stages of the procedure. The tool is used primarily in dissecting and isolating the cystic duct and artery. The OR dissection tasks are considered to be the same as the cystic duct dissection module on the VR simulator (Figure 2.2). The orange dissection on the physical simulator is analogous to both the OR and VR dissection tasks. By comparing motor behaviour on these specific tasks we can assess the validity of the VR and physical simulators. 22 Figure 2.2: The virtual reality software version of the dissection of the cystic duct is shown on the left, while a picture of a "real" cystic duct dissection is shown on the right. Performance Measures Before we can select the equipment to collect expert OR motor behaviour data, a clear understanding of the performance measures for successful task completion is necessary. There are many performance measures from which to choose to evaluate surgeon performance. In consultation with expert surgeons and literature (Cotin 2002, Gallagher 2003, Villegas 2003) we have chosen tool tip kinematics, tool tip forces and torques, and completion time. Previous work in our lab have successfully looked at postural joint angles and event sequencing of a laparoscopic procedure, and can easily be re-implemented in future works. Future studies will involve assessing error frequency and automating the task decomposition process to find the desired performance measures. The following sections outline the performance measures used in this study. All of the performance measures discussed can be calculated from the VR and physical simulator data. Time Time is the most commonly evaluated performance measure because it is relatively easy to compute (Derossis 1998, Fried 1999, Hanna 1998, Hodgson 1999, Keyser 2000, McBeth 2001). Results from these studies indicate that novices take longer on real and 23 simulated surgical tasks than experts. However, a decrease in task completion time does not necessarily correlate with an increase in skill level (Childs 1980). Time is not sufficient as the sole performance measure as it does not capture many important aspects of motor performance, and as such, is only one of the performance measures in this study Kinematics Use of surgeon movement kinematics to assess or evaluate performance is a relatively new concept. Torkington used an electromagnetic tracker to capture surgeon hand movements in real and simulated tasks, and Birkfeliner's group has been developing a hybrid system to track tool tip kinematics in the OR (Birkfellner 1998, Torkington 2001). McBeth used an optical tracking system to gather tool tip kinematics and we improve his method by modifying the system to obtain continuous, high frequency profiles of position, velocity, acceleration, and jerk profiles of the surgical tool tip (McBeth 2001). We use the absolute position of the tool tip vector (Figure 2.3) with respect to a world coordinate frame, to calculate a variety of continuous kinematics performance measures including, distance from mean, velocity, acceleration, and jerk. Translate / Sweep Transverse Plane Figure 2.3: The tool tip Cartesian reference frame (x, y, z) was established with respect to the tool handle where the kinematics sensors are mounted. 24 Force/Torque Signatures Appropriate use of force while handling tissues is an important skill for novice surgeons to learn and has been chosen as a performance measure in this study. Force profiles are available from both simulators. Rosen's group in Washington (Rosen 2001) has done some of the first work in using force/torque (F/T) signatures to classify surgeon performance in a porcine model. A similar approach to gather and calibrate the F/T data is applied here to get continuous tool tip force profiles in an OR. 2.2.2 Tool Design Considerations When the desired performance measures were established, the focus of this work turned towards designing a hybrid surgical tool to collect the necessary data. The experimental surgical hybrid tool had to incorporate the sensors we had available and be capable of collecting continuous high frequency kinematics and F/T data. The tool was designed in conjunction with another student, Joanne Lim, and complementary details about the sensor fusion algorithm and the design of the sensor-mounting bracket can be found in her thesis (Lim 2004). The following sections describe the tool design methods and assembly. Kinematics Kinematics data were collected using a hybrid tool incorporating both optoelectronic and magnetic tracking systems. It has been shown that use of optoelectronic tracking alone often misses large gaps of data, thus reducing extractability of useful performance measures (McBeth 2001). Optoelectronic systems are also sensitive to ambient light from OR lamps (Wagner 2002). For our analysis we required continuous high frequency data acquisition of the tool tip. In order to achieve this goal several options were considered: 25 Further improvement of marker array designs. Incorporate another motion tracking system to the current optical system, such as multiple cameras, accelerometers/gyros, Shape Tape, electromagnetic tracking. By improving the design of the marker arrays and incorporating multiple cameras there may be fewer marker occlusions, but this will not eliminate them entirely. A sensor fusion scheme including the magnetic tracker system was chosen. The magnetic tracking system was readily available and easily incorporated with the previously established experimental protocol. The resulting system utilizes the accuracy of the optoelectronic system as well as the continuous high frequency data acquisition features of the magnetic motion capture system. Electromagnetic trackers have been used extensively in medical applications (Datta 2002, Frantz 2003, Smith 2002) but there are significant limitations when using them in a surgical environment. Magnetic tracking measures suffer from increased variance at greater working volumes due to the decay of the magnetic field (Day 2000, LaScalza 2003). The data from magnetic trackers are also affected by static and dynamic ferrous metal both in and around the working volume of the sensor system. There have been many studies on the calibration of magnetic tracking systems in a surgical environment (Birkfellner 1998, Frantz 2003, Hanada 2002, Perie 2002). These calibration techniques usually involve placing the receiver in known positions and creating a lookup table to "fix" the magnetic data. This is an extremely time consuming exercise that only calibrates for static metal interference, and does not improve accuracy to an acceptable level (Birkfellner 1998). Sensor fusion has been defined as the combination of sensory data, such that the resulting information is better than would be possible if these sources were used independently (Elmenreich 2002). The potential for magnetic sensors to solve the line of sight problems intrinsic to optical tracking has been noted in other studies (Birkfellner 1998, Nakamoto 2000). Birkfellner used calibrated magnetic measures to "fill in" the gaps in the 26 optoelectronic data. This method does not solve our slow sampling rate problem and still involves time-consuming calibrations. Nakamoto uses the optical system to dynamically measure the position of the transmitter and receiver within the optical reference frame, permitting the user to move the transmitter intraoperatively and achieve the optimal transmitter- receiver distance. We are trying to cause as little disturbance to the surgical procedure as possible and this solution is not practical for our study. To fully utilize the power of both systems, we fused both data sets into one continuous high frequency data stream. Our fusion method uses the slower, but more reliable optoelectronic measures as "fixes", and warps the magnetic data on to the optical data. This method retains the high frequency content of the magnetic data, as well as the high accuracy characteristics of the optical data. Force/Torque A large goal of this project was to obtain continuous F/T tissue manipulation profiles of the tool tip. F/T data were collected using a hybrid system consisting of a tri-axial transducer and a uni-axial transducer on the tool handle. Minimally invasive techniques require a delicate hand, and appropriate use of tools and forces on tissues is an important skill. The F/T system used in this study is similar to one successfully implemented in a pig model by Rosen's Washington group. Their tool has a custom-designed, tri-axial force sensor measuring forces and torques of the tool shaft (Rosen 2001). The sensor used in this study was mounted off-axis to the tool shaft (Figure 2.3), unlike the Washington group's sensor, which had a hole through the centre of it. Two primary concerns when designing the F/T hybrid system were: 1) Maintenance of regular operation of the tool while transmitting the forces felt by the tip to the sensor. 2) Allowing for the use of cautery (a high frequency current used to intraoperatively coagulate tissues) while preserving the electrical isolation of the tool shaft. 27 The outer shaft of the tool needed to be cut in order to transmit the forces felt along the innermost shaft through the bracket, to the sensor. This compromised the isolation coating (the outermost black layer) on the tool. Care was taken to minimize the area of "live" shaft that is exposed, and the bracket design eliminated the risk of accidental contact. The working area of the tool that is inserted into the patient's abdomen was not modified, so as not to introduce any risk to the patient. The F/T sensor is ethylene oxide (gas) sterilized but no liquid can get in between the layers of the sensor as this may damage and contaminate the sensor. To overcome this problem, before each OR data collection the F/T sensor is wrapped in Opsite, a sterile surgical plastic wrap. Clipping Force Grasping Force Transducer Figure 2.4: The F/T hybrid system. The F/T sensor is mounted to the tool and is capable of reading forces and torques in three dimensions. The grip force can be correlated to the grasping force felt at the tip of the tool. 2.2.3 Equipment Specifications The equipment used to create the experimental system is described in the sections below. The Polaris optical tracker and Fastrak magnetic tracker make up the kinematics hybrid system. The F/T hybrid system consists of a tri-axial force transducer and a strain gauge system mounted to the tool handle. A modified surgical tool accommodates the sensors via custom-designed hardware. Video data were also collected for each OR and physical simulator trial to allow us to decompose the tasks. 28 Kinematics Optical Tracking System We use a hybrid system to gather kinematics data combining an optoelectronic motion capture system, at a sampling frequency of ~30Hz, and an electromagnetic tracker, at a sampling frequency of ~120Hz. The commercially available optoelectronic system, Polaris Hybrid Optical Tracking System made by Northern Digital Inc., is capable of tracking both passive and active arrays in three dimensions. This study employs passive marker arrays, sets of three spherical retro-reflective targets. The Polaris camera supplies the infrared light source to illuminate and track the passive targets. There must be a clear line of sight between the camera and the passive markers. The optoelectronic system is used because it is well accepted by surgeons, accurate (-0.2 -0.3mm), and easily sterilized. Its use is limited however by the low sampling frequency and line of sight restrictions. These problems are addressed by the hybrid system. The Polaris system requires a minimum of three passive markers to establish a three-dimensional frame in space. The location and orientation of this reference frame with respect to the camera frame is recorded by the Polaris system. A Multi-Directional Marker Array (MDMArray) was mounted to the experimental surgical tool to track its movement in space. The MDMArray was custom designed by Paul McBeth for this project in order to enhance trackability by increasing the angles at which they may be viewed (McBeth, 2001), as compared to the standard planar arrays. The MDMArray has five geometrically unique faces, and can be rotated longitudinally -350° and -160° perpendicular to the longitudinal axis. The Polaris system tracks one face (on the MDMArray) at a time, depending on which is visible to the camera, thereby improving the continuity of location and orientation data during surgical tool movement. Electromagnetic Tracking System The electromagnetic system used in this study, as part of our hybrid tool kinematics system, is the Polhemus 3 SPACE Fastrak six-degree-of-freedom magnetic measurement 29 system. This system uses a stationary transmitting antenna that emanates low frequency magnetic fields to track a movable receiving antenna (the receiver). The transmitter and receiver each have three mutually orthogonal coils. Each loop in the transmitter is excited sequentially and the magnetic fields generated on each of the coils of the receiver are measured. From these measurements, position and orientation of the receiver, with respect to the transmitter, are calculated. The magnetic system collects continuous location and orientation data at 120Hz. It does suffer from relatively poor accuracy (as compared to the optical tracker) and metal interference (both dynamic and static). The product manual quotes an accuracy of 2 mm within the lm3 working volume. However, one study found that the quoted accuracy might only be achieved in an OR setting, with a transmitter-receiver distance of 22 mm (Milne 1996). Our experience with Fastrak system supports these results. Using the Polaris optical tracking system in conjunction with the Fastrak magnetic tracking system we are able to get continuous, accurate position and orientation data of the surgical tool tip. Force/Torque Hybrid System For this study force/torque (F/T) signatures of the tool tip are investigated, for specific tasks in the OR using a commercially available multi-axis system. The multi-axis F/T transducer used is the Mini 40 Transducer, supplied by ATI Industrial Automation Inc. This self-calibrating transducer is a monolithic structure that uses tri-axial strain gauge technology to convert forces and torques into analog signals. The transducer strain gauges are mounted to beams that have been machined from a solid piece of metal, which increases the strength of the system. Forces and torques about each of the three axes are recorded at 120Hz in counts per unit force, and streamed directly to a Matlab data file via custom-designed Matlab drivers supplied by Willem Atsma in the Neuromotor Control Lab (Atsma 2001). This F/T data is calibrated and registered to give forces and torques at the tip of the surgical tool. Strain gauges are mounted in a half bridge configuration to the handle of the surgical tool. In this configuration we measure the single-axis, perpendicular grip forces on the handle. These data are used to calibrate the F/T sensor. The full specifications regarding 30 the electronics of the strain gauge transducer system can be found in Lim's thesis (Lim 2004). Sensor Bracket A bracket was designed to accommodate the magnetic sensor receiver and the F/T transducer. The bracket was designed in consultation with an expert surgeon and a lab volunteer, Brandon Lee, who detailed the bracket using SolidWorks (Appendix C). Full design specifications and design considerations for the bracket can be found in Joanne Lim's thesis (Lim 2004). The surgical tool was slightly modified to accept this bracket and transmit forces felt along the inner shaft of the tool to the transducer. The bracket consists basically of two segments, the top and bottom segments (Figure 2.5). The bracket is mounted in between a small cut portion of the outer shaft of the tool. The forces are transferred through the top segment, which attempts to move against the bottom segment, thereby registering a force on the F/T sensor. A rapid prototyping machine at the British Columbia Institute of Technology (BCIT) was used to construct the bracket out of rigid polyethylene. A non-conductive material is required because of the aforementioned high frequency current that is transmitted through the shaft to facilitate intraoperative tissue cutting and coagulation. The sensors and the user must be isolated from this high voltage current. Finally, we ensured that there are no sharp edges on bracket so the surgical staff will not cut their surgical gloves on it. 31 F/T sensor Top Segment Bottom Segment Magnetic Receiver Figure 2.5: The F/T hybrid sensor bracket. ATI force/torque transducer mounted below tool shaft via custom designed bracket. Tool shaft forces are transferred to the top segment, which then applies a force to the transducer/bottom segment rigid body, and a force is registered on the F/T sensor. Video We record video images of each surgery for sensor time synchronization and task analysis. The video images are time stamped and synchronized to the data streams using video editing equipment. The laparoscope (surgical camera) is connected to a VCR and this video data are recorded on a standard VHS tape. The laparoscope video provides a view of how the tools were moving inside the abdomen at each point in time. A video camera was also used to collect video data of the surgeon's arm movement outside the patient. The outside video data is used for time synchronizing the sensor data to the laparoscope video. Surgical Tool A standard laparoscopic modular tool, supplied by Storz Endoscopy, that is used for cholecystectomy (gall bladder removal) was modified to accept the sensors described above. The tool was chosen because of its modularity. The future goal in our lab is to collect data from all aspects of the procedure and this tool has interchangeable tips. In this work we are only examining dissection tasks, and need one tip, the Maryland 32 dissector tip. Future work will easily incorporate hook and scissors tips allowing us to collect data from all aspects of a laparoscopic procedure. Simulators Virtual Reality Simulator The VR simulation system incorporates the Reachin software and Immersion hardware. The Reachin Laparoscopic Trainer is a haptic feedback laparoscopic training program. The Immersion surgical station consists of two laparoscopic tools with interchangeable handles. Each tool has four haptic degrees of freedom and a rotating tip, similar to the experimental tool. The Immersion hardware and Reachin software complement one another well and are combined to form a complete force feedback VR laparoscopic simulator. Learning on the trainer is done in a stepwise fashion through the skill levels. There are cholecystectomy-specific training tasks such as clip and cut, cholangiography and dissection. The step-wise, logical fashion of the simulation software lends itself well to the hierarchical decomposition protocol described earlier and allows us to compare performance between gallbladder dissection tasks in the OR and VR simulator. Physical Simulator The subject is asked to use the experimental tool to dissect a mandarin orange in the physical simulation task. Standard laparoscopic equipment, such as a laparoscope, light source and tower are used to simulate an operative suite. The orange is secured in a standard box trainer, and the subject is asked to dissect the skin from the orange and remove several segments, causing as little damage to the pulp as possible. This task is done using the experimental surgical tool and data is collected in a similar fashion as for the OR data. 33 2.2.4 Component Integration The MDMArray used in previous studies was integrated with the sensor bracket (Figure 2.5). The sensor bracket and MDMArray are mounted as close to the handle as possible, to allow sufficient working length for the surgeon. Standard laparoscopic surgical equipment was used for each OR procedure, including laparoscopic camera and light supplied by Stryker Endoscopy. A block diagram of the data acquisition equipment and component wiring assembly is shown in Figure 2.6. All equipment used for this study was approved by the Biomedical Engineering Department at Vancouver General Hospital, and sterilized using ethylene oxide, where appropriate. Figure 2.6: Sensor component integration for OR data acquisition. 34 2.3 Data Collection 2.3.1 Data Acquisition Software The data are collected from the optical and magnetic tracking systems using Matlab. Optical data are collected at 30Hz, via serial port interface, using the existing software package designed specifically for this project (McBeth, 2002). The optical data interface indicates when the passive markers are occluded allowing the user to improve camera placement prior to data collection. Magnetic data are also collected at 120Hz using a graphical user interface from a separate laptop computer. Matlab will only support one serial port object and to collect from the magnetic sensor we require a separate computer. A data synchronization protocol is followed for each experiment to time-match the sensor data, and will be explained in detail later (Section 2.4.3). The strain gauge data are converted from analog to digital using a Measurement Computing PCI data acquisition card. Matlab's Data Acquisition Toolbox supports this DAQ card and allows us to stream the strain gauge readings, at 120Hz, directly to a file. Matlab drivers for the ATI F/T transducer and custom Matlab functions (Atsma 2001) designed to communicate with the transducer ISA data acquisition card are used to collect streaming tool forces and torques at 120Hz. Modifications made to the optical tracking software package allow us to collect optical kinematics, grip force, and F/T data all from the same graphical user interface (Figure 2.7). 35 Putori* Mutiuti Caytuio Systaii [OR Voswi) U8C • NWJIWIIUIW CLrrliolLuUjiiituy Figure 2.7: Data acquisition software: Polaris Motion Capture GUI (first version by McBeth, 2001) collects streaming data from passive arrays, strain gauges and force/torque sensor with one button; Start Fastrak gui collects streaming magnetic data. 2.3.2 Experimental Protocol Two expert surgeons were evaluated, using the experimental tool, in three clinical laparoscopic cholecystectomies over a period of three months at University of British Columbia Hospital (January - March, 2004). These same surgeons were also evaluated on the cystic duct dissection module of the VR and physical simulators in the Centre of Excellence for Surgical Education and Innovation (CESEI) within this period. We did not make any prior selection of patients or OR staff for the OR trials. The patient group was made up of both males and females, with varying weights and ages. One researcher scrubs in and is responsible for tool calibrations and assisting the surgeon with the experimental tool. The scrubbed researcher wraps the F/T sensor with OptSite (a sterile surgical wrap) to keep fluids from damaging and contaminating the sensor. The other researcher handles the computer operations, as well as the video data collection. A minimum of two researchers per operation is required, with an optional third assistant to help with any wiring or equipment placement problems. A detailed description of the OR data collection procedure can be found in Appendix A. 36 The experimental hybrid tool is one of the first tools to be used once all of the trocars have been inserted into the abdomen and abdominal exploration has taken place. When the surgeon indicates his readiness for the hybrid tool, the scrubbed researcher hands down the wires from the sterile field (wires to the strain gauges, F/T and magnetic sensors) for the other investigator to plug in to their respective tool interface units located on a stool underneath the OR table. The scrubbed investigator is also responsible for the calibration characteristic moves. These moves are done to aid in the time synchronization of the sensor data. The characteristic movements are either done at the beginning or at the end of the procedure, depending on which is deemed least disruptive for the OR staff. The first movement consists of a large upward arm movement, to match the kinematics data. The large arm movement is accompanied by a "hit" at the peak of the movement, to synchronize the kinematics data with the F/T data. Finally, a hard "squeeze" of the tool handles is done to time-match the data from the gauges and the F/T sensor. The characteristic moves are, as the name implies, much larger than other expected moves and are easily distinguished from the rest of the data stream. Post calibrations include digitizing a body frame on the patient to transform the tool tip kinematics to an anatomical body frame with respect to a world coordinate frame. Also, a gravity vector is established with the tool in a neutral position to remove the gravity effects from the raw F/T data (Section In actual use, however, we found that the OR calibrations often not practical or possible. Manual compensation for the lack of calibration data is a time-consuming process and system improvements should include auto-calibration programs. 2.4 Equipment Calibration and Data Post-Processing Using the experimental hybrid surgical tool, and following the experimental protocol outlined above data was collected from three laparoscopic cholecystectomies, and two physical simulator trials. These data required a significant amount of post processing before the desired performance measures could be extracted (Figure 2.8). Details for each processing level are given in the following sections. 37 Optical Data Magnetic Data Transform data to tool tip in anatomical body frame Force/Torque Data Transform data to tool tip in anatomical body frame Data Synch Strain Gauge Data Data Synch ESU effects filtering Data Fusion & GCV Data Synch r ESU effects filtering r Remove gravity effects Data Synch ESU effects filtering Performance measure extraction Remove grip forces Transform force data to tip Figure 2.8: Data post processing steps for data from each sensor 38 2.4.1 Kinematics Data Registration and Calibration The goal of the calibration and registration procedures is to obtain two data streams from the kinematics sensors, which represent the same location in space, specifically the tool tip, with respect to an anatomical reference frame (Figure 2.9). The optical and magnetic sensors collect position and orientation data in three dimensions, as described earlier in this chapter. The Polaris optical tracking system tracks five geometrically unique faces, sets of three passive retro-reflective markers, which are mounted on to the MDMArray. These faces each represent their own 3D frame in space, with the optical system tracking one at a time, depending on which face is visible by the camera. The magnetic system receiver also represents a frame, and its location and orientation is recorded continuously with respect to the transmitter reference frame. Figure 2.9: Three-dimensional frames from magnetic and optical sensors. All data needs to be registered to the tool tip in the camera frame. 39 In order to fuse the data streams from the optical and magnetic sensors both data streams must represent the locations and orientations of the same frame in space. We want to track the location of the tool tip frame in space with respect to an anatomical body frame. The following sections outline the component calibrations that were done to calculate the rigid body transforms between the frames on the tool, as well as the OR calibrations that are done intraoperatively to find the transforms between the camera and transmitter reference frames. Finally, a discussion of how we use the rigid body transforms on the tool and the transformations between sensor frames to find the location and orientation of the tool tip in a common reference frame is given. Component Calibration The optical and magnetic sensors give 3D locations and orientations of the surgical tool with respect to their separate reference frames (the camera and transmitter respectively) and we use this information in pre-calibrations to find the rigid body transforms between the rigid bodies on the tool itself. The component calibration exercises were done multiple times in our lab at Vancouver General Hospital (VGH) to ensure accuracy. We want to find the location and orientation of the surgical tool tip, so the first step in the component calibration exercises is to set up a tool tip frame with respect to the different reference frames on the experimental tool. A tool tip frame is established using optical and magnetic point probes, and a calibration object. The jaws of the tool tip are digitized and are used to create the y vector of the tool tip frame. A point along the tool shaft is also digitized providing a third point to set up a three dimensional frame (Figure 2.10). We are then able to find the location of the tip frames in the camera and transmitter frames respectively (Equations 2.1 and 2.2): 4 0 rpcamera r „ V camera \-./y-camera \ 7camera \. s-icamera -\ TuPframe=\unit{Xvect ),umt{Yvecl ),umt{Zvect ),C ] J< transmitter r \r transmitter \. _ T / transmitter \, _ m_< */rj transmitter\. /ntransmitter l /") 9^  =["™'(^vecr )^nit{Yvect ),umt(Zvect ),C J t ^ j **unit: denotes a unit vector Digitize point Y Digitize poinP C vector = (B — C) = Yvecl (cross)vector ZVM=Xn,a(cross)Yn,c, Figure 2.10: We use a calibration rig to set up the tool tip reference frame. By digitizing three points on the tool, the tool tip frame was establishedfor both the magnetic and optical sensors. The vectors are calculated using optical and magnetic point probes resulting in two sets of vectors with respect to the camera and transmitter respectively. With the tool in the calibration object, stationary data from each of the five faces and the receiver are collected giving TcFa"^ (4x4x5) and T™f^Jtter. The rigid body transforms from the tool tip reference frame to the five faces and the receiver frame are calculated as shown in Equations 2.3 through 2.4. rp faceXX rpcamera /rpcamera \ - l (2.3) tipframe tipframe V toolref / rp receiver rp transmitter % /rp transmitter \-l t_ 4) tipframe tipframe V receiver ) The next step in component calibration involves finding the transformations between each of the optical frames (the five geometrically unique faces on the MDMArray) in 41 order to transform the data from the five faces to one face, Face A for our case, using the method outlined below. Figure 2.11 shows the different frames involved in the procedure. The imaginary reference frame used here is simply the tool tip reference frame established earlier. We have already calculated T^fZ for each face (the tip frame is referred to here as the imaginary reference frame) and we can now use these transforms to find the transformations between each of the faces on the tool. D B Camera Frame - • Imaginary Reference Frame Figure 2.11: MDMArray passive marker face frames - labelled A, B, C, D, E. Stationary data is collected to find the transforms between the passive marker faces. The tool reference frame to which the data from all other faces is transformed is Face A (toolref) in this study. The separate data streams for each face are transformed to toolref using Equation 2.5. ; (rptoolrej \ v imaginary) Using the equations given above, we are able to register the data from each of the five faces to one reference face (toolref), with respect to the camera frame. rpcamera rp camera -j- rp face % /rpto lref \ - l (2.5) toolref ~ face * imaginary \* ) (For all faces except Face A) The rigid body transformation between the tool reference frame (Face A) and the receiver frame is required for registration. Using the tip frame transforms and stationary transforms from the magnetic and optical sensors we can find this rigid body transformation using Equation 2.6: rp toolref (rpcamera \-\ % ip camera % np transmitter \ - l % rp receiver v toolref ' tivframe v tivframe ) r transmitter receiver (2.6) 42 OR System Calibration The magnetic and optical data are registered to a common reference frame using intraprocedural calibrations. Stationary data from the optical and magnetic sensors are collected before and after the procedure to find the transformation between the camera and transmitter reference frames. This transformation varies between procedures, but not within a procedure. Stationary data provides a segment of data from each sensor where we are confident that the tool tip is in the same location over a period of time. The 71camera . . . ^ . v transmitter ) is the key transform for registering the magnetic and optical data. Using the stationary calibration data from the optical and magnetic sensors find this transform using Equation 2.7. rpcamera rpcamera % rp toolref % /rptransmitter \ - \ ,~ „^ transmitter toolref receiver \ receiver ) \ • ) Finally we set up an anatomical body frame using the surgical tool as a point probe to digitize three points on the patient's body. The umbilicus, xiphoid process and right hip are digitized and the anatomical frame calculated. The anatomical body frame allows us to give meaningful descriptions about how the tool moves inside the body, with respect to anatomical landmarks. In practice, we were rarely able to gather the anatomical frame data. The patient is moved intraoperatively, as the surgeon determines the optimal operating position and this prevents treatment of the anatomical frame as a true rigid body. In future studies, a passive marker attached to the patient will allow us to calculate the location of the body frame at all points in a surgery. Registration Using the rigid body transforms found described above and the transformations calculated using stationary OR calibration data the magnetic and optical data can be expressed as tool tip location and orientation with respect to a common anatomical reference frame. To register the optical data use Equation 2.8. 43 rp bodyframe /rpcamera \ - l % rpcamera ^rp toolref tipframe V body frame) toolref tipframe (2.8) -1 fra e ' to lref For the magnetic data, Equation 2.9: rp bodyframe _ /p camera \ - l % rpcamera % rp transmitter % rp receiver tipframe ~ \ bodyframe) transmitter receiver tipframe \A-') Two streams of data representing the location and orientation of the same point in space, the tool tip, with respect to anatomical landmarks have now been achieved. In this study, we choose either the transmitter or camera frame as the common frame for sensors instead of the anatomical reference frame for reasons described earlier. Further processing including synchronization, filtering, fusion and performance measure extraction uses this registered data and are discussed in the section below. Repeatability Analysis The repeatability of the calibration and registration exercises outlined was analyzed. The component calibration protocol for finding the transformation between the tip and the tool reference frame in the optical sensor and the transformation between the tip and the magnetic receiver frame was done twenty times to establish a measure of repeatability (Table 2.1). Table 2.1: The reliability of the transformations. The tip frame was calculated twenty times, and the transformations found between the tip frames and the tool sensor frames. The means and standard deviations of the locations and orientations of the tool reference rp tipframe Optical: 1 toolref rp tipframe Magnetic: 1 receiver X(mm) -0.128+/-1.08 -39.28+/- 13.42 Y(mm) 59.13+/-2.85 -56.28+/-9.14 Z (mm) 336.81 +/- 0.39 333.90 +/- 5.07 Roll (rad) 1.3+/-0.17 1.32+/-0.13 Pitch (rad) 0.02 +/- 0.02 1.23+/- 0.06 Yaw (rad) 1.5+/-0.04 0.31 +/-0.14 44 From the above results it is apparent that there can be a significant amount of variability in the magnetic pre-calibration measures. This would obviously lead to more errors when the transformation between the camera and transmitter is found in the OR calibration procedures. These errors are compensated for, however, when the two streams are fused (Lim 2004). 2.4.2 Force/Torque Data Calibration and Registration This section outlines the calibration procedures and registration of the three-dimensional F/T data to get meaningful values of forces and torques at the surgical tool tip. The strain gauge data are used in calibrating the 6D F/T sensor. Calibration Grip Calibration In order to understand the data from the F/T sensors and use them in a logical way, an understanding of the mechanics of the tool is required (Figure 2.12). The experimental tool shaft consists of several layers. The innermost layer is attached to both the handles and the tool tip to facilitate opening and closing of the jaws. The outer layers of the tool provide protection and insulation from the inner shaft. When the handles of the tool are opened the inner shaft moves and shortens, thereby pulling on the tool jaws and forcing them open (Figure 2.12). The F/T sensor, by design, senses this movement of the tool shaft. A portion of the outer layer of the tool has been cut between which the bracket is attached and all the forces felt by the innermost shaft is transferred through the bracket to the F/T sensor. Through calibration, the strain gauge data are used to separate the grip force from the tissue manipulation forces at the tool tip. 45 Figure 2.12: Schematic of experimental tool. Operation of the handle moves the shaft, causing the jaws to open or close. The ATI force/torque sensor "feels " this shaft movement. We use a calibration algorithm to separate the grip forces from tissue manipulation forces. The tool was held in a neutral position and the handle of the tool is open and shut multiple times while recording streaming data from the ATI force sensor and strain gauges. Care was taken not to exert any extra external forces on the tool during this data collection. The six degrees of F/T data were then plotted separately against the strain gauge readings to obtain a K-value for each component of force (Figure 2.13). This K-value is referred to as the grip constant. These grip constants, when multiplied by experimental gauge data, represent the F/T due to grip and are subtracted from the raw F/T data, where appropriate. 46 S a m p l e C a l i b r a t i o n G r i p D a t a - O n e D i r e c t i o n Strain Data (V) Figure 2.13: Calculation of the grip constant K for force in the x direction. The friction loop taken by and open and squeeze motion is indicated with arrows following the motion. This is due to friction within the tool, and is not consistent with different grip strengths. Investigation of the strain versus F/T graphs indicates that at either end of the calibration plot are flat sections where the strain data increases and the force data remains constant. The flat section at the far left corner of the calibration plot may be due to friction within the tool handle. The strain gauges feel the initial force needed to overcome this friction before any force is transferred to the F/T transducer. There is also a significant amount of hysteresis in the release after a squeeze. When the tool handles are opened the friction is overcome in the opposite direction, leading to a friction loop (loop in Figure 2.13). Unfortunately, this loop does not remain constant and its location varies with different squeeze strengths. The flat section in the upper right corner of the calibration plot (Figure 2.13) is due to a saturation of the F/T transducer. The transducer has a significant amount of overload protection but grip operations can sometimes reach the maximum readable value of the sensor, causing the gauge data to increase while the force data 47 remains constant. These difficulties lead to an unexpected increase in the complexity of the grip compensation algorithm. Due to time constraints, a simple compensation scheme was devised. A mean K-value was calculated from the linear portions friction loop corresponding to open and squeeze movements. When the F/T data is within this "linear" range, grip forces are removed. Outside of these regions no compensation is applied. This leads to F/T data with misleadingly high force peaks. Nevertheless, for the data shown (Figure 2.14) the root mean squared error is 174.38 counts per unit force (13N). However, the root mean squared error for the raw force data (no compensation data) is 397.81 counts per unit force (3 IN). It is apparent from Figure 2.14 that the grip force calibration often under-compensates for the grip force, especially at the higher force peaks (* in Figure 2.14). At a "typical" large force peak of 50N only ~25N are removed, leaving 25N of force left. Although the compensation for grip force under-removes the grip values it is still much superior to no compensation. Future studies should make addressing these grip calibration and friction issues a priority before collecting more OR data. Grip Fixed Force Data in X Direction (OR #3) Figure 2.14: An illustration of how the grip forces are removedfrom the raw force data giving two separate force data streams: tissue manipulation and grip forces. * This peak shows the under- compensation of the grip force prevalent at higher grip forces. This leads to a F/T data stream of unusually high data, especially in the axial direction. 48 Gravity Effects Calibration The F/T sensor is has a high resolution, and because the sensor/bracket assembly is mounted off axis to the tool shaft and its mass is not balanced, the sensor is capable of reading forces due to the mass of the tool itself. These readings vary significantly (up to 5N) when the tool is in different orientations of roll and pitch (rotations about the z and y axes with respect to the tool tip frame). With roll and pitch representing rotation of the tool shaft about the z and y-axes. To improve the accuracy of the force readings, these gravity effects need to be subtracted from the raw force data. The effect of gravity on the force readings is due to a combination of roll and pitch of the bracket. These forces are sinusoidal for both roll and pitch; our model for recreating these gravity effects uses this characteristic. The model is shown in Equation 2.10. The x matrix contains the calibration parameters that need to be calculated before the gravity effects forces may be determined. cos(rolli) sin(rolli) cos(pitchi) sin(pitchi) 1 F n (b) cos(rolln) sin(rolln) cos(pitchn) cos(pitchn) 1 (A) rl r2 * Pi p2 off (2. (x) The tool was moved around in space with different combinations of pitch and roll to gather the data required to solve Equation 2.10, and find the calibration parameters given in matrix x. During this movement, calibration data from the magnetic tracker and F/T sensor were collected. Care was taken during this procedure to prevent the application of any external forces such as touching the tool handles, as this would give rise to the confounding to grip forces described earlier. Inclination or pitch of the bracket is 49 calculated by establishing a gravity vector by holding the tool in a neutral position and taking magnetic sensor readings. Pitch is calculated from the dot product between the gravity vector and the location of position sensor at each time point (Figure 2.15a). Tool Axis -'position Pitch X gravity Figure 2.15a: Pitch of the experimental tool shaft. The value is found by calculating the dot product between a position sensor vector and a previously established gravity vector. The roll angle is calculated at each time point by taking the dot product between the position sensor vector, and a vector in the horizontal tool plane. Tool axis Horizontal vector Roll Figure 2.15b: Roll of the experimental tool shaft. A horizontal vector is established using the gravity vector and roll is the dot product between the tool position vector and this horizontal vector. The inclination and roll angles at each sample in the calibration dataset are used to solve Equation 2.10 using the Pseudoinverse (pinv) function in Matlab. Equation 2.10 is treated as a least squares problem, A*x =b. We want to solve for the calibration parameters matrix, x. The pinv function uses singular value decomposition to solve the least squares problem as given above. A singular value decomposition of A gives Equation 2.11 (Numerical Recipes in C, 1992). 50 [ A J M X N - [ U J M X N wl [V] NxN (2.11) wn —I diag The parameters matrix (x) is then calculated using Equation 2.12: [x]=[V] * [diag(l/wi)] * [ U ] T * [b] (2.12) The calibration parameters in matrix x (rl through off) are used with experimental roll and pitch values to determine gravity effects forces (Equation 2.10). These gravity effects are subtracted from the F/T data, similar to the procedure used to separate the grip forces from the raw F/T data. The reproduced gravity effects, in the z force direction, for one sample trial are shown in Figure 2.16. RMS: Mean value subtraction: 0.615 N RMS: Sinusoidal model: 0.319 N Fixed Force in Z Direction - Gravity Effects Synchromcalion characteristic mow Reproduced gravity .forces 1000 (500 Sample Number Reproduced gravity forces Raw force data Figure 2.16: Reproduced gravity effects for one sample trial. The "fixed" force data is calculated by subtracting the reproduced gravity forces from the raw force data. Note that the fixed force data is smoothed and shifted to the zero point as expected. 51 The fixed force data represents how well this calibration procedure works. As would be expected, the "fixed" force data has been smoothed and shifted to zero. A simpler force model was also used for comparison. This method simply subtracts the mean force value from the total. The RMS error of the example data shown (Figure 2.16) was calculated, and the more sophisticated sinusoidal model performed significantly better, almost 2 times better RMS error results, than the simple mean subtraction method. The force sensor calibrations outlined above are used to remove the grip and gravity forces from all experimental surgical tool data. The gravity forces are removed first, followed by the grip forces. F/T Registration We are interested in the continuous manipulation forces experienced at the tip of the tool. The calibrated F/T measures need to be transformed to the tip. The transformation between the F/T sensor, referred to as the force frame, and the tip frame was found by digitizing three points on the F/T sensor, as well as three points on the tip. Care was taken to establish the same tip frame as in the kinematics data registration. The transformation from the tip frame to the force frame was calculated using Equation 2.13. rp tipframe rp tipframe %/rp forceframe \ forceframe transmitter V transmitter / (2.13) Equations 2.14 through 2.15 describe the method used to transform calibrated force data to the tool tip (Figure 2.17). The Rtipframe vector was calculated by simply finding the location of the force frame location in the tip frame. F o r c e = T t i p J r a m e * F o r c e tipframe forceframe forceframe T o r q u e l l p P m = T * * £ _ * T o r q u e + ( R , l p f r m ) c r o s s ( F o r c e l i l l f r m ) • (2.14) (2.15) Rlipfra: Figure 2.17: Schematic of the experimental tool for F/T data registration. 52 Once the force data from the F/T sensor was separated into meaningful values and registered to the tool tip, the F/T performance measures were extracted for analysis. 2.4.3 Data Synchronization The next step in the post processing system is data synchronization (Figure 2.18). Synchronization algorithms are used to time-synchronize data from the position and F/T systems. Due to sensor latencies, and the fact that two computers are required, each sensor begins streaming data at slightly different points in time. The large characteristic moves done either pre- or postoperatively (Section 2.3.2), provide an area at the end of the output data stream to compare differences in time. For example, the large arm movement is easy to see when examining the data because it is significantly more dramatic a movement than would be seen in "normal" surgical behaviour. Also, since the tool is held stationary before and after the characteristic move we are able to discriminate between a characteristic move and an accidental move (dropping the tool). Optical Position Sensor Magnetic Position Sensor Data synch Synched position data Handle grip data F/T 3-axis data Data synch Synched position data 1 Data synch Figure 2.18: Data synchronization protocol - Data is synched in three consecutive stages. The first two synchronizations correlate to the "characteristic " move and the third synchs force/torque data using the big squeeze window of time. In order to synchronize the kinematics data the final segment of registered optical and magnetic data are visually examined (Figure 2.19). Once the large characteristic move is identified a window of 1 - 2s is selected from both the optical and magnetic data. This time window reduces the time over which the optimization is done. An initial guess of At is made from the time difference between the tops of the two peaks. The true value of At 53 is calculated by solving the following equation using a non-linear least squares optimization function (lsqnonlin), found in the Matlab Optimization Toolbox. Find At to minimize Equation 2.16: RMS = J {Zopl(t0(i)) - Zmag(t0(i) + At)}2 (2.16) Equation 2.16 is solved for x, y, z position data. The value for At with the lowest cost is then used to synch the position data streams. Collecting a set of optical and magnetic data with seventeen characteristic peaks assessed the repeatability of the kinematics data synchronization protocol. The difference in time (At) between the optical and magnetic data was calculated seventeen times for the same dataset. Over the seventeen trials At = 2.38 +/- 0.047s. 200 Distance (mm) -100 ;* I " 2 0 0 ° 5 10 15 20 25 Time (s) Figure 2.19: Visual examination of registered kinematics data for data synchronization window size. Once the position data has been synchronized, the final segment of kinematics and F/T data are visually inspected. The window of time corresponding to the top of the 54 characteristic large move is chosen. This is taken as the large peak in values for the F/T data because at the top of the characteristic move the tool is "hit". The time difference between these two peaks is taken as At2. Finally, the synchronized 3D F/T data and strain gauge data are examined for evidence of the "big squeeze" characteristic move. This is located approximately 5 seconds from the first characteristic move. The time difference between the two peaks of force data from both sensors is then used to calculate At.3. The video data also needs to be synchronized to the sensor data stream. We have two sets of video data from each OR trial, the internal laparoscope video and the external camera view. These two videos are synchronized using the characteristic movement of the surgeon during the insertion of the laparoscope through the trocars, into the body cavity. Once the same points have been identified, the videos are time-stamped starting from zero using video editing equipment. The next step is to synchronize the video data to the sensor data stream. This is accomplished by looking at where the magnetic kinematics data is "flat" in the middle of the procedure dataset. These "flat" areas occur when the surgeon puts the tool down to apply clips to the cystic arteries and ducts. The start and end points of the flat regions of magnetic data are synchronized to surgeon movement in the external video. The repeatability of the method for finding At2 and At3 was found using the same procedure as for finding At. The characteristic move was done several times within one dataset and the At's were found for each move and compared to find the mean values and standard deviations. Thirteen peaks for finding At2 were examined and the mean and standard deviation for A12 was found to be 0.796s +/- 0.024s. Twenty peaks were synched and the mean and standard deviation of At? was found to be 0.671s +/- 0.017s. 2.4.4 Electrosurgical Effects and Data Removal Electrosurgery is an aspect of laparoscopy that allows the surgeon to perform the procedure in a virtually bloodless environment. Electrosurgical units (ESU's) and instruments generate and deliver radio frequency (RF) currents to the tissue to cut, 55 cauterize or coagulate. The tool used for this study is a monopolar tool, and this means that the cautery current passes from the active electrode to the passive one attached to the patient prior to surgery. The cautery wire is attached to the surgical tool via the wire port located above the handles. The current travels along the innermost shaft to the tool tip, where it is applied to tissue. When set to cutting, the ESU delivers a continuous 400KHz, -1200V current. In the coagulation setting a 3500V, 250KHz current at 40 KHz bursts is supplied. Usually a combination of the two settings, "blend", is used so the surgeon is able to dissect and cut tissues as well as prevent and stop any bleeding.. A literature search indicated that this was the first attempt to bring this type of experimental tool equipped into a human operating room and we were unsure of how the electrosurgical cautery current would affect the sensor readings. Preliminary testing with a sample ESU indicated that using cautery would not be detrimental to the operation of the sensors, but would cause a significant amount of noise in the strain gauge, magnetic sensor and F/T data streams. It was found that a pure cutting current did not have any noticeable effect on our data, while coagulation and blend effects were very significant. This may be due to the high power "bursts" of current in the coagulation setting. Stationary test data was collected using a sample ESU unit supplied by the biomedical engineering department at the highest possible power setting at the coagulation setting. The strain gauge data are affected while the cautery current is passing through the tool (Figure 2.20). The surgeon is trained injudicial use of cautery, especially around delicate areas such as ducts and arteries, but OR results indicate that cautery may be continuously applied for as long as 10 seconds. It is apparent that when cautery is "on" the gauge data profiles abruptly and dramatically change. This feature is exploited in order to remove the unwanted noisy sections from the data stream. Our removal technique examines small packets of time (-1/10 s) and compares the change in velocity range from that of the time segment before. If this difference exceeds a given threshold (determined by comparing the change from known "good" data to noisy data) then that small segment of time is removed from the stream. 56 This removal technique is only implemented when the segment of data being analyzed has significant amounts of cautery noise. For a data segment with only one or two affected areas, the noisy data are manually removed so as to ensure that the minimum amount of data is removed (Figure 2.21). Raw Strain Gauge Data (OR #3 Coagulation Power = 30W) I r 3 Strain Output (V) 2 1 0 -1 -2 -3 1005 •*> 1015 1025 1035 Time (s) 1045 Figure 2.20: Raw strain gauge data from OR experiment #3. At the time shown the tool was lying stationary while cautery was applied - indicating the affect of cautery on the signal output. Strain Gauge Raw and Filtered Experimental Data Strain Output (V) 840 Smaller noisy section * 1 .* : 5*.J Large noisy segment removed 860 880 Time (s) 900 920 Figure 2.21: Raw and filtered/removed data stream from OR experiment #3. The large portion of noisy data are removed using the filtering scheme described above as well as some other smaller unwanted portions. 57 The F/T data are also affected while the cautery current is passing through the tool (Figure 2.22). Once the cautery current ceases to pass through the tool, the data stream abruptly returns to where it was before the current was turned on. A frequency analysis of the affected F/T data led us to believe that the high frequency noise was filterable. A generalized cross validation (GCV) filter is used to filter the data. This filter iterates to find an optimal smoothing parameter by minimizing a cost function (Woltring 1986). This significantly improves the signal, but some spikes due to cautery are often still apparent. Subsequently, a similar removal technique to the gauge data removal is used, whereby the change in velocities from one time segment to another (~l/10s) are examined and the segments that exceed a given threshold are removed Force/Torque Sensor (Coagulation Power = 75W) Force Sample Number Figure 2.22: Raw F/T data in one direction affected by cautery. Finally, the magnetic tracker is also affected while cautery is on (Figure 2.23). The magnetic tracker, however, recognizes that an error is occurring during many of these measurements and this error is acknowledged in the "monitoring" column of the output data matrix. This makes the initial removal of the ESU effects relatively simple, as we can automatically search the monitoring column for error messages and remove all the 58 data where an error was noted. Any residual effects may be either manually (if only one or two affected areas remain in the time segment of interest), or by examining the velocity changes, as was used to filter both the strain and F/T datasets. Figure 2.23: Cautery filtered/removed magnetic data. To obtain the filtered curve the initial error rows are removed, and removing data segments of 0.04s then further filters the data. 2.4.5 Data Fusion Once the kinematics data has been registered to a common reference frame, time synchronized and filtered (to remove ESU effects), the data are fused to give one continuous high-frequency data stream. The data fusion protocol basically uses the optical data for "fixes", on to which the magnetic data is warped. This work has been done in done conjunction with another transfer of training study. A brief description of the process is outlined in the following paragraph, but details of the data fusion algorithms can be found in Joanne Lim's thesis (Lim 2004). 59 The first step in fusing the data involves creating a matrix of differences by subtracting the magnetic data from the optical (where there is optical data available). This matrix of difference measures is then interpolated using a generalized cross validation function to form a smooth curve. This curve is then added back to the magnetic data. This warps the magnetic data directly on to the optical data, maintaining the high frequency, continuous characteristics of the magnetic data. 2.4.6 Data Segmenting and Performance Measure Extraction The formatted OR data is segmented using the time stamped OR video, and laparoscope video data to get information about various parts of each procedure. For this study we are concerned with the dissection tasks in each procedure. The performance measures from these segments are compared against analogous tasks on the VR and physical simulators. Data Segmenting The synchronized, time-stamped internal OR video are used to establish the start and end points of each dissection task. The start of a dissection task is taken as the point when the experimental surgical tool first touches tissue, ending with when the tool is removed from tissue. These points are used to segment the kinematics and F/T data to obtain formatted streams of the dissection portions of the surgery. Similar start and end points for task decomposition are used for the physical simulator data analysis. The entire data stream from the VR simulator is used for analysis. No segmenting is required in this setting as it is only one dissection segment versus the large, variable OR task. Performance Measure Extraction Once the dissection data are separated from the rest of the data stream, performance measures are extracted from the critical portions. 60 Kinematics The 3D kinematics data from the simulators and OR are differentiated using a generalized cross validation (GCV) function to obtain velocity, acceleration and jerk profiles. The GCV function minimizes the root mean squared error to fit a smoothed polynomial to the data. The derivatives of the polynomial result in the estimations of velocity, acceleration and jerk. The projection of the tool path vectors on the tool tip axis vectors are used to calculate the individual component performance measures (Figure 2.24). Each tool tip performance measure is calculated in six tool tip directions: axial (z), grasp (y), translate (x), transverse plane, absolute plane and roll about the tool axis (Figure 2.15b). Figure 2.24: Projection of the tool path vector on the tool axis vectors gives the magnitude of the tool position derivatives in the directions of the tool axis vectors. 61 Force/Torque The simulator and OR F/T profiles are compared using the formatted OR data and the simulator F/T tool tip data. The same six force components outlined in the kinematics section are analyzed (axial, grasp, translate, transverse, absolute, roll torque). Torque profiles around the tool axis are assessed for the OR and physical simulator data but are, at this point, unavailable from the VR software. 2.5 Discussion and Recommendations 2.5.1 Discussion The goal of this study was to design and build a hybrid experimental system capable of gathering continuous kinematics, forces and torques of a surgical tool tip. This tool is used in a pilot study to objectively assess performance validity of two types of surgical simulators: VR and physical. Our primary contribution is the development of a method to calibrate and collect continuous kinematics and force/torque data of a surgical tool tip in a human OR. Hybrid kinematics and force/torque (F/T) systems have been designed to achieve this. An optoelectronic system is used in conjunction with a magnetic tracking system. Previous work in our lab indicated that an optoelectronic sensor alone is not a sufficient kinematics tracking system due to marker occlusions. By combining the two systems, we are able to exploit the benefits of both systems while obtaining better results than either system is capable of individually. An experimental surgical tool with interchangeable tips is used as the experimental surgical tool, so that in future studies data will be collected from all aspects of a procedure. In this study, only one dissection tool tip (the maryland dissector tip) is used. The kinematics and F/T systems are mounted to the surgical tool using a custom-62 designed bracket that transfers all forces along the shaft to the F/T sensor. The bracket also incorporates the optical and magnetic sensors. The kinematics data are time synchronized using large characteristic movements, done intra-operatively, and are then registered to a common reference frame. We are tracking the 3D location and orientation of the tool tip with respect to a common world reference frame. Once the data streams have been synchronized and registered to a common reference frame, they can be fused to obtain one continuous 3D tool tip kinematics profile. The data fusion method warps the magnetic sensor measures on to the more accurate optical measures, thereby providing the desired accurate, continuous, high frequency tool tip position data. Forces and torques of the tool tip are collected using a hybrid system made up of a tri-axial transducer mounted to the tool shaft, and a strain gauge system on the tool handles. Extensive calibrations and registrations are done to separate gravity and grip effects to obtain tissue manipulation forces at the tip. We attempted to remove the grip forces by correlating the strain gauge data (force applied to the tool handle) to the F/T data during pure grip operations. However, because of friction in the tool system and hysteresis in the F/T measures we have an "under-removal" of the grip forces. In other words, at large grip values of ~50N for example, we might only be able to consistently remove 20 - 35N, leaving an extra 15 - 30N in the data stream. The final F/T data stream therefore, contains misleadingly high tool tip forces due to the "extra" axial grip force. Some recommendations for improvement of this system are found in Section 2.5.2. Due to intra-operative electrosurgical effects, a significant amount of filtering is needed to remove noise from all the data streams, before performance measure extraction. Since the noise imparted by cautery is non-zero, linear filters did not work, so the rapid velocity changes, characteristic during affected sections, were used to remove the offending noisy data. A GCV filter is used to differentiate the kinematics data and get velocity, acceleration and jerk profiles. The final, processed data stream consists of position, velocity, acceleration, jerk, forces and torques of the tool tip. 63 2.5.2 Recommendations The experimental system developed is one of the first of its kind developed for assessing surgical performance and is, to the best of our knowledge, the first one to be used in a human operating room. This being said, there are several recommended improvements that should be made before the system is ready for widespread data collection. These improvements include: 1) A smaller F/T sensor with a larger sensing range. The primary complaint from the experienced users was that the large bracket assembly impeded normal rotation of the tool tip. A larger sensing range would eliminate the saturation of the sensor during some grip manipulations. 2) Redesign the sensor bracket assembly to reduce the amount of friction present when opening and closing the handles of the bracket and improve the force transference to the F/T transducer. 3) Improve the strain gauge system to reduce the amount of interference from OR equipment. Additional filters mounted directly on to the handle may improve this value slightly. All filtering is currently done at the interface (A/D conversion). 4) The magnetic receiver, which is mounted on the top segment of the bracket, has a tendency, in certain orientations, to occlude one of the optical markers. The bracket should therefore be redesigned to be more compact, and hold the receiver in a location that can in no way occlude the optical sensor tracking. 5) The data acquisition system should be written in a program that supports multiple serial port objects, such as Lab View, to collect data from all sensors using one computer. 6) A method to dynamically track the position and orientation of the magnetic sensor transmitter in the optical coordinate system would facilitate dynamic registration of the tool data. This permits the user to change camera and transmitter locations intra-operatively and achieve optimal placement. 7) A passive marker attached to the body would allow for dynamic tracking of the body frame and facilitate easier analysis of movement with respect to anatomical 64 landmarks. This would be important if this tool is ever used to evaluate quality of surgical performance. Chapter 3 contains the results of a pilot study where we used the experimental tool in three laparoscopic cholecystectomies. Two expert surgeons were assessed in the operating room and in the two surgical simulators described here: virtual reality and physical. We statistically compare their behaviour to each other and to behaviour in the virtual reality and physical simulator settings. We draw context comparisons across each of the three settings to investigate the reliability of our chosen performance measures as well as to evaluate the simulators' performance validity. 65 Chapter 3 Results from a Quantitative Validity Assessment of Two Laparoscopic Surgical Simulators 3.1 Introduction The goal of this work is to use an experimental surgical tool to quantitatively assess the performance validity of laparoscopic surgical simulators. It is widely recognized that the old teaching paradigm of "see one, do one, teach one" does not meet the challenges of surgical education in the new age of technology (Darzi 1999, Satava 2001). Virtual reality (VR) simulation was proposed over a decade ago as an extension of the successful flight-training program for pilots (Schijven 2003, Woodman 1999). There is a significant body of work done in the last few years supporting the validity of surgical simulation as a useful and reliable training tool (Gallagher 2002, Grantcharov 2003, Schijven 2003, Andrales 2003). Nevertheless, none of the published studies have thoroughly and quantitatively addressed how motor behaviour in the simulator compares to motor behaviour in the real human operating room (OR). The work presented here is a first attempt at tackling this issue. The different types of validity were outlined in section 1.4 of Chapter 1. In this study we are evaluating the performance validity of two types of surgical simulators. Concurrent validity is defined as "an evaluation that reflects the extent to which the scores generated by the assessment tool actually correlate with factors with which they should correlate" (Gallagher 2003). Conventionally, simulator concurrent validity has been evaluated using checklists as the gold standard assessment. We want to remove all subjectivity from the evaluation by quantitatively assessing surgeon motor behaviour and comparing that to OR behaviour, and we define this as a performance validity test. The goal of this project is to use the hybrid tool described in Chapter 2 to collect motor behaviour data of expert surgeons in human operating rooms, and to statistically compare the OR data to the experts' behaviour on analogous tasks in a virtual reality and physical simulators. If the behaviour across both settings is statistically similar, then performance validity of the simulator will have been established. 66 We are comparing kinematics and force tool tip profiles across experts and settings. A measure of intrasubject intrasetting variability is necessary to help us determine whether differences seen between settings can be teased out from the intrasubject differences. In order to do this we need a robust statistic that is insensitive to outliers, units and scaling. The Kolmogorov-Smirnov (KS) statistic, which compares cumulative probability distributions and gives a D-value that represents the maximum absolute difference between the two distributions, was chosen. The benefits of this method include no a priori assumptions about the shape of the cumulative probability distributions, insensitivity to scaling and outliers. The methods and results described here are unique because, to the best of our knowledge, we are the first group to attempt to determine simulator validity this way and may be used as a benchmark by which other simulators may be assessed. These methods may also be used to evaluate new tool designs and teaching protocols. KS Statistic 0 Difference Scale 1 Figure 3.1: Study design for statistically assessing simulator validity by comparing expert surgeon behaviour in simulator and OR using the KS statistic. Due to recruitment issues and lengthy analysis times, a relatively small sample and subject size was used. Some initial hypotheses are drawn about the behaviour of experts in the operating room and simulator and the differences between the two settings are drawn. We are able to draw some preliminary conclusions concerning intrasubject and intersubject variability and what this means about the reliability of our performance measures. Each expert's OR behaviour is compared to that on an analogous task on the VR and physical surgical simulators and hypotheses are drawn about simulator validity and its potential as a teaching and assessment tool. The assessment of primary concern is a quantitative one (performance validity), although face 67 and construct validity are discussed based on expert comments and behaviour. The implications of the face and construct validity suggestions are discussed and compared to the performance validity test results. 3.2 Methods We assessed two expert surgeons, and kinematics and force/torque data from their performance in the OR, VR and physical simulators were gathered over a period of four months. This data is used to draw some preliminary conclusions about the validity of the VR and physical simulator. 3.2.1 Operating Room Data from three laparoscopic cholecystectomies was gathered from two expert surgeons. This data was collected at the University of British Columbia hospital from January through March 2004. The hybrid experimental tool described in detail in Chapter 2 was used for each procedure. The UBC ethical review board gave ethical approval for this work. All patients were required to make informed consent before data were collected from their procedure, and all signed the appropriate forms. All equipment used for OR data collection was ethylene oxide sterilized, where appropriate, and approved by the Biomedical Engineering department. The protocol followed to obtain the OR data is outlined in section 2.3.2 of Chapter 2. The rigorous post processing steps shown outlined in Chapter 2 were followed to obtain synchronized and formatted kinematics and force/torque data for the tool. This data was segmented according to the hierarchical decomposition discussed earlier to obtain data from dissection tasks throughout the procedure. In each procedure there is a total of three segments of dissection data (Figure 3.2). The first segment consists primarily of exploration and anatomy identification. The surgeon dissects the triangle of Calot and identifies the relevant anatomy, the cystic duct and artery. The cystic artery is then clipped using a clipping tool and separated. The experimental tool is used to dissect the cystic duct. Once the cystic duct is clipped and cut, the Maryland experimental tool is used, in conjunction with the hook tool, to dissect and separate the gallbladder from the liver bed. The specific portions of each procedure where the expert surgeon uses the experimental tool tends to 68 vary slightly from surgery to surgery. The laparoscope video was used to identify the relevant segments. The time taken to isolate and dissect the triangle of Calot and the gallbladder from the liver bed varied significantly from case to case depending on the complexity. For example, in OR # 2 the patient was suffering from a chronically inflamed gallbladder and this led to a much longer procedure. This surgery lasted ~1.5 hours, as compared to 45 minutes for a "normal" surgery. Exploration and dissection of the triangle of Calot (experimental tool) ~ 5 minutes Clipping and separation of cystic artery (clipping tool and scissors) -4.5 minutes Dissection of cystic duct (experimental tool) —3 minute Clipping and separation of cystic duct (clipping tool and scissors) ~1.5 minutes Begin dissection gallbladder from liver bed (experimental tool) -3 minute | Completion of gallbladder dissection (hook) | Removal of gallbladder | -30 minutes Figure 3.2: Steps where the experimental hybrid tool is used. The tip in this study is a dissector tip and the boxes with a darker outline give the task the tool was used in and the approximate length of use. The times given for each segment are approximate times only and were determined from the video archives. 3.2.2 Simulators The virtual reality (VR) simulator uses the laparoscopic surgical workstation from Immersion Medical Corporation and the Reachin laparoscopic training software. The Reachin haptic feedback laparoscopic training software is explained in more detail in section of Chapter 69 2. The training software is based on modules of increasing complexity. For this study the cystic duct dissection module was deemed to most closely resemble the real OR task. The physical simulator is a mandarin orange dissection task. The subjects were asked to remove the peel from an orange inside a standard box trainer and dissect out several segments using the experimental surgical tool, causing as little "damage" to the inside of the orange as possible. The design of this simulator was done in consultation with one of the expert surgeons who deemed this task adequately similar to the operating room experience to be of use as a training task. This simulator was set up to resemble a standard operative suite. Standard OR equipment was used, including a laparoscopic tower and camera. The two participating experts came to the Centre of Excellence for Surgical Education and Innovation (CESEI) on the 3rd floor of the Jim Pattison Pavilion at Vancouver General Hospital 4-5 times and completed the cystic duct dissection module on the VR simulator, and the orange dissection on the physical simulator. A short "training" session (-10-15 minutes) was presented to each surgeon before the initial data collection to familiarize the expert with basics of the simulation. We did not want to expose the expert to extensive training on the simulator because in a perfect simulation the expert should be able to treat the simulator exactly as in a human and no training should be necessary. Throughout each test the simulator software collects kinematics and force/torque data of the tool tip and records this data. Post processing software by Iman Brouwer for the VR simulator formats the raw data and provided us with continuous performance measures of kinematics and forces / torques. The data in this form is analogous to most of the formatted OR data and the performance measures are extracted and behavioural differences are assessed. 3.2.3 Performance Measures The performance measures available from each setting vary slightly (Table 3.1). The measures presented describe the movement of the tool tip (Figure 3.3). Continuous velocity, acceleration, jerk and force profiles are compared from each expert across all three settings, OR, VR and physical simulators. We chose to look at 26 continuous measures of performance (Table 3.1) to 70 get a more complete picture of motor performance than from a summary statistic, such as mean velocity. Table 3.1: The performance measures available from each of the three settings. The orange simulator and OR settings are capable of recording all measures as they both use the experimental hybrid tool, while the VR simulator is limited in roll and tool tip forces. Future Operating Room V R Simulator Orange Simulator Tip Distance from Mean: Absolute X X X Roll X X Tip Velocity: x,y,z X X X Transverse X X X Absolute X X X Roll X X Tip Acc'n: x,y,z X X X Transverse X X X Absolute X X X Roll X X Tip Jerk: x,y,z X X X Transverse X X X Absolute X X X Roll X X Tip Force: x,y,z X X Transverse X X Absolute X X X Roll X X 71 Grasp Transverse Plane Translate Rotation I Grasp Wit Spread Figure 3.3: Tool tip reference frame. Performance measures presented represent behaviour of the tool tip. 3.3 Context Comparisons Once data was collected and formatted from the OR, VR and physical simulators, context comparisons were made. We wanted to examine intra- and intersubject variability in all the settings, as well as the differences in operator behaviour across all three settings. These assessments, comparing expert performance in a live OR to performance on an orange, require a robust descriptive statistic that can pick up differences in behaviour without changing or making any assumptions about the data. 3.3.1 Kolmogorov-Smirnov Statistic The Kolmogorov-Smirnov (KS) statistic takes two cumulative probability distributions (CPD's) of difference measures (ex: velocity, jerk, force, et cetera) and measures the maximum absolute vertical difference (D) between them (Figure 3.4a). The values of D range between 0 (similar) and 1 (maximum difference). The KS statistic requires no a priori assumptions about the shape of the distributions and allows us to easily compare behaviour in different settings. Also, this statistic is an internal normalizer since it is derived from cumulative probability distributions. The KS statistic is also comparatively insensitive to outliers in the datasets (Hodgson 2002). 72 This is an important characteristic for this study as outliers are prevalent in the OR data, even after extensive filtering is done (Section 2.3, Chapter 2). The KS statistic is insensitive to x-axis rescaling. We could, for example, plot the data on a log scale on the x-axis for plotting clarity and still get the same D-value. Previous work has shown that the KS statistic is a valuable statistic to assess behaviour differences across contexts (Boer 1996; McBeth 2001). All behaviour comparisons are done using the KS statistic. We took two normally distributed curves (length =1000) with initially the same mean, and shifted them apart. A D-value of 0.3 corresponds to a shift of ~1 standard deviation between 2 normally distributed curves (Figure 3.4b). X L ° g ( x ) Figure 3.4a: Representation of the Kolmogorov-Smirnov statistic D, the maximum difference between two cumulative probability distributions. The KS statistic is insensitive to rescaling; when the log of the x-axis is plotted; we get the same D-value between the 2 distributions. Standard Deviat ions Figure 3.4b: Representation of the KS statistic between 2 normally distributed curves, as they re shifted apart by 1, 3 and 5 standard deviations. 73 3.3.2 Assessing Difference and Reliability Bootstrapping Using the aforementioned KS statistic quantitative comparisons can be made between subjects and settings. In order to assign a confidence interval to, and get a sense of statistical relevance of our estimate of D, we use a bootstrapping method based on a single set of measures. There are other, analytical, methods we could use to put a confidence interval on the D-values. However, due to the complexity of our data and the comparisons we are trying to make, a more general method is required as the analytical methods become unpractically complex. Bootstrapping is a computer-intensive method that involves randomly resampling the dataset and applying the KS statistic at each bootstrapping cycle. This gives a measure of accuracy of the D value by assigning a confidence interval to it. The bootstrap method attempts to recreate the relationship between "population" and "sample" by assuming that the sample available (in this case: OR and simulator data) is representative of the underlying population. However, due to the small sample size for this study the bootstrap estimates are only used to give an idea of the error on the difference value calculated for that specific case. For example if we were to measure the height of one doctor and one engineer with an inaccurate ruler the confidence interval on this measurement would give us a value associated with our measurement technique and would not let us make any assumptions about the heights of all engineers and doctors. Each of the confidence intervals should be examined with this fact in mind. Therefore, by appropriately resampling the available data to generate the bootstrap sample, the original relationship between the population and sample can be measured (Efron 1986, Lahiri 2003). Data Dependency The simplest case for application of the bootstrap method assumes completely independent data. However, our datasets are temporally correlated such that the value at one data point x; is dependent on the values of earlier data points (x;-l, x;-2,.. .Xj-m), with m being an unknown value. This correlation effectively reduces the number of independent points and artificially reduces the size of the calculated confidence intervals. An autoregressive analysis was done on a sample dataset to provide a basic understanding of the extent to which our data is correlated 74 (Figure 3.5). An analytical formula based on a first order autoregressive model give the cutoff value for determining where the data becomes uncorrected occurs at an autoregressive r-value (the y-axis of Figure 3.5) of 0.135 (e-Handbook of Statistical Methods). The autocorrelation output indicates that the data is essentially uncorrected at lags of 65. This means that, if we are collecting data at 120 Hz, there is one completely independent data sample every ~0.5 seconds. For example, if we have 48000 data points (400 seconds of data at 120 Hz), there are 24000 independent samples. ih 0.8 •- *.- -| 0.6 l l l l l t 0 • 1 o.4 • y «> • E w \ 0.135 lljlllllj ok— 50 £5 100 150 200 250 Lag Figure 3.5: Sample autocorrelation output for absolute velocity data from one OR trial (given measurements Y=Yj, Y2...YN (Lahiri 2003)) In the case of temporally correlated data, the bootstrap method becomes slightly more complex but the basic concept remains the same. There are several methods for resampling dependent data (Lahiri 2003), but most require a firm understanding of the order of the autoregressive model to fit the dataset, which can be a tricky thing to ascertain. The bootstrap method chosen for our data is the moving block bootstrap. This method accounts for the dependency of the data but does not require any knowledge about the autoregressive model order. The moving block bootstrap is discussed in the following section. 75 Moving Block Bootstrap The moving block bootstrap (MBB) is applicable to a dependent dataset without any parametric model knowledge or assumptions (Kunsch 1989, Liu 1992). Instead of resampling single values as in the standard bootstrap method, the MBB randomly resamples blocks of data, thereby maintaining the dependence structure of the original dataset within each block (Lahiri 2003). To apply the MBB, the original datasetXn={Xi,....Xn} is divided into overlapping blocks of length / to create a matrix of blocks {Pi, PN}NX/ (Figure 3.6). • • • • \ < — P i — • ! I«« P2 • Figure 3.6: The MBB method breaks the dependent dataset that is to be resampled into N=n-l+l overlapping blocks. These blocks are then randomly resampled with replacement to length k and the resampled dataset then assembled from the resampled blocks, thereby preserving the dependent structure of the original dataset. From the set of blocks {Pi, PN}, a suitable number, k, blocks is resampled with replacement to create a resampled set of blocks {P*i, p\}. The new data, X*, is then assembled from the elements of p*. The value of k is chosen so that each bootstrap sample is the same length as the original sample data (n=k*/). The size of each block of length /, increases with the length of the original sample, and is also dependent on what the bootstrap sample is being used to find. For our purpose of finding confidence intervals, / is on the order of n 1 / 5 (Hall 1995). Assigning Confidence Intervals to D-values When comparing two CPD's (for example, absolute velocity profiles between the VR simulator and OR) a measure of the reliability of that D-value is important. In order to assign a confidence interval to a particular difference measure (D-value), we use the MBB method described above to resample each data set separately and then measure the D-values between the two independently resampled datasets. By MBB resampling each CPD separately, and separately "Pn" 76 finding 1000 D-values, a CPD of D-values are created. The 2.5 - 97.5th percentiles of this CPD of D-values forms the confidence interval on Di_2 (Figure 3.7). The confidence interval indicates the range of D-values we are likely to get assuming the underlying distribution is represented. The size of a confidence interval is dependent on the effective size of the dataset and the variability within the distribution. When we consider the large size of the dataset, even after considering dependency, small confidence intervals are expected about a D-value calculated between two distributions. These confidence intervals give a measure of the error involved in calculating the difference value examined for each comparison. We are not able to make any assumptions about the actual population variability using this method because of the small sample size. Figure 3.7: Assigning a confidence interval to any particular difference value (D-value) each CPD is independently resampled 1000 times to create a CPD of D-values from which the confidence interval is extracted. 77 Assessing Difference When using the KS statistic to compare subject behaviour between settings, a measure of statistical significance is necessary to assess how different the measured values actually are. One distribution, assigned to be the reference (CPD r e f ) of performance measures (e.g.: velocity, force et cetera), is resampled and each resampled distribution (CPDRS) is compared to the reference. This is done -1000 times to get a distribution of difference values between the reference and resampled, CPD(D R S - r e f ) . The D-value at the 95 t h percentile of the CPD(D R S -ref ) is the critical D -value (DC R). This value is used to assess statistical difference, and any new measured D-value (Dmeas) greater than this value is considered to be statistically different than the reference and vice versa. For example, if we are trying to determine whether the velocity profile of Expert 2 (E2) is different from Expert 1 (El) in a simulator, E l is considered to be the reference and is resampled to get CPD(D R S -ref) . If D E i . E 2 is outside the 95 t h percentile of C P D ( D R S - r e f ) then the two experts' behaviour is different in the simulator (Figure 3.8). CPDreference Length=N r=1000 Resample Length = nl r=1000 Resample Length = N FindD CPD„ 0.95 CPD(D r s . r e f) CPD(Drs-ref) m D„ Figure 3.8: Finding a CPD ofD-values between the CPDrs and CPDref. we can assess the statistical relevance of any measured D-value. IfDmeas is greater than the Dcr value of CPD(Drsref) then the two CPD's under consideration are different. 78 3.4 Results Due to time constraints and the complexity of collecting data from a human OR, we were limited by the amount of data we were able to successfully collect (Table 3.2). A total of five OR trials were attempted but due to complications within the surgery, such as the surgeon contaminating the tool or computer failure, data from the three successful data collections wil l be presented. Table 3.2: Summary of data trials from each setting. OR VR Simulator Physical Simulator Expert 1 2 2 1 Expert 2 1 2 1 Intrasubject results from each of the trials, where available, wi l l be presented first. Each OR trial consists of several segments of data. The intraoperative intrasubject comparisons wil l be presented first and were made to assess the level of intraprocedural variability. We are able to make some conclusions about the differences actually seen between segments observed. We report the variability associated with redoing the same task with the same patient at the same stage in the procedure. The variability measures exclude the variability between patients at each stage of the procedure. The confidence intervals from bootstrapping are essentially a measure of the error on the whole distribution. The intrasubject intrasetting results give a measure of the interprocedural variability between OR and simulator trials. We find a point estimate of the difference between procedures and our bootstrapping routine gives a measure of the error associated with finding these differences. The next logical step is intrasetting intersubject comparisons. Expert 1 wi l l be compared to Expert 2 in each of the three contexts (OR and both simulators). Finally, intersetting intersubject comparisons are drawn. Each experts' OR behaviour is compared to simulator behaviour, thereby assessing the validity of each simulator, as we have defined it. The difference measures (Figure 3.9) found in the preceding comparisons are then paired to examine the differences in settings between experts. In other words, we are investigating 79 whether the difference from OR to simulator for Expert 1 is the same as that for Expert 2. Finally, the differences in experts between settings are examined. This answers the question of whether differences between experts in the OR carry over to the simulators. Setting OR VR Simulator Physical Simulator 1 (4>« - (?) T (6) 2 ® + (3) 1 Al: Intrasubject intraprocedural OR comparisons A2: Intrasubject interprocedural OR comparisons (El) A3: Intrasubject intertrial VR simulator comparisons A4: Intersubject intrasetting OR comparisons A5: Intersubject intrasetting VR simulator comparisons A6: Intersubject intrasetting physical simulator comparisons A7: Intrasubject intersetting comparisons (OR versus VR, OR versus physical) A8: Intersubject intersetting: A7 (Expert 1) versus A7 (Expert 2) A9: Intersetting intersubject: A4 versus A5 versus A6 Figure 3.9: Context comparisons made between subjects and setting to assess inter- and intraprocedural variability and simulator validity A variety of D-values from all comparisons are calculated. Recall that 0 represents two curves that are exactly the same and 1 is the absolute maximum difference possible. The size of D-value calculated is dependent on the original sample sizes of the distributions being compared. The larger the sample size the smaller the D-value must be to be considered "similar". However, since distributions across all settings have sample sizes on the order of several thousand some generalizations about the relative sizes of the D-values are derived. Generally, D-values ~ 0.02 -0.05 are found when comparing two curves where one is resampled from itself (Drs.ref). For two curves that look completely different such as absolute force between VR and OR a D-value ~ 0.8 - 1 was commonly found. For this project, a D-value less than 0.3 is considered "small", and behaviour is considered to be similar. 80 3.4.1 Al: Intrasubject Intraprocedural OR Comparisons Within each OR procedure there are three dissection segments. The cumulative probability distributions of all performance measures across all directions are presented. The CPD's of the twenty-seven performance measures (Table 3.1) are examined across the settings. These CPD's are shown in one plot with twenty-seven subplots (Figure 3.10). The velocity, acceleration, jerk and force performance measures are evaluated in 6 tool tip directions: axial (z), grasp (y), translation (x), transverse ( TJX2 + y2 ), absolute (*Jx2 + y2 + z2 ), and roll about the tool axis (Figure 3.3). The distance from mean performance measure (D mean) is only calculated in the absolute and roll directions because of the sensitivity of this measure in the individual locations of the choice of the global reference frame. The 75th percentile window of the cumulative probability distributions is shown for easy interpretation of the critical median values. The important differences between compared CPD's are virtually always in this region. The distributions all have very long tales and if the entire CPD was shown the critical regions would appear very vertical and relevant differences would be difficult to see. When we calculate differences between 2 CPD's for each performance measure in one particular setting (for example, Expert 1 versus Expert 2 behaviour in the physical simulator) these differences are also presented in one plot (Figure 3.11). The D-values are shown with their corresponding confidence intervals (Figure 3.7). We also show the critical confidence interval for determining statistical difference (CPD(DRs-ref) - Figure 3.8) to illustrate how close the measured D-values are to that critical value. To investigate the variability within procedures, each segment was compared to a reference CPD made up of the combined data from the other 2 segments. A confidence interval was placed on each of these differences. The results from each OR are presented below. Expert 1: Intrasubject Intraprocedural OR The performance measures for each of the segments of expert 1 OR# 1 were extracted. All performance measures describe movement or behaviour of the tool tip frame (Figure 3.10). All 81 cumulative probability distributions are, at first glance, very similar in shape and range. We immediately notice that the C P D ' s for velocity, acceleration and jerk were very consistent in their shape and differences. This leads one to believe that only one derivative of position may be necessary. Force and distance from mean ( D mean) distributions are very different in shape. It appears as though the three primary C P D ' s to examine for this and all future C P D performance measure graphs are the force, D mean and velocity subplots. This holds true across all nine setting comparisons (Figure 3.9). Segment 1 is the most different from the other 2 segments, especially in the lateral and absolute directions of the kinematics performance measures. Recall that Segment 1 contains the majority of data and is representative of an exploration and primary dissection phase of each procedure. The kinematics measure C P D ' s are mostly symmetric about 0, while the force distributions show a definite bias in one direction. Note that the roll torque indicates that the surgeon consistently rotates the tool in one direction. Also, axial forces are largest as this is a combination of the axial tip force as well as the residual grip forces not removed through calibration. The largest differences are seen in deviation from the mean measures, while the derivatives of rotation are the most consistent between segments. Each of the segments was statistically compared to a grouping of the two other segments (Figure 3.11). The D-values and their corresponding confidence intervals represent the amount of variability between segments in this OR procedure. It measures the importance of intraprocedural context, because each segment represents a slightly different OR task. The reference C P D for each segment comparison is resampled from itself (Figure 3.8) to create D r s- ref-The closer the experimental D-value confidence intervals between one segment and its reference (Dgeg-ref) falls to the 95th percentile of C P D ( D r s r e f ) , the closer the behaviour is to being the "same". The performance measure C P D ' s indicate that segments 2 and 3 are very close to each other and to investigate this, these segments were compared to each other and plotted (Figure 3.11). As we expected from the CPD's, segment 1 is the most different from the other 2 segments across most performance measures. The lateral and transverse direction performance measures 82 are more different than the individual components for all comparisons. The segment 2 vs segment 3 comparisons are very similar with performance measures that are consistently less than for the other comparisons. Several performance measures even fall within the resample reference distribution, CPD(DRs-ief) (Section, indicating that the values being compared are essentially the same. These values occur in the individual kinematics components such as roll and x acceleration. Figure 3.10: Expert 1 intraprocedure OR trial 1, segments 1 through 3. Each graph represents a performance measure of behaviour at the tool tip in a particular direction. 83 o Trans o z - c Trans *» z Trans 2 T&1 o Trans l l 2 1 — A - —1 DScgl - ref ...\ Dseg2 - ref - ^ Dseg3 - ref — i Dseg2 - seg3 CPD9o(DRS.ref) 0.1 0.2 0 3 0.4 0.5 0 6 D i f f e r e n c e (D) V a l u e 0 7 0.9 Figure 3.11: Expert 1 intraprocedural segment comparisons. The closer the Dseg.refis to Dcr of CPD(Dns-rej) the closer the behaviour within that segment is to the total of the other two segments. (x=translation, y=grasp, z=axial, Trans-transverse, Abs=absolute, Roll=Rotation) Expert 2: Intrasubject Intraprocedural OR Expert 2 was the subject for 2 OR data collection procedures. Data from each of these procedures will be presented separately. For OR trial 1, which incidentally represents data from the first time the tool was used in an OR, segment 1 is significantly different from the other 2 segments across all performance measures (Figure 3.12). Each segment is subsequently compared to the total of the other 2 segments to get D s eg- r ef and CPD(Drs.ref) for each segment across each performance measure. Segment 2 is also compared to Segment 3 to assess any context differences between those segments (Figure 3.13). 84 The difference between any one segment and its reference can be quite large (>0.5), especially in force behaviour. However, when segment 2 is compared with segment 3 it is apparent that this difference is less than 0.2 for most performance measures (Figure 3.13). Intrasubject intraprocedural OR trial 2 segments CPD's (Figure 3.14) indicate very little difference between segments. This difference is statistically assessed and the results (Figure 3.15) support the observation of small differences (<0.3) between segments for most performance measures. Force, and distance from mean measures again have the largest relative differences between segments. This OR trial shows the least amount of intersegment variability across any of the OR experiments. Axial (z) Grasp (y) Translate (x) Transverse Absolute Rotation Segment 1 .......... Segment 2 Segment 3 u O o < mm 0) O O l i . 1 0.5 0 r - ' '• '7 ^ / m m A 0.5 0 A r..-yrnm/s 0.5 -60-40-20 0 20 -100 -50 0 1 -40 -20 0 20 1 * t* /* i' mm/s 20 60 100 1 * / \ • * * / . / it J F mm/s2 200 400 600 1/ if mm/s3 -1000 0 1000 -2000 0 2000 -2000 0 1000 1 yJ1' N -40 -20 8 10 12 14 16 -2 0 2 8 10 12 14 16 000 3000 5000 2000 6000 10 20 30 -15 -10 -5 Figure 3.12: Expert 2 intraprocedure OR trial 2, segments 1 through 3. Each graph represents a tool tip performance measure in a particular direction. 85 s W Q o Trans o z J5? c Trans o z •5 S ^ Trans J> z J j l Trans 2 1 A 1 DSegl - ref Dseg2 - ref 1 - f- - 1 D s eg3 - ref J-^ seg2 - seg3 CPD 9 0(DR S.R E F) 0.1 0.2 0.3 0.4 0 5 0.6 Difference (D) Value 0.7 0 3 0.9 Figure 3.13: Expert 2 intraprocedural OR trial 2 segment comparisons. The closer the Dseg.ref is to Dcr of CPD(DRs-ref) the closer the behaviour within that segment is to the total of the other two segments (x=translation, y-grasp, z=axial, Trans=transverse, Abs=absolute, Roll=Rotation). Figure 3.14: Expert 2 intraprocedure OR trial 3, segments 1 through 3. Each graph represents a performance measure of behaviour at the tool tip in a particular direction. 87 " Trans O 7 5 K -„ Trans Trans z o Trans u. z 1 — A — 1 D s eg 1 - ref ( . , , # „ . . j Dseg2 - ref 1 — • — 1 ^seg3 - ref D seg2 - seg3 i - - H CPD 9 0 (D R S . r e f ) 0.1 0.2 03 0.4 0.5 0.6 Difference (D) Value 0 7 0 9 Figure 3.15: Expert 2 intraprocedural OR trial 3 segment comparisons. The closer the Dseg.ref is to Drs.refthe closer the behaviour within that segment is to the total of the other two segments (x=translation, y=grasp, z=axial, Trans=transverse, Abs=absolute, Roll=Rotation). 3.4.2 A2 Intrasubject Interprocedural OR: Expert 2 Data from 2 OR trials were collected from Expert 2. The segment performance measure data from each of these trials were combined and data from Trial 1 (Figure 3.12) was compared to data from Trial 2 (Figure 3.14). Across most performance measures the difference values are below 0.3 (Figure 3.16). The largest differences are again apparent in the force and distance from mean behaviour measures. It is also noted that the differences are on par with, and even slightly lower than, the intraprocedural segment comparisons. 88 S 5 a o Trans o z c Trans o z 3 & •£ Trans « z Trans z | # | D0R1-0R2 I - - - I CPD 9 0(DR S, e f) 0.1 0.2 0.3 0.4 0.5 0.6 Difference (D) Value 0.7 0.8 0.9 Figure 3.16: Expert 2 intrasubject interprocedural comparisons. The closer the DORIOR2 is to Drs.refthe closer the behaviour between the two procedures are to one another (x=translation, y=grasp, z=axial, Trans-transverse, Abs=absolute, Roll=Rotation). 3.4.3 A3: Intrasubject V R Simulator Comparisons Each expert performed two VR cystic duct dissection task and intrasubject intertrial behaviour was examined. Very little variability was seen in shape or range of any performance measure CPD's, for either expert (Figures 3.17 and 3.18). Recall from Table 3.1 that only 17 performance mesaures are calculated for the VR simulator, as opposed to the 26 from the OR and physical simulator data. We do not have any roll data for the VR simulator, or separate values of force (we only have absolute force data). We noted the extremely small range of absolute force values in the VR simulator. The maximum force value at the 99th percentile was found to be only 4 N. 89 The difference between expert behaviour at one trial versus behaviour at another was compared in the same way as for the intertrial OR comparisons and these differences reflect the visual observations made from the performance measure CPD's, as many of the DVRI-VR2 fall within the 95TH percentile of D r s . r e f (Figure 3.19). This means that for those performance measures behaviour between trials is essentially the same, statistically speaking. Again, the largest relative differences are seen in the force and distance from mean measures. Axial (z) Grasp (y) Translate (x) Transverse Absolute Rotation TO o o it c o 0J 0) o 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0.2 0.4 0 0.5 Figure 3.17: Expert 1 intertrial VR CPD's. Each graph represents a performance measure of behaviour at the tool tip in a particular direction. Recall that we do not have data available for any of the roll direction performance measures, or separate directions of force - only absolute force. Note the extremely small value of the absolute force value. Figure 3.18: Expert 2 intertrial VR CPD's. Each graph represents a performance measure of behaviour at the tool tip in a particular direction. 91 Abs _ ,-Abs - c Trans " Z Abs Trans A b s mtr— 1 — A — 1 Expert 1 (DVRI-VR2) | a « * • tdjBP* * • » w • j Expert 2 (DVRI-VR2) CPD9o(DR S-ref) 0 1 02 0.3 0.4 0.5 0.6 Difference (D) Value 0.7 0 8 0.9 Figure 3.19: Intrasubject intertrial VR comparisons for both experts. The closer the DVRI-VR2 is to Der of CPD(DRs-ref) the closer the behaviour between the two procedures are to one another (x=translation, y=grasp, z=axial, Trans=transverse, Abs=absolute, Roll=Rotation). 3.4.4 A4, A5 & A6: Intersubject Intrasetting Comparisons Expert 1 performance measure CPD's are compared to Expert 2 performance measure CPD's in each of the 3 settings (Figures 3.20 through 3.22). The differences between the expert performance measures CPD's are assessed (Figure 3.23). The OR and physical simulator intersubject differences are on the range of 2 - 4 times larger than the VR intersubject comparisons. A visual examination all CPD shapes are very similar indicating that experts may behave, at first glance, the same way as each other in all three settings. The OR force data has the largest intersubject variability, while the absolute force difference between experts in the orange and VR simulator are very comparable (within 0.01 of each other on Figure 3.23). Most of the DEI-E2 values in the physical simulator and OR, with the exception 92 of force, are below 0.35 and are in the same range as the intrasubject intertrial OR differences of Expert 2. Recall, at this point the amount of structure intrinsic to each setting. The VR presents a highly structured environment, while the OR dissections are very unstructured. Figure 3.20: Intersubject performance measure CPD's in the OR. Figure 3.21: Intersubject performance measure CPD's in the VR simulator. 94 Axial (z) Grasp (y) Translate (x) Transverse Absolute Rotation Figure 3.22: Intersubject performance measure CPD's in the physical simulator. 95 -5* y Trans — z > X -„ Trans < X A^bs £ Trans o o Trans u. z H8H 1 — • — 1 O R ( D E 1 . E 2 ) V R f l W ) I - •• - 1 Physical ( D E I - E 2 ) C P D ^ C D R S ^ J ) oil 0.1 0.2 0.3 04 05 0.6 Difference (D) Value 0.7 0.9 Figure 3.23: Intersubject intrasetting comparisons: El vs E2 across all three settings, OR, VR, and physical simulator. Recall that in the VR simulator we only have 17 performance measures available (Table 3.1) as opposed to the 26 for the OR and physical simulators (x=translation, y=grasp, z=axial, Trans=transverse, Abs=absolute, Roll=Rotation). 3.4.5 A7: Intrasubject Intersetting Comparisons The intrasubject behaviour across each setting was examined to assess the validity of the two simulators, VR and physical (Figures 3.24 and 3.25). The differences between the OR and VR, OR and physical, physical and VR were assessed for each expert (Figures 3.26 and 3.27). The intrasubject intersetting differences are much larger than any of the intersubject intrasetting comparisons. Note that the velocities are significantly larger in the physical simulator than for the VR and OR settings. Also, the forces exhibited in the OR, and physical settings are so much larger that those seen in the VR simulator that in the absolute force plot (subplot (5,6,29) in Figures 3.24 and 3.25) the VR force CPD is not even apparent. Figure 3.24: Expert 1 intersetting performance measure CPD's, OR, VR and physical simulators, The force data from the VR simulator is so low that it is not apparent on this graph; refer to Figure 3.17for Expert J absolute force CPD. 97 Figure 3.25: Expert 2 intersetting performance measure CPD's, OR, VR and physical simulators.^ The force data from the VR simulator is so low that it is almost not apparent on this graph; refer to Figure 3.18 for Expert 2 absolute force CPD. The differences between each of the settings highlight the differences seen in the C P D ' s for each expert. For many of the kinematics performance measures across both experts the difference between the O R and physical settings are quite low, less than 0.4. The DVR-OR values for the kinematics performance measures in the x, y, and z directions are below 0.3, while the values in the lateral and absolute planes are significantly larger. We consistently find that the largest intersetting differences occur for comparisons between the V R and physical simulators. As we expected from the C P D ' s the absolute force profile differences are very large (0.7 - 1). A l l intersetting differences are consistently larger than anything we have seen in the intrasetting 98 comparisons. Also the spread of D-values in the intersetting comparisons indicate a large amount of variability in the performance measures between settings. s s o Trans 5 ! c ADS [ "o Trans h o Zf < Trans I TSBS o Trans DoR-VR DoR-Phys DvR-Phys CPDsoOW) 0.1 0.2 0.3 0.4 0.5 06 Difference (D) Value 0 7 0.8 0.9 Figure 3.26: Expert 1 intrasubject intersetting comparisons: DVR-OR, DP-OR, DVR-P (x=translation, y=grasp, z=axial, Trans=transverse, Abs-absolute, Roll=Rotation). 99 1 m 1 DoR-VR DoR-Phys h- -H — 1 DvR-Phys ^tmtn mm mm | CPD90(DRS.ref) M Trans z T A \ I S Trans z x J 8 Trans z 1*1 1^ rrans I 0.1 0.2 0.3 0.4 0.5 0.6 Difference (D) Value 0.7 0.9 Figure 3.27: Expert 2 intrasubject intersetting comparisons: DVR-OR, DP^OR, F>VR-P (x=translation, y=grasp, z=axial, Trans=transverse, Abs=absolute, Roll=Rotation). 3.4.6 A8: Intersubject Intersetting (A7 Expert 1 versus A7 Expert 2) The intersubject intersetting comparison was done to assess whether experts treat the simulators different from the OR in a similar way. A confidence interval was placed on the difference between the D-value from OR to VR for each expert (DE1 0 R-VR - DE2OR-VR), as well as the difference between the D-value from OR to Physical for each expert (DEloR-phys - DE2oR-phys)-If this confidence interval on (DElOR-sim - DE20R-sim) covers 0 than it can be stated that the differences between the Experts' behaviour towards the OR versus the simulator are the same (Figure 3.28). A confidence interval that covers 0 indicates the differences in intersetting comparisons between experts are negligible. 100 The intersubject intersetting difference analysis reveals that for the physical simulator, expert difference magnitudes between the OR and simulator settings are similar and the confidence intervals envelope zero, except in several of the force directions. The VR simulator difference intervals are mostly negative and consistently very close to zero. The negative values are due to the fact that Expert 1 difference magnitudes are, for the most part, larger than those from Expert 2. s Q o Trans o z 2 I F Trans Trans z 5 Trans II z I ( D O R - V R ) E I _ ( D Q R - V R ) E 2 —• "f (DoR-Phys)El - (DoR-Phys)E2 -0.5 -0.4 -0.3 -0.2 -0.1 0 Difference (D) Value 0.1 0.2 0.3 0.4 0.5 Figure 3.28: Intersubject intersetting paired differences to analyze whether experts are different in the same way (x=translation, y=grasp, z=axial, Trans=transverse, Abs=absolute, Roll=Rotation). 101 3.4.7 A9: Intersetting Intersubject (A4 versus A5 versus A6) To assess whether differences between experts in the O R carry over to the simulators another paired difference was examined. The differences between experts in the O R was compared to differences in each of the simulators (DOREI-E2 - DsiniEi-E2) was calculated from the bootstrapped data and a confidence interval is placed on these differences (Figure 3.29). The differences between experts in the O R appear to transfer quite closely over to the physical simulator, except in the force performance measures. Again, the V R difference intervals are small and clustered close to zero. The V R results imply that the differences between experts in the O R are greater than differences between experts in the V R simulator, as was seen earlier. 'u Trans • o z --™1 ( D E I - E 2 ) O R - ( D E I - E I ) V R -I ( D E I - E 2 ) O R - ( D E I - E I ) P H Y S t- T ^ b s - c Trans H z • Trans M Trans • z • -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 Difference (D) Value 0.2 0.3 0.4 0.5 Figure 3.29: Intersetting intersubject paired differences to analyze whether differences between experts in the OR carry over to the simulators (x=translation, y=grasp, z=axial, Trans=transverse, Abs=absolute, Roll=Rotation). 102 3.5 Discussion This work was undertaken to quantitatively assess the performance validity of two types of surgical simulators, virtual reality and physical. This test was not designed to assess concurrent or construct validity. Given our results, we are able to make some suggestions about concurrent and construct validity, but more tests need to be done to confirm these suspicions. The motor behaviour of the surgical tool tip was used to perform the quantitative comparisons between settings. On the path to answering the ultimate question of how valid the simulators are, several intermediate questions were proposed. 1) What is the amount of intraprocedural OR variability and what effect does this have on the present study? How reliable are the proposed performance measures? 2) How much intertrial OR and VR variability is there for each expert surgeon? Can our validity assessment be used with a nominal number of trials and subjects and provide believable preliminary results? 3) How different from each other are the two experts' motor behaviour profiles in each setting? (OR, VR, and physical) 4) How different are experts' motor behaviour profiles between the OR and simulators? What does this imply about simulator performance validity? 5) Are experts' OR to simulator differences comparable across experts? What does this imply about simulator validity? 6) Do differences between experts in the OR carry over to the simulator? Based on these results, could the simulators be used to assess future OR motor performance? One of the limitations of this study was the small sample and subject size available to us. We were not able to specifically estimate the variability across procedures within a given surgeon or across surgeons as a whole. The confidence intervals on each of the discussed difference magnitudes represent the measurement error. However, we are able to conclude that the measurement resolution is good because of the very small confidence intervals. 103 3.5.1 Context Comparisons Each of the questions posed above are answered, with respect to the pilot study conducted and results we have presented in earlier sections. Intraprocedural O R Variability The results from comparing intraprocedural segments across the 3 OR trials indicate that there can be a significant amount of variability between dissection segments within one procedure. This is expected because each segment represents a different portion of an OR procedure, with slightly different goals in mind. The intersegment variability indicates that context within a surgery makes a difference. By combining all segments together this data should be thought of, not as three repetitions of the same task, but three separate tasks combined to form one dissection time-series. For two of the OR trials, Segment 1 showed the largest differences from the other two segments. OR trial 1 from E2 had the largest differences between segment 1 and the other two segments. Recall that segment 1 represents the exploration and anatomy identification stage early in the procedure. Generally, segments 2 and 3 represent different stages of cystic duct and gallbladder dissection, and have the smallest amount of difference between them. We suspect, based on these preliminary results that segment 1 is the most variable portion of the intraoperative dissection tasks, depending on the complexity of the case and patient anatomy. From the three OR trials, three levels of difference were obtained. OR trial 2 from Expert 2 showed the least intersegment variability, with most D-values below 0.3. Expert 2 OR trial 1 had the largest D-values, but it should be kept in mind that these data are from the first time the tool was being used in an OR setting. Expert 1 OR trial 1 had D-values somewhere in the middle between the other two procedures. Nonetheless, while a certain amount of difference was seen between segments, we feel comfortable stating that intraprocedural repeatability can be very high, particularly when one considers specific OR dissection segments (segments 2 and 3). ) Intertrial Intrasetting Variability 104 Intrasubject Intertrial O R The intertrial OR behaviour of Expert 2 reveals that intertrial variability is on par with, and can even be slightly lower than the differences found between segments of the same OR. The three procedure segments are grouped to form a single extended dissection task that is compared between procedures. Because there are different subcontexts we might expect to and do in fact see lower intertrial differences than intraprocedural OR differences, despite differences in patients and OR staff (D<0.3). This result lends credence to the theory behind the method we are using to assess simulator validity. If intertrial OR variability is low, we can confidently use this data to assess simulator validity. An intertrial variability assessment by McBeth showed that, for clipping task completion time significant variations between patients were uncommon and an analysis of two or three procedures was sufficient to detect outliers and guard against misleading results (McBeth 2001). Our results support this finding by showing relatively small differences between two OR trials. The gathering and analysis of OR data is such a time-consuming process that the minimum number of OR trials required should be determined. This study provides an initial first step towards setting that critical number. Intrasubject Intertrial V R A statistical comparison of intertrial expert surgeon behaviour differences showed that surgeons exhibit less variability across different VR simulator trials than they do in the OR. We anticipated this result because the simulated cystic duct dissection is a highly structured task that is exactly the same each time, whereas the OR dissection task is entirely the opposite. This result also shows that very few trials are needed from each subject when examining simulator performance. It is interesting to note that intertrial differences were the largest for absolute force profiles as was the case with the intertrial OR differences. This also demonstrates that relevant OR force data was obtained despite the electromagnetic noise and sensor problems (refer to sections 2.4.2 and 2.4.4 of Chapter 2). Force appears to be a sensitive measure of context performance differences. 105 Intersubject Intrasetting Comparisons Operation Room Intersubject OR differences were also surprisingly low for most comparisons, except the force measures. Intersubject differences were mostly below 0.3 and on the same magnitude as the intrasubject intertrial comparisons. This suggests that experts achieve successful OR results with similar tool use patterns (as assessed with multiple performance measures). Virtual Reality Simulator Intersubject VR differences are, similarly, very low. The experts received almost no training or familiarization time with the simulator, yet immediately handled the simulator in a similar way to one another. It is also interesting to note the experts' relative opinions about the simulator. If this were a face validity study (Table 1.1) the VR simulator would have faired poorly. Neither expert considered the simulator to be a useful training device. One expert said that due to inconsistencies with OR protocol, extensive training on the simulator could lead to "bad habits" in the OR. Intersubject behaviour comparisons were very close even though one expert had much stronger objections about the inappropriateness of the simulator. This shows that opinion, or face validity may not be consistently reflected in the data. Physical Simulator Intrasetting intersubject comparisons on the orange simulator were made even though only one trial from each expert on the physical simulator was available. These results show intersubject differences that fall somewhere in between OR and VR intersubject differences. This suggests that intersubject physical simulator behaviour is almost the same across experts, with difference magnitudes below 0.3, for the most part. Face validity also does not necessarily hold up here either. Expert 1 thought that the orange dissection task was such a good representation of the actual OR experience that it has been incorporated into the new general surgery-training 106 curriculum for 2004/2005. Expert 2, on the other hand, saw no relevance whatsoever in the orange dissection task. The magnitude of intersubject intrasetting differences decreased with the increasing levels of structure that are intrinsic to each setting, from OR to physical to VR. The OR is a highly unstructured task; with countless numbers of variables that can change from trial to trial while the VR tasks are exactly the same no matter the number of trials. We find the greatest intersubject differences in the OR and the lowest in the VR simulator. These observations are reflected in the data and further support the validity of our performance measures and assessment methods. Intrasubject Intersetting Comparisons Each expert's behaviour was individually compared across each of the three settings: OR to VR, OR to physical, VR to physical. The intersetting comparisons reveal much larger differences than any of the intrasetting analyses. For both experts, the OR to VR, and OR to physical difference values are spread across the entire range of D-values (from 0 to 1). Several of the kinematics performance measures are on the order of 0.2 - 0.3, but in the lateral and absolute directions values as large as 0.4 - 0.6 are seen. While the difference between the OR and each of the simulators are large relative to what we have seen before, the most surprising result was that the largest differences were observed between the VR and physical simulators. The OR to VR and OR to physical simulator difference magnitudes are comparable. However, this does not mean that experts treat the two simulators in the same way; the data proves the opposite to be true. The OR behaviour typically falls somewhere in the middle of the two extremes of simulator behaviour. The kinematics profile ranges (velocity, acceleration and jerk) of the physical simulator, especially for Expert 1, are so much larger than those seen in the OR that further trials should be conducted and this behaviour investigated. By contrast, with the velocities in the VR significantly lower than those seen in the OR and physical simulator, this may indicate a certain amount of hesitation in the experts' behaviour towards the VR simulator. This may possibly be due to unfamiliarity with the simulated environment, unusual operating volume, or the constraints that implicitly affect the surgeons' strategy. 107 Of the two simulators, the force values in the VR simulator appear to be the least like those exhibited in the real situation. While there are some concerns with the completeness and accuracy of the experimental tool force measures (Sections 2.4.2 and 2.4.4 of Chapter 2), the extremely low force values in the VR simulator are dramatically different. This is validated by expert comments about the poor haptic realism of the simulator. Most expert surgeons consider the haptic feedback of the simulator to be much too low by using expressions such as: "is it (the haptic feedback) on?" We also note that force is the most variable performance measure in all comparisons. This points to the importance of force as an indicator of surgeon performance. Improvements to the haptic software, as well as to the experimental hybrid tool would further tests of this hypothesis. Intersubject Intersetting Paired Differences The difference intervals between expert behaviour on the OR versus the simulators were used to analyze whether experts treat the simulators different from the OR to a similar extent. The results indicate that they do. For the most part, these difference measures are clustered close to zero (-0.2<D<0.2). This demonstrates that the differences are similar for both experts and adds support to some of the previous suggestions that experts are extremely consistent in their approach to different tasks. In a concurrently valid simulator we would expect to see that experts treat the simulator differently in the same was as in the OR, as these results imply. Intersetting Intersubject Paired Differences Looking at the data in a slightly different way, we found that differences between experts in the OR carry over to differences in the simulators, particularly the physical simulator. Although we do not have enough data to explicitly test this, our results are compatible with the hypothesis that both simulators would pass construct validity tests: i.e. they would be able to distinguish between surgeons of various skill levels. Since experts are very consistent in their approach, a database could be built to which all other performances can be compared. Based on these results the simulators could also feasibly be used to assess OR performance. Obviously a significant 108 amount of future work on both validity studies and simulator improvements needs to be done but this is a first step to assessing construct and perhaps ultimately, predictive validity. Our results suggest that simulators are capable of eliciting the intrinsic differences between experts that are found in the OR. The physical simulator appears to be better at maintaining this difference, but recall that the same experimental hybrid tool was used for the OR and the physical simulator, so differences in the VR simulator may be partly due to the tool interface rather than the surgical simulation itself (i.e. tissue interaction and visual effects) 3.5.2 Performance Measure Reliability The results from this study support the reliability of our performance measures. For the most part, we found that intrasetting performance measure kinematics and force-based distributions were consistent and are therefore reasonably reliable measures of performance. Both expert surgeons have similar tool tip patterns across most of the performance measures investigated here. This is directly relevant to the question of whether experts reach the same successful OR conclusion in the same way. It is important because future studies can more confidently study tool tip performance measures to assess surgeon performance. An analysis of novice surgeon performance contrasted to our expert surgeon database would be a next logical step in validating the performance measures. The force performance measure and its associated tool tip directions suffered from the largest amount of intersubject and intersetting variability. This leads us to believe that force may be a particularly sensitive measure to differences in subject and settings. More data may be required to reliably assess force distributions, and should be investigated for future studies. The distance from mean performance measure was also quite variable with respect to the other kinematics measures. 3.6 Conclusions We were able to successfully gather OR, VR and physical simulator tool tip motor behaviour data from two expert surgeons over a period of four months. The data from these trials were compared against each other using the KS statistic to quantitatively assess the performance 109 measures and ultimately to test simulator validity. We found that experts' inter- and intraprocedural variability can be very low, indicating that the performance measures used are appropriate and reliable. The intrasubject intertrial OR to VR, OR to physical and physical to VR comparisons showed much higher differences than any of the other intrasetting comparisons. This suggests that the simulators investigated here do not demonstrate performance validity. In other words, experts do not behave in the same way on the simulators as in the OR. However, we saw that differences between experts in the OR do indeed carry over to the simulators, indicating possible construct validity of both simulators. Intersubject difference magnitudes between the OR and simulators also imply construct validity. More validity studies involving many more surgeons at different skill levels need to be conducted to fully assess all of these preliminary observations. However, we have shown the feasibility of undertaking a study such as this and using the data in a constructive way to assess surgeon performance, new tool designs and teaching methods. 110 Chapter 4 Conclusions and Recommendations 4.1 Introduction The goals of the present work were to design an experimental hybrid surgical tool to collect surgeon OR performance data, and use it to quantitatively assess the validity of two laparoscopic surgical simulators. A standard laparoscopic tool was modified to accept a variety of sensors that are used to measure tool tip motor behaviour in a human operating room. This tool was used over a period of four months to collect operating room data from two expert laparoscopic surgeons. During this time data on the experts' behaviour in two simulators (VR and physical) were also gathered. Performance measures from all three settings were extracted and compared. We have shown the reliability of our performance measures, and the feasibility of the methods and tools described. Our preliminary results indicate poor performance validity, for both simulators. 4.2 Review of Present Research 4.2.1 Experimental Hybrid Surgical Tool A hybrid surgical tool was successfully designed, in conjunction with Joanne Lim with assistance from Brandon Lee, to measure continuous kinematics, force and torques of the tool tip. As far as we know this is the first attempt to mount sensors directly to a minimally invasive surgical tool for OR studies. The design of a surgical tool presented a unique set of problems to be dealt with. The tool and accompanying sensors must be easily sterilized, incorporate well with other OR tools, and be accepted by surgeons and operative staff. One of the biggest challenges to overcome was the design sensor bracket to hold the F/T sensor, and transmit the three-dimensional forces from the tool shaft to the sensor, which was mounted off-axis to the tool shaft. This sensor bracket is mounted Ill to the laparoscopic tool shaft, and the forces at the inner tool shaft are transmitted through the bracket to the sensor, without compromising the normal function of the tool. The sensor hybrid kinematics and force/torque systems are mounted on the sensor bracket, and serve to form the hybrid system for continuous motor behaviour data collection. The kinematics hybrid system is made up of optoelectronic and electromagnetic systems. The hybrid system exploits the accuracy of the optical sensor, and the continuous high-frequency sampling of the magnetic sensor to create a hybrid position tracking system that provides data that are more reliable that either sensor individually. The F/T hybrid system is made up of a tri-axial F/T transducer to measure forces and torques in three dimensions about the tool shaft, and strain gauges mounted to the handle of the tool. The hybrid force system is the weakest link in the experimental tool. The kinematics hybrid system has been under development in our lab for several years, and produces accurate and reliable data, while the force system is new with this project. Improving the F/T system should be made a top priority for future studies. Friction within the tool, and less-than-optimal bracket design resulted in misleadingly high tip force data, even after calibration. We believe that the theory and calibration methods outlined are correct, and will provide the foundation for accurate force tip data once the bracket is redesigned. If we obtain a sensor with a hole through the middle, similar to Rosen's (Rosen 2001), the sensor could be mounted on-axis to the tool shaft and we would no longer need a complicated bracket design. Rigorous post processing of the data provides continuous kinematics and F/T at the tool tip. Interference from OR equipment is a significant problem to the kinematics and force systems, primarily from the cautery (high-frequency, high-voltage) current used to cut and coagulate tissue. Our tool is particularly well suited for gathering tool tip kinematics data, when no cautery is being applied. Cautery interferes most severely with the magnetic and strain gauge transducers. Extensive filtering was done but noisy spikes were still common. This caused the biggest problem for F/T calibration and calculation of the kinematics data derivatives. 112 4.2.2 Data Collection Operating Room The experimental hybrid tool was taken to the OR a total of five times. Unfortunately, due to equipment and tool contamination difficulties, data from only three trials were gathered. Data collection in a human OR is a complex undertaking. We had to deal with equipment logistics, sterilization concerns, patient scheduling and consent (newly required in this study; previous studies were free of this burden), as well as a hospital employee union strike. These challenges prevented us from taking measurements on a weekly basis. An estimated 10-15 hours of work went into simply collecting the data from every one-hour procedure (equipment checks, marker sterilization and equipment transport). At least two researchers were necessary for every OR collection. One researcher scrubbed in and assisted with the experimental tool while the other operated the computer and video collection processes. Given the present state of the system, it is not practical to evaluate a large number of surgeons in the OR, but it does serve as a valuable starting point for future improvements. It would be difficult to quantify the effects of the tool on surgeon performance. Expert surgeons had some misgivings about the experimental tool and made comments that included: "cumbersome", "heavy", "awkward" et cetera. The surgeons' primary concern was how heavy and large the bracket is. The bracket impedes normal roll around the tool axis, as the weight of the bracket and the wires tend to "pull" against tool rotation. These complaints are supported in the data by the high levels of jerk in the roll performance measures. Small, wireless sensors would be invaluable in improving this system by reducing the weight and wire management issues. Finally, the calibrations planned for the OR turned out to be impractical in actual use, which led to extremely time-consuming manual registration and synchronization calculations. Simulators The virtual reality simulator data collection was relatively simple, compared to the OR data. The virtually noiseless data are processed by the simulator software and presented 113 to the user as formatted, continuous tool tip kinematics and forces. The physical ("orange") simulator procedure was slightly more complicated, as it required us to acquire a laparoscopic equipment tower, provided by the Centre of Excellence for Surgical Education and Innovation and re-create an operating room environment. The data from the physical simulator suffered from many of the same problems as the OR data, minus the cautery noise because the same experimental hybrid tool was used. 4.2.3 Performance Measures Quantitative assessment of surgeon performance has become a hot topic in surgical education today. Numerous performance measures have been proposed. Completion time is the most common method used to assess performance (den Boer 2001, Hasson 2001, Rosser 1997, Welty 2003). Error frequency (Joice 1998), forces and torques in a porcine model (Rosen 2001), surgeon hand movement (Derossis 1998), tool tip kinematics, and ergonomic scores (McBeth 2001) have also been examined. While any one measure will provide some insight into surgical performance, we believe that a multi-faceted approach is necessary to capture all the relevant aspects of performance. Future studies should be done to determine the relative importance of a variety of performance measures in order to find the minimum number required. We chose to calculate and compare continuous measures of tool tip kinematics and forces and torques. Continuous measures were chosen, as opposed to summary statistics because we wanted to capture all aspects of motor performance in each setting. Also, it was not apparent a priori which summary measures would be most diagnostic or useful. Our performance measures appear to be reliable because we observed little variability between experts in the same setting. As well, the performance measures are able to detect the intrinsic differences between experts across the three settings, although these observations should be confirmed on a larger study. 4.2.4 Context Comparisons A variety of context comparisons were made across all three settings. The context comparisons were done to establish reliability of the performance measures, and validate 114 our assessment technique. The context comparison differences give us a set of difference values to use as a benchmark by which we compare, and assess, the relative magnitudes of the values obtained from the OR to simulator validity tests. In order to quantitatively calculate the differences, the Kolmogorov-Smirnov (KS) statistic was used. The KS statistic is useful because it allows us to compare subject behaviour over a wide range of different settings without making any parametric assumptions about the distribution of the data. We examined differences in surgical behaviour using twenty-seven performance measures. The observed differences in behaviour were compared to a reference curve to establish a measure of statistically significant difference. A bootstrapping technique was used to put confidence intervals on all of the calculated differences. These intervals are simply a measure of the error associated with the measurement of these D-values, and represent the resolution of the difference values. Intraprocedural comparisons indicate that context is important within a procedure. We suspect, based on the results, that the first dissection exploration segment is the most "different" segment in a procedure. A range of intraprocedural differences was seen, but motor behaviour in at least two of the three segments in each procedure is very similar (D<03.). The intrasubject interprocedural OR comparisons show that the intrasubject difference magnitudes between surgeries are also low. These differences are even lower than intraprocedural variability because we compare two trials of essentially the same task, while intraprocedural comparisons are looking at differences between three different subtasks. VR context comparisons show that experts behave in essentially the same way between trials. This is as expected because the simulator provides a highly structured data collection environment with no model variation. Intersubject intrasetting comparisons show increasing levels of difference from the most structured environment to the least, VR simulator to physical simulator to OR. Conclusions and suggestions drawn from the context comparisons include: 115 • Intraprocedural context matters (i.e.: one segment usually varies from the other two segments), but intraprocedural repeatability can be very high. • Exploration and anatomy identification segment (segment 1) is usually most variable, depending on complexity of case and patient anatomy. • On average, intrasubject interprocedural OR differences are low (D = 0.1 - 0.2). • Subjects behave the same way across different VR trials and these differences are much less than in OR. • Intersubject intrasetting variability is comparable to intrasubject intrasetting variability. • Intrasubject intersetting D-values decrease with increasing structure in setting (OR to VR to physical). • Force/torque signatures and distance from mean measures are the most sensitive performance measure to context differences in all comparisons. 4.2.5 Simulator Validity Assessment We assessed expert behaviour in the OR by comparing it to expert behaviour in the simulators and used these results to quantitatively assess validity. We find large differences, with respect to context comparison D-values, for all three settings. The largest differences are between the two simulators. Since the differences are comparable from OR to VR, and OR to physical, it is impossible to determine whether one simulator is superior to the other. Keep in mind, however, that the VR simulator costs ~$50 000 and the physical simulator ~$1. Expert behaviour in the physical simulator is significantly more "extreme" than the OR. Kinematics derivative ranges for physical simulator are, in particular, significantly higher than the same measures in the OR. Experts demonstrate, by comparison, hesitant behaviour in the VR setting, in that their kinematics derivatives and force values in particular, are much lower than the other two settings. This could be attributed to the "foreignness" of the virtual reality environment and the experts' discomfort with that. Both experts agree that the virtual reality simulator is a poor representation of the real task. It is interesting to note, however, that differences 116 between experts in the OR are maintained in the simulator. Also, experts tend to treat the two simulators differently in the same way. This shows us that experts consistently approach different tasks in a similar way, even though this happens to be different from the OR. Neither expert considered the VR simulator to be a useful measurement device. One expert considered the physical simulator to be a valuable training module while the other did not. These differences in opinion, however, were often not reflected in the data. Some preliminary conclusions from the validity tests include: • Both simulators are different from the OR, but are more different from each other. The OR behaviour falls somewhere in the middle of two extremes of motor behaviour exhibited in the simulators. • Experts treat simulators differently from the OR in a similar way. • Intersubject differences carry over to the simulator which hints at construct validity. • Face validity observations do not necessarily hold; the data often do not support expert observations. • Both simulators demonstrate poor concurrent validity using these performance measures. 4.3 Relevance of Present Research Until now, the other surgical simulator concurrent validity investigations used subjective checklists as the gold standard by which to compare simulator performance assessments (Seymour 2004). Several groups have examined face validity (Haluck 2001, McCarthy 1999). Although face validity was not specifically assessed in this study, observations from the participating expert surgeons were noted and most of these observations were not reflected in the data. For example, Expert 2 saw potential in the VR simulator for teaching basic surgical tool handling skills, yet his differences between the OR and VR 117 were larger than those from Expert 1, who incidentally has very strong anti-VR simulator views. Construct validity is probably the most common form of validity test for simulators (Adrales 2003,Datta 2002, Mahmood 2003, Taffmder 1998, Schijven 2003). Our results indicate that the simulators in this study would probably pass a construct validity test as differences between experts carry over from the OR to the simulator. Nonetheless, this preliminary study indicates that the motor behaviour between the OR and simulator are quite different. It could be hypothesized that experts will always demonstrate superior performance to novices on any task that requires manipulation of laparoscopic tools, even if the task does not necessarily emulate the actual OR task. Therefore the use of a simulator in training is highly suspect. As several expert surgeons have remarked the virtual reality simulator may "train bad habits". Once construct validity has been shown for the VR simulator it may be used to assess the "level" of a particular surgeon. However, in order to used as a training and certification device all levels of validity (Table 1.1) should be thoroughly investigated and proven. 4.4 Future Research Recommendations Through consultation with other engineers, surgeons and surgical staff some recommendations for future use of the system and methods have been suggested. 4.4.1 Hardware Improvements • Redesign the sensor the mounting bracket by centring mass about the tool shaft in a compact way. Construct the bracket out of more rigid material to reduce warping and improve force transmission, and reduce friction in the tool operation. • Investigate the possibility of using a different, preferably wireless, force/torque transducer with a larger sensing range. 118 • Investigate how to improve the strain gauge system, with filters or different gauge types, to reduce the amount of electromagnetic interference from OR equipment. • Incorporate more surgical tool tips for data analysis. From our experience, the cautery hook would be the next most logical tool tip to examine. 4.4.2 Software Improvements • Design one program capable of acquiring data from and calibrating all sensors with one program from a laptop computer. • "Vectorize" all Matlab code to improve the speed of post processing calculations. 4.4.3 Data Acquisition • Keep equipment on one small cart dedicated to our equipment. • Use one stand capable of holding the optical and video cameras. • Develop system whereby only one researcher need be in the OR for the full procedure as OR crowding was a serious problem at times. The scrubbed research should remain throughout, but once the equipment is started the other researcher should leave the room • Gather video data of the VR simulator trials, to be certain of what the surgeon was doing at each point throughout each trial. 4.4.4 Data Analysis and Performance Measure Extraction • Design an automated method for data registration and calibration as planned extensive intra-OR calibrations are not practical and usually not feasible. A marker in the data of exactly when each sensor begins collecting data would help with this. 119 • Investigate ways of quantifying human reliability and frequency of errors (Joice 1998). • Improve F/T sensor data calibration methods to take into account the force of the trocar on the tool shaft, as well as improve the models for removing the gravity and grip effects. • Develop an automated system for extracting desired performance measures. 4.5 Future / Concurrent Studies The next logical step in this project is to continue using the experimental hybrid tool to gather more OR data, once the F/T system has been improved. The work presented here is simply a first attempt to use a structured quantitative method to assess surgeon behaviour in a human OR. We have proved that the methods and tools used are feasible to assess tools and simulators. Several areas of the developed design and methods have been left unanswered: • Are the performance measures presented here capable of distinguishing between surgeons of various skill levels? • What is the minimum number of performance measures required to get a well-rounded measure of a surgeon's skill level? • How much data from each of the settings do we need to collect to get sufficient data and reduce patient/procedural variability to a nominal level? • Can we verify that time spent in a simulator translates into improved operating room performance (transfer of training)? • What are the minimum technological requirements of a VR simulator? The last two questions are being addressed in two concurrent studies. Joanne Lim is using the experimental hybrid tool to assess how simulator training, both VR and physical, affects operating room performance. Iman Brouwer is measuring how the quality of the haptic hardware affects simulator performance. 120 A large, future study in our lab is a continuation of this research. The newest study, proposed by Sayra Cristancho, is to use the experimental hybrid tool to assess more surgeons in the OR, and use these data to create an expert surgeon database. The analysis of OR data is an extremely time-consuming process and limits the usability of the tool and methods. By automating the analysis many more procedures can be examined and a very thorough understanding of how surgeons behave in the OR can be determined. This automated analysis could be used to further assess simulator validity and also examine surgical curricula and new tool designs. Two surgical residents are involved in this study. Hamish Hwang is using the data from this study as well as Joanne Lim's work to assess error rates in laparoscopic surgery. He is using the laparoscopic video data from each procedure as well as the tool tip motor behaviour to draw parallels between quantitative data and qualitative quality observations by an expert performance evaluator. Hanna Piper is assessing the proposed general surgery curricula using, in part, the experiment hybrid tool and methods. She proposes to measure performance on certain bench top modules before and after training to determine potential of the module. The results from this thesis combined with the impending results from concurrent studies and future work will provide a significant and unique contribution to the area of surgical simulator validity assessment. 121 List of Terms Abs Absolute Acc'n Acceleration Pi Block of length / of dependent data Pi* Block of length / randomly resampled from original block set {pi PN} Blocks of dependent data created from original dependent dataset {Pi* pk*} Resampled blocks of dependent data from original block set CA Cystic artery CPD Cumulative Probability Distribution CPD(DRs-ref) Cumulative Probability Distribution for the bootstrapped data resampled from itself and compared to the reference CD Cystic Duct CDD Cystic Duct Dissection D KS (D) statistic difference measures D c r Critical D-measure at the 95 t h percentile if CPD(DRs-ref) D i .2 KS (D) statistic between two CPD' s At Time difference between 2 data streams ES Electrosurgery ESU Electrosurgical Unit F/T Force Torque GB Gallbladder GBD Gallbladder dissection GUI Graphical User Interface Hz Hertz k Length of resampled block set - k=N// K grip constant vector KS Kolmogorov-Smirnov / Length of an individual block for MBB mag magnetic data MBB Moving Block Bootstrap MDMArray Multi Dimensional Marker Array mm millimeters 122 N Length of block set - by default N=n-/+1 n Length of original data set opt Optical data Phy Physical (simulator) rad radians r number of bootstrapping cycles RF Radio Frequency RMS Root Mean Square error Roll Rotation (about the experimental tool axis - see Rot) Rot Rotation (about the experimental tool axis) Rtipframe 3D vector representing force frame in the tip frame s seconds synch synchronization T\ The transformation from frame 2 to frame 1 - the location and orientation of frame 2 in frame 1 Trans Transverse toolref Tool reference frame (Face A on the MDMArray) V Volts vect Vector vel velocity VR Virtual Reality (simulator) x Translation direction of tool tip frame Xi Data at point i Xi* Resampled data at point i {Xi.... Xn} Original data set {Xi*.... Xn*} Resampled data set - created from the resampled block set y Grasping direction of tool tip frame z Axial direction of tool tip frame 123 Bibliography American Psychological Association (APA). American Educational Research Association, and National Council on Measurement in Education (1974). Standards for educational and psychological tests, APA, Washington (DC). Atsma, W. ATI force transducer driver. http://www.mech.ubc.ca/~watsma/ATI-FT_driver/. March 20, 2002. Bailey R., Imbembo A., Zucker K. (1991). Establishment of a laparoscopic cholecystectomy training program. American Journal of Surgery 163: 46-52. Baldwin P. J., Paisley A. M., Brown S. P. (1999). Consultant surgeons' opinion of the skills required of basic surgical trainees. British Journal of Surgery 86(8): 1078-1082. Bann S. D.Khan M. S., Darzi A.W. (2003). Measurement of surgical dexterity using motion analysis of simple bench tasks. World Journal of Surgery 27(4): 390-394. Berg D., Berkley J., Weghorst S., Raugi G., Turkiyyah G., Ganter M., Quintanilla F., Oppenheimer P. (2001). Issues in validation of a dermatologic surgery simulator. Studies in Health Technology and Informatics 81: 60-5. Berguer R., Rab G. T., Abu-Ghaida H., Alarcon A., Chung J. (1997). A comparison of surgeons' posture during laparoscopic and open surgical procedures. Surgical Endoscopy 11(2): 139-142. Birkfellner W., Watzinger F., Wanschitz F., Ewers R., Bergmann H. (1998). Calibration of tracking systems in a surgical environment. IEEE Transactions on Medical Imaging 17(5), 737-742. Bholat O.S., Haluck R.S., Kutz R.H., Gorman P. J., Krummel T. M. (1999). Defining the role of haptic feedback in minimally invasive surgery. Studies in Health Technology and Informatics 62: 62-66. Bloom M. B., Rawn C. L., Salzberg A. D., Krummel T. M. (2003). Virtual reality applied to procedural testing: the next era. Annals of Surgery 237(3): 442-448. Boer E.R., Fernandex M., Pentland A., Leu A. (1996). Method for evaluation human and simulated drivers in real traffice situation. IEEE Vehicular Technology Conference 3: 1810-15 124 Cao C.G., MacKenzie C.L., Ibbotson J.A., Turner L.J., Blair N. P., Nagy A.G. (1999). Hierarchical decomposition of laparoscopic procedures. Studies in Health Technology and Informatics 62: 83-9. Chaudhry A., Sutton C , Wood J., Stone R., McCloy R. (1999). Learning rate for laparoscopic surgical skills on MIST VR, a virtual reality simulator: quality of human-computer interface. Annals of the Royal College of Surgeons of England 81(4): 281-286. Chung J.Y., Sackier J.M. (1998). A method of objectively evaluating improvements in laparoscopic skills. Surgical Endoscopy 12(9): 1111-6. Childs J.M. (1980). Time and Error Measures of Human Performance: A Note on Bradley's Optimal-Pessimal Paradox. Human Factors 22(1): 113-117. Coleman R.L., Muller C.Y. (2002). Effects of a laboratory-based skills curriculum on laparoscopic proficiency: A randomized trial. American Journal of Obstetrics and Gynecology 186(4): 836-42. Cotin S., Stylopoulos N., Ottensmeyer M., Neumann P., Rattner D., Dawson S. (2002). Metrics for laparoscopic skills trainers: the weakest link! Lecture Notes in Computer Science 248: 35-43. Darzi A., Datta V., Mackay S. (2001). The challenge of objective assessment of surgical skill. American Journal of Surgery 181(6): 484-486. Datta V., Chang A., Mackay S., Darzi A. (2002). The relationship between motion analysis and surgical technical assessments. American Journal of Surgery 184(1): 70-73. Datta V., Mackay S., Mandalia M., Darzi A. (2001). The use of electromagnetic motion tracking analysis to objectively measure open surgical skill in the laboratory-based model. Journal of American College of Surgeons 193(5), 480-485. Darzi A., Smith S., Taffinder N. (1999). Assessing operative skill: needs to become more objective. British Medical Journal 318(7188), 887-888. Day J., Murdoch D., Dumas G. (2000). Calibration of position and angular data from a magnetic tracking device. Journal of Biomechanics 33, 1039-1045. den Boer K.T., de Wit L.T., Davids P.H., Dankelman J., Gouma D.J. (2001). Analysis of the quality and efficiency in learning laparoscopic skills. Surgical Endoscopy 15(5): 497-503. Dent T. (1991). Training, credentialing, and granting clinical privileges for laparoscopic general surgery. American Journal of Surgery 161:399-403. 125 Derossis A.M., Bothwell J., Sigman H.H., Fried G.M. (1998). The effect of practice on performance in a laparoscopic simulator. Surgical Endoscopy 12(9): 1117-1120. de Visser H., Heijnsdijk E.A., Herder J.L., Pistecky P.V. (2002). Forces and displacements in colon surgery. Surgical Endoscopy 16(10): 1426-30. Efron B., Tibshirani R. (1986). Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statistical Science 1: 54-77. Elmenreich W. (2002). An introduction to sensor fusion - research report. Vienna University of Technology, Austria. Eubanks T.R., Clements R.H., Pohl D., Williams N., Schaad D.C, Horgan S., Pellegrini C. (1999). An objective scoring system for laparoscopic cholecystectomy. Journal of the American College of Surgeons 189(6): 566-574. Fabiani M., Buckley J., Gratton G., Coles M.G.H., Donchin E. (1989). The training of a complex task performance. Acta Psychologica 71:259-299. Faulkner H., Regehr G., Martin J., Reznick R. (1996). Validation of an objective structured assessment of technical skill for surgical residents. Academic Medicine 71(12): 1363-1365. Feldman L.S., Sherman V., Fried G.M. (2004). Using simulators to assess laparoscopic competence: ready for widespread use? Surgery 135:28-42. Fitzpatrick CM., Kolesari G.L., Brasel K.J. (2001). Teaching anatomy with surgeons' tools: use of the laparoscope in clinical anatomy. Clinical Anatomy 14:349-53. Franklin M. (1998). Laparoscopic surgery reduces hospital stays. San Antonio Business Journal http://www.bizjournals.com/sanantonio/stories/1998/07/27/focus4.html?t=printable. May 12, 2004. Frantz D.D., Wiles A.D., Leis S.E., Kirsch S.R. (2003). Accuracy assessment protocols for electromagnetic tracking systems. Physics in Medicine and Biology 48, 2241-2251. Gallagher A.G., Lederman A.B., McGlade K , Satava R.M., Smith CD. (2004). Discriminative validity of the minimally invasive surgical trainer in virtual reality (MIST-VR) using criteria levels based on expert performance. Surgical Endoscopy 18(24): 660-5. Gallagher A.G., McClure N., McGuigan J., Ritchie K , Sheehy N. P. (1998). An ergonomic analysis of the fulcrum effect in the acquisition of endoscopic skills. Endoscopy 30(7): 617-20. 126 Gallagher A.G., Ritter E.M., Satava R.M. (2003). Fundamental principles of validation, and reliability: rigorous science for the assessment of surgical education and training. Surgical Endoscopy 17(10): 1525:9. Gallagher A.G., Satava R.M. (2002). Virtual reality as a metric for the assessment of laparoscopic psychomotor skills. Surgical Endoscopy 16(12): 1746:52. Gallagher AG, Smith CD, Bowers SP, Seymour NE, Pearson A, McNatt S, Hananel D, Satava RM. (2003). Psychomotor skills assessment in practicing surgeons experienced in performing advanced laparoscopic procedures. Journal of the American College of Surgeons 197(3): 479-88. Girija G., Raol J.R., Appavu Raj R., Sudesh K. (2000). Tracking and multi-sensor data fusion. Sadhana 25(2): 159-167. Goff B.A., Nielsen P.E., Lentz G.M., Chow G.E., Chalmers R.W., Fenner D., Mandel L.S. (2002). Surgical skills assessment: A blinded examination of obstetrics and gynecology residents. American Journal of Obstetrics and Gynecology 186(4): 613-617. Grantcharov T. P. Bardram L., Funch-Jensen P., Rosenburg J. (2003). Learning curves and impact of previous operative experience on performance on a virtual reality simulator to test laparoscopic surgical skills. The American Journal of Surgery 185:146-9. Grantcharov TP, Kristiansen VB, Bendix J, Bardram L, Rosenberg J, Funch-Jensen P. (2004). Randomized clinical trial of virtual reality simulation for laparoscopic skills training. British Journal of Surgery 91(2): 146-50. Grantcharov T.P., Rosenburg J., Pahle E., Funch-Jensen P. (2001). Virtual reality computer simulation. Surgical Endoscopy 15:242-4. Grober E. D., Hamstra S. J., Wanzel K. R., Reznick R. K., Matsumoto E. D., Sidhu R. S. Jarvi K. A. (2003). Validation of novel and objective measures of microsurgical skill: Hand-motion analysis and stereoscopic visual acuity. Microsurgery 23(4):317-22. Hall P., Horowitz J.L., Jing B.Y. (1995). On blocking rules for the bootstrap with dependent data. Biometrika 82(3): 561-74. Halsted, W. S. (1904). "The Training of the Surgeon." Bulletin of the Johns Hopkins Hospital 15: 267-275. Haluck R.S., Webster R.W., Snyder A.J., Melkonian M.G., Mohler B.J., Dise M.L., Lefever A. (2001). A virtual reality surgical trainer for navigation in laparoscopic surgery. Studies in Health Technology and Informatics 81: 171-6. Hamilton E.C., Scott D.J., Fleming J.B., Rege R.V., Laycock R., Bergen P.C., Tesfay ST., Jones D.B. (2001). Comparison of video trainer and virtual reality training systems on acquisition of laparoscopic skills. Surgical Endoscopy 16:406-11. 127 Hanada E., Takano K., Antoku Y., Matsumura K., Watanabe Y., Nose Y. (2002). A practical procedure to prevent electromagnetic interference with electronic medical equipment. Journal of Medical Systems 26(1): 61-5. Hasson H.M., Kumari N.V., Eekhout J. (2001). Training simulator for developing laparoscopic skills. Journal of the Society ofLaparoendoscopic Surgeons 5(3): 255-265. Hodgson A.J. (1994). Considerations in applying dynamic programming filters to the smoothing of noisy data. Journal of Biomechanical Engineering 116: 528-31. Hodgson A.J., Person J.G., Salcudean S.E., Nagy A.G. (1999). The effects of physical constraints in laparoscopic surgery. Medical Imange Analysis 3(3): 275-83. Hodgson A.J., McBeth P.B. (2002). Comparing motor performance on similar tasks in different settings: statistical characteristics of a nondimensional difference measure. Internal Document. Hu T, Tholey G, Desai JP, Castellanos AE (2004). Evaluation of a laparoscopic grasper with force feedback. Surgical Endoscopy 18(5): 863-7. Joice P., Hanna G.B., Cuschieri A. (1998). Errors enacted during endoscopic surgery—a human reliability analysis. Applied Ergonomics 29(6): 409-14. Kaufmann C.R. (2001). Computers in surgical education and the operating room. Annates Chirurgiae et Gynaecologiae 90(2): 141-146. Kindratenko V. (2000). A survey of electromagnetic position tracker calibration techniques. Virtual Reality: Research, Development, and Applications 5(2): 169-82. Kohn L.T., Corrigan J.M., Donaldson M.S. (1999). To err is human: building a safer health system. National Academy Press, Washington DC. Kologlu M., Tutuncu T., Yuksek Y.N., Gozalah U., Daglar G., Kama N.A. (2004). Using a risk score for conversion from laparoscopic to open cholecystectomy in resident training. Surgery 135(3): 282-87. Kiinsch S.N. (1991). Second order optimality of stationary bootstrap. Statistics and Probability Letters 11:335-41. Lahiri S.N. (2003). Resampling Methods for Dependent Data. Springer-Verlag. Lentz G.M., Mandel L.S., Lee D., Gardella C , Melville J., Goff B.A. (2001). Testing surgical skills of obstetric and gynecologic residents in a bench laboratory setting: validity and reliability. American Journal of Obstetrics and Gynecology 184(7): 1462-8; discussion 1468-1470. 128 Lim J. (2004). Thesis: A quantitative evaluation of the transfer of training between the operating room and a laparoscopic surgical simulator. Liu R.Y., Singh K. (1992). Moving blocks jackknife and bootstrap capture weak dependence. Exploring the Limits of the Bootstrap. Wiley, New York: 225-48. MacRae H., Regehr G., Leadbetter W., Reznick R.K. (2000). A comprehensive examination for senior surgical residents. American Journal of Surgery 179(3): 190-193. Mahmood T., Darzi A. (2003). A study to validate the colonoscopy simulator. Surgical Endoscopy 17(10): 1583-9. Martin J.A., Regehr G , Reznick R., MacRae H., Murnaghan J., Hutchison C , Brown M. (1997). Objective structured assessment of technical skill (OSATS) for surgical residents. British Journal of Surgery 84(2): 273-8. McBeth P.B. (2002). Thesis: A methodology for quantitative performance evaluation in minimally invasive surgery. University of British Columbia. McCarthy A, Harley P, Smallwood R. (1999). Virtual arthroscopy training: do the "virtual skills" developed match the real skills required? Studies in Health Technology and Informatics 62:221-7. Moorthy K., Smith S., Brown T., Bann S., Darzi A. (2003). Evaluation of virtual reality bronchoscopy as a learning and assessment tool. Respiration 70(2): 195-9. Mehta N.Y., Haluck, R.S., Frecker, M.I., Snyder, A.J. (2002). Sequence and task analysis of instrument use in common laparoscopic procedures. Surgical Endoscopy 16(2): 280-285. Milne A.D., Chess D.G., Johnson J.A., King G.J.W. (1996). Accuracy of an electromagnetic tracking device: a study of the optimal operating range and metal interference. Journal of Biomechanics 29(6):791-3. Morimoto A.K., Foral R.D., Kuhlman J.L., Zucker K.A., Curet M.J., Bocklage T. MacFarlane T.I., Kory L. (1997). Force sensor for laparoscopic Babcock. Studies in Health Technology and Informatics 39:354:61. Nagy A.G., Poulin E.C., Girotti M.J., Litwin D.E., Mamazza J. (1992). History of laparoscopic surgery. Canadian Journal of Surgery 35(3): 271-274. Nakamoto M., Sato Y., Tamaki Y., Nagano H., Miyamoto M., Sasama T., Monden M., Tamura S. (2000). Lecture Notes in Computer Science (Proc. MICCAI2000) 1935: 839-848. 129 NIST/SEMATECH e-Handbook of Statistical Methods. http://www.itl.nist.gov/div898/handbook/, June 8 2004. Oto D., Loftin B., Saito T., Lea R., Keller J. (1995). Virtual reality in surgical education. Computers in Biology and Medicine 25(2): 127-37. Perie D., Tate A.J., Cheng P.L., Dumas G.A. (2002). Evaluation and calibration of an electromagnetic tracking device for biomechanical analysis of lifting tasks. Journal of Biomechanics 35: 293-7. Person J.G. (2000) Thesis: A Foundation for the Design and Assessment of Improved Instruments for Minimally Invasive Surgery. Person J.G., Hodgson A.J., Nagy A.G. (2001). Automated high-frequency posture sampling for ergonomic assessment of laparoscopic surgery. Surgical Endoscopy 15(9): 997-1003. Poulin F., Amiot L.P. (2002). Interference during the use of an electromagnetic tracking system under OR conditions. Journal of Biomechanics 35, 733-737. Poulin E. C , Mamazza J., Litwin D. E., Nagy A.G., Girotti M. J. (1992). Laparoscopic cholecystectomy: strategy and concerns. Canadian Journal of Surgery 35(3): 285-289. Press W.H., Teukolsky S.A., Vertterling W.T., Flannery B.P. (1992). Numerical Recipes in C, 2nd Ed. Cambridge University Press. Reber A.S (1995). The Penguin dictionary of psychology. (2nd ed.), Penguin, New York. Regehr G., MacRae H., Reznick R. K , Szalay D. (1998). Comparing the psychometric properties of checklists and global rating scales for assessing performance on an OSCE-format examination. Academic Medicine 73(9): 993-997. Reznick R., Regehr G., MacRae H., Martin J., McCulloch, W. (1997). Testing technical skill via an innovative "bench station" examination. American Journal of Surgery 173(3): 226-230. Richards C , Rosen J., Hannaford B., Pellegrini C , Sinanan M. (2000). Skills evaluation in minimally invasive surgery using force/torque signatures. Surgical Endoscopy 14, 791-798. Rohrer B., Fasoli S., Krebs H.I., Hughes R., Volpe B., Frontera W.R., Stein J., Hogan N. (2002). Movement smoothness changes during stroke recovery. The Journal of Neuroscience 22(18): 8297-304. Rosen J., Hannaford B., MacFarlane M.P., Sinanan M. (1999). Force controlled and teleoperated endoscopic grasper for minimally invasive surgery - experimental 130 performance evaluation. IEEE Transactions on Biomedical Engineering 46(10), 1212-1221. Rosen J., Hannaford B., Richards G., Sinanan M. (2001). Markov Modeling of Minimally Invasive Surgery Based on Tool/Tissue Interaction and Force/Torque Signatures for Evaluating Surgical Skills. IEEE Transactions on Biomedical Engineering 48(5), 579-591. Rosser J.C. Jr., Rosser L.E., Savalgi R.S. (1998). Objective evaluation of a laparoscopic surgical skill program for residents and senior surgeons. Archives of Surgery 133(6): 657-61. Rosser J.C. Jr., Rosser L.E., Savalgi R.S (1997). Skill acquisition and assessment for laparoscopic surgery. Archives of Surgery 132(2): 200-204. Satava R.M. (2001). Surgical education and surgical simulation. World Journal of Surgery 25(11): 1484-9. Schijven M., Jakimowicz J. (2002). Face-, expert, and referent validity of the Xitact LS500 Laparoscopy Simulator. Surgical Endoscopy 8: 8. Schijven M., Jakimowicz J. (2003). Virtual reality surgical laparoscopic simulators. Surgical Endoscopy 17(12): 1943:50. Seymour N.E., Gallagher A.G., Roman S.A., O'Brien M.K., Anderson D.K., Satava R.M. (2004). Analysis of errors in laparoscopic surgical procedures. Surgical Endoscopy 18(4): 592-5. Sjoerdsma W., Herder J.L., Howard M.J., Jansen A., Bennenberg J.J.G., Grimbergen CA. (1997). Force transmission of laparoscopic grasping instruments. Minimally Invasive Therapy and Allied Technology 6: 274-278. Smith CD., Farrell T.M., McNatt S.S., Metreveli R.E. (2001). Assessing laparoscopic manipulative skills. American Journal of Surgery 181(6): 547-50. Smith S.G., Torkington J., Brown T.J., Taffinder N.J., Darzi A. (2002). Motion analysis. Surgical Endoscopy 16(4): 640-5. Taffinder N., Sutton C , Fishwick R.J., McManus I.C, Darzi A. (1998). Validation of virtual reality to teach and assess psychomotor skills in laparoscopic surgery: results from randomised controlled studies using the MIST VR laparoscopic simulator. Studies in Health Technology and Informatics 50: 124-130. Torkington J., Smith S.G.T, Rees B.L, Darzi A. (2001). The role of the basic surgical skills course in the acquisition and retention of laparoscopic skill. Surgical Endoscopy 15: 1071-5. 131 Uhrich M.L., Underwood R.A., Standeven J.W., Soper N.J., Engsberg J.R. (2002). Assessment of fatigue, monitor placement, and surgical experience during simulated laparoscopic surgery. Surgical Endoscopy 16(4): 635-9. Wagner A., Schicho K., Birkfellner W., Figl M., Seemann R., Konig F., Kainberger F., Ewers R. (2002). Quantitative analysis of factors affecting intraoperative precision and stability of optoelectronic and electromagnetic tracking systems. Medical Physics 29(5): 905-912. Welty G., Schippers E., Grablowitz V., Lawong A. G., Tittel A., Schumpelick V. (2003). Is laparoscopic cholecystectomy a mature operative technique? Surgical Endoscopy 16(5): 820-7. Weghorst S., Airola C , Oppenheimer P., Edmond C.V., Patience T., Heskamp D., Miller J. (1998). Validation of the Madigan ESS simulator. Studies Health Technology and Informatics 50: 399-405. WikipediA - The Free Encyclopedia. http://www.wikipedia.org/wiki/Validity (psychometric). November 2, 2003. Winckel CP., Reznick R.K., Cohen R., Taylor B. (1994). Reliability and construct validity of a structured technical skills assessment form. American Journal of Surgery 167(4): 423-427. Wolfe B.M., Szabo Z., Moran M.E., Chan P., Hunter J.G. (1993). Training for minimally invasive surgery. Need for surgical skills. Surgical Endoscopy 7:93-95. Woltring H.J. (1986). A fortran package for generalized, cross validatory spline smoothing and differentiation. Advances in Engineering Software 8(2): 142-51. Villegas L., Schneider B.E., Callery M.P., Jones D.B. (2003). Laparoscopic skills training. Surgical Endoscopy 17'(12): 1879:88 132 Appendix A OR Study Experimental Protocol and Data Acquisition Procedures A.l Experimental Protocol Researchers: Catherine Kinnaird Joanne Lim Dr. Antony Hodgson Location: University of British Columbia Hospital Procedure: MIS cholecystectomy Study Protocol This is a protocol for a motor behaviour study of an expert surgeon performing a MIS cholecystectomy. Before the patient is brought to the operative suite all equipment is turned on and initialized. The video camera and laparoscope VCR operation are checked and started. The researchers then leave the room while the patient is brought in and anaesthetized. Once the patient is asleep the surgeon and one research scrub in. While the patient is being prepared for surgery, the scrubbed researcher obtains the sterile surgical tool from the OR staff. She uses the sterile tools to cut a small section of OptSite® to wrap the force/torque sensor with. The sterile researched is responsible for ensuring that the experimental tool is assembled correctly and all sensors and optical markers are firmly secured to the tool. The other researcher, during this time, continues with any necessary equipment checks and initializations. When the surgeon indicates her readiness for the experimental tool the scrubbed researcher, carefully, lowers the electrical cords from the tool to the un-scrubbed researcher below the surgical table, outside of the surgical sterile field. Great care must be taken at this point to ensure maintenance of the sterile field. The minimum amount of cord necessary to reach the tool interface units should be passed down, while the excess cord is secured in the sterile field with a clip. Data collection begins at this point and the equipment records until the surgeon indicates that he is done. After the sensors are plugged in the scrubbed researcher remains by the table to assist the surgeon with experimental tool and to make any necessary adjustments. It is critical to cause as little disturbance to the staff and surgeon as possible. At any point in the surgery if the surgeon feels uncomfortable the tool is removed from the field and any equipment is promptly moved out of the way. Approval for this experiment was granted through the University of British Columbia Clinical Research Ethics Board and the University of British Columbia Hospital. In addition the Sterile Supply and Biomedical Engineering 133 departments at the University of British Columbia and Vancouver Hospitals approved all equipment. A.2 Equipment List The following outlines the hardware and software components required for experiments conducted in the OR: Hardware 1 - PC 2.4 GHz Duron (tower, keyboard, mouse, monitor) 1 - Laptop (minimum 800 MHz, with 1 COM serial port) 2 - Tripods 1 - Canon Digital Video Recorder Canon adaptor 1 - power bar 1 - VHS cassette 1 - digital videotape 2 - serial port cables 1 - portable computer desk 1 - Polaris Position Sensor 1 - Polaris Tool Interface Unit 1 - Logitech web cam 1 - Fastrak Tool Interface Unit 1 - Magnetic tracker transmitter 1 - ATI force/torque mux box & cord 1 - strain gauge interface box 1 - stool 1 - Tupperware container 1 - masking tape 1 - double sided tape 1 - isolation transformer *1 - Experimental surgical tool: MDMArray (5 reflective markers, 3 - NDI mounting posts Magnetic sensor receiver Force/Torque sensor Strain gauge system *3 - Allen keys (variety of sizes) *1 - scissors *Equipment requiring sterilization Software: Windows 2000 134 Matlab 6.5 Tera Term Pro V2.3 Logitech QuickCam V5.4.1 FTgui Matlab Programs: -PMCS.m -body_frame_calib2 A.3 OR Procedure The following are the step-by-step procedures for data collection from the experimental hybrid tool in the OR. A.3.1 Equipment set-up Required Time: 30 min - 1.5 In-Suggested start time: 1 day prior to surgery (17:00-17:50) Set-up the equipment in the OR as shown (Figure A.l) Tripod (Optical sensor) Computer Cart Tripod (video camera) Researcher Surgeon Researcher Assistant Figure A.l: UBC Hospital operating room equipment layout (McBeth 2001) 135 A.3.2 Pre-operative set-up and software initialization Required Time: 10-30 min Suggested start time: 7:00 for 8:00 procedure A.3.2.1 Start-up Procedure Teraterm 1) Run Teraterm 2) Turn on Polaris Tool Interface Unit (wait 20 sec for beep - RESETBE6F will appear in the command window) 3) Type COMM50000 - Reply: OKAYA896 4) Teraterm window: Setup - Serial Port: change baud rate to 115200 5) Type: INITenter (note: means space bar) - OKAYA896 6) Teraterm Window: File - exit A.3.2.2 OR Data Collection Procedures 1) Start digital video recorder (focus camera on surgeon arm and experimental tool) 2) Start the MIS VCR to record the laparoscope 3) Start Matlab RI2 4) Set current directory to file where data is to be collected (e.g.: OR_test3 et cetera) 5) In command window type: PMCS 6) Graphical user interface will appear (Figure 2.7) 7) Select radio buttons for all faces 8) STOP check equipment connections before initialization of Polaris 9) Press the Initialize Polaris button when ready - Polaris will beep in acknowledgment 10) Wait until all status bars are illuminated in yellow (Check for error messages in the Command Window) 11) When the surgeon indicates that she is ready to use the experimental tool the sensor cords are plugged in to the tool interface units under the surgical table. 12) On the laptop start FTGui 13) In the FTGui select logging to port radio button, and select appropriate folder 14) In the FTGui select continuous data collection radio button 15) In the FTGui push the Options button - output data - Metric 16) In the FTGui push the Options button - hemispheres - set hemispheres to 0,0,-1 17) In the FTGui push Record Data 18) In the PMCS gui select the appropriate tool used by the surgeon (Tool 2 for our case) 19) Monitor the status bars and adjust equipment if necessary (i.e. camera) if required. (Red status bar: missing marker, Yellow status bar: loading tool identification files, Green status bar: tracked marker) 20) At the end of the procedure in FTGui press Stop data recording, press Stop Tracker - followed by Shut-down Polaris (do not do this opposite direction) 136 Appendix B Operational Definitions B . l Hierarchical Decomposition Operational Definitions (McBeth 2001) In order to organize the massive amount of OR data and find tasks that are analogous to simulator tasks we used a hierarchical decomposition modified from Cao by McBeth in 2001. The decomposition describes the procedure in terms of surgical phases and stages, tool tasks and subtasks, and fundamental tool actions. This five-level hierarchical decomposition provides the foundations for a quantitative analysis of surgeon performance (Figure B . l ) . In the work presented here we are looking primarily at the Task and Subtask levels of the experimental surgical tool during dissection tasks. McBeth did, however, show the feasibility of going as far as examining the action level of the tool by using fuzzy logic action states. Figure B.l: Five levels of the hierarchical decomposition B . l . l Phase Level Phases are the fundamental levels of a procedure forming the backbone and the foundation for further decomposition. A laparoscopic cholecystectomy procedure is divided into five distinct phases: 1) Trocar preparation 2) Cystic Duct Dissection 3) Gallbladder dissection 4) Gallbladder Removal 5) Closure. Each phase has goals associated with it to be accomplished in order to proceed to the next phase. This study dealt with stages, tasks and subtasks in the cystic duct and gallbladder dissections only. Future work wil l incorporate all aspects of the procedure by examining multiple tool tips. B.1.2 Stage Level The stages within each phase also have goals, but the goals are not necessarily compulsory to proceed to the next stage. The stages of the cystic duct and gallbladder dissection (CDD and GBD) phases are shown below (Figure B.2) 137 • Figure B.2 — Stage level diagram for cystic duct dissection (CDD) and gallbladder dissection (GBD) - * note all stage level definitions based on video observation B.1.3 Task Level A task is a set of movements performed with a single tool to achieve a desired effect. A task segment if defined from the time the tool tip is placed in the distal end of the trocar until the tool is pulled out throught the same trocar. This is how we defined specific dissection segments discussed in Chapter 3. B.1.4 Subtask Level The subtask level defines how the experimental tool tip is moving inside the patient. The subtask names are shown in Table xx for a dissection task. 138 Table B.l: Hierarchical subtask dissection definition Subtask Name Definition Start Stop Free Space Movement: Approach Tool is moving toward tissue upon entry into the trocar Entry of the tool tip into the distal end of the trocar Tool tip in contact with tissue Tissue Manipulation Tool is in contact with the tissue Initial contact of the tool tip with the tissue Final contact of the tool tip with the tissue Free Space Movement -Withdrawal Tool is moving away from the tissue being pulled out of the trocar Final contact of the tool tip with tissue Exit of the tool tip from the distal end of the trocar B.1.5 Action Level The action states are made up of 12 types of tool movements and stand as the foundation to the manipulation and dissection tasks. The action level was not specifically examined in this study. It is possible for tool movements to be a combination or a collection of actions. There are a total of 72 feasible combinations of the 12 action states (McBeth 2001). B.2 Performance Measure Definitions The definitions of the performance measures presented in Chapter 3 are shown in Table B.2. The individual components of the performance measures are calculated by projecting the tool path vectors on to the tool axis vectors. In this way the location of the world frame is irrelevant and allows us to compare performance measures across different settings. 139 Table B.2: Kinematics and force performance measures Kinematics Measure Performance Measure Definition Distance from mean Absolute (mm) V(x,.-x)2+(v,.-j)2+(z,.-z)2 Roll (radians) Roll, -Roll Velocity Axial (mm/s2) Grasp (mm/s ) y. Translate (mm/s2) Transverse (rnm/s )) ^x2+y2 Absolute (mm/s2) 1 . 2 . 2 . 1 V*i +y> +zi Roll (rad/s2) Roll, Acceleration Axial (mm/s3) Grasp (mm/s3) y, Translate (mm/s3) xi Transverse (mm/s3) ^'x2+y2 Absolute (mm/s3) /.. 2 , -2 , -.2 V x ; +y> +zi Roll (rad/s3) Roll, Jerk Axial (mm/s4) Grasp (mm/s4) y) Translate (mm/s4) X, Transverse (mm/s4) ix2+y,2 Absolute (mm/s4) I...2 2 , - 2 yjx, +y, + z, Roll (rad/s4) Roll, Force Axial (N) zf Grasp (N) Translate (N) xf Transverse (N) ^xf2+y/ Absolute (N) 1 2 , 2 , 2 yjx/ +yf +zf Roll Torque (N-m) y, 140 Appendix C Sensor Bracket Design Drawings C l Sensor Bracket The SolidWorks drawings of the sensor bracket are shown in Figure C l . Top View Side View Figure Cl: SolidWorks design drawings for sensor bracket assembly (drawn by Brandon Lee) 


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items