RISK PREDICTION MODELS FOR BINARY RESPONSE VARIABLES FOR THE CORONARY BYPASS OPERATION by HONGBIN ZHANG M.Eng., Institute of Software, Academia Sinica, 1987 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in THE FACULTY OF GRADUATE STUDIES Department of Statistics We accept this thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA June 1993 ©Hongbin Zhang, 1993 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. (Signature) Department of ^S+CA. LSA The University of British Columbia Vancouver, Canada e\ Date ^ DE-6 (2/88) t ? 23 Abstract The ability to predict 30 day operative mortality and complications following coronary artery bypass surgery in the individual patient has important implications clinically and for the design of clinical trials. This thesis focuses on setting up risk stratification algorithms. Utilizing the binary feature of the response variables, logistic regression analyses and classification trees (recursive partitioning) were used with the variables identified by the Health Data Research Institute in Portland, Oregon. The data set contains records for 18171 patients who had coronary artery bypass surgery in one of several hospitals between 1968 to 1991. Statistical models are set up, one from each method, for six outcome variables of the surgery: 30 day operative mortality, renal shutdown complication, central nervous system complication, pneumothorax complication, myocardial infarction complication and low output syndrome. The risk groups vary across different outcomes. The history of cardiac surgery has strong association with operative mortality and patients who suffer from a central nervous system disease tend to have higher risks for all the outcomes. Further study is necessary to consider the differences among hospitals and to divide the population according to the type of previous cardiac surgery. 11 Contents Abstract ^ Acknowledgements^ 1 Introduction ^ 2 Coronary Bypass Surgery and Data Source ^ II vii 1 3 2.1 Coronary Bypass Surgery ^3 2.2 Source of Data ^4 3 Initial Data Analysis^ 3.1 Summary Statistics for the Data 3.2 Odds Ratio Analysis ^ 4 Logistic Regression Analysis^ 8 ^8 10 21 4.1 Univariate Analysis and Comparison of Models ^ 22 4.2 Stepwise Logistic Regression ^ 23 4.3 Best Subsets Selection ^ 25 4.4 Goodness-of-fit: Hosmer-Lemeshow Grouping Test ^ 26 5 The Tree-based Model ^ 32 5.1 Recursive Partitioning: Growing a Classification Tree ^ 32 5.2 Getting the Right Size Tree: Pruning the Classification Tree ^ 36 5.3 Applications and Results ^ 37 6 Discussion and Conclusion^ Bibliography 46 ^ A Merged Cardiac Registry 49 ^ B Expanded Definitions for Merged Cardiac Registry Iv 52 ^ 56 List of Tables 1 Code Sheet for the MCR Data ^ 5 2 Code Sheet for the MCR Data (continued) ^ 6 3 Code Sheet for the MCR Data (continued) ^ 7 4 Tabular summary for categorical variables ^ 11 5 Tabular summary for categorical variables (continued) ^ 12 6 Tabular summary for categorical variables(continued) 13 7 Odds ratio for STA ^ 15 8 Odds ratio for REMS ^ 15 9 Odds ratio for NEMS ^ 16 10 Odds ratio for PUMS ^ 16 11 Odds ratio for MI ^ 17 12 Odds ratio for LOMS ^ 17 13 Binary risk factors identified from odds ratio 14 Two-way table for PMI and LVD ^ 18 15 Stepwise regression procedure: a demonstration ^ 24 16 Results of stepwise regression procedure for each outcome ^ 24 17 Models obtained from best subsets procedure for each outcome ^ 26 18 Hosmer-Lemeshow grouping test for selected models ^ 27 ^ ^ 18 19 Final logistic regression model for STA ^ 29 20 Final logistic regression model for REMS ^ 29 21 Final logistic regression model for NEMS ^ 30 22 Final logistic regression model for PUMS ^ 30 23 Final logistic regression model for MI ^ 30 24 Final logistic regression model for LOMS ^ 31 25 List of risk factors and prediction range for the different methods and different outcomes 48 vi List of Figures 1 Boxplots for continuous variables AGE and BSA ^ 19 2 Binary tree: an example ^ 33 3 Original tree for STA. ^ 37 4 Plots of deviance versus size for sequences of subtrees. (a): sequence obtained from sample data; (b): sequence evaluated on test data ^ 38 5 Tree model for STA 40 6 Tree model for REMS ^ 41 7 Tree model for NEMS ^ 42 8 Tree model for PUMS ^ 43 9 Tree model for MI ^ 44 10 Tree model for LOMS ^ 45 ^ vii Acknowledgements I would like to express my appreciation to my supervisor, Dr. Harry Joe, for his guidance, suggestions and assistance in producing this thesis. I also would like to give my thanks to Dr. Stan Page for his critical reading and helpful comments. I am indebted to the Health Data Research Institute for some financial support and for providing the data, to Scott Page, M.D., for advice on medical aspects of this study. The support of UBC Department of Statistics is also gratefully acknowledged. Finally, I thank my wife, Hua, for her continuous encouragement and support. Without her belief in me, it might have taken me longer to get here. Viii Chapter 1 Introduction Since the beginning of coronary bypass surgery at the Good Samaritan Hospital and Medical Center, Portland, Oregon, in 1968, patient data have been recorded in order to manage the care of the patients. This management will be measured by outcome analysis of the 30 day operative mortality and complications arising from the surgery. Although of secondary concern, some complications directly affect the quality of life of patients, for example, renal shutdown may require dialysis treatment. From a clinical point of view, a patient with low risk of mortality or complications could be spared the discomfort and expense of unnecessary treatment in a coronary care unit; based on a prognostic assessment such patients could be placed into an intermediate care unit or a general ward, or be discharged early from the hospital. However, for these concerns to be realized, it is necessary to establish some risk stratification algorithms. Two approaches have been devised for surgery patients depending on the strategy used for derivation. In one strategy, a panel of experts identifies and assigns weights to clinical variables believed to be associated with outcomes of interest. The second strategy uses statistical modeling to relate empirical data for many patient variables to outcomes of interest. In this study, the odds ratio gives a simple summary for the binary risk factors; the logistic 1 CHAPTER 1. INTRODUCTION^ 2 regression analysis and the tree-based methods are used to set up the stratification algorithms While doing the analysis, answers to the following questions are sought: 1). What risk groups are most important in predicting the outcomes? 2). How does each stratification algorithm perform in predicting? The clinical background of the coronary bypass procedure and the data description are presented in Chapter 2 while the initial data analysis is performed in Chapter 3. In Chapter 4 and Chapter 5 the statistical methodologies and results of analyses are described. Conclusions and suggestions are given in Chapter 6. Chapter 2 Coronary Bypass Surgery and Data Source 2.1 Coronary Bypass Surgery Coronary heart disease is the comprehensive term which includes all of the clinical manifestations that result from atherosclerotic narrowing or occlusion of the arteries which supply the heart muscle. In several countries of the industrialized world, it is the major cause of death in both men and women. Despite accumulating knowledge about the epidemiology and pathology of disease of the heart and coronary arteries, there is not, as yet, a way of intervening to definitely arrest the natural progression of atherosclerosis in order to prevent or cure this condition. Compared with medical treatment, the surgical technique is seemingly more radical for coronary artery disease. In 1963, the first coronary artery bypass surgery was performed in the United Kingdom. Nowadays, it is the most common form of elective surgery. The surgical technique of coronary grafting involves opening the chest wall and temporarily stopping the heart while circulation is maintained with a heart-lung machine. A vein is removed to 3 CHAPTER 2. CORONARY BYPASS SURGERY AND DATA SOURCE ^ 4 be used as the graft material. Each obstructed section of artery is then bypassed by attaching one end of vein to the aorta carrying blood for the heart, and the other end to the artery beyond the stenosis or occlusion. The heart is restarted, the chest wall closed, and the operation completed. Coronary artery surgery has been described as relieving or very much reducing angina in over 90% of patients. Bad results are operative death and complications arising from the surgery although the mortality rate is believed to be decreasing with increasing surgical experience and skill. 2.2 Source of Data MCR (Merged, Multi-Center, Multi-Specialty Clinical Registries) is an international database system developed by Health Data Research Institute (formerly Dendrite Systems, Inc.) in which information of patients who had heart related surgery were recorded. This database system is used by several hospitals and the contributors (patients, sometime the doctors) are encouraged to enter information. In doing that, MCR uses a long systematic set of questions to elicit information both prior to operation and after operation (see Appendix). The data set analyzed here consists of 18171 patients from the MCR who had coronary bypass surgery between 1968 to 1991. The pre-operation information include date of operation, patient's age, gender, prior myocardial infarction, existence or non-existence of other diseases, body surface area, etc. The post-operation information include patient's status during or after the bypass surgery; for example, complications, such as renal or neurological problems, and survival status to 30 days following surgery. The variables of primary interest in our analysis are those outcome variables indicating the patient's complications and survival status after the surgery. All variables studied and their abbreviations are listed in Table 1 to Table 3. CHAPTER 2. CORONARY BYPASS SURGERY AND DATA SOURCE ^ 5 Table 1: Code Sheet for the MCR Data Codes/Values^Abbreviation Variable^ Name^ Years 1 Age AGE 2 Sex 0=Male 1=Female SEX 3 Prior Myocardial Infarction 0=No 1=Yes PMI (Variables 4-23 pertain to other diseases) 4 Obesity 0=No 1=Yes OBE 5 Chronic Obstructive Pulmonary Disease 0=No 1=Yes COP 6 Diabetes 0=No 1=Yes DIA 7 Cholesterol Level>206 0=No 1=Yes C112 8 Cholesterol Level>300 0=No 1=Yes CH3 9 Renal Disease 0=No 1=Yes REN 10 Hypertension 0=No 1=Yes HTN 11 Alcohol Abuse 0=No 1=Yes ETO 12 Drug Abuse 0=No 1=Yes DRU 13 Marfan's Syndrome (a skeletal abnormality) 0=No 1=Yes MAR 14 HIV+ 0=No 1=Yes HIV 15 AIDS 0=No 1=Yes AID 16 Cancer 0=No 1=Yes CA 17 Anemia 0=No 1=Yes ANE 18 Liver Disease 0=No 1=Yes LIV 19 Central Nervous System Disease 0=No 1=Yes CNS 20 Prior Cerebrovascular Accident 0=No 1=Yes PCA CHAPTER 2. CORONARY BYPASS SURGERY AND DATA SOURCE^6 Table 2: Code Sheet for the MCR Data (continued) Variable^Name^Codes/Values^Abbreviation 21 Rheumatic Heart Disease 0=No 1=Yes RHE 22 Pulmonary Hypertension 0=No 1=Yes PUL 23 Chronic Dialysis 0=No 1=Yes CHR (Variables 24-29 pertain to type of prior cardiac surgery) 24 Other Surgery 0=No 1=Yes 0TH 25 No Surgery 0=No 1=Yes NON 26 Coronary Bypass Graft 0=No 1=Yes CAB 27 Valve Replacement 0=No 1=Yes VAL 28 Congenital 0=No 1=Yes CON 29 Pacemaker 0=No 1=Yes PAC 30 Left Ventricular Dysfunction 0=Normal 1= 40-49% 2= 30-39% 3= 20-29% 4= < 20% LVD 31 Prior Operation Status 1 =Elective 2= Urgent 3= Emergency 4= Desperate POS 32 Body Surface Area Square meters BSA CHAPTER 2. CORONARY BYPASS SURGERY AND DATA SOURCE ^ 7 Table 3: Code Sheet for the MCR Data (continued) Variable^Name^Codes/Values^Abbreviation (Variables 33-47 pertain to the complications after surgery) 33 Reoperation for Bleeding 0=No 1=Yes REP 34 Renal Shutdown (Mild) 0=No 1=Yes REM 35 Renal Shutdown (Severe) 0=No 1=Yes RES 36 Wound (Severe) 0=No 1=Yes WOU 37 Neurological (Mild) 0=No 1=Yes NEM 38 Neurological (Severe) 0=No 1=Yes NES 39 Pulmonary (Mild) 0=No 1=Yes PUM 40 Pulmonary (Severe) 0=No 1=Yes PUS 41 Myocardial Infarction 0=No 1=Yes MI 42 Low Output (Mild) 0 =No 1 =Yes LOM 43 Low Output (Severe) 0=No 1=Yes LOS 44 Clotting 0=No 1=Yes CLO 45 Sepsis 0=No 1=Yes SEP 46 Gastrointestinal Bleeding 0=No 1=Yes GIB 47 Diffuse Intravascular Coagulation 0=No 1=Yes DIC 48 Discharge/30 Day Status 1=Live 2= Died in OR 3= Died in Hosp/30D 4= Reop 5= Died Late Cardiac 6= Unrelated Death 9= Lost to Follow-up STA Chapter 3 Initial Data Analysis 3.1 Summary Statistics for the Data Since the data are collected from several populations, missing values on several variables are inevitable. Of 18,171 MCR patients, 12,000 had complete data for the 32 variables selected. The missing data mostly occur in two variables prior myocardial infarction, 30% and left ventricular dysfunction, 29.3%. These are the only two which measure the previous damage of heart, so they should not be excluded. 20 kind of diseases/conditions are suspected to be related with the success of the surgery. But seven of them can be removed because of their lower incidences: drug abuse, 0.3% (70 patients); Marfan's syndrome, 0.1% (20 patients); positive test for AIDS and AIDS, 0.0% (0 patients); anemia, 0.0% (18 patients); pulmonary hypertension, 0.3% (59 patients); chronic dialysis, 0.0% (11 patients). The remaining 13 diseases are: OBE (obesity: 1.5x expected body weight), COP (patient with distinct limitations revealed at time of study or on treatment—bronchodilators, etc), DIA (diabetes: patient on oral medicine or insulin), CH2 (patient with cholesterol blood levels between 200-299), CH3 (patient with cholesterol blood level above 300), REN (renal failure: patients not on dialysis 8 CHAPTER 3. INITIAL DATA ANALYSIS^ 9 with creatinines above 2.5), HTN (hypertension), ETO (patients who have undergone treatment for alcohol abuse or come in intoxicated), CA (history of malignant disease - cured or not), LIV (history of hepatitis, cholangitis, but not gall bladder disease), CNS (history of brain abscess, encephalitis, or clinical dementia), PCA (history of stroke with or without residual), RHE (history of rheumatic heart disease). Prior cardiac operation makes the surgery more difficult technically. Six categories are distinguished: 0TH (other cardiac surgery), NON ( none surgery), CAB (coronary bypass surgery), VAL (valve operation), CON (congenital surgery) and PAC (pacemaker operation). The incidences of last two kinds of operation are lower, 0.2% and 0.6% respectively, so that these are removed as risk factors. AGE and SEX are two important variables. BSA (body surface area) is calculated from height and weight using a NOMOGRAM formula and can be an important factor. Although the prior operation status is important, from a decision point of view, we do not include it this time. Post operation variables (outcomes) include 11 complications and the 30 day status. Among these 11 complications, we only consider those which are clinically related with the 30 day status. Hence we remove the following complications: REP (reoperation for bleeding, suspected tamponade), WOU (wound: dehiscence or infection), CLO (clotting: prolonged bleeding problems, low platelets), SEP (septicemia, pneumonia, wound infection, etc), GIB (gastrointestinal bleeding, perforated ulcer, cholecystitis) and DIC (diffuse intravascular coagulation). We study the remaining five complications: REMS (renal shutdown), NEMS (peripheral nerve, central nervous system defect), PUMS (pneumothorax, prolonged respiratory support), MI (intra- or post-operation myocardial infarction by EKG or enzymes) and LOMS (low output syndrome). These response variables were obtained by combining their two levels (mild and severe) into one so that the resulting variables are binary. The response variable for the 30 day operative mortality was obtained by combining the case of CHAPTER 3. INITIAL DATA ANALYSIS ^ 10 original statuses 2, 3 and 5 into "1". In Table 4 to Table 6, summary statistics are given as well as some special features such as missing value, etc. 3.2 Odds Ratio Analysis A natural way to represent the association of a binary risk factor and a binary outcome is the 2x2 contingency table, as follows: 2x2 Contingency Table outcomel outcome2 Sample Size with risk factor^a^b^ni without risk factor^c^dn2 We suppose that such a table has been generated by drawing two independent binomial samples of sizes ni and n2, with probabilities for outcomel being pi and p2 respectively. For example, in our study, an outcome variable is the status after operation, the samples correspond to the patients who have the presence or absence of a risk factor. The odds ratio in such a table is defined as = (1 — P2) P2(1 — The odds ratio (as well as its logarithm) is widely used as a measure of association in 2x2 contingency tables due to its simple interpretability. For example, if outcome is the presence or absence of lung cancer and the populations are smokers and non-smokers, then III = 2 indicates that the odds of lung cancer among smokers is twice that among non-smokers in the study population. It has also been pointed out that the odds ratio forms a useful approximation to the relative risk in retrospective studies [Rothman, 1986]. The coefficients estimated in a logistic regression can also be interpreted as log-odds ratios (logarithm of odds ratio). CHAPTER 3. INITIAL DATA ANALYSIS ^ Table 4: Tabular summary for categorical variables Variable V2 Heading SEX V3 PMI V4 OBE V5 COP V6 DIA V7 CH2 V8 CH3 V9 REN V10 HTN V11 ETO V12 DRU V13 MAR V14 HIV V15 AID V16 CA V17 ANE Code 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 Count 13084 5086 7451 5284 16648 1523 15977 2194 15276 2895 16008 2163 17373 798 170561 1115 11175 6996 17454 717 18101 70 18151 20 18171 0 18171 0 17860 311 18153 18 Frequency^Remark 1 patient with no sex recorded 72.0% 28.0% 5436 missing data (30.0%) 41.0% coded -1 29.0% 91.7% 8.3% 88.0% 12.0% 84.1% 15.9% 88.1% 11.9% 95.7% 4.3% 93.9% 6.1% 61.5% 38.5% 96.1% 3.9% ignored in future analysis 99.7% 0.3% ignored in future analysis 99.9% 0.1% ignored in future analysis 100.0% 0.0% ignored in future analysis 100.0% 0.0% 98.3% 1.7% ignored in future analysis 100.0% 0.0% 11' CHAPTER 3. INITIAL DATA ANALYSIS ^ Table 5: Tabular summary for categorical variables (continued) Variable V18 Heading LIV V19 CNS V20 PCA V21 RHE V22 PUL V23 CHR V24 0TH V25 NON V26 CAB V27 VAL V28 CON V29 PAC V30 LVD V31 POS V33 REP V34 REM Code 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 2 3 4 1 2 3 4 0 1 0 1 Count 17946 225 17427 744 17700 471 17545 626 18112 59 18160 11 17649 522 15418 2753 16525 1646 17733 438 18140 31 18055 116 8972 1854 1524 306 195 14137 1935 1501 165 17471 700 17623 548 Frequency^Remark 98.8% 1.2% 96.0% 4.0% 97.5% 2.5% 96.6% 3.4% ignored in future analysis 99.7% 0.3% ignored in future analysis 100.0% 0.0% 97.2% 2.8% ignored in future analysis 84.8% 15.2% 91.0% 9.0% 97.6% 2.4% ignored in future analysis 99.8% 0.2% ignored in future analysis 99.4% 0.6% 49.3% 5320 (29.3%) missing data coded -1 10.2% 8.3% 1.7% 1.0% 433 (2.4%) missing data 77.8% ignored in future analysis 10.6% 8.3% 0.9% 96.2% 3.8% combined with RES in future analysis 97.0% to form a new variable REMS 3.0% 12 CHAPTER 3. INITIAL DATA ANALYSIS ^ Table 6: Tabular summary for categorical variables(continued) Variable V35 Heading RES V36 WOU V37 NEM V38 NES V39 PUM V40 PUS V41 MI V42 LOM V43 LOS V44 CLO V45 SEP V46 GIB V47 DIC V48 STA Code 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 2 3 4 5 6 9 Count 17881 290 17964 207 16803 1368 17768 403 10173 7998 15643 2528 17194 977 17389 782 17321 850 17803 368 17904 267 17415 756 18117 54 10 15513 519 234 465 41 316 10 Remark Frequency 98.4% 1.6% 98.9% 1.1% combined with NES in future analysis 92.5% to form a new variable NEMS 7.5% 97.8% 2.2% combined with PUS in future analysis 56.0% to form a new variable PUMS 44.0% 86.1% 13.9% 94.7% 5.3% combined with LOS in future analysis 95.7% to form a new variable LOMS 4.3% 95.4% 4.6% ignored in future analysis 98.0% 2.0% ignored in future analysis 98.6% 1.4% ignored in future analysis 95.8% 4.2% ignored in future analysis 99.7% 0.3% 612 (3.4%) missing data 0.0% only the cases 1, 2, 3 and 5 85.4% are considered. That is: 2.9% 1.3% 15513 (95.1%) alive, coded 0 794 (4.9%) died, coded 1 2.6% 0.2% 1.7% 0.0% 13 CHAPTER 3. INITIAL DATA ANALYSIS^ 14 There are several estimates of the odds ratio [Walter, 1987], but the most common one is the maximum likelihood estimate (MLE) 111MLE = ad — bc The derivation of this is simple: for a binomial distributed random variable with parameter p and n, the MLE of p is a/n where n is the sample size and a is the number of "successes". By the invariance principle, the MLE of the odds pi/(1-pi) is a/b. Similarly the MLE of p2/(1-p2) is c/d. Hence the above estimate obtains. The estimate of odds ratio is more useful as an interval estimate or confidence interval (CI). A brief review of various methods for CI construction is given by Fleiss (1979). We use the result derived by Bishop et al. (1975) in which it is proved log if is asymptotically normal with mean log IF and variance (nipi (1 — pi))" + (n2p2(1 — p2))1. This result follows from an application of the delta method. An estimates variance of log is 1/a + 1/b + 1/c + 1/d = SE(log So, a 100(1-cr)% CI of is exp{log^Zi_a12SE(1og 40} where 43 is the upper # quantile of the standard normal distribution. In Table 7 to Table 12, the odds ratios for each outcome variable are given. Statistically, only those 95% CI not containing 1 are more strongly related with the outcome. In Table 13, the risk factors identified by odds ratio are listed. If the estimated odds ratio is larger than 1, we say the variable is positively related with the outcome; otherwise, we say it negatively related. Nearly all binary explanatory variables ( including the groups of other diseases, prior cardiac surgeries) are positively related with the 30 day mortality and complications except CH2 and CH3. Variable CH2 and CH3 measure high cholesterol blood levels. This may lead to increased risk of getting vascular disease in many organs. MI and LOMS are vascular related complications. Unfortunately, CH2 and CH3 have negative association with them so we have some doubt whether the measurement of cholesterol blood level is correct and CHAPTER 3. INITIAL DATA ANALYSIS^ Table 7: Odds ratio for STA Variable V2 V4 V5 V6 V7 V8 V9 V10 V11 V16 V19 V20 V21 V24 V26 V27 Heading SEX OBE COP DIA CH2 CH3 REN HTN ETO CA CNS PCA RHE 0TH CAB VAL 41 1.17 0.38 1.00 0.31 0.40 1.25 1.95 1.32 1.18 1.61 3.05 1.98 1.26 2.35 3.53 4.23 SE(log if) 0.24 0.72 0.38 0.72 0.42 0.41 0.38 0.23 0.60 0.74 0.45 0.53 0.61 0.39 0.27 0.56 95% CI of AF^Remark (0.73, 1.90) (0.09, 1.58) (0.47, 2.12) (0.63, 2.16) (0.17, 0.94) negatively related (0.56, 2.80) (0.91, 4.19) (0.84, 2.08) (0.35, 3.86) (0.37, 6.95) (1.24, 7.44) positively related (0.68, 5.69) (0.38, 4.14) (1.09, 5.10) positively related (2.07, 6.05) positively related (1.40, 12.8) positively related Table 8: Odds ratio for REMS Variable V2 V4 V5 V6 V7 V8 V9 V10 V11 V18 V19 V20 V21 V24 V26 V27 Heading SEX OBE COP DIA CH2 CH3 REN HTN ETO LIV CNS PCA RHE 0TH CAB VAL if 1.48 1.13 1.86 1.80 0.67 0.88 9.78 1.75 1.34 3.58 3.40 1.33 0.88 0.90 0.98 0.64 SE(log ii') 0.21 0.37 0.26 0.24 0.37 0.46 0.22 0.20 0.47 0.62 0.33 0.60 0.59 0.52 0.35 1.01 95% CI of T.^Remark (0.97, 2.26) (0.54, 2.37) (1.11, 3.11) positively related (1.11, 2.91) positively related (0.32, 1.40) (0.35, 2.20) positively related (6.30, 15.2) (1.18, 2.62) positively related (0.53, 3.95) (1.04, 12.3) positively related (1.75, 6.61) positively related (0.40, 4.34) (0.27, 2.85) (0.32, 2.50) (0.49, 1.98) (0.08, 4.76) 15 CHAPTER 3. INITIAL DATA ANALYSIS^ Table 9: Odds ratio for NEMS Variable V2 V4 V5 V6 V7 V8 V9 V10 V11 V16 V18 V19 V20 V21 V24 V26 V27 Heading SEX OBE COP DIA CH2 CH3 REN HTN ETO CA LIV CNS PCA RHE 0TH CAB VAL 1.34 2.11 1.22 1.38 0.69 0.54 1.62 1.49 1.40 2.96 2.20 3.92 2.02 0.81 0.94 1.09 0.58 (log if) 0.15 0.21 0.21 0.18 0.25 0.39 0.23 0.14 0.32 0.43 0.55 0.24 0.37 0.43 0.35 0.24 0.73 95% CI of III^Remark (0.99, 1.81) (1.37, 3.25) positively related (0.80, 1.84) (0.96, 1.98) (0.42, 1.14) (0.25, 1.17) (1.02, 2.57) positively related (1.12, 1.97) positively related (0.73, 2.68) (1.26, 6.93) positively related (0.74, 6.53) (2.41, 6.39) positively related (0.97, 4.19) (0.34, 1.88) (0.47, 1.90) (0.67, 1.75) (0.14, 2.46) Table 10: Odds ratio for PUMS Variable V2 V4 V5 V6 V7 V8 V9 V10 V11 V16 V19 V20 V21 V24 V26 V27 Heading SEX OBE COP DIA CH2 CH3 REN HTN ETO CA CNS PCA RHE 0TH CAB VAL 41 1.08 3.38 3.45 1.91 0.36 0.49 3.32 1.86 2.21 0.56 2.12 0.67 0.97 0.58 1.22 1.03 SE(log 41) 0.09 0.16 0.13 0.11 0.15 0.20 0.16 0.08 0.21 0.41 0.21 0.29 0.23 0.22 0.14 0.34 95% CI of tIf^Remark (0.90, 1.30) (2.45, 4.66) positively related (2.66, 4.48) positively related positively related (1.52, 2.40) negatively related (0.26, 0.49) (0.33, 0.74) negatively related (2.41, 4.58) positively related (1.57, 2.20) positively related (1.46, 3.36) positively related (0.25, 1.25) positively related (1.39, 3.21) (0.37, 0.91) (0.61, 1.54) (0.37, 0.91) (0.92, 1.61) (0.52, 2.01) 16 CHAPTER 3. INITIAL DATA ANALYSIS ^ Table 11: Odds ratio for MI Variable V2 V4 V5 V6 V7 V8 V9 V10 V11 V18 V19 V20 V21 V24 V26 V27 Heading SEX OBE COP DIA CH2 CH3 REN HTN ETO LIV CNS PCA RHE 0TH CAB VAL if 0.76 1.70 2.11 1.13 0.39 0.75 1.80 1.58 1.75 6.43 1.35 2.06 1.12 0.99 1.28 1.73 SE(log tif) 0.19 0.26 0.20 0.22 0.36 0.39 0.25 0.16 0.34 0.46 0.37 0.41 0.43 0.39 0.26 0.53 95% CI of ‘If^Remark (0.52, 1.11) (1.01, 2.85) positively related (1.40, 3.16) positively related (0.73, 1.75) (0.19, 0.80) negatively related (0.34, 1.63) (1.09, 2.99) positively related (1.14, 2.18) positively related (0.89, 3.44) (2.60, 15.86) positively related (0.64, 2.83) (0.92, 4.64) (0.48, 2.61) (0.45, 2.16) (0.76, 2.13) (0.60, 4.95) Table 12: Odds ratio for LOMS Variable V2 V4 V5 V6 V7 V8 V9 V10 V11 V16 V18 V19 V20 V21 V24 V26 V27 Heading SEX OBE COP DIA CH2 CH3 REN HTN ETO CA LIV CNS PCA RHE 0TH CAB VAL SE(log if) 1.69 1.08 1.06 1.27 1.20 0.86 1.99 1.31 1.01 2.33 1.18 1.17 1.22 2.20 1.60 1.84 1.09 0.15 0.28 0.23 0.20 0.22 1.00 0.23 0.15 0.40 0.49 0.74 0.37 0.47 0.32 0.31 0.22 0.60 95% CI of ‘If^Remark (1.24, 2.32) positively related (0.61, 1.91) (0.66, 1.68) (0.85, 1.89) (0.77, 1.89) (0.01, 0.62) negatively related (1.24, 3.17) positively related (0.96, 1.78) (0.46, 2.21) (0.88, 6.13) (0.27, 5.09) (0.56, 2.46) (0.48, 3.10) (1.17, 4.15) positively related (0.86, 2.99) (1.19, 2.84) positively related (0.33, 3.60) 17 "CHAPTER 3. INITIAL DATA ANALYSIS ^ 18 Table 13: Binary risk factors identified from odds ratio Outcome STA REMS NEMS PUMS MI LOMS VAL (4.23) REN (9.78) CNS (3.93) COP (3.45) LIV (6.43) RHE (2.20) CAB (3.53) LIV (3.58) CA (2.96) OBE (3.38) COP (2.11) REN (1.99) Binary Risk Factors CNS (3.05) 0TH (2.35) CNS (3.40) COP (1.86) OBE (2.11) REN (1.62) ETO (2.23) CNS (2.12) REN (1.80) OBE (1.70) CAB (1.84) SEX (1.69) DIA (1.80) HTN (1.49) DIA (1.91) HTN (1.58) HTN (1.75) HTN (1.36) Table 14: Two-way table for PMI and LVD LVD missing normal 40-49% 30-39% 20-29% <20% total prior MI missing^no 239 70 59 413 61 3 40 24 0 5 7 1 384 574 yes 65 183 78 40 14 5 385 total^percent 374 655 30% 142 56% 104 63% 19 74% 13 83% 1307 40% consequently, we did not include C112 and CH3 in our model building procedure. From a medical point of view, cholesterol blood level maybe a surrogate for nutritional level. PMI indicates the health of the heart muscle while LVD measure the left ventricular function. So they essentially measure the same phenomenon although PMI is much less specific. In Table 14, their associations can be seen by a two-way table. The last column is the percent of YES among non-missing cases. As we see, compared with the percentage in the normal category, as the left ventricular function is getting worse, the percentage of patients who had prior myocardial infarction is increasing. This confirms our knowledge about these two variables. AGE and BSA are the only two continuous variables in the data set. Biologically, they maybe related with many other variables, but the most interesting ones ( based on previous studies ) are the following: the association of diabetes with AGE and BSA; the relationship between BSA and CHAPTER 3. INITIAL DATA ANALYSIS 19 as as8 CV q CV I 8 0 •■11OC ^ Diabetes 1 0 ^ ^ 1 Diabetes as CV co 0 ^^ 1 Sex ^ 0 Hypertension Figure 1: Boxplots for continuous variables AGE and BSA gender as well as hypertension. In Figure 1, these relations are displayed by boxplots. Boxplots have proven to be quite a good exploratory tool, especially when several boxplots are placed side by side for comparison as in the current cases. The most striking visual feature is the box which shows the limits of the middle half of the data (the white line inside the box represents the median and the ends of the box represent the lower and upper quartiles). The first horizontal lines beyond the box (which are called the whiskers) are drawn to the nearest value not beyond a standard span from the quartiles. Points beyond, which may be outliers, are drawn individually. The standard span is 1.5 times the difference of the upper and lower quartiles. [Hoaglin et al., 1983] There is little difference in the distribution of age between the populations of patients with or without diabetes. Similarly, this holds for the distribution of body surface area with respect to CHAPTER 3. INITIAL DATA ANALYSIS^ 20 diabetes and hypertension. Only the relation between gender and body surface area appears to be significant. Female patients tend to have small body surface area. Although this is a general truth, notice that by odds ratio analysis that female patients have higher risk of operative mortality. We need to investigate this relation further as some cardiologists believe that gender is a poor surrogate for body surface area. Chapter 4 Logistic Regression Analysis The methodology of logistic regression analysis has become extremely popular among biostatisticians in recent years, see for example Lemeshow et al. (1988). Let yi, i = 1, . , n, be independent binary random variables. The logistic regression is a method for assessing the dependence of pi = Pr(yi = 1) on explanatory variables xi. The dependence is postulated as e'. Pi = ^ eTo 1+ 1 1 — pi = ^ 1 + erT for i = 1, . , n, where xiT=(x„,^, xi,) is a row of known constants and #=(#1, ^13p)T is a column of unknown parameters. The equations above are equivalent to g(pi) = Then, g(p) = log(p/(1 — p)) is called the logistic transformation of the probability p = (pa, . . , and above equation is called a linear logistic model. There are several ways to estimate the logistic parameters # [Flosmer and Lemeshow, 1989]. The 21 CHAPTER 4. LOGISTIC REGRESSION ANALYSIS ^ 22 maximum likelihood procedure is based on the conditional likelihood L(p; Y) = H f(Yi 1=1 where f(y; plx) = pY (1 — p)l—Y. and p = (pr,...,p.), 100 Y = (Yi • • • ,y,„) It is convenient to deal with the log-likelihood function. In our case, it is 1(g; y) = EfyirT0 — log[l exp(xT0)]} i.1 To compute the maximum likelihood estimates, it is necessary to solve the score equations OM; y)/80=0. Commonly, the Newton-Raphson method or the iteratively reweighted least squares method is used to calculate 0, the estimate of 3 [Rao, 1973]. Our goal is to use logistic regression to develop an objective model for prediction of 30 day operative mortality and complications among patients. Typically, the first step in this modeling process is data reduction; from all available predictor variables, only those most associated with outcome are selected for inclusion in the final model. If, after this step, there are still a large set of characteristics, a stepwise logistic regression analysis can be applied to reduce the number of predictor variables. An alternative method is the best subsets selection procedure which provides several candidate models. 4.1 Univariate Analysis and Comparison of Models For continuous variables, the test of association of the outcome and the independent variable can be carried out using Student-t test analogous in linear regression [Weisberg, 1980]. For categorical variables, we use the likelihood ratio test which is defined as follows. The deviance function is defined as D(p; y) = 2 log gy; y) — 2 log L(A; y) The difference in deviance between two models measures the contribution of the parameters by which they differ. The distribution theory is asymptotic [McCullagh and Nelder, 1989]; for comparing 2 CHAPTER 4. LOGISTIC REGRESSION ANALYSIS^ 23 nested models with estimated mean pi and p2, the difference in deviance D(/1i,i12)= D(/11;y)- D (f1 2;Y) has an asymptotic x2 distribution (under the null hyperthesis that the smaller model is correct) with degrees of freedom v = — v2 equal to the difference in the dimensions of the parameter spaces implicit in the models with mean pi and p2• Therefore, to test the association of a single variable x to the outcome, we only need to compare the model y ( iii i )=00 with the model Y (p2i) =130 + filxi and find out how much the variable x improves the predictive value of the model. 4.2 Stepwise Logistic Regression In stepwise logistic regression, models are built by adding in new variables and seeing how much they improve the fit, and by dropping variables that do not improve the fit by a "significant" amount. Usually the procedure starts with an arbitrary model and stops when no step will decrease the value of a selection criterion. The selection criterion used here is AIC (Akaike's Information Criterion) [Akaike, 1973] AIC = D + 2p where D is the deviance of the current model, p the dimension (number of variables) in the model. The changes in AIC due to augmenting or reducing a model by a given variable reflect both the change in deviance caused by the step, as well as the change in the dimension of the model. The rationale of AIC is that the more parameters a model contains, the less accurately they can be estimated and the predictive value of the model may get worse. AIC adjusts the deviance for the CHAPTER 4. LOGISTIC REGRESSION ANALYSIS ^ 24 Table 15: Stepwise regression procedure: a demonstration AGE, AGE, AGE, AGE, AGE, AGE, Variables involved in the current model^ operation^AIC SEX, PMI, OBE, COP, ETO, CA, CNS, PCA, 0TH, CAB, LVD, BSA 479.8 PMI, OBE, COP, ETO, CA, CNS, PCA, 0TH, CAB, LVD, BSA -SEX 477.9 PMI, COP, ETO, CA, CNS, PCA, 0TH, CAB, LVD, BSA -OBE 475.9 PMI, COP, CA, CNS, PCA, 0TH, CAB, LVD, BSA -ETO 473.5 PMI, CA, CNS, PCA, 0TH, CAB, LVD, BSA -COP 472.2 PMI, CNS, PCA, 0TH, CAB, LVD, BSA - CA 471.5 Table 16: Results of stepwise regression procedure for each outcome Outcome^variable involved^AIC STA AGE, PMI, CNS, PCA, 0TH, CAB, IND, BSA 471.56 REMS AGE, SEX, PMI, COP, DIA, REN, CNS, LVD 437.92 NEMS AGE, PMI, OBE, ETO, CA, CNS 744.55 PUMS PMI, DIA, HTN, CNS, BSA 1105.37 MI PMI, LIV, PCA 684.66 LOMS AGE, PMI, DIA, REN, RHE, CAB, LVD, BSA 784.16 number of parameters estimated. Thus, the model with the minimum AIC gives the best fit to the data according to the AIC criterion. Therefore, we think of AIC as a useful tool for the quick comparison of parametric models although it does not indicate that the better of two models is "significantly better". Take STA as example. The initial model contains the 14 variables obtained from univariate screening. The first variable deleted was SEX leading to AIC=477.90; the second one deleted was OBE leading to AIC=475.97; ...; the last one deleted was CA leading to AIC=471.56. The procedure is summarized in Table 15. In Table 16, the results of stepwise logistic regression for various outcome variables are given with the corresponding best AIC values. CHAPTER 4. LOGISTIC REGRESSION ANALYSIS^ 25 4.3 Best Subsets Selection The best subsets selection is an alternative to the stepwise procedure for model building. This approach has been available for linear regression for years and makes use of the branch and bound algorithm of Furnival and Wilson (1974). Typical software implementing this method will identify a specified number of "best" models containing one, two, three variables, and so on, up to the single model containing all p variables. For the case of logistic regression, Hosmer and Lemeshow (1989) proposed a method which can be performed in a striaightforward manner using any program for the best subsets linear regression. The best subsets selection procedure is regarded as a more reliable and informative method. This is because the the stepwise procedure lead to a single subset of variables and does not suggest alternative good subsets. In this procedure, Cp statistics are used for selecting the best subsets [Draper and Smith, 1981]; a model with a Cp value close to the number of predictors is better. In the logistic model, let ,3 be the maximum likelihood estimate and fri be the estimated logistic probability computed using /3 and the data for the ith case, x2. We define two matrix X and V 1^xi].^. .^xi,, x= ^1 1^x21^• • xi . X2p ^. . Xnp and 71(1 —)^0^0 v= 0^7r2(1 —7r2) ^ 0 0^0^. . . *n (1 — *„) It may be shown [Pregibon, 1981] that /3 = (X'VX)-1X'Vz, where the vector z contains pseudovalues, z = Xj3 V-1r, and r is the vector of residuals, r = (y — ir). A computation for the best subsets logistic regression model can be performed using a best subsets linear regression program the CHAPTER 4. LOGISTIC REGRESSION ANALYSIS ^ 26 Table 17: Models obtained from best subsets procedure for each outcome Outcome^Model Code^ Variable included^ AGE, PMI, CNS, PCA, 0TH, CAB, LVD, BSA Si STA AGE, CA, CNS, PCA, 0TH, CAB, LVD, BSA Si AGE, COP, CNS, PCA, 0TH, CAB, LVD, BSA S-3 AGE, SEX, PMI,COP, DIA, REN, LIV, CNS R_1 REMS AGE, SEX, PMI, COP, DIA, REN, CNS, LVD R_2 AGE, SEX, PMI, COP, DIA, REN, HTN, CNS R_3 NJ. AGE, PMI, OBE, ETO, CA, CNS NEMS Ni AGE, PMI, ETO, CA, CNS, CAB AGE, PMI, ETO, CA, CNS, PCA N_3 AGE, PMI, REN, HTN, CNS, BSA PUMS P.J. P_2 PMI, OBE, REN, HTN, CNS, BSA P_3 AGE, PMI,OBE, HTN, CNS, BSA M_1 PMI ,COP, LIV, PCA MI M_2 PMI, COP, CNS, PCA M..3 PMI, LIV, CNS, PCA L_1 AGE, PMI, DIA, REN, RHE, 0TH, CAB, LVD, BSA LOMS L_2 AGE, PMI, DIA, HTN, RHE, 0TH, CAB, LVD, BSA AGE, SEX, PMI, DIA, RHE, 0TH, CAB, LVD, BSA L..3 C, 7.02 6.4 8.0 7.1 8.4 8.6 4.4 4.6 5.8 6.7 7.5 7.9 3.9 5.1 5.3 11.0 11.5 12.2 dependent variable z, case weights vi, equal to the diagonal elements of V, and original covariates x. In this study, for each outcome, we provide three candidate models produced by the best subsets selection procedure. One interesting finding is that the model obtained by stepwise procedure is among the three models. 4.4 Goodness-of-fit: Hosmer-Lemeshow Grouping Test After the above procedures, we would like to know how effective the models we have are in describing the outcome variables. This is referred to as its goodness-of-flu. One test was proposed by Hosmer and Lemeshow (1980). The Hosmer-Lemeshow grouping test creates groups based on the values of the estimated probabilities. Suppose we have n observations. With this method, use of g = 10 groups results in the first group containing the n1 = n/10 subjects CHAPTER 4. LOGISTIC REGRESSION ANALYSIS ^ 27 Table 18: Hosmer-Lemeshow grouping test for selected models Model S_2 R1 N_3 P.2 M_2 L..2 Obs Exp Ohs Exp Obs Exp Obs Exp Ohs Exp Obs Exp gl 0 0.7 1 0.5 1 1.2 9 9.9 1 3.1 8 2.9 g2 0 1.4 2 1.1 3 2.6 10 11.2 4 3.1 4 4.6 g3 2 2.0 1 1.6 3 4.1 11 12.1 2 3.1 4 5.8 g4 3 2.6 1 2.2 6 5.8 12 13.2 3 3.2 4 7.0 Decile of Risk g5 g6 4 4 3.3 4.3 3 3 2.8 3.6 10 9 9.8 7.7 18 19 14.5 16.8 13 9 7.5 9.1 11 7 8.4 10.1 g7 5 5.7 7 4.9 10 12.2 21 19.4 6 9.0 9 11.9 g8 12 7.7 6 6.9 15 15.9 18 23.4 18 16.3 14 14.4 g9 15 11.4 8 9.8 22 21.1 34 31.0 22 20.4 21 18.9 g10 20 25.8 29 27.3 42 40.7 44 45.2 24 27.2 35 33.0 Total^0^Prob 65 7.25 0.51 65 61 3.75 0.88 61 121 1.82 0.98 121 196 3.76 0.87 196 8.29 0.41 102 102 117 13.9 0.08 117 having the smallest estimated probabilities, and the last group containing the n10 = n/10 subjects having the largest estimated probabilities. The Hosmer-Lemeshow goodness-of-fit statistics C is obtained by calculating (ok - nor1)2 irk(1 - irk) k=1 where Ok = E yi .i€Ak and irk = E "tank. jEith With Ak consisting of subjects in the km group. It can be shown that C is asymptotically well approximated by the chi-square distribution with g - 2 degrees of freedom, x2(g - 2), if the model is correct. A small value of C indicates a good fit. From the prediction point of view, we used this statistic as the final criterion for model selection. That is, among the three candidate models obtained from stepwise logistic regression and the best subsets selection procedure, we chose the one with smallest value of O. In Table 18, the grouping tests for each selected model are shown. CHAPTER 4. LOGISTIC REGRESSION ANALYSIS^ 28 Judging from the p-value, all the selected models fit quite well except possibly for the one with LOMS as outcome. The final logistic regression models were given in Tables 19 to 24 together with the maximum estimated probability which was calculated by putting the higher value for all the risk factors in the model (for continuous variable, we use their mean values). This number indicates the range of probability that a model can predict. Since 0TH and CAB cannot be 1 at same time, when these two appear together in the model, we use the one with larger coefficient. All the results are obtained using version 3.1 of the statistical software S/Splus. When coding dummy variables, treatment contrast was used [Chamber and Hastie, 1990]. As mentioned in section 3.2 the estimated coefficients here can be interpreted as log-odds ratio. We simply calculate exp(fl) to give an odds ratio of each predictor with other factors held fixed. For example, in the STA model, variable 0TH has a coefficient 1.77 which gives exp(1.77)=5.87. This means that the patient who had an other cardiac operation are 5.87 times more likely to have a mortality than those who had not. Another example is that AGE leads to an odds ratio of exp(0.048)=1.049. This means an additional multiplicative risk of 1.049 for each increase in age of one year, all other variables held fixed. Since this number larger than 1, we consider age is a contributor to operative mortality; older patients tend to have higher risk. For the categorical variable, the odds ratios should be interpreted as a comparsion to the first category. In both of the categorical variables PMI and LVD, the first category happens to represent the missing value and such a comparison can provide some insights to the missing value category. CHAPTER 4. LOGISTIC REGRESSION ANALYSIS^ Table 19: Final logistic regression model for STA Heading SE(/) -2.811 1.753 0TH 1.770 0.444 CAB 1.327 0.344 BSA -2.061 0.654 AGE 0.048 0.015 LVDO -0.092 0.333 LVD1 -0.430 0.541 LVD2 0.973 0.420 LVD3 1.995 0.665 LVD4 -3.914 5.808 V19 CNS 1.160 0.496 PCA 1.227 0.608 V20 V16 CA 0.868 0.745 Maximum prediction probability: 0.97 Variable Constant V24 V26 V32 V1 V30 //SE(j)^exp() -1.603 5.871 3.981 3.770 3.856 0.127 -3.148 1.049 3.108 -0.277 0.911 -0.794 0.650 2.648 2.314 2.997 7.357 -0.673 0.019 2.337 3.190 2.017 3.411 1.164 2.382 Table 20: Final logistic regression model for REMS Variable Constant V9 V1 V5 V3 Heading SE(13) -9.004 1.186 REN 1.855 0.335 AGE 0.065 0.016 COP 1.032 0.345 PMI1 0.597 0.422 PMI2 1.146 0.407 V19 CNS 1.162 0.494 SEX V2 0.577 0.288 DIA V6 0.668 0.342 V18 LIV 0.836 0.935 Maximum prediction probability: 0.92 fi/SE(j)^exp(/) -7.586 5.523 6.391 4.020 1.067 2.988 2.806 1.415 1.817 2.813 3.146 2.350 3.196 2.000 1.780 1.951 1.950 0.893 2.307 29 CHAPTER 4. LOGISTIC REGRESSION ANALYSIS •^ Table 21: Final logistic regression model for NEMS SE(/) 73 -7.934 0.846 AGE 0.089 0.012 1.292 0.356 CNS PMI1 -0.811 0.241 PMI2 -0.861 0.261 CA 1.662 0.553 V16 ETO V11 0.591 0.415 PCA V20 0.606 0.522 Maximum prediction probability: 0.76 Variable Constant V1 V19 V3 Heading //SE(/)^exp(j) -9.377 7.362 1.093 3.625 3.640 0.444 -3.360 0.422 -3.301 3.002 5.270 1.425 1.807 1.161 1.833 Table 22: Final logistic regression model for PUMS SE(13) ij -0.720 0.690 PMI1 -0.442 0.097 PMI2 -0.097 0.060 HTN V10 0.451 0.162 V32 BSA -0.737 0.362 V19 CNS 0.554 0.338 REN V9 0.329 0.248 OBE V4 0.286 0.269 Maximum prediction probability: 0.40 Variable Constant V3 Heading //SE(/)^exp(/) -1.043 -4.530 0.642 -1.620 0.907 2.773 1.569 -2.037 0.478 1.740 1.635 1.326 1.389 1.063 1.331 Table 23: Final logistic regression model for MI fi SE0) -1.820 0.172 PMI1 -1.991 0.303 PMI2 -0.899 0.248 V20 PCA 1.469 0.524 V19 CNS 0.428 0.412 COP 0.285 0.276 V5 Maximum prediction probability: 0.59 Variable Constant V3 Heading //SE(/)^exp() -10.53 -6.567 0.136 -3.619 0.406 2.801 4.348 1.039 1.534 1.029 1.329 30 CHAPTER 4. LOGISTIC REGRESSION ANALYSIS^ Table 24: Final logistic regression model for LOMS SE(/) -1.916 1.265 BSA -1.603 0.485 PMI1 0.723 0.341 PMI2 1.015 0.332 CAB V26 0.838 0.0.293 LVDO -0.244 0.285 V30 LVD1 -0.204 0.373 LVD2 0.608 0.355 LVD3 1.214 0.553 LVD4 1.338 0.817 0TH V24 0.879 0.392 RHE V21 1.000 0.454 AGE V1 0.022 0.010 HTN V10 0.315 0.209 DIA 0.467 0.265 V6 Maximum prediction probability: 0.93 Variable Constant V32 V3 Heading le/SE(j)^exp(i3) -1.514 -3.300 2.121 3.049 2.855 -0.854 -0.548 1.711 2.193 1.638 2.243 2.200 2.129 1.503 1.759 0.201 2.062 2.760 2.312 0.783 0.814 1.838 3.367 3.813 2.410 2.720 1.022 1.370 1.596 31 Chapter 5 The Tree-based Model The tree-based model has gradually become a popular tool in clinical and epidemiological studies because of its clinical interpretability. The technique was introduced by Morgan et al. (1964), however, more ground-breaking ideas were introduced by Breiman et al. (1984) and the resulting computer program is named CART (Classification And Regression Tree). The tree-based model procedure used in version 3.1 of S/Splus departs slightly from CART in the recursive partitioning (RP) method proposed by Ciampi et al. (1987). Also, compared with CART, the procedure is far less automatic in tree building, as the unbounding of procedures for growing, displaying and challenging trees requires user initiation in all phases. 5.1 Recursive Partitioning: Growing a Classification Tree In general, the tree-based model is fitted by creating binary tree using a RP algorithm. The data have the form (y(i), x(1)), i = 1, . . , N, where y is a multinomial distributed variable with s categories and x is assumed to be vector of categorical variables x=(xi, sk) and for each j, xi has a finite number of categories 11, /m, . The categories of xi can be either ordered or unordered. In what follows, we refer to y as criterion variable and to the components of x as predictor 32 CHAPTER 5. THE TREE-BASED MODEL ^ 33 conditionl :non-terminal node :terminal node Figure 2: Binary tree: an example variables. Predictors contain background information used to define strata which are homogeneous according to a criterion variable; for each homogeneous stratum one can define a unique criterion quantity independent of the x variables given the stratum. In our study, the criterion quantity is the vector of probabilities of being assigned to each outcomes, i.e., p =(pi,^,p3) such that Our aim is to grow a binary tree with nodes representing subsets of observations. In particular the root of the tree represents the entire set of observations and the terminal nodes represent strata that are more homogeneous (see Figure 2). The tree is constructed based on a set of Split Defining Statements (SDS) such as xi fit.; , where CHAPTER 5. THE TREE-BASED MODEL^ 34 Ai is a subset of the mi categories of xi. For xi unordered, Ai can be any of the 2m,-1 nontrivial subsets of /1, ..., im.,; for xi ordered, Ai can be any of the m2-1 subsets of the form Ai = [11 ,11 , I - 12 , • • , 1 m • In an RP tree, each nonterminal node is split by a SDS into two nodes which represent subsets as dissimilar as possible from the point of view of the criterion quantity. Ciampi et al. (1987) applied the likelihood ratio statistic (LRS) as a natural measure of dissimilarity as follows. Let P1, P2 be disjoint sets and let P denote their union. We shall assume that the criterion quantity is represented by a parameter 0 which may take different values 01, 02 for P1 and P2 and that likelihood functions L1(01), L1(02) can be defined for P1, P2. We shall also assume that the likelihood function for P is of the form: I,(0, 02) = L1(01)1,2(02) , Consider now the hypothesis : 01 = 02 = and the alternative H : 01 0 02 Then the log LRS of H versus Ho is defined as: P(HI.H0) = 2 log [Li (8-1)L2(0-01/[Li AL2(61 where ell, 0-2 are the MLEs of 01 and 02 under H and -d is the NILE of 0 underHo. Clearly, the larger p(111110) is, the greater is the evidence in the data that P1 and P2 are heterogeneous with respect to the criterion quantity. It therefore provides a reasonable and general measure of dissimilarity: d(Pi P2) = P(Hl Ho) In our case, the criterion quantity is p = (pr,...,p,) denotes the probability that y falls into each of the possible categories. 35 CHAPTER 5. THE TREE-BASED MODEL^ We have for a given subset or node: L(i;y)= 11 P7' where ni is the number of individuals at the node fall into jth category and y = (yi,^yn). At a given node, let M be the observations in the node and nj be the number of gs falling into jth category, then the maximum likelihood estimate of p is =^111-). Af • • • M A description of the RP algorithm is: Denote by N = {N1,^, Nr} the current collection of nodes. (1) To initialize, set r=1 and let N1 represent the observations (2) For every N1 E N and every split P1,,P2, defined by an element of SDS, compute d(P1,,P22). (3) Among all nodes chose the node .A7 corresponding to the split 1=';',1* with largest dissimilarity and replace N,* by two nodes representing Pi"; and^Use the resulting collection of nodes as current and go to (2) where r has increased by 1. In the tree-based model in S/Splus, an intuitive way is used to implement above algorithm. A deviance of a node is defined D(11;2() = —21(1,1;Y) where l(p; y) = log L(p; y). It can be shown that the deviance is identically zero if all the y's are the same, and increases as the y's deviate from this case. The deviance DT(y) of a tree T is defined as the sum of deviance of all its terminal nodes, EiET D(/t;y), where fit is the vector of the observed proportions of the s categories for node t. Splitting proceeds by comparing the deviance of the tree T, with that of larger trees T' in which a terminal node of T has been split into two. The split that maximizes the change in deviance AD = DT (y) — DT' (Y) is the next split that is chosen. CHAPTER 5. THE TREE-BASED MODEL ^ 36 5.2 Getting the Right Size Tree: Pruning the Classification Tree The above discussion implies that nodes become more and more pure (homogeneous) as splitting progresses. In the limit, a tree can have as many terminal nodes as there are observations. In S/Splus, two thresholds are introduced to stop the splitting process; (a). the node deviance is less than some small fraction of the root node deviance (say 1%); and (b). the node is smaller than some absolute minimum size (say 10). This also introduces another problem: if the threshold is set too high, good splits may be lost. There are two ways out of this dilemma: one is to use new (independent) data to guide the selection of the right size tree, and the other is to reuse the existing data by the method of cross-validation. In this case, S/Splus provides a function called "prune". The idea of pruning is more easily described by tree terminology: Notation 1. A binary tree is denoted by T. A node t on the tree Definition 1: A branch Dt of I' with node t E T is denoted by t ET. T consists of the node t and all descendents of t in T. Definition 2: Pruning a branch f't from a tree The resulting tree is a subtree of -7" denoted by Definition 3: of T involves cutting off Tt just below the node I. T-tt. T' is a pruned subtree of 1' if I"' is obtained by successively pruning off the branches T. In S/Splus, the importance of a pruned subtree T' is captured by the cost-complexity measure D,r(D') = D(T')+ a * size(i'') where D(T") is the deviance of the subtree, size(P) is the number of terminal nodes of Tv and a is the cost-complexity parameter. For any specified a, cost-complexity pruning determines the subtree D' that minimizes Da(2'') over all subtrees of f'' . CHAPTER 5. THE TREE-BASED MODEL^ 37 Figure 3: Original tree for STA. As is known from the RP algorithm, the deviance of a tree i" is smaller than that of subtrees when a is set to zero. But when taking the size of tree into consideration, that is, a >0, pruning provides us an upward way to snip off the least important branches. In the extreme case, only the root node is left if a is set sufficiently large. A sequence of subtrees T = To >- T1 >- >- il=root with decreasing size can be obtained while setting an increasing number of values of a : ao = 0 -‹ al ak. 5.3 Applications and Results For each outcome, the described recursive partitioning procedure was performed on a sample data set. The original tree underwent a cross-validation testing on a new data set by the pruning algorithm and the right size of the trees was decided. CHAPTER 5. THE TREE-BASED MODEL ^ 4.100 0.710 0.510 0.30)0220 0.210 0.094 38 4.1.00 0.710 0.510 0.2)0.220 0.210 0.094 „,•••• — • — ••• 0 cv) 2 1^10^20^30^40^50^60 size ^ ^ 1^10^20^30^40^50^60 size Figure 4: Plots of deviance versus size for sequences of subtrees. (a): sequence obtained from sample data; (b): sequence evaluated on test data Also take STA for example. We use the variables obtained from the initial data analysis as the predictor variables. Each of these variables is considered to split the sample data set root node (with 1307 patients of which 65 died). In the first round, AGE is the variable leading to two nodes that are the most different (with mortality rates 16/672 and 49/635 respectively—refer Figure 5). We continue this procedure and use the same group of predictor variables to split each of the two nodes. For example, the winner for the left node is LVD while 0TH is the best one for the right node. This process is continued until in each node there are less than 10 patients. See Figure 3 for the resulting tree. This tree has 63 terminal nodes and is obviously too large to use so that pruning is necessary. Figure 4(a) displays the plot of deviance versus size (number of nodes) for the sequence of subtree of above tree. It should not be surprising that the sequence produced provide little guidance on CHAPTER 5. THE TREE-BASED MODEL ^ 39 what size tree is adequate. But we can use new data to guide the selection of the right size tree by using the pruning algorithm described in section 5.2. In S/Splus, this function provide a sequence of subtree and the deviance evaluated on the test data. Figure 4(b) illustrated this functionality for the STA data. Usually this sequence will not be monotone and the turning point will suggest the right size; for example, for the STA data, a seven-node tree is suggested and a = 1.125. The binary tree (see Figure 5) has three terminal nodes corresponding to low risk, and four terminal nodes corresponding to high risk. The size of the risk of a node is defined relatively to the sample population risk. Patients whose ages are over 64.5 years and have some other previous cardiac operation appear to have a relatively high risk of mortality. Those patients who are less than 64.5 years old and have normal condition of left ventricular function seem to be at much lower risk than those in the same age group but with a worse condition on LVD. Body surface area also plays an important role here. As we can see, with the same condition on age and LVD referred to above, patients with a smaller body surface area tend to face a higher risk of mortality. Similar interpretations can be made for the tree models with the response variables being one of the complication variables; see Figures 5 to 10. 40 CHAPTER 5. THE TREE-BASED MODEL^ AGE<64.5 LVD<3 (2.3%) 9/606 (1.4%) AGE<52.5 5/20 (25%) GE>64.5 65/1307 (5%) LVD>3 7/66 OTH= 43/61 AGE>52.5 CAB (7%) (4.3% 11/60 (18%) 49/63 (7.7%) CAB=0 32/555 (5.7%) BSA<1.57^BSA>1.57 ^ 2/5 0/41 (40%)^(0.0%) The number under each terminal node is the observed proportion of the 30 day operative mortality; for example, in the leftmost node, which has low risk, 9 out of 606 patients in the node died after the operation. Figure 5: Tree model for STA OTH=1 6/20 (30%) 41 CHAPTER 5. THE TREE-BASED MODEL REN=0^61/1465 (4.2%) AGE<72.5 • REN=1 AGE>72.5 41/1349 (3%) 21/1 3(1.9%) LVD=3 LVD 3 PMI= 20/23 • (8.5%) COP=1 13/1014 8/99( %) BSA>1 '.biSA 1.57 I. 6/20 3/9^5/90^(30%) (33.3%)^(5.5%) LVD LVD 0 0 0/5 (0.0%) OP=0 5/69 (7.2%) AGE<69.5 CNS=1 14/21 CNS0 (6.5%) 6/15 (40%) 0/11 • 17.2% PMI=0,1 15/47 31.9% 10/21 47.6% 5/26 19.2%) LVD=- 1,0 3/9 11/207 (33.3%)^(5.3%) Figure 6: Tree model for REMS AGE>69.5 LVD>1 6/16 4/5 (37.5%)^(80%) CHAPTER 5. THE TREE-BASED MODEL Figure 7: Tree model for NEMS ' 42 43 CHAPTER 5. THE TREE-BASED MODEL ^ PMI=0,1^196/1465 (13.4%) AGE<74 82/966 (8.5%) 111/1110 (10%) AGE>74.5 AGE<72.5 29/144 (20%) HTN= CNS= 21/128 (16.4%) PMI=-1 26/136 (19.1%) 78/295 (26.4%) CNS=1 85/355^AGE>72.5 (23.9%) HTN=1 52/159 AGE>52.5 (32.7%) AGE<52.5 5/8 (62.5%) CAB-0 11111 16/28 AGE<41.5 (57.1% 16/33 (48.5%) CAB=1 0/5(0.0%) AGE>41.5 15/21 1/7 (14.3%)^(71.4%) Figure 8: Tree model for PUMS 7/60 (11.7%) 36/126 (28.6%) CHAPTER 5. THE TREE-BASED MODEL 102/1465 (6.9%) PMI=0,1 PMI=0 31/456 (6.8%) 47/1110 (4.2%) PCA=0 ^ 44 MI=-1 MI=1 BSA <2.015 16/654 (2.4%) 12/633 (1.9%)^BSA<1.825 (15.5% BSA>2.015 PCA=1 31/246 (12.6%) AGE>60.5 24/10 (22%) BSA>1.82 8/52 (15'4%bNS= 4/9 (44.4) AGE<60.5 (28%) CNS=1 0/12 (0 .0%) HTN= 13/5 (25%) 3/5(60%) HTN=1 3/25 10/27 (12%)^(37%) Figure 9: Tree model for MI 45 CHAPTER 5. THE TREE-BASED MODEL BSA<1.90^117/24 (7.9%) LVD>2 LVD< • PMI=- ar 53/634 (8.4%) 29/4 (6.4%) BSA<1.4 5/20 (25%) 63/684 (9.2%) CAB=0 M I= 1 REN=0 24/1 (13.5 BSA>1.415 24/436 (5.5%) 19/164 (11.6%) ID 83/769 (10.8%) CAB=1^AGE<78.5 10/50 (20%) LVD=2 REN=1 BSA<1.505 5/14 (35.7%) C-616 (4.9%) 20/8 AGE>78.5 (23.5% 16/78 (20.1%) 4/7(57.1%) LVD=3,4 10/63^6/15(40%) (15.9% BSA>1.505 3/6 (50%) Figure 10: Tree model for LOMS 7/57 (12.3%) Chapter 6 Discussion and Conclusion Our aim is to set up risk stratification models for some response variables. Beside STA, we are also interested in other variables, i.e., the complications. At first, using odds ratio analysis, we obtained some idea of the association between binary risk factors and the response variable. After that, two methods were applied using S/Splus. They are the logistic regression model and the tree-based model. In the logistic regression procedure, there are several steps of work before achieving the final model. First, each predictor variable's association with the response variable was tested with the likelihood test. Only those which shows potential relation with the response variable ( p-value < 0.25 ) were selected. Linearity of the continuous variables to the response variable is checked and be confirmed. Secondly, if the dimension of variables space is still large ( > than 10), a stepwise procedure was used. This would usually reduce the dimension less than 10. Thirdly, a best subset program was run on the sample, this is to give some alternative candidate models. After that, the Hosmer-Lemeshow grouping test was applied to check the goodness-of-fit and finally the best models were selected. In the tree-based model, a classification tree was grown based on a recursive partitioning method 46 CHAPTER 6. DISCUSSION AND CONCLUSION• ^ 47 in S/Splus. After that, a pruning algorithm was applied to get a tree with 4 to 6 levels. When doing statistical analysis, instead of using the entire data set, a sample subset was used. (This was done partly in order to reduce computational time.) While sampling, the 3000 data entries at the end of the data file were untouched. This latter part was kept for validation testing. The missing values in PMI and LVD bring some troubles to the analysis. Fortunately, these are categorical variables and we can code an extra category labeled as "no information". Consequently, PMI and LVD become categorical variables with 3 and 6 categories respectively, and dummy variable are created to replace them in the logistic regression models, but not in the tree models. A summary of important risk factors for the dependent status variable STA and the complication variables REMS, NEMS, PUMS, MI and LOMS is given in Table 25. The binary risk factors which have significant odds ratios, and the risk factors which are included in the final logistic model and the final pruned tree model are listed (in order of importance). There is substantial overlap in the important risk factors from the 3 methods. The range of predicted risks are summarized in Table 25 for each of the dependent variables. The range for the logistic regression model is wider than that of the tree model because the logistic regression separates out the cases more than the tree. AGE, as we participated, is associated with all outcomes. It seems that SEX is not strongly associated with operative mortality and complications since it does not appear as a predictor variable in any of the final logistic regression or tree models. This could be because the variable BSA accounts somewhat for the gender variable (that is BSA is a partial surrogate). Male and female are facing the same level of risk for the same value of BSA and other variables. What about body surface area? BSA has strong association with the operative mortality. But with the complications, it is not always as important. It is an important risk factor to LOMS and takes a middle position of importance in predicting PUMS and MI. Its importance in the models for REMS and NEMS is much less. Prior cardiac operation plays a very important role in predicting the 30 day operative mortality. CHAPTER 6. DISCUSSION AND CONCLUSION ^ 48 Table 25: List of risk factors and prediction range for the different methods and different outcomes Outcome PUMS COP OBE ETO CNS DIA HTN Method Odds Ratio STA VAL CAB CNS 0TH REMS REN LIV CNS COP DIA HTN NEMS CNS CA OBE REN HTN Logistic Regression 0TH CAB BSA AGE LVD CNS PCA CA REN AGE COP PMI CNS SEX DIA LIV AGE CNS PMI CA ETO PCA Prediction Range Dee-based Model 0 — 0.97 AGE 0TH LVD CAB BSA Prediction Range 0.0 — 0.4 0 — 0.92 REN AGE PMI COP LVD CNS BSA 0.01 — 0.80 0 — 0.76 AGE PMI CNS OBE BSA CA ETO 0.02 — 0.6 MI LIV COP REN OBE HTN LOMS RHE REN CAB SEX PMI PCA CNS REN OBE PMI PCA CNS COP 0 — 0.40 PMI AGE HTN CNS CAB 0 — 0.59 PMI PCA BSA AGE CNS HTN BSA PMI CAB LVD 0TH RHE AGE HTN DIA 0 — 0.93 BSA LVD AGE CAB PMI REN 0.0 — 0.625 0 — 0.60 0.04 — 0.57 Two out of three variables, CAB, 0TH, VAL appear in logistic model and tree model. It is not surprising that these variables are mostly weighted by cardiologists. But they seem have weak association with some of the complications. PMI and LVD are also important predictor variables. As we know, they are measuring roughly the same thing — damage of heart muscle; they seldom appear together in the same model and play the role alternately. As to the diseases, CNS and HTN should be paid much attention to. CNS appears in all the logistic models except the one with LOMS as outcome, although in every appearance, its position in importance is around the middle. HTN's function is revealed when analyzing its relation with CHAPTER 6. DISCUSSION AND CONCLUSION^ 49 complications. For all complications, patients who possess hypertension will surely have higher risk. Actually, all the diseases studied appear as important predictors in different models for predicting the various complications. But they tend to be associated with particular outcomes, for example, REN (renal failure) is the most important risk factor in the REMS model and PCA seem closely related with the MI complication. One of the difficulties in this study was that the patient data were from several populations. The technique and experience may vary across different hospitals. One possibility is to separate the patients and develop the models within one population and then seek generalization. Unfortunately, the MCR database did not provide such information. Another approach which may be more feasible is to include the variables which describe the operation such as X-clamp time, type of oxygenator use, etc., to capture the difference between populations since the database did record these information. As we know, the logistic regression is more powerful to get prediction probabilities in the range of 0.1 to 0.9. So, although the logistic regression and tree models nearly identify the same group of risk factors, when predicting we suggest the latter be used since its prediction range is narrower. Another suggestion, also proposed by cardiologists, is to separate the population according to the prior cardiac operation done. Some such subpopulations are: a). patient who had a coronary bypass operation, b). patient who had a valve operation, c). patient who had both operations. Hopefully, the analyses based on these subpopulations will lead much more interesting and important findings. Bibliography Akaike, II. (1973). Information Theory and an Extension of the Maximum Likelihood Principle, in Second International Symposium on Information Theory. Akademia Kiado, Budapest, 267-281. Bishop, Y. M. M., Fienberg, S. E., and Holland, P. (1975). Discrete Multivariate Analysis: Theory and Practice. MIT Press, Boston. Breiman, L., Friedman, J. H., Olshen, R. A. and Stone C. J. (1984). Classification And Regression Trees. Wadsworth, Belmont, CA. Chambers, John M. and Hastie, Trevor J. (1990). Statistical Models in S. Wadsworth, Belmont, CA. Ciampi, A., Chang, C-H., Hogg, S. and Mckinney S. (1987) Recursive partitioning: a versatile method for exploratory data analysis in biostatistics, in Biostatistics (eds. I. B. MacNeill and G. J. Umphrey). D. Reidel Publishing, New York. Draper, N.D. and Smith, H. (1981). Applied Regression Analysis, Second Edition. Wiley, New Yrok. Fleiss, J. (1979). Confidence intervals for the odds ratio in case-control studies: state of art, Journal of Chronic Diseases, 32, 69-77. Furnival, G. M., and Wilson, R. W. (1974). Regression by leaps and bounds, Technometrics, 16, 499-511. Hoaglin, D. C., Mosteller, F., and Tukey, J. W., editors (1983). Understanding Robust and 50 BIBLIOGRAPHY^ 51 Exploratory Data Analysis. Wiley, New York. Hosmer, D. W. and Lemeshow, S. (1980). A goodness-of-fit testing for the multiple logistic regression model. Communications in Statistics, A10, 1043-1069. Hosmer, D. W. and Lemeshow, S. (1989). Applied Logistic Regression. Wiley, New York. Lemeshow, S., Teres, D., Avrunin, J. S., Pastides, H. (1988). Predicting the outcome of intensive care unit patients. Journal of the American Statistical Association, 83, 348-356. McCullagh, P., and Nelder, J. A. (1989). Generalized Linear Models. Second Edition. Chapman Hall, London. Morgan, J. N., and Sconquist, J. A. (1963). Problems in the analysis of survey data, and a proposal. Journal of American Statistical Association, 58, 415-434. Pregibon, D. (1981). Logistic refression diagnostics. Annals of Statistics. 9, 705-724. Rao, C.R. (1973). Linear Statistical Inference and Its Application. Second Edition. Wiley, New York. Rothman, K. J. (1986). Modern Epidemiology. Little Brown, Boston. Walter, S. D. (1987). Point estimation of the odds ratio in sparse 2x2 contingency tables, in Biostatistics(eds. I. B. MacNeill and G. J. Umphrey). D. Reidel Publishing, New York. Weisberg, S. (1980). Applied Linear Regression. Wiley, New York. Appendix A Merged Cardiac Registry ... MERGED CARDIAC REGISTRY AN INTERNATIONAL DATABASE DENDRITE SYSTEMS, INC. APPENDIX A. MERGED CARDIAC REGISTRY ^ 53 1: DEMOG :^Entered by the system 2: PRIOR ENTRY REGISTRY MCR : Entered by the system 3: DATE OF SURGERY: Enter date MM/DEVYY 4: AGE:^Enter age in years 5: SEX:^Entered by the system 6: (Reserved for future use) 7: PRIOR MI : 1 =No, 2=Yes 8: MOST RECENT Ml: 1=0-6h, 2=6h-24h, 3=1d-7d, 4=1w-6w, 5= >6w 9: OTHER DISEASES: Oj ]=Other, 1[ )=Obesity, 2[ =COPD, 3[ ]=Diab, 4[ 1= Choi >200 5[ ]=Chol > 300, 61 1=Renal, 7[ ]=Htn, 81 1=ETOH, 91 ]=Drug Abuse 10j 1=Marfans, Ilj ]=IIIV+, 12[ ]=AIDS, 13j ]=CA, 141 ]=Blood 15[ ]=Liver, 16[ ]= CNS, 17j ]=Prior CVA, 181 ]=RheumHD, 19[ ]=Pulm Htn, 20[ 1= Chronic Dialysis 10:SMOKING NOW:^0=No, 1=Q>2y, 2=Y<lpk/d, 3=Y> lpk/d 11:PRIOR CARD SURG : Of 1= Other, If ]=None, 2[ I= CABG, 31 1= Valve, 4[ 1= Cong 5[ ]=Pacemaker 12:LV DYSFUNCTION: 1=Nor, 2=40-49%, 3=30-39%, 4=20-29%, 5= <20% 13:LVEF^Ejection Fraction, enter C'c 14: CAD > 70%: if )=No, 2[ 1=AD, 3j 1= CX, 4[ ]=RC, 5[ ]=Branch, 6j )=L Main, 7[ 1=1 Vessel, 8f ]=2 Vessel, 9[ ]=3 Vessel 15: OTHER CARD PATH : Of ]=Other, 1[ ]=Ao St, 2f ]=Ao Insf, 3j ]=Mitr St, 4j 1=Mitr^5[]=Tricusp, 6j 1=PuIrn, 7f ]=Cong 81 ]=Acq VSD, 9[ 1=LV Aneur, 101 ]=Ao Aneur 11[ ]=Ascending diss, 12[ ]=Decending diss 16:DATE MOST RECENT PTCA : Enter date, leave blank if none 17: PTCA RESULT: If ]=N/A, 2[ 1= Success, 3[ )=Failed, 4[ ]=Had complication ^ APPENDIX A. MERGED CARDIAC REGISTRY 18: NO. PICA VESSELS: Enter number of vessels dilated 19: REASON FOR OP: Of ]=Other, If je--Ang, 2[ 1=Urgent, 3[ 1=Arrh, 4[ J=Anat, 5f j=Fail PICA, 6[ ]=Tumor, 7[ 1=Endocarditis, 8[ ]=Trauma, 9f 1=Ao diss 10[ ]=Ao Aneur 20: 'PRE-OP STATUS: 1 =Elect, 2=Urgent, 3 =Ernerg, 4=Desperate . HEMODYNAM1C STAT: 1=Stbl, 2=Stbl on meds, 3 =Unstbl on meds 4=Cardiogenic shock on meds/IABP 22: OXYGENATOR: 0= Other, 1=111500, 2=Shiley, 3=TMO, 4=CM11, 5=Sc125 6=Sc135, 7=Nlaxima, 8=BCM7, 9=Sams, 10=Tenuno 11=SciUltra 23: OTHER OP DEVICES: Of 1= Other, 21 J=Cel Sav, 3[ 1=HemoConcen 4[ ]=Dial Filter, 5f ]=IABP, 61 I=BioMed Pump, If j=LHAD 8j 1=RHAD, 91 P.Art Filter, 10[ Plasma Phor 11j J=MyoTempProb, 12[ j= Cooling Pad, 131 j=Delphin Pump 24: THROMBOL'YTIC Rx : 11 j=tPA, 2( ]=Strepto, 31 ]=Urokin, 4[ ]=IntrCor, Sf 1=IntrVein 25: CARDIOPLEGIA : I[ ]=None, 2[ ]=Cryst, 3[ ]=Blood, 4j ]=Retro (Cor Sinus), Sf f= Intermit Clamp 26: X-CLAMP TIME: Enter time in minutes 27: BODY SURFACE AREA: Enter in square meters (e.g., 2.3) 28: NO. CABGs : Enter number of distal anastomoses 29: VALVES REPLACED: If j=Ao, 2[]=Alitr, 3[ j=Tri, 4j j=Aocombined c Ao graft 30: REPAIRS: If ]=Ao, 2f ]=Afitr, 3j ]=Tri, 41 j=Cong, 51 )=Acq VSD, 61 j=LV Aneur 31: AORTIC PROSTHESIS: 0=Other, 1=SE, 2=BS, 3 =St. J, 4 =Ed Porc, S=Hancock 6=Froz Homo, 74=-.Medtronic 32: MITRAL PROSITIFSIS : 0= Other, 1=SE, 2=BS, 3=St. J, 4=Ed Pore, 5=Hancock 6=Froz Homo, 7=Ring, 8=Medtronic, 9=Omnisci 33: (Reserved for future use) 34: (Reserved for future use) 35: BLOOD PRODUCTS: lj ]=Fresh Froz Plasma, 2j ]=Platelets, 3f =Cryo 36: DONOR TRANSFUSIONS: Enter number of units 37: AUTOLOGOUS TRANSFUSIONS: Enter number of units 38: (Reserved for future use) DENDRITE SYSTEMS, INC. 54 APPENDIX A. MERGED CARDIAC REGISTRY ^ 39: COMPLICATIONS: 01 )=Other, 1[ J=Reop/Bleed, 2[ ]=Renal/Mild, 3[ jm. Renal/Sev - 4[ )=Wound/Sev, 5[ ]=Neuro/Mild, 61 )=Neuro/Sev, 7[ 81 ]=Pulm/Sev, 9[ ]=MI; 10[ ]=Low Out/Mild, 11[ )=Low Out/Sev 12[ )=Clotting, 13[ ]=Sepsis, 141 )= GI/GB, 151 ]=DIC 40: (Reserved for future use) 41: (Reserved for future use) 42: DAYS IN ICU: Enter number of days 43: DAYS SURG/DLSCH Enter number of days 44: PARSONNET RISK: Calculated & entered by the system 45: (Reserved for future use) 46: TRANSFER TO NEW ENTRY: Entered by the system 47: DISCHG/30 DAY STATUS: 0=UNK, 1=ALIVE, 2=DTED IN OR, 3 =DIED IN HOSP/30D 4=REOP, 5=DIED LATE CARD, 6=UNREL DEATH 9=LOST TO FU DENDRITE SYSTEMS, INC. MCR Version 2, January 01, 1990 55 Appendix B Expanded Definitions for Merged Cardiac Registry APPENDIX B. EXPANDED DEFINITIONS FOR MERGED CARDIAC REGISTRY^57 MERGED CARDIAC REGISTRY Expanded Dermitions for Version 2 1. DEMOG N: (There is no user correspondence file set up for this question.) The program will use your Demographic number. It is the only patient identification that Is sent to Dendrite. At Dendrite an offset will be added which is group specific. The offsets will not be published. 2. PRIOR ENTRY REGISTRY MCR: (There is no user correspondence file set up for this question.) This information is entered at the time of transfer. It follows reoperations for both valves and bypasses. 3. DATE OF SURGERY: (There is no user correspondence file set up for this question.) This is the date for this procedure. If a patient is in more than one Source Registry on the same date, the entries will be merged. If the patient has two entries in the same Source Registry on the same date, they will also be merged. 4. AGE: This is the patient's age in years at the time of operation. If you don't have this question in your Source Registry(ies), you should consider adding it. For a minimal fee, we can give you the ability to calculate a default answer for age If registry question "DATE OF SURGERY" and demographic question "DATE OF BIRTH" have been entered. Please call Dendrite for information on this feature. 5. SEX: (There is no user correspondence file set up for this question.) The program gets this information from your demographic file automatically. 6. Reserved for future use. 7. PRIOR MI: 1=No (If no clinical MI.) 2=Yes (One or more clinical MLs.) Do not include silent Mb diagnosed only on angiography. APPENDIX B. EXPANDED DEFINITIONS FOR MERGED CARDIAC REGISTRY^58 MOST RECENT MI: Select the insure that reflects the interval from the most recent MI to this operation. This could be important as a risk factor. If you don't have this question in your Source Registry(ies), you should consider adding it. 9.^Onikait DISEASES: Of ]=Other (Use "other" to record a disease not listed in the answers but that you feel is significant.) 1[ ]=Obesity (1.5x expected body weight.) 2[ 1=COPD (Patient with distinct limitations revealed at time of study or on treatment - bronchodilators, etc.) 3[ ]=Diab (Patient on oral meds or insulin.) 41 ]=Chol >200 (Patients from 200 - 299.) 5[ ]=Chol >300 (Patients above 300>) 6[ ]=Renal (Patients with creatinines above 2.5 not on dialysis.) 7[ ]=Hypertension (History of treatment.) 8[ ]=ETOH (Patients who have undergone treatment or come in intoxicated.) 9[ ]=Drug Abuse (History or current use of cocaine, heroin, etc.) 101 ]=Marfans (Patient with diagnosis or you diagnose.) 11[]=HIV+ (Positive test for AIDS. Not clinical disease.) 12[]=AIDS (Clinical disease.) 131 ]=CA (History of malignant disease - cured or not.) 14[ ]=Blood (History of anemia not related to blood loss; e.g., sickle cell. Also, leukemia or lymphoma even if in remission.) 15[]=Liver (History of hepatitis, cholangitis, but not gall bladder disease.) 161 ]=CNS (History of brain abscess, encephalitis, or clinical dementia.) 17[]=Prior CVA (History of stroke with or without residual.) 18[]=RheumHD (History of Rheumatic Heart Disease.) 19[ ]=Puhn Htn (PA pressures >60nunHG systolic.) 201 ]=Chronic Dialysis (Not successful transplants.) 10. SMOKING NOW: Smoking now is within ten (10) days or at the time of catheterization. Consider answer 2 to mean mild and answer 3 to mean heavy. Do not count pipe smoking or chewing tobacco. 11. PRIOR CARDIAC SURG: Use "Other" for tumors, stab wounds, vinebergs, etc. We have added answer 51 ]=Pacemaker. 12. LV DYSFUNCTION: Select the answer that reflects the estimate from non-planimetry or echo, gated, etc. APPENDIX B. EXPANDED DEFINITIONS FOR MERGED CARDIAC REGISTRY ^59 13. LVEF: This is the actual left ventricular ejection fraction. We only consider planimetry by angiography a valid means to answer this question. For other means (gated blood pool, echo, use question #12.) 14. CAD >70%: Since it is possible that the answer for this question could come from multiple Source Questions, a no answer will be considered the same as none. 11 ]=No (None/no coronary disease) 21 ]=LAD 31 ]=Cx (Includes the large OM as well if >70%.) 4[ ]=RCA (Includes PDA.) SE ]=Branch (Includes intermediate, large diagonal but does not define which system.) 6I =L. Main 7[ ]=1 Vessel Disease 8(3=2 Vessel Disease 913=3 Vessel Disease 15. 0111ER CARDIAC PATHOLOGY 0[ ]=Other (For dissections of the aorta, tumors of the heart.) 11 ]=Ao St (Aortic stenosis with a gradient >60mmHG or valve area <.8CM.) 21 ]=Ao Insuf (Aortic insufficiency moderate or great.) 3[ ]=Alitr St. (Mitral stenosis with a gradient >60mmHG.) 4[ ]=Mitr Insuf (Significant mitral leak with V-waves.) 5[ ]=Tricusp (Either stenos's, leak, or both.) 6[ ]=Pulm (Valve stenos's.) 7[ 3= Cong (Any diagnosis of congenital heart disease.) 8[ ]=Acq VSD (VSD post MI or surgery.) 9[ ]=LV Aneur (Localized paradoxical segment.) 101 ]=Ao Aneur (Ascending, arch, or descending aneurysm.) 111 ]=Asc Diss (Ascending dissection of the aorta.) 121 ]=Dsc Diss (Descending thoracic aortic dissection.) 16. DATE MOST RECENT PTCA: Enter date. This new format will allow date arithmetic later. If you have data in the old format, we can help you transform it. 17. PTCA RESULT: Enter the initial 5-day result judged by the surgeon. A complication would include MI, MI in progress, perforation, etc., within this 5 day period. - 18. NUMBER OF VESSELS PTCA'd: Enter the number of vessels dilated prior to this operation. (A triple PTCA would count as 3.) APPENDIX B. EXPANDED DEFINITIONS FOR MERGED CARDIAC REGISTRY ^60 19. REASON FOR OP: 01 ]=Other (Use "Other" to record an answer not listed, but that you feel is significant.) 1[ ]=Ang (Angina uncontrollable with meds.) 2[ ]=CHF (Congestive heart failure - low output state.) 3[ ]=Arrh (Arrhythmia.) 41 ]=Anat (Anatomy; left main, etc. in otherwise stable patient.) 5[ ]=Failed PTCA (PTCA that was performed within five (5) days if you are treating the same vessel.) 61 ]=Tumor 7[ ]=Endocarditis (Patient has had positive cultures.) 81 ]=Trauma 9[ ]=Ao Dissection 101 ]=Ao Aneurysm 20. PREOP STATUS: 1=Flect (Elective scheduled case.) 2=Urgent (Case moved up on schedule.) 3=Emerg (Emergency case-- do ASAP.) 4=Desperate (Case that has arrested, is very near death, or in severe low output.) 21. ELEMODYNAMIC s rA T: 1=Stbl (Stable patient.) 2=Stbl on meds (Cl >2 on meds or IABP.) 3 =Unstbl on meds (Cl <2 on meds or IABP.) 4 =Cardiogenic shock on zneds/IABP (CI <2 and falling.) 22. OXYGENATOR: 0=Other (Use "Other" to record a answer that is not listed, but that you feel is significant.) 1=111500 (Harvey bubbler includes H1300.) 2=Shiley (Shiley bubbler.) 3=TM0 (Travenol membrane.) 4 = CMII (Cobe membrane.) 5=Sc125 (SciMed SM25.) 6=Sc135 (SciMed SM35.) 7=Maxima (J&J (Medtronic) membrane.) 8=BCM7 (Bentley membrane.) 9 =Sarns (Membrane.) 10=Terumo (Membrane.) 11=SciUltra (SciMed Ultrox I.) APPENDIX B. EXPANDED DEFINITIONS FOR MERGED CARDIAC REGISTRY^61 23. OTHER OPERATIVE DEVICES: Oj ]=Other (Use "Other" to record an answer that is not listed, but that you feel is significant.) Answer 1 has been deleted. 2[ 1= Cell Saver (Any brand.) 3[ ]=HemoConcen (Ultra filtration device to remove H20.) 4[ ]=Dial Filter (Renal dialysis filter in circuit.) SE ]=IABP (Intra or post op.) 6E ]=BioMed Pump (Biomedicus pump rather than roller pump.) 7[ ]=LHAD (Any long term use of left heart assist device post bypass.) 8[ ]=RHAD (Any long term use of right heart assist device post bypass.) 91 ]=Art Filter (Any filter in the arterial line.) 101 ]=Plasma Phor (For the use of plasma phoresis for platelet rich plasma.) 11[]=MyoTmpProb (For the use of myocardial temps where monitored.) 12[ ]=Cooling Pad (If a cooling pad is placed under or on the heart during crossclamp.) 131 ]=Delphin Pump (Sarns centrifical pump rather than roller pump.) 24. THROMBOLYTIC Rx: 1[ ]=tPA (tPA used within 24 hours.) 2[ ]=Strepto (Streptokinase used within 36 hours.) 3[ ]=Urokin (Urokinase Infused.) 41 ]=IntraCor (Intracoronary infusion used.) SE j=IntraV ein (Intravenous.) 26. CARDIOPLEGIA: This question has been changed to a type 7. 11 ]=None (None or just slush.) 21 ]=Cryst (For cold +/- high k+.) 31 ]=Blood (For cardioplegia solutions containing blood.) 41 ]=Retro-cor sinus (Any use of retrograde perfusion.) SE ]=Intermit Clamp (Can be combined with any of the above answers.) 26. X-CLAMP TIME: Enter your answer in minutes. 27. BODY SURFACE AREA: This is a new question. Enter your answer in square meters (e.g., 2.3) 28. NO CABGs: Enter the total number of distal coronary anastomoses for this procedure. DENDRITE SYSTEMS, INC. APPENDIX B. EXPANDED DEFINITIONS FOR MERGED CARDIAC REGISTRY ^62 29. VALVES REPLACED: Enter those valves replaced with a prosthesis. If this information is in your subprocedure section, you may be required to set up one or more secondary questions to create the criteria for transfer. 30. REPAIRS: This includes debridement, commissurotomy, partial resection. May be combined with questions 31-32 if the repair fails or needs supplement. 31. AORTIC PROSTHFSLS: Enter the type of prosthesis used. 32. MITRAL PROSTHESIS: Enter the type of prosthesis used. 33. Reserved for future use. 34. Reserved for future use. 35. BLOOD PRODUCTS: Enter any use of the blood products listed on the MCR. 36. DONOR TRANSFUSIONS: Enter the number of units of bank blood/packed cells on this admission. 37. AUTOLOGOUS TRANSFUSIONS: Enter the number of units of blood drawn 5 to 30 days preop for elective use at surgery. Do not enter blood withdrawn at the time of surgery or plasma phoresis. 38. Reserved for future use. 39. COMPLICATIONS: 0[ ]=Other (Use "Other" to record an answer not listed, but that you feel is significant.) 1[ ]=Reop/Bleed (Reoperation for bleeding, suspected tamponade.) 2[ ]=Renal/Mild (Mild renal shutdown not requiring dialysis.) 3[ ]z---.Renal/Sev (Severe renal shutdown requiring dialysis.) 4[ ]=Wound/Sev (Dehiscence or infection - for sternal wounds only.) 51 jrzNeurofivlild (Peripheral nerve, brachial plexus, confusion or CNS defect that clears before discharge.) 6[ ]=Neuro/Sev (CNS defect that does not clear in 7 days.) 7[ ]=Pulm/11/1ild (Pneumothorax, hemothorax, atelectasis, air leak.) 8[ P--..-Pulm/Sev (Prolonged respiratory support, ARDS or pneumonia requiring antibiotics.) APPENDIX B. EXPANDED DEFINITIONS FOR MERGED CARDIAC REGISTRY ^63 COMPLICATIONS continued 91 )=MI (Jntra- or post-op MI by EKG or enzymes.) 101 J=Low Output/Mild (Low output syndromp postop requiring drugs for a short time.) 111 j=Low Output/Sev (Severe low output syndrome postop with prolonged use of drugs or IABP.) 12[ ]=Clotting (Prolonged bleeding problems, low platelets, etc.) 131 ]=Sepsis (Septicemia, pneumonia, wound infection, etc.) 14[ )=GI/GB (GI bleed, perforated ulcer, cholecystitis, hepatitis, etc.) 15[ 1=DIC (Diffuse intravascular coagulation.) 40. Reserved for future use. 41. Reserved for future use. 42. DAYS IN ICU: Enter the number of days round off (e.g., 27 hrs. = 1 day, 30 hrs. = 2 days). - 43. DAYS SURG/DISCH: Enter the number of days from surgery to discharge. 44. PARSONNET RISK: (There is no user correspondence file set up for this question.) Use the "R" option to get risk calculations once you have transferred your data to the MCR. 45. Reserved for future use. 46. TRANSFER TO NEW ENTRY: (There is no user correspondence file set up for this question.) This is a systemgenerated question to follow re-entries in your registry(ies). 47. DISCH/30D STATUS: 0=UNK (This is a historical answer in use before the addition of follow-up.) DO NOT USE THIS ANSWER VVIEEN CREATING YOUR CORRESPONDENCE FILE. 1=ALIVE 2=DIED IN OR (Died in the operating room.) 3=DIED IN HOSP/30D (Died in or out of hospital within 30 days of surgery.) 4=REOP (Your reoperations only.) 5=DIED LATE CARDIAC (Died after 30-day interval, cardiac-related.) 6=UNRELATED DEATH (Died after 30-day interval, non-cardiac-related.) 9=LOST TO FU (Patient who can no longer be followed because cannot be located.) The only follow-up transferred at this time is survival status. This is moved automatically if your Source Registry(ies) have follow-up. DENDRITE SYSTEMS, INC.
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Risk prediction models for binary response variables...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Risk prediction models for binary response variables for the coronary bypass operation Zhang, Hongbin 1993-12-31
pdf
Page Metadata
Item Metadata
Title | Risk prediction models for binary response variables for the coronary bypass operation |
Creator |
Zhang, Hongbin |
Date | 1993 |
Date Issued | 2008-08-28T22:27:17Z |
Description | The ability to predict 30 day operative mortality and complications following coronary artery bypass surgery in the individual patient has important implications clinically and for the design of clinical trials. This thesis focuses on setting up risk stratification algorithms. Utilizing the binary feature of the response variables, logistic regression analyses and classification trees (recursive partitioning) were used with the variables identified by the Health Data Research Institute in Portland, Oregon. The data set contains records for 18171 patients who had coronary artery bypass surgery in one of several hospitals between 1968 to 1991. Statistical models are setup, one from each method, for six outcome variables of the surgery: 30 day operative mortality, renal shutdown complication, central nervous system complication, pneumothorax complication, myocardial infarction complication and low output syndrome. The risk groups vary across different outcomes. The history of cardiac surgery has strong association with operative mortality and patients who suffer from a central nervous system disease tend to have higher risks for all the outcomes. Further study is necessary to consider the differences among hospitals and to divide the population according to the type of previous cardiac surgery. |
Extent | 3265612 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
File Format | application/pdf |
Language | eng |
Collection |
Retrospective Theses and Dissertations, 1919-2007 |
Series | UBC Retrospective Theses Digitization Project |
Date Available | 2008-08-28 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0086332 |
URI | http://hdl.handle.net/2429/1571 |
Degree |
Master of Science - MSc |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 1993-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- ubc_1993_fall_zhang_hongbin.pdf [ 3.11MB ]
- Metadata
- JSON: 1.0086332.json
- JSON-LD: 1.0086332+ld.json
- RDF/XML (Pretty): 1.0086332.xml
- RDF/JSON: 1.0086332+rdf.json
- Turtle: 1.0086332+rdf-turtle.txt
- N-Triples: 1.0086332+rdf-ntriples.txt
- Original Record: 1.0086332 +original-record.json
- Full Text
- 1.0086332.txt
- Citation
- 1.0086332.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Country | Views | Downloads |
---|---|---|
United States | 40 | 7 |
Russia | 15 | 0 |
China | 12 | 41 |
France | 3 | 0 |
Netherlands | 2 | 0 |
Canada | 1 | 0 |
Italy | 1 | 0 |
Australia | 1 | 0 |
City | Views | Downloads |
---|---|---|
San Jose | 14 | 0 |
Saint Petersburg | 11 | 0 |
Shenzhen | 11 | 32 |
Unknown | 7 | 25 |
Ashburn | 7 | 0 |
Abakan | 4 | 0 |
Mountain View | 3 | 2 |
Seattle | 2 | 0 |
Washington | 2 | 0 |
Redwood City | 2 | 0 |
Saint Paul | 2 | 0 |
Toledo | 2 | 1 |
Atlanta | 1 | 0 |
{[{ mDataHeader[type] }]} | {[{ month[type] }]} | {[{ tData[type] }]} |
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0086332/manifest