RISK PREDICTION MODELS FOR BINARY RESPONSE VARIABLESFOR THE CORONARY BYPASS OPERATIONbyHONGBIN ZHANGM.Eng., Institute of Software, Academia Sinica, 1987A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinTHE FACULTY OF GRADUATE STUDIESDepartment of StatisticsWe accept this thesis as conformingto the required standardTHE UNIVERSITY OF BRITISH COLUMBIAJune 1993©Hongbin Zhang, 1993In presenting this thesis in partial fulfilment of the requirements for an advanceddegree at the University of British Columbia, I agree that the Library shall make itfreely available for reference and study. I further agree that permission for extensivecopying of this thesis for scholarly purposes may be granted by the head of mydepartment or by his or her representatives. It is understood that copying orpublication of this thesis for financial gain shall not be allowed without my writtenpermission.(Signature)Department of ^S+CA. LSA The University of British ColumbiaVancouver, CanadaDate ^e\t ? 23 DE-6 (2/88)AbstractThe ability to predict 30 day operative mortality and complications following coronary artery bypasssurgery in the individual patient has important implications clinically and for the design of clinicaltrials. This thesis focuses on setting up risk stratification algorithms.Utilizing the binary feature of the response variables, logistic regression analyses and classificationtrees (recursive partitioning) were used with the variables identified by the Health Data ResearchInstitute in Portland, Oregon. The data set contains records for 18171 patients who had coronaryartery bypass surgery in one of several hospitals between 1968 to 1991. Statistical models are setup, one from each method, for six outcome variables of the surgery: 30 day operative mortality,renal shutdown complication, central nervous system complication, pneumothorax complication,myocardial infarction complication and low output syndrome.The risk groups vary across different outcomes. The history of cardiac surgery has strong associ-ation with operative mortality and patients who suffer from a central nervous system disease tend tohave higher risks for all the outcomes. Further study is necessary to consider the differences amonghospitals and to divide the population according to the type of previous cardiac surgery.11ContentsAbstract^ IIAcknowledgements^ vii1 Introduction^ 12 Coronary Bypass Surgery and Data Source ^ 32.1 Coronary Bypass Surgery ^32.2 Source of Data ^43 Initial Data Analysis^ 83.1 Summary Statistics for the Data ^83.2 Odds Ratio Analysis ^ 104 Logistic Regression Analysis^ 214.1 Univariate Analysis and Comparison of Models ^ 224.2 Stepwise Logistic Regression ^ 234.3 Best Subsets Selection 254.4 Goodness-of-fit: Hosmer-Lemeshow Grouping Test ^ 265 The Tree-based Model^ 325.1 Recursive Partitioning: Growing a Classification Tree ^ 325.2 Getting the Right Size Tree: Pruning the Classification Tree 365.3 Applications and Results ^ 376 Discussion and Conclusion^ 46Bibliography^ 49A Merged Cardiac Registry^ 52B Expanded Definitions for Merged Cardiac Registry ^ 56IvList of Tables1 Code Sheet for the MCR Data ^ 52 Code Sheet for the MCR Data (continued) ^ 63 Code Sheet for the MCR Data (continued) 74 Tabular summary for categorical variables ^ 115 Tabular summary for categorical variables (continued) ^ 126 Tabular summary for categorical variables(continued) 137 Odds ratio for STA ^ 158 Odds ratio for REMS 159 Odds ratio for NEMS ^ 1610 Odds ratio for PUMS 1611 Odds ratio for MI ^ 1712 Odds ratio for LOMS 1713 Binary risk factors identified from odds ratio ^ 1814 Two-way table for PMI and LVD ^ 1815 Stepwise regression procedure: a demonstration ^ 2416 Results of stepwise regression procedure for each outcome ^ 2417 Models obtained from best subsets procedure for each outcome 2618 Hosmer-Lemeshow grouping test for selected models ^ 2719 Final logistic regression model for STA ^ 2920 Final logistic regression model for REMS 2921 Final logistic regression model for NEMS ^ 3022 Final logistic regression model for PUMS 3023 Final logistic regression model for MI ^ 3024 Final logistic regression model for LOMS 3125 List of risk factors and prediction range for the different methods and different outcomes 48viList of Figures1 Boxplots for continuous variables AGE and BSA ^ 192 Binary tree: an example ^ 333 Original tree for STA. 374 Plots of deviance versus size for sequences of subtrees. (a): sequence obtained fromsample data; (b): sequence evaluated on test data ^ 385 Tree model for STA ^ 406 Tree model for REMS 417 Tree model for NEMS ^ 428 Tree model for PUMS 439 Tree model for MI ^ 4410 Tree model for LOMS 45viiAcknowledgementsI would like to express my appreciation to my supervisor, Dr. Harry Joe, for his guidance, suggestionsand assistance in producing this thesis. I also would like to give my thanks to Dr. Stan Page forhis critical reading and helpful comments. I am indebted to the Health Data Research Institutefor some financial support and for providing the data, to Scott Page, M.D., for advice on medicalaspects of this study. The support of UBC Department of Statistics is also gratefully acknowledged.Finally, I thank my wife, Hua, for her continuous encouragement and support. Without her beliefin me, it might have taken me longer to get here.ViiiChapter 1IntroductionSince the beginning of coronary bypass surgery at the Good Samaritan Hospital and Medical Cen-ter, Portland, Oregon, in 1968, patient data have been recorded in order to manage the care of thepatients. This management will be measured by outcome analysis of the 30 day operative mortalityand complications arising from the surgery. Although of secondary concern, some complicationsdirectly affect the quality of life of patients, for example, renal shutdown may require dialysis treat-ment. From a clinical point of view, a patient with low risk of mortality or complications couldbe spared the discomfort and expense of unnecessary treatment in a coronary care unit; based ona prognostic assessment such patients could be placed into an intermediate care unit or a generalward, or be discharged early from the hospital.However, for these concerns to be realized, it is necessary to establish some risk stratificationalgorithms. Two approaches have been devised for surgery patients depending on the strategy usedfor derivation. In one strategy, a panel of experts identifies and assigns weights to clinical variablesbelieved to be associated with outcomes of interest. The second strategy uses statistical modelingto relate empirical data for many patient variables to outcomes of interest.In this study, the odds ratio gives a simple summary for the binary risk factors; the logistic1CHAPTER 1. INTRODUCTION^ 2regression analysis and the tree-based methods are used to set up the stratification algorithmsWhile doing the analysis, answers to the following questions are sought:1). What risk groups are most important in predicting the outcomes?2). How does each stratification algorithm perform in predicting?The clinical background of the coronary bypass procedure and the data description are presentedin Chapter 2 while the initial data analysis is performed in Chapter 3. In Chapter 4 and Chapter 5the statistical methodologies and results of analyses are described. Conclusions and suggestions aregiven in Chapter 6.Chapter 2Coronary Bypass Surgery andData Source2.1 Coronary Bypass SurgeryCoronary heart disease is the comprehensive term which includes all of the clinical manifestationsthat result from atherosclerotic narrowing or occlusion of the arteries which supply the heart muscle.In several countries of the industrialized world, it is the major cause of death in both men and women.Despite accumulating knowledge about the epidemiology and pathology of disease of the heart andcoronary arteries, there is not, as yet, a way of intervening to definitely arrest the natural progressionof atherosclerosis in order to prevent or cure this condition. Compared with medical treatment, thesurgical technique is seemingly more radical for coronary artery disease. In 1963, the first coronaryartery bypass surgery was performed in the United Kingdom. Nowadays, it is the most commonform of elective surgery.The surgical technique of coronary grafting involves opening the chest wall and temporarilystopping the heart while circulation is maintained with a heart-lung machine. A vein is removed to3CHAPTER 2. CORONARY BYPASS SURGERY AND DATA SOURCE ^ 4be used as the graft material. Each obstructed section of artery is then bypassed by attaching oneend of vein to the aorta carrying blood for the heart, and the other end to the artery beyond thestenosis or occlusion. The heart is restarted, the chest wall closed, and the operation completed.Coronary artery surgery has been described as relieving or very much reducing angina in over90% of patients. Bad results are operative death and complications arising from the surgery althoughthe mortality rate is believed to be decreasing with increasing surgical experience and skill.2.2 Source of DataMCR (Merged, Multi-Center, Multi-Specialty Clinical Registries) is an international database systemdeveloped by Health Data Research Institute (formerly Dendrite Systems, Inc.) in which informationof patients who had heart related surgery were recorded. This database system is used by severalhospitals and the contributors (patients, sometime the doctors) are encouraged to enter information.In doing that, MCR uses a long systematic set of questions to elicit information both prior tooperation and after operation (see Appendix). The data set analyzed here consists of 18171 patientsfrom the MCR who had coronary bypass surgery between 1968 to 1991.The pre-operation information include date of operation, patient's age, gender, prior myocardialinfarction, existence or non-existence of other diseases, body surface area, etc. The post-operationinformation include patient's status during or after the bypass surgery; for example, complications,such as renal or neurological problems, and survival status to 30 days following surgery.The variables of primary interest in our analysis are those outcome variables indicating thepatient's complications and survival status after the surgery. All variables studied and their abbre-viations are listed in Table 1 to Table 3.CHAPTER 2. CORONARY BYPASS SURGERY AND DATA SOURCE ^ 5Table 1: Code Sheet for the MCR DataVariable^ Name^ Codes/Values^Abbreviation1 Age Years AGE2 Sex 0=Male 1=Female SEX3 Prior Myocardial Infarction 0=No 1=Yes PMI(Variables 4-23 pertain to other diseases)4 Obesity 0=No 1=Yes OBE5 Chronic Obstructive Pulmonary Disease 0=No 1=Yes COP6 Diabetes 0=No 1=Yes DIA7 Cholesterol Level>206 0=No 1=Yes C1128 Cholesterol Level>300 0=No 1=Yes CH39 Renal Disease 0=No 1=Yes REN10 Hypertension 0=No 1=Yes HTN11 Alcohol Abuse 0=No 1=Yes ETO12 Drug Abuse 0=No 1=Yes DRU13 Marfan's Syndrome (a skeletal abnormality) 0=No 1=Yes MAR14 HIV+ 0=No 1=Yes HIV15 AIDS 0=No 1=Yes AID16 Cancer 0=No 1=Yes CA17 Anemia 0=No 1=Yes ANE18 Liver Disease 0=No 1=Yes LIV19 Central Nervous System Disease 0=No 1=Yes CNS20 Prior Cerebrovascular Accident 0=No 1=Yes PCACHAPTER 2. CORONARY BYPASS SURGERY AND DATA SOURCE^6Table 2: Code Sheet for the MCR Data (continued)Variable^Name^Codes/Values^Abbreviation21 Rheumatic Heart Disease 0=No 1=Yes RHE22 Pulmonary Hypertension 0=No 1=Yes PUL23 Chronic Dialysis 0=No 1=Yes CHR(Variables 24-29 pertain totype of prior cardiac surgery)24 Other Surgery 0=No 1=Yes 0TH25 No Surgery 0=No 1=Yes NON26 Coronary Bypass Graft 0=No 1=Yes CAB27 Valve Replacement 0=No 1=Yes VAL28 Congenital 0=No 1=Yes CON29 Pacemaker 0=No 1=Yes PAC30 Left Ventricular Dysfunction 0=Normal LVD1= 40-49%2= 30-39%3= 20-29%4= < 20%31 Prior Operation Status 1 =Elective POS2= Urgent3= Emergency4= Desperate32 Body Surface Area Square meters BSACHAPTER 2. CORONARY BYPASS SURGERY AND DATA SOURCE ^ 7Table 3: Code Sheet for the MCR Data (continued)Variable^Name^Codes/Values^Abbreviation(Variables 33-47 pertain to thecomplications after surgery)33 Reoperation for Bleeding 0=No 1=Yes REP34 Renal Shutdown (Mild) 0=No 1=Yes REM35 Renal Shutdown (Severe) 0=No 1=Yes RES36 Wound (Severe) 0=No 1=Yes WOU37 Neurological (Mild) 0=No 1=Yes NEM38 Neurological (Severe) 0=No 1=Yes NES39 Pulmonary (Mild) 0=No 1=Yes PUM40 Pulmonary (Severe) 0=No 1=Yes PUS41 Myocardial Infarction 0=No 1=Yes MI42 Low Output (Mild) 0 =No 1 =Yes LOM43 Low Output (Severe) 0=No 1=Yes LOS44 Clotting 0=No 1=Yes CLO45 Sepsis 0=No 1=Yes SEP46 Gastrointestinal Bleeding 0=No 1=Yes GIB47 Diffuse Intravascular Coagulation 0=No 1=Yes DIC48 Discharge/30 Day Status 1=Live STA2= Died in OR3= Died in Hosp/30D4= Reop5= Died Late Cardiac6= Unrelated Death9= Lost to Follow-upChapter 3Initial Data Analysis3.1 Summary Statistics for the DataSince the data are collected from several populations, missing values on several variables are in-evitable. Of 18,171 MCR patients, 12,000 had complete data for the 32 variables selected. Themissing data mostly occur in two variables prior myocardial infarction, 30% and left ventriculardysfunction, 29.3%. These are the only two which measure the previous damage of heart, so theyshould not be excluded.20 kind of diseases/conditions are suspected to be related with the success of the surgery. Butseven of them can be removed because of their lower incidences: drug abuse, 0.3% (70 patients);Marfan's syndrome, 0.1% (20 patients); positive test for AIDS and AIDS, 0.0% (0 patients); anemia,0.0% (18 patients); pulmonary hypertension, 0.3% (59 patients); chronic dialysis, 0.0% (11 patients).The remaining 13 diseases are: OBE (obesity: 1.5x expected body weight), COP (patient withdistinct limitations revealed at time of study or on treatment—bronchodilators, etc), DIA (diabetes:patient on oral medicine or insulin), CH2 (patient with cholesterol blood levels between 200-299),CH3 (patient with cholesterol blood level above 300), REN (renal failure: patients not on dialysis8CHAPTER 3. INITIAL DATA ANALYSIS^ 9with creatinines above 2.5), HTN (hypertension), ETO (patients who have undergone treatment foralcohol abuse or come in intoxicated), CA (history of malignant disease - cured or not), LIV (historyof hepatitis, cholangitis, but not gall bladder disease), CNS (history of brain abscess, encephalitis,or clinical dementia), PCA (history of stroke with or without residual), RHE (history of rheumaticheart disease).Prior cardiac operation makes the surgery more difficult technically. Six categories are distin-guished: 0TH (other cardiac surgery), NON ( none surgery), CAB (coronary bypass surgery), VAL(valve operation), CON (congenital surgery) and PAC (pacemaker operation). The incidences oflast two kinds of operation are lower, 0.2% and 0.6% respectively, so that these are removed as riskfactors.AGE and SEX are two important variables. BSA (body surface area) is calculated from heightand weight using a NOMOGRAM formula and can be an important factor.Although the prior operation status is important, from a decision point of view, we do not includeit this time.Post operation variables (outcomes) include 11 complications and the 30 day status. Amongthese 11 complications, we only consider those which are clinically related with the 30 day status.Hence we remove the following complications: REP (reoperation for bleeding, suspected tampon-ade), WOU (wound: dehiscence or infection), CLO (clotting: prolonged bleeding problems, lowplatelets), SEP (septicemia, pneumonia, wound infection, etc), GIB (gastrointestinal bleeding, per-forated ulcer, cholecystitis) and DIC (diffuse intravascular coagulation). We study the remaining fivecomplications: REMS (renal shutdown), NEMS (peripheral nerve, central nervous system defect),PUMS (pneumothorax, prolonged respiratory support), MI (intra- or post-operation myocardial in-farction by EKG or enzymes) and LOMS (low output syndrome). These response variables wereobtained by combining their two levels (mild and severe) into one so that the resulting variables arebinary.The response variable for the 30 day operative mortality was obtained by combining the case ofCHAPTER 3. INITIAL DATA ANALYSIS^ 10original statuses 2, 3 and 5 into "1".In Table 4 to Table 6, summary statistics are given as well as some special features such asmissing value, etc.3.2 Odds Ratio AnalysisA natural way to represent the association of a binary risk factor and a binary outcome is the 2x2contingency table, as follows:2x2 Contingency Tableoutcomel outcome2 Sample Sizewith risk factor^a^b^niwithout risk factor^c^dn2We suppose that such a table has been generated by drawing two independent binomial samplesof sizes ni and n2, with probabilities for outcomel being pi and p2 respectively. For example, inour study, an outcome variable is the status after operation, the samples correspond to the patientswho have the presence or absence of a risk factor.The odds ratio in such a table is defined as= (1 — P2) P2(1 —The odds ratio (as well as its logarithm) is widely used as a measure of association in 2x2 contingencytables due to its simple interpretability. For example, if outcome is the presence or absence of lungcancer and the populations are smokers and non-smokers, then III = 2 indicates that the odds of lungcancer among smokers is twice that among non-smokers in the study population. It has also beenpointed out that the odds ratio forms a useful approximation to the relative risk in retrospectivestudies [Rothman, 1986]. The coefficients estimated in a logistic regression can also be interpretedas log-odds ratios (logarithm of odds ratio).CHAPTER 3. INITIAL DATA ANALYSIS^ 11'Table 4: Tabular summary for categorical variablesVariable Heading Code Count Frequency^RemarkV2 SEX 0 13084 72.0% 1 patient with no sex recorded1 5086 28.0%V3 PMI 0 7451 41.0% 5436 missing data (30.0%)1 5284 29.0% coded -1V4 OBE 0 16648 91.7%1 1523 8.3%V5 COP 0 15977 88.0%1 2194 12.0%V6 DIA 0 15276 84.1%1 2895 15.9%V7 CH2 0 16008 88.1%1 2163 11.9%V8 CH3 0 17373 95.7%1 798 4.3%V9 REN 0 170561 93.9%1 1115 6.1%V10 HTN 0 11175 61.5%1 6996 38.5%V11 ETO 0 17454 96.1%1 717 3.9%V12 DRU 0 18101 99.7% ignored in future analysis1 70 0.3%V13 MAR 0 18151 99.9% ignored in future analysis1 20 0.1%V14 HIV 0 18171 100.0% ignored in future analysis1 0 0.0%V15 AID 0 18171 100.0% ignored in future analysis1 0 0.0%V16 CA 0 17860 98.3%1 311 1.7%V17 ANE 0 18153 100.0% ignored in future analysis1 18 0.0%CHAPTER 3. INITIAL DATA ANALYSIS^ 12Table 5: Tabular summary for categorical variables (continued)Variable Heading Code Count Frequency^RemarkV18 LIV 0 17946 98.8%1 225 1.2%V19 CNS 0 17427 96.0%1 744 4.0%V20 PCA 0 17700 97.5%1 471 2.5%V21 RHE 0 17545 96.6%1 626 3.4%V22 PUL 0 18112 99.7% ignored in future analysis1 59 0.3%V23 CHR 0 18160 100.0% ignored in future analysis1 11 0.0%V24 0TH 0 17649 97.2%1 522 2.8%V25 NON 0 15418 84.8% ignored in future analysis1 2753 15.2%V26 CAB 0 16525 91.0%1 1646 9.0%V27 VAL 0 17733 97.6%1 438 2.4%V28 CON 0 18140 99.8% ignored in future analysis1 31 0.2%V29 PAC 0 18055 99.4% ignored in future analysis1 116 0.6%V30 LVD 0 8972 49.3% 5320 (29.3%) missing data1 1854 10.2% coded -12 1524 8.3%3 306 1.7%4 195 1.0%V31 POS 1 14137 77.8% 433 (2.4%) missing data2 1935 10.6% ignored in future analysis3 1501 8.3%4 165 0.9%V33 REP 0 17471 96.2%1 700 3.8%V34 REM 0 17623 97.0% combined with RES in future analysis1 548 3.0% to form a new variable REMSVariable Heading Code Count FrequencyV35 RES 0 17881 98.4%1 290 1.6%V36 WOU 0 17964 98.9%1 207 1.1%V37 NEM 0 16803 92.5%1 1368 7.5%V38 NES 0 17768 97.8%1 403 2.2%V39 PUM 0 10173 56.0%1 7998 44.0%V40 PUS 0 15643 86.1%1 2528 13.9%V41 MI 0 17194 94.7%1 977 5.3%V42 LOM 0 17389 95.7%1 782 4.3%V43 LOS 0 17321 95.4%1 850 4.6%V44 CLO 0 17803 98.0%1 368 2.0%V45 SEP 0 17904 98.6%1 267 1.4%V46 GIB 0 17415 95.8%1 756 4.2%V47 DIC 0 18117 99.7%1 54 0.3%V48 STA 0 10 0.0%1 15513 85.4%2 519 2.9%3 234 1.3%4 465 2.6%5 41 0.2%6 316 1.7%9 10 0.0%Remarkcombined with NES in future analysisto form a new variable NEMScombined with PUS in future analysisto form a new variable PUMScombined with LOS in future analysisto form a new variable LOMSignored in future analysisignored in future analysisignored in future analysisignored in future analysis612 (3.4%) missing dataonly the cases 1, 2, 3 and 5are considered. That is:15513 (95.1%) alive, coded 0794 (4.9%) died, coded 1CHAPTER 3. INITIAL DATA ANALYSIS^ 13Table 6: Tabular summary for categorical variables(continued)CHAPTER 3. INITIAL DATA ANALYSIS^ 14There are several estimates of the odds ratio [Walter, 1987], but the most common one is themaximum likelihood estimate (MLE)ad111MLE = —bcThe derivation of this is simple: for a binomial distributed random variable with parameter p andn, the MLE of p is a/n where n is the sample size and a is the number of "successes". By theinvariance principle, the MLE of the odds pi/(1-pi) is a/b. Similarly the MLE of p2/(1-p2) is c/d.Hence the above estimate obtains.The estimate of odds ratio is more useful as an interval estimate or confidence interval (CI). Abrief review of various methods for CI construction is given by Fleiss (1979). We use the resultderived by Bishop et al. (1975) in which it is proved log if is asymptotically normal with mean log IFand variance (nipi (1 — pi))" + (n2p2(1 — p2))1. This result follows from an application of thedelta method. An estimates variance of log is 1/a + 1/b + 1/c + 1/d = SE(logSo, a 100(1-cr)% CI of isexp{log^Zi_a12SE(1og 40}where 43 is the upper # quantile of the standard normal distribution.In Table 7 to Table 12, the odds ratios for each outcome variable are given.Statistically, only those 95% CI not containing 1 are more strongly related with the outcome. InTable 13, the risk factors identified by odds ratio are listed.If the estimated odds ratio is larger than 1, we say the variable is positively related with theoutcome; otherwise, we say it negatively related. Nearly all binary explanatory variables ( includingthe groups of other diseases, prior cardiac surgeries) are positively related with the 30 day mortalityand complications except CH2 and CH3. Variable CH2 and CH3 measure high cholesterol bloodlevels. This may lead to increased risk of getting vascular disease in many organs. MI and LOMSare vascular related complications. Unfortunately, CH2 and CH3 have negative association withthem so we have some doubt whether the measurement of cholesterol blood level is correct andCHAPTER 3. INITIAL DATA ANALYSIS^ 15Table 7: Odds ratio for STAVariable Heading 41 SE(log if) 95% CI of AF^RemarkV2 SEX 1.17 0.24 (0.73, 1.90)V4 OBE 0.38 0.72 (0.09, 1.58)V5 COP 1.00 0.38 (0.47, 2.12)V6 DIA 0.31 0.72 (0.63, 2.16)V7 CH2 0.40 0.42 (0.17, 0.94) negatively relatedV8 CH3 1.25 0.41 (0.56, 2.80)V9 REN 1.95 0.38 (0.91, 4.19)V10 HTN 1.32 0.23 (0.84, 2.08)V11 ETO 1.18 0.60 (0.35, 3.86)V16 CA 1.61 0.74 (0.37, 6.95)V19 CNS 3.05 0.45 (1.24, 7.44) positively relatedV20 PCA 1.98 0.53 (0.68, 5.69)V21 RHE 1.26 0.61 (0.38, 4.14)V24 0TH 2.35 0.39 (1.09, 5.10) positively relatedV26 CAB 3.53 0.27 (2.07, 6.05) positively relatedV27 VAL 4.23 0.56 (1.40, 12.8) positively relatedTable 8: Odds ratio for REMSVariable Heading if SE(log ii') 95% CI of T.^RemarkV2 SEX 1.48 0.21 (0.97, 2.26)V4 OBE 1.13 0.37 (0.54, 2.37)V5 COP 1.86 0.26 (1.11, 3.11) positively relatedV6 DIA 1.80 0.24 (1.11, 2.91) positively relatedV7 CH2 0.67 0.37 (0.32, 1.40)V8 CH3 0.88 0.46 (0.35, 2.20)V9 REN 9.78 0.22 (6.30, 15.2) positively relatedV10 HTN 1.75 0.20 (1.18, 2.62) positively relatedV11 ETO 1.34 0.47 (0.53, 3.95)V18 LIV 3.58 0.62 (1.04, 12.3) positively relatedV19 CNS 3.40 0.33 (1.75, 6.61) positively relatedV20 PCA 1.33 0.60 (0.40, 4.34)V21 RHE 0.88 0.59 (0.27, 2.85)V24 0TH 0.90 0.52 (0.32, 2.50)V26 CAB 0.98 0.35 (0.49, 1.98)V27 VAL 0.64 1.01 (0.08, 4.76)CHAPTER 3. INITIAL DATA ANALYSIS^ 16Table 9: Odds ratio for NEMSVariable Heading (log if) 95% CI of III^RemarkV2 SEX 1.34 0.15 (0.99, 1.81)V4 OBE 2.11 0.21 (1.37, 3.25) positively relatedV5 COP 1.22 0.21 (0.80, 1.84)V6 DIA 1.38 0.18 (0.96, 1.98)V7 CH2 0.69 0.25 (0.42, 1.14)V8 CH3 0.54 0.39 (0.25, 1.17)V9 REN 1.62 0.23 (1.02, 2.57) positively relatedV10 HTN 1.49 0.14 (1.12, 1.97) positively relatedV11 ETO 1.40 0.32 (0.73, 2.68)V16 CA 2.96 0.43 (1.26, 6.93) positively relatedV18 LIV 2.20 0.55 (0.74, 6.53)V19 CNS 3.92 0.24 (2.41, 6.39) positively relatedV20 PCA 2.02 0.37 (0.97, 4.19)V21 RHE 0.81 0.43 (0.34, 1.88)V24 0TH 0.94 0.35 (0.47, 1.90)V26 CAB 1.09 0.24 (0.67, 1.75)V27 VAL 0.58 0.73 (0.14, 2.46)Table 10: Odds ratio for PUMSVariable Heading 41 SE(log 41) 95% CI of tIf^RemarkV2 SEX 1.08 0.09 (0.90, 1.30)V4 OBE 3.38 0.16 (2.45, 4.66) positively relatedV5 COP 3.45 0.13 (2.66, 4.48) positively relatedV6 DIA 1.91 0.11 (1.52, 2.40) positively relatedV7 CH2 0.36 0.15 (0.26, 0.49) negatively relatedV8 CH3 0.49 0.20 (0.33, 0.74) negatively relatedV9 REN 3.32 0.16 (2.41, 4.58) positively relatedV10 HTN 1.86 0.08 (1.57, 2.20) positively relatedV11 ETO 2.21 0.21 (1.46, 3.36) positively relatedV16 CA 0.56 0.41 (0.25, 1.25)V19 CNS 2.12 0.21 (1.39, 3.21) positively relatedV20 PCA 0.67 0.29 (0.37, 0.91)V21 RHE 0.97 0.23 (0.61, 1.54)V24 0TH 0.58 0.22 (0.37, 0.91)V26 CAB 1.22 0.14 (0.92, 1.61)V27 VAL 1.03 0.34 (0.52, 2.01)CHAPTER 3. INITIAL DATA ANALYSIS^ 17Table 11: Odds ratio for MIVariable Heading if SE(log tif) 95% CI of ‘If^RemarkV2 SEX 0.76 0.19 (0.52, 1.11)V4 OBE 1.70 0.26 (1.01, 2.85) positively relatedV5 COP 2.11 0.20 (1.40, 3.16) positively relatedV6 DIA 1.13 0.22 (0.73, 1.75)V7 CH2 0.39 0.36 (0.19, 0.80) negatively relatedV8 CH3 0.75 0.39 (0.34, 1.63)V9 REN 1.80 0.25 (1.09, 2.99) positively relatedV10 HTN 1.58 0.16 (1.14, 2.18) positively relatedV11 ETO 1.75 0.34 (0.89, 3.44)V18 LIV 6.43 0.46 (2.60, 15.86) positively relatedV19 CNS 1.35 0.37 (0.64, 2.83)V20 PCA 2.06 0.41 (0.92, 4.64)V21 RHE 1.12 0.43 (0.48, 2.61)V24 0TH 0.99 0.39 (0.45, 2.16)V26 CAB 1.28 0.26 (0.76, 2.13)V27 VAL 1.73 0.53 (0.60, 4.95)Table 12: Odds ratio for LOMSVariable Heading SE(log if) 95% CI of ‘If^RemarkV2 SEX 1.69 0.15 (1.24, 2.32) positively relatedV4 OBE 1.08 0.28 (0.61, 1.91)V5 COP 1.06 0.23 (0.66, 1.68)V6 DIA 1.27 0.20 (0.85, 1.89)V7 CH2 1.20 0.22 (0.77, 1.89)V8 CH3 0.86 1.00 (0.01, 0.62) negatively relatedV9 REN 1.99 0.23 (1.24, 3.17) positively relatedV10 HTN 1.31 0.15 (0.96, 1.78)V11 ETO 1.01 0.40 (0.46, 2.21)V16 CA 2.33 0.49 (0.88, 6.13)V18 LIV 1.18 0.74 (0.27, 5.09)V19 CNS 1.17 0.37 (0.56, 2.46)V20 PCA 1.22 0.47 (0.48, 3.10)V21 RHE 2.20 0.32 (1.17, 4.15) positively relatedV24 0TH 1.60 0.31 (0.86, 2.99)V26 CAB 1.84 0.22 (1.19, 2.84) positively relatedV27 VAL 1.09 0.60 (0.33, 3.60)"CHAPTER 3. INITIAL DATA ANALYSIS^ 18Table 13: Binary risk factors identified from odds ratioOutcome Binary Risk FactorsSTAREMSNEMSPUMSMILOMSVAL (4.23)REN (9.78)CNS (3.93)COP (3.45)LIV (6.43)RHE (2.20)CAB (3.53)LIV (3.58)CA (2.96)OBE (3.38)COP (2.11)REN (1.99)CNS (3.05)CNS (3.40)OBE (2.11)ETO (2.23)REN (1.80)CAB (1.84)0TH (2.35)COP (1.86)REN (1.62)CNS (2.12)OBE (1.70)SEX (1.69)DIA (1.80)HTN (1.49)DIA (1.91)HTN (1.58)HTN (1.75)HTN (1.36)Table 14: Two-way table for PMI and LVDLVDprior MItotal^percentmissing^no yesmissing 239 70 65 374 -normal 59 413 183 655 30%40-49% 3 61 78 142 56%30-39% 40 24 40 104 63%20-29% 0 5 14 19 74%<20% 7 1 5 13 83%total 384 574 385 1307 40%consequently, we did not include C112 and CH3 in our model building procedure. From a medicalpoint of view, cholesterol blood level maybe a surrogate for nutritional level.PMI indicates the health of the heart muscle while LVD measure the left ventricular function.So they essentially measure the same phenomenon although PMI is much less specific. In Table 14,their associations can be seen by a two-way table. The last column is the percent of YES amongnon-missing cases.As we see, compared with the percentage in the normal category, as the left ventricular functionis getting worse, the percentage of patients who had prior myocardial infarction is increasing. Thisconfirms our knowledge about these two variables.AGE and BSA are the only two continuous variables in the data set. Biologically, they mayberelated with many other variables, but the most interesting ones ( based on previous studies ) arethe following: the association of diabetes with AGE and BSA; the relationship between BSA and1•■11OCCVqCVasI8as8ascoCVCHAPTER 3. INITIAL DATA ANALYSIS 190^ 0^1Diabetes^ Diabetes0^1^ 0Sex^ HypertensionFigure 1: Boxplots for continuous variables AGE and BSAgender as well as hypertension. In Figure 1, these relations are displayed by boxplots.Boxplots have proven to be quite a good exploratory tool, especially when several boxplots areplaced side by side for comparison as in the current cases. The most striking visual feature is thebox which shows the limits of the middle half of the data (the white line inside the box representsthe median and the ends of the box represent the lower and upper quartiles). The first horizontallines beyond the box (which are called the whiskers) are drawn to the nearest value not beyond astandard span from the quartiles. Points beyond, which may be outliers, are drawn individually.The standard span is 1.5 times the difference of the upper and lower quartiles. [Hoaglin et al., 1983]There is little difference in the distribution of age between the populations of patients with orwithout diabetes. Similarly, this holds for the distribution of body surface area with respect toCHAPTER 3. INITIAL DATA ANALYSIS^ 20diabetes and hypertension. Only the relation between gender and body surface area appears to besignificant. Female patients tend to have small body surface area. Although this is a general truth,notice that by odds ratio analysis that female patients have higher risk of operative mortality. Weneed to investigate this relation further as some cardiologists believe that gender is a poor surrogatefor body surface area.Chapter 4Logistic Regression AnalysisThe methodology of logistic regression analysis has become extremely popular among biostatisticiansin recent years, see for example Lemeshow et al. (1988).Let yi, i = 1, . , n, be independent binary random variables. The logistic regression is a methodfor assessing the dependence of pi = Pr(yi = 1) on explanatory variables xi. The dependence ispostulated ase'.Pi = ^eTo1 +11 — pi = ^1 + erTfor i = 1, . , n, where xiT=(x„,^, xi,) is a row of known constants and #=(#1,^13p)T is acolumn of unknown parameters. The equations above are equivalent tog(pi) =Then, g(p) = log(p/(1 — p)) is called the logistic transformation of the probability p = (pa, . . ,and above equation is called a linear logistic model.There are several ways to estimate the logistic parameters # [Flosmer and Lemeshow, 1989]. The21CHAPTER 4. LOGISTIC REGRESSION ANALYSIS ^ 22maximum likelihood procedure is based on the conditional likelihoodL(p; Y) = H f(Yi 1001=1where f(y; plx) = pY (1 — p)l—Y. and p = (pr,...,p.), Y = (Yi • • • ,y,„) It is convenient to deal withthe log-likelihood function. In our case, it is1(g; y) = EfyirT0 — log[l exp(xT0)]}i.1To compute the maximum likelihood estimates, it is necessary to solve the score equations OM; y)/80=0.Commonly, the Newton-Raphson method or the iteratively reweighted least squares method is usedto calculate 0, the estimate of 3 [Rao, 1973].Our goal is to use logistic regression to develop an objective model for prediction of 30 dayoperative mortality and complications among patients. Typically, the first step in this modelingprocess is data reduction; from all available predictor variables, only those most associated withoutcome are selected for inclusion in the final model. If, after this step, there are still a large setof characteristics, a stepwise logistic regression analysis can be applied to reduce the number ofpredictor variables. An alternative method is the best subsets selection procedure which providesseveral candidate models.4.1 Univariate Analysis and Comparison of ModelsFor continuous variables, the test of association of the outcome and the independent variable canbe carried out using Student-t test analogous in linear regression [Weisberg, 1980]. For categoricalvariables, we use the likelihood ratio test which is defined as follows. The deviance function is definedasD(p; y) = 2 log gy; y) — 2 log L(A; y)The difference in deviance between two models measures the contribution of the parameters by whichthey differ. The distribution theory is asymptotic [McCullagh and Nelder, 1989]; for comparing 2CHAPTER 4. LOGISTIC REGRESSION ANALYSIS^ 23nested models with estimated mean pi and p2, the difference in devianceD(/1i,i12)= D(/11;y)- D(f12;Y)has an asymptotic x2 distribution (under the null hyperthesis that the smaller model is correct) withdegrees of freedom v = — v2 equal to the difference in the dimensions of the parameter spacesimplicit in the models with mean pi and p2• Therefore, to test the association of a single variablex to the outcome, we only need to compare the modely(iiii )=00with the modelY (p2i) =130 + filxiand find out how much the variable x improves the predictive value of the model.4.2 Stepwise Logistic RegressionIn stepwise logistic regression, models are built by adding in new variables and seeing how much theyimprove the fit, and by dropping variables that do not improve the fit by a "significant" amount.Usually the procedure starts with an arbitrary model and stops when no step will decrease the valueof a selection criterion. The selection criterion used here is AIC (Akaike's Information Criterion)[Akaike, 1973]AIC = D + 2pwhere D is the deviance of the current model, p the dimension (number of variables) in the model.The changes in AIC due to augmenting or reducing a model by a given variable reflect both thechange in deviance caused by the step, as well as the change in the dimension of the model. Therationale of AIC is that the more parameters a model contains, the less accurately they can beestimated and the predictive value of the model may get worse. AIC adjusts the deviance for theCHAPTER 4. LOGISTIC REGRESSION ANALYSIS^ 24Table 15: Stepwise regression procedure: a demonstrationVariables involved in the current model^ operation^AICAGE, SEX, PMI, OBE, COP, ETO, CA, CNS, PCA, 0TH, CAB, LVD, BSA 479.8AGE, PMI, OBE, COP, ETO, CA, CNS, PCA, 0TH, CAB, LVD, BSA -SEX 477.9AGE, PMI, COP, ETO, CA, CNS, PCA, 0TH, CAB, LVD, BSA -OBE 475.9AGE, PMI, COP, CA, CNS, PCA, 0TH, CAB, LVD, BSA -ETO 473.5AGE, PMI, CA, CNS, PCA, 0TH, CAB, LVD, BSA -COP 472.2AGE, PMI, CNS, PCA, 0TH, CAB, LVD, BSA - CA 471.5Table 16: Results of stepwise regression procedure for each outcomeOutcome^variable involved^AICSTA AGE, PMI, CNS, PCA, 0TH, CAB, IND, BSA 471.56REMS AGE, SEX, PMI, COP, DIA, REN, CNS, LVD 437.92NEMS AGE, PMI, OBE, ETO, CA, CNS 744.55PUMS PMI, DIA, HTN, CNS, BSA 1105.37MI PMI, LIV, PCA 684.66LOMS AGE, PMI, DIA, REN, RHE, CAB, LVD, BSA 784.16number of parameters estimated. Thus, the model with the minimum AIC gives the best fit tothe data according to the AIC criterion. Therefore, we think of AIC as a useful tool for the quickcomparison of parametric models although it does not indicate that the better of two models is"significantly better".Take STA as example. The initial model contains the 14 variables obtained from univariatescreening. The first variable deleted was SEX leading to AIC=477.90; the second one deleted wasOBE leading to AIC=475.97; ...; the last one deleted was CA leading to AIC=471.56. The procedureis summarized in Table 15.In Table 16, the results of stepwise logistic regression for various outcome variables are givenwith the corresponding best AIC values.CHAPTER 4. LOGISTIC REGRESSION ANALYSIS^ 254.3 Best Subsets SelectionThe best subsets selection is an alternative to the stepwise procedure for model building. Thisapproach has been available for linear regression for years and makes use of the branch and boundalgorithm of Furnival and Wilson (1974). Typical software implementing this method will identify aspecified number of "best" models containing one, two, three variables, and so on, up to the singlemodel containing all p variables. For the case of logistic regression, Hosmer and Lemeshow (1989)proposed a method which can be performed in a striaightforward manner using any program for thebest subsets linear regression.The best subsets selection procedure is regarded as a more reliable and informative method.This is because the the stepwise procedure lead to a single subset of variables and does not suggestalternative good subsets. In this procedure, Cp statistics are used for selecting the best subsets[Draper and Smith, 1981]; a model with a Cp value close to the number of predictors is better.In the logistic model, let ,3 be the maximum likelihood estimate and fri be the estimated logisticprobability computed using /3 and the data for the ith case, x2. We define two matrix X and V1^xi].^. .^xi,,1^x21^• • . X2px=^1 xi ^. . Xnp71(1 —)^0^00^7r2(1 —7r2)^00^0^. . . *n (1 — *„)It may be shown [Pregibon, 1981] that /3 = (X'VX)-1X'Vz, where the vector z contains pseudoval-ues, z = Xj3 V-1r, and r is the vector of residuals, r = (y — ir). A computation for the bestsubsets logistic regression model can be performed using a best subsets linear regression program theandv=CHAPTER 4. LOGISTIC REGRESSION ANALYSIS ^ 26Table 17: Models obtained from best subsets procedure for each outcomeOutcome^Model Code^ Variable included^ C,STA Si AGE, PMI, CNS, PCA, 0TH, CAB, LVD, BSA 7.02Si AGE, CA, CNS, PCA, 0TH, CAB, LVD, BSA 6.4S-3 AGE, COP, CNS, PCA, 0TH, CAB, LVD, BSA 8.0REMS R_1 AGE, SEX, PMI,COP, DIA, REN, LIV, CNS 7.1R_2 AGE, SEX, PMI, COP, DIA, REN, CNS, LVD 8.4R_3 AGE, SEX, PMI, COP, DIA, REN, HTN, CNS 8.6NEMS NJ. AGE, PMI, OBE, ETO, CA, CNS 4.4Ni AGE, PMI, ETO, CA, CNS, CAB 4.6N_3 AGE, PMI, ETO, CA, CNS, PCA 5.8PUMS P.J. AGE, PMI, REN, HTN, CNS, BSA 6.7P_2 PMI, OBE, REN, HTN, CNS, BSA 7.5P_3 AGE, PMI,OBE, HTN, CNS, BSA 7.9MI M_1 PMI ,COP, LIV, PCA 3.9M_2 PMI, COP, CNS, PCA 5.1M..3 PMI, LIV, CNS, PCA 5.3LOMS L_1 AGE, PMI, DIA, REN, RHE, 0TH, CAB, LVD, BSA 11.0L_2 AGE, PMI, DIA, HTN, RHE, 0TH, CAB, LVD, BSA 11.5L..3 AGE, SEX, PMI, DIA, RHE, 0TH, CAB, LVD, BSA 12.2dependent variable z, case weights vi, equal to the diagonal elements of V, and original covariatesx.In this study, for each outcome, we provide three candidate models produced by the best subsetsselection procedure. One interesting finding is that the model obtained by stepwise procedure isamong the three models.4.4 Goodness-of-fit: Hosmer-Lemeshow Grouping TestAfter the above procedures, we would like to know how effective the models we have are in describingthe outcome variables. This is referred to as its goodness-of-flu.One test was proposed by Hosmer and Lemeshow (1980). The Hosmer-Lemeshow grouping testcreates groups based on the values of the estimated probabilities. Suppose we have n observations.With this method, use of g = 10 groups results in the first group containing the n1 = n/10 subjectsCHAPTER 4. LOGISTIC REGRESSION ANALYSIS^ 27Table 18: Hosmer-Lemeshow grouping test for selected modelsModelDecile of RiskTotal^0^Probgl g2 g3 g4 g5 g6 g7 g8 g9 g10S_2 Obs 0 0 2 3 4 4 5 12 15 20 65 7.25 0.51Exp 0.7 1.4 2.0 2.6 3.3 4.3 5.7 7.7 11.4 25.8 65R1 Ohs 1 2 1 1 3 3 7 6 8 29 61 3.75 0.88Exp 0.5 1.1 1.6 2.2 2.8 3.6 4.9 6.9 9.8 27.3 61N_3 Obs 1 3 3 6 10 9 10 15 22 42 121 1.82 0.98Exp 1.2 2.6 4.1 5.8 7.7 9.8 12.2 15.9 21.1 40.7 121P.2 Obs 9 10 11 12 18 19 21 18 34 44 196 3.76 0.87Exp 9.9 11.2 12.1 13.2 14.5 16.8 19.4 23.4 31.0 45.2 196M_2 Ohs 1 4 2 3 13 9 6 18 22 24 102 8.29 0.41Exp 3.1 3.1 3.1 3.2 7.5 9.1 9.0 16.3 20.4 27.2 102L..2 Obs 8 4 4 4 11 7 9 14 21 35 117 13.9 0.08Exp 2.9 4.6 5.8 7.0 8.4 10.1 11.9 14.4 18.9 33.0 117having the smallest estimated probabilities, and the last group containing the n10 = n/10 subjectshaving the largest estimated probabilities. The Hosmer-Lemeshow goodness-of-fit statistics C isobtained by calculatingwhereand(ok - nor1)2 k=1 irk(1 - irk)Ok = E yi.i€Akirk = E "tank.jEithWith Ak consisting of subjects in the km group. It can be shown that C is asymptotically wellapproximated by the chi-square distribution with g - 2 degrees of freedom, x2(g - 2), if the modelis correct.A small value of C indicates a good fit. From the prediction point of view, we used this statisticas the final criterion for model selection. That is, among the three candidate models obtained fromstepwise logistic regression and the best subsets selection procedure, we chose the one with smallestvalue of O. In Table 18, the grouping tests for each selected model are shown.CHAPTER 4. LOGISTIC REGRESSION ANALYSIS^ 28Judging from the p-value, all the selected models fit quite well except possibly for the one withLOMS as outcome.The final logistic regression models were given in Tables 19 to 24 together with the maximumestimated probability which was calculated by putting the higher value for all the risk factors inthe model (for continuous variable, we use their mean values). This number indicates the range ofprobability that a model can predict. Since 0TH and CAB cannot be 1 at same time, when thesetwo appear together in the model, we use the one with larger coefficient. All the results are obtainedusing version 3.1 of the statistical software S/Splus. When coding dummy variables, treatmentcontrast was used [Chamber and Hastie, 1990].As mentioned in section 3.2 the estimated coefficients here can be interpreted as log-odds ratio.We simply calculate exp(fl) to give an odds ratio of each predictor with other factors held fixed.For example, in the STA model, variable 0TH has a coefficient 1.77 which gives exp(1.77)=5.87.This means that the patient who had an other cardiac operation are 5.87 times more likely tohave a mortality than those who had not. Another example is that AGE leads to an odds ratioof exp(0.048)=1.049. This means an additional multiplicative risk of 1.049 for each increase in ageof one year, all other variables held fixed. Since this number larger than 1, we consider age isa contributor to operative mortality; older patients tend to have higher risk. For the categoricalvariable, the odds ratios should be interpreted as a comparsion to the first category. In both of thecategorical variables PMI and LVD, the first category happens to represent the missing value andsuch a comparison can provide some insights to the missing value category.CHAPTER 4. LOGISTIC REGRESSION ANALYSIS^ 29Table 19: Final logistic regression model for STAVariable Heading SE(/) //SE(j)^exp()Constant -2.811 1.753 -1.603V24 0TH 1.770 0.444 3.981 5.871V26 CAB 1.327 0.344 3.856 3.770V32 BSA -2.061 0.654 -3.148 0.127V1 AGE 0.048 0.015 3.108 1.049V30 LVDO -0.092 0.333 -0.277 0.911LVD1 -0.430 0.541 -0.794 0.650LVD2 0.973 0.420 2.314 2.648LVD3 1.995 0.665 2.997 7.357LVD4 -3.914 5.808 -0.673 0.019V19 CNS 1.160 0.496 2.337 3.190V20 PCA 1.227 0.608 2.017 3.411V16 CA 0.868 0.745 1.164 2.382Maximum prediction probability: 0.97Table 20: Final logistic regression model for REMSVariable Heading SE(13) fi/SE(j)^exp(/)Constant -9.004 1.186 -7.586V9 REN 1.855 0.335 5.523 6.391V1 AGE 0.065 0.016 4.020 1.067V5 COP 1.032 0.345 2.988 2.806V3 PMI1 0.597 0.422 1.415 1.817PMI2 1.146 0.407 2.813 3.146V19 CNS 1.162 0.494 2.350 3.196V2 SEX 0.577 0.288 2.000 1.780V6 DIA 0.668 0.342 1.951 1.950V18 LIV 0.836 0.935 0.893 2.307Maximum prediction probability: 0.92CHAPTER 4. LOGISTIC REGRESSION ANALYSIS •^ 30Table 21: Final logistic regression model for NEMSVariable Heading 73 SE(/) //SE(/)^exp(j)Constant -7.934 0.846 -9.377V1 AGE 0.089 0.012 7.362 1.093V19 CNS 1.292 0.356 3.625 3.640V3 PMI1 -0.811 0.241 -3.360 0.444PMI2 -0.861 0.261 -3.301 0.422V16 CA 1.662 0.553 3.002 5.270V11 ETO 0.591 0.415 1.425 1.807V20 PCA 0.606 0.522 1.161 1.833Maximum prediction probability: 0.76Table 22: Final logistic regression model for PUMSVariable Heading ij SE(13) //SE(/)^exp(/)Constant -0.720 0.690 -1.043V3 PMI1 -0.442 0.097 -4.530 0.642PMI2 -0.097 0.060 -1.620 0.907V10 HTN 0.451 0.162 2.773 1.569V32 BSA -0.737 0.362 -2.037 0.478V19 CNS 0.554 0.338 1.635 1.740V9 REN 0.329 0.248 1.326 1.389V4 OBE 0.286 0.269 1.063 1.331Maximum prediction probability: 0.40Table 23: Final logistic regression model for MIVariable Heading fi SE0) //SE(/)^exp()Constant -1.820 0.172 -10.53V3 PMI1 -1.991 0.303 -6.567 0.136PMI2 -0.899 0.248 -3.619 0.406V20 PCA 1.469 0.524 2.801 4.348V19 CNS 0.428 0.412 1.039 1.534V5 COP 0.285 0.276 1.029 1.329Maximum prediction probability: 0.59CHAPTER 4. LOGISTIC REGRESSION ANALYSIS^ 31Table 24: Final logistic regression model for LOMSVariable Heading SE(/) le/SE(j)^exp(i3)Constant -1.916 1.265 -1.514V32 BSA -1.603 0.485 -3.300 0.201V3 PMI1 0.723 0.341 2.121 2.062PMI2 1.015 0.332 3.049 2.760V26 CAB 0.838 0.0.293 2.855 2.312V30 LVDO -0.244 0.285 -0.854 0.783LVD1 -0.204 0.373 -0.548 0.814LVD2 0.608 0.355 1.711 1.838LVD3 1.214 0.553 2.193 3.367LVD4 1.338 0.817 1.638 3.813V24 0TH 0.879 0.392 2.243 2.410V21 RHE 1.000 0.454 2.200 2.720V1 AGE 0.022 0.010 2.129 1.022V10 HTN 0.315 0.209 1.503 1.370V6 DIA 0.467 0.265 1.759 1.596Maximum prediction probability: 0.93Chapter 5The Tree-based ModelThe tree-based model has gradually become a popular tool in clinical and epidemiological studiesbecause of its clinical interpretability. The technique was introduced by Morgan et al. (1964),however, more ground-breaking ideas were introduced by Breiman et al. (1984) and the resultingcomputer program is named CART (Classification And Regression Tree).The tree-based model procedure used in version 3.1 of S/Splus departs slightly from CART inthe recursive partitioning (RP) method proposed by Ciampi et al. (1987). Also, compared withCART, the procedure is far less automatic in tree building, as the unbounding of procedures forgrowing, displaying and challenging trees requires user initiation in all phases.5.1 Recursive Partitioning: Growing a Classification TreeIn general, the tree-based model is fitted by creating binary tree using a RP algorithm. The datahave the form (y(i), x(1)), i = 1, . . , N, where y is a multinomial distributed variable with s categoriesand x is assumed to be vector of categorical variables x=(xi, sk) and for each j, xi has a finitenumber of categories 11, /m, . The categories of xi can be either ordered or unordered.In what follows, we refer to y as criterion variable and to the components of x as predictor32:non-terminal node:terminal nodeconditionlCHAPTER 5. THE TREE-BASED MODEL ^ 33Figure 2: Binary tree: an examplevariables. Predictors contain background information used to define strata which are homogeneousaccording to a criterion variable; for each homogeneous stratum one can define a unique criterionquantity independent of the x variables given the stratum.In our study, the criterion quantity is the vector of probabilities of being assigned to each out-comes, i.e., p =(pi,^,p3) such thatOur aim is to grow a binary tree with nodes representing subsets of observations. In particularthe root of the tree represents the entire set of observations and the terminal nodes represent stratathat are more homogeneous (see Figure 2).The tree is constructed based on a set of Split Defining Statements (SDS) such as xi fit.; , whereCHAPTER 5. THE TREE-BASED MODEL^ 34Ai is a subset of the mi categories of xi. For xi unordered, Ai can be any of the 2m,-1 nontrivialsubsets of /1, ..., im.,; for xi ordered, Ai can be any of the m2-1 subsets of the form Ai = [11 ,11 ,I - 12 , • • , 1 m •In an RP tree, each nonterminal node is split by a SDS into two nodes which represent subsetsas dissimilar as possible from the point of view of the criterion quantity.Ciampi et al. (1987) applied the likelihood ratio statistic (LRS) as a natural measure of dissim-ilarity as follows. Let P1, P2 be disjoint sets and let P denote their union. We shall assume thatthe criterion quantity is represented by a parameter 0 which may take different values 01, 02 for P1and P2 and that likelihood functions L1(01), L1(02) can be defined for P1, P2. We shall also assumethat the likelihood function for P is of the form:Consider now the hypothesisand the alternativeI,(0, , 02) = L1(01)1,2(02): 01 = 02 =H : 01 0 02Then the log LRS of H versus Ho is defined as:P(HI.H0) = 2 log [Li (8-1)L2(0-01/[Li AL2(61where ell, 0-2 are the MLEs of 01 and 02 under H and -d is the NILE of 0 underHo. Clearly, the largerp(111110) is, the greater is the evidence in the data that P1 and P2 are heterogeneous with respectto the criterion quantity. It therefore provides a reasonable and general measure of dissimilarity:d(Pi P2) = P(Hl Ho)In our case, the criterion quantity is p = (pr,...,p,) denotes the probability that y falls into eachof the possible categories.CHAPTER 5. THE TREE-BASED MODEL^ 35We have for a given subset or node:L(i;y)= 11 P7'where ni is the number of individuals at the node fall into jth category and y = (yi,^yn).At a given node, let M be the observations in the node and nj be the number of gs falling intojth category, then the maximum likelihood estimate of p is=^111-).Af • • • MA description of the RP algorithm is:Denote by N = {N1,^, Nr} the current collection of nodes.(1) To initialize, set r=1 and let N1 represent the observations(2) For every N1 E N and every split P1,,P2, defined by an element of SDS, compute d(P1,,P22).(3) Among all nodes chose the node .A7 corresponding to the split 1=';',1* with largest dissimilarityand replace N,* by two nodes representing Pi"; and^Use the resulting collection of nodes ascurrent and go to (2) where r has increased by 1.In the tree-based model in S/Splus, an intuitive way is used to implement above algorithm. Adeviance of a node is definedD(11;2() = —21(1,1;Y)where l(p; y) = log L(p; y). It can be shown that the deviance is identically zero if all the y's are thesame, and increases as the y's deviate from this case. The deviance DT(y) of a tree T is defined asthe sum of deviance of all its terminal nodes, EiET D(/t;y), where fit is the vector of the observedproportions of the s categories for node t. Splitting proceeds by comparing the deviance of the treeT, with that of larger trees T' in which a terminal node of T has been split into two. The split thatmaximizes the change in devianceAD = DT (y) — DT' (Y)is the next split that is chosen.CHAPTER 5. THE TREE-BASED MODEL ^ 365.2 Getting the Right Size Tree: Pruning the ClassificationTreeThe above discussion implies that nodes become more and more pure (homogeneous) as splittingprogresses. In the limit, a tree can have as many terminal nodes as there are observations. InS/Splus, two thresholds are introduced to stop the splitting process;(a). the node deviance is less than some small fraction of the root node deviance (say 1%); and(b). the node is smaller than some absolute minimum size (say 10).This also introduces another problem: if the threshold is set too high, good splits may be lost.There are two ways out of this dilemma: one is to use new (independent) data to guide the selectionof the right size tree, and the other is to reuse the existing data by the method of cross-validation.In this case, S/Splus provides a function called "prune".The idea of pruning is more easily described by tree terminology:Notation 1. A binary tree is denoted by T. A node t on the tree T is denoted by t ET.Definition 1: A branch Dt of I' with node t E T consists of the node t and all descendents of t inT.Definition 2: Pruning a branch f't from a tree T involves cutting off Tt just below the node I.The resulting tree is a subtree of -7" denoted by T-tt.Definition 3: T' is a pruned subtree of 1' if I"' is obtained by successively pruning off the branchesof T.In S/Splus, the importance of a pruned subtree T' is captured by the cost-complexity measureD,r(D') = D(T')+ a * size(i'')where D(T") is the deviance of the subtree, size(P) is the number of terminal nodes of Tv and a isthe cost-complexity parameter. For any specified a, cost-complexity pruning determines the subtreeD' that minimizes Da(2'') over all subtrees of f'' .CHAPTER 5. THE TREE-BASED MODEL^ 37Figure 3: Original tree for STA.As is known from the RP algorithm, the deviance of a tree i" is smaller than that of subtrees whena is set to zero. But when taking the size of tree into consideration, that is, a >0, pruning providesus an upward way to snip off the least important branches. In the extreme case, only the root node isleft if a is set sufficiently large. A sequence of subtrees T = To >- T1 >- >- il=root with decreasingsize can be obtained while setting an increasing number of values of a : ao = 0 -‹ al ak.5.3 Applications and ResultsFor each outcome, the described recursive partitioning procedure was performed on a sample dataset. The original tree underwent a cross-validation testing on a new data set by the pruning algorithmand the right size of the trees was decided.0cv)24.100 0.710 0.510 0.30)0220 0.210 0.094 4.1.00 0.710 0.510 0.2)0.220 0.210 0.094„,•••• — • — •••CHAPTER 5. THE TREE-BASED MODEL ^ 381^10^20^30^40^50^60^1^10^20^30^40^50^60size sizeFigure 4: Plots of deviance versus size for sequences of subtrees. (a): sequence obtained from sampledata; (b): sequence evaluated on test dataAlso take STA for example. We use the variables obtained from the initial data analysis as thepredictor variables. Each of these variables is considered to split the sample data set root node (with1307 patients of which 65 died). In the first round, AGE is the variable leading to two nodes thatare the most different (with mortality rates 16/672 and 49/635 respectively—refer Figure 5). Wecontinue this procedure and use the same group of predictor variables to split each of the two nodes.For example, the winner for the left node is LVD while 0TH is the best one for the right node. Thisprocess is continued until in each node there are less than 10 patients. See Figure 3 for the resultingtree. This tree has 63 terminal nodes and is obviously too large to use so that pruning is necessary.Figure 4(a) displays the plot of deviance versus size (number of nodes) for the sequence of subtreeof above tree. It should not be surprising that the sequence produced provide little guidance onCHAPTER 5. THE TREE-BASED MODEL^ 39what size tree is adequate. But we can use new data to guide the selection of the right size tree byusing the pruning algorithm described in section 5.2. In S/Splus, this function provide a sequenceof subtree and the deviance evaluated on the test data. Figure 4(b) illustrated this functionality forthe STA data. Usually this sequence will not be monotone and the turning point will suggest theright size; for example, for the STA data, a seven-node tree is suggested and a = 1.125.The binary tree (see Figure 5) has three terminal nodes corresponding to low risk, and fourterminal nodes corresponding to high risk. The size of the risk of a node is defined relatively tothe sample population risk. Patients whose ages are over 64.5 years and have some other previouscardiac operation appear to have a relatively high risk of mortality. Those patients who are lessthan 64.5 years old and have normal condition of left ventricular function seem to be at muchlower risk than those in the same age group but with a worse condition on LVD. Body surfacearea also plays an important role here. As we can see, with the same condition on age and LVDreferred to above, patients with a smaller body surface area tend to face a higher risk of mortality.Similar interpretations can be made for the tree models with the response variables being one of thecomplication variables; see Figures 5 to 10.65/1307 GE>64.5AGE<64.5(5%)9/606(1.4%)AGE<52.57/66 6/20(30%)CAB=05/20(25%) (4.3%43/61AGE>52.5 CAB (7%)(2.3%) LVD>3LVD<349/63(7.7%) OTH=111/60(18%) 32/555(5.7%)BSA<1.57^BSA>1.57OTH=CHAPTER 5. THE TREE-BASED MODEL^ 402/5^0/41(40%)^(0.0%)The number under each terminal node is the observedproportion of the 30 day operative mortality; for example,in the leftmost node, which has low risk, 9 out of 606patients in the node died after the operation.Figure 5: Tree model for STACHAPTER 5. THE TREE-BASED MODEL 41Figure 6: Tree model for REMSREN=141/1349 0/1117.2%PMI= -(3%)REN=0^61/1465(4.2%)AGE>72.5AGE<72.5PMI=0,120/23 (8.5%)COP=1 OP=03LVD8/99( %)BSA>11.5714/21CNS=1(6.5%)LVD 00/5(0.0%)21/1 3(1.9%)LVD=3AGE>69.5LVD 0 LVD=- 1,0 LVD>13/9 11/207(33.3%)^(5.3%) 6/16 4/56/15 (37.5%)^(80%)(40%)6/203/9^5/90^(30%)(33.3%)^(5.5%)5/69(7.2%)AGE<69.515/4731.9%13/1014'.biSA I. CNS05/2619.2%)10/2147.6%' 42CHAPTER 5. THE TREE-BASED MODELFigure 7: Tree model for NEMS1/7 15/21(19.1%) (32.7%)PMI=0,1^196/1465(13.4%)PMI=-1AGE<7482/966(8.5%)111/1110(10%)AGE>74.529/144(20%) HTN=AGE<72.578/295(26.4%)85/355^AGE>72.5(23.9%)7/60(11.7%)HTN=1CNS= 26/136 CNS=1 AGE<52.5 52/159 AGE>52.521/128(16.4%)5/8(62.5%) CAB-036/12616/33(28.6%)(48.5%) CAB=1(14.3%)^(71.4%)1111116/28AGE<41.5 (57.1%0/5(0.0%)AGE>41.5CHAPTER 5. THE TREE-BASED MODEL^ 43Figure 8: Tree model for PUMSPMI=0,13/5(60%)13/54/9(44.4)0/12(0 .0%)3/25 10/27PMI=0 47/1110(4.2%) BSA <2.015 (15.5% BSA>2.015MI=116/654(2.4%)12/633(1.9%)^BSA<1.82531/246(12.6%)AGE>60.5BSA>1.82PCA=1PCA=024/10(22%) AGE<60.58/52(15'4%bNS= (28%) CNS=1HTN= (25%) HTN=1(12%)^(37%)31/456(6.8%)102/1465(6.9%)MI=-1CHAPTER 5. THE TREE-BASED MODEL^ 44Figure 9: Tree model for MI45CHAPTER 5. THE TREE-BASED MODEL5/20 24/436 19/164 5/14 3/6 7/57(25%) (5.5%) (11.6%) (35.7%) (50%) (12.3%)Figure 10: Tree model for LOMS83/769(10.8%) ID16/7810/50(20%) LVD=2 (20.1%)CAB=0 63/684(9.2%)BSA>1.50510/63^6/15(40%)(15.9%PMI=- 29/4(6.4%)ar53/634(8.4%)REN=0 (13.5BSA>1.415CAB=1^AGE<78.5REN=1BSA<1.505C-616(4.9%)AGE>78.5LVD<BSA<1.44/7(57.1%)LVD=3,4BSA<1.90^117/24(7.9%)LVD>220/8(23.5%MI=124/1Chapter 6Discussion and ConclusionOur aim is to set up risk stratification models for some response variables. Beside STA, we are alsointerested in other variables, i.e., the complications. At first, using odds ratio analysis, we obtainedsome idea of the association between binary risk factors and the response variable. After that, twomethods were applied using S/Splus. They are the logistic regression model and the tree-basedmodel.In the logistic regression procedure, there are several steps of work before achieving the finalmodel. First, each predictor variable's association with the response variable was tested with thelikelihood test. Only those which shows potential relation with the response variable ( p-value <0.25 ) were selected. Linearity of the continuous variables to the response variable is checked andbe confirmed. Secondly, if the dimension of variables space is still large ( > than 10), a stepwiseprocedure was used. This would usually reduce the dimension less than 10. Thirdly, a best subsetprogram was run on the sample, this is to give some alternative candidate models. After that, theHosmer-Lemeshow grouping test was applied to check the goodness-of-fit and finally the best modelswere selected.In the tree-based model, a classification tree was grown based on a recursive partitioning method46CHAPTER 6. DISCUSSION AND CONCLUSION• ^ 47in S/Splus. After that, a pruning algorithm was applied to get a tree with 4 to 6 levels.When doing statistical analysis, instead of using the entire data set, a sample subset was used.(This was done partly in order to reduce computational time.) While sampling, the 3000 data entriesat the end of the data file were untouched. This latter part was kept for validation testing.The missing values in PMI and LVD bring some troubles to the analysis. Fortunately, these arecategorical variables and we can code an extra category labeled as "no information". Consequently,PMI and LVD become categorical variables with 3 and 6 categories respectively, and dummy variableare created to replace them in the logistic regression models, but not in the tree models.A summary of important risk factors for the dependent status variable STA and the complicationvariables REMS, NEMS, PUMS, MI and LOMS is given in Table 25. The binary risk factors whichhave significant odds ratios, and the risk factors which are included in the final logistic model andthe final pruned tree model are listed (in order of importance). There is substantial overlap in theimportant risk factors from the 3 methods. The range of predicted risks are summarized in Table 25for each of the dependent variables. The range for the logistic regression model is wider than thatof the tree model because the logistic regression separates out the cases more than the tree.AGE, as we participated, is associated with all outcomes.It seems that SEX is not strongly associated with operative mortality and complications since itdoes not appear as a predictor variable in any of the final logistic regression or tree models. Thiscould be because the variable BSA accounts somewhat for the gender variable (that is BSA is apartial surrogate). Male and female are facing the same level of risk for the same value of BSA andother variables.What about body surface area? BSA has strong association with the operative mortality. Butwith the complications, it is not always as important. It is an important risk factor to LOMS andtakes a middle position of importance in predicting PUMS and MI. Its importance in the modelsfor REMS and NEMS is much less.Prior cardiac operation plays a very important role in predicting the 30 day operative mortality.CHAPTER 6. DISCUSSION AND CONCLUSION ^ 48Table 25: List of risk factors and prediction range for the different methods and different outcomesOutcomeMethod STA REMS NEMS PUMS MI LOMSOdds Ratio VALCABCNS0THRENLIVCNSCOPDIAHTNCNSCAOBERENHTNCOPOBEETOCNSDIAHTNLIVCOPRENOBEHTNRHERENCABSEXLogistic RegressionPrediction Range0THCABBSAAGELVDCNSPCACA0 — 0.97RENAGECOPPMICNSSEXDIALIV0 — 0.92AGECNSPMICAETOPCA0 — 0.76PMIPCACNSRENOBE0 — 0.40PMIPCACNSCOP0 — 0.59BSAPMICABLVD0THRHEAGEHTNDIA0 — 0.93Dee-based ModelPrediction RangeAGE0THLVDCABBSA0.0 — 0.4RENAGEPMICOPLVDCNSBSA0.01 — 0.80AGEPMICNSOBEBSACAETO0.02 — 0.6PMIAGEHTNCNSCAB0.0 — 0.625PMIPCABSAAGECNSHTN0 — 0.60BSALVDAGECABPMIREN0.04 — 0.57Two out of three variables, CAB, 0TH, VAL appear in logistic model and tree model. It is notsurprising that these variables are mostly weighted by cardiologists. But they seem have weakassociation with some of the complications.PMI and LVD are also important predictor variables. As we know, they are measuring roughlythe same thing — damage of heart muscle; they seldom appear together in the same model and playthe role alternately.As to the diseases, CNS and HTN should be paid much attention to. CNS appears in all thelogistic models except the one with LOMS as outcome, although in every appearance, its positionin importance is around the middle. HTN's function is revealed when analyzing its relation withCHAPTER 6. DISCUSSION AND CONCLUSION^ 49complications. For all complications, patients who possess hypertension will surely have higher risk.Actually, all the diseases studied appear as important predictors in different models for predictingthe various complications. But they tend to be associated with particular outcomes, for example,REN (renal failure) is the most important risk factor in the REMS model and PCA seem closelyrelated with the MI complication.One of the difficulties in this study was that the patient data were from several populations.The technique and experience may vary across different hospitals. One possibility is to separate thepatients and develop the models within one population and then seek generalization. Unfortunately,the MCR database did not provide such information. Another approach which may be more feasibleis to include the variables which describe the operation such as X-clamp time, type of oxygenator use,etc., to capture the difference between populations since the database did record these information.As we know, the logistic regression is more powerful to get prediction probabilities in the rangeof 0.1 to 0.9. So, although the logistic regression and tree models nearly identify the same groupof risk factors, when predicting we suggest the latter be used since its prediction range is narrower.Another suggestion, also proposed by cardiologists, is to separate the population according to theprior cardiac operation done. Some such subpopulations are:a). patient who had a coronary bypass operation,b). patient who had a valve operation,c). patient who had both operations.Hopefully, the analyses based on these subpopulations will lead much more interesting and importantfindings.BibliographyAkaike, II. (1973). Information Theory and an Extension of the Maximum Likelihood Principle,in Second International Symposium on Information Theory. Akademia Kiado, Budapest, 267-281.Bishop, Y. M. M., Fienberg, S. E., and Holland, P. (1975). Discrete Multivariate Analysis:Theory and Practice. MIT Press, Boston.Breiman, L., Friedman, J. H., Olshen, R. A. and Stone C. J. (1984). Classification And RegressionTrees. Wadsworth, Belmont, CA.Chambers, John M. and Hastie, Trevor J. (1990). Statistical Models in S. Wadsworth, Belmont,CA.Ciampi, A., Chang, C-H., Hogg, S. and Mckinney S. (1987) Recursive partitioning: a versatilemethod for exploratory data analysis in biostatistics, in Biostatistics (eds. I. B. MacNeill and G. J.Umphrey). D. Reidel Publishing, New York.Draper, N.D. and Smith, H. (1981). Applied Regression Analysis, Second Edition. Wiley, NewYrok.Fleiss, J. (1979). Confidence intervals for the odds ratio in case-control studies: state of art,Journal of Chronic Diseases, 32, 69-77.Furnival, G. M., and Wilson, R. W. (1974). Regression by leaps and bounds, Technometrics, 16,499-511.Hoaglin, D. C., Mosteller, F., and Tukey, J. W., editors (1983). Understanding Robust and50BIBLIOGRAPHY^ 51Exploratory Data Analysis. Wiley, New York.Hosmer, D. W. and Lemeshow, S. (1980). A goodness-of-fit testing for the multiple logisticregression model. Communications in Statistics, A10, 1043-1069.Hosmer, D. W. and Lemeshow, S. (1989). Applied Logistic Regression. Wiley, New York.Lemeshow, S., Teres, D., Avrunin, J. S., Pastides, H. (1988). Predicting the outcome of intensivecare unit patients. Journal of the American Statistical Association, 83, 348-356.McCullagh, P., and Nelder, J. A. (1989). Generalized Linear Models. Second Edition. ChapmanHall, London.Morgan, J. N., and Sconquist, J. A. (1963). Problems in the analysis of survey data, and aproposal. Journal of American Statistical Association, 58, 415-434.Pregibon, D. (1981). Logistic refression diagnostics. Annals of Statistics. 9, 705-724.Rao, C.R. (1973). Linear Statistical Inference and Its Application. Second Edition. Wiley, NewYork.Rothman, K. J. (1986). Modern Epidemiology. Little Brown, Boston.Walter, S. D. (1987). Point estimation of the odds ratio in sparse 2x2 contingency tables, inBiostatistics(eds. I. B. MacNeill and G. J. Umphrey). D. Reidel Publishing, New York.Weisberg, S. (1980). Applied Linear Regression. Wiley, New York.Appendix AMerged Cardiac Registry...MERGED CARDIAC REGISTRYAN INTERNATIONAL DATABASEDENDRITE SYSTEMS, INC.APPENDIX A. MERGED CARDIAC REGISTRY^ 531: DEMOG :^Entered by the system2: PRIOR ENTRY REGISTRY MCR : Entered by the system3: DATE OF SURGERY: Enter date MM/DEVYY4: AGE:^Enter age in years5: SEX:^Entered by the system6: (Reserved for future use)7: PRIOR MI : 1 =No, 2=Yes8: MOST RECENT Ml: 1=0-6h, 2=6h-24h, 3=1d-7d, 4=1w-6w, 5= >6w9: OTHER DISEASES: Oj ]=Other, 1[ )=Obesity, 2[ =COPD, 3[ ]=Diab, 4[ 1= Choi >2005[ ]=Chol > 300, 61 1=Renal, 7[ ]=Htn, 81 1=ETOH, 91 ]=Drug Abuse10j 1=Marfans, Ilj ]=IIIV+, 12[ ]=AIDS, 13j ]=CA, 141 ]=Blood15[ ]=Liver, 16[ ]= CNS, 17j ]=Prior CVA, 181 ]=RheumHD,19[ ]=Pulm Htn, 20[ 1= Chronic Dialysis10:SMOKING NOW:^0=No, 1=Q>2y, 2=Y<lpk/d, 3=Y> lpk/d11:PRIOR CARD SURG : Of 1= Other, If ]=None, 2[ I= CABG, 31 1= Valve, 4[ 1= Cong5[ ]=Pacemaker12:LV DYSFUNCTION: 1=Nor, 2=40-49%, 3=30-39%, 4=20-29%, 5= <20%13:LVEF^Ejection Fraction, enter C'c14:CAD > 70%: if )=No, 2[ 1=AD, 3j 1= CX, 4[ ]=RC, 5[ ]=Branch, 6j )=L Main,7[ 1=1 Vessel, 8f ]=2 Vessel, 9[ ]=3 Vessel15:OTHER CARD PATH : Of ]=Other, 1[ ]=Ao St, 2f ]=Ao Insf, 3j ]=Mitr St,4j 1=Mitr^5[]=Tricusp, 6j 1=PuIrn, 7f ]=Cong81 ]=Acq VSD, 9[ 1=LV Aneur, 101 ]=Ao Aneur11[ ]=Ascending diss, 12[ ]=Decending diss16:DATE MOST RECENT PTCA : Enter date, leave blank if none17: PTCA RESULT: If ]=N/A, 2[ 1= Success, 3[ )=Failed, 4[ ]=Had complicationAPPENDIX A. MERGED CARDIAC REGISTRY^ 5418: NO. PICA VESSELS: Enter number of vessels dilated19: REASON FOR OP: Of ]=Other, If je--Ang, 2[ 1=Urgent, 3[ 1=Arrh, 4[ J=Anat, 5f j=FailPICA, 6[ ]=Tumor, 7[ 1=Endocarditis, 8[ ]=Trauma, 9f 1=Ao diss10[ ]=Ao Aneur20: 'PRE-OP STATUS: 1 =Elect, 2=Urgent, 3 =Ernerg, 4=Desperate. HEMODYNAM1C STAT: 1=Stbl, 2=Stbl on meds, 3 =Unstbl on meds4=Cardiogenic shock on meds/IABP22: OXYGENATOR: 0= Other, 1=111500, 2=Shiley, 3=TMO, 4=CM11, 5=Sc1256=Sc135, 7=Nlaxima, 8=BCM7, 9=Sams, 10=Tenuno11=SciUltra23: OTHER OP DEVICES: Of 1= Other, 21 J=Cel Sav, 3[ 1=HemoConcen4[ ]=Dial Filter, 5f ]=IABP, 61 I=BioMed Pump, If j=LHAD8j 1=RHAD, 91 P.Art Filter, 10[ Plasma Phor11j J=MyoTempProb, 12[ j= Cooling Pad, 131 j=Delphin Pump24: THROMBOL'YTIC Rx : 11 j=tPA, 2( ]=Strepto, 31 ]=Urokin, 4[ ]=IntrCor, Sf 1=IntrVein25: CARDIOPLEGIA : I[ ]=None, 2[ ]=Cryst, 3[ ]=Blood, 4j ]=Retro (Cor Sinus),Sf f= Intermit Clamp26: X-CLAMP TIME: Enter time in minutes27: BODY SURFACE AREA: Enter in square meters (e.g., 2.3)28: NO. CABGs : Enter number of distal anastomoses29: VALVES REPLACED: If j=Ao, 2[]=Alitr, 3[ j=Tri, 4j j=Aocombined c Ao graft30: REPAIRS: If ]=Ao, 2f ]=Afitr, 3j ]=Tri, 41 j=Cong, 51 )=Acq VSD, 61 j=LV Aneur31: AORTIC PROSTHESIS: 0=Other, 1=SE, 2=BS, 3 =St. J, 4 =Ed Porc, S=Hancock6=Froz Homo, 74=-.Medtronic32: MITRAL PROSITIFSIS : 0= Other, 1=SE, 2=BS, 3=St. J, 4=Ed Pore, 5=Hancock6=Froz Homo, 7=Ring, 8=Medtronic, 9=Omnisci33: (Reserved for future use)34: (Reserved for future use)35: BLOOD PRODUCTS: lj ]=Fresh Froz Plasma, 2j ]=Platelets, 3f =Cryo36: DONOR TRANSFUSIONS: Enter number of units37: AUTOLOGOUS TRANSFUSIONS: Enter number of units38: (Reserved for future use)DENDRITE SYSTEMS, INC.APPENDIX A. MERGED CARDIAC REGISTRY ^ 5539: COMPLICATIONS: 01 )=Other, 1[ J=Reop/Bleed, 2[ ]=Renal/Mild, 3[ jm. Renal/Sev- 4[ )=Wound/Sev, 5[ ]=Neuro/Mild, 61 )=Neuro/Sev, 7[81 ]=Pulm/Sev, 9[ ]=MI; 10[ ]=Low Out/Mild, 11[ )=Low Out/Sev12[ )=Clotting, 13[ ]=Sepsis, 141 )= GI/GB, 151 ]=DIC40: (Reserved for future use)41: (Reserved for future use)42: DAYS IN ICU: Enter number of days43: DAYS SURG/DLSCH Enter number of days44: PARSONNET RISK: Calculated & entered by the system45: (Reserved for future use)46: TRANSFER TO NEW ENTRY: Entered by the system47: DISCHG/30 DAY STATUS: 0=UNK, 1=ALIVE, 2=DTED IN OR, 3 =DIED IN HOSP/30D4=REOP, 5=DIED LATE CARD, 6=UNREL DEATH9=LOST TO FUDENDRITE SYSTEMS, INC.MCR Version 2, January 01, 1990Appendix BExpanded Definitions for MergedCardiac RegistryAPPENDIX B. EXPANDED DEFINITIONS FOR MERGED CARDIAC REGISTRY^57MERGED CARDIAC REGISTRYExpanded Dermitions for Version 21. DEMOG N:(There is no user correspondence file set up for this question.) The programwill use your Demographic number. It is the only patient identification thatIs sent to Dendrite. At Dendrite an offset will be added which is groupspecific. The offsets will not be published.2. PRIOR ENTRY REGISTRY MCR:(There is no user correspondence file set up for this question.) Thisinformation is entered at the time of transfer. It follows reoperations forboth valves and bypasses.3. DATE OF SURGERY:(There is no user correspondence file set up for this question.) This is the datefor this procedure. If a patient is in more than one Source Registry on thesame date, the entries will be merged. If the patient has two entries in thesame Source Registry on the same date, they will also be merged.4. AGE:This is the patient's age in years at the time of operation. If you don't havethis question in your Source Registry(ies), you should consider adding it.For a minimal fee, we can give you the ability to calculate a default answerfor age If registry question "DATE OF SURGERY" and demographic question"DATE OF BIRTH" have been entered. Please call Dendrite for informationon this feature.5. SEX:(There is no user correspondence file set up for this question.) The programgets this information from your demographic file automatically.6. Reserved for future use.7. PRIOR MI:1=No (If no clinical MI.)2=Yes (One or more clinical MLs.)Do not include silent Mb diagnosed only on angiography.APPENDIX B. EXPANDED DEFINITIONS FOR MERGED CARDIAC REGISTRY^58MOST RECENT MI:Select the insure that reflects the interval from the most recent MI to thisoperation. This could be important as a risk factor. If you don't have thisquestion in your Source Registry(ies), you should consider adding it.9.^Onikait DISEASES:Of ]=Other (Use "other" to record a disease not listed in the answers but thatyou feel is significant.)1[ ]=Obesity (1.5x expected body weight.)2[ 1=COPD (Patient with distinct limitations revealed at time of study or ontreatment - bronchodilators, etc.)3[ ]=Diab (Patient on oral meds or insulin.)41 ]=Chol >200 (Patients from 200 - 299.)5[ ]=Chol >300 (Patients above 300>)6[ ]=Renal (Patients with creatinines above 2.5 not on dialysis.)7[ ]=Hypertension (History of treatment.)8[]=ETOH (Patients who have undergone treatment or come in intoxicated.)9[ ]=Drug Abuse (History or current use of cocaine, heroin, etc.)101 ]=Marfans (Patient with diagnosis or you diagnose.)11[]=HIV+ (Positive test for AIDS. Not clinical disease.)12[]=AIDS (Clinical disease.)131 ]=CA (History of malignant disease - cured or not.)14[]=Blood (History of anemia not related to blood loss; e.g., sickle cell.Also, leukemia or lymphoma even if in remission.)15[]=Liver (History of hepatitis, cholangitis, but not gall bladder disease.)161 ]=CNS (History of brain abscess, encephalitis, or clinical dementia.)17[]=Prior CVA (History of stroke with or without residual.)18[]=RheumHD (History of Rheumatic Heart Disease.)19[ ]=Puhn Htn (PA pressures >60nunHG systolic.)201 ]=Chronic Dialysis (Not successful transplants.)10. SMOKING NOW:Smoking now is within ten (10) days or at the time of catheterization.Consider answer 2 to mean mild and answer 3 to mean heavy. Do not countpipe smoking or chewing tobacco.11. PRIOR CARDIAC SURG:Use "Other" for tumors, stab wounds, vinebergs, etc. We have added answer51 ]=Pacemaker.12. LV DYSFUNCTION:Select the answer that reflects the estimate from non-planimetry or echo,gated, etc.APPENDIX B. EXPANDED DEFINITIONS FOR MERGED CARDIAC REGISTRY^5913. LVEF:This is the actual left ventricular ejection fraction. We only consider planimetry byangiography a valid means to answer this question. For other means (gated bloodpool, echo, use question #12.)14. CAD >70%:Since it is possible that the answer for this question could come from multipleSource Questions, a no answer will be considered the same as none.11 ]=No (None/no coronary disease)21 ]=LAD31 ]=Cx (Includes the large OM as well if >70%.)4[ ]=RCA (Includes PDA.)SE ]=Branch (Includes intermediate, large diagonal but does not define whichsystem.)6I =L. Main7[ ]=1 Vessel Disease8(3=2 Vessel Disease913=3 Vessel Disease15. 0111ER CARDIAC PATHOLOGY0[ ]=Other (For dissections of the aorta, tumors of the heart.)11 ]=Ao St (Aortic stenosis with a gradient >60mmHG or valve area <.8CM.)21 ]=Ao Insuf (Aortic insufficiency moderate or great.)3[ ]=Alitr St. (Mitral stenosis with a gradient >60mmHG.)4[ ]=Mitr Insuf (Significant mitral leak with V-waves.)5[ ]=Tricusp (Either stenos's, leak, or both.)6[ ]=Pulm (Valve stenos's.)7[ 3= Cong (Any diagnosis of congenital heart disease.)8[ ]=Acq VSD (VSD post MI or surgery.)9[ ]=LV Aneur (Localized paradoxical segment.)101 ]=Ao Aneur (Ascending, arch, or descending aneurysm.)111 ]=Asc Diss (Ascending dissection of the aorta.)121 ]=Dsc Diss (Descending thoracic aortic dissection.)16. DATE MOST RECENT PTCA:Enter date. This new format will allow date arithmetic later. If you have data inthe old format, we can help you transform it.17. PTCA RESULT:Enter the initial 5-day result judged by the surgeon. A complication would includeMI, MI in progress, perforation, etc., within this 5- day period.18. NUMBER OF VESSELS PTCA'd:Enter the number of vessels dilated prior to this operation. (A triple PTCA wouldcount as 3.)APPENDIX B. EXPANDED DEFINITIONS FOR MERGED CARDIAC REGISTRY^6019. REASON FOR OP:01 ]=Other (Use "Other" to record an answer not listed, but that you feel issignificant.)1[ ]=Ang (Angina uncontrollable with meds.)2[ ]=CHF (Congestive heart failure - low output state.)3[ ]=Arrh (Arrhythmia.)41 ]=Anat (Anatomy; left main, etc. in otherwise stable patient.)5[ ]=Failed PTCA (PTCA that was performed within five (5) days if you aretreating the same vessel.)61 ]=Tumor7[ ]=Endocarditis (Patient has had positive cultures.)81 ]=Trauma9[ ]=Ao Dissection101 ]=Ao Aneurysm20. PREOP STATUS:1=Flect (Elective scheduled case.)2=Urgent (Case moved up on schedule.)3=Emerg (Emergency case-- do ASAP.)4=Desperate (Case that has arrested, is very near death, or in severe low output.)21. ELEMODYNAMIC s rA T:1=Stbl (Stable patient.)2=Stbl on meds (Cl >2 on meds or IABP.)3 =Unstbl on meds (Cl <2 on meds or IABP.)4 =Cardiogenic shock on zneds/IABP (CI <2 and falling.)22. OXYGENATOR:0=Other (Use "Other" to record a answer that is not listed, but that you feel issignificant.)1=111500 (Harvey bubbler includes H1300.)2=Shiley (Shiley bubbler.)3=TM0 (Travenol membrane.)4 = CMII (Cobe membrane.)5=Sc125 (SciMed SM25.)6=Sc135 (SciMed SM35.)7=Maxima (J&J (Medtronic) membrane.)8=BCM7 (Bentley membrane.)9 =Sarns (Membrane.)10=Terumo (Membrane.)11=SciUltra (SciMed Ultrox I.)APPENDIX B. EXPANDED DEFINITIONS FOR MERGED CARDIAC REGISTRY^6123. OTHER OPERATIVE DEVICES:Oj ]=Other (Use "Other" to record an answer that is not listed, but that you feel issignificant.)Answer 1 has been deleted.2[ 1= Cell Saver (Any brand.)3[ ]=HemoConcen (Ultra filtration device to remove H20.)4[ ]=Dial Filter (Renal dialysis filter in circuit.)SE ]=IABP (Intra or post op.)6E ]=BioMed Pump (Biomedicus pump rather than roller pump.)7[ ]=LHAD (Any long term use of left heart assist device post bypass.)8[ ]=RHAD (Any long term use of right heart assist device post bypass.)91 ]=Art Filter (Any filter in the arterial line.)101 ]=Plasma Phor (For the use of plasma phoresis for platelet rich plasma.)11[]=MyoTmpProb (For the use of myocardial temps where monitored.)12[ ]=Cooling Pad (If a cooling pad is placed under or on the heart duringcrossclamp.)131 ]=Delphin Pump (Sarns centrifical pump rather than roller pump.)24. THROMBOLYTIC Rx:1[]=tPA (tPA used within 24 hours.)2[ ]=Strepto (Streptokinase used within 36 hours.)3[ ]=Urokin (Urokinase Infused.)41 ]=IntraCor (Intracoronary infusion used.)SE j=IntraV ein (Intravenous.)26. CARDIOPLEGIA:This question has been changed to a type 7.11 ]=None (None or just slush.)21 ]=Cryst (For cold +/- high k+.)31 ]=Blood (For cardioplegia solutions containing blood.)41 ]=Retro-cor sinus (Any use of retrograde perfusion.)SE ]=Intermit Clamp (Can be combined with any of the above answers.)26. X-CLAMP TIME:Enter your answer in minutes.27. BODY SURFACE AREA:This is a new question. Enter your answer in square meters (e.g., 2.3)28. NO CABGs:Enter the total number of distal coronary anastomoses for this procedure.DENDRITE SYSTEMS, INC.APPENDIX B. EXPANDED DEFINITIONS FOR MERGED CARDIAC REGISTRY ^6229. VALVES REPLACED:Enter those valves replaced with a prosthesis. If this information is in yoursubprocedure section, you may be required to set up one or more secondaryquestions to create the criteria for transfer.30. REPAIRS:This includes debridement, commissurotomy, partial resection. May be combinedwith questions 31-32 if the repair fails or needs supplement.31. AORTIC PROSTHFSLS:Enter the type of prosthesis used.32. MITRAL PROSTHESIS:Enter the type of prosthesis used.33. Reserved for future use.34. Reserved for future use.35. BLOOD PRODUCTS:Enter any use of the blood products listed on the MCR.36. DONOR TRANSFUSIONS:Enter the number of units of bank blood/packed cells on this admission.37. AUTOLOGOUS TRANSFUSIONS:Enter the number of units of blood drawn 5 to 30 days preop for elective use atsurgery. Do not enter blood withdrawn at the time of surgery or plasma phoresis.38. Reserved for future use.39. COMPLICATIONS:0[ ]=Other (Use "Other" to record an answer not listed, but that you feel issignificant.)1[ ]=Reop/Bleed (Reoperation for bleeding, suspected tamponade.)2[ ]=Renal/Mild (Mild renal shutdown not requiring dialysis.)3[ ]z---.Renal/Sev (Severe renal shutdown requiring dialysis.)4[ ]=Wound/Sev (Dehiscence or infection - for sternal wounds only.)51 jrzNeurofivlild (Peripheral nerve, brachial plexus, confusion or CNS defect thatclears before discharge.)6[ ]=Neuro/Sev (CNS defect that does not clear in 7 days.)7[ ]=Pulm/11/1ild (Pneumothorax, hemothorax, atelectasis, air leak.)8[ P--..-Pulm/Sev (Prolonged respiratory support, ARDS or pneumonia requiringantibiotics.)APPENDIX B. EXPANDED DEFINITIONS FOR MERGED CARDIAC REGISTRY ^63COMPLICATIONS continued91 )=MI (Jntra- or post-op MI by EKG or enzymes.)101 J=Low Output/Mild (Low output syndromp postop requiring drugs for a shorttime.)111 j=Low Output/Sev (Severe low output syndrome postop with prolonged use ofdrugs or IABP.)12[ ]=Clotting (Prolonged bleeding problems, low platelets, etc.)131 ]=Sepsis (Septicemia, pneumonia, wound infection, etc.)14[ )=GI/GB (GI bleed, perforated ulcer, cholecystitis, hepatitis, etc.)15[ 1=DIC (Diffuse intravascular coagulation.)40. Reserved for future use.41. Reserved for future use.42. DAYS IN ICU:Enter the number of days - round off (e.g., 27 hrs. = 1 day, 30 hrs. = 2 days).43. DAYS SURG/DISCH:Enter the number of days from surgery to discharge.44. PARSONNET RISK:(There is no user correspondence file set up for this question.) Use the "R" option toget risk calculations once you have transferred your data to the MCR.45. Reserved for future use.46. TRANSFER TO NEW ENTRY:(There is no user correspondence file set up for this question.) This is a system-generated question to follow re-entries in your registry(ies).47. DISCH/30D STATUS:0=UNK (This is a historical answer in use before the addition of follow-up.) DONOT USE THIS ANSWER VVIEEN CREATING YOUR CORRESPONDENCEFILE.1=ALIVE2=DIED IN OR (Died in the operating room.)3=DIED IN HOSP/30D (Died in or out of hospital within 30 days of surgery.)4=REOP (Your reoperations only.)5=DIED LATE CARDIAC (Died after 30-day interval, cardiac-related.)6=UNRELATED DEATH (Died after 30-day interval, non-cardiac-related.)9=LOST TO FU (Patient who can no longer be followed because cannot be located.)The only follow-up transferred at this time is survival status. This is movedautomatically if your Source Registry(ies) have follow-up.DENDRITE SYSTEMS, INC.
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Risk prediction models for binary response variables...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Risk prediction models for binary response variables for the coronary bypass operation Zhang, Hongbin 1993-08-28
pdf
Page Metadata
Item Metadata
Title | Risk prediction models for binary response variables for the coronary bypass operation |
Creator |
Zhang, Hongbin |
Date Issued | 1993 |
Description | The ability to predict 30 day operative mortality and complications following coronary artery bypass surgery in the individual patient has important implications clinically and for the design of clinical trials. This thesis focuses on setting up risk stratification algorithms. Utilizing the binary feature of the response variables, logistic regression analyses and classification trees (recursive partitioning) were used with the variables identified by the Health Data Research Institute in Portland, Oregon. The data set contains records for 18171 patients who had coronary artery bypass surgery in one of several hospitals between 1968 to 1991. Statistical models are setup, one from each method, for six outcome variables of the surgery: 30 day operative mortality, renal shutdown complication, central nervous system complication, pneumothorax complication, myocardial infarction complication and low output syndrome. The risk groups vary across different outcomes. The history of cardiac surgery has strong association with operative mortality and patients who suffer from a central nervous system disease tend to have higher risks for all the outcomes. Further study is necessary to consider the differences among hospitals and to divide the population according to the type of previous cardiac surgery. |
Extent | 3265612 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
File Format | application/pdf |
Language | eng |
Date Available | 2008-08-28 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0086332 |
URI | http://hdl.handle.net/2429/1571 |
Degree |
Master of Science - MSc |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 1993-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- 831-ubc_1993_fall_zhang_hongbin.pdf [ 3.11MB ]
- Metadata
- JSON: 831-1.0086332.json
- JSON-LD: 831-1.0086332-ld.json
- RDF/XML (Pretty): 831-1.0086332-rdf.xml
- RDF/JSON: 831-1.0086332-rdf.json
- Turtle: 831-1.0086332-turtle.txt
- N-Triples: 831-1.0086332-rdf-ntriples.txt
- Original Record: 831-1.0086332-source.json
- Full Text
- 831-1.0086332-fulltext.txt
- Citation
- 831-1.0086332.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0086332/manifest