C .( A S T A T I S T I C A L C L A S S I F I C A T I O N B Y D E G R E E O F O F B R E A S T C A N C E R N O D A L . M E T A S T A S E S by SANDRA LEE WILSON B.S., S t a n f o r d U n i v e r s i t y , 1966 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in the Department of Mathematics We a c c e p t this thesis as conforming to the requ i red s t a n d a r d THE UNIVERSITY OF BRITISH COLUMBIA May 1977 c-YT Sandra Lee Wilson, 1977. P A T I E N T S -6 In p r e s e n t i n g an this thesis in partial advanced degree a t t h e U n i v e r s i t y the Library shall make i t f r e e l y I f u r t h e r agree t h a t permission for h i s representatives. of this written thesis gain shall Department o f TQ^ajduP^oOfc-O^ The of British Columbia 2075 Wesbrook P l a c e V a n c o u v e r , Canada V6T 1W5 Date CJUw. ?7 I agree that copying o f this thesis by t h e Head o f my D e p a r t m e n t o r I t i s understood f o rfinancial Columbia, f o r r e f e r e n c e and s t u d y . for extensive permission. University of B r i t i s h available s c h o l a r l y p u r p o s e s may be g r a n t e d by f u l f i l m e n t o f the requirements f o r that copying o r p u b l i c a t i o n n o t be a l l o w e d w i t h o u t my A B S T R A C T Recently the traditional primary method of treatment for breast carcinoma — the Halsted radical mastectomy — has been challenged. It is felt by some people that other methods may be more appropriate for certain women. Quality of l i f e and the patient's preferences are being considered in addition to the.strictly medical aspects of the problem. One procedure that attempts to increase the quality of l i f e for certain women is the selective biopsy. Women who are proven to have lymph node metastases at the biopsy are spared a mastectomy and treated by radiation since surgery cannot remove all of the cancer. A study was undertaken at the British Columbia Cancer Institute of selective biopsy patients diagnosed between 1955 and 1963 in order to assess the procedure in British Columbia. After studying survival for selective biopsy patients and others, i t was concluded that the procedure should continue to be recommended. Since only 14% of the patients now referred to BCCI have had a selective biopsy, I decided to try to find a statistical method for assessing the probability of nodal metastases. The problem is one of statistical classification. theory of several statistical models was reviewed. for the problem: The literature on the Two models were chosen linear discriminant analysis and logistic regression. The classification procedure most often used is discriminant analysis. However, the linear discriminant model assumes a normal distribution and ii and common covariance matrix for the vector of observations. is often non-normal and even discrete. works well with such data. Medical data The logistic probability model Both models were then used to study the selec- tive biopsy problem. The patients of the BCCI study were used as a training set to estimate the parameters of the discriminant function and the logistic probability function. Then each estimated function was used to classify the patients as a measure of the goodness of f i t of the models. The logistic regression correctly classified slightly more of the patients than the discriminant analysis did. Because of the iterative nature of the logistic regression, the execution time for the logistic regression was longer than for discriminant analysis, but not beyond practical limits. .The variables that were significant in the statistical analyses could be used to help the physician make a clinical assessment of the lymph nodes of a woman with breast carcinoma. areas where further research would be useful. iii The variables indicate T A B L E O F C O N T E N T S Page ABSTRACT ii LIST OF TABLES vi LIST OF FIGURES AND ILLUSTRATIONS ix ACKNOWLEDGEMENTS x Chapter 1 INTRODUCTION 1 2 MEDICAL HISTORY OF THE PROBLEM 4 3 REVIEW OF STATISTICAL MODELS 4 26 3.1 Fisher's Linear Discriminant 26 3.2 Contingency Tables 31 3.3 Krzanowski Location Model 34 3.4 Logistic Regression 36 3.5 Comparison of Linear Discrimination and Logistic Regression 44 3.6 Variable Reduction 48 3.7 Conclusions 54 DATA COLLECTION 55 4.1 Selecting Variables to be Observed 55 4.2 Data Collection 61 i v Chapter 5 Page 4.3 Variables Selected for Analysis 62 4.4 Missing Data 69 4.5 Sources of Error 70 DATA ANALYSIS 5.1 Computer Programs 71 5.2 Selecting Cases 72 5.3 Classification of 503 Cases 75 5.4 Subsets of Patients for Further Classification 84 Classification of Postmenopausal Patients 93 5.5 5.6 5.7 6 71 Classification of Premenopausal Patients.- 100 Summary of Results 105 CONCLUSIONS 112 BIBLIOGRAPHY 116 APPENDICES A TREATMENT STUDY GROUPS 121 B MANCHESTER STAGING OF BREAST CANCER 122 C CODING INSTRUCTIONS AND DATA CODING FORM 124 D VARIABLES NOT INCLUDED IN ANALYSIS 128 v L I S T O F T A B L E S Table I II III IV V VI VII. VIII IX X XI XII Page Clinical Staging versus Pathological Staging 7 Survival of Selective Biopsy and All Other New Breast Cases by Clinical Stage 11 2x2 Contingency Tables for Survival 13 Contingency Table for 10 Year Survival for BCCI Patients and Haagensen's Patients 16 Contingency Tables for 5 and 10 Year Survival for Treatment Groups A and C 16 Contingency Tables for 5 and 10 Year Survival for Patients with Nodal Metastases and in Treatment Group A versus Treatment Group C 17 T-test of Survival Differences for Positive Nodes in Group A and Group C .. . 24 Asymptotic Relative Efficiency of Logistic Regression to Linear Discrimination 46 Execution Times for Logistic Regression 49 Comparison of Factors by Histologic Type 60 Variables Chosen by Linear Discrimination for 173 Cases , Classification of 173 Cases by Linear Discrimination 73 vi 74 Table XIII XIV XV XVI XVII XVIII XIX XX XXI XXII XXIII XXIV XXV XXVI Page Variables Chosen by Logistic Regression for 173 Cases 76 Classification of 173 Cases by Logistic Regression 77 Variables Chosen by Linear Discrimination for 503 Cases 78 Classification of 503 Cases by Linear Discrimination 79 Variables Chosen by Logistic Regression for 503 Cases 80 Classification of 503 Cases by Logistic Regression with 15 Variables 81 Classification of 503 Cases by Logistic Regression with 4 Variables 82 Variables Chosen by Discriminant Analysis for 60 Postmenopausal Patients 89 Variables Chosen by Discriminant Analysis for 113 Premenopausal Patients 90 Classification of 60 Postmenopausal Patients by Linear Discrimination with 19 Variables 91 Classification of 113 Premenopausal Patients by Linear Discrimination with 16 Variables 92 Subset of Variables Chosen by Discriminant Analysis for 128 Postmenopausal Patients 94 Classification of 128 Postmenopausal Patients by Discriminant Analysis with 4 Variables 95 Subset of Variables Chosen by Logistic Regression for 128 Postmenopausal Patients 96 vi i Table XXVII XXVIII XXIX XXX XXXI XXXII XXXIII XXXIV XXXV XXXVI XXXVII Page Classification of 128 Postmenopausal Patients by Logistic Regression with 16 Variables 97 Classification of 128 Postmenopausal Patients by Logistic Regression with 3 Variables 98 Classification of 128 Postmenopausal Patients by Logistic Regression with 4 Variables 99 Subset of Variables Chosen by Discriminant Analysis for 253 Premenopausal Patients 101 Classification of 253 Premenopausal Patients by Discriminant Analysis with 10 Variables 102 Subset of Variables Chosen by Logistic Regression for 253 Premenopausal Patients 103 Classification of 253 Premenopausal Patients by Logistic Regression with 16 Variables 104 Classification of 253 Premenopausal Patients by Logistic Regression with 10 Variables 106 Summary of Classification Results 107 Number of Premenopausal Patients Correctly and Incorrectly Classified by Decile of Posterior Probability 109 Number of Postmenopausal Patients Correctly and Incorrectly Classified by Decile of Posterior Probability 110 vi i i L I S T Figure 2.1 O F F I G U R E S , Page Areas of lymph node involvement in Carcinoma of the breast 9 2.2 Actuarial survival of all 535 cases 19 2.3 Actuarial survival of patients in treatment group A versus patients in treatment group C Actuarial survival of patients with positive nodes versus patients with negative nodes 20 21 Actuarial survival of patients with positive nodes and in treatment group C versus patients with positive nodes and in treatment group A 22 Cochran and Hopkins categories for partitioning continuous variables 33 2.4 2.5 3.1 3.2 The geometry of the two-stage procedure for two groups 53 5.1 Histogram of age-incidence for all 535 cases 86 5.2 Histogram of age-incidence for premenopausal patients Histogram of age-incidence for postmenopausal patients 87 5.3 ix 88 A C K N O W L E D G E M E N T S The data for this work was provided by the British Columbia Cancer Institute. Particular thanks go to Dr. Glen Crawford of BCCI for arranging for me to work with the data and the Statistics Department of BCCI for help with the medical f i l e s . Thanks also go to my committee members for their comments and help in the preparation of this work. Those committee members are Dr. S t a n l e y J . Nash Department o f Mathematics Dr. B r e n d a J . M o r r i s o n Department o f H e a l t h Care and E p i d e m i o l o g y Dr. S. James P r e s s F a c u l t y o f Commerce Finally, thanks must go to my husband for his encouragement and support throughout my quest for a degree. x Chapter 1 I N T R O D U C T I O N Cancer is one of the most universally feared dieases known to man. Not only does i t too often k i l l , i t k i l l s slowly and usually with pain and suffering. The treatments for this dread disease sometimes seem worse than the disease i t s e l f : amputations of parts of the body, radiation to k i l l cells (both cancerous and normal), and chemicals that poison cells. For a woman, breast cancer usually holds the greatest fear because, in addition to the physical damage done by the disease and treatments, there is often great emotional damage. The North American and European cultures have put such emphasis upon a woman's breasts in defining her worth as a woman that deformity or loss of a breast is an emotional blow that can cripple a woman. In addition, breast cancer "is the single largest cause of death from cancer among women in the United States and Canada" [ 3 4 , p. 3 3 4 ] . In the treatment of cancer there are presently three types of treatment.^ surgery. The oldest and most often used as an initial therapy is If the cancer is completely removed, then the disease is no A new f o r m o f t r e a t m e n t c a l l e d i m m u n o t h e r a p y i s b e i n g t r i e d e x p e r i m e n t a l l y but i s n o t g e n e r a l l y a v a i l a b l e and so i s n o t d i s c u s s e d here. 1 2 longer a problem. excised tumors. However, cancer does not confine itself to neat easilySingle cells that break off from the main mass can travel via the blood and lymphatic systems to all parts of the body and establish new colonies of cancer cells called metastases. To remove as many as possible of these cells that have broken away, cancer surgery removes wide areas of presumably normal tissue in addition to the tumor i t s e l f . This can cause major physical deformities. Even such extensive surgery is often not enough to stop all the cancer cells. Because some cancers are inoperable (not amenable to surgery because of the size or location of the tumor), other methods of treatment are necessary. alike. Radiation is known to k i l l c e l l s , normal and abnormal Radiation can reach places that surgery cannot and does not cause as much deformity. However, i t , like surgery, cannot k i l l all the stray cells. A systemic treatment was needed to k i l l the colonizing or metastatic c e l l s . It has been found that certain drugs k i l l cancerous cells faster than normal cells because cancer cells have a more rapid rate of growth than normal cells. the treatment of cancer. Thus, chemotherapy became another weapon in While chemotherapy does not cause permanent physical deformities, it does cause temporary distressing side effects. Treatment of a particular breast cancer patient can be by any one of these methods or by any combination. Too often the treatment is dictated by the physician's personal preference rather than by the c i r cumstances of the case. Some doctors have tried methods of assessing the best treatment for the patient by taking into account the quality 3 of l i f e of the patient and other possible rewards under alternative treatments. One such study was completed in the spring of 1976 at the British Columbia Cancer Institute (BCCI). A surgical procedure called a selective biopsy was done after an i n i t i a l diagnosis of breast cancer. This procedure attempted to determine whether a. patient had lymph node metastases or not. Depending on the status of the lymph nodes, a course of treatment was recommended. Between the years 1955 and 1963, 557 women had a selective biopsy done and were referred to BCCI for further treatment. The medical staff at BCCI undertook a study to compare the results of different treatment methods for these patients. Some results of that study will be reported in Chapter 2 as the background for the problem to be studied here. Definitive conclusions are not always possible from the selective biopsy because of contamination, loss of material, or incomplete dissection. Also many patients do not have the selective biopsy done. A statistical model is proposed in this paper to augment, and possibly supplant for some patients, the surgical procedure. The patients that provide the data base for this work are the same ones that were used in the study conducted at BCCI. The statistical problem of deciding whether there are nodal metastases or not is a two-group classification problem. Four models for classification of mixed (discrete and continuous) data will be discussed in Chapter 3. and logistic regression Two of these models — linear discrimination will then be applied to the problem of classify- ing breast cancer patients by degree of nodal metastases. Chapter 2 M E D I C A L H I S T O R Y O F T H E P R O B L E M In 1882 Dr. William Halsted began performing the f i r s t true radical mastectomies in Baltimore, Maryland. A radical mastectomy involves the "removal of the breast, pectoral muscles, axillary lymph nodes, and associated skin and subcutaneous tissue" [17]. Surgeons quickly, adopted his operation as the standard treatment for breast carcinoma. It remains the most widely used procedure today and is the standard against which other treatments are judged. Other surgical treatments range from a lumpectomy (removal of only the tumor mass) through super-radical operations that remove even more tissues than the radical mastectomy does. These surgical procedures combined with various types of radiation and chemotherapy produce a large range of combinations of treatments. In the women to be studied here there were twelve different types of treatment combinations (see Appendix A). Only women are considered in this study because of the different factors that are thought to affect the disease in men and womenJ The variations in treatment reflect the preferences of the physician treating the woman in addition to the variations in the disease ^ B r e a s t c a r c i n o m a i n men m a n i f e s t s itself lymph n o d e s , m u s c l e s , and b r e a s t t i s s u e s — h o w e v e r , a r e t h o u g h t t o be q u i t e d i f f e r e n t . 4 i n t h e same a r e a s — t h e hormonal influences 5 process. The treatment a woman gets depends more on the doctor she consults than on the state of her disease. caused by doctor variability, treatment protocols. To try to eliminate these differences attempts have been made to set up standard During the time of the study (1955-1963) radical mastectomy with post-operative radiation therapy was the treatment of choice for operable cases of breast carcinoma seen at the British Columbia Cancer Institute (BCCI). Radiation alone was considered the best treatment for inoperable cases. The next step was to decide which cases were operable. has been where most of the differences of opinion occurred. This It is generally agreed that growth of the disease beyond the breast makes a case inoperable. Any type of metastases constitutes such growth. All researchers agree that the prognosis is poor, no matter what the treatment, i f , as Haagensen says, "metastases had reached these lymph nodes at the periphery of the regional lymph node f i l t e r at the apex of the axilla and in the internal mammary chain" [22, p. 691]. Thus, i t seems reasonable to consider patients with nodal metastases as having growth beyond the breast and thus being inoperable. The method of assessment of these apical and internal mammary nodes was the next problem. 2 Several clinical assessment systems have been devised in order to try to predict the pathological findings. The system used at BCCI is the Manchester staging of breast cancer (see Appendix B). There are four clinical stages that are supposed to correspond to four pathological stages. The stage I's are early disease while the stage IVs represent C l i n i c a l f i n d i n g s a r e those o b t a i n e d from p h y s i c a l e x a m i n a t i o n o f the p a t i e n t w i t h o u t surgery. P a t h o l o g i c a l f i n d i n g s a r e those o b t a i n e d from m i c r o s c o p i c examination o f s u r g i c a l l y o b t a i n e d t i s s u e samples. 6 advanced disease for both clinical and pathological scales. Clinical I l l ' s and clinical IV's are generally considered inoperable "because the likelihood of cure by radical mastectomy is so poor that other methods will do as well or better"'[22, p. 691]. I's Thus, we consider only clinical and 11 s as possible operable cases. 1 Unfortunately, the clinical staging systems have not done very well at predicting pathological staging. As Haagensen says, "clinical features alone upon which we relied for the selection of patients betrayed us. . ." [22, p. 691]. are typical. The results from BCCI, as presented in Table I For clinical I's 50% have negative nodes, while for clinical 11's 49% had pathologically involved apical or internal mammary nodes. In order to permit pathological review of the nodes before a radical mastectomy was carried out, a procedure called a selective or triple biopsy was devised. Dr. C D . Haagensen developed and used this procedure between 1951 and 1966. His results are the only published findings of large groups using selective biopsy and surgery. With a combination of clinical staging and selective biopsy, i t was hoped that a better assessment of the state of the disease could be made before any treatment, including mastectomy, was begun. The selective biopsy is recommended for Clinical I's with inner half lesions, central lesions, or outer half masses with tumors larger than 3 cms and for all Clinical 11's [11]. biopsy of the tumor mass in the breast. It begins with the original 3 When the rush report is positive 3 A r u s h r e p o r t i s t h e r e p o r t o f t h e f r o z e n s e c t i o n done w h i l e the p a t i e n t i s s t i l l i n surgery. A permanent, o r p a r a f f i n , s e c t i o n i s d o n e l a t e r b e c a u s e i t i s more a c c u r a t e a n d shows g r e a t e r d e t a i l . 7 Table I Clinical Staging versus Pathological Staging (489 cases) Pathological Clinical I Clinical II CIinical III I 50..2% 20.4% 11..1% II 21.,4% 30.6% 2.,8% III 0..8% 0.0% 8.,3% 27.,6% 49.0% 77..8% 100..0% 100.0% 100..0% 257 196 36 IV Number of Patients & IV 8 for malignancy, the apical and internal mammary lymph nodes (areas and IV in Figure 2.1) are also biopsied. III In early use of the procedure, the tissues obtained in the second stage were also subject to a rush section and further surgery, i f indicated, was undertaken immediately. However, results of the rush sections of the nodes were often inconclusive and the paraffin sections were necessary for accurate evaluation. Thus, the present procedure evolved in which the patient is returned to her room until the permanent sections are read. If the internal mammary and apical lymph nodes are negative, the woman is returned to the operating room for a radical mastectomy and later referred for radiation treatment to the supraclavicalur (another name for apical) and internal mammary areas. When any of the nodes are positive, the patient has no further surgery and is referred for radiotherapy to the breast and all node areas [22,11]. Haagensen stopped doing selective biopsies in 1967 because he felt that he had learned all that he could from them [22]. The staff of the BCCI did not feel that was an adequate reason for discontinuing a practice that offered such advantages and so they continue to recommend its use for Clinical I's and 11 s. 1 Since most patients are referred to BCCI after initial surgery has been performed, the decision to do the biopsy usually remains with the attending physician. most popular during the late The procedure was 1950's when a high of 43.5% of the patients referred to BCCI had had the biopsy done. It has declined in popularity until the present time when about 18% of the patients referred have had the selective biopsy performed. Because of the increasing incidence of breast cancer and increasing referral to BCCI, the number of patients having the biopsy each year has increased despite the percentage decrease. Figure 2.1. Areas of lymph node involvement in Carcinoma of the breast. 10 In order to assess the results of using the selective biopsy to select patients for surgery in British Columbia, a retrospective study was undertaken at BCCI of selective biopsy patients whose date of diagnosis was between 1955 and 1963. Since five and ten year survival rates are the standards for comparison in cancer therapy, the years to be studied were chosen to ensure availability of ten year survival data for all patients. A total of 557 women were referred to the BCCI after selective biopsy in the specified years. Twenty-two patients were eliminated from the study because of previous breast malignancy (14) or other systemic malignancy (8). Skin cancer and carcinoma in situ of the uterus did not constitute cause for being removed from the study. The remaining 535 women were then put into treatment groups by the method of treatment they actually received. Ideally there would have been only two groups: 1. radical mastectomy with radiation to apical and internal mammary nodes and 2. radiation to original lesion and axillary, apical, and internal mammary drainage areas. However, due to the fact that patients came from many different referring surgeons, there were twelve different treatment groups (see Appendix A for details of the groups). Only the two recommended groups (called C and A) had enough cases to give significant statistical results. The f i r s t concern of the doctors was that the selective biopsy procedure did not harm the patient. It was known that there was l i t t l e morbidity associated with the procedure. To judge whether i t affected survival, all selective biopsy patients were compared to all other new breast cases for 1955 to 1963. The data are presented in Table II. To test whether the survival rates were worse for selective biopsy patients, 11 Table II Survival of Selective Biopsy and All Other New Breast Cases by Clinical Stage (1955-1963) Selective Biopsies Clinical Stage Number of Cases Alive at 5 Years Alive at 10 Years I . 271 207 154 II 202 107 75 III 35 18 14 IV 13 2 0 Unknown 14 10 9 535 344 252 Total Other New Cases Clinical Stage Number of Cases Alive at 5 Years Alive at 10 Years I 529 390 304 II 266 137 88 III 133 59 34 235 29 8 34 21 14 .1197 636 448 IV Unknown Total 12 a series of 2x2 contingency tables were formed for five and ten year survival. The contingency tables are presented in Table III, where the different treatments are selective biopsy or not selective biopsy. We now wish to test whether the proportions in the two treatment groups differ significantly for each contingency table. That i s , we wish to test the hypothesis that survival is independent of the treatment group. Since we must estimate the parameters, the appropriate test is a chi-square with one degree of freedom. X 2 = I We calculate I (2.1) 1 3 i=l j=l F j j where f.... is the (i,j)th observed cell frequency and sponding expected cell frequency. is the corre- F . . is calculated as follows: (a dot indicates summation over that index) where f.. # is the i-th row marginal total, f ^ is the j - t h column marginal total, and f m ># is the grand total of all cases. Only F n needs to be calculated that way since all other F . . are uniquely determined by F n and the fixed row and column marginals. It is a property of 2x2 tables that f. . - F . . is the same except for sign for all i , j = l , 2 . X 2 = ( f n - Fi i ) 2 I I i=l j=i r r ij Thus, we get (2.3) 13 Table III 2x2 Contingency Tables for Survival Clinical I 5 years Alive Other Selective Clinical 390 207 597 Bead 139 64 203 10 years 529 271 800 Alive 304 154 458 225 117 342 529 271 800 II 5 years Alive Other Selective Clinical Bead T37 107 244 Bead 129 95 224 10 years 266 202 468 Alive 88 75 163 Bead 178 127 305 266 202 468 III 5 years Other Selective Alive 59 18 77 Bead 74 17 91 10 years 133 35 168 Alive 34 14 48 Bead 99 21 120 133 35 168 Clinical IV 5 years Other Selective A live 29 2 31 Clinical Unknown Other Selective 206 11 217 10 years 235 13 248 Alive 8 0 8 5 years Alive 21 10 31 Total Other Selective Bead Bead 13 4 17 636 344 980 Bead 561 191 752 227 13 240 235 13 248 10 years 34 14 48 Alive 14 9 23 5 years Alive Bead Bead 20 5 25 34 14 48 10 years 1197 535 1732 Alive 448 252 700 Bead 749 283 1032 119 53 172 14 and x is asymptotically distributed as a chi-square with one degree of 2 freedom [44.]. A correction for continuity should be added giving the final result X 2 = C | f n " F | - 0.5) 1X I I i=l j=l J - . (2.4) ij The above approximate procedure can be used when the numbers in the tables are large (all expected frequencies are greater than five). However, when the total number of cases is less than 20 or the smallest expected frequency is less than five, i t should not be used [44]. In 1935 Fisher showed that an exact test of significance based on 4 the hypergeometric probability distribution could be made. Finney et al. [14] calculated these probabilities and published tables of the results for up to 40 total cases. When the contingency tables in Table III involved small numbers, these exact probabilities were used for testing for significant differences. All 2x2 tables for individual clincial stages at five and ten years showed differences that were not significant at the .05 level. Thus, we cannot reject the hypothesis that survival was the same. However, some tables had a great disparity between the numbers of patients in the two treatment groups (for example, Clinical IV Other 235, Selective 13). For that reason it was decided to test five and ten year survival for all clinical stages combined. When the chi-square test was used, i t showed a significant different (P < .01). ^The exact conditional probability is fi..!.f . ! 2 f.. ! f n Thus, we would reject the hypothesis ! f . i fiz ! 1 f»2 ! f 2 i ! f 2 2 ! ' 15 that total survival was the same. Since total survival was better for the selective biopsy group, we can conclude that selective biopsy at least did not decrease patient survival. Thus, we conclude that the selective biopsy did not harm the patients in this study. The fact that selective biopsy patients demonstrated longer survival may be a result of being included in an "experimental" group. They may have been followed more closely and thus had metastases (local and distant) treated earlier. It is also possible that a larger proportion of these patients received radiotherapy as part of their treatment and consequently recurrences were delayed. The next step in assessing the selective biopsy as used in British Columbia was to compare survival rates to published results and to compare survival rates between groups. It was concluded that the local survival rates were not significantly different (P > .05) from the published results of Haagensen [22] (Table IV). Five and ten year survivals were then compared for treatment groups A (no mastectomy, standard radiation) and C (radical mastectomy, standard radiation). The results are shown in Table V. The differences were significant at the .05 level for both time periods. This was expected since the A cases were known to be advanced cases while the C cases were known to be early cases. The final comparison was for five and ten year survivals for patients known to have nodal metastases and treated by mastectomy - group C - versus those known to have nodal metastases and treated by irradiation - group A - (Table VI). The number of patients with nodal disease and treated by mastectomy was small (18), but not too small to attempt to draw some conclusions. It was found that at five years there was no significant difference at the .05 16 Table IV . Contingency Table for 10 Year Survival for BCCI Patients and Haagensen's Patients Alive Bead Survival Haagensen 550 526 1076 51.1% BCCI 254 804 279 805 533 1609 47.7% Table V Contingency Tables for 5 and 10 Year Survival for Treatment Groups A and C 5 years Alive Bead A 61 85 146 41.8% C 241 302 75 160 316 462 76.3% Alive Bead 26 120 146 17.8% 198 224 118 238 316 462 62.7% Survival 10 years A Survival 17 Table VI Contingency Tables for 5 and 10 Year Survival for Patients with Nodal Metastases and in Treatment Group A versus Treatment Group C 5 years Alive Dead A 57 84 141 C 10 67 8 92 18 159 10 years Alive Dead A 21 120 141 C 8 29 10 130 18 159 18 level of significance. However, at ten years we could reject the hypothesis of no significant differences (P < .05). It was not known what selection factors were used to select the cases with positive nodes for surgery (a typical problem with any retrospective study). The recommendation was made by the BCCI medical staff to continue use of the selective biopsy. After the data were collected for the analysis that comprises the main body of this work, more information was available on survival. For some patients a 22 year history of survival was available and thirteen year survival information was available for all patients. an actuarial survival study to see what had happened five and ten and in the years after ten. I decided to do in the years between The Biomed program BMDllS-Life Tables and Survival Rate was used to produce actuarial survival rates. The results are shown in Figure 2.2 through 2.5. overall survival for selective biopsy patients. Figure 2.2 shows the The results at five and ten years compare favorably with standard survival results [7]: Standard - 5 years 60% Selective biopsy - 5 years 64.6% Standard - 10 years 20% Selective biopsy - 10 years 47.7% There is also a 28% survival for 22 years which is encouraging. A comparison of Figures 2.3 and 2.4 shows the curves to be almost identical — the survival for group A and the survival for positive nodes are nearly the same and the survival negative nodes are the same. for group C and the survival for Again that is the expected result since A cases were supposed to be chosen because of the presence of positive nodes while C cases are supposed to have negative nodes. Figure 2.5 shows the 100 90 80 70 60 50 40 CO 30 20 10 10 15 YEARS Figure 2.2. Actuarial survival of all 535 cases. I 20 i 22 20 100 90 GROUP A (146) X GROUP C (317) • 80 70 60 50 40 • •• • 30 20 X X 10 I- X x X X X x Jl 10 15 20 22 YEARS Figure 2.3. Actuarial survival of patients in treatment group A versus patients in treatment group C. 21 100 X • 90 NEGATIVE (356) • POSITIVE (179) X 80 I 70 k 60 r 50 k | 40 h • • • • CO 30 k 20 X X X X 10 / X X 10 15 X X X X J I 20 22 YEARS Figure 2.4. Actuarial survival of patients with positive nodes versus patients with negative nodes. 22 100 GROUP A (141) 90 U GROUP C ( 18) 80 70 k 60 50 i-3 40 CO 30 20 X X 10 X X X X X X X -I 10 15 20 22 YEARS Figure 2 . 5 . A c t u a r i a l survival of patients with p o s i t i v e nodes and i n treatment group C versus patients with p o s i t i v e nodes and in treatment group A. 23 survival curves by treatment group f o r patients with p o s i t i v e nodes. curves are close together through f i v e years. years they are divergent. again. The Then between s i x and twelve A f t e r twelve years they approach each other The Biomed s u r v i v a l program also c a l c u l a t e s a t - t e s t f o r the differences between groups each year. in Table VII. The r e s u l t s of these t - t e s t s appear These r e s u l t s confirm that there i s a s i g n i f i c a n t d i f f e r e n c e (P < .05) in the years s i x through twelve only. It i s not c l e a r j u s t what s i g n i f i c a n c e t h i s has f o r the treatment decision problem. More cases with p o s i t i v e nodes and treatment by mastectomy need to be studied with that question in mind. Another question of concern about the s e l e c t i v e biopsy was the local recurrence rate — that i s , recurrence of the disease in the breast and associated lymph drainage areas. on recurrence for the BCCI study. It was known that l o c a l recurrences were more of a problem i n study group A. were adequately c o n t r o l l e d . Not a l l the data were a v a i l a b l e However, most local recurrences More work remains to be done on the question of l o c a l recurrence. To complete the assessment of the use of s e l e c t i v e biopsy, the medical s t a f f have been asking questions about the q u a l i t y of l i f e f o r the p a t i e n t s . They feel that sparing a woman with advanced breast carcinoma the m u t i l a t i o n of having a breast removed i s giving her a better q u a l i t y of l i f e . However, the q u a l i t y of l i f e also has a time component to i t . Mueller and J e f f r i e s studied the questions of rate of dying and causes of death in breast carcinoma and concluded 24 Table VII T-test of Survival Differences for Positive Nodes in Group A and Group C Year T-statistic Degrees of Freedom Tabled t g 7 5 1 -0.26 157 1 .9763 2 -1.00 157 1.9763 3 -1.49 157 1.9763 4 -1.67 157 1.9763 5 -1.22 157 1.9763 6 -2.03* 157 1.9763 7 -2.28* 157 1.9763 8 -2.58* 157 1.9763 9 -2.83* 157 1.9763 10 -2.44* 157 1.9763 11 -2.08* 157 1.9763 12 -2.08* 157 1.9763 13 -1.79 157 1.9763 14 -1.60 155 1.9765 15 -1.60 155 1.9765 16 -0.86 118 1.980 17 -0.86 118 1 .980 18 -0.86 118 1.980 19 -0.86 118 1.980 20 -0.86 118 1 .980 21 -0.86 118 1.980 22 -0.86 118 1.980 significant (df) 25 Breast cancer treatment should: a) T r e a t t h e c a n c e r o n l y when a n d w h e r e i t i s known t o e x i s t ; b) N o t be p r o p o s e d a s a means o f i n f l u e n c i n g e i t h e r t i m e o f death o r cause o f death. M e a s u r e m e n t s o f q u a l i t y o f l i f e s h o u l d be e s t a b l i s h e d and s h o u l d c o n s t i t u t e t h e o n l y r e a l i s t i c o b j e c t i v e o f t r e a t m e n t [3A, p. 339]. Thus, the conclusion of the BCCI study was that the selective biopsy should be recommended for patients with Clinical stage I or Clinical stage II disease [ 1 1 ] . Since BCCI is a referral agency for all of British Columbia and the Yukon, most of the range of stages were well represented in the study. The group of patients was deficient in very early cases of Clinical stage I which had received surgery and then were not referred for further treatment. Presumably these would all have had negative nodes since evidence of any nodal metastases in the surgical specimen would be cause for referral. The study could also be deficient in very advanced stages where the patient would be beyond any treatment. Since that stage of disease would never be recommended for selective biopsy, we need not worry about lack of such cases. After completing the statistical analysis of the above study for BCCI, I became interested in trying to find a statistical model that could classify patients into positive or negative nodes when surgical results were not available. Since 82% of the patients now being referred to BCCI have not had a selective biopsy, i t could be useful for helping to decide on the best treatment for these patients. It could also be used with those patients for whom the selective biopsy was inconclusive. Chapter 3 REVIEW OF STATISTICAL MODELS The medical diagnosis problem presented in the previous chapter can be considered as a statistical classification or prediction problem. Given a vector of observable variables for a patient, we wish to predict which group that patient belongs to (positive or negative nodes). different models have been suggested for this problem. will be discussed here: Several Four of the models Fisher's linear discriminant function, multiway contingency tables, Krzanowski s location model, and the logistic prob1 ability model. Discussion will include the assumptions of the models, parametric estimation methods, problems in using the models, and availability of computer routines. 3.1 Fisher's Linear Discriminant In 1936, R.A. Fisher proposed a linear discriminant function to classify a p dimensional vector into one of two known multivariate normal populations, given that the observation was from one of the two and that they had the same covariance matrix. 26 We assume that 27 X_ ~ N ( y E ) with p r o b a b i l i t y q x X_ ~ Np(y_ ,Z) with p r o b a b i l i t y q 0 l 5 and 0 where qi + q 0 = 1 and E i s the common covariance matrix. The l i n e a r discriminant function i s U(X) = 3 o + 3' X, (3.1) where 3o = l n ^ - i ^ 3 ^ + y i Q (3.2) ) and 3 ' = (yi - y ) ' E~ (3.3) 1 0 so that j=l with E _ 1 = (a ). 1 J If U(_X) > 0, X^ i s assigned to population 1, otherwise, _X i s assigned to population 0.^ The goal of the parameter estimation procedure i s to minimize expected t o t a l m i s c l a s s i f i c a t i o n c o s t . Often the costs f o r m i s c l a s s i f i - cation are quite d i f f e r e n t f o r the two populations (death versus t e s t i n g in a medical diagnosis problem). further One can include these costs in the model and then minimize the expected t o t a l cost of m i s c l a s s i f i c a t i o n . In a c t u a l m e d i c a l p r a c t i c e , t h o s e i n d i v i d u a l s f o r whom U ( X ) i s z e r o o r n e a r z e r o w o u l d n o t be c l a s s i f i e d w i t h o u t f u r t h e r i n v e s t i g a t i o n . A two-stage procedure which allows f u r t h e r observation o f b o r d e r l i n e cases is discussed l a t e r i n t h i s c h a p t e r i n t h e s e c t i o n on v a r i a b l e reduction. 28 C(h|k) is defined as the cost of misclassifying an individual to group h when that individual is a member of group k. The expected misclassifica- tion cost for group k is q ( h | k ) and the total expected misclassification cost is X q C(h|k). k (3.5) K Replacing q, by q.C(h|k), the linear discriminant model becomes U(X) = 3o + 3' - X (3.6) with 3o = In r - i J + u. ), 0 C ( 0 1) qo C(l o) Qi (3.7) (3.8) and 3' = - y )' £ 0 (3.9) - 1 so that 3- = n I (y n - y i 0 )^ 1 J . (3.10) j=l Again is classified into group 1 when U{X) > 0 , and into group 0 otherwise. When the parameters of the populations are unknown, they must be estimated. We shall assume that the sampling is random from the mixture of populations so that the sampling mixture approximates the population mixture. When there is a low incidence of one population, a two-sample procedure (separate samples for the different groups) may be more 29 appropriate [1]. for that case. However, the parameter estimation would be different Since the patients in the selective biopsy study were sampled from the mixture of populations, we will not consider the separate sample situation. Let n^ = number of observations from group h, h=0,l, and x.^ = t the i-th characteristic of the t-th individual in the h-th group. n * S and ij,h - ( n h " 1 } i n = V •1 I h h I 1 x i h t , X X ij X ( V ' + (n - 1)S.. 1 (3.11) h=0,l, < iht " ih>< jht " (ni - 1 ) S . . a Then 0 h = 0 ' ' ] ( 3 J 2 ) Q ni + n - 2 ^' > u 0 To estimate the population proportions we use the sample proportions. Thus, n, H h h=0,l, n ni + n 0 so that 3i"c(on) n, C(Oll) qo C(1|0) n C(l|0) ' 0 The population means are estimated by the sample means: a . j be the (i,j)th element of E and a 1 J =X • L e t n be the (i,j)th element of Thus, the estimated function is U(X) = 3o + 3' - X (3.15) 30 where 3o = In n P C(0 1) ni> C(l 0) v ^X. ) (3.16) 0 i=l and If U(_X) > 0, _X is assigned to population 1, otherwise, X. is assigned to population 0. Rewriting U(X) to clarify the estimation problems we get n , CfO 1) n C(l 0) 0 (Xi - x) 0 £" 1 [x -Hh + x 2 ) ] - ( 3 - We see from (3.18) that in order to estimate the linear discriminant function we must estimate y_i, y_> and S. Unless some simplifying assump0 tions are made about I (for example, independence of variates), the estimation problem can become quite substantial. Departures of the data from normality are a cause for concern with this model. Although l i t t l e has been done to show robustness of the linear discriminant functions, many practical applications proceed with linear discrimination after stating that the data are non-normal or even discrete [see for example, 47]. This problem will be of great concern here because i t is known that the medical data are non-normal and often not even continuous. An attractive feature of the linear discriminant model for applications is the widespread availability of computer routines for estimating the function and classifying observations. The discriminant analysis is based on a linear regression and so is easily accessible. The availability of the computer program encourages the user to ignore the departures from the model assumptions for ease of computation. 1 8 ) 31 3..2 Contingency Tables Each individual has a set of attributes describing him. When all the data are discrete, all individuals with the same set of attributes are counted and that count is put into the appropriate cell of a contingency table. The structure of the table for two variables is a rectangular array with columns corresponding to levels of one variable and rows corresponding to levels of the other variable. The simplest contingency table is a 2 x 2 table. There are two levels of attribute A and two levels of attribute B . The model would 2 appear as below:' B 1 A 2 1 P n P12 Pi. 2 P21 P22 p . p.i p.2 1 where p.. is the (i,j)th cell probability, 13 Pi. = 1 P.j 2 I p k=i 1 l P 2 2 = } i k K > k j i = i (3.19) > > 2 (3.20) J= K2, . and \ 1=1 The p. P 'J (3.21) = 1 j=l and p . are the marginal probabilities. The model assumptions are that all categories or contingencies are included (the probabilities "The model c o u l d q u e n c i es m.. = Np.. IJ K IJ also be w r i t t e n i n terms o f the e x p e c t e d f r e - 32 sum to one) and that all variables are discrete (or have been made discrete). This model and equations (3.19), (3.20), and (3.21) easily generalize to higher dimensions. The general model assumptions remain the same: all variables are discrete and the probabilities in all tables sum to one. In higher way contingency tables, however, one usually makes some simplifying assumptions about interaction terms to make the problem more manageable. One common simplification is to assume that bivariate interactions are allowed, but higher order interactions are not. Cochran and Hopkins [8] suggested that when there is a mixture of continuous and qualitative varibles, all the continuous variables should be made qualitative. They concluded that the optimal partition would be into six states as shown in Figure 3.1. When the variables are all quali- tative, the problem has been reduced to analysing a p-way contingency table. The question with this approach is how much information is lost. Cochran and Hopkins felt the loss of information was not significant, however, many others have found the loss unacceptable and sought ways of utilizing the full information. Estimation of the cell frequencies (or cell probabilities) in a contingency table is simple for small tables and large data sets, but can become quite complicated, i f not impossible, for larger tables and moderate data sets. The number of individuals observed for each cell is enumerated and that count is the observed frequency or estimated frequency. The problems arise when there are many cells to be estimated and not very many data points. A small 3 x 5 x 7 parameters to be estimated. table having fixed marginals has 48 Another problem is empty cells. The frequency 33 U i and U Figure 3.1. 2 are calculated from the data. Cochran and Hopkins categories for partitioning continuous variables. 34 method can only estimate an empty cell as zero, while i t is quite likely that a different sample would show the cell to be non-empty. It has been suggested by many authors [17, 24, and 35 for example] that log-linear models are appropriate for analysing contingency tables. Log-linear models f i t the contingency table model assumptions, while solving the estimation problems discussed above. In many cases, empty cells can be estimated as non-zero with log-linear models. Also with a few simplifying assumptions (high order interactions are zero for example), there are many fewer parameters to estimate so that for a given data set size the estimated parameters of the log-linear model will be based on more observations per parameter. A more complete discussion of log-linear models will be deferred to Section 3.4. Another problem with high dimension contingency table analysis has been the general unavailability of computer routines to do the analysis. Recently UCLA's Biomed package has included a program for analysis of a multiway table. Greater availability of this program will encourage more use of contingency table analysis with high dimension problems. The availability of computer routines will not alleviate the problems of large numbers of parameters to be estimated and loss of information with partitioned continuous variables. 3.3 Krzanowski Location Model In order to use all the information available when there is a mixture of continuous and discrete data W.J. Krzanowski proposed a l i k e l i hood ratio classification rule based on the location model [27]. In the 35 location model X where X : qxl is the vector of continuous variables and Z_: pxl is the vector of binary variables. Thus, each distinct pattern of 1_ defines a multinomial cell with Z being in cell m= 1+ j=l z. J (m) It is assumed that Y_ ~ N (JJ. V "' / , Z) in cell m, where T. is the common P covariance matrix. It is also assumed that the probability of an observa- tion in cell m is p . The optimal allocation rule then becomes: allocate to group 1 i f U(Y) (m) (rn) 1 L-H^ im) + y ( m ) 0 ) + In ^m lm (3.22) 3 is > 0 and to group 0 otherwise. The o p t i m u m r u l e d e r i v e d f r o m t h e l o c a t i o n model t h u s leads e f f e c t i v e l y t o a d i f f e r e n t l i n e a r d i s c r i m i n a n t f o r each o f t h e m u l t i n o m i a l c e l l s , with c u t o f f points d e t e r m i n e d i n e a c h c a s e by t h e d i s c r e t e component o f t h e model [27, p. 783j. Thus, this is a model that acknowledges the different types of variables, but i t is unduly complicated in the number of functions produced. The location model seems to be of theoretical interest but of l i t t l e practical use at this time. Krzanowski does, however, suggest that i f the data are not to be treated by his method but rather by f i t t i n g to a model containing only binary A d i s c r e t e v a r i a b l e w i t h n l e v e l s c a n be t r a n s f o r m e d i n t o n-1 i n d i c a t o r v a r i a b l e s , s o we c o n s i d e r o n l y b i n a r y v a r i a b l e s h e r e . 36 one type of v a r i a b l e ( e i t h e r continuous or d i s c r e t e , but not both), then i t i s better to consider them a l l as continuous rather than to p a r t i t i o n the continuous v a r i a b l e s . Thus, for a mixture of continuous and d i s c r e t e v a r i a b l e s , he preferred F i s h e r ' s l i n e a r discriminant to p-way contingency table a n a l y s i s . 3.4 L o g i s t i c Regression A simpler model for continuous and d i s c r e t e variables i s the l o g i s t i c p r o b a b i l i t y model. It allows both continuous and d i s c r e t e v a r i a b l e s without loss of information and the estimators have several desirable properties. Let P(_X. )'be the posterior p r o b a b i l i t y that an i n d i v i d u a l with i explanatory v a r i a b l e values _X.. = (X^ group 1 ) . X- ) m has the disease (belongs to Then •1 P(X.) 1 + exp(-3 0 - B' X .) (3.23) n (a prime on a vector indicates the transpose of the vector) where B and j3 0 are the l o g i s t i c c o e f f i c i e n t s . l o g i s t i c function. The expression in (3.23) i s the m u l t i v a r i a t e In the medical context, (3.23) i s a good formulation because " i n the l i g h t of present medical knowlege a reasonable assumpt i o n i s that P follows a symmetric sigmoid curve" [49, p. 168]. Cox [ i n 10] showed that the l o g i s t i c function i s appropriate for several d i f f e r e n t types of d i s t r i b u t i o n s of the explanatory v a r i a b l e s : v a r i a t e normal, binary, and mixed. multi- In general, (3.23) holds f o r any 37 variables whose distributions are in the exponential family; that i s , those with density functions of the form f(X) = g(6) h(X) exp{T(6) X}. (3.24) Assuming that the distribution of X_ is described by (3.24) is such a mild restriction that we can ignore i t for all practical purposes. A multivariate generalization of (3.23) can be made quite easily for problems with more than two groups. 2 The generalization would allow division into classes where P i s a vector with k components. The logistic probability model has several desirable properties in addition to its general applicability for different distributions. Cox [10] showed that (3.23) has associated with i t the simple sufficient statistics t = I X.y. i=l (3.25) 1 where the y i are indicator variables corresponding to group membership (0-1).. The maximum likelihood estimators are functions of the sufficient statistics. Thus, from the Rao-Blackwel1 Theroem, we expect to get smaller mean squared error using the maximum likelihood estimators than using estimators that are not functions of sufficient statistics. In addition, the logistic model has asymptotically unbiased estimators associated with i t [25]. Several different procedures have been suggested to estimate the parameters of the logistic function. are, by necessity, iterative. Unfortunately, all the procedures Walker and Duncan [49] derive the normal 38 equations through a l e a s t squares procedure with estimated weights. We l e t P. = P ( X . ) = 1 + expC-3' X..) •1 (3.26) be the p r o b a b i l i t y of the i - t h i n d i v i d u a l of the sample having the disease. Therefore, y . = . + . = ( P e f x.) + . i s (3.27) e Thus, the n x (m+1) matrix of independent variables f o r the sample i s 10 x ll " • lm X20 x 21 *" n0 x nl *** ' nm x x and y_'= ( y i , y , • • • ,y ). 2 x x 2m X By a p p l i c a t i o n of weighted i t e r a t i v e n o n - l i n e a r least-squares procedures to (3.27) with a diagonal weight matrix W (the inverse of the covariance matrix of the vector e), they conclude that the normal equations are X' W" X 3. = X 1 1 W Thus, they conclude that the estimators are - i y_ (3.28) 39 £ = (X 1 W" X)" 1 X' W" y_ , 1 (3.29) 1 which we shall see later are the same as the maximum likelihood estimators. Walker and Duncan suggest an iterative solution of (3.29) by the NewtonRaphson method with i n i t i a l estimates obtained by fitting a linear discriminant function. Kalman [49] proposed another recursive estimation procedure that claimed more rapid convergence and because of the rapid convergence, the need for good initial estimates is relaxed. The procedure updates the estimates with the addition of each new individual. based on the f i r s t k individuals is 3, . V k = Var^) The estimate of 3 Then = ( X ' W" X ^ " 1 (3.30) 1 k where X^ is the matrix of k individuals' observations and W^ is the covariance matrix for k individuals. Let X. , , be the vector of observations for the (k+l)st individual, w^-, = k+1 p Q , 3.31) Vl V l and k+1 P Therefore, k+l|k P 1 + exp(-^ •1 = 1-Q,, . k+1 X1 (3.32) 40 Vi • \ - \ *w V i 4 i \ <- ' 3 + 33 and c kHl - (» i U*]'"' • k+ ( 3 - 3 4 ) Finally, the recursive formula for the estimator of J3 is +V 4 4 k + 1 c + 1 k + 1 w k+1 (y - P k+1 k+1 ) (3.35) where y -j takes on the values 1 and 0 as the (k+l)st individual does or k+ does not have the disease. The problem of intial estimates is quite simple now. Let V and b_ be 0 any prior estimates of the variance and j[. The estimates V and $^ are k found using V , b_ , and the f i r s t k data items. 0 0 0 Then V and b_ are 0 0 eliminated from the formula by the following: vr k 1 - v- o v and V The remaining m-k items, V k , and k (3.36) 1 1 h- (3.37) V bo are then used in the recursive process to get the final estimate of |3. A third method is maximum likelihood estimation. The maximum likelihood equations are fairly simple to derive for the logistic model. Let P be the posterior probability of disease from equation (3.23) for s 41 the s - t h i n d i v i d u a l . have disease and y Also l e t y = 0 i f the s - t h i n d i v i d u a l does not g = 1 i f that i n d i v i d u a l has disease. Then the l i k e l v hood i s n L = y P n s=l r s S 1 - p *s <»l-y s (3.38) and the natural logarithm of the l i k e l i h o o d i s In L = I s=l y In s P s + I s=l (1 - y ) ln(l - P ). (3.39) Taking p a r t i a l d e r i v a t i v e s of ( 3 . 3 9 ) , we get the maximum l i k e l i h o o d equations 4 Vir = ^ i s - y p s = 0 ( 3 - 4 0 ' } and 9 In L 9 3 n = i I V i s "= lI s=l , J X is s=° i=l . - . P . P < - > s Equation ( 3 . 4 0 ) assures us that the expected number of cases w i l l be equal to the observed number of cases. That i s another desirable property of the maximum l i k e l i h o o d estimates. 4 To take the p a r t i a l 9, l n P -5~5—~= 9 3o 1 - P 3 ln(l-P ) L_ = 9 3. d e r i v a t i v e s we u s e t h e f a c t s 9 1 n (1 -P ) > —T~5 — = s 9 3o -Y is P s" _ P 3.:ln P » - 5 - 5 — s' 9 $. 1 = X - ( i s that 1 _ P v ) s a n d 3 41 42 The maximum likelihood equations are most often fitted by the Newton-Raphson algorithm. P k It is an iterative gradient algorithm. If is an estimate of P,k > 0, and f is the logistic function to be e s t i - mated, then the new estimate p. , is p k+l = p - (d f)"' 2 k (3.42) (df). With this formulation there can be a problem of divergence when the initial estimates are not close to the true values. Thus, Haberman [in 24] added a factor a(k) to equation (3.42) to prevent divergence in such cases. If reasonable care is taken in choosing the starting values, divergence is not a large problem in most applications so we will not complicate (3.42) with the a term. Several different types of starting estimators have been suggested in the literature. Linear discriminant function estimators have often been used as starting values. Other possible i n i t i a l estimators are conditional estimators and reverse Taylor series approximations [35]. Conditional estimators are obtained by maximizing the conditional l i k e l i hood (conditional on the explanatory variables). Reverse Taylor series approximations arise from the logistic function, equation (3.23). Expand- ing about X. = X. in a Taylor series, one gets 1 P(X) = 3'X exp(-3o - 3'X) .1 + exp(-3 - 3'X) 0 •f [1 + exp(-3 - 3'X)] , 3' exp(-3 - 3'X 0 [[1 + exp(-3o - 3'X] 2 0 X + R(X), (3.43) 43 where R(X) denotes a remainder containing terms of the order, 0[X - X) ' (X - X~]. Neglecting the remainder, we can interpret this as the linear function A + B'X where 1 A B'X (3.44) 1 + exp(-3 - 3'X ) 0 and 3 exp(-3o - B'X) [1 + exp(-3o - 3'X)] (3.45) 2 Solving one gets (3.46) (A + B'X) (1 - A - B'X) and 3o = -3'X - In 1 A + B'X (3.47) -1 as the reverse Taylor series approximations. Computer routines to find the maximum likelihood estimators for the logistic model or logistic regression are not so readily available as those for linear discrimination. accessible. However, they are becoming more An example of one such program is listed in the work by Nerlove and Press [35]. Like most logistic regression programs, i t uses the Newton-Raphson algorithm to find the MLE's. A disadvantage of this 44 routine is the necessity for an additional user-written program to c a l culate the probabilities and classifications. Because the logistic formulation provides a probability of being in group 1 rather than just a classification, one can tell which individuals are quite likely to be correctly classified (probability near 0 or 1) and which individuals are near the boundary (probability near 0.5) and thus are likely to be incorrect. to form three groups: The results could easily be used 1) those in group 0, 2) those in group 1 , and 3) those in the middle region for whom more investigation should be carried out before classification. This is a particularly desirable characteristic for a medical diagnosis problem. are inexpensive. Some tests are expensive, while others If a patient can be classified (diagnosed) on the basis of inexpensive tests only, i t is desirable for the patient and medical staff. However, i f the f i r s t tests are inconclusive, the more costly tests are available to help resolve the question. Thus, we can think of the logistic regression as giving us a two-stage procedure. 3.5 Comparison of Linear Discrimination and Logistic Regression For theoretical and practical considerations outlined above, I chose to concentrate on only two of the classification models: discrimination and logistic regression. linear Linear discrimination was chosen because of its widespread use in published studies despite violations of the model assumptions and because of easily accessible computer routines. Logistic regression by use of the Newton-Raphson algorithm was selected because of the good f i t of the medical data to the model assumptions, the desirable features of the logistic estimators, and the availability of a 45 computer program. In another work [41] S.J. Press and I compared logistic regression and discriminant analysis. Theoretical arguments were presented for and against the use of logistic regression as opposed to discriminant analysis for classification and regression of qualitative variables on explanatory variables. Empirical results for some non-normal classification problems were reported. A theoretical comparison of linear discrimination and logistic regression under different conditions is the next concern of this work. When the data are multivariate normal with equal covariance matrices, the model assumptions of both models are satisfied. One would expect the linear discriminant to be better in this case because its model assumptions are satisfied, i t has a closed form and i t is non-iterative. Also for the normal case, both types of estimators are asymptotically unbiased.[25]. Efron [12] investigated the asymptotic efficiency for each model under the assumption of normality of the data. He calculated the asymptotic relative efficiency (ARE) of logistic regression to linear discrimination. The ARE is given by ARE = 1 + A ?* q° exp (2TT)* exp(-x /2) dx a 2 8 qi where (Ui - u ) 0 Z" 1 (y x - y ) 0 which is the square root of the Mahalanobis distance. Table VIII gives some sample results for the asymptotic relative efficiency when q x = q = 0 46 Table VIII Asymptotic Relative Efficiency of Logistic Regression to Linear Discrimination (q A 0 .5 1 ARE 1.000 1.000 .995 1 .5 .968 : =q 0 = .05) 2 2.5 3 3.5 4 .899 .786 .641 .486 .343 0.5, the most favorable situation for logistic regression. The logistic regression is less efficient as n goes to infinity because the linear discriminant is based on the full maximum likelihood estimators for 3 and 0 3. while logistic regression is a conditional maximum likelihood estimation procedure. Thus, when the data are multivariate normal, the linear dis- criminant approach is to be preferred on theoretical grounds. Since "the multivariate normal assumption is unlikely to be satisfied in applications, even approximately,..." [25, p. 125], this seems the more important situation to consider. When the normality assumption is violated, logistic regression is theoretically more robust. The logistic probability model is valid for the exponential family, equation (3.24). The question of relative efficiencies in this case has not been investigated, but one would expect the logistic estimators to be more efficient. It was shown above that when (3.23) holds, there are sufficient statistics for $_. The maximum likelihood estimators are functions of the sufficient statistics, but the discriminant function estimators are.not. Thus, by the Rao-Blackwell theorem, the logistic estimators have a smaller expected mean square error than the discriminant function estimators. Under non-normal conditions, the logistic maximum likelihood estimates are consistent (asymptotically unbiased). For the discriminant 47 function procedure, Halperin, Blackwelder, and Verter showed that for large samples and one attribute variable (3.49) 3 + ( P i - P o ) / ( q i P i Q i + qoPoQo) and 3o •* In q i P i + qoPo q I Q i + qoQo Si 2 Pi q i P i + qoPo q i Q i + q Qc 0 P i - P. q i P i Q i + qoPoQo where Pi and (3.50) 1 + exp(-3o - B) = 1 - 1 + exp(-3 ) = 1 - Q< 0 Q] (3.5.1) (3.52) When qi = 0.5 the estimates are nearly unbiased, but for other values of qi and q the discrepancies can be quite large. 0 Equations (3.49), (3.50), (3.51), and (3.52) can be easily generalized to more than one attribute variable [ 2 5 ] . The authors continued the analysis to show'that a) 3; w h i c h a r e z e r o w i l l t e n d t o be e s t i m a t e d a s z e r o f o r l a r g e s a m p l e s by t h e method o f maximum l i k e l i .hood, b u t n o t n e c e s s a r i l y by t h e d i s c r i m i n a t i o n method; b) i f a n y 3; a r e n o n - z e r o t h e y w i l l t e n d t o be e s t i mated a s n o n - z e r o by e i t h e r m e t h o d , b u t t h e d i s c r i m i n a n t f u n c t i o n approach w i l l g i v e asymptot i c a l l y b i a s e d e s t i m a t e s f o r t h o s e 3j a n d f o r ( B o ) . • • L25, p. 152]. Finally, i f there are two or more 3^ that are non-zero, the discriminant function estimates of the 3. that are zero will not converge to zero in 48 general. Consequently, one could be led to believe that certain factors are significant when in reality they are not. Empirical studies have shown that i t does indeed happen in non-trivial cases. Thus, when the data follow the exponential family but are not normal with equal covariance matrices, logistic regression is preferred on theoretical grounds. In addition to the theoretical aspects of the comparison, some practical aspects must be considered. Since logistic regression is an iterative procedure, its estimated parameters are more complicated to calculate and thus computation is more time consuming. Halperin, Blackwelder, and Verter [25] found that logistic regression took longer by a factor ranging from 1.3 to 2. For a problem with 50 individuals and 5 independent variables, I found [in 41] that logistic regression took 1.4 times longer. The time problem gets worse as the number of variables and number of observations increase. Other things being equal, doubling the number of observa- tion doubles the time (see Table IX for some sample times). The time can increase even more quickly than indicated in the table because the time depends too on other factors such as starting values of the coefficients, divergence of the algorithm, and covariances between the various coefficients. One would conclude that execution times can be much longer for logistic regression, but not beyond acceptable limits for large-scale computer installations. 3.6 Variable Reduction In an exploratory data analysis of retrospective medical studies, the statistician will often have a large number of variables available. In order to make the problem more manageable and. make the underlying biologic process clearer, i t is desirable to reduce the dimension of 49 Table IX Execution Times for Logistic Regression (using an IBM 360-65 from [35] Independent Variables Observations . - CPU Time for Execution 1 89 6 6 225 36 7 886 60 50 the problem. reduction. A number of c r i t e r i a have been suggested to achieve the Three types w i l l be discussed below: 1) checking a l l possible subsets, 2) a stepwise procedure, and 3) a two-stage non-iterative procedure. McCabe [29] proposes that one check a l l possible subsets of the variables to f i n d the optimal subset. This procedure has the advantage of considering a l l possible combinations and i n t e r a c t i o n s of v a r i a b l e s . Let the w i t h i n sum of cross-products matrix be denoted by W = (w^j) "h 2 W ij where k=l s=l lhs ih< l Jhs " X X jh.J (3.53) (a subscript replaced by a dot indicates the v a r i a b l e was averaged over that index) and l e t the t o t a l sum of cross-products matrix be T = (t.. •) where 2 n h X., ^ h-i i i s ihs - X. l j hs X i X j--J (3.54) One of the standard m u l t i v a r i a t e analysis of variance tests f o r e q u a l i t y of the means uses U = / (3.55) T That i s e s s e n t i a l l y the r a t i o of the estimated generalized variance within to the estimated generalized variance t o t a l . and small values i n d i c a t e good d i s c r i m i n a t i o n . t i v e s t a t i s t i c to show d i s c r i m i n a t i o n p o t e n t i a l . Clearly, 0 < U < 1 McCabe uses U as a d e s c r i p Given two subsets with an equal number of v a r i a b l e s , the subset corresponding to the smaller U 51 is the preferred subset. The process is similar to calculating squared multiple correlation coefficients in regression. Theoretically, the examination of all possible subsets is the best procedure, but in practice i t becomes quite lengthy for even moderate problems. To compare all subsets of p variables, U must be calculated from (3.53), (3.54), and (3.55) for 2P _ ] possible subsets. A CDC 6500 computer required five minutes of CPU time to find the subsets for 20 variables [29] and each added variable doubles the time required. Thus, while theoretically appealing, the examination of all subsets is not practical for a problem of the size considered in this work. A second type of variable reduction scheme is a stepwise procedure. Individual variables are considered for inclusion or exclusion iteratively until a prespecified criterion has been achieved. The procedure adds at each step the variable which reduces the residual sum of squares as much as possible. That i s , i t uses a stepwise linear regression and the variable added is the one which maximally reduces the remaining unexplained variance about regression. An F-ratio is calculated from the within groups and total covariance matrices. Each variable is considered for inclusion depending on its relation to the already included variables only. scheme. Thus, all interactions are not studied as they were in the U-ratio An advantage of the stepwise procedure is the relatively number of variables i t can handle in a reasonable time. large Computer routines such as the Biomed Stepwise Discriminant allow 80 variables and are quite efficient. With the virtual memory of large-scale computers, the number of variables could be increased beyond 80 with ease and without unreasonable increases in time. Although theoretically less desirable than the previous 52 procedure, the stepwise procedure is practically appealing for large numbers of variables. A third approach, suggested by Zielezny [51], reduces, the number of variables by formalizing the medical diagnosis process. The doctor looks at an inexpensive, easily collected set of variables; i f that provides a clear-cut decision, then there is no need to observe further. If i t does not provide a clear-cut decision, then the doctor continues his investigation. This is especially useful when there are, sets of expensive and inexpensive variables. Let where X^. has k. components and _X corresponds to X X 3 the inexpensive variables. similarly. X 3 5 The yy = Hi and Z = En Z12 £21 E are partitioned 2 2 The U{X) of equation (3.1) is also partitioned as fUiUi) U(X) U (X ) 2 2 Then the two-stage rule is as follows: For a,b, -°° < a < b < °°, classify to group 0 i f 1) M X i ) < a or 2) a < [i ( X i ) < b and U ( X ) < 0; otherwise, 1 classify to group 1. Figure 3.2. The geometry of the partition is presented in That i s , i f Ui(Xj) classified on the basis of Xi is high or low an observation can be only. If Ui(_X_i) is in the middle range, then measurements must be taken on the second subset. The parameters a and b are chosen to minimize the total cost. Although i t has not been done, i t seems that this procedure could easily be generalized to p-1 stages so that variables are observed singly. One would then have a sequential decision process. Further investigation 53 Figure 3.2. The geometry of the two-stage procedure for two groups. 54 along these lines would be interesting, but will not be attempted here. However, a variation of the two-stage decision process can be constructed from the logistic regression results. More will be said about such a procedure in Chapter 5. The three types of variable reduction reviewed in this section all appear useful based on theoretical considerations. When practical aspects are considered, the stepwise discriminant analysis is superior. It has also been shown [25] that variables shown to be significant by stepwise discrimination include the ones that are significant in logistic regression. The subset of variables may include non-significant ones also as explained above. Thus, a subset of variables chosen by stepwise discrimination is appropriate as a reduced set of variables for logistic regression. 3.7 Conclusions We can conclude from our review of statistical models for classi- fication that logistic regression should discriminate better than linear discrimination, for the medical problem of this work. The time should be greater for the logistic regression, but should not be a major problem. In order to reduce the dimension of the problem, stepwise discriminant analysis should be done. That i s , stepwise discriminant analysis is appropriate for exploratory work and logistic regression is appropriate for final parameter estimation and classification with non-normal data. Chapter 4 DATA COLLECTION 4.1 Selecting Variables to be Observed The f i r s t problem in a statistical analysis is deciding what variables are to be observed. Because my prior knowledge of the problem was limited, external sources of information were consulted. The medical staff at BCCI were the f i r s t source of information about factors that influence the growth and spread of breast cancer. Their knowledge of the disease process included the general knowledge of breast cancer literature and their own personal experience treating the disease. The second source of facts about the disease process was the published medical reports on breast cancer. No previous work was available on predicting the spread of breast cancer to the regional lymph nodes. Because of the lack of knowledge about the factors influencing lymph node metastases, the variables chosen were those known to affect the risk of originally developing breast cancer and those known to affect the prognosis of the disease. Factors known to influence the risk of breast cancer^ can be divided into four types: 1) endocrine, 2) genetic, 3) immunologic, and ^The i n f o r m a t i o n a b o u t r i s k f a c t o r s i s g e n e r a l l y a g r e e d upon by medical a u t h o r i t i e s . The i n t e r e s t e d r e a d e r i s r e f e r r e d t o [9], [31], and [36] f o r more on t h e e p i d e m i o l o g y o f b r e a s t c a n c e r . 55 56 4) other. cancer. Estrogens and ovarian activity have a great influence on breast Studies [see 36 for example] have shown that early menarche and late menopause (implying a longer than average period of estrogen production) increase the risk of breast cancer. are Pregnancy and lactation known to alter the hormonal balance of a woman's body, but studies have provided conflicting results about their influence on breast cancer 2 risk. Early f i r s t parity decreases the risk. the more the risk decreases. Also the more children, However, this may be accounted for by the fact that women with a large number of children would have begun having them at an early age. Finally, recent studies have produced conflicting results about the relationship between oral estrogen (birth control p i l l ) use and breast cancer incidence. Thus, i t was decided to observe all available information relating to the patients' ovarian activity, menstrual history, pregnancy and lactation history, and estrogen therapy. Other areas of concern in endocrine function are the adrenal and thyroid glands. genesis. Adrenal dysfunction is "known to contribute to carcino- One type of treatment for breast cancer metastases is an adrenalectomy. It is also known that there is a high incidence of thyroid disease in women with carcinoma of the breast. is a low incidence of breast cancer. Thus, endocrine dysfunction seems to have an important influence on breast cancer. all After thyroidectomy there It was decided to observe variables associated with endocrine imbalance, including especially thyroid and adrenal dysfunction. P a r i t y i s t h e c o n d i t i o n o f a woman w i t h r e s p e c t t o h a v i n g p r o d u c e d v i a b l e c h i l d r e n r e g a r d l e s s o f w h e t h e r t h e c h i l d was a l i v e a t b i r t h . 57 The second type of factors includes genetic or hereditary factors. It is well-known that there is a familial predisposition for breast cancer. The incidence is consistently higher for relatives of breast cancer patients and i t is inherited through either parental line. Complete information on relatives with breast cancer was an important area to observe. In addition to familial differences in risk, epidemiological studies [31] have shown racial differences. The lowest risk is among Oriental women and the highest risk is among Caucasian women. The influence of genetic factors i s , however, complicated by environmental factors. Oriental women who move to North America from Asia increase their risk until several generations after immigration, their risk is close to Caucasian women's risk. On the other hand, non-white women generally have histologically more malignant tumors. axillary nodes. Hence there is a greater likelihood of positive Thus, i t was decided to observe all genetic and environ- mental factors that were available. The third set of factors involves the body's immune system. Although this may be the most important set of factors influencing the spread of the disease, i t is the most d i f f i c u l t to observe. good measure of the host-tumor relationship. There is no It is known that various immune deficient states are associated with a very much greater risk of breast cancer. Multicentricity and bilaterality of the disease are clear indications of the inadequacy of the body's defenses. Seasonal variations in incidence have been noted (May to September is a period of lower incidence) which correspond to observed seasonal variations in response to other diseases. Perhaps the closest one can come to measuring the immune status of the body is the lymphocytic count. Since lymphocytes 58 are the body's main defense mechanism against disease, their activity indicates the level of resistance to disease processes. It is interesting to note that either a high or nil lymphocytic reaction is better than a moderate reaction. A high lymphocyte count means that the body has a good defense, while no lymphocytic reaction means that the response has not yet been triggered. A low to moderate level of lymphocytes indicates reaction that has been triggered but is inadequate to the task. It was concluded that lymphocytic reaction and season were to be observed. Also a history of diseases indicating immune deficiency, such as herpes zoster, was to be observed. Other factors that influence risk of breast carcinoma are quite varied. Trauma to the breast is present in about 11 percent of breast cancer patients and an unknown percentage of normal women. Socio- economic level seems to play some part, although there is evidence that this is related to diet and lifestyle. Stomach cancer incidence is cor- related with breast cancer incidence. Since stomach cancer is influenced by diet, doctors have suggested a link between diet and cancer of the breast. On the other hand, lower socioeconomic levels do not have as good access to medical care and so delay seeking treatment which allows the cancer to infiltrate the body more. Other systemic illnesses and previous benign breast disease increase the risk of breast cancer, too. As with many other diseases, the psychological state of the patient can be crucial to treatment response and survival. Thus, i t was decided that all available information on the above factors was to be collected. A final area that seemed promising was the pathological diagnosis. Breast cancer is more than one disease. Many different histological 59 types are combined in the label of breast carcinoma. Different histo- logical types were known to have varying incidences among different groups of women. Table X shows the results of one study [31]. Consequently, all the pathological findings available in the medical charts were to be included as variables. In addition to the factors that influence risk of developing breast cancer, the clinical state of the disease at diagnosis is an important indicator of possible lymph node metastases. different prognoses. Different states have A short survival associated with a particular symptom indicates a more advanced stage of disease or more virulent disease. Both of these conditions have a greater chance of nodal metastases. of the primary tumor can be indicative of disease state. Size A large mass would be seen when there is a long delay between onset of symptoms and diagnosis. The longer the tumor has been growing, the better established i t is in the body. Thus, metastases are more likely. also are seen in fast-growing cancers. Large tumor masses A carcinoma that grows quickly has a good chance of having spread elsewhere. Skin involvement and clincial involvement of the axilla also indicate spread of disease and increased likelihood of lymph node metastases. Consequently, all variables pertain- ing to the state of the disease were to be observed. After the review of the literature on risk factors and prognostic factors, a sample of 50 charts of patients who had not had a selective biopsy but who were i n i t i a l l y treated at BCCI during the years 1955 to 1963 were reviewed. The charts were read to see how much of the infor- mation outlined above was recorded in patient charts and in what format the information appeared. A few additional variables that had no obvious Table X Comparison of Factors by Histologic Type Histologic Type % of total Average Age (years) Location Nodal involvement Infiltrating duct 78.1% 50.7 all ••"60% Infiltrating lobular 8.7% 53.8 standard 60% Medullary 4.3% 49.0 More in upper half of breast 44% Colloid 2.6% 49.7 standard 32% Comedo Carcinomas 4.6% Papillary 1.2% 48.6 51.9 Mostly subareolar More in lower half of breast 61 connection to the lymph node metastases problem but were available on most charts were included in the l i s t of variables to be observed. 4.2 Data Collection From the l i s t of factors mentioned in the previous section, a data collection form and coding instructions were devised. The data collection form and accompanying instructions are in Appendix C. The format for data collection was suggested by previous studies undertaken at BCCI. Over a period of three months forms were completed by me for the 557 selective biopsy patients diagnosed between 1955 and 1963 and treated at BCCI. If some information was unavailable for a patient, a missing-data code was used. After the data collection was completed, the twenty-two patients with previous breast or systemic malignancy were eliminated since the disease process could be obscured by the other primary tumors. Of the remaining 535 patients, three more were dropped from the study because of very poor information. The physicians recording the history reported that two of the women were bad historians or suspect reporters. The third woman removed from the study spoke only German and the interpreter was a young boy unfamiliar with medical terminology and biological processes. An additional three cases were discarded because expert patho- logical review was unable to provide a disease status for the lymph nodes. Thus, the final sample size was 529 patients. 62 4.3 Variables Selected for Analysis After the sample had been pared to its final size, the variables to be included in the analysis had to be selected. Since the exploratory analysis was to be carried out by stepwise discriminant analysis, no more than 80 variables (the program limit) could be included. The posterior knowledge after data collection became the prior knowledge used for variable selection. Three types of variables had been observed: 2) binary, and 3) categorical. 1) continuous, The binary and continuous variables did not have to be transformed prior to analysis. Some categorical variables were changed after data collection because certain categories had too few observations. The variables were collapsed by combining categories until the remaining categories were large enough for analysis. the categorical variable race. very small. An example was All categories other than Caucasian were Categories were combined to form the binary variable for Caucasian and non-Caucasian. If a categorical variable was an ordered categorical variable, i t could be used without transformation. Unordered categorical variables had to be transformed into binary indicator variables. Each binary variable corresponds to one of the f i r s t k-1 categories. an individual in category i , Thus, 1 < i < k-1, would have zeroes for each of the indicator variables except variable i which would be one. If an individual was in category k, the k-1 variables would all be zero. Only k-1 variables are used since k variables would be linearly dependent. After the above transformations were made, there were 112 variables. To accommodate that number of variables, the discriminant analysis could have been used twice and the results combined. However, some variables 63 were not included in the analysis for various reasons. After the survival analysis of Chapter 2 was completed, the survival data were eliminated. Some variables had zero variance and so provided no information. were discarded. Those Several variables that had seemed promising before work began had so few respondents that they also were discarded. Through these means, enough variables were dropped to leave 80 variables for analysis. A l i s t of the variables observed, but not used in the analysis appears in Appendix D. The 80 variables chosen for analysis will be defined here. In the l i s t that follows, a name is assigned to each variable and a definition of the variable appears. General information: ID-BCCI number-a six digit number for patient identification (not used in analysis, but used for case identification). DIAMON-Month of i n i t i a l diagnosis or treatment (also called the anniversary). AGE-Age (at last birthday) at diagnosis-in years. SOCECN-Socioeconomic level-1 i f high socioeconomic level; 0 otherwise. RACE-Racial origin-1 i f Caucasian; 0 otherwise. MARRY-Patient married-1 i f patient is married at time of diagnosis; 0 otherwise. BROKEN-Broken marriage-1 i f patient had been married previously but the marriage has been dissolved by death, separation or legal action; 0 i f never married or married now. Family history: BRMOM-Breast Cancer in patient's mother-1 i f mother had developed breast cancer; 0 otherwise. 64 BRDAUG-Breast cancer in patient's daughter-! i f daughter had developed breast cancer; 0 otherwise. BRSIS-Breast cancer in patient's sister-1 i f sister had developed breast cancer; 0 otherwise. BROTH-Breast cancer in other female relatives-1 i f any other female relative had developed breast cancer; 0 otherwise. CANCER-Blood relatives with cancer-number of relatives who have had cancer. Patient's Personal History: SMOKE-Cigarette smoker-1 i f patient has a history of cigarette smoking; 0 otherwise. Patient's Reproductive History: REG-Regular menstrual periods-1 i f patient had regular, uncomplicated menstrual periods; 0 otherwise. DYSMEN-Dysmenorrhoea-1 i f patient experienced dysmenorrhoea; • 0 otherwise. HORMON-Hormone therapy-1 i f patient had a history of hormone therapy; 0 otherwise. OTHDRG-Major drug therapy other than hormones-1 i f there is a history of other drug therapy; 0 otherwise. STATUS-Menopausal status-1 i f patient is premenopausal or within 5 years of the menopause at the time of diagnosis; 0 i f patient is postmenopausal. AGEMEN-Age at menopause-age in years at menopause for postmenopausal patients; 88 for premenopausal patients. AGEFRS-Age at f i r s t birth-patient's age in years at termination of f i r s t full-term pregnancy for para women; 0 for nullipara women. BIRTHS-Parity-number of pregnancies carried to full term. MISCAR-Miscarriages-number of pregnancies that did not carry to full term. NUMNUR-Number nursed-number of lactation periods. 65 MONNUR-Number of months nursed-length in. months of the longest lactation period. BRFED-Patient breastfed-1 i f patient was breastfed; 0 otherwise. Patient's Illnesses: DIABET-Diabetes-1 i f patient had had diabetes; 0 otherwise. HEART-Heart disease-1 i f patient had a history of heart disease; 0 otherwise. HYPER-Hypertension-1 i f patient had a history of hypertension; 0 otherwise. KIDNEY-Kidney disease-1 i f patient had a history of kidney disease; 0 otherwise. TB-Tuberculosis-1 i f patient had a history of tuberculosis; 0 otherwise. ANEMIA-Anemia-1 i f patient had a history of anemia; 0 otherwise. PNEUM-Pneumonia-1 i f patient had a history of pneumonia or other serious lung diseases, excluding tuberculosis; 0 otherwise. ALLERG-A1lergy-1 i f patient had a history of allergies; 0 otherwise. THYROD-Thyroid disease-1 i f patient had a history of thyroid dysfunction; 0 otherwise. FIBROI-Uterine fibroids-1 i f patient had a history of uterine fibroids; 0 otherwise. DISOTH-Other disease-1 i f patient had a history of other major diseases not listed above; 0 otherwise. Patient's Surgical History (Before Present Illness) 00PH0R-Oophorectomy-! i f patient had an oophorectomy; 0 otherwise. HYSTER-Hysterectomy-1 i f patient had a hysterectomy; 0 otherwise. 66 PELVIC-Other pelvic surgery-1 i f patient had pelvic surgery other than those specifically listed above, especially involving the reproductive system; 0 otherwise. GALLB-Cholecystectomy-1 i f patient had gall bladder surgery; 0 otherwise. THYSUR-Thyroidectomy-1 i f thyroid had been removed; 0 otherwi se. ADRENL-Adrenalectomy-1 i f the adrenal gland.had been removed; 0 otherwise. OTHSUR-Other surgery-1 i f patient had a history of other surgery; 0 otherwise. Benign Breast Ailments: MAZODY-Mazodynia-1 i f patient had mazodynia; 0 otherwise. MASTIT-Mastitis-1 i f patient had had benign breast disease during lactation; 0 otherwise. BENIGN-Benign breast disease other than during lactation-1 i f patient had a history of benign breast disease other than during lactation; 0 otherwise. History of the Present Illness: DURTON-Duration of symptoms-duration of symptom in months from onset of f i r s t symptom to date of diagnosis. SYMPT1-Thickening or lump-1 i f f i r s t symptom was a thickening or lump in the breast; 0 otherwise. SYMPT2-Pain-1 i f f i r s t symptom was pain in the breast or a x i l l a ; 0 otherwise. SYMPT3-Nipple changes-1 i f the f i r s t symptom was change in the contour of the nipple or discharge from the nipple; 0 otherwise. SIZE-Size of the tumor-clinical size of the original tumor mass in cm. LOCI-Lower inner quadrant-1 i f tumor was located in the lower inner quadrant of the breast; 0 otherwise. L0C2-Lower outer quadrant-1 i f tumor was located in the lower outer quadrant of the breast; 0 otherwise. 67 L0C3-Upper inner quadrant-1 i f tumor was located in the upper inner quadrant of the breast; 0 otherwise. L0C4-Upper outer quadrant-1 if tumor was located in the upper outer quadrant of the breast; 0 otherwise. L0C5-Lymph node tumor-1 i f tumor was located in the a x i l l a ; 0 otherwise. NODEPL-Palpable nodes-1 i f the lymph nodes were palable in the axilla on the same side as the tumor; 0 otherwise. SKIN-Skin involvement-1 i f there was skin involvement at the site of the primary tumor; 0 otherwise. BREAST-Breast involved-1 i f the right breast was the primary site; 0 if the left breast was the primary site. TRAUMA-Trauma to the breast-1 if the patient had a history of trauma to the involved breast; 0 otherwise. Patient 's Present Condition: BODYSZ-Overall body size-3 ordered categories for body size. BRSIZE-Breast size/shape-4 ordered categories for breast size. CONDTN-Patient s general physical condition-1 i f patient was in good physical condition at the time of diagnosis; 0 otherwise. 1 OTHILL-Other illnesses present-number of other illnesses present at the time of diagnosis. EMOTON-Emotional problems-1 i f the patient had emotional problems (other than any related to the cancer) at the time of diagnosis or shortly before; 0 otherwise. LYMPH-Lymphocytes-percentage of lymphocytes in the blood at diagnosis. Pathology: NODES-Grouping variable, lymph node status-1 i f apical or internal mammary lymph nodes were positive at diagnosis; 0 if lymph nodes were negative. 68 PATH1-Paget's disease-! i f histology was Paget's disease of the nipple; 0 otherwise. PATH2-NoninfiItrating papillary carcinoma-1 i f histology was non-infiItrating papillary carcinoma; 0 otherwise. PATH3-Infi1trating papillary carcinoma-1 i f histology was infiltrating papillary carcinoma; 0 otherwise. PATH4-Infiltrating duct carcinoma-1 i f histology was infiltrating duct carcinoma (scirrhous, adenocarcinoma); 0 otherwise. PATH5-Colloid carcinoma-1 i f histology was colloid carcinoma; 0 otherwise. PATH6-Medullary carcinoma-1 i f histology was medullary carcinoma; 0 otherwise. PATH7 - In situ lobular carcinoma-1 i f histology was in situ lobular carcinoma; 0 otherwise. PATH8-Infi1trating lobular carcinoma-1 i f histology was infiltrating lobular carcinoma; 0 otherwise. PATH9-Inflammatory carcinoma-1 i f histology was inflammatory carcinoma; 0 otherwise. PATHIO-Other carcinomas-1 i f histology was any other single type of carcinoma; 0 otherwise. DIFF-Differentiation-3 ordered categories for differentiation of the carcinoma. FOCI-Foci of disease-1 i f disease is unicentric; 0 i f disease is multicentric. CELL-Size of cells-1 i f cancer cells are small c e l l ; 0 i f the cells are large c e l l . INFIL-Infi1tration-3 ordered categories for the amount of lymphocytic infiltration of the primary tumor. These 80 variables were the input data for exploratory work with the discriminant analysis to reduce further the dimension of the problem. The results of the analysis appear in Chapter 5. 69 4.4 Missing Data The problem of missing data was acute in this study. History taking by the physician who f i r s t examined the patient at BCCI was uneven in quality. Some histories were quite complete, while others had only a few main points covered. Patients also varied in how much they could remember or wished to t e l l . Some older patients (in their sixties and seventies) could remember l i t t l e of what had happened to them. Any i n c i - dent that could be verified by hospital or physician records was checked by the medical records department at BCCI. Because of the length of time since diagnosis (up to 22 years) and the fact that 68 percent of the patients were dead at the time of the study, no further followup was attempted for missing data. When information about a variable was missing, a numerical missing-data code was used. After the data were collected, an examination of the variables with many missing data entries was undertaken. A variable that had not been mentioned in the chart had been coded as missing. In some cases the variable was not mentioned because the patient did not have that attribute. For example, i f smoking was not mentioned in the history, i t was much more likely that the patient did not smoke than 3 that the patient smoked and the fact had not been reported. Six variables were found to be of that type and so the missing-data code was changed to the code for absence of the attribute. data values were not changed. Other variables that had missing A patient with missing data for a particular variable was kept in the analysis when that variable was not included and The s i x v a r i a b l e s t h a t w e r e c h a n g e d i n t h i s DYSMEN, OTHDRG, HORMON, MAZODY, a n d MAST IT. manner w e r e SMOKE, 70 was dropped from the analysis when the variable was included. More will be said about inclusion and exclusion of cases in Chapter 5. 4.5 Sources of Error Four main sources of error existed for the data. As stated in the previous section, the patients and doctors were the two primary sources of error. The errors in both cases are biased toward omitting data. A patient was unlikely to claim a disease or operation that had not occurred and most of the claims were confirmed by followup or removed from the history. The doctors had no reason to claim things that did not occur. However, i t was quite easy for the patient to forget an incident occurring many years prior to diagnosis and for the physician to neglect to ask specifically for all possibilities. A third source of error was in the data coding. Since all the data were collected by the same individual, any systematic errors in interpretation should be consistent for all patients. If an error in interpretation of the medical facts occurred, i t was the same in each instance. It is assumed that the other data coding errors were random. The final main source of error in data collection was the key- punching of the data. after keypunching. To minimize such errors, all data were proofread A program was written that printed the numbers from the cards in the same format as the data coding form. The computer output was then compared to the original data coding forms to find errors. All errors found in the keypunching were corrected before the analysis was begun. Thus, there were few remaining errors because of the keypunching. Chapter 5 DATA ANALYSIS 5.1 Computer Programs For the theoretical and practical reasons discussed in Chapter 3, it was decided that stepwise discriminant analysis and logistic regression would be used for the classification of patients by nodal status. The linear discrimination was done using the Biomed Stepwise Discriminant Analysis program (BMD07M). It performed a two-group discriminant analysis wherein the variables were selected so as to maximally reduce the remaining variance. The classification functions which included those selected variables were then used to classify the cases. Logistic regression was accomplished by the use of a log-linear model program developed by M. Nerlove and S . J . Press. program appears in their report [35, p. 101 f f ] . A listing of that The coefficients of the logistic probability function were estimated by the Newton-Raphson algorithm, where the starting values were found by ordinary least squares. The program produced estimates of the logistic probability coefficients. A user-written program was then used to calculate the posterior of being in group 1 and thus the classification for each case. 71 probability 72 5.2 Selecting Cases In order to avoid using patients with missing data for the included variables, a procedure was devised for selecting cases. Instead of eliminating cases with missing values, group means could have been substituted for missing values. It was felt that that would not be appro- priate here because of the nature of the data. variables caused problems for this approach. The large number of binary The mean of a 0-1 variable calculated for many patients would be between 0 and 1. A value between 0 and 1 for the binary variables was deemed unacceptable. All cases with complete information for all variables were chosen for a preliminary analysis. There were 173 such cases. Discriminant analysis was run on these 173 cases to select the variables that were of some significance. Significance was defined quite liberally (F probability to enter < .20) so that even variables of marginal significance would be included i n i t i a l l y . For reasons presented in Chapter 3 the subset of variables chosen in this manner should have included all the variables .for which the coefficients were significantly different from zero. It will also include some for which the coefficient should have been estimated as zero. The variables which had coefficients that were not significant were eliminated in the final analysis by the logistic regression. of the discriminant analysis are presented in Table XI and XII. variables were chosen by this process. The results Fourteen One variable was entered in the third step and then removed in the twelveth step. It was decided to include that variable in further work in case i t proved to be significant later. Logistic regression was then done with the fifteen variables and 173 cases to compare to the discriminant analysis. The results are 73 Table XI Variables Chosen by Linear Discrimination for 173 Cases Variables F Probability to Enter 1. THYROD .0038 2. BRSIS .0121 3." DURTON .0114 4. NODEPL .0199 5. HEART .0640 6. LOCI .1020 7. SYMPT2 .0586 8. SYMPT3 .0265 9. KIDNEY .0743 10. BIRTHS .1435 11. HORMON .0820 12. MASTIT .0970 13. BRMOM .1087 14. OTHILL .1746 Variable entered and then removed: 1. MONNUR Table XII Classification of 173 Cases by Linear Discrimination Classified to group 0 1 Actual 0 90 17 107 group 1 33 33 66 123 50 173 CPU time 11.292 sec. Correct classification 71.10% 75 shown in Table XIII and XIV. as significant (P < .05). discrimination. Nine of the fifteen variables were chosen The CPU time was twice as long as for linear However, the classification by logistic regression was better than the classification by discriminant analysis — 77 percent correct for logistic regression and 71 percent correct for discriminant analysis. Both procedures did poorly in classifying those known to have positive nodes. Discriminant analysis classified only 50 percent of such cases correctly, while logistic regression classified 59 percent of them correctly. For those known to have negative nodes the correct classification rates were 84 percent and 89 percent, respectively. The fifteen variables chosen by linear discrimination (Table XI) were then used for a new case-selection procedure. Patients who had complete information for those variables were chosen for further analysis. A total of 503 patients had complete information on these fifteen variables. This group of patients was the sample for the final classification procedure. 5.3 Classification of 503 Cases Discriminant analysis was done for the 503 selected cases and fifteen variables. Again the level of significance was defined liberally F probability to enter less than or equal to .20. Tables XV and XVI. The results appear in Only four variables were significant enough to enter the discriminant function. The discriminating power of the function was even worse for those with positive nodes; only 12 percent were correctly classified into group 1. classification. For negative nodes there was 96 percent correct — 76 Table XIII Variables Chosen by Logistic Regression for 173 Cases Variables Asymptotic Significance 1. DURTON .03017 2. NODEPL .03290 3. HEART .01377 4. SYMPT3 .00338 5. KIDNEY .04828 6. BIRTHS .01239 7. HORMON .08840 8. MASTIT .01727 9. BRMOM .12352 77 Table XIV Classification of 173 Cases by Logistic Regression Classified to group 0 1 Actual 0 95 12 107 group 1 27 39 66 122 51 173 CPU time 25.652 sec. Correct classification 77.46% Iterations 39 Table XV Variables Chosen by Linear Discrimination for 503 Cases Variables F Probability to Enter 1. NODEPL .0001: 2. SYMPT3 .0007 3. THYROD .0171 4. HEART .0432 Table XVI Classification of 503 Cases by Linear Discrimination Classified to group 0 1 Actual 0 321 12 333 group 1 149 21 170 470 33 503 CPU time 4.714 sec. Correct classification 67.99% Table XVII Variables Chosen by Logistic Regression for 503 Cases Variables Asymptotic Significance 1. NODEPL .00016 2. SYMPT3 .00086 3. THYR0D .02031 4. HEART .04636 Table XVIII Classification of 503 Cases by Logistic Regression with 15 Variables Classified to group 0 1 Actual 0 318 15 333 group 1 147 23 170 465 38 503 CPU time 18.251 sec. Correct classification 67.78% Iterations 10 Log of the 1ikelihood -301.774846 82 Logistic regression was run for the same fifteen variables and 503 cases. The results are shown in Tables XVII and XVIII. A comparison of Tables XV and XVII shows that the same four variables were significant and in the same order of significance. Again classification was much worse for positive nodes than negative nodes — 95 percent correct for negative nodes and 14 percent correct for positive nodes. Since only four of the variables were significant, i t was decided to rerun the logistic regression with only those four variables included. The previous run had forced all fifteen variables into the c l a s s i f i c a tion function regardless of significance. XIX. The results are given in Table All variables remained highly significant (P < .05). Using the Table XIX Classification of 503 Cases by Logistic Regression with with 4 Variables Classified to group 0 1 Actual 0 324 9 333 group 1 150 20 170 474 29 503 CPU time 9.810 sec. Correct classification 68.39% Iterations 8 Log of the likelihood -303.560423 83 log-likelihood test to determine whether the ten 3 . dropped from the second model should be zero, i t was found from Tables XVIII and XIX that U = -2 A = -2(-303.560423 + 301.774846) = 3.57114. 1 The .05 level of significance chi-square value for ten degrees of freedom is 18.307. Thus, we conclude that the reduced model is a good f i t when compared to the " f u l l " model with fourteen variables. Correct classification for positive nodes was 12 percent and correct classification for negative nodes was 97 percent. A comparison of Tables XVI and XIX shows that logistic regression was marginally better than discriminant analysis at classification and took twice as long. For both discriminant analysis and logistic regression there was a drop in discriminating power with the increase in number of cases. Linear discrimination went from 71.10 percent to 67.99 percent while logistic regression went from 77.46 percent to 68.39 percent. Part of 2 the drop can be explained by the increase in the number of cases. At least for logistic regression there appears to be some other factor influencing the classification. A likely contributing factor is less reliable data for the 330 added patients. The additional cases had some missing data values for variables not considered in the function and thus = log 2 e (likelihood). 2 When d o i n g a l i n e a r r e g r e s s i o n , o n e c a l c u l a t e s a n a d j u s t e d R t o a c c o u n t f o r d i f f e r e n c e s i n numbers o f o b s e r v a t i o n s . In t h a t c a s e 2 / 2\ N— 1 R .. = 1 - (1-R ) - — w h e r e N i s t h e number o f c a s e s a n d p i s t h e number adj N-p o f t e r m s i n t h e model [ 3 3 ] . F o r t h e numbers we w e r e c o n c e r n e d w i t h h e r e the factor Thus, o n l y of cases. w o u l d be 1.0813 f o r 173 c a s e s a n d 1.0060 f o r 503 c a s e s . N-p s m a l l d i f f e r e n c e s w o u l d be a t t r i b u t e d t o t h e i n c r e a s e i n number 84 their information may be less reliable even when a variable was observed. That i s , a patient who admits to not knowing certain information may be unreliable for other answers that were given. Also a doctor who looks for general answers may not ask for elaboration of answers to ensure complete recording of data. 5.4 Subsets of Patients for Further Classification There is much medical evidence that breast cancer runs a much different course in premenopausal and postmenopausal women. "The age- specific incidence and mortality curves of breast cancer have two components. The premenopausal component is steeper... . The postmenopausal slope is less ;steep than the premenopausal. . . " [9, p. 721]. It has been hypothesized that the premenopausal carcinomas are hormone dependent, while the postmenopausal tumors are not. comes from treatment results. Evidence for the hypothesis Oophorectomy (removal of the ovaries and consequent cessation of estrogen production) is a successful treatment adjunct for premenopausal women. It seems to make no difference in post- menopausal women since the ovaries have already ceased production of estrogen. In the light of the medical evidence, i t was surprising that neither age nor menopausal status appeared as a significant variable. Apparently, this factor was confounded by the other factors in the model. Consequently, i t was decided to investigate the menopausal status further. A stratification of patients by menopausal status was a possible avenue of investigation. The f i r s t step was to see i f there were two distinct components of age incidence. A histogram of incidence versus age 85 (Figure 5.1) showed the possibility of two components. produced f o r the two subsets: Histograms were premenopausal patients (Figure 5.2) and postmenopausal patients (Figure 5 . 3 ) . Figure 5.3 showed c l e a r d i f f e r e n c e s . Examination of Figure 5.2 and The premenopausal patients had a graph with a very steep gradient while the postmenopausal patients had a graph that was more nearly f l a t f o r a twenty year period. The d i f f e r - ence between the slopes may be because of d i f f e r e n t "censors" in the two groups — the premenopausal patients are censored by menopause while the postmenopausal patients are censored by death. However, other i n v e s t i g a - tors [36,28] had found s i g n i f i c a n t differences in the course of the disease between premenopausal and postmenopausal p a t i e n t s . Thus, i t was decided to divide the patients on the basis of menopausal status and do the analyses on each subset of patients separately. The 173 patients with f u l l information were separated f o r preliminary work on the basis of menopausal s t a t u s . S i x t y of them were post- menopausal patients and 113 were premenopausal p a t i e n t s . Discriminant a n a l y s i s was then done f o r the two subsets separately to reduce the dimension of the problem. Tables XX and XXI show the variables chosen f o r postmenopausal and premenopausal p a t i e n t s , r e s p e c t i v e l y . of the variables were chosen f o r both groups: MASTIT, and HEART. Only f i v e DURTON, KIDNEY, BIRTHS, Tables XXII and XXIII present the r e s u l t s of c l a s s i - f i c a t i o n f o r the groups separately. The percentage c l a s s i f i e d c o r r e c t l y was greater f o r each group than when the c l a s s i f i c a t i o n was f o r the combined group (88 percent and 80 percent versus 71 percent f o r combined). Consequently, each group was investigated with a d i f f e r e n t set of v a r i a b l e s . 86 24 28 32 36 40 44 48 52 56 60 64 68 72 A G E A T L A S T BIRTHDAY Figure 5.1. Histogram of age-incidence for all 535 cases. 76 80 87 AGE AT LAST BIRTHDAY Figure 5.2. Histogram of age-incidence f o r premenopausal patients. 88 24 22 20 18 16 8 55 W O M O 5S h-4 14 12 10 JL 43 47 51 55 59 63 67 71 75 79 AGE AT LAST BIRTHDAY Figure 5.3. Histogram of age-incidence for postmenopausal patients. Table XX Variables Chosen by Discriminant Analysis for 60 Post-menopausal Patients Variables F Probability to Enter 1. HEART .0155 2. NODEPL .0054 3. BIRTHS .0373 4. SKIN .0529 5. SMOKE .0758 6. TRAUMA .0283 7. OTHSUR .0824 8. BRFED .0729 9. DURTON .0287 10. BRSIS .1019 11. LYMPH .0471 12. 00PH0R .0646 13. KIDNEY .0745 14. AGE .0738 15. BENIGN .0481 16. SIZE .0936 17. MASTIT .1441 18. RACE .0582 19. CELL .1760 Variable entered and then removed: 1. CANCER 90 Table XXI Variables Chosen by Discriminant Analysis for 113 Premenopausal Patients Variables F Probability to Enter 1. SYMPT3 .0013 2. THYROD .0088 3. BREAST .0398 4. DURTON .0341 5. PATH6 .0538 6. PELVIC .0539 7. KIDNEY .0538 8. HORMON .0326 9. BIRTHS .0395 10. BROTH .0713 11. BRMOM .0647 12. MASTIT .0125 13. PATH 2 .0860 14. HEART .0808 15. REG . 1089 16. ALLERG .0744 Table XXII Classification of 60 Postmenopausal Patients by Linear Discrimination with 19 Variables Classified to group 0 1 Actual 0 32 5 37 group 1 2 21 23 26 60 34 CPU time 7.14 sec. Correct classification 88.33% Table XXIII Classification of 113 Premenopausal Patients by Linear Discrimination with 16 Variables Classified to group 0 1 Actual 0 61 9 70 group 1 14 29 43 75 38 113 CPU time 6.00 sec. Correct classification 79.65% 93 5.5 Classification of Postmenopausal Patients Nineteen variables were selected for the postmenopausal patients. Since the variables were chosen in the order of their explanatory power, i t was decided to use only the f i r s t sixteen variables for the logistic regression. The discrimination programs could have been run twice on subsets of the nineteen variables to pick the best sixteen, but previous work had indicated that that was not necessary. All postmenopausal patients were screened for full information on the nineteen variables, yielding a total of 128 cases. Discriminant analysis was run on the 128 patients with nineteen variables. Four variables were chosen for the analysis and 72.66 percent were correctly classified (Tables XXIV and XXV). Again negative nodes were classified better (80 percent correct) than' positive nodes were (60 percent correct). variables. Then logistic regression was performed with sixteen The results appear in Tables XXVI and XXVII. variables were found to be significant. Only three Logistic regression was run again with the three significant variables (Table XXVIII). A comparison'.of Tables XXV and XXVIII shows that logistic regression was poorer at classifying than discriminant analysis was. Since different variables had been used i t was decided to rerun logistic regression with the four variables of Table XXIV. The results of that run are in Table XXIX. A comparison of Tables XXV and XXIX shows that both methods classified the patients exactly the same. To test whether certain 8. were zero, the log-likelihood test was used. From Tables XXVII and XXVIII the statistic for testing the reduction of the model to three variables was found to be U = -2 A = Table XXIV Subset of Variables Chosen by Discriminant Analysis for 128 Postmenopausal Patients Variables F Probability to Enter 1. NODEPL .0001 2. AGE .0638 3. BRSIS .0682 4. SMOKE .1934 Table XXV Classification of 128 Postmenopausal Patients by Discriminant Analysis with 4 Variables Classified to group 0 1 Actual 0 65 16 81 group 1 19 28 47 84 44 128 CPU time 2.5 sec. Correct classification 72.66% 96 Table XXVI Subset of Variables Chosen by Logistic Regression for 128 Postmenopausal Patients Variables Asymptotic Significance 1. AGE .05321 2. NODEPL .00011 3. TRAUMA .12377 Table XXVII Classification of 128 Postmenopausal Patients by Logistic Regression with 16 Variables Classified to group 0 1 Actual 0 68 13 81 group 1 19 28 47 87 41 128 CPU time 32 sec. Correct classification 75.00% Iterations 37 Log of the likelihood -67.1124784 Table XXVIII Classification of 128 Postmenopausal Patients by Logistic Regression with 3 Variables Classified to group 0 1 Actual 0 64 17 81 group 1 21 26 47 85 43 128 CPU time 3.077 sec. Correct classification 70.31% Iterations 7 Log of the likelihood -74.3288522 Table XXIX Classification of 128 Postmenopausal Patients by Logistic Regression with 4 Variables Classified to group 0 1 Actual 0 65 16 81 group 1 19 28 47 84 44 128 CPU time 10.531 sec. Correct classification 72.66% Iterations 35 Log of the likelihood -70.2542850 100 -2(-74.3288522 + 67.1124784) = 14.432748. The corresponding chi-square for twelve degrees of freedom and .05 significance level is 21.026. Thus, I concluded that the twelve 3 . that were estimated as zero were zero. Using the same test and Tables XXVII and XXIX, the statistic is U = -2(-70.2542850 + 67.1124784) = 6.283614. Since X g O l ) = 19.675, i t 2 5 was concluded that the reduced model with four variables was also acceptable. 5.6 Classification of Premenopausal Patients A similar procedure was used for the premenopausal patients. The sixteen variables of Table XXI chosen by discriminant analysis were used to select full information cases. A total of 253 premenopausal patients had full information on those variables. When discriminant analysis was run with the 253 selected cases, ten variables were found to be significant. XXX and XXXI. The results of the discrimination are shown in Tables Table XXX shows the ten variables chosen as significant. From Table XXXI we see that again patients with negative nodes were classified better (95 percent correct) than patients with positive nodes (27 percent correct). Logistic regression was then run with the 253 full information premenopausal patients and the same sixteen variables from Table XXI. The results of the logistic regression appear in Tables XXXII and XXXIII. Ten variables were significant in the logistic regression also. A com- parison of Tables XXX and XXXII shows that nine of the variables are the same for both cases. The other variable in the logistic regression, DURT0N, was the least significant of the ten variables. As with linear discrimination, classification of patients with positive nodes was poorer Table XXX Subset of Variables Chosen by Discriminant Analys for 253 Premenopausal Patients Variables F Probability to Enter 1. THYROD .0013 2. SYMPT3 .0033 3. MASTIT .0165 4. PELVIC .0521 5. BRMOM .0448 6. PATH2 .1186 7. KIDNEY .1201 8. BIRTHS .1012 9. HEART .1132 10. BROTH .1979 Table XXXI Classification of 253 Premenopausal Patients by Discriminant Analysis with 10 Variables Classified to group 0 1 Actual 0 159 8 167 group 1 53 23 ' 86 222 31 253 CPU time 2.5 sec. Correct classification 71.94% 103 Table XXXII Subset of Variables Chosen by Logistic Regression for 253 Premenopausal Patients Variables Asymptotic Significance 1. BRMOM .02721 2. BROTH .13538 3. BIRTHS .08456 4. HEART .08576 5. KIDNEY .11729 6. THYROD .00284 7. PELVIC .01061 8. MASTIT .00059 9. DURTON .19050 10. SYMPT3 .00726 104 Table XXXIII Classification of 253 Premenopausal Patients by Logistic Regression with 16 Variables Classified to group 0 1 Actual 0 156 11 167 group 1 58 28 86 214 39 253 CPU time 35.603 sec. Correct classification 72.73% Iterations 40 Log of the likelihood -137.340218 105 than classification of patients with negative nodes (33 percent versus 93 percent). In judging how well our models have worked for classification there are two types of tests. whether certain 3.. are zero. estimated. Goodness of f i t tests are used to test They test how well the model has been How well the estimated model classified is the second measure of the classification scheme. Although one gets a better f i t of the model with more variables, the classification may be better with fewer variables in the model. Since the previous logistic classification was with sixteen variables and only ten were significant, i t was decided to rerun the logstic regression with only the ten significant variables. It was expected that that would classify better and the results presented in Table XXXIV confirmed that expectation. Logistic regression with ten variables classified better than logistic regression with sixteen variables or discriminant analysis with ten variables (74 percent versus 73 percent and 72 percent). To test whether the ten variable logistic model was a good enough f i t compared to the sixteen variable logistic model, the loglikelihood test was used again. It was found from Tables XXXIII and XXXIV that U = -2 A = -2(-140.582389 + 137.340218) = 6.48434. The .05 significance level chi-square value for six degrees of freedom is 12.6. Thus, we can conclude that the reduced model provides a good f i t . 5.7 Summary of Results A summary of the classification results was prepared and appears in Table XXXV. The data analysis has confirmed the theoretical results Table XXXIV Classification of 253 Premenopausal Patients by Logistic Regression with 10 Variables Classified to group 0 1 Actual 0 162 5 167 group 1 61 25 86 223 30 253 CPU time 6.908 sec. Correct classification 73.91% Iterations 9 Log of the likelihood -140.582389 107 Table XXXV Summary of Classification Results Model Variables # Cases % Correct Positive Nodes % Correct Negative Nodes % Correct Overall Combined premenopausal and postmenopausal: DA 14 173 50.00 84.11 71.10 LR 9 173 59.09 88.79 77.46 DA 4 503 12.35 96.40 67.99 LR 15 503 13.53 95.50 67.78 LR 4 503 11.76 97.30 68.39 Postmenopausal: DA 19 60 91.30 86.49 88.33 DA 4 128 59.57 80.25 72.66 LR 16 128 59.57 83.95 75.00 LR 3 128 55.32 79.01 70.31 LR 4 128 59.57 80.25 72.66 Premenopausal: DA 16 113 67.44 87.14 79.65 DA 10 253 26.74 95.21 71.94 LR 16 253 32.56 93.41 72.73 LR 10 253 29.07 97.01 73.91 DA = Discriminant Analysis LR = Logistic Regression 103 of Chapter 3. Linear discrimination was an efficient method for the preliminary analyses. For the final analyses logistic regression pro- vided more correct classifications with a significant time increase (in one case the classification was the same). all cases was less than hoped for. However, classification in Examination of the cases misclassified showed that many of such cases were near the boundary. use of a two-stage procedure as described in Chapter 3. That suggested the Patients with high or low posterior probabilities could be classified on the basis of this data. The patients with probabilities near .05 would need further data before classification could be done. Tables XXXVI and XXXVII were prepared to demonstrate some possible two-stage procedures. Table XXXVI used the logistic c l a s s i f i c a - tion for 253 premenopausal patients with ten variables. Table XXXVII used the logistic classification for 128 postmenopausal patients with four variables. For each decile of posterior probability the numbers of patients correctly and incorrectly classified were tabulated. If an error rate of ten percent is acceptable, then the boundaries could be set at .4 and .7 for Table XXXVI. That i s , all patients with posterior probability less than .4 would be classified as having negative nodes. All patients with posterior probability greater than .7 would be classified as having positive nodes. The patients in the middle would need further investigation before classification. patients or 43 percent. In this case, that would be 109 A similar analysis of Table XXXVII produced boundary points of .3 and .7 for a nine percent error rate and 73 patients (57 percent) to have further investigation. 109 Table XXXVI Number of Premenopausal Patients Correctly and Incorrectly Classified by Decile of Posterior Probability Decile Correct Incorrect .00 - .1 14 0 .11 - .2 31 8 .21 - .3 51 14 .31 - .4 42 21 .41 - .5 21 17 .51 - .6 6 2 .61 - .7 3 0 .71 - .8 11 0 .81 - .9 5 2 .91 - 1.0 4 1 Table XXXVII Number of Postmenopausal Patients Correctly and Incorrectly Classified by Decile of Posterior Probability Decile Correct Incorrect .00 - .1 10 0 .11 - .2 18 5 .21 - .3 19 4 .31 - .4 10 7 .41 - .5 8 4 .51 - .6 12 9 .61 - .7 14 5 .71 - .8 1 1 .81 - .9 0 1 .91 - 1.0 0 0 Ill Examination of the cases that were misclassified also showed runs of errors. For one group of patients 90 percent of the errors occurred in two consecutive years of the nine years studied and 95 percent of the errors were in three consecutive years of the nine years. That lends credence to the idea that certain history takers were better than others. It is probable that better classification could have been achieved with better data for those years. The medical implications of the analysis will be discussed in the next chapter. Chapter 6 CONCLUSIONS For the medical classification problem of distinguishing between those breast cancer patients with supraclavicular or internal mammary lymph node metastases and those without such metastases, two models were selected — linear discrimination and logistic regression. The empirical results verified the theoretical findings that linear discrimination was faster than logistic regression but logistic regression provided a greater proportion of correct classifications. When the classification was done on 503 cases with full infor- mation, approximately 68 percent correct classification was achieved. While this is better than the clincial staging, i t was not as good as had been hoped for. Dividing the patients by menopausal status and classify- ing the groups separately provided better classification because of the differing disease processes in the two groups. For 253 premenopausal patients the proportion of correct classifications was 74 percent. For 128 postmenopausal patients the correct proportion was 75 percent. Thus, separating the groups on the basis of menopausal status provided c l a s s i f i cation that was better than when the groups were combined. 112 113 A two-stage procedure was proposed to make the error rate smaller. Patients were classified on the basis of the data into three groups: those with negative nodes, those with positive nodes, and those for whom more data had to be collected before a final classification was made. With an error rate of 10 percent or less, 43 percent of the pre- menopausal and 57 percent of the postmenopausal patients required further observation. It was concluded that for a medical diagnosis problem such as this with data that are clearly non-normal and often not even continuous, the use of linear discriminant analysis was adequate for the exploratory work and to reduce the dimension of the problem. In the preliminary analyses of the subgroups of premenopausal and postmenopausal patients with small numbers of patients and complete information, discriminant analysis provided more correct classifications for positive nodes. Logistic regres- sion was preferred for the final analyses since i t provided better estimators and consequently more classifications that were correct-when there were more patients and less complete information about the patients. The medical conclusions are not so definitive. The c l a s s i f i c a - tion procedures used here suggest areas where further investigation would be useful. For all patients combined the variables that entered the final analysis were NODEPL, SYMPT3, THYROD, and HEART. As expected palpable lymph nodes were positively correlated with pathologically involved lymph nodes. The presence of changes in the nipple or discharge from the nipple as the f i r s t symptom was positively correlated with positive nodes. A history of thyroid disease was also positively correlated with positive nodes. A history of heart disease was negatively correlated 114 with positive nodes. Thus, the physician might consider these factors when trying to evaluate the nodes c l i n i c a l l y . When the premenopausal and postmenopausal patients were considered separately, some other factors were suggested. For the premeno- pausal patients the variables that entered the final analyses were THYROD, SYMPT3, MASTIT, PELVIC, BRMOM, PATH2, KIDNEY, BIRTHS, HEART, and BROTH. Again a history of thyroid disease was positively correlated with positive nodes and nipple change or discharge as the f i r s t symptom was positively correlated with positive nodes. A history of benign breast disease during lactation was positively correlated with involved nodes. pelvic surgery was negatively correlated with involved nodes. Previous Breast cancer in the patient's mother and other relatives were both negatively correlated with involved nodes. This was probably a result of the patient's increased awareness of the disease and consequent earlier diagnosis. Noninfiltrating papillary carcinoma was highly negatively correlated with positive nodes. That i s , patients with that pathological type of carcinoma rarely had nodal metastases. A history of kidney disease was negatively correlated with positive nodes. The number of full term pregnancies was slightly negatively correlated with positive nodes. Heart disease was again negatively correlated with involved nodes. For the post-menopausal patients the variables that entered the final analyses were NODEPL, AGE, BRSIS, and SMOKE. Again as expected nodes palpable was positively correlated with positive nodes. very slightly negatively correlated with positive nodes. Age was Breast cancer in the patient's sister was negatively correlated with positive nodes. Smoking was slightly negatively correlated with positive nodes. The 115 disease in the postmenopausal patients was less variable than the disease in the premenopausal patients. The lesser degree of variability resulted in more accurate classifications for the postmenopausal patients. The results presented above could be used to help the physician in his clinical diagnosis of the nodal status. in which further research could be done. They also suggest areas BIBLIOGRAPHY [1] Anderson, J.A., "Separate Sample Logistic Discrimination," Biometrika, 59 (1972), 19-35. [2] Anderson, T.W., An Introduction to Multivariate Statistical Analysis , . John Wiley & Sons, Inc., New York, 1958.. [3] Bishop, Y.M.M., S.E. Fienberg, and P.W. Holland, Discrete variate Analysis: Theory and Practice, MIT Press, Cambridge, Mass., 1975. [4] Bock, R.D., "Estimating Multinomial Response Relations," Essays in Probability and Statistics, ed. by R.C. Bose, et al. University of North Carolina Press, Chapel H i l l , North Carolina, 1970. Multi- 3 [5] Brinkley, D. and J.L. Haybittle, "A 15-year Follow-up Study of Patients Treated for Carcinoma of the Breast," British Journal of Radiology, 41 (1968), 215-221. [6] Brinkley, D. and J.L. Haybittle, "The Curability of Breast Cancer," The Lancet, July 19, 1975, 95-97. [7] Cancer Research Campaign, "Management of Early Cancer of the Breast," British Medical Journal, 1 (1976),• 1035-1038. [8] Cochran, W.G. and C.E. Hopkins, "Some Classification Methods with Multivariate Qualitative Data," Biometrics, 17 (1961), 10-32. [9] Correa, P., "The Epidemiology of Cancer of the Breast," American Journal of Clinical Pathology, 64 (1975), 720-727. [10] Cox, D.R., Analysis of Binary Bata, Methuen and Co., Ltd., London, 1970. 116 117 [11] Crawford, G., "Breast Cancer: Selective Biopsy—Rationale and Results," Cancer Control Agency Seminar, 1976. [12] Efron, B., "The Efficiency of Logistic Regression Compared to Normal Discriminant Analysis," Journal of the American Statistical Association, 70 (1975), 892-898. [13] Federer, W.T., "Procedures and Designs for Screening Material in Selection and Allocation with a Bibliography," Biometrics, 19 (1963), 553-587. [14] Finney, D.J., R. Latscha, B.M. Bennett, and P. Hsu, Tables for Testing Significance in a 2x2 Contingency Table, Cambridge University Press, New York, 1963. [15] Fisher, E.R., R.M. Gregorio, and B. Fisher, "The Pathology of Invasive Breast Cancer — A Syllabus Derived from Findings of the National Surgical Adjuvant Breast Project (Protocol Number 4)'," Cancer, 36 (1975), 1-85. [16] Fisher, E.R., R.M. Gregorio, C. Redmond, W.S. Kim, and B. Fisher, "The Significance of Extranodal Extension of Axillary Metastases," American Journal of Clinical Pathology, 65 (1976), 439-444. [17] F r i e l , J.P. (ed.), Borland's Illustrated Medical Dictionary, Twenty-fifth Edition, W.B. Saunders Co., Toronto, 1974. [18] Gardner, M.J. and D.J.P. Barker, "Diagnosis of Hypothyroidism: A Comparison of Statistical Techniques," British Medical Journal, 2 (1975), 260-262. [19] Goldstein, M. and M. Rabinowitz, "Selection of Variates for the Two-Group Multinomial Classification Problem," Journal of the American Statistical Association, 70 (1975), 776-781. [20] Gordon, T., "Hazards in the Use of the Logistic Function with Special Reference to Data from Prospective Cardiovascular Studies," Journal of Chronic Diseases, 27 (1974), 97-102. [21] Guzeman, L.F., B.C. Peters, and H.F. Walker, "On Minimizing the Probability of Misclassification for Linear Feature Selection," The Annals of Statistics, 3 (1975), 661-668. [22] Haagensen, D . C , "The Choice of Treatment for Operable Carcinoma of the Breast," Surgery, 76 (1974), 685-714. 118 [23] Haagensen, D . C , E. Cooley, et al., "Treatment of Early Mammary Carcinoma: A Cooperative International Study," Annals of Surgery, 170 (1969), 875-890. [24] Haberman, S . J . , The Analysis of Frequency Data, University of Chicago Press, Chicago, 1974. [25] Halperin, M., W.C. Blackwelder, and J.I. Verter, "Estimation of the Multivariate Logistic Risk Function: A Comparison of the Discriminant Function and Maximum Likelihood Approaches," Journal of Chronic Diseases, 24 (1971), 125-158. [26] Hartz, S . C . , and L.A. Rosenberg, "Computation of MLE for the Multiple Logistic Risk Function for Use with Categorical Data," Journal of Chronic Diseases, 28 (1975), 421-430. [27] Krzanowski, W.J., "Discrimination and Classification Using Both Binary and Continuous Variables," Journal of the American Statistical Association, 70 (1975), 782-790. [28] Lewison, E.F., Breast Cancer and its Diagnosis and Treatment, The Williams and Wi1ki ns Company, Baltimore, 1955. [29] McCabe, G.P., "Computations for Variable Selection in Discriminant Analysis," Technometrics, 17 (1975), 103-109. [30] McCabe, G.P. and R.J. Pohl, A Computer Program for Variable Selection in Discriminant Analysis, Purdue University Department of Statistics Mimeograph Series, Number 334, Purdue University, Lafayette, Indiana, 1973. [31] McDivitt, R.W., F.W. Stewart, and J.W. Berg, Tumors of the Breast, Armed Forces Institute of Pathology, Bethesda, Maryland, 1968. [32] Maehle, B.0. and F. Harviet, "Prognostic Typing in Breast Cancer," Journal of Clinical Pathology, 26 (1973), 784-791. [33] Montgomery, D.B. and D.G. Morrison, "A Note on Adjusting R ," Journal of finance, 1973, 1009-1013. [34] Mueller, C. and W. Jeffries, "Cancer of the Breast: Its Outcome as Measured by the Rate of Dying and Causes of Death," Annals of Surgery, 182 (1975), 334-341. 2 119 [35] Nerlove, M. and S.J. Press, Univariate and Multivariate Log-Linear and Logistic Models, RAND Report R-l306-EDA/NIH, The RAND Corporation, Santa Monica, California, 1973. [36] Papaioannou, A.N., The Etiology Verlag, New York, 1974. [37] Peters, M.V., "Cutting the 'Gordian Knot in Early Breast Cancer," Annals of the Royal College of Physicians and Surgions of Canada, 8 (1975), 186-192. [38] Poser, C M . , et al., "Amino Acid Residues of Serum and CSF Protein in Multiple Sclerosis: Clinical Application of Statistical Discriminant Analysis," Archives of Neurology, 32 (1975), 308-314. [39] Prentice, R., "Use of the Logistic Model in Retrospective Studies," unpublished manuscript, University of Washington. [40] Press, S . J . , Applied Multivariate Winston, New York, 1972. [41] Press, S.J. and S. Wilson, "Choosing Between Logistic Regression and Discriminant Analysis," to be published, manuscript at The University of British Columbia, 1977. [42] Ramberg, J.S. and J.D. Broffitt, "Selecting the Best Set of Linear Discriminant Variates," Proceedings of Computer .' Science and Statistics: 8th Symposium on the Interface, 257-261. [43] Smith, D . C , R. Prentice, D.J. Thompson, W.L. Herrmann, "Association of Exogenous Estrogen and Endometrial Carcinoma," The New England Journal of Medicine, 293, 1164-1167. [44] Snedecor, G.W. and W.G. Cochran, Statistical Methods, Sixth Edition, The Iowa State University Press, Ames, Iowa, 1967. [45] T a l l i s , G.M., P. Leppard, and G. Sarfaty, "A General Classification Model with Specific Application to Response to Adrenalectomy in Women with Breast Cancer," Computers and Biomedical Research, 8 (1975), 1-7. [46] T a l l i s , G.M. and G. Sarfaty, "On the Distribution of the Time to Reporting Cancers with Application to Breast Cancer in Women," Mathematical Biosciences, 19 (1974), 371-376. of Human Breast Cancer, Springer- 1 Analysis, Holt, Rinehart and 120 Truett, J . , J. Cornfield, and W.B. Kannel, "A Multivariate Analysis of the Risk of Coronary Heart Disease in Framingham," Journal of Chronic Diseases, 20 (1967), 511-524. VanNess, J.W. and C. Simpson, "On the Effects of Dimension in Discriminant Analysis," Technometrics, 18 (1976), 175-187. Walker, S.H. and D.B. Duncan, "Estimation of.the Probability of ah Event as a Function of Several Independent Variables," Biometrika, 54 (1967), 167-179. Z i e l , H.K. and W.D. Finkle, "Increased Risk of Endometrial Carcinoma Among Users of Conjugated Estrogens," The New England Journal of Medicine, 293 (1975), 1167-1170. Zielezny, M. and O.J. Dunn, "Cost Evaluation of a Two-Stage Classification Procedure," Biometrics, 31 (1975), 37-47. APPENDIX A TREATMENT STUDY GROUPS Number of cases A. No mastectomy — Standard radiation B. Simple mastectomy — Standard radiation C. Radical mastectomy — Standard radiation D. Radical mastectomy — Radiation which did not include axilla ^ E. Radical mastectomy — with preoperative radiation ^ F. Extended radical mastectomy — Radiation to supraclavicular only , G. Simple mastectomy — No radiation 1 H. Radical mastectomy — No radiation 20 I. Simple mastectomy — w i t h radiation, did not include chest wall 12 J. Simple mastectomy—with radiation, did not include chest wall or axilla 1 K. Simple mastectomy — Radiation and chemotherapy 1 L. Hormones only 1 148 13 316 TOTAL 121 535 APPENDIX B MANCHESTER STAGING OF BREAST CANCER CIinical I Primary freely movable on contracted pectoral muscle or chest wall. Skin involvement, including ulceration, may be present but must be in direct continuity with the tumor and no extension wide of the tumor itself. II As Stage I but there are palpable mobile lymph nodes in the axilla on the same side less than 2.5 cm. Ill a) Either the skin invaded or fixed over an area wide of the tumor itself but s t i l l or b) limited to the breast, the tumor fixed to underlying muscle but not to chest wal 1. Axillary nodes, i f present, must be mobile. IV The growth has extended beyond the breast area as shown by: a) Axillary nodes not mobile or >2.5 cm. b) Tumor fixed to chest wall. c) Supraclavicular node involvement. d) Involvement of skin wide of breast. 122 123 e) Opposite breast involved with metastatic disease. f) Distant metastases. g) Inflammatory carcinoma. Pagent's disease of nipple only is Stage I unless nodes present. Pathological I II Disease confined to the breast. As in Stage I, plus metastatic disease confined to axillary lymph nodes below the level of the apex. II? Level of axillary involvement unknown. Ill Direct local spread from primary to: IV a) skin wide of tumor. b) underlying fascia or muscle. a) Direct extension from breast primary to rib or cartilage of chest wall. b) Extension of disease beyond capsule of an axillary lymph node. c) Involvement of apical or internal mammary lymph node or tissues. d) Involvement of an axillary lymph node at any level which is found pathologically to be 2.5 cm. in size or large. e) Distant metastases (including supraclavicular lymph nodes). APPENDIX C CODING INSTRUCTIONS AND DATA CODING FORM Coding Instructions: Occupation housewife retired technical & professional clerical laborer, outside laborer, inside other unknown . • 1 2 3 4 5 6 7 9 Racial origin Caucasian Negro Indian Asian Semitic Other Unknown.: 1 2 3 4 5 6 9 Family history Yes No Unknown 1 0 9 Menopausal state Premenopausal & up to 5 years after Postmenopausal 5 years 1 0 Age at menopause Premenopausal Postmenopausal Illnesses & surgery Yes No Unknown First observation of symptom Patient '"- Medical professional Duration of symptom 1-97 months 98 months Unknown 88 Actual Years 1 0 9 124 1 0 Actual 98 99 Tumor size < 2 cm 2 to 5 cm > 5 cm No lump palpable Size not stated 1 2 3 4 9 Position of tumor Lower inner Lower outer Upper inner Upper outer Lymph node or tail Nipple Whole breast Other Unknown 1 2 3 4 5 6 7 8 9 Nodes, skin, & trauma Yes No Unknown 1 0 9 Breast Right Left Bilateral 1 0 3 125 First symptom Thickening Lump Pain Discharge from nipple Nipple inverted Skin changes Change in breast size Mammography, etc. Other 1 2 3 4 5 6 7. 8 9 Overall body size Obese Average Slender 1 2 3 Breast size/shape Pendulous, very large Large, full Average Small 1 2 3 4 General physical condition Good Fair Poor 1 2 3 Nodal involvement Positive apical or i.m. nodes Positive lower axillary nodes No nodal involvement 1 2 3 Histological differentiation Wei 1-differentiated Moderately differentiated Poorly- or undifferentiated Unknown 1 2 3 9 Foci of disease Unicentric Multicentric Unknown 1 0 9. Cell size Small Large Unknown 1 0 9 Cause of death Alive Breast cancer Intercurrent disease Lost to followup 0 1 2 3 126 Histology type Paget's disease Noninfiltrating papillary carcinoma Infiltrating papillary carcinoma Infiltrating duct carcinoma (scirrhus with productive fibrosis) Adenocarcinoma Colloid carcinoma (mucoid) Medullary carcinoma In situ lobular carcinoma Infiltrating lobular carcinoma Inflammatory carcinoma Carcinoma, not otherwise specified Other Combinations of the above Lymphocyte infiltrations in tumor None Minimal Moderate or numerous Unknown 1 2 3 4 5 6 7 8 9 10 11 12 13 127 CARCINOMA OF THE HREAST SELECTIVE BIOPSY 1955 - 116J I n t e r v a l between CCAnC dumber Card Identity Dyamenorrhoea Anniversary Data Drugs Hormones Data of Birth Previous braaat ailment Masodynla during period Mastitis l n lactation Benign breast disease not during l a c t a t i o n Other History o f Present Aga at Anniversary Menopause C l i n i c a l stage Status pre 1 Pathological ataga Age at menopause M a r i t a l status S-l,M-2,W-3,D-4 Pregnancies Illness F i r s t symptom pose 2 F i r s t observation of symptom Duration o f symptom Age at f i r s t Tumour s i t e - c L l n l c a l Live births P o s i t i o n o f tumour Miscarriages Kodea palpable {lumber nursed S k i n involvement Months nursed Breast Patient breastfed Trauma to breast Occupation Racial Origin Kumber of Years i n North America Kumber of Years l n B.C. FAMILY HISTORY Cancer o t t e r than breast Mother Father SUter Brother Son Daughter Mat. R e l . Pat. R e l . Breast Cancer Mother Slater Daughter Other Diabetes Tuberculosis O v e r a l l body s i z e Heart disease Typhoid Breast size/shape Hypertension Ulcer General c o n d i t i o n Kidney Anemia Other i l l n e s s e s present Asthma Pneumonia Childhood Other Other diseases l n family members Surgery(not Diabetes Oophorectomy |~ I Beart j Tuberculosis Patients Present Condition Serious i l l n e s s e s PATHOLOGY breast) Modal Involvement 1 Disease Histology Tonsils Type Other Appendix PERSONAL HISTORY Hysterectomy Smoker Kanstrual h i s t o r y Menarche Periods Regular * Length Other p e l v i c Cholecystectomy E Differentiation F o c i o f disease Coll size SURVIVAL Thyroidectomy Date of Adrenalectomy Causa of death Other Lost to followup . death APPENDIX D VARIABLES NOT INCLUDED IN THE ANALYSIS General: Year of diagnosis — y e a r of i n i t i a l diagnosis or treatment (anniversary) Date of birth — month and year of birth (were used to check reported age but do not appear explicitly in functions) Clinical stage — c l i n i c a l stage (the factors used in staging appear as variables) Pathological stage — pathological stage Years in N.A. — number of years that patients has lived in North America Years in B.C. —number of years that patient has lived in British Columbia Family Hi story: Cancer other than breast — for father, mother, sister, brother, son, daughter, maternal relative, and paternal relative there was an indicator variable for occurrence of cancer and a variable for types of cancer Other diseases in family members — diabetes, tuberculosis, heart disease, and other for blood relatives. Menstrual history: Menarche — age at which menstruation began Lenth of periods — in days Interval between — days between periods 128 Illnesses and Surgery of patient Childhood diseases —mumps, measles, etc. Typhoid Ulcer — stomach ulcer Tonsillectomy Appendectomy History of present illnesses: Other symptoms — those appearing after the f i r s t First observation of symptom —whether patient or medical professional f i r s t observed symptom of di sease Survival data: used to increase knowledge of history of problem but not pertinent to the classification problem Date of Death Cause of Death Date lost to followup
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- A statistical classification of breast cancer patients...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
A statistical classification of breast cancer patients by degree of nodal metastases Wilson, Sandra Lee 1977
pdf
Page Metadata
Item Metadata
Title | A statistical classification of breast cancer patients by degree of nodal metastases |
Creator |
Wilson, Sandra Lee |
Date Issued | 1977 |
Description | Recently the traditional primary method of treatment for breast carcinoma — the Halsted radical mastectomy — has been challenged. It is felt by some people that other methods may be more appropriate for certain women. Quality of life and the patient's preferences are being considered in addition to the strictly medical aspects of the problem. One procedure that attempts to increase the quality of life for certain women is the selective biopsy. Women who are proven to have lymph node metastases at the biopsy are spared a mastectomy and treated by radiation since surgery cannot remove all of the cancer. A study was undertaken at the British Columbia Cancer Institute of selective biopsy patients diagnosed between 1955 and 1963 in order to assess the procedure in British Columbia. After studying survival for selective biopsy patients and others, it was concluded that the procedure should continue to be recommended. Since only 14% of the patients now referred to BCCI have had a selective biopsy, I decided to try to find a statistical method for assessing the probability of nodal metastases. The problem is one of statistical classification. The literature on the theory of several statistical models was reviewed. Two models were chosen for the problem: linear discriminant analysis and logistic regression. The classification procedure most often used is discriminant analysis. However, the linear discriminant model assumes a normal distribution and common covariance matrix for the vector of observations. Medical data is often non-normal and even discrete. The logistic probability model works well with such data. Both models were then used to study the selective biopsy problem. The patients of the BCCI study were used as a training set to estimate the parameters of the discriminant function and the logistic probability function. Then each estimated function was used to classify the patients as a measure of the goodness of fit of the models. The logistic regression correctly classified slightly more of the patients than the discriminant analysis did. Because of the iterative nature of the logistic regression, the execution time for the logistic regression was longer than for discriminant analysis, but not beyond practical limits. .The variables that were significant in the statistical analyses could be used to help the physician make a clinical assessment of the lymph nodes of a woman with breast carcinoma. The variables indicate areas where further research would be useful. |
Subject |
Breast Cancer Breast cancer --Statistics |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2010-02-21 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0080131 |
URI | http://hdl.handle.net/2429/20722 |
Degree |
Master of Science - MSc |
Program |
Mathematics |
Affiliation |
Science, Faculty of Mathematics, Department of |
Degree Grantor | University of British Columbia |
Campus |
UBCV |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- 831-UBC_1977_A6_7 W54.pdf [ 5.08MB ]
- Metadata
- JSON: 831-1.0080131.json
- JSON-LD: 831-1.0080131-ld.json
- RDF/XML (Pretty): 831-1.0080131-rdf.xml
- RDF/JSON: 831-1.0080131-rdf.json
- Turtle: 831-1.0080131-turtle.txt
- N-Triples: 831-1.0080131-rdf-ntriples.txt
- Original Record: 831-1.0080131-source.json
- Full Text
- 831-1.0080131-fulltext.txt
- Citation
- 831-1.0080131.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0080131/manifest