Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

A statistical classification of breast cancer patients by degree of nodal metastases Wilson, Sandra Lee 1977

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


831-UBC_1977_A6_7 W54.pdf [ 5.08MB ]
JSON: 831-1.0080131.json
JSON-LD: 831-1.0080131-ld.json
RDF/XML (Pretty): 831-1.0080131-rdf.xml
RDF/JSON: 831-1.0080131-rdf.json
Turtle: 831-1.0080131-turtle.txt
N-Triples: 831-1.0080131-rdf-ntriples.txt
Original Record: 831-1.0080131-source.json
Full Text

Full Text

C .(  A  S T A T I S T I C A L  C L A S S I F I C A T I O N  B Y  D E G R E E  O  F  O  F B R E A S T  C A N C E R  N O D A L . M E T A S T A S E S  by  SANDRA LEE WILSON B.S.,  S t a n f o r d U n i v e r s i t y , 1966  A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE  REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE  in  the  Department of  Mathematics  We a c c e p t  this  thesis  as conforming  to the  requ i red s t a n d a r d  THE  UNIVERSITY OF BRITISH COLUMBIA May 1977  c-YT Sandra Lee Wilson, 1977.  P A T I E N T S  -6  In p r e s e n t i n g an  this  thesis  in partial  advanced degree a t t h e U n i v e r s i t y  the  Library  shall  make i t f r e e l y  I f u r t h e r agree t h a t permission for  h i s representatives.  of  this  written  thesis  gain  shall  Department o f  TQ^ajduP^oOfc-O^  The  of British  Columbia  2075 Wesbrook P l a c e V a n c o u v e r , Canada V6T 1W5  Date  CJUw.  ?7  I agree  that  copying o f this  thesis  by t h e Head o f my D e p a r t m e n t o r  I t i s understood  f o rfinancial  Columbia,  f o r r e f e r e n c e and s t u d y .  for extensive  permission.  University  of B r i t i s h  available  s c h o l a r l y p u r p o s e s may be g r a n t e d  by  f u l f i l m e n t o f the requirements f o r  that  copying o r p u b l i c a t i o n  n o t be a l l o w e d w i t h o u t my  A B S T R A C T  Recently the traditional primary method of treatment for breast carcinoma — the Halsted radical mastectomy — has been challenged.  It  is felt by some people that other methods may be more appropriate for certain women.  Quality of l i f e and the patient's preferences are being  considered in addition to the.strictly medical aspects of the problem. One procedure that attempts to increase the quality of l i f e for certain women is the selective biopsy.  Women who are proven to have lymph node  metastases at the biopsy are spared a mastectomy and treated by radiation since surgery cannot remove all of the cancer. A study was undertaken at the British Columbia Cancer Institute of selective biopsy patients diagnosed between 1955 and 1963 in order to assess the procedure in British Columbia.  After studying survival  for  selective biopsy patients and others, i t was concluded that the procedure should continue to be recommended.  Since only 14% of the patients now  referred to BCCI have had a selective biopsy, I decided to try to find a statistical method for assessing the probability of nodal metastases. The problem is one of statistical classification. theory of several statistical models was reviewed. for the problem:  The literature on the Two models were chosen  linear discriminant analysis and logistic regression.  The classification procedure most often used is discriminant analysis. However, the linear discriminant model assumes a normal distribution and ii  and common covariance matrix for the vector of observations. is often non-normal and even discrete. works well with such data.  Medical data  The logistic probability model  Both models were then used to study the selec-  tive biopsy problem. The patients of the BCCI study were used as a training set to estimate the parameters of the discriminant function and the logistic probability function.  Then each estimated function was used to classify  the patients as a measure of the goodness of f i t of the models.  The logistic  regression correctly classified slightly more of the patients than the discriminant analysis did.  Because of the iterative nature of the logistic  regression, the execution time for the logistic regression was longer than for discriminant analysis, but not beyond practical limits. .The variables that were significant in the statistical analyses could be used to help the physician make a clinical assessment of the lymph nodes of a woman with breast carcinoma. areas where further research would be useful.  iii  The variables indicate  T A B L E  O  F  C O N T E N T S  Page ABSTRACT  ii  LIST OF TABLES  vi  LIST OF FIGURES AND ILLUSTRATIONS  ix  ACKNOWLEDGEMENTS  x  Chapter 1  INTRODUCTION  1  2  MEDICAL HISTORY OF THE PROBLEM  4  3  REVIEW OF STATISTICAL MODELS  4  26  3.1  Fisher's Linear Discriminant  26  3.2  Contingency Tables  31  3.3  Krzanowski Location Model  34  3.4  Logistic Regression  36  3.5  Comparison of Linear Discrimination and Logistic Regression  44  3.6  Variable Reduction  48  3.7  Conclusions  54  DATA COLLECTION  55  4.1  Selecting Variables to be Observed  55  4.2  Data Collection  61 i v  Chapter  5  Page 4.3  Variables Selected for Analysis  62  4.4  Missing Data  69  4.5  Sources of Error  70  DATA ANALYSIS 5.1  Computer Programs  71  5.2  Selecting Cases  72  5.3  Classification of 503 Cases  75  5.4  Subsets of Patients for Further Classification  84  Classification of Postmenopausal Patients  93  5.5 5.6  5.7 6  71  Classification of Premenopausal Patients.-  100  Summary of Results  105  CONCLUSIONS  112  BIBLIOGRAPHY  116  APPENDICES A  TREATMENT STUDY GROUPS  121  B  MANCHESTER STAGING OF BREAST CANCER  122  C  CODING INSTRUCTIONS AND DATA CODING FORM  124  D  VARIABLES NOT INCLUDED IN ANALYSIS  128  v  L I S T  O  F  T A B L E S  Table I II  III IV  V  VI  VII.  VIII  IX X XI XII  Page Clinical Staging versus Pathological Staging  7  Survival of Selective Biopsy and All Other New Breast Cases by Clinical Stage  11  2x2 Contingency Tables for Survival  13  Contingency Table for 10 Year Survival for BCCI Patients and Haagensen's Patients  16  Contingency Tables for 5 and 10 Year Survival for Treatment Groups A and C  16  Contingency Tables for 5 and 10 Year Survival for Patients with Nodal Metastases and in Treatment Group A versus Treatment Group C  17  T-test of Survival Differences for Positive Nodes in Group A and Group C .. .  24  Asymptotic Relative Efficiency of Logistic Regression to Linear Discrimination  46  Execution Times for Logistic Regression  49  Comparison of Factors by Histologic Type  60  Variables Chosen by Linear Discrimination for 173 Cases , Classification of 173 Cases by Linear Discrimination  73  vi  74  Table XIII  XIV  XV  XVI  XVII  XVIII  XIX  XX  XXI  XXII  XXIII  XXIV  XXV  XXVI  Page Variables Chosen by Logistic Regression for 173 Cases  76  Classification of 173 Cases by Logistic Regression  77  Variables Chosen by Linear Discrimination for 503 Cases  78  Classification of 503 Cases by Linear Discrimination  79  Variables Chosen by Logistic Regression for 503 Cases  80  Classification of 503 Cases by Logistic Regression with 15 Variables  81  Classification of 503 Cases by Logistic Regression with 4 Variables  82  Variables Chosen by Discriminant Analysis for 60 Postmenopausal Patients  89  Variables Chosen by Discriminant Analysis for 113 Premenopausal Patients  90  Classification of 60 Postmenopausal Patients by Linear Discrimination with 19 Variables  91  Classification of 113 Premenopausal Patients by Linear Discrimination with 16 Variables  92  Subset of Variables Chosen by Discriminant Analysis for 128 Postmenopausal Patients  94  Classification of 128 Postmenopausal Patients by Discriminant Analysis with 4 Variables  95  Subset of Variables Chosen by Logistic Regression for 128 Postmenopausal Patients  96  vi i  Table XXVII  XXVIII  XXIX  XXX  XXXI  XXXII  XXXIII  XXXIV  XXXV XXXVI  XXXVII  Page Classification of 128 Postmenopausal Patients by Logistic Regression with 16 Variables  97  Classification of 128 Postmenopausal Patients by Logistic Regression with 3 Variables  98  Classification of 128 Postmenopausal Patients by Logistic Regression with 4 Variables  99  Subset of Variables Chosen by Discriminant Analysis for 253 Premenopausal Patients  101  Classification of 253 Premenopausal Patients by Discriminant Analysis with 10 Variables  102  Subset of Variables Chosen by Logistic Regression for 253 Premenopausal Patients  103  Classification of 253 Premenopausal Patients by Logistic Regression with 16 Variables  104  Classification of 253 Premenopausal Patients by Logistic Regression with 10 Variables  106  Summary of Classification Results  107  Number of Premenopausal Patients Correctly and Incorrectly Classified by Decile of Posterior Probability  109  Number of Postmenopausal Patients Correctly and Incorrectly Classified by Decile of Posterior Probability  110  vi i i  L I S T  Figure 2.1  O  F  F I G U R E S  ,  Page  Areas of lymph node involvement in Carcinoma of the breast  9  2.2  Actuarial survival of all 535 cases  19  2.3  Actuarial survival of patients in treatment group A versus patients in treatment group C Actuarial survival of patients with positive nodes versus patients with negative nodes  20 21  Actuarial survival of patients with positive nodes and in treatment group C versus patients with positive nodes and in treatment group A  22  Cochran and Hopkins categories for partitioning continuous variables  33  2.4  2.5  3.1  3.2  The geometry of the two-stage procedure for two groups  53  5.1  Histogram of age-incidence for all 535 cases  86  5.2  Histogram of age-incidence for premenopausal patients Histogram of age-incidence for postmenopausal patients  87  5.3  ix  88  A C K N O W L E D G E M E N T S  The data for this work was provided by the British Columbia Cancer Institute.  Particular thanks go to Dr. Glen Crawford of BCCI  for arranging for me to work with the data and the Statistics Department of BCCI for help with the medical f i l e s . Thanks also go to my committee members for their comments and help in the preparation of this work.  Those committee members are  Dr. S t a n l e y J . Nash Department o f Mathematics Dr. B r e n d a J . M o r r i s o n Department o f H e a l t h Care and E p i d e m i o l o g y Dr. S. James P r e s s F a c u l t y o f Commerce  Finally, thanks must go to my husband for his encouragement and support throughout my quest for a degree.  x  Chapter 1  I N T R O D U C T I O N  Cancer is one of the most universally feared dieases known to man.  Not only does i t too often k i l l , i t k i l l s slowly and usually with  pain and suffering.  The treatments for this dread disease sometimes  seem worse than the disease i t s e l f :  amputations of parts of the body,  radiation to k i l l cells (both cancerous and normal), and chemicals that poison cells.  For a woman, breast cancer usually holds the greatest fear  because, in addition to the physical damage done by the disease and treatments, there is often great emotional damage.  The North American and  European cultures have put such emphasis upon a woman's breasts in defining her worth as a woman that deformity or loss of a breast is an emotional blow that can cripple a woman.  In addition, breast cancer "is the single  largest cause of death from cancer among women in the United States and Canada" [ 3 4 , p. 3 3 4 ] . In the treatment of cancer there are presently three types of treatment.^ surgery.  The oldest and most often used as an initial therapy is  If the cancer is completely removed, then the disease is no  A new f o r m o f t r e a t m e n t c a l l e d i m m u n o t h e r a p y i s b e i n g t r i e d e x p e r i m e n t a l l y but i s n o t g e n e r a l l y a v a i l a b l e and so i s n o t d i s c u s s e d here.  1  2  longer a problem. excised tumors.  However, cancer does not confine itself to neat easilySingle cells that break off from the main mass can travel  via the blood and lymphatic systems to all parts of the body and establish new colonies of cancer cells called metastases.  To remove as many as  possible of these cells that have broken away, cancer surgery removes wide areas of presumably normal tissue in addition to the tumor i t s e l f . This can cause major physical deformities.  Even such extensive surgery  is often not enough to stop all the cancer cells. Because some cancers are inoperable (not amenable to surgery because of the size or location of the tumor), other methods of treatment are necessary. alike.  Radiation is known to k i l l c e l l s , normal and abnormal  Radiation can reach places that surgery cannot and does not cause  as much deformity.  However, i t , like surgery, cannot k i l l all the stray  cells. A systemic treatment was needed to k i l l the colonizing or metastatic c e l l s .  It has been found that certain drugs k i l l cancerous cells  faster than normal cells because cancer cells have a more rapid rate of growth than normal cells. the treatment of cancer.  Thus, chemotherapy became another weapon in While chemotherapy does not cause permanent  physical deformities, it does cause temporary distressing side effects. Treatment of a particular breast cancer patient can be by any one of these methods or by any combination.  Too often the treatment is  dictated by the physician's personal preference rather than by the c i r cumstances of the case.  Some doctors have tried methods of assessing  the best treatment for the patient by taking into account the quality  3  of l i f e of the patient and other possible rewards under alternative treatments.  One such study was completed in the spring of 1976 at the British  Columbia Cancer Institute (BCCI). A surgical procedure called a selective biopsy was done after an i n i t i a l diagnosis of breast cancer.  This procedure attempted to  determine whether a. patient had lymph node metastases or not.  Depending  on the status of the lymph nodes, a course of treatment was recommended. Between the years 1955 and 1963, 557 women had a selective biopsy done and were referred to BCCI for further treatment.  The medical staff at  BCCI undertook a study to compare the results of different treatment methods for these patients.  Some results of that study will be reported  in Chapter 2 as the background for the problem to be studied here. Definitive conclusions are not always possible from the selective biopsy because of contamination, loss of material, or incomplete dissection.  Also many patients do not have the selective biopsy done.  A statistical model is proposed in this paper to augment, and possibly supplant for some patients, the surgical procedure.  The patients that  provide the data base for this work are the same ones that were used in the study conducted at BCCI.  The statistical problem of deciding whether  there are nodal metastases or not is a two-group classification problem. Four models for classification of mixed (discrete and continuous) data will be discussed in Chapter 3. and logistic regression  Two of these models — linear discrimination  will then be applied to the problem of classify-  ing breast cancer patients by degree of nodal metastases.  Chapter 2  M E D I C A L  H I S T O R Y  O  F T  H  E  P R O B L E M  In 1882 Dr. William Halsted began performing the f i r s t true radical mastectomies in Baltimore, Maryland.  A radical mastectomy involves  the "removal of the breast, pectoral muscles, axillary lymph nodes, and associated skin and subcutaneous tissue" [17].  Surgeons quickly, adopted  his operation as the standard treatment for breast carcinoma.  It remains  the most widely used procedure today and is the standard against which other treatments are judged. Other surgical treatments range from a lumpectomy (removal of only the tumor mass) through super-radical operations that remove even more tissues than the radical mastectomy does.  These surgical procedures  combined with various types of radiation and chemotherapy produce a large range of combinations of treatments.  In the women to be studied here  there were twelve different types of treatment combinations (see Appendix A).  Only women are considered in this study because of the different  factors that are thought to affect the disease in men and womenJ The variations in treatment reflect the preferences of the physician treating the woman in addition to the variations in the disease  ^ B r e a s t c a r c i n o m a i n men m a n i f e s t s itself lymph n o d e s , m u s c l e s , and b r e a s t t i s s u e s — h o w e v e r , a r e t h o u g h t t o be q u i t e d i f f e r e n t .  4  i n t h e same a r e a s — t h e hormonal influences  5  process.  The treatment a woman gets depends more on the doctor she consults  than on the state of her disease. caused by doctor variability, treatment protocols.  To try to eliminate these differences  attempts have been made to set up standard  During the time of the study (1955-1963) radical  mastectomy with post-operative radiation therapy was the treatment of choice for operable cases of breast carcinoma seen at the British Columbia Cancer Institute (BCCI).  Radiation alone was considered the best treatment for  inoperable cases. The next step was to decide which cases were operable. has  been where most of the differences of opinion occurred.  This  It is generally  agreed that growth of the disease beyond the breast makes a case inoperable. Any type of metastases constitutes such growth.  All researchers agree that  the prognosis is poor, no matter what the treatment, i f , as Haagensen says, "metastases had reached these lymph nodes at the periphery of the regional lymph node f i l t e r at the apex of the axilla and in the internal mammary chain" [22, p. 691].  Thus, i t seems reasonable to consider patients with  nodal metastases as having growth beyond the breast and thus being inoperable. The method of assessment of these apical and internal mammary nodes was the next problem. 2 Several clinical  assessment systems have been devised in order  to try to predict the pathological findings.  The system used at BCCI is  the Manchester staging of breast cancer (see Appendix B).  There are four  clinical stages that are supposed to correspond to four pathological stages.  The stage I's  are early disease while the stage IVs  represent  C l i n i c a l f i n d i n g s a r e those o b t a i n e d from p h y s i c a l e x a m i n a t i o n o f the p a t i e n t w i t h o u t surgery. P a t h o l o g i c a l f i n d i n g s a r e those o b t a i n e d from m i c r o s c o p i c examination o f s u r g i c a l l y o b t a i n e d t i s s u e samples.  6  advanced disease for both clinical and pathological scales.  Clinical  I l l ' s and clinical IV's are generally considered inoperable "because the likelihood of cure by radical mastectomy is so poor that other methods will do as well or better"'[22, p. 691]. I's  Thus, we consider only clinical  and 11 s as possible operable cases. 1  Unfortunately, the clinical staging systems have not done very well at predicting pathological staging.  As Haagensen says, "clinical  features alone upon which we relied for the selection of patients betrayed us.  . ." [22, p. 691].  are typical.  The results from BCCI, as presented in Table I  For clinical I's  50% have negative nodes, while for clinical  11's 49% had pathologically involved apical or internal mammary nodes. In order to permit pathological review of the nodes before a radical mastectomy was carried out, a procedure called a selective or triple biopsy was devised.  Dr. C D . Haagensen developed and used this  procedure between 1951 and 1966.  His results are the only published  findings of large groups using selective biopsy and surgery.  With a  combination of clinical staging and selective biopsy, i t was hoped that a better assessment of the state of the disease could be made before any treatment, including mastectomy, was begun. The selective biopsy is recommended for Clinical I's with inner half lesions, central lesions, or outer half masses with tumors larger than 3 cms and for all Clinical 11's [11].  biopsy of the tumor mass in the breast.  It begins with the original 3  When the rush report  is positive  3 A r u s h r e p o r t i s t h e r e p o r t o f t h e f r o z e n s e c t i o n done w h i l e the p a t i e n t i s s t i l l i n surgery. A permanent, o r p a r a f f i n , s e c t i o n i s d o n e l a t e r b e c a u s e i t i s more a c c u r a t e a n d shows g r e a t e r d e t a i l .  7  Table I Clinical Staging versus Pathological Staging (489 cases)  Pathological  Clinical  I  Clinical  II  CIinical III  I  50..2%  20.4%  11..1%  II  21.,4%  30.6%  2.,8%  III  0..8%  0.0%  8.,3%  27.,6%  49.0%  77..8%  100..0%  100.0%  100..0%  257  196  36  IV  Number of Patients  & IV  8  for malignancy, the apical and internal mammary lymph nodes (areas and IV in Figure 2.1) are also biopsied.  III  In early use of the procedure,  the tissues obtained in the second stage were also subject to a rush section and further surgery, i f indicated, was undertaken immediately. However, results of the rush sections of the nodes were often inconclusive and the paraffin sections were necessary for accurate evaluation.  Thus,  the present procedure evolved in which the patient is returned to her room until the permanent sections are read.  If the internal mammary and apical  lymph nodes are negative, the woman is returned to the operating room for a radical mastectomy and later referred for radiation treatment to the supraclavicalur (another name for apical) and internal mammary areas. When any of the nodes are positive, the patient has no further surgery and is referred for radiotherapy to the breast and all node areas [22,11]. Haagensen stopped doing selective biopsies in 1967 because he felt that he had learned all that he could from them [22].  The staff of  the BCCI did not feel that was an adequate reason for discontinuing a practice that offered such advantages and so they continue to recommend its use for Clinical I's  and 11 s. 1  Since most patients are referred to  BCCI after initial surgery has been performed, the decision to do the biopsy usually remains with the attending physician. most popular during the late  The procedure was  1950's when a high of 43.5% of the patients  referred to BCCI had had the biopsy done.  It has declined in popularity  until the present time when about 18% of the patients referred have had the selective biopsy performed.  Because of the increasing incidence of  breast cancer and increasing referral to BCCI, the number of patients having the biopsy each year has increased despite the percentage decrease.  Figure 2.1.  Areas of lymph node  involvement in Carcinoma of the breast.  10  In order to assess the results of using the selective biopsy to select patients for surgery in British Columbia, a retrospective study was undertaken at BCCI of selective biopsy patients whose date of diagnosis was between 1955 and 1963. Since five and ten year survival rates are the standards for comparison in cancer therapy, the years to be studied were chosen to ensure availability of ten year survival data for all patients.  A total of 557 women were referred to the BCCI after selective  biopsy in the specified years.  Twenty-two patients were eliminated from  the study because of previous breast malignancy (14) or other systemic malignancy (8).  Skin cancer and carcinoma in situ of the uterus did not  constitute cause for being removed from the study.  The remaining 535  women were then put into treatment groups by the method of treatment they actually received.  Ideally there would have been only two groups:  1. radical mastectomy with radiation to apical and internal mammary nodes and 2. radiation to original lesion and axillary, apical, and internal mammary drainage areas.  However, due to the fact that patients came from  many different referring surgeons, there were twelve different treatment groups (see Appendix A for details of the groups).  Only the two recommended  groups (called C and A) had enough cases to give significant statistical results. The f i r s t concern of the doctors was that the selective biopsy procedure did not harm the patient.  It was known that there was l i t t l e  morbidity associated with the procedure.  To judge whether i t affected  survival, all selective biopsy patients were compared to all other new breast cases for 1955 to 1963. The data are presented in Table II.  To  test whether the survival rates were worse for selective biopsy patients,  11  Table  II  Survival of Selective Biopsy and All Other New Breast Cases by Clinical Stage (1955-1963)  Selective Biopsies Clinical Stage  Number of Cases  Alive at 5 Years  Alive at 10 Years  I  . 271  207  154  II  202  107  75  III  35  18  14  IV  13  2  0  Unknown  14  10  9  535  344  252  Total  Other New Cases Clinical Stage  Number of Cases  Alive at 5 Years  Alive at 10 Years  I  529  390  304  II  266  137  88  III  133  59  34  235  29  8  34  21  14  .1197  636  448  IV Unknown Total  12  a series of 2x2 contingency tables were formed for five and ten year survival.  The contingency tables are presented in Table III,  where the  different treatments are selective biopsy or not selective biopsy. We now wish to test whether the proportions in the two treatment groups differ significantly for each contingency table.  That i s ,  we wish to test the hypothesis that survival is independent of the treatment group.  Since we must estimate the parameters, the appropriate test  is a chi-square with one degree of freedom.  X  2  =  I  We calculate  I  (2.1)  1 3  i=l j=l  F  j  j  where f.... is the (i,j)th observed cell frequency and sponding expected cell frequency.  is the corre-  F . . is calculated as follows:  (a dot indicates summation over that index)  where f..  #  is the i-th row marginal total, f ^ is the j - t h column marginal  total, and f  m  >#  is the grand total of all cases.  Only F n  needs to be  calculated that way since all other F . . are uniquely determined by F n and the fixed row and column marginals.  It is a property of 2x2 tables that  f. . - F . . is the same except for sign for all i , j = l , 2 .  X  2  =  ( f n -  Fi i )  2  I  I  i=l j=i  r r  ij  Thus, we get  (2.3)  13  Table  III  2x2 Contingency Tables for Survival Clinical I 5 years Alive  Other Selective Clinical  390 207 597  Bead  139 64 203  10 years 529 271 800  Alive  304 154 458  225 117 342  529 271 800  II 5 years Alive  Other Selective Clinical  Bead  T37 107 244  Bead  129 95 224  10 years 266 202 468  Alive  88 75 163  Bead  178 127 305  266 202 468  III 5 years  Other Selective  Alive  59 18 77  Bead  74 17 91  10 years 133 35 168  Alive  34 14 48  Bead  99 21 120  133 35 168  Clinical IV 5 years Other Selective  A live  29 2 31  Clinical Unknown Other Selective  206 11 217  10 years 235 13 248  Alive  8 0 8  5 years  Alive  21 10 31  Total Other Selective  Bead  Bead  13 4 17  636 344 980  Bead  561 191 752  227 13 240  235 13 248  10 years 34 14 48  Alive  14 9 23  5 years Alive  Bead  Bead  20 5 25  34 14 48  10 years 1197 535 1732  Alive  448 252 700  Bead  749 283 1032  119 53 172  14  and  x is asymptotically distributed as a chi-square with one degree of 2  freedom [44.].  A correction for continuity should be added giving the  final result  X  2  = C | f n " F | - 0.5) 1X  I  I  i=l  j=l  J -  .  (2.4)  ij  The above approximate procedure can be used when the numbers in the tables are large (all expected frequencies are greater than five).  However, when  the total number of cases is less than 20 or the smallest expected frequency is less than five, i t should not be used [44]. In 1935 Fisher showed that an exact test of significance based on 4 the hypergeometric probability distribution could be made.  Finney et al.  [14] calculated these probabilities and published tables of the results for up to 40 total cases.  When the contingency tables in Table III  involved  small numbers, these exact probabilities were used for testing for significant differences. All 2x2 tables for individual clincial stages at five and ten years showed differences that were not significant at the .05 level. Thus, we cannot reject the hypothesis that survival was the same.  However,  some tables had a great disparity between the numbers of patients in the two treatment groups (for example, Clinical IV Other 235, Selective 13). For that reason it was decided to test five and ten year survival for all clinical stages combined.  When the chi-square test was used, i t showed  a significant different (P < .01). ^The  exact conditional  probability is  fi..!.f .  !  2  f..  ! f n  Thus, we would reject the hypothesis  !  f . i  fiz  !  1 f»2 ! f  2  i  !  f  2  2  !  '  15  that total survival was the same.  Since total survival was better for  the selective biopsy group, we can conclude that selective biopsy at least did not decrease patient survival.  Thus, we conclude that the selective  biopsy did not harm the patients in this study. The fact that selective biopsy patients demonstrated longer survival may be a result of being included in an "experimental" group. They may have been followed more closely and thus had metastases (local and distant) treated earlier.  It is also possible that a larger proportion  of these patients received radiotherapy as part of their treatment and consequently recurrences were delayed. The next step in assessing the selective biopsy as used in British Columbia was to compare survival rates to published results and to compare survival rates between groups.  It was concluded that the local  survival rates were not significantly different (P > .05) from the published results of Haagensen [22] (Table IV).  Five and ten year survivals  were then compared for treatment groups A (no mastectomy, standard radiation) and C (radical mastectomy, standard radiation).  The results are shown in  Table V.  The differences were significant at the .05 level for both time  periods.  This was expected since the A cases were known to be advanced  cases while the C cases were known to be early cases.  The final comparison  was for five and ten year survivals for patients known to have nodal metastases and treated by mastectomy - group C - versus those known to have nodal metastases and treated by irradiation - group A - (Table  VI).  The number of patients with nodal disease and treated by mastectomy was small (18), but not too small to attempt to draw some conclusions.  It was  found that at five years there was no significant difference at the .05  16  Table IV . Contingency Table for 10 Year Survival for BCCI Patients and Haagensen's Patients  Alive  Bead  Survival  Haagensen  550  526  1076  51.1%  BCCI  254 804  279 805  533 1609  47.7%  Table V Contingency Tables for 5 and 10 Year Survival for Treatment Groups A and C 5 years Alive  Bead  A  61  85  146  41.8%  C  241 302  75 160  316 462  76.3%  Alive  Bead  26  120  146  17.8%  198 224  118 238  316 462  62.7%  Survival  10 years A  Survival  17  Table VI Contingency Tables for 5 and 10 Year Survival for Patients with Nodal Metastases and in Treatment Group A versus Treatment Group C  5 years  Alive  Dead  A  57  84  141  C  10 67  8 92  18 159  10 years  Alive  Dead  A  21  120  141  C  8 29  10 130  18 159  18  level of significance.  However, at ten years we could reject the hypothesis  of no significant differences (P < .05).  It was not known what selection  factors were used to select the cases with positive nodes for surgery (a typical problem with any retrospective study).  The recommendation was made  by the BCCI medical staff to continue use of the selective biopsy. After the data were collected for the analysis that comprises the main body of this work, more information was available on survival. For some patients a 22 year history of survival was available and thirteen year survival information was available for all patients. an actuarial survival study to see what had happened five and ten and in the years after ten.  I decided to do  in the years between  The Biomed program BMDllS-Life  Tables and Survival Rate was used to produce actuarial survival rates. The results are shown in Figure 2.2 through 2.5. overall survival for selective biopsy patients.  Figure 2.2 shows the The results at five and  ten years compare favorably with standard survival results [7]: Standard - 5 years  60%  Selective biopsy - 5 years  64.6%  Standard - 10 years  20%  Selective biopsy - 10 years  47.7%  There is also a 28% survival for 22 years which is encouraging. A comparison of Figures 2.3 and 2.4 shows the curves to be almost identical — the survival for group A and the survival for positive nodes are nearly the same and the survival negative nodes are the same.  for group C and the survival for  Again that is the expected result since A  cases were supposed to be chosen because of the presence of positive nodes while C cases are supposed to have negative nodes.  Figure 2.5 shows the  100  90  80  70  60  50  40 CO  30  20  10  10  15  YEARS  Figure 2.2.  Actuarial survival of all 535 cases.  I 20  i 22  20 100  90  GROUP A (146)  X  GROUP C (317)  •  80  70  60  50  40  • •• •  30  20  X X  10 I-  X x X X X x  Jl  10  15  20  22  YEARS  Figure 2.3.  Actuarial survival of patients in treatment group A versus patients in treatment group C.  21 100  X •  90  NEGATIVE (356)  •  POSITIVE (179)  X  80 I  70 k  60 r  50 k  |  40 h  • • • •  CO  30 k  20  X X X X  10  / X X  10  15  X X X X  J  I  20  22  YEARS  Figure 2.4.  Actuarial survival of patients with positive nodes versus patients with negative nodes.  22 100  GROUP A (141) 90 U  GROUP C ( 18)  80  70 k  60  50  i-3  40 CO  30  20  X X 10 X X X X X X X  -I  10  15  20 22  YEARS Figure 2 . 5 .  A c t u a r i a l survival of patients with p o s i t i v e nodes and i n treatment group C versus patients with p o s i t i v e nodes and in treatment group A.  23  survival curves by treatment group f o r patients with p o s i t i v e nodes. curves are close together through f i v e years. years they are divergent. again.  The  Then between s i x and twelve  A f t e r twelve years they approach each other  The Biomed s u r v i v a l program also c a l c u l a t e s a t - t e s t f o r the  differences between groups each year. in Table VII.  The r e s u l t s of these t - t e s t s appear  These r e s u l t s confirm that there i s a s i g n i f i c a n t d i f f e r e n c e  (P < .05) in the years s i x through twelve only.  It i s not c l e a r j u s t  what s i g n i f i c a n c e t h i s has f o r the treatment decision problem.  More cases  with p o s i t i v e nodes and treatment by mastectomy need to be studied with that question in mind. Another question of concern about the s e l e c t i v e biopsy was the local recurrence rate —  that i s , recurrence of the disease in the  breast and associated lymph drainage areas. on recurrence for the BCCI study.  It was known that l o c a l recurrences  were more of a problem i n study group A. were adequately c o n t r o l l e d .  Not a l l the data were a v a i l a b l e  However, most local recurrences  More work remains to be done on the question  of l o c a l recurrence. To complete the assessment of the use of s e l e c t i v e biopsy, the medical s t a f f have been asking questions about the q u a l i t y of l i f e f o r the p a t i e n t s .  They feel that sparing a woman with advanced breast carcinoma  the m u t i l a t i o n of having a breast removed i s giving her a better q u a l i t y of l i f e .  However, the q u a l i t y of l i f e also has a time component to i t .  Mueller and J e f f r i e s studied the questions of rate of dying and causes of death in breast carcinoma and concluded  24  Table VII T-test of Survival Differences for Positive Nodes in Group A and Group C  Year  T-statistic  Degrees of Freedom  Tabled t  g  7  5  1  -0.26  157  1 .9763  2  -1.00  157  1.9763  3  -1.49  157  1.9763  4  -1.67  157  1.9763  5  -1.22  157  1.9763  6  -2.03*  157  1.9763  7  -2.28*  157  1.9763  8  -2.58*  157  1.9763  9  -2.83*  157  1.9763  10  -2.44*  157  1.9763  11  -2.08*  157  1.9763  12  -2.08*  157  1.9763  13  -1.79  157  1.9763  14  -1.60  155  1.9765  15  -1.60  155  1.9765  16  -0.86  118  1.980  17  -0.86  118  1 .980  18  -0.86  118  1.980  19  -0.86  118  1.980  20  -0.86  118  1 .980  21  -0.86  118  1.980  22  -0.86  118  1.980  significant  (df)  25  Breast cancer treatment should: a) T r e a t t h e c a n c e r o n l y when a n d w h e r e i t i s known t o e x i s t ; b) N o t be p r o p o s e d a s a means o f i n f l u e n c i n g e i t h e r t i m e o f death o r cause o f death. M e a s u r e m e n t s o f q u a l i t y o f l i f e s h o u l d be e s t a b l i s h e d and s h o u l d c o n s t i t u t e t h e o n l y r e a l i s t i c o b j e c t i v e o f t r e a t m e n t [3A, p. 339].  Thus, the conclusion of the BCCI study was that the selective biopsy should be recommended for patients with Clinical stage I or Clinical stage II disease [ 1 1 ] . Since BCCI is a referral agency for all of British Columbia and the Yukon, most of the range of stages were well represented in the study.  The group of patients was deficient in very early cases of Clinical  stage I which had received surgery and then were not referred for further treatment.  Presumably these would all have had negative nodes since  evidence of any nodal metastases in the surgical specimen would be cause for referral.  The study could also be deficient in very advanced stages  where the patient would be beyond any treatment.  Since that stage of  disease would never be recommended for selective biopsy, we need not worry about lack of such cases. After completing the statistical analysis of the above study for BCCI, I became interested in trying to find a statistical model that could classify patients into positive or negative nodes when surgical results were not available.  Since 82% of the patients now being referred to BCCI  have not had a selective biopsy, i t could be useful for helping to decide on the best treatment for these patients.  It could also be used with those  patients for whom the selective biopsy was inconclusive.  Chapter 3  REVIEW OF STATISTICAL MODELS  The medical diagnosis problem presented in the previous chapter can be considered as a statistical classification or prediction problem. Given a vector of observable variables for a patient, we wish to predict which group that patient belongs to (positive or negative nodes). different models have been suggested for this problem. will be discussed here:  Several  Four of the models  Fisher's linear discriminant function, multiway  contingency tables, Krzanowski s location model, and the logistic prob1  ability model.  Discussion will include the assumptions of the models,  parametric estimation methods, problems in using the models, and availability of computer routines.  3.1  Fisher's Linear Discriminant In 1936, R.A. Fisher proposed a linear discriminant function to  classify a p dimensional vector  into one of two known multivariate  normal populations, given that the observation was from one of the two and that they had the same covariance matrix.  26  We assume that  27  X_ ~ N ( y E ) with p r o b a b i l i t y q  x  X_ ~ Np(y_ ,Z) with p r o b a b i l i t y q  0  l 5  and 0  where qi + q  0  = 1 and E i s the common covariance matrix.  The l i n e a r  discriminant function i s  U(X) = 3 o + 3' X,  (3.1)  where 3o  = l n ^ - i  ^  3  ^  +  y  i  Q  (3.2)  )  and 3 ' = (yi  - y ) ' E~  (3.3)  1  0  so that  j=l  with E  _ 1  = (a ). 1 J  If U(_X) > 0, X^ i s assigned to population 1, otherwise, _X i s assigned to population 0.^ The goal of the parameter estimation procedure i s to minimize expected t o t a l m i s c l a s s i f i c a t i o n c o s t .  Often the costs f o r m i s c l a s s i f i -  cation are quite d i f f e r e n t f o r the two populations (death versus t e s t i n g in a medical diagnosis problem).  further  One can include these costs in  the model and then minimize the expected t o t a l cost of m i s c l a s s i f i c a t i o n . In a c t u a l m e d i c a l p r a c t i c e , t h o s e i n d i v i d u a l s f o r whom U ( X ) i s z e r o o r n e a r z e r o w o u l d n o t be c l a s s i f i e d w i t h o u t f u r t h e r i n v e s t i g a t i o n . A two-stage procedure which allows f u r t h e r observation o f b o r d e r l i n e cases is discussed l a t e r i n t h i s c h a p t e r i n t h e s e c t i o n on v a r i a b l e reduction.  28  C(h|k) is defined as the cost of misclassifying an individual to group h when that individual is a member of group k.  The expected misclassifica-  tion cost for group k is q ( h | k ) and the total expected misclassification cost is  X q C(h|k). k  (3.5)  K  Replacing q, by q.C(h|k), the linear discriminant model becomes  U(X) = 3o + 3' - X  (3.6)  with 3o = In r - i J  + u. ), 0  C ( 0 1) qo C(l o)  Qi  (3.7)  (3.8)  and 3' =  - y )' £ 0  (3.9)  - 1  so that 3- = n  I  (y  n  - y  i 0  )^  1 J  .  (3.10)  j=l  Again  is classified into group 1 when U{X) > 0 , and into group 0  otherwise. When the parameters of the populations are unknown, they must be estimated.  We shall assume that the sampling is random from the mixture  of populations so that the sampling mixture approximates the population mixture.  When there is a low incidence of one population, a two-sample  procedure (separate samples for the different groups) may be more  29  appropriate [1]. for that case.  However, the parameter estimation would be different Since the patients in the selective biopsy study were  sampled from the mixture of populations, we will not consider the separate sample situation. Let n^ = number of observations from group h, h=0,l, and x.^ = t  the i-th characteristic of the t-th individual in the h-th group.  n  *  S  and  ij,h -  ( n  h "  1 }  i n  = V •1 I  h  h  I  1  x  i h t  ,  X  X  ij  X  (  V '  + (n - 1)S..  1  (3.11)  h=0,l,  < iht " ih>< jht "  (ni - 1 ) S . . a  Then  0  h = 0  ' ' ]  (  3  J  2  )  Q  ni + n - 2  ^' > u  0  To estimate the population proportions we use the sample proportions. Thus, n, H  h  h=0,l,  n  ni + n  0  so that 3i"c(on)  n, C(Oll)  qo C(1|0)  n C(l|0) ' 0  The population means are estimated by the sample means: a . j be the (i,j)th element of E and a  1  J  =X •  L  e  t  n  be the (i,j)th element of  Thus, the estimated function is  U(X) = 3o + 3' - X  (3.15)  30  where 3o  = In  n  P  C(0 1) ni> C(l 0) v  ^X. )  (3.16)  0  i=l  and  If U(_X) > 0, _X is assigned to population 1, otherwise, X. is assigned to population 0.  Rewriting U(X) to clarify the estimation problems we get  n , CfO 1) n C(l 0) 0  (Xi  -  x) 0  £"  1  [x  -Hh  +  x  2 ) ] -  (  3  -  We see from (3.18) that in order to estimate the linear discriminant function we must estimate y_i, y_> and S. Unless some simplifying assump0  tions are made about I (for example, independence of variates), the estimation problem can become quite substantial. Departures of the data from normality are a cause for concern with this model.  Although l i t t l e has been done to show robustness of the  linear discriminant functions, many practical applications proceed with linear discrimination after stating that the data are non-normal or even discrete [see for example, 47].  This problem will be of great concern  here because i t is known that the medical data are non-normal and often not even continuous. An attractive feature of the linear discriminant model for applications is the widespread availability of computer routines for estimating the function and classifying observations.  The discriminant  analysis is based on a linear regression and so is easily accessible. The availability of the computer program encourages the user to ignore the departures from the model assumptions for ease of computation.  1 8  )  31  3..2  Contingency Tables Each individual has a set of attributes describing him. When  all  the data are discrete, all individuals with the same set of attributes  are counted and that count is put into the appropriate cell of a contingency table.  The structure of the table for two variables is a rectangular  array with columns corresponding to levels of one variable and rows corresponding to levels of the other variable. The simplest contingency table is a 2 x 2 table.  There are two  levels of attribute A and two levels of attribute B . The model would 2 appear as below:' B  1 A  2  1  P n  P12  Pi.  2  P21  P22  p .  p.i  p.2  1  where p.. is the (i,j)th cell  probability,  13  Pi.  =  1  P.j  2  I p k=i  1  l  P  2  2  =  }  i k K  >  k j  i = i  (3.19)  > > 2  (3.20)  J= K2,  .  and  \ 1=1  The p.  P  'J  (3.21) =  1  j=l  and p . are the marginal probabilities.  The model assumptions  are that all categories or contingencies are included (the probabilities "The model c o u l d q u e n c i es m.. = Np.. IJ  K  IJ  also  be w r i t t e n  i n terms o f the e x p e c t e d f r e -  32  sum to one) and that all variables are discrete (or have been made discrete).  This model and equations (3.19), (3.20), and (3.21) easily  generalize to higher dimensions. The general model assumptions remain the same:  all variables  are discrete and the probabilities in all tables sum to one.  In higher  way contingency tables, however, one usually makes some simplifying assumptions about interaction terms to make the problem more manageable. One common simplification is to assume that bivariate interactions are allowed, but higher order interactions are not. Cochran and Hopkins [8] suggested that when there is a mixture of continuous and qualitative varibles, all the continuous variables should be made qualitative.  They concluded that the optimal partition would be  into six states as shown in Figure 3.1.  When the variables are all quali-  tative, the problem has been reduced to analysing a p-way contingency table.  The question with this approach is how much information is lost.  Cochran and Hopkins felt the loss of information was not significant, however, many others have found the loss unacceptable and sought ways of utilizing the full information. Estimation of the cell frequencies (or cell probabilities) in a contingency table is simple for small tables and large data sets, but can become quite complicated, i f not impossible, for larger tables and moderate data sets.  The number of individuals observed for each cell is enumerated  and that count is the observed frequency or estimated frequency.  The  problems arise when there are many cells to be estimated and not very many data points.  A small 3 x 5 x 7  parameters to be estimated.  table having fixed marginals has 48  Another problem is empty cells.  The frequency  33  U i and U  Figure 3.1.  2  are calculated from the data.  Cochran and Hopkins categories for partitioning continuous variables.  34  method can only estimate an empty cell as zero, while i t is quite likely that a different sample would show the cell to be non-empty. It has been suggested by many authors [17, 24, and 35 for example] that log-linear models are appropriate for analysing contingency tables. Log-linear models f i t the contingency table model assumptions, while solving the estimation problems discussed above.  In many cases, empty  cells can be estimated as non-zero with log-linear models.  Also with a  few simplifying assumptions (high order interactions are zero for example), there are many fewer parameters to estimate so that for a given data set size the estimated parameters of the log-linear model will be based on more observations per parameter.  A more complete discussion of log-linear  models will be deferred to Section 3.4. Another problem with high dimension contingency table analysis has been the general unavailability of computer routines to do the analysis. Recently UCLA's Biomed package has included a program for analysis of a multiway table.  Greater availability of this program will encourage more  use of contingency table analysis with high dimension problems.  The  availability of computer routines will not alleviate the problems of large numbers of parameters to be estimated and loss of information with partitioned continuous variables.  3.3  Krzanowski Location Model In order to use all the information available when there is a  mixture of continuous and discrete data W.J. Krzanowski proposed a l i k e l i hood ratio classification rule based on the location model [27].  In the  35  location model X  where X  :  qxl is the vector of continuous variables  and Z_: pxl is the vector of binary variables.  Thus, each distinct pattern  of 1_ defines a multinomial cell with Z being in cell  m= 1+ j=l  z. J  (m) It is assumed that Y_ ~ N (JJ. V "' / , Z) in cell m, where T. is the common P  covariance matrix.  It is also assumed that the probability of an observa-  tion in cell m is p .  The optimal allocation rule then becomes:  allocate  to group 1 i f  U(Y)  (m)  (rn)  1  L-H^  im) +  y  ( m ) 0  )  + In ^m lm  (3.22)  3  is > 0 and to group 0 otherwise. The o p t i m u m r u l e d e r i v e d f r o m t h e l o c a t i o n model t h u s leads e f f e c t i v e l y t o a d i f f e r e n t l i n e a r d i s c r i m i n a n t f o r each o f t h e m u l t i n o m i a l c e l l s , with c u t o f f points d e t e r m i n e d i n e a c h c a s e by t h e d i s c r e t e component o f t h e model [27, p. 783j.  Thus, this is a model that acknowledges the different types of variables, but i t is unduly complicated in the number of functions produced.  The  location model seems to be of theoretical interest but of l i t t l e practical use at this time. Krzanowski does, however, suggest that i f the data are not to be treated by his method but rather by f i t t i n g to a model containing only  binary  A d i s c r e t e v a r i a b l e w i t h n l e v e l s c a n be t r a n s f o r m e d i n t o n-1 i n d i c a t o r v a r i a b l e s , s o we c o n s i d e r o n l y b i n a r y v a r i a b l e s h e r e .  36  one type of v a r i a b l e ( e i t h e r continuous or d i s c r e t e , but not both), then i t i s better to consider them a l l as continuous rather than to p a r t i t i o n the continuous v a r i a b l e s .  Thus, for a mixture of continuous and d i s c r e t e  v a r i a b l e s , he preferred F i s h e r ' s l i n e a r discriminant to p-way contingency table a n a l y s i s .  3.4  L o g i s t i c Regression A simpler model for continuous and d i s c r e t e variables i s the  l o g i s t i c p r o b a b i l i t y model.  It allows both continuous and d i s c r e t e  v a r i a b l e s without loss of information and the estimators have several desirable properties. Let P(_X. )'be the posterior p r o b a b i l i t y that an i n d i v i d u a l with i  explanatory v a r i a b l e values _X.. = (X^ group 1 ) .  X- ) m  has the disease (belongs to  Then •1 P(X.)  1 + exp(-3  0  - B' X .)  (3.23)  n  (a prime on a vector indicates the transpose of the vector) where B and j3 0  are the l o g i s t i c c o e f f i c i e n t s . l o g i s t i c function.  The expression in (3.23) i s the m u l t i v a r i a t e  In the medical context,  (3.23) i s a good formulation  because " i n the l i g h t of present medical knowlege a reasonable assumpt i o n i s that P follows a symmetric sigmoid curve" [49, p. 168].  Cox  [ i n 10] showed that the l o g i s t i c function i s appropriate for several d i f f e r e n t types of d i s t r i b u t i o n s of the explanatory v a r i a b l e s : v a r i a t e normal, binary, and mixed.  multi-  In general, (3.23) holds f o r any  37  variables whose distributions are in the exponential family; that i s , those with density functions of the form  f(X) = g(6) h(X) exp{T(6) X}.  (3.24)  Assuming that the distribution of X_ is described by (3.24) is such a mild restriction that we can ignore i t for all practical purposes. A multivariate generalization of (3.23) can be made quite easily for problems with more than two groups. 2  The generalization would allow division into  classes where P i s a vector with k components. The logistic probability model has several desirable properties  in addition to its general applicability for different distributions. Cox [10] showed that (3.23) has associated with i t the simple sufficient statistics  t = I X.y. i=l  (3.25)  1  where the y  i  are indicator variables corresponding to group membership (0-1)..  The maximum likelihood estimators are functions of the sufficient statistics. Thus, from the Rao-Blackwel1 Theroem, we expect to get smaller mean squared error using the maximum likelihood estimators than using estimators that are not functions of sufficient statistics.  In addition, the logistic  model has asymptotically unbiased estimators associated with i t [25]. Several different procedures have been suggested to estimate the parameters of the logistic function. are, by necessity, iterative.  Unfortunately, all the procedures  Walker and Duncan [49] derive the normal  38  equations through a l e a s t squares procedure with estimated weights. We l e t  P. = P ( X . ) =  1 + expC-3' X..)  •1  (3.26)  be the p r o b a b i l i t y of the i - t h i n d i v i d u a l of the sample having the disease. Therefore,  y  . = . + . = ( P  e  f  x.) + .  i s  (3.27)  e  Thus, the n x (m+1) matrix of independent variables f o r the sample i s  10  x  ll  " • lm  X20  x  21  *"  n0  x  nl  *** ' nm  x  x  and y_'= ( y i , y , • • • ,y ). 2  x  x  2m  X  By a p p l i c a t i o n of weighted i t e r a t i v e n o n - l i n e a r  least-squares procedures to (3.27) with a diagonal weight matrix W (the inverse of the covariance matrix of the vector e),  they conclude that the  normal equations are  X' W" X 3. = X 1  1  W  Thus, they conclude that the estimators are  - i  y_  (3.28)  39  £ = (X  1  W" X)" 1  X' W" y_ ,  1  (3.29)  1  which we shall see later are the same as the maximum likelihood estimators. Walker and Duncan suggest an iterative solution of (3.29) by the NewtonRaphson method with i n i t i a l estimates obtained by fitting a linear discriminant function. Kalman [49] proposed another recursive estimation procedure that claimed more rapid convergence and because of the rapid convergence, the need for good initial estimates is relaxed.  The procedure updates the  estimates with the addition of each new individual. based on the f i r s t k individuals is 3, .  V  k  = Var^)  The estimate of 3  Then  = ( X ' W" X ^ " 1  (3.30)  1  k  where X^ is the matrix of k individuals' observations and W^ is the covariance matrix for k individuals.  Let X. , , be the vector of observations  for the (k+l)st individual,  w^-, = k+1 p  Q  ,  3.31)  Vl V l  and  k+1  P  Therefore,  k+l|k  P  1 + exp(-^  •1  = 1-Q,, . k+1 X1  (3.32)  40  Vi  • \ - \ *w V i 4 i \  <- ' 3  +  33  and  c  kHl  - (» i  U*]'"' •  k+  ( 3  -  3 4 )  Finally, the recursive formula for the estimator of J3 is  +V 4  4  k  + 1  c  + 1  k + 1  w  k+1  (y  - P  k+1  k+1  )  (3.35)  where y -j takes on the values 1 and 0 as the (k+l)st individual does or k+  does not have the disease. The problem of intial estimates is quite simple now. Let V and b_ be 0  any prior estimates of the variance and j[.  The estimates V and $^ are k  found using V , b_ , and the f i r s t k data items. 0  0  0  Then V and b_ are 0  0  eliminated from the formula by the following:  vr k  1  - v-  o  v  and  V  The remaining m-k items, V  k  , and  k  (3.36)  1  1  h-  (3.37)  V bo  are then used in the recursive  process to get the final estimate of |3. A third method is maximum likelihood estimation.  The maximum  likelihood equations are fairly simple to derive for the logistic model. Let P be the posterior probability of disease from equation (3.23) for s  41  the s - t h i n d i v i d u a l . have disease and y  Also l e t y  = 0 i f the s - t h i n d i v i d u a l does not  g  = 1 i f that i n d i v i d u a l has disease.  Then the l i k e l v  hood i s n  L =  y  P  n  s=l  r  s  S  1 - p *s  <»l-y  s  (3.38)  and the natural logarithm of the l i k e l i h o o d i s  In  L =  I s=l  y  In  s  P  s  +  I s=l  (1  - y  ) ln(l  -  P ).  (3.39)  Taking p a r t i a l d e r i v a t i v e s of ( 3 . 3 9 ) , we get the maximum l i k e l i h o o d equations  4  Vir  =  ^  i  s -  y  p  s  =  0  ( 3  -  4 0  '  }  and  9 In L 9  3  n  =  i  I V i s "= lI s=l , J  X  is s=°  i=l . - . P .  P  < - >  s  Equation ( 3 . 4 0 ) assures us that the expected number of cases w i l l be equal to the observed number of cases.  That i s another desirable property of  the maximum l i k e l i h o o d estimates. 4 To  take the p a r t i a l  9, l n P -5~5—~= 9 3o  1  -  P  3 ln(l-P ) L_ = 9  3.  d e r i v a t i v e s we u s e t h e f a c t s  9 1 n (1 -P ) > —T~5 — = s 9 3o -Y  is  P  s"  _  P  3.:ln P » - 5 - 5 — s' 9 $.  1  =  X  -  ( i s  that  1 _ P v  ) s  a  n  d  3  41  42  The maximum likelihood equations are most often fitted by the Newton-Raphson algorithm. P  k  It is an iterative gradient algorithm.  If  is an estimate of P,k > 0, and f is the logistic function to be e s t i -  mated, then the new estimate p. , is  p  k+l  =  p  - (d f)"' 2  k  (3.42)  (df).  With this formulation there can be a problem of divergence when the initial estimates are not close to the true values.  Thus, Haberman [in  24] added a factor a(k) to equation (3.42) to prevent divergence in such cases.  If reasonable care is taken in choosing the starting values,  divergence is not a large problem in most applications so we will not complicate (3.42) with the a term. Several different types of starting estimators have been suggested in the literature.  Linear discriminant function estimators have often  been used as starting values.  Other possible i n i t i a l estimators are  conditional estimators and reverse Taylor series approximations [35]. Conditional estimators are obtained by maximizing the conditional l i k e l i hood (conditional on the explanatory variables).  Reverse Taylor series  approximations arise from the logistic function, equation (3.23).  Expand-  ing about X. = X. in a Taylor series, one gets 1 P(X) =  3'X exp(-3o - 3'X)  .1 + exp(-3 - 3'X) 0  •f  [1 + exp(-3 - 3'X)] ,  3' exp(-3 - 3'X 0  [[1 + exp(-3o - 3'X]  2  0  X + R(X),  (3.43)  43  where R(X) denotes a remainder containing terms of the order, 0[X - X) ' (X - X~]. Neglecting the remainder, we can interpret this as the linear function A + B'X where  1  A  B'X  (3.44)  1 + exp(-3 - 3'X ) 0  and 3 exp(-3o - B'X) [1 + exp(-3o - 3'X)]  (3.45) 2  Solving one gets  (3.46) (A + B'X) (1 - A - B'X) and  3o = -3'X - In  1 A + B'X  (3.47)  -1  as the reverse Taylor series approximations. Computer routines to find the maximum likelihood estimators for the logistic model or logistic regression are not so readily available as those for linear discrimination. accessible.  However, they are becoming more  An example of one such program is listed in the work by  Nerlove and Press [35].  Like most logistic regression programs, i t uses  the Newton-Raphson algorithm to find the MLE's.  A disadvantage of this  44  routine is the necessity for an additional user-written program to c a l culate the probabilities and classifications. Because the logistic formulation provides a probability of being in group 1 rather than just a classification, one can tell which individuals are quite likely to be correctly classified (probability near 0 or 1) and which individuals are near the boundary (probability near 0.5) and thus are likely to be incorrect. to  form three groups:  The results could easily be used  1) those in group 0, 2) those in group 1 , and 3)  those in the middle region for whom more investigation should be carried out before classification.  This is a particularly desirable characteristic  for a medical diagnosis problem. are inexpensive.  Some tests are expensive, while others  If a patient can be classified (diagnosed) on the basis  of inexpensive tests only, i t is desirable for the patient and medical staff.  However, i f the f i r s t tests are inconclusive, the more costly  tests are available to help resolve the question.  Thus, we can think of  the logistic regression as giving us a two-stage procedure.  3.5  Comparison of Linear Discrimination and Logistic Regression For theoretical and practical considerations outlined above, I  chose to concentrate on only two of the classification models: discrimination and logistic regression.  linear  Linear discrimination was chosen  because of its widespread use in published studies despite violations of the model assumptions and because of easily accessible computer routines. Logistic regression by use of the Newton-Raphson algorithm was selected because of the good f i t of the medical data to the model assumptions, the desirable features of the logistic estimators, and the availability of a  45  computer program.  In another work [41] S.J. Press and I compared logistic  regression and discriminant analysis.  Theoretical arguments were presented  for and against the use of logistic regression as opposed to discriminant analysis for classification and regression of qualitative variables on explanatory variables.  Empirical results for some non-normal classification  problems were reported. A theoretical comparison of linear discrimination and logistic regression under different conditions is the next concern of this work. When the data are multivariate normal with equal covariance matrices, the model assumptions of both models are satisfied.  One would expect the  linear discriminant to be better in this case because its model assumptions are satisfied, i t has a closed form and i t is non-iterative.  Also for  the normal case, both types of estimators are asymptotically unbiased.[25]. Efron [12] investigated the asymptotic efficiency for each model under the assumption of normality of the data.  He calculated the asymptotic  relative efficiency (ARE) of logistic regression to linear discrimination. The ARE is given by  ARE =  1  +  A  ?* q° exp  (2TT)*  exp(-x /2) dx  a  2  8  qi  where  (Ui - u ) 0  Z"  1  (y  x  - y ) 0  which is the square root of the Mahalanobis distance.  Table VIII gives  some sample results for the asymptotic relative efficiency when q  x  = q = 0  46  Table VIII Asymptotic Relative Efficiency of Logistic Regression to Linear Discrimination (q A  0  .5  1  ARE  1.000  1.000  .995  1 .5 .968  :  =q  0  = .05)  2  2.5  3  3.5  4  .899  .786  .641  .486  .343  0.5, the most favorable situation for logistic regression.  The logistic  regression is less efficient as n goes to infinity because the linear discriminant is based on the full maximum likelihood estimators for 3 and 0  3. while logistic regression is a conditional maximum likelihood estimation procedure.  Thus, when the data are multivariate normal, the linear dis-  criminant approach is to be preferred on theoretical grounds. Since "the multivariate normal assumption is unlikely to be satisfied in applications, even approximately,..." [25, p. 125], this seems the more important situation to consider.  When the normality assumption  is violated, logistic regression is theoretically more robust.  The logistic  probability model is valid for the exponential family, equation (3.24). The question of relative efficiencies in this case has not been investigated, but one would expect the logistic estimators to be more efficient. It was shown above that when (3.23) holds, there are sufficient statistics for $_.  The maximum likelihood estimators are functions of the sufficient  statistics, but the discriminant function estimators are.not.  Thus, by  the Rao-Blackwell theorem, the logistic estimators have a smaller expected mean square error than the discriminant function estimators. Under non-normal conditions, the logistic maximum likelihood estimates are consistent (asymptotically unbiased).  For the discriminant  47  function procedure, Halperin, Blackwelder, and Verter showed that for large samples and one attribute variable  (3.49)  3 + ( P i - P o ) / ( q i P i Q i + qoPoQo)  and  3o •* In  q i P i + qoPo q I Q i + qoQo  Si 2  Pi q i P i + qoPo  q i Q i + q Qc 0  P i - P. q i P i Q i + qoPoQo  where  Pi  and  (3.50)  1 + exp(-3o - B)  = 1 -  1 + exp(-3 )  = 1 - Q<  0  Q]  (3.5.1)  (3.52)  When qi = 0.5 the estimates are nearly unbiased, but for other values of qi and q the discrepancies can be quite large. 0  Equations (3.49), (3.50),  (3.51), and (3.52) can be easily generalized to more than one attribute variable [ 2 5 ] . The authors continued the analysis to show'that a)  3; w h i c h a r e z e r o w i l l t e n d t o be e s t i m a t e d a s z e r o f o r l a r g e s a m p l e s by t h e method o f maximum l i k e l i .hood, b u t n o t n e c e s s a r i l y by t h e d i s c r i m i n a t i o n method;  b)  i f a n y 3; a r e n o n - z e r o t h e y w i l l t e n d t o be e s t i mated a s n o n - z e r o by e i t h e r m e t h o d , b u t t h e d i s c r i m i n a n t f u n c t i o n approach w i l l g i v e asymptot i c a l l y b i a s e d e s t i m a t e s f o r t h o s e 3j a n d f o r ( B o ) . • • L25, p. 152].  Finally, i f there are two or more 3^ that are non-zero, the discriminant function estimates of the 3. that are zero will not converge to zero in  48 general.  Consequently, one could be led to believe that certain factors  are significant when in reality they are not.  Empirical studies have  shown that i t does indeed happen in non-trivial cases.  Thus, when the  data follow the exponential family but are not normal with equal covariance matrices, logistic regression is preferred on theoretical grounds. In addition to the theoretical aspects of the comparison, some practical aspects must be considered.  Since logistic regression is an  iterative procedure, its estimated parameters are more complicated to calculate and thus computation is more time consuming.  Halperin, Blackwelder,  and Verter [25] found that logistic regression took longer by a factor ranging from 1.3 to 2.  For a problem with 50 individuals and 5 independent  variables, I found [in 41] that logistic regression took 1.4 times longer. The time problem gets worse as the number of variables and number of observations increase.  Other things being equal, doubling the number of observa-  tion doubles the time (see Table IX for some sample times).  The time can  increase even more quickly than indicated in the table because the time depends too on other factors such as starting values of the coefficients, divergence of the algorithm, and covariances between the various coefficients. One would conclude that execution times can be much longer for logistic regression, but not beyond acceptable limits for large-scale computer installations.  3.6  Variable Reduction In an exploratory data analysis of retrospective medical studies,  the statistician will often have a large number of variables available. In order to make the problem more manageable and. make the underlying biologic process clearer, i t is desirable to reduce the dimension of  49  Table IX Execution Times for Logistic Regression (using an IBM 360-65 from [35]  Independent Variables  Observations . -  CPU Time for Execution  1  89  6  6  225  36  7  886  60  50  the problem. reduction.  A number of c r i t e r i a have been suggested to achieve the Three types w i l l be discussed below:  1) checking a l l possible  subsets, 2) a stepwise procedure, and 3) a two-stage  non-iterative  procedure. McCabe [29] proposes that one check a l l possible subsets of the variables to f i n d the optimal subset.  This procedure has the advantage  of considering a l l possible combinations and i n t e r a c t i o n s of v a r i a b l e s . Let the w i t h i n sum of cross-products matrix be denoted by W = (w^j) "h  2 W  ij  where  k=l s=l  lhs  ih< l Jhs " X  X  jh.J  (3.53)  (a subscript replaced by a dot indicates the v a r i a b l e was averaged over that index) and l e t the t o t a l sum of cross-products matrix be T = (t.. •) where 2  n  h X.,  ^  h-i i i s  ihs  - X.  l j hs X  i  X  j--J  (3.54)  One of the standard m u l t i v a r i a t e analysis of variance tests f o r e q u a l i t y of the means uses U =  /  (3.55)  T  That i s e s s e n t i a l l y the r a t i o of the estimated generalized variance within to the estimated generalized variance t o t a l . and small values i n d i c a t e good d i s c r i m i n a t i o n . t i v e s t a t i s t i c to show d i s c r i m i n a t i o n p o t e n t i a l .  Clearly, 0 < U < 1  McCabe uses U as a d e s c r i p Given two subsets with  an equal number of v a r i a b l e s , the subset corresponding to the smaller U  51  is the preferred subset.  The process is similar to calculating squared  multiple correlation coefficients in regression. Theoretically,  the examination of all possible subsets is the  best procedure, but in practice i t becomes quite lengthy for even moderate problems.  To compare all subsets of p variables, U must be calculated  from (3.53), (3.54), and (3.55) for 2P _ ] possible subsets.  A CDC 6500  computer required five minutes of CPU time to find the subsets for 20 variables [29] and each added variable doubles the time required.  Thus,  while theoretically appealing, the examination of all subsets is not practical for a problem of the size considered in this work. A second type of variable reduction scheme is a stepwise procedure.  Individual variables are considered for inclusion or exclusion  iteratively until a prespecified criterion has been achieved.  The procedure  adds at each step the variable which reduces the residual sum of squares as much as possible.  That i s , i t uses a stepwise linear regression and  the variable added is the one which maximally reduces the remaining unexplained variance about regression.  An F-ratio is calculated from the  within groups and total covariance matrices.  Each variable is considered  for inclusion depending on its relation to the already included variables only. scheme.  Thus, all interactions are not studied as they were in the U-ratio An advantage of the stepwise procedure is the relatively  number of variables i t can handle in a reasonable time.  large  Computer routines  such as the Biomed Stepwise Discriminant allow 80 variables and are quite efficient.  With the virtual memory of large-scale computers, the number  of variables could be increased beyond 80 with ease and without unreasonable increases in time.  Although theoretically less desirable than the previous  52  procedure, the stepwise procedure is practically appealing for large numbers of variables. A third approach, suggested by Zielezny [51], reduces, the number of variables by formalizing the medical diagnosis process.  The doctor  looks at an inexpensive, easily collected set of variables; i f that provides a clear-cut decision, then there is no need to observe further. If i t does not provide a clear-cut decision, then the doctor continues his investigation.  This is especially useful when there are, sets of  expensive and inexpensive variables. Let  where X^. has k. components and _X corresponds to  X X  3  the inexpensive variables. similarly.  X  3  5  The yy =  Hi  and Z =  En  Z12  £21  E  are partitioned  2 2  The U{X) of equation (3.1) is also partitioned as  fUiUi) U(X) U (X ) 2  2  Then the two-stage rule is as follows:  For a,b, -°° < a < b < °°, classify  to group 0 i f 1) M X i ) < a or 2) a < [i ( X i ) < b and U ( X ) < 0; otherwise, 1  classify to group 1. Figure 3.2.  The geometry of the partition is presented in  That i s , i f  Ui(Xj)  classified on the basis of  Xi  is high or low an observation can be  only.  If  Ui(_X_i)  is in the middle range,  then measurements must be taken on the second subset.  The parameters a  and b are chosen to minimize the total cost. Although i t has not been done, i t seems that this procedure could easily be generalized to p-1 stages so that variables are observed singly. One would then have a sequential decision process.  Further investigation  53  Figure 3.2. The geometry of the two-stage procedure for two groups.  54  along these lines would be interesting, but will not be attempted here. However, a variation of the two-stage decision process can be constructed from the logistic regression results.  More will be said about such a  procedure in Chapter 5. The three types of variable reduction reviewed in this section all appear useful based on theoretical considerations.  When practical  aspects are considered, the stepwise discriminant analysis is superior. It has also been shown [25] that variables shown to be significant by stepwise discrimination include the ones that are significant in logistic regression.  The subset of variables may include non-significant ones  also as explained above.  Thus, a subset of variables chosen by stepwise  discrimination is appropriate as a reduced set of variables for logistic regression.  3.7  Conclusions We can conclude from our review of statistical models for classi-  fication that logistic regression should discriminate better than linear discrimination, for the medical problem of this work.  The time should be  greater for the logistic regression, but should not be a major problem. In order to reduce the dimension of the problem, stepwise discriminant analysis should be done.  That i s , stepwise discriminant analysis is  appropriate for exploratory work and logistic regression is appropriate for final parameter estimation and classification with non-normal data.  Chapter 4  DATA COLLECTION  4.1  Selecting Variables to be Observed The f i r s t problem in a statistical analysis is deciding what  variables are to be observed.  Because my prior knowledge of the problem  was limited, external sources of information were consulted.  The medical  staff at BCCI were the f i r s t source of information about factors that influence the growth and spread of breast cancer.  Their knowledge of  the disease process included the general knowledge of breast cancer literature and their own personal experience treating the disease.  The  second source of facts about the disease process was the published medical reports on breast cancer.  No previous work was available on predicting  the spread of breast cancer to the regional lymph nodes.  Because of the  lack of knowledge about the factors influencing lymph node metastases, the variables chosen were those known to affect the risk of originally developing breast cancer and those known to affect the prognosis of the disease. Factors known to influence the risk of breast cancer^ can be divided into four types:  1) endocrine, 2) genetic, 3) immunologic, and  ^The i n f o r m a t i o n a b o u t r i s k f a c t o r s i s g e n e r a l l y a g r e e d upon by medical a u t h o r i t i e s . The i n t e r e s t e d r e a d e r i s r e f e r r e d t o [9], [31], and [36] f o r more on t h e e p i d e m i o l o g y o f b r e a s t c a n c e r .  55  56  4) other. cancer.  Estrogens and ovarian activity have a great influence on breast Studies [see 36 for example] have shown that early menarche  and late menopause (implying a longer than average period of estrogen production) increase the risk of breast cancer. are  Pregnancy and lactation  known to alter the hormonal balance of a woman's body, but studies  have provided conflicting results about their influence on breast cancer 2 risk.  Early f i r s t parity  decreases the risk.  the more the risk decreases.  Also the more children,  However, this may be accounted for by the  fact that women with a large number of children would have begun having them at an early age.  Finally, recent studies have produced conflicting  results about the relationship between oral estrogen (birth control p i l l ) use and breast cancer incidence.  Thus, i t was decided to observe all  available information relating to the patients' ovarian activity, menstrual history, pregnancy and lactation history, and estrogen therapy. Other areas of concern in endocrine function are the adrenal and thyroid glands. genesis.  Adrenal dysfunction is "known to contribute to carcino-  One type of treatment for breast cancer metastases is an  adrenalectomy.  It is also known that there is a high incidence of thyroid  disease in women with carcinoma of the breast. is a low incidence of breast cancer.  Thus, endocrine dysfunction seems  to have an important influence on breast cancer. all  After thyroidectomy there  It was decided to observe  variables associated with endocrine imbalance, including especially  thyroid and adrenal dysfunction.  P a r i t y i s t h e c o n d i t i o n o f a woman w i t h r e s p e c t t o h a v i n g p r o d u c e d v i a b l e c h i l d r e n r e g a r d l e s s o f w h e t h e r t h e c h i l d was a l i v e a t b i r t h .  57  The second type of factors includes genetic or hereditary factors. It is well-known that there is a familial predisposition for breast cancer. The incidence is consistently higher for relatives of breast cancer patients and i t is inherited through either parental line.  Complete information  on relatives with breast cancer was an important area to observe.  In  addition to familial differences in risk, epidemiological studies [31] have shown racial differences.  The lowest risk is among Oriental women  and the highest risk is among Caucasian women.  The influence of genetic  factors i s , however, complicated by environmental factors.  Oriental  women who move to North America from Asia increase their risk until several generations after immigration, their risk is close to Caucasian women's risk.  On the other hand, non-white women generally have histologically  more malignant tumors. axillary nodes.  Hence there is a greater likelihood of positive  Thus, i t was decided to observe all genetic and environ-  mental factors that were available. The third set of factors involves the body's immune system. Although this may be the most important set of factors influencing the spread of the disease, i t is the most d i f f i c u l t to observe. good measure of the host-tumor relationship.  There is no  It is known that various  immune deficient states are associated with a very much greater risk of breast cancer.  Multicentricity and bilaterality of the disease are clear  indications of the inadequacy of the body's defenses.  Seasonal variations  in incidence have been noted (May to September is a period of lower incidence) which correspond to observed seasonal variations in response to other diseases.  Perhaps the closest one can come to measuring the  immune status of the body is the lymphocytic count.  Since lymphocytes  58  are the body's main defense mechanism against disease, their activity indicates the level of resistance to disease processes.  It is interesting  to note that either a high or nil lymphocytic reaction is better than a moderate reaction.  A high lymphocyte count means that the body has a  good defense, while no lymphocytic reaction means that the response has not yet been triggered.  A low to moderate level of lymphocytes indicates  reaction that has been triggered but is inadequate to the task.  It was  concluded that lymphocytic reaction and season were to be observed.  Also  a history of diseases indicating immune deficiency, such as herpes zoster, was to be observed. Other factors that influence risk of breast carcinoma are quite varied.  Trauma to the breast is present in about 11 percent of breast  cancer patients and an unknown percentage of normal women.  Socio-  economic level seems to play some part, although there is evidence that this is related to diet and lifestyle.  Stomach cancer incidence is cor-  related with breast cancer incidence.  Since stomach cancer is influenced  by diet, doctors have suggested a link between diet and cancer of the breast.  On the other hand, lower socioeconomic levels do not have as good  access to medical care and so delay seeking treatment which allows the cancer to infiltrate the body more.  Other systemic illnesses and previous  benign breast disease increase the risk of breast cancer, too.  As with  many other diseases, the psychological state of the patient can be crucial to treatment response and survival.  Thus, i t was decided that all available  information on the above factors was to be collected. A final area that seemed promising was the pathological diagnosis.  Breast cancer is more than one disease.  Many different histological  59  types are combined in the label of breast carcinoma.  Different histo-  logical types were known to have varying incidences among different groups of women.  Table X shows the results of one study [31].  Consequently,  all the pathological findings available in the medical charts were to be included as variables. In addition to the factors that influence risk of developing breast cancer, the clinical state of the disease at diagnosis is an important indicator of possible lymph node metastases. different prognoses.  Different states have  A short survival associated with a particular symptom  indicates a more advanced stage of disease or more virulent disease. Both of these conditions have a greater chance of nodal metastases. of the primary tumor can be indicative of disease state.  Size  A large mass  would be seen when there is a long delay between onset of symptoms and diagnosis.  The longer the tumor has been growing, the better established  i t is in the body.  Thus, metastases are more likely.  also are seen in fast-growing cancers.  Large tumor masses  A carcinoma that grows quickly  has a good chance of having spread elsewhere.  Skin involvement and clincial  involvement of the axilla also indicate spread of disease and increased likelihood of lymph node metastases.  Consequently, all variables pertain-  ing to the state of the disease were to be observed. After the review of the literature on risk factors and prognostic factors, a sample of 50 charts of patients who had not had a selective biopsy but who were i n i t i a l l y treated at BCCI during the years 1955 to 1963 were reviewed.  The charts were read to see how much of the infor-  mation outlined above was recorded in patient charts and in what format the information appeared.  A few additional variables that had no obvious  Table X Comparison of Factors by Histologic Type  Histologic Type % of total Average Age (years) Location Nodal involvement  Infiltrating duct 78.1% 50.7 all ••"60%  Infiltrating lobular 8.7% 53.8 standard 60%  Medullary 4.3% 49.0 More in upper half of breast 44%  Colloid 2.6% 49.7 standard 32%  Comedo Carcinomas 4.6%  Papillary 1.2%  48.6  51.9  Mostly subareolar  More in lower half of breast  61  connection to the lymph node metastases problem but were available on most charts were included in the l i s t of variables to be observed.  4.2  Data Collection From the l i s t of factors mentioned in the previous section, a  data collection form and coding instructions were devised.  The data  collection form and accompanying instructions are in Appendix C.  The  format for data collection was suggested by previous studies undertaken at BCCI.  Over a period of three months forms were completed by me for  the 557 selective biopsy patients diagnosed between 1955 and 1963 and treated at BCCI.  If some information was unavailable for a patient, a  missing-data code was used. After the data collection was completed, the twenty-two patients with previous breast or systemic malignancy were eliminated since the disease process could be obscured by the other primary tumors.  Of the  remaining 535 patients, three more were dropped from the study because of very poor information.  The physicians recording the history reported  that two of the women were bad historians or suspect reporters.  The  third woman removed from the study spoke only German and the interpreter was a young boy unfamiliar with medical terminology and biological processes.  An additional three cases were discarded because expert patho-  logical review was unable to provide a disease status for the lymph nodes. Thus, the final sample size was 529 patients.  62  4.3  Variables Selected for Analysis After the sample had been pared to its final size, the variables  to be included in the analysis had to be selected.  Since the exploratory  analysis was to be carried out by stepwise discriminant analysis, no more than 80 variables (the program limit) could be included.  The posterior  knowledge after data collection became the prior knowledge used for variable selection. Three types of variables had been observed: 2) binary, and 3) categorical.  1) continuous,  The binary and continuous variables did not  have to be transformed prior to analysis.  Some categorical variables were  changed after data collection because certain categories had too few observations.  The variables were collapsed by combining categories until  the remaining categories were large enough for analysis. the categorical variable race. very small.  An example was  All categories other than Caucasian were  Categories were combined to form the binary variable for  Caucasian and non-Caucasian.  If a categorical variable was an ordered  categorical variable, i t could be used without transformation.  Unordered  categorical variables had to be transformed into binary indicator variables. Each binary variable corresponds to one of the f i r s t k-1 categories. an individual in category i ,  Thus,  1 < i < k-1, would have zeroes for each of  the indicator variables except variable i which would be one.  If an  individual was in category k, the k-1 variables would all be zero.  Only  k-1 variables are used since k variables would be linearly dependent. After the above transformations were made, there were 112 variables. To accommodate that number of variables, the discriminant analysis could have been used twice and the results combined.  However, some variables  63  were not included in the analysis for various reasons.  After the survival  analysis of Chapter 2 was completed, the survival data were eliminated. Some variables had zero variance and so provided no information. were discarded.  Those  Several variables that had seemed promising before work  began had so few respondents that they also were discarded.  Through these  means, enough variables were dropped to leave 80 variables for analysis. A l i s t of the variables observed, but not used in the analysis appears in Appendix D. The 80 variables chosen for analysis will be defined here.  In  the l i s t that follows, a name is assigned to each variable and a definition of the variable appears. General  information: ID-BCCI number-a six digit number for patient identification (not used in analysis, but used for case identification). DIAMON-Month of i n i t i a l diagnosis or treatment (also called the anniversary). AGE-Age (at last birthday) at diagnosis-in years. SOCECN-Socioeconomic level-1 i f high socioeconomic level; 0 otherwise. RACE-Racial origin-1 i f Caucasian; 0 otherwise. MARRY-Patient married-1 i f patient is married at time of diagnosis; 0 otherwise. BROKEN-Broken marriage-1 i f patient had been married previously but the marriage has been dissolved by death, separation or legal action; 0 i f never married or married now.  Family  history: BRMOM-Breast Cancer in patient's mother-1 i f mother had developed breast cancer; 0 otherwise.  64  BRDAUG-Breast cancer in patient's daughter-! i f daughter had developed breast cancer; 0 otherwise. BRSIS-Breast cancer in patient's sister-1 i f sister had developed breast cancer; 0 otherwise. BROTH-Breast cancer in other female relatives-1 i f any other female relative had developed breast cancer; 0 otherwise. CANCER-Blood relatives with cancer-number of relatives who have had cancer. Patient's  Personal  History:  SMOKE-Cigarette smoker-1 i f patient has a history of cigarette smoking; 0 otherwise. Patient's  Reproductive  History:  REG-Regular menstrual periods-1 i f patient had regular, uncomplicated menstrual periods; 0 otherwise. DYSMEN-Dysmenorrhoea-1 i f patient experienced dysmenorrhoea; • 0 otherwise. HORMON-Hormone therapy-1 i f patient had a history of hormone therapy; 0 otherwise. OTHDRG-Major drug therapy other than hormones-1 i f there is a history of other drug therapy; 0 otherwise. STATUS-Menopausal status-1 i f patient is premenopausal or within 5 years of the menopause at the time of diagnosis; 0 i f patient is postmenopausal. AGEMEN-Age at menopause-age in years at menopause for postmenopausal patients; 88 for premenopausal patients. AGEFRS-Age at f i r s t birth-patient's age in years at termination of f i r s t full-term pregnancy for para women; 0 for nullipara women. BIRTHS-Parity-number of pregnancies carried to full term. MISCAR-Miscarriages-number of pregnancies that did not carry to full term. NUMNUR-Number nursed-number of lactation periods.  65  MONNUR-Number of months nursed-length in. months of the longest lactation period. BRFED-Patient breastfed-1 i f patient was breastfed; 0 otherwise. Patient's  Illnesses: DIABET-Diabetes-1 i f patient had had diabetes; 0 otherwise. HEART-Heart disease-1 i f patient had a history of heart disease; 0 otherwise. HYPER-Hypertension-1 i f patient had a history of hypertension; 0 otherwise. KIDNEY-Kidney disease-1 i f patient had a history of kidney disease; 0 otherwise. TB-Tuberculosis-1 i f patient had a history of tuberculosis; 0 otherwise. ANEMIA-Anemia-1 i f patient had a history of anemia; 0 otherwise. PNEUM-Pneumonia-1 i f patient had a history of pneumonia or other serious lung diseases, excluding tuberculosis; 0 otherwise. ALLERG-A1lergy-1 i f patient had a history of allergies; 0 otherwise. THYROD-Thyroid disease-1 i f patient had a history of thyroid dysfunction; 0 otherwise. FIBROI-Uterine fibroids-1 i f patient had a history of uterine fibroids; 0 otherwise. DISOTH-Other disease-1 i f patient had a history of other major diseases not listed above; 0 otherwise.  Patient's  Surgical  History  (Before  Present  Illness)  00PH0R-Oophorectomy-! i f patient had an oophorectomy; 0 otherwise. HYSTER-Hysterectomy-1 i f patient had a hysterectomy; 0 otherwise.  66  PELVIC-Other pelvic surgery-1 i f patient had pelvic surgery other than those specifically listed above, especially involving the reproductive system; 0 otherwise. GALLB-Cholecystectomy-1 i f patient had gall bladder surgery; 0 otherwise. THYSUR-Thyroidectomy-1 i f thyroid had been removed; 0 otherwi se. ADRENL-Adrenalectomy-1 i f the adrenal gland.had been removed; 0 otherwise. OTHSUR-Other surgery-1 i f patient had a history of other surgery; 0 otherwise. Benign Breast  Ailments:  MAZODY-Mazodynia-1 i f patient had mazodynia; 0 otherwise. MASTIT-Mastitis-1 i f patient had had benign breast disease during lactation; 0 otherwise. BENIGN-Benign breast disease other than during lactation-1 i f patient had a history of benign breast disease other than during lactation; 0 otherwise. History  of the Present  Illness:  DURTON-Duration of symptoms-duration of symptom in months from onset of f i r s t symptom to date of diagnosis. SYMPT1-Thickening or lump-1 i f f i r s t symptom was a thickening or lump in the breast; 0 otherwise. SYMPT2-Pain-1 i f f i r s t symptom was pain in the breast or a x i l l a ; 0 otherwise. SYMPT3-Nipple changes-1 i f the f i r s t symptom was change in the contour of the nipple or discharge from the nipple; 0 otherwise. SIZE-Size of the tumor-clinical size of the original tumor mass in cm. LOCI-Lower inner quadrant-1 i f tumor was located in the lower inner quadrant of the breast; 0 otherwise. L0C2-Lower outer quadrant-1 i f tumor was located in the lower outer quadrant of the breast; 0 otherwise.  67  L0C3-Upper inner quadrant-1 i f tumor was located in the upper inner quadrant of the breast; 0 otherwise. L0C4-Upper outer quadrant-1 if tumor was located in the upper outer quadrant of the breast; 0 otherwise. L0C5-Lymph node tumor-1 i f tumor was located in the a x i l l a ; 0 otherwise. NODEPL-Palpable nodes-1 i f the lymph nodes were palable in the axilla on the same side as the tumor; 0 otherwise. SKIN-Skin involvement-1 i f there was skin involvement at the site of the primary tumor; 0 otherwise. BREAST-Breast involved-1 i f the right breast was the primary site; 0 if the left breast was the primary site. TRAUMA-Trauma to the breast-1 if the patient had a history of trauma to the involved breast; 0 otherwise. Patient  's Present  Condition:  BODYSZ-Overall body size-3 ordered categories for body size. BRSIZE-Breast size/shape-4 ordered categories for breast size. CONDTN-Patient s general physical condition-1 i f patient was in good physical condition at the time of diagnosis; 0 otherwise. 1  OTHILL-Other illnesses present-number of other illnesses present at the time of diagnosis. EMOTON-Emotional problems-1 i f the patient had emotional problems (other than any related to the cancer) at the time of diagnosis or shortly before; 0 otherwise. LYMPH-Lymphocytes-percentage of lymphocytes in the blood at diagnosis. Pathology:  NODES-Grouping variable, lymph node status-1 i f apical or internal mammary lymph nodes were positive at diagnosis; 0 if lymph nodes were negative.  68  PATH1-Paget's disease-! i f histology was Paget's disease of the nipple; 0 otherwise. PATH2-NoninfiItrating papillary carcinoma-1 i f histology was non-infiItrating papillary carcinoma; 0 otherwise. PATH3-Infi1trating papillary carcinoma-1 i f histology was infiltrating papillary carcinoma; 0 otherwise. PATH4-Infiltrating duct carcinoma-1 i f histology was infiltrating duct carcinoma (scirrhous, adenocarcinoma); 0 otherwise. PATH5-Colloid carcinoma-1 i f histology was colloid carcinoma; 0 otherwise. PATH6-Medullary carcinoma-1 i f histology was medullary carcinoma; 0 otherwise. PATH7 - In situ lobular carcinoma-1 i f histology was in situ lobular carcinoma; 0 otherwise. PATH8-Infi1trating lobular carcinoma-1 i f histology was infiltrating lobular carcinoma; 0 otherwise. PATH9-Inflammatory carcinoma-1 i f histology was inflammatory carcinoma; 0 otherwise. PATHIO-Other carcinomas-1 i f histology was any other single type of carcinoma; 0 otherwise. DIFF-Differentiation-3 ordered categories for differentiation of the carcinoma. FOCI-Foci of disease-1 i f disease is unicentric; 0 i f disease is multicentric. CELL-Size of cells-1 i f cancer cells are small c e l l ; 0 i f the cells are large c e l l . INFIL-Infi1tration-3 ordered categories for the amount of lymphocytic infiltration of the primary tumor. These 80 variables were the input data for exploratory work with the discriminant analysis to reduce further the dimension of the problem. The results of the analysis appear in Chapter 5.  69  4.4  Missing Data The problem of missing data was acute in this study.  History  taking by the physician who f i r s t examined the patient at BCCI was uneven in quality.  Some histories were quite complete, while others had only a  few main points covered.  Patients also varied in how much they could  remember or wished to t e l l .  Some older patients (in their sixties and  seventies) could remember l i t t l e of what had happened to them.  Any i n c i -  dent that could be verified by hospital or physician records was checked by the medical records department at BCCI.  Because of the length of  time since diagnosis (up to 22 years) and the fact that 68 percent of the patients were dead at the time of the study, no further followup was attempted for missing data. When information about a variable was missing, a numerical missing-data code was used.  After the data were collected, an examination  of the variables with many missing data entries was undertaken.  A variable  that had not been mentioned in the chart had been coded as missing.  In  some cases the variable was not mentioned because the patient did not have that attribute.  For example, i f smoking was not mentioned in the  history, i t was much more likely that the patient did not smoke than 3  that the patient smoked and the fact had not been reported.  Six variables  were found to be of that type and so the missing-data code was changed to the code for absence of the attribute. data values were not changed.  Other variables that had missing  A patient with missing data for a particular  variable was kept in the analysis when that variable was not included and The s i x v a r i a b l e s t h a t w e r e c h a n g e d i n t h i s DYSMEN, OTHDRG, HORMON, MAZODY, a n d MAST IT.  manner w e r e SMOKE,  70  was dropped from the analysis when the variable was included.  More will  be said about inclusion and exclusion of cases in Chapter 5.  4.5 Sources of Error Four main sources of error existed for the data.  As stated in  the previous section, the patients and doctors were the two primary sources of error.  The errors in both cases are biased toward omitting data. A  patient was unlikely to claim a disease or operation that had not occurred and most of the claims were confirmed by followup or removed from the history.  The doctors had no reason to claim things that did not occur.  However, i t was quite easy for the patient to forget an incident occurring many years prior to diagnosis and for the physician to neglect to ask specifically for all possibilities. A third source of error was in the data coding.  Since all the  data were collected by the same individual, any systematic errors in interpretation should be consistent for all patients.  If an error in  interpretation of the medical facts occurred, i t was the same in each instance.  It is assumed that the other data coding errors were random. The final main source of error in data collection was the key-  punching of the data. after keypunching.  To minimize such errors, all data were proofread  A program was written that printed the numbers from  the cards in the same format as the data coding form.  The computer output  was then compared to the original data coding forms to find errors.  All  errors found in the keypunching were corrected before the analysis was begun.  Thus, there were few remaining errors because of the keypunching.  Chapter 5  DATA ANALYSIS  5.1  Computer Programs For the theoretical and practical reasons discussed in Chapter  3, it was decided that stepwise discriminant analysis and logistic regression would be used for the classification of patients by nodal status. The linear discrimination was done using the Biomed Stepwise Discriminant Analysis program (BMD07M).  It performed a two-group discriminant analysis  wherein the variables were selected so as to maximally reduce the remaining variance.  The classification functions which included those selected  variables were then used to classify the cases. Logistic regression was accomplished by the use of a log-linear model program developed by M. Nerlove and S . J . Press. program appears in their report [35, p. 101 f f ] .  A listing of that  The coefficients of  the logistic probability function were estimated by the Newton-Raphson algorithm, where the starting values were found by ordinary least squares. The program produced estimates of the logistic probability  coefficients.  A user-written program was then used to calculate the posterior of being in group 1 and thus the classification for each case. 71  probability  72  5.2  Selecting Cases In order to avoid using patients with missing data for the  included variables, a procedure was devised for selecting cases.  Instead  of eliminating cases with missing values, group means could have been substituted for missing values.  It was felt that that would not be appro-  priate here because of the nature of the data. variables caused problems for this approach.  The large number of binary The mean of a 0-1 variable  calculated for many patients would be between 0 and 1.  A value between  0 and 1 for the binary variables was deemed unacceptable. All cases with complete information for all variables were chosen for a preliminary analysis.  There were 173 such cases.  Discriminant analysis  was run on these 173 cases to select the variables that were of some significance.  Significance was defined quite liberally (F probability to  enter < .20) so that even variables of marginal significance would be included i n i t i a l l y .  For reasons presented in Chapter 3 the subset of  variables chosen in this manner should have included all the variables .for which the coefficients were significantly different from zero.  It will  also include some for which the coefficient should have been estimated as zero.  The variables which had coefficients that were not significant were  eliminated in the final analysis by the logistic regression. of the discriminant analysis are presented in Table XI and XII. variables were chosen by this process.  The results Fourteen  One variable was entered in the  third step and then removed in the twelveth step.  It was decided to include  that variable in further work in case i t proved to be significant later. Logistic regression was then done with the fifteen variables and 173 cases to compare to the discriminant analysis.  The results are  73  Table XI Variables Chosen by Linear Discrimination for 173 Cases  Variables  F Probability to Enter  1.  THYROD  .0038  2.  BRSIS  .0121  3." DURTON  .0114  4.  NODEPL  .0199  5.  HEART  .0640  6.  LOCI  .1020  7.  SYMPT2  .0586  8.  SYMPT3  .0265  9.  KIDNEY  .0743  10.  BIRTHS  .1435  11.  HORMON  .0820  12.  MASTIT  .0970  13.  BRMOM  .1087  14.  OTHILL  .1746  Variable entered and then removed: 1.  MONNUR  Table XII Classification of 173 Cases by Linear Discrimination  Classified to group 0  1  Actual  0  90  17  107  group  1  33  33  66  123  50  173  CPU time  11.292 sec.  Correct classification  71.10%  75  shown in Table XIII and XIV. as significant (P < .05). discrimination.  Nine of the fifteen variables were chosen  The CPU time was twice as long as for linear  However, the classification by logistic regression was  better than the classification by discriminant analysis —  77 percent  correct for logistic regression and 71 percent correct for discriminant analysis.  Both procedures did poorly in classifying those known to have  positive nodes.  Discriminant analysis classified only 50 percent of  such cases correctly, while logistic regression classified 59 percent of them correctly.  For those known to have negative nodes the correct  classification rates were 84 percent and 89 percent, respectively. The fifteen variables chosen by linear discrimination (Table XI) were then used for a new case-selection procedure.  Patients who had  complete information for those variables were chosen for further analysis. A total of 503 patients had complete information on these fifteen variables. This group of patients was the sample for the final classification procedure.  5.3  Classification of 503 Cases Discriminant analysis was done for the 503 selected cases and  fifteen variables.  Again the level of significance was defined liberally  F probability to enter less than or equal to .20. Tables XV and XVI.  The results appear in  Only four variables were significant enough to enter  the discriminant function.  The discriminating power of the function was  even worse for those with positive nodes; only 12 percent were correctly classified into group 1. classification.  For negative nodes there was 96 percent correct  —  76  Table XIII Variables Chosen by Logistic Regression for 173 Cases  Variables  Asymptotic Significance  1.  DURTON  .03017  2.  NODEPL  .03290  3.  HEART  .01377  4.  SYMPT3  .00338  5.  KIDNEY  .04828  6.  BIRTHS  .01239  7.  HORMON  .08840  8.  MASTIT  .01727  9.  BRMOM  .12352  77  Table XIV Classification of 173 Cases by Logistic Regression  Classified to group 0  1  Actual  0  95  12  107  group  1  27  39  66  122  51  173  CPU time  25.652 sec.  Correct classification  77.46%  Iterations  39  Table XV Variables Chosen by Linear Discrimination for 503 Cases  Variables  F Probability to Enter  1.  NODEPL  .0001:  2.  SYMPT3  .0007  3.  THYROD  .0171  4.  HEART  .0432  Table XVI Classification of 503 Cases by Linear Discrimination  Classified to group 0  1  Actual  0  321  12  333  group  1  149  21  170  470  33  503  CPU time  4.714 sec.  Correct classification  67.99%  Table XVII Variables Chosen by Logistic Regression for 503 Cases  Variables  Asymptotic Significance  1.  NODEPL  .00016  2.  SYMPT3  .00086  3.  THYR0D  .02031  4.  HEART  .04636  Table XVIII Classification of 503 Cases by Logistic Regression with 15 Variables  Classified to group 0  1  Actual  0  318  15  333  group  1  147  23  170  465  38  503  CPU time  18.251 sec.  Correct classification  67.78%  Iterations  10  Log of the 1ikelihood  -301.774846  82  Logistic regression was run for the same fifteen variables and 503 cases.  The results are shown in Tables XVII and XVIII.  A comparison  of Tables XV and XVII shows that the same four variables were significant and in the same order of significance.  Again classification was much  worse for positive nodes than negative nodes — 95 percent correct for negative nodes and 14 percent correct for positive nodes. Since only four of the variables were significant, i t was decided to rerun the logistic regression with only those four variables included. The previous run had forced all fifteen variables into the c l a s s i f i c a tion function regardless of significance. XIX.  The results are given in Table  All variables remained highly significant (P < .05).  Using the  Table XIX Classification of 503 Cases by Logistic Regression with with 4 Variables Classified to group 0  1  Actual  0  324  9  333  group  1  150  20  170  474  29  503  CPU time  9.810 sec.  Correct classification  68.39%  Iterations  8  Log of the likelihood  -303.560423  83  log-likelihood test to determine whether the ten 3 . dropped from the second model should be zero, i t was found from Tables XVIII and XIX that U = -2 A = -2(-303.560423 + 301.774846) = 3.57114.  1  The .05 level of  significance chi-square value for ten degrees of freedom is 18.307. Thus, we conclude that the reduced model is a good f i t when compared to the " f u l l " model with fourteen variables.  Correct classification for  positive nodes was 12 percent and correct classification for negative nodes was 97 percent.  A comparison of Tables XVI and XIX shows that  logistic regression was marginally better than discriminant analysis at classification and took twice as long. For both discriminant analysis and logistic regression there was a drop in discriminating power with the increase in number of cases. Linear discrimination went from 71.10 percent to 67.99 percent while logistic regression went from 77.46 percent to 68.39 percent.  Part of 2  the drop can be explained by the increase in the number of cases.  At  least for logistic regression there appears to be some other factor influencing the classification.  A likely contributing factor is less  reliable data for the 330 added patients.  The additional cases had some  missing data values for variables not considered in the function and thus = log 2  e  (likelihood).  2  When d o i n g a l i n e a r r e g r e s s i o n , o n e c a l c u l a t e s a n a d j u s t e d R t o a c c o u n t f o r d i f f e r e n c e s i n numbers o f o b s e r v a t i o n s . In t h a t c a s e 2 / 2\ N— 1 R .. = 1 - (1-R ) - — w h e r e N i s t h e number o f c a s e s a n d p i s t h e number adj N-p o f t e r m s i n t h e model [ 3 3 ] . F o r t h e numbers we w e r e c o n c e r n e d w i t h h e r e the  factor  Thus, o n l y of cases.  w o u l d be 1.0813 f o r 173 c a s e s a n d 1.0060 f o r 503 c a s e s . N-p s m a l l d i f f e r e n c e s w o u l d be a t t r i b u t e d t o t h e i n c r e a s e i n number  84  their information may be less reliable even when a variable was observed. That i s , a patient who admits to not  knowing certain information may be  unreliable for other answers that were given.  Also a doctor who looks  for general answers may not ask for elaboration of answers to ensure complete recording of data.  5.4  Subsets of Patients for Further Classification There is much medical evidence that breast cancer runs a much  different course in premenopausal and postmenopausal women.  "The age-  specific incidence and mortality curves of breast cancer have two components.  The premenopausal component is steeper... .  The postmenopausal  slope is less ;steep than the premenopausal. . . " [9, p. 721].  It has  been hypothesized that the premenopausal carcinomas are hormone dependent, while the postmenopausal tumors are not. comes from treatment results.  Evidence for the hypothesis  Oophorectomy (removal of the ovaries and  consequent cessation of estrogen production) is a successful treatment adjunct for premenopausal women.  It seems to make no difference in post-  menopausal women since the ovaries have already ceased production of estrogen. In the light of the medical evidence, i t was surprising that neither age nor menopausal status appeared as a significant variable. Apparently, this factor was confounded by the other factors in the model. Consequently, i t was decided to investigate the menopausal status further. A stratification of patients by menopausal status was a possible avenue of investigation.  The f i r s t step was to see i f there were two distinct  components of age incidence.  A histogram of incidence versus age  85  (Figure 5.1) showed the possibility of two components. produced f o r the two subsets:  Histograms were  premenopausal patients (Figure 5.2) and  postmenopausal patients (Figure 5 . 3 ) . Figure 5.3 showed c l e a r d i f f e r e n c e s .  Examination of Figure 5.2 and The premenopausal patients had a  graph with a very steep gradient while the postmenopausal patients had a graph that was more nearly f l a t f o r a twenty year period.  The d i f f e r -  ence between the slopes may be because of d i f f e r e n t "censors" in the two groups — the premenopausal patients are censored by menopause while the postmenopausal patients are censored by death.  However, other i n v e s t i g a -  tors [36,28] had found s i g n i f i c a n t differences in the course of the disease between premenopausal and postmenopausal p a t i e n t s .  Thus, i t was decided  to divide the patients on the basis of menopausal status and do the analyses on each subset of patients separately. The 173 patients with f u l l information were separated f o r preliminary work on the basis of menopausal s t a t u s .  S i x t y of them were post-  menopausal patients and 113 were premenopausal p a t i e n t s .  Discriminant  a n a l y s i s was then done f o r the two subsets separately to reduce the dimension of the problem.  Tables XX and XXI show the variables chosen  f o r postmenopausal and premenopausal p a t i e n t s , r e s p e c t i v e l y . of the variables were chosen f o r both groups: MASTIT, and HEART.  Only f i v e  DURTON, KIDNEY, BIRTHS,  Tables XXII and XXIII present the r e s u l t s of c l a s s i -  f i c a t i o n f o r the groups separately.  The percentage c l a s s i f i e d c o r r e c t l y  was greater f o r each group than when the c l a s s i f i c a t i o n was f o r the combined group (88 percent and 80 percent versus 71 percent f o r combined). Consequently, each group was investigated with a d i f f e r e n t set of v a r i a b l e s .  86  24  28  32 36  40  44 48  52  56  60 64  68  72  A G E A T L A S T BIRTHDAY  Figure 5.1.  Histogram of age-incidence for all 535 cases.  76  80  87  AGE AT LAST BIRTHDAY  Figure 5.2.  Histogram of age-incidence f o r premenopausal patients.  88  24 22 20 18 16  8  55 W O M O 5S h-4  14 12 10  JL  43  47 51  55  59  63 67  71  75  79  AGE AT LAST BIRTHDAY  Figure 5.3.  Histogram of age-incidence for postmenopausal patients.  Table XX Variables Chosen by Discriminant Analysis for 60 Post-menopausal Patients Variables  F Probability to Enter  1.  HEART  .0155  2.  NODEPL  .0054  3.  BIRTHS  .0373  4.  SKIN  .0529  5.  SMOKE  .0758  6.  TRAUMA  .0283  7.  OTHSUR  .0824  8.  BRFED  .0729  9.  DURTON  .0287  10.  BRSIS  .1019  11.  LYMPH  .0471  12.  00PH0R  .0646  13.  KIDNEY  .0745  14.  AGE  .0738  15.  BENIGN  .0481  16.  SIZE  .0936  17.  MASTIT  .1441  18.  RACE  .0582  19.  CELL  .1760  Variable entered and then removed: 1.  CANCER  90  Table XXI Variables Chosen by Discriminant Analysis for 113 Premenopausal Patients  Variables  F Probability to Enter  1.  SYMPT3  .0013  2.  THYROD  .0088  3.  BREAST  .0398  4.  DURTON  .0341  5.  PATH6  .0538  6.  PELVIC  .0539  7.  KIDNEY  .0538  8.  HORMON  .0326  9.  BIRTHS  .0395  10.  BROTH  .0713  11.  BRMOM  .0647  12.  MASTIT  .0125  13.  PATH 2  .0860  14.  HEART  .0808  15.  REG  . 1089  16.  ALLERG  .0744  Table XXII Classification of 60 Postmenopausal Patients by Linear Discrimination with 19 Variables  Classified to group 0  1  Actual  0  32  5  37  group  1  2  21  23  26  60  34  CPU time  7.14 sec.  Correct classification  88.33%  Table XXIII Classification of 113 Premenopausal Patients by Linear Discrimination with 16 Variables  Classified to group 0  1  Actual  0  61  9  70  group  1  14  29  43  75  38  113  CPU time  6.00 sec.  Correct classification  79.65%  93  5.5  Classification of Postmenopausal Patients Nineteen variables were selected for the postmenopausal patients.  Since the variables were chosen in the order of their explanatory power, i t was decided to use only the f i r s t sixteen variables for the logistic regression.  The discrimination programs could have been run twice on  subsets of the nineteen variables to pick the best sixteen, but previous work had indicated that that was not necessary.  All postmenopausal  patients were screened for full information on the nineteen variables, yielding a total of 128 cases. Discriminant analysis was run on the 128 patients with nineteen variables.  Four variables were chosen for the analysis and 72.66 percent  were correctly classified (Tables XXIV and XXV). Again negative nodes were classified better (80 percent correct) than' positive nodes were (60 percent correct). variables.  Then logistic regression was performed with sixteen  The results appear in Tables XXVI and XXVII.  variables were found to be significant.  Only three  Logistic regression was run again  with the three significant variables (Table XXVIII).  A comparison'.of  Tables XXV and XXVIII shows that logistic regression was poorer at classifying than discriminant analysis was.  Since different variables had been  used i t was decided to rerun logistic regression with the four variables of Table XXIV.  The results of that run are in Table XXIX.  A comparison  of Tables XXV and XXIX shows that both methods classified the patients exactly the same. To test whether certain 8. were zero, the log-likelihood test was used.  From Tables XXVII and XXVIII the statistic for testing the  reduction of the model to three variables was found to be U = -2 A =  Table XXIV Subset of Variables Chosen by Discriminant Analysis for 128 Postmenopausal Patients  Variables  F Probability to Enter  1.  NODEPL  .0001  2.  AGE  .0638  3.  BRSIS  .0682  4.  SMOKE  .1934  Table XXV Classification of 128 Postmenopausal Patients by Discriminant Analysis with 4 Variables  Classified to group 0  1  Actual  0  65  16  81  group  1  19  28  47  84  44  128  CPU time  2.5 sec.  Correct classification  72.66%  96  Table XXVI Subset of Variables Chosen by Logistic Regression for 128 Postmenopausal Patients  Variables  Asymptotic Significance  1.  AGE  .05321  2.  NODEPL  .00011  3.  TRAUMA  .12377  Table XXVII Classification of 128 Postmenopausal Patients by Logistic Regression with 16 Variables  Classified to group 0  1  Actual  0  68  13  81  group  1  19  28  47  87  41  128  CPU time  32 sec.  Correct classification  75.00%  Iterations  37  Log of the likelihood  -67.1124784  Table XXVIII Classification of 128 Postmenopausal Patients by Logistic Regression with 3 Variables  Classified to group 0  1  Actual  0  64  17  81  group  1  21  26  47  85  43  128  CPU time  3.077 sec.  Correct classification  70.31%  Iterations  7  Log of the likelihood  -74.3288522  Table XXIX Classification of 128 Postmenopausal Patients by Logistic Regression with 4 Variables  Classified to group 0  1  Actual  0  65  16  81  group  1  19  28  47  84  44  128  CPU time  10.531 sec.  Correct classification  72.66%  Iterations  35  Log of the likelihood  -70.2542850  100  -2(-74.3288522 + 67.1124784) = 14.432748.  The corresponding chi-square  for twelve degrees of freedom and .05 significance level is 21.026.  Thus,  I concluded that the twelve 3 . that were estimated as zero were zero. Using the same test and Tables XXVII and XXIX, the statistic is U = -2(-70.2542850 + 67.1124784) = 6.283614.  Since X g O l ) = 19.675, i t 2  5  was concluded that the reduced model with four variables was also acceptable.  5.6  Classification of Premenopausal Patients A similar procedure was used for the premenopausal patients.  The sixteen variables of Table XXI chosen by discriminant analysis were used to select full information cases.  A total of 253 premenopausal  patients had full information on those variables.  When discriminant  analysis was run with the 253 selected cases, ten variables were found to be significant. XXX and XXXI.  The results of the discrimination are shown in Tables  Table XXX shows the ten variables chosen as significant.  From Table XXXI we see that again patients with negative nodes were classified better (95 percent correct) than patients with positive nodes (27 percent correct). Logistic regression was then run with the 253 full information premenopausal patients and the same sixteen variables from Table XXI. The results of the logistic regression appear in Tables XXXII and XXXIII. Ten variables were significant in the logistic regression also.  A com-  parison of Tables XXX and XXXII shows that nine of the variables are the same for both cases.  The other variable in the logistic regression,  DURT0N, was the least significant of the ten variables.  As with linear  discrimination, classification of patients with positive nodes was poorer  Table XXX Subset of Variables Chosen by Discriminant Analys for 253 Premenopausal Patients  Variables  F Probability to Enter  1.  THYROD  .0013  2.  SYMPT3  .0033  3.  MASTIT  .0165  4.  PELVIC  .0521  5.  BRMOM  .0448  6.  PATH2  .1186  7.  KIDNEY  .1201  8.  BIRTHS  .1012  9.  HEART  .1132  10.  BROTH  .1979  Table XXXI Classification of 253 Premenopausal Patients by Discriminant Analysis with 10 Variables  Classified to group 0  1  Actual  0  159  8  167  group  1  53  23  ' 86  222  31  253  CPU time  2.5 sec.  Correct classification  71.94%  103  Table XXXII Subset of Variables Chosen by Logistic Regression for 253 Premenopausal Patients  Variables  Asymptotic Significance  1.  BRMOM  .02721  2.  BROTH  .13538  3.  BIRTHS  .08456  4.  HEART  .08576  5.  KIDNEY  .11729  6.  THYROD  .00284  7.  PELVIC  .01061  8.  MASTIT  .00059  9.  DURTON  .19050  10.  SYMPT3  .00726  104  Table XXXIII Classification of 253 Premenopausal Patients by Logistic Regression with 16 Variables  Classified to group 0  1  Actual  0  156  11  167  group  1  58  28  86  214  39  253  CPU time  35.603 sec.  Correct classification  72.73%  Iterations  40  Log of the likelihood  -137.340218  105  than classification of patients with negative nodes (33 percent versus 93 percent). In judging how well our models have worked for classification there are two types of tests. whether certain 3.. are zero. estimated.  Goodness of f i t tests are used to test They test how well the model has been  How well the estimated model classified is the second measure  of the classification scheme.  Although one gets a better f i t of the model  with more variables, the classification may be better with fewer variables in the model. Since the previous logistic classification was with sixteen variables and only ten were significant, i t was decided to rerun the logstic regression with only the ten significant variables.  It was  expected that that would classify better and the results presented in Table XXXIV confirmed that expectation.  Logistic regression with ten  variables classified better than logistic regression with sixteen variables or discriminant analysis with ten variables (74 percent versus 73 percent and 72 percent).  To test whether the ten variable logistic model was a  good enough f i t compared to the sixteen variable logistic model, the loglikelihood test was used again.  It was found from Tables XXXIII and  XXXIV that U = -2 A = -2(-140.582389 + 137.340218) = 6.48434.  The .05  significance level chi-square value for six degrees of freedom is 12.6. Thus, we can conclude that the reduced model provides a good f i t .  5.7  Summary of Results A summary of the classification results was prepared and appears  in Table XXXV.  The data analysis has confirmed the theoretical results  Table XXXIV Classification of 253 Premenopausal Patients by Logistic Regression with 10 Variables  Classified to group 0  1  Actual  0  162  5  167  group  1  61  25  86  223  30  253  CPU time  6.908 sec.  Correct classification  73.91%  Iterations  9  Log of the likelihood  -140.582389  107  Table XXXV Summary of Classification Results  Model  Variables  # Cases  % Correct Positive Nodes  % Correct Negative Nodes  % Correct Overall  Combined premenopausal and postmenopausal: DA  14  173  50.00  84.11  71.10  LR  9  173  59.09  88.79  77.46  DA  4  503  12.35  96.40  67.99  LR  15  503  13.53  95.50  67.78  LR  4  503  11.76  97.30  68.39  Postmenopausal: DA  19  60  91.30  86.49  88.33  DA  4  128  59.57  80.25  72.66  LR  16  128  59.57  83.95  75.00  LR  3  128  55.32  79.01  70.31  LR  4  128  59.57  80.25  72.66  Premenopausal: DA  16  113  67.44  87.14  79.65  DA  10  253  26.74  95.21  71.94  LR  16  253  32.56  93.41  72.73  LR  10  253  29.07  97.01  73.91  DA = Discriminant Analysis LR = Logistic Regression  103  of Chapter 3.  Linear discrimination was an efficient method for the  preliminary analyses.  For the final analyses logistic regression pro-  vided more correct classifications with a significant time increase (in one case the classification was the same). all cases was less than hoped for.  However, classification in  Examination of the cases misclassified  showed that many of such cases were near the boundary. use of a two-stage procedure as described in Chapter 3.  That suggested the Patients with  high or low posterior probabilities could be classified on the basis of this data.  The patients with probabilities near .05 would need further  data before classification could be done. Tables XXXVI and XXXVII were prepared to demonstrate some possible two-stage procedures.  Table XXXVI used the logistic c l a s s i f i c a -  tion for 253 premenopausal patients with ten variables.  Table XXXVII  used the logistic classification for 128 postmenopausal patients with four variables.  For each decile of posterior probability the numbers of  patients correctly and incorrectly classified were tabulated.  If an  error rate of ten percent is acceptable, then the boundaries could be set at .4 and .7 for Table XXXVI.  That i s , all patients with posterior  probability less than .4 would be classified as having negative nodes. All patients with posterior probability greater than .7 would be classified as having positive nodes.  The patients in the middle would need further  investigation before classification. patients or 43 percent.  In this case, that would be 109  A similar analysis of Table XXXVII produced  boundary points of .3 and .7 for a nine percent error rate and 73 patients (57 percent) to have further investigation.  109  Table XXXVI Number of Premenopausal Patients Correctly and Incorrectly Classified by Decile of Posterior  Probability  Decile  Correct  Incorrect  .00 - .1  14  0  .11 - .2  31  8  .21 - .3  51  14  .31 - .4  42  21  .41 - .5  21  17  .51 - .6  6  2  .61 - .7  3  0  .71 - .8  11  0  .81 - .9  5  2  .91 - 1.0  4  1  Table XXXVII Number of Postmenopausal Patients Correctly and Incorrectly Classified by Decile of Posterior  Probability  Decile  Correct  Incorrect  .00 - .1  10  0  .11 - .2  18  5  .21 - .3  19  4  .31 - .4  10  7  .41 - .5  8  4  .51 - .6  12  9  .61 - .7  14  5  .71 - .8  1  1  .81 - .9  0  1  .91 - 1.0  0  0  Ill Examination of the cases that were misclassified also showed runs of errors.  For one group of patients 90 percent of the errors  occurred in two consecutive years of the nine years studied and 95 percent of the errors were in three consecutive years of the nine years. That lends credence to the idea that certain history takers were better than others.  It is probable that better classification could have been  achieved with better data for those years. The medical implications of the analysis will be discussed in the next chapter.  Chapter 6  CONCLUSIONS  For the medical classification problem of distinguishing between those breast cancer patients with supraclavicular or internal mammary lymph node metastases and those without such metastases, two models were selected —  linear discrimination and logistic regression.  The empirical  results verified the theoretical findings that linear discrimination was faster than logistic regression but logistic regression provided a greater proportion of correct classifications. When the classification was done on 503 cases with full  infor-  mation, approximately 68 percent correct classification was achieved. While this is better than the clincial staging, i t was not as good as had been hoped for.  Dividing the patients by menopausal status and classify-  ing the groups separately provided better classification because of the differing disease processes in the two groups.  For 253 premenopausal  patients the proportion of correct classifications was 74 percent.  For  128 postmenopausal patients the correct proportion was 75 percent.  Thus,  separating the groups on the basis of menopausal status provided c l a s s i f i cation that was better than when the groups were combined.  112  113  A two-stage procedure was proposed to make the error rate smaller.  Patients were classified on the basis of the data into three  groups:  those with negative nodes, those with positive nodes, and those  for whom more data had to be collected before a final classification was made.  With an error rate of 10 percent or less, 43 percent of the pre-  menopausal and 57 percent of the postmenopausal patients required further observation. It was concluded that for a medical diagnosis problem such as this with data that are clearly non-normal and often not even continuous, the use of linear discriminant analysis was adequate for the exploratory work and to reduce the dimension of the problem.  In the preliminary  analyses of the subgroups of premenopausal and postmenopausal patients with small numbers of patients and complete information, discriminant analysis provided more correct classifications for positive nodes.  Logistic regres-  sion was preferred for the final analyses since i t provided better estimators and consequently more classifications that were correct-when there were more patients and less complete information about the patients. The medical conclusions are not so definitive.  The c l a s s i f i c a -  tion procedures used here suggest areas where further investigation would be useful.  For all patients combined the variables that entered the  final analysis were NODEPL, SYMPT3, THYROD, and HEART.  As expected  palpable lymph nodes were positively correlated with pathologically involved lymph nodes.  The presence of changes in the nipple or discharge from  the nipple as the f i r s t symptom was positively correlated with positive nodes.  A history of thyroid disease was also positively correlated with  positive nodes.  A history of heart disease was negatively correlated  114  with positive nodes.  Thus, the physician might consider these factors  when trying to evaluate the nodes c l i n i c a l l y . When the premenopausal and postmenopausal patients were considered separately, some other factors were suggested.  For the premeno-  pausal patients the variables that entered the final analyses were THYROD, SYMPT3, MASTIT, PELVIC, BRMOM, PATH2, KIDNEY, BIRTHS, HEART, and BROTH. Again a history of thyroid disease was positively correlated with positive nodes and nipple change or discharge as the f i r s t symptom was positively correlated with positive nodes.  A history of benign breast disease  during lactation was positively correlated with involved nodes. pelvic surgery was negatively correlated with involved nodes.  Previous Breast  cancer in the patient's mother and other relatives were both negatively correlated with involved nodes.  This was probably a result of the patient's  increased awareness of the disease and consequent earlier diagnosis. Noninfiltrating papillary carcinoma was highly negatively correlated with positive nodes.  That i s , patients with that pathological type of carcinoma  rarely had nodal metastases.  A history of kidney disease was negatively  correlated with positive nodes.  The number of full term pregnancies  was slightly negatively correlated with positive nodes.  Heart disease  was again negatively correlated with involved nodes. For the post-menopausal patients the variables that entered the final analyses were NODEPL, AGE, BRSIS, and SMOKE.  Again as expected  nodes palpable was positively correlated with positive nodes. very slightly negatively correlated with positive nodes.  Age was  Breast cancer  in the patient's sister was negatively correlated with positive nodes. Smoking was slightly negatively correlated with positive nodes.  The  115  disease in the postmenopausal patients was less variable than the disease in the premenopausal patients.  The lesser degree of variability resulted  in more accurate classifications for the postmenopausal patients. The results presented above could be used to help the physician in his clinical diagnosis of the nodal status. in which further research could be done.  They also suggest areas  BIBLIOGRAPHY  [1]  Anderson, J.A., "Separate Sample Logistic Discrimination," Biometrika, 59 (1972), 19-35.  [2]  Anderson, T.W., An Introduction to Multivariate Statistical Analysis , . John Wiley & Sons, Inc., New York, 1958..  [3]  Bishop, Y.M.M., S.E. Fienberg, and P.W. Holland, Discrete variate Analysis: Theory and Practice, MIT Press, Cambridge, Mass., 1975.  [4]  Bock, R.D., "Estimating Multinomial Response Relations," Essays in Probability and Statistics, ed. by R.C. Bose, et al. University of North Carolina Press, Chapel H i l l , North Carolina, 1970.  Multi-  3  [5]  Brinkley, D. and J.L. Haybittle, "A 15-year Follow-up Study of Patients Treated for Carcinoma of the Breast," British Journal of Radiology, 41 (1968), 215-221.  [6]  Brinkley, D. and J.L. Haybittle, "The Curability of Breast Cancer," The Lancet, July 19, 1975, 95-97.  [7]  Cancer Research Campaign, "Management of Early Cancer of the Breast," British Medical Journal, 1 (1976),• 1035-1038.  [8]  Cochran, W.G. and C.E. Hopkins, "Some Classification Methods with Multivariate Qualitative Data," Biometrics, 17 (1961), 10-32.  [9]  Correa, P., "The Epidemiology of Cancer of the Breast," American Journal of Clinical Pathology, 64 (1975), 720-727.  [10]  Cox, D.R., Analysis of Binary Bata, Methuen and Co., Ltd., London, 1970. 116  117  [11]  Crawford, G., "Breast Cancer: Selective Biopsy—Rationale and Results," Cancer Control Agency Seminar, 1976.  [12]  Efron, B., "The Efficiency of Logistic Regression Compared to Normal Discriminant Analysis," Journal of the American Statistical Association, 70 (1975), 892-898.  [13]  Federer, W.T., "Procedures and Designs for Screening Material in Selection and Allocation with a Bibliography," Biometrics, 19 (1963), 553-587.  [14]  Finney, D.J., R. Latscha, B.M. Bennett, and P. Hsu, Tables for Testing Significance in a 2x2 Contingency Table, Cambridge University Press, New York, 1963.  [15]  Fisher, E.R., R.M. Gregorio, and B. Fisher, "The Pathology of Invasive Breast Cancer — A Syllabus Derived from Findings of the National Surgical Adjuvant Breast Project (Protocol Number 4)'," Cancer, 36 (1975), 1-85.  [16]  Fisher, E.R., R.M. Gregorio, C. Redmond, W.S. Kim, and B. Fisher, "The Significance of Extranodal Extension of Axillary Metastases," American Journal of Clinical Pathology, 65 (1976), 439-444.  [17]  F r i e l , J.P. (ed.), Borland's Illustrated Medical Dictionary, Twenty-fifth Edition, W.B. Saunders Co., Toronto, 1974.  [18]  Gardner, M.J. and D.J.P. Barker, "Diagnosis of Hypothyroidism: A Comparison of Statistical Techniques," British Medical Journal, 2 (1975), 260-262.  [19]  Goldstein, M. and M. Rabinowitz, "Selection of Variates for the Two-Group Multinomial Classification Problem," Journal of the American Statistical Association, 70 (1975), 776-781.  [20]  Gordon, T., "Hazards in the Use of the Logistic Function with Special Reference to Data from Prospective Cardiovascular Studies," Journal of Chronic Diseases, 27 (1974), 97-102.  [21]  Guzeman, L.F., B.C. Peters, and H.F. Walker, "On Minimizing the Probability of Misclassification for Linear Feature Selection," The Annals of Statistics, 3 (1975), 661-668.  [22]  Haagensen, D . C , "The Choice of Treatment for Operable Carcinoma of the Breast," Surgery, 76 (1974), 685-714.  118  [23]  Haagensen, D . C , E. Cooley, et al., "Treatment of Early Mammary Carcinoma: A Cooperative International Study," Annals of Surgery, 170 (1969), 875-890.  [24]  Haberman, S . J . , The Analysis of Frequency Data, University of Chicago Press, Chicago, 1974.  [25]  Halperin, M., W.C. Blackwelder, and J.I. Verter, "Estimation of the Multivariate Logistic Risk Function: A Comparison of the Discriminant Function and Maximum Likelihood Approaches," Journal of Chronic Diseases, 24 (1971), 125-158.  [26]  Hartz, S . C . , and L.A. Rosenberg, "Computation of MLE for the Multiple Logistic Risk Function for Use with Categorical Data," Journal of Chronic Diseases, 28 (1975), 421-430.  [27]  Krzanowski, W.J., "Discrimination and Classification Using Both Binary and Continuous Variables," Journal of the American Statistical Association, 70 (1975), 782-790.  [28]  Lewison, E.F., Breast Cancer and its Diagnosis and Treatment, The Williams and Wi1ki ns Company, Baltimore, 1955.  [29]  McCabe, G.P., "Computations for Variable Selection in Discriminant Analysis," Technometrics, 17 (1975), 103-109.  [30]  McCabe, G.P. and R.J. Pohl, A Computer Program for Variable Selection in Discriminant Analysis, Purdue University Department of Statistics Mimeograph Series, Number 334, Purdue University, Lafayette, Indiana, 1973.  [31]  McDivitt, R.W., F.W. Stewart, and J.W. Berg, Tumors of the Breast, Armed Forces Institute of Pathology, Bethesda, Maryland, 1968.  [32]  Maehle, B.0. and F. Harviet, "Prognostic Typing in Breast Cancer," Journal of Clinical Pathology, 26 (1973), 784-791.  [33]  Montgomery, D.B. and D.G. Morrison, "A Note on Adjusting R ," Journal of finance, 1973, 1009-1013.  [34]  Mueller, C. and W. Jeffries, "Cancer of the Breast: Its Outcome as Measured by the Rate of Dying and Causes of Death," Annals of Surgery, 182 (1975), 334-341.  2  119  [35]  Nerlove, M. and S.J. Press, Univariate and Multivariate Log-Linear and Logistic Models, RAND Report R-l306-EDA/NIH, The RAND Corporation, Santa Monica, California, 1973.  [36]  Papaioannou, A.N., The Etiology Verlag, New York, 1974.  [37]  Peters, M.V., "Cutting the 'Gordian Knot in Early Breast Cancer," Annals of the Royal College of Physicians and Surgions of Canada, 8 (1975), 186-192.  [38]  Poser, C M . , et al., "Amino Acid Residues of Serum and CSF Protein in Multiple Sclerosis: Clinical Application of Statistical Discriminant Analysis," Archives of Neurology, 32 (1975), 308-314.  [39]  Prentice, R., "Use of the Logistic Model in Retrospective Studies," unpublished manuscript, University of Washington.  [40]  Press, S . J . , Applied Multivariate Winston, New York, 1972.  [41]  Press, S.J. and S. Wilson, "Choosing Between Logistic Regression and Discriminant Analysis," to be published, manuscript at The University of British Columbia, 1977.  [42]  Ramberg, J.S. and J.D. Broffitt, "Selecting the Best Set of Linear Discriminant Variates," Proceedings of Computer .' Science and Statistics: 8th Symposium on the Interface, 257-261.  [43]  Smith, D . C , R. Prentice, D.J. Thompson, W.L. Herrmann, "Association of Exogenous Estrogen and Endometrial Carcinoma," The New England Journal of Medicine, 293, 1164-1167.  [44]  Snedecor, G.W. and W.G. Cochran, Statistical Methods, Sixth Edition, The Iowa State University Press, Ames, Iowa, 1967.  [45]  T a l l i s , G.M., P. Leppard, and G. Sarfaty, "A General Classification Model with Specific Application to Response to Adrenalectomy in Women with Breast Cancer," Computers and Biomedical Research, 8 (1975), 1-7.  [46]  T a l l i s , G.M. and G. Sarfaty, "On the Distribution of the Time to Reporting Cancers with Application to Breast Cancer in Women," Mathematical Biosciences, 19 (1974), 371-376.  of Human Breast Cancer, Springer-  1  Analysis,  Holt, Rinehart and  120  Truett, J . , J. Cornfield, and W.B. Kannel, "A Multivariate Analysis of the Risk of Coronary Heart Disease in Framingham," Journal of Chronic Diseases, 20 (1967), 511-524. VanNess, J.W. and C. Simpson, "On the Effects of Dimension in Discriminant Analysis," Technometrics, 18 (1976), 175-187. Walker, S.H. and D.B. Duncan, "Estimation of.the Probability of ah Event as a Function of Several Independent Variables," Biometrika, 54 (1967), 167-179. Z i e l , H.K. and W.D. Finkle, "Increased Risk of Endometrial Carcinoma Among Users of Conjugated Estrogens," The New England Journal of Medicine, 293 (1975), 1167-1170. Zielezny, M. and O.J. Dunn, "Cost Evaluation of a Two-Stage Classification Procedure," Biometrics, 31 (1975), 37-47.  APPENDIX A  TREATMENT STUDY GROUPS  Number of cases A.  No mastectomy — Standard radiation  B.  Simple mastectomy — Standard radiation  C.  Radical mastectomy — Standard radiation  D.  Radical mastectomy — Radiation which did not include axilla  ^  E.  Radical mastectomy — with preoperative radiation  ^  F.  Extended radical mastectomy — Radiation to supraclavicular only  ,  G.  Simple mastectomy — No radiation  1  H.  Radical mastectomy — No radiation  20  I.  Simple mastectomy — w i t h radiation, did not include chest wall  12  J.  Simple mastectomy—with radiation, did not include chest wall or axilla  1  K.  Simple mastectomy — Radiation and chemotherapy  1  L.  Hormones only  1  148 13 316  TOTAL  121  535  APPENDIX B  MANCHESTER STAGING OF BREAST CANCER  CIinical I  Primary freely movable on contracted pectoral muscle or chest wall. Skin involvement, including ulceration, may be present but must be in direct continuity with the tumor and no extension wide of the tumor itself.  II  As Stage I but there are palpable mobile lymph nodes in the axilla on the same side less than 2.5 cm.  Ill  a)  Either  the skin invaded or fixed over an area wide of the  tumor itself but s t i l l or  b)  limited to the breast,  the tumor fixed to underlying muscle but not to  chest wal 1. Axillary nodes, i f present, must be mobile. IV  The growth has extended beyond the breast area as shown by: a)  Axillary nodes not mobile or >2.5 cm.  b)  Tumor fixed to chest wall.  c)  Supraclavicular node involvement.  d)  Involvement of skin wide of breast.  122  123  e)  Opposite breast involved with metastatic disease.  f)  Distant metastases.  g)  Inflammatory carcinoma.  Pagent's disease of nipple only is Stage I unless nodes present.  Pathological I II  Disease confined to the breast. As in Stage  I,  plus metastatic disease confined to axillary lymph  nodes below the level of the apex. II?  Level of axillary involvement unknown.  Ill  Direct local spread from primary to:  IV  a)  skin wide of tumor.  b)  underlying fascia or muscle.  a)  Direct extension from breast primary to rib or cartilage of chest wall.  b)  Extension of disease beyond capsule of an axillary lymph node.  c)  Involvement of apical or internal mammary lymph node or tissues.  d)  Involvement of an axillary lymph node at any level which is found pathologically to be 2.5 cm. in size or large.  e)  Distant metastases (including supraclavicular lymph nodes).  APPENDIX C  CODING INSTRUCTIONS AND DATA CODING FORM  Coding Instructions: Occupation housewife retired technical & professional clerical laborer, outside laborer, inside other unknown . •  1 2 3 4 5 6 7 9  Racial origin Caucasian Negro Indian Asian Semitic Other Unknown.:  1 2 3 4 5 6 9  Family history Yes No Unknown  1 0 9  Menopausal state Premenopausal & up to 5 years after Postmenopausal 5 years  1 0  Age at menopause Premenopausal Postmenopausal Illnesses & surgery Yes No Unknown  First observation of symptom Patient '"- Medical professional Duration of symptom 1-97 months 98 months Unknown  88 Actual Years 1 0 9 124  1 0  Actual 98 99  Tumor size < 2 cm 2 to 5 cm > 5 cm No lump palpable Size not stated  1 2 3 4 9  Position of tumor Lower inner Lower outer Upper inner Upper outer Lymph node or tail Nipple Whole breast Other Unknown  1 2 3 4 5 6 7 8 9  Nodes, skin, & trauma Yes No Unknown  1 0 9  Breast Right Left Bilateral  1 0 3  125  First symptom Thickening Lump Pain Discharge from nipple Nipple inverted Skin changes Change in breast size Mammography, etc. Other  1 2 3 4 5 6 7. 8 9  Overall body size Obese Average Slender  1 2 3  Breast size/shape Pendulous, very large Large, full Average Small  1 2 3 4  General physical condition Good Fair Poor  1 2 3  Nodal involvement Positive apical or i.m. nodes Positive lower axillary nodes No nodal involvement  1 2 3  Histological differentiation Wei 1-differentiated Moderately differentiated Poorly- or undifferentiated Unknown  1 2 3 9  Foci of disease Unicentric Multicentric Unknown  1 0 9.  Cell size Small Large Unknown  1 0 9  Cause of death Alive Breast cancer Intercurrent disease Lost to followup  0 1 2 3  126  Histology type Paget's disease Noninfiltrating papillary carcinoma Infiltrating papillary carcinoma Infiltrating duct carcinoma (scirrhus with productive fibrosis) Adenocarcinoma Colloid carcinoma (mucoid) Medullary carcinoma In situ lobular carcinoma Infiltrating lobular carcinoma Inflammatory carcinoma Carcinoma, not otherwise specified Other Combinations of the above Lymphocyte infiltrations in tumor None Minimal Moderate or numerous Unknown  1 2 3 4 5 6 7 8 9 10 11 12 13  127  CARCINOMA OF THE HREAST SELECTIVE BIOPSY 1955 - 116J I n t e r v a l between  CCAnC dumber Card Identity  Dyamenorrhoea  Anniversary Data  Drugs Hormones  Data of Birth  Previous braaat ailment Masodynla during period Mastitis l n lactation Benign breast disease not during l a c t a t i o n  Other  History o f Present  Aga at Anniversary  Menopause  C l i n i c a l stage  Status pre 1  Pathological ataga  Age at menopause  M a r i t a l status S-l,M-2,W-3,D-4  Pregnancies  Illness  F i r s t symptom pose  2  F i r s t observation of symptom Duration o f symptom  Age at f i r s t  Tumour s i t e - c L l n l c a l  Live births  P o s i t i o n o f tumour  Miscarriages  Kodea palpable  {lumber nursed  S k i n involvement  Months nursed  Breast  Patient breastfed  Trauma to breast  Occupation Racial Origin Kumber of Years i n North America Kumber of Years l n B.C. FAMILY HISTORY Cancer o t t e r than breast Mother  Father SUter  Brother  Son  Daughter  Mat. R e l .  Pat. R e l .  Breast Cancer Mother  Slater  Daughter  Other  Diabetes  Tuberculosis  O v e r a l l body s i z e  Heart disease  Typhoid  Breast size/shape  Hypertension  Ulcer  General c o n d i t i o n  Kidney  Anemia  Other i l l n e s s e s present  Asthma  Pneumonia  Childhood  Other  Other diseases l n family members  Surgery(not  Diabetes  Oophorectomy  |~  I Beart  j Tuberculosis  Patients Present Condition  Serious i l l n e s s e s  PATHOLOGY  breast)  Modal Involvement  1  Disease  Histology  Tonsils  Type Other  Appendix  PERSONAL HISTORY  Hysterectomy  Smoker Kanstrual h i s t o r y Menarche Periods Regular * Length  Other p e l v i c Cholecystectomy  E  Differentiation F o c i o f disease Coll size SURVIVAL  Thyroidectomy  Date of  Adrenalectomy  Causa of death  Other  Lost to followup .  death  APPENDIX D  VARIABLES NOT INCLUDED IN THE ANALYSIS  General: Year of diagnosis — y e a r of i n i t i a l diagnosis or treatment (anniversary) Date of birth — month and year of birth (were used to check reported age but do not appear explicitly in functions) Clinical stage — c l i n i c a l stage (the factors used in staging appear as variables) Pathological stage — pathological stage Years in N.A. — number of years that patients has lived in North America Years in B.C. —number of years that patient has lived in British Columbia Family Hi story: Cancer other than breast — for father, mother, sister, brother, son, daughter, maternal relative, and paternal relative there was an indicator variable for occurrence of cancer and a variable for types of cancer Other diseases in family members — diabetes, tuberculosis, heart disease, and other for blood relatives. Menstrual history: Menarche — age at which menstruation began Lenth of periods — in days Interval between — days between periods  128  Illnesses and Surgery of patient Childhood diseases —mumps, measles, etc. Typhoid Ulcer — stomach ulcer Tonsillectomy Appendectomy History of present illnesses: Other symptoms — those appearing after the f i r s t First observation of symptom —whether patient or medical professional f i r s t observed symptom of di sease Survival data: used to increase knowledge of history of problem but not pertinent to the classification problem Date of Death Cause of Death Date lost to followup  


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items