Statistical Modelling of Breast Cancer Risk for British Columbian Women by Tanja Hoegg A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in THE COLLEGE OF GRADUATE STUDIES (Interdisciplinary Studies) THE UNIVERSITY OF BRITISH COLUMBIA (Okanagan) June 2013 c Tanja Hoegg, 2013 Abstract Although there are many known factors associated with an increased risk of breast cancer, age remains the main eligibility criterion for the current Screening Mammography Program of British Columbia (SMP BC). In the light of recent controversy surrounding regular breast screening, the ability of targeted screening for high-risk women is of current interest as it has the potential to sustain high cancer detection rates, reduce the number of false positive results while controlling the overall costs. Multiple models have been proposed to estimate a woman’s risk of breast cancer given her current risk factor profile, the model introduced by Gail et al. in 1989 being particularly popular. Using five-year follow-up data of 223,399 SMP BC participants, we investigate whether the probability estimates of the Gail model adequately stratify the British Columbian population into groups of high and low-risk women and, hence, provide a basis for a personalized access criterion into the Screening Mammography Program. Further, we built a breast cancer risk model for British Columbia with the goal to include a stronger set of predictor variables and improve the outcome stratification of the Gail model. Neither the Gail model nor the new risk prediction model based on SMP BC participants showed adequate stratification properties. Overall, effect sizes of all covariates were too small to clearly separate risk estimates of future breast cancer cases and non-cases. It is questionable whether changes in screening policies should be based on breast cancer risk prediction models. ii Preface This thesis is original, unpublished, independent work by the author, Tanja Hoegg. The research reported in Chapters 2 - 4 was covered by UBC BCCA Ethics Certificate number H11-01617. iii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 The BC Screening Mammography Program . . . . . . . . . . . . . . . . . . . 1.3 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Notation and Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1 Disease Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.2 Survival and Hazard . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.3 Survival Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.4 Censoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 3 5 5 5 6 7 Chapter 2: Preliminary Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Study Population and Variables . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Data Cleaning and Initial Exclusion Criteria . . . . . . . . . . . . . . . . . . 2.3 Exploratory Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Population and SMP BC Comparison . . . . . . . . . . . . . . . . . . 2.3.2 Analysis of Survival Times . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Risk Factors and Cancer Outcome . . . . . . . . . . . . . . . . . . . . 2.3.4 Variable Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 9 10 12 12 13 15 19 iv Table of Contents 2.4 2.5 Limitations of Follow-Up Evaluation of Study Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Chapter 3: Gail Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 The Gail Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Validation Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Validation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Risk Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Assessment of Model Calibration . . . . . . . . . . . . . . . . . . . . . 3.4.3 Discriminatory Ability . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.4 Comparison to Incidence Model . . . . . . . . . . . . . . . . . . . . . 3.5 Residual Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Optimization of Risk Threshold . . . . . . . . . . . . . . . . . . . . . . . . . 26 26 29 30 32 32 33 36 37 38 40 Chapter 4: Model Building . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Survival Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Cox’s Proportional Hazards Model . . . . . . . . . . . . . . . . . . . . 4.1.2 Aalen’s Additive Model . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Modelling Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Estimation of Time to Diagnosis . . . . . . . . . . . . . . . . . . . . . 4.2.2 Exclusion of Individuals . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Analysis of Family History . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Modelling of Invasive Cancers . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Single Covariate Modelling . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Simultaneous Modelling of Covariates . . . . . . . . . . . . . . . . . . 4.4.3 Testing Proportional Hazards Assumptions . . . . . . . . . . . . . . . 4.4.4 Functional Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Modelling of In Situ Cancers . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Single Covariate Modelling . . . . . . . . . . . . . . . . . . . . . . . . 4.5.2 Simultaneous Covariate Cox Model . . . . . . . . . . . . . . . . . . . 4.5.3 Model Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Predictive Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 43 43 45 47 47 48 50 52 53 56 58 61 63 63 64 65 67 Chapter 5: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 v List of Tables 2.1 2.2 2.3 3.1 3.2 3.3 4.1 4.2 4.3 4.4 4.5 Variables extracted from the SMP BC database: Coding, description and values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Contingency table notation of absolute numbers and probabilities (in parenthesis) for exposure groups and cancer outcomes. . . . . . . . . . . . . . . . . 16 Number of missing values (NA), complete observations (n), absolute (nbc ) and relative (nbc /n) cancer frequencies and odds ratios θ with 95% confidence intervals for selected risk factor categories. . . . . . . . . . . . . . . . . . . . . 18 Gail model logistic regression coefficients for Caucasian, Black and Asian women as provided by the Gail model source code. . . . . . . . . . . . . . . . 27 Numbers of expected and observed cancers, E/O ratios and 95% confidence intervals for age and ethnic groups for (1) 176,978 women and (2) 78,106 women with complete information. . . . . . . . . . . . . . . . . . . . . . . . . 34 Person years of 176,978 SMP BC women, B.C. incidence rates, observed cancers (O) and expected cancers predicted by B.C. incidence (E) for five-year age categories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Parameter estimates, relative hazards, standard errors, Wald statistics and p-values for family history model. . . . . . . . . . . . . . . . . . . . . . . . . Coefficients and p-values of univariate Cox and Aalen models for hypothesis tests of significant and time-varying effects. (* refers to all variables levels, ** refers to at least one level) . . . . . . . . . . . . . . . . . . . . . . . . . . Regression coefficients, standard errors, Wald statistics and corresponding p-values under proportional hazards model. . . . . . . . . . . . . . . . . . . Parameters estimates, standard errors, Wald statistics and p-values for the stratified Cox model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Coefficients, standard errors, Wald statistics and p-values for multivariate Cox model of in situ cancers. . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 . 54 . 57 . 61 . 66 vi List of Figures 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 3.1 3.2 3.3 3.4 Age distribution by five-year age groups for 221,545 SMP BC participants and B.C. women aged 40 to 79. . . . . . . . . . . . . . . . . . . . . . . . . . Ethnic composition of 221,545 SMP BC participants undergoing screening in the year 2000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kaplan-Meier estimator and 95% confidence interval of breast cancer free survival times for 221,545 SMP BC participants, stratified by invasive (black) and in situ cancers (blue). . . . . . . . . . . . . . . . . . . . . . . . . . . . . Relative frequencies of responses (in parenthesis) for variables curr_estr, hysterec, menstr, educ, mammo_dens by five-year age groups. . . . . . . Age-specific mean breast density of 212,545 SMP BC women stratified by ethnicity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Histogram of follow-up length in years for 221,545 SMP BC participants screened in the year 2000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Remaining percentage of participants at the end of each follow-up year by five-year age categories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Age-specific breast cancer incidence rates per 100,000 person years for the B.C. population and the data set of 212,545 SMP BC women. . . . . . . . . Relative cancer frequency by five-year age category for 221,545 individuals when women lost to follow-up are included and removed from the data set. . Distribution of five-year breast cancer probabilities for 176,978 SMP BC participants as estimated by the Gail model. . . . . . . . . . . . . . . . . . . . Ratio of expected and observed cancers with 95% confidence interval in fiveyear age categories and ethnic groups for 78,106 individuals with complete Gail information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Estimated five-year cancer risk densities stratified by cancer outcome for 78,106 women with complete Gail model information (x-axis is truncated at x = 5). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 . 14 . 15 . 19 . 20 . 22 . 23 . 24 . 30 . 33 . 36 . 37 vii List of Figures 3.5 3.6 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 Boxplot of Pearson residuals from Gail model stratified by the number of affected relatives. The y-axis is restricted to display the model’s residuals of non-cases only. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Left: Cancer detection rate above threshold T (black) with lower 95% confidence bound (dashed) and prevalence (blue) for 78,106 women. Right: Proportion of women with risk estimates pi above risk threshold T . . . . . . . . . 42 Left: Histogram of time between last recorded SMP BC screening date and cancer diagnosis of 4078 women. Right: Histogram of time between the 2000 screening appointment and diagnosis date for all prevalent cancers. . . . . . Relative frequency of affected first degree relatives by type of relative, as a function of five-year age categories. . . . . . . . . . . . . . . . . . . . . . . . Cumulative regression coefficients B(t) of univariate Aalen models for variables age, abmam, menstr and age_lastmenstr. . . . . . . . . . . . . . . Scaled Schoenfeld residuals for variables age and abmam with smoothing spline as a function of event time t(k) . . . . . . . . . . . . . . . . . . . . . . Plot of all martingale residuals (left) and martingale residuals of non-cases (right) under multivariate Cox model versus age. . . . . . . . . . . . . . . . Cumulative coefficients Bj (t) as a function of time for single covariate Aalen models of age and abmam. . . . . . . . . . . . . . . . . . . . . . . . . . . . Scaled Schoenfeld residuals over time for variable age with locally weighted smoothing spline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Scatterplot of five-year breast cancer risk for 97,856 women as estimated under the Gail model and the stratified Cox model. . . . . . . . . . . . . . . . . . Five-year cancer risk densities stratified by cancer outcome for 97,786 women as estimated by the stratified Cox model (x-axis is truncated at x = 5). . . . . 49 . 51 . 55 . 60 . 62 . 64 . 65 . 68 . 69 viii Acknowledgements First and foremost, I would like to thank my supervisors, Dr. Sylvia Esterby and Dr. Paramjit Gill, who guided me through this project from near and far, whose doors were always open, providing advice and encouragement whenever it was needed. This thesis would not have been possible without your support. I would also like to thank my committee members, Dr. Rasika Rajapakshe and Dr. Cynthia Araujo, from the BC Cancer Agency for inviting me into their team and making this project possible. Your inexhaustible enthusiasm has left a lasting impression on my path. My sincere thanks go to my committee member Dr. Jason Loeppky for giving his valuable time to review my work and providing me with the latest gadgets like one of his own. Finally, many thanks to Dr. Anatol Sargin, who sent advice and support across the globe in many late night conversations. ix Chapter 1 Introduction 1.1 Motivation In Canada, over 20,000 women are diagnosed with breast cancer each year, making it the most common cancer among Canadian women (Statistics Canada, 2011). Hoping to detect cancer at an early stage and therefore improve chances of a successful treatment, many women consider early detection measures such as regular breast screening. In British Columbia, the Screening Mammography Program (SMP BC) offers annual mammograms free of charge for women aged 40 to 79. Although many personal and familial factors have been associated with an increased risk of developing breast cancer, age remains the main eligibility criterion for the current Screening Mammography Program. Other established risk factors for breast cancer include early age at menarche (Brinton et al., 1988), nulliparity (Brinton et al., 1983) and first degree relatives affected by breast cancer (Brinton et al., 1982). The ability to estimate a woman’s probability of developing breast cancer in a future time period given her risk factor profile is of current interest as it provides a basis for targeted screening practice for high-risk women. Based on the projected probability, women at high breast cancer risk could be identified in the population and accepted into the screening program before reaching the age of 40. To further improve the effectiveness of the early detection initiative, more sensitive screening techniques such as Magnetic Resonance Imaging (MRI) could also be implemented for high risk women. Providing more detailed breast imaging, MRI screening has a higher cancer detection rate compared to the currently used mammograms. Overall, targeted screening has the potential to increase cancer detection rates while controlling the overall costs. Several risk models have been proposed during the last 20 years, the model introduced by Gail et al. (1989) being particularly popular. Although many efforts have been made to 1 1.2. The BC Screening Mammography Program predict a woman’s risk of developing breast cancer (Barlow et al., 2006; Chen et al., 2006), most models showed unsatisfactory predictive performance when evaluated on independent data (Decarli et al., 2006; Rockhill et al., 2001). Therefore, the applicability of individualized risk estimates is still limited and needs improvement. Using five-year follow-up data of 223,399 SMP BC participants in the year 2000, the following thesis investigates whether the probability estimates of the Gail model sufficiently stratify the British Columbian population into groups of high and low-risk women and, hence, provide a basis for a personalized access criterion into the Screening Mammography Program. Further, we build a breast cancer risk model for British Columbia with the goal to include a stronger set of predictor variables and improve the outcome stratification of the Gail model. 1.2 The BC Screening Mammography Program Introduced in 1988 as the first breast cancer screening program in Canada, the Screening Mammography Program of B.C. is a population based early detection effort operated by the BC Cancer Agency and is accessible to British Columbian women aged 40 to 79 years (BC Cancer Agency, 2001). Participants are self-referred, invited by promotion activities and advertisement throughout the province which include newspaper ads and information brochures at doctor’s offices. In 1998, the program also introduced an invitation letter sent out to women turning 50 years of age. Age information is based on the B.C. Ministry of Health’s client registry. Women outside the age range of 40 to 79 may also become eligible for screening if referred by a primary healthcare provider and approved by a radiologist. Once enrolled in the program, participants receive recall letters for re-screening examinations. The frequency of the letters is based on SMP BC recall recommendations, which suggest biannual screening for women aged 50 to 79 and annual screening for women under the age of 50. At each screening appointment, background information on the participant’s status of established breast cancer risk factors is collected through an eight item questionnaire. Questions cover risk factors related to personal background, reproductive and familial breast cancer history, menopausal status and medical history. The questionnaire is self-administered and completed on a voluntary basis. 2 1.3. Previous Work The breast cancer status of participants is ascertained through a linkage with the B.C. Cancer Registry. All breast cancers detected within British Columbia, both within or outside the screening program, are also recorded within the SMP BC database. Once diagnosed with breast cancer, women become ineligible for screening mammograms within the SMP BC. In the year 2000, there were 38 screening centers throughout the province including four mobile stations serving remote areas of British Columbia. Province wide participation in a 24-month period was estimated at 49% for women between the ages of 50 to 79 (BC Cancer Agency, 2001). 1.3 Previous Work In 1989, Gail et al. proposed a model to project a woman’s absolute risk of developing invasive and in situ breast cancer in a specified age interval. The model is based on data collected from the Breast Cancer Detection Demonstration Project (BCDDP), a US cohort study following the breast cancer status of 284,780 predominately white women with up to five annual screening examinations from 1973 to 1979 (Baker, 1982). Six explanatory variables are considered in the model and include a woman’s age at menarche, the number of previous breast biopsies, her age at first live birth, the number of affected first degree relatives and a dichotomized age variable separating women below and above fifty years of age. Based on a matched case-control study nested within the BCDDP cohort, a woman’s breast cancer risk r(t|Xi ) relative to an unexposed woman of the same age is approximated by the odds ratio obtained from a logistic regression model. This relative risk is then combined with an age-specific breast cancer baseline hazard rate, h1 (t), estimated from the entire BCDDP cohort. Under consideration of risks of competing causes of death, h2 (t), the probability of breast cancer in the age interval [a, a + τ ] could then be estimated by integration of hazards over the length of the interval τ . This will be referred to as Gail model 1. The presence of atypical hyperplasia in a previous biopsy was added as an additional explanatory variable in a revised model. All variables were considered ordinal in the modelling process. The original Gail model 1 was later modified as part of the Breast Cancer Prevention Trial (BCPT) study design to predict the absolute risk of invasive breast cancer only. BCDDP 3 1.3. Previous Work baseline hazards were replaced with invasive breast cancer rates from the Surveillance, Epidemiology and End Result (SEER) Program and a woman’s race was added as an additional explanatory variable. This version was implemented in an online risk calculator accessible through the website of the National Cancer Institute (www.cancer.gov/bcrisktool/) where women have the ability to calculate their personalized breast cancer risks. In the following chapters, we will refer to this later version as the ‘Gail model’. The validity of the Gail model has been assessed on independent US and international data. Comparing the number of expected (E) and observed (O) cancers in predefined subgroups such as age categories, risk quintiles and categorical risk factor levels, the overall calibration performance of the Gail model has been found to be adequate in several studies. Rockhill et al. (2001) reported a value of E/O = 0.94 (95% CI: 0.89 - 0.99) for the ratio of observed and expected cancers, Costantino et al. (1999) a ratio E/O = 1.03 (95% CI: 0.88 - 1.21). However, it has been noted that the model overestimates breast cancer risk for women who do not participate in annual screenings (Spiegelman et al., 1994) and also shows a tendency to overestimate risk for women classified in the higher predicted risk quintiles (Costantino et al., 1999). Anothaisintawee et al. (2011) offer a systematic review of previous publications assessing the predictive performance of the Gail model. For all validation studies discussed, the Gail model performed rather unsatisfactory in the discrimination of cases and non-cases. The reported concordance statistics range from 0.58 to 0.67, indicating a considerable overlap in the risk distributions for cases and non-cases and therefore a limited applicability in the prediction of individual cancer outcomes. In an attempt to improve on calibration and discriminatory accuracy, modifications of the Gail model were proposed in several subsequent studies. Adjustments were based around a search for a stronger set of predictor variables in the logistic model. New variables like breast density and weight were added to the model by Chen et al. (2006) while other variables considered in the original model were removed. A change in coding of the explanatory variables from ordinal to categorical was proposed by Decarli et al. (2006). Overall, these modifications only lead to minor improvements over the discriminatory performance of the Gail model. Increased concordance statistic were reported (Chen et al., 2006; Decarli et al., 2006), however, the models’ ability to identify who will and will not develop breast cancer 4 1.4. Notation and Concepts remained limited. 1.4 1.4.1 Notation and Concepts Disease Frequency Occurrence of disease in a population is commonly described by two measures, disease prevalence and incidence. The prevalence of a disease describes the proportion of the population that are diagnosed with the disease, number of cases population size (1.1) and measures the likelihood of disease among a population at a specific point in time (Coggon et al., 2003). Disease incidence is defined as number of new cases person-years at risk (1.2) and represents the rate at which new cases occur in a population during a specified time interval (Coggon et al., 2003). Person-years at risk is the total length of time (in years) individuals in a population are at risk of the disease. An individual is considered to be at risk only during observation and until diagnosed with the disease. Hence, disease incidence allows for different follow-up periods between individuals. Throughout this thesis, a woman diagnosed with breast cancer will be referred to as a case while a woman not diagnosed with breast cancer will be denoted as a non-case. At a given point in time, a prevalent case defines an individual whose breast cancer has already developed while an incident case has yet to develop the disease (Krug and McNutt, 2007). 1.4.2 Survival and Hazard The following notation is consistent with Therneau and Grambsch (2000). Let T be a non-negative, random variable denoting the time until event occurrence. We assume that T has a density f and a cumulative distribution function F (t) = P (T ≤ t). 5 1.4. Notation and Concepts This probability will also be referred to as the breast cancer risk. The complement defines the time-dependent survival function as S(t) = P (T > t) = 1 − F (t) (1.3) and represents the probability of no event occurrence before time t. We assume S(0) = 1. Further, let λ(t) denote the hazard function defined as P [t ≤ T < t + ∆t|T ≥ t] dS(t)/dt =− . ∆t→0 ∆t S(t) λ(t) = lim (1.4) The hazard λ(t) represents the instantaneous probability of event given that the event has not occurred before time t. Although the hazard λ(t) is not a probability, it can be interpreted as a measure of risk as higher values λ(t) indicate higher likelihood of event occurrence. Integrating λ(t) over t gives t t λ(u)du = − 0 0 S (u) du = − log(S(t)). S(u) (1.5) and establishes the following relationship between hazard and survival function S(t) = exp(−Λ(t)) where Λ(t) = 1.4.3 t 0 λ(u)du (1.6) represents the cumulative hazard function. Survival Models For individual i, a binary outcome indicator is defined as 0 for Ti > t Ni (t) = 1 for T ≤ t i ,i = 1...n (1.7) and describes the survival status at time t ≥ 0. Ni (t) takes a value of 1 if the event has occurred before time t and a value of 0 if it has not. Further, let dNi (t) denote the ¯ (t) the total incremental change in Ni in the infinitesimal time interval [t, t + dt] and N number of events until time t. 6 1.4. Notation and Concepts The objective is to estimate the survival function for individual i based on current covariate information Si (t) = P (Ti > t|Xi ) (1.8) where Xi = (Xi1 , . . . Xip ) denotes the covariate vector containing information on p variables recorded at time t = 0. These variables are generally thought to affect the survival time Ti . Using the relationship between survival and hazard function given in Equation (1.6), survival probabilities can be estimated through modelling of the hazard λi (t) as λi (t) = g(Xi , β), where g is a function of the covariate vector Xi and a parameter vector β. The vector β is estimated in the model fitting process. 1.4.4 Censoring When following the survival status of a cohort over time, individuals are often able to leave the study before the end of the observational period. Reasons can include death from competing causes, relocation or discontinuation of screening. As a result, the survival status Ni (t) cannot be observed for the entire study period. Let Ci be a non-negative random variable describing the censoring time of individual i. An individual is said to be right censored if the event of interest has not occurred before the individual is lost to follow-up. The survival time Ti then exceeds the censoring time Ci , leading to Ti to be unknown. Instead, we only observe T˜i = min(Ti , Ci ). To indicate whether an individual is still at risk of the event at time t, i.e. T˜i > t, we define an indicator variable 1 for T˜i > t Yi (t) = 0 for T˜ ≤ t i ,i = 1...n (1.9) Right censoring is referred to as non-informative or random if the censoring time Ci is independent of the survival time Ti . Otherwise, censoring is called informative (Therneau 7 1.4. Notation and Concepts and Grambsch, 2000). 8 Chapter 2 Preliminary Analysis 2.1 Study Population and Variables The data set analyzed in this thesis consists of 223,399 women who underwent a screening mammogram as part of the Screening Mammography Program in the year 2000. The data set was extracted from the SMP BC database after approval from the UBC BCCA Research Ethics Board following a formal data request to the SMP BC. A confidentiality agreement was signed by all individuals involved in the project. All women included in the data set were anonymized through consecutive case ID identifiers by the SMP BC. This research project is covered by UBC BCCA Ethics Certificate number H11-01617. As a result of the linkage between the B.C. Cancer Registry and the Screening Mammography Program, all breast cancers diagnosed in participants before December 31, 2005, were recorded with the exact diagnosis date. This allows us to follow each woman’s cancer status for a minimum of 5 years from the initial screening date. December 31, 2005, marks the official end of the observational period for the study. For all participants, we obtained complete information on the women’s birthdate, the 2000 screening date, the 2000 screening result and the women’s breast density estimated from the year 2000 mammographic image. If a woman’s mammogram was found to be abnormal in 2000, the diagnoses and corresponding diagnosis dates of examinations following the abnormal mammogram were also available. Information on the participants’ year of entry into the screening program and the last recorded SMP BC screening date as of July 2012 were complete for all individuals. Based on the participants’ birthdate and the 2000 screening date, we derived an additional variable describing the age at the 2000 screening examination. Further, we obtained the participants’ personal background and risk factor information collected through the screening appointment questionnaire in the year 2000. As the questionnaire is completed on a voluntary basis and self-administered, the corresponding variables 9 2.2. Data Cleaning and Initial Exclusion Criteria contain missing values and may be subject to error. The items included in the questionnaire provide a woman’s personal background and the presence of well-established breast cancer risk factors. Beside a woman’s highest level of education and ethnic origin, her reproductive history is identified by information on her age at menarche, the occurrence of previous pregnancies, the number of full term deliveries and, if applicable, the age at first delivery. Risk factors related to medical history are assessed by questions about the current use of estrogen and the occurrence of a previous breast biopsy. For certain women the number of breast biopsies is also available. If a woman states her menstrual cycles has stopped, her menopausal history is further obtained by information collected on the woman’s age at the last menstruating cycle, the presence of a hysterectomy at the time her menstruation stopped and, if applicable, if both ovaries were removed in the process of the hysterectomy. Familial breast cancer history is ascertained by the number and relationship of first degree relatives diagnosed with breast cancer. If breast cancer has been present among first degree relatives, an additional question whether the cancer occurred before the age of 50 is completed. All participants were anonymized and consecutively numbered with case ID identifiers from 1 to 223,399. Table 2.1 gives a list of relevant variables in the data set, abbreviations used in the following chapters and variable values. 2.2 Data Cleaning and Initial Exclusion Criteria Several participants were excluded from the data set before the start of the analysis. The data set included 50 women diagnosed with bilateral breast cancer, leading them to be listed twice with identical entries for all variables. These duplicates were removed for each of the 50 women. Participant with case ID 60642 was deleted from the data set because of missing values on all risk factors and a missing birthdate. Due to the regulations of the SMP BC, we also decided to exclude participants outside the eligible age range of 40 to 79 at the time of the 2000 screening appointment. As these women were only admitted to the screening program as a result of special medical circumstances related to breast disease, they differ from the general SMP BC population and may introduce biased breast cancer risks when included. Therefore, 394 women under the age of 40 and 10 2.2. Data Cleaning and Initial Exclusion Criteria Table 2.1: Variables extracted from the SMP BC database: Coding, description and values. Code Description Values (format) yearscr1 lastsmp scrdate scr_res year of first SMP screen last recorded screening date 2000 screening date result of 2000 screen scr_diagdate scr_diag 2000 diagnosis date 2000 diagnosis mammo_dens bdate ethnic breast density birth date ethnic origin educ highest educational level age_men ever_preg num_del age_1stdel menstr age_lastmenstr hysterect both_ovrem fam_hist age at menarche ever pregnant number of deliveries age at first delivery currently mestruating age menstruation stopped hysterectomy both ovaries removed affected first degree relatives fam_hist50 curr_estr ever_biop num_biop ca_diagdate relatives affected before 50 currently using estrogen any breast biopsies number of breast biopsies breast cancer diagnosis date year (yy) date (dd/mm/yyyy) date (dd/mm/yyyy) 1 (Normal), A (Abnormal low), B (Abnormal moderate), C (Abnormal high) date (mm/dd/yyyy) B (Benign), M (Invasive), D (DCIS), E (Equivocal disease), L (LCIS) 1 (below 50%), 2 (above 50%) date (dd/mm/yyyy) AP (Aboriginal), BR (British), FR (French), EA (East Asian), SA (South Asian), WE (Western Europe), NE (Northern Europe), SE (Southern Europe), EE (Eastern Europe), BL (Black), OT (other), U (Unknown) N (None), S (Some high school), H (High school), C (College/ Some university), G (University graduate), U (Unknown) integer Y (yes), N (no) integer integer Y (yes), N (no) integer Y (yes), N (no) Y (yes), N (no) N (None), M (Mother), S (Sister(s)), D (Daughter(s)), F (Father or Brother), 1 (Mother & Sister(s)), 2 (Mother & Daugther(s)), 3 (Mother & Sister(s) & Daughter(s)), A (Adopted), O (Only Child), U (Unknown) Y (yes), N (no) Y (yes), N (no) Y (yes), N (no) integer date (mm/dd/yyyy) 11 2.3. Exploratory Analysis 1411 women over the age of 80 were not considered in the following analysis. Overall, 221,545 women remained in the study. Further, we adjusted the coding scheme of several variables to create consistent codes for missing, refused or non-applicable answers throughout the data set. Blank spaces and categories U (unknown) used to indicate unavailable values were replaced with missing value indicators NA . Categories R (refused) were also considered missing as this category was used in a sporadic pattern with a total of 502 entries in the entire data set. For the variable num_biop, the number of previous breast biopsies, missing values were originally assigned a value of zero, making it not possible to differentiate between women without prior biopsies and women with missing information. We adjusted this variable by replacing all values of zeros with NA for all women with missing or positive responses for the variable ever_biop. Complete information on all risk factors was available for 54.1% of the 221,545 individuals, while 45.9% had missing information for at least one variable. 2.3 2.3.1 Exploratory Analysis Population and SMP BC Comparison After excluding participants outside the eligible SMP BC age range in the initial steps, the remaining women’s ages vary between 40 to 79 at the time of the 2000 screening appointment. Figure 2.1 shows the age distribution by five-year age groups within the SMP BC data set and the population of B.C. women aged 40 to 79. Provincial population estimates for the year 2000 were obtained from Statistics B.C. (Statistics B.C., 2012). While most British Columbian women eligible for SMP BC screening are between the ages of 40 to 44, the age category of 45 - 49 showed the highest participation in the 2000 screening. This discrepancy is most likely to be explained by invitation letters sent annually to women turning 50 years of age and the resulting increase of awareness. With overall higher frequencies compared to the B.C. population, participants aged 45 to 69 are overrepresented in the data set, while fewer than expected women aged 70 to 79 decided to undergo screening in 2000. Due to the high sample size, a χ2 test assessing the global goodness of fit between the observed and expected frequency distribution results in a highly significant p value of < 0.0001, suggesting that the age representation in the data set is not consistent with the 12 2.3. Exploratory Analysis Rel. Frequency 0.20 0.15 sample B.C. SMP BC 0.10 0.05 40-44 45-49 50-54 55-59 60-64 Age Category 65-69 70-74 75-79 Figure 2.1: Age distribution by five-year age groups for 221,545 SMP BC participants and B.C. women aged 40 to 79. overall population of B.C. Looking at the ethnic distribution given in Figure 2.2, SMP BC participants in the year 2000 were predominately white, with British and European origins representing the largest groups of 32.1% and 20.7%, respectively. Asian ethnicities rank second with East Asian origins accounting for 11.5% and South Asian origins for 2.8%. First Nations and African heritage were the smallest groups in the data set with 1.0% and 0.2%, respectively. 2.3.2 Analysis of Survival Times Of 221,545 women, 4078 were diagnosed with breast cancer by the end of 2005, resulting in a relative cancer frequency of 1.85%. In situ cancers accounted for 18.6% of all cases. The breast cancer free survival time S(t) is estimated with a non-parametric estimator introduced by Kaplan and Meier (1958). For ordered event times t(1) < t(2) < . . . < t(k) , it is defined as ˆ = S(t) t(i) ni − di ni <t (2.1) 13 2.3. Exploratory Analysis Rel. Frequency 0.3 0.2 0.1 0.0 Black British East Asian Europe Ethnicity First Nations South Asian Figure 2.2: Ethnic composition of 221,545 SMP BC participants undergoing screening in the year 2000. where ni is the number of individuals at risk prior to event time t(i) and di the number of events occurring at time t(i) . The function survfit() available from the R package survival ˆ was used to calculate the survival function S(t). Figure 2.3 shows the Kaplan-Meier estimator of survival times stratified by in situ and invasive cancers. In comparison to in situ cancers, risks associated with invasive breast cancers are considerably higher as seen from clear separation in survival curves. A noticeable drop in survival takes place shortly after the 2000 screening appointment for both cancer types. It can be assumed that these cancers developed before the start of the observational period and were found in the 2000 screening examination. Especially for invasive cancers, a wavy pattern in the survival curves repeating with a frequency of about one year can be seen. There appears to be a particular increase in cancer detection rates after approximately two years, a result of women following the SMP BC screening recall recommendations. With the exception of the initial drop, the survival proportions for both cancer types follow a overall linear trend with a five-year survival rate of 98.4% and 99.6% for invasive and in situ cancers, respectively. 14 2.3. Exploratory Analysis Survival Proportion 1.000 0.995 0.990 0.985 0 2 4 Follow-up time in years 6 Figure 2.3: Kaplan-Meier estimator and 95% confidence interval of breast cancer free survival times for 221,545 SMP BC participants, stratified by invasive (black) and in situ cancers (blue). 2.3.3 Risk Factors and Cancer Outcome To investigate whether the development of breast cancer is associated with different levels of risk factor exposure, we calculated breast cancer odds relative to a baseline group defined as θ= p11 /p12 p21 /p22 (2.2) for each risk factor level (Agresti, 2002). Table 2.2 gives the contingency table notation introduced for the following calculations. Due to low cell counts in the cross classification of two risk factors, odds ratios θ were estimated using conditional maximum likelihood estimation. We assume a multinomial model for cell counts nij , thus, the distribution of nij is dependent on the cell probabilities pij , i, j = 1, 2. Conditioning on the marginal counts, i.e. the total numbers of exposed (ne ) and diagnosed (nbc ) women, cell count n11 determines the value of cell counts n12 , n21 and n22 . Restating Equation (2.2) in terms of p11 and the marginal probabilities, the odds ratio 15 2.3. Exploratory Analysis Table 2.2: Contingency table notation of absolute numbers and probabilities (in parenthesis) for exposure groups and cancer outcomes. breast cancer no breast cancer total n11 (p11 ) n21 (p21 ) nbc (pbc ) n12 (p12 ) n22 (p22 ) nbc ¯) ¯ (pbc ne (pe ) n0 (p0 ) n (1) exposed unexposed total θ is given by θ= p11 (1 − pe − pbc + p11 ) . (pe − p11 )(pbc − p11 ) (2.3) Hence, p11 can be expressed as a function of the odds ratio given the marginal probabilities. The conditional distribution of the number of cancers in the exposed group, n11 , depends only on the parameter θ and is given by a non-central hypergeometric distribution f (n11 |nbc , ne , θ) = nbc n11 n−nbc n11 ne −n11 θ nbc n−nbc u u u ne −u θ (2.4) with summation over the range of all possible integers u, max(0, ne −nbc ¯ ) ≤ u ≤ min(nbc , ne ) (Agresti, 2002). For an observed number n11 , the conditional maximum likelihood estimate θˆ is obtained by maximizing the likelihood function L(θ|n11 , nbc , ne ) with respect to the odds ratio θ. 95% confidence intervals were calculated using Fisher’s exact method. The lower bound of the confidence interval is the value θ0 such that the hypothesis test of H0 : θ = θ0 against Ha : θ > θ0 results in a p-value of 0.025. Similarly, the upper bound of the confidence interval is the value θ0 such that the hypothesis test of H0 : θ = θ0 against Ha : θ < θ0 results in a p-value of 0.025. The p-value can be calculated as j p= f (x|nbc , ne , θ) (2.5) x=i 16 2.3. Exploratory Analysis where i = max(0, ne −nbc ¯ ), j = n11 for the alternative hypothesis of Ha : θ < θ0 and i = n11 , j = min(nbc , ne ) for Ha : θ > θ0 (Agresti, 2002). Odds ratios and 95% confidence intervals were calculated using the function oddsratio() from the R package epitools (R Core Team, 2012). The function estimates θ by conditional maximum likelihood estimation and it’s 95% confidence interval using Fisher’s exact method. With a breast cancer prevalence of 1.85%, the rare disease assumption appears to be fulfilled and odds ratios can be assumed to closely approximate the relative breast cancer risk defined as (Collett, 2002) r = p11 /p21 . (2.6) For each risk factor level, Table 2.3 presents the number of missing values (NA), complete observations (n), absolute (nbc ) and relative (nbc /n) numbers of cancers and odds ratios (θ) with 95% confidence intervals. Categories with odds ratios of θ = 1 represent baseline exposure levels. For variable ethnic, European and British ethnicities were combined into a new category ‘Caucasian’. Odds ratios for all levels of age at first delivery, age at menarche, previous pregnancies and the number of full term deliveries were not significantly different from 1, suggesting that breast cancers are equally likely to occur in the exposed and baseline groups. Previous breast biopsies, current estrogen use, a hysterectomy and high breast density, however, showed significantly increased odds relative to the baseline groups and can therefore be associated with an increased probability of developing breast cancer. With a value of θ = 1.63, breast biopsies appear to have a particularly strong effect. The highest and lowest odds ratios are seen for the numbers of affected relatives and ethnicity groups. In comparison to Caucasian women, breast cancers are significantly less likely to occur in South and East Asian women, resulting in values of θ = 0.66 and θ = 0.81, respectively. Thus, ethnicity appears to be a important risk factor. Conversely, a familial history of breast cancer leads to the overall highest values in odds ratios. One, two and three affected relatives increase breast cancer odds by a factor of 1.64, 2.19 and 7.59, respectively, making it the strongest risk factor in the set of available variables. However, due to the limited number of women with two or more affected relatives, point estimates are subject to higher sampling error as seen from the noticeably wider confidence intervals. 17 2.3. Exploratory Analysis Table 2.3: Number of missing values (NA), complete observations (n), absolute (nbc ) and relative (nbc /n) cancer frequencies and odds ratios θ with 95% confidence intervals for selected risk factor categories. Risk Factor NA levels num_rel 12816 0 1 2 3 Caucasian South Asian Aboriginal Black East Asian no yes no yes ≤ 50 % > 50 % yes no no yes 0 1 >1 ≥ 14 12 - 13 ≤ 11 < 20 20-24 25-29 > 30 ethnic 59797 ever_biop 1525 curr_estr 2173 breast_dens 0 ever_preg 0 hysterec 14609 num_del 732 age_men age_1stdel 13971 11371 n 178948 28019 1736 26 117161 6186 2225 446 25483 186526 33494 159926 59446 163841 57704 175659 45886 152997 53939 45886 26311 148616 63973 111986 31615 25081 64502 48917 25840 nbc nbc /n θ 3023 768 63 3 2129 75 51 7 378 3144 908 2733 1310 2918 1160 3230 848 2741 1081 848 484 2737 1140 2107 616 453 1192 860 529 0.017 0.027 0.036 0.115 0.018 0.012 0.023 0.016 0.015 0.017 0.027 0.017 0.022 0.018 0.020 0.018 0.018 0.018 0.020 0.018 0.018 0.018 0.018 0.019 0.019 0.018 0.018 0.018 0.020 1.00 1.64 2.19 7.59 1.00 0.66 1.27 0.86 0.81 1.00 1.63 1.00 1.30 1.00 1.13 1.00 1.01 1.00 1.12 1.00 0.99 0.99 1.00 1.06 1.10 1.00 1.02 0.97 1.14 95% CI 1.51-1.78 1.67-2.83 1.46-25.15 0.52-0.84 0.94-1.68 0.34-1.79 0.73-0.91 1.51-1.75 1.21-1.39 1.06-1.21 0.93-1.09 1.04-1.20 0.89-1.12 0.92-1.08 0.98-1.14 0.99-1.21 0.92-1.14 0.87-1.09 0.99-1.29 18 2.3. Exploratory Analysis Relative Frequency 0.8 0.6 variable curr_estr(Y) educ(N) 0.4 hysterec(Y) mammo_dens(2) menstr(Y) 0.2 0.0 40-44 45-49 50-54 55-59 60-64 Age category 65-69 70-74 75-79 Figure 2.4: Relative frequencies of responses (in parenthesis) for variables curr_estr, hysterec, menstr, educ, mammo_dens by five-year age groups. 2.3.4 Variable Interactions Many of the risk factors given in the data set, in particular information about a woman’s menopausal status, are strongly associated with age. To control for potential confounding, it is important to differentiate whether differing cancer probabilities among risk factor levels are due to the risk factor itself or a result of it’s association with age. We found distinct age-related patterns within women’s responses on risk factors curr_estr, hysterec, menstr, educ and mammo_dens. The results are displayed in Figure 2.4. As to be expected, the relative frequency of women with response “Yes” on variable menstr decreases sharply with age and approaches zero by age category 60-64. In contrast, the percentage of menopausal women who underwent a hysterectomy or are currently using estrogen show a non-linear pattern, initially increasing with age and peaking for categories 60-64 and 55-59, respectively. Mammographic density decreases with progressing age as we see a decrease in frequencies for the high density category ‘2’. Changes in educational 19 2.3. Exploratory Analysis Mean Breast Density 1.6 ethnic AP BL 1.4 EA OT SA CA 1.2 1.0 40-44 45-49 50-54 55-59 60-64 Age Category 65-69 70-74 75-79 Figure 2.5: Age-specific mean breast density of 212,545 SMP BC women stratified by ethnicity. level, most notably for women with no education, were also visible over time. Frequencies of this category are highest for age categories 75-80 and are slowly decreasing as women get younger. Overall, there are clear response patterns as function of age, one of the strongest risk factors for breast cancer. A model which includes the risk factors simultaneously as covariates will show whether a risk factor’s effects on cancer-free survival time can be explained by it’s association with age. We also found associations between mammographic density and ethnic background. For five-year age categories, Figure 2.5 shows the mean mammographic density stratified by ethnic groups. It can be seen that East Asian (EA) women have consistently higher breast densities for all age categories compared to the remaining ethnicities. Women with African (BL), South Asian (SA) and Caucasian (CA) heritage as well as women classified among the category ‘Other’ (OT) showed similar values with intersecting trends as age categories are increasing. The overall lowest values for all age categories were found for Aboriginal (AP) 20 2.4. Limitations of Follow-Up women, for whom breast density was also most consistent as age is increasing. Overall, ethnicity related differences are decreasing with a progression of age. For age category 75-79, mean breast density for all ethnic groups reach similar values. 2.4 Limitations of Follow-Up To accurately predict the risk of developing breast cancer from the SMP BC study population, the reliable ascertainment of the population’s cancer status throughout the study is essential. Although participants’ breast cancer diagnoses before the end of the observational period are recorded in the data set, the SMP BC follow-up mechanism has several limitations as it both fails to identify women leaving the study and it also introduces informative censoring. Based on the available information, it is unclear whether women without a recorded diagnosis date could have left the study through death from competing causes or relocation outside of British Columbia. In both cases, a woman remains classified as undiagnosed by the follow-up mechanism throughout the entire study period. However, as the SMP BC does not account for cancers detected outside of British Columbia, the recorded cancer status may be incorrect. As a result, the cancer status of undiagnosed participants may initially be unreliable and needs further investigation. For each participant without a recorded diagnosis date, we defined the length of follow-up as the time between the 2000 screening date and the last recorded screening date within the SMP BC program. Figure 2.6 shows the distribution of follow-up times in years. While the majority of women undergo screening for a minimum of five years, 20.6% of all participants drop out of the program before the end of the observational period. The high percentage of participants never returning after their 2000 screening mammogram, a total of 8.5%, is particularly noticeable. As women become ineligible for SMP BC screening once diagnosed with breast cancer, a negative breast cancer status can be confirmed for all participants with a follow-up period greater than five years. Undiagnosed women with a follow-up period shorter than five years on the other hand may still include women lost to follow-up, making their cancer status subject to error. To adjust for these limitations, the cancer status of undiagnosed individuals will only be considered known until their last recorded SMP BC screening date, 21 2.4. Limitations of Follow-Up count 30000 end of study 20000 10000 0 0 5 Length of follow-up (years) 10 Figure 2.6: Histogram of follow-up length in years for 221,545 SMP BC participants screened in the year 2000. resulting in right censored survival times after this date. Further, the follow-up mechanism causes the censoring probabilities to be dependent on the breast cancer status. Consider two individuals deciding to leave the study before the end of the observational period. Also assume that one will be diagnosed with cancer within the five year period. Although both were initially lost to follow-up, the cancer status of the case is updated in the SMP BC data base once diagnosed, causing her to be included in the study again. Therefore, the probability of censoring is higher for a non-case than for a case, as cases are only able to leave the study through relocation or death, but not through discontinuation of screening. As a result, informative censoring is introduced into the data set. Censoring probabilities were also found to be dependent on risk factor levels. Figure 2.7 shows the remaining percentage of participants at the end of each calendar year. Women between the ages of 40 and 69 and above the age of 70 are split up into ten and five-year age categories, respectively. While women aged 40 to 69 appear to leave the study only in small numbers throughout the entire study period, the participation of women over the age of 70 22 2.5. Evaluation of Study Bias 1.0 Remaining Participants (%) 0.8 category 40−49 0.6 50−59 60−69 70−74 75−79 0.4 0.2 2000 2001 2002 2003 2004 2005 End of Follow Up Year Figure 2.7: Remaining percentage of participants at the end of each follow-up year by five-year age categories. declines rapidly. It is especially noticeable for the age category of 75-79 where participation drops below 10 % by the end of 2005. Therefore, older women are more likely to discontinue screening compared to younger women, creating associations between risk groups and followup time. This pattern can mostly be explained by SMP BC policies. Women aged 75 to 79 become ineligible for screening examinations in the course of the five-year follow-up period, forcing them to leave the study before the end of the observation. Overall, the cancer status of certain participants cannot be considered known for the length of the study. Instead, observations are right censored with differing follow-up periods. This will be taken into account for further analyses, where appropriate adjustments will be made to address the presence of informative censoring. 2.5 Evaluation of Study Bias Selection bias enters the study population in several ways. Active participants in cancer screening programs are subject to overdiagnosis (Zahl et al., 2004). Particularly in older 23 Incidence per 100,000 person years 2.5. Evaluation of Study Bias 750 population 500 B.C. SMP BC 250 0 40-44 45-49 50-54 55-59 60-64 Age Category 65-69 70-74 75-79 Figure 2.8: Age-specific breast cancer incidence rates per 100,000 person years for the B.C. population and the data set of 212,545 SMP BC women. women, screening mammography detects cancers that would never have progressed to cause symptoms or death during the women’s lifetime. In an unscreened population, these cancers are not diagnosed or treated, leading to a lower numbers of new cases overall. Therefore, incidence rates, defined as the number of new cases per 100,000 person years of observation, are higher in screened women than in the general population. A shift in cancer detection to earlier ages as a consequence of screening has also been noted (Gail and Benichou, 1994). Compared against an unscreened population, SMP BC participants receive a lead time in cancer detection and thus contribute more cases to lower age categories. As breast cancer incidence increases rapidly with age for younger women, this produces increased cancer rates in particular for lower age categories. Figure 2.8 shows age-specific incidence rates for 220,691 SMP BC women without prevalent breast cancers and the population of British Columbia. Provincial incidence rates were calculated as the mean incidence rate of the year 2000 to 2004 in British Columbia available from Statistics Canada (Statistics Canada, 2011). In situ cancers were included in 24 2.5. Evaluation of Study Bias provincial incidence estimates. Both populations show a positive, approximately linear relationship between incidence rates and age. For all age categories, rates are consistently higher in the screened population. Differences between rates are smallest for ages 40-44 and are increasing with age, a possible sign of overdiagnosis. A χ2 test comparing the incidence distributions resulted in a highly significant test statistic of 532.54 with a corresponding p-value of p < 0.001 and confirms biased incidence rates in the SMP BC sample. As noted by Gail et al. (1989), these results confirm that the risk estimates derived from the SMP BC population are most appropriate for women considering to undergo regular screening, not the general population of British Columbia. 25 Chapter 3 Gail Model Validation In the following chapter, we introduce the Gail model and assess it’s predictive and discriminatory performance to determine whether it provides a reliable breast cancer prediction tool for the B.C. population. Further, we compare it’s performance to that of a null model ignoring all covariate information. An analysis of the Gail model’s residuals is conducted to gain insight into potential improvements of the functional form. 3.1 The Gail Model The Gail model estimates a woman’s probability p(τ |Xi ) of developing invasive breast cancer within a predefined time interval [0, τ ], given her current constellation of the risk factor vector Xi . In an initial step, her risk ri relative to a woman’s at baseline risk factor exposure and of the same age is approximated by the odds ratio from the logistic regression model ri (t|Xi ) = exp(XiT β) (3.1) Parameters β are estimated from a matched case-control study within the BCDDP cohort. Variables contained in the covariate vector Xi include an indicator whether a woman is above 50 years of age, her age at menarche, age at first live birth, the number of affected first degree relatives and the number of previous benign breast biopsies. The model also includes an interaction term between the age indicator and the number of breast biopsies as well as an interaction between the age at first delivery and the number of affected relatives. The logistic regression coefficients for Caucasian, Black and Asian American women were extracted from the National Cancer Institute’s Gail model source code, version 3.0 (National Cancer Institute, 2012), and are presented in Table 3.1. For African and Asian 26 3.1. The Gail Model Table 3.1: Gail model logistic regression coefficients for Caucasian, Black and Asian women as provided by the Gail model source code. Ethnicity Caucasian Black Variable age ≥ 50 age_men num_biop age_1stdel num_rel num_biop : age ≥ 50 age_1stdel : num_rel 0.011 0.094 0.529 0.219 0.958 -0.288 -0.191 0.033 0.267 0.182 0 0.476 -0.112 0 Asian 0 0.075 0.553 0.276 0.792 0 0 ethnicities, variables with parameter values of 0 were not significant and are not included in the model. While significant coefficients for Asian and Caucasian women are rather similar, we note that there is a large difference for the model coefficients of Black women. Regression coefficients for Caucasian women are based on the original paper by Gail et al. (1989) while values for Black and Asian women were added as the result of subsequent studies by Gail et al. (2007) and Matsuno et al. (2011), respectively. A woman’s relative risk ri (t|Xi ) is then combined with the breast cancer baseline hazard h1 (t), an age and ethnicity-specific hazard rate for women without exposures to risk factors besides age. This baseline hazard rate h1 (t) is obtained from the breast cancer incidence rate h∗1 (t) within the composite SEER population using the following adjustment from Bruzzi et al. (1985). For I unique combinations of risk factor levels within the composite population, let Pi (t) and ri (t) be the proportion of women of age t and relative risk within risk group i, i = 1 . . . I, respectively. Then I h∗1 (t) = Pi (t)h1 (t)ri (t) i=1 For each risk group i, the proportion of cancer cases of age t, ρi (t), is given by ρi (t) = Pi (t)h1 (t)ri (t)/h∗1 (t) (3.2) 27 3.1. The Gail Model Dividing Equation (3.2) by ri (t), multiplying it by h∗1 (t) and summing on i, we obtain I h1 (t) = h∗1 (t) ρi (t)/ri (t) ≡ h∗1 (t)F (t) (3.3) i=1 where ρi (t) was estimated by the observed proportion of cases in risk group i within the SEER cohort and ri (t) is obtained from the logistic regression model given in Equation (3.1). The factor F (t) is equivalent to (1 − AR(t)), where AR(t) is the population attributable risk fraction for women of age t, the reduction in risk if no risk factors were present in the population. Under consideration of an age-specific hazard function for competing causes of death, h2 (t), the probability of developing breast cancer in the age interval [a, a + τ ] is given by a+τ t h1 (t) ri (t) exp − p (a, τ, ri ) = a h1 (u) ri (u) du a where S2 (t) dt S2 (a) (3.4) t S2 (t) = exp − h2 (u) du (3.5) 0 describes the probability of surviving competing risks up to age t. Dividing the age scale into m age categories [τj−1 , τj ), j = 1 . . . m with widths ∆j and assuming that relative risks ri (t|Xi ), baseline hazards h1 (t) and competing risks h2 (t) are constant within each category j, Equation (3.4) is evaluated as p (a, τ, ri (t)) = j S1 (τj−1 ) S2 (τj−1 ) h1j rij (h1j rij + h2j ) S1 (a) S2 (a) (3.6) × [1 − exp {−∆j (h1j rij + h2j )}] where ∆j is the length of the age interval j and h1j , h2j and rij denote the baseline hazard, competing hazard and relative risk in interval j, respectively. The smallest value of j in the summation is chosen such that τj−1 = a, the largest such that τj = a + τ . To simplify calculations, values a and a + τ are assumed to be an element of the set {τ0 , . . . τm }. 28 3.2. Validation Population 3.2 Validation Population Several individuals were removed from the data set for the following validation study. As the Gail model predicts the probability of breast cancer occurrence in a predefined time interval in the future, all individuals in the validation set were required to be without breast disease at the beginning of the study. Hence, 855 of 221,545 women with prevalent breast cancers detected at the 2000 screening appointment had to be excluded while all women with a negative screening result remained in the validation set. Further, 600 participants diagnosed with lobular and ductual carcinoma in situ after the 2000 screening appointment were removed from the data set as the Gail model predicts probabilities for invasive cancers only. The cancer status of all participants was ascertained through December 31, 2005, allowing each woman to be followed for a minimum of five years. Therefore, we calculated five-year breast cancer risks p(τ = 5) given in Equation (3.6) for all women with a follow-up time of at least five years. As described in Section 2.4, 20.6 % of all women left the study before the end of the observational period. Their cancer status cannot be considered known for the full five-year follow-up length, making it inappropriate to include them in a validation study of five-year cancer risks. Thus, these 41,997 women were removed from the validation set. Further adjustments to the remaining observations were required as a mere exclusion of these non-cases led to biased cancer rates in the population and could affect the calibration performance of the assessed model. Investigations showed that the highest discrepancy between cancer rates before and after exclusion of women lost to follow-up occurred within the variable age. Figure 3.1 shows the change in relative cancer frequencies within five-year age categories for the validation set when women lost to follow-up were removed. While rates remain similar for ages 40 to 69, age categories 70 to 79 show a considerable bias as a result of the exclusion. Cancer rates increase nearly seven fold for the age category 75-79. Other variables, in particular ethnic background, showed a nearly consistent increase in cancer rates for all variable levels, indicating that the drop out behaviour is mainly associated with age. This assumption is further affirmed as women over the age of 75 are forced to leave the screening program due to SMP BC eligibility policy. Based on these results, we decided to exclude all women over the age of 75 from the validation set in order to minimize 29 3.3. Validation Methods Rel. Cancer Frequency 0.125 0.100 Drop Outs 0.075 excluded included 0.050 0.025 40-44 45-49 50-54 55-59 60-64 Age Category 65-69 70-74 75-79 Figure 3.1: Relative cancer frequency by five-year age category for 221,545 individuals when women lost to follow-up are included and removed from the data set. the introduced bias. Overall, the validation set contained a total of 176,978 women, 2226 of which were diagnosed with invasive breast cancer in the five-year projection interval. Information on the risk factor ‘atypical hyperplasia’ required for the Gail model was not included in the data set and therefore set to NA for all individuals. Only 78,106 had complete information on all other variables. 3.3 Validation Methods For all women in the validation set, five-year breast cancer probabilities were calculated using a program developed by the BC Cancer Agency. The program is based on Version 3.0 (May 2011) of the Gail model source code available from the webpage of the National Cancer Institute (2012) and allowed us to process calculations for all women in batch mode. As an implementation of the Gail model, it returns a woman’s probability of developing invasive breast cancer when standard Gail risk factors age, age at menarche, age at first 30 3.3. Validation Methods live birth, number of affected first degree relatives, the number of previous benign breast biopsies, atypical hyperplasia and ethnicity are provided. To ensure the reproducibility of our risk assignments, the program’s calculated risks were compared to those of the online calculator (National Cancer Institute, 2012) for 50 randomly selected individuals. The predictive performance of the Gail model was assessed by comparing the model’s predicted number of breast cancers (E) with the observed number (O) in a given subgroup of women. Variables considered for the validation analysis are age (by five-year age groups) and ethnicity. The expected value Ej in category j is calculated as nj Ej = pij (τ = 5|Xi ) (3.7) i=1 where pij is the ith woman’s five-year cancer risk and nj the number of women in category j. To investigate whether the ratio between observed and expected cancers in each category is significantly different from 1, confidence intervals for E/O ratios were calculated following the steps in Costantino et al. (1999). Assuming expected values E to be constant and observed values to follow a Poisson distribution P ois(O) with expected value O, a 95% confidence interval for E/O is given by as 2E 2E , χ2 (0.975, 2(O + 1)) χ2 (0.025, 2O) (3.8) χ2 (0.025, 2O) χ2 (0.975, 2(O + 1)) , 2 2 (3.9) is the exact 95% confidence interval for the detected number of breast cancers O under the assumption of a Poisson distribution. Further, global goodness of fit over all age and ethnic categories j was assessed by the χ2 test statistic defined as k 2 X = j=1 (Ej − Oj )2 Ej (3.10) where k denotes the number of categories. Under the null hypothesis of no difference in frequency distributions, X 2 follows a χ2 distribution with k − 1 degrees of freedom. 31 3.4. Performance Evaluation The discriminatory performance of the model was evaluated by use of the concordance statistic c. The concordance is defined as c = Pr(p(τ |Xi ) > p(τ |Xj ) | Ni (τ ) = 1, Nj (τ ) = 0) (3.11) and describes the probability that the predicted breast cancer risk p(τ |Xi ) exceeds p(τ |Xj ) for two randomly chosen individuals i and j with cancer outcomes Ni (τ ) = 1 and Nj (τ ) = 0 (Harrell, 2001). The concordance statistic takes values in the interval [0, 1]. A value of c = 1 indicates that the risk intervals for cases and non-cases can be separated into two mutually exclusive categories by a risk threshold T . Hence, future cancer cases and non-cases can be uniquely identified based on their probability pi lying above or below T . A value of c = 0.5 on the other hand indicates that, based on the model’s prediction, a woman’s assignment into one of the two categories is as accurate as a random guess. The concordance statistic is independent of disease prevalence in the sample (Ash and Shwartz, 1999), hence, the assessment of discriminatory performance is not affected by the selection bias introduced through exclusion of women lost to follow-up. The function survConcordance() available from the R package survival offers calculation of concordance values and corresponding standard errors. 3.4 3.4.1 Performance Evaluation Risk Distribution For the 176,978 women included in the validation set, the Gail model predicted an average five-year breast cancer risk of 1.11% with minimum and maximum risks of 0.19 % and 12.2 %, respectively. 95% of all individuals had estimated risks below 2.29% while projected risks for the remaining 5% varied between 2.29% and 12.2%, resulting in a skewed distribution of breast cancer probabilities. Figure 3.2 shows the risk distribution of five-year cancer probabilities for the validation set. The majority of the data appear to be approximately normally distributed. The long, right tail of the histogram containing only a few individuals is particularly noticeable. Cancer 32 3.4. Performance Evaluation Relative Frequency 0.8 0.6 0.4 0.2 0.0 0 5 Gail Model Risk (%) 10 Figure 3.2: Distribution of five-year breast cancer probabilities for 176,978 SMP BC participants as estimated by the Gail model. rates seem to be inflated for these women and require further investigations into covariate combinations responsible for these high risk estimates. 3.4.2 Assessment of Model Calibration We investigated the model’s performance separately for the full validation set as defined in Section 3.2 and for a subset of 78,106 individuals with complete information on all available Gail model variables. Table 3.2 shows the number of individuals n, observed and expected breast cancers, E/O ratios and 95% confidence intervals for each validation set, overall and stratified by five-year age categories and ethnic groups. When all individuals are considered, the Gail model predicted a total number of E = 2110.25 breast cancers for the validation set, underestimating the detected number of O = 2226 by 5.2%. The difference between expected and observed values is marginally significantly with a 95% confidence interval of (0.91, 0.99) for the E/O ratio. Thus, the Gail model appears to be miscalibrated for the SMP BC population. Within five-year age categories, the model underestimated the observed number of breast cancers for all age categories with the exception of 40-44 and 60-64. However, E/O ratios 33 3.4. Performance Evaluation Table 3.2: Numbers of expected and observed cancers, E/O ratios and 95% confidence intervals for age and ethnic groups for (1) 176,978 women and (2) 78,106 women with complete information. Category (1) Age Ethnicity n O E E/O 95% CI 40-44 45-49 50-54 55-59 60-64 65-69 70-75 Caucasian South Asian Aboriginal Black East Asian 30813 40869 34447 25006 19757 16159 9927 94047 5048 1639 350 20678 176978 209 383 376 388 286 308 276 1148 48 32 3 183 2226 218.23 378.52 372.3 338.26 315.08 291.51 196.35 1171.25 26.98 18.46 3.99 119.24 2110.25 1.04 0.99 0.99 0.87 1.10 0.95 0.71 1.02 0.56 0.58 1.33 0.65 0.95 0.91-1.20 0.89-1.10 0.89-1.10 0.79-0.97 0.98-1.24 0.85-1.06 0.63-0.80 0.96-1.08 0.42-0.76 0.41-0.84 0.46-6.45 0.56-0.76 0.91-0.99 40-44 45-49 50-54 55-59 60-64 65-69 70-75 Caucasian South Asian Aboriginal Black East Asian 17415 19694 13970 9864 7698 6010 3455 56021 3371 1027 193 12640 78106 120 191 133 124 101 101 93 654 26 19 2 104 863 128.06 181.78 148.16 127.39 113.97 101.8 63.77 697.23 18.17 11.82 2.13 76.15 864.94 1.07 0.95 1.11 1.03 1.13 1.01 0.69 1.07 0.70 0.62 1.07 0.73 1.00 0.89-1.29 0.83-1.10 0.94-1.33 0.86-1.24 0.93-1.39 0.83-1.24 0.56-0.85 0.99-1.15 0.48-1.07 0.40-1.03 0.29-8.79 0.60-0.90 0.94-1.07 Total (2) Age Ethnicity Total 34 3.4. Performance Evaluation were only significantly different from 1 for age categories 55-59 and 70-75. The considerable lack of fit for ages 70-75 (O = 276, E = 196.35) is most likely caused by elevated cancer rates due to the high proportion of women lost to follow-up in this age group. The Gail model shows no significant miscalibration for Caucasian and African ethnicities, but underestimates breast cancers for South Asian, East Asian and Aboriginal ethnic groups significantly by 43.8%, 34.8% and 42.3%, respectively. Testing the overall goodness of fit between expected and observed breast cancers results in a χ2 test statistics of X 2 = 43.72 and X 2 = 61.11 for age and ethnic categories. Corresponding p-values were highly significant in both cases, indicating a significant difference in frequency distributions. With a prediction E = 864.94 versus O = 863 detected breast cancers, the overall fit of the Gail model improves considerably when all individuals with missing risk factor information are excluded. The model overpredicts the observed number by only 1.94 cancers, resulting in a E/O ratio of 1. As opposed to a ratio of E/O = 0.95 observed earlier, the Gail model appears to be well calibrated for the SMP BC population when all necessary risk factor information is known. Similar improvements can be seen for the assessment within five-year age categories. Figure 3.3 shows the E/O ratios for age and ethnic groups. With the exception of women aged 70-75, the model seems well calibrated for all age groups showing no E/O ratios significantly different from 1. Again, the underestimation of cancers seen in the age group of 70-75 can in part be explained by biased cancer rates and does not necessarily indicate a miscalibration of the model. When splitting the individuals into ethnic subgroups, the number of breast cancers remains above expectation for Asian and Aboriginal ethnicities, however, differences between observed and expected values were not significant for South Asian and Aboriginal groups. With E/O = 0.73 (CI: 0.60-0.90), a significant miscalibration is still present for East Asian ethnicities. Test statistics for the global χ2 hypothesis test decreased considerably to X 2 = 17.5 and X 2 = 20.6 for age and ethnicity, however, remain highly significant with p < 0.01. The distribution of observed cancers for both age and ethnicity differs significantly from the expected numbers. 35 3.4. Performance Evaluation 1.25 1.25 E/O Ratio E/O Ratio 1.00 1.00 0.75 0.75 0.50 0.50 0.25 40-44 45-49 50-54 55-59 60-64 65-69 70-74 Age Category Black Caucasian E.Asian First Nations S.Asian Ethnicity Figure 3.3: Ratio of expected and observed cancers with 95% confidence interval in five-year age categories and ethnic groups for 78,106 individuals with complete Gail information. 3.4.3 Discriminatory Ability The model’s ability to discriminate between future cases and non-cases using projected breast cancer probabilities p was comparably poor for both validation sets. Concordance values were c = 0.613 (95% CI: 0.601-0.625) and c = 0.611 (95% CI: 0.591-0.630) for the full and reduced validation set, respectively. With both confidence intervals lying well above the theoretical threshold value of 0.5, the Gail model significantly outperforms a random classification process, however, applicability in the prediction of breast cancer outcomes on an individual level remains limited. Figure 3.4 shows the substantial overlap between the estimated risk densities of cases and non-cases for 78,106 women with complete Gail model information. Distributions were obtained from the R function geom_density() using normal kernel density estimation. Although the mean five-year risk is higher for cases with p¯ = 0.015 versus p¯ = 0.011 for non-cases, there is no distinct separation between probability ranges for the two outcomes. If a risk threshold T was to be placed to separate risks into two outcome categories (cancer/no cancer), a considerable number of future cancer patients would be considered as low risk individuals. 36 3.4. Performance Evaluation Density 0.75 Diagnosis 0.50 no BC BC 0.25 0.00 0 1 2 3 Breast Cancer Risk (%) 4 5 Figure 3.4: Estimated five-year cancer risk densities stratified by cancer outcome for 78,106 women with complete Gail model information (x-axis is truncated at x = 5). Based on these results, the Gail model’s predicted breast cancer risk is not a reliable tool to determine who will and will not develop cancer in a future time period. Overall, the Gail Model appears to be well calibrated to predict the number of cancers in the SMP BC population and various subgroups, but fails to predict cancer outcomes for individual women. Thus, improvements on discriminatory accuracy are needed. 3.4.4 Comparison to Incidence Model To decide whether the calibration of the Gail model significantly outperforms that of a null model ignoring all covariate information, we compared the observed number of cancers in five-year age categories for 176,978 women with the numbers predicted by British Columbian incidence rates. Provincial incidence rates per 100,000 person-years were calculated as the mean incidence between 2000 and 2004 in British Columbia, available online from Statistics Canada (2011). For each five-year age category, Table 3.3 shows the person-years contributed by SMP BC participants, provincial incidence rates in British Columbia, the expected and observed 37 3.5. Residual Analysis Table 3.3: Person years of 176,978 SMP BC women, B.C. incidence rates, observed cancers (O) and expected cancers predicted by B.C. incidence (E) for five-year age categories. age person years B.C. E O E/O 95% CI 40-44 45-49 50-54 55-59 60-64 65-69 70-74 67914.1 185256.4 197207.42 142747.42 110130.34 89044.13 67925.92 99.2 162.8 204.6 247.0 285.4 303.0 328.0 67.37 301.6 403.49 352.59 314.31 269.8 222.8 61 321 403 344 360 285 289 1.1 0.94 1 1.02 0.87 0.95 0.77 0.86-1.44 0.84-1.05 0.91-1.11 0.92-1.14 0.79-0.97 0.84-1.07 0.69-0.87 number of cancers in the five-year follow-up period and E/O ratios with 95% confidence intervals. With the exception of age categories 60-64 and 70-74, the ratio of expected and observed cancers is not significantly different from 1. Thus, the incidence model is well calibrated for most age categories without consideration of any risk factor information. The number of breast cancers in category 70-74 was underestimated by 22.9%. Again, the miscalibration can partly be explained by biased cancer rates in this age category. The Gail model outperformed the null model only for age category 60-64 where the incidence model underestimated the observed number of cancers significantly by 12.7%. The Gail model, on the other hand, showed no significant miscalibration for this age range. As a result, the good calibration performance of the Gail model is not surprising. A reasonably accurate prediction of expected cancers can already be obtained by incidence rates alone. 3.5 Residual Analysis Initiated by the highly skewed risk distribution observed in Section 3.4.1, the adequacy of the linear predictor in Equation (3.1) needed investigation. For each individual with complete Gail model information, we calculated the standardized Pearson residuals defined 38 Pearson Residuals 3.5. Residual Analysis -0.2 -0.1 0.0 0 1 2 Number of affected relatives 3 Figure 3.5: Boxplot of Pearson residuals from Gail model stratified by the number of affected relatives. The y-axis is restricted to display the model’s residuals of non-cases only. as i = Ni − pi pi (1 − pi ) , i = 1...n (3.12) where Ni denotes the cancer status at time τ = 5 for individual i and pi the predicted probability by the model. Systematic patterns in residual plots against the explanatory variables indicate functional misspecifications of the linear predictor (Therneau and Grambsch, 2000). Figure 3.5 shows the Pearson residuals stratified by the number of affected first degree relatives in a box plot. Variable levels 0,1,2 and 3 contained 68292, 9306, 505 and 3 individuals, respectively. To improve visibility of the majority of residuals, the y-axis was restricted and excludes the residuals of all individuals with cancer outcome Ni (τ = 5) = 1. We see a non-linear, positive pattern in the residuals peaking at two affected relatives, suggesting that the increase in relative risk for women with one or two affected relatives is too high. A very similar pattern was found within the excluded points. There are two potential causes for the lack of fit in these family history categories. First, this pattern might be caused as the Gail model includes the number of affected relatives as an ordinal rather than a categorical variable. Assuming a log linear increase in relative 39 3.6. Optimization of Risk Threshold risk with increasing numbers of relatives might not be sensible and requires further analysis. This ordinal modelling of family history also appears to be the cause of inflated cancer risks for certain individuals as seen in the highly skewed risk distribution of Figure 3.2. The Gail model’s parameter associated with the variable num_rel is the overall highest with a value of 0.958 for Caucasian women, leading to an increase in relative risk by a factor of 2.61 and 6.79 for one and two affected relatives, respectively. In addition, exploratory analysis showed that individuals with the highest five-year Gail risk had a proportionally high number of affected relatives. Further, stratification of family history with respect to the number of relatives only might not be sufficient. Instead, the type of relative affected should also be taken into account as there is a intuitive difference in risk between a mother and a daughter as a result of the age difference alone. Both alternatives to the current use of family history information in the Gail model will be considered in the following model building process. 3.6 Optimization of Risk Threshold The lack in discriminatory performance of the Gail model is seen in the previous section, what follows is a more detailed analysis in the potential benefits of a population partition based on five-year Gail risks. An optimal risk threshold T ∗ defining the partition between a high and low-risk group should simultaneously satisfy two criteria. First, the proportion of women with cancer probabilities greater than threshold T ∗ cannot exceed a predefined upper bound to ensure affordability of women’s access to enhanced screening procedures. At the same time, this proportion should also be large enough to make a resulting change in screening policies sensible and worthwhile. Second, the difference in relative cancer frequencies fj , j = 1, 2 between the binary partition of the population based on threshold T ∗ should be maximized. To find such an optimal value T ∗ , we investigated relative cancer frequencies in the high and low-risk partitions of 78,106 women without missing covariate information as a function of T . For each threshold T , we evaluated f (T ) = = 1|pi ≥ T ) , i I(pi ≥ T ) i I(Ni (5) T = 0, . . . , max(pi ) (3.13) 40 3.6. Optimization of Risk Threshold where Ni denotes the cancer status for individual i and pi the five-year risk as predicted by the Gail model. The step size of T was chosen as 0.1%. As the number of individuals with risk estimates pi > T decreases quickly with T , 95% confidence intervals for relative cancer frequencies f (T ) were calculated using an exact method rather than a normal approximation. Define y as the absolute number of cancers within n individuals above threshold T . The confidence bounds are the smallest and largest probabilities for which the occurrence of the observed proportion y/n has a probability that is at least equal to α/2 (Collett, 2002). Hence, the lower bound pl (T ) is the value such that pl (T ) fulfills the equation n j=y n pl (T )j (1 − pl (T ))n−j = α/2 j and the upper bound pu (T ) is the value such that pu (T ) fulfills y j=0 n pu (T )j (1 − pu (T ))n−j = α/2. j Exact confidence bounds were calculated using the function binom.confint() available from the R package binom. The left hand side graph of Figure 3.6 shows the relative cancer frequency in the highrisk group as T varies between 0 and the maximum observed Gail risk in the sample of 78,106 SMP BC women (black), the 95% lower confidence bound (dashed) and the overall prevalence level (blue). The right hand side graph displays the percentage of women with estimated cancer risks p exceeding T . Relative cancer frequencies for women with pi > T are consistently greater than the observed prevalence of 0.011 with the exception of the right most bounds. For threshold values greater than 0.082, the relative frequency drops to 0, indicating that the women with the highest estimated risk were free of cancer by the end of the five-year study period. Due to the steep decrease in high risk women, the smoothness of the relative frequency curve decreases with increasing T . Values of f are predominately increasing until a local maximum is reached at risk threshold T = 0.031. A short drop is followed by a steep increase in f until the overall maximum 41 3.6. Optimization of Risk Threshold 1.00 Proportion of women above T Cancer Detection Rate f(T) 0.20 0.15 0.10 0.05 0.00 0.75 0.50 0.25 0.00 0.000 0.025 0.050 Risk Threshold T 0.075 0.000 0.025 0.050 Risk Threshold T 0.075 Figure 3.6: Left: Cancer detection rate above threshold T (black) with lower 95% confidence bound (dashed) and prevalence (blue) for 78,106 women. Right: Proportion of women with risk estimates pi above risk threshold T . of 0.20, however, the lower 95% confidence bound remains below or in the proximity of the prevalence level for all values T > 0.037. If one was to restrict the smallest possible proportion of high risk women to 1% of the population, the maximum achievable increase in cancer detection rates relative to prevalence is obtained at T = 0.031 where an approximately 2.5 fold increase from 0.011 to 0.026 is observed. For this choice, approximately 1.74% of the SMP BC population then becomes eligible for advanced screening. The value T = 0.031 is indicated on both plots of Figure 3.6 by a vertical dashed line. Comparing these numbers to the increases in odds ratios seen in Section 2.3.3 through population stratification on a single risk factor alone, a Gail model threshold offers a slim advantage over any of the available risk factor levels. However, increases in relative risk between high and low risk groups still remain too low to consider this stratification as a basis for improved screening policy. 42 Chapter 4 Model Building The following chapter introduces two common modelling approaches in the presence of right censored survival data. We investigate the effect of available risk factors on breast cancer free survival times using univariate and multivariate survival modelling and build a breast cancer risk prediction model for the SMP BC population. 4.1 Survival Models For the modelling of breast cancer free survival times, we consider the non-parametric additive regression model introduced by Aalen (1980) as well as the semi-parametric Proportional Hazards Model (Cox, 1972). The following notation is consistent with Martinussen and Scheike (2006). 4.1.1 Cox’s Proportional Hazards Model Introduced by D.R. Cox (1972), the proportional hazards model assumes that the hazard function λ(t) can be described as a multiplicative relationship between a time-dependent baseline hazard λ0 (t) and a log-linear relative risk component r. It is given by λi (t) = λ0 (t) exp(XiT β) (4.1) where Xi = (Xi1 , . . . Xip )T denotes a p-dimensional covariate vector and β represents the parameter vector. The baseline hazard λ0 is assumed to be independent of the covariates Xi and, hence, identical for all individuals. The model parameter βk represents the increase in log hazard rates associated with a unit increase of Xik when all other variables are held constant. The exponent exp(βk ) can be interpreted as an increase in risk relative to an individual with baseline exposure level of the risk factor Xik . 43 4.1. Survival Models A key assumption of the Cox model is that hazard ratios are constant over time t. Consider two individuals i and j with covariate vectors Xi and Xj , respectively. The hazard for individual i relative to j is exp(XiT β) λi (t) λ0 (t) exp(XiT β) = = T λj (t) λ0 (t) exp(Xj β) exp(XjT β) (4.2) and is independent of time t. The parameter vector β is estimated without specification of λ0 (t) by the method of partial likelihood (Cox, 1972). Using a counting process interpretation of the hazard function λi (t), the Cox model can be restated as λi (t) = Yi (t)λ0 (t) exp(XiT β), (4.3) where the risk indicator Yi (t) ensures that λi (t) = 0 when the individual is not at risk. For untied event times t(k) , the partial likelihood function is defined as n Yi (t) exp(XiT β) n T j=1 Yj (t) exp(Xj β) P L(β) = t i=1 dNi (t) . (4.4) where dNi (t) denotes the ith individual’s change in event status within the infinitesimal time interval [t, t + dt]. Intuitively, the partial likelihood function can be motivated as a product of conditional probabilities at each event time. Assume individual k experienced the event of interest at a given event time t(k) . Then, the partial likelihood represents the probability that, knowing an event has occurred at time tk among all individuals at risk, it was in fact experienced by individual k. The denominator of Equation (4.4) only considers individuals at risk before time t(k) and therefore excludes all previously censored observations. We estimate β through maximization of the partial likelihood function. Differentiating the log partial likelihood with respect to β gives the score vector n τ (Xi − x ¯(β, s))dNi (s) U (β) = i=1 (4.5) 0 44 4.1. Survival Models where x ¯(β, s) is the weighted mean of X for individuals still at risk at time s, x ¯(β, s) = T i Yi (s) exp(Xi β)Xi T i Yi (s) exp(Xi β) (4.6) ˆ of the equation The solution β ˆ =0 U (β) is a consistent estimator of β and is asymptotically normal with mean β and variance E(I(β))−1 , the inverse of the expected information matrix (Therneau and Grambsch, 2000). For tests of hypotheses about β, H0 : β = β (0) , the Wald statistic is given by ˆ − β (0) )T I( ˆ − β (0) ) ˆβ T = (β (4.7) ˆ Under the null hypothesis, where Iˆ is the the observed information matrix at the solution β. T follows a χ2 distribution with p degrees of freedom. The R package survival (R Core Team, 2012) offers the Cox model fitting algorithm coxph() and is used to estimate model parameters β and conduct tests of hypotheses. 4.1.2 Aalen’s Additive Model A linear regression approach in modelling of the hazard function λ(t) is taken by Aalen (1980). Covariates are assumed to act additively on the baseline hazard λ0 (t), giving the functional form λi (t) = λ0 (t) + XiT β(t) (4.8) where β(t) is a p-dimensional, locally integrable parameter vector. Unlike in the previously introduced Cox model, the model coefficients are no longer assumed to remain constant over time but may vary with t. Hence, no functional form is imposed, making Aalen’s model a non-parametric method. We define a time-dependent n × p design matrix, X(t) = (Y1 (t)X1 , . . . , Yn (t)Xn )T . The rows of X(t) contain the covariate vectors Xi , i = 1 . . . n if individual i is at risk at time 45 4.1. Survival Models t and a null vector (0, . . . , 0)T otherwise. For reasons of simplicity and faster convergence rates, model parameters β are estimated in cumulative form t β(s)ds B(t) = (4.9) 0 through linear regression models for each increment dB(t), leading to estimating equations ˆ = X − (t)dN (t) dB(t) (4.10) where X − is a generalized inverse of X X − = (X T X)−1 X T . (4.11) In integral form, we obtain the estimator of cumulative effects t ˆ = B(t) X − (s)dN (s). (4.12) 0 Martinussen and Scheike (2006) show that √ ˆ n(B(t) − B(t)) converges to a mean zero Gaussian process. To test the null hypothesis of no effect on survival times for covariate j, H0 : Bj (t) = 0, ∀t ∈ [0, τ ], we consider the maximum deviation from zero in the time interval [0, τ ], ˆj (t) . T1 = sup B (4.13) t∈[0,τ ] The percentiles of the test statistic T1 are estimated as follows. If model (4.8) holds, the √ ˆ asymptotic distribution of n(B(t) − B(t)) may be approximated by n n t 1/2 i=1 (X T X)−1 Xi dMi (s) (4.14) 0 where Mi (s) is a mean zero Gaussian martingale, i.e. a stochastic process with E[dMi (s)] = 46 4.2. Modelling Population 0. The percentiles of the process in Equation (4.14) can then be estimated by replacing dMi (s) with Gi dNi (s) and simulating N realizations of Gi where Gi ∼ N (0, 1). Further, effect constancy of βj (t) over time given by the null hypothesis H0 : Bj (t) = Bj t ˆj (t) is assessed using the maximum observed deviation between the estimated parameter B ˆj (τ ) t/τ , and the expected, linear trend under the null hypothesis, B ˆj (t) − B ˆj (τ ) t | T2 = sup |B τ t∈[0,τ ] (4.15) Again, the simulation approach described above is used to estimate the percentiles of the test statistic (Martinussen and Scheike, 2006). Aalen’s additive model provides a useful supplement to the Cox model in applications where effect constancy is questionable or needs initial investigation. The model’s parameters can be visualized as a function of time and provide a useful tool in assessment of a covariate’s effect over time. However, the additional flexibility of time-varying effects comes at the cost of statistical power, potentially allowing smaller variable effects to go undetected by tests of hypotheses. If the proportional hazards assumption holds for a particular set of covariates, the Cox model is also preferable in terms of computational time. In the following analysis, the function aalen() available from the R package timereg (R Core Team, 2012) was used to estimate model coefficients and conduct tests of hypothesis. 4.2 4.2.1 Modelling Population Estimation of Time to Diagnosis As described in Section 2.4, the follow-up mechanism of the SMP BC database introduces informative censoring into the sample. A woman’s censoring probability cannot be considered independent from her disease status as women initially lost to follow-up find their way back into the database once diagnosed with breast cancer. It follows that P (Ci < 5|Ti < 5) < P (Ci < 5|Ti > 5), (4.16) 47 4.2. Modelling Population implying a dependency between censoring time Ci and event time Ti . However, justification of the conditional probability argument of Cox’s partial likelihood Equation (4.4) requires random censoring. To obtain valid results, the modelling population is adjusted by exclusion of all women with cancers diagnosed outside the SMP BC screening program. It will be assumed that cancer is detected through SMP BC screening if the cancer diagnosis follows within a predefined time period after the last recorded screening date. What is needed is an estimate of this time difference. For 855 women with prevalent cancer found at the 2000 screening appointment, the data set provided both screening and diagnosis dates, allowing us to investigate the time between cancer detection and final diagnosis. The results are presented in the right hand side graph of Figure 4.1. It can be seen that most cancers are diagnosed within 50 days of the screening mammogram, however, the histogram is skewed to the right with some outlying observations diagnosed after more than 200 days. The left hand side graph shows the time differences between the last recorded screening date and cancer diagnosis for all non-prevalent cases in the data set of 221,545 women. Again, the histogram is highly skewed with a long right tail. The local minimum seen around 200 days appears to be a good threshold value, especially as it is consistent with the maximum diagnosis time observed in 2000. However, since only women with a negative result at the last screening (i.e. no cancer was detected) are subject to exclusion, this interval seems too short. Time for cancer development, detection and diagnosis needs to be taken into account. Finally, we decided to exclude all cancers found after one year of the last recorded screening date. This allowed us to reduce the presence of informative censoring, while assuring that enough cases remain in the data set to maintain power of hypothesis testing (Therneau and Grambsch, 2000). 4.2.2 Exclusion of Individuals The sample used for the model building process was restricted to individuals with complete information on all the risk factors, resulting in 101,790 of 221,545 women to be excluded from the analysis. As cancer rates between women with complete and incomplete variable information differed by 0.2%, missing information was assumed to be independent of the 48 4.2. Modelling Population 0.006 Relative Frequency Relative Frequency 0.02 0.004 0.002 0.000 0.01 0.00 0 500 1000 Time in days 1500 2000 0 50 100 Time in days 150 200 Figure 4.1: Left: Histogram of time between last recorded SMP BC screening date and cancer diagnosis of 4078 women. Right: Histogram of time between the 2000 screening appointment and diagnosis date for all prevalent cancers. survival status. Further, 8757 individuals without follow-up, i.e. whose 2000 screening date is also the last recorded SMP BC screening date, were excluded. To fit the survival framework, women entering the sample are also required to be without breast cancer at the beginning of the follow-up period. Therefore, 855 women with prevalent cancers detected at the 2000 screening appointment were removed from the data set. Based on the time intervals described above, we estimated that 847 invasive and 31 in situ cancers were detected outside the SMP BC program. These cases were excluded as well. Detected cancers were stratified by type to allow for separate risk modelling and comparisons of risk factor effects on survival status. Overall, the two resulting data sets obtained for the model building process include 108,921 undiagnosed individuals and 942 and 311 women diagnosed with invasive and in situ cancers, respectively. For non-cases with recorded SMP BC screening dates later than 2005, the censoring time T˜i was defined as the time between the 2000 screening mammogram and December 31, 2005. For non-cases not returning for SMP BC screening, the censoring time is marked as the last recorded SMP BC screening appointment. The time to event for all women diagnosed with 49 4.3. Analysis of Family History breast cancer is defined as the time between the 2000 screening date and the cancer diagnosis date. There were no tied event times. 4.3 Analysis of Family History Residual analysis in Section 3.5 showed the inadequacy of a woman’s family history use in the current Gail model. Ordinal modelling of family history and stratification based on the number of affected relatives alone resulted in systematic patterns within the model’s residuals and led to inflated cancer risks. To improve on these issues, associations between the cancer outcomes of first degree relatives are examined in the following. The analysis is based on 108,921 non-cases and 942 cases diagnosed with invasive cancer as defined in Section 4.2.2. The variable f am_hist was converted into four binary variables indicating the presence of an affected mother (mother), sister (sister), daughter (daughter) and father or brother (f ather). Overall, 7.99% of women had a mother diagnosed with breast cancer, 4.61% a sister and only 0.24% an affected daughter. None of the 24 women with an affected father or brother were diagnosed with breast cancer in the five-year follow-up period. Thus, variable f ather will not be considered in the following analysis. Figure 4.2 shows the age-specific relative frequency of affected relatives stratified by type. With the overall lowest values, relative frequencies of affected daughters within the SMP BC sample show a gradual increase with age. A similar but steeper incline can be seen for frequencies of affected sisters. Both trends are easily explained by the rising incidence associated with increasing age. However, unlike the trends for daughters and sisters, the presence of an affected mother becomes less likely with increasing age. In this case, the effect of age appears to be overwhelmed by introduction of early detection programs, leading to the increased detection of cancer in recent years. Estimates of the effect of relatives diagnosed with breast cancer on survival times were obtained by fitting a Cox proportional hazards model containing three indicator variables derived from the variable f am_hist. Due to the strong, age related patterns seen in Figure 4.2, the woman’s age was also included. The linear predictor of the model is given by ri = exp(β1 mother + β2 sister + β3 daughter + β4 age) (4.17) 50 Relative Frequency 4.3. Analysis of Family History 0.10 Relative Daughter Mother Sister 0.05 0.00 40-44 45-49 50-54 55-59 60-64 Age Category 65-69 70-74 75-79 Figure 4.2: Relative frequency of affected first degree relatives by type of relative, as a function of five-year age categories. where the parameters βi , i = 1, 2, 3, 4, represent the increase in hazard ratios associated with the four explanatory variables. Due to the limited number of cases with daughters diagnosed with breast cancer, five overall, the corresponding model effect was not significant and the variable daughter was excluded from the model. For the remaining variables, Table 4.1 shows the parameter estimates βˆk , relative risks, standard errors and p-values of the hypothesis test H0 : βk = 0. Both mothers and sisters diagnosed with breast cancer have a significant effect on the survival time with corresponding p-values < 0.0001. An affected sister leads to the overall highest Table 4.1: Parameter estimates, relative hazards, standard errors, Wald statistics and p-values for family history model. variable coef exp(coef ) s.e.(coef ) z p-value age mother:yes sister:yes 0.034 0.509 0.680 1.035 1.664 1.973 0.003 0.098 0.110 10.347 5.170 6.193 0.000 0.000 0.000 51 4.4. Modelling of Invasive Cancers increase in hazard rates by a factor of 1.973, while an affected mother increases the hazard rate by a factor of 1.664. Based on these results, we see that different types of relatives have different effects on survival times, making it important to differentiate not only by absolute numbers but also by the relationship. Ordinal representation also seems inappropriate. If the same model is fitted with stratification by numbers only, λ(t) = λ0 (t) exp(β num_rel) (4.18) the increase in hazard ratios associated with the variable num_rel is estimated as exp(β) = 1.73. Comparing models (4.17) and (4.18),the relative risk for a woman with an affected mother, daughter and sister increases by a factor 1.664 · 1.973 = 3.28 and 1.733 = 5.18, respectively, a considerable difference given the overall magnitude of the parameters. A particular problem with the collected family history information is that we have no information whether the participants have sisters or daughters in their family. The category ‘O’ indicating that the woman is the only child in the family seems to have been used in an inconsistent fashion with 51 entries in the original data set and therefore, cannot be considered a reliable indicator. While information on the breast cancer status of the women’s mother is reliable in this respect, sisters and daughters are always assumed to be free of cancer without knowing about their existence. Without adjustment, this leads to biased probability estimates in the risk model, as baseline exposure levels are always assumed for each of the unknown cases. For proper use of family history information, models need to take this into consideration, and more detailed data should to be collected by the SMP BC. 4.4 Modelling of Invasive Cancers We consider 17 risk factors given in Table 2.1 as the candidate covariates in the modelling of invasive breast cancer for 109,863 SMP BC participants. As a result of Section 4.3, the variable f am_hist is converted into three binary variables indicating the presence of an affected mother, sister and daughter. To reduce categories and hence, parameters to be estimated for variable ethnic, six European ethnicities were combined into a new category ‘Caucasian’. An additional explanatory variable is derived converting the year 2000 screening 52 4.4. Modelling of Invasive Cancers result into a binary indicator describing the presence of an abnormal result. As prevalent cancers were removed from the data set, the detected abnormalities for the remaining women were found to be non-cancerous in further examinations following the abnormal result. For individuals whose response to variables hysterec, both_ovrem, f am_hist50 is missing due to non-applicability, the corresponding covariates were recoded to baseline exposure levels. 4.4.1 Single Covariate Modelling In an initial univariate analysis, we fitted proportional hazards and Aalen models for each of the available explanatory variables and investigated their effects on breast cancer free survival time. For the Cox model, the significance of variable effects was established using likelihood ratio tests, for Aalen’s model using the supremum test described earlier. The model coefficients β and p-values of the hypothesis test H0 : β = 0 are given in Table 4.2. Parameter values βˆ for categorical variables with more than two levels are not listed. For a single covariate Cox model, significant effects on survival times with p < 0.01 were found for the presence of an abnormal mammogram, age, age at last menstrual cycle, current estrogen use, ethnicity, previous breast biopsies, a hysterectomy, menopausal status and an affected mother and sister. Mammographic density was marginally significant at p = 0.041. For education indicator educ (p = 0.787) and reproductive history variables age_men (p = 0.732), age_1stdel (p = 0.374), num_del (p = 0.357) and ever_preg (p = 0.343) the null hypothesis of H0 : β = 0 could not be rejected. As seen in Section 2.3.4, a woman’s response on variables curr_estr, hysterec, menstr is highly dependent on her age. As a result, the significant effects corresponding with these covariates in the Cox model could not only been caused by the variables themselves but their association with age, one of the strongest risk factors for breast cancer. A simultaneous covariate model containing these variables as well as age will show whether this is the case. Although only partly comparable due to different null hypotheses, the univariate results for Aalen’s additive model were overall very similar. Again, a woman’s reproductive history described by the variables age_men (p = 0.792), ever_preg (p = 0.754), age_1stdel (p = 0.719) and num_del (p = 0.34) did not show a significant effect over time. Unlike in the Cox model, variables mammo_dens (p = 0.138) and hysterec (p = 0.065) were non-significant. The category “No education” of variable educ was significant at p = 0.02. Recalling results 53 4.4. Modelling of Invasive Cancers Table 4.2: Coefficients and p-values of univariate Cox and Aalen models for hypothesis tests of significant and time-varying effects. (* refers to all variables levels, ** refers to at least one level) Cox Model Variable abmam age age_lastmenst age_men age_1stdel curr_estr educ ethnic ever_biop ever_preg hysterec mammo_dens menstr mother num_del sister Aalen Model βˆ p (β = 0) p (β(t) = 0) p (β(t) = β) 0.342 0.036 0.010 -0.007 0.003 0.226 0.469 0.081 0.050 0.046 -0.436 0.504 0.021 0.876 < 0.01 < 0.01 < 0.01 0.732 0.372 < 0.01 0.787 < 0.01 < 0.01 0.343 < 0.01 0.041 < 0.01 < 0.01 0.357 < 0.01 < 0.001 < 0.001 < 0.001 0.792 0.719 < 0.001 < 0.025** < 0.001** < 0.001 0.754 0.065 0.139 < 0.001 < 0.001 0.340 < 0.001 0.014 < 0.001 < 0.001 0.924 0.450 0.408 > 0.141* > 0.198* 0.887 0.497 0.768 0.376 < 0.001 0.233 0.746 0.344 from Section 2.3.4, it is most likely caused by the strong association between a woman’s age and educational level. The remaining variables were again important predictors of survival with highly significant p-values < 0.001. Tests about effect constancy over time revealed that the hypothesis of proportionality could be rejected for age (p < 0.001), the presence of an abnormal mammogram (p = 0.014), menopausal status (p < 0.001) and the age at last menarche (p < 0.001). For these variables, ˆj (t) and their 95% confidence bands plots of the estimated cumulative regression parameter B are presented in Figure 4.3. Constancy of the effect βj (t) over time corresponds to a linear trend of the function Bj (t). For the cumulative effect of age, it appears as if both time periods before and after about two years after the begin of the study follow an approximately linear pattern with the same slope, however, there is a steep increase in effect size shortly after the two years mark. 54 4.4. Modelling of Invasive Cancers Cumulative Effect (abmam) Cumulative Effect (age) 4e-04 3e-04 2e-04 1e-04 0e+00 1 2 3 Time in years 4 0.000 -0.001 -0.002 -0.003 -0.004 0 1 2 3 Time in years 4 0.004 0.002 0.000 5 Cumulative Effect (age_lastmenstr) Cumulative Effect (menstr) 0 0.006 5 0 1 0 1 2 3 4 5 2 3 4 5 Time in years 8e-05 4e-05 0e+00 Time in years Figure 4.3: Cumulative regression coefficients B(t) of univariate Aalen models for variables age, abmam, menstr and age_lastmenstr. This short-lived pattern change is doubtfully caused by age alone, but by screening recall recommendations leading to an increased detection of cancer in this time period. A similar abrupt increase or decrease shortly after year two can be seen in the cumulative effects of menstr and age_lastmenstr. This leads to the conclusion that their corresponding effects may not necessarily violate proportionality, but are affected by cancer detection clustered in time after the first screening recall. Overall, the importance of exact timelines until disease occurrence is clearly shown, a property that is not fulfilled due to the nature of the cancer detection process. For an abnormal mammogram, however, the time-varying effect on survival times appears to be justified. A steep increase in effect size for approximately one and a half years is followed by a nearly constant progression and further but less steep increases. Hence, the 55 4.4. Modelling of Invasive Cancers effect of an abnormal mammogram at the first screening decreases with time, indicating that the risk of abnormalities progressing into invasive cancers is highest in the 1.5 years following their detection, after this period this chance decreases. This difference could be caused by the type of cancer and whether it is slow growing or aggressive. 4.4.2 Simultaneous Modelling of Covariates Due to the particularly high p-values seen for univariate models, the variables age_men, ever_preg, and age_1stdel will not be considered for a multivariate model of survival times. Starting with an initial proportional hazards model including all remaining covariates, the model is reduced in a backward elimination fashion by exclusion of covariates with the highest p-values until only significant variables remain in the model. Variables were excluded in the following order: age_lastmenstr (p = 0.97), menstr (p = 0.62), educ (p = 0.59), num_del (p = 0.18), curr_estr (p = 0.11) and hysterec (p = 0.07). Once these variables were removed and a final model determined, all excluded variables were added individually once again, however, a significant effect could not be found. To confirm the exclusions, the backward selection function selectCox() from the R package pec was applied to the initial model. The selection method is based on Akaike’s Information Criterion defined as AIC = 2k − 2 log(L), (4.19) where k represents the number of model parameters and L the maximized value of the likelihood. Models with smaller values of AIC are preferred. At each step, the function compares the AIC of the proposed model with the AIC of all models resulting from the exclusion of one covariate. The model with the smallest AIC value is selected as the new candidate model until no reduction in AIC can be achieved through exclusion of covariates. The selection algorithm resulted in the same subset of predictor variables as described above. When added to the final model, interactions between ever_biop and age (p = 0.47) and num_rel and age_1stdel (p = 0.99) as included in the Gail model did not show a significant effect. An interaction between mammo_dens and ethnic could also be declined with p = 0.56. 56 4.4. Modelling of Invasive Cancers Table 4.3: Regression coefficients, standard errors, Wald statistics and corresponding p-values under proportional hazards model. variable coef exp(coef ) s.e.(coef ) z age ethnic:Aboriginal ethnic:Black ethnic:EastAsian ethnic:Other ethnic:SouthAsian abmam:abnormal mammo_dens:2 mother:yes sister:yes ever_biop:yes 0.036 0.401 -0.218 -0.297 0.065 -0.211 0.371 0.297 0.474 0.653 0.283 1.036 1.493 0.804 0.743 1.067 0.810 1.449 1.346 1.607 1.922 1.327 0.003 0.261 0.708 0.111 0.129 0.212 0.115 0.076 0.099 0.110 0.082 10.336 1.535 -0.308 -2.681 0.503 -0.994 3.221 3.924 4.790 5.944 3.448 p-value 0.000 0.125 0.758 0.007 0.615 0.320 0.001 0.000 0.000 0.000 0.001 For all variables found to have a significant effect on survival times, the coefficients, standard errors and p-values for the test of H0 : β = 0 are displayed in Table 4.3. As previously suspected, the associations between survival status and the variables curr_estr, hysterec, menstr and educ were introduced by a strong linkage with age. In a multivariate model including age as a main effect, these variables no longer significantly affect survival times. The variable age resulted in a highly significant model parameter of β = 0.036, suggesting that breast cancer hazards increase by a factor of 1.036 for each additional year of life. We notice non-significant model parameters for all ethnic categories with the exception of β = −0.297 for variable level “East Asian”. Breast cancer hazards for East Asian women are therefore significantly reduced by a factor of 0.743, while there is no significant difference in hazards between the baseline of Caucasian women and the remaining ethnicities. A parameter estimate of β = 0.06 for the variable level “Other” is not surprising as a result of the ethnic composition of this category. Ethnicities include various Caucasian, Asian and African backgrounds not available as answers on the SMP BC questionnaire and show no consistency. The category “Black” consists of only 287 women and thus shows comparably large standard errors for it’s effect on survival times. Medical history also plays an important role in the prediction of survival times. An 57 4.4. Modelling of Invasive Cancers abnormal mammogram and previous breast biopsies show highly significant coefficients of 0.371 and 0.283, respectively, leading to a rise in hazard by factors 1.449 and 1.327 relative to baseline exposure levels. Mammographic density, a variable not included in the current Gail model, is found to be a significant predictor in survival times. Women with predominately dense breasts are subject to increased breast cancer hazard by a factor of 1.346. Even with the limitation of the binary density classification provided in the data set, breast density appears to be an important risk factor and could help in the stratification of cancer outcomes. The highest increases in hazard ratios are found for women with a family history of breast cancer. For individuals whose mother is affected by breast cancer, hazards increase by a factor 1.607 relative to baseline levels. A even stronger effect can be seen for an affected sister where cancer hazards are 1.922 times higher compared to women without an affected sister. The difference in effects, a factor of 1.607 versus 1.922, confirms the choice of categorical inclusion of the variable f am_hist, stratified by relative type. It seems reasonable to combine ethnic categories without significant model parameters into category “Other” to reduce on the number of estimated parameters. Although statistical testing of this reduced model’s adequacy through the method of likelihood ratio is not possible, the maximized log likelihood decreased only slightly by a value of 0.44 from L = −15079.3 to L = −15079.74 when all women with African heritage were assigned to the category “OT”. 4.4.3 Testing Proportional Hazards Assumptions Hypothesis tests of effect constancy are conducted using a residual definition proposed by Schoenfeld (1980). Let 0 < t(1) < t(2) < . . . denote ordered event times and V (β, t) = i Yi (t)ri [Xi −x ¯(β, t)]T [Xi − x ¯(β, t))] i Yi (t)ri (4.20) a weighted variance of the covariate matrix X at time t. The Schoenfeld residual at event time t(k) is defined as ˆ t(k) ) sk = X(k) − x ¯(β, (4.21) 58 4.4. Modelling of Invasive Cancers where X(k) is the covariate vector of the individual experiencing the kth event and x ¯(β, t(k) ) = i Yi (t(k) )ri Xi i Yi (t(k) )ri (4.22) is the weighted mean of X over those individuals still at risk at time t(k) . Grambsch and Therneau (1994) showed that the time-varying effect βj (t) for variable j can be approximated by the sum of the expected scaled Schoenfeld residual s∗kj = ˆ t(k) )skj and the estimated parameter value under the Cox model V −1 (β, βj (t) ≈ E(s∗kj ) + βˆj . (4.23) The hypothesis test of a constant model parameter, H0 : βj (t) = βj , is motivated by the regression approach βj (t) = βj + θj (t − t¯) and translated into an equivalent hypothesis H0 : θj = 0 about the regression parameter θj . Using the approximation given in Equation (4.23) and the fact that βˆj is an estimator of βj , it follows that E(s∗kj ) ≈ θj (t − t¯) (4.24) Estimation of θ and it’s variance Q result in the test statistic ˆ T = θˆT Q−1 θ, (4.25) following a χ2 distribution with p degrees of freedom under the null hypothesis β(t) = β. Insights into the nature of deviations from the proportional hazards assumption can also be gained from a scatterplot of the scaled Schoenfeld residuals as a function of event times t(k) . For the Cox model as given in Table 4.3, tests of proportional hazards showed a significant p-value of 0.002 associated with the presence of an abnormal mammogram. The null hypothesis of constant effects over time, H0 : β(t) = β, can be rejected. For the remaining variables, the null hypothesis could not be rejected with p-values > 0.299. In addition, the global test was not significant with a p-value of 0.211. Plots of the scaled Schoenfeld residuals for variables abmam and age, the latter sus- 59 4.4. Modelling of Invasive Cancers 12.5 10.0 beta(t) abmam beta(t) age 0.2 0.0 7.5 5.0 2.5 0.0 -0.2 0 1 2 3 Time in years 4 5 0 1 2 3 Time in years 4 5 Figure 4.4: Scaled Schoenfeld residuals for variables age and abmam with smoothing spline as a function of event time t(k) . pected to be in violation of constant effects in the univariate Aalen model, are presented in Figure 4.4. To ease visualization of the time-varying progression of the expected Schoenfeld residuals E(s∗kj ), a locally weighted smoothing spline was added to the plot and can be seen by the blue line. If the proportional hazards assumption holds, it is an approximately horizontal line. The effect for age shows a wavy pattern without any strong increase or decreases in slope. Interestingly, small increases in effect size can be found shortly before two and four years, the screening recall recommendations of the SMP BC. As seen from the vertical clustering of points within these time periods, cancer detection rates are elevated, leading to the conclusion that the variations are not necessarily caused by a time-varying effect of age, but rather by screening patterns leading to comparably higher cancer detection. However, a slight, underlying trend over time can also be detected, a potential indicator of a wrongly specified functional form of age. Unlike for variable age, the effect associated with an abnormal mammogram at the first screening does show a dominant downward trend, but again briefly interrupted around the two year time point. It is clear that proportionality cannot be assumed in this case, as 60 4.4. Modelling of Invasive Cancers Table 4.4: Parameters estimates, standard errors, Wald statistics and p-values for the stratified Cox model. variable coef exp(coef ) s.e.(coef ) z p-value age ethnic:EastAsian ethnic:Other ethnic:SouthAsian mammo_dens:2 mother:yes sister:yes ever_biop:yes 0.036 -0.252 -0.044 0.052 0.262 0.491 0.655 0.293 1.036 0.777 0.957 1.053 1.300 1.633 1.926 1.341 0.003 0.105 0.123 0.181 0.074 0.099 0.110 0.082 10.377 -2.393 -0.354 0.288 3.523 4.974 5.966 3.584 0.000 0.017 0.723 0.773 0.000 0.000 0.000 0.000 hazard ratios decrease with increasing time t. Adjustments to the original Cox model to allow for a time-varying effect need to be made. As abmam is the only variable found to violate proportional hazards assumptions, we decided to replace the initial model with a stratified Cox model. The categorical nature of abmam provides a natural stratification by variable levels ‘≥ 50%’ and ‘< 50%’. As a result, an abnormal mammogram is not a part of the linear predictor exp(XiT β) but will be represented by a second baseline hazard λ0 (t) applicable to individuals with an abnormal mammogram only (Therneau and Grambsch, 2000). The coefficients, standard errors and p-values of the stratified Cox model are presented in Table 4.4. To reduce non-significant variable levels of ethnic, categories ‘Black’ and ‘Aboriginal’ were combined with the category ‘Other’. Only minor changes at the second decimal place of the model’s coefficients could be seen after these adjustments were made. The coefficient for South Asian ethnicities showed the highest change from -0.211 to 0.052 and is now closer to zero, an acceptable change considering that the effect was not significant in both models. 4.4.4 Functional Form The adequacy of the linear predictor r for the continuous variable age needs to be confirmed. As described in Therneau and Grambsch (2000), a plot of the martingale residuals against the covariate gives insight whether the functional form is specified in the correct 61 4.4. Modelling of Invasive Cancers Figure 4.5: Plot of all martingale residuals (left) and martingale residuals of non-cases (right) under multivariate Cox model versus age. way. If it is, no distinct patterns should be seen in the residual plot. The Martingale residual for individual i is defined as Mi (t) = Ni (t) − Ei (t) (4.26) where Ei (t) = Yi (t)Λ0 (t)r(Xi , β) is the cumulative hazard at follow-up time t. A plot of the martingale residuals from the fitted Cox model versus age is shown in Figure 4.5. As dense horizontal bands complicate visualization, a locally weighted smoothing spline is added to assist in the pattern detection. The left hand side graph includes the residuals for all individuals while the right hand side graph focused on the main concentration of points in the lower half of the first graph, the residuals for all non-cases. Due to the large number of individuals displayed, the points in the scatterplot were displayed transparent rather than in solid black. The smoothing spline is approximately horizontal with a slight wavy pattern around age 50, but there are no major deviations suggesting a misspecified model. The variable age appears to be included in the correct functional form. 62 4.5. Modelling of In Situ Cancers 4.5 Modelling of In Situ Cancers In the following paragraph, we investigate risk factor effects on event free survival times for in situ breast cancers only and build a multivariate proportional hazards model with the objective to compare between variable effects for in situ and invasive cancers. With only 311 in situ cases remaining in the modelling set, model parameters will be subject to considerably more variability than in the previous analysis of invasive cancers, impacting the ability to detect effect sizes as small as in the initial model. For risk factor effect comparisons between cancer types, this needs to be kept in mind. 4.5.1 Single Covariate Modelling Following the steps in the variable selection process of Section 4.4, we built univariate Cox and Aalen models for each available explanatory variable. For the proportional hazards model, non-significant effects in the time interval were found for age at menarche (p = 0.95), ethnicity (p = 0.6), previous pregnancies (p = 0.08), the number of deliveries (p = 0.17), age at first delivery (p = 0.42), hysterectomy (p = 0.40), current estrogen use (p = 0.16) and the presence of an affected sister (p = 0.38) or daughter (p = 0.65). Age (p = 0.001), an abnormal mammogram at the 2000 screening (p < 0.001), mammography density (p < 0.001) and an affected mother (p < 0.001) were all highly significant. Education (p = 0.04), menopausal status (p = 0.03), age at last menarche (p = 0.01) and a family member diagnosed before the age of 50 (p = 0.004) were also significant, however, most likely for reasons of confounding as noted earlier. For Aalen’s model, highly significant time varying effects with p < 0.001 in the time interval [0, 5] were found for age, but not for an abnormal mammogram as observed in the modelling of invasive cancers. The corresponding parameter estimates are displayed as functions over time in Figure 4.6. As seen from the first graph, the time-varying effect for age was again artificially created by biannual screening recalls. Excess risks increase steeply around the two and four year time points. Interestingly, there is virtually no increase in cumulative effect size between these points as lines remain nearly horizontal or are even slightly decreasing. In comparison to invasive cancers where a continuos increase in effect could be seen, this gives an indication about the non-aggressive nature of in situ cancers. These cancers 63 4.5. Modelling of In Situ Cancers Cumulative Effect (abmam) Cumulative Effect (age) 1.0e-04 7.5e-05 5.0e-05 2.5e-05 0.0e+00 0.006 0.004 0.002 0.000 -2.5e-05 0 1 2 3 Time in years 4 5 0 1 2 3 Time in years 4 5 Figure 4.6: Cumulative coefficients Bj (t) as a function of time for single covariate Aalen models of age and abmam. are predominately found through screening examinations and remain undetected in the time intervals between recall examinations. The progression of effect size for an abnormal mammogram leads to similar conclusions. After a longer first period with no diagnoses and a short but steep incline, cumulative effects follow an approximately linear trend until year five. A decrease in slope cannot be seen, suggesting that the abnormalities found in the 2000 screen are detected as cancers at the same rate as in the initial years after screening. For invasive cancers, comparably more cancers were detected in this first one and a half years, indicating a faster and more aggressive growth. 4.5.2 Simultaneous Covariate Cox Model Based on the high p-values found in the univariate analysis, variables age_men, ethnic, age_1stdel, hysterec and daughter will not be considered as candidate variables in the modelling of in situ cancers. Although an affected sister showed an equally high p-value, sister will be initially included with the remaining variables as it showed the highest risk increase for invasive cancers. Subsequent elimination of the variables with the highest p-values for the hypothesis of non significant effects resulted in the following order of exclusion: number of deliveries 64 4.5. Modelling of In Situ Cancers 0.3 beta(t) age 0.2 0.1 0.0 -0.1 0 1 2 3 Time in years 4 5 Figure 4.7: Scaled Schoenfeld residuals over time for variable age with locally weighted smoothing spline. (p = 0.697), current estrogen use (p = 0.686), an affected sister (p = 0.62), education level (p = 0.601), a family member diagnosed before the age of 50 (p = 0.249), menopausal status (p = 0.104), age at last menstrual cycle (p = 0.324) and previous pregnancies (p = 0.082). The remaining variables age, an abnormal mammogram, mammographic density, an affected mother and previous breast biopsies were all highly significant with p-values < 0.001. All excluded variables were again added individually to the final model, however, no significant variables were found. The number of deliveries showed the lowest p-value with 0.08. The backward selection algorithm of selectCox() arrived at the same variables with the exception for previous biopsies. 4.5.3 Model Checking Tests of time-varying effects using scaled Schoenfeld residuals resulted in non-significant p-values for all variables except age, where we found a marginally significant p-value of 0.046. A plot of the residuals over time can be seen in Figure 4.7. For the first one and a half years, the effect of age is below 0 and shows a slightly 65 4.5. Modelling of In Situ Cancers Table 4.5: Coefficients, standard errors, Wald statistics and p-values for multivariate Cox model of in situ cancers. variable coef exp(coef ) s.e.(coef ) z p-value age abmam:abnormal mammo_dens:2 mother:yes ever_biop:yes 0.025 1.048 0.535 0.694 0.497 1.026 2.852 1.707 2.002 1.644 0.006 0.153 0.122 0.160 0.137 4.107 6.828 4.390 4.336 3.643 0.000 0.000 0.000 0.000 0.000 increasing trend over time. This suggests that in the initial years after screening, older women could be less likely to be diagnosed with in situ breast cancer, most likely explained by slower tumour growth. Due to the low number of events seen in the first year, the trend of the smoothing spline is subject to higher variability. Table 4.5 displays the coefficients for the final model. Due to the small number of cases, standard errors are considerably higher for all parameter estimates. A two sample z-test showed that, with the exception of the effect for an abnormal mammogram (p < 0.001), the parameters for variables included in both invasive and in situ models are not significantly different. As the effect of an abnormal mammogram was found to be time-varying for invasive cancers, parameter estimates are thus only comparable as averages over the time interval [0, 5]. Some observations will be made based on the change in point estimates. The coefficient for mammography density, an affected mother and previous biopsies, each variable with a approximately constant effect for both models, increased and appears to lead to higher relative risks for in situ than for invasive cancers. The model coefficient for an abnormal mammogram increased from 0.371 to β = 1.048 for in situ cancers. Most notably, an affected sister, the most important risk factor for invasive cancers with a nearly two fold increase in relative risk, did not have a significant effect on event-free survival time from in situ breast cancer. When sister is added to the final model as given in Table 4.5, the corresponding model parameter was estimated as βˆ = 0.028 and was not significantly different from 0 with p = 0.91. The internal concordance statistic is higher for in situ cancers with c = 0.649 versus c = 0.633 for invasive cancers while containing two explanatory variables less. 66 4.6. Predictive Performance 4.6 Predictive Performance The predictive performance of the stratified Cox model of invasive cancers needs to be assessed and compared to the performance of the Gail model. We will restrict the validation analysis to the discriminatory accuracy of the new model as it is the most important property in the present context. Due to the already limited number of cases available for the model building process, we were unable to retain a subsample for external validation of the fitted model. Instead, a ‘leave-one-out’ cross validation will be performed on the modelling data set to evaluate the discriminatory performance on independent data. The steps are as follows. For each individual k, k = 1 . . . n, fit a proportional hazards model (k) λi (t) = λ0 (t) exp(Xi β (k) ) (4.27) ˆ(k) using all individuals with the exception of the kth. The and estimate model parameters β ˆ k for individual k is then predicted using the fitted model hazard λ ˆ k (t|Xk ) = λ0 (t) exp(Xk β ˆ(k) ). λ (4.28) This method ensures that each training subsample contains at least nbc −1 cases and standard errors remain comparably low. For all individuals i, five-year breast cancer probabilities P (Ti ≤ 5) = 1 − exp(−Λi (t = 5)) were projected following the 2000 screening appointment. Cumulative baseline hazards Λ0 (t = 5) were estimated by the Nelson-Aalen estimator t ˆ 0 (t) = Λ 0 ¯ (s) dN , n i=1 Yi (s)ri (4.29) ¯ (s) is the total number of failures up to time s and ri the relative risk estimate where N under the stratified Cox model (Therneau and Grambsch, 2000). Although an evaluation of the Cox model’s discriminatory performance can be performed using relative risk estimates ri only, this is not possible for a stratified model as one explanatory variable is represented 67 4.6. Predictive Performance Figure 4.8: Scatterplot of five-year breast cancer risk for 97,856 women as estimated under the Gail model and the stratified Cox model. by an additional baseline hazard. Thus, additional error associated with an estimation of the baseline hazards needs to accepted. For reasons explained in Section 3.2 as part of the Gail model validation, the following performance evaluation is restricted to individuals with at least five years of follow-up time. Thus, predicted probabilities of 12,077 individuals were excluded from the validation set, leaving a total of 97,856 individuals. The fitted model showed a concordance of 0.635 (95% CI: 0.617-0.652), only a slight improvement over the Gail model’s value of 0.611. However, including seven instead of nine explanatory variables, the new model’s set of predictors appears to have more explanatory power while containing two fewer variables. The overall mean risk decreased to 0.79% versus 1.1% predicted by the Gail model. Five-year mean risks for non-cases and cases were 0.79 and 1.02%, respectively. Due to categorical modelling of family history information, the maximum observed risk decreased from 12.2% to 8.3%, suggesting that cancer risks for individuals with multiple affected relatives are less inflated. Figure 4.8 shows a scatterplot of estimated risks for 97,856 women under the stratified 68 4.6. Predictive Performance Density 1.5 Diagnosis 1.0 no bc bc 0.5 0.0 0 1 2 3 Breast Cancer Risk (%) 4 5 Figure 4.9: Five-year cancer risk densities stratified by cancer outcome for 97,786 women as estimated by the stratified Cox model (x-axis is truncated at x = 5). Cox model and the Gail model. Risks are positively correlated with the majority of points below the diagonal. Probability values for individuals with the highest Gail risk were clearly reduced by the new model while remaining above average for the stratified Cox model. Probability separation between cases and non-cases remains poor as seen from Figure 4.9 where five-year risk densities estimated under the Cox model were stratified by cancer outcome. There is still a considerable overlap between the risk distributions of cases and non-cases, making the new model an inadequate tool for predictions about an individual’s future cancer status. Although the stratified Cox model did not meet the performance requirements to be considered a reliable tool for the SMP BC, a stronger set of predictors could be found. Two new variables abmam and mammo_dens were able to replace reproductive history variables age_men and age_1stdel and interaction terms while slightly improving the predictive performance. 69 Chapter 5 Conclusion The data set provided by the Screening Mammography Program of British Columbia presented many initial data cleaning challenges, mostly caused by misleading variable coding and improper survival ascertainment. The cancer status indicator did not differentiate between women not diagnosed with breast cancer and women lost to follow-up. In order to apply survival analysis methods, a censoring variable needed to be created using the last recorded SMP BC screening date. In turn, this definition of follow-up introduced informative censoring into the data set and lead to further adjustments. Variable coding for missing information was very inconsistent throughout the data set. Some numerical variables indicated missing covariate information with values of 0 and 99. For variable num_biop this was a particular problem as ‘0’ is also a plausible value for the number of breast biopsies. Cross-classification tables with other variables usually provided insights into the true meaning. For categorical variables, missing variables information was represented by several different indicators and required recoding to reduce variable levels. The first objective of this project was a validation study of the widely used Gail et al. model and in particular the question whether it can be used as a basis for targeted screening policies. Although the Gail model showed good calibration properties for the B.C. population, it’s ability to predict an individual’s five-year breast cancer status was inadequate. Probability estimates for cancer cases were found to overlap substantially with those of non-cases, a conclusion also drawn in the validation studies by Costantino et al. (1999) and Decarli et al. (2006). If a Gail risk criterion was to separate the population into high and low-risk groups, cancer detection rates would show a maximum increase of only twice the prevalence level. Once converted into absolute terms, cancer detection rates increase from approximately 1% to 2%, a difference that is clearly too small to provide a basis for a change in screening policies. Given the large percentage of East Asian women in British 70 Chapter 5. Conclusion Columbia, 11.5% of the SMP BC population, the Gail model’s significant underestimation of breast cancer risk for this ethnic group is particularly noteworthy. In the light of these results, the practice of presenting patients with individualized breast cancer risks also seems rather questionable. Providing good results for the ‘average’ patient is irrelevant when individualized probabilities do not reflect information about a woman’s future cancer status. This is a particular problem if online risk calculators are not accompanied by confidence intervals or statements about accuracy of risk estimates. The main objective of this research, a risk model with improved outcome stratification over the Gail model, was not met. The majority of available risk factors did not have enough explanatory power to predict who will and will not develop cancer in a five-year time interval. Although a number of predictor variables were found to have a significant effect on event-free survival time, effect sizes were generally too small to set probability values of future cases apart from those of non-cases. Significance was achieved not through effect size but as a result of the very large sample size. The overall strongest risk factor, a sister affected by breast cancer, increased a woman’s five-year probability significantly by a factor of 1.926. However, once the estimated probabilities are put into an absolute context, increases associated with the available risk factors are nearly irrelevant. With the majority of estimated risks no larger than 1%, even the strongest risk factor increases these five-year risks only to less than 2%. Generally, estimated probabilities for all cancer cases were too low and did not reflect the five-year outcome. If longer follow-up data was available, one could investigate whether the projection interval of five years is too short for cancers of presumed high-risk women to develop. Although the analysis of in situ cancers was certainly limited due to the low number of available cases, comparisons between the two cancer types gave initial insights into potential differences. Model parameters showed higher relative risks associated with mammographic density and an abnormal mammogram. The question whether the non-significant effect for a sister diagnosed with breast cancer was only the result of low case numbers could be investigated when more data are available. There were several shortcoming in the validation analysis and modelling fitting. Trying to avoid false assumptions about the cancer status of women lost to follow-up, many individuals and more importantly, cases, were excluded during the analysis. This led to biased cancer 71 Chapter 5. Conclusion rates in the validation of the Gail model. Although we attempted to minimize impacts on the predictive performance, observed breast cancers in the highest five-year age group were consistently underestimated. For model fitting, we tried to minimize the presence of informative censoring by identifying cases diagnosed outside the program. As a result, only 978 of the 4078 initially available cancer cases were eligible, making parameter estimates subject to considerably more variability than anticipated. For in situ cancer, the number was even lower with 311 eligible cases. In retrospect, there is a tradeoff between a potentially false cancer status of what might be a small percentage of women and higher standard errors for model parameters due to smaller sample size. One could investigate the differences between the two alternatives by comparing the results of Section 4.4.2 with a proportional hazards model assuming that all women were followed for the full five-year period. As seen from the poor model performances, breast cancer is too complex to be described by the available risk factors. Especially all information related to a woman’s reproductive history seemed to be of little importance. Although new, stronger risk factors are crucial to improve model performances, medical history and an accumulation of breast cancer in the family showed the most promising predictive abilities and should be the focus point for further risk modelling. This approach is also taken in a recent addition to the class of breast cancer prediction models proposed by Tyrer et al. (2004). At the basis of the Tyrer-Cuzick Model stands a genetic submodel predicting the presence of a hypothetical risk gene from breast cancer occurrences in first and second degree relatives. A recent study by Quante et al. (2012) suggests that it outperforms the Gail model in terms of population stratification and also shows a superior calibration performance. As family history information provided by the SMP BC data set was limited to first degree relatives only and gave no information about their existence, this approach could not be taken. As seen in a study by Cecchini et al. (2012), the importance of mammographic density as a breast cancer risk factor could also be confirmed in the analysis. Even with the binary density classification available from data set, a significant effect comparable that of a previous benign biopsy was found and should encourage the collection of breast density data in a four category format. If the Screening Mammography Program questionnaire were to be redesigned, it is there- 72 Chapter 5. Conclusion fore important to collect more detailed information about the family’s breast cancer history. Questions about the existence of sisters and daughters need to be added such that we are in a position to differentiate between women having no affected relatives because there are no cancer cases in the family and women with no family members beside their mothers. Women with large families are more likely to have a family member diagnosed with cancer based on chance alone than women with small families. Relative instead of absolute numbers for sisters and daughters could be collected. It seems that an answer ‘Only child’ was once included on the SMP BC questionnaire, however, the appearance of this category in the data set was too low to be consistent with population values of families with one child only. Currently, the collected information does not contain enough explanatory power to build a breast cancer risk prediction model. 73 Bibliography Aalen, O. (1980). A model for nonparametric regression analysis of counting processes. In W. Klonecki, A. Kozek, and J. Rosinski (Eds.), Mathematical Statistics and Probability Theory, Volume 2 of Lecture Notes in Statistics, pp. 1–25. Springer New York. Agresti, A. (2002). Categorical Data Analysis. Wiley Series in Probability and Statistics. Wiley. Anothaisintawee, T., Y. Teerawattananon, C. Wiratkapun, V. Kasamesup, and A. Thakkinstian (2011). Risk prediction models of breast cancer: a systematic review of model performances. Breast Cancer Research and Treatment 133 (1), 1–10. Ash, A. and M. Shwartz (1999). R2: a useful measure of model performance when predicting a dichotomous outcome. Statistics in Medicine 18 (4), 375–384. Baker, L. (1982). Breast cancer detection demonstration project: Five-year summary report. CA: A Cancer Journal for Clinicians 32 (4), 194–225. Barlow, W. E., E. White, R. Ballard-Barbash, P. M. Vacek, L. Titus-Ernstoff, P. A. Carney, J. A. Tice, D. S. M. Buist, B. M. Geller, and R. Rosenberg (2006). Prospective breast cancer risk prediction model for women undergoing screening mammography. Journal of the National Cancer Institute 98 (17), 1204–1214. BC Cancer Agency (2001). Screening Mammography Program of BC 2000/2001 Annual Report. Vancouver: BC Cancer Agency. Brinton, L. A., R. Hoover, and J. F. Fraumeni (1982). Interaction of familial and hormonal risk factors for breast cancer. Journal of the National Cancer Institute 69 (4), 817–822. Brinton, L. A., R. Hoover, and J. F. Fraumeni Jr (1983). Reproductive factors in the aetiology of breast cancer. British Journal of Cancer 47 (6), 757. 74 Bibliography Brinton, L. A., C. Schairer, R. N. Hoover, and J. F. Fraumeni (1988). Menstrual factors and risk of breast cancer. Cancer investigation 6 (3), 245–254. Bruzzi, P., S. B. Green, D. P. Byar, L. A. Brinton, and C. Schairer (1985). Estimating the population attributable risk for multiple risk factors using case-control data. American Journal of Epidemiology 122 (5), 904–914. Cecchini, R. S., J. P. Costantino, J. A. Cauley, W. M. Cronin, D. L. Wickerham, H. Bandos, J. L. Weissfeld, and N. Wolmark (2012). Baseline mammographic breast density and the risk of invasive breast cancer in postmenopausal women participating in the nsabp study of tamoxifen and raloxifene (star). Cancer Prevention Research 5 (11), 1321–1329. Chen, J., D. Pee, R. Ayyagari, B. Graubard, C. Schairer, C. Byrne, J. Benichou, and M. H. Gail (2006). Projecting absolute invasive breast cancer risk in white women with a model that includes mammographic density. Journal of the National Cancer Institute 98 (17), 1215–1226. Coggon, D., G. Rose, and D. J. P. Barker (2003). Epidemiology for the Uninitiated (5th Edition). London, GBR: BMJ Books. Collett, D. (2002). Modelling Binary Data, Second Edition. Texts in Statistical Science. Taylor & Francis. Costantino, J. P., M. H. Gail, D. Pee, S. Anderson, C. K. Redmond, J. Benichou, and H. S. Wieand (1999). Validation studies for models projecting the risk of invasive and total breast cancer incidence. Journal of the National Cancer Institute 91 (18), 1541–1548. Cox, D. R. (1972). Regression models and life-tables. Journal of the Royal Statistical Society. Series B (Methodological) 34 (2), 187–220. Decarli, A., S. Calza, G. Masala, C. Specchia, D. Palli, and M. H. Gail (2006). Gail model for prediction of absolute risk of invasive breast cancer: independent evaluation in the florence–european prospective investigation into cancer and nutrition cohort. Journal of the National Cancer Institute 98 (23), 1686–1693. Gail, M. H. and J. Benichou (1994). Validation studies on a model for breast cancer risk. Journal of the National Cancer Institute 86 (8), 573. 75 Bibliography Gail, M. H., L. A. Brinton, D. P. Byar, D. K. Corle, S. B. Green, C. Schairer, and J. J. Mulvihill (1989). Projecting individualized probabilities of developing breast cancer for white females who are being examined annually. Journal of the National Cancer Institute 81 (24), 1879–1886. Gail, M. H., J. P. Costantino, D. Pee, M. Bondy, L. Newman, M. Selvan, G. L. Anderson, K. E. Malone, P. A. Marchbanks, W. McCaskill-Stevens, et al. (2007). Projecting individualized absolute invasive breast cancer risk in african american women. Journal of the National Cancer Institute 99 (23), 1782–1792. Grambsch, P. M. and T. M. Therneau (1994). Proportional hazards tests and diagnostics based on weighted residuals. Biometrika 81 (3), 515–526. Harrell, F. E. (2001). Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis. Springer Series in Statistics. Springer. Kaplan, E. L. and P. Meier (1958). Nonparametric estimation from incomplete observations. Journal of the American Statistical Association 53 (282), 457–481. Krug, A. and L. A. McNutt (2007). Incidence. In S. Boslaugh (Ed.), Encyclopedia of Epidemiology, Volume 1, pp. 529–530. Sage Publications, Incorporated. Martinussen, T. and T. H. Scheike (2006). Dynamic regression models for survival data, Volume 1. Springer New York:. Matsuno, R. K., J. P. Costantino, R. G. Ziegler, G. L. Anderson, H. Li, D. Pee, and M. H. Gail (2011). Projecting individualized absolute invasive breast cancer risk in asian and pacific islander american women. Journal of the National Cancer Institute 103 (12), 951– 961. National Cancer Institute (2012). Breast Cancer Risk Assessment Tool. http://www. cancer.gov/bcrisktool/. [Accessed: January-2012]. Quante, A. S., A. S. Whittemore, T. Shriver, K. Strauch, and M. B. Terry (2012). Breast cancer risk assessment across the risk continuum: genetic and nongenetic risk factors contributing to differential model performance. Breast Cancer Research 14 (6), 1–12. 76 Bibliography R Core Team (2012). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. Rockhill, B., D. Spiegelman, C. Byrne, D. J. Hunter, and G. A. Colditz (2001). Validation of the gail et al. model of breast cancer risk prediction and implications for chemoprevention. Journal of the National Cancer Institute 93 (5), 358–366. Schoenfeld, D. (1980). Chi-squared goodness-of-fit tests for the proportional hazards regression model. Biometrika 67 (1), 145–153. Spiegelman, D., G. A. Colditz, D. Hunter, and E. Hertzmark (1994). Validation of the gail et al. model for predicting individual breast cancer risk. Journal of the National Cancer Institute 86 (8), 600–607. Statistics B.C. (2012). Population estimates. http://www.bcstats.gov.bc.ca/ StatisticsBySubject/Demography/PopulationEstimates.aspx. [Accessed: 20-July- 2012]. Statistics Canada (2011). Table 103-0550 - New cases for icd-o-3 primary sites of cancer by age group and sex, canada, provinces and territories, annual, CANSIM. http://www5. statcan.gc.ca/cansim/a01?lang=eng. [Accessed: 24-August-2012]. Therneau, T. M. and P. M. Grambsch (2000). Modeling Survival Data: Extending the Cox Model. Statistics for Biology and Health. Springer. Tyrer, J., S. W. Duffy, and J. Cuzick (2004). A breast cancer prediction model incorporating familial and personal risk factors. Statistics in Medicine 23 (7), 1111–1130. Zahl, P. H., B. H. Strand, and J. Mæhlen (2004). Incidence of breast cancer in norway and sweden during introduction of nationwide screening: prospective cohort study. BMJ 328 (7445), 921–924. 77
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Statistical modelling of breast cancer risk for British...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Statistical modelling of breast cancer risk for British Columbian women Hoegg, Tanja 2013
pdf
Page Metadata
Item Metadata
Title | Statistical modelling of breast cancer risk for British Columbian women |
Creator |
Hoegg, Tanja |
Publisher | University of British Columbia |
Date Issued | 2013 |
Description | Although there are many known factors associated with an increased risk of breast cancer, age remains the main eligibility criterion for the current Screening Mammography Program of British Columbia (SMP BC). In the light of recent controversy surrounding regular breast screening, the ability of targeted screening for high-risk women is of current interest as it has the potential to sustain high cancer detection rates, reduce the number of false positive results while controlling the overall costs. Multiple models have been proposed to estimate a woman’s risk of breast cancer given her current risk factor profile, the model introduced by Gail et al. in 1989 being particularly popular. Using five-year follow-up data of 223,399 SMP BC participants, we investigate whether the probability estimates of the Gail model adequately stratify the British Columbian population into groups of high and low-risk women and, hence, provide a basis for a personalized access criterion into the Screening Mammography Program. Further, we built a breast cancer risk model for British Columbia with the goal to include a stronger set of predictor variables and improve the outcome stratification of the Gail model. Neither the Gail model nor the new risk prediction model based on SMP BC participants showed adequate stratification properties. Overall, effect sizes of all covariates were too small to clearly separate risk estimates of future breast cancer cases and non-cases. It is questionable whether changes in screening policies should be based on breast cancer risk prediction models. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2013-06-18 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0073888 |
URI | http://hdl.handle.net/2429/44569 |
Degree |
Master of Science - MSc |
Program |
Interdisciplinary Studies |
Affiliation |
Graduate Studies, College of (Okanagan) |
Degree Grantor | University of British Columbia |
Graduation Date | 2013-11 |
Campus |
UBCO |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
Aggregated Source Repository | DSpace |
Download
- Media
- 24-ubc_2013_fall_hoegg_tanja.pdf [ 3.72MB ]
- Metadata
- JSON: 24-1.0073888.json
- JSON-LD: 24-1.0073888-ld.json
- RDF/XML (Pretty): 24-1.0073888-rdf.xml
- RDF/JSON: 24-1.0073888-rdf.json
- Turtle: 24-1.0073888-turtle.txt
- N-Triples: 24-1.0073888-rdf-ntriples.txt
- Original Record: 24-1.0073888-source.json
- Full Text
- 24-1.0073888-fulltext.txt
- Citation
- 24-1.0073888.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0073888/manifest