M o d e l s for the Development of T u m o u r s i n Neurofibromatosis 2 by Ryan R. Woods B.Sc, University of Guelph 1998 A THESIS S U B M I T T E D IN P A R T I A L F U L F I L L M E N T O F T H E REQUIREMENTS FOR T H E DEGREE OF Master of Science in T H E F A C U L T Y O F G R A D U A T E STUDIES (Department of Statistics) we accept this thesis as conforming to the required standard T h e University of B r i t i s h C o l u m b i a June 2000 © Ryan R. Woods, 2000 In presenting degree freely at this the thesis in partial fulfilment University of British Columbia, I agree available for copying of department publication this or of reference and study. thesis by this for his thesis scholarly or for her Department of The University of British C o l u m b i a Vancouver, Canada Date DE-6 (2/88) I further purposes the requirements that the agree that may representatives. financial permission. of It gain shall not be is for Library an shall make permission for granted by understood advanced the that be allowed without it extensive head of my copying or my written Abstract Neurofibromatosis 2 (NF2) is a rare genetic disease that affects approximately 1 in 40000 people, some of the characteristic features of this disease include the onset of multiple tumours on the cranial and spinal nerves, juvenile cataracts and hearing loss. Almost all affected individuals develop bilateral tumours of the Schwann cells that line the vestibular nerves; these tumours are called as vestibular schwannomas (VS). Evidence from molecular genetic studies has suggested that a "2-hit" hypothesis is appropriate for the development of V S in patients with NF2; that is to say that a tumour cell develops from a normal Schwann cell after the cell sustains two mutations to its genetic material. Several authors have proposed probabilistic models for this process and have shown that such models are consistent with incidence data for several different types of cancer. We will discuss a selection of probabilistic models for a "2-hit" hypothesis and present the results from the fitting of such models to incidence data from NF2 patients. Molecular evidence does not exclude the possibility that additional hits are necessary for the development of VS; we will discuss a "3-hit" model and compare the model's fit to both the data and to the fit of the "2-hit" models. Genotype- phenotype correlations have been reported in patients with NF2 and thus a model that incorporates a patient's genotype is presented and discussed. Finally, a bivariate model is proposed to estimate the distributions of the ages at onset of both the first and second V S . ii Contents Abstract ii Contents iii List of Tables v List of Figures vii Acknowledgements x Dedication xi 1 Introduction 1 1.1 Neurofibromatosis 2 1 1.2 Models for Carcinogenesis 3 1.3 Study questions to address 5 1.4 Description of the data 8 2 Knudson's Early Model 15 3 Multi-hit Models and a Maximum Likelihood Approach 22 3.1 Notation 23 3.2 2-hit Models 24 3.2.1 24 Deterministic growth of tissue iii 3.2.2 3.3 Stochastic growth of tissue 26 3-hit Model 32 3.3.1 Choice of Tissue Growth Function 38 3.3.2 Likelihood Construction 40 3.3.3 Genotype Information •. 41 4 Estimating the Age at Onset of the Second V S 46 5 Results 50 6 5.1 Knudson's Model 50 5.2 2-hit Models 53 5.2.1 Deterministic Tissue Growth 53 5.2.2 Stochastic Tissue Growth . 60 5.2.3 Genotype 63 5.2.4 Estimating the Age at Onset of the Second V S 75 5.2.5 3-hit Models 78 5.2.6 Comparison of Several Models 85 Recommendations for Future Work 89 Bibliography 92 Appendix A 96 Appendix B Glossary 99 iv List of Tables 5.1 Estimates of the fraction of cell divisions d(t) at various time points from Knudson's model 5.2 Model fitting results for 2-hit model with deterministic tissue growth using M U K data 5.3 53 . 55 Model fitting results for 2-hit model with deterministic tissue growth using M U K data 5.4 56 Model fitting results for 2-hit model with deterministic tissue growth using FSS data 5.5 56 Model fitting results for 2-hit model with stochastic tissue growth using M U K data 5.6 . Model fitting results for 2-hit model with stochastic tissue growth using M U K data 5.7 61 Model fitting results for 2-hit model with stochastic tissue growth using FSS data 5.8 - 61 Model fitting results for 2-hit model incorporating genotype information (common mutation rates for both genotypes) 5.9 61 70 Model fitting results for 2-hit model incorporating genotype information (different mutation rates for both genotypes) 5.10 Model fitting results for bilateral model using M U K data v 70 76 I 5.11 Model fitting results for 3-hit model using age at onset of the first V S ( M U K Data); logistic(10 ,5,0.8) growth function 6 . . . 79 5.12 Model fitting results for 3-hit model using age at onset of the first V S ( M U K Data); logistic(10 ,5,0.8) growth function 7 .. 80 5.13 Model fitting results for 3-hit model using age at onset of the first V S ( M U K Data); logistic(10 ,8.5,1.3) growth function 6 80 5.14 Model fitting results for 3-hit model using age at onset of the first V S ( M U K Data); logistic(10 ,8.5,1.3) growth function 7 80 5.15 Comparison of log-likelihood values for different models; using age at onset of the first V S data ( M U K data) vi 86 List of Figures 1.1 2-hit model for NF2 patients (hereditary tumours) 7 1.2 3-hit model for NF2 patients (hereditary tumours) 1.3 Simple descriptive summaries of the three data sets 1.4 Histograms for the age at onset of hearing loss by mutation type (FSS " 7 12 data) 13 1.5 Histograms for the age at onset of the first V S 14 3.1 Three growth functions to be used in model fitting 3.2 Plots of relative risks versus age obtained from a 3-hit model with specifically chosen parameter values: 6\ = (a^\ : 39 , / i ^ ) = (1.73 x IO" ,1.93 x 10 ,4.10 x 10~ ); values for the components of 0 vary 1 _1 4 2 across the plots 5.1 44 Plots of empirical and estimated probabilities from Knudson's model for both NF2 patients and sporadic cases 5.2 52 Plots of empirical and model-predicted probabilities for 2-hit model with deterministic growth of tissue; using age at onset of first VS ( M U K data) 5.3 57 Plots of empirical and model-predicted probabilities for 2-hit model with deterministic growth of tissue; using age at onset of hearing loss ( M U K data) 58 vii 5.4 Plots of empirical and model-predicted probabilities for 2-hit model with deterministic growth of tissue; using age at onset of hearing loss (FSS data) 5.5 59 Plots of empirical and model-predicted probabilities from 2-hit model with stochastic tissue growth: Age at first V S data ( M U K data) 5.6 . . Plots of empirical and model-predicted probabilities from 2-hit model with stochastic tissue growth: Age at hearing loss data ( M U K data) 5.7 . 66 Histograms of data versus model-simulated values for Age at first V S ( M U K data); sample size of 163 patients 5.9 65 Plots of empirical and model-predicted probabilities from 2-hit model with stochastic tissue growth: Age at hearing loss data (FSS data) 5.8 64 67 Histograms of data versus model-simulated values for Age at Hearing loss ( M U K data); sample size of 144 patients 68 5.10 Histograms of data versus model-simulated values for Age at Hearing loss (FSS data); sample size of 167 patients 69 5.11 Plots of empirical and model-predicted probabilities from genotype model with common mutation rates: Age at hearing loss data (FSS data) 72 5.12 Plots of empirical and model-predicted probabilities from genotype model with different mutation rates: Age at hearing loss data (FSS data) 73 5.13 Empirical distribution functions for the ages at onset of the first and second V S ( M U K data) 77 5.14 Model-estimated distribution functions for the ages at onset of the first and second V S 77 5.15 Plots of empirical and model-predicted probabilities from 3-hit model assuming logistic(10 ,5,0.8) growth for the tissue; age at first V S data 6 ( M U K data) 81 viii 5.16 Plots of empirical and model-predicted probabilities from 3-hit model assuming logistic(10 ,5,0.8) growth for the tissue; age at first V S data 7 ( M U K data) 82 5.17 Plots of empirical and model-predicted probabilities from 3-hit model assuming logistic(10 ,8.5,1.3) growth for the tissue; age at first V S 6 data ( M U K data) . 83 5.18 Plots of empirical and model-predicted probabilities from 3-hit model assuming logistic(10 ,8.5,1.3) growth for the tissue; age at first V S 7 data ( M U K data) 84 5.19 Comparison of the estimated cumulative distribution functions for several models with the empirical distribution function for the age at onset of the first V S data ( M U K data) ix 88 Acknowledgements I would like very much to thank Professor Harry Joe for his much appreciated support and supervision on this, and other projects, over the past two years. I would also like to thank Dr. Jan Friedman for his support, guidance and for providing us with this project. Additionally, I thank the rest of the Friedman lab (in its many forms over the past two years): Jacek, Patricia, Yinshan, Dana, Ravi, and all of the others. I would like to thank Dr. Gareth Evans, Dr. Michael Baser and Dr. Frank Mirz for providing the datasets used in this thesis. I have had a wonderful two years in this department, owed largely to the Department's faculty and my fellow graduate students; thank you all for making this place what it is. Many thanks to my friend Zhu Rong for the use of his (very helpful) thesis proposal. A n d of course, a herculian "bao bao" to Lee Shean, my partner in crime, for her support throughout the writing of this. Shout outs to Bubby, Corky, Pete, Lambo and the rest of the posse in Ontario. RYAN REGINALD WOODS The University of British Columbia July 2000 For Hosh Pesotan xi Chapter 1 Introduction The central objective of this thesis is to outline a selection of probabilistic models for the development of tumour cells and provide an application of such models to < data from patients with the disease neurofibromatosis 2 (NF2). In this chapter we will provide some background information about the genetic disease NF2. We will outline some of the terminology and genetic concepts that will be used and discussed throughout this thesis. Section 1.2 will provide an overview of the major contributions to mathematical models for carcinogenesis. A selection of these models will be used in this thesis in an application to data on NF2 patients; Section 1.3 will discuss our specific study objectives and discuss the general features of the models we are interested in applying to our data. Finally, a description of the datasets to be used in our study is given in Section 1.4. 1.1 Neurofibromatosis 2 Neurofibromatosis 2, also called Bilateral Acoustic Neurofibromatosis, is a rare genetic disease that affects approximately 1 in 40000 people. A l l NF2 patients bear some form of a mutation to the NF2 gene (italics are conventionally used to denote the name of the gene and Roman type for the name of the disease); this mutation 1 is present at birth in all NF2 patients. NF2 is a dominant disease. This means that it is caused by one mutant copy of the NF2 gene. People normally have two copies of every gene (except those on the sex chromosomes in males). People with NF2 have one mutant and one normal copy of the NF2 gene in every cell of their bodies unless additional mutations have occurred in the process of tumorigenesis, as discussed below. The NF2 gene is located on chromosome 22q and includes 17 exons. Characteristic features of NF2 include the onset of multiple tumours on the cranial and spinal nerves, juvenile cataracts, headaches and facial weakness. We will refer to the various disease features exhibited by a patient as the patient's phenotype. Almost all affected individuals develop bilateral tumours of the Schwann cells that line the vestibular nerve; such tumours are known as vestibular schwannomas (VS). The presence of one or more VSs can cause loss of balance, hearing loss and a ringing in the ears called tinnitus. These are typically the early symptoms of NF2, which often occur in an individual's teenage years or during their twenties. Approximately 50% of NF2 cases are new mutations; meaning that the affected person's parents did not have NF2. Additionally, a person with NF2 will have a 50% chance of passing on their mutated NF2 gene to any of their children. When performing statistical analyses on data from NF2 patients it is important to be able to distinguish the family member who first sought medical attention for their condition from the other family members; this family member is called the proband. In most statistical analyses probands and non-probands will be analyzed separately, particularly if the analyses require the use of information related to the age at onset of various features of the disease. The rationale for this is that once a proband has been brought to medical attention, other members of their family will be examined for features of the disease as well; even if they have not previously shown any of the early symptoms of NF2. These other members of the family may be monitored more closely to detect the onset of various features of NF2. This may include giving these family members M R I scans to detect the onset of tumours before they be- 2 come symptomatic. Information recorded about the age at which features became apparent in the non-probands will tend to be biased towards earlier ages compared with this same information in the probands. Analyzing probands separately from the other members of the family is an attempt to prevent this potential bias from affecting the results of any analyses. There are several different general types of mutations of the NF2 gene that can occur. The three major classes of mutation types which seem to be well represented in our NF2 datasets are: protein truncating mutations; missense mutations; and splice-site mutations. These three different mutation types differ from one another in the effect that they produce on the production of protein. More specific details related to these mutation types will appear in the glossary. We will use the term genotype to refer to the type of mutation of the NF2 gene borne by an individual. This is somewhat different than the conventional definition of genotype found elsewhere in genetics; genotype more often refers to the entire genetic constitution of an organism. In statistical studies of NF2 patients it is of interest to examine the relationship between both the genotype and phenotype of the patients. Suggestions from both clinical and epidemiological studies, that certain types of mutations of the NF2 gene produce more severe disease features than others [5, 25], have motivated the study of genotype-phenotype relationships. Further details about NF2 can be found in the book by Friedman et al. [6]. 1.2 Models for Carcinogenesis There have been many contributions to the development of mathematical and probabilistic models for human carcinogenesis. Such models have been used in the past to explain incidence rates for different cancers in populations, as well as to validate hypotheses about the genetic mechanisms that are responsible for tumour development. The common theme that is incorporated into most of these models is that a tumour cell is assumed to be the outcome of a sequence of irreversible events; these 3 events progressively transform normal tissue cells by some mechanism into tumour cells. A presentation of the main ideas and concepts related to many of these models is provided by Chu [2]; this reference provides a clear non-mathematical formulation of these models. A brief outline of some of these models is provided below. Perhaps the earliest of these models was the multistage model presented by Armitage and Doll [1]. The Armitage-Doll (AD) model described the transformation process of a normal tissue cell into a tumour cell. The transformation process is represented by a sequence of irreversible changes of state. The change from one state to another represents a cell moving from its present state to a state of further malignancy, until it eventually reaches the final stage; a malignant tumour cell which divides until it becomes a detectable cancer. The A D model has been used to explain the incidence rates of many adult human cancers. Knudson [12, 13, 8] proposed a two-stage model for cancer initiation to describe the incidence of both sporadic and hereditary retinoblastoma. According to Knudson's model, a tissue cell is transformed into a tumour celf after sustaining two irreversible mutations. This model is often referred to as a "two-hit" model; where the term hit refers to the mutation of an allele in the cell. The first of these mutations is assumed to occur in one of two ways: in hereditary cases of the cancer, individuals inherit the first mutation; in non-hereditary, or sporadic cases, the first mutation occurs by chance. The second mutation is assumed to occur by chance in both groups. This two-mutation model also allows for the cell division of both normal tissue cells and cells that have already sustained a mutation; this is a necessary feature of a realistic model. Previously, the A D model had not taken into account the cellular kinetics of the tissue under study. Knudson's model was shown to be consistent with epidemiological data for retinoblastoma and subsequent genetic analyses have established the validity of this model in retinoblastoma and other forms of human cancers. Further details related to Knudson's model will be discussed in Chapter 2. 4 Moolgavkar and Venzon [18] introduced an important class of two-mutation models that have been used extensively in epidemiological research; we shall refer to these models as the M V models. These models, like Knudson's, also incorporated the cellular kinetics of the tissue into the model; the M V models however, allow for the division and death of both the normal tissue cells and cells that have already sustained mutations. A subclass of these models is able to incorporate an explicit functional form for the number of tissue cells present in the tissue as a function of age [19, 20, 21, 22]. This is a useful feature of the model as some tissues may grow rapidly during some stages of development and then remain at a fixed size thereafter; such growth patterns can be incorporated into the model by choosing an appropriate function for the the number of tissue cells as a function of age. These models have been shown to be consistent with epidemiological and experimental data for many different types of cancers [22]. The mathematical formulation of the model is also convenient to extend to a three-mutation model, and work has also been done to extend this model to a general number of mutations. Several authors have contributed to generalizing the M V models in various ways; many of these contributions are noted in the references. Details related to both the mathematical formulation and application of some of the M V models will appear in Chapter 3. In the subsequent section we will discuss our intent to use a selection of these models in an application to data collected on NF2 patients. A description.of the data that we intend to use will also follow. 1.3 S t u d y questions to address The central objective of this thesis is to explore the appropriateness of a twomutation hypothesis for the development of VSs in patients with NF2. Our motivation for this has come from molecular data that have suggested the appropriateness of such a model [6]. A two-mutation hypothesis for the development of VSs would imply that Schwann cells around the vestibular nerve must acquire two mutations in 5 some fashion in order to develop into tumour cells. Additionally, we are interested in exploring the suitability of a three-mutation hypothesis for the development of VSs. One reason for this is that molecular data do not rule out the possibility that a third mutation may contribute to the development of these tumours. As well, the development of certain cancers have previously been shown to be consistent with both two and three-mutation models. We intend to assess the appropriateness of the aforementioned hypotheses by fitting suitable probabilistic models to patient data; a more thorough discussion of these models appears below. Figure 1.1 is a depiction of a two-hit model for hereditary tumours; circles in this picture represent cells and the arrows between circles represent possible transitions from one type of cell to another. Greek letters along side the arrows represent the rates of transition between states; these will be described more thoroughly in subsequent sections. A cell that bears a single mutation is capable of division, death, or sustaining a second mutation; cells that have died or sustained the second mutation are not capable of returning to the single mutation state. Under such a model a tumour cell is any tissue cell that has sustained two mutations. This model is applicable to NF2 patients as all NF2 patients are born with a mutation to one NF2 allele and thus tumours associated with NF2 can be considered hereditary tumours. Figure 1.2 is a picture of a three-hit model for hereditary tumours. This model is identical to the two-hit model described above except that it contains an additional stage; cells that have sustained two mutations are now also capable of division, death, and acquiring a third mutation. A tumour cell is generated when a tissue cell acquires the third mutation. We will provide mathematical formulations for these models in Chapters 2 and 3 and describe the fitting of such models to NF2 data and the results in Chapter 5. As a starting point, we will provide an outline of Knudson's two-hit model and apply this model to data from both sporadic and hereditary V S , the latter in patients with NF2. Knudson's model however, does not allow the incorporation of 6 7 genotype information into the model. Associations between patients' genotype and phenotype have been described in NF2 [5, 25] and thus we would like to develop a model capable of incorporating genotype information. The M V models described in the previous section appear to be capable of including this information. We will provide an outline of the mathematical formulation of these models and describe our approach to incorporating the genotype effects into these models. These models will be fit to patient data to assess the suitability of the two-mutation hypothesis and to examine the role of genotype in predicting the onset of VS in NF2 patients. We will also provide the mathematical details of a 3-hit model for the development of tumours in NF2 patients. This model will also be fit to data to assess the suitability of a three-mutation hypothesis. Previously, both Knudson's model and the M V models have been used to predict the incidence of the first tumour in a tissue of interest. Many tissue however are bilateral, or paired, and there may be interest in predicting the age at which a person develops tumours in each of the paired tissues. The vestibular nerve is a bilateral tissue and NF2 patients develop tumours of the Schwann cells along both the left and right vestibular nerves. We will present a model that is able to predict the age at which a patient develops both VSs under the assumption that two mutations are necessary for the development of a V S . This model will be presented mathematically in Chapter 4 and the results from fitting this model to data will appear in Chapter 5. Finally, we will provide some comments about how the fitting of these models could be simplified or improved as well as some other potentially interesting study questions that could be addressed using similar models. 1.4 Description of the data It is quite difficult to acquire large NF2 data sets as a result of the low prevalence of the disease. This often necessitates combining data from many sources so that 8 enough patient data are available for statistical analyses. We are fortunate to have access to a reasonably large database of NF2 patient data; the information contained in this database is both clinical and molecular for many of the patients. Unfortu- nately, for some of our analyses we will still need to combine data from several published sources in order to have sufficient sample sizes for our analyses. Here we will provide a description of the data that will be used to fit our models. The large database available to us was provided by Dr. Gareth Evans, St. Mary's Hospital, Manchester, U.K.; we will henceforth refer to this database as the M U K (Manchester, U . K . ) data. As of March 1, 2000, this database contains detailed clinical information on 349 NF2 patients. Information contained in the patient records includes ages at onset for several characteristic features of NF2, age at presentation of the first feature of NF2, age at last examination, proband status (yes/no), presence or absence of several different types of tumours, laterality of V S , and many other variables. A l l of the aforementioned information relates to the phenotype of the patient; a subset of the patients also has genotype information recorded. 188 patients from this dataset had their D N A sequenced to determine the type of their germline NF2 mutation. To be eligible for our study, patients had to be probands with bilateral VS with the age at onset of the first V S recorded. Using probands for our model fitting is an attempt to remove biases that may result from analyzing data that contain both probands and non-probands; this point was discussed previously in Section 1.1. The NIH diagnostic criteria (1997) for NF2 state that a person can be diagnosed with NF2 only if they satisfy one or more of the following: • Bilateral V S • Family history of NF2 and either: 1) Unilateral V S at age less than 30 2) Any two of the following: meningioma, glioma, schwannoma, juvenile 9 posterior subscapsular lenticular opacities/juvenile cortical cataract Probands by definition will not have a family history of NF2 and thus a proband will only be diagnosed with NF2 following the onset of both VSs. To ensure that the patients we are including in our analyses do in fact meet the NIH crtieria for NF2, we require that they have bilateral V S . The age at onset of the first V S will typically be the response variable for our analyses and thus all patients used in our analyses must have this variable known. In some cases a surrogate measure for this can be used in place of the age at onset of the first tumour and this will be further discussed below. In total, the M U K data contained 163 probands with bilateral V S and a recorded age at onset of first VS variable. In order to acquire enough patients with known genotype information we combined the M U K data with data from several other published sources. This database has been compiled and maintained by Dr. Mike Baser from Los Angeles, U.S.A.; we will refer to this as the FSS (from several sources) data set. A complication with merging data from these various sources was that not all sources had recorded the age at onset of the first VS for the patients. Instead, several sources recorded the age at onset of hearing loss; this variable has often been used as a surrogate for the age at onset of the first V S as the presence of a V S typically results in a loss of hearing. Fortunately, all data sources had age at onset of hearing loss recorded for their patients and thus any analyses done using the FSS data will use this variable as the dependent variable. To see that age at onset of hearing loss is a suitable surrogate for the age at onset of the first V S one can refer to Figure 1.3. This figure shows both histograms and boxplots for the age at onset of the first V S and its proposed surrogate. Displays are provided for the age at onset of hearing loss from the patients from the M U K data, as well as for patients from the FSS data. The distributions displayed in this figure match one another very closely and suggest the appropriateness of our proposed surrogate measure for the age at onset of the first VS. The FSS data contained 167 patients that met our criteria for entrance 10 into the study. Of the 167 patients that were eligible for our study, 68 patients from the FSS data had identified mutation types; 40 of these patients had frameshift or nonsense mutations (which we will refer to as protein truncating mutations) and 28 patients had other types of mutations. Figure 1.4 is a display of the distribution of the age at onset of hearing loss by mutation type. The M U K data had 144 probands with bilateral V S and age at hearing loss recorded. Two patients had not yet developed hearing loss and were omitted from any analyses using age at onset of hearing loss as the dependent variable. These patients could have been regarded as censored in our analyses however, given the small censoring rate we have chosen to remove them to avoid the need to apply methodology for censored data. The influence of two censored observations on our model fitting results would quite surely be negligible given the size of the dataset. It is useful to note that the 144 patients described above are in fact a subset of the FSS dataset. Finally, we will also employ a dataset of sporadic V S patients. These are patients that do not have NF2, but do have a unilateral VS. The data were provided by Frank Mirz, a researcher from Denmark, and were published in a study on the natural history of VSs [17]. This dataset contains the age at onset of the first V S for 72 patients affected by a unilateral V S . We will henceforth refer to this dataset as the S P O R data. Figure 1.5 shows histograms for the age at onset of the first V S in sporadic and NF2 cases of VS. The distributions are clearly different for the two groups; the onset of the V S occurs much earlier in patients with NF2 than it does in sporadic cases. 11 Figure 1.3: Simple descriptive summaries of the three data sets (a) Histogram for the age at onset of the first VS (MUK data) (b) Histogram for the age at onset of hearing loss (MUK data) (c) Histogram for the age at onset of hearing loss (FSS data) (d) Boxplot for the age at onset of the first VS (MUK data) (e) Boxplot for the age at onset of hearing loss (MUK data) (f) Boxplot for the age at onset of hearing loss (FSS data) 12 Figure 1.4: Histograms for the age at onset of hearing loss by mutation type (FSS data) 40 age at onset (a) Protein-truncating mutations 40 age at onset (b) Other identified mutation types 13 Figure 1.5: Histograms for the age at onset of the first VS o 0 20 40 60 age at onset (a) Sporadic cases of VS 40 age at onset (b) NF2 patients (MUK data) 14 80 Chapter 2 Knudson's Early Model Let H(t) and N(t) represent the fraction of undiagnosed cases of hereditary and nonhereditary VS at age t respectively. Recall, that hereditary cases of VS represent NF2 patients and sporadic cases of VS represent patients from the general population who do not have NF2. Hereditary cases will always develop VSs on both the left and right sides of their head; we refer to this condition as bilateral VS. Sporadic cases have only a single VS on one side of their head; it is assumed that either side of the head is equally likely to develop the tumour. We also assume that the two sides of the head develop tumours independently of each other. We will begin with a mathematical formulation of the model for hereditary cases of VS. Let m(t) be the mean number of tumours, that have developed in the interval [0, t], per individual from the hereditary cases. Our assumption that the two sides of the head are equally likely to develop a tumour suggests that the expected number of tumours that have developed prior to time t, on one side of the head, is m(t)/2. We assume no delay between the time when a mutation occurs and the time that the tumour is detected; this justifies the equivalence of the events {patient has k mutations} and {patient has k tumours} for our probability calculations. If we assume that the number of tumours on an individual follows a Poisson distribution 15 with mean m(t) then it is possible to calculate the following probabilities: Pr(one side of the head has no mutation in [0, t]) Pr(one side of the head has at least 1 mutation in [0, £]) Pr(patient has no mutation in [0, t\) Pr(patient has at least 1 mutation in [0, t]) = exp{ w ^ j = 1 — exp{ = m ^j exp{—m(t)} = 1 — exp{—m(t)} Further, let us define M[t\, £2] to be the number of mutations that have occurred in the interval [£i,£2]; note that M [ £ i , £ ] = mfo) —m(tx). An 2 expression for H(t) can now be found using the aforementioned assumptions and definitions: H(t) — Pr(patient has M[0, t] = 0 | patient is eventually bilateral) _ Pr(both sides of head have M[0, t] = 0 and M[t, 00] > 1) Pr(both sides of head have M[0, 00] > 1) _ Pr(one side has M[0, t] = 0) Pr(bne side has M[t, 00] > l) Pr(one side of head has M[0, 00] > l) 2 2 2 exp{-m(£)}[l-exp{-^l + ^ } ] Jl-exp{-^ 2 2 } ] [ex {-^}-exp{-^)}] = 2 P [l-ex {-^}] . 2 P The formulation of the model for sporadic cases of VS is similar to that of the hereditary cases. We define q(t) to be the mean number of tumours developed per person, prior to time t, for individuals from the sporadic population. Assuming a Poisson distribution for the random number of tumours per individual, as above, we arrive at the following expression for N(t): ,Q N = exp{-g(*)> ~ exp{-g(oo)} ^ ^ _ q(t) 1 — exp{—g(oo)} 9(00)' where the approximate equality is justified by considering that q(t) is much smaller 16 than 1; thus, 1 — exp{<7(£)} can be approximated by its first order Maclaurin approximation q(t). In the formulation of the expressions for H(t) and N(t) we have allowed them to depend on t by allowing m(t) and q(t) to be age dependent; thus far we have not described how these functions will depend on age. We will now decribe Knudson's approach of relating the functions m(t) and q(t) to the number of cell divisions that have occurred in the tissue prior to age t. We must first make several assumptions about the tissue cells and their cellular kinetics. First, it is assumed that both normal tissue cells and cells that have already sustained a single mutation are capable of cell division. It is also necessary to assume that the mean number of cell divisions that have occurred prior to age t is equal in the hereditary and sporadic cases; we denote this quantity by a(t). We denote the number of tissue-specific cells present at time t to be b(t); where the initial number of cells present in the tissue is denoted by 6(0). Clearly, the number of tissue cells present at any time t would be equal to the number initially present in the tissue plus the number of cell divisions that have occurred prior to this time: b(t) = 6(0) + a(t). Letting t —> oo we obtain the number of tissue cells that will eventually be present in the tissue: 6(oo) = 6(0) + a(oo). We are now able to define a function d(t), which will represent the fraction of cell divisions that have occurred by time t, in the following fashion: . , { m ) = a(t) a(oo) = Kt) - 6(0) 6(oo)-6(0)' In hereditary cases, the functions m(t) and d(t) are related by assuming that tumour cells arise by transformation of an intermediate cell at a constant mutation rate denoted by Li2\ the units for this rate are given as mutations per cell division. Thus we can express the mean number of tumours developed prior to time t as: m(t) = JJ,2 -a(t). The assumption that the mutation rate is constant with respect to 17 time leads immediately to the following result: M2 = m(t) m(oo) a(t) ' a(oo) m(t) . m(oo) a(t) a(oo) = d(t) <?> m(t) = d{t)m(oo). This relationship can now be substituted into our expression for H(t) given previously to yield: H(t) = -exp{ m ( 0 °W} -exp{-^}] [l-exp{-^)}] 2 (2.1) 2 Recall that for a tumour to develop in an individual from the general population, two chance mutations are required: the first to transform a normal tissue cell into an intermediate cell, and the second to transform the intermediate cell into a tumour cell. The first and second mutations are assumed to occur at constant rates of LI\ and L12 respectively; again the units for the mutation rates are given in mutations per cell division. We define p(t) to be the mean number of mutations of normal tissue cells that have occurred prior to time t. Clearly p(t) can be expressed as: p(t) — [i\ • a(t). The quantity that we would like to relate to the number of cell divisions is in fact q(t), the number of mutations of intermediate cells that yield tumour cells. This quantity would simply be the product of the number of intermediate cell divisions that have occurred by time t and the mutation rate Li2- If the number of intermediate cells present in the tissue are represented by 7(i), then the number of intermediate cell divisions that have occurred by time t would simply be I(t) —p(t). Thus, we yield the following expression: q(t) = Li [I(t) 2 -p(t)]=ii [I(t)-liia(t)]. 2 We will now outline Knudson's suggestion for a method of estimating the mean number of intermediate cells present at time t. We begin by partitioning the 18 interval [0, t] into n subintervals of length h: n • [0,t] = \J[n,Ti + h]. i=l Recalling the definition of p(t), we note that the mean number of mutations that produce intermediate cells from normal tissue cells in the ith subinterval is simply p(Ti + h) —p{T{). These intermediate cells will also divide and increase in number; it is assumed that the proportional increase in the number of intermediate cells from age Ti to age t is equal to this proportional increase in the normal tissue cells. The proportional increase in the number of intermediate cells from age Tj to age t is given by: b(t) _ a(t) + 6(0) b(n) ~ a(n) + 6(0)" A final expression for I(t) can now be obtained by computing the following integral: = l(t) l h yP(n + a L = . p [ Jo T W ) ) a(T) y) d h)-p(n)b^l h T + b(0) = / i i [ a ( £ ) + 6 ( 0 ) ] l o g { a ( r ) + 6(0)} [ = /ix[a(t) + 6 ( 0 ) ] { l o g { a ( £ ) + 6(0)} - log{6(0)}}. Note that we have used the fact that a(0) = 0 in the final equality above; a reasonable assumption of course as a(0) represents the number of cell divisions that have occurred prior to age 0. Substituting this expression into our expression for q(t), and using the fact that for positive t, a(t) will be much larger than 6(0), we yield: q(t) = // {/x [a(£)+6(0)]{log{a(£) + 6 ( 0 ) } - l o g { 6 ( 0 ) } } - ~ /ii/i a(<)[log{^y} - 1 . 2 1 2 19 M l a(£)} Substituting this expression into our expression for N(t), and recalling that d(t) = a(i)/a(oo), leads to: , N(t) q(t) q(oo) log] 1 /ii/z a(oo) log' 2 )§}->] f a(oo) 1 1 (o) J & A J 1 - d(t) = l-d(t) rf(t) = h i w}log{<I(t)} 1 - d(t) lo 4w}-\ rf(t). (2.2) The result of the previous mathematical formulation is that both functions H(t) and N(t) are made to depend on t only through the function d(t). This function is a time-dependent parameter which can be estimated from the data at various time points. H(t) and N{t) also depend on two other quantities as well; values of m(oo) and a(oo)/6(0) can be selected prior to the analysis leaving the time-dependent parameter d(t) as the only unknown parameter to estimate from the data. Suppose we want to estimate the fraction of cell divisions that have occurred prior to k different times ti,...,tk- Suppose, additionally that we have and n s hereditary and sporadic cases respectively. We denote the observed ages at onset of VS in the hereditary group as if, these same ages in the sporadic group are denoted by if,...,** . To obtain an estimate of d(ti) (i = 1,...,/:), Hethcote et al. 20 suggested the minimization of the following function Q with respect to d(ti): O - WU^HlT) Q - w {HM where H(U) [e*p{-«n-exp{-^}]V jTT^pSjf kM ) and N(ti) are estimated by the empirical survival functions: H(U) = m) = — , nh j = l,...,n , h j = i,..., ., n n. s and Wh(ti) and W {ti) s are weights used in the minimization; the values chosen for the weights are the number of undiagnosed hereditary and sporadic cases at time ti repectively. Thus, w {u) = #{*J>*i}». i = i,...,«/,, W (U) = h a #{t j>ti}, s j = l,...,n . 8 The minimization described above can be quite easily be performed numerically. The fit of the model can be assessed both by a chi-square goodness of fit test and also by examining plots of empirical and model predicted incidence curves for their agreement. These topics will be addressed further in Chapter 5. 21 Chapter 3 Multi-hit Models and a Maximum Likelihood Approach For the maximum likelihood approach to multi-hit models we derive the hazard function (also known as the hazard rate function) for the time to the generation of the first tumour cell. We assume that our tumour onset times follow this distribution and construct a likelihood in the customary fashion. There are two two-hit models presented below; the first model has a single parameter and a closed form maximum likelihood estimate can be obtained. The second two-hit model has several parameters and a more complicated expression for the hazard and thus the maximum likelihood estimation is carried out numerically; we provide the hazard function necessary to construct the likelihood. Similarly, for the three mutation model we derive the hazard function for the time to the first tumour and estimates of the model parameters are found numerically. A general approach to constructing the likelihood from a hazard function is also provided. The models presented in this chapter are models for the development of tumours in patients with NF2; this is an important point as it implies that cells in the tissue at risk have already sustained a single mutation. 22 3.1 Notation Notation used throughout the remaining sections has been chosen to be consistent with the major references cited. Notation for variables, functions and parameters will be defined as these concepts are introduced. Figures 1.1 and 1.2 were depictions of the processes that we aim to model in this chapter and some simple notation was defined previously with respect to these figures; we will retain this notation throughout the thesis and we redefine it here for convenience. Some simple guidelines for notation are given below: • 6 will denote the parameter vector for the model under discussion. In most cases this vector will contain the growth, death, and mutation rates for the 2-hit or 3-hit model, e.g. 9 = (a,{3,Li). • a will denote a growth rate or cell division rate for the tissue cells. • f3 will denote a death rate for the tissue cells. • LI will denote a mutation rate for the tissue cells. This represents the rate at which cells with 1 mutation transform into cells with 2 mutations (or the rate by which cells with 2 mutations transform into cells with 3 mutations). • h(t\0) will denote the hazard function for the random variable T (representing the time to the first tumour): = v 1 ' l i m Pr( <r< t At->o i + A | T >*,<>) t Ai Additionally, f (t\0), S(t\6), and F(t\6) will be used to denote the density, survival, and distribution functions for T respectively. • ti will denote the realization of the random variable T for patient i. In our applications this will be the age at onset of the first V S or in some cases the age at onset of hearing loss. 23 In general all of the rate parameters could be subscripted to denote the stage of the model to which they belong (example: \i\ would represent the rate at which cells transform from normal tissue cells into cells with a single mutation; /4 would 2 represent the rate at which cells with a single mutation transform into cells with two mutations, etc.). In our applications however, we will assume that rates of the same kind from different stages of the model are equal (i.e. m = ^2 = r )1 3.2 2-hit Models There are two approaches to 2-hit models presented in subsequent sections. The two approaches differ in the assumptions about the growth and death of the tissue cells. The first model assumes that the tissue cells grow according to a deterministic process, and thus the number of cells present in the tissue at any given time t is given by a function X(t); additionally, it is assumed that there is a small chance that any of the cells mutate. The second model assumes that tissue cells divide, die and mutate according to a birth-death process and thus the number of tissue cells present at any time £ is a random variable X(t). I have used the same notation for both and have let the context of the discussion distinguish between the random variable X(t) and the deterministic function X(t). 3.2.1 Deterministic growth of tissue If the tissue cells are assumed to grow according to a deterministic process then the generation of mutated cells can be modelled as a nonhomogeneous Poisson process. Let X(s) represent the number of tissue cells at time 5; note that these tissue cells have already sustained a single mutation in patients with NF2. Further, let Z{s) represent the number of tumour cells in the tissue at time s. These cells have sustained 2 mutations; one which has been inherited and the other which has occurred by chance. The mutation rate of the second chance mutation is assumed to be constant and is denoted by The mean number of tumour cells that have 24 developed by time £, E(Z(t)), is simply The number of tumour cells /j,X(s)ds. can be thought of as arising from a nonhomogeneous Poisson process with intensity fiX{t). Let T be the time to the generation of the first tumour cell. The time to the generation of the first tumour cell can be shown to have the following distribution function: F(t\9) = Pr(T < t) = l-exp{-fj, = 1 - Pr(T > t) f X(s)ds}, Jo = 1 - Pr(Z(t) = 0) t>0. The probability density function for the random variable T is obtained by differentiating the distribution function: f(t\0) = fiX(t)exp{-n [ X(s)ds}, Jo t>0. Hence the hazard function for the time to the first tumour cell, h(t\6), is simply: exp{-/i J 0 X(s)ds} &\t\V) The parameter vector 0 for this model contains only a single parameter; namely the mutation rate parameter fi as the growth and death of cells in the tissue are accounted for in the deterministic growth function for the tissue. A likelihood for the data can be constructed based on the distribution of the time to the onset of the tumour and a maximum likelihood estimate for the rate parameter fj, can easily be obtained. If the data consist of n individuals with onset times t-\,...,t , n the likelihood, L(0), is given by: The maximum likelihood estimate of the rate parameter /J, has a simple closed form solution given by: A= T-^ E?=i — r (tfXMds) 25 (3-2) After having estimated this rate parameter it is possible to compute an estimate of the probability that an individual will develop a tumour before a given time t\ this is done by replacing \x by its estimate in the expression for the distribution function F(t\6) given above in equation (3.1). These estimates can be compared to an empirical distribution function for the data to assess the fit of the model. An asymptotic standard error for /}, based on the observed Fisher information, can be estimated by twice differentiating the log-likelihood 3.2.2 Stochastic growth of tissue For this model, we assume that the tissue cells divide, die and mutate according to a birth-death process. We derive the hazard function for the time to first tumour starting from a single tissue cell: h(t\0). The hazard function for the time to first tumour starting from a tissue consisting of N cells would just be N • h(t\0) because of the assumption that the cells mutate independently of one another. Additional model assumptions and a derivation of the hazard function h(t\0) are provided below. The growth, death and mutation of the tissue cells are assumed to follow a process similar to a birth-death process. In a small interval of time At, a tissue cell may divide into two tissue cells with probability f3At+o(At); o(At); is a At + o(At); die with probability divide into a normal tissue cell and a tumour cell with probability LiAt+ the probability of more than one such event occurring in this time interval o(At). Additionally, we assume that a tumour arises from a single progenitor tumour cell and that the tissue cells mutate independently of one another. Mutations are assumed to occur during cell division; such a cell division will produce both a cell identical to the original progenitor and a cell that has sustained an additional mutation. This assumption is used in the derivation of the Kolmogorov forward equation below. 26 As previously, we define X(t) and Z(t) to be the number of tissue and tumour cells present in the tissue at time i respectively. Let </>(x,z;t) be the probability generating function for the number of tissue and tumour cells at time i starting with a single tissue cell initially. Thus, oo 4>(x,z;t) = Piti,k);t)x z j k j,k=0 where p((j, k);t) = P r ( X ( i ) = j, Z(t) = k\X(0) = 1, Z(0) = 0). Note that we assume: p((j, k);t) = 0 for j < 0 or k < 0. Our claim is that the hazard function for the time to the generation of the first tumour cell T , is given by the expression h(t\6) = — c/>'(l, 0; t)/</)(l, 0; i). To see this is true we make the following calculations: oo 0(1,0; t) = J2 = 3, Z(t) = 0) = Pr(Z(t) = 0), (3.3) 3=0 and #'(1,0;<) = - d(p(x,z;t) 1 = 1,2=0 at E j=o dPi(X{t)=j,Z(t)=0) d P r ( Z ( i ) = 0) dt dt P r ( Z ( i + At) = 0) - P r ( Z ( i ) = 0) At->o . Ai PijZjt + A i ) = 0, Z(t) = 0) - Pr(Z(<) = 0) A«->o Ai f P r ( Z ( i + A i ) = 0 | Z(t) = 0) - l j P r ( Z ( i ) = 0) = lim i At-y0 At { - Pr(Z(t + At) = 1 | Z(t) = 0)} Pr(Z(t) = 0) = l i * m At->o - : Ai P r ( Z ( ) = 0) lim - P r ( Z ( ' + A«) = l | ( v w y At->o Z W= 0) Ai = ^ M M ^ " ^ ' ' " " " " ' , At-»o A i 27 (3.4) where this last line is justified using equation (3.3). Recall that T denotes the random time until the generation of the first tumour cell. Rearranging the result from equation (3.4) gives us the following result: *(l,0;t) Pr(Z(t + A t ) = 1 | Z(t) = 0 ) = lim At->0 = lim At Pr(t < T < t + A t | T > t) At At->0 = h(t\0) which is the hazard function for the random variable T . We will now derive a closed form solution for the hazard function for this model. As we have just shown, this will require a solution for the probability generating function cp(x, z\ t) defined above. We begin by defining: p - .)((m,n); At) = Pr(X(t + At) = m , Z ( t + A t ) = n\X(t) =j,Z{t) (j jA From the Chapman-Kolmogorov equations we yield: p{{j, k);t + At) = , ({j, k); At) • p{{j, k);t) P{j k) +P(j-i,fc)(0', k);At)-p({j-l,k);t) +P(j+i,fc)(O',*0;At) -p{(j+ l,k);t) +P(j,k-i){{J,k);At)-p{{j,k- l);t). We note that: P(j-\,k){{j\k); At) = (j - l ) A t + o(At), P( i,k)((j,k);At) = (j + 1)0 At + o(At), P(j,k-i)((j,k);At) = j/iAt j+ P{j,k){{j, k)\ At) a = + o(At), (1 - jaAt 28 - j/3At - jtiAt - o(At)). = k). The Kolmogorov forward differential equation can be obtained as: At->0 = At - j ( a + /3 + A»)-p((i,fe);*) + 0 ' - i ) « - p ( ( i - i , f c ) ; * ) + (j + l)/3 • p((j + 1, k); t) + 3li • p((j, k - 1); t). We use this result to yield the following differential equation: (j)'{x,z;t) = dc/)(x,z;t) dt oo = £p'((j,A0;*)^ j,k=0 f c oo = Yl [~j(a + i3 + ri)-p((j,k);t) j,k=0 +(j - l)a • p((j - 1, k); t) + (j + 1)0 • p((j + 1, k);t) +Jri-P((j,k-I);t) LIXZ xz j + ax +f3-(a k + (3 + n)x] 2 ^ (3.5) with initial condition (f>(x, z; 0) = x. It is obvious that: <f>'(l,0;t) = a + /3-(a + l3 + n) d(/){l,0;t) _ _ dx d<p{l,0;t) ^ dx The previous differential equations would be useful for finding a general case solution for cj)(x, z; t) and the computation of moments, however we are interested only in obtaining an expression for the hazard function h(t\6). Thus, we are interested ih computing only 0(1,0; t) and (j>'(l, 0; t). We have included these equations nevertheless for completeness. Finding an expression for the hazard function is simplified by using an observation of Moolgavkar et al. [18]; specifically that <f)(x,z;t) can be shown to satisfy the following Riccati differential equation: <t>'{x,z;t) = acj> {x,z;t) + [LIZ - (a + p + n)](f>(x,z;t) +[3. 2 (3.6) This equation can be obtained by considering the integral equation solution for 29 (j)(x,z;t) given by Moolgavkar et al. [18]: 4>(x, z;t) = zexp{ —K*} + J = ^a(j) (x, z\ t — u) + (3 + Liz(p(x, z; t — u) exp{—nu}du 2 z exp {—«;*} + J ^a(j) (x,z;u) + l3 + Lizcj)(x,z;u) exp{—n(t — u)}du 2 where K = (a + /3'•+ LI). Multiplying both sides of this equation by e x p { « ; i } and differentiating with respect to t yields: (j)'(x, z; t) e x p { « ; t } + c/)(x, z; t)K,exp{Kt} d_' r t dl + — {/ /3exp{«;(i - u)}du| = —j / j^Q!0 (:r, 2 ; u) + Lizcj)(x, z; u) = ^a<j) (x,z]t) + /j,z$(x,z\t)^exp{Kt} + 0exp{Kt}. 2 exp{Ku}chzj 2 (3.7) Rearranging equation (3.7) and multiplying both sides of the equation by e x p { - K i } yields the following: (fi'(x, z; t) e x p { « ; t } = — cp(x, z; t)n exp{«;*} + / 3 e x p { « ; t } + acj> (x,z;t) + fj,z<f)(x, z; £)J e x p { « ; i } 2 <=> <p'(x, z; t) <^> — -<j)(x, z; t)n + /3 + |a</> (3;, 2 z; *) + Liz<p(x, z; i)j 0'(z, 2 ; t) = a0 (a;, z; *) + [liz - (a + /? + /i)j 2 z; t) + l3 which is the Riccati equation given in equation (3.6) (more details can be found in the appendix). Thus, evaluating this equation at x = 1 and z = 0 yields: </»'(!, 0;t) = a 0 ( l , O ; t ) - ( a + /3 + ) 0 ( l , O ; i ) + / 3 . 2 M 30 Rearranging this equation yields: 3^(1,0;*) = , a0 (l,O;*) - (a + /3 + /j)0(l,O;<) + 0 2 c 90(1,0;*) (0(1,0;*) - Ci)(0(l,O;*) - C ) ^ 2 where C i < C are distinct roots of the polynomial in q: 2 a a The roots of this polynomial are easily obtained by applying the quadratic formula: Cl = 7r(a + 0 + n) y/a za Za - 2a0 + laii + 0 + 20n + n , 2 = —(a + /3 + M) + wVa ~ 2a0 + 2a/x + 0 + 20LI + \x . c 2 2 2 2 2 2 (3.9) The differential equation given above in equation (3.8) can be integrated directly using partial fractions: f 30(1,0;t) 7 (0(1,0;*)- 6-0(0(1,0;*) - C ) i 2 r /• 1 Ci-C \j 30(1,0;*) /- 0(1,0;*)-Ci 2 ^ /• ^ 30(1,0;*) 1 f 7 0(1,0;*) - C J y 2 B log{0(l,0;*)-d}-log{0(l,O;*)-C } = a ( d - C ) * + C 2 V ; 2 1 - [Cexp{a(Ci - C ) * } ] ' 2 where C is a constant to be determined by the initial condition of the differential equation 0(1,0; 0) = 1. Evaluating our solution for 0(1,0; *) above at * = 0 leads us to: 0(i,o o) = i ; nn=7r = i 31 * c = r=£ 2 Thus our final solution for 0(1,0; t) is given by: Ci 0(1,0,t) = — c 2 l5gexp{a(C 1 - "11-E § e x p { a ( C i 1 - C )t) 2 (3.10) - C2)t\ We can now substitute this into the Riccati equation given previously to obtain our final solution for the hazard function: h{t\0) = = 3.3 - 0'(1,0; t) - « 0 ( 1 , 0 ; t) + {a + (3 + /i)0(l, 0; t) - p 2 0(1,0;*) 0(1,0,*) -a0(l,O,*) + (a + ^ + ^ ) - 5 ( 0 ( l , O , £ ) ) - . 1 / 3-hit Model For this model, the tissue is assumed to grow deterministically. It is important to remember that tissue cells already bear a single mutation. Tissue cells mutate according to a random process to produce intermediate cells; an intermediate cell is a cell that has sustained two mutations. The growth, death and mutation of these intermediate cells are assumed to follow a birth-death process. Mutation of an intermediate cell results in a tumour cell; a tumour cell in this model is a cell that has sustained three mutations. A fully stochastic model could also be proposed here where the tissue cells are assumed to divide, mutate and die according to a birth-death process as well. The problem with such a model is that the mathematics become quite challenging and there is no solution in the literature for the hazard function from such a model. Several authors have proposed an approximate solution to a hazard function from a fully stochastic model; the assumptions necessary for the approximation however, are inappropriate for our application of the model to NF2 patients. The reason for this is that the approximation requires that the probability of tumour in the population of study be very small; in the study of many types of cancer in the general population such an assumption may be quite reasonable. The population 32 of NF2 patients however, have a very high probability of tumour and thus the application of the approximate hazard function would not be appropriate in the analysis of NF2 data. Again, we denote the number of tissue cells present at time s as X(s); note that X(s) is a deterministic function. In a small interval of time At, the probability that an intermediate cell is generated by the mutation of a tissue cell is LiX(s)At + o(At). The probability that more than one intermediate cell is gener- ated in this fashion is o(At). The growth, death and mutation of the intermediate cells are assumed to follow a process similar to a birth-death process. Again, iii a small interval of time At, an intermediate cell divides into two intermediate cells with probability a At + o(At); dies with probability (3 At + o(At); divides into an intermediate cell and a tumour cell with probability LiAt + o(At); the probability of more than one such event occurring in this time interval is o(At). Additionally, we assume that a tumour arises from a single progenitor tumour cell and that the tissue cells mutate independently of one another. Mutations are assumed to occur during cell division; such a cell division will produce both a cell identical to the original progenitor and a cell that has sustained an additional mutation. This assumption is used in the derivation of the Kolmogorov forward equation below. We define Y(t) and Z(t) to be the number of intermediate and tumour cells present in the tissue at time t respectively. Let \T/(y, z;t) be the probability generating function for the number of intermediate and tumour cells. Thus, oo j,k=0 where p((j, k);t) = Pr(r (t) = j, Z(t) = k | y(0) = 0, Z(0) = 0). Note that it is assumed that: p((j, k);t) = 0 for j < 0 or k < 0. 33 We will now derive a closed form solution for the hazard function for this model. A n argument identical to that provided in Section 3.2.2 can be used to show that the hazard function is given by the expression — ^'(1,0; 0; t). Again, it is necessary for us to find the solution for the probability generating function * ( l , 0 ; i ) . We must first define: P(j, )((m,n); A i ) = Pr(Y(t + At) =m,Z(t fc + At) = n | Y(t) = j,Z(t) = k). From the Chapman-Kolmogorov equations we obtain: p({j,k);t + At) = p {{j,k);At)-p([j,k);t) {jik) +P(j-i,fc)(0',*); At) -p((j - l,fc);t) /fc)(ti, k); At) -p{{j + l, k);t) +P(j,k-i)(U, ky, At) • ({j, k - 1); t). P We note that: P(i-i,fc)((j, At) = ( X ( t ) A t + o(At)) + ((j - l)aAt + o(At)), P(i+i,fc)((i, k); At) = (j + l)f3At + o(At), P(j,k-i)(U,k);At) - jfiAt + o(At), p ({j,k);At) {jjk) = M l-jaAt-j(3At-JnAt -fiX{t)At-o{At) The four equations given above arise as a result of the model assumptions. The first equality holds because if at the beginning of the short time interval there are j — 1 intermediate cells and k tumour cells, and at the end of the time interval there are j intermediate cells and k tumour cells, then a single intermediate cell has been produced in some fashion. This can occur by the mutation of a tissue cell or by the normal division of an existing intermediate cell. Consideration of these two events will lead directly to the stated probability. The second equality is obtained by considering that the reduction in the number of intermediate cells from j + 1 to 34 j can occur only if a current intermediate cell dies. The increase in the number of tumour cells from k — 1 to k occurs with the mutation of a current intermediate cell; this provides the motivation for the third equation. Recall that, such a mutation occurs during cell division and results in both an intermediate cell and a tumour cell; thus, the number of intermediate cells does not change with the occurrence of such an event. The final equality is justified by considering the probability that none of the aforementioned events takes place in the time interval. The Kolmogorov forward differential equation can be obtained as: AU,k);t) = i i m p ( ^ ^ ( ) + A Ai->0 = : ) - ^ f c ) ; t ) h + /3 + LI)- p((j, k);t) - LiX(t) • p((j, -j(a k);t) + (j - l)a • p((j - 1, k);t) + (j + \)P • p((j + 1, k);t) +LiX{t)-p((j-l,k);t)+JLi-p((j,k-iy,t). We use this result to yield the following differential equation: oo j,k=0 oo = J2 [-H j,k=0 +1 + AO • p(ti>*) a 3 +(j-l) .p((j-l,k);t) = • P((J> X + (j + l)p-p((j a +LiX(t)-p((j - » V) - l,k);t)+jfi-p((j,k- *) + l,ky,t) l);i)]y^ c f (y-l)iiX(tMy,z;t) + [pyz y 2 + with initial condition ^(y,z\0) a + L3-(a + p + rfy] (3.11) — 1; an outline of the computations necesary to justify the final equality are provided in the appendix (following the bibliography). 35 This leads directly to the following expression for \I/'(1,0;£): *'(l,0;t) = [a + P- (a + f3 + n) dy a*(i,o t) ; dy We can write the conditional expectation E [Y(t) \ Z(t) = 0] as: E [Y(t) | Z{t) = 0] = / * ( i , 0; ay Next, we easily derive: 9*(l,0;t) dV(y,z;t) 9y dy = j/=l,z=0 ^;V'- ^Pr(y(t)=i,Z(t)=A;) 1 j,fc=0 = y=l,*=0 £jPr(y(<)=i,Z(*)=0). 3=0 Now dividing both sides by ty(l,0;t) we get: a*(i,o <) ; 7 *(l,0;t) = £ j P r ( y ( * ) = j \ Z ( * ) = 0)/*(l,0;t) = £ j P r ( y ( t ) = j , Z ( < ) = 0)/Pr(Z(t) oo j'=o oo = Y,JPr(Y(t)=j\Z(t) = 0) 3=0 = E\Y(t)\Z(t) = 0] Therefore, we can express the hazard function as: h(t\9) = -*'(l,0',t)/*{l,0;t) A*' = A a*(l,0;t) 3y / *(l,0;t) i - E [ y ( t ) | Z ( < ) = 0]. Moolgavkar ei al. [20] showed that for this model: E[Y(t)\Z(t) = 0] = J*»X( )ex {£ S V [2a<f>(l,0;u) S -(a + f3 + n)]du^ds, 36 where 0(1,0; u) is defined as previously in Section 3.2.2. The second integral in this expression is readily integrated: ft—a / = [2a0(l,O;u) - (a + p + Li)]du Jo ft — S / 2a<f>(l,0;u)du - (a +p + n)(t-s) = o f " 2/ C i- ^exp{q(C -C ) }] j i-c, - - r. ^ < f a - ( a + £ + /x)(*-a) ^exp{a(Ci -C )u} f g 1 ./o 1- — rt—s S 2 M j 2 c = 2^ { C [l-[^exp{a(C -C ) }" a 1 1 2 U +«Ci [ j-=-^± exp{a(Ci - C )u}] - aC [\^r exp{a(Ci - C )u}] } 2 2 x {l 1 = 2, J-77; r=a Jo +P+ r Li)(t-s) 2 [_ _ a ( C l < aCi C 2 , 1 ) ] i ^ [ e X p { a ( C -<7 ) }K >du l 2 7— n U 1 - j E g exp{a(Ci - C )u) i 2 -(a + p + Li){t = -^ r}du-(a exp{a(Ci - C )u) > 1- rt-s 2 2aCiu q - s) '-21og{l-—^exp{a(Ci-C )u}} | *-(a + + 2 q M )(i - S) M ) t_ ) CI = 2aCi(t - a) - 21og{l - [—^exp{a(Ci - C )(* - «)}} C 2 2 +21og{l - \^~} ~(a + P + (i)(t - s) • 1- • ^ k L = 2aC (t-s) + 21og{ — - j ^ }_ 1exp{a(Ci - C ){t - s)} > 1 ( a + 3 / + ( s L 2 = 9(B\t-s). Thus, the conditional expectation is given by the expression: = 0] = f LiX(s)exp{g{0;tJo E[Y(t)\Z{t) s)}ds. Substituting this result into the previous expression for the hazard function we obtain our final expression for the hazard function: h(t\0) = LI [ X{s) exp{g{0; t - s)}ds, Jo 2 37 t > 0, where g(0;t - s) is given above and C\,C2 are defined as they were previously in equation (3.9). A deterministic function for the growth of the tissue can now be selected and the remaining integral in the expression for the hazard function can be calculated numerically. I have been using Romberg integration to integrate this function. 3.3.1 Choice of Tissue Growth Function Both scaled logistic and Gompertz distribution functions have been used in the literature to model the growth of other tissues [18, 20] and thus these have been used as my starting point. Each of these families of distribution functions has several constants that must be chosen in order to get a curve that seems to reasonably fit the pattern we expect for the Schwann cell growth. Using a logistic distribution function with three constants to model the tissue growth would give the following form for X(s): , , ' Kexpj^} *<•> - J {Vr 1+ P A four-constant Gompertz disribution function would yield: X(s) = tf(l-exp{a(l-exp{&s}) - (3 12) + cs}). Although both of these families seem to be capable of capturing the shape of tissue growth that is expected from the biological information available to us, I have chosen to use only the logistic family given above. The three constants for this family allow us adequate flexibility to capture any shape of tissue growth that we desire for our modelling purposes. Choosing the three constants for the logistic family is quite straightforward. The K constant represents the number of cells that the tissue contains in an average adult; in all of our models we assume that the number of tissue cells approaches this constant after a certain age. The remaining constants in the growth function govern the shape of the curve and influence the rate of growth and the age at which the 38 tissue reaches its maximum size. We will use the notation logistic^, a, b) to refer a logistic growth function of the form given in (3.12); thus logistic(10 , 5.0,0.8) would 7 refer to a logistic growth function with K = 10 , a = 5.0, and b = 0.8. 7 Figure 3.1: Three growth functions to be used in model fitting 0 10 20 30 40 Age (years) Figure 3.1 is a plot of three different logistic growth functions similar to those that will be used for our data analysis. A l l three of these functions as- sume that the maximum number of Schwann cells in an adult tissue is 200000. The differences between these functions are the age at which the tissue reaches its maximum size as well as the rate of which the growth occurs. From the figure it is clear that the logistic(2 x 10 ,5.0,0.8) growth function corresponds to a tis5 sue that achieves its maximum size when the individual is approximately age ten; the rate of growth is quite rapid which can be inferred by the steepness of the growth curve. The logistic(2 x 10 , 8.0,0.8) growth function behaves similarly to 5 the logistic(2 x 10 ,5.0,0.8) function in that one curve is essentially a horizontal 5 shift of the other. The biological implications of the logistic(2 x 10 ,8.0,0.8) growth 5 39 function are that the tissue experiences a period of very slow growth for the first few years of development. This period is then followed by a spurt of rapid growth with the tissue reaching its maximum size at approximately age 14. The third growth function depicted in figure 3.1 assumes a more gradual rate of tissue growth than the previous growth functions. The logistic(2 x 10 ,8.5,1.3) function assumes that 5 tissue growth is rather slow for the first few years of development, although at a slightly faster rate than the logistic(2 x 10 , 8.0,0.8) function, and approaches its 5 maximum size at about age 17. Unfortunately, our knowledge of the precise growth of the Schwann cells around the vestibular nerve is quite limited. From the limited information available to us, any of these functions could very well be a feasible growth function to model the growth of the tissue. The information that is perhaps most important for our modelling, namely the number of cells present in an adult tissue, is particularly difficult to obtain. We have estimated the upper bound on this quantity to be approximately 10 ; we are however unable to provide an assessment of the reliability 7 of this estimate. We have therefore decided to fit our models using a variety of different growth functions and assess the sensitivity of the model fit and parameter estimates to the growth function selected. 3.3.2 Likelihood Construction Given an expression for the hazard function h(t\6) we can express the density func- tion for the time to the onset of the first V S as: Jo > If our data consist of n individuals with onset times * i , and assuming that individuals are independent, the likelihood for the data can be written as a product 40 of density functions:] ) vr=i " T^i JO / The log-likelihood, 1(0), can then be written as: The log-likelihood is then maximized with respect to 6 to obtain the maximum likelihood estimate 0. Given this estimate of the parameter vector, it is possible to compute an estimate of the probability that a patient develops a tumour before a specific time t. A n estimate of this probability, F(t\0), is computed as follows: Jo 3.3.3 Genotype Information It may also be of interest to include information about the type of mutation that a patient has sustained to their NF2 gene into the model. We may wish to include this information in order to obtain probability predictions for individuals bearing a certain type of genetic mutation. The most obvious way of incorporating this information into the model is to allow the vector of parameters for the model to be different for patients with different types of mutations. Suppose a patient can have one of k different types of mutations and the vector of model parameters for an individual with mutation type j is denoted by Oj. Let mj denote the mutation type for patient i; thus, the data that are collected on patient i is the vector (ti, rrii) and patient z's contribution to the likelihood will be: o 41 U > 0. Additionally, let Dj be the set of patients in our data set with mutation type j. Then the likelihood for the data can be written as: k I = „i /»(ti|0j)exp{- j h{u\0j)du} { J] <-l \i€Dj ° J \ k n n A; e x p{- E E / Kv>\o )}du. 3 The log-likelihood can therefore be written as: k Wu k i- Ok) = E E iog{/»(*iiOj)} - E E j = i ieD^ j=i ieDj Again, estimates of the parameter vectors 0j / ' J o (j = 1, ...,k) can be obtained by maximizing the log-likelihood with respect to the parameters. A n obvious concern with this approach is that allowing the model parameters to differ for each mutation type greatly increases the number of parameters that must be estimated from the data; for large data sets this might not be a serious concern. However, since NF2 data sets are typically small and genotype information is often missing from the patient records, it is desirable to reduce the number of parameters to as few as are feasible. Constraints can be imposed to reduce the number of parameters and this is discussed in some detail below. Upon obtaining estimates of the model parameters Oj,j = 1 , k , sible to estimate quantities of interest. it is pos- The probability that an individual with mutation type j will develop a tumour by age i , F(t\0j), will be denoted by Fj(t) and can be estimated as: Fj(t) = 1 - e x p j - ^ h(u\dj)du}, t > 0. Additionally, we may also wish to compare different genotypes using these mutation models. Two genotypes could be compared by estimating the risk of developing a 42 tumour at a given age t for an individual with one genotype relative to the risk for an individual with another genotype. The risk of developing a tumour at age t for an individual with mutation type i relative to an individual with mutation type j is denoted by rij(t) and can be estimated as: r ^T)=am. (3.i3) Clearly the estimates for the relative risks given in equation (3.13) have a dependence on t; this is interpretable as a patient's risk of developing a V S will change with their age. In general, the hazard functions used in equation (3.13) to estimate the relative risk need not be evaluated at the same age. As an example, suppose there is interest in comparing the risk of developing a V S for an individual with genotype i at age t, relative to an individual with genotype j at age 10, then this relative risk could be estimated by h(t\9i)/h(10\0j). For our analyses we will estimate the relative risks according to equation (3.13) by estimating the risk of developing a V S for an individual with one genotype at age t, relative to the risk of developing a V S for an individual with another genotype at the same age. Suppose the model under consideration has three rate parameters for each mutation type; these would be the growth, death, and mutation rates for the intermediate cells. We can denote the parameter vector for the jth mutation type as Oj = (a^,L3^\n^) where l3^\ a,nd ii^ are the growth, death, and mutation rates respectively. The superscripts are used in place of subscripts to identify the mutation type as subscripts on these parameters are typically used in the literature to denote the stage of the model under discussion. A constraint that can be employed to reduce the number of model parameters is lS^ = \i (j = 1 , k ) \ this would imply that the mutation rates for the chance mutations are equal for patients with different mutations of the NF2 gene. This model assumes that the type of mutation of the NF2 gene affects only the rates at which intermediate cells in the tissue divide and die. The number of parameters in the full and reduced models are 3k and 2k + 1 respectively, where k is the number of different mutation types in 43 the data set. Clearly, the necessity to reduce the number of model parameters will depend on both the number of mutation types to be considered in the analysis and more importantly on the interpretability of the constraint given above. Figure 3.2: Plots of relative risks versus age obtained from a 3-hit model with specifically chosen parameter values: 6\ = (a^\ L3^ \ M ^ ) = (1.73 x 10 ,1.93 x 10 ,4.10 x 1 0 ) ; values for the components of 0 vary across the plots. 1 _1 _1 -4 2 ir (a) Assuming equal mutation and growth rates; plots vare labelled by the death rate used (b) Assuming equal mutation and death rates; plots are labelled by the growth rate used Figure 3.2 is an illustration of the estimated relative risks described above. The plots in this figure were produced by fixing the parameters of the 3-hit model to suitable values and plotting the relative risk as a function of age. For these plots the mutation rates for the two genotypes are assumed equal and the growth and death rates for the intermediate cells are allowed to differ across mutation types. In the left panel of the figure the growth rates for the two mutation types are assumed 44 to be equal and the relative risk is estimated, as a function of age, for a range of death rates. The plot suggests that individuals with mutation type 1 have a higher risk of developing a V S than do individuals with mutation type 2 when the death rate for their intermediate cells is lower than for individuals with the latter type of mutation; additionally, this relative risk increases as a function of age. This is quite reasonable as one would expect that when the intermediate cells die more quickly there would be fewer of them present in the tissue to sustain the final mutation and thus the likelihood of a tumour cell developing in the tissue would be smaller. The right panel of the figure shows a plot very similar to the plot from the left panel except that the death rates for the intermediate cells are assumed to be equal across the two genotypes and here the growth rates are allowed to vary across genotypes. Individuals with mutation type 1 clearly have a higher risk of developing a V S than individuals with mutation type 2 when the growth rate for their intermediate cells is higher than individuals bearing a mutation of the latter type; as previously this relative risk is an increasing function of age for the chosen parameter values. Again, this is quite reasonable as a more rapid growth of intermediate cells would imply that there are more cells present in the tissue that need only to acquire the final chance mutation to develop into tumour cells. It is possible to produce plots similar to these for various parameter vectors 6\ and 62; the plots included here were meant only as examples to illustrate the merits of using relative risks to summarize the models fit using genotype information. 45 Chapter 4 Estimating the Age at Onset of the Second VS Estimating the age at which a patient with NF2 will develop both the left and right VSs is also an interesting problem. Estimation of this quantity would require data related to both the onset of the left and right VSs; previously we have been concerned with only the age at onset of the first V S . A subset of the M U K data has information recorded on the onset of both VSs for 144 patients and are suitable for our modelling purposes. A n important note concerning these data is that the information recorded for each patient is the age at which these tumours were detected, not necessarily the age at onset for the two tumours; this point will be relevant in the interpretation of our estimates. The ages at which the right and left VSs develop are quite likely correlated on an individual and thus we choose to model these times as bivariate random variables. To allow for such dependence, we employ the the Kimeldorf and Sampson family of copulas [11, 10] and specify the margins using the appropriate multi-hit model. This model is sometimes called the Clayton-Oakes model [3, 24] and is often used to model bivariate survival data. (I) Let T± (r) and represent the ages at onset of the left and right VSs respec46 tively for individual i. We assume that these two random variables have marginal distribution functions Ft and F respectively; the marginal densities are denoted by r / ; and f . The notation C(u,v;S) is used to represent the Kimeldorf and Sampson r family of copulas, with dependence parameter 5. The specific form of this family is given as: C{u,v;S) = {u- + vs 0<J<oo, - 1)~ , s 1/6 where U and V are random variables with Uniform(0,l) marginal distributions. The joint density for U and V under such a model is given by: c{u,v;6) = (l + <J)[uu]~* ( « ~ * + « _1 - 0 < 8 < oo. * - l)- ~ \ 2 l l Thus, we replace u and v in these expressions by Fi(t^) and F (t( ^) for our applir r cation. The joint density function of ,, f 1 1 ( 0 ,(rh ' A and can now be written as: _ d C(Fi(t^),F (t^);S) = c i F ^ ^ r i t ^ S ) ^ ) ! ^ ) , 2 r dF dF j t r The marginal distributions for /dF (t^) \ V dtW J\ t and /dF (t^)\ r m<r) ) 0<5<oo. are assumed to be identical; we will denote the hazard, density, and distribution functions for these random variables as h(t\6), f(t\6), and F(t\0) respectively. This assumption is justifiable as there is no reason to assume that a V S on one side of the head is more likely to develop by a certain age than a V S on the other side of the head. This assumption simplifies the expression for the joint density of hM \t l { r ) ) = = and T ^ : c(F(tW\0),F(t^\e);6)f(tU\O)f(t^\e) (l + 5)(F(tW\0)~ + F(t^\0)- -iy ~ s 5 x \F(t^e)F{t^9)V ~ f(t^0)f(t^\e), S 1 2 1/S (4.1) 0<6<oo. In the previous sections we derived an expression for the hazard function for the time until the onset of the first V S . The hazard function derived was the hazard for the entire tissue at risk of developing the VS; this tissue consisted of all of the 47 Schwann cells surrounding both the left and right vestibular nerves. In this section, the hazard function given in our expressions will denote the hazard function for the time until the generation of the first tumour cell in a single vestibular nerve; this hazard function will specify the marginal distribution of both and T^ \ The r derivation of the hazard function proceeds identically to the derivation from the previous sections, except that the number of cells that are assumed to be initially present in the tissue are divided in half. Using previous results, and the relationships between the hazard, distribution, and density functions, we can reexpress the joint density from equation (4.1) in terms of this hazard function: rt fl,r(t \t ) {l = {r) , -, -5-1 (j) h(s\0)ds \\ h(s\0)ds}] U + <*) II [l-exp{-/ je{/,r} <- - - -<5-i - 2 - 1 / ( 5 t U ) - 1 + J2 ( l - e x p { - / Ml,r} -° h(s\0)ds}) J J xh(tW\0)h(tW\0)exp{- h(s\0)ds} exp{-J —5—1 pt^ = {1 + 6) JJ [l-exp{-/ x -1+ (l-exp{-/* je{i,r} h(s\0)ds} h(s\d)ds}] J /i(s|6>)ds}) _<5 ]" 2_1/<5 o je{i,r} Jo If our data consist of n patients, with observed onset times (t^\t^),(tn\tn^), then our log-likelihood is simply: 1(0) = \og(l n + S)-(l + 6)J2 log{l-exp{h(s\0)ds}} i=lje{l,r} ° J /*'•"'' n -(2 + l / a ) £ l o g { - l + J2 (l-exp{r +E E iog{Mi?'V)}-E E i=l je{l,r) *=ije{/,r}' 48 —5 h( \0)ds}) 8 T^°) h ds } (-) 42 The log-likelihood given in equation (4.2) can be maximized with respect to the parameters 0 and S to obtain parameter estimates. These estimates can be imputed into our expression for the bivariate distribution function of T^ \ r denoted by Fi (t^ \t^), and to compute estimates of the probability that an l >r individual will develop both tumours by a given age; note that Fi (t^ \ t^ ) l ir C(F(t^\0),F(t^\0);5). r> = A plot of F^(t~t) versus t can be constructed and com- pared to a plot of the empirical bivariate distribution function to assess the fit of the model. The bivariate empirical distribution function would be computed as follows: m a x ( T ^ , T ^ ) ) < i} r F (t,t) E = n n A n interesting feature of this model is that it allows us to estimate the association between the ages at onset of the left and right VSs. This association is characterized by the model parameter <5. Interpreting the magnitude of an estimate of this parameter will be discussed in Chapter 5 and further details can be found in [10]. 49 Chapter 5 Results 5.1 Knudson's Model Knudson's model was fit using the age at onset of the first V S variable from both the M U K and S P O R datasets. The fitting of the model is fairly simple to implement and requires little computational effort. There is a single time-dependent parameter that must be estimated from the data at several pre-selected time points. The estimation is performed using a non-linear weighted least-squares procedure described previously in Chapter 2. A l l computations were performed using programs written in the C programming language. For our fitting eight time points were selected at which the model parameter would be estimated. Recall that the model parameter represents the fraction of cell divisions that have occurred prior to time t and is denoted by d(t). There are two other quantities that must be specified prior to the fitting of the model. These are the expected number of tumours eventually acquired by an NF2 patient, denoted by m(oo), and the ratio of the expected number of cell divisions in an individual's life to the number of Schwann cells originally present in the tissue; this latter quantity will be denoted by a(oo)/6(0). The results presented here have been generated using m(oo) = 2 and a(oo)/6(0) = 2 x 10 . It is natural to specify the expected number of 6 50 tumours eventually acquired by an NF2 patient to be two as all NF2 patients used in our fitting have bilateral V S . The second quantity is more difficult to justify as we require some information about the number of cells present in the tissue and the number of cellular divisions that occur throughout an individual's life. The value chosen is identical to the value used by Hethcote and Knudson [8] in their application to retinoblastoma and is also consistent with the growth functions for the tissue that we assume in the next section. As well, the results obtained in the model fitting depend very little on the value chosen for this quantity. Hethcote and Knudson [8] report that for their application, estimates of d(t) change by less than 1% if any value between 10 and 10 is selected for a(oo)/6(0). Our experiences with perturbations 5 8 of this quantity are consistent with those described by the aforementioned authors. Table 5.1 shows the time points that were selected for the estimation of the fraction of cell divisions and the resulting estimates. Note that estimates of the standard errors are not provided as the method of estimation does not suggest an obvious method for computing standard errors. These estimates can be imputed into equations (2.1) and (2.2) to yield estimates of the cumulative probability that an individual will develop a V S prior to a certain age; these estimates are computed for both NF2 and sporadic patients. A plot of these model estimated probabilities is given in Figure 5.1; plots of the empirical distribution functions have also been added to assess the fit of the model. Overall, the model fit appears to be adequate despite a few deviations between the observed and fitted incidence functions. The model tends to under-predict the incidence for NF2 patients for a range of ages (ages 35-60); it also over-predicts the incidence for the sporadic cases on a range of ages (ages 25-50). A chi-square goodness of fit test was used to test the adequacy of the model fit and suggested that there is some evidence of departure between the model and the data ( x l = 10.346, p-value = 0.066). To compute the test statistic, observed and expected incidences of V S were compared in several age intervals; intervals with observed or expected counts of less than five were combined 51 Figure 5.1: Plots of empirical and estimated probabilities from Knudson's model for both NF2 patients and sporadic cases Table 5.1: Estimates of the fraction of cell divisions d(t) at various time points from Knudson's model (t) d(t) age 0 10 18 26 0 0.006 0.087 0.204 34 , 0.300 42 50 58 66 oo 0.371 0.445 0.572 0.794 1 for asymptotic considerations. These results have motivated us to further explore the two-mutation hypothesis with other models to see if results are consistent across different models. 5.2 5.2.1 2-hit M o d e l s Deterministic Tissue Growth The model fitting for the 2-hit model with deterministic growth of the tissue is quite simple. There is no need to employ a numerical routine to perform the maximization of the likelihood function given in equation (3.1), as a closed form expression for the mutation rate parameter exists. We have only to select a suitable growth function for the tissue and input this function into equation (3.2) to compute our estimate of the mutation rate. Three candidate growth functions were selected for the model fitting. These are the same three functions from the logistic family that are plotted in Figure 3.1. It is assumed in each case that the tissue originally contains 20 cells and after a certain age reaches a constant size of 10 cells; as was discussed previously, the three 7 growth functions differ from one another in the age at which the tissue reaches its maximum size and the rate at which the tissue grows. In computing the muta- tion rate estimate, there is a need to integrate the growth function several times, each time over a finite interval; these integrals have been computed numerically using Romberg integration [16]. A l l computations were implemented using programs written in the C programming language. 53 The models were fit to both the age at onset of the first V S and age at onset of hearing loss variables from the Manchester data set, as well as the age at onset of hearing loss variable from the multi-source data set. Parameter estimates and estimated standard errors from the model fittings are summarized in Tables 5.2, 5.3, and 5.4 for the age at onset of first V S ( M U K ) , age at onset of hearing loss (MUK) and age at onset of hearing loss (FSS) respectively. The estimates for the mutation rates obtained using the three different growth functions are ordered identically across the three different data sets; specifically, the logistic(10 , 5.0,0.8) growth function 7 consistently yields the smallest rate estimate, while the logistic(10 , 8.5,1.3) func7 tion produces the highest estimate of the mutation rate for all three data sets. It is important to recognize that the magnitude of the mutation rate estimate for this model depends heavily on the number of tissue cells that are assumed to be present in the adult tissue. This is quite clear from the expression for the mutation rate estimate. If the number of tissue cells in an adult tissue was rescaled by a multiplicative factor of k then the estimate of the mutation rate would consequently be rescaled by a factor of l/k. Model fitted cumulative distribution functions for the age at onset random variable will not however be affected by rescaling the tissue growth function by a multiplicative factor. For example, models fit using the logistic(10 ,8.5,1.3) 7 and logistic(10 ,8.5,1.3) growth functions will produce identical estimates of the 5 cumulative distribution function despite having different estimates for the mutation rate. Estimates of the cumulative distribution functions obtained using the three different data sets are presented in Figures 5.2, 5.3 and 5.4. In each of these figures, the three estimated distribution functions, each obtained using a different growth function for the tissue, are plotted against the empirical distribution function. For all three data sets it is evident that models fit assuming logistic(10 ,8.0,0.8) and 7 logistic(10 , 8.5,1.3) growth functions yield very similar estimates of the cumula7 tive distribution function. Additionally, these estimates are also more consistent 54 Table 5.2: Model fitting results for 2-hit model with deterministic tissue growth using M U K data Variable: Age at onset of the first VS Growth function: logistic(10 ,5.0,0.8) Parameter Estimate Estimated S E 7 4.052 x 1 0 0.317 x 10~ M value o f log-likelihood: -686.280 - 9 9 Growth function: logistic(10 ,8.0,0.8) 7 Parameter Estimate Estimated S E 4.609 x 10" 0.361 x 10" M value o f log-likelihood: -669.359 9 9 Growth function: logistic(10 ,8.5,1.3) 7 Parameter Estimate 4.715 x IO" Estimated S E 0.369 x 10~ M value o f log-likelihood: -665.474 9 9 with the empirical distribution function than the estimate obtained assuming a logistic(10 ,5.0,0.8) model for the growth of the tissue. 7 Overall, the fit of these simple one-parameter models to the empirical distribution functions is fair. In particular, all three of the models tend to predict an earlier onset of the tumours than is reflected in our data. The explanation for the lack of model fit could be any of several things. One explanation could be that our assumed growth functions are all incorrect; this possibility is difficult to assess given our limited information on the growth of the tissue. Uncertainty in our assumptions on the growth of the tissue is one motivation for choosing a model where the parameters governing the growth of the tissue are estimated from the data. The 2-hit model with stochastic growth of the tissue discussed previously is an example of such a model. Results from the fitting of such a model will be discussed in the next section. 55 Table 5.3: Model fitting results for 2-hit model with deterministic tissue growth using M U K data Variable: Age at onset of hearing loss Growth function: logistic(10 ,5.0,0.8) Parameter Estimate Estimated S E 7 4.524 x 10~ 0.377 x 10~ A* value o f log-likelihood: -590.127 9 9 Growth function: logistic(10 ,8.0,0.8) 7 Parameter Estimate Estimated S E 5.228 x 1 0 0.436 x 10~ A* value o f log-likelihood: -573.221 - 9 9 Growth function: logistic(10 ,8.5,1.3) 7 Parameter Estimate Estimated S E 5.364 x IO" 0.447 x 10" A* value o f log-likelihood: -569.611 9 9 Table 5.4: Model fitting results for 2-hit model with deterministic tissue growth using FSS data Variable: Age at onset of hearing loss Growth function: logistic(10 ,5.0,0.8) 7 Parameter Estimate 4.625 x 1 0 Estimated S E 0.358 x I O A* value o f log-likelihood: -685.683 - 9 - 9 Growth function: logistic(10 ,8.0,0.8) 7 Parameter Estimate Estimated S E 5.358 x IO" 0.415 x 10~ A* value o f log-likelihood: -669.483 9 9 Growth function: logistic(10 ,8.5,1.3) 7 Parameter Estimate 5.499 x 10~ Estimated S E 0.426 x IO" A* value o f log-likelihood: -662.696 56 9 9 Figure 5.2: Plots of empirical and model predicted probabilities for 2-hit model with deterministic growth of tissue; using age at onset of first V S ( M U K data) " o Age (t) 57 Figure 5.3: Plots of empirical and model predicted probabilities for 2-hit model with deterministic growth of tissue; using age at onset of hearing loss ( M U K data) Figure 5.4: Plots of empirical and model predicted probabilities for 2-hit model with deterministic growth of tissue; using age at onset of hearing loss (FSS data) 5.2.2 Stochastic Tissue Growth Fitting the 2-hit model with stochastic tissue growth is more complicated than fitting the model from the previous section. One of the most obvious differences between these two models with respect to fitting, is that the parameters from the fully stochastic model do not have closed form maximum likelihood estimates; all three parameters must be estimated by a numerical routine. For this model, a likelihood for the data was constructed according to the method described in section 3.3.2 by expressing the likelihood function in terms of the hazard function for the time to the first tumour cell. The log-likelihood was maximized using a Quasi-Newton routine [23] and all computations were implemented using programs written in C. All definite integrals in the expression for the likelihood were computed numerically using Romberg integration [16]. The growth and death of the tissue cells in this model are governed by a stochastic process and hence there is no need to select a growth function for the tissue. We do however need to select the number of cells initially present in the tissue, denoted previously by N. We have chosen to use N = 20 for our fitting; this is consistent with the number of tissue cells initially present for the model with deterministic growth of the tissue. Tables 5.5, 5.6 and 5.7 contain the parameter estimates and the estimated standard errors from the fitting of the model to the three data sets. Estimates of the model parameters are consistent in magnitude across the three data sets. The ordering of the growth, death and mutation rates is also consistent across the three data sets. Several different vectors of starting values were selected for the Quasi-Newton routine; most of these yielded consistent solutions. Several starting values resulted in the convergence to a local maximum; the value of the log-likelihood was used to discriminate between local and global maxima. The value of the log-likelihood evaluated at the maximum likelihood estimates of the model parameters is given in the aforementioned tables below the parameter estimates. 60 Table 5.5: Model fitting results for 2-hit model with stochastic tissue growth using M U K data Variable: Age at onset of the first V S Parameter a P Estimate 4.214 x K T Estimated S E 3.352 x K T 3.154 x IO" 1 1.010 x 10" 1 1.013 x 1 0 0.642 x 10~ 4 1 _ 1 4 A* value of log-likelihood -659.966 Table 5.6: Model fitting results for 2-hit model with stochastic tissue growth using M U K data Variable: Age at onset of hearing loss Parameter a P Estimate 4.999 x 1 0 4.009 x I O 3.250 x I O Estimated S E _ 1 - 1 - 4 1.209 x 1 0 1.112 x I O 0.714 x 10~ _1 - 1 4 value o f log-likelihood: -568.175 Table 5.7: Model fitting results for 2-hit model with stochastic tissue growth using FSS data Variable: Age at onset of hearing loss Parameter Estimate a 4.844 x 10" . 3.860 x IO" 3.406 x IO" Estimated S E 1 1 4 1.138 x IO" 1.049 x IO" 0.771 x 10~ A* value o f log-likelihood: -656.965 61 1 1 4 Upon obtaining estimates of the model parameters, it is possible to plot the estimate of the cumulative distribution function for the age at onset variable. A plot of the empirical distribution function can be compared to these model fitted cumulative distribution functions to assess the overall fit of the model. Figures 5.5, 5.6 and 5.7 are plots of the fitted cumulative distribution function versus the empirical distribution function for the three data sets. The fit of this model is clearly an improvement over the model that assumes a deterministic function for the growth of the tissue. For all three data sets the model tends to over-predict the incidence of tumours at young ages (ages less than 15 years); this does not appear to be a serious concern however, as the model seems to fit reasonably well overall. As an additional check on the fit of the model, it is possible to simulate onset times from the fitted model and compare their distribution to that of our data. Simulating onset times from the fitted model, which we will denote by T*, can be done according to the following simple algorithm: • Recall that under our model F(T\0) = 1 — expj — h(s\0)ds^ ~ Uni- form(0,l). • Generate a Uniform(0,l) random variable U*. • Set U* = 1 - expj- JQ h(s\0)ds^ and solve for the onset time T*. Although it would appear that simulating.onset times is quite a simple task, the final step in the algorithm does require the use of a numerical method to solve the given equality. I have used the bisection method for this and have found it to be quite successful. Using the bisection method for this requires that we bracket the onset time between two points; I chose to use the interval [0,100] and have encountered no difficulties withfindinga sensible solution thus far. Samples of onset times were simulated from each of the three fitted models. The number of onset times simulated from each model was identical to the sample size of patients used to fit the model. Histograms of the simulated onset times and 62 of the actual data are presented in Figures 5.8, 5.9 and 5.10 for the three data sets. The histograms for the simulated onset times are quite similar in their distribution to the actual data for onset times greater than 20; the two distributions do however differ for onset times less than 20. Simulations from the fitted models generate a higher number of early onset times (onset times less than 10) than are represented in our data sets. The quartiles for the simulated data sets match the quartiles of the actual data very closely; in fact the medians for the simulated and actual times are identical for all three of the data sets. 5.2.3 Genotype Patients used for the fitting of this model were stratified according to mutation type into one of two groups: patients with protein truncating mutations and patients with other known mutation type. The group of patients with protein truncating mutations includes the patients from our dataset with either frameshift or nonsense mutations; these types of mutations produce a similar effect on the protein product and thus have been grouped together. Our second stratum includes all other known mutation types from our dataset; this group will be less homogeneous than the protein truncating group with respect to the effects on protein product produced by different mutation types in this stratum. This stratification was chosen as a result of the small number of patients in our dataset with identified mutation type. For simplicity we will refer to the patients with truncating mutations as having genotype 1 and patients with other types of mutations as having genotype 2. Two models were fit to the data: the first assuming that the mutation rate parameters for the' two genotypes were identical; and the second model allowing these mutation rate parameters to be different across the two genotypes. Both of these models were fit to the FSS data set on the age at onset of hearing loss variable. Parameter estimates and estimated standard errors from the five-parameter model are given in Table 5.8; the value of the log-likelihood evaluated at the maxi- 63 Figure 5.5: Plots of empirical and model predicted probabilities from 2-hit model with stochastic tissue growth: Age at first V S data ( M U K data) o Age (t) 64 Figure 5.6: Plots of empirical and model predicted probabilities from 2-hit model with stochastic tissue growth: Age at hearing loss data ( M U K data) Figure 5.7: Plots of empirical and model predicted probabilities from 2-hit model with stochastic tissue growth: Age at hearing loss data (FSS data) Figure 5.8: Histograms of data versus model simulated values for Age at first VS (MUK data); sample size of 163 patients o 0 20 40 60 80 100 Age at onset (a) Histogram of actual onset times 0 20 40 60 80 Age at onset (b) Simulation from the fitted model 67 100 Figure 5.9: Histograms of data versus model simulated values for Age at Hearing loss (MUK data); sample size of 144 patients (a) Histogram of actual onset times (b) Simulation from the fitted model 68 1 Figure 5.10: Histograms of data versus model simulated values for Age at Hearing loss (FSS data); sample size of 167 patients 0 20 40 60 80 100 Age at onset (a) Histogram of actual onset times 0 20 40 60 80 Age at onset (b) Simulation from the fitted model 69 100 Table 5.8: Model fitting results for 2-hit model incorporating genotype information (common mutation rates for both genotypes) Variable: Age at onset of hearing loss Parameter Estimated S E Estimate 10.231 x 10" 1 0.829 x 10" 1 1.887 x IO" 1 a& 5.341 x IO" 1 0(0 8.336 x I O ' 1 0.816 x 10" 1 4.339 x 10" 2.272 x IO" 1 1.788 x 10" 0.702 x I O 1 4 A* value of log-likelihood: -253.747 - 4 Table 5.9: Model fitting results for 2-hit model incorporating genotype information (different mutation rates for both genotypes) Variable: Age at onset of hearing loss Parameter Estimate Estimated S E 10.112 x 10" 3.109 x IO" 1 1 0.723 x IO" 1 _1 2.906 x 10" 1 5.203 x 10" 8.248 x 1 0 0(2) 1 4.208 x IO" 1 0.711 x 1 0 2.397 x 10~ 4 1.057 x IO" 4 2.318 x 10~ 4 0.963 x 10~ 4 _ 1 value of log-likelihood: -253.755 mum likelihood estimates is provided as well. Table 5.9 presents a similar summary of the parameter estimates and estimated standard errors from the six-parameter model. Note that the estimated model parameters are very similar for both models. In particular, the estimates for the mutation rates do not differ significantly across the different models. This supports our original intuition that the mutation rates for the two genotype groups should be equal. For both models, there are differences in in the estimates of the cell division and death rates across the two different genotype groups; the implications of this will be discussed below. Figure 5.11 is a plot of the estimated and empirical cumulative distribution 70 functions for both genotypes; the estimated cumulative distribution functions have been computed using the fitted five-parameter model. The overall fit of the model is good; the most obvious concern would be that the model over-predicts the incidence of tumours for patients with genotype 2 for ages less than 15; this departure might be explained by the small sample sizes. Assessment of the fit is somewhat difficult given that the sample sizes from each of the genotype groups were quite small; recall that the sample sizes were 40 and 28 for genotype group 1 and genotype group 2 respectively. 95% confidence intervals for the empirical distribution function have been overlayed at ages 20 and 36 for both genotypes; these values were chosen only as examples. Figure 5.12 is an identical plot to that described above except that the estimated cumulative distribution functions have been computed using the fitted six-parameter model. This figure is almost indistinguishable from Figure 5.11; this is of course expected given the similarity in the parameter estimates given above. A likelihood ratio test confirmed that the addition of the extra mutation rate parameter did not result in a significantly better explanation of the data (xl = 0.016, p = 0.899). There is a suggestion from the plots in Figures 5.11 and 5.12 that a patient's genotype affects the age at which they develop hearing loss (a surrogate for the age at onset of the first VS). Patients with protein truncating mutations clearly develop their hearing loss at an earlier age than patients with other mutation types. This observation is consistent with data from a previous study that observed NF2 patients with protein truncating mutations developing characteristic disease features at an earlier age than patients with spilce-site or missense mutations [7]. Although the six-parameter model allows the mutation rate parameters to differ across genotype groups, the estimates for the mutation rates are very similar for the two groups. This would suggest that the difference observed in the age at onset of hearing loss may be attributed to differences in the rates at which the pre-tumour cells divide and die across the two groups; this is a reasonable hypothesis to explain the differences 71 Figure 5.11: Plots of empirical and model predicted probabilities from genotype model with common mutation rates: Age at hearing loss data (FSS data) Figure 5.12: Plots of empirical and model predicted probabilities from genotype model with different mutation rates: Age at hearing loss data (FSS data) in the age at onset of hearing loss across the two genotype groups. A possible means of quantifying the differences in the cell division and death rates across genotypes might be the use of what Moolgavkar and Knudson [19] refer to as the 'growth advantage' of the pre-tumour cells. These authors explain that when the difference between the division rate and the death rate, a — fl, is positive, the pre-tumour cells are said to have a growth advantage; when this quantity is negative the pre-tumour cells are said to have a growth disadvantage. For genotype 1 the difference between the division and death rates is estimated as 1.895 x 1 0 , _1 (based on the five-parameter model) indicating a growth advantage for the pretumour cells; this same estimate for genotype 2 is 1.002 x 1 0 , indicating a growth _1 advantage as well. The growth advantage for genotype 1 is larger than it is for genotype 2 suggesting that the number of pre-tumour cells at risk of sustaining a chance mutation will likely be larger in patients with genotype 1 than in patients with genotype 2. This could perhaps explain the earlier age at onset of tumours in patients with protein truncating mutations. Moolgavkar and Knudson provide some interesting results related to the risk of tumour development in groups with different growth advantages using an approximate two-mutation model [19]. 74 5.2.4 Estimating the Age at Onset of the Second V S To fit the model for the age at onset of the second V S from Chapter 4 we must first specify the marginal distributions for the ages at onset of the left and right VSs. We stated previously that these two random variables are assumed to be identically distributed and discussed the justifications for this assumption. We have chosen to use the two-mutation model with stochastic growth of the tissue to specify the marginal distribution for the ages at onset of the left and right tumours. A discussion on the derivation of the hazard function for these random variables was provided in Chapter 4. We have only to specify the number of Schwann cells initially present around a single vestibular nerve. We have assumed this quantity to be 10 cells; this is consistent with our previous assumption that there were 20 cells initially present around both the left and right nerves. The log-likelihood given in equation (4.2) was maximized with respect to the model parameters using a Quasi-Newton routine; this was implemented using programs written in the C programming language. The model was fit to a subset of the M U K data; all 144 patients used for the fitting were probands with the ages at onset of both VSs recorded. Estimates of the four model parameters and their estimated standard errors are provided in Table 5.10; the value of the log-likelihood is also provided. The most striking estimate is the estimate of the dependence parameter 8. A n estimate of this magnitude for 8 corresponds to a value for Kendall's Tau of greater than 0.90 [10]; this indicates an extremely strong dependence between the ages at onset of the two tumours. Interpretating this parameter is somewhat delicate; it is possible that the dependence between the onset times for the two tumours that we have estimated here is artificial. For many individuals in our dataset, the ages at onset for the two tumours are recorded as the same age. The reason for this might be that many patients had already developed both tumours prior to their first examination. In this case the actual ages at onset are not really known and have been recorded as the first age that the patient was observed. The parameter estimate for 8 is still meaningful here 75 perhaps, but it represents a different association. In this case it would represent the dependence between the ages at detection by physician of the left and right tumours; this may still be meaningful information for clinicians. Table 5.10: Model fitting results for bilateral model using M U K data Variables: Ages at onset of left and right V S Parameter Estimate a 9.698 x 10~ 3.820 x IO" 8.041 x 10" 21.457 P 6 Estimated S E 3.191 x 10~ 2.542 x IO" 2 2 1.633 x IO" 2.450 4 2 2 4 value of log-likelihood: -910.492 Figure 5.13 are plots of the empirical distribution functions for the ages at onset of the first and second VS; Figure 5.14 is a similar display for the model fitted distribution functions. It is clear from Figure 5.13 that the ages at onset for the left and right tumours are very similar for the patients in our data. A n inspection ot the data reveals that 122 of the 144 patients used in the fitting had both the left and right VSs present at the time of their first evaluation. There is slightly more separation between the model fitted distribution functions for the age at onset of the first and second tumours than is observed in the empirical distribution functions; the overall fit however, is quite reasonable. The distribution function for the age at onset of the second tumour estimated from the model is below the model fitted distribution function for the age at onset of the first tumour; this is the stochastic ordering that we would expect for these two distributions. Plots like the one shown in Figure 5.14 have an obvious value to physicians working with patients. Such plots would allow the clinician to explain the likely progression of disease for a patient with NF2 to the family of an affected individual. 76 Figure 5.13: Empirical distribution functions for the ages at onset of the first and second VS (MUK data) 5.2.5 3-hit Models In order to fit the 3-hit model to our patient data we must first choose a deterministic function for the growth of the tissue at risk, similar to what was necessary for the deterministic 2-hit model. We will fit our models assuming two of the three growth functions previously used in the fitting of the deterministic 2-hit model; both the logistic(10 ,5.0,0.8),and logistic(10 ,8.5,1.3) 7 7 will be employed. We will not display results from fitting the model assuming the third previously described growth function as our expectation is that they would be very similar to those obtained from assuming the logistic(10 ,8.5,1.3) growth pattern. 7 We have also chosen to display results from fitting the model to a single dataset, rather than all three datasets; we have chosen to use the age at onset of the first V S data for all 3-hit analyses. Again, it is our expectation that the results would be reasonably consistent across all three datasets. Results obtained from varying the first parameter in the growth functions from 10 to 10 will be presented below as well to demonstrate the 7 6 sensitivity of the results to the choice of this parameter. Tables 5.11 and 5.12 display the parameter estimates and the values of the log-likelihoods for the models fit assuming the logistic(10 ,5.0,0.8) and 7 logistic(10 ,5.0,0.8) 6 growth patterns respectively. Tables 5.13 and 5.14 are similar displays for the models fit assuming the logistic(10 , 8.5,1.3) and logistic(10 , 8.5,1.3) 7 6 growth functions for the tissue. The values of the log-likelihoods for models fit assuming the logistic(10 ,5.0,0.8) 7 and logistic(10 ,5.0,0.8) 6 are essentially identical; this despite the fact that the parameter estimates differ quite a lot across these two models. This is also true for the models fit assuming the logistic(10 , 8.5,1.3) 7 and logistic(10 ,8^5,1.3) growth functions, suggesting that the parameter estimates 6 themselves are perhaps sensitive to the choice of the first parameter in the logistic growth functions; the value of the log-likelihood at the maximum likelihood estimate was not sensitive to the choice for this parameter value in our model fitting. This suggests that models fit assuming different growth functions could fit the data 78 Table 5.11: Model fitting results for 3-hit model using age at onset of the first V S ( M U K Data); logistic(10 ,5,0.8) growth function 6 Variable: Age at onset of the first V S Growth Function: logistic(10 ,5,0.8) 6 Parameter Estimate a 2.991 x 10" 3.175 x IO" 5.594 x I O P Estimated S E 1 1 - 5 0.183 x 10" 0.0929 x 10~ 0.434 x 10~ M value o f log-likelihood: -643.300 1 x 5 equally well, however some models may produce parameter estimates that are more interpretable than those obtained from other models. This is somewhat of a concern for us, as we are not certain about the number of cells that are present in an adult tissue. Figures 5.15-5.18 are plots of the empirical and model estimated distribution functions from the fitting of the four aforementioned 3-hit models. The model fit is quite good for all four models; in particular the models fit assuming the logistic^, 8.5,1.3), with K = 10 or 10 , fit the data exceptionally well. 6 7 The plots also support the claim cited above that models fit assuming different growth functions, for example growth functions that differ only in the first parameter, could fit the data equally well despite yielding different parameter estimates. The model estimated distribution functions from the fitting of the logistic(10 ,5.0,0.8) and 7 logistic(10 , 5.0,0.8) are indistinguishable from one another; this can also be said 6 for the model estimated distribution functions produced assuming the logistic(10 , 8.5,1.3) and logistic(10 ,8.5,1.3) growth functions. 7 6 Fitting any of the 3-hit models described above to patient data is less straightforward than the fitting of the 2-hit models described previously. Our experience has shown that the log-likelihoods for these models are quite flat and thus a very large sample is desired to facilitate the search for the maximum. The fitting was quite sensitive to the starting values used for the Quasi-Newton algorithm and con- 79 Table 5.12: Model fitting results for 3-hit model using age at onset of the first V S ( M U K Data); logistic(10 ,5,0.8) growth function 7 Variable: Age at onset of the first V S Growth Function: logistic(10 ,5,0.8) Parameter Estimate Estimated S E 7 a P 6.815 x IO" 7.003 x 10" 1.773 x 10~ 0.00140 x 10" 0.0467 x I O 0.0615 x 10~ 2 1 1 - 1 5 5 A* value of log-likelihooc : -643.301 Table 5.13: Model fitting results for 3-hit model using age at onset of the first V S ( M U K Data); logistic(10 ,8.5,1.3) growth function 6 Variable: Age at onset of the first V S Growth Function: logistic(10 ,8.5,1.3) e Parameter Estimate a 3.825 x 10" 4.447 x 10" 7.575 x 10~ P Estimated S E 1 1 0.0433 x 10" 0.294 x I O 1 - 1 0.909 x 10~ A* value of log-likelihood -639.988 5 5 Table 5.14: Model fitting results for 3-hit model using age at onset of the first V S ( M U K Data); logistic(10 ,8.5,1.3) growth function 7 Variable: Age at onset of the first V S Growth Function: logistic(10 ,8.5,1.3) 7 Parameter a Estimate Estimated S E P 6.004 x I O 6.647 x I O V- 2.417 x 10~ - 1 - 1 5 0.947 x I O 0.744 x IO" - 1 0.241 x 10~ value of log-likelihood -639.990 80 1 5 Figure 5.15: Plots of empirical and model predicted probabilities from 3-hit model assuming logistic(10 ,5,0.8) growth for the tissue; age at first V S data ( M U K data) 6 o Age (t) 81 Figure 5.16: Plots of empirical and model predicted probabilities from 3-hit model assuming logistic(10 ,5,0.8) growth for the tissue; age at first V S data ( M U K data) 7 Figure 5.17: Plots of empirical and model predicted probabilities from 3-hit model assuming logistic(10 ,8.5,1.3) growth for the tissue; age at first V S data ( M U K data) 6 Figure 5.18: Plots of empirical and model predicted probabilities from 3-hit model assuming logistic(10 ,8.5,1.3) growth for the tissue; age at first VS data ( M U K data) 7 vergence to a local maximum occurred for several starting values. Moolgavkar and Luebeck [21] reported that they found estimates of the growth and death rates to be unstable for such a model and suggested that greater stability might be obtained by re-parameterizing the model. We have not attempted to estimate model parameters under any alternative parameterizations at this point. 5.2.6 Comparison of Several Models Several models have been presented in this thesis and results from the application of many of these models have appeared previously in this section. For a fixed dataset, it is natural to want to compare the fit of several models to the data to see if any one model provides a better fit than the others. In this section we will examine the fit of three of the models previously presented in this thesis, to the age at onset of the first V S variable from the M U K dataset; the three models will be the 2-hit model with deterministic tissue growth, 2-hit model with stochastic tissue growth and the 3-hit model. In particular, it is interesting to compare the fit of the 2-hit models with the 3-hit model, as this comparison is of biological importance. Such a comparison would allow us to examine which of the hypotheses for the development of tumour cells are most consistent with our data. Results for the 2-hit model with deterministic growth of the tissue and for the 3-hit model will be presented here for the models that assume a logistic(10 ,8.5,1.3) 7 growth function for the tissue. Table 5.15 is a summary of the log-likelihood values for the 3-hit model and for both the 2-hit model that assumes stochastic tissue growth and the 2-hit model that assumes deterministic tissue growth. The log- likelihood is largest for the 3-hit model and smallest for the 2-hit model that assumes deterministic growth of the tissue. The 2-hit model that assumes a stochastic growth of the tissue has a slightly larger log-likelihood value than the 2-hit model with deterministic tissue growth. One might have expected this to occur prior to the model fitting, as the fully stochastic model allows the parameters that govern the 85 Table 5.15: Comparison of log-likelihood values for different models; using age at onset of the first V S data ( M U K data) Variable: Age at onset of the first V S Model A log-likelihood 2-hit (Deterministic tissue growth)* 2-hit (Stochastic tissue growth) 3-hit* -665.474 • -659.966 -639.990 4.715 x 10" 3.154 x I O 2.417 x 10" 9 - 4 5 'Assuming a logistic(10 ,8.5,1.3) growth function for the tissue 7 growth and death of the tissue cells to be estimated from the data, whereas the deterministic model fixes these prior to the data analysis. It is interesting that the 3-hit model has the largest log-likelihood value of the three models. This indicates that this model provides a better fit to the data than either of the 2-hit models; this suggests that the 3-hit hypothesis might be more appropriate for the development of tumour cells than the 2-hit hypothesis (under the assumption that our models adequately represent these hypotheses). It is worth noting that the comparison of the 3-hit model with the 2-hit model that assumes deterministic growth of the tissue should be made using results that were obtained assuming a common growth function for the tissue across the two models. Figure 5.19 is a plot of the empirical distribution function for the data against the three model-estimated cumulative distribution functions. The estimated distribution function for the 3-hit model follows the empirical distribution function more closely than the distribution functions estimated from either of the 2-hit models. In particular, the 3-hit model fits very well for ages less than 20 years where both of the 2-hit models were observed to depart from the data. Again, there is suggestion from the plots of the estimated distribution functions that the 3-hit hypothesis for the development of tumour cells is most consistent with our data. Although it is possible to compare the fit of several models to the data using log-likelihood values and plots, determining which hypothesis is the most 86 appropriate for the development of the tumour cells is still quite difficult. Our greatest concern is that the true growth function for the tissue is unknown to us and the results from both the 2-hit model with deterministic tissue growth, and the 3-hit model, depend heavily on the function chosen for the analysis. The acquisition of more information about the growth of the Schwann cells would certainly improve our ability to compare the fit different models to the data. A n additional comparison to make might be to examine the magnitude of the estimated mutation rates from the three models described above. We have discussed previously that the mutation rate estimate for the 2-hit model with deterministic tissue growth depends directly on the value chosen for the number of cells in an adult tissue. The estimate for the mutation rate obtained from our fitting was 4.715 x IO" , which was considerably smaller than the estimate obtained from the 9 model that assumes stochastic growth of the tissue (3.154 x 1 0 ) . -4 We have not yet found a reference that suggests an appropriate value for this rate in patients with NF2 and thus comparing the rate estimates from the two models is difficult. Moolgavkar and Luebeck [21] reported mutation rate estimates in the range 3 x 1 0 - 8 to 4.5 x 10~ from their analysis of colon cancer data using a 2-hit model; there is 7 little reason however, to expect the rate estimates obtained from our fitting of the models to NF2 data to match their estimates. These same authors [21] also reported mutation rate estimates from an analysis of colon cancer data using a 3-hit model in the range 4.8 x 1 0 - 6 to 2.6 x 10~ . The estimate of this rate from our analysis 5 using the 3-hit model was 2.417 x 1 0 ; again, there is little reason for us to expect -5 a similarity in the estimates from our fitting and theirs. Discriminating between potential models for our data using the rate estimates as a criterion could only be done if information was available on the expected order of magnitude for these mutation rates under the different hypotheses for the development of the tumour cells. 87 Figure 5.19: Comparison of the estimated cumulative distribution functions for several models with the empirical distribution function for the age at onset of the first V S data ( M U K data) o T 0 20 40 Age (t) 88 60 80 Chapter 6 Recommendations for Future Work Future work could be surely improved with the addition of more precise information about the tissue at risk of developing the tumours. In particular, the number of Schwann cells present in an adult tissue would be especially useful. We contacted several sources in an attempt to acquire such information and were unable to find the necessary information. We thank the individuals who took the time to reply to our queries related to this information. Knowledge of the periods at which the tissue undergoes spurts of growth would be useful as well. From our experiences, this information would not be as influential on the model fitting as would the number of cells in an adult tissue. The most obvious suggestion for improving the fit of the models incorporating genotype information would be the acquisition of more patients with known mutation type. This would also allow us to make comparisons between more homogeneous genotype groups. The models fit in this thesis had to partition the patients into only two genotype groups; it would be interesting to explore these models with more than two genotype groups (e.g. splice-site mutations, protein-truncating mutations, missense mutations, other mutations). 89 The fitting of the model for the onset of the second V S could be improved with the acquisition of longitudinal data on NF2 patients. This would provide better information about the precise ages at onset of the left and right V S . Obviously, it is unrealistic to assume that this information could be found on probands. Probands are quite likely only going to be observed after being brought to medical attention for their condition. In our data, the presenting symptom for most patients was loss of hearing; this suggests that probands are monitored only after the onset of at least one V S making it very difficult to acquire accurate ages at onset for the two tumours. One suggestion here might be to use data on non-probands in the fitting; one non-proband per family could be randomly chosen to avoid issues of familial dependence. Non-probands, as a result of their family history, might be monitored more closely for the onset of the first and second V S . Our data did not contain enough non-probands with the necessary information to make the suggested fitting possible. In an unpublished manuscript [4], Evans et al. reported a mean difference between the ages at onset of the first and second V S of 5 years in non-probands; their study however, featured only 11 non-probands. It is still interesting to compare this mean difference in onset times to the mean difference of 1.05 years computed from our proband data. There is certainly a difference in sample sizes here that inhibits formal comparison, but still perhaps a suggestion that information on non-probands may be more informative for our modelling purposes. The model fitting for the 3-hit model could perhaps be improved with the acquisition of more data. The likelihoods for the 3-hit models were quite flat and finding the global maximum was considerably more diffcult than it was with the 2-hit models. This problem might be reduced if the sample size was sufficiently larger. As well, reparameterizations of the 2-hit and 3-hit models could be explored to see if there is an improvement in numerical stability of the estimates and their estimated standard errors. Moolgavkar and Luebeck [21] suggest parameterizing both the 2-hit and 3-hit models with the mutation rate //, the net cell division rate 90 a — /3, and the ratio of cell death to division /3/a. These authors claim that this parameterization yields estimates that are more stable than those obtained under the parameterization presented in this thesis. 91 Bibliography [1] Armitage, P., Doll, R., (1954). The age distribution of cancer and a multistage theory of carcinogenesis. British Journal of Cancer, 8, 1-12. [2] Chu, K.C., (1987). A nonmathematical view of mathematical models for cancer. Journal of Chronic Diseases, 40,163S-170S. [3] Clayton, D.G., (1978). A model for association in bivariate life tables and its application in epidemiological studies of familial tendency in chronic disease incidence. Biometrika, 65, 141-151. [4] Evans, D.G.R., Lye, R., Neary, W., Black, G, Strachan, T., Wallace, A., Ramsden, R.T. (date unknown). The probability of bilateral disease in individuals presenting with a unilateral vestibular schwannoma. Unpublished Manuscript [5] Evans, D.G., Trueman, L., Wallace, A., Collins, S., Strachan, T. (1998) Genotype/phenotype correlations in type 2 neurofibromatosis (NF2): evidence for more severe disease associated with truncating mutations. J Med Genet, 35, 450-455. [6] Friedman, J.M., Gutmann, D.H., MacCollin, M., Riccardi, V.M., (1999). Neurofibromatosis: Phenotype, Natural History, and Pathogenesis, 3rd ed. Johns Hopkins University Press. [7] Friedman, J.M., Woods, R., Joe, H., Evans, D.G.R., Wallace, A., Mautner, V.F., Kluwe, L., Parry, D.M., Rouleau, G.A., Baser, .M.E. (1999). Germ-line 92 NF2 mutation type, location, and phenotype in neurofibromatosis 2. American Journal of Human Genetics, 65 (Supplement): pp. A126. [8] Hethcote, H . W . , Knudson, A.G.,(1978). Model for the incidence of embryonal cancers: application to retinoblastoma. Proc. Nat. Acad. Sci., 75, 2453-2457. [9] Ince, E . L . , (1926). Ordinary Differential Equations. London: Longmans, Green and Co. L t d . . [10] Joe, H . , (1997). Multivariate Models and Dependence Concepts. New York: Chapman & Hall, 139-149. [11] Kimeldorf, G . , Sampson, A . R . , (1975). Uniform representations of bivariate distributions. Comm. Statist., 4, 617-627. [12] Knudson, A.G.,(1971). Mutation and cancer: Statistical Study of retinoblas- toma. Proc. Nat. Acad. Sci., 68, 820-823. [13] Knudson, A . G . , Hethcote, H.W., Brown, B . W . , (1975). Mutation and childhood cancer: a probabilistic model for the incidence of retinoblastoma. Proc. Nat. Acad. Sci., 72, 5116-5120. [14] Little, M.P., (1995). Are two mutations sufficient to cause cancer ? Some generalizations of the two-mutation model of carcinogenesis of Moolgavkar, Venzon and Knudson, and of the multistage model of Armitage and Doll. Biometrics, 51, 1278-1291. [15] Little, M . R , Muirhead, C . R . , Stiller, C . A . , (1996). Modelling lymphocytic leukemia incidence in England and Wales using generalizations of the twomutation model of carcinogenesis of Moolgavkar, Venzon and Knudson. Statis- tics in Medicine, 15, 1003-1022. [16] Mathews, J . H . , (1992). Numerical Methods for Mathematics, Science, and En- gineering, 2nd edition. Englewood Cliffs, New Jersey: Prentice Hall. 93 [17] Mirz, F., Jorgenseri, B., Fiirgaard, B., Lundorf, E., Pedersen, C.B., (1999). Investigations into the natural history of vestibular schwannomas. laryngology, 24, Clinical Oto- 13-18. [18] Moolgavkar, S.H., Venzon, D.J., (1979). Two-event models for carcinogenesis: incidence curves for childhood and adult tumors. Mathematical Biosciences, 47, 55-77. [19] Moolgavkar, S.H., Knudson, A.G., (1981). Mutation and cancer: a model for human carcinogenesis. Journal of the National Cancer Institute, 66, 1037-1051. [20] Moolgavkar, S.H., Dewanji, A., Venzon, D.J., (1988). A stochastic two-stage model for cancer risk assessment. I. The hazard function and the probability of tumor. Risk Analysis, 8, 383-392. [21] Moolgavkar, S.H., Luebeck, G., (1990). Two-event model for carcinogenesis: biological, mathematical, and statistical considerations. Risk Analysis, 10, 323341. [22] Moolgavkar, S.H., Luebeck, G., (1992). Multistage carcinogenesis: populationbased model for colon cancer. Journal of the National Cancer Institute, 84, 610-617. [23] Nash, J.C., (1990). Compact Numerical Methods for Computers: Linear Algebra and Function Minimisation, 2nd edition. New York: Hilger. [24] Oakes, D., (1982). A model for association in bivariate survival data. of the Royal Statistical Society, Series B, 44 Journal , 414-428. [25] Parry, D.M., MacCollin, M.M., Kaiser-Kupfer, M.L, Pulaski, K., Nicholson, H.S., Bolesta, M., Eldridge, R., Gusella, J.F.,(1996). Germ-line mutations in the neurofibromatosis 2 gene: correlations with disease severity and retinal abnormalities. Am J Hum Genet, 59, 529-539. 94 [26] Parzen, E . , (196.2). Stochastic Processes. San Francisco: Holden-Day. [27] Ross, S . M . , (1983). Stochastic Processes. New York: Wiley. [28] Serio, G . , (1984). Two-stage model for carcinogenesis with time-dependent parameters. Statistics and Probability Letters, 2, 95-103. 95 Appendix A G e n e r a l S o l u t i o n o f t h e R i c c a t i E q u a t i o n ( e q u a t i o n (3.6)) Equation (3.5) is a partial differential equation of the form: which has general solution of the form g(u,v) = 0, where u(t, x,(f>) = a is the solution to Pdx = Qdt u(t, x, 0) = b is the solution to Pd<j) = Rdt. In our problem P — 1, R — 0 and Q{x, z) = —[fxxz + ax 2 + /3 — (a + L3 + ii)x\- Let s be the antiderivative of 1/Q with respect to x. Then Pdx = Q(x)dt becomes dx = Q(x)dt or s(x) = t + a. Pdcj) = Rdt becomes dcj) — 0 or <f> — b. Thus, the general solution is of the form g(s(x) —t,<f>) = 0. Applying the boundary condition (f)(x,z,0) = x at t = 0 we get g(s{x),x) = 0 or g(u,v) = u — s(v) or s{x) - t - s((j)) = 0. Differentiating this last expression with respect to t yields: -i =/ , „ £ = [«,),-'!. Rearranging this leads directly to ^ = —Q((j>) which is the Riccati equation (equation (3.6)). 96 A d d i t i o n a l J u s t i f i c a t i o n f o r E q u a t i o n (3.11) ty'(y,z;t) Here we justify the expression for used in the derivation of the three-hit hazard function. Recall that: oo Ylp'^,k);t)y^z = j,k=0 oo k = E [-•*(« + 0 + /*) • • P(U, k);t) *)•; *) - j,k=0 + (j - l)a-p((j - l,k);t) + (j + 1)(3 -p((j + l,k);t) +LiX(t) • p((j - 1, k);t) +jfi • p((j, k - l);t) } y*z oo oo k = E l-J(a j,k=0 oo + + L3 + »)-p((j,k);t)]yiz - £ LiX(t) j,k=0 k J2(j-l) .p((j-l,k);t)yiz a j,k=o + ^ ( j + l)/3-p((i + l , A ; ) ; ^z • p((j,k);t)y>z k k k j,k=0 oo vX(t)-p((j-l,k);t)yiz + £ j,k=0 oo + Y 3' -P(U,k-l);t)^z j,k=0 i l k (A.l) k Now we examine each term from the previous expression in more detail: oo E oo [-3(a + P + »)-p((J,k);t)}yiz k = -(a + j3 + ) y £ jp^k);^- M j,k=0 j,k=0 = Y l*X(t) • p((j, k);t)yiz j,k=o -(a k 97 + j3 + Lt)y = nX{t)V{y, d*(y,z',t) dy z; t) 1 z k J](,'-l)a.p((i-U);t)^ y a J3(j-l)p((j-l,*);t)^-V = 2 j,fc=0 j,fc=0 #y j,fc=0 j,fc=0 <9y j,fc=0 j,fc=0 = j,fc=0 X(t)V(y,z;t) W j,fc=0 ^(y,g;*) Substituting these 6 expressions above into equation (A.l) we obtain the following: *'(y, z; t) = (y - l)A*X(t)tt(y, *; t) + [/iyz + ay + /3 - (a + 0 + 2 which is the expression given previously in the derivation of the three-hit hazard function. 98 Appendix B Glossary allele alternative forms of a genetic locus; a single allele for each is inherited separately from each parent. autosome a chromosome not involved in sex determination. The human genome consists of 22 pairs of autosomes and the sex chromosomes. dominant allele the allele in a heterozygous state that determines the phenotype. exon the coding region of a gene. gene location on chromosome p=short arm, q=long arm; the location of a gene is often given by specifying the chromosome number and the letter representing the arm of the chromosome on which it is found. genotype the entire genetic constitution of an organism. Our definition will more specifically refer to alleles at a given locus. intron the area between coding regions of the gene [i.e. the regions between exons] locus position of a gene on a chromosome. meningioma a tumour on the coverings of the brain and spinal cord. 99 protein truncating mutation examples of such mutations include nonsense [premature stop codon from base-pair change], frameshift [bases involved not a multiple of three] and deletions; these mutations affect the D N A coding at a locus in that protein is not fully produced. splice site mutation a mutation between the exon and intron. missense mutation base-pair change such that there is a substitution of one amino acid for another. phenotype any genetically controlled, observable property of an organism. proband the first family member to be brought to medical attention for their condition or genetic disease. retinoblastoma a malignant tumour of the eye that arises in the retinal cells, usually occurring in children, with a frequency of 1 in 20000. Associated with a deletion on the long arm of chromosome 13 (13q). tinnitus a ringing of the ears. vestibular schwannoma a tumour of the Schwann cells that line the vestibular nerves. 100
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Models for the development of tumours in neurofibromatosis...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Models for the development of tumours in neurofibromatosis 2 Woods, Ryan R. 2000
pdf
Page Metadata
Item Metadata
Title | Models for the development of tumours in neurofibromatosis 2 |
Creator |
Woods, Ryan R. |
Date Issued | 2000 |
Description | Neurofibromatosis 2 (NF2) is a rare genetic disease that affects approximately 1 in 40000 people, some of the characteristic features of this disease include the onset of multiple tumours on the cranial and spinal nerves, juvenile cataracts and hearing loss. Almost all affected individuals develop bilateral tumours of the Schwann cells that line the vestibular nerves; these tumours are called as vestibular schwannomas (VS). Evidence from molecular genetic studies has suggested that a "2-hit" hypothesis is appropriate for the development of VS in patients with NF2; that is to say that a tumour cell develops from a normal Schwann cell after the cell sustains two mutations to its genetic material. Several authors have proposed probabilistic models for this process and have shown that such models are consistent with incidence data for several different types of cancer. We will discuss a selection of probabilistic models for a "2-hit" hypothesis and present the results from the fitting of such models to incidence data from NF2 patients. Molecular evidence does not exclude the possibility that additional hits are necessary for the development of VS; we will discuss a "3-hit" model and compare the model's fit to both the data and to the fit of the "2-hit" models. Genotypephenotype correlations have been reported in patients with NF2 and thus a model that incorporates a patient's genotype is presented and discussed. Finally, a bivariate model is proposed to estimate the distributions of the ages at onset of both the first and second VS. |
Extent | 3933463 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
File Format | application/pdf |
Language | eng |
Date Available | 2009-07-20 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0089685 |
URI | http://hdl.handle.net/2429/11019 |
Degree |
Master of Science - MSc |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2000-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- 831-ubc_2000-0616.pdf [ 3.75MB ]
- Metadata
- JSON: 831-1.0089685.json
- JSON-LD: 831-1.0089685-ld.json
- RDF/XML (Pretty): 831-1.0089685-rdf.xml
- RDF/JSON: 831-1.0089685-rdf.json
- Turtle: 831-1.0089685-turtle.txt
- N-Triples: 831-1.0089685-rdf-ntriples.txt
- Original Record: 831-1.0089685-source.json
- Full Text
- 831-1.0089685-fulltext.txt
- Citation
- 831-1.0089685.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0089685/manifest