HIDDEN M A R K O V MULTIPLE MODEL MODELS: PROCESSES AND SELECTION By R A C H E L J. M A C K A Y B . A . S c , University of Waterloo, 1996 M.S., Cornell University, 1999 A THESIS S U B M I T T E D IN P A R T I A L F U L F I L L M E N T OF T H E R E Q U I R E M E N T S FOR T H E D E G R E E OF D O C T O R OF P H I L O S O P H Y in T H E F A C U L T Y OF G R A D U A T E STUDIES Department of Statistics We accept this thesis as conforming to the, requirephstandard T H E U N I V E R S I T Y OF BRITISH C O L U M B I A June 19, 2003 © Rachel J. MacKay, 2003 In presenting degree at the this thesis in partial University of fulfilment of of department this thesis for or by his requirements British Columbia, I agree that the freely available for reference and study. I further copying the her representatives. an advanced Library shall make it agree that permission for extensive scholarly purposes may be or for It is granted by the understood head of that publication of this thesis for financial gain shall not be allowed without permission. Department of Staffs \\ The University of British Columbia Vancouver, Canada Date DE-6 (2/88) copying my or my written Abstract This thesis considers two broad topics in the theory and application of hidden Markov models (HMMs): modelling multiple time series and model selection. Of particular interest is the application of these ideas to data collected on multiple sclerosis patients. Our results are, however, directly applicable to many different contexts in which H M M s are used. One model selection issue that we address is the problem of estimating the number of hidden states in a H M M . We exploit the relationship between finite mixture models and H M M s to develop a method of consistently estimating the number of hidden states in a stationary H M M . This method involves the minimization of a penalized distance function. Another such issue that we discuss is that of assessing the goodness-of-fit of a stationary H M M . We suggest a graphical technique that compares the empirical and estimated distribution functions, and show that, if the model is misspecified, the proposed plots will signal this lack of fit with high probability when the sample size is large. A unique feature of our technique is the plotting of both the univariate and multivariate distribution functions. H M M s for multiple processes have not been widely studied. In this context, random effects may be a natural choice for capturing differences among processes. Building on the framework of generalized linear mixed models, we develop the theory required for implementing and interpreting H M M s with random effects and covariates. We consider the case where the random effects appear only in the conditional model for the observed data, as well as the more difficult setting where the random effects appear in the model for the hidden process. We discuss two methods of parameter estimation: direct maximum likelihood estimation and the E M algorithm. Finally, to determine whether the additional complexity introduced by the random effects is warranted, we develop a procedure for testing the significance of their variance components. We conclude with a discussion of future work, with special attention to the problem of the design and analysis of multiple sclerosis clinical trials. Contents Abstract ii Contents iii List of Tables vii List of Figures viii Acknowledgements ix Dedication xi 1 Introduction 1 2 Hidden Markov Models for a Single Process 5 2.1 Definition of a Hidden Markov Model 5 2.2 Maximum Likelihood Estimation . 6 2.3 Asymptotic Properties of the M L E s 8 2.4 Application to M S / M R I Data 9 2.4.1 Albert's Model . 2.4.2 Generalization of the Transition Probabilities 12 . 13 2.4.3 Generalization of the Conditional Mean Structure 15 2.4.4 Generalization of the Conditional Mean Structure and of the Transition Probabilities 2.4.5 2.5 3 4 Addition of a Third Hidden State 17 . Summary 17 19 Estimating the Number of States of a H M M 20 3.1 Notation 21 3.2 Identifiability 22 3.2.1 Parameter Identifiability 23 3.2.2 Sufficient Conditions for C K ' s Identifiability Criterion 24 3.3 Parameter Estimation 25 3.4 Application to M S / M R I Data 30 3.5 Performance of the Penalized Minimum-Distance Method 31 3.6 Discussion 33 Assessment of Goodness-of-Fit 39 4.1 Convergence Conditions 42 4.2 Other Models for Count Data 43 4.3 4.2.1 Markov Models . 44 4.2.2 m-Dependent Time Series 44 4.2.3 Parameter-Driven Processes 45 Application to M S / M R I Data 45 4.3.1 Albert's Data 46 4.3.2 Vancouver PRISMS Data 46 4.4 5 6 51 Hidden Markov Models for Multiple Processes 53 5.1 Notation and Assumptions 55 5.2 Model I: H M M s with Random Effects in the Conditional Model for the Observed Process 55 5.3 Moments Associated with Model I 58 5.4 Model II: H M M s with Random Effects in the Model for the Hidden Process . 60 5.5 Moments Associated with Model II 62 5.6 Summary 64 Hypothesis Testing 66 6.1 Identifiability of Models I and II 67 6.2 Asymptotic Properties of the M L E s of Models I and II 68 6.3 Variance Component Testing 69 6.4 Applications 81 6.4.1 Computing the Test Statistic 81 6.4.2 M S / M R I Data 82 6.4.3 Faecal Coliform Data 84 6.5 7 Formal Assessment of the G O F Plots Performance of the Variance Component Test 86 Future Work 90 A The E M Algorithm A.l 93 E M Algorithm for H M M s for Single Processes A.2 E M Algorithm for H M M s with Random Effects in the Observed Process 93 . . 96 A . 3 E M Algorithm for H M M s with Random Effects in the Hidden Process Proofs B. l ' 102 Proof of Lemma 3.1 102 B.2 Proof of Theorem 6.1 B.3 Proof of Theorem 6.2 99 104 '. 107 List of Tables 2.1 Parameter estimates and standard errors for Albert's model 13 2.2 Parameter estimates and standard errors for the model with general transition probabilities 2.3 14 Parameter estimates and standard errors for the model with a general conditional mean structure 2.4 16 Parameter estimates and standard errors for model with general transition probabilities and conditional mean structure . 17 2.5 Parameter estimates and standard errors for 3-state model 19 3.1 Penalized minimum-distances for different numbers of hidden states . . . . . 31 3.2 Parameter values used in the simulation study 32 6.1 Parameter estimates and standard errors for Vancouver PRISMS data . . . . 84 6.2 Parameter estimates and standard errors for faecal coliform data . . . . . . . 85 6.3 Parameter values used in the simulation study 87 6.4 Results of the simulation study . 88 List of Figures 2.1 M S / M R I data analyzed in Albert et al. (1994) 2.2 Simulated data with a trend 16 3.1 Distribution of K° when K° = 1 . . . . . . . . . 33 3.2 Distribution of K° when K° = 2, n = 30 34 3.3 Distribution of AT when K° = 2, n= 100 3.4 Distribution of K° when K° = 3, n = 30 . 3 6 3.5 Distribution of K° when K° = 3, n = 100 - 3 7 4.1 Comparison of the Estimated and Empirical Univariate Distributions ( A l - 0 , . . . . . . . . . . . . . . . . . . . bert's Data) 4.2 47 48 Comparison of the Estimated and Empirical Univariate Distributions (Vancouver Data) 4.4 35 Comparison of the Estimated and Empirical Bivariate Distributions (Albert's Data) 4.3 11 49 Comparison of the Estimated and Empirical Bivariate Distributions (Vancouver Data) 50 Acknowledgements There are so many people to thank! But, I will try to make this shorter than an Academy Awards speech. First of all, I thank my advisor, John Petkau, for his support, both academic and financial. John has truly been a mentor - as a researcher, consultant, and administrator - and I feel honoured to have had the chance to work with him. The faculty in the Statistics Department is outstanding; their enthusiasm and approachability create a very special environment for learning and developing new ideas. Thanks especially to Bertrand Clarke for many helpful and interesting conversations, as well as to my committee members, Jim Zidek, Nancy Heckman, and Paul Gustafson, for their feedback on my work. I am also indebted to the office staff, Christine Graham, Rhoda Morgan, and Elaine Salameh, for all their help with administrative matters. Finally, muchisimas gracias to Ruben Zamar for stepping in at the last minute to act as a University Examiner at my defense. I have greatly appreciated the opportunities that the department has offered me, particularly the chance to be involved with S C A R L . Jim Zidek, Chuck Paltiel, and Rick White have been patient and inspirational teachers. My fellow graduate students have contributed to my time here in many ways. Special thanks to the members of the Journal Club (Steven Wang, Yinshan Zhao, Rong Zhu, Weiliang Qiu, and Jochen Brumm), Jérôme " D J " Asselin, Fatemah Alqallaf, and to everyone who came out to English Lunches! Outside the department, I am grateful to Ya'acov Ritov of the Hebrew University of Jerusalem and to Jiahua Chen of the University of Waterloo for helpful correspondence. Thank you also to Paul Albert of NIH and Drs. Don Paty and David L i of the U B C M S / M R I Research Group for sharing their M S / M R I data. Finally, I thank Rolf Turner of the University of New Brunswick for his generous assistance with the faecal coliform data (provided by Geoff Coade of the NSW Environmental Protection Authority). On a more personal note, I cannot express how grateful I am to my family and friends for their continual love and support. My dad has been an amazing role model for me, both academically and personally. Our "father-daughter" talks have helped me through many a difficult moment: I wouldn't have graduated before the age of 70 without him! M y mom has put up with countless hours of "math talk", and has had an unwavering faith in me, even at those times when I lost faith in myself. I also need to thank my sister, "Vermin", for always understanding me, and, of course, for her Markov jokes. The friends that have helped me are too numerous to mention here. But, most importantly, M y Dang and Cindy Rejwan have been with me every step of the way. And, my Waterloo engineering buddies have always been sources of encouragement and much needed humour. Finally, I thank my partner and future husband, Yevgeni Altman, with all my heart for his tireless love, patience, and support. I look forward to spending the rest of our lives together, thesis-free. RACHEL The University of British Columbia June, 2003 MACKAY To my grandmother, Denilde Gertrude Brodie (1924-2002), whose love is with me always. Chapter 1 Introduction Hidden Markov models (HMMs) describe the relationship between two stochastic processes: an observed process and an underlying "hidden" (unobserved) process. These models form a class of mixture models where, given the hidden state at time t, the distribution of the observation at this time.is fully specified. However, H M M s are more general than classical mixture models in that the hidden states are not assumed to be independent, but rather to have a Markovian structure. One consequence of this assumption is that the observed data are also correlated, with dependence between observations decreasing to zero as the distance between them increases to infinity. This correlation is long-range, in the sense that H M M s are not Markov chains. In general, these models are used for two purposes. The first is to make inferences or predictions about an unobserved process based on the observed process. For example, H M M s have been used successfully for the purposes of prediction in the field of speech recognition (e.g. Levinson et al. 1983). In this context, the observed data - the acoustic signal - may be modelled as a function of unobserved articulatory configurations such as vocal tract shape or tongue movement. For each word from a vocabulary of size V, V < oo, an acoustic signal is generated, and the parameters of the associated H M M estimated. Then, given a signal generated from an unknown word in this vocabulary, these V models can be used to predict which word was uttered. Some advanced systems based on H M M s now perform as accurately as their human counterparts (Juang & Rabiner 1991). Similarly, H M M s have been used in molecular biology for gene recognition (see, e.g., Krogh 1998). Here, sequenced strands of D N A are treated as functions of the underlying signals that comprise the structure of a gene. The models estimated from sequences with known genetic structure are then used to predict the location of genes in new sequences. A second reason for using H M M s is to explain variation in the observed process based on variation in a postulated hidden process. In this paradigm, a H M M captures over-dispersion (relative to a standard distribution) in the observed data. In particular, a H M M attributes this over-dispersion to the key model feature that observations come from one of several different marginal distributions, each associated with a different latent state. When physical meaning can be attributed to these states, a H M M provides a natural model for such data. As an illustration, Leroux & Puterman (1992) use a H M M to model the number of foetal lamb movements in consecutive 5-second intervals. The distribution of each observation is assumed to depend on whether the lamb is in a relaxed or excited state. As another example, Albert (1991) models the distribution of epileptic seizure frequencies according to whether the patient is in a high or low seizure activity state. Magnetic resonance imaging (MRI) scans of relapsing-remitting multiple sclerosis (MS) patients are another source of data that may be appropriately modelled by H M M s . Patients afflicted with this type of MS experience lesions on the brain stem, with symptoms worsening and then improving in alternating periods of relapse and remission. Typical data amassed during clinical trials consist of lesion counts at regular time intervals for a collection of patients. It is now believed that exacerbations are associated with increased numbers of lesions on the brain stem. Thus, it may be reasonable to assume that the distribution of the lesion counts depends on the patient's (unobserved) disease state, i.e. whether the patient is in relapse or remission. Additionally, we might expect to see autocorrelation in this sequence of disease states. Indeed, Albert et al. (1994) use this idea in the development of a H M M for individual relapsing-remitting MS patients. The study of H M M theory began in the late 1960's. One of the key papers, Baum et al. (1970), provides a method of obtaining the maximum likelihood estimates (MLEs). The asymptotic properties of the M L E s have subsequently been established (Leroux 1992a; Bickel et al. 1998; Doue & Matias 2001). Likelihood ratio tests for H M M s have been studied by Giudici et al. (2000), and Bickel et al. (2002) have determined bounds on the expectations of the H M M log-likelihood and its derivatives. However, theoretical gaps remain. This thesis addresses two such topics that have not been adequately studied in the literature: modelling multiple time series, and model selection techniques. We will use the M S / M R I context described above to illustrate many of our ideas. In particular, we will consider two M S / M R I data sets: that used by Albert et al. (1994), and another similar data set involving the 13 placebo patients from the Vancouver cohort of the PRISMS study (PRISMS Study Group 1998). Most work to date on H M M theory has concentrated on models for a single observed process. In Chapter 2, we introduce some basic definitions and concepts in this setting. We then explore the model considered by Albert et al. (1994), as well as some simple extensions. This exploration, which includes a discussion of the limitations of the theory surrounding these models, will serve to clarify the fundamental ideas behind H M M s , and will elucidate our questions of interest. In Chapters 3 and 4, working in the context of a single, stationary H M M , we address two questions of critical importance in the application of HMMs: estimation of the number of hidden states and assessment of goodness-of-fit (GOF). In Chapter 3, we develop a method of consistently estimating the number of hidden states. Our method extends the work of Chen & Kalbfleisch (1996), who consider the use of a penalized distance function for estimating the number of components in a finite mixture model. We apply our procedure to the M S / M R I data collected by Albert et al. (1994), and carry out a small simulation study that suggests the method performs reasonably well for finite samples. This work was published in The Canadian Journal of Statistics (MacKay 2002). In Chapter 4, we propose a graphical technique for assessing the G O F of a H M M . Specifically, we show that plots comparing the empirical and estimated distribution functions - both univariate and multivariate - allow the detection of lack of fit in the model with high probability as the sample size grows. We use this technique to study the appropriateness of various H M M s for the two M S / M R I data sets. Chapter 2 will also motivate the need for H M M s for multiple processes, which is the focus of Chapter 5. In that chapter, we propose the incorporation of random effects as a means of linking the different processes. Random effects provide an efficient way of modelling commonalities among patients while allowing for some inter-patient variability. Furthermore, including random effects permits greater flexibility in the modelling of the correlation structure of the observed data. Using the generalized linear mixed model ( G L M M ) framework, we consider the incorporation of random effects and covariates in both the conditional model for the observed process and the model for the hidden process. We discuss the estimation of these models, as well as their interpretation, with special attention to the impact of the random effects on the marginal moments of the observed data. In Chapter 6, we address the issue of hypothesis tests for the parameters of the class of models developed in Chapter 5. We comment on the asymptotic properties of the M L E s , and suggest settings where standard test procedures may be appropriate. We then present a method for testing the significance of the variance components, which is a more challenging problem since the null hypothesis puts at least one parameter on the boundary of the parameter space. Our method has its inspiration in the score test proposed by Jacqmin-Gadda & Cornmenges (1995) in the G L M M context. We provide two illustrations of the theory in Chapters 5 and 6 to demonstrate the practicality of using our class of models in applications. The first involves the M S / M R I lesion count data from the Vancouver PRISMS study; the second considers the analysis of repeated measurements of faecal coliform counts at several oceanic sites. We end this chapter with a modest simulation study to investigate the power of this method for finite samples. We conclude with Chapter 7, where we summarize the work in this thesis and present ideas for future research in the field of HMMs. Of particular interest is the application of our theory to the design and analysis of M S / M R I clinical trials. Our discussion focuses on the issues that we anticipate will arise in this work. Chapter 2 Hidden Markov Models for a Single Process In this chapter, we provide the formal definition of a H M M for a single process, as well as some theory relevant to parameter estimation and hypothesis testing in this setting. We then illustrate these ideas with an application to a M S / M R I data set. The primary purposes of this chapter are to introduce basic concepts and to highlight our research questions of interest. Throughout this thesis, we use the generic notation f(x) to denote the density (or probability mass function) of a random variable (or vector), X. Usually, / will be a member of a parametric family with parameters xp, in which case we will write f(x;ip). We will use bold face to indicate a vector, such as Y and Z to denote the vectors of observed responses and hidden states, respectively. 2.1 Definition of a Hidden Markov Model Let Y be the observed response at time t, and let Z be the hidden state at time t, t = 1,..., n. t t The process {Y } is a discrete-time H M M if t • {Z } is a Markov chain with transition probabilities {-P^} and initial probabilities t Or*}- • Y \Z is independent of Y t t u Y_ Y t u t+U ..., Y and Z ..., Typically, the following assumptions are also made: n u Z _ t 1 ; Z , t+1 ...,Z . n 1. The density (or probability mass function) of Y \Z is h( • ]Qz ,<f>), where h is a parat t t metric family indexed by the parameters (Oz,., 4>) 6 0 . 2. Z E { 1 , . . . , i f } , where /<" is known and finite. t 3. The values of {9 } are distinct. k 4. The time points t = 1,..., n are equally spaced. 5. Plf = Pke, k, £ = 1, • • •, K, i.e. the transition probabilities are homogeneous. 6. {Z } is stationary. t R E M A R K . Assumption 1 implies that the distribution of Y \Z depends on t only through Z . t t t Our notation indicates that some parameters (9z ) may vary with Z , whereas others (4>) are t t common across the hidden states. We relax Assumption 2 in Chapter 3, where we address the issue of estimating K. Assumption 3 is made in most applications of H M M s , with the notable exception of ion channel modelling (e.g. Chung et al. 1990). We discuss this issue in more detail in Section 3.2. Assumption 4 allows us to specify the model in terms of the 1-step transition probabilities, and hence is a useful simplification, but it is not strictly necessary (see, e.g., Section 6.4.3). Assumption 5 is also standard, though Hughes & Guttorp (1994) have considered non-homogeneous H M M s in applications. One advantageous consequence of Assumption 6 is that the random variables {Y } are identically distributed - a feature that t sometimes permits the extension of existing theory for iid random variables to the H M M setting (see, e.g., Chapters 3 and 4). It is interesting to note that the marginal distribution of Y is a finite mixture: t i< However, the sequence of hidden states, {Z }, is allowed to have a Markovian, rather than int dependent, structure. Thus, we see that stationary H M M s are a generalization of traditional finite mixture models. 2.2 Maximum Likelihood Estimation The likelihood associated with the model described in Section 2.1 is not a simple product of marginal distributions. Define ip = (# ..., 6f<\ (j), Pu, Pn, • • • ; PKI<I l5 • • • -, ^K), and denote the likelihood by £('</>). Then, using the assumption that {Y,} are independent given {Z }, t /(y;V0 E/(yM)/(z;</0 z n E I I f(vtW • n ; VO il/(^tki-i, VO n n (2.1) P*t-\y Thus, we see that the likelihood involves a summation over the K n possible values of z, and hence is quite complicated. We can simplify (2.1) somewhat by recognizing that for each t, the variable z appears t in only a few factors. So = E K, 4>) E Pz^Mv2\o A) Z2 • • •E ^ h(y ;0 ,(p). n Zn This expression can then be written as a product of matrices (MacDonald & Zucchini 1997, Chapter 2). In particular, let A be the vector with elements A\ = ir h(yy;6 , (f>), and let 1 k A be the matrix with elements A 1 kl = P u k h(y ;0e, </>), t > 1. Let 1 be a the K-dimensional t vector of l's. Then (2.2) which is a very simple expression to compute. This form of the likelihood illustrates that the number of hidden states, K, has a far greater impact on the computational effort associated with maximum likelihood estimation than the number of observations, n. Traditionally, the E M algorithm (Dempster et al. 1977) has been used to maximize H M M likelihoods. There are two likely reasons for the popularity of this algorithm. Firstly, for homogeneous H M M s (see Assumption 5 in Section 2.1) taking on only a finite number of values and with unknown initial probabilities, this algorithm reduces to an iterative procedure with simple, closed-form expressions for the parameter estimates at each iteration. In this context, the E M algorithm is often called the Forward-Backward algorithm and is credited to Baum et al. (1970). Details are provided in Appendix A . l . In terms of estimation of the parameters of a H M M , this case is the simplest possible, and will be a useful reference point when assessing the difficulty of estimating the parameters of the more complicated models we consider in Chapter 5. A second reason is that derivatives of H M M likelihoods are somewhat difficult to compute, requiring iterative methods (e.g. Rynkiewicz 2001). The E M algorithm, unlike methods such as Newton-Raphson, does not require that derivatives be supplied. However, in general, the steps of the E M algorithm do not involve closed-form expressions. Furthermore, this algorithm is notoriously slow to converge. Thus, we prefer direct numerical maximization of the likelihood, which is typically much more efficient (MacDonald and Zucchini 1997, Chapter 2). In particular, we have found that the quasi-Newton routine (Nash 1979) tends to locate maximum likelihood estimates (MLEs) more accurately and with far less computational effort than the E M algorithm. Even if the E M algorithm performs better than direct maximization under some circumstances (such as when we have a large number of parameters or poor starting values), repeating the direct maximization procedure using a variety of starting values still seems to be the most efficient means of parameter estimation. Starting values are of critical importance since H M M likelihoods tend to have many local maxima. These values may be selected using, for example, the method suggested by Leroux & Puterman (1992). In addition, we recommend doing a grid search (over a variety of reasonable starting values) to improve our chances of locating the global maximum. Another implementation issue concerns the parameters 1^.}, which are normally considered to be nuisance parameters. Three options exist for the dealing with these parameters. Firstly, we may assume values for {^k}- In the absence of prior information, this option may not be reasonable. On the other hand, Leroux (1992a) shows that the consistency of the M L E s does not depend on the choice of initial distribution. Thus, for processes observed at a large number of time points, this option may be appealing. Secondly, we may estimate {71^} from the data. This option is also undesirable for relatively small data sets since it would require the estimation of K—l additional parameters, risking an increase in the standard errors of the parameters of interest. Finally, if the hidden process can be assumed to be stationary, we can treat {71^} as functions of the transition probabilities. In this way, we can reduce the number of parameters to estimate. This option, while requiring the solution of a system of K linear equations at each iteration of the maximization procedure, seems to be the most attractive as long as the assumption of stationarity is appropriate. We thus use this approach in the examples we consider in this thesis. 2.3 Asymptotic Properties of the MLEs Results on the properties of the M L E s require that the model (2.1) is identifiable, i.e. /(y; = /(y; ^2) i f a n d °niy i f '0i = $2- Strictly speaking, a H M M is never identifiable, in the sense that we can always permute the labels of the hidden states without changing the likelihood. However, it is easy to overcome this obstacle by imposing appropriate restrictions on the parameters (such as an ordering of {9k) in the case where these values are distinct). Setting this point aside, determining sufficient conditions for identifiability is still a difficult problem, except when the H M M is stationary with distinct values of See Section 3.2 for details. In the case of a stationary H M M where the hidden process takes on only a finite number of values, Leroux (1992a) and Bickel et al. (1998) establish the consistency and asymptotic normality, respectively, of the M L E s under quite general conditions. In addition to assuming model identifiability, these authors impose mild conditions on the transition probabilities and on the distribution h. These components of the model are usually quite simple (in contrast with the full likelihood), and hence these conditions are relatively easy to verify (and hold for most models). Douc & Matias (2001) show that the M L E s are also consistent and asymptotically normal in the case where {Z } belongs to a compact set and is possibly nont stationary. Again, these authors impose conditions only on the Markov transition kernel and the distribution h. We are not aware of any results in the literature regarding the asymptotic properties of non-homogeneous HMMs. With respect to inference about the unknown parameters, Bickel et al. (1998) and Douc & Matias (2001) show that the observed information converges in probability to the Fisher information matrix. Thus, if the H M M satisfies the conditions imposed by these authors, we can conduct Wald tests in the standard way, using the observed information to estimate the variance-covariance matrix of the M L E s . In addition, Giudici et al. (2000) show that, in the comparison of nested stationary H M M s with a common, known value of K, the likelihood ratio test statistic has the usual asymptotic x distribution. Their theory is applicable, for 2 example, to test whether the hidden states have an independent, rather than Markovian, structure. 2.4 Application to M S / M R I Data In this section, we discuss an interesting and unusual H M M developed by Albert et al. (1994) for M S / M R I data. We fit this model to their data and give the results in Section 2.4.1! We then develop several extensions, which we present in Sections 2.4.2-2.4.5. One purpose of this discussion is to solidify the concepts in Sections 2.1-2.3. Furthermore, with an eye towards future work on the design and analysis of M S / M R I clinical trials (see Chapter 7), this section will illustrate the type of questions that might be asked in this setting. Most importantly, some of these questions will reveal gaps in existing theory for H M M s , which will motivate the research presented in this thesis. The H M M proposed by Albert et al. (1994), to which we will henceforth refer as Albert's model, describes lesion counts on repeated M R I scans for a single relapsing-remitting MS patient. The authors apply the model individually to three patients, each of whom had monthly M R I scans for a period of approximately 30 months. The observed lesion counts range from 0 to 19, with a mean of 4.5 and a median of 4 lesions per scan. The data are displayed in Figure 2.1. Albert's model is based on the idea that a patient is in an (unknown) state of deterioration or improvement at any time point. This underlying state will affect the mean number of lesions observed at that time. Specifically, it is assumed that if the patient's condition is deteriorating at time t, the mean lesion count at this time will be greater than the mean lesion count at time t — 1 by a factor of 9. Similarly, if the patient's condition is improving at time £, the mean lesion count at this time will be less than the mean lesion count at time t — 1 by a factor of 6. Mathematically, the assumptions of the model can be stated as follows: 1. The hidden state, Z , is —1 if the patient's condition is improving at time t, and +1 t if the patient's condition is deteriorating at time t. This process is modelled as a stationary Markov chain. 2. The transition probabilities are assumed to be homogeneous, with the probability of moving from deterioration to improvement equal to the probability of moving from improvement to deterioration. This common probability is denoted by 7. 3. Given Z , the lesion count, Y , is assumed to be independent of Y i , . . . , Y _ Y i , . . . , Y , t t and distributed as Poisson(/j,), t t 1 ; t + n where 0Lit-i, if the patient is deteriorating at time t ( 1 / 0 ) i f the patient is improving at time t. This assumption can be rewritten as Ik where St — Z)i=i %i a n = f-^9' d A*o is the baseline mean lesion count. REMARK. Assumptions 1 and 2 imply that the initial probabilities are ~P(Z\ — —1) = P(Zi + = 1) = 0.5. Under Assumption 3, when 9=1, the model reduces to that Figure 2.1: M S / M R I data analyzed in Albert et al. (1994) Patient 1 for independent observations distributed as Poisson(/U ). Assumption 3 also leads to an 0 identifiability problem, since the model with 9 is equivalent to the model with | if we reverse the labelling of the hidden states. To remedy this problem, we assume that 6 > 1. On first glance, this model does not appear to be a H M M (according to the definition given in Section 2.1), since the distribution of Y depends on all previous hidden states, not t just Z . However, by defining the hidden process as (St-i,S ) with state space t t :i = —n,... ,n,j = —n,...,n}, we see that Albert's model does, in fact, conform to the definition of a non-homogeneous H M M with countable state space. Thus, the model can be fit in the manner described in Section 2.2. Albert et al. use the E M algorithm for this purpose, but we will maximize the likelihood directly. We have two reasons for this choice. First, as discussed in Section 2.2, direct maximization appears to be the more efficient of the two methods. Second, in our experience with this model, unlike the direct maximization method, the E M algorithm tends to converge to values other than the MLEs. To make inferences about Albert's model and its extensions, we will use the methods outlined in Section 2.3. Because we lack theoretical results about the properties of these methods in the case of non-homogeneous HMMs, our conclusions should be considered only informal. These conclusions are nonetheless useful, as they will help to isolate and illustrate the issues of interest in this thesis. Furthermore, Albert's model is very similar to a stationary Poisson H M M . To see this, note that the mean lesion count at time t is restricted to a discrete number of values (evenly spaced on the log scale). If we assume that the observed process is stationary, it is reasonable to use a finite approximation to these mean values, i.e. to assume that the mean at time t is one of K values. This new model is simply a stationary Poisson H M M with K hidden states and with some restrictions on the transition probabilities. Hence, if more formal conclusions were desired, one could fit a stationary Poisson H M M with an appropriate value of K to the data. Then, the standard inference results discussed in Section 2.3 would certainly apply. 2.4.1 Albert's M o d e l We can facilitate the maximization of the likelihood by transforming the parameters so that their range is the entire real line. > To this end, we use the following reparameterizations: AiS = l o g / , 0 * = l o g ( ( 9 - l ) , a n d 7 * = l o g ( 7 / ( l - 7 ) ) . io Table 2.1 gives the parameter estimates and approximate standard errors resulting from fitting Albert's model to the three patients' data. For Patient 1, 6 is estimated as 1.000, i.e. Table 2.1: Parameter estimates and standard errors for Albert's model Parameter /*o 0* Transformation log//, log(0-l) 7* log£ Patient Estimate 1.070 -11.083 NA 1 SE 0.091 NA NA -66.814 Patient 2 Estimate SE 1.362 0.178 -0.282 0.245 1.241 0.575 Patient Estimate 2.223 -1.128 1.336 -72.029 3 SE 0.244 0.560 0.847 -65.363 the model reduces to that for independent Poisson counts. In this case, the estimate for 7 given by the quasi-Newton routine is, in fact, arbitrary. In addition, since 0 = 1 is on the boundary of the parameter space, there is no guarantee that the usual standard error for the estimate of 0* is even approximately correct. For these reasons, we do not provide an estimate of 7*, or a standard error for the estimate of 0*. One question of interest is whether the complexity of the H M M is warranted, or whether the simple model with independent Poisson counts is sufficient to describe the variability and correlation in the data. In principle, we should be cautious about making such inferences, since the test of 0 = 1 is a boundary problem. Informally, though, in the case of Patient 1, there is no evidence to suggest that the simpler model is inadequate. In the case of Patients 2 and 3, if we believe the 95% confidence intervals for 0 ([1.467,2.219] and [1.108,1.970], respectively), then there is evidence against the null hypothesis that 0 = 1. Thus, the H M M structure seems to be more appropriate for these patients than the simpler model. These conclusions are consistent with those arrived at by Albert et al. The results in Table 2.1 are essentially the same as those obtained by Albert et al. The> primary differences are that we have omitted the estimate of 7* in the case of Patient 1, and have achieved a higher value of the likelihood in the case of Patients 2 and 3. The latter emphasizes the importance of choosing both good starting values and an estimation method whose convergence properties are well-behaved in practice as well as in theory. 2.4.2 Generalization of the Transition Probabilities The first extension we consider (for Patients 2 and 3) is the use of general transition probabilities. We do not apply this new model to the data from Patient 1 since the analysis in Section 2.4.1 indicates that the transition probabilities for this patient are arbitrary. Table 2.2: Parameter estimates and standard errors for the model with general transition probabilities Parameter Transformation 8* log/xo log(0 - 1) 7* fe) »og ( A ) l o Patient 2 Estimate SE 1.360 0.173 -0.281 0.244 1.489 S 0.801 1.039 0.690 -71.921 log£ Patient 3 Estimate SE 2.186 0.143 -1.416 0.273 0.975 0.515 18.570 NA -64.222 In particular, we model the transition probabilities as -i +i -i 1— I +i 7 p 7 1-/3 This generalization may be of interest because Albert's model assumes that patients spend 50% of their time in a state of deterioration, and 50% in a state of improvement. This assumption seems too strong to make a priori. The parameter estimates for this model are given in Table 2.2. In the case of Patient 3, j3 is estimated as 1.000, which is on the boundary of the parameter space. 'We do not include a standard error for the estimate of /?* for this reason. To test the validity of the assumption that 7 = ft, we note that Albert's model is nested within the more general model. We then use the likelihood ratio test (LRT) to compare the two models, assuming that the L R T statistic has an asymptotic x? distribution. The p-values for these tests are p-value Patient 2 0.642 Patient 3 0.131 Surprisingly, the more general model does not fit substantially better for either of the two patients. We have two possible explanations for these results. Firstly, the standard errors of the estimates of 7* given in Table 2.1 are quite large relative to the estimates themselves, and relative to the standard errors of the estimates of //,q and 9*. The same is true of the standard errors of the estimates of 7* and /3* in Table 2.2. These examples show that making inferences about the hidden process is usually a difficult problem. A second explanation may lie in the structure of //,,.. Note that the proportional increase in the mean when the patient is deteriorating, is, by assumption, equal to the proportional decrease in the mean when the patient is improving. In the case where there is no overall trend in the data (as is true for these particular patients, as well as for relapsing-remitting patients in general when observed over a short time period), the number of transitions from decreasing to increasing mean is forced to equal approximately the number of transitions from increasing to decreasing mean. This statement is equivalent to Albert's assumption that the patients spend equal proportions of time in the states of deterioration and improvement. These proportions can be expressed as 7 (2-3) 0 + 7/? + 7 s o p+j = j+j = 0-5 implies that /3 = 7. It is perhaps for this reason that the model with general transition probabilities does not provide an improved fit to these data. Under some circumstances, we would expect to see a trend in the lesion counts, in which case the more general model might be appropriate. For example, patients with secondary progressive MS have lesion counts which may steadily increase over a given time period. Similarly, in a clinical trial setting where patients may be chosen for their relatively high level of disease activity, an initial downward trend ("regression to the mean") may be observed, even in the placebo patients. The simulated data in Figure 2.2 show a clear upward trend. These data were generated from this H M M with / i = 1.5, 0 = 1.2, 7 = 0.8, and /? = 0.2. In 0 this case, the maximum value of the log-likelihood is —73.345 for the restricted model and —69.446 for the general model. The L R T leads to a p-value of 0.005, indicating that the general model provides a significantly improved fit to these data. 2.4.3 Generalization of the Conditional M e a n Structure In light of the discussion in Section 2.4.2, we might consider modelling Li more generally t while leaving the transition probabilities as in Section 2.4.1. For example, we could fit Albert's model with the modification _ j BoHt-i, \ (l/0i)fi -i, t if patient is deteriorating at time t if patient is improving at time £ ' . , The parameter estimates and approximate standard errors associated with fitting this model are given in Table 2.3. When 0 = l/0\ = 0, the model implies that {Y } are independent O t with Y distributed as Poisson(//o#')> in which case 7 is arbitrary. Thus, for Patient 1, we t omit the estimate of 7*. This modification does not significantly improve the fit for Patient 1, but we observe Figure 2.2: Simulated data with a trend s - Table 2.3: Parameter estimates and standard errors for the model with a general conditional mean structure . Parameter Transformation log/7,0 8* e\ 7* log0 log 01 o MA) log£ Patient 1 Estimate SE 1.168 0.200 0.006 0.089 -0.006 0.089 NA NA -66.652 Patient 2 Estimate SE 0.987 0.256 0.484 0.621 0.621 0.120 0.974 0.708 -70.401 Patient 3 Estimate SE 2.446 0.247 0.466 0.166 0.398 0.158 2.384 0.813 -63.904 some evidence of an improved fit for Patients 2 and 3. The p-values associated with these tests are p-value 2.4.4 Patient 1 0.569 Patient 2 0.071 Patient 3 0.088 Generalization of the C o n d i t i o n a l M e a n S t r u c t u r e a n d of the Transition Probabilities It was anticipated that, in the case of Patients 2 and 3, the fit of the model might be further improved by incorporating general transition probabilities (i.e. by combining the models proposed in Sections 2.4.2 and 2.4.3). In fact, the LRTs comparing this model to the model with 7 = 0 (i.e. the model in Section 2.4.3) yield the following p-values: p-value Patient 2 0.065 Patient 3 1.000 Thus, there is some support for the expanded model in the case of Patient 2. The estimates of the transformed parameters and approximate standard errors are given in Table 2.4. 2.4.5 A d d i t i o n of a T h i r d H i d d e n State Our final question of interest regarding Albert's model involves the choice of the number of hidden states. Albert's model is quite restrictive, in the sense that it forces the mean lesion count to either increase or decrease from time t — 1 to time t. One would imagine that, Table 2.4: Parameter estimates and standard errors for model with general transition probabilities and conditional mean structure Parameter t4 9* n Transformation log ^ 0 log6>o log 0i 7* 0* MA) log£ Patient Estimate 1.889 0.811 0.383 2.022 2 SE 0.235 0.169 0.071 1.134 -0.399 0.505 -68.695 Patient Estimate 2.445 0.466 0.398 2.372 3 SE 0.233 0.154 0.149 1.045 2.396 1.052 -63.904 especially during periods of remission, the mean lesion count would remain stable. Thus, we consider the addition of a third hidden state, state 0, where the patient's condition is neither deteriorating nor improving. Mathematically, this modification can be expressed as Ht = if patient is deteriorating at time t { (J-t-i, if patient is stable at time t (l/6)f.i, -i, if patient is improving at time t t We represent the transition probabilities as follows: -1 Pi P3 Pb 0 P2 Pi Pa +i 1 - P i - P2 1 - P3 -Pi 1 -P5 ~ P6 One disadvantage of such an extension is the introduction of the problem of computing the stationary probabilities, which are used as initial probabilities for the hidden Markov chain. In the two-dimensional case, we have the simple, closed form (2.3) for the stationary distribution. In order to compute the stationary distribution in the three-dimensional case, however, a system of three linear equations must be solved at each iteration of the quasiNewton algorithm. Another disadvantage of this model is the large number of unknown parameters. However, we can reduce this number if we are willing to place restrictions on the transition probabilities (as in Albert's model), for example by assuming that the transition probability matrix is symmetric. The parameter estimates and standard errors are given in Table 2.5. In this case, the likelihood functions are quite flat (likely due to the large number of parameters and relatively small sample sizes) and hence difficult to maximize. The parameter estimates are not entirely reliable, and may correspond to a local maximum. Turning our attention to the likelihood, when we compare the results in Table 2.5 with those in Table 2.1, we see that substantial decreases occur for Patient 2 in particular. Thus, we might surmise that this 3-state model is more appropriate than Albert's model. It would be a mistake, however, to use the \ï distribution to gauge the extremity of the L R T statistic. The test comparing models with differing numbers of hidden states amounts to the hypothesis that some of the transition probabilities are zero. Thus, this test is a boundary problem and hence does not satisfy the conditions required for the usual results on LRTs. Moreover, we cannot assume that methods such as the Akaike information criterion (AIC) or the Bayesian information criterion (BIC) provide consistent estimates of the number of hidden states. Formal hypothesis testing and estimation of the number of hidden states is still an open problem. Table 2.5: Parameter estimates and standard errors for 3-state model Parameter Transformation Mo log/*0 9*' log(0 - 1) PÎ log P\ log P\ log Pi log Pi log Pi log 1 - P i "P2 Pi Patient Estimate 1.061 0.003 0.417 1 - P 1 - -P2 P3 I-PS- -P4 PA 1 - P 3 - P4 P5 1 - P 5 - -P6 P6 -P5-P6 / log£ 2.5 -0.536 1 SE 0.139 0.028 0.476 0.816 Patient 2 SE Estimate 1.372 0.155 0.671 0.130 -3.139 0.229 4.316 0.173 0.537 -0.142 -0.370 0.250 0.547 0.856 -1.525 0.136 -65.931 0.315 Patient Estimate 2.119 0.395 0.184 0.126 3 SE 0.110 0.068 0.167 0.195 0.643 -6.825 6.846 -6.376 1.247 2.210 0.349 -1.757 1.002 7.741 6.168 7.079 1.051 -69.448 -1.170 1.401 -63.514 Summary The analyses in this chapter illustrate some of the questions that still remain in our understanding and application of H M M s . We have seen the difficulties involved in comparing the fit of models with differing numbers of hidden states in Section 2.4.5. Likewise, determining the number of hidden states requires non-standard results. We address this issue in Chapter 3, where we develop a method for consistently estimating the number of hidden states in a single, stationary H M M . The analyses in this chapter also reveal the challenges of selecting an appropriate H M M for a given data set. Tools for examining the fit of both the conditional model for the observed data and the model for the hidden process are needed. We discuss the problem of assessing the goodness-of-fit of stationary H M M s in Chapter 4. One important common element of our analyses is the relatively large standard errors associated with the estimates of the parameters of the hidden process. Assuming the same model for each patient is one means of reducing this uncertainty. However, the behaviour of lesion counts in MS patients is often highly variable. It thus seems more reasonable to allow at least some model parameters to vary across patients. Random effects are a useful means of capturing between-patient differences while still borrowing strength across patients. Their incorporation in H M M s is discussed in the Chapters 5 and 6. Chapter 3 Estimating the Number of States of a HMM Consider the case where we have a single, stationary H M M , with observed data {Yi}™ =1 and hidden states We will assume that Z takes values in the set { 1 , . . . , K }, 0 t and will denote the stationary probabilities by {TT^} and the transition probabilities by {-P^}, k,£=l,..., K°. We assume that P(Y < y\Z ) = H(y; 0 | , 0°), where {9%, 0°) G O . t t ( As discussed in Section 2.3, estimation of the model parameters in the case where K° is known has already been studied extensively. However, the problem of consistently estimating K° has not yet been satisfactorily resolved. Maximum likelihood estimation cannot be used because the likelihood is non-decreasing in K°. Most authors applying H M M s , including Leroux & Puterman (1992), Hughes & Guttorp (1994), Albert et al. (1994), and Wang k Puterman (1999), simply use the A I C or BIC, but these methods have not been justified in the context of H M M s (MacDonald & Zucchini 1997). Several authors have attempted to address this problem using penalized likelihood methods. Included in this group are Baras & Finesso (1992), who develop a consistent estimator of K° when the observed process takes on only finitely many values. Rydén (1995) relaxes this assumption, but at the expense of consistency. He shows that a class of penalized likelihood estimators provides, in the limit, an upper bound on K°. Dortet-Bernadet (2001) proves that Rydén 's method in fact leads to a consistent estimator of K°, but under fairly restrictive conditions: he assumes the existence of a known, non-zero lower bound on the transition probabilities. His method of proof appears to require only that this bound apply to the stationary transition probabilities, so perhaps this weaker assumption would suffice. Other authors have used an information-theoretic approach to estimate K°. Kieffer (1993) proposes a method involving maximum likelihood codes to find a consistent estimator of K°. L i u & Narayan (1994) also give a consistent estimator by describing the observed data in terms of a uniquely decodable code. However, both of these estimators rely on the assumption that the observed process takes on only finitely many values. Poskitt & Chung (1996) assume a specific form for the H M M , namely that Y = Z + e , t where {Z } t t t is a finite-state Markov chain, and {e } is a white noise process. Under these t conditions, they suggest an algorithm based on least-squares type calculations that provides an efficient means of consistently estimating K°. Robert, Rydén & Titterington (2000) use reversible jump Markov chain Monte Carlo techniques to estimate A ' in a Bayesian setting. It appears, however, that no frequentist 0 method currently exists for consistently estimating K° in the general setting where {Y } is t a stationary, identifiable H M M . In this paper, we approach this problem by extending the ideas of Chen & Kalbfleisch (1996), henceforth called C K . These authors develop a penalized minimum-distance method that gives, under certain conditions, a consistent estimate of the number of components in a finite mixture model. We will use the fact that the marginal distribution of Y is also a finite mixture in order to show that a variation of C K ' s method t is applicable to stationary H M M s as well. 3.1 Notation Under our assumptions, Y i , . . . , Y are identically distributed with common distribution funcn tion K° F (y) = F(y, G ) = £ n° H (y; 0° , </>°) = / H (y; 0, </>) dG (0, 0 ) , k=l ' •' where the mixing distribution G is defined by 0 0 k k (3.1) 0 0 Go ( M ) = £ ^ / . ( 0 ° < # , / < < / , ) . fc=1 For ease of exposition, we will assume that the {0®} and 4>° are scalars so that 0 < 0 and k (f>° < 4> have the usual interpretations. However, the theory we present easily extends to the more general setting where these parameters are multidimensional. In addition, we treat the pairs (0l,4>°) as the support points of G , even though (j>° is common across states and so 0 would normally be excluded from the definition of the mixing distribution. The treatment of 4>° in this manner will facilitate our discussion of both identifiability and the consistency of the estimator of this parameter. Similarly, using the notation y™ = (y ,..., y ) and 6™ = (#i, • • •, 0 ), we will express x m m the m—dimensional distributions of (Y ) as t F (y?) = F (y?,G™)= E • • • E m m J] 0 21=1 2 m = l \t=X H • •• l-u* p m ) with Go (0i,4)= E ••• E < C - C J K < « i . - . t 21=1 3.2 2 m = <0 ,0 <4 o m l Identifiability Before presenting the proposed method, we address the issue of model identifiability. First, we need to define K° more carefully. We have stated above that A ' is the number of hidden 0 states. This value is not, in general, equal to the number of values of 0°,...., 0 - , since these n o values may not be distinct. Indeed, we can always construct a H M M with K° + 1 hidden states and distribution FQ by choosing an additional state with @° n K +1 6 • • • ,0^0} and an appropriate lumpable underlying Markov chain. (For a discussion of lumpability, see White, Mahony & Brushe 2000.) For this reason, we define A ' , the order of the H M M , as 0 the minimum number of hidden states such that {Y } is a HMM. We will denote the number t of distinct values of the {9%} by A ' , where necessarily K' < K°. We now state the mild regularity conditions that we assume. Condition 1. The transition probability matrix of {Z } is irreducible and aperiodic. f Condition 2. The parameter space 0 is compact. Condition 3. H(y; 9, (f>) is continuous in 9 and </>. ' t Condition 4- Given e > 0, there exists A > 0 such that for all (0,0) G 0, H(A;8,(j)) — H{-A\e,<j>) > 1 - e. Condition 5. The family of finite mixtures of {H(y; 9, <f>)} is identifiable, i.e., F(y,G ) 1 = F(y,G ) 2 => G =G . X 2 Condition 6. Either we know that {0°} are distinct, or we know an upper bound, M, on the number of hidden states. ( REMARK. Conditions 1 and 2 are also assumed by Dortet-Bernadet (2001). Condition 1 implies that the stationary distribution of {Z } is unique, and that 7r^ > 0 for all k. Cont ditions 3 and 4 are satisfied by commonly used distributions including the normal, Poisson, exponential, binomial, and gamma distributions. Condition 5 is also assumed by Leroux (1992a), Rydén (1995), and Dortet-Bernadet (2001), and is satisfied by many common distributions. Prakasa Rao (1992) provides a good discussion. Condition 6 is weaker than that assumed by Rydén (1995) and Dortet-Bernadet (2001), who postulate a known upper bound on the number of hidden states regardless of the distinctness of {9®}. 3.2.1 Parameter Identifiability To obtain well-behaved parameter estimates - using either the maximum likelihood or penalized minimum-distance method - model identifiability is required. In the usual case where the values of {9%} are distinct and Condition 5 is satisfied, the model parameters are identifiable up to permutations of the labels of the hidden states (Wang & Puterman 1999). Petrie (1969) discusses identifiability in the case where Y takes on only a finite set of values. t However, we are unaware of the existence of sufficient conditions for parameter identifiability when {Y } is a general H M M with possibly non-distinct values of Determining t such conditions appears to be quite a difficult problem. Following Wang & Puterman (1999), we may use the series of m—dimensional distributions, m > 1, to obtain a sequence of equations involving the model parameters. For example, let {9^} be the set of K' distinct values among with 9^ < • • • < 6^ ,y Let S = {k : 9% = 9^}. The one-dimensional K q distributions of {Y } are given by t F(y, Go) = E 4H (y; 91 *°) = £ H (y; 9\ ^) £ rr£. k=l q=\ keS q) q Condition 5 allows us to identify G from this equation. By Condition 1, all support 0 points of Go have positive mass, and hence {^)}, 0 ° , and {X^es, ^k) m a y a l s o be identified. Similarly, the two-dimensional distributions of (Y ) are given by t k=ie=i = Ê f > * ? , ) . * ° ) 9=1 r=l H f a ; * ° ) E < E ^ kes,, tes r Teicher (1967) shows that mixtures of products of distributions from a given family are identifiable if this family satisfies Condition 5. Using this result, we may identify G Q , and hence E t e 5, from this equation. In particular, if 4En ees are distinct, we see that the two-dimensional r distributions are sufficient to allow the identification of the parameters up to permutations of the labels of the hidden states. In the case where {91} are not distinct, parameter identifiability can be explored by applying Teicher's result in this manner to the higher-dimensional distributions. However, the equations obtained in this fashion are highly non-linear - in part due to the complicated relationship between {P } and {TT^} - and difficult to analyze. Hence, at this time, we lack KE a means of assessing, in general, whether the parameters of a given model are identifiable. In addition, some of these equations are redundant, so it is unclear even how to specify the minimum number of dimensions that must be considered in order to determine parameter identifiability. Fortunately, Rydén (1995) shows that the finite-dimensional distributions of {Y } are t determined by the 2(K° — K' + 1)—dimensional distribution when Condition 5 is satisfied. Thus, under the assumption that K° is minimal, if M is an upper bound on the number of hidden states, then 2M is an upper bound on the required number of dimensions. Letting 0 = (9a,... ,9 ,4>, P u , P12,..., PRK, K 7T I , • • • , TTK) as before, we will use Rydén's equivalence relation: ip° denotes the equivalence class of VA with -0 G '0° if and only if 0 induces the same law for {Y } as ip°. t In conclusion, if Condition 5 holds, then 0 ° is determined by the 2M—dimensional distributions, where we take M — 1 if the values of {9®} are distinct. We will use this result in the development of the penalized minimum-distance method for H M M s . 3.2.2 Sufficient C o n d i t i o n s for C K ' s Identifiability C r i t e r i o n When K° is minimal, Condition 5 is the standard notion of identifiability of a mixture model. C K , however, assume an identifiability criterion of a different form. In particular, let d be a distance measure on the space of probability distributions, and let G be a sequence of n one-dimensional mixing distributions. C K assume that lim d{F(y, G ), F(y, G )} = 0 n 0 ^—>oo => lim d(G , G ) = 0. n 0 n—>oo . This criterion should in fact read: for distance functions d\ and d , G converges to Go 2 n weakly when d (G , G ) —>• 0 as n —» oo, and 2 n 0 lim d {F(y,G ),F(y,G )} l i—>oo n 0 " = 0 lim d (G ,G ) 2 n—>oo n 0 = 0, (3-3) (Chen, personal communication). It is of interest to develop sufficient conditions for C K ' s criterion, both as an extension to C K ' s work, and for use in our procedure to estimate the parameters of a H M M . The following lemma, which will be proved in Appendix B . l , states that, for specific choices of di and d , this criterion is satisfied when Conditions 2-5 hold. Normally, it will be easier to 2 verify these conditions than the original C K criterion. In particular, we assume that d corresponds to weak convergence, and that d\ = d^s is 2 the Kolmogorov-Smirnov distance, i.e., for distribution functions jF\ and F , 2 d (F F ) KS u = sup |F,(y) - F {y)\. y 2 2 Lemma 3.1 Let (G™) be a sequence of m-dimensional, mixing distributions with associat parameters {(0£, </>")} 6 9 . Then, under Conditions 2-5, if d {F {y™, G™), F ( y , G^)} m rn r KS o(l), then G™ converges weakly to G™. 3.3 Parameter Estimation Let {c } be a sequence of positive constants with c = o(l), and let F be the empirin n n cal distribution function of Y . If {Y } are independent observations from a finite mixture t t model, F , then C K estimate the parameters of F (including the number of components) 0 0 by minimizing the penalized distance function _ D ( F , F) n = d, ( F , F) n K - c log7T n fc k=\ over all F, where F is a finite mixing distribution with K components and mixing probabilities {7Tfc}, and d\ is any distance function satisfying (3.3). The authors prove the consistency of the parameter estimates obtained in this fashion. The novelty of this approach is that the penalty term is a function of Y,k=i log "*;- Models 71 with large values of K are penalized since the requirement that 7 r + • • • + ITK = 1 forces t •Kk to 0 for some values of k as K —» oo. In addition, models for which some states have small values of ir^ are penalized. In these two ways, the estimated number of components is indirectly controlled. In contrast, most other penalized methods, including the A I C and BIC, attempt to control K directly. From the discussion in Section 3.2, it is clear that C K ' s estimation procedure, which is based on one-dimensional distributions, is not sufficient to estimate all the parameters of a H M M . However, by modifying the method to incorporate multiple dimensions, we may obtain a procedure that is appropriate to our problem. In particular, for n > m, we consider the m—dimensional process { Y j Y«+"i-i _ (Y ,...,Y -i)t + m - 1 }" r = m + 1 1 , where We then define the penalized distance based on the distribution t+m of this process as D (F™, F ) m = (F™, F ) m - c £ log TT,, (3.4) n k=\ where F™ is the rn—dimensional empirical distribution function, The dimension m, should be chosen based on identifiability considerations, as discussed in Section 3.2. In addition, it is desirable to choose m as small as possible to minimize the computational burden. Thus, we will use m = 2M. C K ' s proof of the consistency of the parameter estimates in the case of finite mixture models does not require the independence of the observed data. In fact, this proof depends only on their identifiability criterion (see Section 3.2.2) and the assumption that di(F ,F ) n 0 = o(c ) n a.s. We will show that, if d] is chosen such that d\(F™, F™) = o(c ) a.s. and Conditions 1-6 n are satisfied, then the theorems developed by C K also hold when the underlying model is a H M M and the parameter estimates minimize the penalized distance function (3.4). Hence, we will obtain consistent estimators of the model parameters, including K°. Before stating our first theorem we require the definition of the concept of an a-mixing multidimensional process (see, for instance, Ould-Saïd 1994). Definition 3.1 A stationary sequence (Yt) of rn—dimensional vectors is a—mixing or stro mixing if for any s, a= sup e \P{AB) - P ( A ) P ( B ) | -> 0 as 1 -» oo where J-^ = a(Vt, a <t < b). The o.t's are called the mixing coefficients. Theorem 3.1 For a stationary HMM, d KS (F™, F™) = O ( n - ^ l o g l o g n ) a.s. Proof. Ould-Saïd (1994) proves that for a stationary, m—dimensional, a—mixing process with mixing coefficients = 0(£~ ) for some v > 2m + 1, u 1/2 n 2 log log n lim sup = 1 a.s. sup m '1 v Thus to prove that the empirical distribution of a H M M converges at this rate, it is sufficient to show that the H M M satisfies the condition a = 0(f) for v > 4 M + 1. e First, the mixing coefficients af of the Markov chain {Z } satisfy the inequality a\ < cp , e t where c is a finite, positive constant, and 0 < p < 1 (see, for instance, Doukhan 1994). Now, without loss of generality, take £ > m. Define T = cr(Y ,..., Y _i). h a a b+m Following Lindgren (1978), for stationary H M M s we have that a, = sup ( \P(AB)-P(A)P(B)\ sup \E {P (ABIZ *" - , Z% )} - E f p ( y 4 | Z 8 l ^ (AW*"' ) 1 E +l Since P ( i 4 | Z 1 t A^Tr~ = 1 s + m _ 1 1 ) and P(B\Z™ ) e - P ( E P s + m 1 - )}E{p( B|Z?? ))| 1 J / ( ^ l ^ - ) } E {P ( S | % ) } | 1 are bounded and measurable with respect to ff(Zf ) +m_1 and a(Z^_ ), respectively, we may apply Theorem 17.2.1 of Ibragimov & Linnik (1971) to e obtain a, < Aal < 4cp - ^. e m+l m We have thus proved that at = o(t~") for v < oo. • COROLLARY. Le£ F m 6e £/ie function that minimizes (3.4), where the minimization is over all parameters, including K. di<s{F , F™) —>• 0 a.s. and G m m Under Conditions 1-6, if we choose di = d s, then K converges weakly to G™ a.s. Furthermore, </>° —> 4>° a.s., a 9° —» 9* a.s., where 9* is one of the support points of G™. k Proof. In light of Theorem 3.1 and Lemma 3.1, the technique of proof used in Theorems 1 and 3 of C K also applies in the H M M case. Although the class of mixture models considered by C K does not incorporate parameters that are common to all components (i.e., <p°), their Theorem 3 is also sufficient to show that (f> —» </>° a.s. • We now investigate the asymptotic behaviour of K using a method inspired by the work 0 of Leroux (1992b) in the context of penalized maximum likelihood estimation of mixture models. In particular, we study properties of the estimated m—dimensional distribution when we fix the number of hidden states at some known value K, while the other parameters are estimated by minimizing (3.4). We denote this distribution by F™, and the associated parameter estimates by '0, so that K(y?)= E • • • Fl (:</].; 0 , E 2l 2m = 21=1 • -H (jj ; 0 ,(p) 0) m Zm TT P ZI ••• ZUZ2 P _ . ZM UZM l We will need the following two technical results. Lemma 3.2 // we know the value of K° and estimate the remaining parameters by mini mizing (3.4), then we have that i<° d-KS (F%O,F™^ — 0(c ) n a.s. and liminf ]T] log7r£ > - 0 0 a.s. fc=i Proof. This lemma is similar to Theorems 1 and 2 of C K , and can be proved in the same way. • Lemma 3.3 // K is chosen such that K < K° and the remaining parameters estimated b minimizing (3.4), then l i m i n f d ( F ™ , F ™ ) > 0 a.s. A5 Proof. By the triangle inequality, d (P%, K) K S By Theorem 3.1, d[<s(F™, F™) liminîd (F^,F^) KS > > d KS = o(l) a.s. such that \imdKs(F™, Ff) - d (K , n K S F ) m 0 . For this reason, it is enough to show that 0. Choose a realization of the sequence assume that liminf (F™, dKs{F^, F ) m — 0 {F}?}, indexed by n. Towards a contradiction, 0 for this realization. Then there exists a subsequence F ) = 0. For this subsequence, F ,? converges pointwise to F™, and 1 m 0 there exists a further subsequence such that each parameter estimate converges to some limiting value, ip. Since H is continuous in 6 and 0, we have that 21 = 1 2 m = l Let S* = {Ok}. From our discussion of identifiability in Section 3.2, the set of distinct elements of S* must be equal to {0^}, st = {i-. e = e° },k l {k) = and we must have that (j) = cjp. So, defining i,...,K', K> K' 21=1 Z m = l E ''' E X P<ll ><?2 ' ' ' P<lm- 1 ,<7m ' This limiting distribution is the m—dimensional distribution of a H M M with iff hidden states. But, now it is clear that l i m i ^ / F 1 m 0 , since A ' was defined as minimal. This contradicts 0 our hypothesis that l i m d 5 ' ( F ^ , F ) = 0. • m K 0 Theorem 3.2 Assum,e that c = o(l) and Conditions 1-6 are satisfied. Then liminf K° > n K° a.s. If {91} are distinct, then we also have that l i m s u p A ' < A ' a.s. 0 0 Proof. If {6° } are distinct, then Theorem 4 of C K applies. Hence, K° -> K° a.s. If {6° } k k are not distinct, we require a different technique of proof. It is clear that liminf K° > A 7 a.s. However, some work is required to prove that liminf A" > A ' a.s. The key idea is 0 0 that if the limiting value of inf^m D(F™, F ) is less when we fix K = K° than when we fix m K < AT , we will know that D(F™,F ) 0 m is minimized, in the limit, by a H M M with at least A" hidden states. 0 By Lemmas 3.2 and 3.3, for K' < K < K°, we have that d (FR,F?)-d (Fg,F?) KS KS T a.s. oo Also by Lemma 3.2, ]T£i| logir^ is bounded away from —oo a.s. Thus, with probability 1, for all n sufficiently large, dKsjF^o, F„ ) - d (F%, F™) y*\ crfi° m KS 0 Ski n C k-\ n dr<s(F™o, F™) - d {Fx , F™) < c ( £ lo r \k=l < D(F™,F™). KS , D{F?,FPo) n g7 0 fc - f>gi ), k=l j f c In other words, for n sufficiently large, the estimated distribution function will have at least K° hidden states with probability 1. • Our final theorem demonstrates the consistency of {Pke} in the quotient topology generated by our equivalence relation. In particular, if A is an open set containing ip , then 0 ip° G A for large n a.s. We will use the notation -0° —» '0° a.s. This notion was also used by Leroux (1992a) and Rydén (1995). Theorem 3.3 Assume that c — o(l) and Conditions 1-6 are satisfied. Then •0° — > ip a.s. 0 n If {9%} are distinct, then we also have that lim P° —> P e M a.s. (ignoring possible permuta of the labels of the hidden states). Proof. If {91} are distinct, the model parameters are identifiable up to permutations of the labels of the hidden states. Thus, by Theorem 3.2 and the corollary to Theorem 3.1, Pkt ~~^ Pu - - f ° a s r a u ^> ^ (ignoring possible permutations). Otherwise, from the corollary to Theorem 3.1, we know that F F m 0 m converges pointwise to a.s. In addition, from the corollary to CK's Theorem 2, we know that lim sup i f < co 0 a.s. The remainder of the proof will be restricted to the event of probability 1 where these limits hold. Consider a realization of the sequence {K } indexed by n. Since K° takes on only integer 0 values, we can find neighbourhoods around each,limit point of {A^ } such that there is only 0 one limit point in each neighbourhood, and all points of ( A ' } are in at least one neighbour0 hood. Since lim sup Â" < co, there are only a finite number of these neighbourhoods. Let 0 {n j} be the subsequence defined by the set of points in the ith neighbourhood, and let K l l be the limit point of {A' } associated with this neighbourhood. For this subsequence, and 0 for all j sufficiently large, K° — K and hence F^ — F . l m We can then interchange the limit and the summation in the expression for limF™, (as in the proof of Lemma 3.3) and use the fact that F™i = F m —> F™ to conclude that, for this subsequence, V; —> TJJ°. However, since 0 the union of all the subsequences is the original sequence, it is clear that, in fact, -0° —> -0° a.s. for the original sequence. 3.4 • Application to M S / M R I Data In this section, under the assumption that a stationary H M M adequately captures the behaviour of the lesion counts observed on relapsing-remitting MS patients, we estimate the number of underlying disease states using the data of Albert et al. (1994) described in Section 2.4. For the purposes of illustrating our technique, we assume the same model for each patient. Specifically, we assume that Yi \Z t ~ Poisson(//,|.J. Here Y it it is the number of lesions recorded for patient i at time t, and Z is the associated disease state. it Table 3.1: Penalized minimum-distances for different numbers of hidden states Number of states 1 2 3 4 5 Estimated Poisson means 4.03 2.48, 2.77, 2.05, 1.83, Minimum distance 6.25 2.62, 7.10 2.96, 3.53, 7.75 3.21, 3.40, 3.58, 8.35 0.1306 0.0608 0.0639 0.0774 0.0959 We then use the method of penalized minimum-distance to estimate K ° , the number of states in the hidden process. We are willing to assume that the values of {/U°} are distinct, and thus use the bivariate distributions to compute the distance function. As suggested in C K , we use C/v = 0.01 N~ l log N, where N is the total number of observations across all l 2 patients. Using a variety of starting values, we calculate the penalized minimum-distance for K° = 1,..., 5 using a quasi-Newton minimization routine (Nash 1979). These distances are displayed in Table 3.1. Note that the overall penalized minimum-distance occurs at K° = 2, which becomes our estimate for the number of hidden states. These two states may correspond to relapse and remission, which is consistent with qualitative observations of the behaviour of this disease. The estimates of the parameters of the hidden process were •o _ [0.594,0.406] 0.619 0.381 0.558 0.442 indicating that, on average, patients spent a higher proportion of their time in remission (59.4%) than in relapse (40.6%). 3.5 Performance of the Penalized Minimum-Distance Method The appropriateness of the penalized minimum-distance estimator for finite sample sizes is another topic of interest. To address this issue, we generated 100 individual time series of two different lengths (30 and 100) from various Poisson H M M s . We then estimated K° using three different methods: the AIC, BIC, and penalized minimum-distance method. Table 3.2: Parameter values used in the simulation study A Poisson Means Transition Probabilities P=[l] 0 = [4] 1 7 1 2 I 1 P = 1" 5 9 1 2 3 : H = 0.5 0.5 0.269 0.119 " 0.333 0.333 0.333 0.140 P = 0.212 0.030 P = : 0.5 0.5 0.73 0.881 0.333 0.333 0.333 0.231 0.212 0.366 0.333 0.333 0.333 0.629 0.576 0.604 The simulation was conducted in the spirit of an experimental design with four factors: number of components (1, 2, or 3), sample size (30 or 100), separation of components (wellseparated or close together), and proportion of time in each state (balanced or unbalanced among states). The sample sizes of 30 and 100 were chosen to reflect the sizes of typical and large M S / M R I data sets, respectively. The parameters selected for each case are given in Table 3.2. As in the previous section, the models were fit using a quasi-Newton minimization routine (Nash 1979). For the penalized minimum-distance method, we again chose c n = O . O l n / logn. To reduce the computational burden, when the true value of A' was 1 or 2, -1 2 0 we fit only models with 1, 2, and 3 components. For A' = 3, we fit models with 1, 2, 3, and 4 0 components. Histograms of the resulting estimated values of K° appear in Figures 3.1-3.5. The legend is the same for all plots, and is included only in the first. Figure 3.1: Distribution of A ' when A ' = 1 0 1 2 3 E s t i m a t e d N o . of C o m p o n e n t s a) n = 30 0 1 2 3 E s t i m a t e d N o . of C o m p o n e n t s b) n = 100 The histograms show that the penalized minimum-distance method seems to perforin well relative to the A I C and BIC, especially when the problem is "hard," i.e., A ' = 2 or 0 A" = 3, the components are not well-separated or the proportion of time in each state is not 0 balanced. For K° = 1, the method tends to overestimate slightly the number of components. It is interesting that the performances of the methods do not improve substantially when the sample size is increased from 30 to 100. This may not be surprising in the cases of the A I C and B I C (which have not been proved to be consistent methods of estimation). However, in the case of the penalized minimum-distance method, we might expect to see a greater improvement. This result may be due to the fact that A ' is equal to the number of 0 non-zero stationary transition probabilities. Since these probabilities are parameters of the unobserved process, large data sets are required for their precise estimation. Consequently, estimating A ' precisely is also a difficult problem. 0 3.6 Discussion One of the practical difficulties with the method of penalized minimum-distance is that the resulting objective function has many local minima at which the algorithm tends to converge. Using a variety of different starting values can help, but this increases the required computational effort, and we can never be certain that we have located the global minimum. Although, in general, locating the global minimum of the penalized distance function Well Separated, Balanced Not Well Separated, Balanced CD o OO OO CD CD CT CD CO CD Z3 CT CD CD it o O CM CM o J 1 2 1 3 E s t i m a t e d N o . of Components CD CO c5 CDCD o *— 3 Components Not Well Separated, U n b a l a n c e d CD CO S CD c CD =3 2 2 E s t i m a t e d N o . of Well Separated, Unbalanced O CD CD =3 cr CD .. CD CM J CD CD CM 1 2 E s t i m a t e d N o . of 3 Components 1 2 E s t i m a t e d N o . of 3 Components Well Separated, Balanced Not Well Separated, B a l a n c e d CD CD CD OO o CD CDCD CD OO CD CD o c CU n <=> co =3 CD CDCD CD CM CD CD CM 1 YA 1 2 E s t i m a t e d N o . of 1 3 Components 2 E s t i m a t e d N o . of Well Separated, Unbalanced 3 Components Not Well Separated, Unbalanced CD CD CD OO CD £ CD ZD CT JP O CO C_J co o CD CD CDCD CD £= CD =J CD CD CM J CD CM ;,-.M 1 2 E s t i m a t e d N o . of 3 Components 1 2 E s t i m a t e d N o . of 3 Components Well Separated, Balanced Not Well Separated, B a l a n c e d co co co oo o oo co co CO CD cr CD o CO o co CM CM 1 2 3 E s t i m a t e d N o . of 4 1 Components 2 E s t i m a t e d N o . of Well Separated, Unbalanced 3 4 Components Not Well Separated, Unbalanced co co co oo CO oo CO CO CD I CO co C\J cr 2 E s t i m a t e d N o . of CO CO Aï 1 8 3 4 Components 1 1 2 E s t i m a t e d N o . of 3 4 Components Well Separated, Balanced Not Well Separated, B a l a n c e d CD CD o oo £ CD CD OO o co CD =3 CT 35 it il o CD CM 1 2 3 E s t i m a t e d N o . of CD CT CD it CD ^ CD CM 4 1 Components 2 E s t i m a t e d N o . of Well Separated, Unbalanced 3 4 Components Not Well Separated, Unbalanced CD CD CD OO CD CD CDCD it CD OO CD CO O cz CD CDCD o -SI- CD CM 2 E s t i m a t e d N o . of CD CD CM • 1 CD CD 3 4 Components 1 2 E s t i m a t e d N o . of 3 4 Components is critical for obtaining good parameter estimates, it may not be necessary for determining the number of hidden states. In the examples we have considered, the local minima found for each value of K° were typically both close to one another and reasonably well-separated from the local minima found for other values of K°. For this reason, a two-stage estimation procedure may be appropriate. First, K° could be estimated using the method of penalized minimum-distance. Then, assuming this value of K°, the remaining parameters could be estimated using maximum likelihood estimation. The benefit of this approach is that, based on our experience, it seems easier to locate the maximum value of the likelihood function than the minimum value of the distance function. In addition, M L E s have appealing properties, and approximate standard errors are readily available. In contrast, we do not have this information for the estimates obtained using the penalized minimum-distance method. The disadvantage of the two-stage method is that the standard errors for our M L E s do not take into account "our uncertainty about K°. Further exploration of the implications of this estimation procedure is in order. We may be able to improve the performance of the penalized minimum-distance method through different choices of the distance function, c , or the penalty term. For example, our n method might also be valid for other distance functions. The Cramer-von Mises distance or Kullback-Leibler information are options (CK), as is the Hellinger distance, which has the advantage of robustness (Hettmansperger & Thomas 2000). These distance functions - and others - may be valid choices in the H M M setting, and may result in improved performance. In addition, as discussed in C K , there are many possibilities for the form of the penalty term. Finally, the choice of c is critical. When c = C n / logn, for some positive constant C , the - 1 n 2 n penalty term converges to zero at approximately the same rate as the distance function. This seems reasonable. However, in simulation studies similar to those discussed in Section 3.5, we found that the resulting estimate of A ' was very sensitive to the value of C. Of the 0 possible values C = 0.1, 0.01, 0.001, the optimal value of C , as judged by the frequency with which the correct value of K° was predicted, tended to decrease with K°. In other words, in the cases considered, the (subjective) choice of C appeared to have a considerable influence on the estimate of K°. Clearly, the form of c is a topic requiring further investigation. n Chapter 4 Assessment of Goodness-of-Fit In this chapter, we will again be working in the setting where we have a single, stationary H M M , and will use the same notation as in Chapter 3. As with most data analysis problems, it is desirable to find methods for assessing the goodness-of-fit (GOF) of a given H M M . As discussed in Section 2.3, Giudici et al. (2000) show that the likelihood ratio test can be used to compare nested stationary H M M s with a common, known value of A ' . However, the comparison of non-nested models, including 0 both H M M s (with possibly unknown values of K°) and models outside the class of H M M s , is more challenging, and this is the problem we consider here. Lystig (2001) provides a comprehensive overview of existing literature on G O F techniques for H M M s . Turner et al. (1998), working with Poisson H M M s , create a diagnostic plot by overlaying the predicted mean responses at each time point on the observed data. A limitation of this method is its focus on means rather than on distributions: it is not suitable for detecting violations of the Poisson assumption. Zucchini & Guttorp (1991) also look at plots of the predicted mean responses but for binary data. In the case where the observed data take on only a finite number of values, Albert (1991) suggests qualitatively comparing the observed and expected frequencies of each value. However, neither of these methods allows the investigation of deviations from the assumed model for the hidden process. Hughes & Guttorp (1994), working with a non-homogeneous H M M with finite state space, consider the comparison of the observed frequency of each response, y, to "• t-l where P is the estimated probability under the fitted H M M . In a similar way, these authors compare the observed and estimated survival functions (i.e. the probability that the observed process is in state r for at least k days), as well as the observed and estimated correlations. However, since P(Y = y) depends on t in this setting, the advantage of averaging over the t n observations is unclear. Finally, Lystig (2001) develops a formal test based on the score process for use in the context where there are n responses (from a finite state space) on each of N independent individuals, and ./V is large. In addition to providing guidance about choosing among models, it is desirable that a G O F technique will, with high probability, detect a lack of fit as n gets large - when either the marginal distribution or the correlation structure of the observed data is misspecified. However, none of the methods described above has been shown to have this property. With these goals in mind, we consider an alternative method of assessing G O F in this chapter. Our method is similar to that of Hughes & Guttorp (1994), but we exploit the fact that, in the stationary case, we have identically distributed observations. In particular, we propose a graphical approach to analyzing the fit of the H M M : we plot the estimated cumulative distribution function (cdf), F(-, Go), given by Equation 3.1 against the empirical cdf, F (-). Recall that the empirical cdf is based solely on the observed data: n n If {Y } is discrete, we might also consider plotting the estimated probability distribution t function (pdf) against the empirical pdf. However, we will restrict our discussion to the general case, and henceforth, the word "distribution" will refer to the cdf. Under regularity conditions, if we have correctly specified the model, the empirical and estimated distributions will both be consistent estimates of the true distribution, so as n increases, this plot will converge to a 45° line through the origin. By plotting the univariate distributions in this way, we will be able to assess the fit of the assumed marginal distribution for Y , i.e. the mixture distribution given by Equation 3.1. t However, this plot provides no information about the fit of the assumed correlation structure. In light of the comment by Hughes & Guttorp (1994) that, at least in their setting, "it is generally not difficult to get a good fit to the empirical marginal probabilities", checking the correlation structure may be of primary interest. Thus, making use of the ideas in Section 3.2, we propose the construction of an additional plot: the estimated bivariate distribution, F (-,Gl) (see Equation 3.2), against the empirical bivariate distribution, F%(-) 2 (given by Equation 3.5). If the values of {0°} are not distinct, we may also wish to make plots of the higher dimensional distributions. Again, we would expect these plots to converge to a straight line if the assumed model is correct. In this way, as n increases, we will be able to make a better assessment of the fit of both the marginal model and the correlation structure of the observed data. Moreover, we will be able to compare the fit of several proposed models by overlaying plots constructed by fitting these models to the same data set. When Y is discrete, we plot F ( y , G™) versus F,™'(y) for a finite number of points, m t focusing on values of y over which these functions tend to concentrate. When Y is continuous, t versus F™(y) over the entire range of y. In the case of the plot of we plot F (y,G ) m n 0 the univariate distributions, the functions are monotonie in y, and hence their values are necessarily ordered with respect to the values of y. The points of the multidimensional distributions, however, are not ordered in this way. The requirements that we impose to ensure that the plot of the m-dimensional distributions has the above convergence property are as follows: Requirement 1. {Y } is strictly stationary. t Requirement 2. F ( - , G ) converges to F™(-). m n 0 Requirement 3. F™(-) converges to F™(-). REMARK. Requirement 1 implies that the joint distribution of (Y ,..., Y +e) is the same t t for all t. Requirement 2 will be satisfied (in the sense of pointwise convergence) if F™(-) is continuous in the parameters and the parameter estimates are consistent. We may obtain consistent parameters by using either the method of maximum likelihood (when K° is known) or the penalized minimum-distance method described in Chapter 3 (when K° is known or unknown). We discuss Requirement 3 in more detail in Section 4.1, and present two alternative sets of sufficient conditions for this requirement.- We use these to show that our proposed graphical method is valid for stationary HMMs. Thus, we will be able to graphically compare different (including non-nested) H M M s for the observed data by examining how close each estimated distribution is to the empirical distribution. More generally, we would like to know that the empirical distribution is converging to the true distribution regardless of whether the true distribution is a H M M . Since the examples in this thesis have focused on count data, in Section 4.2 we discuss other models for stationary series of count data. It turns out that these models meet, at least one of our conditions. Thus, if the true underlying model is a member of the broad class that we consider, our method will allow us to determine whether the H M M in question is a reasonable model for our data. As an additional advantage, if consistent estimates of these alternative distributions are available, we will also be able to use our method to compare the fit of the H M M with that of the other models by overlaying the appropriate plots. We apply our method to our two M S / M R I data sets in Section 4.3. These examples illustrate the type of deviations that we might see when a H M M does not represent the data well, or when our choice of the conditional model for the observed data is not appropriate. 4.1 Convergence Conditions The conditions for Requirement 3 that we develop are based on the concept of a-mixing sequences of random variables (see Definition 3.1). The idea is that for the empirical distribution to converge to the true distribution, a t must converge to 0 quickly enough. The two theorems that we present give sufficient rates of convergence. The first is due to Ould-Said (1994), and is applicable to plots of the multidimensional distributions. We cited this result in the proof of Theorem 3.1, but we repeat it here in the particular form that we require. Theorem 4.1 For a stationary, m-dimensional, a-mixing process with mixing coefficien cti = 0(l~") for some u > 2m + 1, lim sup | F m n (y)-F™(y)|=0 (4.1) almost surely. In the proof of Theorem 3.1, we showed that the mixing coefficients of stationary H M M s satisfy the condition in Theorem 4.1. Thus, our graphical method (in any dimension) is valid for these models. If we consider the pointwise, rather than uniform, convergence of the empirical distribution to the true distribution, Requirement 3 amounts to a law of large numbers (LLN) for dependent variables. L i n & L u (1996) provide general information in this context for processes satisfying various mixing conditions (e.g. ct-mixing, p-mixing, ^-mixing, and others). Theorem 4.2 below is an example of one such L L N . We focus on this particular result because of its relative simplicity in our context. In particular, for some models the conditions of Theorem 4.2 may be easier to verify than those of Lin & L u (1996) (or of Theorem 4.1) because the calculation of the mixing coefficients is not required. Theorem 4.2 Assuming that {Y } is stationary, let t Pe(y) = \P(Yt<y,Y <y)-P(Y <y)P(Y <y)\. t+e l t+l Then for each y, F (y) converges in probability to Fo(y) if n n j2(n-lWy)=o(n ). 2 ; (4.2) Proof. The proof follows from Chebyshev's inequality. Let N(y) — Ylt=i I(Y < y)- Then for t a given value of e, > e) F (y)-F (y) n < 0 1 E N(y) F (y) n Q N (y) 2 :E my)? l Y,i(Y ,<y) + l u=i e 2 ~ n 2 n<.Fo(y) + W ' > n 1 fFo(y) 1 ->• 0 + .Kt (F (y)Y 2j2i(Yt<yXs<y) 0 2Y,P(Yt<V,Y <y) (Fo(y)Y a ^ ( F „ ( y ) ^ 2n rr (F (?/)) 2 2 0 71 n • Although Theorem 4.2 is stated in terms of the univariate distributions, it can easily be extended to the multidimensional case. For example, to show the pointwise convergence of the empirical bivariate distribution, we would define Pt-s(x,y) = \P(Y < x,Y S a+l < y,Y < x,Y t < y)-P(Y t+i s < x,Y 8+l < y)P{Y < x,Y t t+l < y)\ and replace the condition (4.2) by the requirement that n-2 J2(n-t-l)P (x,y) i = o(n ) 2 for all x and y. 4.2 Other Models for Count Data Work to date on models for stationary time series of count data is nicely summarized in MacDonald and Zucchini (1997). In addition to HMMs, other possible models are 1. Markov models, including • Integer-valued autoregressive (INAR) models (e.g. Alzaid & Al-Osh 1993; McKenzie 1988) • "Observation-driven processes" of the form Y ~ Poisson(/t ), where /.i is modelled t as log/j, = g(Yt- ,..., t Yj-i), p t t and g is some function (e.g. Albert et al. 1994) 2. m-dependent time series, including • Models of the form Y = g(-yt), where g is some function, {7^} is a MA(m) t process, and Y \ j is independent of Y i , . . . , Y j - i , Y i,..., Y t t t+ n • INMA(m) (integer-valued moving average) models (e.g. Alzaid & Al-Osh 1993; McKenzie 1988) 3. "Parameter-driven processes" of the form Y -~ Poisson(///;), where log// = g(x) + e , t ( t and e is a stationary, autocorrelated error process (e.g. Chen & Ibrahim 2000) t We now show that all of these models, in fact, satisfy the convergence criterion given in Theorem 4.1, with the exception of that described by Albert et al. (1994), since this model is not stationary and is thus beyond the scope of this chapter. 4.2.1 Markov Models In the proof of Theorem 3.1 we cite the result that the mixing coefficients of a Markov chain satisfy at < cp , where c is a positive constant, and 0 < p < 1 (see, e.g., Doukhan 1994). e Thus, it is clear that stationary Markov chains satisfy the condition in Theorem 4.1, and hence our graphical method is valid for these processes. The INAR(p) process is a (p-order) example of such a process. Observation-driven processes, such as Poisson regression models with lagged dependent variables, are also p-order Markov processes, since in this case, the distribution of l^,|F _ ,..., Y -\ is independent of Y),..., Ft_ _].. t 4.2.2 p t p m-Dependent T i m e Series Since, for an m-dependent time series, a.i = 0 for / > m, time series of this type clearly satisfy the condition of Theorem 4.1. Included in this class are MA(m) and INMA(m) processes. 4.2.3 P a r a m e t e r - D r i v e n Processes The next model we consider assumes that Y ~ P o i s s o n ^ ) , with t log Ht = g(x) + e t where g is some function of the covariates, x. These covariates are assumed to be constant over time so as not to violate the requirement that {Y } be stationary. The error terms, t {e }, are modelled as an ARMA(p,q) process, and Y \c is assumed to be independent of t t t Y\,..., Yt-i, Yt+i,..., Y . n Biais et al. (2000) show that if {e } is an oi-mixing sequence with mixing coefficients on, t and {Y,} is a process such that Y \e is independent of Yi,..., Y -\, Y ,..., Y , then the t t t t+i n process {Y } is also a-mixing, with mixing coefficients Aa . The proof is similar to that of t t Theorem 3.1. From Liebscher (1996) we have that an A R M A ( p , q) process is a-mixing with exponential rate, i.e. the mixing coefficients satisfy at < cp for some p, 0 < p < 1, and some c, e 0 < c < oo. • It is now obvious that the condition of Theorem 4.2 holds for models of this type. 4.3 Application to M S / M R I Data In this section, assuming the same model for each patient, we compare the fit of five different stationary H M M s to both Albert's data and to data on the placebo patients from the Vancouver cohort of the PRISMS study (PRISMS Study Group 1998). Based on the results from Section 3.4, we assume that each H M M has two hidden states and use the method of maximum likelihood to obtain estimates of the other parameters. We model the conditional distribution of Y given Z as one of the following five distributions: t t 1. Poisson: P(Y = y\Z = k) = t A > 0 t fe 2. Negative binomial: P(Y = y \ Z = k) = ( t PfcH t 3. Logarithmic: P{Y = y \ Z = k) = ~ ( C t t 4. Generalized Poisson: P(Y = y | Z t y l t = f c ) k) = J'~ )a£ (l - a ) , 0 < a < 1, p > 0 l fc y k k k , 0< 9 < 1 k A *( *+ fcg)'' A g | l e ~ A f c ~ f l f c , > A > 0, 6 > 0 fc k 5. Zero-extended Poisson: P{Y = y I Z = k) = w ô t 4.3.1 t k + (1 - w )—^ , JL {y=0} k 0<w <l, k X > 0 k Albert's Data Figures 4.1 and 4.2 show the fit of these models to Albert's data. In Figure 4.1, we plot the estimated univariate distribution of Y under each model versus the empirical distribution of t Y over the range 0 , . . . , 20. Figure 4.2 is the corresponding plot of the bivariate distributions t over the range (0, 0), ( 0 , 1 ) , . . . , (20, 20). Note that the H M M involving the zero-extended Poisson distribution was not fit to Albert's data. Since there were, in fact, no zeroes in this data set, the parameter w could not be estimated for any of the components of the H M M . Figure 4.1 shows that these models seem to capture the univariate behaviour of Albert's data quite well, with the exception of the logarithmic model. Since the Poisson model seems to be reasonable, it is not surprising that the negative binomial and generalized Poisson models also provide good fits, since these are generalizations of the Poisson model. In contrast, Figure 4.2, shows that none of the models is a good choice for representing the bivariate behaviour of the data. In particular, the estimated probabilities tend to be lower than the empirical probabilities throughout almost the entire range. Thus, it would appear that a 2-state H M M cannot fully capture the correlation structure of the data, and hence is not an adequate model in this case. 4.3.2 Vancouver P R I S M S D a t a The Vancouver data have the same format as Albert's, but consist of 13 rather than 3 patients. Each patient has between 2 and 26 observations. The lesion counts are most frequently zero, but range up to 15. Figures 4.3 and 4.4 give the plots of the univariate and bivariate distributions for the Vancouver data, with the functions plotted over the ranges 0 , . . . , 15 and (0,0), ( 0 , 1 ) , . . . , (15,15), respectively. These figures show that the Poisson model seems to provide quite a good fit to these data, both for the univariate and bivariate distributions, although it may underestimate somewhat the probability of seeing pairs of small lesion counts. Again, the logarithmic model seems inappropriate for these data. Figure 4.1: Comparison of the Estimated and Empirical Univariate Distributions (Albert's Data) Figure 4.2: Comparison of the Estimated and Empirical Bivariate Distributions (Albert's Data) Figure 4.3: Comparison of the Estimated and Empirical Univariate Distributions (Vancouver Data) Figure 4.4: Comparison of the Estimated and Empirical Bivariate Distributions (Vancouver Data) 4.4 Formal Assessment of the GOF Plots The examples in Section 4.3 show that our G O F method is useful both for comparing different models, and for detecting when a proposed H M M is not appropriate for the data. We have also proved in Section 4.1 that if we have correctly specified the model, then the plot will converge to a 45° line through the origin as n —> oo. It is also of interest to develop a formal method of assessing the degree of variability in the observed plot. In other words, it would be desirable to have a theoretical means of determining whether the observed scatter around the 45° line is "acceptable" for a given sample size, n. One way in which other authors have assessed this variability is by computing the correlation coefficient of the two plotted variables, and then deriving the distribution of a test statistic based on this coefficient under the null hypothesis that, the model fits. This derivation is simplified considerably if one of the variables is fixed rather than random. Lockhart & Stephens (1998) provide a good example in this setting. They investigate the use of probability plots, where the n observations are ordered and plotted against the values F { / c / ( n + l ) } , _1 k = 1,... ,n. Here F is an arbitrary distribution in the proposed family of distributions. Under the assumptions that F is in the location-scale family (usually with the values of the location and scale parameters chosen as 0 and 1, respectively) and that the observations are iid, the asymptotic distribution of their test statistic has a nice form. However, typically, the exact distribution of this statistic is not easily derived. Furthermore, in our setting, where we plot two different estimates of the cdf and our observations are not independent, computing the distribution of the associated correlation coefficient, even asymptotically, seems like a very challenging problem. Alternatively, in the context where the proposed distribution is completely specified under the null hypothesis, Raubertas (1992) considers envelope plots as a formal G O F test. He suggests simulating s independent samples of size n from this distribution, and preparing a plot of the estimated distribution against the true distribution for each. These plots are then superimposed and summarized by displaying only their upper and lower envelopes. Points corresponding to the observed (as opposed to simulated) data that fall outside this envelope indicate lack of fit in the model. The advantage of this method is that the true distribution need not be limited in its complexity. For example, we could easily simulate observations from a H M M . However, Raubertas (1992) points out that the power of this test may be undesirably low, and does not recommend it when other options are available. A n envelope plot in our case would have even more inherent variability, since we would need to sample from the estimated, rather than true, distribution. For the same reason, the computational burden would also be quite heavy. Given these concerns, we have not attempted to conduct, this test on our data sets. In conclusion, a feasible, theoretical method of assessing the variability in our G O F plots is not yet available at this time. Further research on this topic is required. Chapter 5 Hidden Markov Models for Multiple Processes In Chapters 3 and 4, we discussed the problems of estimating the number of hidden states and assessing the goodness-of-fit of a single H M M . We now move on to extensions of the basic model, in particular, the use of the H M M framework for modelling multiple processes. Few such H M M s currently exist. Most have been developed in the context of specific applications, and hence have not been posed in their full generality. Not surprisingly, little is known about the theory surrounding these models. Hughes & Guttorp (1994) present one example of such a model: a multivariate H M M for data consisting of daily binary rainfall observations (rain or no rain) at four different stations. These time series are assumed to be functions of the same underlying process. Given the hidden state at time t, the observed measurements at this time are considered to be independent of all measurements taken at other time points. They may, however, depend on one another. Turner et al. (1998) and Wang & Puterman (1999), working in the setting of Poisson count data, develop models for independent processes, each of which depends on a different underlying process. To account for between-subject differences, these authors incorporate covariates into the conditional model for the observed process. MacDonald & Zucchini (1997, Chapter 3) also provide a brief discussion of this idea. Between-subject differences can also occur in the transition probabilities, and Hughes & Guttorp (1994) address this issue in the context of precipitation modelling by including covariates in the model for the hidden process. The addition of random effects is a natural extension of these models. In the subject- area literature, H M M s with random effects have appeared in a limited way. For instance, Humphreys (1997, 1998) suggests such a model for representing labour force transition data. He works with binary observations on employment status (employed or unemployed) which are assumed to contain reporting errors. The misclassification probabilities, as well as the transition probabilities, depend on a subject-specific random effect. Seltman (2002) proposes a complex biological model for describing Cortisol levels over time in a group of patients. The initial concentration of Cortisol in each patient is modelled as a random effect. The goal of this chapter is to develop a new class of models, H M M s with random effects, which will unify existing H M M s for multiple processes and will provide a general framework for working in this context. We will build on the ideas of generalized linear mixed models (GLMMs), and will address theoretical questions regarding interpretation, estimation, and hypothesis testing. For concreteness, we will present our ideas in the setting where the observed processes are repeated measurements on a group of patients. This choice will not, however, limit the generality of our models. The advantages of H M M s with random effects are numerous. Most importantly, modelling multiple processes simultaneously permits the estimation of population-level effects. For example, to study the impact of a new MS drug, the individual models proposed by Albert et al. (1994) would not be sufficient. Estimating the treatment effect would require the assumption of some commonality among the responses of patients in the same treatment group while allowing for differences in patients in different treatment groups. A second advantage of these models is that they allow more efficient estimation of patient-level effects while recognizing inter-patient differences. This feature is particularly important in view of our examples and discussion in Section 2.4. Finally, H M M s with random effects permit greater flexibility in modelling the correlation structure of the data. Standard H M M s assume that the serial dependence induced by the hidden process completely characterizes the correlation structure of the observed data. While this may be reasonable for a single process, in multiple processes, the correlation structure could be more complex. In our presentation of these models, we first consider the case where the random effects appear only in the conditional model for the observed data. We then explore the more difficult case where the random effects also appear in the model for the transition probabilities. We will show that interpretation and estimation of H M M s with random effects in the conditional model for the observed data is quite straightforward, but is more challenging when there are random effects in the model for the hidden process. We defer the study of the properties of these estimators and hypothesis testing to Chapter 6. 5.1 Notation and Assumptions We will denote the observation on patient i, i = 1,..., N, at time t, t = 1,..., n;, by Y , it and the corresponding hidden state by Zu- As in Section 2.1, we will assume that the time points are equally spaced. Furthermore, we will assume that Zu takes on values from a finite set, { 1 , 2 , . . . , A'}, where K is known. Let Y^f=i iM = nr- Then, we denote by Y* the n,-dimensional vector of observations on patient i, and by Y the n^-dimensional vector of all observations. Similarly, let Zi be the rij-dimensional vector of hidden states for patient i, and let Z be the rtj-dimensional vector of all hidden states. We assume that, conditional on the random effects, {Z,; }' i| is a Markov chain. These l t processes may or may not be stationary. If {Z } it t are conditionally stationary with unique stationary distributions, we will use these distributions for the initial probabilities. In other words, we will compute the initial distributions based on the transition probabilities, so that these probabilities may vary among patients. Otherwise, we will assume that the initial probabilities are fixed parameters and are the same for all patients. We generally have very little information with which to estimate the initial probabilities, so allowing for further complexity does not seem necessary. Finally, we assume that, conditional on the random effects, the ith process, { Y ^ } " ! ^ is a H M M , and that these H M M s are independent of one another. 5.2 Model I: HMMs with Random Effects in the Conditional Model for the Observed Process In this section, we focus on the addition of random effects to the conditional model for the observed data, and assume that random effects do not appear in the model for the hidden processes. In particular, we assume that the hidden processes are homogeneous with transition probabilities, {P e}, and initial probabilities, {7r }, common to all subjects. fe k Borrowing from the theory of G L M M s (see, e.g., McCulloch & Searle 2001) and again using '0 to represent the vector of all model parameters, we assume that, conditional on the random effects, u, and the hidden states Z, {Y,;*} are independent with distribution in the exponential family, i.e. f(y I Zu = k, u, '0) = exp{(yii% - c(rj ))/a(0) + d(y 0)}. it fe Uk iu (5.1) Here Viik = n + x.' !3 + w' u, lt k (5.2) Uk where r is the fixed effect when Z — k, x' are the covariates for patient i at time t, and k it w' itk it is the row of the model matrix for the random effects for patient i at time t in state k. We will denote the distribution of the random effects by /(u;^), and will assume that the random effects are independent of the hidden states. The notation u (as opposed to u^) indicates that a random effect could be common to more than one patient. The form of the exponential family (5.1) assumes that the link function is canonical, an assumption that is not strictly necessary, but one that will facilitate our exposition. We will henceforth refer to this model as Model I. The notation in (5.2) was selected for its generality, but in most applications we would expect significant simplification. For example, consider the M S / M R I setting where we have two treatment groups, with group membership indicated by the explanatory variable Xi. We might postulate the existence of two hidden states, one associated with relapse (state 1) and the other associated with remission (state 2). We prefer this interpretation of the hidden states to that of Albert et al. (1994) since it implies that patients' mean lesion counts can remain stable (at either a high or low level of activity) while assuming only two hidden states (in contrast with the model in Section 2.4.5). If we believe that the mean lesion count in state k varies randomly among patients, we might choose the model where u ik is independent of Uj >, i ^ j, and (UJI> ^ 2 ) has some bivariate distribution. This k model allows the mean lesion count in state k to vary not only with patient, but also with treatment group and with time. The model (5.2) could be made even more general by allowing the parameter 4> to vary among patients. We do not consider this case here, but provide a brief discussion in Chapter 7. The likelihood for Model I is £(V0 ) /(u; VO £ n ^ i / ^ i I ii> >V0 II Pzu-uznfiVit 2 u I Zit,u,ip)\ du /(u;V0 dvi. (5.3) As in (2.2), we may simplify this expression by writing the summation as a product of matrices. For a given value of u, let be the vector with elements A\ = n f(yn \ Zu = l k k,u), and let A be the matrix with elements A \ = P f(y | Z — £, u), t > 1. Again, let lt r k k( it it 1 be the A'-dimensional vector of l's. Then N AVO = / u II {(A )' 11 \J[A ) a 1) /(«; VO dn. (5.4) From (5.4) it becomes clear that the only impact of the random effects in (5.2) on the computation of the likelihood is the introduction of the integration over the distribution of the random effects. In other words, it is only the complexity of this integral that makes evaluating (5.4) more difficult than the likelihood for standard H M M s . In most applications, (5.4) reduces to a very simple form. We consider two special cases. EXAMPLE 1 (Single, patient-specific random effect). We assume that Ui is a random effect associated with the ith patient, i = 1,..., N, and that {w,,} are iid. Under this model, observations on different patients are independent. Furthermore, our model specifies that, given Ui and the sequence of hidden states, the observations collected on patient i are independent. In this case, (5.3) and (5.4) simplify to a one-dimensional integral: £(V0 = II / l E i=l Ju I i Kzufiyn I z ,u ,tp)JlP _ f(y il i z Zt uZu I z ,Ui,tp) u it > f(ui]ip) = ft / ( ( ^ y f l ^ l l / e ^ ; ^ ) ^ . i = l EXAMPLE J i u I 1=2 dui J 4=2 (5.5) J 2 (Patient-specific and centre-specific random effects). We incorporate patient- specific random effects as in Example 1. However, we also allow multiple treatment centres, with a random effect associated with each (assumed independent of the patient-specific random effects). Consequently, observations on patients from different centres are independent, and, given the centre-specific random effects, observations on different patients at the same centre are independent. We use the same notation as before, but with an additional index, c, c = 1,2,... , C , to indicate treatment centre. We assume that treatment centre c has m c patients. Then (5.3) and (5.4) reduce to a two-dimensional integral: £(vo = n / n / C=l J V c 1=1 J u ci E i X CI i *u u», v , ip) n p ^ _ z .f(y t z c I Zc t U cit cl t=2 iz ciu u v, ^ \ ch c ) •f(u ; ip) du f{v ; ip) dv ci C r ci m c = n / n / c=l J v c ( c n ci ( y n Acil i=l <-> I Ju c J ACIT 1=2 ) ^ ^ KV ^ d, . c] Vc (5.6) In the usual case where the integral in (5.4) is of low dimension, numerical methods of evaluation seem to work well. For common choices of distribution for u, Gaussian quadrature methods offer both accuracy and efficiency. For example, the Gauss-Laguerre formula gives good approximations to integrals of the form / f(x)x e~ dx. The case where there is a x one patient-specific random effect, u,, with f(v,i) in the gamma family of distributions, fits into this framework. In the special cases considered by Humphreys (1997, 1998), (5.4) can actually be evaluated analytically. The E M algorithm may seem like a natural choice for parameter estimation in this setting if we think of the random effects as "missing" along with the values of {Z }. However, again, it we do not recommend this algorithm because of efficiency considerations (see Section 2.2). Nonetheless, for completeness, we include the details of the procedure in Appendix A.2 for the usual case where the random effects are patient-specific. We show that, when random effects are included in this way, the increase in difficulty in the implementation of this algorithm (compared to the simple case described in Appendix A . l ) essentially depends only on the complexity of the integral. Monte Carlo (MC) methods allow the circumvention of the evaluation of the integral, and hence may also prove useful for either maximizing the likelihood directly or implementing the E M algorithm. For example, the M C Newton-Raphson and M C E M algorithms presented by McCulloch (1997) in the G L M M context may be applicable to our models as well. 5.3 Moments Associated with Model I It is easy to create complex models by postulating the existence of latent variables. However, caution must be exercised in order to avoid unwanted implications of our modelling choices. In particular, it is important to be able to interpret both the fixed and random effects. One way of understanding their impact on the model is to examine the resulting marginal moments of the observed process. We use the property of exponential families that E[Y I Z = k, u] = c'(r) k) and Var[Y« | Z = k,u] = c"{r] k)a{(j)) (see, e.g., McCullagh .& it it it it U Nelder 1989). In addition, we use our assumption that Cov[Yj , Yj | Z = k,Zj = £, u] = 0. t S it s Then, under Model I, E[Y«] (5.7) E[c'(7fa )] fc (5.8) Cov[Y ,Y ] it ja Cov[c'(iiitk),c'(ri e)}, Here the expectations are over both Z and u. JS s < t. (5.9) In general, the moments (5.7)-(5.9) do not have closed forms. Even if {Zu}^ is sta- tionary, we can not always evaluate these expressions analytically. However, the Laplace method (e.g. Evans & Swartz 1995) can provide good approximations when the variance components of the random effects are reasonably small. And, for certain common distributions of Yù I Z ,u (e.g., Poisson, normal) and of the random effects (e.g. multivariate it normal), closed forms do exist. For instance, consider the case where u = (u\,..., u ) is a vector of iid patient-specific N random effects, as in Example 1. Let Yu \ Zu,Ui ~ Poisson(/ijj) with log nu = r Z i l + (5.10) Ui, where it* ~ N(0,a ), and {Z^}"!, is stationary with stationary distribution {n }. 2 k Since a(</>) = 1 for the Poisson distribution, and since Waxlc (rjuk)] > 0, we can see by comparing 1 (5.7) and (5.8) that this model allows for overdispersion in the Poisson counts. In this case, we can evaluate (5.7)-(5.9) as k Var[y«] Cov[Y ,Y ] ia jt = e ^ ^ e ^ +e ^ ^ e k k '0, = < 2 f f 2 e T + T e EE ' 2 ^ - ^ 2 fee 7r ) Vk / Tfc fc iÏ j '^- K-enEe% , i=j s t k t. where P i(t — s) denotes the (t — s)-step transition probability, P(Z = £\ Z — k,ip). k it is One feature of the moments (5.7)-(5.9) is that they depend on the covariates, X j , in t a very specific way. These relationships should be carefully considered in terms of their scientific justifiability before applying the proposed model. Thinking of s as fixed, it is interesting to look at the covariance (5.9) as t —» oo when i = j. Consider the case where {Z } is stationary and x ; = x, and w it ? t : itk = Wj are independent of t and k. We have that Cov[Y , Y ] = u E \ £ c'(r + Xi/3 + w',u)c'(T + f t + wJu)P (* - s)n kl is fc - X i e k E Yi c'( k + *iPk + W-u)7T L k M T FC Now, for each k and £, P t(t — s) —> ir^ as t —> oo. Setting k X = t Y '( fc + c k,e r + i ) ' ( ^ + »A? + wJu)P M (* - s)xk w u c r X k and X = Y, '( k C T + Xj/?fc + W-u)7T , FC k we have X —> X as t -> oo. Assuming that the dominated convergence theorem (DCT) 2 t holds, we see that Cov[Y ,Y ] it ia -> E [ A ] - {ELY]} 2 = Var[A] > 0, 2 with equality occurring if and only if the distribution of w-u is degenerate. Since \X \ < £ \c'(T + Xi(3 + w'^c'in + Xi/3 + w|u)|, k,e t k k e (5.11) the D C T will hold if c(-) and f(u;ip) are chosen such that the right-hand side of (5.11) is integrable. This is true, for example, in the case where f(yi \z ,u,ip) is in the Poisson or t it normal family and E[u] < oo. Thus, referring again to our example in the Poisson context (5.10), we have that . Cov[Y ,Y ] -> (e i8 2ff2 it - e" ) ( Ç < ^ ) 2 > °> with equality occurring if and only if a = 0. In contrast, when we remove the random effects from this model (i.e. we assume the same model for each patient), Cov[Y , iÉ ->• 0 as t -> oo. Thus, the random effects impact the correlation structure of the observed data in the expected way. In particular, they allow a long-range, positive dependence in each patient's observations, as we observe in the linear mixed model setting. This is an additional layer of dependence, superimposed on the usual correlation structure of a H M M . Considering the heterogeneity of MS patients, this model may be more realistic for that context than models where the same H M M is applied to all patients. 5.4 Model II: HMMs with Random Effects in the Model for the Hidden Process It may also be desirable to allow the hidden Markov chain to vary randomly among patients. For example, in the M S / M R I context, patients may spend differing proportions of time in relapse and remission. However, allowing random effects in the hidden process of the H M M is a challenging problem, regardless of whether random effects also appear in the conditional model for the observed data. To explore this class of models, we will again specify the conditional model for the observed data by (5.1) and (5.2), but we will now allow the model for the hidden process to vary among patients. In particular, we will assume that {Z | u } " i , is a Markov chain and that Z it it | u is independent of Zj \ u for i ^ j. Naturally, any model for these Markov chains must satisfy s the constraints that the transition probabilities lie between 0 and 1, and that the rows of the transition probablity matrices sum to 1. Thus, we propose modelling the transition probabilities as P(Z = £ I Z _ ! = k, u, V^) = lt „ iit Z =i h " X exp {T* ^ + x* P* . + w* u\ r kh t kh (5.12) ltkh We use asterisks here to distinguish the model matrices and parameters from those in (5.2). The vector u now contains the random effects associated with the hidden process as well as those associated with the conditional model for the observed data. parameterization, we define r kK = fi* kK = 0 for all k, and set w*' tkK To prevent over- to be a row of O's for all i, t, k. We will refer to this model as Model II. The likelihood associated with Model II is very similar to (5.4). Again, we may write M C = /^{(^(n^) }/^;^)^, 1 where now we define the quantities A A \ = P(Z k l and A\\ as A\ = ir f(yu l k l k (5.i3) | Zn = k,u,ip) and = £ I Zit-i = k, u, V) f(y I Z = £, u, ij;),t > 1. U it it At first glance, (5.4) and (5.13) look the same. However, in the case of (5.13), the integral will typically be quite complicated, even in simple situations such as that presented in Example 3 below. 3 (Patient-specific random effects). EXAMPLE We assume that the random effects are patient-specific, so that observations on different patients are independent. In particular, for patient i, we model the transition probabilities as P(Z u = £\Z , _ l t l = k,u* e,i;) = exp{r^ + ? 4 J k E L i exp {TÙ + UÏMY where r kK = v,* as in (5.5). kK = 0 for all i, k. The likelihood associated with this model can be simplified However, instead of requiring only one random patient effect, we will need K(K — 1) (possibly correlated) random effects for each patient. There is no obvious way to reduce this number; simplifications such as v,* ke = u* do not have sensible interpretations in terms of the transition probabilities for patient i. The restriction u\ u = u* ek for all k and I may be appropriate in some cases, but still results in quite a large number of random effects. Moreover, adding a random effect to the conditional model for the observed data (as in Example 1) would further increase the dimension of the integral; if this random effect were correlated with those in the model for the hidden process, the resulting integral could be quite complex. We see that the estimation of the parameters in this setting is a difficult problem, even in simple cases such as Example 3. The high-dimensional integrals in these models not only create a potentially prohibitive computational burden, but also raise questions about the accuracy with which we can evaluate (5.13). For similar reasons, computational difficulties may arise in the implementation of the E M algorithm as well. See Appendix A.3 for details in the case where the random effects are patient-specific. As a final note, the number of parameters associated with the distribution of u may be undesirably large. For example, consider the model with K = 2 where there are two random effects associated with the transition probabilities, and one associated with the conditional model for the observed data, and all are correlated (as in Example 3). If we model the distribution of these three random effects as multivariate normal with mean 0 and variancecovariance matrix Q, this distribution will have 6 unknown parameters. We could reduce this number by assuming that {it;} are independent with distribution iV(0,of), i = 1,2,3, or even that Q = <r I (i.e. that the random effects are iid), but these assumption would not 2 be reasonable in most applications. 5.5 Moments Associated with Model II Another problem with adding random effects to the model for the hidden process is that it becomes difficult to assess their impact on the model in general, and on the marginal moments in particular. The expressions for the moments are the same as in (5.7)-(5.9). However, the integration is considerably more difficult in this setting. For instance, consider (5.7). Under Model I, this equation becomes K At least in the case where {Z } it is stationary, this expression can be easily interpreted. In contrast, under Model II (assuming the random effects appear only in the model for the hidden process), (5.7) becomes K K &[Yit] =I E . ' M C = / I u,^)/(u; V) du. Even if {Zjj | u} is stationary, it will be difficult to compute (or approximate) this integral because P ( Z i — t k | u, tp) is a complex function of the transition probabilities when > 2. K This is also true of the integral in (5.8). The expression (5.9) is even harder both to evaluate and interpret when we have random effects in the model for the hidden process. Specifically, we need to integrate the function The P ( l [ (t _ 7 A — ~ u p L ^ _ 7 J S ~ ,f.\ i „ k K l U ^ ) ~ _ / \ P ( F ( Z z i t l = t s)-step transition probabilities, I I u, = £ \ Z P ( Z i s = t P ( Z V) l j s = k I u, = k , n ^ ) P ( Z £ | Z = i s l V), s k,u,tp), = k\u,iP), j i ^ i = j • are, like the stationary transition probabilities, complicated functions of the transition probabilities. Thus, evaluating these integrals - even by approximation methods - does not seem feasible. In the case where { Z i t | u} is homogeneous with transition probabilities { P ^ } , k, £ = 1, 2, and stationary, the stationary distribution has the simple form r 7Ti 7T 0 = pi 2\ 1 pi 12 Then we may compute approximations to (5.7) and (5.8). However, evaluating (5.9), even approximately, is still problematic because the (t — s)-step transition probabilities do not have a simple form. Interestingly, we are nonetheless able to make statements about l i m ^ o o Cov[Fj , Y ] when 4 is the hidden Markov chains are homogeneous and stationary. By the same argument given in Section 5.3, we have that lim Cov[y , Y ] > 0 it ia l—>oo if condition (5.11) holds. Again, this will be true if f(y \ z , u, I/J) is in the Poisson or normal it it family and E[u] < oo. Thus, under these.conditions, we see that adding random effects to the model for the hidden process also affects the correlation structure in the expected way (see the discussion in Section 5.3). As a way of circumventing the problems associated with adding random effects to the transition probabilities, if the underlying Markov chains are homogeneous and stationary, we might instead consider incorporating the random effects into the stationary transition probabilities. We would then be able to compute E[Y ], Var[Y ], and Cov[Y , Yj ], i ^ j, it explicitly. it it s A simpler (though not explicit) expression for Cov[Y^, Y ] would also result. is However, it is important to consider the implications of these random effects in terms of the distribution of the transition probabilities. For example, in the simple case where K = 2, it is easy to derive the constraint that the transition probabilities for this model must satisfy, namely P\ = ^ 2 (5.14) for each i. It is reasonable to insist that, in the absence of prior information, all the transition probabilities have the same distribution. But, it is unclear whether there exists a distribution for u so that this is possible, given the constraint (5.14). Even if an appropriate distribution does exist, extending this method to the case where K > 2 seems unrealistic because the system of equations relating the transition probabilities to the stationary transition probabilities becomes increasingly complex with K. In conclusion, we may add random effects to the transition probabilities, but are unlikely to be able to evaluate any moments. Adding random effects to the stationary transition probabilities results in tractable expressions in (5.7), (5.8), and (5.9) for i ^ j, but may not be sensible in terms of the resulting distributions of the transition probabilities. Of course, these problems are not unique to HMMs: incorporating random effects in Markov chains is equally difficult. 5.6 Summary In summary, the addition of random effects to the conditional model for the observed data results in a tractable, interprétable model. Marginal moments can be evaluated or approximated, and parameter estimation is not much more complicated than in the case of a standard H M M . The primary change resulting from the inclusion of random effects is the need for integration (in most cases, numerical) of the likelihood. In addition, we will need to estimate the extra parameters associated with the distributions of the random effects, but this task should be straightforward as long as we have a suitable number of patients. On the other hand, including random effects in the model for the hidden process is hard from the perspective of both interpretation and estimation. Typically, marginal moments cannot even be approximated because of the complex nature of the integrands involved. Parameter estimation also may be problematic due to the high-dimensional integrals, the dependence structure of the random effects, and the large number of unknown parameters. However, the limitations of Model II are perhaps not as serious as one might imagine. As discussed in Section 2.5, we have relatively little information about the parameters of the hidden process. Extending the model to allow patient-to-patient differences on this level may explain very little additional variation in the observed data, and hence may not be worthwhile from a statistical standpoint. Moreover, capturing inter-patient heterogeneity in the hidden processes is still possible by incorporating covariates in this part of the model (though, of course, the issue of the reasonableness of the resulting marginal moments remains). Chapter 6 Hypothesis Testing In Chapter 5, we presented two models for multiple processes, Models I and II, as well as some theory regarding their interpretation and estimation. We did not, however, discuss the properties of these estimators. In the present chapter, we address this issue, along with the problem of hypothesis testing. We begin, in Sections 6.1 and 6.2, by considering model identifiability and the asymptotic properties of the MLEs, respectively. Although we do not provide formal results, our discussion will suggest that standard tests are appropriate for most hypotheses concerning the model parameters. We then move on to the focus of this chapter: tests of the significance of the variance components of the random effects. In particular, we wish to develop a test of whether the additional complexity associated with introducing random effects into H M M s is warranted. Hypotheses of this nature do not lend themselves to standard testing procedures, and pose quite different challenges. Specifically, the difficulty surrounding tests of the variance components is that the null hypothesis puts at least one parameter on the boundary of the parameter space. Without loss of generality, we can assume that the variance-covariance matrix of the random effects, Q, is a function of some (/-dimensional parameter D, D > 0, and that Q(D) is a matrix of zeroes if D = 0. One of the conditions for the validity of the usual Wald or likelihood ratio test (LRT) is that the true parameter be in the interior of the parameter space. In the test with null hypothesis D = 0, this condition clearly does not hold. There is a substantial literature on the asymptotic distribution of the L R T statistic when the true parameter is on the boundary of the parameter space and when the observed data are iid. In this case, this distribution can often be determined analytically. For example, when the observed data are iid and normally distributed, Self & Liang (1987) show that, for a variety of boundary problems, the L R T statistic is asymptotically a mixture of x 2 distributions. Stram & Lee (1994) apply these results to the L R T of the variance components in a linear mixed model. Feng k McCulloch (1992) (extend this work to the general iid case, and show that a variation of the L R T statistic (based on an enlarged parameter space) has the -usual asymptotic x distribution. 2 However, in other settings, including that where the observed data are not independent, the distribution of the L R T statistic when some parameters are on the boundary is far more complicated. For this reason, other tests are used more commonly in the G L M M context. In particular, Jacqmin-Gadda & Commenges (1995), L i n (1997), and Hall & Praestgaard (2001) have proposed tests based on the score vector (6.1) This test has also been suggested for detecting overdispersion (as captured by a random effect) in Poisson and binomial regression models (e.g. Dean 1992). In both cases, conditional on the random effects, the observations are independent with distributions in the exponential family. It turns out that this conditional independence leads to a nice form for (6.1), and hence its asymptotic distribution is easily derived. This is not true in the H M M setting, but we can still use this framework to conduct tests of the significance of the variance components. We provide a detailed explanation in Section 6.3. We apply the theory developed in the previous and current chapters to two data sets: lesion count data collected on the Vancouver PRISMS patients (PRISMS Study Group 1998), and faecal coliform counts in sea water (Turner et al. 1998). These analyses are presented in Section 6.4. We include the results of a small simulation study that investigates the performance of our variance component test in Section 6.5. 6.1 Identifiability of Models I and II In Sections 2.3 and 3.2, we raised the issue of the identifiability of a H M M for a single process, along with the fact that this topic has not been adequately addressed in the literature. Nonetheless, identifiability is a critical concern, since it is a required condition for the usual asymptotic properties of the parameter estimates to hold (e.g. Leroux 1992a; Bickel et al. 1998; Douc & Matias 2001). Likewise, the identifiability of Models I and II is also needed in order to study the asymptotic properties of their M L E s . However, the necessary theory is even less well-developed in the multiple process setting. In fact, we are aware of only one reference on this subject. Wang & Puterman (1999) show that Poisson H M M s with (possibly time-varying) covariates in the conditional model for the observed data are identifiable if for each patient i. The model matrix for the covariates is of full rank. ii. The Poisson means associated with each hidden state at each time point are distinct. The method of proof used to establish this result follows Leroux (1992a), and relies on the fact that the marginal distribution of each observation is an identifiable mixture distribution. This method easily extends to the case where the conditional distribution of the observed data is in the exponential family, (i) holds, and the support points of the mixing distributions are distinct (see the discussion in Section 3.1). In contrast, in the case where the model contains random effects, the approach of Wang & Puterman (1999) does not readily apply. Consider Model I. The marginal distribution of Yn. is which is a finite mixture. Prakasa Rao (1992) provides a summary of sufficient conditions for the identifiability of both finite and arbitrary mixture models. However, it may be difficult to verify these conditions for Model I because the kernel J f ( y u | Z i t = k,u,ip)f(u;ip) du will typically not be a standard distribution. Similarly, the marginal distribution of Yu under Model II is f K f i V i ù = J2 / f(Vit I z u = k , > u *l>)P{Zit = k I u, i/>) /(u; ip) dui. This distribution is also technically a mixture (with both discrete and continuous mixing distributions), but its complicated form makes verifying identifiability even more challenging. Clearly, further research is required to establish sufficient conditions for the identifiability of this new class of models. In the meantime, for the purposes of the discussion in this section, we will follow Bickel et al. (1998) and Douc & Matias (2001) and simply assume that the models in question are identifiable. 6.2 Asymptotic Properties of the MLEs of Models I and II At this time, we have not formally established the asymptotic properties of the M L E s of Models I and II. However, Bradley & Gart (1962) develop conditions for the consistency and asymptotic normality of M L E s in a very general setting. We describe these conditions in this section, since they illustrate the work that may be involved in showing that these properties hold in the context of H M M s with random effects. In addition to model identifiability, these authors require that the data be sampled from independent populations. They consider two broad classes of problems: i. The number of populations is finite, with the number of observations on each approaching infinity. ii. The number of populations approaches infinity, with a finite number of observations on each. In the context of M S / M R I clinical trials, we would expect to follow each patient for a limited time period, but to collect data on large numbers of patients. Thus, we are primarily interested in the second scenario. The conditions that Bradley & Gart (1962) impose are chosen to ensure that for all k, £, and m, as N —> oo, N d (6.2) and that there exists a positive definite matrix with (A;,£)th entry J H ( ^ ) and with finite determinant, as well as a constant C < oo, such that N N and lim P £ d 2 log/(yi;V) + Jfc«(VO c9 = °P( ) 1 (6.3) 3 > C 0. (6.4) n Unfortunately, the conditions required for (6.2)-(6.4) are hard to verify in the H M M setting due to the complicated form of the likelihood and its derivatives, especially when random effects are involved. However, it is very likely that the M L E s have the usual asymptotic properties in the case where we have N independent patients with rii < n < oo for all i, and N —> oo. For this reason, in this context, we feel comfortable conducting standard Wald tests, using the observed information to approximate the standard errors of the parameter estimates. 6.3 Variance Component Testing In this section, we develop a test, based on the score function, for testing the significance of the variance components of the random effects in our proposed models for multiple processes. For our purposes in this section, we specify the values of the variance components under the null hypothesis, but the other model parameters remain unknown. Thus, we treat the other model parameters as nuisance parameters. Little is known about the behaviour of the score function in the multiple H M M setting. For single H M M s , asymptotic properties of the score function have been established (Bickel et al. 1998; Douc & Matias 2001), but not in the case when the true parameters lie on the boundary, nor when there are nuisance parameters. Jacqmin-Gadda & Commenges (1995) provide a general result about the asymptotic distribution of a statistic based on the function S(N)(£) = Y^f =l where {5j(c;)} are independent, and £ is a vector of nuisance parameters. In their derivation of this distribution, the conditions that they impose ensure the validity of the central limit theorem and law of large numbers that these authors apply to certain key sums. When <S(jv)(0 has the form (6.1) and the observed data arise from a G L M M with a single, patient-specific random effect, they show that these conditions are easily verified. Perhaps not surprisingly given our discussion in Section 6.2, the same is not true of these conditions in the H M M setting. Thus, while we follow the approach of Jacqmin-Gadda & Commenges (1995) in the formulation of our test procedure, we will require different results in order to verify the necessary regularity conditions. In particular, we will need bounds on the expectation of the product of derivatives of log/(y; | Ui,'tp). To this end, we use the methods of Bickel et al. (2002), henceforth called B R R . These authors consider the case where {Y } is a single, t stationary H M M , and establish bounds on the expectation of the derivatives of log/(y; •?/>) for a given sample size n. We show that the techniques used in the development of these bounds are also applicable to our problem. The beauty of B R R ' s work is that their conditions depend only on the conditional distribution of the observed data and the model for the hidden process, not on the full likelihood. For this reason, verifying these conditions is reasonably easy. B R R ' s bounds are derived with an eye towards determining the properties of the likelihood and its derivatives as n —y oo, and are quite refined. We are working in a much simpler context (i.e. where observations on different patients are independent, and, for each i, remains bounded while N —y oo) and so presumably far cruder results would suffice. However, we are unaware of the existence of such results at this time. Our test is valid for a variety of models, and we discuss these at the end of this section. For the purposes of developing the relevant theory, we present only the simple example of Model I with no covariates, but with one patient-specific random effect, U j , in the conditional model for the observed data.(see, e.g., (5.5)). In other words, we assume that {u;} are iid with mean zero, and that /{Vu I Z = k, Ui, i>) = exp{{yitrj k - c(% ))/o(0) + d(y , </>)}, it it fc it with rjitk — T + Ui. Recall that, for Model I, the hidden Markov chains are assumed homok geneous and stationary with transition probabilities common to all patients. Without loss of generality, we model the random effect as Ui = VDvi, where the {UJ} are independent and identically distributed with zero mean and unit variance. One advantage of our test is that we do not need to specify the distribution of Let £ be the p-dimensional vector of all the model parameters except D, the variance of the random effect. Then, we will form our test statistic based on 5(/v)(£) = HiLi where 2' l o g / ( y i I Ui,tl)) Ui=0 + _d_ dm (6.5) Ui=0_ i — 1,..., N. Other authors have claimed that Si(£) is equal to * l o g / ( y i i iji) using either L'Hôpital's rule (e.g. Chesher 1984; Jacqmin-Gadda & Commenges 1995) or a Taylor series expansion of /(y,; | Ui,ip) about £>=0' (e.g. Dean 1992; Lin 1997) as justification. Both Ui methods depend on (unstated) regularity conditions for their validity. Specifically, these conditions permit, in the case of the former, two interchanges of derivatives and integrals, and in the case of the latter, the interchange of a derivative and an infinite summation. It is not clear that these conditions hold; as usual, they are particularly difficult to verify in our setting. For this reason, we will not make this claim, but simply note that it may be true in some cases. We now introduce some notation, mostly borrowed from B R R . A l l expectations and probabilities are calculated under the null hypothesis that D = 0. First, we use the operator D m to indicate a partial derivative of order m, taken with respect to any combination of Ui and the parameters The bounds that we develop on the expectation of these derivatives will not depend on the particular combination of these variables, hence the generality of our notation. Second, let Z be the set of positive integers. We define + J EE {( J . . . , J ) : k > 0, Jj e Z 1 } k + ) j = 1,..., k} and J£(k) = {(I ...,h): u Ij € Z , m < I, < n, j = 1,..., k}. + For J e J, let |J| be the length of J, and let |J|+ = J In addition, let J {k) = {J e + r J : |J| = k}. Also, we define U(I) as the ordered set of unique elements of the vector I. + Finally, we define quantities C^yu^ip) and Bl . nr conditional distribution hi(t) = \ogf(y ,z it it (I/J). These quantities relate to the | y ^ , z*f \ u V), where y ^ = (y 1 h iU ..., y ), it and are critical in determining our bounds of stated interest. They are somewhat difficult to understand in isolation, but their purpose should become clear in the context of the proof of Theorem 6.1. In particular, we define C m(Vit, VO = max max \ D TO log P D + kt m log f{y \Zu = k, m, ip) it + Ui=0 D l0g7T m where max should be interpreted as the maximum over all derivatives D . D f c Our model m ni assumptions imply that h^t) = logP _ Zit + logf(y uZf | it z u , U i , ^ ) , so C (y ,ip) is a bound l m it for the mth order derivatives of hi(t). Letting the empty product be 1, we define BlW) = max n n E Zix. — je{i,...,|J|}: . JeJ (m), : Zi* + J le J?(\J\), r = \u(i)\, (t ...,t ) u = u(l), r , e{i,...,K}j: Zl ls The value B (ip) will be used to bound the expectation of the product of derivatives of l hi(t). m We also need the quantities B . (ip), B^ l mim 2 lTO2ms (^),..., where / £* (V0 = max n \ n E nim2 n 31 Zi,t — i,U z s 1 J € J + K ) , J e J (m )X 1 + 2 2 e Jr'(|J |),i e jr(|J |), J r = \U(l U I )|, (t . . . , t ) = UQ. U I ), z l 1 2 u P 2 2 hts 2 e {1,..., A'}}, and B ... (ip), d > 2, is defined similarly. l mi md We require the following conditions for the remainder of this chapter. Condition 1. The transition probabilities satisfy Pu > 0 for all k, I = 1,..., K. Condition 2. The function log f(y \ Zu,Ui,ip) has M continuous derivatives, M > 4, with it respect to Ui and at the true value of these parameters. Condition 3. For all m i , . . . , < M, d < co, and for all i, oo »=l Y 1 Condition 4- There exists a fixed value n < oo such that iii < n for a l H = 1, 2,...-. REMARK. Conditions 1-3 are essentially those assumed by B R R , with a slight change in Condition 2 to account for the incorporation of the random effects in our model, and in Condition 3 to bound terms involving B .. (\j)) in a specific way. Conditions 1, 2, and 4 l mv m are easy to verify. Theorem 6.2 below provides sufficient conditions for Condition 3 when the conditional distribution for the observed data is one of several common choices. Before stating our main theorem, we present some preliminary results. Our first, which we prove in Appendix B.2, extends B R R ' s Theorem 2.1 to establish bounds on the expectation of the product of derivatives of log/(yj | Ui,tp). We define Dm = D l o g / ( y ; | Ui,ll>) m t E mi md Uj=0 Under Conditions 1 and 2, for all i, and all m , . . . , m < M, d < oo, Theorem 6.1 where A ... 7 x• • • x < A ... \ mi mdmi d x ••• x m\ nlB .., (ïP), l d mi nd is a finite constant depending only on the transition probabilities. Given the complicated form of the expression B ... (ijj), it is of interest to identify l mi md models where Condition 3 holds. This builds on the work of B R R , who simply assume that B 1 m i (VO < ((/;) < oo. We also require conditions such that J2i=i 0 0 f ° finite values r of d. Theorem 6.2, which is proved in Appendix B.3, shows that this property holds for four common distributions for the conditional model for the observed data. The same line of argument also allows the investigation of the properties of B ... (tp) for other choices of l mi md distribution for the conditional model. Theorem 6.2 Assume that Conditions 1 and 4 hold. In addition, assume that £ belon to a bounded parameter space. Then for all i, and all m i , . . . , m < oo, d < oo, d oo. 1=1 when f(yit\zit,Ui,îl>) is in the Poisson, binomial, normal, or gamm,a family of distribution We now present a series of technical results that will be required in the proof of our principal theorem, Theorem 6.3. These results will allow us to apply the law of large numbers and the central limit theorem in the same manner as Jacqmin-Gadda & Commenges (1995), and hence to determine the asymptotic distribution of our test statistic. Lemma 6.1 Under Conditions 1-4, for all k,i = 1,... ,p, E dSite) E 0 0 1 I E^Var < 00 for i — 1,..., TV (6.6) < 00 for i = 1,..., N (6.7) < 00 (6.8) < 00 (6.9) Proof. Consider (6.6). We have that E = _d_ d_ log f(yi \ui,ip) + dm -E u =0 t 2 a D*, + 2 -E|D 2 u,=0_ 5 2 • tt—log/(yi l o g / ( y » I Ui.VO I Ui.V') m=0 l + 2D Dl|. l 3 2 Thus, by Theorem 6.1 and Condition 3, we see that E | ^ | ^ | < 00 for all % and K. Si (£) , Var {^^(OJ) In the same way, we can show that E a n d Var (agfagt^^)} are expectations of sums of products of { D ^ J , m < 4, and hence are finite for all z, fc, and I by Theorem 6.1. Now, let L\ and M£ be finite constants such that V a r | ^ 5 j ( c ; ) } < L\ £ and Var {^f Si{0} < ( « = 1, • • •, -/V, = 1,..., A'. For fixed values of k and £, these moments vary with i only because rii may be different for each i. Since ni can take on only values from the finite set { 1 , . . . , n}, we see that Var {gf^ a n d Var { â ^ 7 ^ ( ^ ) } on at most n distinct values. Thus, there exist finite constants c a n t a ^ e and Mkc (independent of N) such that for all i = 1,..., JV, Var and 'dSi(Z) Var < < M. Thus, it is clear that (6.8) and (6.9) hold for all k; and t kl • In the proof of the next results, we will require a law of large numbers for non-identically distributed random variables. We will use the law that if JVTi, A ' , . . . are independent with 2 mean zero, {Q} are non-negative constants with c* —> 00, and ^ i = i E [ X ] / c < 00, then 2 CN 0 a.s. 2 (See, e.g., Breiman 1968, Theorem 3.27.) For our purposes, we will take Q = i. We will henceforth refer to this law as (LI). Lemma 6.2 Assume that Conditions 1~4 hold, and let £ be a MLE of Ç under the null hypothesis that D = 0. Then Ç converges in probability to £ as N —> oo. Proof. Let be the density of patient i's observations under the null hypothesis, i.e. /o(yi;c;) /o(yt;0 =Y l^zafivu I Zii,Ui o,0 f[P ^ f{y = Zi UZlt |z, u it =o,m • Ui t=i This is just the distribution of a stationary H M M . Now, using the method of Bradley & Gart (1962), the following requirements are sufficient to ensure the properties (6.2)-(6.4): 1. For almost all and for all £, d ' d d 2 3 s/o(yt;0 exist for all j, k, £ = 1,..., p, and for a l H = 1,..., TV. » oo: N = • o (l) (6.10) = o (l) (6.11) = o, (6.12) P 1 d N 2 2 log/o(yi;0-£ log/o(y*;0 1 d Y, dÇjd&dÇi , , ^ s /o(y<; 0 TV fr{ N lim P N-*oo d p 3 l o a a where C is some finite constant. From Conditions 1 and 2, it is clear that the first requirement is satisfied. For the second requirement, we use (LI) to show all three results. This law will, in fact, give us convergence a.s., which is a stronger result than necessary. In particular, for (6.10), the fact that EJD^ | < oo allows us to interchange the expectation and derivative to compute E ^ log/ (yj; £)] = 0 for all i. Conditions 1-4 and Theorem 6.1 0 imply that E[(Dj) ] has a finite bound, independent of i. Denote this bound by C. Then 2 OO i=l 1 1 ^;log/o(yi;0 < C ] T - < oo, so the conditions of (LI) are satisfied, and (6.10) holds. Similarly, (LI) applies in the case of (6.11), since the expected values of the summands are zero, and E[(D^) ] — (E[DJ,]) has 2 2 a finite bound, independent of i. For (6.12), we first consider the expression 1 i N d 3 (6.13) log/ (y,;£)-E 0 Clearly, the expected values of the summands are zero. Since E[(Dg) ] — (EfDg]) has a 2 2 finite bound, independent of i, (LI) implies that (6.13) converges to zero a.s. Since, for fixed values of j, k, and £, the terms E [ log / oç 0 i = 1,2,... take on only a (y,;; £ ) ] , finite number of values (as argued in the proof of Lemma 6.1), the second sum in (6.13) is bounded. Therefore, the first term must also be bounded a.s. These convergence results guarantee that the arguments of Bradley & Gart (1962) apply, and thus we have verified the consistency of £. • Lemma 6.3 Assum,e that Conditions 1-4 hold, and let £ be a MLE of £ under the null hypothesis that D = 0. Then \/ÏV(£ — £) has the form, VN(i - £) = y/N l \ E U, + o (l), a N) (6.14) p where each U ; is a random vector with kth element ^ log/ (yi; 0 £ ) , k = 1, ,p, and variance-covariance matrix 1^. We write Içf(jv) = Y^iL\ L^iProof. Let H yv)(£) be the matrix with entries 5Zi==i Q^Q^ ( Under Conditions 1 log/o(y^£)- and 2, we can do a first-order Taylor series expansion of YliLi ^ log/o(yz! £) about £, for k — 1,... ,p, yielding £ - £ = [-H (£t)]- f;u,, 1 (w) i=i where |£ - £| < |£ - £|. By Lemma 6.2, £ = £ + o ( l ) , so that £ = £ + o ( l ) as well. B y the f f p p log/o(yi;0, continuity of £-£ = -^H w ( £ ) + o (i) 1 N (6.15) p As shown in the proof of (6.11), 1 d N d 2 2 log/o(y*;£)-E -Y -» 0 a.s. log/ (y ;£) 0 4 By ConditionsNti 3 and 4 and Theorem 6.1, we can interchange the expectation and derivative to obtain d d 2 E log/ (y :;£) 0 ? d = - E ^log/ (yi;£) ^log/ (yi;£) 0 CÇfc 0 Oct so ^(H (0 + l {N) mN) ) =o (l). p Substituting this result into (6.15), we obtain N 1 N N Eu, N (6.16) We now demonstrate that the second term of (6.16) is o (l). It suffices to show that p YliLi Uj/x/TV = O (l). Let Vij be the jth element of U ; , and fix a value of e > 0 and C > 0. p By Chebyshev's inequality, Var [j- E f AT E^ >C\ < = 1 V* C 2 S £ i Var[l in NC 2 As argued in the proof of (6.10), E [ V y = 0, and Var[Vy has a finite bound independent of i, say 7. Then, N E ^ >c) Nj NC < 2 ?:=i C' 2 Since C is arbitrary, we can choose a value large enough that N E ^ This statement implies that Yli=\ U j / v N > C < e. = O (l), as required. • p We will need the following quantities for our next theorem. Define <S(Ar)(£) = as in (6.5). Let i j = Var[5j(c5)], and let I( ) = ^- Let J( ) and K(jv) be column vectors N defined by 3 {N) = E ^ i E [ ^ $ ( 0 ] and K (N) HiLi <%(£) W = £f = 1 E [SfâUi]. Finally, define (6.17) Theorem 6.3 a&ove definitions, under Conditions 1-4, S(N)(i)/4 N) has an asymptotically standard normal distribution as N —» oo. (6.18) Proof. Following Jacqmin-Gadda & Commenges (1995), we expand S(N)(£) in a Taylor series about £ to obtain = E S(0 + N(i - 0 S (i) W where iV T [iV- E ^$(0 + | , 1 e j v c9 ; e ^ ^ E ^ ^ - ^ - O , and l^t-el < Using (6.7), (6.9), and (LI), we have that 1 f a w 2 ^ ( ^ By Lemma 6.2, £, and hence 0, is consistent. In addition, by (6.7) and the arguments in the proof of Lemma 6.1, TV" Z?=i 1 E [ â f r ^ t ) ] is bounded away from infinity. So, e = o (l). N p Likewise, we can use (6.6), (6.8), and (LI) to show that N- f:^SM) = N- 3 l + o (l). i (N) p Using Lemma 6.3, we now have that s N)(£) { = ESi(0+.v^[vOv£x^ i=l i=l L J N £ 5,(0 + £ U f I - ^ j J i=l i=\ + o (l)N £ Vjï-^ i=l p ( i V ) i=l ''-1 N) p p {N) + o (l)^ p •JV ,K"«»>) \ + o (l)v/iV (N-% ) + o (l)N-^ N) + o (l)\/iV. p By (6.6), (6.7), and Condition 4, we have that is bounded away from infinity. As N~ J( ) l N argued in the proof of Lemma 6.3, Y,iLi Uf/\/N = O (l). Thus, to show that p = E {$(0 + U f l ' t=i K ( / v ) J ( w ) + o (l)^N} , p (6.19; we need only prove that NlJ^ ^ = 0 ( 1 ) . N Using the idea discussed in the proof of Lemma 6.1, the matrices {Iççi}^ belong to i set of at most n matrices. Denote this set by {Z?}" , where lj is the matrix-1^ for n» = j =1 Let {dj} be the set of minimum eigenvalues of {J,}, and let {d'j} be the set of maximum eigenvalues of {Tj}. Define d = min{dj} and define d! = max{c^}. Since the matrices are positive definite, d and d' are strictly positive. Let be the number of times that rii = j, i = 1,... ,N. Then n 3=1 The minimum eigenvalue of IÇÇ(N) is given by • «(iv)V mm — ' , y / I v v'v where this minimum is computed over all p-dimensional vectors v. Thus mm — = mm —- v'v " ( N ) > L ; v'v . v'X,v m i n c i=i -v^- > Ê - f = Nd By a similar argument, the maximum eigenvalue of IÇÇ(JV) is at most iVif. Thus, the eigenvalues of NIç^ j N N1^ (N) must lie in the interval [l/d',l/d]. This is sufficient to guarantee that = 0(1), and hence that (6.19) holds. Now, let i?,j(jv) = Si(£) + U f r ^ ) J ( 7 v ) . As demonstrated in the proof of (6.10), E[Uj] = 0. N Similarly, by Condition 3 and Theorem 6.1, we can show t h a t ' Ë [,%(£)] = 0. So, E[i2j(jv)} = 0. Defining a = Var R^N)] , Jacqmin-Gadda & Commenges (1995) prove that s% = 5Zi==i 2 h{N)- We now show that (j2f =x RH(N)) /SN — is asympotically distributed as iV(0,T), using Lyapounov's condition. In particular, we show that af < oo for all i, and that J i E i J V ^ ° ° 1=1 E [ l ^ ) r i =o (6.20) /v s for some 5 > 0. Now = E[Si(£)] + J ^ I ^ ^ E f U i U f j I ^ ^ ^ f A T ) + 2E[S' (c;)Uf]I^ 2 i 1 (iV) J( v). J Using the same argument as in the proof of Lemma 6.1, a can take on at most n possible 2 values. So, a 2 = miner 2 > 0. Similarly, using Conditions 3 and 4 and Theorem 6.1, E[|Z?j(jv)| ] < 7, where 7 is a finite constant. So, letting S = 1 in (6.20), 3 lim V ^ E [ | i % i V ) | 2 + ' ] 5 < lim ^ i = l { = verifying that T,f Ri(N)/sN '' lim -=L— = 0, is asymptotically distributed as N(0,1). Since s =1 2 N = I ( ), we C N see that S (j) w = ^ {RUN) + Qp(l)VN} ZLRJW = + o (l), P where the last equality follows from the fact that s/N/s N < I/o < oo. Then, by Slutsky's theorem, S ( V ) ( £ ) / ^ c ( J V ) is also asymptotically distributed as JV(0,1). • We conclude this section with a discussion of other models to which this type of test may apply. Adding time-independent covariates to the conditional model for the observed data is the easiest of these; by assuming that the coefficients are bounded, all of the results we have presented are valid in this case as well. Likewise, incorporating a patient-specific random effect in the model for the hidden processes should be fairly simple, as long the associated parameters are bounded, and the processes {Z it \ U i } ^ , i = 1,..., N are homogeneous and stationary. It should be reasonably straightforward to extend our theory to the case where there are multiple, patient-specific random effects, U ; , in either the conditional model for the observed data or the model for the hidden process or both, and we wish to simultaneously test the significance of all of the variance components. In this case, by taking derivatives of l o g / ( y j ; U j , ip) would be a vector obtained with respect to each variance component, and then evaluating each variance component at 0. Consequently, we would require a multivariate central limit theorem in the derivation of the asymptotic distribution of YliLi .%(£). Similar theory should apply to the case where we have a large number of independent clusters of processes, where the number of processes in each cluster is bounded. For instance, in Example 2 of Section 5.2, we assumed both patient- and centre-specific random effects. Presumably, a multivariate version of our test could be used to assess the significance of both variance components, assuming that the number of patients in each centre is bounded, and that the number of centres is large. Some models, however, would present more substantial challenges. For instance, B R R ' s theory applies only to homogeneous, stationary H M M s . It is not immediately apparent how to extend this theory to a more general setting. Even more difficult is the case where the random effects are crossed, so that observations on different patients (or clusters) are not necessarily independent. Our approach relies heavily on the assumption of independence among patients; for example the law of large numbers and central limit theorem that we use apply only to independent observations. Finally, in the case where there are multiple random effects, it is not clear how to use our procedure to test the significance of a subset of the variance components. In this case, the test statistic (6.5) would involve a (possibly multidimensional) integral, and the results of B R R would not apply. Further research is required to address the issue of hypothesis testing in these contexts. 6.4 Applications In this section we illustrate the use of H M M s with random effects in applied settings, both in terms of fitting the model and in conducting the test of the significance of the variance components. First, we address the issue of estimating the quantity I ( ) in (6.18). We then C N analyze two different data sets: M R I data from a group of MS patients, and faecal coliform count data originally presented by Turner et al. (1998). 6.4.1 C o m p u t i n g the Test Statistic The quantity (6.18) contains unknown parameters, and hence is not a test statistic. In particular, I ( ) (defined by 6.17) is a function of parameter values-which are not known C N a priori. However, as long as we have a consistent estimate of and Slutsky's theorem guarantee that S( )(Ç)jI \ ) N I (N), I (N), C C Conditions 1-4 has an asymptotically standard normal C N distribution as N —» oo. In the G L M M setting, I ( ) can be expressed analytically in terms of £. In this case, the C N usual practice is to estimate I ( ) by replacing £ with £ (e.g. Dean 1992; Jacqmin-Gadda & C N Commenges 1995; L i n 1997). In contrast, in the H M M context, I ( ) does not have a closed form, and it is unclear how C N to evaluate this function at £. For this reason, we use a different method to obtain I ^N)'C the parametric bootstrap (e.g. Efron & Tibshirani 1993). Specifically, under the null hypothesis that the variance components are 0, by Lemma 6.2 and Conditions 1 and 2, /o(yi;£) is a consistent estimate of I(N), J(/v)> a n fo(yf,0- Using this fact, we estimate the unknown expected values d K(jv) in I (N) by generating C V samples from /o(y»:;0 and computing the mean of appropriate functions of these. For example, to estimate J(/v), for each v = 1,..., V and each i = 1,..., TV, we generate a sample of n observations from /o(yi; £). Denote the ?: vth sample by y ^ , and the associated value of Sj.(Ç) by S i ( y ^ ; £ ) . We then approximate ficSi(yj"^;£) by taking differences of Si(y\"^;£) in the neighbourhood of £. Repeating this procedure V times, we compute our estimate as 1 V ) v d yY.Q? {N){y i' \£)- = S { ) Rather than using the parametric bootstrap to estimate the quantity IÇÇ(N) in h(N), we instead use the matrix of second derivatives of the log-likelihood evaluated at £. This matrix is a by-product of the quasi-Newton minimization routine, and hence is readily available. Sometimes this method leads to a negative value for We discard these cases, which I (N)C presumably results in an estimate of / (/v) that is positively biased. However, the fact that c this bias is positive makes the test more conservative, and hence is not a concern. It is difficult to assess the accuracy of this estimation procedure in our setting. However, we have investigated the simpler case where there is only one observed lesion count on each patient and we wish to test for overdispersion relative to the Poisson distribution (e.g. Dean 1992). We are able to compare the estimates of I (N) obtained using the parametric bootstrap C with V = 100 with those obtained by replacing £ with £ in I ( ) (since, in this case, I ( ) can C N C N be computed analytically). The estimates were very similar for the data sets we considered. Thus, the parametric bootstrap seems to be well-behaved, at least in this simpler context. If iii = n for all i and there are no covariates, the vectors of patients' observations are iid. In this case, two other options are available for estimating I ^y. c the nonparametric bootstrap and the jackknife (Efron & Tibshirani 1993). As with the parametric bootstrap, these methods require the generation of realizations of the random variables Y i , . . . , YAT. The former involves resampling with replacement from the observed vectors of data; the latter focuses on the samples that omit the ith patient's observations, i = 1,..., N, one at a time. In our simulation study in Section 6.5 we compare the parametric and nonparametric bootstrap estimates of I (N) for some of the cases that we consider. C 6.4.2 M S / M R I Data In Section 2.4, we discussed the model for individual MS patients developed by Albert et al. (1994). These authors comment that MS is a very heterogeneous disease, in the sense that the degree and behaviour of the disease is expected to vary considerably among patients. Although modelling each patient's data separately certainly allows for between-patient differences, the cost of adding so many parameters to the model is increased uncertainty about all parameter estimates. In this section, we propose using random effects as a means of capturing between-patient differences while maintaining a parsimonious model. We fit our model to the Vancouver PRISMS data on 13 placebo patients described in Section 4.3.2. Let Yu be the lesion count for patient i at time t, and let Z be the associated hidden state. it We fit Model I of Chapter 5 with one, patient-specific random effect, ui in the conditional model for the observed data, and no covariates. In particular, we assume that given Zu = k and Ui, Y a is distributed as Poisson (yu,,^) with l0g/iitfc = where r k r k + u u is the fixed effect of being in state k, and are iid, each with a N(0,a ) 2 distribution. Furthermore, we assume that the transition probabilities are homogeneous, non-zero, and common to all patients, and that {Zu}\=i is stationary for all i. Our results in Section 3.4 suggested that two hidden states are appropriate for this type of data, and thus we will use K = 2 here. For the purpose of comparison, we also fit the same model with no random effect. In other words, we consider the case where Y \ Zu — k is distributed as Poisson(putk) with it log Hitk = r . k We refer to this model as the fixed model, and to the model with a random effect as the mixed model. The advantage of the mixed model is that we allow different mean lesion counts for different patients with the addition of only one extra unknown parameter (cr). Although we have assumed.a restricted form for the distribution of these means without any prior information about its appropriateness, we could test this assumption by fitting the fixed model to each patient individually, and then estimating the distribution of the values of T\ and r that we obtain. 2 To avoid the need to use constrained optimization methods, we reparameterize the model so that each parameter has as its range the whole real line. These transformations are provided in Table 6.1. We then fit both models by directly maximizing the likelihood. In the case of the mixed model, we use the Gauss-Hermite quadrature formula to approximate the integral that appears in the likelihood. We would expect this formula to provide a high level of accuracy in this case, where we are integrating with respect to the normal distribution. Table 6.1 gives the parameter estimates and standard errors that result from fitting both models. By including the random effect, we observe what appears to be quite a large increase in the log-likelihood. In addition, we see substantial changes in the estimates of some of the fixed effects, r and P^, in particular. 2 Table 6.1: Parameter estimates and standard errors for Vancouver PRISMS data Parameter n Transformation NA NA p* MA) a* logo- P* M l 0.594 Fixed Model Estimate S.E. -0.907 0.115 1.657 0.107 3.700 0.456 -1.097 0.517 0.752 0.188 -249.57 -1.488 0.578 •NA NA, -268.41 Mixed Model Estimate S.E. -0.807 0.289 1.344 0.173 log C 2.876 More formally, the mixed model satisfies the conditions of Theorem 6.3, so we can use our variance component test to test the hypothesis that a = 0. This test is equivalent to a comparison of the fixed and mixed models. For these data, we compute and use the parametric bootstrap to obtain I (i3) c — = 215.1, 3396.0. The value of our test statistic is then 215.1/\/3396.0 = 3.69, which results in a p-value of < 0.001 when compared to the standard normal distribution. Thus, we have strong evidence in favour of the mixed model. 6.4.3 Faecal C o l i f o r m D a t a Another application of H M M s with random effects is to the faecal coliform data first analyzed by Turner et al. (1998). In this study, sea water samples were collected at seven sites near Sydney, Australia, over a four year period. At each site, samples were taken from four depths (0, 20, 40, and 60 m) and the associated coliform counts recorded. Turner et al. (1998) analyze these data using a Poisson H M M with depth and site as covariates in the conditional model for the observed data. Observations from each depth and site combination are treated as independent time series. The hidden states (0 or 1) are presumed to correspond to whether the sample was taken from above or below the thermocline. These authors treat site as a fixed effect, but certainly in some contexts this effect might be more appropriately treated as random. For example, repeated measurements at each site (i.e. at the four depths) may be correlated. Moreover, it may be desirable to think of the sites as a random sample, in which case we would want to generalize our results to all "sites" in the area. We thus re-analyze the data by incorporating a random site effect, using the framework of Model I. Let Y y be the coliform count at site i, depth j, and time t. Given the hidden t state Zijt = k and the site effect, Ui, we model the distribution of Yij as Poisson(/ijj fc), t t Table 6.2: Parameter estimates and standard errors for faecal coliform data Parameter Mixed Model Estimate S.E. -2.182 0.135 1.515 0.055 0.027 0.065 -0.062 0.063 0.018 0.056 1.976 0.149 Transformation NA NA NA NA NA' TO h b h 2 P* 1 00 TO* •MO -0.194 loger a* 0.227 -2.246 0.561 -1533.3 \ogC Fixed Model Estimate S.E. -2.209 0.129 1.537 0.034 0.012 0.061 -0.054 0.060 0.016 0.055 1.993 0.148 -0.218 0.223 NA NA -1534.4 . where 10g//ytJfc = Tfc + bj + Ui. Following Turner et al. (1998), we assume that k = 0 or 1, i.e. that K = 2. To prevent over-parameterization of the model, we impose the constraint 5ZJ=i bj = 0. We model the site effects, as independent, each with a N(0,a ) distribution. Finally, we assume that 2 the transition probabilities are homogeneous, non-zero, and common to all patients, and that {ZijtYt=i is stationary for all We also fit a model with no site effect, i.e. we propose that Y ^ | Zij = k is distributed t as Poisson(/Xjjtfe), t where log Uij tk = T k + b. 3 We again refer to these models as the fixed and mixed models, respectively. The results of these analyses are presented in Table 6.2. In this case, it appears that the random site effect is not necessary. The log-likelihoods and the parameter estimates for the fixed and mixed models are both very similar, indicating that there is likely very little site-to-site variation. This example differs somewhat from the paradigm presented in Section 6.3 in that there are four sequences (one for each depth) associated with each site. In particular, letting Y ; J represent the observations from site i, depth j, (6.5) becomes Si(0 4 Q2 j=i ua i - 2' + ui=0 3=1 dui log/(yy 1 ik,tp) • Ui=0- However, assuming that the depth effects are bounded, it is easy to show that the mixed model satisfies the conditions of Theorem 6.3. Our variance component test is thus statistically valid for assessing the difference in fit between the mixed and fixed models. We compute L ^ L i ^ O — 390.3, and estimate 7 ( ) = 79240.5 using the parametric C 7 bootstrap. These estimates give a value of 390.3/\/79240.5 = 1.39 for our test statistic, and a corresponding p-value of 0.166. Therefore, as expected, we do not reject the null hypothesis that the fixed model adequately represents these data. Interestingly, in an analysis of a subset of these data using fixed site effects, Turner et al. (1998) find evidence of site-to-site variation. This difference in conclusions leads naturally to the question of the sensitivity of our test, an issue which we address in Section 6.5. As an aside, this example is also a nice illustration of the ease with which H M M s can accommodate missing values. In order to satisfy the condition that the observations be equally spaced in time, Turner et al. (1998) round the observation times to the nearest week, and use missing values for weeks with no associated observation. For instance, say there are s missing values between times t and t at site i. Then, in the piece of the likelihood 0 corresponding to the observation at t, A\\ = P i f(]Ju \ Zu — £,Ui) (see (5.4)), we simply k replace P i with the (s + l)-step transition probability, P e.( + 1). s k k 6.5 Performance of the Variance Component Test In Section 6.4, we applied our variance component test to two data sets with unknown generating mechanisms. It is also of interest to apply the test to data generated from known models so that we can assess its performance. In particular, we wish to make statements about the probability of Type I and Type II error for selected cases. As in Section 3.5, we use the ideas of experimental design to choose the parameter values for the cases we consider. For simplicity, we assume that each process has the same number of observations (n). There are many factors that could potentially impact the performance of the test, including n, the number of processes (iV), the number of hidden states (K), the conditional distribution for the observed data, the size of the variance component (cr ), and 2 the spacing of the components and the proportion of time spent in each hidden state (see Section 3.5). It is expected that of these factors, N, K, and a' are the most influential. 2 Thus, in our study, we examine only this subset. We look at the levels N = 10, 30, 50, K = 1,2,3, and a = 0,1.11,1.22, which reflect "small", "medium", and "large" values of these factors. Since the applications in this thesis have focused on lesion counts observed during M S / M R I clinical trials, we consider only the model (5.5) with f(y it corresponding to the Poisson(/j ) distribution, log fj„ = r + iifc uk k \Z = it k,Ui,ip) and fix n = 30 (a typical Table 6.3: Parameter values used in the simulation study A' 1 0 State Effects r Transition Probabilities = [0] P = [1] 0.69 0.31 0.31 0.69 3 T = " -1 " ' 0.71 0.19 0.10 p = 0.19 0.10 0.71 0 1 0.10 0.71 0.19 value in this setting). Furthermore, we investigate only the case of well-spaced components with an equal proportion of observations from each. The values of {r^} and of the transition probabilities that we use are provided in Table 6.3. For each model, we generate 400 data sets. Then, for each data set, we compute the test statistic (using the parametric bootstrap with V = 100 to estimate I (N)) C and record whether the null hypothesis is rejected at the 5 % level. For some models, we also estimate I (N) using C the nonparametric bootstrap and 100 of the data sets. In these cases, we resample with replacement 100 times from the simulated data set and compute / (;v) using these samples. c We then recompute the test statistic using this value of I (N), C and again record whether the null hypothesis is rejected at the 5 % level. The results of our simulations are given in Table 6.4. In the case of the parametric bootstrap, Table 6.4 shows that the observed rejection rate when cr = 0 is less than 5 % for all values of N and for K = 1,2. When K = 3, this rate is a little higher, ranging between 6 and 12%. We conclude that, overall, the test seems to control for Type I error reasonably well, even in finite samples. In terms of the power of the test, it appears that a and N are quite influential. For the "large" values of these factors, the test may have at least 8 8 % power. The power seems to be quite low for small sample sizes, though may be as much as 4 9 % when a = 1.22 and N — 10. Similarly, the power appears to be relatively low when a = 1.11, but may be up to 2 9 % when N = 50. With respect to the value of K, the power seems to decrease somewhat with increasing K, but the test does not seem as sensitive to K as to a and N. As expected, since the parametric and nonparametric bootstrap methods both give consistent estimates of I (N) under the null hypothesis, both methods lead to good control of C Table 6.4: Results of the simulation study Percentage Rejected Parametric Nonparametric Bootstrap Bootstrap K° AT a 1 1 1 1 1 1 1 1 1 10 .0 30 0 50 0 10 1.11 30 1.11 50 1.11 10 1.22 30 1.22 50 1.22 4.25 3.75 4.25 10.25 19.00 29.0 49.25 87.50 98.25 18 NA 6 6 NA 4 5 NA 81 2 2 2 2 2 2 2 2 2 10 30 50 10 30 50 10 30 50 0 0 0 1.11 1.11 1.11 1.22 1.22 1.22 3.25 3.50 2.75 10.50 13.50 24.50 30.50 75.25 90.25 19 NA .7 5 NA 3 1 NA 32 3 3 3 3 3 3 3 3 3 10 30 50 10 30 50 10 30 50 0 0 0 1.11 1.11 1.11 1.22 1.22 1.22 6.00 12.25 9.00 9.50 22.50 23.75 23.75 69.50 87.75 • NA NA NA NA NA NA NA NA NA the Type I error when N is large. However, it is not clear that the use of the nonparametric bootstrap results in a valid test for smaller values of iV. In particular, the rejection rates when a = 0 and N = 10 are up to 19% in this case. The power of the test when we use the nonparametric bootstrap appears to be very low in general (except perhaps when K = 1, a = 1.22, and N = 50), especially relative to the power when we use the parametric bootstrap. Thus, based on these preliminary results, we recommend using the parametric rather than the nonparametric bootstrap to estimate I (N)C Further investigation is, nonetheless, warranted, especially in the case where the model has been specified incorrectly. Chapter 7 Future Work In our final chapter, we review the contents of this thesis, and discuss possible extensions to our work. We present these ideas in the context of the design and analysis of M S / M R I clinical trials, one of our main areas of interest for future research. In Chapter 2, through our exploration of some extensions to the model of Albert et al. (1994), we illustrated some issues surrounding the selection of a H M M . Although we presented our discussion in the context of a single time series, these issues are relevant to multiple time series as well. In particular, the application of a H M M requires decisions concerning the type of model (homogeneous or non-homogeneous, stationary or non-stationary), the conditional distribution of the observed data, the number of lags in the hidden Markov chain, and the number of hidden states. These features of the model must be carefully considered when developing a H M M for a given data set. Chapter 2 also reveals that, if the model structure proposed by Albert et al. (1994) is indeed appropriate for relapsing-remitting M S / M R I data, further research regarding the asymptotic properties of non-homogeneous H M M s will be necessary. In Chapter 3, we presented a method for consistently estimating the number of hidden states in a single, stationary H M M . As outlined in Section 3.6, there are a number of issues related to this method that are worthy of further study, such as the proposed two-stage estimation procedure, the choice of distance function and c , and the form of the penalty n term. In addition, estimating the number of hidden states in a H M M for multiple processes is still an open question. In Chapter 4, we developed a graphical method for investigating the G O F of a single, stationary H M M . We demonstrated that, in the limit, this method will detect deviations from the proposed model with probability 1. Our G O F plots will provide a simple, useful tool for assessing the quality of the fit of models for the relapsing-remitting M S / M R I data. Outstanding topics for research include the formal assessment of the variability in the plots, and G O F techniques for H M M s for multiple processes. In Chapter 5, we proposed a general class of H M M s for multiple processes. We showed how both covariates and random effects can be used to account for inter-patient variability while maintaining a parsimonious model. Several extensions to these models are possible. First, we could allow 4> to vary among patients, or among treatment arms, in (5.2). In most mixture models and HMMs, the mixing occurs over the means of the components. Nonetheless, other possibilities exist, including mixing over the variances of the components. The cost of this generalization would presumably be loss of efficiency in estimating the other model parameters. Second, methods of estimation other than maximum likelihood may be available. For example, in the G L M M setting, penalized and marginal quasi-likelihood (Breslow & Clayton 1993), maximum hierarchical likelihood (Lee & Nelder 1996), and Monte Carlo Newton-Raphson (McCulloch 1997) have been used. In the H M M context, these procedures may also give good asymptotic properties and be easier to implement than maximum likelihood estimation. Finally, in Chapter 6, we discussed the asymptotic properties of the maximum likelihood estimates of the models presented in Chapter 5. Formal proof of the consistency and asymptotic normality of these estimates (e.g. by verification of the regularity conditions of Bradley & Gart 1962), and hence justification for using standard hypothesis tests, is still pending. These results will be required for power calculations in the context of the design of experiments. Our primary contribution in this chapter, however, was the development of a test for the variance component in our models. We studied the performance of our test for finite samples in Section 6.5, and showed that the test seems to control for Type I error quite well, even in small samples. The power of the test seems reasonable for moderate-sized samples and variance components, and does not vary substantially with the number of hidden states. Lin (1997) and Hall & Praestgaard (2001) proved optimality properties of this test in the G L M M setting, and these properties may apply in our context as well. This test will be important in the design of clinical trials, since if the model without random effects is adequate, then the computation of the required sample sizes will be substantially simplified. Our preliminary results suggest that H M M s may be useful models for capturing the behaviour of relapsing-remitting M S / M R I lesion count data. However, other models may provide an equally good - or better - representation of these data. For example, G L M M s are frequently employed in the analysis of longitudinal data. Some features that are common to H M M s and G L M M s are immediately apparent: both postulate the existence of a latent (unobserved) process, both assume the conditional independence of the observed data given the values of this latent process, and both control the correlation structure of the observed data through specification of the latent process. Despite the similarities between the two models, little is known about the connections between their theoretical properties, or about the issues involved in choosing between models in applications. Thus, one area of interest for future research is the establishment of a formal link between H M M s and G L M M s . Assuming that we have identified a reasonable class of models for the M S / M R I data, two natural questions arise: how do we incorporate a treatment effect in a realistic, but interprétable, way, and which design is best for clinical trials in this setting? With respect to the first question, a treatment could be beneficial in two ways: either by reducing the mean lesion count, or by extending the time between relapses. Choosing between these perspectives, and deciding on the particular form of the treatment effect, are issues of critical importance. The second question relates to the selection of a sample size (both number of patients and number of scans per patient) and frequency of sampling that would yield a desired level of power. Ideally, we would be able to plan for the following types of studies: 1. Studies of untreated patients where lesion counts can be considered stationary (e.g. in Phase II trials, which are relatively short-term, and where M R I responses tend to be used as primary outcomes) 2. Studies of untreated patients where lesion counts may not be stationary (e.g. when patients are selected for their high level of disease activity at the commencement of the study, or in longer-term studies) 3. Studies of treated patients, whose lesion counts, we would hope, would not be stationary. The work in this thesis provides a starting point for addressing these questions. Appendix A The E M Algorithm This appendix details the implementation of the E M algorithm for three different models. This algorithm consists of an Expectation (E-) step followed by a Maximization (M-) step. In the E-step, we form the "complete" log-likelihood, which is the log-likelihood we would have if we were able to observe the hidden process and random effects as well as realizations of the process {Y }. We then take the expectation of the complete log-likelihood conditional t on {Y }, and evaluate the resulting expression at the initial parameter estimates. In the M t step, we maximize this expectation over the space of the unknown parameters to get updated parameter estimates. We then use these new estimates in place of the initial parameter estimates in the next iteration. This procedure is repeated until convergence of the parameter estimates is achieved. A.l E M Algorithm for HMMs for Single Processes In this section, we outline the steps of the E M algorithm for a single, homogeneous H M M with unknown initial probabilities, and with observations taking on only a finite number of values. The parameter estimates given at each iteration of the E M algorithm take on a special form in this context, and, for this reason, are usually referred to as the ForwardBackward Algorithm. These estimates will converge to the maximum likelihood estimates (Baum et al. 1970); in other words, the parameter estimates will, in the limit, maximize (2.2). We use the additional notation j = P(Y = y \ Z = k) for t — 1,..., n, k = 1,... ,K, Vtk t and y G fl, for some finite set Q, = {y ,..., y^}. x E-Step t Thinking of the hidden states as "missing", we can write the complete likelihood, £ (i>), c as £M = /(y|z,V)/(z;V) n n t=l n t=2 n t=\ So t=2 n n t=l t=2 8 Pt-i,zt• Let r = (7* ,i' • • • ' 1yd,Ki Pi,ii • • • ' eters at iteration p. Then the E-step is ^l? • • • ) A) be the estimates of all unknown param71 -^A'.K ' - E[log£ (V0 I Y,ip ] = E E [ l o g 7 , , z J Y . ^ J + EpogTrz, I Y , ^ ] + - f ; E [ l o g P ^ _ i=l i=2 p c y = 1|Zi I Y,^] EE/tely.TniogT*,* t=l 2 = l (A.i) + E/(^i|y,^)log7r (A.2) T Z l 21=1 + E E E/(^i,2 |y,f)logP t=2 z -i=l t ( Z ( ,, Z ( . (A.3) z =l t M-Stcp We now need to maximize this expectation over all of the unknown parameters in the model. Since {-y ,k}i {^k}, and {Pkt} appear in separate terms, we can maximize each term y individually. Let Y * = (Y ,...,Y ). s t Define the forward probabilities at the pth iteration as W?(k) = P(Y =y[,Z = t 1 t k\r) and the backward probabilities as Xf(k) = P(Y» =y? \Z +l +i = k,P). t These probabilities can be computed recursively as W? (k)={^Wf{t)P^ l 1 t+uk +l and (A.4) K ' ^(k) = J2l y X! (£)P P p t+l/ +1 ke . (A.5) with Wf(k) = 7r^7y 1)fc and X%(k) = 1 (by convention). First we will consider the maximization of (A.2), and hence the derivation of 7rjjî , +1 k = 1,...,A. Since we have the constraint Y^ ^k =l — 1> we will use the method of Lagrange multipliers and maximize the function K 9l= K Y I y. V) log7T - A(]T TTfc - 1). 2l 21=1 k=l Then |^- = - P ( Z = fc| Y , ^ ) - A 1 and Setting these derivatives equal to 0 then gives the equations 7r£ P(Zi = = + 1 k I Y,^ p ) *;=i which we can solve to get < + 1 w = ï m ) ( A 6 ) £WW)*f(<) «=1 Next consider the maximization of (A.3) and the derivation of P ^ . We have the con+ 1 straint that X ^ L i Pkt — 1 for each k. Thus, we see that we need to maximize the function 92 = J2 J2f(zt-i,z \y\P)logP, _ -J2\k(J:Pki-l). £ t t t=2 2 ( _ i = l zt=l k=l Then utl \e=i J • ^ = ~£p(Zt- l 0"kl i u t = k,z l = e\Y,r)-Xk =2 and Setting these derivatives to equal to 0 then gives the equations n pp+l _ U 1 K HPII 1 — ' 1=2 y+1 which we can solve to get ' Pit = — ^ • Ywi^xuik) 1 • (A.7) t=2 Finally we will maximize (A.l) and derive T ^ f - We have the constraint that Y, ly,k = 1 y for each k. Thus the function to be maximized is n h' K I fs = E EZtf ( t I y . & 7 y , z t - Y k z (Yi ,k-1 x l o \ y t ). Then % 3 _ ±_ £ 9ly,k (=1 p[Zi = k i Y , V P ) - A f c T?/,fc and Setting these derivatives equal to 0 then gives the equations n • Y p+l p(z t = k\Y,r) yt=y ETS 1 = i. 2/ which we can solve to get E wnk)xf(k) 7 ^ = ^ • ' (A.8) YW[(k)X?(k) t=i A.2 E M Algorithm for HMMs with Random Effects in the Observed Process In this section, we give the steps of the E M algorithm required to estimate the parameters of the model (5.4) in the case where the random effects are patient-specific. We assume that {•Kk} are unknown parameters to be estimated. • . The convergence properties of the sequence of estimators given by the E M algorithm have not been studied in the context of H M M s for multiple processes, but Wu (1983) provides sufficient conditions in the general case. One of these conditions is that the true parameters be in the interior of the parameter space. In other words, we require that the variance components be non-zero, and that 0 < P t < 1 for each k and I where Pu is unknown. k E-Step Recall that, for the model (5.4), the hidden states are assumed to be independent of the random effects. Now, thinking of both the hidden states and the random effects as "missing", the complete likelihood for this model is £ (<0) c = /(y|z,u,^)/(z;V)/(u;V) = n {n f(yn i Zit, u,, ip) • n n p _ Zii i=l Kt-i zu UZil • \ t=2 ) So the contribution to the complete log-likelihood from patient rn log £*(?/>), is m logZ£(t/>) = J2 °S.f(yu I z ,Ui,ip) - t - l o g 7 r , ., l it 1=1 0 i l + E l o g *M-i.*« p +log/(ui;V). t=2 Using the fact that observations on different patients are independent (so that Z it and are independent of Yj for j ^ i), the E-step is E [ l o g £ ( » I Y,ip ] p c = EEflog^WlY,,^] N = ni K f E E E / i=i t=i *:=i N l Q g / ( ^ . I Zu = k,va,iP) P(Z U = fc|y,,V/)/(u,| ,^) yi dm ' (A.9) K +E£ °g i=l k=l 1 7 r * (^i = fc p (A.IO) |y^) + E E E E logPk,t P(^i,t-i = *, Z = £|y,, ^ ) i=l t=2fc=l£ = 1 (A.ll) + E / l o g 7 ( u ^ ) / ( u , I y r) i=l (A.12) i t i; : u du,. M-Step Again, since each unknown parameter appears in exactly one of the terms (A.9)-(A.12), we can maximize each of these terms individually. Now define the forward probabilities for patient i at the pth iteration as W[ (k, iii) = f(y t iU ...,y \ Z = k, m, ^ ) P ( Z „ = k \ ip ) p it u and the backward probabilities as Xf (k, iii) = f(y +u t Vi,m I Z = k, u*, '<p ). p itt it With a slight change of notation, the recursions (A.4) and (A.5) also apply to these new definitions: Wl (k, ui) = (Y W£(i, Ui)P^ / ( y , I Z = k, m, V) t+l m a and f.=l with W^(A;, Uj) = 7r£ / ( y ^ | Z,i = k,Ui,tp ) and A^,,(A:,Ui) = 1 (again, by convention). p Applying the same method used to derive (A.6), we can now show that the maximum value of (A. 10) occurs at P + _ l f 1 jWUk^XlJk^fjn^dn, iV^ELJ^^u^^ua/K;^)^' 1 • ) Similarly, applying the method used to derive (A.7), it is clear that the maximum value of (A.11) occurs at ^ E Ë pp+1 _ 1 UÙ — =1 / w ^ t o u O X S & u O / f o i Iz = £ , u , . , ^ ) / ( u ^ ) dm lt i ; t=2 E XE / Wl^KnAXl^k^fi^r) t=l 1=2 du, If {Zn} is not stationary, numerical maximization will be required in order to compute P p + 1 £ . In general, the terms (A.9) and (A.12) must also be maximized numerically, using, for example, a Gaussian quadrature technique. However, we can simplify the computations by noting that (A.9) can be written as n,: N K E E / l °g / (!/* I Zu = k, « i , 1>) WJl(k, m) Xf (k, ut) / ( U i ; #0 du, * t E «i) / « i ) /(u*;V) (A.14) ^ and (A.12) as N E i=i f log/(m; V) K £ E u «) * M "i) /(u i5 V") ^ / e.=\ WS(f.Ui) ^ ( « . U i ) / ( U i ; ^ ) rfUi • du, (A.15) Since f(y \ z ,,VLi,ijj) is in the exponential family, log f(y lf it \ 2^,^,-0) will have a nice it form. Likewise, if f(ui,i{j) is in the exponential family, log/(u;;i/>) will also have a nice form. Thus, for certain choices of these functions, we would expect that the estimates of the parameters associated with these distributions would be quite easy to compute. Thus, we see that, aside from the question of integration, the E M algorithm for the model with patient-specific random effects in the conditional model for the observed data is not much different from the case where the algorithm is applied to a single H M M . If the random effects are not patient-specific, this algorithm can still be used, but the expressions (A.9)(A.12) will be more complicated since we will need to take the expectations conditional on the full data set, Y , rather on each patient's data individually. A.3 E M Algorithm for HMMs with Random Effects in the Hidden Process In this section, we give the steps of the E M algorithm required to estimate .the parameters of the model (5.13) assuming again that the random effects are patient-specific. We further assume that ix k = 7r , fc i.e. that the initial probabilities are fixed, unknown parameters common to all patients. As mentioned in Section A.2, the convergence properties of the E M algorithm are unknown in the context of H M M s for multiple processes. Wu (1983), however, provides sufficient conditions for convergence in a very general setting. In particular, the variance components must be strictly positive. E-Step For this model, thinking of the hidden states and the random effects as "missing" data, the complete likelihood is C M = f(y I u, z , ip)f(z I u, T/>)/(U; '0) II fiVit I iU U i ,tf>)• 7T f i z 2IL t=l I ^Z 1, U i , VO • /(U»; V) \ • . J 4=2 So 71; log£'c('0) =E l 0 n; S/(^'< I H^M Z + + £ l 0 g / ( 2 ï ( I Z - Ui,1p) ijt U 1=2 t=l Then, using the assumption that the random effects are patient-specific, E[log£ ('0) I Y,ip ] p c + l 0 g / ( U i î 1p). E E [ i o g 4 w I Y r. u ni JV = K Ë Ë Ë /l°g/(î/« I ^ i=it=ifc=i = Mi.VO P(Z it = ^ . u ^ / f u i l y ^ ) du* (A.16) y + EElog7r P(Z i=lfc=l f c N m K i l = A;|yi,^) (A.17) K + E E E E / l o g P ( Z = € I Z _ = A;, u*, '0) i=l t=2fc=l£=1 ft M t J •P(Z _,. = k, Z = £\y u V ) / ( u i I , ^) du, (A.18) p M it u h y i + i=i E / l o g / ( u i ; ^ ) / ( u i I y,,#') du,. (A.19) M-Step Again, we see that each unknown parameter appears in exactly one of the terms (A. 16)(A.19), so we can maximize each of these terms individually. For this model, define the forward probabilities for patient i as W[ (k,u ) = f(y ,...,yu\Zu t i = k,u ,r)P(Zu li = l k\ui,r) and the backward probabilities as ït( > x i) = f(Vi,t+u • • •, Vi, I Z = k, u k u ni it h V/). We then have the recursions W (k,uA = p t+l ^twl {£,Ui)?{Z t = k I Z _ ! = £,Ui,r^j it /(yi,H-i M I Za = fc,Ui,^) and K Xf (k,u ) t t with W/J(A;, u^ = = E/(î/M+i I 7r p /(T/J! Since the parameters = I Zji = {7r } f c ^Ui,^ )X? p t + 1 (^Ui)P(Zi t = ^ | . V i = k,u,,r) k,Ui,tp ) and Af„(A;, u^) = 1 (again, by convention). p r are fixed, the maximum value of (A.10) occurs at (A.13), as for Model I (but using the above definitions of W[ (k,Ui) and Xf (k,Ui)). f t However, the evaluation of the integrals will be much more complicated in this case, for the reasons discussed in Section 5.4. Similarly, the expressions (A.16) and (A.19) are equivalent to (A. 14) and (A. 15), respectively, but will be much more difficult to compute. For this model, we will also need to numerically maximize (A.18). However, by writing this expression in terms of the forward and backward probabilities we can obtain the simpler form ni K K . 4=2 /c=i e=i • J Y [WZ&Ui) Xf,(£ m.) i=l : f(m;f) dm where qUk,£, m) = P(z it = £I = k, m-r) w^k, m) x? (i, m) f( t yu \z it = £, m; r)- Again, evaluating these integrals may prove difficult. In summary, we see that the E M algorithm may be of only limited use in the estimation of the parameters of Model II because of the complex nature of the integrals involved. Appendix B Proofs B.l Proof of Lemma 3.1 Here we prove Lemma 3.1. For ease of exposition, we assume m = 1. However, the theory is valid for all finite values of k. Since the measures \x are tight (as a result of Condition 2), there exists a subsequence n \G .} such that G n converges weakly to G , where G is a distribution function (Billingsley nj 1995, Theorem 29.3). We have that d {F(y,G),F(y,G )} KS < d {F(y,G),F(y,G )} 0 KS + d {F(y,G ),F(y,Go)}- Uj I<s nj (B.l) By our hypothesis, the second term on the right-hand side is o(l). We claim that the first term is also o(l). Taking limits in (B.l) will imply that d {F(y,G), KS hence, by Condition 5, that G = G . 0 F(y,Go)} — 0, and Therefore, G . converges weakly to G . Since the n 0 measures fj, are tight, and since every subsequence of {G } n n that converges weakly at all converges weakly to Go, we may conclude that G converges weakly to Go (Billingsley 1995, n Theorem 29.3). We now show that dxs{F(y, G), F(y, G )} nj = o(l). By Condition 4, for all e there exists A > 0 such that for all (0,0) G 0 , H (A; 6,4») - H{-A\9,4>) > 1 - e. In particular, for y < -A, H(y-9, <p) < e. So, for all y < -A, F(y,G ) nj = j H(y-9A)dG {9,c{>) < e J dG (9,<j>) = e. & nj Q nj Likewise, for y < —A, F(y,G) < e. Similarly, for y > A, 1 — F(y,G ) nj also bounded by e. (B.2) and 1 — F(y,G) are converges weakly to G and H(y; 0, 4>) is continuous in 9 and 4>, we have that Since G nj F(y, G) - F(y, G ) = J H(y; 9, <j>)d{G(9, </>) - G , (9, 4>)} -> 0 nj (B.3) ri } for each y (Billingsley 1995, Theorem 29.1). We claim that F(y,G ) nj converges not only pointwise but uniformly to F(y,G) on the interval [—A, A]. It is enough to show that F(yj, G ) converges to F(y, G) for all sequences Uj yj I y, with y ,y G [—A, A]. (See Strichartz 1995, Theorem 7.3.5, with a slight modification 3 to account for the fact that {F(-,G )} are right-continuous rather than continuous.) nj Fix e, y, and a sequence {yj} such that yj I y. Define F.j(y) = F(y,G ) nj and F(y) = F(y,G). Now \Fj( ) - F(y)\ < \F {y ) - F (y)\ + ^(y) yj 3 3 3 - F(y)\. (B.4) By the right-continuity of F , there exists Ô > 0 such that \F(y + S)-F(y)\<e. (B.5) By (B.3), there exists TV such that for j > N, \Fj(y) ~ F(V)\ < e ( -6) B and there exists TV^ such that for j > N$, \F (y + ô)-F(y + S)\<e. j ' (B.7) Since y l y and Fj is non-decreasing, there exists N\ such that for j > Ni, 3 \FjiVi) ~ Fi(v)\ < \Fj(y + Fj(y)\ • (B-8) Combining results (B.4)-(B.8), we have that for j > max(iV, N/j, N ), x \Fj{Vj) - F(y)\ < \F (y + S)-F ( )\ < \F (y + Ô)- F(y + 5)\ + \F(y + 5) - F(y)\ + \F(y) - F (y)\ + \F (y) - F(y)\ J J y + \F (y)-F(y)\ J 3 3 3 < 4c. Thus, Fj converges uniformly to F on the interval [—A, A], implying that for all e there exists N* such that for all j > N* and for all y € \-A, A], \F(y, G) - F(y, G )\ < e. Using nj (B.2), we see that this result holds for all y, and hence d {F(y,G),F(x,G .)} KS Uj =o(l), as desired. • Although Lemma 3.1 is stated in terms of deterministic mixing distributions, we in fact require that for a sequence {G } of estimated mixing distributions, d {F(y, G ), F(y, G )} = n KS n 0 o(l) a.s. implies that G converges weakly to G a.s. However, it is easy to see that this require0 ment is met when Conditions 2-5 are satisfied. Denote by G (co) the terms of the estimated n sequence of mixing distributions associated with the sample point co. Since (Qj,(f> ) € © for 0 all j, Lemma 3.1 applies to each sequence {G (u)} for which d s{F(y,G (uj)), F(y,G )} = n K n 0 o(l). B.2 Proof of Theorem 6.1 To develop the necessary bounds, we begin by examining terms of the form E D* D* "mi where m i , m 2 m-2 < M. The extension to the case where we have a product of more than two such derivatives is straightforward. Our proof relies on the fact that is a derivative of l o g / ( y i I Ui,ip), and that we compute the expectation of products of these derivatives with respect to the null distribution, f{y i \ — 0,'ip). For our model, ' / ( y i | Ui,i/j) is the likelihood of a stationary H M M . In this way, our problem effectively reduces to that considered by B R R . To avoid having to work with mixed partial derivatives, we will use the claim of B R R that, without loss of generality, £ can be taken to be unidimensional. (Recall that yj = (£, D), so that £ does not include parameters associated with the random effect.) Also following B R R , any operation between two sequences is understood to be performed termwise. For example, for sequences I = (ii,.'..., I ) m and J = ( J . . . , J ), and functions .{/i}£L we m l 5 l5 would have Define Y * = (Y ,..., Y ). We let fy(l) = log f(Y s is it iu fine hi(t) = log f(Y Zi \Y\^ ,7/^i 1 iu l o g / ( Y j , Z j I Ui) = t ,Ui^). Zn^,^), and, for t > 1, we de- As a result of these definitions, we have that Y!tL\hi{t). Let hi \t) represent the mth derivative of hi(t) with m respect to £, evaluated at Ui = 0. By Conditions 1 and 2, these derivatives exist for m < M. Using the notation of B R R , we denote the cumulant of a random vector X = (Xi,..., by F ( X ) = F(X ..., X ) EE 1 U m A••• log (E [ ^ " * ~ * » > ] ) e Xl=---=X =0 m X ) m where t = y/^-ï. The cumulant of the conditional distribution of X given Y will be denoted by T ( X ) . Finally, let x'(Xi) = AT, and let x(Xi) = EfA^]. We define the centred moment Y function (Saulis & Statulevicious 1991), x, recursively by x'{X X ..., X ) u = Xi (x'{X ,X ) m 2 {X ...,X ) u = m - x ( A ' , • • •, X )) m m 2 E[ '{X ,...,X )], X l m m — 2,3, — We denote the centred moment of the conditional distribution of X given Y byx (X). Y In their Proposition 3.1, B R R show how to write the derivatives of the log-likelihood of a single, stationary H M M as a linear combination of cumulants. It turns out that their result also holds for derivatives of l o g / ( y ; I uui)) if we replace the distribution of each random variable in their Equation 4 with its distribution conditional on u,;. In particular, D! JeJ+M l l J < V ! J ! ml E JeJ+(m) (B.9) ÎJÎ! B R R provide bounds on terms of the form T ' . Specifically, from their h\ \tfj Yi Ui 3 Lemma 3.3(i), Tli r <"(5>i (t) Y J) < E E E M (Ki,K ) E V V (B.10) • I I ^ ' " " ^ " ^ , ) ) ) q=l where 1+J K = { 1 , . . . , | J|} denotes the set of all ?;-block partitions of the set { 1 , . . . , | J|}, M Q are non-negative combinatorial constants, J(K ) Q = {Ji< , Jh' . qi q 2 ,•••)> l I(-^g) a n ( = (^K , qi V Ix , • q2 Then, letting A (I) = max{/j} — min{/,-}., Equation 11 of B R R gives the following relationship: fi | x Y - {h[ J{Ki,) \l(K ))) Q \ <2 ^ - v E ^ W , ) ) g q=l Cj.ÇYj^ip). j=l where /? = 1 — min j m i n P ^ , mmP A, with P *^ = k fc 7r P /7r . f t t f c Under Condition 1, B R R show that 0 < p < 1. We can now use these results to establish bounds on E From the multilinearity m i m-2 of the cumulant function and (B.9), we can write E m,]l < E k E j»ej7-+(m,) n, r Y ^ ( E ^ J l ) W m ! (B.ll) 2 Using the bound (B.10), the right-hand side of ( B . l l ) becomes a linear combination of terms of the form V-2 X ,«71=1 Let the values t 92= I < t x < ••• < t 2 denote the distinct elements of the union of I i and I . r 2 Using the technique in Equation 12 of B R R and the definition of E X^ n (hf ^\i\K ))) Y (I mxrn2 (ip), • n i qi [qi = l B l 92 = 1 7 K maxL J E < X s=l 2 |J' I>: 2 2 1 1 P < 2 1 P n max L T É z s=l < .»2e{l 1 1 |J i-i EI =iA(ii(A' ! )). u i-i E;:=,A(i2(A'4)). 1 = J €{1,...,|J |}: '-'il s i ^ i - i ^ U i (Y,u.O) 2 J ! 2 ./ e{i,...,U |}: jr 1 2 C) n 7* '' j e{i,...,|ji|}'.,=/] JI JJ JMJJ J ! 2 ">2 2 A(ii«)). a u ' i - i p E ^ A(i2(At )). r j jM 2 nJ 2 ! • B m i m 2 (V0- Then, from (B.10) and Lemma 3.3(h) of B R R , it is clear that iii E r »- (£hf\t) Y \t=i < |J l-i UM-i 2 U |! TiylJ !! 1 riJ ' 2 n j 1 2 ' ^ 2 w Substituting into ( B . l l ) and using Lemma 4.1(h) of B R R , we conclude that E B D* <n m \m \B M[l 1 mi 2 i + t 1 2 mim m-2 8 1-p, mi +7712—2 and, more generally, that 8 E Dm, x • • • x This proves our theorem. • <nS!x--'Xm R F !5; ( I ...^) 1+ \£?=i"v- d B.3 Proof of Theorem 6.2 To prove Theorem 6.2, it suffices to show that B*..(ip) is finite and independent of i. To this end, we first study the random variable C {Yiu As an aside, B R R ' s definition l m of C^yu, ip) also involves a supremum over all values of the parameters in the neighbourhood of the true values of {&}. These authors require this generalization since they apply their results to problems relating to the MLEs, but it is unnecessary for our purposes. Under Condition 1, 0 < Pu < 1, and hence 0 < 7r < 1, for all k,£. We then have k max max < D k,£ D,„ [ T O log P < oo D lbg7r/fc u m for all m. Furthermore, this quantity is independent of i and {Yn). Thus," for our purposes, we ignore these terms, and focus our attention on the term max max Dm log fiVu I Z = k,Ui,lJ))u ; = 0 D k it for m < M. m When a((j>) is a constant (e.g. when f(y | z , u,, ip) is in the Poisson or binomial family), it it we need only be concerned with derivatives with respect to {r } and Uj. Since for our model fc Vuk = T~k + Ui, we can equivalently consider derivatives with respect to r] k- We have that it d c(Vitk) Vu log/(r/« I Z = k,Ui,ip) drjitk Uj=0 it a(0) (B.12) u =0 t d and m _ log/(yi4 I Z = k,v,i,ip) d -q,'itk m ' c(r)uk) d' riitk d n n it a(0). u =0 z 2 < m < M. When f(y (B.13) Ui=0 \ z , u^, ip) is in the Poisson family, c(r/) = e , and when v it it f(Uit I z , Ui, ip) is in the binomial family, c(rf) = log(l + e ). Since, by assumption, {r } are v it fc bounded, in both of these cases, all derivatives of c(t]uk) (and hence expectations of products of derivatives of the form (B.13)) will be bounded and independent of i. When derivatives of the form (B.12) appear as factors in the product of interest, we note that the absolute moments of Y a \ Z , Ui are finite and independent of i in both the Poisson and binomial it families. Thus, B, l m m hp) is also finite and independent of i, which proves Theorem 6.2 for these cases. When a((p) is not constant, we need to consider partial derivatives with respect to <p as well as {rfc} and {ife} or d When these functions include at least one derivative with respect to we have e+i log f (Vu \ Z = k,u ,ip) dr) d (p iii=0 e itk it t Vu d •c(rjitk) dmtk M,=0 a-'{<P), d cj) e j (B.14) ^ = 0 , 1 , . . . , M - 1 , and Qt+rr, log f{yu I Z = k,Ui,ij;) (B.15) -ciVitk) it Ui=0 0 are £ + m < M , rn, > 2. Since {r } and k t bounded, as long as derivatives of ^ are well-behaved and the absolute moments of Y it \ Z, and 0(77^) are finite and independent of i, it the expected value of products of such terms are finite and independent of i. In particular, when f(yu \ Zu, i) is in the normal family with parameterization u l o g / ( y « I Z = k,u^) = U m t k y u ^ ^ then all absolute moments are finite, and derivatives of bounded since {r } and (f> are bounded. k (B.16) -|log(2^), - I = ^ and c(rjuk) = ^vïtk a r e Likewise, when f(yu \ z ,Ui) is in the gamma it family with parameterization log f(yu I Zu = k, u i},) = ^kyit + log(-ri ) h itk + < j ) X o g ( j > + ^_ 1 ) l o g ^ _ l o g i then again, we see that all absolute moments are finite, and that derivatives of m ( B - 1 7 ) = 4> and c(rjitk) = — log(—T]uk) are bounded since {r^} and <f> are bounded. Derivatives with respect to (j> only have the form d d [VitTk ~ c{r )} Q^a- (<l>) + ^d(yu, m logf{iju I Z = k,u i})) it h m l k 0). U{=0 Verifying that the expected value of products of such terms are finite and independent of i is more complicated since ^^d{yu, (j>) is a function of y . However, we can do so for the it special cases we consider. For example, when f(yu \ zu,Ui,ip) is the normal distribution parameterized as in (B.16), d m d 4>log f(yu I Z = k,Ui,il>) m it = { [rkVit ~ \r k] ~ \vl) 2 (-lrm!*-™- 1 - \(-l) ^{m m - 1)!<T™ (B.18) Since <fi and {r } are bounded and the absolute moments of the normal distribution are fc finite, the expected value of products of terms such as (B.14), (B.15), and (B.18) are finite and independent of i. Similarly, in the case of the gamma distribution parameterized as in (B.17), d m log/Ota I Z = k,Ui,tp) d <j) m = < it Ui=0 nyu + log(-T ) + loge/* + 1 + logyu - ^ l o g T f » , fc (_l)m( _l)!^-m+l_^L m l o g r ( ^ j m > 2 m = 1 (B.19) Since <f> and {r^} are bounded, using the facts that \ogyu < y it and that the absolute moments of the gamma distribution are finite, the expected value of products of terms such as (B.14), (B.15), and (B.19) are also finite and independent of i. Thus, we have shown that B ... (ip) is finite and independent of i in the case of the l mi md normal and gamma distributions. This completes our proof of Theorem 6.2. • Bibliography [1] Albert, P.S. (1991). A two-state Markov mixture model for a time series of epileptic seizure counts. Biometrics, 47, 1371-1381. [2] Albert, P.S., McFarland, H.F., Smith, M . E . , and Frank, J.A. (1994). Time series for modelling counts from a relapsing-remitting disease: application to modelling disease activity in multiple sclerosis. Statistics in Medicine, 13, 453-466. [3] Alzaid, A . A . and Al-Osh, M A . (1993). Some autoregressive moving average processes with generalized Poisson marginal distributions. Annals of the Institute of Statistical Mathematics, 45, 223-232. [4] Baras, J.S. and Finesso, L . (1992). Consistent estimation of the order of hidden Markov chains. In Stochastic Theory and Adaptive Control (Lawrence, KS, 1991), 26-39 Springer-Verlag, Berlin. [5] Baum, L . E . , Pétrie, T., Soûles, G . , and Weiss, N . (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Annals of Mathematical Statistics, 41, 164-171. [6] Bickel, P.J., Ritov, Y . , and Rydén', T. (1998). Asymptotic normality of the maximumlikelihood estimator for general hidden Markov models. Annals of Statistics, 26, 16141635. [7] Bickel, P.J., Ritov, Y . , and Rydén, T. (2002). Hidden Markov model likelihoods and their derivatives behave like IID ones. Annales de L'Institut Henri Poincaré - Probabilités et Statistiques, 38, 825-846. [8] Billingsley, P. (1995). Probability and Measure. John Wiley & Sons, New York. [9] Biais, M . , MacGibbon, B . , and Roy, R. (2000). Limit theorems for regression models of times series of counts. Statistics & Probability Letters, 46, 161-168. [10] Bradley, R . A . and Gart. J.J. (1962). The asymptotic properties of M L estimators when sampling from associated populations. Biometrika, 49, 205-214. [11] Breiman, L . (1968). Probability. Addison-Wesley Publishing Company, Reading, Massachusetts. [12] Breslow, N . E . and Clayton, D . G . (1993). Approximate inference in generalized linear mixed models. Journal of the American Statistical Association, 88, 9-25. [13] Chen, J. and Kalbfleisch, J.D. (1996). Penalized minimum-distance estimates in finite mixture models. The Canadian Journal of Statistics, 24, 167-175. [14] Chen, M . - H . and Ibrahim, J . G . (2000). Bayesian predictive inference for time series count data. Biometrics, 56, 678-685. [15] Chesher, A . (1984). Testing for neglected heterogeneity. Econometrica, 52, 865-872. [16] Chung, S.H., Moore, J.B., Xia, L . , Prernkumar, L.S., and Gage, P.W. (1990). Characterization of single channel currents using digital signal processing techniques based on hidden Markov models. Philosophical Transactions of the Royal Society of London B, 329, 265-285. [17] Dean, C.B. (1992). Testing for overdispersion in Poisson and binomial regression models. Journal of the American Statistical Association, 87, 451-457. [18] Dempster, A . , Laird, N . M . , and Rubin, D.B. (1977). Maximum likelihood from incomplete data via the E M algorithm. Journal of the Royal Statistical Society, B, 39, 1-38. [19] Doukhan, P. (1994). Mixing: Properties and Examples. Springer-Verlag, New York. [20] Dortet-Bernadet, V . (2001). Choix de modèle pour des chaînes de Markov cachées. Comptes Rendus des Séances de rAcadémie des Sciences. Série I, 332, 469-472. [21] Douc, R. and Matias, C. (2001). Asymptotics of the maximum likelihood estimator for general hidden Markov models. Bernoulli, 7, 381-420. [22] Efron, B . and Tibshirani, R . J . (1993). An Introduction to the Bootstrap. Chapman and Hall, New York. [23] Evans, M . and Swartz, T. (1995). Methods for approximating integrals in statistics with special emphasis on Bayesian integration problems. Statistical Science, 10, 254-272. [24] Feng, Z. and McCulloch, C E . (1992). Statistical inference using maximum likelihood estimation and the generalized likelihood ratio when the true parameter is on the boundary of the parameter space. Statistics & Probability Letters, 13, 325-332. [25] Giudici, P., Rydén, T., and Vandekerkhove, P. (2000). Likelihood-ratio tests for hidden Markov models. Biometrics, 56, 742-747. [26] Hall, D.B. and Praestgaard, J.T. (2001) Order-restricted score tests for homogeneity in generalised linear and nonlinear mixed models. Biometrika, 88, 739-751. [27] Hettmansperger, T.P. and Thomas, H . (2000). Almost nonparametric inference for repeated measures in mixture models. Journal of the Royal Statistical Society B, 62, 811-825. [28] Hughes, J.P. and Guttorp, P. (1994). A class of stochastic models for relating synoptic atmospheric patterns to regional hydrologie phenomena. Water Resources Research, 30, 1535-1546. [29] Humphreys, K . (1997). Classification error adjustments for female labour force transitions using a latent Markov chain with random effects. In Applications of Latent Class and Latent Trait Models in the Social Sciences (ed. J. Rost and R. Langeheine), 370-380. Waxmann, Munster. [30] Humphreys, K . (1998). The latent Markov chain with multivariate random effects: an evaluation of instruments measuring labour market status in the British Household Panel Study. Sociological Methods and Research, 26, 269-299. [31] Ibragimov, L A . and Linnik, Y . V . (1971). Independent and Stationary Sequences of Random, Variables. Wolters-Noordhoff Publishing, The Netherlands. [32] Jacqmin-Gadda, H . and Commenges, D. (1995). Tests of homogeneity for generalized linear models. Journal of the American Statistical Association, 90, 1237-1246. [33] Juang, B . H . and Rabiner, L . R . (1991). Hidden Markov models for speech recognition. Technomietrics, 33, 251-272. [34] Kieffer, J.C. (1993). Strongly consistent code-based identification and order estimation for constrained finite-state model classes. IEEE Transactions on Information Theory, 39, 893-902. [35] Krogh, A . (1998). A n introduction to hidden Markov models for biological sequences. In Computational Methods in Molecular Biology (ed. S.L. Salzberg, D . B . Searls, and S. Kasif), 45-63. Elsevier, Amsterdam. [36] Lee, Y . and Nelder, J.A. (1996). Hierarchical generalized linear models. Journal of the Royal Statistical Society B, 58, 619-678. [37] Leroux, B . G . (1992a). Maximum-likelihood estimation for hidden Markov models. Stochastic Processes and their Applications, 40, 127-143. [38] Leroux, B . G . (1992b). Consistent estimation of a mixing distribution. Annals of Statistics, 20, 1350-1360. [39] Leroux, B . G . and Puterman, M . L . (1992). Maximum-penalized likelihood estimation for independent and Markov-dependent mixture models. Biometrics, 48, 545-558. [40] Levinson, S.E., Rabiner, L.R., and Sondhi, M . M . (1983). A n introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition. The Bell System Technical Journal, 62, 1035-1074. [41] Liebscher, E. (1996). Strong convergence of sums of a-mixing random variables with applications to density estimation. Stochastic Processes and Their Applications, 65, 69-80. [42] L i n , X . (1997). Variance Component testing in generalised linear models with random effects. Biometrika, 84, 309-326. [43] L i n , Z. and L u , C. (1996). Limit Theory for Mixing Dependent Random Variables. Kluwer Academic Publishers, Boston. [44] Lindgren, G . (1978). Markov regime models for mixed distributions and switching regressions. Scandinavian Journal of Statistics, 5, 81-91. [45] L i u , C . - C . and Narayan, P. (1994). Order estimation and sequential universal data compression of a hidden Markov source by the method of mixtures. IEEE Transactions on Information Theory, 40, 1167-1180. [46] Lockhart, R . A . and Stephens, M . A . (1998). The probability plot: tests of fit based on the correlation coefficient. In Handbook of Statistics 17, 453-473. Elsevier Science, Amsterdam. [47] Lystig, T . C . (2001)-. Evaluation of Hidden Markov Models. Ph.D. thesis, University of Washington, Seattle. [48] MacDonald, I.L. and Zucchini, W . (1997). Hidden Markov Models and Other Models for Discrete-Valued Time Series. Chapman & Hall, London. [49] MacKay, R . J . (2002). Estimating the order of a hidden Markov model. The Canadian Journal of Statistics, 30, 573-589. [50] McCullagh, P. and Nelder, J.A. (1989). Generalized Linear Models. Chapman and Hall, London. [51] McCulloch, C . E . (1997). Maximum likelihood algorithms, for generalized linear mixed models. Journal of the American Statistical Association, 92, 162-170. [52] McCulloch, C E . and Searle, S.R. (2001). Generalized, Linear, and Mixed Models. John Wiley & Sons, New York. [53] McKenzie, E. (1988). Some A R M A models for dependent sequences of Poisson counts. Advances in Applied Probability, 20, 822-835. [54] Nash, J . C (1979). Compact Numerical Methods for Computers: Linear Algebra and Function Minimisation. John Wiley & Sons, Inc., New York. [55] Ould-Saïd, E . (1994). Loi du log itéré pour la fonction de répartition empirique dans le cas multidimensionel et a-mélangeant. Comptes Rendus des Séances de l Académie des Sciences. Série I, 318, 759-763. [56] Pétrie, T. (1969). Probabilistic functions of finite state Markov chains. Annals of Mathematical Statistics, 40, 97-115. [57] Poskitt, D.S. and Chung, S.-H. (1996). Markov chain models, time series analysis and extreme value theory. Advances in Applied Probability, 28, 405-425. [58] Prakasa Rao, B.L.S. (1992). Identifiability in Stochastic Models: Characterization of Probability Distributions. Academic Press, Inc., London. [59] PRISMS (Prevention of Relapses and Disability by Interferon /3-la Subcutaneously in Multiple Sclerosis) Study Group (1998). Randomised double-blind placebo-controlled study of interferon /3-la in relapsing/remitting multiple sclerosis. The Lancet, 352, 1498-1504. [60] Raubertas, R . F . (1992). The envelope probability plot as a goodness-of-fit test. Commxmications in Statistics - Simulation and Computation, 21, 189-202. [61] Robert, C P . , Rydén, T., and Titterington, D . M . (2000). Bayesian inference in hidden Markov models through the reversible jump Markov chain Monte Carlo method. Journal of the Royal Statistical Society Series B, 62, 57-75. •• [62] Rydén, T. (1995). Estimating the order of hidden Markov models. Statistics, 26, 345354. [63] Rynkiewicz, J. (2001). Estimation of hybrid H M M / M L P models. In Proceedings of the European Sym,posium on Artificial Neural Networks. Bruges, Belgium, 383-389. [64] Saulis, L . and Statulevicius, V . A . (1991). Limit Theorems for Large Deviations^ Kluwer, Dordrecht. [65] Self, S.G. and Liang, K . - Y . (1987). Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions. Journal of the American Statistical Association, 82, 605-610. [66] Seltman, H . J . (2002). Hidden Markov models for analysis of biological rhythm data. In Case Studies in Bayesian Statistics V (ed. C. Gatsonis, R . E . Kass, B . Carlin, A . Carriquiry, A . Gelman, I. Verdinelli, and M . West), 398-406. Springer, New York. [67] Stram, D.O. and Lee, J.W. (1994). Variance components testing in the longitudinal mixed effects model. Biometrics, 50, 1171-1177. [68] Strichartz, R.S. (1995). The Way of Analysis. Jones and Barlett Publishers, Boston. [69] Teicher, H . (1967). Identifiability of mixtures of product measures. Annals of Mathematical Statistics, 38, 1300-1302. [70] Turner, T.R., Cameron, M . A . , and Thomson, P.J. (1998). Hidden Markov chains in generalized linear models. The Canadian Journal of Statistics, 26, 107-125. [71] Wang, P. and Puterman, M . L . (1999). Markov Poisson regression models for discrete time series. Journal of Applied Statistics, 26, 855-869. [72] White, L . B . , Mahony, R., and Brushe, G . D . (2000). Lumpable hidden Markov models - model reduction and reduced complexity filtering. IEEE Transactions on Automatic Control, 45, 2297-2306. [73] Wu, C . F . J . (1983). On the convergence properties of the E M algorithm. Annals of Statistics, 11, 95-103. [74] Zucchini, W . and Guttorp, P. (1991). A hidden Markov model for space-time precipitation. Water Resources Research, 27', 1917-1923.
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Hidden Markov models : multiple processes and model...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Hidden Markov models : multiple processes and model selection MacKay, Rachel J. 2003
pdf
Page Metadata
Item Metadata
Title | Hidden Markov models : multiple processes and model selection |
Creator |
MacKay, Rachel J. |
Date Issued | 2003 |
Description | This thesis considers two broad topics in the theory and application of hidden Markov models (HMMs): modelling multiple time series and model selection. Of particular interest is the application of these ideas to data collected on multiple sclerosis patients. Our results are, however, directly applicable to many different contexts in which HMMs are used. One model selection issue that we address is the problem of estimating the number of hidden states in a HMM. We exploit the relationship between finite mixture models and HMMs to develop a method of consistently estimating the number of hidden states in a stationary HMM. This method involves the minimization of a penalized distance function. Another such issue that we discuss is that of assessing the goodness-of-fit of a stationary HMM. We suggest a graphical technique that compares the empirical and estimated distribution functions, and show that, if the model is misspecified, the proposed plots will signal this lack of fit with high probability when the sample size is large. A unique feature of our technique is the plotting of both the univariate and multivariate distribution functions. HMMs for multiple processes have not been widely studied. In this context, random effects may be a natural choice for capturing differences among processes. Building on the framework of generalized linear mixed models, we develop the theory required for implementing and interpreting HMMs with random effects and covariates. We consider the case where the random effects appear only in the conditional model for the observed data, as well as the more difficult setting where the random effects appear in the model for the hidden process. We discuss two methods of parameter estimation: direct maximum likelihood estimation and the EM algorithm. Finally, to determine whether the additional complexity introduced by the random effects is warranted, we develop a procedure for testing the significance of their variance components. We conclude with a discussion of future work, with special attention to the problem of the design and analysis of multiple sclerosis clinical trials. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2009-12-16 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0092248 |
URI | http://hdl.handle.net/2429/16862 |
Degree |
Doctor of Philosophy - PhD |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2003-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- 831-ubc_2003-859592.pdf [ 9.07MB ]
- Metadata
- JSON: 831-1.0092248.json
- JSON-LD: 831-1.0092248-ld.json
- RDF/XML (Pretty): 831-1.0092248-rdf.xml
- RDF/JSON: 831-1.0092248-rdf.json
- Turtle: 831-1.0092248-turtle.txt
- N-Triples: 831-1.0092248-rdf-ntriples.txt
- Original Record: 831-1.0092248-source.json
- Full Text
- 831-1.0092248-fulltext.txt
- Citation
- 831-1.0092248.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0092248/manifest