ASYMPTOTIC INFERENCE FOR SEGMENTED REGRESSION MODELS By SHIYING WU B.Sc, Beijing University, 1983 M.Sc, The University of British Columbia, 1988 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in THE FACULTY OF GRADUATE STUDIES DEPARTMENT OF STATISTICS We accept this thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA October 1992 ©Shiying Wu, 1992 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of 3/^=^'.S 1^ CX^ The University of British Columbia Vancouver, Canada Date Oci /</ DE-6 (2/88) Asymptotic inference for segmented regression models Abstract This thesis deals with the estimation of segmented multivariate regression models. A segmented regression model is a regression model which has different analytical forms in different regions of the domain of the independent variables. Without knowing the number of these regions and their boundaries, we first estimate the number of these regions by using a modified Schwarz' criterion. Under fairly general conditions, the esti mated number of regions is shown to be weakly consistent. We then estimate the change points or "thresholds" where the boundaries lie and the regression coefficients given the (estimated) number of regions by minimizing the sum of squares of the residuals. It is shown that the estimates of the thresholds converge at the rate of (9p(ln'^n/n), if the model is discontinuous at the thresholds, and Op{n~^^^) if the model is continuous. In both cases, the estimated regression coefficients and residual variances are shown to be asymptotically normal. It is worth noting that the condition required of the error distri bution is local exponential boundedness which is satisfied by any distribution with zero mean and a moment generating function provided its second derivative is bounded near zero. As an illustration, a segmented bivariate regression model is fitted to real data and the relevance of the asymptotic results is examined through simulation studies. The identifiability of the segmentation variable is also discussed. Under different conditions, two consistent estimation procedures of the segmentation variable are given. The results are then generalized to the case where the noises are heteroscedastic and autocorrelated. The noises are modeled as moving averages of an infinite number of independently, identically distributed random variables multiplied by different constants in different regions. It is shown that with a slight modification of our assumptions, the estimated number of regions is still consistent. And the threshold estimates retain the convergence rate of Op{\n^ n/n) when the segmented regression model is discontinuous at the thresholds. The estimation procedures also give consistent estimates of the residual variances for each region. These estimates and the estimates of the regression coefficients are shown to be asymptotically normal. The consistent estimate of the segmentation variable is also given. Simulations are carried out for different model specifications to examine the performance of the procedures for different sample sizes. ni Table of Contents Abstract ii Table of Contents v List of Tables vList of Figures vii Acknowledgement viiChapter 1. Prologue 1 1.1 Introduction1.2 A review of segmented regression and related problems 3 1.3 New contributions and their relationship to previous work 8 1.4 Outline of the following chapters 11 Chapter 2. Estimation of segmented regression models 14 2.1 Identifiability of the segmentation variable 15 2.2 Estimation procedures 23 2.3 General remarks 30 Chapter 3. Asymptotic results of the estimators for segmented regression models 2 3.1 Asymptotic results when the segmentation variable is known 33 3.2 Consistency of the estimated segmentation variable 60 3.3 A simulation study 74 3.4 General remarks 83.5 Appendix: A discussion of the continuous model 83 Chapter 4. Segmented regression models with heteroscedastic noise 97 4.1 Estimation procedures 98 4.2 Asymptotic properties of the parameter estimates 98 4.3 A simulation study 104 4.4 General remarks 7 4.5 Appendix: Proofs 10Chapter 5. Summary and future research 142 5.1 A brief summary of previous chapters 145.2 Future research on the current model 2 5.3 Further generalizations 143 References 145 List of Tables Table 3.1 Frequency of correct identification of P in 100 repetitions and the estimated thresholds for segmented regression models 149 Table 3.2 Estimated regression coefficients and variances of noise and their standard errors with n = 200 150 Table 3.3 The empirical distribution of / in 100 repetitions by MIC, SC and YC for piecewise constant model 151 Table 3.4 The estimated thresholds and their standard errors for piecewise constant model 152 Table 4.1 Frequency of correct identification of P in 100 repetitions and the estimated thresholds for segmented regression models with two regimes . 153 Table 4.2 Estimated regression coefficients and variances of noise and their standard errors with n = 200 154 Table 4.3 Frequency of correct identification of /° in 100 repetitions and the estimated threshold for a segmented regression model with three regimes 154 Table 4.4 Estimated regression coefficients and noise variances and their standard errors with n = 200 155 List of Figures Figure 2.1 (xi,X2) uniformly distributed over the shaded area 156 Figure 2.2 [xi,X2) uniformly distributed over the eight points 157 Figure 2.3 Mile per gallon vs. w^eight for 38 cars 158 Figure 4.1 {xi,X2) uniformly distributed over each of six regions with indicated mass 159 Acknowledgements I thank my supervisor, Dr. Jian Liu, for his inspiration, guidance, support and advice throughout the course of the work reported in this thesis. I vi^ish to express my deep gratitude to Professor James V. Zidek, for his guidance, encouragement, patience and valuable advice. This thesis benefitted from the helpful comments of Professor Piet De Jong to whom I am indebted. Professor John Petkau and Mr. Feifang Hu also made valuable comments. Many thanks go to Dr. Harry Joe and Nancy E. Heckman for their encouragement and support during my stay at UBC. Special thanks to Professor James V. Zidek, who provided boundless support through out my graduate career at UBC. The financial support from the Department of Statistics, University of British Columbia is acknowledged with great appreciation. I also acknowledge the support of the Univer sity of British Columbia through a University Graduate Fellowship. Vlll Chapter 1 PROLOGUE 1.1 Introduction This thesis deals with asymptotic estimation for segmented multivariate regression mod els. A segmented regression model is a regression model which has different analytical forms in different regions of the domain of the independent variables. This model may be useful when a response variable depends on the independent variables through a function whose form cannot be uniformly well approximated by a single finite Taylor expansion, and hence the usual linear regression models are not applicable. In such a situation, the possibility of regaining the simplicity of the Taylor expansion and added modeling flexibility is achieved by allowing the response to depend on these variables differently in different subregions of the domains of certain independent variables. For example, Yeh et al (1983) discuss the idea of an "anaerobic threshold". It is hypothesized that if a person's workload exceeds a certain threshold where his muscles cannot get enough oxygen, then the aerobic metabolic processes become anaerobic processes. This threshold is called "anaerobic threshold". In this case a model with two seg ments is suggested by the subject oriented theory. McGee and Carleton (1970) discuss another example where the dependent structure of the selhng volume of a regional stock exchange on that of New York Stock Exchange and American Stock Exchange is thought to be clianged by a change of govenment regulation. A model with four segments is considered appropriate in their analysis. Examples of this kind in various contexts are given by Sprent (1961), Dunicz (1969), Schuize (1984) and many others. In some situations, although a segmented model Is considered suitable, the appropriate number of segments may not be known, as in the example mentioned above and the exchange rate problem we shall discuss in Chapter 5. Furthermore, in the case of multivariate regression, it may not be clear which of the independent variables relate to the change of the dependent structure, or, which independent variable can be best used as the segmentation variable. In some problems where the independent variables are of low dimension, graphical ap proaches may be effective in determining the number of segments and which independent vari able can best be chosen as the segmentation variable. However, if the independent variables are of high dimension, the interrelations of the independent variables may thwart such an approach. Tlierefore, an objective and automated approach is in order. In this thesis, we develop procedures to estimate the model parameters, including the segmentation variable, the number of segments, the location of the thresholds, and other pa rameters in the model. Note that the word "threshold" is used to emphasize that the depen dent structure changes when the segmentation variable exceeds certain values. The estimation procedures are based on least squares estimation and a modified version of Schwarz' (1978) cri terion. These estimators are shown to be consistent under fairly mild conditions. In addition, asymptotic distributions are derived for the estimated regression coefficients and the estimated variance of the noises. The procedures are then generahzed to accommodate situations when the noise levels are different from segment to segment, and when the noise is autocorrelated. It is shown that the consistency of these estimators is retained. Simulated data sets are analyzed by the proposed procedures to show their performances for finite sample sizes, and the results seem satisfactory. 1.2 A review of segmented regression and related problems One problem closely related to segmented regression is the change-point problem. A seg mented regression problem reduces to a change-point problem if the regression functions are unknown constants and the boundaries of the segments are to be estimated. In general, a change-point problem refers to the problem of making inferences about the point in a sequence of random variables at which the law governing evolution of the process changes. As a matter of fact, part of the work in this thesis is greatly inspired by Yao's (1988) work on the change-point problem. The segmented regression problem and change-point problem have attracted much atten tion since the 1950's. Shaban (1980) gives a rather complete hst of references from the 1950's to 1970's. Among other authors, Quandt (1958) postulates a model of the form: where t* is unknown. Under the assumption that ej's are independent normal random variables, he obtains the maximum likelihood estimates for the parameters including t*. Robison (1964) considers a two-phase polynomial regression problem of the form: =+pi^^x,+p^^xj+...++i = {2; iî;> Also assuming noises are independent normal variables, he obtains the maximum likehhood estimate and confidence interval for the change-point. Adding to the model of Quandt (1958) the assumption that the model is everywhere continuous and the variances of the {et} are identical, Hudson (1966) gives a concise method for calculating the overall least squares estimator of the intersection point of two intersecting regression lines. For the same problem, Hinkley (1969) derives an asymptotic distribution for the maximum likelihood estimate of the intersection which is claimed to be a better approximation to the finite sample distribution than the asymptotic normal distribution of Feder and Sylwester (1968). For the change-point problem, Hinkley (1970) derives the asymptotic distribution of the maximum likelihood estimate of the change-point. He assumes that exactly one change occurs and that the means of the two submodels are known. He also gives the asymptotic distribution when these means are unknown, and the noises are assumed to be identically, independently distributed normal random variables ("iid normal" hereafter). As Hinkley notes, the maximum likehhood estimate is not consistent and the asymptotic result is not good for small samples when the two means are unknown. In all of these problems, the number of change points is assumed to be exactly one. For problems where the number of change-points may be more than one, Quandt (1958, p880) concludes "The exact number of switches must be assumed to be known". McGee and Carleton (1970) treat the estimation problem for cases where more than one change may occur. Their model is: yt = Po^ + fii'^xu + ••• + di'^Xkt + if te h_i, Tj), where 1 < TI < • • • < TL < T^^I = N and the {ej are iid N{0,a-). Note that L and the r^'s are unknown. Constrained by the computing power available at that time (1970), they pro pose a estimation method which essentially combines least squares estimation with hierarchical clustering. While being computationally efficient, their method is suboptimal (resulting from the use of hierarchical clustering), subjective (in terms of choice of L) and lacking theoretical justification. Goldfeld and Quandt (1972, 1973a) discuss the so-caUed switching regression model spec ified as follows: = htPi + ^u, iiT^'zt < 0; Here Zt = {zn, • • •, Zkt}' are the observations on some exogenous variables (including, possibly, some or all of the regressors), TT = (TTI, • • •, TT^)' is an unknown parameter, and the {un} are independent normal random variables with zero means and variances, <T?, i = 1,2. The parameters, /3i, /JT, o"!, CTI and TT are to be estimated. They define d{zt) = l(x'2, >o) «'•nd reexpress the model as yt = x[[{l - d{zt))(3^ + d{zt)(32] + (1 - d{zt))uu + d{zt)u2t. For estimation the "D-method" is proposed: d{zt) is replaced by J-co \/27rc7 io-^ and the maximum lil^elihood estimates for the parameters are obtained. As they point out, the D-method can be extended to the case of more than two regimes. Gallent and Fuller (1973) consider the problem of estimating the parameters in a piece-wise polynomial model with continuous derivative, where the join points are unknown. They reparametrize the model so that the Gauss-Newton method can be applied to obtain the least squares estimates. An F statistic is suggested for model selection (including the number of regimes) without theoretical justification. Poirer (1973) relates sphne models and piecewise regression models. Assuming the change points known, he develops tests to detect structural changes in the model and to decide whether certain of the model coefficients vanish. Ertel and Fowlkes (1976) also point out that the regression models for linear spline and piecewise linear regression have many common elements. The primary difference between them is that in the linear spline case, adjacent regression lines are required to intersect at the change-points, while in the piecewise hnear case, adjacent regression hues are fitted separately. He develops some efficient algorithms to obtain least squares estimates for these models. Feder (1975a) considers a one-dimensional segmented regression problem; it is assumed that the function is continuous over the entire range of the covariate and the number of segments is known. Under certain additional assumptions, he shows that the least squares estimates of the regression coefficients of the model are asymptoticaUy normally distributed. Note that the two assumptions that the function is continuous and that the number of segments is known are essential for his results. For the simplest two segments regression problem with continuity assumption, Miao (1988) proposes a hypothesis test procedure for the existence of a change-point together with a confi dence interval of the change-point, based on the theory of Gaussian processes. Statistical hypothesis tests for segmented regression models are studied by many authors, among them are Quandt (1960), Sprent (1961), Hinkley (1969), Feder (1975b) and Worsley (1983). Bayesian methods for the problem are considered by Farley and Hinich (1970), Bacon and Watts (1971), Broemehng (1974), Ferreira (1975), Holbert and Broemehng (1977) and Salazar, Broemehng and Chi (1981). Quandt (1972), Goldfeld and Quandt (1972, 1973b) and Quandt and Ramsey (1978) treat the problem as a random mixture of two regression lines. Closely related to the problem studied in this thesis, Yao (1988) studies the following change-point problem: a sequence of independent normally distributed random variables have a common variance, but their means change / times along the sequence, with / unknown. He adopts the Schwarz criterion for estimating / and proves that such an estimator is consistent. Yao noted that consistency need not obtain without the normahty assumption. Yao and Au (1989) consider the problem of estimating a step function, g{t), over t G [0,1] in the presence of additive noise. They assume that i,- = i/n (i = 1, • • •, n) are fixed points and the noise has a sixth or higher moment, and derive limiting distributions for the least squares estimators of the locations and sizes of the jumps when the number of jumps is either known or bounded. The discontinuity of g{i) at each change point makes the estimated locations of the jumps converge rapidly to their true values. This thesis is primarily about situations like those described above, where the segmented regression model may be viewed as a partial explanation model tries to capture our impression that an abrupt change in the mechanism underlying the process. It is linked to other paradigms in modern regression theory as well. Much of this theory (see the references below, for example) is concerned with regression functions of say, y on x, which cannot be well approximated globally by the leading terms of its Taylor expansion, and hence by a global linear model. This has led to various approaches to "nonparametric regression" (see Friedman, 1991, for a recent survey). One such approach is that of Cleveland (1979) when the dimension of x is 1; his results, which use a linear model in a moving local window, are extended to higher dimensions by Cleveland and Devlin (1988). Weerahandi and Zidek (1988) use a Taylor expansion explicitly to construct a locally weighted smoother, also when the dimension of a; is 1; a different expansion is used at each i-value thereby avoiding the shortcomings of using a single global expansion. However, difficulties confront local weighting methodologies like those described above as well as kernel smoothers and splines because of the "curse of dimensionality" which becomes progressively more serious as the dimension of x grows beyond 2. These difficulties are weU described by Friedman (1991) who presents an alternative methodology called "multivariate adaptive regression splines," or "MARS." MARS avoids the curse of dimensionality by partitioning I's domain into a data-deter mined, but moderate number of subdomains within which spline functions of low dimensional subvectors of a; are fitted. By using splines of order exceeding 0, MARS can lead to continuous smoothers. In contrast, its forerunner, called "recursive partitioning" by Friedman, must be discontinuous, because a different constant is fitted in different subdomains. But, like MARS it avoids the curse of dimensionality because it depends locally on a small number (in fact, none) of the coordinates of x. Friedman (1991) attributes to Breiman and Meisel (1976), a natural extension of recursive partitioning wherein a hnear function of x is fitted within each subdomain. However, it can encounter the curse of dimensionality when these subdomains are small and Friedman (1991) ascribes the lack of popularity of this extension to this feature. However, the curse of dimensionality is relative. If the subdomains of x are large the "curse" becomes less problematical. And within such subdomains, the Taylor expansion leads to linear models like those used by Breiman and Meisel (1976) and here, as natural approximants; in contrast, splines seem somewhat ad hoc. And linear models have a long history of application in statistics. 1.3 New contributions and their relationship to previous work In this thesis, we address the problem of making asymptotic inference for the following where zt = (ïti,..., x^p)' is an observed random variable; f/ is assumed to have zero mean model: p (1.1) i=i and unit variance, wliile r,-, ctj (i = 1,...,/+ 1, j = 0,l,...,p), / and d are unlvnown parameters. Our main contributions are as follows. A sequence of procedures are proposed to estimate all these parameters, based on least squares estimation and our modified Schwarz' criterion. It is shown that under mild conditions, the estimator, /, of / is consistent. Furthermore, a bound on the rate of convergence of fi and the asymptotic normality for estimators of Pij, ai (z = /,...,/+ 1, J = 0,1,... ,p) are obtained under certain additional assumptions. When the segmentation is related to a few highly correlated covariates, it may not be clear which covariate can best be chosen as the segmentation variable. In such a case, d will be treated as an unknown parameter to be estimated. A new concept of identifiabihty of d is introduced to formulate the problem precisely. We prove that the least squares estimate of d is consistent. In addition, we propose another consistent and computationally efficient estimate of d. All of these are achieved without the Gaussian assumption on the noises. In many practical situations, it is necessary to assume that the noises are heteroscedastic and serially correlated. Our estimation procedures and the asymptotic results are general ized to such situations. Asymptotic theory for stationary processes are developed to estabhsh consistency and asymptotic normality of the estimates. Note that in Model (1.1) if f3ij = 0 for all z = 1, • • •,/ -|- 1 and j = 1, • equation (1.1) reduces to the change-point problem discussed by Yao (1988), Xd being the explanatoi-y variable controlhng the allocation of measurements associated with different dependence structures. Although our formulation is somewhat different from that of Yao (1988) in that we introduce an explanatory variable to allocate response measurements, both formulations are essentially the same from the point of view of an experimental design. If the other covariates are all known functionals of x^, as in segmented polynomial regressions, and / is known, (1.1) reduces to the case discussed by Feder (1975a). Unlike all the above mentioned work on segmented regression except McGee and Carleton (1970), we assume that the number of segments is unknown, and that the noise may be depen dent. In terms of estimating /, we generalize Yao's (1988) work on the change-point problem to a multiple segmented regression set-up. Furthermore, his conditions on the noises are relaxed in the sense that the e^'s do not have to be (a) normally distributed (rather, they could follow any of the many distributions commonly used for noise); (b) identically distributed; and (c) independent. In terms of making asymptotic inference on the regression coefficients and the change points, we do not assume continuity of the underlying function which is essential for Feder's (1975a) results. We find that without the continuity assumption, the estimated change points converge to the true ones at a much faster rate than the rate given by Feder. Finally, a consistent estimator is obtained for d, an additional parameter not found in any of the previous work. Our results also relate to MARS. In fact, our estimation procedure can be viewed as adaptive regression using a different method of partitioning than Breiman and Meisel (1976). By placing an upper bound on the number of partitions, we can avoid the difficulties caused by curse of dimensionahty, of fitting our model to data in high dimensional space (but recognize that there are trade-offs involved). And we have adopted a different stopping criterion in partitioning a;-space; it is based on ideas of model selection rather than testing and seems more appealing to us. Finally, and most importantly, we are able to provide a large sample theory for our methodology. This feature of our work seems important to us. Although the MARS methodology appears to be supported by the empirical studies of Friedman (1991), there is an inevitable concern about the general merits of any procedure when it lacks a theoretical foundation. Interestingly enough, it can be shown that in some very special cases, our estimation pro cedures coincide with those of MARS in estimating the change points, if our stopping criterion were adopted in MARS. This seems to indicate that, with our techniques, MARS may be modified to possess certain optimalities (e.g. consistency) or suboptimalities for more general cases. So in summary, with the estimation procedures proposed in this thesis we regain some of the simphcity of the (piecewise) Taylor expansion and attendant linear models, while retaining some of the virtues of added modeling flexibihty possessed by nonparametric approaches. Our large sample theory gives us precise conditions under which our methodology would work well, given sufficiently large samples. And by restricting the number of a;-subdomains sufficiently we avoid the curse of dimensionality. Partitioning for our methodology, is data-based like that of MARS. 1.4 Outline of the following chapters This dissertation is organized as follows. In Chapter 2, the identifiability of the segmen tation variable in the segmented regression model is discussed first. We introduce a concept of identifiability and demonstrate how the concept naturally arises from the problem. Then we give an equivalent condition which is crucial in establishing the consistency. Finally, we give a sequence of procedures to estimate all the parameters involved in a "basic" segmented regression model with uncorrelated and homoscedastic noise. These procedures are illustrated with an example. The consistency of the estimates given in Chapter 2 is proved in Chapter 3. Conditions under which the procedures give consistent estimates are also discussed. For technical reasons, the consistency of estimates other than that of the segmentation variable is estabhshed first. The estimation problem is treated as a problem of model selection, with the models represented by the possible number of segments, assuming the segmentation variable is known. Schwarz' criterion is tuned to an order of magnitude that can distinguish systematic bias from random noise and is used to select models. Then, with the estabhshed theories, the consistency of the estimated segmentation variable is proved. Simulations with various model specifications are carried out to demonstrate the finite sample behavior of the estimators, which prove to be satisfactory. Results given in Chapter 2 and Chapter 3 are generalized to the case where the noise levels in different segments are different. The noise often derives from factors that cannot be clearly specified and about which little is known. In many practical situations, like that of the economic example mentioned above, the noise may represent a variety of factors of different magnitudes, over different segments. Therefore a heteroscedastic specification of the noise is often necessary. To meet practical needs further, the noise term in the model is assumed to be autocorrelated. The estimation procedures given in Chapter 2 are modified to accommodate these necessities and presented in Chapter 4. It is shown that under a moving average specification of the noise, the estimates given by the procedures are consistent. Further, the parameters specified in the moving average model of the noise term can be estimated by the estimated residuals. Simulation results are given to shed light on the finite sample behavior of the estimates. A summary of the results established in this thesis is given in Chapter 5. Future research is also discussed. One line of future research comes from the similarity between segmented regression and spline techniques. Our model can first be generalized to the case where there are more than one segmentation variables. Then an "oblique" threshold model can be considered. An oblique threshold is one made by a linear combination of explanatory variables. This is reasonable because often there is no reason to beheve that the threshold has to be parallel to any of the axes. Finally, by partitioning the domain of the explanatory variables into polygons, an adaptive regression splines could be developed. This could serve as an alternative to Friedman's (1988) multivariate adaptive regression sphne method, or MARS. Chapter 2 ESTIMATION OF SEGMENTED REGRESSION MODELS In this chapter, we consider a special case of model (1.1) where the {ctj} are all equal and the {et} are independent and identically distributed. In this case, the model can be reformulated as foUows. Let (2/1,a:ii,...,xip), ..., (?/„,x„i, .. •,Xnp) be the independent observations of the response, y, and the covariates, xi,...,Xp. Let Xt = [l, Xti,..., Xtp)' for i = l,...,n and Â = {0io,Piu Pip)', i = 1,...,/+ 1. Then, yt = x'Ji + et, if xtd G {Ti-i,Ti], i = 1,...,/+ 1, t = l,...,n, (2.1) where the {et} are iid with mean zero and variance and are independent of {xj, —00 = •'"0 < Ti < • • • < T/+1 = 00. The Pi, Ti, (i = 1,..., / + 1), /, d and CT^ are unknown parameters. When Pd = 0, the segmentation variable Xtd becomes an exogenous variable as considered by Goldfeld and Quandt (1972, 1973a). A sequence of estimation procedures is given to estimate the parameters in model (2.1). The estimation is done in three steps. First, the segmentation variable or the parameter d is estimated, if it is not known a priori. Then, with d known or supposed known, if estimated, the number of structural changes / and the locations of structural changes r^'s are estimated by a modified Schwarz' criterion. Finally, based on the estimated d, I and r^'s, the Pi's and <7^ are estimated by ordinary least squares. It will be shown in the next chapter that all these estimators are consistent, under certain conditions. It is obvious that to estimate d consistently, it has to be identifiable. In Section 2.1, we discuss the identifiability of d. Specifically, we introduce a concept of identifiability and give equivalent conditions, all illustrated by examples. These conditions will be used in the next chapter to provide the consistency of the estimator of d. Our estimation procedures are given in Section 2.2. In particular, two procedures are given to estimate d under different conditions. The first one assumes less prior knowledge while the second one requires less computational effort. Based on the estimated d, the estimation procedures for other parameters are then given. Finally, all the procedures are illustrated by an example in which the dependence of gas consumption on the weight and horse power of different cars is examined. Some general remarks are made in Section 2.3. In the sequel, either a superscript or a subscript 0 will be used to denote the true parameter values. 2.1 Identifiability of the segmentation variable Although in some appfications, the parameter d can be determined a priori from back ground knowledge about the problem of concern, it can be hard to determine d with reasonable certainty, due to a lack of background information. For instance, if the segmentation is related to a few highly correlated covariates, it may not be clear which one can best be chosen as the segmentation variable. Therefore, there is a need for a defensible choice of d based on the data. When the vector of covariates are of high dimension and d cannot be identified by graphical methods, a computational procedure is required. However, when some of the covariates are highly correlated, it may not be clear whether d can be uniquely identified. In the following, we discuss the exact meaning of being "identified" and give a set of conditions under which d can be uniquely identified. To simplify notation, let x have the same distribution as that of xi and R° = {x : x^o G (r?_i,r?]}, j = 1,...,/° + 1. And for any d, let {Rff^t\ be a partition of RP where i?^ = {x : Xd £ (rj_i,Tj]}, -co = TQ < n < • • • < r; < r;+i = oo. Let X be a known upper bound on the number of thresholds. Intuitively speaking, dP is identifiable if for any d ^ d°, and any partition {Rj}^^^, there is at least one region, say Rf, on which the model exhibits clear nonlinearity. Note that L is involved. Indeed, the identifiabihty oi d° does depend on L when the domain of X takes a certain special form. This can be easily seen in the following two examples. Example 1 x is uniformly distributed over the shaded area in Figure 2.1, y = l(xi>i) + where is an indicator function. And i2° = {x : xi e (-00,1]}, ii:^ = {x : xi e (1, oo)}. For X = 1, no threshold on X2 can make the model piecewise linear over its domain. The only possible threshold which makes the model piecewise linear is ri = 1 as defined in the model. For i = 2, however, TI — —1, T2 — 1 also make the model piecewise hnear over its domain. Hence either Xi or X2 can be used as the threshold variable. % The same phenomenon can also be seen in the next example. Example 2 x is uniformly distributed with probabilities concentrated at the 8 points as specified in Figure 2.2, Y = l(xi>0) •X2 + e. 16 For X = 1, no threshold on X2 can make the model piecewise linear over its domain. For L = 2, however, TI = —1/2, T2 = 1/2 make the model piecewise linear over its domain. Hence either xi or X2 can be used as the threshold variable. ^ Sometimes, but not always, one cannot determine whether or not the model is linear on unless the model can be uniquely determined on both Rf n R^ and Rf n R^ for a pair of adjacent In Example 2, if Rf = {-x. : X2 < 0}, dropping the point ( — 1, —1) makes the model linear on Rf. Furthermore, since in model (2.1) we did not exclude the possibility of (3i = Pj for nonadjacent to ensure the detection of nonlinearity on Rf, the model has to be uniquely determined on Rf n R^ and Rf D R°j for at least one pair of adjacent To this end, we need 1 " -^Xtx;i(^,efi.nHO^.) (2.2) be positive definite for z = 1,2 and some A; e {0, • • •, /° - 1}. Asymptotically, we need (2.2) to hold with probabiHty approaching 1 as n becomes large, and its LHS should not gradually degenerate to a singular matrix. This in turn can be stated as follow: For any set A, let A(A) be the smallest eigenvalue of jE[xx'l(xeyi)]. Define ^{{Rj}fii) = ,2mRj n Rl+i)}. We win need d° to be identifiable, defined as follows: Definition 2.1 d^ is identifiable w.r.t. L if for every d ^ A= mi Xi{R^}f+,')>0, (2.3) where the inf is taken over all possible partitions of the form {Rj}^^^ . If /" = 1, then k = 0 and X{{R^}f+^) = max^ mini=i,2{A(i2^^ n Rf)}. Now, let us examine the identifiability of d^ in the two examples given above. Example 1 (continued) dP is not identifiable w.r.t. L = 2. Since for d = 2, and (ri,r2) = (-1,1), either P{RJ n ii?) = 0 or P{RJ n iE^) = 0 for all j = 1,2,3. dP is identifiable w.r.t. L — 1. Since for any T\, there exists r G {1,2} such that £^[xx'l(xeiî<'nR°)] is positive definite, for i =1,2. f Example 2 (continued) is not identifiable w.r.t. L = 2. Let d = 2. If (ri,r2) = (—0.5,0.5) then each of Rj D R'- will contain no more than two points with positive masses, i = 1,2, j = 1,2,3. Hence ^fxx'l^^g^jnjjo)] will be degenerate for all d° is identifiable w.r.t. L = 1. Since for any TI and i = 1,2, there exists r G {1,2} such that Rf n R'i contain at least 3 points, with positive masses, which are not collinear. Hence £{xx'l(x.e7î''niî°)} is positive definite. Because we have effectively just 4 choices of ri, the eigenvalues of JEJ{xx'l(3(.£/?<Jn/i9)}, ^ = 1,2, are positive. % In more complicated cases, the identifiabihty condition may not be easy to verify. An equivalent condition is given in the theorem below. This theorem is essential in showing that the two methods of estimating d^ given in the next section are consistent. Theorem 2.1 The following conditions are equivalent: (i) d° is identifiable w.r.t. L, (ii) for any d ^ d°, there exist sets {Aj]^!^ of the form Aj = {x : Oj < Xd < bj] such that (a) \{AJr]Rl_^-) > 0 for some 0 < k < P - 1 and all i = 1,2, s = 1, L + 1, and (b) for any partition {Rj]^^l, A^ C Ri for some r, 5 G {1, • • •, X + 1}. H Before proving the theorem, let us find Aj's in the two examples given above. Assume, arbitrarily, d = 2. In Example 1, let Af = {x :-2 < X2 < -0.5} and = {x : 0.5 < X2 < 2}. Then, Af and A^ satisfy (ii) in Theorem 2.1. In Example 2, Af = {x : -1 < X2 < 0} and A2 = {x. : 0 < X2 < 1}. Note that in this case, Af H A^ = {0}; the sets overlap. For any measurable set C in , let A'^(C) = jmn A({x : G C} n i2?). Lemma 2.1 A'^([a,u]) is right continuous in u. X'^{[u,b]) is left continuous in u. Also, hmfc__oo A''((-oo, b]) = 0, hm<,_oo A''([a, +00)) = 0 and X'^i{a}) = 0. Proof Let A = {x : a < Xd < u} n Rl, As = {x : u < Xd < u + S} n R° and A+ = {x : a < Xd < u + ê} n Ri- Then A^ = AU As. Let a be the normalized eigenvector corresponding to X{A), the smallest eigenvalue of £[xx'l(.{xgyi})]- Then X{A) = a'i;[xx'l({x6^})]a = a'i;[xx'l({xeA+})]a- a'i;[xx'l({xe>i.})]a > A(A+)-a'£;[xx'l({xe^,})]a >X{A+)-tr{E[xx'l^^^^A,})]) = A(A+) - E[x'xl({xe>ia)]-By the dominated convergence theorem, i^[x'xl(.{xeAi})] = -^[x'xl(^x:u<^<j<u-i-5}nR°)] con verges to 0 as ^ 0+. Therefore, X(A) < A(A+) < X{A) + o(l) and A(£'[xx'l(^x:a<:r^<u}n/î°)]) is right continuous in u. Replacing R° by R2 in the above argument, we have that A(£'[xx' ^({7s.:a<xa<u}nR°)]) right continuous in u. Since A'^([a,t/]) is the minimum of the two right continuous functions, it is also right continuous. Now, let A = {x : u < Xd < b} 0 R^, As - {x : u - 6 < Xd < u} D R^ and A_ = {x : u — 6 < Xd < b} f] R^. Then A- = AU As- Let a be the normalized eigenvector corresponding to \{A), the smallest eigenvalue of E[x.-x.'l^^xeA})]- Then A(A) = a'i;[xx'l({,e^})]a = a'£[xx'l({xe^_})]a - a'f;[xx'l({xg^,})]a > A(A_)-a'X;[xx'l({,g^,})]a > A(A_)-ir(^[xx'l({x€^,})]) = A(A_)-£[x'xl(.{xç^,))]. By the dominated convergence theorem, X^[x'xl({xeA«})] = •E^[x'xl({x:u-5<xd<u}nflO)] con verges to 0 as ^ ^ 0+. Therefore, X{A) < A(A_) < X(A) + o(l) and A(i;[xx'l(^x:«<a:<i<fc}nH;)]) is left continuous in u. Replacing by R2 in the above argument, we have that A(£[xx' ^{{x:u<xd<b}nR°)]) is left continuous in u. Since X'^([u,b]) is the minimum of the two left con tinuous functions, so it is also left continuous. Observe that 0 < X'{[a,+^)) < /r(i;[xx'l^x,„<,,«^}nflO)]) < ^[x'xl {{x:a<xj«x.}nR°))l-By the dominated convergence theorem, the RHS converges to 0 as a ^ cx). Thus lim A'*([a,+oo)) = 0. a—KX> Similarly, 0 < A'^(-oo,6]) < tr{E[xx'l^^^,_^^^^<tynR°)]) < i^[x'xl(^x:-oo<r,<6}nii?)]-By the dominated convergence theorem again, the RHS converges to 0 as 6 ^ —00. Thus lim A''((-oo,6]) = 0. 6-* —00 Since the {d + l)th row of the matrix £'[xx'l(^3ç.^^_a-}n/jO)] is its first row multiphed by a, its rank is less than or equal to p and hence it is degenerate. So does the rank of i;[xx'l({,,,,=,}nRO)]. Hence A''({a}) = 0. % Let = sup{6 : A'^([6,+00)) > A} where A > 0 is given by Définition 2.1, 6^^^^ = co, and, recursively, bj_i - sup{6 < bj : X^{[b,bj]) > A}, j = 2,...,i, where, by convention, 6;_i = -00 if {b < b* : X-'iib, b*j]) > A} = Lemma 2.2 Suppose is identifiable w.r.t. L. Let 65 = —00. Then (i) -00 = 60 < < ... < 62 < 62+1 - ^"'^ (ii) A''((-oo,6î])> A. Proof (i) Lemma 2.1 imphes hma_^oo A'^ffa, 00)) = 0, so 6^ < 00. And 6^ > -00. For if it were not, i.e., 6^ = ^h^n since limf,_t_oo A'^((-oo, 6]) = 0, there exists Tl Ç. (—00,00), such that A'^((—00, ri]) < A. In view of the definition of 6^ ^.nd the assumption that 62 = —00, we have that A'^((ri,oo)) < A. For any T2,---,TL such that —00 — TQ < TI < T2 < • • • < TL < TL^I = 00, we have X'^{{TJ_I,TJ]) < A, j = 1, • • •, X + 1. This contradicts to the definition of A. So, — 00 < 62 < 00. Assume that 6^, • • •, 62 have been well defined and satisfy —oo<6^<---<62<oo. We will now show that -00 < 6*_j < 6^. By Lemma 2.1, X'^{{a}) — 0 and X'^{[u,b]) is left continuous in u. Hence, bj_i < bj. Suppose bj_^ = —CO. Since lim6__oo A''((—00,6]) = 0, there exists rj_i € ( — 00,6*) such that A'^((—00,rj_i]) < A. For this TJ-I, let TQ = 00 and choose ri,---,rj_2 such that 00 = To < Tl < • • • < Tj-2 < Tj-i- Then X'^iin-uTk]) < A'^((-^,r,_a]) < A, k = l,---J-l. Since bj_-^ = —00, A''([rj_i, 6^]) < A. By right continuity of X'^{[a,-]), there exists Sj > 0 such that Tj = bj + Sj e (6^,6^^j) and X'^{[TJ^I,TJ]) < A. Repeating this argument we can see that there exists Sk > 0, such that Tk = b^ + 6k £ (KiK+i) A'^([r/.._i, rfc]) < A, where k = j, • • •, L. By the definition of 62, X'^([TL, 00)) < A. In summary, we have X\{Tk-urk]) < X\[Tk-i,rk]) < A, k = l,...,L, and A'^((rL,oo)) < A. That is, the partition {Rjjf^l, where = {x: Xd £ (rj_i,rj]}, satisfy inini=i_2 A(i2^ni2°) = X'^{{TJ-I,TJ]) < A, j = 1, • • •, L + 1. This again contradicts the definition of A. By induction, —oo < < 6^ for j = 2, • • •, i + 1. Thus, (i) is verified, (u) If not, X'^{(-(X),b'^]) < A. Then, by the right continuity of A'^([a,-]), there exists > 0 such that n = + ^1 < ^2 and A''((-oo, ri]) < A. By the definition of b^, X'^{[Ti,b^]) < A and hence there exists 62 > 0, such that tt = 62 + ^2 < ^3 and A'^([ri, r2]) < A. By repeating this process we shall see that there exists — 00 = TQ < ri < • • • < r/,_i < bl<TL = bl + 6L< TL+1 = 00 such that A'^((rj_i, TJ]) < A, j = 1, • • •, X + 1. This leads again to a contradiction to the definition of A. ^ Proof of Theorem 2.1 Without loss of generality, /° = 1 is assumed. Suppose (ii) holds. The condition A(Af n i??) > 0 for ah s and i imphes mim^siK^i ^ ^?)} > 0- Then, X{{R'^}^+^) > mini=i,2 A(i2^ n i2?) > min,=i,2 A(Af n R'^) > mini,^{A(Af n i?^)}. We conclude that d° is identifiable w.r.t. L by taking the infima in the last inequality. Now assume (i) holds. Let Aj — {-x. : Xd £ l^j-ii^j]}, where bj is defined in Lemma 2.2, j = 1,---,X + 1. By Lemma 2.2, -00 = 6^ < 6J' < • • • < 6^ < ^l+i = and A'^((-oo, 6|]) > A. By the definition of b^s, X'^([u,b*j]) > A for all u < j = 2,---,X + 1. By Lemma 2.1, X'^{lu,b]) is left continuous in u. Hence, A'^([6^_j, 6*']) > A, j = 2, • • •, X + 1. By the definition of A'^(-), X{Af n i?0) = A({x : Xd e (-00, b1]} n X;0) > A'^((-oo, b^]) > A, and A(Af n R°) = A({x : Xd € [K-i^K]}'^R^i) > ^'^ilK-i^K]) > A.s = 2,---,L + 1. That is, {A^}^+/ satisfy (a) in Theorem 2.1 (u). It remains to show that for any {Rj}f^i, where Rj = {x. : £ (rj_i,r,]}, there exists r, 5 € {1, • • •, X -f 1} such that Rf C . We shall show it by sequential exhaustive argument. If Rf 75 Af then ri < 6*. If R^ 75 Af, i = 1,2, then r2 < If i?^ 7$ A,^, i = 1,2,3, then Ta <b^. •• : If i2£ 75 Af, i = 1, • • •, X, then, rz, < bl and hence igf+i D A^^^. This completes the proof of Theorem 2.1. 1[ Corollary 2.2 Suppose the distribution of Zi = (xn,..., Xip)' has support (ai,6i) x ••• X (flp, 6p), where —00 < Ui < bi < 00, i — 1,... ,p. Then for any integer X > /°, d° is identifiable w.r.t. X. Proof For any d ^ d^, any X + l mutually exclusive subsets of the form {x : Xd £ [a, T]]}, where a < Tj and [a,r]] C ia.d,bd), will serve as the {Aj}^^l in Theorem 2.1. Hence the identifiabihty of d° follows. ^ Corollary 2.3 Suppose the support of distribution of zi = (xn,... ,Xxp)' is a convex subset of RP. Then for any integer X > is identifiable w.r.t. X. Proof Since the support of distribution of Zi is convex, it contains a subset of the form (ai, 61) X ... X (ttp, b-p), where —00 < a, < bi < 00, i = 1,... ,p. For any d 7^ c?°, any X + l mutually exclusive subsets of the form {x : € [a, 77]}, where a < rj and [a, T]] C (a^, 6^), will serve as the {A'j)^^l in Theorem 2.1. f 2.2 Estimation procedures The least squares criterion is used to select d. The idea is simple. Suppose that d^ is identifiable and that a wrong d were chosen as the threshold variable. Then for sufficiently large n, on at least one of the Rj^s, say Rf, the model exhibits nonhnearity, resulting in a large sum of squared errors on Rf. Hence, the total sum of squared residuals is large. In contrast, if d° were chosen, by adjusting the f/s, the model on each {x : fj_i < x^o < fj} would be roughly hnear, resulting in a smaher total sum of squared errors. Therefore, d should be chosen as the d resulting in the smallest total sum of squared errors. To simphfy the implementation of this idea, let \enJ In{A) := cfiap(l(x,e^),...,l(x„eA)), A C R''+'' XniA) := In{A)Xn, H^{A) := Xr.{A)[X'M)Xn{A)]-X'M Sn{A) := Y:,{UA) - Hn{A))Yn, and Tn{A) := è'MA)ên, where in general for any matrix M, M~ denotes a generahzed inverse. Note that X„(A), Hn{A) and Sn{A) are, respectively, the covariates, "hat matrix" and the sum of squared residual errors from fitting a linear model based on just the observations in A. Finally, for any {RjYjtl define the total sum of squares over different regions as ;+! i=i The first method for estimating is given below. Method 1 Suppose d° is identifiable w.r.t. L. Choose d to minimize the sum of squared errors. More precisely, let := S^{ff,..., f^), where < • • • < f| minimize S^{TI, ..., r^) over ah (ri,..., TL). Select d such that < 5^ for d = 1,... ,p. Should multiple minimizers occur, we define d to be the smallest of them. Remark When calculating SniRj), at least p data points must be in to ensure the regression coefficients on that segment are uniquely determined. This method requires intensive computation. As Feder (1975a) and other authors note, S^{TI, • • •, TL) may not be differentiable at the true change points. So to minimize 5'^(TI, • • •, TL), one has to search all (ri, • • •,TL). Fortunately, we can do this by restricting ourselves to the finite set {xid, • • •, Xnd}, without loss of generality. Even so, exhausting all (T^, • • •, T^) for any d needs (£) x (i + 1) linear fits. Although a method more efficient than actually doing the (2) x{L + l) fits exists, there is still a lot of work for any i > 3 and large n. So, under stronger conditions, we give another more efficient method. This method is based on the following idea. Suppose zi = (xu, •.., xip)' is a continuous random vector and the support of its distribution is (ci, 6i) X.. .x(op, bp), where —oo < a,- < 6,- < oo, (i — 1, - • • ,p). Then for any d we can partition {ad,bd) into 2L + 2 disjoint intervals such that there are an equal number of observations in each of the intervals. For any d ^ d°, on all these intervals the model will exhibit nonlinearity and hence the linear fits will result in larger sum of squared errors. If d = d^, then there are at least X + l intervals that are entirely embedded in one of the (r°_j,r°]'s. Hence, on those intervals, the model is linear and the sum of squared errors from hnear fits are smaller. Thus, the total of the smallest L + 1 sums of squared errors for d = d° is expected to be smaller than that for d ^ d^. It is easy to see that the above argument holds as long as the number of partitions is no less than X + 2. The practical advantages of choosing a number larger than X + 2 will be discussed in Section 3.2. We summarize the above discussion as follows: Method 2 Suppose Zi = (xn,..., xip)' is a continuous random vector and the support of its distribution is X ... X (ap,6p), where -oo < a,- < 6j < oo, i = 1,.. .,p. Let r'j be the [100 X j/{2L + 2)]th percentile of Xt^'s, = {xi : xu G (^j^-i, r^^]}, j = 1,..., 2X + 2. Select d, so that for aU d = 1, • • •, p, where 5^=x;'5•n(4)) :=1 and 5„(À(''-)) is the ith smallest of 5„(À^), • • •, 5„(À^£,+2)-Remark For any d, Method 2 requires only 2X + 2 linear fits (independent of n). The computational effort is significantly reduced compared with Method 1. Now, with d'^ estimated above, we can assume that rf" is known and estimate other pa rameters. For simphcity, we shall drop the superscript, d, on and rj^'s in the rest of this section. First we estimate P and the thresholds, ,..., r^J, by minimizing the modified Schwarz' criterion (Schwarz, 1978), MICil) := ln[5(fi,..., f;)/(n -p*)] + ££O^Î^)!l!l^ (2.4) n for some constants CQ > 0, > 0. In equation (2.4), p* = (I + l)p + I ^ (I + l){p + 1) is the total number of fitted parameters, and for any fixed /, fi,..., f/ are the least squares estimates which minimize 6'„(ri,..., r;) subject to —oo = TQ < TI < • • • < r;+i = oo. Recall that Schwarz' criterion (SC) is defined by SC{1) = ln[Sin,fi)l{n - I)] + /^^. (2.5) 26 We can see that the distinction between MIC{1) and SC{1) hes in the severity of the penalty for overspecification. And a severer penalty is essential for the correct specification of a non-Gaussian, segmented regression model, since SC{1) is derived under Gaussian assumption (cf., Yao, 1988). Both criteria are sometimes referred as penalized least squares. With estimates, / of /°, and fj for r°, i = 1,...,/ available, we then estimate the other regression parameters and the residual variance by the ordinary least squares estimates, h = [x;(4)x„(Âi)]-x;(Ài)Yn, î = i,...,/ + i, and = 5„(fi,...,f/)/(n-p*), where Ri = {x : f,_i < x^o < fi}, p* = (l + l)p + I. Under regularity conditions essential for the identifiability of the regression parameters, we shall see in Chapter 3 that the ordinary least squares estimates Pj will be unique with probabihty approaching 1, for j = 1,..., / -|- 1, as n —>• oo. While for a really large sample size, we do not expect the choice of and CQ to be crucial, for small to moderate sample sizes, this choice does influence the model selection. Below, we briefly discuss the choice of CQ and ^o-In general, when selecting models, a relatively large penalty term would be preferable for the models that can be easily identified. This is because a larger penalty will greatly reduce the probabihty of overestimation while not risking underestimation too much. However, if the model is difficult to identify (e.g., a continuous model with \\dj+i — Pj\\ small), the penalty should not be too large since the risk of underestimation is now high. Another factor infiuencing the choice of the penalty is the error distribution. A distribution with heavy tails is likely to generate extreme values, making it look as though a change in response has occurred. To counter this effect, one needs a heavier penalty. In fact, if ej has only finite order moments, a penalty of order for some a > 0 is needed to make the estimation of 1° consistent. Given that the best criterion is model dependent and no uniformly optimal choice can be made, the following considerations guide us to a reasonable choice of and CQ: (1) From the proof of Lemma 3.2 in Section 3.1, we shall see that it is possible that the exponent 2 + SQ in the penalty term of MIC may be further reduced, while keeping the model selection procedure consistent. And since the Schwarz' criterion (where the exponent is 1) is obtained by maximizing the posterior likelihood in a model selection paradigm and is widely used in model selection problems, it may be used as a basehne reference. Adopting such a view, should be small to reduce the potential risk of underestimation when the noise is normal and n is not large. (2) For a small sample, it is practically difficult to distinguish normal and double exponential noise, or t distributed noise. And, hence, one would not expect the choice of SC or any other reasonable criterion to make a drastic difference. (3) As Yao (1988) noted for large samples, SC tends to overestimate /° if the noise is not normal. We observe such overestimation in our simulations under different model specifications when n = 50 (see Section 3.3). Based on (1), we should choose a small ^o- And by (2), with SQ chosen, we can choose some moderate no, and solve for CQ by forcing MIC equal to SC at UQ. By (3), no < 50 seems desirable. In the simulation reported in the next section, we (arbitrarily) choose 6o to be 0.1 (which is considered to be small). With such a 6o, we arbitrarily choose no = 20 and solve for Co. We get Co = 0.299. In summary, since the "best" selection of the penalty is model dependent for finite samples, no optimal pair of (co,^o) can be recommended. On the other hand, our choice of = 0.1 and Co = 0.299 performs reasonably well for most of the cases we experimented with in our simulation. The simulation results are reported in Section 3.3. Further study is needed on the choice of 6o and co under different assumptions. A data set used in Henderson and Velleman (1981) is analyzed below to illustrate the esti mation procedures proposed above. The data consist of measurements of three variables, miles per gallon (y), weight (xj) and horse power (x^), on thirty eight 1978-79 model automobiles. The dependence of y on Xi and X2 is of interest. Graphs of the data show a certain nonlinear dependence structure between y and xi (see Figure 2.3). Suppose we want to fit a model of the form (2.1). In this case, it becomes yt = Pio + Piixn + Pi23:t2 + Q, if xtd £ (r,_i,ri],i = l,...,/-f 1, (2.6) where is assumed to have zero mean and variance <t^. To demonstrate the use of two methods of estimating let us ignore the information given by Figure 2.3 (which suggests <i° = 1 and /° = 1) and estimate d° by calculation. First, we (arbitrarily) choose L - 2 and apply Method 1. We get 5^ = 120.0 and Si = 136.0. Hence = 1 is chosen by Method 1. With Z = 2 we get on applying Method 2, S^ = 14.6 and Si — 15.3. Thus, d — 1 is also chosen by Method 2. Both methods agree with the casual observation made above about Figure 2.3. Next, with d = 1, we calculate and compare MIC{1) for / = 0,1,2 to estimate /°. For illustrative purposes, the constants CQ and 6o in the penalty term of MIC are chosen as 0.2 and 0.05 respectively, to enable the piecewise model to remain competitive for this small sample ex ample. The MIC values for / = 0,1,2 are 2.28, 2.11 and 2.31 respectively. Thus / = 1 is chosen by the criterion. Then with / = 1, fi = 2.7 is obtained. With these estimates, the estimated co efficients are (Âo,/3i2) = (48.82,-5.23,-0.08), (/320,/32iJ22) = (30.76,-1.84,-0.05) and â2 = 4.90. Finally, treating the MIC as a general model selection criterion rather than a tool for finding two more competing models are fitted to the data. These are 2/t = /?o+/?ia;n + ef, (2.7) and 2/i = /3o + Pxxn + P2x\i + P:iXt2 + ft- (2.8) From Figure 2.3, both models seem appealing. The MIC values for these two models are 2.24 and 2.12. Thus, the segmented model is chosen as the "best". Needless to say, it is only the "best" among the few models considered; further model reduction may be possible. 2.3 General remarks In Section 2.1, we have discussed the identifiability of cP. It can be seen from Corollary 2.3 that in many regression problems, dP can be treated as identifiable w.r.t. any L > But, it is important to reahze that ^ is not always uniquely identifiable and to know when it is not uniquely identifiable, in an asymptotic sense. It is also important to bear in mind the question of identifiability in a design problem. The results in Section 2.1 have provided an answer to these questions. Moreover, these results not only provide a foundation for estimating dP in model (2.1) for continuous covariates, but they also address the same problem when the covariates are discrete or ordered categorical. For example, one may want to know which of the two covariates, the dose of certain drug or age group, alters the dependent structure of blood pressure on the two. In this case, the identification of cP is important even when the change point is not uniquely defined. As in the example of automobiles, the MIC we proposed in the last section should be treated as a method of model selection, and not merely as a tool of estimating dP. In fact, in the case when dP is only identifiable w.r.t. some number less than the known L, d^ and P can be jointly estimated by minimizing MIC over all the combinations of d{<. p) and /(< L). In the next chapter, the consistency of these estimates, under certain conditions, will be shown. From a much broader perspective, our estimation procedures can be seen as a general adaptive model fitting technique. The upper bound L on the number of segments is imposed to ensure computational feasibility and to avoid the "curse of dimensionality"; in other words, L ensures there are sufficient data to enable each piece of the model to be well estimated even when the covariate is a vector of high dimension. With this upper bound, the number of segments and the boundaries of each segment are selected by the data. It will be shown in the next chapter that these estimates are also consistent. Chapter 3 ASYMPTOTIC RESULTS FOR ESTIMATORS OF SEGMENTED REGRESSION MODELS In this chapter, asymptotic results for the estimators given in the last chapter are proved. The exact conditions under which these results hold are stated and explained. It will be seen that these conditions seem realistic for many practical problems. More importantly, the techniques we use in this chapter constitute a foundation for the generalizations in Chapter 4 of Model (2.1). In some cases the parameter dP is known a priori, in such cases the notation required for presenting the proof of our results is relatively simple, and so we first prove the results for these cases. In Section 3.1 we estabhsh the consistency of the estimated number of segments, the estimated thresholds and the estimated regression coefficients. Then, for the discontinuous model, an upper bound is given for the rate of convergence of the estimated change points. The asymptotic normality of the estimated regression coefficients and of the estimated variance of the noise is also estabhshed. In Section 3.2 we move to the case of unknown dP and prove the consistency of the two estimators of (f given in Section 2.2. It wih be easy to see that the results proved in Section 3.1 still hold \î cP is replaced by its consistent estimate. In Section 3.3, the finite sample behavior of these estimators is investigated by simulation for various models and noise distributions. Some general remarks are made in Section 3.4. The asymptotic normality of the various estimates for the continuous model is established in Section 3.5. 3.1 Asymptotic results when the segmentation variable is known In this section, the parameter d in model (2.1) is assumed known. Consequently, we can simphfy the notation at the beginning of Section 2.2. For any —oo<a<7/<oo, let /„(a,T?) := dia5(l(^j^e(c,„i),...,l{:,„^e{<:,,T,l)), and ^„(a, 7/) := X„(a, r/)[X;(a, 7?)X„(a, r?)]-X;(a, 7?), where in general for any matrix A, A~ will denote a generalized inverse while 1(.) represents the indicator function. Similarly, let y„(a, Tj) := In{a, T])Yn, ê„(a, rj) := /„(a, 7/)ê„, 5„(a, rj) := ^[/^(a, 7?) - H^a, 7/)]y„, i+i Sn(Ti,...,Ti) := ^SniTi-i,Ti),To 1= -co,r,+i := oo, and r„(a,7/) := 4^n(«,^)ën-Observe that Sn{ot,v) is just the error sum of squares when a linear model is fitted over the "threshold" interval (a, rj]. Also, let the forecast of y„ on the interval (a, 77], Yn{a, 77), be defined by y„(«,7/) := Hr,{a,ri)Yn. Then, in terms of true parameters, (2.1) can he rewritten in the vector form, F„= J]X„(rf_i,r°)/3.° + f-n. (3.1) t=i To establish the asymptotic theory for the estimation problems of Model (3.1), some as sumptions have to be made. First, we assume an upper bound, i, of can be specified. This is because in practice, the sample size n is always finite and hence any 1° that can be effectively identified is always bounded. We also assume the segmentation does occur at every true thresh old, i.e., 7^ /^j+i) i = 1) • • • 5 so that these parameters are uniquely defined. The covariates {xt} are assumed to be strictly stationary, ergodic random sequence. Further, {xt} and the errors sequence {q} are assumed independent. These are the basic assumptions underlying our analysis. To simplify the problem further, we assume in this chapter that the errors {et} are iid random variables with mean zero and variance a^. In addition, a local exponential boundedness condition is placed on the distribution of the errors {et}. A random variable Z is said to be locally exponentially bounded li there exist two positive constants, CQ and TQ, such that i;(e"^) < e'^""', for every \u\ < TQ. (3.2) The above assumptions are summarized in Assumption 3.0; The covariates {x^} and the errors {et} are independent, where the {x^} are strictly stationary and ergodic with E{x[x.i) < oo, {ct} are iid with a locally exponentially bounded distribution having mean zero and variance CTQ. For the number of threshold there exists a known L such that /o < L. Also, for anyj^l,---, f, 7^ ^^^j. Remark The local exponential boundedness condition is satisfied by any distribution with zero mean and a moment generating function with second derivative bounded around zero. Many distributions commonly used as error distributions such as those in the symmetrized exponential family are of this type, and hence aU the theorems in this chapter wiU commonly apply. The next assumption is required to identify the number of thresholds /° consistently. Assumption 3.1 There exists è G (0, mini<j<;o(rj^i —Tj )/2) such that both E{x.i-x.\l^^^^^ç,(^^o_g ,^ay^} and E{x.i-x.'i ^(xide(r9 ,T9-irS])] o,re positive definite for each of the true thresholds rf,...,rjo. Under Assumption 3.1, the design matrix Xn{ct,T]) has full column rank a.s. as n —»• oo for every open interval (a, r?) containing at least one of (rf - S, rf + 6], i = 1,..., 1°. So P{a, 77) = [Xl^(a,ri)Xnia,T])]~Xl^{a,rj)Yn wiU be unique with probabihty tending to 1 as ^ 00. It is easy to see that Assumption 3.1 is satisfied if and only if the conditional covariances ofzi = ixn,--.,xipy,Cov{zi\xueirf-S,Tf]) and Cov{zi\xid € (rf, + <5]), (i = 1,...,/«), are both positive definite. Assumption 3.1 means that the model can be uniquely determined over each of {xi : xu G (rf - 6,Tf]} and {xi : xid £ (rP,rf + S]}, i ^ 1,..The remark immediately after the proof of Theorem 3.1 will show that this assumption can be slightly relaxed. To estimate the thresholds consistently, we need Assumption 3.2 For any sufficiently small S > 0, £:{xixil(^^_^e(^_o_5 ,._o])} and £{xixil(^j^g(^_o ^.o+^j)} are pos itive definite, i — 1,---,P. Also, £(xixi)" < 00 for some u> 1. Obviously, Assumption 5.5 imphes Assumption 3.1. If Model (3.1) is discontinuous at rj* for some j = I, - • • ,P, it will he shown that the least squares estimate fj converges to rj* at the rate no slower than Op(ln'^ n/n), under the following assumption: Assumption 3.3 (A.3.3.1) The covariates {xj are iid random variables. Also, £(xixi)" < oo for some u > 2. (A.3.3.2) Within some small neighborhoods of the true thresholds, xid has a positive and con tinuous probability density function fd{-) with respect to the one dimensional Lebesgue measure. (A.3.3.3) There exists one version of E[xi-x.[\xid = x] which is continuous within some neigh borhoods of the true thresholds and that version has been adopted. Remark Assumptions (A.3.3.2)-(A.3.3.3) are satisfied if zi = (xi, • • •, Xp) h.as a joint distri bution in canonical form from the exponential family. Note that Assumptions 3.1-3.3 are made on the distribution of {xj. When {x^} are non-random, one may assume the empirical distribution function of {xt} converges to a distribution function satisfying these assumptions. Now, the main results of this section are presented in the next five theorems. Their proofs are given in the sequel. Theorem 3.1 Assume for the segmented linear regression model (3.1) that Assumptions 3.0 and 3.1 are satisfied. Then I, the minimizer of (2.4), converges to in probability as n oo. Remark In the nonlinear minimization of 5(ri,.. .,r(), the possible values of ri < ... < r; may be limited to {xi^,..., x„d}. This restriction induces no loss of generality. Theorems 3.2 and 3.3 show that the estimates f, ^^s and a- are consistent. Theorem 3.2 Assume for the segmented linear regression model (3.1) that Assumptions 3.0 and 3.2 are satisfied. Then where T° = (r°,..., r^o ) and f = (fi,..., -fp) is the least squares estimate of r° based on I = /, and I is a minimizer of MIC {I) subject to I < L. Theorem 3.3 If the marginal cdf Fj. ofx\d satisfies the Lipschitz condition \Fd{x')—Fd{x")\ < C\x' — x"\ for some constant C in a small neighborhood of Xid = r° for every j, then under the conditions of Theorem 3.2, the least squares estimates (Pj, j = 1,... ,1) based on the estimates I and fj's as defined in Section 2.2 are consistent. The next two theorems show that if Model (3.1) is discontinuous at TJ for some j = 1, • • •, /°, then the threshold estimate fj converges to the true thresholds rj" at the rate of Op(ln'^n/n), and the least squares estimates of and CTQ based on the estimated thresholds are asymptotically normal. Theorem 3.4 Suppose for the segmented linear regression model (3.1) that Assumptions 3.0, 3.2 and 3.3 are satisfied. For any J G {1, • • •, /°} such that P(xi(^j%i - ySp ^ Q\xd = T^) > 0, Tj-Tj =0p(-—). Let Pj and CT'^ be the least squares estimates of P^j and CTQ based on the estimates / and fj's as defined in Section 2.2, j = 1,... ,1^ -\- I. Theorem 3.5 Suppose for the segmented linear regression model (3.1) that Assumptions 3.0, 3.2 and 3.3 are satisfied. If P{x[(P^^j^ - P^) 7^ 0\xd = r?) > 0 for all j = l,---,/°, then y/n(Pj -/3°) and •y/n[â^ - CTQ] converge in distribution to normal distributions with finite variances, j = 1,..., /° + 1. Remark The asymptotic variances can be computed by first treating P and rj", (j = 1,..., /°), as known so that the usual "estimates" of the variances of the estimates of the regression coefficients and residual variance can then be written down explicitly by substituting / and fj for and TJ, [j = 1,...,/^), in these variance "estimates". For example, the asymptotic covariance matrix for Pj is OTQGJ^, where Gj = £'[xiXil(2,j^g(^o_^ ,.9])]. The proof of Theorem 3.1 is motivated by the following idea. If the model is overfitted {P < I < L), the reduction in the mean square error will be bounded in probability by a positive sequence tending to zero. In fact, this turns out to be Op(ln^ n/n). On the other hand, if the model is underfitted (/ < P), the inflation in the mean square error will be of order Op{l). Hence, by setting the penalty term in MIC equal to a quantity of order bigger than Op(ln^ n/n) but still tending to 0, we can avoid both overfitting and underfitting. This idea is formulated in a series of lemmas. The result of Lemma 3.1 is a consequence of the local exponential boundedness assumption, which gives the added flexibihty of modehng with non-Gaussian noises. Using the properties of the hat matrix Hn{xsd, Xtd), Lemma 3.2 estabhshes a uniform bound of T„(a, 77) for all a < t]. With this lemma, we show in Proposition 3.1 that the mean squared residuals differs from the mean squared pure errors only by Op{ln^ n/n), which in sequel motivates the choice of the penalty term in our MIC. Given Lemma 3.2 and Proposition 3.1, the results of Lemmas 3.3 and 3.4 are more or less expected. Lemma 3.1 Let Zi,...,Zk be i.i.d. locally exponentially bounded random variables, i.e., i;(e"^i) < e'=°"' for \u\ < TQ, where TQ and CQ € (0,oo). Let Sk = EÎLi where the a\s are constants. Then for any > 0 satisfying |fo«t| < TQ, i < k, P{\Sk\ >x}< 2e-'°^+'=°'°S-=i''?. (3.3) Proof It follows from Markov's inequality that for the hypothesized to, P{Sk >x} = Pfe*"-^* > e*'"'} < e~^'"'E{e^°^'') = e-'°'^£(e'° ^*=i) < e-'o^e""*" ^i=i, and to conclude the proof of (3.3), P{Sk < -x} = P{-Sk >x}< e-^o^e"^"'" ^*=i ''^. ^ Lemma 3.2 Assume for the segmented linear regression model (3.1) that Assumption 3.0 is satisfied. Let r„(a,7/),—oo < ex. < T} < oo, be defined as in the beginning of this section. Then P{sup Tn{a, 7?) > ^ In^ ra} ^ 0, as n ^ 0, (3.4) a<ri 1Q where po is the true order of the model and TQ is the constant associated with the local exponential boundedness condition for the {ct}. Proof Conditioning on X„, we have P{sup r„(a,r?) > ^In^n I X„} = P{ max €'M^sd,xtd)ën >^ln'n\ X„} a<v J-O x,d<x,a ip < P{è'M^sd,xtd)èn>^ln'n\Xn}. Since Hni^Xsdi Xtd) is nonnegative definite and idempotent, it can be decomposed as Hn{xsd, Xtd) = M^'APF, where W is orthogonal and A = diag{l, • • •, 1,0, • • •, 0) with p := rank{Hn{xsd, Xtd)) = rank{A) < PQ. Set Q = (Ip,0)W. Then Q has full row rank p. Let Q' = (qi,---,qp) and Ui = q;ê„, / = Then p Since p < po and 1=1 ^0 <P{J:uf>p^^ln'n\Xr.} 1=1 ^0 <P{Ul > ^In^ n for some /|X„} 1=1 ^0 it suffices to show, for any /, that E P{Uf>^ln'n\Xn}^0, asn-^0. X,d<Xtd ° Noting that p = trace{Hn{Xsd,Xtd)) = Y7i=x II qt IP> we have || q, f= qjq; < p < Po, / = 1,... ,p. By Lemma 3.1, with = To/po we have V P{|C/,| >3poInn/To I X„}< T 2exp(-^ • ^lnn)exp(co(ro/po)%) < n(n - l)/n^ exp{coT^/po) 0, as n -> oo, where CQ is the constant specified in Lemma 3.1. Finally, by appealing to the dominated convergence theorem we obtain the desired result without conditioning. ^ Proposition 3.1 Consider the segmented regression model 3.1. (i) For any j and {a,rj\ C (r]'_i,r]'], 5„(a, 7/) = ê'„(a, 77)ê„(a, r/) - r„(a, 77). ('ii^ Suppose Assumption 3.0 is satisfied. Let m > 1. T/ien uniformly for all (ai, • • •,a^) such that —00 < ai < • • • < Um < 00, m+t° + l i=l where = -oo, ^„,+;o+i = oo, and {^i, • • • ,^m+i°} is the set {rf, • • •, r°o, ai, • • ordering its elements. Proof: (i) Observe that Snia, ri) ^Y^iUa, rj) - H^ia, r,)Y^ =(X„(a, 7?)^° + 6„(a, v))'iXn{a, r,)$'j + ê„(a, rj)) - (X„(a, r,)p'j + ê„(a, r?))'^n(a, r?)(X„(a, 7?)^° + £„(a, 7/)) =/f°'X;(a, 77)X„(a, 7?)^° + 2ë'^ia, 7?)X„(a, 7/)^° + ^(a, 7?)6„(a, T?) -[/3°'X:(a,77)^„(a,77)X„(a,77)^° + 24(a, 7?)/r„(a, 7?)X„(a, 7/);3° + 7,)ir„(a, 77)€„(a, TJ)]. Noting that if„(a, 77) is idempotent and X;(a, 77)^„(a, 7;)X„(a, 77) = X^ia, n)Xn{a, rj), we have (X„(a, 77) - ^„(a, rj)Xnia, 7?))'(X„(a, 7/) - ^„(a, 7?)X„(a, 7?)) =X;(a, 7/)(/„(a, 77) - Hn(a, 7/))X„(a, 7/) =X;(a, 7?)X„(a, 7?) - X;(a, 7,)X„(a, 7?) = 0 and hence X„(a, 77) = Hn{a, 77)X„(a, 77). Therefore 5„(a, 7/) =êUa, 7?)ë„(a, 7?) - 4(a, 7?)5"„(a, 77)è„(Q, 7/) (ii) By (i), =ê'„("> '/)ë„(a, 7;) - T„(a, 77). m+l° + l «•=1 m+l° + l •- E K(ei-i,6)ê„(ei-i,6)-r„(e.-i,e.)] «=i = ê'„ê„- E î^n(6-l,^i). Note that each of (6-1 > ft] is contained in one of (r°_i, rj"], j — 1, • • • ,1^ + 1. By Lemma 3.2, ET=/^' <{m + P + l)sup„<, r„(a < r?) = Op{\n' n). % Lemma 3.3 Under the condition of Theorem 3.1, there exists 8 £ (0, mini<j</o (rj'^j — TJ)/2) such that for r = 1,..., [5„(r° - 6, + 6)- 5„(r° - S, r,) - 5„(T°, r° + ê)]/n ^ C. (3.5) for some Cr > 0 as n —* 00. Proof It suffices to prove the result when 1° = 1. For notational simplicity, we omit the subscripts and superscripts 0 in this proof. For the S in Assumption 3.1, let Xj* = X„(ri —^, ri), ^2 = ^n(ri, n + 6), X* = X„(ri - ^, n + <5) = X; + X;, el* = è„(ri - «5, rj), = è„(ri, n + 8), €* = + €2 and P = (X*'X*)~X*'yn. As in ordinary regression, we have Sn{ri-8,Tx + 8) =\\xfpx + x*j2 + r-x*'p\? =\\x:{h-h+x;cP2-h+n' =\mh - h? + mih - h? + +2ê*'x;{h -h + 2e~*'x^ip, - h It then follows from the strong law of large numbers for stationary ergodic stochastic processes that as n -* 00, 1 ' 1 " 1 , f ^{Xixil(^^,g(^,_5,^,])} > 0, if —XX* ""'i < " ' ' \ i;{xixil(x,.e(n,n+6])} > 0, if j=2. and To Therefore, Similarly, it can be shown that f (Â - ;â*)'X;(xixil(,,,g(,,_5,n])) • (^1 - /3*), if j=i, 02-n'E{x^K[l(^^^ç^r„n+5]))-02-n, if J=2, n •' Vx;(;â^-^)^0, for j = 1,2, and n Thus as n —>• oo, ^5„(ri — ^, ri -f- ^) has a finite hmit, this limit being given by lim -5„(ri - S,TI +6) n—*oo n ={h - ^*)'i;(xax;i(.,,e(n-.,x,i)) • (À - P') + 02 - /3')'£(xix;i(,,,e(.„.,+„)) • 0, - p*) + a^P{xtHe{n-S,n + S]}. It remains to show that ^5n(Ti — S,TI) and ^5„(ri,ri + ^) converge to a-P{xid G (TI -^, n]} and cr^Pjxid G (n, rj+<î]}, respectively, and either {Pi -/3*)'£(xixi^(^j_s^ri]))0i -P*) > 0 or (;92 -^*)'£^(xixil(ij_^ç(^j,^j4.5]))(/32 -y3*) > 0. The latter is a direct consequence of the assumed conditions while the former can be shown again by the law of large numbers. To this end, we first write 5„(TI — 8,TI) in the following form (bearing in mind that P is assumed to he 1 in the proof), Sn{ri-6,Ti) = êl'll-Tr.{n-6,n) using Proposition 3.1 (i). By the strong law of large numbers, Ul'êl ^ E[4l^,,,e(r..s,r,])] = <T'P{xrd G (n - 6,n]}, iê*Xi E[eiy.il{^^^^^r,-s,T,])] = 0, 71 and W = lim„_oo ^I'X^ is positive definite under tlie assumption. Tlierefore, and hence ^5„(ri - S,TI) a'^P{xid G - 8, ri]}. The same argument can also be used to show that ~Sn{T\,Ti + 6) a'^P{xid G {TI,TI + 6]}. This completes the proof. % Lemma 3.4 Under the condition of Theorem 3.1, we have (i) for every I < , P{àj > <TQ + C} I, as n ^ oo for some C > 0, and (ii) for every I such that <l < L, where L is an upper bound of , 0 < -I'^ln - à] = Op{ln\n)ln), (3.6) n where âj — ^5„(fi,.. .,f() is the estimated CTQ when the number of true thresholds is assumed to be I. Proof (i) Since / < /°, for the 6 G (0, mini<j<;o (rj'^i - rj')/2) in Assumption 3.1, there exists 1 < r- < /o, such that (fi,...,fi) G Ar := {(ri,...,r,) : \TS - r°| > S, for all s = 1,...,/}. Hence, if we can show that for each r, 1 < r < /°, with probabihty approaching 1, min Sn{Ti,---,Ti)/n> +Cr, for some Cr > 0, then by choosing C := mini<r<;o{Cr}, we will have proved the desired result. For any (ri,---,r/) G A^, let fi < ••• < 6+io+i be the ordered set {ri,..., r;, TI", ..., T°_i, T°-ë, r°+6, T°^i ,...,T^o} and let fo = -oo, 6+/0+2 = oo- Then it follows from Proposition 3.1 (ii) that uniformly in 1 n n 1+1°+2 1 _ T = - E ^n(6-l,ei) (3.7) = n^ E -^"(0-1,0) + 'în(r° - ^,r°) + 5„(r°,r« + 6)] + i[5„(r° - 6, r° + ^) - 5„(r,° - S, r°) - 5„(r°, r° + 6)] n = -~e'nën + Op(ln2(n)/n) + -[5„(r° - <5,r° + 6)- 5„(r° - (5,rO) - 5„(r°,r° + <?)]. By the strong law of large numbers the first term on the RHS is + o(l) a.s.. By Lemma 3.3, the third term on the RHS is Cr + Op(l) a.s.. Thus 1 n where Cr is defined in (3.5). (u) Let ^1 < ••• < ^/+;o be the ordered set, {n, • • •, f;, , • • •, r,^}, = T§ = -oo and ^;+(o+i = T°o^^ = 00. Since / > P, by Proposition 3.1 (ii) again, ^n^n >'5'n(7"i , • • •, Tjo) i.2 =4f-n + Opiln'in)). This proves (ii). ^ Proof of Theorem 3.1 By Lemma 3.4 (i), for / < P and sufficiently large n, there exists c > 0 such that MIC{1) = \n{âf) + p*{lnnf+^/n > Inia^ + C/2) > In(al) + ln(l + C/(2a^)) with probability approaching 1. By Lemma 3.4 (ii), for / > 1°, MIC{1) = In(âf) + p*(lnn)2+Vn Incrf. Thus, P{1 > —* 1 as oo. By Lemma 3.4 (ii) and the strong law of large numbers, for /o < / < X, 0 > [a? - U'^ên] - [4 - U'jn] = Op{ln' n/n), and [âl - cl] = [âfo - + [^è'jn - CT'O] = Opiln' n/n) + Op(l) ^ Op(l). Hence 0 < (âfo-àf)/â% = Op(ln^ n/n). Note that for 0 < a; < 1/2, In(l-x) > -2x. Therefore, MIC{1) - MIC{f) =ln(âf) - ln(4) + CQ{1 - f){\unf+^°ln = ln(l - (4 - âf)/4) + co(/ - /°)(lnn)2+«o/n > - 20p(ln2(n)/n) + co(/ - /°)(ln n)2+*Vn >0 for sufficiently large n. Whence / ^ /" as n ^ oo. f Remark: From the proof of Theorem 3.1 it can be seen that if the term Co/(ln n)^+''o/n is replaced by / -cn""^, where a € (0,1) and c is a constant, the model selection procedure is still consistent. In fact, such a penalty is proposed by Yao (1989) for a one-dimensional piecewise constant model. Remark If the assumed 6 in Assumption 3.1 is replaced by assumed sequences {flj}, {bj] such that -oc < oi < rf < 6i < • • • < a;o < r^o < 6/o < oo, and such that both E{x.ix.[l(^^^^f^a. .^o-^-^] and £{xixil(2.j^g(,.o^{,^.])} are positive definite for j = 1,...,/°, then the conclusion of Lemma 3.3 still holds with 6 replaced by aj and bj, respectively. Therefore, the conclusion of Theorem 3.1 still holds. To prove Theorem 3.2, we need the following lemma. Lemma 3.5 Under the assumptions of Theorem 3.2, for any sufficiently small 6 G (0, mini<j</o(r^^j — rj')/2), there exists a constant Cr > 0 such that ^[5„(r° - <5,r° + 6)- 5„(r° - S,T^) - Sn(r^,T°, + S)] ^ Cr, as n ^ oo, where r = 1, • • Proof It suffices to prove the result for the case when = 1. For any small ^ > 0, all the arguments in the proof of Lemma 3.3 apply, under Assumption 3.2. Hence the result holds. IF Remark: Although the proofs of Lemma 3.3 and Lemma 3.5 are essentially the same, the assumptions, and hence the conclusions of these lemmas are different. In Lemma 3.3 Cr is fixed for the existing 6. While Lemma 3.5 implies that for any sequence of {6m} such that > 0 and —^ 0 as m ^ oo, there exist {Cr(m)} such that the conclusion of Lemma 3.5 holds for all m. Proof of Theorem 3.2 By Theorem 3.1, the problem can be restricted to {/ = /°}. For any suflîciently small 8' > 0, substituting S' for the 6 in (3.7) in the proof of Lemma 3.4 (i), we have the following inequality -Snin, - • • ,Tio) n >-ë'^èn + Op{ln\n)ln) n 1 + -[5„(r° - r° + 6') - 5„(r," - 8', r°) - 5„(r°, r," + 8% n uniformly in (n, • • - JT/O) £ Ar {(ri, • • • ,r;o) : Ir^, - > 1 < s < /°}. By Lemma 3.5, the last term on the RHS converges to a positive Cr- For sufficiently large n, this Cr will dominate the term Op(ln^ n/ra). Thus, uniformly in Ar, r = 1,... ,1^, and with probability tending to 1, 1 o / ^ 1 , Cr -Sn{ri,---,Tio) > -e„e„ + —. n n 1 This implies that with probability approaching 1 no r in Ar is qualified as a candidate for the role of f, where f = (fi, • • •, fjo). In other words, P{T 6 Af) 1 as n ^ oo. Since this is true for all r, P{f G fltli ^r) 1, n -> oo. Note that for 8' < mino<i<;o{(rP^i - rP)/2}, riil^'- - I < = - ^rl < è'Jor some 1 < v < = {r € fl r=l r=l r=Thus we have, 1° P{\fr - r°| < 8' for r = 1,...,/") = P{f e Ç] A';) ^ 1, as n ^ oo, r=l which completes the proof. ^ The proof of Theorem 3.3 requires a series of preliminary results. The key step is to estab lish Lemma 3.6 which implies the estimation errors of the regression coefficients are controlled by the estimation errors of the thresholds. Proposition 3.2 Let {x„} be a sequence of random variables. If z„ = Op(l), then there exists a positive sequence {a„}, such that a„ ^ 0 as n ^ oo and Xn = Op(a„). Proof Let €k = = 1/2'', k = 1,2,- • Since a;„ = Op(l), for e\ and ^i, tliere exists A''i > 0 such that for all re > Ni PCknl > Si) < €i. And for each pair of and 6k, there exists Nk > iVjt_i such that for all n > Nk, P(\Xn\ > 6k) < €k-Let a„ = 1 if n < iVi and an = 6k ii Nk < n < Nk+i, k = 1,2, - • •. Then a„ 0 as re oo. Also, for any e > 0, there exists ko such that 0 < < €. Thus for any re > Nk^, Nk < n < Nk+i for some k > ko, and P(\xn\ > a„) = P{\x^\ > 6k) < ffc < ffco < e. Again by x„ = Op(l), there exists M > 1 such that Pi\xn\ >M)<e for all re < Nko • This completes the proof. % Lemma 3.6 Let Rj = (rj'_i,r]'], Rj = (fj_i,fj], = TQ = -oo, rfo+j = 7^,0+1 = 00, and An,j = \fj — Tj \ = Op(a„), j = 1, • • •,+ 1, where {an} is a sequence of positive numbers. Suppose that {(zt,Xtd)} is a strictly stationary and ergodic sequence and that the marginal cdf, Fd, of Xid satisfies the Lipschitz condition, \Fd{x') - Fd{x")\ < C\x' — x"\, for some constant C in a small neighborhood of xid = TJ for every j. If for some u > 1, E\zi\^ < 00, then where 1/v = 1 — 1/u. Proof It suffices to sliow that \^i\\^(x,defli) - l(x.j6fl,)l = C>p((a„)i/''). Since, for every j = 1,...,/°, where for J = 1, the first term is defined as 0. Hence it suffices to show that for every i. By assumption, A„j = Op(a„). So for all e > 0 there exists M > 0 such that P(A„j > a„M) < € for all n. Thus 1 " E l^'|l(k,.-r°|<a„M) > «y'^M) + 6. Hence it remains to show that ^i/,^ I]"=i kt|l(|x,j-T9|<a„M) is bounded in probabihty. How ever, in view of the Holder's inequality and the assumptions, the expected value of this last quantity is bounded above by (£'|2i|")^/"aô''^''(Ca„il/)^/" for some constant C. This shows that 1 " an n is bounded in and hence in probabihty. % Proof of Theorem 3.3 Let /Sj" be the "least squares estimates" of j = 1, • • •, /° -f-1, when P and {T\I - • • IT^Q) are assumed known. Then by the law of large numbers, — /3j = Op(l), j = 1, • • •, /" -f 1. So it suffices to show that Pj ~ = Op(l) for each j. Set x; = /«(rP.i,rj')X„ and Xj = /„(f,_i,f,)X„. Then, h - ^; - (ix;'x;)-]{i(xj - x;)% + ix;r„} + [(ix;'x;)-][i(x, - x;)'y„] =:(/){(//) + (///)}+ where (/) = [(^XjX,)" - (^X/X;)"], (//) = i(X;. - X;)%, (///) = iX;y„ and {IV) = [(iX/X/)-]. By the strong law of large numbers, both (III) and (IV) are Op(l). By Theorem 3.2, f — r° = Op(l). Proposition 3.2 implies that there exists a sequence {«n}, a„ —> 0 as n oo such that f - r° = Op(a„). Note that (//) = ^ Y,^^^ ^tyti'^ix.jeR,) ~ h^t^eR,)) where Rj = (•fj_i,fj], Rj = (rj'_i,rj']. Taking u > 1 and Zt = ai'xtyt for any real vector a, it follows from Lemma 3.6 that (//) = Op(l). If (J) = Op(l), then 'pj - P* = Op(l), j = 1, • • •,/° + 1. So, it remains only to show that (/) = Op(l). By the strong law of large numbers, ^XJ'XJ ^fxiXil^^^^g^^o^^^o])} > 0. If we can show that ^X'jXj-i^XJ'X* = Op(l), then for sufficiently large n, (iXjXy)-i and (^X/'X*)"! exist with probability approaching 1. And, (^XjXj)~ — (^Xj*'X*)~ = Op(l). So, it suffices to show that ^XjXj — ^Xj'XJ = Op(l). Let a 7^ 0 be a constant vector and Zt — (a'xj)^. Then a'(iXjX, - iX;'X;)a = 1 ELi a'x,x^a(l(^,^,^^.) - !(..,,«,)) = \ ^ti^^.^eR,) " ^(xtjeRj))- Taking the sequence {un} in the last paragraph and u > 1, it follows from Lemma 3.6 that a'(iXjX,- - iX;'X;)a = Op(l) and hence iXjX,- - iX/'X* = Op(l). This completes the proof. % The proof of Theorem 3.4 depends on the following results. Proposition 3.3 (Serfling, 1980, p32) Let {y^t, 1 < t < Kn,n = 1,2,...} be a double array with independent random variables within rows. Suppose, for some v > 2, Then n B-'[J2y-t-^-]^ N{Q,l), asn-^oo, where n^t = E{ynt), An = E<=i Mnt and Bl = Var(ynt). Lemma 3.7 Let {kn} be a sequence of positive numbers such that kn ^ 0 and nkn —> oo. Assumptions 3.0 and 3.3 imply that for any j = 1, - • • ,P, (i) ^X;(r« - fc„,r°)X„(r° - fc„,r°) ^ £(xixl|a;i, = r°)/,(r°), ^X;(r°,rj' + fc„)X„(r°,r° + kn) ^ E{xix[\x,d = r^)féir°), (ii) ^6Ur« - kn,r^)en{r^ - kn,T^) ^ a'Mr^), ^4(r«,r° + A:„K(r°,r° + kn) ^ cToV.(r°), (Hi) - kn,r^)Xn(r° - kn,T^) ^ 0, -^<(r°,r» + kn)Xn{Tf,T^ + kn) ^ 0. Proof It suffices to show the second equation in each of (i), (ii) and (iii), the proofs of the first deferring only in a formahstic sense. (i) Note that X'niTf,Tf + /:„)X„(rj', r? + A;„) = Etli Xtx;i(,.,e(,o,,o+,„]). Let a ^ 0 be a constant vector, r/„t = a'xtx;al(^,^e(^o_^o^jt„]), = E(ynt), and al = Var{ynt). If X;[(a'xt)2|r9] > 0, then E[(a'yity\Tf] > 0 and =^{l(x..€{r°,rO + fc„])£^[(a'xi)2|xtd]} =E[iBi'xrf\xrd = &n]fd{0n)kn =i;[(a'xi)'|ii<i = r°]/d(r°)fc„ + o{kn), where dn € {'''J^TJ + A;„] and /d(-) is the marginal density function of Xtd- Similarly, al=Eyl-,^l = E[(ei%)'\Xtd = VnUMkn - f^l = E[(B.%y\Xtd = T^]UT^)kn + 0{kn), where rjn € i^j + ^n] and for sufficiently large n, > 0. By Minkowski's inequality, for E\yni-t^nr<2''-\E\ynir + ti:) =2''-H^[(a'xif "Ixi, = ^n]fdUr.)kn + (i;[(a'xi l^i, = en]fd(On)knr} = 2''-'E[iai'xxf'\x,d = Tf]MT])kn + Oikn), where Ç„ € i'^j^'^j + ^n]- So by setting An = nfin and = ncr^, we have i=l iE[{a'x,y\xu = r°]/,(r9)A;„ + o(A;„))V2 -0, as n ^ oo since v > 2. Hence by Proposition 3.3, n Bn'[J2ynt-An]^N{0,l), US U OO. t=l Now, since Bllinknf = Opiln^n)/ln'n = Op{ln-^n), 53 we obtain 1 = — V ynt a'X;(xixJ|a;fd = T^)aifd{T^), as n ^ oo. K i;[(a'xi)2|xid = rj»] = 0, it suffices to show that ;^a'X;(rj>, T"? + A;„)X„(r?, + fc„ converges to 0 in ii. i;(^a'X;(r°, 7-° + K)Xn{Tl + fc„)a) 1 =£[(a'xtfl(..,e(.o,.o+jt„])]/fc„ =^{l(r..e(rO,TO+fc„])£[(a'xt)'|xid]}/A:„ =i;[(a'xif|a:id = ^„]/d(^„) =£[(a'xaf|xi, = r°]/.(r°) + o(l) =o(l), as n —>• oo, where 0„ € ('''JITJ + 'i^n)- This completes the proof. (u) Similarly to (i), let y^t = ^t'^(x,de(r°,r°+k„]), fJ-n = E(ynt), and al = Var(ynt). Then fin =^[f?l(x„e(T°,T°-l-fc„])] = al[fd{T^)kn + o(kn)l =E{ylù - ni = Eiet)P{xtdeiTf,Tf + kn])-fll = Ei4)UT^)kn + 0ikn)-fll = Eiet)MT^)kn + oikn). By Minkowski's inequality, for u > 2, ^iî/„i-/^nr<2'^-'(^iynir+//;:) =2''-if;(e^)/d(r?)fc„ + o(fc„). So by setting A„ = n/z„ and = na^, we have è ^\ynt - MnlV^;: =n-^''''-'^E\ynt - Mn|7(^|yn* - tin?)"" <n i=i _(./2_i) 1''-'[E{exrU{r^)K + o{kn)] {E{e,YU{r])kn^o{K)Yn ^0, as n —»^ oo. Hence by Proposition 3.3, n By the fact that Bllinknf = Op(/n2n)//n''n = Op{ln-''n), we obtain (iii) For any a 7^ 0, E{^e'n{rlr^j + fc„)X„(r]', r]> + ^„)af 1 " 1 = E ^[^?(^'^0'l(x..6(rO,rO + fc„])] = ^a2(i;[(a'x,)2|xi, = r]>]/,(r°) + o(l)) ^ 0 as n oo. f The approach of the fohowing proof is to show that uniformly for all TJ such that \TJ — TJ\ > Op(ln^ n/n), 5„(ri, • • •, r;o) > 5'„(r{*, • • •, rfo) for sufficiently large n. We shall achieve this by showing 5n(r?_i + 6, TJ) + Snirj, rf^^ - S) - [5„(rj'_, + 6, r^) + 5„(r?, T°^, - S)] + Op{ln' n) > 0 for sufficiently large n. Proof of Theorem 3.4 By Theorem 3.1, the problem can be restricted to {/ = P}. Suppose for some j, P(xU/9,Vi - P'j) ^ 0|xd = r?) > 0. Hence A = XJ[(xi(y3P+i - P']))'\xd = r?] > 0. Let P(a,T}) be the minimizer of \\Ynia,T]) - Xn{(x,Tf)P\\'^. Set kn = A'ln^ n/n for n = 1,2,- • - , where K will be chosen later. The proofs of Lemma 3.6 and Theorem 3.3 show that if a„ «5 Vn Vi then /3(a„,7/„) 0(a,T)) as n ^ oo. Hence, for rj* + A;„ ^ rj* as n ^ oo, /3(rj'_i + <Ç,rj' + kn) + <5,rj') as n oo. By Assumption 3.2, for any sufficiently small S e (rj'_i,rj'), i;{xixi l{x,de(T9_^+s,T°])} is positive definite, hence P{Tf_-^ + 6,Tf) as n —» oo. Therefore P{TJ_I + S,T^ + kn) Pj. So, there exists a sufficiently smah <5 > 0 such that for all sufficiently large n, ||/?(r?_i + S,T^ + kn) - P°j\\ < \\P°j - P%i\\ and iP{Tf_, + ê,T^ + kn) - P^+i)'Eixix[\xu = rf) {P{TU + ^'^i + kn) - P]+i) > A/2 with probabihty approaching 1. Hence by Theorem 3.2, for any c > 0, there exists Ni such that for n > Ni, with probability larger than 1 - 6, we have {\)\fi-Tf\<S, Z=l,---,/°, (ii) WkrU + + ^n) - < 2||^? - P'HA' and (in) (/3(r«_, + 6,T9 + kn) - P'^j^JE{xix\\xid = r^){M-i + + ^")) " Z^^+i) > A/2. Let A,- = {{n, - • • ,r,o) : \Ti - Tf\ < S, i = 1, - • •,P, \TJ - > J = 1, • • •,/«. Since for the least squares estimates fi, • • •, f^o, 5„(fi, • • •, f/o) < 5„(rf, ••• ,T^O), inf {5n(ri, • • •, r,o) - 5„(r°, • • •, rfo)} > 0 (TI,-,T,O)6>1,-implies (fi, • • •, fio) ^ Aj, or, |fj — rj"] < fc„ = ii'ln'^ n/n when (i) holds. By (i), if we show that for each j, there exists N > Ni such that for all ra > TV, with probability larger than 1 — 2e, inf(Ti,...,T,o)eAj{'S'n(T-i,---,T/o) - 5„(ri°,---,r°o)} > 0, we wiU have proved the desired result. Furthermore, by symmetry, we can consider the case when TJ > TJ only. Hence Aj may be replaced by A'j = {(ri, • • •, r,o) : \Ti-T^\ < S, i = 1,-• • ,1°, TJ-T^ > K}. For any (n, • • •, r(o) G A'j, let 6 < • • • < 6/0+1 be the set {n,r,o, r», • • •, T-P.^, rj».! + S, r^+j -S,T^^^,---, } after ordering its elements and let fo = -oo, ^2i°+2 — oo. Using Proposition 3.1 (ii) twice, we have E Sn{^i-uii) + 5„(r]'_i + <5,r°) + 5„(r]',r]Vi - ^) =4c„ + Op(ln2 n) =[Sn{rl • • •, r° ) + Op(ln2 n)] + Op{\n^ n) =5„(r°,...,r°o) + Op(ln2 n). Thus, Sn{T\, • • -jT/o) >5„(6, • ••,6/0 + 1) 2/°+2 = ^ 5„(f,_l,6) :=1 = Sn{ii-,,ii) + 5„(Tf_i + 8,rj) + 5„(r,-,r]Va - <5) 5„(6-i,6) + 5n(r°_i + ^,r°) + 5'„(r°,r]Vi - 8) +[^n(r°_a + 8,Tj) + 5„(r,-,rO+, - <^)] - [Snir°_, + 8,T^) + 5„(r°,r°,i - 8)] =Sn{Tl...,T°) + Op{ln\) HSnirU + + Snirj,T^+r - é)] - [5„(r°_i + ,Ç,r°) + Snir^T^^, - 8)], where Op{ln'^n) is independent of (ri, • • •, r;o) G Aj. It suffices to show that for 5„ = {TJ : TJ G (TJ + kn, rj* + 6)} and sufficiently large n, inf {5n(r?_i - ^, rj) + 5„(r,-, r?+i - ^) - [5„(r?_i + 6, r]) + Snir^ rj'+i - 6)]} ^'^^^ (3.8) with probability larger than 1 — 2e for some fixed M' > 0. Let n 5„(a,r?;^) = ||y„(a, 77) - X„(a, 7?)^||2 = E^^/* " Since 5„(Q;, 77) = 5'„(Q, 77; P(a, 77)), we have 5„(r?_i+^,r,) >5„(r9_i + ^, r9 + kn) + 5„(r° + A;„, TJ) =Sn{rf_i + 6, rf;P{T^_, + S,r° + kn)) + 5„(r9,+ Â;„;^(rf_j +6,T^ + kn)) (3.9) + 5„(r]' + Â:„,r,) >5'„(rj'_i + S,T^) + 5„(r°,r9 + fc„;/3(r°_i + <Ç,r° + A:„)) + 5„(r]' + A;„,r,). And since (r^ + kn,T^^i - ^] C (TJ , TJ^.!] for sufficiently large n, Snirf + kn,T^+i - ^;^°+x) = Ur] + fcn,r°+i - è)ln{r] + fc„,r]Vi - <!?). Applying Proposition 3.1 (i), we have 0 <Sn{T] + kn,T]^i - 60%,) - [5„(r° + fc„, T,) + 5„(r,-,r°+i - .5)] =Tn{r] + Ar„, r,) + r„(r,-, T]^, - S). By Lemma 3.2, the RHS is Op(ln^ n). Thus, Snir^rf^i-S) <Snirf,Tl,-6;Pl,) = 5„(r°,r; + kn;P"j+i) + 5„(r° + kn,T^+r - 60%,) <SniT^,T] + kn, P%,) + 5„(r° + kn, Tj) + Sn{Tj, T^+j - S) + Op{\n' 7l), where Op{ln^ n) is independent of TJ. Hence (3.10) >Sn{rJ,T]^, -S)- 5„(r?,r9 + knJ'j+x) - 5„(r? + Ar„,T,) + Op(W n). Therefore, by (3.9) and (3.10) [5„(r?_i + (5, TJ) + 5„(r,-, r^+i - 6)] - [5„(Tf_i + ^, rj>) + 5„(r]', rj'+i - 6)] >5n(r?, + A:„; ^(r?_i + 6, r? + A;„)) - 5'„(r]', r? + ^P^^) + ^^(In^ n). Let M > 0 such that the term |Op(ln^ n)| < Mln^ ra with probability larger than 1 - e for all n > Ni. To show (3.8), it suffices to show that for sufficiently large n, Snirf, + A:„;/3(r°_i + ê, r° + K)) - 5„(rj', rj» + K; P'j+,) - Mln'n > M'ln'n, or SniTf,T] + kn; P{Tf_, + 6, r? + kn)) - 5„(rj', rj> + kn, P°j+r) > (M' + M)ln'n (3.11) with large probabihty. Recall Sn{a,vJ) = \\Yn{a,rj) - Xn{a,T})P\\^ and Yn{Tf,Tf + kn) = + kn)Pj+i + ^n(Tf,Tf + kn). Taking K sufficiently large and applying (ii), (iii) and Lemma 3.7 (i), (iii), we can see that there exists N > Ni such that for any n > N, -L-lSnir^T^ + kn, + + ^n)) - ^n(r°, T» + kn, ^°+i)] = ;^[rn(r°,r° + kn) - X„(r«,r° + kn)KrU + + - ||y„(r«,r« + kn) - Xn{rf,T^ + kn)P'j+,\\'] -||c„(r;,7-° + A;„)||2] = + ^n)(^?+l - + S,T^ + kn))r J:'"^^^' + ' + ^n)0°+l - + ^i" + kn)) >A/4-A/8 > (M' +Af)//!: with probabihty larger than 1 — 2e. Since /:„ = Klv?n/n, the above imphes (3.11). ^ Proof of Theorem 3.5 By Lemma 3.4 (n), - J2t=i A = Op{ln^ n/n). So, and n S"=i share the same asymptotic distribution. Applying the central hmit theorem to {e^}, we conclude that the asymptotic distribution of Z)"=i is normal. Let {Pi, - • • iPfo^i) be the "least squares estimates" of (Pi, • • •,P%^i) when P and r?, (i = 1, • • •, P), are assumed known. Then it is clear that ^/n[{P*', • • •,P*o+i)'-{Pi', ••,^p+i')'] converges in distribution to a normal distribution. So it suffices to show that Pj — Pj = Ovin-'I'). Set X; = /„(rj'_i,rP)X„ and Xj = J„(f,_i,f,)X„. Then, h - ^; - (^x;'x;)-][ixjy„] + [(ix;'A7)-][i(x, - x;)'y„] =[(ix;.x,)- - (ix;'x;)-]{i(x;. - x;)'y„ + ix;y„} + [(ix;'x;)-][i(x, - x;)'y„] =:(/){(//)+ (J7/)} + (n/)(//). where (/) = [(^XjX,)" - (^X/'X/)"], (//) = i(Xj - X;)'y„, (///) = ix;y„ and (IF) = [(iX;'x;)-]. As in the proof of Theorem 3.3, both (III) and (IV) are 0^(1). By Theorem 3.4, f - r° = Op{ln^n/n). The order of Op(n"^''^) of (I) and (II) follows from Lemma 3.6 by taking a„ = In^n/n, zt = (a'xj)'^ and zt — a'xf^j respectively, for any real vector a and u > 2. This completes the proof. ^ 3.2 Consistency of the estimated segmentation variable Since d is assumed unknown in this section, we wiU use the notation such as 5„(yi), Tn{A) introduced in Section 2.2. The two theorems in this section show that the two methods of estimating d9 given in Section 2.2 produce consistent estimates, respectively. Theorem 3.6 If dP is asymptotically identifiable w.r.t. L, then under the conditions of Theo rem 3.1, d given in Method 1 satisfies P{d = dP) —- 1 as n ^ ex. Theorem 3.7 Assume {xj} are iid random vectors. If Zi — (xn,..., Xip)' is a continuous random vector and the support of its distribution is (ai,6i) X ... X (ap,bp), where —oo < ai < bi < oc, i = I,... ,p, and for any a G RP, E[{z[zi)^] < oo, for some u > I, then d given by Method 2 satisfies P{d = dP) —r 1 as n oo. To prove Theorem 3.6, some results similar to those presented in the last section are needed. Lemmas 3.2'-3.3' and Proposition 3.1' below are generahzations of Lemmas 3.2-3.3 and Proposition 3.1 respectively. Lemma 3.2' Assume for the segmented linear regression model (3.1) that Assumption 3.0 is satisfied. For any d ^ do and j ^ 1, • • •,/° -|- 1, let R'j(a, 77) = {xi : a < xid < v}<^R°j, < a < 7/ < 00. Then P{svipT4R'jia,rj)) > ^In'n} ^0, as n0, a<Ti J-Q where Po is the true order of the model and To is the constant associated with the local exponential boundedness condition for the {et}. Proof Conditioning on X„, we have for any j and d ^ do that Q„3 P{supr„(i2,^(a,77))>^ln2 7i|Xj a<TI J-0 =P{ max ê'nHn{R%Xsd,Xtd))ên > Mln'n | X„} < J2 P{'<Hn{R^ix,d,Xtd))ên>^ln'n\Xn}. x,d<x,d ^0 Since IIn{Rj{x3d,Xtd)) is nonnegative definite and idempotent, it can be decomposed as Hn{RJ{x,d,Xtd)) = W'AW, 61 where W is orthogonal and A = diag{l, - •• ,1,0, - •• ,0) with p := rank{Hn{RJ{xsd,Xtd))) = Tank{K) < po. Set Q = (/p,0)W. Then Q has fuh row rank p. Let Ç' = (qi,---,qp) and C/, = q5ê„,/= Then p (=1 Since p < po and 7~r -'o ^ 9pg /=1 <i'{f^f > ^iri^ri for some l\Xn} Pq it suffices to show, for any /, that E Pi^^ > M ^ I Xn} ^0, asn^O. Noting that p = trace{H^{RJ{x,d,xtd))) = ELi II \?^ we have || q, f= q^q, < p < po, / = 1,... ,p. By Lemma 3.1, with <o = îo/po we have E ^{|C^/I > 3poInn/To I X„}< E 2exp(—^ • ^Inn)exp(co(ro/po)'po) < n(n - l)/n^exv{coT^/po) ^ 0, as n ^ oo, where CQ is the constant specified in Lemma 3.1. Finally, by appealing to the dominated convergence theorem we obtain the desired result without conditioning. % Proposition 3.1' Consider the segmented regression model 3.1. (i) For any subset B of the domain of X\ and any j, SniB n R^j) = -e'niB n R''j)ên{B D E") - T„(5 n iZ^). (ii) Let be a partition of the domain o/xi, where m is a finite positive integer. Then, m+1 m+1 i=i i=i /or a//Further, if Bi = {xi : rj_i < xi^ < r,} for d ^ do then Assumption 3.0 implies m+1 Sn{Bi n R]) = ê'n{R]yn{R]) + Op(ln2 n) i=l uniformly for all T\, - • • ,Tjn such that —oo = TQ < ri • • • < r^^ < r^+i = oo. Proof: (i) Denote A = Bf\R]. Sn{A) =y,:(/n(A) - Hn{A))Yn = (X„(A)/3° + èn{A))'{UA) - Hn{,A)){XMW'j + UA)) =P'j'X'^{A)Xn{A)P] + 2ê'n{A)Xn{A)P] + 4(A)è„(/l) - [^°X(^)^n(^)X„(A)^° + 24(A)^„(A)X„(A)/3« + ê'„(A)JÏ„(A)è„(A)]. Since X;(A)^„(A)X„(A) = X;(A)X„(A) and ^„(A) is idempotent, we have [Xn{A) - ir„(A)X„(A)]'[X„(A) - ^„(A)X„(A)] = 0 and hence 5'„(A)X„(A) = X„(A). Thus, 5„(A) = 4(^)fn(^) - ê„(A)^„(A)ê„(A) = è'n{A)én{A) - T„(A). (ii) By (i), m+1 Y,Sn{B,f\R]) i=l m+1 = Y KiBi n R])UBi n i2°) - r„(5.- n R])] t=i m+1 =ê'„(i2?)è„(ii:°) - E ^-(^<^ ^i)-.=1 1£ Bi = {xi : ri_i < xid < Ti}, denote Bi n R° by RJ{Ti_i,Ti) for all i. Lemma 3.2' im plies Y.TJ'i TniBi n PQ) = ZtV Tn{RJ(Ti-i, Ti)) < (m + 1) sup,<, T„(i2^^(a, T?)) = Op{ln' n) uniformly for all —oo < ri < • • • < < oo. % Lemma 3.3' Let A be a subset of the domain o/xi. // both £'[xixil(xieAnHO)] X^[xiXil(xjeyiniî<'^j)] û'^c positive definite. Then under Assumption 3.0, [Sn{A) - Sn{A n R°,) - Sn{A n i?°+i)]/n ^ for some Cr > Q as n ^ oo, r = 1, • • •, /°. Proof It suffices to prove the result when /° = 1. For notational simplicity, we omit the subscripts and superscripts 0 in this proof. Let = X„(A n Rj), êj = ê„(A fi Rj), j — 1,2, X* = Xi* + Xj*, €* = €l + ë| and 'p = (X*'X*)-X*'y„. As in ordinary regression, we have Sn(A) =\\x;:0i-'p) + x;02-h + n\' =\\x;0i - + \\x;02 - h\' + Wn? + 2€*'X;(À - ^) + 26--'x,*(^2 - h It then follows from the strong law of large numbers for stationary ergodic stochastic processes that as n —> oo, ^^v^* = ^èx,x;i(x.eA) ^ £{xix;i(x,e^)} > 0, ix;'x; ^ £{xixll(x,exnR,)} > 0, ; = 1,2, and ix*V„ ^ i;{î/iXal(xieA)}-Therefore, ^ ^ {^{xixil(x,6^)}}-^£{î/ixil(x,6^)} 64 Similarly, it can be shown that Tt for J = 1, 2, and n Thus as n —>• oo, ^5„(A) has a finite limit, this limit being given by lim -Sn{A) n-K» n =(Â - ^*)'£(xixil(x,e^nR,)) • (^1 - n + 02 - ^•)'i;(xix'il(x,e^nR,)) • 02 -+ a^Pjxi e A}. It remains to show that ^5„(>1 n Rj) converges to a-P{xi £ A (1 Rj}, j = 1,2, and at least one of 0i - P*)'E(xxx[l^^^^^nR,))0i - P") and 02 - $')'Eix^^[li^,eAnR,))02 - P') is positive. The latter is a direct consequence of the assumed conditions while the former can be shown again by the strong law of large numbers. By Proposition 3.1' (i), Sn(A nRr) = ê'^iA n Ri)êniA n Ri) - T„(A n = - Tn{A n R^). The strong law of large numbers implies -êîêi ^ E[ell^^^eAnR,)] = (T^P{^I e AO R^), Tt -fi'Xj ^[fiXil(x,e^niîi)] = 0, Tt as n ^ oo and W = lim„_^co ^-'^i'-^i* is positive definite. Therefore, -TniAn Ri) = i-ê[x;)i-x^'xn-i-x:'€,) ^ ow-'o = o n n n n and hence ^5„(A n i^i) (T^P{XI 6 AD Ri}. The same argument can also be used to show that ^SniA n R2) ^ CT^Pfxi e An R2}. This completes the proof ^ Proof of Theorem 3.6 For d = (f,hy Lemma 3.4 (ii), n Thus, it suffices to show for d ^ dP, that ^S^ > <^o+C for some constant C > 0 with probabihty approaching 1. Again, /° = 1 is assumed for simplicity, li d ^ d^,hy the identifiability of d'^ and Theorem 2.1, for any {Rj]'fil, there exist r, 5 e {1, • • •, X + 1} such that D where Af = {xi : Xid e [as,b,]} is defined in Theorem 2.1. Let = {(ri,...,ri) : Rf D A'^ for some r}. Then for any (ri,..., TL), (TI, • • •, TL) G Bs for at least one s 6 {1, • • •, X + 1}. Since d is chosen such that < for all d, it suffices to show that for d ^ d° and each s, there exists Cs > 0 such that inf i5^(n,...,rz,)>a2 + C. (3.12) (TI,...,TI,)6B, n with probabihty approaching 1 as n ^ oo. For any {TI,...,TL) € P^, let R'1^2 = {x : G (rr_i,as)}, iî^^3 = {x : Xd e (&i,r,.]}. Then J?^ = A'^^ U R'[_^_ol> Ri+s- Note that the total sum of squared errors decreases as the partition becomes finer. By Proposition 3.1' and the strong law of large numbers, n j=i >-[ Y Sn{R'^) + SMi)] >-{ E [SniR'^nR'i) + Sn{R'jnRl)] + [SniAinR'i) + Sn{AinR'',)]} T"'^ (3.13) + -[SniAi) - 5„(Af n R°) - SniAi n R°)] n = -{è'^{Rl)UR°i) + ^RDURD + Op{\n' n)] n = i{è'„è„ + Op(ln^ n)} + ^[SniAi) - 5„(A^ n ii!?) - 5„(Af n iî?)] =al + Op(l) + -[5„(Af) - SniAi n iE°) - SniAi n i2«)]. Now it remains to show that i[5„(A^)-5„(Af n A?)-5„(Af fli?^)] > for some Cs > 0, with probability approaching 1. By Theorem 2.1, £^[xiXil(xie^,nRO)]j * — 1)2, are positive definite. Applying Lemma 3.3' we obtain the desired result. ^ To prove Theorem 3.7, we first define the Â;th percentile of a distribution function F as Pk := inft{/ : Fit) > k/100}. Let and be the j * 100/(2X + 2)th percentile of F'^ and F^ respectively, where F*^ is the distribution function of and Fn is the empirical distribution function of {xtd}, i = 1,..., 2X + 2. If x^d has positive density function over a neighborhood of Pj for each j, then by Theorem 2.3.1 of Serfling (1980, p75), converges to pj almost surely for any j. Now, we are ready to introduce three lemmas required by the proof of Theorem 3.7. In these three lemmas, we shall omit "d" in and for notational simpficity. Lemma 3.8 Suppose izt,Xtd) is a strictly stationary process and the marginal cdf of xtd has bounded derivative at pj for all j. If rj - pj = Op(l), j = 1, • • •, 2X + 2, and for some u > 1 jEl^tl" < oo, then 1 " ~E^*(^(^"^e(ry_i,r,)) " l(x,<ie(py_ i ,Py )) ) = Op(l). " t=l Proof By the assumption, the marginal cdf, Fd, of xid satisfies Lipschitz condition in a small neighborhood of x-^d — Pj for every j. By Proposition 3.2, TJ — pj — Op(l) implies that there exists a positive sequence {an} such that a„ ^ 0 as ^ oo and rj — pj = (9p(a„). Applying Lemma 3.6 in with and fj replaced by pj and TJ respectively, we obtain the desired result. IT For any j G {1, • • - , 2Z + 2}, let Rj = {xi : < x^d < Pj} and Rj = {xj : rj^i < xid < rj}. Also let x:^ = Xn{RjnR°), X* = Xn{Rj), f; = èn(i2,), and X*r = Xn{Rj n i2j ), X* = Xn{Rj), K = ëniRj), where i = 1,2. Under the conditions of Theorem 3.7, the support of the distribution of zi is (ai,6i) X ... X (ap,bp). Hence, for d ^ dP, E[xix[l(^^^çfi.CRO)] is of full rank, i = 1,2. Lemma 3.9 Under the conditions of Theorem 3.7, (i) iX.-X.-. = ^X:;x;^ + Op(l), i = l, 2; (ii) liK'K - = Op{l); and (iii) \x:;ë; = Op(n-i/2), ix.*.'?; = Op(i), i = 1,2. Proof: With loss of generality, we can assume P{Rj f] R'-) > 0, i = 1, 2. (i) For any a 7^ 0, 1 1 1 " Taking Zt — (a'xt)^!^^,^/??) and applying Lemma 3.8, we have \x*:xi = ^x:;x:^ + o^ii), i = i, 2. Tl Tl (ii) Take Zt = ejl(x,g/î9). Lemma 3.8 implies the desired result. (in) Take zt = a'x^Ci for any a. Lemma 3.8 imphes ^[X^^'e* - X*,'e*] = Op(l). So, it suffices to show that ^X*/e* = Op(7i-i/2). For any a 7^ 0, 1 1 " t=i where {a.'x.t£tl(x,eR°nRj)} is a martingale difference sequence. By the central hmit theorem for a martingale difference sequence (Bilhngsley, 1968), a'(^X^/e*) = Op{n-'^/'^). t Lemma 3.10 Let n{A) — l(x,e>i) ^^"2/ set A in the domain of x.\. Then under the conditions of Theorem 3.7, for j = 1, • • •, 2Z + 2, (0 HRJ) = HRJ) + Op{l) = 2rF2 + Op(l), (ii) 'Pr = K + = ^P + Op(l), where K = (x;'x;)-x;'y„, 'pp = ix;'x;)-x;'Yn, h = {^[xixil(x,eiîy)]}~^i:[î/iXil(xj6R.)]. (Hi) \[Sn{Ri) - Sn{Rj)] = Op(l) and (iv) SniRj)/n(R,) - Sn{Rj)ln{R,) = Op(l). Proof With loss of generality, we can assume P{Rj f] ) > 0, i — 1,2. (i) Notethatin(P,)-i7z(i2,) = By applying Lemma 3.8 with Zt = 1, we get ^n(Rj) = ^n(Rj) + Op(l). By the strong law of large numbers for ergodic processes, ^n{Rj) = i E M^,eR,) = ElMx.eR,)] + Op(l) = P(x, € Rj) + Op(l) = + Op(l). (u) By the strong law of large numbers for ergodic sequence, ^X*'X* ^ -Efxix'j l(x,6Hj)] > 0 and ^X*'Yn ^ £'[X'IJ/I1(X,6RJ)]. Hence, — /3p as u -> oo. Since x;'F„ = xrp'xrpA° + x;;x;^p', + x;'ê; and X*'Yn = Xi^' XirPi + X^r' X2rP2 + X^/êl, Lemma 3.9 (i) and (iii) imply (^x.;'x.*.)--(ix.vx.";)- = op(i), Tl Tt i = 1,2 and -x:'Yn - -x;'y„ =èxi'x:,. - ixr;xrp)/3? + C-x;;x;^ - lx;;x;Xpl + hx;',; - x;'e;) Tt Tt Tt 71 Tt =Op(l). This implies ^X;'Yn = Op(l) since ^X;'y„ = Op(l). Thus, K-K = {x:'x:rx:'Yn - (x;'x;)-x;'y„ =[(ix;'x;)- - (ix;'x;)-]ix;'r„ + (ix;'x;)-[ix;'r„ - ix;'y„] 71 7i Ti 7Z 71 71 =Op(l)Op(l) + Op(l)op(l) = Op(l). Tl Tl =hxxAl - P\)+xiXPr - P\) + Tl =(|,-^?)'(ix,Vxrj(|,-^?) +(^,-/3°)'(ix;/x;,)(,i-^2°) +ie;'e; + \e*'\xiXPr - Pi) + xuhr - m-Tl Tl By (ii) and Lemma 3.9 (iii), 'Pr = Pp + Op{l) and ^e'^'Xf^ = Op(l), i = 1,2. Thus, ={h - ^m^xi'xiM, - /3?) + (P, - »°.)'èx;;x;M, - 0°) + U',; + 0,(1). Tl Tl Tl Similarly, =W, - /3f)'(ix,-/x,;)(ft - ^«) + 0, - p°,y{^x;;x;,)0, - (fi,) + ^.-j,; + 0,(1). ft Tl ll Hence, by Lemma 3.9 (i) and (ii), ^SniRj) - ^SniR,) Th TL =CPP - mlx'jx;, - \XI'X',XPP - pi) Tl Tl HPp - p'2)'[^x;/xi - lx;;x;^m - P') + - ^e;'e;] + 0^(1) Tl Tl Tl Tl = Op(l). (iv) By (i) and (iii), n(Rj) n{Rj) n n{Rj) n n{R,) Lemma 3.10 sets down the fundation for Theorem 3.7 and will be used repetedly in its proof. Proof of Theorem 3.7 Let d ^ dP. Suppose a hnear model is fitted on _ff^ = {xi : xu, € with the mean squared error à'j{d) = Sn{RJ)/n{R'j). Under the assumed conditions, Lemma 3.3'and Lemma 3.10 (i) imply -;^Sn{RJ)- ^^^[Sn{RJr\RVl + Sn{RJ^Rl)] ^ Cj for some Cj > 0. Proposition 3.1' (i) and Lemma 3.2' imply the second term on the LHS, 1 —[SniRjnR°,) + SniR^nR'2)] = ;^E'n(Rl ^R°yn(Rj n R'i) + Op{ln' n)] = ^/niRl)èniR^) + Op(ln'n/n), which converges to (TQ by the strong law of large numbers. Thus, P(àj{d) > (TQ + Cj/2) 1 as n oo. Since this holds for every by Lemma 3.10 (iv) > E (^0+Cfc/2)1(H^^.^^A^)+Op(l) >al + C + Op{l) for some C > 0. By Lemma (3.10) (i) n 2(^+1) rtd 2(L+1) ^ Thus, 1 - 1 ^"^^ 1 2 = 2^o + y+ «p(l)-If = <f°, there are at least Z + 1 E^'s, say, , i = 1, • • •, Z + 1, which are entirely embedded in one of the P^'s. By Proposition 3.1 and Lemma 3.2, 1 ^ ^^[4(4.)fn(4)-r„(4)] [-6'„(4)ê„(E,^.) + OpOn^ n/n)], i=l,...,i+l. By Lemma 3.10 (i) and the strong law of large numbers, the RHS is al + Op(ln" n/n). This and Lemma 3.10 (i), (iv) imply. 1 1 ^+1 " i=i L+1 , L+1 = E(^(7q:Tj + «p(i)K^o + ''p(i)) = ^^0+Op(l)-So, with probabihty approaching 1, 5^° < ioi d ^ (f. ^ Remark The number 2{L + 1) in Theorem 3.7 is not necessary. Actually, all we need is a number larger than (i + l). SoX-h2 will do. And with probabihty approaching 1, 5„(Ê^°^), the smallest of the {Sn{Rf)} will be one of those obtained from the data entirely contained in one regime. Hence, if we let = SniRf-j^^), with probability approaching 1, < for di^dP. However, by changing Z, + 2 and Sn{Rfiy) to 2(X + 1) and SniRf^j) respectively, we expect that the chance of < for any d ^ dP will be reduced for small sample size. In fact, this was shown by a simulation study we performed but have not included in this thesis for the sake of brevity. The rate of correct identification is significantly higher when ^f^^ ^niR^j-^) is used. If the number of regimes is chosen to be too large, then the number of observations in each regime will be small and the variance of 5^ will increase. Hence, it will undermine our selection of d. Through our simulation, we found that 2(X + 1) is a reasonable choice. In addition, with small sample size, one of R^^-^ n R'- (z = 1,2) may have very few observations for some d ^ cJ". In such a case SniÈfi^^) is hkely to be smaller than SniAfl^^) by chance. Using "^^=1 '^n{Rfj)) may average out this effect. 3.3 A simulation study In this section, simulations of model (3.1) are carried out to examine the performance of the proposed procedure under various conditions. Constrained by our computing power, we study only moderate sample sizes under the segmented regression setup with two to three dependence structures, that is, 1^ = 1 and 2, respectively. Let {et} be iid with mean 0 and variance CTQ and Zt = (xti, • • •, Xtp)' so that xj = (l,zj), where {xtj} are iid iV(0,4). Let DE{0, A) denote the double exponential distribution with mean 0 and variance 2A^. For d = 1 and T° = 1, the foUowing 5 sets of specifications of the model are used for reasons given below: (a) p = 2, Â = (0,1,1)', 02 = (1.5,0,1)', €t ~ iV(0,1); (b) p = 2, ^1 = (0,1,1)', 02 = (1.5,0,1)', et ~ DE{0,1/^); (c) p^2ji = (0,1,0)', /32 = (1,1,0.5)', et ~ DEiO, 1/V2); (d) p = 3,/3i = (0,1,0,1)',/32 = (1,0,0.5,1)', et ~Z'i;(0,l/v^); (e) p = 3, À = (0,1,1,1)', 02 = (1,0,1,1)', et ~ DE{0,1/^2). From the theory in Section 3.1 we Icnow that the least squares estimate, fi, is appropriate if the model is discontinuous at rf. To explore the behavior of fi for moderate sized samples. Models (a)-(d) are chosen to be discontinuous. The noise term in Model (a) is chosen to be normal as a reference, normal noise being widely used in practice. However, our emphasis is on more general noise distributions. Because the double exponential distribution is commonly used in regression modeling and it has heavier tails than the normal distribution, it is used as the distribution of the noise in all other models. The deterministic part of Model (b) is chosen to be the same as that of Model (a) to make them comparable. Note that Models (a) and (b) have a jump of size 0.5 at xi = ri while Var(ei) = 1, which is twice the jump size. Except for the parameter T,, our model selection method and estimation procedures work for both continuous and discontinuous models. Model (e) is chosen to be a continuous model to demonstrate the behavior of the estimates for this type of model. In all, 100 replications are simulated with different sample sizes, 30, 50, 100 and 200. Although in some experiments, X = 3 was tried, the number of under- and over-estimated /° are the same as those obtained by setting Z = 2. The number of cases where / = 3 is only 1 or 2, out of 100 replications. This agrees with our intuition that, given a two-piece model, if a two-piece model is selected over a three-piece one, it is unlikely that a four-piece model will be selected over a two-piece one. Based on this experience, the results reported in Tables 3.1 and 3.2 are obtained by setting i = 2 to save some computational effort. The two constants and Co in MIC are chosen as 0.1 and 0.299 respectively, as explained in Section 3.1. The results are summarized in Tables 3.1 and 3.2. Table 3.1 contains the estimates of/°, r° and the standard error of the estimate of r^, fx, based on the MIC. A number of observations may be made about the results in the table. (i) For sample sizes greater than 30, the MIC correctly identifies l'^ in most of the cases. Hence, for estimating Z*', the result seems satisfactory. Comparing Models (a) and (b), it seems that the distribution of the noise has a significant influence on the estimation of /°, for sample sizes of 50 or less. (ii) For smaller sample sizes, the bias of fi is related to the shape of the underlying model. It is seen that the biases are positive for Models (a) and (b), and negative for the others. In an experiment where Models (a) and (b) are changed so that the jump size at Xi = TI is -0.5, instead of 0.5, negative biases are observed for every sample size. These biases decrease as the sample size becomes larger. (iii) The standard error of fi is relatively large in all the cases considered. And, as expected, the standard error decreases as the sample size increases. This suggests that a large sample size is needed for a reliable estimate of rf. An experiment with sample size of 400 for a model similar to Model (e) is reported in Section 4.3. In that experiment the standard error of fi is significantly reduced. (iv) The choice oi 6o = 0.1 seems adequate for most of the models we experimented with since it does not generate a pattern, like always overestimating / for n = 30 and underestimating / for n = 50, or vice-versa. By the continuity of Model (e), its identification is expected to be the most difficult of all the cases considered. The CQ chosen above seems too big for this case, since the tendency toward underestimating / is obvious when the sample size is small. However, a more plausible explanation for this is that with the small sample size and the noise level, there is simply not enough information to reveal the underlying model. Therefore, choosing a lower dimensional model with positive probability may be appropriate by the principle of parsimony. In summary, since the optimal selection of the penalty is model dependent for samples of moderate size, no optimal pair of (co,^o) can be recommended. On the other hand, our choice of ^0 and Co shows a reasonable performance for the models we experimented with. Table 3.2 shows the estimated values of the other parameters for the models in Table 3.1 for a sample size of 200. The results indicate that, in general, the estimated /3j's and CTQ are quite close to their true values even when fi is inaccurate. So, for the purpose of estimating /3j's and al, and interpolation when the model is continuous, a moderate sized sample say of size 200 may be sufficient. When the model is discontinuous, interpolation near the threshold may not be accurate due to the inaccurate fi. A careful comparison of the estimates obtained from Models (a) and (b) shows that the estimation errors are generally smaller with normally distributed errors. The estimates of have relatively larger standard errors. This is due to the fact that a small error in P21 would result in a relatively large error in $20-To assess the performance of the MIC when 1° = 2, and to compare it with the Schwarz Criterion (SC) as well as a criterion proposed by Yao (1989), simulations were done for a much simpler model with sample sizes up to n = 450. Here we adopt Yao's (1989) setup where an univariate piecewise constant model is to be estimated. Note that such a model is a special case of Model (3.1). Specifically, Yao's model is where Xt is set to be t/n for i = 1, • • •, n, e< is iid with mean zero and finite 2mth moment for some positive integer m. Yao shows that with m > 3, the minimizer of logâf -f- /• C„/n is a consistent estimate of 1° for / < L, the known upper bound of where {C„} is any sequence satisfying Cnn"^/"* oo and C„/n —>• 0 as n —* oo. Four sets of specifications of this model are experimented with: (f) r° = 1/3, = 2/3, /3?o - 0, 0% = 2, P% = 4, e, ~ DEiO, 1/^2); (g) rf = 1/3, T° = 2/3, P% = 0, P% = 2, P% = 4, - tj/VU; (h) r" = 1/3, rO = 2/3, 0% = 0, /3?o = 1, P'zo = -1, Q ~ ^'^^(0,1/V2); and (i) = 1/3, = 2/3, 0% = 0, y3°o = 1, P'so = -1, ~ tr/VU, where refers to the Student-t distribution with degree of freedom of 7. In each of these cases the variances of ej are scaled to 1 so the noise levels are comparable. Note that for ej ~ tj/y/ÏÂ, ^^(ef) < oo and Ele]] = oo. It barely satisfies Yao's (1989) condition with m = 3 and does not satisfy our exponential boundedness condition. In Yao's (1989) paper, {Cn} is not specified, so we have to choose a {Cn} satisfying the conditions. The simplest {C„} is cin". With m = 3, we have n"~'^l'^ oo implying a > 2/2. (We shall call the criterion with such a Cn, YC, hereafter.) To reduce the potential risk of underestimating /°, we round 2/3 up to 0.7 as our choice of a. The and CQ in MIC are chosen as 0.1 and 0.299 respectively, for the reasons previously mentioned. Ci is chosen by the same method as we used to choose CQ, that is, forcing log no = cing" and solving for cj. With no = 20 and a = 0.7, we get ci = 0.368. The results for model selection are reported in Tables 3.3-3.4. Table 3.3 tabulates the empirical distributions of the estimated for different sample sizes. From the table, it is seen that for most cases, MIC and YC perform significantly better than SC. And with sample size of 450, MIC and YC correctly identify /" in more then 90% of the cases. For Models (f ) and (g), which are more easily identified, YC makes more correct identifications than MIC. But for Models (h) and (i), which are harder to identify, MIC makes more correct identifications. From Theorem 3.1 and the remark after its proof, it is known that both MIC and YC are consistent for the models with double exponential noise. This theory seems to be confirmed by our simulation. The effect on model selection of varying the noise distribution does not seem significant. This may be due to the scaling of the noises by their variances, since variance is more sensitive to tail probabilities compared to quantiles or mean absolute deviation. Because most people are familiar with the use of variance as an index of dispersion, we adopt it, although other measures may reveal the tail effect on model identification better for our moderate sample sizes. Table 3.4 shows the estimated thresholds and their standard deviations for Models (f), (g), (h), (i), conditional on I = l'^. Overall, they are quite accurate, even when the sample size is 50. For Models (h) and (i), the accuracy of is much better than that of fi, since T2 is much easier to identify by the model specification. In general, for models which are more difficult to identify, a larger sample size is needed to achieve the same accuracy. Finally, the small sample performance of the two methods given in Section 2.2 for the identification of the segmentation variable is examined. The experiment is carried out for Models (b), (d) and (e). Among Models (a)-(e). Models (b) and (e) seem to be the most difficult in terms of identifying /°, and are also expected to be difficult for identifying d. Note that for all the models considered, d is asymptotically identifiable w.r.t. any X > 1 by Corollary 2.2. For X = 2, 100 replications are simulated with sample sizes of 50, 100 and 200. With sample sizes of 100 and 200, both methods identify 1° correctly in every case. With sample size of 50, the correct identification rate of Method 1 is 100% for Models (b), (d), and 96% for Model (e); for Method 2 the rates are 98, 94 and 88 for Models (b), (d) and (e), respectively. From these results, we observe that for sample sizes of 100 or more, the two methods perform very well. And for a sample size of 50, Method 1 performs better than Method 2. This suggests that if the sample size is small. Method 1 may be more reliable. Otherwise, Method 2 gives a good estimate with a high computational efficiency. 3.4 General remarks In this chapter, we proved the consistency of the estimators given in Chapter 2. In addition, when the model is discontinuous at the thresholds, we proved that the estimated thresholds converge rapidly to their true values at the rate of In^ n/n. Consequently, the estimated regres sion coefficients and the estimated variance of the noise are shown to have the same asymptotic distributions as in the case where the thresholds are known, under the specified conditions. We put emphasis on the case where the model is discontinuous for the following two reasons: First, if the model is continuous at the thresholds, then we have for any z € RP and x' = (1, z'), x'^^o = x'P%, if X, = rj», J = 1,..., /O. This implies for ah j, E.-^d(/^(°+i)i - = P% ~ f^U+i)o 0% ~ f^U+i)d)'''j • Since this holds for any x such that Xd = , we can conclude that /J^j+i),- = /5ji for i ^ 0,d and all j. By aggregating the data over Xd, we obtain an ordinary hnear regression problem and, hence, (z 7^ 0, c?, j = 1, • • •, /° 1), can be estimated by least squares estimates with all the properties given by the classical theory. The residuals can then be used to fit a one-dimensional continuous piecewise hnear model to estimate (i = 0, d, j = I, - • • ,1° + 1). For this one-dimensional continuous problem, Feder (1975a) shows that the restricted (by continuity) least squares estimates of the thresholds and the regression coefficient are asymptoticaUy normally distributed when the covariates are viewed as nonrandom. So the problem is essentially solved except for a few technical points. In the Appendix of this chapter, we shall use Feder's idea to show that for a multidimensional continuous model with random covariates, the unrestricted least squares estimates possess similar properties. That is, the {/3j} are asymptoticaUy normally distributed, and so are the thresholds estimates given by the {Pj} instead of least squares. Second, noting that continuity requires P^j^i-^i ~ 0% for i ^ (},d and all j, it would seem that a response surface over a multidimensional space will rarely be well approximated by such a continuous piecewise model. Problems where the models are either continuous at all thresholds or discontinuous at all thresholds have now been solved. The next question is what if the model is continuous at some thresholds, and discontinuous at others. This problem can be treated as follows. First, decide if the model is continuous at each threshold. This can be done by comparing fj, the least squares estimate of rj", with fj, the solution of pjo - P(j+i)o - {P(j+i)d - Pjd)'''j- By the established convergence of the /S^'s and the fj's, if the model were discontinuous at TJ, then fj would converge to TJ. Meanwhile, or P(j+i)i would converge to different values for some i ^ 0,d or fj would converge to some point different from rj", or both. Thus, a large difference between fj and fj or between 0ji and P(j+i)i for some i ^ 0,d would indicate discontinuity. Then, by noting that Theorem 3.4 does not assume the model is discontinuous at all r^'s, we see that fj - rj* = Op(ln^n/n) for ah r^'s which are thresholds of model discontinuity. By the proof of Theorem 3.5, it is seen that these f/s can replace the corresponding rj's without changing the asymptotic distributions of the other parameters. So, between each successive pair of thresholds at which the model is discontinuous, the asymptotic results for a continuous model can be applied. In summary, regardless of whether the model is continuous or not, we can always obtain estimates of TJ''S which converge to their true values no slower than Op{ll\/n), and the estimated regression coefficients always have asymptoticaUy normal distributions. Note that most results given in this chapter do not require that xi have a joint density which is everywhere positive over its domain. Hence, one component of Xi could be a function of other components, as long as they are not collinear. In particular, xi could be a basis of pth order polynomials. Since our estimation procedure is computationally intensive, one may worry about its computational feasibility. However, we do not thin]< this is a serious problem, especially with the ever growing speed of modern computers. The simulations reported in the last section are done with a Sparc 2 work station. Even with our inefl^icient program, which inverts an order rp (p-t- 1) X (p-|-1) matrices, 100 runs for model (a) consumes only about 9 minutes of CPU time with a sample size of n = 50 and only about 35 minutes with n = 100. Hence, each run would consume approximately .35 minutes of CPU time if n = 100. A more efficient program is under development; it uses an iterative method to avoid matrix inversion. A preliminary test shows that, with the same problems mentioned above, the CPU time consumed by this program is about 15 and 40 seconds for n = 50 and 100, respectively. Hence, each run would only take a few seconds of CPU time. Unfortunately, further modifications are needed for the new program to counter the problem of error evolution for large sample size. Nevertheless, even with our inefficient program, we believe our procedure is computationally feasible if L is small and n is not too large (say, Z < 5, n < 1000). And with a better program and a faster computer, the computation time could be substantially reduced, making much more complicated model fitting computationally feasible. Finally, as we mentioned in Section 3.1, the choice of and Co in MIC needs further study. 3.5 Appendix: A discussion of the continuous model In Section 3.1, we estabhshed the asymptotic normality of coefficient estimators for Model (3.1) when it is discontinuous at the thresholds. In this section, we shall establish the corre sponding result for Model (3.1) when it is everywhere continuous. If Assumptions 3.0-3.1 are assumed by Theorem 3.1, the attention can be restricted to {/ = /°}. First, we shall show that the /3j's converge at a rate no slower than Op{n~^l- Inn) by a method similar to that of Feder (1975a). Now let ^ = (/3;,...,^;o+i)'; ^° = (^?',---J?oVi)'; f = (^',ri,---,r;o)'; f° = (^°',rf,...,r°)'; S = : /5j 7^ /^j+i, i = 1, • • •, -oo < n < • • • < r,o < oo}; m(6X) = x'[^ l(^,g(^._,,^^])^j]; and /i(Ç;Xi) = (^(f;xi),---,Me;xfc))', where Xfc = (xi, • • • ,Xfc)'. Assuming no measurement errors, Feder (1975a) seeks the values at which the response must be observed to uniquely determine the model over the domain of the covariate. To find these values, he introduces a concept of identifiability. VVe adapt his concept to our problem. Definition For any C = {6*', r^*, • • •, r* )' G S, tlie parameter (9 = (/3[, • • •,0\o+J is identified at /i* = /x(f*,Xfc) by Xk if the equation = /i* uniquely determines 0 = Next we prove a lemma adapted from Feder (1975a). The proof follows that of Feder (1975a). Lemma A3.1 If 9 is identified at fp = /i(Ç°,Xyt) hy Xk = (xi,---,Xfc), then there exist neighborhoods, M, of fi(^'^,Xk) and T of Xk such that (a) for all (k-dimensional) vectors p, = {fii, • • •,pk)' € M and (p + I) X k matrices X^ G T such that p, can be represented as jl — /i(^, X^) for some ^ £E, 0 is identified at fi by XI; and (b) the induced transformation 9 = 9{fi;X^) satisfies the Lipschitz condition \\9i —^2|| < C\\fii -/Ï2II for some constant C > 0, whenever X^ G T and p., = n{Çi;X^), p2 = più'iXk) S M. Proof: Since 9 is identified at fjP by Xk, it follows that for any possible choice of parameters Tl, - •• ,Tio consistent with 9^, for each j there must exist p + 1 components of Xk, Xjj, • • •, Xj^^^ such that Xj.^d € iTj-i,Tj]n{T^_-^^,T^], i = 1, - • • ,p-|-l, and the matrix (x_,-,, • • •, Xj^^^J is nonsin-gular. By continuity, the Xj. 's may be perturbed shghtly without disturbing the nonsingularity of (xjj, • • •, Xjp^j). Assertions (a) and (b) follow directly from the properties of nonsingular hnear transformations. (Recall that if /i = X6 for a nonsingular X, then 9 = X~'p, and hence ll^ll < tr{X-''X-')M\). H Remark It is clear from the proof that for a continuous model, it is necessary and sufficient to identify 9'^, that within each r-partition, there are p + 1 observations (xjj, • • •, xj^,^ J such that the matrix X = (xjj, • • • ,Xjp^j) is of full rank. In particular, if z has a positive density over a neighborhood of rj* for each j, then with large n, a Xk exists such that 9 is identified at fi{e\Xk) hyXk. Another concept introduced by Feder (1975a) is called the center of observations. This concept is modified in the next definition to fit our multivariate setup. Définition Let z = (xi, • • •, Xp)'. z° = {x°, • • •, x^)' is a center of observation if for any ^ > 0, both P({z : ||z - z°|| < S, Xd < x^}) and P({z : ||z - z°|| < 6, Xd > x"}) a-^e positive. Remark For any a < ?/, if constant vectors zi,---,Zp+i are centers of observations such that Xtd € (a, 77), t = l,---,p+ 1, and the matrix Xp+i = (xi, • • •, Xp+i) is of full rank where Xj = (1,Z;)', by Lemma (A3.1) there exists a neighborhood, T, of Xp+i, such that T C {x : a < xtrf < 77}, P{T) > 0 and X*^^ is of fuU rank if X;^^ 6 T. Hence, for any a / 0 and random vector x, i;[(a'x)^l(,,e(„,,|)] > ^[(a'x)2l(x6T)] > 0 implying that £^[xx'l(2.^ç(is positive definite. Therefore, a sufficient condition for As sumption 3.1 to hold is that for some è G (0,mini<j<;o(r°^i - TJ)/2), within each of {x : x^ £ {TJ —6, TJ)} and {x : x^ G {'''J,TJ-\-S)} there are p+1 centers of observations forming a full rank matrix for every j. In particular, ordinal categorical covariates are allowed in this assumption. Lemma A3.2 (Feder, 1975a) Let V be an inner product space and X, y subspaces of V. Suppose x £ y £ y, and x*, y* are the orthogonal projections 0/ x + y onto X, y respectively. If there exists an a < I such that -x. £ X, y £ y implies |x'y| < a||x||||y||, then ||x + y||<(||x*|| + ||y*||)/(l-a). Lemma A3.3 For any real TI < , let T be the random linear space spanned by the 2{p + 1) column vectors o/(A„(-oo, n), X„(ri,oo)), and let C = X„(ri, r°)A^°, where Ap° = P^-P^. Then under Assumptions 3.0-3.1, there exists a < I such that for sufficiently large n, K'gl < ^\m\9\\ 85 uniformly in T\ < r° and g £ T with probability approaching 1. Proof: It suffices to show that with large probability, for all Vi < r° and g £ iC'gf < a'WCfWgf. Define = X„(-<x),ri), X^ = X„(ri,oo), X^ = X„(-oo,rO), X^ = ^«(rO,^). For any g e :F, there exist pijo G R-^+^ such that g = XiÂ + X2P2- Noting that ||X„(ri, rf)/32i|- < \\X2P2\\\^e have \\X4n,T^)M' _ \\Xn{n,T^)P2\\' M' \\xJi\\^ + \\X2M' < \\X2P: |2 . - (A3.1) J|X„(ri,r»)/?2|P + ||Xi/?2|P - \\X2P2\\' + \\XxP2\\' \\XnM'' Suppose A, B are positive definite matrices and A(M) denotes the largest eigenvalue of any symmetric matrix M. Then for any P ^ 0, P'AP _ {B^'^P)'{B-^I^AB-^I^){B^/'^~P) _ ^ A(5-V2^5-i/2) ~P'{A + B)p {B^np)'{B-^l''AB-^n){B^I-ip) + {B^f^py{B^/^p) - A(E-i/2A5-i/2) + T This result can be appfied to the RHS of (A3.1) since X^Xn = XfX^ + X^'X^ and with probability approaching 1, Xf X^, X2 X2 are positive definite. Thus, \x:P2r _ p'2C-x;'xi)P2 ^ Ai \\XnP2\V P'2CnXrXt + iX*2X*2)~P2 - Al + 1 ' where Al = xii^x; x;)-'/\lx{'xi){\xU;)-''') n n n is bounded in probabihty since both ^X^'X^ and ^X2 X2 converge to positive definite ma trices. Therefore, by (A3.1) and (A3.2) there exists 0 < a < 1 such that with probabihty approaching 1, for all Tl < T^ and g Ç. T. Thus, with probabihty approaching 1, <[E(A^°'x,)^l(..,,(.,,.o„][E(x;^2)^l(..,e(n,.?I)] t=i t-i = |Kin|X„(ra,r«)^2||2 ||.||2|| ,|2ll^Yn(ri,r»)^2|P =iiai M —^^ii,— <«'iicini5ii' for all Tl < and g E J^. This completes the proof. ^ Lemma A3.4 Suppose Assumptions 3.0-3.1 are satisfied. Let W he a subset of RP such that P{W) > 0. Then under Assumptions 3.0-3.1, min^,gvv |z^(xf)| = Op{lnn/^/n), where />(xO = Ml;xt)-Me°;xt). Proof Without loss of generality, we can assume P = 1. If we can show that Y,7=i ^ti'^t) = Op{\n^ n), then for any I^ C R" such that P(W) > 0, min^.evv \i>i^t)\ = Op{\nn/y/n). Let be the linear space spanned by the 2(p + 1) column vectors of (A„(—oo,fi), X„(fi,oo)), be the linear space spanned by //(f°;X„), and :F+ = :F ® X^)] be the direct sum of the two vector spaces. Let Q'^,Q denote the orthogonal projections onto .;£•+, respectively. Let i>(X„) = (j>(xa), • • •, £>(x„))'. Then ||^(X„) - êjp = S^ih) < \\ên\\'. Since botli /i(f°,X„) and /x(f;X„) belong to T"^, by orthogonality, lKl;x„)-g+ynii' + iiê+5^n-F„iP =IKl;Xn)-F„||2 <lk-n|P =IKe°;X„) - Q+y„|p + ||Q+y„ - YX-Subtracting HQ'^yn — yn|P from both sides, we have that <IKe°;X„)-Q+n|p Therefore, <||/z(|;x„) - + i|Q+y„ - Me";^n)li <llO+è„|| + ||Q+ê„|| =2iig+ëni|. Since YJt=\ ^li'^t) = \\i>(Xn)\\'^, it remains to show that ||(5"'"ên|| = Op{lnn). Without loss of generahty, we can assume that n < T^. Let /3° = and A/3° ^ -0^. Note that Kf°,A„) = (X„(-oo,r{'),X„(r{',oo))4° =(X„(-œ, fi) + X„(fi, r°), X„(fi, oo) - X„(fi, rO))/3° = [(X„(-cx),f,),X„(fi,oo)) + (X„(fi,r°),-X„(fa,r«))]^« =(X„(-^, fi), X„(fi, oo))^° + X„(fa, r°)A/3°. This imphes that T'^ is also generated by the direct sum of T and vector C, where C X„(fi,rf)A/3°. By Lemma A3.3, there exists a < 1 such that for sufficiently large n, IC'^I < allClllkll for ah f\ < r° and g Ci P with probability approaching 1. Since Q{Q^èn) — Q^n and C'(Q^fn)/IICI| = C'ên/IICII) it follows from Lemma A3.2 that with probability approaching 1, Therefore, if it is shown that ||Qên|| = Op(lnn) and C'?n/l|C|| = Op(lnra), the desired result obtains. Define X = (Â:i,Â'2). Then =è'^X{X'X)-X'X{X'X)-X'èn =è'nX{X'X)-X'ln = ~e'nMX[Xi)-X[èn + è'MX'2X2)-XUn =r„(-oo,fi) + r„(fi,(X)). Therefore by Lemma 3.2, ||Qên|| = Op(lnra) uniformly for all fx. We next show that uniformly in n < rJ", C'ên/||CII = Op(lnn) for ||C|| 7^ 0, where C -^{M-^i) and ^ = (X„(-oo,rf)-X„(-oo,fi)). Let yt = x'^A/3°. Conditional on X„, we have that AQ%\ 31nn IICII - To 1^") <P( l^if^^-<-^--^;,'l > ^|A„) < pJEr=iytl(x„j<x.^<r„j)gt| 3 In 71 where To is specified in Lemma 3.1. Since |2/.l(x„,<x„<x„,)/(Er=i 2/i l(:<:.d<x„<x.,))^/-| < 1 and n n for any x^d, by Lemma 3.1, < Y 2exp(-To.^)exp(coro^) <n{n - l)/n^exp{coT^) 0, as —>• oo, where CQ is the constant specified in Lemma 3.1. Finally, by appealing to the dominated convergence theorem we obtain the desired result without conditioning. This completes the proof. ^ Theorem A3.1 Suppose Assumptions 3.0 and 3.1 are satisfied. Let X° = (x°, • • - jx"). If 6 is identified at X°) by and xj, • • •, x° are centers of observations, then Proof Lemma A3.4 implies that with probability approaching 1, within any small neighbor-= Op(lnn/A). hood of x°, there exists a xj^ such that i = 1, - •• ,k. Lemma A3.1 imphes the conclusion of the theorem. If Corollary A3.1 Under the conditions of Theorem A3.1, f - r° = Op(lnn/y/n) where f = (^1, • • •, 'fio )', fj = 0 - Pj+xfi)l0i+i,d - hd), i = 1, • • •, Proof For any j = 1, • • •, /°, by continuity of the model at the end points x^ = r^, for all {xi, i ^ d}. Then by choosing the {x^, i ^ d} so that they are not collinear, we deduce that = for ah i ^ 0,d. By assumption, /9°^ ^ Therefore, TJ can be reestimated by solving and hence, fj — r° has the same order as — 1 Next we shall establish the asymptotic normahty of ^, and f when the model is continuous. The idea is to form a pseudo problem by deleting all the observations in a small neighborhood of each r° so that classical techniques can be apphed, and then to show that the problem of concern is "close" to the pseudo problem. The term "pseudo problem" is used because in practice the r^'s are unknown and so are the observations to be deleted. This idea is due to Sylwester (1965) and is used by Feder (1975a). Assume xj, has positive density function fd{xd) over a neighborhood of r°, j = l,---,/°. Our pseudo problem is formed by deleting all the observations in {x : r° — d„ < < r° + rf„} where dn = 1/ln^ n. Intuitively speaking, the number of observations deleted will be Op{ndn). This will be confirmed later in Lemma A3.6. Adopting Feder's (1975a) notation, we define n* as the sample size in the pseudo problem, and let n** = n - n*, 9* he the least squares estimate in the pseudo problem, the summation over the n* terms of the pseudo problem, and = Yl't=i " E*- Generally, a single asterisk refers to the pseudo problem. Theorem A3.1 and Corollary A3.1 carry over directly to the pseudo problem. Thus, Theorem A3.2 If the conditions of Theorem A3.1 is satisfied in the pseudo problem, then 9' -9° = Op{lnn/V^). Further, if Model (3.1) is continuous, f — r° = Op{\n n/y/n). Lemma A3.5 Suppose {xt} is an iid sequence. Under the conditions of Theorem A3.2 where Gj = £;[xx'l(^_^ç(^<^^,^o])], j = 1, • • •,/° + 1. Proof Let 5*(f) = ^ ^'(yt - //(^Xt))^. Theorem A3.2 imphes that f* £ (r° - dn,Tf + dn] with probability approaching 1. Since there are no observations within this region, it follows that 5*(f) computed within this region does not depend on r and is a paraboloid in 9. In particular, it is twice differentiable in 6. For the reminder of the proof, denote S*(Ç) by S*{d). Thus, with probability approaching 1, 6* may be obtained by setting the derivative of S*(9) to 0: t=i j=i n = ^ ^x,(x',(/3, - - fOl(x..e(rO_,+.„,.o_.„])-- * Hence, ^T,ti^t^tM..,e(rO_^+d„,rf-d„]))0j - P'j) = 7[T.7=i^tetl(.,,ç(rO_^+d^,rf-d„])- By Lemma 3.6 and the strong law of large numbers, 1 " - Y ^t^tMx,deir°_^ + d„,r°-d„])) 1 " =G, + Op(l), where Gj = ^fxix'^l^j-^^çf^o J,T°])]- Under the assumptions of the pseudo problem, Gj is positive definite. Thus, ^0] - P'j) = [Gj + èx,C,l(,.,e(.;^, + <i„,rO-.„l)-The Lindeberg-Feller central limit theorem for double sequences implies the assertion of the lemma. f It now remains to sliow tliat 9 in the original problem and 9* in the pseudo problem do not differ by too much. In fact, we shall show that 9 — 9* = Op{n~^/'^) and hence that the two have the same asymptotic distribution. Lemma A3.6 Suppose Assumptions 3.0, 3.1 and 3.3 are satisfied. Then under the conditions of Theorem A3.2, 9 - 9* = Op(n-i/2). Proof The hypotheses imply that 9 is identified at X^) both in original problem and in the pseudo problem, by some X° = (x^, • • •, x°), where xJ, • • •, x° are centers of observations. It follows from Theorems A3.1 and A3.2 that ^-é»" = Opin''^/^ In n), a.nd 9'-9'^ = Op{n-'^/^'Inn). Let an = (ln7i)5/4 and = : \0 - 9''\ < <x„/V^, |r,-- r^l < j = l,---,/°}. Then ^ and ^* both lie in J/„ with probability approaching 1. Note that function S*(^) depends only on (9 for f € so that S'{0 = S*{9). RecaU that S(0=^f^(^i + K^t))\ and (A3.3) S*{0 = li^i^t + '^(^t))'. Tl Thus, SiO =S*{0+lf^{et + t^{xt))' Without loss of generality, we can assume that z is bounded. It follows from the definition of Un and the boundedness of z, that sup max |i/(^;xt)| = 0(a„/v^). Note that n** is the (1, l)th component of J2'j=i XI,(T° - d^, + dn)Xn{T° - d^, rf + d„). By Lemma 3.7 (i), n'* = Op(ndn). Thus, 1 ** sup l-^i^'i^t)] <{alln)n'*ln ^Opialdnln) Also, for any (5 > 0 and ^ Ç.lin <§E[f:.\C,Xt)] <§(sup max K^;xOlfi?K*) ieu^x,ue[j.(T°-d,.,rf+d„] <^0i^)0p{ndn) for some M > 0, where 0{a\ln) and Op{ndn) are independent of ^ € ZYn. Since a\dn —>• Q as n -* oo, ^ ^ti^i^y^t) = Op{l/n) uniformly for all ^ G ZY„. Thus, by (A3.3) S{0 = S*{0 + ^f^^l + Op{h (A3.4) where Op(l/n) is uniformly small for ^ £lin-Since ^ and ^* are least squares estimates for the original and the pseudo problem respec tively, Sii) < Sit), S'it) < S'ii). (A3.5) (A3.4) and (A3.5) imply 0 < Sit) - 5(0 = S'it) - S*ii) + Op{-) < opi-). (A3.6) Tt Tl Therefore, S*(i) - S*{i') Taylor's expansion yields = Op(^). Since dS*(i*)/d9 = 0 and 5*(f) is a paraboloid in 6, s'ii) = s*in+l{ê - - r)'. (^3.7) Equations (A3.6) and (A3.7) imply Ô - 9* = Op(7i-§). If Lemma A3.6 implies that ^/n{9 — 9^) and y/n{9* — 9^) have the same asymptotic distribu tion. Thus, by Lemma A3.5 we have Theorem A3.3 Suppose the conditions of Lemma A3.6 are satisfied. Then, ^A^(/3, - -i N{0, alGf), j = 1, • • •, /° + 1 where Gj is defined in Lemma A3.5. For any j = + 1, let and AÂ = $j,o - A/3, = pj^d - Pj+i,d-Then = fj = hence. V(A/3o - A/3S) - -M^(A/3° - A/3,) -A/3/ A/3,A/32 = -i^(A/3o - A/3°) + -^(A^, - A/33). M^i - r°) = -^v^(A^o - A/3°) + _^v^(A/3, - A/32) + «P(1)-95 So we have Theorem A3.4 Under the conditions of Theorem A3.3, if Model (3.1) is continuous, then {fj — Tj) and _^^(,{APo — A/?o) + zr^{^Pd — A/3°) have the same asymptotic distribution. Chapter 4 SEGMENTED REGRESSION MODELS WITH HETEROSCEDASTIC AUTOCORRELATED NOISE In this chapter, we consider the situation where the noise is autocorrelated and the noise levels are different in different regimes. Specifically, consider the model yt = x'j^j + o-jfi, if Xtd € (rj_i, TJ], J = 1,..., / + 1, ^ = 1,..., n, (4.1) where €t = YlT i^iCt-i, with < oo. The {CJ} are iid, have mean zero, have variance a^, and are independent of the {xj}, Xj = {l,Xti,..., Xtp)'. And —oo = TQ < TI < • • • < TI^I = oo, while the CTJ (j = 1,...,/+ 1) are positive parameters. We adopt the parametrization which forces aç — l/Eo°^i^ ^° that the {et} have unit variances. Further, we assume that there exists a ^ > 3/2, ko > 0 such that < k/{i + 1)'' for all i. Note that this implies {et} is a stationary ergodic process. Estimation procedures are given in Section 4.1. In Section 4.2, it is shown that the asymp totic results obtained in Chapter 3 remain vahd. Since a major part of the proofs formally resemble those in Chapter 3, all the proofs are put in Section 4.5 as an appendix. Simulation results are reported in Section 4.3. Section 4.4 contains some remarks. 4.1 Estimation procedures With the notation introduced in Chapter 3, the model can be rewritten in the vector form, y„ = J]X„(Tf_„r°)^, + c-, (4.2) i=i where := [^'-^x'ajUrl„rf)%. All the parameters are estimated as in Chapter 2 except for the variances {a^,..., a-fo_^_-^}. These are estimated by â] = Snifj-i,fj)/nj. i = 1,...,/+ 1, where fij is the number of observations falling in the jth estimated regime and / is the estimate of /° produced by the estimation procedure in Section 2.2. We shall see in the next section that the asymptotic results in Section 3.2 are essentially unchanged for this modification of the model. After estimating Pj and aj we may use the estimated residuals, êt — {yt — x.[Pj)/âj, if Xtd € (fj_i,fj], to estimate the parameters in the moving average model for the e'^s. 4.2 Asymptotic properties of the parameter estimates To establish the asymptotic theory, we need to make some assumptions for Model (4.2). Below is a basic assumption which is assumed to hold throughout this section. Assumption 4.0; The {xj} is a strictly stationary ergodic process with £'(xjxi) < oo. The et are given by €t = tpiCt-i, where ipi < ko/{i-\- if for some ko > 0, 6 > 3/2 and all i, the {Q} o^fe iid, locally exponentially bounded random variables with mean zero, variance = 1/ J2ilo '^h are independent of the {xj}. For the number of threshold P, there exists a specified L such that P < L. Also, for anyj = l,...,l\ p° ^ 0%,. Note that {e^} is a stationary ergodic process and each has unit variance. Additional assumptions analogous to those in Section 3.1 are also needed to establish the consistency of the estimates. For convenience, we restate Assumptions 3.1-3.2 as Assumptions 4-1-4-^, respectively. Assumption 4.1 There exists 6 e (0,mini<j<;o(rj'.^;^-r]')/2) such that both E{x.iXil^^^^ç.(^.,.o_g,,o-^^} and E{xix[ 'i-{xideiT9,T°+s])} are positive definite for each of the true thresholds T^,...,T°O-Assumption 4.2 For any sufficiently small 6 > 0, £^{xiXil(3,j_^ç(^p_5 .^o])} and jE'{xixil(^j_^g(^p .,.0^5]^} are pos itive definite, i = l,---,l°. Also, £'(xixi)" < 00 for some u> I. To establish the asymptotic normality for the /9j's and âj's, we need to establish it for the least squares estimates of the /3j's and o-|'s with P and r^, • • •, T^O known. To this end, we specify the probabihty structure of {xJ and {0} exphcitly. If {Q, T, V) is a probability space, a measurable transformation T : fi —> is said to be measure-preserving if P{T~'A) = P{A) for all A € !F- If T is measure-preserving, a set A € is called invariant if T~'{A) — A. The class T of all invariant sets is a sub-cr-field of T, called the invariant cr-field, and T is said to be ergodic if all the sets in T have probabihty zero or one. (cf. Hah and Heyde, 1980, P281.) As Hall and Heyde point out (1980, P281): "Any stationary process {x„} may be thought of as being generated by a measure-preserving transformation, in the sense that there exists a variable x defined on a probability space {Q.,T,V), and a measure-preserving map T : fi —> fi, such that the sequence {x'„} defined by XQ = x and xj,(u;) — x(T"a;), n > 1, a; G has the same distribution as {x„}." Therefore, we can assume that the stationary and ergodic sequence {xt,Ct} is generated by a measure preserving transformation T on a probability space without loss of generality. Assumption 4.3 (A.4.3.1) Let (fi, J^, •p) he a probability space. Let {^t,Ct}t^-oo the iid random sequence such that (i) {Xf} and {CJ are independent; (ii) (xtXt) = (x(r*a;), C(T'a>)), a; G fi, i = 0,±1,---, where T is an ergodic measure-preserving transformation and (x, ^) is a random variable defined on the probability space {^,T,V);and (iii) E{x\x.iY < 00 for some u > 2. (A.4-3.2) Within some small neighborhoods of the true thresholds, x\d has a positive and con tinuous probability density function /,(•) with respect to the one dimensional Lebesgue measure. (A.4-3-3) There exists one version of E[-x.\X.'^\xxd — x] which is continuous within some neigh borhoods of the true thresholds and that version has been adopted. Consider the segmented linear regression model (4.2) of the previous section. Let / be the minimizer of MIC{1). Theorem 4.1 For the segmented linear regression model (4.2) suppose Assumptions 4-0 and 4.1 are satisfied. Then I converges to /° in probability as n ^ 00. The next two theorems show that the estimates f, 0j and aj are consistent, under As sumptions 4.0 and 4-2. Theorem 4.2 Assume for the segmented linear regression model (4-Sj Assumptions 4-0 and 4.2 are satisfied. Then f-r° = Op(l), where r° = (rf,..., r^o) and f — (fi,..., fj) is the least squares estimate of r° based on I — I, and I is a minimizer of MIC {I) subject to I < L. Theorem 4.3 If the marginal cdf Fj, of xn satisfies Lipschitz Condition \Fd{x') - Fd{x")\ < C\x' — x"\ for some constant C at a small neighborhood of X\d = rj" for every j, then under the conditions of Theorem 4-2, the least squares estimates Pj and aj, j = 1,... ,1 + 1, based on the estimates I and fj's as defined in Section 2.2, are consistent. Next, we show that if Model (4.2) is discontinuous at r° for some j = 1, • • •,/°, then the threshold estimates, fj, converge to the true thresholds, r°, at the rate of Op(ln' n/n), and the least squares estimates of Pj and <7| based on the estimated thresholds are asymptotically normally distributed. Theorem 4.4 Suppose for the segmented linear regression model (4-2) that Assumptions 4-0, 4.2 and 4.3 are satisfied. IfP{x[{Pj+i - Pj) / 0\xd = r?) > 0 for some j = 1,---,P, then For j = 1, • • •, /° + 1, let Pj be the least squares estimates of Pj based on the estimates / and fj's as defined in Section 2.2, and aj be as defined in Section 4.1. Define Gj = Z;(xix'il(^^_^ç(^o_^_^o])), 00 E,- = aj[G-' + 2Y,l{i)Gj'E{xil^,^^^^rO_^,rO^^^^^ i=l Pj = P{TU < < r'j) and oo vj=pjil-pj)Eiet) + p'j[iv-3h\0) + 2 ^ 7^(0], »=-oo where 7(1) = £'(ei€i+,), 77 = cryE(<^f) and j = + Then, we have the following result. Theorem 4.5 Suppose for the segmented linear regression model (4-2) Assumptions 4-0, 4-2 and 4.3 are satisfied. If P{x.\{Pj^x - Pj) 7^ O^d = r?) > 0 for all j = 1, • • •then V^CPJ - Pj) N{0, S,) and ^Pj{à] - u]) iV(0, v^a)), as n ->• 00, j = 1, - • • ,f + 1. Note that if 7(1) = 0, i > 0, then Ylj — <^o^7^ as shown in Section 3.1. The next theorem shows that Method 1 of Section 2.2 for estimating dP produces a consistent estimate. Theorem 4.6 If d° is asymptotically identifiable w.r.t. L, then under the conditions of Theo rem 4-1, d given in Method 1 of Section 2.2 satisfies P(d = d^) —>^ 1 as TI —»• 00. Remark: Although the result of Theorem 3.7 is expected to carry over if aj = a for all j, it does not carry over in general. Hence, Method 2 given in Section 2.2 is not generally consistent. Below is a counterexample. Example 4.1. Let x = (1,2:1,X2)' where (xi,X2) is a random vector with domain [0,6] x [0,6]. Divide the domain into six parts as shown in Figure 4.1. On each part, (xi,X2) is uniformly distributed with mass indicated in the figure. Let d = 1, Z*' = 2, L = 2 and (ri,r2) = (0.5,1). Hence, i?? = {x : 0 < xi < 0.5}, ii:^ = {x : 0.5 < xi < 1} and i?^ = {x : 1 < xi < 6}. The model is yt = ^^ l(x,eK«) + <^j(t: if Xt G R'j, 102 where the {xJ are independent samples from the distribution of x, the {et} are iid iV(0,1) and independent of {xt}. Let o-^ = 1 and = cr^ = 10. Define Rj = {x : Xi 6 (j — 1, j]}, i = 1,2, J = 1, • • - ,6. It is easy to see that on each Rj, the mass is 1/6 = 1/(2X + 2). Suppose we fit a constant on each of Rj. Let us calculate AMSE{R^j), the asymptotic mean squared error on R). For j > 1, AMSE(R]) = a| = 10. And AMSE{Rl) = ^2 ^ i + al X i + 5f = ^ + BJ, where Bi is the asymptotic mean bias. Observe that the marginal distribution of Xi on (0,1] is uniform and symmetric about n = 0.5; hence Bi = 1 and AMSE{R\) = 13/2 < 10. Therefore, with probabihty approaching 1 as n —»^ oo, the MSE on Rl wiU be chosen as the smaUest MSE among those on 72], j = 1, • • •, 6. For i = 2 and j > 1, where B2 represents the asymptotic mean bias on each of Rj, j > 1. The asymptotic mean squared error on Rl should be no larger than the asymptotic mean squared error obtained by setting the model to 0: \ ij - 1 20 2 20 ^ 20 20 20 100 Thus, with large probability as n ^ 00, the MSE on Rl will be chosen as the smallest MSE among those on Rj, j = l,---,6. Since AMSE{R\) > AMSE{R\), X2, rather than xi, wih be chosen by Method 2 as the segmentation variable with probability approaching 1 as n —>^ 00. f 4.3 A simulation study In this section, simulation experiments involving model (4.2) are carried out to examine the small sample performance of our proposed procedures under various conditions. As in Section 3.3, segmented regression models with two to three regimes are investigated. Let 4 = 0.7eJ_i - 0.1e;_2 + Ct, where the {0} are iid with a locally exponentially bounded distribution having zero means and unit variances. Note that the {e^} can alternatively be defined by (l-ei-^5)(l-C2-^5)e', = Ct, where B is the backward shift operator defined by Bh'^ = e[_j, j = 0, ±1, ±2, • • -, and (6,6) = (2,5). Since |6| > 1 for i = 1,2, {ej} is a causal AR(2) process. Hence, it can be written as = Sjlo where is the coefficient of in the polynomial, V>(2) = l/[{l — ^z){l-^z)]. Expanding tp(z), we get t=0 fc=0 .=0 it=0 Let j = i + k, then «=0 j=» j=0 i=0 So t=0 t=0 Thus for any S > 3/2, taking ko > 0 sufficiently large, we have < ko/(j + 1)*. Let €t — e'Jy/Var{€[), so that Var{et) = 1 for all t. Then the {et} satisfy the condition of Model (4.2) [In this case ^yVar{e't) = 1.33 (c.f Example 3.3.5, Brockweh and Davis, 1987)]. Let Zt = {xti, - • • ,xtp)' and xJ = (l,z'J, where {xtj} are nd iV(0,4). Let DE{Q,\) denote the double exponential distribution with mean 0 and variance 2A^. For d = 1 and r° = 1, the following 3 sets of model specifications are used: (a') p = 2 = (0,1, ly, p2 = (1.5,0,1)', tTi = 0.8, = 1, 0 ~ ^(0,1), (d') p = 3, Â = (0,1,0,1)', ^2 = (1,0,0.5,1)', ai =0.8, <T2 = l,Ct-^ DEiO,l/V2), (e') p=3ji = (0,1,1,1)', 02 = (1,0,1,1)', (Tl = 0.8, (72 = 1, 0 ~ i?^(0,1/v^). Note that the regression coefficients in Models (a'), (d') and (e') are the same as those in Models (a), (d) and (e). Beyond the reasons given in Section 3.3, these models are selected so that the results in this section will be comparable to those in Section 3.3. In all, 100 replications are simulated with different sample sizes, 50, 100 and 200. For the reason given in Section 3.3, the results reported in Tables 4.1 and 4.2 are obtained by setting L = 2 to save some computational effort. The two constants, êo and CQ in MIC, are chosen as 0.1 and 0.299 respectively, as explained in Section 3.1. Table 4.1 shows the estimates /, fi and its standard error, based on the MIC. The following observations derive from the table. (i) For all models, in more than 90% of the cases 1° is correctly identified. Hence, for estimating f our residts seem satisfactory. Comparing these results to those in Table 3.1, it seems that Models (a'), (d') and (e') are more diflRcult to identify than Models (a), (d) and (e). (ii) As in Section 3.3, fi seems biased for small sample size. This bias is related to the shape of the model. Note that the biases for Model (a') are all positive and those for Model (d') are all negative. These biases decrease as the sample size becomes larger. (iii) The standard error of fi is relatively large in all the cases considered. And, as expected, the standard error decreases as the sample size increases. This suggests that a large sample size is needed for reliable estimation of rf. An experiment of n = 400 is carried out for Model (e'). We again obtained correct identification in 99% of tlie cases. But the standard error of fi reduces from 1.111 for n = 200 to 0.707 when n = 400. (iv) A larger niay perform better in these cases, since there seems to be a tendency to over estimate especially as n becomes large. Because in practice, the model structure is unknown and one cannot choose the best (SofCo), we adopt the same values for these parameters as in Section 3.3. Table 4.2 shows the estimated values of the other parameters for the models in Table 4.1 only for a sample size of 200. The results indicate that, except for P20, the estimated y3j's are quite close to their true values even when fi is inaccurate. So, for the purpose of estimating the ySj's, and interpolation when the model is continuous, a moderate sample size such as 200 may be sufficient. When the model is discontinuous, interpolation near the threshold may not be accurate due to the inaccurate fi. As we saw in Section 3.3, the estimates of /32o have relatively large standard errors. This is due to the fact that a small error in P21 would result in a relatively large error in $20- The relatively large error for may also be due to the inaccurate fi. Simulations have also been carried out for a model with /° = 2. Specifically, the model is: (j) p = 2, Â = (1,1,0)', P2 = (0,0,1), Ps = (0.5,0,0.5), ai = 0.7, ^2 = 0.8, = 1 r{' = -l, T° = l, (:t^DE{0,l/V2). The results are reported in Tables 4.3-4.4. Table 4.3 tabulates the empirical distributions of the estimated /" for different sample sizes. With n = 200, 1° is correctly identified 95 out 100 rephcations. The standard errors of fj (j = 1,2) are relatively smah indicating that the thresholds in this model are easier to identify. The Pj''s and the â]'s are given in Table 4.4. The results are similar to those in Table 4.2. 4.4 General remarks In this chapter, we generalized the results in Chapter 3 to the case where the noise is heteroscedastic and autocorrelated. Although the ideas used in this generalization are the same as those of Chapter 3, it can be seen in Section 4.5 that a more technical analysis is required to prove these results. The simulation results given in the last section indicate that this model is in general more difficult to identify, compared with the model discussed in the last chapter. There are several questions which need further investigation. First, can the residuals be used to estimate the tpi's in the moving average specification of the noise once the estimates of the regression coefficients are obtained? If so, what procedure should be used to reduce the impact of the bias in the estimated r°'s? Once the Vt's are estimated, can the information obtained be used to reestimate the other parameters of the model to obtain better estimates? Second, the asymptotic distribution of the estimates given in this chapter are for discontinuous models. If the model were continuous, one could aggragate the data over the segmentation variable regions to obtain a linear regression problem. The /3ji's {i ^ 0,d) can be estimated by least squares. The residuals can be then be used to estimate f3ji, /Sjd and aj (j = 1, • ••,/° + 1) by least squares again in a one-dimensional segmented regression problem. A number of questions remain to be answered: Are these estimates consistent? What are their asymptotic distributions? If the parameters are estimated directly by least squares, are the estimates, unrestricted by continuity, consistent? What are their asymptotic distributions? Some of these problems will be discussed further in the next chapter as future research topics. 4.5 Appendix: Proofs Although a major part of the proof appear to resemble those in Chapter 3, there are some extra difficulties resulted from the correlated errors. First, we have to show that the result of Lemma 3.2 still holds under dependent assumptions. This is accomplished in Lemmas 4.1 and 4.2. Second, the results of Lemma 3.7 have to be re-established by calculating the limits of sample moments. Third, we have to establish the asymptotic normality of the estimated regression coefficients and the variances of the errors for known thresholds. This is done in Lemmas 4.9 and 4.10 by using a central hmit theorem for stationary processes. The proof of Theorem 4.1 will be given after a series of related lemmas. Lemma 4.1 (Susko, 1991) Suppose \ai\ < ko/i^ for some Â;o > 0, ^ > 3/2. Then YlZiŒZi |a,+,|)2 < oo. Proof: By assumption, \ai\ < ko/i^ for some ko > 0, S > 3/2. Therefore, oo oo oo oo ^ ;=1 /=1 .=1 /=1 ^ ^ Now, oo ^ oo 1=1 ^ ^ ^ j=,+i oo j = E V/ / dt oo .j = E / min < E / roo -I. dt dt So, oo oo L,2 °° DEi-'+^d^st^E'/.^"-"-(4.3) By assumption, S > 3/2, so 2(6 — 1) > 1, and hence f;(f;ia,+,i)2<oo. The next Lemma is slightly modified version of Lemma 1 of Susko (1991). Lemma 4.2 Let {Ct} be iid, locally exponentially bounded random variables. Let €t = Si^o '^iCt-i, and assume there exists 6 > 3/2, ko > 0 such that < ko/{i + 1)'' for all i. Let Sk = Yii=i ^i^i> where the a'^s are constants. Then there exists 0 < ci < oo and Ti > 0, such that for any x >Q, k > 1 and t satisfying 0 < /||a|| < Ti, P{\Sk\ >x}< 2e-*^+=i*'ll''ll'. Proof The assumption of locally exponentially boundedness means that for some TQ > 0 and 0 < Co < oo, f;(e*^i) < e''"*^ for \t\ < To. Now it follows from Markov's inequality that for sufficiently small t > 0, And where Hence, P{Sk >x} = P{e*^* > e'^} < e-*^X;(e'^*). fc k oo Sk = Y = E E = ^(^)+^(^)' 1 i=l j=0 fc-1 t ^(^) = E'^'^-'E^'t-j^'-i' :=0 j=0 ^w = Ec-.E«iV'i+.-. i=0 i=l if |^Etia/V'/+i| < To for aU i. Let Mi = ESoCEti Note that we can assume y/M^ > 0 without loss of generality (since otherwise Cj = 0 a.s.). Since iV»,! < ko/{i+ 1)^, from the previous lemma Afi < oo. Observe that for all i, (E«'V'/+.)^<(è«?)(E^'+.) /=i 1=1 1=1 <iwP(Ei^'+'i)'^iHi'(Ei^'+«i)'-/=i /=i Hence if t is such that |i|||a|| < TQ/^/M^, then for aU i k oo l*E"'^'+«l ^ MIHI(El^'+.l) < \t\\HVM'i<To. 1=1 1=1 Therefore, for any t such that |t|||a|| < To/y/M^ and c = coMi, Also, if I Z)}=o '^k-j'>Pi-j\ < To for all i. Let n = i- j, m = i - I, then i=0 j=0 k-1 i i j-1 = EE ^l-j^i-j + 2 E E ak-jQk-irpi-ji^i-i] 1=0 i=0 j=l /=0 fc-1 0 fc-1 i-1 n+1 = E E ^l-i+n'^l + 2 E E E afc-(i-n)afc-(.-m) V'n^m t"=0 n=t" j=l n=0 m=: fc-1 t fc-2 fc-1 «• = E E'^fc-'+n'^" + 2 5^ J2 E flfc+n-iafc+m-.V'nV'^ t=0 n=0 n=0 t=n+l m=n+l fc-1 fc-1 fc-2 fc-1 fc-1 ^ E ^" E+ 2| ^ Y^k+n-iak+m-iMm n=0 i=n n=Om=n+lz=m fc-1 fc-2 fc-1 fc-1 i /V A A. A A, J. <E^nHi'+2E E iV'.^'-iiE Ck+n-iak+m-i\ n=0 n=0 m=n+l »=m <E^nNl'+2E E l^n^'mlNI^ n=0 n=0 m=n+l =ii«iP(Ei^'^i)' n=l Therefore, for any t such that |<|||a|| < To/y/M^ and the c = CQMI, we have « «• ItY^k-j^i-jl < |f||^afc_jV.-il j=o j=0 fc-1 i t=0 j=0 <7o. and hence Since A(A;) and are independent we get that for Ti = To/y/Ml and any A;, P{Sk >x}< e-'^Eie*''^''^)E{e'^^'^) < e-«-e2ct^||.||^ ^ ^-tx^c,t^\\a\\^^ where ci = 2c and |f|||a|| < Tj. Finally, to conclude the proof, we note that P{Sk < -x} = P{-Sk > x}. f Lemma 4.3 Assume for the segmented linear regression model (4-2) that Assumption 4-0 is satisfied. Define (Tmax := rnaxj <7i and redefine Tn(a,T]) := ê^'^„(a, 77)6^, -00 < a < 77 < 00. Then Qfj2 „3 P{sup Tnia, 77) > In^ TI} 0, as n0, a<Ti ±1 where po is the true order of the model and T, is the constant specified in Lemma 4-2. Proof Conditioning on X„, we have P{supT„(a,77) > £4f^ln2 7i I X„} a<r] J-i =P{ max êr^„(x,rf,x,,K > ^-^In^n I X„} < E PK'Hn{x,d,xu)èl>^^\n'n\Xn]. X,d<X,d 1 Since Hnixad, Xtd) is nonnegative definite and idempotent, it can be decomposed as Hnixsd, Xtd) = W'AW, where W is orthogonal and A = diag{l, • • •, 1,0, • • •, 0) with p := rank{Hn(xsd, Xtd)) = rank{A) < po- Set Q = {Ip,0)W. Then Q has fuh row rank p. Let Q' = (qi, • • •,qp) and Ui = q^el = qJEllV ^i/n(rP.i, rP)]c„, / = 1,... ,p. Then, 1=1 Since p < po, as in the proof of Lemma 3.2, it suffices to show, for any /, that E m'>^%^ln'n|X,}->0, asn^O. Noting that p = trace{Hn{x,d,Xtd)) = Ef=i II qi IP> we have || q, ||2= qjq, < p < po and II q;E!lV^i^n(r?.i,rf) r< aL. || q/ |P< ^LxJ^o < crLxPg, where / = l,...,p. By Lemma 112 4.2, with ^0 = Tx/umaxPo we have T2 V, i_2 ^ E 2exp(--^.^^hin)exp(ci(-^)VL.Fo) <n(n - l)/n3exp(ciToVPo) -> 0, as ra -> oo, where c\ is the constant specified in Lemma 4.2. Finally, by appealing to the dominated convergence theorem we obtain the desired result without conditioning. % Corollary 4.1 Consider the segmented regression model 4-1 • (i) For any j and (a, /?] C (r^.i, r]>], 5„(a, 7/) = a]è'n{a, r])€n(a, rj) - Tn{a, rj). (ii) Suppose Assumption 4-0 is satisfied. Let m > 1. Then uniformly for all (oi, • • •, a^) such that -oo < cx < • • • < < oo, m+l°+l 5„(6,---,W)= Y SniÇi.x,^i) = rn'ë^n + Op{ln'n), i=i where 6 = -oo, fm+zo+i = oo, and {^i,-• • ,^m+i°} is the set {ri°, • • •, r°o, ai, • • •, a„} after ordering its elements. Proof: (i) Replace ë„(a, rj) in the proof of Proposition 3.1 (i) by c^(a, rj) = /„(a, r])ê^ and note €^(0,77) = ajën(a,rj) when (a,77) C {TJ_I,T^]. The result obtains immediately. (") By (i), SniÙ, • ••,Çm+l°) «=1 m+l°+l 1=1 m+l° + l «=1 Note that each of (^j_i,^j] is contained in one of (r°_i, rj*], j = 1, • • •, /° + 1. By Lemma 4.3, E.'lt'""'' Tn{ii-x,ii) < (m + /« + 1) sup,<, r„(a < T?) = O^Cln^ n). 1[ Lemma 4.4 Under the condition of Theorem 4-1, there exists S G (0, mini<j</o(TJ^-, — TJ')/2) such that for r = 1,..., /°, [5„(r° - 6,r° + S)- Snir"^ - é,r^,) - 5„(r°,r° + <5)]/n ^ (4.4) /or some Cr > 0 as n —>^ oo, r = 1,...,/° + 1. Proof It suffices to prove the result when /" = 1. For notational simplicity, we omit the subscripts and superscripts 0 in this proof. For the 6 in Assumption 4.I, denote = X„(ri -S,Ti),X^ = XniTi,Ti+S),X* = Xn(Ti-S,Ti+S),ël = <7i/„(rin)ë„, = Cr2ln(Tl,Ti+6)ën, = + and /3 = {X*'X*)~X*'Yn. As in ordinary regression, we have =\\x{Pi+x;h + ê*-x'k? = ||Xr(Â-^) + ^2*(^2-^) + 6lP =\\x*{h - h' + \\x;02 - h' + + 2e*'xr(Â -h + ^i-'x^i^ - h Note that {xJ and {j/J in Model (4.2) are strictly stationary and ergodic. It then follows from the strong law of large numbers for stationary ergodic stochastic processes that as n —»• oo, 1 ' 1 " as -X* X" = - VxiX'il(^.^e(^j_5,Ti + 5]) ^{xix'il(^j^ç(.,j_6,^j + 6])} > 0, 71 . ^ -xfx; and «•=1 ' i;{xix'il(^,^e(ri-5,Ti])} > 0, if j=l, £{xixil(^,,G(^,,^,+5])} > 0, if j=2, -X* Y„ ^ E{yiXil(xue{Ti-s,Ti+i])}, Th where E{yiXil^^^^ç(^r,-s,T,+s])} = -E{xixil(^j^e(rj-5,ri])}Â + £^{xixil(^^^6(^i,^,+5])}^2-Therefore, P ^ {X;{xixil(^„e(^j_5,^,+5])}}'^^{î/iXil(xi,e(n-5,ri+5])} =: P'-Similarly, it can be shown that f iP, - ^•)'E(xix'il(x..e(n-5,n]))(^i - if J=l, 02 - ^•)'i;(x:xil(,,,e(.,,,,+^]))(/32 - ^S*), if j=2. 7t -c*'x;(^,-^)^0, for j = 1,2, Th and n where pi = P{xid € (n - 6,TI]} and p2 = P{xid € (ri,ri + S]}. Thus, as n -> oo, ^5„(ri -6, Tl + (5) has a finite limit, given by lim -5„(ri - 6,TI + S) =(Â - /3-)'i;(xixil(,,,e(.,_5,.,]))(;3i - n + 02 - ^*)'£(xix;i(.,,e(,,,,,+5]))(/32 - PI + (Tlpi+alp2. It remains to show that ~Sn{Ti - S,TI) and ^5'„(ri,ri + ^) converge to ajpi and cr^p2 respectively, and either (Â - P*yEixix[l(,,,^^r,-s,n]))0i - P*) > 0 or (^2 - P*yE{xix[ 1(xide(Ti,Ti+s])) • (02 — p*) > 0. The latter is a direct consequence of the assumed conditions while the former can be shown again by the strong law of large numbers. To this end, we first write 5n(Ti — 6,TI) in the following form, Sniri - 6, n) = êl'êl - Tn{ri - 6, n) using Corollary 4.1 (i). Bearing in mind Eel = 1» by the strong law of large numbers, iê-'ëî ^ <rlE[ell^^^,^^r,-s,r.])] = <TlP{xid e in -Tl Tl and W = lim„_»oo ^X^'X^ is positive definite under the assumption. Therefore, Tn{ri - è,T,) = {--el'Xl){-XrX*,)-{-Xl'ël) ^ OW-'O = 0. n n n Thus, ^Sn{Ti—è,Ti) cr^pi. The same argument can also be used to show that ^5„(ri,ri-|-6) o'2P2. This completes the proof. f Now define al — Y^j^i PJ(TJ, where Pj = P{xid G 7"°]}. Applying the strong law of large numbers to {efl(x,<,e(TO_i,Tp])} for all j, we obtain ^è^'è^ ^ al. Lemma 4.5 Under the condition of Theorem 4-i, we have (i) for every I < 1°, P{âf > al + C} —>• 1, as n oo for some C > 0, and (ii) for every I such that P < I < L, where L is an upper bound of P, 0 < i^'c'^ _ âf = Op(ln\n)/n), Tt where aj = ^5'„(fi,..., f;) is the estimated al when the true number of thresholds is assumed to he I. Proof (i) Since / < /°, for 6 € (0, mini<j<;o (rj*^! - T^)/2) in Assumption 4.I, there exists 1 < r < /o, such that {h,...,fi)€ A^ := {(ri,..., r,) : |r, - r"! >(5, s = 1,..., /}. Hence, if we can show that for each r, 1 < r < with probability approaching 1, min Snin,---,Ti)/n> al + Cr, for some Cr > 0, then by choosing C := mini<r<(o {Cr}, we prove the desired result. For any (ri,---,r/) G Ar, let 6 < ••• < be the ordered set {ri,..., r,, rf,..., , r° - 6, + ^, T°^i,. ..,Tfo} and let ^0 = -0°, ^i+i°+2 = oo- Then it follows from Corollary 4.1 (ii) that uniformly in Ar, -SniTi,---,Ti) n n . 1+1°+2 = - E "^"(0-1,6) (4.5) = -[ E SMJ-U^J) + 5„(r° - S,r°) + 5„(r°,r° + ,5)] + ^[5„(r° - r° + ,5) - 5„(r° - ^, r°) - 5„(r°, r° + 6)] = ie-'6- + Op(ln2(n)/n) + i[5„(r° - ^, r° + S) - 5„(r° - ^, r°) - 5„(r°, r° + ^)]. 7i Tt By the strong law of large numbers the first term on the RHS is CTQ + o(l) a.s.. By Lemma 4.4, the third term on the RHS is Cr + o(l) a.s.. Thus -SniTl,---,Ti)>al+Cr + Opil), Tt where Cr is defined in (4.4). (u) Let 6 < ••• < (1+1° be the ordered set, {h-,-• • ,TI,T^,• • • ,Tfo}, - = -00 and Çi+io^i = T°o^, = OO. Since / > /°, by Corollary 4.1 (ii) again, >5n(r°,.-.,rfo) =naï = E '^n(6-l,6) j=l =ël'rn + Op{ln\n)). This proves (ii). ^ Proof of Theorem 4.1 By Lemma 4.5 (i), for / < f and sufhciently large n, there exists C > 0 such that MIC{1) = ln(<7f ) +p*(lnn)2+Vn > \n{al + C/2) > ln(a2) + ln(l + Cl{2al)) with probabihty approaching 1. By Lemma 4.5 (ii), for / > /°, MIC{1) = lii{âj)+ p*(Innf+^/n Ina^. Thus, P{1 > /"} 1 as n —»• oo. By Lemma 4.5 (ii) and the strong law of large numbers, for 1° <1<L, 0>âf- àfo = [àf - ie-'c-] - [âfo - ^e-'e-] = ^^(In^ n/n), and [^?o - al] = [âfo - Uv-<\ + \-jV-<. - <^Vi = Op(ln2 n/n) + Op(l) = 0^(1). Hence 0 < (âfo - à'\)/à]„ = Op{ln'^{n)/n). Note that for 0 < x < 1/2, ln(l - x) > -2x. Therefore, MIC{1) - MIC{f) =ln(âf) - ln(4) + Co(/ - f){\nnf^^°ln = ln(l - (4 - 4)/4) + co(/ - /°)(In(n))2+*Vn > - 20j,{\n\n)/n) + co(/ - /°)(ln(n))2+«Vn >0 for sufficiently large n. Whence / /° as n ^ oo. % To prove Theorem 4.2, we need the foUowing lemma. Lemma 4.6 Under the assumptions of Theorem 4-2, for any sufficiently small 6 G (0, mini<j<jo(r°^.i — r°)/2), there exists a constant Cr > 0 such that -[5„(r° - 6, r° + S) - 5„(r° - 6, r°) - 3^(4,T° + S)] ^ Cr, as n ^ oc, Tt where r = 1, • • •, Proof It suffices to prove the result for the case when P = 1. For any small ^ > 0, all the arguments in the proof of Lemma 4.4 apply, under Assumption 4-2. Hence, the result holds. Proof of Theorem 4.2 By Theorem 4.1, the problem can be restricted to {/ = For any sufficiently small 6' > 0, substituting 6' for the 6 in (4.5) in the proof of Lemma 4.5 (i), we have the foUowing inequality: -Sn{n,---,Tl<>) n >Uîèl + Op{ln\n)/n) + ^[5„(r° - y, 4 + 6') - 5„(r° - 8', r°) - 5„(r°, r° + 6% uniformly in (ri,---,r;o) G Ar := {(n, • • •, r/o) : jr, - T°\ > 6' ,1 < s < By Lemma 4.6, the last term on the RHS converges to a positive Cr for every r. And for sufficiently large n, the O pilv? {n) I n) < imni<r<io(Cr). Thus, uniformly in Ari r = 1,..., i^, and with probabihty tending to 1, i5„(ri,...,r,o)>iCf- + ^. n n 1 This imphes that with probability approaching 1 no r in is quahfied as a candidate of f, where f = (fi, • • • ,fjo). In other words, P(f € A%) -> 1 as n -> oo. Since this is true for ah r, P{f e Hrli ^r) ^ 1> 05 n oo. Note that for S' < mino<i<,o{(rP+i - rf )/2}, 1° /" i° n - ^rl < S'} = f]{\K - r"r\ < S'Jor some 1 < ir < 1°} = {f e f] A^. r=l r=l r=l Thus we have, 1° r=l Pi\fr-T^\<6' for r = l,...,P) = Fife f| A^) ^ 1, as n ^ oo, which completes the proof. ^ Proof of Theorem 4.3 Let aj* and Pj be the "least squares estimates" of aj and /? j = 1, - • • ,1° + 1, when /° and (rf, • • •, rjj) are assumed known. First, we shaU show that the Pj^s are consistent. By the strong law of large numbers for ergodic sequence, Pj — Pj = Op(l), J = 1, • • •, /° + 1. So it suffices to show that Pj — Pj = Op(l) for each j. Set X; = /„(rj'_i,r]')X„ and Xj = /„(f,_i,f,)X„. Then, <i\^^^r - {\^U]r\^^y^\ + a'-x-'x-m'-ix, - x;)'y„] =[(ixjx,)- - i^-xfxjmkx'j - x;)X + ix;y„} + [(^x/x;)-][i(x, - x;)%] =:(I){{II) + {in)} + iIV)iII). where (/) = [(^XjX,)" - (iX/X/)"], (//) = i(Xj - X;)'F„, (///) = iX;F„ and (/V) = [(iX/'X/)-]. By the strong law of large numbers, both (III) and (IV) are Op(l). By Theorem 4.2, f — r° = Op(l). Proposition 3.2 implies that there exists a sequence {a„}, a„ 0 as n -> oo such that f - r° = Op(a„). Note that (//) = ^ X;r=i '<^i2/Kl(x.aeR, ) ~ l(^<d€Ri)) where -^j = ('''j-i»'Ty]' -^i = (^i-i''^}']- Taking u > 1 and = aJxtyt for any real vector a, it follows from Lemma 3.6 that (//) = Op(l). It is shown in the proof of Theorem 3.3 that (/) = Op(l). Thus, ;â^-;â; = op(i),i = i,...,z° + i. Next, we shall show that the â^'s are consistent. When and (r^', • • •, T,°O) are known, the least squares estimates ô-|*'s are obtained from each regime separately. Hence within each regime, applying Corollary 4.1 (i) and Lemma 4.3, we obtain that n "i^f = E + Op(/n^n), (4.6) «=1 where Uj = Y^^=i ^(x^eR^) number of observations in the jth regime. By the strong law of large numbers and Lemma 4.3 Uj/n pj as n ^ oo, and = ^-1 E^?l(^..ei.o) + Op(^) = a] + Op(l). t=i " Therefore, it remains to show that aj - âf = Op(l). Recall fij = ^JL^ ^(xt^eRj)- Applying Lemma 3.6 to = 1 we obtain ^ftj = ^TIJ + Op(l) = pj + Op(l). Thus, it suffices to show 5„(f,_i,f,)-5„(r}'_i,rj') = Op(l). Since Sn{fj-l,fj) = y^(/„(fj_i,f,) - ^„(f^_i,f^))F„, and Sn{TU,r^) = F,:(/„(r]'_a,r») - ^„(Tf_i,rj'))y„, we have that 5„(f,_i,f,)-5„(r°_„r°) n «=1 n + Kx,(x;'x;)-x;'y„ - y,:x;(x;'x;)-x;'y„} n = Eî''(i(x..6R,) - - {y^x,(x;.x,)-(xj - x;')y„ (4-7) t=i + - (x;'x;)-]x;'y„ + y,:(x,- - x;)(x;'x;)-x;y„} n = E^<(^(^.<ieA,) - l(x.dGH°)) «=1 - {Y:,XA{X'^XJ)- - {x;'x;)- + - x;')y„ + y^x,[(xjje,)- - (x;'x;)-]x;'y„ + y^(x,- - x;)(x;'x;)-x;y„} n = E2't(^(^<^eÂ,) - l(x.,Gfi?)) - {((//) + (///))'[(/) + (/F)](/J) + ((//) + (///))'[(/)](///) + (//)'(/F)(//7)}. Taking u> 1 and Zt = j/f, it foUows from Theorem 4.2 and Lemma 3.6 that ^ E"=i 2/i (l(r,<iefi ) -l(xMefl?)) = Op(l)- As we have previously shown, (/) = Op(l), (//) = Op(l), (///) = Op(l) and (IV) = Op(l). Hence - {(op(l) + Op(l))[op(l) + Op(l)]op(l) + (op(l) + Op(l))[op(l)]Op(l) + Op(l)0p(l)0p(l)} = Op(l) H Proposition 4.1 (Broclcwell and Davis, 1987, p219-220) Let oo j=—oo where {^t} is iid with mean zero and variance a^, E^f = rja'^ and Y1JL_^ IV'jl < co- Then, E{et) = 3'r\0) + {rj-3)a',Y^t, (4-8) t and -, n oo lim nFar(-V 6?) = (7/-3)7^(0)+ 2 T T'CJ), (4-9) n—^oo Jl ' ' ' t=l j= —oo where 7(-) is the autocovariance function of {et}. We would remark that under Assumption 4-0, 7(7) = «'"^ Si^o''/'«'^î+i- particular, r(0) = -E(ef ) = 1. Now, we restate Lemma 3.7 with appropriately modified hypotheses. Lemma 4.7 Let {kn} be a sequence of positive numbers such kn ^ 0 and nkn 00. Suppose Assumptions 4-0 and 4-3 are satisfied. Then for any j = I, - •• ,1°, (i) ^ X^Crj» - kn,T°)Xn{T] - kn,T^) ^ E{XIK[\XU = r^i)fd{r% (ii) nkn :^j'nir'j,T'J + A:„)X„(r°,rj' + kn) ^ E{xix[\xid = r'i)fé{r]), :^/n'{r'j - kn,r^)êl{r] - fe„,r°) ^ a|/d(r°), ^êr(r°,r« + A;„)6-(r°,r° + kn) ^ a]^Mr'j), (iii) nk \-ëV{r'i - K,r])Xn{r] - kn,r]) ^ 0, ^ -lV{r],T]^kn)Xn{T],r]^kn)^Q, nkn Proof (i) is the same as in Lemma 3.7, hence, it suflices to show the second equation in each of (ii) and (iii). (ii) Noting for sufficiently large n that ê^(rj', rj' + Arn) = ajênirf, r^ + K), it suffices to show that là:^'ni'rf,T^+kn)€n{rf,r^+kn) /d(rj') as n oo. Let y^t = i(x.de{r°,rO+k„]), Pn = E{ynt) and al = Var(ynt). Then, Pn =Pixtd e (TIT^ + kn]) = iMT°) + 0{l))kn, ^^i^(x,de{r°,r9+k„])) " [^(iCx^eCr/.TO+fc™]))]^ =nn - nl =iUT]) + 0il))kn. In particvdar, /i„/A;„ /d(T°) as n ^ oo. It therefore suflîces to show that 1 " y E ^?l(^Me(T°,T° + A:„]) - /^n/^n "^0, 71 ^ OO, nkn or 1 " -T-E(^?2/nt -/^n) ^ 0, n->00. Since i;(ef) = 1 and hence E{e]ynt) = E{€^)E{ynt) = /^n, this last result would be imphed by Note that 1 " ^yar(Ee?ynt) J n n " t=l «=1 = Jk^{^^^iE^tf^n] + E[J2etal]} =0(l).Fûr(iË^?) + 0(l)-i-£(4) =0(l)Fûr(^Ëe?) + o(l)i;(.t). It remains to show that Var{^ Ylt=i f?) = o(l) and Eie^) = 0(1). To this end observe that YlJLo < a-nd hence by equation (4.8), that £^(e|) ~ 0(1). Now, OO OO OO oo oo Y^'u) = E(^c E^'^^+.)' ^ E(E i^'V'.+ii)^ j=o j=0 i=0 j=0 i=0 oo °° u oo oo ^-c E(E 77W'^'^^'^' ^ E(E l^'+iD' < j=0 i=0 ^ ' j=o i=0 Consequently, Y,-oo 7^(j) = 2 Ylf=o l^U) " 7^(0) < oo, and hence, by equation (4.9), y«.(i|:4)=o(i). (iii) Since €^(T^,T^ + K) = (7jën{T^,+ K), it suffices to show that ^ë„(rP,r° + Ar„)X„(r°,r]' + k^) ^ 0, noo, or, for any a 7^ 0, E[^€n(TlTJ + kn)Xn(TlT] + K)aif = o(l). But ^[^'xil(x>.6(r0,x°+fc„])] = (^[a'xilxi, = r°]/d(r°) + o(l))kn and ^[(a'xi)2l(,.,e(rO,,o+,„j)] = (E[(a'xi)'\xu = r°]/d(rj') + o(l))kn. Consequently, 1 " 1 " t>s oo oo =o(i)+o(i)^E E i^'^^i t>s ij:i—j=t — s =o(i)+o(i)^EE E i^'^ii fc=l a=l i,j:i—j=k ^ n—1 oo =o(i)+o(i)-jE("-^)Ei^i+'^^^i ^ k=l i=o ^ oo n—1 <o(i)+o(i)-EEi^i+^^ii " i=0 oo oo <o(i)+o(-)EEi^i+^^ii ^ oo oo <o(l) + 0(i)E(El^^+'^l)' =o(l). This completes the proof. f With Lemmas 3.6, 4.3, 4.7 and Theorems 4.2, 4.3, the proof of Theorem 4.4 is analogous of that of Theorem 3.4. Proof of Theorem 4.4 By Theorem 4.1, the problem can be restricted to {/ = /°}. Suppose for some j, P{x[0j+i - Pj) ^ 0\xd = r?) > 0. Hence A = E[{x[0j+i - Pj)f\xd = rj] > 0. Let /3(a, TJ) be the minimizer of ||y„(a, TJ) — X„(a,77)y3|p. Set — Kln^ n/n for n = 1,2, • • - , where K will be chosen later. The proofs of Lemma 3.6 and Theorem 4.3 show that if a„ 'Hn Til then jâ(a„,7?„) y5(a, 77) as TI —»• oc. Hence, for rj" + k /3(r°_j + ^, rj" + kn) Pi'r'j-i + ^, TJ") as —>• 00. By Assumption 4-2, for any sufficiently small ^ € ('r°_i,rj'), i^lxixi 1(2;JJ6(T?_I+(5,T?])} is positive definite, hence, by the strong law of large numbers, ${Tf_i + S, rf) "-4' Pj as TI 00. Therefore PiTf_i + 6, rj" + kn)^ Pj. So, there exists a sufficiently small ^ > 0 such that for ah sufficiently large n, \\P(TJ_I + S,TJ + kn) - Pj\\ < \\~Pj-P,+x\\ and {P{rj_i+6,TJ+kn)-~Pj+x)'E{-Kix[\xid = rj") (/SCrf.i+5, rj'+A;„)-^,+i) > A/2 with probability approaching 1. Hence by Theorem 4.2, for any e > 0, there exists Ni such that for n> Ni, with probability larger than 1 — e, we have (i) |fi-rP|<<5, i = l,-.-,/o, (u) ||/3(r?_i + <5, r9 + fc„) - Pj^^f < 2\\Pj - Pj+i\\' and (iu) iPiTf_i + 6, rj» + kn) - Pj+r)'E{xix[\xid = rj){P{rU + + ^-)) " -^i+i) > A/2. Let Aj = {(n, •. -, r,o) : jr.- - rfl < ^, i = 1, • • •, /«, \TJ - rfl > j = 1, • - •, /«. Since for the least squares estimates fi, • • •, f/o, 5„(fi, • • •, fio) < 5„(r{', • • •, r^^ ), inf {5„(ri,...,rio)-5„(r°,...,r°o)} >0 implies (fi,---,fio) ^ Aj, or, \fj-TJ\ < kn = Kln^ n/n when (i) holds. By (i), if we show that for each j, there exists N > Ni such that for all n > N, with probabihty larger than 1 - 2e, inf(Ti,...,T,o)eyij{'5'n(T"i,• • • jTjo) - 5'n(r{',• • • ,r,o)} > 0, we will have proved the desired result. Furthermore, by symmetry, we can consider the case when TJ > TJ only. Hence Aj may be replaced by = {(rj, • • •,r(o) : \Ti-Tf\ < 6, i = l,---,/°, TJ-T] > kn}. For any (ri, • • •,r,o) G A'j, let Cl < • • • < be the set {n,..., r^o, T°, • • •, T]_,,T]_., + S, r^+i -6,r°^,,---, } after ordering its elements and let = —00, ^2i°+2 — oo- Using Corollary 4.1 (ii) twice, we have =[5„(r°, • • •, r° ) + Op(ln2 ^ ^^(^^2 =5„(r{',...,r°o) + Op(ln2 n). Thus, •5n(n, • • - jTio) >Sn{Çl,- ••,^2l<> + l) 2l°+2 E •5'n(ei-l,^i) + Snir^x + S,Tj) + Sn{T,,T%, - b) +[5„(rj'_i + r,) + 5„(r,-, r]^: - b)\ - \Sn{r]_i + r]) + 5„(r9, ^«,1 - S)\ =5„(r{',...,r°) + 0p(/n2n) + [5„(7f_i + b,r,) + 5„(r,-,r°,i - ^)] - [5n(r°_i + <J,r°) + 5„(r«,r]Vi - -5)], where Op(ln'^n) is independent of (TI, • • •, r;o) G A^-. It suffices to show that for 5„ = {TJ : TJ G (•'•j + ^n»7-j + ^)} and sufficiently large n, inf {5„(r°_i - S, Tj) + 5„(r,, r°,i - ^) - [5„(r°_i + ^, r°) + 5„(r°, r]Vi - 6)]} (4.10) with probabihty larger than 1 - 2e for some fixed M' > 0. Let n 5„(a, r?;^) = ||y„(a, 7?) - X„(a, 7/)^|p = J^iyt - x0)H^,^,^^^,r,)). Since 5„(a, 77) = Sn(a, 77; /3(a, 77)), we have >Sn{TU + ^> + ^n) + Sn{TJ + K,Tj) =5„(r?_i + S, Tf-J(T°_i + 6, + k^)) + 5„(r°, rf + A:„; ^(r?,! + 6, r° + A:„)) (4.11) + 5„(r9 + A;„,r,) >5„(rj'_i + S,TJ) + 5„(r°,r° + A:„;^(7-]'_i + S,TJ + fc„)) + 5„(r° + fc„,r,). And since (r? + rP^j - <î] C (TJJTJ^I] for sufficiently large TI, Snir] + A:„,r°,i - = aj+icUr° + A:„,rjVi - ^)6n(r° + A:„,r°,i - ^). Applying Corollary 4.1 (i), we have 0 <Sn{rf + kn, r°+i - 6; Pj+i) - [6'„(r]' + k^, TJ) + 5„(r,-, r^+i - 6)] =Tn{T] + kn,T,)+Tn{Tj,T]^,-è). By Lemma 4.3, the RHS is Op(ln^ n). Thus, 5„(r°,rjVi-^) <5„(r°,Tf+i-*;/3,+a) =5„(r]',r]' + A;„;^^+a) + 5„(rO + fc„,r°+i - ^;^,+a) <5„(r?,r° + A;„;^^+i) + 5„(r° + A;„,r^) + 5„(r,-,r?+i - ^) + ^^(In^ n), where Op(ln'^ n) is independent of TJ. Hence 5„(r,-,rjVx-^) >5„(r°,rj'^i - <5) - 5„(rj',r° + k^Jj+i) - S^irj + k^^rj) + Op{\n' n). Therefore, by (4.11) and (4.12) [5„(r°_i + S, TJ) + Snirj,rjVi - S)] - [5„(7f_i + 6, rj) + 5„(r«, rj^^ - 6)] >5„(r?,r° + kn-Jirj.i + S,TJ + k^)) - SniT°,T° + kn-Jj+i) + Op(ln2 n). (4.12) Let M > 0 such that the term |Op(ln^ n)| < Mln^ n with prohabihty larger than 1 - e for all n > Ni. To show (4.10), it suffices to show that for sufficiently large n, Sn(r^,T° + kn-JiT9_, + 6, r° + k^)) - SniT^,T° + k^; Pj+i) - Mln-'n > M'ln'n, or SniT^rf + kn,+ ^'+ ^n)) " Sn{r°+ k^,Pj+i) > (M' + M)ln'n (4.13) with large probabihty. RecaU Sn(a,rj;P) = ll^n(a,7/) - X„(a,7?)^||2 and y„(r]',rj' + A;„) = X{T^,TJ + kn)Pj+i + €niTj,T^ + kn)- Taking K sufficiently large and applying (ii), (in) and Lemma 4.7 (i), (iii), we can see that there exists N > Ni such that for any n > N, ^[5„(rj', rj» + kn, 0{T°_, + S, + kn)) - Snirl rj» + kn;Pj+i)] = ^[rn(T-,^r? + kn) - Xnir^T^ + fc„)/9(r°_i + S,T° + kn)\\' - ||y„(rj>,r° + kn) - Xn{r°,T^ + kn)Pj+xf] -\\aj+lèn{Tf,T^ + kn)\\'] + ^^^n(rj, r° + A:„)X„(rO, r« + kn)iPj+i - + ^' + ^n)) >A/4 - A/8 > (M' + M)/A' with probabihty larger than 1 - 2e. Since kn = Kln^n/n, the above imphes (4.13). ^ The following Lemma (cf. Hall and Heyde, 1980, Liu 1991) plays an important role in establishing the central hmit theorem for the sample moments involving the {et}. Before we state the lemma, we need to introduce some notation. Let T be an ergodic one-to-one measure-preserving transformation on tlie probability space (fi, T, P). Suppose Ito is a sub-cr-field of satisfying Z/Q Ç T~^{UO). Also suppose that ZQ is a square integrable r.v. defined on P) with E(Zo) = 0, and that {Zt} is a sequence of r.v.'s defined by Zt = ZQ{T^UI), a; € fi. Let Uk = T'^'iUo), k = 0,±l,--Lemma 4.8 Suppose thatUo Ç T-^{UQ) andputUk = T-''{UQ). Let E{Zl) < oo and E{ZQ) = 0. // oo Y,{iE[E{Zo\U.m)fy' + {E[Zo- EiZopm)?)^/-"} < oo, m=l then a*"^ := fim„_oo '^^f"^ exists, where 5„ := Yjt=\ '^t- Further, Sn d \fn N{0,a''). Proof The proof is obtained from Hall and Heyde (1980, Theorem 5.5 and Corollary 5.4) or Liu (1991, Theorem 4.1). ^ Proposition 4.2 (Brockwell and Davis, 1987, Remark 2, p212) Let oo i=-oo where the {Ct} is an iid sequence of random variables each with mean zero and variance a'^. If T:T=-oo \^J\ < ^> then, ZZ-oo hih)\ < oo and .. n oo oo ]imnVari-Yet)= ^ l(h) = ^ ^J?• t=l h=—oo j=-oo To facihtate the statement of the next result let Gj = £'(xixil(^j^ç(^o_^^.,o])), 131 and = aJGj'TjGj', where 7(1) = £^(ei€i+,) and j = 1, - • • ,P + 1. Also recall that for each j = 1, • • •,/° + 1, is the least squares estimate of/3j given r^'s. Lemma 4.9 Under the Assumptions 4-0, 4-i and 4-3, j = h---,P + l. Proof: First, we shall show that It suffices to show that for any constant vector a, where <7^ = a'TjU. By Assumption 4-3, {x.t}^^oo is an iid sequence of random variables. Let Tt = a((^s,'^s, s < t) denote the cr-field generated by {(s,Xs, s < t}, and Zt = a'x.t€tl(^x,de(,T°_^,T°]) for a given constant vector a. To show that Z]"=i has an asypmtotic normal distribution, one needs to verify the conditions of Lemma 4.8. Thus, it suffices to show that EZQ = 0, EZQ < 00, E::=i(^[^(^o|^-„.m^ < 00, and 00 Y,{E[Zo-EiZo\Tm)?y^'<oo. (4.14) 132 Observe that EZ^ - a'£;(xol(^„^ç(.r?_,,T?]))-^fo = 0 and EZl = a'E(xox[,l(^g_^g(^o_^,^o]))a < oo. Also, for m > 1, Zo = "'xofol(2;ode(TJ'_i,T°]) is .T^m-measurable, hence - E{Zo\^m) = Zo - Zo = 0. So (4.14) is trivial. It remains to show that Y^'^^iiE[E{Zo\J^-mf]y^^ < oo. Now, note that ElEiZolJ'-m)? oo i=0 oo = ^[^("'^ol(.o,e(r^„rO])) E ^'^-'l' oo =[x;(a'xoi(.„,e(,c^,,.o]))]2i;[5;v.C-,f oo =[X;(a'xol(.„,e(,^^,rO]))]2 ^fcr^2 oo E t=m where cj = [E{a'xol(^^^e{T°_„rf]))?(^C Thus CO Y{E[E{Zo\T.m)?V^' m=l oo oo m=:l «=m oo oo m=l »=m oo oo svJJtoElET-fWr. Tn=l »=Tn ^ under our assumption that \ipi\ < ka/Çi + 1)'^ for all i. Replacing the 6 in equation (4.3) with 26, we obtain that °° 1 °° 1 1 E u + 1)25 = E + i)2S ^ I2S _ i)Tn2«-i • ("^-1^) 133 Since 2(5 - 1 > 1, 771 = 1 OO 771=1 This shows that E"=i •^t ^^.s an asymptotic normal distribution. We next calculate the asymptotic variance of ra"^/^ Z)"=i ^t- By Lemma 4.8, it is n-+oo n n 1 =^[("'xi)'l(x.,6(r<L„r01)] + [^("'^l l(x,,.rO]) )]' ^ i;e,Q =a'G,a + [i;(a'xa(x,.e(r;^„rO]))]' Jim ^i^E^^^' " E^?] 1 " =a'G,a + a'[i;(xil(,^,g(,<^^.,o]))i;(xil(,,,e(,<^^,,o]^ -il->(E^?)]' where lim„..^oo -^-E^CEfLi ^t) = Ee\ = 1 by our assumption. By Proposition 4.2, 71 OO ^hjn^nFar(-Eft)= E t=l i=-oo Hence, hm„^oo nVar{l ^t) - ^ = ET=-oo T(0 - 7(0) = 2 E.^x 7(0, and lim ^ = a'Tja, 7i->oo n which is CT^. By the strong law of large numbers for ergodic sequences, as 71 —>• oo. With sufficiently large n, (X^(rj'_i, r°)X„(rP_i, rj*))"^ exists a.s., and 71/ as 71 ^ oo. Hence, =(^;(^-i,r°)X„(r°_i,rj'))-i(X;(7f_i,r?)X„(7f_i,rj>)^, + X;(r°_i, r")?:) =Pj + a,(X;(r]'_i,r«)X„(r°_i,r°))-ix;(7f_i,r°)c„. Since a]G-i'[G, + 2ESi7(0i^(xil(.,,e(.<^^,.o]))X;(xil(,^,e(.o^ v^(^;-/3i)^m£i)-This completes the proof. f Lemma 4.10 Under the condition of Lemma 4-9, 1 " asn^oo, where vj = p,(l - pj)Eiei) + pj[iv - 3)7^(0) + 2 ZT=-oo 7^(0] '^rid p, = P{T'J_I < xu < rf). Proof It suffices to show that Let Tt = <T(C,,X,, S <t) he the cr-field generated by {CsjX,, s < t} and = e?l(x„e(r°_i,r»]) - Pj-To show that E"=i has the asymptotic normal distribution, one needs to verify that the conditions of Lemma 4.8 obtain. That is, it must be shown that EZQ = 0, EZ^ < oo, oo J2iE[EiZo\T.m?])'/'< oo, m=l and ^iE[Zo-EiZo\Tm)?y/' <oo. m=l the latter having the appearance of (4.14). We obtain EZQ = £e§£l(xo^ç(^o_^ .,.0]) - pj = 1 -Pj - Pj = 0, and EZl =i;(egl(x„^e(^o_^,^<)]) -pjf = E{4M^0d€(rf_„r°])) + P'j - 2Pj£(fol(xo.e(rj^.,rO])) =PjEe*-pj <oo. Also, for m > 1, Zo is J'm-measurable. Hence, Zo-E{Zo\Tm) = ZQ-ZQ - 0. So (4.14) is trivial. It remains only to show that Em=i(^[^(^oi.^-m)^])^/^ < oo. Recall that E^el) = al E.^o V'." is assumed to be 1. Hence, E[E{ZQ\T-m)? = E[Ei4Hxode(rO_,,rO])-Pi\^-m)f =E\pjE{el\T.m)-Pj? oo ^p]E[E{{Y,i^,^-if\^-ra)-lf i=0 m-1 =p)E[Y,i^>i + {Y.^iC-i?-if i=0 »=m =p)E[{±i.iC-if-f:^Hf i=m i=m oo oo =p][EiZ^iC-ir-{E^'-i)'^-i~ m i= m Using equation (4.8) by setting ipi = Q for i < m, we have i=m i=m oo oo oo t=m «=m oo oo ^('/-iKcE^i)' :=m <(r;-l)a^fc^(E-i-^f. By (4.15), YlZm + 1)" < 1/(2^ - l)m2*-i. Thus, oo J^iElEiZolT.m)?}'^' <f:p.v^^-ikii±jr^) m=l »=m ' m=l ' <oo. Finally, r ESI Vj = lim n-^oo n 1 " = Jl^ -^(E(^'l(x..€(rO.„rO]) -Pi))' s,t - Pi(f?l(x.ae(T».,,r°i) + fll(x„e(TO_,,TJ'l))] =£^ i E + £^ ^ E[^(^'^?>i + pj - p'^(^3) - p,'^(f?)] - lim iyi;(e?)p2 =p,£(et) + Jim ^ £[(,2 _ i)(^2 _ _ p2^(^4) 1 °° =p,(l - pj)E{et) + p] Jim nFar(- ^'t)-By equation (4.9), limn^oo nVari^ Et=i f?) = (^ - 3)7^(0) + 2 ES-oo 7'(0- This completes the proof. ^ Proof of Theorem 4.5 We shall show the conclusion for the j9j's first. Let Pj denote the least squares estimate of Pj when (rf, • • •, r^o ) is known, j = 1, • • •, /° +1. By Lemma 4.9, it suffices to show that Pj and Pj share the same asymptotic distribution, for all j. In turn, it suffices to show that Pj - Pj = Op{n~'/-). Set X; = /„(rj'_i,rj')X„ and Xj = Ufj-ufj)Xn. Then, =[(ixjx,)- - (ix;'x;)-][ixjy„] + [(ix;'x;)-][i(x, - x;)'y„] It 7t Tt Tt /* =[(ix;.x,)- - {^x;'x;r]{kx'j - x;)X + ix;y„} + [(ix;'x;)-][^(x, - x;)X] =:(/){(//) + (///)}+ (/y)(//). where (/) = [(^X'^Xj)- - (iX/X/)"], (//) = i(Xj - X;)'y„, (///) = iX;y„ and (IV) = [(ix/x;)-]. As in the proof of Theorem 4.3, both (III) and (IV) are Op(l). And the order of Op(ra~^/^) of (I) and (II) foUows from Lemma 3.6 by taking a„ = In^n/n, Zt = (a'x^)^ and Zt = a'xtj/f respectively, for any real vector a and u > 2. Thus, Pj — Pj = Op{n~'/'^). Next, we proof the conclusion for the <T|'S. Let aj* denote the least squares estimate of when (r°, • • •, r^o) is known, j = 1, • • •, P + l. By Lemma 4.3, T„(rj'_i,r?) = Op(ln'^n). Hence, 1 " 1 1 " = -'']J2^ti(x„e(rO_„rO]) + Op{ln\/n). t-i By Lemma 4.10, 1 " Therefore ^(•?n(Tf_i,r°) - np.aj) ^ iV(0, t;,a,^), and hence v^p,(âf-(TJ)-^A(0,t;,a,^). It remains to show that aj - aj* = Op{n~'^'^). As in the proof of Theorem 4.3, it suffices to show that 5n(fj-i,fj) - 5„(rj'_i,r]») = Op(7i-V2). gy equation (4.7), 5„(Vi,f,)-5„(r°_i,r°) n - {((//) + (///))'[(/) + (/F)](//) + ((//) + (///))'[(/)](///) + (//)'(/F)(/7/)}. Taking a„ = In'^n/n, u > 2 and Zt = yt, it follows from Lemma 3.6 that n ^ J2^=i Vt i^(xtdefii) ~ •'•(a^ideH?)) = Op(ra~^/2). Also, it is shown in the proof of Theorem 4.3 that both (III) and (IV) are Op(l). The order of Op(n-^/2) of (j) ^nd (II) follows from Lemma 3.6 by taking a„ = lv?n/n, Zt = (a'xi)^ and Zt = aJxtyt respectively, for any real vector a and u > 2. This shows that a] - à]* = o(ra-^/2)_ ^ Proof of Theorem 4.6 For d = (f,hy Lemma 4.5 (u), -Sn —^o-Q-n For d ^ dP, -we shall show that > CTQ + C for some constant C > 0 with probability approaching 1. Again, = 1 is assumed for simplicity. JÎ d d9,hy the identifiability of d°, for any {Rj}f^i , there exist r,5 € {1, • • •, X +1} such that Rf D where is defined in Theorem 2.1. Let 5s = {(n,..., TL) : Rf D Af for some r}. Then for any (n,..., TL), (n, • • •, TL) € for at least one s e {1, • • •, L + 1}. Since d is chosen such that S^ < for all d, it suffices to show that iox d^ dP and each s, there exists > 0 such that inf i5^(rj,...,rL)>a^ + C, (4.16) (Ti,...,Ti)€B, n with probabihty approaching 1 as n -> oo. For any {TI,...,TL) € Bs, let -^£,^.2 = {x : a;, € (rr_i,a,)}, i2|,+3 = {x : Xd € (6„rr]}. Then Ri = Afxj Rj^^^ U From Lemma 4.3 and the proof of Lemma 3.2', we can see that the conclusion of Lemma 3.2' still holds under current assumptions. Hence, the conclusions of Proposition 3.1' and Lemma 3.3' also hold. Therefore, by (3.13) i5^(ri, ...,TL) = al + Op(l) + ^[5„(Af ) - 5„(Af n R^) - 5„(Af n R^)]. Now it remains to show that i[5„(Af)-5„(Af ni2?)-5„(Af ni?^)] > for some C, > 0, with probabihty approaching 1. By Theorem 2.1, Z;[xixil(xjg^^P^o)], i = 1,2, are positive definite. Applying Lemma 3.3' we obtain the desired result. f Chapter 5 SUMMARY AND FUTURE RESEARCH 5.1 A brief summary of previous chapters In this thesis, we propose a set of procedures for estimating the parameters of a segmented regression model. The consistency of the estimators is established under fairly general con ditions. For the "basic" model where the noise is an iid sequence and locally exponentially bounded, it is shown that if the model is discontinuous at a threshold, then the least squares estimate of the threshold converges at the rate of Op{lv?nln). For both continuous and discon tinuous models, the asymptotic normality of the estimated regression coefficients and the noise variance is established. The least squares "identifier" of the segmentation variable is shown to be consistent, if the segmentation variable is asymptotically identifiable. A more efficient method of identifying the segmentation variable is given under stronger conditions. Most of these results are generalized to the case where the noise is heteroscedastic and autocorrelated. A simulation study is carried out to demonstrate the small sample behavior of the proposed estimators. The proposed procedures perform reasonably weU in identifying the models, but indicate the need for large sample sizes for estimating the thresholds. 5.2 Future research on the current model First, further work on choosing and CQ in the MIC is needed. One way to reduce the risk of mis-specifying the model is to try different (^O)Co) values over certain range. If several (<5o,co) pairs produced the same /, we would be more confident of our choice. Otherwise different models can be fitted. And the estimated regression coefficients and noise variance may then indicate what {60, CQ) is more appropriate. In particular, when the noise is autocorrelated, recursive estimation procedures need to be investigated. Second, the asymptotic normality of the estimated regression coefficients for continuous models need to be generalized to the case where the noise is heteroscedastic and autocorrelated. The techniques used in Sections 3.5 and 4.5 are useful but additional tools are needed, such as the central limit theorem for a double array of martingale sequences. Third, the local exponential boundedness assumption made on the noise may be relaxed. Note that this assumption implies that ei has moments of any order. If Ci is assumed to have only moments to finite order, a model selection criterion with a penalty term of the form Cn°' (0 < a < 1) may well be consistent. This has been shown by Yao (1989) for a one-dimensional step function with fixed covariates and iid noise. 5.3 Further generalizations Further generalization of the segmented regression model will enable its broader apph-cations. First, there may be more than one segmentation variable. For example, changes in economic policy may be triggered by the simultaneous extremes in a number of key economic indices. The results in this thesis may be generahzed to the case where more than one seg mentation variable is present. Further, since sometimes there is no reason to beheve that segmentation has to be parallel to any of the axes, a threshold defined in terms of a linear combination of explanatory variables may be appropriate. A least squares approach or that of Goldfeld and Quandt (1972, 1973a) can be applied. Large sample properties of the estimators given by these approaches would need to be investigated. In many economic problems, the explanatory variables exhibit certain kinds of dependence over time. The explanatory variables and the noise may also be dependent. Our results can be generalized in this direction, since the iid assumption on {x^} is not essential. Once such generahzations are accomplished, we expect this model to be useful for many economic problems, since many economic policies and business decisions are threshold-based, at least to some extent. In fact, the segmented regression model has been applied to a foreign exchange rate problem by Liu and Susko (1992) with significantly better results than other approaches reported in the hterature. And, the need for a theoretical justification for this approach is obvious. K yt and Xti in Model 2.1 are replaced by Xt and xt-i respectively {i = /, • • •,p), where {xf} is a time series, then the model becomes a threshold autoregressive model. This interesting nonhnear time series models has been studied by many authors. See, for example, Tong (1987) for a review on some recent work on nonlinear time series analysis. Because this model is very similar to ours in its structure, the approaches used in this thesis may also shed some light on its model selection problem and the large sample properties of its least squares estimates. In particular, we expect a criterion similar to MIC can be used to select the number of threshold for the threshold autoregressive model. REFERENCES Bacon, D.W. and Watts, D.G. (1971). Estimating tiie transition between two intersecting straigiit lines. Biometrika, 58, 525-543. Bellman, R. (1969). Curve fitting by segmented straight fines. J. Amer. Statist. Assoc., 64, 1079-1084. Bilhngsley, P. (1968). Convergence of Probability Measures. Wiley, N.Y. Breiman, L., and Meisel, W.S. (1976). General estimates of the intrinsic variability of data in nonlinear regression models. J. Amer. Statist. Assoc., 71, 301-307. Brockwell, P.J. and Davis, R.A. (1987). Time series: Theory and methods. Springer-Verlag, N.Y. Broemehng, L.D. (1974). Bayesian inferences about a changing sequence of random variables. Commun. Statist., 3, 234-255. Cleveland, W.S. (1979). Robust locally weighted regression: An approach to regression analysis by local fitting. J. Amer. Statist. Assoc., 74, 829-836. Cleveland, W.S. and Devlin, S.J. (1988). Locally weighted regression: an approach to regression analysis by local fitting. J. Amer. Statist. Assoc., 83, 596-610. Dunicz, B.L. (1969). Discontinuities in the surface structure of alcohol-water mixtures. Kolloid-Zeitschr. u. Zeitschrift f. Polymère, 230, 346-357. Ertel J.E. and Fowlkes E.B. (1976). Some algorithms for linear spline and piecewise multiple linear regression. /. Amer. Statist. Assoc., 71, 640-648. Farley, J.U. and Hinich, M.J. (1970). A test for a shifting slope coefficient in a hnear model. J. Amer. Statist. Assoc., 65, 1320-1329. Feder, P.I. and Sylwester, D.L. (1968). On the asymptotic theory of least squares estimation in segmented regression: identified case (preliminary report) abstracted in Ann. Math. Statist., 39,1362. Feder, P.I. (1975a). On asymptotic distribution theory in segmented regression problems-identified case. Ann. Statist. 3, 49-83. Friedman, J. H. (1988). Multivariate Adaptive Regression Sphnes, Report 102, Department of Statistics, Stanford University. Friedman, J. H. (1991). Multivariate Adaptive Regression Splines. Ann. Statist. 19, 1-141. Feder, P.I. (1975b). The log hkelihood ratio in segmented regression. Ann. Statist. 3, 84-97. Ferreira, P.E. (1975). A Bayesian analysis of switching regression model: Known number of regimes. J. Amer. Statist. Assoc., 70, 730-734. Gallant, A.R. and Fuller, W.A. (1973). Fitting segmented polynomial regression models whose join points have to be estimated. J. Amer. Statist. Assoc., 68, 144-147. Goldfeld, S.M. and Quandt, R.E. (1972). Nonlinear Methods in Econometrics. North-Holland Pubhshing Co. Goldfeld, S.M. and Quandt, R.E. (1973a). The estimation of structural shifts by switching regressions. Ann. Econ. Soc. Measurement, 2, 475-485. Goldfeld, S.M. and Quandt, R.E. (1973b). A Markov model for switching regressions. Journal of Econometrics, 1, 3-16. Hall, P. and Heyde, C. (1980). Martingale limit theory and its application. Academic Press. Hawkins, D.M. (1980). A note on continuous and discontinuous segmented regressions. Tech-nometrics, 22, 443-444. Henderson, H.V. and Velleman, P.F. (1981). Building regression model interactively. Biomet rics, 37, 391-411. Henderson, R. (1986). Change-point problem with correlated observations, with an application in material accountancy. Technometrics, 28, 381-389. Hinkley, D.V. (1969). Inference about the intersection in two-phase regression. Biometrika, 56, 495-504. Hinkley, D.V. (1970). Inference about the change-point in a sequence of random variables. Biometrika, 57, 1-17. Holbert, D. and Broemhng, L. (1977). Bayesian inferences related to shifting sequences and two-phase regression. Commun. Statist. Theor. Meth., A6(3), 265-275. Jennrich, R.J. (1969). Asymptotic properties of non-hnear least squares estimators. Ann. Math. Statist, 40, 633-643. Hudson, D.J. (1966). Fitting segmented curves whose join points have to be estimated. J. Amer. Statist. Assoc., 61, 1097-1129. Liu, J. and Liu, Z. (1991). Higher order moments and hmit theory of a general bilinear time series. Unpubhshed manuscript. Liu, J. and Suslco, E.A. (1992). Forecasting exchange rates using segmented time series regres sion model - a nonlinear multi-country model. Unpubhshed manuscript. MacNeill, LB. (1978). Properties of sequences of partial sums of polynomial regression residuals with applications to test for change of regression at unknown times. Ann. Statist., 6, 422-433. McGee, V.E., and Carleton, W.T. (1970). Piecewise regression. J. Amer. Statist. Assoc., 65, 1109-1124. Miao, B.Q. (1988). Inference in a model with at most one slope-change point. Journal of Multivariate Analysis, 27, 375-391. MuUer, H.G. and Stadtmuller, U. (1987). Estimation of heteroscedasticity in regression analysis. Ann. Statist., 15, 610-625. Poirier, D.J. (1973). Piecewise regression using cubic splines. J. Amer. Statist. Assoc., 68, 515-524. Quandt, R.E. (1958). The estimation of the parameters of a linear regression system obeying two separate regimes. /. Amer. Statist. Assoc., 53, 324-330. Quandt, R.E. (1960). The estimation of the parameters of a linear regression system obeying two separate regimes. J. Amer. Statist. Assoc., 53, 873-880. Quandt, R.E. (1972). A new approach to estimating switching regression. J. Amer. Statist. Assoc., 67, 306-310. Quandt, R.E., and Ramsey, J.B. (1978). Estimating mixtures of normal distributions and switching regression. (With discussion). J. Amer. Statist. Assoc., 73, 730-752. Robison, D.E. (1964). Estimates for the points of intersection of two polynomial regressions. J. Amer. Statist. Assoc., 59, 214-224. Sacks, J. and Ylvisaker, D. (1978). Linear estimation for approximately linear models. Ann. Statist., 6, 1122-1137. Schulze, U. (1984). A method of estimation of change points in multiphasic growth models. Biometrical Journal, 26, 495-504. Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist., 6, 49-83. Serfling, R.J. (1980). Approximation theorems of mathematical statistics. Wiley, New York. Shaban, S.A. (1980) Change point problem and two-phase regression: an annotated bibhogra-phy. International Statistical Review, 48, 83-93. Shao, J. (1990). Asymptotic theory in heteroscedastic nonlinear models. Statistics & Probability Letters, 10, 77-85. Shumway, R.H. and Stoffer, D.S. (1991). Dynamic linear models with switching. J. Amer. Statist. Assoc., 86, 763-769. Sylwester, D.L. (1965). On the maximum likelihood estimation for two-phase Unear regression. Technical Report No. 11, Department of Statistics, Stanford Univ. Sprent, P. (1961). Some hypotheses concerning two phase regression lines. Biometrics, 17, 634-645. Univ. Susko, E.A. (1991). Segmented regression modelhng with an apphcation to German exchange rate data. M.Sc. thesis. Department of Statistics, University of British Columbia. Tong, H. (1987). Non-linear time series models of regularly sampled data: A review. Proc. First World Congress of the Bernoulli Society, Tashkent, USSR, 2, 355-367, The Netherlands, VNU Science Press. Weerahandi, W. and Zidek, J.V. (1988). Bayesian nonparametric smoothers for regular pro cesses. The Canandian journal of Statistics, 16, 61-73. Worsley, K.J. (1983). Testing for a two-phase multiple regression. Technometrics, 25, 35-42. Yao, Y. (1988). Estimating the number of change-points via Schwarz' criterion. Statistics & Probability Letters, 6, 181-189. Wu, C.F.J. (1981). Asymptotic theory of nonlinear least squares estimation. Ann. Statist., 9, 501-513. Yao, Y. and Au, S.T. (1989). Least-squares estimation of a step function. Sankhya: The Indian Journal of Statistics, A, 51, 370-381. Yeh, M.P., Gardner, R.M., Adams, T.D., Yanowitz, F.G., and Crapo, R.O. (1983). "Anaerobic threshold": Problems of determination and validation. J. Apply. Physiol. Respirit. Envioron. Excercise Physiol., 55, 1178-1186. Zwiers, F. and Storch, H.V. (1990). Regime-dependent autoregressive time series modeling of the Southern OsciUation. Journal of Climate, 3, 1347-1363. Table 3.1: Frequency of correct identification of P in 100 repetitions and the estimated thresholds for segmented regression models ( m,mu,mo are the frequencies of correct, under- and over-estimations of ) MIC : m(mu, nio) h (SE) sample size 30 50 100 200 Model{a) 79 (18, 3) 95 (4, 1) 100 (0, 0) 100 (0, 0) 1.168 (1.500) 1.033 (1.353) 1.410 (0.984) 1.259 (0.665) Model{b) 70 (21, 9) 86 (8, 6) 99 (0, 1) 100 (0, 0) 1.022 (1.546) 1.220 (1.407) 1.432 (0.908) 1.245 (0.692) Model(c) 80 (6, 14) 97(1,2) 100 (0, 0) 100 (0, 0) 0.890 (0.737) 0.761 (0.502) 0.901 (0.221) 0.932 (0.151) Model{d) 85 (8, 7) 99 (0, 1) 100 (0, 0) 100 (0, 0) 0.791 (1.009) 0.860 (0.665) 0.971 (0.232) 0.963 (0.169) Model(e) 68 (23, 9) 87 (12, 1) 100 (0, 0) 100 (0, 0) 0.463 (1.735) 0.708 (1.332) 0.989 (0.923) 0.940 (0.707) Table 3.2: Estimated regression coefficients and variances of noise and their standard errors with n = 200 ( Conditional on / = 1 ) 4- (SE) Model (a) Model (b) Model (c) Model (d) Model (e) Pw -0.003 (0.145) -0.018 (0.146) 0.004 (0.143) -0.008 (0.154) -0.059 (0.177) /3ii 1.001 (0.038) 0.995 (0.037) 1.000 (0.035) 0.995 (0.041) 0.985 (0.045) /3l2 1.000 (0.024) 0.996 (0.025) -0.004 (0.025) 0.000 (0.024) 1.000 (0.025) /?13 0.994 (0.023) 0.995 (0.025) /Î20 1.485 (0.345) 1.388 (0.332) 0.962 (0.243) 1.009 (0.225) 0.960 (0.283) ^21 0.005 (0.063) 0.019 (0.067) 0.008 (0.055) 0.000 (0.049) 0.008 (0.057) ^23 1.006 (0.034) 0.998 (0.034) 0.495 (0.032) 0.498 (0.032) 0.998 (0.036) 0.997 (0.034) 0.996 (0.036) a2 0.948 (0.108) 0.950 (0.154) 0.956 (0.156) 0.953 (0.160) 0.944 (0.158) Table 3.3: The empirical distribution of / in 100 repetitions by MIC, SC and YC for piecewise constant model ( Tip, rai, 712, "3 are the frequencies of / = 0,1,2,3 respectively) MIC : no, nx,n2, YC : no, n\,n2, n^ SC : no, 7Î1, 7l2, 7l3 sample size 50 150 450 Modelif) 5, 30, 48, 17 0, 18, 79, 3 0, 0, 98, 2 5, 36, 45, 14 0, 36, 64, 0 0, 9, 91, 0 0, 17, 52, 31 0, 1, 64, 35 0, 0, 83, 17 Model{g) 5, 38, 51, 6 0, 23, 72, 5 0, 0, 99, 1 7, 41, 48, 4 0, 46, 53, 1 0, 7, 93, 0 3, 18, 56, 23 0, 2, 79, 19 0, 0, 87, 13 Model{h) 0, 3, 81, 16 0, 0, 96, 4 0, 0, 98, 2 0, 3, 84, 13 0, 0, 100,0 0, 0, 100,0 0, 0, 63, 37 0, 0, 82, 18 0, 0, 87, 13 Model(i) 0, 5, 85, 10 0, 0, 97, 3 0, 0, 100, 0 0, 7, 86, 7 0, 0, 100, 0 0, 0, 100, 0 0, 1, 73, 26 0, 0, 83, 17 0, 0, 93, 7 Table 3.4: The estimated thresholds and their standard errors for piecewise constant model ( Conditional on / = 2 ) ri, (SE) r2, (SE) sample size 50 150 450 Model{f) 0.335 (0.078) 0.338 (0.039) 0.334 (0.012) 0.660 (0.032) 0.666 (0.008) 0.667 (0.003) Model(g) 0.313 (0.076) 0.332 (0.032) 0.334 (0.013) 0.656 (0.015) 0.669 (0.009) 0.667 (0.002) Model{h) 0.316 (0.027) 0.334 (0.007) 0.333 (0.002) 0.662 (0.030) 0.667 (0.006) 0.667 (0.003) Model{i) 0.323 (0.023) 0.332 (0.010) 0.334 (0.004) 0.661 (0.030) 0.666 (0.007) 0.667 (0.003) Table 4.1: Frequency of correct identification of P in 100 repetitions and the estimated thresholds for segmented regression models with two regimes ( m, mu,mo are the frequencies of correct, under- and over-estimations of /° ) MIC : mim-u,, mo) h (SE) sample size 50 100 200 Model (a') 95 (3, 2) 98 (0, 2) 99 (0, 1) 1.322 (1.681) 1.412 (1.293) 1.223 (1.060) Model (d') 91 (1,8) 95 (0, 5) 99 (0, 1) 0.808 (0.545) 0.936 (0.256) 0.960 (0.109) Model (e') 94 (3, 3) 98 (0, 2) 99 (0, 1) 0.693 (1.583) 1.088 (1.470) 1.175 (1.111) Table 4.2: Estimated regression coefficients and variances of noise and their standard errors with n = 200 ( Conditional on / = 1 ) kj (SE) Model (a') Model (d') Model (e') Pio -0.049 (0.247) 0.007 (0.190) -0.056 (0.227) /3n 0.993 (0.066) 0.998 (0.059) 0.985 (0.065) /3l2 1.003 (0.017) -0.001 (0.020) 0.999 (0.019) /3l3 0.998 (0.018) 0.997 (0.018) /320 1.258 (0.730) 0.957 (0.461) 0.749 (0.596) 0.033 (0.129) 0.013 (0.107) 0.045 (0.126) 0.998 (0.033) 0.503 (0.029) 1.002 (0.030) P24 0.998 (0.026) 0.999 (0.029) ol 0.656 (0.117) 0.639 (0.167) 0.634 (0.166) 0.929 (0.271) 1.050 (0.391) 0.963 (0.361) Table 4.3: Frequency of correct identification of /° in 100 repetitions and the estimated threshold for a segmented regression model with three regimes ( m, THU-, rrio are the frequencies of correct, under- and over-estimations of /° ) MIC : m(mu, mo) rx {SE) f2 {SE) sample size 50 100 200 Model (j) 62 (26, 12) 86 (6, 8) 95 (0, 5) -1.211 (0.251) -1.051 (0.151) -1.034 (0.078) 1.046 (0.493) 1.060 (0.388) 0.974 (0.096) Table 4.4: Estimated regression coefficients and noise variances and their standard errors with n = 200 ( Conditional on / = 2 ) Model (j) J = 1 i = 2 i = 3 h (SE) 0.987 (0.290) -0.029 (0.212) 0.454 (0.413) h [SE) 0.996 (0.062) 0.097 (0.480) 0.011 (0.092) h {SE) -0.001 (0.017) 1.000 (0.032) 0.499 (0.028) {SE) 0.511 (0.165) 0.681 (0.269) 1.002 (0.294) Figure 2.1 {xi,X2) uniformly distributed over the shaded area -2 -1 -1 Figure 2.2 (xi,X2) uniformly distributed over the eight points weight Figure 2.3 Mile per gallon vs. weight for 38 cars 20 8 120 91 120 120 120 0.5 1 120 Figure 4.1 {xi,X2) uniformly distributed over each of six regions with indicated mass
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Asymptotic inference for segmented regression models
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Asymptotic inference for segmented regression models Wu, Shiying 1992-12-18
pdf
Page Metadata
Item Metadata
Title | Asymptotic inference for segmented regression models |
Creator |
Wu, Shiying |
Date Issued | 1992 |
Description | This thesis deals with the estimation of segmented multivariate regression models. A segmented regression model is a regression model which has different analytical forms in different regions of the domain of the independent variables. Without knowing the number of these regions and their boundaries, we first estimate the number of these regions by using a modified Schwarz' criterion. Under fairly general conditions, the estimated number of regions is shown to be weakly consistent. We then estimate the change points or "thresholds" where the boundaries lie and the regression coefficients given the (estimated) number of regions by minimizing the sum of squares of the residuals. It is shown that the estimates of the thresholds converge at the rate of (Op(ln²n/n), if the model is discontinuous at the thresholds, and Op{n-¹/2) if the model is continuous. In both cases, the estimated regression coefficients and residual variances are shown to be asymptotically normal. It is worth noting that the condition required of the error distribution is local exponential boundedness which is satisfied by any distribution with zero mean and a moment generating function provided its second derivative is bounded near zero. As an illustration, a segmented bivariate regression model is fitted to real data and the relevance of the asymptotic results is examined through simulation studies. The identifiability of the segmentation variable is also discussed. Under different conditions, two consistent estimation procedures of the segmentation variable are given. The results are then generalized to the case where the noises are heteroscedastic and autocorrelated. The noises are modeled as moving averages of an infinite number of independently, identically distributed random variables multiplied by different constants in different regions. It is shown that with a slight modification of our assumptions, the estimated number of regions is still consistent. And the threshold estimates retain the convergence rate of Op(ln² n/n) when the segmented regression model is discontinuous at the thresholds. The estimation procedures also give consistent estimates of the residual variances for each region. These estimates and the estimates of the regression coefficients are shown to be asymptotically normal. The consistent estimate of the segmentation variable is also given. Simulations are carried out for different model specifications to examine the performance of the procedures for different sample sizes. |
Extent | 5371601 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
File Format | application/pdf |
Language | eng |
Date Available | 2008-12-18 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0086617 |
URI | http://hdl.handle.net/2429/3141 |
Degree |
Doctor of Philosophy - PhD |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 1992-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- 831-ubc_1992_fall_wu_shiying.pdf [ 5.12MB ]
- Metadata
- JSON: 831-1.0086617.json
- JSON-LD: 831-1.0086617-ld.json
- RDF/XML (Pretty): 831-1.0086617-rdf.xml
- RDF/JSON: 831-1.0086617-rdf.json
- Turtle: 831-1.0086617-turtle.txt
- N-Triples: 831-1.0086617-rdf-ntriples.txt
- Original Record: 831-1.0086617-source.json
- Full Text
- 831-1.0086617-fulltext.txt
- Citation
- 831-1.0086617.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0086617/manifest