Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Estimation of heteroskedastic limited dependent variable models Donald, Stephen Geoffrey 1990

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


831-UBC_1990_A1 D66.pdf [ 8.03MB ]
JSON: 831-1.0103893.json
JSON-LD: 831-1.0103893-ld.json
RDF/XML (Pretty): 831-1.0103893-rdf.xml
RDF/JSON: 831-1.0103893-rdf.json
Turtle: 831-1.0103893-turtle.txt
N-Triples: 831-1.0103893-rdf-ntriples.txt
Original Record: 831-1.0103893-source.json
Full Text

Full Text

ESTIMATION OF HETEROSKEDASTIC LIMITED DEPENDENT VARIABLE MODELS By STEPHEN GEOFFREY DONALD B. Ec. (Honours) University of Sydney A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF T H E REQUIREMENTS FOR T H E DEGREE OF DOCTOR OF PHILOSOPHY in T H E FACULTY OF GRADUATE STUDIES DEPARTMENT OF ECONOMICS We accept this thesis as conforming to the required standard T H E UNIVERSITY OF BRITISH COLUMBIA October 1990 © STEPHEN GEOFFREY DONALD, 1990 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of ^-c<? HO«WC\ The University of British Columbia Vancouver, Canada Date II. Ocifbe- I ^ H P DE-6 (2/88) Abstract This thesis considers the problem of estimating limited dependent variable models when the latent residuals are heteroskedastic normally distributed random variables. Com-monly used estimators are generally inconsistent in such situations. Two estimation methods that allow consistent estimation of the parameters of interest are presented and shown to be consistent when the latent residuals are heteroskedastic of unknown form. Both estimators use recent advances in the econometric literature on nonparametric es-timation and deal with the unknown form of heteroskedasticity by approximating the variance using a Fourier series type approximation. The first estimator is based on the method of maximum likelihood and involves max-imising the likelihood function by choice of the parameters of the variance function ap-proximation and the other parameters of interest. Consistency is shown to hold in the three most popular limited dependent variable models — the Probit, Tobit, and sample selection models — provided that the number of terms in the approximation increases with the sample size. The second estimator, which can be used to estimate the Tobit and sample selection models, is based on a two-step procedure, using Fourier series ap-proximations in both steps. Consistency and asymptotic normality are proven under restrictions on the rate of increase of the number of parameters in the approximating functions. Finally, a small Monte Carlo experiment is conducted to examine the small sample properties of the estimators. The results show that the estimators perform quite well and in many cases reduce the bias, relative to the commonly used estimators, with little or no efficiency loss. 11 Table of Contents Abstract ii List of Tables iv Acknowledgement v 1 Introduction 1 2 Literature Survey 4 3 Maximum Likelihood Estimation Methods 12 3.1 Introduction 12 3.2 Heuristics 14 3.3 Models of Interest 17 3.4 Approximating the Variance 20 3.5 Uniform Convergence 25 3.6 Identification 27 3.7 Consistency and Practical Implications 30 3.8 Conclusions 32 4 Two Step Estimation Methods 35 4.1 Introduction 35 4.2 The Model 38 4.2.1 Intuition Behind the Estimator 42 iii 4.3 Nonparametric Series Regression results 45 4.4 Estimation with known Xt 50 4.5 Estimation of the Discrete Choice Model 52 4.5.1 Heteroskedastic I A 1 4 53 4.5.2 Homoskedastic uu • • 58 4.6 Asymptotic Normality with Estimated A t 59 4.6.1 Estimating the Covariance Matrices 64 4.7 Specification Tests 68 4.7.1 A Test for Sample Selectivity Bias 70 4.7.2 A Test for Heteroskedasticity 75 4.8 Conclusions 78 5 Small Sample Properties 80 5.1 Introduction 80 5.2 Probit Model 81 5.3 The Tobit Model 88 5.4 The Sample Selection Model 100 5.5 Conclusions 108 6 Conclusions 110 Appendices 113 A Proofs for Chapter 3 113 B Proofs for Chapter 4 124 Bibliography 147 iv List of Tables 5.1 Approximate Specification Errors - Probit 84 5.2 Probit experiment 1 86 5.3 Probit experiment 2 86 5.4 Probit experiment 3 86 5.5 Probit experiment 4 87 5.6 Probit experiment 5 87 5.7 Approximate Specification Errors - Tobit. a 92 5.8 Approximate Specification Errors - Tobit, log a 92 5.9 Tobit experiment 1 94 5.10 Tobit experiment 2 95 5.11 Tobit experiment 3 96 5.12 Tobit experiment 4 97 5.13 Tobit experiment 5 98 5.14 Approximate Specification Errors - Sample Selection, h 102 5.15 Sample selection experiment 1 104 5.16 Sample selection experiment 2 104 5.17 Sample selection experiment 3 105 5.18 Sample selection experiment 4 105 5.19 Sample selection experiment 5 106 v Acknowledgement I would like to thank the members of my committee, John Cragg, Harry Paarsch and Brendan McCabe for comments on previous versions of this thesis. I would also like to thank John Cragg and Harry Paarsch for support and encouragement during my time at U B C . The comments of Terry Wales, Ken White, Margaret Slade, Bill Sell worm, Jeremy Rudin and classmates are also appreciated. I would also like to thank my parents for their financial and moral support over the last four years. v i Chapter 1 Introduction The use of limited dependent variables in economics has increased dramatically in recent years. Such models often arise in micro-econometric contexts where data sets contain variables which may be qualitative or limited in nature. This has necessitated the explicit introduction of such possibilities into structural economic models and the development of statistical techniques for the estimation of such models. This approach often gives rise to a latent model and a rule determining what will be observed in the data. From this and an assumption governing the distribution of the unobservables in the model one can then proceed to estimate the parameters. A common assumption is that such unobservables are independent and identically distributed (i.i.d.) as normal random variables. Because of the nature of limited dependent variable models this assumption is not harmless and so the situation differs markedly from other more standard situations such as the linear regression model. In particular, violation of either the normality assumption or the identically assumption is known to cause estimator inconsistency (the nature of which is still not well understood). While recent attention has been focussed on the normality assumption and, in particular, on obtaining consistent estimators without explicit knowledge of the distribution of the error terms, the identically assumption has been relatively neglected. This thesis examines this assumption while maintaining the normality assumption, and proposes estimation techniques for some popular limited dependent variable models which yield consistent estimators in situations where there is quite general heteroskedasticty in the unobserved components of the model. 1 Chapter 1. Introduction 2 Chapter 2 provides a brief survey of the literature on the estimation of limited de-pendent variable models (LDVMs) and other theoretical work related to that considered in this paper. It will be apparent that while there have been important advances in our understanding of LDVMs and i n estimation technology, much remains to be done before our understanding of LDVMs approaches that of the regression model. It is hoped that this thesis will remedy some of the deficiencies i n this literature. LDVMs are typically estimated (under normality) using either maximum likelihood or some two-step method (which is computationally simpler). Chapter 3 considers the problem of estimating the three workhorse LDVMs: the Probit, Tobit, and sample se-lection models, using a maximum likelihood type estimator. The estimator proposed i n this case allows for heteroskedasticity of unknown form i n the latent residual by approx-imating the variance using a Fourier series type approximation. Consistency results for these three models are proven under the assumption that the number of terms i n the approximation increases with the sample size. This technique has been developed by Gallant and Nychka (1987) and is referred to as semi-non-parametric estimation. Similar estimation techniques are referred to in the statistics literature as "method of sieves", see Geman and Hwang (1982). One attractive feature of the estimation technique is ease of computation relative to some other estimators proposed in the literature. In Chapter 4 we consider two-step type estimators for the Tobit and sample selection models, which are similar to those proposed by Heckman (1979), but which explicitly allow for the fact that there is heteroskedasticity of unknown form in the latent errors. It is shown that heteroskedasticity shows up in the conditional mean of the dependent variable in two places and lends a structure similar to that of the semiparametric re-gression model. The added complication arises from a first step estimation error. The estimator proposed is similar to estimators belonging to a class of semiparametric estima-tors, recently proposed by Andrews (1989a), and again use series approximations to the Chapter 1. Introduction 3 unknown functions. Results regarding the \/N consistent estimation and asymptotic nor-mality of the proposed two-step estimator are proven under fairly weak assumptions and explicit maximal rates of increase of the number of parameters are derived. In addition, consistent estimators for the covariance matrix are proposed and Durbin-Wu-Hausman type specification test statistics are developed. One nice feature of the estimators proposed in this thesis is that they are relatively easy to compute and may in fact be computed using standard packages such as S H A Z A M or TSP. This makes it relatively simple to perform a small Monte Carlo experiment to examine how the estimators perform relative to the commonly used estimators in small samples. In addition, some evidence on the required number of terms for good perfor-mance can also be provided although the evidence cannot look at all possible situations. As is the case with other nonparametric type estimators, it is found that the bias can be reduced even with a small number of terms in the approximating functions. In the case of the two-step estimators, this usually comes at the cost of a loss in efficiency, but in the case of the M L E , allowing for heteroskedasticity may even improve the efficiency (in small samples) of the estimates. Chapter 6 concludes the thesis with a brief summar}' of the contributions and points to areas of future research. Chapter 2 Literature Survey In this chapter we survey briefly some of the literature concerning the estimation of limited dependent variable models, identifying important contributions and deficiencies in this area. We then identify some of the contributions to this literature made in this thesis. More complete surveys on LDVMs which also survey some of the applications of these types of models may be found in Maddala (1983), McFadden (1985), Amemiya (1985), Dhrymes (1986), and Maddala (1986). In addition, recent contributions to the theoretical literature are contained in Duncan (1986a) and Blundell (1987) and a survey of some specification testing literature is contained in Pagan and Vella (1988). This survey is by no means complete and concentrates on the econometric literature in this area, recognising that some of these models have a longer history in statistics. The need for LDVMs in economics is quite obvious. Often, especially in microeco-nomic situations, one is confronted with data that are qualitative or limited in nature. Examples include: car or electrical appliance ownership; hours worked by an individual; expenditure on durables and so on. The first example of empirical work in economet-rics using a model which takes into account this type of situation is Tobin (1958) in his analysis of household expenditure on durable goods. In this situation, one is confronted with a non-negligible portion of households with zero expenditure in a given year. Tobin noted that the usual least squares estimator in this situation is inconsistent, even if the random errors are iid normally distributed, because a negative expenditure is impossible (at least in observed data). Tobin proposed a maximum likelihood (MLE) estimator by 4 Chapter 2. Literature Survey 5 constructing the logarithm of the likelihood function under the assumption of iid normal latent errors. His model has become known as the "Tobit" model. The next paper to consider this type of situation in economics appears to be Cragg (1971) who presented a number of more general models, some of which contained bivariate random variables and which were estimated under the same assumptions used by Tobin (1958). These can be considered as forerunners to the now popular 'sample selection model' attributed to Heckman (1974, 1976, 1979) and Gronau (1973). A l l of these models have been termed 'generalised' Tobit models by Amemiya (1985). An alternative method of estimation for these models based on a two-step procedure was proposed in Heckman (1976, 1979). Amemiya (1973) proved the consistency and asymptotic normality of the M L E of the Tobit model under the assumption of iid normal random errors. Amemiya (1985) provides a useful survey of Tobit type models and various applications. In the case of discrete dependent variables, the first papers to appear in the econo-metric literature appear to be Theil (1969) and McFadden (1974) — the latter relating such models to choice decisions of individuals and hence incorporating utility functions. As in the case of the Tobit type models, the early contributions appear to be directed at generalisations of the simple zero-one case and, although in the case of discrete choice models, distributions apart from the normal one have been used to construct the like-lihood functions (notably the logistic, giving rise to the Logit model) little attention has been paid to the possible mis-specification bias resulting from invalid distributional assumptions. McFadden (1985) provides a useful summary of the literature on discrete choice models. Since these early contributions, researchers have begun to worry more about the extreme distributional assumptions needed for consistency and examined to what extent they can be relaxed. An important paper by Robinson (1982) showed that for the Tobit model the usual M L E remains consistent when observations are dependent. The only Chapter 2. Literature Survey 6 problem is to obtain a consistent estimate of the covariance matrix of the M L E . In addition, the M L E is inefficient in this case. As Dagenais (1983) has shown, computing the true M L E , even in the simple case of an AR(1) error term can be computationally intractable. Parallel work in the case of the discrete choice or Probit model shows that the M L E in this case is also consistent with dependent data, see Gourieroux, Monfort and Trognon (1984) and Poirier and Ruud (1988). Estimation taking into account the dependence has been examined by Poirier and Ruud (1988). Most work concerning robustness has been focussed on the normality assumption which is in general needed for consistency, see Ruud (1983). Due to the highly nonlinear nature of these models, the precise nature of the inconsistency is poorly understood. There has been some evidence provided in very simple cases by Arabmazar and Schmidt. (1982) for the Tobit model, who show that the biases can be 'large' and are worse the higher the degree of censoring. Also Paarsch (1984) has shown that in finite samples the usual M L E may perform quite poorly when the true error distributions have fat tails. Given that our understanding is so limited, it is unsurprising that much attention has been devoted to the problem of estimating LDVMs without making explicit distributional assumptions. Perhaps the first papers to do this are Manski (1975, 1985) who developed the Maximum Score (MS) estimator for the discrete choice model. This estimator chooses the parameter vector that maximises the number of correct predictions (regarding choice) and is consistent under quite general conditions, only requiring a median restriction. The estimator is robust not only to nonnormality, but also heteroskedasticity. One drawback of the estimator is that it is difficult to compute and that it is not asymptotically normally distributed. Moreover, it is less than \fN consistent (and is in fact only T V 1 / 3 consistent). A smoothed version of the estimator has recently been proposed by Horowitz (1989) which is asymptotically normally distributed but is still less than \/N consistent. Similar work for Tobit type models has been done by Powell (1984, 1986). The first Chapter 2. Literature Survey 7 estimator proposed is similar to the MS estimator in that it is consistent under a restric-tion on the median of the latent errors. This estimator, known as the Censored Least Absolute Deviations (CLAD) estimator, works because the median is invariant to cen-soring provided that it is positive (assuming that censoring is at zero). A fundamental identification condition requires that a non-negligible part of the data comes from con-ditional densities that where this is true. Like the MS estimator the C L A D is robust to heteroskedasticity and the only impediment to its use is computation which is quite difficult. The second estimator proposed is known as the Symmetrically Censored Least Squares (SCLS) estimator. It is based on the observation that if one could censor the ob-servations symmetrically about the mean (assuming sjmimetric distribution of the latent errors) then ordinary least squares applied to the symmetrically censored data would be consistent. An objective function which does this (asymptotically) is set up and estima-tion is based on a sort of reweighted least squares type algorithm. A similar approach works in the truncated regression model where trimming takes the place of censoring. As with the C L A D estimator these estimators are robust against heteroskedasticity and only require a symmetric and unimodal distribution. Despite the attractiveness of the estimators of Manski and Powell they have not been used much in practice. This appears to be in part due to the computational difficulties and also the fact that they tend to be inefficient. Thus based on an M S E criteria they may not be much better than the usual estimators (see Paarsch (1984) and Powell (1986)). This has led to some other work which has been concentrated on obtaining estimators based on maximum likelihood type criteria. Work in this direction has been done by Cosslett (1983) for the discrete choice model and by Duncan (1986b) and Fernandez (1986) for the Tobit model. In these estimators, the likelihood function is maximised by choice of the parameters in the model and the density function of the latent errors. To make this a meaningful exercise some restrictions must be made in finite samples, Chapter 2. Literature Survey 8 but consistency results obtain due to the fact that the restrictions are relaxed as the sample size grows. Duncan (1986b) uses splines to approximate the density of the errors, while Fernandez (1986) uses Fourier series approximations and estimation is performed by choice of the parameters of the approximation and the parameters in the model. Unfortunately, for these estimators only consistency results have been obtained thus far. Another drawback is that while the estimators are robust to nonnormality they maintain an identicality assumption and so are not robust to heteroskedasticity. The estimation method proposed by Matzkin (1988) which maximises by choice of both the distribution function and the regression function, may circumvent this problem, but the estimator seems difficult to employ and will give estimates that may be somewhat difficult to interpret. Similar work to this has been done by Ichimura (1988) as well as Klein and Spady (1987) for the discrete choice model and Horowitz (1986, 1988) for the Tobit model. In these cases, however, the estimators have been shown to be consistent and asymptotically normal and in the case of Klein and Spady the estimator attains the semiparametric effi-ciency bound. These estimators use kernel estimation and exploit the single index nature of the moments of the dependent variable. Note too, AndreAvs (1988, 1989d) has shown recently that similar results will hold if series are used instead. As with the estimation methods based on the likelihood function, these estimators require the assumption that the latent error terms are identically distributed, or the restrictive assumption that the distribution depends on the indpendent variables only through the index (which is also the mean of the latent variable). Thus far we have neglected any discussion of the sample selection model. This is because it differs in an important way from the other models in that it contains a bi-variate random variable. This makes it impossible to employ estimators analogous to those of Manski and Powell since quantile restrictions in bivariate models are somewhat Chapter 2. Literature Survey 9 problematic. This means that the likelihood based methods or some alternative methods have to be used. Gallant and Nychka (1987) have proposed a likelihood based method similar to the methods of Duncan and Fernandez, but approximate a bivariate density using hermite polynomial approximation. The}' also give quite general results for this type of situation. The other recently proposed method for the sample selection model is based on two-step methods similar to the Heckman (1979) estimator but which allow for an arbitrary density of the unobserved latent errors. The first paper to do this is Cosslett (1984) who derived consistency results. Also in this vein is the work of Powell (1987) and Newey (1988b) who use kernel estimation and polynomial approximation re-spectively. Both estimators have been shown to be \/rN consistent and asymptotically normally distributed under certain conditions. In the first stage one must use a method such as that of Ichimura or Klein and Spady, which are \/N consistent and asymptotically normal. The distribution of the second step estimator will depend on the distribution of the first step estimator. The results in Andrews (1989d) also cover this situation as does the semiparametric regression results of Robinson (1988) although the identification conditions in his case are quite restrictive. An application of these methods by Newey, Powell, and Walker (1990) has shown that in the case of the Heckman (1974) model of labour supply (and the subsequent work of Mroz (1987)) the results do not differ much from those obtained under the assumption of normality. Common in much of this work is the assumption that the latent errors are identi-cally distributed and it seems likely that the estimators (apart from those of Powell and Manski) will be inconsistent in the presence of heteroskedasticity. As is the case with the normality assumption, little is known about the properties of the usual estimators when the homoskedasticity assumption is violated. Only Hurd (1979) and Arabmazar and Schmidt (1981) have considered the nature of the asymptotic bias caused by het-eroskedasticity in very simple situations. Powell (1986) in his. Monte Carlo experiments Chapter 2. Literature Survey 10 has provided some finite sample evidence on the nature of the bias in the context of the Tobit model and was led to the conclusion that "...failure of the homoskedasticity assumption may have more serious consequences than failure of normality in censored regression models." Moreover, since these models are typically estimated using cross-sectional data where there is often considerable heterogeneity, it is somewhat surprising that more work on LDVMs with heteroskedasticity has not been done. The only con-tributions appear to be those of Powell (1984, 1986) and Manski (1975, 1985) which are limited to univariate models and have seen little application. In fact, it appears that no remedies for the problem of heteroskedasticity in the sample selection model have yet been proposed. The contribution of this thesis is an attempt to remedy this deficiency in the literature. In particular, while maintaining the normality assumption (in which case the heteroskedasticity shows up as changing variance across observations) we pro-pose estimation methods for the three main LDVMs, which yield consistent and in some cases asymptotically normal estimators, when the errors are heteroskedastic of unknown form. One nice feature of the estimators proposed is that they are relatively easy to compute as evidenced by the ease with which we can perform a Monte Carlo experiment. As a by product we present in Chapter 5 some small sample evidence on the effect of heteroskedasticity on these models We should also briefly mention some of the technical work that is used in this paper. The main tool used to overcome the unknown nature of the variance function is the Fourier series approximation method proposed in the econometric literature by Gallant (1981) for estimation of regression models. The basic idea is to substitute the Fourier series approximation for the variance and to choose the parameters based on the max-imisation of some objective function. In Chapter 3 the objective will be the likelihood function, while in Chapter 4 the objective will be the sum of squared deviations of the dependent variable from its expected value. The proof of consistency in the first case is Chapter 2. Literature Survey 11 similar to that of Gallant (1987) and Gallant and Nychka (1987), while in the second case we employ some of the results of Robinson (1988). Andrews (1988, 1989d) as well as Andrews and Whang (1989) although the situation is somewhat different from any of these. The methods used in the second case differ from those proposed in Andrews (1989a.b,c) which yield consistency and asymptotic normality for situations similar to that considered in this thesis. Chapter 3 Maximum Likelihood Estimation Methods 3.1 Introduction Likelihood based methods are now quite common in applications of LDVMs. This is especially the case in the Probit and Tobit models. Recent contributions by Cosslett (1983), Duncan (1986b) and Gallant and Nychka (1987) have examined generalising these methods to situations where the density of the latent errors in LDVMs is not restricted to a particular parametric family. Estimation proceeds by maximising the likelihood by choice of a density and the usual parameters of interest. The density is usually approximated by a function which is capable (in the limit) of approximating an arbitrary density function. Hence the terminology nonparametric maximum likelihood estimation. A common assumption is that the latent errors are identically distributed. The robustness of these types of estimators to heteroskedasticity is as yet unknown, but from a theoretical viewpoint it appears inconsistency will result. In this chapter, we propose an estimator, which under the assumption of normal-ity, provides consistent estimates of the parameters of interest in the presence of het-eroskedasticity of arbitrary unknown form. We deal with this case because when there is normality, heterogeneity takes on a form which can clearly be examined. The particular type of heterogeneity to be examined is a dependence of the scale parameters on the exogenous variables in the model. This is much like heteroskedasticity in the regression model, but unlike the regression model it produces inconsistent estimates (although little 12 Chapter 3. Maximum Likelihood Estimation Methods 13 is known about the nature of this inconsistency). The basic idea of the estimator is to approximate the scale parameters with the Fourier Flexible Functional (FFF) form, and then to maximise the likelihood function with respect to the parameters of interest and the parameters of the approximating function. Consistency obtains because as the sample size increases, the number of terms in the approximation increases and the approximation error reduces. The advantage of this approach is that the estimates can be obtained using standard maximum likelihood methods. This approach is similar to recent work by Gallant and Nychka (1987) (hereafter GN) who approximate the shape of the density function with a Hermite polynomial series. The proof of consistency in this chapter and G N (and most papers considering consistency of optimization type estimators) is essentially adapted from the classical proof of the consistency of the Maximum Likelihood Estimator (MLE) in the independent and identically distributed (iid) case by Wald (1949). The major drawback of the approach is the assumption of normality which is main-tained. Future work could examine the possibility of consistent estimation with arbitrary distributional shape and heterogeneity. The estimation technique does, however, appear to be useful in situations where there is an infinite dimensional nuisance parameter which must be taken into account before consistent estimates of the parameters of interest can be obtained. The remainder of the chapter is organised as follows. Section 2 gives a heuristic discussion of the estimator and gives a general result on consistency which is adapted from G N . The remaining sections present the specific models of interest and provide assumptions on these models which guarantee that the conditions of the theorem are satisfied. The chapter concludes with a discussion of the practical implications of the chapter and suggests areas where further work is needed. Chapter 3. Maximum Likelihood Estimation Methods 14 3.2 Heuristics The type of situation we have in mind is the following. The underlying unobservables in the model are generated according to the following density function f(u\a(x)) where / is a known density function and cr(x) is an unknown function of a vector of exogenous variables x belonging to on some set A', which is a subset of Euclidean space. Typically / will be chosen to be normal so that cr2(x) is the variance parameter. This would give rise to the familiar heteroskedastic regression model if u were an additive disturbance in the usual regression model. Observations of the dependent variable z, which is possibly a vector, are presumed to come from some parametrically specified model with parameter vector (3 but are observed with error because of some, possibly discontinuous, dependence on the unobserved errors u. It is also presumed that we are dealing with a situation in which the M L E of (3 assuming cr(x) is a constant, is inconsistent (unlike the case of the hnear regression model). This is generally the case with the LDVMs considered in this chapter. The parametric assumptions and the density / give rise to a density of the observed variables z (conditional on x) which we denote by p(z\x,(3,a(x)) and a sample average logarithm of the likelihood function (assuming independence of observations) the negative of which is given by sn((3,a) = (-l)(l/n)J2\ogp(zt\xt,{3,a) t The true values of f3 and a will be denoted f3" and a". The objective is to estimate (3' consistently by minimising this function by choice of/3, belonging to some set B, and the function a", assuming that a belongs to some set of functions denoted S. The arguments used to achieve this end follow the classical lines of the proof of consis-tency of the M L E by Wald (1949) which have been used repeatedly since. Our particular problem bears some resemblance to recent work by Gallant and Nychka (1987) who also Chapter 3. Maximum Likelihood Estimation Methods 15 follow Wald's methods, but they are interested in maximising the likelihood function over 8 and the density / , under the maintained assumption that the u variables are iid random variables. A whole host of other types of semi-(non)-parametric MLEs can be cited (Matzkin (1988), Cosslett (1984) etc.) but it appears that most have been con-cerned with shape rather than with heterogeneity across observations. (Matzkin (1988), however, can deal with both for the discrete choice model.) The intuition behind the approach taken in this chapter is clear. Since the set E will be infinite dimensional, the exercise of minimising sn by choice oi (3 and the function a will not provide estimators which converge to the true values in any meaningful sense. What is done instead, is to minimise over some subset of E , denoted Sj^, where the restriction is such that there may be only a finite number of parameters, which we denote K. For example Sj<- could contain trigonometric polynomial functions of x and K parameters, and one would proceed bj' estimating 0 and the parameters of the polynomial. In our case this will represent an approximation to the a function. The consistency result, for the estimators of 3" and cr", which will denote the true values, is obtained by letting the number of parameters, K increase with the sample size, provided that the set E# becomes "close" to E, in some sense. The sense in which thay must be close is that as K increases, there should exist functions in E# which are capable of approximating the true function in E as measured by some norm denoted ||.||. This corresponds to the denseness condition in the theorem presented below, which is a slight modification of Theorem 0 of G N . Note that since K will be required to increase with n, we index it by n as Kn. Theorem 1: Let 0 n and an minimise sn(0,a) over B x E ^ where B C Rk is compact in the Euclidean norm \\.\\E and E ^ is a subset of E on which is defined a norm \\o~\\. Suppose the following conditions hold; (a) Compactness: The closure of E w.r.t \\.\\ is compact in the relative topology generated bV II.II; Chapter 3. Maximum Likelihood Estimation Methods 16 (b) Denseness: UitLi ^Kn is a dense subset of S w.r.t \\.\\, (c) Uniform Convergence: There are points ((3~,cr~) in B X E and there is a function s((3, a, j3>:, cr") that is continuous in (f3,cr) w.r.t. \\(3\\E -f ||<r|| and lim sup \sn((3, a) — s(j3, a, ft", = 0 n _ > 0 0 B x S fd^ Identification: Any point ((3°, <x°) tn B x S o"0,/3>. cr") < s(f3~,a~, (3*, a~) has \\f3° - (3~\\E = 0 and ||cr0 - a'\\ = 0. • TTien l i m 7 1 _ 0 0 ||/5n — ^ "Hs = 0 a.s. and lin^^^ \\an — a"\\ — 0 a.s. provided that l i m n _ 0 0 i f n = CXD. Proof: For the proof of this result and all others in Chapter 3, see Appendix A. Our approach is somewhat similar to G N , in that we obtain consistency by using an approximation of the unknown function. In our case we are concerned with the scale parameter a, whereas G N were concerned with the density function parameters. In addition, the nature of the approximation in our case will be very different from that used in G N . The conditions (a), (c) and (d) are fairly standard conditions for consistency results involving extremum type estimators. Condition (b) is the major modification, and as mentioned earlier, essentially requires that the approximation for a be capable of be-coming arbitrarily "good" as the number of terms increases. In the next section, we present the three models to be examined in this chapter. The subsequent sections then make primitive assumptions on these models that guarantee that the above conditions for Theorem 1 are satisfied. The first two conditions can be examined in fairly general terms without reference to any particular model. The other two conditions must be verified specifically for each model. It will become apparent that one set of primitive assumptions will guarantee conditions (c) and (d) for almost any L D V model that may arise in practice (at least when normality is maintained). Chapter 3. Maximum Likehhood Estimation Methods 17 3.3 Models of Interest The structures of the Probit and Tobit models are fairly similar as is the way in which they fit into the framework of section 2. Consider some latent variable y generated by the following (where t subscripts have been suppressed for notational convenience), y — x0 -f u. In the Probit model, one observes the following endogenous variable z = I(y>0) (where 1(A) is the indicator function for the event .4) while in the Tobit model one observes zi = I(y > 0) z2 =yl(y > 0). The densities of the observables are respectively p(z\x,(3,cr) = F(x0\ay [F(x0\o-)f-z) p(zi,z2\x,0,a) = f(z2 - x0\o-r [1 - F{xP\*)f-*> where F(x0\a) = f(u\a(x))du. J — oo In both these models / is typically chosen to be a normal density with variance <r(x) so we can write cryx) <y(x) F(x8\a) = *(-^_) Chapter 3. Maximum LikeEhood Estimation Methods 18 where (p and $ are the standard normal density and distribution functions. In both these cases, we will be minimising sn(0,a) by choice of 0 and the R1 valued function a where a > 0. More details on the required properties of this function will be given in Section 4. The sample selection type model involves a bivariate normal density function and the two latent variables given below yi = x0x 4- ui y2 = x02 + u2. The observed endogenous variables in this case are 2 i = I(yi > 0) z2 = zxy2. In this case, the vector (ux,u2) has the bivariate normal density with covariance matrix Q, which can be written as follows / a\{x) <r12(x)\ tt(x) = \o-l2(x) a\(x) j Because this must be positive definite for all the observations, it will be convenient to impose some structure on it and to redefine the covariance matrix. In particular, we assume that it can be written as follows / o-lq^xf CTuq^x^x)' fi(s) = \o-i2qi{x)q2(x) <rlq2{x)2 where the two functions qi and q2 will be restricted to be positive. Positive definiteness can then be imposed by making sure that the constant matrix ( ZT-2 „ \ 0-12 V22 Chapter 3. Maximum Likelihood Estimation Methods 19 is positive definite. Note that this structure allows covariance terms to be heteroskedastic but imposes constancy on the correlation coefficient. Also no restriction is placed on the sign of this correlation. Such a structure would arise if ux = qi(x)e1 u2 = q2{x)t2 with (£1,62) having a homoskedastic bivariate normal density with covariance matrix The density in this case is complicated and is given below where the above assumptions have been imposed. The parameter pX2 is the constant correlation coefficient. p(z\tl(x),x>f3) = 0-191(35) <r2q2{x) o-2q2(x) cr2q2(x) Note that it is impossible to identify o~i and a2, unless qi and q2 are normalised, although this is unnecessary since the main concern is with the (3 coefficients. Also note that in this case and in the Probit case some normalization of /3i will be required to help identify the other parameters. This is also obviously the case in the homoskedastic models where it is typically assumed that o~i = 1. In the sample selection case, we minimise the objective function by choice of the two q functions, the two sets of coefficients (3X and (32 and the correlation coefficient p12. The restrictions will be that the q functions are positive and that |p 1 2 | < 1 — e for some e > 0. The second of these restrictions is easy to impose. The problem of imposing positivity on the approximating functions in all these models will be discussed in Section 7 where we look more closely at the practical implications of this work. The next section considers in detail the technical side of conditions (a) and (b) with regard to these models and discusses the nature of the approximating functions. 1 - $| 0- i9i (z) x Chapter 3. Maximum Likelihood Estimation Methods 20 3.4 Approximating the Variance In this section, we discuss the way in which the unknown variance and covariance func-tions will be approximated. The functions we wish to approximate are cr for the Probit and Tobit models and qx and q2 for the sample selection model. A l l the discussion will apply to each of these functions separately. Before deciding on a method of approxima-tion, we must first decide what minimal set of conditions these functions must satisfy. We will refer to the function of interest generically as g. Then we must decide the sense in which the approximating function approximates g. This is done with reference to some norm. This will then be termed the consistency norm for the function g. We denote this norm by ||.|| and for the moment we define it by IMI = sup \g(x)\ xfiX where g is defined on the set A". We will assume that A' is an open bounded subset of fc-dimensional Euclidean space where, without loss of generality, we assume that the closure of A" is contained in the ^-dimensional open cube defined by the intervals (0, 27r). This assumption is motivated by the fact that the type of approximation to be used is based on Fourier series which are periodic of period 27T. Since g will not be assumed to be periodic, this type of approximation will only be able to approximate g on X if cl(A) is in the open cube, see Gallant (1981). The following notation will be useful in what follows. Let where |A| = __?=i a n d K a r e nonnegative integers. If A = 0 then D°g(x) — g(x). We make use of the following pseudonorms (which are discussed fully in Adams (1975)) IML.OO.AI = max ess sup |DAg(x)| U|<m x Chapter 3. Maximum Likelihood Estimation Methods 21 where the essential supremum is with respect to the probability measure p defined on A'. That is we assume that x is a, random vector on A' , with probability measure given by p. These norms are termed Sobolev norms and give rise to Sobolev spaces which are subspaces of the more familiar Lp spaces. In discussing the issues of compactness and denseness, we follow the same arguments as in Elbadawi, Gallant, and Souza (1983) (hereafter EGS). We define the set of g func-tions as G. Assumption 1 gives the conditions we wish the g functions to satisfy. Assumption 1: G consists of all functions g defined on X such that i)Dxg(x) is continuous on X for all A such that |A| < 3; ii) sup_ \D*g(x)\ < oc for |A| = 3; " ^ l l S ' l k o o , , ! < b < oo ; iv) h\ixg(x) > 8 > 0. Note that these assumptions imply that up to third order partial derivatives are continuous (i), and that certain smoothness conditions are satisfied by the function. The last assumption (iv) ensures that the variance terms are bounded away from 0 uniformly in x. The type of approximating function that we use is a variant of a multivariate Fourier series function. It may contain linear and quadratic terms in x, but the remaining terms in the function will be trigonometric. In talking about the number of terms of the approximating function going to infinity, we will be refering to the trigonometric terms. If the function were to contain quadratic terms, then it would coincide with the Fourier Flexible Functional (FFF) form of Gallant (1980). The addition of quadratic terms to the Fourier terms is meant to improve the approximation so that fewer trigonometric terms are needed to obtain a "good" approximation. These additional terms, however, will not affect any of the theory regarding the approximation of a function by Fourier Series. The F F F is given by gK(x) = a + b'x -f -x'Cx -f Chapter 3. Maximum Likehhood Estimation Methods 22 A J E ( u 0 o + E ( U J ° - cos(jfc^x) - vja sin(jfc^a;))) a = l j = l with a, b, C and being parameters, and where the ka are a sequence of multi-indices (which are integer valued k-dimensional vectors) given in Gallant (1981) and A C = — 2^ uOcckaka a = l The degree of the trigonometric polynomial will be denoted by K and satisfies fc i = l for all a £ {1,2,...,.4} and j <= {1,2,..., J } . As noted by EGS, A and J will depend on K. For a given sample size n, we let the degree of the polynomial be Kn. In order to obtain a consistent estimate for the function, we will need the number of parameters in the approximating function to increase with the sample size. That is, we need that Kn —> co. As noted by Gallant (1980), the rule determining the degree K may be adaptive (data dependent) or fixed. A convenient way of writing the approximating function is K 9K(X\6) = E ^ j ( » ) Here the 6j are coefficients and the ipj are the elementary functions used (a constant will be included). The attractive feature of these types of approximations is that as K goes to infinity the above function becomes capable of approximating g in a sense that will now be discussed. In particular, letting 6 be an infinite dimensional vector of parameters and given Assumption 1 and the corollary to Theorem 1 in Gallant (1981) then there exists a vector 6" such that lim 11^-^(^)112,^ = 0. (3.1) Chapter 3. Maximum Likelihood Estimation Methods 23 In practice, it would appear that there would be little problem in just substituting the approximating function into the objective functions defined in Section 4 and minimising over (3 and the 8 parameters using standard maximum likelihood type procedures. To obtain consistency, however, we must overcome a number of technical difficulties to ensure that the conditions of Theorem 1 are satisfied. We follow the line of argument of EGS with some modification. From above, we know that there is a vector 8~ such that the corresponding approx-imating function represents g in the sense defined above. We will define the pointwise limit of this approximating function by g^. We would like to be able to just replace the g function with the corresponding g^ function and proceed from there. There are, however, a few problems we must overcome. First, we want to make sure that g^ is at least continuous. Second we would like the set of these gx functions, corresponding to the g functions satisfying Assumption 1, to be contained in a set which is compact with respect to ||.||. We introduce some more notation. Let ||goo|| 2,oo,M = | | ^ | | 2 , O O , / J • Define the set 0 by 0 = {6 : lim max sup | V] 8jDxipj(x)\ < oo} K—>oo IAI<2 x , and the set 0 O by oo X Now by ( 3.1) for every g in G there is a 8 in 0 O which corresponds to it in the sense of j|.I!2,°°,/x- The the following lemma is the first part of Lemma 2 of EGS and is useful in obtaining the needed compactness and continuity discussed above. Lemma 1: There is a set 0" such that for all 8 in Q" goo{x\8) ts continuously differential le to order 1 and there is a continuous mapping from the weak* closure of Q0, denoted by Q0, onto 0". Moreover the set Q" is compact in the relative topology generated by H -1| i,oo,M-Chapter 3. Maximum Likelihood Estimation Methods 24 Proof: See Elbadawi, Gallant, and Souza (1983). This lemma provides us with the desired continuity and compactness. We can now represent the g function by some 6 in 0" as g^xlB). This function will be continuous and so is identical to the g function in the consistency norm. Next, since we have that all the g^ are continuous then ll^ooli lli?oo [I 0,oo,_^  jl^oo | | l,oo,/j where the weak inequality is obvious from the definition of the Sobolev norm given above. This result then gives the result that convergence in ||-1|I,OO,M implies convergence in ||.|| so that 0" will also be compact in the topology generated by the consistency norm. Next, we discuss the estimation space. Since it is clearly infeasible in finite samples to minimise the objective function by choice of the complete vector 8 we must truncate the vector at some point say Kn which may depend on the sample size n. In terms of the above we represent this as minimising over the subset of 8 vectors in 0" f l 0 which have the form (#!,..., 0, .. .) There will clearly not be any problem of emptiness here since if b > 8 then the element which has all zeroes except the first and that is between b and 8, is clearly of the appropriate form and is also in 0" f l 0. We denote the set of vectors with Kn nonzero elements by Clearly, we will then have that U^Li 0# = 0" H 0 provided that Kn —> oo so that condition (b) of Theorem 1 is satisfied. In actual practice, when performing the optimisation we will have two sets of constraints that will force the function to be in 0 O Pi 0 C 0" f l 0. These will be that sup|_?V_(_|.)l < 6 X for all |A| < 2, and inf_-j f n (_|0)> - >0. X The required continuity will be clearly satisfied. EGS note that the fact that we are optimising over a subset of the function space is not likely to be of much importance Chapter 3. Maximum Likelihood Estimation Methods 2 5 in practice. As far as imposing the above restrictions goes, the first does not seem to pose any practical problems since we can always make b quite large.The non-negativity constraint, however, does appear to pose some problem in practice. More on this wi l l be taken up later. So far we have only considered the case of a single variance parameter. In the case of a bivariate model, we must estimate two functions. This is easily handled by the above discussion since we can treat each function as a separate parameter and consider compactness and denseness separately. The norm of the variance covariance matrix can just be defined as the sum of the norms of the two functions. The product space 0 " fl 0 x 0 " Pi 0 wil l then be compact in this norm. Minimisation wil l then be done over two sets of 6 parameters. 3.5 Uniform Convergence In this section we provide conditions which guarantee that (c) of Theorem 1 is satisfied for the three models of interest. That is, we verify that there exists a function s(0,a,0",ax) that is continuous in (/?, er) with respect to the norm ||(/3,<r)|| and lim sup \sn((3,er) — s(0, a, 0", a")\ = 0 T W 0 ° B x_ (Note that in the sample selection model the pu parameter should be included with the 0 parameters and wil l be restricted to lie in [—1-f e, 1 — e]). In addition, to our Assumption 1, we make the following assumption on the exogenous variables. Assumption 2: xt is a Cesaro sum generator with respect to p(x) on X and (ut,xt) is a Cesaro sum generator with respect to f(u\a"(x))dudp(x) where er" is the true variance parameter (s). The meaning of this assumption is that limits of Cesaro sums of the random variables of interest can be computed as integrals. Moreover, the same applies to any continuous Chapter 3. Maximum Likelihood Estimation Methods 26 function which is dominated by an integrable function. This allows for completely ex-ogenous variables (GN). so no identically of distributions is needed. Essentialby, all that is needed is that some law of large numbers is applicable. The following three lemmata give the required results for the three models of interest. In addition to assumptions 1 and 2, we use the fact that B is compact. This along with assumption 1 put bounds on the regression function part xf3 and keeps cr away from 0 or oo and these facts are used heavily in the proofs of the lemmata. Also, convergence in the metric ||.|| implies pointwise convergence of the function and this is exploited. Lemma 2: For the Probit model, given Assumptions 1 and 2 and compactness of B, (c) of Theorem 1 is satisfied. Lemma 3: For the Tobit model, given Assumptions 1 and 2 and compactness of B, (c) of Theorem 1 is satisfied. Lemma 4: For the sample selection model, given Assumptions 1 and 2 and compact-ness of B, (c) of Theorem 1 is satisfied. The proofs are somewhat simpler than the proof of the result in G N which is for the sample selection model. That proof is complicated by the fact that they were interested in approximating the shape of the density function. Note that in the proofs heavy use is made of the fact that the x variables are bounded. This comes about because of the type of approximation we have used, which will only work for a function defined on a bounded set. If we were able to choose an approximating function that was able to approximate the variance terms over the whole real plane then some assumptions about the integrability of certain functions of x would be required. Chapter 3. Maximum Likelihood Estimation Methods 27 3.6 Identification In this section, we examine the identification of the three models of interest. This is done by verifying that in the class of functions C(X) and the set B, that any two parameters which are different from the true parameters in the sense of the norm, give rise to different probability models in a sense that will be defined below. Then it will be shown that this implies condition (d) of Theorem 1. Instead of examining the narrow class of function 0", we concentrate here on the class C(X) which contains 0". Identification in C(X) will imply identification in 0". Note that two elements are considered different if the norm of their difference is nonzero. We make use of the following definition of identifiable uniqueness of a probability model in a set of semi-non-parametric probability models. It is a generalisation of the definition typically used in the literature (see Wald (1949) Assumption 4 for example). Definition: The parameter vector (/3",a") in B x C(X) is identifiably unique in B x C(X) if for any (fd, a) not equal to (f3~,o~~) there is a set D in X with fi(D) > 0 and a set E in the support of z, such that Pr(z £ E\x\fd,o~) ^ Pr(z 6 E\x\fd"',cr") for all x in D. Note that E must have positive probability under at least one of the two parameter vectors. In examining whether the models satisfy this, we make a number of assumptions which are typical of this type of work. Assumption 3: The probability measure fi on X has the property that at least one co-ordinate of x has everywhere (on its domain in (0, 2n)) positive Lebesgue density con-ditional on the other components. This assumption is common in work on nonparametric approaches to estimation and identification (see Matzkin (1989) for example). It allows one to obtain an open ball around a given point in X which has positive probability with respect to p. and on Chapter 3. Maximum Likelihood Estimation Methods 28 which the densities of z differ for some nonnull set in the support of z. Assumption 4: For the Probit model and the Tobit model a is in C(X). For the sample selection model the functions qx and q2 are in C(X). Assumption 5: x in Rk satisfies a full rank condition that p(x : x((3 — 0") = 0) = 1 implies that 0 = 0'. Assumption 6: In the Probit model and the dichotomous part of the sample selection model (when only the variable I(y > 0) is observed) it is known that the coefficients in the corresponding equation satisfy \\0\\E = 1-Assumption 4 is a minimal smoothness condition which is satisfied by the set of functions of interest, discussed in the previous sections. Assumption 5 rules out perfect multicollinearity among the x variables and allows identification of 0 given that x0 is identified. Assumption 6 takes care of the fact that when the discrete variable is observed the contribution to the likelihood function is invariant to monotonic transformations to the 0 and the a coefficients. Since it can be shown that x0/a is identified and hence the sign of x0 is identified, we can then identify the 0 up to scale. This is similar to the situation of Manski (1975, 1985). To be able to obtain similar results we make the following assumption where we have x partitioned as x = (_„,_*) with x^ being the continuously distributed variable, and the domain of Xk being (e,f) (which is strictly contained in (0, 2ir) ). Assumption 7: It is known that in the equation for the variable y for which I(y > 0) is observed, the 0 corresponding to the continuously distributed x, 0£, is nonzero. In addition if 0^ > 0 then e0'k + < 0 < fft + xt0{ for a (measurable) set of x\ _ C with P(C) > 0 and P(xk £ (c,d)\C) > 0 (where these probability measures are derived from the measure p) for any (c,d) C (e,/). If 0£ < 0 Chapter 3. Maximum Likelihood Estimation Methods 29 then the inequality above is reversed. This provides a sufficient condition for the identification of the parameters of the equation generating the dichotomous variable. They are slightly stronger than those of Manski (1985) and are necessitated by the fact that in our case the x variables are bounded, which implies that the values of xfi will be bounded. Hence, more stringent restrictions on the fi vector are required. Essentially, the above assumption guarantees that there is a non-trivial subset of A' on which the sign of the xf3 changes. This then allows identification of the fi" parameters. The simplest example of the type of restrictions being imposed is the case where x\ is a constant. Then we require that the intercept term and the slope coefficient fi"k differ in sign (assuming that the x variables have been scaled to be positive, as is required for use of the Fourier series approximation) and be of magnitudes which allow the above inequality to hold. Lemma 5: Given Assumptions 3-7 the parameter vector (fi",cr") is identifiably unique for the Probit model. For the Tobit model we are able to identify fi and a without the use of normalisations. This is the same as the usual case without heteroskedasticity. It results because at least for some observations we observe the value of y. Moreover, for any x value there is a positive probability that y is nonzero. Identification then follows from the properties of the normal density and the continuity built into the model. Lemma 6: Under Assumptions 3-5 (fi',o~*) is identifiably unique for the Tobit model. The sample selection model contains elements of both the Probit and Tobit models and hence it is unsurprising that fix is only identified up to scale. It is important, however, to note that p12 is identified although Oi and a2 are not, their influences on the density being mixed into the functions q i and q2 which are identified up to scale. Lemma-7: Given Assumptions 3-7 (fil,o~"lq1,fi2,a'2q2,p"12) is identifiably unique. Chapter 3. Maximum Likelihood Estimation Methods 30 We can now verify condition (d) of Theorem 1. To do this we use the approach of Wald (1949) which exploits Jensen's inequality, identifiable uniqueness, and the integrability of functions of x. We first note that the densities in all cases may be written as s(u,x,0°,ao,ft") = -logp(z\x,0°,ao) and that JJR2s(u,x,ft^a\n^)H^))dudp(x) = Ix Iz - logp(z\x,{3°, a°)p(z\x,/3,o-)dv(z)du(x) for any function h, where v(z) is the measure with respect to which p(z\x,(3°,a°) is a density. (To get this we use a change of variables argument.) For example, in the Probit model !/(_) = 50(z) + S1{z) where Si gives measure to the set where z = i and in this case Z = {0, 1}. Using this notation, we have that for all three models the following lemma holds. Lemma 8 : The identification condition (d) of Theorem 1 is satisfied in all three models. We have now verified that the four conditions of Theorem 1 are satisfied by the three models. Although in this section we only used C(X), it should be noted that this will imply identification in the actual function space used. This is because the function space (S in Theorem 1 and 0" in Section 4) is contained in C(X). 3.7 Consistency and Practical Implications We can now state the main results of the chapter relating to consistency of the maximum likelihood estimators for the three models. These are presented below. Chapter 3. Maximum Likelihood Estimation Methods 31 Proposition 1: Given Assumptions 1-7, the maximum likelihood estimators (3 and a of the Probit model converge almost surely to the true values. Proposition 2: Given Assumptions 1-5 the MLE of the Tobit model converges almost surely to the true values. Proposition 3: Given Assumptions 1-7 the MLE of the sample selection model con-verges almost surely to the true values. These propositions are direct applications of Theorem 1 where the preceding lemmata have verified the conditions of the Theorem. In practice, we could face a number of problems. The first is related to the use of the F F F . As can be seen from the representation given earlier, there are likely to be a large number of parameters especially if x contains a large number of variables. This could lead to difficulties when trying to maximise the likelihood function. In principle, the maximisation can be done, but in practice difficulties may arise. One solution may be to limit the number of a; variables appearing in the function. For example, if one had prior information on the variables most likely to be related to heterogeneity then this could be used. One such example could be that in the case of durable expenditure one would expect there to be more expenditure variability at the high income levels. Information of this sort may be useful in simplifying the estimation. A second problem concerns the imposition of the non-negativity constraint on the variance approximations. As the method has been presented this is likely to be nontriv-ial. One way to get around this problem is to approximate some transformation of the variance, with the transformation ensuring that the variance is nonnegative. For example if one notes that g{x) = exp(log(5-(x))) it would be possible to impose non-negativity on g by approximating log(g(x)) and Chapter 3. Maximum Likelihood Estimation Methods 32 ensuring that this is bounded away from — oo and oo. This will keep g bounded away from 0 and oo as is desired. All of the previous analysis can be applied to the parameter defined by log <r(x). Since exp is a uniformly continuous function on bounded intervals all the identification results will trivially be satisfied. It was noted earlier that we have assumed that A' is a bounded open set. Amemiya (1973) and Dhrymes (1987) both prove consistency in the iid case for models of the type discussed in this chapter under the assumption of bounded x variables. In our case, this was necessitated by the fact that we chose to use the FFF. In practice, this is not likely to cause problems since all actual data sets can be contained in some bounded open, set, and hence can be scaled so that all x variables are in (0,27r). An alternative to this boundedness assumption would be to use an alternative type of approximating function and some other metric for measuring the goodness of the approximation to the true function. This would require the imposition of restrictions on the moments of the x variables, and the integrability of certain functions of x. Moreover, the type of convergence one could achieve would likely be weaker than the intuitively appealing uniform convergence achieved in this chapter. Further work could examine the possibility of using other conditions and assumptions. 3.8 Conclusions This chapter has considered the consistent estimation of heteroskedastic LDV models with particular reference to the Probit, Tobit, and sample selection models. Although interest in these models is typically centered on the parameters of the means of the latent variables (the /5's), it was noted that mis-specification of the variance parameters, and in particular assuming homoskedastic latent residuals when there is in fact heteroskedas-ticity, wiD lead to inconsistent estimates of the parameters of interest (unlike the case of Chapter 3. Maximum Likelihood Estimation Methods 33 the linear regression model). The method proposed in this chapter allows one to obtain consistent estimates of both the variance parameters and the parameters of interest. The basic idea was to approximate the unknown variance functions using a Fourier series type approximation and to let the number of terms in the approximation increase with the sample size. Given certain smoothness conditions on the variance functions, the ap-proximation becomes capable of being arbitrarily good (in the sense defined above). In addition, and of more interest, the corresponding estimates of the /3's are consistent. Since the method of estimation is based on maximum likelihood, the estimation should be fairly straightforward to perform in practice and can be done using any of the stan-dard optimisation algorithms. The only difference lies in the fact that one must optimise with respect to both the /3;s and the parameters of the approximation to the variance parameter. This makes the method attractive relative to the computationally burden-some Maximum Score and Least Absolute Deviations estimators for the discrete choice and censored regression models respectively. The other advantage of the method is that it is more generally applicable unlike the techniques mentioned above which are based on quantile restrictions and do not generalise in an obvious way to multivariate con-texts. In particular, for the sample selection model or any other bivariate model, no other technique is as yet known which is consistent in the presence of heteroskedasticity of an unknown form (at least without substantial identification assumptions as would be required in a semi-parametric regression type technique). One could conjecture that the method may also be used in other multivariate models such as those listed in Amemiya (1985). This chapter has been concerned only with consistency. Therefore, a useful area for further work is the asymptotic distribution theory for the estimator. There are two possible avenues to explore in this context. The first is to view the approximation as a parametric assumption and to consider the problem of making inferences on the resulting Chapter 3. Maximum Likehhood Estimation Methods 34 Quasi M L E . In this case, the methods of White (1982) and others may be of use. The second approach is to attempt to find the distribution of the fi and or p12 (note that we are most interested in inferences on fi and p12 rather than the a terms) using recent results on asymptotic distribution theory for these types of estimators (see Andrews (1989) for example). Chapter 4 Two Step Estimation Methods 4.1 Introduction In this chapter we consider the problem of using two-step methods to estimate the sample selection and Tobit models when there is heteroskedasticity of an unknown form in the latent residuals. Such methods are quite common in practice since they are somewhat less burdensome computationally than the maximum likelihood estimator (MLE). These methods were first suggested by Heckman (1976, 1979), for these models and one rarely finds examples where the maximum likelihood estimator has been computed (see Donald (1985)). Another reason for considering these methods is that the regression structure they possess is easier to handle than the M L E , which is usually defined as the implicit solution of a. set of nonlinear equations, so that when one is concerned with the distribu-tion of the estimators the analysis may be somewhat simpler. In addition, the analysis may aid in the development of an asymptotic normality result for the MLEs considered in the preceding chapter. The work in this chapter differs in an important way from the work of Newey (1988), Powell (1988), and others who have considered two-step estimators which are consistent and asymptotically normal when the latent residuals are independent and identically distributed (iid) but with unknown density function. The models in this case have a double index form which in general will be lost if there is heteroskedasticity which is not 35 Chapter 4. Two Step Estimation Methods 36 a. function of the indexes. In fact, the robustness of these methods to the heteroskedas-ticty is unknown and is an important topic for future research since most data used to estimate these models are heteroskedastic in nature (being typically large cross sections with heterogeneity a common problem — see Phillips (1988)). Also, in a recent paper Newey, Powell, and Walker (1990) have estimated the Heckman (1974) model using these methods and found that the results do not differ much from the results obtained under Gaussian assumptions, so at least for that particular problem distibutional assumptions regarding the shape do not seem to matter too much. They argue that other forms of mis-specification may be more important (at least with regard to that model and data). Heteroskedasticity is one form of mis-specification that has been relatively neglected, and may actually be important. In addition, the results of this chapter cover the case of the Tobit model and so offer an alternative estimator to the censored least absolute deviations and symmetrically censored least squares estimators of Powell (1984, 1986). This chapter first considers the nature of the problem we are dealing with and, in particular, shows that the conditional means which are typically considered bear some resemblance to some other models that are currently being used in econometrics, namely semiparametric regression models. We then go on to present a two-stage estimation pro-cedure similar to the standard Heckman (1979) procedure, but which alloAvs for fairly general forms of heteroskedasticity in the latent error terms. Under the various quite gen-eral assumptions presented, the estimator is shown to be consistent and asymptotically normally distributed with the usual rate of convergence. The semiparametric method we use is based on series t}'pe nonparametric estimators which are quite easy to use and quite generally applicable. In particular, since the con-ditional mean regression function is likely to have heteroskedastic error terms, as noted in Robinson (1988), (and since we would like to be able to cope with heterogeneous ob-servations in any case) we would like a. nonparametric technique which copes with quite Chapter 4. Two Step Estimation Methods 37 general heteroskedasticity. The series approaches cope with this situation quite easily whereas other techniques such as kernel regressions appear to not be able to cope (see White and Stinchcombe (1989)). For this reason the results of Robinson (1988) are not directly applicable since he maintains the assumption of independent and identical!}' dis-tributed (i.i.d) errors (although he claims that, heteroskedasticity can be dealt with using his technique). For this reason we first provide results analogous to those of Robinson (1988) but using series estimators and then go on to adapt this framework to the prob-lem at hand. The main complication that arises and which causes the model to differ from a pure semiparametric regression model is due to the fact that the usual inverse Mills ratio is estimated (consistently and asymptotically normally) so there will be some estimation error involved which will complicate the nonparametric estimation and the covariance matrix. This chapter deals with these by the application of a result on the \/~N asymptotic normalit}' of averages of nonparametric point estimates. Obtaining a consistent and asymptotically normal estimator with the usual rate of convergence (\/N) is quite useful. First, if one wishes to obtain an estimator that is best among all estimators which allow for arbitrary heteroskedasticit}', then one would like to at least know that an estimator exists that is consistent at this rate so that it must then follow that the best possible estimator must also be consistent at this rate. That is, the semiparametric efficiency bound is surely smaller than the variance-covariance matrix for the estimator obtained in this chapter. Second, the result allows one to provide a basis for a specification test for heteroskedasticity in these types of models which can be based on the difference between the estimator obtained under homoskedasticity and that obtained under general heteroskedasticity in this chapter measured in the metric provided by the variance-covariance matrix of this difference. Without both estimators being asymptotically normal at the same rate it is difficult to see how such a test could be constructed. Chapter 4. Two Step Estimation Methods 38 The chapter is structured as follows. Section 2 discusses the model of interest and shows how it is related to a semiparametric regression model and also what complicates it relative to a semiparametric regression model. Some issues regarding identification of the model are also discussed. Section 3 presents in fairly general terms the type of nonparametric regression technique to be used throughout the chapter and presents a few results regarding convergence rates of the mean squared error and asymptotic normality of weighted averages. Section 4 presents results analogous to Robinson (1988) but allows for heteroskedasticity and uses the series regression estimator. This allows the derivation of a \/~N consistent asymptotically normal estimator in the unrealistic situation that the inverse Mills ratio were known. Section 5 discusses two possible ways of estimating the discrete choice model depending on the assumptions one makes regarding the error term in the discrete part of the model (which will only be relevant in the case of the sample selection model) and of obtaining an estimator of the inverse mills ratio. A linearisation analogous to the 5 method is used to put the estimator in a form that can be handled when the distribution of the second stage estimator is examined. Section 6 presents the main result regarding \/~N consistent and asymptotically normal (RNCAN) estimation of the full two stage estimator. Section 7 discusses the estimation of the variance-covariance matrix of the estimator. Section 8 presents a Hausman (1978) type specification test for heteroskedasticity which is shown to be asymptotically chi-squared. Section 9 concludes the chapter. 4.2 The Model The type of model we consider involves a bivariate normal densit}' function and the two latent variables given below Vlt = xuPi + uu (4.2) Chapter 4. Two Step Estimation Methods 39 Vtt = %2t02 + U2t- (4.3) for t = l,2...N. The observed endogenous variables in this case are y i = i(yl>o) (4.4) V2 = yiy2- (4.5) That is, we observe a discrete variable y1, and conditional on this being unit}' we observe the value of the continuous latent variable y^. A n example of this situation would be that we observe whether or not someone works and then conditional on them working we observe the number of hours worked. Also we observe the x variables for all the observations. Since we wil l be assuming that the observations are independent, we can order the data such that the first Ni have yx — 1 and the last NQ have yx = 0 and N — Nx + N0. The vector (ui,u2) has the bivariate normal density with covariance matrix Q which can be written as follows where x = (xi,x2). Note that if 3\ = f32 . xx = x2 and U\ = u2 then the model reduces to the standard Tobit or censored model. The estimation problem arises because unobservables determining the value of the discrete variable ( i .e . , _ i ) , may be correlated with those determining the value of the continuous variable. Thus employing OLS to estimate Q2 using only observations for which yx — \ may be inconsistent. This is best seen by taking conditional expectations of y2 given that yx — \. (4.6) (4.7) where (4.8) Chapter 4. Two Step Estimation Methods 40 with d> and $ being the standard normal density and distribution functions respectively. When the covariance term is nonzero, then the possible correlation between the second term and the first will in general lead to inconsistent estimates of {32. The above representation also provides motivation for a, two-step estimation proce-dure where in the first stage we estimate /3j by Probit maximum likelihood on the first equation in the two equation model and then use it to obtain a correction term A. Then the second stage consists of performing OLS for the Ni observations with y\ with the correction term included as a regressor. This is almost always done under the assumption that the covariance matrix is constant across all the observations. If this is false, then problems may arise both in the estimation of A and in the second stage regression where the coefficient of A may in fact be a function of x. In general, this will result in the estimator being inconsistent (the degree of this inconsistency has not to my knowledge been considered in the literature). This chapter will discuss a method for dealing with these sources of inconsistency. First since the main concern here is in estimating the (32 parameter vector, the ar-gument of A, ^"f^; can be treated as some unknown function h(xit). We are implicitly assuming that the variance of the first latent residual ux is a function of only those x vari-ables appearing in the mean (i.e., xi). This assumption greatly simplifies notation and does not seem too extreme. Writing it in this way does not affect any of the main results of the chapter since for the asymptotics the whole function must be consistently esti-mated. Note also that we can write the function as g(x). Using these adjustments the model can be written as V2t = x2t/32 + g{xt)X(h(xlt)) + (t (4.9) where £ t is an heteroskedastic error term with mean zero. In fact, the variance of £ t is Chapter 4. Two Step Estimation Methods 41 given by V(£t\xt) = *l(xt) - ^ p\{xu(3i\{h(xu)) + X(h{xu))2). (4.10) The model above bears some resemblance to some models that are currently being examined in econometrics —.namely, semiparametric regression models, see Robinson (1988). Suppose for the moment that the h function is known and is bounded on its domain so that A is bounded away from 0 and oo. Then one can transform the model into the semiparametric regression form given below wt = ztfa + 9{*t) + m (4.11) where wt = yt/Xt, zt — xt/Xt, r\t = ii/Xt and At = X(h(xlt)). This is now in the familiar semiparametric regression form with g being an unknown function. So given that h is known we can apply recent results, for example Andrews (1988, 1989d), to obtain \/N consistent estimates of 02 for this model. Note, however, that since the error term is heteroskedastic the results of Robinson (1988) are not directly applicable. Section 4 presents a method of dealing with this. In practice, however, h will be unknown and must be estimated. The model will have the form wt = zt02 +g(xt) +vt + (4.12) with i9t being related to the estimation error inherent in estimating the A t, and where estimates are used in place of the true values. In analysing this model, we must also worry about the estimation error inherent in the estimation of the A f. In addition, one must worry about the identification of the model since the unrestricted nature of the g function may cause problems unless there are certain exclusion restrictions (referring to exclusion of certain of the x variables from the different functions involved). Chapter 4. Two Step Estimation Methods 42 4.2.1 Intuition Behind the Estimator Before we discuss the intuition behind the estimation method to be used, we outline some of the basic assumptions on the model. For ease of exposition denote the vector of x variables included as arguments of the covariance matrix elements and hence arguments of g by x3 and remember that A t is a function of xu. The complete set of exogenous variables will be referred to as x. Then, as noted above, we can write the conditional mean of y2t as y2t = x2t/32 + g{x3t)\t + 6 (4.13) with E(£t) = 0 and V*(£t) = cr2 for all t and it will be assumed that the model is such that 0 < inf a\ < sup of < oo (4-14) f t and 0 < inf A f < supA ( < oo. (4.15) ( t As mentioned above this allows us to transform the model so that wt = zt(32 + g{x3t) 4- nt (4-16) with the transformed residual, r)t, having properties similar to those of £ t . The population motivation for the type of estimator to be used in this chapter is the following. Take expectations of (4.16) conditional on x3t which gives E{wt\x3t) = E(zt\x3t){32 +g{x3t) (4.17) and subtract this from (4.16) to give wt - E(wt\x3t) = [zt - E(zt\x3t)){32 + nt (4.18) Note that we are implicitly conditioning on the fact that yu = 1. Since this is common to all the observations, the inclusion of this as an argument is redundant provided that Chapter 4. Two Step Estimation Methods 43 x3 contains a constant. This will be important further on in this chapter, so one should remember that yu is, i n fact, an argument of these conditional expectations, but that it is omitted for notational simplicity. Note also that we are assuming that g is a common function for all the observations and that it does not depend on ylt. This is no different from the usual case with homoskedasticity. It would follow that if we knew the expectations i n (4.16) then we could potential!}' estimate 0 2 . There is, however, the question of identification to be dealt with since if zt = E(zt\x3t) then (4.18) would no longer contain information on 8 2 due to the cancellation. To avoid problems of non-identification in this regression model we make the following assumptions Assumption 2.1: For each element of the vector zt we have that zu ^ E(zit\x3t). Since in our case we have that zt = ^^u^ then we will require that X j contain at least one variable that is not contained i n the vector x3. Moreover, we require that this variable is not measurable with respect to the cr-field generated by x 3 . If all the elements of the variance-covariance matrix are functions of x3 then this would require that there be one variable in xx that is not in x 3 . Our next assumption is a standard full rank assumption on the regressors. Define the matrix of the zt vectors as Z and the matrix of conditional expectations as T3. Assumption 2.2: Assume that 1?]imjr(Z-T3)'(Z-T3) = A1 which is a finite, positive definite matrix. Note that we avoid complications such as in White (1984), where the limit may not exist. This simplifies the notation appreciably without diminishing the worth of the re-sults. In addition to the preceding assumptions, we make the assumption of boundedness of all the exogenous variables. This is obviously not a desirable assumption, but is quite Chapter 4. Two Step Estimation Methods 44 common in theoretical work on limited dependent variables and in nonparametrics in-volving series type estimators (see White and Stinchcombe (1989), for example). Finally, it will be assumed that the errors possess sufficient moments, so we can apply a central limit theorem to the estimators considered in this chapter. Assumption 2.3: The data xt are uniformly bounded independent random variables and any needed probability limits are assumed to exist. In addition E(tf\x) < A < oo for all t. Note that this allows the data to be heterogeneous but does not allow any sort of dependence. This independence assumption may not be needed, as recent work by White and Stinchcombe (1989) has shown. Define the population regression estimator of L32 by h = ((Z - T3)'{Z - TZ))~\Z - T3)\W - E{W\x3)). (4.19) We now can prove the following simple result on the asymptotic normality and consistency of the above estimator. Theorem 1: Given Assumptions 2.1-2.3 we have where V2 = {A1)-1p]im{l/N1)E((Z - TJ'D^TD^iZ - T 3 ) ) ( A 1 ) - 1 . Proof: For the proof of this and all other results in this chapter, see Appendix B. The matrix T is the diagonal variance covariance matrix of the error terms £( and the D3 matrix is given by D3 = diagA t. The result in Theorem 1 is not operational and the estimator J32 i s infeasible, but it does provide motivation for the approach in this chapter which is essentially to replace the unknown expectations by consistent estimates. Chapter 4. Two Step Estimation Methods 45 4.3 Nonparametric Series Regression results In this section, we examine the series approach to the estimation of conditional expecta-tions as are used to make the estimator considered above a feasible estimator. In doing this it is useful to refer to a generic type of situation given by the following nonparametric regression model yt=g{xt)+ut (4.20) where g is the function to be estimated (note that these variables are not necessarily the same as those included in the preceding analysis). The error term has expectation zero conditional on xt so that the regression function has the conditional mean interpretation. A series type estimator for g is obtained by replacing g with a truncated series or approx-imation to g which involves parameters and basis functions denoted by xl\(xt) which may be polynomials or trigonometric functions. The type of approximation that will be used is based on the Fourier Flexible Functional Form (FFF), developed by Gallant (1981). In that case the basis functions include quadratic terms involving _, and trigonometric terms of the form cos(jk'ax), sin(jk'ax) where the ka are multi-indices, some of which are presented in Gallant (1981), and j are constant integers. The number of terms in the F F F is increased by adding more trigonometric terms. For a given number of terms, one is then working with following regression K K yt = 8iMxt) + ut + g(xt) - ]T e^i(xt) (4.21) where the object is to estimate the parameters, K denoting the number of terms in the approximation. Note that there are now two sources of error — one is due to sampling error (the ut) and the other stems from the fact that the approximation may Chapter 4. Two Step Estimation Methods 46 not necessarily equal the true function. Typically, one can eliminate this specification error asymptotically by choosing basis functions that are capable of approximating the unknown g function and by increasing the number of terms K with the sample size. Results on the consistency (see Gallant (1987) ) and asymptotic normality (see Andrews (1988, 1989d)) are now available for these types of estimators. Our interest will be in considering the rates of convergence of MSE for these types of estimators in situations where the sampling error is possibly heteroskedastic, and also in the asj'mptotic normality of weighted averages of these estimates. For this purpose the estimator for g can be written g = ^{x)(^(x),^(x))-1<S{x)'y = P*y where \P(ss) is a N X K(N) matrix of basis functions and K(N) is the number of terms in the approximation and will depend on the sample size N. Px will be used to denote the projection matrix formed by the basis functions of the F F F constructed using the vector x. When the inverse does not exist then some generalised inverse may be used or else some terms may be omitted. Interest here will centre on the behaviour of the MSE(g,g) given by MSE(g,g) = E^\\g-Pxy\\2 (4.22) where ||.|| will refer to the Euclidean norm of the argument. The following assumptions will be used: Assumption 3.1: For the regression equation (4-20) the observations are indepen-dent and E{ut\xt) = 0,E(u2\xt)=af Chapter 4. Two Step Estimation Methods 47 where 0 < inf a2 < sup o~\ < oo. Next define the following Sobolev norm (see Adams (1975)) of the function h by ll^ llg.oo.A" = max sup \Dxh(x)\ (4-23) l*l<g__A' where D denotes the generalised partial derivative as defined in Chapter 2 and the Sobolev smoothness index (see Andrews (1988)) of the function h by S(h) = max{? : 11A-1|q,_o,.Y < oo} (4.24) where X is the set to which x belongs and A' 6 Rd, where d is the dimension of the xt vector. The following assumption concerns the possibility of approximating the g function of interest. Assumption 3.2: For any g there is a sequence 8-m. = (#ml, ••) 6mm)' £ R™ such that m m 1 YI OrniZi{x) - g{x)\\0iOO,X ~> 0 i = l as 777. —> oo for all a such that 0 < a < . This assumption says that it is possible to find a sequence of uniformly convergent approximating functions. Moreover, the approximation error goes to zero at a rate that is larger the larger is the smoothness of the g function relative to the number of variables in x. There are a number of circumstances when this assumption will be satisfied. The one of interest to us is given below. Chapter 4. Two Step Estimation Methods 48 Assumption 3.3: The density functions of the xt is everywhere hounded below by some small positive number. In addition it is assumed that X is open and bounded and closure(X) C x^_ 1 (0,27r) ) and the basis functions are given by the trigonometric func-tions as used in a fourier series. Note that any bounded set A' can be made to satisfy this assumption by suitable linear transformations, without affecting, in any meaningful way. the results. Many of the results of this chapter will also hold for polynomial series estimators under a similar condition. Note that the results also cover the case where the x are fixed provided that the empirical distribution converges to one like that above. The following result which is similar to that of Andrews and Whang (1989) can now be proved where we condition on the x vector to find the MSE and determine its rate of convergence. Lemma 1: Given Assumptions 3.1 and 3.2 (i) If —* 0 and K(N) —> oo then MSE(g,g) -> 0. (ii) If K(N) = 0(NT) then MSE(g,g) = 0(N~b) where b = min{l -r, 2ar} (iii) If > ± and K(N) = <3(A^-7) for q < \, and 0 < 7 < q - —-y then SiLi ®izi{xt) — g\\ — 4 0 for some sequence of parameters, and if q < \, then in addition b in (ii) is such that b > (1 — q). These results show that the rate of convergence depends on three parameters relating to the smoothness of the unknown function, the dimension of its argument and the rate of increase of the number of terms in the approximation. Generally, the rate increases when the function is smooth, when the dimension of x is reduced, and when K increases more slowly. The final result concerns bias reduction and will be crucial in the results of the chapter. In particular, it appears that y/~N consistency will not obtain unless the bias can be reduced at faster than \fN (as one may anticipate). This is achieved b}7 altering both S(g) and K — the rate of increase on K used in (iii) will also be needed in the main results. As it stands, the result does not allow for discrete x variables. However if the discrete Chapter 4. Two Step Estimation Methods 49 x variables have finite support then g may be written as a finite sum of gi functions of continuous x variables, one for each point in the support of the discrete re's. Then the result would hold provided that each gi function satisfied the conditions of the Lemma. The remainder of the chapter avoids this complication without loss of generality. A second result that will prove useful in the remainder of this chapter concerns the asymptotic distribution of weighted averages of estimated g functions — i.e. jjc'g for some vector of constants c. In particular it is possible to show that under certain condi-tions such averages are \/iV consistent and asymptotically normal. Theorem 2: Under Assumptions 3.1-3.2 and (i) S(g) > d + 1, (ii) K(N) = 0(NJ/2~~f), and (iii) ct = c(xt) a bounded sequence of known functions ofxt with 5(c) > 0 we have VN(^c'Pxy - ~c'g) = —c'u + o p ( l) and hence ^N(^c'Pxy - ±c'g) -> N(Q,lim±E(c'Vuc)) where $u = E(uu'\x) is the diagonal covariance matrix of u. This result will prove useful in Section 6 when we deal with the estimation error inher-ent in using an estimator of the Xt. It will allow us to obtain the usual \/7f convergence even though the individual g (or in our case \ t ) elements are less than consistent (see Andrews (1988, 1989d)). This type of result has been proved for kernel type estimators by Rilstone (1989) and Powell, Stock and Stoker (1989). A simple corollary to the above result arises when the c vector is just a vector of ones and a constant is included in the series functions (as is usually done). Denoting such a vector by i then since Pxt — i (because Px is a projection matrix), the result is immediate. Corollary 1: Given Assumptions 3.1-3.2 and hypotheses (i) and (ii) of Theorem 2 Chapter 4. Two Step Estimation Methods 50 then a 1 (L'Pxy-b'g)^ jV ( 0 , l m i £ ( - t > $ ) ) VN Note that in both results the hypotheses (i) and (ii) are used to eliminate the bias at faster than V~N rate as in Lemma 1 (iii) shows. The third hypothesis ensures that Pxc converges to c in MSE and uniformly. This is trivially true when c is, in fact, equal to one of the basis functions as in the case of L as proved in Corollary 1. One could pursue this line of work further, and could in particular consider the estimation of average derivatives for the discrete choice model considered below in Section 5. 4.4 Estimation with known Xt The estimator of interest in this section is the same as j32, but with the unknown expec-tations replaced by estimates based on series regression discussed in the previous section. The analysis of this estimator is useful as an intermediate step toward the result for the estimator considered later, in which Xt is estimated. In fact as will be shown, the completely general estimator can be split up into two parts. The first will have the same distribution as the estimator of this section, and the second will have a non-degenerate distribution that will depend on the preliminary estimator of Xt. Since the expectations in this case are conditional on x3 we denote the relevant pro-jection matrix by PX3 and the truncation parameter by K3(Ni) or simply as A ' 3 where it will be understood that it increases with N and or Nx. The estimator can then be written as 1 - l x (4.25) ( 1 (4.26) Chapter 4. Two Step Estimation Methods 51 (-^)(Z - PU3Z)'(Z - PaaZ) -l {-^=)(Z-PxzZ)>(g + r,) (4.27) using the expression for w and the fact that (Z-PXsZyPX3 = 0 (4.28) since PX3 is a projection matrix. The dimension of Xjt wil l be denoted dj, so that the dimension of zt is d2 (in the case of the Tobit model dx — d 2 ) and each element of zt wi l l be referred to as z^. The conditional expectations are denoted by E{zit\x3t) = Ti{x3t) (4.29) and the matrix of these values by T3 = E(Z\x3) which wil l be of the same dimension as Z. The symbol 8 will denote a generic small strictly positive constant while A and C wil l be used as generic large finite positive constants. The following assumptions wil l be required to prove that the. estimator /32 is R N C A N . Assumption 4.1: Assume E(zit\x3t) = Ti(x3t) for each i with x3t £ A ' 3 C Rd:i and assume Assumption 3.3 applies to the distribution of x3t and the set A " 3 . Assumption 4.2: Defining un = zu — Ti(x3t) then uu satisfies Assumption 3.1 for all i. We are now in a position to prove the main result of this section concerning the R N C A N of the estimator (32. Theorem 3: Given Assumptions 2.1-2.3 and 4.1-4.2, if (i) ^ > ^ for some q < 1, (ii) S{rt) > 0 for all i, (m) K3(Ni) = O(N^) with 0 < 7 < q - ^  then where V2 is given in Theorem 1. Note that it is required that the unknown g function (which arises due to the het-eroskedasticity in the latent model) be smooth and increasingly so the larger is the Chapter 4. Two Step Estimation Methods 52 dimension of its argument and the the slower the rate of increase of the Kz parameter. This along with the third hypothesis guarantee that the bias term disappears at the required y/~N rate as proved in Lemma 1. The r, functions related to the expectations are required only to possess one derivative in order for the bias in estimating them to disappear sufficiently fast as K3 —* oo. Obviously, this result is generally inapplicable since Xt is generally unknown, however it does provide a useful result en route to the more general result. Moreover, this result is of independent interest since it provides a proof of RN C A N for semiparametric regression models which allows for more general data generating processes than does Robinson (1988) (although he claims his result does generalise to this case). It also provides alternative conditions to those of Andrews (1989a) who uses empirical process methods to obtain a similar result under a different set of assumptions. 4.5 Estimation of the Discrete Choice Model As noted above, the preceding analysis is generally inapplicable due to the fact that Xt is generally unknown. This section details two approaches to the estimation of A ( which are appropriate under different conditions, depending on whether one believes there is heteroskedasticity in the discrete part of the model or not. In the case of the Tobit model, both latent equations are the same so that only the method allowing for heteroskedasticity is of interest. Therefore, the method based on a homoskedastic discrete choice model assumption is only of interest in the context of the bivariate sample selection model. The equation of interest is rewritten below y'lt = xuPi + ult (4.30) where E(uu) = 0 and V(uu) — a\(xu) and uu are independent normal random variables. Chapter 4. Two Step Estimation Methods 53 The variable yu = I(y\"t > 0) is all that is observed As noted earlier, A ( is related to these variables by Xt = V °\ 1 (4.31) so as in Heckman (1979) one can obtain a consistent estimator of A t by replacing the fti and Oi parameters with consistent estimators. There are two possibilities that will be considered here. 'The first is to assume that CTJ is, in fact, a function of the x variables, remembering that we assumed that only the variables in xx may be included as arguments of the function. In this case, there is bivariate heteroskedasticity (this is the only inter-esting case for the Tobit model) so the problem can then be written as one of obtaining consistent estimates of a '/^'^ = hix-iA where the structure of h can be ignored without affecting the results (the convergence properties of an estimator of h will depend on both the estimators for f3y and <Ti). Alternatively, one could treat ai as being a constant and then without loss of generality one could normalise it to 1 and then use standard Probit maximum likelihood estimation. It should be re-emphasised that this second case is only of interest in the case of the bivariate sample selection model. 4.5.1 Heteroskedastic uu The work of Heckman (1979) and the survey of Amemiya (1985) are useful in pointing out that the use of an estimated Xt in a second stage regression will have some affect on the variance-covariance matrix of the parameters due to the pre-estimation error. In fact, the distribution of the estimator will depend on the distribution of the A t estimator. This makes it imperative that we choose a method for estimating the Xt that allows us to get at the distribution of the estimator. Since in the form in which the estimator appears the estimated A t enter as a weighted linear average the results of section 3 will be of use with some adjustments. (Note that we are excluding the possibility of finding a super-efficient Chapter 4, Two Step Estimation Methods 54 estimator for Xt in which case the pre-estimation error would disappear at a, sufficiently fast rate so that the distribution of the second stage estimator would not depend on the distribution of the first stage estimator.) Unfortunately, as it stands the work in Chapter 3 is inapplicable due to the difficulty in obtaining appropriate expansions and distribution theory. In contrast, it appears that the regression based methods discussed in Section 3 may be more useful with some adjustment. In fact, we will'show that the use of the 8 method along with the results for series regression estimators allow us to obtain a consistent estimator for A ( and an expansion or linearisation that will help in the derivation of the distribution of the second stage estimator. To see why a regression based technique may work, consider the following expectation E(yu\xit,x2t,x3l) = E{ylt\xlt) = $(h(xJt)) (4.32) so that yu can be written in a regression form as yu = ${h(xlt)) + elt (4.33) with E{eu\xu) = 0 and V(eit\xu) = $(h(xu))(l — $(h(xlt))) which has a familiar regression type structure. Since h is an unknown function we can let l(xit) = $(h(xit)) and then the regression equation can be written as Vu = l{xlt) + elt (4.34) and then h can be recovered by the inverse transform h = 5>-1(Z) using the inverse function of the normal cumulative distribution function. The following assumptions regarding h and I will be made Assumption 5.1: h : X1 —» [—A, A] and hence I : X1 -* \8, 1-8} where 8 = A) and 1 — 8 = $(A) for some small 8 > 0. Also X1 and the distribution of the x-y satisfies Assumption 3.3. Chapter 4. Two Step Estimation Methods 55 Again we rule out discrete exogenous variables in this but as mentioned earlier they can be dealt with without much trouble. Note that the conditional probabilities are kept away from 0 and 1. This also ensures that the conditional variances are bounded away from oo and 0 as is required for the results of Section 3 to be useful. These are required to avoid problems when we transform the variables in the second stage regression by division. The technique used to estimate Xt will be based first on estimation of the heteroskedas-tic regression model (4.34) which gives us estimates of Zt = l{xit) which are then are used to construct the AT using the fact that I and Xt are related by X, = « » - ' < ' • » (4.35) H The estimator for Xt is obtained by replacing the lt in this relation with the lt obtained b}r the series regression of yu on xlt. Two important facts about the transformation are that the first two derivatives with respect to the lt given by ^ = + l*-\l))r* = dx(l) (4.36) ^ = r W A O ) + IS -AO ) - r 2 ( ^ i ( l j j ) = d ' ( 0 ( 4- 3 7) are both bounded functions of / on [6,1 — 6} for any small 6. This will prove useful in the expansions of Xt about its true value Xt. Note also that Xt itself is bounded away from 0 and oo The estimator for / that we use is given by l = P*1yi (4.38) where PXl is the projection matrix formed by the the basis functions of xx as described in section 3, analogously to PX3 and the number of terms in the approximation or the truncation parameter is denoted by K1(N) or K1. Note that all the N observations are Chapter 4. Two Step Estimation Methods 56 used in the computation of the / unlike the second stage estimator which uses only the Ni observations for which y l t = 1. One complication that arises when one computes the A t estimator is that it is required that one compute 3?-1(Z) which is only defined if the / is between 0 and 1. Unfortunatelj', the use of the series regression estimator does not guarantee this so one must have some rule for dealing with such observations. Note that this analogous to the problem in the hnear regression model with heteroskedasticity where one estimates the variance function using the squared residuals and performs weighted least squares using the estimated residual squared regression function for weights (as in Cragg (1988) for example). Depending on the method used, one can obtain negative predicted variances. These are usually dealt with by setting them to some small positive value. The solution proposed here will be similar. In particular the following rule is used to define the estimator l(xu) l(xu) if 8/2 < l(xlt) < 1 -8/2 Kxu) = I 8/2 i{l(Xlt) < 8/2 1-8/2 if l(xyt) > 1-8/2 where the 8 is as defined as in Assumption 5.1. In practice, the choice of 8 is up to the discretion of the econometrician and typically some small number will be chosen. The following lemma allows us to write linear combinations of Xt — A t as linear combi-nations of l(xu) — l(xu) plus a stochastic error that converges to zero as N —> oo. Since such linear combinations of A t — A t appear in the the second stage estimator and since Lemma 2 allows us to find the distribution of hnear combinations of l(xu) — l{xu) this will allow us to show how the distribution of the second stage estimator depends on the distribution of this first stage estimator. Lemma 2: Given Assumption 5.2 and (i) S(l) > dy +1, (n)K\N) = 0{Nll2^) for Chapter 4. Two Step Estimation Methods 57 0 < 7 < \ — -^j-y and (iii) c(xu) a sequence of bounded constants then 1 N 1 N —== J2 c(xlt)d1(l(xu))(l(xu) - l(xlt)) + o p(l) V A t=i Note that in this result all N observations are used. In the application of this result, this will also be the case so some condition on the relation between Ni and N will be needed. The hypotheses of the result are slightly stronger than in previous results since in this case we need the MSE of the I to be op(N~1^2) for the result to follow. Combined with Lemma 2 this result allows us to prove the following result concerning the R N C A N of weighted averages of A t — A ( and as noted earlier is one of the key steps in proving the R N C A N of the second stage estimator. Lemma 3: Given Assumption 5.1 S(c) > 0 and conditions (i),(ii) and (iii) of Lemma 2 then V / i V ( l c ' A - l c ' A ) = - ^ c ' J D 1 e + 0 p ( l ) where D\ — diag{c/1(Z(x l t))} and hence v ^ T ^ ' A - 4?c'A) -* Ni^limEi — c'D&D^c)) N N v v i V " This result bears some resemblance to a 8 method type of result. In fact, it is a combination of the 8 method, which is applied to linearise A t — A{ for each t, and the result in Lemma 2 on the distribution of weighted averages of point estimates, which proves R N C A N for combinations of point estimates which are less than R N C A N as noted earlier. Chapter 4. Two Step Estimation Methods 58 4.5.2 Homoskedastic u it As mentioned earlier in this case due to the assumed homoskedastic normality of the latent error ult, one may proceed with the Probit M L E to estimate fix with = 1 imposed as a normalisation. Then the estimator for Xt may be computed as A, = (4.39) Using standard results, as in Amemiya (1985) for example, one can then perform the same sort of linearisation of this about its true value so that K -Xt = f^ ft -f3i) + O^N- 1) (4.40) op i which can be written in vector notation as A - X = -D4X1(f31-(31) + Op(N- 1) (4.41) where D 4 = diag(j5it/3iA t -f A 2 ) and A"i is the N X d\ matrix of xu variables which appear after evaluating the derivative. Standard mean value expansions for the M L E fii as in Amemiya (1985) (which are valid given our assumptions) yield y/N{$i ~ f3i) -> N{0}p]imN(X[D2Xi)-1) (4.42) where D2 = diag($( 2 ; i t /3 a )- 1 ( l - ^{xMY'faxM2) (4.43) Using this notation, it then follows that (c'A - c'A) - i V ( 0 , p h m - c ' J D 3 A ' 1 ( A ( J D 2 A 1 ) - 1 X ' £ ) 3 C ) (4.44) which is the usable equivalent to the result provided in Lemma 5. Much of this is standard and found in Amemiya (1985) or any other text dealing with this problem. The Chapter 4. Two Step Estimation Methods 59 use of the standard M L E Probit estimator avoids the need for trimming or truncating the observations in practice as is needed in the heteroskedastic case. An interesting question to consider in the finite sample analysis (using simulations) would be whether the effect of the trimming is such that even in the presence of mis-specified variance (i.e., assuming homoskedasticity when it is false) the second approach is superior. In the case of heteroskedastic regression models, Cragg (1989) has found that the use of variance estimators where trimming (or more correctly truncation) is required, that in finite samples one may obtain more efficient estimates using a mis-specified variance function than the true one. The situation here may differ somewhat, however, since in the regression model unbiasedness and consistencj' obtain no matter what weighting function is used, whereas here mis-specification causes inconsistency (although the nature of this is still not very well understood). 4.6 Asymptotic Normality with Estimated \ t The preceding results allow us now to derive the distribution of the estimator given in Section 4 where Xt is replaced by either of the two estimators developed in Section 5. We denote this estimator by 02 and note that it can be written as follows. 02 = ((Z - P X 3 Z ) ' ( Z - P X , Z ) Y \ Z - P*>Z)'(w - P X 3 w ) (4.45) Noting that wt = zt02 + g { x 3 t ) + ~ + | i (4.46) A A this estimator can be rewritten as 02 = ((Z - PX3Z)'(Z - PX3Z))~lx (4.47) (Z - PXiZ)\{Z - PxJ)02 + g - PX3g + g ^ ^ - - P m i g ^ ^ + | " A A A A Chapter 4. Two Step Estimation Methods 60 due to the fact that PX3 is a projection matrix, we have r1* (4.49) which is now7 in a form that can be analysed. It should be noted that this estimator can be split into two parts. The first part involves g and £ and is similar to the estimator analysed in Section 4.4. The only difference involves the fact that Z is used in place of Z. It will be shown in Theorem 4 below that this term and the estimator in Section 4.4 are asymptotically equivalent. The second term involves the pre-estimation error Xt — A ( . Using the linearisation results of Section 4.5, it will be shown in Theorem 4 that the term will have an as}7mptotic distribution similar to the one in Lemma 3. Finally, since the two terms will be shown to be asymptotically uncorrelated, the asymptotic distribution of the sum will be normal with covariance matrix given by the sum of the covariance matrices of the two terms involved. One technical complication that arise when we consider the asymptotic distribution of the term involving the estimation error is that all N observations are used to estimate the Xt. The problem arises because only the estimates of Xt for observations with yn appear in the estimator. After linearisation of the term corresponding to the pre-estimation error, we will end up with a sum of the following form, The problem arises because the constants that appear in the sum involve yu and the results of Lemma 2 and 3 require the constants to be smooth functions of only xu. The A 7 t=l Chapter 4. Two Step Estimation Methods 61 results in those Lemmata are obtained because of the properties of the projection matrix formed by the basis functions (of xu) and the fact that c is a smooth function of xu (see the proof of Theorem 2). To make this type of result appby we must break up the constants into a part which is the conditional expectation of the constants (given xu) and another part with conditional mean zero, and then make assumptions regarding these terms so that the previous results may apply. To do this we introduce the following notation. Define the following variable vn by vit = E((zit - Tj ( :c3 t ) )y i t | a : l t ) (4.50) which will be defined for all N observations, and where zit — r t (x 3 t ) appears as part of the constant in the sum above. It should be noted that as mentioned earlier r (a3 3 ( ) is actually a function of yu but the argument is suppressed for convenience. Note that we have only considered estimating this for observations with yu = 1. Since the v terms will appear in the covariance matrix it would appear that we would need estimates of r for all the observations. Fortunately this is not the case because when we estimate u by projecting (zn — Ti(x3t))yu onto the space spanned by the basis functions of Xu the terms corresponding to ylt = 0 will be zero so that r need not be estimated for these observations. This avoids a. complication when we come to estimate the covariance matrix. Assumption 6.1: The error term defined by z& — Ti(x3t) — vn satisfies Assumption 3.3 and in addition phmj^v'p is a finite positive definite matrix. More useful notation includes the matrices D3 = diag(A() D 5 =diag(d1(Z(xu))) L = diag(Z(ajat)) (4.51) (4.52) (4.53) Chapter 4. Two Step Estimation Methods 62 and G = dmg(g(x3t)) (4.54) The main result of this section is the following Theorem 4: Given Assumptions 2. 1-2.3, 4.1-^.2, 5.1, 6.1, and, (1) KX(N) = 0{N1'2-^) for 0 < 7 l < I - ^ and S{1) > dy + 1, (11) Ks{Ny) = 0 ( i V 1 1 / 2 - 7 3 ) for 0 < 7 3 < \ - ^ and S(g) > d3 + 1, f m j 5(r t) > d 3 + 1 /or a// 2 f i t , ; Ny = O(N), jj- —> p > 0 and ^ > <$ > 0 f i > ; 333 c X i , a proper subset and E(zi\xy) has strictly positive Sobolev smoothness index for all i (vi) S(vi) > 0 for all i then \fN~i02 ~ 02)-> N(0,V2) where V2 = A ^ a ( 5 a + S^A^1 with Sy = p h m ^ ( £ - T3)'D3lTD3l{Z - T3) and S2 = -pnm^-v'GD^DsLVLLDsD^Gv p N This result provides the basis for performing linear and non-linear Wald tests on the parameters for both sample selection and Tobit models under heteroskedasticity One should note that in this result we have strengthened the smoothness conditions on the T{ functions which are required for convergence due to the fact that only estimated Xt values are used which potentially reduces the rate of convergence of the estimated T; functions to the true values. This needs to be increased for R N C A N to hold and the way this is achieved is by requiring a stronger smoothness condition. Assumption (v) is required for the application of Theorem 2 to be valid. It seems likely that this may be unnecessary, but the proof, which is already quite complicated, would become even more Chapter 4. Two Step Estimation Methods 63 complicated. The implication is that the variance-covariance matrix is a function of a subset of the variables in the first latent equation (the discrete part), and in that case it would just require that not all of those explanatory variables appear in the variance function. Conditions (i) and (ii) are slightly stronger than in Theorem 3 due to the fact that there is estimation error and a faster rate of convergence of the M S E of the estimated expectations is required. Condition (iv) ensures that the number of observations with y\t always makes up a nontrivial proportion of the sample and seems fairly mild, while (vi) allows the application of Lemma 5. Although the result does offer some guidance concerning the rates of increase for the truncation parameters, the actual choice of these in practice is somewhat more difficult to advocate. The results may suggest a rule of thumb where say for 100 observations one would pick K < 10 and so on, however in practice there is no reason to suggest that such a rule will be optimal in an}' sense. The results of a simulation exercise may aid in this regard. Alternatively, one could use cross validation type procedures or some testing procedures. More on this will be discussed in the following chapter. One advantage of the estimator is that it is relatively simple to compute, requiring only regressions and transformations. In fact, in some sense it may even appear to be simpler than the usual Heckman procedure which involves the first stage maximum likelihood Probit estimator. In principle, any regression package, such as S H A Z A M , LIMDEP or TSP, should be able to compute the estimator while the use of matrix commands should enable the computation of the covariance matrix (more on this will be taken up in the following section). A corollary to the above result arises under the assumption that the discrete part of the model is homoskedastic and then the Probit M L E can be computed in the first stage to compute the Xt for the second stage. The result corresponding to Theorem 4 is given for this case. Chapter 4. Two Step Estimation Methods 64 Corollary 2: Given Assumptions 2.1-2.3, 4.1-4.2, then (i) K3(N1) = 0{N\12 73) for 0 < 73 < | - —f^ and S(g) > d3 + 1, (ii) 5(r f) > d 3 + 1 for all i, (in) JVa = O(N), y - 4 p > 0 and > 5 > 0 (u>,) i/ie u l f are homoskedastic then y/Nlfa - 02) ^ N(0,V2) where V2 = A^^Sy -f S 3 )-4Y a with S, = p lmi^-(Z - T3)'D^TD-\Z - T3) and S2 = - P U m i - ( Z - r3 ) ' ^ 3 X 1 ( A ' { J D 2 A 1 ) - 1 X ' J D 3 ( ^ - T 3 ) p 1\ The estimator in Corollary 2 is slightly more difficult to compute but not overly so and most statistical packages should be able to do it. Clearly, the only difference between the two estimators lies in the part of the variance-covariance matrix related to the first stage estimator. In the case of the Probit estimator, there is no need for trimming since all predicted probabilities will be restricted to lie between 0 and 1. It will be interesting to see in the simulations whether the trimming has an adverse effect on the first estimator and whether the second performs better even in the presence of mis-specification. 4.6.1 Estimating the Covariance Matrices The only thing that remains to make the results in Theorem 3 and Corollary 2 operational is to find ways of estimating (consistently) the variance-covariance matrices. This is needed to construct Wald type tests of restrictions on the f32 parameters. A natural thing to do is to replace the unknown values appearing in the Ai, Si matrices with consistent estimates and sample analogues. This requires a decision on how to estimate each unknown. Chapter 4. Two Step Estimation Methods 65 We first deal with Si. In the case of the T matrix, a natural estimator is to take the squares of the estimated residuals given by (t - yu ~ x2t(32 - g{x3t)Xt (4.55) A natural estimator of T is f = diagtf?} (4.56) and given (4.16) a natural estimator of the vector g is g = Px3(w - Zft2) (4.57) while the obvious estimator for Z — T 3 is Z — PX3Z. Finally, D 3 can be estimated by D3 = diag{A t}. Define the variance estimator so constructed by Si = ~(Z - PXiZ)'D-Vtb-\Z - PX3Z) (4.58) Similarly, in the case of S2 define n = diag{Z(xlt)(l - i(xu))} (4.59) G = dmg{g{x3t)} (4.60) D\ = diag{rf1(Z(xlt)} (4.61) L = d iag{Z(xi t )} (4.62) and the natural estimator of v can be constructed by 0 = PXl{(Z-PX3Z)'0'y (4.63) where 0 here denotes a No x d2 matrix of zeroes which arises due to the definition of Vi and the fact that for these observations yu — 0. A natural estimator for is then given by S2 = ^-^-v'Gb^bhLtlLbhb^Gv (4.64) N\ N Chapter 4. Two Step Estimation Methods 66 Finally, define the estimator of Ay by (Z-PX3Z)'(Z-PX3Z) (4.65) Note that G contains estimates of g for all the observations and for observations with ylt — 0 one must take the estimated g function and recover the estimated parameters and evaluate the function at the values of x3t for the observations. Defining the basis functions for these N0 functions as $ 3 0 and the basis functions for the other observations as $ 3 the estimated vector of g functions for these observations is given by Unfortunately, due to technical difficulties it was necessary to alter some of the con-ditions of Theorem 4 to obtain consistency of these estimators. In particular, thus far we have avoided worrying about the rank of the matrix formed by the basis functions and just used a generalised inverse. To prove consistency of the covariance matrix estimator we need that the inverse exists. This necessitates an additional assumption and stronger smoothness conditions on the functions of interest. This assumption is based on the work of Andrews (1988, 1989d) and ensures that provided the number of parameters does not increase too fast (the explicit rate must be no faster than N1^4) and that the distribution of the x satisfies the assumptions made earlier, then the matrix of basis functions will have full rank. In addition, a condition on the projection matrices which ensures that the diagonal elements go to zero will follow. The following assumption in conjunction with those made previously allows the application of Theorem B-1 of Andrews (1988) which in turn allows the circumvention of the technical difficulties. Assumption 6.2: The smallest eigenvalues of the matrices g = * 3 o ( * ; * s ) X%{W-ZP2). (4.66) 1 I £ ( * 3 * 3 ) Chapter 4. Two Step Estimation Methods 67 and have strictly positive limit infimum. This along with a slower rate of increase of the K1(N) parameters allows the use of the results of Andrews which gives the following fact that is used in the proof of the consistency of the covariance matrix estimators: m ^ a x ^ U ^ r V ^ O (4.67) for both i = 1 and 3. The following Theorem provides the necessary result for the consistency of the estimated covariance matrix estimator Theorem 5: Given Assumptions 2.1-2.3, 4.1-4.2, 5.1, 6.1, and 6.2 and given (i) K\N) = 0(NV*-*)for 0 < 7i < ™d S(l) > 2dr+l, (ii) K^N,) = O ^ 4 " 7 3 ) for 0 < 7s < \ - ^ and S(g) > 2d 3 + 1, (iii) S(r,) >2d3 + l for all i (iv) Nt = 0{N), ^ —* p > 0 and > 6 > 0 (v) x 3 C Xj, a proper subset and E(zi\xi) has strictly positive Sobolev smoothness index for all i (vi) S(ui) > 2di + 10 for all i then the result of Theorem 4 holds and Si - Si = op(l) S2- S2 = op(l) i X = i i +op(l) and hence the estimator of \>\ defined by V1 = (Al)-\Sl + S2)(ALY' satisfies V\ = Vj + op(l). The work of Andrews (1988, 1989d) indicates that in the case of a polynomial series estimator a slower maximal rate of increase of the parameters may be necessary. Also in Chapter 4. Two Step Estimation Alethods 68 the case of the F F F series approximation the fact that the functions are not orthonormal may cause problems with Assumption 6.2. Andrews suggests that some of the functions may have to be deleted to ensure, that the condition holds. In practice, there is unlikely to be an}' problem of invertibility. The N1!4 should therefore be interpreted as a maximal rate. Note that in constructing the matrix T we use a. similar estimator to that of White (1980) for heteroskedastic regression models. As in the regression case this estimator may be improved on by use of the jackknife estimator proposed by MacKinnon and White (1985). This does not affect the consistency of the estimator but is likely to improve the finite sample performance of the estimator. The equivalent result for the model with homoskedastic discrete component is given in Corollary 3 below where the only difference lies in the estimation of the 5 3 matrix. Define the estimator of the S 3 matrix by N 1 • - - -Ss = ^ { z ~ P^ZWDzX^X'^X^X'MZ - PX3Z) (4.68) and define V>2 = ( i i ) - 1 ( 5 i + 5 3 ) ( i 1 ) - 1 (4.69) Corollary 3: Under the conditions of Theorem 5, and with the assumption of a homoskedastic discrete pari, then the result of Corollary 2 holds and Vi = V2 + o p(l). 4.7 Specification Tests This section considers two forms of specification tests based on the estimator developed in this chapter. Both tests are of the Durbin-Wu-Hausman form and compare the estimator Chapter 4. Two Step Estimation Methods 69 of this chapter to other estimators which are generally preferable under certain restric-tions due to improved efficiency. The metric of comparison is the variance-covariance matrix of the difference of the estimators so the tests will have a, %2 limiting distribution under the respective null hypotheses. The value in performing the test is that under the null the two estimators should have the same limit (the true value), while under the alternative the estimator developed in this chapter will converge to the true value while the alternative estimator will not in general. Thus, the tests may be considered consistency tests on the alternative estimators, where it is the case that the alternative estimator is that which is generally used in practice. Since in our case the alternative estimator need not be the M L E (and may include the two-step estimation method of Heckman (1979) which is not efficient) the simplification of the variance-covariance matrix noted by Hausman (1978) is not generally applicable. This simplification relies on one estimator being asymptotically efficient (in the sense that it attains the Cramer-Rao lower bound), in which case Hausman (1978) has shown that it must be uncorrelated with any other consistent and asymptotically normal estimator. This greatly simplifies the computation of the covariance matrix of the difference between the two estimators. This will not generally be the case here, so that one must deal with the correlation between the two estimators in forming the variance-covariance matrix. This is not overly difficult, however, as can be seen by the various applications of this testing framework proposed by White (1980a), for example. In particular, if one can show that two estimators, say bx and b2, satisfy 6i = WjU! + o p(l) b2 = W'2u2 + o p(l) where Wi are non-random matrices, and Ui are random error terms, then the variance-covariance matrix of the difference between t>j and b2 can be found provided that one Chapter 4. Two Step Estimation Methods 70 can find the covariance between ux and u2. This will be the case with the estimators considered in this chapter. In addition, there may be a number of alternative estimators and the tests will be developed for the most commonly used ones only. The two forms of mis-specification dealt with are sample selectivity bias (which is only relevant in the case of the bivariate sample selection model) and a test for heteroskedasticity in the latent model (which can be performed in the case of both types of models considered in this chapter). 4.7.1 A Test for Sample Selectivity Bias A common test in empirical applications of the bivariate sample selection model is for selectivity bias. In the homoskedastic case, this is done by performing a t-test on the estimated coefficient of A t in the second step regression. Since in this case the coefficient is ^ the test corresponds to a null hypothesis that cr12 = 0, and this being the case one may use OLS without the correction term, since it will then yield consistent and efficient estimates of the Q2 parameters. There are two reasons why this form of test (a Wald type test of cr12 = 0) will not be proposed. The first relates more generally to why such a test is of interest in any case. The second reason will have to do with the nature of this type of test in the heteroskedastic case. In both the heteroskedastic and homoskedastic cases, the fact of <r12 ^ 0 is insufficient for the parameters B2 to be biased by the neglect of the \ t regressor in the second stage regression. In particular, if A t is uncorrelated with the other regressors then no bias will be evident, so if one is mainly interested in obtaining "good" estimates of 32 then the OLS estimator may be preferred — even if there is a small correlation then OLS may even be preferable on a MSE criterion. Of course, if the main concern is to obtain estimates of the E{y2\y\ = 1) then the correction term will be included in any case. This Chapter 4. Two Step Estimation Methods 71 suggests that the use of a Durbin-Wu-Hausman type test may be of use since it examines directly whether the exclusion of the correction term causes significant bias (or change) in the parameter estimates for the j32 vector. Such a test would be based on testing the significance of the difference of the estimates obtained when the correction term is included and when it is not. The second reason why a Wald type test may not be advisable has to do more specif-ically with the nature of the model in the heteroskedastic case. The null in this case would be cr 1 2(:E3) = 0 or in the context of the notation and form of the model as devel-oped in this chapter. H0:g(xz) = 0. The null, therefore, requires that the function be identically 0. In the instance where we know the parametric form of g up to a finite number of parameters, one could write the null as a test involving restrictions on these finite parameters that ensure that g is 0. Standard Wald type tests would be applicable as there would be a finite number of restrictions, and the parameter estimates themselves would be asymptotically normal with the usual rate of convergence. The more general nonparametric case is more problematic. First, there is the problem that the number of parameters in the g function is not known and the approximation involves an increasing number of parameters. So even if one could ascertain the distribution of each parameter (and the results of this chapter do not deal with this problem) then one has the problem of testing an increasing number of restrictions, where there is no bound to the number. To consider this difficulty more closely, suppose that one were able to observe the true value of 02- Then a natural estimator for g would be given by (where the restriction g = 0 has been imposed). Using the nature of the PX3 matrix one may expect that a test statistic of the form g = PX3{w-Z02) = PX3t (4.70) (4.71) Chapter 4. Two Step Estimation Methods 72 could provide the appropriate form of the test. The degrees of freedom, however, are unbounded since the dimension of $ increase with the sample size. One alternative to this tjrpe of procedure would be to consider a finite number of the functions in In this case, a Lagrange Multiplier (LM) type test could be con-structed where the £ could be replaced by the estimated values from OLS, which are consistent under the null hypothesis. Thus, one is testing whether the OLS residuals are uncorrelated with the candidates x3. This type of conditional moment testing procedure has been surveyed and developed by Pagan and Vella (1989). The situation here, how-ever, involves an infinite number of conditional moment restrictions. Alternatively one could pick a finite number of points in A" 3 space at which to evaluate the g function and use the results of Andrews (1988, 1989d) to obtain an appropriate test statistic. These procedures are likely to possess some power, but are not entirely satisfactory since they do not test the entirety of the null hypothesis and there is likely to be some degree of arbitrariness in the selection of moment conditions. These two factors make it seem preferable to construct a test of the null H0:g = 0 based on the consequences for 02 estimators, of violations of the null. A natural compari-son is between the 02 estimator developed in this chapter and an OLS (or more generally some weighted least squares (WLS)) estimator, denoted (32). Although we have allowed quite general heteroskedasticity we will not rule out the possibility of obtaining an efficient estimator. The main test result, however, will not be developed under this assumption . Note that recent work by Robinson (1988) and White and Stinchcombe (1989) has shown that efficient estimation in the heteroskedastic regression model is possible under quite general conditions — in our case since the residuals are maintained to be normally distributed, such an estimator will attain the Cramer - Rao lower bound which will make the test statistic developed in this section somewhat simpler than it appears, due to a simplification noted by Hausman (1978). We can write the two estimators for comparison Chapter 4. Two Step Estimation Methods 73 in the useful form = A ^ + o^l) (4.72) and y/Niifo-Pt) = sfWx{{Z-T3)\Z-Tz))-\Z-Tz)'Dzl + op[l) = A2B'2( + op{l) (4.73) where W is a matrix of weights and the simplifictaioii in (32 arises from Theorem 3 and the fact that under the null g = 0. The rationale for considering a test statistic based on the difference between these two estimators is that if g = 0 then the estimators should have the same probability limits (/32) whereas if g ^ 0 then the estimators will have different limits. A significant difference will be interpreted as evidence that the exclusion of the correction term results in selectivity bias or some other form of mis-specification. The following result provides the basis for such a test Theorem 6: Under H0:g = 0 and given the conditions of Theorem 5, Nj(^ ~ / 3 2 ) ' l>i2 1(^ - fa) - X2(d2) where \\2 is a consistent estimator of V 1 2 which is given by Tp]imA1B'1TB1A1 + A2B'2TB2A2 -A1B'1TB2A2 - A2B'2TB1A1 which is assumed to be a finite positive definite matrix. A consistent estimator for V\2 can easily be obtained by replacing unknowns with estimates and population moments with averages as was done previously. There are also Chapter 4. Two Step Estimation Methods 74 a number of alternatives that can be used. One which is guaranteed to be positive definite is given by using an estimator for Ai given by A1={±X'3WX3)-1 (4.74) B[ = -^fiW (4-75) where W are possibly estimated weights, and for the second estimator A2 = (±-(Z - P.tZ)'(Z - (4.76) B'2 = -±=(Z-PX3Z)'D3 (4.77) v -< v l while the variance-covariance matrix of £ can be estimated f 1 = d iag{£} (4.78) f 2 = diag{£2 2} (4.79) f 1 2 = diag{Ui} (4-80) where the matrices are used respectively in the first, second and last two terms in V\2. (We ignore the need for consistency results as they follow from Theorem 5 and work in White (1984)). The weights appearing in the first estimator may be any weights (any quasi-Aitken estimator), however in the situation where one is able to compute the asymptotically efficient estimator of White and Stinchcombe (1989) the variance-covariance matrix appearing in the test may be written as V12 = phmA2B'2TB2A2 - AV (4.81) where AV is the Cramer-Rao lower bound, and this is positive definite although it is difficult to see how one could ensure an estimator in finite samples was positive definite without using the more complicated form above. The test based on the efficient estimator Chapter 4. Two Step Estimation Methods 75 may be expected to be more powerful (asymptotically) than the more general WLS test, although the relative performance, of any of these tests in finite samples is open to question. 4.7.2 A Test for Heteroskedasticity Similar principles point to the possibility of a test for heteroskedasticity in both sample selection and Tobit models, based on the estimator developed in this chapter. In this case, the null hypothesis of homoskedasticity not only imposes a restriction on the g function, but also imposes a restriction on the \ t function. That is, it depends on xx via x^Bi only. This makes a Wald type test problematic. The same principle of a conditional moment test as discussed in Pagan and Vella (1989) is applicable in this case. Since violation of the hypothesis of homoskedasticity causes mis-specification of the conditional mean of T/2 one would expect that a. test could be constructed using the residuals from the model estimated under homoskedasticity. In particular, the null of homoskedasticity would require that these residuals be uncorrected with any function of xt. As in the case of a test for selectivity bias there is some degree of arbitrariness in the selection of these conditions and it seems unlikely that such a test would be powerful against all forms of heteroskedasticity. As mentioned in the previous section, a fully robust procedure would be based on an increasing number of such moments (although it is difficult to see how such a procedure could be very powerful in practice). This suggests that a Hausman type test may have some merit since it will be based on the premise of the null being rejected only when allowing for heteroskedasticity makes a difference in practice, and testing whether the consistency of the estimator is question-able. Such a test would be based on the difference between the preferred estimator under the null and the heteroskedasticit}' robust estimator developed in this chapter. The pre-ferred estimator may be either the two-step estimator or the MLE. As noted earlier, this Chapter 4. Two Step Estimation Methods 76 will generally mean the M L E for the Tobit and the two-step estimator for the sample selection model. As noted previously when the preferred estimator is the M L E , the test statistic may be somewhat easier to compute. Here we develop the test for the case where the preferred estimator is the two-step estimator. In this case, the second stage regression will have the form Vu = x2t(32 + — Xt + 6 + — ( A t - A t) (4.82) where Xt is estimated using the homoskedastic estimator developed in Section 5. Since the test will be based on the Q2 estimates, we are only interested in this part of the parameter vector and since under the null any weighted least squares estimator of the above is consistent a useful weight could be j- in which case the regression equation becomes - o , °"12 , £t , °"12 (At - Af) wt = ztp2 H 1- — + (4.83) C2 Xt °~2 Xt which is similar in form to the heteroskedastic model. The estimator in this case may be written as /32 = ((Z - E{Z))\Z - E{Z)))-\Z - E{Z)){w - E(w)) (4.84) where E{Z) = ^ (4.85) which are the averages of each of the regressors. This is similar in form to Q2. From Theorem 4 we know that these estimators can be written under the null as XJN[(02 - (32) = (-L(Z - T3)'(Z - T 3 ) ) - J (4.86) ix 1 (-L=((Z - T3)'D^ + u'GD^DBLe) + op(l) v i V i = A^BM + B'12e) + op(l) (4.87) Chapter 4. Two Step Estimation Methods 77 where T\ = Z in the case of the Tobit model, and - ft) = { W ( Z - E(Z))'(Z - E{Z))Y'x (4.88) (Z - E{Z))lGD4X1{X[D5X1)-lX,DbD6e) A3(B'31Z + B32e) + op(l). (4.89) using the expansion for f3x given in Section 5. Note that the values in these representations are the true values under the null so that G is proportional to the identity matrix. Using the fact that £ and e are uncorrelated vectors we now have the following test of mis-specification. Theorem 7: Under the null that the underlying model is homoskedastic, and given the conditions of Theorem 4, where V13 is a consistent estimator of the variance-covariance matrix Vx3 given by which is assumed to be positive definite. As in the previous case, all one is required to do is to obtain a consistent estimator for the variance-covariance matrix, which can easily be done by replacing unknowns with consistent estimates. Also, one may obtain an estimator that is bound to be positive definite. Again, if the preferred estimator was the M L E then the test would have the form N1{f32 -/32){V 13 pmnA^B^TBu + B'22nB22)A1 + A3(B'31TB31 + B'32SIB32)A3 A,{B'21TB3, + B'22QB32)A3 - A3{B'31TB21 + B'32QB22)A - - VMLE)~\p2 - 02) (4.90) Chapter 4. Two Step Estimation Methods 78 where the covariance matrix is somewhat easier to compute although it may not be positive definite in finite samples. This may be the desired form of the test in the case of the Tobit model. As noted above, this form of the test is likely to have power against quite arbitrary forms of heteroskedasticity, at least to the extent that such heteroskedasticity causes inconsistency. In addition this form of the test may be powerful against other forms of mis-specification. As noted earlier, one of the advantages of obtaining R N C A N estimators for these models is that they allow the construction of tests of mis-specification of the sort con-sidered here. Not only does one have a relatively simple estimator to use in the event of model failure, but one also has a relatively nice form of test of model, adequacy based directly on the consequences of mis-specification for the parameters of interest. 4.8 Conclusions In this chapter, we have developed two-step estimation procedures for both the sample selection model and the Tobit model, when there is heteroskedasticity of unknown form in the latent errors. The estimators were shown to be consistent and asymptotically normal under fairly mild regularity conditions. In addition, estimation of the covariance matrix was considered along with specification tests for heteroskedasticity and selectivity bias. One of the nice features of the estimator is the ease of computation being solely based on regressions. A useful result regarding the distribution of averages of nonparametric estimates was developed and used to find the distribution of the second stage estimator, where the problem of pre-estimation error must be dealt with. Compared to the M L E developed in Chapter 3 the estimator differs in a couple of ways. First, the identification of the model requires that some restrictions must be satis-fied by the data generating process. In the case of the Tobit model, this was simply that Chapter 4. Two Step Estimation Methods 79 at least one of the variables in the mean of the latent variable not appear as an argument of the variance function. Such restrictions were not required to identify the model in the case of the M L E . Another difference is that in the case of the two-step estimator we do not have to worry about the variance-covariance matrix being positive definite, which is required for the M L E method to be well defined. Also the two step estimation method is likely to be somewhat easier to compute in practice. This is evidenced in the Monte Carlo experimentation that appears in Chapter 5. Chapter 5 Small Sample Properties 5.1 Introduction In this chapter, we examine the small sample properties of the estimators developed in the preceding chapters. There are a number of reasons for doing this. First, the arguments used to justify the use of the estimators have been asymptotic in nature and in no way guarantee that the estimators wi l l have any merit in samples of the size typically used in practice. Second, the consistency results rely on being able to increase the number of terms in the approximating functions with the sample size. Although in practice a sample size wil l be forced on an applied econometrician, the number of terms in the approximating functions is a choice variable, so it wil l be useful to see what number of terms may be required in some simple situations to make the estimators perform well. As in most simulation work, simple models wi l l be employed and it should be stressed that we make no attempt to examine every possible aspect of the performance of the estimators. Also, we make no attempt to choose the optimal number of terms in the approximations. The experiments will be conducted using a few different number of terms (truncation parameters) which will be chosen a priori and could be related loosely to the asymptotics. One interesting thing to consider is how well a simple quadratic approximation performs and whether there is much need to add on the Fourier terms. The simulation work also gives some guide as to parameterisation of the variance function, which is obviously useful for practical purposes. In particular, in the case of 80 Chapter 5. Small Sample Properties 81 the M L E estimator, it is required that the variance function be restricted to be positive and the parameterisation should be chosen so that this will be true, plus it should be such that the estimator is relatively easily computed (i.e., convergence is quick and no numerical problems arise). In addition, the ease with which the estimators can be computed will be brought out. For example, standard packages that have nonlinear optimisation routines will be sufficient to compute the estimators. There is in fact no need to program the derivatives (provided that the package has reasonable facilities for computing the derivatives as in S H A Z A M and TSP). The chapter has four remaining sections. In Section 2 we consider the Probit model and compare the usual M L E with the heteroskedasticity adjusted M L E in Chapter 3. Section 3 deals with the Tobit model and compares the usual Tobit M L E with the het-eroskedastic M L E of Chapter 3 and the tAvo-step estimator of Chapter 4. Finally, we consider the sample selection model and compare the usual Heckman two-step estimator Avith variants on this developed in Chapter 4. We do not examine the performance of the M L E in this case due to the fact that it is someAvhat more expensiA'e to compute than the other MLEs . In addition, as noted by NeAvey, PoAA'ell, and Walker (1990) the likelihood function in this case may be ill-conditioned making it s omeAvhat difficult to perform a simulation exercise using it. Finally, in Section 5 Ave summarise the main findings. 5.2 Probit Model In this section, we consider the Probit model and compare the performance of the usual (homoskedastic) M L E with the heteroskedastic M L E proposed in Chapter 3. The model we consider is based on the following equation for the latent variable y't = a + bxt + ut (5.91) Chapter 5. Small Sample Properties 82 where we assume that the xt are independent and uniform!}7 distributed on (0.1,6.1) which ensures that that they are contained in the (0,2TT) interval required for the use of the Fourier series approximation. We assume that a = — 3 and b = 1 so that ap-proximately one half of the observations on the latent variable y" will be positive. The discrete dependent variable is determined according to the rule yt = I{y^ > 0). The la-tent residuals will be generated as independent normal random variables with mean zero and variance given by a2(xt) where a varies across the different experiments. Assuming that we proceed to estimation assuming that the standard deviation of ut is given by f(xt) then the contribution to the likelihood for observation t is Obviously, in the standard case / is assumed constant. In the heteroskedastic case, we must substitute in an approximation based on the arguments of Chapter 3. When doing this the approximating function must be kept away from zero. One way that seems to work well computationally is to let fi*t) = (tWiM)-2 (5.93) j=l where the Bj are parameters that will be estimated along with a and 6, and xpj(xt) are the basis functions, or the functions that comprise the Fourier Flexible Functional Form (FFF), discussed in Chapter 3. Since the xt are scalars these functions will be (in order), 1, x, x2, sin(rc), cos(sc), sin(2a;), cos(2;r),.... (5.94) As noted in Chapter 3, some normalisation of the coefficients a and b will be required to identify the parameters. The normalisation used in chapter 3 may not be easy to impose computationally, but an equivalent normalisation can be imposed by restricting one of the coefficients in the likelihood to a particular value. There are obviously many alternatives, Chapter 5. Small Sample Properties 83 but the one chosen, and which resulted in the quickest convergence was to restrict 6 = 1 and to maximise over a and the 6j parameters. To make the results comparable across the different experiments we normalise the resultant estimates so that they lie on the unit circle (i.e. a2 + b2 = 1 holds), and then compare these for the different estimators Note that the true values of a and 6 correspond to values of —0.94868 and 0.31623. The exponential parameterisation suggested in Chapter 3 was also tried and although it gave sensible results most of the time there were instances when the optimisation procedure failed due to exponential overflow. The parameterisation chosen here was better behaved. In all the experiments apart from the above data generating process, we also keep constant the sample size, which is 200, and maintain the same set of xt variables across all experiments. There are 500 replications in each experiment. We conduct five different experiments, one for each different assumed cr(xt) and compare the bias, standard devi-ation and quartiles for the different estimators. The variance functions will be chosen so that the average variance is constant across the experiments. The variance functions that we use are: (1) a2(i (2) a2(o = 1 = c 2 * x (3) cr2(x) = c 3 exp(0.1a;) exp(exp(0-Ix)) (4) <T2(X) = c4 exp(—x)exp(exp( — x)) (5) <r2(x) = c5(5(x - 3) 4 + 1) where the constants are chosen so that the average variance in all experiments is 1. This is done because in LDVMs the performance of the estimators depends on the magnitude of the scale parameter (the a). This enables one to be able to attribute whatever behaviour is observed for the different estimators, to the fact of changing variance function, as opposed to changing scale. These functions are chosen because it is easy to impose the Chapter 5. Small Sample Properties 84 No. of Terms Measure Exp 2 Exp3 Exp4 Expo 1 L2 2.6 0.01 Loc 7.4 0.19 3 L2 0.5 0.00 L o o 4.0 0.0001 5 L2 0.17 0.00 L o o 2.4 0.00002 0.59 0.65 1.6 1.5 0.0005 0.147 0.066 0.723 0.00001 0.023 0.013 0.371 Table 5.1: Approximate Specification Errors - Probit constant average variance assumption and because they correspond to different types of heteroskedasticity. The first is homoskedasticity which is used to give some idea of the cost of using the estimators when the usual Probit estimator is valid. The second and third correspond to increasing functions of IE , the fourth is a decreasing function of x, while the fifth is not monotonic. To gain some idea of the extent to which the F F F is capable of approximating these functions, we provide, in Table 5.1, different measures of fit for the four heteroskedastic experiments and three different approximations used. Since the normalisation is such that is approximated by the F F F , we regress these values on the terms in the F F F to see how well the F F F fits. Two measures of fit are provided. The first, denoted by L 2 , is the mean squared error, while the second, denoted Loo is the maximum difference in the sample, of the function from the corresponding F F F approximation. These measures are computed for the three cases considered in the experiments below. The rows corresponding to one term can be loosely interpreted as measuring the degree of heteroskedasticity, while the other rows give some indication of how well the F F F fits and hence the degree of mis-specification. As one would expect the F F F fit improves as more terms are added using either measure. Chapter 5. Small Sample Properties 85 In each of the five experiments, we consider the properties of the following three like-lihood based estimators which are all contained in the general likelihood above. Accord-ingly we identify each estimator with value of K. The first is the usual Probit estimator denoted by K = 1. The second is based on a quadratic approximation, and is denoted K = 3. The two other estimator corresponds to K = 5. Because the optimisation be-comes increasingly time consuming the more terms that are included, we neglected any further estimators. The estimators are computed using the Newton-Raphson algorithm in TSP, which uses analytic derivatives. The number of iterations to convergence is gen-erally two to three times the number that are usually required in the ordinary Probit estimation. The results are contained in Tables 5.2 to 5.6. Chapter 5. Small Sample Properties 86 Estimator Variable Bias SD LQ Median UQ M L E one 0.00012 0.00450 -0.95151 -0.94832 -0.94535 X 0.00006 0.01341 0.30762 0.31732 0.32606 HMLE3 one 0.00045 0.00738 -0.95255 -0.94859 -0.94375 X 0.00053 0.02190 0.30440 0.31652 0.33066 HMLE5 one 0.00101 0.00765 -0.95310 -0.94820 -0.94234 X 0.00230 0.02271 0.30266 0.31768 0.33466 Table 5.2: Probit experiment 1 Estimator Variable Bias SD LQ Median UQ M L E one -0.00069 0.00252 -0.95114 -0.94934 -0.94776 X -0.00216 0.00761 0.30876 0.31425 0.31899 HMLE3 one 0.00013 0.00446 -0.95244 -0.94824 -0.94491 X 0.00009 0.01340 0.30471 0.31756 0.32732 HMLE5 one -0.00017 0.00473 -0.95302 -0.94896 -0.94523 X -0.00089 0.01423 0.30292 0.31540 0.32640 Table 5.3: Probit experiment 2 Estimator Variable Bias SD LQ Median UQ M L E one -0.00041 0.00395 -0.95557 -0.95312 -0.95021 X -0.00130 0.01230 0.29477 0.30258 0.31160 H M L E 3 one -0.00005 0.00630 -0.95184 -0.94782 -0.94240 X -0.00050 0.01848 0.30611 0.31882 0.33449 HMLE5 one -0.00024 0.00704 -0.95437 -0.94931 -0.94529 X -0.00110 0.02104 0.29863 0.31433 0.32624 Table 5.4: Probit experiment 3 Chapter 5. Small Sample Properties 87 Estimator Variable Bias SD LQ Median UQ M L E one 0.01260 0.00459 -0.93922 -0.93643 -0.93312 X 0.03500 0.01215 0.34331 0.35086 0.35956 HMLE3 one 0.00099 0.00333 -0.95013 -0.94796 -0.94574 a; 0.00280 0.00985 0.31860 0.31840 0.32493 HMLE5 one 0.00034 0.00460 -0.95277 -0.94969 -0.94659 X 0.00130 0.01345 0.30525 0.31318 0.32243 Table 5.5: Probit experiment 4 Estimator variable mean SD LQ Median UQ M L E one -0.00420 0.00343 -0.95521 -0.95309 -0.95074 X -0.01300 0.01076 0.29594 0.30268 0.31000 HMLE3 one -0.00112 0.00223 -0.95100 -0.94975 -0.94854 X -0.00346 0.00700 0.30920 0.31301 0.31666 HMLE5 one -0.00102 0.00424 -0.95028 -0.94884 -0.94721 X -0.00337 0.01331 0.31139 0.31576 0.32061 Table 5.6: Probit experiment 5 Chapter 5. Small Sample Properties 88 The results indicate that the estimators that correct for heteroskedasticity do reduce the bias relative to that of the usual Probit estimator. In all the heteroskedastic cases (experiments 2 to 5) the bias is significantly (at the 0.01 level) smaller for both the heteroskedastic MLEs. In experiment 3 and 4 the estimator based on five terms is signif-icantly (at the 0.01 level) less biased than the quadratic approximation. In experiments 2 and 5, HMLE3 and HMLE5 do not have signicantly different biases. In experiments 2 and 3, the bias reduction comes at the cost of a loss of efficiency, with the standard deviation being slightly less than twice as big as in the ordinary Probit case. In ex-periments 4 and 5, however, the estimator based on a quadratic approximation is less biased and more efficient than the usual Probit estimator and the estimator based on five terms is only slightly less efficient than the Probit MLE. These results indicate that the need for some method of choosing the number of approximating terms may be useful. With regard to the effect of heteroskedasticity on the Probit estimator it seems that the increasing function (experiments 2 and 3) result in an downward biases in both the coefficients. The decreasing variance function results in biases in the opposite direction. The use of HMLE3 and HMLE5 in homoskedastic situations may be costly, as one might expect, although only estimates based on HMLE5 are significantly biased. The order of the biases are small in the experiments but are statistically significant (at the 0.01 level) in all cases except the first. The small order of the biases is obviously related to the experimental design, but it is noticable that the degree of bias did not seem to be related to the degree of mis-specification, as measured in Table 5.1. 5.3 The Tobit Model In this section, we compare the performance of the usual (homoskedastic) MLE with the heteroskedastic MLE of section 3 and the two step estimator of section 4. In light of the Chapter 5. Small Sample Properties 89 identification conditions needed for the two step estimator, the simplest specification of the model is given by y't = a + bxt -f czt + ut (5.95) where ut are independent normal random variables with mean zero and variance given by some function of xt, (r(xt)2. The xt and zt are independent, uniform random variables on (0.1,6.1), and the parameters will be fixed at ( — 6. 1, 1) so that approximately one half of the observations will be censored, with the observed dependent variable given by the rule yt — max{0,?/t~}. Writing in shorthand the mean of the latent variable as /xt we can write down the contribution to the logarithm of the likelihood function when the square root of the variance function is approximated by the function f(xt) as k = (1 - /(tf > 0))log * ( ^ ) + W > OJlog ( T ^ , * ^ ) ) ( S . » « ) In the homoskedastic M L E case, the function will be a constant, and in the heteroskedas-tic case we will use the approximation based on f(xt) = exp(^26ji(;j(xt)^ (5.97) where the ipj functions are as in the Probit case. As in the Probit case, we distinguish different MLEs by the number of terms used to approximate the variance function. For the experiments we use the shorthand H M L E K where H M L E denotes the fact that it is the heteroskedastic M L E and the K denotes the number of terms in the approximation. We perform the experiments for K = 3 and K = 7 due to the cost of computing the M L E in a Monte Carlo experiment. The homoskedastic M L E corresponds to K = 1, but is denoted M L E . To compute the two-step estimator we need to define the qualitative variable vt = I(y't > 0). The first step in the computation of the two step estimator regresses vt on Chapter 5. Small Sample Properties 90 the F F F constructed from the variables x and z as in vt = £ 8}ipi{xt, zt) + error (5.98) i = l where the I/J functions are constructed using the algorithm described in Gallant (1981):the first five consist of 1, x, z, x 2, z2, x z and correspond to the quadratic part of the FFF; next there are the trigonometric functions of the form sin(fc;,.a; + kzz), cos(kxx + kzz) where the k,x,kz axe constructed using Gallant (1981), the first few being (1,0),(0,1),(2,0),(0,2),(1,-1).... The terms are ordered by the index formed by \kx \ + |fc2|, and since these will give groups of increasing size we ensure that the estimators are computed using all the terms up to ,a specific group. Denote the predicted values from this regression by vt. As noted in the Chapter 4, we must censor these values to ensure that they are between 0 and 1. This is done using the rule in Chapter 4 where the value of 8 is 0.01. As noted in Chapter 4, this is somewhat arbitrary. As this value gave estimators that performed fairly well, no experimentation with different values was done. Denote the censored values by vt. In the second stage, we transform these values to create the analogue of the inverse Mills ratio by A, = «£im (5.99) vt In the second stage the following regression is performed using the observations for which Vt > °> ]}i = a^- +bJ- + + ^2 e2kipk{xt) + error (5.100) Xt Xt A ( Xt fc=1 Chapter 5. Small Sample Properties 91 where the ibk functions are as above. We denote the different estimators in this case by TK1,K2 where T signifies that it is a two step estimator and Kl and K2 give the number of terms used in the approximating functions. In all experiments, we use 200 observations and perform 250 replications due to the cost of computation. As before, the variance functions used in the five experiments are given by: (1) <7 2(2) = 10 (2) cr2(x) = c2x (3) o~2(x) = c 3 exp(0.3a?) exp(exp(0 .3rc)) (4) cr2(x) = c 4 exp( — x) exp(exp( — x)) (5) <72(z) = c 5 ( 5 ( x - 3 ) 4 + l ) The constants are chosen so that the average variance in all the experiments is about 10. Again, to gain some idea of the specification errors involved and the ability of the F F F to approximate these functions, we compute measures of fit analogous to those in Table 5.1. We do this for two normalisations. The first, which is used for the two-step estimators, recognises that cr appears in the regression function in the second step. The second, relevant for the M L E based estimators recognises that we are implicitly trying to approximate logo -. Chapter 5. Small Sample Properties 92 No. of Terms Measure Exp2 E x P 3 Exp4 Exp 5 1 9.9 48.7 47.2 37.9 Loo 7.7 7.9 22.6 15.3 3 L2 0.05 0.33 1.7 0.05 Loo 0.95 2.5 5.1 0.56 7 L2 0.0013 0.003 0.020 0.007 Loo 0.17 0.21 0.48 0.21 11 L2 0.0001 0.0001 0.009 0.0004 Loo 0.07 0.044 0.12 0.045 Table 5.7: Approximate Specification Errors - Tobit, a No. of Terms Measure Exp2 Exp 3 Exp4 Exp5 1 L2 0.17 0.93 0.88 1.02 Loo 1.5 2.01 1.9 1.5 3 L2 0.0083 0.00056 0.00067 0.109 Loo 0.47 0.076 0.091 0.79 7 L2 0.00047 0.0 0.0 0.00083 Loo 0.12 0.003 0.006 0.08 Table 5.8: Approximate Specification Errors - Tobit, logo -Chapter 5. Small Sample Properties 93 All estimators were computed using S H A Z A M which uses numeric derivatives and the algorithm of Davidon-Fletcher-Powell to compute the MLEs. The main stumbling block in computing the two-step estimator was in obtaining the inverse of the Normal cdf. S H A Z A M was used because it allowed easy calculation of the inverse while TSP did not. Convergence in the M L E computations generally took two to three times the number of iterations usualljr required for the homoskedastic Tobit M L E . Note that we also computed the Heckman two-step estimator to enable some sort of comparison between the M L E and this estimator. It will be denoted Heck in the tables. The results are contained in Tables 5.9 to 5.13. Chapter 5. Small Sample Properties 94 Estimator Variable Bias SD LQ Median UQ OLS one 5.772 0.773 -0.771 -0.155 0.332 X -0.524 0.142 0.368 0.471 0.577 z -0.519 0.153 0.379 0.489 0.591 M L E one -0.088 0.887 -6.707 -6.05 -5.455 x 0.017 0.156 0.904 1.018 1.130 z 0.008 0.179 0.889 1.005 1.131 HMLE3 one -0.592 1.343 -7.464 -6.590 -5.585 X -0.113 0.204 0.768 0.903 1.030 z 0.233 0.337 1.035 1.288 1.453 HMLE7 one 0.088 1.026 -6.631 -5.795 -5.135 X -0.019 0.186 0.854 0.985 1.094 z 0.001 0.187 0.875 0.990 1.137 Heck one 0.282 7.787 -10.228 -5.341 -0.816 X -0.026 0.745 0.506 0.902 1.423 2 -0.032 0.721 0.524 0.939 1.329 T6,3 one -1.216 43.565 -25.447 -6.374 11.453 X 0.236 4.488 -0.424 1.021 2.810 z 0.031 4.014 -0.731 1.033 3.255 T10,7 one 1.851 31.565 -19.878 -2.434 12.150 X -0.023 3.667 -0.755 0.831 2.771 z -0.232 2.706 -0.533 0.579 1.998 T18 , l l one 3.027 19.174 -11.216 -1.765 6.974 X -0.102 2.442 -0.337 0.779 1.917 z -0.356 1.750 -0.298 0.577 1.560 Table 5.9: Tobit experiment 1 Chapter 5. Small Sample Properties 95 Estimator Variable Bias SD LQ Median UQ OLS one 4.564 0.937 -2.074 -1.401 -0.766 X -0.152 0.166 0.738 0.843 0.973 z -0.544 0.169 0.348 0.450 0.565 M L E one -2.186 1.117 -8.847 -8.115 -7.432 X 0.409 0.193 1.283 1.411 1.544 z 0.095 0.200 0.950 1.095 1.236 H M L E 3 one -1.668 1.473 -8.559 -7.609 -6.792 X 0.066 0.217 0.923 1.084 1.215 z 0.314 0.327 1.155 1.316 1.519 HMLE7 one -0.099 1.096 -6.555 -5.838 -5.160 X -0.041 0.192 0.819 0.948 1.094 z 0.007 0.195 0.881 0.990 1.134 Heck one 1.772 9.238 -11.279 -4.942 0.765 X 0.197 0.871 0.635 1.190 1.755 z -0.213 0.849 0.218 0.753 1.293 T6,3 one 2.052 48.138 -22.590 -2.595 14.500 X -0.163 2.748 -0.252 1.027 2.530 z -0.269 5.639 -1.617 0.514 2.555 T10,7 one 2.248 33.259 -16.602 -3.742 11.184 X 0.093 3.092 -0.039 0.908 2.456 z -0.321 3.979 -0.819 0.469 1.933 T 1 8 , l l one 4.023 21.732 -9.286 -1.614 8.479 X -0.163 2.748 -0.150 0.805 1.899 z -0.408 2.328 -0.430 0.552 1.421 Table 5.10: Tobit experiment 2 Chapter 5. Small Sample Properties 96 Estimator Variable Bias SD LQ Median UQ OLS one 0.238 1.373 -6.722 -5.819 -4.882 X 0.663 0.255 1.491 1.651 1.832 z -0.396 0.198 0.465 0.600 0.731 M L E one -5.312 1.599 -12.238 -11.195 -10.169 X 0.876 0.283 1.672 1.847 2.069 z 0.396 0.205 1.256 1.385 1.521 HMLE3 one -0.327 0.768 -6.769 -6.291 -5.823 X 0.023 0.114 0.936 1.026 1.108 z 0.055 0.128 0.977 1.042 1.127 HMLE7 one -0.071 0.843 -6.627 -5.974 -5.507 X 0.002 0.150 0.899 1.003 1.107 z 0.016 0.130 0.932 0.998 1.097 Heck one 1.555 8.363 -9.779 -4.756 1.035 X 0.557 0.700 1.148 1.548 1.988 z -0.541 0.966 -0.179 0.500 1.081 T6 ;3 one -1.987 23.413 -19.509 -5.909 4.858 X 0.678 1.453 0.860 1.513 2.503 2 -0.132 3.627 -1.000 0.367 2.806 T10,7 one -0.295 17.690 -16.315 -6.073 5.300 X 0.856 1.408 0.855 1.771 2.717 z -0.526 2.850 -1.259 0.373 2.160 T 1 8 , l l one 0.529 7.140 -9.777 -4.952 -1.229 X 0.346 0.802 0.774 1.295 1.794 z -0.282 1.100 0.116 0.667 1.380 Table 5.11: Tobit experiment 3 Chapter 5. Small Sample Properties 97 Estimator Variable Bias SD LQ Median UQ OLS one 6.776 1.009 0.099 0.836 1.482 X -1.107 0.177 -0.212 -0.107 0.012 z -0.396 0.103 0.536 0.612 0.677 M L E one 1.938 0.615 -4.454 -4.020 -3.636 X -0.493 0.124 0.415 0.507 0.589 0.017 0.115 0.932 1.009 1.089 HMLE3 one -0.174 0.435 -6.441 -6.154 -5.889 X 0.008 0.078 0.944 1.009 1.060 z 0.029 0.043 0.997 1.030 1.056 HMLE7 one 0.004 0.373 -6.228 -5.998 -5.729 X 0.004 0.067 0.954 0.994 1.040 z 0.002 0.041 0.971 1.003 1.030 Heck one -3.423 2.609 -11.150 -9.253 -7.509 X 0.005 0.328 0.789 1.000 1.216 z 0.623 0.239 1.437 1.614 1.758 T6,3 one 0.111 2.022 -7.472 -5.960 -4.476 X -0.023 0.291 0.772 0.989 1.216 z -0.002 0.123 0.908 0.994 1.092 T10,7 one -0.013 1.913 -7.389 -6.106 -4.752 X 0.000 0.295 0.817 1.024 1.223 z 0.002 0.121 0.922 0.993 1.076 T18 , l l one 1.540 1.760 -6.500 -5.643 -4.621 X -0.110 0.321 0.752 0.931 1.065 z -0.005 0.079 0.943 0.994 1.043 Table 5.12: Tobit experiment 4 Chapter 5. Small Sample Properties 98 Estimator Variable Bias SD LQ Median UQ OLS one 4.850 1.27 -1.916 -1.020 -0.204 X -0.260 0.216 0.532 0.709 0.867 z -0.630 0.178 0.250 0.366 0.467 M L E one -1.250 1.187 -8.111 -7.178 -6.340 X 0.012 0.227 0.854 1.022 1.170 0.274 0.183 1.149 1.277 1.385 HMLE3 one -0.05 1.008 -6.648 -5.918 -5.373 X -0.466 0.228 0.355 0.524 0.700 z 0.354 0.186 1.230 1.343 1.470 HMLE7 one 0.037 0.478 -6.244 -5.916 -5.632 X -0.019 0.112 0.893 0.988 1.052 z 0.006 0.063 0.962 1.004 1.056 Heck one -6.79 6.296 -16.378 -11.939 -8.450 X 0.489 0.489 1.166 1.472 1.764 z 0.848 0.802 1.296 1.721 2.335 T6,3 one -1.162 28.173 -19.244 -5.313 6.319 X 0.676 1.845 0.674 1.387 2.428 -0.320 3.874 -0.993 0.438 2.536 T10,7 one -1.870 10.36 -12.306 -7.488 -2.325 X 0.680 1.269 0.938 1.536 2.320 z -0.185 1.429 0.016 0.825 1.630 T18 . l l one -0.140 3.516 -8.307 -5.826 -3.866 X 0.170 0.732 0.709 1.117 1.604 z -0.082 0.518 0.597 0.889 1.224 Table 5.13: Tobit experiment 5 Chapter 5. Small Sample Properties 99 The results indicate that the heteroskedastic MLEs perform quite well under all forms of heteroskedasticity. In particular the M L E based on K = 7 is generally significantly (at the 0.01 level) less biased and more efficient than the usual M L E under heteroskedasticity and is almost unbiased in most of the cases. The M L E based on a quadratic approxima-tion generally performs well, but is usually inferior to the other heteroskedastic M L E . In the homoskedastic case, there is some cost in terms of bias and efficiency loss in using HMLE3 and, to a lesser extent, HMLE7. With regard to the two-step estimators the efficiency loss is quite marked. Moreover, i n the cases where there is not substantial heteroskedasticity the heteroskedastic two-step estimators perform quite poorly, as i n experiments 1 and 2. The estimates of the inter-cepts are particularly biased and variable, although T6,3 and Tl0,7 are not significantly biased. The performance also varies quite a lot depending on the number of terms used in the approximation. However, it seems that increasing the number of terms reduces the standard deviation, at least for the estimators considered here. One would hope that the performance in such situations would improve in larger samples, and with more terms added to the approximations. In the experiments with considerable heteroskedasticity (3, 4 and 5), the heteroskedastic estimators generally improve the bias significantly over the Heck estimator (especially in experiments 4 and 5). Curiously, however, the bias reduction is more pronounced for the constant and the coefficient of zt, both of which are poorly estimated by Heck. In experiments 3, 4 and 5, T18 , l l estimator is just as efficient as Heck. The T18 , l l estimator is not always the least biased of the two-step estimators so that in practice some criteria for choosing the number of terms may be useful. The performance of both the heteroskedastic M L E and the two-step estimators is encouraging. With regard to the nature of the bias in the M L E for the Tobit model, it appears that as in the Probit case an increasing variance function causes a downward bias of the intercept and an upward bias in the slope coefficients with the bias being Chapter 5. Small Sample Properties 100 larger for the coefficient of x (which is the variable in the variance function). A decreas-ing variance function has the opposite effect on the intercept and coefficient of x and little effect on the coefficient of z, while the nonmonotonic variance function seems to influence the intercept and coefficient of z most severel}'. The bias in using Heck is not always the same sign as that in the Tobit M L E , especially with respect to the effects on the intercept and coefficient of z. Unlike the Probit case, the degree of bias did appear to be related to the degree of mis-specification as measured in Tables 5.7 and 5.8. 5.4 The Sample Selection Model The simplest model that we can consider in which all of the identification conditions of section 4 are satisfied is given by ylt = a + bxt + ult (5.101) Vu — c + dxt + ezt + u2l (5.102) where the dependent variables that are observed are given by the relations yu = I{y'2t > 0)y"lt (5.103) y2t = I(y;t > 0) (5.104) and where the latent errors uu and u2t are bivariate normal random variables with covari-ance matrix that depends on some function of xt. The coefficients in the model are set to {0,1,-6,1,1} so that approximately one half of the observations will be censored and hence one half "selected". The correlation between the latent residuals will be set at 0.75 for all the experiments. As in the previous models, the i t ' and zt variables are uniformly distributed on (0.1,6.1), and are held fixed throughout the different experiments. We compare the performance of the two-step estimator with that of the homoskedastic two step estimator proposed by Heckman (1979). The homoskedastic two-step estimator Chapter 5. Small Sample Properties 101 is computed in the usual way, by performing Probit estimation in the first stage using y2t, xt and zt. Then the inverse Mills ratio is constructed and used as a regressor in the second step regression, where for observations with y2t — 1, we regress yJt on a constant, xt and the inverse mills ratio. We also compute a weighted version of this estimator, where the weights are the inverse of the inverse mills ratio. This is because the heteroskedastic two step estimator is based on a similar weighting. These estimators will be denoted by HI and H2 respectively. The heteroskedastic two step estimators are computed in a wajr that is very similar to the Tobit case except that in the second step regression the term corresponding to the zt regressor is omitted. We compute three different heteroskedastic two step estimators. The first, denoted by T6,3, is based on only the use of quadratic approximations The next two denoted by T10, 7 and T18, 11 add on the denoted number of Fourier terms (i.e., T10, 7 has four Fourier terms in addition to the quadratic terms in each of the two approximations). In addition we compute an estimator which uses the second step of the heteroskedastic estimator, but uses the homoskedastic inverse Mills ratio. This estimator will be denoted by TP, 11, and is computed to see to what extent the bias is due to the bias in the estimation of the first step. Chapter 5. Small Sample Properties 102 No. of Terms Measure Exp2 Exp3 Exp4 Exp 5 1 9.9 49.6 38.7 37.8 Loo 7.7 18.4 16.8 15.3 3 L2 0.045 0.362 0.270 0.049 Loo 0.948 2.07 1.56 0.56 7 L2 0.0013 0.0008 0.0006 0.007 Loo 0.17 0.06 0.073 0.207 11 L2 0.0001 0.00002 0.00002 0.0004 Loo 0.069 0.017 0.015 0.045 Table 5.14: Approximate Specification Errors - Sample Selection, h The number of observations is set at 200 and there are 500 replications. The errors are generated by taking i.i.d. Normal random pairs with correlation 0.75, and multiplying them by a function of xt, denoted by h(xt), which is constructed so that the average variance is roughly constant (as close to 100 as possible) across the different experiments. The five h functions used are given by (1) h(x) = 10 (2) h(x) = ciy/x (3) h(x) = c 2 exp(^z) (4) h(x) = c 3 e x p(-ia;) (o)h{x) = C 4 ( 5 ( z - 3 ) 4 + l ) 1 / 2 . The constants are chosen so that the mean of the function (over x) is approximately 100. Again, we provide measures of fit for the F F F approximations, by fitting the various F F F approximations to the data on h generated using these functions. The estimators are computed using S H A Z A M . Various summary statistics for the Chapter 5. Small Sample Properties 103 estimators (in addition to the least squares estimator) are contained in Tables 5.15 to 5.19. Chapter 5. Small Sample Properties 104 Estimator Variable Bias SD LQ Median UQ OLS one 7.478 1.842 6.351 7.416 8.669 X -0.489 0.481 0.207 0.516 0.849 HI one -1.234 25.021 -6.629 1.087 6.434 X 0.096 1.789 0.451 0.915 1.547 H2 one -1.349 25.587 -6.832 0.764 5.783 X 0.101 1.831 0.411 0.977 1.596 T P . l l one 0.653 63.613 -11.487 2.301 13.650 X -0.104 12.844 -2.570 0.417 3.503 T6,3 one 1.707 20.276 -7.943 3.421 12.911 X -0.153 4.870 -1.932 0.609 3.621 T10,7 one 3.508 18.782 -4.413 4.125 14.060 X -0.163 4.999 -1.698 0.775 3.141 T18.11 one 5.736 22.911 -3.797 6.304 14.164 X -0.369 4.982 -1.437 0.668 2.861 Table 5.15: Sample selection experiment 1 Estimator Variable Bias SD LQ Median UQ OLS one 4.016 1.610 3. .013 3.972 5.072 X 0.556 0.510 1. .189 1.586 1.897 HI one -5.086 17.550 -9. .967 -2.538 2.668 X 1.287 1.596 1. .518 2.095 2.796 H2 one -5.48 18.324 -10. .844 -2.877 -3.534 X 1.281 1.626 1 .363 2.062 2.762 TP,11 one 1.234 66.014 -14. .453 0.582 11.509 X -0.376 16.529 -2 .006 1.184 4.784 T6.3 one -0.374 19.123 -8. .876 0.767 9.748 X 0.575 5.250 -1 .355 1.351 4.495 T10.7 one 0.661 18.602 -7 .345 0.837 9.467 X 0.699 5.031 -1 .025 1.672 4.491 T18 . i l one 2.475 18.394 -4 .943 2.818 10.365 X 0.518 4.629 -0 .630 1.269 3.689 Table 5.16: Sample selection experiment 2 Chapter 5. Small Sample Properties 105 Estimator Variable Bias SD LQ Median UQ OLS one -2.500 2.330 -3. .979 -2.378 -0.938 X •2.310 0.843 2. .753 3.308 3.876 HI one -7.903 10.475 -12. .656 -6.668 -2.289 X 2.831 1.353 2. .991 3.797 4.562 H2 one -11.036 14,687 -17. .978 10.015 -2.188 X 3.377 1.939 3. .189 4.181 5.509 TP,11 one -8.993 31.024 -23 .343 -7.758 6.870 X 2.941 8.074 -0, .539 3.791 8.044 T6,3 one -2.506 17.541 -10 .896 -2.275 5.972 X 1.138 5.810 -0. .800 2.235 5.543 T10.7 one -4.700 18.557 -12 .877 -3.367 4.823 X 2.087 5.921 -0 .483 3.124 6.354 T18.11 one -4.913 28.367 -11 .486 -2.938 4.261 X 2.359 6.729 0 .097 3.019 6.104 Table 5.17: Sample selection experiment 3 Estimator Vari able Bias SD LQ Median UQ OLS one 11.967 2.788 10.06 11.904 13. .839 X -2.325 0.603 -1.706 -1.327 -0, .931 HI one 6.564 6.451 2.794 7.217 10 .725 X -1.625 0.949 -1.247 -0.720 -0 .071 H2 one 4.230 4.095 1.869 4.374 6. .980 X -1.081 0.630 -0.512 -0.112 0. .269 T P . l l one 4.163 8.109 -0.988 4.365 9 .504 X -0.935 1.604 -1.007 0.062 1 .512 T6.3 one 0.358 7.168 -4.300 0.529 5 .251 X -0.068 1.267 0.086 0.913 1 .773 T10.7 one 0.853 8.431 -4.170 0.851 5 .961 X -0.157 1.532 -0.066 0.804 1 .741 T 1 8 . l l one 3.279 8.975 -2.913 2.943 8 .249 X -0.607 1.654 -0.517 0.446 1. .512 Table 5.18: Sample selection experiment 4 Chapter 5. SmaU Sample Properties 106 Estimator Variable Bias SD LQ Median UQ OLS one 4.490 2.538 2. .783 4.515 6.265 X -0.094 0.709 0. .444 0.912 1.419 HI one -0.842 4.061 -3. .151 -0.475 1.821 X 0.250 0.798 0. .667 1.215 1.786 H2 one -3.744 4.762 -6. .410 -3.419 -0.100 X 0.937 1.035 1. .208 1.904 2.619 T P . l l one -0.262 8.72 -5 .523 0.493 4.954 X 0.529 2.476 0 .020 1.372 3.055 T6,3 one -2.376 24.7 -5. .992 1.076 7.382 X 0.891 5.700 -0 .582 1.164 3.277 T10.7 one -1.434 20.824 -7, .345 1.145 7.38 X 0.645 4.889 -0 .846 1.165 3.513 T 1 8 . l l one -2.044 15.383 -7 .303 -0.216 5.682 X 0.738 4.050 -0, .451 1.258 3.405 Table 5.19: Sample selection experiment 5 Chapter 5. Small Sample Properties 107 The results indicate that the heteroskedastic estimators perform fairly well. The bias reduction is significant (at the 0.01 level) in experiments 3 and 4, with respect to both coefficients, in experiment 5 for the constant, experiment 2 for the slope. As in the To-bit experiments, the heteroskedastic two-step estimators do not estimate the intercept very well when there is little in the way of heteroskedasticity, as in experiments 1 and 2, although the estimates of the slope are not badly biased in these cases. Although there is substantial bias reduction in using the heteroskedastic two-step estimators, the estimators, are in most cases (with the quadratic approximation estimator being slightly better), still significantly biased. The quadratic approximation based estimator appears to be somewhat better than the other heteroskedastic estimators in estimating the inter-cept. It is noteworthy that the usual homoskedastic estimators, HI and H2, are biased significantly (at the 0.01 level) compared to Heck in the Tobit case. This appears to be due to the fact that average variance is higher in this case. The estimator based on standard Probit in the first step, performs well in some cases, but is very inefficient in the first three experiments. Even in the first experiment where the Probit estimator is valid in the first step, the TP,11 estimator is very inefficient relative to the two step estimators that take account of heteroskedasticity (when they need not) in both steps. The conjecture, regarding possible efficiency gains from using a mis-specified first step estimator, made in Chapter 4, appears to be untrue, at least in the experiments conducted. This seems to indicate that both sources of mis-specification may have serious consequences for the properties of the usual estimators. Generally, the heteroskedastic estimators are less efficient than the homoskedastic ones, although in the estimation of b in experiment 4 the difference is not that great. Among the different heteroskedastic estimators the relative variances depend on the ex-periment, once again suggesting the need for some criteria for choosing an "optimal" Chapter 5. Small Sample Properties 108 estimator. Although the quadratic approximation does quite well it is evident that when the variance function is not monotonic some of the Fourier terms may improve the esti-mator as evidenced in Table 5.19 where the quadratic approximation does worst among the three heteroskedastic estimators. The nature of the biases is as one might anticipate with the increasing variance function biasing the coefficients in a way that indicates a rotation of the regression line in an anti-clockwise direction. That is, the slope increases and the intercept decreases. This is quite marked in experiment 3 where only the quadratic estimator seems to succeed in reducing the slope appreciably. In experiment (4), precisely the opposite occurs, and in this case the heteroskedastic estimators are quite successful at reducing the bias without sacrificing much in the way of efficiency. In experiments 2 and 3 the bias in the use of the homoskedastic estimators is so bad that even the OLS estimator appears to be preferred. As in the Tobit case, there did seem to be some relationship between the degree of mis-specification measures and the biases. 5.5 Conclusions In this chapter, we have considered the performance of the estimators developed in Chap-ters 3 and 4. The findings may be summarized as follows. Both the M L E based estimators of Chapter 3 and the two step estimators of Chapter 4 seem to be successful in reducing the bias when there is moderate to extreme heteroskedasticity. In the case of the Probit and Tobit experiments, the bias reduction in using the M L E based estimators may even be accompanied by improved efficiency, although this only occurs in one case considered for the Probit model. The two-step estimators are generally much less efficient for the Tobit and sample selection models, but relative to their homoskedastic counterparts they are also successful in reducing bias, with in some cases little if any loss in efficiency. It Chapter 5. Small Sample Properties 109 was also noted that when there was little or no heteroskedasticity, the two step estima-tors performed relatively poorly, so there may be some cost in using them when they are not needed. This preliminary examination of these estimators on the whole gives quite encouraging results with regard to the performance of the estimators. It was quite apparent that substantial biases ma}' occur in the usual estimators and a distinct pattern emerged (at least in the experiments performed in this chapter). If the variance is related positively to one of the regressors then this generally results in an upward bias in the regression coefficient of that variable and a downward bias in the intercept. The opposite seemed to hold for the case where the variance is negatively related to the regressor. One thing that was noticable was that the performance among various versions (cor-responding to different numbers of approximating terms) of the estimators proposed in this thesis differed so there appears to be some need for criteria for choosing an optimal estimator in a given instance may be desirable. The work of Andrews (1989e) and ref-erences therein, relating to cross validation may be of some use here, and it would be interesting to see how the chosen optimal estimator performs relative to the above. Chapter 6 Conclusions This thesis has considered the problem of estimating limited dependent variable mod-els when there is heteroskedasticity of unknown form in the latent residuals, under a maintained assumption of normality. As noted, this situation differs from the usual re-gression model since the commonly applied estimators (obtained under a homoskedastic assumption) are inconsistent as well as inefficient. The aim of the thesis was to propose estimation methods which yield estimators that are consistent and two such methods were considered. The first, proposed in Chapter 3, is based on maximum likelihood estimation, and deals with the heteroskedasticity by approximating the variance with a Fourier series approximation. Estimation proceeds by finding the parameters of the approximation and the parameters of interest that maximise the likelihood function. Given various regularity conditions (including smoothness of the variance function) and an assumption that the number of terms in the approximation increases with the sample size, such estimators were shown to be consistent in the three most commonly used LDVMs, the Probit, Tobit, and sample selection models. The second type of estimation technique, proposed in Chapter 4, is based on a two-step estimation strategy and was developed for the Tobit and sample selection models. The method takes into account the possibility of heteroskedasticity in both stages using Fourier series type regression estimators. In the first stage, a nonparametric regression is performed to estimate the discrete part of the model and then in the second stage these 110 Chapter 6. Conclusions 111 nonparametric estimates are used to construct a correction term. This correction term is used to transform the second stage regression so that in the second stage the model resembles a semi-parametric regression. The estimators were shown to be consistent and asymptotically normal at the usual N1^2 rate for the two models under fairly general conditions. One of the interesting features of the estimator is that although the first stage estimates are consistent at a slower than N1?2 rate, and appear as pre-estimation error in the second stage, the (weighted) averaging of these over the observations restores the N1/2 for the estimates from the second stage. As a by-product of the results, we were able to construct specification tests for heteroskedasticity and selectivity bias based on the work of Hausman (1978) and White (1980b). Since the arguments used to justify the estimators are largely asymptotic, a. small sampling experiment was conducted in Chapter 5 to see whether the estimators could perform well in small samples, and to examine the feasibility of computing the maxi-mum likelihood estimators. The results were encouraging. Both estimators seemed to be successful in reducing the bias in the usual estimators, and in some situations this came at a small efficiency loss. In fact, in some instances the heteroskedasticity robust estimators appeared to be more efficient than the usual estimators. This was particu-larly the case with the likelihood based Tobit estimators. In addition, fairly small order approximations seemed to perform quite well. As far as computation goes the estimators were relatively simple to compute. The two step estimators needed only regressions and a nonlinear transformation, while the maximum likelihood estimators could be computed with packages which possessed nonhnear optimisation routines. One of the obvious avenues where further research is needed is in obtaining distribu-tional results for the likelihood based methods of Chapter 3. This appears to be in sight as methods for regressions with increasing dimension are now available (as in Andrews Chapter 6. Conclusions 112 (1988, 1989d)). Also Portnoy (1988) has considered the distribution of parameter esti-mates in the maximum likelihood context when the number of parameters increases. He does, however, assume that the data generating process changes with the sample size so that the model is correct!}' specified at each stage. The situation we have is one where the data generation process is fixed and the number of parameters increases so that in the limit the model is correctly specified. It would seem that some headway could be made by adapting the results of Portnoy to the mis-specification situation, as done for the standard case by White (1982), and to then show that mis-specification does in fact disappear at a sufficiently fast rate so that the asymptotic distributions are appropriately centered. One of the drawbacks of the results obtained in this thesis is that they are obtained under a maintained assumption of normality. It would be nice if one could relax this. The work of Powell (1984, 1986) and Manski (1975, 1985) is one approach, but as mentioned earlier this approach appears to be limited to univariate models. If this is indeed the case, then one would be required to generalise the estimators developed in this thesis to a heteroskedastic non-normal situation. One obstacle to such a development is a possible identification problem. This seems to be particularly true for the two-step estimation methods where the structure possessed when the model is either non-normal (in which case it possesses a multiple index structure) normal and heteroskedastic will be lost and one may end up with an almost pure semi-parametric regression model. As the work of Robinson (1988) shows certain exclusion restrictions are neeeded to identify the usual parameters of interest, and these are likely to be much stronger than those needed in any of Newey (1988), Powell (1988) or the work of Chapter 4. Nevertheless, further work is needed on the identification and estimation of L D V M s under general data generating processes. Appendix A Proofs for Chapter 3 Proof of Theorem 1: This is essentially the same as the proof of Theorem 0 in Gallant and Nychka (1987) but is included for completeness. We consider those points u> on the abstract probability space (fi, 5, P) for which we have l i m n _ 0 O Kn = oo and for which condition (c) of the theorem holds. Almost all u> satisfy this. Now since (/3n,an) 6 B x S and B x £ is compact in (||/5|||; + | ! c r | | 2 ) 1 / ' 2 , we must have a subsequence (j3n.,crn.) converging to some (0°,a°) £ fl x S in this metric. By condition (b), however, there is a sequence {c° J-jLj with o~°. £ £jr n . such that lim^oo \\<J^. — cr"|| = 0. (Here we have corrected an error in the proof of Theorem 0 in Gallant and Nychka (1987) ). Since the sequence (0n.,&n.) minimises the objective function, whereas (/? , n e e d not, then for all j. By uniform convergence (condition (c)) we have lim s (0 , a ) = s(0°, a0,0", a") J-.0O 1 1 ' and hmsn](0',a°n ) = s(l3,<r,/3,<T"). j — t o o 1 Since the inequality must be preserved in the limit we get s(0o,ao,0',a*)< a(P\<r\P\<r-) 113 Appendix A. Proofs for Chapter 3 114 which implies, using condition (d) that \\0° - B"\\E = 0 and Since every subsequence of (f3n,&n) has a further subsequence satisfying this, then the sequence itself must have only one limit point and it must be (8",o-") Q.E.D. Proof of Lemma 2: Let n sn(f3,<r) = (l/n)Y,s{ut,xt,(3,a,/3") t=i where s{u,x,8,a,8") = -I{xB + u > Q)gl{x,8,o-)- I{x8 + u> 0)g2{x,8,a) with g1{x,P,*) = g> =log*(^) and g2(x,8,*)=g2 = log(l-<l>(^)) a Following G N we will argue that the required s function is given by s(8,cr,8",o") = / / s(u,x,8,o-,B*)h"(u)dud[i(x) where h"(u) is the true (normal) density of u, and show that it is continuous in (/3,<T). First, we will show continuity, (In referring to the space of a functions we will use £.) Define Ai = sup | — | < oo Appendix A. Proofs for Chapter 3 115 First, consider s'(0,a,0K,a') = J J^-I{x0'" + u > Q)glhr(u)dudp(x) Now note that gl is continuous in (x,0,a) in ||(x,0, a)|| = (||x|||; + \\@\\E + l l ° " ! | 2 ) 1 / ' 2 s m c e convergence for a implies uniform and hence pointwise convergence, so that the argument of log $ is continuous. Since log $ is a continuous function it is continuous in (x,0,cr). Now suppose we have a sequence (0n,ern) converging to some point (0o,aQ) in the sense of the metric above. Then clearly I(x/3'Au > O)gl{x,0Tl,an) - I(x0~ +u> 0)g1(x, 0o,ao) pointwise. Since \I{x0" + u > 0)g1\ < | log —A a ) | < oo for any admissable sequence, and since / / | log ^ (-A^h'^dudp = | l o g $ ( - A 1 ) | < oo JX JR then we have by the dominated convergence theorem (Royden (1987)), that s1 is contin-uous in (fi,o~). A similar argument establishes continuity for the term corresponding to g2. Next we verify the uniform convergence part. First, note that the continuity result holds when we replace I with any function x which is continuous and bounded between 0 and 1. Hence, define the function x{z) by X.{z) = 1 if z > 7 , x(z) = 0 if z < 0, and x continuous on [0 ,7 ] . Define the function s1 to be the same as s but with I replaced by X- This function will also be continuous and dominated by the function -XW + u)\og - x (x /T + u)log(l -Also, define the functions and s 7 in a similar way — i.e., replace I with % m the definitions. Now sup \sn(0,a) - s(0,a,0~,a)\ < sup K(0,a) - s\0,a,0\cr')\ Appendix A. Proofs for Chapter 3 116 + s u p B x S \sl(P,i) - sn(B,o)\ + s u p B x £ |p,(/3,cr,/3',o--) - s(l3,ar,3',(T")\ Now consider each of these terms in turn By Theorem 1 of Burguette, Gallant, and Souza (1980) (hereafter BGS) the first converges to 0 as n —> co, since the s 7 functions are continuous. For the next, as in G N define the function a1 by 0 < a 7 < 1, a 7 ( 2 ) = 0 for \z\ > 27, a1 (z) = 1 for \z\ < 7 and it is continuous. Then \sl(6,a)-Sn(0,a)\ < J2\x(xB" + u)-I(x8" + u>0)\\gl+g2\ t=i < f > 7 ( x / r + u)|£| t=i where £ = log $( — Ai) + log(l — $(.4i)). Again applying Theorem 1 of BGS we obtain that lim sup \s^(B, cr) — sn(B, cr)\ < / / a1 [xB";-f u)|£\h"(u)dudp A similar bound can be obtained for the third term directly, since it does not depend on n. But both bounds can be made arbitraril}' small by picking small 7, since the function a 7 converges to zero except at the set of points {(x,u) : xB' -f u = 0} which has probability zero with respect to h"{u)dudp. Thus all three terms are zero in the limit. Q.E.D. Proof of L e m m a 3: The proof follows along the same lines as Lemma 2 with minor modifications. In this case, we have g\u,x,B,cr,n = l o g ^ ( ^ ~ / , ) + U ) g2(x,B,a) = log(l - $ ( - ^ - ) ) Clearly, both gl and g2 are continuous in the arguments. To obtain the desired continuity of the s function, all that is required is that we find dominating functions for the g Appendix A. Proofs for Chapter 3 117 functions and show that they are integrable (with respect to h"(u)dudp ). Clearly | 5 ^ / ? , a ) | < | l o g ( l - $ 0 4 A ) ) | as before. For gl we have g1 = - log *(x) - —L-((x([3'- - B))2 + u2 + 2x{8' - B)u) lcr\x y Let anc An - SUp | — — | < OO B x i x E a(xy c = max( | log S\, log b) < oo Then we have 1 l ^ l < c +-(A22 + — + 2A2\u\) which is integrable since u is normally distributed. The proof then follows along similar lines to that for Lemma 2. Q.E.D. Proof of Lemma 4: As before we require the bounding function which enables the method used in the proof of Lemma 2 to be used. Here 9 = log*( THT, -JTTTj cr 2 ( x ) cr 2(a;) <72 = l o g ( l - * ( - ^ ) o~i{x) These are clearly continuous in any argument. The bounding function for g2 is as before. The second part of the gx function can be bounded as in the Tobit case. Therefore we Appendix A. Proofs for Chapter 3 118 are left with only the log $(.) component. The $ part is just the following integral. (The arguments of the a functions are suppressed for convenience.) J-x01 ( 2 7 T ) 1 / 2 C r 1 ( l - p 2 ) V 2 X -p2 -Aiyz-xPi)2 e X P ( 2 ^ ( 1 -p») ) X <r a ( l - P 2 ) 1 / 2 / 0 0 - ^ exPC4^^)cp(ul)dul = e x P ( ^ ) $ ( g i ( l ^ ; 2 ) 1 / 2 ) ^ ( e X P ( K ) K > , i (r:g )1/2) > e x P ( ^ H > ( g i ( 1 ^ ; 2 ) 1 / 2 ) e x p ( i £ ; K i ^ > ffl(rJgV/a)) where the inequality follows from the use of Jensen's inequality for the convex function exp(x), and j is defined by, .. = p^(y2-x02) 3 C r x ( l - ^2)1/2 and 1 ^ ( l - p 2 ) ^ Taking logarithms and evaluating the expectation we get 0 > g* > - t + l o g $( / ) + - y - 2 S Va(l - p2yi*> J §{k) where fc = — ^ _ ^ (1 - p 2 ) 1 / 2 Observing the usual bounds on the parameters and x\ then we can bound each of these functions from below. In particular, + >-~^(Al+ul + 2A3\u2\) S2e2 where A3 = sup \x(02 - AO I < oo A ' x B Appendix A. Proofs for Chapter 3 119 This function is integrable since u is normally distributed. Next, log$(fc) > log(—~) > -oo eo Similarly, .6(k) (1 - e)b8 6(0) > ~ $(Jfe) ~ Se $(^) which is integrable since each part is finite and |u2j is integrable since u2 is normal. Q.E.D. Proof of Lemma 5: Step 1. First we show that is identifiable. Suppose ^ *"\x) a ^ s o m e point x0. Then there is an open ball B(x0,^) such that (assume without loss of generality that ~ > x8" xB~ o~"(x) &"(%) for all x contained in the ball. This implies that Pr(y = 0]XJ,a) ^ Pr(y = 0;x,8~,a~) for all such x. Since p(B) > 0 by Assumption 5 we have that ^ 4 - is identifiably unique. Step 2. Given the result of step 1 and the fact that o-'(x) > 0 it is clear that s\gi\(xB") is identified. Suppose that sign(xci) = s\g\\(xB')a.e. [p] and that | | 6 | | J E = H Z ^ H E = 1 but that b ^  8*. Then it must also be the case on the set {xt £ (e,f),xi £ C}. Assume without loss of generality that 8k > 0. Now consider the set L which is a subset of C , containing all x\ for which the sign equation is violated for some Xk- We want to show that for the sign condition to hold a.e. then L must have Appendix A. Proofs for Chapter 3 120 zero measure. It will then follow that b = (3K which will provide us with a contradiction. Now for each xi G L there is an xk such that the sign equation is violated. Since xb is continuous in xk for any xt, this must also be true on B(xk)d) for some i? > 0. Denote the everywhere positive (on (e,/)) conditional density of xk given xi by g(xk\xi). We have then that the probability that the sign condition is violated is given by 3 ( £ t , t 9 ) But / / . 9(xk\xi)dxkdP(xi) . g(xk\xt)dxi > 0 B(x*\iS) for all xi G L by the above argument and Assumption 7. Thus this probability is only zero if P(L) = 0. Hence almost every xt G C has the sign condition satisfied for all xk £ ( e ) /)- Restricting attention to such xi we get that sign(xkbk A xfit) = sign(xfc/3£ + xtffi) for aD xk G (e,/). Clearly fc/. > 0 must be true. Since this is true there must be an xk(xi) G (e,/) such that xk{xi)bk A x,b, = 0 =4> xk(xi) = + xify = 0 => xk(xt) = - ^ L Pk This imphes that X[bt xi0i h 0 k for almost all xi G C. By the full rank condition we must have that Appendix A, Proofs for Chapter 3 121 The condition ||£>|||; = = 1 implies that 62 + 4llAIII = l Pk which implies that bk = 01 and hence that b — 0" which contradicts the assumption that they differ. Q.E.D. Proof of Lemma 6: We can use the same argument as in Step 1 of Lemma 5 to show that is identified. Now suppose that there is a point x0 such that o-(x0) 7^  cr"(x0) then it must also be true on B(x0,'d) for some d > 0. Assume it is greater (without loss of generalit}'), then for all x e B <r(x) > cr"(x) + tt for some e > 0. Note that p(B) > 0. This implies that cr(x) <r(x) ^{x) v'ix) at at most two points (using properties of the normal densities). Thus they differ almost everywhere for y > 0 and hence there are clearly events regarding y which have different probabilites under a compared to that under cr'. By a similar argument we obtain xb — x0x for almost all x. By the full rank condition we must have that b — 0". Q.E.D. Proof of Lemma 7: Using a similar argument to that in Lemma 5 we can show that <rx(x) is identified and 0" is identified to scale. Thus clearly we have *f(x) identified also. (This is using only the observations for which y\ — 0.) Next, consider the other part of the density function given by 1 a gffi I n(Vi-x/32\ 1 ^V2-X02^l{x) a 2 [ x ) )^ *2(*) " 2 ( 3 ) ' ' ( 1 - p 2 ) 1 / 2 Hence both parts are clearly continuous in 7/2, however the first part is not monotone in 7/2 unlike the second part (again using properties of the normal density function). Now Appendix A. Proofs for Chapter 3 122 suppose that o-2(x0) ^ cr^Xo), at some point Xo £ A' , then it must also be the case that they differ on an open ball B around the point and that p(B) > 0. By a similar argument to that used in Lemma 6, the normal density parts under these different er2 functions will be equal at two different points (note that y2 is not restricted to be positive — the addition of this restriction would turn the model into Cragg's (1971) double hurdle model). However the <I> part is monotone in y2 so evaluated at different <x2 functions will only cross once. Together these two facts ensure that the densities evaluated at the different a2 fuctions will differ on a nontrivial set ofy2. Hence cr2 is identified. A similar argument can be used to ensure that the other parameters , p" and 6'2 are identified. Q.E.D. Proof of Lemma 8 This follows from the use of Jensen's inequality and Lemmata 5-7. In particular suppose we have s(0°,a\3~,n<s(8',a~,3\*n Now by Jensen's inequality we have — s(u,x,8",a",8') — s(u,x,/30,a0,8")h"(u)dudp,(x) Jx JR2 > — log / / exp(s(u,x,8~,cr*,8*) — s(u, x,30,a0,8"))h''(u)dudp(x) Jx JR2 = _ 1 ° S Ix Jz exi?(~ loZp(z\x>P*>a^ +  loSP( z\ x^ 0, a°)) p{z\x, 8", a")dv{z)dp{x) ——log J p(z\x,80,a0)dv(z)dp,(x) = - log 1 = 0 Hence we must have that s{B\a\B\<T-)>s{B~,c-\B~,<T') Appendix A. Proofs for Chapter 3 with the equality holding only if we have s(u,x,0o,a°,0") - s(u,x,/3,<r,0*) is constant a.e. h'(u)dudfi(x). Together we have then that s(u,x,0°,a°,0x) = s(u,x,/3,<r,pT°) a.e. h'(u)dudp(x). But by Lemmata 5-7 this is only true if n/?°-nu = o and \\a°-ax\\ = 0 Q.E.D. Appendix B Proofs for Chapter 4 Proof of Theorem 1: By Assumption 2.1 and Assumption 2.2 we have _1 r ( Z - T 3 ) , ( 2 - r 3 ) = -41 + 0 p ( l ) and using Assumption 2.3 we can apply the Liapunov Central Limit Theorem for inde-pendent non-identically distributed (i.n.i.d) random variables so that 1 -\Z - T3yD^C - N(OlP]im{-^=(Z - T3)'D3TD3{Z - T3))) and the result follows.Q.E.D. Proof of Lemma 1: The proof is similar to that of Andrews and Whang (1989). We condition on x in this so all expectations are conditional. Write MSE{g,g) = jjE\\g - g\\2 = jjE\\g - Pxy\\2 using the parallelogram law in a Hilbert space. Now, jjE\\Pxu\\2 = ^E(trPxuu'Pxy = ±E(tTPxuu') = 2 t r P * < ™VfT2K(N) _ t r / - x ^ u ^ supcrf 124 Appendix B. Proofs for Chapter 4 125 by Assumption 3.1 and since PX is a projection matrix with K(N) regressors. Next denoting the remainder term of the difference between the approximation in Assumption 3.2, denoted gx, and the true function, g, by gTK, and noting that PX9K - 9K then we have that we have that, ^\\9 ~ P*9\? =-^\\9 ~ P*9K + p*9K\\2 = ^ l i a - - P - ) ^ | | 2 = ^ l k J r l | 8 < | | ^ | | S , « J using facts about projections on Hilbert spaces. Therefore, we have that MSE(g,g) < s u p c r 2 ^ ^ + \\gK\\lx,x Then if the conditions of (i) hold then MSE(g,g)^0 by Assumption 3.2. If K(N) = 0{NT) as in (ii) then MSE(g,g) = O(N^) + 0{N~^ K{Nfa\\gK\\l^x = 0{NT-l) + 0{N-2aT) by Assumption 3.2, so the result holds. For (iii) first note that by Assumption 3.2 we S ( ^ can choose an a < such that K(N)a\\gK\\0tOO,x - 0 and since K(N) = 0 ( A 9 - 7 ) then Na{q-^\\gK\\o,oo,x - 0 Appendix B. Proofs for Chapter 4 126 and the exponent can be made bigger than 1/2 provided that > ^ and 0 < 7 < q — 2S(g) • Fmally, we must also have that if q < 1/2 then 1 - r = 1 - ( 0 - 7 ) > 1/2 and 2ar can be made bigger than 1 using the above mentioned a and 7 . Q.E.D. Proof of Theorem 2: We can write VN(^c'Pxy - \-c'g) = (^—c'Pxg + -^=c'Pxu - -$=c'g) 1 :C>pxU--Lc'(I-Px)g VN VN Consider the second term first and condition on x so that the results will hold for every possible sequence of x ^ | c ' ( / - P x ) 3 | < y m ^ Y ^ U I - Pm)g\\^ <.0(l)v/JV:||^||oloo,A- -> 0 using the fact that ct is a bounded sequence and (i) and (ii) of the Theorem which allow the application of Lemma l(iii). Next, consider cu —c'Pxu = —=zc\I — Px)u 'N VN VN Suppose we condition on x again (and hence on c) and find the variance of this, then S((-4c'(/ - Px)uf) = lc'(J - Pm)9u(I - Px)c < SUp <T2t^-- Px)c\\2 < S U p c r t 2 | | C ^ | | 2 A . - * 0 t iV t using (iii) and (ii) by Lemma 1, so by Chebyshev's Inequality 1 c'{I - Px)u = op(l). Appendix B. Proofs for Chapter 4 127 Finally, 1 c '* c ? c ' u - > t f ( 0 , p H m ( i ^ ) ) fN " i r " N using Liapunov's Central Limit Theorem, which is applicable since Assumption 3.1 im-plies that E{\ut\2+S) < A < co and ct are bounded. Q.E.D. Proof of Corollary 1: Follows simply from the proof of Theorem 2 and the fact that PXL = L. Q.E.D. Proof of Theorem 3: The result will follow from the following two results I A7 _ TA' where and and = A f + o P ( l ) A " = ± ( Z - PBaZ)'(Z - PmaZ) A " =±{Z-T3)'(Z-T3) 1 (Z - PX3Z)\g + v) =-±-(Z - T3)'v + op(l) with rj = | = D3l£- For the first write z - p a , 3 z = z -T3 + r 3 - p a 3 z so that ^ = A ? + ~ T ^ T * - P - » ) + ^ ( R 3 " P X > Z ) L [ Z ~ T * ] +Yi(T3-PX3Z)'(T3-PX3Z) Appendix B. Proofs for Chapter 4 128 Using the Cauchy Inequality, it suffices to show that for each i 1 \Px2Zi-Ti\\ = O p ( l ) and — \\z, - Ti\\* = Op(l) which follow using (ii) (iii), by Lemma 1, and Assumption 2.3. Next, consider 1 = (Z-PX3Z)'g = -±=Z'(g-PX3g). ?N By the Cauchy Inequality 1 N , ^ - p ^ ) \ < j N 1 ^ ( - \ \ g - P . M ' Y ' ' < ( ^ ) V V ^ l l 9 * l k « > . . v - ° by Assumption 2.3, (ii) and (iii) and Lemma l(iii). Finally, consider lw{z - pX3zyv = - ± = ( z - r3yv + -~(T3 - pX3zyv Thus, the result will follow provided for each i, 1 (Ti - P^Zi)'!} = O p ( l ) . To show this it will suffice that E ( ( - ^ ( T i - P^yr,)') ^ 0 by Chebyshev's Inequality. To compute this expectation first condition on x so that E i i - ^ ^ - P ^ y n f ) E ( Y ^ T i - P*3*i)'E(vv'\x){Ti - PX3Zi)) Appendix B. Proofs for Chapter 4 129 - P^ZiYD^TD^iTi - PX3Zi)) < sup V(&|a: t)sup | Tj | £77-117"*' ~ px3zi\\2 -> 0 given Assumption 2.3, (ii) and (iii) (which allow the application of Lemma 1). Therefore, by Theorem 1 the result holds. Q.E.D. Proof of Lemma 2: Using a second order mean value expansion of Xt about A(/(x l t)), which is valid for observations with I 6 [8/2, 1 — 6/2] and defining It = I(i(xu)<6/21(l(xll)> 1-8/2) we have Xt - Xt = diMxuWixu) - l(xlt)) + (l/2)d2(r)(l(xlt) - l(xu))2 for some l" lying between / and I. Using this we then have 1 A ' , 1 A ' = £ c(jcit)(At - Xt) = — = J2 c{^u)di(l{xu)){i{xu) - Kxu)) 1 A ' £ c(xu)ft{Xt - A t) 1 * /AT^2c(xu)di(l{xu))It(l(xlt) - l{xlt)) v A t = i 1 * . . • + -7== £ c ( a J i t ) r f 2 ( r ( x l t ) ) / t ( / ( a J i e ) - l(xit))2 Since c(xu), Xt are both bounded sequences, the second term on the right of this satisfies, E\-$=Y,c{zu)It\ V A' t=i Appendix B. Proofs for Chapter 4 130 <C-±=J2P{\i(xu)-l(xu)\>6/2) v A' t=i 2C 1 A ' <—-~J2E((i(xlt)-i(xlt)y) <?f>/N±-E\\i-i\\2-+o ~ 8 N " where convergence follows from the application of Lemma l(iii), given (i) and (ii), which imply that VNE^-\\i-/II2 - » 0. N . Next, using the Cauchy and Chebyshev Inequalities, and the boundedness of c and d-± E\~J^ E c(xu)dMxu))h{f{xu) ' Hxu))\ vA' t=i < C v ^ ( £ ^ ^ ) 1 / 2 ( £ ^ ! | f - /||2)1/2 ^ ^ v ^ ^ i i i - z i i 2 ) 1 ^ ! ! / - / ! ! 2 ) ^ = ^ ^ 1 | | / - / | | ^ 0 as above. Similarlj' since d2 are bounded we have for the final term 1 N £ | - 7 T 7 E c(z l t )4(r(x a t )) / t ( / (*i0 - l{xit))2\ < C V ^ E - J - I K - ' I I 2 - o and the result follows by applying Markovs Inequality to each of these terms to show that all but the first term are o p ( l) . Q.E.D. Proof of Lemma 3: Appendix B. Proofs for Chapter 4 131 This result follows from Lemma 2 and Theorem 2, since dj is a smooth function and c(xu) satisfies S(c) > 0 so the product will also have at least one derivative bounded and also be bounded. Q.E.D. Proof of Theorem 4: Consider first 1 and show that Since we can write A\r = —{Z-PX3Z)'{Z-PX3Z) i f = i f +op(l) (B.105) Z — PX2Z — Z — T3 -f T 3 — PX3T3 + PX3T3 — PX3Z (B.106) +PX3Z - PX3Z + Z - Z then by the Cauchy Inequality applied element by element to the product it suffices to show that 17"i 7~i = °p(l) = op(l) 1 \zi-Ti\\2 = 0p{l) \\PX3(Zi-Zi)\\2=Op{l) For (B.107) for any sequence of x 1 T - P T-ll 2 < llr r II2 (B.107) (B.108) (B.109) (B.110) ( B . l l l ) Appendix B. Proofs for Chapter 4 132 given (ii) and (iii). b}r Lemma 1. For (B.108), taking expectations first conditional on x we get E—llP^-z^^E^-itrP^) = E^tTP^EimuWx)) < C — - » 0 using Assumption 4.2, and the fact that PX3 is a projection matrix with K 3(Nl) regressors, so (B.108) follows from Markov's Inequalit}'. Next, given Assumption 3.2 and (iii) (B.109) is true. For (B . l 10) we have < -^(Zi - Z ^ - Zi) due to the fact that PX3 is a projection matrix, so that (B.110) will follow from ( B . l l l ) . For ( B . l l l ) , N,hXli A?A? — ||a;j||||A — A|| Ni since, Xt and Xt are bounded variables for all t. By the mean value theorem applied in Theorem 2, At - Xt = -d1{C{x1t)){l(xlt) - l{xu)) for observations where It = 0 (where this is defined in the proof of Theorem 2), so that 1 » N 1 N ^IIA - A||2 < — - l > ( Z - ( * u ) ) 2 ( l - It)(l(xu) - l{xu)f Appendix B. Proofs for Chapter 4 133 JV 1 A ' and since d\ are bounded, and by (iv) -j^ —> 1, the first term on the right converges to 0 in probability provided that / converges to / in MSE, as does the second given the proof in Lemma 2 (using Chebyshev's Inequality). But this is guaranteed by (i). So (B.105) follows since ^ ^ I k i l j — Op(l). Next, consider 1 (Z-PXaZ)'g = ^=Z'(g-Pxag) To show that this is o p ( l) , it suffices that for each i (B.112) k{g - PX3g) = o p ( i) By the Cauchy inequality 1 NI MG - P . M < ( ^ P , H 2 ) 1 / 2 V ^ ( ^ l l ( / - ^ l l 2 ) 1 7 2 o < O p ( l ) N / ^ 1 | | f f J f 3 ( J V i using Assumption 2.3 and Lemma 1, which applies given (iii). So that (B.112) is o p (l) . Next consider, 1 (Z-PxaZ)'D3t. Using (B.106) we show the following hold for each i ±=(Zl - TiybiH = ~={Zl - Ti)'DiH + 0 p ( i ) i mi (Ti - PX3Ti)'D^ = 0 p ( l ) (Pxa(Ti-zi))'D; 1C = op(l) i ( P X 3 ( z i - i T ) ) ' ^ 3 1 e = O p ( l ) (B.113) (B.114) (B.115) (B.116) Appendix B. Proofs for Chapter 4 134 - ± = ^ - Z i y b ; l t = o p ( l ) (B.H7) To show all but (B.113) we consider the expectation of the term squared, conditional on y-i and x and show that each of these converges to zero, so then by Chebyshev's Inequality the results follow. First, consider (B . l 14) taking expectation first conditional on yi and x and then over these we get, E{{-^={ri-PX3rl)'b-'(f) = = £ ( J _ ( T i _ p^yb-'Eia'lx^b^ir,. - PX2T{)) = E{-~-(Ti - PmaTi)'D;1Tb31(ri - P^)) <E(±\\Ti-PX3Ti\\2->0 using Assumption 2.3, (ii), (iii) and the result of Lemma 1. Similarly, for (B.115) we have E a - ^ i P ^ - z ^ y b ^ o 2 ) A a by Assumptions 2.3, 4.2 , (iii) and (ii). For (B.116) S ( ( - ^ ( P - s ( ^ - i O ) ' 4 - 1 0 2 ) ^ C E ^ Z i - Z i y P ^ Z i - ii)) KCEi^-Wzi-ZiW2)^^ using Assumption 2.3, and a previous result. For (B.117), Appendix B. Proofs for Chapter 4 135 as before. Finally, for (B.113), < C £ ( i - | | i . . - 2 i l P ) ^ 0 1 < C £ ; ( ^ E ( ^ - ^ ( ^ ) ) 2 ( | - y ) 2 i v i t=i A t A t y v i t=i using the boundedness of z i t — Ti(x3t) and the M S E convergence of A to A which is shown above. Therefore, (B.113) holds and we have from Theorem 1 that since 1 where we let £ f = ^ ( Z - T3)'D3 \ then since ^ = ( ^ - ^ , 3 ^ ) ^ = 5 ^ + 0 , ( 1 ) it is also true that 1 (Z - P : C 3 Z) 'D 3 £ - iV(0 ,phm(Bf r ^ f ) ) Finally, we show that 1 :{Z-PmaZ)'g^-J^-*N{01S3). VNi" - A We do this by breaking up the term using (B.106) and show the following, 1 / v ( A - A ) 1 , w ( A - A ) A A 1 ( r . -P . 3 r^^ -3= 0 p ( l ) A (B.118) (B.119) Appendix B. Proofs for Chapter 4 136 Consider first (B.119), - ^ ( P X 3 K - ^ ) ) ' 5 ^ T ^ = o p ( l ) V i V i A ^ ( P X 3 ( 2 t - i ^ l ^ = 0 p ( 1 ) A 1 /- v (A — A) ( Z i - Z i y g ^ — > = o p ( i ) . E\~yi-pX3Tiyg{±J^\ N i f=l A 2 ( A , - A t ) 2 ) 1 / 2 < Ci/N.E—Wn - PxM\y/2(jN~E±\\\ - A| | 2 ) 1/2 (B.120) (B.121) (B.122) using (i), (ii). (iii), (iv) by Lemma 1, and boundedness of A t . For (B.120), < ( ^ E i j - u ' P ^ u ^ ^ E ^ £ % £ ( A t - A t ) 2 ) 1 / 2 1 t=i At < C(^1^^)^(y/N~E±\\\ - AH 2 ) 1 / 2 by (ii), (i) and from above. Next for (B.121), E l ^ P ^ - z ^ ' g ^ l < Ciy/NiE-^-dzi - Zi)'Pmi{zi - Zi))Y'2{\[NiE^\\\ - A y 2 ) 1 / 2 < Ciy/N.E^-Hzi - - m1/2(\fN~E±\\\- AH 2 ) 1 / 2 Appendix B. Proofs for Chapter 4 137 1 £ 2 ( A t - A f ) < Ciy/NiE-j- £ x i r \ ^ ! {\I^E^\\X- A H 2 ) 1 / 2 N i t=i A2 A 2 1 N < CJNyEi-U - A| | 2 -» 0 using boundedness of i i t , A t and the above result. Similarly (B.122) holds. Finally for (B . l 18) we have E ] 1 ( s i _ T ,y ( 9 (M)„ 9 (izi)| ' • / / V A A = ^ l E ( ^ - ^ K ) ) % ^ ( A f - A t ) 2 | v-Wi t=i AAf2 < C-^=E\\X - A | | 2 -> 0 ~ VN\ as before. Therefore 1 (Z - P X J ) ' 9 { - ^ = -kr{Z- T3)'g{-^1 + o p ( l) \ / A V ^ 'J X VN! Using the definition of yi we can write A 1 ( A - A ) (Zi -Ti)g-fNx X 1 Ylyu{zit - Ti(x3t))^~^(Xt - A t) i t=i At = ~ ^ ~ ^ ] ^ y u ( Z i t ~ r i ( a ; 3t ) ) ^ ^ ^ i ( / ( a ; 1 t ) ) ( / ( x 1 t ) - £ ( a : u ) ) + o p ( l) using the result of Lemma 2. Letting Yi = diag{yi f} this can be written as ^N 1 (zt-nyGD^D^iP^-l) {zi-nyGD^D^WP^e fNlVN ^N 1 VA^I VN VN I W i VN {zi-nyGD^D^P^l-l) Appendix B. Proofs for Chapter 4 138 Since (iv) holds we ignore the factor ^ = for the moment to simplify the notation. Con-sider now the second term on the right, 1 ^ 7 ^ - r^GD^D.Y^J - Z)| < C V / J V : | | / ^ I ( J V ) | | O I O 0 , A - - 0 using boundedness assumptions, and (i) which allows application of Lemma l(iii). Next, define the following vector ( Z i — r^j/j by {zi ~ Ti)Vi = y\{zi - T i ) and noting that {Zi ~ Tifyi = I/i(xi) -f Vi then we can write the first term as 1 (zi-rJ'GD^D^P^e N 1 v & j G D ^ D ^ e Defining ( i t by VN +^=v'iGD31D1PXl€. > g(x3t), I U N X (it = vit— <Xi(/(a:i t )) A t we have that E(dt\Xl) = 0 and V((it\xx) is bounded above given Assumption 6.2 and boundedness of the other functions. Hence, E\~v[GD^D,Pxle\ Appendix B. Proofs for Chapter 4 139 by Lemma A2.1 since e has conditional mean zero and bounded conditional variance and since (i) holds. Thus, we need only consider 1 ^i(x1)'GD;1D1PXle = -^=B^PXle , N .v o V N and show that for any vector b such that, = 1, 1 b'B^P^e N is normal^ distributed. But by Theorem 2 we have that since b'B^ is a function of Xi which has positive Sobolev smoothness index then -^=b'B^Pxle = -^=b'B^e + op(l) and this is asymptotically normal using the Liapunov Central Limit Theorem so that 1 KrjA" Finally, we have that b'B»2PXle-> N[p,S2) ~ ft) = ( A f + B%e) A op(l) and since £ and e have zero correlation, then the result holds. Q.E.D. Lemrna A2.1 : Let (i and (2 be vectors of dimension N and Px be a N x N projection matrix formed by R functions of x and suppose that the observations are independent and E((it\xt) = 0 0 < sup E((ft\xt) < oo t then 1 W . C 2 ) < c £ N ~ N Appendix B. Proofs for Chapter 4 140 for some finite C. Proof: = i(£;(trPx£;(C1ci|a !))1 / 2(E(trP.E(C2Cik))1 / 2 < C-$=E(txPx) < C K _'N v V A since the variances are bounded by some finite constant C and since the trPx = A because it is a projection matrix. Q.E.D. Proof of Theorem 5: It is easy to show that under the conditions of this result, the conclusion of Theorem 4 hold. In fact instead of having MSE of the various functions being op(N~1^2) it is instead Op(A r~ 3/ 4), although this fact is only used to prove part of this result. The convergence at rate op(N~1^2) suffices for most of the proof. Clearly, we have A" = A? + 0 p ( l ) as in the proof of Theorem 3 so it remains to consider the remaining two matrices. This will be done by showing convergence element by element. In the case of Si the typical element will be written in shorthand notation as where a i t = Zit ~ Px^Zi with PX3 denoting the t row of PXi. To show that S 1 = 5 1 + o p ( l ) (B.123) we show that for all i and j Appendix B. Proofs for Chapter 4 141 After some rearranging it is easy to show that it will be sufficient • to show the following results, ^ £ d l f d j t l ( £ ( 2 - 4 2 ) = 0 p ( l ) (B.124) ^ E ^ ^ ^ ^ - A ' ) ^ = 0 ^ 1 ) (B.125) ^ E t t . ' f e - « , f ) ^ ( 2 = Op(l) (B.126) YiY,AA*it-ait)-^g =o P ( l ) (B.127) The following inequalities will prove useful in verifying these four results. For all i we have max \dit \ < A + max | ^ f ( * ' $ ) _ 1 * ' £ i | Now we have, max | ^ t ( * ' * ) _ 1 * ' i , - | < snp\ip(x3)'0K3\ + i n a x ( ^ ( * ' * ) - V t ) 1 / 2 ( ( ^ - Zi)\zi - Zi))1'2 + m a x ( ^ ( * ' * ) - V 0 1 / 2 K - ^ 3 ^ ) 1 / 2 + m p ( ^ ( * ' * ) - V 0 1 / 2 V / ^ l k ; l l o . o o . A -< A + o p (A^/ 4 ) + o p(7V 1/ 8) + o p ( l) using Assumption 6.2 and respectively, (ii) and Lemma 1, Assumption 2.3 and MSE convergence for A t of oP(N~1'2), Lemma A2.1 and (ii) and finally (ii) ,(iii) and Lemma l(iii). Thus we have that max|aa| < A o p ( A f l / 4 ) (B.128) for some finite A . Next consider the MSE convergence of the estimated g for the obser-vations with yi — 1. We have that 1 '[<?!- < ~^=(\\PX3Z(d2-fi2)\\ • Appendix B. Proofs for Chapter 4 142 - P . , k i l l + \\P.a9^-y^\\ + l l ^ f l l ) The following four results give the desired MSE convergence result for g\. ^\\PxJ(/32-p2)\\2<AOp(N^) (~W - P * M 2 ) 1 / 2 < \\gh\\o.~j = oP(N-*<2) ( ^ l l ^ f l | 2 ) 1 / 2 < Ao p (7V- 3 / 8 ) using respectively, root N convergence of/32, (ii) and Lemma 1, MSE convergence for Xt, and (i). Therefore we have that 1 Finally, noting that \9l - gi\\ < Ao^N-1'4) (B.129) 6 = x2t((32 - 82) + g{x3t){Xt - A t) + Xt(g(x3t) - g[xat)) + it it is easy to show that ^ £ ( # - t f ) < A o ^ i V - / 2 ) (B.130) using the boundedness of g, X and the aforementioned results regarding 02, A t and g. Using this and (B.128) then we can show that (B.124) holds. For (B.125) write it as, 1 - < * \(At + A f ) - 2 ~\7~ aiAaJt ~ ajt) . (A t - A t X t  1 *t*t ^ 1 AfA then by (B.128) boundedness of the an, the convergence in MSE at op(N-1'2) for a3 and A, the boundedness of the fourth moments of £ t we can show that each of these terms Appendix B. Proofs for Chapter 4 143 converges to zero by the use of Cauchy Inequality and Markovs Inequality. Note that we have that i - l l a . - a . l ^ o ^ i V - 3 / 4 ) (B.131) although this is faster than is needed in this part of the proof. Finally the same sort of arguments verify that (B.126) and (B.127) hold and so we have verified that (B.123) holds. We next must show that S? = S? + o p ( l) (B.132) and this will be done in a similar fashion to the previous result. First, denote the ith element of v by i>i = P ^ i ~ Px3zi)'0') and denote the population values for each t by V{t. Then we must show that for all i and j that I N 2 N T; £ Vitv3tg{x2t)2ct - j7 £ vitVjtg(xzt)2ct = O p(l) t=l t=l where ct = gixuf-^d^kfl^l - lt) and ct denotes the estimate of this. Since we have that lt d], and Xt and the estimates of these are all bounded, and the estimators converge in MSE to the true values at o p ( A r - 3 / ' 4 ) , by Lemma 1 the main concern will be with g and the 0{t. Since in this case we must estimate g for all the observations, we must verify that a similar result to that in (B.130) holds. Also similar inequalities to that in (B.128) will be required. First, we deal with gy which is given by, g = ((P„(t& - Zi32))'(^30(^3rl%((w - Z02))')' Appendix B. Proofs for Chapter 4 144 where ^ 3 0 is the matrix of basis functions evaluated at the x3l for observations for which yi = 0. For any t we can write the estimate as so that as we did above to yield inequality (B.128) we have that \g{**)\ < ( ^ ( * ; * s ) - v o 1 / 2 ( ( f t - Mz'zfa - A>))I/2 + |^if3 I + (^'(*3*3)- 1^) 1 / 2V yJV||^ 3||o , o o , A -+ ( ^ ( * ^ 3 ) - V t ) 1 / a A | | A - A|| + ( ^ ( * 3 * s ) - V t ) 1 / 2 A | | f P . s | | < A o p ( A r l / 8 ) using argument similar to the above and the fact that convergence of A is op(N~3^4), using (i) and (ii). and Assumption 6.2 which by Andrews (1989) shows that (V>; (*s*3) -Vi ) 1 / 2 = op(l) (B.133) It is clear that since (B.129) holds to show MSE convergence for g which is based on all the observations we need only consider the convergence for the observations with t/a = 0. Therefore, we note that we can write the appropriate expression as, go ~ go = * 3 0 ( * 3 * 3 ) - 1 * 3 Z ( ^ 2 - 02) + ^ M * ^ ) " 1 $3*30*3 -g0 + * 3 0 ( * 3 * 3 ) - 1 * 3 ^ 3 + * 3 0 ( * 3 * 3 ) - 1 * 3 T A Since we have (B.133) then as before we can show that given (ii), (v) and the result proved in Lemma l(iii), then (^\\9o-9o\\y/2 = op(N-^) and so we have that 1 N J - 9 \ \ 2 = oP(N~1/2) (B.134) Appendix B. Proofs for Chapter 4 145 Regarding uu we have that + ^ f ( * i * i ) - 1 * i ( ( « i - a . - ) ' 0 ' ) ' l < Aop(N1'8) using similar arguments to the above. Also, it is the case that 1 N (B.135) (B.136) In fact, this like the other rates is slower than the actual rates that obtain under the assumptions but is sufficient for the proof. Now to show (B.132) the following will suffice, N J29(x3t)2Vit{vjt - Vjt) ^ Jf Y.3{xzt)2{vit - Vit)(Vjt - Vjt) + YT J2g(X3t)2"it{Vjt - Vjt) N 1 < Ao^N1'4)^^ - ViBvj ~ + Aop(Nl'A\\uj - Vj < A o p ( l ) using (B.136), (B.135), and (B.134). Next, 1 N Y^g(x3t)2Vjt(uit - V i t ) and N <^ 1 / 4)A(^!I^-^|| 2) 1 / 2 £ "itVjt{g(xst) + g{x3t)){g{x3t) - g(xat)) < Aop(l)(±\\g - g\\2f2 = op(l) Appendix B. Proofs for Chapter 4 146 It remains then to show that E VitVjtg(x3t)2(ct - CT) = Op(l) and this follows due to the boundedness of all of the elements and the fact that each converges in MSE to the true value. Therefore, we have that (B.132) holds and so the result of the Theorem follows. Q.E.D. Proof of Theorems 6 and 7: The result follows from Lemma 3.3 of White (1980a). Q.E.D. Bibliography Adams, R A . : Sobolev Spaces, New York: Academic Press (1975). Amemiya, T.: "Regression Analysis When the Dependent Variable is Truncated Normal", Econometrica, 41 (1973), 997-1016. Amemiya, T.: Advanced Econometrics, Cambridge Massachusetts: Harvard Uni-versity Press (1985). Andrews, D.W.K.: "Asymptotic Normality for Series Estimators for Various Non-parametric and Semiparametric Models", Cowles Discussion Paper 874, Yale Uni-versity, (1988). Andrews, D.W.K.: "Asymptotics for Semiparametric Econometric Models I: Esti-mation", Cowles Discussion Paper 908, Yale University, (1989a). Andrews, D.W.K.: "Asymptotics for Semiparametric Econometric Models II: Stochastic Equicontinuity", Cowles Discussion Paper 909, Yale University, (1989b). Andrews, D.W.K.: "Asymptotics for Semiparametric Econometric Models III: Testing and Examples", Cowles Discussion Paper 910, Yale University, (1989c). Andrews, D.W.K.: "Asymptotic Normality for Series Estimators for Various Non-parametric and Semiparametric Models", Cowles Discussion Paper 874R, Yale Uni-versity, (1989d). Andrews, D.W.K.: "Asymptotic Optimality of GCi, Cross Validation, and G C V for Models with Heteroskedastic Errors", Cowles Discussion Paper 906, Yale Uni-versity (1989e). Andrews, D.W.K. and Yoon-Jae Whang: "Additive Interactive Regression Models: Circumvention of the Curse of Dimensionality", Cowles Discussion Paper 925, Yale University, (1989). Arabmazar, A. and P. Schmidt: "Further Evidence on the Robustness of the Tobit Estimator to Heteroskedasticity", Journal of Econometrics, 17 (1981), 253-258. Arabmazar, A. and P.Schmidt: "An Investigation of the Robustness of the Tobit Estimator to Non-Normality", Econometrica, 50 (1982), 1055-1063. 147 Bibliography 148 Blundell, R. (ed.): "Specification Testing in Limited and Discrete Dependent Vari-able Models", Journal of Econometrics, 34 (1987). Burguette, J.F., A.R. Gallant, and G. Souza: "On Unification of the Asymptotic Theory of Nonlinear Econometric Models", Econometric Reviews, 1 (1982), 151-190. Cosslett, S.R.: "Distribution Free Maximum Likelihood Estimation of the Binary Choice Model", Econometrica, 51 (1983), 765-782. Cosslett, S.R.: "Distribution-Free Estimator of Regression Model with Sample Se-lectivity", University of Florida, Unpublished manuscript, (1984). Cragg, J.G.: "Some Statistical Models for Limited Dependent Variables with Ap-plication to the Demand for Durable Goods", Econometrica, 39 (1971), 829-844. Cragg, J.G.: "Quasi-Aitken Estimation for Heteroskedasticity of Unknown Form", U.B.C. Discussion Paper 88-34, (1988). Dagenais, M.G. : "The Tobit Model with Serial Correlation", Economics Letters, 10 (1983), 263-267. Dhrymes, P.J. "Limited Dependent Variables" in Handbook of Econometrics, Volume 3 edited by Z. Griliches and M.D. Intriligator, New York: North Holland, (1986). Donald, S.G.: "An Empirical Analysis of the Relationship Between Appliance Choice and Electricity Demand", unpublished B.Ec. thesis, University of Sydney, (1985). Duncan, G.M.(ed.): "Continuous / Discrete Econometric Models with Unspecified Error Distributions", Journal of Econometrics, 32 (1986a). Duncan, G.M. : "A Semi-Parametric Censored Regression Estimator", Journal of Econometrics, 32 (1986b), 5-34. Elbadawi, I.A., A.R. Gallant, and G. Souza: "An Elasticity Can be Estimated Consistently Without A Priori Knowledge of Functional Form", Econometrica, 51 (1983), 1731-1751. Fernandez, L.: "Nonparametric Maximum Likelihood Estimation of Censored Re-gression Models", Journal of Econometrics, 32 (1986), 32-57. Gallant, A.R.: "On the Bias in Flexible Functional Forms and an Essentially Un-biased Form", Journal of Econometrics, 15 (1981), 211-245. Bibliography 149 Gallant A.R. and D Nychka: "Semi-nonparametric Maximum Likelihood Estima-tion", Econometrica, 55 (1987), 363-390. Gallant, A.R.: "Identification and Consistency in Seminonparametric Regression", in Advances in Econometrics, Fifth World Congress of the Econometric Society Vol. I., edited by T.F. Bewley, Cambridge: Cambridge University Press (1987). Geman, S. and C. Hwang: "Nonparametric Maximum Likelihood Estimation by the Method of Sieves", Annals of Statistics, 10 (1982), 401-414. Gourieroux, C , A. Monfort, and A. Trognon: "Estimation and Test in Probit Mod-els with Serial Correlation", in Alternative Approaches to Time Series Analysis, Florens, J.P. and C. Simor (eds.), Brussels Publications des Facultes Universitaires Saint Louis (1984). Gronau, R.: "The Effect of Children on the Housewifes Value of Time", Journal of Political Economy, 81 (1973), 1119-1143. Hausman, J.A.: "Specification Tests in Econometrics", Econometrica, 46 (1978), 1251-1271. Heckman, J.: "Shadow Prices, Market Wages and Labor Supply", Econometrica, 42 (1974), 679-693. Heckman, J.: "The Common Structure of Statistical Models of Truncation, Sam-ple Selection and Limited Dependent Variables and a Simple Estimator for Such Models", Annals of Economic and Social Measurement, 5 (1976), 475-492. Heckman, J.: "Sample Selection Bias as a Specification Error", Econometrica, 47 (1979), 153-161. Horowitz, J.L.: "A Distribution Free Least Squares Estimator for Censored Re-gression Models", Journal of Econometrics, 32 (1986), 59-84. Horowitz, J.L.: "Semiparametric M-Estimation of Censored Linear Regression Models" in Advances in Econometrics: Robust and Nonparametric Inference edited by G.F. Rhodes and T.B. Fomby (ed.), Greenwich: JAI Press, 7 (1988). Horowitz, J.L.: "A Smoothed Maximum Score Estimator for the Binary Response Model". Working Paper 89-30, Department of Economics, University of Iowa, (1989). Hurd, M . : "Estimation in Truncated Samples When There is Heteroskedasticity" Journal of Econometrics, 11 (1979), 247-258. Bibliography 150 Ichimura, H.: "Estimation of Single Index Models", unpublished Ph.D. Dissertation Dept of Economics MIT (1988). Klein, R.W. and R.H. Spady: "An Efficient Semiparametric Estimator for Discrete Choice Models", Economics Research Group, Bell Communications Research, Mor-ristown New Jersey, (1987). McFadden, D.: "Conditional Logit Analysis of Qualitative Choice Behaviour", in Frontiers in Econometrics, edited by P. Zarembka, New York: Academic Press, (1974). McFadden, D.L.: "Econometric Analysis of Econometric Response Models", in, Handbook of Econometrics, Volume 2 edited by Z. Griliches and M.D. Intriligator, New York: North Holland, (1984). MacKinnon, J . G . and H. White: "Some Heteroskedasticity Consistent Covariance Matrix Estimators with Improved Finite Sample Properties", Journal of Econo-metrics, 29 (1985), 305-325. Maddala, G.S.: Limited Dependent and Qualitative Variables in Econometrics, Cambridge, England: Cambridge University Press, (1983). Maddala, G.S.: "Disequilibrium, Self Selection and Switching Models", in, Hand-book of Econometrics, Volume 3 edited by Z. Griliches and M.D. Intriligator, New York: North Holland, (1987). Manski, C : "Maximum Score Estimation of the Stochastic Model of Discrete Choice" Journal of Econometrics, 3 (1975), 205-228. Manski, C: "Semiparametric Analysis of Discrete Response: Asymptotic Properties of the Maximum Score Estimator" Journal of Econometrics, 27 (1985), 313-333. Matzkin, R.L.: "Nonparametric and Distribution Free Estimation of Binary Choice and Threshold Crossing Models" Cowles Discussion Paper 889, Yale University, (1988). Mroz, T.A.: "The Sensitivity of an Empirical Model of Married Womens Hours of Work to Economic and Statistical Assumptions", Econometrica, 55 (1987), 765— 799. Newey, W.K. : "Adaptive Estimation of Regression Models via Moment Condi-tions", Journal of Econometrics, 38 (1988), 301-339. Newey, W.K. : "Two Step Estimation of Sample Selection Models", unpublished manuscript, Dept. of Economics Princeton University (1988). Bibliography 151 [53] Newey, W.K. , J .L. Powell, and J.R. Walker: "Semiparametric Estimation of Selec-tion Models: Some Empirical Results", American Economic Review, Papers and Proceedings, 80 (1990), 324-328. [54] Paarsch, H.J.: "Monte-Carlo Comparison of Estimators for the Censored Regres-sion Model", Journal of Econometrics, 24 (1984), 197-213. [55] Pagan, A.R. and F. Vella: "Diagnostic Tests for Models Based on Individual Data: A Survey", University of Rochester Working Paper 162, (1988). [56] Phillips, P.C.B.: "Reflections on Econometric Methodology", The Economic Record, (1988). [57] Poirier, D.J. and P.A. Ruud: "Probit with Dependent Observations", Review of Economic Studies, LV (1988), 593-614. [58] Portnoy, S.: "Asymptotic Behavior of Likelihood Methods for Exponential Families when the Number of Parameters Tends to Infinity", Annals of Statistics, 16 (1988). 356-366. [59] Powell, J.L.: "Least Absolute Deviations Estimation for the Censored Regression Model", Journal of Econometrics, 25 (1984), 303-325. [60] Powell, J.L.: "Symmetrically Trimmed Least Squares Estimation of Tobit Models" Econometrica, 54 (1986), 1435-1460. [61] Powell, J.L.: "Semiparametric Estimation of Bivariate Latent Variable Models", SSRI Working Paper 8704 University of Wisconsin, Madison, (1987). [62] Powell, J.L., J .H. Stock and T . M . Stoker: "Semiparametric Estimation of Index Coefficients", Econometrica, 57 (1989), 1403-1430. [63] Rilstone, P.: "Nonparametric Hypothesis Testing for Realistically Sized Samples", Universite Laval Cahier 8823, (1988). [64] Robinson, P .M.: "On the Asymptotic Properties of Estimators of Models Contain-ing Limited Dependent Variables", Econometrica, 50 (1982), 27-41. [65] Robinson, P .M. : "Root-N-Consistent Semiparametric Regression" Econometrica, 56 (1988), 931-954. [66] Royden, H.L.: Real Analysis, Second Edition, New York: Macmillan, (1987). [67] Ruud, P.: "Sufficient Conditions for Consistency of the Maximum Likelihood Esti-mator Despite Misspecification of Distribution", Econometrica, 51 (1983), 225-228. Bibliography 152 [68] Theil, H.: "A Multinomial Extension of the Linear Logit Model", International Economic Review, 10 (1969), 251-259. [69] Tobin, J.: "Estimation of Relationships for Limited Dependent Variables", Econo-metrica, 26 (1958), 24-36. [70] Wald, A.: "A Note on the Consistency of the Maximum Likelihood Estimator" Annals of Mathematical Statistics, 60 (1949), 595-601. [71] White, H: "Nonlinear Regression on Cross-Section Data", Econometrica, 48 (1980a), 721-726. [72] White, H.: "A Heteroskedasticity Consistent Covariance Matrix and a Direct Test for Heteroskedasticity", Econometrica, 48 (1980b), 817-838. [73] White, H.: Asymptotic Theory for Econometricians, Orlando: Academic Press, (1984). [74] White, H.: "Maximum Likelihood Estimation of Misspecified Models", Economet-rica, 50 (1982), 1-25. [75] White, H. and M . Stinchcombe: "Adaptive Generalized Least Squares with Depen-dent Observations", University of California, San Diego Discussion Paper 89-45, (1989). 


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items