Generalized Method of Moments Theoretical, Econometric and Simulation Studies by Yitian Liang B.Sc. (Honor), Jinan University, 2004 M.Sc. (Honor), City University of Hong Kong, 2009 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in The Faculty of Graduate Studies (Statistics) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) c Yitian Liang 2011 Abstract The GMM estimator is widely used in the econometrics literature. This thesis mainly focus on three aspects of the GMM technique. First, I derive the prooves to study the asymptotic properties of the GMM estimator under certain conditions. To my best knowledge, the original complete prooves proposed by Hansen (1982) is not easily available. In this thesis, I provide complete prooves of consistency and asymptotic normality of the GMM estimator under some stronger assumptions than those in Hansen (1982). Second, I illustrate the application of GMM estimator in linear models. Specifically, I emphasize the economic reasons underneath the linear statistical models where GMM estimator (also referred to the Instrumental Variable estimator) is widely used. Third, I perform several simulation studies to investigate the performance of GMM estimator under different situations. ii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii 1 Introduction of GMM . . . . . . . . . . 1.1 Method of Moment . . . . . . . . . 1.2 Maximum Likelihood Estimator . . 1.3 An Example of Moment Conditions 1.4 Definition of GMM Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 3 6 9 2 Theoretical Development of GMM 2.1 Consistency . . . . . . . . . . 2.2 Asymptotic Normality . . . . 2.3 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 11 15 18 3 Application of GMM in Linear Models 3.1 Framework and Motivations . . . . 3.2 GMM Estimation . . . . . . . . . . 3.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 20 25 30 4 Simulation Studies . . . . . . . . . . . . . . . . 4.1 Study 1 - Robustness to Mixed Populations . 4.2 Study 2 - Robustness to Outliers . . . . . . . 4.3 Study 3 - Weak Instruments . . . . . . . . . 4.4 Study 4 - Mixed Strong and Weak Instruments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 33 37 39 43 . . . . . . . . iii Table of Contents 5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 iv List of Tables 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 Distributional parameter values Parameter values for study 1 . Results of study 1 . . . . . . . Parameter values of study 2 . . Results of study 2 . . . . . . . Parameter values of study 3 . . Results of Study 3 . . . . . . . Parameter values of study 4 . . Results of βˆ1 for study 4 . . . Results of βˆ2 for study 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 34 35 38 38 40 40 44 45 47 v List of Figures 4.1 4.2 4.3 4.4 4.5 4.6 Graphical comparison for study 1 . . . . . . . . . . . . . . . . . . Graphical comparison for study 2 . . . . . . . . . . . . . . . . . . Comparison of OLS and GMM estimators . . . . . . . . . . . . . Comparison of GMM estimators under different correlations for varied sample size . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of GMM estimators under different correlations for a given sample size . . . . . . . . . . . . . . . . . . . . . . . . . . Comparisons of GMM estimator with and without a weak quadratic instrument . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 39 41 42 43 46 vi Acknowledgments I would like to express my sincere gratitude to my supervisor Professor Jiahua Chen and my thesis second reader Professor Paul Gustafson whose support and suggestions have been an immeasurable help throughout my master study. I would also like to thank Professor John Petkau for his support throughout my master program and Eugenia Yu for her support of my TA works. I am also grateful to my fellow graduate students Corrine and Eric who brought me into real Canadian culture and help to reshape my concept of value. I would also like to express my gratitude to Mike Danilov, who helped me a lot during the first year and whom I always enjoy talking to. I would also like to thank all graduate students. Last but not the least, thanks to all secretary staffs who make my life much easier. vii To my parents, aunts and Ling Ip. viii Chapter 1 Introduction of GMM In statistics, we often wish to learn about some aspects of the world from a data set. For example, marketing researchers are often interested in analyzing what factors affect a consumer’s decision on buying one brand versus another within a product category. Typical consumer scanner data from stores provide a basis for such studies. In many cases, we assume an underlying model which potentially generates the observed data. The model could come from our prior experiences, or some theoretical arguments. Following the previous example, marketing researchers usually model consumers’ decisions based on the theory of utility maximization. Specifically, they assume consumers make their decision by maximizing some underlying utility function which depends on unknown parameters (usually called preference parameters or structural parameters in the econometrics literature). Hence, the observed decisions are ultimately functions of observed covariates and a set of unknown parameters of interest. One main goal of statistical inference is to estimate the unknown parameters by effectively using the observed data. This domain in statistical inference is usually called estimation. In addition, we often wish to utilize the data to test some of our beliefs, or some implications drawn from theoretical models. In other words, we hope to see whether the observed data provides inconsistent evidence against some statements. This domain in statistical inference is usually called hypothesis testing. Both domains admit the inherent randomness embedded in the observed data. Out of many statistical techniques, the Method of Moment (MM) and Maximum Likelihood Estimation (MLE) are two popular candidates used in statistical inferences. Both methods represent our knowledge or assumptions about the mechanism that generates the observed data. For instance, the MM method reflects our knowledge or assumptions about the moment conditions belonging to the mechanism that generates the data. On the other hand, the use of MLE method requires more knowledge about the mechanism, i.e. the joint distribution of the observed data. The Generalized Method of Moment, from some point of view, lies between the MM and MLE methods. It generally requires more information about the data than MM does, yet leaving the complete joint distribution of the data unspecified. Suggested by its name, the cornerstone of GMM is a set of population moment conditions. These conditions could come from the assumptions made by 1 1.1. Method of Moment researchers, or implications drawn from some theoretical models. From this point of view, it is obvious that there is a strong connection between GMM and Method of Moment (MM). Section 1.1 briefly reviews the idea of Method of Moment. It also discusses the directions in which GMM extends the idea of MM. It is also interesting to compare GMM with another popular estimation method MLE. Section 1.2 briefly reviews the idea of MLE and provides some motivations for the fact that GMM is preferred to MLE in some cases. After Hansen’s 1982 paper, GMM has become a widely used estimation method in econometrics, especially macroeconomics and finance. Section 1.3 provides an example to illustrate how moment conditions can be used in econometrics problems. 1.1 Method of Moment In statistical analysis, the population moments of a random variable are often functions of the unknown parameters of interest. The idea of MM is simply to equate the population moment conditions to the analogous sample moments and define the estimates of the unknown parameters to be the solutions of the resulting equations. To illustrate the idea, consider a simple example. Suppose we have a random sample of annual incomes in 2010 of a city (e.g. Vancouver). Denote Xi as the annual income of the ith individual in the sample. Further assume all individual annual incomes in the sample come from a common population distribution and they are independent of each other. Notationally, we have i.i.d Xi ∼ N µ, σ 2 , i = 1, 2, ..., n, where µ and σ 2 are the population mean and variance, respectively. In this simple example, our interested parameters are µ and σ 2 . Due to the i.i.d structure, by the classical Law of Large Number, when the sample size n is large, we have n−1 Σni=1 xi ≈ E (X1 ) n−1 Σni=1 xi2 ≈ E X12 , where xi denotes the observed values of Xi . The Method of Moment involves esˆ σˆ 2 defined as the solution to the analogous timating µ, σ 2 by the values µ, sample moment conditions n−1 Σni=1 xi − µˆ = 0 n−1 Σni=1 xi2 − µˆ 2 − σˆ 2 = 0. 2 1.2. Maximum Likelihood Estimator It follows that µˆ = n−1 Σni=1 xi ˆ 2. σˆ 2 = n−1 Σni=1 (xi − µ) (1.1) This approach is very intuitive but not without its weaknesses. For example, all the higher moments of the normal distribution are also functions of µ, σ 2 . Therefore, this technique could have been applied as effectively to the third and the fourth moments of the distribution. Yet the resulting estimators of µ, σ 2 would be different from those given by (1.1). There would be an issue on which estimators should be used within the MM framework. On the other hand, there is another potential weakness inherent in the MM framework. Suppose a researcher would like to base the estimation of µ, σ 2 on the first three moments of X1 , that is the first two moment equations introduced previously plus E X13 − 3E X12 µ + 3E (X1 ) µ 2 − µ 3 = 0. In this case, the three moment conditions form a system of three equations with only two parameters. Such a system typically has no solution. Therefore, the Method of Moment is infeasible in this case. This is one motivation for the origination of GMM, which is able to utilize information presented in a large number of moment equations, often larger than the number of parameters. In addition, historically speaking, we usually refer the moment conditions in MM to the expectations of the polynomial powers of a random variable that is directly observed. This serves as another motivation for GMM in which the moment conditions are usually referred to the expectation of some general functions of the (observed) random variables. 1.2 Maximum Likelihood Estimator Although GMM is widely used in many econometric studies, MLE remains the main statistical toolkit in some areas. I will first present an example of demand analysis which is a classical topic in marketing. Some discussion about the potential weaknesses of MLE is presented next. 1.2.1 Demand Analysis In marketing, the market share of a brand within a product category is of great interest to the managers and the researchers. They are usually interested in the following questions: how does my brand’s market share change if we raise our price 3 1.2. Maximum Likelihood Estimator by one percent? How does my brand’s market share change if one of our competitors decreases its price by one percent? How does my brand’s sales change if there is a promotion? Many of this type of questions could be summarized in such a way: how does our marketing strategy affect the profit. To answer these questions, it is crucial to understand how consumers’ purchase decisions are made. In other words, we hope to understand what variables and how they affect consumers’ demand. The random utility model is usually employed for such analysis. To illustrate the model, denote J as the number of brands in a product category. Suppose a marketing researcher collects a random sample of N consumers at a specific time. Denote Wi as the information collected from the ith consumer, Yi as a J-dimensional vector with the jth component being one if the ith consumer purchased the jth brand and zero otherwise, p as a J-dimensional vector containing the prices for all brands, and Xi as a Q dimensional vector containing the demographic information of the ith consumer. Notationally, we have {Wi = (Yi , p, Xi ) , i = 1, 2, ..., N}, where {Wi } are i.i.d random vectors. Further suppose a consumer only purchases one brand at a time. The random utility model postulates that for the ith consumer, his/her utility of consuming the jth brand (ui j ) can be expressed as follows1 ui j = α j + β · p j + δ j Xi + εi j , j = 1, 2, ..., J, where α j is the brand-specific parameter, β is usually called price sensitivity, δ j is a vector containing the parameters of the demographic variables for the jth brand, εi j contains all other factors that determine the utility. In the econometrics literature, εi j is often referred to “demand shock” which is assumed to be known by the consumer but unobserved by the researcher. Hence, from the perspective of the researcher, εi j is assumed to be random, which is the reason for the name “random utility model”. It is assumed that a consumer would choose to purchase brand j if it gives him/her the highest utility. Hence, the probability (from the perspective of the researcher) of observing the ith consumer purchased the jth brand is Pi j = P ui j ≥ uik , for any k = j . If {εi1 , εi2 , ..., εiJ } are i.i.d random variables across brands, which follows the Type I Extreme Value distribution, the above probability has a closed form expression Pi j = exp α j + β · p j + δ j Xi Σk exp αk + β · pk + δk Xi . (1.2) 1 The detailed specification varies from study to study. The specification presented here is just for illustration. 4 1.2. Maximum Likelihood Estimator Further assume the demand shocks are independent across consumers. Hence, the probability of observing the data set is y L = Πi Π j Pi ji j . When L is viewed as a function of the unknown parameters, we call it the likelihood function. The MLE is a statistical method in which we estimate α j , β , δ j , j = 1, 2, ..., J by αˆ j , βˆ , δˆ j , j = 1, 2, ..., J at which the likelihood function is maximized. There are two points worth notice in this example. First, the randomness in the above model framework is from the perspective of the researcher not the consumer. Since the demand shock (εi j ) is assumed to be known by the consumer, there is no uncertainty when the consumer is making a decision. Second, based on the observed data, we can analyze the problem from a usual GLM multinomial logit model and obtain the same likelihood function. Specifically, the multinomial response in this case is Yi and the covariates are p and Xi . If we assume the following link function (choosing the first brand as the baseline) log Pi j = α j + β · p j + δ j Xi , Pi1 the choice probability Pi j can also be expressed as (1.2). So from the perspective of the final likelihood function, the random utility model introduced above is equivalent as the multinomial logit model introduced in many GLM textbooks. However, the construction of random utility model reflects the theoretical arguments from the economics literature and will make a difference when the underlying economic problem becomes more complicated. 1.2.2 Potential Weakness of MLE In general, the data set is assumed to come from a distribution family which is indexed by a vector of unknown parameters of interest. The idea of MLE is to pick up the point in the parameter space that maximizes the likelihood function as the estimate. Under some regularity conditions, the MLE is optimal in the sense that its asymptotic variance attains the Cramer-Rao lower bound. Despite its optimality, there are some potential weaknesses of MLE, compared to GMM. 5 1.3. An Example of Moment Conditions 1. The optimality of MLE stems from its basis on the joint probability distribution of the data. Under certain conditions, we know that MLE is the most efficient (asymptotically) estimator, provided the population distribution is correctly specified. However, in some circumstances, this dependence on the probability distribution can become a weakness. The desirable statistical properties of MLE might not be realized if the distribution is not correctly specified. However, an economic theory rarely provides a complete knowledge of the probability distribution of the data. On the other hand, the GMM method does not require full specification of the distribution of the data. The cornerstone of GMM is a set of moment conditions, which might be deduced from the economics theories or the assumed econometric model. Hence, the GMM method is often regarded to be more robust to the MLE method, in terms of misspecification of the joint distribution. 2. In many econometric applications, the computation of MLE could be very cumbersome. Two types of problems tend to occur. First, the econometric model implied by the economic theory reasonably coincides with the specified joint distribution. But the likelihood function is extremely difficult to evaluate numerically with the current computer technology. In other cases, the economic theory only implies some aspects of the joint distribution. But the complete specification of the joint distribution involves some additional parameters which must also be estimated. Often in these cases, the likelihood function must be maximized subject to a set of nonlinear constraints implied by the economics model, resulting in more computational burden. In contrast, in many econometric cases, the GMM framework provides a computationally convenient method of conducting statistical inferences without completely specifying the likelihood function. Besides these two potential reasons, there are others that motivate the usage of GMM instead of MLE in many econometric applications. I will provide an examples in the next section. 1.3 An Example of Moment Conditions The method of Instrumental Variables (IV) is probably the most popular empirical tool in econometrics. In fact, the method of IV is a special case of GMM. In this section, I will present a simple example to illustrate the reasons accounting for the popularity of IV in econometrics. In addition, I will briefly discuss the difficulty of MLE encountered in this example. 6 1.3. An Example of Moment Conditions The estimation of the relationship between demand and price is a classical problem in econometrics. In many cases, the problem can be simply described by one sentence: does higher price cause lower demand? The keyword of the problem is “cause”. The definition of causation varies across different subject areas, even within a same academic area. Due to its ambiguity, I do not aim to give a precise and rigorous definition of causation in this essay. Instead, I will present one of its definition loosely which is accepted by many scholars in the field. In the demand and price example, the causal effect of price in the economics literature is usually (loosely) defined as: the units demand changes in response to a unit change of price holding all other relevant factors fixed. In the economics literature, such causal effect is also referred to “the ceteris paribus2 effect of price”. With this in mind, the example proceeds as follows. Suppose a researcher collects a random sample of quantity and price pairs {qi , pi } (i = 1, 2, ..., N) across N geographical markets for a specific commodity (e.g. a specific brand of coffee). Further assume demand and price follow a linear relationship as follows qi = α + β · pi + εi , (1.3) where α is the intercept, β is the price sensitivity, εi contains all other demand relevant factors in the ith market which are not observed by the researcher. Without loss of generality, further assume E (εi ) = 0. The causal parameter of interest is β , which captures the effect of price on demand while holding other factors constant. Intuitively, we expect the sign of the coefficient β should be negative. However, the OLS estimate of β based on many real data is often positive. The underlying reason is that prices are not randomly assigned to different geographical areas. Instead, firms in different areas set their prices according to the demand relevant factors. In other words, firms set their prices according to εi which is not observed by the researcher. For example, the wealth level of a city certainly affects the demand while it is also a factor determines the firms pricing strategy. In fact, we expect the wealthier a city is, the higher the price and the demand would be. In such case, the positive OLS estimate of β is largely driven by the positive correlations between price and εi , as well as between demand and εi . In statistics, we often refer such problem to the existence of unobserved confounders. In econometrics, researchers often refer such reason to a terminology “endogeneity”. To be more specific, the endogeneity issue explained above can be mathematically expressed as3 E (pi εi ) = 0. (1.4) 2 Ceteris paribus is a Latin phrase, literally translated as “with other things the same” or “all other things being equal or held constant”. 3 Note that this condition is equivalent to corr (p , ε ) = 0 under the assumption E (ε ) = 0. i i i 7 1.3. An Example of Moment Conditions One way to identify the interested parameter β is the use of instrumental variables. Specifically, if we could find an observed variable zi , which is correlated with price but uncorrelated with εi , then it implies the following moment condition according to (1.3) Cov (zi , qi ) − β ·Cov (zi , pi ) = 0. If we can consistently estimate Cov (zi , qi ) and Cov (zi , pi ), we can also consistently estimate β . Suppose we can find K (K > 2) such observed variables, denoted by a vector zi = (zi1 , zi2 , ..., ziK ). Since E (zi εi ) = 0, from (1.3) we have E (zi qi ) − αE (zi ) − β E (zi pi ) = 0. (1.5) (1.5) implies a system of K (K > 2) equations with only two parameters α and β . The GMM technique introduced in latter chapters can be applied to consistently estimate the parameters. The economic meaning of instrumental variable is as follows. Since zi is correlated with price but uncorrelated with εi , zi affects demand only through price. For example, suppose the commodity is coffee, zi could be the price of the raw material used in producing coffee. In such a case, the price of coffee is correlated with zi , but most economists believe zi is uncorrelated with any other factors that affect demand4 . In the econometrics terminology, zi is said to be satisfy the “exclusion restriction”. The intuition for why the instrumental variable z can help to identify the causal effect of p on q is as follows. Suppose we observe when z increases one unit, q and p increases four and two units respectively. Then it is as if we hypothetically “observe” when p increases one unit, q increases two units. Since z is uncorrelated with all other factors that affect q, all other relevant factors can be treated as fixed in the above hypothetical observation. In summary, with the help of instrumental variable, we could hypothetically observe when p increases one unit with all other variables constant, q increases two units. The causal effect of p on q is then identified. To illustrate the difficulty of applying MLE in the above problem, notice that (1.4) implies E ( εi | pi ) = 0. In general, due to the endogeneity issue, E ( εi | pi ) is a highly non-linear function of pi . However, in order to apply MLE, we need to specify the functional form of the conditional expectation which is a controversial task. Therefore, MLE is generally not applied in the above problem. 4 Whether this assumption is realistic or not is beyond the scope of this essay. 8 1.4. Definition of GMM Estimator 1.4 Definition of GMM Estimator In this section, I will formally define GMM estimator. In the next chapter, I will continue to study its asymptotic properties. Assume the researcher has collected data on realized values of {Xn } , n = 1, 2, ..., N, where {Xn } is a set of p × 1 i.i.d random vectors. Denote S as the parameter space which is a subset of Rq . Consider a function f : R p × Rq → Rr , for some r and we are most interested in r ≥ q. The population model is defined through the following moment condition E { f (X1 , β )} = 0, for some β ∈ Rq . Denote aN as a non-singular r × r matrix (possibly depends on the data), which satisfies a.s aN → a0 , where a0 is a constant r × r non-singular matrix. Denote d (b) = aN N −1 ΣNi=1 f (Xi , b) aN N −1 ΣNi=1 f (Xi , b) . (1.6) D EFINITION 1.1: The function d (b) in (1.6) is defined as the GMM objective function. D EFINITION 1.2: The GMM estimator is defined as βˆNGMM = argminb∈S d (b). Note that in the above definition, the population moment condition implies a system of r equations, containing only q (r ≥ q) parameters. If we directly form the analogous sample moment equations, generally there does not exist a solution. To illustrate the rationale of GMM estimator, let’s consider a simple example. Suppose the population moment conditions are E (X1 ) − β1 = 0 E X12 − β2 = 0 E X13 − 3β1 β2 − 2β13 = 0. Hence we have [E (X1 ) − β1 ]2 + E X12 − β2 2 + E X13 − 3β1 β2 − 2β13 2 = 0. 9 1.4. Definition of GMM Estimator Since in general there does not exist a pair βˆ1 , βˆ2 such that d βˆ1 , βˆ2 = N −1 ΣNi=1 Xi − βˆ1 2 + N −1 ΣNi=1 Xi2 − βˆ2 + N −1 ΣNi=1 Xi3 − 3βˆ1 βˆ2 − 2βˆ13 2 2 = 0, instead the idea of GMM estimator is to minimize the above quadratic distance: βˆ1,GMM , βˆ2,GMM = argmin(β1 ,β2 ) d (β1 , β2 ) , in which the weighting matrix aN is a 3 × 3 identity matrix. 10 Chapter 2 Theoretical Development of GMM The Generalized Method of Moment was first proposed by Professor Hansen in Econometrica 1982. In the original paper, Hansen studies the large sample properties of GMM estimators under the setup where the observed data is assumed to be a realization of some stochastic process. There are four main results in the paper. First, the author shows the GMM estimator is strongly consistent under some conditions. Secondly, the GMM estimator is asymptotically normally distributed under certain conditions. Thirdly, there is a lower bound for the asymptotic variance of GMM estimators. Fourth, the author proposed a statistical test for the validity of some economic modeling specifications based on the GMM framework5 . However, the author did not provide detailed proofs in the paper. For example, the author only presented the main assumptions and outlined the idea of the proof. The promised details are not easily available based on my best effort. For instance, it is not in the technical report as we may have expected. In this section, I provide a proof of my own. For simplicity, I will work with the situation where the observed data are realizations of some i.i.d random variables. In fact, his rough proof is very similar to Wald, Wilks and Cramer’s proofs of consistency and asymptotic normality of MLE. The rest of this section is organized as follows. Section 2.1 provides the proof of consistency of the GMM estimator under i.i.d situation. Section 2.2 derives the proof of asymptotic normality of the GMM estimator under i.i.d situation. Section 2.3 introduces the notion of efficient GMM. 2.1 Consistency In this section, I discuss the proof of consistency of the GMM estimator under the i.i.d situation. Before going into technical details, it is worthwhile to sketch the basic idea of the proof. Similarly to Wald’s proof of consistency of the MLE, the idea could be summarized in one sentence: the GMM estimator always picks up 5 The material on this issue is beyond the scope of this essay. 11 2.1. Consistency the “winner”, which coincides with the true parameter that we wish to estimate in the limit. The main assumptions used in the proof are listed below. A SSUMPTION 2.1.1: The model introduced in Section 1.4 is identifiable. That is, the moment condition E { f (X1 , β )} = 0 is satisfied if and only if β = β0 , where β0 ∈ Rq and is often referred to the “true” parameter. Under assumption 2.1.1, together with the assumption that a0 is non-singular, a0 E { f (X1 , β )} = 0 holds if and only if E { f (X1 , β )} = 0. Hence it implies β = β0 . Therefore, as a function of β , the following function D (β ) = (a0 E { f (X1 , β )}) (a0 E { f (X1 , β )}) , attains its lower bound zero, if and only if β = β0 . Since the GMM estimator always picks up the “winner”, which is the minimum of the sample analogue of D (β ), the winner will eventually fall into an infinitesimal neighborhood of the true parameter in the limit. Denote the sample analogue of D (β ) as d (β ) = aN N −1 ΣNi=1 f (Xi , b) aN N −1 ΣNi=1 f (Xi , b) . In fact, Hansen (1982) only proves d (β ) converges almost surely to D (β ) under the situation where the data is a realization of some stochastic process. Now we continue to list more assumptions. A SSUMPTION 2.1.2: The parameter space S is compact. A SSUMPTION 2.1.3: f (·, β ) is Borel measurable for each β in S. A SSUMPTION 2.1.4: f (x, ·) is continuous on S for each x in R p . A SSUMPTION 2.1.5: limδ →0 E {ε (X1 , β , δ )} = 0, f or every β ∈ S, where ε (X1 , β , δ ) = sup {| f (X1 , β ) − f (X1 , α)| : α ∈ S, β − α < δ } . Assumption 2.1.2 implies if S is covered by a collection of open sets, then it is also covered by a finite sub-collection of them.. Assumption 2.1.3 is a technical requirement. Assumption 2.1.4 means the transformation of the observed data, when viewed as a function of the parameters, is smooth. Assumption 2.1.5 means: in the limit, if α and β is sufficiently close, the value of E { f (X1 , α)} can be arbitrarily close to E { f (X1 , β )}. Let B be a measurable subset of S. Define 12 2.1. Consistency g (x, B) = in fβ ∈B (a0 f (x, β )) (a0 f (x, β )) , and d (B) = in fb∈B aN N −1 ΣNn=1 f (Xn , b) aN N −1 ΣNn=1 f (Xn , b) . For any b ∈ S, by SLLN, we have a.s N −1 ΣNn=1 f (Xn , b) → E { f (X1 , b)} . According to the GMM definition in section 1.4, we have a.s aN → a0 . Therefore, for every b ∈ S, we have aN N −1 ΣNn=1 f (Xn , b) aN N −1 ΣNn=1 f (Xn , b) a.s → (a0 E { f (X1 , b)}) (a0 E { f (X1 , b)}) . Since it’s true for all b ∈ S and by the continuity of f in β (assumption 2.1.4), we have a.s d (B) → E {g (X1 , B)} . The following lemma plays an important role in the proof of consistency later. L EMMA 2.1.1: Suppose: (i) The parameter space S = ∪kj=0 B j ; (ii) The true parameter is in B0 and for all j > 0, E g (X1 , B j ) > E {g (X1 , β0 )} = 0. Then as N → ∞, Pr βˆNGMM ∈ B0 → 1. Proof: ∞ ˆ GMM ∈ B j . We need only show Denote βˆNGMM ∈ B j , i.o. as ∩∞ N=1 ∪n=N βn Pr βˆNGMM ∈ B j , i.o. = 0, for each j = 1, 2, ..., k. Without loss of generality, assume k = 1 and hence j = 1. 13 2.1. Consistency Pr βˆNGMM ∈ B1 , i.o. ≤ Pr (d (B1 ) ≤ d (B0 ) , i.o.) . Since a.s d (B1 ) → E {g (X1 , B1 )} , a.s d (B0 ) → E {g (X1 , B0 )} , E {g (X1 , B1 )} > E {g (X1 , B0 )} , we have Pr βˆNGMM ∈ B1 , i.o. ≤ 0. It is equivalent as Pr βˆNGMM ∈ B0 , i.o. = 1. Therefore Pr βˆNGMM ∈ B0 → 1. Q.E.D. T HEOREM 2.1: Suppose Assumptions 2.1.1-2.1.5 are satisfied, the GMM estimator defined in Section 1.4 converges almost surely to β0 . Proof: For a sufficiently large r, denote B1 = {β : β − β0 > r}. we have E {g (X1 , B1 )} > E {g (X1 , β0 )} = 0. Hence, by Lemma 2.1.1, Pr βˆNGMM ∈ B1 , i.o. = 0. For an arbitrary ε > 0, let B2 = {β : ε ≤ β − β0 ≤ r}. For every β ∈ B2 , by Assumption 2.1.1 (identification), Assumption 2.1.4 (continuity) and Assumption 2.1.5, we can find small enough δβ , such that E a0 f x, β > E (a0 f (x, β0 )) (a0 f (x, β0 )) = 0, a0 f x, β for all β where β − β ≤ δβ . Denote Aβ = β : β − β ≤ δβ . By Assumption 2.1.2 (compactness), B2 will be covered by finitely many such Aβ , say A j , j = 1, 2, ..., m. Then by Lemma 2.1.1, Pr βˆ GMM ∈ A j , i.o. = 0, N for j = 1, 2, ..., m. So, we have shown, for every ε > 0, Pr βˆNGMM − β0 > ε, i.o. = 0. This implies βˆNGMM → β0 almost surely. Q.E.D. 14 2.2. Asymptotic Normality 2.2 Asymptotic Normality In this section, I derive the asymptotic distribution of the GMM estimator after properly scaled under the i.i.d situation. Similarly to the proof of asymptotic normality of MLE, the original proof in the paper re-defines the GMM estimator to be a sequence of solutions to the first order condition of the GMM objective function. In my proof here, however, I simply assume the GMM estimator defined in Section 1.4 satisfies the first order condition of the GMM objective function, which means N −1 ΣNn=1 ∂ f (Xn , b) ∂b b=βˆNGMM aN aN N −1 ΣNn=1 f Xn , βˆNGMM = 0. (2.1) Recall that the GMM estimator is defined as the point that minimizes the GMM objective function. Equation (2.1) indicates the GMM estimator being studied in this section satisfies the necessary condition of being a minimum point. In other words, the GMM estimator is further assumed to be the solution that equates the first derivative of the GMM objective function to zero. Before going to the proof, the important assumptions are listed below. A SSUMPTION 2.2.1: S is an open subset of Rq that contains β0 , which is an interior point of S. A SSUMPTION 2.2.2: ∂ f (Xn , b) ∂b A SSUMPTION 2.2.3: neighborhood of β0 . E (n = 1, , 2, ..., N) is Borel measurable for each b ∈ S. p n, β ) N −1 ΣNn=1 ∂ f (X →E ∂b ∂ f (X1 , β0 ) ∂b ∂ f (X1 , β0 ) ∂b uniformly in a small is finite, and has full rank. A SSUMPTION 2.2.4: E f (X1 , β0 ) f (X1 , β0 ) exists, is finite, and has full rank. A SSUMPTION 2.2.5: The GMM estimator defined in Section 1.4, which is also assumed to satisfy the first order condition of the GMM objective function as stated in (2.1), is strongly consistent. Assumption 2.2.1 rules out the possibility of boundary solution which simplifies the proof. Assumption 2.2.2 is a technical requirement. In the original paper, assumption 2.2.3 is obtained through a lemma under some conditions. Here, I simply state it as an assumption. Since the proof of consistency of GMM estimator in the previous section is based on the definition in section 1.4, assumption 2.2.5 indicates the consistency result follows in the case where (2.1) is also satisfied. 15 2.2. Asymptotic Normality T HEOREM 2.2: Suppose Assumptions 2.2.1-2.2.5 are satisfied. Then the GMM estimator defined in Section 1.4, which is also assumed to satisfy the first order condition of the GMM objective function as stated in (2.1), converges in distribution to a normally distributed random vector after properly scaled. Specifically, denote V −1 = Q a0 a0 Q Q a0 a0 Ωa0 a0 Q Q a0 a0 Q ∂ f (X1 , β0 ) Q = E , ∂b Ω = E f (X1 , β0 ) f (X1 , β0 ) , −1 , then √ d N βˆNGMM − β0 → N (0, V ). Proof: By assumption 2.2.5, for sufficiently large N, we can expand f Xn , βˆNGMM around f (Xn , β0 ). We obtain ∗ f Xn , βˆNGMM = f (Xn , β0 ) + ∂ f (X∂ nb, β ) βˆNGMM − β0 , where β ∗ is between βˆNGMM and β0 . Substitution of the above expansion into the first order condition (2.1) yields 0 = N −1 ΣNn=1 ∂ f Xn , βˆNGMM + N −1 ΣNn=1 × N −1 ΣNn=1 ∂b aN aN N −1 ΣNn=1 f (Xn , β0 ) ∂ f Xn , βˆNGMM ∂b ∂ f (Xn , β ∗ ) ∂b aN aN βˆNGMM − β0 , from which we obtain 16 2.2. Asymptotic Normality √ N βˆNGMM − β0 −1 GMM ˆ ∗ ∂ f X , β n N aN aN N −1 ΣNn=1 ∂ f (Xn , β ) = − N −1 ΣNn=1 ∂b ∂b × N −1 ΣNn=1 ∂ f Xn , βˆNGMM ∂b aN aN √1 ΣNn=1 f (Xn , β0 ) . N Due to E { f (Xn , β0 )} = 0 and the i.i.d structure of {Xn }, together with Assumption 2.2.4, applying CLT gives us 1 d √ ΣNn=1 f (Xn , β0 ) → N (0, Ω) . N Under Assumption 2.2.3, when N is large enough, due to consistency of βˆNGMM , we have N −1 ΣNn=1 ∂ f Xn , βˆNGMM ∂b p → Q. p Since βˆNGMM is consistent, as a result, β ∗ → β0 as well. Hence, we have N −1 ΣNn=1 ∂ f (Xn , β ∗ ) ∂b p → Q. p Since aN → a0 by definition and the fact that (Q a0 a0 Q) is invertible due to Q and a0 are of full rank, we have ∂ f (Xn , βˆNGMM ) N −1 ΣNn=1 ∂b aN aN N −1 ΣNn=1 ∗ N −1 ΣNn=1 ∂ f (X∂ nb, β ) ∂ f (Xn , βˆNGMM ) ∂b −1 p → (Q a0 a0 Q)−1 , p aN aN → Q a0 a0 . Then, by Slutsky’s Theorem, we have √ d N βˆNGMM − β0 → N (0, V ). Q.E.D. 17 2.3. Efficiency 2.3 Efficiency From the previous sections, we could see that the asymptotic variance of the GMM estimator after properly scaled depends on the weight matrix aN . The following theorem gives us a lower bound and identify a situation where it is attained. I will assume Ω (defined in theorem 2.2) is positive definite. T HEOREM 2.3: The lower bound for the asymptotic variance of the class of GMM −1 estimator indexed by aN is given by Q Ω−1 Q . The lower bound is achieved if p aN aN → Ω−1 . Proof: The first part amounts to say that Q Ω−1 Q −1 − (Q a0 a0 Q)−1 Q a0 a0 Ωa0 a0 Q (Q a0 a0 Q)−1 is negative semi-definite for any a0 that has full rank. This clam is equivalent to Q Ω−1 Q − Q a0 a0 Q Q a0 a0 Ωa0 a0 Q −1 Q a0 a0 Q ≥ 0. (2.2) Since Ω is positive definite, we can write Ω−1 = C C, for some invertible C as well. Hence, we can re-write (2.2) as Q C CQ − Q a0 a0 Q Q a0 a0C−1 C = QC I− C −1 −1 a0 a0 Q Q a0 a0C−1 C −1 a0 a0 Q −1 Q a0 a0 Q −1 a0 a0 Q Q a0 a0C−1 CQ. (2.3) Define H= C −1 a0 a0 Q. Hence, (2.3) could be re-written as Q C I −H H H −1 H CQ. The above matrix is positive semi-definite if I − H (H H)−1 H is positive semidefinite. I − H (H H)−1 H is idempotent and, consequently, positive semi-definite. 18 2.3. Efficiency p Thus, the first part in the theorem is proved. If aN aN → a0 a0 = Ω−1 , then the asymptotic variance becomes Q Ω−1 Q −1 Q Ω−1 ΩΩ−1 Q Q Ω−1 Q Hence, the second part in the theorem is proved. −1 = Q Ω−1 Q −1 . Q.E.D. The intuition of this theorem is as follows. Note that Ω = E f (X1 , β0 ) f (X1 , β0 ) , which represents the variation of each moment condition used in estimation. Hence, the efficient GMM estimator is attained when each moment condition in the objective function is weighted by the corresponding inverse of its variance. Consequently, the more information embedded in a moment condition, the higher weight is assigned to it. 19 Chapter 3 Application of GMM in Linear Models In the last chapter, I have established certain asymptotic properties of the GMM estimator based on i.i.d observations. I focus on the application of GMM estimator for linear models in this chapter. Linear regression model is probably the most widely used tool in many empirical fields. In addition, GMM estimation technique in linear regression models is probably the most widely used tool in econometrics. In fact, when researchers apply GMM estimator in linear regression models, it is often called the Instrumental Variable Estimator (or IV Estimator). The rest of this section is organized as follows. Section 3.1 lays out the basic linear model framework and illustrates the motivations of why GMM estimator is widely used in econometrics. Section 3.2 derives the efficient GMM estimator in a linear model. Section 3.3 gives an example of its application. 3.1 Framework and Motivations Suppose a researcher is interested in analyzing the relationship between a response variable Y and a vector of covariates X. He/she has collected a random sample of N i.i.d observations. More specifically, suppose the data are observed by the researcher but not collected through some experiment. The information of the ith observation is denoted by Wi = (Yi , Xi ) (i = 1, 2, ..., N), where Yi is a random scalar (response), Xi = (X1i , X2i , ..., Xki ) is a random k × 1 vector (covariates). In many economic applications, Yi often represents some decision of an economic agent, Xi often contains demographic information about the agent and other relevant variables which may also be decision outcomes of the same agent or even other economic agents. For example, Yi can be the number of stores a brand is running in a geographical area, Xi contains the unit price set by the brand and its other characteristics. In addition, Xi could also contain the number of stores running by other brands in the same geographical area. Suppose the researcher believes Yi can be represented by a linear function of Xi as follows 20 3.1. Framework and Motivations Yi = Xi β +Ui , (3.1) where β is a k × 1 constant parameter vector of interest and Ui is a random scalar, containing all other factors that affect Yi but are not observed by the researcher. Suppose Xi contains an intercept term, then the researcher can make the following assumption without any cost E (Ui ) = 0. (3.2) In addition, due to the nature of observational data, the researcher also believes Xi and Ui are correlated. Under (3.2), this implies E (XiUi ) = 0. (3.3) (3.3) indicates at least one covariates in Xi is correlated with Ui . Suppose only X2i is correlated with Ui . In the econometrics literature, X2i is called endogenous; since E (X jiUi ) = 0 for j = 2, those covariates are usually called weakly exogenous. In contrast, strong exogeneity requires E (Ui | Xi ) = 0. In summary, (3.1) to (3.3) together characterize a linear model that is widely used in econometrics. The key characteristics of the above model is the endogeneity issue as indicated by (3.3). In the following, I will discuss why this modeling framework (especially (3.3)) is particularly important and popular in econometrics and how the condition in (3.3) emerges in application. I will refer condition (3.3) to “endogeneity” here and there. 3.1.1 Motivations of the Framework The motivation of such model framework can be summarized into one word: causation. Recall that in section 1.3, I’ve mentioned the main goal in many classical econometric analyses is causal inference. The key characteristic in causal inference is to control for all other relevant factors. In the above model framework, from the economic agent’s perspective, Yi , Xi and Ui are real value variables instead of random variables. Hence, (3.1) implies Yi is a deterministic function of Xi and Ui . Following the example given at the beginning of section 3.1, from the perspective of the brand, the number of stores is determined once the brand knows its Xi and Ui . This economic theoretical argument underneath the above statistical model is largely neglected in the field. To my best knowledge, few researchers explicitly state this argument in their works. However, it is shown below this is crucial in understanding the above model framework. For ease of illustration, suppose Xi is not a vector but a scalar. Hence the coefficient β can be represented (from the perspective of the agent) as 21 3.1. Framework and Motivations β= ∂Yi (Xi ,Ui ) . ∂ Xi Loosely speaking, β measures the units of Yi changes in response to a unit change in Xi holding all other relevant factors constant. Therefore, from the agent’s perspective, the coefficient β is the “causal effect” of Xi on Yi . Note that from the agent’s perspective, there is no randomness in the above model. However, from the researcher’s perspective, {Ui , i = 1, 2, ..., N} are nearly always assumed to be random variables coming from a common distribution because they are not observed. Hence, it is fair to say that all randomness arguments in the above model are drawn from the perspective of the researcher due to his/her inability of observing Ui . Yet, how do these arguments lead to the formation of condition (3.3)? I will discuss this issue later in the next sub-section. In summary, the above model framework is suitable for causal inference in many economics problems. Here, I briefly present another popular modeling framework in statistics and econometrics, as well as compare it to the above framework. As shown later, this framework is more suitable for non-causal inference in many cases. Suppose a researcher is still interested in studying the relationship between Yi and Xi . But the main focus is no longer the causal effect of Xi on Yi , instead he/she only focuses on the correlation between Yi and Xi . In other words, the researcher now is interested in the following question: if Xi is observed to increase one unit, how many units does Yi will be observed to change on average. In this case, the following model might be useful. Yi = E (Yi | Xi ) + ei (3.4) In (3.4), by construction, we have E (ei |Xi ) = 0. Intuitively, the relationship between Xi and Yi is completely captured by the conditional expectation. The “leftovers” ei represents any other random shocks to Yi that is uncorrelated with Xi . Furthermore, let’s assume E (Yi | Xi ) = Xi δ . (3.5) The above model framework, (3.4) to (3.5), is a popular linear regression model in many areas of statistics and some areas of econometrics. The OLS estimator of δ is consistent. The fundamental difference between two modeling framework lies in the decomposition of the response variable. In the model of (3.4) to (3.5), Yi is decomposed as the complete relationship with Xi and a pure random shock that is uncorrelated with Xi . However, in model (3.1) to (3.3), Yi is decomposed as the causal relationship with Xi and other relevant factors which might be correlated with Xi . Mathematically 22 3.1. Framework and Motivations E (XiUi ) = 0 =⇒ E (Ui | Xi ) = 0, therefore E (Yi | Xi ) = Xi β . Suppose Ui is positively correlated with Xi and Yi . Then intuitively speaking, the complete effect (represented by δ ) of a unit change of Xi on Yi comprises two components: 1, the causal effect of Xi on Yi , represented by β ; 2, the indirect effect of Xi on Yi through Ui , which is positive by assumption. Hence, in this case we have δ > β . Therefore, if we are interested in β , the estimate obtained from running the OLS regression based on the data will overestimate the causal effect. In addition, the model framework (3.1) to (3.3) is often backed up by some economic argument which is from the perspective of some economic agent. 3.1.2 Potential Reasons for Endogeneity 1. Missing Covariates. In many economic applications, the response variable Yi (or decision variable) is often determined by some observed variables Xi and other unobserved variables. For example, a salesperson’s sales performance is affected by the salary paid to him and his effort. In general, we cannot observe a salesperson’s effort (denoted by Ui ). In many cases, we cannot even measure the effort. However, it is generally believed that effort is positively correlated with the salary (reflected by E (XiUi ) = 0). If we want to investigate whether an increase of salary causes higher sales performance, the correlation between salary and effort needs to be considered. Otherwise one may overestimate the causal effect of salary. In an extreme case, if the causal effect of salary on sales performance is nearly zero while effort is highly positively correlated with salary and sales performance, running an OLS regression of sales performance on salary yields a significantly positive salary coefficient. In such case, the significant positive salary coefficient is largely driven by the fact that effort is highly positively correlated with salary and sales performance. As a result, one may be misled to an impression that higher salary causes higher sales performance. Hence, based on the misleading OLS regression result, a firm’s manager may raise salary in order to increase sales. However, if one can use the IV regression (introduced later) to estimate the true causal effect of salary on sales performance, the manager may apply another lower-cost policy to stimulate sales. In fact, if the manager also realizes the driving force for sales performance is effort, he may seek to re-design a contract that pays the same amount of salary as before but is more effective in terms of stimulating an employee’s effort, i.e. 23 3.1. Framework and Motivations appropriate design of bonus and commission. This research area in economics is often referred to “contract theory”. 2. Simultaneity. This is a jargon heavily used in the economics and econometrics literature. In short, simultaneity means the covariate Xi is chosen optimally by some economic agent (partially) according to Yi . For example, Yi represents the salary level of automobile production workers in a city and Xi denotes the number of automobile firms in the city. An important economic question is whether higher competition (reflected by larger number of firms) leads to higher salary level. Intuitively and also suggested by many economic models, higher competition should lead to higher salary level due to the following reason. Employees have larger bargaining power against the firm when competition is more severe because they have more alternatives to choose. From the perspective of the local government, if this economic intuition is true, they may implement some policy to attract more investors to the city. Suppose a policy maker collects a random sample of {Yi , Xi , i = 1, 2, ..., N} from N different cities. If he runs an OLS regression of Yi on Xi , it is very likely the estimated coefficient of Xi is not significantly positive (even negative in many cases). Based on such result, the policy maker may be misled to believe that encouraging more investors is unable to increase the local salary level. However, the policy maker may neglect the fact that firms choices of entering a city or not largely depend on the salary level of the city. Intuitively, a firm is more likely to enter a city if the local salary level is low. How does this relate to the unobservable Ui ? It is intuitive and suggested by some economic theories that the larger the population of a city, the cheaper its labor cost is. Hence, a firm is more likely to enter a city if the local population is larger. Suppose the policy maker does not collect the population data, then the unobserved Ui contains the ith city’s population. In such case, Xi and Ui is positively correlated, i.e. E (XiUi ) > 0. However, Yi and Ui is negatively correlated. Thus, a negative OLS estimated coefficient of Xi on Yi may be largely driven by the fact that Ui is positive correlated with Xi but negatively correlated with Yi . Based on the above argument, one can treat simultaneity issue as a special case of missing covariates. For this example, some researchers argue that the negative OLS estimated coefficient of Xi is generated by the reverse causality which states that it is actually a lower salary causes a higher number of firms in a city. From my opinion, the arguments of reverse causality is the same as those of simultaneity in this case. 3. Measurement Error. The endogeneity issue characterized by (3.3) can also generated in the case of measurement error. Suppose the researcher is interested in analyzing the causal effect of Xi on Yi and the true model is Yi = α + Xi β + εi . 24 3.2. GMM Estimation Further assume Xi is independent of εi . However, the researcher cannot observe Xi directly but a proxy for it X˜i = Xi + ξi . Hence the regression model faced by the researcher is Yi = α + X˜i β +Ui , whereUi = −ξi β + εi . Therefore X˜i is endogenous, because it is correlated with Ui through ξi . 3.1.3 Potential Difficulty of Applying MLE For the linear regression model characterized by (3.1) to (3.3), the potential difficulty of applying MLE to estimate β arises from the specification of the conditional distribution of Ui given Xi . For ease of illustration, assume the conditional distribution depends on Xi only through E (Ui | Xi ). Further assume Xi is a scalar. If we assume E (Ui | Xi ) = Xi γ, then we have E (Yi | Xi ) = Xi (β + γ) . Therefore, the parameters β and γ cannot be separately identified from the observed data. However, if we assume E (Ui | Xi ) = Xi2 τ, it is easy to verify the parameters β and τ can be separately identified from the data. However, such ability of identification is often criticized by the fact that it is only due to a specific functional form assumption. From this simple example, we can see that it is not easy to specify a convincing conditional distribution of Ui given Xi in a real world application. 3.2 GMM Estimation As discussed above, the information embedded in {Yi , Xi } is usually not enough to consistently estimate the causal parameters. Some additional information may be crucial in identifying the causal parameters. The instrumental variables discussed below play such a role. Suppose Zi is a l × 1 vector and satisfies 25 3.2. GMM Estimation E (ZiUi ) = 0. (3.6) Condition (3.6) indicates the instruments Zi are weakly exogenous. Now the observed data set is augmented to be Yi , Xi , Zi , i = 1, 2, ..., N . Based on the moment conditions in (3.6), according to the definition of GMM estimator in section 1.4, the population model can be defined through: E Zi Yi − Xi β = 0. (3.7) Note that (3.7) implies a system of l equations with k unknown parameters. The rest of the section discusses about the identification issue and introduces the GMM estimator based on (3.7). In addition, the efficient GMM estimator is also discussed. 3.2.1 Identification According to assumption 2.1.1, the population model (3.7) is identifiable if and only if there exists a unique β0 ∈ Rk , such that E Zi Yi − Xi β0 = 0. Re-arranging (3.7), we have E Zi Xi β = E (ZiYi ) . (3.8) Note that (3.8) is a l-equation linear system with k unknown parameters. The identification condition requires there is an unique solution to (3.8). According to linear algebra, this is equivalent to the following condition rank E Zi Xi = k. (3.9) (3.9) is often referred to the “rank condition” in the econometrics literature. Hence the population model defined by (3.7) is identified if and only if the rank condition (3.9) is satisfied. Hence, a vector Zi is said to be “valid instruments” if it satisfies the exogeneity condition (3.7) and the rank condition (3.9). An immediate implication of (3.9) is l ≥ k, since rank {E (Zi Xi )} ≤ min (l, k). In the econometrics literature, the necessary condition l ≥ k is often referred to “order condition”. It means one needs to have at least as many exogenous instruments as those endogenous regressors in order to identify the causal parameters. Given the rank condition 26 3.2. GMM Estimation is satisfied, when l = k, the model is said to be exactly identified; while if l > k, it is said to be overidentified. In order to gain some insights for the rank condition, consider the following example. Suppose Xi = (1, X1i ) , where X1i is a scalar random variable. Suppose Zi = (1, Z1i ) , where Z1i is also a scalar random variable. Note that the regressor “1” is exogenous due to (3.2). Therefore, there is only one endogenous variable X1i and one exogenous instrument Z1i . The rank condition implies det E Zi Xi 1 E (X1i ) E (Z1i ) E (Z1i X1i ) = E (Z1i X1i ) − E (X1i ) E (Z1i ) = 0, = det which means the exogenous instrument needs to be correlated with the endogenous regressor. 3.2.2 GMM Estimator Based on the population model (3.7), the GMM objective function (according to definition 1.1) in the linear model situation ((3.1) to (3.3)) is d (b) = aN · N −1 · ΣNi=1 Zi Yi − Xi b aN · N −1 · ΣNi=1 Zi Yi − Xi b , (3.10) where aN is a l × l weighting matrix satisfying the conditions listed in section 1.4. The GMM estimator is the solution to the following equation ∂ d (b) = 0. ∂b In this case, it is given by βˆNGMM = ΣNi=1 Xi Zi aN aN ΣNi=1 Zi Xi −1 ΣNi=1 Xi Zi aN aN ΣNi=1 ZiYi . (3.11) According to theorem 2.2, the asymptotic variance of βˆNGMM , after proper scaling, is given by V = Q a0 a0 Q −1 Q a0 a0 Ωa0 a0 Q Q a0 a0 Q −1 , where 27 3.2. GMM Estimation Q Ω aN = ∂ Zi (Yi − Xi b) ∂b E = E Zi Yi − Xi β0 = E ZiUiUi Zi , , b=β0 Yi − Xi β0 Zi p → a0 . In the linear model case, we have Q = E Zi Xi , Ω = E Ui2 Zi Zi . Their natural consistent estimators are given by Qˆ = N −1 ΣNi=1 Zi Xi , ˆ = N −1 ΣNi=1Uˆ i2 Zi Zi , Ω where Uˆ i = Yi − Xi βˆNGMM . 3.2.3 Efficient GMM Estimator (General Case) According to theorem 2.3, the lower bound of the asymptotic variance for βˆNGMM under the linear model is achieved when p aN aN → Ω−1 = E Ui2 Zi Zi −1 . The following two step procedure illustrates how to obtain an efficient GMM estimator under the linear model setup. Step 1. Set aN aN = Il , where Il is the l × l identity matrix. Obtain the corresponding GMM estimate, say β˜NGMM , which is inefficient but consistent. Use β˜NGMM to construct a consistent estimator of Ω, which is given by ˆ = N −1 ΣNi=1Uˆ i2 Zi Zi , where Ω Uˆ i = Yi − Xi β˜NGMM . 28 3.2. GMM Estimation ˆ −1 (obtained from step 1) and substitute back into (3.11), Step 2. Set aN aN = Ω yielding the efficient GMM estimator as βˆNGMM∗ = 3.2.4 ΣNi=1 Xi Zi ΣNi=1Uˆ i2 Zi Zi −1 ×ΣNi=1 Xi Zi ΣNi=1Uˆ i2 Zi Zi −1 ΣNi=1 Zi Xi −1 ΣNi=1 ZiYi . Efficient GMM Estimator (Conditional Homoscedasticity) If we assume the conditional variance of Ui given Zi is constant (which is also called conditional homoscedasticity) E Ui2 Zi = σ 2 . Then we have Ω = E Ui2 Zi Zi = E E Ui2 Zi Zi Zi = E E Ui2 Zi Zi Zi = σ 2 E Zi Zi . A natural consistent estimator of E (Zi Zi ) is N −1 ΣNi=1 Zi Zi . Note that βˆNGMM defined in (3.11) is invariant to aN aN up to a constant term. Hence, in order to obtain the efficient GMM estimator in this case, we can set aN aN = ΣNi=1 Zi Zi −1 . (3.12) Substituting (3.12) into (3.11), the efficient GMM estimator, also known as the Two-Stage-Least-Square (2SLS) estimator, is given by βˆN2SLS = ΣNi=1 Xi Zi ΣNi=1 Zi Zi −1 ΣNi=1 Zi Xi −1 ΣNi=1 Xi Zi ΣNi=1 Zi Zi −1 ΣNi=1 ZiYi . Denote X = (X1 , X2 , ..., XN ) , Z = (Z1 , Z2 , ..., ZN ) and Y = (Y1 ,Y2 , ...,YN ) . The 2SLS estimator can be expressed as βˆN2SLS = X Z Z Z −1 −1 ZX XZ ZZ −1 Z Y. (3.13) 29 3.3. Example The reason for the name “Two-Stage-Least-Square” lies in the fact that (3.13) can be obtained through the following two regression steps. Step 1. Regress X on Z. In other words, project the matrix of regressors X orthogonally onto the space spanned by the instruments Z to obtain X˜ = Z Z Z −1 Z X. ˜ which gives Step 2. Obtain βˆN2SLS by running an OLS regression of Y on X, βˆN2SLS = X˜ X˜ −1 X˜ Y. (3.14) It is easy to verify (3.14) is equivalent to (3.13). 3.3 Example In this section, an example is given to illustrate the use of GMM estimator in the linear model. Based on the example, I also discuss the potential difficulties in applying GMM estimators in the above framework. Suppose a researcher is interested to see whether education really helps to improve a person’s wage level, after controlling for the effects of ability. Denote Yi as the annual wage for the ith person in the researcher’s random sample, Xi as education measured by number of years in school, Di as ability which is unobserved by the researcher. Suppose we may explain the wage based on the following linear model Yi = α + β · Xi + δ · Di + ui , where E (ui ) = 0, ui is uncorrelated with both Xi and Di while β is the parameter of interest. Since the researcher does not observe Di , the linear model he/she is facing should be Yi = α˜ + β · Xi + εi , (3.15) where α˜ = α + δ · E (Di ) and εi = δ · Di − δ · E (Di ) + ui . Hence E (εi ) = 0. (3.16) Since the number of years in school is generally correlated with a person’s ability, we have 30 3.3. Example E (Xi εi ) = 0. (3.17) Therefore, the linear model ((3.15) to (3.17)) faced by the researcher belongs to the class of linear models characterized by (3.1) to (3.3). Hence running an OLS regression of Yi on Xi does not provide a consistent estimate for β . In fact, the level of education is likely to be positively correlated with ability, and ability is also probably positively correlated with wage, the usual OLS estimate of β should be inflated. This is a classical problem in labor economics. Historically, researchers have proposed many instruments under different situations in order to identify the education effect. Here I will discuss two instruments: the quarter of birth of the ith person and the education levels of the ith person’s parents. The reason I choose to discuss these two instruments is that they reflect two different kinds of difficulties when applying the GMM technique developed in this section. To be a valid instrument, the quarter of birth/parent’s education should be correlated with a person’s education, but not correlated with a person’s ability and any other unobservable factors that affect wage level. For the quarter of birth instrument, it is easy to rationalize that it is uncorrelated with a person’s ability and other unobservable factors. However, it is not easy to rationalize why it is correlated with a person’s number of years in school. Fortunately, the data can provide us evidence to see whether quarter of birth is correlated with number of years in school. In fact, a side-by-side boxplot of number of years in school against four quarters can help verification. Actually, Angrist and Krueger’s (1991) show that to some extent, these two variables are indeed correlated though not strongly correlated. In contrast, for the parent’s education instrument, it is easy to rationalize that it should be correlated (or even strongly correlated) with a person’s education. However, it is hard to rationalize the parent’s education is not correlated with a person’s ability and any other unobserved factors that influence wage. Unfortunately, this problem is not testable from the data. If one wants to apply such instrument, he/she can only assume the parent’s education is exogenous. The above discussion shows the embarrassment when applying IV regressions. It is relatively easy to find an instrument that is correlated with the endogenous regressor while its exogeneity is hard to rationalize. Moreover, the exogeneity requirement of the instrument is not testable from the data. In contrast, it is relatively easy to find an instrument that is exogenous (uncorrelated with the unobserved term) while its correlation with the endogenous regressor is usually small. Fortunately, the correlation between an instrument and the endogenous regressors can be estimated from the data. In summary, to find a good instrument in reality is not an easy task. 31 Chapter 4 Simulation Studies In this section, a series of simulation studies are conducted to investigate the performance of GMM estimator under various conditions. All the simulation studies in this section are based on the linear model introduced in Section 3. Specifically, consider the following linear model: Y = α + β1 · X1 + β2 · X2 + ε, Z = δ · X1 + u, W = γ · X1 + e, where X1 X2 ε u e ∼ N (µ, Σ) , µ = µx1 µx2 0 0 0 1 ρx1 x2 1 , Σ = σε ρx1 ε σε ρx2 ε σε2 ρx1 u ρx1 e ρx2 u ρx2 e σε ρεu σε ρεe 1 ρue 1 . Y , X1 , X2 , W and Z are random scalars which are assumed to be observable; while ε, u and e are random scalars which are assumed to be unobservable. All parameters are pre-specified and data sets are simulated based on their values. Parameters α, β1 and β2 will be estimated by GMM using simulated data sets. In addition, X1 is assumed to be endogenous and X2 is assumed to be exogenous. Both Z and W are valid instruments for X1 . In other words, the parameters δ , γ, µ and Σ are chosen such that the following moment conditions are satisfied: E (X1 ε) = 0, E (X2 ε) = 0, E (Zε) = 0, E (W ε) = 0, (Instrument Exogeneity) (4.1) E (W X1 ) = 0, E (ZX1 ) = 0 (Instrument Relevance). (4.2) 32 4.1. Study 1 - Robustness to Mixed Populations Such moment conditions imply we are not “free” to choose values of all parameters. The restrictions imposed on the parameters are listed below. ρx1 ε = 0, ρx2 ε = 0, ρεu = −δ ρx1 ε , ρεe = −γρx1 ε , δ 1 + µx21 + ρx1 u = 0, γ 1 + µx21 + ρx1 e = 0. In some studies described later, I will use models with more than two instruments. In those cases, Z, W , u and e could be viewed as random vectors while δ and γ are vectors containing fixed constants correspondingly. Hence, in all simulation studies, a population distribution is indexed by the following collection of parameters {µ, Σ, δ , γ, α, β1 , β2 } . (4.3) In addition, I assume conditional homoscedasticity of ε. Specifically E ε 2 |Z,W = σε2 . Hence, in all simulation studies below, the GMM estimator is the 2SLS estimator which has a close form expression and hence can be easily obtained. 4.1 Study 1 - Robustness to Mixed Populations The cornerstone of GMM estimator is based on a set moment conditions that characterize the data generating mechanism. The moment conditions do not completely specify the joint distribution. Under some regularity conditions, the desired properties of GMM estimator, like consistency and asymptotic normality, hold as long as the moment conditions are correctly specified. Study 1 is designed to investigate the performance of GMM estimator when the observed data come from a mixture two different underlying distributions. Hence, there are two “parts” of the observed data in this study. For example, 90% of the data is generated by distribution A and the rest of the data is generated by another distribution B. Specifically, the variables {X1 , X2 , ε, u, e} are generated from distribution A and B respectively, based on which (together with α, β1 and β2 ) the variable Y is generated accordingly. The mixture of the observed data is characterized in two dimensions. The first one relates to the degree of mixture, which is represented by the relative proportions of distribution A and B in generating the data. The second one relates to the “difference” between distributions A and B, which is represented by the difference in the means of regressors between two distributions. The objective of this study is to 33 4.1. Study 1 - Robustness to Mixed Populations investigate how bias and efficiency of the GMM estimator changes in response to changes in the above two dimensions. The distributions generating part A and B are the same for all components in (4.3) except for µx1 and µx2 . The values of the parameters are chosen as follows: Para. Value Para. Value Para. Value Para. Value α 1 ρx1 x2 0.1 ρx2 u 0.2 µxA1 1 µxA2 µxB1 µxB2 σεA 1 β1 2 ρx1 ε 0.5 ρx2 e 0.2 β2 3 ρx1 u 0.2 ρεu -0.5 δ 1 ρx1 e 0.2 ρεe -0.5 γ 1 ρx2 ε 0 ρue 0.2 µxA1 + µ0 µxA2 + µ0 1 Table 4.1: Distributional parameter values where µ0 varies in different simulated scenarios and captures the difference between distribution A and B. In addition, I set σεB = µ0 + 1. By doing so, the structural part of Y (represented α + β1 · X1 + β2 · X2 ) does not dominate the random component ε. As discussed before, the simulation study is designed mainly in two dimensions: (i) the degree of mixture characterized by a parameter pB , which is the proportion of data coming from distribution B; (ii) the difference in means of X1 and X2 between distribution A and B, denoted by µ0 . In addition, I also vary the sample size in the simulation study. In other words, the objective of this study is to examine the bias and efficiency of the GMM estimator when pB and µ0 change. Every dimension has three distinct values, resulting in 27 different scenarios in total. For each scenario, I performed 1000 Monte Carlo simulations. For each combination of parameter values in each simulation, I first generated n vectors of {X1 , X2 , ε, u, e} based on which I then generated n vectors of {Y, Z,W }. Then the GMM estimator is estimated based on the formula of 2SLS estimator in (3.14). The table below shows the values these three dimensions take. n 100 500 1000 pB 0.1 0.2 0.3 3 5 10 µ0 (µxB1 − µxA1 ) Table 4.2: Parameter values for study 1 Although there are 27 different scenarios, I only present the results for the follow34 4.1. Study 1 - Robustness to Mixed Populations ing six situations. βˆ1 (β1 = 2) βˆ2 (β2 = 3) n pB µ0 Mean SD Mean SD 100 0 —– 1.9987 0.1156 2.9962 0.1016 100 0.3 10 1.9906 1.5075 3.0483 5.7763 100 0.3 5 2.0100 0.5839 3.0167 1.8389 1000 0.3 10 1.9776 0.4914 3.0370 1.9044 1000 0.3 5 2.0001 0.1738 2.9929 0.5570 100 0.1 10 2.0134 0.9291 2.9537 3.8323 100 0.1 5 2.0128 0.3839 2.9804 1.3637 Table 4.3: Results of study 1 The means of βˆ1 and βˆ2 in all scenarios are very close to their true values respectively. It indicates the bias of GMM estimator in this case is small even when the sample size is small (e.g. 100). Hence, the first conclusion from this simulation study is that mixed population (characterized in this study) has little effect on the bias of GMM estimator. However, the table above shows the efficiency of GMM estimator highly depends on how the population is mixed. Specifically, holding all other aspects constant, the higher the degree of mixture (reflected by pB ), the higher the variation of the GMM estimator. For example, when n = 100 and µ0 = 10, changing pB from 0.1 to 0.3, the standard deviations of βˆ1 (coefficient of endogenous regressor) and βˆ2 (coefficient of exogenous regressor) are increased by 62% and 51% respectively. In addition, holding all other aspects fixed, the higher the difference of mixed populations (reflected by µ0 ), the higher the variation of the GMM estimator. For instance, when n = 100 and pB = 0.1, changing µ0 from 5 to 10, the standard deviations of βˆ1 and βˆ2 are increased by 142% and 181% respectively. The following graph visualizes the results. 35 4.1. Study 1 - Robustness to Mixed Populations ^ Density of β1 (n=100 & pB=0.3) ^ Density of β1 (n=100 & µ0=10) 0.4 pB=0.1 pB=0.3 0.2 Density 0.3 0.0 0.0 0.1 0.1 0.2 Density 0.4 0.3 0.5 0.6 µ0=5 µ0=10 −4 −2 0 2 ^ β1 4 6 8 −4 −2 0 2 4 6 8 ^ β1 Figure 4.1: Graphical comparison for study 1 Therefore, the second conclusion of this study is that: when the degree of mixture is higher or the difference of the mixed population is larger, the efficiency of the GMM estimator declines significantly. The practical implication of this study is non-trivial. Suppose a researcher is interested in the causal effect of advertisement on sales. Therefore X1 in the study can represent the annual investment on advertisement of a firm while X2 represents all other sales-relevant factors. The variation of investment on advertisement of different firms varies a lot in the market, i.e. large firms, particularly internationally reputed firms, spend millions of dollars for advertising while medium or small firms spend significantly less on it. Hence, a random sample of firms collected by the researcher is likely to contain many large, medium and small firms. Suppose in an extreme case, for some reason, the researcher intends to collect a sample with size 100 that consists of 70% small firms (reflected by distribution A in the study) and 30% large firms (reflected by distribution B). Suppose µ0 is 10 (in millions of dollars) and the true causal effect β1 is 2. Suggested by the simulation results above, before collecting the sample and performing the analysis, the researcher has 25% (or 75%) chance to underestimate (or overestimate) the advertisement effect by 50% (the chance of βˆ1 = 1 or βˆ1 = 3). What is worse is that the researcher has a 10% chance to obtain a negative advertisement effect (the chance of βˆ1 < 0). In summary, this simulation study shows that when the sample is mixed with two parts coming from different populations, the bias of the GMM estimator is nearly unaffected but the efficiency decreases significantly. 36 4.2. Study 2 - Robustness to Outliers 4.2 Study 2 - Robustness to Outliers In this study, I will investigate the performance of GMM estimator when a small proportion of the data are contaminated. By contamination, it means a small proportion of the data is generated from another mechanism as opposed to the mechanism that generates the majority of the data. Contaminated data happens for different reasons in practice, i.e. incorrect record. The way I defined contaminated data is as follows: a random disturbance term is added to the response after it was generated by the true underlying model. Suppose I generate 100 observations from a specified distribution, a random disturbance term is then added to the last response. As a result, the last observation is contaminated. For the perspective of data composition, study 1 and study 2 are similar to each other in that the observed data is not generated from a common mechanism. The difference between two studies lies in the following. The mixture feature of the observed data comes from different sources. In study 1, the mixture feature comes from the question itself under investigation. In study 1, the reason of mixture comes from the fact that some firms are of large scale while many other firms are of medium or small scale, reflected by the difference in population means of X1 . As a result, larger firms tend to have a higher value of Y . However, in study 2, the source of mixture is out of the question under investigation. In other words, it can often be characterized by “mistakes”, either made by human being or machine. Since the mixture feature comes from different sources, the characteristics of mixture in study 2 is also different from study 1. The contamination proportion pC is usually very low (e.g. 1%). The strength of contamination is usually not small, i.e. the mean of the random contamination disturbance µC is non-trivial. For example, the data collector may accidentally input the value “1000” instead of “100”. The difference between these two studies can also be seen mathematically. For study 1, those exceptionally large value of Y is attributed to the corresponding large value of X1 ; for study 2, it is attributed to a large value of the intercept α. The objective of this study is to investigate the performance of GMM estimator under different situations of pC and µC . We use the parameter values of distribution A in study 1 to generate the majority (uncontaminated) of data in this study. The random contamination disturbance is specified as a normal random variable with mean µC and unit variance. The simulation study is designed through two dimensions: (i) the proportion of contaminated data (pC ); (ii) the mean of the random contamination disturbance, µC . In addition, I also vary the sample size n. For each scenario, I performed 1000 Monte Carlo simulations. The table below shows the values these three dimensions take. 37 4.2. Study 2 - Robustness to Outliers Sample Size (n) 100 500 1000 Contamination Proportion (pC ) 0.01 0.05 0.1 Mean of Contamination (µC ) ±5 ±10 ±50 Table 4.4: Parameter values of study 2 So there are 54 (3 × 3 × 6) scenarios in total. Some results are reported below. The table reports the mean and standard deviation of βˆ1 and βˆ2 for seven combinations of the parameter values. βˆ1 (β1 = 2) βˆ2 (β2 = 3) n pC µC Mean SD Mean SD 100 0 0 1.9987 0.1156 2.9962 0.1016 100 0.01 50 2.0029 0.6263 2.9968 0.5023 100 0.01 −50 2.0349 0.6188 2.9842 0.5197 100 0.05 50 1.9402 1.2865 3.0668 1.1255 100 0.05 −50 2.0119 1.3095 2.9652 1.1198 100 0.05 10 2.0006 0.2942 3.0105 0.2452 100 0.05 −10 1.9965 0.2832 3.0011 0.2623 Table 4.5: Results of study 2 The reason for reporting the results where µC = ±50 and n = 100 lies in the intuition that the distortions of the parameter estimates should be largest in these two cases if there are any. From the table above, the bias of GMM estimator is small in all scenarios reflected by the closeness between mean of the estimates and the true value. In terms of the variation of the estimates, when the contamination proportion pC is low (e.g. 1%) or the contamination strength µC is low (e.g. 10), the standard deviations of βˆ1 and βˆ2 are relatively small. However, when both contamination proportion and strength are high (e.g. 5% and 50 respectively), the standard deviations of the estimators are high compared to their means. For example, when n = 100, pC = 5% and µC = 50, the standard deviation of βˆ1 is about 66% of its mean. Therefore, it is suggested by the results that the efficiency of GMM estimator decreased significantly when both contamination proportion and strength are high. In addition, I found no systematic effect of the sign for contamination strength. The following graph also illustrates the results. 38 4.3. Study 3 - Weak Instruments ^ Density of β1 (n=100 & µC=50) ^ Density of β1 (n=100 & pC=0.05) 1.2 0.0 0.0 0.2 0.1 0.4 0.6 Density 0.8 0.4 0.3 0.2 Density µC=10 µC=50 1.0 0.5 0.6 pC=0.01 pC=0.05 −5 0 5 10 −5 ^ β1 0 5 10 ^ β1 Figure 4.2: Graphical comparison for study 2 In summary, the conclusions of this simulation study are as follows. In terms of bias, the GMM estimator is robust to both contamination proportion and strength. In terms of efficiency, the GMM estimator is relatively robust to either contamination proportion or strength. However, its efficiency decreases significantly when both contamination proportion and strength are high. 4.3 Study 3 - Weak Instruments Recall that in section 3.3, I discussed the potential difficulties of applying GMM estimator in reality through an example. The two requirements for an instrument to be valid are hard to satisfy simultaneously, especially given that the exogeneity requirement (4.1) is untestable from the data. In this study, I will study the performance of GMM estimator when the exogenous instruments are only weakly correlated with the endogenous regressor. Recall that to be valid instruments, Z and W must satisfy (4.2): E (W X1 ) = 0, E (ZX1 ) = 0. Under our model setup, we have E (ZX1 ) = E [(δ X1 + u) X1 ] = δ 1 + µx21 + ρx1 u , E (W X1 ) = E [(γX1 + e) X1 ] = γ 1 + µx21 + ρx1 e . 39 4.3. Study 3 - Weak Instruments In this study, I chose ρx1 u = ρx1 e = 0 and µx1 = 1. Hence, by varying δ and γ, I could set the correlation between the instruments and X1 to be arbitrarily small. For simplicity, I let δ = γ in this study. Other parameters that govern the joint distribution of the data are chosen the same as those in study 1. Therefore, the objective of this simulation study is to investigate the performance of GMM estimator, both bias and efficiency, in response to δ which determines the correlation between the instruments and the endogenous regressor. The simulation study is designed through two dimensions: δ and sample size (n). The table below shows the values these two dimensions take. Sample Size (n) 50 100 500 1000 —– δ 0.01 0.05 0.1 0.15 0.2 Table 4.6: Parameter values of study 3 The results for βˆ1 and βˆ2 are summarized in the following table. βˆ1 (β1 = 2) GMM Est. βˆ2 (β2 = 3) OLS Est. GMM Est. OLS Est. n δ Mean SD Mean SD Mean SD Mean SD 100 1 2.0067 0.1329 2.5048 0.0899 2.9957 0.1033 2.9471 0.0901 1000 1 1.9979 0.0398 2.5039 0.0261 2.9988 0.0317 2.9488 0.0269 100 0.01 2.5271 1.3493 2.5081 0.0892 2.9449 0.2197 2.9462 0.0887 100 0.05 2.4368 1.2778 2.5036 0.0932 2.9666 0.2103 2.9551 0.0893 100 0.1 2.2980 1.3055 2.5082 0.0896 2.9746 0.1912 2.9511 0.0857 100 0.2 2.0352 1.7101 2.5028 0.0895 2.9968 0.2458 2.9522 0.0886 1000 0.05 2.2068 1.4474 2.5054 0.0275 2.9801 0.1829 2.9507 0.0281 1000 0.1 1.9894 0.3497 2.5052 0.0277 3.0007 0.0514 2.9499 0.0286 Table 4.7: Results of Study 3 Not surprisingly, the standard deviations of the GMM estimator in all scenarios are larger than those of the OLS estimator. In fact, this is a general result although I did not provide a proof in earlier chapters. The intuition can be explained as follows. Remember the GMM estimator in these simulation studies is the 2SLS estimator. Hence, the regressors X1 and X2 are projected onto the instruments Z and W in the first stage. The projection is then used as regressors to estimate 40 4.3. Study 3 - Weak Instruments the parameters β1 and β2 in the second stage. Therefore, not all information (or variation) embedded in the regressors is used in the estimation. Intuitively, only the information that can be absorbed by the instruments is used in estimating the parameters. In contrast, the OLS estimator only involves regressing the response on the original regressors X1 and X2 . Hence, it is not hard to rationalize that the variation of the estimator is larger in GMM than OLS. When the instruments are not weak (e.g. δ = 1), the bias of GMM estimator of both β1 and β2 are negligible. In contrast, in such case, the bias of OLS estimator is non-trivial for β1 (roughly 25% larger than the true value) but very small bias for β2 (roughly 1.6% larger smaller than the true value). The above table shows that when sample size is small (e.g. 100) and the instruments are weakly correlated with X1 (e.g. δ = 0.01 or 0.05), the bias of GMM estimator is roughly the same as that of the OLS estimator. In addition, the effect of weak correlation between the instruments and the endogenous regressor spills over to the estimate of β2 (the coefficient of the exogenous regressor X2 ), making its GMM estimator slightly biased. The following graph visualize the comparison. ^ Density of β2 (n=100 & δ=0.05) 4 ^ Density of β1 (n=100 & δ=0.05) OLS Est. GMM Est. 0 0 1 2 Density 2 1 Density 3 3 OLS Est. GMM Est. 0 1 2 3 ^ β1 4 5 0 1 2 3 4 5 ^ β2 Figure 4.3: Comparison of OLS and GMM estimators Therefore, when sample size is not large and the instruments are only weakly correlated with the endogenous regressors, the GMM estimator is moderately biased and has relatively large standard deviation. In such cases, the OLS estimator is not much worse in terms of bias. The results also show that when δ is very small, i.e. 0.05, increasing sample size from 100 to 1000 helps to reduce the bias but not the standard deviation. Even though sample size is 1000 (and δ = 0.05), the GMM estimator still has non-trivial 41 4.3. Study 3 - Weak Instruments bias, i.e. 10% higher than the true value. On the other hand, when δ is only moderately small, i.e. 0.1, increasing the sample size from 100 to 1000 significantly reduces the bias and standard deviation. In fact, the bias of the GMM estimator when δ = 0.1 and n = 1000 is nearly the same as that when δ = 1 and n = 1000. The following graph depicts these results. ^ Density of β1 (δ=0.1) 0.8 0.4 0.6 Density 0.4 0.3 0.0 0.0 0.1 0.2 0.2 Density n = 1000 n = 100 1.0 1.2 n = 1000 n = 100 0.5 0.6 ^ Density of β1 (δ=0.05) −5 0 ^ β1 5 −5 0 5 ^ β1 Figure 4.4: Comparison of GMM estimators under different correlations for varied sample size In addition, the result also shows that for a given sample size, i.e. 100, the bias reduces when the instruments are more correlated with X1 . However when sample size is 100, standard deviations of the GMM estimator stay basically the same when δ increases. The following plot shows this result. 42 4.4. Study 4 - Mixed Strong and Weak Instruments δ=0.2 δ=0.01 Density 0.0 0.2 0.4 0.6 0.8 ^ Density of β1 (n=100) −5 0 5 ^ β1 Figure 4.5: Comparison of GMM estimators under different correlations for a given sample size In summary, when the instrument is only weakly correlated with the endogenous regressor and the sample size is not large, the GMM estimator of the endogenous regressor’s coefficient has non-trivial bias and large variation. In such case, the GMM estimator of the exogenous regressor’s coefficient is also slightly affected in terms of bias. In addition, it requires a very large sample size, i.e. larger than 1000, to reduce the bias and variation of the GMM estimator for the endogenous regressor’s coefficient when the instruments are weak. 4.4 Study 4 - Mixed Strong and Weak Instruments In the previous study, I investigate the performance of GMM estimator when both instruments are weak and linear functions of the endogenous regressor X1 . In this study, I will examine the performance of GMM estimator when there are strong and weak instruments. Specifically, there is one strong instrument (Z) and two weak instruments in this study. In addition, one of the weak instrument (W ) is still a linear function of X1 but the other one (Q) is a quadratic function of X1 . In the previous study, I show that weak relationship between the endogenous regressors and the instruments may yield undesired results of GMM estimators, i.e. non-trivial bias. In this study, the objective is twofold: (1) whether weak instruments still bias the estimator with the presence of a strong instrument; (2) how an additional weak instrument affect the GMM estimator’s performance with the presence of strong and weak instruments. 43 4.4. Study 4 - Mixed Strong and Weak Instruments 4.4.1 Specification of Quadratic Instrument As discussed before, I make an instrument Q that is quadratically related to X1 : Q = aX12 + bX1 + ξ , where E (ξ ) = 0, Var (ξ ) = 1. In order to ensure Q to be exogenous which means E (Qε) = E aX12 + bX1 + ξ ε = 0, the following restrictions are imposed E (ξ ε) = − aE X12 ε + bE (X1 ε) ⇒ ρεξ = − a 1 + µx21 + bρx1 ε . I also make E (ξ X1 ) = 0. Hence another restriction (instrument relevance) for Q to be a valid instrument is E (QX1 ) = aE X13 + bE X12 = 0. Since µx1 = 1 and σx21 = 1, this leads to the requirement of E (QX1 ) = 4a + 2b = 0. In this study, I assume a = 0.5. Hence, the strength of the relationship between Q and X1 , defined as E (QX1 ), is affected only through b. For better illustration, we can re-express it as ˜ where b˜ = 1 + b. E (QX1 ) = 2 (1 + b) = 2b, 4.4.2 Simulation Design and Results In this case, the measurement of relationship between the linear instrument and the endogenous variable is the same as before. The study is designed in three dimensions: (i) sample size, n; (ii) δW , which describes the weak correlation between the linear instrument (W ) and X1 while the strong instruments (Z) take value δZ = 1 in ˜ which describes the weak relation between quadratic instrument this study; (iii) b, Q and X1 . The table below shows the values these three dimensions take. n δW b˜ 100 0.01 0.01 500 0.05 0.05 1000 0.1 0.1 Table 4.8: Parameter values of study 4 44 4.4. Study 4 - Mixed Strong and Weak Instruments The following table reports some of the results for βˆ1 . βˆ1 (β1 = 2) n δW b˜ 100 0.01 0.1 100 0.05 0.05 100 0.05 0.1 100 0.1 0.01 1000 0.01 0.1 1000 0.05 0.05 1000 0.1 0.01 Only Strong Strong & Weak All Inst. Linear Inst. Linear Inst. (Z, W & Q) (Z) (Z & W ) Mean Mean Mean OLS Mean 1.9975 1.9997 1.9425 2.1016 (0.1460) (0.1449) (0.2062) (0.1019) 1.9986 2.0005 1.9699 2.1010 (0.1520) (0.1503) (0.2035) (0.1023) 2.0021 2.0039 1.9468 2.1029 (0.1456) (0.1446) (0.1855) (0.1009) 1.9983 2.0005 1.9945 2.0999 (0.1467) (0.1453) (0.1965) (0.1050) 1.9987 1.9990 1.9396 2.1015 (0.0449) (0.0449) (0.0613) (0.0340) 1.9962 1.9965 1.9664 2.0996 (0.0465) (0.0464) (0.0654) (0.0327) 2.0014 2.0017 1.9927 2.1008 (0.0446) (0.0446) (0.0625) (0.0319) Table 4.9: Results of βˆ1 for study 4. Standard errors are reported in the brackets. It is shown from the above table that the GMM estimators are basically unbiased and have relatively small standard deviation (compared to their means) when only the strong linear instrument is used. It also shows a general pattern that adding a weak linear instrument has virtually no effect on the bias and variation of the GMM estimator with the presence of a strong linear instrument. However, regardless of the sample size, when the weak linear instrument W is weakly correlated with X1 (e.g. δW = 0.05 or 0.01), adding a weak quadratic instrument Q increases the bias and standard deviation by roughly 2.5% and 30% respectively. The following graph visualizes the comparison. 45 4.4. Study 4 - Mixed Strong and Weak Instruments ~ ^ Density of β1 (n=1000 & δW=0.01 & b=0.1) No Weak Quadratic Inst. With Weak Quadratic Inst. 8 No Weak Quadratic Inst. With Weak Quadratic Inst. 4 Density 1.5 0 0.0 0.5 2 1.0 Density 6 2.0 2.5 ~ ^ Density of β1 (n=100 & δW=0.01 & b=0.1) 1.5 2.0 ^ β1 2.5 1.8 1.9 2.0 2.1 2.2 ^ β1 Figure 4.6: Comparisons of GMM estimator with and without a weak quadratic instrument The first conclusion of this study is: with the presence of a strong linear instrument, adding a weak linear instrument does not affect the GMM estimator’s performance but adding a non-linear weak instrument increases the bias and reduce accuracy. What’s interesting from table 4.9 is that when the added quadratic weak instrument is extremely weak (e.g. b˜ = 0.01), the variation of the GMM estimator increases but the bias is virtually unaffected. Another pattern from table 4.9 is that the standard deviations in all cases are small compared to those in table 4.7 where there are only two weak linear instruments. Hence, the second conclusion is the presence of a strong instrument reduces the variation significantly. The following table reports some of the results for βˆ2 . 46 4.4. Study 4 - Mixed Strong and Weak Instruments βˆ2 (β2 = 2) n δW b˜ 100 0.01 0.1 100 0.05 0.05 100 0.05 0.1 100 0.1 0.01 1000 0.01 0.1 1000 0.05 0.05 1000 0.1 0.01 Only Strong & Strong Weak All Inst. Linear Inst. Linear Inst. (Z, W & Q) (Z) (Z & W ) OLS Mean Mean Mean Mean 2.9984 2.9982 3.0037 2.9876 (0.1100) (0.1100) (0.1131) (0.1087) 3.0039 3.0037 3.0063 2.9950 (0.1068) (0.1067) (0.1084) (0.1053) 2.9972 2.9970 3.0030 2.9871 (0.1029) (0.1029) (0.1045) (0.1018) 3.0031 3.0028 3.0031 2.9927 (0.1085) (0.1084) (0.1107) (0.1074) 3.0007 3.0007 3.0067 2.9903 (0.0309) (0.0309) (0.0315) (0.0306) 3.0007 3.0006 3.0036 2.9903 (0.0313) (0.0313) (0.0317) (0.0311) 3.0009 3.0009 3.0018 2.9908 (0.0319) (0.0319) (0.0326) (0.0316) Table 4.10: Results of βˆ2 for study 4. Standard errors are reported in the brackets. Table 4.10 shows that the bias and variation of βˆ2 are very small in all cases. In contrast to the results in table 4.7, the bias of βˆ2 is slightly increased when there are only weak instruments. Therefore the last conclusion of this study is the GMM estimator of the exogenous regressor’s coefficient has little bias and small variation when there exists a strong instrument. 47 Chapter 5 Summary Since Hansen’s (1982) original paper on GMM estimator, it gained its popularity rapidly in recent years not only in econometrics but also other literatures (e.g. epidemiology). A vast volume of published papers in economics are related to the application of GMM estimator. Its popularity is partially attributed to the tempting theoretical asymptotic results and the ease of application in reality. Another reason lies in the fact that many economic models ultimately imply a set of moment conditions that can be used as the cornerstone of GMM estimation. However, the GMM estimator has its own weakness. A major concern is the so-called weak instrument which is just weakly correlated with the endogenous regressors. In addition, the validity of the moment conditions used in estimation is usually not testable from the observed data. Another concern bothers many practitioners is its performance when the sample size is small. This thesis reviews the theoretical development of the GMM estimator and conducts several simulation studies to examine its performance under different situations. Specifically, I derive the complete proofs of consistency and asymptotic normality for the GMM estimator under i.i.d data structure. I then review the application of GMM estimator in linear models. Specifically, I emphasize the motivations for model formulations in econometrics, based on which the suitability of GMM estimator in such framework is illustrated. On the other hand, I also explain the potential difficulties of applying GMM estimator through an example. The use of GMM estimator is, however, not restricted to linear models. In fact, it is also widely used in the discrete choice models (described in section 1.2.1), i.e. Steven Berry et al (1995) provides one application of GMM estimator in discrete choice models with endogenous regressor and aggregate level data. Paul Gustafson et al (2008) provides the application of instrumental variables in generalized linear models in the context of epidemiological studies. In fact, the GMM technique is heavily used in econometrics when endogeneity is a key feature of the model. In recognition of its potential weakness, the simulation studies are designed to investigate the performance of GMM estimator under different scenarios. The first two studies are related to a common fact that the observed data does not come from a single population. A general conclusion from these two studies is that under mixed populations the bias of GMM estimator is trivial but its efficiency is sensitive 48 Chapter 5. Summary to various aspects of the mixture. The last two studies involves the presence of weak instruments. A general conclusion is that weak instruments dramatically increase the bias and variation of the GMM estimator even though the sample size is moderately large. In summary, the GMM estimator has its own theoretical desired properties but caution is warned for its practical applications. 49 Bibliography Amemiya, T. (1985). Advanced Econometrics. Cambridge, MA: Harvard University Press. Angrist, J. D., and Krueger, A. B. (1991). Does Compulsory School Attendance Affect Schooling and Earnings. Quarterly Journal of Economics, 106: 979-1014. Berry, S., Levinsohn, J., and Pakes, A. (1995). Automobile Prices in Market Equilibrium. Econometrica, 63: 841-890. Bound, J., and David, A. J. (2000). Do Compulsory Schooling Attendance Laws Alone Explain the Association between Quarter of Birth and Eamings. Research in Labor Economics, 19: 83-108. Bound, J., Jaeger, D. A., and Baker, R. (1995). Problems With Instrumental Variables Estimation When the Correlation Between the Instruments and the Endogenous Explanatory Variables Is Weak. Journal of the American Statistical Association, 90: 443-450. Buse, A. (1992). The Bias of Instrumental Variable Estimators. Econometrica, 60: 173-180. 50 Bibliography Card, D. E. (1999). The Causal Effect of Education on Earnings. Handbook of Labor Economics, 3: 1801-1863. Donald, S. G., and Newey, W. K. (2001). Choosing the Number of Instruments. Econometrica, 69: 1161-1191. Hall, A. R., Rudebusch, G. D., and Wilcox, D. W. (1996). Judging Instrument Relevance in Instrumental Variables Estimation. International Economic Review, 37: 283-289. Hansen, L. P. (1982). Large Sample Properties of Generalized Method of Moments Estimators. Econometrica, 50: 1029-1054. Heckman, J. J. (2000). Causal Parameters and Policy Analysis in Economics: A Twentieth Century Retrospective. Quarterly Journal of Economics, 115: 45-97. Johnston, K. M., Gustafson, P., Levy, A. R., and Grootendorst, P. (2008). Use of instrumental variables in the analysis of generalized linear models in the presence of unmeasured confounding with applications to epidemiological research. Statistics in Medicine, 27:1539-1556. Stock, J. H., Wright, J. H., and Yogo, M. (2002). A Survey of Weak Instruments and Weak Identification in Generalized Method of Moments. Journal of Business & Economic Statistics, 20: 518-529. Staiger, D., and Stock, J. H. (1997). Instrumental Variables Regression With Weak Instruments. Econometrica, 65: 557-586. 51 Bibliography Stock, J. H., and Wright, J. H. (2000). GMM With Weak Identification. Econometrica, 68: 1055–1096. Wang, J., and Zivot, E. (1998). Inference on Structural Parameters in Instrumental Variables Regression With Weak Instruments. Econometrica, 66: 1389-1404. 52
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Generalized method of moments : theoretical, econometric...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Generalized method of moments : theoretical, econometric and simulation studies Liang, Yitian 2011
pdf
Page Metadata
Item Metadata
Title | Generalized method of moments : theoretical, econometric and simulation studies |
Creator |
Liang, Yitian |
Publisher | University of British Columbia |
Date Issued | 2011 |
Description | The GMM estimator is widely used in the econometrics literature. This thesis mainly focus on three aspects of the GMM technique. First, I derive the prooves to study the asymptotic properties of the GMM estimator under certain conditions. To my best knowledge, the original complete prooves proposed by Hansen (1982) is not easily available. In this thesis, I provide complete prooves of consistency and asymptotic normality of the GMM estimator under some stronger assumptions than those in Hansen (1982). Second, I illustrate the application of GMM estimator in linear models. Specifically, I emphasize the economic reasons underneath the linear statistical models where GMM estimator (also referred to the Instrumental Variable estimator) is widely used. Third, I perform several simulation studies to investigate the performance of GMM estimator under different situations. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2011-08-24 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0072092 |
URI | http://hdl.handle.net/2429/36866 |
Degree |
Master of Science - MSc |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2011-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2011_fall_liang_yitian.pdf [ 551.8kB ]
- Metadata
- JSON: 24-1.0072092.json
- JSON-LD: 24-1.0072092-ld.json
- RDF/XML (Pretty): 24-1.0072092-rdf.xml
- RDF/JSON: 24-1.0072092-rdf.json
- Turtle: 24-1.0072092-turtle.txt
- N-Triples: 24-1.0072092-rdf-ntriples.txt
- Original Record: 24-1.0072092-source.json
- Full Text
- 24-1.0072092-fulltext.txt
- Citation
- 24-1.0072092.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0072092/manifest