UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Three essays on applied econometrics Xu, Jinwen 2014

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2014_november_xu_jinwen.pdf [ 958.09kB ]
Metadata
JSON: 24-1.0167367.json
JSON-LD: 24-1.0167367-ld.json
RDF/XML (Pretty): 24-1.0167367-rdf.xml
RDF/JSON: 24-1.0167367-rdf.json
Turtle: 24-1.0167367-turtle.txt
N-Triples: 24-1.0167367-rdf-ntriples.txt
Original Record: 24-1.0167367-source.json
Full Text
24-1.0167367-fulltext.txt
Citation
24-1.0167367.ris

Full Text

Three Essays on Applied EconometricsbyJinwen XuB.A., Shanghai Jiao Tong University, 2005M.A., The University of British Columbia, 2006A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Economics)The University Of British Columbia(Vancouver)August 2014c© Jinwen Xu, 2014AbstractThis dissertation consists of three chapters. Chapter 1 investigates how returns toeducation are related to occupation choices. Specifically, I investigate the returnsto attending a two-year college and a four-year college and how these returns toeducation differ from a blue-collar occupation to a white-collar occupation. Toaddress the endogenous education and occupation choices, I use a finite mixturemodel. I show how the finite mixture model can be nonparametrically identified byusing test scores and variations in wages across occupations over time. Using datataken from the National Longitudinal Survey of Youth (NLSY) 1979, I estimatea parametrically specified model and find that returns to education are occupationspecific. Specifically, a two-year college attendance enhances blue-collar wagesby 24% and white-collar wages by 17% while a four-year college attendance in-creases blue-collar wages by 23% and white-collar wages by 30%. Chapter 2 andChapter 3 study how to perform econometric analysis with complex survey data,which is widely used in large scale surveys. Although it is attractive in terms ofsampling costs, it introduces complication in statistical analysis, when comparedwith the simple random sampling method. In Chapter 2, I study the properties ofM-estimators when they are used with complex survey data. To undo the over- andunder-representation effects of the complex survey design, it is typically necessaryto use the survey weight in M-estimation. I establish the consistency and asymp-totic normality of the weighted M-estimators. I also discuss how to estimate theasymptotic covariance matrix of the M-estimators. Further, I demonstrate seriousconsequences of ignoring the survey design in M-estimation and inference based onit. In Chapter 3, I consider specification testing with complex survey data. Specifi-cally, I modify the standard m-testing framework to propose a new method to testiiif a given model is correct for a subpopulation. The proposed test has advantagesover the standard m-testing, taking account of likely heterogeneity of subpopulationdistributions. All of the three chapters deal with heterogeneity of subpopulationdistributions, whether or not the subpopulation identity is known (Chapter 2 andChapter 3) and unknown (Chapter 1).iiiPrefaceChapter 2 and Chapter 3 are joint works with Professor Shinichi Sakata. As part ofthe training and under the guidance of Professor Sakata, I was involved in all stagesof the research and other aspects of the analysis.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x1 Returns to Education and Occupation Choices . . . . . . . . . . . . 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Empirical Specification . . . . . . . . . . . . . . . . . . . . . . . 91.3.1 A Model for Postsecondary Education, Occupation Choiceand Wages . . . . . . . . . . . . . . . . . . . . . . . . . 111.3.2 A Finite Mixture Model . . . . . . . . . . . . . . . . . . 161.4 Nonparametric Identification of The Finite Mixture Model . . . . 181.4.1 Nonparametric Identification of the Occupation Abilities . 181.4.2 Nonparametric Identification of the Education Psychic Costs 201.5 Empirical Results . . . . . . . . . . . . . . . . . . . . . . . . . . 221.5.1 Occupation-specific Returns to Education, Education andOccupation Choices . . . . . . . . . . . . . . . . . . . . 23v1.5.2 Conditional Independence of Wages and Test Scores . . . 311.5.3 The Occupation-Specific Returns to A Bachelor’s Degree . 331.5.4 The Expected Returns to Education . . . . . . . . . . . . 351.5.5 Test Scores and Returns to Education . . . . . . . . . . . 401.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422 M-Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442.2 M-Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502.4 Asymptotic Normality . . . . . . . . . . . . . . . . . . . . . . . 532.5 Estimation of the Asymptotic Covariance Matrix . . . . . . . . . 553 m-Testing of Stratum-Wise Model Specification in Complex SurveyData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583.2 Problem Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623.3 A Test with Estimated Nuisance Parameters . . . . . . . . . . . . 683.4 Tests without Estimation of Nuisance Parameters . . . . . . . . . 79Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93A.1 Summary Statistics, College Dropouts vs. College Graduates . . . 93A.2 Simplification of the Type-Specific Joint Distribution . . . . . . . 96A.3 Likelihood Contributions . . . . . . . . . . . . . . . . . . . . . . 97A.4 Assumptions and Proofs of Propositions . . . . . . . . . . . . . . 98A.4.1 Assumptions and proof of Proposition 1.1 . . . . . . . . . 98A.4.2 Assumptions and proof of Proposition 1.2 . . . . . . . . . 103A.4.3 Assumptions and proof of Proposition 1.3 . . . . . . . . . 106A.4.4 Assumptions and proof of Proposition 1.4 . . . . . . . . . 107A.5 EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108A.6 Choice of Initial Values . . . . . . . . . . . . . . . . . . . . . . . 110viB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111B.1 Proofs of the Results in Section 2.2 . . . . . . . . . . . . . . . . . 111B.2 Proofs of the Results in Section 2.3 . . . . . . . . . . . . . . . . . 112B.3 Proofs of the Results in Section 2.4 . . . . . . . . . . . . . . . . . 115B.4 Proofs of the Results in Section 2.5 . . . . . . . . . . . . . . . . . 119C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122C.1 Proof of the Results in Section 3.2 . . . . . . . . . . . . . . . . . 122C.2 Proof of the Results in Section 3.3 . . . . . . . . . . . . . . . . . 124viiList of TablesTable 1.1 Discriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . 7Table 1.2 Test Scores by Education and Initial Occupation . . . . . . . . 10Table 1.3 Estimated Test Scores Parameters (Equation (1.4)) . . . . . . . 24Table 1.4 Estimated Wage Parameters (Equation (1.1)) . . . . . . . . . . 26Table 1.5 Estimated Average Partial Effects, Occupation Choice . . . . . 27Table 1.6 Estimated Average Partial Effects, Educational Choice . . . . . 29Table 1.7 Occupation-Specific Returns . . . . . . . . . . . . . . . . . . 30Table 1.8 Testing Conditional Independence of Wages and Test Scores . . 32Table 1.9 Estimated Wage Parameters, with Three Test Scores in the WageEquation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34Table 1.10 Estimated Wage Parameters in the Wage Equation (6 types) . . 35Table 1.11 The Occupation-Specific Returns to A Bachelor’s Degree . . . 36Table 1.12 Expected Returns to Education . . . . . . . . . . . . . . . . . 37Table 1.13 Posterior Probabilities of Types Conditional on Test Scores . . 41Table 1.14 Expected Returns to Education, By Test Scores . . . . . . . . . 42Table A.1 Discriptive Statistics (More Education Categories) . . . . . . . 94viiiList of FiguresFigure 1.1 Sequential Education and Occupation Choices . . . . . . . . 13ixAcknowledgmentsI would like to express my sincere gratitude to all those who supported me duringthe PhD program at the University of British Columbia. Without their help, I wouldnot be able to complete this dissertation. First, I would like to thank my thesissupervisor Professor Hiroyuki Kasahara for his exceptional guidance throughoutthe research process. Without him, I would not be able to go through the mostdifficult days when my previous thesis supervisor left UBC and when I met greatfrustrations in my research. I also would like to thank my thesis committee membersProfessor W. Craig Riddell and Professor Thomas Lemieux. Professor W. CraigRiddell opened the door towards academic research for me. It was his course”Policy Evaluation and Research Design” that inspired my passion for economicsand motivated me to pursue the doctoral degree in economics. He provided notonly great help in research but also invaluable emotional support during my PhDtraining. Professor Thomas Lemieux was always there to answer my questions andconcerns. His constructive advice and feedback greatly improved the quality of mywork. I am very grateful for his precious support when I felt helpless and depressedunder the great pressure of job searching. I thank Professor Shinichi Sakata for hismeticulous supervision of my research when he was at UBC. I have learnt a lot aboutacademic writing from working together with him on the second and third chaptersof my dissertation. Another person I would like to extend my sincere appreciationto is Professor Lang Wu. Discussions with him were of great help deepening myunderstanding of sophisticated statistical models. His encouragement relieved myanxiety and helped to keep my chin up. I would also like to thank Professor HenrySiu for his kind help as the Graduate Director when I faced difficulties in bothresearch and personal life in my third year in the PhD program. Big thanks toxProfessor Kevin Song, Professor Vadim Marmer, Professor Paul Shrimpf, ProfessorDavid Green, Professor Kevin Milligan, Professor Nicole Fortin, Professor FlorianHoffmann and all the other attendants in the econometrics lunch seminars and theempirical lunch seminars for their helpful suggestions and feedback to my researchand presentations . I thank the Canadian Labour Market and Skills ResearcherNetwork (CLSRN) for the financial support. Thanks to Maureen Chin and KathyMeredith for their diligently taking care of all of the administrative aspects of thePhD and the job market. Special thanks to Nishant Chadha, Xiaodan Gao andHaimin Zhang for sharing laughters and tears together, and witnessing each otherbecoming a maturer and better self. Lastly, I would like to thank my parents fortheir unconditional love, for their endless support, for always making me feel I amhome on the phone calls even though we are whole Pacific Ocean away.xiChapter 1Returns to Education andOccupation Choices1.1 IntroductionThe association between education and earnings is perhaps the most well-documented and studied subject in social science. Much recent work by economistshas investigated the extent to which this correlation is causal in nature (see Card,2001, for a recent survey of this literature, also see Heckman et al. , 2006a). How-ever, how returns to education are related to the choice of occupations has receivedless attention.It is plausible that returns to education are occupation specific. Returns toeducation are found to be lower in secondary sector occupations (Blaug, 1985,Dickens and Lang, 1985) and in occupations which do not require the educationthat one obtains(Duncan and Homan, 1981,Sicherman, 1991). In this chapter, Iexamine how returns to attending a two-year college and a four-year college differfrom those of a blue-collar occupation to those of a white-collar occupation, usingthe National Longitudinal Survey of Youth (NLSY) 1979. Intuitively, the wagepremium for high school graduates attending a two-year college may be higher in ablue-collar occupation such as that of a machinist than it may be in a white-collaroccupation such as that of a manager while the wage premium for high schoolgraduates attending a four-year college may be higher in a white-collar occupation1than it may be in a blue-collar occupation.The main complication of estimating the occupation-specific returns to educationcomes from the endogenous education and occupation choice. As in Roys model(Roy, 1951), individuals are endowed with different abilities to work in a blue-collaroccupation or a white-collar occupation. They tend to work in the occupation inwhich they have a comparative advantage. Moreover, occupation abilities can alsoinfluence the education choice. For example, individuals who know that they aremore likely to work in a white-collar occupation are more likely to attend a four-year college, which would increase the white-collar wages they would earn moreso than attending a two-year college would. In addition, individuals vary in theireducation psychic costs. Those with lower education psychic costs may obtain moreeducation than may those with higher education psychic costs (Willis and Rosen,1979; Willis, 1986; Carneiro et al. , 2003). While the occupation abilities and theeducation psychic costs are known to individuals making education and occupationdecisions, these abilities and costs are unobserved by the econometrician. In thepresence of self-selection in both education and occupation, the Ordinary LeastSquares (OLS) estimates of occupation-specific returns to education are biased. Onetraditional way of dealing with the endogeneity issue in the returns to educationliterature is to use compelling instruments for education such as institutional rulesor natural experiments (see Card, 2001, for a survey of papers using IV approachin this literature). However, the standard IV approach is hard to implement herebecause it is difficult to find good instruments for both education and occupationchoices.I address the issue of endogeneity in education and occupation by explicitlymodelling the sequential education and occupation choices. The unobserved occu-pation abilities and education psychic costs are specified with a flexible multinomialdistribution in a finite mixture model. Departing from previous papers that use afinite mixture model to tackle the endogeneity issue in the education literature, Iachieve nonparametric identification of the finite mixture model without imposingparametric assumptions on the joint distribution of wages, education, and occupa-tion choices. Based on Kasahara and Shimotsu (2009) and Kasahara and Shimotsu(2012), I rigorously show how to nonparametrically identify the occupation abilitiesusing the variations in wages across occupations over time. Since the information2from the panel data alone is not enough to identify the unobserved education psychiccosts, I bring in additional data. Specifically, I use scores from four tests (mathskills, verbal skills, coding speed, and mechanic comprehension) conducted by theArmed Force Vocational Aptitude Battery(ASVAB), together with the Rotter Locusof Control test score and the Rosenberg Self-Esteem Scale. I show that conditionalon occupation abilities and education psychic costs the education psychic costs canbe nonparametrically identified under the assumption that the test scores do notdirectly affect wages, education, or occupation choices. My identification strategyallows the unobserved occupation abilities and education psychic costs to be freelycorrelated. Carneiro et al. (2003), Hansen et al. (2004), Heckman et al. (2006b),and Cunha and Heckman (2008) also use test scores to identify their mixture model.However, they assume the unobserved variables to be mutually independent. Cunhaet al. (2010) relax the strong independence assumption, but their identificationreplies on the assumption that distributions are bounded complete. My identificationstrategy does not require this strong rank condition.While I show that the finite mixture model can be nonparametrically identified,estimating the high-dimensional model nonparametrically is nearly impossiblegiven the relatively small sample size of NLSY 1979. Therefore, I impose someparametric forms in wage, education, occupation, and test scores to facilitate theestimation. I find that attendance of a two-year college enhances blue-collar wagesby 24% and white-collar wages by 17%. Therefore, attendance of a two-year collegehelps accumulate more blue-collar skills than it does white-collar skills. The reverseholds true for attendance of a four-year college, which increases blue-collar wagesby 23% and white-collar wages by 30%.This chapter is the first to quantify the occupation-specific returns to attending atwo-year and a four-year college. Although many papers have estimated returns toa two-year college and a four-year college (Kane and Rouse, 1995; Grubb, 1997;Light and Strayer, 2004; Marcotte et al. , 2005), these papers assumed that returnsto education are homogeneous across occupations. The occupation-specific returnsto education suggest that analyzing the potential impact of an education policy, suchas tuition subsidy, requires consideration of individuals possible occupation choiceswhen these individuals have finished school because returns to education depend ontheir subsequent occupation choices.3Moreover, this paper helps us understand the choice made between attendanceof a two-year and that of a four-year college. I find that individuals make theirpost-secondary education choices based on both occupation abilities and educationpsychic costs. The idea that individuals invest in education based on their occupationabilities was first raised in a seminal paper by Willis and Rosen (1979). Willis andRosen studied the choice made by high school graduates between entering the labourmarket or attending a college; they suggest that individuals who are more suitableto the college labour market are more likely to attend a college. Keane and Wolpin(1997) extend Willis and Rosen (1979) by taking into account the sequential choicesof education and occupation. They studied how individuals with different occupationabilities make year-by-year decisions as to whether to further their education. Mypaper departs from that of Keane and Wolpin by bringing additional data, the testscores, to achieve nonparametric identification of the education psychic costs. I findthat the education psychic costs play an important role in post-secondary decisions.This is consistent with the findings in Carneiro et al. (2003). Carneiro et al. (2003)extend the model of Willis and Rosen (1979) to account for the education psychiccosts and use the ASVAB scores to identify the education psychic costs. Theyfind that individuals decide whether to attend college or not taking into accounteducation psychic costs. In addition, I show that without considering the selectionbased on the unobserved education psychic costs, the returns to attending a two-yearcollege are biased upward.The rest of the paper proceeds as follows. Section 2 describes the data. Section3 discusses the empirical specifications. Section 4 shows the nonparametric iden-tification of the finite mixture model. Section 5 reports the empirical results, andSection 6 concludes.1.2 DataThis paper uses data taken from the NLSY79. The NLSY79 is a U.S. national surveyof 12686 young men and women who were 14-22 years old in 1979. It consists ofa core random sample of civilian youths, a supplemental sample of minority andeconomically disadvantaged youths and a sample of youths in the military. Theanalysis is based on the 2439 male respondents in the core random sample. The4individuals were interviewed annually through 1994 and are currently interviewedon a biennial basis. I use the observations from 1979 to 1994.The NLSY79 collects information on individuals’ education attainment andthe type of post-secondary education individuals in which were enrolled. I assignindividuals to three educational categories: high school graduates, two-year collegeattendants and four-year college attendants.1 High school graduates are those whoare reported to have completed at least 12 years of education and have never attendedeither a two-year college or a four-year college. Two-year college attendants arethose who are reported to have enrolled in a two-year college and have never attendeda four-year college. Four-year college attendants are the ones who are reported tohave enrolled in a four-year college. I distinguish two-year college and four-yearcollege education because a two-year college provides more technical and vocationalprograms while a four-year college offers more academic and professional programs.The NLSY79 asks individuals about their occupations and the associated hourlypayment in each survey year. I assign individuals to a blue-collar occupationand a white-collar occupation2 according to the occupation they work the mostduring the survey year based on one-digit census codes. Blue-collar occupationsare (1) craftsmen, foremen, and kindred; (2)operatives and kindred; (3) laborers,except farm; (4) farm laborers and foremen; and (5) service workers. White-collaroccupations are (1) professional, technical, and kindred; (2) managers, officials, andproprietors; (3) sales workers; (4) formers and farm managers; and (5) clerical andkindred.One advantage of the NLSY79 is that many of the respondents were in schoolwhen they were first interviewed. Therefore, information about their first jobs areavailable. Such information about initial conditions is especially useful becauseit is important to take into account the persistent shocks in wages and occupationchoices as pointed out by Hoffmann (2011).1Although the main analysis of this paper is based on these three education groups, I examinethe occupation-specific returns to a bachelor’s degree because usually college graduates earn morethan college dropouts (Jaeger and Page, 1996). However, I do not investigate the occupation-specificreturns to an associate degree because the sample size of associate degree earners are too small to giveany reasonable estimates.2Although a finer aggregation is possible, I focus on two occupation categories to emphasize theimportance of the role of occupational choices in returns to education.5To identify the individual unobserved occupation abilities and education psychiccosts, I use ASVAB, which was administrated in 1979, to construct four test scores:math skill, verbal skill, coding speed, and mechanic comprehension. The higherscores indicates higher skills. In addition, I use the Rotter Locus of Control Scale,which was administered in 1979, and the Rosenberg Self-Esteem Scale ,whichwas administered in 1980. The Rotter Locus of Control Scale measures whetherindividuals believe that events in their life derive primarily from their own actions.It is normalized to the case that a higher score indicates higher degree of controlindividuals feel they possess over their life. The Rosenberg Self-Esteem Scalemeasures perceptions of self worth. A higher score indicates higher self-esteem.33All measures are standardize to mean zero and variance one.6Table 1.1: Discriptive StatisticsOverall High School 2-yr College 4-yr CollegeVariables Obs Mean S.D Obs Mean S.D Obs Mean S.D Obs Mean S.D2-yr college attendant 934 0.175 0.380 318 0.000 0.000 163 1.000 0.000 453 0.000 0.0004-yr college attendant 934 0.485 0.500 318 0.000 0.000 163 0.000 0.000 453 1.000 0.000Highest grade completed 934 13.943 2.249 318 11.880 0.567 163 13.074 1.034 453 15.700 1.864Age in 1979 934 17.217 2.082 318 16.447 1.628 163 17.362 2.033 453 17.706 2.223Initial job(white collar) 934 0.394 0.489 318 0.110 0.313 163 0.264 0.442 453 0.640 0.480Initial wage 934 11.603 15.738 318 9.879 22.438 163 10.795 5.004 453 13.103 12.024Initial wage(blue collar) 934 10.249 17.129 318 10.060 23.754 163 10.910 5.234 453 10.089 4.535Initial wage(white collar) 934 13.684 13.068 318 8.410 3.431 163 10.472 4.340 453 14.797 14.374Mother education 934 12.269 2.086 318 11.355 1.835 163 12.202 1.919 453 12.934 2.066Father education 934 12.726 3.038 318 11.226 2.469 163 12.748 2.604 453 13.770 3.111Number of siblings 934 2.767 1.748 318 3.110 1.850 163 2.755 1.757 453 2.530 1.631Broken family at age 14 934 0.127 0.334 318 0.151 0.359 163 0.147 0.355 453 0.104 0.305South at age 14 934 0.239 0.427 318 0.239 0.427 163 0.221 0.416 453 0.245 0.431Urban at age 14 934 0.730 0.444 318 0.619 0.486 163 0.791 0.408 453 0.786 0.4117Table 1.1 presents the sample summary statistics by the three education groups:high school graduates, two-year college attendants, and four-year college atten-dants.4 The sample consists of 934 individuals, of which 34% are high schoolgraduates, 17.5% are two-year college attendants, and 48.5% are 4-year collegeattendants. On average, the high school graduates complete 11.9 years of schooling.The two-year college attendants finish 13.1 years of school. The complete years ofschooling of the two-year attendants suggests that a large fraction of the two-yearattendants do not graduate5. The four-year college attendants complete 15.7 years ofschooling, which suggests that a large proportion of the four-year college attendantsobtain a bachelor’s degree6. The comparison of the fraction of individuals workingin a white-collar occupation as their first jobs7 across the three education groupssuggests that the probability of the initial job in a white-collar occupation increaseswith education: around 11% of the high school graduates, 26.4% of the two-yearcollege attendants, and 64% of the four-year college attendants work initially ina white-collar occupation. The average wages associated with the first jobs aspresented in table 1.1 suggest that the higher the education, the higher the wages:on average, the high school graduates earn $9.88, the two-year college attendantsearn $10.80, and the four-year college attendants earn $13.10. Further, I look at theblue-collar wages and white-collar wages associated with individuals’ first jobs. Forthose who initially work in a white-collar occupation, I find that higher educationare associated with higher wages: the high school graduates earn around $8.41, thetwo-year attendants earn around $10.47, and the four-year attendants earn around$14.80. However, the relationship between wages and education is different forthose who initially work in a blue-collar occupation: the two-year college attendantsearn the most among the three education groups, and the high school graduatesand the four-year college attendants earn almost the same. The average blue-collarwages of the high school graduates are $10.06, those of the two-year college atten-dants are $10.91, and those of the four-year college attendants are $10.09. Table 1.14Table A.1 gives the summary statistics for two-year college dropouts, those with an associatedegree, four-year college dropouts, and those with a bachelor’s degree.5Around 75% of the two-year attendants do not have an associated degree.6Around 70% of the four-year college attendants obtain a bachelor’s degree.7I look at individuals’ first jobs to get rid of the impact of work experience on the probability ofworking in a white-collar occupation.8also shows that the three education groups have quite different family background.Individuals whose parents have more education, who have fewer siblings, grewup in a two-parent family, and live in an urban area at age 14 tend to obtain moreeducation.Table 1.2 presents the average test scores of the six tests across education groupsand occupation groups. Table 1.2a shows that individuals who initially work ina white-collar occupation perform better than those who initially work in a blue-collar occupation in all the six test scores, and therefore, the six test scores may beinformative about individuals’ occupation abilities. Table 1.2b shows the six testscores increase with education. Further, table 1.2c and table 1.2d show that the sixtest scores increase with education when conditional on initial occupations. Thepositive correlation between education attainment and the test scores suggest thatthe six test scores may be informative about individuals’ education psychic costs.1.3 Empirical SpecificationIn this section I specify the wage regression in which returns to education areoccupation-specific, explicitly model how individuals make their subsequent post-secondary education choices and occupation choices based on their unobservedoccupation abilities and education psychic costs, and present the test scores re-gression specification, which is essential for the identification of the finite mixturemodel.To control for the selection in education and occupation, I specify the jointdistribution of occupation abilities and education psychic costs by a multinomialdistribution in a finite mixture model. A finite mixture model assumes that the overallpopulation consists of M types of people. Each type shares the same occupationabilities and education psychic costs, and different types are different in occupationabilities and/or education psychic costs. I assume that the unobserved types affectthe intercepts of the wage regression, the test scores regression, and the expectationson utility in the choice of postsecondary education and occupations. The superscriptm in the following equations represents the mth type-specific parameters. The finitemixture model is discussed in more detail in Section 1.3.2.9Table 1.2: Test Scores by Education and Initial Occupation(a) By Intial OccupationBlue Collar White CollarVariable Obs Mean S.D. Obs Mean S.D.Math skill 566 -0.323 0.961 368 0.497 0.844Verbal skill 566 -0.279 1.066 368 0.428 0.700Coding speed 566 -0.231 0.966 368 0.356 0.947Mechanical 566 -0.065 1.044 368 0.101 0.920Locus of control 566 -0.099 0.982 368 0.153 1.009Self-esteem 566 -0.110 1.000 368 0.169 0.978(b) By EducationHigh School 2-yr College 4-yr CollegeVariable Obs Mean S.D. Obs Mean S.D. Obs Mean S.D.Math skill 318 -0.684 0.814 163 -0.236 0.870 453 0.565 0.812Verbal skill 318 -0.672 1.107 163 -0.068 0.867 453 0.497 0.607Coding speed 318 -0.430 0.911 163 -0.099 0.955 453 0.338 0.953Mechanical 318 -0.248 1.096 163 0.049 0.969 453 0.156 0.903Locus of control 318 -0.256 0.956 163 -0.017 1.026 453 0.186 0.982Self-esteem 318 -0.325 0.947 163 0.058 0.943 453 0.207 0.999(c) By Education, Blue-CollarHigh School 2-yr College 4-yr CollegeVariable Obs Mean S.D. Obs Mean S.D. Obs Mean S.D.Math skill 283 -0.708 0.801 120 -0.255 0.925 163 0.294 0.910Verbal skill 283 -0.734 1.105 120 -0.071 0.855 163 0.359 0.705Coding speed 283 -0.476 0.916 120 -0.126 0.965 163 0.116 0.934Mechanical 283 -0.257 1.097 120 0.063 0.959 163 0.294 0.910Locus of control 283 -0.263 0.971 120 0.006 1.001 163 0.108 0.943Self-esteem 283 -0.348 0.943 120 0.071 0.969 163 0.171 1.024(d) By Education, White-CollarHigh School 2-yr College 4-yr CollegeVariable Obs Mean S.D. Obs Mean S.D. Obs Mean S.D.Math skill 35 -0.495 0.899 43 -0.181 0.698 290 0.718 0.708Verbal skill 35 -0.178 1.004 43 -0.061 0.908 290 0.574 0.531Coding speed 35 -0.061 0.783 43 -0.022 0.933 290 0.462 0.942Mechanical 35 -0.178 1.108 43 0.011 1.009 290 0.718 0.708Locus of control 35 -0.204 0.834 43 -0.079 1.101 290 0.23 1.003Self-esteem 35 -0.139 0.975 43 0.021 0.877 290 0.228 0.986101.3.1 A Model for Postsecondary Education, Occupation Choice andWagesThe Model for WagesDifferent from the conventional Mincer-type wage specification, I allow returns toattending a two-year and a four-year college to depend on the occupation choice.The log wage, Wit , for individual i at time t is as follows,Wit = αmW,1 +αmW,2Oit +β12YRi +β24YRi +β32YRiOit +β44YRiOit +X ′itβ5 +εW,it ,(1.1)where 2YRi is a dummy variable which equals 1 if individual i is a two-yearcollege attendant, 4YRi is a dummy variable which equals 1 if individual i is a four-year college attendant, and Oit is a dummy variable which equals 1 if individuali works in a white-collar occupation at time t. The occupation-specific workexperience and its squared terms are collected into Xit . Since different occupationsreward the occupation-specific work experience differently, Xit also includes theinteraction terms of the occupation-specific work experience and the occupationdummy variable Oit , and the interaction terms of the occupation-specific workexperience squared and Oit .The returns to attending a two-year college and a four-year college in a blue-collar occupation are represented by β1 and β2 respectively, and the returns toattending a two-year college and a four-year college in a white-collar occupation aredenoted by β1 +β3 and β2 +β4 respectively. Since a two-year college focuses ontechnical and vocational programs while a four-year college provides academic andvocational programs, we would expect the returns to attending a two-year college tobe higher in a blue-collar occupation than a white-collar occupation, i.e. β3 < 0, andthe returns to attending a four-year college to be higher in a white-collar occupationthan a blue-collar occupation, i.e. β4 > 0.The relationship between wages and innate occupation abilities are captured byαmW,1 and αmW,2, which are specific to type m. A large value of αmW,1 means type mhas a high blue-collar ability, and a large value of αmW,1 +αmW,2 implies type m has ahigh white-collar ability. In other words, if type m has a comparative advantage in awhite-collar occupation than a blue-collar occupation, we would expect αmW,2 > 0.11I assume that productivity shocks εW,it follow a first-order Markov process8:εW,it = ρεW,it−1 +ζit ,where εW,i1iid∼ N(0,σW,1) and ζit |εW,it−1iid∼ N(0,σW,2).The Model for Occupation ChoicesIn each period, individuals choose to work in either in a blue-collar or a white-collar occupation to maximize life-time income. Let IO,it denote the latent utilityassociated with a white-collar occupation relative to a blue-collar occupation at timet:IO,it = αmO +λ12YRi +λ24YRi +λ3Oit−1 +X ′itλ4 + εO,it , (1.2)where εO,itiid∼N(0,1). Since the latent utility, Iit , depends on wages, all the regressorsin Equation (1.1) are included. In addition, the occupation choice at time t−1 mayaffect the occupation choice at time t because job switching costs may preventindividuals from moving from one occupation to another. Such a relationshipbetween the occupation choices at time t−1 and time t are captured by the dummyvariable, Oit−1, which equals to 1 if the job at time t−1 is a white-collar occupation.The type-specific intercept, αmO , reflects that the latent utility, IO,it , depends onoccupation abilities. In other words, holding everything else the same, an individualwith a comparative advantage in a white-collar occupation is more likely to workin a white-collar occupation than an individual with a comparative advantage in ablue-collar occupation.8I assume the same productivity shock for a blue-collar and a white-collar occupation. It isbecause occupation choice in current period is influenced by current wages in blue- and white-collaroccupations. The current occupation-specific wages depend on the blue- and white-collar productivityshocks in the last period. However, we only observe the wage associated with the last period occupationan individual worked in. The wage associated with the other occupation is unobserved. So if weconsider occupation-specific productivity shocks, we have to integrate out the unobserved productivityshock associated the other occupation. This significantly increase the computation burden.12Figure 1.1: Sequential Education and Occupation Choices13As illustrated in figure 1.1, occupation abilities drive both wages and occupationchoices. Therefore, the occupation choice in Equation (1.1) is endogenous and OLSestimates of occupation-specific returns to education are biased in general.The Model for The Education ChoiceA high school graduate faces three options: attending a two-year college, attending afour-year college, and entering the labour market without pursuing more education.She makes the postsecondary education decision to maximize the life-time utility.Let IS,i j 9represent the net benefit associated with education level j ( j ∈ {1,2,3})relative to the benefit associated with education level 1:IS,i j ={εS,i j if j = 1αmS, j +Z′S,iδ j + εS,i j if j = 2,3(1.3)where {εS,i j}3j=1 are mutually independent and follows type I extreme value dis-tribution while ZS,i includes family background variables. The intercept, αmS =(αS,2,αS,3)′, is different across types. The reasons are twofold. First, future oc-cupations and wages depend on occupation abilities. For instance, an individualwith a comparative advantage in a white-collar occupation would expect herself tobe more likely to work in a white-collar occupation and tend to attend a four-yearcollege because a four-year college helps accumulating more white-collar skillsthan blue-collar skills. Second, education psychic costs also play an importantrole. For example, an individual with a comparative advantage in a white-collaroccupation may choose to attend a two-year college rather than a four-year collegeif her psychic costs to attend a four-year college are high, although a four-yearcollege enhances white-collar skills more than a two-year college.As shown in figure 1.1, occupation abilities are related to both wages andthe postsecondary education choice. Education psychic costs, which affect theeducation choice, may be correlated with occupation abilities and cause the cor-relation between wages and the education choices as well. Hence, the educationdummy variables are endogenous in Equation (1.1) and the OLS estimates of theoccupation-specific returns to education are biased in general.9IS,i1 is normalized to 0.14To sum up, the education dummy variables and the occupation choice in Equa-tion (1.1) are endogenous because the unobserved types connects wages, occupationchoices, and the postsecondary education choice. I address the endogeneity issueusing a finite mixture model in which the distribution of types are specified by aflexible multinomial distribution. Since same types of individuals have the sameoccupation abilities and education psychic costs, the variations in education andoccupation choices within type, holding the observables constant, are purely random.Once the finite mixture model is nonparametrically identified, we can get unbiasedand consistent estimates of the occupation-specific returns to attending a two-yearand a four-year college.The Model for The Six Test ScoresAs I will discuss in details in Section 1.4, To achieve nonparametric identificationof the finite mixture model, I bring in additional information. Specifically, I usefour test scores conducted from ASVAB. They are tests for math skills, verbal skills,coding speed, and mechanic comprehension. I also use the Rotter Locus of Control,and the Rosenberg Self-Esteem Scale.In the following specification for test scores, I take into account the possibilitythat the test scores are influenced by the education level at the date of the tests.Since the tests were administered to all respondents in the sample in year 1979 and1980, when they were between 14 and 22 years of age and many had finished theirschooling, the tests may not be fully informative about the occupation abilities andeducation psychic costs (Hansen et al. , 2004;Heckman et al. , 2006b). Let Qi,rdenote the test score in test r:Qir = αmQ,r +θr,12YRir +θr,24YRir +Z′i,rθr,3 + εQ,ir, f or r = 1, . . . ,6, (1.4)where 2YRir is a dummy variable, which equals 1 if individual i was a two-yearcollege attendant at the time test r was administrated, and 4YRir is a dummy variable,which equals 1 if individual i was a four-year college attendant at the time test r wasadministrated. Other observables, which influence the test score r, such as familybackground variables and the age when test r was administrated, are collected inZi,r.15The intercept αmQ,r is subpopulation-specific, because the test scores reflect theoccupation abilities and education psychic costs. For example, mechanic com-prehension is important to a blue-collar occupation. The Rotter Locus of Controlwhich measures people’s belief in their ability to control life may be important to amanagement job. Math and verbal skills can reflect education psychic costs.I assume that the test scores are mutually independent conditional on occupationabilities, education psychic costs, and the observables, i.e. εQ,ir ⊥ εQ,ir′ for r 6= r′and εQ,ir ∼ N(0,σQ,r) for r ∈ {1, . . . ,6}. Further, I assume that the test scoresdo not directly affect wages, occupation and education choices once conditionalon occupation abilities, education psychic costs, and the observables, i.e. εQ,ir ⊥εW,it , εQ,ir ⊥ εO,it , and εQ,ir ⊥ εS,i j. These two assumptions are the key for thenonparametric identification of the finite mixture model, which will be discussed inSection 1.4.1.3.2 A Finite Mixture ModelIn the finite mixture model, the conditional joint distribution of wages {Wit}Tt=1,occupations {Oit}Tt=1, education Si, and tests {Qir}6r=1 in the overall population is aweighted average of type-specific conditional joint distribution. The weight pim isthe proportion of type m. Formally,f ({Wit ,Oit}Tt=1,Si,{Qir}6r=1|{Xit}Tt=1,ZS,i,{Zir}6r=1) (1.5)=M∑m=1pim f m({Wit ,Oit}Tt=1,Si,{Qir}6r=1|{Xit}Tt=1,ZS,i,{Zir}6r=1),where {Xit}Tt=1, ZS,i, and {Zir}6r=1 are observables in Equation (1.1), Equation (1.2),Equation (1.3), and Equation (1.4). With the assumptions (i) test scores do notdirectly affect wages, occupations, and education conditional on type, (ii) the errorterms in test scores are mutually independent, (iii) the error terms in wage followsa first order Markov process, (iv) the occupation choice is only affected by theprevious occupation, not the whole occupation history, and (v) the regressors andthe error terms in Equation (1.1), Equation (1.2), Equation (1.3), and Equation (1.4)are independent, I simplify the type-specific conditional joint distribution of wages,occupations, education, and test scores, and express the population conditional joint16distribution as follows10:f ({Wit ,Oit}Tt=1,Si,{Qir}6r=1|{Xit}Tt=1,ZS,i,{Zir}6r=1) (1.6)=M∑m=1pim f m(Wi1|Oi1,Si)T∏t=2f m(Wit |Oit ,Si,Xit ,Wit−1,Oit−1,Xit−1)× f m(Oi1|Si)T∏t=2f m(Oit |Oit−1,Si,Xit) fm(Si|ZS,i)6∏r=1f m(Qir|Zir).In Section 1.4, I rigorously show how this finite mixture model is nonparamet-rically identified, i.e. how to recover the unknowns, which are on the right handside of Equation (1.6), from the observed restriction, which is on the left hand sideof Equation (1.6). Many papers that use a finite mixture model in the returns toeducation literature do not show the nonparametric identification. In other words,their finite mixture models may rely on restrictive parametric assumptions, whichcan lead to biased estimates of occupation-specific returns.Once the nonparametric identification of the finite mixture model is established,I use the Maximum Likelihood Estimator (MLE) to estimate the occupation-specificreturns to attending a two-year and a four-year college. Although the finite mix-ture model can be nonparametrically identified, estimating a high dimensionalnonparametric statistical model requires very heavy computation and is nearlyimpossible given the relatively small sample size of NLSY 1979. Therefore, Iestimate a statistical model with parametric assumptions in Section 1.3.1. LetYi = ({Wit ,Oit ,Xit}Tt=1,Si,Zi,{Qir,Zir}6r=1). The log-likelihood contribution for aparticular individual is as follow:L(Yi;αW ,αO,αS,αR,β ,λ ,δ ,θ ,σW ,σQ,ρ) (1.7)=log(M∑m=1pimL mW (Yi;αW ,β ,σW ,ρ)L mO (Yi;αO,λ )L mS (Yi;αS,δ )L mQ (Yi;αQ,θ ,σQ)).where αW = {αmW,1,αmW,2}Mm=1, αO = {αmO}Mm=1, αS = {{αS, j}3j=2}Mm=1, αR ={{αQ,r}6r=1}Mm=1, β = {β1,β2,β3,β4,β5}, λ = {λ1,λ2,λ3,λ4}, σW = {σW,1,σW,2},and σQ = {σQ,r}6r=1.10Please refer to Appendix A.2 for more details about how these assumptions simplify the type-specific conditional joint distribution of wages, occupations, education, and test scores.17Note the likelihood contribution of a particular individual who belongs tosubpopulation m consists of four pieces:L mW (Yi;αW ,β ,σW ,ρ)–the likelihood contribution of wages;L mO (Yi;αO,λ )–the likelihood contribution of occupation;L mS (Yi;αS,δ )–the likelihood contribution of education;L mQ (Yi;αQ,θ ,σQ)–the likelihood contribution of test scores.The detailed expressions for each of the four likelihood contributions are col-lected in Appendix A.3.1.4 Nonparametric Identification of The Finite MixtureModelIn this section, I discuss the nonparametric identification of the finite mixture modelusing the results in Kasahara and Shimotsu (2009) and Kasahara and Shimotsu(2012). Nonparametric identification means that the proportion of types, and type-specific joint distributions of wages, occupations, education, and test scores, which isunknown, can be recovered from the observed empirical population joint distributionof wages, occupations, education, and test scores. I use two sources of informationto achieve the nonparametric identification: variations of wages across occupationsover time and test scores. I show that the wage history is helpful to identify theoccupation abilities. Yet, test scores are essential to identify the education psychiccosts.1.4.1 Nonparametric Identification of the Occupation AbilitiesI use variations in wages across occupations over time to identify the occupationabilities. Intuitively, individuals with a comparative advantage in a white-collaroccupation may have high white-collar wages, and hence, the fraction of individualswith high white-collar wages can be informative about the fraction of individualswith a comparative advantage in a white-collar occupation. In other words, thefractions of individuals with high white-collar wages over time impose restrictionson the unknowns type probabilities and type-specific distributions.18Three elements are the important determinants of identification: (1) the time-dimension of panel data, (2) the variation in the occupation-specific work experience,and (3) the heterogeneity in wages and occupational choices of individuals withdifferent occupation abilities conditional on the occupation-specific work experience.The number of observed restrictions depend on the first two elements. The thirdelement says that variations in wages are informative about the occupation abilities.Let’s start with a simple case in which wage and occupation distribution func-tions are stationary and there is no serial correlation.Proposition 1.1. Suppose Assumption A.1 and Assumption A.2 hold. With T ≥3, pim, f m(Si|Zi), f m(Wi1|Oi1,Si), f m(Oi1|Si), f m(Wit |Oit ,Si,Xit ,Wit−1,Oit−1,Xit−1),and f m(Oit |Si,Xit ,Oit−1) for t ≥ 2 can be identified up to M types.The assumptions and the proof of Proposition 1.1 are collected in AppendixA.4. The number of types M that can be identified depends on the number ofvalues {Xit}Tt=1 can take and its changes over time. The key insight is that eachdifferent value of {Xit}Tt=1 imposes different restrictions on the type probabilitiesand type-specific distributions.The assumption that current wage and occupational choice are not influencedby the lagged values is restrictive. The productivity shocks in the wage equationcan be serially correlated and occupation in the last period can affect the occupationsearching cost in the current period. The next proposition relaxes this strongassumption by allowing current wage and occupation depend on those in the lastperiod.Proposition 1.2. Suppose Assumption A.3 and Assumption A.4 hold,and assume T ≥ 6. Then pim, f m(Si|Zi), f m(Wi1|Oi1,Si), f m(Oi1|Si),f m(Wit |Oit ,Si,Xit ,Wit−1,Oit−1,Xit−1), and f m(Oit |Si,Xit ,Oit−1) for t ≥ 2 canbe nonparametrically identified up to M types.The assumptions and the proof of Proposition 1.2 are in Appendix A.4. Ifthere is longer dependence in either wage or occupational choice, a longer panelis required. For example, suppose current wage is affected by wage two periodsbefore, then at least 9-period observations are needed for identification.The education psychic costs cannot be nonparametrically identified with paneldata. The reason is that postsecondary education is one-period choice in my model,19so there is no information over time that can distinguish the unobserved noises andunobserved education psychic costs in the education equations. Although Keane andWolpin (1997) consider year-by-year schooling decisions, education is not an optionfor each time period due to the fact that the probability of going back to school afterworking is very low. Therefore, education psychic costs are not nonparametricallyidentified in Keane and Wolpin (1997).1.4.2 Nonparametric Identification of the Education Psychic CostsIn order to nonparametrically identify the education psychic costs, I use six testscores. They are math, verbal, coding, mechanical tests in ASVAB, the Rotter Locusof Control, and the Rosenberg Self-Esteem Scale. The nonparametric identificationusing test scores are intuitive. For example, individuals with low education psychiccosts may have good math and verbal test scores. So the fraction of individuals withgood test scores in math and verbal is informative about the fraction of individualswith low education psychic costs.Assume that the test scores do not directly affect postsecondary choices con-ditional on type and some observables. In other words, test scores do not affectpostsecondary education application and admission once type and other observablesare known. In addition, assume that there are three test scores which are independentfrom each other conditional on type and some observables. These two assumptionslead to the nonparametric identification of education psychic costs.Proposition 1.3. With access to three test scores (Q1,Q2,Q3), Suppose AssumptionA.5 holds. Then pim, f m(Si|Zi), and { f m(Qir|Sir,Zir)}3r=1 can be nonparametricallyidentified up to M types.Assumption A.5, and the proof of Proposition 1.3 are collected in AppendixA.4.The nonparametric identification of the education psychic costs using test scoresdoes not require the six test scores to be perfect proxies. In other words, thenonparametric identification does not need the assumption that education does notaffect test scores as it does when test scores are used as proxies. Such assumption isrestrictive because some respondents already finished schooling when the test wereadministrated. The finding that education does influence the ASVAB scores, Rotter20Locus of Control, and Rosenberg Self-Esteem Scale in Heckman et al. (2006b)further show the importance to relax such assumption.Not only can test scores identify the education psychic costs, they can identifythe occupation abilities as well. For instance, the Rotter Locus of Control whichmeasures people’s belief in their ability to control life may be important to amanagement job. So the fraction of individuals who perform well in this test isinformative about the fraction of individuals with a comparative advantage in awhite-collar occupation. It implies that the nonparametric identification of thefinite mixture model can be achieve without the panel data, although the additionalinformation from the variations in wages across occupations over time are helpfulto increase the efficiency. Different from previous papers such as Keane and Wolpin(1997), Belzil and Hansen (2002), and Belzil and Hansen (2007), which reply on along panel data to identify a finite mixture model, test scores allow me to apply afinite mixture model to data with limited periods of observations.Assume that the test scores do not directly affect wages, education and occupa-tion choices conditional on type and some observables. Further, assume that thereare three test scores which are independent from each other conditional on type andsome observables. These two assumptions lead to the nonparametric identificationof both occupation abilities and education psychic costs.Proposition 1.4. With access to three test scores (Q1,Q2,Q3), Suppose AssumptionA.6 holds. Then pim, f m(Si|Zi), { f m(Qir|Sir,Zir)}2r=1, f m(Wi1|Oi1,Si), f m(Oi1|Si),f m(Wit |Oit ,Si,Xit ,Wit−1,Oit−1,Xit−1), and f m(Oit |Si,Xit ,Oit−1) for t ≥ 2 can benonparametrically identified up to M types.Assumption A.6 and the proof of Proposition 1.4 are collected in Appendix A.4.The exclusion condition that test scores do not directly affect wage, educationand occupation choices conditional on type and some observables is the key asump-tion to nonparametrically identify occupation abilities and education psychic costsusing test scores. Intuitively, the exclusion of test scores from occupation and wagemeans that once employers know an individual’s type, addition information abouttest scores would not influence the their decision on hiring and salary. The exclusionof test scores from education means that test scores do not affect postsecondaryeducation application and admission once type is known. The assumption that test21scores are exclusive from wage, education, and occupation is different from the ex-clusion condition in the IV approach and Heckman’s two-step. While the exclusionvariable in the IV approach and Heckman’s two-step must not be correlated withthe unobserved type, the exclusion variable in the finite mixture model has to becorrelated with the unobserved type.1.5 Empirical ResultsEM algorithm (Dempster et al. , 1977) is applied in this paper to facilitate thecomputation of finding the maximum likelihood estimates. As is well known, directmaximization of the likelihood function based on Newton-Raphson type algorithm isdifficult for a finite mixture model because of the possibility of many local maxima.EM algorithm is a method for finding maximum likelihood estimates by iterating anexpectation (E) step and a maximization (M) step. In E step, the expectation of thelog-likelihood is calculated given the estimated the type proportions and parametersin the previous iteration. In M step, parameters are updated by maximizing theexpected log-likelihood found in E step. The details about the E step and M stepare discussed in Appendix A.5. Each iteration increases the value of log-likelihoodand it stops when convergence is achieved. The corresponding estimates uponconvergence are either a local maximum or saddle points. EM algorithm is found tobe sensitive to initial parameters. I choose initial parameters following the approachsuggested by Heckman and Singer (1984) and a detailed discussion is in AppendixA.6.The empirical results are presented below. I assume that the overall populationconsists of four types. There are two occupation abilities type: type 1 and type2 have the same occupation abilities, and so do type 3 and type 4. Within eachoccupation ability type, I consider two types of education psychic costs: type 1and type 2 are different in education psychic costs, although they have the sameoccupation abilities. Similarly, type 3 and type 4 have different education psychiccosts but share the same occupation abilities.221.5.1 Occupation-specific Returns to Education, Education andOccupation ChoicesAs illustrated in Section 1.4, the nonparametric identification of the finite mixturemodel heavily relies on the informativeness of the test scores about the unobservedtypes. If test scores are reflective about unobserved occupation abilities and educa-tion psychic costs, we would expect different types to have different test scores.23Table 1.3: Estimated Test Scores Parameters (Equation (1.4))Math Verbal Coding Mechanic Self-control Self-esteemConstantType 1 -0.021 -1.122 ∗∗∗ -1.315 ∗∗∗ -2.430 ∗∗∗ -1.081 ∗∗∗ -1.548 ∗∗∗(0.315) (0.285) (0.404) (0.367) (-0.446) (0.474)Deviation of type 2 from type 1 -1.271 ∗∗∗ -0.868 ∗∗∗ -0.742 ∗∗∗ -0.734 ∗∗∗ -0.380 ∗∗∗ -0.140 ∗∗∗(0.074) (0.078) (0.113) (0.099) (-0.122) (0.120)Deviation of type 3 from type 1 -2.052 ∗∗∗ -2.187 ∗∗∗ -1.445 ∗∗∗ -1.661 ∗∗∗ -0.588 ∗∗∗ -0.535 ∗∗(0.096) (0.076) (0.104) (0.110) (-0.128) (0.129)Deviation of type 4 from type 1 -0.622 ∗∗∗ -0.271 ∗∗∗ -0.528 ∗∗∗ -0.226 ∗∗∗ -0.410 ∗∗∗ -0.257 ∗∗∗(0.067) (0.076) (0.089) (0.086) (-0.100) (0.096)2-year college 0.019 ∗∗∗ 0.038 ∗∗∗ 0.019 0.030 0.005 0.005(0.014) (0.014) (0.017) (0.017) (-0.021) (0.022)4-year college 0.041 ∗∗∗ 0.024 ∗∗∗ 0.018 ∗∗∗ 0.017 ∗∗ 0.006 ∗∗ 0.020 ∗∗(0.009) (0.010) (0.012) (0.012) (-0.015) (0.014)Mother education 0.012 ∗∗ -0.027 ∗∗∗ -0.012 -0.010 ∗ -0.010 -0.033 ∗(0.013) (0.013) (0.019) (0.018) (-0.020) (0.020)Father education -0.064 ∗∗∗ -0.090 ∗∗∗ 0.115 ∗∗∗ -0.092 -0.025 0.003(0.068) (0.069) (0.102) (0.093) (-0.100) (0.113)#siblings -0.034 0.033 ∗∗∗ 0.042 -0.146 0.048 0.003(0.055) (0.056) (0.074) (0.067) (-0.082) (0.083)Broken family at age 14 0.008 ∗ 0.013 -0.005 -0.186 ∗ -0.035 0.029(0.051) (0.053) (0.075) (0.070) (-0.087) (0.079)South at age 14 -0.012 0.044 ∗ 0.067 0.134 ∗∗ 0.072 0.083(0.013) (0.013) (0.019) (0.017) (-0.023) (0.024)Urban at age 14 0.278 0.168 0.129 -0.052 ∗∗∗ 0.224 0.104(0.076) (0.082) (0.110) (0.097) (-0.15) (0.156)Age 0.542 ∗∗ 0.305 ∗∗∗ 0.210 ∗∗∗ -0.286 ∗∗∗ 0.145 ∗∗∗ 0.210 ∗∗∗(0.066) (0.070) (0.091) (0.086) (-0.121) (0.122)Dependent variable: math, verbal, coding speed, mechanical comprehension, locus of control, and self-esteem scores. 11Type 1 and type 2 have the same occupation abilities. So do type 3 and type 4.Type 1 and type 2 have different education psychic costs. So do type 3 and type 4.Standard errors are in parenthesis.*** p< 0.01, ** p< 0.05, * p< 0.124Table 1.3 reports the estimated parameters in Equation (1.4). It shows thatthe test scores vary across the four types and confirms that test scores are helpfulto nonparametric identification of the finite mixture model. In addition, table 1.3suggests that it is important to take into account the impact of education on testscores. A two-year college attendance significantly increases math and verbaltest scores and a four-year college attendance significantly improves all the sixtest scores. The finding that education improves test scores implies that the testscores are not perfect proxies and using them as proxies would not help addressingthe endogeneity issue to give unbiased estimates of occupation-specific returns toattending a two-year college and a four-year college.Table 1.4 reports the estimated parameters in Equation (1.1). It shows that re-turns to education are occupation specific. The return to attending a two-year collegeis significantly higher in a blue-collar occupation than a white-collar occupation.A two-year college attendance increases blue-collar hourly payment by 24% andwhite-collar hourly payment by 17%. Regarding the returns to attending a four-yearcollege, a four-year college attendance significantly increases more white-collarhourly wages than blue-collar wages. A four-year college attendant’s hourly wageis 23% higher in a blue-collar occupation and 30% higher in a white-collar occu-pation than a high school graduate. Comparing a two-year college and a four-yearcollege, these two kinds of postsecondary education institutions increase blue-collarwages similarly while a four-year college attendance is significantly more helpful toenhancing white-collar wages than a two-year college attendance does.Converting the returns to attending a two-year college and a four-year collegeinto annual returns, the corresponding annual return12 to two-year college educationis 20% in a blue-collar occupation, and 14% in a white-collar occupation. Thecorresponding annual return to four-year college education is 6% in blue-collaroccupation, and 8% in white-collar occupation.Among the people with post-secondary education in the sample, 27% aretwo-year college attendants and 73% are four-year college attendants. Hence,on average one year post-secondary education increases blue-collar wages by 10%12According to Table 1.1, two-year college attendants have 1.20 year more schooling than highschool graduates and four-year college attendants have 3.82 year more schooling than high schoolgraduates on average.25Table 1.4: Estimated Wage Parameters (Equation (1.1))Blue Collar White CollarConstantsType 1 and 2 6.932 ∗∗∗ 7.044 ∗∗∗(0.021) (0.033)Type 3 and 4 6.485 ∗∗∗ 6.457 ∗∗∗(0.021) (0.038)2-year college 0.243 ∗∗∗ 0.170 ∗∗∗(0.024) (0.036)4-year college 0.231 ∗∗∗ 0.296 ∗∗∗(0.022) (0.031)Blue-collar experience 0.078 ∗∗∗ 0.033 ∗∗∗(0.008) (0.009)Blue-collar experience squared -0.285 ∗∗∗ -0.076(0.069) (0.075)White-collar experience 0.040 ∗∗∗ 0.079 ∗∗∗(0.012) (0.008)White-collar experience squared -0.044 -0.191 ∗∗(0.169) (0.084)Hypothesis testing p-valueblue-collar return=white-collar return, 2-year college 0.016blue-collar return=white-collar return, 4-year college 0.0182-year college return=4-year college return, blue-collar 0.3112-year college return=4-year college return, white-collar 0.000Dependent variable: log hourly salaryType 1 and type 2 have the same occupation abilities. So do type 3 and type 4.Type 1 and type 2 have different education psychic costs. So do type 3 and type 4.Standard errors are in parenthesis.*** p < 0.01, ** p < 0.05, * p < 0.1(20%× 27%+ 6%× 73%) and white-collar wages by 10%(14%× 27%+ 8%×73%). Keane and Wolpin (1997) find that the annual return to postsecondary educa-tion is 2.4% in a blue-collar occupation and 7% in a white-collar occupation. Theannual return to postsecondary education in a white-collar occupation in this paperis similar to that reported in Keane and Wolpin (1997). However, the annual returnto postsecondary education in a blue-collar occupation is higher in this paper thanin Keane and Wolpin (1997) where they do not distinguish a two-year college and afour-year college.The type-specific constants reported in table 1.4 suggest that individuals areendowed with different occupation abilities. Among the four types of individuals,26Table 1.5: Estimated Average Partial Effects, Occupation ChoiceInitial job Subsequent jobsConstantType 1 and 2 -0.246 ∗∗∗ -0.137 ∗∗(0.057) (0.071)Deviation of type 3 and 4 from type 1 and 2 -0.162 ∗∗∗ -0.056 ∗∗(0.063) (0.027)2-year college 0.159 ∗∗∗ 0.042 ∗∗(0.044) (0.019)4-year college 0.503 ∗∗∗ 0.122 ∗∗∗(0.028) (0.046)Blue-collar experience -0.044(0.193)Blue-collar experience squared 0.262 ∗∗(0.128)White-collar experience 0.054(0.120)White-collar experience squared -0.266 ∗∗(0.137)White-collar job in the last period 0.287 ∗∗∗(0.079)The average partial effect is calculated as the average of the partial effect of each individual.Type 1 and type 2 have the same occupation abilities. So do type 3 and type 4.Type 1 and type 2 have different education psychic costs. So do type 3 and type 4.Standard errors are in parenthesis.*** p < 0.01, ** p < 0.05, * p < 0.1type 1 and type 2 share the same occupation abilities, but are different in theeducation psychic costs. Type 3 and type 4 have the same occupation abilities,but different education psychic costs. Although the occupation abilities and theeducation psychic costs may be correlated, the education psychic costs do notdirectly affect wages as assumed. So type 1 and type 2 earn the same, and so do type3 and type 4. As reported in table 1.4, type 1 and type 2 earn more in a white-collaroccupation than a blue-collar occupation. In other words, type 1 and type 2 havea comparative advantage in a white-collar occupation. On the other hand, type 3and type 4 are similarly productive in a blue-collar and a white-collar occupation,because they earn similar wages in a white-collar and a blue-collar occupation.Table 1.5 shows the estimated average partial effects in the occupation choice.Column 1 in table 1.5 reports the estimated average partial effects in making theoccupation choice in the first job. It indicates that individuals with a comparative27advantage in a white-collar occupation (type 1 and type 2) are 16% more likely tochoose to work in white-collar jobs than those with a comparative advantage ina blue-collar occupation (type 3 and type 4). Moreover, education increases theprobability of being employed in a white-collar occupation. Comparing to highschool graduates, two-year college attendants are 16% more likely to work in awhite-collar occupation and four-year college attendants are 50% more likely towork in a white-collar occupation.Column 2 in table 1.5 reports the estimated average partial effects in makingthe occupation choice in the sequential jobs. It shows that individuals with acomparative advantage in a white-collar occupation (type 1 and type 2) are 6% morelikely to work in white-collar jobs than those with a comparative advantage in a blue-collar occupation (type 3 and type 4) in the sequential jobs. Education has smallerinfluence on occupation in subsequent jobs than initial jobs. Attending a two-yearcolleges and a four-year college increase the probability of being employed by awhite-collar occupation by 4% and 12% respectively. One important factor whichaffects the occupation choice is the occupation in the previous period. An individualwho worked in a white-collar occupation in the previous period is 29% more likelyto work in a white-collar occupation in the current period than an individual whoworked in a blue-collar occupation in the previous period does.13Regarding the postsecondary education choice, if individuals consider theirfuture occupations when making their education decisions, we would expect indi-viduals with a comparative advantage in a white-collar occupation (type 1 and type2) to be more likely to attend a four-year college than individuals earn similarly in ablue-collar occupation and a white-collar occupation (type 3 and type 4). Table 1.6reports the average partial effects in the postsecondary education choice. It showsthat type 1 and type 2 are 53% more likely to attend a four-year college than type 3and type 4. This finding confirms that the occupation abilities affect the education13I have consider the case that individuals may have different occupation tastes. For example, thosewho enjoy working outdoors may prefer a construction worker position to an economist position.To copy with the potential heterogeneity in occupation taste, I estimate a finite mixture model with8 types. Specifically, I consider two occupation abilities types. Within each occupation abilitiestype, there are two education psychic costs type. In addition, I look at two occupation taste typesfor individuals with the same occupation abilities and education psychic costs. I do not find thatindividuals with the same occupation abilities and education psychic costs behave differently in theoccupation choices.28Table 1.6: Estimated Average Partial Effects, Educational Choice2-year College 4-year CollegeConstantType 1 -0.028 -0.293(0.179) (0.283)Deviation of type 2 from type 1 0.083 -0.416 ∗∗∗(0.828) (0.11)Deviation of type 3 from type 1 0.092 -0.525 ∗∗∗(0.112) (0.116)Deviation of type 4 from type 1 0.094 -0.204 ∗∗∗(0.792) (0.085)Mother education -0.004 0.038(0.645) (0.078)Father education 0.004 0.031(0.735) (0.087)#siblings -0.001 -0.021(0.103) (0.086)Broken family at age 14 0.032 -0.066(0.063) (0.126)South at age 14 -0.013 0.077 ∗∗(0.059) (0.033)Urban at age 14 0.059 0.044(0.070) (0.043)The average partial effect is calculated as the average of the partial effect of each individual.Type 1 and type 2 have the same occupation abilities. So do type 3 and type 4.Type 1 and type 2 have different education psychic costs. So do type 3 and type 4.Standard errors are in parenthesis.*** p < 0.01, ** p < 0.05, * p < 0.1choice. Further, the results in table 1.6 suggest that individuals take into account theeducation psychic costs when making their education decisions. Although type 1and type 2 share the same occupation abilities, type 1 is 42% more likely to attend afour-year college than type 2 is, which indicates that type 1 has lower psychic coststo attend a four-year college than type 2 does. Similarly, type 4 are more likely toattend a four-year college than type 3 does, which suggests that type 4 has lowerpsychic costs to attend a four-year college than type 3 does. Regarding the decisionto attend a two-year college, the similarity across the four types in the decision toattend a two-year college suggests no self-selection in attending a two-year college.To sum up, individuals self select into different education groups based on theiroccupation abilities and education psychic costs. Their occupation choices are alsoinfluenced by their occupation abilities. Failure to address the endogenous education29Table 1.7: Occupation-Specific Returns(1) (2) (3) (4)2-year college × blue-collar 0.224 ∗∗∗ 0.188 ∗∗∗ 0.277 ∗∗∗ 0.243 ∗∗∗(0.024) (0.023) (0.024) (0.024)2-year college × white-collar 0.136 ∗∗∗ 0.105 ∗∗∗ 0.192 ∗∗∗ 0.170 ∗∗∗(0.034) (0.031) (0.035) (0.036)4-year college × blue-collar 0.246 ∗∗∗ 0.172 ∗∗∗ 0.237 ∗∗∗ 0.231 ∗∗∗(0.022) (0.023) (0.022) (0.022)4-year college × white-collar 0.295 ∗∗∗ 0.227 ∗∗∗ 0.293 ∗∗∗ 0.296 ∗∗∗(0.029) (0.027) (0.030) (0.031)Column (1): OLS estimates of the occupation-specific returns to educationColumn (2): OLS estimates of the occupation-specific returns to education when six test scores are includedas proxies for occupation abilitiesColumn (3): estimates of the occupation-specific returns to education when controlling for occupationabilities onlyColumn (4): estimates of the occupation-specific returns to education when controlling for both occupationabilities and education psychic costsAll the regressors in Equation (1.1) are included.Standard errors are in parenthesis.*** p < 0.01, ** p < 0.05, * p < 0.1and occupation choices can result in biased estimates of occupation-specific returns.Due to the complication of the self-selection problem here, it is hard to tell thedirection of the possible bias. I compare the estimates of occupation-specificreturns to education when controlling or not controlling for occupation abilitiesand/or education psychic costs in table 1.7. Column (4) presents the estimates ofoccupation-specific returns to attending a two-year college and a four-year collegecontrolling both the occupation abilities and education psychic costs (the sameestimates as those reported in table 1.4). Column (1) gives the OLS estimates ofthe occupation-specific returns to education. The OLS estimates of the occupation-specific returns to attending a two-year college are slightly lower than those incolumn (4) and the OLS estimates of the occupation-specific returns to attending afour-year college are comparable to those in column (3). Column (2) in table 1.7shows the OLS estimates of the occupation-specific returns to education when sixtest scores are included as proxies for occupation abilities and education psychiccosts. Using test scores as proxies requires that education does not affect testscores. However, the results in table 1.3 suggest that education helps to improvethe performance in all the six tests. The estimated returns to attending a two-year30college and a four-year college in column (2) are around 6 percentage points lowerthan those reported in column (4). Column (3) in table 1.7 gives the estimates ofthe occupation-specific returns to education only controlling for the occupationabilities14. The estimates of the occupation-specific returns to attending a two-yearcollege in column (3) are larger than both the OLS estimates in column (1) and thosein column (4). The estimates of the occupation-specific returns to attending a four-year college in column (3) are comparable to those in column (1) and column (4).The results in table 1.7 suggest that the possible biases are of different direction andcancel out each other although education and occupation choices are endogenous.1.5.2 Conditional Independence of Wages and Test ScoresOne of the key assumptions of the nonparametric identification of the finite mixturemodel is that conditional on type and some other observables the six test scoresdo not directly affect wages. I test the validation of this conditional independenceassumption by including the six test scores one by one into the wage equation(Equation (1.1)). The idea is that assuming the finite mixture model is nonparamet-rically identified using the other test scores, the coefficient on test score r, whichis included in the wage equation, should be zero when test score r and wages areconditionally independent.14Here, I estimate a finite mixture model in which there are two occupation abilities types andeveryone has the same education psychic costs. Since the variations in wages across occupations overtime are sufficient to nonparametrically identify occupation abilities, I do not use test scores as anadditional source of nonparametric identification31Table 1.8: Testing Conditional Independence of Wages and Test Scores(1) (2) (3) (4) (5) (6)2-year college × blue-collar 0.203 ∗∗∗ 0.275 ∗∗∗ 0.250 ∗∗∗ 0.235 ∗∗∗ 0.243 ∗∗∗ 0.240 ∗∗∗(0.024) (0.025) (0.025) (0.024) (0.024) (0.024)2-year college × white-collar 0.123 ∗∗∗ 0.198 ∗∗∗ 0.176 ∗∗∗ 0.163 ∗∗∗ 0.170 ∗∗∗ 0.165 ∗∗∗(0.037) (0.034) (0.035) (0.037) (0.036) (0.036)4-year college × blue-collar 0.185 ∗∗∗ 0.291 ∗∗∗ 0.239 ∗∗∗ 0.205 ∗∗∗ 0.231 ∗∗∗ 0.227 ∗∗∗(0.024) (0.023) (0.023) (0.022) (0.022) (0.022)4-year college × white-collar 0.241 ∗∗∗ 0.357 ∗∗∗ 0.305 ∗∗∗ 0.267 ∗∗∗ 0.296 ∗∗∗ 0.291 ∗∗∗(0.034) (0.031) (0.032) (0.032) (0.031) (0.031)Math 0.027 ∗∗∗(0.010)Verbal -0.044 ∗∗∗(0.008)Coding -0.009(0.008)Mechanic 0.025 ∗∗∗(0.007)Locus of control 0.0003(0.008)Self-esteem 0.008(0.008)All the regressors in Equation (1.1) are included.Standard errors are in parenthesis.*** p< 0.01, ** p< 0.05, * p< 0.132Table 1.8 presents the estimated coefficients on the six test scores. The estimatedcoefficients on coding speed, Rotter locus of control, and Rosenberg self-esteemscale are not significantly different from zero. According to Proposition 1.4, thefinite mixture model can be nonparametrically identified because we have threetests satisfied the conditional independence assumption. However, we reject thehypothesis that the conditional independence assumption holds for math skill, verbalskill, and mechanical comprehension. The reason of the finding that math, verbal,mechanical comprehension scores affect wages is that I only consider a small numberof types (two occupation abilities types and two education psychic costs types). It ispossible that there are heterogeneity in occupation abilities and education psychiccosts within each of the four types and math skill, verbal skill, and mechanicalcomprehension scores are informative about these within type heterogeneity. Icheck the sensitivity of the estimates in two ways. First, I include the math, verbal,and mechanical comprehension scores together into the wage equation and checkwhether the estimates of the occupation-specific returns to education are differentfrom those reported in table 1.4. Table 1.9 shows that the estimates of the occupation-specific returns to education are around 2 to 5 percentage points smaller than thosein table 1.4. Second, I increase the number of types from 4 (two types of occupationabilities and two types of education psychic costs) to 6 (three types of occupationabilities and two types of education psychic costs) and the corresponding estimatesin the wage equation are presented in table 1.10. Table 1.10 shows that the estimatesof occupation-specific returns to education are around 1 to 4 percentage pointssmaller than those reported in table 1.4.1.5.3 The Occupation-Specific Returns to A Bachelor’s DegreeThe wage gap between college dropouts and college graduates are documented inthe literature (Jaeger and Page, 1996). It is interesting to examine the occupation-specific returns to college graduate besides the occupation-specific returns to collegeattendants. Due to the small sample size of the two-year college graduates, I focuson investigating the occupation-specific returns for those obtained a bachelor’sdegree.Among the four-year college attendants in my sample, around 70% obtained33Table 1.9: Estimated Wage Parameters, with Three Test Scores in the WageEquationBlue Collar White CollarConstantsType 1 and 2 6.914 ∗∗∗ 7.048 ∗∗∗(0.021) (0.032)Type 3 and 4 6.461 ∗∗∗ 6.443 ∗∗∗(0.023) (0.038)2-year college 0.222 ∗∗∗ 0.121 ∗∗∗(0.023) (0.034)4-year college 0.228 ∗∗∗ 0.273 ∗∗∗(0.024) (0.032)Blue-collar experience 0.075 ∗∗∗ 0.034 ∗∗∗(0.008) (0.009)Blue-collar experience squared -0.278 ∗∗∗ -0.125 ∗∗(0.069) (0.074)White-collar experience 0.050 ∗∗∗ 0.080 ∗∗∗(0.012) (0.009)White-collar experience squared -0.158 -0.205 ∗∗∗(0.161) (0.084)Test ScoresMath 0.085 ∗∗∗(0.013)Verbal -0.137 ∗∗∗(0.012)Mechanical comprehension 0.068 ∗∗∗(0.009)Dependent variable: log hourly salaryType 1 and type 2 have the same occupation abilities. So do type 3 and type 4.Type 1 and type 2 have different education psychic costs. So do type 3 and type 4.The coefficients on the three test scores are restricted to be the same acrossoccupations.Standard errors are in parenthesis.*** p < 0.01, ** p < 0.05, * p < 0.1a bachelor’s degree. A simple comparison of the first year wage of the four-yearcollege dropouts and the four-year college graduates shows that the four-yearcollege dropouts and the four-year college graduates earn similarly in a blue-collaroccupation, yet the four-year college graduates earn 30% more than the four-yearcollege dropouts in a white-collar occupation15.Table 1.11 presents the estimated parameters of the wage equation (Equation15For more summary statistics of the college dropouts and college attendants, please refer to tableA.1.34Table 1.10: Estimated Wage Parameters in the Wage Equation (6 types)Blue Collar White CollarConstantsType 1 and 2 6.998 ∗∗∗ 7.116 ∗∗∗(0.027) (0.034)Type 3 and 4 6.501 ∗∗∗ 6.439 ∗∗∗(0.026) (0.041)Type 5 and 6 6.609 ∗∗∗ 6.679 ∗∗∗(0.021) (0.037)2-year college 0.235 ∗∗∗ 0.126 ∗∗∗(0.023) (0.036)4-year college 0.238 ∗∗∗ 0.283 ∗∗∗(0.021) (0.032)Blue-collar experience 0.076 ∗∗∗ 0.03 ∗∗∗(0.008) (0.009)Blue-collar experience squared -0.284 ∗∗∗ -0.063(0.07) (0.078)White-collar experience 0.042 ∗∗∗ 0.078 ∗∗∗(0.011) (0.008)White-collar experience squared -0.075 -0.211 ∗∗∗(0.162) (0.086)Dependent variable: log hourly salaryType 1 and type 2 have the same occupation abilities. So do type 3 and type 4as well as type 5 and type 6.Type 1 and type 2 have different education psychic costs. So do type 3 andtype 4 as well as type 5 and type 6.Standard errors are in parenthesis.*** p < 0.01, ** p < 0.05, * p < 0.1(1.1)) using the sample where the four-year college dropouts are eliminated. Itshows that a bachelor’s degree increases blue-collar wages by 26% and white-collarwages by 33% for a high school graduate. Comparing to the returns to attending afour-year college as reported in table 1.4, ie. 23% and 30% respectively for a blue-collar occupation and a white-collar occupatio, the estimated occupation-specificreturns to a bachelor’s degree are not much higher.1.5.4 The Expected Returns to EducationIndividuals make their education choices taking into account their future occupations.Yet, they do not know exactly their occupations because of the uncertainty in thelabour market. Therefore, their education choices are based on the expected returnsto attending a two-year and a four-year college. Below, I calculated the expected35Table 1.11: The Occupation-Specific Returns to A Bachelor’s DegreeBlue Collar White CollarConstantsType 1 and 2 6.914 ∗∗∗ 6.983 ∗∗∗(0.024) (0.035)Type 3 and 4 6.465 ∗∗∗ 6.425 ∗∗∗(0.023) (0.042)2-year college 0.257 ∗∗∗ 0.191 ∗∗∗(0.025) (0.036)Bachelor’s degree 0.257 ∗∗∗ 0.328 ∗∗∗(0.035) (0.033)Blue-collar experience 0.084 ∗∗∗ 0.047 ∗∗∗(0.009) (0.01)Blue-collar experience squared -0.326 ∗∗∗ -0.162 ∗∗(0.076) (0.08)White-collar experience 0.038 ∗∗∗ 0.083 ∗∗∗(0.014) (0.01)White-collar experience squared -0.049 -0.214 ∗∗(0.189) (0.094)Dependent variable: log hourly salaryType 1 and type 2 have the same occupation abilities. So do type 3 and type 4.Type 1 and type 2 have different education psychic costs. So do type 3 and type 4.The four-year college dropouts are eliminated from the sample.Standard errors are in parenthesis.*** p < 0.01, ** p < 0.05, * p < 0.1returns to education by simulating a sample of 10000 observations.Panel A of table 1.12 shows that the expected returns to attending a two-yearcollege are around 23% for type 1 and type 2 (individuals with a comparativeadvantage in a white-collar occupation) over time16, and they are around 22%for type 3 and type 4 (individuals with a comparative advantage in a blue-collaroccupation) over time. Returns to attending a two-year college are similar to alltypes, which explains that individuals do not select to attend a two-year collegebased on their occupation abilities as suggested by the results in table 1.6. Regardinga four-year college, the expected returns to attending a four-year college for type 1and type 2 are 34% in the first year and increase to 40% nine years later. Returns16Since the occupation choice depend on the occupation abilities and not influenced by educationpsychic costs directly, the expected returns to attending a two-year college are the same for individualswith same occupation abilities and different education psychic costs. That is to say that type 1 andtype 3 have the same expected returns to attending a two-year college, and type 2 and type 4 gain thesame in earnings from attending a two-year college36Table 1.12: Expected Returns to Education2-Year College 4-Year CollegeType 1 and 2 Type 3 and 4 Type 1 and 2 Type 3 and 4Panel A: Total1st year 0.238 0.225 0.337 0.2515th year 0.225 0.212 0.349 0.25510th year 0.239 0.22 0.397 0.287Panel B: Occupation-Specific Skills Accumulation1st year 0.216 0.229 0.277 0.2645th year 0.206 0.221 0.283 0.27110th year 0.201 0.221 0.288 0.275Panel C: Better Occupation Match1st year 0.022 -0.004 0.06 -0.0135th year 0.019 -0.009 0.065 -0.01610th year 0.038 -0.001 0.109 0.011Calculation is based on the simulation of 10000 observations.Panel A: total expected returns to attending a two-year college and a four-year collegePanel B: expected returns to education from enhancing the occupation-specific skillsPanel C: expected returns to education from increasing the probability of being employed ina white-collar occupationTotal expected returns to education (Panel A) is the sum of the expected returns from enhancingthe occupation-specific skills (Panel B) and the expected returns from increasing the probabilityof being employed in a white-collar occupation (Panel C).Type 1 and type 2 have the same occupation abilities. So do type 3 and type 4.Type 1 and type 2 have different education psychic costs. So do type 3 and type 4.to attending a four-year college for type 3 and type 4 are 25% in the first yearand increase to 29% in the tenth year. Returns to attending a four-year college arearound 9 percentage points higher for type 1 and type 2 than type 3 and type 4 inthe first year and the difference increases to 11 percentage points in the 10th year.Therefore, individuals with a comparative advantage in a white-collar occupation(type 1 and type 2) are more likely to attend a four-year college than those with acomparative advantage in a blue-collar occupation (type 3 and type 4) as shown intable 1.6. Comparing returns to attending a two-year and a four-year college, returnsto attending a four-year college are 10 percentage points higher than returns to atwo-year college in the beginning and the discrepancy is enlarged to 16 percentagepoints after nine years for type 1 and type 2. For type 3 and type 4, returns toattending a four-year college are 3 percentage points higher than returns to attending37a two-year college in the beginning and the difference increases to 7 percentagepoints after nine years. The difference between returns to attending a two-yearcollege and a four-year college echoes the findings in Belzil and Hansen (2002) thatreturns can be education-level-specific.The expected returns to attending a two-year and a four-year college are differentacross types of people with different occupation abilities. The relationship betweenreturns to education and innate abilities are well documented in the literature (Belziland Hansen, 2007; Carneiro et al. , 2003). The reason of the correlation between theexpected returns to education and the occupation abilities is that the expected returnsto education are related to the probability of working in a white-collar occupation,which depend on occupation abilities. The expected returns to education and theprobability of working in a white-collar occupation can be related in two ways. First,education helps accumulating white-collar and blue-collar skills differently. Forexample, attending a four-year college increases white-collar skills more than blue-collar skills. Therefore, individuals with a comparative advantage in a white-collaroccupation, who are more likely to work in a white-collar occupation, have higherreturns to attending a four-year college on average. Second, education enhancesthe probability of working in a white-collar occupation. For individuals with acomparative advantage in a white-collar occupation, the reward to their occupationabilities are higher in a white-collar occupation than a blue-collar occupation.Attending a four-year college education increases the probability of working in awhite-collar occupation and leads to a high reward to their occupation abilities. Idecompose the returns to education into these two parts: enhancing occupation-specific skills and increasing the probability of working in a white-collar occupation.Panel B of table 1.12 shows the part of returns to education from enhancingoccupation-specific skills and panel C of table table 1.12 presents the part of returnsto education from increasing the probability of being employed in a white-collaroccupation. Let’s first look at the decomposition of the expected returns to attendinga two-year college. For individuals with a comparative advantage in a white-collaroccupation (type 1 and type 2), a two-year college attendance increases wages by22% from enhancing the occupation-specific skills. The part of expected returnsfrom increasing the probability of working in a white-collar occupation is 2% at thebeginning and 4% at the end. As type 1 and type 2 become more likely to work38in a white-collar occupation, their latter part of the expected returns to attendinga two-year college increases. For individuals with a comparative advantage in ablue-collar occupation (type 3 and type 4), the expected returns from enhancingoccupation-specific skills are around 22% over the ten years, which is comparableto those of type 1 and type 2 in magnitude. The part of expected returns fromincreasing the probability of being employed in a white-collar occupation is almostclose to zero over time. This is because type 3 and type 4 are rewarded similarly totheir occupation abilities in both occupations. Next, let’s look at the decompositionof the expected returns to attending a four-year college. For type 1 and type 2, thepart of expected returns attending a four-year college due to occupation-specificskills accumulation is around 28%. The part of expected returns from increasing theprobability of working in a white-collar occupation increases from 6% to 14% withthe probability of working in a white-collar occupation increasing from 71% to 83%over time. After ten years in the labour market, 65% of the total expected returns toattending a four-year college education are from its impact on occupation-specificskills accumulation for type 1 and type 2. For type 3 and type 4. The part ofexpected returns to attending a four-year college due to occupation-specific skillsaccumulation is around 26%, which is comparable to that for type 1 and type 2 inmagnitude. The part of expected returns to attending a four-year college from itsinfluence on occupation affiliation is close to zero because type 3 and type 4 arerewarded similarly to their occupation abilities in a blue-collar and a white-collaroccupation.The increasing expected returns to attending a four-year college for individualswith a comparative advantage in a white-collar occupation (type 1 and type 2) implya faster wage growth rate of four-year college attendants than high school graduates.This finding is consistent with Willis and Rosen (1979) where they find that acollege attendant’s wage grows faster than a high school graduates. This chaptersuggests that one important reason for the faster wage growth rate of four-yearcollege attendants is that they switch to the occupation they have a comparativeadvantage of over time.391.5.5 Test Scores and Returns to EducationAs shown in table 1.3, test scores are informative about individuals occupationabilities and education psychic costs. Once the type-specific joint distributions ofthe six test scores are identified, we can get the probabilities of types conditionalon the six test scores using Bayes’ rule. In other words, we are able to tell whichtype an individual is most likely to be given her six test scores and demographicinformation. Further, we can infer her expect returns to attending a two-year collegeand a four-year college.I simulate the six test scores for 10,000 high school graduates, whose parents arehigh school graduates, who have three siblings, were raised in a two-parent family,lived in the northern urban area of U.S. at age 14, and took the six test scores at age18. For simplicity, I divide the six test scores into two groups: cognitive tests (mathskills, verbal skills, coding speed, and mechanical comprehension) and noncognitivetests (the Rotter Locus of Control and the Rosenberg Self-Esteem Scale). Then Icalculate the average scores, Qˆc and Qˆnc, for the two groups. For further simplicity,each of the two average test scores are partitioned into four parts. The proportion ofeach type conditional on test scores are presented in table 1.13. For example, look atan individual with high school graduates parents, 3 siblings, raised in a two parentsfamily, lived in the North urban area of U.S. at age 14, and took the cognitive andnoncognitive tests at age 18. If all her test scores are at the 10th percentile, her netaverage scores are in the cell of row 1 and column 1 and she is 0.1% likely to betype 1, 61.5% likely to be type 2, 29.8% likely to be type 3, and 8.6% likely to betype 4. If all her test scores are at the 50th percentile, her net average scores arein the cell of row 2 and column 3 and she is 15% likely to be type 1, 0.1% likelyto be type 2, 32.5% likely to be type 3, and 52.3% likely to be type 4. If all hertest scores are at the 90th percentile, her net average scores are in the cell of row4 and column 4 and she is 90.9% likely to be type 1, 0% likely to be type 2, 0%likely to be type 3, and 9.1% likely to be type 4. Once the conditional proportionof types is known, her returns to education can be calculated accordingly. Table1.14 shows the expected returns to attending a two-year college and a four-yearcollege for such an individual with test scores at the 10th, 50th, and 90th percentile.Since returns to two-year college are similar to all types, the expected returns to40Table 1.13: Posterior Probabilities of Types Conditional on Test Scores(a) Type 1Q¯nc ≤−0.5 −0.5 < Q¯nc ≤ 0 0 < Q¯nc ≤ 0.5 Q¯nc ≥ 0.5Q¯c ≤−0.5 0.001 0 0.013 0.007−0.5 < Q¯c ≤ 0 0.065 0.127 0.150 0.1910 < Q¯c ≤ 0.5 0.328 0.459 0.565 0.680Q¯c ≥ 0.5 0.782 0.777 0.873 0.909(b) Type 2Q¯nc ≤−0.5 −0.5 < Q¯nc ≤ 0 0 < Q¯nc ≤ 0.5 Q¯nc ≥ 0.5Q¯c ≤−0.5 0.615 0.54 0.445 0.357−0.5 < Q¯c ≤ 0 0.000 0.001 0.001 0.0020 < Q¯c ≤ 0.5 0.000 0.000 0.000 0.000Q¯c ≥ 0.5 0.000 0.000 0.000 0.000(c) Type 3Q¯nc ≤−0.5 −0.5 < Q¯nc ≤ 0 0 < Q¯nc ≤ 0.5 Q¯nc ≥ 0.5Q¯c ≤−0.5 0.298 0.353 0.432 0.529−0.5 < Q¯c ≤ 0 0.313 0.343 0.325 0.3270 < Q¯c ≤ 0.5 0.056 0.043 0.033 0.031Q¯c ≥ 0.5 0.000 0.006 0.000 0.000(d) Type 4Q¯nc ≤−0.5 −0.5 < Q¯nc ≤ 0 0 < Q¯nc ≤ 0.5 Q¯nc ≥ 0.5Q¯c ≤−0.5 0.086 0.106 0.111 0.107−0.5 < Q¯c ≤ 0 0.623 0.529 0.523 0.4810 < Q¯c ≤ 0.5 0.616 0.498 0.402 0.289Q¯c ≥ 0.5 0.218 0.217 0.127 0.091Calculation is based on the simulation of 10000 high school graduates whose parents are high school graduates,who have three siblings, were raised in a two-parent family, lived in the northern urban area of U.S. at age 14,and took the six test scores at age 18.Q¯c is the average (standardized) math, verbal, coding speed, mechanical comprehension scores.Q¯nc is the average (standardized) Rotter Locus of Control and Rosenberg Self-Esteem Scale.Each cell shows the probability of belonging to a specific type given that Q¯c and Q¯nc fall in a specific region.41Table 1.14: Expected Returns to Education, By Test Scores2-Year College 4-Year College10th 50th 90th 10th 50th 90thPercentile Percentile Percentile Percentile Percentile Percentile1st year 0.230 0.232 0.238 0.280 0.293 0.3335th year 0.217 0.219 0.225 0.286 0.300 0.34210th year 0.225 0.228 0.236 0.323 0.339 0.388Calculation is based on the simulation of 10000 observations.two-year college are almost the same for different test scores. Regarding a four-year college, the expected returns to attending a four-year college increase as testscores increase. The reason is that high test scores imply a high probability of acomparative advantage in a white-collar occupation, and a comparative advantagewith a white-collar occupation are associated with a high returns to attending afour-year college.1.6 ConclusionIn this paper, I examine the returns to attending a two-year college and a four-year college and how the returns to education differ from those of a white-collaroccupation to those of a blue-collar occupation. Despite a vast literature on returnsto education, the existing research on how returns to attending a two-year collegeand a four-year college depend on the occupation choice is limited. The reasonfor this limitation is that it is difficult to estimate the occupation-specific returns inthe presence of endogenous education and occupation choices. On the one hand,individuals are endowed with different abilities to work in a blue-collar occupationor a white-collar occupation. They tend to work in the occupation in which theyhave a comparative advantage. Moreover, they are more likely to choose the typeof postsecondary education that intensively accumulates the skills needed in theoccupations they would like to work in when they finish schooling. Therefore,occupation abilities drive wages, education, and occupation. On the other hand,individuals vary in their education psychic costs, which may be correlated withoccupation abilities. While the occupation abilities and the education psychic costsare known to the individuals making education and occupation decisions, these42abilities and costs are unobserved by the econometrician, thus leading to the missingvariable problem. The instrumental variables (IV) approach, conventionally used todeal with the endogeneity issue in the returns to education literature, is difficult toimplement here simply because good instruments for both education and occupationare difficult to find.I address the endogeneity issue in education and occupation by explicitly mod-eling the sequential education and occupation choices, specifying the unobservedoccupation abilities and education psychic costs with a flexible multinomial dis-tribution in a finite mixture model. I show how to nonparametrically identify theoccupation abilities using the variations in wages across occupations over time.However, the information from the panel data alone is not enough to identify theeducation psychic costs. In order to achieve nonparametric identification of theeducation psychic costs, I use test scores such those of the ASVAB, the RotterLocus of Control, and the Rosenberg Self-Esteem Scale. I show that conditionalon occupation abilities and education psychic costs the education psychic costs canbe nonparametrically identified under the assumption that the test scores do notdirectly affect wages, education, or occupation choices.Using data taken from the National Longitudinal Survey of Youth (NLSY) 1979,I estimate a parametrically specified finite mixture model for joint wages, education,occupation, and test scores and find that returns to education are occupation-specific.Specifically, I find that attendance of a two-year college enhances blue-collar wagesby 24% and white-collar wages by 17% while attendance of a four-year collegeincreases blue-collar wages by 23% and white-collar wages by 30%.43Chapter 2M-Estimation with ComplexSurvey Data2.1 IntroductionThe complex survey design, which is also known as the stratified multistage clusteredsample design, is widely used in large scale surveys. For example, the MonthlyCurrent Population Survey (CPS, US), the Panel Study of Income Dynamics (PSID,US), the Labour Force Survey(LFS, Canada), and the Survey of Labour and IncomeDynamics(SLID, Canada) employ the complex survey design.In complex survey sampling, the population is partitioned into mutually exclu-sive and exhaustive subpopulations called strata. Each stratum is then partitionedinto primary sampling units (psu), and each of the psu’s is partitioned into secondarysampling units (ssu). In the m-stage cluster sampling, this process is repeated mtimes to form finer and finer partitions. The sampling units created by the laststage of partitioning is called the ultimate sampling unit (usu). Though an usu istypically a group of individuals (e.g., individuals in a contiguous dwelling segment),it is possible that an usu is an individual. The number of stages for the recursivepartitioning may differ from a stratum to another.The complex survey method makes use of the recursive structure of samplingunits described above for randomly selecting individuals into the sample in eachstratum. The selection of individuals starts with a few draws of psu’s, which is44followed by a few draws of ssu’s within the selected psu’s, which is further followedby a few draws of tertiary sampling units within the selected ssu’s, and so on. Onceusu’s are selected in this manner, every individual in the usu is selected into thesample. The selection of sampling units is statistically independent across strata.While the complex survey method is attractive in terms of the sampling cost,it introduces some complication in statistical analysis using it, when we compareit with the simple random sampling method. First, the number of observations ineach stratum is random. Second, conditional on the random number of observations,the observations are dependent in a complicated manner, because of the recursiveselection of sampling units. It is therefore important to study how such features ofcomplex survey data should be incorporated in estimation and statistical inferencesin econometric analysis. For this reason, this chapter investigates the properties ofM-estimators when used with complex survey data.In the survey sampling literature, there are two prevailing views about thepopulation. The principal difference between these two views is the randomnesswhich is used to give stochastic structure to the inference. One is called a designbase, which assumes that each individual in the population carries a constant (i.e.,nonrandom) character vector. Following Neyman (1934)’s seminal paper, thedesign-based approach regards the probability ascribed by the sampling design tothe various subsets of the finite population as the primary source of randomness.For example, Krewski and Rao (1981), Binder (1983), and Sakata (2000) are design-based. The other is called a model base. It assumes that each individual drawsa characteristic vector from a distribution (superpopulation). For example, Fuller(1984), Hung (1990), Chamblessa and Boyle (1985), and Breckling et al. (1994)are model-based. Cassel et al. (1977) and Sarndal et al. (1992) examine bothdesign-based and model-based inference in detail. Sarndal et al. (1978) provide acomprehensive comparison of the design-based and model-based approaches. Inthis paper, we provide a unified framework, for which both a design base and amodel base are special cases. In such framework, we study the behavior of theM-estimators, which includes widely used least squares estimator and maximumlikelihood estimator. We consider how statistical inferences should be made basedon the M-estimation. Specifically, we consider two stages of correction (relative tosimple random sampling) to account for the features of complex survey data. First,45we follow Horvitz and Thompson (1952) to deal with the fact that individuals areusually under- or over-represented in complex survey data. Horvitz and Thompson(1952) proposed to estimate the total and mean of a superpopulation in a stratifiedsample by weighting the sample using inclusion probability, the probability ofbecoming part of the sample. We show that our proposed weighted M-estimator isunbiased, where the sample is weighted by inclusion probability. Second, standarderrors of the estimates with complex survey data are different from those with simplerandom sampling data. Clustering increases the standard errors and stratificationreduces the standard errors. These two effects do not cancel out in general. Wederive the asymptotic distribution of the estimates and compute standard errorswhich are robust to the sample-design effects. Our analysis employs the standardasymptotic framework in the complex survey literature, in which the number ofstrata grows to infinity 1.The rest of this chapter is organized as follows. Section 2.2 describes ourassumptions about the data generating process and introduces the weighted M-estimator. Sections 2.3 and 2.4 establish consistency and asymptotic normality ofthe weighted M-estimator, respectively. Section 2.5 then discusses how to estimatethe asymptotic covariance matrix of the weighted M-estimator.2.2 M-EstimatorsLike many studies in the literature of estimation with complex survey data, thischapter employs the asymptotics along which the number of strata grows to infinity.Consider a sequence of populations, {PL}L∈N, such that PL consists of NL indi-viduals, each of whom carries numerical values for a set of v characteristics, X . Thepopulation PL is split into L strata. Let NLh denote the number of individuals inthe hth stratum in population PL. The value of X for the kth individual in the hth1Chen and Rao (2007) discuss large sample properties of statistical inferences in i.i.d. case takinginto account both the characteristics of the finite population and the sampling design employed.They show that the law of large numbers which assumes that both sample size and population sizeincrease to infinity does not hold in the context of finite population. However, asymptotic theory forindependent but not identically distributed (i.n.i.d) observations in such context, which is essential forcomplex survey data, is unavailable and not straightforward to derive. We will study large sampleproperties of the proposed weighted M-estimator with correction for finite population size in ourfuture work.46stratum in population PL, which is denoted X˜hk, is drawn from a v-variate distribu-tion Phk, where the draws of X˜hk are independent across individuals and strata. Inthis setup, the product measure ⊗Lh=1⊗NLhk=1 Phk can be viewed as the superpopulationdistribution of X’s behind the population PL. Note that Phk does not depend on L.We assume:Assumption 2.1. The double array, X˜ ≡ {X˜hk : (k,h) ∈ N2}, of v×1 random vec-tors on a probability space (Ω,F ,P) is independently (but not identically) dis-tributed. For each L ∈ N, the population PL consists of L stratum, and for eachh ∈ {1,2, . . . ,L}, the hth stratum contains NLh individuals (NLh ∈ N). For each(k,h,L) ∈ K≡ {(k˜, h˜, L˜) ∈ N3 : k˜ ≤ NLh, h˜≤ L˜}, the kth individual in the hth stra-tum in the Lth population PL carries X˜hk. The array {NLh : (h,L) ∈ H} satisfiesthatsup{NLh/(NL/L) : (h,L) ∈ H}< ∞, (2.1)where NL ≡ ∑Lh=1 NLh, L ∈ N.In Assumption 2.1, we impose independence on X˜ for simplicity. Even if weassume that X˜ is weakly dependent in each stratum instead, the results of thischapter would stay essentially the same. Because NL/L is the average stratum size,(2.1) requires that none of the stratum asymptotically dominates the others in size,reflecting that the sizes of strata are typically comparable to each other in practice.A sample design is a mechanism to select individuals in the population into thesample in each of the strata. Suppose there are nLh psu’s selected in stratum h andnLhi ultimate sampling units selected in the ith selected psu in stratum h of PL. LetCLhk denote the (random) number of times individual k in stratum h is selected intothe data set under population PL. We assume that the sample design satisfies:Assumption 2.2. (a) For each L ∈ N, the L collections of nonnegative integer-valued random variables,{CL1k : k ∈ {1, . . . ,NL1}}, . . . ,{CLLk : k ∈ {1, . . . ,NLL}}which are defined on (Ω,F ,P), are independent.47(b) The collection of random variables {CLhk : (k,h,L) ∈ K} is independent from{X˜hk : (k,h) ∈ N2}.(c) There exists C¯ ∈ N such that for each (k,h,L) ∈ K, the support of CLhk iscontained in the interval [0,C¯].(d) It holds thatp¯l ≡ inf{P[CLhk > 0]/NLh−1 : (k,h,L) ∈ K}> 0and thatp¯u ≡ sup{P[CLhk > 0]/NLh−1 : (k,h,L) ∈ K}< ∞.Assumption 2.2(a) means that the sampling is independent across strata, as isthe case virtually in all complex survey design. Assumption 2.2(b) reflects the factthat selection of individuals into the data set does not depend on the values of Xdrawn by the individuals. Assumption 2.2(c) requires that there be a maximumnumber of times an individual can be possibly selected into the sample, as is the casein typical sample designs (in fact, many of the designs allows an individual to beincluded in the sample at most once). Assumption 2.2(d) specifies that individuals’probabilities of selection into the sample proportional and the stratum populationsize are in inverse proportion in our asymptotics, loosely speaking.The mean number of times the kth individual in stratum h in population PL isselected into the sample is equal to E[CLhk]. We can view E[CLhk] as showing howwell each individual is represented in the sample design; the higher it is, the betterthe individual is represented. The reciprocal of E[CLhk], wLhk ≡ E[CLhk]−1, is calledthe survey weight of the kth individual in stratum h under the Lth population. Thesurvey designer can compute the survey weight, precisely knowing the distributionof CLhk. We let WLhi j denote the survey weight of the jth selected individual in theith selected psu in stratum h under the Lth populationPL. As Proposition 2.1 showsbelow, the survey weight can be used for unbiasedly estimating the population meanof variables in each stratum. In this chapter, the mixture of Ph1, . . . , PhNLh with equalweights 1/NLh is called the distribution of X in the hth stratum in the Lth populationand denoted P¯Lh. Also, the mixture of P¯Lh, . . . , P¯LL with the corresponding weights48NL1/NL, . . . , NLL/NL is called the distribution of X in the entire Lth population anddenoted P¯L.Proposition 2.1. Suppose that Assumptions 2.1 and 2.2 hold. Also, let {φLh :(h,L) ∈ H ≡ {(h˜, L˜) ∈ N2 : h˜ ≤ L˜}} be an array of Borel-measurable functionsfrom Rv to R such that for each k ∈ N and each (h,L) ∈ H,∫|φLh(x)|Phk(dx)< ∞.Then:(a) For each (h,L) ∈ H,E[N−1Lhnh∑i=1nhi∑j=1WLhi jφLh(XLhi j)∣∣∣ X˜]= N−1LhNLh∑k=1φLh(X˜hk).(b) For each (h,L) ∈ H,E[N−1Lhnh∑i=1nhi∑j=1WLhi jφLh(XLhi j)]=∫φLh(x)P¯Lh(dx).(c) For each L ∈ N,E[N−1LL∑h=1nh∑i=1nhi∑j=1WLhi jφLh(XLhi j)]=L∑h=1(NLh/NL)∫φLh(x) P¯Lh(dx).(d) If, in addition, φL1 = φL2 = · · ·= φLL for each L ∈ N, it holds that for eachL ∈ N,E[N−1LL∑h=1nh∑i=1nhi∑j=1WLhi jφLh(XLhi j)]=∫φL1(x) P¯L(dx).In Proposition 2.1, (i) shows that the sample average weighted by WLhi j/NLhcorrects the over- and under-representation of individuals and unbiasedly estimatethe population mean (given the individuals’ draws of X from the superpopulation).This property of the weighted average is called the design unbiasedness. The designunbiasedness immediately implies the unbiasedness of the weighted average, as(ii) claims. Given the unbiasedness of the weighted average, we can also estimate49superpopulation means unbiasedly, taking the average of the unbiased stratum meanestimator weighted by NLh/NL, as (iii) and (iv) states.We now introduce the parameter of interest. Assume:Assumption 2.3. The set Θ is nonempty compact subset of Rp (p ∈ N). Thefunction q : Rv×Θ→ R is Borel-measurable, and for each x ∈ Rv, q(x, ·) : Θ→ Ris continuous.Our parameter of interest is characterized as the maximizer of Q¯L : Θ→ Rdefined byQ¯L(θ)≡∫q(x,θ)P¯L(dx) =L∑h=1NLhNL∫q(x,θ)P¯Lh(dx), θ ∈Θ.For each θ ∈Θ, N−1Lh ∑nhi=1∑nhij=1WLhi jq(XLhi j,θ) estimates∫q(x,θ)P¯Lh(dx) unbias-edly. A natural estimator of the parameter is a (weighted) M-estimator constructedbased on this fact. For each L ∈ N, define a function QL : Ω×Θ→ R byQL(ω,θ)≡1NLL∑h=1nh∑i=1nhi∑j=1WLhi j(ω)q(XLhi j(ω),θ),and write QL(θ)≡ QL(·,θ), θ ∈Θ. The M-estimator is the maximizer of QL on Θ.We can easily verify the existence of the M-estimator by using the standard result onthe existence of the measurable maximum (Gallant and White, 1988, Lemma 2.1).Theorem 2.2. Suppose that Assumptions 2.1–2.3 hold. Then for each L ∈ N, thereexists a Θ-valued random vector θˆL such that QL(θˆL) = supθ∈Θ QL(θ).2.3 ConsistencyIn studying the large sample behavior of the M-estimators in our setup, the followingresult is useful.Proposition 2.1. Suppose that Assumptions 2.1 and 2.2 hold. Let Γ a set, a anonnegative real number, and {φLh : (h,L) ∈ H, } an array of measurable functionfrom Rv×Γ to R such that for each γ ∈ Γ and each (h,L) ∈ H, φLh(·,γ) : Rv→ R is50Borel measurable. Ifsupγ∈Γsup(h,L)∈H∫|φLh(x,γ)|a Pkh(dx)< ∞,then it holds that{∥∥∥N−1Lhnh∑i=1nhi∑j=1WLhi jφLh(XLhi j,γ)∥∥∥a: (h,L) ∈ H, γ ∈ Γ}is uniformly La-bounded.Proposition 2.1 says that if the stratum mean of superpopulation is uniformlyLa-bounded, its unbiased estimator is also La-bounded.For compactly stating the assumptions we employ to establish the consistencyand asymptotic normality results, we introduce a terminology.Definition 1. Given Assumption 2.1, let Γ be a finite-dimensional Euclidean spaceand {φh}h∈N a sequence of measurable functions from (Rv×Γ,Bv⊗B(Γ)) to(Rl1×l2 ,Bl1×l2). We say that {φh} is LB(a) on Γ, where a ∈ [1,∞), if the followingconditions are satisfied.(a) For each γ ∈ Γ, {φh(X˜hk,γ) : (k,h) ∈ N2} is uniformly La-bounded.(b) There exist a continuous function g : R→ [0,∞) and a sequence of Borel-measurable functions, {dh : Rv → [0,∞)}h∈N such that g(y) ↓ 0 as y ↓ 0,sup(k,h)∈N2∫dh(x)Phk(dx) < ∞, and for each (γ1,γ2) ∈ Γ2 and each x ∈ Rv,|φh(x,γ2)−φh(x,γ1)| ≤ dh(x)g(|γ2− γ1|).In Definition 1, the term “LB” comes from “Lipschitz and bounded”.In establishing our consistency result, we use the following concept.Definition 2. Given Assumption 2.1 and 2.2, let Γ be a finite-dimensional Euclideanspace and {ΦLh : (h,L) ∈ H} an array of measurable functions from (Ω×Γ,F ⊗B(Γ)) to (Rl1×l2 ,Bl1×l2). We say that {ΦLh} is SLB(a) on Γ, where a ∈ [1,∞), ifthe following conditions are satisfied (“S” stands for “stratum-wise”).(a) For each γ ∈ Γ, {ΦLh(·,γ) : (h,L) ∈ H} is uniformly La-bounded.51(b) There exists a continuous function g : R→ [0,∞) such that g(y)→ 0 asy ↓ 0, and an array of nonnegative random variables {DLh : (h,L) ∈ H} on(Ω,F ,P) that satisfies thatsupL∈NL−1L∑h=1E[DLh]< ∞,and for each (γ1,γ2) ∈ Γ2 and each (h,L) ∈ H, |ΦLh(·,γ2)−ΦLh(·,γ1)| ≤DLh g(|γ2− γ1|).The SLB property is useful in our analysis, being closely related to the LBproperty used in describing some of the assumptions imposed in the main text.Lemma 2.2. Suppose that Assumptions 2.1 and 2.2 hold. Let Γ be a finite-dimensional Euclidean space, a a positive real number, and {φh}h∈N a sequenceof measurable functions from (Rv×Γ,Bv⊗B(Γ)) to (Rl1×l2 ,Bl1×l2) that is LB(a)on Γ. Then the array of functions from (Rv×Γ,Bv⊗B(Γ)) to (Rl1×l2 ,Bl1×l2),{ΦLh}L∈N, defined byΦLh(ω,γ)≡ N−1Lhnh∑i=1nhi∑j=1WLhi j(ω)φh(XLhi j(ω),γ), ω ∈Ω, γ ∈ Γ, (h,L) ∈ His SLB(a) on Γ.In establishing the consistency result, we require:Assumption 2.4. For some real number δ > 0, the function q is LB(1+δ ) on Θ.Assumption 2.4 along with the preceding assumptions are sufficient for theuniform convergence of {QL(θ)− Q¯L(θ)}L∈N in probability to zero over θ ∈ Θ.For each L ∈ N, let Θ∗L denote the set of maxima of Q¯L on Θ. To turn uniformconvergence of the estimation objective function into the consistency of {θˆL}L∈N,we need identifiability of {Θ∗L}L∈N, namely:Assumption 2.5. For any real number ε > 0liminfL→∞inf{Q¯L(θ)− Q¯∗L : d(θ ,Θ∗L)≥ ε, θ ∈Θ}> 0,52where Q¯∗L ≡ supθ∈Θ Q¯L(θ), and d is the Euclidean metric on Θ, so thatd(θ ,Θ∗L) = infθ ∗∈Θ∗Ld(θ ,θ ∗), θ ∈Θ, L ∈ N.Assumption 2.5 rules out the possibility that as L goes to infinity, the Q¯L(θ)function gets flatter around maxima.We are now ready state our consistency result.Theorem 2.3. Suppose that Assumption 2.1–2.5 hold. Then {d(θˆL,Θ∗L)}L∈N con-verges in probability to zero.An interesting special case of our setup is the case in which the model is correctlyspecified for each stratum. For each (h,L) ∈ H, define the function Q¯Lh : Θ→ R byQ¯Lh(θ)≡∫q(x,θ)P¯Lh(dx).We mean by the stratum-wise correct specification:Assumption 2.6. There exists θ0 ∈ Θ such that for each (h,L) ∈ H, Q¯Lh is maxi-mized at θ0 over Θ. Also, Θ∗L is a singleton for almost all L ∈ N.When all strata share the same true parameter value, this parameter value is alsothe true parameter value of the superpopulation, i.e. θ ∗L = θ0. As θˆL is consistentfor θ ∗L (Theorem 2.3), it is also consistent for θ0.Corollary 2.4. Suppose that Assumption 2.1–2.6 hold. Then {d(θˆL,θ0)}L∈N con-verges in probability to zero.2.4 Asymptotic NormalityWe apply the standard linearization approach with smooth (generalized) scores toachieve the asymptotic normality.Assumption 2.7. For almost all L∈N, Θ∗L is a singleton, and there exists a sequence{θ ∗L ∈Θ∗L}L∈N and a compact set Θ0 ⊂ intΘ, to which {θ ∗L} is uniformly interior.Assumption 2.7 rules out the case that {θ ∗L} is on the boundary of Θ, so that∇Q¯∗L = 0.53Assumption 2.8. (a) For each x ∈ Rv, q(x, ·) is twice continuously differentiableon intΘ.(b) For some real number δ > 0, ∇2q is LB(1+δ ) on Θ0.(c) The sequence {A∗L ≡ AL(θ ∗L )}L∈N is asymptotically uniformly nonsingular,where for each L ∈ N, AL : intΘ→ Rp×p is defined byAL(θ)≡∫∇2q(x,θ)P¯L(dx), θ ∈ intΘ.(d) For some real number δ > 0,supθ∈Θ0sup(k,h)∈N2∫|∇q(x,θ)|2+δ Phk(dx)< ∞.(e) For each (k,h) ∈ N2 and each θ ∈Θ0,∇∫q(x,θ)Phk(dx) =∫∇q(x,θ)Phk(dx).(f) The sequence {B∗L ≡ BL(θ ∗L )}L∈N is asymptotically uniformly nonsingular,where for each θ ∈ Θ, BL(θ)≡ L−1∑Lh=1 BLh(θ); for each θ ∈ Θ and each(h,L) ∈ H, BLh(θ)≡ var[SLh(·,θ)] andSLh(ω,θ)≡ (L/NL)nh∑i=1nhi∑j=1WLhi j(ω)∇q(Xlhi j(ω),θ)= (LNLh/NL)N−1Lhnh∑i=1nhi∑j=1WLhi j(ω)∇q(Xlhi j(ω),θ), ω ∈Ω, θ ∈ intΘ.Assumption 2.8 allows us to derive the standard asymptotic linear representationof {θˆL}L∈N and establish the asymptotic normality of θˆL based on the asymptoticnormality of the score evaluated at θ ∗L . Most M-estimators used in econometricssatisfy the condition (a), though it rules out some estimators such as Koenker andBassett (1978) quantile regression estimator. The conditions (b), (d), and (e) aremild, requiring that the gradient and Hessian of q does not have too fat tails under54the distributions in {Phk : (k,h) ∈ N2}. The asymptotic uniform nonsingularity of{A∗L}L∈N and {B∗L} imposed in (c) and (f) are also easily satisfied, as long as P¯Ldoes not change too much as L grows to infinity.We now state the asymptotic normality result.Theorem 2.1. Suppose that Assumption 2.1–2.5, 2.7, and 2.8 hold. Then thesequence {D∗L ≡ A∗−1L B∗LA∗−1L }L∈N is bounded and uniformly nonsingular. Also, itholds that D∗−1/2L L1/2(θˆL−θ ∗L )A∼ N(0, Ip) as L→ ∞.2.5 Estimation of the Asymptotic Covariance MatrixConsistent estimation of D∗L in Theorem 2.1 can be performed by consistently esti-mating A∗L and B∗L. The proof of the asymptotic normality shows that {∇2QL}L∈N isuniformly consistent for {AL}L∈N over Θ0. Let {AˆL : Ω→ Rp×p}L∈N be a sequenceof random matrices such that AˆL = ∇2QL(θˆL) whenever θˆL ∈ intΘ. Given the con-sistency of {θˆL}L∈N for {θ ∗L}L∈N and the above mentioned uniform consistencyof {∇2QL}L∈N for {AL}, it is straightforward to verify consistency of {AˆL}L∈N for{A∗L}L∈N.We next consider estimation of B∗L. For a wide range of multistage clustersample designs, the literature offers design-unbiased estimators of the covariancematrix of the total estimator, where the design unbiasedness means the conditionalunbiasedness given X˜ . (Wolter, 1985, pp. 11–16), for example, lists many of suchestimators. We here assume that such an estimator is available for estimation of thevariance of ∑nhi=1∑nhij=1WLhi j∇q(Xlhi j,θ) in each stratum and each θ ∈Θ0.Assumption 2.9. An array of measurable functions from (Ω×Θ,F ⊗B(Θ)) to(Rp×p,Bp×p), {K˜Lh : Ω×Θ→ Rp×p}L∈N, satisfies that for each θ ∈ intΘ andeach (h,L) ∈ H,E[K˜Lh(·,θ) | X˜ ] = var[ nh∑i=1nhi∑j=1WLhi j(ω)∇q(Xlhi j(ω),θ)],and for some real number δ > 0, {N−2Lh K˜Lh : (h,L) ∈ H} is SLB(1+δ ) on Θ0.The requirement that {N−2Lh K˜Lh : (h,L) ∈ H} be SLB(1+δ ) is satisfied by the55prevailing estimators of the covariance matrix of the total estimators, given theassumptions already imposed in this chapter.Define {B˜Lh : Ω×Θ→ Rp×p : (h,L) ∈ H} byB˜Lh(ω,θ)≡ (L/NL)2K˜Lh(ω,θ) (2.2)= (LNLh/NL)2N−2Lh K˜Lh(ω,θ), ω ∈Ω, θ ∈Θ,(h,L) ∈ H.Under Assumption 2.9, (L/NL)2K˜Lh(·,θ) is a design-unbiased estimator of the(conditional) covariance matrix of SLh(·,θ) for each θ ∈Θ0, i.e.,E[(L/NL)2K˜Lh(·,θ) | X˜ ] = var[SLh(·,θ) | X˜ ], θ ∈Θ0, (h,L) ∈ H.Nevertheless, its unconditional mean isE[B˜Lh(·,θ)] = E[var[SLh(·,θ) | X˜ ]]= var[SLh(·,θ)]−var[E[SLh(·,θ) | X˜ ]], θ ∈Θ0, (h,L) ∈ H,where the first equality follows by the law of iterated expectations, and the secondequality follows from the fact that the mean conditional variance and the varianceof the conditional mean sums up to the unconditional mean. BecauseE[SLh(·,θ) | X˜ ] = (LNLh/NL)N−1LhNLh∑k=1∇q(X˜hk,θ), θ ∈Θ0, (h,L) ∈ Hby Proposition 2.1, it holds that E[B˜Lh(·,θ)] is biased for var[SLh(·,θ)] byvar[E[SLh(·,θ) | X˜ ]]= (LNLh/NL)2N−2LhNLh∑k=1var[∇q(X˜hk,θ)], θ ∈Θ0, (h,L) ∈ H.(2.3)This bias is zero if ∇q(X˜hk,θ) is degenerate (as is the case in the finite populationframework). If ∇q(X˜hk,θ) is not degenerate, it is negative definite. Under ourcurrent assumption, however, {LNLh/NL : (h,L) ∈ H} is bounded (Assumption 2.1),56and it holds thatsup{∣∣var[∇q(X˜hk,θ)]∣∣ : θ ∈Θ0, (k,h) ∈ N2}≤ sup{∣∣E[∇q(X˜hk,θ)∇′q(X˜hk,θ)]∣∣ : θ ∈Θ0, (k,h) ∈ N2}+ sup{∣∣E[∇q(X˜hk,θ)]E[∇′q(X˜hk,θ)]′∣∣ : θ ∈Θ0, (k,h) ∈ N2}< ∞(2.4)(Assumption 2.8), the size of the bias is asymptotically governed by NLh. In sum, thebias of B˜Lh(·,θ) in estimation of var[SLh(·,θ)] is zero or asymptotically negligibleuniformly in all strata if:Assumption 2.10. (a) All random vectors of X˜ are degenerate; or(b) it holds that liminfL→∞ infh∈{1,2,...,L}NLh = ∞.Intuitively, in finite population framework, there is no random draws from thesuperpopulation. Hence, conditional variance equals unconditional variance. Given(b), stratum size of the population is large enough. So that the population convergesto superpopulation in each stratum. Hence, the conditional variance converges tothe unconditional variance.We now define {B˜L : Ω×Θ→ Rp×p}L∈N byB˜L(ω,θ)≡ L−1L∑h=1B˜Lh(ω,θ), ω ∈Ω, θ ∈Θ.Application of a uniform law of large numbers to the triangle array {B˜Lh :(h,L) ∈ H} establishes the uniform consistency of {B˜L} for {BL} over Θ0, whichleads to the following result along with the consistency of {θˆL}L∈N for {θ ∗L}L∈N.Theorem 2.1. Under Assumptions 2.1–2.5 and 2.7–2.10, {BˆL ≡ B˜L(·, θˆL)}L∈N isconsistent for {BL}L∈N.57Chapter 3m-Testing of Stratum-Wise ModelSpecification in Complex SurveyData3.1 IntroductionAn economic model is often estimated by using complex survey data obtained froma population. The estimated model is then sometimes used to analyze phenomenain a subpopulation, which is typically a stratum in the complex survey design or aunion of strata.In order for stratum-specific analysis using a model estimated for the entirepopulation to be valid, it is sufficient that (a) the model is correctly specified foreach stratum, and (b) all strata share the same true parameter value. We refer tothe combination of (a) and (b) as stratum-wise correct specification of the model.It is worthwhile to test the specification of the model in strata, before conductingstratum-specific analyses. For example, the Current Population Survey (CPS), oneof the principal source of data on U.S. labour market, is collected using complexsurvey sampling method. It provides comprehensive information about individuals’employment status and their demographic characteristics, and is used by state andlocal government, which is usually a stratum or a union of a small number of strata58in the CPS, for planning and budgeting purposes and to determine the need for thelocal employment and training services. Due to the features of complex surveysampling of the CPS 1, it is not possible to estimate a model for only one or a fewstrata, and the estimation requires using the sample of the whole population or alarge number of strata (Please refer to Sakata and Xu (2010) for a discussion aboutestimation with complex survey data). The application of the estimated model tolocal labour markets requests that the estimated model also holds for local labourmarkets.In this chapter, we consider how to test the stratum-wise specification of amodel within the m-testing framework, which was pioneered by Newey (1985) andTauchen (1985) and extended by White (1987). Our approach is closely related tothe one taken in Sakata (2009), in which the data are assumed to be collected bysimple random sampling in each of many subpopulations.Suppose that there are L strata in the population, and nh primary sampling units(psu) are drawn from stratum h ∈ {1,2, . . . ,L}. In the ith psu in stratum h, nhiobservations are drawn according to the sample design (nhi is in general random).Let Xhi j denote the jth observation of Rv-valued variable X from the ith psu instratum h, and Whi j the survey weight for the observation.Also, suppose that a model with a parameter space Θ⊂ Rp is given. The modelmay be designed to capture some conditional probability, conditional expectation,conditional variance, or something else. Also, suppose that it is known that whenthe given model is correctly specified for stratum h, for each pi in a set Π⊂ Rr,∫mh(x,θ0h,pi)P¯h(dx) = 0,where mh : Rv×Θ×Π→ Rq is a function known to the researcher, θ0h is the trueparameter value for stratum h (the parameter pi is usually referred to as a nuisanceparameter), and P¯h is the probability distribution of X in stratum h. By using the1The CPS a monthly survey which interviews a representative sample of the civilian noninstitu-tional population 16 years and older. It splits the U.S. population into 792 strata, where each stratumis a subregion of a state or in some cases an entire state. On average 2.5 primary sample units (PSU)are randomly selected for each stratum. Within the PSUs, the CPS directly samples ultimate samplingunits (USU), a geographically compact group of approximately four addresses.59survey weights Whi j’s, we can rewrite the above equality asE[ nh∑i=1nhi∑j=1Whi jmh(Xhi j,θ0h,pi)]= 0, h = 1,2, . . . ,L(see Proposition 3.1(iv) below). Under the stratum-wise correct specification, thesingle true parameter θ0 shared by all strata satisfies that for each pi in a set Π⊂ Rr,E[ nh∑i=1nhi∑j=1Whi jmh(Xhi j,θ0,pi)]= 0, h = 1,2, . . . ,L. (3.1)In many applications, θ0 can be accurately estimated by an estimator θˆ using theentire sample under the stratum-wise correct specification.Just for the sake of illustration, suppose that Π is a singleton whose only elementis pi∗. Then a usual m-test would employ the statistic LMˆ′CˆMˆ, whereMˆ ≡L∑h=1nh∑i=1nhi∑j=1Whi jmˆhi j,mˆhi j ≡ mh(Xhi j, θˆ ,pi∗),and Cˆ is a suitably chosen q×q weighting matrix, which is typically the inverse ofan estimated covariance matrix of Mˆ. Because Mˆ approximatesM¯∗ ≡L∑h=1nh∑i=1nhi∑j=1E[Whi jmh(Xhi j,θ ∗,pi∗)]the m-test would detect the violation of the null hypothesis only through the devi-ation of M¯∗ from the origin. When all strata have the same distribution, (3.1) isequivalent to zeroness of M¯∗. When the stratum have heterogeneous distributions,on the other hand, it is possible that M¯∗ is nearly equal to zero, even when (3.1)is grossly violated in some stratum. It is therefore preferable to design a test thatdirectly checks (3.1), instead of the zeroness of M¯∗ in out current setup.A natural way to formulate a test of (3.1) within the standard m-testing frame-60work is to consider moment conditions for each stratum separately, by takingmˆ†hi j ≡ (0,0, . . . ,0︸ ︷︷ ︸(h−1)q, mˆ′hi j,0,0, . . . ,0︸ ︷︷ ︸(L−h)q)′for the moment function of the jth observation in the ith psu in the hth stratum.One might find such approach related to the growing literature on use of manymoment conditions: Koenker and Machado (1999) Carrasco and Florens (2000),Donald et al. (2003), Doran and Schmidt (2006), Han and Phillips (2006), Carrascoet al. (2007), and Anatolyev (2008), just to mention a few chapters in the area.Nevertheless, it is important to note that all elements of mˆ†hi j are zeros, except for theq elements in mˆhi j. Such moment functions are ruled out by the regularity conditionsin the existing literature at our best knowledge. There are two major problems inthe approach using mˆ†. First, the usual normal approximation to the distribution ofMˆ† ≡ ∑Lh=1∑nhi=1∑nhij=1 mˆ†hi j would not work well in typical complex survey designs,in which only a few psu’s are selected in each stratum. Second, the weighting matrixCˆ† would have a serious problem, because the estimated asymptotic covariancematrix of Mˆ† is either nearly or exactly singular given the small number of psu’sper stratum. Thus, the behavior of Mˆ†′Cˆ†Mˆ† would be very different from what thestandard asymptotics suggests. The use of the usual m-test based on the momentconditions separately set up for each stratum is not appealing for this reason.In this chapter, we propose an implementation of m-testing that overcomes theabove-mentioned difficulties, being hinted by Sakata (2009). In our approach, weunbiasedly estimate a quadratic form of the moment vector in question for eachstratum, where the quadratic form has a positive definite weighting matrix, and thentake the average of the estimated quadratic form over the strata to obtain a statistic.We consider both the situation in which the nuisance parameter pi is estimated (i.e.,picked based on data) and the situation in which a researcher desires to base thetest on the whole moment conditions indexed by the nuisance parameter. Our testrejects the null hypothesis when the statistic described above is largely positive,loosely speaking.The rest of the chapter is organized as follows. In Section 3.2, we formalizeour problem setup. In Section 3.3, we then propose a method to test the null61hypothesis and derive its large sample properties in the case in which an estimatednuisance parameter is used. In Section 3.4, we remove the estimation of thenuisance parameter from the setup and consider the test based on the whole momentconditions indexed by the nuisance parameter. The proofs of the theorems arecollected in the Appendices.We employ the following convention and symbols throughout this chapter.Limits are taken along the sequence of numbers of stratum (denoted L) growing toinfinity, unless otherwise indicated. For each matrix A, |A| denotes the Frobeniusnorm of A, i.e., |A| ≡√tr(A′A), and A+ the Moore-Penrose (MP) inverse of A. Byapplying the MP inverse in division by scalars, we rule that division by zero equalszero. We use the MP inverses of random matrices instead of the regular inversesto avoid technical problems caused by the singularity of the random matrices thatoccurs with a small probability. The reader can safely replace the MP inverses withthe regular inverses, when applying the formulas in practice. Also, for each randommatrix Z and positive real number a, ‖Z‖a denotes the La-norm of |Z|.3.2 Problem SetupWe now formalize the problem setup described in Section 3.1. Like many studiesin the literature of estimation with complex survey data, this chapter employs theasymptotics along which the number of strata grows to infinity. Consider a sequenceof populations, {PL}L∈N, such that PL consists of NL individuals, each of whomcarries numerical values for a set of v characteristics, X . The population PL issplit into L strata. Let NLh denote the number of individuals in the hth stratum inpopulationPL. The value of X for the kth individual in the hth stratum in populationPL, which is denoted X˜hk, is drawn from a v-variate distribution Phk, where thedraws of X˜hk are independent across individuals and strata. In this setup, the productmeasure ⊗Lh=1⊗NLhk=1 Phk can be viewed as the superpopulation distribution of X’sbehind the population PL. Note that Phk does not depend on L. We assume:Assumption 3.1. The double array, X˜ ≡ {X˜hk : (k,h) ∈ N2}, of v×1 random vec-tors on a probability space (Ω,F ,P) is independently (but not identically) dis-tributed. For each L ∈ N, the population PL consists of L stratum, and for eachh ∈ {1,2, . . . ,L}, the hth stratum contains NLh individuals (NLh ∈ N). For each62(k,h,L) ∈ K≡ {(k˜, h˜, L˜) ∈ N3 : k˜ ≤ NLh, h˜≤ L˜}, the kth individual in the hth stra-tum in the Lth population PL carries X˜hk. The array {NLh : (h,L) ∈ H} satisfiesthatsup{NLh/(NL/L) : (h,L) ∈ H}< ∞, (3.2)where NL ≡ ∑Lh=1 NLh, L ∈ N.In Assumption 3.1, we impose independence on X˜ for simplicity. Even if weassume that X˜ is weakly dependent in each stratum instead, the results of thischapter would stay essentially the same. Because NL/L is the average stratum size,(3.2) requires that none of the stratum asymptotically dominates the others in size,reflecting that the sizes of strata are typically comparable to each other in practice.A sample design is a mechanism to select individuals in the population into thesample in each of the strata. Suppose there are nLh psu’s selected in stratum h andnLhi ultimate sampling units selected in the ith selected psu in stratum h of PL. LetCLhk denote the (random) number of times individual k in stratum h is selected intothe data set under population PL. We assume that the sample design satisfies:Assumption 3.2. (a) For each L ∈ N, the L collections of nonnegative integer-valued random variables,{CL1k : k ∈ {1, . . . ,NL1}}, . . . ,{CLLk : k ∈ {1, . . . ,NLL}}which are defined on (Ω,F ,P), are independent.(b) The collection of random variables {CLhk : (k,h,L) ∈ K} is independent from{X˜hk : (k,h) ∈ N2}.(c) There exists C¯ ∈ N such that for each (k,h,L) ∈ K, the support of CLhk iscontained in the interval [0,C¯].(d) It holds thatp¯l ≡ inf{P[CLhk > 0]/NLh−1 : (k,h,L) ∈ K}> 063and thatp¯u ≡ sup{P[CLhk > 0]/NLh−1 : (k,h,L) ∈ K}< ∞.Assumption 3.2(a) means that the sampling is independent across strata, as is thecase virtually in all complex survey design. Assumption 3.2(b) reflects the fact thatselection of individuals into the data set does not depend on the values of X drawnby the individuals (this effectively requires that the variables in X constitutes thesampling frames are fixed be constant under distribution Phk for each (k,h) ∈ N2).Assumption 3.2(c) requires that there be a maximum number of times an individualcan be possibly selected into the sample, as is the case in typical sample designs (infact, many of the designs allows an individual to be included in the sample at mostonce). Assumption 3.2(d) specifies that individuals’ probabilities of selection intothe sample proportional and the stratum population size are in inverse proportion inour asymptotics, loosely speaking.The mean number of times the kth individual in stratum h in population PL isselected into the sample is equal to E[CLhk]. We can view E[CLhk] as indicating howwell each individual is represented in the sample design; the higher it is, the betterthe individual is represented. The reciprocal of E[CLhk], wLhk ≡ E[CLhk]−1, is calledthe survey weight of the kth individual in stratum h under the Lth population. Thesurvey designer can compute the survey weight, precisely knowing the distributionof CLhk. We let WLhi j denote the survey weight of the jth selected individual in theith selected psu in stratum h under the Lth populationPL. As Proposition 3.1 showsbelow, the survey weight can be used for unbiasedly estimating the population meanof variables in each stratum. In this chapter, the mixture of Ph1, . . . , PhNLh with equalweights 1/NLh is called the distribution of X in the hth stratum in the Lth populationand denoted P¯Lh. Also, the mixture of P¯Lh, . . . , P¯LL with the corresponding weightsNL1/NL, . . . , NLL/NL is called the distribution of X in the entire Lth population anddenoted P¯L.Proposition 3.1. Suppose that Assumptions 3.1 and 3.2 hold. Also, let {φLh :(h,L) ∈ H ≡ {(h˜, L˜) ∈ N2 : h˜ ≤ L˜}} be an array of Borel-measurable functionsfrom Rv to R such that for each k ∈ N and each (h,L) ∈ H,∫|φLh(x)|Phk(dx)< ∞.Then:64(a) For each (h,L) ∈ H,E[N−1Lhnh∑i=1nhi∑j=1WLhi jφLh(XLhi j)∣∣∣ X˜]= NLhNLh∑k=1φLh(X˜hk).(b) For each (h,L) ∈ H,E[N−1Lhnh∑i=1nhi∑j=1WLhi jφLh(XLhi j)]=∫φLh(x)P¯Lh(dx).(c) For each L ∈ N,E[N−1LL∑h=1nh∑i=1nhi∑j=1WLhi jφLh(XLhi j)]=L∑h=1(NLh/NL)∫φLh(x) P¯Lh(dx).(d) If, in addition, φL1 = φL2 = · · ·= φLL for each L ∈ N, it holds that for eachL ∈ N,E[N−1LL∑h=1nh∑i=1nhi∑j=1WLhi jφLh(XLhi j)]=∫φL1(x) P¯L(dx).In Proposition 3.1, (i) shows that the sample average weighted by WLhi j/NLhcorrects the over- and under-representation of individuals and unbiasedly estimatethe stratum population mean (given the individuals’ draws of X from the superpop-ulation). This property of the weighted average is called the design unbiasedness.The design unbiasedness immediately implies the unbiasedness of the weightedaverage for the stratum mean, as (ii) claims. Given the unbiasedness of the weightedaverage, we can also estimate superpopulation means unbiasedly, taking the averageof the unbiased stratum mean estimator weighted by NLh/NL, as (iii) and (iv) states.In our large-L asymptotics, we often check the moment conditions for averagesweighted by the survey weights. The result result is handy in such tasks.Proposition 3.2. Suppose that Assumptions 3.1 and 3.2 hold. Let Γ a set, a anonnegative real number, and {φLh : (h,L) ∈ H, } an array of measurable functionfrom Rv×Γ to R such that for each γ ∈ Γ and each (h,L) ∈ H, φLh(·,γ) : Rv→ R is65Borel measurable. Ifsupγ∈Γsup(h,L)∈H∫|φLh(x,γ)|a Pkh(dx)< ∞,then it holds that{∥∥∥N−1Lhnh∑i=1nhi∑j=1WLhi jφLh(XLhi j,γ)∥∥∥a: (h,L) ∈ H, γ ∈ Γ}is uniformly La-bounded.By this proposition, we see that the La-boundedness of a function of X˜ uni-form both in individuals and the index γ nicely translates into the uniform La-boundedness of its unbiased estimator under Assumptions 3.1 and 3.2.Following Section 3.1, we now assume that the stratum-wise correct specifica-tion implies a set of moment conditions that there exists θ0 in the parameter spaceΘ such that for each h ∈ N and for each pi in a space Π, m¯Lh(θ ,pi) = 0, wherem¯Lh(θ ,pi)≡∫mh(x,θ ,pi) P¯Lh(dx), (θ ,pi) ∈Θ×Π.and the sets Θ and Π and the function mh : Rv×Π×Ω→ Rq satisfy:Assumption 3.3. The sets Θ and Π are nonempty Borel-measurable subsets of thep- and r-dimensional Euclidean spaces (p<∞, r <∞), respectively, and {mh : Rv×Θ×Π→Rq}h∈N is a sequence of functions measurable-(Bv⊗B(Θ)⊗B(Π))/Bqsuch that for each (h,k) ∈N2 and each (θ ,pi) ∈Θ×Π,∫|mh(x,θ ,pi)|Phk(dx)<∞.Like the usual m-tests, the tests proposed in this chapter requires an estimatorθˆL that is consistent for the true parameter θ0 under the stratum-wise correct specifi-cation. The estimator may be an M-estimator, a GMM-estimator, or others, as longas it satisfies the next assumption regardless of whether or not the specification isstratum-wise correct.Assumption 3.4. The set Θ0 is a compact subset of Θ. A Θ-valued estimator{θˆL}L∈N on (Ω,F ,P) is consistent for a sequence {θ ∗L ∈ ΘL}L∈N, i.e., {|θˆL−θ ∗L |}L∈N converges to zero in probability-P.66Under the stratum-wise correct specification, the “pseudo-true parameter” θ ∗Lcoincides with θ0, so that it holds thatfor each pi ∈Π and each (h,L) ∈ H, m¯Lh(θ ∗L ,pi) = 0. (3.3)The test we develop in this chapter rejects stratum-wise correct specification if dataexhibit evidence against (3.3).A few examples of situations that fulfill the requirements in Assumptions 3.3and 3.4 follow. e.g., Assume that {X˜hk : (k,h) ∈ N2} is uniformly L2+2δ -boundedfor some positive real number δ . Let Y denote the first characteristic in X , and Zthe remaining v−1 characteristics in X (v≥ 2). Partition X˜hk as X˜hk = (Y˜hk, Z˜′hk)′,where Y˜hk is a random variable, and Z˜hk is a (v− 1)× 1 random vector, and setΘ= Rv−1. We say that the linear regression of Y on Z is stratum-wise correct for theconditional mean of Y given Z, if there exists θ0 ∈Θ such that the conditional meanof Y −Z′θ0 given Z is zero under each of the stratum distributions P¯Lh ((h,L) ∈ H).By using the law of large numbers of (Sen, 1970, Theorem 3) and Proposition 3.2,we can easily verify that the weighted least squares estimator {θˆL}L∈N in regressionof YLhi j on ZLhi j using the survey weight is consistent for{θ ∗L ≡(∫zz′ P¯L(dy,dz))−1 ∫zy P¯L(dy,dz)}L∈N,provided that {∫zz′ P¯L(dy,dz)}L∈N is uniformly nonsingular. Also, it holds thatθ ∗L = θ0 for each L ∈ N under the stratum-wise correct specification.Let g be a Borel-measurable function from Rv−1 to Rq such that∫g(z)′g(z)Phk(dy,dz) < ∞ for each (h,k) ∈ N2. Then the stratum-wise correctspecification of the linear regression implies that∫g(z)(y− z′θ ∗L ) P¯Lh(dy,dz) = 0, (h,L) ∈ H. (3.4)Though this condition involves no nuisance parameter, we can create one artificiallyto put this example squarely in the framework described above. Set Π= R andmh(x,θ ,pi) = g(z)(y− z′θ), x = (y,z) ∈ R×Rv−1, (θ ,pi) ∈Θ×Π, h ∈ N.67Then (3.4) is written in the form of (3.3). e.g., In the setup of Example 3.2, assumein addition that∫(z′z)3 Phk(dy,dz) < ∞ for each (h,k) ∈ N2. Also, let Π = Rv−1.Then the stratum-wise correct specification implies that∫(z′pi) j (y− z′θ ∗L ) P¯Lh(dy,dz) = 0, pi ∈Π, j ∈ {2,3}, (h,L) ∈ H. (3.5)If we setmh(x,θ ,pi)= (z′pi, (z′pi)2)′ (y−z′θ), x=(y,z′)∈R×Rv−1, (θ ,pi) ∈Θ×Π, h ∈ N.Then (3.5) is written in the form of (3.3). e.g., In the setup of Example 3.2, letΦ : Rv−1→ Rv−1 be a bounded, Borel-measurable, bounded one-to-one functionand Π a nonempty compact subset of Rv−1 with a positive Lebesgue measure. Thenthe stratum-wise correct specification is equivalent to that there exists θ0 ∈Θ suchthat ∫exp(Φ(z)′pi)(y− z′θ0) P¯Lh(dy,dz) = 0, pi ∈Π, (h,L) ∈ H, (3.6)as can be verified using (Bierens, 1990, Theorem 1). If we setmh(x,θ ,pi)= exp(Φ(z)′pi)(y−z′θ), x=(y,z′)′ ∈R×Rv−1, (θ ,pi) ∈Θ×Π, h ∈ N,(3.6) is written in the form of (3.3).3.3 A Test with Estimated Nuisance ParametersLet {pi∗L}L∈N be an arbitrary sequence in Π. Then the condition (3.3) apparentlyimplies thatfor each (h,L) ∈ H, m¯Lh(θ ∗L ,pi∗L) = 0, (3.7)which further implies thatfor each L ∈ N,L∑h=1βLhm¯Lh(θ ∗L ,pi∗L) = 0, (3.8)where βLh’s are positive real numbers to weight strata. If we take the usual m-testingapproach, (3.8) would be the basis for our test. In Example 3.2, for example one68might set pi∗L = θ ∗L and obtain a variant of Ramsey (1969) regression specificationerror test (RESET).Nevertheless, the usual implementation of the m-testing based on (3.8) can beproblematic for our purpose, as we discussed in Section 3.1. To overcome thedifficulty, we here formulate a test based on the average of the quadratic form ofm¯∗Lh ≡ m¯Lh(θ ∗L ,pi∗L), taken over the L strata with some weights βLh, wherem¯Lh(θ ,pi)≡∫mh(x,θ ,pi) P¯Lh(dx), (θ ,pi) ∈Θ×Π, (h,L) ∈ H,and the quadratic form has a positive definite weighting matrix. We assume thatAssumption 3.5. The array of positive real numbers, {βLh : (h,L) ∈ H}, satisfiesthat for each L ∈ N, ∑Lh=1βLh = 1, and thatlimsupL→∞Lsup{βLh : h ∈ {1,2, . . . ,L}}< ∞.For example, one may set NLh/NL or N2Lh/∑Lh′=1 N2Lh′ to βLh. It is straightforwardto verify that the required conditions are satisfied by these choices under (3.2) inAssumption 3.1.Let Sq denote the set of all q×q positive semidefinite symmetric matrices, andΛL ∈ Sq a positive definite matrix possibly dependent on L. Note that for each(h,L) ∈ H, m¯∗Lh′ΛLm¯∗Lh ≥ 0; and m¯∗Lh′ΛLm¯∗Lh = 0 if and only if m¯∗Lh = 0. DefineαL : Θ×Π×Sq→ R byαL(θ ,pi,Λ)≡L∑h=1βLhm¯Lh(θ ,pi)′Λm¯Lh(θ ,pi)=L∑h=1βLhtr(Λm¯Lh(θ ,pi)m¯Lh(θ ,pi)′) (θ ,pi,Λ) ∈Θ×Π×Sq, L ∈ N.(3.9)Then α∗L ≡ αL(θ ∗L ,pi∗L ,ΛL) is nonnegative for each L ∈ N; and α∗L = 0 for a givenL ∈ N if and only if m¯Lh = 0 for each h ∈ {1,2, . . . ,L}. Thus, it holds that α∗L = 0for each L ∈ N if and only if (3.7) holds. We base our test of stratum-wise correctspecification on this simple fact. Namely, our test rejects the stratum-wise correct69specification if an estimate of α∗L is positive and far from zero.Before formulating an estimator of α∗L , we introduce two terminologies, whichlet us below state our assumptions compactly.Definition 3. Given Assumption 3.1, let Γ be a finite-dimensional Euclidean spaceand {φh}h∈N a sequence of measurable functions from (Rv×Γ,Bv⊗B(Γ)) to(Rl1×l2 ,Bl1×l2). We say that {φh} is LB(a) on Γ, where a ∈ [1,∞), if the followingconditions are satisfied.(a) For each γ ∈ Γ, {φh(X˜hk,γ) : (k,h) ∈ N2} is uniformly La-bounded.(b) There exist a continuous function g : R→ [0,∞) and a sequence of Borel-measurable functions, {dh : Rv → [0,∞)}h∈N such that g(y) ↓ 0 as y ↓ 0,sup(k,h)∈N2∫dh(x)Phk(dx) < ∞, and for each (γ1,γ2) ∈ Γ2 and each x ∈ Rv,|φh(x,γ2)−φh(x,γ1)| ≤ dh(x)g(|γ2− γ1|).Definition 4. Given Assumption 3.1 and 3.2, let Γ be a finite-dimensional Euclideanspace and {ΦLh : (h,L) ∈ H} an array of measurable functions from (Ω×Γ,F ⊗B(Γ)) to (Rl1×l2 ,Bl1×l2). We say that {ΦLh} is SLB(a) on Γ, where a ∈ [1,∞), ifthe following conditions are satisfied (“S” stands for “stratum-wise”).(a) For each γ ∈ Γ, {ΦLh(·,γ) : (h,L) ∈ H} is uniformly La-bounded.(b) There exists a continuous function g : R→ [0,∞) such that g(y)→ 0 asy ↓ 0, and an array of nonnegative random variables {DLh : (h,L) ∈ H} on(Ω,F ,P) that satisfies thatsupL∈NL−1L∑h=1E[DLh]< ∞,and for each (γ1,γ2) ∈ Γ2 and each (h,L) ∈ H, |ΦLh(·,γ2)−ΦLh(·,γ1)| ≤DLh g(|γ2− γ1|).The term “LB” comes from “Lipschitz and bounded”. The SLB property isuseful in our analysis, being closely related to the LB property used in describingsome of the assumptions imposed in the main text.70Lemma 3.1. Suppose that Assumptions 3.1 and 3.2 hold. Let Γ be a finite-dimensional Euclidean space, a a positive real number, and {φh}h∈N a sequenceof measurable functions from (Rv×Γ,Bv⊗B(Γ)) to (Rl1×l2 ,Bl1×l2) that is LB(a)on Γ. Then the array of functions from (Rv×Γ,Bv⊗B(Γ)) to (Rl1×l2 ,Bl1×l2),{ΦLh}L∈N, defined byΦLh(ω,γ)≡ N−1Lhnh∑i=1nhi∑j=1WLhi j(ω)φh(XLhi j(ω),γ), ω ∈Ω, γ ∈ Γ, (h,L) ∈ His SLB(a) on Γ.We now formulate an estimator of α∗L and establishing its consistency. Assume:Assumption 3.6. (a) The sequence {θ ∗L}(h,L)∈H is uniformly interior to Θ0, acompact subset of Θ.(b) The Π-valued estimator {pˆiL}L∈N on (Ω,F ,P) is consistent for a sequence{pi∗L ∈Π}L∈N that is uniformly interior to Π0, a compact subset of Π.(c) The sequence {mh}h∈N is LB(2+2δ ) on Θ0×Π0 for some real number δ > 0.(d) The sequence {ΛL ∈ Sq}L∈N is bounded.Assumptions 3.6 is mild. For example, we consider how the required conditionscan be satisfied in Example 3.2. For each (k,h) ∈ N2, let Y˜hk denote the firstcomponent of X˜hk, and Z˜hk a vector consisting of the second to last component ofX˜hk. Assume that {Y˜kh : (k,h) ∈ N2} and {Z˜kh : (k,h) ∈ N2} are uniformly L4+4δ -and L12+12δ -bounded, respectively, where δ is some positive real number. Setpi∗L = θ ∗L and pˆiL = θˆL. Then {θ ∗L}L∈N is uniformly bounded, so that we can choosefor Θ0 a compact subset of Rv−1, to which {θ ∗L}L∈N is uniformly interior. We thenset Π0 =Θ0. Assumptions 3.6(a) and (b) are now satisfied. We can also easily verifythat Assumptions 3.6(c) is fulfilled. Of course, the uniform L12+12δ -boundednessof {Y˜hk} may look too restrictive in some applications. In such a case, we shouldchoose a different {mh : h ∈ N} that results in less stringent moment requirements.In formulating an estimator of α∗L , we first consider how to estimate αL(θ ,pi,Λ)with fixed θ ∈Θ0, pi ∈Π0, and Λ ∈ Sq. Note that αL(θ ,pi,Λ) is a weighted average71oftr(Λ m¯Lh(θ ,pi) m¯Lh(θ ,pi)′) (3.10)taken over the L strata, with weights βLn. If we can estimate (3.10) with an asymp-totically negligible bias for each h ∈ {1, . . . ,L}, plugging the asymptotically un-biased estimator into (3.9) generates an estimator that converges to αL(θ ,pi,W )in probability-P by the law of large numbers for independent but not identicallydistributed processes.For each (h,L) ∈ H, define m˜Lh : Ω×Θ×Π→ Rq bym˜Lh(ω,θ ,pi)≡NLh−1nh∑i=1nhi∑j=1WLhi j(ω)mh(XLhi j(ω),θ ,pi), (ω,θ ,pi)∈Ω×Θ×Π.Then m˜Lh(·,θ ,pi) is an unbiased estimator of m¯Lh(θ ,pi) by Proposition 3.1.Nevertheless, m˜Lh(·,θ ,pi)m˜Lh(·,θ ,pi)′ is biased for m¯Lh(θ ,pi)m¯Lh(θ ,pi)′ byvar[m˜Lh(·,θ ,pi)]. To correct the bias, we need to estimate it. For a wide rangeof multistage cluster sample designs, the literature offers design-unbiased esti-mators of the covariance matrix of the stratum total estimator NLhm˜Lh(·,θ ,pi) =∑nhi=1∑nhij=1WLhi jmh(XLhi j,θ ,pi), where the design unbiasedness means the unbiased-ness for the conditional covariance matrix of NLhm˜Lh(·,θ ,pi) given X˜ . (Wolter,1985, pp. 11–16), for example, lists many of such estimators. We here assume theavailability of such an estimator in each stratum and each (θ ,pi) ∈Θ×Π.Assumption 3.7. The array {Σ˜Lh : Ω×Θ×Π→ Sq}(h,L)∈H satisfies that for each(θ ,pi) ∈Θ×Π and each (h,L) ∈ H, Σ˜Lh(·,θ ,pi) is measurable-F/B(Sq). Also,the array {N−2Lh Σ˜Lh} is SLB(1+δ ) on Θ0×Π0 for some real number δ > 0. Further,it holds thatE[Σ˜Lh(·,θ ,pi) | X˜ ] = var[NLhm˜Lh(·,θ ,pi) | X˜ ].Given the estimator Σ˜Lh(·,θ ,pi), a design-unbiased estimator of var[m˜Lh(·,θ ,pi)]is N−2Lh Σ˜Lh(·,θ ,pi). Our estimator αˇL(·,θ ,pi,Λ) : Ω→ R of αL(θ ,pi,Λ) is obtainedby replacing m¯Lh(θ ,pi)m¯Lh(θ ,pi)′ with m˜h(·,θ ,pi)m˜′h(·,θ ,pi)−N−2Lh Σ˜Lh(·,θ ,pi) in72(3.9):αˇL(ω,θ ,pi,Λ)≡L∑h=1βLh tr(Λ(m˜Lh(ω,θ ,pi)m˜Lh(ω,θ ,pi)′−N−2Lh Σ˜Lh(ω,θ ,pi))), θ ∈Θ, Λ ∈ Sq, L ∈ N.Nevertheless, this estimator is not exactly unbiased, because the unconditional meanof N−2Lh Σ˜Lh(·,θ ,pi) isE[N−2Lh Σ˜Lh(·,θ ,pi)] = E[var[m˜Lh(·,θ ,pi) | X˜ ] ]= var[m˜Lh(·,θ ,pi)]−var[E[m˜Lh(·,θ ,pi) | X˜ ]], (θ ,pi) ∈Θ×Π, (h,L) ∈ H,where the first equality follows by the law of iterated expectations, and the secondequality follows from the fact that the mean conditional variance and the varianceof the conditional mean sums up to the unconditional mean. BecauseE[m˜Lh(·,θ ,pi) | X˜ ] = N−1LhNLh∑k=1mh(X˜hk,θ ,pi), (θ ,pi) ∈Θ×Π, (h,L) ∈ Hby Proposition 3.1, N−2Lh Σ˜Lh(·,θ ,pi) is biased for var[m˜Lh(·,θ ,pi)] byvar[E[m˜Lh(·,θ ,pi) | X˜ ]]= N−2LhNLh∑k=1var[mh(X˜hk,θ ,pi)], (θ ,pi) ∈Θ×Π, (h,L) ∈ H.Thus, αˇL(·,θ ,pi,Λ) is biased for αL(θ ,pi,Λ) byE[αˇL(·,θ ,pi,Λ)]−αL(θ ,pi,Λ) =−L∑h=1βLhtr(Λvar[E[m˜Lh(·,θ ,pi) | X˜ ]])=−L∑h=1βLhN−2LhNLh∑k=1tr(Λvar[mh(X˜hk,θ ,pi)), (θ ,pi) ∈Θ×Π, L ∈ N.The bias is zero if mh(X˜hk,θ ,pi) is degenerate (as is the case in the finite population73setup). If mh(X˜hk,θ) is not degenerate, it holds under our current assumption thatsup{∣∣var[mh(X˜hk,θ ,pi)]∣∣ : (θ ,pi) ∈Θ0×Π0, (k,h) ∈ N2}≤ sup{∣∣E[mh(X˜hk,θ ,pi)mh(X˜hk,θ ,pi)]∣∣ : (θ ,pi) ∈Θ0×Π0, (k,h) ∈ N2}+ sup{∣∣E[mh(X˜hk,θ ,pi)]E[mh(X˜hk,θ ,pi)]′∣∣ : (θ ,pi) ∈Θ0×Π0, (k,h) ∈ N2}< ∞,(3.11)so that the size of the bias of αˇL(·,θ ,pi,Λ) for αL(θ ,pi,Λ) is O(1/min{NLh : h ∈{1,2, . . . ,L}}) uniformly in (θ ,pi,Λ)∈Θ0×Π0×, where  is any compact subsetof Sq. Thus, we can conclude that the bias is zero or asymptotically negligible if:Assumption 3.8. (a) All random vectors in X˜ are degenerate; or(b) it holds that liminfL→∞ minh∈{1,2,...,L}NLh = ∞.Applying a suitable uniform law of large numbers to the array{βLh tr(Λ(m˜Lh(ω,θ ,pi)m˜Lh(ω,θ ,pi)′−N−2Lh Σ˜Lh(ω,θ ,pi))): (h,L) ∈ H},(θ ,pi) ∈Θ0×Π0, Λ ∈ Sqafter demeaning it establishes that {αˇL(·,θ ,pi,Λ)−E[αˇL(·,θ ,pi,Λ)]}L∈N convergesto zero in probability-P uniformly in (θ ,pi,Λ) ∈Θ0×Π0×, where  is a compactsubset of Sq, to which {ΛL}L∈N is uniformly interior. Thus, if Assumption 3.8holds, {αˇL}L∈N is uniformly consistent for {αL}L∈N on Θ0×Π0×. In estimationof α∗L , we use αˇL(·,θ ,pi,Λ), replacing θ with the estimator θˆL of θ ∗L , pi with anestimator pˆiL of pi∗L , and Λ with ΛL to obtain an estimator of α∗L . We can also allowfor use of a data-dependent weighting matrix ΛˆL instead of ΛL, provided:Assumption 3.9. The sequence of random matrices, {ΛˆL : Ω→ Sq}L∈N, satisfiesthat {|ΛˆL−ΛL|}L∈N converges to zero in probability-P.Our estimator {αˆL ≡ αˇL(·, θˆL, pˆiL, ΛˆL)}L∈N is consistent for {α∗L}L∈N, given theconsistency of {(θˆL, pˆiL, ΛˆL)}L∈N for {(θ ∗L ,pi∗L ,ΛL)}L∈N and the uniform consistencyof {αˇL}L∈N for {αL}L∈N over Θ0×Π0×.74Theorem 3.2. Under Assumptions 3.1–3.9, {α∗L}L∈N is bounded, and {αˆL −α∗L}L∈N converges to zero in probability-P.For the weighting matrix, one may desire to take the inverse of ∑Lh=1βLhvar[m˜∗Lh]for the weighting matrix, where m˜∗Lh ≡ m˜Lh(·,θ ∗,pi∗), because it allows the test totake into account the “average” noisiness of m˜∗Lh. Given that ∑Lh=1βLhN−2Lh var[m˜∗Lh]is unknown, we would use the inverse of ∑Lh=1βLhN−2Lh ΣˆLh, for the weighting matrix,where ΣˆLh ≡ Σ˜Lh(·, θˆL, pˆiL). The estimated weighting matrix can be shown to beconsistent for the desired one under our current assumptions.Remark. It would certainly be more desirable to use a stratum-specific weight-ing matrix to reflect the noisiness of m˜∗Lh in each stratum h. Nevertheless, suchgroup-specific weighting matrix cannot be estimated accurately. We thus requirethat the same weighting matrix be used in every stratum.We can also derive the large-L distribution of {αˆL}L∈N under the stratum-wise correct specification, imposing a few additional assumptions. Let ∇, ∇θ ,and ∇pi denote the gradient operator with respect to (θ ′,pi ′)′, only θ , and only pi ,respectively.Assumption 3.10. (a) There exists a bounded sequence of p× p matrices,{J∗L}L∈N, and a sequence of Borel-measurable functions from Rv×Θ to Rp,{ψh}h∈N, such that {ψh}h∈N is LB(2+ 2δ ) on Θ0, {∑nhi=1∑nhij=1WLhi jψ∗Lhi j :(h,L) ∈ H} is a zero-mean array, where ψ∗Lhi j ≡ ψh(XLhi j,θ ∗L ), andL1/2(θˆL−θ ∗L ) = J∗LL−1/2L∑h=1NLhNL/LNLh−1nh∑i=1nhi∑j=1WLhi jψ∗Lhi j +oP(1). (3.12)(b) pˆiL−pi∗L = OP(L−1/2).(c) For each h ∈ N and each x ∈ Rv, mh(x, ·, ·) : Θ×Π→ Rv is continuouslydifferentiable, and {∇mh :Rv×Θ×Π→Rp×q}h∈N is LB(2+2δ ) on Θ0×Π0for some real number δ > 0.(d) For each (h,L) ∈ H and each x ∈ Rv, each element of Σ˜Lh(x, ·, ·) : Θ×Π→Rv is continuously differentiable, and {N−2Lh ∇(vech(Σ˜Lh)) : (h,L) ∈ H} isSLB(1+δ ) on Θ0×Π0 for some real number δ > 0.75(e) The sequence {mh}h∈N is LB(4+4δ ) on Θ0×Π0 for some real number δ > 0.(f) The array {N−2Lh Σ˜Lh}(h,L)∈H is SLB(2+2δ ) on Θ0×Π0 for some real numberδ > 0.(g) The array {var[ξ ∗Lh] : (h,L) ∈ H} is uniformly positive, whereξ ∗Lh ≡LβLhtr(ΛL(m˜∗Lhm˜∗Lh′−N−2Lh Σ˜∗Lh))+G∗1,L′J∗L(LNLh/NL)N−1Lhnh∑i=1nhi∑j=1WLhi jψ∗Lhi j, (h,L) ∈ H,m˜∗Lh ≡ m˜Lh(·,θ ∗L ,pi∗L), Σ˜∗Lh ≡ Σ˜(·,θ ∗L ,pi∗L), G∗1L ≡ G1L(θ ∗L ,pi∗L ,ΛL), andG1L(θ ,pi,Λ)≡ E[∇θ α˜L(·,θ ,pi,Λ)]=L∑h=1βLh(2N−1Lhnh∑i=1nhi∑j=1E[WLhi j∇θmh(XLhi j,θ ,pi)Λm˜Lh(·,θ ,pi)]−N−2Lh E[∇θ (vecΣ˜Lh(·,θ ,pi))]vecΛ),(θ ,pi) ∈Θ×Π, Λ ∈ Sq, L ∈ N.(h) If all random vectors in X˜ are not degenerate, it holds thatliminfL→∞minh∈{1,2,...,L}NLhL1/2= ∞.Condition (a) of Assumption 3.10 requires that {θˆL}L∈N admits the asymp-totic linear representation. Conditions (b), (c), and (e) are mild like the momentconditions in Assumption 3.6, though moment requirements are tightened. Inthe RESET approach in Example 3.2 setting pi∗L = θ ∗L and pˆiL = θˆL, for instance,the uniform L4+4δ -boundedness of {Y˜hk : (k,h) ∈ N2} and the uniform L24+24δ -boundedness of {Z˜hk : (k,h) ∈ N2} with some positive real constant δ are suffi-cient for conditions (a)–(c) and (e), where we set ψh(x,θ) = z(y− z′θ), wherex = (y,z′)′ ∈ R×Rv−1, and J∗L =∫zz′ P¯L(dy,dz). Under the same conditions, con-ditions (d) and (f) are also satisfied, when a typical estimator is used for Σ˜Lh.The uniformly positiveness of {var[ξ ∗Lh] : (h,L) ∈ H} imposed in (g) is innocuous,76though it is a high-level assumption. Finally, condition (h) adds the strengthen theuniform divergence of the stratum sizes imposed in Assumption 3.8. It ensuresthat the bias of Σ˜Lh is asymptotically negligible in our derivation of the asymptoticnormality result. It is again consistent with the common view that the stratumpopulation size is large enough that the most characteristics of P¯Lh can be wellcaptured by those of the stratum population.We are now ready to state the asymptotic normality of αˆL under the stratum-wisecorrect specification.Theorem 3.3. Suppose that Assumptions 3.1–3.10 hold. If (3.3) holds (i.e., whenthe model is stratum-wise correctly specified),L1/2αˆL = L−1/2L∑h=1ξ ∗Lh +oP(1) and (3.13)V−1/2L L1/2αˆLA∼ N(0,1), where VL ≡ L−1∑Lh=1 var[ξ ∗Lh], L ∈ N.To formulate a useful test of the stratum-wise correct specification, we needan estimator of VL to standardize αˆL. Given the definition of {ξ ∗Lh} in Assump-tion 3.10(g), estimation of VL requires estimation of J∗L . An estimator of J∗L istypically constructed based on the formula of J∗L , which is sometimes only validunder the stratum-wise correct specification. To reflect this reality, we assume:Assumption 3.11. There exists a bounded sequence of constant p× p matrices{J¯L}L∈N such that JˆL− J¯L→ 0 in probability-P.Under the stratum-wise correct specification, it should hold that J¯L = J∗L , whileJ¯L may not coincide with J∗L under the alternatives. In the example discussed above,we can set JˆL = N−1L ∑Lh=1∑nhi=1∑nhij=1WLhi jZLhi jZ′Lhi j and J¯L =∫zz′ P¯L(dy,dz) (whichcoincides with J∗L) for each L ∈ N, to satisfy Assumption 3.11.To obtain an estimator of VL, we approximate ξ ∗Lh byξˆLh ≡LβLhtr(ΛˆL(m˜Lh(·, θˆL, pˆiL)m˜Lh(·, θˆL, pˆiL)′−N−2Lh ΣˆLh)+ Gˆ′1,LJˆL(LNLh/NL)N−1Lhnh∑i=1nhi∑j=1WLhi jψˆLhi j, (h,L) ∈ H, where77Gˆ1L ≡ G˜1L(·,θ ,pi),G˜1L(ω,θ ,pi)≡ ∇θ αˇL(ω,θ ,piL)=L∑h=1βLh(2N−1Lhnh∑i=1nhi∑j=1WLhi j(ω)∇θmh(XLhi j(ω),θ ,pi)Λm˜Lh(ω,θ ,pi)−N−2Lh ∇θ (vecΣ˜Lh(ω,θ ,pi))vecΛ),(θ ,pi) ∈Θ×Π, Λ ∈ Sq, L ∈ N, andψˆLhi j ≡ ψLh(XLhi j, θˆL, pˆiL), j ∈ {1, . . . ,nhi}, i ∈ {1, . . . ,nh}, (h,L) ∈ H.With this approximation, we now estimate VL byVˆL ≡ L−1L∑h=1ξˆ 2Lh, L ∈ N.Our test statistic is thusTL ≡L1/2αˆLVˆ 1/2L, L ∈ N.The large-L behavior of this statistic is described in the next theorem. Let ξ¯Lh denotethe expression obtained by replacing J∗L with J¯L in the definition of ξ ∗Lh.Theorem 3.4. (a) Under Assumptions 3.1–3.9, 3.10(b)–(f), and 3.11, the se-quence{V¯L ≡ L−1L∑h=1var[ξ¯Lh]+L−1L∑h=1(LβLh)2(m¯∗Lh′ΛLm¯∗Lh)2}L∈Nis bounded, and {VˆL−V¯L}L∈N converges in probability-P to zero.(b) Suppose that Assumptions 3.1–3.11 hold. If (3.3) holds, and J¯L = J∗L foreach L ∈ N (i.e., when the model is stratum-wise correctly specified), thenTLA∼ N(0,1).(c) Suppose that Assumptions 3.1–3.9, 3.10(b)–(f), (h), and 3.11 hold. If {α∗L}L∈Nis uniformly positive, then for each c ∈ R, P[TL > c]→ 1.78Theorem 3.4 shows that we can perform a level-p test of the stratum-wisecorrect specification by rejecting the null hypothesis when TL is greater than the(1− p)-quantile of the standard normal distribution. It also shows that the test has apower approaching 1, if {α∗L} is uniformly positive. When {ΛL}L∈N is uniformlypositive definite, the imposed uniform positivity of {α∗L} can be interpreted asrequiring that the average of the squared length of m¯∗Lh taken over all strata in thepopulation does not shrink, as the second result in the next proposition states. Theuniform positive definiteness of {ΛL} is a mild requirement.Proposition 3.5. Suppose that Assumptions 3.1–3.5 and 3.6(d) hold.(a) If {α∗L}L∈N is uniformly positive, so is {∑Lh=1βLh | m¯∗Lh |2}L∈N.(b) If, in addition, {ΛL}L∈N is uniformly positive definite, the converse of (a)holds.3.4 Tests without Estimation of Nuisance ParametersThe specification test of Bierens (1990) is an m-test based the moment conditiondescribed in Example 3.2. In practice, it is not clear what value we should choosefor the nuisance parameter pi . A way to overcome this difficulty is to calculateTL for each of the possible values of the nuisance parameter and summarize theresults in the form of a scalar statistic. For each pi ∈Π, let TL(pi) denote statisticTL obtained by taking pi for pˆiL and ΛˆL(pi) for ΛˆL, where ΛˆL(pi) is a weight matrixthat is possibly dependent on pi . In this section, we consider tests based on statisticsthat can be written as functions ϕ of the random function pi 7→TL(pi) on Π such assuppi∈ΠTL(pi).We continue imposing the conditions employed in Section 3.3 that are notdirectly related to pi∗L , though we now require Π=Π0. That is, we impose Assump-tions 3.1–3.5, 3.7, 3.8, and 3.11 with no changes, while we modify Assumptions 3.6,3.9, and 3.10 to accommodate the current approach. The modified conditions ofAssumption 3.6 and 3.9 are:Assumption 3.12. (a) Assumptions 3.6(a) and (c) hold, with Π0 =Π nonemptyand compact.79(b) The sequence {ΛL}L∈N of uniformly equicontinuous functions from Π to Sqsatisfies that sup{|ΛL(pi)| : pi ∈Π, L ∈ N}< ∞.(c) The sequence {ΛˆL : Ω × Π → Sq}L∈N of functions measurable-F ⊗B(Π)/B(Sq) satisfies that for each ω ∈ Ω, ΛˆL(ω, ·) : Π→ Sq is contin-uous and that suppi∈Π |ΛˆL(pi)−ΛL(pi)| → 0 in probability-P.Under this assumption combined with Assumptions 3.1–3.4, {αˆL(pi) ≡αˇL(·,θ ∗L ,pi, ΛˆL(pi)}L∈N is consistent for {α∗L(pi)≡ αL(θ ∗L ,pi,ΛL(pi))}L∈N uniformlyin pi ∈Π, corresponding to the result of Theorem 3.2 in Section 3.3.Theorem 3.1. Under Assumptions 3.1–3.5, 3.7, 3.8, 3.11, and 3.12, {α∗L(pi) :pi ∈Π,L ∈ N} is bounded, and suppi∈Π |αˆL(pi)−α∗L(pi)| → 0 in probability-P.We now state the modified version of Assumption 3.10. For convenience, weintroduce the LBP and SLBP properties, which are slightly stronger versions of theLB and LBP properties.Definition 5. Given Assumption 3.1, let Γ be a finite-dimensional Euclidean spaceand {φh}h∈N a sequence of measurable functions from (Rv×Γ,Bv⊗B(Γ)) to(Rl1×l2 ,Bl1×l2). We say that {φh} is LBP(a) on Γ, where a ∈ [1,∞), if it is LB(a)fulfilling the conditions required in Definition 3 with h that satisfies that h(y)/ys→ 0as y ↓ 0 for some real number s > 0.Definition 6. Given Assumption 3.1 and 3.2, let Γ be a finite-dimensional Euclideanspace and {ΦLh : (h,L) ∈ H} an array of measurable functions from (Ω×Γ,F ⊗B(Γ)) to (Rl1×l2 ,Bl1×l2). We say that {ΦLh} is SLBP(a) on Γ, where a ∈ [1,∞), ifit is SLB(a) fulfilling the conditions required in Definition 4 with h that satisfies thath(y)/ys→ 0 as y ↓ 0 for some real number s > 0.Remark. The P in the terms “LBP” and “SLBP” stands for the requirement thath is dominated by a power function in the neighborhood of the origin.The SLBP property is useful in our analysis, being closely related to the LBPproperty in the manner parallel to the relationship between the LB and SLB propri-eties.Lemma 3.2. The assertion of Lemma 3.1 holds even if LB and SLB are replacedwith LBP and SLBP, receptively, in the lemma.80Writem¯∗Lh(pi)≡ m¯Lh(θ ∗L ,pi), pi ∈Π, i ∈ Ig, (h,L) ∈ H,Σ˜∗Lh(pi)≡ Σ˜Lh(·,θ ∗L ,pi), pi ∈Π, (h,L) ∈ H,m˜∗Lh(pi)≡ m˜Lh(ω,θ ∗L ,pi), pi ∈Π, (h,L) ∈ H,G∗1L(pi)≡ G1L(θ ∗L ,pi,ΛL(pi)), pi ∈Π, L ∈ N andξ ∗Lh(pi) = LβLhtr(ΛL(m˜∗Lh(pi)m˜∗Lh(pi)′−N−2Lh Σ˜∗Lh(pi)))+G∗1,L(pi)′J∗L(LNLh/NL)N−1Lhnh∑i=1nhi∑j=1WLhi jψ∗Lhi j, (h,L) ∈ H,The modified assumption is:Assumption 3.13. (a) Assumptions 3.10(a), (c)–(f), and (h) hold, with LB andSLB replaced by LBP and SLBP, respectively.(b) The array {var[ξ ∗Lh(pi)] : pi ∈Π,(h,L) ∈ H} is uniformly positive.Imposing Assumption 3.13 along with Assumptions 3.1–3.5, 3.7, 3.8, 3.11, and3.12, we derive the large-L distribution of {pi 7→ L1/2αˆL(pi)}L∈N, a sequence ofrandom functions from Π to R, under the null. In so doing, we employ a functionalcentral limit theorem derived from (Pollard, 1990, Theorem 10.6), which requiresan additional assumption. LetVL(pi)≡ L−1L∑h=1var[ξ ∗Lh(pi)], pi ∈Π, L ∈ N.The additional assumption is:Assumption 3.14. There exists a function K : Π2→ R such that for each (pi1,pi2) ∈Π2,KL(pi1,pi2)≡ L−1L∑h=1cov[VL(pi1)−1/2ξ ∗Lh(pi1),VL(pi2)−1/2ξ ∗Lh(pi2)]→ K(pi1,pi2).81Note that KL(pi1,pi2) is the coefficient of correlation between L−1∑Lh=1 ξ ∗Lh(pi1)and L−1∑Lh=1 ξ ∗Lh(pi2). Assumption 3.14 requires that the correlation coefficientconverges as L grows large.We now state the asymptotic Gaussianity of {pi 7→ L1/2αˆL(pi)}L∈N on Π underthe stratum-wise correct specification.Theorem 3.3. Suppose that Assumptions 3.1–3.5, 3.7, 3.11, 3.12, and 3.13(a) hold.If (3.3) holds (i.e., when the model is stratum-wise correctly specified), it holds thatsuppi∈Π∣∣∣∣L1/2αˆL(pi)−L−1/2L∑h=1ξ ∗Lh(pi)∣∣∣∣→ 0 in probability-P. (3.14)If, in addition, Assumptions 3.13(b), and 3.14 hold, the sequence of random functionsfrom Π to R{pi 7→VL(pi)−1/2L1/2αˆL(pi)}L∈N (3.15)converges in distribution to the zero-mean Gaussian process with covariance kernelK concentrated on U(Π), the set of all uniformly continuous R-valued functions onΠ.WriteξˆLh(pi)≡LβLhtr(ΛˆL(pi)(m˜Lh(·, θˆL,pi)m˜Lh(·, θˆL,pi)′−N−2Lh Σ˜Lh(·, θˆL,pi))+ Gˆ1,L(pi)′JˆL(LNLh/NL)N−1Lhnh∑i=1nhi∑j=1WLhi jψˆLhi j, (h,L) ∈ H, whereGˆ1L(pi)≡ ∇θ αˇL(·, θˆL,pi), L ∈ N.We then estimate VL(pi) byVˆL(pi)≡ L−1L∑h=1ξˆ 2Lh(pi), pi ∈Π, L ∈ N.Our test is based on the random functionpi 7→TL(pi)≡L1/2αˆL(pi)VˆL(pi)1/2, L ∈ N.82The large-L behavior of {pi 7→ VˆL(pi)}L∈N and {pi 7→TL(pi)}L∈N are describedin the next theorem. Let ξ¯Lh(pi) denote the expression obtained by replacing J∗L withJ¯L in the definition of ξ ∗Lh(pi).Theorem 3.4. (a) Under Assumptions 3.1–3.3, 3.11, 3.12, and 3.13(a), the set{V¯L(pi)≡ L−1L∑h=1var[(ξ¯Lh(pi)]+L−1L∑h=1(LβLh)2(m¯∗Lh(pi)′ΛL(pi)m¯∗Lh(pi))2pi ∈Π, L ∈ N}is bounded, {V¯L : Π→ R}L∈N is uniformly equicontinuous, and {VˆL(pi)−V¯L(pi)}L∈N converges in probability-P to zero uniformly in pi ∈Π.(b) Suppose that Assumptions 3.1–3.5, 3.7, and 3.11–3.14 hold. If (3.3) holds,and J¯L = J∗L for each L ∈ N (i.e., when the model is stratum-wise correctlyspecified), then the sequence of random functions {pi 7→TL(pi)}L∈N convergesin distribution to the zero-mean Gaussian process with covariance kernel Kconcentrated on U(Π).We now apply a real-valued map ϕ to the random function pi 7→ TL(pi), toobtain a scalar statistic suitable for use in testing our null hypothesis. Given anypossible realization of the data, the set {pi ∈Π : VˆL(pi) = 0} is Borel-measurablefor each L ∈ N, because VˆL(pi) is continuous in pi . Also, αˆL(pi) is continuous in pi .It follows that pi 7→TL(pi) is a Borel-measurable functional on Π for each possiblerealization of the data. We thus pick a map ϕ defined on M (Π), the set of allBorel-measurable functions from Π to R. The map ϕ given byϕ( f )≡ suppi∈Πf (pi), f ∈M (Π) (3.16)is such a map. The one written asϕ( f )≡∫Πmax{ f (pi),0}dpi, f ∈M (Π), (3.17)where the integral is taken with respect to the Lebesgue measure, is another.In (3.17), the effect of negative values of f (pi) is suppressed before the integral83is taken. To appreciate benefits of this mechanism, recall that α∗L(pi) is known tobe nonnegative for every pi ∈ Π. A negatively large realized value of TL(pi) canbe viewed as a reflection of an error in estimation of α∗L(pi). By suppressing thenegative part of the integrand in (3.17), we prevent such estimation errors for somepi’s from canceling out the effect of positive values of TL(pi) for other pi’s.Like most other maps discussed in similar contexts in the literature, the maps ϕin (3.16) and (3.17) satisfy the conditions imposed in the next assumption.Assumption 3.15. The function ϕ from M (Π) endowed with the uniform metricto the one-dimensional Euclidean space is continuous. It is also monotonic in thesense that whenever a pair f1 and f2 in M (Π) satisfies that f1 ≥ f2, it holds thatϕ( f1) ≥ ϕ( f2). Further, it satisfies that whenever a sequence { f j ∈M (Π)} j∈Nsatisfies that for each pi in some Borel-measurable subset Π¯ of Π with a nonzeroLebesgue measure, lim j→∞ f j(pi) = ∞, it holds that lim j→∞ϕ( f j) = ∞.Given such a map ϕ , our test should reject the null hypothesis if ϕ(TL) exceedsa suitably chosen critical value.To pick the critical value in the test, we use the result of Theorem 3.4(ii). Letη be a zero-mean Gaussian process with kernel K concentrated on U(Π). Then itfollows from the theorem by the continuous mapping theorem (van der Vaart andWellner, 1996, Lemma 1.3.6) that {ϕ(TL)}L∈N converges in distribution to ϕ(η)under the null. To utilize this fact in formulation of a test, we need to estimate K,on which the distribution of ϕ(η) depends. A natural estimator of K is {KˆL}L∈Ndefined byKˆL(pi1,pi2)≡ (VˆL(pi1)VˆL(pi2))−1/2L−1L∑h=1ξˆLh(pi1)ξˆLh(pi2), (pi1,pi2) ∈Π20, L ∈ N.Under the current assumptions, KˆL(pi1,pi2) is consistent for K(pi1,pi2) uniformly in(pi1,pi2) ∈ Π2 if {α∗L(pi)}L∈N converges to zero uniformly in pi ∈Π, in particular,under the null hypothesis.Having a consistent estimator KˆL of K, it is easy to generate a zero-meanGaussian process with covariance kernel KˆL, as Hansen (1996) explains. Let{νh}h∈N be an i.i.d. sequence of standard normal random variables independent84from the data. It then holds that for each L ∈ N,TˆL(pi)≡ VˆL(pi)−1/2L−1/2L∑h=1ξˆLh(pi)νhis a zero-mean Gaussian process with covariance kernel KˆL conditionally given thedata. When for each pi ∈Π and each L ∈ N, α∗L(pi) = 0, the conditional distributionof pi 7→ TˆL(pi) weakly converges to a Gaussian process with covariance kernelK concentrated on U(Π) in probability-P, as one might expect from the uniformconsistency of {KˆL} for K. see Gine and Zinn (1990) for the concept of weakconvergence in probability). It follows that we can take for the critical value in ourtest the (1− p)-quantile of the distribution of ϕ(TL) conditional on the data. Inaddition, we can show that suppi∈Π |TˆL(pi)|= OP(1), regardless of whether or notthe null hypothesis is true. This means that the critical value would stay bounded asL→ ∞, even under alternatives.In practice, the test described above can be conveniently implemented by usingthe p-value transformation explained in Hansen (1996). The next theorem providesinformation essential for the test in such form.Theorem 3.5. (a) Suppose that Assumptions 3.1–3.5, 3.7, and 3.11–3.15 hold.Also suppose that (3.3) holds, and J¯L = J∗L for each L ∈ N (these conditionshold if the model is stratum-wise correctly specified). Let FˆL be the conditionaldistribution function of ϕˆL ≡ ϕ(TˆL) given the data. Also, let F denotethe distribution function of ϕ0 ≡ ϕ(η), where η is a zero-mean Gaussianprocess with covariance kernel K concentrated on U(Π). Write ϕL ≡ ϕ(TL),L ∈ N. Then {pˆL ≡ 1− FˆL(ϕL)}L∈N and {pL ≡ 1−F(ϕL)}L∈N satisfy thatpˆL− pL→ 0 in probability-P. If, in addition, F is continuous and increasingon the support of ϕ0, { pˆL}L∈N is asymptotically distributed with the uniformdistribution on [0,1].(b) Suppose that Assumptions 3.1–3.5, 3.7, 3.11, 3.12, and 3.13(a) hold. If thereexists a Borel-measurable subset Π¯ of Π with a nonzero Lebesgue measuresuch that liminfL→∞ infpi∈Π¯α∗L(pi)> 0, then pˆL→ 0 in probability-P.Our test rejects the null hypothesis, if and only if pˆL is lower than the specified85level. Theorem 3.5 confirms that the test has asymptotically correct size under thenull and a power approaching to one if for some Π¯⊂Π with a non-zero Lebesguemeasure, infpi∈Π¯α∗L(pi) is bounded away from zero for almost all L ∈ N.86BibliographyAnatolyev, Stanislav. 2008 (Sept.). Inference in Regression Models with ManyRegressors. Working Paper 125. Center for Economic and Financial Research,New Economic School. → pages 61Belzil, Christian, and Hansen, Jorgen. 2002. Unobserved ability and the return toschooling. Econometrica, 70, 2075–2091. → pages 21, 38Belzil, Christian, and Hansen, Jorgen. 2007. A structural analysis of correlatedrandom coefficent wage regression model. Journal of Econometrics, 140,827–848. → pages 21, 38Bierens, Herman J. 1990. A Consistent Conditional Moment Test of FunctionalForm. Econometrica, 58(6), 1443–1458. → pages 68, 79Binder, David A. 1983. On the Variance of Asymptotically Normal Estimators fromComplex Surveys. International Statistical Review, 51(3), 279–292. → pages 45Blaug, Mark. 1985. Where are we now in the economics of education. Economicsof Education Review, 4(1), 17–28. → pages 1Breckling, J. U., et al. . 1994. Maximum Likelihood Inference from Sample SurveyData. International Statistical Review, 62(3), 349–363. → pages 45Card, David. 2001. Estimating the Return to Schooling: Progress on SomePersistent Econometric Problems. Econometrica, 69(5), 1127–1160. → pages 1,2Carneiro, Pedro, Hansen, Karsten T., and Heckman, James J. 2003. Estimatingdistributions of treatment effects with an application to the returns to schoolingand measurement of the effects of uncertainty on college choice. InternationalEconomic Review, 44, 361–422. → pages 2, 3, 4, 3887Carrasco, Marine, and Florens, Jean-Pierre. 2000. Generalization of GMM to aContinuum of Moment Conditions. Econometric Theory, 16(6), 797–834. →pages 61Carrasco, Marine, Chernov, Mikhail, and Florens, Jean-Pierre Ghysels, Eric. 2007.Efficient Estimation of General Dynamic Models with a Continuum of MomentConditions. Journal of Econometrics, 140(2), 529–573. → pages 61Cassel, C. M., Sarndal, Carl-Erik, , and Wretman, Jan. 1977. Foundations ofinference in survey sampling. New York: Wiley. → pages 45Chamblessa, Lloyd E., and Boyle, Kerrie E. 1985. Maximum Likelihood Methodsfor Complex Sample Data: Logistic Regression And Discrete ProportionalHazards Models. Communications in Statistics-Theory and Methods, 14(6),1377–1392. → pages 45Chen, Jiahua, and Rao, J. N. K. 2007. Asymptotic normality under twophasesampling designs. Statist. Sinica, 1047–1064. → pages 46Cunha, Flavio, and Heckman, James J. 2008. Formulating, identifying andestimating the technology of cognitive and noncognitive skill formation. Journalof Human Resources, 43(4), 738–782. → pages 3Cunha, Flavio, Heckman, James J., and Schennach, Susanne M. 2010. Estimatingthe technology of cognitive and noncognitive skill formation. Econometrica,78(3), 883–931. → pages 3Davidson, J. 1994. Stochastic Limit Theory-An Introduction for Econometricians,Advanced Textbooks in Econometrics. → pages 150, 152Dempster, A.P., Laird, N.M., and Rubin, D.B. 1977. Maximum Likelihood fromIncomplete Data via the EM Algorithm. Journal of the Royal StatisticalSociety.Series B (Methodological), 39(1), 746–773. → pages 22Dickens, William T., and Lang, Kevin. 1985. A test of dual labor market theory.The American Economic Review, 75(4), 792–805. → pages 1Donald, Stephen G., Imbens, Guido W., and Newey, Whitney K. 2003. EmpiricalLikelihood Estimation and Consistent Tests with Conditional MomentRestrictions. Journal of Econometrics, 117(1), 55–93. → pages 61Doran, Howard E., and Schmidt, Peter. 2006. GMM Estimators with ImprovedFinite Sample Properties Using Principal Components of the Weighting Matrix,with an Application to the Dynamic Panel Data Model. Journal of Econometrics,133(1), 387–409. → pages 6188Duncan, G., and Homan, S. 1981. The incidence and wage effects of overeducation.Economics of Education Review, 1(1), 75–86. → pages 1Folland, G. 1984. Real analysis: Modern techniques and their applications. →pages 111, 112, 122, 123Fuller, Wayne A. 1984. Least Squares and Related Analyses for Complex SurveyDesigns. Survey Methodology, 10(1), 97–118. → pages 45Gallant, A. Ronald, and White, Halbert. 1988. A Unified Theory of Estimation andInference for Nonlinear Dynamic Models. New York: Basil Blackwell. → pages50, 112Gine, Evarist, and Zinn, Joel. 1990. Bootstrapping General Empirical Measures.The Annals of Probability, 18(2), 851–869. → pages 85Grubb, W. Norton. 1997. The Returns to Education in the Sub-Baccalaureate LaborMarket, 1984-1990. Economics of Education Review, 16(3), 231–245. → pages3Han, Chirok, and Phillips, Peter C. B. 2006. GMM with Many Moment Conditions.Econometrica, 74(1), 147–192. → pages 61Hansen, Bruce E. 1996. Inference When a Nuisance Parameter Is Not Identifiedunder the Null Hypothesis. Econometrica, 64(2), 413–430. → pages 84, 85Hansen, Karsten, Heckman, James J., and Mullen, Kathleen J. 2004. The Effects ofSchooling and ability on Achievement test scores. Journal of Econometrics,121(1-2), 39–98. → pages 3, 15Heckman, James J., and Singer, Burton. 1984. A method for minimizing the impactof distributional assumptions in econometric models for duration data.Econometrica, 52, 271–320. → pages 22Heckman, James J., Lochner, Lance J., and Todd, Petra E. 2006a. EarningsFunctions, Rates of Return and Treatment Effects: The Mincer Equation andBeyond. I, 307–458. → pages 1Heckman, James J., Stixrud, Jora, and Urzua, Sergio. 2006b. The Effects ofCognitive and Noncognitive Abilities on Labor Market Outcomes and SocialBehavior. Journal of Labour Economics, 24(3), 746–773. → pages 3, 15, 21Hoffmann, Florian. 2011. An empirical model of life-cycle earnings and mobilitydynamics. working paper. → pages 589Horvitz, D. G., and Thompson, D. J. 1952. A generalization of sampling withoutreplacement from a finite universe. Journal of the American StatisticalAssociation, 47(260), 663–685. → pages 46Hung, Hsien-Ming. 1990. Nonlinear Regression Analysis for Complex Surveys.Communications in Statistics-Theory and Methods, 19(9), 3447–3468. → pages45Jaeger, David A., and Page, Marianne E. 1996. Degrees Matter: New Evidence onSheepskin Effects in the Returns to Education. The Review of Economics andStatistics, 78(4), 733–740. → pages 5, 33Kane, Thomas J., and Rouse, Cecilia Elena. 1995. Labor Market Returns to Two-and Four-Year College. American Economic Review, 85(3), 600–614. → pages 3Kasahara, Hiroyuki, and Shimotsu, Katsumi. 2009. Nonparametric identification offinite mixture models of dynamic discrete choices. Econometrica, 77, 135–175.→ pages 2, 18Kasahara, Hiroyuki, and Shimotsu, Katsumi. 2012. Nonparametric identification ofmultivariate mixtures. Discussion papers 2010-09, Graduate School ofEconomics, Hitotsubashi University. → pages 2, 18Keane, Michael P., and Wolpin, Kenneth I. 1997. The career decisions of youngmen. Journal of Political Economy, 105, 473–522. → pages 4, 20, 21, 26Koenker, Roger, and Machado, Jose´ A. F. 1999. GMM Inference When the Numberof Moment Conditions is Large. Journal of Econometrics, 93(2), 327–344. →pages 61Koenker, Roger W., and Bassett, Jr., Gilbert W. 1978. Regression Quantiles.Econometrica, 46(1), 33–50. → pages 54Krewski, D., and Rao, J. N. K. 1981. Inference from Stratified Samples: Propertiesof the Linearization, Jacknife and Balanced Repeated Replication Methods. TheAnnals of Statistics, 9(5), 1010–1019. → pages 45Light, Audrey, and Strayer, Wayne. 2004. Who Receives the College WagePremium? Assessing the Labor Market Returns to Degrees and College TransferPatterns. The Journal of Human Resources, 39(3), 411–482. → pages 3Marcotte, Dave E., Bailey, Thomas, Borkoski, Carey, and Kienzl, Greg S. 2005.The Returns of a Community College Education: Evidence From the NationalEducation Longitudinal Survey. Educational Evaluation and Policy Analysis,27(2), 157–175. → pages 390Newey, Whitney K. 1985. Maximum Likelihood Specification Testing andConditional Moment Tests. Econometrica, 53(5), 1047–1070. → pages 59Neyman, Jerzy. 1934. On the two different aspects of the representative method:the method of stratified sampling and the method of purposive selection. Journalof the Royal Statistical Society, 97(4), 558–625. → pages 45Pollard, David. 1990. Empirical Processes: Theory and Applications. CBMS-NSFRegional Conference Series in Applied Mathematics, vol. 2. Hayward,California: Institute of Mathematical Statistics. → pages 81, 136Ramsey, J. B. 1969. Tests for Specification Errors in Classical LinearLeast-Squares Analysis. Journal of the Royal Statistical Society, Series B, 71,351–371. → pages 69Roy, A. D. 1951. Some Thoughts on the Distribution of Earnings. OxfordEconomic Papers, 3(2), 135–146. → pages 2Sakata, Shinichi. 2000. Quasi-Maximum Likelihood Estimation with ComplexSurvey Data. Mimeo., University of Michigan. → pages 45Sakata, Shinichi. 2009 (Sept.). m-Testing of Model Specification in Many Groups.Mimeo., University of British Columbia. → pages 59, 61Sakata, Shinichi, and Xu, Jinwen. 2010 (Oct.). M-Estimation with Complex SurveyData. Mimeo., University of British Columbia. → pages 59Sarndal, Carl-Erik, Thomsen, Ib, Hoem, Jan M., Lindley, D. V., Barndorff-Nielsen,O., and Dalenius, Tore. 1978. Design-Based and Model-Based Inference inSurvey Sampling. Scandinavian Journal of Statistics, 5(1), 27–52. → pages 45Sarndal, Carl-Erik, Swensson, Bengt, and Wretman, Jan. 1992. Model AssistedSurvey Sampling. New York: Springer-Verlag. → pages 45Sen, Pranab Kumar. 1970. On Some Convergence Properties of One-Sample RankOrder Statistics. Annals of Mathematical Statistics, 41(6), 2140–2143. → pages67Sicherman, N. 1991. ”Overeducation” in the labor market. Journal of LaborEconomics, 9(2), 101–122. → pages 1Tauchen, G. 1985. Diagnostic Testing and Evaluation of Maximum LikelihoodModels. Journal of Econometrics, 30(1/2), 415–444. → pages 5991van der Vaart, Aad W., and Wellner, Jon A. 1996. Weak Convergence andEmpirical Processes. Springer Series in Statistics. New York: Springer. →pages 84, 148, 151White, H. 1984. Asymptotic theory for econometricians. → pages 117White, Halbert. 1987. Specification Testing in Dynamic Models. Chap. 1, pages1–58 of: Truman, F. Bewley (ed), Advances in Econometrics—Fifth WoldCongress. Econometric Society Monographs, vol. 1. New York: CambridgeUniversity Press. → pages 59White, Halbert. 1994. Estimation, Inference and Specification Analysis. → pages117, 118Willis, Robert J. 1986. Wage Determinants: A Survey and Reinterpretation ofHuman Capital Earnings Functions. I, 525–602. → pages 2Willis, Robert J., and Rosen, Sherwin. 1979. Education and self-selection. Journalof Political Economy, 87, S7–S36. → pages 2, 4, 39Wolter, Kirk M. 1985. Introduction to Variance Estimation. New York: Springer.→ pages 55, 7292Appendix AA.1 Summary Statistics, College Dropouts vs. CollegeGraduates93Table A.1: Discriptive Statistics (More Education Categories)2-yr Dropouts Associate Degree 4-yr Dropouts Bachelor’s DegreeVariables Obs Mean S.D Obs Mean S.D Obs Mean S.D Obs Mean S.DHighest grade completed 119 12.697 0.859 40 14.200 0.608 131 13.573 1.336 307 16.619 1.180Age in 1979 119 17.277 2.004 40 17.525 2.075 131 17.55 2.146 307 17.759 2.246Initial job(white collar) 119 0.244 0.431 40 0.275 0.452 131 0.374 0.486 307 0.752 0.432Initial wage 119 10.19 4.579 40 12.185 5.605 131 10.669 5.18 307 14.106 13.981Initial wage(blue collar) 119 10.524 4.926 40 12.274 6.006 131 10.099 4.513 307 9.935 4.603Initial wage(white collar) 119 9.155 3.128 40 11.95 4.629 131 11.624 6.067 307 15.478 15.669Mother education 119 12.126 1.964 40 12.4 1.751 131 12.328 1.854 307 13.166 2.096Father education 119 12.689 2.626 40 12.9 2.479 131 13.038 3.134 307 14.055 3.072Number of siblings 119 2.798 1.754 40 2.45 1.694 131 2.748 1.729 307 2.43 1.596Broken family at age 14 119 0.176 0.383 40 0.075 0.267 131 0.115 0.32 307 0.094 0.293South at age 14 119 0.244 0.431 40 0.175 0.385 131 0.244 0.431 307 0.248 0.432Urban at age 14 119 0.84 0.368 40 0.65 0.483 131 0.725 0.448 307 0.814 0.38994In table A.1, I divide two-year college attendants into those with and withoutan associate degree, and divide four-year attendants to those with and without abachelor’s degree. Among the two-year college attendants, 75% do not obtain anassociate degree. The average schooling years of the two-year college dropouts are12.70 years and those of the two-year college graduates are 14.2 years. Among thefour-year college attendants, around 30% drop out of four-year college while themajority obtain a bachelor’s degree. The average schooling years of the four-yearcollege dropouts are 13.6 years and those of the four-year college graduates arearound 16.6 years. Regarding the the first job after schooling, those who obtainan associate degree are slightly more likely to work in a white-collar occupationthan those drop out of a two-year college. The probability of initially workingin a white-collar position of the two-year college dropouts is 24.4% and that ofindividuals with an associate degree is 27.5%. Interestingly, although the four-yearcollege dropouts have more schooling years than those with an associate degree,the former are more likely to work in a white-collar occupation entering the labourmarket than the latter. On average, around 37.4% of the four-year college dropoutsinitially work in a white-collar occupation. Those with a bachelor’s degree aremuch more likely to start with a white-collar occupation than the others. Around75.2% of those with a bachelor’s degree work in a white-collar occupation astheir first jobs. Regarding wages, the two-year college dropouts and the four-yeardropouts earn almost the same. The average hourly payment for the two-yearcollege dropouts and the four-year dropouts are $10.19 and $10.67 respectively.Those with an associate degree earn around $12.19 per hour for their first jobs. Thehourly payment of those with an associate degree is higher than that of the two-yearcollege dropouts and four-year college dropouts. Those with a bachelor’s degreeearn the most among the post-secondary attendants. The average hourly paymentof those with a bachelor’s degree is around $14.11. When we take a closer look atwages by separating individuals into two occupation groups, those initially work ina blue-collar occupation and those initially work in a white-collar occupation, thestory is different. The two-year college dropouts, the four-year college dropouts,and those with a bachelor’s degree earn around $10 per hour if their first job isblue-collar while those with an associate degree earn $12.27 per hour if their firstjob is blue-collar. For those whose first job is white-collar, the two-year college95dropouts earn $9.15 per hour. Those with an associate degree and the four-yearcollege dropouts earn around $12 per hour. Those with a bachelor’s degree earnmore than the two-year college dropouts, those with an associate degree, and thefour-year college dropouts. The hourly payment of those with a bachelor’s degreeis $15.48 per hour. Although this paper mainly examines the impact of attendanceof a two-year college and a four-year college on wages, I also provide estimates ofthe occupation-specific wage gains to obtain a bachelor’s degree for a high schoolgraduate by eliminating the four-year college dropouts from the sample. I do notstudy the occupation-specific returns to an associate degree because the sample sizeof those with an associate degree is too small or reasonable results.A.2 Simplification of the Type-Specific Joint DistributionBelow, I show how to simplify the type-specific joint distribution of wages, occupa-tions, education, and the test scores.f m({Wit ,Oit}Tt=1,Si,{Qir}6r=1|{Xit}Tt=1,ZS,i,{Zir}6r=1)= f m({Wit ,Oit}Tt=1,Si|{Qir}6r=1,{Xit}Tt=1,ZS,i,{Zir}6r=1)× f m({Qir}6r=1|{Xit}Tt=1,ZS,i,{Zir}6r=1)= f m({Wit ,Oit}Tt=1,Si|{Qir}6r=1,{Xit}Tt=1,ZS,i) fm({Qir}6r=1|{Zir}6r=1)= f m({Wit ,Oit}Tt=1,Si|{Qir}6r=1,{Xit}Tt=1,ZS,i)6∏r=1f m({Qir}6r=1|{Zir}6r=1)= f m({Wit ,Oit}Tt=1,Si|{Qir}6r=1,{Xit}Tt=1,ZS,i)6∏r=1f m({Qir}6r=1|{Zir}6r=1)= f m(Wi1|Oi1,Si)T∏t=2f m(Wit |Oit ,Si,Xit ,Wit−1,Oit−1,Xit−1)× f m(Oi1|Si)T∏t=2f m(Oit |Oit−1,Si,Xit) fm(Si|ZS,i)6∏r=1f m(Qir|Zir).The first equality holds under the assumption that the six test scores do notdirectly affect wages, occupations, and education conditional on type. The secondequality holds under the assumption that the regressors and the error terms in96Equation (1.1), Equation (1.2), Equation (1.3), and Equation (1.4) are independent.The third equality holds under the assumption that the error terms in test scores aremutually independent (εQ,ir ⊥ εQ,ir′ for r 6= r′). The fourth equality holds underthe assumptions that the error terms in wage follows a first order Markov process(εW,it = ρεW,it−1 +ζit) and the occupation choice is only affected by the previousoccupation, not the whole occupation history.A.3 Likelihood Contributions(a) The likelihood contribution of wages:L mW (Yi;αW ,β ,σW ,ρ) = φ(Wi1−µW,i1σW,1)T∏t=2φ(Wit −µW,i2σW,2).The wage density functions follow a normal distribution according to theassumptions in Equation 1.1. Specifically,µW,i1 =αmW1+αmW,2Oit +β12YRi+β24YRi+β32YRiOit +β44YRiOit +X ′itβ5,andµW,i2 =αmW1 +αmW,2Oit +β12YRi +β24YRi +β32YRiOit +β44YRiOit +X ′itβ5−ρ(Wit−1− (αmW1 +αmW,2Oit−1 +β12YRi +β24YRi +β32YRiOit−1+β44YRiOit−1 +X ′it−1β5))where DO,i jt is a dummy variable, which equals 1 if individual i works inoccupation j at time t(b) The likelihood contribution of occupations:L mO (Yi;αO,λ ) =Φ(αmO +λ12YRi +λ24YRi)×T∏t=2Φ(αmO +λ12YRi +λ24YRi +λ3Oit−1 +X ′itλ4).97(c) The likelihood contribution of education:L mS (Yi;αS,δ ) =exp(αmS, j +Z′iδ j)1+∑3j′=2 exp(αmS, j′+Z′iδ j′).(d) The likelihood contribution of test scores:L mQ (Yi;αQ,θ ,σQ) =6∏r=1φ(Qir−µQ,irσQ,r).The density functions of test scores follow a normal distribution accordingto the assumptions in Equation 1.4, and µQ,ir = αmr +θr,12YRir +θr,24YRir +Z′i,rθr,3.A.4 Assumptions and Proofs of PropositionsA.4.1 Assumptions and proof of Proposition 1.1Assumption A.1. For m=1,. . . ,M and t ≥ 2,(a)f mt (Wt |Ot ,Xt ,S,{Wτ ,Oτ ,Xτ}t−1τ=2) = fm(Wt |Ot ,Xt ,S,{Wτ ,Oτ ,Xτ}t−1τ=2),andf mt (Ot |Xt ,S,{Wτ ,Oτ ,Xτ}t−1τ=2) = fm(Ot |Xt ,S,{Wτ ,Oτ ,Xτ}t−1τ=2).(b)f m(Wt |Ot ,Xt ,S,{Wτ ,Oτ ,Xτ}t−1τ=2) = fm(Wt |Ot ,Xt ,S),andf m(Ot |Xt ,S,{Wτ ,Oτ ,Xτ}t−1τ=2) = fm(Ot |Xt ,S).Assumption A.1 reduces the number of unknown type-specific distributions andthe conditional type-specific joint distributions of wages, occupations, and education98can be simplified as follows:f m({Wt ,Ot}Tt=1,S|{Xit}Tt=1,ZS)= f m(W1|O1,S)T∏t=2f m(Wt |Ot ,S,Xt)× f m(O1|S)T∏t=2f m(Ot |S,Xt) fm(S|ZS).For the sake of clarity, assume the support of Xt (t=2,. . . ,T) is discrete andknown. Let (ηt,1,ηt,2, . . . ,ηt,M−1) be elements of Xt for t=1,. . . ,T. Fix S = s anddefine, for (ηt ,η1,zS) ∈Xt ×X1×ZS ,λ ∗mO,η1 = Pm(O1 = 1|(X1,S) = (η1,s)),λmO,ηt = Pm(Ot = 1|(Xt ,S) = (ηt ,s)),˜mzS = pimPm(S = s|ZS = zS)Construct a matrix of type-specific distribution functions and type probabilitiesasLt =1 λ 1O,ηt,1 · · · λ1O,ηt,M−1· · · · · ·. . . · · ·1 λMO,ηt,1 · · · λMO,ηt,M−1 , f or t = 2, . . . ,TDOη1 = diag(λ∗1O,η1 , . . . ,λ∗MO,η1), and VzS = diag(˜1zS , . . . , ˜MzS ).The elements of Lt ,DOη1 , and VzS are parameters of the underlying mixture modelto be identified. Consider we have data for three time periods i.e. T = 3. Fix Ot = 1for all t and defineFO∗ZS,X1,X2,X3 =M∑m=1˜mZSλ∗mO,X1λmO,X2λmO,X3.Now fix O2 = O3 = 1 and defineFOZS,X2,X3 =M∑m=1˜mZSλmO,X2λmO,X3 .99Similarly, define the following functionsFO∗ZS,X1,X2 =M∑m=1˜mZSλ∗mO,X1λmO,X2 ,FO∗ZS,X1,X3 =M∑m=1˜mZSλ∗mO,X1λmO,X3 ,FO∗ZS,X1 =M∑m=1˜mZSλ∗mO,X1 ,FOZS,X2 =M∑m=1˜mZSλmO,X2 ,FOZS,X3 =M∑m=1˜mZSλmO,X3 .Arrange these into two M×M matrices:POzS =1 FOzS,η3,1 . . . FOzS,η3,M−1FOzS,η2,1 FOzS,η2,1,η3,1 . . . FOzS,η2,1,η3,M−1....... . ....FOzS,η2,M−1 FOzS,η2,M−1,η3,1 . . . FOzS,η2,M−1,η3,M−1,andPO∗zS,η1 =FO∗zS,η1 FO∗zS,η1,η3,1 . . . FO∗zS,η1,η3,M−1FO∗zS,η1,η2,1 FO∗zS,η1,η2,1,η3,1 . . . FO∗zS,η1,η2,1,η3,M−1....... . ....FO∗zS,η1,η2,M−1 FO∗zS,η1,η2,M−1,η3,1 . . . FO∗zS,η1,η2,M−1,η3,M−1.To achieve identification, further assume:Assumption A.2. There exist some {ηt,1, . . . ,ηt,M−1}Tt=2 such that POzS is of full rankand that all the eigenvalues of (POzS )−1PO∗zS,η1 take distinct values.Proof of Proposition 1.1. POzS and PO∗zS,η1 can be expressed as the follows:POzS = L′2VzSL3, and PO∗zS,η1 = L′2VzSDOη1L3.100Because POzS is full rank, it follows that L2 and L3 are full rank. We can con-struct a matrix AzS = (POzS )−1PO∗zS,η1 = L−13 DOη1L3. Because AzSL−13 = L−13 DOη1 and theeigenvalues of AzS are distinct, the eigenvalues of AzS determines the elements ofDOη1 .Moreover, the right eigenvectors of AzS are the columns of L−13 up to multi-plicative constants. Denote L−13 K to be the right eigenvectors of AzS where K issome diagonal matrix. Now we can determine VzSK from the first row of POzS L−13 Kbecause POzS L−13 K = L′2VzSK and the first row of L′2 is a vector of ones. Then L′2 isdetermined uniquely by L′2 = (POzS L−13 K)(VzSK)−1. Similarly, by construct a matrixBzS = (PO′zS )−1(PO∗′zS,η1), we can uniquely determine L′3.We can determine VzS from the first row of POzS L−13 K because POzS L−13 K = L′2VzSKand the first row of L′2 is a vector of ones. Till now we have identified {˜mzS}Mm=1,{λmO,η1}Mm=1 and {λmO,ηt, j}M−1j=1 }Mm=1 for t = 2,3.Next I show how to identify DOx1 for any x1 ∈X1. Let’s construct PO∗zS,x1in thesame way as PO∗zS,η1 . It follows that DOx1 = (L′2Vx1)−1P∗O,x1L−13 . So {λ ∗mO,x1}Mm=1 for anyx1 ∈X1 is identified.To identify {λmO,x2}Mm=1 for any x2 ∈X2, construct the following matrices:Lx2 =1 λ 1O,x2......1 λMO,x2 ,andPx2 =(1 FOz,η3,1 . . . FOz,η3,M−1FOz,x2 FOz,x2,η3,1 . . . FOz,x2,η3,M−1).Px2 can be expressed as Px2 = (Lx2)′VzSL3. So (Lx2)′ = Px2(VzSL3)−1. So{λmO,x2}Mm=1 is identified. With similar approach {λmO,x3}Mm=1 for any x3 ∈ X3 can alsobe identified.To identify Vz′S for any z′S ∈ ZS , construct POz′Sby replacing zS with z′S in POzS .POz′Scan be expressed as POz′S= L′2Vz′SL3. Then Vz′S = (L′2)−1POz′SL−13 and {˜mz′S}Mm=1for any z′S ∈ ZS is identified. By integrating out S, we can get {pim}Mm=1 andf m(S|ZS) = ˜mz′S/pim.I have shown the identification of pim, f m(S|ZS), f m(O1|X1,S) and f m(Ot |Xt ,S)101for any ({Xt}3t=1,S,ZS)∈∏3t=1Xt×S ×ZS . The rest is to show the identificationof the type-specific wage marginal distributions. Defineλ ∗mW,(w1,x1) = fm((W1,O1) = (w1,1)|(X1,S) = (η1,s)),DWw,η1 = diag(λ∗1W,(w,η1), . . . ,λ∗MW,(w,η1)).Fix W1 = w1, Ot = 1 for t = 1,2,3, and define the following functions:FW∗ZS,X1,X2,X3 =M∑m=1˜mZSλ∗mW,(w1,X1)λmO,X2λmO,X3 ,FW∗ZS,X1,X2 =M∑m=1˜mZSλ∗mW,(w1,X1)λmO,X2 ,FW∗Z,X1,X3 =M∑m=1˜mZSλ∗mW,(w1,X1)λmO,X3 ,FW∗Z,X1 =M∑m=1˜mZSλ∗mW,(w1,X1).Arrange these to an M×M matrix:PW∗zS,η1 =FW∗zS,η1 FW∗zS,η1,η3,1 . . . FW∗zS,η1,η3,M−1FW∗zS,η1,η2,1 FW∗zS,η1,η2,1,η3,1 . . . FW∗zS,η1,η2,1,η2,M−1....... . ....FW∗zS,η1,η2,M−1 FW∗zS,η1,η2,M−1,η3,1 . . . FW∗zS,η1,η2,M−1,η3,M−1.PW∗zS,η1 = L′2VzSDWw1,η1L3. Then DWw1,η1 = (L′2VzS)−1PW∗zS,η1L−13 and fm(W1,O1|X1,S)is identified. Further f m((W1|O1,X1,S) = f m(W1,O1|X1),S)/ f m((O1|X1),S).To identify f m(Wt |Ot ,Xt ,S) for t = 2, defineλmW,(w2,η2) = fm((W2,O2) = (w2,1)|(X2,S) = (η2,s)).Fix W2 = w2, Ot = 1 for t = 2,3, and define the following functions:FWZS,X2,X3 =M∑m=1˜mZSλmW,(w2,X2)λmO,X3 ,102FWZS,X2 =M∑m=1˜mZSλmW,(w2,X2).Then construct the following matrices:Lw2 =1 λ 1W,(w2,η2)......1 λMW,(2,η2) ,andPw2 =(1 FOzS,η3,1 . . . FOzS,η3,M−1FWzS,x2 FWzS,η2,η3,1 . . . FWzS,η2,η3,M−1).Pw2 can be expressed as Pw2 = (Lw2)′VzSL3. Then (Lw2)′ = Pw2(VzSL3)−1and { f m(W2,O2|X2,S)}Mm=1 is identified and fm(W2|O2,X2,S) =f m(W2,O2|X2,S)/ f m(O2|X2,S) for m = 1, . . . ,M. With similar approach,f m(W3|O3,X3,S) can be identified for m = 1, . . . ,M. This completes the proof ofProposition 1.1.A.4.2 Assumptions and proof of Proposition 1.2Assumption A.3. For m=1,. . . ,M and t ≥ 2,(a)f mt (Wt |Ot ,Xt ,S,{Wτ ,Oτ ,Xτ}t−1τ=2) = fm(Wt |Ot ,Xt ,S,{Wτ ,Oτ ,Xτ}t−1τ=2),andf mt (Ot |Xt ,S,{Wτ ,Oτ ,Xτ}t−1τ=2) = fm(Ot |Xt ,S,{Wτ ,Oτ ,Xτ}t−1τ=2).(b)f m(Wt |Ot ,Xt ,S,{Wτ ,Oτ ,Xτ}t−1τ=2) = fm(Wt |Ot ,S,Xt ,Wt−1,Ot−1,Xt−1),andf m(Ot |Xt ,S,{Wτ ,Oτ ,Xτ}t−1τ=2) = fm(Ot |Ot−1,S,Xt).103Under Assumption A.3, the conditional joint distribution of wages, occupations,and education can be simplified as follows:f m({Wt ,Ot}Tt=1,S|{Xt}Tt=1,ZS)= f m(W1|O1,S)T∏t=2f m(Wt |Ot ,S,Xt ,Wt−1,Ot−1,Xt−1)× f m(O1|S)T∏t=2f m(Ot |Ot−1,S,Xt) fm(S|ZS).The transition process of (Wt ,Ot ,Xt) becomes a stationary first-order Markovprocess. Define Yt = (Wt ,Ot ,Xt). The variation of Yt affects both the type-specificconditional joint distribution at period t and that at period t + 1. This makes itdifficult to construct factorization equations as before. To solve this problem, welook at every other period. Fix Yt to be y¯t for odd t and define˜my¯,zS = pim f m(y¯1,s|zS),λmy¯ (Yt) = f m(y¯t+1|Yt ,s) f m(Yt |y¯t−1,s),λ ∗my¯ (YT ) = f m(YT |y¯T−1,s).Let ξt be element of Yt and defineLt,y¯ =1 λ 1y¯ (ξt,1) . . . λ 1y¯ (ξt,M−1)....... . ....1 λMy¯ (ξt,1) . . . λMy¯ (ξt,M−1) ,Vy¯ = diag(˜1y¯,zS , . . . , ˜My¯,zS), and DOYT |y¯= diag(λ ∗1y¯ (YT ), . . . ,λ ∗My¯ (YT )).Then constructPOy¯ = L′2,y¯Vy¯L4,y¯,PO∗y¯ = L′2,y¯DOYT |y¯Vy¯L4,y¯.Further, assumeAssumption A.4. There exist some {ξt,1, . . . ,ξt,M−1}Tt=1 such that POy¯ is of full rankand that all the eigenvalues of (POy¯ )−1PO∗y¯ take distinct values.104Proof of Proposition 1.2. Without loss of generality, set T = 6. Fix (Y1,Y3,Y5) =(y1,y2,y5) and defineF∗OY2,Y4,Y6 =M∑m=1˜y¯,ZSλmy¯ (Y2)λmy¯ (Y4)λ ∗my¯ (YT ),F∗OY2,Y6 =M∑m=1˜y¯,zSλmy¯ (Y2)λ ∗my¯ (YT ),F∗OY6 =M∑m=1˜y¯,zSλ ∗my¯ (YT ),FOY2,Y4 =M∑m=1˜y¯,zSλmy¯ (Y2)λmy¯ (Y4),FOY2 =M∑m=1˜¯,zSλmy¯ (Y2),FO =M∑m=1˜y¯,zS .And construct matrices as follows:POy¯ =FO FOξ4,1 . . . FOξ4,M−1FOξ2,1 FOξ2,1,ξ4,1 . . . FOξ2,1,ξ4,M−1....... . ....FOξ2,M−1 FOξ2,M−1,ξ4,1 . . . FOξ2,M−1,ξ4,M−1,andPO∗y¯ =FO∗ξ6 FO∗ξ4,1,ξ6 . . . FO∗ξ4,M−1,ξ6FO∗ξ2,1,ξ6 FO∗ξ2,1,ξ4,1,ξ6 . . . FO∗ξ2,1,ξ4,M−1,ξ6....... . ....FO∗ξ2,M−1,ξ6 FO∗ξ2,M−1,ξ4,1,ξ6 . . . FO∗ξ2,M−1,ξ4,M−1,ξ6.Then repeat the argument of the proof of Proposition 1.1 and we achieve theidentification of ˜my¯,zS , λmy¯ (ξt), and λ ∗my¯ (YT ). Then integrate out the other elementsand apply Bayes’ rule, we can get pim, f m(W1|O1,X1,S), f m(O1|X1,S), f m(S|ZS),f m(Wt |Ot ,S,Xt ,Wt−1,Ot−1,Xt−1), and f m(Ot |Ot−1,S,Xt).105A.4.3 Assumptions and proof of Proposition 1.3Denote the support of Q1, Q2, and Q3 by Q1, Q2, and Q3 respectively. Partition Q1into M mutually exclusive and exhaustive subsets and denote the partitions as4Q1 ={δ 1Q1 , . . . ,δMQ1}. Similarly denote the partitions of Q2 as4Q2 = {δ1Q2 , . . . ,δMQ2}. Let4=4Q1×4Q2 . Also partitionQ3 into 2 mutually exclusive and exhaustive subsetsas4Q3 = {δ 1Q3 ,δ2Q3}.Let’s definepmQ1 = (Pm(Q1 ∈ δ 1Q1 |s,zs), . . . ,Pm(Q1 ∈ δMQ1)|s,zs)′,pmQ2 = (Pm(Q2 ∈ δ 1Q2 |s,zs), . . . ,Pm(Q2 ∈ δMQ2)|s,zs)′,pmQ3(h) = Pm(Q3 ∈ δ hQ3 |s,zs),˜m = pim f m(s,zs).Collect the type-specific distributions into following matricesLQ1 = (p1Q1 , . . . , pMQ1),LQ2 = (p1Q2 , . . . , pMQ2),V = diag(˜1, . . . , ˜M),andDh = diag(p1Q3(h), . . . , pMQ3(h)).Let Ps(Q1 ∈ δmQ1 ,Q2 ∈ δm′Q2) be the probability that Q1 ∈ δmQ1 and Q2 ∈ δm′Q2for S = s and Ps(Q1 ∈ δmQ1 ,Q2 ∈ δm′Q2,Q3 ∈ δ hQ3) be the probability that Q1 ∈ δmQ1,Q2 ∈ δm′Q2, and Q3 ∈ δ hQ3 for S = s. Define two M×M matrices as follows:P4 =Ps(Q1 ∈ δ 1Q1 ,Q2 ∈ δ1Q2) . . . Ps(Q1 ∈ δ1Q1 ,Q2 ∈ δMQ2)... . . ....Ps(Q1 ∈ δMQ1 ,Q2 ∈ δ1Q2) . . . Ps(Q1 ∈ δMQ1 ,Q2 ∈ δMQ2) ,106P4,h =Ps(Q1 ∈ δ 1Q1 ,Q2 ∈ δ1Q2 ,Q3 ∈ δhQ3) . . . Ps(Q1 ∈ δ 1Q1 ,Q2 ∈ δMQ2 ,Q3 ∈ δhQ3)... . . ....Ps(Q1 ∈ δMQ1 ,Q2 ∈ δ1Q2 ,Q3 ∈ δhQ3) . . . Ps(Q1 ∈ δMQ1 ,Q2 ∈ δMQ2 ,Q3 ∈ δhQ3) .Assume:Assumption A.5. There exists a partition4×4Q3 on the variables (Q1,Q2,Q3)for which the matrix P4 is nonsingular and the eigenvalues of P4,hP−14 are distinctfor partition level h = 1 of the variable Q3.Proof of Proposition 1.3. P4 and P∗4,h can be expressed as the follows:P4 = LQ1V (L′Q2), and P4,h = LQ1DhV (L′Q2).Since P4 is nonsingular, both LQ1 and LQ2 are nonsingular. Construct Ah =P4,hP−14 = LQ1DhL−1Q1, and we have AhLQ1 = LQ1Dh. The distinct eigenvalues of Ahdetermines the elements of Dh, and its eigenvectors determine the columns of LQ1uniquely up to a multiplicative constant. Then LQ1 is uniquely determined since theelements of each column of LQ1 must sum to one. Construct Bh = (P′4,h)(P′4)−1 =LQ2DhL−1Q2, and LQ2 is determined using the similar argument. Once LQ1 and LQ2 aredetermined, V is uniquely determined by V = (LQ1)−1P4(L′Q2)−1. Then {pim}Mm=1is identified by integrating out S and ZS, and f m(S|ZS) = ˜m/(pim f (ZS)).For any q1 ∈ Q1, denote pq1 = (P1Q1(q1), . . . ,PMQ1(q1)) and define Pq1,4Q2 =pq1V (LQ2)′. Then pq1 = Pq1,4Q2(V (LQ2)′)−1, and {PMQ1(q1)}Mm=1 is identi-fied. Define P4Q1,q2 and P4Q1,q3 analogously and apply the same argument,{PMQ2(q2),PMQ3(q3)}Mm=1 are identified.A.4.4 Assumptions and proof of Proposition 1.4Denote pmOt =Pm(Ot = 1|(S,ZS) = (s,zS)), and DOt = diag(p1Ot , . . . , pMOt ). Constructan M×M matrixP4,Ot =P(Q1 ∈ δ 1Q1 ,Q2 ∈ δ1Q2 ,Ot = 1) . . . P(Q1 ∈ δ1Q1 ,Q2 ∈ δMQ2 ,Ot = 1)... . . ....P(Q1 ∈ δMQ1 ,Q2 ∈ δ1Q2 ,Ot = 1) . . . P(Q1 ∈ δMQ1 ,Q2 ∈ δMQ2 ,Ot = 1) ,107AssumeAssumption A.6. The eigenvalues of P4,Ot P−14 are distinct.Proof of Proposition 1.4. The proof of the nonparametric identification of educa-tion psychic costs using test scores is already shown in the proof of Proposition 1.3.Below, I prove the nonparametric identification of occupation abilities using testscores.Express P4,Ot as P4,Ot = LQ1DOtV L′Q2 . Replacing P4,h in the proof of Propo-sition 1.3 by P4,Ot , and repeating the proof, pim, f m(S|ZS), and f m(Ot |Xt ,S) areidentified.Next, denote pmWt = Fm((Wt ,Ot) = (wt ,1)|(S,ZS) = (s,zS)) and DWt =diag(p1Wt , . . . , pMWt ). Let P(Q1 ∈ δmQ1,Q2 ∈ δm′Q2,(ωt ,1)) be the probability thatQ1 ∈ δmQ1 , Q2 ∈ δm′Q2, Wt = ωt , and Ot = 1 for S = s. The corresponding M×Mmatrix isP4,wt =PS(Q1 ∈ δ 1Q1 ,Q2 ∈ δ1Q2 ,(ωt ,1)) . . . PS(Q1 ∈ δ1Q1 ,Q2 ∈ δMQ2 ,(ωt ,1))... . . ....PS(Q1 ∈ δMQ1 ,Q2 ∈ δ1Q2 ,(ωt ,1)) . . . PS(Q1 ∈ δMQ1 ,Q2 ∈ δMQ2 ,(ωt ,1)) .P4,wt = LQ1DwtV L′Q2 . Then Dwt = L−1Q1P4,wt (V L′Q2)−1, and f m(Wt ,Ot |Xt ,S) isidentified. By Bayes’ rule, f m(Wt |Ot ,Xt ,S) = f m(Wt ,Ot |Xt ,S)/ f m(Ot |Xt ,S).A.5 EM AlgorithmConsider (k+1)th iteration. In E step, calculate the expectated log-likelihood φbased on the estimates from the kth iteration:φ (k) =n∑i=1M∑m=1µm(k)i (logpim + logL mW + logLmO + logLmS + logLmQ ),whereµm(k)i =pim(k)L m(k)W Lm(k)O Lm(k)S Lm(k)Q∑Mm=1pim(k)Lm(k)W Lm(k)O Lm(k)S Lm(k)Q.108In M step, compute the parameters by maximizing the expected log-likelihoodφ :pim(k+1) satisfies ∂φk∂pim(k+1) = 0. Correspondingly,pim(k+1) = ∑ni=1 µipim(k)n.βm(k+1)W satisfies∂φ k∂βm(k+1)W= 0. And it can be simplified to∂ ∑ni=1∑Tt=1Lm(k+1)W∂βm(k+1)W= 0,which is an OLS regression.γm(k+1)O satisfies∂φ k∂γm(k+1)O= 0, and it can be simplified to∂ ∑ni=1∑Tt=1Lm(k+1)O∂γm(k+1)O= 0.(θm(k+1)R satisfies∂φ k∂θm(k+1)R= 0. And it can be simplified to∂ ∑ni=1∑Rr=1Lm(k+1)Q∂θm(k+1)R= 0,which is a probit.δm(k+1)S satisfies∂φ k∂δm(k+1)S= 0, and it can be simplified to∂ ∑ni=1Lm(k+1)S∂δm(k+1)S= 0,which is a multinomial logit.109A.6 Choice of Initial ValuesThe strategy is to start with estimating the parameters in Equation (1.7) whenthe population is homogenous (M=1) and then add one more type at a time andre-estimate the parameters. Let L mi denote the likelihood for individual i and defineµm =n∑i=1(1−L mi∑m−1k=1 Lki pik)The estimation follows the algorithm as below:(a) Set M = 1 and pi1 = 1. Choose initial values for αW , αO, αS, αR, β , λ , δ , θ ,σW , σQ, and ρ in Equation (1.7).(b) Given the current value of M, maximize the likelihood over αW , αO, αS, αR,β , λ , δ , θ , σW , σQ, ρ , and pim.(c) Evaluate µM+1 for a grid of values of the type-specific parameters.(d) Set the type-specific parameters to the values that yield the smallest value forµM+1.(e) Maximize the likelihood. Increase the value of M by 1. Return to Step (b).110Appendix BB.1 Proofs of the Results in Section 2.2Proof of PROPOSITION 2.1: Because the second result follows from the first bythe law of the iterated expectations, and the third and fourth results can be easilyderived from the second result, we only prove the first result below. Let (h,L) be anarbitrary member in H. Then we have thatN−1Lhnh∑i=1nhi∑j=1WLhi jφLh(XLhi j) = N−1LhNLh∑k=1wLhkCLhkφLh(X˜hk). (B.1)For each k ∈ N, CLhk has a bounded support by Assumption 2.2(c), and φLh(X˜hk) hasa finite absolute moment by hypothesis, so that CLhkφLh(X˜hk) has a finite absolutemoment. From this fact and the independence of CLhk and X˜hk (Assumption 2.2(b)),it follows by Fubini’s theorem (Folland, 1984, Theorem 2.37, pp. 65–66) that foreach k ∈ N,E[CLhkφLh(X˜hk) | X˜ ] = E[CLhk]φLh(X˜hk).Also, wLhk is the reciprocal of E[CLhk] by definition. Taking the conditional expecta-tion of both side in (B.1) given X˜ and applying these facts yields thatE[N−1Lhnh∑i=1nhi∑j=1WLhi jφLh(XLhi j)∣∣ X˜]= N−1LhNLh∑k=1wLhkE[CLhk]φLh(X˜hk)= N−1LhNLh∑k=1φLh(X˜hk).111Thus, the desired result follows. Proof of THEOREM 2.2: Let n be an arbitrary natural number. For each x ∈Rv, q(x, ·) : Θ→ R is continuous on Θ, given Assumption 2.3. Thus, for eachω ∈ Ω, QL(ω, ·) : Θ→ R is a continuous function on the compact set Θ underAssumptions 2.1 and 2.2. The desired result therefore follows by (Gallant andWhite, 1988, Lemma 2.1). B.2 Proofs of the Results in Section 2.3Proof of PROPOSITION 2.1: Because for each (h,L) ∈ H and each γ ∈ Γ,N−1Lhnh∑i=1nhi∑j=1WLhi jφLh(XLhi j,γ) = N−1LhNLh∑k=1wLhkCLhkφLh(X˜hk,γ),we have that for each (h,L) ∈ H and each γ ∈ Γ,∥∥∥∥N−1Lhnh∑i=1nhi∑j=1WLhi jφLh(XLhi j,γ)∥∥∥∥a≤ N−1LhNLh∑k=1wLhk∥∥CLhkφLh(X˜hk,γ)∥∥a.Given the independence between {CLhk : (k,h,L) ∈ K} and {X˜hk : (k,h) ∈ N2} (As-sumption 2.2(b)), it follows by Fubini’s theorem (Folland, 1984, Theorem 2.37,pp. 65–66) that for each (h,L) ∈ H and each γ ∈ Γ,∥∥∥∥N−1Lhnh∑i=1nhi∑j=1WLhi jφLh(XLhi j,γ)∥∥∥∥a≤ N−1LhNLh∑k=1wLhk‖CLhk‖a · ‖φLh(X˜hk,γ)‖a.Under Assumption 2.2(b)–(d), we have that for each k ∈ N and each (h,L) ∈ H,wLhk =E[CLhk]−1≤(0·P[CLhk = 0]+1·P[CLhk > 0])−1=P[CLhk > 0]−1≤ (N−1Lh p¯l)−1and‖CLhk‖a· ≤ C¯ P[CLhk > 0]≤ N−1Lh C¯ p¯u.112Also, ∆≡ supγ∈Γ sup(k,h,L)∈K ‖φLh(X˜hk)‖a < ∞ by hypothesis. It follows that∥∥∥∥N−1Lhnh∑i=1nhi∑j=1WLhi jφLh(XLhi j)∥∥∥∥a≤ (p¯u/ p¯l)C¯ ·N−1LhNLh∑k=1wLhk‖φLh(X˜hk,γ)‖a≤ (p¯u/ p¯l)C¯∆, (h,L) ∈ H.Because the right-hand side of the above inequality depends on neither h nor L, thedesired therefore result follows. Proof of LEMMA 2.2: Because {φh}h∈N satisfies condition (a) in Definition 1,{ΦLh}L∈N is uniformly La-bounded by Proposition 2.1, so that it satisfies condi-tion (a) of Definition 2. Due to condition (b) in Definition 1, {φh} satisfies that foreach (γ1,γ2) ∈ Γ2 and each (h,L) ∈ H,|ΦLh(·,γ2)−ΦLh(·,γ1)| ≤ NLhnh∑i=1nhi∑j=1WLhi j(ω)∣∣φh(XLhi j(ω),θ2)−φh(XLhi j(ω),θ1)∣∣≤ DLh g(|θ2−θ1|),whereDLh ≡ N−1Lhnh∑i=1nhi∑j=1WLhi j(ω)dh(XLhi j(ω)).Because sup(k,h)∈N2∫dh(x)Phk(dx)< ∞, it follows from Proposition 2.1 that {DLh :(h,L) ∈ H} is uniformly L1 bounded, so that ∆≡ sup(h,L)∈HE[DLh]< ∞. Using thisfact, we obtain thatsupL∈NL−1L∑h=1E[DLh]≤ ∆,as required by condition (b) in Definition 2. Thus, {ΦLh : (h,L) ∈ H} is SLB(a). In proving Theorem 2.3, we employ a uniform law of large numbers stated inthe next lemma, in which the SLB property takes an important role.Lemma B.1. Given Assumptions 2.1 and 2.2, let Γ be a finite-dimensional Eu-clidean space, and {FLh : (h,L) ∈ H} an array of measurable functions from (Ω×Γ,F ⊗B(Γ)) to (Rl1×l2 ,Bl1×l2) such that for each γ ∈ Γ and each L ∈ N, FL1(·,γ),. . . , FLL(·,γ) are independent, and {FLh : (h,L) ∈ H} is SLB(1+δ ) on Γ for someδ ∈ (0,∞). Then {γ 7→ L−1∑Lh=1 E[FLh(·,γ)] : Γ→ R}L∈N is uniformly bounded113and uniformly equicontinuous, and {|L−1∑Lh=1 FLh(·,γ)− L−1∑Lh=1 E[FLh(·,γ)]|}converges in probability-P to zero uniformly in γ ∈ Γ.Proof of LEMMA B.1: The result can be established essentially in the same way asLemma A.3 of ?. We now prove Theorem 2.3.Proof of THEOREM 2.3: To establish the desired result, we apply the standard con-sistency result, e.g., (?, Lemma 4.2). Given the compactness of Θ (Assumption 2.3)and the identifiability of {Θ∗L}L∈N (Assumption 2.5), it suffices to show the uniformconvergence of {QL(·,θ)− Q¯L(θ)}L∈N to zero over θ ∈Θ in probability-PTo prove the above-mentioned uniform convergence, define {FLh : (h,L) ∈ H} ,an array of functions from Ω×Θ to R byFLh(ω,θ)≡ (L/NL)nh∑i=1nhi∑j=1WLhi j(ω)q(XLhi j(ω),θ), ω ∈Ω, θ ∈Θ, (h,L) ∈ H.Then QL can be written asQL(θ) = L−1L∑h=1FLh(·,θ), L ∈ N.Also, by Proposition 2.1, it follows from the definition of QL that for each θ ∈Θ,E[QL(θ)] = Q¯L(θ). Thus, if {FLh : (h,L) ∈ H} obeys the uniform law of largenumbers on Θ, the desired uniform convergence of {QL(·,θ)− Q¯L(θ)}L∈N holds.To verify that {FLh : (h,L) ∈ H} obeys the uniform law of large numbers, weuse Lemma B.1, which states that if (A) for each θ ∈Θ and each L ∈ N, FL1(·,θ),. . . , FLL(·,θ) are independent, and (B) for some positive real number δ , {FLh} isSLB(1+δ ), then {FLh : (h,L) ∈ H} obeys the uniform law of large numbers. UnderAssumptions 2.1 and 2.2(a), (b), (A) is clearly satisfied. For (B), rewrite FLh(·,θ) asFLh(·,θ) = (LNLh/NL)N−1Lhnh∑i=1nhi∑j=1WLhi j q(XLhi j,θ), θ ∈Θ, (h,L) ∈ H.114Because q, which is common for every stratum, is LB(1+δ ), the array{N−1Lhnh∑i=1nhi∑j=1WLhi j q(XLhi j,θ) : (h,L) ∈ H}.is SLB(1+δ ) by Lemma 2.2. The array {LNLh/NL = NLh/(NL/NLh) : (h,L) ∈ H}is also bounded (Assumption 2.1). It is straight forward to verify that {FLh :(h,L) ∈ H} is SLB(1+ δ ) using these facts. Thus, {FLh : (h,L) ∈ H} obeys theuniform law of large numbers, and the desired result follows. B.3 Proofs of the Results in Section 2.4In proving Theorem 2.1, we employ a central limited theorem stated in the nextlemma.Lemma B.1. Let {ULh : (h,L) ∈ H} be an array of uniformly L2+δ -bounded, zero-mean v×1 random vectors for some δ ∈ [0,∞) such that for each L ∈ N, UL1,. . . ,ULL are independent. Then:(a) L−1/2∑Lh=1ULh = OP(1).(b) Suppose in addition that δ > 0, and {VL ≡ L−1∑Lh=1 var[ULh]}L∈N is uni-formly positive definite. Then V−1/2L L−1/2∑Lh=1ULhA∼ N(0,1).Proof of LEMMA B.1: The result can be established essentially in the same way asLemma A.4 of ?. The following lemma is also useful in proving Theorem 2.1.Lemma B.2. Define {A˜L : Ω× intΘ→ Rp×p}L∈NA˜L(ω,θ)≡∇2QL(·,θ)=N−1LL∑h=1nh∑i=1nhi∑j=1WLhi j(ω)∇2q(XLhi j(ω),θ), ω ∈Ω, L ∈ N.Then {AL : intΘ→ Rp×p}L∈N is uniformly bounded and uniformly equicontinuouson Θ0, and {A˜L(·,θ)−AL(θ)}L∈N converges in probability-P to zero uniformly inθ ∈Θ0.115Proof of LEMMA B.2: Note that for each L ∈ N,A˜L(·,θ) = L−1L∑h=1(LNLh/N−1L )N−1Lhnh∑i=1nhi∑j=1WLhi j∇2q(XLhi j,θ), θ ∈Θ0.Because ∇2q is LB(1+δ ),{(ω,θ) 7→ N−1Lhnh∑i=1nhi∑j=1WLhi j(ω)∇2q(XLhi j(ω),θ) : (Ω,Θ0)→ Rp×p : (h,L) ∈ H}is SLB(1+ δ ) by Lemma 2.2. Also, {LNLh/NL = NLh/(NL/NLh) : (h,L) ∈ H} isbounded (Assumption 2.1). It is straight forward to verify that{(ω,θ) 7→ (LNLh/N−1L )N−1Lhnh∑i=1nhi∑j=1WLhi j(ω)∇2q(XLhi j(ω),θ) : (Ω,Θ0)→ Rp×p: (h,L) ∈ H}(B.2)is SLB(1+δ ). Further, the array (B.2) is row-wise independent (i.e., independentacross strata for each L ∈ N). Thus, application of Lemma B.1 to the array (B.2)yields that {θ 7→ E[A˜L(·,θ)] : Θ0→ Rp×p}L∈N is uniformly bounded and uniformlyequicontinuous, andA˜L(·,θ)−E[A˜L(·,θ)]→ 0 in probability-P as L→ ∞.The desired result follows from this fact, becauseE[A˜L(·,θ)] = AL(θ), θ ∈Θ0, L ∈ Nby Lemma B.1. We now prove Theorem 2.1.Proof of THEOREM 2.1: By Proposition 2.1, it follows from Assumption 2.8(d)that {N−1Lhnh∑i=1nhi∑j=1WLhi j∇q(Xlhi j,θ ∗) : (h,L) ∈ H}is uniformly L2+δ -bounded. Because {LNLh/NL = NLh/(NL/NLh) : (h,L) ∈ H} is116bounded under Assumption 2.1, it follows that {S∗Lh : (h,L) ∈ H} is uniformly L2+δ -bounded, so that {BLh = var[S∗Lh] : (h,L) ∈ H} is bounded. Also, {A∗−1L }L∈N isbounded, due to the uniform positive definiteness of {A∗L}L∈N (Assumption 2.8).The first claim follows from these facts.To verify the second claim, we use the standard linearization approach withsmooth (generalized) scores. More concretely, we employ Theorem 6.10 of White(1994). Nevertheless, the setup of the theorem is slightly different from ours; inparticular, the theorem requires that the score is continuously differentiable on theentire parameter space, while our setup does not require differentiability of q on theboundary of the parameter space. To fill the discrepancies between our problemsetup and that of the theorem, we introduce {θ˜L : Ω→Θ0}L∈N defined byθ˜L ≡θˆ if θˆ ∈Θ0,θ ∗L otherwise.(White, 1994, Theorem 6.10) applies to {θ˜L}L∈N well, as it lives in the compactspace Θ0 on which the score is continuously differentiable. Also, θ˜L coincides withθˆL with a probability approaching one as L grows to infinity, because {θˆL}L∈N isconsistent for {θ ∗L}L∈N, which is uniformly interior to Θ0. It follows that for eachreal number δ > 0,P[∣∣L1/2(D∗−1/2L (θ˜L−θ∗L )−L1/2D∗−1/2L (θˆL−θ∗L )∣∣< δ]= P[∣∣D∗−1/2L L1/2(θ˜L− θˆL)∣∣< δ]≥ P[θ˜L = θˆL]→ 1 as L→ ∞,i.e., θ˜L − θˆL = oP(L−1/2) as L → ∞. Thus, by the asymptotic equivalencelemma (White, 1984, Lemma 4.7), {L1/2(θˆL−θ ∗L )}L∈N has the same asymptoticdistributions as {L1/2(θ˜L−θ ∗L )}L∈N, if the latter is convergent in distribution. Itthus suffices to prove thatD∗−1/2L L1/2(θ˜L−θ ∗L )A∼ N(0, I) as L→ ∞. (B.3)117Define {ΨL : Ω×Θ0→ Rp}L∈N byΨL(ω,θ)≡ L−1L∑h=1SLh(ω,θ), ω ∈Ω, θ ∈Θ0, L ∈ N.Because Θ0 ⊂ intΘ, it holds that ψL(·, θ˜L) = 0, whenever θˆL ∈ Θ0, satisfying thefirst order condition for the maximization of QL. Because {θ˜L}L∈N is consistent for{θ ∗L}L∈N, which is uniformly interior to Θ0, we have thatP[∣∣L1/2ΨL(·, θ˜L)∣∣< δ]≥ P[θˆL ∈Θ0]→ 1 as L→ ∞,i.e., L1/2Ψ(·, θ˜L) = oP(1) as L→ ∞. Given this property, it follows from (White,1994, Theorem 6.10) that (B.3) holds if (A) {B∗L}L∈N is bounded and uniformlypositive definite, (B) B∗−1/2L L1/2ΨL(·,θ ∗L )A∼ N(0, I) as L→ ∞, (C) {AL : Θ0 →Rp×p}L∈N is uniformly equicontinuous, (D) {A˜L(·,θ)−AL(θ)}L∈N converges inprobability-P to zero uniformly in θ ∈Θ0 as L→∞ (note that A˜L(·,θ) is the Hessianof QL at θ ), and (E) {A∗L}L∈N is uniformly nonsingular. Because we have verifiedabove that {BLh : (h,L) ∈ H} is bounded, the average of it, {BL}L∈N is also bounded,and (A) holds. Also, Lemma B.2 has verified that the conditions (C) and (D) hold.Further, (E) holds by Assumption 2.8(c).To verify the remaining condition, (B), define the array {ULh : (h,L) ∈ H} byULh ≡ S∗Lh−E[S∗Lh], (h,L) ∈ H.Now, by Proposition 2.1, we have thatE[S∗Lh] = (LNLh/NL)∫∇q(x,θ ∗L ) P¯Lh(dx), (h,L) ∈ H.Under Assumption 2.8(e), we can rewrite the right-hand side of this equation toobtain thatE[S∗Lh] = (LNLh/NL)∇∫q(x,θ ∗L ) P¯Lh(dx).118It follows that for each L ∈ N,L−1L∑h=1E[S∗Lh] = ∇∫q(x,θ ∗L ) P¯L(dx) = ∇Q¯L(θ ∗L ) = 0,where the last equality follows by the first order condition for maximization of Q¯L.We thus have thatΨL(·,θ ∗L ) = L−1L∑h=1S∗Lh = L−1L∑h=1ULh, L ∈ N.If the array {ULh} obeys the central limit theorem, condition (B) is satisfied. Because{SLh(·,θ) :,(h,L) ∈ H, θ ∈Θ0} is uniformly L2+δ -bounded, {ULh : (h,L) ∈ H} isuniformly L2+δ -bounded. Also, UL1, . . . , ULL are independent for each L ∈ N.Further, {L−1∑Lh=1 var[ULh] = L−1∑Lh=1 BLh}L∈N is uniformly nonsingular by As-sumption f. The condition (B) thus follows by Lemma B.1. B.4 Proofs of the Results in Section 2.5Lemma B.1. Under Assumptions 2.1–2.8, {AˆL}L∈N is consistent for {A∗L}L∈N.Proof of LEMMA B.1: We have that for each L ∈ N,|AˆL−A∗L| ≤ |AˆL−AL(θˆL)|+ |AL(θˆL)−A∗L|. (B.4)It suffices to show that each of the two terms on the right-hand side of this inequalityconverges to zero in probability-P.To show the convergence of the first term, let ε be an arbitrary positive realnumber. Then we have thatP[|AˆL−AL(θˆL)| ≥ ε]= P[|AˆL−AL(θˆL)| ≥ ε and θˆL ∈Θ0]+P[|AˆL−AL(θˆL)| ≥ ε and θˆL 6∈Θ0]≤ P[supθ∈Θ0|A˜L(θ)−AL(θ)| ≥ ε]+P[θˆL 6∈Θ0](B.5)where A˜L is as in Lemma B.2. The first term on the right-hand of this inequality119converges to zero by Lemma B.2. For the second term, there exists a real numberc> 0 such that for each L ∈ N, the open ball with radius c centered at θ ∗L is containedin Θ0, because {θ ∗L}L∈N is uniformly interior to Θ0. Thus, P[θˆL 6∈Θ0]is dominatedby P[d(θˆL,θ ∗L ) > c], which converges to zero by the consistency of {θˆL}L∈N for{θ ∗}L∈N. It follows that both terms on the right-hand side of (B.5) converge to zero.Because ε is an arbitrary positive real number, this verifies that the first term on theright-hand side of (B.4) converges to zero in probability-P.We now turn to the second term on the right-hand side of (B.4). Pick a positivereal number ε arbitrarily. Because {AL}L∈N is uniformly equicontinuous on Θ0,there exists a real number c > 0 such thatsup{|AL(θ1)−A∗L(θ2)| : (θ1,θ2) ∈Θ0, d(θ1,θ2)< c}< ε.With such a c, we have thatP[|AL(θˆL)−A∗L| ≥ ε]= P[|AL(θˆL)−AL(θ ∗L )| ≥ ε]≤ P[d(θˆL,θ ∗L )≥ c and θˆL ∈Θ0]+P[θˆL 6∈Θ0]≤ P[d(θˆL,θ ∗L )≥ c]+P[θˆL 6∈Θ0]. (B.6)On the right-hand side of this inequality, the first term converges to zero by theconsistency of {θˆL}L∈N for {θ ∗}L∈N, while the second term converges to zero aswe have already verified above. Thus, the left-hand side of (B.6) converges to zeroas L→ ∞. Because ε was chosen arbitrarily, this establishes that the second termon the right-hand side of (B.4) converges to zero in probability-P and completes theproof. Proof of THEOREM 2.1: Because {LNLh/NL : (h,L) ∈ H} is bounded under As-sumption 2.1, it is straightforward to verify that {B˜Lh : (h,L) ∈ H} defined in (2.2)is SLB(1+δ ) on the compact set Θ0 under Assumption 2.9. Thus, application ofLemma B.1 to {B˜Lh : (h,L) ∈ H} establishes thatsupθ∈Θ0|B˜L(·,θ)−E[B˜L(·,θ)]|= supθ∈Θ0∣∣∣∣L−1L∑h=1B˜Lh(·,θ)−L−1L∑h=1E[B˜Lh(·,θ)]∣∣∣∣→ 0in probability-P as L→ ∞. (B.7)120If condition (a) of Assumption 2.10 holds, we have that E[B˜L(·,θ)] coincides withBL(θ), so thatsupθ∈Θ0∣∣B˜L(·,θ)−BL(θ)∣∣→ 0 in probability-P as L→ ∞.If, instead, condition (b) holds, it follows from (2.3) that∣∣∣∣E[B˜L(·,θ)]−BL(θ)∣∣∣∣≤ L−1L∑h=1(LNLh/NL)2N−2LhNLh∑k=1∣∣var[∇q(X˜hk,θ)]∣∣.Because∆1 ≡ sup{∣∣var[∇q(X˜hk,θ)]∣∣ : θ ∈Θ0, (k,h) ∈ N2}< ∞(see (2.4)) and ∆2 ≡ sup{LNLh/NL : (h,L) ∈ H}< ∞, it follows thatsupθ∈Θ0∣∣∣∣E[B˜L(·,θ)]−BL(θ)∣∣∣∣≤ ∆1∆2L−1L∑h=1N−1Lh ≤ ∆1∆2(infh∈{1,...,L}NLh)−1→ 0as L→ ∞. (B.8)It follows from (B.7) and (B.8) thatsupθ∈Θ0∣∣B˜L(·,θ)−BL(θ)∣∣≤ supθ∈Θ0∣∣B˜L(·,θ)−E[B˜L(·,θ)]∣∣+ supθ∈Θ0∣∣E[B˜L(·,θ)]−BL(θ)∣∣→ 0 in probability-P as L→ ∞.Thus, {B˜L}L∈N is uniformly consistent for {BL}L∈N over Θ0 under both of theconditions in Assumption 2.10. Given the uniform consistency of {B˜L} over Θ0 andthe consistency {θˆL}L∈N for {θ ∗L}L∈N (which is uniformly interior to Θ0), we canverify that {BˆL = B˜L(θˆL)}L∈N is consistent for {B∗L = BL(θ ∗L )}L∈N in the same wayas we proved Lemma B.2. 121Appendix CC.1 Proof of the Results in Section 3.2Proof of PROPOSITION 3.1: Because the second result follows from the first bythe law of the iterated expectations, and the third and fourth results can be easilyderived from the second result, we only prove the first result below. Let (h,L) be anarbitrary member in H. Then we have thatN−1Lhnh∑i=1nhi∑j=1WLhi jφLh(XLhi j) = N−1LhNLh∑k=1wLhkCLhkφLh(X˜hk). (C.1)For each k ∈ N, CLhk has a bounded support by Assumption 3.2(c), and φLh(X˜hk) hasa finite absolute moment by hypothesis, so that CLhkφLh(X˜hk) has a finite absolutemoment. From this fact and the independence of CLhk and X˜hk (Assumption 3.2(b)),it follows by Fubini’s theorem (Folland, 1984, Theorem 2.37, pp. 65–66) that foreach k ∈ N,E[CLhkφLh(X˜hk) | X˜ ] = E[CLhk]φLh(X˜hk).Also, wLhk is the reciprocal of E[CLhk] by definition. Taking the conditional expecta-tion of both side in (C.1) given X˜ and applying these facts yields thatE[N−1Lhnh∑i=1nhi∑j=1WLhi jφLh(XLhi j)∣∣ X˜]= N−1LhNLh∑k=1wLhkE[CLhk]φLh(X˜hk)= N−1LhNLh∑k=1φLh(X˜hk).122Thus, the desired result follows. Proof of PROPOSITION 3.2: Because for each (h,L) ∈ H and each γ ∈ Γ,N−1Lhnh∑i=1nhi∑j=1WLhi jφLh(XLhi j,γ) = N−1LhNLh∑k=1wLhkCLhkφLh(X˜hk,γ),we have that for each (h,L) ∈ H and each γ ∈ Γ,∥∥∥∥N−1Lhnh∑i=1nhi∑j=1WLhi jφLh(XLhi j,γ)∥∥∥∥a≤ N−1LhNLh∑k=1wLhk∥∥CLhkφLh(X˜hk,γ)∥∥a.Given the independence between {CLhk : (k,h,L) ∈ K} and {X˜hk : (k,h) ∈ N2} (As-sumption 3.2(b)), it follows by Fubini’s theorem (Folland, 1984, Theorem 2.37,pp. 65–66) that for each (h,L) ∈ H and each γ ∈ Γ,∥∥∥∥N−1Lhnh∑i=1nhi∑j=1WLhi jφLh(XLhi j,γ)∥∥∥∥a≤ N−1LhNLh∑k=1wLhk‖CLhk‖a · ‖φLh(X˜hk,γ)‖a.Under Assumption 3.2(b)–(d), we have that for each k ∈ N and each (h,L) ∈ H,wLhk =E[CLhk]−1≤(0·P[CLhk = 0]+1·P[CLhk > 0])−1=P[CLhk > 0]−1≤ (N−1Lh p¯l)−1and‖CLhk‖a· ≤ C¯ P[CLhk > 0]≤ N−1Lh C¯ p¯u.Also, ∆≡ supγ∈Γ sup(k,h,L)∈K ‖φLh(X˜hk)‖a < ∞ by hypothesis. It follows that∥∥∥∥N−1Lhnh∑i=1nhi∑j=1WLhi jφLh(XLhi j)∥∥∥∥a≤ (p¯u/ p¯l)C¯ ·N−1LhNLh∑k=1wLhk‖φLh(X˜hk,γ)‖a≤ (p¯u/ p¯l)C¯∆, (h,L) ∈ H.Because the right-hand side of the above inequality depends on neither h nor L, thedesired therefore result follows. 123C.2 Proof of the Results in Section 3.3Proof of LEMMA 3.1: Suppose that Assumptions 3.1 and 3.2 hold. Let Γ be a finite-dimensional Euclidean space, a a positive real number, and {φh}h∈N a sequenceof measurable functions from (Rv×Γ,Bv⊗B(Γ)) to (Rl1×l2 ,Bl1×l2) that is LB(a)on Γ. Then the array of functions from (Rv×Γ,Bv⊗B(Γ)) to (Rl1×l2 ,Bl1×l2),{ΦLh}L∈N, defined byΦLh(ω,γ)≡ N−1Lhnh∑i=1nhi∑j=1WLhi j(ω)φh(XLhi j(ω),γ), ω ∈Ω, γ ∈ Γ, (h,L) ∈ His SLB(a) on Γ. In proving Theorem 3.2, we employ a uniform law of large numbers stated inthe next lemma, in which the SLB property takes an important role.Lemma C.1. Given Assumptions 3.1 and 3.2, let Γ be a finite-dimensional Eu-clidean space, and {FLh : (h,L) ∈ H} an array of measurable functions from (Ω×Γ,F ⊗B(Γ)) to (Rl1×l2 ,Bl1×l2) such that for each γ ∈ Γ and each L ∈ N, FL1(·,γ),. . . , FLL(·,γ) are independent, and {FLh : (h,L) ∈ H} is SLB(1+δ ) on Γ for someδ ∈ (0,∞). Then {γ 7→ L−1∑Lh=1 E[FLh(·,γ)] : Γ→ R}L∈N is uniformly boundedand uniformly equicontinuous, and {|L−1∑Lh=1 FLh(·,γ)− L−1∑Lh=1 E[FLh(·,γ)]|}converges in probability-P to zero uniformly in γ ∈ Γ.Proof of LEMMA C.1: The result can be established essentially in the same way asLemma A.3 of ?. Also, the next result is useful in proving Theorem 3.2.Lemma C.2. Let r1 and r2 be positive real numbers and Γ a finite-dimensionalEuclidean space. Write r0 ≡min{r1,r2}. Under Assumptions 3.1 and 3.2:(a) If measurable functions φ1 and φ2 from (R×Γ,Bv⊗B(Γ)) to (Rl1×l2 ,Bl1×l2)are LB(r1) and LB(r2) on Γ, respectively, then φ1 +φ2 and φ1φ2 are LB(r0)and LB(r0/2) on Γ, respectively.(b) The claim of (a) holds even if each occurrence of LB in the claim is replacedwith LBP.124(c) If arrays {F1Lh : (h,L) ∈ H} and {F2Lh : (h,L) ∈ H} of measurable functionsfrom (Ω× Γ,F ⊗B(Γ)) to (Rl1×l2 ,Bl1×l2) are SLB(r1) and SLB(r2) onΓ, respectively, then {F1Lh +F2Lh : (h,L) ∈ H} and {F1LhF2Lh : (h,L) ∈ H} areSLB(r0) and SLB(r0/2) on Γ, respectively.(d) The claim of (c) holds even if each occurrence of SLB in the claim is replacedwith SLBP.Proof of LEMMA C.2: The proof is essentially the same as the one of (?,Lemma A.2). We now prove Theorem 3.2.Proof of THEOREM 3.2: Because {ΛL}L∈N is bounded, there exists a compactsubset  of Sq, to which {ΛL}L∈N is uniformly interior. Because {θˆL}L∈N, {pˆiL}L∈N,and {ΛˆL}L∈N are consistent for {θ ∗L}L∈N, {pi∗L}L∈N, and {ΛL}L∈N, respectively, itsuffices to prove that {αˇL}L∈N is consistent for {αL}L∈N uniformly on Θ0×Π0×and that {αL} is uniformly continuous on Θ0×Π0.As demonstrated in Section 3.3, αˇL(·,θ ,pi) is unbiased for αL(θ ,pi) for each(θ ,pi) ∈Θ0×Π0, if X˜ is degenerate. Suppose that X˜ is not degenerate. Let ∆ denotethe left-hand side of (3.11). Then we have that0≤ tr(var[mh(X˜hk,θ ,pi)])≤ q∆, (k,h) ∈ N2.As demonstrated in Section 3.3, we also have thatE[αˇL(·,θ ,pi)]−αL(θ ,pi) =L∑h=1βLhN−2LhNLh∑k=1tr(var[mh(X˜hk,θ ,pi)]), (h,L) ∈ H.It follows thatE[αˇL(·,θ ,pi)]−αL(θ ,pi)≤ q∆L∑h=1βLhN−1Lh , (θ ,pi) ∈Θ0×Π0, L ∈ N.The right-hand side of this inequality, which does not depend on θ and pi , converges125to zero by Assumptions 3.5 and 3.8. It follows thatsup(θ ,pi)∈Θ0×Π0∣∣E[αˇL(·,θ ,pi)]−αL(θ ,pi)∣∣→ 0.Thus, to establish the result of Theorem 3.2, it suffices to prove thatsup(θ ,pi)∈Θ0×Π0∣∣αˇL(·,θ ,pi)−E[αˇL(·,θ ,pi)]∣∣→ 0 in probability-P (C.2)and {(θ ,pi) 7→ E[αˇL(·,θ ,pi)] : Θ×Π→ R}L∈N is uniformly continuous on Θ0×Π0.Define an array {FLh : Ω×Θ×Π→ Rq×q : (h,L) ∈ H} byFLh(ω,θ ,pi)≡ LβLh tr(Λ(m˜Lh(ω,θ ,pi)m˜Lh(ω,θ ,pi)′−N−2Lh Σ˜Lh(ω,θ ,pi))).Under Assumptions 3.1 and 3.2, FL1(·,θ ,pi), . . . , FLL(·,θ ,pi) are independent foreach (θ ,pi) ∈Θ0×Π0 and each L ∈ N. Because {LβLh : (h,L) ∈ H} is bounded(Assumption 3.5), it also follows by Lemma C.2 from Assumption 3.6(c) and 3.7that {FLh : (h,L) ∈ H} is SLB(1+δ ). Applying Lemma C.1 to {FLh : (h,L) ∈ H}establishes the desired result, becauseαˇL(·,θ ,pi) = L−1L∑h=1FLh(·,θ ,pi), (θ ,pi) ∈Θ0×Π0, (h,L) ∈ H.We now turn to the proof of Theorem 3.3. Our proof uses the following double-array central limit theorem. To establish the asymptotic normality of {αˆL}L∈N,we first derive the asymptotic distribution of {L1/2αˇL(·, θˆL, pˆiL,ΛL)}L∈N. and thenshow that {L1/2αˆL−L1/2(αˇL(·, θˆL, pˆiL,ΛL)}L∈N converges in probability-P to zero.In establishing the first result, we employ the following double-array central limittheorem.Lemma C.3. Let {ULh : (h,L) ∈ H} be an array of uniformly L2+δ -bounded, zero-mean v×1 random vectors for some δ ∈ [0,∞) such that for each L ∈ N, UL1,. . . ,ULL are independent. Then:(a) L−1/2∑Lh=1ULh = OP(1).126(b) Suppose in addition that δ > 0, and {VL ≡ L−1∑Lh=1 var[ULh]}L∈N is uni-formly positive. Then V−1/2L L−1/2∑Lh=1ULhA∼ N(0,1).Proof of LEMMA C.3: See Lemma A.4 of ?. WriteG2L(θ ,pi,Λ)≡ E[∇pi αˇL(·,θ ,pi,Λ)]=L∑h=1βLh(2N−1Lhnh∑i=1nhi∑j=1E[WLhi j∇pimh(XLhi j,θ ,pi)Λm˜Lh(·,θ ,pi)]−N−2Lh E[∇pi(vecΣ˜Lh(·,θ ,pi))]vecΛ),GL(θ ,pi,Λ)≡ (G1L(θ ,pi,Λ)′,G2L(θ ,pi,Λ)′)′, G∗L ≡ GL(θ ∗,pi∗L ,ΛL),(θ ,pi) ∈Θ×Π, Λ ∈ Sq, L ∈ N.Lemma C.4. (a) Suppose that Assumptions 3.1–3.3, 3.5, 3.6(a)–(c), 3.7, 3.8,and 3.10(c), (d) hold. Let  be an arbitrary nonempty compact subset ofSq. Then {GL : Θ0×Π0×→ Rq+r}L∈N is uniformly bounded and uni-formly equicontinuous, and {∇αˇL(·,θ ,pi,Λ)−GL(θ ,pi,Λ)}L∈N converges inprobability-P to zero uniformly in (θ ,pi,Λ) ∈ Θ0×Π0×. If, in addition,(3.3) holds, then for each L ∈ N and each Λ ∈ , G∗2L ≡ G2(θ ∗L ,pi∗L ,ΛL) = 0.(b) Suppose that Assumptions 3.1–3.8 and 3.10(a)–(f), (h) hold. If (3.3) holds,thenL1/2αˇL(·, θˆL, pˆiL,ΛL) = L−1/2L∑h=1ξ ∗Lh +oP(1) = OP(1). (C.3)If, in addition, Assumption 3.10(g) holds, then V−1/2L L1/2αˇL(·, θˆL, pˆiL,ΛL)A∼N(0,1).Proof of LEMMA C.4: To prove (i), note that∇αˇL(·,θ ,pi,Λ) = L−1L∑h=1gˇLh(·,θ ,pi,Λ),127where the array {gˇLh : Ω×Θ×Π×Sq→ Rq+r : (h,L) ∈ H} is defined bygˇLh(·,θ ,pi,Λ) =LβLh(2N−1Lhnh∑i=1nhi∑j=1WLhi j∇θmh(XLhi j,θ ,pi)Λm˜Lh(·,θ ,pi)−N−2Lh ∇θ (vecΣ˜Lh(·,θ ,pi))vecΛ), (θ ,pi) ∈Θ×Π, Λ ∈ Sq, L ∈ N.By using Lemmas 3.1 and C.2, we can easily show that {gˇLh : (h,L) ∈ H} is SLB(1+δ ) on Θ0×Π0×. Application of Lemma C.1 to {gˇLh} thus establishes the firstresult. For the second result, note that given each fixed Λ ∈ Sq, each component of∇αˇL(·,θ ,pi,Λ) is bounded by anL1+δ -bounded random variable in a neighborhoodof (θ ∗L ,pi∗L), so thatG∗L =E[∇αˇL(·,θ ∗L ,pi∗L ,ΛL)]=∇E[αˇL(·,θ ∗L ,pi∗L ,ΛL)]=∇α(θ ∗L ,pi∗L ,ΛL), Λ∈, L ∈ N.Because α(θ ∗L ,pi,ΛL) = 0 for each pi ∈Π under (C.3), it follows thatG∗2L = ∇piα(θ ∗L ,pi∗L ,ΛL) = 0, L ∈ N.For (ii), note that there exists a real number ε > 0 such that for each L ∈ N,the open ball in Θ with radius ε centered at θ ∗L is contained in intΘ0, and theopen ball in Π with radius ε centered at pi∗L is contained in intΠ0, because {θ ∗L}and {pi∗L} are uniformly interior to Θ0 and Π0, respectively. By the mean valuetheorem for random functions (?, Lemma 3), there exists sequences of randomvectors {θ¨L : Ω→Θ}L∈N and {p¨iL : Ω→Π}L∈N such that for each L ∈ N, θ¨L is onthe line segment connecting θˆL and θ ∗L , p¨iL is on the line segment connecting pˆiL andpi∗L , andαˇL(·, θˆL, pˆiL,ΛL)− αˇL(·,θ ∗L ,pi∗L ,ΛL)= ∇θ αˇL(·, θ¨L, p¨iL,ΛL)′(θˆL−θ ∗L )+∇pi αˇL(·, θ¨L, p¨iL,ΛL)′(pˆiL−pi∗L),whenever |θˆL−θ ∗L |< ε and |pˆiL−pi∗L |< ε (where ε is as described above). By theresult of (i) established above, {GˇL(·,θ ,pi,ΛL)−GL(θ ,pi,ΛL)}L∈N converges tozero uniformly in (θ ,pi) ∈ Θ0×Π0 in probability-P, and {GL(·, ·,ΛL) : Θ×Π→128Rp}L∈N is equicontinuous uniformly on Θ0×Π0. Also, we have that|θ¨L−θ ∗L | ≤ |θˆL−θ ∗L |= OP(L−1/2)by Assumption 3.10(a) and Lemma C.3, and that|p¨iL−pi∗L | ≤ |pˆiL−pi∗L |= OP(L−1/2)by Assumption 3.10(b). It follows that |∇αˇL(·, θ¨L, p¨iL,ΛL)−G∗L|= oP(1), andαˇL(·, θˆL, pˆiL,ΛL)− αˇL(·,θ ∗L ,pi∗L ,ΛL) = G∗1L(θˆL−θ ∗L )+G∗2L(pˆiL−pi∗L)+oP(L−1/2).Applying Assumption 3.10(a) along with the second result of (i) of the currentlemma in this equality establishes thatαˇL(·, θˆL, pˆiL,ΛL)− αˇL(·,θ ∗L ,pi∗L ,ΛL)= G∗1LL−1L∑h=1(LNLh/NL)N−1Lhnh∑i=1nhi∑j=1WLhi jψ∗Lhi +oP(L−1/2).Substituting this into the right-hand side of the identity:αˇL(·, θˆL, pˆiL,ΛL)= (αˇL(·, θˆL, pˆiL,ΛL)−αˇL(·,θ ∗L ,pi∗L ,ΛL))+αˇL(·,θ ∗L ,pi∗L ,ΛL), L ∈ N,applying the definition of αˇL, and then rewriting the resulting expression using ξ ∗Lhyields the first equality in (C.3).To show the second equality in (C.3), we rewrite the first equality in (C.3) asL1/2αˇL(·, θˆL, pˆiL,ΛL) = L−1/2L∑h=1(ξ ∗Lh−E[ξ ∗Lh])+L−1/2L∑h=1E[ξ ∗Lh], L ∈ N. (C.4)Using Proposition 3.2 and Lemmas 3.1 and C.2, we can verify that {ξ ∗Lh} is L2+2δ -bounded array. It follows by Lemma C.3(a) that the first term on the right-handside of (C.4) is OP(1). Because {∑nhi=1∑nhij=1WLhi jψ∗Lhi j : (h,L) ∈ H} is a zero-mean129array (Assumption 3.10(a)), the second term is equal toL−1/2L∑h=1E[LβLhtr(ΛL(m˜∗Lhm˜∗Lh′−N−2Lh Σ˜∗Lh))]= L1/2E[αˇL(·,θ ∗L ,pi∗L)].Note that E[αˇL(·,θ ∗L ,pi∗L)] is the bias of αˇL(·,θ ∗L ,pi∗L) for α∗L (which is zero by hy-pothesis). As discussed in Section 3.3, the bias of αˇL(·,θ ∗L ,pi∗L) for α∗L (= 0) isO(1/min{NL1, . . . ,NLL}), so that it is o(1/L1/2) under Assumption 3.10 (h). Thus,the second term on the right-hand side of (C.4) is o(1). This establishes the secondequality in (C.3).To verify the last claim of the current lemma, we substitute (C.3) into theright-hand side of (C.3) and multiply the resulting equality by V−1/2L to obtain thatV−1/2L L1/2αˇL(·, θˆL, pˆiL,ΛL) =V−1/2L L−1/2L∑h=1(ξ ∗Lh−E[ξ ∗Lh])+oP(1). (C.5)By applying Lemma C.3(b) to L2+δ bounded, zero-mean array {ξ ∗Lh−E[ξ ∗Lh]′ :(h,L) ∈ H} and applying the asymptotic equivalence lemma delivers the desiredresult. To show the asymptotic equivalence between {L1/2αˆL}L∈N and{L1/2αˇL(·, θˆL, pˆiL,ΛL)}L∈N, we use the following fact.Lemma C.5. Suppose that Assumptions 3.1–3.5, 3.6 (a)–(c), 3.7, and 3.10(a)–(f),(h) hold. If (3.3) holds, thenL∑h=1βLh(m˜Lh(·, θˆL, pˆiL)m˜Lh(·, θˆL, pˆiL)′−N−2Lh ΣˆLh)= OP(L−1/2). (C.6)Proof of LEMMA C.5: For each (i, j)∈{1,2, . . . ,q}2, let Ei j denote the q×q matrix,all of whose elements are zeros except the (i, j)-element set equal to one. Then foreach i ∈ {1,2, . . . ,q}, Eii is a symmetric matrix, and it holds that αˇL(·, θˆL, pˆiL,Eii) isequal to the ith diagonal element of the random matrix in question. It follows byLemma C.4(ii) that each diagonal element of the random matrix is OP(L−1/2). Also,for each distinct i and j in {1,2, . . . ,q}, Ei j +E ji is a symmetric matrix, and it holdsthat αˇL(·, θˆL, pˆiL,Ei j +E ji) is equal to two times the (i, j)-element of the random130matrix in question. Application Lemma C.4(ii), verifies that each off-diagonalelement of the random matrix is OP(L−1/2). We now show the asymptotic equivalence between {L1/2αˆL}L∈N and{L1/2αˇL(·, θˆL, pˆiL,ΛL)}L∈N.Lemma C.6. Suppose that 3.1–3.7, 3.9 and 3.10(a)–(f), (h) hold. If (3.3) holds,then L1/2αˆL−L1/2αˇ(·, θˆL, pˆiL,ΛL) = oP(1).Proof of LEMMA C.6: It follows from the definitions of {αˆL}L∈N and {αˇL}L∈Nthat for each L ∈ N,L1/2(αˆL− αˇ(·, θˆL, pˆiL,ΛL))= tr((ΛˆL−ΛL)L1/2L∑h=1βLh(m˜Lh(·, θˆL, pˆiL)m˜Lh(·, θˆL, pˆiL)′−N−2Lh ΣˆLh)).The right-hand side of this equality converges in probability-P to zero by Assump-tion 3.9 and Lemma C.5. Proof of THEOREM 3.3: By Lemma C.6, we have thatL1/2αˆL = L1/2αˇL(·, θˆL, pˆiL,ΛL)+L1/2(αˆL− αˇL(·, θˆL, pˆiL,ΛL))= L1/2αˇL(·, θˆL, pˆiL,ΛL)+oP(1). (C.7)The equality (3.13) follows from this equality. The asymptotic normality resultsalso follows from (3.13) and Lemma C.4(ii) by the asymptotic equivalence lemma.For convenience in proving Theorem 3.4, define {ξˇLh :Ω×Θ×Π×Sq×Rp×p×Rq→ R : (h,L) ∈ H} byξˇLh(·,θ ,pi,Λ,J,G1)≡ LβLhtr(Λ(m˜Lh(·,θ ,pi)m˜Lh(·,θ ,pi)′−N−2Lh Σ˜Lh(·,θ ,pi)))+G′1J(LNLh/NL)nh∑i=1nhi∑j=1WLhi jψLh(XLhi j,θ),(θ ,pi,Λ,J,G1) ∈Θ×Π×Sq×Rp×p×Rq, (h,L) ∈ H,with which we have that for each (h,L) ∈ H, ξ ∗Lh = ξˇLh(·,θ ∗L ,pi∗L ,ΛL,J∗L,G∗L1).131Proof of THEOREM 3.4: Because {ΛL}L∈N is bounded, there exists a compactsubset  of Sq to which {ΛL} is uniformly interior. Analogously, there exists acompact subset J of Rp×p and a compact subset G1 of Rp×1 such that {J∗L}L∈Nand {G∗1L}L∈N are uniformly interior to J and G1, respectively. Application ofLemma C.2 verifies that {ξˇ 2Lh} is SLB(1+δ ) on Θ0×Π0××J×G1. It followsby Lemma C.1 that{(θ ,pi,Λ,J,G1) 7→E[ξˇLh(·,θ ,pi,Λ,J,G1)2] : Θ0×Π0××J×G1→R, (h,L) ∈ H}is uniformly bounded and uniformly equicontinuous, and thatsup(θ ,pi,Λ,J,G1)∈Θ0×Π0××J×G1∣∣∣∣L−1L∑h=1ξˇLh(·,θ ,pi,Λ,J,G1)2−L−1L∑h=1E[ξˇLh(·,θ ,pi,Λ,J,G1)2]∣∣∣∣→ 0 in probability-P.BecauseVˆL = L−1L∑h=1ξˇLh(·, θˆL, pˆiL, ΛˆL, JˆL, Gˆ1L)2, L ∈ N,it follows from these facts and the consistency of {θˆL}L∈N, {pˆiL}L∈N, {ΛˆL}L∈N,{JˆL}L∈N, and {Gˆ1L}L∈N for {θ ∗L}L∈N, {pi∗L}L∈N, {ΛL}L∈N, {J¯L}L∈N, and {G∗1L}L∈NthatVˆL−L−1L∑h=1E[ξˇLh(·,θ ∗L ,pi∗L ,ΛL, J¯L,G∗1L)2]→ 0 in probability-P.Also, we have that for each (h,L) ∈ H,E[ξˇLh(·,θ ∗L ,pi∗L ,ΛL, J¯L,G∗1L)]=LβLhm¯∗L′ΛLm¯∗L+LβLhtr(ΛLN−2LhNLh∑k=1var[mh(X˜hk,θ ,pi)]),where the second term on the right-hand side is convergent to zero uniformly in h(see the discussion on the bias of Σ˜L in Section 3.3). It follows thatE[ξˇLh(·,θ ∗L ,pi∗L ,ΛL, J¯L,G∗1L)2] = V¯L +o(1).This verifies claim (i).132For claim (ii), note thatTL−L1/2αˆLV 1/2L= (Vˆ−1/2L −V−1/2L )L1/2αˆL, L ∈ N.Because V¯L =VL by hypothesis, it follows from (i) established above and Assump-tion 3.10(g) that Vˆ−1/2L −V−1/2L = oP(1). Also, we know that L1/2αˆL = OP(1) under(3.3), as {V−1/2L L1/2αˆL}L∈N is convergent in distribution (Theorem 3.3), and {VL =V¯L}L∈N is bounded as shown above. It follows that TL = L1/2αˆL/V 1/2L +oP(1). Thedesired result follows from this equality by the asymptotic equivalence lemma andTheorem 3.3.To prove (iii), recall that {VˆL}L∈N is consistent for {V¯L}L∈N that is bounded(Theorem 3.3). Also, note that for each L ∈ N,V¯L≥L−1L∑h=1(LβLhm¯∗Lh′ΛLm¯∗Lh)2+o(1)≥(L−1L∑h=1LβLhm¯∗Lh′ΛLm¯∗Lh)2+o(1)=α∗L2+o(1),where the second inequality follows by Jensen’s inequality. It follows that when{α∗L}L∈N is uniformly positive, so is {V¯L}L∈N.Given the consistency of {VˆL} for the bounded and uniformly positive se-quence {V¯L} and the consistency {αˆL}L∈N for the bounded sequence {αL}L∈N(Theorem 3.2), we have thatL−1/2TL−V¯−1/2L α∗L = V¯−1/2L αˆL−V¯1/2L α∗L → 0 in probability-P (C.8)(Theorems 3.2 and 3.4(i)). Let c be an arbitrary real number. Then we have that foreach L ∈ N,P[TL > c] = P[L−1/2TL−V¯−1/2L α∗L > L−1/2c−V¯−1/2L α∗L ]≥ P[L−1/2TL−V¯−1/2L α∗L > L−1/2c− τ],where τ ≡ inf{V¯−1/2L α∗L : L ∈ N}> 0, because {α∗L}L∈N is assumed to be uniformlypositive, and {V¯L}L∈N is bounded. Because L−1/2c < τ/2 for almost all L ∈ N, it133holds that for almost all L ∈ N,P[TL > c]≥ P[L−1/2TL−V¯−1/2L α∗L >−τ/2]≥ P[|L−1/2TL−V¯−1/2L α∗L |< τ/2].Because the right-hand side of this equality converges to one by (C.8), the desiredresult follows. Proof of PROPOSITION 3.5: Because {ΛL}L∈N is bounded by Assumption 3.6(d),there exists a positive real number c1 that is no smaller than the maximum eigenvalueof ΛL for each L ∈ N. With this c1, we have thatα∗L =L∑h=1βLhm¯∗Lh′ΛLm¯∗Lh ≤ c1L∑h=1βLh|E[m∗Lh] |2, L ∈ N.Claim (a) therefore follows. When, in addition, {ΛL} is uniformly positive definite,there exists a real number c2 > 0 that is no larger than the minimum eigenvalue ofΛL for each L ∈ N, so thatα∗L ≥ c2∑βLh|E[m∗Lh] |2, L ∈ N.Claim (b) therefore follows. Proof of the Results in Section 3.4Proof of THEOREM 3.1: Under Assumption a, {mh(X˜hk,θ ,pi) :(θ ,pi) ∈Θ0×Π0, (k,h) ∈ N2} is uniformly L2+2δ -bounded for some real δ > 0.By Proposition 3.2 it follows that {m˜Lh(·,θ ,pi) : (θ ,pi) ∈Θ0×Π0, (h,L) ∈ H} isuniformly L2+2δ -bounded, and so is {m˜Lh(·,θ ∗L ,pi) : pi ∈Π0, (h,L) ∈ H}. Thus,{m¯Lh(θ ∗L ,pi) : pi ∈Π0, (h,L) ∈ H} is bounded. It follows from this fact andAssumptions 3.12a, b that {α∗L(pi) : pi ∈Π0, L ∈ N} is bounded.For the consistency result, define c≡ sup{|ΛL(pi)| : pi ∈Π, L ∈ N}+1. Thenc is finite by Assumption 3.12(b). Write  ≡ {Λ ∈ Sq : |Λ| ≤ c}, and let ε be an134arbitrary positive real number. Then we have thatP[suppi∈Π∣∣∣∣αˆL(pi)−α∗L(pi)∣∣∣∣≥ ε](C.9)≤ P[suppi∈Π∣∣∣∣αˆL(pi)−α∗L(pi)∣∣∣∣≥ ε, θˆL ∈Θ0, and suppi∈Π|ΛˆL(pi)| ≤ c]+P[θˆL 6∈Θ0]+P[suppi∈Π|ΛˆL(pi)|> c], L ∈ N. (C.10)The second and third terms on the right-hand side of this equality converge to zeroby Assumptions 3.4 and 3.12(a), (c). It thus suffices to prove that the first term alsoconverges to zero to establish the desired result.Note that for each pi ∈Π and each L ∈ N,αˆL(pi)−α∗L(pi) = αˇL(·, θˆL,pi, ΛˆL(pi))−αL(θ ∗L ,pi,ΛL(pi))=(αˇL(·, θˆL,pi, ΛˆL(pi))−αL(θˆL,pi, ΛˆL(pi)))+(αL(θˆL,pi, ΛˆL(pi))−αL(θ ∗L ,pi,ΛL(pi))).It follows that if θˆL ∈Θ0 and suppi∈Π |ΛˆL(pi)| ≤ c,suppi∈Π|αˆL(pi)−α∗L(pi)| ≤ sup(θ ,pi,Λ)∈Θ0×Π×|αˇL(·,θ ,pi,Λ)−αL(θ ,pi,Λ)|+ suppi∈Π|αL(θˆL,pi, ΛˆL(pi))−αL(θ ∗L ,pi,ΛL(pi))|, L ∈ N.(C.11)The first term on the right-hand side of this equality converge in probability-P tozero, because {αˇL}L∈N is uniformly consistent for {αL}L∈N on Θ0×Π0× asverified in the proof of Theorem 3.2. In the second term, {αL}L∈N is uniformlycontinuous on Θ0×Π0×, as established in the proof of Theorem 3.2. It followsby Assumptions 3.4 and 3.12(a), (c) that the second term converges to zero inprobability-P. Thus, given that the inequality (C.11) holds with a probabilityapproaching to one, the first term on the right-hand side of (C.10) converges to zero,and the desired result follows. In establishing the claims in Theorem 3.3, we use the following functionalcentral limit theorem.135Lemma C.7. Given Assumptions 3.1 and 3.2, let Γ a nonempty compact subsetof the r-dimensional Euclidean space, and {ULh : (h,L) ∈ H}, a double array ofmeasurable functions from (Ω×Γ,F/B⊗B(Γ)) to (R,B) such that {ULh} isSLBP(2+ δ ) on Γ for some δ ∈ (0,∞), and for each (h,L) ∈ H and each γ ∈ Γ,E[ULh(·,γ)] = 0. Then:(a) The sequence {supγ∈Γ |L−1/2∑Lh=1ULh(·,γ)}L∈N is uniformlyL2+δ -bounded;hence, it is OP(1).(b)supL∈NE[sup{∣∣∣∣L−1/2L∑h=1ULh(·,γ2)−L−1/2L∑h=1ULh(·,γ1)∣∣∣∣: |γ1− γ2|< κ, (γ1,γ2) ∈ Γ2}]→ 0 as κ ↓ 0.(c) The sequence of random functions {γ 7→ L−1/2∑Lh=1ULh(·,γ)}L∈N is stochas-tically equicontinuous uniformly on Γ, i.e., for each pair of real numbersε > 0 and κ > 0, there exists a real number β > 0 such thatlimsupP[sup{∣∣∣∣L−1/2L∑h=1ULh(·,γ2)−L−1/2L∑h=1ULh(·,γ1)∣∣∣∣: |γ1− γ2|< β , (γ1,γ2) ∈ Γ2}> κ]< ε.(d) Suppose in addition that there exists a function K : Γ2→ R such that for each(γ1,γ2) ∈ Γ2KL(γ1,γ2)≡ L−1L∑h=1cov[ULh(·,γ1),ULh(·,γ2)]→ K(γ1,γ2).Then the process {γ 7→ L−1/2∑Lh=1ULh(·,γ)}L∈N converges in distribution toa zero-mean Gaussian process with covariance kernel K concentrated onU(Γ), the set of all uniformly continuous R-valued functions.Proof of LEMMA C.7: The lemma can be proved by using (Pollard, 1990, The-136orem 10.2). The detailed proof of Lemma C.7 is available from the author uponrequest. Proof of LEMMA 3.2: The result immediately follows from the proof of Lemma 3.1.To prove Theorem 3.3, we first derive the asymptotic linear representation of thesequence of random functions from Π to R, {pi 7→ L1/2αˇL(·, θˆL,pi,ΛL(pi))}L∈Nand then show that {pi 7→ L1/2(αˇL(·, θˆL,pi,ΛL(pi))− αˆL(pi))}L∈N converges inprobability-P to zero (function).Lemma C.8. Suppose that 3.1–3.5, 3.7, 3.12(a), (b), and 3.13(a) hold. If (3.3)holds, then:(a) suppi∈Π |L1/2αˇL(·, θˆL,pi,ΛL(pi))−L−1/2∑Lh=1 ξ ∗Lh(pi)| → 0 in probability-P.(b) suppi∈Π |L1/2αˇL(·, θˆL,pi,ΛL(pi))|= OP(1).(c) If in addition Assumption 3.13(b) and 3.14 hold, then the sequence of randomfunctions {pi 7→VL(pi)−1/2L1/2αˇL(·, θˆL,pi,ΛL(pi))}L∈Nconverges in distribution to a zero-mean Gaussian process with covariancekernel K concentrated on U(Π).Proof of LEMMA C.8: To prove (i), note that there exists a real number ε > 0 suchthat for each L ∈ N, the open ball in Θ with radius ε centered at θ ∗L is containedin intΘ0. Fix pi ∈ Π arbitrarily. Then, by the mean value theorem for randomfunctions (?, Lemma 3), there exists a sequence of random vectors {θ¨L(pi) : Ω→Θ}L∈N such that for each L ∈ N, θ¨L(pi) is on the line segment connecting θˆL andθ ∗L , andαˇL(·, θˆL,pi,ΛL(pi))− αˇL(·,θ ∗L ,pi,ΛL(pi)) = ∇θ αˇL(·, θ¨L(pi),pi,ΛL(pi))′(θˆL−θ ∗L )whenever |θˆL− θ ∗L | < ε . Subtracting G∗1L(pi)(θˆL− θ ∗L ) from both sides of thisequality and taking the absolute value of both sides of the resulting equality yields137that∣∣αˇL(·, θˆL,pi,ΛL(pi))− αˇL(·,θ ∗L ,pi,ΛL(pi))−G∗1L(pi)(θˆL−θ ∗L )∣∣=∣∣(∇θ αˇL(·, θ¨L(pi),pi,ΛL(pi))′−G∗1L)(pi)(θˆL−θ ∗L )∣∣≤ |∇θ αˇL(·, θ¨L(pi),pi,ΛL(pi))−G∗1L(pi)| |θˆL−θ ∗L |, pi ∈Π, L ∈ N,so thatsuppi∈Π∣∣αˇL(·, θˆL,pi,ΛL(pi))− αˇL(·,θ ∗L ,pi,ΛL(pi))−G∗1L(pi)(θˆL−θ ∗L )∣∣≤ suppi∈Π|∇θ αˇL(·, θ¨L(pi),pi,ΛL(pi))−G∗1L(pi)| |θˆL−θ ∗L |, L ∈ N, (C.12)whenever |θˆL− θ ∗L | < ε . For the first factor on the right-hand side of the aboveinequality, we have that|∇θ αˇL(·, θ¨L(pi),pi,ΛL(pi))−G∗1L(pi)|≤|∇θ αˇL(·, θ¨L(pi),pi,ΛL(pi))−G1L(θ¨L(pi),pi,ΛL(pi))|+ |G1L(θ¨L(pi),pi,ΛL(pi))−G1L(θ ∗L ,pi,ΛL(pi))|, pi ∈Π, L ∈ N.Let  be as in the proof of Theorem 3.1. Then we have thatsuppi∈Π|∇θ αˇL(·, θ¨L(pi),pi,ΛL(pi))−G∗1L(pi)|≤ suppi∈Π|∇θ αˇL(·, θ¨L(pi),pi,ΛL(pi))−G1L(θ¨L(pi),pi,ΛL(pi))|+ suppi∈Π|G1L(θ¨L(pi),pi,ΛL(pi))−G1L(θ ∗L ,pi,ΛL(pi))|≤ sup(θ ,pi,W )∈Θ0×Π×|∇θ αˇL(·,θ ,pi,W )−G1L(θ ,pi,W )|+ suppi∈Π|G1L(θ¨L(pi),pi,ΛL(pi))−G1L(θ ∗L ,pi,ΛL(pi))|, (C.13)whenever |θˆL− θ ∗L | < ε . On the right-hand side of this inequality, the first termconverges in probability-P to zero by Lemma C.4(i). The second term also converges138to zero, because {L1L} is uniformly equicontinuous on Θ0×Π×, andsuppi∈Π|θ¨L(pi)−θ ∗L | ≤ |θˆL−θ ∗L | → 0 in probabiity-Pby Assumption 3.4. Further, inequality (C.13) holds with a probability approachingone, as the probability that |θˆL−θ ∗L |< ε converges to one by Assumption 3.4. Itfollows thatsuppi∈Π|∇θ αˇL(·, θ¨L(pi),pi,ΛL(pi))−G∗1L(pi)| → 0 in probability-P.Applying this result in inequality (C.12), which holds with a probability approachingone, establishes that∣∣αˇL(·, θˆL,pi,ΛL(pi))− αˇL(·,θ ∗L ,pi,ΛL(pi))−G∗1L(pi)(θˆL−θ ∗L )∣∣= oP(|θˆL−θ ∗L |) = oP(L−1/2),where the second equality follows from Assumption 3.10(a) by Lemma C.3. Finally,multiplying both sides of this equality by L1/2 and applying Assumption 3.10(a) to(θˆL−θ ∗L ) on the left-hand side of the resulting inequality yields thatsuppi∈Π∣∣∣∣L1/2(αˇL(·, θˆL, pˆiL,ΛL(pi))− αˇL(·,θ ∗L ,pi∗L ,ΛL(pi)))−G∗1L(pi)J∗LL−1/2L∑h=1(LNLh/NL)N−1Lhnh∑i=1nhi∑j=1WLhi jψ∗Lhi j∣∣∣∣= oP(1),because the remainder term in the equality in Assumption 3.10(a) does not depend139on pi , and {G∗1L(pi) : pi ∈Π, L ∈ N} is bounded. Given this equality, we have thatsuppi∈Π∣∣∣∣L1/2αˇL(·, θˆL,pi,ΛL(pi))−L−1/2L∑h=1ξ ∗Lh(pi)∣∣∣∣= suppi∈Π∣∣∣∣L1/2(αˇL(·, θˆL,pi,ΛL(pi))− αˇL(·,θ ∗L ,pi,ΛL(pi)))+L−1/2αˇL(·,θ ∗L ,pi,ΛL(pi))−L−1/2L∑h=1ξ ∗Lh(pi)∣∣∣∣= suppi∈Π∣∣∣∣G∗1G(pi)J∗LL−1/2L∑h=1(LNLh/NL)N−1Lhnh∑i=1nhi∑j=1WLhi jψ∗Lhi j+L−1/2αˇL(·,θ ∗L ,pi,ΛL(pi))−L−1/2L∑h=1ξ ∗Lh(pi)∣∣∣∣+oP(1).We can easily verify that the first-term on the right-hand side of this equality is zeroby using the definition of αˇL and ξ ∗L . The result therefore follows.For claim (ii), we apply Lemma C.2 to verify that {ξ ∗Lh : (h,L) ∈ H} is SLBP(2+2δ ) on Π. The desired result follows from this fact and (i) by Lemma C.7(i).We now turn to claim (iii). Because {VL(pi) : pi ∈Π, L ∈ N} is uniformlypositive by Assumption 3.13(b), VL(pi)−1/2 is uniformly bounded. It follows fromthis fact and claim (i) of the current lemma thatsuppi∈Π∣∣∣∣VL(pi)−1/2L1/2αˇL(·, θˆL,pi,ΛL(pi))−VL(pi)−1/2L−1/2L∑h=1ξ ∗Lh(pi)∣∣∣∣→ 0 in probability-P.It thus suffices to show that {VL(pi)−1/2L−1/2∑Lh=1 ξ ∗Lh(pi)}L∈N converges in dis-tribution to a zero-mean Gaussian process with covariance kernel K (see (?,Theorem 4.1)). We verify the desired by using Lemma C.7, by taking {pi 7→V−1/2L (pi)ξ ∗Lh(pi)} for {ULh}.Because {ξ ∗Lh : (h,L) ∈ H} is SLBP(2 + δ ) on Π, and {V−1/2L (pi) :pi ∈Π, L ∈ N} is bounded, {pi 7→VL(pi)−1/2ξ ∗Lh(pi) : (h,L) ∈ H} is SLBP(2+δ ) onΠ. In addition, the convergence of KL to K is directly imposed in Assumption 3.14.Thus, conditions imposed in Lemma C.7 are satisfied in our current problem. Thedesired result therefore follows. 140Lemma C.9. Suppose that 3.1–3.5, 3.7, 3.12(a), and 3.13(a) hold. If (3.3) holds,thenL∑h=1βLh(m˜Lh(·, θˆL,pi)m˜Lh(·, θˆL,pi)′−N−2Lh Σ˜Lh(·, θˆL,pi))= OP(L−1/2). (C.14)Proof of LEMMA C.9: The result can be obtained by essentially repeating theargument in the proof of Lemma C.5, using Lemma C.8(ii) instead of Lemma C.4(ii).Lemma C.10. Suppose that Assumptions 3.1–3.5, 3.7, 3.12, and 3.13(a) hold. If(3.3) holds, then suppi∈Π |L1/2(αˆL(pi)− αˇ(·, θˆL,pi,ΛL(pi)))|= oP(1).Proof of LEMMA C.10: By the definitions of {αˆL}L∈N and {αˇL}L∈N, we have thatfor each pi ∈Π and each L ∈ N,L1/2(αˆL(pi)− αˇL(·, θˆL,pi,ΛL(pi)))= tr((ΛˆL(pi)−ΛL(pi))L1/2L∑h=1βLh(m˜Lh(·, θˆL,pi)m˜Lh(·, θˆL,pi)′−N−2Lh Σ˜Lh(·, θˆL,pi))).The right-hand side of this equality converges in probability-P to zero uniformly inpi ∈Π by Lemma C.5 and Assumption 3.12(c). The desired result therefore follows.Now, Theorem 3.3 follows from Lemmas C.8 and C.10. We are also ready toprove Theorem 3.4.Proof of THEOREM 3.3: The equality (3.14) immediately follows fromLemma C.8(i) and Lemma C.10. To establish the convergence of (3.15) in distribu-tion, we use (3.14) together with the boundedness of {VL(pi)−1/2 : pi ∈Π, L ∈ N}to obtain thatsuppi∈Π∣∣∣∣V−1/2L (pi)L1/2αˆL(pi)−V−1/2L (pi)L1/2αˇ(·, θˆL, pˆiL,ΛL(pi))∣∣∣∣= oP(1).141The desired convergence in distribution follows from Lemma C.8(iii) by this fact(see (?, Theorem 4.1)). Proof of THEOREM 3.4: Let  be as in Theorem 3.1. Also, let J and G1 be compactsubsets of Rp×p and Rp×1, respectively, such that the bounded arrays {J¯L}L∈N and{G∗1L(pi) : pi ∈Π : L ∈ N} are uniformly interior to J and G1, respectively. Then{(θ ,pi,Λ,J,G1) 7→E[ξˇLh(·,θ ,pi,W,J,G1)2] : Θ0×Π0××J×G1→R, (h,L) ∈ H}is uniformly bounded and uniformly equicontinuous, and it holds thatsup(θ ,pi,W,J,G1)∈Θ0×Π0××J×G1∣∣∣∣L−1L∑h=1ξˇLh(·,θ ,pi,W,J,G1)2−L−1L∑h=1E[ξˇLh(·,θ ,pi,W,J,G1)2]∣∣∣∣→ 0 in probability-P,as verified in the proof of Theorem 3.4.Note thatVˆL(pi) = G−1L∑h=1ξˇLh(·, θˆL,pi, ΛˆL(pi), JˆL, Gˆ1L(pi))2, pi ∈Π, L ∈ N, andV¯L(pi) = L−1L∑h=1E[ξˇLh(·,θ ∗L ,pi,ΛL(pi), J¯L,G∗1L(pi))2, pi ∈Π, L ∈ N.The claim (i) follows from these facts, given that {θˆL}L∈N and {JˆL}L∈N are consis-tent for {θ ∗L}L∈N and {J¯L}L∈N, respectively, and that {ΛˆL(pi)}L∈N and {Gˆ1L(pi)}L∈Nare consistent for {ΛL(pi)}L∈N and {G∗1L(pi)}L∈N, respectively, uniformly in pi ∈Π.We now turn to claim (ii). By the definition of TL, we have thatsuppi∈Π∣∣∣∣TL(pi)−L1/2αˆL(pi)V 1/2L (pi)∣∣∣∣= suppi∈Π∣∣∣∣(VˆL(pi)−1/2−VL(pi)−1/2)L1/2αˆL(pi)∣∣∣∣≤ suppi∈Π|VˆL(pi)−1/2−VL(pi)−1/2| suppi∈Π|G1/2αˆL(pi)|, L ∈ N.The first factor on the right-hand side converges in probability-P to zero by claim (i)of the current theorem and the assumption that V¯L =VL for each L ∈ N. The second142factor is OP(1) by the second claim of Theorem 3.3 and the boundedness of {VL(pi) :pi ∈Π, L ∈ N}. Thus, we have that suppi∈Π |TL(pi)−L1/2αˆL(pi)/V1/2L (pi)|= oP(1).The result follows from this equality and Theorem 3.3 by (?, Theorem 4.1). To prove Theorem 3.5, we first prove a few lemmas.Lemma C.11. Suppose that Assumptions 3.1–3.5, 3.7, 3.11–3.14. Thensup(pi1,pi2)∈Π2 |KˆL(pi1,pi2)− K¯L(pi1,pi2)| → 0 in probability-P, where {K¯L : Π2 →R}L∈N defined byK¯L(pi1,pi2)≡ (V¯L(pi1)V¯L(pi2))−1/2(L−1L∑h=1cov[ξ¯L(pi1), ξ¯L(pi2)]+L−1L∑h=1(LβLh)2m¯∗Lh(pi1)′ΛL(pi)m¯∗Lh(pi1)m¯∗Lh(pi2)′ΛL(pi)m¯∗Lh(pi2))+o(1),(pi1,pi2) ∈Π2, L ∈ N. (C.15)Proof of LEMMA C.11: Let , J, and G1 be as in the proof of Theorem 3.3. Byusing Lemma C.2, we can show that{(θ ,pi1,pi2,Λ1,Λ2,J,G1,1,G1,2) 7→ ξˇLh(·,θ ,pi1,Λ1,J,G1,1)ξˇLh(·,θ ,pi2,λ2,J,G1,2) :Θ0×Π×Π×××J×G1×G1→ R : (h,L) ∈ H}is SLBP(1+δ ) on Θ0×Π×Π×××J×G1×G1. It follows by Lemma C.1that{(θ ,pi1,pi2,Λ1,Λ2,J,G1,1,G1,2)7→ E[ξˇLh(·,θ ,pi1,Λ1,J,G1,1)ξˇLh(·,θ ,pi2,Λ2,J,G1,2) : (h,L) ∈ H}]is bounded and equicontinuous uniformly on Θ0×Π×Π×××J×G1×G1,143and thatsup(θ ,pi1,pi2,Λ1,Λ2,J,G1,1,G1,2)∈Θ0×Π0××J×G1∣∣∣∣L−1L∑h=1ξˇLh(·,θ ,pi1,W1,J,L1,1) (C.16)× ξˇLh(·,θ ,pi2,W2,J,L1,2)−L−1L∑h=1E[ξˇLh(·,θ ,pi1,Λ1,J,G1,1)× ξˇLh(·,θ ,pi2,Λ2,J,G1,2)]∣∣∣∣→ 0 in probability-P. (C.17)Note that for each (pi1,pi2) ∈Π2 and each L ∈ N,KˆL(pi1,pi2) = (VˆL(pi1)VˆL(pi2))−1/2×L−1L∑h=1ξˇLh(·, θˆL,pi1, ΛˆL(pi1), JˆL, Gˆ1,L(pi1))ξˇLh(·, θˆL,pi2, ΛˆL(pi2), JˆL, Gˆ1,L(pi2)).Given (C.17), Assumption 3.13(b), and Theorem 3.4(i), it follows that{KˆL(pi1,pi2)}L∈N is uniformly consistent for{(V¯L(pi1)V¯L(pi2))−1/2×L−1L∑h=1E[ξˇLh(·,θ ∗L ,pi1,ΛL(pi1),J∗L,G∗1,L(pi1))ξˇLh(·,θ ∗L ,pi2,ΛL(pi2),J∗L,G∗1,L(pi2))]}L∈N,which is uniformly equicontinuous in (pi1,pi2) ∈Π2. BecauseL−1L∑h=1E[ξˇLh(·,θ ∗L ,pi1,ΛL(pi1),J∗L,G∗1,L(pi1))ξˇLh(·,θ ∗L ,pi2,ΛL(pi2),J∗L,G∗1,L(pi2))]= L−1L∑h=1cov[ξ ∗Lh(pi1),ξ ∗Lh(pi2)]+L−1L∑h=1(LβLh)2m¯∗Lh(pi1)′ΛL(pi)m¯∗Lh(pi1)× m¯∗Lh(pi2)′ΛL(pi)m¯∗Lh(pi2), (pi1,pi2) ∈Π2, (h,L) ∈ H, (C.18)the first result follows. 144The next lemma is necessary in establishing Lemma C.13 stated below, whichis essential in proving Theorem 3.5.Lemma C.12. Suppose that Assumptions 3.1–3.5, 3.7, and 3.11–3.14. If (3.3) holdsand J¯L = J∗L for each L ∈ N, then {pi 7→ TˆL(pi)}L∈N converges in distribution to azero-mean Gaussian process in Π with covariance kernel K concentrated on U(Π).Proof of LEMMA C.12: Let , J, and G1 be as in the proof of Theorem 3.4. Thenthe array of random functions {ξˇLhνh : (h,L) ∈ H} is stochastically equicontinuouson Θ0×Π××J×G1 by Lemma C.7(iii). Pick a real numberε > 0 arbitrarily. When for a real number β > 0, |θˆL − θ ∗L | < β/4,suppi∈Π |ΛˆL(pi)−ΛL(pi)|< β/4, |JˆL−J∗L|< β/4, and suppi∈Π |Gˆ1L(pi)−G∗1L(pi)|<β/4, we have that(|θˆL−θ ∗L |2+ suppi∈Π|ΛˆL(pi)−ΛL(pi)|2+ |JˆL−J∗L|2+ suppi∈Π|Gˆ1L(pi)−G∗1L(pi)|2)1/2 < β .Given this fact, for each real number β > 0,P[suppi∈Π∣∣∣∣L−1/2L∑h=1ξˆLh(pi)νh−L−1/2L∑h=1ξ ∗Lh(pi)νh∣∣∣∣> ε]= P[suppi∈Π∣∣∣∣L−1/2L∑h=1ξˇLh(·, θˆL,pi, ΛˆL(pi), JˆL, Gˆ1L)νh−L−1/2L∑h=1ξˇLh(·,θ ∗L ,pi,ΛL(pi),J∗L,G∗1L(pi))νh∣∣∣∣> ε](C.19)145is dominated byP[θˆL 6∈Θ0]+P[ΛˆL(pi) 6∈  for some pi ∈Π]+P[JˆL 6∈ J]+P[Gˆ1L(pi) 6∈ G1 for some pi ∈Π]+P[|θˆL−θ ∗L | ≥ b/4]+P[suppi∈Π|ΛˆL(pi)−ΛL(pi)| ≥ b/4]+P[|JˆL− J∗L| ≥ b/4]+P[suppi∈Π|Gˆ1L(pi)−G∗1L(pi)| ≥ b/4]+P[sup{∣∣∣∣L−1/2L∑h=1ξˇLh(·,θ2,pi,Λ2,J2,G1,2)νh−L−1/2L∑h=1ξˇLh(·,θ1,pi,Λ1,J1,G1,1)νh∣∣∣∣ :(|θ2−θ1|2 + |W2−W1|2 + |J2− J1|2 + |G2−G1|2)1/2 < b,(W1,W2) ∈ 2, (θ1,θ2) ∈Θ20, (J1,J2) ∈ J2, (G1,2,G1,1) ∈ G21}> ε],L ∈ N.In the above expression, we can choose b to make the last term arbitrarily smalluniformly in L ∈ N, while given the chosen b, all other terms converge to zero asL→ ∞. It follows that (C.19) converges to zero for each ε > 0, i.e.,suppi∈Π∣∣∣∣L−1/2L∑h=1ξˆLh(pi)νh−L−1/2L∑h=1ξ ∗Lh(pi)νh∣∣∣∣= oP(1).Further, {suppi∈Π |VˆL(pi)|−1}L∈N is OP(1), because {VˆL(pi)}L∈N is consistent for{VL(pi)}L∈N uniformly in pi ∈Π, and {VL(pi) : pi ∈Π, L ∈ N} is uniformly positive.It follows thatsuppi∈Π∣∣∣∣VˆL(pi)−1/2L−1/2L∑h=1ξˆLh(pi)νh−VL(pi)−1/2L−1/2L∑h=1ξ ∗Lh(pi)νh∣∣∣∣= oP(1).(C.20)Now, it is straightforward to verify that {VL(pi)−1/2ξ ∗Lh(pi) : pi ∈Π, (h,L) ∈ H}satisfies all conditions imposed on {ULh} in Lemma C.7, by using Lemma C.2 andthe independence between {ξ ∗Lh(pi) : pi ∈Π, (h,L) ∈ H} together with the normalityof νh. Thus, {pi 7→VL(pi)−1/2L−1/2∑Lh=1 ξ ∗Lh(pi)νh}L∈N converges in distribution to146the stated Gaussian limit. Given (C.20),{pi 7→ TˆL = VˆL(pi)−1/2L−1/2L∑h=1ξˆLh(pi)νh}L∈Nalso converges in distribution to the same limit (?, Theorem 4.1), and the resultfollows. We now prove an important lemma about the conditional distribution of {pi 7→TˆL(pi)}L∈N given X˜ and {CLhk : (k,h,L) ∈ K}, using the result on the correspondingunconditional distribution established in Lemma C.12.Lemma C.13. (a) Suppose that Assumptions 3.1–3.5, 3.7, 3.11, 3.12, and 3.13hold. Then suppi∈Π |TˆL(pi)|= OP(1).(b) Suppose that Assumptions 3.1–3.5, 3.7, 3.11–3.14 hold. If (3.3) holds andJ¯L = J∗L for each L ∈ N, then the distribution of {pi 7→ TˆL(pi)} conditionalon X˜ and {CLhk : (k,h,L) ∈ K} weakly converges to a zero-mean Gaussianprocess with covariance kernel K concentrated on U(Π) in probability-P.Proof of LEMMA C.13: To prove claim (i), recall thatξˆLh(pi) = ξˇLh(·, θˆL,pi, ΛˆL(pi), JˆL, Gˆ1L(pi)), pi ∈Π, L ∈ N.Let , J, and G1 be as in the proof of Theorem 3.3. In Lemma C.7(i), takeΘ0×Π××J×G1 for Γ and {ξˇLhνh : (h,L) ∈ H} for {ULh : (h,L) ∈ H}. Thenwe can easily verify that all assumptions imposed in the lemma are satisfied, byusing Lemma C.2. Thus,sup{∣∣∣∣L−1/2L∑h=1ξˇLh(·,θ ,pi,Λ,J,G1)νh∣∣∣∣: (θ ,pi,Λ,J,G1) ∈Θ0×Π××J×G1}= OP(1).Because the probability that θˆL ∈Θ0, JˆL ∈J, and for each pi ∈Π, ΛˆL(pi) and GˆL1(pi)147belong to  and G1, respectively, approaches one as L→ ∞, it follows thatsuppi∈Π∣∣∣∣L−1/2L∑h=1ξˆLh(pi)νh∣∣∣∣= OP(1).Further, {VˆL(pi)}L∈N is consistent for {V¯L(pi)}L∈N uniformly in pi , which is positiveuniformly in pi ∈Π and L ∈ N. Thus,suppi∈Π|TˆL(pi)|= suppi∈Π∣∣∣∣VˆL(pi)−1/2L−1/2L∑h=1ξˆLh(pi)νh∣∣∣∣= OP(1).We now turn to claim (ii). Our proof follows the strategy taken by ?. Let `∞(Π)denote the set of all bounded continuous functions on Π and BL1(`∞(Π)) the set ofall real-valued Lipschitz functions on `∞(Π) with a uniform norm and a Lipschitzconstant both bounded by one. Then it suffices to show thatsupφ∈BL1(`∞(Π))|Eν [φ(TˆL))]−E[φ(η)]| → 0 in probability-P,where η is a zero-mean Gaussian process with covariance kernel K concentratedon U(Π), and Eν denotes the expectation taken with respect to ν ≡ {νh}h∈N (see(van der Vaart and Wellner, 1996, pages 72–73)).For each real number b> 0, Π can be covered by a finite number of open b-balls,because Π is compact. Let Mβ : Π→ Π be a function that returns a closest pointamong the centers of such b-balls. Because η is concentrated on U(Π), it holds thatsuppi∈Π |η(Mb(pi))−η(pi)| → 0 as b ↓ 0 a.s. We thus havesupφ∈BL1(`∞(Π))∣∣E[φ(η(Mb(·)))]−E[φ(η)]∣∣≤ E[min{2, suppi∈Π|η(Mb(pi))−η(pi)|}]≤ E[min{2, sup{|η(pi2))−η(pi1)| : |pi1−pi2|< b, (pi1,pi2) ∈Π2}}]→ 0 as β ↓ 0, (C.21)where the convergence follows by the dominated convergence theorem.Let ε be an arbitrary positive real number. Then there exists a real numberb1 > 0 such that for each b ∈ (0,b1), the left-hand side of (C.21) is less than ε/3. It148follows that for each β ∈ (0,b1),supφ∈BL1(`∞(Π))|Eν [φ(TˆL)]−E[φ(η)]| ≤ supφ∈BL1(`∞(Π))∣∣Eν [φ(TˆL)]−Eν [φ(TˆL(Mb(·)))]∣∣+ supφ∈BL1(`∞(Π))∣∣Eν [φ(TˆL(Mb(·)))]−E[φ(η(Mb(·)))]∣∣+ supφ∈BL1(`∞(Π))∣∣E[φ(η(Mb(·)))]−E[φ(η)]∣∣≤ supφ∈BL1(`∞(Π))∣∣Eν [φ(TˆL)]−Eν [φ(TˆL(Mb(·)))]∣∣+ supφ∈BL1(`∞(Π))∣∣Eν [φ(TˆL(Mb(·)))]−E[φ(η(Mb(·)))]∣∣+ ε/3 L ∈ N.It further follows that for each b ∈ (0,b1),P[supφ∈BL1(`∞(Π))|Eν [φ(TˆL)]−E[φ(η)]|> ε]≤ P[supφ∈BL1(`∞(Π))∣∣Eν [φ(TˆL)]−Eν [φ(TˆL(Mb(·)))]∣∣> ε/3]+P[supφ∈BL1(`∞(Π))∣∣Eν [φ(TˆL(Mb(·)))]−E[φ(η(Mb(·)))]∣∣> ε/3], L ∈ N.(C.22)For the first term on the right-hand side of this inequality, we have thatsupφ∈BL1(`∞(Π))∣∣Eν [φ(TˆL)]−Eν [φ(TˆL(Mb(·)))]∣∣≤ supφ∈BL1(`∞(Π))Eν [|φ(TˆL)−φ(TˆL(Mb(·)))|]≤ Eν[supφ∈BL1(`∞(Π))|φ(TˆL)−φ(TˆL(Mb(·)))|]≤ Eν[min{2, suppi∈Π|TˆL(Mb(pi))− TˆL(pi)|}]≤ Eν[min{2, sup{|TˆL(pi2)− TˆL(pi1)| : |pi1−pi2|< b, (pi1,pi2) ∈Π2}}].Taking the expectation of both sides of this inequality and applying the law of149iterated expectations to the right-hand side yields thatE[supφ∈BL1(`∞(Π))∣∣Eν [φ(TˆL)]−Eν [φ(TˆL(Mb(·)))]∣∣]≤ E[min{2, sup{|TˆL(pi2)− TˆL(pi1)| : |pi1−pi2|< b, (pi1,pi2) ∈Π2}}]→ E[min{2, sup{|η(pi2)−η(pi1)| : |pi1−pi2|< b, (pi1,pi2) ∈Π2}}],where the convergence follows by Lemma C.12. By applying the Markov inequal-ity (Davidson, 1994, p. 132) to this result, we obtain thatlimsupP[supφ∈BL1(`∞(Π))∣∣Eν [φ(TˆL)]−Eν [φ(TˆL(Mb(·)))]∣∣> ε/3]≤ (ε/3)−1 E[min{2, sup{|η(pi2)−η(pi1)| : |pi1−pi2|< b, (pi1,pi2) ∈Π2}}].(C.23)Now, let κ be an arbitrary positive real number. As is the case in (C.21), theright-hand side of (C.23) can be made smaller than κ by setting b equal to a suitablevalue in (0,b1). Given such β , we have thatE[supφ∈BL1(`∞(Π))∣∣Eν [φ(TˆL(Mb(·)))]−E[φ(η(Mb(·)))]∣∣]≤ E[Eν[supφ∈BL1(`∞(Π))|φ(TˆL(Mb(·)))−φ(η(Mb(·)))|]]≤ E[min{2, suppi∈Π|TˆL(Mb(pi))−η(Mb(pi))|}]≤E[min{2, suppi∈Π|TˆL(pi)−η(pi)|}]→ 0,where the convergence follows by Lemma C.12. It follows by the Markov inequal-ity (Davidson, 1994, p. 132) thatlimsupP[supφ∈BL1(`∞(Π))∣∣Eν [φ(TˆL(Mb(·)))]−E[φ(η(Mb(·)))]∣∣> ε/3]= 0.150Applying the above results in (C.22) yields thatlimsupP[supφ∈BL1(`∞(Π))|Eν [φ(TˆL)]−E[φ(η)]|> ε]≤ κ.Because κ is an arbitrary positive real number, it follows thatlimsupP[supφ∈BL1(`∞(Π))|Eν [φ(TˆL)]−E[φ(η)]|> ε]→ 0.As this holds for every ε > 0, the second claim of the current lemma follows. We finally prove Theorem 3.5.Proof of THEOREM 3.5: For claim (i), it follows from Theorem 3.4(ii) by sub-sequence theorem (van der Vaart and Wellner, 1996, Lemma 1.9.2(ii)) and thecontinuous mapping theorem (van der Vaart and Wellner, 1996, Lemma 1.3.6) that{ϕˆL = ϕ(TˆL)}L∈N converges in distribution to ϕ0 = ϕ(η) in probability. We thushave that supx∈R |FˆL(x)−F(x)| → 0 in probability-P. The first result follows fromthis, because|pˆL− pL|= |FˆL(ϕL)−F(ϕL)| ≤ supx∈R|FˆL(x)−F(x)|.To prove the second result, note that if {pL}L∈N is convergent in distribution,{ pˆL}L∈N converges to the same limiting distribution as {pL} does, by the asymptoticequivalence lemma. It thus suffices to show that {pL} converges in distribution tothe uniform distribution on [0,1].Recall that {TL}L∈N converges in distribution to η if (??) holds (Theorem 3.4).Because F is assumed to be continuous and increasing on the support of ϕ0, itfollows by the continuous mapping theorem that {pL = F(ϕL) = F(ϕ(TL))}L∈Nconverges in distribution to p0 = F(ϕ0) = F(ϕ(η)). The random variable p0 isdistributed uniformly on [0,1], since for each y ∈ (0,1),P[F(p0)≤ y] = P[p0 ≤ F−1(y)] = F(F−1(y)) = y,where for each y∈ (0,1), F−1(y)≡ inf{x∈ R : F(x)> y}. This completes the proofof claim of (i).151We now turn to claim (ii). Pick real numbers ε > 0 and b > 0 arbitrarily.Then, it suffices to show that for almost all L ∈ N, P[pˆL > ε] ≤ β . To establishthis inequality, let F¯L denote the unconditional distribution function of ϕˆL = ϕ(TˆL)for each L ∈ N. Because suppi∈Π |TˆL(pi)| = OP(1) (Lemma C.13(i)), and ϕ ismonotonic (Assumption 3.15), we have that ϕˆL = OP(1). It follows that there existsa real number ∆ such that1− F¯L(∆)< bε/2, L ∈ N.Note that whenever 1− FˆL(∆) ≤ ε and ϕL ≥ ∆, it holds that 1− FˆL(ϕL) ≤ ε . Itfollows thatP[pˆL > ε] = P[1− FˆL(ϕL)> ε]≤ P[1− FˆL(∆)> ε]+P[ϕL < ∆].It thus suffices to show that (A) P[1− FˆL(ϕL)> ε]< b/2 for each L ∈ N and that(B) P[ϕL < ∆]< b/2 for almost all L ∈ N.To show (A), we use the fact that for each L ∈ N, E[FˆL(∆)] = F¯L(∆). Using thisfact, we obtain thatE[1− FˆL(∆)] = 1− F¯L(∆)< bε/2, L ∈ N.Given that 1− FˆL(∆) is a positive random variable, condition (A) follows from thisinequality by the Markov inequality (Davidson, 1994, p. 132).For condition (B), recall that {αˆL(pi)}L∈N and {VˆL(pi)}L∈N are respectivelyconsistent for {α∗L(pi)}L∈N and {V¯L(pi)}L∈N uniformly in pi ∈Π (Theorem 3.1 andTheorem 3.4(i)). Also, we know that {V¯L(pi) : pi ∈Π, L ∈ N} is uniformly boundedand uniformly positive (Assumption 3.13(b) and Theorem 3.4(i)). Thus, we havethatsuppi∈Π∣∣∣∣L−1/2TL(pi)−α∗L(pi)V¯L∣∣∣∣→ 0 in probability-P. (C.24)By hypothesis, we also have that liminf infpi∈Π¯α∗L(pi)> 0, so thatc≡ liminf infpi∈Π¯α∗L(pi)V¯L> 0.152It follows thatP[infpi∈Π¯L−1/2TL(pi)< c/2]→ 0.Note that when infpi∈Π¯ L−1/2TL(pi)≥ c/2, it holds thatϕL ≥ ϕ(1{pi ∈ Π¯} ·L1/2c),where 1{C} is the indicator function for the condition C. Because the right-handside of the above inequality, which diverges to infinity by Assumption 3.15, is nosmaller than ∆ for almost all L ∈ N, we have that for almost all L ∈ N,P[ϕL < ∆]≤ P[ϕL < ϕ(1{pi ∈ Π¯} ·L1/2c)]≤ P[infpi∈Π¯L−1/2TL(pi)< c/2].The right-hand side of this inequality converging to zero by (C.24). Condition (B)therefore follows, and so does claim (ii). 153

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0167367/manifest

Comment

Related Items