MODEL SELECTION CRITERIA IN ECONOMIC CONTEXTS By Kevin J. Fox B.Comm., University of Canterbury, 1989 M.Comm., University of Canterbury, 1990 M.A., University of British Columbia, 1992 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE O F D O C T O R OF PHILOSOPHY in THE FACULTY OF GRADUATE STUDIES DEPARTMENT OF ECONOMICS We accept this thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA February 1995 © Kevin J. Fox, 1995 In presenting this thesis in partial fulfilment degree at the University of of the requirements for an advanced British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. (Signature) Department of ^L£jGt^ O r A X C <L The University of British Columbia Vancouver, Canada Date DE-6 (2/88) 2.2"a PeKftutgy. ms'. Abstract Model selection criteria are used in many contexts in economics. The issue of determining an appropriate criterion, or alternative method, for model selection is a topic of much interest for applied econometricians. These criteria are used when formal testing methods are difficult due to a large number of models being compared, or when a sequential modelling strategy is being used. In econometrics, we are familiar with the use of model selection criteria for determining the order of an ARMA process and the number of dependent variable lags in Augmented Dickey-Fuller equations. The latter application is examined as an interesting example of the sensitivity of results to the choice of criterion. An application of model selection criteria to spline fitting is also considered, introducing a new, flexible, modelling strategy for technical progress in a production economy and for returns to scale in a resource economics context. In this latter context we have a system of estimating equations. Two of the criteria which are compared are the Cross-Validation score (CV) and the Generalized CrossValidation Criterion (GCV), which until now have only had single equation context expressions. Multiple equation expressions for these criteria are introduced, and are used in the two applications. Comparison of the models selected by the different criteria in each context reveals that results can differ greatly with the choice of criterion. In the unit root test application, the choice of criterion influences the number of times the false hypothesis is not rejected. In the production economy and resource applications, measures of technical progress and returns to scale differ greatly, as do own and cross price elasticities, ii depending on which criterion is used for selecting the appropriate spline structure. An overview of the literature on model selection is given, with new expressions and interpretations for some model selection criteria, and historical notes. iii Table of Contents Abstract ii List of Figures viii List of Tables x Acknowledgement xiii Dedication xiv 1 Objectives and Overview 1 1.1 Objectives 1 1.2 Overview 4 2 White Noise Experiments on Augmented Dickey-Fuller Tests 5 2.1 Introduction 5 2.2 Augmented Dickey-Fuller Tests 7 2.2.1 9 2.3 Unit Root Hypotheses Selection of Lag Length 10 2.3.1 Autocorrelation Function—AUTO 10 2.3.2 Akaike Information Criterion—AIC 11 2.3.3 Schwarz Criterion—SC 11 2.3.4 Generalized Cross-Validation Criterion—GCV 12 2.3.5 Cross-Validation Score—CV 12 IV 2.4 2.3.6 Testing Criterion—TC 13 2.3.7 Relative Marginal Penalties 14 Monte Carlo Experiments 15 2.4.1 A Simple Experiment 17 2.4.2 AR(l) Errors 17 2.4.3 Results: Simple Experiment 18 2.4.4 Results: AR(1) Errors 22 2.4.5 AUTO and Phillips-Perron Tests 24 2.5 Directions for Further Research 25 2.6 Overview 25 3 Non-Parametric Estimation of Technical Progress 40 3.1 Introduction 40 3.2 The NQ Variable Profit Function 43 3.3 The NQLS Model 50 3.4 Model Selection 52 3.4.1 Selection Criteria: Single Equation Context 54 3.4.2 Selection Criteria: Multiple Equation Context 58 3.4.3 A Single Equation Monte Carlo Experiment 63 3.5 Index Number Measures of Technical Progress 65 3.6 Comments 67 3.6.1 Why Splines? 67 3.6.2 Why Linear Splines? 68 3.6.3 Historical Notes 69 3.7 Empirical Results 69 3.8 Overview 75 v 4 Non-Parametric Estimation of Returns to Scale 94 4.1 Introduction 94 4.2 The NQLSK Model 96 4.3 Model Selection 98 4.3.1 Selection Criteria 100 4.4 Fishery and Data Description 101 4.5 Empirical Results 102 4.6 Overview 108 5 Model Selection Criteria: A Reference Source 116 5.1 Introduction 116 5.2 Model Selection Criteria 117 5.2.1 Some Historical Notes 119 5.2.2 Theil's Adjusted R2 5.2.3 Mallows' Criterion 120 5.2.4 A Much Suggested Criterion: Jp, FPE, PC 121 5.2.5 Hocking's Criterion 122 5.2.6 Hannan and Quinn Criterion 122 5.3 " 119 Comparisons of Criteria 123 5.3.1 Penalty Term Approximations 123 5.3.2 Log Forms and Marginal Penalties 124 5.3.3 Interpretation of Criteria as x 2 ~Statistics 135 5.3.4 Representations as i? 2 -Type Values 140 5.4 Data. Re-Sampling Methods 144 5.5 Selection Algorithms 146 5.6 References for Further Reading 147 vi 6 5.7 Other Considerations in Model Choice 148 5.8 Overview 149 Summary and Conclusions 150 6.1 Summary of Contributions 150 6.2 Directions for Further Research 152 Bibliography 155 Appendices 164 A Cross-Validation, GCV, etc. 164 B Determining M(T, 0) 170 vii List of Figures 3.1 Monte Carlo experiment: Average absolute errror from fitting structure to white noise, y ~ JV(0,1), T = 20, using the AIC, SC and PGCV. . 3.2 Monte Carlo experiment: Average absolute errror from fitting structure to white noise, y ~ N(0,1), T = 20, using the GCV, CV and PGCV. 3.3 and NQ 87 A comparison of productivity measures from three models: NQLSAIC 3.7 NQLSPFQCV, and NQ 88 A comparison of productivity measures: NQ vs. Fisher's Ideal Quantity Index 3.8 89 A comparison of productivity measures: NQLSPFGCV VS. Fisher's Ideal Quantity Index 3.9 86 A comparison of productivity measures from three models: NQLScv> NQLSFQCV 3.6 85 Monte Carlo experiment: Average absolute errror from fitting structure to white noise, y ~ JV(0, .1), T = 20, using the GCV, CV and PGCV. 3.5 84 Monte Carlo experiment: Average absolute errror from fitting structure to white noise, y ~ N(0, .1), T = 20, using the AIC, SC and PGCV. . 3.4 83 90 A comparison of productivity measures: NQLScv vs. Fisher's Ideal Quantity Index 91 3.10 A comparison of productivity measures: Quantity Index NQLSFGCV VS. Fisher's Ideal 92 viii 3.11 A comparison of productivity measures: NQLSAIC VS - Fisher's Ideal Quantity Index 4.1 93 A comparison of returns to scale measures: NQLSKcv, and NQK NQLSKAIC 115 ix List of Tables 2.1 Marginal Penalties of the AIC, SC and GCV for each lag order (p) and sample size considered: Constant, no trend model (n = 2+p), effective sample size = T - T5 - 1 2.2 27 Number of times lag order p selected by each of the methods for each sample size considered: Constant, no trend model, y ~ N(Q,1), correct lag length is zero 2.3 28 Number of times lag order p selected by each of the methods for each sample size considered: Constant, trend model, y ~ iV(0,1), correct lag length is zero 2.4 29 Number of times out of 100 that each ADF null hypothesis is not •• rejected. Sample size is 30, y ~ N(0,1). Significance levels are 1%, 5% and 10% 2.5 30 Number of times out of 100 that each ADF null hypothesis is not rejected. Sample size is 60, y ~ iV(0,1). Significance levels are 1%, 5% and 10% 2.6 31 Number of times out of 100 that each ADF null hypothesis is not rejected. Sample size is 120, y ~ ^ ( 0 , 1 ) . Significance levels are 1%, 5% and 10% 32 2.7 Number of times lag order p selected by each of the methods for each sample size considered: Constant, no trend model, yt = -S5yt-i — .I5yt~2 + rjti correct lag length is one x 33 2.8 Number of times lag order p selected by each of the methods for each sample size considered: Constant, trend model, yt = .&5yt-\ — .15?yj_2-trjt, correct lag length is one 2.9 34 Number of times out of 100 that each ADF null hypothesis is not rejected. Sample size is 30, yt = .85y<_i — .15yt-2 + Vt- Significance levels are 1%, 5% and 10% 35 2.10 Number of times out of 100 that each ADF null hypothesis is not rejected. Sample size is 60, yt = .85yt-\ — .15j/(_2 + i]t. Significance levels are 1%, 5% and 10% 36 2.11 Number of times out of 100 that each ADF null hypothesis is not rejected. Sample size is 120, yt = .85yt-i — .15yt-2 + rjt. Significance levels are 1%, 5% and 10% 37 2.12 Number of times out of 100 that each ADF null hypothesis is not rejected. yt = yt-\ — .25yt-2 + Vt, a nonstationary process. Correct lag order is p = 1. Significance levels are 1%, 5% and 10% 38 2.13 Number of times out of 100 that each ADF null hypothesis is not rejected. Phillips-Perron test, yt = .8>5yt-i — .15yt_2 + rjh AUTO used to determine the truncation lag parameter. Significance levels are 1%, 5% and 10% 39 3.1 Selection Criteria 77 3.2 NQ Own Price Elasticities 77 3.3 NQLSPFGCV 3.4 NQLScv 3.5 NQLSFGcv 3.6 NQLSAIC Own Price Elasticities 78 Own Price Elasticities 78 Own Price Elasticities Own Price Elasticities xi 79 ; 79 3.7 NQ Cross Price Elasticities, 1963 80 3.8 NQLSPFGCV 80 3.9 NQLScv Cross Price Elasticities, 1963 Cross Price Elasticities, 1963 3.10 NQLSFGCV Cross Price Elasticities, 1963 3.11 NQLSAic 81 . Cross Price Elasticities, 1963 3.12 Selected M(T,d) Values 81 82 82 4.1 Marginal Penalties 110 4.2 110 Selection Criteria 4.3 NQK Own Price Elasticities Ill 4.4 NQLSKcv Ill 4.5 NQLSKAIC Own Price Elasticities 112 4.6 NQK Cross Price Elasticities for Observation 29 112 4.7 NQKcv 113 4.8 NQLSKAIC 4.9 Summary Statistics: Price Data on the British Columbia Sablefish Own Price Elasticities Cross Price Elasticities for Observation 29 Cross Price Elasticities for Observation 29 Fishery 113 114 4.10 Summary Statistics: Quantity Data on the British Columbia Sablefish Fishery 114 xii Acknowledgement Many thanks to my thesis committee of Erwin Diewert (chairman), John Cragg and Ken White. I particularly wish to express my gratitute to Erwin Diewert for his encouragement of this struggling student. Thanks also to Bob Allen, Ashok Kotwal, Chuck Blackorby, Bill Schworm and Terry Wales who all played a role in making the writing of this thesis possible. Dorian Owen, formally of the University of Canterbury, was instrumental in stimulating my interest in applied economics for which I am truely grateful. The computing support provided by Ken White and Diana Whistler ensured that I had some results to talk about, and is greatly appreciated. The support of fellow students and friends over the years is acknowledged with appreciation. Special thanks go to Kevin Albertson, David Allen, Caroline Betts, Steve Berger, Quentin Grafton, John and Masayo Hickey, Neal Hooker, the Ishizuka family, Patrick Kenny, Ross McKitrick, Sachiko Okawara, the Shishido family, the Steel family, Wayne Thomas, and the Funky Armidillo Saturday Brunch Club. Finally, my greatest debt is to Masami who tolerated the life of a "university widow" with humour and understanding, despite protestations to the contrary. xiii Dedication Dedicated to the memory of Douglas Takeshi Fox. Chapter 1 Objectives and Overview There is no single criterion which will play the role of a panacea in model selection problems. Hamparsum Bozdogan (1987; 368). 1.1 Objectives There are many instances when it is useful to use model selection criteria in economic contexts. Any situation where model comparison through formal hypothesis testing is difficult due to having a large number of candidate models, or due to a sequential model specification procedure, is a potential case for the use of model selection criteria. However, there is controversy over which model selection criterion we should use. This thesis looks at the use of model selection criteria in some applications and notes how results change depending on the choice of criterion. Multiple equation context forms for the Cross-Validation score and Generalized Cross-Validation Criterion are introduced and used in the system of demand equations context. There are many available/criteria, which each have an asymptotic justification of some kind. The problem is that we have fixed, finite sample sizes, so arguing over which criterion has better properties as the sample size increases to infinity may not be too informative. Investigating the relative properties of the selection criteria in some finite sample applications is the approach taken here. It is found that results, 1 Chapter 1. Objectives and Overview 2 and hence conclusions, can differ greatly depending on which criterion is deemed appropriate for identifying the underlying "true" model. The specific applications choosen are of direct interest, as well as for the indirect reason of being testing grounds for the relative properties of the considered selection criteria. The first application is to unit root testing. Augmented Dickey-Fuller tests add lags of the dependent variable as regressors in equations which are used in testing the null hypothesis of a unit root, or non-stationarity, in the stochastic process of the dependent variable. The lags are added to remove the effects of serial correlation in the error terms of the regression equations. As the appropriate number of lags is unknown in practice, model selection criteria (such as the Akaike Information Criterion) are often used to guide the choice of regression equation specification. One would expect that the number of lag terms included in the regression equations has a marked influence on the outcome of the unit root tests, which are essentially tests of restrictions placed on the (non-lag) parameters in the regression equations. The influence of the choice of different lag lengths by different model selection criteria on the outcome of the unit root tests was investigated by the construction of a simple Monte Carlo experiment. The objective here is to determine which criterion results in the rejection of a false null hypothesis most often in repeated trials. It was found that in small samples the use of all the criteria resulted in many non-rejections of the false null hypothesis. In larger samples, the performance of the tests improved, but the use of some methods of determining lag length still appear inappropriate. The data used was samples of white noise drawn from the iV(0,1) and N(0, .1) distributions. The second application is to the context of modelling technical progress in a production economy. The objective here is to find a method which more adequately takes into account the effects of productivity increases on an economy. Demand systems, Chapter 1. Objectives and Overview 3 with each equation representing the demand or supply of aggregated commodities, are usually estimated with the addition of a linear time trend to take account of the effects of technical progress over time on the economy. The parameters of the estimated system can be used to derive own and cross price elasticities and a measure of technical progress. An alternative method to using a linear time trend term is to spline the time trend; i.e., to make it piecewise linear. The problem then is to determine how many spline segements, or break points, should be allowed and where they should be placed. A method of solving this problem is presented for the first time in this context. This method uses a model selection criteria to aid in the selection of the appropriate spline structure. It was found that depending on which criterion is used, the spline structure can differ greatly, as can the resulting measure of technical progress and elasticity estimates. The data used was U.S. aggregate production data for the years 1947-1981. The third application considers a similar situation to the second application, but with a twist. In the situation of cross-sectional data we may be interested in the flexible estimation of returns to scale. This can be achieved through splining in the capital variable, rather than restricting it to enter in a linear fashion. The objective here is then to more adequately model the effects of returns to scale, both to get accurate estimates of returns to scale as well as reliable elasticity estimates. The same splining strategy as used in the second application was re-applied in this context, and again it was found that the choice of selection criterion has a large impact on the results, and hence conclusions. The data used was two pooled cross- sections (1988 and 1990) on the British Columbia sablefish fishery. Finally, a review of the literature on model selection criteria is presented with new expressions and interpretations for some criteria, and historical notes. The objective was to create a resource for applied researchers. It is hoped that the historical Chapter 1. Objectives and Overview 4 notes are of general interest, and the resource is helpful in clarifying the relationships between criteria. 1.2 Overview This section presents the organization of the thesis. Chapter 2 conducts Augmented Dickey-Fuller tests for unit roots on samples of white noise, and considers which model selection criteria is most appropriate in the sense of rejecting the false null hypothesis most often. Chapter 3 looks at a splining method for allowing the flexible modelling of technical progress in a production economy. Chapter 4 is similar in nature to Chapter 3, except the splining is performed in the capital variable, allowing the flexible modelling of returns to scale in a resource application. Chapter 5 presents a reference source for applied users of model selection criteria. Chapter 6 concludes with a summary of the main findings of the thesis, and outlines some areas for further research. Some technical details are relegated to the appendices. In particular, details on the generalizations of two model selection criteria to the multiple equation context can be found in Appendix A, along with the proofs of some theorems. As there is some overlap in the methods used in the different chapters, there is occasionally reiteration of previously presented material for ease of the reader's reference. Therefore, although each chapter is an integral part of the whole, it is possible to read one chapter independently of the others without too much loss in overall perspective. Chapter 2 White Noise Experiments on Augmented Dickey-Fuller Tests It is interesting that Humans try to find Meaningful Patterns in things that are essentially random. Commander Data, Star Trek (1992). 2.1 Introduction Various methods have been suggested for choosing the appropriate lag length for Augmented Dickey-Fuller (ADF) tests (Fuller, 1976; Dickey and Fuller, 1979). This chapter presents some "white noise" and related experiments on the effectiveness of several model selection approaches to this problem. A commonly used method is found to do very poorly, in that the null hypothesis of a unit root is not rejected when it should be rejected. Performance of these tests in small samples is found to be particularly poor. This casts doubts on the conclusions of the many papers which use these testing procedures. Specifically, a stationary process is generated by sampling some white noise, then the various tests associated with the unit root hypothesis are performed. In another experiment, a stationary AR(2) process is generated and the same unit root tests are performed. The lag length in these ADF equations is selected by using (i) the 5 Chapter 2. White Noise Experiments on Augmented Dickey-Fuller Tests 6 autocorrelation function of the first differenced series (AUTO), (ii) the Akaike Information Criterion (AIC), (iii) the Schwarz Criterion (SC), (iv) the Generalized CrossValidation Criterion (GCV), (v) the Cross-Validation score (CV) and (vi) a Testing Criterion (TC) approach based on sequential Lagrange Multiplier tests. Because one replication contains little information, one hundred replications of each experiment were performed. Sample sizes considered were 30, 60 and 120. The distribution from which the noise (y) was drawn was iV(0,1). As we are testing whether or not white noise is a stationary process, the null hypothesis of a unit root should be rejected. Several authors have described the difficulties of testing for unit roots in economic time series data (e.g. Blough, 1988; Schwert, 1989). The approach taken here is straightforward. We ask how many times out of one hundred is the null hypothesis not rejected, when it should be. Recalling that the Type II error of a test is the probability that it does not reject the null when the null is false, we obtain an approximation to the Type II error of each testing procedure through our simulations. In the "white noise" simulations the appropriate choice of lag length in the ADF equations is zero. As some of the criteria considered may have a tendency to choose models which are too parsimonious, the second experiment design generated data as a stationary second order autoregressive process with standard normal errors. The correct lag length for the ADF equations is one in this case. Again, the null hypothesis of a unit root should be rejected. We find that there can be great variabilty in the performance of the ADF test depending on how the lag length is determined. Section 2.2 describes the various permutations of ADF tests. Section 2.3 presents the strategies for determing the appropriate lag length for the ADF regression. Section 2.4 presents the results of a range of simulations. Results on the Phillips-Perron tests (Phillips, 1987; Perron, 1988) are also reported. Some directions for further Chapter 2. White Noise Experiments on Augmented Dickey-Fuller Tests 7 research are described in Section 2.5. Section 2.6 concludes. 2.2 Augmented Dickey-Fuller Tests Finding a unit root in time series data implies that the series is nonstationary. There has been much interest in determining whether or not economic time series are stationary (e.g. Nelson and Plosser, 1982), with decisions for modelling and theoretic conclusions being drawn from the results of the tests employed. The Dickey-Fuller tests have the null hypthesis of a unit root (i.e. nonstationarity), with or without drift, and alternative hypotheses of simple stationarity, or stationarity around a linear trend. The Augmented Dickey-Fuller (ADF) regression equations upon which the ADF tests are based, include the Dickey-Fuller tests as a special case {p = 0), and can be written as follows: p Ayt = a0 + atiyt-i + £ Ij&Vt-j + et (2.1) j=i p Ayt - a0 + aiyt-i + a2t + £ Jj&Vt-j + et (2.2) where Ayt = yt — yt-i, t is a linear trend, and et is an error term. The ADF unit root tests are not valid if the errors from these regressions are serially correlated. The inclusion of lags of the dependent variable is an attempt to "whiten" the error terms— serially dependent error terms are replaced by serially independent error terms through the addition of an appropriate number of dependent variable lags as regressors (Dickey and Fuller, 1979; Said and Dickey, 1984; Phillips, 1987; Perron, 1988). To see this, consider the following simplified version of the ADF regression equations (without loss in generality), where there is neither constant nor trend, but errors Chapter 2. White Noise Experiments on Augmented Dickey-Fuller Tests 8 which are first order autocorrelated: Ay* = ut otiyt-i + ut (2.3) = put-i + rjt (2.4) We proceed by writing ai as a — 1, substituting equation (2.4) into equation (2.3) and re-arranging as follows: Ayt = (o-l)ft.i + j ~ (2.5) = (1 - pL)(a - l)yt-i + pAy t _i + rjt = (a - l)y4_x - p(a - l)yt-2 + PVt-i - PVt-2 + Vt = {(a- l) + p)yt_x - payt-2 + m (2.6) Adding and subtracting payt-x from the right hand side yields: Ay( = = {a-l+p- pa)yt-i + paAyt-i (a - 1)(1 - p)yt.1 + paAyt.i + i]t + rjt. (2.7) We see that adding a single lagged term of the dependent variable as a regressor whitens the formerly autocorrelated error term. It follows trivially that if ut was an AR(|>) process the addition of p lagged terms of the dependent variable as regressors would whiten the error term. Note that if p = 0 we no longer have AR(1) errors and so need no lag of Ayt in the ADF regressions. Note also that if a = 0, then yt = pyt-i + r]t (2.8) and again there is no role for a lag of Ayt. See Said and Dickey (1984) for the case where the error term follows an MA or ARMA process. Chapter 2. White Noise Experiments on Augmented Dickey-Fuller Tests 2.2.1 9 Unit Root Hypotheses The performance of the following hypotheses are examined under the alternative procedures for selecting the lag length p in equations (2.1) and (2.2): Equation (2.1): N o Trend «i = 0 a0 = ai (2.9) = 0 (2.10) The first null hypothesis, equation (2.9), is tested by what is commonly called a rtest. It is a i-type test, but the distribution of the statistic is not Student's t even asymptotically. This is because when «i = 0 we have a nonstationary series, implying that the usual asymptotic theory cannot be used. Acceptance of the null hypothesis implies the series has a unit root, with drift (i.e. ao ^ 0). The second null hypothesis, equation (2.10), implies an F-type unit root test with zero drift. Equation (2.2): Trend Included ax = 0 a0 = ai=a2 ai = a2 = 0 (2.11) =0 (2.12) (2.13) Equation (2.11) is tested by a r-test, equation (2.12) implies an F-type test of a unit root without drift, and equation (2.13) implies another .F-type test of a unit root with drift.1 The appropriate critical values for performing these test were taken to be the asymptotic values of Davidson and MacKinnon (1993; 708). These values differ only 1 There is another test which has not been discussed. That is the z-test which has a test statistic of T&i when p = 0, where T is the sample size. The reason for its exclusion is that the test statistic depends on p when p ^ 0. See Davidson and MacKinnon (1993; 711). Chapter 2. White Noise Experiments on Augmented Dickey-Fuller Tests 10 slightly from those of Fuller (1976). For more detailed summaries of these tests, the reader may consult Dickey, Bell and Miller (1986), Maddala (1992; Chapter 14) and Davidson and MacKinnon (1993; Chapter 20). 2.3 Selection of Lag Length This section briefly describes the methods for choosing the lag length p in the ADF regressions, equation (2.1) and equation (2.2). 2.3.1 Autocorrelation Function—AUTO The autocorrelation and partial autocorrelation functions of the first differenced series, Ayt, are considered. Using an approximate 95% confidence interval, the highest significant lag length from either of these functions is selected. The general econometrics program SHAZAM (1993; White, 1978) employs this method when no lag length is specified by the user, taking the square root of the sample size to determine the maximum permissable lag length.2 Note that this method does not consider the ADF regression equations in determining the lag length, but only the series Ayt. As the purpose of including the lags is to try and whiten the error terms of the ADF regressions rather than to find an appropriate AR representation for Ayt we may expect this method to select an inappropriate lag length. Indeed, we find that this method does poorly in the ADF context, but performs well in our brief examination of the Phillips-Perron tests, where it has a better justification for its use (see Section 2.4.5). 2 SHAZAM User's Reference Manual Version 7.0, 1993; Chapter 13. Chapter 2. White Noise Experiments on Augmented Dickey-Fuller Tests 2.3.2 11 Akaike Information Criterion—AIC The AIC (Akaike, 1973; see also Amemiya, 1980) has the following form: AIC = -logL($) + n (2.14) where logL0) is the log likelihood function and n is the dimension of the parameter vector, p. Hence, it is an adjusted R2- type criterion, where the log likelihood is penalized with a marginal penalty of one for each additional regressor. This is currently the most popular "information criterion" in the economics literature. Other selection criteria, with different marginal penalties, are discussed below. They are all negative objective functions, the purpose being to find the model which gives the smallest value of the function.3 This approach to model selection is used in contexts where model specification is sequential in nature and/or a large set of candidate models makes the use of formal testing difficult. Determining the lag length for the ADF equations is such a context. 2.3.3 Schwarz Criterion—SC The SC (Schwarz, 1978) has the following form:4 SC = -logL0) + nlog(T)/2 (2.15) where logL{(3) is the log likelihood function, n is the dimension of the parameter vector, /?, and T is the sample size. The marginal penalty for the addition of another regressor is larger than that of the AIC for T > 8, (see Section 2.3.7). Hence, it selects a model more parsimoniously parameterized than that selected by the AIC. 3 Notice that we have not said the smallest absolute value of the function. If the criterion returns a series of negative values, the model with the largest negative value is selected. 4 The Schwarz Criterion is sometimes also called the Bayes Information Criterion, or BIC. As other criteria have been given the same label, BIC, we try to avoid confusion by using the notation SC. Chapter 2. White Noise Experiments on Augmented Dickey-Fuller Tests 2.3.4 12 Generalized Cross-Validation Criterion—GCV The GCV (Craven and Wahba, 1979) can be written in the following logarithmic form (see Appendix A): 5 GCV = -logL0) - Tlog(l - n/T) (2.16) where logL(J3) is the log likelihood function, n is the dimension of the parameter vector, /?, and T is the sample size. The marginal penalty for an additional regressor is always greater than for the AIC (see Section 2.3.7), and may be greater or smaller than for the SC, depending on the relative sizes of T and n. An interesting point to note is that, for fixed T, the AIC and SC penalize each additional regressor with the same constant, regardless of whether the regressor is using up the first or last degree of freedom. This is not so for the GCV which imposes a much higher penalty on additional regressors when there are few remaining degrees of freedom than when there are many. 2.3.5 Cross-Validation Score—CV The single equation regression context cross-validation score is written as follows:6 CV = ( 1 / T ) J > - f-i{xi)f where f-i(xi) (2.17) denotes the estimate for the ith observation calculated from the param- eters of the model fitted to all the data except the observation i. By dropping each observation in turn, estimating the function and its corresponding prediction residuals over the dropped observation, we get T cross-validation residuals. Squaring, summing and averaging these residuals yields the above CV. This is an approximation to the 5 The GCV is actually not a generalization of cross- validation in any sense, but rather an approximation to the cross-validation score, which is also considered. 6 Cross-validation as a statistical tool was suggested independently by Schmidt (1971), Allen (1971), Stone (1974) and Geisser (1974). Chapter 2. White Noise Experiments on Augmented Dickey-Fuller Tests 13 unobservable predictive-squared error. The CV is burdensome to calculate in the fashion just described. Fortunately, through the use of the Sherman-Morrison matrix downdating formula,7 the CV can be written in the following more convenient form (see Appendix A): CV = ( l / D f > - /(*,-)] 2 /[l - c,-]2 (2.18) t=l where we now have the full sample estimator, /(#,-), and c, is the ith diagonal element of the projection matrix X(X'X)~1X'. The use of the CV in the current context is perhaps a little dubious due to the presence of lagged values in the ADF equations— it is hard to interpret "dropping an observation" in this case. The CV was however used for purposes of comparison. The GCV is derived from the CV by assuming that the Cj are (nearly) constant. 2.3.6 Testing Criterion—TC This approach uses a squential Lagrange Multiplier (LM) test strategy of the type suggested by Potscher (1983) in the context of order estimation in ARMA models. Potsclier showed that, under certain conditions his approach has the property of strong consistency.8 The lag length p in the ADF regressions of equations (2.1) and (2.2) is determined as the solution to the problem of selecting the smallest p for which a x 2 ~test would not reject the hypothesis that the lag length was that under consideration rather than a larger value at some significance level OL(T). For our purposes, a is taken to be 0.05, i.e. the 5% significance level.9 We can write the 'Schmidt (1971) provides an alternative derivation. See Appendix A, and Chapter 3 for more on the CV. 8 See also Cragg and Donald (1993). 9 That the significance level is dependent on the sample size is important for the asymptotic properties of the procedure, but in our finite sample context we prefer to take the approach that an applied researcher is likely to follow in practice—given a fixed sample of some arbitrary size the level of significance is generally taken to be the 5% level. Chapter 2. White Noise Experiments on Augmented Dickey-Fuller Tests 14 choice decision as follows: p* = min(p € 0,..., P : LM(p) < X 2 ( . 0 5 ) P _ P ) , (2.19) where P is the maximum lag length considered, LM{p) is the Lagrange Multiplier test statistic for the model with lag length p and x 2 (.05)p_ p is the 5% level Y 2 critical value with P — p degrees of freedom. 2.3.7 Relative Marginal Penalties Before proceeding with a presentation of the results, a few additional comments are made on the penalty terms of the AIC, SC and GCV. These criteria trade off the increase in the log likelihood due to an additional parameter with the increase in the penalty term. A couple of observations on the relative properties of the penalty terms are now made. Observation 1 Only the GCV has a marginal penalty that is increasing as the number of parameters approaches T, for fixed T. Both the AIC and SC penalize every marginal parameter with the same constant, no matter if that parameter is using up the first degree of freedom or the last. This is seen easily by differentiating the respective penalty terms in the above logarithmic expressions for the criteria with respect to n, the number of parameters: on AICMP SCMP = ^ = T - n = 1 d(nlogT/2) — (2-21) _ logT _ where the subscript denotes the marginal penalty for the respective criteria. (2.22) Chapter 2. White Noise Experiments on Augmented Dickey-Fuller Tests 15 Observation 2 Consider the relative sizes of the marginal penalties, (i) For T < 8: SCMP < AICMP (ii) For SCMP (2.23) GCVMP T>8: < GCVMp AICMP and < is greater than AICMP, an d (2.24) rna V be less than or greater than GCVMP depending on the size ofT and n. The ordering in (i) can be established trivially from the above expressions for the marginal penalties. The ordering in (ii) with respect to forward due to SCMP being non-linear in T. SCMP is not so straight However, the ordering is trivial to calculate given any T and n. Table 2.1 contains the marginal penalties for the AIC, SC and GCV for the no trend model (equation (2.1)) in the experiments considered below (Section 2.4). The values confirm these observations concerning the properties of these marginal penalties. The AIC and SC marginal penalties are constant, and bound the GCV marginal penalty for each sample size and parameterization tabled. These marginal penalties are discussed further in the next section. 2.4 M o n t e Carlo E x p e r i m e n t s Two experiment designs were examined. In the first, simple case, the performance of the ADF tests was examined by seeing how many times out of one hundred they identify a series of white noise as a nonstationary process. In this case the model selection criteria should choose a lag length of zero in equations (2.1) and (2.2). The second experiment design considered stationary data generated so that the absence of a lagged dependent variable in the ADF regressions would result in autocorrelated Chapter 2. White Noise Experiments on Augmented Dickey-Fuller Tests 16 errors. In this case the model selection criteria should choose a lag length of one. In each case, sample sizes of 30, 60 and 120 were considered, with one hundred replications of the experiments. The different methods for determining the number of lag terms to be included are presented in Section 2.3. As we are testing the stationarity of white noise, we would anticipate the rejection of the null hypotheses of a unit root.10 The different null hypotheses which were considered are described in Section 2.2.1. When the sample size was set to 30, the 30 observations were taken from the first 30 of a sample size of 120. Following this procedure allows us to investigate the effect of our sample size increasing, through more data becoming available, on the ADF tests. The introduction of an extra lag term can be treated in two ways. First, an observation at the start of the sample can be dropped. Second, an observation at the start of the period can be set to zero. The first method was employed. The results may change if the second was employed. The maximum number of lags permitted for each sample size was determined by y/T, the method used by SHAZAM's AUTO procedure. While the AUTO procedure determines the number of observations to be dropped from the sample, then estimates the ADF regressions, the other methods must compare different parameterizations based on the estimation of the ADF regressions. Therefore, to ensure that the comparisons among the parameterizations are made using the same data set the sample size must be reduced by the maximum number of lags by dropping the initial observations from the sample. For example, for the methods other than AUTO, if the sample size were 30 then the maximum number of lags permitted is 5 and the sample size used to compare the different parameterizations and perform the ADF tests is (30 — 5 — 1) = 24, where one is subtracted due 10 It is possible for a sample drawn from the distributions to resemble a nonstationary series by chance. However, we would not anticipate this happening often. Chapter 2. White Noise Experiments on Augmented Dickey-Fuller Tests 17 to the lagged value of y being a regressor. Restricting the maximum number of lags will aid the performance of methods which insist on selecting inappropriately long lag lengths. The critical values used were those of Davidson and MacKinnon (1993; 708), as given by SHAZAM, as these values are probably used, or are similar to those used by applied researchers.11 Note that the unit root tests are only valid if the error terms in the ADF tests are serially uncorrelated. The incorrect selection of the augmenting lag length results in invalid tests. In practice, we are ignorant as to the correct lag length and hence we are interested as to how often we are led to an incorrect conclusion concerning the stationarity of the data. The experiment designs are now described in more detail. 2.4.1 A Simple Experiment The first experiment design was to generate samples of white noise drawn from the standard normal distribution: y ~ iV(0, l). 1 2 The Augmented Dickey-Fuller (ADF) tests were performed, using different criteria for determining the appropriate lag lengths in the ADF regression equations, equations (2.1) and (2.2) above. 2.4.2 AR(1) Errors The second experiment design entailed generating stationary data which would have a first order autocorrelated error term if the ADF regressions in equations (2.1) and (2.2) were estimated without a single lag of the dependent variable included as a 11 All computations were performed using the general econometrics computer program SHAZAM (1993; White, 1978) on a Sun SPARCstation 2. 12 The ADF testing procedure should be scale invariant, so using samples from two different distributions would act as a check of both the computer programs used and the random number generator. Taking y ~ N(0, .1) led to essentially identical results. Chapter 2. White Noise Experiments on Augmented Dickey-Fuller Tests 18 regressor. Re-arranging equation (2.6), we get the following AR(2) process for yt: yt = (a + p)yt-i - pa-yt-2 + Vt (2.25) For yt to be a stationary AR(2) process we require the following conditions to be satisfied: • |a + p | < l ; • (a + p) — pa < 1; • — pa — (a + p) < 1. Given the above observations, the following values for a and p were chosen: a = 0.25 p = 0.6 (2.26) yt = 0.85t/t_i - 0.15^.2 + m (2.27) Equation (2.25) then becomes Working backwards, one can verify that given this data generating process one needs a lag of Ayt as a regressor to avoid having AR(1) errors in the ADF regressions. The rjt were drawn from a standard normal distribution, as were the initial values of yt-i and yt-22.4.3 Results: Simple Experiment This section presents the results from the experiment described in Section 2.4.1, while the next section presents the results from the experiment described in Section 2.4.2. The results from the simple experiment design of Section 2.4.1 are presented in tables 2.2 to 2.6. Tables 2.2 and 2.3 present the number of times each lag length Chapter 2. White Noise Experiments on Augmented Dickey-Fuller Tests 19 was selected for each sample size and for each of the ADF regression equations (equations (2.1) and (2.2)). The TC and SC do the best at all sample sizes at selecting the correct lag length of zero. The AIC, CV and GCV also select the correct lag length more times than any other. The selection performance of each of the criteria improves with an increase in the sample size, except for SHAZAM's default method, AUTO. For each sample size an incorrect lag order is selected more often than the correct order, and the performance of AUTO becomes worse as the sample size increases. For T = 60 and T = 120, AUTO never selects the correct lag order of zero in one hundred replications. The similar- behaviour in terms of lag length selection by the AIC and GCV in larger samples is suggested by their respective marginal penalties. The marginal penalties for the no trend model (equation (2.1)) are tabulated in Table 2.1 for ease of reference. The marginal penalty of the AIC is constant at one for all sample sizes and values of p. The marginal penalty of the SC changes with T only, while that of the GCV changes with both T and p, (see Section 2.3.7 above). than one, but approaches one as the sample size increases. SCMP GCVMP is greater becomes larger as T increases, and it always has the largest value of the three criteria. It is interesting to note how the SC penalizes the additional parameter more heavily in large samples than in small samples, but is indifferent to whether that parameter is using up the last or first degree of freedom. The GCV, on the other hand, penalizes the additional parameter more lightly in large samples, and dislikes intensely giving up a degree of freedom when there are few remaining. This behaviour of the GCV is more intuitively appealing than that of the SC, as the GCV behaves as an applied researcher may be expected to—adding more parameters when T is large and p small, but reluctant to add more when there is a scarcity of degrees of freedom. As noted above, as T increases so does SCA/P, hence we see the SC selecting p = 0 Chapter 2. White Noise Experiments on Augmented Dickey-Fuller Tests 20 most often when T = 120. However, the AIC and GCV also select more parsimonious models when T = 120, indicating that the improvement in the log likelihood resulting from the addition of a lag term is falling as T increases. For the GCV, this implies that its marginal penalty is falling more slowly than the marginal improvement in the log likelihood, as T increases. Although the ADF tests are invalid when we have misspecified the lag length, it is interesting to see how many times out of one hundred that we would be misled in identifying a stationary series as nonstationary, given that in practice we are unaware of our specification error. Tables 2.4 to 2.6 present the results from performing the ADF tests on white noise given the lag lengths selected by each criterion for each sample size, at the 1%, 5% and 10% levels of significance. As these tables present Type II errors, the smaller the number reported the less often we are being misled into identifying a stationary series as nonstationary. The last column of these tables gives the performance of the ADF tests when the correct lag length is specified. We can identify a couple of patterns in these results. First, we are misled more often when the correct lag length is not selected. Second, inappropriately including a trend variable as a regressor in the ADF regressions results in more frequent non-rejections of the false null hypothesis of a unit root. Given that we have created data series from random draws of white noise, there should be no trend in our series. Specifying a model with a trend is clearly not helping the ADF tests. However, in most practical applications using macroeconomic data, one would want to include a trend of some sort. The use of a quadratic or piecewise linear trend would not seem to help, as in the current case we have merely samples of white noise which should not exhibit a trend of any sort (at least not often in many replications). Note that the use of AUTO can lead us to the wrong conclusion about the nature Chapter 2. White Noise Experiments on Augmented Dickey-Fuller Tests 21 of a series more times than the correct conclusion, especially in small samples. 'Bad" performance is defined by the false null hypothesis of a unit root not being rejected in a large number of cases. For T = 30 (Table 2.4), at the 5% level of significance AUTO accepts the false null hypothesis more often than it rejects. For example, the null of Q:0 = ai = ci'2 = 0 is not rejected 73 times. For T = 120 (Table 2.6), AUTO is still doing poorly (34 failures to reject the null of «o = «i = <*2 = 0). When we are reaching the wrong conclusion more than 50 times out of one hundred at the 5% level of significance, we must regard this as a serious problem. These results are very discouraging for the use of the AUTO procedure in combination with the ADF tests. Note also that the other means of model selection result in the right conclusion more times than not, especially in larger samples. Although the results may favour the use of the SC, GCV or TC over the AIC or CV in smaller samples, for T = 120 the use of the AIC leads to the same conclusions as the GCV. The behaviour of the CV as T increases leads to an improvement in attaining the right conclusion when T = 60, but a deterioration when T = 120. These results are perhaps indicative of the inappropriateness of this method in the current context. The results in the purely "white noise" experiment that we have considered so far indicate that AUTO is a totally inappropriate procedure for lag length selection in the context of the ADF regression equations. In small samples we are better off using the TC, SC or GCV, while the AIC can be considered as an appropriate selection criterion, particularly in larger samples. However, in this experiment it is clear that the criterion with the largest penalty for an additional parameter is always going to do best in selecting the lag length of zero. Hence, in the the next section results are reported for an experiment where the appropriate lag length is one. The selection criteria which did well in selecting the most parsimonious model when it was in fact the correct model may not do so well when the most parsimonious model is not the Chapter 2. White Noise Experiments on Augmented Dickey-Fuller Tests 22 correct model. 2.4.4 Results: A R ( 1 ) E r r o r s This section considers the experiment described in Section 2.4.2, where the correct lag length is designed to be one, in contrast to the experiment described in Section 2.4.1 where the correct lag length was zero. The results are presented in Tables 2.7 to 2.11. Tables 2.7 and 2.8 present the number of times each lag length was selected for each sample size and for each of the ADF regression equations (equations (2.1) and (2.2)). We see that for the smaller sample sizes of T = 30 and T = 60 the inappropriate lag length of zero is favoured by all of the selection methods. For T = 120, the AIC, GCV and CV prefer a lag length of one to any other. The SC and TC, which did well in selecting the most parsimonious model when it was the correct model (Section 2.4.3) continue to select the most parsimonious model when it is now inappropriate to do so. In fact TC never selects the correct lag length for each of the three sample sizes considered. For T = 60 and T = 120, AUTO also never selects the correct lag order of one. It seems then that AUTO is the least favourable method for lag length selection in the ADF context as it has done poorly in both of the experiments considered. Again, the ADF tests are invalid when the incorrect lag length has been selected, but we are still interested in how many times out of one hundred we would be misled into not rejecting the false null hypothesis. Tables 2.9 to 2.11 present the Type II errors resulting from this experiment, for sample sizes 30, 60 and 120, respectively. From Table 2.9 we see that the results from using AUTO are shockingly bad. For example, at the 5% level of significance we fail to reject the null hypothesis of a'0 = a.! = oj2 = 0 97 times when we use AUTO to select the lag length. The other methods do slightly better. The most interesting result is that the other methods do Chapter 2. White Noise Experiments on Augmented Dickey-Fuller Tests 23 better in terms of not leading us astray in our conclusions than if we specified the correct lag length of one. This suggests that the tests are biased in the appropriate direction for this context. It is however disconcerting that the specification of the correct lag length would lead us to the wrong conclusion so often with a small sample. As the sample size increases, we can see from Tables 2.10 and 2.11 that the improvement in selecting the correct lag order of one by the methods other than AUTO and the TC results in an improvement in terms of rejecting the false null hypothesis. Also, selecting the correct lag length of one a priori does better than most methods. However, the SC and TC still do best in terms of having the smallest number of Type II errors. Again, the bias in the tests when the lag order is incorrectly specified to be zero seems to be going the right way as the false null is being rejected when it should be. However, this bias may be going the wrong way if the series is actually nonstationary. Setting a = 0.5 and p = 0.5 in equation (2.25), we get a violation of the stationary conditions listed in Section 2.4.2. The series generated using these parameter values are then nonstationary. The ADF regressions would still have AR(1) errors if a lag of the dependent variable was not included as a regressor. In this case we want non-rejections of the null hypothesis of nonstationarity, and are interested in finding whether there are more non-rejections when the lag length is correctly specified to be one, or when the lag length is incorrectly specified to be zero. It was found that setting the lag length to zero a priori resulted in more non-rejections than setting the lag length to the correct value of one (Table 2.12). The results from this experiment suggest that we should be skeptical of conclusions drawn from ADF tests when using small samples, that the AUTO procedure is inappropriate in the ADF test context, and that a procedure of setting the lag order in the ADF regressions to be zero, even when it should be one (or perhaps larger), is Chapter 2. White Noise Experiments on Augmented Dickey-Fuller Tests 24 a preferred approach. 2.4.5 AUTO and Phillips-Perron Tests The Phillips-Perron method of testing for nonstationarity (Phillips, 1987; Perron, 1988), involves the use of an alternative non-parametric adjustment for serial correlation to that of the inclusion of lag terms, as in the ADF test procedure. The idea is to use the regression equations (2.1) and (2.2) with p = 0 (i.e. no lag terms), and to adjust the test statistics for the effect of serial correlation. Perron (1988;308-9) presents the formula for the transformed test statistics. The appropriate critical values are the same as for the Dickey-Fuller tests. The Phillips-Perron tests were performed using the default set-up in SHAZAM.13 This method employs the AUTO procedure (Section 2.3.1) to select the "truncation lag parameter" used in the Newey-West (1987) method for estimating the error variance from the estimated residuals. The choice of the truncation lag parameter is as important to this context as the choice of lag order was to the ADF tests, so we are interested in how well AUTO does considering its poor performance in the previous context. In this context the use of AUTO has some justification as it is used to examine the autocorrelation function of the errors resulting from the estimation of equations (2.1) and (2.2), (with p = 0). Hence AUTO directly examines the error terms for evidence of serial correlation and selects the lag truncation parameter suggested by the autocorrelation function of these estimated errors. Considering y ~ N(0,1), and sample sizes of 30, 60 and 120 the false null hypothesis was rejected in all one hundred replications. Considering yt = .85yt-\ — .15yt-2+rit, 13 The error variance estimate suggested by Newey and West (1987) was used: ^ £ t e l e\ + T !C«=i w ( s > 0 X)t=s+i ^tit-sj where the it are the estimated residuals, / is the truncation lag parameter, and w(s. I) is a window. SHAZAM takes the window as having the form w(s, I) = 1 — s/(l + 1 ) . Chapter 2. White Noise Experiments on Augmented Dickey-Fuller Tests 25 there were many non-rejections for T = 30, but performance was better for T = 60 and perfect for T = 120, (Table 2.13). Even in the T = 30 case, the correct lag length of one was selected every time indicating that the problem with a large number of non-rejections is due to the small sample properties of the Phillips-Perron tests rather than the selection of incorrect lag order by the AUTO procedure. This suggests that while AUTO is a method to be avoided in the ADF test context, it is appropriate for the Phillips-Perron context. 2.5 Directions for Further Research An alternative procedure of testing for stationarity is that of Campbell and Mankiw (1987). They considered taking the null hypothesis to be that of stationarity. Their procedure uses a model selection criterion to determine the order of an ARM A model. The criteria they suggest are the AIC and SC. 14 Naturally, we are interested in the appropriateness of these criteria. Campbell and Mankiw considered a large number of alternative ARMA specifications and hence their suggestion of the use of the AIC or SC had no impact on their results. However, the applied user of their method may be less rigourous in investigating many different parameterizations. Hence an investigation of the performance of this stationarity test when different model selection procedures are employed would be of interest. 2.6 Overview We have performed Augmented Dickey-Fuller tests on white noise and a stationary AR(2) process. The results indicate that certain methods for selecting the lag order for u T h e y . as many others, use the notation BIC to denote the Schwarz Criterion. Note also that there are a couple of typograhical errors in their paper. Their expressions for the AIC and SC should be adding, rather than subtracting, the penalty terms (Campbell and Mankiw, 1987;862). The empirical work does not suffer from this error. Chapter 2. White Noise Experiments on Augmented Dickey-Fuller Tests 26 the ADF regressions do worse than others. In particular, the use of the autocorrelation and partial autocorrelation functions of the first differenced series (AUTO) results in a poor ability for the ADF tests to reject the false null hypothesis of nonstationarity, i.e. the tests seem to have very low power. The other selection criteria considered perform comparably well, especially in the largest sample considered (T = 120). Generally, there is evidence to suggest that one should qualify carefully one's conclusions from the ADF tests in small samples, regardless of which criterion has been used for lag length selection. In larger samples, any of the model selection criteria considered, besides AUTO, seem to result in similar results, making us indifferent as to which criterion to use. There is also evidence to suggest that setting the lag length to zero a priori may be a preferred procedure, even when this is the inappropriate lag length. The use of AUTO in the Phillips-Perron unit root test context was found to be appropriate. The use of inappropriate methods for the selection of lag order in the ADF regressions can lead to the disasterous performance of the ADF tests. Performance is particularly bad in small samples. The use and comparison of various selection methods for lag length is recommended for all sample sizes. Chapter 2. White Noise Experiments on Augmented Dickey-Fuller Tests 27 Table 2.1: Marginal Penalties of the AIC, SC and GCV for each lag order (p) and sample size considered: Constant, no trend model (n = 2 + p), effective sample size = T-T-5 - 1 . Lag Order p=l p=2 p= 3 p= 4 p= 5 p= 6 p= 7 p= 8 p= 9 p = 10 T = 30 AIC SC 1 1.589 1 1.589 1 1.589 1 1.589 1 1.589 1 1.589 1 1.589 1 1.589 1 1.589 1 1.58 T = 60 GCV AIC SC 1.143 1.976 1.200 1.976 1.263 1.976 1.333 1.976 1.412 1.976 1.500 1.976 1.600 1.976 1.714 1.976 1.846 1.976 2.000 1.976 GCV 1.061 1.083 1.106 1.130 1.156 1.182 1.209 1.238 1.268 1.300 T = 120 AIC SC GCV 1 2.346 1.028 1 2.346 1.038 1 2.346 1.048 1 2.346 1.058 1 2.346 1.068 1 2.346 1.079 1 2.346 1.090 1 2.346 1.101 1 2.346 1.112 1 2.346 1.124 28 Chapter 2. White Noise Experiments on Augmented Dickey-Fuller Tests Table 2.2: Number of times lag order p selected by each of the methods for each sample size considered: Constant, no trend model, y ~ N(0,1), correct lag length is zero. Lag Order T = 30 AUTO AIC SC GCV CV TC T = 60 AUTO AIC SC GCV CV TC T = 120 AUTO AIC SC GCV CV TC 0 7 68 85 74 63 95 0 72 95 75 72 97 0 74 95 79 76 100 1 2 3 4 39 27 13 8 14 7 1 4 0 7 5 1 1 3 13 6 19 7 2 4 1 0 0 2 6 22 24 19 1 9 7 6 0 0 2 3 5 1 11 5 1 12 7 3 0 0 0 2 0 0 13 22 6 7 5 3 3 2 0 0 6 5 5 3 3 9 7 2 0 0 0 0 5 6 6 2 3 5 2 8 4 0 2 3 0 10 3 0 2 3 0 6 7 9 12 0 1 0 0 0 1 1 1 0 1 8 16 0 1 0 0 0 0 0 0 0 0 8 9 10 16 1 0 0 0 0 8 0 0 0 0 0 7 0 0 0 0 0 Chapter 2. White Noise Experiments on Augmented Dickey-Fuller Tests 29 Table 2.3: Number of times lag order p selected by each of the methods for each sample size considered: Constant, trend model, y ~ iV(0,1), correct lag length is zero. Lag Order AUTO r = 30 AIC sc GCV CV TC T = 60 AUTO AIC SC GCV CV TC T = 120 AUTO AIC SC GCV CV TC 0 7 62 78 69 60 93 0 69 93 73 70 94 0 74 95 79 75 100 1 2 3 4 39 27 13 8 3 6 15 8 1 1 11 7 1 4 15 8 18 9 2 7 2 0 0 3 6 22 24 19 9 6 7 1 4 3 0 0 11 3 1 5 11 9 3 1 2 3 0 0 0 0 13 22 4 8 6 3 3 2 0 0 6 4 5 3 4 9 6 2 0 0 0 0 5 6 6 2 3 4 2 8 5 0 4 5 0 10 3 0 3 3 0 6 7 9 12 2 1 0 0 2 1 0 1 0 1 8 16 0 1 0 0 0 0 0 0 0 0 9 10 16 8 1 0 0 0 0 0 0 0 0 0 7 0 0 0 0 0 8 Chapter 2. White Noise Experiments on Augmented Dickey-Fuller Tests 30 Table 2.4: Number of times out of 100 that each ADF null hypothesis is not rejected. Sample size is 30, y ~ N(0,1). Significance levels are 1%, 5% and 10%. AIC SC GCV CV TC p= 0 10 12 6 8 8 9 14 16 5 7 0 1 23 26 23 15 18 15 17 21 17 27 28 27 14 19 15 4 9 5 45 53 5 6 2 2 3 3 8 9 2 2 0 0 62 73 66 7 13 9 3 7 5 3 7 5 10 15 12 4 8 5 0 1 0 30 37 3 4 2 2 2 2 5 6 1 1 0 0 49 62 54 5 8 5 1 4 1 1 4 1 7 11 7 2 4 2 0 0 0 AUTO T = 30,y~JV(0,l) 1% LEVEL (1) CONSTANT, NO TREND 63 72 Q'O = Oi\ — 0 (2) CONSTANT, TREND 83 on = 0 89 Q0 = Oil = 0>2 — 0 87 Oil — Oi2 — 0 5% LEVEL (1) CONSTANT, NO TREND Q'O = 0 Q'o = Q'l = 0 (2) CONSTANT, TREND ©1 = 0 Q'O = a Q'l = Oi2 — 0 i = OL<I = 0 10% LEVEL (1) CONSTANT, NO TREND Qo = 0 Q'O = 0 1 = 0 (2) CONSTANT, TREND c*i = 0 Q: 0 = Cl'i = Q2 = 0 Ql = Q2 = 0 Chapter 2. White Noise Experiments on Augmented Dickey-Fuller Tests 31 Table 2.5: Number of times out of 100 that each ADF null hypothesis is not rejected. Sample size is 60, y ~ N(0,1). Significance levels are 1%, 5% and 10%. T = 60, y ~ JV(0,1) 1% LEVEL (1) CONSTANT, NO TREND a0 = 0 a '0 = 0:1 = 0 (2) CONSTANT, TREND ai = 0 a 0 = Oi = «2 = 0 ai = «2 = 0 5% LEVEL (1) CONSTANT, NO TREND oo = 0 O'o = Oi = 0 (2) CONSTANT, TREND ai = 0 Oo = oil = a-2 = 0 oi = 02 = 0 10% LEVEL (1) CONSTANT, NO TREND oo = 0 «o = oi = 0 (2) CONSTANT, TREND ai = 0 QO ' = O'l = 0'2 = 0 «1 = «2 = 0 AUTO AIC SC GCV CV TC p=0 43 57 2 2 0 0 2 2 4 4 0 0 0 0 69 84 77 6 9 8 0 0 0 5 8 7 4 6 6 0 0 0 0 0 0 18 25 1 1 0 0 1 1 1 1 0 0 0 0 38 53 43 2 3 2 0 0 0 2 3 2 0 2 1 0 0 0 0 0 0 13 15 1 1 0 0 1 1 1 1 0 0 0 0 25 37 29 1 2 1 0 0 0 1 2 1 0 0 0 0 0 0 0 0 0 Chapter 2. White Noise Experiments on Augmented Dickey-Fuller Tests 32 Table 2.6: Number of times out of 100 that each ADF null hypothesis is not rejected. Sample size is 120, y ~ N(Q, 1). Significance levels are 1%, 5% and 10%. AIC sc GCV cv TC p=0 1 1 0 0 1 1 3 5 0 0 0 0 1 3 2 0 0 0 1 3 2 11 13 12 0 0 0 0 0 0 6 18 1 1 0 0 1 1 0 2 0 0 0 0 25 34 29 1 1 1 0 0 0 1 1 1 3 8 6 0 0 0 0 0 0 3 5 0 1 0 0 0 1 0 0 0 0 0 0 18 26 18 1 1 1 0 0 0 1 1 1 1 4 3 0 0 0 0 0 0 AUTO T= 120, y ~ iV(0,1) 1% LEVEL (1) CONSTANT, NO TREND 29 OiQ = 0 otQ = a\ = 0 34 (2) CONSTANT, TREND 39 «i = 0 = 53 Q'O = Q-'i = « 2 0 44 Q-'l = &2 — 0 5% LEVEL (1) CONSTANT, NO TREND cvo = 0 QJO = CKi = 0 (2) CONSTANT, TREND «! = 0 Ot,Q = tti = « 2 Q = 0 l = Ct'2 = 0 10% LEVEL (1) CONSTANT, NO TREND Q-'o = oc\ — 0 (2) CONSTANT, TREND Q'l = 0 Q'o = « i = CK2 Q:j = Q', = 0 = 0 Chapter 2. White Noise Experiments on Augmented Dickey-Fuller Tests 33 Table 2.7: Number of times lag order p selected by each of the methods for each sample size considered: Constant, no trend model, yt = .85yt-i — .15y<_2 + r?t, correct lag length is one. Lag Order T = 30 AUTO AIC SC GCV CV TC T = 60 AUTO AIC SC GCV CV TC T = 120 AUTO AIC SC GCV CV TC 0 73 54 79 62 61 93 52 52 74 56 54 96 12 33 64 34 33 99 1 3 22 13 22 18 2 0 28 19 29 22 1 0 48 33 49 45 1 2 3 4 9 7 4 6 5 5 2 3 3 5 5 2 6 5 6 3 1 0 7 15 5 7 5 3 5 1 1 9 2 2 8 6 4 1 1 0 5 14 6 7 6 3 2 1 0 7 5 3 9 7 4 0 0 o 5 6 7 5 8 0 4 4 1 3 13 5 2 2 1 0 0 0 0 1 1 2 4 0 0 0 0 11 14 7 2 1 0 0 0 0 2 0 0 2 0 0 0 0 0 8 9 10 10 7 0 0 0 0 0 0 0 0 0 0 14 0 0 0 0 0 Chapter 2. White Noise Experiments on Augmented Dickey-Fuller Tests 34 Table 2.8: Number of times lag order p selected by each of the methods for each sample size considered: Constant, trend model, yt = .85^_i — .15y<_2 + f]t) correct lag length is one. Lag Order T = 30 AUTO AIC SC GCV CV TC T = 60 AUTO AIC SC GCV CV TC T = 1 2 0 AUTO AIC SC GCV CV TC 0 73 44 65 53 61 89 52 48 68 51 50 94 12 31 61 32 32 99 1 2 3 4 5 6 7 8 9 10 3 9 7 4 5 20 13 4 6 13 21 7 3 1 3 25 8 4 5 5 18 6 5 6 4 3 3 2 1 2 0 7 15 5 3 13 5 25 10 6 5 4 2 0 25 5 0 2 0 0 0 29 8 5 4 2 1 0 23 10 6 6 2 2 1 2 2 0 0 0 1 1 0 5 14 6 11 14 7 10 7 14 43 10 6 4 2 1 0 2 1 0 37 2 0 0 0 0 0 0 0 0 46 8 6 4 2 0 0 2 0 0 43 9 7 4 2 0 0 1 2 0 1 0 0 0 0 0 0 0 0 0 Chapter 2. White Noise Experiments on Augmented Dickey-Fuller Tests 35 Table 2.9: Number of times out of 100 that each ADF null hypothesis is not rejected. Sample size is 30, yt = .85yt-i - .15yt-2 + r]t- Significance levels are 1%, 5% and 10%. T = 30, y =AR(2) 1% LEVEL (1) CONSTANT, NO TREND a0 = 0 Q.Q = Q.'i = 0 AUTO AIC SC GCV CV TC p=l 97 97 88 89 88 89 87 88 88 89 63 69 91 92 99 99 99 86 94 88 86 93 90 84 93 87 87 94 90 76 80 78 94 97 96 85 89 64 69 71 76 69 74 67 69 30 39 70 75 94 97 94 67 78 68 70 78 72 66 75 68 72 78 73 56 66 60 86 90 85 80 89 57 59 61 68 59 64 57 61 17 21 57 64 94 96 93 55 65 57 65 69 67 57 65 59 62 70 63 37 57 45 73 83 77 (2) CONSTANT, TREND ai =0 = a = a #0 l 2 a.\ = a-2 = 0 = 0 5% LEVEL (1) CONSTANT, NO TREND a0 — 0 a-o = ot\ = 0 (2) CONSTANT, TREND ocx = 0 QQ = Oi\ = a 2 = 0 OL\ = QJ2 = 0 10% LEVEL (1) CONSTANT, NO TREND ao = 0 a 0 = ai = 0 (2) CONSTANT, TREND ax = 0 a 0 = ai = a 2 = 0 Oil = « 2 = 0 Chapter 2. White Noise Experiments on Augmented Dickey-Fuller Tests 36 Table 2.10: Number of times out of 100 that each ADF null hypothesis is not rejected. Sample size is 60, yt = .85yt-i — .15yt-2 + Vt- Significance levels are 1%, 5% and 10%. AUTO T = 60, y =AR(2) 1% LEVEL (1) CONSTANT, NO TREND 87 Q0 = 0 93 ao = ot,\ = 0 (2) CONSTANT, TREND 93 ax = 0 = a = a 95 Q() l 2 — 0 95 Q'l = Cl2 = 0 5% LEVEL (1) CONSTANT, NO TREND a0 = 0 Q'o = a'i = 0 (2) CONSTANT, TREND ai = 0 OLQ = OL\ = Ot-2 = 0 Q'l = tt2 = 0 10% LEVEL (1) CONSTANT, NO TREND aQ = 0 Cl'o = Q'l = 0 AIC SC GCV CV TC p=l 57 63 60 70 53 62 58 65 2 5 57 71 67 82 74 77 83 79 70 82 75 72 83 76 18 26 19 73 85 77 54 67 30 40 29 42 27 38 33 42 0 0 21 31 81 89 84 41 55 44 45 64 48 39 57 42 42 59 46 2 8 4 46 66 56 40 49 17 25 15 24 14 22 15 26 0 0 10 12 62 81 66 32 42 38 34 46 40 30 40 37 31 44 38 0 3 0 25 47 32 (2) CONSTANT, TREND a-i = 0 a'o = Q'l = Q?2 Q'l = «2 = 0 = 0 Chapter 2. White Noise Experiments on Augmented Dickey-Fuller Tests 37 Table 2.11: Number of times out of 100 that each ADF null hypothesis is not rejected. Sample size is 120, yt = .85yj_i — -l5yt-2 + Vt- Significance levels are 1%, 5% and 10%. T = 120, y = AR(2) 1% LEVEL (1) CONSTANT, NO TREND a0 = 0 a:o = ai = 0 (2) CONSTANT, TREND ai = 0 a = a = a = 0 l 2 0 ai = a-2 = 0 5% LEVEL (1) CONSTANT, NO TREND a0 = 0 a'o = Q'i = 0 (2) CONSTANT, TREND Q'I = 0 Cl'0 = Oi\ = a 2 — 0 a'i = a'2 = 0 10% LEVEL (1) CONSTANT, NO TREND ao = 0 a'o = «i = 0 (2) CONSTANT, TREND ax ^ 0 Q 'o = a'i = a-2 = 0 ai = tt2 = 0 AUTO AIC SC GCV CV TC p= l 57 64 4 7 3 7 3 6 6 8 0 0 4 5 77 85 78 13 26 17 11 25 15 12 25 16 11 24 14 0 0 0 8 17 12 28 38 2 2 1 2 1 1 1 1 0 0 0 0 51 67 57 5 6 . 6 2 3 2 3 4 4 5 5 5 0 0 0 1 4 2 16 21 1 2 1 1 1 1 1 1 0 0 0 0 35 54 42 2 4 3 1 2 1 0 2 1 0 3 1 0 0 0 1 1 1 Chapter 2. White Noise Experiments on Augmented Dickey-Fuller Tests 38 Table 2.12: Number of times out of 100 that each ADF null hypothesis is not rejected. yt = yt-i — .25y*_2 + Vt, a nonstationary process. Correct lag order is p = 1. Significance levels are 1%, 5% and 10%. y nonstationary 1% LEVEL (1) CONSTANT, NO TREND a0 = 0 Q'O = Q'i = 0 (2) CONSTANT, TREND «i = 0 Q'O = Cti = «2 = CHi = «2 = 0 0 5% LEVEL (1) CONSTANT, NO TREND «o = 0 cv0 = «i = 0 (2) CONSTANT, TREND «i = 0 Q'O = Oi\ = Ct2 = 0 &'i = a'2 = 0 10% LEVEL (1) CONSTANT, NO TREND QO ' = 0 ao = Q'i = 0 (2) CONSTANT, TREND ai = 0 Q'O = « i = Q'2 = a.\ — «2 = 0 0 T = 30 p=0 p = l T = 60 p=0 p = l T = 120 p=0 P = l 97 98 91 94 92 94 64 73 24 33 4 5 98 98 98 95 98 96 97 100 98 75 85 78 55 79 64 13 26 14 91 93 72 77 64 75 31 37 5 8 0 1 96 97 97 85 91 86 85 92 87 52 65 61 17 35 23 2 6 3 80 87 63 68 41 51 15 19 1 2 0 0 95 95 94 75 81 77 69 85 77 30 56 36 5 21 9 1 2 2 Chapter 2. White Noise Experiments on Augmented Dickey-Fuller Tests 39 Table 2.13: Number of times out of 100 that each ADF null hypothesis is not rejected. Phillips-Perron test, yt = .8byt-i — -l5yt-2 + "Ht, AUTO used to determine the truncation lag parameter. Significance levels are 1%, 5% and 10%. y = AR(2) 1% LEVEL (1) CONSTANT, NO TREND ci'o = 0 cuo — Oi\ = 0 (2) CONSTANT, TREND cti = 0 Q'O = CKi = Q!2 = 0 Oil = Qf-2 = 0 5% LEVEL (1) CONSTANT, NO TREND a0 = 0 «o = cci = 0 (2) CONSTANT, TREND «i =0 ao = ai = 012 = 0 a l — a2 = 0 10% LEVEL (1) CONSTANT, NO TREND a0 = 0 ao = ei'i = 0 (2) CONSTANT, TREND ai = 0 «0 = Oil = OL2 = 0 ai = «2 = 0 T = 30 T = 60 T= 120 52 57 1 3 0 0 75 81 77 7 17 9 0 0 0 14 19 0 0 0 0 44 65 50 0 3 2 0 0 0 7 11 0 0 0 0 21 44 27 0 2 0 0 0 0 Chapter 3 Non-Parametric Estimation of Technical Progress An important difference between parametric and nonparametric 'regression methodologies is their repective degree of reliance on the information about [the functional form] obtained from the experimenter and from the data. Randall L. Eubank (1988; 3). 3.1 Introduction The problem of modelling technical progress in a production economy is of great interest, as is the problem of model selection. This chapter presents some results on the second issue while presenting an answer to the first. Specifically, a logarithmic multiple equation context form of the Generalized CrossValidation Criterion (FGCV) for model selection is derived, and used in choosing the best representation of technical progress employing U.S. aggregate production data. Comparisons are made with other selection criteria using the same data, and through the use of a Monte Carlo experiment. Technical progress is typically modelled in the empirical estimation of factor demand/supply systems using time series data by adding a linear time trend to each estimating equation in the system. However, there is no reason to expect, a priori, that technical progress enters each equation in a linear fashion. The evidence from index number approaches to calculating rates of technical progress does not support the 40 Chapter 3. Non-Parametric Estimation of Technical Progress 41 use of a linear time trend—rates of technical progress are subject to large fluctuations throughout the sample and average rates differ over sub-periods. A recent paper by Diewert and Wales (1992) introduced splines in an independent variable into a flexible functional form. They looked at the case of allowing non-linear responses with respect to time in a normalized quadratic profit function.1 The current chapter demonstrates a method for determining the number of splines to be included and the placement of these splines. In this way, we consider the non-parametric estimation of technical progress. The two models which have been proposed use the normalized quadratic profit function as the basic functional form. The first model uses linear splines in the time trend variable, thus allowing separate time trends across the sample period. These splines are joined at break points, 2 meaning that the profit function will be continuous with respect to time, but not continuously differentiable. Therefore, the second proposed model introduces quadratic splines—the lowest order needed for the profit function to be continuously differentiable with respect to the time variable. Although Diewert and Wales seemed to favour the quadratic spline model, an argument is made here for the use of linear splines. In their empirical demonstration (using Japanese production data, 1955-1987) they avoided the issue of how to determine the number of break points by deciding on two, a priori. The problem is then simplified to finding the optimal positions for these break points. They did a grid search over all possible combinations of potential break points and selected the model with the largest log likelihood value. Both linear and quadratic spline models were found to be considerable improvements 'The normalized quadratic form is due to Diewert and Wales (1987), while the profit function formulation is due to Diewert and Ostensoe (1988), and Kohli (1991). The idea of modelling the time trend with splines in economics seems to have been first suggested by Fuller (1969). 2 The terms "break point" and "knot" are used interchangeably. Chapter 3. Non-Parametric Estimation of Technical Progress 42 over the basic linear time trend model. Also, the results of the linear and quadratic spline models were similar. However, to make this modelling strategy operational for applied users, it is necessary to have some means of determining the number of break points to be included in the model. This is the point from which we encounter the issue of model selection. The problem of determining the number and position of break points (i.e. determining the appropriate model) is an issue in the recent statistics literature dealing with the estimation of functional relationships by using combinations of splines in the independent variables (e.g. Friedman and Silverman, 1989, Friedman, 1991, Breiman, 1991). The work of Smith (1982) motivated Friedman and Silverman (1989). The work of Friedman and Silverman has, in turn, motivated the approach presented here to the problem of fitting splines in the time trend of our normalized quadratic model. This approach has the very desirable property of allowing the data to decide how the time trend should enter the model. The spline fitting algorithm of Friedman and Silverman is taken as given, except for the model selection step where several criteria are used and compared. To keep this exposition simple, and because this is the model the researcher would probably investigate first, the normalized quadratic linear spline (NQLS) model is used below. The proposed method of fitting splines can be used with the quadratic spline model in exactly the same fashion, although it is argued that the use of linear splines may be more appropriate. Section 3.2 describes the normalized quadratic variable profit function, and the estimating equations. Section 3.3 briefly outlines the normalized quadratic unit profit function with linear splines. Section 3.4 describes the proposed approach to fitting splines. It includes the introduction of logarithmic, multiple equation forms for the cross-validation score and Generalized Cross-Validation Criterion. The results of a Chapter 3. Non-Parametric Estimation of Technical Progress 43 Monte Carlo experiment on the relative properties of the selection criteria are reported. An index number approach to measuring technical progress is presented in Section 3.5. Section 3.6 makes some comments about issues relating to the appropriateness of the proposed procedure, and includes some historical notes. Section 3.7 gives the results of an empirical implementation of this modelling strategy, using U.S. production data, 1947-1981, and compares the models selected by different criteria. Section 3.8 concludes. 3.2 The NQ Variable Profit Function Before introducing the NQLS, the parent functional form, the Normalized Quadratic is presented. First, some properties of a neo-classical profit function. The variable profit function can be expressed as follows: V(p, k) = moxyip'y : k = F(y)) (3.1) where p' denotes the transpose of the price vector, F(y) is the factor requirements function and k is capital. It is assumed that : • pi > 0 for i = 1,..., N, where N is the number of variable goods. • k > 0. • yi > 0 if i denotes an output, and y, < 0 if i denotes an input (i.e. inputs are treated as negative outputs). The neo-classical conditions on the variable profit function V are: • V is a nonnegative function. Chapter 3. Non-Parametric Estimation of Technical Progress 44 • V is positive homogeneous of degree one in p. • V is convex and continuous in p for every fixed k. • V is positive homogeneous of degree one in k. • V is nondecreasing in k for every fixed p. • V is concave and continuous in k for every fixed p. (See Diewert, 1973, for proofs). The profit maximizing supply and demand functions can then be derived by using Hotelling's Lemma — the derivative of V(p, k) with respect to the price of the ith good gives the supply or demand function (alternatively "net supply" function) for this good. The convexity of V in prices implies that the matrix of second derivatives of V with respect to prices (the first partial dervatives of the net supply functions with respect to prices) is positive semidefinite. The normalized quadratic variable profit function is defined as follows: V(p, k,t) = a • p + b • pk + c • ptk + 1/2 • ^—?— (3.2) where B = [6,-y] = B\ and t is a time trend, t = 1, ...,T. The a vector is a predetermined vector of parameters. Kohli (1991) and Diewert and Wales (1992) experimented with different values for a, but for simplicity it can be set to a vector of ones, i.e. a = IN-, where N is the number of variable goods. It is required that a > 0/y. Setting the elements of the a vector to zero implies imposing constant returns to scale. Differentiating with respect to p gives the system of net supply equations, through Hotelling's Lemma, which are used in estimation: y(p,k,t) = = = a + bk + ckt + — - 1/2 • f P.0 • a a •p (a • p)1 a + bk + ckt + Bwk - l/2w'Bwka. s7PV(p,k,t) (3.3) Chapter 3. Non-Parametric Estimation of Technical Progress w = p/a • p. 45 (3.4) The following restrictions are placed on the elements of B to avoid potential problems with multicollinearity between the elements of 6 and B: Bltf (3.5) = Oy\T, i.e. the rows of B are restricted to add to zero. Therefore, we have one parameter in each row which is not independently determined. Using these restrictions, we can, for example, solve for the diagonal elements of the B matrix once the other parameters have been obtained by estimation. Consider the case of a five good model (say one output, three variable inputs and a fixed input, which is taken to be capital), so that A^ = 4: 611 = -612-613-614 (3.6) 622 = -612-623-624 (3.7) 633 = -613-623-634 (3.8) 644 = -614-624-634. (3.9) Remember that 6^ = bji. Defining Wij — w, — Wj, for i,j = 1,...,JV, we can then write: Bw = bl2iv2l + 613^31 + 614w;41 -6l2wl2 - 613wl3 - 614^14 612wl2 + 623^32 + 624w;42 612wl2 - 623w23 - 624^24 613wl3 + 623to23 + 634u;34 bl3wl3 + 623^23 - 634^34 614^14 + 624tu24 + 634^34 614zyl4 + b24w24 + 634w;34 (3.10) 46 Chapter 3. Non-Parametric Estimation of Technical Progress Premultiplying Bw in equation 3.10 by w' gives us: w'Bw = -[612(wl2) 2 + M3(u;13)2 + 614(w;14)2 +b23(w23f + b24(w24)2 + b34(w34)2}. (3.11) We can now write our system of supply and demand equations of equation (3.3) in the form of the following estimating equations for our five good example: yx = ol + blk + clkt - bl2wl2k - bl3wl3k - bUwUk - ll2w'Bwkoix+ex y2 = a2 + b2k + c2kt + bl2wl2k - b2Zw2Sk - b24w24k l/2w'Bwka2 y3 (3.12) + e2 (3.13) = a3 + b3k + c3kt + blSwl3k + b23w23k - b34w34k l/2w'Bwka3 yA - + e3 (3.14) a4 + bAk + cAkt + bUwUk + b24w24k + b34w34k l/2w'Bwka4 + e4 (3.15) for each t = 1,...,T, where w'Bw is as expressed in equation (3.11) and the e*, i — 1, ...,4, are error terms added in estimation. The matrix of price derivatives can be obtained by differentiating the supply and demand equations with respect to their respective prices. Note that these derviatives are the second-order derivatives of the profit function: V P y(p, k, t)-[B- (Bw)a' - a(w'B) + w'Bwaa']k/a • p. (3.16) The price elasticities can then be calculated as follows: _ dyi(p,k,t) hij ~ dp, Pj 'yi(p,k,ty v-u> Chapter 3. Non-Parametric Estimation of Technical Progress 47 for i,j = 1,2,3,4 and for each t. Once we have estimated our supply and demand equations, we can multiply the fitted quantities by their respective prices and add them together to derive the fitted variable profit function, for each t. Differentiating the log of the fitted profit function, V, with respect to t, we get an index of technical progress, for each t: Tech = ^ S . (3.18) V To get a measure of technical progress that is comparable with multifactor index number measures, such as the Fisher's quantity index approach (Diewert, 1992; Section 3.5), we set the denominator in the above to be equal to the value of estimated output(s). We will recall that a requirement for the profit function to be neo- classical is that it satisfies convexity in prices. This requirement is satisfied by the normalized quadratic if the estimated B matrix of parameters is found to be positive semidefinite. This holds if the following conditions are true: 611 > 0, 611622 - (612)2 > 0, (3.19) 611622633 + 612623613 + 613621623 - 613622613 -633(612) 2 - 611(623)2 > 0. In the model described so far, there is no guarantee that these conditions will hold, and in practice they will usually be violated. But one of the main advantages of the normalized quadratic form over other flexible functional forms (such as the translog), is its susceptibility to the imposition of correct curvature globally without sacrificing flexibility, or any of the neo-classical theoretical requirements. Curvature is imposed using the following theorem: Chapter 3. Non-Parametric Estimation of Technical Progress 48 T h e o r e m 1 Let B be a symmetric N x N matrix. Then B is positive semidefinite if and only if B = AA' for some lower triangular matrix A (i.e. A = [ay], aij = 0 for >• < J! hj — 1* --iN — 1), (Wiley, Schmidt and Bramble, 1973; Diewert and Wales, 1987). The restriction on the elements of the B matrix that we imposed above, Bl^ = 0, can now be written as: A'U = 0N, (3.20) i.e. we now have the columns of A adding to zero. In our five good example, we can solve for the diagonal elements of A, remembering that A is a lower triangular matrix: all = -a21-a31-a41 (3.21) a22 = -a32-a42 (3.22) a33 = -a43 (3.23) a44 = 0 (3.24) Using these restrictions, we can express the elements of the B matrix in terms of the elements of the A matrix (remembering that B = AA'): 611 = ( a l l ) 2 = (a21 + a31+a41) 2 (3.25) 612 = alla21 = - ( a 2 1 + a31 + a41)a21 (3.26) 613 = alla31 = - ( a 2 1 + a31 + a41)a31 (3.27) 614 = alla41 = -(a21 + a31 + a41)a41 (3.28) 622 = (a21) 2 + (a22)2 = (a21) 2 + (a32 + a42)2 (3.29) 623 = a21a31 + a22a32 = a21a31 - (a32 + a42)a32 (3.30) 624 = a21a41 + a22a42 = a21a41 - [a32 + a42]a42 (3.31) Chapter 3. Nou-Parametric Estimation of Technical Progress 49 633 = (a31) 2 + (a32) 2 + (a33) 2 = (a31) 2 + (a32) 2 + (a43)2 (3.32) bM = a31a41 + a32a42 + a33a43 = a31a41 + a32a42 - (a43)2 (3.33) 644 = («41)2 + a(42) 2 + (a43) 2 . (3.34) Substituting these expressions for the bij into the estimating equations gives us a system of net supply equations whose parent function, the normalized quadratic profit function, has assured globally correct curvature. Once the system has been estimated, the bjj can be derived using the estimated a^-. Technical progress and the price elasticities can then be calculated as described previously. An additional point to consider is the interpretation of the restrictions on the parameters that have been displayed above. A (Diewert-) flexible functional form, such as the normalized quadratic, can theoretically approximate an arbitrary neoclassical function to the second order at a point. Consider some arbitrary reference price vector p*, say the sample mid-point price vector, and normalize prices and quantities so that each price in this vector is equal to one. We can then interpret the restrictions on the parameters as follows: Alp* = 0N (3.35) and the predetermined parameter vector a satisfies: a-p* = JV, (3.36) where a > Oyv, and we take a = IN- In other words, we are imposing these restrictions at the point of theoretical approximation. So, as one can see from the above, using the normalized quadratic is not as cumbersome as one might initially think, and is little harder than using a translog functional form (Christensen, Jorgenson and Lau, 1971), with higher returns in terms of theoretical consistency. Chapter 3. Non-Parametric Estimation of Technical Progress 3.3 50 The NQLS Model The normalized quadratic linear spline model of Diewert and Wales (1992) has the desirable properties of globally correct curvature, Diewert-nexibility3 and additional approximation properties with respect to the splined time variable. The NQLS unit profit function is defined as follows: 7r(p,t) = h(p) + d(p,t) (3.37) where p is the vector of prices, d(p, t) is defined below, and h{p) = a • p + (l/2)(a • p)-lp'AA'p (3.38) where A is a lower triangular matrix. The following restrictions are assumed to hold at the reference prices p* ^> 0/v, where there are N variables, A'p* = 0N, (3.39) and the predetermined parameter vector a satisfies: a-p* a = N, (3.40) > 0N. (3.41) Although Kohli (1991) and Diewert and Wales (1992) discussed various options for determining a, we followed Diewert and Wales (1988a, 1988b) in setting it to be a vector of ones. The prices and quantities used are typically normalized so that the reference prices p* are also a vector of ones in the arbitrary reference year. This makes a • p* = TV, but this constant is absorbed into the matrix AAT. 3 A function is Diewert- or Second Order- flexible if it has enough free parameters to theoretically approximate an arbitrary twice continuously differentiable, linearly homogeneous, function to the second order at a point (Diewert, 1974, p.113). Chapter 3. Non-Parametric Estimation of Technical Progress 51 The use of AAT as the interaction matrix ensures that correct curvature is attained (Theorem 1). The unit profit function can be interpreted as the profit accruing to a unit of capital, the fixed factor.4 In the three spline, (two break point case), d(p,t) is defined as: | d\p,t) forO<i<ii d(p,t) = I d?{p,t) for tx<t<t2 d3(p,t) for (3.42) t2<t. The functions d'(p,t), i = 1,2,3, can then be defined to be linear, quadratic, or of some higher order. In the empirical results of Section 3.7 the linear spline model is used. In this case we have: d\p,t) = (b^-p)t; (3.43) d\p,t) = d'ip^ (3.44) + ibW-pXt-h); d3(p,t) = d2(p,t2) + (b^.p)(t-t2). (3.45) Note that if the N dimensional vectors of parameters b^ are all equal, then we are back to the basic normalized quadratic profit function of the previous section. If this is not the case (as we would generally expect) then we have the NQLS unit profit function. This functional form is not only flexible,5 but also curvature correct (convex) and it has the additional flexibility property of being able to attain arbitrary rates of technical change at each break point £\ In other words, technical progress is no longer restricted to enter the model in a linear fashion, but rather it can take on arbitrary rates within each time segment. 4 Treating capital as a fixed factor is an advantage of working with a profit function—if we use a cost function we must hold an output fixed, which is less intuitively appealing. 5 Actually, it is what Diewert and Wales (1992) call T P flexible. To be fully flexible, the term (l/2)c(a-p)t 2 (where c is an additional unknown scalar parameter), would have to be added to equation (3.37) so that there are enough parameters for the approximating function to attain coincidence of second derivatives with respect to t with some arbitrary neo-classical function. Chapter 3. Non-Parametric Estimation of Technical Progress 52 The estimating equations are derived from the unit profit function by taking the first price derivatives (Hotelling's Lemma), as shown in the previous section. On the left hand side, note that we have the demand/supply variable divided by capital. 6 The demand (input) variables are taken to be negative in estimation, while the supply (output) variables are taken to be positive. The issue addressed now is how to determine the position and number of break points to be allowed. 3.4 Model Selection We have t, the time trend variable, running from 1 to T, and each t is a potential break point. Every t could be treated as a break point, but the problem would be overfitting—a lot of noise could be introduced into the model. Therefore, the number of splines to be included must be restricted in some fashion. Friedman and Silverman (1989) suggested increments determined by: M(T, 6) = -log2[-(l/T)ln(l - 0)]/2.5 (3.46) with 0.05 < 0 < 0.1, where T is the sample size, (see Appendix B). This minimum span concept avoids the potential "packing" problem. If the candidate models are too tightly packed together then we have to choose between them in terms of small local properties. This can lead to models with low local lack-of-fit bias, but high variance. Now the idea is to proceed with a stepwise strategy: 1. The first knot (K = 1), or break point, is chosen from the T/M potential positions by trying each within the NQLS model and selecting the model 6 This is an advantage of working with a unit rather than a variable profit function—dividing by capital may reduce heteroscedasticity. Chapter 3. Non-Parametric Estimation of Technical Progress 53 with the best performance, ( measures of model performance are described below). 2. Then each additional knot position is selected from the remaining T/M — K + 1 potential positions by placing it in the position that optimizes the model's performance given it and the K — 1 knots that have already been selected. 3. This strategy produces a sequence of models, each with one break point more than the preceding model. From among these models the best performer is chosen on the basis of criteria discussed below. 4. The chosen model, with K* knots (0 < K* < Kmax) is subjected to a backward stepwise deletion procedure to see if all the knots placed in the forward procedure are needed, given the K* — 1 other knots—the break point that was selected 2nd, for example, may no longer be significant to the model now that the 3rd, 4th and 5th breaks have been added. Each knot is deleted in turn and is removed permanently if the performance of the model is improved by its deletion. This process is continued until the deletion of another knot fails to improve the model. There are many alternative procedures to that given above. Some benefits of the proposed method are relative computational simplicity, a justification for the minimum span employed and nice intuition—the first knot or two may mainly help the global fit while later placed knots help the local approximation properties, but at the same time may help the global fit to the extent that the first placed knot(s) are no longer needed. Chapter 3. Non-Parametric Estimation of Technical Progress 3.4.1 54 Selection Criteria: Single Equation Context In the context that we will be considering, the use of formal statistical testing is difficult, due to the sequential nature of the procedure and the need to compare across many models with differing numbers of parameters. The Generalized CrossValidation criterion (GCV) has been used as a measure of model performance in the case of fitting splines by many authors (Craven and Wahba, 1979, Friedman and Silverman, 1989, Friedman, 1991 and Breiman, 1991). It is usually expressed as follows: GCV = ±jyd,i-f{xi)?/[l-^f (3.47) where y8- and £,• are the observation i dependent and independent variables respectively, T is the number of observations and /(•) is a function which depends on n unknown parameters. For our purposes, n = q + (K + 1), where q represents the number of non-spline related parameters, K is the number of knots and 1 is added since K knots implies K + 1 spline segments, each of which requires a parameter. Friedman and Silverman noted the results of Hinkley (1969, 1970) and Feder (1975), concerning an additional parameter using up more than one degree of freedom in adaptive modelling. Therefore they replaced n with d(K) = q + (3K + 1). Note that we still have 1 rather than 3 added. This is because time is always allowed to enter in at least a linear fashion. We followed the Friedman and Silverman suggestion in the empirical work, and denote the GCV with this increased penalty as PGCV. Taking logs of the GCV and substituting in the log likelihood function, we get the following selection criterion function:7 GCV = -logL0) 7 - Tlog(l - n/T). See Appendix A for the derivation of this expression for the GCV. (3.48) Chapter 3. Non-Parametric Estimation of Technical Progress 55 We can see that this is very similar to the common expressions for two widely used criterion functions—the Akaike Information Criterion (AIC) and the Schwarz Criterion (SC): AIC SC = -logL0) +n (3.49) = + nlog(T)/2 (3.50) -logL0) where n is the dimension of the parameter vector /?, (Akaike, 1973; Schwarz, 1978). These are all negative objective functions, the objective being to find the model which minimizes the GCV, AIC or SC. The essential difference between these criteria is the penalty paid for the addition of more parameters to the model.8 These criteria trade off the increase in the log likelihood due to an additional parameter with the increase in the penalty term. Observations 1 and 2 of Section 2.3.7 have pointed out the relative properties of the penalty terms by comparing the the behaviour of the marginal penalty paid for the marginal parameter. These observations found that the GCV has a marginal penalty that is increasing as the number of parameters approaches T, for fixed T, while both the AIC and SC penalize every marginal parameter with the same constant, no matter if that parameter is using up the first degree of freedom or the last. The relative sizes of the marginal penalties determines which criteria selects the highest parameterization and which selects the lowest. From Observation 2 we have seen that, for T > 8 the marginal penalty of the AIC is the lowest, while the relative size of the GCV and SC marginal penalties depends on the sample size and total number of parameters present. There is much interest in the model selection literature over what is an appropriate penalty. All proposed model selection criteria come with justification of some sort, 8 The penalty terms of the AIC and SC could also be adjusted to take into account the Feder/Hinkley results, but this issue is not pursued here. Chapter 3. Non-Parametric Estimation of Technical Progress 56 yet there is of course no guarantee that they will select the "true" model in empirical situations. Asymptotic justifications in terms of consistency can be misleading, as we more often than not work with small data sets, and the criteria may not have any optimality properties in practice (Cragg and Donald, 1993). Moreover, with the typically complex, high noise data that we work with, there can be many parameterizations which can give equally good stories about the underlying data generating process. It would seem wise therefore, to consider the models selected by a range of criteria. To this end, multiple equation forms of the above three criteria, (see the next section), plus the actual cross-validation score are used in the emprical work reported below. The GCV is used as an approximation to the cross- validation score, which is more computationally burdensome. Cross-validation is not a new procedure (see Stone, 1974, for some background), but one that has recently been receiving more attention. Efron (1982; 49) states that cross- validation "is an old idea whose time seems to have come again with the advent of modern computers." 9 The idea is simple. Leave some of the data out and run the regression for a certain parameterization. Then calculate the implied residuals for the left out observations using the parameters of the estimated model. Leaving out each observation in turn, calculating the "left out" residuals, summing and averaging them gives the cross-validation score: CV = (l/T)J2lyi-f-i(xi)? (3-51) where /_,(£;) denotes the estimate for the ith observation calculated from the parameters of the model fitted to all the data except observation i.10 This gives us an 9 It does, however, have its detractors. See, for example, the discussion following Stone (1974), Efron (1983) and Learner (1983). 10 A less computationally intensive procedure is i;-fold cross-validation, where the data is split into v equal groups. For example, 5 fold cross-validation involves leaving out 20% of the data each time. Various permutations of this kind of "multi-fold" cross-validation are possible (Geisser, 1975; Zhang, 1993) Chapter 3. Non-Parametric Estimation of Technical Progress 57 approximation to the unobservable predictive-squared error: PSE = E[yncw - f(xnew)f (3.52) where ynew and xn£W denote new data, over which we want our model /(•) to minimize the PSE. What cross-validation does then, is simulate prediction. The PSE has two components—the squared bias of our model and the variance of our model. In minimizing the PSE we have to consider the prediction variance-bias trade off. Fitting the available data too closely (i.e. overfitting) results in low bias, but high variance which implies poor prediction performance. The PSE is an attractive alternative measure of model performance to the residual sum of squares (RSS). As long as the X'X matrix of the regressors is non-singular the addition of extra parameters will reduce the RSS, but will not necessarily reduce the PSE. The GCV, AIC and SC can all be interpreted as adjusted coefficients of multiple correlation (R2). It is interesting to note that the model which minimizes the RSS may not be the model which minimizes the PSE. Even when considering two models with the same number of parameters, the RSS criterion may select a different model from the PSE. This is because to minimize our expression for the PSE implies a variance-bias trade off, whereas the RSS is minimized by reducing only the bias. An implicit assumption in using the GCV is that the same model will be selected by the RSS and PSE. So, the GCV is an approximation to the CV,11 which is in turn an approximation to the unobservable PSE. Note that we have described the cross-validation score, rather than what is known as "complete cross-validation," or "true cross- validation." This latter concept employs a different selection algorithm which requires finding the best model for each sample of size T — 1. Doing so results in T selection procedures and a 11 See Appendix A for a derivation of the GCV from the CV Chapter 3. Non-Parametric Estimation of Technical Progress 58 larger class of models from which to choose the best model. Due to the computational burden of this "complete" algorithm, and the further complication of selecting the best selection algorithm, only the "fixed path," or main sequence of models was considered in the selection process used below—see Assumption 1 of Appendix A and Section 5.5. Taking this approach allows for the cross-validation residuals to be calculated from the full sample ordinary least squares residuals in the single equation case. Theorem 2 (Schmidt, 1971; Allen, 1971) The cross-validation score can be calculated directly from the full sample residuals using the following expression: CV = ( l / D J > - / ( ^ ) ] 2 / [ l - Cif where c,; is the ith diagonal element of the projection matrix (3.53) X(X'X)~1X'. Proof: The conclusion follows simply from the Sherman-Morrison matrix downdating formula—see Appendix A. This theorem greatly facilitates the calculation of the CV score—we do not have to re-estimate each model for every observation dropped. We will see in Section 3.4.2 that this result does not carry over directly to the multiple equation/general variancecovariance matrix context. 3.4.2 Selection Criteria: Multiple Equation Context In the case of a system of demand equations (the Seemingly Unrelated Regression (SUR) context), such as we will be considering, we do not have a single value for the variance, but we have a variance-covariance matrix. In the SUR context, where we have G equations, the standard estimator of the coefficient vector is: 0 = (A"(S <g> J)- 1 A)- 1 AT / (E <g> I)~lY (3.54) Chapter 3. Non-Parametric Estimation of Technical Progress 59 where S is a variance-covariance matrix of dimension G x G, / is a T x T identity matrix, X is a G(T x h) data matrix, and the prime denotes the transpose. Note that h denotes the number of variables in each equation. £ is typically estimated by E'E/T where E is a T x G matrix of the residual vectors for the G equations. In this case we choose to define the CV as follows: CV = \tCV\l/G (3.55) where Ecv is the variance-covariance matrix derived from the cross-validation residuals, and G is the number of equations in the system. The exponent, (1/G), in equation (3.55) is included as we want an approximation to the average predictivesquared error for each model considered. The CV in equation (3.55) is straightforward to calculate the long way. In the SUR. framework we have GT observations, or T observations for each equation in the system. Deleting the ith observation implies we are left with G(T — 1) observations. We therefore calculate one prediction error for each equation for a total of G, for every i. Thus, we get a vector of cross-validation errors for each equation. Forming &TxG matrix with each of these vectors as columns, then taking the inner product of this matrix with itself and dividing by T, yields £cv- We have dropped the superfluous Kronecker product term in specifying equation (3.55). In the single equation case we could use a matrix downdating formula to derive an expression for the CV which allows us to attain the cross-validation residuals from the full sample residuals (Appendix A). This result does not carry over directly to the current context, unless we make an assumption on the £ matrix used to calculate the parameter vector for each G(T — 1) sample, when £ is estimated from the regression residuals. In other words, because of the interaction between the error estimates and the typical (maximum likelihood) estimate of £, re- estimation of the Chapter 3. Non-Parametric Estimation of Technical Progress 60 system for each G(T — 1) sample is required to gain the true cross-validation errors. This is computationally burdensome. However, the following theorem provides an alternative. As this is a fast way of obtaining an approximation to the CV, we call it Fast CV (FCV). T h e o r e m 3 FCV Specifying an appropriate E matrix to be used in estimating the parameter vector for each G(T — 1) sample allows the single equation matrix downdating to be utilized, and hence an approximation to the CV of equation (3.55) to be derived in terms of transformed full sample residuals. . Proof: First we make an assumption about the matrix E p , the variance-covariance matrix used to estimate the parameter vector for each G(T — 1) sample. Assumption 1 An appropriate estimate o / E p is the full sample estimate, E. There are many possible alternatives to this choice. However, this assumption does not imply any further computations to get an acceptable matrix estimate, and allows us to prove the theorem. Let T denote an upper triangular matrix such that F T = ( E ® / ) - 1 . Transforming the A' and Y matrices by premultiplying by T, and estimating the system as one large OLS regression problem yields the same parameter estimates as if we had directly estimated the system, i.e. equation (3.54) is simply re-expressed as: 0 = (ATTX)-1ATTF. (3.56) The variance-covariance matrix of this OLS regression has the form o2I, so we can now estimate the system of tranformed variables by OLS. This allows us to use the matrix downdating described in Appendix A in the proof to Theorem 2, i.e. we can Chapter 3. Non-Parametric Estimation of Technical Progress 61 consider one row of the data matricies at a time for all GT observations. The result is a GT x 1 vector of transformed cross-validation residuals, Vv, calculated as follows: Tv = (Te)/diag(I - TXiX'T'TX^X'T') (3.57) where Te denotes the transformed error vector from the full sample estimation. We get an estimate of v by multiplying Tv by T _ 1 . We in turn get an estimate of the CV of equation (3.55) by the rearranging v into a T x G matrix , T: FCV = \~\^G (3.58) where T has been derived simply by a transformation of the full sample regression errors. We can now derive a GCV-type expression in a similar fashion; we will denote it by FGCV due to its derivation from the FCV. Theorem 4 FGCV Given Assumption 1, we can derive a GCV-type criterion from the FCV: \±\l'G FGCV =W=km> <3 59) ' where S is the variance-covariance matrix derived based on all the observations, and h is the number of variables in each equation. Proof: As in the single equation case, we use the assumption that the diagonal elements of the projection matrix are approximately constant (Appendix A, Assumption 2). We then divide each element of the GT x 1 matrix of OLS residuals, Te, by the average of diagonal elements of I - TX(X'T'TX)-lX'T': Tv = (Ye)/-^-Trace(I-VX{X'T'TX)-lX'T') GT = Te/(l-h/T). (3.60) Chapter 3. Non-Parametric Estimation of Technical Progress 62 Then an estimate of equation (3.55) can be formed by reasoning as above for the FCV: FGCV = | E'E il/G T(l - h/T)2 g I 1/G (i-h/Tyl isi 12/ 0 (I-h/T) ' (3.61) Note that although we have derived the FCV and FGCV with reference to the SUR context, they are applicable for any arbitary variance-covariance matrix context. Also note that we have assumed the same number of variables are in each equation to get the above expression for the FGCV. When this is not the case h/T should be replaced by the appropriate average. We now derive a logarithmic expression for the FGCV by taking the logs of equation (3.61): FGCV = hog\t\ - 2/o0(l - h/T). (3.62) In order to get a form similar to that given in the single equation case, we multiply through by GT/2, and re-write the expression in terms of the log likelihood, ignoring the additive constants: 12 FGCV = -logL0) - GTlog{\ - h/T). (3.63) Comparible expressions for the AIC, SC and CV are as follows: AIC 12 = -logL0) +n (3.64) SC = -logL(/3) + nlog(GT)/2 (3.65) CV = {T/2)log\tCv\ (3.66) Recall that the concentrated log likelihood in the system of equations case is (GT/2)log(2w) (T/2)/oflf|E| - (GT/2). See Appendix A. - Chapter 3. Non-Parametric Estimation of Technical Progress 63 where n = G x h. These expressions were used in the empirical work below, along with the PFGCV which differs from the FGCV by the Feder/Hinkley adjustment of the penalty term discussed in Section 3.1, i.e. h — q + (K + 1) is replaced by d(K) = q + (3K + 1), where q represents the non-spline related parameters and K is the number of knots. Note that we used the actual CV rather than the FCV approximation—this entailed far more computation, but hopefully more accurate results. 3.4.3 A Single Equation Monte Carlo Experiment In this section we present the results of a simple Monte Carlo experiment of fitting structure to white noise. Employing the same spline fitting algorithm, we investigated the use of different model selection criteria in a single equation context—the less structure a criterion suggests is present in white noise the better. The design of the experiment is as follows. A sample of twenty observations was drawn from the standard normal distribution to be the dependent variable, i.e. y ~ iV(0,1). Time was taken to be the independent variable. The Friedman and Silverman (1989) spline fitting algorithm described earlier, was used in combination with the various model selection criteria. The minimum span was taken to be three. No structure should be fitted to the data, as we are attempting to model white noise. One hundred replications of this experiment were performed for each of the criteria, and the average absolute error (deviations from zero) for each t over the hundred replications was calculated. The results of this experiment are summarized in Figure 3.1 and Figure 3.2. The horizontal line in each figure is the average absolute error of the sample mean. From our observations on the sizes of the relative marginal penalties in Section 2.3.7, we could have predicted the outcome in Figure 3.1. The AIC tries to Chapter 3. Non-Parametric Estimation of Technical Progress 64 fit the most structure to the noise, the SC less so, and Friedman and Silverman's PGCV the least. What is interesting is the quantitative result of just how large the average absolute error is when the AIC is used. This suggests that in actual empirical applications the AIC could select a model which is modelling more than the underlying relationships in which we are interested. The PGCV, which is the GCV with an increased penalty, clearly does the best in avoiding the modelling of noise, while the SC performs in the range between the the extremes of the AIC and PGCV. Figure 3.2 compares the performance of three cross-validation based criteria: CV, GCV and PGCV. The relationship between the GCV and PGCV is clear as the latter is just a more highly penalized version of the GCV. In our experiment, the CV and GCV are quite close in their model selection, with the GCV giving a slightly higher average absolute error for most of the sample. This need not be the case in a more general modelling context, as the relative properties of the CV will vary with the variablity of the diagonal elements of the projection matrix. However, it suggests that in terms of modelling noise, there is little difference between the GCV and CV. Comparing Figure 3.1 and Figure 3.2, we note that the results for the GCV and CV lie between those for the SC and AIC. The same experiment was performed with the white noise drawn from a distribution with lower variance, specifically y ~ N(0, .1). The results are summarized in Figures 3.3 and 3.4, which show that although the measures of average absolute error are lower by one decimal place for all criteria, the relative performance of the criteria remains largely unchanged. Similar results to those presented in the figures were also obtained using an independent variable drawn from a uniform distribution. A phenomenon which is observed in the figures is the "end point problem." It is clearly seen that for every criterion the average absolute errors are larger near the beginning and end of the sample. This is due to there being less data constraining Chapter 3. Non-Parametric Estimation of Technical Progress 65 the fit in these regions. Section 3.6.2 makes some more comments on this problem. 3.5 Index Number Measures of Technical Progress A commonly used method of calculating productivity, or technical progress, is to divide an index of the growth rate of outputs by an index of the growth rate of inputs used. This gives us a measure of how much the growth rate of outputs has changed relative to the growth rate of inputs, or in other words, how much of the growth in outputs is not caused by the growth in inputs. The changes in the output index not due to changes in the input index are identified as the mysterious thing called technical progress. It is a moot point how the output and input indexes should be created. This is a literature with a long history and will not be reviewed here. However, it is probably worth mentioning, for the reader's reference, two classics which continue to influence the directions of research in index number theory: Fisher (1922) and Diewert (1976). In the current context, an often used index is the Divisia quantity index, (Divisia, 1926; Solow, 1957; Jorgenson and Griliches, 1967). Consider a Divisia index of inputs. This has the following form: dt eIo^Myj/yp (3.67) where yj denotes the tth observation on the j t h input, sj is the j t h input's share of output, and yj = dy^jdt. Due to the continuous nature of this formula and the discrete nature of economic data, a commonly used discrete approximation to the Divisia index is the Tornqvist (1936) index. This particular approximation is so common that it is synonymous with the Divisa index. In chained (moving base period) index form, the Tornqvist quantity index measure of the growth rate of inputs Chapter 3. Non-Parametric Estimation of Technical Progress 66 is written as follows: E [i(Sj) + ^ r 1 ) ] • Mltf/j^1) where Sj = {p)y))/{Y^1j=\P)ytj)i an (3-68) d J9} *s t n e ^ observation on the j t h good. One alternative to the Divisia/Tornqvist index is the Fisher index (Fisher, 1922). It is formed by taking the geometric average of the well known Laspeyres and Paasche indexes, and is written as follows: 9* Y> p)y) ,-=i PjU] pj ly) 1 ' P] (3.69) yj There are two ways in which the properties of index numbers are evaluated. These are the axiomatic approach and the economic approach. Diewert (1992) finds that there are strong arguments for the Fisher index using either approach. To create a measure of productivity, we proceed as follows. With more than one output a similar index for the output variables would have to be created. In the case of there being one output, such as in the application below, the growth rate is given by Gl = yl/yt'1, where y* denotes the output variable. Productivity for period t is then defined as: IndexTectf = (G*/p*) - 1. (3.70) Subtracting one centers the productivity series around zero, rather than the arbitrary value of one. Given the arguments in favour of the Fisher productivity index presented in Diewert (1992), this approach was selected to create a measure of productivity for comparison with the demand system estimates of productivity. Chapter 3. Non-Parametric Estimation of Technical Progress 3.6 67 Comments Before proceeding with the empirical results, we anticipate some questions that the reader may have and make some comments on the general approach taken. 3.6.1 Why Splines? One can think of two possible alternatives—the use of dummy variables and the use of higher order polynomials. In the case of the dummy variables, we would have to consider fitting dummies to blocks of observations to avoid the problem of over-fitting that would occur if we used a dummy for each observation. Some very sophisticated work has been done on this subject (Breiman, Friedman, Olshen and Stone, 1984). There are two problems with fitting dummy blocks. The first is that if the model is actually continuous with respect to the variable which is being modelled this way, the piecewise constant and sharply discontinuous nature of the method produces a poor approximation. The second is that simple linear and additive functions are approximated poorly (Friedman, 1991; 14). Therefore, as we consider our model to be continuous with respect to time we favour time entering in a continuous fashion. In the case of polynomials, they are constrained by having to always consider the global fit. One should note that spline functions are nth order polynomials which are continuously differentiable to the order of n — 1, rather than to the order of n like usual polynomials. Consider this quote from De Boor (1978) on polynomials: If the function to be approximated is badly behaved anywhere in the interval of approximation, then the approximation is poor everywhere. This global dependence on local properties can be avoided when using piecewise polynomial approximants. Chapter 3. Non-Parametric Estimation of Technical Progress 68 Spline functions can therefore respond to locally sharp areas of the function without adversely affecting the global fit. J.W. Tukey (cited in Friedman, 1991; 137) makes the argument against polynomials in a more dramatic fashion by stating "polynomials cut people's throats"! 3.6.2 Why Linear Splines? The model that is suggested here uses linear splines, whereas often higher order spline functions (quadratic, cubic) are favoured. The higher order functions have the property of pushing the discontinuity of the derivatives to a higher order, allowing accurate modelling of smooth functions.13 However, this implies that the higher order spline functions cannot respond rapidly to a threshold type effect. Consider the case when the underlying function is approximately (locally) piecewise linear; the function is initially flat, then drops suddenly (a threshold effect), then is flat again. A linear spline function requires only two break points to capture this effect; one at the start of the drop and one at the end of the drop. A quadratic requires three break points; one at the start of the drop producing a concave to the origin segment, one in the middle producing a convex segment, and one at the end of the drop. There is also the end point problem. As there are fewer data points constraining the fit near the end points of the approximation, there is the possibility that the spline function may try too hard to fit the data in these areas, resulting in low bias but unacceptably high variance. This problem is particularly acute when higher order spline functions are used. As a partial remedy, it is becoming common to force the splines at the ends of the function to be linear.14 If one uses linear splines from the outset then this adjustment is unnecessary. 13 That is. functions without locally high second order derivatives—see Friedman (1991;134). For example, Fuller (1969), Stone and Koo (1985), Friedman and Silverman (1989), Breiman (1991) and Friedman (1991). 14 Chapter 3. Non-Parametric Estimation of Technical Progress 69 As we have no strong prior reason for requiring for time (our splined variable) to enter into the model in a continuously differentiable fashion, we use linear splines. 3.6.3 Historical Notes Splining is an old technique used by draftsmen (Alhberg, Nilson and Walsh, 1967). Pinning down a flexible piece of material (a thin strip of wood, xylonite etc.) at separated points allows a line of particular curvature to be drawn between the points. Splines, the actual tool, seem to have been also widely used by actuaries to draw a curve through data. Watson (1937), in his discussion of Orloff (1937), describes the usefulness of the spline for such a task and includes a diagram of a spline and the weights used to pin down the spline. The introduction of spline functions into the mathematics literature is generally credited to Schoenberg (1946). Fuller (1969) seems to be the first to recommend the use of splining to approximate production surfaces and time-series trends, although he called spline functions "grafted polynomials," which is perhaps more informative terminology. He also suggested linearizing at the end points to avoid the end point problem. The use of splining to model structural change in the economics literature was promoted by Poirier (1976). See Chapter 5 for some historical notes on model selection criteria. 3.7 Empirical Results The data used was the KLEM 15 annual U.S. production data of Berndt and Wood (1986a), for the period 1947- 1981.16 The reference year was taken to be 1972, i.e. this is the year for which prices were normalized to be equal to one, and quantities 15 Capital, labour, energy, and materials as inputs, and a single output aggregate. I thank Terry Wales for making this data available to me. This data has been used by Berndt and Wood (1986b, 1987) and Diewert and Wales (1991), among others. 16 Chapter 3. Non-Parametric Estimation of Technical Progress 70 were adjusted accordingly to maintain the price x quantity = value equality. The tables and figures are referred to in this section are ordered in increasing order of parameterization. This implies that for each category of results reported (own and cross price elasticities, productivity measures), the results from the no-spline model are reported first, and those from the model with the largest number of splines are reported last. We now define some notation that is used below. The NQ is the basic normalized quadratic model with a linear time trend. The NQLSPFGCV is the normalized quadratic model with a piecewise linear time trend structure chosen by the PFGCV. The NQLScv is the same as the NQLSPFGCV, but it has had its time trend structure chosen by the CV. The time trend structures of the NQLSFGCV and NQLSAIC were similarly selected by the FGCV (and SC) and AIC, respectively. The increment at which potential knot positions were considered was determined by M(T, 0), defined by equation (3.46), with T = 35 and 9 = 0.05. This gave a span of about four (see Table 3.12), so that we examined placing knots at t = 0,4,8,12,..., 32. In practice, we place the "knot" at t = 0 a priori as we wish time to enter the model in at least a linear fashion. The forward placement procedure gave us a model with (an additional) three knots, using the FGCV and the SC. The positions of the knots were, in the order of placement, 1950, 1978 and 1962. The backward deletion strategy did not lead to any improvement, so all three knots were kept. An interesting observation is that the AIC selected a model with an additional four knots, at 1966, 1970, 1958 and 1954. The concern that the AIC may select a model with too high a parameterization is the motivation for many authors to propose criteria with larger penalty terms. Certainly in this example the AIC has chosen a much higher parameterization than the FGCV and SC. On the other hand, the PFGCV selected the lowest parameterization, with one knot placed at 1950. The CV selected Chapter 3. Non-Parametric Estimation of Technical Progress 71 the second lowest parameterization, with two knots placed at 1950 and 1954, the first two possible positions. As the CV is expected to be the most accurate approximation to the PSE, and given the low parameterization chosen by the other criteria, the highly parameterized AIC model is somewhat of an "outlier." In our application the function d(K) in the penalty term of the PFGCV has the following form: d(K) = 10 + 3(f(T) + 1. We have ten non-spline related variables in each equation (including the constant), and K is the number of break points. The 1 on the end of the expression is for the linear time trend- -we always allow time to enter at least linearly. The PFGCV selected a model with only one break, at 1950. Table 3.1 reports the values of the criteria for each model. -LLF is the negative of the log likelihood function, ignoring the constant terms. We see from Table 3.1 that the CV favours its model, the NQLScv, the NQ and NQLSFGCV and it favoured the clearly over alternatives. The FCV was also calculated for these models, NQLSFGCV over the NQLScv- This indicates that, although it entails less computation, the FCV is not a good approximation to the CV. The PFGCV* actually prefers the NQ to the NQLScvthe NQ, NQLScv and NQLSFGCV, Comparing the criteria values for we see that the NQLSFGCV is the choice of all the adjusted R2 type criteria (the FGCV, SC, PFGCV and AIC). This highlights the difference between the RSS and PSE criteria as different models have been selected by each principle, with the FGCV (and PFGCV) voting with the RSS school. All of the criteria selected 1950 as the first break point, while the FGCV, PFGCV, AIC and SC selected 1978 as the second break point. This makes one nervous of having the end effect problem discussed in Section 3.6.2. The CV, however, selected 1954 as the only other break point. If 1950 is excluded as a potential break point, then the CV prefers the NQ to any one break model, i.e. it has a strong desire to place a knot at 1950. We get some insight into what is going on by examing the Chapter 3. Non-Parametric Estimation of Technical Progress 72 measures of productivity. Figure 3.5 plots the productivity, or technical progress, measures calculated from the NQ, NQLScv the NQ, and NQLSPFGCV NQLSFQCV- and NQLSAIC- Figure 3.6 plots the productivity measures from Productivity was calculated by multiplying the derivative of the unit profit function with respect to time by capital (the fixed factor), and dividing by estimated GNP (i.e. the estimate of the output variable times its price). This gives a measure of productivity, that is comparable to index number measures. Figures 3.7 through 3.11 plot the productivity of each model against a raw Fisher's quantity index number approach to calculating multifactor productivity, and a smoothed index number series. To calculate the (chained) index series we dropped the first observation, so it is set to zero for 1947. The smoothed series is different for each model as the smoothing was done by regressing the raw index series on the relevant time structure, e.g. for the NQ model (Figure 3.7) the raw series was regressed on a linear time trend. We see that the and NQLSFGCV NQLSPFGCV, NQLSCV generated productivity that is very close to the respective smoothed series, indicating that if all one is interested in is a measure of productivity then the index number approach is adequate. These models tell us that technical progress in the U.S. has been, on average, quite constant and under 1% in the post-War period. The NQLSAIC, on the other hand, yields a varying measure of technical progress (Figures 3.6 and 3.11). The correspondence with the smoothed index is not as close as for the other models, but the productivity measure does generally follow the broad movements of the raw index series. Again, this suggests that if the researcher is interested only in obtaining a measure of technical progress, the Fisher's index number approach seems to perform well. So well, in fact, it can give the current application of optimal spline fitting the interpretation of selecting the appropriate degree of smoothing for the Fisher's index number measure of productivity! Chapter 3. Non-Parametric Estimation of Technical Progress In Figures 3.10 and 3.11 we see that the NQLSFGCV and NQLSAIC 73 seem indeed to have the end point problem at the end of the sample—observe that productivity rises unusually sharply after the 1978 knot, in contrast to the respective smoothed index series and corresponding to a sharp little increase in the raw index series. The 1950 knot may also be an end point effect, but consider what the raw index series tells us about productivity between 1950 and 1954. We see that this is the only four year period that is considered for a spline segment where the fall in productivity is not balanced by sharp increases; hence productivity declines almost continuously over this period. This is the sort of effect that we want our spline strategy to catch. As mentioned earlier, on the basis of all the suggested criteria the 1950 break improves model performance greatly. Note that if we use the CV for model selection, we do not have the end effect problem at the end of the sample (Figure 3.9). The NQ exhibits productivity which is monotonically increasing, while its smoothed index series is monotonically decreasing (Figure 3.7). This is indicative of a misspecification problem and is similar to a result found by Diewert and Wales (1992). Using Japanese production data, they found productivity to be monotonically decreasing using the NQ. This strongly suggests that the simple linear time trend is inadequate. Perhaps the most interesting comparison between the models is in terms of elasticity measures. Own price elasticities are reported for selected years in Tables 3.2 through 3.6, and cross price elasticities are reported for the sample mid-point, 1963, in Tables 3.7 through 3.11. Standard errors for the elasticity estimates are given in brackets. Comparing the own price elasticities between models, we see that the use of a splined time trend results in smaller absolute elasticities (except for labour). This indicates that before the introduction of the spline terms the parameters of the basic model were doing too much work, in that they were capturing the effects of technical Chapter 3. Non-Parametric Estimation of Technical Progress 74 progress on the model. The flexible modelling of technical progress through the introduction of splining takes this burden off these parameters, perhaps allowing them to more accurately reflect the underlying elasticities. Looking at the own price elasticities across time, we see that every model gives higher (absolute) elasticity values for each variable at the beginning of the sample than at the end, except for labour. Increasing own price elasticity for labour over time seems to be a result deserving further research.17 The cross price elasticities in Tables 3.7 to 3.11 should be read as follows. The element in the ith row, j t h column of each table is calculated by Eij = (dyi/dpj)(pj/yi), where y, is the estimated demand for the ith good and pj is the price of the j t h good. Comparing the cross price elasticities between models, we see that again the estimates are generally reduced in absolute value by the inclusion of splined time trend terms. The most startling observation is that the materials and labour are complements in production in the NQ, NQLSPFGCV, in the NQLScv1- NQLSFGCV and NQLSAIC, but substitutes This sign difference is of great interest. Policy recommendations drawn from the respective models would differ greatly. However, we must be cautious with our conclusions about the relationship between materials and labour. First, we have rather large standard errors for the elasticity estimates. 18 Also our results are not robust to a change in model selection criteria, and may not be robust to a change in the minimum span and forward/backward knot placement strategy. This is good reason to compare the models selected by various criteria, and to use the more computationally burdensome cross-validation. As an example of the burden, it took 4525 seconds of CPU time on a Sun SPARCstation 2 to place the first knot, using the econometrics computer program SHAZAM (1993; White, 1978). If we get a change 17 Diewert and Wales (1992) found the same result using their Japanese data. Although the standard errors may suggest that many of the elasticity estimates are insignificantly different from zero, it is interesting to note that demand is significantly inelastic across models. 18 Chapter 3. Non-Parametric Estimation of Technical Progress 75 in the sign of an elasticity by using cross-validation, then the burden is well justified. 3.8 Overview A strategy for modelling technical progress through adding splines, as determined by the data, to a flexible functional form has been discussed and demonstrated. It has been found that the Normalized Quadratic Linear Spline model, with adaptive spline fitting, out-performs its parent form, the Normalized Quadratic. The same spline fitting procedure can easily be used for the quadratic spline case, (although we made an argument against this case), which has the property of continuity of first derivatives with respect to time. Also, it can be used for splining with respect to other variables such as utility in the consumer context (allowing non-linear income responses —Diewert and Wales, 1993), and splining in capital to allow for non-constant returns to scale with cross-section data (Chapter 4). Due to the large difference in price elasticities between the models examined, and the dominance of the NQ model by the NQLS model (based on model selection criteria), it must be concluded that the NQ model is badly misspecified. Policy recommendations drawn from the results of the respective models would differ greatly, especially due to the possibility of cross price elasticities changing signs between models. It was found that although the Fast Cross-Validation and Fast Generalized CrossValidation criteria are supposed to be approximations to actual cross- validation, the approximations are poor as quite different models are selected. It is suggested that it is worthwhile to use cross-validation, despite the added computation that this implies. The choice of model selection criteria can affect the obtained results in a dramatic fashion, making it wise to compare the models selected by various criteria, including Chapter 3. Non-Parametric Estimation of Technical Progress 76 the probably more accurate cross-validation. One conclusion to be drawn from the results may be that more "flexibility" in modelling needs to be introduced, perhaps by having splines in the price variables, and hence making the analytic functional form more dependent on the data. Chapter 3. Non-Parametric Estimation of Technical Progress 77 Table 3.1: Selection Criteria NQ PFGCV SC CV FGCV AIC -LLF NQLSPFGCV -220.83 -164.93 92.52 -220.83 -229.65 -273.65 -230.32 -183.23 27.03 -243.05 -253.82 -301.83 NQLScv NQLSFGcv -215.55 -180.17 -169.00 -243.65 -256.65 -308.65 -224.24 -204.50 8.42 -271.35 -286.86 -342.86 Table 3.2: NQ Own Price Elasticities Variable 1950 1960 1970 1980 Output .632 (.513) -.441 (.563) -.434 (.124) -.254 (.114) .630 (.505) -.437 (.558) -.324 (.086) -.337 (.151) .625 (.494) -.420 (.547) -.233 (.054) -.452 (.201) .455 (.345) -.289 (.399) -.222 (.048) -.331 (.159) Materials Energy Labour NQLSAW -20.04 -186.08 -10.23 -262.88 -291.98 -363.98 Chapter 3. Non-Parametric Estimation of Technical Progress Table 3.3: NQLSPFGCv Own Price Elasticities Variable 1950 1960 1970 1980 Output .163 (.181) -.186 (.222) -.336 (.106) -.138 (.082) .159 (.176) -.194 (.223) -.263 (.072) -.183 (.108) .155 (.170) -.201 (.223)" -.199 (.046) -.243 (.143) .122 (.125) -.138 (.157) -.200 (.042) -.174 (.106) Materials Energy Labour Table 3.4: NQLScv Own Price Elasticities Variable 1950 1960 Output .167 (.161) -.105 (.180) -.409 (.111) -.265 (.103) .162 .157 (.146) (.146) -.106 -.115 (.173) (.178) -.280 -.206 (.069) (.045) -.331 -.446 (.128) (.172) Materials Energy Labour s 1970 1980 .127 (.110) -.077 (.123) -.208 (.042) -.330 (.130) 78 Chapter 3. Non-Parametric Estimation of Technical Progress Table 3.5: NQLSFGCv Variable Own Price Elasticities 1950 1960 1970 1980 .198 (.158) Materials -.046 (.165) Energy -.204 (.069) Labour -.275 (.093) .188 (.144) -.041 (.156) -.111 (.042) -.340 (.115) .198 (.145) -.040 (.159) -.059 (.028) -.457 (.154) .156 (.113) -.028 (.122) -.062 (.028) -.361 (.126) Output Table 3.6: NQLSAIC Own Price Elasticities Variable 1950 1960 1970 1980 Output .131 (.114) -.110 (.205) -.289 (.069) -.270 (.127) .128 (.106) -.117 (.207) -.193 (.048) -.347 (.162) .132 (.103) -.127 (.211) -.142 (.038) -.458 (.213) .105 (.079) -.092 (.156) -.150 (.039) -.357 (.177) Materials Energy Labour 79 Chapter 3. Non-Parametric Estimation of Technical Progress Table 3.7: NQ Cross Price Elasticities, 1963 Variable Output Materials Energy Labour Output Materials Energy Labour .624 (.497) .605 (.680) 1.692 (.731) .720 (.420) -.358 (.403) -.424 (.548) -.825 (.514) -.272 (.317) -.067 (.029) -.055 (.034) -.230 (.007) -.081 (.033) -.199 (.116) -.127 (.147) -.568 (.233) -.368 (.165) Table 3.8: NQLSPFGCv Variable Output Materials Energy Labour Cross Price Elasticities, 1963 Output Materials Energy Labour .157 (.173) .150 (.241) .918 (.584) .117 (.202) -.089 (.142) -.194 (.221) -.288 (.396) .137 (.122) -.037 (.023) -.019 (.027) -.246 (.065) -.056 (.030) -.032 (.055) -.063 (.057) -.384 (.203) -.198 (.118) 80 Chapter 3, Non-Parametric Estimation of Technical Progress Table 3.9: NQLSCv Variable Output Materials Energy Labour Output Materials Energy Labour .158 (.146) .061 (.182) 1.017 (.573) .301 (.253) -.036 (.108) -.108 (.173) -.236 (.382) .136 (.132) -.040 (.023) -.016 (.026) -.260 (.063) -.076 (.030) -.082 (.069) .063 (.061) -.521 (.204) -.361 (.140) Table 3.10: NQLSFGCv Variable Output Materials Energy Labour Cross Price Elasticities, 1963 Cross Price Elasticities, 1963 Output Materials Energy Labour .187 (.141) .062 (.170) .658 (.357) .458 (.252) -.037 (.100) -.039 (.154) -.140 (.238) -.029 (.142) -.026 (.014) -.009 (.016) -.093 (.038) -.063 (.020) -.124 (.068) -.014 (.066) -.426 (.133) -.366 (.124) 81 Chapter 3. Non-Parametric Estimation of Technical Progress Table 3.11: NQLSAIC Variable Output Materials Energy Labour .118 (.095) .043 (.173) .711 (.340) .240 (.255) -.025 (.101) -.110 (.191) -.156 (.245) .167 (.125) -.028 (.013) -.011 (.017) -.169 (.043) -.056 (.020) -.065 (.069) -.077 (.058) -.386 (.138) -.350 (.164) Output Materials Energy Labour Cross Price Elasticities, 1963 Table 3.12: Selected M(T,6) Values T &e 35 65 100 500 0.05 0.065 0.085 0.1 3.77 4.12 4.37 5.30 3.60 3.97 4.22 5.14 3.35 3.71 3.96 4.88 3.45 3.81 4.05 4.98 82 ^ < 2 r a g e Abso ute Erro 6 I - s o 00 0.6 0.5 0..4 0.3 o o 6 .2 \ • • \ ~\ i \ 1 "•••-" 3 1 \ > - 1 i 1 ' • • * • • - . . . \ — i — i — r - " ' '#. 1 N i 7 1 ' " • • - • N i - 1 - i - - 1 ...... i - 1 - 1 / . i - t l . | l | l • - • • • • - i " • • 13 . \ - i 15 i . / - - i 17 / / . | 19 i AlC - Top Line SC - Middle Line PGCV - Bottom Line Sample Mean - Horizontal 1 1 ' " • • « • • • • • ' - i Fig.3.1 : Monte Carlo #1 , N ( 0 , 1 t / / / . / l 21 - CD Average Absolute 0.0 0.1 0.2 I 1 0.3 0.4 " 1 1 '1 "" „ •T"—*• ^-•<' "'" ****" * 1 1 * 1 // 1 •' 1 0.7 0.8 1 1 ,/ ' / / 11 _ i :' 1 - 0.6 i^ * - 0.5 Error - ••' / - • / : 1 :/ 1 \ 1• i \\ WTJOO Q C X O ;N . i • // 1 •H. fD ;/ 1 ' I i S Orcfl) =3 £ " ^ ^5I o3 5'CD fD i 1 - ^ O ZS' D.fD N O 15 Q \\ i »x 1 •••. 1 \ A ' 1 1 ™ i r 1 1 1 _ v - PQ l | * 1 - — . 3 9< l , < v • 1 1 1 1 1 ~ - 85 \ o c o •+-> / / N CD s_ C o ii c CD r "_i J I E t 1 o -t-j c _ i CD - M o o CD "O a o m2 \ i— ^ I CD 1 1 > CL o cu o o D \ i / \ <ooa_oo i I i / \ / \ \ / / - n 8 0 ' 0 ^O'O 9 0 ' 0 SO'O tO'O £ 0 ' 0 2 0 ' 0 JOJJ3 9^.n|osq\/ 9 6 D J O A \ / LO'O OO'O 86 1 1 1 1 1 i r '• •s* \ -^ o / c o - - V. 1 \ Line Hori N - - CD /; / • • <u.E E l .E-J§ +-> c _l CD •*-> - o \ '•. \ \ 1 \ o-amS 1 CD ^ - > - / aE / OCJCLCO - 1 * \ \ ~~ ~ 1 • // 1* - - - ir - - 1; /.* - 1 •\ / / . / . • • " 1 i .-IJ."^ i i i,. 8 0 ' 0 Z.0'0 9 0 ' 0 SO'O tO'O £ 0 ' 0 3 0 ' 0 JOJJ3 9^.n|osqv 96DJ9A\/ i LO'O OO'O Productivity -0.02 -0.01 0.00 0.01 0.02 0.03 0.04 cn cn O Cn Cn cn O CJ1 -< CD (D Q cn 2 2 Z o o o r~ I - (/) I T00 | n < o n I < o l-t- o i <-*• CD o ~i cn CD 1 LU O CD ZS I 00 O 00 cn Z.8 - 3 CD \ \ (n o Q_ r~_ 3 CD o Q_ c o r-+< ' r-+- d O d o I d o o CL o \ o U u 15 > ^ q d d o d o 45 — / 1 \ 50 A 1 1 1 70 i ^. -" 75 i NQLSpFGCv — Solid Line NQLSA[C - Broken Line NQ - Dotted line Year 65 60 \ 55 \ I S - 1 I \ / / / / 1 Productivity l ~ \ 1 Fig.3.6: 80 i 1 85 00 oo 45 50 55 60 Year 65 70 F i g . 3 . 7 : NQ vs. Index 75 80 Measures 85 00 ID - I 45 O 1 d o o o o q d T — o o CN d o d ^ / * i '•M *i 60 55 50 : - 1 1 « 1 1 » \ : • 1 . V ; ;\ •/ ;'/ .if \ ?' \- i I » » 65 i - ' i i i Measures 4 + «. • 70 i » * * f 75 i « • • • ; : • 80 i » • NQLSpFGCV - Solid Line Smoothed Index —Broken Line Raw Index - Dotted Line Year * « —i F i g . 3 . 8 : NQLS P F G C V vs. Index 85 - - O Productivity -0.02 -0.01 0.00 0.01 0.02 0.03 0.04 4^ cn . • • * • 31 o CD cn Cn o cn o CO o < I "" -< CD CD Q cn < 70 00 Z a 3 o S o •vl o r~ ?5 13 Q_ CD CD X Q. . 1 — 00 Cn o 1 => CL — C D Q_o X o r-tr-t- CD Q. 1 -• 1 13 CD °> r"_ O 00 O Z3* 7T CD CD 3 r"_ 3* CD 00 cn 16 CO 15 Q_ CD X CD Q 0) C CD CO 60 Year 65 70 F i g . 3 . 1 0 : NQLS F G C V vs. Index Measures 85 in d o d o d o o d o d o CN d o d o 45 - 9 •M - ;/ ;/ \ 1 50 •' •/ it i / * •. I '••• \ * \ \ • \ /; • \ y/ \- \ • ' '••• ; 55 i » \ :\ • ^ • 1 ,'r " 1 "- • 1" i Measures > A » 70 i / 7 i f * - » 75 i < 1 » 80 *\ / • NQLSA[C - Solid Line Smoothed Index - B r o k e n Line Raw Index - Dotted Line Year 65 \ 60 * : : \ \ \ : : V \ * 1 » t • 1 « 1 F i g . 3 . 1 1 : NQLS A)C vs. Index 85 - CO Chapter 4 Non-Parametric Estimation of Returns to Scale Since true models do give the best predictions, and since prediction is presumably what any science is about, it is important to be able to decide which are true models and which are not. Peter Schmidt (1971; 1). 4.1 Introduction In resource economics we are often faced with sparse cross-section data, where nonconstant returns to scale may be present. The usual practice in this context is to use a functional form in which the capital variable enters in a linear fashion. A flexible functional form model for factor demands where the addition of splines in the fixed variable, capital, allows for nonconstant returns to be captured in a flexible fashion by the model seems to be an attractive alternative. The model is then not only flexible in the usual sense, but also has additional flexibilty properties with respect to capital. Specifically, between the spline break points arbitrary returns to scale can be attained. Combined with a spline fitting strategy, such as that presented in Chapter 3, its potential applications are numerous. This chapter considers an application to data on the British Columbia sablefish fishery.1 1 1 thank Quentin Grafton for making this data available. More details than are given in the current paper on both the fishery and data can be found in Grafton (1992, 1994). These studies used only one cross-section of data for 1988 to investigate rent capture issues. The current paper uses an additional cross-section for 1989. 94 Chapter 4. Non-Parametric Estimation of Returns to Scale 95 Although suggested by Diewert and Wales (1990), this is the first application of the normalized quadratic net revenue function with linear splines in capital (NQLSK). The model is similar to that of Chapter 3, the major difference being that in the current context we spline in the capital variable (which is vessel length in this application), rather than the time variable. The adaptive spline fitting strategy, which allows the non-parametric estimation of returns to scale, is the same as that of Chapter 3. The most commonly used flexible functional form in resource economics is the translog of Christensen, Jorgenson and Lau (1971). Applications in the literatures of fisheries (Squires, 1987), agriculture (Brown and Christensen, 1981), energy (Berndt and Wood, 1975), water (Renzetti, 1992) and exhaustible resources (Moroney and Trapani, 1981) are common. It is suggested that the NQLSK model with the proposed spline fitting strategy is a strong competitor for the choice of functional form in such applications. Not only is the NQLSK model flexible, but correct curvature can be imposed globally and arbitrary returns to scale can be attained between the spline break points. 2 As in Chapter 3, the use of various model selection methods is investigated for the choice of spline structure, and the resulting models are compared. It is found that of the methods considered, only Akaike's Information Criterion (AIC) and the crossvalidation score (CV) prefer the NQLSK model over the no-spline model (NQK). The AIC selected a three break point model, while the CV selected a one break point model, with this point not being one of those selected by the AIC. Section 4.2 describes the NQLSK model. Section 4.3 presents the model selection methods used and compares the values of the relative marginal penalties of the model 2 The tranlog form can also be amended to allow for splining in the capital (or time) variable, but the global imposition of correct curvature is not possible while maintaining the flexiblity property— see Jorgenson and Fraumeni (1981), and Diewert and Wales (1987; 48). Chapter 4. Non-Parametric Estimation of Returns to Scale 96 selection criteria in the current context. These values are the same for the application in Chapter 3. Section 4.4 describes the sablefish fishery and the data. The empirical results are in Section 4.5, and Section 4.6 concludes. 4.2 The NQLSK Model This section briefly presents the normalized quadratic net revenue function with linear splines in capital. This functional form allows the model to capture the effects of nonconstant returns to scale, while retaining the desirable properties described in the previous chapter for the profit function. The net revenue function shares the same neo- classical properties as the profit function. Diewert and Wales (1990) represent the net revenue function as follows: r(p,k) = h(p,k) + d(p,k) (4.1) where p is the vector of prices, and; h(p,k) = a-p + (l/2)(a • p)~^'AA'pk (4.2) where .4 is a lower triangular matrix. The following restrictions are assumed to hold at the reference prices p* ~^> 0^, where there are N variables: A'p* = OAT, (4.3) and the predetermined parameter vector a satisfies: a-p* a = N, > 0N. (4.4) (4.5) The use of AA' as the interaction matrix ensures that correct curvature is attained globally (Theorem 1, page 48). Chapter 4. Non-Parametric Estimation of Returns to Scale 97 We now consider how the splines in capital are introduced. In the three spline, (two break point, or two 'knot', case), d(p,t) is defined as: dl(p,k) forO<£<A;i d(p, k) = { d2(p, k) for fa<k<k2 (4.6) d3(p, k) for k2 < k. The functions d'(p, k), «'=1,2,3, can then be defined to be linear, quadratic, or of some higher order. In the linear spline case we have: d\p,k) = (b^-p)k; (4.7) d\p,k) = d1(p,Ai) + (^ 2 >.p)(fc-* 1 ); (4.8) d3(p,k) = d2(p,k2) + (bW-p)(k-k2). (4.9) Note that if the b^' N dimensional vectors of parameters are all equal then we are back to the basic normalized quadratic form. If this is not the case (as we would generally expect) then we have the NQLSK net revenue function. Also note that, in empirical applications, it is necessary to order the data so that it is in ascending order with respect to capital. Diewert and Wales (1990) showed that this functional form is not only flexible,3 but also curvature correct (convex) and it has the additional flexibility property of being able to attain arbitrary returns to scale between each break point k{. In other words, returns to scale are no longer restricted to enter the model in a linear fashion, but rather they can take arbitrary values within each capital segment. 3 Actually, it is, what they call RS flexible. To be fully flexible, the term (l/2)c(a-p)k2 (where c is an additional unknown scalar parameter), would have to be added to equation (4.1) so that there are enough parameters for the approximating function to attain coincidence of second derivatives with respect to k with some arbitrary neo-classical function. Chapter 4. Non-Parametric Estimation of Returns to Scale 98 The estimating equations are derived by differentiating the net revenue function with respect to prices and substituting in the restrictions, as described in Section 3.2. Local returns to scale at each point (p, k) can be thought of as the profit maximizing increase in net revenue accruing from a one percent increase in capital utilisation. We can then define returns to scale (RS) as: = *|*l/r%S If we have constant returns to scale, so that r(p,k) = r(p,l)k, (4.io) then RS(p,k) = 1. If, however, RS(p, k) > 1 then we have increasing returns to scale, and reversing the inequality indicates decreasing returns to scale. The next section addresses the problem of how many break points there should be and where they should be placed. 4.3 Model Selection This section reiterates the spline fitting strategy presented in Chapter 3, with adjustments for the altered context. In Chapter 3 the variable which was splined was the time trend, whereas now we are considering splining the capital (fixed factor) variable, which is vessel length (in meters) in our application. We have T observations on the capital variable, k, with each observation being a potential break point. Every k could be treated as a break point, but the problem would be overfitting—a lot of noise could be introduced into the model. Therefore, the number of splines to be included must be restricted in some fashion. Friedman and Silverman (1989) suggested increments determined by: M(T, 6) = -log2[-(l/T)ln(l - 0)]/2.5 (4.11) Chapter 4. Non-Parametric Estimation of Returns to Scale 99 with 0.05 < 8 < 0.1, where T is the sample size, (see Appendix B). This minimum span concept avoids the potential "packing" problem. If the candidate models are too tightly packed together then we have to choose between them in terms of small local properties. This can lead to models with low local lack-of-fit bias, but high variance. As in Chapter 3, K denotes the number of knots, while the traditional notation of capital with the lowercase k is maintained. 1. The first knot (A" = 1), or break point, is chosen from the T/M potential positions by trying each within the NQLSK model and selecting the model with the best performance. 2. Then each additional knot position is selected from the remaining T/M — K + 1 potential positions by placing it in the position that optimizes the model's performance given it and the K — 1 knots that have already been selected. 3. This strategy produces a sequence of models, each with one break point more than the preceding model. From among these models the best performer is chosen on the basis of criteria discussed below. 4. The chosen model, with K* knots (0 < K* < Kmax) is subjected to a backward stepwise deletion procedure to see if all the knots placed in the forward procedure are needed, given the K* — 1 other knots—the break point that was selected 2nd, for example, may no longer be significant to the model now that the 3rd, 4th and 5th breaks have been added. Each knot is deleted in turn and is removed permanently if the performance of the model is improved by its deletion. This process is continued until the deletion of another knot fails to improve the model. Chapter 4. Non-Parametric Estimation of Returns to Scale 4.3.1 100 Selection Criteria i The criteria used to determine model performance were the same as those in Chapter 3. Specifically, the Fast Generalized Cross-Validation Criterion (FGCV), the Fast Penalized Generalized Cross-Validation Criterion (PFGCV), the Akaike Information Criterion (AIC), the Schwarz Criterion (SC), and a multiple equation context formulation for the cross-validation score (CV): FGCV PFGVC = -logL0) - GTlog{\ - h/T) (4.12) = -logL0)-GTlog(l-d(K)/T) (4.13) AIC = -logL0) +n (4.14) SC = -logL0) + nlog(GT)/2 (4.15) CV = (T/2)log\tCv\ (4.16) where logL{(3) is the maximized log likelihood, T is the sample size, G is the number of equations in the system, h is the number of variables in each equation, h = q+(K+1), q =number of non-spline related parameters in each equation, K = number of spline knots, d(K) = q + (ZK 4-1), n = G x h, and Ecv is the variance-covariance matrix derived from the cross-validation residuals. These are all negative objective functions, the goal being to find the model which minimizes the criterion. In Chapter 2 and Chapter 3 we have discussed the marginal penalties implied by FGCV, PFGCV, AIC and SC. The marginal penalty of a criterion is the increase in the penalty term in each criterion due to the addition of a parameter, which is traded off with the increase in the log likelihood. If the penalty increases by more than the increase in the log likelihood, then the inclusion of the additional parameter is rejected.4 In the application considered in this chapter, we have the same number 4 Actually, it is wise to consider models past the parameterization where the additional parameter is not accepted, as the criteria can increase/decrease locally. Chapter 4. Non-Parametric Estimation of Returns to Scale 101 of observations and parameters in our model as in the application of the previous chapter. Therefore, for the sake of reference and elaboration, the marginal penalty terms for each selection criterion in these applications is presented in Table 4.1. The entries for the AIC and SC are constant as the number of parameters are increased, indicating that the last degree of freedom is worth the same as the first for these criteria. This is not the case for the FGCV and PFGCV, which penalize the addition of parameters with rapidly growing marginal penalties as degrees of freedom disappear. In fact, the PFGCV places such a high penalty on the additional parameter that by the last break, there are no remaining degrees of freedom, according to this criterion. The differences in these criteria are obviously large as the number of parameters increases for fixed T, and the models selected would be anticipated to also be quite different. 4.4 Fishery and Data Description The British Columbia sablefish fishery had a gross landed value of around CDN$18 million in 1990. The main customer for the harvest is the Japanese market, where sablefish are sold primarily for use in traditional winter meals (nabemono) due to their high fat content. Access to the fishery has been controlled since 1981. The empirical work below uses data from the years 1988 and 1990. The number of vessels with fishing rights was 48 in 1988, and 30 in 1990. There are two types of fishing technologies being utilized; longline and trap. The longline technology entails laying baited hooks, while the trap > technology uses baited traps. Typically, the type of technology used is related to vessel length, with the larger vessels using traps. Chapter 4. Non-Parametric Estimation of Returns to Scale 102 The two cross-sections of data available for 1988 and 1990 contained information on 27 and 8 vessels for the two years, respectively. This data was pooled to create a sample of 35 observations, with 21 vessels using the trap technology. Each vessel participated in other fisheries. Summary statistics for the price and quantity data used in this study are tabulated in Tables 4.9 and 4.10, respectively. Confidentiality requirements prevent the reporting of the data in more detail. The original sources for the data are now given. Data on total expenditures on fuel and labour, total revenue from sablefish and other species, vessel length and home port identification came from the Canadian Department of Fisheries and Oceans' 1988 and 1990 cost and earnings surveys. Prices for the two outputs, sablefish and other species, were derived from landings data from the same agency.5 The corresponding quantities were calculated by dividing the respective revenues by the prices. Port diesel prices from Chevron Canada and home port information were combined to obtain fuel prices per vessel. See Grafton (1992, 1994) for further details. 4.5 Empirical Results The five goods in the estimated model included two outputs (sablefish, other species), two variable inputs (labour, fuel), and a fixed input (vessel length). The variable inputs were treated as negative outputs in estimation, and splines were introduced in vessel length. Cross-section data from 1988 and 1990 was pooled to create a sample of 35 observations. The data was ordered by increasing vessel length, prices were normalized to be one for the first observation, and quantities adjusted appropriately. • 5 For species other than sablefish, an aggregate price series was calculated using the Fisher price index with the mean price taken as the base; p1 = [(J2f=i PjVj/ E J L i Pjyj)(J2f=i PjVj/ Hf=i PjVj)]1/2, where the bar denotes the mean for the j t h species. This aggregation entails the assumption of weak separability (Sono, 1945/1961; Leontief, 1947a,b) between sablefish and other species. Chapter 4. Non-Parametric Estimation of Returns to Scale 103 Vessel length was normalized by the first observation to create a vessel length index with a starting value of one. This series is plotted in Figure 4.1 and summary statistics are tabulated in Table 4.10. To give readers interested in fishery details an idea of actual vessel size in meters, the first observation (shortest vessel) had a length of 12.75 meters (Table 4.10). The NQLSK model, augmented with the spline fitting strategy described in Section 4.3 was used. The different model selection criteria of Section 4.3.1 were used in combination with this strategy, and results for the NQLSK models selected by the differing criteria are compared below. Linear splines were used as the technology employed can change as the vessel length increases (see Section 4.4), implying the possibility of threshold effects. Returns to scale could change suddenly in this case. These sudden changes are better captured by linear splines (see Section 3.6.2). Noting that, by pure coincidence, there are 35 observations in the current sample as there were in the application of Chapter 3, we have the same minimum span as previously, i.e. four. The potential positions for break placement were then observations 0, 4, 8, 12,...,32, with the "break" at 0 imposed a priori to allow capital to enter in at least a linear fashion. The only two criteria to select a NQLSK model over the NQK (no splines) model were the AIC and CV. The AIC selected a spline structure with three breaks: at observations 32, 28 and 24, in order of placement. The CV selected a model with a single break, placed at observation 8. The other model selection criteria considered —the FGCV, PFGCV and SC— all preferred the no-spline option. The values for these criteria were calculated for the AIC selected model (NQLSKAic), and the CV selected model (NQLSKcv), and are compared i with their values for the estimated linear capital model, NQK, in Table 4.2. The negative of the maximized log likelihood value, ignoring the constant terms, is denoted by -LLF in this table. The smaller the value of the criterion, the better the criterion Chapter 4. Non-Parametric Estimation of Returns to Scale 104 evaluates the corresponding model. The fact that most of the criteria clearly prefer the more restrictive model comes as somewhat of a surprise, as we may have expected some degree of flexible modelling to be favoured. Given that the AIC is generally thought to select too high a parameterization in empirical applications, we have to discount the finding that this criteria favours a spline model—especially as the other, possibly more reasonable criteria, do not. The CV also selected a spline model, but with a lower parameterization than the AIC. It is interesting to note the difference in relative model parameterization between this and the application in Chapter 3. In Chapter 3, the FGCV and SC both selected a higher parameterization (the same model) than the CV, whereas here the FGCV and SC have both selected a lower parameterization than the CV. This indicates that the relationships between the CV and these criteria are not fixed in different applications. However, it is the case that the AIC selected a (much) higher parameterization than the CV in both applications. The Fast Cross-Validation (FCV) criterion that was introduced in Chapter 3 (equation (3.58)) as an approximation to the CV, was calculated for the three competing models. Although it selected the same model as the FGCV in the application of Chapter 3, here it preferred the same model as the CV. This is encouraging for its use as a fast way of performing cross-validation. Tables 4.3 to 4.5 present some own price elasticity estimates for the selected models, NQLSKcv, NQLSKAIC and NQK. The particular values presented were chosen so as to represent observations in each spline segment in the NQLSAICI i-e- observa- tions 10, 25, 29, 33. The elasticities do not differ as much between models as those of Chapter 3. Once again, we have rather high standard errors for the elasiticity estimates. Fuel is the exception, having small standard errors relative to the elasticities, but having some very large elasticity estimates such as -17.37 for observation 10 in Chapter 4. Non-Parametric Estimation of Returns to Scale the NQK. 105 In fact, the first three own price elasticities for fuel are positive (with small standard errors) in this model. The same occurs for the NQLSKcv where the reported own price elasticity of fuel for observation 10 (and another couple in the sample) takes a positive value of 32.19. This is due to the estimated demand for fuel being negative for these observations. This is a little strange, and does not occur in the NQLSKAIC model. The incorrect sign for some of the fuel elasticities could result for the following reasons. First, it may be due to a lack of flexibility in the model. Second, it suggests that one or more of the estimated demand equations is fitting very well at the expense of the other demand equations— in this case fuel demand is the victim. However, the values of B2 between the observed and predicted values for each equation do not indicate that any one equation is fitting much better than any other. Third, there could be a problem with the data. Summary statistics for the price and quantity data, are given in Tables 4.9 and 4.10.6 The quantity of fuel used by observation 34 is more than twice that used by any other vessel. The next largest fuel quantity is for observation 33, and this value is also large relative to the other observations. As both of these observations are contained in the last considered spline segment, any vessel over the length of that of observation 32 was dropped and the models were estimated using the truncated sample. The same break points were selected by each criterion, except that 32 was no longer a valid break point for selection, i.e. the chosen break points were 28 and 24 for the AIC (in order of selection) and 8 for the CV. The NQLSKcv own price elasticities all have the correct signs using this truncated sample. The NQK and NQLSKAIC also yield more sensible results than before, with respect to the fuel price elasticities. It should be noted that the procedure of selecting 6 Due to confidentiality requirements, only summary statistics can be presented. The price data for capital is the ex post rental price, normalized by the first observation. Chapter 4. Non-Parametric Estimation of Returns to Scale 106 only the observations that one likes the look of is statistically dubious and resembles a process of torturing the data until it validifies our favoured model, or gives us the results we find most appealing. The quantity series for fuel also looks strange at points other than at observations 33 and 34. For example, comparing observations 21 and 23, which face the same fuel prices, the fuel quantity for observation 23 is fourteen times larger than for observation 21. This suggests that there may be problems with this series which will not be cured by the deletion of a few observations. Moreover, the own price elasticities for the alternative model to the NQK and NQLScv, NQLSKAIC, the does not have the problem of incorrectly signed elasticities, suggesting that we should not choose the data that we use. Hence the full sample results are reported. Generally, all the own price elasticities are smaller the longer the vessel. This implies that the larger the vessel, the less sensitivity to output/input price changes. This makes sense, as we might expect the smaller fishing operations to be more easily influenced by fluctuations in prices, (but perhaps by not as much as the fuel elasticities suggest!). Tables 4.6 to 4.8 present the cross-price elasticities for observation 29, selected as this is the observation for which we seem to have the largest divergence between the NQLSKAIC and NQK, as we shall see later when we investigate returns to scale measures. Indeed, a cursory view of the two tables reveals quite large differences in values from the two models. For example, the cross-price elasticity between fuel and labour is .195 for the NQK, but .627 for the NQLSKAIC, with admittedly large standard errors. The values in these tables are read as follows. The element in the •i'h row, j " ' column of each table is calculated by Eij = (dyi/dpj)(pj/yi), where y,- is the estimated demand for the ith good and pj is the price for the j t h good. Note that there is a difference in the effects of the additional splines in this context Chapter 4. Non-Parametric Estimation of Returns to Scale 107 compared to that of the previous chapter. In Chapter 3 we saw that the addition of splines resulted in the reduction of the absolute values of the elasticity estimates. It was noted that this was probably due to the parameters in the basic model, before the introduction of splines, doing too much work in the sense of capturing effects that should be taken into account by another variable (the splined time trend). In the current application, we have the absolute values of the elasticities being often increased by the addition of spline terms in the capital variable, i.e. some of the common parameters between the models seem to be doing more work once the splines have been introduced. Perhaps this is due to parameters being previously able to delegate work to other parameters, but now with more flexible modelling, they have to do the work themselves. However, as some elasticities increase and some decrease with the addition of spline terms, and standard errors are generally quite large, we cannot read too much into this result. Possible evidence for the inappropriateness of the NQLSK AIC c a n De found by examining Figure 4.1. This graphs the returns to scale measures from the three competing models, as well as the vessel length index. Increasing returns to scale is found if the returns to scale measure (RS) in equation (4.10) is greater than one. We see that the NQLSK AIC suggests that there are very large increasing returns to scale for some of the larger vessels. This seems strange, especially due to returns to scale having a value of almost 12 for observation 29. If returns to an additional unit of capital were truly so high, then it is surprising that there are vessels in this range of vessel length, especially as RS drops off rapidly towards the end of the sample. A similarly strange result is found for the NQLSKcv model's measure of RS. For smaller vessels'it takes on negative values. In contrast to the fluctuations in RS found by the NQLSK AIC and NQLSKcv, the NQK finds RS to be almost a mirror image of vessel length- -as vessel length increases gradually, RS decreases gradually. Chapter 4. Non-Parametric Estimation of Returns to Scale 108 In Figure 4.1, we can see clearly the arguments about the approximation properties of spline functions, discussed in Section 3.6. When not allowing for some local effect ' through the use of an additional spline segment, there is a close correspondence between the measures of returns to scale from the spline and no-spline models. 4.6 Overview The first application of the normalized quadratic net revenue function with linear splines in capital (NQLSK) was presented, in combination with a spline fitting strategy which determines simultaneously the number and position of breaks to be included in the spline function. With the introduction of this adaptive fitting of splines in capital, we are allowing the non-parameteric estimation of returns to scale. The spline fitting algorithm employs a model selection criterion to determine the best performing spline structure. The various model selection criteria used in Chapter 3 were also used here, and the resulting models selected by the different criteria were compared. The application was to data from the British Columbia sablefish fishery. Rent capture is currently an issue in this fishery, and studies by Grafton (1992, 1994) on this issue have used a version of the no-spline, NQK model. In using an extended data set (two cross-sections, rather than one), and considering the impact of allowing for the non-parametric estimation of returns to scale, this chapter has greatly expanded previous work on estimating demand systems in this particular context. Also, the chapter has answered some applied econometric questions (such as the effects of using different selection criteria), which are of general interest. It was found that most of the selection criteria selected a model with no break points. The only two criteria that selected a spline model over the no-spline model Chapter 4. Non-Parametric Estimation of Returns to Scale 109 were Akaike's Information Criterion and the Cross-Validation score. The elasticity estimates did not differ greatly between the competing three models, and the returns ' to scale measures from the spline models fluctuated markedly over the sample. The degree to which the returns to scale measures varied depending on the model selection criterion used again indicates the importance of considering a variety of criteria. Chapter 4. Non-Parametric Estimation of Returns to Scale Table 4.1: Marginal Penalties Break Number AIC SC FGCV PFGCV 1 2 3 4 5 6 7 8 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 9.88 9.88 9.88 9.88 9.88 9.88 9.88 9.88 5.96 6.22 6.51 6.83 7.18 7.57 8.00 8.49 18.69 21.58 25.53 31.24 40.28 56.77 97.04 Table 4.2: Selection Criteria NQK cv PFGCV FGCV SC AIC -LLF 1577.41 1648.63 1648.63 1704.53 1639.81 1595.81 NQLSKcv NQLSKAIc '1201.30 1665.62 1652.88 1712.70 1642.10 1594.10 1567.70 1701.68 1654.58 1721.43 1639.06 1583.06 T 110 Chapter 4. Non-Parametric Estimation of Returns to Scale 111 Table 4.3: NQK Own Price Elasticities Observation Sablefish Other Species Fuel Labour 10 25 .248 .167 .147 (.643) (.440) (.412) .021 .013 .012 (.037) (.029) (.047) -17.37 -2.08 -.844 (6.25) (.733) (.282) -.222 -.196 -.176 (.619) (.557) (.544) Table 4.4: NQLSKcv Observation Sablefish Other Species Fuel Labour 29 33 .151 (.393) .016 (.035) -1.95 (.668) -.174 (.494) Own Price Elasticities 10 25 .275 (.768) .026 (.040) 32.19 (11.15) -.248 (.651) .161 (.456) .016 (.030) -2.24 (.765) -.212 (.568) i 29 33 .142 .143 (.425) (.401) .014 .019 (.047) (.036) -.873 -2.04 (.283) (.678) -.188 -.186 (.558) (.501) Chapter 4. Non-Parametric Estimation of Returns to Scale 112 i Table 4.5: NQLSKAIC Observation Own Price Elasticities 10 25 Sablefish .194 (.548) Other Species .020 (.028) Fuel -13.83 (4.89) Labour -.332 (.669) 29 33 .137 .200 (.389) (.588) .013 .004 (.022) (.014) -2.94 -1.19 (1.01) (.396) -.292 -.370 (.604) (.858) .129 (.366) .008 (.013) -1.39 (.470) -.215 (.445) Table 4.6: NQK Cross Price Elasticities for Observation 29 Sablefish Other Species Fuel Labour Sablefish Other Species Fuel Labour .147 (.412) -.036 (.181) .558 (.856) .192 (.677) -.019 (.097) .012 (.047) .130 (.273) -.048 (.147) -.042 (.064) -.018 (.039) -.844 (.282) .032 (.128) -.089 (.313) .041 (.127) .195 (.785) -.176 (.544) i. . Chapter 4. Non-Parametric Estimation of Returns to Scale Table 4.7: NQKcv Cross Price Elasticities for Observation 29 Sablefish Other Species Fuel Labour .142 (.425) -.034 (.182) .481 (.832) .195 (.697) -.018 (.098) .014 (.047) .173 (.262) -.051 (.152) -.037 (.063) -.025 (.037) -.873 (.283) .044 (.126) -.090 (.321) .044 (.130) .263 (.763) -.188 (.556) Sablefish Other Species Fuel Labour Table 4.8: NQLSKAIC Sablefish Other Species Fuel Labour i 113 Cross Price Elasticities for Observation 29 Sablefish Other Species Fuel Labour .200 (.588) -.010 (.088) .398 (1.19) .314 (.976) -.011 (.095) .004 (.014) .228 (.359) -.048 (.160) -.034 (.102) -.018 (.028) -1.19 (.396) .105 (.189) -.160 (.500) .023 (.075) .627 (1.13) -.370 (.858) i Chapter 4. Non-Parametric Estimation of Returns to Scale 114 Table 4.9: Summary Statistics: Price Data on the British Columbia Sablefish Fishery Labour Fuel Other Species Sablefish Capital Mean Standard Deviation Minimum Maximum 1.1033 1.0461 1.0815 1.1196 1.2673 0.18507 0.06028 0.30466 0.23106 0.56661 1.00000 0.96071 0.27864 0.90163 0.34029 1.4411 1.1454 1.8587 1.8030 3.2043 Table 4.10: Summary Statistics: Quantity Data on the British Columbia Sablefish Fishery Mean Labour Fuel Other Species Sablefish Capital Standard Deviation Minimum Maximum 169450 40406 232290 310650 5.39 56697 4161 36958 122000 12.75 771050 228760 952150 1415200 35.67 271840 30062 303720 418470 19.54 I LT CD Z5 c 00 O CD I CN O O 0 0 CD 0) o < < * • 0 r 1 r 4 j i 8 / I 16 I L Observation 12 J NQK - Solid Line NQLSK rv - Broken Line NQLSK Air - Dots Sc Das oes Vessel Le ngth - Dc>tted Line "" 24 L i l\ 1 L I \ 28 J \ i I 1 Scale Number 20 j Fig. 4 . 1 : R e t u r n s t o 32 r 36 i Chapter 5 Model Selection Criteria: A Reference Source It is not my intention to recommend any single criterion as a definitely superior criterion or as a panacea of the problem of selection. On the contrary, the general picture that has emerged from this paper is that all of the criteria considered are based on a somewhat arbitrary assumption which cannot be fully justified, and that by slightly varying the loss function and the decision stategy one can indefinitely go on inventing new criteria. This is what one would expect, for there is no simple solution to a complex problem. Takeshi Amemiya (1980; 352). 5.1 Introduction In this chapter, model selection criteria are described and compared. Most of the criteria discussed here have already been introduced in the previous chapters. This chapter could then be viewed as an appendix to the above. A few details are repeated for ease of reference. However, the overall content is substantially different compared with the previous chapters, as the emphasis here is on creating a reference source for applied researchers rather than presenting an application which uses the criteria. The previous applications aid in the understanding of the relative properties of the criteria, while the general survey/reference source nature of this chapter perhaps justifies its status as a chapter rather than an appendix. 116 Chapter 5. Model Selection Criteria: A Reference Source 117 Section 5.2 reviews the selection criteria used in this thesis, and presents a few more well known criteria. Historical notes are also included for the reader's interest. Section 5.3 outlines some of the approximation relationships between the criteria, expresses the criteria as penalized log likelihood functions, ranks them in terms of model selection properties, interprets them as x 2 -statistics and derives adjusted R2 expressions for each criterion. Section 5.4 describes two data re-sampling approaches to model selection. There are possible alternatives to the usual selection algorithm, the essential difference being in how the class of models to be allowed for consideration is restricted. Some selection algorithms are discussed in Section 5.5. References for further readings are given in Section 5.6, with brief summaries. As not all methods for model selection are considered, references to some other, less commonly used methods, are also given. Section 5.7 contains a discussion of the place of model selection criteria in the broad context of determining relative model performance, while Section 5.8 concludes. 5.2 Model Selection Criteria The common forms of some common criteria are reviewed before introducing some new forms and interpretations for these criteria. The Cross-Validation score is derived through the data re-sampling technique of cross-validation. It is written as: CV = (1/D J > - /(^)] 2 /[i - d? (5-1) 1=1 where T is sample size, yi and X{ are the ith dependent and independent variables respectively, /(#;) denotes the estimate of y;, and the q in the denominator denotes the ith diagonal element of the projection matrix X(X'X)~lX'. The Generalized Cross- Validation Criterion is derived by assuming that the Q are approximately Chapter 5. Model Selection Criteria: A Reference Source 118 constant, yielding the following criterion: T 1 n GCV=±;Y:iyi-f(xi)}2/l1-f}2- (5-2) where n is the number of independent variables in the model. A logarithmic form for the GCV is derived in Appendix A: GCV = -logL0) where logL0) - Tlog(l - n/T) (5.3) is the maximized log likelihood. The commonly used Akaike Informa- tion Criterion and Schwarz Criterion are written as follows: AIC = -logL0) +n (5.4) SC = -logL0) + nlog(T)/2. (5.5) A multiple equation context expression for CV was defined in Chapter 3: CV = \tcv\l/G (5.6) where |Ecv| is the determinant of the variance-covariance matrix derived from the cross-validation residuals. An alternative to taking the determinant of the matrix would be to take the trace. Both methods give a measure of the size of the matrix, and while the determinant was chosen for use here, the use of the trace may be worth considering. The problem with using the determinant is that one well fitting equation (small residuals) may suggest that the entire system is fitting/predicting well, whereas this does not have to be the case. However, the use of the determinant here allows the derivation of a multiple equation context expression for the GCV: FGCV = -logL0) - GTlog(l - h/T). (5.7) The step between the above definition of the CV and the expression for the FGCV requires the derivation of another criterion, Fast CV (see Chapter 3): FCV= —- (5.8) Chapter 5. Model Selection Criteria: A Reference Source 119 where T is a transformed vector of full sample least squares residuals. 5.2.1 Some Historical Notes Akaike (1973) suggested the AIC (the "A" was originally meant to denote Information Criterion A, rather than the author's name). The Schwarz Criterion is due to Schwarz (1978). This is sometimes referred to as the BIC ("B" for Bayes), but as this notation has been used to refer to other criteria, we call it the SC. Although Craven and Wahba (1978) are justifiably credited with the GCV, it seems that Tukey (1967) suggested the same criterion—see Section 5.2.5 on Hocking's criterion below. Schmidt (1971, 1974, 1975), Allen (1971, 1974), Stone (1974), and Geisser (1974, 1975) seem to all have suggested the cross-validation approach as a statistical tool. 1 Although a pioneer in this field, Schmidt's work seems to be largely unknown. Schmidt called the cross-validation score SSPE (Sum of Squared Predictive Errors), while Allen called it PRESS, (Prediction Sum of Squares), and this latter notation is often used in the literature. 5.2.2 Theil's Adjusted R 2 An enduring choice criterion is the R 2 , defined as follows: R -1' ELta-S) 2 ' (5 9) - where the bar denotes the mean value and models which return a value close to one explain much of the variation in the data. However, it is well known that adding more regressors has the result of always increasing the value of R2, as long as the matrix of regressors (A') remains non-singular. Therefore, Theil (1961) suggested the 1 Stone cites some early work which used the general concept in various applications. Chapter 5. Model Selection Criteria: A Reference Source 120 adjusted R2 criterion for model comparison, which is derived by replacing the sum of squared residuals in the expression for R2 with T/(T — n) times the sum of squared residuals. This yields a criterion which is sensitve to the number of remaining degrees of freedom, and can be expressed as: 52 _ i R E,-=lbi ~ f(xi)] ^ l - ~T-n EUyi-yr /rin\ • ( } By noting that the only part of this criterion that is relevant for model selection is the Ordinary Least Squares (OLS) error variance estimate (J2j=i[lli ~ f(xi)]2)/(T ~ n)> we see that this criterion suggests the choice of the model with the smallest error variance. Theil showed that this criterion leads to the choice of the true model at least as often as any other model (Theil, 1961; 213). However, as shown by the results of Schmidt (1973, 1975), this criterion will not help us in selecting the true model when a regression contains both the variables of true model plus some extra, irrelevant, regressors. 5.2.3 Mallows' Criterion This is a criterion that has enjoyed much popularity in many sciences, but decreasingly so now that there exist a large range of alternatives. It was first suggested by Mallows (1964), and common references include Gorman and Toman (1966) and Mallows (1973). It is often written as follows: 2 cp==ZU[yi-f(*i)} +2n_T ( 5 U ) o where o is a variance estimate. It also often appears in the following equivalent (for model selection) form: T CP = £ > ~ f^it + 2nc?. (5.12) Chapter 5. Model Selection Criteria: A Reference Source 121 An issue that limits the usefulness of this criterion is the need to find an appropriate estimate for a. The usual approach is to use the variance from a regression that includes the entire set of potential regressors. Denoting the dimension of this set as n* we can write the error variance estimate as: a= (5.13) T-n* An alternative to this estimate results in the derivation of the next criterion. 5.2.4 A Much Suggested Criterion: Jp, F P E , P C An alternative estimate of a in the Cp criterion has been suggested, which results in a different criterion. Specifically, taking an appropriate estimate of a to be the variance for each parameterization considered (a = (Ej=i[yi ~ f(%i)]2)/(T — n)), and substituting this in equation (5.12) we get the following criterion: Jp = FPE = PC = 2 L ± ^ f > - f{Xi)f. (5.14) We can see this by defining Cp as the criterion resulting when the error variance in equation (5.12) is taken to be a — ($2f=i[yi — f(xi)]2)/(T ~ n), where n is the number of parameters in each model considered. Then we can write: T cP = Eb«- - f^)f + 2nd «=i T ( 5 - 15 ) = E[y»-/(^)]8 + 2n (Efal ^"f (af ' 0]) (5.16) = {l + ^ } E ^ - / M 2 (5-17) = 7^i>-/(*01 2 (5.19) = (5.20) Jp. Chapter 5. Model Selection Criteria: A Reference Source 122 Rothman (1968), Akaike (1969) and Amemiya (1972, 1980) have all suggested this criterion, Rothman calling it Jp, Akaike calling it Final Prediction Error (FPE) and Amemiya. calling it Prediction Criterion (PC). 5.2.5 Hocking's Criterion Hocking's Sp criterion was suggested by Hocking (1976), it is reviewed thoroughly by Thompson (1978), and is given an alternative justification by Breiman and Freedman (1983). It is written as follows: Q _ S»=ib« ~ f(xi)] b P~ (T-n)(T-n~lY (KO-\\ {b 21) ' One can note easily that this criterion closely resembles the Generalized CrossValidation Criterion. In fact, although Hocking cites Tukey (1967) as the source for this criterion, it seems that Tukey actually had a denominator of (T — n)2, which results in a criterion which is just a monotonic transformation of the GCV in equation (5.2)! 5.2.6 Hannan and Quinn Criterion This criterion is due to Hannan and Quinn (1979), and can be expressed as: HQ = Y\yi - fixitfilogTf^ (5.22) i=i The relationships and finite sample properties of the criteria are discussed in some detail below, but. here we can note that HQ shares a property of the AIC and SC in that its marginal penalty is constant as n increases, for fixed T. The criteria introduced above can all be considered as approximations to one another. Some of the relationships between the criteria are now noted. Chapter 5. Model Selection Criteria: A Reference Source 5.3 123 Comparisons of Criteria i This section consists of two parts. The first looks at approximation properties between the penalties of some criteria. The second looks at what is the more important issue for the model selection properties of the criteria—how much of a penalty is paid for an additional parameter and how that marginal penalty behaves as the number of parameters approaches the sample size. 5.3.1 Penalty Term Approximations This section compares the relative sizes of the respective penalty terms for some of the above criteria. First, we examine the relationship between the GCV and AIC. Define f(x) = log(l + x), where x = —n/T. Taking a Taylor's series expansion around x = 0 (i.e. a Maclaurin series) we get: f(x) = 0 + x + i ( - l ) * 2 + i(2)o: 3 + ^ ( ~ % 4 + ... (5.23) Therefore, we can write an approximation to the penalty term of the GCV in equation (5.2) as: -Tlog(l - n/T) & -T[(^) - \(^f = + \{^f n + T(^)2 + | ( ^ ) 3 + ^ ) - \{^-f 4 + ... + ...J (5.24) From the above expressions of the GCV and AIC, we see then that we can write: GCV - AIC + T{^f +|(^)3 + ^ ) 4 + ... (5.25) We see that the GCV and the AIC approximate each other to the first order, but it is a very biased approximation, in that the error term always goes in one direction. Also note that Stone (1977) proved the asymptotic equivalence, in terms of selecting the same model, of the AIC and CV. Chapter 5. Model Selection Criteria: A Reference Source 124 Now consider the criterion suggested by Rothman (1968) (Jp). We will compare its expression in equation (5.14) with that of the GCV in equation (5.2). Consider the penalty term of Jp: (T + n) _ ( l + n / T ) (T^) " (WTO a = ( m { ] where we get the approximation with the GCV penalty through the use of the geometric series. To make explicit the relationship between the penalty terms of the GCV and the Sp, consider the penalty term for the GCV in equation (5.2): d-"/r)-' = (?)- J (r-")-^ ( r _„ ) ( r-„- 1 ) < 5 - 27 > which is the penalty for the Sp monotonically transformed by the multiplication by T\ From the approximations given above, further approximation properties can easily be derived between pairs of criteria, e.g. the relationship between the AIC and Sp can be derived through their relationships with the GCV. 5.3.2 Log Forms and Marginal Penalties The considered model selection criteria can all be put into penalized log likelihood form, as was done in Chapter 3 for the GCV (Appendix A). This section re-expresses the criteria not already presented in logarithmic form in such a form. It is hoped that these new expressions for the criteria may be of use to applied researchers. Once in this form, the marginal penalties of the criteria are examined, allowing easy comparison of the relative model selection properties of the criteria, e.g. the criterion with the largest penalty for an additional parameter will select the most parsimonious model. Now, we make an observation which will be used below. Chapter 5. Model Selection Criteria: A Reference Source 125 Observation 3 It is noted that a monotonic transformation of the criteria does not affect the model selection properties in any way. Proof: Let A denote some constant, z(0) the value of some criterion for model 0, and z(l) the value of the criterion for model 1, which has more paramters than model 0. To determine which model is selected by the criterion we examine D — X(z(l) — z(0)). As we want to find the model which minimizes our criterion, D negative implies that the more highly parameterized model is preferred. In other words, it is only the sign of D, rather than its magnitude, which is of importance in determining model choice. In what follows, we use this fact by replacing the sum of squared residuals in the above criteria with the average squared residuals, i.e. A = 1/T. Claim 1 The criterion implied by Theil's (1961) R2 can be expressed in the following penalized log likelihood form: RBAR = -logL0) - ~log{T - n) (5.28) Proof: R2 in equation (5.10) implies a criterion of jr^SJLife/* ~ f(xi)]2, or an unbiased estimate of error variance scaled by T. Using the above observation on the invariance of the criteria with respect to model selection to monotonic transformations, we can write this criterion as ^^T S£=ib* ~ f(xi)]2- Now note that ignoring the additive constants in the concentrated log likelihood function we have: logL0) = -^log(o2) = -?M^X>-/(^)] 2 ) where logL(/3) is the maximized log likelihood and a1 = j;J2f=i[yi ~ f(xi)]2 (5.29) (5-30) i s the maximum likelihood estimate of the variance. We can re-arrange equation (5.30) as Chapter 5. Model Selection Criteria: A Reference Source 126 follows: rp - ~hgL0) = log{± J > - M')]2)- (5.31)' Let the criterion implied by R2 be denoted by RBAR: RBAR = - 2 L - I J > - f{xi)f- (5.32) Taking the logarithm of this expression and using equation (5.31) we get: log(RBAR) - /(.r,)] 2 + log(-^—) = log&jjyi 1 = x .=1 (5.33) n ~logL0) + log(T)-log(T-n). (5.34) Now, we multiply through by T/2 and re-define RBAR: T T RBAR = -logL{(3) + —log(T) - —log(T - n), As the (T/2)log(T) (5.35) is a common term for each model considered, it is irrelevant for model selection purposes and can be dropped. Then we have the expression for RBAR given in the claim. Claim 2 Mallows' criterion ((1 + f5?)X)fe«i[lfc - f(%i)]2) can be expressed in the following penalized log likelihood form: T Cp = -logL(P) + -log(l In +— - ) . (5.36) Proof: Using Observation 3 we write Mallows' Criterion as: 9n C 1 + P =( 1 r T ^ T I > - /^)]2- (5-37) Taking the logs of equation (5.37) and using equation (5.31), we get: 2/7 log(Cp) 1 T = log(l + ——)log(-Y,lyi-f(xi)]2) i = 2 --logL(p) 1 n - i (5.38) .=1 277 + log(l + -). 1 — n* (5.39) Chapter 5. Model Selection Criteria: A Reference Source 127 Now, we multiply through by T/2 and re-define Cp: T Cp = -logL(p) + -log(l In +— - ) , (5.40) which proves the claim. Claim 3 Rothman's Jp criterion (\fz$ ]C;Li[y% — f{%i))2) can be expressed in the following penalized log likelihood form: Jp = -logL0) + ^log(T + n) - | % ( T - n). (5.41) Proof: Using Observation 3 we write Jp as: J p = §r^^E^-/(^)]2- (5-42) Taking the logs of equation (5.42) and using equation (5.31), we get: log(Jp) = log(±Y\yi-f(xi)?) + lo9{T + n)-log{T-n) = ~logL0) + log(T + n)-log(T-n). (5.43) (5.44) Now, we multiply through by T/2 and re-define Jp: Jp = -logL0) + ^log(T + n) - |/o(?(T - n), (5.45) which proves the claim. Claim 4 Hocking's Sp criterion (£j=1\yi - f(xi)]2/[(T - n){T - n - I))) can be ex- pressed in the following penalized log likelihood form: Sp = -logL0) - jlog(T - n) - ^log(T - n - 1). (5.46) Chapter 5. Model Selection Criteria: A Reference Source 128 Proof: Using Observation 3 we write Sp as: (i/r)X£x[y<--/(s<)]2 „ ^ - ( T -n)(T-n-l) ., ,7,' ' {5A7) Taking the logs of equation (5.47) and using equation (5.31), we get: log(Sp) = log{±-Y\yi-f^i)?)-l°9{T-n)-log{T-n-l) = ~^logL0)-log(T-n)-log(T-n-l). (5.48) (5.49) Now, we multiply through by T/2 and re-define Sp: Sp = -logL0) - ^log(T - n) - ^log(T - n - 1), (5.50) which proves the claim. Claim 5 The Hannan and Quinn criterion, HQ, (Tn^ilVi ~ f(xi)}2(logT)^2n^) can be expressed in the following penalized log likelihood form: HQ = -logL0) + nlog(logT) (5.51) Proof: Using Observation 3 we write HQ as: HQ = ^ 5 > - fixitfilogTf^). (5.52) Taking the logs of equation (5.47) and using equation (5.31), we get: 1 log(HQ) T In = log{~Y\yi-f{xi)f) 2 = --logL((3) + ~rlog{logT) (5.53) In + Ylog(logT). (5.54) Now, we multiply through by T/2 and re-define HQ: HQ = -logL0) + nlog(logT) (5.55) Chapter 5. Model Selection Criteria: A Reference Source 129 which proves the claim. Now we have penalized log likelihood forms for the criteria which we are considering. We can easily compare the relative sizes of the marginal penalties, which are derived by differentiating the penalty terms for the respective criteria with respect to n, the number of parameters. This gives the penalty incurred by the marginal parameter. Letting the superscript MP denote the marginal penalty of the respective criteria, we get the following expressions: GCVMP = d(-Tlog(l - n/T)) dn AICMP = M (5.56) T-n ( 5.5 7 ) = 1 SC"» = d{V / 2 ) = *2f MP RBAB = on (l °9(T) d l (5.58) 2 - ^log(T - n)) __ 1 dn ' 2 dn T « («•*» T - n* + 2n J » ~ ^ - 2 ( r T ^ + f^) ,MP a(f%((r-n)(r-n-i)) 1 r ^ ~ ^ ^ = ^ gn^ogT) = 2 ( T ^ (5,61) r + r - n - l = %(%T) } (5 62) - (563) We consider now the relative sizes of these marginal penalties. From Observation 2 (Chapter 2) we already know that (i) for T < 8, SCMP (ii) for T > 8, AICMP < AICMP < GCVMP and < GCVMP, where SCMP is greater than AZ"CMF, and may be less than or greater than GCVMP depending on the size of T and n. The ranking of the marginal penalties in (i) is determined by noting that for T < 8, (logT)/2 < 1, and T/(T — n) > 1 for any positive value of n. The ranking in (ii) with respect to SCMP is difficult to determine due to SCMP being non- linear in T, but it is trivial Chapter 5. Model Selection Criteria: A Reference Source 130 to determine the relative sizes of the marginal penalties given any T and n. Now consider some further rankings. Claim 6 RBARMP (ii) forn has the following relative size properties: (i) RBARMP < f, RBARMP n > f, RBARMP > < AICMP, for n = f, RBARMP = AICMP, < GCVMP; and for AICMP. Proof: We prove (i) by noting that RBARMP = J ( j ^ ) = ±GCV M i \ We prove RBARMP: (ii) by substituting n = \ into RBARMP = i(=^—) 1 T - if—_—) = 1= ^/CWP (5.64) Then, from the fact that a larger value of n implies a smaller value of T — n and hence a larger value for RBARMP, RBARMP it follows that RBARMP < 1 for n < T / 2 and > 1 for n > T/2. Claim 7 J^IP has the following relative size properties, for n > 0: (i) J?fp < GCVMP; (ii) Jfp > AICMP; (Hi) Jj?p > RBARMP. Proof: We prove (i) by re-expressing GCVMP and comparing it to Jjfp: GCV MP > = T T-n 1, T 2{T-n) 1 T 2(T-n)+ JPMP> + 1 T 2^T-n/ 1 T 2{T + n} ( 5 - 65 ) Chapter 5. Model Selection Criteria: A Reference Source 131 where the inequality follows from 5 ^ > jq—, for n > 0. To prove (ii), we note that 1 for n = 0, JpIP = 1 and take the derivatives of the two terms in the marginal penalty with respect to n (dropping the superfluous 1/2 at this point): dT/(T + n) dn ' dT/(T -n) dn Then the term T/(T+n) T ( r + n)2' T (T - nf' in Jfp is decreasing at rate -jf^yi, (5.66) (5.67) and the term T/(T-n) is increasing at rate /jj-^js as n increases. Using the following inequality, T T 2 < 7= (T + n) {T-nfrj (5.68) we know that T/(T + n) decreases slower than T/(T — n) increases with n, and with Jjfp = 1 for n = 0, we have J™p > 1 = AICMP for n > 0. To prove (iii), we re-write P J5 : P jMP _ " = > Claim 8 5 ^ p > ]_(__£_ I f. \ 2lT + n T - n j *^-+!(_?_) RBARMP. (5.69) GCVMP. Proof: We prove the claim by re-writing GCVMP and comparing it to Sjjfp: GCVMP < = T T - n 1 T { 2 T-n) + 1, T , 2o (v ^T— - n- /) + S f 1 T 2^T-n> 1, T 7;(7F—r—r) 2vT-n-l' (5.70) Chapter 5. Model Selection Criteria: A Reference Source 132 where the inequality follows from the fact that T/(T - n - 1) > T/(T - n). i We have now established the following rankings for the marginal penalties of the considered criteria. (i) For n < f: RBARMP < AICMP < J™? < GCVMP < S™p. RBARMP = AICMP < J™1* < GCVMP < S™p. (5.71) (ii) For n — = Z- 2 " (5.72) (iii) For n > ¥T . AICMP < RBARMP < jMp < GCVMP < Sfp. (5.73) It is more difficult to rank the marginal penalties of the Cp, SC and HQ, unless we are given values for T and n. However, there are some relationships and properties of these criteria which we can determine without knowing these values, which we examine now. Claim 9 We can determine that HQMP HQMP < SCMP; has the following relative size properties: (i) (ii) for T < 16, HQMP < AICMP, and for T > 16, HQMP > AICMP. Proof: We know that (i) is true as SCMP = (l/2)logT, HQMP = logilogT), and taking the log of a number always returns a smaller value than multiplying by (1/2). We can prove (ii) by setting HQMP = 1 and proceeding as follows: log(logT) = 1 (5.74) logT - e (5.75) T = ee^ 15.154. (5.76) Chapter 5. Model Selection Criteria: A Reference Source As HQMP is increasing in T, we find that HQMP > 1 = AICMP 133 for T > 16 and i HQMP < 1 = AICMP for T < 16. Claim 10 C*fp can be greater or smaller than J^p depending on the size of' n*, the dimension of the set of all potential regressors. Proof: We want to prove the above claimed relationship between C^p = T/(T — n* + 2n) and Jp = (1/2)(T/(T + n) + T/(T - n)). Define n* = n + 7, which lets us write C ^ ^ = T / ( T + 71 — 7). We note the following inequalities: T T + n--7 T T + n--7 T > T+n (5.77) >*fe if7>2n (5.78) < T +n-7 r ~" if 7 < 2n. (5.79) It follows simply from the first two inequalities that for 7 > 2n we have: QMP p _ MP ~>k^-: + ~) = J V P T + n-j ' 2 T' -n ' T + n (5.80) Then, from equations (5.78) and (5.79), for 7 < 2n we have the following: If If T > T+n T+n--7 T T < T+n-7 ~ T+n T+n- -7 T T+n-7 T T T — T-n then C?fp > J™13. (5.81) T — T-n then Cfp < jup. (5.82) The relationship between C*iP and J p / P illustrates an interesting aspect of Cp— the larger is 7 = n* — n, the larger the marginal penalty. Consider the first order derivative of CpIP with respect to 7: QCMP_ 57 {T + n- 7) 2 >0. (5.83) This implies that if we can arbitrarily determine n*, as may be the case in many applications, we get a larger penalty for an additional parameter simply by re-defining Chapter 5. Model Selection Criteria: A Reference Source 134 the dimensionality of the set of potential regressors. In other words, by increasing n* we increase 7 = n* — n and hence increase C^p for every n, where it is always the case that n* > n by definition. It remains to determine how the marginal penalties change as we lose degrees of freedom, i.e. we want to know the rate at which the marginal penalties increase as n approaches T. By taking the second derivatives of the penalty terms for the respective criteria, or the first derivatives of the marginal penalties, with respect to n, we get the following: dAICMP 3(1) DSC"? dn BHQ*" dn = _ QGCVMP dn 3RBARMP Tn dJ™p dn ~ d(nlogT)l2 = Q dn ^ ( % T ) = 0 dn dT/(T-n) T > 0 dn (T - n)2 dT/(2(T - n)) 1 T { )>0 Yn -2 jT^n)2 d(l/2)(T/(T-n) + T/(T + n)) dn (5.87) (5 88) " ^ 89 ) = ^ - ( T ^ ) > ° dS™p dn IP dC* dn The derivatives for AICMP, d(l/2)(T/(T - n) + T/(T - n - 1)) dn )>0 - hwh^-w^hw (5 90) - dT/{T - n* + 2n) dn ~2T SCMP and HQMP above have returned values of zero, implying that the penalty paid for an additional parameter is constant and insensitive to the remaining number of degrees of freedom. For GCVMP, RBARMP, J^p and Chapter 5. Model Selection Criteria: A Reference Source 135 Sp4P we see that the respective marginal penalties are sensitive to the remaining degrees of freedom, in that additional parameters are penalized more harshly as the degrees of freedom approach zero. Surprisingly, the Cp has a marginal penalty which exhibits the opposite property to this—as the remaining degrees of freedom approach zero, an additional parameter incurs a lighter penalty. 5.3.3 Interpretation of Criteria as ^2—Statistics In their penalized log likelihood forms we can consider the relationships between model selection by these criteria and formal testing using the Likelihood Ratio (LR) test. Recall that the LR statistic can be written as LR = 2{logL0x) - logL((30)) ~ xl (5.92) where logL(fti) is the maximized log likelihood value for the more highly parameterized model (model 1) and logL(/3o) is the same, but for the restricted model (model 0). The LR statistic is asymptotically distributed as Xr under the null hypothesis, where r is the number of restrictions which nests model 0 in model 1. The null hypothesis is that the r restrictions are valid. Define Xc *& the critical value for some signifcance level with r degrees of freedom. Then if LR > xl (5.93) we reject the null hypothesis and conclude that our r restrictions are not valid. Now recall that the marginal penalties described earlier are the penalties for the additional parameter in the respective criteria. Comparing model 0 and model 1, model 1 has r more parameters than model 0; i.e., model 0 is model 1 with r restrictions imposed. Consider first the case where the marginal penalty does not depend on the number of parameters in the model. Denoting the relevant marginal penalty Chapter 5. Model Selection Criteria: A Reference Source 136 by MP we can write the selection problem as follows: If logL(lh) - logL0o) > rMP (5.94) we reject the null hypothesis and conclude that our r restrictions are not valid. Multiplying both sides of equation (5.94) by 2, we can write the selection problem as follows: If LR > 2rMP (5.95) we reject the null hypothesis and conclude that our r restrictions are not valid. The model selection properties of the respective criteria are unaffected by the scaling, as we saw in Observation 3. Consider now the case where the marginal penalty does depend on the number of parameters in the model. Equation (5.95) then becomes r LR>2j2MPi. (5.96) %=i In other words, we sum the marginal penalties incurred for the addition of each parameter as we add a total of r parameters to model 0 to attain model 1. When the inequality is true, we reject the null hypothesis and conclude that our r restrictions are not valid. We can see now that by using model selection criteria we are effectively replacing xl with an appropriately scaled marginal penalty function in an LR-type test. We can consider the scaled marginal penalty functions for each selection criterion that we have expressed as a penalized log likelihood function, and compare this value with, say, the 5% level of significance x2 critical values. Also, we can see which approximate X2 significance level is implied if we consider these scaled marginal penalties as drawn from a x 2 -type distribution. Chapter 5. Model Selection Criteria: A Reference Source 137 The scaled marginal penalty functions (SMP) for each respective criterion is as follows: SMP(GCV) = 2E( frfT-ni SMP(AIC) = SMP(SC) SMP(CP) SMP(JP) (5.97) )i (5.98) 2r r{logT) r SMP(RBAR) T tfT-nt = 2£( fz{xT-n* ^ r (5.99) rp (5.100) T T rp •)i + 2ni' T (5.102) rp SMP(SP) SMP(HQMP) (5.101) (5.103) 2r(log{logT)) (5.104) SMP (AIC) is greater than the 5% v2 critical value for more than seven restrictions on model 1. This means that for seven or less restrictions the AIC selects a smaller model than the LR test, but a more highly parameterized model for more than seven restrictions. Another interpretation of this observation is that for r > 7 the significance level implied by the AIC is greater than that for the LR test. Although the other SMP's considered depend on T and n (and n* in the case of Cp) making comparisons with the \'2 critical values difficult2 the MP rankings established earlier (Section 5.3.2), indicate the relative model selection properties with respect to the LR test given the observed relative properties of the AIC and LR. Note that an important difference between the LR test type of formal testing approach to model selection and the criterion approach is that in the LR case a "saturated" model must be selected and the models under comparison must be tested 2 TJnless given values of T and «•, in which case comparison is straight forward. Chapter 5. Model Selection Criteria: A Reference Source 138 as restricted versions of this saturated model. Changing the highest parameterized model against which to do the formal testing of the restrictions may change the model of choice. The criterion approach does not have this property. Adding another, more highly parameterized, model for consideration will not change the relative rankings of the models already considered using a model selection criterion.3 Due to the Lindley "paradox" (Lindley, 1957), we may be interested in how the significance level implied by each criterion changes as T increases. This is because with a sufficiently large sample, every null hypothesis can be rejected. It is enough to note here that as T increases it becomes more likely that a null hypothesis will be rejected. Hence it may be desirous to have the significance level decrease as T increases. This implies larger critical values, or in our case, larger SMP values. To see which, if any of the considered criteria have this property it is sufficient to sign the derivatives of the marginal penalties of the criteria with respect to T. A positive sign means that the SMP of the criterion is increasing in T and hence the implied significance level is decreasing. Claim 11 The derivatives of the marginal penalties of the considered criteria have the following signs: * ^ dT dSCMP dT dRBARMP dT dC™p dT , ~ ffi^ = (r_re)-1(1_r(r_n)-1)<0 (MO.) dT d(logT)/2 _ _1_ > 0 (5.107) dT ' 2T <9§(r£r) = (1/2)(T - n ) - a ( l - T(T - n)" 1 ) < 0 (5.108) dT dT/(T - n* + 2n) dT 'Except for the Cp which has an MP that depends on n*, the highest parameterization considered. Chapter 5. Model Selection Criteria: A Reference Source 139 (T - n* + 2n)~l - T(T - n* + 2n)~ 2 < 0 d(l/2)(T/(T + n) + T/(T-n)) dT = (l/2)[(T + n ) - 1 - r ( T + ? i)- 2 = p dJ}f dT + P dS™ dT — = (T - n)-1 - T(T - n)~2] < 0 dHQ (5.110) d(l/2)(T/(T- - n) + T/(T -- n- 1 ) ) dT 1 (l/2)[(T-n)- - T(T - n)- 2 1 + (T-n-1)- -T{T-nMP dlog{logT) _ dT dT 1 2 1)- ] < 0 (5.111) .n (5 U2) TlogT ' " ' Proofs: As T(T - n)" 1 is never less than one dGCVMP/dT of dRBARMP/dT = (l/2)(dGCVMP/dT) The sign of dC^IPJdT (5.109), < 0. The sign dGCVMP/dT. follows from the sign of depends on the relative sizes of n* and n. As earlier, define 7 = n* — n and re-express equation (5.109) as follows: QQMP -wr = ^ = + 2 rp ^-^W (wis) (T + n - 7 ) ^ ( 1 - — ^ — ) . (5.114) We see that if 7 > n, then (1 - T/(T + n - 7)) > 0 and if 7 < n, then (1 - T/(T + " - "))) < 0. Therefore, dC*5 p > - ^ - < 0 The sign of dJplP/dT > as7<n. (5.115) can be determined directly, or from the sign of dGCVMP/dT and equation (5.65), which gave the ranking GCVMP > J^p. The sign of dSfrp/8T can be determined through the following re-expression of equation (5.111): QQ.MP rp - £ - = (1/2)[(T - n)~\\ - •—) rp + (T-n~ 1)^(1 - J - ) (5.116) Chapter 5. Model Selection Criteria: A Reference Source 140 The negative sign is determined by noting that T/(T — n) and T/(T — n — 1) are never less than one. From the signs of these derivatives, we observe that only the SC and HQ always have the property of an implied level of significance which is a decreasing function of sample size. 5.3.4 Representations as i? 2 -Type Values The relationship between the adjusted R2 and other selection criteria has been examined above, where all the criteria were expressed in terms of penalized log likelihood functions. It is possible to express these criteria as adjusted R2 values. The model selection properties of these re- expressed criteria remain unchanged from those already determined above. Some scaling is necessary for some of the criteria in order for their R2 values to approach one, rather than some other number, as the fit improves. From Observation 3 we know that this scaling has no effect on model selection. The notation is as before: T is sample size, n is the number of parameters in the model and n* is the maximum number of parameters considered for inclusion (which is of relevance to the Cp). First, consider the expression for Theil's adjusted R2 given in Section 5.2.2: fl2=l- ' T-n ^SP£ This criterion compares (I^Ljy,- — f(xi)]2)/(T JK l 1 _> IZtiiVi-y? . (5.117) — n) between models (small is good). Scaling by T and subtracting the resulting expression from one gives the above expression for R2, which allows R2 to be close to one when the fit of the model under consideration is good. The relationship between R2 and R2 can be seen by writing R2 as follows: 1 - R2 = j r r ^ ( l - R% . (5.H8) Chapter 5. Model Selection Criteria: A Reference Source 141 or, R2 = 1 - ~~(1 - R2). (5.119) This relationship is easily derived by noting that SLbe-fe)]2 ^ r ^ " ^ ^ • Elite-it)* ,2 l-R2= (5-120) We now consider i? 2 -type expressions for the other criteria under consideration, and expressions for their relationships with R2. Definition 1 We define the adjusted R -type expression for the AIC as follows: RHAIC) . 1 - ^ ^ y . (5.121) This definition was derived by manipulating an expression for the AIC in the following manner: AIC exp(AIC) = -llogL0) = + ~n ZlAyi f(Xi)]2 - as (—2/T)logL(J3) = log(Y,i=i T(Vi — f(xi)/T), exp(f) (5.122) (5.123) if we ignore the constant terms in the log likelihood function. Scaling exp(AIC) by T, dividing by Yj=i(Vi ~ v)2 and subtracting the resulting expression from one yields the defined expression for R2(AIC). The relationship of R?(AIC) with R2 can be trivially derived from equation (5.120): R\AIC) on = 1 - exp(~)(l - R2). (5.124) Definition 2 We define the adjusted R2-type expression for the SC as follows: R2(SC) = 1 - r W r ) ^ - M ) ] 2 , £;=i(yi — y) ( 5 . 12 5) Chapter 5. Model Selection Criteria: A Reference Source 142 This definition was derived by manipulating an expression for the SC in the followingmanner: ' SC = -llogL^ + ^f- (5.126) exp(SC) = r ( r a / r ) £ a ^ ~ / ( f 4 ) ] 2 . (5.127) 2Jt=i(2/j ~ y) Scaling by T, dividing by Y^iiVi ~ y)2 and subtracting the resulting expression from R2(SC). one yields the defined expression for The relationship of R2(SC) with R2 can be trivially derived from equation (5.120): R2(SC) = 1 - T ( n / T ) (l - R2). (5.128) Definition 3 We define the adjusted R2-type expression for the GCV as follows: R2(GCV) = 1 - (1 - Z)-*4&Mt ] T j^ji Eli(yi-0 (512g) This definition was derived by manipulating an expression for the GCV in the following manner: G L V - T (1 - n/T)2 (5 130j ' Scaling by T, dividing by J2j=i(yi ~ V)2 and subtracting the resulting expression from one yields the defined expression for R2(GCV). The relationship of R2{GCV) with R2 can be trivially derived from equation (5.120): R2(GCV) = 1 - (1 - ^)~ 2 (1 - R2). (5.131) Definition 4 We define the adjusted R2-type expression for the Sp as follows: 1 ^• -(r-»x?-.-i)1fc-g''- ^ Chapter 5. Model Selection Criteria: A Reference Source 143 This definition was derived by manipulating an expression for the Sp in the following manner: 1^1 = 1 U/» ~ J\Xi)\ C _ bp /r , q o \ ~ (T - n)(T - n - 1) (5 ld3j ' Scaling by T 2 , dividing by Y,J=i(yi — y)2 and subtracting the resulting expression from one yields the defined expression for R2(SP). We can see that scaling by T2 is appropriate to get a comparable adjusted R2 form, through re-expressing R2(GCV) in equation (5.129) as follows: Comparing these expressions for R2(sp) and R2(GCV) it is obvious that scaling by T2 is appropriate in order to get a comparable adjusted R2 value.4 The relationship of R2(SP) with R2 can be trivially derived from equation (5.120): *' ( * ) = 1 -(r-.)(r-,,-i) (1 - tf) - (5 135) ' Definition 5 We define the adjusted R2-type expression for the Cp as follows: m)Sl-(l +^ ) S | g y - _ y . (5,36) This definition was derived by manipulating an expression for the Cp in the following manner: x n ;=i 2 Dividing by Y^=\{yi — V) and subtracting the resulting expression from one yields the defined expression for R?{CP). The relationship of R2(CP) with R2 can be trivially derived from equation (5.120): 9n R2(CP) = 1 - (1 + j r ^ ) ( l - R2). (5.138) 'This scaling makes no difference to the model selection properties of R2(SP); see Observation 3. Chapter 5. Model Selection Criteria: A Reference Source 144 Definition 6 We define the adjusted R2-type expression for the Jp as follows: R W ~ l T~n YlUyi-y? ' ( 9) This definition was derived by manipulating an expression for the Jp in the following manner: 1 Dividing by Yj=\{Vi ~ V)2 the defined expression for an n i=l d subtracting the resulting expression from one yields R2(JP). The relationship of R2(JP) with R2 can be trivially derived from equation (5.120): R\JP) = 1 - ~ T ^ ( 1 - R2). (5-141) Definition 7 We define the adjusted R2-type expression for the HQ as follows: R2(HQ) = 1 - (logTfn^E^[yl~ f( **f. (5.142) 22i-i(yi — y) This definition was derived by manipulating an expression for the HQ in the following manner: HQ = (logTf2nlT\Y\yi - fixi))2) (5.143) Scaling by T, dividing by YlJ=i{yi — V)2 and subtracting the resulting expression from one yields the defined expression for R2(HQ). The relationship of R2(HQ) with R2 can be trivially derived from equation (5.120): R\HQ) 5.4 = 1 - (logTY2n/T\(l - R2). (5.144) Data Re-Sampling Methods Unfortunately, since the eight equations are each of the tenth degree, reducing to the ninth degree when coordinates of two of the points are given Chapter 5. Model Selection Criteria: A Reference Source 145 numerical values, a straightforward elimination would seem to lead to an equation of degree 98 = 43,046,711. The number of algebraic operations in performing the elimination, solving for the others, would be a large multiple of this number, and would doubtless be sufficient to occupy a large and efficient computing project for many millenniums. Harold Hotelling (1941; 42). This quote, although not directly relating to the topic of data re-sampling, indicates well the kinds of computational difficulties found by applied econometricians of an earlier generation, and a sense of despair that accompanied these difficulties. Although future generations may find our claims to powerful computing capabilities to be as quaint as the above quote may seem to us now, it is true that the current state of computing allows us to use techniques which would have formerly been impossible due to computational burden. One such technique in model selection is cross-validation, which has already been discussed and used in this thesis. Another re-sampling technique is the bootstrap. References for this technique are numerous, but Efron (1982, 1983) are two works well worth noting. The essential difference between cross-validation and the bootstrap for model selection is that the bootstrap samples "with replacement," which means that the same observation could be dropped twice (or more times) in calculating an estimate of the predictive-squared error. Breiman and Spector (1992) find that the bootstrap and cross-validation do comparably well in a range of Monte Carlo experiments on model selection. However, Breiman and Spector use a different "selection algorithm" to that usually used. This is discussed in the next section. Chapter 5. Model Selection Criteria: A Reference Source 5,5 146 Selection Algorithms The method used by Breiman and Spector (1992) is what they call "complete" crossvalidation, whereas the type of cross-validation that has been used in this thesis they call "partial" cross-validation. The difference is in the use of Assumption 1 of Appendix A. This implies that the models selected using only T — 1 observations is the same as the models selected using T observations. The complete cross-validation alternative is to select the best model for each sample of size T — 1. This will yield T sequences of models, each sequence being generated on the way to determining the best model when the ith observation has been dropped. Then the best model is selected from all the models in the T sequences. This obviously results in many more models being admitted into the set of models for consideration as the best. It can be noted that the procedure has nothing that ties it to the use of crossvalidation for model selection at each stage. If one thought that the AIC gave the best guide to model selection, then that could be used to create the T sequences for final consideration, and indeed to determine the best model. So we see that this algorithm is independent of the criterion used, and it would be interesting to investigate the performance of different criteria in combination with this algorithm. Breiman and Spector do not allow other criteria the use of this option. If a larger class of models for final consideration helps in the performance of model selection, then perhaps there are alternative ways of generating this class than that considered by Breiman and Spector. For example, one may try drawing a random sample from the set of all possible models. The more models drawn for consideration, the higher the probability that the best model is contained in this sample. Although i there remains the problem of how to deterimine the size of the sample of models to be drawn, this new algorithm could be a useful alternative procedure. Chapter 5. Model Selection Criteria: A Reference Source 5.6 147 References for Further Reading Some excellent reviews of the model selection problem and suggested solutions are: Amemiya (1980), Anscombe (1967), Eubank (1988; Chapter 2), Judge et al (1985; Chapter 21), Hocking (1976), Maddala (1992; Chapter 12), Miller (1990) and Thompson (1978). These reviews all contain both general introductions to the problem, as well as some depth on relative properties of selection criteria. Stoica, Eykhoff, Janssen and Soderstrom (1986) give a massive bibliography and summary of many papers on model selection criteria. However, their paper is also notable for their lack of reference to the GCV. Special issues of two journals have been dedicated to papers on model selection. The Journal of Econometrics (1981, Vol. 16, No. 1, edited by G.S. Maddala) carries papers by H. Akaike, A.C. Atkinson, G.C. Chow, and others. These papers deal with a broad range of issues relating to model selection. Other issues of the same journal also carry papers which address the general problem (1982, Vol. 20, 1-157 and 1983, Vol. 21, 1-160, both edited by H. White). Psychometrica (1987, Vol. 52, No. 3, edited by Y. Takane and H. Bozdogan) carries articles by H. Akaike, S.L. Sclove, H. Bozdogan, and others. The articles are quite general in their specification of the problem of model selection, but have special reference to techniques common in psychometrics, such as factor analysis. References for some less commonly employed methods for model selection than those already mentioned are now given. Potscher (1983) suggested a stepwise testing procedure using Lagrangian multiplier tests in the context of determining the order of an ARMA process, which is a common application of selection criteria. PoUak and Wales (1991) introduced a "likelihood dominance" criterion to compare different models. They used this method to compare the performance of different functional • Chapter 5. Model Selection Criteria: A Reference Source 148 form parameterizations in a consumer demand system example. Rissanen (1986, 1987, 1988) has described new criteria based on the principle of stochastic complexity. Mills and Prasad (1992) find Monte Carlo evidence against Rissanen's criteria and support for the Schwarz Criterion (Schwarz, 1978) and the related Bayesian Estimation Criterion proposed by Geweke and Meese (1981). A review of F-to-enter, F-to- delete tests and a range of of other methods for model selection can be found in the monograph by Miller (1990). 5.7 Other Considerations in Model Choice This thesis has concentrated on the use of a range of selection criteria for model choice. All these criteria have the same underlying principle, which is to be the one criterion for use in determining the appropriate choice out of a large set of competing models. The fact that there are a large number of models to be compared may necessitate the use of simple criteria. However, if we are comparing a small set of potential model specifications, it would seem wise not to exclude other considerations in assessing model performance. Issues such as the presence of heteroscedasticity or serial correlation in the error terms, sensitivity to alternative parameterizations and multicollinearity should also be taken into account. An excellent review of a wide range of considerations in model selection is given in Nakamura, Nakamura and Duleep, (1990). It is worth stressing that economic theory should be at the forefront when a decision is made to accept or reject a particular model. For example, if a large sequential selection process using a single model selection criterion is performed, then the resulting model should be examined to see if it makes economic sense. Despite its limitations, and perhaps a long and tedious road to finding a model which minimizes Chapter 5. Model Selection Criteria: A Reference Source 149 some negative objective function, economic theory should have a final right to veto. 5.8 Overview References to a wide variety of procedures for model selection have been given. References to many more methods can be found in the works cited above. It has been seen that, there are many possible paths to follow when faced with the problem of determing an appropriate model. However, the over-ruling factor must be the appropriateness of a model to the particular context. This chapter began with a quote from Amemiya, and will end with another from the same source (Amemiya, 1980; 352): ...the selection of regressors should be primarily based on one's knowledge of the underlying economic theory and one's intuition which cannot be easily quantified. By noting that the only part of this criterion that is relevant for model selection is the Ordinary Least Squares (OLS) error variance estimate (Yj=i[yi~ f(xi)]2)/(T — n), we see that this criterion suggests the choice of the model with the smallest error variance. Chapter 6 Summary and Conclusions But the more I studied economic science, the smaller appeared the knowledge which I had of it, in proportion to the knowledge I needed; and now, at the end of nearly half a century of almost exclusive study of it, I am conscious of more ignorance of it than I was at the beginning of the study. Alfred Marshall (1842-1924), quoted in J.M. Keynes (1933; 167). 6.1 Summary of Contributions This thesis has addressed several important problems in applied econometrics. Each of these problems was addressed with reference to the use of model selection criteria. The issue of the determining appropriate unit root testing procedures is of continuing interest, and the performance of one such testing procedure (Augmented DickeyFuller) was examined with particular reference to the choice of model selection criteria in determining the relevent regression equations. The problem of modelling technical progress in a production economy is of great interest. A non-parametric approach was introduced to allow the flexible approximation of technical progress, and the properties of the models resulting from the use of various model selection criteria in combination with this approach was examined. Multiple equation context forms were introduced for two criteria which previously only had single equation context forms. The same flexible modelling strategy was applied to a resource economics context, 150 Chapter 6, Summary and Conclusions 151 but this time investigating the non-parameteric estimation of returns to scale. In each application it was found that the results were influenced greatly by the choice of the model selection criterion. In the Augmented Dickey-Fuller unit root testing context, it was found that the tests performed particularly poorly in small samples, and performance in general depended on how the parameterizations of the regression equations used in the tests were determined. Poor performance was defined as the inability to reject a false null hypothesis a large number of times in repeated trials. Observations on the relative properties of the criteria considered were made, both from a theoretical and empirical perspective. In the context of modelling technical progress in a production economy, it was found that the choice of criterion affected greatly the estimation of productivity, as well as the own and cross price elasticities of the aggregate goods considered in the economy. The employed strategy of adaptively fitting spline functions to the time trend variable was found to be successful in returning more sensible measures of technical progress than the use of a single linear time trend. It was also found that there was quite a close correspondence between the measures of technical progress from the spline inclusive models and from a (smoothed) Fisher index number approach. This indicates that if one is only interested in obtaining a measure of technical progress then the use of the Fisher index approach returns an adequate measure, without the computational burden of estimating a system of equations with adaptively fitted splines. If, however, we are interested in obtaining reliable elasticity estimates, then the computational burden of appending, and adaptively fitting, splines in the time trend variable to a system of demand equations, seems to be well justified by the results. In the context of modelling returns to scale using cross-section data, it was found that allowing splines in the capital variable in a demand system again had an impact Chapter 6. Summary and Conclusions 152 on the estimated own and cross price elasticities, as well as the returns to scale measure. The application was to data on the British Columbia sablefish fishery, which is of recent interest due to rent capture debates. The different parameter estimates obtained through using a non-parametric approach to modelling returns to scale, compared to the use of a model with capital entering in a predetermined linear fashion, could lead to differing policy recommendations. The thesis is concluded with a survey style review of some methods for model selection, and their relative properties, along with some historical notes. 6.2 Directions for Further Research The thesis has reported some important results in applied econometrics. However, there remain areas for useful extensions. Some of these areas are discussed in this section. A particularly pressing unsolved problem is the determination of a "best" criterion for model selection. Although this thesis has compared the use of some criteria, and has repeatedly stressed the importance of using and comparing a range of models selected by the criteria, it would be a boon to the applied researcher to have a single criterion which could be relied upon. The literature of deriving appropriate criteria has typically focused on asymptotic properties or, in the case of data re-sampling techniques, the appropriate way to get an accurate calculation of a model's predictive ability. Despite the rigourous pursuit of an ideal criterion, there seems to be no agreement on whether such a criterion exists. For example, while the Akaike Information Criterion is the most widely used criterion in the economics literature, the Generalized Cross-Validation Criterion is the choice of many statisticians. An alternative direction for the research on this issue to follow may be to take an axiomatic approach. Chapter 6. Summary and Conclusions 153 This would imply deriving a set of reasonable axioms that a criterion should be able to satisfy. One may be that the penalty term for additional parameters should depend on how many remaining degrees of freedom exist—a larger penalty associated with an additional parameter when degrees of freedom are scarce. Such a set of axioms may help us in choosing an appropriate criterion. Such a set of axioms could also aid in the construction of a "new" criterion. Given the large number of available model selection criteria, it may be interesting to take a weighted average of a number of these to use in model selection. The determination of the weights could be through reference to the axioms—the criterion which satisfies the most axioms gets the largest weight. Some possible extensions to the investigation of model selection criteria in the context of unit root testing have been discussed in Chapter 2. This chapter looked at the properties of the Augmented Dickey-Fuller (ADF) tests under a range of different circumstances, but all relating to the properties when faced with a false null hypothesis. It would be of interest to follow-up this study by examining different designs to the Monte Carlo experiments. For example, an investigation of the application of various criteria when dependent variable lags are required due to a known form of serial dependence in the ADF regression equation error terms. Another context of interest would be when the null hypothesis of a unit root is true. An alternative form of unit root testing (Campbell and Mankiw, 1987) could also be investigated for the reliance on the choice of model selection criterion. The studies on splining to allow for the flexible modelling of technical progress and returns to scale could be extended to consider the use of alternative spline fitting strategies. The parameter estimates obtained could differ greatly with a change in strategy, as we have seen them change with the application of different selection criteria. Chapter 6. Summary and Conclusions 154 There are also further, more tangential, directions for future research too numerous to discuss here. Both the results presented in this thesis, and those of the suggested extensions, should be of interest to applied econometricians working in many fields. Bibliography [1] Akaike, H. (1969), "Fitting Autoregressive Models for Prediction," Annals of the Institute of Statistical Mathematics, 21, 243-247. [2] (1973), "Information Theory and an Extension of the Maximum Likelihood Principle" in: B.N. Petrov and F. Caski, eds., Second International Symposium on Information Theory, Akademiai Kiado, Budapest, 267-281. [3] Alhberg, J.H., E.N. Nilson and J.L. Walsh (1967), The Theory of Splines and their Applications, Academic Press, New York. [4] Allen, D.M. (1971), "The Prediction Sum of Squares as a Criterion for Selecting Predictor Variables," Department of Statistics, University of Kentucky, Technical Report 23. [5] (1974), "The Relationship Between Variable Selection and Data Augmentation and a method for Prediction," Technometrics, 16, 125-127. [6] Amemiya, T. (1980), "Selection of Regressors," International Economic Review, Vol.21, 331-354. [7] Anscombe, F.J. (1967), "Topics in the Investigation of Linear Relations fitted by the Method of Least Squares," Journal of the Royal Statistical Society, Series B, with discussion, 29, 1-52. [8] Berndt, E.R. and D.O. Wood (1975), "Technology, Prices, and the Derived Demand for Energy," Review of Economics and Statistics, 3, 259-268. [9] (1986a) "U.S. Manufacturing Output and Factor Input Price and Quantity Series, 1908-47 and 1947-81," M.I.T. Energy Laboratory Working Paper No. 86-010 WP. [10] (1986b), "Energy Price Shocks and the Slowdown in U.S. and U.K. Manufacturing," Oxford Review of Economic Policy, Vol. 2, 1- 31. [11] (1987), "Energy Price Shocks and Productivity Growth: A Survey," in: R.L. Gordon, H.D. Jacoby and M.B. Zimmerman, Energy Markets and Regtdation, Essays in Honor of M.A. Adelman, The M.I.T. Press, Cambridge, 305-342. 155 Bibliography 156 [12] Bjorndal, T. and D.V. Gordon (1988), "Price Response and Optimal Vessel Size in a Multi-Output Fishery," in Rights Based Fishing, (P.A. Neher, R. Arnason, and N. Mollett Eds.) Dordrecht, Holland: Kluwer Academic Publishers, pp. 389-411. [13] Blough, S. (1988), "On the Impossibility of Testing for Unit Roots and Cointegration in Finite Samples," Working Paper No. 211, Johns Hopkins University. [14] Bozdogan, H. (1987), "Model Selection and Akaike's Information Criterion (AIC): The General Theory and its Analytical Extensions," Psychometrika, 52, 345-370. [15] Breiman, L. (1991), "The n Method for Estimating Multivariate Functions from Noisy Data," Technometrics, Vol. 33, 125-160. [16] Breiman, L. and D. Freedman(1983), "How Many Variables Should be Entered in a Regression Equation?" Journal of the American Statistical Association, 78, 131-136. [17] Breiman, L., J.H. Friedman, R.A. Olshen, C.J. Stone (1984), Classification and Regression Trees, Wadsworth, Belmont, California. [18] Breiman, L. and P. Spector (1992), "Submodel Selection and the Evaluation in Regression. The A"-Random Case," International Statistical Review, 60, 3, 291-319. [19] Brown, R.S, and R.L. Christensen (1981), "Estimates of Elasticities of Substitution in a Model of Partial Static Equilibrium: An Application to U.S. Agriculture," in: E.R. Berndt and B.C. Fields eds., Modeling and Measuring Natural Resource Substitution, Cambridge, Mass., MIT Press, 209-229. [20] Campbell, J.Y. and N.G. Mankiw (1987), "Are Output Fluctuations Transitory," Quarterly Journal of Economics, 102, 857-880. [21] Christensen, L.R., D.W. Jorgenson and L.J. Lau (1971), "Conjugate Duality and the Transcendental Logarithmic Production Function," Econometrica, 39, 255-256. [22] (1973), "Transcendental Logarithmic Production Frontiers," Review of Economics and Statistics, 55, 28-45. [23] Cragg, J.G. and S.G. Donald (1993), "On Several Criteria for Estimating the Rank of a Matrix," manuscript, University of British Columbia. Bibliography 157 [24] Craven, P. and G. Wahba (1979), "Smoothing Noisy Data with Spline Functions: Estimating the Correct Degree of Smoothing by the Method of Generalized Cross-Validation," Numerische Mathematik, 31, 377-403. [25] Davidson, R. and J.G. MacKinnon (1993), Estimation and Inference in Econometrics, Oxford University Press, New York. [26] De Boor, C. (1979), A Practical Guide to Splines, Springer-Verlag, New York. [27] Dickey, D.A., W.R. Bell and R.B. Miller (1986), "Unit Roots in Time Series Models: Tests and Implications," The American Statistician, 40, 12-26. [28] Dickey, D.A. and W.A. Fuller (1979), "Distribution of the Estimators for Autoregressive Time Series with a Unit Root," Journal of the American Statistical Association, 74, 427-431. [29] Diewert, W.E. (1973), "Functional Forms for Profit and Transformation Functions," Journal of Economic Theory, No. 3, 284-316. [30] "Applications of Duality Theory," in: M.D. Intriligator and D.A. Kendrick, eds., Frontiers of Quantitative Economics, Vol. II, North-Holland, Amsterdam, 106-171. [31] (1976), "Exact and Superlative Index Numbers," Journal of Econometrics, 4, 115-145. [32] (1992), "Fisher Ideal Output, Input and Productivity Indexes Revisited," Journal of Productivity Analysis, 3, 211-248. [33] Diewert, W.E. and L. Ostensoe (1988), "Flexible Functional Forms for Profit Functions and Global Curvature Conditions," in: W.A. Barnett, E.R. Berndt and H. White, eds., Dynamic Economic Modelling, Chapter 3, 43-51. [34] Diewert, W.E. and T.J. Wales (1987), "Flexible Functional Forms and Global Curvature Conditions," Econometrica, Vol. 55, 43-68. [35] (1990), "Quadratic Spline Models for Producer Supply and Demand Functions," University of British Columbia Department of Economics Discussion Paper, No. 90-08. [36] (1992), "Quadratic Spline Models for Producer's Supply and Demand Functions," International Economic Review, Vol. 33, 705-722. [37] (1993), "Linear and Quadratic Spline Models for Consumer Demand Functions," Canadian Journal of Economics, No. 1, 77-106. Bibliography 158 [38] Divisia, F. (1926), L'indice Monetaire et la theorie de la Monnaie, Societe Anonyme du Recueil Sirey, Paris. [39] Dupont, D.P. (1990), "Rent Dissipation in a Restricted Access Fisheries," Journal of Environmental Economics and Management, 19(1), 24-44. [40] Efron, B. (1982), The Jackknife, the Bootstrap and Other Resampling Plans, Society for Industrial and Applied Mathematics, Philadelphia. [41] Efron, B. (1983), "Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation," Journal of the American Statistical Association, 17, 316-331 [42] Eubank, R.L. (1988), Spline Smoothing and Nonparametric Regression, Marcel Dekker, New York. [43] Feder, P.I. (1975), "The Log Likelihood Ratio in Segmented Regression," The Annals of Statistics, Vol. 3, 84-97. [44] Fisher, I. (1922), The Making of Index Numbers, Houghton Mifflin, Boston. [45] Friedman, J.H. (1991), "Multivariate Adaptive Regression Splines," with discussion, The Annals of Statistics, Vol. 19, 1-67. [46] Friedman, J.H. and B.W. Silverman (1989), "Flexible Parsimonious Smoothing and Additive Modeling," with discussion, Technometrics, Vol. 31, 3-21. [47] Fuller, W.A. (1969), "Grafted Polynomials as Approximating Functions," Australian Journal of Agricultural Economics, 13, 35-46. [48] (1976), Introduction to Statistical Time Series, John Wiley & Sons, New York. [49] Geisser, S. (1974), "A Predictive Approach to the Random Model Effect," Biometrika, Vol. 61, 101-107. [50] (1975), "The Predictive Sample Reuse Method with Applications," Journal of the American Statistical Association, Vol. 70, 320-328. [51] Geweke, J. and R. Meese (1981), "Estimating Regression Models of Finite but Unknown Order," International Economic Review, 22, 55-70. [52] Gorman, J.W. and R.J. Toman (1966), "Selection of Variables for Fitting Equa- • tions to Data," Technometrics, 8, 27-51. Bibliography 159 [53] Grafton, R.Q. (1992), "Rent Capture in Rights Based Fisheries," Unpublised Ph.D thesis, Department of Economics, University of British Columbia. [54] (1994), "Rent Capture in a Rights Based Fishery," Journal of Enviromental Economics and Management, forthcoming. [55] Hinkley, D.V. (1969), "Inference About the Intersection in Two- Phase Regression," Biometrika, Vol. 56, 495-504. [56] (1970), "Inference in Two-Phase Regression," Journal of the American Statistical Association, Vol. 66, 736-743. [57] Hocking, R.R (1976), "The Analysis and Selection of Variables in Linear Regression," Biometrics, 32, 1-49. [58] Hotelling, H. (1941), "Experimental Determination of the Maximum of a Function," Annals of Mathematical Statistics, 12, 20-45. [59] Jorgenson, D.W. and Z. Griliches (1972), "The Explanation of Productivity Change," Review of Economic Studies, 34, 249-283. [60] Judge, G., W. Griffiths, R. Hill, H. Liitkepohl, and T. Lee (1985), The Theory and Practice of Econometrics, 2nd edition, John Wiley h Sons, New York. [61] Keynes, J.M. (1933), Essays in Biography, Harcourt, Brace and Company, New York. [62] Kohli, U. (1991), Technology, Duality and Foreign Trade, Harvester Wheatsheaf, Hertfordshire. [63] Learner, E.E. (1983), "Model Choice and Specification Analysis," in: Z. Griliches and M.D. Intriligator eds., Handbook of Econometrics, North-Holland, New York, Vol. I, Chapter 5. [64] Leontief, W.W. (1947a), "A Note on the Interrelation of Subsets of Independent Variables of a Continuous Function with Continuous First Derivatives," Bulletin of the American Mathematical Society, 53, 343-350. bibitem (1947b), "Introduction to a Theory of the Internal Structure of Functional Relationships," Econometrica, 15, 361-373. [65] Maddala, G.S. (1981), editor, special issue of the Journal of Econometrics, Vol. 16, No. 1. [66] (1992), Econometrics, Second Edition, McGraw-Hill. Bibliography 160 [67] Mallows, C.L. (1964), "Choosing Variables in a Linear Regression: A Graphical Aid," presented at the Central Regional Meeting of the Institute of Mathematical Statistics, Manhattan, Kansas (May). [68] (1973), "Some Comments on Cp," Technometrics, 15, 661-676. [69] Miller, A.J. (1990), Subset Selection in Regression, Chapman and Hall, London. [70] Mills, J.A. and K. Prasad (1992), "A Comparison of Model Selection Criteria," Econometric Reviews, 11, 201-233. [71] Moroney, J.R. and J.M. Trapani (1981), "Alternative Models of Substitution and Technical Change in Natural Resource Intensive Industries," in: E.R. Berndt and B.C. Fields eds., Modeling and Measuring Natural Resource Substitution , Cambridge, Mass., MIT Press, 48-69. [72] Nakamura A., M. Nakamura, H. Duleep (1990), "Alternative Approaches to Model Choice," Journal of Economic Behaviour and Organization, 14, 97-125. [73] Nelson, C.R. and C.L Plosser (1982), "Trends and Random Walks in Macroeconomic Time Series: Some Evidence and Implications," Journal of Monetary Economics, 10, 139-162. [74] Orloff, G.A. (1937), "Actuarial Note: A Guide to Graphic Graduation," Actuarial Society of America, Transactions, Vol. XXXVIII, 12-13. [75] Perron, P. (1988), "Trends and Random Walks in Macro-Economic Time Series: Further Evidence from a New Approach," Journal of Economic Dynamics and Control, 12, 297-332. [76] Phillips, P.C.B. (1987), "Time Series Regression with a Unit Root," Econometrica, 55, 277-301. [77] Poirer, D.J. (1976), The Econometrics of Structural Change, North-Holland, Amsterdam. [78] Potscher, B.M. (1983), "Order Estimation in ARMA-Models by Lagrangian Multiplier Tests," Annals of Statistics, 11, 872-885. [79] Renzetti, S. (1992), "Structure of Industrial Water Demands," Land Economics 68(4), 396-404. [80] Rissanen, J. (1986), "Stochastic Complexity and Modeling," The Annals of Statistics, 14, 1080-1100. Bibliography 161 [81] (1987), "Stochastic Complexity," Journal of the Royal Statistical Society, Series B, with discussion, 49, 223-265. [82] (1988), "Stochastic Complexity and the MDL Principle," Econometric Reviews, 85-102. [83] Rothman, D. (1968), Letter to the Editor, Technometrics, 10, 432. [84] Said, S.E. and D.A. Dickey (1984), "Testing for Unit Roots in AutoregressiveMoving Average Models of Unknown Order," Biometrika, 71, 599-607. [85] Schmidt, P. (1971), "Methods of Choosing Among Alternative Models," Michigan State University Econometrics Workshop Paper, No. 7004. [86] (1973), "Calculating the Power of the Minimum Standard Error Choice Criterion," International Economic Review, 14, 253-255. [87] (1974), "Choosing Among Alternative Linear Regression Models," Atlantic Economic Journal, Vol. II, 7-13. [88] (1975), "Choosing Among Alternative Linear Regression Models: A Correction and Some Further Results," Atlantic Economic Journal, Vol. Ill, 61-63. [89] Schoenberg, I.J. (1946), "Contributions to the Problem of Approximation of Equidistant Data by Analytic Functions," Parts I and II, Quarterly Journal of Applied Mathematics, 4, 45-99 and 112-141. [90] Schwarz, G. (1978), "Estimating the Dimension of a Model," The Annals of Statistics, 6, 461-464. [91] Schwert, G.W. (1989), "Testing for Unit Roots: A Monte Carlo Investigation," Journal of Biisiness and Economic Statistics, 7, 147-59. [92] SHAZAM (1993), User's Reference Manual Version 7.0, McGraw- Hill. [93] Smith, P.L. (1982), "Curve Fitting and Modeling with Splines Using Statistical Variable Selection Techniques," NASA Report 166034, Langley Research Center, Hampton, VA. [94] Sono, M. (1945), "The Effect of Price Changes on the Demand and Supply of Separable Goods," Kokumin Keizai Zasshi, 74, 1-51. Translated in the International Economic Review (1961), 2, 239-271. [95] Solow, R.M. (1957), "Technical Change and the Aggregate Production Function," Review of Economics and Statistics, 39, 312-320. Bibliography 162 [96] Squires, D. (1987) "Public Regulation and the Structure of Production in Multiproduct Industries: An Application to the New England Otter Trawl Industry," Rand Journal of Economics 18(2), 232-248. [97] Stoica P., P. Eykhoff, P. Jansscn and T. Soderstrom (1986), "Model-Structure Selection by Cross-Validation," International Journal of Control, 43,1841-1878. [98] Stone, M. (1974), "Cross-Validatory Choice and Assessment of Statistical Predictors," with discussion, Journal of the Royal Statistical Society, Series B, 36, 111-147. [99] (1977), "An Asymptotic Equivalence of Choice of Model by Crossvalidation and Akaike's Criterion," Journal of the Royal Statistical Society, Series B, 39, 44-47. [100] Takane, Y. and H. Bozdogan (1987), editors, special issue of Psychometrica, Vol. 52, No. 3. [101] Theil, H. (1961) Economic Forecasts and Policy, 2nd edition, North-Holland, Amsterdam. [102] Thompson, M.L. (1978), "Selection of Variables in Multiple Regression," Part I and Part II, International Statistical Review, 46, 1-21 and 129-146. [103] T6rnqvist,L. (1936), "The Bank of Finland's Consumption Price Index," Bank of Finland Monthly Bulletin, 10, 1-8. [104] Tukey, J.W. (1967), Discussion of Anscombe (1967), Journal of the Royal Statistical Society, Series B, 29, 47-48. [105] Watson, A.D. (1937), Written Discussion of Orloff (1937), Actuarial Society of America, Transactions, Vol. XXXVIII, 521-524. [106] Wiley, D.E., W.H. Schmidt and W.J. Bramble (1973), "Studies of a Class of Covariance Structure Models," Journal of the American Statistical Association, 68, 317-323. [107] White, H. (1982), editor, special issue of the Journal of Econometrics, Vol. 20, 1-157. [108] White, H. (1983), editor, special issue of the Journal of Econometrics, Vol. 21, 1-160. [109] White, K.J. (1978), "A General Computer Program for Econometric Methods - SHAZAM," Econometrica, January, 239-240. Bibliography 163 [110] Zhang, P. (1993), "Model Selection Via Multifold Cross Validation" Annals of Statistics, Vol. 21, 299-313. Appendix A Cross-Validation, GCV, etc. Cross-validation involves deleting each observation in turn from the sample period, estimating the model and then calculating the average squared residuals for the left out observations. The cross-validation score (CV) is therefore given by: CV = ( l / T ) f : [ y i - / _ i ( ^ ) ] 2 - (A.l) »=i In our application, the variables that we are considering deleting are additional spline segments. Therefore, the CV is a function of the number of knots in our model, and gives us a measure of future prediction error. A direct calculation of the C V is computationally burdensome. Fortunately we can transform the full sample errors to obtain the cross-validation residuals, as expressed in Theorem 2. Two proofs of the theorem exist. The lesser known proof, due to Schmidt (Spring, 1971) is presented first, then the more commonly know proof due to Allen (August, 1971). Both proofs implicitly use the following assumption: Assumption 1 The sequence of models that is selected using all T observations is the same as the sequence that would be selected using T — 1 observations, where the ith observation has been dropped, i = 1,2, ...,T. This allows us to take the sequence of the X matrix as fixed for each observation that is dropped, and just consider how our coefficient vector changes when each model has one less observation. 164 Appendix A. Cross-Validation, GCV, etc. 165 Proof A of Theorem 2: Schmidt. Let X-i be the X matrix of regressors with the ith row deleted, and similarly for Y_,, the ((T — 1) x 1) dependent variable vector. If the full sample estimate of (3 is given by /? = (X'X)~lX'Y, then the T — 1 sample estimate of /? is: /?_,• = (XUX^XUY.i. A prediction of yi is obtained from f-i(xi) (A.2) = Xifi-i. The prediction error, Vi is then defined as: Vi = yi - £,/?_; (A.3) while the ith full sample least squares residual is defined as: ei~yi- Xi/3. (A.4) Substituting for (3 in equation (A.4) we get the following: ei Xi{X'X)-lX'Y = Vi - = Vi - XiiX'xy'XL^ Now, using the equalities X'^Y-i = X'^X-ifi-i - XiiX'X)-1^. (A.5) and x\y{ = x\xi{5-i + x\v{, we can re-express et- as follows: e, = Vi - Xiix'xyKx'^x-J-i - Xiix'xy'x'ixJ^ - Xiix'xy^iVi = Vi - xi{X'X)-\X'_iX_i = yi- Xi(3_i - = (l-XiiX'X)-1^. + afcO0-i - ^ ( X ' X ) - ^ , ^ XiiX'Xj^x'iVi (A.6) Defining c,: = a:8(A"'A')-1x'j, and re-arranging equation (A.6) we get: Vi = Vi - f_i(xi) = (yi - x'J)/(l - a). (A.7) Appendix A. Cross-Validation, GCV, etc. 166 We have expressed the prediction, or cross-validation, errors in terms of the full sample errors. Squaring, summing and averaging the transformed full sample errors yields: cv = (i/D J > - / M 7 [ i - Q]2. (A.8) Proof B of Theorem 2: Allen. We use the Sherman-Morrison formula for doing a rank one downdate of a nonsingular inverse matrix (X'X)" 1 , where X is the (T x n) matrix of regressors, x' is the (1 x n) data vector being removed, and we assume that 1 - x'(X'X)~lx jL 0: (X'X - xx'Y1 = (X'X)-1 + (X'X)-lx(l - x'{X'X)-lx)-lx\X'X)-1. (A.9) Define /3 as the least squares linear regression coefficient vector of dimension n from the regression involving all the observations: 0 = {X'XyKX'Y. (A.10) Let,(?_,: denote the parameter vector derived using all observations except the ith. Noting that x\ is a (1 x n) row vector and y?- a scalar, we can then write: = xfAX'X-Xi^iX'Y-XiVi) = x'^x'xy1 + (x'xy^iii - ci)-1^(x,x)-1](x,y - xiVi) (A.H) where we have a scalar 1 - c,- = 1 - x'^X'X^Xi. Note that the c, are the diagonal elements of the projection matrix X(X'X)~lX'. Now consider the per observation deviation of predicted values from the true values of the dependent variable: Vi - f-i(xi) = Vi[i + x'iix'xy^i + x'^x'xy'xiii - ci)-xx\{x'X)-xx%\ - ; Appendix A. Cross-Validation, GCV, etc. x'^X'Xy'X'Y - x'^X'Xy^iil - a)'1 - x'Ml - = yi(l = (lH-4t)/(l-Ci) 167 ci)-lxfi(X,X)-1X1Y - a)-1 (A.12) We have expressed the cross-validation errors in terms of the full sample errors. Squaring, summing and averaging the adjusted full sample errors yields: CV = ( 1 / D J > - / ( ^ ) ] 2 / [ l - Ci}2. (A.13) 1=1 To derive the GCV of Craven and Wahba (1979) we make our second assumption: Assumption 2 The diagonal elements of the matrix (I — X(X'X)~1X') are (nearly) constant. This allows us to use the average of these diagonal elements in what follows. The trace of (J — X(X'X)~1X') is T — n in the single equation case, where n is the dimension of the parameter vector. The average of the diagonal elements is then (1 — n/T). Squaring, summing and averaging equation (A.12) over all T observations, and using Assumption 2, yields the following insightful expression for the GCV: GCV = ~ || (J - X{X'X)-lX')Y ||2 /[^Trace(I - X{X'X)~lX')f. (A.14) Equivalently, we can write the GCV in the form given in Section 3.1: GCV = ( 1 / T ) f > - f(Xi)fl[l - n/T}2. (A.15) The log GCV form was derived as follows. Normality was assumed so that the log likelihood function can be written in the usual way: logL(p, a2) = ~log(2*) - |M<T2) - ~ f > 1=1 - x'fi)2. (A.16) Appendix A. Cross-Validation, GCV, etc. 168 The maximum likelihood estimator of a2 is: ^ = sE(*-*$ 2 . 1 ( A - 17 ) 1=1 Substituting a2 into the log likelihood, we have: ~logL0) = log(± J > - 4/?)2) (A.18) ignoring the additive constants. Taking the log of the GCV and substituting from equation (A.18) , we have the log GCV form: GCV = ^logL(p) - 2/00(1 - n/T). (A.19) Multiplying through by T/2 we get the following more attractive form that appears in the text above: GCV = -logL0) - Tlog{\ - n/T). (A.20) Now consider the multiple equation context described in Section 3.4.2. The matrix of regressors X and the identity matrix I are "stacked" matrices of dimension G(T x /i), where G is the number of equations in the system, T is the sample size and h is the number of variables in each equation. We define the CV, in this multiple equation context as follows: CV = \±cv\l/G (A.21) where Sov is the variance-covariance matrix derived from the cross-validation residuals. The (1/Gr) is included as we want an approximation to the average predictivesquared error for each model. We derived a GCV-type criterion from an approximation to this definition of the CV (FCV), which we called FGCV: FGCV = (|E| 1 / G )/(1 - h/Tf (A.22) Appendix A. Cross-Validation, GCV, etc. 169 where S is the variance-covariance matrix, using all observations. We take the log of this expression: FGCV = hog\t\ - 2log(l - h/T). (A.23) Recall that in the multiple equation context, the concentrated log likelihood function is: logL0) = {GT/2)log(2Tr) - (T/2)log\t\ - (GT/2). (A.24) Therefore, ignoring the constants in the log likelihood, we can write the FGCV as: FGCV = L(-ltogL0)) - 2log(l - h/T). (A.25) Then multiplying through by GT/2, we get our form for the FGCV which was given in the text above: FGCV = -logL0) - GTlog{\ - h/T). We want to find the model whiqh minimizes the FGCV. , (A.26) Appendix B Determining M(T, 9) Friedman and Silverman (1989) solved the potential knot placement location decision problem by using the following argument. We want to avoid responding completely to runs in the signs of the error terms, as doing so would lead to high model variance. If a piecewise linear spline fitting model can place three knots within the length, L, of a run, then it can respond completely to the run, without adversely affecting the fit of the model in other regions. This implies a minimum knot increment of M > where Lmax Lmax/3, is defined as the longest run that can be expected in T binomial trials. Consider the probability of having a run of at least length L in heads or tails from a fair coin tossing experiment. Let that probability be denoted as Pr(L): I rp _ T j/L l r Pr(L) = 2 - E B - D ;=i.=i m • . -I \ ( rp _ •]• \ x • \ i J \T-j (B.27) j By choosing a value for Pr(L), say, 0.05 < 9 < 0.1, we can solve for the length L(9) that derives from our choice. For a conservative knot increment, we would set M = L(9)/2.5. An approximation to L(9), which is accurate for 9 < 0.1 and T > 15, allows us to write: M(T, 9) = -log2[-(l/T)ln(l - 0)]/2.5. (B.28) We then have an increment which is resistant to runs in the signs of the errors with probability 9. (Recall that logxy = (lny)/(lnx)). 170 Appendix B. Determining M(T, 9) 171 Table 3.12 presents some values of M(T, 9) for selected sample sizes and 9 values. We see that, with rounding, we will have a minimum span of three or four for many time series data sets encountered in economics.
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Model selection critieria in economic contexts
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Model selection critieria in economic contexts Fox, Kevin John 1995
pdf
Page Metadata
Item Metadata
Title | Model selection critieria in economic contexts |
Creator |
Fox, Kevin John |
Date Issued | 1995 |
Description | Model selection criteria are used in many contexts in economics. The issue of determining an appropriate criterion, or alternative method, for model selection is a topic of much interest for applied econometricians. These criteria are used when formal testing methods are difficult due to a large number of models being compared, or when a sequential modelling strategy is being used. In econometrics, we are familiar with the use of model selection criteria for determining the order of an ARMA process and the number of dependent variable lags in Augmented Dickey-Fuller equations. The latter application is examined as an interesting example of the sensitivity of results to the choice of criterion. An application of model selection criteria to spline fitting is also considered, introducing a new, flexible, modelling strategy for technical progress in a production economy and for returns to scale in a resource economics context. In this latter context we have a system of estimating equations. Two of the criteria which are compared are the Cross-Validation score (CV) and the Generalized Cross- Validation Criterion (GCV), which until now have only had single equation context expressions. Multiple equation expressions for these criteria are introduced, and are used in the two applications. Comparison of the models selected by the different criteria in each context reveals that results can differ greatly with the choice of criterion. In the unit root test application, the choice of criterion influences the number of times the false hypothesis is not rejected. In the production economy and resource applications, measures of technical progress and returns to scale differ greatly, as do own and cross price elasticities, depending on which criterion is used for selecting the appropriate spline structure. An overview of the literature on model selection is given, with new expressions and interpretations for some model selection criteria, and historical notes. |
Extent | 7547924 bytes |
Subject |
Econometric models |
Genre |
Thesis/Dissertation |
Type |
Text |
File Format | application/pdf |
Language | eng |
Date Available | 2009-06-04 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0088844 |
URI | http://hdl.handle.net/2429/8793 |
Degree |
Doctor of Philosophy - PhD |
Program |
Economics |
Affiliation |
Arts, Faculty of Vancouver School of Economics |
Degree Grantor | University of British Columbia |
Graduation Date | 1995-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- 831-ubc_1995-982861.pdf [ 7.2MB ]
- Metadata
- JSON: 831-1.0088844.json
- JSON-LD: 831-1.0088844-ld.json
- RDF/XML (Pretty): 831-1.0088844-rdf.xml
- RDF/JSON: 831-1.0088844-rdf.json
- Turtle: 831-1.0088844-turtle.txt
- N-Triples: 831-1.0088844-rdf-ntriples.txt
- Original Record: 831-1.0088844-source.json
- Full Text
- 831-1.0088844-fulltext.txt
- Citation
- 831-1.0088844.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0088844/manifest