MULTICOLLINEARITY, AUTOCORRELATION, AND RIDGE REGRESSION by JACKIE JEN-CHY HSU B.A. i n Econ., The National Taiwan University, 1977 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE fBUSISJESS ADMINISTRATION) IN: THE FACULTY OF GRADUATE STUDIES" THE FACULTY OF COMMERCE AND BUSINESS ADMINISTRATION We accept t h i s thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA February 1980 c) Jackie Jen-Chy Hsu, 1980 In p r e s e n t i n g t h i s t h e s i s i n p a r t i a l f u l f i l m e n t o f the r e q u i r e m e n t s f o r an a d v a n c e d d e g r e e at t h e U n i v e r s i t y o f B r i t i s h Co 1umbia, I a g r e e t h a t t h e L i b r a r y s h a l l make i t f r e e l y a v a i 1 a b l e f o r r e f e r e n c e and s t u d y . I f u r t h e r a g r e e t h a t p e r m i s s i o n f o r e x t e n s i v e c o p y i n g o f t h i s t h e s i s f o r s c h o l a r l y p u r p o s e s may be g r a n t e d by the Head o f my Department o r by h i s r e p r e s e n t a t i v e s . It i s u n d e r s t o o d t h a t c o p y i n g o r p u b l i c a t i o n o f t h i s t h e s i s f o r f i n a n c i a l g a i n s h a l l n o t be a l l o w e d w i t h o u t my w r i t t e n p e r m i s s i o n . Department o f Commerce & B u s i n e s s Admin. The U n i v e r s i t y o f B r i t i s h C o l u m b i a 2 0 7 5 Wesbrook Place Vancouver, Canada V6T 1W5 Date F e t i . 8, 1980 ABSTRACT The presence of m u l t i c o l l i n e a r i t y can induce large variances i n the ordinary Least-squares estimates of repression c o e f f i c i e n t s . It has been shown that ridge regression can reduce t h i s adverse e f f e c t on estimation. The presence of s e r i a l l y correlated error terms can also cause serious estimation problems. Various two-stage methods, have been proposed to obtain good estimates of the regression c o e f f i c i e n t s i n t h i s case. Although the m u l t i c o l l i n e a r i t y and autocorrelation problems have long been recognized i n regression a n a l y s i s , they are usually dealt with separately. This thesis explores the j o i n t e f f e c t s of these two conditions on the mean square error properties of the ordinary ridge estimator as well as the ordinary 'least-squares estimator. We show that ridge regression i s doubly advantageous when m u l t i c o l l i n e a r i t y i s accompanied by autocorrelation i n both,the errors and the p r i n c i p a l . components. We then derive a new ridge type estimator that i s adjusted for autocorrelation. F i n a l l y , using simulation experiments with d i f f e r e n t degrees of - m u l t i c o l l i n e a r i t y and autocorrelation, we compare the mean square error properties of various estimators. TABLE OF CONTENTS INTRODUCTION NOTATION AND PRELIMINARIES MULTICOLLINEARITY 3.1 Sources 3.2 Effects 3.3 Detection AUTOCORRELATION 4.1 Sources 4.2 Effects 4.3 Detection JOINT EFFECTS OF MULTICOLLINEARITY AND AUTOCORRELATION 5.1 Mean Square Error of the OLS Estimates of §' 5.2 Mean Square Error of the Ridge Estimates of,3 5.3 When w i l l Ridge estimates be better than the OLS estimates? 5.4 Use of the "Ridge Trace" RIDGE REGRESSION: ESTIMATES, MEAN SQUARE ERROR AND PREDICTION 6.1 Derivation of Ridge Estimator for a CLR Model 6.2 Derivation of Ridge Estimator for an ALR Model 6.3 Mean Square Error of the "Generalized Estimates" 6.4 Estimation 6.5 Prediction TABLE OF CONTENTS (cont'd) THE MONTE CARLO STUDY •7.1 Design of the Experiments 7.2 Sampling Results 7.2a. Results assuming p £ i s known 7.2b. Results assuming p £ i s unknown 7.2c. Forecasting CONCLUSIONS REFERENCES . INTRODUCTION M u l t i c o l l i n e a r i t y and Autocorrelation are two very common problems i n regression analysis. As i s well-known, the presence of some degree of m u l t i c o l l i n e a r i t y r e s u l t s i n estimation, i n s t a b i l i t y and model mis-s p e c i f i c a t i o n while the presence of s e r i a l l y correlated errors leads to underestimation of the variances of parameter estimates and i n e f f i c i e n t p r e d i c t i o n . Because these two conditions have adverse e f f e c t s on estimation and p r e d i c t i o n , a wide range of tests have, been developed to reduce t h e i r impact. Invariably, the m u l t i c o l l i n e a r i t y and auto-c o r r e l a t i o n problems are dealt with separately i n most i f not a l l the preceedings. In t h i s thesis we address the question "What are. the j o i n t e f f e c t s of m u l t i c o l l i n e a r i t y and autocorrelation on estimation and prediction?" Thereafter we s h a l l study a n a l y t i c a l l y the possible changes i n the effectiveness of various estimation methods i n the j o i n t presence of these two conditions. As a r e s u l t of these new findings, a new ridge estimator adjusted for autocorrelation i s then proposed and i t s properties are investigated by conducting a simulation study. We b r i e f l y o u t l i n e t h i s t h e s i s . Section 2 provides the se t t i n g for our analysis. Sections 3 and 4 give a general discussion of the problems of m u l t i c o l l i n e a r i t y and autocorrelation. In addition, we comment on the v a l i d i t y of various e x i s t i n g diagnostic t e s t s . The a n a l y t i c a l study of the j o i n t e f f e c t s of m u l t i c o l l i n e a r i t y and autocorrelation i s present-ed i n Section 5 . In Section 6 , a new ridge estimator adjusted for auto-c o r r e l a t i o n i s derived and i t s mean square error properties are analyzed. Also, we discuss how these new estimates can be obtained i n p r a c t i c e . The methodology and the r e s u l t s of sampling experiments appear i n Section - 2 -7. The thesis concludes with the presentation of several two-stage methods that can be used with the new ridge rule that hopefully w i l l achieve better estimates and p r e d i c t i o n s . -3-2. NOTATION AND PRELIMINARIES The C l a s s i c a l L i n e a r Regression (CLR) model can be represented by the equation (2.1) Y = x§ + e where Y i s a n x l . vector of observations on the dependent v a r i a b l e , X i s a nxp matrix of observations on the explanatory v a r i a b l e s , 3 i s a p x l v e c t o r of r e g r e s s i o n c o e f f i c i e n t s to be estimated and e i s a n x l vector of true e r r o r terms. The standard assumptions of the l i n e a r r e g r e s s i o n model are: (1) E(e) = 0 , where 0 i s the zero v e c t o r T 2 (2) E(ee ) = a I , where I i s the i d e n t i t y matrix. (3) The explanatory v a r i a b l e s are no n - s t o c h a s t i c , hence they are independent of the e r r o r terms. (4) Rank (X) = p < n . The Ordinary Least-squares (OLS) estimator of 3 i s given (2.2) 3 0 L g = (fp^fY w i t h v a r i a n c e - c o v a r i a n c e matrix (2.3) V a r ( 3 Q L S ) = a 2 ( X T X ) _ 1 . T For s i m p l i c i t y , we w i l l assume that (X X) i s i n c o r r e l a t i o n form. Let X T P be the pxp orthogonal matrix such that PX XP = A where A i s a T diagonal m a t r i x w i t h the eigenvalues of (X X), X^,...,X , d i s p l a y e d on the diagonal of A. We assume f u r t h e r that 1 2 p - 4 -A f t e r a p p l y i n g an o r t h o g o n a l r o t a t i o n , P, i t f o l l o w s from (2.1) t h a t (2.4) E(Y) = X T P T P § = X*a T where X* = XP i s the d a t a m a t r i x r e p r e s e n t e d i n the r o t a t e d c o o r d i n a t e s and t h e columns o f X* a r e l i n e a r l y independent. The v e c t o r a =-P§ i s the v e c t o r o f r e g r e s s i o n c o e f f i c i e n t s o f the p r i n c i p a l components. I t f o l l o w s t h a t the OLS e s t i m a t o r o f a i s g i v e n by -We w i l l c o n s i d e r r i d g e e s t i m a t o r s f o r 3 of the form (2.6) B R ( k ) = ( X T X + k I ) ~ 1 ( X T X ) § 0 L S 0 < k < l . Where k i s independent o f 3 Q L S . When k i s a f u n c t i o n of |3QLS> § R ( k ) i s c a l l e d an " a d a p t i v e r i d g e e s t i m a t o r " [ 1 2]. I f k l i s r e p l a c e d by a symmetric n o n n e g a t i v e d e f i n i t e m a t r i x k, then the e s t i m a t o r i s s a i d t o be a " g e n e r a l i z e d r i d g e e s t i m a t o r " [ l l : p . 6 3 ] . E x p r e s s e d i n t h e r o t a t e d c o o r d i n a t e s , the r i d g e e s t i m a t o r f o r a i s g i v e n by (2.7) SfcCk) - Za 0 L S where Z = ( A + k I ) - 1 A . By s u b s t i t u t i n g X*P = X i n ( 2 . 6 ) , I found 8 R ( k ) = ( P T A P + k I ) - 1 P T A P § Q L S = ? T ( A + k P _ 1 ? ? T A ? § O L S -P^A+k l ) " 1 Aa Q L S... I t f o l l o w s from (2.7) t h a t (2.8) a R ( k ) = P § R ( k ) . For the CLR model, assumption (2) that the errors are uncorrelated i s often v i o l a t e d i n p r a c t i c e . This leads to the formulation of an Autoregressive Linear Regression (ALR) model. Mathematically the ALR model i s given by replacing assumptions (2) and (3) by assumptions (2') and (3') below. T 2 (2 1) E(ee ) = a^Q where 2 i s a nondiagonal p o s i t i v e d e f i n i t e matrix. (3 1) Xfc = t x t;L> x t2» • • • >xtp-' ' t^ i e observation on the explanatory v a r i a b l e s , i s independent of the contemporaneous and succeeding errors, £ t ' Et+1' V We assume that the error term 'e follows a f i r s t - o r d e r autoregressive scheme, that i s (2.9) e = p e • - + U t e t-1 t where p £ i s the autocorrelation c o e f f i c i e n t . We require that |p £|<l and that U s a t i s f i e s the following for a l l t (2.10) E(U ) = 0 E ( U t U t + S ) " % E ( U t U t + s ) = 0 s = 0 s 4 o and (2.11) where T 2 E(ee ) = a V N~~ ' u~ V n-1 n-2 n-1 n-2 - 6 -For an ALR model, the "Generalized Least-squares" (GLS) w i l l give the "Best Linear Unbiased Estimator" (BLUE) of 3, denoted as 3 r T C. The matrix ft can be written T 0 = 9Q where Q i s nonsingular. Hence -1 -1 T Q ft(Q ) = I -1 T -3 -1 and (Q ) 9 = n • . ^GLS ^ S °b t a^- n e^ by making the following s u b s t i t u t i o n i n the ALR model; S* = Q _ 1 £ • Then i t follows that (2.12) Y A = X Ag + £,.{ Since (2.12) s a t i s f i e s a l l the assumptions of a CLR model, OLS w i l l give the BLUE of 3. Hence i t follows that ( 2 - 1 3 ) i G L s = T -1 -1 T -1 ' = (X ft X) X ft Y For p r e d i c t i o n , formula (2.15) gives the "Best Linear Unbiased Predictor" (BLUP) i n a f i r s t - o r d e r ALR model (2.15) Y t + 1 = X t + 1 B G L S + p £ e t where e i s the t*"*1 GLS r e s i d u a l . -7-3. MULTICOLLINEARITY In applying multiple regression models, some degree of i n t e r -dependence among explanatory v a r i a b l e s can be expected. As t h i s T interdependence grows and the c o r r e l a t i o n matrix (X X) approaches s i n g u l a r i t y , m u l t i c o l l i n e a r i t y constitutes a problem. Therefore i t i s preferable to think of m u l t i c o l l i n e a r i t y i n terms of i t s " s e v e r i t y " rather than i t s "existence" or "nonexistence". 3.1 Sources In general, m u l t i c o l l i n e a r i t y can be considered to be a symptom of poor experimental design. The sources of severe m u l t i c o l l i n e a r i t y may be c l a s s i f i e d as follows [20:p.99-101]. ( i ) Not enough data or two many variables In many cases large data sets only contain a few basic f a c t o r s . As the number of variables extracted from the data increases, each v a r i a b l e tends to measure the d i f f e r e n t nuances of the same basic factors and each highly c o l l i n e a r v a r i a b l e only has l i t t l e informa-t i o n content. In t h i s case, deleting some variables or c o l l e c t i n g more data can usually solve the problem. ( i i ) Physical or s t r u c t u r a l s i n g u l a r i t y Sometimes highly c o l l i n e a r v a r i a b l e s , due to mathematical or p h y s i c a l constraints, are inadvertently included i n the model. ( i i i ) Sampling s i n g u l a r i t y Due to expense, accident or mistake, sampling was only conducted i n a small region of the design space. 3.2 E f f e c t s The major e f f e c t s of s e r i o u s m u l t i c o l l i n e a r i t y a r e ( i ) E s t i m a t i o n i n s t a b i l i t y T As the c o r r e l a t i o n m a t r i x (X X) becomes i l l - c o n d i t i o n e d , the T -1 elements o f the i n v e r s e m a t r i x (X X) e x p l o d e . (2.3) shows t h a t the v a r i a n c e s of the OLS e s t i m a t e s f o r B a r e a f f e c t e d by the T -1 d i a g o n a l elements of the i n v e r s e m a t r i x ( X X ) . As a r e s u l t of T -1 the i n s t a b i l i t y o f the i n v e r s e m a t r i x (X X) , the OLS e s t i m a t e s o f the r e g r e s s i o n c o e f f i c i e n t s o f t h e c o l l i n e a r v a r i a b l e s might be n u m e r i c a l l y i m p o s s i b l e to o b t a i n . In any case they have l a r g e v a r i a n c e s and a r e q u i t e s e n s i t i v e to s m a l l changes i n the d a t a s e t . ( i i ) S t r u c t u r e m i s s p e c i f i c a t i o n The i n c r e a s e i n t h e s i z e o f t h e v a r i a b l e s e t o f X d e c r e a s e s the i n f o r m a t i o n c o n t e n t o f each e x p l a n a t o r y v a r i a b l e , t h e r e b y d e c r e a s i n g the sample s i g n i f i c a n c e o f each v a r i a b l e ' s c o n t r i b u t i o n to the e x p l a i n e d v a r i a n c e of Y. T h e r e f o r e , even though Y r e a l l y depends on each member o f a r e l a t i v e l y l a r g e v a r i a b l e s e t of X, e r r o n e o u s d e l e t i o n o f v a r i a b l e s may happen. As a s s e r t e d by many au t h o r s [ 6 : p . 9 4 ] [ 1 3 : p . l 6 0 ] [ 1 5 ] , i n the p r o c e s s of m o d e l - b u i l d i n g , d a t a l i m i t a t i o n r a t h e r than the t h e o r e t i c a l l i m i t a t i o n i s r e s p o n s i b l e f o r the tendency to u n d e r s p e c i f y the models. ( i i i ) F o r e c a s t i n a c c u r a c y I f an i m p o r t a n t v a r i a b l e i s o m i t t e d because i t i s h i g h l y c o l l i n e a r , but l a t e r in- the p r e d i c t i o n p e r i o d , t h i s o m i t t e d v a r i a b l e changes i t s b e h a v i o r and moves i n d e p e n d e n t l y of o t h e r v a r i a b l e s , t h e n any. f o r e c a s t i n g u n d e r . t h i s o v e r s i m p l i f i e d model w i l l be v e r y i n a c c u r a t e . - 9 -(iv) Numerical problems T The c o r r e l a t i o n matrix (X X) i s not i n v e r t i b l e i f the columns T of X are l i n e a r l y dependent. With the matrix (X X) being singular, the OLS estimates of §, represented by (2.2), are completely indeterminate. In case of an almost singular set of v a r i a b l e s , T -1 the numerical i n s t a b i l i t y i n c a l c u l a t i n g inverse matrix (X X) s t i l l remains. 3.3 Detection Tests for the presence and l o c a t i o n of serious m u l t i c o l l i n e a r i t y are b r i e f l y outlined and followed by comments. (1) Tests based on.various c o r r e l a t i o n c o e f f i c i e n t s Here, harmful m u l t i c o l l i n e a r i t y i s generally recognized by rules of thumb. For instance, an admitted r u l e of thumb requires simple pair-wise c o r r e l a t i o n c o e f f i c i e n t s of explanatory variables to be less than 0.8. Ce r t a i n l y , those more extended and sophisticated r u l e s of thumb with prudent use of various c o r r e l a t i o n c o e f f i c i e n t s w i l l give more s a t i s f a c t o r y r e s u l t s . The following rule of thumb i s generally considered to be superior to other r u l e s : a var i a b l e i s said to be highly m u l t i c o l l i n e a r i f i t s c o e f f i c i e n t of 2 multiple c o r r e l a t i o n , R., with the remaining (p-1) var i a b l e s i s 2 greater than the c o e f f i c i e n t of multiple c o r r e l a t i o n , R , with a l l the explanatory variables [I4:p.101]. The variance of the estimate of 3^ can be expressed as follows [9] -, o 1 - R (3.1) Var(B ) = —^--4 \ 1 n - p _ 1 a2v 1 - R 2 X. l l 2 2 where a i s the variance of the dependent v a r i a b l e Y and a,, i s the • y X. 3 l - 1 0 -v a r i a n c e o f t h e e x p l a n a t o r y v a r i a b l e X ^ . From ( 3 . 1 ) , i t i s ob v i o u s 2 t h a t m u l t i c o l l i n e a r i t y c o n s t i t u t e s a problem o n l y when R„ i s r e l -2 a t i v e l y h i g h t o R ^ . U n f o r t u n a t e l y the ge o m e t r i c i n t e r p r e t a t i o n of t h i s r u l e o f thumb i s apparent o n l y when t h e r e a r e two e x p l a n a t o r y v a r i a b l e s [6:p.98], ( i i ) T h r e e - s t a g e h i e r a r c h y t e s t T h i s i s proposed by F a r r a r and Glauber [ 6 ] . A t the f i r s t s t a g e , i f t h e n u l l h y p o t h e s i s H Q : | X X | = 1 i s r e j e c t e d based on the W i l k s - B a r t l e t t ' s t e s t , we may a s s e r t t h a t m u l t i c o l l i n e a r i t y i s s e v e r e and move toward the second s t a g e . The F s t a t i s t i c i s then 2 computed f o r each R^ R J / ( P - D F = - i = 1, . . . , p . (l-R*)/(n-p) S t a t i s t i c a l s i g n i f i c a n t F^ i m p l i e s X ^ i s c o l l i n e a r . A t the t h i r d s t a g e , i n s p e c t i o n o f the p a r t i a l c o r r e l a t i o n c o e f f i c i e n t s between X ^ and the r e m a i n i n g (p-1) v a r i a b l e s and the a s s o c i a t e d t — r a t i o s c a n show t h e p a t t e r n o f i n t e r d e p e n d e n c y among the e x p l a n a t o r y v a r i a b l e s . F a r r a r and Gl a u b e r c l a i m e d t h a t d e t e c t i n g , l o c a l i z i n g s e v e r e m u l t i c o l l i n e a r i t y and l e a r n i n g the p a t t e r n o f i n t e r d e p e n d e n c e among e x p l a n a t o r y v a r i a b l e s can be r e s p e c t i v e l y a c h i e v e d a t t h r e e d i f f e r e n t s t a g e s o f t h e i r t e s t . ( i i i ) H a i t o v s k y C h i Square t e s t I n 1969, H a i t o v s k y [9] proposed a h e u r i s t i c s t a t i s t i c f o r the h y p o t h e s i s t e s t o f s e v e r e m u l t i c o l l i n e a r i t y . T h i s h e u r i s t i c s t a t i s t i c i s a f u n c t i o n o f t h e d e t e r m i n a n t o f the c o r r e l a t i o n m a t r i x T ( X X ) , and a p p r o x i m a t e l y d i s t r i b u t e d as C h i Square. A p p l i c a t i o n s to Farrar and Glauber's data show that t h i s test gives more s a t i s f a c t o r y r e s u l t s than the W i l k s - B a r t l e t t ' s test that i s adopted at the f i r s t stage of the Farrar and Glauber three-stage t e s t . Therefore Haitovsky claimed the s u p e r i o r i t y of his test and suggested a replacement of Wi l k - B a r t l e t t ' s test by h i s test i n the Farrar and Glauber three-stage t e s t . However, any test based on the the determinant of c o r r e l a t i o n matrix has some b u i l t - i n d e f i c i e n c i e s . . As w i l l be shown l a t e r , the mean square error T properties depend only on the eigenvalues of the matrix (X X). T Only when the (X X)has a broad eigenvalue spectrum, that i s to say the r a t i o of the largest eigenvalue to the smallest one, X^/X , i s large, the performance of the OLS estimates may deteriorate. Since the determinant of the c o r r e l a t i o n matrix i s equal to the product of a l l the eigenvalues, t h i s test w i l l treat the matrix having broad eigenvalue spectrum equivalently to those having r e l a t i v e l y narrow eigenvalue spectra, so long as they have the same or nearly the same determinants. The r e l a t i v e magnitude of the eigenvalues i s d i f f i c u l t i f not impossible to i n f e r from the r e s u l t s of any test that i s based on the determinant of the c o r r e l a t i o n matrix. However, Haitovsky test gives a f a i r l y good i n d i c a t i o n i n the presence of severe m u l t i c o l l i n e a r i t y i n our simulation study. T Examining the spectrum of matrix (X X) T If the matrix (X X) has a broad eigenvalue spectrum, that i s , A^/Ap i s large, then the mean square error of the OLS estimates of § becomes very large. Since the trace of the c o r r e l a t i o n matrix T (X X) i s equal to the number of explanatory variables p, an a r b i t r a r y - 1 2 -rule of thumb may consider \,/\ is large i f A,/X > p. Besides, 1 p 1 p r —2 —2 the minimax index MMI = Z X. /X is a useful indicator too. Small i 1 P MMI , say, less than two implies the presence of multicollinearity [21:p.13-14]. Among a l l these tests and methods proposed, examining the T eigenvalue spectrum of the matrix (X X) provides not only a sound theoretical basis but also the lightest computation burden. -13-4. AUTOCORRELATION One of the b a s i c assumptions of the CLR model i s that the e r r o r terms are independent of each other. However, when r e g r e s s i o n a n a l y s i s i s a p p l i e d to time s e r i e s data, the r e s i d u a l s are o f t e n found to be s e r i a l l y c o r r e l a t e d . L i k e m u l t i c o l l i n e a r i t y , a u t o c o r r e l a t i o n i s another widespread problem i n applying r e g r e s s i o n models. For s i m p l i c i t y , f i r s t -order a u t o c o r r e l a t i o n i s assumed i n our study. 4.1 Sources The sources are mainly the f o l l o w i n g : ( i ) Omission of v a r i a b l e s The time-ordered e f f e c t s of the omitted v a r i a b l e s w i l l be included i n the e r r o r terms. This prevents the e r r o r s from d i s p l a y i n g random behavior. In t h i s case, f i n d i n g the missing v a r i a b l e s and i d e n t i f y i n g the c o r r e c t r e l a t i o n s h i p can solve the problem. J ( i i ) Systematic measurement e r r o r i n the dependent v a r i a b l e Again, the e r r o r terms absorb the systematic measurement e r r o r i n the dependent v a r i a b l e and then d i s p l a y non-random behavior. ( i i i ) E r r o r s t r u c t u r e i s time dependent The great impacts of some random events or shocks, such as war, s t r i k e s , f l o o d , e t c . , are spread over s e v e r a l periods of time, causing the e r r o r terms to be s e r i a l l y c o r r e l a t e d . This i s s o - c a l l e d " t r u e - a u t o c o r r e l a t i o n " . - 1 4 -4 . 2 E f f e c t s When the OLS technique i s s t i l l used for estimation, the major e f f e c t s are: (i ) Unbiased but i n e f f i c i e n t estimator of B GLS provides the BLUE of B when the disp e r s i o n matrix of e, 2 oM}, i s nondiagonal. That i s to say on the average the sampling variances of GLS estimates of B are les s than that of OLS estimates of g, hence OLS i s i n e f f i c i e n t compared with GLS. (ii ) Underestimation of the variances of the estimates of B As an i l l u s t r a t i o n , consider the very simple model y t = B x t + c t V = p e e t - i + u t where u t s a t i s f i e s assumptions ( 2 . 1 0 ) . It has been shown that the variance of OLS estimate of B i s [13:p.247] n - 1 n -2 a2 i > i x i + l , J , x l x i + 2 (4.1) V a r ( 6 0 L S ) - - f — [ 1 + 2^ +.2pt - f ^ — * i - i 1 1=1 i - i i X • •+ 2 p n - 1 n " 1 n e n 2 I x i i = l 1 The OLS formula ( 2 . 3 ) i g n o r e s the term i n parentheses i n ( 4 . 1 ) and 2 n 2 gives the variances of the estimates of B as a / £ x ^ If both e . i = l and x are p o s i t i v e l y autocorrelated the expression i n parentheses is almost certainly greater than unity, therefore the OLS formula w i l l underestimate the true variance of 3 n T C. ( i i i ) Inefficient predictor of Y When autocorrelation i s present,, error made at one point in time gives information about the error made at a subsequent point in time. The OLS predictor f a i l s to take this information into account, hence i t is not the BLUP of Y [13:p.265-266]. A.3 Detection The tests which are commonly used to recognize the existence of first-order autocorrelation are the following. (i) Eye-ball tests The plot of OLS residuals efc against time t can be informative. Any nonrandom bechaior of e can be considered as an indication of autocorrelation. We may also plot the OLS residual e against i t s lagged value efc ^. If the observations are hot evenly spread over the four quadrants, we may conclude the first-order autocorrelation is present. These eye-ball tests are quite effective, however they are imprecise and do not lend themselves to classical inferential methods. (ii) yon-Neumann ratio In 1941, the ratio of the mean square successive difference to the variance was proposed by von-Neumann as a test s t a t i s t i c for the existence of first-order autocorrelation [22]. Though various applications have proven the usefulness of the von-Neumann ratio, we emphasize that this test i s applicable only when e values are independently distributed and the sample size i s large. In practice, the OLS resi d u a l s used to compute the von-Neumann r a t i o u sually are not independently d i s t r i b u t e d even when the true error terms are. ( i i i ) Durbin-Watson test This t e s t , named a f t e r i t s o r i g i n a t o r s Durbin and Watson, i s widely used for small sample si z e s [4][5].. There are some short-comings of the Durbin-Watson t e s t . F i r s t , there e x i s t two regions of indeterminancy. Though an exact test was suggested by Henshaw in 1966, i t s heavy computational burden, prevents the .test" from wide applications'. [10]. Secondly, the Durbin-Watson test i s derived f o r non-stochastic explanatory v a r i a b l e s only. I t has been shown that i f the lagged dependent variables are present either i n single regression equation models or i n systems of simultaneous regression equations,the Durbin-Watson test i s biased towards the value for a random er r o r , that i s , d i s biased towards 2 } thereby giving very misleading information [17]. It i s as necessary as important to test for s e r i a l c o r r e l a t i o n f o r models containing lagged dependent variables since autocorrelated models are usually repaired by i n s e r t i n g lagged Y values i n t o the right-hand side of the regression equation. To t h i s end, Durbin developed a test based on the h s t a t i s t i c i n 1970[3]. "h" i s defined as the following, n where b' i s the c o e f f i c i e n t of Y t-1* -17-This test i s computational cheap but only applicable for large sample s i z e s . The small sample properties of the "h" s t a t i s t i c are s t i l l unknown. -18-5. JOINT EFFECTS OF MULTICOLLINEARITY AND AUTOCORRELATION In s t a t i s t i c a l a n a l y s i s , a point estimate i s usually of l i t t l e use unless accompanied by an estimate of i t s accuracy. In t h i s connection, Mean Square Error (MSE) i s widely used as a measure of accuracy. Since i t i s true that accurate parameter estimates constitute an e f f e c t i v e model. MSE can be used to determine the models effectiveness when the underlying objective i s simply to obtain good parameter estimates. In 1970 paper, Hoerl and Kennard presented the MSE properties f o r the OLS and Ridge estimates of § [11]. Thereafter the r e s u l t s of various studies have confirmed that ridge regression w i l l improve the MSE of estimation and predictions i n the presence of severe m u l t i c o l l i n e a r y . In t h i s section, we w i l l present expressions for the MSE of § and B R(k) when the error terms follow a f i r s t - o r d e r autocorrelated pattern (2.9). These expressions w i l l enable us to examine the e f f e c t of these two conditions on the ridge and the OLS estimates. Our analysis can be reduced to that of Hoerl and Kennard by s e t t i n g p £ = 0. 5.1 Mean Square Error of the OLS Estimates of 8 We begin with the analysis for the OLS estimates for a first-orderALR model. Let = Distance from B Q L S to B A = ( ^ 0 L S " § ) T ( ^ 0 L S ~ ^ ' " 2 We define the MSE of B to be E(L ). — ul_ib _L Proposition 5.1 ( 5 . D - B O * ) = I I D p M 1 j = l £=1 J J i E -19-where D = X(XTX) 2X T Proof: From (2.1) (2.2) ( 5 . 2 ) O^LS " § = ( ? T P - 1 ? T I ~ § = (X TX) - 1X T(X§+e) - 3-T - 1 T = (XlX) X. e By definition and (5.2) i f follows that E(L2) = E [ ( 6 O L S - B ) T ( i O L S - 3 ) ] T T —2 T = gt^xonp X e] . Noting that E(e) = 0 i t follows from Theorem 4.6.1 Graybill [7:p.l39] that (5.3) E(L2) = a 2t ?.[X(X TX)" 2X TV] 1 From the definition of V and D, (5.1) follows. (5.1) does not give much insight into the effect of multicollinearity and autocorrelation on the MSE of 3QLS« By rotating axes (using principal components) the effect can be more clearly demonstrated. Proposition 5.2 1=1 l j > I 1=1 where x*. is the i t h observaton on the i t h principal component. ^ o r a matrix A we use the notation t r(A) to denote the trace of A. -20-Proof: From (2.4) and (2.5) (5.5) a Q L S - a = (X* TX*) - 1X* TY - a = (X*TX*) 1X*T(X*a+e) - a T - 1 T = (X* X*) X* g By definition and (5.5) i t follows that E<L1> ^ " i b L S - ^ V P C g o L S - ^ l = E [ ( ? 0 L S ^ ) ^0LS _2 ) ] = E[eTX*(X*TX*) 2X* Te] Hence by the same argument used in proving Proposition 5.1 E(L 2) a 2t [X*A 2X* TV] u r ~ ~ ~ ~ 2 , a t ( u r x l i X ! i X ! i X* X* . l i nx L 2 1 AT l X* . x* . p X* . X* . nx lx Y nx 2x 2 ^ 2 AT 1 AT X X p x* 2 y nx 1 A 2 x V .) By definition of A and V, (5.4) follows. After orthogonal rotation, the effect of multicollinearity and autocorrelation becomes apparent from (5.4). F i r s t , i f p £ is positive and most of the principal components are also positively autocorrelated, almost certainly the second term in (5.4) w i l l be positive. That is to -21-say that the MSE of g w i l l be larger than when these e f f e c t s are not present; moreover the differ e n c e w i l l be i n proportion to the magnitude of p . Secondly, we obtain a cross term of eigenvalues, A_ and auto-T c o r r e l a t i o n c o e f f i c i e n t , p^. I f the matrix (X X) i s i l l - c o n d i t i o n e d , that i s , i s close to zero and there is' a high degree of p o s i t i v e auto-c o r r e l a t i o n both i n the p*"*1 component and the error terms, then the second term i n (5.4) dominates and the MSE of g can be very large. It i s then extremely dangerous to apply OLS to data with the above c h a r a c t e r i s t i c s . However, the problem w i l l not be that serious i f p £ i s negative or the p r i n c i p a l components, e s p e c i a l l y those weak components, are not autocorrelated. F i n a l l y , from (5.4) we are able to t e l l by how much the MSE of Pg^g changes because of the existence of f i r s t - o r d e r autocorrelated errors i n general regression models containing p explanatory v a r i a b l e s . Note that when p^ = 0, (5.4) reduces to E(L 2) = a 2 t (xV I u r ~ ~ 2 ? ^ = a > A. . U i = l 1 5.2 Mean Square Error of the Ridge Estimates of 0 •— —• - '•* 1— " 1 '•• '-' • • • 1 • • • • • i In p a r a l l e l with 5.1, we define L.(k) = Distance from B_(k) to g L ~K ~ The MSE of g D(k) i s given by E [ L 2 ( k ) ] = E[g R(k) - B ) T ( B R ( k ) - §)]. Proposition 5.3 (5.6) E [ L 2 ( k ) ] = Y l ( k ) + Y 2(k) + Y 3(k) 2 I A i where y ^ k ) = ^ \^ ( x _ + k ) 2 2 P 4 Y 2 ( k ) = k I (A +k)2 i — J . . "i -22-n n-1 px*.x* A. 0 Y 3(k) - l o l l l l \ ] pJ£-£' . U j > I i=l( Aj+k) Proof: From (2.7) and (2.8), the MSE of §R.(k) can be written as E[L 2(k)] (5.7) Since the f i r s t term in (5.7) is a scalar, from (2.7) and Proposition 5.2, i t follows (5.8) E [ ( a Q L S - a ) T Z T Z ( a 0 L S - a ) ] = a 2 t r [ ^ r ^ C A + k l ) " 2 ^ " 1 ^ * ^ ] = aj t r[X*(A+kI)" 2X* TV] 2 x + u 1=1 (XH-k) „ n n-1 P x*.x* . „ 2o l i t ! - » ^ 2 ] P f £ u j > «, i=l (X±+k) Since the matrix (Z-I) can be written as (5.9) Z - I = Z(I-Z - 1) = Z(-kA - 1) = -k(A+kI) _ 1 From (5.9), the second term in (5.7) can be expressed as follows. = E[(§R(k) - B ) V p ( 3 R ( k ) - §)] " E [ ( Z?OLS " ? ) T ( Z 5 o L S * ? ) ] = E [ ( a Q L S - a ) T Z T Z ( a Q L S - a)] + (Za - a) T(Za - a) (Za-a) J-(Za-a) = a i(Z-I)^'a 2 T -? = k V ( M - k l ) Z a 9 P a. ' = k Z I • 1 = Y 2(k) i = l (A ±+k) Z ' Completing the proof. The MSE of B R(k) consists of three parts, y^ik), Y 2(k) and Y^Ck). Y^(k) can be considered to be the t o t a l variance of the parameter estimates and i s a monotonically decreasing function of k, Y 2(k) i s the square of the bias brought by the augmented matrix k l and i s monotonically increasing function of k while Y-j(k) i s rel a t e d to the autocorrelation i n the error terms. Hoerl and Kennard claim that i n the presence of severe m u l t i c o l l i n e a r i t y , i t i s possible to reduce MSE s u b s t a n t i a l l y by taking a l i t t l e b i a s , that i s , choosing k > 0. This i s because i n the neighborhood of o r i g i n , Y^(k) w i l l drop sharply while Y 2(k) w i l l only increase s l i g h t l y as k increases [ll:p.60-61]. After incorporating autocorrelation i n the context of ridge regression analys t h e i r assertion w i l l s t i l l be true only i f c e r t a i n conditions are s a t i s f i e d . From (5.6) we see that the e f f e c t s of m u l t i c o l l i n e a r i t y and autocorrelation are the following. (i ) I f i s p o s i t i v e and the p r i n c i p a l components, e s p e c i a l l y the weak components, are also p o s i t i v e l y autocorrelated, then ridge method w i l l be even more desirable than OLS. This i s because su b s t a n t i a l decrease i n both Y-^(k) and Y^(k) can be achieved by choosing k > 0 while the increase i n Y 2(k) i s r e l a t i v e l y small as moving to k > 0. (ii) I f pc i s negative or almost a l l of the p r i n c i p a l components are -24 -not autocorrelated, then on the average Y 3(k) i s close to zero, hence the ridge and the OLS estimates w i l l perform r e l a t i v e l y the same as i n the uncorrelated case, ( i i i ) Since ridge regression i s s i m i l a r to shrinking the model by dropping the l e a s t important component [21:p.24-28] (5.6) gives a t h e o r e t i c a l j u s t i f i c a t i o n to shrink the model i f both the l a s t component and the error terms are p o s i t i v e l y autocorrelated. From the point of view of estimation s t a b i l i t y , the ridge method w i l l be h e l p f u l when severe m u l t i c o l l i n e a r i t y i s accompanied by high degree of p o s i t i v e auto-c o r r e l a t i o n both i n the weakest component and the error terms. 5.3 When w i l l Ridge estimates be better than the OLS estimates? Taking the der i v a t i v e s of Y-^(k) and Y2(k), Hoerl and Kennard found a condition on k such that ridge regression gives better parameter 2 estimates than OLS i n terms of MSE. That i s when k i s smaller than a / u 2 a , where a i s the largest regression c o e f f i c i e n t i n magnitude, the max max MSE of 6T,(k) w i l l be les s than that of g.T „ [11]. When autocorrelation i s present, the condition on k such that ridge regression w i l l perform better than OLS regression are described below. Consider the der i v a t i v e s of Y 1 ( k ) , Y 2(k) and Y 3 ( k ) . • "P A. af dy /dk = 2k 'I X 1 i = l (A^k)- 3 When (X X) approaches s i n g u l a r i t y which implies that A ->• 0, the values of the f i r s t two derivatives i n the neighborhood of o r i g i n are given by (5.11) Lim Lim (dy /dk) = -<*> A ->0 k->0 + P (5.12) Lira Lim (dy /dk) = 0 A ->0 k+0 P As k increases, a huge drop i n y^ with s l i g h t increase i n y 2 may be expected. However (5.10) shows that the behavior of y^ depends on the degree of autocorrelation both i n the p r i n c i p a l components and the error terms. Therefore y^ may increase or decrease at various rate's as k increases. The use of ridge regression i s most favourable when there i s a high degree of p o s i t i v e autocorrelation both i n the components and the error terms. We now formalize these arguments and present a condition on k such that ridge regression w i l l be better than OLS regression i n MSE c r i t e r i o n . (5.10) dy 3/dk = -- 2 6 -Let F(k) = E(L 2) - E [ L 2 ( k ) ] 2 i = i At; (A.+k)" " j = i *=i fcJ- £ + j i E ±=i (A.+k) 2 Then • P A 2 (5.13) dF/dk = 2 a J j -^—[x/f 7 X*.X* p j ] - 2 k J ^ i = l (X +k) J x j = l £ = 1 £ 1=1 (A ±+k) 3 Assume that Yg(k) i s a non-increasing function of k i n the neighborhood of o r i g i n . From ( 5 . 1 1 ) and ( 5 . 1 2 ) we may expect F(k) to increase as moving towards k > 0 , i . e . ^ + ( d F / d k ) > 0 . In other words, there e x i s t s k > 0 such that the OLS estimates have higher MSE than the ridge estimates. j £ J J+ (dF/dk) > 0 implies Theorem 5 . 1 . If 2 2 /r- •, , \ o* a , ri-1 n-j a a 1 3=1 1=1 J max max then E(L?; )• - E [ L 2 ( k ) ] > 0 . -27-Again (5.14) w i l l reduce to Hoerl and Kennard's result i f p = 0 . When positive autocorrelation exists in the error terms and the principal components, the second term in (5.14) may well be positive, hence the range of k for ridge estimates to be better than OLS estimates in MSE criterion w i l l be larger than what Hoerl and Kennard asserted in uncorrelated case. (5.14) shows that the extension in the range of k is positively related with the magnitude of p . However, (5.14) is 2 2 just a necessary condition on k for E(L^) to be greater than E[L2(k)] since F(k)- is increasing in k over the range shown by (5.14). It is possible that for some values of k, F(k) is decreasing in k while 2 the function value is s t i l l p o s i t i v e , that is E(L^) is s t i l l greater 2 than E[L (k)]. Therefore, we may consider (5.14) as a stringent condition on k for ridge estimates to be better than OLS estimates in MSE criterion. If either p £ is negative or the principal components, especially those weak ones, are not autocorrelated, the behavior of Y^k) and thereby F(k) as k increases w i l l be hard to predict. The effect of autocorrelation on the range of k depends on the data set we gathered. In practice, the true parameters are unknown, the range of k shown by (5.14) can be approximated by conducting a principal component analysis and substituting the estimates for the parameters. 5.4 Use of the "Ridge Trace" In ridge regression the augmented matrix (kl) is used to cause the system to have the general characteristics of an orthogonal system. -28-Hoerl and Kennard claimed that at c e r t a i n value of k, the system w i l l s t a b i l i z e [ll':p.65]. They proposed the usage of a "Ridge Trace" as a diagnostic t o o l to select a si n g l e value of k and a unique ridge estimate of g i n p r a c t i c e . The "Ridge Trace" w i l l portrary the behavior of a l l the parameter estimates as k v a r i e s . Therefore instead of suppressing dimensions either by d e l e t i n g c o l l i n e a r v a r i a b l e s or dropping the p r i n c i p a l components of small importance, the Ridge Trace w i l l show how s i n g u l a r i t y i s causing i n s t a b i l i t y , over/under-estimations and inco r r e c t signs. In connection with autocorrelation where ridge regression i s even more desirable, c e r t a i n l y the "Ridge Trace" w i l l be of great help i n getting better point estimates and thereby better predictions. Even when p i s negative or the p r i n c i p a l components e are not autocorrelated, the merits and usefulness of the "Ridge Trace" and the "Ridge Regression" are s t i l l preserved i n dealing with the problem of m u l t i - c o l l i n e a r i t y . - 2 9 -6. RIDGE REGRESSION: ESTIMATES, MEAN SQUARE ERROR AND PREDICTION The MSE of OLS estimates of g can be written as the difference in the length between two vectors, $ 0 T jg and g [11:p.56] (6.1) E(L*) = E ( i 0 L S T 3 0 L S ) - gTg ' (6.1) shows that in the presence of severe multicollinearity, the MSE can be improved by shortening the OLS estimates of g. In this section we w i l l show that this reasoning appears to be compatible with the derivation of ridge estimator of 'g. Hence ridge regression can be expected to be better in terms of MSE. 6•1 Derivation of Ridge Estimator for a CLR Model Let B be any estimate of g. Its residual sum of squares, 0 , can be written as the value of minimum sum of squares, 0 . , plus ^ m m ' c X the distance from B to &QLg weighted through (XX). 0 = (Y-XB)T(Y-XB) (6.2) = ^ X g O L S ) T ( Y - x g o L s ) + (B-B 0 L S) TX TX(B-g 0 L S) = 0 . •+ 0(B) min ~ • For a specific value of 0(B), 0 q , the ridge estimator i s founded by choosing a B to T Minimize B B (6.3) Subject to (B-g 0 L S) TX TX (B-g Q L S) = 0 Q This problem can be solved by use of Lagrange multiplier techniques -30-where (1/k) i s the m u l t i p l i e r corresponding to the constraint (6.3). The problem i s to minimize ( 6 .4 ) F = B TB + ( l / k ) [ ( B - 6 0 L S ) T X T x ( . ? - § 0 L S ) " 0 0* A necessary condition for B to minimize (6.4) i s that | f = 2 B + i [ 2 ( X T X ) B - 2 ( X T X ) ^ Hence [J+^?T?)]?4(?T?^0LS and B* = B ( k ) = (X TX+kI) _ 1X TY where k i s chosen to s a t i s f y constraint (6.3). In p r a c t i c e , we usually work the other way round since i t i s easier to choose a k > 0 and then compute the a d d i t i o n a l r e s i d u a l sum of squares, 0Q. I t i s c l e a r that for a fixed increment 0 , there i s a continuum of o' values of B n that w i l l s a t i s f y the r e l a t i o n s h i p 0 = 0 . + 0 , and u J mxn o the ridge estimate so derived i s the one with the minimum length. Therefore, we may well expect the ridge estimates to y i e l d l e s s MSE i n the presence of m u l t i c o l l i n e a r i t y since they are o r i g i n a l l y derived by minimizing the length of the regression vector. It i s true to a c e r t a i n extent that minimizing the length of the regression vector i s equivalent to reducing the MSE of parameter estimates. In addition, (6.2) shows that i t i s possible to move further away from 8 Q L S without T an appreciable increase i n the r e s i d u a l sum of squares as (X X) approaching s i n g u l a r i t y . That i s to say, ridge regression may achieve large reduction in MSE at vir t u a l l y no cost in terms of the T residual sum of squares i f the conditioning of (X X) is poor enough. In 1971, Newhouse and Oman [18] used MSE as evaluation criterion in their Monte Carlo studies of ridge regression. Since then i t has become the standard way to evaluate proposals for ridge estimators. From the above derivation of the ridge estimator, obviously we realize that ridge estimates are designed to be better in MSE criterion. Now we would like to study the implications of the constraint in deriving the ridge estimator. Since orthogonalization can ease interpretation, we represent (6.2) in the rotated axes. Let A = PB Then 0 = ^ O L s ' V ^ O L S 5 + ^ O L S ) V T x * C A - a ( J L S ) = 0 m i n + X (-V°OLS. ) 2*i 1=1 l Where A^ i s the estimate of regression coefficient for the i t h component and a is the OLS estimate of regression coefficient for the ith U L i o « 1 component. The problem is T Minimize A A (6.5) Subject to ( A-? 0 L S) TA(A-a 0 L S) = 0Q or equivalently (6.6) Subject to \ (A.-« 0 L S ) 2X. = 0 . i=l i (6.5) Shows that the vector [ A _ ? 0 L S ] is normed through A to have the length equal to 0 q. Since the eigenvalue X^ can be considered as an indicator of the information content and explanatory power of the ith principal component, we may well conclude that the derivation of the ridge estimator has already taken the relative information content and explaining power of the explanatory variables into account. (6.6) shows that the constraint has incorporated the concept of square-error-loss function as well. It increases the length of A the most when the parameter estimates of the important components deviate from OLS estimates since i t is found by taking the square of the deviations multiplied by their corresponding eigenvalues. This implies that i t is best to shrink the estimate of § for those components that have small eigenvalues, i.e. the ones most subject to ins t a b i l i t y . 6.2 Derivation of Ridge Estimator for an ALR Model In the presence of autocorrelated error terms, the OLS estimator of 8 w i l l no longer have the minimum-variance property; the GLS type estimator w i l l be the BLUE of B. Our derivation Of a new ridge estimator adjusted for autocorrelation, 8 (k), w i l l parallel with ~GR the derivation of 6r)(k) in the previous section. Again let B be any R ~ estimate of B.. Its residual sum of squares, 0, can be written as the value of the minimum sum of squares, 0 . , plus the distance from n mm' . r, rj-i B to § G L S weighted through (X AX A). (See (2.10) for notation). We have 0 = (Y*-X*BjT(Y*-X*B) " CY A-xJ G L S) T(Y,-xJ G L S) 4- ( B - B G L S ) T X ^ X A ( B - B G L S ) - 3 3 -rp For a s p e c i f i c value 0(B) = 0 , 3 i s derived to minimize B B (6.7) subject to ( B - § G L S ) T X ^ ( B - § G L S ) = 0 Q . The Lagrangian i s given by * = ? T ? + £ C ( ? - § G L S > T & ( ? - e G L S ) ^ 0 ] -A necessary condition for a minimum i s that f » 2B + i [ 2 X ^ B - 2 ( X ^ ) 3 G L S ] = 0 . This reduces to (6.8) B* = 3 G R ( k ) = ( X * X * + k l ) " 1 ^ T - 1 ' - 1 T - 1 = ( X Q X + kl) X n • Y Where k i s chosen to s a t i s f y (6.7). The cha r a c t e r i z a t i o n of the ridge trace of 3 (k) w i l l be e s s e n t i a l l y the same as that of B ^ k ) . That i s , for a s p e c i f i c increment 0 , the 3 , ^ so derived i s the regression o ~(JK. vector with the minimum length among a continuum of values of B q that w i l l s a t i s f y the r e l a t i o n s h i p 0 = 0 m ^ n + 0 Q . However m u l t i c o l l i n e a r i t y may no longer be a su b s t a n t i a l problem a f t e r transforming X into X A i n some rare cases. For instance, t h i s may happen i n time seri e s studies where m u l t i c o l l i n e a r i t y i s a r e s u l t of the explanatory variables increasing together over time. It i s then possible that the transformed variables are not close to being c o l l i n e a r with each other. If that i s the case, the reduction i n MSE can not be obtained with only a s l i g h t increase i n the r e s i d u a l sum of squares. This i s because of the low " T MSE of § G T s already achieved and the non-singularity of the ( X ^ X A ) . T -1 In most cases, i f not a l l , the matrix (X ft X) i s very l i k e l y to have X a broad eigenvalue spectrum i f (X X) does. Then the previous discussion on the motivation of minimizing the length of the regression vector, the i n t e r p r e t a t i o n and implications of-the constraint i n the d e r i v a t i o n of ridge estimator of 8 for a CLR model w i l l be applicable to that i n the d e r i v a t i o n of 3 „ „ . ~GR 6-3 Mean Square Error of the "Generalized Estimators" The MSE of 8^Tc and & (k) are r e a d i l y established. Since Y A = X A8 + s a t i s f i e s a l l the assumptions of a CLR model, (5.3) with p =0 gives the MSE of 3 „ T „ as follows, e -GLS Let L. = Distance from $ n to 3 E(L 2) = a 2 t r ( X ^ ) ' 1 2 T -1 -1 (6.9) a n t r (X Q X) x ' Setting p^ = 0 i n (5.8), (6.10) gives the MSE of 8 r R ( k ) . e Let L^(k) = Distance from § G R ( k ) to 8 (6.10) E [ L 2 ( k ) ] = a 2 t r [ ( X T f t 1X+kI) 2 ( X T f t 1X) ] + 2 T T -1 -2 k § (X « ?+kI) 8 The e f f e c t of autocorrelation Is d i f f i c u l t to i n f e r from (6.9) and (6.10), since 0, i s not a diagonal matrix, however, normally we may 2 2 2 2 expect E(L-j) a n ^ E[L^(k)] to.be le s s than E(L^) and E [ L 2 ( k ) ] r e s p e c t i v e l y . -35-6.4 Estimation T h e o r e t i c a l l y , the GLS gives the BLUE of 8 for an ALR model. But usually i n p r a c t i c e , neither the order of autocorrelation structure nor the value of the parameter p' i s known. Hence the GLS or GR estimates can not be computed d i r e c t l y . Many two-stage methods have been proposed to approximate the GLS estimates and have proven to be quite e f f e c t i v e . These include the Cochrane-Ocutt i t e r a t i v e process [1] and Durbin's two-step method [2]. In the j o i n t presence of m u l t i c o l l i n e a r i t y and autocorrelation, i t i s a c t u a l l y quite straightforward to combine ridge regression with one of the two-stage regression methods i n the hope of achieving better estimates of § . We i l l u s t r a t e how ridge regression can be incorporated i n Durbin's two-step method for a simple model with only two c o l l i n e a r explanatory v a r i a b l e s : (6.11) Y t - 3 q + B l X t l + 8 2 X t 2 + E t t = l , 2 , , . . , n . E £ . = P e E t - l + U t ' P J « I and f o r a l l t E(u t) = 0 E ( u t ' u t + s ) = a u for s = 0 = 0 for s ± 0 The transformed r e l a t i o n i s given by (6.12) Y t - P e Y t_ x - 3 o ( l - p £ ) + ^ ( X ^ - P ^ . ^ + e 2 ( X t 2 - P e X t „ l j 2 ) + u. Combining (6.11) and (6.12) gives (6.13) Y t - 3 0 ( 1 - P e > - + ^ ~ h p e X t - l , l + S2 Xt2 " W t - l , 2 + p E Y t - l + U t The f i r s t step i s to estimate the parameters of (6.13) using OLS. Then use the estimated coefficient of Y ^ to compute the transformed variables ( Y t - p £ Y t _ 1 ) , (X^-pgX^.^ and (\2~PeXt-l ^. At the second step, ridge regression i s highly recommended to be used in place of OLS and applied to relationship (6.12) containing those transformed variables. The coefficient estimate of (X . - p X n .) i s our t i e t - l i approximation of 3 and the intercept term divided by (1-p ) is our /\ approximation of 3 . G R o It might seem reasonable to apply ridge regression at the f i r s t step of Durbin's method since X,, and X^„ are collinear. As stated v t l t2 earlier in Section 3, high pair-wise correlation coefficient of explanatory variables does not necessarily result in estimation' in-. sta b i l i t y . Besides, the lagged values of X ^ and X ^ are inserted into the explanatory variable set. If X ^ and X ^ are not autocorrelated, T the conditioning of the enlarged (X X) may be satisfactory. Moreover ufc in (6.12) has a scalar dispersion matrix, therefore OLS gives consistent estimates of regression coefficients. Also among these estimates, only the coefficient estimate of Y' ^ w i l l be used to compute the transformed variables. Hence OLS technique is recommended to be used at the f i r s t step even when Xj.^ and X^ are collinear. This combination of ridge regression and Durbin's two-step method can easily be extended to a p-variable model with higher order of autocorrelation. -37-6.5 Prediction Consider a first-order ALR model, (2.15) gives the minimum variance predictor (BLUP). In practice, both $ r T O and p are replaced by their estimated values. If ridge regression is used in conjunction with some other methods to cope with the joint problem of multicollinearity and auto-correlation, the prediction is given by (6.14) Y f c + 1 = X ^ B ^ k ) + P £ e t s t where X t +^ is a 1 x p vector of the (t+1) observation on the explanatory variables, B (k) i s a p x 1 vector of approximated regression coefficients,P £ i s an estimate of autocorrelation coefficient and e^ is the ridge residual at time t. THE MONTE CARLO STUDY Consider a first-order ALR model with two explanatory variables, the error terms in the transformed relation, as shown by (6.12), have a scalar dispersion. The residual sum of squares from (6.12) is given by t=i t=i \ M X t 2 - p e W ] 2 If and Y^ are given, the summation can run from 1 to n, other-wise i t can only run from 2 to n. The direct minimization of (7.1) with /N /S A respect to B Q , 6^, 8 2 a n d P £ leads to non-linear equations, therefore, /N /S /\ the analytic expressions for 8Q» B^ > 8^ and can not be obtained. As mentioned before, many two-stage methods have been proposed to approximate these parameters. Usually in practice, not only the parameters and the true error terms but also the order of autocorrelation structure i s unknown. As indicated previously, joint presence of autocorrelation and multicollinearity w i l l further complicate the situa-tion. Under this circumstance, the relative effectiveness of those two-stage methods can best be studied by the Monte Carlo experiments [19]. 7.1 Design of the Experiments The main purpose of the experiment is to give an empirical support to the inference drawn from our analytic studies. The sampling experiments are conducted in the following manner. Basically, the sampling experiments comprise nine different experiments with different - 3 9 -degree of multicollinearity and autocorrelation. They are summarized in Table 1. ' Table 1 Experiment r12 1 .05 .05 2 .05 .50 3 .05 .90 4 .50 .05 5 .50 .50 6 .50 .90 7 .95 .05 8 .95 .50 9 .95 .90 In our experiments, y ^ and are used to indicate the severity of multicollinearity and autocorrelation respectively.. Usually in practice, multicollinearity constitutes a problem only when Y-^ 1 S a s high as 0.8 or 0.9. In addition, the error terms are normally considered to be independent, moderately and highly autocorrelated when p £ = .05, .50 and .90 respectively. As shown by Table 1; the experiments are set up to have different characteristics. Through this design, we can study the effects of autocorrelation on estimation and prediction for a given degree of multicollinearity. Moreover, we can observe how these effects of autocorrelation change as the degree of multicollinearity varies. The data is generated as follows; first, values are assigned to 8^ , 81, B0 and the probability characteristics of error terms ufc in -40-(2.9). Three s e r i e s of e i n (2.9) are subsequently generated, given the values of e and d i f f e r e n t value of p . The p r o b a b i l i t y o „ e c h a r a c t e r i s t i c s of the j o i n t d i s t r i b u t i o n X, ., and X^„ are chosen to J t l t.2 generate the seri e s of X ^ and X ^ that are suitable for the f i r s t three experiments. By varying the c o r r e l a t i o n c o e f f i c i e n t of X ^ and X 2» another two seri e s of X ^ and X ^ are generated for the remaining s i x experiments. We have also assured that there i s no s i g n i f i c a n t f i r s t - o r d e r autocorrelation i n X ^ and X ^ so that the error structure i s f i r s t - o r d e r . Solving for Y based on the data, nine d i f f e r e n t sets can be generated for the experiments. For each experiment, ten samples of f o r t y observations are generated. In each sample, t h i r t y observations on the Y 's from t = 1 to t = 30 are employed to estimate the equation by approrpiate methods. Observations 31 to 40 are used to study the p r e d i c t i o n properties of estimators. The BLUP i s used i n the presence of s i g n i f i c a n t autocorrelation i n the error terms. Special care has to be exercised i n c o n t r o l l i n g the s e r i a l c o r r e l a t i o n properties of the error terms. In t h i s connection, OLS regression has to be run on (7.2) e = p e , + u j = 1,2,...,10 3 C e J C 3 t = 1,2 40 to determine whether the estimated regression c o e f f i c i e n t p £ i s consistent with the p • which i s used to generate them. However, as i s well-known, the OLS estimates of parameters for small samples may be badly biased i f some of the regressors are lagged dependent va r i a b l e s [23]. This i s because the error terms, u_.fc w i l l no longer be independent of the regressors, E j t _ ^ > e j t ' ' " ' ' G j 4 0 " * n ^ • 2 ) > -41-E(u-i.> E - 4 . j _ ) $ 0 for s 4- 0 and a l l t, hence the OLS estimate for the j t jt+s c o e f f i c i e n t of £..-, i s biased. The usual t test on the estimate of 3 t - l regression c o e f f i c i e n t may be quite misleading, therefore we can only asc e r t a i n that the desired s e r i a l c o r r e l a t i o n properties are obtained by assuring that u^ t are randomly d i s t r i b u t e d . For each sample, we f i r s t test whether the seri e s of u i s consistent with the p r o b a b i l i t y c h a r a c t e r i s t i c s chosen to generate them, then we use run test to determine whether u f c i s randomly d i s t r i b u t e d . Only those seri e s of u passed a l l the tests are adopted i n our simulation study. We are now ready to estimate the regression equation. F i r s t , for each experiment, the OLS p r i n c i p l e i s applied to estimate the parameters. The Durbin-Watson s t a t i s t i c i s used as a f i l t e r to te s t the existence of autocorrelation. Whenever the Durbin-Watson s t a t i s t i c computed from the f i t t e d model i s les s than the corresponding upper c r i t i c a l value d^ (a=0.05), autocorrelation i s assumed to'be present i n the error terms, then Durbin's two-step method i s used in-con junction' with Ridge regression f or estimation as described i n Section 6.4; otherwise, autocorrelation i s assumed to be absent, and only OLS and Ridge regressions techniques are employed for- estimation purposes. In addition, whenever the existence of autocorrelation i s ~ ~ j -' recognized, 3„ T c and B (k) are computed for comparison purposes. Since the true value of the autocorrelation c o e f f i c i e n t i s known for each experiment, c a l c u l a t i o n s of 3 and 3™(k) w i l l simply be the - GLo ~ GR straightforward m u l t i p l i c a t i o n of matrices as shown by (2.13) and (6.8). The methods adopted for estimation i n each experiment are recorded i n Table 2. -42-Table 2 Experiment Method 1 OLS, RR 2 Durb., Durb.+RR, GLS, GR 3 Durb., Durb.+RR, GLS, GR 4 OLS, RR •5 Durb., Durb.+RR, GLS, GR 6 Durb., Durb.+RR,. GLS, GR 7 OLS, RR 8 Durb., Durb.+RR, GLS, GR 9 . Durb., Durb.+RR, GLS, GR OLS: Ordinary Least-Durb.: Durbin's Two--squares Regression -step Method Durb.+RR: Durbin's Two-step i n conjunction with Ridge Regression GLS: Centralized Least-squares Regression GR: Ridge Regression adjusted for Autocorrelation As i s expected, no cor r e c t i o n for autocorrelation i s necessary for experiments 1, 4 and 7. Whenever the ridge method i s applied seven or eight values of k have been used i n our study. In order to minimize the e f f e c t s of s u b j e c t i v i t y r e s u l t i n g from s e l e c t i n g the value of k A. i n ridge regressions, we compute the mean 6 of the samples for every ~R s p e c i f i c value of k i n each experiment. Then the value of k i s selected based on a "Mean Ridge Trace". That i s to say, a unique value of k selected w i l l generally be the best for a l l then samples. Obviously, the value of k so-selected may well not to be the best for every i n d i v i d u a l sample. Therefore, the minimum of the MSE of ridge estimates of 3 achieved f o r each experiment i s s l i g h t l y upward biased. C e r t a i n l y t h i s way of s e l e c t i n g the value of k cannot be used i n pr a c t i c e . 7.2 Sampling Results For each method for each experiment, the MSE of the estimates 2 of the adjusted R , the r e s i d u a l sum of squares, the MSE forecast and the Durbin-Watson s t a t i s t i c are averaged over ten samples. In addition, the mean estimate of p . and the mean Haitovsky h e u r i s t i c s t a t i s t i c are also computed, u i s assumed to follow a normal d i s t r i b u t i o n with mean zero and variance equal to 6 i . e . u ~'N(0,'6). The true mean of X i s 10 and that of X i s 8. The respective t l t2 v variance of X and X ^ a r e 18 and 15. i s chosen to be 3 for each sample. The true value of BQ i s 5, 8^ i s 1.1 and i s 1. 7.2a Results assuming p i s Known F i r s t we assume p i s known. The r e s u l t s here w i l l i n d i c a t e e whether the methods described i n Section 6.4 can show promise i n the best of s i t u a t i o n s . Table 3 contains the average MSE of 8 ~ Lrii b and 3™ for experiments 2, 3, 5, 6, 8 and 9. _GR Table 3 \5xperiment k \. 2 (Y12=.05) ( P e =.50) 3 (Y12=.05) ( P £ =.90) 5 (Y12=.50) ( P £ =.50) 6 (Y12=-50) P £ =.90) 8 (Y12=.95) ( P e =.50) 9 (Y12=.95) (P . e =.90) 0.0* .4824 3.2913 .0834 2.3940 .1011 2.1681 .025 .0591 1.8084 .0018 1.4991 .05 .0363 .8073 .1215 .8337 .0561 .9405 .075 .3603 .2274 .3033 .4263 . • .1 .9849 .0156 .8871 .1092 .4929 .2901 .2 5.7786 2.0100 4.0851 .6153 2.4426 .4242 .3 2.9771 7.0896 9.3743 4.0521 5.4924 1.0113 .5 30.1494 21.7887 21.1581 11.4165 13.7628 4.8285 .7 47.5367 36.6241 35.2063 22.2489 23.7003 10.9524 1.0 75.6732 63.4542 56.5869 39.9762 38.5830 * GLS regressions can be considered as a special case of Ridge regressions, adjusted for autocorrelation, with k = 0. In Section 5, i t has been shown that the MSE of 8 w i l l increase rapidly i f significantly positive autocorrelation exists both in the disturbances and in the principal components. Correction for auto-correlation w i l l then be necessary in estimating the regression equation. Though the GLS regression yields the BLUE of 8 , the behavior of the MSE of 8 i s very d i f f i c u l t to infer from (6.10). From the MSE of ~Gljb experiments 3, 6 and 9 when k = 0, we observe that the MSE of 8 decreases as the degree of multicollinearity increases for sufficiently high degree of autocorrelation. On the other hand, for a given degree of multicollinearity, the MSE.of 8 w i l l increase as the degree of autocorrelation increases. But the magnitude of the increase in the MSE -45-of B decreases as the relation among explanatory variables increases. For instance, the difference in MSE of 3 r T 0 between experiments 8 and ~ GL< o 9 i s less than that between experiments 5 and 6. Moreover, Table 3 shows that there exists at least one value of k for each experiment such that the MSE of 8 is less than that of 8„ T C. Note that when ~GR ~GLiD P e = 9, k = .1 obtains the minimum MSE of the estimates of g in T experiment 9. This also implies that the transformed matrix (X^X^) is s t i l l ill-conditioned. In Section 5.3 we have shown that the range of k such that the MSE of 3- is less than that of B „ T c w i l l be larger ~R ~OLS i f multicollinearity i s accompanied by high degree of autocorrelation. Now with parameter estimates fu l l y adjusted for autocorrelation (since P £ is known) experiment 9 s t i l l has the largest admissible range of k. That i s , the range of k such that the MSE of g is less than that of ~GR B „ „ is s t i l l larger i f multicollinearity is accompanied by high ~Gijb degree of autocorrelation and autocorrelation has been fully adjusted. We also observe that as the degree of autocorrelation increases, a larger reduction in the MSE of the estimates of 3 can be obtained by replacing B „ c with $ . For instance, the difference in the MSE of ~GLo ~GR 3 n T „ and 3™ (.05) in experiment 8 is less than that in experiment 9. ~G1JO ~ G K However, the behavior of the MSE of B „ „ is very d i f f i c u l t , i f not ~GR impossible,to predict. (6.10) shows that the MSE of $ is comprised ~GR of two terms. How each term behaves w i l l depend not only on the data matrix X and the degree of autocorrelation but also on the way the matrix X. i s linked with the matrix 0. ^ . 7.2b Results assuming p e is unknown In practice p is unknown. We assume that the true autocorrelation coefficient i s unknown and we try to f i t the equation using heuristic - 46 -techniques akin to the Durbin's two-step method in which has been shown by Griliches and Rao [8] to perform well when there is auto-correlation. We apply these techniques as described in Section 7.1. 2 Tables 4,5 and 6 report the mean adjusted R and the mean Durbin-Watson s t a t i s t i c for each experiment. (du(a = 5%) = 1.57 for experiments L-4 and 7; du(a = 5%) = 1. 56 for the remaining six experiments) • Table 4 (Y 1 2 = .05) Experiment 1 2 3 V .05 p E = .50 .90 2* 2 2 k R d** R d R d a a a 0.0 .8640 2.0791 .8979 1.8766 .9149 1.8380 .025 .8640 2.0903 .8884 1.8713 .9144 1.8269 .05 .8620 2.1001 . 8868 1.8728 .9128 1.8286 .075 .8579 2.1083 .8844 1.8805 .9104 1.8409 .1 .8567 2.1154 .8813 1.8916 .9071 1.8613 .2 . 8401 2.1353 .8619 1.9619 .8888 1.9809 .3 .8167 2.1463 .8399 2 .0383 .8648 2.1030 .5 .7651 2.1536 .7865 2.1563 .8102 2.2789 1.0 .6394 2.1527 .6571 2.2960 .6780 2.4740 * R : the mean adjusted R' *d : the mean Durbin-Watson s t a t i s t i c Table 5 ( Y , 0 = - 5 0 ) Experiment 4 5 6 0 5 P . £ = . 5 0 . 9 0 k 2 R a d R2 -a ' d R 2 a d 0 . 0 . 8 9 7 3 2 . 0 9 8 4 . 9 1 7 8 1 . 8 9 5 8 . 9 4 7 5 2 . 0 7 5 4 . 0 2 5 . 8 9 7 0 2 . 1 0 4 0 . 9 1 7 5 1 . 9 0 5 4 . 9 4 7 2 2 . 0 8 6 5 . 0 5 . 8 9 6 2 2 . 1 0 9 5 . 9 1 6 8 1 . 9 1 9 4 . 9 4 6 4 2 . 1 0 5 9 . 0 7 5 . 8 9 5 0 2 . 1 1 4 7 . 9 1 5 5 1 . 9 3 7 1 . 9 4 5 1 2 . 1 3 1 7 . 1 . 8 9 3 3 2 . 1 1 9 8 . 9 1 3 8 1 . 9 5 7 6 . 9 4 3 4 2 . 1 6 2 2 . 2 . 8 8 3 2 2 . 1 3 8 1 . 9 0 3 2 2 . 0 5 4 0 . 9 3 3 6 2 . 3 0 2 4 . 3 . 8 7 0 2 2 . 1 5 9 2 . 8 8 9 5 2 . 1 5 0 4 . 9 1 8 4 2 . 4 5 3 2 . 5 . 8 3 5 1 2 . 1 7 2 1 . 8 5 4 6 2 . 2 9 8 8 . 8 8 3 4 2 . 6 0 8 6 . 7 . 7 9 7 0 2 . 1 8 2 6 . 8 1 8 7 2 . 4 2 0 3 . 8 4 4 0 2 . 7 0 8 4 1 . 0 . 7 4 1 2 2 . 1 9 0 0 . 7 5 7 2 . 2 . 4 7 3 2 . 7 8 4 6 2 . 7 8 8 7 Table 6 (y., = . 9 5 ) Experiment 7 8 9 V . 0 5 P e 5 0 P£= -9 0 k R 2 a d R2 a d R 2 a d 0 . 0 . 9 2 0 8 2 . 0 5 1 1 . 9 3 9 1 1 . 8 7 8 5 . 9 5 7 5 1 . 8 7 0 7 . 0 5 . 9 2 0 1 2 . 0 8 7 3 . 9 3 8 0 1 . 9 0 5 8 . 9 5 6 5 1 . 8 8 8 3 . 1 . 9 1 8 1 2 . 1 1 2 8 . 9 3 5 9 1 . 9 4 3 7 . 9 5 4 4 1 . 9 3 3 5 . 2 . 9 1 1 6 2 . 1 4 7 3 . 9 2 9 3 2 . 0 2 8 0 . 9 6 7 7 2 . 0 5 6 2 •3 . 9 0 2 4 2 . 1 6 5 9 . 9 1 9 9 2 . 1 0 8 4 . 9 3 8 2 2 . 1 6 6 0 . 5 . 8 7 8 6 2 . 1 7 6 8 . 8 9 5 7 2 . 2 3 1 2 . 9 1 3 6 2 . 8 3 8 1 . 7 . 8 5 0 5 2 . 1 7 2 5 . 8 6 7 3 2 . 3 0 8 2 . 8 8 4 7 2 . 4 4 1 7 1 . 0 . 8 0 5 6 2 . 1 5 9 4 . 8 2 1 7 2 . 3 7 3 9 . 8 3 9 9 2 . 5 2 5 9 -48-From Tables 4-6, we observe that the adjusted Increases as the degree of autocorrelation increases for a given value of k and a given degree of m u l t i c o l l i n e a r i t y . This i s i n t u i t i v e l y p l a u s i b l e since autocorrelation can account for part of the v a r i a t i o n i n the er r o r s , thereby decreasing the r e s i d u a l sum of squares and increasing the 2 2 adjusted R . For each experiment, the adjusted R decreases as k increases. The reason i s obvious from the der i v a t i o n of ridge estimators. 2 Besides, the best R achieved for each experiment i s pretty high, that cl i s , the estimated model can explain most of the v a r i a t i o n i n Y^ _. This also implies that the estimation methods adopted i n our experiments are f a i r l y e f f i c i e n t and powerful.. The mean Durbin-Watson s t a t i s t i c computed for each method for each experiment i s high enough to ascer t a i n that the f i t t e d model has su c c e s s f u l l y removed the problem of autocorrelation. Since the model i s reasonably w e l l f i t t e d , simulation comparisons of the experimental r e s u l t s should be meaningful as w e l l as informative. The average MSE of the estimates of § i s computed for each method for each experiment and recorded i n Table 7. Table 7 ^^Experiment. k 1 2 3 4 5 6 7 8 9 0.0 .1101 .4824 .9594 .0030 .0342 .3996 .0180 .0720 .6951 .025 .0192 .0570 .2691 .1104 .0210 .0945 - - -.05 .2865 .0390 .0087 .4158 .1965 .0024 .1833 .0719 .0939 .075 .8820 .3744 .1200 .8973 .5778 .1026 - - -.1 1.7430 1.0167 .5559 1.5366 1.1097 .3765 .8307 .5643 .0041 .2 7.3539 5.9115 4.7694 5.3823 4.5690 2.7540 3.2001 1.2549 .5822 .3 15.1383 13.2456 11.5962 10.8432 9.5949 6.8219 6.6531 5.7624 3.3012 .5 33.5067 31.1559 28.8369 23.9694 22.3449 18.6255 15.6804 14.2377 10.0251 .7 - - - • 38.6334 36.9396 32.0214 26.3049 24.3789 18.5115 1.0 79.3134 76.8708 73.8630 60.7620 50.3176 52.8276 43.3464 40.2351 32.3991 -50-As the known autocorrelation case, when k = 0 the MSE of estimates of B increases as the degree of autocorrelation increases, given the degree of m u l t i c o l l i n e a r i t y . However, being d i f f e r e n t from the known autocorrelation case, the MSE of estimates of 8 f i r s t decreases then increases as the degree of m u l t i c o l l i n e a r i t y increases for k = 0 and a given degree of autocorrelation. Table 7 shows that except for experiments 4 and 7, better estimates of B i n MSE c r i t e r i o n can be obtained i f Durbin's two-step method i s combined with ridge regression for estimation. Besides, amazingly we have found that we are able to obtain better estimates of B i n terms of MSE i f the true autocorrelation c o e f f i c i e n t P £ i s unknown. For c l a r i t y , we s h a l l compare only the minimum of the average MSE of the estimates of B achieved f o r each experiment i n the p unknown case with that i n the p, £ known case. Table 8 reports the minima of the average MSE of the estimates of 8 achieved for each experiment i n both the known and unknown cases. In add i t i o n , the estimation method and the c h a r a c t e r i s t i c s of each experiment are also tabulated. -51-Table 8 (p unknown) (p known) ^sExp e r imen t Experiment ( r 1 2 , P £ ) Estimation Method k Min. MSE of g k Min. MSE ° f G^R 1 (.05,.05) RR .025 .0192 _ _ 2 (.05,.50) . Durb.+RR .05 .0390 .05 .0363 3 (.05,.90) Durb.+RR .05 .0087 .1 .0156 4 (.50,.05) OLS 0.0 .0030 - •-5 (.50,,50) Durb.+RR .025 .0210 . 025 .0018 6 (.50,.90) Durb.+RR .05 .0024 .1 .1092 7 (.95,.05) OLS 0.0 .0180 - -8 (.95,.50) Durb.+RR .05 .0719 .05 .0561 9 (.95,.90) Durb.+RR .1 . .0441 .1 .2901 A couple of i n t e r e s t i n g observations can be made from Table 8. F i r s t i f the degree of m u l t i c o l l i n e a r i t y i s held constant, the minimum of the average MSE of parameter estimates w i l l f i r s t increase then decrease as the degree of autocorrelation increases. On the other hand, given the degree of autocorrelation, the minimum of the average MSE of the parameter estimates w i l l f i r s t decrease then increase as the degree of m u l t i -c o l l i n e a r i t y increases. These are i n t u i t i v e l y p l a u s i b l e since s u f f i c i e n t high degree of autocorrelation should-lead to more stable parameter estimates while s u f f i c i e n t high degree of m u l t i c o l l i n e a r i t y usually r e s u l t s i n very unstable parameter estimates. Secondly, we observe that the value of ridge parameter k, used to achieve the minimum MSE of the estimates of 8, increases with the degree of m u l t i c o l l i n e a r i t y and autocorrelation. This i s consistent with our a n a l y t i c findings shown in Section 5.3. Moreover, we have found that knowing does not give better estimates of 6 for sufficient high degree of autocorrelation. This may result from sample sizes being small. Table 9 contains the mean estimates of p^ obtained in the f i r s t step of Durbin's method and the mean Haitovsky heuristic s t a t i s t i c for each experiment. Table 9 Experiment 1 2 3 4 5 6 7 8 9 — .3581 .7182 .3586 .7231 .3849 .7498 .1419 .1818 .1414 .1769 — .1151 .1502 125.7 123.1 111.7 39.1 38.7 37.4 2.78 2.53 2.40 In a l l cases, Durbin's two-step method tends to underestimate the true autocorrelation coefficient. This results from the presence of the lagged Y values among the explanatory variables [16]. If the degree of multi-collinearity Is held constant, the bias of estimate of p^ increases as the degree of autocorrelation increases; while given the degree of auto-correlation, the bias decreases as the degree of multicollinearity increases. In our simulation study, Haitovsky heuristic s t a t i s t i c can recognize the existence of severe multicollinearity in experiments 7, 8 and 9. However, i t does not give any warning when there exists a f a i r l y high degree of multicollinearity, i.e. based on the Haitovsky test, multicollinearity is insignificant in experiments 4, 5 and 6. Since the Haitovsky test is based on the determinant of correlation matrix, i t has some bu i l t - i n defficiencies (see Section 3.3 for details). Our experiments have p c Bias in p e H 2 x^ df = 3 -53-disclosed these defficiencies to a certain extent, hence we suggest that special care has to be exercised in applying this test. 7.2.C. Forecasting Tables 10, 11 and 12 report the average residual sums of squares and the mean square error of prediction from the given values for the fore-cast period of each experiment, under the assumption that is unknown. Table 10 ( r 1 2 = .05, a* = 6) Experiment 1 2 3 k a u A A F/C »2* a u A A F/C 0 u A A M S EF/C 0.0 5.9700 8.1132 5.7055 9.1966 5.6351 10.939 .025 5.9924 8.0343 5.7332 9.6623 5.6721 10.952 .05 6.0554 7.9961 5.8113 9.0620 5.7762 11.034 .075 6.1536 7.9986 5.9328 9.1072 5.9382 11.173 .1 6.2824 8.0343 6.0918 9.1913 6.1501 11.360 .2 7.0250 8.4293 7.0074 9.8181 7.3716 12.470 .3 8.0022 9.0877 8.2087 10.739 8.9754 13.932 .5 10.245 10.755 10.957 12.944 12.6521 17.249 1.0 15.733 15.075 17.656 18.419 21.649 25.164 "2 * a : the u average of the residual sums of squares over ten samples. ** MSEF^C: the average MSE of predictions from the given values for the forecast period. -54-Table 11 (r, = .50, a 2 = 6) • 12 u Experiment 4 5 6 -2 -2 "2 k a u M S EF/C . a u M S EF/C a u M S EF/C 0.0 6.0169 8.2093 5.8625 9.3838 5.7757 10.733 .025 6.0331 8.1743 5.8828 9.3691 5.8038 10.731 .05 6.0797 8.1690 5.9409 9.3874 5.8838 10.672 .075 6.1541 8.1905 6.0330 9.4351 6.0109 10.850 .1 6.2518 8.2360 6.1559 9.5093 6.1803 10.961 .2 6.8444 8.6142 6.8849 10.021 7.1976 11.679 .3 7.2030 9.2476 7.9231 .10.783 8.6992 12 .482 .5 9.7190 10.851 10.470 12.735 12.128 15.327 .7 11.933 12.726 13.070 14.851 15.759 18.018 1.0 15.429 15.620 17.566 18.301 21.929 22.680 2 Table 12 (r 2 = .95, a = 6) Experiment 7 8 9 k a u ^ F / C -2 a u MSE„ .„ F/C -2 a u M S EF/C 0.0 6.0443 8.3699 5.6725 9.8589 5.4033 11.335 .05 6.1186 8.1165 5.7759 9.4141 5.5390 10.758 .1 6.2753 8.1220 6.0473 9.3568 5.8110 10.754 .2 6.7890 8.4120 6.6160 9.5945 6.7433 11.172 .3 7.5180 8.9408 7.5161 10.116 7.9187 12.001 .5 9.4101 10.445 9.8466 11.683 11.120 . 14.301 .7 11.643 .12.290 12.584 13.651 14.881 17.123 1.0 15.198 15.343 16.973 16.918 20.713 21.591 Though the BLUP i s adopted for forecast purposes, the MSE of p r e d i c t i o n w i l l s t i l l increase as the degree of autocorrelation increases. However, the main point i s that the presence of m u l t i -c o l l i n e a r i t y w i l l adversely a f f e c t the p r e d i c t i v e performance i f the disturbances are highly s e r i a l l y c o rrelated. The commonly held b e l i e f that the p r e d i c t i v e power of the model i s not affected by existence of m u l t i - c o l l i n e a r i t y i s only true i f the problem of autocorrelation i s not serious. In the 9th experiment, the model f i t t e d by Durbin's method gives s a t i s f a c t o r y r e s u l t s on various diagnostic t e s t s , s t i l l the p r e d i c t i o n of the BLUP leaves much to be desired. Fortunately, with Durbin's method combined with Ridge regression, we are able to determine the model which w i l l y i e l d l e s s MSE of p r e d i c t i o n and perform w e l l on various c r i t e r i a and t e s t s i n the j o i n t presence of m u l t i -c o l l i n e a r i t y and a u t o c o r r e l a t i o n . We also observed that the value of k that y i e l d s the best estimate of the r e s i d u a l sum of squares w i l l not usually give the minimum MSE of p r e d i c t i o n s . However, the value of k giving the best estimates of B i n MSE c r i t e r i o n tends to y i e l d l e s s MSE of p r e d i c t i o n for each experiment. Hence, we may conclude the v a l i d i t y of the MSE c r i t e r i o n i n the evaluation of parameter estimates. To avoid confusion, we have not reported the MSE of p r e d i c t i o n based on However, we found that the MSE of p r e d i c t i o n based on g and true p i s l e s s than that based on the parameter estimates ~GLS c obtained by Durbin's two-step method i n conjunction with ridge regression. Though Durbin's two-step method combined with ridge regression gives better estimates of 3, but i n general i t underestimates p . Therefore, the BLUP based on (3 and true p gives the minimal MSE of p r e d i c t i o n i n each of the experiments 2, 3, 5, 6, 8, and 9. CONCLUSIONS It has been shown that i n the presence of m u l t i c o l l i n e a r i t y with s u f f i c i e n t high degrees of autocorrelation. The OLS estimates of regression c o e f f i c i e n t s can be highly inaccurate. Improving the estimation procedure i s obviously necessary. Combining GLS and Ridge regression, we derived a new estimator. 8 (k) = (xV'-'-X+kl)'"1 X Tn~1Y ~GR ~ .. ~ ~ ~ ~ where 0 < k < 1 andft i s defined i n (2'). &„„(k), though biased, i s expected to perform w e l l i n the j o i n t presence of m u l t i c o l l i n e a r i t y and autocorrelation. However, since ft i s unknown,parameter estimates based on the biased estimator B^Ck) cannot be obtained i n p r a c t i c e . Therefore, we combined Durbin's two-step method with ordinary Ridge regression to approximate those parameters. The effectiveness of our approximation can then best be examined by the Monte Carlo simulation. Our study has confirmed that, f o r a given degree of m u l t i -c o l l i n e a r i t y , the MSE of the GLS estimates of 3 i s d i r e c t l y proportioned to the degree of autocorrelation. This agrees with conventional wisdom. Unexpectedly, we found that the MSE of the GLS estimates of 3 i s inversely proportional to the degree of m u l t i c o l l i n e a r i t y for a s u f f i c i e n t l y high degree of autocorrelation. This implies that i n the a p p l i c a t i o n of the GLS technique, the symptom of the existence of m u l t i c o l l i n e a r i t y may be disguised. However, since i n p r a c t i c e neither the true error terms nor the autocorrelation c o e f f i c i e n t i s known, no GLS estimates can possibly be obtained. We were pleased to f i n d that i n the j o i n t presence of m u l t i c o l l i n e a r i t y and autocorrelation; whatever the degree i s , Durbin's two-step method i n conjunction with Ridge regression (p^ unknown) y i e l d s even better estimates of 8 than the GLS technique ( p £ known) does i n MSE c r i t e r i o n . Though the value of k giving better estimates of 3 tends to y i e l d l e s s MSE of p r e d i c t i o n , s t i l l the GLS gives the minimal-MSE of p r e d i c t i o n i n a l l the cases. Besides, our experimental r e s u l t s have shown that the Durbin-Watson test for detecting the existence of f i r s t - o r d e r auto-c o r r e l a t i o n remains powerful i n the presence of m u l t i c o l l i n e a r i t y while the Haitovsky h e u r i s t i c s t a t i s t i c gives r e l a t i v e l y l i m i t e d information about the existence of - m u l t i c o l l i n e a r i t y e i t h e r with or without the presence of autocorrelated error terms. Our r e s u l t s also suggest that i t might be possible to find,an "optimal estimation package" that deals with the j o i n t problem of m u l t i c o l l i n e a r i t y and autocorrelation. Empirical research has hitherto been confined to the search for optimal estimation techniques i n dealing with m u l t i c o l l i n e a r i t y and autocorrelated errors as separate and independent phenomena. The ordinary ridge regression, i . e . , T adding a constant k on the diagonal of c o r r e l a t i o n matrix (X X) and Durbin's two-step method have been shown to be very powerful techniques i n handling m u l t i c o l l i n e a r i t y and autocorrelation problems. Even though s a t i s f a c t o r y estimation and p r e d i c t i o n are obtained by the combination of Durbin's method and ordinary ridge regression, there may s t i l l e x i s t some other even more e f f i c i e n t approaches to the j o i n t problem of m u l t i c o l l i n e a r i t y and autocorrelation. For -58-instance, the combination of the Cochrane-Orcutt procedure with Generalized Ridge regression i s a more f l e x i b l e estimation technique and thereby should lead to better estimation and p r e d i c t i o n . Allowing for higher order and mixed order autocorrelation w i l l be a good d i r e c t i o n to pursue as w e l l . BIBLIOGRAPHY Cochrane, D. and Orcutt, G. H. (1949). Application of least-squares regressions to relationships containing autocorrelated error terms. J. Am. Statist. Assoc., 44, 32-61. Durbin, J. (1960). Estimation of parameters in time-series regression models. J. Royal Statist. Soc, Series B, 139-153. Durbin, J. (1970). Testing for serial correlation in least-squares regression when some of the regressors are lagged dependent variables. Econometrica, 38, 410-421. Durbin, J. and Watson, G. S. (1950). Testing for serial correlation in least-squares regression (part 1). Biometrica, J 3 7 , 409-428. Durbin, J. and Watson, G. S. (1951). Testing for seri a l correlation in least-squares regression (part 2). Biometrica, 38, 159-178. Farrar, D. C. and Glanber, R. R. (1967). Multicollinearity in regression analysis: the problem revisited. Rev. Economics Statistics, 49, 92-107. Graybill, F. A. (1976). Theory and Application of the Linear .Model. Daxburg Press, North Scituate, Mass. Griliches, Z. and Rao, P. (1969). Small-sample properties'of several two-stage regression methods in the context of . autocorrelated errors. JASA, 64, 253-272. Hailovsky, Y. (1969). Multicollinearity in regression analysis: • comment. Rev. Economics Statistics, 486-489. Henshaw, R. C., Jr. (1966). Testing single-equation least squares regression models for autocorrelated disturbances. Econometrica, 34, 646-660. Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: biased estimation for nonorthogonal problems. Tech., 12, 55-67. Hoerl, A. E., Kennard, R. W., and Baldwin, K. F. (1975). Ridge regression: some simulations. Comm. Stat. , k_, 105-123. Johnston, J. (1972). Econometric Methods. 2nd edn., McGraw-Hill. Klein, L. R. (1962). An Introduction to Econometrics. Prentice-Hall. -60-[15] Lin, T. C. (1960). Underidentification, structural estimation and forecasting. Econometrica, 28, 856. [16] Marriott, F. H. C. and Pope, J. A. (1954). Bias in the estimation of autocorrelation. Biometrica, 41, 390-402. [17] Nerlove, M. and Wallis, K. F. (1966). Use of the Durbin-Watson Statistic in inappropriate situations. Econometrica, 34, 235-238. [18] Newhouse, J. P. and Oman, S. D. (1971). An evaluation of ridge estimators. Rand report No. R-716-PR. [19] Smith, V. K. (1973). Monte Carlo Methods.. Lexington, Mass. [20] Thisted, R. A. (1976). Ridge regression, minimax estimation and empirical Bayes method. Ph.D. thesis, Tech. Report 28, Biostatistics Dept., Stanford University. [21] Thisted, R. A. (1978). Multicollinearity, information and ridge regression. Statistics Dept., University of Chicago. [22] -yon Neumann, J. (1941). Distribution of the ratio of the mean square successive difference to the variance. Ann. Math. Stat. 12, 367-395. [23] White, J. S. (1961). Asymptotic expansions for the mean and variance of the serial.correlation coefficient. Biometrica, 48, 85-94.
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Multicollinearity, autocorrelation, and ridge regression
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Multicollinearity, autocorrelation, and ridge regression Hsu, Jackie Jen-Chy 1979
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | Multicollinearity, autocorrelation, and ridge regression |
Creator |
Hsu, Jackie Jen-Chy |
Publisher | University of British Columbia |
Date Issued | 1979 |
Description | The presence of multicollinearity can induce large variances in the ordinary Least-squares estimates of repression coefficients. It has been shown that ridge regression can reduce this adverse effect on estimation. The presence of serially correlated error terms can also cause serious estimation problems. Various two-stage methods, have been proposed to obtain good estimates of the regression coefficients in this case. Although the multicollinearity and autocorrelation problems have long been recognized in regression analysis, they are usually dealt with separately. This thesis explores the joint effects of these two conditions on the mean square error properties of the ordinary ridge estimator as well as the ordinary least-squares estimator. We show that ridge regression is doubly advantageous when multicollinearity is accompanied by autocorrelation in both,the errors and the principal components. We then derive a new ridge type estimator that is adjusted for autocorrelation. Finally, using simulation experiments with different degrees of multicollinearity and autocorrelation, we compare the mean square error properties of various estimators. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2010-03-13 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0094775 |
URI | http://hdl.handle.net/2429/21865 |
Degree |
Master of Science in Business - MScB |
Program |
Business Administration |
Affiliation |
Business, Sauder School of |
Degree Grantor | University of British Columbia |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-UBC_1980_A4_6 H88.pdf [ 3.32MB ]
- Metadata
- JSON: 831-1.0094775.json
- JSON-LD: 831-1.0094775-ld.json
- RDF/XML (Pretty): 831-1.0094775-rdf.xml
- RDF/JSON: 831-1.0094775-rdf.json
- Turtle: 831-1.0094775-turtle.txt
- N-Triples: 831-1.0094775-rdf-ntriples.txt
- Original Record: 831-1.0094775-source.json
- Full Text
- 831-1.0094775-fulltext.txt
- Citation
- 831-1.0094775.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
data-media="{[{embed.selectedMedia}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0094775/manifest