Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Examining how missing data affect approximate fit indices in structural equation modelling under different… Zhang, Xijuan 2020

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata

Download

Media
24-ubc_2021_may_zhang_xijuan.pdf [ 3.24MB ]
Metadata
JSON: 24-1.0395218.json
JSON-LD: 24-1.0395218-ld.json
RDF/XML (Pretty): 24-1.0395218-rdf.xml
RDF/JSON: 24-1.0395218-rdf.json
Turtle: 24-1.0395218-turtle.txt
N-Triples: 24-1.0395218-rdf-ntriples.txt
Original Record: 24-1.0395218-source.json
Full Text
24-1.0395218-fulltext.txt
Citation
24-1.0395218.ris

Full Text

Examining How Missing Data Affect Approximate FitIndices in Structural Equation Modelling Under DifferentEstimation MethodsbyXijuan ZhangB.A., University of British Columbia, 2012M.A., University of British Columbia, 2015A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Psychology)The University of British Columbia(Vancouver)December 2020c© Xijuan Zhang, 2020The following individuals certify that they have read, and recommend to the Faculty ofGraduate and Postdoctoral Studies for acceptance, the thesis entitled:Examining How Missing Data Affect Approximate Fit Indices in Struc-tural Equation Modelling Under Different Estimation Methodssubmitted by Xijuan Zhang in partial fulfillment of the requirements for the degree ofDoctor of Philosophy in Psychology.Examining Committee:Victoria Savalei, Professor, Psychology, UBCSupervisorJeremy Biesanz, Associate Professor, Psychology, UBCSupervisory Committee MemberHarry Joe, Professor, Statistics, UBCSupervisory Committee MemberLang Wu, Professor, Statistics, UBCUniversity ExaminerBrian O’Connor, Professor, Psychology, UBCUniversity ExamineriiAbstractThe full-information maximum likelihood (FIML) is a popular estimation method formissing data in structural equation modeling (SEM). However, it is not commonly knownthat approximate fit indices (AFIs) can be distorted, relative to their complete data coun-terparts, when FIML is used to handle missing data. In the first part of the dissertationwork, we show that two most popular AFIs, the root mean square error of approxima-tion (RMSEA) and the comparative fit index (CFI), often approach different populationvalues under FIML estimation when missing data are present. By deriving the FIML fitfunction for incomplete data and showing that it is different from the usual maximumlikelihood (ML) fit function for complete data, we provide a mathematical explanationfor this phenomenon. We also present several analytic examples as well as the resultsof two large sample simulation studies to illustrate how AFIs change with missing dataunder FIML. In the second part of the dissertation work, we propose and examine an al-ternative approach for computing AFIs following the FIML estimation, which we referto as the FIML-Corrected or FIML-C approach. We also examine another existing es-timation method, the two-stage (TS) approach, for computing AFIs in the presence ofmissing data. For both FIML-C and TS approaches, we also propose a series of smallsample corrections to improve the estimates of AFIs. In two simulation studies, we findthat the FIML-C and TS approaches, when implemented with small sample corrections,can estimate the complete data population AFIs with little bias across a variety of condi-iiitions, although the FIML-C approach can fail in a small number of conditions with a highpercentage of missing data and a high degree of model misspecification. In contrast, theFIML AFIs as currently computed often performed poorly. We recommend FIML-C andTS approaches for computing AFIs in SEM.ivLay SummaryIn a survey study, participants often leave some questions blank due to carelessness orunwillingness to answer certain questions. This creates missing data, which can distortthe results of statistical analyses. Modern missing data techniques, such as the full-information maximum likelihood (FIML) method, are designed to address this problemof missing data. However, the FIML method corrects the distorted results in some circum-stances but not in all.In this dissertation, we focus on structural equation modelling (SEM), which is anadvanced statistical analysis commonly used in social sciences. We explain that in SEM,the FIML method may distort the results regarding the degree of model fit as measured bythe approximate fit indices (AFIs). We propose two alternative methods called the FIML-Corrected (FIML-C) and the two-stage (TS) methods. Through computer simulations, weshow these two new methods can correctly compute the AFIs; therefore, we recommendthese two new methods.vPrefaceI am the primary author of this PhD dissertation. I was the primary individual responsiblefor conducting the simulation studies involved in this research. The theoretical work in thisresearch is a joint work with Dr. Victorial Savalei. A version of Chapters 2 to 4 has beenpublished in the Structural Equation Modelling journal with the following bibliographicdetails:Zhang, X., & Savalei, V. (2019). Examining the effect of missing data on RM-SEA and CFI under the normal theory full information maximum likelihood.Structural Equation Modeling, 27, 219-239.A version of Chapters 5 and 6 have been submitted for publication.viContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viContents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiList of Supplementary Materials . . . . . . . . . . . . . . . . . . . . . . . . . . xvList of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Missing Data Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Missing Data Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3 Missing Data Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 8vii1.4 SEM and SEM AFIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.5 Past Research on the Effect of SEM AFIs under FIML Estimation . . . . 131.6 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 152 SEM AFIs under FIML Estimation: Technical Details . . . . . . . . . . . . 172.1 Fit Function Minimum for Complete Data . . . . . . . . . . . . . . . . . 182.2 Fit Function Minimum for Incomplete Data . . . . . . . . . . . . . . . . 202.2.1 Sample Fit Function Minimum for Incomplete Data . . . . . . . . 202.2.2 Population Fit Function Minimum for Incomplete Data . . . . . . 252.3 RMSEA and CFI for Complete and Incomplete Data . . . . . . . . . . . 293 SEM AFIs under FIML Estimation: Analytical Examples . . . . . . . . . . 323.1 Change in RMSEA due to Differences in the Equations of the Fit FunctionMinimum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.1.1 Case 1: Complete data . . . . . . . . . . . . . . . . . . . . . . . 343.1.2 Case 2: MCAR data; misspecification does not involve variableswith missing values . . . . . . . . . . . . . . . . . . . . . . . . . 353.1.3 Case 3: MCAR data; misspecification involves variables with miss-ing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.1.4 Case 4: MAR data; misspecification involves variables with miss-ing values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.2 Change in RMSEA due to Differences in Parameter Values . . . . . . . . 393.2.1 Case 1: Pseudo-parameter values stay the same with missing data 403.2.2 Case 2: Pseudo-parameter values change with missing data . . . 424 SEM AFIs under FIML Estimation: Simulation Studies . . . . . . . . . . . 454.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50viii4.2.1 Study 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.2.2 Study 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645 Alternative Approaches for Computing AFIs . . . . . . . . . . . . . . . . . 675.1 Alternative AFIs following FIML Estimation . . . . . . . . . . . . . . . 685.1.1 Population Limits for FIML-C AFIs . . . . . . . . . . . . . . . . 725.1.2 Analytical Example for FIML-C Estimation . . . . . . . . . . . . 745.2 Alternative AFIs following TS Estimation . . . . . . . . . . . . . . . . . 755.2.1 Population Values for TS AFIs . . . . . . . . . . . . . . . . . . . 775.2.2 Analytical Example for TS Estimation . . . . . . . . . . . . . . . 785.2.3 Derivation of Small Sample Correction in FIML-C . . . . . . . . 815.2.4 Derivation of Small Sample Correction in TS . . . . . . . . . . . 816 SEM AFIs under FIML-C and TS Estimations: Simulation Studies . . . . 836.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 846.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 856.2.1 Population Behavior . . . . . . . . . . . . . . . . . . . . . . . . 856.2.2 Finite Sample Behavior . . . . . . . . . . . . . . . . . . . . . . . 916.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1017 Conclusion and Overall Discussion . . . . . . . . . . . . . . . . . . . . . . . 1047.1 Limitations and Future Directions . . . . . . . . . . . . . . . . . . . . . 108Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114ixList of TablesTable 2.1 Notation for mean vectors and covariance matrices with incompletedata under an incorrect hypothesized model. . . . . . . . . . . . . . . 21Table 4.1 Conditions in the Simulation Studies . . . . . . . . . . . . . . . . . . 47Table 4.2 Complete data RMSEA and CFI for all conditions in Studies 1 and 2 . 50Table 4.3 Variables in the Regression Analyses . . . . . . . . . . . . . . . . . . 59Table 4.4 Results of the Regression Analyses . . . . . . . . . . . . . . . . . . . 60Table 5.1 Equations for k and kB for FIML-C versions . . . . . . . . . . . . . . 73Table 5.2 Equations for c and cB for TS versions . . . . . . . . . . . . . . . . . 80Table 6.1 “Pseudo-Parameters” for Complete and Incomplete Data under FIML 89Table 6.2 Additional Variables in the Regression Analyses . . . . . . . . . . . . 90Table 6.3 Results of the Regression Analyses for Bias in the Population . . . . . 90Table 6.4 Results of the Regression Analyses for Bias in the Finite Samples . . . 100xList of FiguresFigure 1.1 An example of an SEM model. In SEM diagram, circles represent latent variables,rectangles represent observed variables, and the arrows represent relationship be-tween variables. The relationships between the observed variables and the latentfactor for the measurement model part of SEM model (see the blue box for an ex-ample); the relationships between different latent variables form the structural partof SEM model (see the red box for an example). . . . . . . . . . . . . . . . 11Figure 1.2 The population model for the simulation example in section 1.5. . . . . . . . . 13Figure 4.1 Differences between DF and SF conditions. . . . . . . . . . . . . . . . . . 48Figure 4.2 RMSEA and CFI for Study 1 conditions varying in the locations of misfit and num-ber of variables with missing data. For the conditions in this figure, the missingmechanism is MCAR, the population factor correlation is 0, and the number of cor-related residuals is 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . 52Figure 4.3 Fit function minima of the hypothesized and baseline models for selected conditionsin Study 1. For the conditions in this figure, the missing mechanism is MCAR, thepopulation factor correlation is zero, and the number of correlated residuals is two. 54Figure 4.4 RMSEA and CFI for selected conditions in Study 1. For the conditions in this fig-ure, the missing mechanism is MCAR. There is a single correlated residual in thepopulation model, and the number of variables with missing data is four. . . . . . 56xiFigure 4.5 RMSEA and CFI for selected conditions in Study I. For conditions in this figure, thepopulation factor correlation is 0.4; the number of correlated residuals is two, andthe number of variables with missing data is four. . . . . . . . . . . . . . . . 57Figure 4.6 RMSEA and CFI for selected conditions in Study 2. For the conditions shown inthis figure, the number of variables with missing data is six. . . . . . . . . . . . 62Figure 6.1 Population RMSEA and CFI (estimated from n = 1000000) for selected conditionsin Study 1 comparing FIML, FIML-C and TS approaches. Complete data populationRMSEA and CFI are also included for comparison. In these selected conditions, thenumber of variables with missing data is four, the number of correlated residuals istwo, and the population factor correlation is zero. The population model is a two-factor model with varying sizes for the correlated residuals shown on the x-axis. Thehypothesized model is a two-factor model without any correlated residuals. . . . . 86Figure 6.2 Population RMSEA and CFI (estimated from n = 1000000) for selected conditionsin Study 2 comparing FIML, FIML-C and TS approaches. Complete data popula-tion RMSEA and CFI are also included for comparison. In these selected conditions,there are six variables that have missing data. The population model is a two-factormodel with varying sizes for the factor correlation shown on the x-axis. The hypoth-esized model is a one-factor model. . . . . . . . . . . . . . . . . . . . . . . 87Figure 6.3 Bias in the sample RMSEA and CFI estimates for selected conditions in Study 1comparing FIML, FIML-C and TS approaches. In these conditions, the number ofvariables with missing data is four, the number of two correlated residuals is two,the population factor correlation is zero, the percentage of missing is 50%, and thelocation of misfit is the same as the location of missing data. The population modelis a two-factor model with varying sizes for the correlated residuals shown on the x-axis. The hypothesized model is a two-factor model without any correlated residuals. 92xiiFigure 6.4 Root mean square error (RMSE) in the sample RMSEA and CFI estimates for se-lected conditions in Study 2 comparing FIML, FIML-C and TS approaches. In theseconditions, the number of variables with missing data is six, the percentage of miss-ing is 50%, and the number of patterns is large. The population model is a two-factormodel with varying sizes for the factor correlation shown on the x-axis. The hypoth-esized model is a one-factor model. . . . . . . . . . . . . . . . . . . . . . 93Figure 6.5 Bias in the sample RMSEA and CFI estimates for selected conditions in Study 1comparing among the best performing FIML-C and TS methods. In these condi-tions,the number of variables with missing data is four, the number of two correlatedresiduals is two, the population factor correlation is zero, and the missing data mech-anism is weak MAR. The population model is a two-factor model with varying sizesfor the correlated residuals shown on the x-axis. The hypothesized model is a two-factor model without any correlated residuals. . . . . . . . . . . . . . . . . . 96Figure 6.6 Bias in the sample RMSEA and CFI estimates for selected conditions in Study 2comparing among the best performing FIML-C and TS methods. In these conditions,the number of variables with missing data is six and the missing mechanism is strongMAR. The population model is a two-factor model with varying sizes for the factorcorrelation shown on the x-axis. The hypothesized model is a one-factor model. . . 97Figure 6.7 Root mean square error (RMSE) in sample RMSEA and CFI for selected conditionsin Study 1 comparing among the best performing FIML-C and TS methods. In theseconditions,the number of variables with missing data is four, the number of twocorrelated residuals is two, the population factor correlation is zero, and the missingdata mechanism is weak MAR. The population model is a two-factor model withvarying sizes for the correlated residuals shown on the x-axis. The hypothesizedmodel is a two-factor model without any correlated residuals. . . . . . . . . . . 98xiiiFigure 6.8 Root mean square error (RMSE) in sample RMSEA and CFI for selected conditionsin Study 2 comparing among the best performing FIML-C and TS methods. Inthese conditions, the number of variables with missing data is six and the missingmechanism is strong MAR. The population model is a two-factor model with varyingsizes for the factor correlation shown on the x-axis. The hypothesized model is aone-factor model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99Figure 7.1 Population RMSEA and CFI (estimated from n = 1000000) for selected conditionsin Study 1 comparing FIML, FIML-C, TS, MI approaches. In these selected condi-tions, the number of variables with missing data is four, the number of two correlatedresiduals is two, and the population factor correlation is zero. The population modelis a two-factor model with varying sizes for the correlated residuals shown on the x-axis. The hypothesized model is a two-factor model without any correlated residuals. 111Figure 7.2 Bias in the sample RMSEA and CFI estimates for selected conditions in Study 1comparing among FIML, FIML-C, TS and MI methods. In these conditions,thenumber of variables with missing data is four, the number of two correlated residualsis two, the population factor correlation is zero, and the missing data mechanism isweak MAR. The population model is a two-factor model with varying sizes for thecorrelated residuals shown on the x-axis. The hypothesized model is a two-factormodel without any correlated residuals. . . . . . . . . . . . . . . . . . . . 112xivList of Supplementary Materials• Tables for the results of simulation studies.• Sample code for generating missing data for the simulation studies.• Sample code for computing FIML-C RMSEA and CFI.• Sample code for computing TS RMSEA and CFI.For online access of the Supplementary Materials, please visit the following webpage:https://osf.io/rtp38/?view only=15d7262e78ca4f018f4deb8b47307e5axvList of Abbreviations• AFI: Approximate Fit Indices• CFI: Comparative Fit Index• EM: Expectation-Maximization• FIML: Full Information Maximum Likelihood• FIML-C: FIML-Corrected• LR: Likelihood Ratio• MAR: Missing At Random• MCAR: Missing Completely At Random• MCMC: Markov Chain Monto Carlo• MI: Multiple Imputation• MNAR: Missing Not At Random• RMSEA: Root Mean Square Error of Approximation• RMSE: Room Mean Square Error• SEM: Structural Equation Modelling• TS: Two-StagexviAcknowledgmentsFirst and foremost, I thank my supervisor, Dr.Victoria Savalei. She has helped me inevery aspect of my academic career. This dissertation work would not have been remotelypossible without her support and dedication. Her advice, guidance and commitment havebeen invaluable for me. I am also very grateful for the financial support she has granted methroughout my graduate studies at UBC, which allows me to pursue my dream in academiawhile living a comfortable life. Beyond vocational or academic matters, I thank Dr.Savaleifor her understanding and support during some of the difficult times in my person life. Allin all, I consider myself extremely lucky to have Dr. Savalei as my graduate supervisor.Secondly, I would also like to thank Dr. Jeremy Biesanz, Dr. Harry Joe and Dr. Ke-HaiYuan. Specifically, I thank Dr. Jeremy Biesanz for introducing me to quantitative psychol-ogy in PSYC 359, for helping me out with various issues I had met during my graduateschool years, and for the time and effort for being on my master and PhD committee. Ithank Dr. Harry Joe for teaching me STAT 306, and for being on my PhD committee andwriting me a research note after my dissertation proposal defense. Dr. Joe’s research noteand course package for STAT 306 had helped me understand many of key concepts instatistics. I thank Dr. Ke-Hai Yuan for giving the opportunity to visit him at the Universityof Notre Dame; Dr. Ke-Hai Yuan’s advice on my dissertation work had helped me designthe simulation studies in my research.I also express my gratitude towards Dr. Oscar Olvera, who is both my friend and myxviimentor. He has helped me both personally and professionally. Based on his suggestions, Itook a course on mathematical proofs and another course on real analysis, both of whichhad greatly improved my mathematical proficiency. His humor and support had madesome of the difficult times in my life more bearable.Finally, I would like to thank my course professors, Dr. Lang Wu (STAT 300), Dr.William Welch (STAT 305), Dr. Matias Salibian-Barrera (STAT 406), and Dr. GordonSlade (MATH 320), Dr. Brett Kolesnik (MATH 220) and Dr. Anna Levit (MATH 302),for teaching me statistics and mathematics. The courses I took with these professors hadgiven me a lot of insight and inspiration for my own research.This research was supported by the Social Sciences and Humanities Research Coun-cil’s Doctoral Fellowship to Xijuan Zhang, the Univeristy of British Columbia’s Four YearDoctoral Fellowship to Xijuan Zhang, and the Natural Sciences and Engineering ResearchCouncil’s research grant to Dr. Victoria Savalei.xviiiDedicationI dedicate my dissertation work to the most important teachers I have met in my life:Mr. Bill Morphett (high school science teacher), Dr. Victoria Savalei (graduate schoolsupervisor), Dr. Corey Hamm (piano teacher), and Dr. Ross Salvosa (piano teacher).Mr. Bill Morphett’s encouragement has given me the courage to pursue my dream.His kindness and compassion as a teacher has made me also want to become a teacher.Dr. Victoria Savalei is the most influential female role model in my life. Her intelli-gence, independence, kindness and dedication to research have inspired me to become afemale figure like her.Dr. Corey Hamm and Dr. Ross Salvosa have not only taught how to play the piano butmore importantly, how to understand and appreciate music. Music holds a very specialplace in my heart and so do Dr. Hamm and Dr. Salvosa.xixChapter 1IntroductionMissing data are a real bane to researchers across all social sciencedisciplines. For most of our scientific history, we have approached missingdata much like a doctor from the ancient world who might use bloodletting tocure disease or amputation to stem infection (e.g, removing the infected partsof one’s data by using list-wise or pair-wise deletion).Craig K.EndersMissing data, also known as incomplete data, are prevalent in psychological and educa-tional research, particularly when repeated measures or longitudinal studies are involved.Osborne [29] reported that around 40% of papers in APA journals in the year 2009 de-scribed dealing with missing data. Jelicic et al. [18] examined the prevalence of missingdata in longitudinal studies over six years in three developmental psychology journals, andfound about half of the studies had missing data.Historically, statistical analysis methods are developed assuming that data are com-plete, and proper methodology for dealing with incomplete data is hard to implement dueto intensive computations. However, with the advance of computing technology, begin-ning in the late 1980s, more and more researchers began to study the problem of missing1data. According to Google Scholar, the number of articles with titles including the wordsmissing data or incomplete data were 1024 in years 1990 to 1999, grew to 3505 in theyears 2000–2009, and is 5400 in the past 8 years.In addition, with the improvement in computing technology, more advanced modelingmethods are made available for researchers to use. Structural equation modeling (SEM)is one of these advanced methods; it allows researchers to test complex theories involvingmultiple observed and latent variables. With increasing number of SEM software pack-ages, SEM has become very popular in psychology and other social sciences.In this dissertation, we aim to expand the research on both missing data and SEM byexamining how SEM approximate fit indices (AFIs) are affected by missing data under dif-ferent missing data techniques. More specifically, the research demonstrates that popularSEM AFIs can be distorted when being estimated using one of the most popular missingdata techniques, the full information maximum likelihood (FIML), and such distortion canbe corrected through alternative missing data techniques. In this introductory chapter, wefirst introduce the key topics related to the current research, including missing data mech-anisms, missing data patterns, missing data techniques, SEM, and SEM AFIs. Then wereview the past research on how missing data affect SEM AFIs. We end the chapter withan outline for the rest of the dissertation paper.1.1 Missing Data MechanismsThe most common classification of missing data is by missing data mechanism, whichis first proposed by Rubin [31]. Missing data mechanism can be thought as a kind ofmissing data generation rule that describes the statistical relationship between variablesand the probability of missing data at the population level [28, 31]. There are generallythree types of missing data [32]: 1) missing completely at random (MCAR), 2) missing atrandom (MAR), and 3) missing not at random (MNAR). In this section, we review these2three types of missing data mechanisms in both informal and formal terms.Let us consider a dataset with n subjects and p variables denoted as X1, . . . ,Xp. Whenwe do not have missing data, our dataset should look like a matrix with n rows and pcolumns. When we have missing data, we can consider the missing data as unobservedvalues that create holes in the data matrix. Suppose only X1 has missing values. If X1 isMCAR, then the probability of a subject having a missing value of X1 does not dependon its unobserved value in X1 nor its observed values of other variables. This means thatknowing the subject’s values on any of the variables does not give you any informationabout its probability of being missing. An example of MCAR data is when the paper-formquestionnaire data are missing because a house cat spilled coffee on the table. In this case,there are no observed nor missing data that can predict the probability of being missing. IfX1 is MAR, then the probability of a subject being missing depends on its observed valuesof other variables but does not depend on its value of X1. In other words, MAR means“conditionally missing at random”: conditional on the observed values of other variables,the probability of being missing does not depend on the value of X1. An example of MARdata is when shy participants tend to have more missing values on the questionnaire itemsabout their sexual orientation. In this case, we can use the items that measure the shynessof participants to predict the probability of missing data about sexual orientation. If X1is neither MCAR nor MAR, then X1 is MNAR, where the probability of a subject havinga missing value on X1 depends on its value of X1. A classical example of MNAR data iswhen participants with high income avoid answering questions about income. In this case,the probability of missing the income data is related to participants’ own income.To define the types of missing data mechanisms formally, let X = (X1, . . . ,Xp)T be arandom vector representing the p variables in the dataset and x = (x1, . . . ,xp)T representthe realizations of X . Same as above, suppose X1 is the only random variable with missingdata. Let M be a random indicator variable with M = 1 representing a missing value in X1;3we call M the missing data indicator. MCAR occurs when the distribution of M does notdepend on x:P(M = 1|x) = P(M = 1)P(M = 0|x) = 1−P(M = 1|x) = 1−P(M = 1) = P(M = 0).To define MAR and MNAR, we have to break down x into the observed (xobs) and theunobserved or missing (xmis) parts of x; that is x = (xmis,xobs)T . In this case, since x1 isthe only variable with missing data, xmis = x1 and xobs = (x2, . . . ,xp)T . MAR occurs whenthe distribution of M depends on xobs but not xmis:P(M = 1|(xmis,xobs)T ) = P(M = 1|xobs)P(M = 0|(xmis,xobs)T ) = P(M = 0|xobs).Notice that MAR data becomes MCAR data when M’s dependence on xobs is zero. Lastly,MNAR occurs when the distribution ofM depends on xmis; that is when P(M= 1|(xmis,xobs)T )and P(M = 0|(xmis,xobs)T ) cannot be simplified further.An important concept related to the types of missing data mechanisms is ignorability.Ignorable data are the types of missing data that can be handled with the likelihood-basedanalysis such as the full information maximum likelihood (FIML) estimation method. Inother words, with ignorable missing data, we can obtain consistent parameter estimateswithout explicitly modelling the underlying missing data mechanism. Ignorable missingdata needs to satisfy the following two conditions: 1) the data are either MCAR or MAR;2) parameters associated with the specific missing data rule are distinct from the param-eters associated with the distribution of the variables in the dataset [31]. In the aboveexample, the second condition means that the parameters associated with the distributionof M are distinct from the parameters associated with the distribution of X . To explain4why these conditions are needed, let θ and φ are the parameters associated with X and M,respectively, and let f (x,m;θ ,φ) denote the joint density of X and M. Because θ and φare distinct, when the data are incomplete, the observed data likelihood can be obtainedvia the marginal of xobs as follows:f (xobs,m,θ ,φ) =∫f (xobs,xmis;θ) f (m|xobs,xmiss;φ)dxmis (1.1)When the data are MCAR, f (m|xobs,xmiss;φ) = f (m;φ); when the data are MAR,f (m|xobs,xmis;φ) = f (m|xobs;φ). Since neither f (m;φ) nor f (m|xobs;φ) involves xmis, wecan take f (m;φ) or f (m|xobs;φ) out of the integral. In other words, for MCAR or MARdata, it is sufficient to maximize∫f (xobs,xmis;θ)dxmis with respect to θ if we only want toestimate θ . There are MAR data that violate the second assumption for ignorable missingdata (i.e.,θ and φ are not distinct); in such cases, statistical methods assuming ignorabilityare not optimal but are generally still good for obtaining consistent estimates. Therefore,in practice, ignorable missing data imply MCAR or MAR data, and non-ignorable missingdata imply MNAR data.In addition, it is worth noting that the fact that ignorable missing data can be han-dled by the FIML method (i.e., producing consistent estimates) rests on the assumptionthat the model is correctly specified. In the case of SEM, if the hypothesized model isthe same as the population model, then the FIML method is able to produce consistentmodel parameter estimates and model fit. On the other hand, if the hypothesized modelis misspecified, the FIML method estimates the model parameters so that the “distance”between hypothesized probability distribution and the true probability distribution is asclose as possible.1 In this dissertation, we call the parameters obtained under a misspeci-fied model the “pseudo-parameters”. As we will show later, even with ignorable missing1The “distance” between two probability distributions is known as the Kullback-Leibler divergence [41].5data, the FIML method does not produce consistent estimates for the “pseudo-parameters”of complete data.Due to the important property of ignorability for MCAR and MAR data, missing datamechanism is by far the most important feature of missing data being studied in simulationstudies involving missing data. Most SEM research on missing data focuses on studyingmissing data techniques that can handle ignorable missing data; in other words, the re-searchers will mainly focus on MCAR and MAR missing data in their simulation studies[e.g., 36, 37, 45]. Our research is no exception. Our research mainly focuses on missingdata approaches that can be used to handle ignorable missing data; therefore, as you willsee in Chapters 4 and 6, examining different types of MCAR and MAR missing data isone of the main focuses in our missing data simulation studies.1.2 Missing Data PatternsMissing data pattern is another way to categorize missing data. Missing data pattern refersto the arrangement of observed and missing values in a dataset [15]. It is often confusedwith missing data mechanism [e.g. 16]. The distinction is that a specific missing datamechanism is a missing data generation rule that describes the relationship between vari-ables and the probability of missing, whereas a specific missing data pattern is a dataconfiguration that describes the location of the missing values in the data.Although missing data pattern and missing data mechanism are distinct concepts, theydo affect each other. Given a specific missing data generation rule with a certain type ofmissing data mechanism, the number and the type of missing data pattern will be deter-mined. For example, suppose a dataset has X1, . . . ,Xp variables, if the missing data rule iseach subject has 20% probability of being missing from the variable X1, then the missingdata pattern is univariate, implying two missing patterns.When it comes to studying missing data techniques such as FIML, missing data pat-6terns are often considered less important than missing data mechanisms, probably becausemissing data patterns are not directly related to the ignorability property of missing data.Nonetheless, missing data patterns have several important implications for missing datatechniques. First, when missing data patterns have variables that are never observed to-gether, some parameters such as those measuring the correlations between these variablesmay not be estimable from the observed data (see Example 1.7 in [23]). Second, thenumber of missing data patterns may affect the performance of missing data techniques[35, 36]. For example, Savalei and Bentler [35] found that the number of missing datapatterns may affect the efficiency of an estimation method, and this effect can be as strongas that of the missing data mechanism. Based on these previous findings, we have variedboth types of missing data mechanisms and the number of missing data patterns whendesigning our simulation studies. Indeed, as we will explain in detail, missing data patterncan interact with missing data mechanism in their effects on the estimation of SEM AFIs.Finally, another importance of missing data patterns is that the loglikelihood functionfor missing data can be written as a sum that iterates through each missing data patternin the dataset, weighted by the proportion of each pattern. For example, loglikelihoodfunction of Equation 1.1 can be written aslogL(θ ;xobs) =J∑j=1qˆ j logL(θ ;xobs, j), (1.2)where J is the number of missing data patterns in the population, and q j is proportionof missing data in a pattern j, where j = 1, . . . ,J. As we will explain later, we can alsowrite the SEM FIML fit function as a sum that iterates through the missing data patterns;doing so allows us to see how missing data affect the estimation of AFIs under the FIMLestimation.71.3 Missing Data TechniquesIn older times, when computing power is limited, the most common techniques for han-dling missing data include listwise deletion, pairwise deletion and mean substitution. Themain goal of these techniques is to get rid of the missing data so that some data analy-ses could be done. This is in contrast with the modern missing data techniques, whichmain goal is to effectively deal with the missing data so that the data analysis can be usedto obtain unbiased, consistent and efficient estimates of the population parameters. Mostmodern data techniques are mainly designed to handle ignorable missing data (i.e., MCARand MAR data). MNAR data are almost impossible to be dealt with unless the researcherscan effectively model the underlying missing data generation rule [1].In SEM, the most common modern approach to handling ignorable missing data isthe normal theory FIML estimation [1, 2, 43]. The FIML method is available in almostall SEM software, and it is usually the go-to estimation method for missing data. Asexplained earlier, the FIML approach involves maximizing the observed data likelihoodin order to obtain the parameter estimates. Because the likelihood function for ignorabledata does not involve the parameters associated with the missing data mechanism, FIMLis able to produce consistent parameter estimates and standard errors under a correctlyspecified model.The other modern missing data techniques used in SEM are the multiple imputation(MI) and two-stage (TS) approaches. These two approaches are well-researched but lesscommonly used in SEM literature. The MI approach consists of three steps: 1) imputa-tion, 2) analysis, and 3) pooling [31]. In the imputation step, multiple sets of the data arecreated, each of which contains different estimates of the missing values (i.e., each datasethas the missing data filled in). This step involves an iterative process based on the MarkovChain Monto Carlo (MCMC) algorithm, which has been implemented in common SEMsoftware such as Mplus and lavaan package in R [1]. In the analysis step, the hypothesized8model is fit to each filled-in dataset as if there were no missing data, and then the statisticof interest (e.g., parameter estimates) is computed for each dataset. In the pooling step,the results across the imputed datasets are combined into a single result. In a way, the MIapproach is similar to the older regression-based imputation technique, where the imputedvalues are based on the regression model built with cases with no missing data. How-ever, the older regression-based imputation technique underestimates the standard errorsbecause the imputed values always fall right on the regression line/plane; the MI approachsolves this problem by incorporating simulated random draws from the population in theimputation step (see [15] for details).The TS approach involves a two-stage procedure for obtaining parameter estimates[43, 46]. The first stage involves fitting a saturated model, which is an unrestricted modelwith zero degrees of freedom, to the incomplete data in order to estimate the saturatedmodel’s mean vector and covariance matrix. The saturated model’s mean vector and co-variance matrix essentially estimates what the mean vector and covariance matrix wouldhave been if the data had been complete. This stage is analogous to the imputation stage ofthe MI method; however, instead of imputing missing data to create a “complete” dataset,the first stage of the TS method directly estimates the mean vector and covariance ma-trix of the “complete” dataset. Then in the second stage, the saturated model’s estimatedmean vector and covariance matrix are used to minimize the complete data fit function inorder to obtain consistent estimates of the model parameters. Unbiased standard errorsfor the parameter estimates can be obtained by using a sandwich-type covariance matrixdeveloped based on the likelihood theory (see [36, 48] for details).It is not hard to see that the FIML approach is the “simplest” modern missing dataapproach in terms of computational complexity, which underscores its popularity. Oneunique advantage of the MI and TS approaches over the FIML approach is that they allowthe incorporation of auxiliary variables, which are variables that the researchers are not9interested to study but their inclusion may improve the estimates of model parameter orstandard error [36, 48]. Past research showed that with these auxiliary variables, the TSapproach can produce more stable estimates in smaller samples [36, 48].Most of the previous SEM research comparing the FIML, MI and TS approaches fo-cused mostly on model parameter estimates, standard errors and confidence intervals ofthe estimates [e.g. 12, 35–37, 48]. Only a small number of research studies , which wewill review in a later section, have compared these methods in terms of estimating AFIs.Our research aims to address this gap of research. In this dissertation, we focus mainly onthe FIML and TS approaches; the MI approach will be discussed in the final chapter.1.4 SEM and SEM AFIsWhat is structural equation modelling? This may not be an easy question to answer evenfor researchers who are familiar with SEM. Indeed, the word, “structural equation model-ing” or “SEM”, describes a diverse set of mathematical models, computer algorithms andstatistical methods that involve fitting a network of constructs to data. Historically, SEMcomes from three different streams of research: 1) path analysis, 2) measurement models,and 3) general estimation algorithms for statistical models [5] .Despite the varied origins of SEM, one important theme in SEM is the modelling oftheoretical constructs that cannot be directly observed in a dataset. With SEM, researcherscan represent these underlying theoretical constructs by latent variables, and they can es-timate these latent factors via several observed variables that serve as “indicators” of thelatent variables. The indicators for a latent variable can be selected based on prior knowl-edge or based on exploratory factor analyses that can measure the degree to which theindicators “tap into” the latent factor. The main advantage of SEM is its flexibility inincorporating both the relationships between several observed variables and one latentvariable (via the measurement model part of SEM) as well as the relationships between10Figure 1.1: An example of an SEM model. In SEM diagram, circles represent latent variables, rect-angles represent observed variables, and the arrows represent relationship between variables.The relationships between the observed variables and the latent factor for the measurementmodel part of SEM model (see the blue box for an example); the relationships between differ-ent latent variables form the structural part of SEM model (see the red box for an example).several different latent variables (via the structural model part of SEM) (see Figure 1.1).When conducting SEM analysis, researchers can specify their hypothesized model thatmay include the structural part or the measurement part or both. Through fitting the hy-pothesized model to the data, researchers can obtain the estimates of the model parametersas well as the model-implied covariance matrix (a.k.a. model-based covariance matrix),which is the covariance matrix computed based on the estimates of the model parameters.Another important theme in SEM is the measure of the overall model fit; that is, themeasure of the extent to which relationships between variables as specified in the hypoth-esized model are representative of the true relationships found in the population. Tradi-tionally, researchers uses the chi-square test of fit to make a binary decision on whetherthe model is sufficiently fit to the data. However, “all models are wrong but some areuseful”; in a sense, all hypothesized models can be rejected given a large enough sample,thus defeating the purpose of the chi-square test. Therefore, in recent decades, researchers11have proposed approximate fit indices (AFIs), which measure the degree to which the hy-pothesized model is fit to the data [3, 39]. In other words, an AFI is a continuous metricalong which to evaluate the hypothesized model’s appropriateness for the data.In our research, we will focus on the two most popular AFIs in SEM: the root meansquare error of approximation (RMSEA) and comparative fit index (CFI). RMSEA mea-sures the amount of misfit in the hypothesized model per degrees of freedom. CFI mea-sures the amount of improvement in fit for the hypothesized model relative to the fit of thebaseline model (a.k.a. the independence model), which is a null model where all variablesare uncorrelated. RMSEA value is equal to or greater than zero, with lower value indicat-ing better fit (i.e., zero indicating perfect fit) whereas CFI value ranges from zero to one,with higher value indicating better fit (i.e., one indicting perfect fit). Detailed equations forthese AFIs will be provided in the later chapters. Ironically, although AFIs are supposedto measure fit on a continuum, cut-off points are still commonly used to help researcherscategorize the amount of misfit. For RMSEA, a value less than 0.08 indicates good fit [8];for CFI, a value greater than 0.9 indicates good fit [17].Finally, we explain RMSEA and CFI’s relationship with other types of AFIs. As wewill show in the later chapters, both RMSEA and CFI are defined in terms of the fit func-tion minimum values. There are other AFIs that are also defined in terms of the fit functionminimum (e.g., Normed Fit Index (NFI), Tucker-Lewis Index (TLI) [3]); for these AFIs,the patterns of results in this dissertation work should also apply to them. However, forAFIs that are not defined in terms of the fit function minimum (e.g., the standardized rootmean square residual (SRMR) [4] and goodness of fit index (GFI) [25] ), our results maynot apply. Many of these other AFIs fall out of popularity due to a variety of reasons. Forexample, SRMR can be very biased in smaller samples, and NFI does not account for thecomplexity of the hypothesized model well [3, 17]. Due to the unpopularity of these AFIs,we did not include them in our study.12Figure 1.2: The population model for the simulation example in section 1.5.1.5 Past Research on the Effect of SEM AFIs underFIML EstimationThe first main goal of this dissertation is to point out the potentially problematic per-formance of AFIs when computed following FIML estimation. It does not appear to bewell-known that when AFIs are computed following FIML estimation, the resulting pop-ulation values are distorted relative to their complete data counterparts. This means thatthe approximate fit of the same model to data drawn from the same population may be dif-ferent depending on whether the data are complete or incomplete. To illustrate, we havegenerated a sample of n = 1000000 from a population that follows a correlated (a.k.a.oblique) two-factor model with standardized loadings of 0.7 and a factor correlation of0.5 (see Figure 1.2). We fit a one-factor model to this sample. With complete data, theRMSEA and CFI are 0.203 and 0.747, respectively. However, when we randomly delete50% data for each of the three variables loading on factor 1, the RMSEA and CFI are now0.148 and 0.816, respectively.We are aware of only three studies that have examined the behavior of AFIs with in-complete data under the FIML estimation; none of them noted this phenomenon. Daveyet al. [10] conducted a simulation study, examining the effects of incomplete data on AFIs13in sample data. They found that with a misspecified model, sample AFIs following FIMLestimation indicated better fit with higher percentage of missing data, but they do not pro-vide an explanation for this finding. Enders and Mansolf [13] have conducted a simulationstudy comparing AFIs computed under the FIML and MI approaches. He found that bothapproaches have produced similar AFIs. More specifically, under both approaches, sam-ple CFI stayed relatively the same with more missing data but RMSEA decreased slightlywith missing data. One drawback of Enders and Mansolf [13]’s simulation study is thatthey used a misspecified model that was only slightly misspecified (RMSEA=0.041 andCFI=0.981 for complete data). It is impossible for the AFIs to show much improvement infit with the addition of missing data when the model misspecification is already minor withcomplete data. Finally, Li and Lomax [22] conducted a simulation study, where they ex-amined the effects of incomplete and nonnormal data on RMSEA. They found that sampleRMSEA following FIML estimation had relatively little bias; however, the authors did notreport any population RMSEA values, so it is unclear how the sample bias was computed.The second main goal of this dissertation is to examine alternative methods for esti-mating AFIs so that the AFIs are not distorted by missing data. We are only aware of onerecently published research paper that also examined alternative approaches for estimatingAFIs. Lai [21] has studied the TS approach for computing RMSEA under missing data. Inaddition, he proposed a small sample correction to improve the TS estimation of RMSEAin finite samples with missing data. He found that across a wide variety of conditions, theTS approach with the small sample correction consistently produced RMSEA estimatesthat are closer to the complete data population RMSEA values relative to the FIML ap-proach. However, Lai [21] did not explain how the small sample correction should becomputed. Our paper addresses this gap in the research, and propose two computationalversions for the small sample corrections under the TS estimation.141.6 Dissertation OrganizationThis dissertation has two main goals. The first goal is to examine why and how AFIsare distorted by missing data under the FIML estimation. The second goal is to proposeand investigate alternative computations of AFIs that can produce consistent and unbiasedestimation in the presence of missing data.Chapters 2 to 4 address the first goal of our dissertation. Summary of each of thesechapters are as follows:• Chapter 2 provides the technical details that can help us explain why AFIs can beaffected by missing data under the FIML estimation method. We first show howwe can rewrite the minimum of the FIML fit function in terms of the missing datapatterns. Then we obtain the minimum of this fit function at the population level byfiguring out the population limits of each component in the fit function minimum.Here, we show the population limits vary across different types of missing data.Finally, we show how these fit functional minimum directly affect the estimates ofAFIs.• Chapter 3 provides a few analytical examples that demonstrate how AFIs are af-fected with increasing missing data under FIML. Here, we give examples whereAFIs change with missing data as well as examples where AFIs stay the same withmore missing data. We show examples where AFIs change solely because of the dif-ferences in equations between complete and incomplete data; we also show exam-ples where AFIs change due to both the differences in equations and the differencesin the population parameter values.• Chapter 4 presents the results from two large sample simulation studies that examinethe effect of missing data on AFIs under more realistic models. We focus on largesamples in order to study the behaviour of AFIs without the presence of sampling15error. Across the two simulation studies, we have mainly manipulated the amountof missing data, missing data mechanism, missing data pattern and the location ofmissing data relative to the location of misfit. Each of these factors have turned outto be important in the effect of missing data on AFIs under FIML.Chapters 5 and 6 address the second goal of the dissertation. Summary of these twochapters are as follows:• Chapter 5 proposes two alternative approaches that can address the problems ofAFIs under FIML. One approach involves implementing a correction step followingthe FIML estimation; we call it the FIML-corrected or FIML-C approach. Thesecond approach involves the use of the TS method. We lay out all the technicaldetails for the two approaches and provide two analytical examples that demonstratehow these methods should be used.• Chapter 6 presents the results of two simulation studies that compare the AFIs underthe FIML-C and TS approaches relative to the original FIML approach. The designof the two simulation studies is the same as that of the simulation studies in Chapter4 except that the studies in this chapter include the FIML-C and TS approaches aswell as simulated data with more varied sample sizes. Overall, the results from thesimulation studies give support for the use of the alternative methods.Finally, in Chapter 7, we conclude the dissertation by summarizing the main results,providing recommendations for applied researchers, discussing the limitations of the cur-rent research, and suggesting a few future research directions.16Chapter 2SEM AFIs under FIML Estimation:Technical DetailsI argued that full information maximum likelihood (FIML) has severaladvantages over multiple imputation (MI) for handling missing data: 1) FIMLis simpler to implement (if you have the right software); 2) unlike multipleimputation, FIML has no potential incompatibility between an imputationmodel and an analysis model; 3) FIML produces a deterministic result ratherthan a different result every time.Paul Allison, 2012Although many past research studies have shown that missing data affect AFIs suchas RMSEA and CFI under the FIML estimation [10, 13, 22], none of them have provideda mathematical explanation for such a phenomenon. In this chapter, we provide the tech-nical details to show how RMSEA and CFI are affected by missing data under FIML.Since RMSEA and CFI are functions of the fit function minimum, we start this chapterby explaining how the fit function minimum is changed with missing data. We first showthe derivations of the fit function minimum for complete data at both the sample and the17population levels. We then show the derivations for incomplete data at the sample andpopulation levels. Finally, we explain how the change in the fit function minimum affectsRMSEA and CFI.2.1 Fit Function Minimum for Complete DataLet x1, . . . ,xn be a random sample from p-variate normal distribution with N(µ,Σ). Wewant to test the null hypothesis that the data come from N(µ(θ),Σ(θ)), where θ is a q×1vector of model parameters. The normal-theory maximum likelihood (ML) estimator θˆmaximizes the observed data log-likelihoodl(θ |x1, . . . ,xn) =n∑i=1li(θ)=C− n2(log |Σ(θ)|+ tr|SΣ−1(θ)|+(x¯−µ(θ))′Σ−1(θ)(x¯−µ(θ))),(2.1)where x¯ and S are sample means and covariance matrix, and C does not depend on θ . Wedenote the maximized log-likelihood for the structured (hypothesized) model as lˆ. Themodel-implied means and covariance matrix are µˆ = µ(θˆ) and Σˆ = Σ(θˆ). We can alsomaximize Equation 2.1 under the saturated model, which includes all the unique elementsin µ and Σ as model parameters. We denote the maximized log-likelihood for the saturatedmodel as l˜. The estimates of means and covariance matrix under the saturated model areµ˜ = µ(θ˜) = x¯ and Σ˜ = Σ(θ˜) = S. With complete data, the saturated model estimates arejust the familiar sample means and sample covariance matrix.Maximizing the log-likelihood in Equation 2.1 is equivalent to minimizing the familiar18ML fit function, 1 whose minimum is given by:Fc(µˆ, Σˆ|x¯,S)= log |ΣˆS−1|+ tr(SΣˆ−1)+(x¯− µˆ)′Σˆ−1(x¯− µˆ)− p, (2.2)where the subscript c represents complete data. The likelihood ratio (LR) test statistic is ascaled difference between the structured and the saturated log-likelihoods, and it can alsobe expressed in terms of the fit function minimum, as follows:Tc =−2(l(θˆ)− l(θ˜)) = nFc(µˆ, Σˆ|x¯,S), (2.3)where Tc denotes the LR test statistic for complete data. In order to derive populationvalues of AFIs, it is necessary to determine the population limit of Fc(µˆ, Σˆ|x¯,S). When thehypothesized model is true, this limit is zero. In this article, we are primarily interested inthe case when the hypothesized model is false, as this is when the AFIs become relevantfor evaluating the degree of misfit. Denote the population limits of sample parameterestimates under the structured model as follows: θˆ → θ0, and the corresponding limits ofthe model-implied means and covariances are given by µˆ → µ0 and Σˆ→ Σ0. Under thesaturated model, the sample estimates of means and covariances, x¯ and S, will converge toµ and Σ, respectively. When the structured model is wrong, it is generally the case Σ 6= Σ0and it is sometimes the case that µ 6= µ0 (in the presence of a mean structure). Therefore,when the data are complete, the fit function minimum at the population level is given byFc(µ0,Σ0|µ,Σ) = log |Σ0Σ−1|+ tr(ΣΣ−10 )+(µ−µ0)′Σ−10 (µ−µ0)− p. (2.4)We refer to the values θ0 as “pseudo-parameters” 2, because they are population parame-1The ML fit function used in SEM is the equivalent to the Kullback-Leibler divergence (see footnote 1in Chapter 1).2In the statistics literature, the “pseudo-parameters” are also known as the “pseudo-true values” [40] ,the “least false parameter values minimizing Kullback-Leibler divergence”[9], or the “parameter vector that19ters for an incorrect model.2.2 Fit Function Minimum for Incomplete Data2.2.1 Sample Fit Function Minimum for Incomplete DataLet x1, . . . ,xn again be a random sample from the p-variate normal distribution N(µ,Σ).If the sample contains missing data, for each i = 1, · · · ,n, the corresponding observedvector xobs,i is of dimension pi× 1. Under an ignorable missing data mechanism (i.e.,MCAR or MAR), the FIML estimator θˆ can be obtained by maximizing the observed datalog-likelihoodl(θ |x1, . . . ,xn) =n∑i=1li(θ)=C− 12∑ilog |Σi(θ)|− 12∑i(xobs,i−µi(θ))′Σ−1i (θ)(xobs,i−µi(θ)),(2.5)where µi(θ) is the relevant pi×1 subvector of µ(θ), Σi(θ) is the relevant pi× pi submatrixof Σ(θ) , andC does not depend on θ [e.g. 24] (see Table 2.1 for a summary of the notationused in this section). As with complete data, we can obtain the structured and saturatedlog-likelihoods (denoted by lˆ and l˜, respectively):lˆ =n∑i=1lˆi=C− 12∑ilog |Σˆi|− 12∑i(xobs,i− µˆi)′Σˆ−1i (xobs,i− µˆi),minimizes the Kullback-Leibler divergence” [19].20Table 2.1: Notation for mean vectors and covariance matrices with incomplete dataunder an incorrect hypothesized model.Description Population Quantities Consistent SampleEstimatesTrue Means and Covariance Matrix µ,Σ µ˜, Σ˜Model-implied Means and CovarianceMatrix (pseudo-parameters)µ0 = µ(θ0m),Σ0 = Σ(θ0m) µˆ = µ(θˆ), Σˆ= Σ(θˆ)True Means and Covariance Matrix(sub-components for pattern j)µ j,Σ j µ˜ j, Σ˜ jModel-implied Means and CovarianceMatrix (sub-components for pattern j)µ0m, j,Σ0m, j µˆ j, Σˆ jPattern-specific Means and CovarianceMatrixµ∗j ,Σ∗j x¯ j,S jNote: The subscript 0m indicates that the population limits for missing data are differentfrom those for complete data, which are denoted by the subscript 0.l˜ =n∑i=1l˜i=C− 12∑ilog |Σ˜i|− 12∑i(xobs,i− µ˜i)′Σ˜−1i (xobs,i− µ˜i).With the structured model, we obtain the model-implied mean vector µˆ = µ(θˆ) and co-variance matrix Σˆ = Σ(θˆ); with the saturated model, we obtain the saturated model esti-mates µ˜ = µ(θ˜) and Σ˜ = Σ(θ˜), which represent the incomplete data analogues of x¯ andS. However, in the case of incomplete data, these saturated model estimates generally donot reduce to any closed form sample quantities. These saturated model estimates are alsosometimes known as the “EM” [after the Expectation-Maximization (EM) algorithm; 11]means and covariances [e.g., 14].The LR statistic is again the rescaled difference between the two log-likelihoods; how-ever, with incomplete data, this statistic is typically not expressed as the sample size timesthe minimum of a fit function. In fact, the concept of a “fit function” does not seem to21have been defined for incomplete data. In this article, we introduce this concept and inferthe form of this function by taking the difference of the two maximized log-likelihoods.That is, we write the LR test statistic for missing data as follows:Tm =−2(l(θˆ)− l(θ˜))=∑ilog |Σˆi|+∑i(xobs,i− µˆi)′Σˆ−1i (xobs,i− µˆi)−∑ilog |Σ˜i|−∑i(xobs,i− µ˜i)′Σ˜−1i (xobs,i− µ˜i)=∑ilog |ΣˆiΣ˜−1i |+∑i(xobs,i− µˆi)′Σˆ−1i (xobs,i− µˆi)−∑i(xobs,i− µ˜i)′Σ˜−1i (xobs,i− µ˜i)= nFm(µˆ, Σˆ|µ˜, Σ˜),(2.6)where the general form of the minimized FIML fit function for incomplete data isFm(µ(θ),Σ(θ)|µ˜, Σ˜, φ˜) =1n(∑ilog |Σi(θ)Σ˜−1i |+∑i(xobs,i−µi(θ))′Σ−1i (θ)(xobs,i−µi(θ))−∑i(xobs,i− µ˜i)′Σ˜−1i (xobs,i− µ˜i)),(2.7)where φ˜ is the missing mechanism parameter vector. When the mechanism is MCAR, thevector φ˜ contains only the population probabilities and the specification of each missingdata pattern. When the mechanism is MAR, the vector φ˜ contains additional parametersassociated with the relationships between the probability of being missing in one variableand the observed value in the other variable. Comparing Equations 2.2 and 2.7 reveals thatthe equations of the fit function minima for complete and incomplete data are different.In addition, when the hypothesized model is misspecified, the model-implied mean vectorand covariance matrix will differ (i.e., µˆ and Σˆ will not be the same for complete and in-complete data) because the model parameters depend the missing mechanism parameters22(see the next section for a detailed explanation).3To figure out the corresponding population limit of Equation 2.7 , it is necessary tore-write Equation 2.7 in terms of the missing data patterns. Let qˆ j = n j/n be the observedproportion of missing data in pattern J, where j = 1, . . . ,J and ∑n j = n. Then Equation2.7 can be rewritten as follows:Fm(µˆ, Σˆ|µ˜, Σ˜, φ˜) =J∑j=1qˆ j(log |Σˆ jΣ˜−1j |+1n jn j∑i(xobs,i( j)− µˆ j)′Σˆ−1j (xobs,i( j)− µˆ j)− 1n jn j∑i(xobs,i( j)− µ˜ j)′Σ˜−1j (xobs,i( j)− µ˜ j)).(2.8)In the above equation, the summations over all n have been replaced with summation overthe J missing data patterns and summations over the n j observations within each pattern;xobs,i has been replaced with xobs,i( j), so that raw observations are now enumerated withineach pattern j , i( j) = 1, ...,n j. In addition, Σˆ j, Σ˜ j, µˆ j, and µ˜ j represent the appropri-ate sub-matrices of Σˆ and Σ˜ and sub-vectors of µˆ and µ˜ , with only rows and columnscorresponding to variables observed within pattern j.We note that in the missing data literature, a similar version of Equation 2.8 has beenprovided by Muthe´n and Muthe´n [27] in the Mplus technical appendices (see Appendix 6Equation 133 in Muthe´n and Muthe´n [27]). However, Muthe´n and Muthe´n [27]’s equationdid not write out all terms associated the saturated model; instead, they expressed the partof equation associated the saturated model with one constant term. Equation 2.8 is alsodifferent from the equations presented in some of the older missing data papers [e.g., 26],which relied on the multiple-group (MG) setup to handle missing data.4We now re-write Equation 2.8 in terms of the sample covariance matrices, which will3Technically, we should include a subscript φ˜ for µi(θ) and Σi(θ) in Equation 2.7 to denote their depen-dency on φ˜ ; here, we omit this for the simplicity in notations.4To use the MG fit function for handling missing data, pseudo-values corresponding to cases with missingdata have to be inserted in the covariance matrices of the missing data patterns , and the degrees of freedomneed to be adjusted for these pseudo-values after fitting the model. See Chapter 8 of Bollen [5] for a detailedexplanation.23later help us find the population limit of Fm(µˆ, Σˆ|µ˜, Σ˜, φ˜). To do this, we need to define thefollowing three “sample covariance matrices” that can be computed within each missingdata pattern:S j =1n jn j∑i(xobs,i( j)− x¯ j)(xobs,i( j)− x¯ j)′,Sµˆ, j =1n jn j∑i(xobs,i( j)− µˆ j)(xobs,i( j)− µˆ j)′,Sµ˜, j =1n jn j∑i(xobs,i( j)− µ˜ j)(xobs,i( j)− µ˜ j)′,Here, the first matrix S j is the usual sample covariance matrix within pattern j computedusing the within-pattern sample mean x¯ j; the next two matrices Sµˆ, j and Sµ˜, j are computedusing model-estimated means, either under the structured model or under the saturatedmodel. Using these three matrices, it follows that:S j =1n jn j∑i(xobs,i( j)− x¯ j)(xobs,i( j)− x¯ j)′=1n jn j∑i(xobs,i( j)− µˆ j)(xobs,i( j)− µˆ j)′− (µˆ j− x¯ j)(µˆ j− x¯ j)′= Sµˆ, j− (x¯ j− µˆ j)(x¯ j− µˆ j)′;S j =1n jn j∑i(xobs,i( j)− x¯ j)(xobs,i( j)− x¯ j)′=1n jn j∑i(xobs,i( j)− µ˜ j)(xobs,i( j)− µ˜ j)′− (µ˜ j− x¯ j)(µ˜ j− x¯ j)′= Sµ˜, j− (x¯ j− µ˜ j)(x¯ j− µ˜ j)′.We can also write Sµˆ, j and Sµ˜, j in terms of S j:24Sµˆ, j = S j+(x¯ j− µˆ j)(x¯ j− µˆ j)′;Sµ˜, j = S j+(x¯ j− µ˜ j)(x¯ j− µ˜ j)′.Using these expressions and the rules of trace, starting with Equation 2.8, we can write:Fm(µˆ, Σˆ|µ˜, Σ˜, φ˜) =J∑j=1qˆ j(log |Σˆ jΣ˜−1j |+1n jn j∑i(xobs,i( j)− µˆ j)′Σˆ−1j (xobs,i( j)− µˆ j)− 1n jn j∑i(xobs,i( j)− µ˜ j)′Σ˜−1j (xobs,i( j)− µ˜ j))=J∑j=1qˆ j(log |Σˆ jΣ˜−1j |+ tr(1n jn j∑i(xobs,i( j)− µˆ j)′Σˆ−1j (xobs,i( j)− µˆ j))− tr( 1n jn j∑i(xobs,i( j)− µ˜ j)′Σ˜−1j (xobs,i( j)− µ˜ j)))=J∑j=1qˆ j(log |Σˆ jΣ˜−1j |+ tr((1n jn j∑i(xobs,i( j)− µˆ j)(xobs,i( j)− µˆ j)′)Σˆ−1j )− tr(( 1n jn j∑i(xobs,i( j)− µ˜ j)(xobs,i( j)− µ˜ j)′)Σ˜−1j ))=J∑j=1qˆ j(log |Σˆ jΣ˜−1j |+ tr(Sµˆ, jΣˆ−1j )− tr(Sµ˜, jΣ˜−1j ))=J∑j=1qˆ j(log |Σˆ jΣ˜−1j |+ tr((S j+(x¯ j− µˆ j)(x¯ j− µˆ j)′)Σˆ−1j )− tr((S j+(x¯ j− µ˜ j)(x¯ j− µ˜ j)′)Σ˜−1j )).(2.9)2.2.2 Population Fit Function Minimum for Incomplete DataBefore obtaining the population limit of the fit function minimum for incomplete data, weelaborate on the concept of the incomplete data population. If the process that createdthe current sample with incomplete data is allowed to go on indefinitely so that a largerand larger sample is generated, we would eventually sample the entire population. In this25way, the current observed sample with incomplete data can be viewed as a random samplefrom this incomplete data population. The observed percentage of missing values in thesample is a consistent estimate of the population percentage of missing values. Further, theobserved incomplete data patterns and their relative frequency are assumed to accuratelyreflect the underlying incomplete data population. Of course, in smaller samples not allmissing data patterns that are possible in the population may be realized.We can now proceed to obtain the population limit of the fit function for incompletedata. To obtain the population limit of Equation 2.9, we assume that the index J enumer-ates all of the missing data patterns that exist in the population. This means either that thesample size is large enough that all missing data patterns that exist in the population havebeen realized in the sample, or alternatively, that in Equation 2.9, some qˆ j values are zeroin the sample but will approach non-zero population values; in other words, the percent-age of any missing data pattern in the sample is a consistent estimate of the populationprobability of that pattern.In addition, we need to determine the limits of all sample quantities in Equation 2.9 toobtain the population limit of Fm(µˆ, Σˆ|µ˜, Σ˜, φ˜). Under an ignorable missing data mecha-nism (i.e., MCAR or MAR), the saturated model estimates µ˜ and Σ˜ are consistent for µand Σ. Therefore, for any missing data pattern, it is the case that µ˜ j→ µ j and Σ˜ j→ Σ j. Wealso define the population “pseudo-parameters” as the limits of the corresponding samplequantities, θˆ → θ0m , µˆ→ µ0m , and Σˆ→ Σ0m , where the subscript 0m indicates that the pop-ulation limits for missing data may be different from those for complete data, which aredenoted by the subscript 0. Indeed, when the hypothesized model is wrong, it is generallythe case that Σ 6= Σ0m and µ 6= µ0m . With incomplete data, the FIML estimates of meanscan be different under the structured model even when the mean structure is saturated.In addition, when the model is misspecified, the “pseudo-parameters” for complete andincomplete data will usually differ from each other, resulting in different model-implied26mean and covariance matrix estimates even in the population (i.e., Σ0 6=Σ0m and µ0 6= µ0m),unless the hypothesized model has no free parameters (see Chapter 3.2 for an example).We first state the population limit of Fm(µˆ, Σˆ|µ˜, Σ˜, φ˜) in the special case when the as-sumption of homogeneity of means and covariances holds. This assumption is always metwhen the data are MCAR [e.g., 20], and it is usually not met when the data are MARor MNAR, although it is possible to construct a counter-example [44]. Under the homo-geneity of means and covariances assumption, the estimates of pattern-specific means andcovariances converge to the corresponding subsets of the overall population mean and co-variance matrix; that is, x¯ j → µ j, S j → Σ j for all j, where µ j is the p j× 1 sub-vector ofµ and Σ j is the p j× p j sub-matrix of Σ corresponding to the variables observed in thejth missing data pattern. Therefore, the population value of Equation 2.9 in the case ofMCAR data is given by:FMCAR(µ0m ,Σ0m|µ,Σ,φ) =J∑j=1q j(log |Σ0m, jΣ−1j |+ tr((Σ j+(µ j−µ0m, j)(µ j−µ0m, j)′)Σ−10m, j)− tr((Σ j+(µ j−µ j)(µ j−µ j)′)Σ−1j))=J∑j=1q j(log |Σ0m, jΣ−1j |+ tr(Σ jΣ−10m, j)+(µ j−µ0m, j)′Σ−10m, j(µ j−µ0φ , j)− tr(Σ jΣ−1j ))=J∑j=1q j(log |Σ0m, jΣ−1j |+ tr(Σ jΣ−10m, j)+(µ j−µ0m, j)′Σ−10m, j(µ j−µ0m, j)− p j),(2.10)where J is the number of missing data patterns that are possible in the population, q j isthe population probability of the jth pattern, and p j is the number of variables in the jthmissing data pattern (see Table 2.1 for notation). Note that, in the functional form, this27equation is a weighted average, by pattern probabilities, of the complete data fit functiongiven in Equation 2.4. However, the population limits of the model-implied estimates ofthe means and covariances will generally differ for complete and incomplete data.In the more general case when data are MAR, the homogeneity of means and co-variances assumption is typically violated. In this case, the limits of the within-patternestimates of means and covariances are not necessarily equal to the corresponding sub-components of the overall population means and covariance matrix. For example, considerthe simplest case of two variables, X and Y , both N(0,1), where Y is missing with proba-bility one whenever X > 0. Even though the population means of X and Y are both zero,the pattern-specific means will be different. In the missing pattern where X is observedbut Y is missing, all sample realizations of X are positive, and thus the estimated meanof X using only the cases with this pattern will approach the mean of a standard normaldistribution truncated at zero. In the general case of MAR data, let x¯ j → µ∗j , S j → Σ∗jbe the pattern-specific limits of the means and covariance matrix for variables within jthpattern. The population value of the fit function minimum is given byFMAR(µ0m ,Σ0m|µ,Σ,φ) =J∑j=1q j(log |Σ0m, jΣ−1j |+ tr((Σ∗j +(µ∗j −µ0m, j)(µ∗j −µ0m, j)′)Σ−10m, j)− tr((Σ∗j +(µ∗j −µ j)(µ∗j −µ j)′)Σ−1j )).(2.11)Thus, in the general case of ignorable incomplete data, the fit function minimum dependson: 1) the true means and covariances (µ and Σ, where µ j and Σ j are the correspondingsub-components for pattern j, j = 1, ...J); 2) the model-implied means and covariances(µ0m and Σ0m , with µ0m, j and Σ0m, j indicating the relevant subcomponents for pattern j);and 3) the pattern-specific population means and covariances (µ∗j and Σ∗j , for j = 1, ...,J;see Table 2.1). The model-implied means and covariances in 2) will be different for the28complete and different types of incomplete data. The pattern-specific population meansand covariances in 3) also depend on the missing data mechanism; when the data areMCAR, they are equal to the corresponding subsets of the population means and covari-ances (i.e., µ∗j = µ j and Σ∗j = Σ j), however, when the data are MAR, they will usuallydiffer across patterns and cannot be viewed as subsets of a single vector and matrix (i.e.,µ∗j 6= µ j and Σ∗j 6= Σ j).We briefly note what happens when the data are MNAR. In this case, the saturatedFIML estimates of mean vector and covariance matrix, µ˜ and Σ˜, are no longer consistentfor µ and Σ, so the general Equation 2.11 will feature the population limits of µ˜ and Σ˜,instead of µ and Σ. We assume an ignorable missing data mechanism (MCAR or MARdata) for the remainder of this dissertation.2.3 RMSEA and CFI for Complete and Incomplete DataIn all current SEM software, the RMSEA and CFI are computed using the same equationsregardless of whether the data contain missing values. For complete normal data, underthe ML estimation, the LR test statistic in Equation 2.3 is used to define RMSEA and CFIas follows:R̂MSEAML =√max(Tc−d f ,0)d f (n)=√max(Fc(µˆ, Σˆ|x¯,S)− d fnd f,0);ĈFIML = 1− max(Tc−d f ,0)max(Tc−d f ,Tc,B−d fB,0)= 1−max(Fc(µˆ, Σˆ|x¯,S)− d fn ,0)max(Fc(µˆB, ΣˆB|x¯,S)− d fBn ,Fc(µˆ, Σˆ|x¯,S)− d fn ,0) ,(2.12)where the subscript B stands for the baseline model, which assumes all variables are un-correlated with each other; that is, d fB, µˆB, ΣˆB and Tc,B are the baseline model’s degrees29of freedom, model-implied means, model-implied covariance matrix, and LR test statis-tic, respectively. As mentioned in Section 1.4, as the model fit increases, RMSEA getscloser to zero and CFI gets closer to one. In the rare case when both the numerator andthe denominator in the CFI computation are zero, the convention is to set CFI to one.In the presence of missing data, under the FIML estimation, RMSEA and CFI arecomputed in the same way as the above equations except we use the corresponding LRtest statistic for missing data in Equation 2.6, as follows:R̂MSEAFIML =√max(Tc−d f ,0)d f (n)=√max(Fm(µˆ, Σˆ|µ˜, Σ˜,φ)− d fnd f,0);ĈFIFIML = 1− max(Tc−d f ,0)max(Tc−d f ,Tc,B−d fB,0)= 1−max(Fm(µˆ, Σˆ|µ˜, Σ˜,φ)− d fn ,0)max(Fm(µˆB, ΣˆB|µ˜, Σ˜,φ)− d fBn ,Fm(µˆ, Σˆ|µ˜, Σ˜,φ)− d fn ,0) .(2.13)We now show the population limits of RMSEA and CFI under complete and incom-plete data. For complete data, we can find the population limits of RMSEA and CFI byusing the population fit function minima for complete data in Equation 2.4, as follows:RMSEAML =√Fc(µ0,Σ0|µ,Σ)d f;CFIML = 1− Fc(µ0,Σ0|µ,Σ)Fc(µB,0,ΣB,0|µ,Σ) ,(2.14)where µB,0 and ΣB,0 are the population limits of the model-implied means of covariancesunder the baseline model. For incomplete data, we just use the corresponding population30fit function minima for incomplete data, as follows:RMSEAFIML =√Fm(µ0m,Σ0m|µ,Σ,φ)d fCFIFIML = 1− Fm(µ0m ,Σ0m|µ,Σ,φ)Fm(µB,0m ,ΣB,0m|µ,Σ,φ).(2.15)In summary, the AFIs’ equations show that the AFIs depend on the fit function minima,and they may estimate different population values depending on the presence (and type) ofmissing data. In the special case when the model is exactly correct, all fit function minimawill be zero in the population, and AFIs from complete and any type of incomplete datawill agree asymptotically. However, it is a safe assumption that the model is never exactlycorrect in the population, and the complete and incomplete data AFIs will converge todifferent population values. For complete data, the fit function minimum in the populationis given by Equation 2.4; for MCAR data, it is given by Equation 2.10, and for MARdata, it is given by the most general equation, Equation 2.11. It is worth emphasizing thateven this categorization is incomplete: there is a separate population value of the RMSEAand CFI for each specific type of MCAR or MAR data, depending on the missing dataproportion, location, patterns, and (in the case of MAR) conditioning rules.31Chapter 3SEM AFIs under FIML Estimation:Analytical ExamplesWhen attempting to assess how well a model fits a particular dataset, one mustrealize at the outset that the classic hypothesis-testing approach isinappropriate.James H.Steiger, 1980In this chapter, we demonstrate, with a few analytical examples, how RMSEA underthe FIML estimation changes with the presence and type of missing data. As shown inChapter 2, the equations of the fit function minimum differ under complete and incompletedata and under different types of incomplete data; consequently, AFIs can also be differentdue to the differences in the equations of the fit function minimum. In the first section ofthis chapter, we will present examples where RMSEA stays the same and examples whereRMSEA changes with missing data solely due to the differences in the equations of thefit function minimum. In addition, under FIML, AFIs may also change with missing datadue to the differences in the parameter values. In the second section, we will presentan example where RMSEA changes due to both the differences in the equations and the32differences in the parameter values.3.1 Change in RMSEA due to Differences in theEquations of the Fit Function MinimumIn this example, the hypothesized model is fully constrained (i.e., it has no free parame-ters), so the model-implied means and covariances do not differ for complete and differenttypes of incomplete data, greatly simplifying computations. All observed differences aretherefore only due to the different forms of the fit function.Let X1, . . . ,X6 follow a multivariate normal distribution N(µ,Σ) with the followingpopulation covariance matrix and vector of means:Σ=1.000.89 1.000.49 0.49 1.000.00 0.00 0.00 1.000.00 0.00 0.00 0.49 1.000.00 0.00 0.00 0.49 0.49 1.00,µ = (0,0,0,0,0,0)′. (3.1)This covariance structure is consistent with that of a two-factor model with orthogonalfactors and three indicators per factor (with loadings of 0.7), plus a correlated residual (ofsize 0.4) between X1 and X2. The model fit to data is always a fully constrained model,which is the same as the population model but without the correlated residual. The model-33implied covariance matrix and mean vector areΣ0 =1.000.49 1.000.49 0.49 1.000.00 0.00 0.00 1.000.00 0.00 0.00 0.49 1.000.00 0.00 0.00 0.49 0.49 1.00,µ0 = (0,0,0,0,0,0)′. (3.2)Because the hypothesized model has no free parameters, no fit function minimization isrequired to obtain “parameter estimates”: all values are fixed a priori. Thus, the model-implied estimates µ0 and Σ0 will be the same in all of the examples considered below(given by Equation 3.2). However, one can still evaluate the fit function at these estimatesto obtain the “fit function minimum” for the purposes of computing AFIs. It is importantto note that the misfit is caused by the correlated residual between X1 and X2; therefore,the deviation of the fit function minimum from zero will always be due to the differencein the value of the covariance between X1 and X2 in Σ versus Σ0. Because the fit functionhas a different form for complete and incomplete data, the numeric values of the AFIs canstill differ even in this simplified example.3.1.1 Case 1: Complete dataWhen there are no missing values, the fit function “minimum” in Equation 2.4 is given byFc(µ0,Σ0|µ,Σ) = log |Σ0Σ−1|+ tr(ΣΣ−10 )+(µ−µ0)′Σ−10 (µ−µ0)− p=1.200+5.612+0−6 = 0.812.34The corresponding population RMSEA calculated using Equation 2.14 isRMSEAML =√Fc(µ0,Σ0|µ,Σ)d f=√0.81227= 0.173.We will use this RMSEA value obtained under complete data as a benchmark to comparewith values obtained under incomplete data.13.1.2 Case 2: MCAR data; misspecification does not involvevariables with missing valuesNow suppose that 20% of the values on X6 are missing completely at random. In thiscase, there are J = 2 missing data patterns, with q1=0.8 (probability of the complete datapattern) and q2 = 0.2 (probability of the incomplete data pattern), and with p1 = 6 andp2 = 5 (number of observed variables in each pattern). Then, Equation 2.10 yieldsFMCAR(µ0m,Σ0m|µ,Σ,φ) =q1(log |Σ0m,1Σ−11 |+ tr(Σ1Σ−10m,1)+(µ1−µ0m,1)′Σ−10m,1(µ1−µ0m,1)− p1)+q2(log |Σ0m,2Σ−12 |+ tr(Σ2Σ−10m,2)+(µ2−µ0m,2)′Σ−10m,2(µ2−µ0m,2)− p2)=(0.8)(0.812)+(0.2)(0.812) = 0.812.Because the hypothesized model is fully constrained, Σ0m = Σ0 is given by Equation 3.2.The first pattern is the complete data pattern, so that Σ0m,1 = Σ0, Σ1 = Σ, µ0m,1 = µ0, andµ1 = µ; consequently, the component of FMCAR(µ0m ,Σ0m|µ,Σ,φ) for the first pattern isthe same as Fc(µ0,Σ0|µ,Σ) (i.e., it equals 0.812).For the incomplete data pattern, Σ0m,2 and Σ2 are the 5× 5 sub-matrices of Σ0 and Σwith the last row and column deleted, and µ0m,2 and µ2 are the 5×1 sub-vectors of µ0 andµ with the last element deleted. Because Σ and Σ0 are block-diagonal (the factors are or-thogonal), model misfit caused by the correlated residual between X1 and X2 does not prop-1We cannot compute the traditionally defined CFI because the traditional independence model is notnested within the highly restrictive hypothesized model used in this example.35agate to variables loading on the second factor. Since the variable with missing values (X6)only loads on the second factor, the component of FMCAR(µ0m,Σ0m |µ,Σ,φ) for the secondpattern turns out to be the same as Fc(µ0,Σ0|µ,Σ), and the entire FMCAR(µ0m,Σ0m|µ,Σ,φ)is the same as Fc(µ0,Σ0|µ,Σ).In sum, compared to the complete data case (Case 1), in Case 2 the hypothesizedmodel fit function minimum and consequently the RMSEA stay the same. The reasonthe fit function minimum for the hypothesized model stays the same is that the variablewith missing data (i.e., X6) contributes no information about the amount of misfit (whichinvolves the covariance between X1 and X2), due to the nature of the model.3.1.3 Case 3: MCAR data; misspecification involves variables withmissing dataNext, we consider the case where the location of misfit in the covariance structure involvescovariances among variables with missing data. Suppose that 20% of the values on X1(rather than X6, as was in Case 2) are missing completely at random. As in Case 2, J = 2,p1 = 6, q1 = 0.8, p2 = 5, and q2 = 0.2. However, the fit function minimum value is nowgiven byFMCAR(µ0m,Σ0m|µ,Σ,φ) =q1(log |Σ0m,1Σ−11 |+ tr(Σ1Σ−10m,1)+(µ1−µ0m,1)′Σ−10m,1(µ1−µ0m,1)− p1)+q2(log |Σ0m,2Σ−12 |+ tr(Σ2Σ−10m,2)+(µ2−µ0m,2)′Σ−10,2(µ2−µ0m,2)− p2)=(0.8)(0.812)+(0.2)(0) = 0.650.The first pattern is the complete data pattern, and the corresponding component ofFMCAR(µ0m ,Σ0m|µ,Σ,φ) has the same value. However, the second pattern now omits X1:Σ0m,2 and Σ2 are the 5×5 submatrices of Σ0 and Σ with the first row and column deleted,and µ0m,2 and µ2 are the 5× 1 subvectors of µ0 and µ with the first element deleted. Inthis case, Σ0m,2 and Σ2 no longer contain the covariance between X1 and X2, which is36the one covariance that is misspecified. Thus, Σ0m,2 = Σ2, and the second component ofFMCAR(µ0m ,Σ0m|µ,Σ,φ) is zero. As a result, the fit function minimum for the hypothe-sized model is smaller. This fit function minimum directly affects RMSEA:RMSEAFIML =√FMCAR(µ0m,Σ0m |µ,Σ,φ)d f=√0.65027= 0.155.This example illustrates that when the variables with missing data are also involvedin the misspecification, the fit function minimum for the hypothesized model and, conse-quently, the RMSEA generally decrease relative to their complete data counterparts. Wenote that for a hypothesized model (such as the one in our example) where only part ofthe model is severely misspecified, an interaction between the location of misfit (relativeto the location of missing data) and the effect of missing data on RMSEA is expected.If variables corresponding to the part of the model that is severely misspecified part aremissing, then the model will show better fit when assessed by the RMSEA.3.1.4 Case 4: MAR data; misspecification involves variables withmissing valuesThe last example involves MAR data. Suppose that X1 is missing with probability onewhenever X2 > 0.842 , which is the z-score corresponding to the 80th percentile of anormal distribution. This implies that X1 is missing with 20% probability. As before,J = 2, p1 = 6, p2 = 5, q1 = 0.8 and q2 = 0.2.The correct equation for the fit function minimum is now given by Equation 2.11 in-stead of Equation 2.10. To compute Equation 2.11, we require pattern-specific populationmeans and covariance matrices, that is, µ∗j and Σ∗j for j = 1,2. In this example, even forthe complete data pattern, µ∗1 6= µ and Σ∗1 6= Σ. The reason is that in the complete data pat-tern, X2 is distributed as a standard normal variable truncated at 0.842; in addition, X1 andX3 will no longer have normal distributions, as they will tend to have more negative than37positive values observed (by virtue of being correlated with X2). We have used the tmvt-norm package in R to obtain the population covariance matrix and means of the truncatedmultivariate normal distributions corresponding to this example, yielding:Σ∗1 =0.6700.519 0.5830.308 0.286 0.9000.00 0.00 0.00 1.000.00 0.00 0.00 0.49 1.000.00 0.00 0.00 0.49 0.49 1.00,µ∗1 =(−0.311,−0.350,−0.171,0,0,0)′,Σ∗2 =0.2190.107 0.8120.00 0.00 0.00 1.000.00 0.00 0.00 0.49 1.000.00 0.00 0.00 0.49 0.49 1.00,µ∗2 = (1.400,0.686,0,0,0)′.Substituting all these components into Equation 2.11, we getFMAR(µ0m,Σ0m |µ,Σ,φ) =q1(log |Σ0m,1Σ−11 |+ tr((Σ∗1+(µ∗1 −µ0m,1)(µ∗1 −µ0m,1)′)Σ−10m,1)− tr((Σ∗1+(µ∗1 −µ1)(µ∗1 −µ1)′)Σ−11 ))+q2(log |Σ0m,2Σ−12 |+ tr((Σ∗2+(µ∗2 −µ0m,2)(µ∗2 −µ0m,2)′)Σ−10m,2)− tr((Σ∗2+(µ∗2 −µ2)(µ∗2 −µ2)′)Σ−12 ))=(0.8)(1.200+5.248−5.706)+(0.2)(0+6.178−6.178)=0.594.As before, because the first pattern is the complete data pattern, Σ0m,1 = Σ0, Σ1 = Σ,38µ0m,1 = µ0, and µ1 = µ . However, the addition of the truncated covariances and means(i.e., Σ∗1 and µ∗1 ) into this equation means that the component of FMAR(µ0m,Σ0m|µ,Σ,φ)corresponding to the first pattern is different from that of FMCAR(µ0m ,Σ0m|µ,Σ,φ) in Cases2 and 3. For the second pattern, as in Case 3, Σ0m,2 and Σ2 are the 5×5 submatrices of Σ0and Σ with the first row and column deleted, and µ0m,2 and µ2 are the 5×1 subvectors ofµ0 and µ with the first element deleted. Because the misfit associated covariance betweenX1 and X2 is eliminated, Σ0m,2 = Σ2, so that the component of FMAR(µ0m ,Σ0m|µ,Σ,φ)corresponding to the second pattern is zero.The population RMSEA is given by:RMSEAFIML =√FMCAR(µ0m,Σ0m |µ,Σ,φ)d f=√0.59427= 0.148.Overall, this set of examples shows that different missing mechanisms can yield dif-ferent fit function minima and AFIs. We have not shown that missing data percentage, thenumber of missing data patterns, and the strength of the missing data mechanism (in caseof MAR) can also affect the fit function minimum and the AFIs. These variables will beconsidered in the simulation study described in the next chapter.3.2 Change in RMSEA due to Differences in ParameterValuesIn this section, we will show that when data change from complete to incomplete, modelswith pseudo-parameters may have the same or different model-implied covariance matrix,which in turn affects fit function minimum and AFIs.393.2.1 Case 1: Pseudo-parameter values stay the same with missingdataWe first consider the case where the model-implied covariance matrix stays the same. LetX and Y be two random variables that follow a multivariate normal distribution with thepopulation covariance matrix and mean vector given byΣ=0.5 00 0.5and µ = (0,0)′. (3.3)Let the hypothesized model be a special case of the simple regression model: Y = X +ζ ,where the regression coefficient is fixed to one. This model is misspecified (since X and Yare orthogonal in the population). Let var(X) = η and var(ζ ) =ψ . We assume a saturatedmean structure and a zero correlation between X and ζ . The model-implied covariancematrix and mean vector are thenΣ(θ0) =η ηη η+ψand µ0 = (0,0)′, (3.4)where the model parameters are given by θ0 = (η ,ψ)′.To obtain the “pseudo-parameters” θ0 with complete data, we minimizeFc(µ0,Σ(θ0)|µ,Σ) = log |Σ(θ0)Σ−1|+ tr(ΣΣ−1(θ0))+(µ−µ0)′Σ−1(θ0)(µ−µ0)− p= log(4ψη)+1ψ+12η−2,(3.5)where the second expression has been obtained by substituting Equations 3.3 and 3.4 into40the first expression and then simplifying. The partial derivatives of Fc(µ0,Σ(θ0)|µ,Σ) are∂F∂η=1η− 12η2and∂F∂ψ=1ψ− 1ψ2.Setting them to 0, we obtain θ0 = (η ,ψ)′ = (12 ,1)′. Substituting these values into Equa-tions 3.4 and 3.5, we obtainΣ0 =1.5 0.50.5 0.5 and Fc(µ0,Σ0|µ,Σ) = 0.693. (3.6)The population RMSEA can then be computed using Equation 2.14:RMSEAML =√Fc(µ0,Σ0|µ,Σ)d f=√0.69311= 0.148.Now suppose there are 50% missing data onY , missing completely at random (MCAR).In this case, J = 2, p1 = 2, p2 = 1, q1 = q2 = 0.5. The fit function becomesFMCAR(µ(θ0m),Σ(θ0m)|µ,Σ,φ) =q1(log |Σ1(θ0m)Σ−11 |+ tr(Σ1Σ−11 (θ0m))+(µ1−µ0,1)′Σ−11 (θ0m)(µ1−µ0,1)− p1)+q2(log |Σ2(θ0m)Σ−12 |+ tr(Σ2Σ−12 (θ0m))+(µ2−µ0,2)′Σ−12 (θ0m)(µ2−µ0,2)− p2)=12(log(4ψη)+1ψ+12η−2)+12(log(2η)+12η−1),(3.7)where Σ1 = Σ, µ1 = µ , Σ1(θ0m) = Σ(θ0m), µ0,1 = µ0, Σ2 = 0.5, µ2 = 0, Σ2(θ0m) = η , andµ0,2 = 0. The partial derivatives for this fit function are the same as those for completedata. Therefore, the “pseudo-parameters” and the model-implied matrix for this incom-41plete data is the same as those for complete data, shown in Equation 3.6. Substitutingthe parameters into Equation 3.7, we find that FMCAR(µ0m,Σ0m |µ,Σ,φ) = 0.3459, whichmakes RMSEAFIML = 0.5887. Therefore, in this case, the fit function minima and RM-SEAs for complete and incomplete data are different solely due to the differences in theirequations.3.2.2 Case 2: Pseudo-parameter values change with missing dataWe now extend this example to the situation where the model-implied covariance matrixchanges when there are missing data. The population covariance matrix and mean vectorare again given by Equationi 3.3. The hypothesized model is again Y = X+ζ , but we nowfix var(X) = 0.5. The model-implied covariance matrix and mean vector areΣ(θ0) =0.5 0.50.5 ψ+0.5and µ0 = (0,0)′, (3.8)where the parameter is just θ0 = ψ .When data are complete, the fit function we need to minimize is the same as Equation3.5 except we can substitute 0.5 for η , so thatFc(µ0,Σ(θ0)|µ,Σ) = log(2ψ)+ 1ψ −1. (3.9)The derivative with respect to ψ is againdFdφ=1ψ− 1ψ2, (3.10)yielding θ0 = ψ = 1 and Fc(µ0,Σ0|µ,Σ) = 0.6931, which are the same as the ones in theprevious example. The population RMSEA is different from the previous complete data42RMSEA because the d f now equals to two:RMSEAML =√Fc(µ0,Σ0|µ,Σ)d f=√0.69312= 0.5887.Next, suppose that there are 50% MCAR missing data on X . Therefore, again, J = 2,p1 = 2, p2 = 1, q1 = q2 = 0.5. The fit function becomesFMCAR(µ(θ0m),Σ(θ0m)|µ,Σ,φ) =q1(log |Σ1(θ0m)Σ−11 |+ tr(Σ1Σ−11 (θ0m))+(µ1−µ0,1)′Σ−11 (θ0m)(µ1−µ0,1)− p1)+q2(log |Σ2(θ0m)Σ−12 |+ tr(Σ2Σ−12 (θ0m))+(µ2−µ0,2)′Σ−12 (θ0m)(µ2−µ0,2)− p2)=12(log(2ψ)+1ψ−1)+12(log(2ψ+1)+12ψ+1−1),(3.11)where Σ1 = Σ, µ1 = µ , Σ1(θ0m) = Σ(θ0m), µ0,1 = µ0, Σ2 = 0.5, µ2 = 0, Σ2(θ0m) =ψ+0.5,and µ0,2 = 0. Notice that this fit function is very different from Equation 3.9. Therefore,the derivative is also different from Equation 3.10:dFdψ=12( 1ψ− 1ψ2)+12( 22ψ+1− 2(2ψ+1)2). (3.12)Setting the derivative to 0, we find that θ0m = ψ = 0.7378, 2 which yieldsΣ0m =0.5 0.50.5 1.2378 and FMCAR(µ0m ,Σ0m|µ,Σ,φ) = 0.5274. (3.13)2Using the rational root theorem, we can show that the function in Equation 3.12 has no rational root.We solved for ψ by graphing the function.43With this fit function minimum, we obtain the population RMSEA:RMSEAFIML =√FMCAR(µ0m ,Σ0m|µ,Σ,φ)d f=√0.52742= 0.5135.Therefore, in this case, the fit function minima and the RMSEA values are different be-tween complete and incomplete data not only due to differences in equations but also dueto the differences in the “pseudo-parameters” and the model-implied covariance matrices.In summary, we have shown in this section that when the data change from complete toincomplete, the “pseudo-parameters’ may change, which in turn affects the fit functionminimum and RMSEA.44Chapter 4SEM AFIs under FIML Estimation:Simulation StudiesA simulation is the imitation of the operation of a real-world process orsystem over time. Whether done by hand or on a computer, simulationinvolves the generation of an artificial history of a system and the observationof that artificial history to draw inferences concerning the operatingcharacteristics of the real system.Jerry Banks, John S. Carson II, Barry L. Nelson, David M. Nicol, 2010In this chapter, we describe two large sample stimulation studies designed to demon-strate, with more realistic models, how factors such as the location of missing data relativeto the location of misfit, the percentage of missing data, and the type of missing datamechanism affect the RMSEA and the CFI computed using FIML estimation. We focuson large samples to mimic what happens at the population level. It is important to firstestablish the differences in the population so that they are not obfuscated by the presenceof sampling error.454.1 DesignTable 4.1 shows the design of the two simulation studies. In both studies, data weregenerated from confirmatory factor analysis (CFA) models. The population model wasalways a two-factor model with six indicators per factor, all loadings of 0.7, and unitvariances for all observed and latent variables. The population model varied in the valueof the factor correlation and the number and size of correlated residuals (if any) acrossstudies and study conditions. For each population model, we generated n = 1000000normally distributed observations using the simulData() function in the lavaan package[30] in R (see Supplementary Materials for sample code).The two studies differed in the type and location of misfit in the hypothesized model.In Study 1, we varied the number of correlated residuals (1 or 2), the size of the correlatedresiduals (0, 0.1, 0.2, 0.3, or 0.4) and the strength of the factor correlation (0, 0.4, or 0.8)in the population model. The hypothesized model was always a two-factor model withoutcorrelated residuals. Thus, misfit in the hypothesized model was most directly related tothe indicators of the factor where correlated residuals appeared in the population model,although misfit can propagate throughout the model via the factor correlation (when it isnot zero). In addition, the location of correlated residuals (i.e., location of misfit) wasvaried relative to the location of missing data: 1) in the ”Same Factor” (SF) conditions,variables with correlated residuals and variables with missing data loaded on the samefactor; 1 2) in ”Different Factor” (DF) conditions, variables with correlated residuals andvariables with missing data loaded on different factors (see Figure 4.1). In Study 2, thepopulation model did not have any correlated residuals; instead, it had a factor correlationof varying size (0.2 to 1, see Table 4.1). The hypothesized model was always a one-factor model. Thus, in Study 2, model misfit increased as the factor correlation in the1In most SF conditions, the variables with missing data had correlated residuals in the population model.However, in the SF conditions where four variables have missing data but only two variables have a corre-lated residual, two of the variables with missing data will not include a correlated residual.46Table 4.1: Conditions in the Simulation StudiesStudy 1Number of Variables with Missing Data(2 levels) 2, 4Percentage of Missing Data in Each Vari-able with Missing Data (3 levels)0%, 20%, 50%Location of Misfit (2 levels) Same factor (SF) conditions: Variables involving misfitand those involving missing data load on the same factor.Different factor (DF) conditions: Variables involving misfitand those involving missing data load on different factors.Missing Mechanism (3 levels) MCAR, Weak MAR, Strong MARModel (2 × 5 × 3 =30 levels) The population model is a two-factor model (six indicatorsloading on each factor) that varies in the following features:• Number of correlated residuals: 1, 2• Size of correlated residuals: 0, 0.1, 0.2, 0.3, 0.4• Factor correlation: 0, 0.4, 0.8The hypothesized model is always a correlated two-factormodel without any correlated residuals.Study 2Number of Variables with Missing Data(3 levels) 2, 4, 6Percentage of Missing Data in Each Vari-able with Missing Data (3 levels)0%, 20%, 50%Number of Missing Data Patterns (2 levels) Minimum: Always 2 patternsMaximum: 4, 16 and 64 patterns when 2, 4 and 6 variableshave missing data, respectively.Missing Mechanism (3 levels) MCAR, Weak MAR, Strong MARModel (9 levels) The population model is a two-factor model (six indicatorsloading on each factor) that varies in the factor correlation:1, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2. The hypothesizedmodel is always a one-factor model.Note: Both studies have factorial designs. In total, there are 2×3×2×3×30 = 1080 conditions in Study1 and there are 3×3×2×3×9 = 486 conditions in Study 2.47Figure 4.1: Differences between DF and SF conditions.population model decreased. This type of misfit affects the entire covariance structure, butit particularly affects the covariances among indicators of different factors.In both studies, we varied the percentage of missing data by deleting 0%, 20% or50% of values in each variable designated to contain missing data. The number of suchvariables varied within and across studies (described shortly). In both studies, we stud-ied three missing data mechanisms: one MCAR mechanism and two two types of MARmechanism (weak and strong). To create MCAR data, we randomly selected rows fordeletion. To generate MAR data, we first specified a cut-off point for a conditioning vari-able without missing data that loaded on the same factor. For 20% missing data, the cutoffpoint for the conditioning variable was 0.842 (i.e., the 20th quantile of the standard normaldistribution); for 50% missing data, the cutoff point was 0. In the strong MAR conditions,the probability of missing data was 1 if the conditioning variable exceeded the specifiedcutoff, and 0 otherwise. In the weak MAR conditions, the probability of missing data was0.75 if the conditioning variable exceeded the specified cutoff, and 0.25 otherwise.48The two studies differed in the number of variables with missing data and the numberof missing data patterns. In Study 1, the number of variables with missing data was either2 or 4; in Study 2, this number was either 2, 4 or 6. In Study 1, the variables with miss-ing values were jointly missing, creating the minimum number of missing data patterns(i.e., two patterns). In Study 2, we also added the conditions where the variables were notjointly missing, resulting in the maximum number of possible patterns (see Table 1 for ex-act numbers). For MCAR conditions, data with the maximum number of possible patternswere created by creating missingness for each variable independently. For MAR condi-tions, such data were created by using different conditioning variable for each variablewith missing data.In summary, both studies have a factorial design, with 1080 conditions in Study 1and 486 conditions in Study 2 (see 4.1). In both studies, we manipulated the degree ofmisfit, the amount of missing data, and the missing data mechanism. The main differencebetween the two studies is that in Study 1, we manipulated the location of misfit relativeto the location of missing data; in Study 2, we fixed the location of misfit but manipulatedthe number of missing data patterns. The population values of the RMSEA and the CFIwith complete data across different study conditions are given in Table 4.2. As this tableillustrates, the complete data population RMSEA varies from 0 to 0.175 in Study 1 andfrom 0 to 0.192 in Study 2, while the complete data population CFI varies from 0.769to 1 in Study 1 and from 0.538 to 1 in Study 2. The values in this table are taken to bebenchmarks to which the incomplete data RMSEA and CFI will be compared. Our mainquestion of interest is how much these values change with the introduction of incompletedata.49Table 4.2: Complete data RMSEA and CFI for all conditions in Studies 1 and 2  Study 1                   FC=0 FC=0.4 FC=0.8 Number of CRs Size of CR RMSEA CFI RMSEA CFI RMSEA CFI One CR 0.0 0.000 1.000 0.000 1.000 0.000 1.000 0.1 0.022 0.994 0.022 0.994 0.023 0.994 0.2 0.044 0.977 0.045 0.977 0.048 0.976 0.3 0.067 0.950 0.069 0.948 0.076 0.943 0.4 0.086 0.927 0.089 0.923 0.105 0.902 Two CRs 0.0 0.000 1.000 0.000 1.000 0.000 1.000 0.1 0.033 0.987 0.033 0.987 0.034 0.988 0.2 0.069 0.947 0.069 0.948 0.071 0.951 0.3 0.113 0.874 0.113 0.876 0.115 0.882 0.4 0.166 0.773 0.168 0.773 0.175 0.769 Study 2               FC RMSEA CFI 1.0 0.000 1.000 0.9 0.045 0.979 0.8 0.078 0.932 0.7 0.106 0.872 0.6 0.129 0.803 0.5 0.149 0.730 0.4 0.167 0.654 0.3 0.182 0.586 0.2 0.192 0.538 Note: FC=Factor Correlation; CR=Correlated Residual; MCAR=Missing Completely At Random; MAR=Missing At Random.  4.2 ResultsBelow we summarized the major patterns of results using a series of figures and regressionanalyses. In all figures, the corresponding population quantities (i.e., RMSEA, CFI, orpopulation fit function minimum) with complete data are shown by the red (solid) lines.The AFIs’ values for complete data are also shown in Table 4.2; these values illustratethe benchmark with which we compare the incomplete data AFIs. For the regressionanalyses, we computed the absolute bias for the population AFIs by finding the absolutedifferences between the complete data population values (i.e., RMSEAML or CFIML from50Equation 2.14) and the corresponding incomplete data population values (estimated fromn = 1000000), and then we used the features of the missing data (shown in Table 4.1)to predict the absolute bias. In addition, we provided the full simulation results in theSupplementary Materials. 24.2.1 Study 1Figures 4.2-4.5 present selected results from Study 1. Figure 4.2 shows the RMSEA andCFI values for the conditions with MCAR data, population factor correlation of zero, andtwo correlated residuals of varying size (shown on the x-axis). Both DF (”Different Fac-tors”) and SF (”Same Factor”) conditions are shown. The maximum discrepancy betweencomplete and incomplete data AFIs occurs in the SF conditions with 50% missing data onfour variables, when the correlated residuals are of size 0.4: the complete data RMSEAand CFI are 0.166 and 0.773, respectively, while the incomplete data RMSEA and CFI are0.119 and 0.831, respectively.Although the AFIs measure model fit on a continuum, cut-off points are commonlyused to help researchers categorize the amount of misfit. For example, some methodol-ogists have suggested RMSEA less than 0.08 indicate good fit [8], and CFI greater than0.9 indicate good fit [17]. Figure 4.2 illustrates several conditions where missing datacause the AFIs to cross these recommended cutoff points. For example, in the SF con-ditions where four variables had missing data and the correlated residuals were of size0.3, RMSEA decreased from 0.113 to 0.080 and CFI increased from 0.874 to 0.912 as thepercentage of missing data increased from 0% to 50%. Thus, researchers may arrive atdifferent conclusions about model fit depending on whether missing data are present.Figure 4.2 also shows that the pattern of results for RMSEA is different from that2The tables in the supplementary materials combine the results from these two simulation studies withthose from the simulation studies in 6. For the simulation studies’ results in this Chapter, please refer to therow under FIML and n= 1000000 in each table in the Supplementary Materials51Figure 4.2: RMSEA and CFI for Study 1 conditions varying in the locations of misfit and numberof variables with missing data. For the conditions in this figure, the missing mechanism isMCAR, the population factor correlation is 0, and the number of correlated residuals is 2.52for CFI. Because the factor correlation was zero, misfit associated with the covariancestructure of indicators of one factor did not propagate to affect the covariance structureof the indicators of the other factor. Therefore, in the DF conditions where the indicatorscontaining correlated residuals in the population model were different from indicatorswith missing data, the values of the RMSEA did not visibly change with missing data. Incontrast, in SF conditions, the RMSEA values generally decreased, indicating better fit, asmissing data increased (higher percentage of missing data or more variables with missingdata), and the rate of decrease was higher for higher levels of misfit (i.e., larger size ofcorrelated residuals).The pattern was more complex for the CFI (see the second panel of Figure 4.2). In DFconditions, CFI decreased, indicating worse fit with more missing data. This pattern wasopposite of that for the RMSEA. However, in the SF conditions, the CFI values increasedwith more missing data, indicating better fit. To explain this pattern of results, we exam-ined the fit function minima for the hypothesized and baseline models separately. Figure4.3 shows these values for the same conditions as in Figure 4.2. In the DF conditions,the fit function minimum for the hypothesized model stayed approximately the same withmore missing data; however, it decreased for the baseline model with more missing data,especially for greater levels of misfit. The reason is that in the baseline model, whichhypothesizes uncorrelated variables, the misspecification affected every part of the model,and was thus always entangled with the location of missing data. In the SF conditions,the fit function minima for both the hypothesized and the baseline models decreased withmore missing data, but the rate of decrease for the hypothesized model was larger, espe-cially for models with greater misfit. As a result, in the SF conditions, CFI increased withmore missing data.Figure 4.4 examines whether the patterns found in Figure 4.2 extend to the case whenthe factor correlation is not zero. In the conditions shown in Figure 4.4, the missing data53Figure 4.3: Fit function minima of the hypothesized and baseline models for selected conditions inStudy 1. For the conditions in this figure, the missing mechanism is MCAR, the populationfactor correlation is zero, and the number of correlated residuals is two.are still MCAR; the population model has one correlated residual and four variables withmissing data; three values of the factor correlation (0, 0.4, and 0.8) are shown in separatepanels. Note that the overall misfit is smaller in this figure compared to Figure 4.2, butthe range of y-axis is kept the same to ensure comparability of figures. Interestingly, whenthe population factor correlation was non-zero, the RMSEA values still barely changedwith the amount of missing data in the DF conditions: even when the correlated residualwas 0.4 and the factor correlation was 0.8, the RMSEA changed from 0.105 to 0.102 as54the percentage of missing data changed from 0% to 50%. The finding that the value ofthe factor correlation in the population model had a small effect on the distortion in theincomplete data AFIs also generalized to other conditions of the study (see Supplemen-tary Materials). Thus, even though misfit in the indicators of one factor can theoreticallypropagate across the factor correlation to affect the indicators of the other factor, this didnot seem to actually occur to the degree that would affect the AFIs that much.Finally, Figure 4.5 shows the impact of different missing data mechanisms on theAFIs in selected conditions (two correlated residuals, four variables with missing data, andfactor correlation of 0.4). In these conditions, the largest change in the AFIs due to missingdata occurred when the percentage of missing data was 50%, the missing mechanism wasstrong MAR, and the size of the correlated residuals was 0.4: the complete data RMSEAand CFI were 0.168 and 0.773, respectively, while the incomplete data RMSEA and CFIwere 0.116 and 0.832, respectively. Overall, the patterns of change in the AFIs with moremissing data were similar to those in Figure 4.2 and consistent across different missingmechanisms. However, the missing mechanism moderated the rate of change in the AFIswith the increasing percentage of missing data, although this effect was not always thesame. For example, in the SF conditions, as the proportion of missing data increased from0% to 20%, the RMSEA values in the weak MAR condition decreased at a faster rate thanthose in the MCAR and strong MAR conditions, but when the proportion of missing dataincreased from 20% to 50%, the RMSEA values for the weak MAR data decreased at aslower rate than those in the other two conditions. This effect of missing mechanism onthe rate of change of the AFIs was similar in other study conditions not shown in Figure4.5.55Figure 4.4: RMSEA and CFI for selected conditions in Study 1. For the conditions in this figure,the missing mechanism is MCAR. There is a single correlated residual in the population model,and the number of variables with missing data is four.56Figure 4.5: RMSEA and CFI for selected conditions in Study I. For conditions in this figure, thepopulation factor correlation is 0.4; the number of correlated residuals is two, and the numberof variables with missing data is four.57Regression AnalysesWe also conducted regression analyses to examine whether missing data percentage, miss-ing data mechanism, factor correlation in the population model, degree of misfit (measuredby the size of the correlated residuals), location of misfit, and the interaction between themissing data percentage and location of misfit can predict the absolute bias of the AFIs. Tosimplify the analyses, we held the number of correlated residuals at two and the numberof variables with missing data at four. 3 We coded the features of the missing data intofactor or numeric variables, and explained these variables in Table 4.3.Table 4.4 shows the full results of the regression analyses. Consistent with the resultsshown in Figures 4.4 and 4.5, missing data mechanism and factor correlation had verysmall effect on the absolute bias of the AFIs. Consistent with the results in Figures 4.2and 4.3, there was an interaction between the percentage of missing data and the locationof missing data. For RMSEA, when the location of missing data was DF, holding all othervariables constant, the absolute bias did not change as missing data increased; however,when the location was SF, as missing data increased from 20% to 50%, the absolute bias,on average, increased by 0.012 unit. For CFI, when the location was DF, holding all othervariables constant, the absolute bias increased by 0.008 unit on average; but when thelocation was SF, the absolute bias increased by 0.015 unit. In addition to the location andpercentage of missing data, the degree misfit can have an effect on the bias, although itseffect was relatively small.Overall, Study 1 shows that the biggest distortion in the incomplete data AFIs relativeto complete data AFIs occurred when the misspecification was large and when the amountof missing data was high (e.g., greater number of variables with missing data, greater pro-3The number of correlated residuals and the size of the correlated residuals both measure the degree ofmisfit; to simplify the analyses, we held the number of correlated residuals constant and included the thesize of the correlated residuals in the regression. Similarly, because the number of variables with missingdata and the percentage of missing data both measure the amount of missing data, we only included thepercentage of missing data in the analyses.58Table 4.3: Variables in the Regression AnalysesStudy 1Missing percentage of missing data in each variable with missing data. Missing isconsidered a categorical variable that equals 20 or 50 when the percentagemissing is 20% or 50%, respectively; 20 is the reference group.Location location of misfit relative to location of missing data.Location is considered a categorical variable that is either SF or DF represent-ing the SF or DF condition.Mechanism missing data mechanism.Mechanism is considered a numerical variable variable that equals to 0, 1 or 2 whenthe missing data mechanism is MCAR, weak MAR or strong MAR, respectively.FactorCor factor correlation in the population model.FactorCor is considered a numerical variable that equals to 0, 1 or 2 when thefactor correlation is 0, 0.4 or 0.8, respectively.Misfit size of the correlation residuals.Misfit is considered a numerical variable that equals to 0, 1, 2, 3, or 4 when the sizeof the correlated is 0, 0.1, 0.2, 0.3, or 0.4, respectively.Study 2Missing percentage of missing data in each variable with missing data. Missing isconsidered a categorical variable that equals 20 or 50 when the percentagemissing is 20% or 50%, respectively; 20 is the reference group.Pattern whether the missing data contain the minimum or maximum number of miss-ing data patterns.Location is considered a categorical variable that is either min or max rep-resenting the conditions with either the minimum or maximum number ofmissing data patterns; min is the reference group.Mechanism missing data mechanism.Mechanism is considered a numerical variable that equals to 0, 1 or 2 when themissing data mechanism is MCAR, weak MAR or strong MAR, respectively.Misfit size of the factor correlation.Misfit is considered a numerical variable that equals to 0, 1, 2, 3, 5, 6, 7, 8 or 9 whenthe factor correlation is 1, 0.9, 0.8, 0.7, 0.6, 0.4, 0.3 or 0.2, respectively.59Table 4.4: Results of the Regression AnalysesStudy 1DF as the reference group for the Location variable:bl BIASRMSEA =−0.010+0.000Missing+0.010Location+0.000Mechanism+0.000FactorCorbl BIASRMSEA = +0.005Misfit+0.012(Location)(Missing)bl BIASCFI =−0.010+0.008Missing+0.001Location+0.000Mechanism+0.000FactorCorbl BIASCFI = +0.009Misfit+0.006(Location)(Missing)SF as the reference group for the Location variable:bl BIASRMSEA =−0.001+0.012Missing+0.010Location+0.002Mechanism+0.001FactorCorbl BIASRMSEA = +0.005Misfit+0.012(Location)(Missing)bl BIASCFI =−0.011+0.015Missing+0.001Location+0.000Mechanism+0.000FactorCorbl BIASCFI = +0.009Misfit−0.006(Location)(Missing)Study 2BIASRMSEA =−0.010+0.020Missing+0.004Pattern+0.001Mechanism+0.005MisfitBIASCFI =−0.011+0.040Missing+0.008Pattern+0.001Mechanism+0.012MisfitNote: The coding of the variables is explained in Table 4.3.portion of missing data on those variables). Another important variable was the locationof misfit relative to the location of missing data: in the DF conditions, hardly any changewas observed in the RMSEA. Finally, missing data mechanism can also be an importantfactor that determined how missing data affect the AFIs, but its effects were more subtleand more complicated.4.2.2 Study 2Figure 4.6 shows the RMSEA and CFI values for conditions with six variables with miss-ing data. Because Study 2 includes conditions with greater misfit than Study 1 (see Table4.2), the y-axis range in Figure 5 is greater than that in Figures 4.2-4.5. In this study, thehypothesized model was always the one-factor model, whereas the population model wasa two-factor model without correlated residuals but with a factor correlation of varyingsize. In this case, the amount of misfit was directly related to the factor correlation in thepopulation model; the x-axis in Figure 4.6 shows the population factor correlation varying60from least (correlation of 1) to most misfit (correlation of 0.2).Figure 4.6 illustrates that the impact of missing data on AFIs can actually be quitelarge. In all panels of this figure, the curves corresponding to complete versus incompletedata RMSEA and CFI are much more widely separated than those in Study 1. This isdue to both more severe model misspecification and to the inclusion of a condition wherehalf (6 out of 12) of the variables contain missing data. In the most extreme case, inthe weak MAR conditions with the maximum number of missing data patterns and thefactor correlation of 0.2, the RMSEA changed from 0.192 to 0.114 (a 40% decrease) andCFI changed from 0.538 to 0.758 (a 41% increase) as the percentage of missing dataper variable increased from 0% to 50%. In several of the shown conditions, the RMSEAcrossed the recommended cutoff of 0.08 and CFI crosses the recommended cutoff of 0.9 asthe percentage of missing data increased. For example, in the strong MAR conditions withthe maximum number of missing data patterns and the factor correlation of 0.7, RMSEAdecreased from 0.106 to 0.070 and CFI increased from 0.872 to 0.913 as missing dataincreased from 0% to 50%.Several other patterns of results in Study 2 are noteworthy. First, the RMSEA de-creased and CFI increased with missing data in all conditions. This pattern can be ex-plained by the overlap between the location of misfit and the location of missing data. Themisfit in Study 2 always affected all variables. This pattern was consistent with the re-sults in Study 1, where RMSEA always decreased and CFI always increased with missingdata in the SF conditions. Second, and consistent with Study 1, the missing mechanismaffected the rate of change in the AFIs as the percent of missing data increased, and thiseffect of missing mechanism was different for different missing data percentage changes.Finally, as the percent of missing data increased, the RMSEA decreased and CFI increasedat a faster rate in the conditions where the number of missing patterns was maximum.For example, for the weak MAR conditions with 0.2 factor correlation and six variables61Figure 4.6: RMSEA and CFI for selected conditions in Study 2. For the conditions shown in thisfigure, the number of variables with missing data is six.62containing missing data (see Figure 4.6), when the number of missing patterns was max-imum, the RMSEA decreased from 0.192 to 0.139 (a 27% decrease) and CFI increasedfrom 0.538 to 0.679 (a 26% increase) as the missing data percentage changed from 0%to 20% . However, when the number of missing patterns was minimum, the RMSEA de-creased from 0.192 to 0.154 (a 19% decrease) and CFI increased from 0.538 to 0.637 (a18% increase). These patterns of results also hold in the conditions not shown in Figure4.6; however, as the omitted conditions contain fewer variables (two or four) with missingdata, the effects were smaller (see Supplementary Materials).Regression AnalysesFor the regression analyses, we examined whether missing data percentage, missing datamechanism, missing data pattern and degree of misfit (measured by the size of the factorcorrelation) can affect the absolute bias of the AFIs. To simplify the analyses, we heldthe number of variables with missing data at six.4 Table 4.3 shows how we coded thefeatures of the missing data in Study 2. Table 4.4 shows that the results of the analyses.The regression analyses showed the missing data mechanism and pattern had very littleeffect on the AFIs’ bias. On the other hand, the percentage of missing data had the largesteffect on the bias of the AFIs. For RMSEA, holding all variables constant, the absolute,on average, increased by 0.021 unit as missing data increased from 20% to 50%; for CFI,the bias increased by 0.040 unit as missing data increased from 20% to 50%. The degreeof misfit also had an effect on the absolute bias. For RMSEA, holding all other variablesconstant, the absolute bias increased by 0.005 unit as the degree of misfit increased byone unit (i.e., as the factor correlation decreased by 0.1 unit; see Table 4.3); for CFI, thebias increased by 0.012 unit as the degree of misfit increased by one unit. These resultsimply that as the factor correlation decreased from 1 to 0.2, the RMSEA and CFI bias,4Consistent with the regression analyses for Study 1, because the number of variables with missingdata and the percentage of missing data both measure the amount of missing data, we only included thepercentage of missing data in the analyses.63on average, increased by 0.005×8 = 0.04 and 0.012×8 = 0.96 unit, respectively; thesechanges were very substantial when considered on the metrics for RMSEA and CFI. Inshort, all these patterns of results by the regression analyses were perfectly consistent withthose shown in Figure 4.6.Overall, Study 2 further illustrated that the incomplete data AFIs are affected by manycharacteristics of missing data, such as the amount of missing data (in terms of the numberof variables and the percentage of missing data per variable), missing data mechanism,and, new in this study, the number of missing data patterns. Because the misspecificationwas greater in this study, and because the type of misspecification (wrong number offactors) was such as to affect all variables, the patterns of results illustrating differencesbetween complete and incomplete data AFIs were also more dramatic than they were inStudy 1.4.3 DiscussionThe results from our simulation studies show that the impact of missing data on the valuesof the AFIs varies from trivial to quite dramatic. If the misfit in the hypothesized modelis highly localized (e.g., an omitted correlated residual) and pertains to variables that arefully observed, the impact on the RMSEA can be almost zero. On the other hand, in themost extreme case when the hypothesized model is misspecified globally (1-factor modelis fit to 2-factor data), the number of variables with missing data is high, and the missingmechanism is MAR, we have found that in some conditions the AFIs changed by as muchas 40% when the missing data increased from 0% to 50%. Across all conditions, theminimum of the fit function for the hypothesized either stayed the same or decreased withthe presence of missing data. While we are not yet able to offer an analytical proof to showour results pattern holds in general, the main pattern we have observed is the following:with more missing data, more information is lost about the misfit contained in the data64unless the model is correctly specified or the variables involved in the misspecificationare distinct from variables with missing data. Since RMSEA is a direct function of the fitfunction minimum for the hypothesized model, RMSEA will generally stay the same ordecrease with more missing data. We are not certain whether it is possible for RMSEA toincrease with more missing data, but we have not been able to create an example where itdoes so.The pattern for the CFI, which involves a comparison to the fit of the baseline model,was more complex. We found that CFI tended to indicate worse fit (decreased) when thelocation of misfit was localized to a few indicators of one factor, and variables with missingdata loaded on a different factor. In this case, with more missing data, the fit functionminimum for the baseline model decreased more relative to the fit function minimum forthe hypothesized model, leading to an overall slight decrease in the CFI (see Figures 4.2and 4.3). When misfit was entangled with the location of missing data, CFA, like RMSEA,always indicated better fit with missing data.The impact of the percentage of missing data and the number of variables with missingdata had a predictable effect on the AFIs: the differences relative to complete data AFIsbecame more dramatic. Increasing the number of missing data patterns tended to furtherdistort the AFIs. The impact of the missing data mechanism was more nuanced: it seemedto primarily affect the initial distortion in the AFIs when moving from complete data to20% missing data.Our findings have direct implications for researchers who use FIML to handle missingdata. While evaluating approximate model fit using AFIs is always nuanced and subjec-tive, existing tentative cutoffs have all been developed with complete data [8, 17]. Whenresearchers use FIML estimation to handle missing data, they should be cautious whenusing any firm cutoffs for AFIs. AFIs computed by the current software following FIMLestimation do not necessarily reflect what researchers probably want to know: what the65amount of misfit would have been had the data been complete; in other words, the incom-plete data AFIs are not consistent estimates of the population values of the complete dataAFIs. Therefore, when compared to the existing cutoffs, the incomplete data AFIs may beinaccurate indicators of the amount of model misfit. In particular, the RMSEA, arguablythe most popular index of approximate fit, may underestimate the amount of misfit withmissing data. As shown in our simulation studies, the AFIs can cross the recommendedcutoff points as missing data increase, and thus researchers could draw opposite conclu-sions about model fit depending on the amount of missing data present.66Chapter 5Alternative Approaches for ComputingAFIsAn approximate answer to the right problem is worth a good deal more thanan exact answer to an approximate problem.John Wilder TukeyAs seen in the previous three chapters, the most popular FIML estimation can producevery distorted AFIs. Based on our simulation studies, in some cases, the population AFIschanged by as much as 40% when the percentage of missing data increased from 0% to50%. In this chapter, we propose an alternative approach for computing AFIs followingFIML estimation, which we refer to as the FIML-corrected or FIML-C approach. TheFIML-C approach involves modifying the current computations of RMSEA and CFI sothat they estimate what these AFIs would have been had the data been complete.In addition to the FIML-C approach, we study another AFI computation approach,which involves the use of the two-stage (TS) estimation. In the typical TS procedure, thesaturated model is fit to incomplete data in the first stage, and then the complete data fitfunction is minimized in the second stage, with the saturated model’s estimates of means67and covariances replacing the sample means and covariances in this fit function [43, 46].Because the complete data fit function is used to compute the AFIs, if the missing mecha-nism is ignorable, they should approach the same population values as if the data had beencomplete.For both FIML-C and TS approaches, we also propose a series of small sample cor-rections that are meant to improve performance in small samples. These developments arebased on the earlier work that proposed similar corrections to AFIs in the context of non-normal data [6, 7] and in the context of categorical data [34]. All of these corrections arederived through finding the expected value of the estimate of the population fit functionminimum under a correctly specified model. In the last section of the chapter, we showthe derivations of the small sample corrections for the FIML-C and TS approaches.5.1 Alternative AFIs following FIML EstimationThe FIML-C approach allows us to approximate what the model fit would have been hadthere been no missing data; that is, FIML-C RMSEA and CFI approximate RMSEAMLand CFIML (Equation 2.14) rather than RMSEAFIML and CFIFIML (Equation 2.15). Thisapproach to computing AFIs is an extension of the approach proposed by Savalei [34] forAFIs for categorical data. While it is an approximation rather than an exact approach,Savalei [34] found that the approximation works well for mild and moderate model speci-fications. We therefore also expect FIML-C AFIs to work well unless the model misspec-ifications are severe.The FIML-C AFIs are computed according to the following equations:R̂MSEAFIML-C =√max(Fc(µˆ, Σˆ|µ˜, Σ˜)− d fnd f,0)ĈFIFIML-C = 1−max(Fc(µˆ, Σˆ|µ˜, Σ˜)− d fn ,0)max(Fc(µˆB, ΣˆB|µ˜, Σ˜)− d fBn ,Fc(µˆ, Σˆ|µ˜, Σ˜)− d fn ,0) , (5.1)68where the subscript B stands for the baseline model, and Fc(µˆ, Σˆ|µ˜, Σ˜) is the completedata ML fit function in Equation 2.2, with the saturated FIML estimates µ˜ and Σ˜ usedin place of sample means and sample covariance matrix and evaluated at the structuredFIML estimates µˆ and Σˆ. By evaluating the complete data ML fit function at the FIMLparameter estimates, we approximate what the complete data ML fit function minimumwould have been had there been no missing data.However, Equation 5.1 may not suffice in small samples. Because the FIML parameterestimates are obtained using a different fit function than Fc, the degrees of freedom mayno longer be the best estimate of the expected value of Fc evaluated at the FIML estimates.To correct for this bias, we can incorporate small sample corrections as follows:R̂MSEAFIML-C,s =√max(Fc(µˆ, Σˆ|µ˜, Σ˜)− knd f,0);ĈFIFIML-C,s = 1−max(Fc(µˆ, Σˆ|µ˜, Σ˜)− kn ,0)max(Fc(µˆB, ΣˆB|µ˜, Σ˜)− kBn ,Fc(µˆ, Σˆ|µ˜, Σ˜)− kn ,0) , (5.2)where k and kB are the FIML-C correction terms for the hypothesized and baseline models,respectively [6, 7, 34]. In the most general form,k = tr(UmW−1m WcW−1m UmΓ), (5.3)where the subscript m indicates the matrix is related to the FIML fit function for the miss-ing data and the subscript c indicates the matrix is related to the fit function for the com-plete data but evaluated at FIML estimates. Section 5.2.3 provides the derivation for thisgeneral equation for k. For each unique component in Equation 5.3, there are different op-tions for estimating the matrix with sample data, resulting in different ways of estimatingk. In this chapter, we examine six ways of estimating k. Table 5.1 lists the equations forthese six versions of computations of k, and explains the components in the equations for69k. In the following paragraphs, we explain each component in Equation 5.3 as well as thesix versions of k listed in Table 5.1. To make it easier to refer to the different FIML-Cversions, we use FIML-C V0 to denote the FIML-C version without small sample correc-tion, and use the FIML-C V1-V6 to denote the six FIML-C versions with small samplecorrections.The matrix Wm in Equation 5.3 is the population weight matrix used in the FIML esti-mation, or the “FIML weight matrix.” The FIML weight matrix can be thought of as theinformation matrix for the saturated model. This information matrix can be observed orexpected. However, for MAR data, the observed information matrix is the only asymp-totically unbiased estimate [33]. Therefore, the observed information matrix is used forall versions of Wm (see “Estimate of FIML Weight Matrix” in Table 5.1). In addition, atthe sample level, Wm can be evaluated at different sample estimates depending on whichsoftware you use. Some software such as EQS evaluate the weight matrix at the hypoth-esized model estimates (Σˆ, µˆ) whereas others such as Mplus evaluate it at the saturatedmodel estimates (Σ˜, µ˜) [42]. In lavaan, which is the software of our choice, by default, theweight matrix is evaluated with the hypothesized model estimates, but it can be changedto the saturated model estimates by setting mimic=”Mplus”. Although some researcherssuggest the use of hypothesized model estimates over the saturated model estimates [42],there is no consensus which one is better; therefore, we vary this option across the ver-sions. For FIML-C V1, V2, V4 and V5 in Table 5.1, the weight matrix is evaluated atthe hypothesized model estimates (denoted as Wˆm); for FIML-C V3 and V6, the weightmatrix is evaluated at the saturated model estimates (denoted as W˜m).The matrix Um is the residual weight matrix:Um =Wm−Wm∆ˆ(∆ˆ′Wm∆ˆ)−1∆ˆ′Wm, (5.4)70where ∆ˆ is the matrix of model derivatives, always evaluated at the hypothesized modelestimates. Since Um is a function of Wm, we use Uˆm to denote the estimate of the residualweight matrix when Wˆm is substituted in Equation 5.4 (as in FIML-C V1, V2, V4, and V6),and use U˜m when W˜m is substituted (as in FIML-C V3 and V6; see ”Estimate of ResidualWeight Matrix” in Table 5.1).The matrix Γ is the asymptotic covariance matrix of the saturated FIML estimates.Without normality assumption, Γ can be estimated using the fourth order moment of thedata or it can be estimated using the sandwich method involving a triple product: Γ =W−1m VmW−1m , whereVm is the FIML first order information matrix, andWm andVm are bothevaluated at the saturated model estimates. In the special case when we assume normalityand the hypothesized model is correctly specified, Γ0 =W−1m,0 (i.e., asymptotically), andthe correction will simplify. In this case, k is simplified tok = tr(WcW−1m UmW−1m ). (5.5)Because we do not assume the model is correct when evaluating fit indices, this simplifi-cation is only an approximation even if we have normal data. We include this variation ofΓ or k in our research. FIML-C V4-V6 in Table 5.1 assume normality and correctly speci-fied model, and use Equation 5.5 for k, whereas FIML-C V1-V3 do not assume normalitynor correctly specified model, and use Equation 5.3.Lastly, the matrix Wc is the complete data weight matrix, which is the informationmatrix based on the complete data fit function. Similar to Wm, we can either evaluateWc at the FIML hypothesized or saturated model estimates. Unlike Wm, both observedand expected information version of this matrix are asymptotically unbiased estimate forWc. When the saturated model estimates are used for Wc, observed and expected versionsare the same, therefore, there are only three options. For notations, we use WˆOBSc (as in71FIML-C V1 and V4 in Table 5.1) to denote the observed information matrix evaluatedat the hypothesized model estimates, WˆEXPc (as in FIML-C V2 and V5) to denote theexpected information matrix evaluated at the hypothesized model estimates, and W˜c forthe observed or expected information matrices evaluated at the saturated model estimates(as in FIML-C V3 and V6; see “Estimate of Complete Data Weight Matrix” in Table 5.1).In the above paragraphs, we explained in detail the different computation versions fork. However, in order to compute the CFI, we also need to compute kB, the correction factorfor the baseline model. The variations for kB are the same as the variations for k exceptthat in the case of kB, the “hypothesized model” is the baseline model (see Table 5.1) 1.5.1.1 Population Limits for FIML-C AFIsBecause the correction terms k and kB stay relatively the same as n increases, the popu-lation limits of these AFIs are the same regardless of whether small sample correctionsare incorporated in the computation of sample FIML-C AFIs. The population limits ofEquations 5.1 and 5.2 are given by:RMSEAFIML-C =√Fc(µ0m ,Σ0m |µ,Σ)d f;CFIFIML-C = 1− Fc(µ0m,Σ0m |µ,Σ)Fc(µB,0m,ΣB,0m |µ,Σ),(5.6)where µ0m , Σ0m , µB,0, and ΣB,0m are the population limits of the corresponding FIMLmodel-implied means and covariance matrices. Comparing Equations 2.14 and 5.6, wecan understand why the FIML-C approach is an approximation. In general, µ0 6= µ0m andΣ0 6= Σ0m; that is, when the model is misspecified, the FIML parameter estimates may havedifferent population limits from the corresponding ML estimates for the complete data.1We note that for FIML, regardless of what the hypothesized model is, the saturated model estimatesstay the same. This explains why for kB, when Wm is evaluated at the saturated model estimates, there is nosubscript B as shown in Tables 5.1.72Table 5.1: Equations for k and kB for FIML-C versionsEquationStructured Model Baseline ModelVersions without the Normality AssumptionFIML-C V1 k = tr(UˆmWˆ−1m WˆOBSc Wˆ−1m UˆmΓ˜) kB = tr(Uˆm,BWˆ−1m,BWˆOBSc,B Wˆ−1m,BUˆm,BΓ˜)FIML-C V2 k = tr(UˆmWˆ−1m WˆEXPc Wˆ−1m UˆmΓ˜) kB = tr(Uˆm,BWˆ−1m,BWˆEXPc,B Wˆ−1m,BUˆm,BΓ˜)FIML-C V3 k = tr(U˜mW˜−1m W˜cW˜−1m U˜mΓ˜) kB = tr(U˜m,BW˜−1m W˜cW˜−1m U˜m,BΓ˜)Versions with the Normality AssumptionFIML-C V4 k = tr(WˆOBSc Wˆ−1m UˆmWˆ−1m ) kB = tr(WˆOBSc,B Wˆ−1m,BUˆm,BWˆ−1m,B)FIML-C V5 k = tr(WˆEXPc Wˆ−1m UˆmWˆ−1m ) kB = tr(WˆEXPc,B Wˆ−1m,BUˆm,BWˆ−1m,B)FIML-C V6 k = tr(W˜cW˜−1m U˜mW˜−1m ) kB = tr(W˜cW˜−1m U˜m,BW˜−1m )Components of the EquationEstimate of the Complete Data Weight MatrixWˆOBSc Observed information matrix, evaluated at hypothesized model estimates.WˆEXPc Expected information matrix, evaluated at hypothesized model estimates.W˜c Observed or expected information matrix, evaluated at saturated model estimates.WˆOBSc,B Observed information matrix, evaluated at baseline model estimates.WˆEXPc,B Expected information matrix, evaluated at baseline model estimates.Estimate of the FIML Weight MatrixWˆm Observed information matrix, evaluated at hypothesized model estimates.W˜m Observed information matrix, evaluated at saturated model estimates.Wˆm,B Observed information matrix, evaluated at baseline model estimates.Estimate of the Residual Weight MatrixUˆm Uˆm = Wˆm−Wˆm∆ˆ(∆ˆ′Wˆm∆ˆ)−1∆ˆ′Wˆm where ∆ˆ is the matrix of hypothesized model deriva-tives, evaluated at the hypothesized model estimates.U˜m U˜m = W˜m−W˜m∆ˆ(∆ˆ′W˜m∆ˆ)−1∆ˆ′W˜m.Uˆm,B Uˆm,B = Wˆm,B−Wˆm,B∆ˆB(∆ˆ′BWˆm,B∆ˆ)−1∆ˆ′BWˆm,B where ∆B is the matrix of baseline modelderivatives, evaluated at the baseline model estimates.U˜m,B U˜m,B = W˜m−W˜m∆ˆB(∆ˆ′BW˜m∆ˆB)−1∆ˆ′BW˜m.Estimate of the Asymptotic Covariance Matrix of Saturated Model EstimatesΓ˜ Without the normality and correct model assumptions, as in V1-V3, Γ˜ = W˜−1m V˜mW˜−1mwhere V˜m is the FIML first order information matrix evaluated at the saturated modelestimates. With these assumptions, Γ˜= W˜−1m , and thus V1-V3 are simplified to V4-V6.73To put it in another way, recall that the original FIML approach has two problems thatresult in AFIs changing with missing data: 1) the population fit function minima betweencomplete and incomplete data are different, 2) the “pseudo-parameters” between completeand incomplete data are different. The FIML-C approach can adjust for the differences inthe fit function equations between incomplete and complete data but it cannot adjust for thedifferences in the “pseudo-parameters” between incomplete and complete data. However,when the model is only slightly misspecified, the “pseudo-parameters” between completeand incomplete data should be similar: µ0 ≈ µ0m and Σ0 ≈ Σ0m . Therefore, the FIML-Capproximation should work well in situations where the degree of model misfit is low.5.1.2 Analytical Example for FIML-C EstimationIn this section, we use the same analytical examples we presented in Chapter 3 Section3.2 to show how RMSEA is computed under FIML-C. Recall in the example in Section3.2.1, the parameter values for incomplete data are the same as those for the completevalue; that is, the fit function mimima and RMSEAs for complete and incomplete dataare different solely due to the differences in their equations. As explained in the previoussection, in this case, using the FIML-C approach, the RMSEAFIML-C value exactly equalsto the RMSEAML for the complete data.Now we explain in more detail the example in Section 3.2.2, where the parameter valuechanges with missing data. Recall in this example, the one model parameter, ψ . Underthe FIML estimation, ψ = 0.7378. In addition, the complete data fit function equation isFc(µ0,Σ(θ0)|µ,Σ) = log(2ψ)+ 1ψ −1 (see Equation 3.9). Therefore, by substituting ψ =0.7378 into the complete data fit function equation, we can obtain the FIML-C RMSEA74as followsRMSEAFIML-C =√Fc(µ0m,Σ0m |µ,Σ)d f=√log(2(0.7378))+ 10.7378 −12= 0.6101.(5.7)Notice this FIML-C value is closer to the complete data RMSEAML = 0.5887 than theFIML RMSEAFIML = 0.5135.5.2 Alternative AFIs following TS EstimationThe TS approach is an alternative estimation method to FIML for incomplete data. The TSapproach obtains parameter estimates by obtaining the saturated model estimates µ˜ , Σ˜ inthe first stage, and then minimizes Fc(µ(θ),Σ(θ)|µ˜, Σ˜), the complete data fit function withthe ”EM” estimates replacing x¯ and S, in the second stage. Under the saturated model, theTS and FIML approaches obtain the same parameter estimates, but under the structuredmodel, the two approaches obtain different parameter estimates.While the TS approach is theoretically less efficient than FIML [43], it has been shownin simulation studies to perform very similarly to FIML [14, 36, 43], and it has someadvantages. In this article, we focus on one such advantage: the TS approach naturallyyields AFIs that approach desirable population values (i.e., those given by Equation 2.14)asymptotically. We will denote the TS estimates by θ˘ , and the model-implied vector ofmeans and covariance matrix obtained under this approach by µ˘ and Σ˘. The TS AFIs75without a small sample correction (denoted as TS V0) are as follows:R̂MSEATS =√max(Fc(Σ˘, µ˘|Σ˜, µ˜)− d fnd f,0)ĈFITS = 1−max(Fc(Σ˘, µ˘|Σ˜, µ˜)− d fn ,0)max(Fc(Σ˘B, µ˘B|Σ˜, µ˜)− d fBn ,Fc(Σ˘, µ˘|Σ˜, µ˜)− d fn ,0) . (5.8)The main difference between the FIML-C AFIs in Equation 5.1 and the TS AFIs in Equa-tion 5.8 is that the TS approach involves minimizing the complete data fit function (whichis minimized by θ˘ ) whereas the FIML-C approach uses an approximate minimum, ob-tained by evaluating the complete data fit function at the model-implied FIML estimatesθˆ . This distinction has important theoretical consequences: as long as the missing datamechanism is ignorable, µ˘ → µ0 and Σ˘→ Σ0, so that the TS AFIs in Equation 5.8 havethe population values given by Equation 2.14. Therefore, the TS AFIs naturally estimatewhat the fit would have been had the data been complete. In other words, at the popula-tion level, the TS approach provides an exact solution that is parallel to the complete datasolution.Although TS approach provides an exact solution at the population level, at the finitesample level, the TS approach still needs small-sample corrections to improve the estimateof AFIs. The corrections below parallel the corrections proposed for nonnormal data [6, 7].We define small sample corrected TS AFIs as follows:R̂MSEATS,s =√max(Fc(Σ˘, µ˘|Σ˜, µ˜)− cnd f,0)ĈFITS,s = 1−max(Fc(Σ˘, µ˘|Σ˜, µ˜)− cn ,0)max(Fc(Σ˘B, µ˘B|Σ˜, µ˜)− cBn ,Fc(Σ˘, µ˘|Σ˜, µ˜)− cn ,0) , (5.9)where c and cB are the TS correction terms for the hypothesized and the baseline models,respectively. Here, c = tr[UcΓ], where Uc is the residual weight matrix obtained in the76second stage and Γ is the asymptotic covariance matrix of the saturated model (see Section5.2.4 for detailed derivation). More specifically, Uc =Wc−Wc∆˘(∆˘′Wc∆˘)−1∆˘′Wc, whereWc and ∆˘ are the complete data weight matrix and model derivatives, respectively; Γ =W−1m VmW−1m , whereWm andVm are the FIML observed and first order information matricesobtained in the first stage of the TS method, respectively.Wm is the weight matrix obtained in Stage 1 of the TS method, and Vm is the first orderinformation matrix obtained in Stage 1; both Wm and Vm are evaluated at the saturatedmodel estimates (denoted as W˜m and V˜m, see Table 5.2 ) .Similar to the FIML-C correction terms, TS correction terms can also be estimated indifferent ways. In our research, we examine two different computational versions of thesecorrections, which are shown in Table 5.2. In both versions, Γ is evaluated at the saturatedmodel estimates. The difference between the two versions is that whether Uc is evaluatedat the hypothesized or the saturated model estimates. In TS V1, Uc is evaluated at thehypothesized model estimates (denoted as Uˆc), whereas in TS V2, Uc is evaluated at thesaturated model estimates (denoted as U˜c; see Table 5.2).Finally, for CFI estimates, we need to compute cB, the correction term for the baselinemodel. The two variations for the computations for cB are the same as the variations for cexcept that for cB, the “hypothesized model” is the baseline model (see Table 5.2).5.2.1 Population Values for TS AFIsBecause the correction terms c and cB stay relatively the same as n increases, the popula-tion limits of these AFIs are the same regardless of whether small sample corrections areincorporated in the computation of sample TS AFIs. As n→ ∞ in Equations 5.8 and 5.9,77the population limits of RMSEA and CFI are the following:RMSEATS =√Fc(µ0TS ,Σ0TS |µ,Σ)d f;CFITS = 1− Fc(µ0TS ,Σ0TS |µ,Σ)Fc(µB,0TS ,ΣB,0TS |µ,Σ),(5.10)where µ0TS , Σ0TS , µB,0TS and ΣB,0TS are the population limits for µ˘ , Σ˘, µ˘B and Σ˘B, respec-tively. Theoretically, µ0TS , Σ0TS , µB,0TS and ΣB,0TS equal to µ0, Σ0, µB,0 and ΣB,0 for thecomplete data; therefore, RMSEATS and CFITS should equal to RMSEAML and CFIMLfor the complete data in Equation 2.14.5.2.2 Analytical Example for TS EstimationIn this section, we use the same analytical examples in Chapter 3 Section 3.2 to demon-strate why the TS parameter value, the fit function minimum and the RMSEA for incom-plete data are the same as those for the complete data. Recall, in the two examples inSection 3.2 , the population covariance matrix and mean vector for the two random vari-ables X and Y are given by (see Equation 3.3):Σ=0.5 00 0.5and µ = (0,0)′.Now suppose there are 50% MCAR missing data on Y , the same as the example inSection 3.2.1. In the first stage of the TS approach, we need to find parameter valuesfor the saturated model under the FIML estimation. It should be obvious that since thesaturated model is always the correct model, the parameter values for the saturated modelequal to the true parameter values (as shown in the above equation) for ignorable missingdata under the FIML estimation. We can verify this by “manually” fitting a saturated78model. One way to specify a saturated model in the covariance structure is the following:Σsat(θ) =α ββ γand µ = (0,0)′,where the parameter vector is θ = (α,β ,γ)′. In this case, the fit function we want tominimize becomes:FMCAR(µ(θ0m),Σ(θ0m)|µ,Σ,φ) =q1(log |Σ1(θ0m)Σ−11 |+ tr(Σ1Σ−11 (θ0m))+(µ1−µ0,1)′Σ−11 (θ0m)(µ1−µ0,1)− p1)+q2(log |Σ2(θ0m)Σ−12 |+ tr(Σ2Σ−12 (θ0m))+(µ2−µ0,2)′Σ−12 (θ0m)(µ2−µ0,2)− p2)=12(log(4αγ−4β 2)+ γ2αγ−2β 2 +α2αγ−2β 2 −2)+12(log(2γ)+0.5γ−1).From here, it is obvious that if θ = (α,β ,γ)′ = (0.5,0,0.5)′ which are the true parametervalues in Equation 3.3, then FMCAR(µ(θ0m),Σ(θ0m)|µ,Σ,φ) = 0. Therefore, in the firststage of the TS approach, we can obtain the true population covariance matrix for thecomplete data through fitting the saturated model. In the second stage, we use the truepopulation covariance matrix to minimize the complete data fit function. It is obvious thatby doing so, we obtain the complete data fit function minimum and thus the complete dataRMSEA: RMSEATS = RMSEAML = 0.5887.The TS RMSEA for the second example in Section 3.2 can be shown to be the sameas the complete data RMSEA using the same logic. In short, for ignorable missing data,the TS approach allows us to first estimate the true population covariance matrix for thecomplete data, which can then be used to estimate the complete data fit function minima79Table 5.2: Equations for c and cB for TS versionsEquationStructured Model Baseline ModelTS V1 c= tr(UˆcΓ˜) cB = tr(Uˆc,BΓ˜)TS V2 c= tr(U˜cΓ˜) cB = tr(U˜c,BΓ˜)Components of the EquationEstimate of the Residual Weight MatrixUˆc Uˆc = Wˆc−Wˆc∆˘(∆˘′Wˆc∆˘)−1∆˘′Wˆc where Wˆc is the complete data observed informationmatrix obtained in Stage 2, evaluated at the hypothesized model estimates fromStage 2, and ∆˘ is the matrix of the hypothesized model derivatives, also evaluatedat the hypothesized model estimates.U˜c U˜c = W˜c−W˜c∆˘(∆˘′W˜c∆˘)−1∆˘′W˜c where W˜c is the complete data observed informationmatrix obtained in Stage 2, evaluated at the saturated model estimates.Uˆc,B Uˆc,B = Wˆc,B−Wˆc,B∆˘B(∆˘′BWˆc,B∆˘B)−1∆˘′BWˆc,B , where Wˆc,B is the complete data ob-served information matrix obtained in Stage 2, evaluated at the baseline model es-timates from Stage 2, and ∆˘B is the matrix of the baseline model derivatives, alsoevaluated at the baseline model estimates.U˜c,B U˜c,B = W˜c−W˜c∆˘B(∆˘′BW˜c∆˘B)−1∆˘′BW˜cEstimate of the Asymptotic Covariance Matrix of Saturated Model EstimatesΓ˜ Γ˜= W˜−1m V˜mW˜−1m , where W˜m and V˜m are the FIML observed and first order informa-tion matrices, respectively, both obtained in Stage 1 and evaluated at the saturatedmodel estimates.and AFIs.805.2.3 Derivation of Small Sample Correction in FIML-CLet β˜ = (vechΣ˜′, µ˜ ′)′ and let βˆ = β (θˆ) = (vechΣˆ′, µˆ ′)′. When we assume that the hy-pothesized model is true in the population (i.e., there exists a θ0 such that β0 = β (θ0) ),the following approximation holds in the population (e.g., Shapiro [38], Yuan and Bentler[43]):√n(β˜ − βˆ ) =√nW−1m,0Um,0(β˜ −β0)+op(1),whereWm,0 is the FIML information matrix andUm,0 =Wm,0−Wm,0∆(∆′Wm,0∆)−1∆′Wm,0is the FIML residual weight matrix, where ∆ is the matrix of model derivatives. We alsohave√n(β˜ − β0) → N(0,Γ0). When we further assume normality, Γ0 = W−1m,0. Eventhough θˆ does not minimize Fc, when the hypothesized model is correct, we can approxi-mateFˆc = (β˜ − βˆ )′Wc,0(β˜ − βˆ )+op(1)≈ (β˜ −β0)′Um,0W−1m,0Wc,0W−1m,0Um,0(β˜ −β0),where Wc,0 is the complete data information matrix, evaluated at the FIML parametervalues. The distribution of nFˆc can then be approximated by a mixture of independent 1degree of freedom chi-square variates with weights given by the eigenvalues ofUm,0W−1m,0Wc,0W−1m,0Um,0Γ0, and the approximate expected value of nFˆc is given by the traceof this matrix product, which is k in Table 5.1.5.2.4 Derivation of Small Sample Correction in TSAssuming null hypothesis model is true, the chi-square test statistic in the second stage ofthe TS method is TTS = (n)Fc ≈ (n)(β˜ −β0)′Uc(β˜ −β0), where Fc is the minimum of thecomplete data normal theory fit function in the second stage, andUc is the residual weightmatrix obtained in the second stage; that is,Uc =Wc−Wc∆˘(∆˘′Wc∆˘)−1∆˘′Wc, whereWc and∆˘ are the complete data normal theory weight matrix and model derivatives, respectively.81Asymptotically,√n(β˜ − β0) ∼ N(0,Γ), where Γ is the asymptotic covariance matrix ofthe estimates of parameters for the saturated model.Asymptotically,√n(β˜−β0)∼N(0,Γ0), where Γ0 is the asymptotic covariance matrixof the estimates of parameters for the saturated model obtained in the first stage. Therefore,the distribution of nFˆc can then be approximated by a mixture of independent one degreeof freedom chi-square variates with weights given by the eigenvalues of Uc,0Γ0, and theapproximate expected value of nFˆc is given by the trace of this matrix product. Therefore,c= tr(Uc,0Γ0).82Chapter 6SEM AFIs under FIML-C and TSEstimations: Simulation StudiesBias and efficiency are the two most important statistical concepts whenconsidering a parameter estimator. These two concepts provide the rationalefor comparing different estimation methods. If method A generates less biasedand more efficient parameter estimates than method B, then A is better than B.Because there is usually a reason for a statistical method to be developed inthe first place, it might be hard to have a best method under all circumstances.But we might prefer method A if it yields better estimators than method B inmost circumstances.Ke-Hai Yuan, Xin Tong, Zhiyong Zhang, 2015In this chapter, we present the results of two simulations studies that investigate theperformance of the newly proposed FIML-C and TS AFIs and to compare them with theFIML AFIs currently in use. These simulation studies are a follow-up for the simulationsconducted in Chapter 3. Through these simulation studies, we aim to show that relativeto the complete data AFIs, FIML-C and TS AFIs are not biased or less biased than FIML83AFIs.6.1 DesignThe design of the two simulation studies are the same as the simulation studies in Chapter3 except more sample sizes are studied (see Table 4.1). There are four levels of sam-ple size: n = 200,500,1000,1000000. The conditions with n = 200,500,1000 are meantto simulate sample data with small, medium or large sample size; for these sample dataconditions, we generated 1000 replications of normally distributed observations using thesimulData() function in the lavaan package [30] in R. The conditions with n = 1000000are meant to mimic the population so that FIML, FIML-C, and TS AFIs can be com-pared without the influence of sampling fluctuations; for these population conditions, wegenerated a single dataset of normally distributed observations.In total, Studies 1 and 2 had 4320 and 1944 conditions, respectively.1 For each con-dition, we computed FIML RMSEA and CFI that are currently in use, the seven versionsof the FIML-C AFIs, and the three versions of the TS AFIs.2 For the population dataconditions, we computed population bias by subtracting the complete data population val-ues (i.e., RMSEAML or CFIML from Equation 2.14) from the corresponding incompletedata population values. For the sample data conditions (n = 200,500,1000), we com-puted empirical bias and empirical standard error of the RMSEA or CFI estimates acrossthe replications. In addition, we calculated the root mean square error (RMSE) using the1Table 4.1 shows that for the simulation studies in Chapter 4, there were 1080 and 486 conditions forStudies 1 and 2, respectively. In the current simulation studies, we added 4 levels for sample size; therefore,there were 1080×4 = 4320 and 486×4 = 1944 conditions for Studies 1 and 2, respectively.2The TS V0 AFIs were computed by setting estimator=”two-stage” inside the cfa() function in lavaanand then inspecting the model fit to get the AFIs. All versions of FIML-C and small sample corrections toTS AFIs were implemented using custom R code that made use of lavaan’s internal functions. Sample codefor implementing different versions of the FIML-C and TS approaches can be found in the SupplementaryMaterials84following equation:RMSE =√∑1000i=1 (AFIi,simu−AFIML)21000,where AFIi,simu is the ith simulated RMSEA or CFI value, and AFIML is either RMSEAMLor CFIML from Equation 2.14. RMSE is a joint measure of bias and efficiency of anestimator and provides information regarding the bias-variance trade-off associated withan estimator.6.2 ResultsWe summarized the results using figures and regression analyses to demonstrate the mainpatterns observed across the two studies. Since the patterns of results for FIML had beendiscussed extensively in Chapter 4, we mainly focused on the patterns of results for FIML-C and TS in this Chapter. We also provided the full simulation results for all conditionsacross the two studies in the table format in the Supplementary Materials.6.2.1 Population BehaviorFigures for Population AFIsFigures 6.1 and 6.2 show the population RMSEA and CFI values (estimated from n =1000000) under the FIML, FIML-C (V0) and TS (V0) approaches in selected conditionsof Studies 1 and 2. Small sample corrections disappear asymptotically, and thus the resultsfor FIML-C V1-V6 and TS V1-V2 are not shown here because they are equal to FIML-C(V0) and TS (V0).As shown in Figures 6.1 and 6.2, at the population level, the FIML AFIs that arecurrently implemented in popular software tended to exhibit a relatively large bias relativeto the corresponding complete data AFIs unless the model had perfect fit. An exception85(a) 20% Missing Data (b) 20% Missing Data(c) 50% Missing Data (d) 50% Missing DataFigure 6.1: Population RMSEA and CFI (estimated from n = 1000000) for selected conditions inStudy 1 comparing FIML, FIML-C and TS approaches. Complete data population RMSEAand CFI are also included for comparison. In these selected conditions, the number of vari-ables with missing data is four, the number of correlated residuals is two, and the populationfactor correlation is zero. The population model is a two-factor model with varying sizes forthe correlated residuals shown on the x-axis. The hypothesized model is a two-factor modelwithout any correlated residuals.86(a) 20% Missing Data (b) 20% Missing Data(c) 50% Missing Data (d) 50% Missing DataFigure 6.2: Population RMSEA and CFI (estimated from n = 1000000) for selected conditions inStudy 2 comparing FIML, FIML-C and TS approaches. Complete data population RMSEAand CFI are also included for comparison. In these selected conditions, there are six variablesthat have missing data. The population model is a two-factor model with varying sizes for thefactor correlation shown on the x-axis. The hypothesized model is a one-factor model.87was the RMSEA values in the conditions where the location of misfit was different fromthat of missing data (i.e., in the DF conditions). In contrast, the TS AFIs were the sameas the complete data AFIs in all conditions; in other words, the TS AFIs asymptoticallyapproached the complete data population AFI values, would be theoretically expected.The FIML-C AFIs had very little bias in conditions with a smaller percentage of miss-ing data (see Figures 6.1ab and 6.2ab). In conditions with a large percentage of missingdata (i.e., 50% missing data), the population FIML-C AFIs were generally still a goodapproximation to the complete data AFIs, except when the mechanism was strong MARand the degree of model misfit was high (see Figures 6.1cd and 6.2cd). There also ap-peared to be some dependence on the number of missing data patterns. For example, asshown in Figure 6.2cd, when the number of missing patterns was small and the missingdata was strong MAR, at a high degree of model misfit (i.e., when the complete data RM-SEA and CFI were 0.192 and 0.538, respectively), the FIML-C RMSEA was 0.057 higherthan the complete data RMSEA and the FIML-C CFI was 0.313 lower than the completedata CFI. In other words, in these conditions, the FIML-C AFIs indicated worse model fitrelative to the corresponding complete data AFIs. The reason for the poor performanceof these FIML-C AFIs is due to the fact that in these conditions, the “pseudo-parameters”(estimated from n = 1000000) under FIML were very different from those for the corre-sponding complete data (see Table 6.1 for an example). Recall that the FIML-C approachuses parameter estimates from FIML, and, therefore, cannot correct for the differences inthe FIML “pseudo-parameters” between complete and incomplete data.Regression Analyses for Population AFIsFor the regression analyses of the population AFIs, we used the features of the missing datato predict the absolute bias of AFIs.3 We coded the features of the missing data according3The reason why we used the absolute bias instead of the raw bias is that with the raw bias, the negativeregression coefficient may mean a decrease in the magnitude of the bias or may mean an increase in the88Table 6.1: “Pseudo-Parameters” for Complete and Incomplete Data under FIMLFactor Loadings for Complete Data Factor Loadings for Incomplete Data0.243 0.6620.287 0.6160.154 0.5900.416 0.5890.200 0.6150.407 0.6420.749 1.3720.634 0.9210.805 1.1420.845 1.2200.693 0.8900.772 1.309Note: The condition presented in the table involves strong MAR data with the minimum number of missingdata patterns. The population model is a two-factor model (6 indicators loading on each factor) with afactor correlation of 0.2. The hypothesized model is a one-factor model with 12 indicators.to Tables 4.3 and 6.2. For Study 1, the predictors included the estimation method, percent-age of missing data, missing data mechanism, factor correlation in the population modeland degree of misfit; for Study 2, the predictors included estimation method, percentageof missing data, missing data pattern, missing data mechanism and degree of misfit.4The results of the regression analyses, presented in Table 6.3, showed that both FIML-C and TS methods had less absolute bias relative to FIML but TS method showed morereduction in bias. Specifically, for RMSEA, holding all other variables constant, the biasof FIML-C, on average, was 0.016 unit less than FIML in both studies; the bias of TS, onaverage, was 0.016 and 0.019 unit less than FIML in Studies 1 and 2, respectively. For CFI,the bias of FIML-C decreased by 0.014 and 0.029 unit in Studies 1 and 2, respectively; themagnitude of the bias but in the negative direction.4To simplify the analyses in Study 1, we held the number of correlated residuals at two, the number ofvariables with missing data at two and the location of missing data at SF. To simplify the analyses Study 2,we held the number of variables at six.89Table 6.2: Additional Variables in the Regression AnalysesPopulation AFIsEst estimation method for AFIs. Est is considered a categorical variable thatequals FIML, FIML-C or TS; FIML is the reference group.Sample AFIsEst estimation method for AFIs. Est is considered a categorical variable thatequals FIML, FIML-C V3 or TS V2; FIML is the reference group.Sample sample size. Sample is considered a categorical variable that equals to 200 or500; 200 is the reference groupTable 6.3: Results of the Regression Analyses for Bias in the PopulationStudy 1BIASRMSEA = 0.007−0.016EstFIML-C−0.016EstTS+0.004Missing+0.001MechanismBIASRMSEA = +0.000FactorCor+0.003MisfitBIASCFI = 0.005−0.014EstFIML-C−0.016EstTS+0.006Missing+0.000MechanismBIASCFI = +0.000FactorCor+0.004MisfitStudy 2BIASRMSEA =−0.004−0.016EstFIML-C−0.019EstTS+0.008Missing+0.001PatternBIASRMSEA = +0.001Mechanism+0.002MisfitBIASCFI =−0.001−0.029EstFIML-C−0.038EstTS+0.017Missing+0.000PatternBIASCFI = +0.003Mechanism+0.005MisfitNote: The coding of the variables is explained in Tables 4.3 and 6.2.bias of TS decreased by 0.016 and 0.038 unit in Studies 1 and 2, respectively. In addition,holding other variables constant, the percentage of missing and the degree of misfit hadeffects on the bias of AFIs, although their effects were relatively small. All of these resultswere consistent with those observed in Figures 6.1 and 6.2.Summary for Population AFIsIn summary, at the population level, the TS approach performed very well in all conditions.The FIML-C approach performed similarly well in most conditions but for conditionswith a large percentage of strong MAR data and a small number of patterns, the FIML-C90approximation began to fail, producing population AFIs with values that departed from thecorresponding complete data AFI values. We note that in the conditions where FIML-Ctended to fail, the complete data RMSEA ranged from the 0.149 to 0.191, and the completedata CFI ranged from 0.730 to 0.538.6.2.2 Finite Sample BehaviorFigures for Sample AFIsFigures 6.3-6.8 show selected results for the sample data conditions (i.e., n = 200,500and 1000). Figures 6.3-6.4 show the results for all computational versions of the proposedAFIs in selected conditions. Figures 5-8 provide additional results for the best performingversions. In order to present results from a variety of conditions, for each of Figures 6.3-8,we select different sets of conditions from either Study 1 or Study 2. The patterns of resultspresented in these figures generalize to those in the other conditions (see SupplementaryMaterials).Figure 6.3 shows the empirical bias in selected sample data conditions for Study 1.Figure 6.4 shows the RMSE in selected sample data conditions in Study 2. Due to thelarge variability in both bias and RMSE values across the studied AFIs, the range of valueson the y-axes in Figures 6.3 and 6.4 is very large. In these figures, we use color to indicatethe versions of FIML-C and TS with large bias or RMSE in order to highlight the poorlyperforming versions; the well-performing versions are shown in grey (differences amongthe well-performing versions will be inspected more closely in Figures 5-8).One noticeable pattern of results is that FIML-C V5 was one of the worst perform-ing methods . The FIML-C V5 CFI estimates tend to be negatively biased (see Figure6.3), and both RMSEA and CFI estimates tend to have large RMSE values, especially atsmall sample sizes (see Figure 6.4). Particularly, the bias of CFI under FIML-C V5 canbe more than three times higher than that of other approaches, and the RMSE of both91Figure 6.3: Bias in the sample RMSEA and CFI estimates for selected conditions in Study 1 com-paring FIML, FIML-C and TS approaches. In these conditions, the number of variables withmissing data is four, the number of two correlated residuals is two, the population factor cor-relation is zero, the percentage of missing is 50%, and the location of misfit is the same as thelocation of missing data. The population model is a two-factor model with varying sizes forthe correlated residuals shown on the x-axis. The hypothesized model is a two-factor modelwithout any correlated residuals.92Figure 6.4: Root mean square error (RMSE) in the sample RMSEA and CFI estimates for selectedconditions in Study 2 comparing FIML, FIML-C and TS approaches. In these conditions,the number of variables with missing data is six, the percentage of missing is 50%, and thenumber of patterns is large. The population model is a two-factor model with varying sizes forthe factor correlation shown on the x-axis. The hypothesized model is a one-factor model.93RMSEA and CFI can be more than four times higher than those under other approaches.The large RMSE values were mainly due to the large standard errors of the RMSEA andCFI estimates under FIML-C V5, which can be three times higher than those under otherapproaches (see Supplementary Materials).In addition to FIML-C V5, there were several other AFIs that did not perform verywell. FIML-C V4 also produced AFIs with large RMSE values in some conditions. Al-though the RMSE values of FIML-C V4 were not as large as those of FIML-C V5, theywere often larger than those of FIML AFIs, especially with small samples (n = 200; seeFigure 4). Further, both FIML-C V0 and TS V0 tended to produce large and similar biasvalues in some conditions. Specifically, the AFIs under both FIML-C V0 and TS V0 hadrelatively large bias in the direction of indicating better fit when the sample size was small(i.e., n= 200) and the degree of model misfit was low (i.e., when complete data RMSEAwas less than 0.08 and CFI greater than 0.95; see Figure 6.3). Both FIML-C V0 and TS V0are estimation methods without small sample corrections, so their poor performances insmall samples were expected. Finally, the AFIs under FIML-C V1 and V2 had noticeablylarge bias values in conditions where the degree of model misfit was relatively high. Forexample, in the strong MAR conditions where the complete data population RMSEA andCFI were around 0.12 and 0.87 respectively, the bias of the CFI and RMSEA estimatesunder FIML-C V1 and V2 was around 0.05, nearly as large as the bias under the origi-nal FIML approach. In conclusion, Figures 3 and 4 reveal that FIML-C V0, V1, V2, V4and V5 as well as TS V0 methods did not perform well, and the top performing methodsacross all conditions were FIML-C V3, FIML-C V6, TS V1 and TS V2. FIML-C V3and V6 were based on saturated model estimates of all matrices involved, either assumingnormality (V6) or not assuming normality (V3; see Table 5.1).The remaining figures compare bias and RMSE values among the top four best-performingAFIs in a variety of selected conditions in Studies 1 and 2. Figures 6.5 and 6.6 show the94bias values for the top four performing AFIs in selected conditions whereas Figures 6.7and 6.8 show the RMSE values in the same conditions. The y-axis range in these figuresis smaller to allow for better discrimination among the AFI estimates.Regression Analyses for Sample AFIsFor the regression analyses, we picked the best performing FIML-C and TS version (i.e.,FIML-C V3 and TS V2) and compared them with FIML in terms of the absolute biasand the RMSE values. The predictors are the same as those for the population AFIs (seeSection 6.2.1) except sample size was added as another predictor. Table 6.4 shows theresults of the regression analyses. The results indicated that relative to FIML, AFIs underFIML-C V3 and TS V2, on average, had lower bias and RMSE; the amount of decrease inbias and RMSE for FIML-C V3 and TS V2 were very similar with TS V2 having slightlymore decrease in bias and RMSE than FIML-C V3. For example, in Study 2, holdingother variables constant, the RMSEA bias (RMSEA RMSE) for FIML-C V3 and TS V2decreased by 0.025 (0.010) and 0.026 (0.011), respectively; the CFI bias (CFI RMSE) forFIML-C V3 and TS V2 decreased by 0.043 (0.012) and 0.050 (0.018), respectively. Inaddition, holding other variables constant, sample size and percentage of missing data hadeffects on the RMSE of AFIs but their effects were in the opposite directions. For example,in Study 2, as sample size increased from n= 200 to n= 500, the RMSE of RMSEA andCFI decreased by 0.007 and 0.016, respectively; on the other hand, as the percentage ofmissing data increased from 20% to 50% , the RMSE of RMSEA and CFI increased by0.008 and 0.018, respectively. Overall, the patterns of results in the regression analyseswere very consistent with those shown in the graph.Summary for Sample AFIsOverall, Figures 6.4-6.8 as well as the regression analyses showed that the top best-performing AFI estimation methods had similar bias and RMSE values across most con-95(a) Different Factors Conditions (b) Different Factors Conditions(c) Same Factor Conditions (d) Same Factor ConditionsFigure 6.5: Bias in the sample RMSEA and CFI estimates for selected conditions in Study 1 com-paring among the best performing FIML-C and TS methods. In these conditions,the number ofvariables with missing data is four, the number of two correlated residuals is two, the popula-tion factor correlation is zero, and the missing data mechanism is weak MAR. The populationmodel is a two-factor model with varying sizes for the correlated residuals shown on the x-axis.The hypothesized model is a two-factor model without any correlated residuals.96(a) Small Number of Patterns (b) Small Number of Patterns(c) Large Number of Patterns (d) Large Number of PatternsFigure 6.6: Bias in the sample RMSEA and CFI estimates for selected conditions in Study 2 com-paring among the best performing FIML-C and TS methods. In these conditions, the numberof variables with missing data is six and the missing mechanism is strong MAR. The popu-lation model is a two-factor model with varying sizes for the factor correlation shown on thex-axis. The hypothesized model is a one-factor model.97(a) Different Factors Conditions (b) Different Factors Conditions(c) Same Factor Conditions (d) Same Factor ConditionsFigure 6.7: Root mean square error (RMSE) in sample RMSEA and CFI for selected conditionsin Study 1 comparing among the best performing FIML-C and TS methods. In these condi-tions,the number of variables with missing data is four, the number of two correlated residualsis two, the population factor correlation is zero, and the missing data mechanism is weakMAR. The population model is a two-factor model with varying sizes for the correlated resid-uals shown on the x-axis. The hypothesized model is a two-factor model without any correlatedresiduals.98(a) Small Number of Patterns (b) Small Number of Patterns(c) Large Number of Patterns (d) Large Number of PatternsFigure 6.8: Root mean square error (RMSE) in sample RMSEA and CFI for selected conditions inStudy 2 comparing among the best performing FIML-C and TS methods. In these conditions,the number of variables with missing data is six and the missing mechanism is strong MAR.The population model is a two-factor model with varying sizes for the factor correlation shownon the x-axis. The hypothesized model is a one-factor model.99Table 6.4: Results of the Regression Analyses for Bias in the Finite SamplesStudy 1BIASRMSEA = 0.017−0.015EstFIML-C-V3−0.015EstTS-V2+0.005Missing+0.000MechanismBIASRMSEA = +0.000FactorCor+0.001Misfit−0.002SampleBIASCFI = 0.009−0.013EstFIML-C-V3−0.013EstTS-V2+0.006Missing+0.000MechanismBIASCFI = +0.000FactorCor+0.004Misfit−0.003SampleRMSERMSEA = 0.021−0.006EstFIML-C-V3−0.007EstTS-V2+0.005MissingBIASRMSEA = +0.000Mechanism+0.000FactorCor−0.001Misfit−0.006SampleRMSECFI = 0.009−0.003EstFIML-C-V3−0.005EstTS-V2+0.005Missing+0.000MechanismBIASCFI = +0.000FactorCor+0.006Misfit−0.009SampleStudy 2BIASRMSEA = 0.020−0.025EstFIML-C-V3−0.026EstTS-V2+0.009Missing+0.001PatternBIASRMSEA = +0.001Mechanism+0.002Misfit−0.002sampBIASCFI = 0.016−0.043EstFIML-C-V3−0.050EstTS-V2+0.021Missing+0.000PatternBIASCFI = +0.004Mechanism+0.007Misfit−0.003SampleRMSERMSEA =−0.019−0.010EstFIML-C-V3−0.011EstTS-V2+0.008Missing+0.001PatternRMSERMSEA = +0.001Mechanism+0.001Misfit−0.007SampleRMSECFI = 0.006−0.012EstFIML-C-V3−0.018EstTS-V2+0.018Missing−0.001PatternRMSECFI = +0.004Mechanism+0.009Misfit−0.016SampleNote: The coding of the variables is explained in Tables 4.3 and 6.2.ditions. The two FIML-C AFIs (V3 and V6) produced almost identical bias and RMSEvalues across all conditions. However, FIML-C V3 and V6 performed considerably worsethan TS V1 and V2 in some conditions. The most noticeable conditions for this pattern ofresults are the Study 2 conditions with a small number of patterns and 50% strong MARmissing data (see Figures 6.6ab and 6.8ab). Recall that in these conditions, at the popu-lation level, the FIML-C AFIs had large bias values, especially when the complete dataAFIs had a high degree of misfit (see the bottom left panels in Figure 6.2cd); therefore,it is not surprising that at the sample level, the FIML-C AFIs also had large bias (see theright panels in Figure 6.6ab). Finally, TS V2, which is based on the saturated model es-timates (see Table 5.2), tended to outperform TS V1 in finite samples. For example, inStudy 2 conditions with a small sample size (n= 200; see Figures 6.6 and 6.8), the TS V1100estimates of AFIs tended to have larger bias and RMSE values than the TS V2 estimatesof AFIs; in particular, the bias differences between TS V1 and V2 can be as large as 0.08and the RMSE differences can be as large as 0.12.6.3 DiscussionIn these two simulation studies, we have compared the FIML, FIML-C and TS approachesfor computing AFIs. Recall that the FIML-C AFIs are an approximation where the com-plete data fit function, with saturated FIML estimates as input ”data”, is evaluated at theFIML parameter estimates. This fit function value is then used in the AFI equations as ifit were the minimum. On the other hand, the TS AFIs are based on the actual completedata fit function minimum, with saturated FIML estimates as input. The advantage of TSAFIs is that they have population values that are the same had there been no missing data,whereas the FIML-C AFIs approach these values only approximately, and can break downwhen misspecifications are large. Thus, the TS AFIs are theoretically superior. However,FIML is by far the most common estimation method for incomplete data, and having AFIsthat are based on FIML estimates and work well is practically important.At the population level, we found that in most conditions, both FIML-C and TS ap-proaches performed very well, producing population AFIs with little or no bias relativeto the complete data population AFIs. However, in conditions with a large percentage ofstrong MAR data and a small number of missing patterns, when the complete data AFIsshowed a high degree of model misfit (i.e., RMSEA was greater than 0.15 and CFI lessthan 0.75), the FIML-C approach produced very biased AFIs whereas the TS approachstill produced AFIs with no bias, as would be theoretically expected.At the sample level, we can estimate FIML-C and TS AFIs with or without small sam-ple corrections. Overall, we found that the FIML-C and TS AFIs without such correctionstended to have large bias in the direction of indicating worse fit when the sample size was101small (i.e., n= 200) and the degree of model misfit was low (i.e., complete data RMSEAlower than 0.08 and CFI higher than 0.95). We evaluated several different computationalversions of the small sample corrections. For the FIML-C approach, we examined sixdifferent corrections; for the TS approach, we examined two different corrections. Fourcorrected versions of the FIML-C AFIs (V1, V2, V4 and V5) and one version of the TSAFIs (V1) evaluated the relevant matrices (such as the normal-theory weight matrix andthe residual weight matrix, see Tables 1 and 2) at structured (model-implied) estimatesof µ and Σ, whereas the remaining versions (FIML-C V3 and V6, and TS V2) evaluatedthese matrices at the saturated model estimates. We found that versions using the saturatedmodel estimates (FIML-C V3 and V6, and TS V2) greatly outperformed the versions us-ing the structured model estimates. A possible explanation for this pattern of results isthat, when the model is misspecified, structured model estimates are quite off and mayresult in negative definitive estimates of the relevant matrices. In some replications thenegative definite matrices can result in negative correction terms, creating bias in the re-sulting corrected AFIs. For example, based on our additional analyses, across replications,the baseline weight matrix Wm,B, when evaluated at the structured model estimates (i.e.,Wˆm,B), was negative definite in most conditions. As a result, the baseline correction termkB under FIML-C V1, V2, V4 and V5 was negative in some replications. Especially forFIML-C V5, approximately 50% of kB values across replications were negative in condi-tions with a small sample size (i.e., n= 200).It is interesting to compare FIML-C (with best performing corrections, V3 or V6) andTS (with best performing corrections, V2). While these methods performed very similarlyin most conditions, there were a small number of conditions where TS AFIs outperformedFIML-C AFIs. The reason why TS sometimes outperforms FIML-C is that FIML-C is anapproximate method, and this approximation fails in a small number of conditions wherethe percentage of missing data is large (i.e., at least 50% missing data) and the degree of102model misfit is high (i.e., when the complete data population RMSEA was greater than0.15 and CFI less than 0.75). However, in such conditions, model fit is already very bad,and arguably it is less important that the FIML-C AFIs are approximately unbiased, aslong as they also reflect very poor model fit, so that researchers’ conclusions about themodel remain unchanged.103Chapter 7Conclusion and Overall DiscussionAll generalizations are false, including this one.Mark TwainOlder missing data techniques such listwise delete and mean substitution mainly aimto get past the missing data so that at least some analyses could be done. This is insharp contrast with modern missing data techniques, which goal is to effectively deal withmissing data so that the results of data analyses are generally not affected by the missingdata. In other words,‘when researchers use modern techniques to handle missing data,they usually expect that their statistical analysis can estimate what the results would hadbeen had there been no missing data. However, the first part of our dissertation workshows that this may not be case under certain circumstances.In the first part of our dissertation (i.e., Chapters 2 and 4), we showed that using one ofthe most popular modern missing data techniques, the FIML estimation (which is usuallythe default method in current software for handling missing data), SEM AFIs such as theRMSEA and the CFI computed from incomplete data are often different from those forcomplete data. This discrepancy is not due to sampling fluctuations; that is, it occurs atthe population level. We have provided a mathematical explanation for this phenomenon.104Specifically, we have shown that, as with complete data, maximizing the log-likelihoodwith incomplete data is equivalent to minimizing a certain “incomplete data ML fit func-tion”, but this function turns out to be different from the complete data ML fit function.Because the RMSEA and CFI rely on the fit function minimum in these equations, theyapproach different population values when data are complete versus incomplete. Further-more, the incomplete data fit function, and hence the RMSEA and CFI values, differ withcharacteristics of missing data, such as missing data percentages, missing data patterns,and the exact missing data mechanism.In addition to deriving the population fit function minima for different types of missingdata, we have provided several small analytical examples and conducted two large samplesimulation studies to illustrate how AFIs change with more missing data. From the analyt-ical examples and simulation studies, we found that in addition to missing data percentageand mechanism, another characteristic of missing data that can largely affect AFIs is thethe location of missing data relative to the location of misfit (i.e., whether the variableswith missing data are those that are associated with model misspecifications). The generalpattern of results was the following: when the location of misfit is relatively separate fromthe location of missing data, the fit function minimum does not change with more missingdata, whereas when the location of misfit largely overlaps with the location of missingdata, the fit function minimum decreases with missing data. Because RMSEA is a directfunction of the fit function minimum for the hypothesized model, it stays the same whenthe location of misfit is separate from the location of missing data, and decreases (indi-cate better fit) when the two locations overlap. For CFI, how the location of missing dataaffects its value depends on the change in the fit function minimum for the hypothesizedmodel versus that for the baseline model. When locations of misfit and missing data areseparate, the fit function minimum for the baseline model decreases while the fit functionminimum for the hypothesized model stays the same, leading to an overall decrease in105CFI (i.e., indicate worse fit). On the other hand, when the two locations overlap, both thefit function minima for the hypothesized and baseline models tend to decrease but the fitfunction minimum for the hypothesized model tends to decrease at a faster rate than thatfor the baseline model, leading to an overall increase in CFI (i.e., indicate better fit).In short, the finding of the first part of the dissertation work is somewhat troublesomebecause it means that researchers using FIML AFIs as currently computed by popularsoftware would tend to find better fit to the extent that there is missing data. However,most researchers expect that the FIML AFIs to estimate the same model fit (or misfit) as ifthere were no missing data, and then continue using the same cut-off guidelines developedfor complete data AFIs.To address the problem of how missing data affect AFIs under FIML, in the secondpart of the dissertation (i.e., Chapters 5 and 6), we have proposed and examined newestimates of AFIs following the FIML and TS estimations with incomplete data. The newestimates following FIML is called the FIML-C estimates. Theoretically, the FIML-C andTS approaches for computing AFIs allow researchers to estimate what the AFIs wouldhave been had there been no missing data, thus placing the incomplete data AFIs’ estimateson the same metric as complete data AFIs’ estimates. For each of these new approaches,we have proposed several versions for calculating sample AFIs, one with no small samplecorrection and others with different ways of computing small sample corrections.For the second part of the dissertation work, we conducted simulation studies to com-paring the original FIML approach with different versions of the FIML-C and TS ap-proaches. We found that at the population level, FIML-C and TS approaches, in mostconditions, produced AFIs that are almost the same as complete data AFIs. However,in a small number of conditions with a high degree of misfit and a large percentage ofmissing data, the FIML-C approach produced AFIs that are very different from those forthe complete data, whereas TS AFIs in the same conditions are the same as the complete106data AFIs. At the sample level, we found that relative to the complete data AFIs, theFIML-C and TS AFIs without small sample corrections tend to produce large bias in thedirection of indicating worse fit when the sample size is small and the degree of misfit islow. For the FIML-C and TS AFIs with small sample corrections, the best performing ver-sions (i.e., FIML-C V3 and V6, and TS V2) are those which relevant weight matrices areevaluated at the saturated model estimates rather than those evaluated at the structure esti-mates. Among these best performing versions, the TS approach outperforms the FIML-Capproach in a small number of cases with a large percentage of missing data and highdegree of misfit. The reason for TS’ better performance over FIML-C is that FIML-C isan approximation method. FIML-C can only adjust for the differences in the fit functionequations between incomplete and complete data; however, in the case where the model-implied means and covariances are very different between incomplete and complete data,FIML-C may fail to correct for the FIML AFIs. In summary, the finding for the secondpart of the dissertation work shows that the FIML-C and TS approaches can accuratelyestimate complete data AFIs. However, exceptions occur when there are a large amountof missing data and the hypothesized model is severely misspecified; in these cases, theFIML-C AFIs are no longer consistent estimates for the complete data population AFIs,and may deviate greatly from the complete data or TS AFIs.Based on our study results, we recommend substantive researchers use FIML-C V3and TS V2 AFIs when estimating model fit for incomplete data. Both FIML-C V3 and TSV2 showed very good performances in our simulation studies, and they theoretically workfor both normal and non-normal data although their performances under non-normal dataneed to be examined in future research (see the following section). Whether researchersshould use FIML-C V3 or TS V2 for estimating AFIs depends on whether they use FIMLor TS to estimate the model parameters and standard errors. We recommend researchersuse FIML-C AFIs if they use FIML to obtain parameter estimates, and use TS AFIs if107they use TS for parameter estimates. In other words, despite the superior performanceof the TS AFIs, we would not recommend that researchers use TS AFIs just to evaluatemodel fit while using FIML to obtain parameter estimates and standard errors. The choiceof estimation method should come first, and all computations, including AFIs, should bebased on the same parameter estimates.The next natural question that substantive researchers may ask is whether they shoulduse FIML or TS for estimating parameters and standard errors. Under the correct model,both FIML and TS produce consistent estimates of the model parameters and standarderror; in terms of efficiency, FIML tends to have higher efficiency than TS but the dif-ferences are small [35]. Therefore, under the correct model, FIML and TS methods willproduce similar results. However, under a misspecified model, FIML and TS will esti-mate different “pseudo-parameters” in the population, with the TS “pseudo-parameters”correspond to the complete data “pseudo-parameters.” In this case, there is no clear an-swer as to which “pseudo-parameters” are better. If the researchers are interested in thecomplete data “pseudo-parameters”, then we would recommend the TS approach. Onthe other hand, if the researchers are interested in “pseudo-parameters” that are closest tothe distribution of the population incomplete data, then we would recommend the FIMLapproach.7.1 Limitations and Future DirectionsAs all research works, the current dissertation work is not without limitations. First, ourstimulation studies only examined CFA models. Future studies should examine the effectof missing data on the AFIs using other SEM models. Second, the implementation of theFIML-C and TS methods require the use of internal functions in the lavaan package in R(see Supplementary Materials for sample code). The use of internal functions may makeit harder for applied researchers to understand and implement the code. Therefore, in the108future, we hope to work with the developer of the lavaan package so that the FIML-Cand TS methods can be added to the functions of the package. Third, in our simulationstudies, we only examined normally distributed data; this may explain why the FIML-Cversions that make the normality assumption performed similarly with the versions with-out the assumption. Future studies should include non-normal data conditions (e.g., datawith large skewness and kurtosis) in order to confirm whether FIML-C version without thenormality assumption can outperform the version with the assumption in the non-normaldata conditions. Fourth, as shown in our simulation studies (see Table 6.1), with misspec-ified models, the “pseudo-parameters” under FIML and TS can be very different; suchpatterns of results were rarely discussed in previous research comparing FIML with TSbecause previous studies [e.g., 35, 47, 48] mainly focused on comparing the parameterestimates under FIML and TS for the correctly specified model. In a future study, we canconduct a simulation study that systematically examines the differences between “pseudo-parameters” under FIML and TS methods.Another future direction for this line of research is to examine the MI approach for es-timating AFIs. As mentioned in Section 1.3, MI is another popular missing data technique.In MI, we first create multiple “complete” datasets by using imputation under either thesaturated or the structured model, fit the structured model to each dataset, and then poolthe AFIs across these “complete” datasets. 1 When imputation is done under the saturatednormal model, this MI method is conceptually equivalent to TS and should also produceAFIs that estimate complete data AFIs. It is interesting to note that FIML and MI are twomost common methods for treating incomplete data in SEM, yet they produce AFIs withdifferent population values, a problem that has not so far been noticed in the literature.We have done some preliminary simulation studies to confirm our expectation. Figure 7.11The pooling stage of MI can be done in different ways. One way is to compute AFIs in each “complete”dataset, and then average across the computed AFIs. Another way is to compute the fit function minimumin each “complete” dataset, average across the fit function minima, and then computed the AFIs based onthe pooled fit function minima. These different pooling methods also need to be studied in future research.109shows the population AFIs under the MI method. Consistent with our expectation, at thepopulation level, the AFIs under MI are exactly the same as those under TS, which are thesame as those under complete data (see Figure 6.1).At the sample level, similar to the FIML-C and TS methods, small sample correctionsmay be needed to improve upon the MI method for estimating AFIs. Without a small sam-ple correction, the MI AFIs should perform similarly as the FIML-C and TS AFIs withoutsmall sample corrections. Our preliminary simulation results show that for RMSEA, theMI without small sample correction indeed performed similarly as the FIML-C and TSversions without small sample corrections but for CFI, the results for MI were quite dif-ferent from FIML-C and TS (see Figure 7.2). In our future work, we will attempt to findan explanation for this phenomenon.In addition, Enders and Mansolf [13] have attempted to develop a small sample correc-tion for computing AFIs under MI. However, the main purpose of their work was to comeup with a correction for the MI chi-square test statistic to account for missing data. Theythen used this corrected chi-square statistic in the equations for sample AFIs, but doingso distorted their population values (see Brosseau-Liard and Savalei [6], Brosseau-Liardet al. [7] for explanations). In our future research, we plan to develop appropriate smallsample corrections for MI AFIs that do not distort their population values.Finally, our finding that small sample corrections computed with weight matrices eval-uated at the saturated model estimates were better than those computed at the structuredmodel estimates has further implications. With the current SEM literature, there are veryfew research studies that investigate how computational options for estimating the weightmatrices affect the estimates of other quantities, such as the estimates of the small sam-ple correction for non-normal data and the estimate of missing information. In fact, manySEM users may not be aware of these computational options. In future studies, researchersshould systematically examine these computational options.110(a) 20% Missing Data (b) 20% Missing Data(c) 50% Missing Data (d) 50% Missing DataFigure 7.1: Population RMSEA and CFI (estimated from n = 1000000) for selected conditions inStudy 1 comparing FIML, FIML-C, TS, MI approaches. In these selected conditions, thenumber of variables with missing data is four, the number of two correlated residuals is two,and the population factor correlation is zero. The population model is a two-factor model withvarying sizes for the correlated residuals shown on the x-axis. The hypothesized model is atwo-factor model without any correlated residuals.111(a) 20% Missing Data (b) 20% Missing Data(c) 50% Missing Data (d) 50% Missing DataFigure 7.2: Bias in the sample RMSEA and CFI estimates for selected conditions in Study 1 com-paring among FIML, FIML-C, TS and MI methods. In these conditions,the number of vari-ables with missing data is four, the number of two correlated residuals is two, the populationfactor correlation is zero, and the missing data mechanism is weak MAR. The populationmodel is a two-factor model with varying sizes for the correlated residuals shown on the x-axis. The hypothesized model is a two-factor model without any correlated residuals.112In conclusion, this dissertation makes a meaningful contribution to our understandingof how different missing data techniques affect the estimation of AFIs in SEM. Based onthe results of this dissertation, we have offered practical advice to applied researchers whouse SEM in their data analyses. Future research should continue investigating propertiesof different missing data techniques and exploring better methods for handling missingdata in a variety of settings.113Bibliography[1] Allison, P. D. (2003). Missing data techniques for structural equation modeling.Journal of Abnormal Psychology, 112:545–557. 8[2] Arbuckle (1999). Full information estimation in the presence of incomplete data.Lawrence Erlbaum, Mahwah, NJ. 8[3] Bentler, P. M. (1990). Comparative fit indexes in structural models. PsychologicalBulletin, 107:238–46. 12[4] Bentler, P. M. and Wu, E. J. (1995). EQS for Windows user’s guide:[version 5].Multivariate Software, Encino, CA. 12[5] Bollen, K. A. (1989). Structural equations with latent variables. John Wiley & Sons,New York. 10, 23[6] Brosseau-Liard, P. E. and Savalei, V. (2014). Adjusting incremental fit indices fornonnormality. Multivariate Behavioral Research, 49(5):460–470. 68, 69, 76, 110[7] Brosseau-Liard, P. E., Savalei, V., and Li, L. (2012). An investigation of the sampleperformance of two nonnormality corrections for RMSEA. Multivariate BehavioralResearch, 47(6):904–930. 68, 69, 76, 110[8] Browne, M. W. and Cudeck, R. (1992). Alternative ways of assessing model fit.Sociological Methods & Research, 21:230–258. 12, 51, 65[9] Claeskens, G. and Hjort, N. L. (2008). Model selection and model averaging.Cambridge University Press, Cambridge. 19[10] Davey, A., Salva, J., and Luo, Z. (2005). Issues in evaluating model fit with missingdata. Structural Equation Modeling, 12:578–597. 13, 17[11] Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihoodestimation from incomplete data via the EM algorithm. Journal of the RoyalStatistical Society Series B-Methodological, 39:1–38. 21[12] Enders, C. K. and Gottschall, A. C. (2011). Multiple imputation strategies formultiple group structural equation models. Structural Equation Modeling,18(1):35–54. 10114[13] Enders, C. K. and Mansolf, M. (2018). Assessing the fit of structural equationmodels with multiply imputed data. Psychological Methods, 23:76–93. 14, 17, 110[14] Enders, C. K. and Peugh, J. L. (2004). Using an EM covariance matrix to estimatestructural equation models with missing data: Choosing an adjusted sample size toimprove the accuracy of inferences. Structural Equation Modeling, 11:1–19. 21, 75[15] Graham, J. W. (2010). Missing data: Analysis and design. Springer, New York. 6, 9[16] Grigsby, T. J. and McLawhorn, J. (2019). Missing data techniques and the statisticalconclusion validity of survey-based alcohol and drug use research studies: A reviewand comment on reproducibility. Journal of Drug Issues, 49(1):44–56. 6[17] Hu, L. T. and Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariancestructure analysis: Conventional criteria versus new alternative. Structural EquationModeling, 6:1–55. 12, 51, 65[18] Jelicic, H., Phelps, E., and Lerner, R. (2009). Use of missing data methods inlongitudinal studies: the persistence of bad practices in developmental psychology.Developmental Psychology, 45:1195–1199. 1[19] Joe, H. (2014). Dependence modeling with copulas. CRC press, Boca Raton. 20[20] Kim, K. H. and Bentler, P. M. (2002). Tests of homogeneity of means andcovariance matrices for multivariate incomplete data. Psychometrika, 67:609–24. 27[21] Lai, K. (2020). Correct estimation methods for RMSEA under missing data.Structural Equation Modeling, Advanced publication. 14[22] Li, J. and Lomax, R. G. (2017). Effects of missing data methods in SEM underconditions of incomplete and nonnormal data. The Journal of ExperimentalEducation, 85:231–58. 14, 17[23] Little, R. J. and Rubin, D. B. (2019). Statistical analysis with missing data, volume793. John Wiley & Sons. 7[24] Little, R. J. A. and Rubin, D. B. (2002). Statistical analysis with missing data. JohnWiley & Sons, New York. 20[25] Maiti, S. S. and Mukherjee, B. N. (1990). A note on distributional properties of thejo¨reskog-so¨rbom fit indices. Psychometrika, 55(4):721–726. 12[26] Muthe´n, B. O., Kaplan, D., and Hollis, M. (1987). On structural equation modelingwith data that are not missing completely at random. Psychometrika, 52(3):431–462.23[27] Muthe´n, B. O. and Muthe´n, L. K. (2010). Technical appendices. Los Angeles, CA:Muthe´n & Muthe´n. 23115[28] Nakagawa, S. (2015). Missing data: Mechanisms, methods, and messages. In Fox,G. A., Negrete-Yankelevich, S., and Sosa, V. J., editors, Ecologicalstatistics:Contemporary theory and application, pages 81–105. Oxford ScholarshipOnline. 2[29] Osborne, J. W. (2013). Is data cleaning and the testing of assumptions relevant inthe 21st century? Frontiers in Psychology, 4:1–3. 1[30] Rosseel, Y. (2012). Lavaan: An r package for structural equation modeling. Journalof Statistical Software, 48:1–36. 46, 84[31] Rubin, D. B. (2004). Multiple imputation for nonresponse in surveys, volume 81.John Wiley & Sons, New York. 2, 4, 8[32] Rubin, J. B. (1976). Inference and missing data. Biometrika, 63:581–592. 2[33] Savalei, V. (2010). Expected versus observed information in SEM with incompletenormal and nonnormal data. Psychological methods, 15(4):352. 70[34] Savalei, V. (2020). Improving fit indices in structural equation modeling withcategorical data. Multivariate Behavioral Research, pages 1–18. 68, 69[35] Savalei, V. and Bentler, P. M. (2005). A statistically justified pairwise ML methodfor incomplete nonnormal data: A comparison with direct ML and pairwise ADF.Structural Equation Modeling, 12:183–214. 7, 10, 108, 109[36] Savalei, V. and Bentler, P. M. (2009). A two-stage approach to missing data: theoryand application to auxiliary variables. Structural Equation Modeling, 16:477–497. 6,7, 9, 10, 75[37] Savalei, V. and Rhemtulla, M. (2017). Normal theory two-stage ML estimator whendata are missing at the item level. Journal of Educational and Behavioral Statistics,42(4):405–431. 6, 10[38] Shapiro, A. (2007). Statistical inference of moment structures. In Lee, S.-Y. L.,editor, Handbook of latent variable and related models, pages 229–260. Elsevier,Amsterdam, Netherlands. 81[39] Steiger, J. H. (1980). Statistically based tests for the number of factors. Paperpresented at the annual meeting of Psychometric Society, Iowa City, IA. 12[40] Vuong, Q. H. (1989). Likelihood ratio tests for model selection and non-nestedhypotheses. Econometrica: Journal of the Econometric Society, pages 307–333. 19[41] White, H. (1982). Maximum likelihood estimation of misspecified models.Econometrica, pages 1–25. 5116[42] Xia, Y., Yung, Y.-F., and Zhang, W. (2016). Evaluating the selection ofnormal-theory weight matrices in the satorra–bentler correction of chi-square andstandard errors. Structural Equation Modeling, 23(4):585–594. 70[43] Yuan, K. H. and Bentler, P. M. (2000). Three likelihood-based methods for meanand covariance structure analysis with nonnormal missing data. SociologicalMethodology, 30:165–200. 8, 9, 68, 75, 81[44] Yuan, K. H., Jamshidian, M., and Kano, Y. (2018). Missing data mechanisms andhomogeneity of means and variances-covariances. Psychometika, 83:425–42. 27[45] Yuan, K. H. and Lu, L. (2008a). SEM with missing data and unknown populationdistributions using two-stage ML: Theory and its application. Multivariate BehavioralResearch, 43:621–52. 6[46] Yuan, K.-H. and Lu, L. (2008b). SEM with missing data and unknown populationdistributions using two-stage ML: Theory and its application. Multivariate BehavioralResearch, 43(4):621–652. 9, 68[47] Yuan, K.-H., Tong, X., and Zhang, Z. (2015). Bias and efficiency for SEM withmissing data and auxiliary variables: Two-stage robust method versus two-stage ML.Structural Equation Modeling, 22(2):178–192. 109[48] Yuan, K.-H. and Zhang, Z. (2012). Robust structural equation modeling withmissing data and auxiliary variables. Psychometrika, 77(4):803–826. 9, 10, 109117

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            data-media="{[{embed.selectedMedia}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0395218/manifest

Comment

Related Items