Missing data are ubiquitous in psychological research. They may come about as an unwanted result of coding or computer error, participants' non-response or absence, or missing values may be intentional, as in planned missing designs. We discuss the effects of missing data on χ²-based goodness-of-fit indices in Structural Equation Modeling (SEM), specifically on the Root Mean Squared Error of Approximation (RMSEA). We use simulations to show that naive implementations of the RMSEA have a downward bias in the presence of missing data and, thus, overestimate model goodness-of-fit. Unfortunately, many state-of-the-art software packages report the biased form of RMSEA. As a consequence, the scientific community may have been accepting a much larger fraction of models with non-acceptable model fit. We propose a bias-correction for the RMSEA based on information-theoretic considerations that take into account the expected misfit of a person with fully observed data. The corrected RMSEA is asymptotically independent of the proportion of missing data for misspecified models. Importantly, results of the corrected RMSEA computation are identical to naive RMSEA if there are no missing data.

Structural Equation Modeling (SEM;

Since its early days as a modeling framework of variances and covariances, the theory surrounding SEM has greatly expanded to include multi-level modeling

While the methods by which we fit SEM have changed, the methods by which we assess the goodness-of-fit of these models largely have not. Most of the common fit indices used to assess model fit make certain assumptions about the models being compared and the data that were generated from them, namely that the models contain full rank covariance matrices with single values for sample size and degrees of freedom. To the extent that a particular model uses raw data under FIML and is affected by missingness, these assumptions are decidedly false and thus bias is introduced into fit indices. For example, the RMSEA quantifies the amount of noncentrality per the product of sample size and degrees of freedom. When the data are complete, the product of adjusted sample size and degrees of freedom contain one measure of the total amount of information in the sample. When some portion of the data are missing, the

Despite the bias in fit statistics, using fit statistics in the presence of missing data remains common practice. Google Scholar search indicates approximately 47% of articles (or 37,400 of 79,931) that cited one of two popular articles with recommended “rules of thumb” for RMSEA ^{1}

As of January 28, 2021.

While this search is imperfect and nowhere near exhaustive, it gives some background toward the widespread use of fit statistics in SEM and, more importantly, the widespread use of fit statistics in SEM with missing data.In this article, we will first discuss the problem that missing data introduces into the realm of model fit indices in SEM, focusing on the RMSEA. We then introduce the mathematical basis for a bias correction of the RMESA. Next, we present data simulations and plots to illustrate the problem and demonstrate to what extent the proposed solutions are better suited to capture model misfit under missing data than the uncorrected RMSEA. We discuss our motivation for a correction, broader implications of the problem of missing data in SEM, limitations of our correction, and future directions in the research of model misfit indices.

The RMSEA is an estimate based on the quantity

For complete data, the RMSEA for a proposed model is defined as

As a data set has an increasing proportion of missing data (e.g., assuming missing completely at random; MCAR), the minus two log likelihoods of both the proposed and saturated models will in expectation decrease proportionally to that amount. In consequence, the

How can we best correct the computation of RMSEA such that it is invariant under the proportion of missing data for misspecified models? We require that the bias-correction must leave RMSEA unchanged if there are no missing data. Second, we require the bias-correction to yield asymptotically identical RMSEA values under misspecification and different levels of missing data under the assumption of MCAR and MAR. In the following, we propose a method to correct the bias incurred in the RMSEA by missing data.

To account for the downward bias in the

Under MCAR or MAR missingness, the model-implied covariance matrix from a model with missing data is an estimate of the population covariance matrix, and by extension, of the model-implied covariance matrix if no missing data were present. Using the Kullback-Leibler divergence

Note that

To demonstrate the effect of the proposed bias correction, we performed two simulations. The first uses a small SEM to demonstrate the effect best without interference, while the second uses a larger Latent Growth Curve Model taken from a real substantive study.

In the first simulation, the generating model was a bivariate normal distribution with zero mean, variance one, and a correlation which we varied in three steps, no correlation (

For a larger substantive example, we simulated misfit in a linear Latent Growth Curve Model (LGCM;

In both simulations, we then created missing data by removing all values except for the first variable in some participants. The choice of participants with missing data was done either by an MCAR or MAR process: For the MCAR process, each participant had the same probability

Overall, we simulated 1000 trials in both simulations and recorded the RMSEA with and without the correction. We used Onyx

Results are presented visually by condition, with one plot per type of missingness (MCAR vs. MAR) and observed covariance between variables. The horizontal (

Simulation results revealed consistent patterns for RMSEA calculations. RMSEA values were most strongly related to model misspecification (i.e., the strength of the correlation between variables that was constrained to zero in the misspecified model), but also showed effects of missingness patterns (MCAR vs. MAR). In the bivariate model, for both covariance values of 0.0 and 0.125, MAR and MCAR data generally yielded similar patterns.

For data MAR with a covariance of 0.125 between variables (

For data MCAR with a covariance value of 0.125 between variables (

As model misspecification increases, the resulting patterns observed become more pronounced. In the condition with a covariance of 0.9, we observe a similar trend, with uncorrected values scaling downward as the percentage of missingness increases, and

For models with no model misspecification (i.e., those with no covariance between variables), we observe no decline in RMSEA values for any of the RMSEA values. Instead, uncorrected RMSEA values remain constant at a level of almost zero, even as missingness increase. Our proposed bias-corrected RMSEA, shows slight increases as missingness increases, that is, it has a slight tendency to overpessism in adjuging model fit. When missing percentage values reach 20%, both MAR and MCAR

Results of the LGCM simulation are shown in

Among models with some misspecification, we see a a downward bias with the uncorrected RMSEA calculation and largely constant values for the

For the models having no misspecification, a pessimistic trend for

The RMSEA, as defined by

RMSEA was created to detect model misspecification. In fact, if we want a test whether a model is perfectly specified (to be precise, testing and possibly rejecting the null hypothesis of perfect fit), we could directly use the likelihood ratio test. The logic of the RMSEA is that models are never perfectly specified, and in fact would be useless if they were, since simplification is an integral part of statistical modeling. The RMSEA is a quantification of the degree of misspecification that allows us to work with “slightly” misspecified models that still are deemed to work as desired in most cases. We showed that with misspecification, the uncorrected RMSEA is no longer guaranteed to be constant over differing levels of missingness. This is evident from our simulations, where the uncorrected RMSEA decreases with higher proportions of missing data. We proposed a bias correction for the RMSEA adjusting the estimate of

We conclude that, under the null hypothesis of no misfit, the uncorrected RMSEA is unbiased. We also find the uncorrected RMSEA underestimates misfit and is thus an overoptimistic measure of goodness-of-fit when missing data are present. The overoptimism may severely bias our conclusions about correct model misspecification. Because we rarely operate with correctly-specified models, we believe that for the advancement of knowledge, it is more beneficial to risk a slight pessimism in the rare case of correctly-specified models than overoptimism with miss-specified models when missing data are present. The mathematical derivation, backed up by the simulations, show that the suggested, show that the

Although the simulations strongly support the notion that

A possible objection to implementing the

The original formulation of RMSEA was derived under the assumption of complete observations. We found that a naive computation of the RMSEA under missing data results in overoptimistic model fit (i.e., RMSEA values that are too low). Hence, we must conclude that most reported model fits using RMSEA are too optimistic. We propose that the RMSEA can be corrected by replacing the

This paradigm can and should be extended to other fit indices as well, and be incorporated into the updating of SEM fit statistics to account for novel estimation approaches. Other fit indices are affected by missing data, and the

Informing researchers about the benefits of an RMSEA correction and its ability to handle missing data may encourage the use of SEM in the realistic scenario of missing data. In order to do this, it is important to provide the scientific community easy access to the correction. A feasible solution is writing a package for R which includes wrapper tools that will extract data from existing functions and calculate RMSEA corrected values. In this line, we attach a small R script to this article that derives

For this article R-code for computing the corrected RMSEA is available via PsychArchives (for access see

The authors have no funding to report.

The authors have declared that no competing interests exist.

The authors have no additional (i.e., non-financial) support to report.

A preprint of this article has been published under