Correcting the Bias of the Root Mean Squared Error of Approximation Under Missing Data

Missing data are ubiquitous in psychological research. They may come about as an unwanted result of coding or computer error, participants' non-response or absence, or missing values may be intentional, as in planned missing designs. We discuss the effects of missing data on χ2-based goodness-of-fit indices in Structural Equation Modeling (SEM), specifically on the Root Mean Squared Error of Approximation (RMSEA). We use simulations to show that naive implementations of the RMSEA have a downward bias in the presence of missing data and, thus, overestimate model goodness-of-fit. Unfortunately, many state-of-the-art software packages report the biased form of RMSEA. As a consequence, the scientific community may have been accepting a much larger fraction of models with non-acceptable model fit. We propose a bias-correction for the RMSEA based on information-theoretic considerations that take into account the expected misfit of a person with fully observed data. The corrected RMSEA is asymptotically independent of the proportion of missing data for misspecified models. Importantly, results of the corrected RMSEA computation are identical to naive RMSEA if there are no missing data.

Structural Equation Modeling (SEM; Bollen, 1989) is a widely used and powerful techni que for the analysis of multivariate data. A host of common models can be specified and estimated within the SEM paradigm, including models based on the general linear model (GLM) like regression and ANOVA (see e.g., Miller, 1997), and more complex techniques ranging from factor analysis (Mulaik, 1972) growth mixture models with non-linear growth components (Grimm et al., 2010) to general multi-level models (Bauer, 2003;Curran, 2003). Before substantial interpretations are warranted, models are usually assessed using fit indices describing how well the model fits the data. Fit indices typically compare the proposed model to either a saturated model or both a saturated and an inde pendence model. Saturated models include all possible means, variances and covariances, and represent the best possible fit of a covariance structure to the data, while independ ence models include only means and variances (i.e., assume all covariances are zero) and represent the worst reasonable fit of a covariance structure. Fit indices like Root Mean Square Error of Approximation (RMSEA; Steiger & Lind, 1980) describe absolute misfit of a model relative to the saturated model, while indices like the Comparative Fit Index and Tucker Lewis Index (Tucker & Lewis, 1973) place the proposed model on a continuum between the saturated and independence models.
Since its early days as a modeling framework of variances and covariances, the theory surrounding SEM has greatly expanded to include multi-level modeling (Heck, 2001), mixture modeling (Muthén & Shedden, 1999), non-linear growth modeling (Grimm & Ram, 2009), and even extensions to statistical learning approaches (Brandmaier et al., 2013). One important extension is the implementation of Full-Information Maximum Likelihood (FIML; cf. Finkbeiner, 1979) estimation based on raw data instead of cova riance matrices. Early SEM packages relied solely on observed moment matrices (e.g., covariance matrices, mean vectors, SSCP matrices) as input, requiring researchers to convert their data into the appropriate matrices and handle missing data, ordinal data, and other complexities prior to model fitting. Modern SEM packages can use FIML to handle raw data, eschewing user-generated data reduction in favor of fitting the model to each row of the data individually. This allows SEM to handle missing values by applying the model to whatever values are observed for a particular row of data. FIML is robust to missingness under both the missing completely at random (MCAR) and missing at random (MAR) mechanisms (Little & Rubin, 1987;Rubin, 2002), yielding unbiased parameter estimates under ignorable missingness conditions and performs better than other missing data methods, such as similar response pattern imputation or listwise deletion (Enders & Bandalos, 2001).
While the methods by which we fit SEM have changed, the methods by which we assess the goodness-of-fit of these models largely have not. Most of the common fit indi ces used to assess model fit make certain assumptions about the models being compared and the data that were generated from them, namely that the models contain full rank covariance matrices with single values for sample size and degrees of freedom. To the extent that a particular model uses raw data under FIML and is affected by missingness, these assumptions are decidedly false and thus bias is introduced into fit indices. For example, the RMSEA quantifies the amount of noncentrality per the product of sample size and degrees of freedom. When the data are complete, the product of adjusted sample size and degrees of freedom contain one measure of the total amount of information in the sample. When some portion of the data are missing, the χ 2 will decrease while the sample size and degrees of freedom values stay constant, which reduces the absolute value of the RMSEA statistic and make fit appear to be better than it would be with complete data. This artificial improvement in model fit has been noted in previous research (i.e., Davey, 2005;Hoyle, 2012) with no known solution to the problem. Zhang and Savalei (2020) have recently investigated this issue more closely and demonstrated that factors such as the type and degree of misfit in the hypothesized model, the type of missing data mechanism, and the number of missing data variables and patterns play an important role in this distortion.
Despite the bias in fit statistics, using fit statistics in the presence of missing data remains common practice. Google Scholar search indicates approximately 47% of articles (or 37,400 of 79,931) that cited one of two popular articles with recommended "rules of thumb" for RMSEA (Browne & Cudeck, 1993;Hu & Bentler, 1999) also include either the terms "missing" or "full information". 1 While this search is imperfect and nowhere near exhaustive, it gives some background toward the widespread use of fit statistics in SEM and, more importantly, the widespread use of fit statistics in SEM with missing data.

Summary of This Article
In this article, we will first discuss the problem that missing data introduces into the realm of model fit indices in SEM, focusing on the RMSEA. We then introduce the mathe matical basis for a bias correction of the RMESA. Next, we present data simulations and plots to illustrate the problem and demonstrate to what extent the proposed solutions are better suited to capture model misfit under missing data than the uncorrected RMSEA. We discuss our motivation for a correction, broader implications of the problem of missing data in SEM, limitations of our correction, and future directions in the research of model misfit indices. 1) As of January 28, 2021.

Method Mathematical Statement of the Problem
The RMSEA is an estimate based on the quantity F df , where df represents the differ ence of degrees of freedom between a fitted model and a saturated model, and F is a population quantity of the discrepancy between a normal population distribution and the best-fitting normal distribution under a given model. F is thus independent of sample size and, conventionally, χ 2 N is used as estimator of this discrepancy. The resulting population value, F df , is thus a measure of absolute misfit per person and per degree of freedom. The fraction χ 2 N ⋅ df is an estimate of this population quantity for model misfit, which forms the basis of RMSEA.
For complete data, the RMSEA for a proposed model is defined as where df is the difference in degrees of freedom between the proposed model and a saturated normal model, N the number of participants, and χ 2 the difference in the minus two log-likelihood between the proposed and saturated models for the given data set. Without model misspecification, the χ 2 index will be distributed as a central χ 2 distribu tion with df degrees of freedom; therefore, the fraction will have an expected value of zero, with decreasing sampling variance as N increases. If the model is misspecified, χ 2 follows a non-central χ 2 distribution with a non-centrality parameter proportional to N , and thus RMSEA will converge towards a fixed non-zero number in expectation. The degree to which RMSEA is larger than zero is therefore a measure of model misfit. As a data set has an increasing proportion of missing data (e.g., assuming missing completely at random; MCAR), the minus two log likelihoods of both the proposed and saturated models will in expectation decrease proportionally to that amount. In conse quence, the χ 2 -the difference between the two -will also decrease proportionally to the proportion of missing data, while df and N remain constant. Hence, RMSEA values will become smaller in value and, hence, more optimistic for increasing proportions of missing data. Note that when there is no model misspecification, the proportion of missing data does not bias RMSEA.

Defining the Correction
How can we best correct the computation of RMSEA such that it is invariant under the proportion of missing data for misspecified models? We require that the bias-correction must leave RMSEA unchanged if there are no missing data. Second, we require the bias-correction to yield asymptotically identical RMSEA values under misspecification and different levels of missing data under the assumption of MCAR and MAR. In the following, we propose a method to correct the bias incurred in the RMSEA by missing data.
To account for the downward bias in the χ 2 as estimator of model misfit, the χ 2 value can be replaced by an estimate of its expected value if no missing data were present. We obtain this bias correction by computing the expected divergence of a single person with complete observations and the best-fitting model and multiply this by the effective sample size, that is, the sample size multiplied by the proportion of observed values, to obtain the expected discrepancy if there had been no missing data.
Under MCAR or MAR missingness, the model-implied covariance matrix from a model with missing data is an estimate of the population covariance matrix, and by extension, of the model-implied covariance matrix if no missing data were present. Using the Kullback-Leibler divergence (Kullback & Leibler, 1951), also known as relative entropy, between the model-implied distribution and the saturated model's distribution, we compute the expected value of the χ 2 value of a single, fully observed person. Hence, the negative two log-likelihood of the complete data can be estimated as 2N times the Kullback-Leibler (Kullback & Leibler, 1951) divergence between the two distributions, just as 2N is multiplied by the fit function in covariance modeling to generate χ 2 values. 2KL is given by where Σ and μ are the model distribution covariance matrices and mean, S and m the saturated model distribution covariance matrix and mean, and k the number of variables (i.e., the dimensionality of the distributions). Note that 2 ⋅ KL is identical to χ 2 N for complete data sets and, thus, the bias correction will not change RMSEA for complete data sets. To ensure an unbiased estimation of the average discrepancy for missingness with rate 1 − p, we need to discount the sample size, N , by the proportion of observed cases, p. The resulting term 2pN ⋅ KL has the same expectation as χ 2 for all rates of missingness if S = Σ, that is, if the model is correctly specified. So we obtain a bias correction if we replace our estimator of F with 2KL. For S ≠ Σ in the population, 2pN ⋅ KL − df has a non-zero expectation, that is, asymptotically proportional to pN . Adding the maximum operator to avoid negative values under the square root, we obtained the bias-corrected RMSEA: Fitzgerald, Estabrook, Martin et al.
Note that even though the fraction term is guaranteed to have a constant expectation for different missingness rates, the expression under the square root (which includes the max operator) may show an increasing expectation with p if the distribution of the fraction term includes negative results. However, the actual center of the distribution is not changed; the effect is just due to the negative values being moved to zero. The median and mode of the distribution stay constant.

Simulation Design
To demonstrate the effect of the proposed bias correction, we performed two simulations. The first uses a small SEM to demonstrate the effect best without interference, while the second uses a larger Latent Growth Curve Model taken from a real substantive study.
In the first simulation, the generating model was a bivariate normal distribution with zero mean, variance one, and a correlation which we varied in three steps, no correlation (r = 0.0), a typical weak association one may encounter in psychological research (r = 0.125), and a very strong correlation (r = 0.9) to investigate the behavior close to the limit. N = 1000 participants were simulated. We then analyzed the data with a model in which we fixed the covariance to be zero, using a saturated mean structure (this is akin to a covariance-only model). Note that this miss-specified model is identical to the null model if one would test against a zero covariance between the two manifest variables.
For a larger substantive example, we simulated misfit in a linear Latent Growth Curve Model (LGCM; McArdle, 1988). The linear LGCM models change across repeated observations by specifying an intercept and a slope component describing the overall level and change as well as individual differences in each. Our simulation was inspired by estimates that are typical for cognitive ageing research (Ghisletta et al., 2020). The model contained five manifest variables. We set the intercept mean to 20 and the intercept variance to 30, the slope mean to -4 and the slope variance to 5 and the residual error to 14. This corresponds to a moderate growth curve reliability (Figure 1). We used the same model for data generation, but to add some typical misspecification, we added a quadratic growth component with mean = -5 and variance = 5. N = 9000 participants were generated with this model.
In both simulations, we then created missing data by removing all values except for the first variable in some participants. The choice of participants with missing data was done either by an MCAR or MAR process: For the MCAR process, each participant had the same probability p miss to miss all values but the first. For the MAR process, the probability p miss; i of the i th participant to show missingness depended on the first value x i which was never missing and the overall missingness rate p miss by a squashed sigmoid function: Participants were chosen with this probability until exactly p miss of the participant showed missingness. Throughout the simulations, a single data set was created in each condition, and missing data were then simulated in the same data set for each missing ness scenario. This data was then analyzed with the misspecified model. Overall, we simulated 1000 trials in both simulations and recorded the RMSEA with and without the correction. We used Onyx (von Oertzen, Brandmaier, & Tsang, 2015) and lavaan (Rosseel, 2012) for the simulations.

Description of Plots
Results are presented visually by condition, with one plot per type of missingness (MCAR vs. MAR) and observed covariance between variables. The horizontal (x) axes of each plot represent the percent of data missing, while the vertical (y) axes of each plot represent RMSEA. To aid in the interpretation of these simulations, we follow general guidelines in the literature to define RMSEA values less than 0.05 to indicate good fit, greater than 0.1 to indicate poor fit, and values in between 0.05 and 0.1 to indicate mediocre fit (Browne & Cudeck, 1993). A line for RMSEA of 0.05 is included where necessary. Simulation results revealed consistent patterns for RMSEA calculations. RMSEA val ues were most strongly related to model misspecification (i.e., the strength of the corre lation between variables that was constrained to zero in the misspecified model), but also showed effects of missingness patterns (MCAR vs. MAR). In the bivariate model, for both covariance values of 0.0 and 0.125, MAR and MCAR data generally yielded similar patterns.

Bivariate Normal Model Under Miss-Specification
For data MAR with a covariance of 0.125 between variables (Figure 2), the uncorrected RMSEA values scale linearly with missingness. For the data simulation conducted here, a missingness percentage of zero results in an RMSEA value of approximately 0.12, which indicates that the model has unacceptable fit. With a little more than 20% missingness, however, the RMSEA value drops below 0.1, indicating "poor" fit, and with about 70% missingness, it indicates "good" fit. KL corrected RMSEA values remain constant across all levels of missingness, with a slight upward bias starting at around 50% missingness.

Figure 2
Simulation Results for Data Generated With a Covariance of 0.125 Note. RMSEA values are given for data completely missing at random (MCAR; blue) and data missing at random (MAR; orange). While KL-corrected RMSEA values (shown as triangles) remain mostly constant across levels of missingness (except extreme conditions of missingness), uncorrected RMSEA (shown as circles) yield artificially improved model fit.
For data MCAR with a covariance value of 0.125 between variables (Figure 2), we see a similar pattern as was observed with the MAR data condition above. Uncorrected RMSEA values continue to decline linearly as missingness percentage increases. Values for the KL corrected RMSEA stay mostly constant in the simulation, as is expected from the mathematical results above.
As model misspecification increases, the resulting patterns observed become more pronounced. In the condition with a covariance of 0.9, we observe a similar trend, with uncorrected values scaling downward as the percentage of missingness increases, and KL-corrected values being almost completely independent of missingness. Figure 3 display results for data simulation when model covariance cov = 0 . 9 under MAR and MCAR data conditions, respectively. The mean of the KL corrected RMSEA increases slightly for high missingness values. This is due to the floor effect in which KL-corrected values are found using the maximum operator, because the sample error increases as missingness increases creates higher variance in the RMSEA values. This effect causes more RMSEA values to become zero by chance. In addition, high missingness also decreases the asymptotic properties of the parameter estimates, which also explains a part of the increase in the RMSEA values.

Bivariate Normal Model With No Miss-Specification
For models with no model misspecification (i.e., those with no covariance between variables), we observe no decline in RMSEA values for any of the RMSEA values. Instead, uncorrected RMSEA values remain constant at a level of almost zero, even as missingness increase. Our proposed bias-corrected RMSEA, shows slight increases as missingness increases, that is, it has a slight tendency to overpessism in adjuging model fit. When missing percentage values reach 20%, both MAR and MCAR KL corrected values show an increase in terms of RMSEA value. See Figure 4 for these results.

Figure 4
Simulation Results for Data Generated With no Model Misspecification Note. RMSEA values are given for data completely missing at random (MCAR; blue) and data missing at random (MAR; orange). As expected, results stay close to zero for both corrected (shown as triangles) and uncorrected (shown as circles) RMSEA values. Corrected RMSEA values start increasing for high missingness rates. When evaluating RMSEA values in this condition, keep note of the small scale of the y-axis.

Results in the Latent Growth Curve Model
Results of the LGCM simulation are shown in Figure 5. The corrected RMSEA value stays mostly constant up to a rate where 80% of the participants show missingness both under MAR and under MCAR. In contrast, both for MAR and MCAR the uncorrected RMSEA reduces from 0.32 initially to 0.15 for very high missingness rates.

Results Summary
Among models with some misspecification, we see a a downward bias with the uncorrec ted RMSEA calculation and largely constant values for the KL corrected RMSEA. This shows up among both MCAR and MAR data simulations among models with both low (cov = 0.125) and high (cov = 0.9) model misspecification.
For the models having no misspecification, a pessimistic trend for KL-corrected values is observed when the amount of missing data is larger but not when it is small. In this case, the uncorrected RMSEA is unbiased and the KL-correction shows a slight pessimistic bias.

Discussion
The RMSEA, as defined by Steiger and Lind (1980) may be regarded as standardization of the χ 2 index of model misfit. Under the null hypothesis of no misfit, that is, the population distribution can be represented by the hypothesized model, the RMSEA fol lows a χ 2 distribution for any proportion of missing data (MAR and MCAR). Hence, we find in our simulation that the uncorrected RMSEA is constant in expectation for every Fitzgerald, Estabrook, Martin et al. proportion of missing data when there is no model misspecification. We conclude that, under the null hypothesis, the uncorrected RMSEA behaves correct and according to our expectations. We also find that both the uncorrected RMSEA underestimates misfit and is thus an overoptimistic measure of goodness-of-fit when missing data are present. The overoptimism may severly bias our conclusions about correct model misspecification. In our simulations, the suggested bias correction decreases this overoptimism such that in many instances, models would be rejected as not well fitting even though the uncorrec ted RMSEA remains below typically used cut-off criteria for "good" model fit.
RMSEA was created to detect model misspecification. In fact, if we want a test whether a model is perfectly specified (to be precise, testing and possibly rejecting the null hypothesis of perfect fit), we could directly use the likelihood ratio test. The logic of the RMSEA is that models are never perfectly specified, and in fact would be useless if they were, since simplification is an integral part of statistical modeling. The RMSEA is a quantification of the degree of misspecification that allows us to work with "slightly" misspecified models that still are deemed to work as desired in most cases. We showed that with misspecification, the uncorrected RMSEA is no longer guaranteed to be constant over differing levels of missingness. This is evident from our simulations, where the uncorrected RMSEA decreases with higher proportions of missing data. We proposed a bias correction for the RMSEA adjusting the estimate of F via the Kullback-Leibler divergence such that RMSEA remains an appropriate fit index when data is only partially observed.
We conclude that, under the null hypothesis of no misfit, the uncorrected RMSEA is unbiased. We also find the uncorrected RMSEA underestimates misfit and is thus an overoptimistic measure of goodness-of-fit when missing data are present. The overop timism may severely bias our conclusions about correct model misspecification. Because we rarely operate with correctly-specified models, we believe that for the advancement of knowledge, it is more beneficial to risk a slight pessimism in the rare case of correct ly-specified models than overoptimism with miss-specified models when missing data are present. The mathematical derivation, backed up by the simulations, show that the suggested, show that the KL corrected RMSEA index provides a better indicator of model fit in the presence of missingness.

Limitations
Although the simulations strongly support the notion that KL-corrected RMSEA values are unaffected by missing data, further investigation is needed for more general models. The model used in our simulation was minimal and was designed as a proof of concept that demonstrates the behavior of our proposed corrections In most cases, such models are oversimplified and do not adequately represent more complicated datasets. Increas ing the complexity of data simulation models and reassessing RMSEA behavior will allow for broader confirmation of the effectiveness of the KL correction as a modification for the RMSEA fit index.
A possible objection to implementing the KL correction for RMSEA lies in the reali zation that, in presence of missing data, corrected RMSEA values indicate worse model fit than uncorrected values. Though our rationale for an RMSEA correction is rooted in theory and practice, it may be hard to accept that RMSEA values that have been reported thus far have been impling artificially inflated goodness-of-fit for models estimated from missing data. Accepting the KL correction will pressure researchers to reassess model fit, and there may be resistance if KL corrected RMSEA values point to a different conclusion than uncorrected RMSEA values had indicated previously. Considering that the typically used cut-off values are based on rules-of-thumb derived from observed RMSEA values without using the correction, it may be a viable strategy to re-think the typically employed cut-off values. Changing the RMSEA calculation under missing data at large, with higher RMSEA values considered acceptable, may reduce resistance to such a change without necessarily accepting considerably worse models as acceptable but rather restore fairness among models derived from completely and partially observed data. RMSEA values may then become comparable across different levels of missingness, and preference will no longer be given to models where missing data rates are high.

Summary and Future Directions
The original formulation of RMSEA was derived under the assumption of complete observations. We found that a naive computation of the RMSEA under missing data results in overoptimistic model fit (i.e., RMSEA values that are too low). Hence, we must conclude that most reported model fits using RMSEA are too optimistic. We propose that the RMSEA can be corrected by replacing the χ 2 with an estimator based on the KL divergence in the RMSEA equation. This leads to an intuitive extension of the original idea behind the RMSEA to settings of missing data.
This paradigm can and should be extended to other fit indices as well, and be incorporated into the updating of SEM fit statistics to account for novel estimation approaches. Other fit indices are affected by missing data, and the KL correction should be explored and applied to CFI, TLI, and other indices. As the KL correction is an estimate, estimation accuracy needs to be considered when creating confidence intervals around KL-corrected RMSEA. Beyond these corrections, we must accept and further explore ways of evaluating SEM model fit that do not depend on the assumption of a complete saturated model, especially as we extend SEM into planned missingness, design and definition variables, and other important advances that take SEM from a covariance method to flexible GLM-based modeling.
Informing researchers about the benefits of an RMSEA correction and its ability to handle missing data may encourage the use of SEM in the realistic scenario of missing data. In order to do this, it is important to provide the scientific community easy access Fitzgerald, Estabrook, Martin et al. to the correction. A feasible solution is writing a package for R which includes wrapper tools that will extract data from existing functions and calculate RMSEA corrected values. In this line, we attach a small R script to this article that derives KL-corrected RMSEA values for a given estimation result. The script is available in the Supplementary Materials. Ω nyx is currently the only software providing both uncorrected and correc ted RMSEA. We hope that eventually other software developer teams will follow and integrate the KL correction into existing data software tools to encourage and support widespread use of corrected RMSEA estimates.

Funding:
The authors have no funding to report.