Multiple Imputation to Balance Unbalanced Designs for Two-Way Analysis of Variance

A balanced ANOVA design provides an unambiguous interpretation of the F-tests, and has more power than an unbalanced design. In earlier literature, multiple imputation was proposed to create balance in unbalanced designs, as an alternative to Type-III sum of squares. In the current simulation study we studied four pooled statistics for multiple imputation, namely D0, D1, D2, and D3 in unbalanced data, and compared them with Type-III sum of squares. Statistics D1 and D2 generally performed best regarding Type-I error rates, and had power rates closest to that of TypeIII sum of squares. Additionally, for the interaction, D1 produced power rates higher than Type-III sum of squares. For multiply imputed datasets D1 and D2 may be the best methods for pooling the results in multiply imputed datasets, and for unbalanced data, D1 might be a good alternative to Type-III sum of squares regarding the interaction.

other types of sum of squares, such as Type-I and Type-II. For example, in Type-III sum of squares, the F-values are not influenced by the order in which each effect is entered in the ANOVA, as in Type-I sum of squares. For a discussion of the advantages of Type-III sum of squares over the other types, see Maxwell & Delaney (2004). However, despite these advantages, the power of the F-tests in Type-III sum of squares is still lower than in balanced designs.
According to Schafer (1997, p. 21) an unbalanced design can be considered a miss ing-data problem by imagining a number of additional cases with missing data on the outcome variable, which in the appropriate conditions would result in a balanced design (also, see Shaw & Mitchell-Olds, 1993, p. 1640, and Winer, Brown, & Michels, 1991. Considering an unbalanced design a missing-data problem creates new ways to handle the problems that come with unbalanced designs by using methods for deal ing with missing data. Shaw andMitchell-Olds (1993, p. 1641) discuss imputation (i.e., estimation) of the missing data to balance the design. They point out that imputation will give unbiased parameter estimates of the ANOVA model, but biased significance tests. This bias is caused by the fact that the imputed values are treated as observed values, and do not incorporate uncertainty of the missing values in the analysis (e.g., Schafer, 1997, p. 2). Thus, a method for estimating the missing values is needed that does take this uncertainty into account. One such method is multiple imputation (Rubin, 2004;Van Buuren, 2012). In multiple imputation, 1) missing values are estimated multiple times (M), resulting in M different complete versions of the incomplete dataset. Next, 2) these complete datasets are analyzed using the statistical analysis of interest, resulting in M outcomes of the same analysis, and 3) the results are combined into one pooled analysis, such that the uncertainty about the missing values is taken into account. For this pooling, formulas are used which will henceforth be called combination rules.
For balancing unbalanced data, multiple imputation may be used as follows. First, by adding a number of additional cases to specific groups such that all groups have equal size, the dataset is now balanced where in some cells cases have missing data on the outcome variable. These missing data are then multiply imputed using factors A and B as categorical predictors of the missing data on the outcome variable. Procedures for how to generate multiply imputed values for the missing data are described in, for example, Van Buuren (2012, Chapter 3) and Raghunathan, Lepkowski, Van Hoewyk, and Solenberger (2001).
Once multiply imputed datasets have been obtained, the ANOVAs can be applied to these datasets, and the results can be combined using specific combination rules. However, Van Ginkel and Kroonenberg (2014) argue that the M F-tests of an effect in ANOVA in M multiply imputed datasets cannot be combined directly, because for the combination rules that are suited for ANOVA (to be discussed shortly) one needs M sets of regression coefficients and M covariance matrices of these regression coefficients. See, Rubin (2004, pp. 79-81) and Van Ginkel (2019) who show how to calculate a pooled F-sta tistic testing several regression coefficients for significance simultaneously for multiply imputed datasets.
However, one can reformulate the two-way ANOVA model as a regression model, so that the combination rules can be applied to the regression coefficients. Van Ginkel and Kroonenberg (2014) show that this can be done by recoding the categorical variables and their interaction into effect-coded variables, and using these effect-coded variables as predictors in the regression model (also, see Winer et al., 1991, pp. 959-963). Once estimates of the regression coefficients and their covariances have been obtained for each of the M imputed datasets, the results may be combined.
For pooling the results of two-way ANOVA Rubin (2004) discusses four methods that may be good candidates. Schafer (1997) called three of these methods D 1 , D 2 , and D 3 . The fourth method was not given a name, probably because Schafer noted some problems of this method, which made him set it aside altogether. However, for the current application of these combination rules, the statistic that was not given a name may still be useful, for reasons discussed later on. Because the method with no name was chronologically the first method in Schafer's book, we will call this statistic D 0 . Each of these methods are discussed next.
which has an approximate F-distribution with p model degrees of freedom, and 1 + p −1 ν/2 error degrees of freedom, where an approximation for ν by Reiter (2007) may be used. The formula for ν is rather long and complex, so we will not give it here. For computational details, see Reiter (2007).
A pooled F-test for the effect of factor A may be computed by filling in the pooled regression coefficients of the effect coded variables that together form factor A in Q m in Equation 1 and Equation 3, filling in the covariances of these coefficients in Equation 2, and use the resulting Q, U and B in Equation 4 and Equation 5. Similarly, F-tests for factor B and the interaction may be computed. Computational details on how to obtain the covariance matrix U m for the effects of factors A, B and the interaction are given in Van Ginkel and Kroonenberg (2014).

Statistic D₁
Because B is a noisy estimate of the between-imputation variance and does not even have full rank if M < p, Schafer (1997, p. 113) advises against using statistic D 0 . Rather than using T as an estimate for the total covariance matrix, a more stable total covariance matrix is: r 1 = 1 + M −1 tr BU −1 /p where r 1 is the relative increase in variance due to nonresponse. The resulting F-value testing all elements in vector Q for significance simultaneously is: which, under the assumption that r 1 is equal across all elements of Q, has an approximate F-distribution with p numerator degrees of freedom, and ν denominator degrees of freedom.
Despite the assumption of equal r 1 across all elements of Q, Li, Raghunathan and Rubin (1991b) showed that violation of this assumption does not substantially influence the Type-I error rate of D 1 . Because of this finding, and because of the disadvantages of statistic D 0 , D 1 has been widely implemented in different statistical packages, such as SAS (SAS, 2013;Yuan, 2011), Stata (StataCorp, 2017, the miceadds package in R (Robitzsch, Grund, & Henke, 2017), and an SPSS (IBM SPSS, 2017) macro by Van Ginkel (2010). Using the procedure outlined by Van Ginkel and Kroonenberg (2014), Grund et al. (2016) showed that D 1 generally produces Type-I error rates close to the theoretical α in both one-way and two-way ANOVAs.

Statistic D₂
as the Wald statistic of imputed dataset m, as the average Wald statistic across imputed datasets, and as an alternative estimate of the relative increase in variance due to nonresponse. Statis tic D 2 is given by: As a reference distribution for D 2 an F-distribution with p numerator degrees of freedom and ν 2 = p − 3 m m − 1 1 + r 2 −1 2 denominator degrees of freedom is used. Applied to two way ANOVA, significance tests for factors A, B, and the interaction can be obtained by substituting the relevant coefficients for Q m , and by substituting the covariance matrices of the relevant coefficients for U m , and substitute these quantities in Equation 8.
An advantage of D 2 is that it is easier to compute than D 0 and D 1 as it can be calculated from M separate Wald statistics alone and no separate combining of the covariance matrices is needed (Schafer, 1997, p. 115). Li, Meng, Raghunathan, and Rubin (1991a) however, show that D 2 occasionally produces Type-I error rates as high as 8% when using α = .05. Li et al. (1991a) therefore argue that D 2 only be used as a rough guide. However, Grund et al. (2016) found Type-I error rates for D 2 that were closer to the theoretical 5% than found by Li et al. (1991a). Thus, how well D 2 performs regarding Type-I error rates in which situations, remains unclear.

van Ginkel & Kroonenberg
Statistic D 3 (Meng & Rubin, 1992) is a combined likelihood-ratio test of M likelihood-ratio statistics from M imputed datasets. Define d L, m as a likelihood-ratio test of imputed dataset m, and as the average d L, m across imputed datasets. Next, define d L, m * as a likelihood-ratio test of imputed dataset m but now evaluated at the average parameter estimates of the model across M imputed datasets. The average d L, m * is computed as Statistic D 3 is computed as The reference distribution that is used for D 3 is an F-distribution with p numerator degrees of freedom and a denominator degrees of freedom given by: For a specific effect in the model (factor A, factor B, interaction) D 3 is computed by letting d L, m and d L, m * be the likelihood-ratio tests that test the full model against the full model, minus the terms that together form the specific effect.

Goal of the Current Study
Several authors (Schafer, 1997;Shaw & Mitchell-Olds, 1993;Winer et al., 1991) have pointed out that unbalanced designs can be considered a missing-data problem. Shaw and Mitchell-Olds (1993) discussed imputation as a possible solution for imbalance and at the same time noted some problems of the imputation approach regarding p-values. However, they did not mention the use of multiple imputation to overcome the problems of single imputation in unbalanced designs. Although Schafer (1997) never explicitly suggested to balance an unbalanced design using multiple imputation, the suggestion implicitly followed from the fact that Schafer's book is about multiple imputation. Van Ginkel and Kroonenberg (2014, p. 79) made the suggestion explicit, and used it in an em pirical data example (pp. 87-88). The reasoning behind their suggestion was that balanced designs generally have more power than unbalanced designs, and that consequently multiple imputation in unbalanced designs could increase power.
However, Van Ginkel and Kroonenberg (2014) did not carry out any simulations to back up this claim. Additionally, one aspect that Van Ginkel and Kroonenberg did not include in their reasoning is that multiple imputation does not actually add information that could increase power. Indeed multiple imputation simulates additional cases, but imputing M values of the outcome variable for each additional case creates additional variation in the parameter estimates of the ANOVA model as well, represented by covariance matrix B or the relative increase in variance due to nonresponse (r 1 , r 2 , and r 3 ). The statistics D 0 to D 3 include these quantities in their calculation such it lowers their power, which may consequently undo the increased power as a result of creating additional cases. Also, Van Buuren (2012, p. 48) and Von Hippel (2007) argue that when missing data only occur on the outcome variable, analyzing only the complete cases may be preferred over multiple imputation. Since unbalanced data can be considered a situation with missings on only the outcome variable, and that Type-III sum of squares in this context is equivalent to analyzing only complete cases, it would follow that in unbalanced data Type-III sum of squares is preferred over multiple imputation.
In short, both valid arguments for balancing unbalanced data using multiple imputa tion prior to two-way ANOVA, and simulation studies that confirm its usefulness seem to be lacking. However, the fact that this suggestion has been made in the literature or even just the fact that unbalanced data are often described as a missing-data problem and that multiple imputation is a highly recommended procedures for dealing with missing data, calls for a simulation study to investigate the usefulness of this suggestion. In the current paper we will carry out such a simulation study. Consequently, the first research question is whether there is some benefit in using multiple imputation for balancing an unbalanced design prior to a two-way ANOVA after all.
Furthermore, Grund et al. (2016) already investigated statistics D 1 , D 2 , and D 3 in a wide variety of situations. In their study they found that in most situations all statistics produced Type-I error rates close to the theoretical 5%. However, in more extreme situations (e.g., small sample size, high percentages of missing data), D 2 , and D 3 were somewhat conservative and had lower power than D 1 . Based on this study, D 1 is the preferred statistic for combining the results of multiply imputed data in ANOVA.
However, the question is to what extent the results by Grund et al. (2016) generalize to a situation where unbalanced data are balanced using multiple imputation. D 1 showed the most promising results but has one potential weak point, and which may especially be problematic in unbalanced data: it assumes that the fraction of missing information van Ginkel & Kroonenberg is equal across the p parameter estimates in Q. Unequal fractions of missing information across parameter estimates in Q are inherently related to unbalanced designs as each regression coefficient of an effect-coded variable of a specific factor represents a specific group of this factor. Unequal group sizes imply that for smaller groups more additional values of the dependent variable have to be multiply imputed than for larger groups, resulting in unequal fractions of missing information across parameter estimates.
In a situation which inherently has unequal fractions of missing information across parameter estimates, a statistic assuming equal fractions of missing information across parameter estimates (D 1 ) is not the most convenient choice, especially when an alterna tive version of this statistic exists (D 0 ) that does not make this assumption (when p = 1, or when the fractions of missing information across parameter estimates are equal, both statistics are equivalent). Given that D 1 generally gives the best results regarding power and Type-I error rates, and that D 0 is equivalent to D 1 but without the assumption of equal fractions of missingness across parameters, D 0 might be the ideal candidate for calculating pooled F-tests in unbalanced data when imbalance is handled using multiple imputation. Although Schafer (1997) and Li et al. (1991b) advised against the use of D 0 , the argument against its use was mainly that for small M it may not perform well. However, this problem may easily be resolved by increasing M.
Furthermore, although Li et al. (1991b) showed that violation of equal fractions of missing information is not necessarily problematic for D 1 , in their study the unequal fractions of missing information randomly varied across parameters of the simulation model and across replicated datasets. However, in unbalanced data unequal fractions of missing information may not always randomly vary across parameters. For example, in a field experiment it may be more difficult to collect data for one experimental condition than for the other, or in a clinical trial more people may drop out in one condition than in the other. Also, in non-experimental studies some categorical variables may have unequal groups in the population, for example, ethnicity. When drawing a random sample from this population, the different ethnic groups will have unequal sizes in the sample as well.
When fractions of missing information randomly vary across parameters, the frac tions of missing information may not be equal within one replication, but the average fractions of missing information across replicated datasets are. Consequently, the nega tive effect of unequal fractions of missing information may cancel itself out across repli cations. However, in situations where the differences in fractions of missing information across parameters do not vary across replicated datasets, a statistic might be needed that allows for different fractions of missing data across parameter estimates.
Thus, a second research question is how the different pooling statistics from Grund et al. (2016) will behave in unbalanced data where the unequal fractions of missing information do not vary across replications, now also including statistic D 0 . To this end, rejection rates of statistics D 0 to D 3 were studied along with the rejection rates for Type-III sum of squares under the null-and alternative hypothesis. For comparison, rejection rates were also studied for balanced data with the same total sample size.
In the next section, the setup of the simulation study is described. In the section that follows, results of the simulation study are shown. Finally, in the discussion section conclusions will be drawn about the usefulness of multiple imputation for balancing unbalanced designs, and implications for which statistic to use will be discussed.

Method
Data were simulated according to a two-way ANOVA model in the form of a regression model with effect coded predictors. Some of the properties of the data were held constant while some were varied (discussed next). The properties that were varied resulted in several design cells. Within each design cell 2500 replications were drawn (based on studies by Harel, 2009, andVan Ginkel, 2019).
The simulations were programmed in R (R Core Team, 2018). Multiple imputation was performed using Fully Conditional Specification (FCS; e.g., Van Buuren, 2012) in the mice package (Van Buuren & Groothuis-Oudshoorn, 2011). For continuous variables there are two different versions of FCS, namely regression imputation (FCS-R; e.g., Little & Schenker, 1995, p. 60;Van Buuren, 2012, p. 13) and Predictive Mean Matching (FCS-PMM; Rubin, 1986;Van Buuren, 2012, pp. 68-74). The default method in mice is FCS-PMM because it generally preserves distributional shapes better when data are not normally distributed than FCS-R Marshall, Altman, Royston, & Holder, 2010). However, for imputing an outcome variable with a normally distributed error (as in our simulation model) FCS-R may be a better alternative as it explicitly imputes values according to a normal linear model. Thus, the method for multiple imputation was FCS-R. For the imputation of outcome variable Y, factors A and B plus their interaction were used as predictors. Type-III sum of squares were computed using the car package (Fox & Weisberg, 2019). Statistic D 3 was calculated using the mitml package (Grund et al., 2016). The other statistics D 0 , D 1 , and D 2 were programmed by the first author.
Like in many other simulation studies, decisions regarding properties of the simula tion design were to some extent arbitrary. However, prior to running the simulations, some test runs were done to see what properties would make the effects of imbalance and the differences between the different statistics most clearly visible, and which were also likely to occur in practice. The properties of the simulation design that are going to be discussed next, are mostly the result of these test runs.

Fixed Design Characteristics
The number of levels of factor A was I = 2. The error term of the ANOVA model was normally distributed with μ ε = 0 and σ ε = 18.37.

Number of Levels of Factor B
The number of levels of factor B was J = 3, 4, 5.

Sample Size
Small, medium, and large sample sizes were studied. Because J also affects the number of design cells of the two-way ANOVA, sample size also partly depended on the size of J. For small N, the average cell size was 10, for medium N it was 20, and for large samples it was 30. Given these average cell sizes, sample sizes were N = 60, 120, 180 for J = 3, N = 80, 160, 240 for J = 4, and N = 100, 200, 300 for J = 5.

Degree and Structure of Imbalance
Four different degrees of imbalance were simulated, along with balanced data, for com parison. The degree of imbalance was varied as follows: for a specific design cell the cell size was either increased or decreased by each time adding the same number to, or subtracting the same number from the original cell size in the balanced case. The increasing and decreasing of cell sizes was done such that the total sample size remained the same.
Additionally, to study whether it mattered which cells increased or decreased in size, an additional situation of imbalance was created where the cell sizes of the most severe case of imbalance were randomly redistributed across design cells. The cell sizes for each degree of imbalance are displayed for small N in Table 1. For medium and large N the entries must be multiplied with 2 and 3 respectively.

Dependent Variables
For each of the F-tests in the two-way ANOVA, the number of times the null hypothesis was rejected was studied, denoted the rejection rate. As already noted, when p = 1, D 0 and D 1 are equivalent. Thus, for factor A, methods D 0,M = 5 , and D 0,M = 100 will not be displayed.

Results
To get a rough impression of how close the Type-I error rates were to α = .05 under the null hypothesis, it was tested whether the empirical rejection rates differed significantly from 0.05, using an Agresti and Coull (1998) confidence interval. Under the alternative hypothesis, it was tested whether the empirical rejection rates of the multiple-imputation methods differed significantly from the rejection rates of Type-III, assuming that the latter represent the "real" power. A method based on multiple imputation may be consid ered successful when under the null hypothesis it gives rejection rates not significantly different from 0.05, and when under the alternative hypothesis it gives rejection rates significantly higher than the rejection rates of Type-III. To avoid redundancy in discus sing the results, we will mainly focus on the results of factor B and less on the results of factor A and the interaction. Eighteen tables were needed to report all the results. Because results showed similar patterns across different J and different N, results of the remaining independent variables are only reported for J = 3 and a small N. Results for J > 3 and larger N are provided in Supplementary Materials. Table 2 shows the results for all levels of imbalance, all methods for handling imbalance, and all effects in the ANOVA, under the null model. Table 3 shows the results for all levels of imbalance, all methods for handling imbalance, and all effects in the ANOVA, under the alternative model.  Under the null hypothesis (Table 2), methods D 3,M = 5 , and D 3,M = 100 tend to underestimate the Type-I error rate for factor B somewhat, which becomes worse as the degree of imbalance increases. This undercoverage is worst for D 3,M = 100 . Method D 0,M = 5 tends to overestimate the Type-I error rate, which becomes worse as the degree of imbalance increases. Type-III, D 1,M = 5 , D 1,M = 100 , D 0,M = 100 , D 2,M = 5 , and D 2,M = 100 produce Type-I error rates that do not differ significantly from 5% in most cases. All methods based on M = 5 and method D 3,M = 100 have lower power than Type-III sum of squares (Table 3). The difference in power between these methods and Type-III sum of squares becomes larger as the degree of imbalance increases. Methods D 0,M = 100 and D 2,M = 100 generally have power values close to that of Type-III sum of squares. Like methods D 0,M = 100 and D 2,M = 100 , method D 1,M = 100 has power values similar to that of Type-III sum for the main effects. However, for the interaction effect, D 1,M = 100 has significantly higher power than Type-III sum of squares.
Finally, when the order of the cell sizes is shuffled, we only see changes in results for the interaction in the alternative model (Table 2): For methods D 1,M = 5 and D 1,M = 100 the power drops below the power level of Type-III sum of squares; for D 2,M = 100 it raises up to a higher level than that of Type-III sum of squares. Van Ginkel and Kroonenberg (2014) suggested multiple imputation to increase the power of a two-way ANOVA in unbalanced designs compared to Type-III sum of squares. In the current study it was studied whether multiple imputation would indeed do this. For the main effects, methods D 0,M = 100 , D 1,M = 100 and D 2,M = 100 had power similar to those of Type-III sum of squares, but not higher power than Type-III sum of squares. However, for the interaction effect, method D 1,M = 100 had higher power than Type-III sum of squares for all degrees of imbalance, except when the order of cell sizes was shuffled. In the latter case, only method D 2,M = 100 had higher power than Type-III sum of squares. The other methods based on multiple imputation had lower power than Type-III sum of squares in all situations. Additionally, D 0,M = 5 overestimated the Type-I error rate.

Discussion
The main conclusion of this study is that there may be some benefit in doing multiple imputation for handling unbalanced data in two-way ANOVA after all. When using the appropriate statistics (D 1 and D 2 ) with a large number of imputations (M = 100), the test for the interaction may have higher power than Type-III sum of squares. However, the specific structure of imbalance (i.e. which cells have a high number of observations and which ones have a small number of observations) also seems to matter. Under all four degrees of imbalance D 1,M = 100 was the methods that had highest power for the interaction. However, when the sizes of the cells were shuffled, D 2,M = 100 had the highest power. Thus, it remains unclear how and when multiple imputation leads to higher power rates for the interaction in unbalanced two-way ANOVA.

van Ginkel & Kroonenberg
Although there seems to be some benefit in multiple imputation over Type-III sum of squares, it may be wondered whether this benefit outweighs the costs. Multiple imputation is more work, the benefit seems to only concern the interaction, and it is not even entirely clear when it has higher power rates than Type-III sum of squares.
However, although the benefit of multiple imputation over Type-III sum of squares is relatively small, the results of this study are still important for other reasons. Previously it was only assumed that it was better to use D 1 than D 0 because D 0 used a noisy estimate of the total covariance matrix in its computation, which would especially be problematic for small M, but never demonstrated. The current study showed that for M = 5, D 0 indeed performed poorly regarding Type-I error rates (overestimation of the Type-I error rate).
Furthermore, Li et al. (1991b) showed that D 1 was robust to violation of the assump tion of equal fractions of missing information across parameter estimates when the unequal fractions randomly varied across replications. The current study showed that this result also seems to hold when the unequal fractions were fixed across replications. Thus, based on the current results there does not seem to be a need for a version of D 1 that allows for unequal fractions of missing information, but which performs poorly for small M (D 0 ).
As for statistics D 2 and D 3 , although Li et al. (1991a) showed that D 2 could sometimes be too liberal, both Grund et al. (2016) and the current study show satisfactory Type-I error rates of this method. Additionally, the current study also shows that the power of D 2 is comparable to that of Type-III sum of squares when M = 100. Considering these results and the fact that D 2 is relatively easy to compute, it may be a good candidate for pooling F-tests from ANOVA in multiply imputed data. However, as long as it is not clear when D 2 may overestimate the Type-I error rate, we must be cautious concluding that D 2 is generally the best alternative for pooling the results of ANOVA in multiply imputed data.
The results of D 3 were disappointing: in unbalanced data this method (severely) underestimated the Type-I error rates. Additionally, of all statistics, D 3 generally had the lowest power. An additional disadvantage of D 3 is that is not easily computed as it requires the likelihood of imputed dataset m evaluated at the average set of model parameters. The latter makes the implementation in software packages complex. Consid ering the disappointing results for D 3 and its complex implementation, this statistic is not the most convenient method for pooling results of ANOVA.
The results of the current study imply that software packages do not need to replace D 1 with D 0 , not even in case of unequal fractions of missing information that are inherently related to the parameters in question. While it was previously shown that D 1 produces accurate Type-I error rates when fractions of missing information are on average equal across replications (Li et al, 1991b), the current study shows that D 1 also produces accurate Type-I error rates for unequal average fractions of missing informa tion across replications.
In conclusion, it may be a bit premature to conclude that multiple imputation is a good alternative to Type-III sum of squares in unbalanced data, given the extra amount of work and the fact that its benefits only seem to show in the interaction. Finally, as most other studies have already indicated, we recommend using either D 1 or D 2 in multiply imputed data, with a slight preference of D 1 .

Funding:
The authors have no funding to report.