In an experiment where twoway analysis of variance is the intended analysis, unforeseen circumstances may occur which may cause the design to be unbalanced. Unbalanced data may also occur in nonexperimental research when group sizes are unequal by themselves. One important consequence of imbalance is that due to the resulting multicollinearity Ftests in ANOVA have less power than in balanced designs (e.g., Shaw & MitchellOlds, 1993, p. 1643). One of the most widely used methods for calculating Ftests in unbalanced data is TypeIII sum of squares. This method has several advantages over other types of sum of squares, such as Type I and Type II. For example, in Type III sum of squares, the Fvalues are not influenced by the order in which each effect is entered in the ANOVA, as in TypeI sum of squares. For a discussion of the advantages of Type III sum of squares over the other types, see Maxwell & Delaney (2004). However, despite these advantages, the power of the Ftests in Type III sum of squares is still lower than in balanced designs.
According to Schafer (1997, p. 21) an unbalanced design can be considered a missingdata problem by imagining a number of additional cases with missing data on the outcome variable, which in the appropriate conditions would result in a balanced design (also, see Shaw & MitchellOlds, 1993, p. 1640, and Winer, Brown, & Michels, 1991, pp. 479481). Considering an unbalanced design a missingdata problem creates new ways to handle the problems that come with unbalanced designs by using methods for dealing with missing data. Shaw and MitchellOlds (1993, p. 1641) discuss imputation (i.e., estimation) of the missing data to balance the design. They point out that imputation will give unbiased parameter estimates of the ANOVA model, but biased significance tests. This bias is caused by the fact that the imputed values are treated as observed values, and do not incorporate uncertainty of the missing values in the analysis (e.g., Schafer, 1997, p. 2). Thus, a method for estimating the missing values is needed that does take this uncertainty into account. One such method is multiple imputation (Rubin, 2004; Van Buuren, 2012). In multiple imputation, 1) missing values are estimated multiple times (M), resulting in M different complete versions of the incomplete dataset. Next, 2) these complete datasets are analyzed using the statistical analysis of interest, resulting in M outcomes of the same analysis, and 3) the results are combined into one pooled analysis, such that the uncertainty about the missing values is taken into account. For this pooling, formulas are used which will henceforth be called combination rules.
For balancing unbalanced data, multiple imputation may be used as follows. First, by adding a number of additional cases to specific groups such that all groups have equal size, the dataset is now balanced where in some cells cases have missing data on the outcome variable. These missing data are then multiply imputed using factors A and B as categorical predictors of the missing data on the outcome variable. Procedures for how to generate multiply imputed values for the missing data are described in, for example, Van Buuren (2012, Chapter 3) and Raghunathan, Lepkowski, Van Hoewyk, and Solenberger (2001).
Once multiply imputed datasets have been obtained, the ANOVAs can be applied to these datasets, and the results can be combined using specific combination rules. However, Van Ginkel and Kroonenberg (2014) argue that the M Ftests of an effect in ANOVA in M multiply imputed datasets cannot be combined directly, because for the combination rules that are suited for ANOVA (to be discussed shortly) one needs M sets of regression coefficients and M covariance matrices of these regression coefficients. See, Rubin (2004, pp. 7981) and Van Ginkel (2019) who show how to calculate a pooled Fstatistic testing several regression coefficients for significance simultaneously for multiply imputed datasets.
However, one can reformulate the twoway ANOVA model as a regression model, so that the combination rules can be applied to the regression coefficients. Van Ginkel and Kroonenberg (2014) show that this can be done by recoding the categorical variables and their interaction into effectcoded variables, and using these effectcoded variables as predictors in the regression model (also, see Winer et al., 1991, pp. 959963). Once estimates of the regression coefficients and their covariances have been obtained for each of the M imputed datasets, the results may be combined.
For pooling the results of twoway ANOVA Rubin (2004) discusses four methods that may be good candidates. Schafer (1997) called three of these methods ${D}_{1}$, ${D}_{2}$, and ${D}_{3}$. The fourth method was not given a name, probably because Schafer noted some problems of this method, which made him set it aside altogether. However, for the current application of these combination rules, the statistic that was not given a name may still be useful, for reasons discussed later on. Because the method with no name was chronologically the first method in Schafer’s book, we will call this statistic ${D}_{0}$. Each of these methods are discussed next.
Statistic D₀
Let ${\widehat{Q}}_{m}$ be a vector of length p, containing p parameter estimates of a statistical model, and let ${U}_{m}$ be a covariance matrix of the p parameter estimates, in imputed dataset m. A pooled set of parameter estimates across M imputed datasets is computed as:
1
$\overline{Q}=\frac{1}{M}{\displaystyle \sum}_{m=1}^{M}{\widehat{Q}}_{m}.$The pooled covariance matrix has two components, namely a withinimputation covariance matrix $\overline{U}$, and a betweenimputation covariance matrix $B$:
2
$\overline{U}=\frac{1}{M}{\displaystyle \sum}_{m=1}^{M}{U}_{m},$3
$B=\frac{1}{M}{\displaystyle \sum}_{m=1}^{M}\left({\widehat{Q}}_{m}\overline{Q}\right)\left({\widehat{Q}}_{m}\overline{Q}\right)\text{'}.$The total covariance matrix of the estimate $\overline{Q}$ is computed as
4
$T=\overline{U}+\left(1+{M}^{1}\right)B.$To test all p parameter estimates in $\overline{Q}$, statistic ${D}_{0}$ (Rubin, 2004, pp. 7778) is computed as:
5
${D}_{0}=\overline{Q}\text{'}{T}^{1}\overline{Q}/p,$which has an approximate Fdistribution with p model degrees of freedom, and $\left(1+{p}^{1}\right)\widehat{\nu}/2$ error degrees of freedom, where an approximation for $\widehat{\nu}$ by Reiter (2007) may be used. The formula for $\widehat{\nu}$ is rather long and complex, so we will not give it here. For computational details, see Reiter (2007).
A pooled Ftest for the effect of factor A may be computed by filling in the pooled regression coefficients of the effect coded variables that together form factor A in ${\widehat{Q}}_{m}$ in Equation 1 and Equation 3, filling in the covariances of these coefficients in Equation 2, and use the resulting $\overline{Q}$, $\overline{U}$ and $B$ in Equation 4 and Equation 5. Similarly, Ftests for factor B and the interaction may be computed. Computational details on how to obtain the covariance matrix ${U}_{m}$ for the effects of factors A, B and the interaction are given in Van Ginkel and Kroonenberg (2014).
Statistic D₁
Because $B$ is a noisy estimate of the betweenimputation variance and does not even have full rank if M < p, Schafer (1997, p. 113) advises against using statistic ${D}_{0}$. Rather than using $T$ as an estimate for the total covariance matrix, a more stable total covariance matrix is:
6
$\tilde{T}=\left(1+{r}_{1}\right)\overline{U}$where ${r}_{1}$ is the relative increase in variance due to nonresponse. The resulting Fvalue testing all elements in vector $\overline{Q}$ for significance simultaneously is:
7
${D}_{1}=\overline{Q}\text{'}{\tilde{T}}^{1}\overline{Q}/p,$which, under the assumption that ${r}_{1}$ is equal across all elements of $\overline{Q}$, has an approximate Fdistribution with p numerator degrees of freedom, and $\widehat{\nu}$ denominator degrees of freedom.
Despite the assumption of equal ${r}_{1}$ across all elements of $\overline{Q}$, Li, Raghunathan and Rubin (1991b) showed that violation of this assumption does not substantially influence the TypeI error rate of ${D}_{1}$. Because of this finding, and because of the disadvantages of statistic ${D}_{0}$, ${D}_{1}$has been widely implemented in different statistical packages, such as SAS (SAS, 2013; Yuan, 2011), Stata (StataCorp, 2017), the miceadds package in R (Robitzsch, Grund, & Henke, 2017), and an SPSS (IBM SPSS, 2017) macro by Van Ginkel (2010). Using the procedure outlined by Van Ginkel and Kroonenberg (2014), Grund et al. (2016) showed that D_{1} generally produces TypeI error rates close to the theoretical α in both oneway and twoway ANOVAs.
Statistic D₂
Define
8
${d}_{W,m}={\widehat{Q}}_{m}\text{'}{U}_{m}^{1}{\widehat{Q}}_{m}$as the Wald statistic of imputed dataset m,
9
${\overline{d}}_{W}=\frac{1}{M}{\displaystyle \sum}_{m=1}^{M}{d}_{W,m}$as the average Wald statistic across imputed datasets, and
10
${r}_{2}=\left(1+{M}^{1}\right)\left[\frac{1}{M1}{\displaystyle \sum}_{m=1}^{M}{\left(\sqrt{{d}_{W,m}}\overline{\sqrt{{d}_{W}}}\right)}^{2}\right]$as an alternative estimate of the relative increase in variance due to nonresponse. Statistic ${D}_{2}$ is given by:
11
${D}_{2}=\frac{{\overline{d}}_{W}{p}^{1}\left(M+1\right){\left(M1\right)}^{1}{r}_{2}}{1+{r}_{2}}$As a reference distribution for ${D}_{2}$ an Fdistribution with p numerator degrees of freedom and ${\nu}_{2}={p}^{\frac{3}{m}}\left(m1\right){\left(1+{r}_{2}^{1}\right)}^{2}$ denominator degrees of freedom is used. Applied to twoway ANOVA, significance tests for factors A, B, and the interaction can be obtained by substituting the relevant coefficients for ${\widehat{Q}}_{m}$, and by substituting the covariance matrices of the relevant coefficients for ${U}_{m}$, and substitute these quantities in Equation 8.
An advantage of ${D}_{2}$ is that it is easier to compute than ${D}_{0}$ and ${D}_{1}$ as it can be calculated from M separate Wald statistics alone and no separate combining of the covariance matrices is needed (Schafer, 1997, p. 115). Li, Meng, Raghunathan, and Rubin (1991a) however, show that ${D}_{2}$ occasionally produces TypeI error rates as high as 8% when using α = .05. Li et al. (1991a) therefore argue that ${D}_{2}$ only be used as a rough guide. However, Grund et al. (2016) found TypeI error rates for ${D}_{2}$ that were closer to the theoretical 5% than found by Li et al. (1991a). Thus, how well ${D}_{2}$ performs regarding TypeI error rates in which situations, remains unclear.
Statistic D₃
Statistic ${D}_{3}$ (Meng & Rubin, 1992) is a combined likelihoodratio test of M likelihoodratio statistics from M imputed datasets. Define ${d}_{L,m}$ as a likelihoodratio test of imputed dataset m, and
12
${\overline{d}}_{L}=\frac{1}{M}{\displaystyle \sum}_{m=1}^{M}{d}_{L,m}$as the average ${d}_{L,m}$ across imputed datasets. Next, define ${d}_{L,m}^{*}$ as a likelihoodratio test of imputed dataset m but now evaluated at the average parameter estimates of the model across M imputed datasets. The average ${d}_{L,m}^{*}$ is computed as
13
${\tilde{d}}_{L}=\frac{1}{M}{\displaystyle \sum}_{m=1}^{M}{d}_{L,m}^{*}$Statistic ${D}_{3}$ is computed as
14
${D}_{3}=\frac{{\overline{d}}_{L}}{p\left(1+{r}_{3}\right)}$The reference distribution that is used for ${D}_{3}$ is an Fdistribution with p numerator degrees of freedom and a denominator degrees of freedom given by:
15
$\begin{array}{cc}{\nu}_{3}=4+\left(t4\right){\left[1+\left(12{t}^{1}\right){r}_{3}^{1}\right]}^{2}\hfill & \text{if}t=p\left(M1\right)4\hfill \\ {\nu}_{3}=t\left(1+{p}^{1}\right){\left(1+{r}_{3}^{1}\right)}^{2}/2\hfill & \text{otherwise.}\hfill \end{array}$For a specific effect in the model (factor A, factor B, interaction) ${D}_{3}$ is computed by letting ${d}_{L,m}$ and ${d}_{L,m}^{*}$ be the likelihoodratio tests that test the full model against the full model, minus the terms that together form the specific effect.
Goal of the Current Study
Several authors (Schafer, 1997; Shaw & MitchellOlds, 1993; Winer et al., 1991) have pointed out that unbalanced designs can be considered a missingdata problem. Shaw and MitchellOlds (1993) discussed imputation as a possible solution for imbalance and at the same time noted some problems of the imputation approach regarding pvalues. However, they did not mention the use of multiple imputation to overcome the problems of single imputation in unbalanced designs. Although Schafer (1997) never explicitly suggested to balance an unbalanced design using multiple imputation, the suggestion implicitly followed from the fact that Schafer’s book is about multiple imputation. Van Ginkel and Kroonenberg (2014, p. 79) made the suggestion explicit, and used it in an empirical data example (pp. 8788). The reasoning behind their suggestion was that balanced designs generally have more power than unbalanced designs, and that consequently multiple imputation in unbalanced designs could increase power.
However, Van Ginkel and Kroonenberg (2014) did not carry out any simulations to back up this claim. Additionally, one aspect that Van Ginkel and Kroonenberg did not include in their reasoning is that multiple imputation does not actually add information that could increase power. Indeed multiple imputation simulates additional cases, but imputing M values of the outcome variable for each additional case creates additional variation in the parameter estimates of the ANOVA model as well, represented by covariance matrix $B$ or the relative increase in variance due to nonresponse (r_{1}, r_{2}, and r_{3}). The statistics ${D}_{0}$ to ${D}_{3}$ include these quantities in their calculation such that it lowers their power, which may consequently undo the increased power as a result of creating additional cases. Also, Van Buuren (2012, p. 48) and Von Hippel (2007) argue that when missing data only occur on the outcome variable, analyzing only the complete cases may be preferred over multiple imputation. Since unbalanced data can be considered a situation with missings on only the outcome variable, and that TypeIII sum of squares in this context is equivalent to analyzing only complete cases, it would follow that in unbalanced data TypeIII sum of squares is preferred over multiple imputation.
In short, both valid arguments for balancing unbalanced data using multiple imputation prior to twoway ANOVA, and simulation studies that confirm its usefulness seem to be lacking. However, the fact that this suggestion has been made in the literature or even just the fact that unbalanced data are often considered a missingdata problem and that multiple imputation is a highly recommended procedure for dealing with missing data, suggests that at first sight it may not be obvious to all statisticians that this suggestion may not be useful. Hence, a simulation study is needed to demonstrate its actual usefulness. In the current paper we will carry out such a study. Consequently, the first research question is whether there is some benefit in using multiple imputation for balancing unbalanced data after all.
Furthermore, Grund et al. (2016) already investigated statistics ${D}_{1}$, ${D}_{2}$, and ${D}_{3}$ in various situations. In their study they found that in most situations all statistics produced TypeI error rates close to 5%. However, in more extreme situations (e.g., small sample size, high percentages of missing data), ${D}_{2}$, and ${D}_{3}$ were somewhat conservative and had lower power than ${D}_{1}$. Based on this study, ${D}_{1}$ is the preferred statistic for combining the results of multiply imputed data in ANOVA.
However, the question is to what extent the results by Grund et al. (2016) generalize to a situation where unbalanced data are balanced using multiple imputation. ${D}_{1}$ showed the most promising results but has one potential weak point, and which may especially be problematic in unbalanced data: it assumes that the fraction of missing information is equal across the p parameter estimates in $\overline{Q}$. Unequal fractions of missing information across parameter estimates in $\overline{Q}$ are inherently related to unbalanced designs as each regression coefficient of an effectcoded variable of a specific factor represents a specific group of this factor. Unequal group sizes imply that for smaller groups more additional values of the dependent variable have to be multiply imputed than for larger groups, resulting in unequal fractions of missing information across parameter estimates.
In a situation which inherently has unequal fractions of missing information across parameter estimates, a statistic assuming equal fractions of missing information across parameter estimates ( ${D}_{1})$ is not the most convenient choice, especially when an alternative version of this statistic exists ( ${D}_{0}$) that does not make this assumption (when p = 1, or when the fractions of missing information across parameter estimates are equal, both statistics are equivalent). Given that ${D}_{1}$ generally gives the best results regarding power and TypeI error rates, and that ${D}_{0}$ is equivalent to ${D}_{1}$ but without the assumption of equal fractions of missingness across parameters, ${D}_{0}$ might be the ideal candidate for calculating pooled Ftests in unbalanced data when imbalance is handled using multiple imputation. Although Schafer (1997) and Li et al. (1991b) advised against the use of ${D}_{0}$, the argument against its use was mainly that for small M it may not perform well. However, this problem may easily be resolved by increasing M.
Furthermore, although Li et al. (1991b) showed that violation of equal fractions of missing information is not necessarily problematic for ${D}_{1}$, in their study the unequal fractions of missing information randomly varied across parameters of the simulation model and across replicated datasets. However, in unbalanced data unequal fractions of missing information may not always randomly vary across parameters. For example, in a field experiment it may be more difficult to collect data for one experimental condition than for the other, or in a clinical trial more people may drop out in one condition than in the other. Also, in nonexperimental studies some categorical variables may have unequal groups in the population, for example, ethnicity. When drawing a random sample from this population, the different ethnic groups will have unequal sizes in the sample as well.
When fractions of missing information randomly vary across parameters, the fractions of missing information may not be equal within one replication, but the average fractions of missing information across replicated datasets are. Consequently, the negative effect of unequal fractions of missing information may cancel itself out across replications. However, in situations where the differences in fractions of missing information across parameters do not vary across replicated datasets, a statistic might be needed that allows for different fractions of missing data across parameter estimates.
Thus, a second research question is how the different pooling statistics from Grund et al. (2016) will behave in unbalanced data where the unequal fractions of missing information do not vary across replications, now also including statistic ${D}_{0}$. To this end, rejection rates of statistics ${D}_{0}\text{}$to ${D}_{3}$ were studied along with the rejection rates for Type III sum of squares under the null and alternative hypothesis. For comparison, rejection rates were also studied for balanced data with the same total sample size.
In the next section, the setup of the simulation study is described. In the section that follows, results of the simulation study are shown. Finally, in the discussion section conclusions will be drawn about the usefulness of multiple imputation for balancing unbalanced designs, and implications for which statistic to use will be discussed.
Method
Data were simulated according to a twoway ANOVA model in the form of a regression model with effect coded predictors. Some of the properties of the data were held constant while some were varied (discussed next). The properties that were varied resulted in several design cells. Within each design cell 2500 replications were drawn (based on studies by Harel, 2009, and Van Ginkel, 2019).
The simulations were programmed in R (R Core Team, 2018). Multiple imputation was performed using Fully Conditional Specification (FCS; e.g., Van Buuren, 2012) in the mice package (Van Buuren & GroothuisOudshoorn, 2011). For continuous variables there are two different versions of FCS, namely regression imputation (FCSR; e.g., Little & Schenker, 1995, p. 60; Van Buuren, 2012, p. 13) and Predictive Mean Matching (FCSPMM; Rubin, 1986; Van Buuren, 2012, pp. 6874). The default method in mice is FCSPMM because it generally preserves distributional shapes better when data are not normally distributed than FCSR (Marshall, Altman, & Holder, 2010; Marshall, Altman, Royston, & Holder, 2010). However, for imputing an outcome variable with a normally distributed error (as in our simulation model) FCSR may be a better alternative as it explicitly imputes values according to a normal linear model. Thus, the method for multiple imputation was FCSR. For the imputation of outcome variable Y, factors A and B plus their interaction were used as predictors. Type III sum of squares were computed using the car package (Fox & Weisberg, 2019). Statistic ${D}_{3}$ was calculated using the mitml package (Grund et al., 2016). The other statistics ${D}_{0}$, ${D}_{1}$, and ${D}_{2}$ were programmed by the first author.
Like in many other simulation studies, decisions regarding properties of the simulation design were to some extent arbitrary. However, prior to running the simulations, some test runs were done to see what properties would make the effects of imbalance and the differences between the different statistics most clearly visible, and which were also likely to occur in practice. The properties of the simulation design that are going to be discussed next, are mostly the result of these test runs.
Fixed Design Characteristics
The number of levels of factor A was I = 2. The error term of the ANOVA model was normally distributed with ${\mu}_{\epsilon}=0$ and ${\sigma}_{\epsilon}=18.37$.
Independent Variables
Number of Levels of Factor B
The number of levels of factor B was J = 3, 4, 5.
Parameters of the TwoWay ANOVA Model
For each J, two sets of parameters of the twoway ANOVA model were studied: one in which there were no effects of factor A, factor B, and the interaction in the population (the null model), and one in which there were effects (the alternative model). For J = 3 the two sets of parameter estimates were $\beta =\left({\beta}_{0},{\beta}_{1\left(\text{A}\right)},{\beta}_{2\left(\text{B}\right)},{\beta}_{3\left(\text{B}\right)},{\beta}_{4\left(\text{AB}\right)},{\beta}_{5\left(\text{AB}\right)}\right)$ = (27,0,0,0,0,0) and $\beta =\left({\beta}_{0},{\beta}_{1\left(\text{A}\right)},{\beta}_{2\left(\text{B}\right)},{\beta}_{3\left(\text{B}\right)},{\beta}_{4\left(\text{AB}\right)},{\beta}_{5\left(\text{AB}\right)}\right)$ = (27,1.5,3,0,1,0.5).
For J = 4 the two sets were: $\beta =\left({\beta}_{0},{\beta}_{1\left(\text{A}\right)},{\beta}_{2\left(\text{B}\right)},{\beta}_{3\left(\text{B}\right)},{\beta}_{4\left(\text{B}\right)},{\beta}_{5\left(\text{AB}\right)},{\beta}_{6\left(\text{AB}\right)},{\beta}_{7\left(\text{AB}\right)}\right)$ = (27,0,0,0,0,0,0,0) and $\beta =\left({\beta}_{0},{\beta}_{1\left(\text{A}\right)},{\beta}_{2\left(\text{B}\right)},{\beta}_{3\left(\text{B}\right)},{\beta}_{4\left(\text{B}\right)},{\beta}_{5\left(\text{AB}\right)},{\beta}_{6\left(\text{AB}\right)},{\beta}_{7\left(\text{AB}\right)}\right)$ = (27,1.5,3,1,1,1,0,0.5).
Finally, for J = 5 the two sets were: $\beta =\left({\beta}_{0},{\beta}_{1\left(\text{A}\right)},{\beta}_{2\left(\text{B}\right)},{\beta}_{3\left(\text{B}\right)},{\beta}_{4\left(\text{B}\right)},{\beta}_{5\left(\text{B}\right)},{\beta}_{6\left(\text{AB}\right)},{\beta}_{7\left(\text{AB}\right)},{\beta}_{8\left(\text{AB}\right)},{\beta}_{9\left(\text{AB}\right)}\right)$ = (27,0,0,0,0,0,0,0,0,0) and $\beta =\left({\beta}_{0},{\beta}_{1\left(\text{A}\right)},{\beta}_{2\left(\text{B}\right)},{\beta}_{3\left(\text{B}\right)},{\beta}_{4\left(\text{B}\right)},{\beta}_{5\left(\text{B}\right)},{\beta}_{6\left(\text{AB}\right)},{\beta}_{7\left(\text{AB}\right)},{\beta}_{8\left(\text{AB}\right)},{\beta}_{9\left(\text{AB}\right)}\right)$ = (27,1.5,3,1,0,1,1,0.25,0.25,0.5).
Sample Size
Small, medium, and large sample sizes were studied. Because J also affects the number of design cells of the twoway ANOVA, sample size also partly depended on the size of J. For small N, the average cell size was 10, for medium N it was 20, and for large samples it was 30. Given these average cell sizes, sample sizes were N = 60, 120, 180 for J = 3, N = 80, 160, 240 for J = 4, and N = 100, 200, 300 for J = 5.
Degree and Structure of Imbalance
Four different degrees of imbalance were simulated, along with balanced data, for comparison. The degree of imbalance was varied as follows: for a specific design cell the cell size was either increased or decreased by each time adding the same number to, or subtracting the same number from the original cell size in the balanced case. The increasing and decreasing of cell sizes was done such that the total sample size remained the same.
Additionally, to study whether it mattered which cells increased or decreased in size, an additional situation of imbalance was created where the cell sizes of the most severe case of imbalance were randomly redistributed across design cells. The cell sizes for each degree of imbalance are displayed for small N in Table 1. For medium and large N the entries must be multiplied with 2 and 3 respectively.
Table 1
Cell size  Balanced  Imbalance



Small  Medium  Severe  Extra severe  Extra severe, order shuffled  
No. levels factor B: 3  
n_{11}  10  8  6  4  2  18 
n_{12}  10  10  10  10  10  10 
n_{13}  10  12  14  16  18  2 
n_{21}  10  11  12  13  14  6 
n_{22}  10  10  10  10  10  10 
n_{23}  10  9  8  7  6  14 
No. levels factor B: 4  
n_{11}  10  8  6  4  2  10 
n_{12}  10  10  10  10  10  18 
n_{13}  10  10  10  10  10  10 
n_{14}  10  12  14  16  18  2 
n_{21}  10  11  12  13  14  10 
n_{22}  10  10  10  10  10  6 
n_{23}  10  10  10  10  10  10 
n_{24}  10  9  8  7  6  14 
No. levels factor B: 5  
n_{11}  10  8  6  4  2  10 
n_{12}  10  10  10  10  10  18 
n_{13}  10  10  10  10  10  10 
n_{14}  10  10  10  10  10  10 
n_{15}  10  12  14  16  18  2 
n_{21}  10  11  12  13  14  10 
n_{22}  10  10  10  10  10  6 
n_{23}  10  10  10  10  10  10 
n_{24}  10  10  10  10  10  10 
n_{25}  10  9  8  7  6  14 
Method for Handling Imbalance
Nine methods for handling imbalance were used: Type III sum of squares, and two versions of each of the statistics ${D}_{0}$, ${D}_{1}$, ${D}_{2}$, and ${D}_{3}$. Because ${D}_{0}$ is argued to be sensitive to a small number of imputations, two different sizes of M were studied within each of the statistics, namely M = 5 and M = 100. This defines the eight procedures based on multiple imputation: ${D}_{\mathrm{0,}M=5}$, ${D}_{\mathrm{0,}M=100}$, ${D}_{\mathrm{1,}M=5}$, ${D}_{\mathrm{1,}M=100}$, ${D}_{\mathrm{2,}M=5}$, ${D}_{\mathrm{2,}M=100}$, ${D}_{\mathrm{3,}M=5}$, and ${D}_{\mathrm{3,}M=100}$.
Dependent Variables
For each of the Ftests in the twoway ANOVA, the number of times the null hypothesis was rejected was studied, denoted the rejection rate. As already noted, when p = 1, ${D}_{0}$ and ${D}_{1}$ are equivalent. Thus, for factor A, methods ${D}_{\mathrm{0,}M=5}$, and ${D}_{\mathrm{0,}M=100}$ will not be displayed.
Results
To get a rough impression of how close the Type I error rates were to α = .05 under the null hypothesis, it was tested whether the empirical rejection rates differed significantly from 0.05, using an Agresti and Coull (1998) confidence interval. Under the alternative hypothesis, it was tested whether the empirical rejection rates of the multipleimputation methods differed significantly from the rejection rates of Type III, assuming that the latter represent the “real” power. A method based on multiple imputation may be considered successful when under the null hypothesis it gives rejection rates not significantly different from 0.05, and when under the alternative hypothesis it gives rejection rates significantly higher than the rejection rates of Type III. To avoid redundancy in discussing the results, we will mainly focus on the results of factor B and less on the results of factor A and the interaction.
Eighteen tables were needed to report all the results. Because results showed similar patterns across different J and different N, results of the remaining independent variables are only reported for J = 3 and a small N. Results for J > 3 and larger N are provided in Supplementary Materials. Table 2 shows the results for all levels of imbalance, all methods for handling imbalance, and all effects in the ANOVA, under the null model. Table 3 shows the results for all levels of imbalance, all methods for handling imbalance, and all effects in the ANOVA, under the alternative model.
Table 2
Method  Balanced  Imbalance



Small  Medium  Severe  Extra severe  Extra severe, order shuffled  
Effect A  
Type III  .052  .049  .051  .054  .055  .055 
${D}_{\mathrm{1,}M=5}$  .049  .056  .059^{a}  .061^{a}  .058  
${D}_{\mathrm{1,}M=100}$  .051  .051  .054  .054  .053  
${D}_{\mathrm{2,}M=5}$  .047  .052  .052  .052  .054  
${D}_{\mathrm{2,}M=100}$  .052  .053  .056  .060^{a}  .058  
${D}_{\mathrm{3,}M=5}$  .045  .034^{a}  .027^{a}  .019^{a}  .016^{a}  
${D}_{\mathrm{3,}M=100}$  .044  .036^{a}  .024^{a}  .006^{a}  .005^{a}  
Effect B  
Type III  .047  .049  .054  .049  .048  .048 
${D}_{\mathrm{0,}M=5}$  .052  .059^{a}  .064^{a}  .073^{a}  .081^{a}  
${D}_{\mathrm{0,}M=100}$  .053  .059^{a}  .052  .052  .053  
${D}_{\mathrm{1,}M=5}$  .038^{a}  .035^{a}  .034^{a}  .033^{a}  .052  
${D}_{\mathrm{1,}M=100}$  .043  .042  .030^{a}  .023^{a}  .056  
${D}_{\mathrm{2,}M=5}$  .050  .055  .058  .054  .057  
${D}_{\mathrm{2,}M=100}$  .055  .060^{a}  .054  .050  .051  
${D}_{\mathrm{3,}M=5}$  .042  .034^{a}  .026^{a}  .024^{a}  .023^{a}  
${D}_{\mathrm{3,}M=100}$  .048  .042  .026^{a}  .014^{a}  .018^{a}  
Effect A × B  
Type III  .058  .048  .054  .056  .045  .045 
${D}_{\mathrm{0,}M=5}$  .052  .055  .066^{a}  .079^{a}  .084^{a}  
${D}_{\mathrm{0,}M=100}$  .056  .054  .058  .050  .048  
${D}_{\mathrm{1,}M=5}$  .038^{a}  .037^{a}  .032^{a}  .035^{a}  .046  
${D}_{\mathrm{1,}M=100}$  .044  .035^{a}  .034^{a}  .022^{a}  .055  
${D}_{\mathrm{2,}M=5}$  .048  .050  .052  .056  .055  
${D}_{\mathrm{2,}M=100}$  .058  .056  .059^{a}  .046  .048  
${D}_{\mathrm{3,}M=5}$  .041^{a}  .031^{a}  .028^{a}  .021^{a}  .022^{a}  
${D}_{\mathrm{3,}M=100}$  .048  .037^{a}  .031^{a}  .012^{a}  .015^{a} 
^{a}Significantly different from theoretical significance level of α = .05.
Table 3
Method  Balanced  Imbalance



Small  Medium  Severe  Extra severe  Extra severe, order shuffled  
Effect A  
Type III  .764  .746  .728  .686  .549  .549 
${D}_{\mathrm{1,}M=5}$  .734  .676^{a}  .585^{a}  .415^{a}  .406^{a}  
${D}_{\mathrm{1,}M=100}$  .753  .731  .680  .543  .541  
${D}_{\mathrm{2,}M=5}$  .718^{a}  .646^{a}  .547^{a}  .388^{a}  .370^{a}  
${D}_{\mathrm{2,}M=100}$  .755  .735  .687  .558  .556  
${D}_{\mathrm{3,}M=5}$  .702^{a}  .551^{a}  .370^{a}  .140^{a}  .142^{a}  
${D}_{\mathrm{3,}M=100}$  .736  .676^{a}  .555^{a}  .261^{a}  .253^{a}  
Effect B  
Type III  .976  .972  .961  .922  .826  .836 
${D}_{\mathrm{0,}M=5}$  .966  .934^{a}  .863^{a}  .719^{a}  .746^{a}  
${D}_{\mathrm{0,}M=100}$  .976  .963  .925  .829  .838  
${D}_{\mathrm{1,}M=5}$  .947^{a}  .873^{a}  .750^{a}  .505^{a}  .676^{a}  
${D}_{\mathrm{1,}M=100}$  .970  .947^{a}  .892^{a}  .720^{a}  .845  
${D}_{\mathrm{2,}M=5}$  .946^{a}  .852^{a}  .687^{a}  .426^{a}  .426^{a}  
${D}_{\mathrm{2,}M=100}$  .976  .964  .924  .807^{a}  .818^{a}  
${D}_{\mathrm{3,}M=5}$  .949^{a}  .869^{a}  .706^{a}  .384^{a}  .398^{a}  
${D}_{\mathrm{3,}M=100}$  .973  .950^{a}  .886^{a}  .643^{a}  .642^{a}  
Effect A × B  
Type III  .179  .182  .169  .150  .101  .148 
${D}_{\mathrm{0,}M=5}$  .172  .153^{a}  .151  .127^{a}  .174^{a}  
${D}_{\mathrm{0,}M=100}$  .194  .182  .151  .111  .157  
${D}_{\mathrm{1,}M=5}$  .146^{a}  .118^{a}  .091^{a}  .063^{a}  .092^{a}  
${D}_{\mathrm{1,}M=100}$  .174  .149^{a}  .111^{a}  .075^{a}  .126^{a}  
${D}_{\mathrm{2,}M=5}$  .156^{a}  .128^{a}  .115^{a}  .075^{a}  .117^{a}  
${D}_{\mathrm{2,}M=100}$  .197  .183  .149  .098  .164^{a}  
${D}_{\mathrm{3,}M=5}$  .152^{a}  .111^{a}  .077^{a}  .050^{a}  .042^{a}  
${D}_{\mathrm{3,}M=100}$  .186  .154^{a}  .107^{a}  .049^{a}  .036^{a} 
^{a}Significantly different from TypeIII, assuming TypeIII is the “true” power.
Under the null hypothesis (Table 2), methods ${D}_{\mathrm{1,}M=5}$, ${D}_{\mathrm{1,}M=100}$, ${D}_{\mathrm{3,}M=5}$, and ${D}_{\mathrm{3,}M=100}$ tend to underestimate the TypeI error rate for factor B somewhat, which becomes worse as the degree of imbalance increases. This undercoverage is worst for ${D}_{\mathrm{3,}M=100}$. Method ${D}_{\mathrm{0,}M=5}$tends to overestimate the TypeI error rate, which becomes worse as the degree of imbalance increases. Type III, ${D}_{\mathrm{0,}M=100}$, ${D}_{\mathrm{2,}M=5}$ ${D}_{\mathrm{2,}M=100}$ produce TypeI error rates that do not differ significantly from 5% in most cases.
All methods based on M = 5 and some of the methods based on M = 100 ( ${D}_{\mathrm{1,}M=100}$, ${D}_{\mathrm{3,}M=100}$) have lower power than Type III sum of squares (Table 3). The difference in power between these methods and Type III sum of squares becomes larger as the degree of imbalance increases. Methods ${D}_{\mathrm{0,}M=100}$ and ${D}_{\mathrm{2,}M=100}$ generally have power values close to that of TypeIII sum of squares.
Finally, when the order of the cell sizes is shuffled, only the results for methods based on ${D}_{1}$ change substantially; the results for methods ${D}_{\mathrm{1,}M=5}$ and ${D}_{\mathrm{1,}M=100}$ improve to acceptable Type I error rates and higher power.
Discussion
Van Ginkel and Kroonenberg (2014) suggested multiple imputation to increase the power of a twoway ANOVA in unbalanced designs compared to Type III sum of squares. In the current study it was studied whether multiple imputation would indeed do this. Only ${D}_{\mathrm{0,}M=100}$ and ${D}_{\mathrm{2,}M=100}$ had power similar to that of Type III sum of squares, but not higher power than Type III sum of squares. The other methods based on multiple imputation had lower power than Type III sum of squares. Additionally, the earlier recommended ${D}_{1}$statistic underestimated the TypeI error rate under the null hypothesis, both with M = 5, and M = 100 imputations, while statistic D_{0} which was not recommended earlier produced TypeI error rates close to 0.05 when M = 100. However, when M = 5 ${D}_{0}$ overestimated the TypeI error rate.
The main conclusion of this study is that multiple imputation is not an improvement of Type III sum of squares. At best, multiple imputation performs as well as Type III sum of squares when using the appropriate statistics ( ${D}_{0}$ and ${D}_{2}$) with a large number of imputations (M = 100). Given that Type III sum of squares is easier than doing multiple imputation, Type III sum of squares is still the preferred method in unbalanced designs.
Even though multiple imputation does not seem to be a useful alternative to Type III sum of squares, the results of this study are still important, and may in fact have implications for calculating significance tests for multiparameter estimates in multiply imputed datasets. Previously it was assumed that it was better to use ${D}_{1}$ than ${D}_{0}$ because ${D}_{0}$ used a noisy estimate of the total covariance matrix in its computation, which would especially be problematic for small M. Secondly, Li et al. (1991b) showed that ${D}_{1}$ was robust to violation of the assumption of equal fractions of missing information across parameter estimates. When looking at the results of the current study, however, one would come to a different conclusion. Although for M = 5, ${D}_{0}$ usually overestimated the TypeI error rates, for M = 100, ${D}_{0}$ gave satisfactory TypeI error rates. For ${D}_{1}$ on the other hand, the TypeI error rates for both M = 5 and M = 100 were too low. Additionally, power rates of ${D}_{0}$ were higher than those of ${D}_{1}$. Thus, the claim that ${D}_{1}$ is robust to violation of the assumption of equal fractions of missing information may only hold when the unequal fractions of missing information vary across replications, as was the case in the study by Li et al. (1991b). The problems that come with ${D}_{0}$ on the other hand, may be less serious than was previously thought. It should be noted however, that when the order of the cells sizes was shuffled the results improved for the methods based on ${D}_{1}$. Thus, the specific structure of the imbalance may also influence the performance of ${D}_{1}$, although it is unclear how.
It may be recommendable to use ${D}_{0}$ rather than ${D}_{1}$ in case of highly unequal fractions of missing information, and when there is reason to believe that unequal fractions of missing data across parameter estimates are not random but systematic. Although the results for ${D}_{0}$ are not that good when M = 5, for M = 100 its results are comparable with Type III sum of squares. Since nowadays computers are fast enough to perform M = 100 imputations in a short time, this required high number of imputations for ${D}_{0}$ should not be an issue.
As for statistics ${D}_{2}$ and ${D}_{3}$, although Li et al. (1991a) showed that ${D}_{2}$ could sometimes be too liberal, both Grund et al. (2016) and the current study show satisfactory TypeI error rates of this method. Additionally, the current study also shows that the power of ${D}_{2}$ is comparable to that of TypeIII sum of squares when M = 100. Considering these results and the fact that ${D}_{2}$ is relatively easy to compute, it may be a good candidate for pooling Ftests from ANOVA in multiply imputed data. However, as long as it is not clear when ${D}_{2}$ may overestimate the TypeI error rate, we must be cautious concluding that ${D}_{2}$ is generally the best alternative for pooling the results of ANOVA in multiply imputed data.
The results of ${D}_{3}$ were disappointing: in unbalanced data this method (severely) underestimated the TypeI error rates. Additionally, of all statistics, ${D}_{3}$ generally had the lowest power. An additional disadvantage of ${D}_{3}$ is that is not easily computed as it requires the likelihood of imputed dataset m evaluated at the average set of model parameters. The latter makes the implementation in software packages complex. Considering the disappointing results for ${D}_{3}$ and its complex implementation, this statistic is not the most convenient method for pooling results of ANOVA.
The results of the current study imply that software packages in which ${D}_{1}$ is implemented may also need to include ${D}_{0}$ in case of unequal fractions of missing information that are inherently related to the parameters in question. While ${D}_{1}$ only produces accurate TypeI error rates when fractions of missing information are on average equal across replications (Li et al, 1991b), ${D}_{0}$ also produces accurate TypeI error rates for unequal average fractions of missing information across replications, and has higher power than ${D}_{1}$ when M = 100. For M = 5 however, we must be cautious using both ${D}_{0}$ and ${D}_{1}$ with highly unequal fractions of missing data, as the former overestimates the TypeI error rate and the latter underestimates it.
To conclude, based on the current findings we recommend Type III sum of squares in unbalanced data, and we recommend using either ${D}_{0}$ or ${D}_{2}$ in multiply imputed data with a high (i.e., M = 100) number of imputations.