Original Article

# Much Ado About Nothing: Multiple Imputation to Balance Unbalanced Designs for Two-Way Analysis of Variance

Joost R. van Ginkel*a, Pieter M. Kroonenbergb

Methodology, 2020, Vol. 16(4), 335–353, https://doi.org/10.5964/meth.4327

Received: 2019-06-09. Accepted: 2020-08-06. Published (VoR): 2020-12-22.

*Corresponding author at: Methodology and Statistics, Department of Psychology, Leiden University, PO Box 9500, 2300 RB Leiden, The Netherlands. Tel.: +31(0)71 527 3620, E-mail: jginkel@fsw.leidenuniv.nl

This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## Abstract

In earlier literature, multiple imputation was proposed to create balance in unbalanced designs, as an alternative to Type III sum of squares in two-way ANOVA. In the current simulation study we studied four pooled statistics for multiple imputation, namely D₀, D₁, D₂, and D₃ in unbalanced data, and compared these statistics with Type III sum of squares. Statistics D₀ and D₂ generally performed best regarding Type-I error rates, and had power rates closest to that of Type III sum of squares. However, none of the statistics produced power rates higher than Type III sum of squares. The results lead to the conclusion that for multiply imputed datasets D₀ and D₂ may be the best methods for pooling the results of multiparameter estimates in multiply imputed datasets, and that for unbalanced data, Type III sum of square is to be preferred over using multiple imputation in obtaining ANOVA results.

Keywords: unbalanced designs, multiple imputation, two-way analysis of variance, missing data, type III sum of squares

In an experiment where two-way analysis of variance is the intended analysis, unforeseen circumstances may occur which may cause the design to be unbalanced. Unbalanced data may also occur in non-experimental research when group sizes are unequal by themselves. One important consequence of imbalance is that due to the resulting multicollinearity F-tests in ANOVA have less power than in balanced designs (e.g., Shaw & Mitchell-Olds, 1993, p. 1643). One of the most widely used methods for calculating F-tests in unbalanced data is Type-III sum of squares. This method has several advantages over other types of sum of squares, such as Type I and Type II. For example, in Type III sum of squares, the F-values are not influenced by the order in which each effect is entered in the ANOVA, as in Type-I sum of squares. For a discussion of the advantages of Type III sum of squares over the other types, see Maxwell & Delaney (2004). However, despite these advantages, the power of the F-tests in Type III sum of squares is still lower than in balanced designs.

According to Schafer (1997, p. 21) an unbalanced design can be considered a missing-data problem by imagining a number of additional cases with missing data on the outcome variable, which in the appropriate conditions would result in a balanced design (also, see Shaw & Mitchell-Olds, 1993, p. 1640, and Winer, Brown, & Michels, 1991, pp. 479-481). Considering an unbalanced design a missing-data problem creates new ways to handle the problems that come with unbalanced designs by using methods for dealing with missing data. Shaw and Mitchell-Olds (1993, p. 1641) discuss imputation (i.e., estimation) of the missing data to balance the design. They point out that imputation will give unbiased parameter estimates of the ANOVA model, but biased significance tests. This bias is caused by the fact that the imputed values are treated as observed values, and do not incorporate uncertainty of the missing values in the analysis (e.g., Schafer, 1997, p. 2). Thus, a method for estimating the missing values is needed that does take this uncertainty into account. One such method is multiple imputation (Rubin, 2004; Van Buuren, 2012). In multiple imputation, 1) missing values are estimated multiple times (M), resulting in M different complete versions of the incomplete dataset. Next, 2) these complete datasets are analyzed using the statistical analysis of interest, resulting in M outcomes of the same analysis, and 3) the results are combined into one pooled analysis, such that the uncertainty about the missing values is taken into account. For this pooling, formulas are used which will henceforth be called combination rules.

For balancing unbalanced data, multiple imputation may be used as follows. First, by adding a number of additional cases to specific groups such that all groups have equal size, the dataset is now balanced where in some cells cases have missing data on the outcome variable. These missing data are then multiply imputed using factors A and B as categorical predictors of the missing data on the outcome variable. Procedures for how to generate multiply imputed values for the missing data are described in, for example, Van Buuren (2012, Chapter 3) and Raghunathan, Lepkowski, Van Hoewyk, and Solenberger (2001).

Once multiply imputed datasets have been obtained, the ANOVAs can be applied to these datasets, and the results can be combined using specific combination rules. However, Van Ginkel and Kroonenberg (2014) argue that the M F-tests of an effect in ANOVA in M multiply imputed datasets cannot be combined directly, because for the combination rules that are suited for ANOVA (to be discussed shortly) one needs M sets of regression coefficients and M covariance matrices of these regression coefficients. See, Rubin (2004, pp. 79-81) and Van Ginkel (2019) who show how to calculate a pooled F-statistic testing several regression coefficients for significance simultaneously for multiply imputed datasets.

However, one can reformulate the two-way ANOVA model as a regression model, so that the combination rules can be applied to the regression coefficients. Van Ginkel and Kroonenberg (2014) show that this can be done by recoding the categorical variables and their interaction into effect-coded variables, and using these effect-coded variables as predictors in the regression model (also, see Winer et al., 1991, pp. 959-963). Once estimates of the regression coefficients and their covariances have been obtained for each of the M imputed datasets, the results may be combined.

For pooling the results of two-way ANOVA Rubin (2004) discusses four methods that may be good candidates. Schafer (1997) called three of these methods $D 1$, $D 2$, and $D 3$. The fourth method was not given a name, probably because Schafer noted some problems of this method, which made him set it aside altogether. However, for the current application of these combination rules, the statistic that was not given a name may still be useful, for reasons discussed later on. Because the method with no name was chronologically the first method in Schafer’s book, we will call this statistic $D 0$. Each of these methods are discussed next.

### Statistic D₀

Let $Q ^ m$ be a vector of length p, containing p parameter estimates of a statistical model, and let $U m$ be a covariance matrix of the p parameter estimates, in imputed dataset m. A pooled set of parameter estimates across M imputed datasets is computed as:

##### 1
$Q ¯ = 1 M ∑ m = 1 M Q ^ m .$

The pooled covariance matrix has two components, namely a within-imputation covariance matrix $U ¯$, and a between-imputation covariance matrix $B$:

##### 2
$U ¯ = 1 M ∑ m = 1 M U m ,$
##### 3
$B = 1 M ∑ m = 1 M Q ^ m − Q ¯ Q ^ m − Q ¯ ' .$

The total covariance matrix of the estimate $Q ¯$ is computed as

##### 4
$T = U ¯ + 1 + M − 1 B .$

To test all p parameter estimates in $Q ¯$, statistic $D 0$ (Rubin, 2004, pp. 77-78) is computed as:

##### 5
$D 0 = Q ¯ ' T − 1 Q ¯ / p ,$

which has an approximate F-distribution with p model degrees of freedom, and $1 + p − 1 ν ^ / 2$ error degrees of freedom, where an approximation for $ν ^$ by Reiter (2007) may be used. The formula for $ν ^$ is rather long and complex, so we will not give it here. For computational details, see Reiter (2007).

A pooled F-test for the effect of factor A may be computed by filling in the pooled regression coefficients of the effect coded variables that together form factor A in $Q ^ m$ in Equation 1 and Equation 3, filling in the covariances of these coefficients in Equation 2, and use the resulting $Q ¯$, $U ¯$ and $B$ in Equation 4 and Equation 5. Similarly, F-tests for factor B and the interaction may be computed. Computational details on how to obtain the covariance matrix $U m$ for the effects of factors A, B and the interaction are given in Van Ginkel and Kroonenberg (2014).

### Statistic D₁

Because $B$ is a noisy estimate of the between-imputation variance and does not even have full rank if M < p, Schafer (1997, p. 113) advises against using statistic $D 0$. Rather than using $T$ as an estimate for the total covariance matrix, a more stable total covariance matrix is:

##### 6
$T ˜ = 1 + r 1 U ¯$
$r 1 = 1 + M − 1 t r B U ¯ − 1 / p$

where $r 1$ is the relative increase in variance due to nonresponse. The resulting F-value testing all elements in vector $Q ¯$ for significance simultaneously is:

##### 7
$D 1 = Q ¯ ' T ˜ − 1 Q ¯ / p ,$

which, under the assumption that $r 1$ is equal across all elements of $Q ¯$, has an approximate F-distribution with p numerator degrees of freedom, and $ν ^$ denominator degrees of freedom.

Despite the assumption of equal $r 1$ across all elements of $Q ¯$, Li, Raghunathan and Rubin (1991b) showed that violation of this assumption does not substantially influence the Type-I error rate of $D 1$. Because of this finding, and because of the disadvantages of statistic $D 0$, has been widely implemented in different statistical packages, such as SAS (SAS, 2013; Yuan, 2011), Stata (StataCorp, 2017), the miceadds package in R (Robitzsch, Grund, & Henke, 2017), and an SPSS (IBM SPSS, 2017) macro by Van Ginkel (2010). Using the procedure outlined by Van Ginkel and Kroonenberg (2014), Grund et al. (2016) showed that D1 generally produces Type-I error rates close to the theoretical α in both one-way and two-way ANOVAs.

### Statistic D₂

Define

##### 8
$d W , m = Q ^ m ' U m − 1 Q ^ m$

as the Wald statistic of imputed dataset m,

##### 9
$d ¯ W = 1 M ∑ m = 1 M d W , m$

as the average Wald statistic across imputed datasets, and

##### 10
$r 2 = 1 + M − 1 1 M − 1 ∑ m = 1 M d W , m − d W ¯ 2$

as an alternative estimate of the relative increase in variance due to nonresponse. Statistic $D 2$ is given by:

##### 11
$D 2 = d ¯ W p − 1 − M + 1 M − 1 − 1 r 2 1 + r 2$

As a reference distribution for $D 2$ an F-distribution with p numerator degrees of freedom and $ν 2 = p − 3 m m − 1 1 + r 2 − 1 2$ denominator degrees of freedom is used. Applied to two-way ANOVA, significance tests for factors A, B, and the interaction can be obtained by substituting the relevant coefficients for $Q ^ m$, and by substituting the covariance matrices of the relevant coefficients for $U m$, and substitute these quantities in Equation 8.

An advantage of $D 2$ is that it is easier to compute than $D 0$ and $D 1$ as it can be calculated from M separate Wald statistics alone and no separate combining of the covariance matrices is needed (Schafer, 1997, p. 115). Li, Meng, Raghunathan, and Rubin (1991a) however, show that $D 2$ occasionally produces Type-I error rates as high as 8% when using α = .05. Li et al. (1991a) therefore argue that $D 2$ only be used as a rough guide. However, Grund et al. (2016) found Type-I error rates for $D 2$ that were closer to the theoretical 5% than found by Li et al. (1991a). Thus, how well $D 2$ performs regarding Type-I error rates in which situations, remains unclear.

### Statistic D₃

Statistic $D 3$ (Meng & Rubin, 1992) is a combined likelihood-ratio test of M likelihood-ratio statistics from M imputed datasets. Define $d L , m$ as a likelihood-ratio test of imputed dataset m, and

##### 12
$d ¯ L = 1 M ∑ m = 1 M d L , m$

as the average $d L , m$ across imputed datasets. Next, define $d L , m *$ as a likelihood-ratio test of imputed dataset m but now evaluated at the average parameter estimates of the model across M imputed datasets. The average $d L , m *$ is computed as

##### 13
$d ˜ L = 1 M ∑ m = 1 M d L , m *$

Statistic $D 3$ is computed as

##### 14
$D 3 = d ¯ L p 1 + r 3$
$r 3 = M + 1 p M − 1 d ¯ L − d ˜ L$

The reference distribution that is used for $D 3$ is an F-distribution with p numerator degrees of freedom and a denominator degrees of freedom given by:

##### 15

For a specific effect in the model (factor A, factor B, interaction) $D 3$ is computed by letting $d L , m$ and $d L , m *$ be the likelihood-ratio tests that test the full model against the full model, minus the terms that together form the specific effect.

### Goal of the Current Study

Several authors (Schafer, 1997; Shaw & Mitchell-Olds, 1993; Winer et al., 1991) have pointed out that unbalanced designs can be considered a missing-data problem. Shaw and Mitchell-Olds (1993) discussed imputation as a possible solution for imbalance and at the same time noted some problems of the imputation approach regarding p-values. However, they did not mention the use of multiple imputation to overcome the problems of single imputation in unbalanced designs. Although Schafer (1997) never explicitly suggested to balance an unbalanced design using multiple imputation, the suggestion implicitly followed from the fact that Schafer’s book is about multiple imputation. Van Ginkel and Kroonenberg (2014, p. 79) made the suggestion explicit, and used it in an empirical data example (pp. 87-88). The reasoning behind their suggestion was that balanced designs generally have more power than unbalanced designs, and that consequently multiple imputation in unbalanced designs could increase power.

However, Van Ginkel and Kroonenberg (2014) did not carry out any simulations to back up this claim. Additionally, one aspect that Van Ginkel and Kroonenberg did not include in their reasoning is that multiple imputation does not actually add information that could increase power. Indeed multiple imputation simulates additional cases, but imputing M values of the outcome variable for each additional case creates additional variation in the parameter estimates of the ANOVA model as well, represented by covariance matrix $B$ or the relative increase in variance due to nonresponse (r1, r2, and r3). The statistics $D 0$ to $D 3$ include these quantities in their calculation such that it lowers their power, which may consequently undo the increased power as a result of creating additional cases. Also, Van Buuren (2012, p. 48) and Von Hippel (2007) argue that when missing data only occur on the outcome variable, analyzing only the complete cases may be preferred over multiple imputation. Since unbalanced data can be considered a situation with missings on only the outcome variable, and that Type-III sum of squares in this context is equivalent to analyzing only complete cases, it would follow that in unbalanced data Type-III sum of squares is preferred over multiple imputation.

In short, both valid arguments for balancing unbalanced data using multiple imputation prior to two-way ANOVA, and simulation studies that confirm its usefulness seem to be lacking. However, the fact that this suggestion has been made in the literature or even just the fact that unbalanced data are often considered a missing-data problem and that multiple imputation is a highly recommended procedure for dealing with missing data, suggests that at first sight it may not be obvious to all statisticians that this suggestion may not be useful. Hence, a simulation study is needed to demonstrate its actual usefulness. In the current paper we will carry out such a study. Consequently, the first research question is whether there is some benefit in using multiple imputation for balancing unbalanced data after all.

Furthermore, Grund et al. (2016) already investigated statistics $D 1$, $D 2$, and $D 3$ in various situations. In their study they found that in most situations all statistics produced Type-I error rates close to 5%. However, in more extreme situations (e.g., small sample size, high percentages of missing data), $D 2$, and $D 3$ were somewhat conservative and had lower power than $D 1$. Based on this study, $D 1$ is the preferred statistic for combining the results of multiply imputed data in ANOVA.

However, the question is to what extent the results by Grund et al. (2016) generalize to a situation where unbalanced data are balanced using multiple imputation. $D 1$ showed the most promising results but has one potential weak point, and which may especially be problematic in unbalanced data: it assumes that the fraction of missing information is equal across the p parameter estimates in $Q ¯$. Unequal fractions of missing information across parameter estimates in $Q ¯$ are inherently related to unbalanced designs as each regression coefficient of an effect-coded variable of a specific factor represents a specific group of this factor. Unequal group sizes imply that for smaller groups more additional values of the dependent variable have to be multiply imputed than for larger groups, resulting in unequal fractions of missing information across parameter estimates.

In a situation which inherently has unequal fractions of missing information across parameter estimates, a statistic assuming equal fractions of missing information across parameter estimates ( $D 1 )$ is not the most convenient choice, especially when an alternative version of this statistic exists ( $D 0$) that does not make this assumption (when p = 1, or when the fractions of missing information across parameter estimates are equal, both statistics are equivalent). Given that $D 1$ generally gives the best results regarding power and Type-I error rates, and that $D 0$ is equivalent to $D 1$ but without the assumption of equal fractions of missingness across parameters, $D 0$ might be the ideal candidate for calculating pooled F-tests in unbalanced data when imbalance is handled using multiple imputation. Although Schafer (1997) and Li et al. (1991b) advised against the use of $D 0$, the argument against its use was mainly that for small M it may not perform well. However, this problem may easily be resolved by increasing M.

Furthermore, although Li et al. (1991b) showed that violation of equal fractions of missing information is not necessarily problematic for $D 1$, in their study the unequal fractions of missing information randomly varied across parameters of the simulation model and across replicated datasets. However, in unbalanced data unequal fractions of missing information may not always randomly vary across parameters. For example, in a field experiment it may be more difficult to collect data for one experimental condition than for the other, or in a clinical trial more people may drop out in one condition than in the other. Also, in non-experimental studies some categorical variables may have unequal groups in the population, for example, ethnicity. When drawing a random sample from this population, the different ethnic groups will have unequal sizes in the sample as well.

When fractions of missing information randomly vary across parameters, the fractions of missing information may not be equal within one replication, but the average fractions of missing information across replicated datasets are. Consequently, the negative effect of unequal fractions of missing information may cancel itself out across replications. However, in situations where the differences in fractions of missing information across parameters do not vary across replicated datasets, a statistic might be needed that allows for different fractions of missing data across parameter estimates.

Thus, a second research question is how the different pooling statistics from Grund et al. (2016) will behave in unbalanced data where the unequal fractions of missing information do not vary across replications, now also including statistic $D 0$. To this end, rejection rates of statistics to $D 3$ were studied along with the rejection rates for Type III sum of squares under the null- and alternative hypothesis. For comparison, rejection rates were also studied for balanced data with the same total sample size.

In the next section, the setup of the simulation study is described. In the section that follows, results of the simulation study are shown. Finally, in the discussion section conclusions will be drawn about the usefulness of multiple imputation for balancing unbalanced designs, and implications for which statistic to use will be discussed.

## Method

Data were simulated according to a two-way ANOVA model in the form of a regression model with effect coded predictors. Some of the properties of the data were held constant while some were varied (discussed next). The properties that were varied resulted in several design cells. Within each design cell 2500 replications were drawn (based on studies by Harel, 2009, and Van Ginkel, 2019).

The simulations were programmed in R (R Core Team, 2018). Multiple imputation was performed using Fully Conditional Specification (FCS; e.g., Van Buuren, 2012) in the mice package (Van Buuren & Groothuis-Oudshoorn, 2011). For continuous variables there are two different versions of FCS, namely regression imputation (FCS-R; e.g., Little & Schenker, 1995, p. 60; Van Buuren, 2012, p. 13) and Predictive Mean Matching (FCS-PMM; Rubin, 1986; Van Buuren, 2012, pp. 68-74). The default method in mice is FCS-PMM because it generally preserves distributional shapes better when data are not normally distributed than FCS-R (Marshall, Altman, & Holder, 2010; Marshall, Altman, Royston, & Holder, 2010). However, for imputing an outcome variable with a normally distributed error (as in our simulation model) FCS-R may be a better alternative as it explicitly imputes values according to a normal linear model. Thus, the method for multiple imputation was FCS-R. For the imputation of outcome variable Y, factors A and B plus their interaction were used as predictors. Type III sum of squares were computed using the car package (Fox & Weisberg, 2019). Statistic $D 3$ was calculated using the mitml package (Grund et al., 2016). The other statistics $D 0$, $D 1$, and $D 2$ were programmed by the first author.

Like in many other simulation studies, decisions regarding properties of the simulation design were to some extent arbitrary. However, prior to running the simulations, some test runs were done to see what properties would make the effects of imbalance and the differences between the different statistics most clearly visible, and which were also likely to occur in practice. The properties of the simulation design that are going to be discussed next, are mostly the result of these test runs.

### Fixed Design Characteristics

The number of levels of factor A was I = 2. The error term of the ANOVA model was normally distributed with $μ ε = 0$ and $σ ε = 18.37$.

### Independent Variables

#### Number of Levels of Factor B

The number of levels of factor B was J = 3, 4, 5.

#### Parameters of the Two-Way ANOVA Model

For each J, two sets of parameters of the two-way ANOVA model were studied: one in which there were no effects of factor A, factor B, and the interaction in the population (the null model), and one in which there were effects (the alternative model). For J = 3 the two sets of parameter estimates were $β = β 0 , β 1 A , β 2 B , β 3 B , β 4 AB , β 5 AB$ = (27,0,0,0,0,0) and $β = β 0 , β 1 A , β 2 B , β 3 B , β 4 AB , β 5 AB$ = (27,-1.5,-3,0,1,-0.5).

For J = 4 the two sets were: $β = β 0 , β 1 A , β 2 B , β 3 B , β 4 B , β 5 AB , β 6 AB , β 7 AB$ = (27,0,0,0,0,0,0,0) and $β = β 0 , β 1 A , β 2 B , β 3 B , β 4 B , β 5 AB , β 6 AB , β 7 AB$ = (27,-1.5,-3,-1,1,1,0,-0.5).

Finally, for J = 5 the two sets were: $β = β 0 , β 1 A , β 2 B , β 3 B , β 4 B , β 5 B , β 6 AB , β 7 AB , β 8 AB , β 9 AB$ = (27,0,0,0,0,0,0,0,0,0) and $β = β 0 , β 1 A , β 2 B , β 3 B , β 4 B , β 5 B , β 6 AB , β 7 AB , β 8 AB , β 9 AB$ = (27,-1.5,-3,-1,0,1,1,0.25,-0.25,-0.5).

#### Sample Size

Small, medium, and large sample sizes were studied. Because J also affects the number of design cells of the two-way ANOVA, sample size also partly depended on the size of J. For small N, the average cell size was 10, for medium N it was 20, and for large samples it was 30. Given these average cell sizes, sample sizes were N = 60, 120, 180 for J = 3, N = 80, 160, 240 for J = 4, and N = 100, 200, 300 for J = 5.

#### Degree and Structure of Imbalance

Four different degrees of imbalance were simulated, along with balanced data, for comparison. The degree of imbalance was varied as follows: for a specific design cell the cell size was either increased or decreased by each time adding the same number to, or subtracting the same number from the original cell size in the balanced case. The increasing and decreasing of cell sizes was done such that the total sample size remained the same.

Additionally, to study whether it mattered which cells increased or decreased in size, an additional situation of imbalance was created where the cell sizes of the most severe case of imbalance were randomly redistributed across design cells. The cell sizes for each degree of imbalance are displayed for small N in Table 1. For medium and large N the entries must be multiplied with 2 and 3 respectively.

##### Table 1

Cell Sizes for Different Degrees of Imbalance Under a Small Sample Size

Cell size Balanced Imbalance
Small Medium Severe Extra severe Extra severe, order shuffled
No. levels factor B: 3
n11 10 8 6 4 2 18
n12 10 10 10 10 10 10
n13 10 12 14 16 18 2
n21 10 11 12 13 14 6
n22 10 10 10 10 10 10
n23 10 9 8 7 6 14
No. levels factor B: 4
n11 10 8 6 4 2 10
n12 10 10 10 10 10 18
n13 10 10 10 10 10 10
n14 10 12 14 16 18 2
n21 10 11 12 13 14 10
n22 10 10 10 10 10 6
n23 10 10 10 10 10 10
n24 10 9 8 7 6 14
No. levels factor B: 5
n11 10 8 6 4 2 10
n12 10 10 10 10 10 18
n13 10 10 10 10 10 10
n14 10 10 10 10 10 10
n15 10 12 14 16 18 2
n21 10 11 12 13 14 10
n22 10 10 10 10 10 6
n23 10 10 10 10 10 10
n24 10 10 10 10 10 10
n25 10 9 8 7 6 14

#### Method for Handling Imbalance

Nine methods for handling imbalance were used: Type III sum of squares, and two versions of each of the statistics $D 0$, $D 1$, $D 2$, and . Because $D 0$ is argued to be sensitive to a small number of imputations, two different sizes of M were studied within each of the statistics, namely M = 5 and M = 100. This defines the eight procedures based on multiple imputation: $D 0, M = 5$, $D 0, M = 100$, $D 1, M = 5$, $D 1, M = 100$, $D 2, M = 5$, $D 2, M = 100$, $D 3, M = 5$, and $D 3, M = 100$.

### Dependent Variables

For each of the F-tests in the two-way ANOVA, the number of times the null hypothesis was rejected was studied, denoted the rejection rate. As already noted, when p = 1, $D 0$ and $D 1$ are equivalent. Thus, for factor A, methods $D 0, M = 5$, and $D 0, M = 100$ will not be displayed.

## Results

To get a rough impression of how close the Type I error rates were to α = .05 under the null hypothesis, it was tested whether the empirical rejection rates differed significantly from 0.05, using an Agresti and Coull (1998) confidence interval. Under the alternative hypothesis, it was tested whether the empirical rejection rates of the multiple-imputation methods differed significantly from the rejection rates of Type III, assuming that the latter represent the “real” power. A method based on multiple imputation may be considered successful when under the null hypothesis it gives rejection rates not significantly different from 0.05, and when under the alternative hypothesis it gives rejection rates significantly higher than the rejection rates of Type III. To avoid redundancy in discussing the results, we will mainly focus on the results of factor B and less on the results of factor A and the interaction.

Eighteen tables were needed to report all the results. Because results showed similar patterns across different J and different N, results of the remaining independent variables are only reported for J = 3 and a small N. Results for J > 3 and larger N are provided in Supplementary Materials. Table 2 shows the results for all levels of imbalance, all methods for handling imbalance, and all effects in the ANOVA, under the null model. Table 3 shows the results for all levels of imbalance, all methods for handling imbalance, and all effects in the ANOVA, under the alternative model.

##### Table 2

Rejection Rates for Each Effect Under the Null Model, a Small Sample Size, Three Levels of Factor B, for Different Methods for Handling Imbalance, and Different Degrees of Imbalance

Method Balanced Imbalance
Small Medium Severe Extra severe Extra severe, order shuffled
Effect A
Type III .052 .049 .051 .054 .055 .055
$D 1, M = 5$ .049 .056 .059a .061a .058
$D 1, M = 100$ .051 .051 .054 .054 .053
$D 2, M = 5$ .047 .052 .052 .052 .054
$D 2, M = 100$ .052 .053 .056 .060a .058
$D 3, M = 5$ .045 .034a .027a .019a .016a
$D 3, M = 100$ .044 .036a .024a .006a .005a
Effect B
Type III .047 .049 .054 .049 .048 .048
$D 0, M = 5$ .052 .059a .064a .073a .081a
$D 0, M = 100$ .053 .059a .052 .052 .053
$D 1, M = 5$ .038a .035a .034a .033a .052
$D 1, M = 100$ .043 .042 .030a .023a .056
$D 2, M = 5$ .050 .055 .058 .054 .057
$D 2, M = 100$ .055 .060a .054 .050 .051
$D 3, M = 5$ .042 .034a .026a .024a .023a
$D 3, M = 100$ .048 .042 .026a .014a .018a
Effect A × B
Type III .058 .048 .054 .056 .045 .045
$D 0, M = 5$ .052 .055 .066a .079a .084a
$D 0, M = 100$ .056 .054 .058 .050 .048
$D 1, M = 5$ .038a .037a .032a .035a .046
$D 1, M = 100$ .044 .035a .034a .022a .055
$D 2, M = 5$ .048 .050 .052 .056 .055
$D 2, M = 100$ .058 .056 .059a .046 .048
$D 3, M = 5$ .041a .031a .028a .021a .022a
$D 3, M = 100$ .048 .037a .031a .012a .015a

aSignificantly different from theoretical significance level of α = .05.

##### Table 3

Rejection Rates for Each Effects Under the Alternative Model, a Small Sample Size, Three Levels of Factor B, for Different Methods for Handling Imbalance, and Different Degrees of Imbalance

Method Balanced Imbalance
Small Medium Severe Extra severe Extra severe, order shuffled
Effect A
Type III .764 .746 .728 .686 .549 .549
$D 1, M = 5$ .734 .676a .585a .415a .406a
$D 1, M = 100$ .753 .731 .680 .543 .541
$D 2, M = 5$ .718a .646a .547a .388a .370a
$D 2, M = 100$ .755 .735 .687 .558 .556
$D 3, M = 5$ .702a .551a .370a .140a .142a
$D 3, M = 100$ .736 .676a .555a .261a .253a
Effect B
Type III .976 .972 .961 .922 .826 .836
$D 0, M = 5$ .966 .934a .863a .719a .746a
$D 0, M = 100$ .976 .963 .925 .829 .838
$D 1, M = 5$ .947a .873a .750a .505a .676a
$D 1, M = 100$ .970 .947a .892a .720a .845
$D 2, M = 5$ .946a .852a .687a .426a .426a
$D 2, M = 100$ .976 .964 .924 .807a .818a
$D 3, M = 5$ .949a .869a .706a .384a .398a
$D 3, M = 100$ .973 .950a .886a .643a .642a
Effect A × B
Type III .179 .182 .169 .150 .101 .148
$D 0, M = 5$ .172 .153a .151 .127a .174a
$D 0, M = 100$ .194 .182 .151 .111 .157
$D 1, M = 5$ .146a .118a .091a .063a .092a
$D 1, M = 100$ .174 .149a .111a .075a .126a
$D 2, M = 5$ .156a .128a .115a .075a .117a
$D 2, M = 100$ .197 .183 .149 .098 .164a
$D 3, M = 5$ .152a .111a .077a .050a .042a
$D 3, M = 100$ .186 .154a .107a .049a .036a

aSignificantly different from Type-III, assuming Type-III is the “true” power.

Under the null hypothesis (Table 2), methods $D 1, M = 5$, $D 1, M = 100$, $D 3, M = 5$, and $D 3, M = 100$ tend to underestimate the Type-I error rate for factor B somewhat, which becomes worse as the degree of imbalance increases. This undercoverage is worst for $D 3, M = 100$. Method tends to overestimate the Type-I error rate, which becomes worse as the degree of imbalance increases. Type III, $D 0, M = 100$, $D 2, M = 100$ produce Type-I error rates that do not differ significantly from 5% in most cases.

All methods based on M = 5 and some of the methods based on M = 100 ( $D 1, M = 100$, $D 3, M = 100$) have lower power than Type III sum of squares (Table 3). The difference in power between these methods and Type III sum of squares becomes larger as the degree of imbalance increases. Methods $D 0, M = 100$ and $D 2, M = 100$ generally have power values close to that of Type-III sum of squares.

Finally, when the order of the cell sizes is shuffled, only the results for methods based on $D 1$ change substantially; the results for methods $D 1, M = 5$ and $D 1, M = 100$ improve to acceptable Type I error rates and higher power.

## Discussion

Van Ginkel and Kroonenberg (2014) suggested multiple imputation to increase the power of a two-way ANOVA in unbalanced designs compared to Type III sum of squares. In the current study it was studied whether multiple imputation would indeed do this. Only $D 0, M = 100$ and $D 2, M = 100$ had power similar to that of Type III sum of squares, but not higher power than Type III sum of squares. The other methods based on multiple imputation had lower power than Type III sum of squares. Additionally, the earlier recommended statistic underestimated the Type-I error rate under the null hypothesis, both with M = 5, and M = 100 imputations, while statistic D0 which was not recommended earlier produced Type-I error rates close to 0.05 when M = 100. However, when M = 5 $D 0$ overestimated the Type-I error rate.

The main conclusion of this study is that multiple imputation is not an improvement of Type III sum of squares. At best, multiple imputation performs as well as Type III sum of squares when using the appropriate statistics ( $D 0$ and $D 2$) with a large number of imputations (M = 100). Given that Type III sum of squares is easier than doing multiple imputation, Type III sum of squares is still the preferred method in unbalanced designs.

Even though multiple imputation does not seem to be a useful alternative to Type III sum of squares, the results of this study are still important, and may in fact have implications for calculating significance tests for multi-parameter estimates in multiply imputed datasets. Previously it was assumed that it was better to use $D 1$ than $D 0$ because $D 0$ used a noisy estimate of the total covariance matrix in its computation, which would especially be problematic for small M. Secondly, Li et al. (1991b) showed that $D 1$ was robust to violation of the assumption of equal fractions of missing information across parameter estimates. When looking at the results of the current study, however, one would come to a different conclusion. Although for M = 5, $D 0$ usually overestimated the Type-I error rates, for M = 100, $D 0$ gave satisfactory Type-I error rates. For $D 1$ on the other hand, the Type-I error rates for both M = 5 and M = 100 were too low. Additionally, power rates of $D 0$ were higher than those of $D 1$. Thus, the claim that $D 1$ is robust to violation of the assumption of equal fractions of missing information may only hold when the unequal fractions of missing information vary across replications, as was the case in the study by Li et al. (1991b). The problems that come with on the other hand, may be less serious than was previously thought. It should be noted however, that when the order of the cells sizes was shuffled the results improved for the methods based on $D 1$. Thus, the specific structure of the imbalance may also influence the performance of $D 1$, although it is unclear how.

It may be recommendable to use $D 0$ rather than $D 1$ in case of highly unequal fractions of missing information, and when there is reason to believe that unequal fractions of missing data across parameter estimates are not random but systematic. Although the results for $D 0$ are not that good when M = 5, for M = 100 its results are comparable with Type III sum of squares. Since nowadays computers are fast enough to perform M = 100 imputations in a short time, this required high number of imputations for $D 0$ should not be an issue.

As for statistics $D 2$ and $D 3$, although Li et al. (1991a) showed that $D 2$ could sometimes be too liberal, both Grund et al. (2016) and the current study show satisfactory Type-I error rates of this method. Additionally, the current study also shows that the power of $D 2$ is comparable to that of Type-III sum of squares when M = 100. Considering these results and the fact that $D 2$ is relatively easy to compute, it may be a good candidate for pooling F-tests from ANOVA in multiply imputed data. However, as long as it is not clear when $D 2$ may overestimate the Type-I error rate, we must be cautious concluding that $D 2$ is generally the best alternative for pooling the results of ANOVA in multiply imputed data.

The results of $D 3$ were disappointing: in unbalanced data this method (severely) underestimated the Type-I error rates. Additionally, of all statistics, $D 3$ generally had the lowest power. An additional disadvantage of $D 3$ is that is not easily computed as it requires the likelihood of imputed dataset m evaluated at the average set of model parameters. The latter makes the implementation in software packages complex. Considering the disappointing results for $D 3$ and its complex implementation, this statistic is not the most convenient method for pooling results of ANOVA.

The results of the current study imply that software packages in which $D 1$ is implemented may also need to include $D 0$ in case of unequal fractions of missing information that are inherently related to the parameters in question. While $D 1$ only produces accurate Type-I error rates when fractions of missing information are on average equal across replications (Li et al, 1991b), $D 0$ also produces accurate Type-I error rates for unequal average fractions of missing information across replications, and has higher power than $D 1$ when M = 100. For M = 5 however, we must be cautious using both $D 0$ and $D 1$ with highly unequal fractions of missing data, as the former overestimates the Type-I error rate and the latter underestimates it.

To conclude, based on the current findings we recommend Type III sum of squares in unbalanced data, and we recommend using either $D 0$ or $D 2$ in multiply imputed data with a high (i.e., M = 100) number of imputations.

## Funding

The authors have no funding to report.

## Competing Interests

The authors have declared that no competing interests exist.

## Acknowledgments

The authors have no support to report.

## Supplementary Materials

For this article the following supplementary materials are available (Van Ginkel & Kroonenberg, 2020):

• Tables providing results for J > 3 and larger N.

• Programming code of the simulation study.

### Index of Supplementary Materials

• Van Ginkel, J. R., & Kroonenberg, P. M. (2020). Supplementary materials to: Much ado about nothing: Multiple imputation to balance unbalanced designs for two-way analysis of variance [Tables, Code].PsychOpen. https://doi.org/10.23668/psycharchives.4422

## References

• Agresti, A., & Coull, B. A. (1998). Approximate is better than ‘exact’ for interval estimation of binomial proportions. The American Statistician, 52, 119-126. https://doi.org/10.2307/2685469

• Fox, J., & Weisberg, S. (2019). An R companion to applied regression (3rd ed.). Thousand Oaks, CA, USA: SAGE.

• Grund, S., Lüdtke, O., & Robitzsch, A. (2016). Pooling ANOVA results from multiply imputed datasets: A simulation study. Methodology, 12(3), 75-88. https://doi.org/10.1027/1614-2241/a000111

• Harel, O. (2009). The estimation of R2 and adjusted R2 in incomplete datasets using multiple imputation. Journal of Applied Statistics, 36(10), 1109-1118. https://doi.org/10.1080/02664760802553000

• IBM SPSS. (2017). SPSS (Version 25.0 for Windows) [Computer software]. Author.

• Li, K. H., Meng, X. L., Raghunathan, T. E., & Rubin, D. B. (1991a). Significance levels from repeated p-values with multiply-imputed data. Statistica Sinica, 1, 65-92.

• Li, K. H., Raghunathan, T. E., & Rubin, D. B. (1991b). Large-sample significance levels from multiply imputed data using moment based statistics and an F reference distribution. Journal of the American Statistical Association, 86, 1065-1073. https://doi.org/10.2307/2290525

• Little, R. J. A., & Schenker, N. (1995). Missing data. In G. Arminger, C. C. Clogg, & M. E. Sobel (Eds.), Handbook of statistical modeling for the social and behavioral sciences (pp. 39-75). New York, NY, USA: Plenum Press.

• Marshall, A., Altman, D. G., & Holder, R. L. (2010). Comparison of imputation methods for handling missing covariate data when fitting a Cox proportional hazard’s model: A resampling study research article open access. BMC Medical Research Methodology, 10, Article 112. https://doi.org/10.1186/1471-2288-10-112

• Marshall, A., Altman, D. G., Royston, P., & Holder, R. L. (2010). Comparison of techniques for handling missing covariate data with prognostic modelling studies: A simulation study. BMC Medical Research Methodology, 10, Article 7. https://doi.org/10.1186/1471-2288-10-7

• Maxwell, S. E., & Delaney, H. D. (2004). Designing experiments and analyzing data (2nd ed.). Mahwah, NJ, USA: Erlbaum.

• Meng, X. L., & Rubin, D. B. (1992). Performing likelihood ratio tests with multiply-imputed data sets. Biometrika, 79(1), 103-111. https://doi.org/10.1093/biomet/79.1.103

• Raghunathan, T. E., Lepkowski, J. M., Van Hoewyk, J., & Solenberger, P. (2001). A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey Methodology, 27, 85-95.

• R Core Team. (2018). R: A language and environment for statistical computing. Retrieved from https://www.R-project.org/

• Reiter, J. P. (2007). Small-sample degrees of freedom for multi-component significance tests with multiple imputation for missing data. Biometrika, 94(2), 502-508. https://doi.org/10.1093/biomet/asm028

• Rubin, D. B. (1986). Statistical matching using file concatenation with adjusted weights and multiple imputations. Journal of Business & Economic Statistics, 4(1), 87-94. https://doi.org/10.1080/07350015.1986.10509497

• Rubin, D. B. (2004). Multiple imputation for nonresponse in surveys (2nd ed.). New York, NY, USA: Wiley.

• SAS. (2013). SAS (Version 9.4) [Computer software]. Author.

• Schafer, J. L. (1997). Analysis of incomplete multivariate data. London, United Kingdom: Chapman & Hall.

• Shaw, R. G., & Mitchell-Olds, T. (1993). Anova for unbalanced data: An overview. Ecology, 74(6), 1638-1645. https://doi.org/10.2307/1939922

• StataCorp. (2017). Stata Statistical Software (Version 15.0) [Computer software]. Author.

• Van Buuren, S. (2012). Flexible imputation of missing data. Boca Raton, FL, USA: Chapman & Hall/CRC Press.

• Van Buuren, S., & Groothuis-Oudshoorn, C. G. M. (2011). mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45(3), 1-67. https://doi.org/10.18637/jss.v045.i03

• Van Ginkel, J. R. (2010). MI-MUL2.SPS [Computer code]. Retrieved from https://www.universiteitleiden.nl/en/staffmembers/joost-van-ginkel#tab-1

• Van Ginkel, J. R. (2019). Significance tests and estimates for R2 for multiple regression in multiply imputed datasets: A cautionary note on earlier findings, and alternative solutions. Multivariate Behavioral Research, 54(4), 514-529. https://doi.org/10.1080/00273171.2018.1540967

• Van Ginkel, J. R., & Kroonenberg, P. M. (2014). Analysis of variance of multiply imputed data. Multivariate Behavioral Research, 49(1), 78-91. https://doi.org/10.1080/00273171.2013.855890

• Von Hippel, P. T. (2007). Regression with missing Ys: An improved strategy for analyzing multiply imputed data. Sociological Methodology, 37(1), 83-117. https://doi.org/10.1111/j.1467-9531.2007.00180.x

• Winer, B. J., Brown, D. R., & Michels, K. M. (1991). Statistical principles in experimental designs (3rd ed.). New York, NY, USA: McGraw-Hill.

• Yuan, Y. C. (2011). Multiple Imputation using SAS Software. Journal of Statistical Software, 45(6), 1-25. https://doi.org/10.18637/jss.v045.i06