METHMethodologyMethodologyMethodology1614-18811614-2241PsychOpenmeth.432710.5964/meth.4327Original ArticleMuch Ado About Nothing: Multiple Imputation to Balance Unbalanced Designs for Two-Way Analysis of VarianceMuch ado About Nothing: Multiple Imputation to Balance Unbalanced DesignsMuch ado about nothing: Multiple imputation to balance unbalanced designs for two-way analysis of variancevan GinkelJoost R.*aKroonenbergPieter M.bDepartment of Psychology, Leiden University, Leiden, The NetherlandsDepartment of Child and Family Studies, Leiden University, Leiden, The NetherlandsMethodology and Statistics, Department of Psychology, Leiden University, PO Box 9500, 2300 RB Leiden, The Netherlands. Tel.: +31(0)71 527 3620, jginkel@fsw.leidenuniv.nl22122020202016433535309062019060820202020Van Ginkel & KroonenbergThis is an open-access article distributed under the terms of the Creative Commons Attribution (CC BY) 4.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
In earlier literature, multiple imputation was proposed to create balance in unbalanced designs, as an alternative to Type III sum of squares in two-way ANOVA. In the current simulation study we studied four pooled statistics for multiple imputation, namely D₀, D₁, D₂, and D₃ in unbalanced data, and compared these statistics with Type III sum of squares. Statistics D₀ and D₂ generally performed best regarding Type-I error rates, and had power rates closest to that of Type III sum of squares. However, none of the statistics produced power rates higher than Type III sum of squares. The results lead to the conclusion that for multiply imputed datasets D₀ and D₂ may be the best methods for pooling the results of multiparameter estimates in multiply imputed datasets, and that for unbalanced data, Type III sum of square is to be preferred over using multiple imputation in obtaining ANOVA results.
unbalanced designsmultiple imputationtwo-way analysis of variancemissing datatype III sum of squares
In an experiment where two-way analysis of variance is the intended analysis, unforeseen circumstances may occur which may cause the design to be unbalanced. Unbalanced data may also occur in non-experimental research when group sizes are unequal by themselves. One important consequence of imbalance is that due to the resulting multicollinearity F-tests in ANOVA have less power than in balanced designs (e.g., Shaw & Mitchell-Olds, 1993, p. 1643). One of the most widely used methods for calculating F-tests in unbalanced data is Type-III sum of squares. This method has several advantages over other types of sum of squares, such as Type I and Type II. For example, in Type III sum of squares, the F-values are not influenced by the order in which each effect is entered in the ANOVA, as in Type-I sum of squares. For a discussion of the advantages of Type III sum of squares over the other types, see Maxwell & Delaney (2004). However, despite these advantages, the power of the F-tests in Type III sum of squares is still lower than in balanced designs.
According to Schafer (1997, p. 21) an unbalanced design can be considered a missing-data problem by imagining a number of additional cases with missing data on the outcome variable, which in the appropriate conditions would result in a balanced design (also, see Shaw & Mitchell-Olds, 1993, p. 1640, and Winer, Brown, & Michels, 1991, pp. 479-481). Considering an unbalanced design a missing-data problem creates new ways to handle the problems that come with unbalanced designs by using methods for dealing with missing data. Shaw and Mitchell-Olds (1993, p. 1641) discuss imputation (i.e., estimation) of the missing data to balance the design. They point out that imputation will give unbiased parameter estimates of the ANOVA model, but biased significance tests. This bias is caused by the fact that the imputed values are treated as observed values, and do not incorporate uncertainty of the missing values in the analysis (e.g., Schafer, 1997, p. 2). Thus, a method for estimating the missing values is needed that does take this uncertainty into account. One such method is multiple imputation (Rubin, 2004; Van Buuren, 2012). In multiple imputation, 1) missing values are estimated multiple times (M), resulting in M different complete versions of the incomplete dataset. Next, 2) these complete datasets are analyzed using the statistical analysis of interest, resulting in M outcomes of the same analysis, and 3) the results are combined into one pooled analysis, such that the uncertainty about the missing values is taken into account. For this pooling, formulas are used which will henceforth be called combination rules.
For balancing unbalanced data, multiple imputation may be used as follows. First, by adding a number of additional cases to specific groups such that all groups have equal size, the dataset is now balanced where in some cells cases have missing data on the outcome variable. These missing data are then multiply imputed using factors A and B as categorical predictors of the missing data on the outcome variable. Procedures for how to generate multiply imputed values for the missing data are described in, for example, Van Buuren (2012, Chapter 3) and Raghunathan, Lepkowski, Van Hoewyk, and Solenberger (2001).
Once multiply imputed datasets have been obtained, the ANOVAs can be applied to these datasets, and the results can be combined using specific combination rules. However, Van Ginkel and Kroonenberg (2014) argue that the M F-tests of an effect in ANOVA in M multiply imputed datasets cannot be combined directly, because for the combination rules that are suited for ANOVA (to be discussed shortly) one needs M sets of regression coefficients and M covariance matrices of these regression coefficients. See, Rubin (2004, pp. 79-81) and Van Ginkel (2019) who show how to calculate a pooled F-statistic testing several regression coefficients for significance simultaneously for multiply imputed datasets.
However, one can reformulate the two-way ANOVA model as a regression model, so that the combination rules can be applied to the regression coefficients. Van Ginkel and Kroonenberg (2014) show that this can be done by recoding the categorical variables and their interaction into effect-coded variables, and using these effect-coded variables as predictors in the regression model (also, see Winer et al., 1991, pp. 959-963). Once estimates of the regression coefficients and their covariances have been obtained for each of the M imputed datasets, the results may be combined.
For pooling the results of two-way ANOVA Rubin (2004) discusses four methods that may be good candidates. Schafer (1997) called three of these methods D1, D2, and D3. The fourth method was not given a name, probably because Schafer noted some problems of this method, which made him set it aside altogether. However, for the current application of these combination rules, the statistic that was not given a name may still be useful, for reasons discussed later on. Because the method with no name was chronologically the first method in Schafer’s book, we will call this statistic D0. Each of these methods are discussed next.
Statistic D₀
Let Q^m be a vector of length p, containing p parameter estimates of a statistical model, and let Um be a covariance matrix of the p parameter estimates, in imputed dataset m. A pooled set of parameter estimates across M imputed datasets is computed as:
Q¯=1M∑m=1MQ^m.
The pooled covariance matrix has two components, namely a within-imputation covariance matrix U¯, and a between-imputation covariance matrix B:
U¯=1M∑m=1MUm,B=1M∑m=1MQ^m−Q¯Q^m−Q¯'.
The total covariance matrix of the estimate Q¯ is computed as
T=U¯+1+M−1B.
To test all p parameter estimates in Q¯, statistic D0 (Rubin, 2004, pp. 77-78) is computed as:
D0=Q¯'T−1Q¯/p,
which has an approximate F-distribution with p model degrees of freedom, and 1+p−1ν^/2 error degrees of freedom, where an approximation for ν^ by Reiter (2007) may be used. The formula for ν^ is rather long and complex, so we will not give it here. For computational details, see Reiter (2007).
A pooled F-test for the effect of factor A may be computed by filling in the pooled regression coefficients of the effect coded variables that together form factor A in Q^m in Equation 1 and Equation 3, filling in the covariances of these coefficients in Equation 2, and use the resulting Q¯, U¯ and B in Equation 4 and Equation 5. Similarly, F-tests for factor B and the interaction may be computed. Computational details on how to obtain the covariance matrix Um for the effects of factors A, B and the interaction are given in Van Ginkel and Kroonenberg (2014).
Statistic D₁
Because B is a noisy estimate of the between-imputation variance and does not even have full rank if M < p, Schafer (1997, p. 113) advises against using statistic D0. Rather than using T as an estimate for the total covariance matrix, a more stable total covariance matrix is:
T˜=1+r1U¯r1=1+M−1trBU¯−1/p
where r1 is the relative increase in variance due to nonresponse. The resulting F-value testing all elements in vector Q¯ for significance simultaneously is:
D1=Q¯'T˜−1Q¯/p,
which, under the assumption that r1 is equal across all elements of Q¯, has an approximate F-distribution with p numerator degrees of freedom, and ν^ denominator degrees of freedom.
Despite the assumption of equal r1 across all elements of Q¯, Li, Raghunathan and Rubin (1991b) showed that violation of this assumption does not substantially influence the Type-I error rate of D1. Because of this finding, and because of the disadvantages of statistic D0, D1has been widely implemented in different statistical packages, such as SAS (SAS, 2013; Yuan, 2011), Stata (StataCorp, 2017), the miceadds package in R (Robitzsch, Grund, & Henke, 2017), and an SPSS (IBM SPSS, 2017) macro by Van Ginkel (2010). Using the procedure outlined by Van Ginkel and Kroonenberg (2014), Grund et al. (2016) showed that D1 generally produces Type-I error rates close to the theoretical α in both one-way and two-way ANOVAs.
Statistic D₂
Define
dW,m=Q^m'Um−1Q^m
as the Wald statistic of imputed dataset m,
d¯W=1M∑m=1MdW,m
as the average Wald statistic across imputed datasets, and
r2=1+M−11M−1∑m=1MdW,m−dW¯2
as an alternative estimate of the relative increase in variance due to nonresponse. Statistic D2 is given by:
D2=d¯Wp−1−M+1M−1−1r21+r2
As a reference distribution for D2 an F-distribution with p numerator degrees of freedom and ν2=p−3mm−11+r2−12 denominator degrees of freedom is used. Applied to two-way ANOVA, significance tests for factors A, B, and the interaction can be obtained by substituting the relevant coefficients for Q^m, and by substituting the covariance matrices of the relevant coefficients for Um, and substitute these quantities in Equation 8.
An advantage of D2 is that it is easier to compute than D0 and D1 as it can be calculated from M separate Wald statistics alone and no separate combining of the covariance matrices is needed (Schafer, 1997, p. 115). Li, Meng, Raghunathan, and Rubin (1991a) however, show that D2 occasionally produces Type-I error rates as high as 8% when using α = .05. Li et al. (1991a) therefore argue that D2 only be used as a rough guide. However, Grund et al. (2016) found Type-I error rates for D2 that were closer to the theoretical 5% than found by Li et al. (1991a). Thus, how well D2 performs regarding Type-I error rates in which situations, remains unclear.
Statistic D₃
Statistic D3 (Meng & Rubin, 1992) is a combined likelihood-ratio test of M likelihood-ratio statistics from M imputed datasets. Define dL,m as a likelihood-ratio test of imputed dataset m, and
d¯L=1M∑m=1MdL,m
as the average dL,m across imputed datasets. Next, define dL,m* as a likelihood-ratio test of imputed dataset m but now evaluated at the average parameter estimates of the model across M imputed datasets. The average dL,m* is computed as
d˜L=1M∑m=1MdL,m*
Statistic D3 is computed as
D3=d¯Lp1+r3r3=M+1pM−1d¯L−d˜L
The reference distribution that is used for D3 is an F-distribution with p numerator degrees of freedom and a denominator degrees of freedom given by:
ν3=4+t−41+1−2t−1r3−12 if t=pM−1>4ν3=t1+p−11+r3−12/2 otherwise.
For a specific effect in the model (factor A, factor B, interaction) D3 is computed by letting dL,m and dL,m* be the likelihood-ratio tests that test the full model against the full model, minus the terms that together form the specific effect.
Goal of the Current Study
Several authors (Schafer, 1997; Shaw & Mitchell-Olds, 1993; Winer et al., 1991) have pointed out that unbalanced designs can be considered a missing-data problem. Shaw and Mitchell-Olds (1993) discussed imputation as a possible solution for imbalance and at the same time noted some problems of the imputation approach regarding p-values. However, they did not mention the use of multiple imputation to overcome the problems of single imputation in unbalanced designs. Although Schafer (1997) never explicitly suggested to balance an unbalanced design using multiple imputation, the suggestion implicitly followed from the fact that Schafer’s book is about multiple imputation. Van Ginkel and Kroonenberg (2014, p. 79) made the suggestion explicit, and used it in an empirical data example (pp. 87-88). The reasoning behind their suggestion was that balanced designs generally have more power than unbalanced designs, and that consequently multiple imputation in unbalanced designs could increase power.
However, Van Ginkel and Kroonenberg (2014) did not carry out any simulations to back up this claim. Additionally, one aspect that Van Ginkel and Kroonenberg did not include in their reasoning is that multiple imputation does not actually add information that could increase power. Indeed multiple imputation simulates additional cases, but imputing M values of the outcome variable for each additional case creates additional variation in the parameter estimates of the ANOVA model as well, represented by covariance matrix B or the relative increase in variance due to nonresponse (r1, r2, and r3). The statistics D0 to D3 include these quantities in their calculation such that it lowers their power, which may consequently undo the increased power as a result of creating additional cases. Also, Van Buuren (2012, p. 48) and Von Hippel (2007) argue that when missing data only occur on the outcome variable, analyzing only the complete cases may be preferred over multiple imputation. Since unbalanced data can be considered a situation with missings on only the outcome variable, and that Type-III sum of squares in this context is equivalent to analyzing only complete cases, it would follow that in unbalanced data Type-III sum of squares is preferred over multiple imputation.
In short, both valid arguments for balancing unbalanced data using multiple imputation prior to two-way ANOVA, and simulation studies that confirm its usefulness seem to be lacking. However, the fact that this suggestion has been made in the literature or even just the fact that unbalanced data are often considered a missing-data problem and that multiple imputation is a highly recommended procedure for dealing with missing data, suggests that at first sight it may not be obvious to all statisticians that this suggestion may not be useful. Hence, a simulation study is needed to demonstrate its actual usefulness. In the current paper we will carry out such a study. Consequently, the first research question is whether there is some benefit in using multiple imputation for balancing unbalanced data after all.
Furthermore, Grund et al. (2016) already investigated statistics D1, D2, and D3 in various situations. In their study they found that in most situations all statistics produced Type-I error rates close to 5%. However, in more extreme situations (e.g., small sample size, high percentages of missing data), D2, and D3 were somewhat conservative and had lower power than D1. Based on this study, D1 is the preferred statistic for combining the results of multiply imputed data in ANOVA.
However, the question is to what extent the results by Grund et al. (2016) generalize to a situation where unbalanced data are balanced using multiple imputation. D1 showed the most promising results but has one potential weak point, and which may especially be problematic in unbalanced data: it assumes that the fraction of missing information is equal across the p parameter estimates in Q¯. Unequal fractions of missing information across parameter estimates in Q¯ are inherently related to unbalanced designs as each regression coefficient of an effect-coded variable of a specific factor represents a specific group of this factor. Unequal group sizes imply that for smaller groups more additional values of the dependent variable have to be multiply imputed than for larger groups, resulting in unequal fractions of missing information across parameter estimates.
In a situation which inherently has unequal fractions of missing information across parameter estimates, a statistic assuming equal fractions of missing information across parameter estimates (D1) is not the most convenient choice, especially when an alternative version of this statistic exists (D0) that does not make this assumption (when p = 1, or when the fractions of missing information across parameter estimates are equal, both statistics are equivalent). Given that D1 generally gives the best results regarding power and Type-I error rates, and that D0 is equivalent to D1 but without the assumption of equal fractions of missingness across parameters, D0 might be the ideal candidate for calculating pooled F-tests in unbalanced data when imbalance is handled using multiple imputation. Although Schafer (1997) and Li et al. (1991b) advised against the use of D0, the argument against its use was mainly that for small M it may not perform well. However, this problem may easily be resolved by increasing M.
Furthermore, although Li et al. (1991b) showed that violation of equal fractions of missing information is not necessarily problematic for D1, in their study the unequal fractions of missing information randomly varied across parameters of the simulation model and across replicated datasets. However, in unbalanced data unequal fractions of missing information may not always randomly vary across parameters. For example, in a field experiment it may be more difficult to collect data for one experimental condition than for the other, or in a clinical trial more people may drop out in one condition than in the other. Also, in non-experimental studies some categorical variables may have unequal groups in the population, for example, ethnicity. When drawing a random sample from this population, the different ethnic groups will have unequal sizes in the sample as well.
When fractions of missing information randomly vary across parameters, the fractions of missing information may not be equal within one replication, but the average fractions of missing information across replicated datasets are. Consequently, the negative effect of unequal fractions of missing information may cancel itself out across replications. However, in situations where the differences in fractions of missing information across parameters do not vary across replicated datasets, a statistic might be needed that allows for different fractions of missing data across parameter estimates.
Thus, a second research question is how the different pooling statistics from Grund et al. (2016) will behave in unbalanced data where the unequal fractions of missing information do not vary across replications, now also including statistic D0. To this end, rejection rates of statistics D0to D3 were studied along with the rejection rates for Type III sum of squares under the null- and alternative hypothesis. For comparison, rejection rates were also studied for balanced data with the same total sample size.
In the next section, the setup of the simulation study is described. In the section that follows, results of the simulation study are shown. Finally, in the discussion section conclusions will be drawn about the usefulness of multiple imputation for balancing unbalanced designs, and implications for which statistic to use will be discussed.
Method
Data were simulated according to a two-way ANOVA model in the form of a regression model with effect coded predictors. Some of the properties of the data were held constant while some were varied (discussed next). The properties that were varied resulted in several design cells. Within each design cell 2500 replications were drawn (based on studies by Harel, 2009, and Van Ginkel, 2019).
The simulations were programmed in R (R Core Team, 2018). Multiple imputation was performed using Fully Conditional Specification (FCS; e.g., Van Buuren, 2012) in the mice package (Van Buuren & Groothuis-Oudshoorn, 2011). For continuous variables there are two different versions of FCS, namely regression imputation (FCS-R; e.g., Little & Schenker, 1995, p. 60; Van Buuren, 2012, p. 13) and Predictive Mean Matching (FCS-PMM; Rubin, 1986; Van Buuren, 2012, pp. 68-74). The default method in mice is FCS-PMM because it generally preserves distributional shapes better when data are not normally distributed than FCS-R (Marshall, Altman, & Holder, 2010; Marshall, Altman, Royston, & Holder, 2010). However, for imputing an outcome variable with a normally distributed error (as in our simulation model) FCS-R may be a better alternative as it explicitly imputes values according to a normal linear model. Thus, the method for multiple imputation was FCS-R. For the imputation of outcome variable Y, factors A and B plus their interaction were used as predictors. Type III sum of squares were computed using the car package (Fox & Weisberg, 2019). Statistic D3 was calculated using the mitml package (Grund et al., 2016). The other statistics D0, D1, and D2 were programmed by the first author.
Like in many other simulation studies, decisions regarding properties of the simulation design were to some extent arbitrary. However, prior to running the simulations, some test runs were done to see what properties would make the effects of imbalance and the differences between the different statistics most clearly visible, and which were also likely to occur in practice. The properties of the simulation design that are going to be discussed next, are mostly the result of these test runs.
Fixed Design Characteristics
The number of levels of factor A was I = 2. The error term of the ANOVA model was normally distributed with με=0 and σε=18.37.
Independent VariablesNumber of Levels of Factor B
The number of levels of factor B was J = 3, 4, 5.
Parameters of the Two-Way ANOVA Model
For each J, two sets of parameters of the two-way ANOVA model were studied: one in which there were no effects of factor A, factor B, and the interaction in the population (the null model), and one in which there were effects (the alternative model). For J = 3 the two sets of parameter estimates were
β=β0,β1A,β2B,β3B,β4AB,β5AB = (27,0,0,0,0,0) andβ=β0,β1A,β2B,β3B,β4AB,β5AB = (27,-1.5,-3,0,1,-0.5).
For J = 4 the two sets were:
β=β0,β1A,β2B,β3B,β4B,β5AB,β6AB,β7AB = (27,0,0,0,0,0,0,0) andβ=β0,β1A,β2B,β3B,β4B,β5AB,β6AB,β7AB = (27,-1.5,-3,-1,1,1,0,-0.5).
Finally, for J = 5 the two sets were:
β=β0,β1A,β2B,β3B,β4B,β5B,β6AB,β7AB,β8AB,β9AB = (27,0,0,0,0,0,0,0,0,0) and
β=β0,β1A,β2B,β3B,β4B,β5B,β6AB,β7AB,β8AB,β9AB = (27,-1.5,-3,-1,0,1,1,0.25,-0.25,-0.5).
Sample Size
Small, medium, and large sample sizes were studied. Because J also affects the number of design cells of the two-way ANOVA, sample size also partly depended on the size of J. For small N, the average cell size was 10, for medium N it was 20, and for large samples it was 30. Given these average cell sizes, sample sizes were N = 60, 120, 180 for J = 3, N = 80, 160, 240 for J = 4, and N = 100, 200, 300 for J = 5.
Degree and Structure of Imbalance
Four different degrees of imbalance were simulated, along with balanced data, for comparison. The degree of imbalance was varied as follows: for a specific design cell the cell size was either increased or decreased by each time adding the same number to, or subtracting the same number from the original cell size in the balanced case. The increasing and decreasing of cell sizes was done such that the total sample size remained the same.
Additionally, to study whether it mattered which cells increased or decreased in size, an additional situation of imbalance was created where the cell sizes of the most severe case of imbalance were randomly redistributed across design cells. The cell sizes for each degree of imbalance are displayed for small N in Table 1. For medium and large N the entries must be multiplied with 2 and 3 respectively.
Cell Sizes for Different Degrees of Imbalance Under a Small Sample Size
Cell size
Balanced
Imbalance
Small
Medium
Severe
Extra severe
Extra severe, order shuffled
No. levels factor B: 3
n11
10
8
6
4
2
18
n12
10
10
10
10
10
10
n13
10
12
14
16
18
2
n21
10
11
12
13
14
6
n22
10
10
10
10
10
10
n23
10
9
8
7
6
14
No. levels factor B: 4
n11
10
8
6
4
2
10
n12
10
10
10
10
10
18
n13
10
10
10
10
10
10
n14
10
12
14
16
18
2
n21
10
11
12
13
14
10
n22
10
10
10
10
10
6
n23
10
10
10
10
10
10
n24
10
9
8
7
6
14
No. levels factor B: 5
n11
10
8
6
4
2
10
n12
10
10
10
10
10
18
n13
10
10
10
10
10
10
n14
10
10
10
10
10
10
n15
10
12
14
16
18
2
n21
10
11
12
13
14
10
n22
10
10
10
10
10
6
n23
10
10
10
10
10
10
n24
10
10
10
10
10
10
n25
10
9
8
7
6
14
Method for Handling Imbalance
Nine methods for handling imbalance were used: Type III sum of squares, and two versions of each of the statistics D0, D1, D2, andD3. Because D0 is argued to be sensitive to a small number of imputations, two different sizes of M were studied within each of the statistics, namely M = 5 and M = 100. This defines the eight procedures based on multiple imputation: D0,M=5, D0,M=100, D1,M=5, D1,M=100, D2,M=5, D2,M=100, D3,M=5, and D3,M=100.
Dependent Variables
For each of the F-tests in the two-way ANOVA, the number of times the null hypothesis was rejected was studied, denoted the rejection rate. As already noted, when p = 1, D0 and D1 are equivalent. Thus, for factor A, methods D0,M=5, and D0,M=100 will not be displayed.
Results
To get a rough impression of how close the Type I error rates were to α = .05 under the null hypothesis, it was tested whether the empirical rejection rates differed significantly from 0.05, using an Agresti and Coull (1998) confidence interval. Under the alternative hypothesis, it was tested whether the empirical rejection rates of the multiple-imputation methods differed significantly from the rejection rates of Type III, assuming that the latter represent the “real” power. A method based on multiple imputation may be considered successful when under the null hypothesis it gives rejection rates not significantly different from 0.05, and when under the alternative hypothesis it gives rejection rates significantly higher than the rejection rates of Type III. To avoid redundancy in discussing the results, we will mainly focus on the results of factor B and less on the results of factor A and the interaction.
Eighteen tables were needed to report all the results. Because results showed similar patterns across different J and different N, results of the remaining independent variables are only reported for J = 3 and a small N. Results for J > 3 and larger N are provided in Supplementary Materials. Table 2 shows the results for all levels of imbalance, all methods for handling imbalance, and all effects in the ANOVA, under the null model. Table 3 shows the results for all levels of imbalance, all methods for handling imbalance, and all effects in the ANOVA, under the alternative model.
Rejection Rates for Each Effect Under the Null Model, a Small Sample Size, Three Levels of Factor B, for Different Methods for Handling Imbalance, and Different Degrees of Imbalance
Method
Balanced
Imbalance
Small
Medium
Severe
Extra severe
Extra severe, order shuffled
Effect A
Type III
.052
.049
.051
.054
.055
.055
D1,M=5
.049
.056
.059a
.061a
.058
D1,M=100
.051
.051
.054
.054
.053
D2,M=5
.047
.052
.052
.052
.054
D2,M=100
.052
.053
.056
.060a
.058
D3,M=5
.045
.034a
.027a
.019a
.016a
D3,M=100
.044
.036a
.024a
.006a
.005a
Effect B
Type III
.047
.049
.054
.049
.048
.048
D0,M=5
.052
.059a
.064a
.073a
.081a
D0,M=100
.053
.059a
.052
.052
.053
D1,M=5
.038a
.035a
.034a
.033a
.052
D1,M=100
.043
.042
.030a
.023a
.056
D2,M=5
.050
.055
.058
.054
.057
D2,M=100
.055
.060a
.054
.050
.051
D3,M=5
.042
.034a
.026a
.024a
.023a
D3,M=100
.048
.042
.026a
.014a
.018a
Effect A × B
Type III
.058
.048
.054
.056
.045
.045
D0,M=5
.052
.055
.066a
.079a
.084a
D0,M=100
.056
.054
.058
.050
.048
D1,M=5
.038a
.037a
.032a
.035a
.046
D1,M=100
.044
.035a
.034a
.022a
.055
D2,M=5
.048
.050
.052
.056
.055
D2,M=100
.058
.056
.059a
.046
.048
D3,M=5
.041a
.031a
.028a
.021a
.022a
D3,M=100
.048
.037a
.031a
.012a
.015a
aSignificantly different from theoretical significance level of α = .05.
Rejection Rates for Each Effects Under the Alternative Model, a Small Sample Size, Three Levels of Factor B, for Different Methods for Handling Imbalance, and Different Degrees of Imbalance
Method
Balanced
Imbalance
Small
Medium
Severe
Extra severe
Extra severe, order shuffled
Effect A
Type III
.764
.746
.728
.686
.549
.549
D1,M=5
.734
.676a
.585a
.415a
.406a
D1,M=100
.753
.731
.680
.543
.541
D2,M=5
.718a
.646a
.547a
.388a
.370a
D2,M=100
.755
.735
.687
.558
.556
D3,M=5
.702a
.551a
.370a
.140a
.142a
D3,M=100
.736
.676a
.555a
.261a
.253a
Effect B
Type III
.976
.972
.961
.922
.826
.836
D0,M=5
.966
.934a
.863a
.719a
.746a
D0,M=100
.976
.963
.925
.829
.838
D1,M=5
.947a
.873a
.750a
.505a
.676a
D1,M=100
.970
.947a
.892a
.720a
.845
D2,M=5
.946a
.852a
.687a
.426a
.426a
D2,M=100
.976
.964
.924
.807a
.818a
D3,M=5
.949a
.869a
.706a
.384a
.398a
D3,M=100
.973
.950a
.886a
.643a
.642a
Effect A × B
Type III
.179
.182
.169
.150
.101
.148
D0,M=5
.172
.153a
.151
.127a
.174a
D0,M=100
.194
.182
.151
.111
.157
D1,M=5
.146a
.118a
.091a
.063a
.092a
D1,M=100
.174
.149a
.111a
.075a
.126a
D2,M=5
.156a
.128a
.115a
.075a
.117a
D2,M=100
.197
.183
.149
.098
.164a
D3,M=5
.152a
.111a
.077a
.050a
.042a
D3,M=100
.186
.154a
.107a
.049a
.036a
aSignificantly different from Type-III, assuming Type-III is the “true” power.
Under the null hypothesis (Table 2), methods D1,M=5, D1,M=100, D3,M=5, and D3,M=100 tend to underestimate the Type-I error rate for factor B somewhat, which becomes worse as the degree of imbalance increases. This undercoverage is worst for D3,M=100. Method D0,M=5tends to overestimate the Type-I error rate, which becomes worse as the degree of imbalance increases. Type III, D0,M=100,D2,M=5D2,M=100 produce Type-I error rates that do not differ significantly from 5% in most cases.
All methods based on M = 5 and some of the methods based on M = 100 (D1,M=100, D3,M=100) have lower power than Type III sum of squares (Table 3). The difference in power between these methods and Type III sum of squares becomes larger as the degree of imbalance increases. Methods D0,M=100 and D2,M=100 generally have power values close to that of Type-III sum of squares.
Finally, when the order of the cell sizes is shuffled, only the results for methods based on D1 change substantially; the results for methods D1,M=5 and D1,M=100 improve to acceptable Type I error rates and higher power.
Discussion
Van Ginkel and Kroonenberg (2014) suggested multiple imputation to increase the power of a two-way ANOVA in unbalanced designs compared to Type III sum of squares. In the current study it was studied whether multiple imputation would indeed do this. Only D0,M=100 and D2,M=100 had power similar to that of Type III sum of squares, but not higher power than Type III sum of squares. The other methods based on multiple imputation had lower power than Type III sum of squares. Additionally, the earlier recommended D1statistic underestimated the Type-I error rate under the null hypothesis, both with M = 5, and M = 100 imputations, while statistic D0 which was not recommended earlier produced Type-I error rates close to 0.05 when M = 100. However, when M = 5 D0 overestimated the Type-I error rate.
The main conclusion of this study is that multiple imputation is not an improvement of Type III sum of squares. At best, multiple imputation performs as well as Type III sum of squares when using the appropriate statistics (D0 and D2) with a large number of imputations (M = 100). Given that Type III sum of squares is easier than doing multiple imputation, Type III sum of squares is still the preferred method in unbalanced designs.
Even though multiple imputation does not seem to be a useful alternative to Type III sum of squares, the results of this study are still important, and may in fact have implications for calculating significance tests for multi-parameter estimates in multiply imputed datasets. Previously it was assumed that it was better to use D1 than D0 because D0 used a noisy estimate of the total covariance matrix in its computation, which would especially be problematic for small M. Secondly, Li et al. (1991b) showed that D1 was robust to violation of the assumption of equal fractions of missing information across parameter estimates. When looking at the results of the current study, however, one would come to a different conclusion. Although for M = 5, D0 usually overestimated the Type-I error rates, for M = 100, D0 gave satisfactory Type-I error rates. For D1 on the other hand, the Type-I error rates for both M = 5 and M = 100 were too low. Additionally, power rates of D0 were higher than those of D1. Thus, the claim that D1 is robust to violation of the assumption of equal fractions of missing information may only hold when the unequal fractions of missing information vary across replications, as was the case in the study by Li et al. (1991b). The problems that come withD0 on the other hand, may be less serious than was previously thought. It should be noted however, that when the order of the cells sizes was shuffled the results improved for the methods based on D1. Thus, the specific structure of the imbalance may also influence the performance of D1, although it is unclear how.
It may be recommendable to use D0 rather than D1 in case of highly unequal fractions of missing information, and when there is reason to believe that unequal fractions of missing data across parameter estimates are not random but systematic. Although the results for D0 are not that good when M = 5, for M = 100 its results are comparable with Type III sum of squares. Since nowadays computers are fast enough to perform M = 100 imputations in a short time, this required high number of imputations for D0 should not be an issue.
As for statistics D2 and D3, although Li et al. (1991a) showed that D2 could sometimes be too liberal, both Grund et al. (2016) and the current study show satisfactory Type-I error rates of this method. Additionally, the current study also shows that the power of D2 is comparable to that of Type-III sum of squares when M = 100. Considering these results and the fact that D2 is relatively easy to compute, it may be a good candidate for pooling F-tests from ANOVA in multiply imputed data. However, as long as it is not clear when D2 may overestimate the Type-I error rate, we must be cautious concluding that D2 is generally the best alternative for pooling the results of ANOVA in multiply imputed data.
The results of D3 were disappointing: in unbalanced data this method (severely) underestimated the Type-I error rates. Additionally, of all statistics, D3 generally had the lowest power. An additional disadvantage of D3 is that is not easily computed as it requires the likelihood of imputed dataset m evaluated at the average set of model parameters. The latter makes the implementation in software packages complex. Considering the disappointing results for D3 and its complex implementation, this statistic is not the most convenient method for pooling results of ANOVA.
The results of the current study imply that software packages in which D1 is implemented may also need to include D0 in case of unequal fractions of missing information that are inherently related to the parameters in question. While D1 only produces accurate Type-I error rates when fractions of missing information are on average equal across replications (Li et al, 1991b), D0 also produces accurate Type-I error rates for unequal average fractions of missing information across replications, and has higher power than D1 when M = 100. For M = 5 however, we must be cautious using both D0 and D1 with highly unequal fractions of missing data, as the former overestimates the Type-I error rate and the latter underestimates it.
To conclude, based on the current findings we recommend Type III sum of squares in unbalanced data, and we recommend using either D0 or D2 in multiply imputed data with a high (i.e., M = 100) number of imputations.
For this article the following supplementary materials are available (Van Ginkel & Kroonenberg, 2020):
Tables providing results for J > 3 and larger N.
Programming code of the simulation study.
Van GinkelJ. R.KroonenbergP. M. (2020). [Tables, Code].PsychOpen. https://doi.org/10.23668/psycharchives.4422ReferencesAgrestiA.CoullB. A. (1998). Approximate is better than ‘exact’ for interval estimation of binomial proportions. , 52, 119–126. 10.2307/2685469Fox, J., & Weisberg, S. (2019). An R companion to applied regression (3rd ed.). Thousand Oaks, CA, USA: SAGE.GrundS.LüdtkeO.RobitzschA. (2016). Pooling ANOVA results from multiply imputed datasets: A simulation study. , 12(3), 75–88. 10.1027/1614-2241/a000111HarelO. (2009). The estimation of R2 and adjusted R2 in incomplete datasets using multiple imputation. , 36(10), 1109–1118. 10.1080/02664760802553000IBM SPSS. (2017). SPSS (Version 25.0 for Windows) [Computer software]. Author.LiK. H.MengX. L.RaghunathanT. E.RubinD. B. (1991a). Significance levels from repeated p-values with multiply-imputed data. , 1, 65–92.LiK. H.RaghunathanT. E.RubinD. B. (1991b). Large-sample significance levels from multiply imputed data using moment based statistics and an F reference distribution. , 86, 1065–1073. 10.2307/2290525Little, R. J. A., & Schenker, N. (1995). Missing data. In G. Arminger, C. C. Clogg, & M. E. Sobel (Eds.), Handbook of statistical modeling for the social and behavioral sciences (pp. 39-75). New York, NY, USA: Plenum Press.MarshallA.AltmanD. G.HolderR. L. (2010). Comparison of imputation methods for handling missing covariate data when fitting a Cox proportional hazard’s model: A resampling study research article open access. , 10, 112. 10.1186/1471-2288-10-11221194416MarshallA.AltmanD. G.RoystonP.HolderR. L. (2010). Comparison of techniques for handling missing covariate data with prognostic modelling studies: A simulation study. , 10, 7. 10.1186/1471-2288-10-720085642Maxwell, S. E., & Delaney, H. D. (2004). Designing experiments and analyzing data (2nd ed.). Mahwah, NJ, USA: Erlbaum.MengX. L.RubinD. B. (1992). Performing likelihood ratio tests with multiply-imputed data sets. , 79(1), 103–111. 10.1093/biomet/79.1.103RaghunathanT. E.LepkowskiJ. M.Van HoewykJ.SolenbergerP. (2001). A multivariate technique for multiply imputing missing values using a sequence of regression models. , 27, 85–95.R Core Team. (2018). R: A language and environment for statistical computing. Retrieved from https://www.R-project.org/ReiterJ. P. (2007). Small-sample degrees of freedom for multi-component significance tests with multiple imputation for missing data. , 94(2), 502–508. 10.1093/biomet/asm028Robitzsch, A., Grund, S., & Henke, T. (2017). Package ‘miceadds’. Retrieved from https://cran.r-project.org/web/packages/miceadds/miceadds.pdfRubinD. B. (1986). Statistical matching using file concatenation with adjusted weights and multiple imputations. , 4(1), 87–94. 10.1080/07350015.1986.10509497Rubin, D. B. (2004). Multiple imputation for nonresponse in surveys (2nd ed.). New York, NY, USA: Wiley.SAS. (2013). SAS (Version 9.4) [Computer software]. Author.Schafer, J. L. (1997). Analysis of incomplete multivariate data. London, United Kingdom: Chapman & Hall.ShawR. G.Mitchell-OldsT. (1993). Anova for unbalanced data: An overview. , 74(6), 1638–1645. 10.2307/1939922StataCorp. (2017). Stata Statistical Software (Version 15.0) [Computer software]. Author.Van Buuren, S. (2012). Flexible imputation of missing data. Boca Raton, FL, USA: Chapman & Hall/CRC Press.Van BuurenS.Groothuis-OudshoornC. G. M. (2011). mice: Multivariate imputation by chained equations in R. , 45(3), 1–67. 10.18637/jss.v045.i03Van Ginkel, J. R. (2010). MI-MUL2.SPS [Computer code]. Retrieved from https://www.universiteitleiden.nl/en/staffmembers/joost-van-ginkel#tab-1Van GinkelJ. R. (2019). Significance tests and estimates for R2 for multiple regression in multiply imputed datasets: A cautionary note on earlier findings, and alternative solutions. , 54(4), 514–529. 10.1080/00273171.2018.154096730822143Van GinkelJ. R.KroonenbergP. M. (2014). Analysis of variance of multiply imputed data. , 49(1), 78–91. 10.1080/00273171.2013.85589024860197Von HippelP. T. (2007). Regression with missing Ys: An improved strategy for analyzing multiply imputed data. , 37(1), 83–117. 10.1111/j.1467-9531.2007.00180.xWiner, B. J., Brown, D. R., & Michels, K. M. (1991). Statistical principles in experimental designs (3rd ed.). New York, NY, USA: McGraw-Hill.YuanY. C. (2011). Multiple Imputation using SAS Software. , 45(6), 1–25. 10.18637/jss.v045.i06
The authors have no funding to report.
The authors have declared that no competing interests exist.