^{a}

^{b}

In earlier literature, multiple imputation was proposed to create balance in unbalanced designs, as an alternative to Type III sum of squares in two-way ANOVA. In the current simulation study we studied four pooled statistics for multiple imputation, namely D₀, D₁, D₂, and D₃ in unbalanced data, and compared these statistics with Type III sum of squares. Statistics D₀ and D₂ generally performed best regarding Type-I error rates, and had power rates closest to that of Type III sum of squares. However, none of the statistics produced power rates higher than Type III sum of squares. The results lead to the conclusion that for multiply imputed datasets D₀ and D₂ may be the best methods for pooling the results of multiparameter estimates in multiply imputed datasets, and that for unbalanced data, Type III sum of square is to be preferred over using multiple imputation in obtaining ANOVA results.

In an experiment where two-way analysis of variance is the intended analysis, unforeseen circumstances may occur which may cause the design to be unbalanced. Unbalanced data may also occur in non-experimental research when group sizes are unequal by themselves. One important consequence of imbalance is that due to the resulting multicollinearity

According to

For balancing unbalanced data, multiple imputation may be used as follows. First, by adding a number of additional cases to specific groups such that all groups have equal size, the dataset is now balanced where in some cells cases have missing data on the outcome variable. These missing data are then multiply imputed using factors A and B as categorical predictors of the missing data on the outcome variable. Procedures for how to generate multiply imputed values for the missing data are described in, for example,

Once multiply imputed datasets have been obtained, the ANOVAs can be applied to these datasets, and the results can be combined using specific combination rules. However,

However, one can reformulate the two-way ANOVA model as a regression model, so that the combination rules can be applied to the regression coefficients.

For pooling the results of two-way ANOVA

Let

The pooled covariance matrix has two components, namely a within-imputation covariance matrix

The total covariance matrix of the estimate

To test all

which has an approximate

A pooled

Because

where

which, under the assumption that

Despite the assumption of equal _{1} generally produces Type-I error rates close to the theoretical α in both one-way and two-way ANOVAs.

Define

as the Wald statistic of imputed dataset

as the average Wald statistic across imputed datasets, and

as an alternative estimate of the relative increase in variance due to nonresponse. Statistic

As a reference distribution for

An advantage of

Statistic

as the average

Statistic

The reference distribution that is used for

For a specific effect in the model (factor A, factor B, interaction)

Several authors (

However, _{1}, _{2}, and _{3}). The statistics

In short, both valid arguments for balancing unbalanced data using multiple imputation prior to two-way ANOVA, and simulation studies that confirm its usefulness seem to be lacking. However, the fact that this suggestion has been made in the literature or even just the fact that unbalanced data are often considered a missing-data problem and that multiple imputation is a highly recommended procedure for dealing with missing data, suggests that at first sight it may not be obvious to all statisticians that this suggestion may not be useful. Hence, a simulation study is needed to demonstrate its actual usefulness. In the current paper we will carry out such a study. Consequently, the first research question is whether there is some benefit in using multiple imputation for balancing unbalanced data after all.

Furthermore,

However, the question is to what extent the results by

In a situation which inherently has unequal fractions of missing information across parameter estimates, a statistic assuming equal fractions of missing information across parameter estimates (

Furthermore, although

When fractions of missing information randomly vary across parameters, the fractions of missing information may not be equal within one replication, but the average fractions of missing information across replicated datasets are. Consequently, the negative effect of unequal fractions of missing information may cancel itself out across replications. However, in situations where the differences in fractions of missing information across parameters do not vary across replicated datasets, a statistic might be needed that allows for different fractions of missing data across parameter estimates.

Thus, a second research question is how the different pooling statistics from

In the next section, the setup of the simulation study is described. In the section that follows, results of the simulation study are shown. Finally, in the discussion section conclusions will be drawn about the usefulness of multiple imputation for balancing unbalanced designs, and implications for which statistic to use will be discussed.

Data were simulated according to a two-way ANOVA model in the form of a regression model with effect coded predictors. Some of the properties of the data were held constant while some were varied (discussed next). The properties that were varied resulted in several design cells. Within each design cell 2500 replications were drawn (based on studies by

The simulations were programmed in R (

Like in many other simulation studies, decisions regarding properties of the simulation design were to some extent arbitrary. However, prior to running the simulations, some test runs were done to see what properties would make the effects of imbalance and the differences between the different statistics most clearly visible, and which were also likely to occur in practice. The properties of the simulation design that are going to be discussed next, are mostly the result of these test runs.

The number of levels of factor A was

The number of levels of factor B was

For each

For

Finally, for

Small, medium, and large sample sizes were studied. Because

Four different degrees of imbalance were simulated, along with balanced data, for comparison. The degree of imbalance was varied as follows: for a specific design cell the cell size was either increased or decreased by each time adding the same number to, or subtracting the same number from the original cell size in the balanced case. The increasing and decreasing of cell sizes was done such that the total sample size remained the same.

Additionally, to study whether it mattered which cells increased or decreased in size, an additional situation of imbalance was created where the cell sizes of the most severe case of imbalance were randomly redistributed across design cells. The cell sizes for each degree of imbalance are displayed for small

Cell size | Balanced | Imbalance |
||||
---|---|---|---|---|---|---|

Small | Medium | Severe | Extra severe | Extra severe, order shuffled | ||

No. levels factor B: 3 | ||||||

_{11} |
10 | 8 | 6 | 4 | 2 | 18 |

_{12} |
10 | 10 | 10 | 10 | 10 | 10 |

_{13} |
10 | 12 | 14 | 16 | 18 | 2 |

_{21} |
10 | 11 | 12 | 13 | 14 | 6 |

_{22} |
10 | 10 | 10 | 10 | 10 | 10 |

_{23} |
10 | 9 | 8 | 7 | 6 | 14 |

No. levels factor B: 4 | ||||||

_{11} |
10 | 8 | 6 | 4 | 2 | 10 |

_{12} |
10 | 10 | 10 | 10 | 10 | 18 |

_{13} |
10 | 10 | 10 | 10 | 10 | 10 |

_{14} |
10 | 12 | 14 | 16 | 18 | 2 |

_{21} |
10 | 11 | 12 | 13 | 14 | 10 |

_{22} |
10 | 10 | 10 | 10 | 10 | 6 |

_{23} |
10 | 10 | 10 | 10 | 10 | 10 |

_{24} |
10 | 9 | 8 | 7 | 6 | 14 |

No. levels factor B: 5 | ||||||

_{11} |
10 | 8 | 6 | 4 | 2 | 10 |

_{12} |
10 | 10 | 10 | 10 | 10 | 18 |

_{13} |
10 | 10 | 10 | 10 | 10 | 10 |

_{14} |
10 | 10 | 10 | 10 | 10 | 10 |

_{15} |
10 | 12 | 14 | 16 | 18 | 2 |

_{21} |
10 | 11 | 12 | 13 | 14 | 10 |

_{22} |
10 | 10 | 10 | 10 | 10 | 6 |

_{23} |
10 | 10 | 10 | 10 | 10 | 10 |

_{24} |
10 | 10 | 10 | 10 | 10 | 10 |

_{25} |
10 | 9 | 8 | 7 | 6 | 14 |

Nine methods for handling imbalance were used: Type III sum of squares, and two versions of each of the statistics

For each of the

To get a rough impression of how close the Type I error rates were to α = .05 under the null hypothesis, it was tested whether the empirical rejection rates differed significantly from 0.05, using an

Eighteen tables were needed to report all the results. Because results showed similar patterns across different

Method | Balanced | Imbalance |
||||
---|---|---|---|---|---|---|

Small | Medium | Severe | Extra severe | Extra severe, order shuffled | ||

Effect A | ||||||

Type III | .052 | .049 | .051 | .054 | .055 | .055 |

.049 | .056 | .059^{a} |
.061^{a} |
.058 | ||

.051 | .051 | .054 | .054 | .053 | ||

.047 | .052 | .052 | .052 | .054 | ||

.052 | .053 | .056 | .060^{a} |
.058 | ||

.045 | .034^{a} |
.027^{a} |
.019^{a} |
.016^{a} |
||

.044 | .036^{a} |
.024^{a} |
.006^{a} |
.005^{a} |
||

Effect B | ||||||

Type III | .047 | .049 | .054 | .049 | .048 | .048 |

.052 | .059^{a} |
.064^{a} |
.073^{a} |
.081^{a} |
||

.053 | .059^{a} |
.052 | .052 | .053 | ||

.038^{a} |
.035^{a} |
.034^{a} |
.033^{a} |
.052 | ||

.043 | .042 | .030^{a} |
.023^{a} |
.056 | ||

.050 | .055 | .058 | .054 | .057 | ||

.055 | .060^{a} |
.054 | .050 | .051 | ||

.042 | .034^{a} |
.026^{a} |
.024^{a} |
.023^{a} |
||

.048 | .042 | .026^{a} |
.014^{a} |
.018^{a} |
||

Effect A × B | ||||||

Type III | .058 | .048 | .054 | .056 | .045 | .045 |

.052 | .055 | .066^{a} |
.079^{a} |
.084^{a} |
||

.056 | .054 | .058 | .050 | .048 | ||

.038^{a} |
.037^{a} |
.032^{a} |
.035^{a} |
.046 | ||

.044 | .035^{a} |
.034^{a} |
.022^{a} |
.055 | ||

.048 | .050 | .052 | .056 | .055 | ||

.058 | .056 | .059^{a} |
.046 | .048 | ||

.041^{a} |
.031^{a} |
.028^{a} |
.021^{a} |
.022^{a} |
||

.048 | .037^{a} |
.031^{a} |
.012^{a} |
.015^{a} |

^{a}Significantly different from theoretical significance level of α = .05.

Method | Balanced | Imbalance |
||||
---|---|---|---|---|---|---|

Small | Medium | Severe | Extra severe | Extra severe, order shuffled | ||

Effect A | ||||||

Type III | .764 | .746 | .728 | .686 | .549 | .549 |

.734 | .676^{a} |
.585^{a} |
.415^{a} |
.406^{a} |
||

.753 | .731 | .680 | .543 | .541 | ||

.718^{a} |
.646^{a} |
.547^{a} |
.388^{a} |
.370^{a} |
||

.755 | .735 | .687 | .558 | .556 | ||

.702^{a} |
.551^{a} |
.370^{a} |
.140^{a} |
.142^{a} |
||

.736 | .676^{a} |
.555^{a} |
.261^{a} |
.253^{a} |
||

Effect B | ||||||

Type III | .976 | .972 | .961 | .922 | .826 | .836 |

.966 | .934^{a} |
.863^{a} |
.719^{a} |
.746^{a} |
||

.976 | .963 | .925 | .829 | .838 | ||

.947^{a} |
.873^{a} |
.750^{a} |
.505^{a} |
.676^{a} |
||

.970 | .947^{a} |
.892^{a} |
.720^{a} |
.845 | ||

.946^{a} |
.852^{a} |
.687^{a} |
.426^{a} |
.426^{a} |
||

.976 | .964 | .924 | .807^{a} |
.818^{a} |
||

.949^{a} |
.869^{a} |
.706^{a} |
.384^{a} |
.398^{a} |
||

.973 | .950^{a} |
.886^{a} |
.643^{a} |
.642^{a} |
||

Effect A × B | ||||||

Type III | .179 | .182 | .169 | .150 | .101 | .148 |

.172 | .153^{a} |
.151 | .127^{a} |
.174^{a} |
||

.194 | .182 | .151 | .111 | .157 | ||

.146^{a} |
.118^{a} |
.091^{a} |
.063^{a} |
.092^{a} |
||

.174 | .149^{a} |
.111^{a} |
.075^{a} |
.126^{a} |
||

.156^{a} |
.128^{a} |
.115^{a} |
.075^{a} |
.117^{a} |
||

.197 | .183 | .149 | .098 | .164^{a} |
||

.152^{a} |
.111^{a} |
.077^{a} |
.050^{a} |
.042^{a} |
||

.186 | .154^{a} |
.107^{a} |
.049^{a} |
.036^{a} |

^{a}Significantly different from Type-III, assuming Type-III is the “true” power.

Under the null hypothesis (

All methods based on

Finally, when the order of the cell sizes is shuffled, only the results for methods based on

_{0} which was not recommended earlier produced Type-I error rates close to 0.05 when

The main conclusion of this study is that multiple imputation is not an improvement of Type III sum of squares. At best, multiple imputation performs as well as Type III sum of squares when using the appropriate statistics (

Even though multiple imputation does not seem to be a useful alternative to Type III sum of squares, the results of this study are still important, and may in fact have implications for calculating significance tests for multi-parameter estimates in multiply imputed datasets. Previously it was assumed that it was better to use

It may be recommendable to use

As for statistics

The results of

The results of the current study imply that software packages in which

To conclude, based on the current findings we recommend Type III sum of squares in unbalanced data, and we recommend using either

^{2}and adjusted R

^{2}in incomplete datasets using multiple imputation.

^{2}for multiple regression in multiply imputed datasets: A cautionary note on earlier findings, and alternative solutions.

The authors have no funding to report.

The authors have declared that no competing interests exist.

The authors have no support to report.