^{a}

^{b}

A balanced ANOVA design provides an unambiguous interpretation of the F-tests, and has more power than an unbalanced design. In earlier literature, multiple imputation was proposed to create balance in unbalanced designs, as an alternative to Type-III sum of squares. In the current simulation study we studied four pooled statistics for multiple imputation, namely D₀, D₁, D₂, and D₃ in unbalanced data, and compared them with Type-III sum of squares. Statistics D₁ and D₂ generally performed best regarding Type-I error rates, and had power rates closest to that of Type-III sum of squares. Additionally, for the interaction, D₁ produced power rates higher than Type-III sum of squares. For multiply imputed datasets D₁ and D₂ may be the best methods for pooling the results in multiply imputed datasets, and for unbalanced data, D₁ might be a good alternative to Type-III sum of squares regarding the interaction.

In an experiment where two-way analysis of variance is the intended analysis, unforeseen circumstances may occur which may cause the design to be unbalanced. Unbalanced data may also occur in non-experimental research when group sizes are unequal by themselves. One important consequence of imbalance is that due to the resulting multicollinearity

According to

For balancing unbalanced data, multiple imputation may be used as follows. First, by adding a number of additional cases to specific groups such that all groups have equal size, the dataset is now balanced where in some cells cases have missing data on the outcome variable. These missing data are then multiply imputed using factors A and B as categorical predictors of the missing data on the outcome variable. Procedures for how to generate multiply imputed values for the missing data are described in, for example,

Once multiply imputed datasets have been obtained, the ANOVAs can be applied to these datasets, and the results can be combined using specific combination rules. However,

However, one can reformulate the two-way ANOVA model as a regression model, so that the combination rules can be applied to the regression coefficients.

For pooling the results of two-way ANOVA

Let

The pooled covariance matrix has two components, namely a within-imputation covariance matrix

The total covariance matrix of the estimate

To test all

which has an approximate

A pooled

Because

where

which, under the assumption that

Despite the assumption of equal

Define

as the Wald statistic of imputed dataset

as the average Wald statistic across imputed datasets, and

as an alternative estimate of the relative increase in variance due to nonresponse. Statistic

As a reference distribution for

An advantage of

Statistic

as the average

Statistic

The reference distribution that is used for

For a specific effect in the model (factor A, factor B, interaction)

Several authors (

However, _{1}, _{2}, and _{3}). The statistics

In short, both valid arguments for balancing unbalanced data using multiple imputation prior to two-way ANOVA, and simulation studies that confirm its usefulness seem to be lacking. However, the fact that this suggestion has been made in the literature or even just the fact that unbalanced data are often described as a missing-data problem and that multiple imputation is a highly recommended procedures for dealing with missing data, calls for a simulation study to investigate the usefulness of this suggestion. In the current paper we will carry out such a simulation study. Consequently, the first research question is whether there is some benefit in using multiple imputation for balancing an unbalanced design prior to a two-way ANOVA after all.

Furthermore,

However, the question is to what extent the results by

In a situation which inherently has unequal fractions of missing information across parameter estimates, a statistic assuming equal fractions of missing information across parameter estimates (

Furthermore, although

When fractions of missing information randomly vary across parameters, the fractions of missing information may not be equal within one replication, but the average fractions of missing information across replicated datasets are. Consequently, the negative effect of unequal fractions of missing information may cancel itself out across replications. However, in situations where the differences in fractions of missing information across parameters do not vary across replicated datasets, a statistic might be needed that allows for different fractions of missing data across parameter estimates.

Thus, a second research question is how the different pooling statistics from

In the next section, the setup of the simulation study is described. In the section that follows, results of the simulation study are shown. Finally, in the discussion section conclusions will be drawn about the usefulness of multiple imputation for balancing unbalanced designs, and implications for which statistic to use will be discussed.

Data were simulated according to a two-way ANOVA model in the form of a regression model with effect coded predictors. Some of the properties of the data were held constant while some were varied (discussed next). The properties that were varied resulted in several design cells. Within each design cell 2500 replications were drawn (based on studies by

The simulations were programmed in R (

Like in many other simulation studies, decisions regarding properties of the simulation design were to some extent arbitrary. However, prior to running the simulations, some test runs were done to see what properties would make the effects of imbalance and the differences between the different statistics most clearly visible, and which were also likely to occur in practice. The properties of the simulation design that are going to be discussed next, are mostly the result of these test runs.

The number of levels of factor A was

The number of levels of factor B was

For each

For

For

Finally, for

Small, medium, and large sample sizes were studied. Because

Four different degrees of imbalance were simulated, along with balanced data, for comparison. The degree of imbalance was varied as follows: for a specific design cell the cell size was either increased or decreased by each time adding the same number to, or subtracting the same number from the original cell size in the balanced case. The increasing and decreasing of cell sizes was done such that the total sample size remained the same.

Additionally, to study whether it mattered which cells increased or decreased in size, an additional situation of imbalance was created where the cell sizes of the most severe case of imbalance were randomly redistributed across design cells. The cell sizes for each degree of imbalance are displayed for small

Cell size | Balanced | Imbalance |
||||
---|---|---|---|---|---|---|

Small | Medium | Severe | Extra severe | Extra severe, order shuffled | ||

No. levels factor B: 3 | ||||||

_{11} |
10 | 8 | 6 | 4 | 2 | 18 |

_{12} |
10 | 10 | 10 | 10 | 10 | 10 |

_{13} |
10 | 12 | 14 | 16 | 18 | 2 |

_{21} |
10 | 11 | 12 | 13 | 14 | 6 |

_{22} |
10 | 10 | 10 | 10 | 10 | 10 |

_{23} |
10 | 9 | 8 | 7 | 6 | 14 |

No. levels factor B: 4 | ||||||

_{11} |
10 | 8 | 6 | 4 | 2 | 10 |

_{12} |
10 | 10 | 10 | 10 | 10 | 18 |

_{13} |
10 | 10 | 10 | 10 | 10 | 10 |

_{14} |
10 | 12 | 14 | 16 | 18 | 2 |

_{21} |
10 | 11 | 12 | 13 | 14 | 10 |

_{22} |
10 | 10 | 10 | 10 | 10 | 6 |

_{23} |
10 | 10 | 10 | 10 | 10 | 10 |

_{24} |
10 | 9 | 8 | 7 | 6 | 14 |

No. levels factor B: 5 | ||||||

_{11} |
10 | 8 | 6 | 4 | 2 | 10 |

_{12} |
10 | 10 | 10 | 10 | 10 | 18 |

_{13} |
10 | 10 | 10 | 10 | 10 | 10 |

_{14} |
10 | 10 | 10 | 10 | 10 | 10 |

_{15} |
10 | 12 | 14 | 16 | 18 | 2 |

_{21} |
10 | 11 | 12 | 13 | 14 | 10 |

_{22} |
10 | 10 | 10 | 10 | 10 | 6 |

_{23} |
10 | 10 | 10 | 10 | 10 | 10 |

_{24} |
10 | 10 | 10 | 10 | 10 | 10 |

_{25} |
10 | 9 | 8 | 7 | 6 | 14 |

Nine methods for handling imbalance were used: Type-III sum of squares, and two versions of each of the statistics

For each of the

To get a rough impression of how close the Type-I error rates were to α = .05 under the null hypothesis, it was tested whether the empirical rejection rates differed significantly from 0.05, using an

Eighteen tables were needed to report all the results. Because results showed similar patterns across different

Method | Balanced | Imbalance |
||||
---|---|---|---|---|---|---|

Small | Medium | Severe | Extra severe | Extra severe, order shuffled | ||

Effect A | ||||||

Type-III | .052 | .049 | .051 | .054 | .055 | .055 |

.049 | .056 | .059^{a} |
.061^{a} |
.058 | ||

.051 | .051 | .054 | .054 | .053 | ||

.047 | .052 | .052 | .052 | .054 | ||

.052 | .053 | .056 | .060^{a} |
.058 | ||

.045 | .034^{a} |
.027^{a} |
.019^{a} |
.016^{a} |
||

.044 | .036^{a} |
.024^{a} |
.006^{a} |
.005^{a} |
||

Effect B | ||||||

Type-III | .047 | .049 | .054 | .049 | .048 | .048 |

.054 | .062^{a} |
.067^{a} |
.077^{a} |
.082^{a} |
||

.053 | .059^{a} |
.053 | .053 | .053 | ||

.051 | .046 | .054 | .050 | .054 | ||

.053 | .058 | .051 | .052 | .053 | ||

.050 | .055 | .058 | .054 | .057 | ||

.055 | .060^{a} |
.054 | .050 | .051 | ||

.042 | .034^{a} |
.026^{a} |
.024^{a} |
.023^{a} |
||

.048 | .042 | .026^{a} |
.014^{a} |
.018^{a} |
||

Effect A × B | ||||||

Type-III | .058 | .048 | .054 | .056 | .045 | .045 |

.055 | .057 | .070^{a} |
.083^{a} |
.084^{a} |
||

.056 | .054 | .058 | .050 | .048 | ||

.047 | .049 | .048 | .053 | .052 | ||

.057 | .055 | .056 | .050 | .052 | ||

.048 | .050 | .052 | .056 | .055 | ||

.058 | .056 | .059^{a} |
.046 | .048 | ||

.041^{a} |
.031^{a} |
.028^{a} |
.021^{a} |
.022^{a} |
||

.048 | .037^{a} |
.031^{a} |
.012^{a} |
.015^{a} |

^{a}Significantly different from theoretical significance level of α = .05.

Method | Balanced | Imbalance |
||||
---|---|---|---|---|---|---|

Small | Medium | Severe | Extra severe | Extra severe, order shuffled | ||

Effect A | ||||||

Type-III | .764 | .742 | .722 | .686 | .549 | .549 |

.728 | .666^{a} |
.585^{a} |
.415^{a} |
.406^{a} |
||

.748 | .728 | .680 | .543 | .541 | ||

.715^{a} |
.636^{a} |
.547^{a} |
.388^{a} |
.370^{a} |
||

.751 | .731 | .687 | .558 | .556 | ||

.698^{a} |
.546^{a} |
.370^{a} |
.140^{a} |
.142^{a} |
||

.732 | .671^{a} |
.555^{a} |
.261^{a} |
.253^{a} |
||

Effect B | ||||||

Type-III | .976 | .972 | .957 | .922 | .826 | .836 |

.966 | .934^{a} |
.870^{a} |
.729^{a} |
.745^{a} |
||

.976 | .960 | .925 | .829 | .838 | ||

.963^{a} |
.926^{a} |
.850^{a} |
.645^{a} |
.665^{a} |
||

.974 | .964 | .934^{a} |
.830 | .837 | ||

.944^{a} |
.843^{a} |
.687^{a} |
.426^{a} |
.426^{a} |
||

.976 | .961 | .924 | .807^{a} |
.818^{a} |
||

.947^{a} |
.861^{a} |
.706^{a} |
.384^{a} |
.398^{a} |
||

.972 | .945^{a} |
.886^{a} |
.643^{a} |
.642^{a} |
||

Effect A × B | ||||||

Type-III | .179 | .176 | .162 | .150 | .101 | .148 |

.174 | .157 | .160 | .131^{a} |
.173^{a} |
||

.189 | .175 | .152 | .111 | .157 | ||

.165 | .144^{a} |
.126^{a} |
.102 | .095^{a} |
||

.192 | .183^{a} |
.169^{a} |
.137^{a} |
.120^{a} |
||

.150^{a} |
.125^{a} |
.115^{a} |
.075^{a} |
.117^{a} |
||

.192 | .177 | .149 | .098 | .164^{a} |
||

.146^{a} |
.106^{a} |
.077^{a} |
.050^{a} |
.042^{a} |
||

.180 | .151 | .107^{a} |
.049^{a} |
.036^{a} |

^{a}Significantly different from Type-III, assuming Type-III is the “true” power.

Under the null hypothesis (

All methods based on

Finally, when the order of the cell sizes is shuffled, we only see changes in results for the interaction in the alternative model (

The main conclusion of this study is that there may be some benefit in doing multiple imputation for handling unbalanced data in two-way ANOVA after all. When using the appropriate statistics (

Although there seems to be some benefit in multiple imputation over Type-III sum of squares, it may be wondered whether this benefit outweighs the costs. Multiple imputation is more work, the benefit seems to only concern the interaction, and it is not even entirely clear when it has higher power rates than Type-III sum of squares.

However, although the benefit of multiple imputation over Type-III sum of squares is relatively small, the results of this study are still important for other reasons. Previously it was only assumed that it was better to use

Furthermore,

As for statistics

The results of

The results of the current study imply that software packages do not need to replace

In conclusion, it may be a bit premature to conclude that multiple imputation is a good alternative to Type-III sum of squares in unbalanced data, given the extra amount of work and the fact that its benefits only seem to show in the interaction. Finally, as most other studies have already indicated, we recommend using either

^{2}and adjusted R

^{2}in incomplete datasets using multiple imputation.

^{2}for multiple regression in multiply imputed datasets: A cautionary note on earlier findings, and alternative solutions.

The authors have no funding to report.

The authors have declared that no competing interests exist.

The authors have no support to report.