^{1}

^{2}

Random effects contain crucial information to understand the variability of the processes under study in mixed-effects models with crossed random effects (MEMs-CR). Given that model selection makes all-or-nothing decisions regarding to the inclusion of model parameters, we evaluated if model averaging could deal with model uncertainty to recover random effects of MEMs-CR. Specifically, we analyzed the bias and the root mean squared error (RMSE) of the estimations of the variances of random effects using model averaging with Akaike weights and Bayesian model averaging with BIC posterior probabilities, comparing them with two alternative analytical strategies as benchmarks: AIC and BIC model selection, and fitting a full random structure. A simulation study was conducted manipulating sample sizes for subjects and items, and the variance of random effects. Results showed that model averaging, especially Akaike weights, can adequately recover random variances, given a minimum sample size in the modeled clusters. Thus, we endorse using model averaging to deal with model uncertainty in MEMs-CR. An empirical illustration is provided to ease the usability of model averaging.

Model averaging with Akaike weights uses the AIC index to compute an average estimate of the target parameters (e.g., _{1}, _{2}, …, _{r}, and computing the so-called Akaike weights using the AIC index (e.g.,

_{r}=

_{r}-

_{min}

where ∆_{r} represents the difference between the AIC of each _{min}). Then, Akaike weights are then calculated as:

where ω_{r} is the resulting Akaike weight for each competing model based on ∆_{r} of all _{r}. These averaged estimations have been found to generate estimations of SEs of fixed effects of MEMs-CR with small bias, being a less risky approach than model selection strategies (

BMA shares the same rationale than model averaging with Akaike weights about the computation of averaged estimates for target parameters, but using a Bayesian framework (e.g., _{1}, _{2}, …, _{r}, have been fitted to the data set _{1}, _{2}, …, _{r}). Again, it is necessary to rank all the

_{r}=

_{r}-

_{min}

where ∆_{r} represents the difference between the BIC of each _{min}). Then, it is possible to approximate the posterior probability of each

where _{r} of all

Random effects should be understood as informative parameters of the target processes under study, making them a relevant substantive part of the results report (

In this line, there is a handicap when researchers want to recover the random structure of MEMs-CR using model averaging. This procedure uses all the available information from the competing models but, unfortunately, some models do not include all the parameters of interest. This is the case of random effects. Researchers could try to average random intercepts or random slopes when some of the competing models do not include such parameters by weighting null information (a zero) in model averaging. Thus, the usefulness of model averaging could be compromised recovering different random structures. The present study aims to evaluate the bias of averaged crossed random effects in simulation scenarios where the random structure of various competing models is incomplete. For this purpose, we are going to compare the average estimates of the two model averaging strategies introduced previously (Akaike weights and BMA with BIC posterior probabilities), with model selection of AIC and BIC indices as benchmarks. We think that there will be relevant differences between both perspectives because model averaging is supposed to deal with model uncertainty while model selection always does an all-or-nothing decision making. Given that all the MEMs-CR of the study are nested, the maximal MEM-CR (

An experimental design in psycholinguistics was simulated using _{0s0} and _{00i}, respectively, and their variances _{1s0} and _{10i}. Two different variances

Simulation conditions | Parameter | Values | |
---|---|---|---|

Intercept | γ_{000} |
2,000 ms | |

Within-subject effect | γ_{100} |
0 ms | |

Residual level 1 variance | 300 ms | ||

Random intercepts for subjects | σ_{0s0} |
0; 100; 200 ms | |

Random intercepts for items | σ_{00i} |
0; 100; 200 ms | |

Random slopes for subjects | σ_{1s0} |
0; 100 ms | |

Random slopes for items | σ_{10i} |
0; 100 ms | |

Intercept-slope correlation for subjects and items | _{01} |
0; 0.60 | |

Sample size for subjects | _{1} |
15; 30; 60; 90 subjects | |

Sample size for items | _{2} |
5; 10; 30; 60 items |

_{01} represents the standardized intercept-slope covariance for subjects (

We assume that MEMs-CR naturally have random intercepts for both subjects and items. Also, we consider that the most complex MEM-CR will be the one with random intercepts and random slopes for both subjects and items. The former is called here

First, model selection performance of AIC and BIC fit indices was evaluated. It was computed as the proportion of true model selection: if the model used to simulate the data was selected, it was considered a correct model selection. Otherwise, it was considered as incorrect (please note that selecting the

Second, the estimations of model averaging with Akaike weights and BMA with BIC posterior probabilities were computed. For this purpose, we applied the above-mentioned procedures using the four competing models of this simulation study, namely: minimal, subject random slopes, item random slopes, and maximal MEMs-CR. Specifically, we computed the average estimations of the random effects.

Third, the bias of the estimations of the variances of the random effects was computed using different measures. Bias was computed for all the random effects using the following formula:

Model selection with AIC and BIC and fitting a full random structure (

Fit index | Number of subjects | Number of items | Minimal | Subject random slopes | Item random slopes | Maximal |
---|---|---|---|---|---|---|

.91 (.28) | .25 (.43) | .20 (.40) | .07 (.25) | |||

.90 (.29) | .45 (.50) | .42 (.49) | .22 (.41) | |||

.90 (.30) | .81 (.39) | .85 (.37) | .78 (.41) | |||

.90 (.30) | .92 (.28) | .92 (.27) | .96 (.18) | |||

.91 (.28) | .44 (.50) | .31 (.46) | .18 (.39) | |||

.91 (.29) | .72 (.45) | .63 (.48) | .51 (.50) | |||

.90 (.31) | .92 (.26) | .92 (.26) | .98 (.14) | |||

.89 (.31) | .94 (.23) | .93 (.26) | .99 (.02) | |||

.91 (.28) | .67 (.47) | .51 (.50) | .41 (.49) | |||

.91 (.29) | .88 (.32) | .82 (.38) | .84 (.36) | |||

.89 (.31) | .93 (.25) | .94 (.24) | .99 (.01) | |||

.89 (.31) | .93 (.26) | .93 (.26) | 1.00 (.00) | |||

.91 (.29) | .77 (.42) | .60 (.49) | .57 (.49) | |||

.90 (.30) | .92 (.27) | .88 (.33) | .92 (.27) | |||

.88 (.31) | .93 (.25) | .94 (.24) | 1.00 (.00) | |||

.88 (.32) | .93 (.25) | .94 (.24) | 1.00 (.00) | |||

.99 (.06) | .05 (.22) | .03 (.16) | .00 (.04) | |||

.99 (.05) | .11 (.32) | .09 (.28) | .01 (.11) | |||

.99 (.02) | .42 (.49) | .52 (.49) | .24 (.42) | |||

.99 (.02) | .79 (.41) | .87 (.34) | .69 (.46) | |||

.99 (.04) | .15 (.36) | .05 (.21) | .00 (.08) | |||

.99 (.02) | .34 (.47) | .20 (.40) | .07 (.26) | |||

.99 (.02) | .86 (.34) | .87 (.34) | .75 (.43) | |||

.99 (.02) | .99 (10) | .99 (.08) | .99 (.12) | |||

.99 (.03) | .34 (.47) | .11 (.32) | .04 (.20) | |||

.99 (.03) | .68 (.47) | .48 (.50) | .34 (.47) | |||

.99 (.01) | .99 (.08) | .99 (.08) | .98 (.13) | |||

1.00 (.00) | .99 (.01) | 1.00 (.00) | 1.00 (.00) | |||

.99 (.02) | .49 (.50) | .19 (.39) | .10 (.30) | |||

.99 (.02) | .83 (.38) | .68 (.47) | .59 (.49) | |||

.99 (.01) | .99 (.03) | .99 (.01) | .99 (.03) | |||

1.00 (.00) | 1.00 (.00) | 1.00 (.00) | 1.00 (.00) |

In the first simulation scenario, symmetric variances of random intercepts and slopes (100 and 100 standard deviations, respectively) were considered for the two clusters (subjects and items).

While

_{1} = Number of subjects. _{2} = Number of items. x axis ranges from -110 to 150. y axis ranges from 0 to 700 replications. Continuous lines represent the mean. Discontinuous lines represent the median.

In the second simulation scenario, asymmetric variances of random intercepts and slopes (200 and 100 standard deviations, respectively) were considered for the two clusters (subjects and items). Additionally, we only considered the estimations of not-null variances because of their results did not present large differences with those of null symmetric variances of random effects. Thus, we only present the estimations of not-null asymmetric variances of random effects of subjects and items in model averaging and model selection in this section for the sake of brevity.

First, it was found an absence of bias of the estimations of the random variances of the intercepts of subjects and items. In the most demanding conditions, the most severe bias was found to be around the 6% (that is, approximately, -12.5 bias was for the estimation of the 200 standard deviations). Similarly, we found more bias of the estimations of random intercepts in items than in subjects. The comparison of bias and RMSE of the random intercepts of subjects and items show that there is a relevant sample variance of the estimates that decreases as the sample sizes of both clusters increase.

Second, important bias was found for the estimations of the variances of the random slopes in the more demanding conditions. Again, the model averaging and the model selection derived from the BIC index presented a significantly worse performance than the ones of AIC index (the results of both indices are only comparable when there are larger sample sizes in both clusters). Given that the simulated random slopes had a standard deviation of 100, we can easily interpret the results as the percentage of bias. Bias was large in all the conditions where the sample sizes of both clusters were small, that is, for 5–10 items and 15–30 subjects. When one of the clusters presented a medium/large sample size (e.g., 60–90 subjects or 30–60 items), the recovery of the variances of both random effects was accurate (bias was around 8%), and the results were near (but larger than) the 10% of bias even in the more extreme conditions of small sample sizes of items. On the contrary, none of the conditions with 5 items reached an appropriate level of bias (being the bias of subjects random slopes larger than the bias of items random slopes), and the 10 item conditions presented a similar pattern of results with lower bias. Again, RMSE of the random slopes of subjects and items decrease as the sample sizes of both clusters increase. In this line, it is worth mentioning that model averaging with Akaike weights presented a significantly lower bias and RMSE than model selection with AIC index in the most demanding conditions, that is, when the sample sizes were smaller in both clusters. This means that the bias of model averaging with Akaike weights, although relevant, was lower than the one of model selection with AIC, and that the sample variance of the estimates followed the same pattern of results.

In this section, we present an empirical illustration about the use of model averaging with Akaike weights in one of the simulated data sets of this study. In this data set, 90 subjects answered 60 items. Given that this is a simulated data set, we know that the population model has a standard deviation of 100 (a variance equal to 10,000 ms was simulated) for the random slopes of subjects, and a zero-variance for the random slopes of items. Then, the correct model that should be fitted to the data is a MEM-CR with not-null variances for the intercepts of both clusters, a not-null variance for random slopes of subjects and a null variance for random slopes of items. But the reality is that applied researchers typically do not know the population model of the random structure underlying their data. Imagine that a researcher aiming to analyze this data set is considering four possible MEMs-CR with different random variances (e.g.,

Then, we can compute the average estimate for all the parameters of the model. For example, consider that we want to compute the average estimate of the variances of the random slopes. To do so, we would multiply the estimation of the variance of the random slopes of each of the competing models by their respective Akaike weight, and then the resulting weighted estimates would be summed. Considering this example, the averaged variance of the subject random slopes would be:

Similarly, the averaged variance of the item random slopes would be:

The resulting estimates of random slopes of model averaging with Akaike weights are close to the simulated parameters. In this example, there was important evidence in favor of the model with subject random slopes, and the models with null subject random slopes were very unlikely given the estimated variability in the data set. In more demanding scenarios with less evidence in favor of a model, Akaike weights would be less extreme weighting the estimations of all the competing models. The same procedure would apply for the estimation of any of the parameters of the model, including the fixed effects and their standard errors.

Model fit and Akaike weights ( |
|||||
---|---|---|---|---|---|

Estimated models | Minimal | Subject rand. slopes | Item rand. slopes | Maximal | |

AIC | 77309.33 | 77207.52 | 77311.59 | 77209.74 | |

AIC_{r}–AIC_{min} |
101.81 | .00 | 104.07 | 2.23 | |

.000 | .753 | .000 | .247 | ||

Estimates of competing models |
|||||

Random effects | Minimal | Subject rand. slopes | Item rand. slopes | Maximal | Averaged estimates |

σ_{0s0} |
152.80 | 112.79 | 152.80 | 112.80 | 112.79 |

σ_{00i} |
84.13 | 84.21 | 78.57 | 78.52 | 82.80 |

σ_{1s0} |
.00 | 95.80 | .00 | 95.81 | 95.80 |

σ_{10i} |
.00 | .00 | 13.95 | 11.38 | 2.82 |

_{1} = 90. _{2} = 60.

Whilst random effects are usually considered of secondary interest, they contain crucial substantive information to understand the processes that are being modeled in MEMs-CR. For example, random intercepts mean that there are individual differences in the mean process, and random slopes mean that there are individual differences in the target fixed effect of the researchers. Here, we endorse the use of random effects as a confirmatory hypothesis testing approach (see also

We compared the bias of the average estimations of random effects of MEMs-CR using Akaike weights and BIC posterior probabilities, using AIC and BIC model selection and fitting a full random structure (here,

In scenarios with null random effects, no relevant differences were found between model averaging and model selection. Bias was larger in more demanding conditions (e.g., small sample sizes), but it was not alarming. This means that both model averaging and model selection provide virtually unbiased estimates of population random effects that do not differ from zero, once minimal sample sizes are raise. Presumably, this simulation scenario is not very common in empirical research.

In scenarios with not-null random effects, which are, presumably, the most common empirical scenario, interesting differences were found between model averaging and model selection. First, no relevant differences were found between them when random intercepts were estimated. This means that random intercepts were estimated unbiasedly in almost all conditions of the simulation study, that is, including the presence of symmetrical and asymmetrical variances of random intercepts and slopes. Second, some differences were found between model averaging and model selection when random slopes were simulated, especially in random slopes of subjects. That is, given a minimum sample size, model averaging could estimate random slopes more accurately than model selection (the differences between the two strategies mainly appeared in random slopes of subjects, but their differences were less important for random slopes of items). But the main advantage of model averaging was using all the available information while model selection made all-or-nothing decisions that were incorrect on many occasions (in some scenarios, we even found a median of the 100% percent of bias in random slopes when a null variance was settled to a cluster). These differences were dependent of the smaller sample sizes, as they probably limited the available information for the fit indices, being model averaging capable of overcoming, at least in part, this limitation.

Some relevant differences were also found between the versions of model averaging, that is, Akaike weights and BIC posterior probabilities. In general, it was found that Akaike weights obtained a better performance than BIC posterior probabilities when there are not-null random effects (which is, in fact, the most common scenario in applied research). Again, once minimum sample sizes were raised, Akaike weights could estimate unbiased random effects while BIC posterior probabilities obtained larger bias. In any case, the estimations of both versions of model averaging were affected by lower sample sizes in any of the simulated clusters.

Additionally, interesting differences were found between model averaging with Akaike weights and fitting a full random structure (here,

Applying a MEM-CR assumes decisions about their parametrization given that true population models are not known. Model averaging proposals use all the available information of different models to deal with uncertainty. Two general conclusions can be made from our analyses: model averaging and model selection show similar results for recovering null random effects, and model averaging shows less bias than model selection for recovering not-null random effects, especially for Akaike weights and random slopes of subjects. These conclusions are conditioned to minimum sample sizes in both clusters (that is, subjects and items). Thus, we would like to endorse model averaging with Akaike weights as a relevant tool to recover random effects in studies with medium to large sample sizes.

There is consensus about the necessity of including all relevant dimensions in experimental conditions to potentiate ecological validity (

The present study simulated different random structures for MEMs-CR, but only two population parameters (a null vs. not-null value) were considered to analyze the recovery of random effects. Also, a simple experimental design with two fixed effects (the intercept and the within-effect) was simulated due to our objective was to evaluate the recovery of random effects using model averaging. However, we think that model averaging would be an affordable tool to recover small random effects, which are more difficult to estimate, avoiding all-or-nothing decisions of model selection strategies. Also, given the importance of the design of the study to establish model random structures, it would be interesting to expand these results to other relevant designs like longitudinal studies (see

Model averaging attempts to use all the available information of the competing models to deal with model uncertainty, beyond model selection based on AIC and BIC. Unbiased estimates of random effects were found in both model averaging with Akaike weights and BMA with BIC posterior probabilities under conditions with sufficient sample sizes in the target clusters. In general, 60 units per cluster were found to be an appropriate sample size to obtain very accurate estimations of the variances of random effects in the conditions of the present simulation study. But some differences were found in favor of model averaging with Akaike weights in demanding conditions like small sample sizes, which was also capable of estimating unbiased variances of random effects if one of the clusters presented large sample sizes. This means that larger number of subjects (e.g., 60 subjects) could lead to compensate the handicaps of smaller sample sizes of items (e.g., 10 items). Whilst the performance of model averaging is questionable under some simulation conditions, it supposes an alternative to deal with model uncertainty even in those scenarios where using model selection would require a risky all-or-nothing decision regarding to the inclusion of parameters like random slopes. Thus, we recommend using model averaging Akaike weights, and to use both model averaging approaches with small sample sizes to analyze their convergences and divergences.

The authors have no funding to report.

The authors have declared that no competing interests exist.

The authors have no additional (i.e., non-financial) support to report.