Linear Mixed Effect Models (LMEM) have become a popular method for analyzing nested experimental data, which are often encountered in psycholinguistics and other fields. This approach allows experimental results to be generalized to the greater population of both subjects and experimental stimuli. In an influential paper Bar and his colleagues (2013; https://doi.org/10.1016/j.jml.2012.11.001) recommend specifying the maximal random effect structure allowed by the experimental design, which includes random intercepts and random slopes for all within-subjects and within-items experimental factors, as well as correlations between the random effects components. The goal of this paper is to formally investigate whether their recommendations can be generalized to wider variety of experimental conditions. The simulation results revealed that complex models (i.e., with more parameters) lead to a dramatic increase in the non-convergence rate. Furthermore, AIC and BIC were found to select the true model in the majority of cases, although selection accuracy varied by LMEM random effect structure.
In psycholinguistic studies, a common outcome measure is reaction time (RT). For example, subjects might judge whether strings of letters are words or non-words, indicating their decision by pressing a button. The real words represent different categories, constituting the experimental manipulation. The nature of psycholinguistic studies necessitates accounting for variability in the outcome variable caused by particular subjects and items (i.e., by-subject and by-item random effects).
Classical methods for analyzing psycholinguistic data are by-subjects or by-items analysis of variance (ANOVA), known as
The quasi
The recently-popularized linear mixed effects model (LMEM) enables simultaneous modeling of by-subject and by-item random effects while handling missing data better than previous ANOVA methods (
Although LMEMs are receiving increased attention in psycholinguistics, there is no widely-accepted rule for determining the random effects to include in an LMEM, causing confusion and inconsistent use of LMEMs. To provide practical recommendations concerning the choice of random effect structure,
Based on the results, the authors suggested that LMEMs with the most complex random effect structure justified by the design (i.e., maximal LMEM) should be implemented, providing the model converges. Their results showed that maximal LMEMs performed well in terms of type I error rate, power, and model selection accuracy, while random-intercept-only models performed worse on all criteria, and usually even worse than separate
Although innovative and informative,
For cases in which a maximal model does not converge,
The purpose of this paper is to systematically investigate the applicability of the
A LMEM can be formally specified as follows,
In Equation (1),
If a hypothetical study were both within-subjects (i.e., each subject sees all items) and within-item (i.e., each item occurs in all experimental conditions), the corresponding LMEM could be expressed as follows:
In Equation (2), the response of a subject to an item is modeled as the fixed-effect intercept (β0) and fixed-effect slope (β1) plus random effects, where “fixed effect” means that the value does not change for different subjects and items. The fixed-effect intercept represents the average response value under one of the experimental conditions, while the fixed-effect slope represents the mean difference between the two experimental conditions (i.e., the overall treatment effect). The random intercepts adjust the baseline response value for each subject (S0s) and item (I0i). The random slopes adjust the treatment effect for each subject (S1s) and item (I1i). Consequently, a different response is predicted for each unique subject–item combination. While this is the conceptual basis of LMEM, the actual model-fitting procedure does not estimate individual subject and item random effects, instead estimating the population variance-covariance matrices of random subject effects (Equation 3) and random item effects (Equation 4), where
Multiple software packages can implement LMEMs, including SAS (e.g.,
Three evaluation criteria – parameter estimation accuracy, non-convergence rate, and model selection accuracy – were considered to evaluate different random effect structures in LMEMs. We first assessed parameter estimate accuracy of both fixed and random effects under the random effect structures included in the simulation study.
The second criterion was non-convergence rate. Non-convergence occurs when the model-fitting algorithm fails to reach a solution within the specified number of iterations or stopping criteria. In real-world experiments, it is closely related to model complexity, especially concerning the random effects structure; non-convergence is more likely to occur as models become more complex, such as when additional parameters are estimated, including item order effects, extra random components.
The third evaluation criterion is model selection accuracy. Model selection means selecting a statistical model from candidate models for interpretation of the results. The candidate models should ideally be grounded in sound theory, and thus researchers should develop several theory-based candidate models for comparison using objective model-selection techniques (
A simulation study was designed to investigate non-convergence rates and model selection accuracy in LMEMs. Two factors were varied to generate datasets from nine models (
Random structure complexity | Number of predictors |
||
---|---|---|---|
1X | 2X | 3X | |
Random-intercept only (A) | A1 | A2 | A3 |
No random correlations (B) | B1 | B2 | B3 |
Maximal (C) | C1 | C2 | C3 |
Additionally, the three random effects structures considered in this study – random-intercept-only model, random intercepts and slopes without random correlations in the covariance matrix, and the maximal random effects model – are designated A, B, and C, respectively. The letter designations follow the order of increasing model complexity, i.e., the intercept-only model A was extended to create the no-correlation model B by adding by-subject and by-item random slopes. The introduction of correlations between random intercepts and random slopes gives the maximal model C. These correlation coefficients represent conceptually the pairwise relatedness between random intercepts and slopes. They are part of the off-diagonal terms in the random effect covariance matrices. In the three-predictor maximal (C3) model, the random correlation coefficients add 56 parameters compared to the three-predictor no-correlation (B3) model.
The combination of number of predictors and random effect component letter creates unique identifiers for each model. For example, the random-intercept-only model with two predictors is labeled A2. In each of the nine simulation conditions and generating models, both the number of subjects and items was 24, equivalent to the larger-sample conditions in
The simulation design included nine combinations of the levels of the two manipulated factors. As in
Parameter | Description | Value |
---|---|---|
β0 | Grand-average intercept | ~ |
β | Grand slopes X1, X2, etc | 0.8 |
τ2 | By-subject random effect variances | ~ |
ω2 | By-item random effect variances | ~ |
λ | Random effect matrix eigenvalues | ~ |
ρ | Correlation between by-subject and by-item random effects | ~ |
σ2 | Residual error | ~ |
After generating 1,000 datasets from each of the nine models A1 through C3, each dataset was fit with four models differing in complexity of the random component. The simplest model had no random component, i.e., a fixed-effects-only regression model. This basic model was included to provide an additional candidate for model selection, and so that all LMEMs considered would have a simpler alternate model. In all fitted models, the fixed-effect structure was specified such that the fitted model included the same number of predictors and interactions as the model from which the data were generated.
From each fitted model, we recorded model convergence and the model’s AIC ( The number of predictors in the simulation was included as two dummy-coded binary predictors, one-predictor (1 = model with 1 predictor) and three-predictors (1 = model with 3 predictors). Likewise, two dummy-coded binary variables - Underfit (1 = fitted model less complex than generated model) and Overfit (1 = fitted model more complex than generated model) - were also included as predictors.
We also performed logistic regression to evaluate the effects of our experimental factors on whether a fitted model was “selected” using AIC and BIC from the candidate models (i.e., the model had the lowest IC value). In each regression analysis, the outcome variable was a binary variable in which 1 represented selection using the particular IC. For the analysis, a model was first fit containing model match, number of predictors (as two dummy-coded variables), and generated model (as two dummy-coded variables) as predictors.
All LMEMs were fit using the
Model | β0 ( |
β1 ( |
β2 ( |
β3 ( |
β4 ( |
β5 ( |
β6 ( |
β7 ( |
---|---|---|---|---|---|---|---|---|
A1 | 0.795 (0.073) | 0.798 (0.014) | - | - | - | - | - | - |
B1 | 0.788 (0.072) | 0.778 (0.068) | - | - | - | - | - | - |
C1 | 0.791 (0.072) | 0.805 (0.077) | - | - | - | - | - | - |
A2 | 0.802 (0.070) | 0.800 (0.011) | 0.802 (0.010) | 0.799 (0.020) | - | - | - | - |
B2 | 0.799 (0.073) | 0.783 (0.071) | 0.784 (0.072) | 0.806 (0.075) | - | - | - | - |
C2 | 0.791 (0.073) | 0.795 (0.076) | 0.808 (0.073) | 0.787 (0.077) | - | - | - | - |
A3 | 0.809 (0.069) | 0.799 (0.008) | 0.799 (0.007) | 0.800 (0.007) | 0.801 (0.014) | 0.800 (0.014) | 0.802 (0.015) | 0.796 (0.030) |
B3 | 0.790 (0.074) | 0.808 (0.071) | 0.798 (0.073) | 0.799 (0.071) | 0.780 (0.071) | 0.798 (0.073) | 0.806 (0.074) | 0.803 (0.076) |
C3 | 0.790 (0.073) | 0.795 (0.079) | 0.806 (0.074) | 0.786 (0.075) | 0.820 (0.074) | 0.813 (0.077) | 0.808 (0.072) | 0.817 (0.074) |
Generated Model | Fitted Model |
||||||||
---|---|---|---|---|---|---|---|---|---|
A1 | B1 | C1 | A2 | B2 | C2 | A3 | B3 | C3 | |
0 | 0.002 | 0.008 | |||||||
0 | 0 | 0.011 | |||||||
0 | 0.002 | 0.001 | |||||||
0 | 0.013 | 0.331 | |||||||
0 | 0 | 0.016 | |||||||
0 | 0 | 0.006 | |||||||
0 | 0.023 | 0.954 | |||||||
0 | 0.004 | 0.150 | |||||||
0 | 0 | 0.045 |
Each pair of generated and fitted models can be classified as underfit, overfit, and true models. The diagonal of the table corresponds to fitting the “true model” (i.e., the fitted model is the same as the generated model). Non-convergence rates are relatively low in the true model scenarios, despite a slight increase as additional predictors are added. Below the diagonal, which represents when models were underfit (i.e., the model was less complex than the data), non-convergence only occurred twice, so the non-convergence rate is effectively zero regardless of the number of predictors. Almost all cases of non-convergence are above the diagonal, representing when models were overfit (i.e., the model was more complex than the data). Overfit models suffered from higher non-convergence rates than other model pairs with the same number of predictors, and also displayed a trend of dramatically increasing non-convergence rates with the inclusion of additional predictors. When displayed visually (
In
Logistic regression analysis reveals that both the number of predictors and the relationship between the generated and fitted models are independently significant predictors of model non-convergence. Models are more likely to result in non-convergence when overfit, as compared to the true model, by a factor of nearly 50 (
In addition to model convergence, another research concern of this paper is using information criteria (AIC and BIC) to select the best-fitting model. In each replication, four models differing only in complexity of the random-effects component were fit to a dataset, one of which corresponded to the “true” model. Theoretically, the true model should be the best fitting of the four candidate models.
Using AIC, the true model selection rate was generally high (> 85%) for random-intercept-only (A) and no-correlation (B) models, with the rate for models A3 and B3 reaching 100% and 99.9% respectively. For maximal (C) models, the rate was lower, ranging from 57.3% to 75.3%. The overall true model selection rate for AIC was 85.8%. Using BIC, the true model selection rate was uniformly high (> 99%) for A and B models. However, the rate for C models was substantially lower, ranging from 21% (C1) to 0% (C3). When the true model was not selected in C model conditions (except two simulation runs in the C1 condition), the corresponding B model was selected by BIC. The overall true model selection rate for BIC was 69.1%. So overall, AIC and BIC selected the true model in the majority of cases, otherwise selecting the next simplest model.
Logistic regression was also implemented to predict model selection using AIC. A logistic regression model was first fitted containing model match, number of predictors (as two dummy-coded variables) and generated model (as two dummy-coded variables) as predictors. The saturated model containing model match and number of predictors as parameters was subjected to model simplification via stepwise Likelihood-ratio test (LRT), resulting in the final model with significant effects for Model Match, one-Predictor, and their interaction (
Variable | β ( |
95% CI (Lower) | Odds | 95% CI (Upper) |
---|---|---|---|---|
Model Match | 4.98 (0.06) | 130.54 | 146.16*** | 163.94 |
One-Predictor | 0.74 (0.06) | 1.87 | 2.09*** | 2.34 |
Model Match × One-Pred | -1.54 (0.08) | 0.18 | 0.21*** | 0.25 |
***
The final model includes model match and one-predictor and their interaction. Model Match is the primary variable that predicts AIC selection, with the true model being more likely to be selected by a factor of 146.16 (
The logistic regression results for BIC differed more starkly based on generated model than by number of predictors. Therefore, model match and generated model were used as predictors. The saturated model was subjected to model simplification via stepwise LRT, resulting in the final model containing Model Match, C-Generated (a binary dummy variable indicating whether the data were generated from a maximal model), and the Model Match × C-Generated interaction (
Variable | β ( |
95% CI (Lower) | Odds | 95% CI (Upper) |
---|---|---|---|---|
Match | 13.82 (0.49) | 413688 | 1001297*** | 2826149 |
C-Generated | 7.22 (0.44) | 629.54 | 1364.12*** | 3560.81 |
B-Generated | 0.12 (0.49) | 0.430 | 1.13 | 3.00 |
Match × C-Gen | -16.14 (0.49) | 0.000 | 0.000*** | 0.000 |
***
The true model is extremely unlikely to be selected for C datasets (
The primary differences between our study and
The results showed that overall non-convergence rates increase with the addition of predictors. There was no clear pattern in non-convergence rates under the one-predictor conditions. However, when the simulation scenarios were expanded to include two and three predictors, two intertwined patterns of model non-convergence emerged. Furthermore, a significant effect of number of predictors was found in logistic regression analysis of the non-convergence data, with models with fewer predictors less likely to experience non-convergence.
Upon closer inspection of the non-convergence rates within models containing the same number of predictors, another pattern emerges: when a dataset is fit with its true model, non-convergence is very infrequent. In underfit models, there is essentially no non-convergence. In overfit models, however, non-convergence is considerably more common. This pattern of non-convergence is reflected in the logistic regression analysis by the significant effect of model match, showing that a true model is significantly less likely to suffer non-convergence.
The two patterns of non-convergence also form an interaction. There is no change in the non-convergence rate of underfit models as the number of predictors increases. For true models, there is a small increase in non-convergence with additional predictors. For overfit models, there is a more dramatic increase in non-convergence rates with increasing numbers of predictors. This significant interaction between model overfit and number of predictors is relevant to the recommendations of
Our results show that, overall, AIC selected the true model in 86% of simulated cases, while the success rate for BIC is 69%, supporting that AIC is overall more consistently accurate, with increasing accuracy as the number of predictors increased. However, AIC’s true model selection rate was lower in maximal model conditions.
BIC’s true model selection rate displayed a different pattern; BIC was extremely accurate at selecting the true model for data generated from random-intercept-only models and no-correlation models, with rates for these conditions all exceeding 99%. Only in the maximal model conditions did BIC perform poorly.
The issue of non-convergence presents a major obstacle to implementing the advice of
In this study, AIC and BIC have proven to be useful tools for selecting the optimal random-effects structure under different conditions. Specifically, AIC performs well when the random-effects structure of the fitted model is more complex, while BIC is preferable under conditions when the fitted model is relatively simple. These IC are usually in the default output of most statistical software for LMEM, and are thus a realistic means for selecting the best-fitting model.
Crucially, when fitting LMEMs, studies suggest that a priori knowledge about the random-effects structure is important for gauging the potential risk of overfitting and non-convergence, even though the true random-effects structure is usually ‘‘unknown.’’ We suggest paying careful attention to the methodological literature on current LMEM best practices, substantive knowledge on the research topic, as well as information from visualization techniques (e.g., residuals pattern) and model criticism, as these would help in making a more confident decision concerning the appropriate random-effects structure.
There are limitations to the present study, which should be considered when weighing our recommendations. First, the generated simulation conditions may not reflect the wide array of scenarios in empirical studies, and thus researchers should interpret the results with caution and not overgeneralize the findings.
Second, we did not investigate Type I error rate and Power, so we cannot say if Barr et al.’s findings of the maximal model’s superior performance on these metrics generalizes to models containing more predictors. Furthermore, it is unclear whether our method for generating positive definite random-effects covariance matrices was adequate in approximating the maximal model. Finally, we only considered model selection using AIC and BIC in isolation, without considering other ICs or using multiple ICs simultaneously to select a model. Future research could rectify these shortcomings.
Hsiu-Ting Yu's work is partially supported by the Ministry of Science and Technology, Taiwan. [MOST-108-2401-H-004-100].
The authors have declared that no competing interests exist.
All authors contributed equally to the work.
The authors have no support to report.