Estimation Quality and Required Sample Sizes in Three-Level Contextual Analysis Models

In multilevel analysis, Level-1 predictors that also explain variance at a higher level are called contextual predictors. In the multilevel manifest covariate model, the Level-2 component is modeled as the average of the Level-1 predictor scores within a cluster. In the multilevel latent covariate model, the predictor is decomposed into two latent variables at Level-1 and Level-2. Performance conditions of these modeling approaches for three-level models are largely unexplored. We investigate the two approaches’ performance with respect to bias, coverage, and power in a three-level random intercept model. Results reveal differences in estimation quality and required sample sizes. We provide sampling recommendations for both approaches.

Within the context of contextual analysis models, it is important to distinguish between formative and reflective measurement processes (Lüdtke et al., 2008; cf. also Bollen & Lennox, 1991): Formative constructs are directly caused by the aggregate (e.g., the proportion of boys in a class). Hence, the number of observations is finite. For reflective constructs, lower-level observations are manifest realizations of the construct. Hence, the number of observations to measure a reflective construct is potentially infinite. The modeling approach needs to account for the nature of the higher-level construct, especially by correcting for sampling error in the case of reflective constructs.
While there is a strong body of research on contextual predictors in two-level analysis (e.g., Korendijk et al., 2011;Lüdtke et al., 2008Lüdtke et al., , 2011Marsh et al., 2009), contex tual analysis models are frequently applied to three-level data (e.g., Chen & Cui, 2020;Rathmann et al., 2020). However, information on estimation quality in three-level contex tual models is still sparse, especially with respect to reflective measurement processes.
In this study, we first describe the two common modeling approaches for contextual variables in three-level models. We then analyze and compare estimation quality for the two approaches by means of Monte Carlo simulations. Finally, we derive concise sampling recommendations.

The Multilevel Manifest Covariate (MMC) Model for Three-Level Data
In contextual analysis models, the most widely used approach to obtain higher-level predictor variables is to compute the average scores of all L1-units in a L2-subcluster or L3-cluster. Modeling averages at higher levels requires centering procedures at the lower levels. We follow the notation by Brincks et al. (2017) and differentiate between grand-mean centering (GMC) and centering-within-context (CWC). For a linear model with k = 1,…, n 3 L3-clusters, each with j = 1,…, n 2 L2-subclusters, each with i = 1,…, n 1 L1-units, outcome Y ijk , L1-predictor X ijk , L2-predictor X . jk (i.e., the subcluster means), and L3-predictor X ..k (i.e., the cluster means), the MMC model can be formulated as: Y ijk = γ 000 + γ 100 ⋅ (X ijk − X . jk ) + γ 010 ⋅ X . jk − X ..k + γ 001 ⋅ X ..k − X … + v 00k + u 0jk + e ijk (1) On Level-1, (X ijk − X . jk ) is the CWC predictor, obtained by subtracting the subclustermean from each L1-score. As a result, the regression coefficient γ 100 addresses only the L1-specific influence of the predictor on the outcome. e ijk~N (0, σ e 2 ) is a random effect (prediction error) with variance component σ e 2 . On Level-2, X . jk − X ..k is the CWC contextual predictor, obtained by subtracting the respective cluster mean from the subcluster-means, with respective L2-specific regression coefficient γ 010 . u 0jk~N (0, σ u 0 2 ) is a random effect with variance σ u 0 2 . Note that X ..k refers to the average L1-scores within a cluster (not to the average of the subcluster averages, which yields different results for unbalanced samples).
Lastly, γ 000 is the intercept, and coefficients γ 100 , γ 010 , and γ 001 are the regression coefficients quantifying the level-specific influence of the respective predictor on the outcome (Brincks et al., 2017).
This approach has been criticized as insufficient for reflective constructs since it assumes a finite population of L1-units. Sampling a finite set of interchangeable indica tors for an unobservable construct (e.g., repeated measures of wellbeing in students) disregards unreliability due to sampling error and can result in considerable bias and underestimated standard errors (e.g., Grilli & Rampichini, 2011;Harker & Tymms, 2004). Furthermore, the assumption of perfect reliability in the MMC approach is also violated for formative constructs if the sampling rate, i.e., the rate of units sampled from the total number of units in a (sub-)cluster, is small (e.g., inhabitants in cities).

The Multilevel Latent Covariate (MLC) Model for Three-Level Data
The alternative approach is the multilevel latent covariate (MLC) approach. It treats the observations at L1 as manifest realizations of an underlying latent variable with variance at each level (Lüdtke et al., 2008). Extending the two-level notation, the decomposition of observed predictor X ijk and outcome Y ijk with means μ X and μ Y takes the form of: Level-2: U Y jk = β L2 ⋅ U X jk + δ jk (5) Level-1: V X k , U X jk , and R X ijk are independent, latent representations of predictor X. V X k has a mean μ X and a variance expressing the L3-specific deviations from μ X in the predictor. U X jk and R X ijk each have mean zero and variances expressing the L2-and L1-specific deviations. Regression coefficients β L1 , β L2 , and β L3 express the level-specific effects of the predictor. ε ijk , δ jk and τ k are the residual and random intercepts, respectively. Integrating Equations 3 to 6 yields the combined equation: The MLC approach accounts for sampling error by treating the measurements as poten tially biased realizations of the latent construct and has therefore been shown to produce higher estimation quality for reflective constructs or formative constructs with a low (20%) sampling rate (Lüdtke et al., 2008(Lüdtke et al., , 2011. In this study, we compare the approaches regarding their estimation quality in sam ples drawn from an infinite population (reflective process), since research fields that commonly employ multilevel modeling oftentimes investigate constructs with a concep tually infinite number of observations. Additionally, correctly specifying a reflective construct might pose a challenge for researchers due to high sampling requirements to obtain sound estimation results for latent variables in three-level models.

Evaluating Estimation Quality
The estimation quality of contextual predictors can be assessed in Monte Carlo simula tions. Most commonly, estimation quality examinations are based on the point estimates and standard errors.

Parameter Estimation Bias (PEB)
For a true parameter θ and estimates θ 1 , …, θ n in n samples, the relative parameter estimation bias (rPEB) is defined as: It is interpreted as the average bias rate. Since across samples, negative and positive bias values are averaged, the rPEB expresses the direction of bias across samples, but not the average strength of bias, which is expressed using an absolute bias measure. In this study, we use the absolute PEB (aPEB): The aPEB measures the rate of absolute misestimation across samples as an alternative to common measures such as the root mean squared error (RMSE). Both the RMSE and aPEB capture mean bias and variability in estimates, since conditions producing a larger variance in estimates (given the same mean estimate) result in increased RMSE and aPEB values. We argue, however, that the aPEB is better suited for this study, since it is scaled relative to the true value θ, making bias rates more easily comparable across parameter sizes. Additionally, interpretation of the aPEB (average strength of bias) well complements the interpretation of the relative bias (average rate/direction of bias).

Statistical Power and Coverage
Standard errors are commonly evaluated by means of statistical power and coverage rates (e.g., Muthén & Muthén, 2002). For fixed effects, power of an estimate is the rate of statistically significant Wald-tests (Wald, 1943) across all analyzed samples. The coverage rate expresses the rate of samples with a 95%-confidence interval (CI) that includes θ:

Sample Size Recommendations
Sampling advice for three-level models is still sparse (see Kerkhoff & Nussbeck, 2019, for an overview), and research on the estimation quality in three-level contextual models is still sparse. Usami (2017) derived power formulas for regression coefficients in threelevel contextual analysis models with manifest means. Comparisons between derived and observed power in simulations reveal that observed power may be biased due to unreliability of the mean values, and that increasing both L1 and L2 sample sizes reduces differences between derived and observed power. Regarding three-level models, research has shown that estimation quality is mostly determined by the number of clusters (L3-sample size) and the sample size at the level the coefficient of interest is measured at (de Jong et al., 2010;Dong et al., 2018;Kerkhoff & Nussbeck, 2019, 2022Lee & Hong, 2021;Li & Konstantopoulos, 2016). Regarding contextual predictors for reflective constructs in two-level models, Lüdtke et al. (2008Lüdtke et al. ( , 2011 found that for the MLC approach, bias remains within 10% in most conditions with at least 50 clusters of cluster size 5, while the MMC approach is more heavily biased. Due to narrow CIs, the MMC approach suffers from low coverage rates. In contrast, the absolute bias is higher for the MLC approach due to high variance in estimates.

Aim of This Study
Since contextual analysis in three-level models is of increasing relevance, we investigate the estimation quality for both the MMC and MLC approach to derive answers to the following research questions: 1. How do sample sizes relate to bias, coverage, and power for each modeling approach, i.e., (a) what are influential sample characteristics and (b) what patterns emerge between sampling conditions and estimation quality indicators? 2. What are (a) minimum required sample sizes and (b) advantageous sampling strategies to achieve sound estimation quality for each approach?
We also included one effect size combination with negative regression weights to explore differences between positive and negative coefficients on a small scale (see Table 1, con dition SSM -). We furthermore evaluated additional effect size combinations, sample sizes, and unbalanced designs, but respective results are only reported in the Supplementary Materials since they do not meaningfully impact inferences reported below. The calculus to obtain conditional variances and regression weights can be found in the R-script in the Supplementary Materials. For each generated sample, models were fitted according to Equation 1 and Equation 7. Figure 1 shows the conceptual models.
In total, we analyzed 1,728 conditions, each with 1,000 generated samples. Data generation and model estimation were done in Mplus 8 (Muthén & Muthén, 1998-2017 using maximum likelihood estimation with robust standard errors (MLR). Results were imported to R 4.0.5 (R Core Team, 2021) for subsequent analyses. To distinguish between conditions, we abbreviate each combination of sample sizes by n 3 -size/n 2 -size/n 1 -size. For example, 100/5/5 encompasses samples with 100 clusters, each containing 5 subclusters, which in turn contain 5 L1-units. 200/2/• subsumes conditions with 200 clusters with 2 subclusters each and any number of L1-units.

Conceptual Models
Note. X L2 refers to the subcluster means, X L3 refers to the cluster means. R, U, and V correspond to the latent predictor and outcome components as in Equations 4 to 6. Data levels are arranged vertically and separated by dashed lines, similarly to figures used in Lüdtke et al. (2008Lüdtke et al. ( , 2011 and the Mplus manual (Muthén & Muthén, 1998-2017 2004; Muthén & Muthén, 2002), we consider |rPEB| < 0.10, power ≥ 0.8, and 0.91 ≤ coverage ≤ 0.98 to indicate sufficient estimation quality. We further computed analyses of variance (ANOVA) to evaluate how the simulation conditions influence estimation quality, using bias, coverage, or power as outcomes and effect sizes as well as sampling conditions as factors. Due to the large sample size, we only report partial effect sizes η 2 . To keep analyses concise, we primarily focus on samples with a maximum of 10,000 observations (see the Supplementary Materials for full results).

Results
Three conditions in the MMC approach and 13 conditions in the MLC approach showed convergence issues for at least one sample. Convergence rates did not drop below 99.7% (as observed in 15/5/5 with effect sizes SLS in the MLC approach). Table 2 lists median estimates and standard errors, averaged across values for conditions with up to 10,000 observations. Due to overall high estimation quality of the L1-effect, only results for the higher-level effects are comprehensively reported. On L1, all condi tions are unbiased, and power and coverage rates are insufficient only in the smallest sample sizes, e.g., 15/5/10 (see the Supplementary Materials for full results). With respect to research question (1a), ANOVA results in Table 3 show that the number of clusters is the most relevant sampling factor for estimation quality in the MLC approach. Similarly, in the MMC approach, the number of clusters is the most important factor to achieve estimation quality, except for L2 relative bias and coverage. Notably, differences in L2 relative bias are almost exclusively determined by the number of sampled L1-units per subcluster.

Estimation Bias
To answer research question (1b) for the relative and absolute bias, Figures 2 and 3 plot the mean regression estimates with grey areas indicating |rPEB| ≤ 0.10 ( Figure 2) and aPEB ( Figure 3). While regression estimates are rather unaffected by different sample sizes in the MLC approach, estimates in the MMC approach tend to be biased for condi Note. β-size = size of the population coefficient, rPEB = relative bias, aPEB = absolute bias. n 1 , n 2 , n 3 indicate the number of clusters, subclusters per cluster, and Level-1 units per subcluster. NOBS = total number of observations. MMC = multilevel manifest covariate, MLC = multilevel latent covariate. Due to heterogeneity of variances, values are likely to have positive bias.
tions with fewer L1-units. Moreover, the MMC approach more heavily overestimates the small negative L2-effect than the small positive L2-effect. In contrast, while the relative bias is higher in the MMC approach, the absolute bias tends to be higher in the MLC approach.

Power and Coverage Rates
To answer research question (1b) for coverage and power rates, Figures 4 and 5 plot power ( Figure 4) and coverage rates ( Figure 5). Figure 4 indicates that power tends to be higher in the MMC approach. Notably, due to overestimation, statistical power is higher for the small positive effects than for small negative effects on L2 in the MMC approach. Strikingly, coverage rates for (negative) small and large effects on L2 decrease drastically as n 2 and n 3 increase. This is due to smaller CIs around consistently biased estimates (cf. Table 2).

Mean Estimates According to Sampling Condition
Note. Plots show mean estimates for each Level-3 (upper plot) and Level-2 (lower plot) regression coefficient, in sample sizes with up to 10,000 observations. Shaded areas indicate relative unbiasedness. Plots are grouped according to n 3 , each x-axis is sorted according to n 1 within n 2 , but only the first condition per n 2 is labelled on the x-axis to visually differentiate between n 2 -sizes, i.e., for n 3 = 15 (leftmost plot), 10/5 indicates n 2 = 10 with n 1 = 5, which is followed by 10/10, 10/15 etc. MLC = multilevel latent covariate, MMC = multilevel manifest covariate.

Required Sample Sizes
To answer the research questions (2a) and (2b), Tables 4 and 5 show quartiles of absolute bias and required sample sizes for relative unbiasedness, sufficient coverage, and suffi cient power for the MMC approach (Table 4) and MLC approach (Table 5).

Level-2 MMC Estimates
Small positive effects are unbiased (rPEB) for n 1 ≥ 25 with n 2 ≥ 10. Medium and large effects are unbiased for most samples with n 1 ≥ 20. Conditions with large n 1 , such as 30/5-10/25-30, ensure sufficient coverage irrespective of effect size. Power for small effects is achieved in large samples, such as 100/25/•. For large effects, power is sufficient (80% or higher) in most conditions. For medium effects, power is sufficient for most conditions with n 3 ≥ 100 or n 2 ≥ 20. The average absolute bias for small effects exceeds 50% even for larger samples.

Absolute Bias of Estimates
Note. Plots show absolute bias for each Level-3 (upper plot) and Level-2 (lower plot) regression coefficient, in sample sizes with up to 10,000 observations. Plots are grouped according to n 3 , each x-axis is sorted according to n 1 within n 2 , but only the first condition per n 2 is marked on the x-axis. MLC = multilevel latent covariate, MMC = multilevel manifest covariate.

Level-3 MMC Estimates
Relative unbiasedness for small effect requires n 2 ≥ 30. Medium-sized effects are unbiased in most conditions with n 2 ≥ 25 or both n 1 and n 2 ≥ 10. Large effects are unbiased for most conditions with n 2 ≥ 10. Coverage rates are mostly sufficient by sampling at least 50 clusters. Statistical power is only sufficient for large effects in conditions with at least 100 clusters. Average absolute bias is considerably high even for large effects.

Level-2 MLC Estimates
Medium and large effects are generally unbiased (rPEB). Small effects have sufficiently low relative bias in most conditions with n 3 ≥ 30. Coverage rates are sufficient in most conditions. Power of small effects is only sufficient in extremely large samples, such as 200/20/•. For medium-sized effects, power is sufficient in most conditions with at least 5,000 observations. Large effects mostly have sufficient power even in small samples. Average absolute bias exceeds 50% for small effects even in larger samples.

Figure 4
Power Rates of Estimates Note. Plots show power rates for each Level-3 (upper plot) and Level-2 (lower plot) regression coefficient, in sample sizes with up to 10,000 observations. Shaded areas indicate ranges for sufficient power. Plots are grouped according to n 3 , each x-axis is sorted according to n 1 within n 2 , but only the first condition per n 2 is marked on the x-axis. MLC = multilevel latent covariate, MMC = multilevel manifest covariate.

Level-3 MLC Estimates
Large and medium-sized effects have sufficiently low relative bias in most conditions. For small effects, most conditions with n 3 = 50 in combination with n 2 ≥ 10, or n 3 = 30 in combination with n 2 ≥ 15 are unbiased. Samples with at least 50 clusters have sufficient coverage. Power is only sufficient for large effects in samples with n 3 ≥ 150, or n 3 = 100 with n 2 ≥ 10. Average absolute bias is high for all effect sizes and exceeds 50% even for large effects.

Figure 5
Coverage Rates of Estimates Note. Plots show coverage rates for each Level-3 (upper plot) and Level-2 (lower plot) regression coefficient, in sample sizes with up to 10,000 observations. Shaded areas indicate ranges for sufficient coverage. Due to very similar values, all lines in the upper plot and lines resulting from the MLC approach in the lower plot are encompassed by a dark ribbon. Plots are grouped according to n 3 , each x-axis is sorted according to n 1 within n 2 , but only the first condition per n 2 is marked on the x-axis. MLC = multilevel latent covariate, MMC = multilevel manifest covariate.

Discussion
In this study, we investigated the estimation quality of the MMC and MLC approaches in three-level models in order to (1) evaluate how bias, coverage and power rates relate to the sample size at each data level and (2) derive advantageous sampling strategies to achieve sound estimation quality (see Tables 4 and 5). Overall, sampling 100/10/• or Note. β-size = size of the population coefficient, rPEB = relative bias, aPEB = absolute bias with lower quartile Q1 and upper quartile Q3, n 3 = number of clusters, n 2 = subclusters per clusters, n 1 = Level-1-units per subcluster. Note. β-size = size of the population coefficient, rPEB = relative bias, aPEB = absolute bias with lower quartile Q1 and upper quartile Q3, n 3 = number of clusters, n 2 = subclusters per clusters, n 1 = Level-1-units per subcluster.
150 clusters ensures overall sound estimation quality for large effects and additionally medium-sized effects on L2 in the MLC approach. For the MMC approach, sampling 100/•/20 ensures sound estimation quality for large L3-effects and medium-sized L2-ef fects. Extending our knowledge regarding required sample sizes in multilevel modeling, our results may help researchers make informed decisions regarding required sample sizes. Most notably, for the MMC approach, tendencies to over-or underestimate effects (relative bias) depend on the (sub-)cluster sizes. Since for this approach, estimation bias does not generally improve as the overall sample size increases, but standard errors become smaller, coverage rates deteriorate as the samples get larger. It is therefore highly important to sample a sufficiently high number of lower-level units to avoid biased estimates at the higher levels. In contrast, the MLC approach has higher absolute bias, in dicating higher variance in estimates, and slightly lower statistical power, but estimation quality for MLC estimates can be reliably improved by sampling more clusters. Naturally, our recommendations are based on specified thresholds indicating sufficient estimation quality (esp. Muthén & Muthén, 2002), and might therefore differ if stricter or less strict thresholds are used. For example, Burton et al. (2006) recommend basing coverage rate thresholds on the number of simulation replications. For our analyses, this translates to an admissible coverage range of 93.5% to 96.4%, such that additional conditions (mainly n 3 = 15, 30, 50) result in insufficiently low coverage.

Limitations and Future Prospects
Most importantly, our analyses are limited by the simulation conditions. For example, in some research contexts, only two L2 subclusters per L3 cluster might be available. Such samples limit admissible model complexity, but contextual effects might still be reliably estimated. Additional analyses (see the Supplementary Materials) to explore estimation quality for such samples show that-in comparison to conditions with n 2 = 5-rPEB is at least twice as high, except for nearly unchanged rPEB values for L2 estimates in the MMC approach. Similarly, n 2 = 2 results in at least 30% less power than n 2 = 5, except for power on L3 in the MMC approach, which has only about 5% less power. Moreover, previous studies argue that the variance distribution of the predictor vari able across levels influences estimation quality (Lüdtke et al., 2008(Lüdtke et al., , 2011Zitzmann et al., 2015). For reflective constructs measured by the MMC in particular, reliability of the predictor in two-level models is a function of the predictor variance at the respective level (predictor ICC) and the group size (cf. Lüdtke et al., 2008). Our results confirm the importance of the subcluster size for unbiasedness of the MMC approach on L2 and demonstrate the role of the cluster size (n 2 × n 1 and n 2 ) for estimation quality on L3. However, to focus on the interplay between sample size, effect size, and analysis approach, we kept the variance fractions constant with 20% of variance on each higher level. To illustrate how the distribution influences results, we ran additional simulations with 60% of variance at either higher level (see the Supplementary Materials for full results). Results show that coverage rates and power are not meaningfully affected, while absolute bias decreases at the level with the higher variance. For the relative bias, differ ences between approaches are considerable: While for the MLC approach, differences are marginal, we find that for the MMC approach, the relative bias is consistently smaller at the level with the high variance proportion. These additional analyses suggest that the variance distribution of the predictor needs to be considered in future research to develop more specific sampling recommendations for three-level contextual models.
Lastly, we limited our analyses to samples generated from an assumed infinite pop ulation (reflective process). For two-level models, Lüdtke et al. (2008) showed that for finite samples (formative process), the MMC approach results in smaller bias than the MLC approach with a sampling rate of 50% or higher. We hence consider it promising ex tending our research to three-level contextual analysis for finite populations, especially since for three-level data, the sampling rate at both L1 and L2 must be considered.
In conclusion, our results suggest that the MLC approach tends to be advantageous for research where the number of sampled clusters can be more easily increased than the (sub-)cluster sizes. The MMC approach, however, has the advantage of higher power and lower absolute bias (i.e., lower variability in estimates), especially for samples with less than 50 clusters. Thus, the MMC approach might be preferable for research where (sub-)cluster sizes can be readily increased for a limited number of clusters.

Funding:
The authors have no funding to report.