In the behavioral research and related fields, researchers increasingly employ multilevel modeling on three-levels to analyze clustered data. These data structures consist of measurement objects at Level-1, which are nested within higher-order Level-2 subclusters, which are nested in Level-3 clusters, such as patients within therapists within clinics (e.g., Firth et al., 2019).
Common estimates of the influence of the clustering are the intraclass correlation coefficients (ICCs), which depict the fraction of variance in the criterion located at Level-2 (ICC2) or Level-3 (ICC3). They thus indicate the extent to which the criterion is shaped by superordinate clusters, which may reveal, e.g., at which level(s) predictors or random effects should be included to reduce unexplained variance, or they inform the sampling procedure, ensuring high estimation quality in cluster-randomized experiments (Hedges & Hedberg, 2013). ICCs further indicate the degree to which standard errors are biased in standard regression analysis (two-level models: e.g., Kreft & de Leeuw, 1988; three-level models: e.g., Cunningham & Johnson, 2016). Therefore, ensuring that a nested structure is accurately characterized is a central issue, and reporting ICCs is a standard analysis step in multilevel modeling (see Hox et al., 2017).
However, research has shown that unbiased estimation and sufficient power in multilevel models depend on an interplay of adequate sample sizes on each level, the effect sizes, and the levels at which the predictors are modelled (more recently Cox & Kelcey, 2019; Kerkhoff & Nussbeck, 2019; LaHuis et al., 2020). To our knowledge, no study has systematically investigated required sample sizes to ensure sufficient estimation quality of the ICCs in three-level models. In this study, we shed light on this relationship and derive sampling recommendations for a wide range of ICC sizes. We limit our analyses to models with continuous responses, since the computation of ICCs in multilevel models with discrete and continuous responses differ (Goldstein et al., 2002; Leckie et al., 2020), resulting in limited comparability of estimation results and interpretation of the ICC sizes.
Variance Decomposition in the Linear Three-Level Model
The linear three-level model with Level-1 units i nested within Level-2 subclusters j nested within Level-3 clusters k decomposes the values of the criterion as follows:
1
The coefficient is the grand mean of Y, ~N(0, ) is a random variable reflecting the cluster-specific deviations from the mean, ~N(0, ) is the random variable of subcluster-specific deviations from the respective cluster mean, and ~N(0, ) is the random variable of deviations of the Level-1 unit values from the respective subcluster mean. The total variance of Y is given by . The ICCs express the fraction of variance at the respective level in relation to the total variance:
2
3
For the ICC2, an alternative approach is to include both higher-level variances in the enumerator ( ). The value then expresses how strongly any two Level-1 units from the same subcluster correlate (Hox et al., 2017). Typical ICC values are most comprehensively reported for educational research, ranging from 0.1 to 0.3 (Dong et al., 2016; Hedges & Hedberg, 2013), and research has demonstrated that even small ICCs can result in meaningful bias if the clustering is ignored (Lai & Kwok, 2015).
Assessing Estimation Quality of the Variance Components
When using Monte Carlo (MC) simulations to investigate estimation quality in multilevel models across many generated samples, the most common measures of estimation quality are the parameter estimation bias (PEB) to evaluate the accuracy of point estimates, and measures of significance to evaluate statistical power.
Parameter Estimation Bias (PEB)
The PEB reflects the percentage of over- or underestimation of a population parameter. We refer to this commonly reported PEB (e.g., Muthén & Muthén, 2002) as relative PEB. For a population parameter , estimated by in i = 1, …, n replications, it is defined as
4
We argue that the relative PEB alone is not sufficient for evaluating bias: By design, under- and overestimation balance out when computing the relative bias across replications. This can result in a relative PEB close to zero, although each replication might be either over- or underestimated. The magnitude of bias for a single sample can more precisely be approximated by the absolute PEB, given by:
5
For the absolute PEB, only the strength of the bias is averaged across replications, and the direction of bias for each sample is ignored. Together, the relative and absolute PEB reflect the expected strength and overall direction of bias.
Statistical Power
Statistical power of a parameter in the context of simulation studies is the rate of replications with a statistically significant point estimate. However, the commonly used Wald-test (Wald, 1943) is unreliable for variance components (Berkhof & Snijders, 2001; Molenberghs & Verbeke, 2004; Raudenbush & Bryk, 2002).
An alternative procedure to test statistical significance of parameters in multilevel models is the χ2-test-based comparison between the full model and a nested model where the parameter of interest is constrained to be zero (Hox et al., 2017). Based on the scaled χ2-test statistic (Satorra & Bentler, 1994), the modified test statistic proposed by Satorra and Bentler (2010, henceforth: SB-test) is computed using the unscaled maximum likelihood χ2-values of the nested ( ) and full ( ) models, the respective scaled χ2-test statistics of and , with degrees of freedom and , and scaling correction factors and :
6
The score is compared to a χ2-distribution with degrees of freedom. As an alternative, the likelihood-ratio test (LRT) follows the same procedural logic but uses the log-likelihood to compute the test statistic (see Herzog et al., 2007, for an overview). To assess the statistical significance of the Level-2 (Level-3) variance, the empty model (Eq. 1) is compared to a nested model where the Level-2 (Level-3) variance is constrained to be zero (Greven et al., 2008) using a one-sided test, since significantly negative variance values are inadmissible (Berkhof & Snijders, 2001). However, this approach has drawbacks. Particularly, both the LRT and SB-test can result in inadmissible, negative test-values, requiring a more complex approach using an auxiliary model (Bryant & Satorra, 2012; Satorra & Bentler, 2010). Further, the distribution of the test statistic has been shown to more closely follow a χ2-mixture distribution, but nevertheless, research suggests that χ2-based tests for the variances still produce sufficiently accurate results (Dominicus et al., 2006; LaHuis & Ferguson, 2009).
Sampling Recommendations
Despite its importance as a measure of cluster influences, there are no comprehensive recommendations for unbiased estimation of the ICC values, and only few studies report on the estimation quality of the variance components. In Kerkhoff and Nussbeck (2019), a small Level-3 intercept variance is heavily overestimated in samples with up to 55 clusters with 5 or 15 units per cluster. For ten or less sampled Level-3 clusters, McNeish and Wentzel (2017) found that the Level-2 variance is underestimated by up to 5% and the Level-3 variance is underestimated by 30% or more. Although these studies do not provide information regarding accurate estimation of the ICCs, they demonstrate that the estimation of variances is not trivial in three-level models, and that substantial bias can occur if the sampling process is not optimized.
Aim of This Study
By means of extensive Monte Carlo simulations, we evaluate how estimation bias and statistical power relate to the overall sample size, the allocation of units on each level, and the ICC sizes. We are particularly interested in minimum required samples sizes and advantageous sampling-strategies for overall sound estimation.
Method
Simulation Setup
For the simulations, we used the empty three-level model (Eq. 1), with and ~N(0, ). In total, we generated 1,125 simulation conditions: 5 (n3 = 5, 10, 50, 100, or 200) × 5 (n2 = 2, 5, 10, 20, or 30) × 5 (n1 = 2, 5, 10, 20, or 30) × 3 ( = 0.1, 0.2, or 0.6) × 3 ( = 0.1, 0.2, or 0.6) × 1 ( = 1). The combinations of variances yielded nine different combinations of ICC3 and ICC2, calculated as in Eq. 2 and 3 (see Table 1). We chose ICC values and sample sizes based on empirical findings in the social and behavioral research (e.g., Dong et al., 2016; Kerkhoff & Nussbeck, 2019).
Table 1
ICC3/ICC2 | Notation | ||
---|---|---|---|
0.1 | 0.1 | .083/.083 | S/S |
0.1 | 0.2 | .077/.154 | S/M |
0.1 | 0.6 | .059/.353 | S/L |
0.2 | 0.1 | .154/.077 | M/S |
0.2 | 0.2 | .143/.143 | M/M |
0.2 | 0.6 | .111/.333 | M/L |
0.6 | 0.1 | .353/.059 | L/S |
0.6 | 0.2 | .333/.111 | L/M |
0.6 | 0.6 | .273/.273 | L/L |
Note. = Level-3 variance, = Level-2 variance, Level-1 residual variance = 1 in all conditions; ICC = Intraclass correlation coefficients for Level-3 (ICC3) and Level-2 (ICC2).
For every condition, we generated 1,000 samples. For each sample, we fitted the empty model using robust maximum likelihood estimation with the expectation maximization algorithm and 500 admissible iterations. Data generation and model estimation was done in Mplus Version 8 (Muthén & Muthén, 1998-2017), and results were imported to R 6.3.1 (R Core Team, 2019) using the MplusAutomation package (Hallquist & Wiley, 2018) for subsequent analyses.
We refer to the total number of observations in a condition (n3 × n2 × n1) as NOBS. We further refer to the specific combination of n3, n2, and n1 in a condition as “allocation” and abbreviate the allocation by n3-size/n2-size/n1-size. For example, 100/20/5 subsumes samples with 100 clusters, each with 20 subclusters, which in turn contain 5 Level-1 units. 100/5/• comprises all conditions with 100 clusters, each with 5 subclusters and any number of Level-1 units. ICC sizes are summarized where appropriate by L = large (ICC = .273, .333, or .353), M = medium (ICC = .111, .143, or .154), and S = small (ICC = .059, .077, or .083). ICC sizes of a condition are abbreviated by ICC3-size/ICC2-size (see Table 1).
Outcome Measures
We report convergence rates, but computed coefficients only across runs that converged normally. To explore the influence of sample sizes on bias (rPEB and aPEB for ICC2 and ICC3), we ran four analyses of variance, and report the partial omega-squared ( ) effect size. We further computed the median estimate, the upper and lower quartiles, and the relative and absolute PEB of the ICCs and the Level-1 variance. We considered -0.10 < rPEB < 0.10 as sufficiently unbiased (see Flora & Curran, 2004; Muthén & Muthén, 2002). There are no established thresholds to indicate absolute unbiasedness, so we considered aPEB < 0.15 to be unbiased, since we argue that in practice, less than 15% over- or underestimation of an ICC does not considerably change statistical inferences. We further calculated the rate of biased runs, which is the percentage of replications in a condition with at least 15% over- or underestimation, as an indicator for the risk of producing a biased estimate.
For the Level-2 and Level-3 variance components, we assessed statistical power by the rate of significant one-sided SB-tests as in Eq. 61. We considered a power of 80% or higher as sufficient (see Cohen, 1992; Muthén & Muthén, 2002).
Results
First, since Level-1 residuals were estimated accurately in most conditions (relative unbiasedness in NOBS > 50, absolute unbiasedness in NOBS > 100), we do not provide detailed information about the estimation quality on Level-1. Complete results are tabulated in the supplementary dataset.
Out of the 1,125 conditions, 299 conditions (all n3 ≤ 10, 26.58% of all conditions) produced at least one replication that did not converge normally. Moreover, convergence problems occurred more frequently for conditions with small ICC3 (see also Table 2).
Table 2
n3/n2/n1 | ICC3/ICC2 | ||
---|---|---|---|
5/2/2 | All | ||
5/2/5 | All | ||
5/2/10 | S/• | M/• | L/S |
5/2/20 | S/• | M/• | L/L |
5/2/30 | S/• | M/M | M/L |
5/5/2 | S/• | M/• | |
5/5/5 | S/• | M/M | M/L |
5/5/10 | S/• | M/L | |
5/5/20 | S/M | S/L | M/L |
5/5/30 | S/M | S/L | M/L |
5/10/2 | S/• | M/M | M/L |
5/10/5 | S/M | S/L | |
5/10/10 | S/L | M/L | |
5/10/20 | S/L | M/L | |
5/10/30 | S/L | ||
5/20/2 | S/L | ||
10/2/2 | S/• | M/S | |
10/2/5 | S/S | M/S |
Note. ICC = Intraclass correlation coefficients for Level-3 (ICC3) and Level-2 (ICC2). n1, n2, n3 indicate the number of clusters (n3), subclusters per cluster (n2), and Level-1 units per subcluster (n1). S, M, L indicate small, medium, and large ICC sizes, respectively.
Results Across Allocations
Median estimates, quartiles of ICC estimates, PEB and power across all sampling conditions are presented in Table 3. Results show that the ICC3 tended to be underestimated, while the ICC2 tended to be overestimated. Analyses of variance (Table 4) show that on Level-3, the number of clusters was most influential on both, relative and absolute PEB. On Level-2, relative bias was mostly influenced by the total sample size, while absolute bias was mostly influenced by the number of clusters.
Table 3
ICC | Level-3
|
Level-2
|
||||||||
---|---|---|---|---|---|---|---|---|---|---|
ICC3 estimate
|
rPEBICC3 |
aPEBICC3 |
Power of
|
ICC2 estimate
|
rPEBICC2 |
aPEBICC2 |
Power of
|
|||
Mdn | [Q1; Q3] | Mdn | Mdn | Mdn | Mdn | [Q1; Q3] | Mdn | Mdn | Mdn | |
.059 | .058 | [.053; .059] | -.016 | .420 | .865 | .060 | [.059; .065] | .021 | .201 | 1.000 |
.077 | .075 | [.067; .076] | -.024 | .275 | .992 | .078 | [.077; .080] | .013 | .166 | 1.000 |
.083 | .081 | [.072; .082] | -.026 | .240 | .982 | .084 | [.084; .085] | .008 | .163 | 1.000 |
.111 | .109 | [.097; .110] | -.020 | .273 | 1.000 | .113 | [.112; .118] | .016 | .155 | 1.000 |
.143 | .139 | [.123; .141] | -.028 | .183 | 1.000 | .144 | [.143; .146] | .007 | .125 | 1.000 |
.154 | .150 | [.131; .152] | -.026 | .163 | 1.000 | .154 | [.154; .155] | .004 | .120 | 1.000 |
.273 | .266 | [.233; .270] | -.026 | .145 | 1.000 | .275 | [.274; .285] | .009 | .107 | 1.000 |
.333 | .325 | [.288; .330] | -.024 | .117 | 1.000 | .334 | [.333; .336] | .002 | .081 | 1.000 |
.353 | .345 | [.308; .349] | -.023 | .111 | 1.000 | .353 | [.350; .354] | .001 | .073 | 1.000 |
Note. rPEBθ = relative PEB of θ; aPEBθ = absolute PEB of θ; Mdn = median; Q1 = lower quartile; Q3 = upper quartile; ICC = Intraclass correlation coefficients for Level-3 (ICC3) and Level-2 (ICC2); = Level-3 variance; = Level-2 variance.
Table 4
Factor | df | Level-3
|
Level-2
|
||
---|---|---|---|---|---|
rPEBICC3 |
aPEBICC3 |
rPEBICC2 |
aPEBICC2 |
||
Population | 2 | .043 | .427 | .072 | .042 |
Population | 2 | .033 | .198 | .087 | .363 |
n3 | 4 | .462 | .814 | .074 | .651 |
n2 | 4 | .074 | .508 | -.002 | .565 |
n1 | 4 | .002 | .106 | .093 | .526 |
NOBS | 34 | .098 | .177 | .142 | .456 |
Note. rPEBθ = relative PEB of θ; aPEBθ = absolute PEB of θ; = partial omega squared; ICC = Intraclass correlation coefficients for Level-3 (ICC3) and Level-2 (ICC2); = Level-3 variance; = Level-2 variance. n1, n2, n3 indicate the number of clusters (n3), subclusters per cluster (n2), and Level-1 units per subcluster (n1). NOBS = total number of observations. Residual df = 1074.
In total, 384 conditions (34.13%) resulted in sufficient power and relative and absolute unbiasedness on all levels. These conditions can be identified in Figure 1 as squares with gray areas, where the shades of the triangles indicate the rate of biased runs for the ICC3 (upper triangle) and ICC2 (lower triangle). For example, in 200/2/2, only the condition with large ICCs was estimated with sufficient power and without bias, yet both ICCs were still biased in more than 30% of replications in this condition. Conditions with 5 or 10 clusters did not result in accurate estimation across levels. Overall, conditions with at least 200/20/2 or 100/20/5 resulted in sufficient estimation quality across levels.
Figure 1
Estimation of Variance Components
In Figure 2, we present the median, upper and lower quartiles for (upper plot) and (lower plot) across allocations. We do not provide information for conditions with more than 10,000 observations because estimation quality did not further improve for larger samples. Since statistical power of all variance sizes was sufficient for a variety of conditions, we do not plot power results, but report requirements for sufficient power in conjunction with Table 5. Comprehensive estimation results are tabulated in columns 13 to 24 in the supplementary dataset.
Figure 2
Table 5
n3 | Level-3
|
Level-2
|
||||||
---|---|---|---|---|---|---|---|---|
ICC3 | rPEBICC3 | aPEBICC3 | Power of | ICC2 | rPEBICC2 | aPEBICC2 | Power of | |
5 | S | nonea | none | 20/10a | S | 5/5a | nonea | n2 × n1 ≥ 100 |
M | nonea | none | n2 × n1 ≥ 60, if n2 ≥ 10 | M | n2 × n1 ≥ 20a | nonea | n2 × n1 ≥ 40, if n1 ≥ 5 | |
L | none | none | n2 × n1 ≥ 20, if n2 ≥ 5 | L | n2 ≥ 5 | 20/5a or 10/20a | n2 × n1 ≥ 20 | |
10 | S | nonea | none | n2 × n1 ≥ 40, if n2 ≥ 10 | S | n2 × n1 ≥ 20, except 10/2 | n2 × n1 ≥ 300, except 10/30 | n2 × n1 ≥ 40, if n1 ≥ 5 |
M | nonea | none | n2 × n1 ≥ 10 | M | anya | n2 × n1 ≥ 150, if n2 ≥ 10a | n2 × n1 ≥ 25, except 20/2 | |
L | none | none | n2 × n1 ≥ 10, except 2/5 | L | n2 × n1 ≥ 10, except 2/5 | n2 × n1 ≥ 40, if n2 ≥ 10 | n2 × n1 ≥ 10 | |
50 | S | anya | none | n2 ≥ 5 | S | n2 × n1 ≥ 10 | n2 × n1 ≥ 100 | n2 × n1 ≥ 20, except 10/2 |
M | any | 30/5a or 20/20a | n2 × n1 ≥ 10 | M | any | n2 × n1 ≥ 50, if n2 ≥ 5 | n2 × n1 ≥ 10 | |
L | any | any except 2/2, 2/5 | any | L | any | n2 ≥ 5 or n1 ≥ 20 | any | |
100 | S | any | 20/2 or 10/10a | n2 × n1 ≥ 10, except 2/5 | S | n2 × n1 ≥ 10 | n2 × n1 ≥ 50, if n1 ≥ 5 | n2 ≥ 10 or n1 ≥ 5 |
M | any | n2 × n1 ≥ 20, if n2 ≥ 5 | n2 × n1 ≥ 10 | M | any | n2 × n1 ≥ 25 | n2 × n1 ≥ 10 | |
L | any | any | any | L | any | n2 × n1 ≥ 10 | any | |
200 | S | any | n2 × n1 ≥ 20, if n2 ≥ 5a | n2 × n1 ≥ 10 | S | any | n2 × n1 ≥ 25 | n2 × n1 ≥ 10 |
M | any | n2 × n1 ≥ 10,a except 2/5 | any | M | any | n2 × n1 ≥ 20 | any | |
L | any | any | any | L | any | any | any |
Note. rPEBθ = relative PEB of θ; aPEBθ = absolute PEB of θ. -0.1 < rPEB < 0.1 and aPEB < 15% indicate relative and absolute unbiasedness, power ≥ 0.8 indicates sufficient power. ICC = Intraclass correlation coefficients for Level-3 (ICC3) and Level-2 (ICC2); = Level-3 variance; = Level-2 variance. S, M, L indicate small, medium, and large ICC sizes, respectively. n1, n2, n3 indicate the number of clusters (n3), subclusters per cluster (n2), and Level-1 units per subcluster (n1). n2 × n1 refers to the multiplied sample size, e.g., n2 × n1 = 10 is realized with n2 = 2 and n1 = 5, or with n2 = 5 an n1 = 2.
awith exceptions, e.g.: 20/20a indicate that most conditions with n2 = n1 = 20 are unbiased or with sufficient power.
Level-3 Variance Component
In general, was consistently underestimated in most conditions, and estimates fluctuated strongly across replications for the smallest samples. Within conditions with varying allocations but equal NOBS (gray areas in Figure 2), conditions with large n3 produced less biased estimates and less fluctuations. Sufficient power for the estimation of Level-3 variances was ensured in most conditions with 7,500 or more observations and some conditions with smaller samples, such as 10/30/•, 50/10/•, or 100/5/•. Sufficient power for medium and large Level-3 variance components was obtained with at least 2,500 observations, and conditions •/30/5, 10/20/•, 50/5/•, •/20/10, or 10/10/5 (see Table 5).
Level-2 Variance Component
Estimates of tended to be underestimated for less than 1,250 observations and, similar to Level-3 findings, estimates fluctuated more heavily across replications with small samples (especially NOBS ≤ 1000). Power was sufficient for most allocations with at least 500 observations and some conditions with less observations, such as 10/•/20. For medium and large Level-2 variances, power was sufficient for conditions with at least 500 observations, or allocations with at least •/5/10, •/10/5, or n1 ≥ 20 (see Table 5).
Intraclass Correlation Coefficients
Allocations to achieve unbiasedness for the ICCs are presented in Table 5. Results for relative and absolute bias are plotted in Figure 3 (Level-3) and Figure 4 (Level-2). Conditions with large NOBS did not result in meaningfully biased estimates and are not presented. Comprehensive bias results are tabulated in columns 25 to 33 in the supplementary dataset.
Figure 3
Figure 4
Results for the ICC3
In most conditions with n3 ≥ 100 and minimum sample sizes of 200/20/• or 100/30/•, we found less than 15% absolute bias. For medium and large ICC3 sizes, sampling less Level-2 units (200/10/• and 100/20/•) or emphasizing the number of Level-3 units (200/5/5) also resulted in absolute unbiasedness. Further, conditions with n3 = 50, even conditions with large NOBS, such as 50/30/30, were considerably biased for small and most medium sized ICC3. Notably, bias of the ICC3 was higher in conditions where the complementing ICC2 was large rather than medium or small.
Results for the ICC2
ICC2 relative bias remained within 10% over-/underestimation for a variety of allocations, such as small n3 (e.g., 10/5/10, 10/10/5) or small n2 and n1 (e.g., 200/2/2, 50/2/5, 50/5/2). Absolute unbiasedness was achieved in most conditions with at least 5,000 observations (e.g., 10/30/20, 50/10/10, 100/5/10, 200/5/5). For medium or large ICC2, conditions with smaller NOBS, such as 100/5/5 or 50/5/10 also resulted in absolute unbiasedness.
Discussion
Our findings extend our knowledge on the estimation quality in three-level modeling by showing that moderate to large samples and an advantageous allocation are needed for overall good estimation quality of the ICCs, and that the size of the ICCs and the number of available clusters greatly influences required sample sizes.
The Role of the Sampling-Strategies
Results demonstrate that required n2/n1 depend on the available number of clusters and ICC scores (see Table 5). Consequently, even large samples may result in biased estimates if allocations are suboptimal. Although the number of clusters (n3) is most important to ensure estimation quality, statistical power and relative unbiasedness on Level-2 can be achieved even with a small number of clusters. Specifically, to achieve sufficient power, it is advantageous to sample n1 ≥ 5 to increase power at Level-2, and n2 ≥ 5 to increase power at Level-3.
Interestingly, we found that the variance components are consistently underestimated. Since is more strongly underestimated than , the resulting ICC2 is consistently overestimated, while the ICC3 is underestimated, which can result in misinterpretation of the three-level structure. Similar patterns of underestimation are visible in previous research (McNeish & Wentzel, 2017) but have, to our knowledge, not yet been systematically investigated.
Further, convergence rates for the smallest samples are considerably low. Research suggests that restricted maximum likelihood (REML) may improve convergence and reduce bias in small samples (McNeish & Wentzel, 2017). However, additional analyses (not reported) show mixed results for our data: Since REML is not implemented in Mplus, we compared convergence rates and ICC bias resulting from REML and regular maximum likelihood (ML) estimation in R using the lme4-package (Bates et al., 2015) for the allocations listed in Table 2 (all ICC-sizes, i.e., 162 conditions). We considered all issues resulting in non-computable ICC values as convergence issues. Across conditions, REML estimation improved convergence rates by Mdn = 3.75 percentage points. REML performed better for medium and large ICC2 (MdnML = -0.44, MdnREML = -0.33, for ICC2 ≥ .143), but worse for small ICC2 (MdnML = 1.28, MdnREML = 1.68, for ICC2 ≤ .111). For the ICC3, differences between estimation methods were small (MdnML = 0.04, MdnREML = 0.01). Absolute bias was comparably high for both ICCs and estimation methods.
The Role of ICC-Sizes
Results show that smaller variance components require considerably larger samples for sufficient estimation quality. For example, small ICCs require at least twice (four times) as many observations as medium (large) ICCs for a given number of clusters for absolute unbiasedness. Similarly, in samples with 5 or 10 clusters, required n2/n1 sample sizes to achieve sufficient power are at least two times (four times) higher for small variance components compared to medium (large) components.
Interestingly, the bias of an ICC estimate is higher if the ICC at the other level is larger. As an example, the ICC3 in M/L was more heavily biased than in M/M or M/S. In additional simulations, we tested if this is a direct consequence of the simulation setup, since, for example, the ICC3 was slightly smaller in S/L (ICC3 = .059) than S/M and S/S (ICC3 = .077, .083, respectively). These additional analyses (100 replications each for n3 = 50, 100; n2 = 5, 10; n1 = 5, 10; ICC2 = .077, .143, .333, ICC3 = .143) indicated, however, that this pattern is still present if the ICC3 size does not vary for different ICC2 sizes. Furthermore, this pattern is similar for statistical power, where small Level-3 variances required larger samples for sufficient power if the respective Level-2 variance was large. This pattern remains unexplained and requires more detailed research on the relationship between Level-3 and Level-2 bias.
Evaluation of Estimation Quality Indicators
Most importantly, our results demonstrate that relative unbiasedness of a simulation condition does not imply that a sample generated from this condition produces unbiased estimates, as indicated by the rate of biased runs and the absolute PEB.
Further, our inferences regarding statistical power are based on the one-sided SB-test. Our findings may therefore not be directly compared to previous research, since there is no single established coefficient assessing the power of variance estimates in multilevel research. Hence, approaches incorporating auxiliary models for the SB- or LRT-test or differing test distributions might suggest different sampling requirements, and we suggest that future studies include and compare different power measures in their simulations.
Concluding Recommendations
As a rule of thumb, overall estimation quality is achieved if samples ensure absolute unbiasedness of Level-3 estimates. If there is no information about ICC sizes, large samples with an emphasis on the number of clusters, such as 200/10/5, or 100/20/5, are recommended. If both ICCs are at least of medium size, required sample sizes reduce to e.g., 100/20/2 or 200/5/5. Achieving sufficient estimation quality with 50 clusters is still possible with at least n2 = 10 in populations with large ICC sizes.
In conclusion, our findings reveal that correctly characterizing a three-level structure through ICC estimates requires an advantageous sampling-strategy, where the number of achievable clusters determines the required numbers of subclusters and Level-1 units. Particular attention must be paid to the ICC3, which will most likely be slightly underestimated, even with moderate sample sizes. Researchers should take advantage of previously reported ICC sizes in their domain to identify a most likely adequate sampling strategy for a feasible overall sample size.