Obtaining Sound Intraclass Correlation and Variance Estimates in Three-Level Models: The Role of Sampling-Strategies

Three-level clustered data commonly occur in social and behavioral research and are prominently analyzed using multilevel modeling. The influence of the clustering on estimation results is assessed with the intraclass correlation coefficients (ICCs), which indicate the fraction of variance in the outcome located at each higher level. However, ICCs are prone to bias due to high requirements regarding the overall sample size and the sample size at each data level. In Monte Carlo simulations, we investigate how these sample characteristics influence the bias of the ICCs and statistical power of the variance components using robust ML-estimation. Results reveal considerable underestimation on Level-3 and the importance of the Level-3 sample size in combination with the ICC sizes. Based on our results, we derive concise sampling recommendations and discuss limits to our inferences.

Common estimates of the influence of the clustering are the intraclass correlation coefficients (ICCs), which depict the fraction of variance in the criterion located at Level-2 (ICC 2 ) or Level-3 (ICC 3 ). They thus indicate the extent to which the criterion is shaped by superordinate clusters, which may reveal, e.g., at which level(s) predictors or random effects should be included to reduce unexplained variance, or they inform the sampling procedure, ensuring high estimation quality in cluster-randomized experiments (Hedges & Hedberg, 2013). ICCs further indicate the degree to which standard errors are biased in standard regression analysis (two-level models: e.g., Kreft & de Leeuw, 1988; three-level models: e.g., Cunningham & Johnson, 2016). Therefore, ensuring that a nested structure is accurately characterized is a central issue, and reporting ICCs is a standard analysis step in multilevel modeling (see Hox et al., 2017).
However, research has shown that unbiased estimation and sufficient power in multi level models depend on an interplay of adequate sample sizes on each level, the effect sizes, and the levels at which the predictors are modelled (more recently Cox & Kelcey, 2019;Kerkhoff & Nussbeck, 2019;LaHuis et al., 2020). To our knowledge, no study has systematically investigated required sample sizes to ensure sufficient estimation quality of the ICCs in three-level models. In this study, we shed light on this relationship and derive sampling recommendations for a wide range of ICC sizes. We limit our analyses to models with continuous responses, since the computation of ICCs in multilevel models with discrete and continuous responses differ (Goldstein et al., 2002;Leckie et al., 2020), resulting in limited comparability of estimation results and interpretation of the ICC sizes.

Variance Decomposition in the Linear Three-Level Model
The linear three-level model with Level-1 units i nested within Level-2 subclusters j nested within Level-3 clusters k decomposes the values of the criterion Y ijk as follows: Y ijk = γ 000 + v 00k + u 0jk + e ijk (1) The coefficient γ 000 is the grand mean of Y, v 00k~N (0, σ v 0 2 ) is a random variable reflecting the cluster-specific deviations from the mean, u 0jk~N (0, σ u 0 2 ) is the random variable of subcluster-specific deviations from the respective cluster mean, and e ijk~N (0, σ e 2 ) is the random variable of deviations of the Level-1 unit values from the respective subcluster mean. The total variance of Y is given by σ Y 2 = σ v 0 2 + σ u 0 2 + σ e 2 . The ICCs express the fraction of variance at the respective level in relation to the total variance: (2) For the ICC 2 , an alternative approach is to include both higher-level variances in the enumerator (σ v 0 2 + σ u 0 2 ). The value then expresses how strongly any two Level-1 units from the same subcluster correlate (Hox et al., 2017). Typical ICC values are most comprehensively reported for educational research, ranging from 0.1 to 0.3 (Dong et al., 2016;Hedges & Hedberg, 2013), and research has demonstrated that even small ICCs can result in meaningful bias if the clustering is ignored (Lai & Kwok, 2015).

Assessing Estimation Quality of the Variance Components
When using Monte Carlo (MC) simulations to investigate estimation quality in multilevel models across many generated samples, the most common measures of estimation quali ty are the parameter estimation bias (PEB) to evaluate the accuracy of point estimates, and measures of significance to evaluate statistical power.

Parameter Estimation Bias (PEB)
The PEB reflects the percentage of over-or underestimation of a population parameter. We refer to this commonly reported PEB (e.g., Muthén & Muthén, 2002) as relative PEB. For a population parameter θ, estimated by θ i in i = 1, …, n replications, it is defined as We argue that the relative PEB alone is not sufficient for evaluating bias: By design, under-and overestimation balance out when computing the relative bias across replica tions. This can result in a relative PEB close to zero, although each replication might be either over-or underestimated. The magnitude of bias for a single sample can more precisely be approximated by the absolute PEB, given by: For the absolute PEB, only the strength of the bias is averaged across replications, and the direction of bias for each sample is ignored. Together, the relative and absolute PEB reflect the expected strength and overall direction of bias.

Statistical Power
Statistical power of a parameter in the context of simulation studies is the rate of replications with a statistically significant point estimate. However, the commonly used Wald-test (Wald, 1943) is unreliable for variance components (Berkhof & Snijders, 2001;Molenberghs & Verbeke, 2004;Raudenbush & Bryk, 2002). An alternative procedure to test statistical significance of parameters in multilevel models is the χ 2 -test-based comparison between the full model and a nested model where the parameter of interest is constrained to be zero (Hox et al., 2017). Based on the scaled χ 2 -test statistic (Satorra & Bentler, 1994), the modified test statistic T d proposed by Satorra and Bentler (2010, henceforth: SB-test) is computed using the unscaled maximum likelihood χ 2 -values of the nested (T 0 ) and full (T 1 ) models, the respective scaled χ 2 -test statistics of T 0 and T 1 , with degrees of freedom r 0 and r 1 , and scaling correction factors c 0 = T 0 /T 0 and c 1 = T 1 /T 1 : The score is compared to a χ 2 -distribution with r 0 − r 1 degrees of freedom. As an alter native, the likelihood-ratio test (LRT) follows the same procedural logic but uses the log-likelihood to compute the test statistic (see Herzog et al., 2007, for an overview). To assess the statistical significance of the Level-2 (Level-3) variance, the empty model (Eq. 1) is compared to a nested model where the Level-2 (Level-3) variance is constrained to be zero (Greven et al., 2008) using a one-sided test, since significantly negative variance values are inadmissible (Berkhof & Snijders, 2001). However, this approach has drawbacks. Particularly, both the LRT and SB-test can result in inadmissible, negative test-values, requiring a more complex approach using an auxiliary model (Bryant & Satorra, 2012;Satorra & Bentler, 2010). Further, the distribution of the test statistic has been shown to more closely follow a χ 2 -mixture distribution, but nevertheless, research suggests that χ 2 -based tests for the variances still produce sufficiently accurate results (Dominicus et al., 2006;LaHuis & Ferguson, 2009).

Sampling Recommendations
Despite its importance as a measure of cluster influences, there are no comprehensive recommendations for unbiased estimation of the ICC values, and only few studies report on the estimation quality of the variance components. In Kerkhoff and Nussbeck (2019), a small Level-3 intercept variance is heavily overestimated in samples with up to 55 clusters with 5 or 15 units per cluster. For ten or less sampled Level-3 clusters, McNeish and Wentzel (2017) found that the Level-2 variance is underestimated by up to 5% and the Level-3 variance is underestimated by 30% or more. Although these studies do not provide information regarding accurate estimation of the ICCs, they demonstrate that the estimation of variances is not trivial in three-level models, and that substantial bias can occur if the sampling process is not optimized.

Aim of This Study
By means of extensive Monte Carlo simulations, we evaluate how estimation bias and statistical power relate to the overall sample size, the allocation of units on each level, and the ICC sizes. We are particularly interested in minimum required samples sizes and advantageous sampling-strategies for overall sound estimation.
For every condition, we generated 1,000 samples. For each sample, we fitted the empty model using robust maximum likelihood estimation with the expectation maximization algorithm and 500 admissible iterations. Data generation and model estimation was done in Mplus Version 8 (Muthén & Muthén, 1998, and results were imported to R 6.3.1 (R Core Team, 2019) using the MplusAutomation package (Hallquist & Wiley, 2018) for subsequent analyses.
We refer to the total number of observations in a condition (n 3 × n 2 × n 1 ) as NOBS. We further refer to the specific combination of n 3 , n 2 , and n 1 in a condition as "allocation" and abbreviate the allocation by n 3 -size/n 2 -size/n 1 -size. For example, 100/20/5 subsumes samples with 100 clusters, each with 20 subclusters, which in turn contain 5 Level-1 units. 100/5/• comprises all conditions with 100 clusters, each with 5 subclusters and any number of Level-1 units. ICC sizes are summarized where appropriate by L = large (ICC = .273,.333,or .353), M = medium (ICC = .111,.143,or .154), and S = small (ICC = .059, .077, or .083). ICC sizes of a condition are abbreviated by ICC 3 -size/ICC 2 -size (see Table  1).

Outcome Measures
We report convergence rates, but computed coefficients only across runs that converged normally. To explore the influence of sample sizes on bias (rPEB and aPEB for ICC 2 and ICC 3 ), we ran four analyses of variance, and report the partial omega-squared (ω P 2 ) effect size. We further computed the median estimate, the upper and lower quartiles, and the relative and absolute PEB of the ICCs and the Level-1 variance. We considered -0.10 < rPEB < 0.10 as sufficiently unbiased (see Flora & Curran, 2004;Muthén & Muthén, 2002). There are no established thresholds to indicate absolute unbiasedness, so we considered aPEB < 0.15 to be unbiased, since we argue that in practice, less than 15% over-or underestimation of an ICC does not considerably change statistical inferences. We further calculated the rate of biased runs, which is the percentage of replications in a condition with at least 15% over-or underestimation, as an indicator for the risk of producing a biased estimate.
For the Level-2 and Level-3 variance components, we assessed statistical power by the rate of significant one-sided SB-tests as in Eq. 6 1 . We considered a power of 80% or higher as sufficient (see Cohen, 1992;Muthén & Muthén, 2002).

Results
First, since Level-1 residuals were estimated accurately in most conditions (relative unbiasedness in NOBS > 50, absolute unbiasedness in NOBS > 100), we do not provide detailed information about the estimation quality on Level-1. Complete results are tabu lated in the supplementary dataset.
Out of the 1,125 conditions, 299 conditions (all n 3 ≤ 10, 26.58% of all conditions) produced at least one replication that did not converge normally. Moreover, convergence problems occurred more frequently for conditions with small ICC 3 (see also Table 2). 1) While we initially computed both LRTs and SB-tests, we chose not to report results for the LRT, since the rate of inadmissible test values was considerably higher.

Conditions With Sufficient Power, Relative Unbiasedness, and Absolute Unbiasedness
Note. Each square represents a condition, ordered by n 3 (grouped columns), n 2 (grouped rows), n 1 (single columns within an n 3 group), and ICC sizes (single rows within an n 2 group). Shaded diagonals show the rate of runs with at least 15% under-or overestimation of the ICC 3 (upper diagonal) and ICC 2 (lower diagonal). White squares indicate that the condition did not yield unbiased estimates or sufficient power at all levels.
In total, 384 conditions (34.13%) resulted in sufficient power and relative and absolute unbiasedness on all levels. These conditions can be identified in Figure 1 as squares with gray areas, where the shades of the triangles indicate the rate of biased runs for the ICC 3 (upper triangle) and ICC 2 (lower triangle). For example, in 200/2/2, only the condition with large ICCs was estimated with sufficient power and without bias, yet both ICCs were still biased in more than 30% of replications in this condition. Conditions with 5 or 10 clusters did not result in accurate estimation across levels. Overall, conditions with at least 200/20/2 or 100/20/5 resulted in sufficient estimation quality across levels.

Estimation of Variance Components
In Figure 2, we present the median, upper and lower quartiles for σ v 0 2 (upper plot) and σ u 0 2 (lower plot) across allocations. We do not provide information for conditions with more than 10,000 observations because estimation quality did not further improve for larger samples. Since statistical power of all variance sizes was sufficient for a variety of conditions, we do not plot power results, but report requirements for sufficient power in conjunction with Table 5. Comprehensive estimation results are tabulated in columns 13 to 24 in the supplementary dataset.

Level-3 Variance Component
In general, σ v 0 2 was consistently underestimated in most conditions, and estimates fluc tuated strongly across replications for the smallest samples. Within conditions with varying allocations but equal NOBS (gray areas in Figure 2), conditions with large n 3 produced less biased estimates and less fluctuations. Sufficient power for the estimation of Level-3 variances was ensured in most conditions with 7,500 or more observations and some conditions with smaller samples, such as 10/30/•, 50/10/•, or 100/5/•. Sufficient power for medium and large Level-3 variance components was obtained with at least 2,500 observations, and conditions •/30/5, 10/20/•, 50/5/•, •/20/10, or 10/10/5 (see Table 5).

Level-2 Variance Component
Estimates of σ u 0 2 tended to be underestimated for less than 1,250 observations and, similar to Level-3 findings, estimates fluctuated more heavily across replications with small samples (especially NOBS ≤ 1000). Power was sufficient for most allocations with at least 500 observations and some conditions with less observations, such as 10/•/20. For medium and large Level-2 variances, power was sufficient for conditions with at least 500 observations, or allocations with at least •/5/10, •/10/5, or n 1 ≥ 20 (see Table 5).

Intraclass Correlation Coefficients
Allocations to achieve unbiasedness for the ICCs are presented in Table 5. Results for rel ative and absolute bias are plotted in Figure 3

Median Estimates, Upper and Lower Quartiles of Higher-Level Variances Across Allocations
Note. Plotted are medians, upper and lower quartiles of σ v 0 2 (upper plot) and σ u 0 2 (lower plot) estimates across sample size conditions with up to 10,000 observations. Lines represent the median estimates, colored ribbons encompass upper and lower quartiles. Colors represent large (blue: σ v 0 2 , σ u 0 2 = 0.6), medium (yellow: σ v 0 2 , σ u 0 2 = 0.2), and small (green: σ v 0 2 , σ u 0 2 = 0.1) population variance sizes. On the x-axis, numbers in parentheses indicate the overall number of observations in this and-if applicable-the following conditions. Shaded areas mark multiple conditions with the same number of observations but differing allocations. Values for each variance component's population parameter have been averaged (e.g., estimates for σ v 0 2 = 0.1 are average scores across σ u 0 2 = 0.1, 0.2 and 0.6) ICC 3 . Notably, bias of the ICC 3 was higher in conditions where the complementing ICC 2 was large rather than medium or small.

Figure 3
Absolute and Relative PEB for the ICC 3 Across Allocations Note. Plotted are the relative (upper plot) and absolute (lower plot) PEB of the ICC 3 in conditions with up to 60,000 observations. ICC 3 sizes are differentiated by color (green = small, yellow = medium, blue = large), ICC 2 sizes are differentiated by line type (dotted = small, line = medium, dashed = large). Areas are colored to facilitate readability. On the x-axis, numbers in parentheses show the number of observations in this and-if applicable-the following conditions. Shaded areas mark multiple conditions with the same number of observations but differing allocations.

Figure 4
Absolute and Relative PEB for the ICC 2 Across Allocations Note. Plotted are the relative (upper plot) and absolute (lower plot) PEB of the ICC 2 in conditions with up to 10,000 observations. ICC 2 sizes are differentiated by color (green = small, yellow = medium, blue = large), ICC 3 sizes are differentiated by line type (dotted = small, line = medium, dashed = large). Areas are colored to facilitate readability. On the x-axis, numbers in parentheses show the number of observations in this and-if applicable-the following conditions. Shaded areas mark multiple conditions with the same number of observations but differing allocations.

Table 5
Minimum Allocations to Achieve Unbiasedness and Sufficient Power, Differentiated by n 3 .

Discussion
Our findings extend our knowledge on the estimation quality in three-level modeling by showing that moderate to large samples and an advantageous allocation are needed for overall good estimation quality of the ICCs, and that the size of the ICCs and the number of available clusters greatly influences required sample sizes.

The Role of the Sampling-Strategies
Results demonstrate that required n 2 /n 1 depend on the available number of clusters and ICC scores (see Table 5). Consequently, even large samples may result in biased estimates if allocations are suboptimal. Although the number of clusters (n 3 ) is most important to ensure estimation quality, statistical power and relative unbiasedness on Level-2 can be achieved even with a small number of clusters. Specifically, to achieve sufficient power, it is advantageous to sample n 1 ≥ 5 to increase power at Level-2, and n 2 ≥ 5 to increase power at Level-3. Interestingly, we found that the variance components are consistently underestima ted. Since σ v 0 2 is more strongly underestimated than σ u 0 2 , the resulting ICC 2 is consistently overestimated, while the ICC 3 is underestimated, which can result in misinterpretation of the three-level structure. Similar patterns of underestimation are visible in previous research (McNeish & Wentzel, 2017) but have, to our knowledge, not yet been systemati cally investigated.
Further, convergence rates for the smallest samples are considerably low. Research suggests that restricted maximum likelihood (REML) may improve convergence and reduce bias in small samples (McNeish & Wentzel, 2017). However, additional analyses (not reported) show mixed results for our data: Since REML is not implemented in Mplus, we compared convergence rates and ICC bias resulting from REML and regular maximum likelihood (ML) estimation in R using the lme4-package (Bates et al., 2015) for the allocations listed in Table 2 (all ICC-sizes, i.e., 162 conditions). We considered all issues resulting in non-computable ICC values as convergence issues. Across conditions, REML estimation improved convergence rates by Mdn = 3.75 percentage points. REML performed better for medium and large ICC 2 (Mdn ML = -0.44, Mdn REML = -0.33, for ICC 2 ≥ .143), but worse for small ICC 2 (Mdn ML = 1.28, Mdn REML = 1.68, for ICC 2 ≤ .111). For the ICC 3 , differences between estimation methods were small (Mdn ML = 0.04, Mdn REML = 0.01). Absolute bias was comparably high for both ICCs and estimation methods.

The Role of ICC-Sizes
Results show that smaller variance components require considerably larger samples for sufficient estimation quality. For example, small ICCs require at least twice (four times) as many observations as medium (large) ICCs for a given number of clusters for absolute unbiasedness. Similarly, in samples with 5 or 10 clusters, required n 2 /n 1 sample sizes to achieve sufficient power are at least two times (four times) higher for small variance components compared to medium (large) components.
Interestingly, the bias of an ICC estimate is higher if the ICC at the other level is larger. As an example, the ICC 3 in M/L was more heavily biased than in M/M or M/S. In additional simulations, we tested if this is a direct consequence of the simulation setup, since, for example, the ICC 3 was slightly smaller in S/L (ICC 3 = .059) than S/M and S/S (ICC 3 = .077, .083, respectively). These additional analyses (100 replications each for n 3 = 50, 100; n 2 = 5, 10; n 1 = 5, 10; ICC 2 = .077, .143, .333, ICC 3 = .143) indicated, however, that this pattern is still present if the ICC 3 size does not vary for different ICC 2 sizes. Furthermore, this pattern is similar for statistical power, where small Level-3 variances required larger samples for sufficient power if the respective Level-2 variance was large. This pattern remains unexplained and requires more detailed research on the relationship between Level-3 and Level-2 bias.

Evaluation of Estimation Quality Indicators
Most importantly, our results demonstrate that relative unbiasedness of a simulation condition does not imply that a sample generated from this condition produces unbiased estimates, as indicated by the rate of biased runs and the absolute PEB.
Further, our inferences regarding statistical power are based on the one-sided SB-test. Our findings may therefore not be directly compared to previous research, since there is no single established coefficient assessing the power of variance estimates in multilevel research. Hence, approaches incorporating auxiliary models for the SB-or LRT-test or differing test distributions might suggest different sampling requirements, and we suggest that future studies include and compare different power measures in their simu lations.

Concluding Recommendations
As a rule of thumb, overall estimation quality is achieved if samples ensure absolute unbiasedness of Level-3 estimates. If there is no information about ICC sizes, large samples with an emphasis on the number of clusters, such as 200/10/5, or 100/20/5, are recommended. If both ICCs are at least of medium size, required sample sizes reduce to e.g., 100/20/2 or 200/5/5. Achieving sufficient estimation quality with 50 clusters is still possible with at least n 2 = 10 in populations with large ICC sizes.
In conclusion, our findings reveal that correctly characterizing a three-level structure through ICC estimates requires an advantageous sampling-strategy, where the number of achievable clusters determines the required numbers of subclusters and Level-1 units. Particular attention must be paid to the ICC 3 , which will most likely be slightly un derestimated, even with moderate sample sizes. Researchers should take advantage of