Threelevel clustered data commonly occur in social and behavioral research and are prominently analyzed using multilevel modeling. The influence of the clustering on estimation results is assessed with the intraclass correlation coefficients (ICCs), which indicate the fraction of variance in the outcome located at each higher level. However, ICCs are prone to bias due to high requirements regarding the overall sample size and the sample size at each data level. In Monte Carlo simulations, we investigate how these sample characteristics influence the bias of the ICCs and statistical power of the variance components using robust MLestimation. Results reveal considerable underestimation on Level3 and the importance of the Level3 sample size in combination with the ICC sizes. Based on our results, we derive concise sampling recommendations and discuss limits to our inferences.
The linear threelevel model with Level1 units
The coefficient
For the ICC_{2}, an alternative approach is to include both higherlevel variances in the enumerator (
When using Monte Carlo (MC) simulations to investigate estimation quality in multilevel models across many generated samples, the most common measures of estimation quality are the parameter estimation bias (
The
We argue that the relative
For the absolute
Statistical power of a parameter in the context of simulation studies is the rate of replications with a statistically significant point estimate. However, the commonly used Waldtest (
An alternative procedure to test statistical significance of parameters in multilevel models is
the χ^{2}testbased comparison between the full model and a nested
model where the parameter of interest is constrained to be zero (
The score is compared to a χ^{2}distribution with
Despite its importance as a measure of cluster influences, there are no comprehensive recommendations for unbiased estimation of the ICC values, and only few studies report on the estimation quality of the variance components. In
By means of extensive Monte Carlo simulations, we evaluate how estimation bias and statistical power relate to the overall sample size, the allocation of units on each level, and the ICC sizes. We are particularly interested in minimum required samples sizes and advantageous samplingstrategies for overall sound estimation.
For the simulations, we used the empty threelevel model (


ICC_{3}/ICC_{2}  Notation 

0.1  0.1  .083/.083  S/S 
0.1  0.2  .077/.154  S/M 
0.1  0.6  .059/.353  S/L 
0.2  0.1  .154/.077  M/S 
0.2  0.2  .143/.143  M/M 
0.2  0.6  .111/.333  M/L 
0.6  0.1  .353/.059  L/S 
0.6  0.2  .333/.111  L/M 
0.6  0.6  .273/.273  L/L 
For every condition, we generated 1,000 samples. For each sample, we fitted the empty model using robust maximum likelihood estimation with the expectation maximization algorithm and 500 admissible iterations. Data generation and model estimation was done in Mplus Version 8 (
We refer to the total number of observations in a condition (
We report convergence rates, but computed coefficients only across runs that converged normally.
To explore the influence of sample sizes on bias (
For the Level2 and Level3 variance components, we assessed statistical power by the rate of
significant onesided SBtests as in
While we initially computed both LRTs and SBtests, we chose not to report results for the LRT, since the rate of inadmissible test values was considerably higher.
. We considered a power of 80% or higher as sufficient (seeFirst, since Level1 residuals were estimated accurately in most conditions (relative unbiasedness in NOBS > 50, absolute unbiasedness in NOBS > 100), we do not provide detailed information about the estimation quality on Level1. Complete results are tabulated in the
Out of the 1,125 conditions, 299 conditions (all
ICC_{3}/ICC_{2}  

5/2/2  All  
5/2/5  All  
5/2/10  S/•  M/•  L/S 
5/2/20  S/•  M/•  L/L 
5/2/30  S/•  M/M  M/L 
5/5/2  S/•  M/•  
5/5/5  S/•  M/M  M/L 
5/5/10  S/•  M/L  
5/5/20  S/M  S/L  M/L 
5/5/30  S/M  S/L  M/L 
5/10/2  S/•  M/M  M/L 
5/10/5  S/M  S/L  
5/10/10  S/L  M/L  
5/10/20  S/L  M/L  
5/10/30  S/L  
5/20/2  S/L  
10/2/2  S/•  M/S  
10/2/5  S/S  M/S 
Median estimates, quartiles of ICC estimates,
ICC  Level3 
Level2 


ICC_{3} estimate 
Power of 
ICC_{2} estimate 
Power of 

[ 
[ 

.059  .058  [.053; .059]  .016  .420  .865  .060  [.059; .065]  .021  .201  1.000 
.077  .075  [.067; .076]  .024  .275  .992  .078  [.077; .080]  .013  .166  1.000 
.083  .081  [.072; .082]  .026  .240  .982  .084  [.084; .085]  .008  .163  1.000 
.111  .109  [.097; .110]  .020  .273  1.000  .113  [.112; .118]  .016  .155  1.000 
.143  .139  [.123; .141]  .028  .183  1.000  .144  [.143; .146]  .007  .125  1.000 
.154  .150  [.131; .152]  .026  .163  1.000  .154  [.154; .155]  .004  .120  1.000 
.273  .266  [.233; .270]  .026  .145  1.000  .275  [.274; .285]  .009  .107  1.000 
.333  .325  [.288; .330]  .024  .117  1.000  .334  [.333; .336]  .002  .081  1.000 
.353  .345  [.308; .349]  .023  .111  1.000  .353  [.350; .354]  .001  .073  1.000 
Factor  Level3 
Level2 







Population 
2  .043  .427  .072  .042 
Population 
2  .033  .198  .087  .363 
4  .462  .814  .074  .651  
4  .074  .508  .002  .565  
4  .002  .106  .093  .526  
NOBS  34  .098  .177  .142  .456 
In total, 384 conditions (34.13%) resulted in sufficient power and relative and absolute unbiasedness on all levels. These conditions can be identified in
In
Level3 
Level2 


ICC_{3}  Power of 
ICC_{2}  Power of 

5  S  none^{a}  none  20/10^{a}  S  5/5^{a}  none^{a}  
M  none^{a}  none  M  none^{a}  
L  none  none  L  20/5^{a} or 10/20^{a}  
10  S  none^{a}  none  S  
M  none^{a}  none  M  any^{a}  
L  none  none  L  
50  S  any^{a}  none  S  
M  any  30/5^{a} or 20/20^{a}  M  any  
L  any  any except 2/2, 2/5  any  L  any  any  
100  S  any  20/2 or 10/10^{a}  S  
M  any  M  any  
L  any  any  any  L  any  any  
200  S  any  S  any  
M  any  any  M  any  any  
L  any  any  any  L  any  any  any 
^{a}with exceptions, e.g.: 20/20^{a} indicate that most conditions with
In general,
Estimates of
Allocations to achieve unbiasedness for the ICCs are presented in
In most conditions with
ICC_{2} relative bias remained within 10% over/underestimation for a variety of allocations, such as small
Our findings extend our knowledge on the estimation quality in threelevel modeling by showing that moderate to large samples and an advantageous allocation are needed for overall good estimation quality of the ICCs, and that the size of the ICCs and the number of available clusters greatly influences required sample sizes.
Results demonstrate that required
Interestingly, we found that the variance components are consistently underestimated. Since
Further, convergence rates for the smallest samples are considerably low. Research suggests that restricted maximum likelihood (REML) may improve convergence and reduce bias in small samples (
Results show that smaller variance components require considerably larger samples for sufficient estimation quality. For example, small ICCs require at least twice (four times) as many observations as medium (large) ICCs for a given number of clusters for absolute unbiasedness. Similarly, in samples with 5 or 10 clusters, required
Interestingly, the bias of an ICC estimate is higher if the ICC at the other level is larger. As an example, the ICC_{3} in M/L was more heavily biased than in M/M or M/S. In additional simulations, we tested if this is a direct consequence of the simulation setup, since, for example, the ICC_{3} was slightly smaller in S/L (ICC_{3} = .059) than S/M and S/S (ICC_{3} = .077, .083, respectively). These additional analyses (100 replications each for
Most importantly, our results demonstrate that relative unbiasedness of a simulation condition does not imply that a sample generated from this condition produces unbiased estimates, as indicated by the rate of biased runs and the absolute
Further, our inferences regarding statistical power are based on the onesided SBtest. Our findings may therefore not be directly compared to previous research, since there is no single established coefficient assessing the power of variance estimates in multilevel research. Hence, approaches incorporating auxiliary models for the SB or LRTtest or differing test distributions might suggest different sampling requirements, and we suggest that future studies include and compare different power measures in their simulations.
As a rule of thumb, overall estimation quality is achieved if samples ensure absolute unbiasedness of Level3 estimates. If there is no information about ICC sizes, large samples with an emphasis on the number of clusters, such as 200/10/5, or 100/20/5, are recommended. If both ICCs are at least of medium size, required sample sizes reduce to e.g., 100/20/2 or 200/5/5. Achieving sufficient estimation quality with 50 clusters is still possible with at least
In conclusion, our findings reveal that correctly characterizing a threelevel structure through ICC estimates requires an advantageous samplingstrategy, where the number of achievable clusters determines the required numbers of subclusters and Level1 units. Particular attention must be paid to the ICC_{3}, which will most likely be slightly underestimated, even with moderate sample sizes. Researchers should take advantage of previously reported ICC sizes in their domain to identify a most likely adequate sampling strategy for a feasible overall sample size.
The authors thank Zoran Kovacevic for his support on creating the figures for this work.
For this article, data are freely available (
The Supplementary Materials contain the research data and codebook with estimation results for
all conditions and variables analyzed in the article (for access see
The authors have no funding to report.
The authors have declared that no competing interests exist.