Original Article

Power Analyses for Moderator Effects With (Non)Randomly Varying Slopes in Cluster Randomized Trials

Nianbo Dong*1, Jessaca Spybrook2, Benjamin Kelcey3, Metin Bulus4

Methodology, 2021, Vol. 17(2), 92–110, https://doi.org/10.5964/meth.4003

Received: 2019-11-05. Accepted: 2020-05-22. Published (VoR): 2021-06-30.

*Corresponding author at: School of Education, University of North Carolina at Chapel Hill, 116 Peabody Hall, CB 3500, Chapel Hill, NC 27599, USA. Phone: +1 919-843-9553, E-mail: dong.nianbo@gmail.com

This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Researchers often apply moderation analyses to examine whether the effects of an intervention differ conditional on individual or cluster moderator variables such as gender, pretest, or school size. This study develops formulas for power analyses to detect moderator effects in two-level cluster randomized trials (CRTs) using hierarchical linear models. We derive the formulas for estimating statistical power, minimum detectable effect size difference and 95% confidence intervals for cluster- and individual-level moderators. Our framework accommodates binary or continuous moderators, designs with or without covariates, and effects of individual-level moderators that vary randomly or nonrandomly across clusters. A small Monte Carlo simulation confirms the accuracy of our formulas. We also compare power between main effect analysis and moderation analysis, discuss the effects of mis-specification of the moderator slope (randomly vs. non-randomly varying), and conclude with directions for future research. We provide software for conducting a power analysis of moderator effects in CRTs.

Keywords: cluster randomized trials, CRTs, minimum detectable effect size difference, moderator effect, statistical power

A critical consideration in the evaluation of treatment programs is whether those treatment effects are moderated by context or individual characteristics. As a result, an important consideration that emerges in the planning stage is how to design studies that have the sufficient power to detect such moderation if it exists. Although there has been a steady pace of advancement in the design of moderation studies in cluster randomized trials (CRTs; Bloom, 2005; Dong, Kelcey, & Spybrook, 2018; Mathieu, Aguinis, Culpepper, & Chen, 2012; Moerbeek & Maas, 2005, Spybrook, Kelcey, & Dong, 2016), extant studies are largely fragmented in that they normally consider only isolated aspects of the design rather than the full assembly of design considerations that are typically encountered in planning such a study. For instance, with the exception of a few studies (e.g., Dong, Kelcey, & Spybrook, 2018), prior literature regarding the estimation of statistical power for moderation has often limited its analysis to only binary moderators or has failed to include additional covariates (i.e., “unconditional designs”; Bloom, 2005; Spybrook, Kelcey, & Dong, 2016). Given the widespread presence of moderators that are continuous in nature (e.g., pretest) and the widespread use of covariate-adjusted designs to improve power and reduce potential bias due to unhappy randomization, it is critical to provide a more general set of tools for power analyses that can readily accommodate such variations (e.g., Bloom, 2006; Bloom, Richburg-Hayes, & Black, 2007; Dong & Maynard, 2013; Moerbeek, 2006; Moerbeek, van Breukelen, & Berger, 2001; Raudenbush, Martinez, & Spybrook, 2007).

Similarly, current multilevel literature is limited in the guidance it offers concerning statistical power when assessing the extent to which treatment effects vary across subgroups defined by an individual-level variable. More specifically, assessments of individual-level moderators are typically operationalized through cross-level interactions between the cluster-level treatments and individual-level moderators (e.g., child’s gender). The result is that the effect of the individual-level variable (i.e., as quantified through the coefficient) can be regarded as randomly or nonrandomly varying across clusters. The nonrandomly varying slope approach assumes that the gender achievement gap does not vary randomly across schools but rather only as an explicit function of cluster-level variables (e.g., the individual-level slope or coefficient for gender varies across clusters only as a function of the treatment status). The randomly varying slope or coefficient model addresses the same moderation question, but allows for the possibility that the gender achievement slope or coefficient randomly varies across schools even after accounting for the treatment effect (e.g., unexplained heterogeneity across schools in terms of the relationship between gender and the outcome). The choice between these approaches ultimately depends on prior knowledge of the effects of the moderator variables and the theory underlying the intervention. However, it is important that design frameworks consider both of these approaches and the implications of designing a study based on one of the frameworks.

Our review of the literature identified only two methodological studies that have examined the power for the randomly varying slope model in moderation analysis (Dong, Kelcey, & Spybrook, 2018; Mathieu, Aguinis, Culpepper, & Chen, 2012). In addition, there are no studies that have examined the trade-offs between the design assumptions, the effects on power when the slope is mis-specified (randomly vs. non-randomly varying slope) or the potential inaccuracies that accumulate in power formulas under such mis-specifications. A mis-specification of the slope term potentially undermines the accuracy of the standard error estimates for the moderator effect, which may result in incorrect estimates of statistical power. Investigation of the effects of a mis-specified slope can help us understand how much the bias on power arises due to either type of mis-specification, helps develop potential strategies to mitigate bias due to such mis-specifications, and ultimately to design moderation studies that are robust and well-positioned to detect such effects.

A key prior contribution to the literature with regard to designing multilevel moderation studies was Mathieu et al. (2012). Mathieu et al. (2012) conducted a comprehensive Monte Carlo simulation to estimate the statistical power to detect cross-level interaction effects in multilevel modeling. However, Mathieu et al. (2012) only studied two-level models without including covariate adjustment on additional covariates separate from the moderator, and did not provide closed form formulas to estimate the statistical power, minimum detectable effect size difference (MDESD) between moderator subgroups, or minimum required sample size to detect meaningful effects. Dong, Kelcey, and Spybrook (2018) extended this line of inquiry by developing the formulas to calculate statistical power and MDESD by considering the levels of the moderators at which they have been assessed, the distribution of moderators (binary vs. continuous), the slopes of lower level moderators (random vs. non-randomly varying), and the level of covariates for three-level CRTs. However, the scope, developments and analyses in Dong, Kelcey, and Spybrook (2018) did not cover two-level CRTs.

The purpose of this study is to consolidate and extend the literature on power analyses for moderators by developing power formulas that accommodate categorical or continuous moderators, models with or without covariates, same or cross-level moderator effects, and nonrandomly varying or randomly varying slopes in two-level CRTs. We then advance the practical application of these results by examining the effects on power when the slope is mis-specified (randomly varying slope vs. non-randomly varying slope) to outline the sensitivity of power analysis to such mis-specifications. Because a team planning a CRT may be interested in the power for a moderator effect of a given magnitude or the MDESD given sample size and the desired power, we provide the power formulas as well as the MDESD calculations and their corresponding confidence intervals. We also created a Microsoft Excel-based function, an R function, and an R shinny app to assist researchers conducting power analyses for various moderator effects1.

The paper is organized as follows. We present the formulas for statistical power and the MDESD and its confidence intervals for the moderator variable at level 2 and subsequently for a moderator at Level 1. In each case, we start with a continuous moderator and extend it to a binary moderator. We also conduct a small Monte Carlo simulation to assess the empirical validity of the formulas in finite sample sizes. We then compare the statistical power and MDESD for moderation effects under different design considerations followed by a comparison of the MDES for main treatment effects and the MDESD for the moderation effects. Finally, we discuss the implications of planning studies to detect moderator effects in two-level CRTs and consider directions for future work.

Statistical Power and Minimum Detectable Effect Size Difference in Two-Level CRTs

We present the key results of the formulas for statistical power and the MDESD and its confidence intervals for different moderator effects in the framework of a two-level hierarchical linear model (HLM; Raudenbush & Bryk, 2002). The detailed derivations are in Supplementary Materials SM1.

Two-Level CRTs With a Moderator at Level 2

We begin with a two-level design that randomly assigns groups/clusters (e.g., schools) to the treatment or control condition and conditions on a cluster-level covariate (e.g., the percentage of students eligible for free or reduced-price lunch) and probes a cluster-level moderator (e.g., school size). The data are generated using a two-level hierarchical linear model (Raudenbush & Bryk, 2002):

Level 1:

1
Y i j = β 0 j + β 1 j X i j X ¯ . j + r i j , r i j ~ N 0, σ | X 2

Level 2:

2
β 0 j = γ 00 + γ 01 S j + γ 02 T j + γ 03 S j × T j + γ 04 W j + γ 05 X ¯ . j + u 0 j , u 0 j ~ N 0, τ | S , W , X ¯ , T 2 β 1 j = γ 10

Y i j is the outcome measure for observation i (i = 1,…,nj) in cluster j (j = 1,…, J ), T j is a binary variable indicating the treatment status coded as ± ½, S j is a level-2 continuous moderator, ( S j ~ N 0, S s 2 ) , X i j is a Level 1 covariate and X ¯ j is the sample group mean, and W j is a Level 2 covariate ( W j ~ N 0, S w 2 ) . r i j is the Level 1 random error, r i j ~ N 0, σ | X 2 , and u 0 j is the random effect for the intercepts, u 0 j ~ N 0, τ | S , W , X ¯ , T 2 . As in the single level regression analysis, centering variables yields desirable statistical properties (Aiken & West, 1991), group-mean centering is used in Equation 1 to gain some computational and derivational advantages. Note that in random intercept models, parameter estimates under group-mean centering, grand-mean centering, and no centering can be equated using simple transformations (e.g., Kreft, de Leeuw, & Aiken, 1995). γ 02 and γ 03 represent the main effect of treatment and moderator effect, respectively.

We assume that the data are balanced such that each cluster has the same number of observations (nj = n). However, we do not assume the clusters are equally allocated to treatment conditions. Although equal allocation of clusters to the treatment and control conditions typically yields the most sensitive design (i.e., highest power to detect main and moderator effects), such balance is not always possible in reality. For this reason, we considered a more flexible approach that introduces P as the proportion of total clusters that are randomly assigned to the treatment group.

We can test γ 03 using a t-test. Assuming the alternative hypothesis is true, the test statistic follows a non-central t-distribution, T’, and the standardized noncentrality parameter is:

3
λ | S , W , X = δ 2 c 2 P 1 P J 6 1 R 2 2 ρ + 1 R 1 2 1 ρ / n

where J is the number of total clusters, n is sample size for every cluster (e.g., number of students per school), P is the proportion of total clusters that are randomly assigned to the treatment group. R 2 2 is the proportion of variance at level 2 that is explained by the Level 2 predictors ( S j , W j , T j , X ¯ j , and ( S j × T j ) ): R 2 2 = 1 ( τ | S , W , X ¯ , T 2 ) / τ 2 , where τ 2 is the unconditional Level 2 variance; R 1 2 is the proportion of variance at level 1 that is explained by the Level 1 predictor ( X i j X ¯ . j ), R 1 2 = 1 σ | X 2 / σ 2 , where σ 2 is the unconditional Level 1 variance. ρ is the unconditional intraclass correlation, ρ = τ 2 τ 2 + σ 2 . δ 2 c is the standardized coefficient of ( S j × T j ) , (where the subscript indicates the use of a Level 2 continuous moderator) such that δ 2 c = γ ^ 03 S S 2 / ( τ 2 + σ 2 ) , where S S 2 is the variance of S j .

The statistical power for a two-sided test is (note t 0 = t 1 α / 2, J 6 ):

1 β = 1 P T J 6, λ | S , W , X < t 0 + P T J 6, λ | S , W , X t 0 where the degrees of freedom2 is v = J 6 .

The MDESD for the standardized coefficient is:

4
M D E S D δ 2 c = M v 1 R 2 2 ρ + 1 R 1 2 1 ρ / n P 1 P J 6

where, M v = t α + t 1 β for one-tailed tests with v degrees of freedom ( v = J 6 ), and M v = t α / 2 + t 1 β for two-tailed tests.

The 100*(1−α)% confidence interval for M D E S D δ 2 c is given by:

5
( M v ± t α / 2 ) 1 R 2 2 ρ + 1 ρ 1 R 1 2 / n P 1 P J 6

When the moderator, S j , is a binary variable with a proportion of Q in one moderator subgroup and (1-Q) in another moderator subgroup, the standardized noncentrality parameter is:

6
λ | S , W = δ 2 b 2 P 1 P Q 1 Q J 6 1 R 2 2 ρ + 1 R 1 2 1 ρ / n

where δ 2 b is the effect size (standardized mean difference), δ 2 b = γ ^ 03 / τ 2 + σ 2 .

Table 1 presents the summary of standardized noncentrality parameters, MDESD and 100*(1−α)% confidence intervals, and degrees of freedom for the t-test for various two-level moderation models. The above results are presented under Model “CRT2-2”, which stands for a two-level CRT with a Level 2 moderator and flexible treatment allocation. Note that we assume the fixed slope for covariate X i j X ¯ . j in Equation 2 for the purpose of simplicity. Because the moderation term is in the equation for the Level 2 intercept, the standard error of the moderator effect is not affected by the slopes of other Level 1 covariates, hence, the power and MDESD formulas apply to the model with random slope for X i j X ¯ . j .

Table 1

Summary of Standardized Noncentrality Parameters, MDESD and 100*(1−α)% Confidence Intervals for Two-Level CRTs

Equation for... Model Number
CRT2-1N CRT2-1R CRT2-2
HLM
L1 Y i j = β 0 j + β 1 j S i j + r i j , r i j ~ N ( 0, σ | S 2 ) Y i j = β 0 j + β 1 j S i j + r i j , r i j ~ N ( 0, σ | S 2 ) Y i j = β 0 j + β 1 j X i j X ¯ . j + r i j , r i j ~ N 0, σ | X 2
L2 β 0 j = γ 00 + γ 01 T j + u 0 j β 1 j = γ 10 + γ 11 T j u 0 j ~ N ( 0, τ | T 2 ) β 0 j = γ 00 + γ 01 T j + u 0 j β 1 j = γ 10 + γ 11 T j + u 1 j u 0 j u 1 j ~ N 0 0 , τ 00 | T 2 τ 01 | T τ 10 | T τ 11 | T 2 β 0 j = γ 00 + γ 01 S j + γ 02 T j + γ 03 S j × T j + γ 04 W j + γ 05 X ¯ . j + u 0 j β 1 j = γ 10 u 0 j ~ N 0, τ | S , W , X ¯ , T 2
Standardized Noncentrality Parameter (λ)
Binary Moderator δ 1 b 2 P ( 1 P ) Q ( 1 Q ) J n ( 1 R 1 2 ) ( 1 ρ ) δ 1 b 2 P ( 1 P ) J ( 1 R 2 T 2 ) ρ ω + ( 1 R 1 2 ) ( 1 ρ ) / ( n Q ( 1 Q ) ) δ 2 b 2 P 1 P Q 1 Q J 6 1 R 2 2 ρ + 1 R 1 2 1 ρ / n
Continuous Moderator δ 1 c 2 P ( 1 P ) J n ( 1 R 1 2 ) ( 1 ρ ) δ 1 c 2 P ( 1 P ) J ( 1 R 2 T 2 ) ρ ω + ( 1 R 1 2 ) ( 1 ρ ) / n δ 2 c 2 P 1 P J 6 1 R 2 2 ρ + 1 R 1 2 1 ρ / n
MDESD
Binary Moderator M v ( 1 R 1 2 ) ( 1 ρ ) P ( 1 P ) Q ( 1 Q ) J n M v ( 1 R 2 T 2 ) ρ ω + ( 1 R 1 2 ) ( 1 ρ ) / ( n Q ( 1 Q ) ) P ( 1 P ) J M v 1 R 2 2 ρ + 1 R 1 2 1 ρ / n P 1 P Q 1 Q J 6
Continuous Moderator M v ( 1 R 1 2 ) ( 1 ρ ) P ( 1 P ) J n M v ( 1 R 2 T 2 ) ρ ω + ( 1 R 1 2 ) ( 1 ρ ) / n P ( 1 P ) J M v 1 R 2 2 ρ + 1 R 1 2 1 ρ / n P 1 P J 6
100*(1-α)% Confidence Interval
Binary Moderator ( M v ± t α / 2 ) ( 1 R 1 2 ) ( 1 ρ ) P ( 1 P ) Q ( 1 Q ) J n ( M v ± t α / 2 ) ( 1 R 2 T 2 ) ρ ω + ( 1 R 1 2 ) ( 1 ρ ) / ( n Q ( 1 Q ) ) P ( 1 P ) J M v ± t α / 2 1 R 2 2 ρ + 1 R 1 2 1 ρ / n P 1 P Q 1 Q J 6
Continuous Moderator ( M v ± t α / 2 ) ( 1 R 1 2 ) ( 1 ρ ) P ( 1 P ) J n ( M v ± t α / 2 ) ( 1 R 2 T 2 ) ρ ω + ( 1 R 1 2 ) ( 1 ρ ) / n P ( 1 P ) J M v ± t α / 2 1 R 2 2 ρ + 1 R 1 2 1 ρ / n P 1 P J 6
Degree of Freedom (v) J (n -1)-2 J - 2 J - 6

Note. CRT = cluster randomized trial; HLM = hierarchical linear model; MDESD = minimum detectable effect size difference. CRT2-1N and CRT2-1R stand for two-level CRTs with a Level 1 moderator with nonrandomly varying and randomly varying slopes, respectively. CRT2-2 stands for two-level CRTs with a Level 2 moderator.

Two-Level CRTs With a Moderator at Level 1

Under the same design, we next consider individual-level moderators allowing for two different specifications: 1) the randomly varying slope model, which assumes that the effect of the Level 1 moderator varies by the treatment status and varies randomly across the Level 2 units, and 2) the nonrandomly varying slope model, which assumes that the effect of the Level 1 moderator varies by the treatment status but does not vary further across the Level 2 units.

The Randomly Varying Slope Model

The randomly varying slope hierarchical linear model, including one treatment variable, T j , and one Level 1 moderator, S i j ( S i j ~ N 0, S s 2 ) , with a random slope is:

Level 1:

7
Y i j = β 0 j + β 1 j S i j + r i j , r i j ~ N ( 0, σ | S 2 )

Level 2:

8
β 0 j = γ 00 + γ 01 T j + u 0 j β 1 j = γ 10 + γ 11 T j + u 1 j , u 0 j u 1 j ~ N 0 0 , τ 00 | T 2 τ 01 | T τ 10 | T τ 11 | T 2

The Level 2 residuals for the intercept, u 0 j , and the slope, u 1 j , conditional on the treatment status, have a multivariate normal distribution with means of 0. τ 00 | T 2 and τ 11 | T 2 are the variances, and τ 01 | T is the covariance for u 0 j and u 1 j conditional on the treatment status. The parameter of interest for the moderator effect is γ 11 . Note that in the context of CRTs, we treat the treatment status ( T j ) as the focal predictor and S i j as the moderator, and interpret γ 11 as the treatment effect of T j depending on S i j . We may also interpret γ 11 as the effect of S i j on the outcome depending on the treatment status ( T j ).

We test the moderator effect ( γ 11 ) using a t-test. Based on the formula for the variance of the estimated regression coefficients of a Level 1 variable with random slope (Snijders, 2001, 2005), we can derive the standardized noncentrality parameter as below:

9
λ | S = δ 1 c 2 P ( 1 P ) J ( 1 R 2 T 2 ) ρ ω + ( 1 R 1 2 ) ( 1 ρ ) / n

ρ is the unconditional intraclass correlation, ρ = τ 00 2 τ 00 2 + σ 2 , where σ 2 and τ 00 2 are the variances of residuals for Level 1 and Level 2 intercept in the unconditional model without any predictors. R 1 2 is the proportion of variance at Level 1 that is explained by the Level 1 moderator ( S i j ): R 1 2 = 1 σ | S 2 σ 2 . R 2 T 2 is the proportion of the random slope (for S) variance explained by the treatment indicator ( T j ): R 2 T 2 = 1 τ 11 | T 2 τ 11 2 . ω is the proportion of the variance ( τ 11 2 ) between clusters on the effect of S i j to the between-cluster residual variance ( τ 00 2 ) when τ 00 2 > 0 under the multilevel modeling framework, ω = τ 11 2 τ 00 2 . ω indicates the effect heterogeneity for the Level 1 moderator ( S i j ) across Level 2 units (clusters) in the model that is not conditional on the treatment variable, T j . P is the proportion of clusters in the treatment group. δ 1 c is the standardized coefficient, δ 1 c = γ ^ 11 S S 2 / ( τ 00 2 + σ 2 ) , where S S 2 is the variance of S i j .

The statistical power for a two-sided test is (note t 0 = t 1 α / 2, J 2 ): 1 β = 1 P [ T ' ( J 2, λ | S ) < t 0 ] + P [ T ' ( J 2, λ | S ) t 0 ] , where the degrees of freedom is v = J 2 .

The MDESD for the standardized coefficient is:

10
M D E S D ( δ 1 c ) = M v ( 1 R 2 T 2 ) ρ ω + ( 1 R 1 2 ) ( 1 ρ ) / n P ( 1 P ) J

where, M v = t α + t 1 β for one-tailed tests with v degrees of freedom ( v = J 2 ), and M v = t α / 2 + t 1 β for two-tailed tests.

The 100*(1−α)% confidence interval for M D E S D ( δ 1 c ) is given by:

11
( M v ± t α / 2 ) ( 1 R 2 T 2 ) ρ ω + ( 1 R 1 2 ) ( 1 ρ ) / n P ( 1 P ) J

The Nonrandomly Varying Slope Model

In the nonrandomly varying slope model the Level 1 model is the same as that in Equation 7. However, the Level 2 model is:

12
β 0 j = γ 00 + γ 01 T j + u 0 j β 1 j = γ 10 + γ 11 T j , u 0 j ~ N ( 0, τ | T 2 )

The standardized noncentrality parameter is:

13
λ | S = δ 1 c 2 P ( 1 P ) J n ( 1 R 1 2 ) ( 1 ρ )

The degrees of freedom3 is v = J n 1 2 .

Extension to Binary Moderator

When the Level 1 moderator, S i j , is a binary variable with a proportion of Q in one moderator subgroup and (1 - Q) in another moderator subgroup, the noncentrality parameters (standardized) for the randomly varying slope model and the nonrandomly varying slope model are:

14
λ | S = δ 1 b 2 P ( 1 P ) J ( 1 R 2 T 2 ) ρ ω + ( 1 R 1 2 ) ( 1 ρ ) / ( n Q ( 1 Q ) )

and

15
λ | S = δ 1 b 2 P ( 1 P ) Q ( 1 Q ) J n ( 1 R 1 2 ) ( 1 ρ )

where δ 1 b is the effect size (standardized mean difference), δ 1 b = γ ^ 11 / τ 00 2 + σ 2 .

The standardized noncentrality parameters, the MDESD for the standardized regression coefficient, and the 100*(1−α)% confidence interval for M D E S D ( δ 1 c ) for a continuous Level 1 moderator with randomly varying slope and nonrandomly varying slope are presented under Models “CRT2-1R” and “CRT2-1N” in Table 1. The MDESD for the standardized mean difference, and the 100*(1−α)% confidence interval for M D E S D ( δ 1 b ) for a binary Level 1 moderator with randomly varying slope and nonrandomly varying slope are presented under Models “CRT2-1R” and “CRT2-1N” in Table 1.

Monte Carlo Simulation

To validate the standard error and power formulas we derived, we conducted a small Monte Carlo simulation. The simulation results provided initial but limited evidence of the close correspondence on the standard error and power (or Type I error) between our formulas and the empirical distribution from the simulation when the analytic model was correctly specified. The detailed procedures and results are presented in Supplementary Materials SM2.

We note one particular finding that emerges from the results of the simulation. For a Level 1 moderator, we set the effect heterogeneity (ω) for the Level 1 moderator across Level 2 units varied from 0 to 0.8. For each dataset, we used both the randomly varying slope model and the nonrandomly varying slope model to estimate the moderator effects. When ω is set as 0, the nonrandomly varying slope model is the correctly specified analytic model while the randomly varying slope model is mis-specified analytic model. In these simulations, the randomly varying slope model tended to slightly over-estimate the standard error, but the coverage rate of 95% CI is as good as the nonrandomly varying slope model. Comparing with the nonrandomly varying slope model, the randomly varying slope model produced slightly smaller power. When ω is set as 0.2, 0.4, 0.6, and 0.8, the nonrandomly varying slope model is the mis-specified analytic model while the randomly varying slope model is the correctly specified analytic model (see Tables S1-S24 in Supplementary Materials SM2). In these simulations, the randomly varying slope model produced closer estimates of the standard error and the coverage rate of 95% CI than the nonrandomly varying slope model. The nonrandomly varying slope model produced bigger bias in the standard error estimates and worse coverage rage of 95% CI when ω increases. Bias in the standard error estimates for mis-specified models are consistent with LaHuis et al.’s (2020) findings. Figure 1 below clearly demonstrates the relationship between the standard error (SE) and the coverage rate of 95% CI with the heterogeneity coefficient (ω).

Click to enlarge
meth.4003-f1
Figure 1

Standard Error (SE) and Coverage Rate of 95% CI vs. Heterogeneity Coefficient

Note. Under the assumptions: ρ = 0.2, J = 40, n = 20, R 1 2 = 0.4, R 2 T 2 = 0.07, P = 0.5, Q = 0.5, effect size difference = 0.2.

Discussion: Comparisons Among Moderation Designs and Main Effect Designs

Contrasting Moderation Designs

As in the power analysis of the main treatment effect, the power of the moderator effect in two-level CRTs is associated with the noncentrality parameter (λ) and the critical t value ( t 0 ). The critical t value ( t 0 ) is associated with the degrees of freedom (v), the Type I error rate ( α ), and the choice of a one-tailed or two-tailed test. The noncentrality parameter (λ) is a ratio of the moderator effect estimate to its standard error (SE), which is a function of the total number of clusters ( J ) and the number of individuals per cluster (n), the proportion of clusters in the treatment group (P), the proportion of variance at Level 2 explained by covariates ( R 2 2 ), and the unconditional intraclass correlation (ICC).

If the moderator is a binary variable, the power is also associated with the proportion (Q) of the sample in one moderator subgroup. The MDESD using the standardized mean difference for the binary moderators is Q ( 1 Q ) times larger than the MDESD using the standardized regression coefficient for the continuous moderators when the moderators are at Level 2 or Level 1 with the nonrandomly varying slopes. When the sample is equally allocated between the moderator subgroups (Q = 0.5), the design has the biggest power (smallest MDESD) among all options of Q that ranges from 0 to 1.

If the moderator is at Level 1 with a randomly varying slope, the power is also associated with the effect heterogeneity (ω) for the Level 1 moderator across Level 2 units. The MDESD increases and power decreases as ω increases. The results for the nonrandomly varying slope model for the Level 1 moderator do not contain the factor that is related to ω. The degrees of freedom also differ depending on whether it is a random slope model or not. The degree of freedom (v) is J n 1 2 for the nonrandomly varying model while v = J 2 for the randomly varying slope model. This is because the interaction term of the treatment and moderator variables varies among the Level 1 units within each Level 2 cluster for the nonrandomly varying model, but the Level 2 random term (i.e., u 1 j in Equation 8) associated with the coefficient of the moderator in the randomly varying slope model varies among the Level 2 clusters. As a result, when the estimation models are correctly specified for the real data, the model with a varying moderator slope will yield less precise estimates than the model with a constant moderator slope. The differences for the power and MDESD between the two models decreases when the number of clusters ( J ) increases and the effect heterogeneity (ω) decreases.

Using the mis-specified analytic models for study design will result in either overestimating or underestimating the power. Specifically, if the randomly varying slope model is used to design the studies where ω = 0, the power will be underestimated; if the nonrandomly varying model is used to design the studies where ω > 0, the power will be overestimated. The bias in power estimates due to model mis-specification decrease when the sample size for the clusters ( J ) increases and the effect heterogeneity (ω) decreases.

To make these comparisons more concrete, we compare MDESD and power among three moderation designs using several examples. Suppose a team of researchers are designing a two-level CRT to test the efficacy of a school-based intervention on student achievement. They are interested in student-level moderator effects and school-level moderator effects. They approach the moderator power analyses from two perspectives: 1) what is the MDESD given power of 0.80 and 2) what is the power for a moderation effect size of 0.20. Based on the literature (Bloom, Richburg-Hayes, & Black, 2007; Hedges & Hedberg, 2007, 2013) they assume an intraclass correlation coefficient ( ρ ) of 0.23, and the proportions of variance explained by the covariates at Level 1 and Level 2 of 0.5 ( R 1 2 = R 2 2 = 0.5). To be conservative, they assume the proportion of variance between schools on the effect of the student-level moderator explained by the school-level predictor to be 0 ( R 2 T 2 = 0). The effect heterogeneity (ω) for the student-level moderator across school-levels is assumed as 0.3 for the randomly varying slope model, which is equivalent to an effect size variability of 0.069 (= ρ × ω = 0.23 × 0.3). They use a balanced design with equal assignment of schools to the treatment and control groups (P = 0.5) and 100 students per school. They are interested in the results for a binary moderator and a continuous moderator. For the binary case, they assume half of the sample is in one moderator subgroup (Q = 0.5). Table 2 shows the results of MDESD and power for the total numbers ( J ) of schools of 40 and 80 under the above assumptions.

Table 2

MDESD and Statistical Power of Two-Level CRTs

Level of moderator Slope of lower level moderator MDESD
Power
Binary moderator
Continuous moderator
Binary moderator
Continuous moderator
J = 40 J = 80 J = 40 J = 80 J = 40 J = 80 J = 40 J = 80
1 Nonrandomly varying 0.11 0.08 0.06 0.04 1.00 1.00 1.00 1.00
1 Randomly varying 0.26 0.18 0.25 0.17 0.56 0.86 0.63 0.91
2 N/A 0.67 0.45 0.34 0.23 0.13 0.24 0.39 0.70

Note. MDESD = minimum detectable effect size difference. Under the assumptions: n = 100, ρ = 0.23, P = 0.5, Q = 0.5, R 1 2 = 0.5, R 2 2 = 0.5, R 2 T 2 = 0 and ω = 0.3 for random slope design, power = 0.8 for the calculation of MDESD, and effect size difference = 0.2 for the calculation of power, a two-sided test with α = .05.

The findings in Table 2 are discussed below. First, a design always has a smaller MDESD, or larger power for a fixed effect size when the Level 2 sample size is bigger. Second, the MDESD is larger or the power is smaller for a fixed effect size when the moderator is at the school level compared to the student level. Third, when the moderator is at the student level, the nonrandomly varying moderator slope model has a smaller MDESD, or bigger power for a fixed effect size than the random moderator slope model. Finally, the MDESD as defined by the standardized mean difference for the binary moderator and Q = 0.5 is always twice the value of the MDESD defined by the standardized coefficient for the continuous moderator when the moderator is at the school level or the moderator is at the student level with the nonrandomly varying slope.

Comparing Moderation Designs With Main Effect Designs

We examine the ratio of the MDESD for the moderator analysis to the minimum detectable effect size (MDES) for the main effect analysis. The MDES formula for a two-level cluster randomized design with a Level 1 and two Level 2 covariates is as follows (Bloom, 2006):

16
M D E S = M J 4 ρ ( 1 R 2 2 ) P ( 1 P ) J + ( 1 ρ ) ( 1 R 1 2 ) P ( 1 P ) J n

where the multiplier M J 4 = t α / 2 + t 1 β with J - 4 degrees of freedom.

We use the MDESD formulas for binary moderators in Table 1. The ratio of MDESD for a Level-2 binary moderator to the MDES of the main effect when there is no Level 1 covariate is:

17
M D E S D C R T 2 2 M D E S = M J 6 M J 4 J 6 J Q 1 Q

The result in Equation 17 is consistent with Bloom (2005) except Equation 17 includes an extra factor J 6 J . Bloom (2005) derived the standard error of the moderator effects based on the population using the sample size J while we derived the standard error based on the sample by adjusting for the degrees of freedom using J - 6 (our Monte Carlo simulation suggested that our formulas worked better especially when the sample size is small). M D E S D C R T 2 2 / M D E S is around 2 when it is a balanced design (Q = 0.5) and there is a large sample size ( M J 6 / M J 4 is close to 1 when J is larger than 10, e.g., M J 6 / M J 4 = 1.01 when J = 11). This result indicates that the MDESD for a Level 2 moderator is about twice as large as the MDES of the main effect using the same set of covariates in both cases in the same study. This is analogous to using the ordinary least square (OLS) regression to analyze the completely randomized trials, which do not involve hierarchical data. This makes the Level 2 moderator effect more difficult to detect than the main effect just as in the OLS analysis of the completely randomized trials.

The situation is different for the analysis of the Level-1 moderator effect, which may have bigger power than the main effect. The MDES formula for the main effect in Equation 16 includes an additional component that is associated with the Level 2 residual variance which is not related to the sample size at the individual level (n), while the MDESD formulas for a Level 1 binary moderator with nonrandomly varying slope in Table 1 only includes the component associated with the Level 1 residual variance. As a result, n is more influential on the MDESD than the MDES.

Figure 2 shows the relationship between power and cluster sample size by comparing the main treatment effect analysis with moderation analyses with binary Level 1 and -2 moderators.

Click to enlarge
meth.4003-f2
Figure 2

Power vs. Group Sample Size

Note. Panel A: ρ = 0.20. Panel B: ρ = 0.10. Under the assumptions: n = 20, R 1 2 = 0.5, R 2 2 = 0.5, P = 0.5, Q = 0.5, R 2 T 2 = 0 and ω = 0.3 for randomly varying slope design, effect size (standardized mean difference) = 0.2, effect size difference (standardized mean difference) = 0.2, and a two-sided test with α = .05.

The figure is based on the following assumptions: the intraclass correlation coefficient ( ρ ) is 0.2 in Figure 2A and 0.1 in Figure 2B in two-level CRTs. The proportions of variance explained by the covariates at Level 1 and Level 2 for the main effect analysis is 0.5 ( R 1 2 = R 2 2 = 0.5); The proportions of variance explained for the Level 2 moderation analysis, R 2 2 = 0.5 at Level 2, and for the Level 1 moderation analysis, R 1 2 = 0.5 at the Level 1. The proportion of variance between clusters on the effect of the student-level moderator explained by the school-level predictor is set to 0 ( R 2 T 2 = 0). The effect heterogeneity (ω) for the student-level moderator across school-levels is assumed as 0.3 for the randomly varying slope model, which is equivalent to an effect size variability of 0.06. We assume a balanced design with equal assignment of schools to the treatment and control groups (P = 0.5) and 20 students per school. In addition, half of the sample is in one moderator subgroup (Q = 0.5). For comparison purposes, we assume the effect size for the main treatment effect and the effect size difference for the moderator effect (standardized mean difference) to be detected using a two-sided test with α = .05 are both .20. This is equivalent to effect sizes for the two moderator subgroups of 0.3 and 0.1, respectively. The resulting power curves are for the moderation analyses with a binary Level 2 moderator (grey solid line), a binary Level 1 moderator with randomly varying slope (long dashed black line), a binary Level 1 moderator with nonrandomly varying slope (short dotted black line), and the main treatment effect analysis (black solid line).

Figure 2A and Figure 2B indicates that the power increases for a binary Level 1 moderator effect with the increase of the group sample size The power for detecting the effects of a binary Level 1 moderator with nonrandomly varying slope (short dotted black line) is bigger than that for a binary Level-1 moderator with randomly varying slope (long dashed black line). The power for detecting the effects of a binary Level 1 moderator with nonrandomly varying slope (short dotted black line) is bigger than the power for the main treatment effect analysis (black solid line) in Figure 2A ( ρ = 0.20). By comparing Figure 2A ( ρ = 0.20) with Figure 2B ( ρ = 0.10), we can see that the power for detecting the effect of a binary Level 1 moderator with nonrandomly varying slope (short dotted black line) is bigger when the intraclass correlation is bigger. This is also apparent in the formulas for the MDESD which contain a factor of (1 - ρ ), hence when ρ increases the MDESD decreases and the power increases. Note that across all scenarios the power for a binary Level 2 moderator effect (grey solid line) is the smallest.

Conclusion

The main findings are summarized as follows. First, the effects of the sample sizes at different levels, the levels of the moderators at which they have been assessed, the slopes of Level 1 moderators (random vs. non-randomly varying), the distribution of moderators (binary vs. continuous), and the inclusion of covariates on power and MDESD in two-level CRTs are consistent with that in three-level CRTs (Dong, Kelcey, & Spybrook, 2018). For instance, the sample size at the higher level (e.g., Level 2) is more critical than the sample size at lower level (e.g., Level 1) for increasing the power to detect the effects of a Level 2 moderator and a Level 1 moderator with randomly varying slope. However, the sample size at Level 1 is as important as at Level 2 for increasing the power to detect the effect of a Level 1 moderator with nonrandomly varying slope. Furthermore, the MDESD is larger or the power is smaller when the moderator is at the higher level. In other words, studies are more likely to be well-powered to detect Level 1 moderator effects than Level 2 moderator effects. Besides, the MDESD measured by the standardized mean difference for the binary moderator is always 1 / Q ( 1 Q ) times of the MDESD measured by the standardized coefficient for the continuous moderator when it is Level 2 moderator or Level 1 moderator with nonrandomly varying moderator slope. In addition, including Level 1 covariates can improve power for both Level 1 and Level 2 moderator effects; including Level 2 covariates may improve power only if the Level 2 covariates are in the intercept model for the Level 2 moderator or the Level 2 covariates are in the slope model to explain the heterogeneity of the Level 1 moderator.

Second, when the estimation models are correctly specified for the real data, the model with a varying moderator slope will yield less precise estimates than the model with a constant moderator slope. The differences on the power and MDESD between the two models decreases when the number of clusters ( J ) increases and the effect heterogeneity (ω) decreases.

Lastly, the mismatch between the study design and real data will result in either overestimating or underestimating the power. Specifically, if the randomly varying slope model is used to design the studies where ω = 0, the power will be underestimated; if the nonrandomly varying slope model is used to design the studies where ω > 0, the power will be overestimated. The bias in power estimates due to model mismatch decreases when the sample size for the clusters ( J ) increases and the effect heterogeneity (ω) decreases. However, it is generally preferable to use the randomly varying slope model to design the cross-level moderation studies unless there is strong theory or prior knowledge that the slope of the lower level moderator does not vary across clusters.

This study focused on two-level CRTs. There are many important directions for further work. First, extending the work to other designs is necessary. This includes multisite randomized trials (MRTs), which are also common designs used to evaluate the effectiveness of programs (Spybrook, Shi, & Kelcey, 2016), and longitudinal study designs. Second, a well conducted power analysis heavily relies on accurate empirical estimates of the design parameters. Hence more empirical studies of design parameters such as the ICC, effect heterogeneity of Level 1 covariates, and meaningful moderator effect size differences are important as we move forward.

Notes

1) The software can be accessed from the website: https://www.causalevaluation.org/

2) Generally, v = J g * 4 , where g * is the number of Level 2 covariates (excluding the treatment variable, moderator, and moderator*treatment).

3) Generally, v = J ( n 1 ) 2 g * , where g * is the number of Level 1 covariates (excluding the moderator).

Funding

This project has been funded by the National Science Foundation [1437679, 1437692, 1437745, 1913563, 1552535, 1760884]. The opinions expressed herein are those of the authors and not the funding agency.

Acknowledgments

The authors have no additional (i.e., non-financial) support to report.

Competing Interests

The authors have declared that no competing interests exist.

Supplementary Materials

For this article the following Supplementary Materials are available via the PsychArchives repository (for access see Index of Supplementary Materials below):

  • SM1: Derivations of power and MDESD formulas.

  • SM2: Procedures and results of Monte Carlo simulation (Tables S1-S24).

Index of Supplementary Materials

  • Dong, N., Spybrook, J., Kelcey, B., & Bulus, M. (2021). Supplementary materials to: Power analyses for moderator effects with (non)randomly varying slopes in cluster randomized trials [Formulas, Tables].PsychOpen GOLD. https://doi.org/10.23668/psycharchives.4947

References

  • Aiken, L. S., & West, S. G. (1991). Multiple regression: Testing and interpreting interactions. New York, NY, USA: SAGE Publication.

  • Bloom, H. S. (2005). Randomizing groups to evaluate place-based programs. In H. S. Bloom (Ed.), Learning more from social experiments: Evolving analytic approaches (pp. 115-172). New York, NY, USA: Russell Sage Foundation.

  • Bloom, H. S. (2006). The core analytics of randomized experiments for social research (MDRC working papers on research methodology). Retrieved from http://www.mdrc.org/publications/437/full.pdf

  • Bloom, H. S., Richburg-Hayes, L., & Black, A. R. (2007). Using covariates to improve precision for studies that randomize schools to evaluate educational interventions. Educational Evaluation and Policy Analysis, 29(1), 30-59. https://doi.org/10.3102/0162373707299550

  • Dong, N., Kelcey, B., & Spybrook, J. (2018). Power analyses of moderator effects in three-level cluster randomized trials. Journal of Experimental Education, 86(3), 489-514. https://doi.org/10.1080/00220973.2017.1315714

  • Dong, N., & Maynard, R. A. (2013). PowerUp!: A tool for calculating minimum detectable effect sizes and minimum required sample sizes for experimental and quasi-experimental design studies. Journal of Research on Educational Effectiveness, 6(1), 24-67. https://doi.org/10.1080/19345747.2012.673143

  • Hedges, L. V., & Hedberg, E. (2007). Intraclass correlation values for planning group randomized trials in education. Educational Evaluation and Policy Analysis, 29(1), 60-87. https://doi.org/10.3102/0162373707299706

  • Hedges, L. V., & Hedberg, E. (2013). Intraclass correlations and covariate outcome correlations for planning two- and three-level cluster-randomized experiments in education. Evaluation Review, 37(6), 445-489. https://doi.org/10.1177/0193841X14529126

  • Kreft, I. G. G., de Leeuw, J., & Aiken, L. S. (1995). The effect of different forms of centering in Hierarchical Linear Models. Multivariate Behavioral Research, 30(1), 1-21. https://doi.org/10.1207/s15327906mbr3001_1

  • LaHuis, D., Jenkin, D. R., Hartman, M. J., Hakoyama, S., & Clark, P. (2020). The effects of misspecifying the random part of multilevel models. Methodology, 16(3), 224-240. https://doi.org/10.5964/meth.2799

  • Mathieu, J. E., Aguinis, H., Culpepper, S. A., & Chen, G. (2012). Understanding and estimating the power to detect cross-level interaction effects in multilevel modeling. The Journal of Applied Psychology, 97(5), 951-966. https://doi.org/10.1037/a0028380

  • Moerbeek, M. (2006). Power and money in cluster-randomized trials: When is it worth measuring a covariate? Statistics in Medicine, 25(15), 2607-2617. https://doi.org/10.1002/sim.2297

  • Moerbeek, M., & Maas, C. J. M. (2005). Optimal experimental designs for multilevel logistic models with two binary predictors. Communications in Statistics. Theory and Methods, 34(5), 1151-1167. https://doi.org/10.1081/STA-200056839

  • Moerbeek, M., van Breukelen, G. J. P., & Berger, M. P. F. (2001). Optimal experimental designs for multilevel models with covariates. Communications in Statistics. Theory and Methods, 30(12), 2683-2697. https://doi.org/10.1081/STA-100108453

  • Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis methods (2nd ed.). Thousand Oaks, CA, USA: Sage Publications.

  • Raudenbush, S. W., Martinez, A., & Spybrook, J. (2007). Strategies for improving precision in group-randomized experiments. Educational Evaluation and Policy Analysis, 29(1), 5-29. https://doi.org/10.3102/0162373707299460

  • Snijders, T. (2001). Sampling. In A. H. Leyland & H. Goldstein (Eds.), Multilevel modeling of health statistics (pp. 159-173). New York, NY, USA: John Wiley.

  • Snijders, T. A. B. (2005). Power and sample size in multilevel linear models. In: B. S. Everitt & D. C. Howell (Eds.), Encyclopedia of statistics in behavioral science (Vol. 3, pp. 1570–1573). Hoboken, NJ, USA: John Wiley & Sons.

  • Spybrook, J., Shi, R., & Kelcey, B. (2016). Progress in the past decade: An examination of the precision of cluster randomized trials funded by the U.S. Institute of Education Sciences. International Journal of Research & Method in Education, 39(3), 255-267. https://doi.org/10.1080/1743727X.2016.1150454

  • Spybrook, J., Kelcey, B., & Dong, N. (2016). Power for detecting treatment by moderator effects in two and three-level cluster randomized trials. Journal of Educational and Behavioral Statistics, 41(6), 605-627. https://doi.org/10.3102/1076998616655442