Empirical research across a broad range of disciplines aims at assessing the effectiveness of a treatment or an intervention. The interest might lie in average effects on a certain outcome or in conditional effects, given values of categorical or continuous covariates. For example, a researcher might be interested in the average effect of an extended day program on student achievement (Meyer & Van Klaveren, 2013) or the conditional effects of an anxiety intervention on symptoms and functioning measures of females versus males (Grubbs et al., 2015). Currently, most researchers use linear regression or ANOVA (analysis of variance; Rutherford, 2001, Weiss, 2006) to analyze their data. However, this only yields distinct regression coefficients, whereas the effects of interest are often a more complicated function of these parameters. This makes interpretation of results cumbersome.
The EffectLiteR approach (Mayer & Dietzfelbinger, 2019; Mayer et al., 2016) provides a framework and an R package (R Core Team, 2020) for the estimation of average and conditional effects of a discrete treatment variable on a continuous outcome variable, conditioning on categorical and continuous covariates. This approach uses definitions of average and conditional effects from the causal inference literature (Imbens & Rubin, 2015; Steyer et al., 2014). In addition to the computation of these effects of interest and their standard errors, Wald or -tests are used to test several hypotheses of interest, for example that (all) average effects are equal to zero, or (all) conditional effects are equal to zero in the population. A feature of the EffectLiteR approach are stochastic group weights, that have been embedded into the calculation of these effects (see, e.g., Mayer & Thoemmes, 2019). In the behavioral and social sciences, treatment group sizes often vary across samples. Stochastic group sizes should be considered whenever the exact group sizes are not determined before conducting the experiment. For an in-depth discussion on why it is important to include stochastic group sizes in general linear models and structural equation models, see Mayer and Thoemmes (2019).
A limitation of the current EffectLiteR approach is that it does not allow for testing informative hypotheses (Hoijtink, 2012; Silvapulle & Sen, 2005) yet. For example, assume there are three groups in a randomized experiment (without further covariates), where one is a control group ( ), one receives a standard treatment ( ) and one receives a novel treatment ( ) and we are interested in the group means of the dependent variable . The researcher expects that the group mean of the novel treatment, , is greater than the group mean of the standard treatment, , and that the group mean of the standard treatment is greater than the group mean of the control group, . Within the framework of classical null hypothesis testing, a two-step procedure has to be followed. First, we have to test against not , that is, not all three means are equal. If we can reject in favor of , a second step follows, in which we execute pairwise comparisons to determine which means are equal and which means are not equal. Thus, step one alone does not give us the answer we are looking for, and step two entails multiple testing, which brings along the risk of an inflated type I error rate. In an informative test, the a priori expectation of the order of the means is explicitly taken into account. For example, we can test the null hypothesis against the ordered hypothesis in a single step. The statistical approach to allow for these informative hypothesis tests is called constrained statistical inference (Hoijtink, 2012; Silvapulle & Sen, 2005). It is available in several R packages, including restriktor (Vanbrabant, 2020), ic.infer (Grömping, 2010) and bain (Gu et al., 2020).
The goal of this paper is to integrate informative hypothesis testing into the EffectLiteR approach. We will show how - and Wald tests can be used in the regular as well as the informative case, where the quantities of interest can be expressed as a linear or non-linear function of the parameters. The paper is organized as follows. First, an introduction to the EffectLiteR framework will be given by means of a simulated data example along the lines of the original EffectLiteR paper (Mayer et al., 2016). It will be shown how (non-informative) hypotheses about effects of interest can currently be tested via standard - or linear Wald tests. Second, it will be explained how to incorporate informative hypotheses into the EffectLiteR approach. For this purpose, the - and generalized linear Wald test as well as two procedures to calculate their -values will be presented. Finally, stochastic group sizes will be included, which implies two further challenges. First, parameter estimation will be carried out by means of structural equation modeling (SEM) via the R package lavaan (Rosseel, 2012). And second, the generalized non-linear Wald test will be introduced. The paper concludes with a short artificial data example from the constrained statistical inference literature. Accompanying Supplementary Materials are provided and will be referenced throughout the paper.
Introduction Into the EffectLiteR Framework
In the following, the background of the EffectLiteR framework is explained. This will be illustrated by means of a non-randomized experiment, whose data of size are generated along the lines of Mayer et al. (2016). In the Supplementary Materials, the associated R script (“A-exampleI.R”) as well as the data set itself (“B-exampleIData.rds”) and detailed explanations regarding the R script (“C-explanation.pdf”) can be found. Three treatment groups are considered, namely a group receiving innovative therapy ( ), a group receiving conventional therapy ( ) and a wait-list control group ( ). As covariates, the dichotomous variable gender with values male ( ) and female ( ) and the continuous covariate mental health pre-test ( ), are considered. The outcome variable is the post-test of mental health ( ). Mental health will be treated as a manifest variable. Table 1 shows the relative treatment group frequencies, the means of and the raw as well as the adjusted means of mental health .
Table 1
Raw Means of Mental Health Post-Test , Adjusted Means of Mental Health Post-Test , Raw Means of Mental Health Pre-Test and Relative Treatment Group Frequencies
Group | ||||
---|---|---|---|---|
Note. The last column of the table shows the expectations and probabilities where we ignore . The last row shows the expectations and probabilities where we ignore .
When all variables are observed, the EffectLiteR approach first estimates a linear regression model. Typically, this regression model contains not only the main effects of and , but also the interaction effects of , and as well as the three-way interaction effect of . However, when defining the model via the effectLite() command, the researcher can specify any preferences regarding the interactions by means of the interactions= argument. In our case, we fit the regression model including all possible interactions:
1
Table 2
Estimated Regression Parameters (Simulated Data)
1.667 | -0.285 | 0.029 | 0.300 | -0.252 | 0.591 | 0.261 | -0.546 | 0.128 | 0.322 | 0.032 | -0.234 |
All the effects of interest in the EffectLiteR approach will be written as a function of the estimated regression parameters and the (conditional) expectations of the covariates. In the next section, possible hypotheses of interest are discussed.
Formulation of Hypotheses
With a general linear hypothesis test, hypotheses concerning regression parameters like versus can be tested. In contrast, EffectLiteR allows for testing hypotheses concerning average and conditional effects like versus . Here, is the average effect of treatment compared to the control group . In this paper, we aim at combining EffectLiteR and informative hypotheses, which will allow for testing hypotheses like versus . Regarding our illustrative example, one may be interested in the following hypotheses:
2
EffectLiteR Model
The EffectLiteR model is based on intercept and effect functions (see also Steyer, 2021). Considering three treatment groups, a binary variable ( ) with values and as well as a continuous covariate ( ), these can be formulated as follows:
3
To be able to calculate the average effect, we need the unconditional expectation of , the unconditional expectation of and the unconditional expectation of . Note that for binary , the unconditional expectation equals the probability of . The following calculations can also be found in the document “D-manualCalculations.pdf” in the Supplementary Materials, which contains a more exhaustive overview of the calculations done by EffectLiteR including all intermediate steps. The unconditional expectation of can be calculated as follows:
4
5
6
7
8
9
Table 3
Adjusted Means and Average Effects Estimates With Standard Errors in Parentheses (Simulated Data)
Group | Fixed group sizes
|
Stochastic group sizes
|
||
---|---|---|---|---|
Adjusted mean | Average effect | Adjusted mean | Average effect | |
Control ( 0) | 1.601 (0.099) | 1.601 (0.099) | ||
Conventional therapy ( 1) | 1.683 (0.127) | 0.082 (0.161) | 1.683 (0.128) | 0.082 (0.162) |
Innovative therapy ( 2) | 1.875 (0.074) | 0.274 (0.124) | 1.875 (0.075) | 0.274 (0.124) |
Note. The average effect of conventional therapy ( versus ) as well as the average effect of innovative therapy ( versus ) equal the differences of the respective adjusted means: and .
To test the hypotheses of interest, EffectLiteR makes use of the -statistic or the linear Wald test. Both will be explained below.
Hypothesis Testing
The -test can be calculated as follows (Seber & Lee, 2012, p. 100):
10
Using the same notation, the linear Wald test can be defined as (Fahrmeir et al., 2013, p. 663):
11
12
13
14
15
To test against in the EffectLiteR framework, has to be defined as follows. can be written as , where equals:
16
17
18
Integrating Informative Hypotheses into the EffectLiteR Framework
Informative hypotheses reflect prior order expectations regarding means, regression coefficients or combinations thereof. These hypotheses can be constructed by means of different constraints (see, e.g., Hoijtink, 2012). The most important constraints are equality constraints like and inequality constraints like or . Both large-sample test statistics (generalized Wald, Score or likelihood ratio) and small-sample test statistics ( , ) have been developed (Barlow et al., 1972; Robertson et al., 1988; Silvapulle & Sen, 2005). The -test was first introduced by Kudô (1963) and generalized by Wolak (1987). The -test and the generalized linear Wald test will be explained in more detail below.
Fbar-Statistic
The -statistic can be defined as follows (Silvapulle & Sen, 2005, p. 29):
19
If the constraints are linear, can be found using quadratic programming, for example via the subroutine solve.QP() in the R package quadprog (Turlach & Weingessel, 2019). In a simple regression example with three coefficients, the unrestricted estimates may be . However, if , will be constrained to be greater than zero in our estimation procedure, leading to the restricted estimates that may look like . Note that the estimates of and also change slightly, even though they satisfied the constraints right from the beginning. Furthermore, the restricted estimation will lead to different residuals that are used in the informative test statistic.
Generalized Linear Wald Test
The generalized linear Wald statistic is a generalization of the regular Wald test and can be found in Silvapulle and Sen (2005, p. 154):
20
In our example, the corrected Wald test as well as the -test yield the same results, as expected: and . In contrast, the uncorrected Wald test produces the slightly different result . In the following, the calculation of -values for the generalized linear Wald and -statistic is explained.
-Values
To calculate the -values of the - and generalized linear Wald test, two approaches can be followed, which should yield similar results. These procedures are described in detail in the document “E-pValues.pdf” that can be found in the Supplementary Materials. Essentially, the first approach calculates the -value by means of simulation (Silvapulle & Sen, 2005, p. 98). By generating a large amount of normal (or non-normal) random data under the null hypothesis and calculating their test statistics, it is possible to determine the proportion of times, the newly obtained test statistic exceeds the originally observed test statistic as an estimate of the -value.
The second approach is more economical, as the mixing weights are estimated first, also by means of simulation, and directly used to calculate the -value afterwards (Silvapulle & Sen, 2005, p. 79). Here, generating a large amount of (normal) random data and calculating the parameters of interest allows to determine the proportions of times that a different number of constraints are satisfied as an estimate of the weights.
Estimating the -value for the obtained -value yields (simulation approach) and (weights approach). This allows us to reject hypothesis in favor of hypothesis . A complete ordering of average effects of the treatment groups therefore seems to be in accordance with our data. This is an illustration of how informative hypothesis testing has greater power compared to classical null hypothesis testing, since the former could not reject in favor of .
Intermediate Summary
It can be concluded that integrating informative hypotheses into the EffectLiteR framework enriches testing hypotheses regarding effects of interest. This is because it allows to directly test hypotheses that correspond to the researcher’s prior order expectations of the treatment groups. Using this approach, it is not necessary anymore to follow a cumbersome two-step procedure with potentially increased type I error rates, as is needed in classical null hypothesis testing. In fact, it is even possible to detect significant results that would not have been detected via classical null hypothesis testing, as was the case in our motivating example. Going one step further, the next section illustrates how to account for stochastic group sizes while testing informative hypotheses in the EffectLiteR framework.
Stochastic Group Sizes
The treatment groups ( ) and the binary covariate ( ) divide the sample in several groups by means of combinations of their different levels. The group size is the number of observations in a particular group. The group proportions are the ratios of group sizes divided by the total sample size. The corresponding model parameters are called group weights.
So far, we have treated the group sizes as fixed. Theoretically, this implies that each time we would replicate the study, the group sizes would remain the same. However, in practice, this is not always the case, and the group sizes may vary from sample to sample. In this case, it is more correct to treat the group sizes as stochastic, and this implies that the group weights become free parameters in our model. Mayer and Thoemmes (2019) showed that failing to take the stochastic nature of group sizes into account may lead to increased type I error rates even in randomized experiments. If we want to account for stochastic group weights, our vector of parameters will not only include the regression coefficients but also the sample proportions:
21
22
23
To apply informative hypothesis testing in the EffectLiteR framework while taking into account stochastic group sizes, two changes are needed. First, a new method is needed to obtain , the vector of restricted regression coefficients and sample proportions, and second, the generalized non-linear Wald statistic has to be used. This is because the effects of interest are no longer a linear combination of the model parameters and quadratic programming cannot be used to find .
Estimation of by Means of SEM
To obtain , several steps should be followed. The model is specified as a SEM, which is estimated via the R package elavaan (Rosseel, 2012). SEM software is used because it can handle constraints on functions of parameters. The hypotheses are included in the model definition as quantities of interest, but the constraints imposed on them are defined separately. This allows us to first fit an unconstrained SEM. If all quantities of interest already satisfy the constraints, the generalized non-linear Wald test is calculated based on the results of the unconstrained fit. If at least one quantity of interest does not satisfy the constraints, that is, at least one constraint is active, a constrained SEM is fitted and the generalized non-linear Wald test is calculated based on the results of the constrained fit. This procedure is implemented in the R script “A-exampleI.R” and is explained in the document “C-explanation.pdf”, which are both part of the Supplementary Materials. The resulting parameter vector is used in the generalized non-linear Wald test, which is explained in the following section.
Generalized Non-linear Wald Test
The generalized non-linear Wald statistic can be found in Silvapulle and Sen (2005, p. 166) and is defined as follows:
24
In our simulated data example, we obtain the following results. The columns of Table 3 titled “Stochastic group sizes” show the estimates of all adjusted means and all average effects, including their standard errors. It can be observed that the standard errors get slightly larger when stochastic group sizes are considered, compared to when group sizes are assumed to be fixed, which reflects the increased uncertainty. However, when the sample size is smaller, the differences between the two sets of standard errors become larger. This is illustrated in the R script “F-exampleISmallN.R” in the Supplementary Materials. The data set with is available under the name “G-exampleISmallNData.rds” and the results are presented in the document “H-exampleISmallNResults.pdf”. As for the Wald statistic in our running example, we obtain , depending on whether the simulation or weights approach is used to calculate the -value. The test statistic is slightly different than the one we obtained without considering stochastic group sizes ( ), while the -values are nearly equal.
Intermediate Summary
We showed how to account for stochastic group sizes when testing informative hypotheses within the EffectLiteR framework. If group weights are considered as free parameters in the model, uncertainty increases, which manifests itself in increased standard errors of parameters estimates. This seems to be more evident in small samples than it is in large samples. This observation is important especially in the social and behavioral sciences, where treatment group sizes often vary across samples and the exact group sizes are not determined before conducting the experiment. However, at this point, considering stochastic group sizes comes with an increased computational cost, as parameter estimation has to be carried out in a two-step procedure by means of lavaan.
Data Example
In this section, the illustrated methods will be applied to an artificial data example from the constrained statistical inference literature (Hoijtink, 2012; Vanbrabant, 2018). The accompanying R script can be found in the Supplementary Materials and is named “I-exampleII.R”. We recommend this R script as a starting point for all researchers who would like to apply our method. The Anger Management data set comes with the restriktor package (Vanbrabant, 2018, 2020). It contains 40 observations on decrease in aggression level over the course of eight weeks and considers the continuous covariate age. Subjects are assigned to either one of the following groups: No training ( ), physical training ( ), behavioral therapy ( ) and a combination of physical exercise and behavioral therapy ( ). All groups have the same size of . Since this data set is artificial and group sizes are fixed to 10, we do not consider stochastic group sizes here. Table 4 shows the raw and adjusted means of decrease in aggression level as well as the relative treatment group frequencies.
Table 4
Raw as Well as Adjusted Means of Decrease in Aggression Level and Relative Treatment Group Frequencies
Group | Raw/adjusted Means and relative treatment group frequency |
---|---|
-0.200 | |
-0.920 | |
0.250 | |
0.800 | |
0.922 | |
0.250 | |
3.100 | |
3.341 | |
0.250 | |
4.100 | |
4.237 | |
0.250 |
Possible hypotheses regarding the effects of interest are:
25
The adjusted means and average effects estimates are presented in Table 5.
Table 5
Adjusted Means and Average Effects Estimates With Standard Errors in Parentheses (Anger Management Data)
Group | Adjusted mean | Average effect |
---|---|---|
No training ( 0) | -0.920 (0.777) | |
Physical training ( 1) | 0.922 (0.744) | 1.842 (1.076) |
Behavioral therapy ( 2) | 3.341 (0.706) | 4.261 (1.049) |
Physical training & behavioral therapy ( 3) | 4.237 (0.801) | 5.157 (1.116) |
Testing Hypothesis against , (using either one of the two approaches to calculate the -value) is obtained. Furthermore, the regular -test of against yields . Again, we observe that the test statistic of the informative hypothesis test is larger than the test statistic of the classical null hypothesis test, even though both -values are highly significant.
Discussion
This paper demonstrated how to integrate informative hypotheses into the EffectLiteR framework, while taking into account stochastic group sizes. We provided R scripts and explanatory documents in the Supplementary Materials. The method described allows to directly test hypotheses that represent the researcher’s prior order expectations about the effects of interest in the treatment groups. In contrast to classical null hypothesis testing, it is not necessary anymore to follow a cumbersome two-step procedure with potentially increased type I error rates due to multiple testing. Furthermore, we were able to show that testing informative hypotheses about effects of interest has greater power compared to classical null hypothesis testing. Vanbrabant (2018) already showed this in the simpler context of comparing group means instead of effects of interest.
Considering stochastic group sizes in the approach described in this paper does not seem to impact the results substantially if the sample size is large. However, when dealing with small sample sizes, standard errors might be underestimated when erroneously treating stochastic group sizes as fixed. This is because of adequately accounting for the increased uncertainty that stems from considering group weights as free instead of as fixed parameters in the model.
Two approaches to estimate the -values when testing informative hypotheses about effects of interest were introduced. When the residuals are normally distributed, the weights approach is preferred due to its efficiency. Furthermore, when stochastic group sizes are involved, we also prefer the weights approach, albeit for practical reasons: In particular when some of the group sizes are small, many simulation iterations may fail and must be replaced, making the simulation approach very time-consuming.
The limitations of this paper and the outlook on future research are the following. First, we only considered manifest variables. In the future, the presented methods should be extended to be able to deal with latent variables. Furthermore, the small-sample properties of the generalized non-linear Wald test are unknown. Thus, testing informative hypotheses in the EffectLiteR framework while accounting for stochastic group sizes should be examined further in future research by means of simulation studies. Moreover, we did not examine the consequences when assumptions of the general linear model like homoscedasticity are not fulfilled. When considering stochastic group sizes, it has already been shown that in the two-group context, erroneously assuming equal variances is only critical regarding standard error estimation of effects when group sizes are unequal (see, e.g., Berry, 1993; Mayer & Thoemmes, 2019). This can lead to increased type I error rates. Future research should thus examine the effect of violations of model assumptions when testing informative hypotheses in the EffectLiteR framework taking into account stochastic group sizes.
Finally, the two-step procedure presented to test informative hypotheses about effects of interest while considering stochastic group sizes needs a lot of manual tuning at this point. To make this approach more efficient and accessible for applied researchers, it might be possible to adapt it using the distance statistic (Silvapulle & Sen, 2005, p. 154). This would avoid the need for estimating the (order-)constrained model using SEM, and may result in a more efficient testing procedure.