Consider the goal of comparing independent groups. A broad issue is how many methods are needed to get a good understanding of how, and by how much, groups differ. Certainly, the best-known approach is to use some measure location such as the mean, median, 20% trimmed mean or even an M-estimator. Classic methods based on means provide some information regarding how groups differ, but it is fairly evident that this approach can miss details that are clinically important. For example, the difference between means might be relatively small when the difference between the medians is large. Of course, the reverse can happen where the difference between the medians is small but the difference between the means is large.
A limitation of methods that test hypotheses based on measures of location only is that they ignore the variation within the groups. The goal in this paper is to suggest methods that are based on a robust measure of effect size that is based in part on a robust measure of dispersion. As is illustrated later in this paper, taking into account the variation within groups can increase power substantially in some situations.
For independent groups, let ( , ) denote some measure of dispersion associated with the th group. The better-known measures of effect size assume homoscedasticity, meaning that is assumed. The measures of effect size used here avoid this assumption. That is, they allow heteroscedasticity. The idea is that comparing groups based on a heteroscedastic measure of effect size provides an alternative perspective that helps provide a deeper understanding of how groups differ.
Two situations are considered. The first is where the goal is to perform all pairwise comparisons among independent groups. More formally, let denote a measure of effect size when comparing groups and . The goal is to test
1
and to compute a confidence interval for for all . Typically, as is the case here, is based in part on the difference between two measures of location, say . Consequently, a decision can be made about whether is greater than or less than zero by simply testing the hypothesis . But here, is also based on a measure of dispersion. Consequently, power when testing can differ from the power of testing (1) simply because the method for testing (1) is sensitive to different features of the data. And even when these two approaches have similar power, the problem of computing a confidence interval for remains.
The second goal deals with comparing the levels of the first (or second) factor of a -by- ANOVA design. To elaborate a bit, momentarily focus on a 2-by- design where the goal is to compare the first and second levels of the first factor. To be a bit more precise, consider independent random variables. Let be some global measure of effect size associated with the first random variables and let denote the measure of effect size for the remaining variables. As will be made evident, the choice for used here is related to the choice for but it differs in a fundamental way. The goal is to test
2
And there is the related goal of computing a confidence interval for . The proposed method is readily extended to a -by- design where the goal is to make inferences based on all pairwise levels of the first or second factor. The results here extend the results in Wilcox (2022b), which was limited to two-by-two design.
When testing (1), a seemingly natural guess is that a percentile bootstrap method will perform well in terms of controlling the Type I error probability. Simulations related to this issue are reported here. As for testing (2), a percentile bootstrap method was found to perform poorly; the actual level was estimated to be well below the nominal level. An alternative approach is proposed here and studied via simulations.
The paper is organized as follows. The next section describes the robust heteroscedastic measures of effect size that will be used. Both are based on a simple extension of measures of effect size proposed by Kulinskaya et al. (2008) and Kulinskaya and Staudte (2006). This is followed by proposed methods for testing (1) and (2). Then some simulations results are reported followed by two illustrations.
Robust, Heteroscedastic Measures of Effect Size
Momentarily consider two independent groups and let denote the sample mean for the th group. Let denote the corresponding sample size. Let and denote the population means and let and denote the population standard deviations. Certainly, one of the better-known measures of effect size is
where it is assumed that . That is, homoscedasticity is assumed. Cohen’s d, (e.g., Cohen, 1988) provides an estimate of that is biased. Hedge’s g (Hedges & Olkin, 1985) provides an unbiased estimator for . Glass et al. (1981) avoid the homoscedasticity assumption by specifying one of the groups as the control group and then use only the estimate of the variance for the control group to measure effect size. Kulinskaya et al. (2008) avoid the homoscedasticity assumption in the following manner. Note that the standard error of can be written as where
and . As a result, they use
3
as a measure of effect size. Kulinskaya and Staudte (2006, p. 101) note that a natural generalization of to the heteroscedastic case does not appear to be possible without taking into account the relative sample sizes. Under normality and when the population variances are equal, . For independent groups, Kulinskaya and Staudte (2006) generalized this approach by estimating effect size with
4
where and is the sample variance associated with the th group. Their results include a method for computing an interval estimate of the population analog of (4) assuming normality.
A parameter is said to be non-robust if a small change in a distribution has a large impact on its value. Formal mathematical methods for characterizing robust parameters are summarized by Hampel et al. (1986), Staudte and Sheather (1990) as well as Huber and Ronchetti (2009). The point here is that the population mean and variance are not robust. In practical terms, when using the measures of effect size just reviewed, even a small departure from a normal distribution can mask a large effect size among the bulk of the participants. Algina et al. (2005) illustrate this point when using .
Here, robust versions of (3) and (4) are used, which are based on a trimmed mean and a Winsorized variance. This mimics the basic approach used by Algina et al. (2005) to derive a robust version of . For notational convenience, momentarily focus on a single random sample and let denote the values written in ascending order. Let denote some specified constant, and let , where is the value of rounded down to the nearest integer. The sample trimmed mean is computed by removing the g largest and g smallest observations and averaging the values that remain. More formally, the sample trimmed mean is
The choice has been studied extensively (e.g., Wilcox, 2022a) and is used here. It has good efficiency under normality and its standard error can be substantially smaller than the standard error of the mean when dealing with heavy-tailed distributions where outliers are likely to occur.
Winsorizing means that the g smallest values are set equal to and the g largest are set equal to . The γ Winsorized mean, , is the mean of the Winsorized values and the Winsorized sample variance, , is the usual sample variance based on the Winsorized data. When sampling from a normal distribution and when , estimates the population variance. Let denote the trimmed mean for the th group. Here, the measure of effect size that will be used is estimated with
5
where and is the Winsorized variance of the th group rescaled to estimate the variance when dealing with a normal distribution. This will be called the KMS measure of effect size henceforth. The robust version of (3) is estimated with
6
where
Consider the case where of the groups have a common population trimmed mean and the other group has a population trimmed mean that is larger than the other trimmed means by some specified amount. If the term is excluded from (5), the resulting measure of effect size decreases as increases. By including this term, remains similar to what would be obtained when .
Testing (1): Method M1
A basic percentile bootstrap method is used to test (1). Momentarily focus on two independent groups. First, generate a bootstrap sample from each group. That is, randomly sample with replacement values from the th group. Based on these bootstrap samples, compute the measure of effect size corresponding to (6) and for notational convenience label the result . Repeat this process times yielding . Let denote the values written is ascending order. Then a confidence interval for is
where , rounded to the nearest integer, and . Let denote the number of times and let . From Liu and Singh (1997), a (generalized) p-value is 2min . This is called method M1 henceforth.
When , the improvement on the Bonferroni method, derived by Hochberg (1988), is used to control the family wise error rate (FWE) meaning the probability of one or more Type I errors. Given p-values, say the computational details are as follows. Put the p-values in descending order yielding . Suppose the goal is to have FWE equal to . Set .
-
If , reject all hypotheses.
-
If , increment by one. If , reject all hypotheses where the p-value is less than or equal to .
-
If , repeat step 2.
Testing (2): Method M2
As is evident, the percentile bootstrap method is readily adapted to testing (2) where =2 and now ( ) is given by (5) based on the th level of the first factor and corresponding levels of the second factor. However, simulations made it clear that this approach is highly unsatisfactory when the sample sizes are relatively small. The actual level was estimated to be well below the nominal level. A much more satisfactory approach is to use a simulation to estimate the null distribution of . Now let denote the sample size corresponding to the th level of the first factor and the th level of the second factor. The initial strategy was, for the th level of the first factor and the th level of the second, generate observations from a standard normal distribution and then compute yielding say . This is repeated times yielding , which yields an estimate of the null distribution of . Here, is used. However, for small sample sizes, control over the Type I error probability was not quite satisfactory when dealing with a skewed, light-tailed distribution.
Let be a random variable having a standard normal distribution. Then
has a g-and-h distribution, where and are parameters that determine the first four moments (Hoaglin, 1985). When , the expression for is taken to be
Here, an estimate of the null distribution of is based on data randomly sampled from a g-and-h distribution with and
Let denote the estimate of . Let where the indicator function if , otherwise . A p-value is 2min . A confidence interval can be computed by assuming that the null distribution of is reasonably well approximated as described above. Let be an estimate of the th quantile of the null distribution of based on . Then a confidence interval for is readily shown to be
That is, an estimate of the upper quantile of the distribution of is used to compute the lower end of the confidence interval. (This result is similar to how confidence intervals are computed via a bootstrap-t method. See Wilcox, 2022a, for details.) When and the goal is to perform all pairwise comparisons among the levels of the first factor, FWE is controlled via Hochberg’s method.
Some comments about a 2-by-2 design are helpful. For this special case, let denote some measure of location associated with the th level of the first factor and the th level of the second factor. Let denote when comparing the groups associated with the th level of the first factor and the first and second levels of the second factor. In a similar manner, let denote when comparing the groups associated with the th level of the second factor and the first and second levels of the first factor. As is evident, when dealing with an interaction, testing
is the same as testing
That is, it is irrelevant whether differences are based within rows rather than within columns. Note that testing
7
is an analog of testing for an interaction. Rather than comparing measures of effect size based on the difference between measures of location only, a measure of effect size is used that is based in part on the Winsorized variance within each group. But testing (7) is not necessarily the same as testing
8
That is, comparing measures of effect size corresponding to the levels of the first factor differs from comparing measures of effect size corresponding to the levels of the second factor.
Simulation Results
Simulations were used to assess the small sample properties of methods M1 and M2. Data were generated from four types of distributions: normal, symmetric and heavy-tailed symmetric and relatively light-tailed, and asymmetric and relatively heavy-tailed, roughly meaning that outliers tend to be common. More specifically, data were generated from four g-and-h distributions. The four distributions used here are the standard normal ( ), a symmetric heavy-tailed distribution ( , ), a skewed distribution with relatively light tails ( , ), and a skewed distribution with heavy tails ( , h = 0.2). Table 1 summarizes the skewness ( ) and kurtosis ( ) of these distributions. All g-and-h distributions have a median equal to zero. For and the variance is 2.15. For g and h , the mean and variance are 0.648 and 4.67, respectively. For g and h , the mean and variance are 0.97 and 30.6. Figure 1 shows plots of these four distributions.
Table 1
g | h | ||
---|---|---|---|
0.0 | 0.0 | 0.00 | 3.00 |
0.0 | 0.2 | 0.00 | 21.46 |
1.0 | 0.0 | 0.61 | 3.68 |
1.0 | 0.2 | 32.81 | 2295.98 |
Figure 1
The range of distributions used here is motivated by a review of several studies aimed at charactering the extent distributions differ from normality (Wilcox, 2022a, section 4.2). It is noted that the reported skewness and kurtosis for are estimates of the actual skewness and kurtosis based on a sample size of one million. Repeating this process yielded even larger estimates. This particular distribution might appear to be extreme. The point here is that if a method performs reasonably well when dealing with this distribution, this provides some assurance that it will perform reasonably well for distributions that are likely to be encountered.
For method M1, simulations were run for and 4 groups. For three sample size configurations were used: (20, 20, 20, 20), (40, 40, 40, 40) and (20, 20, 40, 40). For the sample sizes were (20, 20), (40, 40) and (20, 40). For convenience these three sample size configurations are labeled N1, N2 and N3, respectively. The results for N1 and N2 are reported in Table 2 and are based on 1000 replications. The estimates for N3 were very similar and so for brevity are not reported. Note that the largest estimate occurs for N2, , = 0 and . Increasing the common sample size to 60, the estimate is 0.040. Some additional simulations were run where the first group has a standard deviation four times larger than the other groups. Again, control over the Type I probability was similar to the results in Table 2. Bradley (1978) has suggested that as a general guide, when testing at the 0.05 level, the actual level should be between 0.025 and 0.075. All indications are that M1 satisfies this criterion. Replacing the KMS method with the robust version of derived by Algina et al., 2005, gives very similar results.
Table 2
N1 | |||
---|---|---|---|
0.0 | 0.0 | 0.055 | 0.055 |
0.0 | 0.2 | 0.053 | 0.051 |
1.0 | 0.0 | 0.056 | 0.057 |
1.0 | 0.2 | 0.041 | 0.053 |
N2 | |||
0.0 | 0.0 | 0.048 | 0.053 |
0.0 | 0.2 | 0.047 | 0.068 |
1.0 | 0.0 | 0.040 | 0.051 |
1.0 | 0.2 | 0.047 | 0.048 |
Note. = 0.05.
There is a well-established heteroscedastic method for performing all pairwise comparisons based on trimmed means (e.g., Wilcox, 2022a, section 7.4.1). For convenience, this method is labeled T1. In some cases, T1 will have about the same amount of power as M1, but the expectation is that the choice of method might make a practical difference simply because they are sensitive to different features of the data. Consider, for example, normal distributions having variance one and suppose the mean of the first group is 0.5 while the remaining groups have a mean of zero. For N2, the probability of one or more significant results was estimated to be 0.553 for method M1 and 0.543 for method T1: Hochberg’s method was also used to control FWE when using T1. Increasing the common sample size to 60, the estimates were 0.734 and 0.693, respectively. If instead the first random variable is divided by 4 and 0.8 is added to the first and second random variables, the estimates are 0.945 and 0.865.
As for method M2, simulations were run for and . Four sample size configurations were used. The first three had a common sample size of 20, 50 and 100. Now N1, N2 and N3 are used to denote these three sample configurations, respectively. The fourth, N4, consisted of a common sample size of 20 for the first four groups (corresponding to the first level of the first factor) and a common sample size of 50 for the remaining four groups. The Type I error probability was estimated with 5000 replications. Table 3 shows the estimated probability of a Type I error, , when testing at the = 0.05 level.
Table 3
N1 | ||
---|---|---|
0.0 | 0.0 | 0.021 |
0.0 | 0.2 | 0.016 |
1.0 | 0.0 | 0.071 |
1.0 | 0.2 | 0.070 |
N2 | ||
0.0 | 0.0 | 0.032 |
0.0 | 0.2 | 0.029 |
0.2 | 0.0 | 0.063 |
1.0 | 0.2 | 0.063 |
N3 | ||
0.0 | 0.0 | 0.043 |
0.0 | 0.2 | 0.042 |
1.0 | 0.0 | 0.055 |
0.2 | 0.2 | 0.057 |
N4 | ||
0.0 | 0.0 | 0.029 |
0.0 | 0.2 | 0.028 |
1.0 | 0.0 | 0.065 |
1.0 | 0.2 | 0.065 |
Note. = 0.05.
As can be seen, the estimates satisfy Bradley’s criterion in all situations except N1 and when dealing with a symmetric distribution; the estimates are less than 0.025, the lowest estimate being 0.016.
Method T1 is readily extended to a two-way ANOVA design where the goal is to perform all pairwise comparisons of the levels of the first or second factor, which is called method T2 henceforth. The details are in Wilcox (2022a, section 7.2.1; here, the R function t2way in the R package WRS was used to apply method T2). As was the case for a one-way design, M2 and T2 are sensitive to different features of the data, so which method has more power depends on the situation. A few simulations are reported here to provide some indication of the extent the choice of method matters, where again and and the goal is to compare the two levels of the first factor. Power was estimated for situations where, excluding the first group (the group corresponding to the level 1 of both factors), the remaining groups have a standard normal distribution. The first group also had a normal distribution but a mean . Values for the standard deviation for the first group were taken to be , 1 and 2. Table 4 reports the results. As can be seen, M2 generally has the highest power. But as indicated, this is not always the case. The only point is that the power of the two methods can differ substantially. Moreover, the method that has the most power depends on how the groups differ which is not known. But perhaps the more important point is that the two methods provide different perspectives on how the groups differ.
Table 4
M2 | T2 | ||
---|---|---|---|
N1 | |||
1.0 | 1.0 | 0.447 | 0.301 |
1.0 | 0.5 | 0.798 | 0.229 |
1.0 | 2.0 | 0.284 | 0.206 |
N3 | |||
0.5 | 1.0 | 0.712 | 0.361 |
0.5 | 0.5 | 0.993 | 0.407 |
0.5 | 3.0 | 0.290 | 0.213 |
Two Illustrations
Methods M1 and M2 are illustrated based on data stemming from a study of an intervention program aimed at improving the physical and mental health of older adults (Clark et al., 2012). The first illustration is based on a measure of depressive symptoms (CESD) taken after intervention. The goal is to compare five groups corresponding to a participant’s educational level: less than high school, high school graduate, some college or technical school, four years of college completed, post-graduate study. The total sample size is 328. Table 5 reports the p-values for methods M1 and T1. The p-values were adjusted via Hochberg’s method. As indicated, these two methods are consistent in terms of whether a significant result is obtained at the 0.05 level. Note, however, that the p-values differ substantially in some cases indicating that the two methods can paint a decidedly different picture about which groups differ significantly.
Table 5
Group | M1 | T1 | |
---|---|---|---|
1 | 2 | 0.084 | 0.060 |
1 | 3 | 0.000 | 0.002 |
1 | 5 | 0.000 | 0.001 |
2 | 3 | 0.352 | 0.534 |
2 | 4 | 0.260 | 0.325 |
2 | 5 | 0.564 | 0.103 |
3 | 4 | 0.564 | 0.999 |
3 | 5 | 0.564 | 0.825 |
4 | 5 | 0.564 | 0.997 |
Note. When comparing education groups based on a measure of depressive symptoms.
Presumably, what constitutes a large effect size can depend on the situation. For illustrative purposes, suppose = 0.2, 0.5 and 0.8 are viewed as small, medium and large effect size, respectively, as is sometimes suggested (e.g., Cohen, 1988). Then under normality and homoscedasticity, = 0.2, 0.5 and 0.8 correspond to small, medium and large effect sizes as well. For the three significant results reported here, the estimates of range between 0.66 and 0.77. That is, the results indicate a rather substantial difference between participants who did not complete high school versus those who have some educational training beyond high school.
The second illustration deals with the goal of understanding the association between a measure of meaningful activities (MAPA) and two independent variables: a measure of life satisfaction (LSIZ) and a participant’s cortisol awakening response (CAR), which is the difference between cortisol measured upon awakening and measured again about 30-45 minutes later. The focus here is on measures taken after intervention.
For illustrative purposes, the data are split into four groups based on the medians of the two independent variables, resulting in a two-by-two ANOVA design. Wilcox (2019) demonstrates that this approach can reveal details that are missed by nonparametric regression methods. The median of the CAR values is , so the two CAR groups reflect approximately groups where cortisol increases versus decreases soon after awakening. Analyzing the data with method T2, for the low versus high LSIZ groups the p-value is 0.001. That is, there is reasonably strong evidence that the low LSIZ has a lower trimmed mean when CAR is ignored. For the low versus high CAR groups, ignoring LSIZ, the p-value is 0.509. Testing the hypothesis of no interaction, the p-value is 0.004. For low LSIZ, the 20% trimmed means for low versus high CAR groups are 33.03 and 30.64, respectively. The difference between these estimates differs at the 0.05 level; the p-value is 0.014. For the high LSIZ group the estimates are 34.31 and 35.74, which do not differ at the 0.05 level; the p-value is 0.186. That is, the evidence suggests that there is a disordinal interaction, but the strength of the evidence is not very strong.
As for method M2, for low LSIZ, the estimate of , , when comparing the groups corresponding to low versus high CAR values, is 0.248. For high LSIZ scores, the estimate of , is 0.129. Testing (6), the p-value is 0.160. In contrast, for low CAR values, comparing the low and high LSIZ groups, the estimate of is 0.131 and for high CAR values the estimate of is 0.468. The p-value when testing (8) is 0.002 and the 0.95 confidence interval for is (-0.510, -0.146). That is, testing (8) yields a significant result at the 0.05 level in contrast to testing (7).
As previously noted, based on trimmed means only, a significant interaction was obtained. The results based on also indicate an interaction but with the added benefit of an alternative perspective regarding the nature and relative importance of the interaction. In particular, the results demonstrate that differences within rows can yield a different perspective compared to differences within columns when using the KMS measure of effect size. Roughly, knowing whether cortisol increases or decreases appears to provide information about how low and high life satisfaction groups differ in terms of meaningful activities. When cortisol decreases, the results indicate that there is a much more pronounced difference between the high and low LSIZ groups compared to when cortical increases.
Concluding Remarks
Steegen et al. (2016) discuss and illustrate an extremely important issue: no single method reveals everything of interest when comparing groups. Multiple methods can be required to get a deep and nuanced understanding of data. The goal here is to suggest a method for comparing groups, via a robust, heteroscedastic measure of effect size, that helps achieve this goal. Here, the illustration involving a 2-by-2 ANOVA design demonstrates this point.
Perhaps it should be stressed that it is not being suggested that methods based on measures of location only should be abandoned. Surely, they provide useful information about how groups compare. Again, the only suggestion is that comparing groups based on a robust measure of effect size that includes some measure of variation can provide insights about how groups compare.
There are several other ways for dealing with an interaction, based on a heteroscedastic measure of effect size, beyond the approach used here (Wilcox, 2022a, section 7.4.17). One approach is to use a quantile shift measure of effect size and another is to use the notion of explanatory power. Inferences based on these methods are under investigation.