Eta squared ( ) is one of the most common effect sizes in the social sciences. It has a long history dating back to Pearson (1923) and Fisher (1925), and it occupies a central place in relation to other effect sizes. For instance, it is equivalent to the coefficient of determination in regression analysis, and the square root of eta squared is the correlation between the predicted and actual outcomes. Also, it is functionally related to Cohen’s effect size ( ), which plays an important role in statistical power analysis. Cohen’s effect size which represents the standard deviation of standardized group means can be easily computed from eta squared.
1
Eta squared will remain a staple effect size measure in publications because journals have been increasingly encouraging researchers to include effect size measures. American Psychological Association recommended effect size measures in addition to statistical tests (Wilkinson & Task Force on Statistical Inference, 1999). American Educational Research Association (2006) also adopted a similar policy on effect size reporting.
Despite its popularity, eta squared is not an unbiased estimate and is known to have positive bias. The existence of the bias can be deduced from the computation of eta squared.
2
where is the between-group sum of squares and is the total sum of squares. The two sums of squares are unbiased estimates on their own, but their ratio is not necessarily an unbiased estimate. The bias of eta squared is well documented in the literature (Keselman, 1975; Okada, 2013). The bias depends on the sample size and the population effect size, and it can be substantial in some cases.
The bias of eta squared has not really diminished its popularity because of its straightforward interpretation (Mordkoff, 2019). In the context of one-way ANOVA, eta squared can be interpreted as the proportion of the variation in the outcome associated with the group membership (i.e., treatment factor). Its alternatives (i.e., epsilon squared and omega squared) have not completely unseated the dominant role of eta squared in social science research. Matter of fact, eta squared is routinely reported in journal publications and is readily accessible in popular statistical software (e.g., SPSS).
The two alternative effect size measures ( and ) are created to lessen the bias of eta squared. The epsilon squared includes degrees of freedom to correct potential bias in estimating the population effect size (Kelley, 1935).
3
where N is the total sample size, g is the number of treatment groups, and SSE is the within-group sum of squares in ANOVA. It is easy to see that is an unbiased estimate of the within group variance, and that is an unbiased estimate of the total variance. Epsilon squared reduces the positive bias in eta squared, but can result in underestimation. In a similar vein, Hays (1963) derived omega squared to decrease the positive bias in eta squared.
4
where MSE is the within-group means squares. Although and have less positive bias than , they are not unbiased and can underestimate the population effect size. The three effect size estimates generally follow a descending order, . It is recommended that be used in lieu of (Okada, 2013). It appears to be a compromise choice among the three competing effect size measures because the value of sits between and . However, neither nor has gained the same popularity as in practice.
So far, there is no unbiased estimate of the population effect size, as the less biased alternatives ( and ) are not unbiased and can underestimate the population effect size. Both epsilon squared and omega squared implicitly use normality assumption in correcting the bias in eta squared (see Kelley, 1935; Hays, 1963). However, the normality assumption may not always be tenable in many practical situations, where limited data in a study do no lend support to the normality assumption. The actual data distribution can be difficult to ascertain, due to scarcity of the data in a small study. It will be advisable to remove the bias in eta squared without making distributional assumption. Non-parametric bootstrapping becomes a very viable way to correct the bias in eta squared in view of the increasing computing capability of modern computers.
Bootstrap Bias Correction
Bootstrapping can be used to estimate bias, although it is often deployed to estimate a complex statistic, which is analytically intractable. Bootstrapping draws repeated samples from the original sample data, which are conceived of as a proxy population. The repeated samples from this proxy population are used to imitate the sampling distribution of the statistic in question. The statistic can be estimated from the sampling distribution of the bootstrap samples. The inference from the sample to the population is then made analogous to the inference from the bootstrap sample to the original sample. For instance, we can use bootstrapping to estimate the standard error of a ratio. As the standard error of the ratio is analytically difficult to derive, bootstrapping can be utilized to compute the standard error of the ratio. The original sample data serves as a proxy population. We repeatedly draw simple random samples with replacement from the original data or the proxy population. Those repeated samples are bootstrap samples, for which the ratios are computed. The ratios from the bootstrap samples form a sampling distribution, from which the standard error can be obtained. In theory, bootstrapping can be used to calculate bias. As the bias estimate is just a statistic like the standard error, bootstrapping applies equally well in this situation (Chapter 10, Efron & Tibshirani, 1998).
Efron and Tibshirani (1998) provided the statistical theory behind bootstrap bias correction. It is relatively easy to empirically test whether bootstrap bias estimation works by way of Bessel’s correction, which uses the adjusted degrees of freedom in calculating the sample variance. If the sum of squared deviations, SS, were divided by the sample size, n, the sample variance thus computed would have a negative bias of . Bootstrapping can be used to correctly estimate the negative bias in . We leave out the code for brevity of the presentation, but it just shows the veracity of bootstrap bias estimation in the case of a well-known bias. The flexibility of bootstrap bias estimation can be further attested by the positive bias in eta squared.
The bias of an estimator is generally defined as the difference between the expectation of the parameter estimator and the parameter itself .
5
If the bias is positive, it means that the estimator on average is larger than the population parameter. If the bias is negative, it means that the estimator on average tends to underestimate the population parameter. When the bias is zero, the estimator is said to be an unbiased estimator. Knowing the bias, we can correct an estimator and make it unbiased. The bias corrected estimator is, therefore, .
The bootstrap estimate of bias ( ) is the expectation of the estimators of the bootstrap samples ( ) minus the original estimator ( ):
6
To understand the bootstrap estimate of bias, , we can analogize to and to in the definition of bias. First, is called the bootstrap replicate. It is the relevant statistic based on a bootstrap sample, but it is computed in the same way as the estimator in the original sample. We repeatedly generate bootstrap samples of the same size as the original sample. For each bootstrap sample, we calculate the bootstrap replicate, . Together, the bootstrap replicates or s form a sampling distribution, which is analogous to the sampling distribution of . Thus, is made comparable to . The expectation of the bootstrap replicate can be readily computed as the mean of all the bootstrap replicates, based on B number of bootstrap samples.
7
where is a bootstrap replicate of the original estimator, based on the bth bootstrap sample. Second, the bootstrap theory suggests that the original sample data is the proxy population, from which all bootstrap samples are drawn. In other words, is the proxy population parameter, relative to all the bootstrap replicates, s. Thus, is made analogous to . We can then compute the bootstrap estimate of bias as
8
The last expression in the equation is the bootstrap estimate of bias, .
In our case, the statistic is the eta squared, , and the parameter is the population eta squared. Therefore, the bootstrap bias estimate for is
9
where is the replicate statistic from the bootstrap sample. The bias corrected eta squared, , then becomes twice the original estimator minus the mean of bootstrap replicates:
10
The bias estimator converges to a certain limit as the number of bootstrap samples, B, increases to infinity. In practice, we do not need to obtain an infinite number of bootstrap samples and replicates. The number of bootstrap samples, , can be increased gradually until the mean of bootstrap replicates shows convergence to a certain value. The means of bootstrap replicates can be lined up against the increasing number of bootstrap replicates, . The limit of convergence can be easily identified on a graph (Efron, 1982; Efron, 1990; Efron & Tibshirani, 1998).
Example
The example data come from a randomized comparative experiment on the effects of money on human behavior (Vohs, Mead, & Goode, 2006). Money was believed to change human motivation and behavior. When people were mentally reminded of the idea of money (a priming technique), they would act more self-sufficiently and would be less likely to request for help. In the experiment, subjects were randomly assigned to three conditions. They were reminded of money (prime condition) or play money (play condition) or not reminded of money (control condition) while they were asked to descramble several words and create sensible phrases using four out of five jumbled words. It was hypothesized that participants who were reminded of money and play money would work longer than participants in the control before requesting for help. The outcome was time in seconds before asking for help. Table 1 lists the data (see Moore, Notz, & Fligner, 2015, p. 658). The ANOVA analysis confirmed the research hypothesis (F = 3.73, p = .031).
Table 1
Condition | Time in seconds | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
prime | 609 | 444 | 242 | 199 | 174 | 55 | 251 | 466 | 443 | 531 | 135 |
241 | 476 | 482 | 362 | 69 | 160 | ||||||
play | 455 | 100 | 238 | 243 | 500 | 570 | 231 | 380 | 222 | 71 | 232 |
219 | 320 | 261 | 290 | 495 | 600 | 67 | |||||
control | 118 | 272 | 413 | 291 | 140 | 104 | 55 | 189 | 126 | 400 | 92 |
64 | 88 | 142 | 141 | 373 | 156 |
The eta squared is . Despite of being a small fractional number, it is very close to Cohen’s rule of thumb number for large eta squared .1379 (Cohen, 1988, p. 287). It should be noted that Cohen’s rule of thumb numbers are used here for reference, and that the rule of thumb numbers do not dictate the magnitude of effect size in a specific field. The eta squared is known to have positive bias so that the true effect size is definitely smaller. The less biased epsilon squared is . According to the epsilon squared, the bias is approximately . Since all these numbers are small fractionals, it makes sense to see the size of the bias through the percentage of the effect size measure. The percentage of the bias in is or 26.8%. So, there is a sizable bias in the estimated eta squared.
The epsilon squared is less biased, but is not unbiased. It is implicitly based on the normality assumption. Without the assumption of data distribution, it is not feasible to find the expectations of those moment estimates (i.e., sums of squares), from which the epsilon squared is derived. A simple analysis of the data in this experiment suggests that the data are not normally distributed (see Figure 1). The histograms of the data in the three conditions basically confirms that. As ANOVA is fairly robust against the violation of the normality assumption, it can be validly applied. Violation of the distribution assumption, however, affects the accuracy of epsilon squared. Given the limited data, it is difficult, if not impossible, to ascertain the true distribution of the data and, in turn, the expectations of the sum of squares in the effect size measures, be it or .
Figure 1
Non-parametric bootstrapping becomes a very practical tool to estimate the bias in this situation. The technique does not depend on distributional assumption. The bootstrap bias estimate converges to .0320 as the number of bootstrap replicates gradually increases (see R code in the Appendix). By the law of large numbers, the mean averaged over a large number of replicates will approach the expectation. The converged bias estimate can be easily identified in Table 2 and Figure 2. Some bias estimates are above the converged value, whereas other bias estimates are below the converged value (e.g., B = 100 and B = 200).
Figure 2 is similar to a typical graph that illustrates the convergence in the law of large numbers. Some numbers are left out on the horizontal axis to show the pace of convergence. The bootstrap bias-corrected eta squared is, therefore, . Compared with epsilon squared 0.0966, the bootstrap bias-corrected eta square is slightly larger, that is, . The epsilon squared implies a bias estimate of .0354, whereas the bootstrap bias estimate is .0320 or 9.6% smaller. It makes sense to obtain a smaller bootstrap bias estimate because is known to over correct the positive bias and underestimate the true effect size.
Table 2
B | 100 | 200 | 500 | 800 | 1000 | 1500 | 2000 | 5000 | 8000 | 10000 | 20000 |
Bias estimate | .0354 | .0236 | .0251 | .0353 | .0327 | .0343 | .0318 | .0318 | .0312 | .0320 | .0320 |
Note. B = Number of bootstrap replicates.
Figure 2
Compared to bootstrapping, bias correction through epsilon squared depends more on the size of eta squared and sample sizes. As a result, epsilon squared can easily return a negative estimate. Simple algebra illustrates this point. Epsilon squared can be expressed in terms of eta squared (Mordkoff, 2019, eq. 5).
11
The bias estimate implied by is . The bias correction is always positive but can exceed . When eta squared is smaller than the ratio of , epsilon squared is negative, that is, means . For instance, in a one-way ANOVA design with three groups of equal sample size 10, we have , , and . If is less than .069, then is negative. It is an overestimate of potential bias in because the population effect size cannot be negative. Since of .069 is more than Cohen’s rule of thumb number .0588 for the medium-sized , will be negative for all the small and medium-sized with the limited sample sizes (i.e., and ). Even if the total sample size N increases to 60, will still be negative for Cohen’s small-sized eta squared (i.e., .0099). In other words, will likely result in a negative estimate in many situations where eta squared or sample size is not large. This may partially explain why is still a popular effect size measure: it is never negative in contrast to its less biased alternatives, and . Bootstrap bias-corrected estimate less likely suffers the same shortcoming because the accuracy of bootstrapping in theory depends less on the magnitude of the effect size and sample sizes but more on the number of bootstrap replicates.
The sampling distribution of eta squared can be approximated by bootstrap replicates of eta squared, which provides clues of the potential bias. The bootstrap procedure shows that the sampling distribution of the eta squared is right skewed in the example: its bias may not solely depend on the sample size. The original eta squared measure of .1321 is marked on the sampling distribution in Figure 3, and it has a quantile score of .37 in the sampling distribution. In other words, its bootstrap replicates, , exceed the original 63% of the time among repeated bootstrap samples – a clear positive bias. The expectation of the bootstrap replicates or can be computed by averaging 20,000 bootstrap replicates, and the average is .16455, marked by the dotted line in Figure 3. There does not appear a ready formula to describe the sampling distribution or behavior of eta squared in repeated samples. The bias in eta squared may be too complex to yield a simple analytical solution. Thus, epsilon squared can lessen the bias of the eta squared by using sample sizes, but cannot eradicate the bias. Bootstrapping appears to be a viable alternative.
Figure 3
Simulations
Simulations can be used to show that the bootstrap bias-corrected eta squared generally performs better than eta squared and epsilon squared under the normality assumption. When data show a skewed distribution or mix normal distribution, the bootstrap bias-corrected eta squared does not show much bias either. Table 3 uses the same parameter setting for effect sizes as in Okada (2013). Three effect sizes are .26471 (large), .12339 (medium), and .02200 (small). The R code for the simulation is adapted from Okada (2013). A one-way ANOVA with four groups is used with varying group size (n). The sample size (n) goes from small (5) to large (30). As bootstrapping is a computation intensive method, the number of sample datasets and the number of bootstrap replicates are limited to thirty and one thousand, respectively. The bias estimate is the average difference between the effect size and its estimate. Table 3 lists the bias estimates for eta squared, epsilon squared, and the bootstrap bias-corrected eta squared. The simulation duplicates the finding about eta squared and epsilon squared in Okada (2013). The eta squared is positively biased, and the epsilon squared is negatively biased. The bias diminishes with increased sample sizes. The bootstrap bias-corrected eta squared generally shows little bias, compared to eta squared and epsilon squared.
Table 3
Effect size | n | |||
---|---|---|---|---|
.26471 | 5 | .07464 | -.04924 | -.02687 |
10 | .05239 | -.00452 | .00214 | |
20 | .02095 | -.00725 | -.00483 | |
30 | .01748 | -.00108 | -.00034 | |
.12329 | 5 | .09979 | -.04589 | -.02015 |
10 | .06305 | -.00475 | .00338 | |
20 | .02903 | -.00443 | -.00203 | |
30 | .02133 | -.00079 | .00004 | |
.02200 | 5 | .13340 | -.02497 | .00229 |
10 | .07037 | -.00527 | .00309 | |
20 | .03628 | -.00089 | .00100 | |
30 | .02435 | -.00031 | .00033 |
The second simulation in Table 4 shows that the bootstrap bias-corrected eta squared does not have much bias when the data follow a skewed or mix normal distribution. The simulation uses a gamma distribution for the skewed distribution. The gamma distribution can show a variety of distribution shapes. The shape parameter is set at 3, and the rate parameter is set to . The gamma distribution is then centered at , and the resulting distribution has a zero mean and unit standard deviation. A similar strategy is used to generate the mix normal distribution. The mix normal distribution has two means (0 and 2) and two standard deviations (1 and 1) with a mix proportion (.5 and .5). It is then standardized to have a zero mean and unit standard deviation. As non-normal distribution poses its own challenge in ANOVA and effect size interpretation, researchers may choose to transform the original data. For instance, researchers often apply log transformation to skewed distributions to make data look normal. It begs the question whether eta squared should be used in this case. So, the simulation is limited to the special case of zero effect. In other words, it is more relevant to see whether the bootstrap bias-corrected eta squared can rightfully identify no effect under the background noise of non-normal distributions. The results in Table 4 confirms that the bootstrap bias-corrected eta squared shows very little bias.
Table 4
Distribution | n | ||
---|---|---|---|
skewed | 5 | .15667 | .02687 |
10 | .06877 | -.00121 | |
20 | .04898 | .01323 | |
30 | .02882 | .00464 | |
mix normal | 5 | .13642 | .00128 |
10 | .09315 | .02316 | |
20 | .04790 | .01182 | |
30 | .02449 | .00003 |
Conclusion
The effect size measure may contain sizable bias in one-way ANOVA. Given the cost of running a scientific study, it is advisable to remove the bias from the and offer a bias-corrected estimate. Bootstrapping is a very economical way to calculate the bias estimate because it only involves a little coding effort and computing time. The bootstrap bias-corrected eta squared can show substantial improvement as demonstrated in the example. Compared to less biased effect size measure , the bootstrap bias estimate appears slightly better. In the example the bootstrap bias estimate is rightly smaller than the bias implied by the epsilon squared, which can overestimate the bias in eta squared.
The advantage of bootstrap bias estimation is no prior knowledge about data distribution. When data are limited as in many situations, they do not usually appear normal. It is often difficult to ascertain the actual distribution, based on the limited data. Nevertheless, analysis of variance is still applicable, due to its robustness, and eta squared is often reported. Non-parametric bootstrapping can be used to estimate the bias in eta squared without prior knowledge of the data distribution. The bias of the eta squared in ANOVA can potentially arise from the estimation method and data distribution. Non-parametric bootstrap offers a nice solution to correct the bias arising from both sources. In bootstrapping the bias is allowed to have a complex relation to the contributing factors (e.g., sample size, data distribution, etc.).
Bootstrap bias estimate is also a good alternative when is negative. When or sample size is not large, can be negative. In practice, a small or medium-sized is more likely than a large . The same thing can be said about sample size. So, negative is likely in some cases. Also, bootstrapping can be used to show the long-run behavior of estimates. Its sampling distribution can be readily obtained from the bootstrap replicates, which can graphically illuminate the size of the potential bias.
Finally, it should be noted that eta squared is a small number. Its bias estimate is often an even smaller number. There are circumstances when eta squared and its bias may not convey a difference of practical importance. Therefore, its use requires nuanced interpretation with reference to the context and the intended audience. Other effect size measures can also be considered: probability of superiority (Ruscio & Gera, 2013), common language effect size (Brooks, Dalal, & Nolan, 2014; McGraw & Wong, 1992; Li, Nesca, Waisman, Cheng, & Tze, 2021; Li & Tze, 2021), and binomial effect size. Although ANOVA is a fixed procedure, a variety of effect size measures can be used and modified to meet different needs. Future studies in this area can focus on effect size measures in non-normal data situations, where a regular effect size such as eta squared needs to be adapted to allow for meaningful interpretation.