Sample size planning is a classic problem in research design when aiming for the greatest statistical power to reject the null hypothesis, while also having the smallest sample size in order to be economical. Moreover, to obtain confidence intervals (CIs) in data analysis is an increasingly important topic because CIs provide substantive advantages to the controversial null hypothesis significance testing by providing informative results and facilitating the accumulation of knowledge from insufficient data (American Psychological Association, 2001; Cumming & Finch, 2001; Meeker et al., 2017; Mendoza & Stafford, 2001; Smithson, 2000; Thompson, 1998; also see the 2019 special issue of the American Statistician). Unfortunately, misunderstandings of CIs are not uncommon, and further guidance regarding their use and interpretation is still needed (Belia et al., 2005; Cumming & Calin-Jageman, 2017; Fidler et al., 2004; Morey et al., 2016). Namely, asking how much and how uncertain are the salient statistical literacy (Calin-Jageman & Cumming, 2019; Trafimow, 2018; Trafimow & Uhalt, 2020). In particular, it should be noted that the sample size calculation for constructing CIs may be different from those for testing hypotheses (Borenstein et al., 2001) because the expected width or precision is the concern, rather than the statistical power (Kelley et al., 2003). More specifically, a naïve researcher usually ignores the fact that CI width is actually a random variable; as such, the sample size might be underestimated, and the resulting CI would likely only have about 50% of probability of containing the true parameter (Grieve, 1989; Liu, 2009). In view of the problem mentioned, an integrative method for sample size calculation which can incorporate statistical power and precision is a continuing concern.
For comparing two-group means, recent developments in sample size planning have already heightened the need to integrate the notion of power and precision altogether by probabilistic thinking. An increasingly recognized solution to the integration is to reject the null hypothesis (event rejection, R), to encompass the true mean difference for a two-sided CI (event validity, V), and to achieve the desired width for this CI (event width, W), (Beal, 1989; Cesana & Antonelli, 2010; Jiroutek et al., 2003; Kelley et al., 2003; Liu, 2009, 2012; Pan & Kupper, 1999). For the event rejection, the probability is defined as the probability of the null hypothesis being rejected. For the event width, the probability is the probability of the CI width being smaller than, or equal to, a desired width value. Note that this unconditional probability does not depend on whether the CI includes the true parameter value. Kelley and Rausch (2006) proposed a method for sample size planning via the accuracy in the parameter estimation (AIPE) approach for the expectation value of the width; thereafter, Lai and Kelley (2012) extended AIPE to mean contrasts. Moreover, Beal (1989) proposed a conditional probability, denoted by , which is the probability of a random CI width achieving a desired width value, conditional on the CI including the true parameter value. Furthermore, Jiroutek et al. (2003) and Cesana and Antonelli (2010) considered several scenarios, such as a probability , the conditional probability that the intersection of events width and rejection occurs given that the event validity has occurred. Liu (2012) proposed a conditional probability , which is the probability of the event width conditional on the null hypothesis being rejected. Collectively, these scenarios and existing methods outline a critical role for helping practitioners and researchers appropriately accomplish their research needs and to avoid false interpretation (Ioannidis, 2005).
In view of methodological justification that has been mentioned so far, the sample size calculation methods have been mostly restricted to the cases of equal variances. In practice, there is evidence that extreme variance ratios do occur (Wilcox, 1987). While constructing CIs for two-group comparisons of means, if the homogeneity of variances cannot be assumed, the sample sizes for each group should be allocated in some way to take into account the proportionality with variances. It has already been noted that the uncertainty about population variances is very important (Grieve, 1991), and so alternative methods, such as Welch’s approximate t test, have been advocated (Lee, 1992). For CI-based inferences with the case of heteroscedasticity, Wang and Kupper (1997) calculated sample sizes that were both unconditional and conditional on the coverage (i.e., inclusion of the true value). However, none of these above-mentioned works dealt with the cost of sampling.
It should be noted that formulas for sample sizes that are needed for interval estimation, especially when taking the cost factor into account, are generally not found in elementary textbooks, and are thus ignored by most researchers (Algina & Olejnik, 2000; Allison et al., 1997; Dumville et al., 2006). Hsu (1994) noted that cost ratios and unbalanced designs are seldom used to determine sample size when the total cost is limited even though the use of unequal randomization ratios offers a number of advantages. In fact, unbalanced designs are the norm rather than the exception in applied settings (Yang et al., 1996). The allocation ratios are then developed for the two-sample trimmed mean case (Guo & Luh, 2009) as well as for heterogeneous-variance group comparisons (Guo & Luh, 2013). For a fixed CI width, Pentico (1981) developed a sample size formula that minimizes the cost related to two-group estimation when variances are known. Jan and Shieh (2011) and Shieh and Jan (2012) developed optimal sample sizes of Welch’s t test for testing hypotheses and for CI width, respectively. Together these studies highlight the need to have high statistical power and good precision of estimation for sample size planning so that the unconditional/conditional probability for the event width, validity, and/or rejection can be maximal if the sampling cost is constrained.
A major problem of the afore-mentioned methods suggests that developing easy-to-use computer applications for practitioners and applied researchers is critical. Thus, in the context of two-group mean comparisons where variances are unequal/unknown, the aim of the present study was to develop several R Shiny apps for practitioners to find the optimal sample size for the probability of an event of interest to fulfill two distinct motives of cost effectiveness. These motives are: (a) for achieving a desired probability level, the total sampling cost can be minimal; and (b) for a given total sampling cost, the probability of event can be maximal. In an attempt to comprehensively satisfy researchers’ needs and synthesize the literature, the probabilities (cases) of event prepared in the present study contain two categories:
Unconditional: 1. , 2. , 3. , 4. , 5. .
Conditional: 6. , 7. , 8. , and 9. .
This thorough discussion can provide an exciting opportunity to advance our knowledge of event probability. Another advantage of our method over Jan and Shieh (2011) and Shieh and Jan (2012)’s method is that we provide simple yet accurate methods in line with those of textbook so that applied researchers can easily comprehend.
The remainder of this article is organized as follows. In the section immediately following, sample size planning while considering corresponding probabilities is proposed. In the first sub-section, the Welch test statistic is introduced and the sample size for event W is discussed. In the next sub-section, sampling cost is added on. The next section, 'Method Comparisons', describes an illustrative example and compares extant methods to assess the proposed R Shiny apps. The succeeding section, 'Sample Size Tables and Simulation', presents sample size tables and a simulation study, with the last section concluding and discussing the implications.
Sample Size Planning While Considering Corresponding Probabilities
Test Statistic and Sample Sizes for Event of Interest
Let the given data be independent and normally distributed with mean (or expectation) , (sample size), . Suppose the two variances are and , and their corresponding sample variances are and , respectively, where and denote the sample mean of group j. Based on random samples from these populations, is normally distributed with mean (or expectation) and variance . To test the null hypothesis , we can use Welch’s (1938) statistic, defined as
with approximate Type I error rate
where is the percentile of the t distribution with degrees of freedom which is calculated from the data as
where follows a non-central t distribution, denotes the cumulative distribution function of a non-central t random variate at x with v degrees of freedom, and the non-centrality parameter . The CI, i.e., , for the mean difference is
The CI width is then
If the CI width is less than or equal to a desired width value w that should be sensibly chosen according to the alternative hypothesis, the required group sizes can be determined by
where is an allocation ratio.
It should be noted that, in Equation (4), is an estimate of and is random, sometimes underestimating and sometimes overestimating , j = 1,2. Hence, the sample sizes in Equation (4) bring about . As such, it is imperative to find adequate sample sizes such that the probability, , achieves the desired probability level . That is, the group sizes and must be chosen such that they satisfy
Based on Casella and Berger (2002),
It is known that is distributed approximately as distribution with degrees of freedom. Assuming , Equation (5) can be re-written as
where and . Because Equation (7) cannot be solved explicitly, App (I) (see Supplementary Materials) was developed to find the required sample sizes to satisfy , where is the percentile of the distribution with degrees of freedom. In addition to , other unconditional/conditional probabilities (cases) of event are also obtained in Appendix.
Adding on Sampling Costs
In the present study, let C represent the total sampling cost, excluding any fixed overhead costs, and suppose the total cost is a linear function of sample sizes, for which the cost function is , where is the cost of obtaining an observation from group j. If sampling cost is the concern, we need to find the optimal sample size allocation ratio, , so that the total cost (or total sample size, ) can be minimal. Pentico (1981) provided an optimal allocation ratio in a normal case when variances are known, and the resulting sample sizes reflect a minimal total cost. When variances are unknown, his method can be adopted by using some planning values of variance, such as the sample variances, to calculate the allocation ratio
For motive (a), achieving a desired probability level for minimal total cost, the provided App (I) can be executed by exhaustion algorithms (Guo & Luh, 2020), the optimal sample size (setting k = 0, see Figure 1) then can be obtained in any of the nine probability cases so that the total sampling cost can be minimal. To achieve optimization, researchers need to form a range of plausible values of by specifying two adjustment values, , to better execute exhaustion algorithms in App (I). It is suggested that and be used to form a narrower search space. If a warning statement “a is too large” or “b is too small” appears, researchers need to adjust the values to enlarge the search space. Several reminders are in order. Researchers can specify one- or two-tailed tests in the apps, and the value of the mean difference can be either positive or negative. If the researcher only considers , the desired width can be plugged in as any positive value. Conversely, if the event of interest is not related to rejection, then the mean difference ( ) can be plugged in as any value. Further, setting the allocation ratio (k = 0) and unit costs (1, 1) can find the minimal total number of sample sizes. Note that when the allocation ratio (k) is pre-specified (e.g., 0.5, 1, 2, 3…) other than 0, App (I) can also be employed to find the needed sample size for any of the nine probabilities. However, in this case, the desired level can still be achieved, but the total cost may not be minimal.
For motive (b), on the other hand, if there is a budget constraint and the total sampling cost is limited, we need to choose an optimal sample size such that a maximal probability of the event can be obtained. Thus, Equation (8) for the cost function can be used to obtain the initial sample size, respectively, as
For this task, App (II) (see Supplementary Materials) is supplied. It should be noted that a floor function is used in computing to ensure that the resulting cost is no more than the given total cost C.
In this section, several comparisons are described in order to see if the proposed approach gives consistent or better results than existing methods for two-tailed test or two-sided CIs. For motive (a), firstly, when variances are equal, in Liu’s study (2012), patients were assigned to receive either psychotherapy to reduce stress in terms of blood pressure or no intervention. The population variances for blood pressure were , the minimal effect size was (mmHg), and the CI width was set to be 7. We set the same condition as Table 1 of Liu (2012) and set the unit cost as . For the case of , we employed App (I) and found the computed sample sizes of (64, 64) with the obtained probability of .80146. For the case of , we found the computed sample sizes of (70, 70) with the obtained probability of .803865. Both cases are consistent with Liu’s result. Secondly, when variances are unequal, we used the condition in Table 1 of Lee (1992) to set , , for two-sided , and . After employing App (I), we obtained sample sizes (21, 10) for power of .8 and (27, 13) for power of .9, compared to his results of (21, 11) and (27, 14), respectively. Taken together, these results indicate consistency in terms of sample sizes. Thirdly, when unequal variances and unit costs are the concern and motive (a) is the focus, we ran all the conditions by employing App (I) as Table 4 of Jan and Shieh (2011) for ; and fourthly, to consider CI width, we ran all the conditions as Table 8 of Shieh and Jan (2012) for . As a result, almost the same values of sample sizes were obtained, and only a few cases of discrepancy were identified, which are summarized in Table 1 for and Table 2 for . It is apparent from these two tables that while the resulting probabilities can be satisfactorily maintained, our total cost is slightly lower than those of their methods.
|with||Proposed App (I)
||Jan & Shieh (2011)
|(1,1) with (1,1)||45||22||23||.906142||45||23||22||.9057|
|(1,1) with (1,3)||83||29||18||.900254||84||30||18||.9032|
|(1/9,1) with (1,1)||21||5||16||.902258||22||6||16||.9144|
|(1/9,1) with (1,2)||36||6||15||.900894||37||7||15||.9086|
Note. Setting ; ; .
|with||Proposed App (I)
||Shieh & Jan (2012)
|(4,1) with (1,3)||253||130||41||.901892||254||128||42||.9060|
|(9,1) with (1,2)||337||225||56||.900217||338||226||56||.9057|
|(9,1) with (1,3)||390||240||50||.900709||391||238||51||.9042|
Note. Setting ; ; .
For motive (b) in a fixed total cost, we employed App (II) to run all the conditions as Table 3 of Jan and Shieh (2011) for and as Table 7 of Shieh and Jan (2012) for , respectively. Subsequently, for , we obtained all the same group sizes as Jan and Shieh's (2011) results with comparable power. For , we got three cases of different group sizes compared with Shieh and Jan (2012), which are presented in Table 3.
|with||Fixed total cost||Proposed App (II)
||Shieh & Jan (2012)
|(1/9,1) with (1,3)||50||8||14||.152415||11||13||.1546|
|(1,1) with (1,3)||80||35||15||.066037||38||14||.0679|
|(4,1) with (1,2)||180||104||38||.470735||106||37||.4723|
Note. Setting ; .
Sample Size Tables and Simulation
It should be noted that except for the true mean difference and the desired width, the values of variances and unit costs are also the key elements for allocating group sizes. Thus, these four parameters were varied to present the features of sample size planning in the following two sample size tables for the nine probability cases (all sample sizes are rounded up to the nearest integer) for motive (a). In Table 4 for equal variances/unit costs, there are six configurations of , arranged with a small (or large) effect together with a narrow (or wide) desired width, while in Table 5 there are four configurations for unequal variances/unit costs. These tables are quite revealing in several ways. First, the impacts of the true mean difference and the desired width can be unfolded in Table 4 by taking Case 1 ( ) and Case 2 ( ) as cross-references. In Case 1, as the true value increases from 2 to 8 (see Columns 1-3), the sample sizes decrease from (393, 394) to (26, 26), showing that it is not influenced by the value of at all.
|(2, 3)||(4, 3)||(8, 3)|
|1.||393, 394||99, 100||26, 26|
|2.||358, 358||358, 358||358, 358|
|3.||395, 395||358, 358||358, 358|
|4.||361, 361||361, 361||361, 361|
|5.||420, 420||361, 361||361, 361|
|6.||358, 358||358, 358||358, 358|
|7.||385, 386||358, 358||358, 358|
|8.||357, 358||358, 358||358, 358|
|9.||359, 360||361, 361||361, 361|
|Case||(2, 4)||(4, 4)||(8, 4)|
|1.||393, 394||99, 100||26, 26|
|2.||205, 205||205, 205||205, 205|
|3.||393, 394||205, 206||205, 205|
|4.||207, 207||207, 207||207, 207|
|5.||420, 420||207, 207||207, 207|
|6.||205, 205||205, 205||205, 205|
|7.||379, 379||205, 205||205, 205|
|8.||204, 204||204, 205||205, 205|
|9.||206, 206||206, 206||207, 207|
Note. Setting ; ; ; unit costs (1,1).
These sample sizes are consistent with the results in Table 2 of Liu’s study (2012). In Case 2, on the contrary, as the desired value increased from 3 to 4 (see the upper and lower panels), the needed sample sizes dramatically reduced, regardless of the value of . The fact of the independence of can also be observed in Cases 4 and 6. Further note that observing the sample sizes of Cases 1 and 2 shows that many discrepancies were clearly demonstrated; that is, the sample sizes for hypothesis testing and constructing CIs are different, which was mentioned in Borenstein et al. (2001). Second, the more events that are considered, the greater the required sample sizes. It can be seen that among Cases 1-5, the relatively greater sample size required appears in Case 5 which is the intersection of events W, R, and V. Third, if the probability is conditional on the event validity, the required sample sizes are relatively fewer than those of intersection events (see Case 6 vs. Case 4 and Case 7 vs. Case 5). Note that and thus . Similarly, if the probability is conditional on the event rejection, the required sample sizes are also fewer than those of intersection events (see Case 8 vs. Case 3 and Case 9 vs. Case 5). Finally, if , the resulting sample sizes of Cases 2-9 show greater discrepancies than those cases of . The smaller the value of , the greater the discrepancy. It can been seen that the discrepancy of sample sizes for = (2, 4) are greater than those for = (2, 3). To sum up, the various sample sizes among the nine cases significantly emerge under different parameter configurations and events of interest.
|1.||64, 64||49, 24||63, 17||41, 39|
|2.||36, 37||29, 15||36, 11||25, 23|
|3.||64, 64||49, 24||63, 17||41, 39|
|4.||37, 38||30, 15||39, 11||26, 23|
|5.||68, 69||52, 26||68, 18||44, 41|
|6.||36, 37||29, 15||37, 11||26, 20|
|7.||61, 62||47, 24||62, 16||40, 36|
|8.||36, 36||28, 14||37, 10||24, 23|
|9.||36, 37||29, 15||36, 11||25, 23|
Note. Setting ; ; ; ; .
In Table 5, we considered unequal variances/unit costs and demonstrate four configurations. Take Column 1 as a reference, where variances and unit costs are both equal. It can be seen from Column 2 that approximately half the sample size is needed for the second group because the variance in that group is reduced. In other words, the allocation k is 0.5. Then, take Column 2 as a reference where unit costs are equal. As can be seen from Column 3, a much smaller sample size is required for the second group due to the higher unit cost in that group. In this situation, the allocation k is 0.25. Finally, Column 4 shows approximately equal group sizes because larger variance and higher cost are both taken into account. Note that although the allocation k is 1 in this situation, the resulting group sizes may not be equal because the algorithms in the apps seek optimization.
To carry out a computer simulation, we chose the optimal sample sizes (26, 20) in Case 6 from Table 5, given = 5, w = 10 and ran 10,000-replicated simulations. The results show that the average mean difference is 5.0049 (SD = 2.2633) and the coverage rate is 0.9503, indicating good coverage. The average width for which CIs cover the true mean difference is 9.1116, while the average width for which CIs do not cover the true mean difference is 8.5862. Accordingly, the empirical probabilities of the nine cases are 1. = .5787, 2. = .8100, 3. = .4951, 4. = .7637, 5. = .4709, 6. = .8036, 7. = .4955, 8. = .8555, and 9. = .8137. As expected, the resulting probability of Case 6 is very close to the nominal level of .8, and Cases 1, 3, 4, 5, and 7 cannot achieve the nominal level due to the given sample size (26, 20) being insufficient (see Column 4 of Table 5).
Conclusions and Discussion
The current discussions in sample size planning is to fulfill one or more goals, such as power-based statistical tests, precision of estimations, minimal costs, or some other criteria. To avoid the pitfall of treating inferential statistics as descriptive statistics, the probabilistic thinking is essential for the scheme of sample size determination. Along with rapid advances of ready-to-used computer applications, the present study aimed to contribute to this growing area and developed R Shiny App (I) to find sample sizes to minimize a sampling total cost for a desired probability level and App (II) to maximize the probability of an event of interest for a fixed total cost. The choice of nine probability cases of events can be selected in the apps, depending on the purpose of the research. The present study also showed that the sample size allocation ratio is affected by the proportion of standard deviations and the disproportion of the square roots of the unit costs. The idea is to obtain more sample sizes for the group that has larger variance or for the group with the lower unit cost. When the unit costs or the standard deviations are dissimilar across groups, the use of optimal sample sizes can result in substantial savings. The interrelation among the values of true difference, the desired width, the variances, and the unit costs as well as the events of interest all affect the seeking of optimal sample sizes. Their roles are just like a five-way tug of war. This study provides an important opportunity to advance the understanding of sample size planning.
As Lenth (2001) noted, not all sample size problems are the same. Cohen (1990) stated that “I have learned that there is no royal road to statistical induction, that the informed judgment of the investigator is the crucial element in the interpretation of data, and that things take time.” Among other things, a desired width can be sensibly chosen according to the alternative hypothesis (Cesana & Antonelli, 2010) and viable benchmarks for the width among disciplines may be established for improving meta-analysis (Smithson, 2003). Since a larger sample size is needed for a smaller and/or w, one of the issues that emerges from the present findings is to sort out the relation between the mean difference and a desired width. An implication of this is the possibility to consider a suitable range of alternative true mean difference as , setting (geometric mean), i.e., from a small to large effect based on Cohen’s effect size paradigm. Moreover, a suitable range of desired width is suggested as , which will result in a reasonable sample size value from about 20 to 400.
Since the advocacy of CIs is a central theme in statistical reform, additional methods and formulas for sample size determination in regard to various unconditional/conditional probabilities are required to address this need in the future to simultaneously obtain power and precision. The current findings add to a growing body of literature on sample size planning, and there is abundant room for further progress in multiple comparisons (simultaneous confidence intervals) (Liu, 2009; Pan & Kupper, 1999), linear contrasts (Bonett, 2009; Jan & Shieh, 2019; Luh & Guo, 2016), stratified sampling (Snedecor & Cochran, 1989), and multilevel analyses (Liu, 2003). Under certain circumstances, one group size may be pre-fixed to a certain number in a study; thus, the present study also supplied App (III) (see Supplementary Materials) to facilitate the recommended procedures in planning the required size of the other group so that the desired probability level can still be achieved.