Although the literature on the a priori procedure, designed to help researchers determine the sample sizes they should use in their substantive research, is expanding rapidly, there are two important limitations. First, there is a need to expand to new popular distributions, log-normal and gamma distributions, and the present work provides those expansions. Second, there is a need to test the consequences of wrong distributional assumptions; for example, assuming a log-normal distribution when the population follows a gamma distribution, or the reverse. The present work addresses the limitations with respect to estimating population means, and it includes computer simulations, links to free and user-friendly programs that researchers can utilize for their own research, and two examples involving real data sets for illustrations of our main results.

Although the a priori procedure (APP) can be used post-data, it was designed to be used pre-data to determine the sample sizes researchers should collect to simultaneously consider two issues: precision and confidence. The precision issue concerns how close sample statistics are to their corresponding population parameters and the confidence issue concerns the probability of meeting the precision criterion. For example, a researcher might be interested in having 95% confidence of obtaining a sample mean difference that is within one-tenth of a standard deviation of the population mean difference, and an APP equation could provide that answer. APP equations have been devised for a number of purposes such as estimating a single population mean under normality or under skew normality (

But not surprisingly, for a new literature, there are limitations and one of them is the need for expansion to additional distribution families. The present work addresses that limitation with respect to estimating population means under log-normal and gamma distributions. Both distributions are continuous probability distributions that are widely used in different fields of science to model continuous variables that are always positive and have skewed distributions. They can be used to fit the data collected in (i) human behaviors such as the length of comments posted in Internet discussion forums; (ii) biology and medicine such as measures of size of living tissue and blood pressure; (iii) social sciences and demographics such as the household income; (iv) reliability analysis, wireless communications, and computer networks and internet traffic analysis, etc.

An additional limitation is that APP calculations require the researcher to make a distributional assumption, but there has been no APP work exploring the consequences when the distributional assumption is wrong. The present work will be the first of such explorations. In brief, suppose that a population follows a log-normal distribution, and the researcher assumes a gamma distribution; or suppose the population follows a gamma distribution and the researcher assumes a log-normal distribution; either way, what are the consequences for being wrong? One possibility is that the sample size determinations are similar for both log-normal and gamma distributions. In that case, the consequence of choosing the wrong distribution could be considered minor because there is little loss in choosing the wrong distribution. In contrast, if sample size determinations are dissimilar for the two distributions, then using the wrong distribution could entail major consequences such as having much less precision than the calculation implies. A caveat is that because of the newness of the APP, procedures have not yet been developed for distinguishing when a discrepancy is minor or major in a APP context.

In summary, the present work was designed to make three main contributions. First, our goal was to expand the APP to two new distributions: log-normal and gamma. Second, we desired to provide the first exploration of the consequences of wrong distributional assumptions, in an APP context, based on the APP expansion to log-normal and gamma distributions. Third, the simulation results show that the coverage rates matched our specified precision and confidence level very well in both log-normal and gamma distributions.

For deriving the APP for estimating the population mean, we need the following results about the log-normal and gamma distribution.

^{2}^{2})^{2}^{2})

Note that it is easy to show that the

Numerical convolution of log-normal distributions has shown that the sum of such
distributions is a distribution which follows the log-normal law with a fair
approximation (but not exactly). Therefore, it can be assumed that the sum of two
(or more) log-normal distributions is, in a first approximation, another log-normal
distribution. The problem is to find this distribution without the tedious work of
numerical convolution. The basic idea is to find a log-normal distribution which has
the same moments as the exact sum-distribution.

_{1}, …, _{n},
where each X_{i}^{2}). _{n} and

_{1}, …,
_{n} be a random sample from the log-normal
distribution LN^{2}).

^{2}).

_{n} − log

Density curves of

In order to compare the log-normal distribution and the gamma distribution, we need the following definition and properties.

_{1}, …, _{n} be
a random sample from the gamma distribution Gamma

^{2}/

In this section, we will set up the APP procedures for (i) estimating the population mean by a random sample from the log-normal distribution with parameters ^{2}, and (ii) estimating population mean by a random sample from the gamma distribution with the shape parameter

In order to determine the necessary sample size _{x} given in Definition 1 for known standard deviation

^{2}). _{x}. _{1}
_{2}
_{x}
_{x}_{x} is minimized,
where

The required sample size _{1} and _{2} such that
_{x} based on Theorem 1 is

Similarly, the necessary sample size _{y} given in Definition 2 for known shape parameter

_{1}, …, _{n} be independent and
identically distributed random variables from the gamma distribution
Gamma_{y}. _{1}
_{2}
_{y}
_{y}_{y} is minimized,
where

The required sample size _{1}, and _{2} can be
obtained simultaneously, given the precision _{y} based on Theorem 2
is

Like any inferential statistics, the APP assumes a statistical model. For example, we assume log-normal distributions and gamma distributions in this paper. What if particular assumptions are wrong? For example, what if we get the distribution wrong? An important question is: How robust is the APP to various sorts of assumption violations? If one could show robustness to at least some sorts of assumption violations, that would be very helpful. Even if one finds some assumption violations to which the APP is not robust, that would still be useful because we would know where it is important to be careful. Based on this idea, we first consider sample size. As a trivial example, suppose we assume a gamma distribution and the truth is that there is a log-normal population. We want to see what difference it makes in the estimated necessary sample size.

When we determine the necessary sample sizes for estimating the population mean, we assume ^{2} is known in log-normal setting and assume ^{2} and

Precision ( |
Confidence level ( |
Sample size^{a} ( |
Left precision (_{1}) |
Right precision (_{2}) |
---|---|---|---|---|

0.95 | 391 | −0.0978204 | 0.0996861 | |

0.9 | 275 | −0.0998779 | 0.0989713 | |

0.95 | 177 | −0.1453921 | 0.1496078 | |

0.9 | 124 | −0.1493012 | 0.1473061 | |

0.95 | 105 | −0.1921884 | 0.1998908 | |

0.9 | 69 | −0.1967918 | 0.1934145 | |

0.95 | 67 | −0.2378956 | 0.2497299 | |

0.9 | 48 | −0.2474282 | 0.2422409 |

^{a}To find the sample size needed to have a particular probability that the sample mean will be within a desired distance of the population mean, assuming the population is lognormally distributed ^{2}), the lognormal program can be used at

Precision ( |
Confidence level ( |
Sample size^{a} ( |
Left precision (_{1}) |
Right precision (_{2}) |
---|---|---|---|---|

0.95 | 392 | −0.0986157 | 0.0998879 | |

0.9 | 271 | −0.0998625 | 0.0992504 | |

0.95 | 174 | −0.1470710 | 0.1499018 | |

0.9 | 124 | −0.1500000 | 0.1486157 | |

0.95 | 98 | −0.1940000 | 0.1991398 | |

0.9 | 71 | −0.1986367 | 0.1962319 | |

0.95 | 66 | −0.2408662 | 0.2487501 | |

0.9 | 45 | −0.2487469 | 0.2450630 |

^{a}To find the sample size needed to have a particular probability that the sample mean will be within a desired distance of the population mean, assuming the population is gamma distributed, the gamma program can be used at

From the results given in

In this section, we conduct two simulations. First we process a simulation to see how big a difference we have when we use the same sample size for the estimation of parameters in both models. For the comparison of two models, we use measures of the model, such as log-likelihood, AIC, and BIC values, respectively.

The Akaike information criterion (AIC) is an estimator of prediction error and relative quality of statistical models for a given set of data (see

The simulation results are listed in

Log-normal (True model) |
Gamma |
|||
---|---|---|---|---|

True value | 1.0000 | 0.7500 | 1.3244 | 2.7190 |

Mean | 0.9993 | 0.7450 | 1.9914 | 1.8563 |

|Bias| | 0.0007 | 0.0050 | 0.6666 | 0.8627 |

Std. Dev. | 0.0727 | 0.0514 | 0.2563 | 0.0820 |

Log-L | −227.1051 | |||

AIC | 458.2102 | |||

BIC | 463.5181 | |||

_{AIC} |
0.0832 |

Log-normal |
Gamma (True model) |
|||
---|---|---|---|---|

True value | 1.5183 | 0.4270 | 5.0000 | 1.0000 |

Mean | 1.5057 | 0.4659 | 5.1665 | 0.9813 |

|Bias| | 0.0135 | 0.0372 | 0.1665 | 0.0187 |

Std. Dev. | 0.0470 | 0.0332 | 0.7155 | 0.1507 |

Log-L | −211.4705 | |||

AIC | 426.9410 | |||

BIC | 432.1109 | |||

_{AIC} |
0.1887 |

In ^{2}) with

Similarly in

For the effectiveness of comparison between two models we use _{AIC}, the proportion of the true model selected by using AIC among _{AIC} = 0.9168 in _{AIC} = 0.8113 indicates the gamma (true model) is more frequently selected model by AIC than the log-normal model.

For investigating the changes of AIC and BIC values to parameter

In order to compare the coverage rates of confidence intervals with specified precision and confidence level, we process the second simulation for the performance of the confidence intervals obtained by using the derived sample sizes obtained. The coverage rate (_{x}) from Theorem 1 with parameters _{y}) from Theorem 2 with parameters

Precision ( |
Confidence level ( |
||||
---|---|---|---|---|---|

0.95 | 388 | 0.9514 | 392 | 0.9524 | |

0.9 | 273 | 0.9012 | 271 | 0.9003 | |

0.95 | 173 | 0.9519 | 178 | 0.9548 | |

0.9 | 121 | 0.9011 | 124 | 0.9063 | |

0.95 | 121 | 0.9011 | 102 | 0.9565 | |

0.9 | 68 | 0.9038 | 70 | 0.9072 | |

0.95 | 63 | 0.9532 | 65 | 0.9564 | |

0.9 | 44 | 0.9036 | 45 | 0.9088 |

Precision ( |
Confidence level ( |
||||
---|---|---|---|---|---|

0.95 | 388 | 0.9518 | 388 | 0.9513 | |

0.9 | 273 | 0.9018 | 273 | 0.9016 | |

0.95 | 173 | 0.9512 | 173 | 0.9519 | |

0.9 | 121 | 0.9003 | 121 | 0.9012 | |

0.95 | 98 | 0.9518 | 98 | 0.9526 | |

0.9 | 69 | 0.9045 | 69 | 0.9029 | |

0.95 | 63 | 0.9531 | 63 | 0.9530 | |

0.9 | 44 | 0.9036 | 44 | 0.9029 |

In this section we will analyze two real data sets for investigating the performance of our APP methods under log-normal and gamma assumptions.

The first data set is on the survival time (in months) of 184 patients who had limited stage small-cell lung cancer from

Distribution | Estimator | Log-L | AIC | BIC |
---|---|---|---|---|

First data set: Lung cancer | ||||

log-normal | ||||

gamma | −455.3863 | 914.7727 | 920.4133 | |

Second data set: Faculty salary | ||||

log-normal | −261.4429 | 526.8858 | 532.5102 | |

gamma |

The second data set is the salary data of faculties in the College of Arts and Sciences,

The values of the log-likelihood, AIC and BIC criteria resulted from fitting log-normal distribution and gamma distribution to the two data sets are presented in

An important contribution is that the present work, which includes links to free and user-friendly programs, expands the APP so that it can be used under log-normal and gamma distributions. Thus, researchers who wish to know sample sizes needed to have sample statistics that are good estimators of corresponding population parameters, but who worry about not having normally distributed data, need worry no longer. One reason this is important is that most distributions are skewed (

In addition, however, the present work is, to our knowledge, the first APP work that directly addresses the issue of mistakes in identifying the relevant distribution family. To that end, we have explored the consequences of assuming a log-normal distribution in the presence of a gamma distribution, or assuming a gamma distribution in the presence of a log-normal distribution. The results are nuanced. Although the consequences of being wrong are minimal with respect to sample size computations,

In conclusion, the present equations and links to programs successfully expand the APP to log-normal and gamma distributions. And we have seen that the consequences of making a wrong assumption with respect to which family of distributions to use are often, but not always, minimal. We hope and expect that future research will include more APP expansions to distribution families not addressed here.

(i) If ^{2}), log(^{2}). Given any positive constant ^{2}), thus ^{2}). (ii) The result follows directly from Lemma 1 and (i) with

Consider the standardized random variable

Note that by (i) and (ii) in Proposition 1, we obtain

Consider the standardized random variable

Supplementary materials include three parts. The first one is the R-code for the link of
required sample size for Gamma distribution. The second one is the R-code for the
link of required sample size for log-normal distribution. The third one is the
R-code for simulations and real data analysis. (for access see

This research was partially supported by Shaanxi Province Plan to Improve Public Scientific Literacy of China (NO. 2021PSL122).

The authors have declared that no competing interests exist.

The authors would like to thank the editor and referees for their useful comments and suggestions, which significantly improved the quality of the present paper.