Although the a priori procedure (APP) can be used post-data, it was designed to be used pre-data to determine the sample sizes researchers should collect to simultaneously consider two issues: precision and confidence. The precision issue concerns how close sample statistics are to their corresponding population parameters and the confidence issue concerns the probability of meeting the precision criterion. For example, a researcher might be interested in having 95% confidence of obtaining a sample mean difference that is within one-tenth of a standard deviation of the population mean difference, and an APP equation could provide that answer. APP equations have been devised for a number of purposes such as estimating a single population mean under normality or under skew normality (Trafimow, Wang, & Wang, 2019; Trafimow, Wang, & Wang, 2020), estimating population mean differences for matched or independent samples under normality or under skew normality (Trafimow et al., 2020; Wang, Wang, Trafimow, & Myüz, 2019b; Wang, Wang, Trafimow, & Chen, 2019), estimating population scale values (Wang, Wang, Trafimow, & Myüz, 2019b), estimating population shape values (Wang, Wang, Trafimow, & Myüz, 2019a), and correlation coefficients (Wang et al., 2021).
But not surprisingly, for a new literature, there are limitations and one of them is the need for expansion to additional distribution families. The present work addresses that limitation with respect to estimating population means under log-normal and gamma distributions. Both distributions are continuous probability distributions that are widely used in different fields of science to model continuous variables that are always positive and have skewed distributions. They can be used to fit the data collected in (i) human behaviors such as the length of comments posted in Internet discussion forums; (ii) biology and medicine such as measures of size of living tissue and blood pressure; (iii) social sciences and demographics such as the household income; (iv) reliability analysis, wireless communications, and computer networks and internet traffic analysis, etc.
An additional limitation is that APP calculations require the researcher to make a distributional assumption, but there has been no APP work exploring the consequences when the distributional assumption is wrong. The present work will be the first of such explorations. In brief, suppose that a population follows a log-normal distribution, and the researcher assumes a gamma distribution; or suppose the population follows a gamma distribution and the researcher assumes a log-normal distribution; either way, what are the consequences for being wrong? One possibility is that the sample size determinations are similar for both log-normal and gamma distributions. In that case, the consequence of choosing the wrong distribution could be considered minor because there is little loss in choosing the wrong distribution. In contrast, if sample size determinations are dissimilar for the two distributions, then using the wrong distribution could entail major consequences such as having much less precision than the calculation implies. A caveat is that because of the newness of the APP, procedures have not yet been developed for distinguishing when a discrepancy is minor or major in a APP context.
In summary, the present work was designed to make three main contributions. First, our goal was to expand the APP to two new distributions: log-normal and gamma. Second, we desired to provide the first exploration of the consequences of wrong distributional assumptions, in an APP context, based on the APP expansion to log-normal and gamma distributions. Third, the simulation results show that the coverage rates matched our specified precision and confidence level very well in both log-normal and gamma distributions.
Properties of Log-Normal and Gamma Distributions
For deriving the APP for estimating the population mean, we need the following results about the log-normal and gamma distribution.
Definition 1
A positive random variable X is said to be log-normally distributed with parameters μ and σ2, denoted by X ∼ LN(μ, σ2), if the logarithm of X is normally distributed with mean μ and variance σ2, log X ∼ N(μ, σ2). The probability density function (pdf) of X is given by
1
Note that it is easy to show that the kth moment of X exists and is given by
Numerical convolution of log-normal distributions has shown that the sum of such distributions is a distribution which follows the log-normal law with a fair approximation (but not exactly). Therefore, it can be assumed that the sum of two (or more) log-normal distributions is, in a first approximation, another log-normal distribution. The problem is to find this distribution without the tedious work of numerical convolution. The basic idea is to find a log-normal distribution which has the same moments as the exact sum-distribution. Fenton (1960) and Schwartz and Yeh (1982) estimate the pdf for a sum of log-normal random variables using another log-normal pdf with the same mean and variance. The Fenton approximation, referred to as the Fenton-Wilkinson (FW) method, is simple to apply, and for a wide range of log-normal parameters has been shown to be reasonably accurate in comparison to the Schwartz-Yeh (SY) method. We introduce the FW method in the following Lemma, which will be used to derive our main results in next section.
Lemma 1
Consider the sum of n independent and identically distributed (i.i.d.) log-normal random variables X1, …, Xn, where each Xi ∼ LN(μ, σ2). The Fenton-Wilkinson (FW) approximation of the sum is a log-normal distribution with parameters μn and , where
2
Proposition 1
Let X, X1, …, Xn be a random sample from the log-normal distribution LN(μ, σ2). Then
-
for a positive constant a, cX ∼ LN(μ + log a, σ2).
-
The sample mean is approximately log-normal distributed with parameters μn − log n and given in (2).
Density curves of for μ = 0, σ = 0.7, with different sample sizes n = 30, 50, and 100 are given in Figure 1, while the density curves of for μ = 0.5, n = 50, with different σ = 0.5, 0.75, and 1 are given in Figure 2. From Figure 1, we can see that the density shapes are toward symmetric as the sample size n increases, but they change a lot as the parameter σ increases.
Figure 1
Figure 2
In order to compare the log-normal distribution and the gamma distribution, we need the following definition and properties.
Definition 2
A random variable Y is said to have a gamma distributed with the shape parameter k and the scale parameter θ, denoted by Y ∼ Gamma(k, θ) if its pdf is given by log-normal pdf
3
Lemma 2
Let Y, Y1, …, Yn be a random sample from the gamma distribution Gamma(k, θ). Then
-
the mean and the variance of Y are
-
the sampling distribution of the sample mean Ȳ is also gamma distributed with the shape parameter nk and the scale parameter θ/n: Ȳ ∼ Gamma(nk, θ/n), and
-
the mean and the variance of Ȳ are given by E(Ȳ) = kθ, Var(Ȳ) = kθ2/n.
The APP for Estimating the Population Mean
In this section, we will set up the APP procedures for (i) estimating the population mean by a random sample from the log-normal distribution with parameters μ and σ2, and (ii) estimating population mean by a random sample from the gamma distribution with the shape parameter k and the scale θ.
The Necessary Sample Size for Estimating νx With Known σ2 Under the Log-Normal Setting
In order to determine the necessary sample size n to be c × 100% confident for the given precision, we consider the distribution of the unbiased estimator for νx given in Definition 1 for known standard deviation σ.
Theorem 1
Suppose that forms a random sample from the log-normal distribution LN(μ, σ2). Let c be the confidence level and f be the precision which are specified such that the error associated with estimator is fσx. More specifically, if
4
5
Remark 1
The required sample size n, f1 and f2 such that are obtained simultaneously, given specified precision f and the confidence level c × 100%. The corresponding confidence interval for νx based on Theorem 1 is
The Necessary Sample Size for Estimating νy With Known k Under the Gamma Setting
Similarly, the necessary sample size n can be obtained for a given confidence level c × 100% and precision f. Consider the distribution of the unbiased estimator for νy given in Definition 2 for known shape parameter k. The second main result is given below.
Theorem 2
Let Y1, …, Yn be independent and identically distributed random variables from the gamma distribution Gamma(k, θ). Let c be the confidence level and f be the precision which are specified such that the error associated with estimator Ȳ is fσy. More specifically, if
6
7
Remark 2
The required sample size n, f1, and f2 can be obtained simultaneously, given the precision f and the confidence level c × 100%. The corresponding confidence interval for νy based on Theorem 2 is
The Robustness of APP to Some Sorts of Assumption Violations
Like any inferential statistics, the APP assumes a statistical model. For example, we assume log-normal distributions and gamma distributions in this paper. What if particular assumptions are wrong? For example, what if we get the distribution wrong? An important question is: How robust is the APP to various sorts of assumption violations? If one could show robustness to at least some sorts of assumption violations, that would be very helpful. Even if one finds some assumption violations to which the APP is not robust, that would still be useful because we would know where it is important to be careful. Based on this idea, we first consider sample size. As a trivial example, suppose we assume a gamma distribution and the truth is that there is a log-normal population. We want to see what difference it makes in the estimated necessary sample size.
When we determine the necessary sample sizes for estimating the population mean, we assume σ2 is known in log-normal setting and assume k is known in gamma setting. To see the difference in the necessary sample sizes, we would control the relationship between σ2 and k. A natural idea is to render the first two moments of the sample mean, which is an estimator of the population mean in both models. Thus we get the following equation:
8
Table 1
Precision (f) | Confidence level (c) | Sample sizea (n) | Left precision (f1) | Right precision (f2) |
---|---|---|---|---|
f = 0.1 | 0.95 | 391 | −0.0978204 | 0.0996861 |
0.9 | 275 | −0.0998779 | 0.0989713 | |
f = 0.15 | 0.95 | 177 | −0.1453921 | 0.1496078 |
0.9 | 124 | −0.1493012 | 0.1473061 | |
f = 0.2 | 0.95 | 105 | −0.1921884 | 0.1998908 |
0.9 | 69 | −0.1967918 | 0.1934145 | |
f = 0.25 | 0.95 | 67 | −0.2378956 | 0.2497299 |
0.9 | 48 | −0.2474282 | 0.2422409 |
aTo find the sample size needed to have a particular probability that the sample mean will be within a desired distance of the population mean, assuming the population is lognormally distributed LN(μ, σ2), the lognormal program can be used at https://probdiffgamma.shinyapps.io/lognormal/. To use the lognormal program it is necessary to make three entries. In the first box, type in the value of (σ). In the second box, type in the desired degree of precision (f). In the third box, type in the desired confidence level (c). Then click ‘update’ to obtain the sample size needed to meet your specifications for precision and confidence.
Table 2
Precision (f) | Confidence level (c) | Sample sizea (n) | Left precision (f1) | Right precision (f2) |
---|---|---|---|---|
f = 0.1 | 0.95 | 392 | −0.0986157 | 0.0998879 |
0.9 | 271 | −0.0998625 | 0.0992504 | |
f = 0.15 | 0.95 | 174 | −0.1470710 | 0.1499018 |
0.9 | 124 | −0.1500000 | 0.1486157 | |
f = 0.2 | 0.95 | 98 | −0.1940000 | 0.1991398 |
0.9 | 71 | −0.1986367 | 0.1962319 | |
f = 0.25 | 0.95 | 66 | −0.2408662 | 0.2487501 |
0.9 | 45 | −0.2487469 | 0.2450630 |
aTo find the sample size needed to have a particular probability that the sample mean will be within a desired distance of the population mean, assuming the population is gamma distributed, the gamma program can be used at https://probdiffgamma.shinyapps.io/app-gamma/. To use the gamma program it is necessary to make three entries. In the first box, type in the shape parameter of the population distribution (k). In the second box, type in the desired degree of precision (f). In the third box, type in the desired confidence level (c). Then click ‘update’ to obtain the sample size needed to meet your specifications for precision and confidence, assuming the shape parameter of the log-arithmetically transformed population that you entered in the first box.
From the results given in Tables 1 and 2, we can see that the sample sizes derived under two different populations with same confidence, precision and paired values of parameter are similar. For example, when f = 0.1, c = 0.95 we get n = 391 in Table 1 and n = 392 in Table 2. That is to say the APP is robust to the population assumption violations because the required sample sizes are very close in both tables.
Simulation Results
In this section, we conduct two simulations. First we process a simulation to see how big a difference we have when we use the same sample size for the estimation of parameters in both models. For the comparison of two models, we use measures of the model, such as log-likelihood, AIC, and BIC values, respectively.
The Akaike information criterion (AIC) is an estimator of prediction error and relative quality of statistical models for a given set of data (see Akaike, 1974 and Aho, Derryberry, & Peterson, 2014). Let m be the number of estimated parameters in the model. Let be the maximum value of the likelihood function for the model. Then the AIC value of the model is given by
The simulation results are listed in Tables 3 and 4, respectively, using the sample size n required for precision f = 0.2 and confidence level c = 0.95.
Table 3
Log-normal (True model)
|
Gamma
|
|||
---|---|---|---|---|
n = 105 | ||||
True value | 1.0000 | 0.7500 | 1.3244 | 2.7190 |
Mean | 0.9993 | 0.7450 | 1.9914 | 1.8563 |
|Bias| | 0.0007 | 0.0050 | 0.6666 | 0.8627 |
Std. Dev. | 0.0727 | 0.0514 | 0.2563 | 0.0820 |
Log-L | −222.7483 | −227.1051 | ||
AIC | 449.4967 | 458.2102 | ||
BIC | 454.8046 | 463.5181 | ||
pAIC | 0.9168 | 0.0832 |
Table 4
Log-normal
|
Gamma (True model)
|
|||
---|---|---|---|---|
n = 98 | ||||
True value | 1.5183 | 0.4270 | 5.0000 | 1.0000 |
Mean | 1.5057 | 0.4659 | 5.1665 | 0.9813 |
|Bias| | 0.0135 | 0.0372 | 0.1665 | 0.0187 |
Std. Dev. | 0.0470 | 0.0332 | 0.7155 | 0.1507 |
Log-L | −211.4705 | −209.8472 | ||
AIC | 426.9410 | 423.6944 | ||
BIC | 432.1109 | 428.8643 | ||
pAIC | 0.1887 | 0.8113 |
In Table 3, we first generate the required n = 105 sample data points from a log-normal distribution LN(μ, σ2) with μ = 1 and σ = 0.75, and use this data sample to fit both the log-normal and gamma models. Then we performed the M = 10000 simulated data sets and calculated means, absolute bias, and standard deviations of estimators, together with log-likelihood, AIC, and BIC values. From Table 3, we can see that if we use the gamma model to fit the generated data points, both bias and standard errors of estimates are larger than those in fitted log-normal model. Also the Log-likelihood, AIC and BIC values indicate the support of the log-normal models.
Similarly in Table 4, we generate the required n = 98 random data points from the gamma model, Gamma(k, θ), with shape k = 5 and scale θ = 1, Then we calculate the Maximum likelihood estimates of the parameters in both models, together with log-likelihood, AIC and BIC values. From Table 4, we can see that the fitted gamma model is better than fitted log-normal model.
For the effectiveness of comparison between two models we use pAIC, the proportion of the true model selected by using AIC among M = 10000 runs of the simulated data. We can see that pAIC = 0.9168 in Table 3 indicates the log-normal (true model) is more frequently selected model by AIC than the gamma model. Similarly, in Table 4 pAIC = 0.8113 indicates the gamma (true model) is more frequently selected model by AIC than the log-normal model.
For investigating the changes of AIC and BIC values to parameter σ in log-normal model, and to the scale θ in gamma model, the results are listed in Figures 3 and 4, and Figures 5 and 6. Figures 3 and 4 show that the values of AIC and BIC are changing as the value of (σ) of log-normal distribution is being changed with μ = 1. Similarly, Figure 5 and 6 show that the values of AIC and BIC are changing as the value of scale (θ) of gamma distribution is being changed with shape parameter k = 1. Here the sample size is 100. We can see that the difference of both AICs and BICs are getting bigger as the σ and θ parameters change from 1 to 5, respectively, in Figures 3 and 4 (log-normal distribution) and Figures 5 and 6 (gamma distribution).
Figure 3
Figure 4
Figure 5
Figure 6
In order to compare the coverage rates of confidence intervals with specified precision and confidence level, we process the second simulation for the performance of the confidence intervals obtained by using the derived sample sizes obtained. The coverage rate (cr) of interval estimating for population mean (νx) from Theorem 1 with parameters μ = 1, and σ = 0.25, 0.5, respectively, is listed in Table 5. The coverage rate (cr) of interval estimating for population mean (νy) from Theorem 2 with parameters k = 5, and θ = 1, 2, respectively, is given in Table 6. All results are illustrated with a number of simulations runs M = 500000. From both Tables 5 and 6, we can see our APP methods are very effective.
Table 5
Precision (f) | Confidence level (c) | n (μ = 1, σ = 0.25) | cr (μ = 1, σ = 0.25) | n (μ = 1, σ = 0.5) | cr (μ = 1, σ = 0.5) |
---|---|---|---|---|---|
f = 0.1 | 0.95 | 388 | 0.9514 | 392 | 0.9524 |
0.9 | 273 | 0.9012 | 271 | 0.9003 | |
f = 0.15 | 0.95 | 173 | 0.9519 | 178 | 0.9548 |
0.9 | 121 | 0.9011 | 124 | 0.9063 | |
f = 0.2 | 0.95 | 121 | 0.9011 | 102 | 0.9565 |
0.9 | 68 | 0.9038 | 70 | 0.9072 | |
f = 0.25 | 0.95 | 63 | 0.9532 | 65 | 0.9564 |
0.9 | 44 | 0.9036 | 45 | 0.9088 |
Table 6
Precision (f) | Confidence level (c) | n (k = 5, θ = 1) | cr (k = 5, θ = 1) | n (k = 5, θ = 2) | cr (k = 5, θ = 2) |
---|---|---|---|---|---|
f = 0.1 | 0.95 | 388 | 0.9518 | 388 | 0.9513 |
0.9 | 273 | 0.9018 | 273 | 0.9016 | |
f = 0.15 | 0.95 | 173 | 0.9512 | 173 | 0.9519 |
0.9 | 121 | 0.9003 | 121 | 0.9012 | |
f = 0.2 | 0.95 | 98 | 0.9518 | 98 | 0.9526 |
0.9 | 69 | 0.9045 | 69 | 0.9029 | |
f = 0.25 | 0.95 | 63 | 0.9531 | 63 | 0.9530 |
0.9 | 44 | 0.9036 | 44 | 0.9029 |
Real Data Examples
In this section we will analyze two real data sets for investigating the performance of our APP methods under log-normal and gamma assumptions.
The first data set is on the survival time (in months) of 184 patients who had limited stage small-cell lung cancer from Overduin (2004). We use both log-normal and gamma distributions to fit this data set, the maximum likelihood estimates for parameters in log-normal model are , , and in gamma model are , . Using the input of values of and with f = 0.15 and c = 0.9 in the links provided in the notes of Table 1 and 2, we obtain the necessary sample sizes for log-normal and gamma are n = 124 and 121, respectively. Since the sample sizes are very close we use the larger n. A random sample of simple size n = 124 from the lung cancer data set is selected to fit both distributions. The relative histogram, the fitted log-normal and gamma pdfs for the sampled data are plotted in Figure 7. From Table 7, we know that log-normal model fitting is preferable.
Figure 7
Table 7
Distribution | Estimator | Log-L | AIC | BIC |
---|---|---|---|---|
First data set: Lung cancer | ||||
log-normal | ; | −450.7487 | 905.4973 | 911.1379 |
gamma | ; | −455.3863 | 914.7727 | 920.4133 |
Second data set: Faculty salary | ||||
log-normal | ; | −261.4429 | 526.8858 | 532.5102 |
gamma | ; | −259.7783 | 523.5566 | 529.1810 |
The second data set is the salary data of faculties in the College of Arts and Sciences, New Mexico State University (2018/19). After the same process as for the first data set, a random sample of the larger size n = 123 (this value is determined by the estimates of μ and k with f = 0.15 and c = 0.9) is selected form the second data set. The relative histogram, the fitted log-normal and gamma pdfs for the sampled data are plotted in Figure 8. From Table 7, we know that gamma model fitting is preferable.
Figure 8
The values of the log-likelihood, AIC and BIC criteria resulted from fitting log-normal distribution and gamma distribution to the two data sets are presented in Table 7.
Discussion
An important contribution is that the present work, which includes links to free and user-friendly programs, expands the APP so that it can be used under log-normal and gamma distributions. Thus, researchers who wish to know sample sizes needed to have sample statistics that are good estimators of corresponding population parameters, but who worry about not having normally distributed data, need worry no longer. One reason this is important is that most distributions are skewed (Blanca, Arnau, López-Montiel, Bono, & Bendayan, 2013; Ho & Yu, 2015; Micceri, 1989), thereby rendering the family of normal distributions less relevant in estimation contexts. In addition, as there are many ways in which skewness can occur, it is desirable to have the possibility of using many distribution families as potential models, rather than just the skew normal family that has been used earlier (Trafimow et al., 2019). Thus, the present expansion of the APP to the log-normal and gamma families is potentially useful. The potential utility is backed by both computer simulations and worked examples based on real data.
In addition, however, the present work is, to our knowledge, the first APP work that directly addresses the issue of mistakes in identifying the relevant distribution family. To that end, we have explored the consequences of assuming a log-normal distribution in the presence of a gamma distribution, or assuming a gamma distribution in the presence of a log-normal distribution. The results are nuanced. Although the consequences of being wrong are minimal with respect to sample size computations, Figures 3 and 4 show that the difference in AIC and BIC increases as σ increases. Thus, the consequences for being wrong vary depending on the researchers goals. If the goal is sample size determination, the consequences of using the wrong distribution are minimal. In contrast, if the goal is more complex, where AIC or BIC is relevant, the consequences of using the wrong distribution might matter more. An important caveat is that the present work concerns log-normal and gamma distributions. It is not difficult to imagine the possibility of arriving at different conclusions with different distribution families.
In conclusion, the present equations and links to programs successfully expand the APP to log-normal and gamma distributions. And we have seen that the consequences of making a wrong assumption with respect to which family of distributions to use are often, but not always, minimal. We hope and expect that future research will include more APP expansions to distribution families not addressed here.