For over a century, researchers have performed multiple regression analyses to obtain regression coefficients that, in turn, provided valuable information about the ability of independent variables to explain dependent variables. However, until the present work, sample size determination has been left to power analyses (e.g., Cohen, 2013). From the point of view of determining sample sizes needed to render good chances of obtaining statistical significance, power analyses make sense. But from the point of view of determining sample sizes needed so that sample statistics can be trusted to estimate corresponding population parameters, power analysis is insufficient (Trafimow et al., 2020; Trafimow & Myüz, 2019). This is because power analysis is a function not only of the sample size, but of the expected effect size too. For example, consider the simple case of a single mean under the typical prescription in psychology that researchers should attempt to detect a ‘medium’ effect size (e.g., Cohen’s d = 0.50) with a 0.80 probability of rejecting the null hypothesis at alpha equals 0.05. In that case only 31 participants are required. However, that sample size implies that the researcher has a probability of 0.95 of obtaining a sample mean that is within 0.35 standard deviations of the population mean it is intended to estimate (Trafimow et al., 2020), which many would consider insufficiently precise. Trafimow (2018) recommended having a probability of 0.95 of having sample statistics be within 0.20 or 0.10 of corresponding population parameters for ‘good’ or ‘excellent’ precision, respectively, though also indicating that such designations could change depending on study contexts. Clearly, an alternative to power analysis is desirable, and the a priori procedure (APP), to be explained presently, provides it. The present goal is to expand the APP so that researchers can determine the sample sizes necessary to meet their requirements for obtaining sample regression coefficients that provide good estimates of corresponding population regression coefficients.
To set up the present work, it is useful to briefly consider the issue of regression coefficient size in the context of research that is exploratory or that is beyond exploratory, keeping in mind that the definitions of ‘small’ or ‘large’ regression coefficients depend on substantive areas and researcher goals. In exploratory research, variables with larger regression coefficients are typically considered better candidates for future investigation than are independent variables with smaller regression coefficients. In research that is beyond the initial exploration phase, independent variables with larger regression coefficients are typically considered better candidates for intervention, policy recommendations, or theoretical explanation than are independent variables with smaller regression coefficients. This is because policy makers must have reason not only to believe in the empirical relationship, but also that the relationship is sufficiently large to justify the costs of implementing a policy (Trafimow & Osman, 2022). Even for basic research, large regression coefficients are less susceptible to trivial alternative explanations than are smaller ones, all else being equal. Whether the research is at the exploratory level, or beyond that, researchers must have some reason to believe that the regression coefficients they obtain from their sample generalize to the population of interest; it is necessary to assume that sample regression coefficients are good estimates of corresponding population regression coefficients. Most researchers are aware that, in general, the larger the sample size, the better the estimation. However, at present, there is no way to know the minimum sample size needed to meet criteria for quality of estimation. Our goal is to derive a procedure to accomplish this. The procedure to be described works equally well for those researchers who prefer unstandardized regression coefficients to standardized ones.
Recently, a general methodology—the APP—has been developed for determining required sample sizes so that sample statistics provide good estimates of corresponding population parameters. There are two main criteria that are bullet-listed below:
-
Precision: Researchers must specify the distance within which they wish their sample statistics to be of corresponding population parameters.
-
Confidence: Researchers must specify the probability they wish to have of meeting the precision specification.
For example, in the case of a single mean, under normality, Trafimow (2017) showed that it is necessary to obtain a sample size of 385 to be confident that the sample mean will be within one-tenth of a standard deviation of the population mean. Note the contrast between a sample size of 385 participants versus 31 participants sufficient to satisfy a typical power analysis. Even dropping the criterion to a precision level of 0.20 implies a sample size of 97, which still exceeds 31, thereby exemplifying that the APP is very different from power analysis.
Recent APP work has gone well beyond single means under normality. For example, Trafimow et al. (2019) extended the APP to work for skew normal distributions; Wang et al. (2019a) extended the APP to work for comparisons between independent groups under skew normal distributions; and Wang et al. (2019c) provided an APP extension to two dependent groups (matched data). Moreover, Wang et al. (2020) extended the APP to one-way analysis of variance paradigms. Then, too, there are APP extensions pertaining to standard deviations or scales (e.g., Wang et al., 2022), distribution shapes (Wang et al., 2019b), and correlation coefficients (Wang et al., 2021). There is even a Bayesian APP extension for estimating the normal mean (Wei et al., 2020) and proportion based on skew normal approximations and the Beta-Bernoulli process (Cao et al., 2021). The foregoing citations indicate that a sizable APP literature already exists and that it is growing quickly.
The present goal is to add to the APP literature by extending that literature to address regression coefficients assuming a multivariate normal distribution. Our aim is to derive a procedure by which researchers can specify the desired degree of precision and confidence, as well as the number of independent variables, to determine the minimum sample sizes needed to meet the specifications. In other words, the work to be presented provides a way for researchers to determine the sample size necessary so that they can trust that their sample regression coefficients are good estimates of corresponding population regression coefficients.
Review of Multiple Linear Regression Model
In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent (response) variable and one or more independent (explanatory) variables. The most common form of regression analysis is linear regression that most closely fits the data according to a specific mathematical criterion.
The multiple linear regression model for n observations can be written as
1
2
Lemma 1
Consider the regression model given in (1) and (2) and assume that , the n-dimensional multivariate normal distribution with mean and covariance matrix . Then the maximum likelihood estimators of , , and are, respectively, given by
3
-
.
-
, the chi-square distribution with degrees of freedom.
-
and are independent.
Let be the sample covariance matrix of and be the vector of sample covariances between and the ’s, where ’s are sample covariance between y and for . It can be shown that
4
We can also express the vector of regression coefficients in terms of sample correlations. Let be the sample correlation matrix between y and ’s and be its corresponding sample covariance matrix as
5
Example 1
Consider the regression model with two independent variables and . We have
Remark 1
Note that the ’s can be compared to each other, whereas the ’s cannot be so compared. The division by is customary but not necessary. That is, the relative values of and are the same as those of and . Therefore, in stead of finding the necessary sample size needed to trust standardized regression weights , people look for the sample size needed for estimating , which is equivalent to estimating as are calculated by the observations of independent variables . Therefore we will focus on setting our APP method on the estimation of .
The Necessary Sample Size Needed for Estimating the Vector of Regression Coefficients
In this section, we will establish the APP for estimating , vector of the regression coefficients given in (2) under normal assumption.
Theorem 1
Assume that , where is of rank and . Then for specified precision f and confidence level c, the necessary sample size needed for estimating regression coefficients (or regression weights) can be obtained by solving the following equation
6
7
The proof of Theorem 1, together with the density of U is given in the Appendix.
Remark 2
Note that Theorem 1 still hold for finding the necessary sample size needed to trust since
Remark 3
If the previous data sets are available, we can use them to obtain so that the random variable U given in (10) has a noncentral F distribution with non-centrality parameter
The density curves of U with different values of p and n are listed in Figures 1 and 2. In Figure 1, the density curves of U are given for and different values of , which do not change much as n increases. In Figure 2, the density curves of U are graphed for and different values of , which change substantially when p changes.
Figure 1
Figure 2
If the non-central parameter is not 0, the density curves for , , and different values of are listed in Figure 3. We can see that the non-central parameter do effect density curves.
Figure 3
Remark 4
From Theorem 1, we can construct a confidence region for or with give confidence level and precision f, the confidence region for and are given by
8
9
To illustrate the above results for case where , , the confidence regions of with and are given in Figures 4 and 5, respectively. Here data observations of and are generated from the uniform distribution with mean 0 and variance 1, and with true values .
Figure 4
Figure 5
Simulation Study and Real Data Example
In this section, we conduct a simulation study and present a real data analysis to evaluate the performance of the APP proposed above. The necessary sample sizes (n) obtained by using Theorem 1 are provided in Tables 1, 2, 3, 4 for different values of p when and .
Table 1
f | 0.1 | 0.15 | 0.2 | 0.25 | ||||
c | 0.95 | 0.9 | 0.95 | 0.9 | 0.95 | 0.9 | 0.95 | 0.9 |
n | 308 | 244 | 138 | 114 | 84 | 62 | 58 | 49 |
cr | 0.95027 | 0.89956 | 0.95006 | 0.89995 | 0.95005 | 0.89968 | 0.94963 | 0.90019 |
Table 2
f | 0.1 | 0.15 | 0.2 | 0.25 | ||||
c | 0.95 | 0.9 | 0.95 | 0.9 | 0.95 | 0.9 | 0.95 | 0.9 |
n | 275 | 222 | 129 | 104 | 74 | 60 | 49 | 38 |
cr | 0.94976 | 0.90035 | 0.94999 | 0.90042 | 0.94974 | 0.90039 | 0.94991 | 0.90003 |
Table 3
f | 0.1 | 0.15 | 0.2 | 0.25 | ||||
c | 0.95 | 0.9 | 0.95 | 0.9 | 0.95 | 0.9 | 0.95 | 0.9 |
n | 229 | 195 | 113 | 96 | 71 | 51 | 43 | 38 |
cr | 0.95009 | 0.90000 | 0.94962 | 0.90004 | 0.95003 | 0.89946 | 0.94992 | 0.90030 |
Table 4
f | 0.1 | 0.15 | 0.2 | 0.25 | ||||
c | 0.95 | 0.9 | 0.95 | 0.9 | 0.95 | 0.9 | 0.95 | 0.9 |
n | 192 | 172 | 98 | 78 | 58 | 46 | 38 | 35 |
cr | 0.95009 | 0.89961 | 0.94995 | 0.89959 | 0.94991 | 0.89957 | 0.94971 | 0.89984 |
The tables indicate the following. First, the required sample size n decreases as values of precision f increase for all , and 10. Second, with runs (samples) for required sample size n, the coverage rates (the percentage of the constructed confidence intervals that include the true parameters) are very close to the corresponding confidence levels . Third, as the number p of independent variables increases, the required sample size n decreases for fixed f and c. It is reasonable since the multiple correlation coefficient is increased as p increases so that the sample size decreases as p increases. Fourth, the effect of increasing the number p of independent variables is smaller for low precision setting (e.g., : , , n = 49; , , , n = ) than for high-precision setting (e.g., : p = 2, , ; , , , n = ).
For calculating the necessary sample sizes needed to estimate the regression coefficients , a freely available online calculator can be found at the Supplementary Materials.
Introduction to the Link
To use the program for finding the sample size needed to estimate the regression coefficients, it is necessary to make three entries. In the first box, type in the number (p) of independent variables included in the model. In the second box, type in the desired degree of precision (f). In the third box, type in the desired confidence level (c). The last input is the noncentrality parameter ( ), which can be determined by using previous data. The default value of is 0. Then click “update” to obtain the sample size needed to meet your specifications for precision and confidence.
Example 2
The data set was obtained from the R Package named datarium (Kassambara, 2019). The data sets list the impact of three advertising media (Youtube, Facebook and newspaper) on sales. Data are the advertising budget in thousands of dollars along with the sales. The advertising experiment has been repeated 200 times. Now, we construct a regression model to predict sales (y) on the basis of advertising budget spent in Youtube media ( ) and newspaper ( ). By the online calculator provided in the above, we obtain the necessary sample size needed for estimating the standardized regression weights is 138 with precision , confidence level . So we randomly choose a sample of size from the row data. After calculation, the least-squares estimate of in equation (2) and are and , respectively. Also the estimate of the standardized regression weights is = . If we use the whole data set as a sample ( ), the estimates of , and are = , and , respectively. For comparison, the difference between and is , which indicates that our proposed method for required n = 138 is consistent.
Remark 5
The verification of the assumptions of normality, homoscedasticity and influential values is provided in the C section of the Appendix.
Discussion
In the introduction, we explained why the size of regression coefficients, not just whether they are statistically significant, is important especially for applied research. Even if a regression coefficient is statistically significant, it might not be sufficiently large to justify expenditures necessary for a policy change (Trafimow & Osman, 2022). However, once the importance of regression coefficient size is acknowledged, there remains the crucial issue of the accuracy with which sample regression coefficients estimate population regression coefficients. Even a large sample regression coefficient may not justify a policy change if it cannot be trusted to be a good estimator of the corresponding population regression coefficient. Consequently, it is useful to have a procedure to enable valid judgments of the degree of trust consumers of research can place in sample regression coefficients as estimators of corresponding population regression coefficients. The present APP expansion provides that procedure.
In turn, there are two ways the present work, with the free and user-friendly program, can be used. One use concerns the original purpose of the APP, which is to plan sample sizes necessary for achieving researcher goals pertaining to precision and confidence. Secondly, however, the present APP expansion can be used post data collection, such as evaluating an already published regression coefficient. If a researcher reports a seemingly impressive regression coefficient, the trust that regression coefficient deserves can be assessed using the present program. If the reported sample size is less than what is necessary to meet assessors’, reviewers’, or policy makers’ criteria for precision and confidence, the applicability of the sample regression coefficient can be discounted accordingly. Alternatively, if the reported sample size exceeds that which is necessary to meet criteria for precision and confidence, trust in the sample regression coefficient can be augmented accordingly.
Also, we wish to be upfront about an important limitation, which is the assumption of multivariate normality. Future work, that we intend to perform, could include commencing from more general assumptions. For example, instead of assuming a multivariate normal distribution, it would be a further advance to extend the APP to regression coefficients under a multivariate skew normal distribution. In the meantime, the present work is nevertheless useful even if the assumption of multivariate normality is violated. To see why, consider that skewness decreases sample sizes necessary to meet specifications for precision and confidence (e.g., Trafimow et al., 2019; Wang et al., 2019a; Wang et al., 2019c). Thus, when multivariate normality is violated, the present computer program will overestimate necessary sample sizes needed to meet specifications for precision and confidence. Hence, the results the program produces can be considered conservative sample size estimates; if a researcher collects the sample size indicated by the computer program, he or she can be assured that precision and confidence are at the specified level, or better. Of course, in those cases where multivariate normality does apply, the results produced by the computer program should be very accurate and neither conservative nor liberal.
Finally, applied researchers should consider potential applications of their research, and explicitly consider how accurate the estimation needs to be to base an intervention or policy change on the regression coefficients they obtain. They can render specifications for precision and confidence accordingly. Also, the total cost of collecting a sample with required sample size n should be considered in selecting f and c. In our real data example, for and . If we use and with instead, the required sample size are 308 and 1014, respectively. It is impossible to have such sample sizes because the whole data size is 200.
In conclusion, we hope and expect that the present contribution provides an alternative to power analysis for researchers who use correlational designs that feature regression coefficients. If the goal is to attain statistical significance, power analysis makes sense; but if the goal is to obtain sample regression coefficients that are trustworthy estimators of corresponding population regression coefficients, the present APP extension is best.