A Quantile Shift Approach To Main Effects And Interactions In A 2-By-2 Design

When comparing two independent groups, shift functions are basically techniques that compare multiple quantiles rather than a single measure of location, the goal being to get a more detailed understanding of how the distributions differ. Various versions have been proposed and studied. This paper deals with extensions of these methods to main effects and interactions in a between-by-between, 2-by-2 design. Two approaches are studied, one that compares the deciles of the distributions, and one that has a certain connection to the Wilcoxon-Mann-Whitney method. For both methods, we propose an implementation using the Harrell-Davis quantile estimator, used in conjunction with a percentile bootstrap approach. We report results of simulations of false and true positive rates.

is to report results on two methods for controlling the family wise error rate (FWER), meaning the probability of one or more Type I errors.
Here, two distinct approaches are considered.The first defines interactions and main effects with means replaced by a collection of quantiles with an emphasis on the deciles.For example, if θ jk denotes the .2quantile corresponding to level j of the first factor and level k of second factor, one goal is to test H 0 : θ 11 − θ 12 = θ 21 − θ 22 , (1) which mimics the usual notion of an interaction in an obvious way.Here, however, the goal is to use multiple quantiles and to assess how well the FWER is controlled.
Of course, a related issue is computing a reasonably accurate confidence interval for θ 11 − θ 12 − θ 21 + θ 22 .As is evident, main effects can be addressed as well.For example, one can test for a collection of quantiles.
The second approach, when dealing with an interaction, has a certain connection to a rank-based method proposed by Patel and Hoel (1973) that in turn has a connection with the classic Wilcoxon-Mann-Whitney test.To describe the Patel-Hoel approach, let X jk denote four independent random variables where X 11 and X 12 correspond to the first level of the first factor in a 2-by-2 design, while X 21 and X 22 correspond to the second level of the first factor.Let p 1 = P(X 11 < X 12 ) and p 2 = P(X 21 < X 22 ).The hypothesis of no interaction is H 0 : p 1 = p 2 . ( And there is the issue of computing a 1 − α confidence interval for p 1 − p 2 .Wilcox (2022, Section 7.9.2) describes a method for making inferences about this measure of effect size that performs well in simulations.For some related rank-based methods, see Gao and Alvo (2005), as well as De Neve and Thas (2017).
Let X ijk denote a random sample from the jth level of the first factor and the kth level of the second factor (i = 1, …, n jk ; j = 1, 2; k=1, 2).For convenience momentarily focus on p 1 and let D iℎ = X i11 − X ℎ12 (i = 1, …, n 11 ; ℎ = 1, …, n 12 ).An estimate of p 1 is simply H 0 : p 1 = .5 (5) is the same as H 0 : θ 1 = 0. (6) The parameter θ 1 is defined based on Level 1 of the first factor.Let θ 2 denote the analog of θ 1 when dealing with Level 2 of the first factor.Then an analog of the Patel-Hoel in teraction is θ 1 − θ 2 .Here, however, the goal is to consider the broader issue of comparing the deciles of these two distributions.More formally, let q 1 and q 2 denote the qth quantile of the distribution D for Level 1 of the first factor and Level 2 of first factor, respectively.The goal is to test and to compute a 1 − α confidence interval for q 1 − q 2 for q = .1,.2,…, .9.A second goal is to control the FWER in a reasonably accurate manner.
To illustrate this point, consider, for example, a situation where for the first level of the first factor, both levels of the second factor have standard normal normal distribu tions, while for the second level of the first factor, the two levels of the second factor have lognormal distributions.Data were generated as just described based on sample sizes of 50 per group and the 50th quantile was estimated.The estimate of q 1 − q 2 was .027.But interchanging the rows and columns, now the estimate was −.088.The code to reproduce this example can be found in the R notebook apd_ex.Rmd, which is available as part of the companion reproducibility package for this article (Wilcox & Rousselet, 2023) and on GitHub (Rousselet, 2023).

The Proposed Methods
First, consider testing Equation (1).The first issue is choosing a reasonable quantile estimator from among the many estimators that might be used.The focus here is on the estimator derived by Harrell and Davis (1982) for two fundamental reasons.The first has to do with tied values.When comparing two independent groups using the usual sample median, tied values can be accommodated using a slight generalization of a standard percentile bootstrap method.However, when comparing other quantiles using an estimator based on only one or two order statistics, such as those summarized by Hyndman and Fan (1996), this approach no longer performs in an adequate manner (e.g., Wilcox, 2022).Simulation results reported in the Simulation Results section indicate that this remains the case for the situation at hand.
In contrast to the estimators considered by Hyndman and Fan (1996), the Harrell-Da vis estimator uses a weighted average of all the order statistics.Moreover, when using the Harrell-Davis estimator in conjunction with a percentile bootstrap method, all in dications are that this approach does perform well when comparing two independent groups and there are tied values.An issue here is whether this continues to be the case when dealing with a two-way design.A second reason for using the Harrell-Davis estimator is that is has better efficiency under normality compared to using a weighted average of only two order statistics.
For completeness, it is noted that comparing quantiles can be accomplished using the quantile regression estimator derived by Koenker and Bassett (1978).However, this is tantamount to using one or two order statistics.When comparing the medians of two independent groups, for example, and the sample sizes are odd, in effect the usual sample median is being used.Evidently, there is no generalization of the Koenker-Bassett meth od that captures the spirit of the Harrell-Davis estimator.
There are several quantile estimators, in addition to the Harrell-Davis estimator, that use all of the order statistics (e.g., Liu et al., 2022;Navruz & Özdemir, 2020).A possible criticism is that these estimators, including the Harrell-Davis estimator, have a breakdown point of only 1/n.That is, the minimum number of order statistics that must be altered to make the estimate arbitrarily large is one.In practice this issue might not be a serious concern because the extreme order statistics get a relatively small weight.In a situation where the breakdown point is an issue, one possibility is to use the trimmed Harrell-Davis estimator derived by Akinshin (2022).
Let U be a random variable having a beta distribution with parameters a = (n + 1)q and b = (n + 1)(1 − q).Let The Harrell-Davis estimate of the qth quantile is where X (1) ≤ ⋯ ≤ X (n) are the values written in ascending order.The beta weights used to calculate the deciles for a sample size of n = 50 are illustrated in Figure 1.As an alternative to the Harrell-Davis estimator, we considered the default quantile estimator in R, called using the command quantile(x, probs = seq(0.1,0.9,0.1),Type = 7), to compute the deciles.This estimator, in ad dition to being widely used, relies on two order statistics (Hyndman & Fan, 1996), and could thus be more robust to outliers than the Harrell-Davis estimator in some situations.
To test Equation (1), we considered a percentile bootstrap method combined with the Harrell-Davis estimator and the quantile(Type = 7) estimator.The percentile bootstrap method begins by generating bootstrap samples by sampling with replacement n jk values from the data associated with the jth level of the first factor and the kth level of second factor yielding X ijk * (i = 1, …, n jk ).Based on these bootstrap samples, compute the qth quantile using the Harrell-Davis estimator, or the quantile(Type = 7) estimator, yielding A p-value, when testing Equation (1), is 2min P, 1 − P .The term D/B is important when dealing with tied values (e.g., Wilcox, 2022).To compute a 1 − α confidence interval, put Ψ 1 *, …, Ψ B * in ascending order yielding Ψ (1) * ≤ ⋯ ≤ Ψ (B) * .Let ℓ = αB/2, rounded to the nearest integer.Let u = n − ℓ.Then a 1 − α confidence interval for Ψ is Here, B = 2000 is used.
Note that the same bootstrap samples were used for each of the quantiles being compared.An alternative approach is to use separate bootstrap samples for each test to be performed (Wilcox, 1995).Both approaches were considered and there is no indication that separate bootstrap samples offer a practical advantage in terms of controlling the Type I error probability.Results of the simulations comparing the two bootstrap approaches are available in the R notebooks sim_fp_b1b9.Rmd and sim_fp_apd_b1b9.Rmd (Wilcox & Rousselet, 2023).Using the same bootstrap sam ples for all of the tests performed considerably reduces execution time, which is why it is assumed henceforth.
When performing C tests, there is the issue of controlling the FWER, meaning the probability of one or more Type I errors.Two approaches are considered here.The first is Hochberg's (1988) improvement on the Bonferroni method.Let p 1 , …, p C be the p-values associated with the C tests.Put these p-values in descending order, and label the results . Set k = 0 and proceed as follows: stop and reject all hypotheses having a p-value less than or equal to p Repeat Steps 1 and 2 until a significant result is obtained or all C hypotheses have been tested.
The second method for controlling the FWER was Benjamini and Hochberg (1995), which is aimed at controlling the false discovery rate (FDR).That is, the goal is to control the expected proportion of false positives among the rejected hypotheses.It is known that there are situations where the Benjamini-Hochberg method ensures that the false discovery rate is less than or equal to α, but it does not necessarily control the FWER (Hommel, 1988).However, the simulations described in the Simulation Results section, below, indicated that when using Hochberg's method, the FWER is always below the nominal level.Consequently, there was interest in how well the Benjamini-Hochberg method performs.It is readily verified that the Benjamini-Hochberg method always has as much or more power than Hochberg's method.Consequently, provided the Benjami ni-Hochberg method controls the FWER for the situation at hand, it has a practical advantage over Hochberg's method.The Benjamini-Hochberg method is applied simply by replacing p As for testing Equation ( 7), again both the Harrell-Davis and the quantile(Type = 7) esti mators were considered, in conjunction with a percentile bootstrap method.Bootstrap samples are generated as before yielding estimates of q 1 and q 2 , which are labelled q 1 * and q 2 *.This process is repeated B times and the results are used to compute a 1 − α confidence interval for q 1 − q 2 , as well as a p-value, in essentially the same manner as done when testing Equation (1).

Simulation Results
Simulations were used to check the FWER when testing Equation (1) and Equation ( 7), as well as how the power of these methods compares to the classic ANOVA F test and a method for comparing 20% trimmed mean, which is described in Wilcox (2022); Section 7.4.3.Data were generated from four continuous distributions as well as three discrete distributions.The discrete distributions were a Poisson distribution having mean 9, and two beta-binomial distributions, one with parameter r = 1, the other with r = 9, and the other parameters set to s = 9, and 10 bins.The three discrete distributions were included as a partial check on the impact of tied values.The four continuous distributions were standard normal, mixed normal, lognormal and mixed lognormal.The distribution of the mixed normal is where Φ(x) is the standard normal distribution.The mixed normal is a symmetric distri bution with heavy tails, roughly meaning that outliers are likely to occur.The mixed lognormal distribution is given by Equation ( 13), but with Φ replaced by the lognormal distribution.
Based on over 1,500 estimates of skewness and kurtosis reported in various journal articles, Cain et al. (2017) report that 99% of the estimates were less than the skewness and kurtosis of a lognormal distribution.This suggests that if a method performs reason ably when dealing with a lognormal distribution, it is highly likely to perform reasonably well in practice.However, a possible concern is that when dealing with heavy-tailed distributions, the standard error of the usual kurtosis estimator can be quite high even when the sample size is fairly large.Moreover, the usual estimate of kurtosis can grossly under estimate the true value.
Consider, for example, a lognormal distribution, which has kurtosis 113.9.Based on a sample of 100,000, the kurtosis of the lognormal distribution was estimated and this process was repeated 1000 times.It was found that 79% of the estimates were less than the true value.The median estimate was 82.This process was repeated using the mixed lognormal distribution which is skewed and very heavy-tailed.The kurtosis of this distribution was estimated based on a sample size of one million.This process was repeated 1000 times yielding estimates ranging between 242 and 16400.The median estimate was 429.Consequently, the mixed lognormal distribution was used here as an additional check on how well the methods perform.The code for these simulations is available in the notebook kurtosis_estimation.Rmd (Wilcox & Rousselet, 2023).
Simulations were performed for sample sizes 20, 30, 40, ..., 100 per group, using 10,000 iterations.This was done using both types of quantile estimators and for the two main effects and the interaction.A p-value was computed for each of the nine quantiles to be compared.In terms of controlling the FWER, both Hochberg's method and the Benjamini-Hochberg method were considered.And as previously indicated, this was done for seven distributions.So in terms of Type I errors there is a total of 3402 results (7 distributions * 2 quantile estimators * 3 contrasts * 9 deciles * 9 sample sizes).
Finally, simulations were performed dealing with power.Various situations were considered, including shifting the four continuous distributions used to estimate Type I errors, varying the skewness of g-and-h distributions (Hoaglin, 1985), and the skew ness of Poisson and beta-binomial distributions.In all situations, the distributions were adjusted to provide high power at the maximal sample size, to better differentiate the various methods.Complete details, including the code that was used, are reported in files available on figshare (Wilcox & Rousselet, 2023) and on GitHub (Rousselet, 2023).The main R packages used to perform the simulations and illustrate the results were Rcpp (Eddelbuettel & Francois, 2011), ggplot2 (Wickham, 2016) and cowplot (Wilke, 2017).
First we consider the results for testing Equation (1), then for testing Equation (7).

Results When Testing Equation (1)
The results regarding the FWER were highly consistent over the conditions considered, making it a simple matter to briefly summarize the relative merits of the methods being considered.All indications are that the actual FWER is less than the nominal .05level using the Hochberg method as well as the Benjamini-Hochberg method.Consequently, the Benjamini-Hochberg method is recommended because it always has as much or more power than Hochberg's method.

Type I Errors
To illustrate some of the Type I error simulation results, Figure 2 contains results for the normal and lognormal populations, separately for the interaction and the two main effects.Results with Hochberg's method are omitted but detailed figures are available in the notebook sim_fp.Rmd.The FWER without correction for multiple comparison (grey lines) is lower than expected if the 9 tests were independent, which in theory would be 1 − (1 − α) 9 = 0.37.The results in Figure 2 and Figure 3 are about half of that expected value, supporting the use of the FDR correction over Hochberg's.As a sanity check, results for the two types of ANOVAs are included, confirming false positives very close to the nominal level when sampling from a normal population.
Although the percentile bootstrap led to conservative estimates for both quantile estima tors, the Harrell-Davis estimator clearly outperforms quantile(Type = 7).
The gap between the two estimators increases when we consider samples from discrete populations, because the performance of quantile(Type = 7) deteriorates while that of Harrell-Davis remains stable (Figure 3).Bradley (1978) suggested, as a general guide, that when testing at the .05level, the actual level should be between .025 and .075.This criterion was met in most situations for the Harrell-Davis estimator but not for the quantile(Type = 7) estimator.
The higher performance of the Harrell-Davis estimator compared to the quan tile(Type = 7) estimator can be observed at individual deciles as well (Figure 4).However, for the first and last deciles, the Type I error rate is higher than the nominal level when using the Harrell-Davis estimator and n = 20, a result that confirms earlier observations (Wilcox et al., 2014).

Power
Not surprisingly, there are situations where inferences based on means or a 20% trimmed have more power.But there are situations where comparing deciles provides more power: no single method dominates.To illustrate, Figure 5 presents results for normal and lognormal populations.In each case, data were generated for the four groups by sampling from the standard normal and lognormal distributions, before shifting the groups by different amounts.When sampling from normal populations, the ANOVA on means dominates other methods.To compare methods, we report familywise power for the decile methods (bootstrap p-values with and without FDR correction), the probability of at least one rejection among the 9 tests.The ANOVA on 20% trimmed means is a bit less powerful, followed by the bootstrap method using the Harrell-Davis estimator, and last the bootstrap method combined with the quantile(Type = 7) estimator.Switching to lognormal populations, the power of both ANOVA tests is dramatically lower than the bootstrap approach.This figure and the next one were generated using the notebook sim_tp.Rmd.
Figure 6 presents results from populations in which tied values were common.In both cases, the ANOVA on means dominates other methods.When using the bootstrap approach, the gap between the Harrell-Davis estimator and the quantile(Type = 7) estimator is larger than what was observed when sampling from continuous populations.

Compare Deciles of Distributions of All Pairwise Differences -Test Equation (7)
To assess the bootstrap method aimed at testing Equation ( 7), we used the same ap proach employed in the previous section.Now only the interaction is considered.The simulations and illustrations of the Type I error rates can be found in the notebook sim_fp_apd.Rmd.For power, see notebook sim_tp_apd.Rmd.

Type I Errors
Again we observed FWERs much lower than expected if the deciles were independent (Figure 7).For continuous distributions (panels A-D in Figure 7), results were very similar for the Harrell-Davis and quantile(Type = 7) estimators.Both methods were conservative.When sampling from distributions in which tied values are likely, now the Harrell-Davis estimator outperforms the quantile(Type = 7) method (Panels E and F in Figure 7).
For continuous distributions, the similarity in performance between the two quantile methods is evident at the level of individual deciles as well (Figure 8).Under normality, all deciles were associated with Type I error rates close to the nominal level, irrespective of the sample size per group (Panel A in Figure 8).In the worst situation tested, when sampling from a mixed lognormal, results are a bit conservative, especially for the ex treme deciles.In both situation, there is very little separating the two quantile methods.When sampling from a Poisson distribution (M = 9 in all groups), the Type I error rates for the Harrell-Davis estimator remain near 0.05, irrespective of sample size (Panel A in Figure 9).However, the quantile(Type = 7) is conservative and the situation deteriorates with increasing sample size.In the most extreme situation considered, when sampling from a beta-binomial distribution with r = 1, s = 9, nbin = 10, the Type I error rates were lower for both quantile estimators relative to the Poisson case (Panel B in Figure 9), or when sampling from a beta-binomial distribution with r = 9 (not illustrated here, but see notebook sim_fp_apd.Rmd).Although the situation got worse with increasing sample size for both estimators, Harrell-Davis outperformed quantile(Type = 7) in all situations.

Power
Under normality, the ANOVA on means performed best, followed by ANOVA on trim med means and finally the bootstrap method (Panel A in Figure 10).When sampling from lognormal distributions, power was low for the ANOVAs relative to the bootstrap method, and much more so when making inferences about means (Panel B in Figure 10).For the mixed normal distributions, again the ANOVA on means performed poorly, but now the ANOVA on trimmed means dominates the bootstrap approach (Panel C in Figure 10).If sampling from mixed lognormal distributions, now the bootstrap method is the most powerful (Panel D in Figure 10).In all these situations, the Harrell-Davis and quantile(Type = 7) estimators gave very similar results.Finally, in the presence of tied values, the ANOVA on means dominated the other approaches, and the Harrell-Davis estimator led to higher power than the quantile(Type = 7) estimator (Panels E and F in Figure 10).

An Illustration
Both methods are illustrated using data dealing with perceived health (PH) among older adults (Clark et al., 2011).The first factor consists of two educational groups: those who did not complete high school and those who have some college or technical training.The other two groups are based on a measure of depressive symptoms (CESD).One group corresponds to participants with a CESD score greater than 15, which is often taken to indicate mild depression or worse.The other level consists of participants with CESD scores less than or equal to 15.The four groups are defined like this: • A 1 B 1 = lower education, lower CESD.
Perceived health results are illustrated for the four groups in Figure 11A.The figure was generated using the notebook examples.Rmd.A 2x2 ANOVA on means returns these p-values: main effect of A (education) = 0.001; main effect of B (depression) < 0.0001; interaction = 0.09.Should we conclude that we have failed to obtain sufficient evidence about the presence of an interaction?This conclusion would be appropriate if the populations were symmetric and differed only in central tendency.However, the plot of marginals suggests differences in skewness and spread (Figure 11A).In keeping with this observation, considering the deciles reveals a more complex picture, as illustrated in Panels B-F of Figure 11, with patterns of non-uni form group differences.Panels B and C illustrate the shift functions comparing B1 and B2 at each level of A: the decile differences between two groups are plotted as a function of the deciles in one group.Panels D and E illustrate the main effects.The values along the x axis (A1 and B1) correspond to the deciles of observations pooled across groups (for instance A1 = (A1B1, A1B2).Computing the average of the deciles leads to very similar graphs.Finally, Panel F illustrates the interaction, which is highly non-linear, growing from the first decile to the median, and then decreasing and reversing sign.
The output of the decinter R function that tests Equation ( 1) is shown in Table 1.As can be seen, the unadjusted p-values suggest that there is an interaction for the .2-.6 quantiles.Shown are the adjusted p-values based on Hochberg's method (using the R function p.adjust) in order to underscore the practical advantage of the Benja mini-Hochberg method.As indicated, no significant difference is found at the .05level using Hochberg's method.But using the Benjamini-Hochberg correction when testing Equation ( 1), the adjusted p-values, when dealing with the .3,.4 and .5 quantiles are all equal to .030.When testing Equation ( 7), plots of the difference scores help provide perspective.Figure 12A illustrates this point.For each level of education (A1 = lower level; A2 = higher level), every participant with a lower CESD score was compared to every participant with a higher CESD score.As previously noted, when two distributions are identical, the distribution of D is symmetric about zero. Figure 12A suggests that for both groups the distribution of the difference scores is shifted to the right, but with a stronger shift for Group 2 (completed high school -lower panel in Figure 12A).A positive difference indicates higher perceived health in not depressed participants relative to depressed par ticipants.For the second group, testing the hypothesis that the median of the difference scores is zero, the estimate is 10.5, with a [4.2, 14.7] 95% confidence interval, p = 0.In the first group the median is 4.21 [0, 8.4], p = 0.0765.Note.(A) Distributions of all pairwise differences between participants with lower and higher CESD scores.
Participants with lower levels of education are shown at the top, participants with higher levels at the bottom.The vertical lines mark the deciles, with a thicker line for the median.(B).Interaction plot; see details in Figure 11.
Table 2 summarizes the results when testing Equation ( 7).Here, the .1,.25,.5, .75 and .9quantiles are used, which is the default for the R function iband that was used.Now the unadjusted p-values indicate an interaction in the upper tails of the two distributions.For example, the estimates of the 0.75 quantile indicate that for the first level of the first factor (did not complete high school), when comparing the not depressed group to the depressed group, there is 25% chance of getting a difference between perceived health values greater than 14.75, while for the second group there is 25% chance of getting a difference greater than 22.But using the Hochberg adjustment shown here, no significant difference is found at the .05level and this remains the case when using the Benjamini-Hochberg method.

Concluding Remarks
The two new methods presented in this article help gain a deeper understanding of where and by how much groups differ in a 2x2 factorial design.With a sample size of at least 30 per group, all indications are that the methods perform well even when dealing with distributions that have a relatively high amount of skewness and kurtosis.These methods are implemented in the functions decinter to test Equation ( 1) and iband to test Equation ( 7).The R functions decinter and iband can be found in the file Rallfun-v41.txt,downloadable from Wilcox (2024).These functions, the R code for the simulation and the figures are available in the reproducibility package for the article (Wilcox & Rousselet, 2023) and on GitHub (Rousselet, 2023).The reproducibility package also contains Rcpp code that is much faster to execute than the base R version (see notebook examples.Rmd).By default, these functions use 2000 bootstrap samples and correct for multiple comparison using the FDR correction from Benjamini & Hochberg (1995), which outperformed the FWER correction of Hochberg (1988) in our simulations.Even with the FDR correction, the two methods presented here remain conservative, so it would be worthwhile to explore other correction strategies.We have already considered several alternative methods, but they do not improve matters (Benjamini et al., 2006;Benjamini & Yekutieli, 2001;Blanchard & Roquain, 2008).Preliminary investigations of yet other approaches to control the FWER, such as using a maximum statistic distribu tion (Nichols & Holmes, 2002), have not revealed any method that could significantly improve power in all situations, making recommendations difficult.While this issue requires further work, for the moment we recommend to use the FDR correction from Benjamini and Hochberg (1995) by default.More generally, the obvious concern with comparing multiple quantiles is that power might be negatively impacted due to controlling the FWER.But power might also be negatively impacted when focusing on a single quantile simply because other quantiles are being ignored.The example in the previous section demonstrated the risk of drawing conclusions from a single measure of central tendency or quantile.It is also worth keeping in mind that there is no free lunch in inference: methods that can reveal more complex patterns in the data, such as those proposed here, necessarily require larger sample sizes to reveal where and by how much distributions differ.For applications to the deciles considered here, given the increased uncertainty associated with the estima tion of the 1st and 9th deciles, we recommend sample sizes of at least 30, echoing earlier recommendations (Wilcox et al., 2014).
Several strategies are worth considering to boost power, starting with testing more specific hypotheses involving only a subset of quantiles.For instance, one could imagine a cross-validation approach in which a large dataset is split between a discovery set and a testing set.Instead of testing each quantile individually, one could also look for specific patterns across quantiles, such as stochastic dominance, which is characterised by all quantile differences having the same sign, or differences in spread, which are characterised by monotonic trends across quantile differences (Rousselet et al. 2017).Other obvious strategies are to consider different quantile estimators and bootstrap methods.Preliminary investigations suggest that using the quantile(Type = 8) estimator recommended by Hyndman and Fan (1996) improves Type I error rates and power rela tive to the quantile(Type = 7) estimator for instance, but is still outperformed by the Har rell-Davis estimator in all situations.As for parametric methods, Goldman and Kaplan (2018) have proposed a fast and powerful extension of the Kolmogorov-Smirnov test to compare all the quantiles of two independent groups (Kolmogorov, 1992;Stephens, 1992).However, their approach assumes no tied values, and it is unclear how it could be generalised to deal with interactions.
A bootstrap approach provides enough flexibility to deal with a variety of experi mental designs.As such, this work could be extended to mixed designs with withinand between-subject factors.A covariate could be handled using different strategies, including non-parametric methods with smoothers (Wilcox, 1997;Wilcox, 2022, Chapter 12).
A referee inquired about how applied researchers might report results based on the methods studied here.We suggest reporting results as done in Table 1 and Table 2. Funding: The authors have no funding to report.

Figure 1
Figure 1 Beta Weights Used to Calculate the Harrell-Davis Estimates of the Deciles With a Sample Size of 50 Let A denote the number of Ψ* values that are less than zero and let D denote the number of Ψ* values that are equal to zero.Let

Figure 2 False
Figure 2False Positive Results for Normal and Lognormal Populations

Figure 3 False
Figure 3 False Positive Results for Poisson and One of the Two Beta-Binomial Populations

Figure 4 False
Figure 4False Positive Results at Individual Deciles for Normal and Beta-Binomial Populations

Figure 5 Power
Figure 5Power Results for Normal and Lognormal Populations

Figure 6 Power
Figure 6Power Results forPoisson and Beta-Binomial Populations

Figure 7 FWER
Figure 7 FWER Results for the Comparison of the Deciles of Distributions of All Pairwise Differences

Figure 9
Figure 9 Type I Error Rates for the Comparison of the Deciles of Distributions of All Pairwise Differences: Discrete Distributions

Figure 10 Power
Figure 10 Power Results for the Comparison of the Deciles of Distributions of All Pairwise Differences

Figure 11 Example:
Figure 11Example: Comparison of the Deciles of Perceived Health

Figure 12 Example:
Figure 12 Example: Comparison of the Deciles of All Pairwise Differences of Perceived Health.

Table 1
Result for Perceived Health When Testing Equation (1)

Table 2
Result for Perceived Health When Testing Equation(7)