A well-established result is that outliers can wreak havoc on Pearson’s correlation resulting in a poor and misleading understanding of the nature of the association among the bulk of the data (e.g., Kim et al., 2015; Niven & Deutsch, 2012). These results are related to the fact that the usual estimate of Pearson’s correlation has a breakdown point of only , where is the sample size and the breakdown point is the minimum proportion of points that must be altered to make the estimate arbitrarily large or small (e.g., Wilcox, 2022). Moreover, the population Pearson correlation, , has an unbounded influence function. This means that even a small departure from a bivariate normal distribution can alter in a manner that masks a strong association as illustrated in Wilcox (2022).
Numerous estimators have been derived that are aimed at dealing with outliers or extreme values when measuring the strength of an association. Extensive comparisons of 11 such estimators are reported by Li (2022). Several of these estimators are effective at dealing with the limitations of Pearson’s correlation. But all of them have a common feature that can be a practical concern: they do not make a distinction between good and bad leverage points.
Consider the random sample and assume that for the bulk of these points
1
where the slope and intercept are unknown and is a random variable having some unknown distribution. The point is called a leverage point if is an outlier among the values . Let and be estimates of and , respectively and let ( ) denote the residuals. If is an outlier among and simultaneously is a leverage point, the point is a bad leverage point. If is a leverage point, but is not an outlier, the point is a good leverage point. Roughly, a good leverage point is a point that is reasonably consistent with the linear model for the bulk of the data given by (1) as illustrated in Figure 1. Bad leverage points can negatively impact Pearson’s correlation, Kendall’s tau and Spearman’s rho as illustrated in the Concluding Remarks.
Figure 1
Let denote the least squares estimator for the slope and consider the squared standard error of :
2
where is the variance of the error term in (1). Note that a leverage point inflates the denominator, which in turn lowers the standard error. However, even a single bad leverage point can result in a poor fit to bulk of the points.
Many robust regression estimators have been derived that have a reasonably high breakdown point. However, all of these estimators, including estimators with the highest possible breakdown point, .5, can be unduly impacted by bad leverage points (Wilcox, 2022). What is needed is a method that retains good leverage points and eliminates bad leverage points.
There are two broad classes of robust measures of association. The first are measures that deal with outliers among the marginal distributions and the second deals with outliers in a manner that takes into account the overall structure of the data cloud (Wilcox, 2022). But extant versions do not deal with bad leverage points. This is because these measures are not directly tied to any particular regression estimator. Here, the approach is to first assume that (1) is a reasonable model of the association among the bulk of the participants. Next, use a method for estimating the slope and intercept in a manner that avoids the deleterious impact of bad leverage points and then use this fit to compute an analog of Pearson’s correlation.
The paper is organized as follows. The remainder of this section reviews the method used to detect bad leverage points. The second section, Proposed Measures of Association, reviews a method for detecting bad leverage points that is used here. Section 3, Simulation Results reports on how well the inferential methods in the second section, Proposed Measures of Association, perform. The fourth section, Some Illustrations, reports simulation results on how well the inferential methods in the section, Simulation Results, perform. In contrast to the results reported by Li (2022), as well as Yuan and Mackinnon (2014), the simulation results reported here include situations where the distributions are skewed, which will be seen to be an important issue.
A major advance toward the goal of detecting bad leverage points was derived by Rousseeuw and van Zomeren (1990). Their method begins by fitting a regression line using the least median of squares (LMS) estimator. That is, the slope and intercept are estimated by the values and , respectively, that minimize the median of the squared residuals. This estimator has the highest possible breakdown point, .5, where the breakdown point refers to the minimum number of points that must be altered to make the estimates arbitrarily large or small. Essentially, the breakdown point reflects the sensitivity of an estimator to outliers. A natural conclusion at the time was that because the LMS estimator has the highest possible breakdown point, it will provide a reasonable estimate of the regression line given by (1) even when there are leverage points. However, this is not necessarily the case.
Many robust regression estimators have been derived that have a reasonably high breakdown point. Nevertheless, generally these estimators can be substantially influenced by a few bad leverage points (Wilcox, 2022). That is, a few bad leverage points cannot make the estimates arbitrarily large, but they can alter the estimate of slope and intercept to the point that a poor fit to the bulk of the data is obtained. One consequence is that the method derived by Rousseeuw and van Zomeren can miss bad leverage points.
Wilcox and Xu (2023) suggested a slight modification of the Rousseeuw and van Zomeren method that deals with the issue just described. The method begins by removing all leverage points and estimating the slope and intercept using some robust regression estimator. No single estimator dominates in terms of efficiency, but two that stand out are the MM-estimator derived by Yohai (1987) and the estimator derived by Theil (1950) and Sen (1968). The MM-estimator has the highest possible breakdown point, .5, while the Theil–Sen estimator has a breakdown point of .29. Both of these estimators are more efficient than the LMS estimator. A possible practical concern with the MM-estimator is that situations are encountered where the iterative estimation method fails. The Theil–Sen estimator avoids this problem and is used here with the understanding that there might be situations where the MM-estimator, or even some other robust estimator, offers a practical advantage.
The Theil–Sen estimator is computed as follows. Let
3
for every . The estimate of the slope, , is the median of the values. The intercept is estimated with , where and are the medians based on and , respectively.
Note that by deleting all leverage points, the deleterious impact of bad leverage points has been eliminated in which case values for and , based on a robust regression estimator, provide an estimate of the parameters in (1). Next, based on this fit, compute the residuals using all of the data yielding . If is an outlier among , and if is a leverage point, decide that is a bad leverage point.
Here, the MAD-median rule is used to detect outliers, which is a special case of the outlier detecting method in Rousseeuw and van Zomeren (1990). To elaborate, let MAD denote the median of . Then is declared an outlier if
4
Under normality, MAD/.6745 estimates the population standard deviation.
The Proposed Measures of Association
Let and denote some measure of variation associated with and , respectively, where . Explanatory power is
5
From basic principles, when using the least squares regression estimator and is taken to be the variance, reduces to , where is Pearson’s correlation.
A robust version of is obtained simply by using a robust regression estimator in conjunction with some choice for that is robust as well. There are many robust measures of variation, comparisons of which are reported by Lax (1985). One that was found to perform relatively well is the percentage bend measure of variation. The version used here has a breakdown point of .2, which is reasonably high.
The percentage bend measure of variation is computed as follows. Compute , round this value down to the nearest integer and label the result . Let , . Let be the values written in ascending order. Let
and let if . If , . The percentage bend midvariance is
6
where
Under normality, this estimator provides a very close estimate of the population variance.
In summary, the proposed robust analog of the coefficient of determination that deals with bad leverage points is computed as follows. Compute and , estimates of the slope and intercept, respectively, using a robust regression estimator with bad leverage points removed. This leaves points which for notational convenience are denoted by . Let , ( ). Let denote the percentage bend measure of variation based on and let denote the percentage bend measure of variation based on . An estimate of is
7
A robust analog of that deals with bad leverage points is estimated with
8
There is a feature of that should be noted. Its value can depend on which variable is taken to be the dependent variable. This is, of course, in contrast to Pearson’s correlation, Spearman’s rho and Kendall’s tau.
Inferences About η
There is the issue of testing hypotheses and computing confidence intervals for , the population parameter being estimated by . Here, the same three bootstrap methods studied by Li (2022) are considered. For two of these methods there are in fact two variations considered here that differ in how bootstrap samples are generated. In effect, five methods are considered here. The first variation is related to extant methods for making inferences about the parameters associated with a linear model. An approach that has been studied extensively is to first remove any outliers among the independent variable and then make inferences about the slope and intercept using the remaining data.
The details of the first method are as follows. Again let denote the remaining data after bad leverage points are removed. Next, generate a bootstrap sample by randomly sampling with replacement pairs of points from yielding the bootstrap sample .
Based on a bootstrap sample, compute and use this value in (7) and (8) yielding . Note that a bootstrap version of is not used. If a bootstrap estimate is used, simulations indicated that control over the Type I error probability is poor.
Next, repeat the above process times yielding . A bootstrap estimate of the squared standard error of is
9
where . Results in Efron (1987) suggest that generally, suffices and was found to perform reasonably well in simulations, but at the suggestion of a referee, is used here.
The test statistic for testing
10
is
11
which is assumed to have a standard normal distribution when the null hypothesis is true. A confidence interval for is taken to be
12
where is the quantile of a standard normal distribution. Following Li, this method is labeled BSI.
The next approach is based on two variations of the basic percentile bootstrap method. The first, which is labeled method PBR, proceeds as done by method BSI, meaning that the analysis begins by removing bad leverage point and generating bootstrap samples based on the remaining data. In contrast to BSI, no estimate of the standard error is used. Another difference is that now both the numerator and denominator of (7) are computed based on a bootstrap sample. Put the bootstrap values in ascending order yielding . Let , where is the number of bootstrap estimates that are less than zero. A p-value is
13
(Liu & Singh, 1997). A confidence interval is
14
where rounded to the nearest integer and .
Methods BSI and PBR reflect a common approach, in the context of a linear model, where leverage points, but not outliers among the dependent variable, are removed and a percentile bootstrap method is performed on the remaining data (Wilcox, 2022). In contrast, when working with some measure of association, bootstrap samples are based on all of the data. Otherwise, given the goal of making inferences about , the details are exactly the same as method PBR. This alternative approach is labeled PBC.
The final two methods are based on the bias-corrected and accelerated bootstrap (BCa) interval. The main difference from the percentile bootstrap method is that BCa attempts to correct for any bias associated with an estimator as well as any skewness associated with the bootstrap distribution. This is done via two parameters. The first is based on the proportion of bootstrap estimates that are less than the observed value, . The second is aimed at adjusting the confidence interval based on how much the bootstrap distribution is skewed to the right or left. Complete computational details are in Efron (1987) and Li (2022). The practical point here is that the method is readily applied via the R function bcajack2, which can be accessed via the R package bcaboot. Again two variations are used. The first begins by removing bad leverage points (BCaR) and the second uses all of the data when generating bootstrap samples (BCaC).
Simulation Results
Simulations were used to get some sense of how well the methods in the Simulation Results section perform when dealing with sample sizes , 40 and 100. A few simulations were run with and 300.
Data were generated based on the linear model given by (1). Three distributions were used for and the error term, : standard normal, a symmetric heavy-tailed distribution and a skewed distribution with relatively high skewness and kurtosis.
An h distribution, which belongs to the family of g-and-h distributions, was used to generate data from a symmetric, heavy-tailed distribution. A value is generated from this distribution by first generating a value from a standard normal distribution yielding and computing
15
The skewed distribution that was used is a lognormal distribution, which is motivated in part by various studies summarized in Wilcox (2022) looking at the degree distributions might be non-normal. This distribution is also motivated by results reported by Cain et al. (2017) who reviewed estimates of skewness and kurtosis reported in papers published in two journals: Psychological Science and the American Education Research Journal. The skewness and kurtosis of a lognormal distribution are 6.185 and 113.9, respectively. Based on 1,567 estimates collected by Cain et al. (2017), this level of skewness is realistic and relatively extreme. The skewness and kurtosis of a lognormal distribution is larger than 99% of the estimates reported by Cain and colleagues, suggesting that if a method performs reasonably well for this seemingly large departure from a normal distribution, this offers some assurance that it will work well in practice.
Table 1 reports , the estimated probability of a Type I error when using BSI, and when testing at the .05 level. The estimates are based on 3000 replications. Bradley (1978) suggested that as a general guide, when testing at the .05 level, the actual level should be between .025 and .075. As can be seen, all of the estimates in Table 1 indicate that this criterion is satisfied. Moreover, based on the Agresti and Coull (1998) method, the .95 confidence interval for does not contain .075 if and it does not contain .025 if . Note that the largest estimate in Table 1 is .068, which occurred when has a lognormal distribution, has a standard normal distribution and . Increasing the sample size to , the estimate is .074. For , the estimate is .077 suggesting that BSI might not be asymptotically correct.
Table 1
X | n = 20 (B = 100) | n = 20 | n = 40 | n = 100 | |
---|---|---|---|---|---|
N | N | .062 | .054 | .057 | .047 |
N | LN | .041 | .036 | .023 | .037 |
N | H | .048 | .051 | .043 | .051 |
H | N | .065 | .053 | .054 | .058 |
H | LN | .047 | .048 | .029 | .029 |
H | H | .057 | .034 | .050 | .047 |
LN | N | .063 | .049 | .051 | .068 |
LN | LN | .037 | .040 | .023 | .034 |
LN | H | .047 | .036 | .043 | .065 |
Note. Testing at the .05 level. except where noted. . N = normal, LN = lognormal, H = h distribution.
As for method PBR, situations were found where it performed in a reasonably accurate manner, but situations were found where it performed poorly. For example, when the independent variable has a standard normal distribution, the estimated Type I error probabilities ranged between .051 and .058 for the three sample sizes and the three distributions used for . However, when the independent variable has a lognormal distribution, it performed poorly. When has a normal distribution, the estimates for the three sample sizes are .072, .092 and .100, respectively. When has a lognormal distribution as well, the estimates are .072, .061 and .095. That is, method PB deteriorates as the sample size increases. Consequently, this method is not considered further.
Table 2 reports the results for method PBC. A limitation of method PBC is that when , this often resulted in numerical errors associated with estimates based on a bootstrap sample. Consequently, results for are not reported. In contrast to method PBR, the estimates of are close to the nominal level for all of the situations considered. Moreover, for the situations where PBR fails as the sample size increases, this was not the case using PBC. For example, with and where both and have lognormal distributions, the estimate is .045 using PBC. It was previously noted that when has a lognormal distribution and has a normal distribution, the estimated probability of a Type I error using BSI is .074 when and .077. Using PBC, the estimate is .045.
Table 2
X | n = 40 | n = 100 | |
---|---|---|---|
N | N | .041 | .047 |
N | LN | .041 | .042 |
N | H | .041 | .045 |
H | N | .038 | .044 |
H | LN | .038 | .039 |
H | H | .036 | .041 |
LN | N | .038 | .042 |
LN | LN | .043 | .055 |
LN | H | .037 | .039 |
Note. Testing at the .05 level. . N = normal, LN = lognormal, H = h distribution.
Table 3 reports results when using BCaR and BCaC. A positive feature of BCaR is that it improves on method PBR, but when the independent variable is lognormal, it performed poorly. Indeed, like method PBR, as the sample size increases, its ability to control the Type I error probability deteriorates. Notice that when has a lognormal distribution and has the h distribution, the estimate is .107 when . Increasing the sample size to , the estimate is .117.
Table 3
BCaR
|
BCaC
|
|||||
---|---|---|---|---|---|---|
X | n = 20 | n = 40 | n = 100 | n = 40 | n = 100 | |
N | N | .057 | .051 | .048 | .040 | .042 |
N | LN | .065 | .061 | .059 | .050 | .047 |
N | H | .051 | .054 | .052 | .041 | .053 |
H | N | .055 | .064 | .065 | .032 | .046 |
H | LN | .054 | .061 | .065 | .042 | .051 |
H | H | .053 | .056 | .067 | .043 | .040 |
LN | N | .060 | .082 | .080 | .027 | .045 |
LN | LN | .062 | .074 | .085 | .045 | .064 |
LN | H | .077 | .086 | .107 | .030 | .046 |
Note. Testing at the .05 level. . N = normal, LN = lognormal, H = h distribution.
Both PBC and BCaC were found to control the probability of a Type I error reasonably well for all situations considered. BSI performed reasonably well when and has the advantage of avoiding computational issues associated with PBC and BCaC when . But the extent these three methods deal with bad leverage points is unclear. The h distribution and the lognormal distribution have a tendency to generate outliers. That is, situations are being considered where leverage points are highly likely to occur as well as regression outliers, meaning outliers among the residuals. But a criticism is that the likelihood of a bad leverage point is relatively low. To deal with this, the simulations were repeated only now two leverage points were added to the data, namely (4, 4) and (5, 5). The lognormal distribution was shifted to have a median equal to zero so that the value 4 is an outlier. The resulting estimates of the Type I error probability are reported in Table 4. As can be seen, BCaC always performed about as well or better than PBC. BSI is a bit better than BCaC in some situations but worse in others. If the goal is have a Type I error probability reasonably close but less than the nominal level, BCaC is best.
Table 4
X | BSI (n = 20) | BCaC (n = 40) | PBC (n = 40) | |
---|---|---|---|---|
N | N | .058 | .039 | .041 |
N | LN | .040 | .045 | .039 |
N | H | .065 | .044 | .038 |
H | N | .062 | .035 | .035 |
H | LN | .039 | .043 | .037 |
H | H | .062 | .045 | .035 |
LN | N | .063 | .033 | .030 |
LN | LN | .037 | .034 | .033 |
LN | H | .065 | .034 | .028 |
Note. Testing at the .05 level. . N = normal, LN = lognormal, H = h distribution.
The next set of simulations focused on the ability to get a reasonably accurate confidence interval for when . Now and are used. The actual value of was determined by generating a random sample of size 100, computing and repeating this process 1000 times. The mean of the resulting 1000 estimates is taken to be the true value of . Another approach is to use one large sample size, say 10000. The Theil–Sen estimator is easily computed when , but for this is no longer the case. In practice, when is extremely large, some alternative robust estimator can be required. A good choice is the MM-estimator derived by Yohai (1987).
Table 5 reports the results for methods BSI and BCaR and Table 6 reports results using methods PBC and BCaC. Method BCaR was included because in contrast to the results in Table 3, it performed reasonably well even when has a lognormal distribution. Indeed, in a variety of situations, it performed better than method BSI. The main concern with BSI is that is that the estimates are less than .025 in some situations and the estimates decrease as gets large, another indication that BSI is not asymptotically correct in some situations.
Table 5
n = 20
|
n = 40
|
n = 100
|
||||||
---|---|---|---|---|---|---|---|---|
X | BSI | BCaR | BSI | BCaR | BSI | BCaR | ||
N | N | .024 | .054 | .027 | .054 | .047 | .068 | .698 |
N | LN | .038 | .050 | .022 | .051 | .021 | .051 | .645 |
N | H | .032 | .052 | .019 | .064 | .010 | .059 | .636 |
H | N | .024 | .058 | .012 | .070 | .047 | .065 | .720 |
H | LN | .035 | .052 | .025 | .056 | .023 | .056 | .637 |
H | H | .028 | .060 | .015 | .066 | .015 | .056 | .643 |
LN | N | .031 | .061 | .025 | .063 | .033 | .068 | .597 |
LN | LN | .035 | .057 | .027 | .052 | .030 | .067 | .513 |
LN | H | .032 | .051 | .025 | .060 | .023 | .066 | .520 |
Note. differs from zero. N = normal, LN = lognormal, H = h distribution.
Table 6
PBC
|
BcaC
|
|||||
---|---|---|---|---|---|---|
X | n = 40 | n = 100 | n = 40 | n = 100 | ||
N | N | .012 | .025 | .055 | .054 | .698 |
N | LN | .017 | .031 | .048 | .049 | .645 |
N | H | .021 | .034 | .052 | .047 | .636 |
H | N | .011 | .022 | .056 | .062 | .597 |
H | LN | .013 | .023 | .053 | .050 | .513 |
H | H | .014 | .023 | .059 | .056 | .520 |
LN | N | .011 | .018 | .047 | .067 | .720 |
LN | LN | .010 | .013 | .044 | .057 | .637 |
LN | H | .012 | .020 | .046 | .054 | .643 |
Note. differs from zero. N = normal, LN = lognormal, H = h distribution.
As indicated in Table 6, the estimates of , when using PBC, are well below the nominal level when . For , the estimates are closer to the nominal level, but the method improves rather slowly as increases. All indications are that BCaC performs better than PBC.
Some Illustrations
Of course, bad leverage points are not always an issue. But the reality is that situations are encountered where bad leverage points are a serious concern as illustrated here.
The first example is based on data dealing with measures of reading ability. The sample size is . The goal is to understand the association between a measure of speeded naming for letters and a measure of speeded naming for digits. Figure 2 shows a scatterplot of the data. Points marked with o are bad leverage points and points marked with * are good leverage points. The scatterplot clearly suggests that for the bulk of the points there is a positive association. Pearson’s correlation, using all of the data, is .106. Testing the hypothesis that Pearson’s correlation is zero, the p-value is .01. (The bootstrap-t method in Wilcox, 2022, Section 9.1 was used, which compares well to the methods studied by Bishara & Hittner, 2012.) Using Spearman’s rho and Kendall’s tau, the estimates are .452 and .347, respectively, both of which reject at the .05 level. In contrast, . That is, all four measures reject at the .05 level, but paints a decidedly different picture regarding the strength of the association. The .95 confidence intervals for are (.465, 1), (.537, 1) and (.506, 1) using BSI, PBC and BCaC, respectively. Method PBC has the shortest confidence interval, which was somewhat unexpected because in the simulations, the estimate of is smaller when using PBC compared to BCaC.
Figure 2
The next example is based on data reported by Rousseeuw and Leroy (1987, p. 27) that deals with the logarithm of the effective temperature at the surface of 47 stars versus the logarithm of its light intensity. A scatterplot of the data is shown in Figure 3. Pearson’s correlation is , while Spearman’s correlation, Kendall’s tau and are .295, .250 and .607, respectively. The corresponding p-values are .136, .132 and less than .001. The .95 confidence intervals using BSI, PBC and BCaC are (.352, .861), (.360, .844) and (.326, .795), respectively. In this case, BCaC has the shortest confidence interval.
Figure 3
The next illustration is based on measures of cortisol levels taken upon awakening and measured again 30–45 minutes. Past studies indicate that Time 2 measures tend to be higher than the Time 1 measures. The extent this is the case has been found to be associated with various measures of stress. The goal here is to understand the strength of the association between these two measures.
Figure 4 shows a scatterplot of the data. The sample size is 101. The data stem from a study of an intervention program aimed at improving the emotional and physical wellbeing of older adults. Pearson’s correlation, Spearman’s rho and Kendall’s tau are .490, .494 and .358, respectively. In contrast, . All four measures are significant at the .01 level, but clearly the three bad leverage points have an impact on the first three correlation coefficients. The .95 confidence intervals using BSI, PBC and BCaC are (.359, .806), (.357, .757) and (.369, .763), respectively. BCaC has a slightly shorter confidence interval than PBC.
Figure 4
Concluding Remarks
It would be convenient to have a single method that generally performs about as well or better than competing techniques. Generally, PBC and BCaC perform relatively well, but computational issues eliminate these two approaches when dealing with a small sample size ( ). For , both PBC and BCaC are good choices. BCaC performs better than PBC in general. An argument for PBC might be that a p-value is readily computed. A p-value does not indicate the probability of making a correct decision about whether is positive or negative. But in the context of Tukey’s three decision rule (e.g., Jones & Tukey, 2000), it reflects the strength of the empirical evidence that a decision can be made. BCaC can be used with any choice for , in principle a p-value can be computed, but this is more easily done using PBC. Also, simulations suggest that BCaC will yield a shorter confidence interval than PBC, but the illustrations demonstrate that this is not necessarily the case.
There are many variations of the method used here, each based on some robust regression estimator. Presumably the use of the Theil–Sen estimator does not dominate all other robust regression estimators in terms of power, and perhaps the ability of controlling the Type I error probability, when testing (10). The only suggestion here is that the Theil–Sen estimator is reasonably good choice for general use.
The illustrations suggest that at a minimum, when dealing with a linear model, it is prudent to check whether there are any bad leverage points. Otherwise, there is a realistic chance that the nature of the association for the bulk of the points is completely missed.
Finally, R functions are available for dealing with bad leverage points. The R function outblp checks for bad leverage points. The regression estimator that is used is determined by the argument regfun, which defaults to the Theil–Sen estimator. The function corblp computes and corblp.ci computes a confidence interval for using method BSI. The function corblppb performs method PBC and corblp.bca.C performs method BCaC. Access to these functions can be achieved by sourcing the file Rallfun-v42, which can be downloaded from the Supplementary Materials.