Original Article

A Comparative Study of Approximations for Perturbation Analysis of Principal Components

Jacques Bénasséni*1, Alain Mom1

Methodology, 2025, Vol. 21(1), 27–45, https://doi.org/10.5964/meth.15357

Received: 2024-08-20. Accepted: 2025-02-07. Published (VoR): 2025-03-31.

Handling Editor: Tamás Rudas, Eötvös Loránd University, Budapest, Hungary

*Corresponding author at: Université Rennes 2, Place du Recteur Henri Le Moal CS 24307 - 35043 Rennes cedex, France. E-mail: jacques.benasseni@univ-rennes2.fr

This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Principal component analysis is a well known method for dimension reduction based on the covariance matrix associated to a multivariate data table. Therefore, a large amount of work has been devoted to analyzing the sensitivity of the eigenstructure of this matrix to influential observations. In order to evaluate the effect of deleting one or a small subset of observations, several approximations for the perturbed eigenelements have been proposed. This paper provides a theoretical and numerical comparison of the main approximations. A special emphasis is given to those based on Rayleigh quotients since they are under-utilized given their excellent performance. A general approach, using refined inequalities, is proposed in order to get a precise evaluation of their accuracy without having to recompute the exact perturbed eigenvalues and eigenvectors. This approach is of specific interest from a computational standpoint. Theoretical developments are illustrated with a numerical study which emphasizes the accuracy of approximations based on Rayleigh quotients.

Keywords: approximation, eigenvalue and eigenvector, covariance matrix, principal component analysis, perturbation, Rayleigh quotient.

We consider a n × p data matrix X in which the n rows are the observation vectors xitRp, i=1,,n. Letting x¯=(1/n)i=1nxi denote the mean vector of dimension p, the covariance matrix S=(1/n)i=1n(xix¯)(xix¯)t is involved in a wide range of statistical methods including principal component analysis or multiple regression for example. Since S is known to be highly prone to influential observations, sensitivity aspects for principal component analysis have been discussed in several papers including Critchley (1985), Jolliffe (2002), Pack et al. (1988), Prendergast (2008), Prendergast and Li Wai Suen (2011), Tanaka (1988) among many others. It should be noted that perturbation issues for this method are still the subject of active research as emphasized by the recent work of Masioti et al. (2023) for example. In order to assess the influence of a small subset I of r observations on the eigenstructure of S, a possible approach consists of studying the effect of removing these r observations on the eigenelements of S. If we note a significant modification of the eigenelements, we know that these observations are strongly influential on the results. In this framework, letting S~ denote the covariance matrix obtained without the subset of observations indexed by I, several authors have studied the relationship between the eigenelements of S and those of S~. More specifically, providing approximations to the eigenelements of S~ allows to detect influential subsets of observations without having to recompute the exact modified eigenvalues and eigenvectors. Hadi and Nyquist (1993), Masioti et al. (2023), and Wang and Nyquist (1991) study the effect of deleting a single observation while Bénasséni (2018), Enguix-González et al. (2005) and Wang and Liski (1993) focus on the general case where I comprises several observations. In these works, approximations to the eigenvalues and eigenvectors are obtained by retaining the first terms in power expansions of these parameters. Independently of these works, Bénasséni (1987) suggests using approximations based on Rayleigh quotients together with inequalities given in Wilkinson (1988). More generally, approximations for the covariance matrix are also the subject of research in a wider framework as emphasized by other works including for example Enguix-González et al. (2015) which considers the moments of the eigenelements or Enguix-González et al. (2012) dealing with the conditional bias of eigenvalues.

The contribution of this work is twofold. First, some new elements of comparison are provided between approximations derived from power expansions and those based on Rayleigh quotients. This comparison is based on some simple theoretical relations together with the numerical study of a data table which is intended to provide some guidance on the approximations to choose in practice. Second, refined inequalities in Chatelin (2012) are introduced in order to evaluate the accuracy of approximations without having to recompute the eigenelements of S~. It is proved that these inequalities always provide a sharper evaluation than those used in Bénasséni (1987). This is a definite advantage from a computational standpoint.

In this work, the eigenvalues λ1>λ2>λp0 of S are assumed simple and associated to the normalized eigenvectors ϕ1,ϕ2,ϕp. In the same way, the eigenvalues λ~1>λ~2>λ~p0 of the perturbed matrix S~ are also assumed simple and associated to the eigenvectors ϕ~1,ϕ~2,ϕ~p.

Approximations Based On Power Expansions

Theoretical Background On Matrix Perturbations

Referring to Bénasséni (2018) or Enguix-González et al. (2005) and letting x¯I=(1/r)iIxi, we know that, when a subset of observations indexed by I is deleted, the covariance matrix S is transformed to S~ which can be expressed as:

1
S~=S+(rnr)[S1riI(xix¯)(xix¯)t]+(rnr)2[(x¯Ix¯)(x¯Ix¯)t]

We then have a perturbation of the form S~=S+ϵM+ϵ2N with ϵ=rnr, M=S(1/r)iI(xix¯)(xix¯)t and N=(x¯Ix¯)(x¯Ix¯)t. Following matrix perturbation theory detailed in Wilkinson (1988) for example or referring to Sibson (1979), we know that, if ϵ is sufficiently small, for each simple eigenvalue λ of S there is an eigenvalue λ~ of S~ given by a convergent power series:

2
λ~=λ+γ1ϵ+γ2ϵ2++γmϵm+O(ϵm+1)

with a corresponding eigenvector which can also be expressed under a convergent power series:

3
ϕ~=ϕ+ψ1ϵ+ψ2ϵ2++ψmϵm+O(ϵm+1)

The parameters γ1, γ2, …, γm and ψ1,ψ2,…,ψm are derived by equating the coefficients of ϵ, ϵ2,…,ϵm in the equation S~ϕ~=λ~ϕ~. It should also be noted that the perturbation λ~k of λk may not necessarily be the kth largest eigenvalue of S~ if the subset of observations indexed by I has initially a strong influence on λk for example. In this case, following Critchley (1985), one can simply assume that the eigenvalues λ~k have been reordered in decreasing order and that their corresponding eigenvectors ϕ~k have been relabeled. However, the reader is referred to Masioti et al. (2023) for a more comprehensive discussion on this topic.

Formulae for the Approximations Derived From Power Expansions

For any integer m1, retaining only terms of order lower or equal to m in ϵ in (2) and (3) provides the approximations λ~(m) and ϕ~(m) of order m for the eigenvalues and eigenvectors of S~.

This is the general approach suggested in Bénasséni (2018), Enguix-González et al. (2005) and Wang and Liski (1993). Assuming that ϵ=rnr is sufficiently small to ensure the convergence of the above power series, Enguix-González et al. (2005) provide the following approximations for k=1,,p:

4
λ~k(1)=λk+(rnr)(λk1riIαki2)
5
λ~k(2)=λ~k(1)+(rnr)2[αkI2jk1λjλk(1riIαkiαji)2]
6
ϕ~k(1)=ϕk+(rnr)jk(1riIαkiαji)ϕjλjλk

where αki=ϕkt(xix¯), for i=1,,n and αkI=(iIαki)/r.

Details on the derivation of the above expressions are given in Bénasséni (2018). The reader is referred to Enguix-González et al. (2005) for the formula of ϕ~k(2) which is fairly long and therefore omitted in this paper. It should also be pointed out that a comprehensive study of approximations for the unbiased matrix (nr)S~/(nr1) is provided in this last reference. In the remaining of the paper, we define also the approximations of order zero as λ~(0)=λ and ϕ~(0)=ϕ for notational convenience.

Finally, when studying the influence of a single observation, note that approximations are simply obtained by taking r = 1, I={i} and ϵ=1n1 in the previous developments. The reader more specifically interested by this case will refer to a series of papers including Critchley (1985), Hadi and Nyquist (1993), Masioti et al. (2023) and Wang and Nyquist (1991).

Approximations Based On Rayleigh Quotients

Rayleigh Quotients as Approximations to the Perturbed Eigenvalues

Assuming that weights are given to the observations, Bénasséni (1987) studies the effects of modifying these weights on the eigenvalues and eigenvectors of the covariance matrix. Deleting a small subset of observations indexed by I is therefore a particular case of his approach which consists simply of modifying to zero the corresponding weights. As approximations to λ~k for k=1,,p, this author suggests using the Rayleigh quotient qk(0)=(ϕ~k(0))tS~ϕ~k(0)/(ϕ~k(0))tϕ~k(0) for S~ and the initial normalized eigenvector ϕ~k(0), and the Rayleigh quotient qk(1)=(ϕ~k(1))tS~ϕ~k(1)/(ϕ~k(1))tϕ~k(1) for S~ and the approximation of order one ϕ~k(1) to ϕ~k.

Error Analysis

From a computational standpoint, it is of major interest to evaluate the accuracy of approximations without having to recompute the exact eigenelements of S~. In order to do this, Bénasséni (1987) suggests using inequalities provided in Wilkinson (1988). Focusing on λ~k and its corresponding eigenvector ϕ~k and, from now on, assuming without change of notation that ϕ~k(m) has been normalized, let ηk(m)=S~ϕ~k(m)qk(m)ϕ~k(m)2 for m=0,1 where .2 stands for the two norm. Assume that ck(m)R+ is a nonzero positive constant such that |λ~jqk(m)|>ck(m) for j=1,,p with jk. Then the accuracy of qk(m) as approximation to λ~k, and of ϕ~k(m) as approximation to ϕ~k, is analyzed in Bénasséni (1987) using the following inequalities given in (Wilkinson, 1988, pp. 172–176):

7
ϕ~kϕ~k(m)22(ηk(m)ck(m))2[1+(ηk(m)ck(m))2]

and if ηk(m)/ck(m)<1:

8
|λ~kqk(m)|[(ηk(m))2ck(m)]/[1(ηk(m)ck(m))2]

Using ϕ~k2=ϕ~k(m)2=1, note that (7), can be written with the cosine between ϕ~k and ϕ~k(m) as:

9
1[(ηk(m))2/2(ck(m))2][1+(ηk(m)/ck(m))2]cos(ϕ~k,ϕ~k(m)).

In practice, it is necessary to give the parameter ck(m) a value in the previous inequalities. When studying the influence of a single observation, it should be noted that (1) can be expressed as

10
S~=nn1[S1n1(xix¯)(xix¯)t]

so that we have a rank one perturbation. In this case, the parameter ck(m) is given a value in Bénasséni (1987) using bounds, derived from the Courant-Fischer theorem, for the eigenvalues λ~j of the symmetric matrix S~, j=1,,p, jk. In the case where several observations are deleted this author suggests a fairly lengthy procedure assuming that these observations are removed one after the other, so that we have a series of rank one perturbations.

New Developments

Some Relations Between the Approximations

In his work, Bénasséni (1987) only considers the approximations of order one λ~k(1) for the eigenvalues of S~ and provides no comparison of λ~k(2) with approximations based on Rayleigh quotients. However, it is easy to derive the following simple relations. First note that qk(0) can be written as:

11
qk(0)=λk+(rnr)(λk1riIαki2)(rnr)2αkI2

using (1). Therefore we see that we have always λ~k(1)qk(0). Furthermore, a simple comparison of (11) with (5) shows that:

λ~k(2)qk(0)=(rnr)2jk1λjλk(1riIαkiαji)2.

In particular, when focusing on the largest eigenvalue which plays a central role in PCA, this difference is non negative so that we have λ~1(2)q1(0). In a similar way, when considering the smallest eigenvalue, we get λ~p(2)qp(0).

We omit the derivation of qk(1) which is tedious and leads to a formula too complicated to be interpreted. However, it should be noted that this approximation involves terms up to the order 4 in ϵ=rnr.

Improved Inequalities in Error Analysis

In practice, it is of crucial importance to have bounds as close as possible to the true values of |λ~kqk(m)| and cos(ϕ~k,ϕ~k(m)) in Inequalities (8) and (9). This will allow for an evaluation as precise as possible of the real accuracy of the approximations without having to recompute the perturbed analysis. For this purpose, we introduce below refined inequalities in the study of covariance matrices in order to get an improved error analysis

Indeed, Inequalities (7), (8) and (9) introduced in the Subsection Error Analysis can be improved using error analysis developed in Chatelin (2012, pp. 180–184). More precisely, it is easily derived from Corollary 4.6.4 in this reference that:

12
|λ~kqk(m)|(ηk(m))2ck(m)

and

13
sin(ϕ~k,ϕ~k(m))ηk(m)ck(m)

under the condition:

14
ηk(m)<ck(m).

It is obvious that (12) is more accurate than (8). A similar remark holds for (13) wich improves (9). This last point is easily checked by converting (13) into

15
cos2(ϕ~k,ϕ~k(m))1(ηk(m)/ck(m))2.

Then letting a=(ηk(m)/ck(m))2, A=1a and B=[1(a/2)(1+a)]2, Inequality (9) becomes cos2(ϕ~k,ϕ~k(m))B so that we have simply to prove that AB. Developing B, we get B=A+(a2/4)(a2+2a3). Thus we have that AB if and only if the polynomial a2+2a3 is non-positive. This is the case when a belongs to [3,1]. Therefore the result follows since 0a<1 from (14).

Furthermore, when dealing with the eigenvector associated to the largest eigenvalue, Chatelin (2012, p. 204) points out that Inequality (13) can also be refined into the following tangent based inequality:

16
tan(ϕ~1,ϕ~1(m))η1(m)c1(m)

since this eigenvalue is assumed to be simple. More precisely, letting α denote the angle between the two vectors ϕ~1 and ϕ~1(m), we have αarctan(η1(m)/c1(m))arcsin(η1(m)/c1(m)) since 0arctanxarcsinx for all x[0,1]. Thus we obtain a better approximation of α when using the function arctan rather than the function arcsin showing that (16) improves (13).

Improved Value for the Parameter ck(m)

The sharpness of the bounds in Inequalities (12), (13) and (16) depends on the value of the parameter ck(m). The larger ck(m) is, the sharper are these inequalities. In order to get a suitable value for this parameter, some other results of Chatelin (2012, pp. 180–181) or of Wilkinson (1988, pp. 174–176) turn out to be also of specific interest since they often improve significantly the value of the parameter ck(m) obtained through the Courant-Fischer theorem in Bénasséni (1987). More precisely, from these references we know that there is at least one eigenvalue of S~ in each of the intervals defined for j=1,,p by [bj(m),Bj(m)]=[qj(m)ηj(m),qj(m)+ηj(m)] which are often referred to as Krylov-Weinstein intervals. When the interval [bk(m),Bk(m)] is isolated from the p1 other ones, we know that it contains precisely one eigenvalue. Then, assuming that the Rayleigh quotients satisfy q1(m)>q2(m)>>qp(m) (after having been reordered if necessary), a value for ck(m) can be easily derived as ck(m)=min(bk(m)Bk+1(m),bk1(m)Bk(m)) if k{2,,p1}, c1(m)=b1(m)B2(m) if k = 1 and cp(m)=bp1(m)Bp(m) if k=p. Finally, note that these Krylov-Weinstein intervals proved to be also useful in studying the effects of some small errors in the data table itself as emphasized by the work of Bénasséni (1988, pp. 303–310)

It should be pointed out that for very close eigenvalues, giving a value to this parameter can be a real issue. However, once a value satisfying (14) is obtained, we know from (12) that, for k=1,,p, the eigenvalues λ~k lie in the intervals [qk(m)(ηk(m))2/ck(m),qk(m)+(ηk(m))2/ck(m)] which are sharper than the Krylov-Weinstein intervals as soon as (14) holds. These new intervals can be used to obtain a larger value for the parameter ck(m), thus improving (12), (13) and (16). This process could be iterated, but no significative improvment is generally observed.

Finally, it was noted in (10) that we have a rank one perturbation when deleting a single observation. Several inequalities and relations for this restricted rank perturbation are given in Bénasséni (1987), Hadi and Nyquist (1993), and Wang and Nyquist (1991). The reader is also referred to more recent works by Bénasséni (2011), Cheng et al. (2014), and Ipsen and Nadler (2009), who suggest new bounds for the perturbed eigenvalues. Although Krylov-Weinstein intervals generally provide a satisfying value for the parameter ck(m), these recent works can also be interesting in the determination of the largest possible constant ck(m) in order to make Inequalities (12), (13) and (16) sharper.

Numerical Study

The numerical illustration of the results is based on the soil composition data in Kendall (1975) which have already been used in several works including, among others, Bénasséni (2018), Critchley (1985), Enguix-González et al. (2005), Enguix-González et al. (2012), Tanaka (1988), Wang and Liski (1993), or Wang and Nyquist (1991) for sensitivity study of covariance based principal component analysis. The data table consists of 20 observations measured on 4 variables. We have the following four eigenvalues for the corresponding covariance matrix: λ1=82.30827, λ2=6.73891, λ3=0.44783, λ4=0.24552. In the first subsection, we study the effect of deleting each of the 20 observations on the two largest eigenvalues (which account for more than 99% of the total variation in principal component analysis) and on their corresponding eigenvectors. In the following subsection, we study the effects of deleting subsets of two observations on the largest eigenvalue. These subsets are those considered for the numerical study in Wang and Liski (1993). Finally, a short illustration of the potential effects of removing three observations is given in the last subsection.

Approximations When Deleting One Observation

Table 1 provides for each subset I={i}, the perturbed eigenvalue λ~1, its order one and two approximations λ~1(1) and λ~1(2), the Rayleigh quotients q1(0) ans q1(1) and the differences between each approximation and the true perturbed eigenvalue λ~1. In the last two columns we find the bounds to |λ~1q1(0)| and |λ~1q1(1)| given by Inequality (12).

Table 1

Approximations and Error Analysis for the Largest Eigenvalue With r = 1

Iλ~1λ~1(1)λ~1(2)q1(0)q1(1)λ~1λ~1(1)λ~1λ~1(2)λ~1q1(0)λ~1q1(1)|λ~1q1(0)||λ~1q1(1)|
181.5997681.833381.5976981.5803181.59969-0.233550.002070.019456.87×1050.019886.91×105
277.1868177.6493277.1851177.1761177.18667-0.462510.001700.010701.42×1040.011151.46×104
386.5230986.5284486.5230686.5225586.52309-0.005352.53×1055.38×1045.20×1095.40×1045.22×109
477.3497577.6447777.3239977.1713377.34799-0.295030.025760.178421.76×1030.189221.78×103
586.5787886.5811286.5787586.5780186.57878-0.002342.99×1057.71×1041.06×1077.74×1041.07×107
676.8503977.3387776.850276.8492276.85037-0.488381.86×1041.17×1031.53×1051.28×1031.66×105
779.9556280.2754479.9536879.9404479.95552-0.319810.001940.015181.02×1040.015441.02×104
874.7150575.2991974.7126474.7022974.71477-0.584140.002410.012762.72×1040.013262.79×104
974.4516575.0516774.449874.4417474.45144-0.600020.001859.91×1032.04×1040.010732.19×104
1086.0244986.0540486.0244286.0231886.02449-0.029557.21×1051.30×1033.99×1081.33×1034.04×108
1185.3259385.3831485.325485.3169785.32593-0.057215.31×1040.0017828818.62×1079.06×1038.63×107
1286.5838286.5864386.5838186.5835986.58382-0.002611.06×1052.26×1042.48×1092.32×1042.53×109
1384.1309584.2075584.1278184.0795184.13094-0.076600.003140.051436.97×1060.052936.99×106
1486.5244686.5285586.5243986.5226786.52446-0.004106.56×1051.78×1033.46×1071.82×1033.50×107
1582.7279082.9176882.7273282.7217582.72788-0.189785.84×1046.15×1031.37×1056.26×1031.39×105
1686.6391786.6392386.6391786.6391786.639175.48×1053.36×1086.79×1072.13×10137.21×1072.59×1013
1783.2790283.4200483.2766883.2505583.27899-0.141010.002340.028473.27×1050.029253.29×105
1880.8047981.0897680.8039680.7976380.80476-0.284978.28×1047.16×1033.42×1057.57×1033.59×105
1978.3393378.7521778.3390078.3370178.33931-0.412833.31×1042.33×1032.18×1052.53×1032.37×105
2086.3715086.3847786.3714986.3713386.37150-0.013289.05×1061.73×1041.09×1091.86×1041.17×109

The first comment regarding the results in this table is that λ~1(1) is by far the less accurate approximation in all cases. In contrast q1(1) always provides extremely sharp approximations since it deviates from λ~1 by 1.76×103 in the worst case (when deleting Observation 4) and that the error is only 2.13×1013 when deleting Observation 16. It should be noted that λ~1(2) also provides fairly satisfying approximations although clearly less accurate than q1(1). The Rayleigh quotient q1(0) is outperformed by λ~1(2) but remains significantly sharper than λ~1(1). Furthermore, it is worth pointing out that λ~1(1) always overestimates the perturbed eigenvalue while the other three estimations slightly underestimate it. Note also that the results agree with inequalities λ~1(1)q1(0) and λ~1(2)q1(0) given in the Subsection Some Relations Between the Approximations.

Second, Inequality (12) provides bounds sufficiently close to |λ~1q1(0)| and |λ~1q1(1)| to evaluate correctly the accuracy of Rayleigh quotients as approximations to λ~1 without having to recompute the perturbed analysis.

Third, it is easily seen from (10) that the maximum value for the perturbed eigenvalue is obtained when xi=x¯ with λ~1=(20λ1)/19=86.64028. We have the highest perturbed eigenvalue when deleting Observations Number 3, 5, 10, 12, 14, 16, 20 which are fairly close to x¯ and in these cases we get the sharper approximations to λ~1.

Focusing now on the second largest eigenvalue, Table 2 provides results similar to those of Table 1. It turns out that λ~2(1) is the less accurate approximation to λ~2. Except when deleting Observation 13, the Rayleigh quotient q2(1) again provides the best approximation with a very good accuracy since in the worst case corresponding to this observation we have λ~2q2(1)=7.69×104. For this observation, λ~2(2) is slightly better but less accurate in all the other cases while performing fairly well in general. The Rayleigh quotient q2(0) performs in a similar way as q1(0) in Table 1.

Table 2

Approximations and Error Analysis for the Second Largest Eigenvalue with r = 1

Iλ~2λ~2(1)λ~2(2)q2(0)q2(1)λ~2λ~2(1)λ~2λ~2(2)λ~2q2(0)λ~2q2(1)|λ~2q2(0)||λ~2q2(1)|
16.807606.838246.809356.824806.80764-0.03065-0.00175-0.017214.90×1050.225577.06×104
27.032627.042317.033727.039617.03271-0.00969-0.00110-0.006999.02×1050.084231.02×103
36.737246.755096.737216.737276.73724-0.017852.51×1053.39×1051.36×1067.24×1031.53×106
45.669555.879235.684745.815325.66891-0.20968-0.01519-0.145776.36×1042.523880.02060
56.100586.150356.100496.100716.10057-0.049779.24×1051.22×1041.36×1050.011691.59×105
67.092647.09287.092667.092767.092641.56×1041.90×1051.15×1041.71×1061.23×1031.82×105
76.917336.940116.919206.932036.91743-0.02277-0.00187-0.014709.66×1050.172071.12×103
87.031117.042957.032847.040297.03131-0.01184-0.00173-0.009171.96×1040.099862.08×103
97.078357.081237.078777.0805837.07840-0.002884.22×1042.23×1034.77×1050.026155.16×104
106.957766.964886.957766.958116.95776-0.007123.17×1063.48×1043.59×1070.013753.83×107
116.555956.590926.556426.564476.55595-0.034974.69×104-0.008512.69×1060.116891.33×105
126.854446.863616.854226.851506.85444-0.009172.21×1040.002931.79×1060.005362.30×106
135.508845.624895.508265.547595.50807-0.116055.83×104-0.038757.69×1040.847908.23×104
146.049116.074046.043756.020386.04857-0.024930.005360.028735.42×1040.056895.82×104
156.982556.993066.983046.987776.98256-0.010514.84×104-0.005221.04×1050.066101.34×104
167.080397.080977.080387.080317.080395.88×1043.51×1067.55×1051.05×1097.94×1051.15×109
176.480506.530326.481476.500676.48043-0.049819.70×104-0.020177.06×1050.36045783.20×104
187.049667.054597.049987.052537.04967-0.004923.18×1042.87×1031.18×1050.038781.56×104
197.091247.091597.091287.091487.091253.43×1043.48×1052.38×1042.45×1062.68×1032.73×105
207.086647.086967.086647.086627.086643.23×1041.23×1062.54×1059.25×10123.36×1044.31×109

It should be noted that again λ~2(1) always overestimates the perturbed eigenvalue but in contrast to Table 1, the other three approximations can as well slightly underestimate or overestimate λ~2.

Another difference with Table 1 is that bounds provided to |λ~2q2(0)| and |λ~2q2(1)| by Inequality (12) are not so close to these quantities as they were previously. For a part, this can be explained by the fact that we have a smaller value for c2(0) and c2(1) than for c1(0) and c1(1) when considering the largest eigenvalue. Indeed, for m=0,1 the gap |λ~3q2(m)|  is smaller than the gap |λ~2q1(m)| . However, if we except the case of Observation 4, we know from this bound that |λ~2q2(1)| never exceeds 1.02×103 and this is sufficient for practical interpretation.

Results for the eigenvectors corresponding to the two largest eigenvalues are given in Table 3 which provides the sines sin(ϕ~k,ϕk) and sin(ϕ~k,ϕ~k(1)) for k=1,2 and their bounds provided by Inequality (13).

Table 3

Approximations and Error Analysis for the Eigenvectors Associated to the Two Largest Eigenvalues with r = 1

Isin(ϕ~1,ϕ1)sin(ϕ~1,ϕ~1(1))sin(ϕ~1,ϕ1)sin(ϕ~1,ϕ~1(1))sin(ϕ~2,ϕ2)sin(ϕ~2,ϕ~2(1))sin(ϕ~2,ϕ2)sin(ϕ~2,ϕ~2(1))
10.016099.56×1040.016449.62×1040.019880.001530.193030.01056
20.012181.40×1030.012671.44×1030.012130.001420.117680.01250
30.002598.07×1060.002618.09×1060.009124.62×1040.034084.94×104
40.049744.94×1030.052804.99×1030.081140.019540.722480.06298
50.003093.64×1050.003113.64×1050.010970.001540.045630.00168
60.003934.50×1040.004284.88×1030.014910.001250.165330.01317
80.013561.98×1030.014082.03×1030.013100.001940.125070.01782
90.011691.68×1030.012661.80×1030.007250.001070.065530.00883
100.004032.23×1050.004112.26×1050.010922.32×1040.046282.43×104
110.010661.05×1040.010781.05×1040.013217.59×1040.139470.00148
120.001675.52×1060.001715.63×1060.021955.32×1040.029155.96×104
130.025552.98×1040.026302.98×1040.053980.012300.423190.01285
140.004686.52×1050.004776.60×1050.072520.009700.103860.01011
150.008974.24×1040.009134.28×1040.010535.52×1040.102470.00454
168.97×1055.16×1089.52×1055.70×1080.003351.25×1050.003471.32×105
170.019186.50×1040.019716.55×1040.036240.003830.252150.00729
180.009626.66×1040.010176.97×1040.009996.86×1040.078820.00486
190.005495.31×1040.005975.76×1040.002132.16×1040.020210.00203
200.001423.58×1060.001533.84×1060.002747.34×1060.007152.55×105

First, we consider the eigenvector associated to the largest eigenvalue. It could be noted that the maximum value of the sine bewteen the unperturbed and perturbed eigenvector is obtained when deleting Observation 4. This value corresponds to an angle of 2.85. In this case, as in all the other ones, the order one approximation performs fairly well since its sine with the perturbed eigenvector is equal to only 0.00494 which corresponds to an angle of 0.28. Furthermore bounds provided by Inequality (13) are always extremely close to the exact value of sin(ϕ~1,ϕ~1(1)).

Focusing now on the eigenvector corresponding to the second largest eigenvalue, the maximum of sin(ϕ~2,ϕ2) is again obtained when deleting Observation 4 with the value of 0.08114. Even in this case, the order one approximation is fairly close to the perturbed eigenvector since sin(ϕ~2,ϕ~2(1))=0.01954. In contrast to the previous eigenvector, it is worth pointing out that bounds provided by Inequality (13) are not always sufficiently close to the true values of the sine to give an exact account of the accuracy for these approximations.

Finally, since the angles between the eigenvectors studied in the table are always very close to zero, we do not provide the tangent of these angles which only deviates from the sine by an extremely small amount.

Approximations When Deleting Subsets of Two Observations

Now, we study approximations to the perturbed largest eigenvalue ant its corresponding eigenvector when deleting the subsets of two observations considered for the numerical illustration in Wang and Liski (1993). Results similar to those of the previous subsection are provided in Tables 4 and 5.

Table 4

Approximations and Error Analysis for the Largest Eigenvalue with r = 2

Iλ~1λ~1(1)λ~1(2)q1(0)q1(1)λ~1λ~1(1)λ~1λ~1(2)λ~1q1(0)λ~1q1(1)|λ~1q1(0)||λ~1q1(1)|
{8,9}64.5931067.2500564.5812664.5608464.59026-2.656950.011840.032260.002840.036160.00316
{4,9}67.5171869.7259567.448167.3213867.50241-2.208770.069080.195800.014760.208420.01501
{4,8}67.9821969.9872267.8552767.6100267.95618-2.005020.126920.372180.026020.406640.02660
{6,9}69.4039169.4029469.4024269.3955869.403669.72×1040.001490.008342.52×1040.008922.66×104
{2,9}69.7452669.7307469.7411569.7210969.744550.014510.004100.024177.04×1040.025127.19×104
{2,8}69.9891169.9920269.9883269.9840269.98899-0.002917.86×1040.005091.21×1040.005651.33×104
{2,6}70.0192972.1449170.0129469.9996570.01808-2.125610.006360.019640.001210.020720.00126
{3,8}78.8358779.3644278.8324278.8248978.83542-0.528540.003450.010994.54×1040.011444.59×104
{2,4}72.6035672.4679172.5877472.4679172.601720.135650.015820.135650.001840.144140.00188
{4,7}75.5326375.2399275.5059675.2266175.530530.292710.026670.306020.002100.330280.00212
Table 5

Approximations and Error Analysis for the Eigenvector Associated to the Largest Eigenvalue with r = 2

Isin(ϕ~1,ϕ1)sin(ϕ~1,ϕ~1(1))sin(ϕ~1,ϕ1)sin(ϕ~1,ϕ~1(1))Isin(ϕ~1,ϕ1)sin(ϕ~1,ϕ~1(1))sin(ϕ~1,ϕ1)sin(ϕ~1,ϕ~1(1))
{8,9}0.022540.006670.025250.00745{2,8}0.008560.001320.009510.00146
{4,9}0.056370.015480.060100.01574{2,6}0.017400.004320.018330.00449
{4,8}0.077240.020420.084640.02089{3,8}0.012200.002510.012690.00253
{6,9}0.011280.001970.012050.00207{2,4}0.044770.005220.047590.00532
{2,9}0.019500.003330.020260.00340{4,7}0.066090.005480.071480.00553

We note that deleting subsets of two observations can result in larger variations of the eigenvalue of interest than when deleting single observations. Indeed, the perturbed eigenvalue is lower than 70 in the first six lines of Table 4.

Furthermore for all the subsets I studied in this table we have a decrease of the eigenvalue, while we note that this eigenvalue is increased in several cases in Table 1. Despite these significant variations of the eigenvalue, we see that the Rayleigh quotient q1(1) always provides a very accurate approximation to λ~1 since the maximum gap |λ~1q1(1)|=0.02602 observed for I={4,8} remains fairly moderate. It should also be noted that q1(1) always performs better than λ~1(2). This point is fairly well illustrated considering again the case of I={4,8} for which we have λ~1λ~1(2)=0.12692. The Rayleigh quotient q1(0) provides less accurate approximations than λ~1(2) but should generally be preferred to λ~1(1) if we except some cases. Finally, it is worth pointing out that bounds to |λ~1q1(0)| and |λ~1q1(1)| provided by Inequality (12) are always very close to the true values of these two differences thus avoiding to recompute the perturbed analysis.

Turning now to the sine values in Table 5 , we note the largest variations of the eigenvector when deleting the subsets I={4,7}, I={4,8} and I={4,9}. However, the order one approximation ϕ~1(1) remains fairly satisfying in all the cases since the maximum value of sin(ϕ~1,ϕ~1(1)) obtained when deleting the subset I={4,8} does not exceed 0.02042 which corresponds to an angle of only 1.17.

Approximations When Deleting Subsets of Three Observations

While the accuracy of q1(1) is fairly good for the removal of two observations despite the small sample size, one might think that results would probably be much worse when removing three observations and that their performance would deteriorate very quickly as ϵ=rnr increases. In order to investigate this point, we decided to perform a very brief study of the performance of the approximations for the largest eigenvalue when removing some sets of three observations. Note that in this case we have: ϵ=317=0.1765 instead of ϵ=218=0.1111 when deleting only two observations.

First, when removing the three Observations 3, 4 and 7, the largest eigenvalue initially equal to λ1=82.30827 is moderately decreased to λ~1=79.83362. The approximations to λ~1 are then the following: λ~1(1)=79.5408, λ~1(2)=79.82602, q1(0)=79.53947 and q1(1)=79.83346. We note the excellent accuracy of q1(1) since the error λ~1q1(1)=0.00016 is negligible.

Second, when removing the three Observations 2, 4 and 6, the largest eigenvalue is now significantly decreased to λ~1=65.86825. The approximations to λ~1 are then the following: λ~1(1)=66.33492, λ~1(2)=65.85841, q1(0)=65.7237 and q1(1)=65.86529. We still have a satisfying accuracy of q1(1) since the error λ~1q1(1) is equal to 0.00295.

Finally, when removing the very influential set {4,8,9}, the largest eigenvalue is drastically decreased to λ~1=55.36930 and we get the following approximations: λ~1(1)=61,15213, λ~1(2)=55.10147, q1(0)=54.87525 and q1(1)=55.26709. Therefore while the largest eigenvalue is decreased by 26.939 from its initial value, the error λ~1q1(1)=0.10221 remains fairly moderate while obviously larger than when considering removal of only two observations.

However, it would be necessary to perform additional studies with other data tables to get more reliable information on the relation between the value of ϵ and the accuracy of the approximations.

Concluding Remarks

The previous numerical study provides some indications on the sharpness of the various approximations in the paper when considering a particular data set. As a result, it may be useful to provide practitioners with some guidance on the choice of approximations for perturbed covariance matrices.

First, when focusing on eigenvalues, it should be noted that, in this example, approximations provided by λ~k(1) should be avoided as well as the Rayleigh quotients qk(0) which are not sufficiently accurate.

In contrast, Rayleigh quotients qk(1) (based on the perturbed matrix S~ and the approximations of order one ϕ~k(1)) seem to always provide reliable approximations to λ~k. They must be preferred to the approximation of order two λ~k(2) in the study of this data table. Furthermore, their accuracy can be evaluated in a precise way by Inequality (12) if the eigenvalue of interest is not too close to the other eigenvalues, as emphasized by results in Table 1 when considering the largest eigenvalue. This is a definite advantage over other approximations.

It should also be noted that these Rayleigh quotients perform fairly well even for values of ϵ=rnr which are not necessarily very close to zero. This point is made clear in the first lines of Table 4 where ϵ=218=0.11 and the eigenvalue λ~1 is significantly decreased through the perturbation. The Subsection Approximations When Deleting Subsets of Three Observations, briefly dealing with the removal of three observations, tends to indicate that these Rayleigh quotients should still perform in a satisfying way even for higher values of ϵ. However, more in-depth studies would be necessary to get reliable conclusions on this point.

Second, when considering eigenvectors, we note a satisfying accuracy of approximations of order one ϕ~k(1). Again, when the eigenvalue corresponding to the eigenvector of interest is sufficiently distant from the other ones, we have a correct evaluation of this accuracy by Inequality (13).

Funding

The authors have no funding to report.

Acknowledgments

The authors are grateful to the reviewers for their careful reading of the paper and their comments.

Competing Interests

The authors have declared that no competing interests exist.

References

  • Bénasséni, J. (1987). Perturbation des poids des unités statistiques et approximation en analyse en composantes principales. RAIRO—Recherche opérationnelle/Operations Research, 21(2), 175-198.

  • Bénasséni, J. (1988). Sensitivity of principal component to data perturbation (pp. 303–310). In E. Diday (Ed.), Data analysis and informatics, V. Elsevier.

  • Bénasséni, J. (2011). Lower bounds for the largest eigenvalue of a symmetric matrix under perturbations of rank one. Linear and Multilinear Algebra, 59(5), 565-569. https://doi.org/10.1080/03081081003709827

  • Bénasséni, J. (2018). A correction of approximations used in sensitivity study of principal component analysis. Computational Statistics, 33, 1939-1955. https://doi.org/10.1007/s00180-017-0790-7

  • Chatelin, F. (2012). Eigenvalues of matrices. Society for Industrial and Applied Mathematics.

  • Cheng, G., Song, Z., Yang, J., & Si, J. (2014). The bounds of the eigenvalues for rank-one modification of hermitian matrix. Numerical Linear Algebra with Applications, 21(1), 98-107. https://doi.org/10.1002/nla.1867

  • Critchley, F. (1985). Influence in principal component analysis. Biometrika, 72(3), 627-636. https://doi.org/10.1093/biomet/72.3.627

  • Enguix-González, A., Moreno-Rebollo, J. L., & Muñoz Pichardo, J. M. (2015). A better approximation of moments of the eigenvalues and eigenvectors of the sample covariance matrix. Journal of Multivariate Analysis, 142, 133-143. https://doi.org/10.1016/j.jmva.2015.08.002

  • Enguix-González, A., Muñoz Pichardo, J. M., Moreno-Rebollo, J. L., & Barranco-Chamorro, I. (2012). Using conditional bias in principal component analysis for the evaluation of joint influence on the eigenvalues of the covariance matrix. Applied Mathematics and Computation, 218(17), 8937-8950. https://doi.org/10.1016/j.amc.2012.02.054

  • Enguix-González, A., Muñoz Pichardo, J. M., Moreno-Rebollo, J. L., & Pino-Mejías, R. (2005). Influence analysis in principal component analysis through power-series expansions. Communications in Statistics-Theory and Methods, 34(9–10), 2025-2046. https://doi.org/10.1080/03610920500203505

  • Hadi, A., & Nyquist, H. (1993). Further theoretical results and a comparison between two methods for approximating eigenvalues of perturbed covariance matrices. Statistics and Computing, 3, 113-123. https://doi.org/10.1007/BF00147774

  • Ipsen, I. C. F., & Nadler, B. (2009). Refined perturbation bounds for eigenvalues of hermitian and non-hermitian matrices. SIAM Journal on Matrix Analysis and Applications, 31(1), 40-53. https://doi.org/10.1137/070682745

  • Jolliffe, I. T. (2002). Principal component analysis (2nd ed.). Springer-Verlag.

  • Kendall, M. G. (1975). Multivariate analysis. Griffin.

  • Masioti, M., Li-Wai-Suen, C., Prendergast, L. A., & Shaker, A. (2023). A note on switching eigenvalues under small perturbations. Communications in Statistics—Theory and Methods, 53(20), 7311-7325. https://doi.org/10.1080/03610926.2023.2263114

  • Pack, P., Jolliffe, I. T., & Morgan, B. J. T. (1988). Influential observations in principal component analysis: A case-study. Journal of Applied Statistics, 15(1), 39-52. https://doi.org/10.1080/02664768800000004

  • Prendergast, L. A. (2008). A note on sensitivity of principal component subspaces and the efficient detection of influential observations in high dimensions. Electronic Journal of Statistics, 2, 454-467. https://doi.org/10.1214/08-EJS201

  • Prendergast, L. A., & Li-Wai-Suen, C. (2011). A new and practical influence measure for subsets of covariance matrix sample principal components with applications to high dimensional datasets. Computational Statistics and Data Analysis, 55(1), 752-764. https://doi.org/10.1016/j.csda.2010.06.022

  • Sibson, R. (1979). Studies in robustness of multidimensional scaling: perturbational analysis of classical scaling. Journal of the Royal Statistical Society: Series B (Methodological), 41(2), 217-229. https://doi.org/10.1111/j.2517-6161.1979.tb01076.x

  • Tanaka, Y. (1988). Sensitivity analysis in principal component analysis: Influence on the subspace spanned by principal components. Communications in Statistics-Theory and Methods, 17(9), 3157-3175. https://doi.org/10.1080/03610928808829796

  • Wang, S.-G., & Liski, E. P. (1993). Effects of observations on the eigensystem of a sample covariance matrix. Journal of Statistical Planning and Inference, 36, 215-226.

  • Wang, S.-G., & Nyquist, H. (1991). Effects on the eigenstructure of a data matrix when deleting an observation. Computational Statistics and Data Analysis, 11(2), 179-188. https://doi.org/10.1016/0167-9473(91)90068-D

  • Wilkinson, J. H. (1988). The algebraic eigenvalue problem. Oxford University Press.