Measurement invariance has become a standard topic in research involving comparisons across cultures (Davidov et al., 2018), organisations (Somaraju et al., 2022), developmental stages (Putnick & Bornstein, 2016) and other subpopulations in which a test may perform differently. The question of whether the test measures the latent trait in the same way in different groups can be investigated either within the framework of factor analysis or item response theory (in the latter case it is referred to as differential item functioning; Wells, 2021). The factor analytic approach seems to have gained acceptance in psychometric practice, possibly because it can be easily integrated into more comprehensive structural equation models and because a generally accepted standard procedure exists.
In the case of complete measurement invariance, the conditional probability distribution of the observed scores X given a latent variable1 ξ does not depend on the group membership G (Millsap, 2011, p. 52):
1
However, this definition is both overly restrictive and difficult to evaluate empirically. In practice, more specific aspects of the score distribution are tested. Typically, a series of increasingly restrictive nested models are tested, two of which are of particular interest:
Scalar invariance stipulates the equality of the expected values of the item scores, conditional on the common factor, across all groups:
2
This implies the equality of both the factor loadings and the intercepts between the groups.
Strict invariance additionally requires equality of the residual variances across groups. A generally accepted view seems to be that scalar invariance is generally sufficient in practice (Leitgöb et al., 2023). This is because two persons with the same standing on the measured trait have the same expected test score regardless of the group membership (Millsap, 2011, p. 50). This also enables meaningful comparisons of mean test scores across groups: if the factor means are equal across groups, the mean values of the observed scores are equal as well. Nevertheless, some authors advocated the strict invariance as the model that should be achieved in practice. For instance, Meredith (1993) noted that differences in residual variances affect the likelihood of admission in a selection context (p. 530) and concluded that strict (rather than scalar) invariance is essential when comparing individuals (p. 542).
Apart from scalar and strict invariance, it is possible to test the hypotheses regarding the distribution of the common factor(s). The equality of means, variances and possibly covariances of the factors is classified under the generic term structural invariance (Byrne et al., 1989). Structural invariance is not considered a problem of measurement and therefore structural non-invariance is interpreted as a substantive result and not as an indicator of a deficient measurement.
As mentioned earlier, the literature on measurement invariance has focused on group-level comparisons, typically on the conditions for comparability of means between groups. On the other hand, it could be argued that the scalar (or strict) invariance perspective may not be satisfactory for a practitioner conducting individual-level diagnostics. This is because in practical situations a psychologist must base their conclusions on the observed scores and does not know the values of the common factor for particular individuals. Consider a selection or diagnostic situation in which we compare two individuals from different groups who have obtained the same test score. From a practitioner’s perspective, it is more relevant and natural to ask whether individuals with the same observed score have the same expected value of the measured trait than to ask whether individuals with the same level of the measured trait have the same expected test score.
This issue has largely been overlooked in the basic psychometric literature on measurement invariance, possibly due to the focus on measurement in the research setting.
In order to formalise the perspective of diagnostics on an individual level, we introduce the concept of “person comparison invariance”: Under this condition, the expected value of the common factor, conditional on the observed test score S, does not depend on group membership:
3
If the person comparison invariance holds, a practitioner can conclude that the person with the higher test score is more likely to have the higher standing on the measured trait as well, compared to a person with the lower test score. In this paper, I define the test score S as the sum of the item scores, which is the common approach to scoring for tests based on either classical test theory or factor analysis.
Researchers often appear to treat scalar invariance (Equation 2) and person comparison invariance (Equation 3) as equivalent conditions. For example, Olino (2020) concluded that “when wanting to meaningfully compare levels of functioning across clients of different demographic characteristics or identities or compare across time within client to monitor changes in functioning, measurement invariance is critical evidence to seek” (p. 728).
Another example is the European Federation of Psychologists’ Association’s (2013) test review model, which largely focuses on diagnostic tests and explicitly states that with scalar invariance “raw scores have the same meanings and can be compared across groups” (p. 54).
This paper aims to scrutinise such conjectures. In particular, it aims to:
Show that standard models of measurement invariance, especially scalar and strict invariance, are inadequate in the context of person-level comparisons of sum scores.
Describe the conditions under which such comparisons are valid.
Illustrate the extent of non-equivalence in person-level comparisons that can be expected in practice.
Notation and Preliminaries
The following notation will be used: xij is the item score for person i on item j, Si is the sum score for person i, ξi is the common factor value for person i, τj is the factor intercept for item j, λj is the common factor loading for item j, sλj is the completely standardised common factor loading for item j, δij is the residual (or the unique factor value, respectively) of person i on item j, θj is the variance of these residuals, κ is the common factor mean, ϕ is the common factor variance, Λ and τ are p×1 vectors of the common factor loadings and factor intercepts, respectively, and Θ is a p×p diagonal matrix of the residual variances. The subscripts R and F denote the reference and focal group (G), respectively. Although only the case with two groups is explicitly treated, the results can be generalised to any number of groups. The parameter values are assumed to be realistic, that is, 1`Λ > 0, 1`Θ1 > 0, and ϕ > 0.
Consider the standard linear one-factor model (e.g., Kaplan, 2000, p. 68):
4
Let us assume that the linear one-factor model holds and that the item scores can be treated as continuous variables (for a discussion on this topic, see e.g., Rhemtulla et al., 2012). Let us further assume that scalar invariance holds, i.e. ΛR = ΛF and τR = τF. Following standard practice, we set the mean value of the common factor in the reference group to zero, κR = 0.
The sum score on the test for person i can be decomposed as follows:
5
and its expected value can be written as
6
If the factor mean is set to 0, the mean test score is simply equal to the sum of intercepts. In matrix form, the one factor model can be written as
7
For further developments, it is useful to note that, in the case of a perfectly fitting one-factor model, the variance of the sum score can be decomposed into the explained and the unexplained part:
8
The covariance between the latent trait and the j-th item score is
9
and the covariance between the sum score and the latent trait is
10
Recall that the squared correlation between the common factor and the sum score is equivalent to McDonalds’s omega2:
11
Using Equations 8 and 10, can also be written as the standardised covariance:
12
Conditional Expectations and Person Comparison Invariance
The expected sum score of an individual, conditional on the common factor 3, is obtained by taking the expectation of Equation 5:
13
Alternatively, under the standard assumption of a linear relationship between the common factor and its indicators, we can express this expectation by the basic linear regression equation:
14
The regression slope can be expressed as the product of the correlation and the ratio of the standard deviations. Expressing the correlation as in Equation 12, and the sum score standard deviation as in Equation 8, we obtain:
15
The regression constant can then be expressed using the means and the slope. After expressing E(S) as in Equation 13 and the slope as in Equation 15, it becomes:
16
This illustrates the well-known implication of scalar invariance: if the loadings and intercepts are invariant across groups, individuals with the same value of the common factor will have the same expected test score regardless of the group membership. In our case, however, we are interested in the expectation of the common factor given the sum score. Therefore, we need to find the parameters of the equation
17
We can first express the slope in the same way as before:
18
Since the denominator is the same as in Equation 12, we can express b as
19
Accordingly, ω is equal to . Using these results and Equation 6, and noting that , we express the constant as:
20
Note that, when using the conventional approach to identification, the part κ(1 – ω) equals 0 in the reference group.
Therefore, the expected common factor value for a person with a sum score Si is:
21
Or, alternatively, expressing ω as in Equation 12:
22
Under scalar invariance, the value of the scaled test score (Si – Στ)/Σλ = C in Equation 21 does not depend on the group membership. Setting the value of κR to zero, the difference between the expected common factor values for two persons who belong to different groups and have the same test score can be expressed as:
23
Equation 23 is the key result of this paper. It implies that scalar invariance is not sufficient for person comparison invariance: if either the common factor means or the coefficients omega differ across the groups, individuals with the same sum score but belonging to different groups will have different expected common factor values. The value of ΔE(ξi|S) can be used as a measure of the non-invariance effect size.4 A positive value indicates that an individual from the focal group has a higher expected value of the factor than an individual from the reference group with the same test score. The value is a sum of a constant part, which is the difference between factor means, weighted by the “unreliability” (1 – ωF), and a part depending on the observed score value. In particular, C, that is, the scaled deviation of the test score from the mean of the reference population (cf. Equation 6), is weighted by the difference between the coefficients omega in both groups. The effect of non-invariance is therefore more pronounced for more extreme test scores (with respect to the mean of the reference group). On the other hand, it follows from Equation 23 that, for ωR ≠ ωF, the effect is zero when
24
If the mean factor value is equal in both groups (and therefore κF = κR = 0), this point corresponds to the mean test score. Otherwise, it can be either above or below the mean test score, depending on the value and the sign of the difference between the factor means and between the coefficients omega.
Special Cases
Let us now consider a special case of structural invariance, where the common factor mean and variance are equal across groups, κR = κF and ϕR = ϕF. In this case, we are free to set their values to the standardised metric (κ = 0 and ϕ = 1), and the expectation of the common factor value (see Equation 22) then simplifies to:
25
The expected common factor value equals the weighted deviation of the sum score from the mean sum score (note that the mean score is now equal in both groups). The weight W is invariant across groups (WR = WF) if and only if, in addition to scalar invariance, the coefficient omega is invariant. In this particular case, it also follows that the sums of residual variances are equal across groups, because both the factor loadings and factor variances are invariant. We shall denote this condition as “omega invariance”. With invariant factor means and variances, the strict invariance is sufficient but not necessary for the omega invariance, because it is only the sum of the residual variances that matters.
It is also evident from Equation 25 that the value of the weight W is inversely related to the sum of the residual variances. Therefore, among two individuals with the same test score, the person from the population with the lower sum of residual variances (or, equivalently, higher value of coefficient omega) has the larger absolute expected value of the common factor. In other words, the interpretation of the test scores that does not take the group membership into account will be biased against the group in which the measurement is more reliable.
Let us now consider the case where the common factor means are equal across groups (κF = κR), but the common factor variances differ (ϕR ≠ ϕF). After simplifying Equation 22, the expected value of the common factor can now be expressed as:
26
In this case, the person comparison invariance holds if and only if the ratio is invariant across groups, in addition to scalar invariance. Note that also in this case the weight W is invariant across groups if ωR = ωF. The omega invariance is therefore the necessary and sufficient condition for the person comparison invariance in this case as well. In case of non-invariant omegas, a person belonging to the group with the smaller value of will have a larger absolute expected value of the common factor, compared to a person with the same test score belonging to the group with the larger value of the residual-to-factor variance ratio.
Finally, the across-group comparisons of individuals’ scores are not invariant if the means differ across groups: even if omega invariance holds, the expected values of the common factor for individuals with the same sum score will still differ for a constant value of κF(1 – ωF), see Equation 23. In practice it is important that the effect size of the non-invariance is small if either the mean difference κF – κR (which in fact equals κF) is small, or the measurement reliability as reflected in coefficient omega is high, because in both cases the value of κF(1 – ωF) will be close to zero.
Relations to Predictive Invariance and Classical Test Theory
Person comparison invariance can also be regarded as a special case of predictive invariance (Millsap, 1995, 1997, 2007), in which the common factor is treated as the dependent variable predicted by the test score. In Appendix B of the Supplementary Material (see Sočan, 2026) it is shown that Equation 18 and Equation 20 can also be derived from the more general results presented by Millsap (1997, 2007). Technically speaking, person comparison invariance is therefore a special case of predictive invariance. Millsap’s (1995) duality theorem states that either measurement or predictive invariance can hold empirically, but they hold simultaneously only under constrained conditions. This implies that person comparison invariance can hold even if scalar invariance does not. However, unlike the general case of predictive invariance, in which an external variable is predicted, it is much less likely for person comparison invariance to hold without scalar invariance, because the regression parameters in Equation 17 are functions of the factor parameters. As shown in Appendix B of Sočan (2026), person comparison invariance without scalar invariance requires invariant factor intercepts, and a specific dependence pattern among factor-model parameters. In particular, the ratios of error-to-factor variance in both groups need to be related through a linear transformation with coefficients depending on the factor loadings in both groups:
27
It is unlikely that this relationship would hold in practice, therefore the combination of perfect person comparison invariance and violated scalar invariance is unlikely to be observed in real data. Note also that under scalar invariance, where , person comparison invariance requires the equality of error-to-factor variance ratios across groups, as also follows from Equation 18 (see also Equation B7 in Appendix B of Sočan, 2026).
From the classical test theory viewpoint, the common factor is, under certain conditions, a linear transformation of the true score (Jöreskog, 1971). The difference between the expected true scores (T) for two individuals belonging to different groups and having the same sum score Si can be expressed using the well-known Kelley’s formula (Lord & Novick, 1968, p. 65):
28
The similarity with Equation 23 should not be surprising due to the syntactic similarity between factor analysis and classical test theory. However, the generality of the classical test theory approach is limited because the true score is defined with the respect to the observed scores on a particular test (namely, as its expected value for an individual), so its range and scale are determined by the number of items and the response scale; on the other hand, the existence of a common factor is, at least in principle, independent of its indicators.
Testing the Hypothesis of Person Comparison Invariance
After scalar invariance has been established, person comparison invariance can be tested by separately testing the equality of both factor means and coefficients omega across groups. The first hypothesis can be tested using the Wald test that is routinely reported by SEM software, or by constraining the factor means across groups and then testing the difference in the model fit. Omega invariance, on the other hand, can be tested by defining the omegas as new model parameters in the scalar invariance model and then either constrain them to equality or (in case of two groups) compute a bootstrap confidence interval for the difference between the omegas. The advantage of the first approach is the possibility of combining both hypotheses: since the usual statistical practice is to constrain the slopes first and the intercepts second, I propose to test the model with equal omegas against the scalar invariance model in the first step, and to test the model with equal latent variable means against the omega invariance model in the second step. In addition to difference tests, I recommend using the plots as described below to assess the size of the bias; (differences between) fit indices do not seem to be very useful in this respect. R scripts for testing the person comparison invariance and visualization of the effects of non-invariance are available as Supplementary Materials (files PCI_LRtest.R, PCI_boot_omega.R, and PCI_graphs.R, see Sočan, 2026).
Illustration on Real Data
I present a re-analysis of the data collected by Kavčič et al. (2023). Among other things, they investigated the measurement invariance of the Connor-Davidson Resilience Scale (CD-RISC) across educational groups (higher vs. lower education). The scale consists of 10 items rated on a 5-point scale; the range of sum scores is from 0 to 40. The confirmatory factor analysis with the MLMV estimator showed a good fit to the scalar invariance model (robust RMSEA = .046, SRMR = .056, robust CFI = .962). The higher education group had a higher mean resilience (standardised factor mean of .342) and a lower coefficient omega (ωHI = .788, ωLO = .840) than the lower education group. Figure 1 shows the effect of these differences on the expected standardised factor values. The solid and dashed lines represent the expected common factor values for individuals with a given sum score belonging to different education groups. The shaded area shows the difference between these expected values, i.e., the bias in person comparisons. The lines intersect at a sum score of 32.8 (this can also be obtained from Equation 24), about one standard deviation above the mean test score in the whole sample. At this point, person comparisons would be invariant. In the range of sum scores below this point, comparisons based on sum scores are biased against individuals with higher education, and vice versa.
Figure 1
Expected Common Factor Values for Two Educational Level Groups
In Figure 2, the bias values (i.e., the differences between the expected values) are plotted. The shaded area shows where the absolute bias values are smaller than 0.2 SD. By analogy with the commonly accepted interpretation of Cohen’s d (see also Nye et al., 2019), this is tentatively considered a negligible bias. Using this criterion, we conclude that comparisons of sum scores across groups are notably biased only among participants with the lowest scores (12 or lower). The positive bias in this case means that an individual from the higher education group has a higher expected factor value than an individual from the lower education group with the same sum score.
Figure 2
Bias in Person Comparisons Across the Two Educational Level Groups
What Size of Bias Should We Expect in Practice?
Figure 3 illustrates the magnitude of bias (y-axis) across standardised test scores on the reference group metric (-2 ≤ z ≤ 2; the x-axis) under scalar invariance and different parameter combinations:
Coefficient omega in the reference group (ωR = .7, .8, or .9).
Factor mean in the focal group (κF = 0, 0.2, or 0.5, following common guidelines for Cohen’s d); the reference-group factor mean was set to zero.
The difference between coefficients omega in the focal and reference group, respectively (Δω: = ωF – ωR = 0, -.05, or -.10).
Figure 3
Bias in Relation to the Sum Score for Different Values of κF, ωR, and ωF –ωR.
The panels correspond to the ωR×κF combinations, and, within each panel, regression lines correspond to different values of Δω. The shaded region again marks absolute bias values smaller than 0.2 SD. In all situations shown, κF ≥ κR = 0, and ωF ≤ ωR. If the values were reversed, the basic shape would remain the same, with the lines mirrored. Figures C1–C3, showing the remaining combinations, appear in the Supplementary Materials (Appendix C); see Sočan (2026). The R script to reproduce and modify the plots is also provided there (file Figure_3.R).
In accordance with Equation 23, a positive value means that a member of the focal group has a higher expected factor value than a member of the reference group who has an equal observed test score. A positive value therefore indicates bias against the focal group.
When omega invariance holds (Δω = 0), the bias lines are horizontal because the bias equals (see Equation 23), being constant across all score levels. In our range of conditions, the bias never exceeds the threshold, but it would do so with sufficiently low omega values, or sufficiently large latent-mean differences.
When omega coefficients differ across groups, bias also depends on the test score. When |Δω| ≤ .05 and κF ≤ 0.2, the bias is consistently negligible. When κF = 0.5, bias is notable in the lower score range, especially when ω is lower or Δω is larger. The bias remains negligible for standardised scores that are either close to κF, or somewhat higher.
Two general rules can be inferred from the figures:
When κF > 0 (as in Figures 3 and C1) the bias tends to be positive (i.e., against the focal group), and vice versa. This is also evident from Equation 23.
When κF(ωF – ωR) < 0 (as in Figures 3 and C3) the bias is larger in the lower score range, and when κF(ωF – ωR) > 0 (as in Figures C1 and C2) the bias is larger in the higher score range. The difference in factor means pushes all lines upwards (when κF > 0) or downwards (when κF < 0) and thus amplifies the bias due to the difference in omegas in either the upper or the lower score range. Note that, strictly speaking, in a sufficiently extreme range, both high and low scores are biased whenever ωF ≠ ωR.
Discussion
The aim of this paper is to offer a new perspective on measurement invariance, focusing on comparisons between individuals rather than groups. The main conclusion is that the standard models of measurement invariance are not an optimal framework for assessing the comparability of individual scores: scalar invariance is neither necessary nor sufficient for person comparison invariance. Assuming a weakened form of scalar invariance (i.e., the sums of factor loadings and the sums of intercepts being invariant across groups), individuals with equal sum scores have equal expected factor values when the factor means are equal and omega coefficients are invariant across groups. In the case of invariant omega coefficients and non-invariant factor means, the effect of person comparison non-invariance is constant across the range of scores and depends on both the magnitude of the mean difference and the size of the omega coefficient. If the omega coefficients are not invariant, differences between their values also affect the size of the non-invariance effect. In this case, the non-invariance effect varies across the score range: it is generally higher at the extremes, and there is a point where the effect is zero (i.e., where the individual scores are comparable).
These results accord with several previous findings. Shealy and Stout (1993) noted that conditioning on observed scores distorts the estimation of bias in a studied subtest when true score means differ across groups. It is also well known that latent-trait estimates may be biased, particularly at the extremes (Feuerstahler, 2018). Finally, Meredith’s (1993) warning concerning the effect of non-invariant residual variances on person comparability was confirmed.
As the effect-size measure of bias, I propose the difference between the expected values of the measured trait (possibly standardised) for two individuals who share the same sum score but belong to different groups (see Equation 23). The calculation is easy to perform, because it requires only parameters routinely reported by SEM software (factor means, loadings, and intercepts) and the omega coefficient, which has become a standard psychometric quality indicator for tests based on factor analysis.
Two limitations of the proposed approach should be noted. First, it relies on the standard factor analysis model with homoscedastic errors, assuming linear relationships between the common factor and the item scores, and treating the item scores as numeric variables. While this reflects a common practice in applied psychometrics, it can be somewhat inaccurate, especially for items with a smaller number of rating categories. Second, I assume that the residuals in the regression equations have a mean value of zero. Although the expected value of random measurement error is zero by definition, this may not be true for the so-called specific factors (see Millsap, 2011, p. 77). However, this limitation also applies to the standard testing of scalar invariance.
Mathematically speaking, person comparison invariance is a special case of Millsap’s (1995, 1997, 2007) predictive invariance. Therefore, it could hold even without scalar invariance, but this would require restrictions unlikely to hold in practice. The added value of this paper relative to the existing literature on predictive invariance is the treatment of the special case where the underlying latent factor is “predicted”, and the derivation of an explicit expression for bias and elucidation of its determinants and their interplay in this special case. I also provide estimated bias values for various likely combinations of parameters.
From the practical viewpoint, person comparison invariance can be conceptualised as an extension of the standard measurement invariance testing in situations where specific persons are to be compared. Even for such tests, scalar invariance should be examined as an initial step in order to identify items that function differently across subpopulations. Person comparison invariance is more difficult to achieve in practice than scalar invariance because a larger set of parameters is constrained and because it partially depends on factors beyond the researcher’s control, such as the means and variances of the common factor. At first glance, this may appear discouraging to test users. Fortunately, as the presented illustrations show, the size of the bias may be small or even negligible in practice, especially if the measurement reliability is high. Indeed, the bias would vanish if coefficients omega equalled 1, regardless of other parameter values. In contrast to some liberal recommendations on reliability that can be found in the literature, these findings underscore the importance of striving for a high omega coefficient in test construction. On the other hand, they point to inherent limitations of the ubiquitously used sum scores as indicators of the measured latent variables.
This is an open access article distributed under the terms of the