When research results are sensitive to the choice of statistical model, they become dependent on researcher discretion, and bias can be introduced (Gelman & Loken, 2013; King & Nielsen, 2019; Simmons et al., 2011; Wicherts et al., 2016). Researcher discretion is a particular challenge in evaluation research on outcomes derived from psychometric instruments, such as educational tests, psychological surveys, and patient-reported outcomes, because of the many approaches to scoring the outcome measures and accounting for measurement error in the results, such as Classical Test Theory (CTT), Item Response Theory (IRT), Factor Analysis (FA), or Latent Variable Models (LVMs). Researcher-designed outcome measures in particular demand many decision points in the analysis process, raising the question of how sensitive results are to model selection and outcome scoring decisions, especially in causal studies investigating intervention impacts on outcomes that aim to provide policy-relevant findings. While reviews in several fields such as medicine, education, and organizational research show a relative lack of attention to issues of measurement in general and call for better measurement practices (Brakenhoff et al., 2018; Cox & Kelcey, 2019; Flake et al., 2017; Flake & Fried, 2020; Pedersen et al., 2025; Spybrook et al., 2016), the implications of measurement principles for causal inference, policy, and program evaluation are less prominent (Shear & Briggs, 2024; Soland, Kuhfeld, & Edwards, 2024; Soland, Edwards, & Talbert, 2024).
For a given causal research question, alternative statistical methods may provide defensible options for analysis, and varying results are expected. For instance, when modeling a binary outcome, logistic regression and the linear probability model may produce different results due to the contrasting assumptions of each model (Timoneda, 2021). Similarly, in the context of multisite randomized trials or meta-analyses, fixed effects and random effects estimators will produce different estimates of treatment effects due to the different estimands targeted by each model (Chan & Hedges, 2022; Miratrix et al., 2021; Skrondal & Rabe-Hesketh, 2004). While such differences in “estimates, estimators, and estimands” (Miratrix et al., 2021) are well understood in causal inference generally, the use of psychometric measures as outcome variables demands additional consideration because observed scores are typically not of interest in themselves but rather as proxies for unobserved latent variables such as academic achievement. Thus, researchers are faced with navigating a range of options for causal analysis of psychometric outcome data and the challenge of interpreting differing results from models that theoretically target the same treatment effect on the latent trait. Furthermore, it is unclear whether some approaches are consistently superior to others or if the tradeoffs of model selection depend on the circumstances (Gilbert, 2024a; Hontangas et al., 2015).
As an example, consider the options for scoring an educational test to estimate a treatment effect on the latent trait of academic achievement imperfectly represented by the observed test score. Both CTT sum scores and IRT- or FA-based scores use item responses to estimate a latent trait score for each student, which is then used in subsequent analysis. IRT- or FA-based methods such as the two-parameter logistic (2PL) model or the congeneric factor model theoretically provide more fine-grained distinctions among students by weighting item responses based on the information (i.e., item discrimination or factor loading) provided by the item, in contrast to sum scores, which treat different sets of correct answers as identical (Camilli, 2018; Hambleton & Van der Linden, 1982; Lord, 1980; Lord & Novick, 1968; Thissen & Wainer, 2001). Alternatively, LVM techniques, such as Structural Equation Modeling (SEM; Kline, 2023; Muthén, 2002) or Explanatory Item Response Modeling (EIRM; Briggs, 2008; De Boeck, 2004; De Boeck & Wilson, 2016; Gilbert, 2024a; Wilson et al., 2008) estimate the measurement and regression models in a single step. Because all test scoring methods and LVMs target the same treatment effect on the latent trait, a key question is the extent to which theoretical differences between these models matter in causal analysis of test score outcome data. Correlations between IRT- and FA-based scores and CTT scores are typically above 0.90 (Lu et al., 2005; Soland, Kuhfeld, & Edwards, 2024), which raises the question of whether the theoretical benefits of IRT- or FA-based scoring methods or LVMs are worth the added complexity, computational power, and interpretational challenges they may pose. Furthermore, no clear guidelines exist on which model researchers should prefer, particularly when the results conflict.
To illustrate the challenge facing the applied researcher, consider two recent publications on the implications of using sum scores versus factor scores in statistical models. On one side, McNeish and Wolf (2020) argue that sum scores can have “adverse effects on validity, reliability, and qualitative classification” compared to FA-based scores because sum scores implicitly assume that each item contributes equally to the estimation of the latent trait, an assumption that is unlikely to be met in many empirical applications. In contrast, Widaman and Revelle (2023) argue that so long as the scale is unidimensional, sum scores “often have a solid psychometric basis and therefore are frequently quite adequate for psychological research”. Such competing claims, expanded in further publications (McNeish, 2022, 2023, 2024; Sijtsma et al., 2024), provide a challenge for the applied researcher working with outcome data derived from psychometric measures.
The purpose of this study is to provide both a concise and accessible review of the conceptual issues at play and practical guidance for evaluation researchers by exploring the consequences of outcome measurement modeling decisions on causal inference by determining which decision points in measurement modeling are most salient for analytic results. Results show that the issue of attenuation bias dominates the issue of scoring weights, and simpler models can perform better even under extreme circumstances. In other words, our results suggest that accounting for measurement error in the outcome variable is a first-order concern in causal inference, in contrast to second-order issues of measurement “model error”, in which an incorrect measurement model is applied to generate scores for the outcome variable (Liu & Pek, 2024), such as using equally-weighted scores when other approaches are a better fit to the data. Our results align with studies showing that the marginal gains to more complex statistical models can be low and may not justify their increased complexity (e.g., Domingue, Kanopka, Kapoor, et al., 2024 in IRT; Castellano & Ho, 2015, in value-added modeling; Widaman & Revelle, 2023 in psychological measurement), and serve as a contrast with other work emphasizing the sensitivity of analytic results to measurement modeling choices in the analysis of psychometric data (McNeish & Wolf, 2020; Soland, Kuhfeld, & Edwards, 2024).
Classical Approaches to Measurement Error in Evaluation Research
Measurement error is a widely studied phenomenon, with work on the reliability of educational and psychological tests going back many decades (Asher, 1974; Bollen, 1989; Borsboom, 2005; Briggs, 2021; Cronbach, 1951; Lord & Novick, 1968), and has well-known consequences in statistical analysis (Fuller & Hidiroglou, 1978; Hutcheon et al., 2010; Liu, 1988). In the case of simple linear regression with two variables, error in independent (X, predictor) variables serves to attenuate regression coefficients toward 0, whereas error in dependent (Y, outcome) variables will not bias estimated regression coefficients, but will decrease precision and reduce statistical power by increasing residual variance, though these general rules of thumb do not always hold in more complex circumstances (Kline, 2023).
Measurement error can be addressed with both classical and modern methods. For example, Errors-in-Variables (EIV) regression models (Carroll et al., 2009; Gillard, 2010) use estimates of reliability to adjust the coefficients of predictor variables, and LVMs (Muthén, 2002) adjust for measurement error by simultaneously estimating the latent variable(s) and the regression model. While both EIV and LVM methods can correct for measurement error, some studies have shown that the LVM approach can provide more robust estimates of uncertainty than EIV methods (Gilbert, 2024a; Lockwood & McCaffrey, 2014).
Measurement error in the dependent variable is sometimes ignored because it does not bias coefficients (Cox & Kelcey, 2019), but LVMs can also be applied to outcome variables and can provide modest benefits to statistical power and more robust estimates of uncertainty than alternative approaches (Christensen, 2006; Rabbitt, 2018; Zwinderman, 1991), though benefits are context dependent (Gilbert, 2024a). However, coefficients are downwardly biased when the dependent variable is standardized. Attenuation due to standardization is a particular issue in evaluation research because most test scores, psychological surveys, and patient reported outcomes have no natural scale, and standardization allows for estimates of treatment effect size that can in principle be compared across studies or pooled in meta-analyses (Borenstein et al., 2009) and are often argued to be more interpretable than unstandardized coefficients (Schielzeth, 2010).
Standardization of the dependent variable Y attenuates regression coefficients because measurement error causes overdispersion in the standard deviation of Y, . That is, will be greater than the SD of the true latent trait scores because contains the variation of plus measurement error , as summarized in the CTT variance decomposition (Brennan, 2010; DeVellis, 2006; Hambleton & Jones, 1993; Jackson, 1973; Lewis, 2006; Traub, 1997). We can precisely estimate the overdispersion of with the CTT reliability formula, which defines reliability as the ratio of true score variance () to observed score variance (): . Solving for shows that . Therefore, when we standardize an outcome variable such as a test score by dividing by its SD , this value is too large by a factor of . Consequently, when measurement error in the outcome is present, standardized regression coefficients will be driven downward, and this bias can be corrected by dividing by . Applying this EIV correction deattenuates the standardized regression coefficient to what it would be if the test were perfectly reliable or of infinite length.
Attenuation due to standardization is not a new insight (Cole & Preacher, 2014; Hedges, 1981; Shear & Briggs, 2024), but it is nonetheless commonly ignored, or reserved for technical discussions (Borenstein et al., 2009) and comparatively less emphasized in practical guides for researchers. For example, in its section on reliability, the Institute of Education Sciences’ (IES) What Works Clearinghouse Standards Handbook lists minimum thresholds for various reliability metrics (e.g., in Version 4.1 and in Version 5.0), but makes no mention of attenuation bias, in contrast to detailed explanation of the bias that arises from other sources, such as non-random attrition or baseline non-equivalence.1 Crucially, attenuation bias is not solved by IRT or FA scoring procedures, because the resulting scores still contain measurement error. The problem can be further compounded by expected a posteriori (EAP) scoring methods because shrinkage of the empirical Bayes estimation draws the distribution of estimated latent trait scores to the overall mean across treatment and control groups rather than the respective means of each group (Briggs, 2008; Soland, 2022). This problem is less severe but still present when using maximum likelihood (ML) scoring (Soland, Kuhfeld, & Edwards, 2024, p. 11), though ML scoring raises other issues such as undefined scores for respondents with “perfect” scores (i.e., all items answered correctly or incorrectly on an educational test). The solution is to apply an EIV correction by dividing the coefficients by , where can be estimated as the internal consistency of the test (e.g., Cronbach’s or ) (Hedges, 1981; Shear & Briggs, 2024), or to employ an LVM that directly adjusts for measurement error in the estimation procedure, as we will demonstrate.
Methods for Estimating Causal Effects on Psychometric Outcome Data
Estimands and Estimators
Consider outcome for person (). Under the potential outcomes framework (Imbens & Rubin, 2015; Rubin, 2005), the individual causal effect of binary treatment on person is , where 1 indicates the treatment counterfactual and 0 indicates the control counterfactual.2 Because only one counterfactual is observed, is unobservable. The target estimand of causal analyses is therefore typically the average treatment effect (ATE), defined as .
Random assignment of the treatment ensures that treatment status is independent of the potential outcomes. Therefore, we can estimate as a difference in means between the treated and control groups. Practically, we can use a simple linear regression model as our estimator for , in which is an indicator variable for the treatment status of person , is the mean of the control group, is the difference in means between the groups, and is the error term (Angrist & Pischke, 2009; Imbens & Rubin, 2015; Murnane & Willett, 2010; Rosenbaum, 2017):
1
When is observed, the difference in means approach provided by Equation 1 is standard. However, when represents an unobserved variable, such as mathematical ability, extroversion, or depression, Equation 1 is no longer estimable (Stoetzer et al., 2024). The two primary approaches to estimating causal effects on latent variables are two-step procedures and simultaneous estimation, to which we now turn.
Two-Step Procedures
In a two-step procedure, the latent trait of interest is first estimated for each person and then analyzed as the outcome variable using a standard statistical model such as OLS regression (Christensen, 2006; Ye, 2016). For example, consider the following regression model, in which represents an estimated latent trait score for person and represents the average treatment effect (ATE):
2
3
may be generated in a CTT or IRT/FA framework. In CTT, a sum or mean score is used, such that the observed score across all items for items equals the sum of the responses or the mean of the responses . In IRT or FA, the latent trait estimate, denoted , is calculated by maximizing the likelihood of given the estimated item parameters (Bock et al., 1997). Generally, the IRT scoring approach has been argued to be superior to CTT approaches because IRT estimates are on an interval rather than ordinal scale (Ferrando & Chico, 2007; Harwell & Gatti, 2001; Jabrayilov et al., 2016; McNeish & Wolf, 2020).3 Furthermore, scores provided by IRT/FA models weight the contributions of item responses to by their discrimination parameters or factor loadings, thus maximizing the information in and increasing the reliability of (Camilli, 2018; Jessen et al., 2018; McNeish, 2023; McNeish & Wolf, 2020; Rhemtulla & Savalei, 2025), and participants with identical sum scores can have different based on different patterns of item responses, thus providing theoretically more fine-grained distinctions between the respondents. Empirically, however, differences between CTT and IRT/FA scoring are often found to be minor (Lu et al., 2005; Sébille et al., 2010; Xu & Stone, 2012). One limitation of the two-step approach is that, regardless of what type of scoring procedure is used to estimate the latent trait, the outcome variable is treated as known when it contains error and therefore resulting regression coefficients will be biased when the outcome is standardized, unless the EIV correction is applied, as we will show.
Simultaneous Estimation With Latent Variable Models (LVMs)
As an alternative to two-step procedures, LVMs enable the analyst to estimate measurement (psychometric) and regression (structural) models simultaneously and incorporate the effects of measurement error directly into the estimation procedure, for both predictors and outcomes (Bollen, 1989; Kline, 2023; Muthén, 2002). For example, consider the following LVM for the analysis of a treatment effect on test score data,
4
5
6
7
in which the response Y to item for person is a function of latent person ability and item easiness parameters (item intercepts) , weighted by item discrimination parameters (factor loadings) and error term . is in turn a function of the control group mean , the ATE , and unexplained or residual variance in person ability . Thus, the ATE is estimated directly on the latent trait without the need to compute an outcome score in a separate step.
Because is unobserved, constraints are necessary to identify the model and provide a scale for . Two standard approaches to model identification are the unit-loading approach, where is fixed to 1 for a single item (or all items, as in a Rasch model), or the unit-variance approach, where the total variance (or residual variance) of is fixed to 1. The item easiness parameters can be identified by either excluding one item in a fixed effects approach (so that represents the performance of the average control respondent on this item), by fixing the mean of to 0 in a random effects approach (De Boeck, 2008), or by fixing .
These constraints resolve the scale indeterminacy of but are nonetheless arbitrary. For example, a single could be fixed to 2 instead of 1, yielding different point estimates and standard errors but providing identical fit to the observed data. Thus, when estimating causal effects on latent outcomes, indexing to the pooled standard deviation of (i.e., in the case without additional covariates in the model) is an attractive strategy to ground the interpretation of the model results. Accordingly, for the purposes of the present study, we target the following latent ATE in our analyses, following the notation of Stoetzer et al. (2024):
8
This approach has the important benefit of providing the same point estimate regardless of the chosen identification constraints, and is analogous to estimating a standardized effect size such as Cohen’s d when the outcome variable is observed (Stoetzer et al., 2024).
is a link function to allow for both linear and non-linear models. When all , and the link function is logistic, i.e., , the LVM is equivalent to the One Parameter Logistic (1PL) Explanatory Item Response Model (EIRM; De Boeck, 2004, De Boeck & Wilson, 2016, Wilson et al., 2008). Note that the variance of the error term in the logistic model is fixed at for model identification (Breen et al., 2018; Mood, 2010). When are freely estimated and the link function is an identity, i.e., , the LVM is a linear Structural Equation Model (SEM). While LVMs such as the EIRM and SEM can be more complex to interpret than two-step approaches, LVMs estimate associations among latent variables theoretically stripped of measurement error. LMVs therefore deattenuate estimates of standardized regression coefficients because, unlike , is a consistent estimator of the residual SD of , thus counteracting the effects of measurement error compared to regression on observed scores (Briggs, 2008; Christensen, 2006; Stoetzer et al., 2024; Zwinderman, 1991), suggesting that LVMs may provide more accurate tests of between-group differences such as causal treatment effects.
Model Assumptions
In addition to the standard causal inference assumptions of the stable unit treatment value assumption (SUTVA) and unconfoundedness of the treatment assignment, causal inference in latent variable contexts requires some additional assumptions. First, Equation 4 assumes full measurement invariance between the treatment and control groups. That is, other than the treatment effect on the latent trait, the items function equivalently between the groups. An example violation of this assumption could include “response shift,” whereby treatment causes participants to interpret items differently such that differences in post-intervention scores reflect changes to item functioning rather than changes to the latent variable (Olivera-Aguilar & Rikoon, 2023). Stoetzer et al. (2024, p. 5) describe this assumption as “unconfounded measurement”. More flexible models, such as multi-group estimation that allow for heteroskedasticity (Kim & Yoon, 2011) or multidimensional models that allow for response style effects (Deng et al., 2018) can relax these assumptions but are beyond the scope of the present study. Moreover, such approaches are relatively rare in applied causal inference (Soland, Edwards, & Talbert, 2024; Soland & Gilbert, 2025). We emphasize that while measurement invariance between treatment and control groups can and should be tested, such tests require item-level data, and therefore are difficult or impossible to assess in secondary analyses of sum or factor scores.
Figure 1 provides Directed Acyclic Graphs (DAGs) for the two-step and simultaneous estimation approaches and highlights a closely related assumption that is necessary for causal inference with latent variables. Namely, we must assume that the treatment effect on the individual item responses is fully mediated by , similar to the exclusion restriction assumption in instrumental variables estimation (Halpin & Gilbert, 2024; Stoetzer et al., 2024; VanderWeele & Vansteelandt, 2022). In other words, a treatment that improves is statistically equivalent to one that makes the items easier (Gilbert, Miratrix, et al., 2025; San Martín, 2016).4
Summary
In sum, the applied researcher faces many choices in model selection when test score data are used as outcomes in a causal inference context: to use a one-step or two-step approach, to weight or not to weight the item responses in the construction of scores, to use CTT or IRT/FA, and so forth, as summarized in Table 1. While exploratory data analysis can shed light on, for example, whether a 1PL or 2PL IRT model is a better fit to the data, to what extent does allowing for varying item discriminations/loadings in the estimation of the latent trait score affect the bias, precision, and power of causal estimates? Are certain models consistently more robust than alternatives? This study seeks to shed light on these questions and leverage measurement principles for better application of causal inference in evaluation research by using Monte Carlo simulation and an empirical application to examine the performance of several models under varying conditions of test score construction and model estimation.
Figure 1
Directed Acyclic Graphs for Two-Step and Simultaneous Estimation
Note. Squares indicate observed variables, hollow circles indicate latent variables. are item responses, is the treatment indicator, and represents the average treatment effect. is the residual standard deviation of the latent variable. See also Figure 1 in Stoetzer et al. (2024).
Table 1
Summary of Analysis Options for Estimating Treatment Effects on Latent Outcomes
| Measurement Framework | Item Types | ATE Estimation Strategy | Weighting | Implementation |
|---|---|---|---|---|
| CTT | dichotomous, polytomous, continuous | Two-step | Equal | Generate sum score, run regression |
| IRT | dichotomous, polytomous | Two-step | Equal or Variable | Generate IRT score, run regression |
| FA | continuous | Two-step | Equal or Variable | Generate FA score, run regression |
| CTT | dichotomous, polytomous, continuous | Simultaneous | Equal | LMM on item responsesa |
| IRT | dichotomous, polytomous | Simultaneous | Equal or Variable | EIRM on item responses |
| FA | continuous | Simultaneous | Equal or Variable | SEM on item responses |
Note. CTT = Classical Test Theory, IRT = Item Response Theory, FA = Factor Analysis, LMM = Linear Mixed Model, EIRM = Explanatory Item Response Model, SEM = Structural Equation Model.
aThe linear mixed model applicable to simultaneous CTT estimation is equivalent to the FA model with equal loadings when the item parameters are fixed; see Borsboom (2005) for a discussion of underlying equivalencies between models.
Method
Data Generating Process
The simulation and data analysis procedures are implemented in R. In total, we simulate 18,000 data sets (1,000 data sets per 18 data-generating conditions) and apply four analytic models—sum score, factor score, equal loading SEM, and variable loading SEM—to each, for a total of 72,000 results. We use a full factorial design to assess the performance of each model under a range of treatment effect sizes and items of varying discriminating power. To maintain focus on the contrasts between the models and the effects of item characteristics, we fix the number of subjects at 500 and the number of items at 10 to represent a moderate sample size and moderate test length. The latent trait scores are drawn from and the item intercepts are drawn from . The latent variables are converted to continuous observed scores for each item using Equation 4. The residual SD for each item is defined as so that items with higher loadings have lower residual variances. The simulation factors include null, moderate, and large treatment effect sizes (0, 0.2, or 0.4 SDs on the latent trait, based on empirical distributions of effect sizes in education research, e.g., Kraft, 2020), moderate and high average factor loadings (), and constant, moderately variable, or highly variable factor loadings ( where ).
For each simulated data set, we estimate the treatment effect and associated z-statistic, p-value, and whether the null hypothesis was rejected under each model. The models for the sum score and factor scores are equivalent OLS regression models and the SEMs are estimated using maximum likelihood with fixed item intercepts. In all models, the parameter of interest is the ATE , and the errors are assumed to be normally distributed with mean 0 and constant variance and uncorrelated with the predictors. We also calculated for each simulated test as an estimate of to assess the effect of applying EIV corrections to the two-step models. To render each ATE comparable, we divide by the RMSE of the regression model to standardize the coefficients, as the RMSE represents the estimated pooled (i.e., within-group) standard deviation of the latent trait . Thus, the standardized coefficients are equivalent to Cohen’s d effect size.
We use lm to fit the OLS models, lme4 to fit the FA models with equal loadings (Bates et al., 2015), and lavaan (Rosseel, 2012) to fit the FA models with variable loadings.5 For the two-step approach using factor scores as outcomes, we use expected a posteriori (EAP) scores (Chapman, 2022; Lu et al., 2005; Muraki & Engelhard, 1985) as outcome variables. While there are many factor scoring options available in addition to EAP (Grice, 2001), such as maximum a posteriori (MAP), maximum likelihood, test characteristic curve (TCC, common in IRT, Baker et al., 2017), we use EAP scoring because it is among the most commonly used approaches and the Bayesian shrinkage of the EAP estimation reduces the SD of the observed scores, which is the primary cause of attenuation bias. For the EIV corrected two-step models, we calculate using the R package psych (Revelle & Condon, 2019). We collect the default SE provided by each model and we can estimate the true SE by calculating the SD of the point estimates.
Results
Figure 2 shows the mean bias and Monte Carlo 95% confidence intervals for each method across all simulation conditions. We see that when the ATE is 0, bias is negligible across all conditions. However, when the ATE is positive, the two-step procedures are downwardly biased, the bias is proportional to the treatment effect size (as expected given that the standardized effect size is a ratio), and the bias is more severe when the average loadings are lower because lower average loadings translate to lower test reliability. In contrast, the LVMs do not show the same pattern of attenuation and are approximately unbiased across all conditions. Crucially, the performance of the SEM assuming equal factor loadings is indistinguishable from the SEM allowing for variable loadings, even when the range of loadings is high. The factor score allowing for variable weights only slightly outperforms the sum score when the range of loadings is highest, but its performance is nonetheless bested by the equal-weight SEM.
These results clearly illustrate that attenuation bias due to measurement error with standardized outcome variables is a more serious concern than the decision of whether to weight or not to weight the item responses. When we correct the two-step procedures for measurement error by dividing the coefficients by as shown in Figure 3, we find that the performance of the sum score is indistinguishable from the LVMs.6
Figure 2
Estimated Bias by Method, Standardized Scores
Note. The points indicate the bias in standard deviation units, without the EIV correction applied to the two-step models. The bars indicate the Monte Carlo 95% CIs, calculated using the formula where is the number of simulation trials and is the standard deviation of the point estimates.
Figure 3
Estimated Bias by Method, EIV Correction Applied to Two-Step Scores
Note. The points indicate the bias in standard deviation units, with the EIV correction applied to the two-step models. The bars indicate the Monte Carlo 95% CIs, calculated using the formula where is the number of simulation trials and is the standard deviation of the point estimates.
We include additional simulation results in Appendix A. In short, results show that differences between all models are minimal in terms of absolute precision (the SD of the point estimates), standard error calibration (the mean model-based SE as a proportion of the true SE), false positive rates, and statistical power, with the EIV correction applied to the two-step models. These results suggest that once the attenuation bias of the two-step models has been corrected, the choice of model does not appear to have strong impacts on the other statistical properties of the ATE. As a final sensitivity check, we rerun analogous simulations using dichotomous responses and IRT models rather than continuous responses and FA models. We find essentially identical results to those reported here, suggesting that our findings are consistent across multiple specifications and outcome item types. We include the full IRT simulation analysis results in our supplement.
Empirical Application
Data Source
To illustrate how the issues of scoring and model selection can play out in practice, we use a selection of empirical datasets from RCTs containing item-level outcome data from Gilbert, Himmelsbach, et al. (2025a). The authors examine 75 datasets from 48 RCTs with item-level outcome data to examine models for item-level heterogeneous treatment effects, in which treatments may uniquely impact each item of an outcome measure. Here, we take a random sample of 10 datasets to illustrate the results of different analytic approaches to estimating average treatment effects across a range of contexts. Table 2 provides a summary of each dataset. The datasets cover a range of geographic regions, outcome measures, and show a wide range of estimated reliability with ranging from 0.43 to 0.95. In contrast to the simulation, we cannot know the true value of the treatment effect on the latent trait in the empirical data. However, we can still examine the results of the different analytic models explored in the simulation and see how sensitive the results are to the modeling choices.
Table 2
Studies Included in Our Empirical Analysis
| ID | Citation | Location | Outcome | N | I | ω |
|---|---|---|---|---|---|---|
| 1 | Bruhn et al. (2016) | Brazil | Financial Literacy | 16852 | 10 | 0.57 |
| 2 | Kim et al. (2024) | USA | Reading Comprehension | 1335 | 29 | 0.86 |
| 3 | Kim et al. (2021) | USA | Vocabulary | 2588 | 24 | 0.79 |
| 4 | Romero et al. (2020) | Liberia | Raven’s Progressive Matrices | 3510 | 10 | 0.63 |
| 5 | Duflo et al. (2024) | Ghana | English | 27201 | 21 | 0.89 |
| 6 | Carpena (2024) | India | Health Knowledge | 839 | 21 | 0.75 |
| 7 | Luo et al. (2019) | China | Parenting Beliefs | 449 | 11 | 0.43 |
| 8 | Lyall et al. (2020) | Afghanistan | Pro-government attitudes | 1853 | 9 | 0.49 |
| 9 | Banerjee et al. (2017) | India | Math | 9193 | 30 | 0.93 |
| 10 | Banerjee et al. (2017) | India | Math | 5356 | 30 | 0.95 |
Note. N = number of subjects, I = number of items, ω = estimated reliability of the outcome measure.
Analytic Models
We apply eight estimators to produce treatment effects from each dataset. Because all item responses are dichotomous, we use logistic IRT models instead of the linear SEM. For two-step approaches, we calculate the mean score, 1PL IRT score, and 2PL IRT score, and regress the resulting scores on the treatment indicator and standardize the results to calculate Cohen’s . We then apply the EIV correction to these three estimates, dividing the estimated ATE by . (Our supplemental simulations of IRT models show that the classical EIV correction works well even though single-number estimates of reliability are less common in IRT frameworks where precision varies over the range of the latent trait, see Nicewander, 2018; Raju et al., 2007). We use mean scores instead of sum scores because there is some missing item response data. For one-step approaches, we estimate 1PL and 2PL EIRMs that allow for a treatment effect directly on the latent trait.
Results
Figure 4 shows the point estimates and 95% CIs for the standardized treatment effect from the eight estimators applied to our 10 datasets. The datasets are ordered by from lowest to highest. When is high (datasets 2, 5, 9, 10), differences between the estimators are minimal, as expected. Datasets with moderate (datasets 1, 4, 6, 3) most clearly mirror the simulation results, showing estimates from two-step models generally lower than alternative approaches, and minimal differences between EIV corrected estimates and estimates from the one-step models. In dataset 8, the ATE is near 0 and all estimators are very close to the null value. Only when is very low, as in dataset 7 (), do we see meaningful differences between weighted and unweighted estimators, with the 2PL approaches yielding negative point estimates and the 1PL approaches yielding positive point estimates. Taken as a whole, these results again suggest that once EIV corrections are applied, differences between estimators are likely to be minor in all but the most extreme cases.
Figure 4
Estimated Treatment Effects for 10 RCT Datasets
Note. The points indicate the estimated treatment effect size in standard deviation units. The bars indicate the model-based 95% CIs. 1PL and 2PL indicate IRT-based scores are derived from one-parameter logistic and two-parameter logistic models, respectively. The EIRM is equivalent to an SEM with a logistic link function.
Discussion
Because psychometric outcome measures are a noisy proxy of a latent trait of interest, they suffer from measurement error, which results in negatively biased treatment effect estimates when outcome variables are standardized. Simulation results show that when applied to outcome data with different properties, the bias is substantial when treatment effect sizes are high, as predicted by Classical Test Theory. However, when the EIV correction is applied and the standardized coefficients are divided by , differences in model performance are negligible under most conditions. Thus, the very process that makes varying statistical models comparable to one another—standardization—biases two-step models, and the effect of this bias dominates other features of the data generating process, including the variability of factor loadings used to create scoring weights. When left unaddressed, such bias could lead researchers and policymakers to erroneous conclusions about the efficacy of interventions.
As a concrete example of how attenuation bias could affect substantive results, consider meta-analyses that pool the effects of interventions on test score outcomes across studies. Even if the true underlying effect is equal across all studies, when the outcome measures are of varying reliability, the estimated effect sizes will differ due to attenuation bias, even as the participant and study sample sizes grow to infinity (Borenstein et al., 2009). Thus, failing to adjust standardized effect sizes for measurement error may lead to spurious conclusions about treatment heterogeneity. The degree to which such issues may be related to the ongoing “replication crisis” in psychology and other fields is an open question (Open Science Collaboration, 2015), but it seems plausible that variation in measurement practices may play a role in explaining variation in conclusions across studies (Flake et al., 2017; Flake & Fried, 2020; Pedersen et al., 2025).
Our interpretation of these results is that researchers may be overly focused on second-order measurement issues, such as the use of variable factor loadings that function as optimal scoring weights (McNeish, 2022; McNeish & Wolf, 2020), rather than the first-order issue of attenuation of standardized coefficients for measurement error in the outcome variable (Shear & Briggs, 2024; Widaman & Revelle, 2023). That is, when the EIV correction is applied, differences between the simplest standardized sum score model and the more complex LVMs are negligible in terms of bias, precision, and statistical power in the estimation of treatment effects, and this result holds even when the variability of factor loadings is high. Thus, when causal inference at a single time point is the primary goal, the use of sum scores with the EIV correction is likely to be sufficient for many applications in applied program evaluation.
These results should not detract from other uses of LVMs. Clearly, IRT/FA methods are essential for piloting measures, identifying poorly functioning items (Jessen et al., 2018), differential item functioning analysis (Osterlind & Everson, 2009), vertical scaling (Briggs & Domingue, 2013), linking (Lee & Lee, 2018), and addressing missing data (Gilbert, 2024a), and LVMs can easily be expanded to incorporate complex relationships among many latent variables or multidimensional constructs at several time points (Kline, 2023). A particularly valuable use case for LVMs in causal inference would be settings in which treatment may differentially impact individual items and the LVM can provide insights on treatment heterogeneity, such as “item-level heterogeneous treatment effects” that would be masked in a two-step analysis (Ahmed et al., 2024; Gilbert, 2024b; Gilbert, Himmelsbach, et al., 2025a; Gilbert et al., 2023; Sales et al., 2021), differential growth by item type (Briggs, 2021; Gilbert et al., 2024; Naumann et al., 2014), or the appropriate interpretation of interaction effects (Domingue, Kanopka, Trejo, et al., 2024; Gilbert, Miratrix, et al., 2025). However, when all respondents answer the same items at a single time point, and only average treatment effects are of interest, the results appear relatively insensitive to the methods employed when the EIV correction is applied. Therefore, the benefits of interpretability and computational complexity may favor the EIV-corrected standardized sum score in many straightforward causal inference applications, despite arguments that the sum score can be a suboptimal choice (in general) because the constraint of equal factor loadings imposed by the sum score is rarely met in real data (McNeish & Wolf, 2020).
While the results of this study provide evidence for the importance of EIV corrections in two-step analyses of standardized test score outcome variables, several limitations merit consideration. For example, the data generating process employed in this study examines the simple case of individual randomization with no covariates beyond the treatment indicator, and thus may be extended to explore how measurement model selection may impact the estimation of heterogeneous treatment effects, the effects of predictive covariates, multilevel structures such as multi-site or cluster-randomized trials, or alternative experimental and quasi-experimental contexts such as regression discontinuity, difference-in-differences, instrumental variables, and longitudinal analyses, though an emerging literature on the synthesis of latent variable and causal inference methods has begun to shed light on these areas (Gilbert, Himmelsbach, et al., 2025a; Gilbert et al., 2024; Gilbert, Miratrix, et al., 2025; Kuhfeld & Soland, 2022, 2023; Mayer, 2019; Miratrix et al., 2021; Rabbitt, 2018; Soland, 2022, 2023; Soland et al., 2023).
A related issue is measurement error in covariates, which we did not explore in this study. In theory, in an RCT, any bias induced by covariate measurement error will affect treatment and control groups equally and thus should not affect estimation of the ATE (Lockwood & McCaffrey, 2014). In observational studies, however, covariate measurement error can lead to biases when the covariates do not fully control for relevant differences between groups (Cook et al., 2009; Sengewald & Pohl, 2019; Sengewald et al., 2019). Factor models with latent covariates and outcomes are easily estimable in lavaan when the indicators are continuous, however, software options for categorical responses common in the social sciences are currently less widely used in R, though recent developments such as galamm (Sørensen, 2024) and EffectLiteR (Mayer et al., 2016, 2020; Sengewald & Mayer, 2024) may provide attractive options. We view exploration of how measurement error in both covariates and outcomes influences results in experimental and observational contexts to be a promising avenue for future research.
In conclusion, results of causal analyses of psychometric outcome data are sensitive to model selection, and the effects of attenuation bias are much more consequential than the use of scoring weights. When researchers do not adjust for measurement error with EIV corrections or use LVMs, standardized treatment effect estimates will be downwardly biased and thus understate estimates of treatment impact. When the EIV correction is applied, the impact of model selection will be reduced, demonstrating how the application of psychometric principles can improve causal inference in evaluation research.
This is an open access article distributed under the terms of the Creative Commons Attribution License (