When research results are sensitive to the choice of statistical model, they become dependent on researcher discretion, and bias can be introduced (<xref ref-type="bibr" rid="Xbib43">Gelman & Loken, 2013</xref>; <xref ref-type="bibr" rid="Xbib68">King & Nielsen, 2019</xref>; <xref ref-type="bibr" rid="Xbib124">Simmons et al., 2011</xref>; <xref ref-type="bibr" rid="Xbib139">Wicherts et al., 2016</xref>). Researcher discretion is a particular challenge in evaluation research on outcomes derived from psychometric instruments, such as educational tests, psychological surveys, and patient-reported outcomes, because of the many approaches to scoring the outcome measures and accounting for measurement error in the results, such as Classical Test Theory (CTT), Item Response Theory (IRT), Factor Analysis (FA), or Latent Variable Models (LVMs). Researcher-designed outcome measures in particular demand many decision points in the analysis process, raising the question of how sensitive results are to model selection and outcome scoring decisions, especially in causal studies investigating intervention impacts on outcomes that aim to provide policy-relevant findings. While reviews in several fields such as medicine, education, and organizational research show a relative lack of attention to issues of measurement in general and call for better measurement practices (<xref ref-type="bibr" rid="Xbib11">Brakenhoﬀ et al., 2018</xref>; <xref ref-type="bibr" rid="Xbib28">Cox & Kelcey, 2019</xref>; <xref ref-type="bibr" rid="Xbib41">Flake et al., 2017</xref>; <xref ref-type="bibr" rid="Xbib40">Flake & Fried, 2020</xref>; <xref ref-type="bibr" rid="Xbib103">Pedersen et al., 2025</xref>; <xref ref-type="bibr" rid="Xbib132">Spybrook et al., 2016</xref>), the implications of measurement principles for causal inference, policy, and program evaluation are less prominent (<xref ref-type="bibr" rid="Xbib122">Shear & Briggs, 2024</xref>; <xref ref-type="bibr" rid="Xbib131">Soland, Kuhfeld, & Edwards, 2024</xref>; <xref ref-type="bibr" rid="Xbib128">Soland, Edwards, & Talbert, 2024</xref>). For a given causal research question, alternative statistical methods may provide defensible options for analysis, and varying results are expected. For instance, when modeling a binary outcome, logistic regression and the linear probability model may produce diﬀerent results due to the contrasting assumptions of each model (<xref ref-type="bibr" rid="Xbib136">Timoneda, 2021</xref>). Similarly, in the context of multisite randomized trials or meta-analyses, fixed eﬀects and random eﬀects estimators will produce diﬀerent estimates of treatment eﬀects due to the diﬀerent estimands targeted by each model (<xref ref-type="bibr" rid="Xbib23">Chan & Hedges, 2022</xref>; <xref ref-type="bibr" rid="Xbib92">Miratrix et al., 2021</xref>; <xref ref-type="bibr" rid="Xbib125">Skrondal & Rabe-Hesketh, 2004</xref>). While such diﬀerences in “estimates, estimators, and estimands” (<xref ref-type="bibr" rid="Xbib92">Miratrix et al., 2021</xref>) are well understood in causal inference generally, the use of psychometric measures as outcome variables demands additional consideration because observed scores are typically not of interest in themselves but rather as proxies for unobserved latent variables such as academic achievement. Thus, researchers are faced with navigating a range of options for causal analysis of psychometric outcome data and the challenge of interpreting diﬀering results from models that theoretically target the same treatment eﬀect on the latent trait. Furthermore, it is unclear whether some approaches are consistently superior to others or if the tradeoﬀs of model selection depend on the circumstances (<xref ref-type="bibr" rid="Xbib44">Gilbert, 2024a</xref>; <xref ref-type="bibr" rid="Xbib58">Hontangas et al., 2015</xref>). As an example, consider the options for scoring an educational test to estimate a treatment eﬀect on the latent trait of academic achievement imperfectly represented by the observed test score. Both CTT sum scores and IRT- or FA-based scores use item responses to estimate a latent trait score for each student, which is then used in subsequent analysis. IRT- or FA-based methods such as the two-parameter logistic (2PL) model or the congeneric factor model theoretically provide more fine-grained distinctions among students by weighting item responses based on the information (i.e., item discrimination or factor loading) provided by the item, in contrast to sum scores, which treat diﬀerent sets of correct answers as identical (<xref ref-type="bibr" rid="Xbib19">Camilli, 2018</xref>; <xref ref-type="bibr" rid="Xbib54">Hambleton & Van der Linden, 1982</xref>; <xref ref-type="bibr" rid="Xbib78">Lord, 1980</xref>; <xref ref-type="bibr" rid="Xbib79">Lord & Novick, 1968</xref>; <xref ref-type="bibr" rid="Xbib135">Thissen & Wainer, 2001</xref>). Alternatively, LVM techniques, such as Structural Equation Modeling (SEM; <xref ref-type="bibr" rid="Xbib69">Kline, 2023</xref>; <xref ref-type="bibr" rid="Xbib96">Muthén, 2002</xref>) or Explanatory Item Response Modeling (EIRM; <xref ref-type="bibr" rid="Xbib14">Briggs, 2008</xref>; <xref ref-type="bibr" rid="Xbib30">De Boeck, 2004</xref>; <xref ref-type="bibr" rid="Xbib32">De Boeck & Wilson, 2016</xref>; <xref ref-type="bibr" rid="Xbib44">Gilbert, 2024a</xref>; <xref ref-type="bibr" rid="Xbib141">Wilson et al., 2008</xref>) estimate the measurement and regression models in a single step. Because all test scoring methods and LVMs target the same treatment eﬀect on the latent trait, a key question is the extent to which theoretical diﬀerences between these models matter in causal analysis of test score outcome data. Correlations between IRT- and FA-based scores and CTT scores are typically above 0.90 (<xref ref-type="bibr" rid="Xbib80">Lu et al., 2005</xref>; <xref ref-type="bibr" rid="Xbib131">Soland, Kuhfeld, & Edwards, 2024</xref>), which raises the question of whether the theoretical benefits of IRT- or FA-based scoring methods or LVMs are worth the added complexity, computational power, and interpretational challenges they may pose. Furthermore, no clear guidelines exist on which model researchers should prefer, particularly when the results conflict. To illustrate the challenge facing the applied researcher, consider two recent publications on the implications of using sum scores versus factor scores in statistical models. On one side, <xref ref-type="bibr" rid="Xbib89">McNeish and Wolf (2020)</xref> argue that sum scores can have “adverse eﬀects on validity, reliability, and qualitative classification” compared to FA-based scores because sum scores implicitly assume that each item contributes equally to the estimation of the latent trait, an assumption that is unlikely to be met in many empirical applications. In contrast, <xref ref-type="bibr" rid="Xbib140">Widaman and Revelle (2023)</xref> argue that so long as the scale is unidimensional, sum scores “often have a solid psychometric basis and therefore are frequently quite adequate for psychological research”. Such competing claims, expanded in further publications (<xref ref-type="bibr" rid="Xbib86">McNeish, 2022</xref>, <xref ref-type="bibr" rid="Xbib87">2023</xref>, <xref ref-type="bibr" rid="Xbib88">2024</xref>; <xref ref-type="bibr" rid="Xbib123">Sijtsma et al., 2024</xref>), provide a challenge for the applied researcher working with outcome data derived from psychometric measures. The purpose of this study is to provide both a concise and accessible review of the conceptual issues at play and practical guidance for evaluation researchers by exploring the consequences of outcome measurement modeling decisions on causal inference by determining which decision points in measurement modeling are most salient for analytic results. Results show that the issue of attenuation bias dominates the issue of scoring weights, and simpler models can perform better even under extreme circumstances. In other words, our results suggest that accounting for measurement error in the outcome variable is a first-order concern in causal inference, in contrast to second-order issues of measurement “model error”, in which an incorrect measurement model is applied to generate scores for the outcome variable (<xref ref-type="bibr" rid="Xbib76">Liu & Pek, 2024</xref>), such as using equally-weighted scores when other approaches are a better fit to the data. Our results align with studies showing that the marginal gains to more complex statistical models can be low and may not justify their increased complexity (e.g., <xref ref-type="bibr" rid="Xbib36">Domingue, Kanopka, Kapoor, et al., 2024</xref> in IRT; <xref ref-type="bibr" rid="Xbib22">Castellano & Ho, 2015</xref>, in value-added modeling; <xref ref-type="bibr" rid="Xbib140">Widaman & Revelle, 2023</xref> in psychological measurement), and serve as a contrast with other work emphasizing the sensitivity of analytic results to measurement modeling choices in the analysis of psychometric data (<xref ref-type="bibr" rid="Xbib89">McNeish & Wolf, 2020</xref>; <xref ref-type="bibr" rid="Xbib131">Soland, Kuhfeld, & Edwards, 2024</xref>). <sec id="s1_1"><title>Classical Approaches to Measurement Error in Evaluation Research

METH

Methodology

1614-1881 1614-2241

PsychOpen

meth.15773

10.5964/meth.15773

Original Article

Data

How Measurement Affects Causal Inference: Attenuation Bias Is (Usually) More Important Than Outcome Scoring Weights

How Measurement Affects Causal Inference

How measurement affects causal inference: Attenuation bias is (usually) more important than outcome scoring weights

https://orcid.org/0000-0003-3496-2710

Gilbert

Joshua B.

*1 Medzihorsky

Juraj

1Harvard Graduate School of Education, Harvard University, Cambridge, MA, USA Durham University, Durham, United Kingdom

*13 Appian Way, Cambridge, MA 02138, USA. joshua_gilbert@g.harvard.edu

30062025

2025

21 2 91 122 05 10 2024 06 05 2025

2025

Gilbert

https://creativecommons.org/licenses/by/4.0/

This is an open-access article distributed under the terms of the Creative Commons Attribution (CC BY) 4.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

When analyzing treatment effects on outcome variables constructed from psychometric instruments (e.g., educational test scores, psychological surveys, or patient reported outcomes), researchers face many choices and competing guidance for scoring the measures and modeling results. This study examines the impact of outcome measure scoring and modeling approaches through simulation and an empirical application. Results show that estimates from multiple methods applied to the same data will vary because two-step models using sum or factor scores provide attenuated standardized treatment effects compared to latent variable models. This bias dominates any other differences between models or features of the data generating process, such as the use of scoring weights. An errors-in-variables (EIV) correction removes the bias from two-step models. An empirical application to 10 datasets from randomized controlled trials demonstrates the sensitivity of the results to model selection. This study shows that the psychometric principles most consequential in causal inference are related to attenuation bias rather than optimal outcome scoring weights.

causal inferencelatent variable modelsfactor analysispsychometricsmeasurement

When research results are sensitive to the choice of statistical model, they become dependent on researcher discretion, and bias can be introduced (<xref ref-type="bibr" rid="Xbib43">Gelman & Loken, 2013</xref>; <xref ref-type="bibr" rid="Xbib68">King & Nielsen, 2019</xref>; <xref ref-type="bibr" rid="Xbib124">Simmons et al., 2011</xref>; <xref ref-type="bibr" rid="Xbib139">Wicherts et al., 2016</xref>). Researcher discretion is a particular challenge in evaluation research on outcomes derived from psychometric instruments, such as educational tests, psychological surveys, and patient-reported outcomes, because of the many approaches to scoring the outcome measures and accounting for measurement error in the results, such as Classical Test Theory (CTT), Item Response Theory (IRT), Factor Analysis (FA), or Latent Variable Models (LVMs). Researcher-designed outcome measures in particular demand many decision points in the analysis process, raising the question of how sensitive results are to model selection and outcome scoring decisions, especially in causal studies investigating intervention impacts on outcomes that aim to provide policy-relevant findings. While reviews in several fields such as medicine, education, and organizational research show a relative lack of attention to issues of measurement in general and call for better measurement practices (<xref ref-type="bibr" rid="Xbib11">Brakenhoﬀ et al., 2018</xref>; <xref ref-type="bibr" rid="Xbib28">Cox & Kelcey, 2019</xref>; <xref ref-type="bibr" rid="Xbib41">Flake et al., 2017</xref>; <xref ref-type="bibr" rid="Xbib40">Flake & Fried, 2020</xref>; <xref ref-type="bibr" rid="Xbib103">Pedersen et al., 2025</xref>; <xref ref-type="bibr" rid="Xbib132">Spybrook et al., 2016</xref>), the implications of measurement principles for causal inference, policy, and program evaluation are less prominent (<xref ref-type="bibr" rid="Xbib122">Shear & Briggs, 2024</xref>; <xref ref-type="bibr" rid="Xbib131">Soland, Kuhfeld, & Edwards, 2024</xref>; <xref ref-type="bibr" rid="Xbib128">Soland, Edwards, & Talbert, 2024</xref>). For a given causal research question, alternative statistical methods may provide defensible options for analysis, and varying results are expected. For instance, when modeling a binary outcome, logistic regression and the linear probability model may produce diﬀerent results due to the contrasting assumptions of each model (<xref ref-type="bibr" rid="Xbib136">Timoneda, 2021</xref>). Similarly, in the context of multisite randomized trials or meta-analyses, fixed eﬀects and random eﬀects estimators will produce diﬀerent estimates of treatment eﬀects due to the diﬀerent estimands targeted by each model (<xref ref-type="bibr" rid="Xbib23">Chan & Hedges, 2022</xref>; <xref ref-type="bibr" rid="Xbib92">Miratrix et al., 2021</xref>; <xref ref-type="bibr" rid="Xbib125">Skrondal & Rabe-Hesketh, 2004</xref>). While such diﬀerences in “estimates, estimators, and estimands” (<xref ref-type="bibr" rid="Xbib92">Miratrix et al., 2021</xref>) are well understood in causal inference generally, the use of psychometric measures as outcome variables demands additional consideration because observed scores are typically not of interest in themselves but rather as proxies for unobserved latent variables such as academic achievement. Thus, researchers are faced with navigating a range of options for causal analysis of psychometric outcome data and the challenge of interpreting diﬀering results from models that theoretically target the same treatment eﬀect on the latent trait. Furthermore, it is unclear whether some approaches are consistently superior to others or if the tradeoﬀs of model selection depend on the circumstances (<xref ref-type="bibr" rid="Xbib44">Gilbert, 2024a</xref>; <xref ref-type="bibr" rid="Xbib58">Hontangas et al., 2015</xref>). As an example, consider the options for scoring an educational test to estimate a treatment eﬀect on the latent trait of academic achievement imperfectly represented by the observed test score. Both CTT sum scores and IRT- or FA-based scores use item responses to estimate a latent trait score for each student, which is then used in subsequent analysis. IRT- or FA-based methods such as the two-parameter logistic (2PL) model or the congeneric factor model theoretically provide more fine-grained distinctions among students by weighting item responses based on the information (i.e., item discrimination or factor loading) provided by the item, in contrast to sum scores, which treat diﬀerent sets of correct answers as identical (<xref ref-type="bibr" rid="Xbib19">Camilli, 2018</xref>; <xref ref-type="bibr" rid="Xbib54">Hambleton & Van der Linden, 1982</xref>; <xref ref-type="bibr" rid="Xbib78">Lord, 1980</xref>; <xref ref-type="bibr" rid="Xbib79">Lord & Novick, 1968</xref>; <xref ref-type="bibr" rid="Xbib135">Thissen & Wainer, 2001</xref>). Alternatively, LVM techniques, such as Structural Equation Modeling (SEM; <xref ref-type="bibr" rid="Xbib69">Kline, 2023</xref>; <xref ref-type="bibr" rid="Xbib96">Muthén, 2002</xref>) or Explanatory Item Response Modeling (EIRM; <xref ref-type="bibr" rid="Xbib14">Briggs, 2008</xref>; <xref ref-type="bibr" rid="Xbib30">De Boeck, 2004</xref>; <xref ref-type="bibr" rid="Xbib32">De Boeck & Wilson, 2016</xref>; <xref ref-type="bibr" rid="Xbib44">Gilbert, 2024a</xref>; <xref ref-type="bibr" rid="Xbib141">Wilson et al., 2008</xref>) estimate the measurement and regression models in a single step. Because all test scoring methods and LVMs target the same treatment eﬀect on the latent trait, a key question is the extent to which theoretical diﬀerences between these models matter in causal analysis of test score outcome data. Correlations between IRT- and FA-based scores and CTT scores are typically above 0.90 (<xref ref-type="bibr" rid="Xbib80">Lu et al., 2005</xref>; <xref ref-type="bibr" rid="Xbib131">Soland, Kuhfeld, & Edwards, 2024</xref>), which raises the question of whether the theoretical benefits of IRT- or FA-based scoring methods or LVMs are worth the added complexity, computational power, and interpretational challenges they may pose. Furthermore, no clear guidelines exist on which model researchers should prefer, particularly when the results conflict. To illustrate the challenge facing the applied researcher, consider two recent publications on the implications of using sum scores versus factor scores in statistical models. On one side, <xref ref-type="bibr" rid="Xbib89">McNeish and Wolf (2020)</xref> argue that sum scores can have “adverse eﬀects on validity, reliability, and qualitative classification” compared to FA-based scores because sum scores implicitly assume that each item contributes equally to the estimation of the latent trait, an assumption that is unlikely to be met in many empirical applications. In contrast, <xref ref-type="bibr" rid="Xbib140">Widaman and Revelle (2023)</xref> argue that so long as the scale is unidimensional, sum scores “often have a solid psychometric basis and therefore are frequently quite adequate for psychological research”. Such competing claims, expanded in further publications (<xref ref-type="bibr" rid="Xbib86">McNeish, 2022</xref>, <xref ref-type="bibr" rid="Xbib87">2023</xref>, <xref ref-type="bibr" rid="Xbib88">2024</xref>; <xref ref-type="bibr" rid="Xbib123">Sijtsma et al., 2024</xref>), provide a challenge for the applied researcher working with outcome data derived from psychometric measures. The purpose of this study is to provide both a concise and accessible review of the conceptual issues at play and practical guidance for evaluation researchers by exploring the consequences of outcome measurement modeling decisions on causal inference by determining which decision points in measurement modeling are most salient for analytic results. Results show that the issue of attenuation bias dominates the issue of scoring weights, and simpler models can perform better even under extreme circumstances. In other words, our results suggest that accounting for measurement error in the outcome variable is a first-order concern in causal inference, in contrast to second-order issues of measurement “model error”, in which an incorrect measurement model is applied to generate scores for the outcome variable (<xref ref-type="bibr" rid="Xbib76">Liu & Pek, 2024</xref>), such as using equally-weighted scores when other approaches are a better fit to the data. Our results align with studies showing that the marginal gains to more complex statistical models can be low and may not justify their increased complexity (e.g., <xref ref-type="bibr" rid="Xbib36">Domingue, Kanopka, Kapoor, et al., 2024</xref> in IRT; <xref ref-type="bibr" rid="Xbib22">Castellano & Ho, 2015</xref>, in value-added modeling; <xref ref-type="bibr" rid="Xbib140">Widaman & Revelle, 2023</xref> in psychological measurement), and serve as a contrast with other work emphasizing the sensitivity of analytic results to measurement modeling choices in the analysis of psychometric data (<xref ref-type="bibr" rid="Xbib89">McNeish & Wolf, 2020</xref>; <xref ref-type="bibr" rid="Xbib131">Soland, Kuhfeld, & Edwards, 2024</xref>). <sec id="s1_1"><title>Classical Approaches to Measurement Error in Evaluation Research

Measurement error is a widely studied phenomenon, with work on the reliability of educational and psychological tests going back many decades (Asher, 1974; Bollen, 1989; Borsboom, 2005; Briggs, 2021; Cronbach, 1951; Lord & Novick, 1968), and has well-known consequences in statistical analysis (Fuller & Hidiroglou, 1978; Hutcheon et al., 2010; Liu, 1988). In the case of simple linear regression with two variables, error in independent (X, predictor) variables serves to attenuate regression coeﬃcients toward 0, whereas error in dependent (Y, outcome) variables will not bias estimated regression coeﬃcients, but will decrease precision and reduce statistical power by increasing residual variance, though these general rules of thumb do not always hold in more complex circumstances (Kline, 2023).

Measurement error can be addressed with both classical and modern methods. For example, Errors-in-Variables (EIV) regression models (Carroll et al., 2009; Gillard, 2010) use estimates of reliability to adjust the coeﬃcients of predictor variables, and LVMs (Muthén, 2002) adjust for measurement error by simultaneously estimating the latent variable(s) and the regression model. While both EIV and LVM methods can correct for measurement error, some studies have shown that the LVM approach can provide more robust estimates of uncertainty than EIV methods (Gilbert, 2024a; Lockwood & McCaﬀrey, 2014).

Measurement error in the dependent variable is sometimes ignored because it does not bias coeﬃcients (Cox & Kelcey, 2019), but LVMs can also be applied to outcome variables and can provide modest benefits to statistical power and more robust estimates of uncertainty than alternative approaches (Christensen, 2006; Rabbitt, 2018; Zwinderman, 1991), though benefits are context dependent (Gilbert, 2024a). However, coeﬃcients are downwardly biased when the dependent variable is standardized. Attenuation due to standardization is a particular issue in evaluation research because most test scores, psychological surveys, and patient reported outcomes have no natural scale, and standardization allows for estimates of treatment eﬀect size that can in principle be compared across studies or pooled in meta-analyses (Borenstein et al., 2009) and are often argued to be more interpretable than unstandardized coeﬃcients (Schielzeth, 2010).

Standardization of the dependent variable Y attenuates regression coeﬃcients because measurement error causes overdispersion in the standard deviation of Y, σY. That is, σY will be greater than the SD of the true latent trait scores σT because σY contains the variation of σT plus measurement error σE, as summarized in the CTT variance decomposition σY2=σT2+σE2 (Brennan, 2010; DeVellis, 2006; Hambleton & Jones, 1993; Jackson, 1973; Lewis, 2006; Traub, 1997). We can precisely estimate the overdispersion of σY with the CTT reliability formula, which defines reliability ρ as the ratio of true score variance (σT2) to observed score variance (σY2): ρ=σT2σY2. Solving for σY shows that σY=σTρ. Therefore, when we standardize an outcome variable such as a test score by dividing by its SD σY, this value is too large by a factor of 1ρ. Consequently, when measurement error in the outcome is present, standardized regression coeﬃcients will be driven downward, and this bias can be corrected by dividing by ρ. Applying this EIV correction deattenuates the standardized regression coeﬃcient to what it would be if the test were perfectly reliable or of infinite length.

Attenuation due to standardization is not a new insight (Cole & Preacher, 2014; Hedges, 1981; Shear & Briggs, 2024), but it is nonetheless commonly ignored, or reserved for technical discussions (Borenstein et al., 2009) and comparatively less emphasized in practical guides for researchers. For example, in its section on reliability, the Institute of Education Sciences’ (IES) What Works Clearinghouse Standards Handbook lists minimum thresholds for various reliability metrics (e.g., α≥.50 in Version 4.1 and α≥.60 in Version 5.0), but makes no mention of attenuation bias, in contrast to detailed explanation of the bias that arises from other sources, such as non-random attrition or baseline non-equivalence.¹1

Current and past WWC Standards Handbooks are available at the following URL: https://ies.ed.gov/ncee/wwc/handbooks.

Crucially, attenuation bias is not solved by IRT or FA scoring procedures, because the resulting scores still contain measurement error. The problem can be further compounded by expected a posteriori (EAP) scoring methods because shrinkage of the empirical Bayes estimation draws the distribution of estimated latent trait scores to the overall mean across treatment and control groups rather than the respective means of each group (Briggs, 2008; Soland, 2022). This problem is less severe but still present when using maximum likelihood (ML) scoring (Soland, Kuhfeld, & Edwards, 2024, p. 11), though ML scoring raises other issues such as undefined scores for respondents with “perfect” scores (i.e., all items answered correctly or incorrectly on an educational test). The solution is to apply an EIV correction by dividing the coeﬃcients by ρ, where ρ can be estimated as the internal consistency of the test (e.g., Cronbach’s α or ω) (Hedges, 1981; Shear & Briggs, 2024), or to employ an LVM that directly adjusts for measurement error in the estimation procedure, as we will demonstrate.

Methods for Estimating Causal Effects on Psychometric Outcome Data Estimands and Estimators

Consider outcome 𝜃j for person j (j=1,...,J). Under the potential outcomes framework (Imbens & Rubin, 2015; Rubin, 2005), the individual causal eﬀect of binary treatment Tj on person j is τj≡𝜃j(1)−𝜃j(0), where 1 indicates the treatment counterfactual and 0 indicates the control counterfactual.²2

We follow the notation of Gilbert, Himmelsbach, et al. (2025a).

Because only one counterfactual is observed, τj is unobservable. The target estimand of causal analyses is therefore typically the average treatment eﬀect (ATE), defined as τ¯=1JΣj=1J(𝜃j(1)−𝜃j(0)).

Random assignment of the treatment ensures that treatment status is independent of the potential outcomes. Therefore, we can estimate τ¯ as a diﬀerence in means between the treated and control groups. Practically, we can use a simple linear regression model as our estimator for τ¯, in which Tj is an indicator variable for the treatment status of person j, β0 is the mean of the control group, β1 is the diﬀerence in means between the groups, and εj is the error term (Angrist & Pischke, 2009; Imbens & Rubin, 2015; Murnane & Willett, 2010; Rosenbaum, 2017):

1𝜃j=β0+β1Tj+εj.

When 𝜃j is observed, the diﬀerence in means approach provided by Equation 1 is standard. However, when 𝜃j represents an unobserved variable, such as mathematical ability, extroversion, or depression, Equation 1 is no longer estimable (Stoetzer et al., 2024). The two primary approaches to estimating causal eﬀects on latent variables are two-step procedures and simultaneous estimation, to which we now turn.

Two-Step Procedures

In a two-step procedure, the latent trait of interest is first estimated for each person and then analyzed as the outcome variable using a standard statistical model such as OLS regression (Christensen, 2006; Ye, 2016). For example, consider the following regression model, in which scorej represents an estimated latent trait score for person j and β1 represents the average treatment eﬀect (ATE):

2scorej=β0+β1treatj+εj

3εj∼N(0,σε).

scorej may be generated in a CTT or IRT/FA framework. In CTT, a sum or mean score is used, such that the observed score across all items for items i=1,...,I equals the sum of the responses ∑i=1Iitemi or the mean of the responses 1I∑i=1Iitemi. In IRT or FA, the latent trait estimate, denoted 𝜃̂, is calculated by maximizing the likelihood of 𝜃̂ given the estimated item parameters (Bock et al., 1997). Generally, the IRT scoring approach has been argued to be superior to CTT approaches because IRT 𝜃̂ estimates are on an interval rather than ordinal scale (Ferrando & Chico, 2007; Harwell & Gatti, 2001; Jabrayilov et al., 2016; McNeish & Wolf, 2020).³3

The ability of IRT to produce an interval scaling is a theoretical ideal that may or may not be met in any given empirical application, see e.g., Briggs and Domingue (2013); Briggs and Weeks (2009); Michell (1994, 1997); Reckase (2004); Schafer (2006) for various perspectives on this issue.

Furthermore, scores provided by IRT/FA models weight the contributions of item responses to 𝜃̂ by their discrimination parameters or factor loadings, thus maximizing the information in and increasing the reliability of 𝜃̂ (Camilli, 2018; Jessen et al., 2018; McNeish, 2023; McNeish & Wolf, 2020; Rhemtulla & Savalei, 2025), and participants with identical sum scores can have diﬀerent 𝜃̂ based on diﬀerent patterns of item responses, thus providing theoretically more fine-grained distinctions between the respondents. Empirically, however, diﬀerences between CTT and IRT/FA scoring are often found to be minor (Lu et al., 2005; Sébille et al., 2010; Xu & Stone, 2012). One limitation of the two-step approach is that, regardless of what type of scoring procedure is used to estimate the latent trait, the outcome variable is treated as known when it contains error and therefore resulting regression coeﬃcients will be biased when the outcome is standardized, unless the EIV correction is applied, as we will show.

Simultaneous Estimation With Latent Variable Models (LVMs)

As an alternative to two-step procedures, LVMs enable the analyst to estimate measurement (psychometric) and regression (structural) models simultaneously and incorporate the eﬀects of measurement error directly into the estimation procedure, for both predictors and outcomes (Bollen, 1989; Kline, 2023; Muthén, 2002). For example, consider the following LVM for the analysis of a treatment eﬀect on test score data,

4f(Yij)=λi(𝜃j+bi)+εij

5𝜃j=β0+β1treatj+ζj

6εij∼N(0,σεi)

7ζj∼N(0,σζ)

in which the response Y to item i for person j is a function of latent person ability 𝜃j and item easiness parameters (item intercepts) bi, weighted by item discrimination parameters (factor loadings) λi and error term εij. 𝜃j is in turn a function of the control group mean β0, the ATE β1, and unexplained or residual variance in person ability ζj. Thus, the ATE β1 is estimated directly on the latent trait without the need to compute an outcome score in a separate step.

Because 𝜃j is unobserved, constraints are necessary to identify the model and provide a scale for 𝜃j. Two standard approaches to model identification are the unit-loading approach, where λi is fixed to 1 for a single item (or all items, as in a Rasch model), or the unit-variance approach, where the total variance (or residual variance) of 𝜃j is fixed to 1. The item easiness parameters bi can be identified by either excluding one item in a fixed eﬀects approach (so that β0 represents the performance of the average control respondent on this item), by fixing the mean of bi to 0 in a random eﬀects approach (De Boeck, 2008), or by fixing β0=0.

These constraints resolve the scale indeterminacy of 𝜃j but are nonetheless arbitrary. For example, a single λi could be fixed to 2 instead of 1, yielding diﬀerent point estimates and standard errors but providing identical fit to the observed data. Thus, when estimating causal eﬀects on latent outcomes, indexing β1 to the pooled standard deviation of 𝜃j (i.e., σζ in the case without additional covariates in the model) is an attractive strategy to ground the interpretation of the model results. Accordingly, for the purposes of the present study, we target the following latent ATE in our analyses, following the notation of Stoetzer et al. (2024):

8E(𝜃j|Tj=1)−E(𝜃j|Tj=0)SD(𝜃j|Tj).

This approach has the important benefit of providing the same point estimate regardless of the chosen identification constraints, and is analogous to estimating a standardized eﬀect size such as Cohen’s d when the outcome variable is observed (Stoetzer et al., 2024).

f() is a link function to allow for both linear and non-linear models. When all λi=1, and the link function is logistic, i.e., f(x)=logit(x), the LVM is equivalent to the One Parameter Logistic (1PL) Explanatory Item Response Model (EIRM; De Boeck, 2004, De Boeck & Wilson, 2016, Wilson et al., 2008). Note that the variance of the error term in the logistic model is fixed at π23 for model identification (Breen et al., 2018; Mood, 2010). When λi are freely estimated and the link function is an identity, i.e., f(x)=x, the LVM is a linear Structural Equation Model (SEM). While LVMs such as the EIRM and SEM can be more complex to interpret than two-step approaches, LVMs estimate associations among latent variables theoretically stripped of measurement error. LMVs therefore deattenuate estimates of standardized regression coeﬃcients because, unlike σY, σζ is a consistent estimator of the residual SD of 𝜃j, thus counteracting the eﬀects of measurement error compared to regression on observed scores (Briggs, 2008; Christensen, 2006; Stoetzer et al., 2024; Zwinderman, 1991), suggesting that LVMs may provide more accurate tests of between-group diﬀerences such as causal treatment eﬀects.

Model Assumptions

In addition to the standard causal inference assumptions of the stable unit treatment value assumption (SUTVA) and unconfoundedness of the treatment assignment, causal inference in latent variable contexts requires some additional assumptions. First, Equation 4 assumes full measurement invariance between the treatment and control groups. That is, other than the treatment eﬀect on the latent trait, the items function equivalently between the groups. An example violation of this assumption could include “response shift,” whereby treatment causes participants to interpret items diﬀerently such that diﬀerences in post-intervention scores reflect changes to item functioning rather than changes to the latent variable (Olivera-Aguilar & Rikoon, 2023). Stoetzer et al. (2024, p. 5) describe this assumption as “unconfounded measurement”. More flexible models, such as multi-group estimation that allow for heteroskedasticity (Kim & Yoon, 2011) or multidimensional models that allow for response style eﬀects (Deng et al., 2018) can relax these assumptions but are beyond the scope of the present study. Moreover, such approaches are relatively rare in applied causal inference (Soland, Edwards, & Talbert, 2024; Soland & Gilbert, 2025). We emphasize that while measurement invariance between treatment and control groups can and should be tested, such tests require item-level data, and therefore are diﬃcult or impossible to assess in secondary analyses of sum or factor scores.

Figure 1 provides Directed Acyclic Graphs (DAGs) for the two-step and simultaneous estimation approaches and highlights a closely related assumption that is necessary for causal inference with latent variables. Namely, we must assume that the treatment eﬀect on the individual item responses is fully mediated by 𝜃j, similar to the exclusion restriction assumption in instrumental variables estimation (Halpin & Gilbert, 2024; Stoetzer et al., 2024; VanderWeele & Vansteelandt, 2022). In other words, a treatment that improves 𝜃j is statistically equivalent to one that makes the items easier (Gilbert, Miratrix, et al., 2025; San Martín, 2016).⁴4

Under certain interpretations, the exclusion restriction assumption can be subsumed by the unconfounded measurement assumption because a direct eﬀect on item performance beyond the latent trait implies that the items are functioning diﬀerentially between the treatment and control groups. However, alternative conceptions allow for item-specific treatment eﬀects without invoking changes to the item parameters, but rather by introducing an additional item-specific treatment sensitivity term to the model (e.g., Gilbert, Himmelsbach, et al., 2025a, Figure F.1.)

Summary

In sum, the applied researcher faces many choices in model selection when test score data are used as outcomes in a causal inference context: to use a one-step or two-step approach, to weight or not to weight the item responses in the construction of scores, to use CTT or IRT/FA, and so forth, as summarized in Table 1. While exploratory data analysis can shed light on, for example, whether a 1PL or 2PL IRT model is a better fit to the data, to what extent does allowing for varying item discriminations/loadings in the estimation of the latent trait score aﬀect the bias, precision, and power of causal estimates? Are certain models consistently more robust than alternatives? This study seeks to shed light on these questions and leverage measurement principles for better application of causal inference in evaluation research by using Monte Carlo simulation and an empirical application to examine the performance of several models under varying conditions of test score construction and model estimation.

Figure 1Directed Acyclic Graphs for Two-Step and Simultaneous Estimation

Note. Squares indicate observed variables, hollow circles indicate latent variables. In are item responses, Tj is the treatment indicator, and β1 represents the average treatment eﬀect. σζ is the residual standard deviation of the latent variable. See also Figure 1 in Stoetzer et al. (2024).

Table 1Summary of Analysis Options for Estimating Treatment Effects on Latent Outcomes

Measurement Framework	Item Types	ATE Estimation Strategy	Weighting	Implementation
CTT	dichotomous, polytomous, continuous	Two-step	Equal	Generate sum score, run regression
IRT	dichotomous, polytomous	Two-step	Equal or Variable	Generate IRT score, run regression
FA	continuous	Two-step	Equal or Variable	Generate FA score, run regression
CTT	dichotomous, polytomous, continuous	Simultaneous	Equal	LMM on item responses^a
IRT	dichotomous, polytomous	Simultaneous	Equal or Variable	EIRM on item responses
FA	continuous	Simultaneous	Equal or Variable	SEM on item responses

Note. CTT = Classical Test Theory, IRT = Item Response Theory, FA = Factor Analysis, LMM = Linear Mixed Model, EIRM = Explanatory Item Response Model, SEM = Structural Equation Model.

^aThe linear mixed model applicable to simultaneous CTT estimation is equivalent to the FA model with equal loadings when the item parameters are fixed; see Borsboom (2005) for a discussion of underlying equivalencies between models.

Method Data Generating Process

The simulation and data analysis procedures are implemented in R. In total, we simulate 18,000 data sets (1,000 data sets per 18 data-generating conditions) and apply four analytic models—sum score, factor score, equal loading SEM, and variable loading SEM—to each, for a total of 72,000 results. We use a full factorial design to assess the performance of each model under a range of treatment eﬀect sizes and items of varying discriminating power. To maintain focus on the contrasts between the models and the eﬀects of item characteristics, we fix the number of subjects at 500 and the number of items at 10 to represent a moderate sample size and moderate test length. The latent trait scores 𝜃j are drawn from N(0+β1treatj,1) and the item intercepts bi are drawn from N(0,1). The latent variables are converted to continuous observed scores for each item using Equation 4. The residual SD for each item σεi is defined as 1−λi2 so that items with higher loadings have lower residual variances. The simulation factors include null, moderate, and large treatment eﬀect sizes (0, 0.2, or 0.4 SDs on the latent trait, based on empirical distributions of eﬀect sizes in education research, e.g., Kraft, 2020), moderate and high average factor loadings (μλ=0.4,0.6), and constant, moderately variable, or highly variable factor loadings (λi∼Unif(μλ−x,μλ+x) where x=0,0.15,0.3).

For each simulated data set, we estimate the treatment eﬀect and associated z-statistic, p-value, and whether the null hypothesis was rejected under each model. The models for the sum score and factor scores are equivalent OLS regression models and the SEMs are estimated using maximum likelihood with fixed item intercepts. In all models, the parameter of interest is the ATE β1, and the errors are assumed to be normally distributed with mean 0 and constant variance and uncorrelated with the predictors. We also calculated ω for each simulated test as an estimate of ρ to assess the eﬀect of applying EIV corrections to the two-step models. To render each ATE comparable, we divide β1 by the RMSE of the regression model to standardize the coeﬃcients, as the RMSE represents the estimated pooled (i.e., within-group) standard deviation of the latent trait 𝜃j. Thus, the standardized coeﬃcients are equivalent to Cohen’s d eﬀect size.

We use lm to fit the OLS models, lme4 to fit the FA models with equal loadings (Bates et al., 2015), and lavaan (Rosseel, 2012) to fit the FA models with variable loadings.⁵5

We use lme4 to fit the equal-loading FA models because the syntax is more convenient than that of lavaan. We use MLE rather than the default REML in lme4 so that the estimation procedures are identical across packages. Extensions to lme4 such as PLmixed and galamm allow for more complex measurement models to be fit with lme4 syntax (Rockwood & Jeon, 2019; Sørensen, 2024).

For the two-step approach using factor scores as outcomes, we use expected a posteriori (EAP) scores (Chapman, 2022; Lu et al., 2005; Muraki & Engelhard, 1985) as outcome variables. While there are many factor scoring options available in addition to EAP (Grice, 2001), such as maximum a posteriori (MAP), maximum likelihood, test characteristic curve (TCC, common in IRT, Baker et al., 2017), we use EAP scoring because it is among the most commonly used approaches and the Bayesian shrinkage of the EAP estimation reduces the SD of the observed scores, which is the primary cause of attenuation bias. For the EIV corrected two-step models, we calculate ω using the R package psych (Revelle & Condon, 2019). We collect the default SE provided by each model and we can estimate the true SE by calculating the SD of the point estimates.

Results

Figure 2 shows the mean bias and Monte Carlo 95% confidence intervals for each method across all simulation conditions. We see that when the ATE is 0, bias is negligible across all conditions. However, when the ATE is positive, the two-step procedures are downwardly biased, the bias is proportional to the treatment eﬀect size (as expected given that the standardized eﬀect size is a ratio), and the bias is more severe when the average loadings are lower because lower average loadings translate to lower test reliability. In contrast, the LVMs do not show the same pattern of attenuation and are approximately unbiased across all conditions. Crucially, the performance of the SEM assuming equal factor loadings is indistinguishable from the SEM allowing for variable loadings, even when the range of loadings is high. The factor score allowing for variable weights only slightly outperforms the sum score when the range of loadings is highest, but its performance is nonetheless bested by the equal-weight SEM.

These results clearly illustrate that attenuation bias due to measurement error with standardized outcome variables is a more serious concern than the decision of whether to weight or not to weight the item responses. When we correct the two-step procedures for measurement error by dividing the coeﬃcients by ω as shown in Figure 3, we find that the performance of the sum score is indistinguishable from the LVMs.⁶6

Note that an EIV correction based on α will potentially overcorrect the factor score when the loadings are extremely variable. This occurs because the calculation of α assumes equal loadings and provides a lower bound on test reliability (McNeish & Wolf, 2020, p. 2292). When this assumption is not met, ρ>α so dividing by α provides an overcorrection for the factor score model. Thus, if the range of loadings is large, a measure of reliability appropriate for models with variable factor loadings such as ω (Hayes & Coutts, 2020) should be applied, as we apply in our analyses. We also note that the concept of reliability is less straightforward in an IRT framework because the conditional standard error of measurement varies across the 𝜃 scale, e.g., Kim and Feldt (2010); Lockwood and McCaﬀrey (2014). However, marginal reliability in IRT provides a conceptually similar quantity that may be used for EIV corrections if desired, and ω functions well even with dichotomous items in moderate sample sizes (Padilla & Divers, 2016).

Figure 2Estimated Bias by Method, Standardized Scores

Note. The points indicate the bias in standard deviation units, without the EIV correction applied to the two-step models. The bars indicate the Monte Carlo 95% CIs, calculated using the formula sn where n is the number of simulation trials and s is the standard deviation of the point estimates.

Figure 3Estimated Bias by Method, EIV Correction Applied to Two-Step Scores

Note. The points indicate the bias in standard deviation units, with the EIV correction applied to the two-step models. The bars indicate the Monte Carlo 95% CIs, calculated using the formula sn where n is the number of simulation trials and s is the standard deviation of the point estimates.

We include additional simulation results in Appendix A. In short, results show that diﬀerences between all models are minimal in terms of absolute precision (the SD of the point estimates), standard error calibration (the mean model-based SE as a proportion of the true SE), false positive rates, and statistical power, with the EIV correction applied to the two-step models. These results suggest that once the attenuation bias of the two-step models has been corrected, the choice of model does not appear to have strong impacts on the other statistical properties of the ATE. As a final sensitivity check, we rerun analogous simulations using dichotomous responses and IRT models rather than continuous responses and FA models. We find essentially identical results to those reported here, suggesting that our findings are consistent across multiple specifications and outcome item types. We include the full IRT simulation analysis results in our supplement.

Empirical Application Data Source

To illustrate how the issues of scoring and model selection can play out in practice, we use a selection of empirical datasets from RCTs containing item-level outcome data from Gilbert, Himmelsbach, et al. (2025a). The authors examine 75 datasets from 48 RCTs with item-level outcome data to examine models for item-level heterogeneous treatment eﬀects, in which treatments may uniquely impact each item of an outcome measure. Here, we take a random sample of 10 datasets to illustrate the results of diﬀerent analytic approaches to estimating average treatment eﬀects across a range of contexts. Table 2 provides a summary of each dataset. The datasets cover a range of geographic regions, outcome measures, and show a wide range of estimated reliability with ω ranging from 0.43 to 0.95. In contrast to the simulation, we cannot know the true value of the treatment eﬀect on the latent trait in the empirical data. However, we can still examine the results of the diﬀerent analytic models explored in the simulation and see how sensitive the results are to the modeling choices.

Table 2Studies Included in Our Empirical Analysis

ID	Citation	Location	Outcome	N	I	ω
1	Bruhn et al. (2016)	Brazil	Financial Literacy	16852	10	0.57
2	Kim et al. (2024)	USA	Reading Comprehension	1335	29	0.86
3	Kim et al. (2021)	USA	Vocabulary	2588	24	0.79
4	Romero et al. (2020)	Liberia	Raven’s Progressive Matrices	3510	10	0.63
5	Duflo et al. (2024)	Ghana	English	27201	21	0.89
6	Carpena (2024)	India	Health Knowledge	839	21	0.75
7	Luo et al. (2019)	China	Parenting Beliefs	449	11	0.43
8	Lyall et al. (2020)	Afghanistan	Pro-government attitudes	1853	9	0.49
9	Banerjee et al. (2017)	India	Math	9193	30	0.93
10	Banerjee et al. (2017)	India	Math	5356	30	0.95

Note. N = number of subjects, I = number of items, ω = estimated reliability of the outcome measure.

Analytic Models

We apply eight estimators to produce treatment eﬀects from each dataset. Because all item responses are dichotomous, we use logistic IRT models instead of the linear SEM. For two-step approaches, we calculate the mean score, 1PL IRT score, and 2PL IRT score, and regress the resulting scores on the treatment indicator and standardize the results to calculate Cohen’s d. We then apply the EIV correction to these three estimates, dividing the estimated ATE by ω. (Our supplemental simulations of IRT models show that the classical EIV correction works well even though single-number estimates of reliability are less common in IRT frameworks where precision varies over the range of the latent trait, see Nicewander, 2018; Raju et al., 2007). We use mean scores instead of sum scores because there is some missing item response data. For one-step approaches, we estimate 1PL and 2PL EIRMs that allow for a treatment eﬀect directly on the latent trait.

Results

Figure 4 shows the point estimates and 95% CIs for the standardized treatment eﬀect from the eight estimators applied to our 10 datasets. The datasets are ordered by ω from lowest to highest. When ω is high (datasets 2, 5, 9, 10), diﬀerences between the estimators are minimal, as expected. Datasets with moderate ω (datasets 1, 4, 6, 3) most clearly mirror the simulation results, showing estimates from two-step models generally lower than alternative approaches, and minimal diﬀerences between EIV corrected estimates and estimates from the one-step models. In dataset 8, the ATE is near 0 and all estimators are very close to the null value. Only when ω is very low, as in dataset 7 (ω=.43), do we see meaningful diﬀerences between weighted and unweighted estimators, with the 2PL approaches yielding negative point estimates and the 1PL approaches yielding positive point estimates. Taken as a whole, these results again suggest that once EIV corrections are applied, diﬀerences between estimators are likely to be minor in all but the most extreme cases.

Figure 4Estimated Treatment Eﬀects for 10 RCT Datasets

Note. The points indicate the estimated treatment eﬀect size in standard deviation units. The bars indicate the model-based 95% CIs. 1PL and 2PL indicate IRT-based scores are derived from one-parameter logistic and two-parameter logistic models, respectively. The EIRM is equivalent to an SEM with a logistic link function.

Discussion

Because psychometric outcome measures are a noisy proxy of a latent trait of interest, they suﬀer from measurement error, which results in negatively biased treatment eﬀect estimates when outcome variables are standardized. Simulation results show that when applied to outcome data with diﬀerent properties, the bias is substantial when treatment eﬀect sizes are high, as predicted by Classical Test Theory. However, when the EIV correction is applied and the standardized coeﬃcients are divided by ω, diﬀerences in model performance are negligible under most conditions. Thus, the very process that makes varying statistical models comparable to one another—standardization—biases two-step models, and the eﬀect of this bias dominates other features of the data generating process, including the variability of factor loadings used to create scoring weights. When left unaddressed, such bias could lead researchers and policymakers to erroneous conclusions about the eﬃcacy of interventions.

As a concrete example of how attenuation bias could aﬀect substantive results, consider meta-analyses that pool the eﬀects of interventions on test score outcomes across studies. Even if the true underlying eﬀect is equal across all studies, when the outcome measures are of varying reliability, the estimated eﬀect sizes will diﬀer due to attenuation bias, even as the participant and study sample sizes grow to infinity (Borenstein et al., 2009). Thus, failing to adjust standardized eﬀect sizes for measurement error may lead to spurious conclusions about treatment heterogeneity. The degree to which such issues may be related to the ongoing “replication crisis” in psychology and other fields is an open question (Open Science Collaboration, 2015), but it seems plausible that variation in measurement practices may play a role in explaining variation in conclusions across studies (Flake et al., 2017; Flake & Fried, 2020; Pedersen et al., 2025).

Our interpretation of these results is that researchers may be overly focused on second-order measurement issues, such as the use of variable factor loadings that function as optimal scoring weights (McNeish, 2022; McNeish & Wolf, 2020), rather than the first-order issue of attenuation of standardized coeﬃcients for measurement error in the outcome variable (Shear & Briggs, 2024; Widaman & Revelle, 2023). That is, when the EIV correction is applied, diﬀerences between the simplest standardized sum score model and the more complex LVMs are negligible in terms of bias, precision, and statistical power in the estimation of treatment eﬀects, and this result holds even when the variability of factor loadings is high. Thus, when causal inference at a single time point is the primary goal, the use of sum scores with the EIV correction is likely to be suﬃcient for many applications in applied program evaluation.

These results should not detract from other uses of LVMs. Clearly, IRT/FA methods are essential for piloting measures, identifying poorly functioning items (Jessen et al., 2018), diﬀerential item functioning analysis (Osterlind & Everson, 2009), vertical scaling (Briggs & Domingue, 2013), linking (Lee & Lee, 2018), and addressing missing data (Gilbert, 2024a), and LVMs can easily be expanded to incorporate complex relationships among many latent variables or multidimensional constructs at several time points (Kline, 2023). A particularly valuable use case for LVMs in causal inference would be settings in which treatment may diﬀerentially impact individual items and the LVM can provide insights on treatment heterogeneity, such as “item-level heterogeneous treatment eﬀects” that would be masked in a two-step analysis (Ahmed et al., 2024; Gilbert, 2024b; Gilbert, Himmelsbach, et al., 2025a; Gilbert et al., 2023; Sales et al., 2021), diﬀerential growth by item type (Briggs, 2021; Gilbert et al., 2024; Naumann et al., 2014), or the appropriate interpretation of interaction eﬀects (Domingue, Kanopka, Trejo, et al., 2024; Gilbert, Miratrix, et al., 2025). However, when all respondents answer the same items at a single time point, and only average treatment eﬀects are of interest, the results appear relatively insensitive to the methods employed when the EIV correction is applied. Therefore, the benefits of interpretability and computational complexity may favor the EIV-corrected standardized sum score in many straightforward causal inference applications, despite arguments that the sum score can be a suboptimal choice (in general) because the constraint of equal factor loadings imposed by the sum score is rarely met in real data (McNeish & Wolf, 2020).

While the results of this study provide evidence for the importance of EIV corrections in two-step analyses of standardized test score outcome variables, several limitations merit consideration. For example, the data generating process employed in this study examines the simple case of individual randomization with no covariates beyond the treatment indicator, and thus may be extended to explore how measurement model selection may impact the estimation of heterogeneous treatment eﬀects, the eﬀects of predictive covariates, multilevel structures such as multi-site or cluster-randomized trials, or alternative experimental and quasi-experimental contexts such as regression discontinuity, diﬀerence-in-diﬀerences, instrumental variables, and longitudinal analyses, though an emerging literature on the synthesis of latent variable and causal inference methods has begun to shed light on these areas (Gilbert, Himmelsbach, et al., 2025a; Gilbert et al., 2024; Gilbert, Miratrix, et al., 2025; Kuhfeld & Soland, 2022, 2023; Mayer, 2019; Miratrix et al., 2021; Rabbitt, 2018; Soland, 2022, 2023; Soland et al., 2023).

A related issue is measurement error in covariates, which we did not explore in this study. In theory, in an RCT, any bias induced by covariate measurement error will aﬀect treatment and control groups equally and thus should not aﬀect estimation of the ATE (Lockwood & McCaﬀrey, 2014). In observational studies, however, covariate measurement error can lead to biases when the covariates do not fully control for relevant diﬀerences between groups (Cook et al., 2009; Sengewald & Pohl, 2019; Sengewald et al., 2019). Factor models with latent covariates and outcomes are easily estimable in lavaan when the indicators are continuous, however, software options for categorical responses common in the social sciences are currently less widely used in R, though recent developments such as galamm (Sørensen, 2024) and EﬀectLiteR (Mayer et al., 2016, 2020; Sengewald & Mayer, 2024) may provide attractive options. We view exploration of how measurement error in both covariates and outcomes influences results in experimental and observational contexts to be a promising avenue for future research.

In conclusion, results of causal analyses of psychometric outcome data are sensitive to model selection, and the eﬀects of attenuation bias are much more consequential than the use of scoring weights. When researchers do not adjust for measurement error with EIV corrections or use LVMs, standardized treatment eﬀect estimates will be downwardly biased and thus understate estimates of treatment impact. When the EIV correction is applied, the impact of model selection will be reduced, demonstrating how the application of psychometric principles can improve causal inference in evaluation research.

References

Ahmed, I., Bertling, M., Zhang, L., Ho, A. D., Loyalka, P., Xue, H., Rozelle, S., & Benjamin, W. (2024). Heterogeneity of item-treatment interactions masks complexity and generalizability in randomized controlled trials, 1–22. Journal of Research on Educational Effectiveness. 10.1080/19345747.2024.2361337

Angrist, J. D., & Pischke, J.-S. (2009). Mostly harmless econometrics: An empiricist’s companion. Princeton University Press.

Asher, H. B. (1974). Some consequences of measurement error in survey data. American Journal of Political Science, 18(2), 469–485.

Baker, F. B., & Kim, S.-H. (2017). The test characteristic curve. The basics of Item Response Theory using R (pp. 55–67). Springer.

Banerjee, A., Banerji, R., Berry, J., Duflo, E., Kannan, H., Mukerji, S., Shotland, M., & Walton, M. (2017). From proof of concept to scalable policies: Challenges and solutions, with an application. Journal of Economic Perspectives, 31 (4), 73–102. 10.1257/jep.31.4.73

Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67 (1), 1–48. 10.18637/jss.v067.i01

Bock, R. D., Thissen, D., & Zimowski, M. F. (1997). IRT estimation of domain scores. Journal of Educational Measurement, 34 (3), 197–211. 10.1111/J.1745-3984.1997.Tb00515.X

Bollen, K. A. (1989). Structural equations with latent variables. John Wiley & Sons.

Borenstein, M., Hedges, L. V., Higgins, J. P., & Rothstein, H. R. (2009). Introduction to meta-analysis. John Wiley & Sons.

Borsboom, D. (2005). Measuring the mind: Conceptual issues in contemporary psychometrics. Cambridge University Press.

Brakenhoff, T. B., Mitroiu, M., Keogh, R. H., Moons, K. G., Groenwold, R. H., & van Smeden, M. (2018). Measurement error is often neglected in medical literature: A systematic review. Journal of Clinical Epidemiology, 98, 89–97. 10.1016/j.jclinepi.2018.02.023

Breen, R., Karlson, K. B., & Holm, A. (2018). Interpreting and understanding logits, probits, and other nonlinear probability models. Annual Review of Sociology, 44, 39–54. 10.1146/annurev-soc-073117-041429

Brennan, R. L. (2010). Generalizability theory and classical test theory. Applied Measurement in Education, 24 (1), 1–21. 10.1080/08957347.2011.532417

Briggs, D. C. (2008). Using explanatory item response models to analyze group differences in science achievement. Applied Measurement in Education, 21 (2), 89–118. 10.1080/08957340801926086

Briggs, D. C. (2021). Historical and conceptual foundations of measurement in the human sciences: Credos and controversies. Routledge.

Briggs, D. C., & Domingue, B. (2013). The gains from vertical scaling. Journal of Educational and Behavioral Statistics, 38 (6), 551–576. 10.3102/1076998613508317

Briggs, D. C., & Weeks, J. P. (2009). The sensitivity of value-added modeling to the creation of a vertical score scale. Education Finance and Policy, 4 (4), 384–414. https://ideas.repec.org/a/tpr/edfpol/v4y2009i4p384-414.html

Bruhn, M., de Souza Leão, L., Legovini, A., Marchetti, R., & Zia, B. (2016). The impact of high school financial education: Evidence from a large-scale evaluation in Brazil. American Economic Journal: Applied Economics, 8 (4), 256–295. 10.1257/app.20150149

Camilli, G. (2018). IRT scoring and test blueprint fidelity. Applied Psychological Measurement, 42 (5), 393–400. 10.1177/0146621618754897

Carpena, F. (2024). Entertainment-education for better health: Insights from a field experiment in India. Journal of Development Studies, 60 (5, 745–762. 10.1080/00220388.2024.2312832

Carroll, R. J., Delaigle, A., & Hall, P. (2009). Nonparametric prediction in measurement error models. Journal of the American Statistical Association, 104 (487), 993–1003. 10.1198/jasa.2009.tm07543

Castellano, K. E., & Ho, A. D. (2015). Practical differences among aggregate-level conditional status metrics: From median student growth percentiles to value-added models. Journal of Educational and Behavioral Statistics, 40 (1), 35–68. 10.3102/1076998614548485

Chan, W., & Hedges, L. V. (2022). Pooling interactions into error terms in multisite experiments. Journal of Educational and Behavioral Statistics, 47 (6), 639–665. 10.3102/10769986221104800

Chapman, R. (2022). Expected a posteriori scoring in PROMIS®. Journal of Patient-Reported Outcomes, 6, 59. 10.1186/s41687-022-00464-9

Christensen, K. B. (2006). From Rasch scores to regression. Journal of Applied Measurement, 7 (2), 184–191.

Cole, D. A., & Preacher, K. J. (2014). Manifest variable path analysis: potentially serious and misleading consequences due to uncorrected measurement error. Psychological Methods, 19 (2), 300–315. 10.1037/a0033805

Cook, T. D., Steiner, P. M., & Pohl, S. (2009). How bias reduction is affected by covariate choice, unreliability, and mode of data analysis: Results from two types of within-study comparisons. Multivariate Behavioral Research, 44 (6), 828–847. 10.1080/00273170903333673

Cox, K., & Kelcey, B. (2019). Optimal design of cluster-and multisite-randomized studies using fallible outcome measures. Evaluation Review, 43 (3–4), 189–225. 10.1177/0193841X19870878

Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16 (3), 297–334. 10.1007/BF02310555

De Boeck, P. (2004). Explanatory item response models: A generalized linear and nonlinear approach. Springer Science & Business Media.

De Boeck, P. (2008). Random item IRT models. Psychometrika, 73 (4), 533–559. 10.1007/s11336-008-9092-x

De Boeck, P., & Wilson, M. R. (2016). Explanatory item response models. In W. J. van der Linden (Ed.), Handbook of item response theory (pp. 593–608). Chapman; Hall/CRC.

Deng, S., E. McCarthy, D., E. Piper, M., B. Baker

, & Bolt, D. M. (2018). Extreme response style and the measurement of intra-individual variability in affect. Multivariate Behavioral Research, 53 (2), 199–218. 10.1080/00273171.2017.1413636

DeVellis, R. F. (2006). Classical test theory. Medical Care, 44 (11 Suppl. 3), S50–S59. 10.1097/01.mlr.0000245426.10853.30

Domingue, B., Braginsky, M., Caffrey-Maffei, L. A., Gilbert, J., Kanopka, K., Kapoor, R., Liu, Y., Nadela, S., Pan, G., Zhang, L., Zhang, S., & Frank, M. C. (2024). Solving the problem of data in psychometrics: An introduction to the Item Response Warehouse (IRW). PsyArXiv. 10.31234/osf.io/7bd54

Domingue, B. W., Kanopka, K., Kapoor, R., Pohl, S., Chalmers, R. P., Rahal, C., & Rhemtulla, M. (2024). The InterModel Vigorish as a lens for understanding (and quantifying) the value of item response models for dichotomously coded items. Psychometrika, 89 (3), 1034–1054. 10.1007/s11336-024-09977-2

Domingue, B. W., Kanopka, K., Trejo, S., Rhemtulla, M., & Tucker-Drob, E. M. (2024). Ubiquitous bias and false discovery due to model misspecification in analysis of statistical interactions: The role of the outcome’s distribution and metric properties. Psychological Methods, 29 (6), 1164–1179. 10.1037/met0000532

Duflo, A., Kiessel, J., & Lucas, A. M. (2024). Experimental evidence on four policies to increase learning at scale. Economic Journal, 134 (661), 1985–2008. 10.1093/ej/ueae003

Ferrando, P. J., & Chico, E. (2007). The external validity of scores based on the two-parameter logistic model: Some comparisons between irt and ctt. Psicologica: International Journal of Methodology and Experimental Psychology, 28 (2), 237–257.

Flake, J. K., Pek, J., & Hehman, E. (2017). Construct validation in social and personality research: Current practice and recommendations. Social Psychological and Personality Science, 8 (4), 370–378. 10.1177/1948550617693063

Flake, J. K., & Fried, E. I. (2020). Measurement schmeasurement: Questionable measurement practices and how to avoid them. Advances in Methods and Practices in Psychological Science, 3 (4), 456–465. 10.1177/2515245920952393

Fuller, W. A., & Hidiroglou, M. A. (1978). Regression estimation after correcting for attenuation. Journal of the American Statistical Association, 73 (361), 99–104. 10.1080/01621459.1978.10480011

Gelman, A., & Loken, E. (2013). The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time (pp. 1–17). Department of Statistics, Columbia University.

Gilbert, J. B. (2024a). Estimating treatment effects with the explanatory item response model. Journal of Research on Educational Effectiveness, 1–19. 10.1080/19345747.2023.2287601

Gilbert, J. B. (2024b). Modeling item-level heterogeneous treatment effects: A tutorial with the glmer function from the lme4 package in R. Behavior Research Methods, 56 (5), 5055–5067. 10.3758/s13428-023-02245-8

Gilbert, J. B. (2025). ResearchBox #2289 – ‘How measurement affects causal inference’ [Replication toolkit]. ResearchBox. https://researchbox.org/2289.

Gilbert, J. B., Himmelsbach, Z., Soland, J., Joshi, M., & Domingue, B. W. (2025a). Estimating heterogeneous treatment effects with item-level outcome data: Insights from Item Response Theory. Journal of Policy Analysis and Management, 1–34. 10.1002/pam.70025

Gilbert, J. B., Himmelsbach, Z., Soland, J., Joshi, M., & Domingue, B. W. (2025b). “Replication data for: Estimating heterogeneous treatment effects with item-level outcome data: Insights from Item Response Theory” [Data, Study Materials]. Harvard Dataverse. 10.7910/DVN/C4TJCA.

Gilbert, J. B., Kim, J. S., & Miratrix, L. W. (2023). Modeling item-level heterogeneous treatment effects with the explanatory item response model: Leveraging large-scale online assessments to pinpoint the impact of educational interventions. Journal of Educational and Behavioral Statistics, 48 (6), 889–913. 10.3102/10769986231171710

Gilbert, J. B., Kim, J. S., & Miratrix, L. W. (2024). Leveraging item parameter drift to assess transfer effects in vocabulary learning. Applied Measurement in Education, 37 (3), 240–257. 10.1080/08957347.2024.2386934

Gilbert, J. B., Miratrix, L. W., Joshi, M., & Domingue, B. W. (2025). Disentangling person-dependent and item-dependent causal effects: Applications of item response theory to the estimation of treatment effect heterogeneity. Journal of Educational and Behavioral Statistics, 50 (1), 72–101. 10.3102/10769986241240085

Gillard, J. (2010). An overview of linear structural models in errors in variables regression. REVSTAT-Statistical Journal, 8 (1), 57–80. 10.57805/revstat.v8i1.90

Grice, J. W. (2001). Computing and evaluating factor scores. Psychological Methods, 6 (4), 430–450. 10.1037/1082-989X.6.4.430

Halpin, P., & Gilbert, J. (2024). Testing whether reported treatment effects are unduly dependent on the specific outcome measure used. ArXiv. 10.48550/ARXIV.2409.03502

Hambleton, R. K., & Jones, R. W. (1993). Comparison of classical test theory and item response theory and their applications to test development. Educational Measurement: Issues and Practice, 12 (3), 38–47. 10.1111/j.1745-3992.1993.tb00543.x

Hambleton, R. K., & Van der Linden, W. J. (1982). Advances in item response theory and applications: An introduction. Applied Psychological Measurement, 6 (4), 373–378. 10.1177/014662168200600401.

Harwell, M. R., & Gatti, G. G. (2001). Rescaling ordinal data to interval data in educational research. Review of Educational Research, 71 (1), 105–131. 10.3102/00346543071001105

Hayes, A. F., & Coutts, J. J. (2020). Use Omega rather than Cronbach’s Alpha for estimating reliability. but… Communication Methods and Measures, 14 (1), 1–24. 10.1080/19312458.2020.1718629

Hedges, L. V. (1981). Distribution theory for Glass’s estimator of effect size and related estimators. Journal of Educational Statistics, 6 (2), 107–128. 10.2307/1164588

Hontangas, P. M., De La Torre, J., Ponsoda, V., Leenen, I., Morillo, D., & Abad, F. J. (2015). Comparing traditional and IRT scoring of forced-choice tests. Applied Psychological Measurement, 39 (8), 598–612. 10.1177/0146621615585851

Hutcheon, J. A., Chiolero, A., & Hanley, J. A. (2010). Random measurement error and regression dilution bias. BMJ, 340 c2289. 10.1136/bmj.c2289

Imbens, G. W., & Rubin, D. B. (2015). Causal inference in statistics, social, and biomedical sciences. Cambridge University Press.

Jabrayilov, R., Emons, W. H., & Sijtsma, K. (2016). Comparison of classical test theory and item response theory in individual change assessment. Applied Psychological Measurement, 40 (8), 559–572. 10.1177/0146621616664046

Jackson, P. H. (1973). The estimation of true score variance and error variance in the classical test theory model. Psychometrika, 38 (2), 183–201. 10.1007/BF02291113

Jessen, A., Ho, A. D., Corrales, C. E., Yueh, B., & Shin, J. J. (2018). Improving measurement efficiency of the inner ear scale with item response theory. Otolaryngology–Head and Neck Surgery, 158 (6), 1093–1100. 10.1177/0194599818760528

Kim, J. S., Burkhauser, M. A., Mesite, L. M., Asher, C. A., Relyea, J. E., Fitzgerald, J., & Elmore, J. (2021). Improving reading comprehension, science domain knowledge, and reading engagement through a first-grade content literacy intervention. Journal of Educational Psychology, 113 (1), 3–26. 10.1037/edu0000465

Kim, S., & Feldt, L. S. (2010). The estimation of the IRT reliability coefficient and its lower and upper bounds, with comparisons to CTT reliability statistics. Asia Pacific Education Review, 11 (2), 179–188. 10.1007/s12564-009-9062-8

Kim, J. S., Gilbert, J. B., Relyea, J. E., Rich, P., Scherer, E., Burkhauser, M. A., & Tvedt, J. N. (2024). Time to transfer: Long-term effects of a sustained and spiraled content literacy intervention in the elementary grades. Developmental Psychology, 60 (7), 1279–1297. 10.1037/dev0001710

Kim, E. S., & Yoon, M. (2011). Testing Measurement Invariance: A Comparison of Multiple-Group Categorical CFA and IRT. Structural Equation Modeling: A Multidisciplinary Journal, 18 (2), 212–228. 10.1080/10705511.2011.557337

King, G., & Nielsen, R. (2019). Why propensity scores should not be used for matching. Political Analysis, 27 (4), 435–454. 10.1017/pan.2019.11

Kline, R. B. (2023). Principles and practice of structural equation modeling. Guilford Publications.

Kraft, M. A. (2020). Interpreting effect sizes of education interventions. Educational Researcher, 49 (4), 241–253. 10.3102/0013189X20912798

Kuhfeld, M., & Soland, J. (2022). Avoiding bias from sum scores in growth estimates: An examination of IRT-based approaches to scoring longitudinal survey responses. Psychological Methods, 27 (2), 234–260. 10.1037/met0000367

Kuhfeld, M., & Soland, J. (2023). Scoring assessments in multisite randomized control trials: Examining the sensitivity of treatment effect estimates to measurement choices. Psychological Methods. 10.1037/met0000633

Lee, W.-C., & Lee, G. (2018). IRT linking and equating. In P. Irwing, T. Booth, & D. J. Hughes (Eds.),Wiley handbook of psychometric testing: A multidisciplinary reference on survey, scale and test development (pp. 639–673). Wiley Blackwell. 10.1002/9781118489772.ch21

Lewis, C. (2006). 2 selected topics in classical test theory. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics (Vol. 26, pp. 29–43). 10.1016/S0169-7161(06)26002-4

Liu, K. (1988). Measurement error and its impact on partial correlation and multiple linear regression analyses. American Journal of Epidemiology, 127 (4), 864–874. 10.1093/oxfordjournals.aje.a114870

Liu, Y., & Pek, J. (2024). Summed versus estimated factor scores: Considering uncertainties when using observed scores. Psychological Methods. 10.1037/met0000644

Lockwood, J., & McCaffrey, D. F. (2014). Correcting for test score measurement error in ANCOVA models for estimating treatment effects. Journal of Educational and Behavioral Statistics, 39 (1), 22–52. 10.3102/107699861350940

Lord, F. M. (1980). Applications of item response theory to practical testing problems. Routledge.

Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Addison-Wesley.

Lu, I. R., Thomas, D. R., & Zumbo, B. D. (2005). Embedding IRT in structural equation models: A comparison with regression based on irt scores. Structural Equation Modeling, 12 (2), 263–277. 10.1207/s15328007sem1202_5

Luo, R., Emmers, D., Warrinnier, N., Rozelle, S., & Sylvia, S. (2019). Using community health workers to deliver a scalable integrated parenting program in rural China: A cluster-randomized controlled trial. Social Science & Medicine, 239, 112545. 10.1016/j.socscimed.2019.112545

Lyall, J., Zhou, Y.-Y., & Imai, K. (2020). Can economic assistance shape combatant support in wartime? Experimental evidence from Afghanistan. American Political Science Review, 114 (1), 126–143. 10.2139/ssrn.3026531

Mayer, A. (2019). Causal effects based on latent variable models. Methodology, 15 (S1, 15–28. 10.1027/1614-2241/a000174

Mayer, A., Dietzfelbinger, L., Rosseel, Y., & Steyer, R. (2016). The EffectLiteR Approach for Analyzing Average and Conditional Effects. Multivariate Behavioral Research, 51 (2–3), 374–391. 10.1080/00273171.2016.1151334

Mayer, A., Zimmermann, J., Hoyer, J., Salzer, S., Wiltink, J., Leibing, E., & Leichsenring, F. (2020). Interindividual differences in treatment effects based on structural equation models with latent variables: An EffectLiteR tutorial. Structural Equation Modeling: A Multidisciplinary Journal, 27 (5), 798–816. 10.1080/10705511.2019.1671196

McNeish, D. (2022). Limitations of the sum-and-alpha approach to measurement in behavioral research. Policy Insights from the Behavioral and Brain Sciences, 9 (2), 196–203. 10.1177/23727322221117144

McNeish, D. (2023). Psychometric properties of sum scores and factor scores differ even when their correlation is 0.98: A response to Widaman and Revelle. Behavior Research Methods, 55 (8), 4269–4290. 10.3758/s13428-022-02016-x

McNeish, D. (2024). Practical implications of sum scores being psychometrics’ greatest accomplishment. Psychometrika, 89 (4), 1148-1169. 10.1007/s11336-024-09988-z

McNeish, D., & Wolf, M. G. (2020). Thinking twice about sum scores. Behavior Research Methods, 52, 2287–2305. 10.3758/s13428-020-01398-0

Michell, J. (1994). Numbers as quantitative relations and the traditional theory of measurement. British Journal for the Philosophy of Science, 45 (2), 389–406. 10.1093/bjps/45.2.389

Michell, J. (1997). Quantitative science and the definition of measurement in psychology. British Journal of Psychology, 88 (3), 355–383. 10.1111/j.2044-8295.1997.tb02641.x

Miratrix, L. W., Weiss, M. J., & Henderson, B. (2021). An applied researcher’s guide to estimating effects from multisite individually randomized trials: Estimands, estimators, and estimates. Journal of Research on Educational Effectiveness, 14 (1), 270–308. 10.1080/19345747.2020.1831115

Mood, C. (2010). Logistic regression: Why we cannot do what we think we can do, and what we can do about it. European Sociological Review, 26 (1), 67–82. 10.1093/esr/jcp006

Muraki, E., & Engelhard, G., Jr. (1985). Full-information item factor analysis: Applications of eap scores. Applied Psychological Measurement, 9 (4), 417–430. 10.1177/01466216850090041

Murnane, R. J., & Willett, J. B. (2010). Methods matter: Improving causal inference in educational and social science research. Oxford University Press.

Muthén, B. O. (2002). Beyond SEM: General latent variable modeling. Behaviormetrika, 29 (1), 81–117. 10.2333/bhmk.29.81

Naumann, A., Hochweber, J., & Hartig, J. (2014). Modeling instructional sensitivity using a longitudinal multilevel differential item functioning approach. Journal of Educational Measurement, 51 (4), 381–399. 10.1111/jedm.12051

Nicewander, W. A. (2018). Conditional reliability coefficients for test scores. Psychological Methods, 23 (2), 351–362. 10.1037/met0000132

Olivera-Aguilar, M., & Rikoon, S. H. (2023). Intervention effect or measurement artifact? Using invariance models to reveal response-shift bias in experimental studies. Journal of Research on Educational Effectiveness, 1–29. 10.1080/19345747.2023.2284768

Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science, 349 (6251), aac4716. 10.1126/science.aac4716

Osterlind, S. J., & Everson, H. T. (2009). Differential item functioning. SAGE Publications.

Padilla, M. A., & Divers, J. (2016). A comparison of composite reliability estimators: Coefficient Omega confidence intervals in the current literature. Educational and Psychological Measurement, 76 (3), 436–453. 10.1177/0013164415593776

Pedersen, A. P., Kellen, D., Mayo-Wilson, C., Davis-Stober, C. P., Dunn, J. C., Khan, M. A., Stinchcombe, M. B., Kalish, M. L., Tentori, K., & Haaf, J. (2025). Discourse on measurement. Proceedings of the National Academy of Sciences of the United States of America, 122 (5), e2401229121. 10.1073/pnas.2401229121

Rabbitt, M. P. (2018). Causal inference with latent variables from the Rasch model as outcomes. Measurement, 120, 193–205. 10.1016/j.measurement.2018.01.044

Raju, N. S., Price, L. R., Oshima, T., & Nering, M. L. (2007). Standardized conditional SEM: A case for conditional reliability. Applied Psychological Measurement, 31 (3), 169–180. 10.1177/0146621606291569

Reckase, M. D. (2004). The real world is more complicated than we would like. Journal of Educational and Behavioral Statistics, 29 (1), 117–120.

Revelle, W., & Condon, D. M. (2019). Reliability from α to ω: A tutorial. Psychological Assessment, 31 (12), 1395–1411. 10.1037/pas0000754

Rhemtulla, M., & Savalei, V. (2025). Estimated factor scores are not true factor scores. Multivariate Behavioral Research, 60 (3), 598–619. 10.1080/00273171.2024.2444943

Rockwood, N. J., & Jeon, M. (2019). Estimating complex measurement and growth models using the R package PLmixed. Multivariate Behavioral Research, 54 (2), 288–306. 10.1080/00273171.2018.1516541

Romero, M., Sandefur, J., & Sandholtz, W. A. (2020). Outsourcing education: Experimental evidence from Liberia. American Economic Review, 110 (2), 364–400. 10.1257/aer.20181478

Rosenbaum, P. (2017). Observation and experiment: An introduction to causal inference. Harvard University Press.

Rosseel, Y. (2012). lavaan: An R package for Structural Equation Modeling. Journal of Statistical Software, 48 (2), 1–36. 10.18637/jss.v048.i02

Rubin, D. B. (2005). Causal inference using potential outcomes: Design, modeling, decisions. Journal of the American Statistical Association, 100 (469), 322–331. 10.1198/016214504000001880

Sales, A., Prihar, E., Heffernan, N., & Pane, J. F. (2021). The effect of an intelligent tutor on performance on specific posttest problems (pp. 206–215). Proceedings of the 14th International Conference on Educational Data Mining (EDM21), Paris, France, June 29–July 2, 2021. International Educational Data Mining Society.

San Martín, E. (2016). Identification of item response theory models. In

W. J.

van der Linden (Ed.), Handbook of item response theory (Vol. 2, pp. 127–150). Chapman; Hall/CRC.

Schafer, W. D. (2006). Growth scales as an alternative to vertical scales. Practical Assessment, Research, and Evaluation, 11 (1), 4. 10.7275/xjkz-7n67

Schielzeth, H. (2010). Simple means to improve the interpretability of regression coefficients. Methods in Ecology and Evolution, 1 (2), 103–113. 10.1111/j.2041-210X.2010.00012.x

Sébille, V., Hardouin, J.-B., Le Néel, T., Kubis, G., Boyer, F., Guillemin, F., & Falissard, B. (2010). Methodological issues regarding power of classical test theory (CTT) and item response theory (IRT)-based approaches for the comparison of patient-reported outcomes in two groups of patients: A simulation study. BMC Medical Research Methodology, 10, 24. 10.1186/1471-2288-10-24

Sengewald, M.-A., & Mayer, A. (2024). Causal effect analysis in nonrandomized data with latent variables and categorical indicators: The implementation and benefits of EffectLiteR. Psychological Methods, 29 (2), 287–307. 10.1037/met0000489

Sengewald, M.-A., & Pohl, S. (2019). Compensation and amplification of attenuation bias in causal effect estimates. Psychometrika, 84 (2), 589–610. 10.1007/s11336-019-09665-6

Sengewald, M.-A., Steiner, P. M., & Pohl, S. (2019). When does measurement error in covariates impact causal effect estimates? Analytic derivations of different scenarios and an empirical illustration. British Journal of Mathematical and Statistical Psychology, 72 (2), 244–270. 10.1111/bmsp.12146

Shear, B. R., & Briggs, D. C. (2024). Measurement issues in causal inference. Asia Pacific Education Review, 25 (3), 719–731. 10.1007/s12564-024-09942-9

Sijtsma, K., Ellis, J. L., & Borsboom, D. (2024). Recognize the value of the sum score, psychometrics’ greatest accomplishment. Psychometrika, 89 (1), 84–117. 10.1007/s11336-024-09964-7

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological science, 22 (11), 1359–1366. 10.1177/0956797611417632

Skrondal, A., & Rabe-Hesketh, S. (2004). Generalized latent variable modeling: Multilevel, longitudinal, and structural equation models. CRC Press.

Soland, J. (2022). Evidence that selecting an appropriate item response theory–based approach to scoring surveys can help avoid biased treatment effect estimates. Educational and Psychological Measurement, 82 (2), 376–403. 10.1177/00131644211007551

Soland, J. (2023). Item response theory models for difference-in-difference estimates (and whether they are worth the trouble). Journal of Research on Educational Effectiveness, 17 (2), 391–421. 10.1080/19345747.2023.2195413

Soland, J., Edwards, K., & Talbert, E. (2024). When should evaluators lose sleep over measurement? Toward establishing best practices. Journal of Research on Educational Effectiveness, 1–33. 10.1080/19345747.2024.2344011

Soland, J., & Gilbert, J. B. (2025). Does socially desirable responding increase after an intervention? Implications for estimating treatment effects. PsyArXiv. 10.31234/osf.io/ujx4n_v1

Soland, J., Johnson, A., & Talbert, E. (2023). Regression discontinuity designs in a latent variable framework. Psychological Methods, 28 (3), 691–704. 10.1037/met0000453

Soland, J., Kuhfeld, M., & Edwards, K. (2024). How survey scoring decisions can influence your study’s results: A trip through the IRT looking glass. Psychological Methods, 29 (5), 1003–1024. 10.1037/met0000506

Sørensen, Ø. (2024). Multilevel semiparametric latent variable modeling in R with “galamm”. Multivariate Behavioral Research, 59 (5), 1098–1105. 10.1080/00273171.2024.2385336

Spybrook, J., Kelcey, B., & Dong, N. (2016). Power for detecting treatment by moderator effects in two-and three-level cluster randomized trials. Journal of Educational and Behavioral Statistics, 41 (6), 605–627. 10.3102/1076998616655442

Stoetzer, L., Zhou, X., & Steenbergen, M. (2024). Causal inference with latent outcomes. American Journal of Political Science, 62 (2), 624–640. 10.1111/ajps.12871

Thissen, D., & Wainer, H.. (Eds.). (2001). An overview of test scoring. Test scoring (pp. 13–32). Taylor & Francis. 10.4324/9781410604729

Timoneda, J. C. (2021). Estimating group fixed effects in panel data with a binary dependent variable: How the LPM outperforms logistic regression in rare events data. Social Science Research, 93, 102486. 10.1016/j.ssresearch.2020.102486

Traub, R. E. (1997). Classical test theory in historical perspective. Educational Measurement: Issues and Practice, 16 (8), 8–14.

VanderWeele, T. J., & Vansteelandt, S. (2022). A statistical test to reject the structural interpretation of a latent factor model. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84 (5), 2032–2054. 10.1111/rssb.12555

Wicherts, J. M., Veldkamp, C. L., Augusteijn, H. E., Bakker, M., Van Aert, R., & Van Assen, M. A. (2016). Degrees of freedom in planning, running, analyzing, and reporting psychological studies: A checklist to avoid p-hacking. Frontiers in Psychology, 1832. 10.3389/fpsyg.2016.01832

Widaman, K. F., & Revelle, W. (2023). Thinking thrice about sum scores, and then some more about measurement and analysis. Behavior Research Methods, 55 (2), 788–806. 10.3758/s13428-022-01849-w

Wilson, M., De Boeck, P., & Carstensen, C. H. (2008). Explanatory item response models: A brief introduction. In J. Hartig, E. Klieme, & D. Leutner (Eds.), Assessment of competencies in educational contexts (pp. 91–120). Hogrefe & Huber Publishers.

Xu, T., & Stone, C. A. (2012). Using IRT trait estimates versus summated scores in predicting outcomes. Educational and Psychological Measurement, 72 (3), 453–468. 10.1177/0013164411419846

Ye, F. (2016). Latent growth curve analysis with dichotomous items: Comparing four approaches. British Journal of Mathematical and Statistical Psychology, 69 (1), 43–61. 10.1111/bmsp.12058

Zwinderman, A. H. (1991). A generalized Rasch model for manifest predictors. Psychometrika, 56 (4), 589–600. 10.1007/BF02294492

Appendix Additional Simulation Results

All figures include EIV corrections for the two-step scores. The figure for statistical power only includes true eﬀect sizes of 0.2 because near-ceiling levels of power are achieved at an eﬀect size of 0.4.

Figure A.1.Estimated Standard Errors by MethodFigure A.2.Estimated Standard Error Calibration by MethodFigure A.3.Estimated False Positive Rates by MethodFigure A.4.Estimated Power by Method

Data Availability

A replication toolkit is available at Gilbert, 2025. The empirical datasets and study materials from Gilbert, Himmelsbach, et al. (2025b) are available at https://doi.org/10.7910/DVN/C4TJCA, and also in the Item Response Warehouse (IRW, Domingue et al., 2024), with the prefix gilbert_meta: https://redivis.com/datasets/as2e-cv7jb41fd.

Supplementary Materials

For this article, the following Supplementary Materials are available:

Data. (Gilbert, Himmelsbach, et al., 2025b)

Study materials. (Gilbert, Himmelsbach, et al., 2025b)

Replication toolkit. (Gilbert, 2025)

The author has no funding to report.

The author has declared that no competing interests exist.

The author wishes to thank Andrew Ho, Ben Domingue, Derek Briggs, and two anonymous reviewers for their helpful comments on drafts of this paper.