<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article
  PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD with MathML3 v1.2 20190208//EN" "JATS-journalpublishing1-3-mathml3.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ali="http://www.niso.org/schemas/ali/1.0/" article-type="research-article" dtd-version="1.2" xml:lang="en">
<front>
<journal-meta><journal-id journal-id-type="publisher-id">METH</journal-id><journal-id journal-id-type="nlm-ta">Methodology</journal-id>
<journal-title-group>
<journal-title>Methodology</journal-title><abbrev-journal-title abbrev-type="pubmed">Methodology</abbrev-journal-title>
</journal-title-group>
<issn pub-type="ppub">1614-1881</issn>
<issn pub-type="epub">1614-2241</issn>
<publisher><publisher-name>PsychOpen</publisher-name></publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">meth.15773</article-id>
<article-id pub-id-type="doi">10.5964/meth.15773</article-id>
<article-categories>
<subj-group subj-group-type="heading"><subject>Original Article</subject></subj-group>

<subj-group subj-group-type="badge">
<subject>Data</subject>
</subj-group>

</article-categories>
<title-group>
<article-title>How Measurement Affects Causal Inference: Attenuation Bias Is (Usually) More Important Than Outcome Scoring Weights</article-title>
	<alt-title alt-title-type="right-running">How Measurement Affects Causal Inference</alt-title>
<alt-title specific-use="APA-reference-style" xml:lang="en">How measurement affects causal inference: Attenuation bias is (usually) more important than outcome scoring weights</alt-title>
</title-group>
<contrib-group>
	<contrib id="author-1" contrib-type="author" corresp="yes"><contrib-id contrib-id-type="orcid" authenticated="false">https://orcid.org/0000-0003-3496-2710</contrib-id><name name-style="western"><surname>Gilbert</surname><given-names>Joshua B.</given-names></name><xref ref-type="corresp" rid="cor1">*</xref><xref ref-type="aff" rid="aff1">1</xref></contrib>
<contrib contrib-type="editor">
<name>
	<surname>Medzihorsky</surname>
	<given-names>Juraj</given-names>
</name>
<xref ref-type="aff" rid="aff2"/>
</contrib>
	<aff id="aff1"><label>1</label><institution content-type="dept">Harvard Graduate School of Education</institution>, <institution>Harvard University</institution>, <addr-line><city>Cambridge, MA</city></addr-line>, <country country="US">USA</country></aff>
	<aff id="aff2">Durham University, Durham, <country>United Kingdom</country></aff>
</contrib-group>
	<author-notes>
		<corresp id="cor1"><label>*</label>13 Appian Way, Cambridge, MA 02138, USA. <email xlink:href="joshua_gilbert@g.harvard.edu">joshua_gilbert@g.harvard.edu</email></corresp>
	</author-notes>
	
<pub-date pub-type="epub"><day>30</day><month>06</month><year>2025</year></pub-date>
<pub-date pub-type="collection" publication-format="electronic"><year>2025</year></pub-date>
	
<volume>21</volume>
<issue>2</issue>
<fpage>91</fpage>
<lpage>122</lpage>
<history>
<date date-type="received">
<day>05</day>
<month>10</month>
<year>2024</year>
</date>
<date date-type="accepted">
<day>06</day>
<month>05</month>
<year>2025</year>
</date>
</history>
<permissions><copyright-year>2025</copyright-year><copyright-holder>Gilbert</copyright-holder><license license-type="open-access" specific-use="CC BY 4.0" xlink:href="https://creativecommons.org/licenses/by/4.0/"><ali:license_ref>https://creativecommons.org/licenses/by/4.0/</ali:license_ref><license-p>This is an open-access article distributed under the terms of the Creative Commons Attribution (CC BY) 4.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p></license></permissions>
	
<abstract>
<p>When analyzing treatment effects on outcome variables constructed from psychometric instruments (e.g., educational test scores, psychological surveys, or patient reported outcomes), researchers face many choices and competing guidance for scoring the measures and modeling results. This study examines the impact of outcome measure scoring and modeling approaches through simulation and an empirical application. Results show that estimates from multiple methods applied to the same data will vary because two-step models using sum or factor scores provide attenuated standardized treatment effects compared to latent variable models. This bias dominates any other differences between models or features of the data generating process, such as the use of scoring weights. An errors-in-variables (EIV) correction removes the bias from two-step models. An empirical application to 10 datasets from randomized controlled trials demonstrates the sensitivity of the results to model selection. This study shows that the psychometric principles most consequential in causal inference are related to attenuation bias rather than optimal outcome scoring weights.</p>
</abstract>
<kwd-group kwd-group-type="author"><kwd>causal inference</kwd><kwd>latent variable models</kwd><kwd>factor analysis</kwd><kwd>psychometrics</kwd><kwd>measurement</kwd></kwd-group>

</article-meta>
</front>
<body>
	<sec sec-type="intro" id="intro"><title/>	
		<p id="S1.PX1.P1">When research results are sensitive to the choice of statistical model, they become dependent on researcher discretion, and bias can be introduced (<xref ref-type="bibr" rid="Xbib43">Gelman &amp; Loken, 2013</xref>; <xref ref-type="bibr" rid="Xbib68">King &amp; Nielsen, 2019</xref>; <xref ref-type="bibr" rid="Xbib124">Simmons et al., 2011</xref>; <xref ref-type="bibr" rid="Xbib139">Wicherts et al., 2016</xref>). Researcher discretion is a particular challenge in evaluation research on outcomes derived from psychometric instruments, such as educational tests, psychological surveys, and patient-reported outcomes, because of the many approaches to scoring the outcome measures and accounting for measurement error in the results, such as Classical Test Theory (CTT), Item Response Theory (IRT), Factor Analysis (FA), or Latent Variable Models (LVMs). Researcher-designed outcome measures in particular demand many decision points in the analysis process, raising the question of how sensitive results are to model selection and outcome scoring decisions, especially in causal studies investigating intervention impacts on outcomes that aim to provide policy-relevant findings. While reviews in several fields such as medicine, education, and organizational research show a relative lack of attention to issues of measurement in general and call for better measurement practices (<xref ref-type="bibr" rid="Xbib11">Brakenhoﬀ et al., 2018</xref>; <xref ref-type="bibr" rid="Xbib28">Cox &amp; Kelcey, 2019</xref>; <xref ref-type="bibr" rid="Xbib41">Flake et al., 2017</xref>; <xref ref-type="bibr" rid="Xbib40">Flake &amp; Fried, 2020</xref>; <xref ref-type="bibr" rid="Xbib103">Pedersen et al., 2025</xref>; <xref ref-type="bibr" rid="Xbib132">Spybrook et al., 2016</xref>), the implications of measurement principles for causal inference, policy, and program evaluation are less prominent (<xref ref-type="bibr" rid="Xbib122">Shear &amp; Briggs, 2024</xref>; <xref ref-type="bibr" rid="Xbib131">Soland, Kuhfeld, &amp; Edwards, 2024</xref>; <xref ref-type="bibr" rid="Xbib128">Soland, Edwards, &amp; Talbert, 2024</xref>).</p>
<p id="S1.PX1.P2">For a given causal research question, alternative statistical methods may provide defensible options for analysis, and varying results are expected. For instance, when modeling a binary outcome, logistic regression and the linear probability model may produce diﬀerent results due to the contrasting assumptions of each model (<xref ref-type="bibr" rid="Xbib136">Timoneda, 2021</xref>). Similarly, in the context of multisite randomized trials or meta-analyses, fixed eﬀects and random eﬀects estimators will produce diﬀerent estimates of treatment eﬀects due to the diﬀerent estimands targeted by each model (<xref ref-type="bibr" rid="Xbib23">Chan &amp; Hedges, 2022</xref>; <xref ref-type="bibr" rid="Xbib92">Miratrix et al., 2021</xref>; <xref ref-type="bibr" rid="Xbib125">Skrondal &amp; Rabe-Hesketh, 2004</xref>). While such diﬀerences in “estimates, estimators, and estimands” (<xref ref-type="bibr" rid="Xbib92">Miratrix et al., 2021</xref>) are well understood in causal inference generally, the use of psychometric measures as outcome variables demands additional consideration because observed scores are typically not of interest in themselves but rather as proxies for unobserved latent variables such as academic achievement. Thus, researchers are faced with navigating a range of options for causal analysis of psychometric outcome data and the challenge of interpreting diﬀering results from models that theoretically target the same treatment eﬀect on the latent trait. Furthermore, it is unclear whether some approaches are consistently superior to others or if the tradeoﬀs of model selection depend on the circumstances (<xref ref-type="bibr" rid="Xbib44">Gilbert, 2024a</xref>; <xref ref-type="bibr" rid="Xbib58">Hontangas et al., 2015</xref>).</p>
		<p id="S1.PX1.P3">As an example, consider the options for scoring an educational test to estimate a treatment eﬀect on the latent trait of academic achievement imperfectly represented by the observed test score. Both CTT sum scores and IRT- or FA-based scores use item responses to estimate a latent trait score for each student, which is then used in subsequent analysis. IRT- or FA-based methods such as the two-parameter logistic (2PL) model or the congeneric factor model theoretically provide more fine-grained distinctions among students by weighting item responses based on the information (i.e., item discrimination or factor loading) provided by the item, in contrast to sum scores, which treat diﬀerent sets of correct answers as identical (<xref ref-type="bibr" rid="Xbib19">Camilli, 2018</xref>; <xref ref-type="bibr" rid="Xbib54">Hambleton &amp; Van der Linden, 1982</xref>; <xref ref-type="bibr" rid="Xbib78">Lord, 1980</xref>; <xref ref-type="bibr" rid="Xbib79">Lord &amp; Novick, 1968</xref>; <xref ref-type="bibr" rid="Xbib135">Thissen &amp; Wainer, 2001</xref>). Alternatively, LVM techniques, such as Structural Equation Modeling (SEM; <xref ref-type="bibr" rid="Xbib69">Kline, 2023</xref>; <xref ref-type="bibr" rid="Xbib96">Muthén, 2002</xref>) or Explanatory Item Response Modeling (EIRM; <xref ref-type="bibr" rid="Xbib14">Briggs, 2008</xref>; <xref ref-type="bibr" rid="Xbib30">De Boeck, 2004</xref>; <xref ref-type="bibr" rid="Xbib32">De Boeck &amp; Wilson, 2016</xref>; <xref ref-type="bibr" rid="Xbib44">Gilbert, 2024a</xref>; <xref ref-type="bibr" rid="Xbib141">Wilson et al., 2008</xref>) estimate the measurement and regression models in a single step. Because all test scoring methods and LVMs target the same treatment eﬀect on the latent trait, a key question is the extent to which theoretical diﬀerences between these models matter in causal analysis of test score outcome data. Correlations between IRT- and FA-based scores and CTT scores are typically above 0.90 (<xref ref-type="bibr" rid="Xbib80">Lu et al., 2005</xref>; <xref ref-type="bibr" rid="Xbib131">Soland, Kuhfeld, &amp; Edwards, 2024</xref>), which raises the question of whether the theoretical benefits of IRT- or FA-based scoring methods or LVMs are worth the added complexity, computational power, and interpretational challenges they may pose. Furthermore, no clear guidelines exist on which model researchers should prefer, particularly when the results conflict.</p>
<p id="S1.PX1.P4">To illustrate the challenge facing the applied researcher, consider two recent publications on the implications of using sum scores versus factor scores in statistical models. On one side, <xref ref-type="bibr" rid="Xbib89">McNeish and Wolf (2020)</xref> argue that sum scores can have “adverse eﬀects on validity, reliability, and qualitative classification” compared to FA-based scores because sum scores implicitly assume that each item contributes equally to the estimation of the latent trait, an assumption that is unlikely to be met in many empirical applications. In contrast, <xref ref-type="bibr" rid="Xbib140">Widaman and Revelle (2023)</xref> argue that so long as the scale is unidimensional, sum scores “often have a solid psychometric basis and therefore are frequently quite adequate for psychological research”. Such competing claims, expanded in further publications (<xref ref-type="bibr" rid="Xbib86">McNeish, 2022</xref>, <xref ref-type="bibr" rid="Xbib87">2023</xref>, <xref ref-type="bibr" rid="Xbib88">2024</xref>; <xref ref-type="bibr" rid="Xbib123">Sijtsma et al., 2024</xref>), provide a challenge for the applied researcher working with outcome data derived from psychometric measures.</p>
		<p id="S1.PX1.P5">The purpose of this study is to provide both a concise and accessible review of the conceptual issues at play and practical guidance for evaluation researchers by exploring the consequences of outcome measurement modeling decisions on causal inference by determining which decision points in measurement modeling are most salient for analytic results. Results show that the issue of attenuation bias dominates the issue of scoring weights, and simpler models can perform better even under extreme circumstances. In other words, our results suggest that accounting for measurement error in the outcome variable is a first-order concern in causal inference, in contrast to second-order issues of measurement “model error”, in which an incorrect measurement model is applied to generate scores for the outcome variable (<xref ref-type="bibr" rid="Xbib76">Liu &amp; Pek, 2024</xref>), such as using equally-weighted scores when other approaches are a better fit to the data. Our results align with studies showing that the marginal gains to more complex statistical models can be low and may not justify their increased complexity (e.g., <xref ref-type="bibr" rid="Xbib36">Domingue, Kanopka, Kapoor, et al., 2024</xref> in IRT; <xref ref-type="bibr" rid="Xbib22">Castellano &amp; Ho, 2015</xref>, in value-added modeling; <xref ref-type="bibr" rid="Xbib140">Widaman &amp; Revelle, 2023</xref> in psychological measurement), and serve as a contrast with other work emphasizing the sensitivity of analytic results to measurement modeling choices in the analysis of psychometric data (<xref ref-type="bibr" rid="Xbib89">McNeish &amp; Wolf, 2020</xref>; <xref ref-type="bibr" rid="Xbib131">Soland, Kuhfeld, &amp; Edwards, 2024</xref>).</p>
<sec id="s1_1"><title>Classical Approaches to Measurement Error in Evaluation Research</title>
<p id="S1.PX1.P6">Measurement error is a widely studied phenomenon, with work on the reliability of educational and psychological tests going back many decades (<xref ref-type="bibr" rid="Xbib3">Asher, 1974</xref>; <xref ref-type="bibr" rid="Xbib8">Bollen, 1989</xref>; <xref ref-type="bibr" rid="Xbib10">Borsboom, 2005</xref>; <xref ref-type="bibr" rid="Xbib15">Briggs, 2021</xref>; <xref ref-type="bibr" rid="Xbib29">Cronbach, 1951</xref>; <xref ref-type="bibr" rid="Xbib79">Lord &amp; Novick, 1968</xref>), and has well-known consequences in statistical analysis (<xref ref-type="bibr" rid="Xbib42">Fuller &amp; Hidiroglou, 1978</xref>; <xref ref-type="bibr" rid="Xbib59">Hutcheon et al., 2010</xref>; <xref ref-type="bibr" rid="Xbib75">Liu, 1988</xref>). In the case of simple linear regression with two variables, error in independent (<italic>X</italic>, predictor) variables serves to attenuate regression coeﬃcients toward 0, whereas error in dependent (<italic>Y</italic>, outcome) variables will not bias estimated regression coeﬃcients, but will decrease precision and reduce statistical power by increasing residual variance, though these general rules of thumb do not always hold in more complex circumstances (<xref ref-type="bibr" rid="Xbib69">Kline, 2023</xref>).</p>
<p id="S1.PX1.P7">Measurement error can be addressed with both classical and modern methods. For example, Errors-in-Variables (EIV) regression models (<xref ref-type="bibr" rid="Xbib21">Carroll et al., 2009</xref>; <xref ref-type="bibr" rid="Xbib50">Gillard, 2010</xref>) use estimates of reliability to adjust the coeﬃcients of predictor variables, and LVMs (<xref ref-type="bibr" rid="Xbib96">Muthén, 2002</xref>) adjust for measurement error by simultaneously estimating the latent variable(s) and the regression model. While both EIV and LVM methods can correct for measurement error, some studies have shown that the LVM approach can provide more robust estimates of uncertainty than EIV methods (<xref ref-type="bibr" rid="Xbib44">Gilbert, 2024a</xref>; <xref ref-type="bibr" rid="Xbib77">Lockwood &amp; McCaﬀrey, 2014</xref>).</p>
<p id="S1.PX1.P8">Measurement error in the dependent variable is sometimes ignored because it does not bias coeﬃcients (<xref ref-type="bibr" rid="Xbib28">Cox &amp; Kelcey, 2019</xref>), but LVMs can also be applied to outcome variables and can provide modest benefits to statistical power and more robust estimates of uncertainty than alternative approaches (<xref ref-type="bibr" rid="Xbib25">Christensen, 2006</xref>; <xref ref-type="bibr" rid="Xbib104">Rabbitt, 2018</xref>; <xref ref-type="bibr" rid="Xbib144">Zwinderman, 1991</xref>), though benefits are context dependent (<xref ref-type="bibr" rid="Xbib44">Gilbert, 2024a</xref>). However, coeﬃcients <italic>are</italic> downwardly biased when the dependent variable is standardized. Attenuation due to standardization is a particular issue in evaluation research because most test scores, psychological surveys, and patient reported outcomes have no natural scale, and standardization allows for estimates of treatment eﬀect size that can in principle be compared across studies or pooled in meta-analyses (<xref ref-type="bibr" rid="Xbib9">Borenstein et al., 2009</xref>) and are often argued to be more interpretable than unstandardized coeﬃcients (<xref ref-type="bibr" rid="Xbib117">Schielzeth, 2010</xref>).</p>
	<p id="S1.PX1.P9">Standardization of the dependent variable <italic>Y</italic> attenuates regression coeﬃcients because measurement error causes overdispersion in the standard deviation of <italic>Y</italic>, <inline-formula><mml:math id="math-1" display="inline"><mml:msub><mml:mrow><mml:mi>σ</mml:mi></mml:mrow><mml:mrow><mml:mi>Y</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. That is, <inline-formula><mml:math id="math-2" display="inline"><mml:msub><mml:mrow><mml:mi>σ</mml:mi></mml:mrow><mml:mrow><mml:mi>Y</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> will be greater than the <italic>SD</italic> of the true latent trait scores <inline-formula><mml:math id="math-3" display="inline"><mml:msub><mml:mrow><mml:mi>σ</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> because <inline-formula><mml:math id="math-4" display="inline"><mml:msub><mml:mrow><mml:mi>σ</mml:mi></mml:mrow><mml:mrow><mml:mi>Y</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> contains the variation of <inline-formula><mml:math id="math-5" display="inline"><mml:msub><mml:mrow><mml:mi>σ</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> plus measurement error <inline-formula><mml:math id="math-6" display="inline"><mml:msub><mml:mrow><mml:mi>σ</mml:mi></mml:mrow><mml:mrow><mml:mi>E</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, as summarized in the CTT variance decomposition <inline-formula><mml:math id="math-7" display="inline"><mml:msubsup><mml:mrow><mml:mi>σ</mml:mi></mml:mrow><mml:mrow><mml:mi>Y</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mi>σ</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup><mml:mo>+</mml:mo><mml:msubsup><mml:mrow><mml:mi>σ</mml:mi></mml:mrow><mml:mrow><mml:mi>E</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup></mml:math></inline-formula> (<xref ref-type="bibr" rid="Xbib13">Brennan, 2010</xref>; <xref ref-type="bibr" rid="Xbib34">DeVellis, 2006</xref>; <xref ref-type="bibr" rid="Xbib53">Hambleton &amp; Jones, 1993</xref>; <xref ref-type="bibr" rid="Xbib62">Jackson, 1973</xref>; <xref ref-type="bibr" rid="Xbib74">Lewis, 2006</xref>; <xref ref-type="bibr" rid="Xbib137">Traub, 1997</xref>). We can precisely estimate the overdispersion of <inline-formula><mml:math id="math-8" display="inline"><mml:msub><mml:mrow><mml:mi>σ</mml:mi></mml:mrow><mml:mrow><mml:mi>Y</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> with the CTT reliability formula, which defines reliability <inline-formula><mml:math id="math-9" display="inline"><mml:mi>ρ</mml:mi></mml:math></inline-formula> as the ratio of true score variance (<inline-formula><mml:math id="math-10" display="inline"><mml:msubsup><mml:mrow><mml:mi>σ</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup></mml:math></inline-formula>) to observed score variance (<inline-formula><mml:math id="math-11" display="inline"><mml:msubsup><mml:mrow><mml:mi>σ</mml:mi></mml:mrow><mml:mrow><mml:mi>Y</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup></mml:math></inline-formula>): <inline-formula><mml:math id="math-12" display="inline"><mml:mi>ρ</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msubsup><mml:mrow><mml:mi>σ</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup></mml:mrow><mml:mrow><mml:msubsup><mml:mrow><mml:mi>σ</mml:mi></mml:mrow><mml:mrow><mml:mi>Y</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup></mml:mrow></mml:mfrac></mml:math></inline-formula>. Solving for <inline-formula><mml:math id="math-13" display="inline"><mml:msub><mml:mrow><mml:mi>σ</mml:mi></mml:mrow><mml:mrow><mml:mi>Y</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> shows that <inline-formula><mml:math id="math-14" display="inline"><mml:msub><mml:mrow><mml:mi>σ</mml:mi></mml:mrow><mml:mrow><mml:mi>Y</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>σ</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msqrt><mml:mrow><mml:mi>ρ</mml:mi></mml:mrow></mml:msqrt></mml:mrow></mml:mfrac></mml:math></inline-formula>. Therefore, when we standardize an outcome variable such as a test score by dividing by its <italic>SD</italic> <inline-formula><mml:math id="math-15" display="inline"><mml:msub><mml:mrow><mml:mi>σ</mml:mi></mml:mrow><mml:mrow><mml:mi>Y</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, this value is too large by a factor of <inline-formula><mml:math id="math-16" display="inline"><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msqrt><mml:mrow><mml:mi>ρ</mml:mi></mml:mrow></mml:msqrt></mml:mrow></mml:mfrac></mml:math></inline-formula>. Consequently, when measurement error in the outcome is present, standardized regression coeﬃcients will be driven downward, and this bias can be corrected by dividing by <inline-formula><mml:math id="math-17" display="inline"><mml:msqrt><mml:mrow><mml:mi>ρ</mml:mi></mml:mrow></mml:msqrt></mml:math></inline-formula>. Applying this EIV correction deattenuates the standardized regression coeﬃcient to what it would be if the test were perfectly reliable or of infinite length.</p>
<p id="S1.PX1.P10">Attenuation due to standardization is not a new insight (<xref ref-type="bibr" rid="Xbib26">Cole &amp; Preacher, 2014</xref>; <xref ref-type="bibr" rid="Xbib57">Hedges, 1981</xref>; <xref ref-type="bibr" rid="Xbib122">Shear &amp; Briggs, 2024</xref>), but it is nonetheless commonly ignored, or reserved for technical discussions (<xref ref-type="bibr" rid="Xbib9">Borenstein et al., 2009</xref>) and comparatively less emphasized in practical guides for researchers. For example, in its section on reliability, the Institute of Education Sciences’ (IES) <italic>What Works Clearinghouse Standards Handbook</italic> lists minimum thresholds for various reliability metrics (e.g., <inline-formula><mml:math id="math-18" display="inline"><mml:mi>α</mml:mi><mml:mo>≥</mml:mo><mml:mo>.</mml:mo><mml:mn>50</mml:mn></mml:math></inline-formula> in Version 4.1 and <inline-formula><mml:math id="math-19" display="inline"><mml:mi>α</mml:mi><mml:mo>≥</mml:mo><mml:mo>.</mml:mo><mml:mn>60</mml:mn></mml:math></inline-formula> in Version 5.0), but makes no mention of attenuation bias, in contrast to detailed explanation of the bias that arises from other sources, such as non-random attrition or baseline non-equivalence.<xref ref-type="fn" rid="x1-3001f1"><sup>1</sup></xref><fn id="x1-3001f1"><label>1</label>
<p id="S1.PX1.P11">Current and past WWC Standards Handbooks are available at the following URL: <ext-link ext-link-type="uri" xlink:href="https://ies.ed.gov/ncee/wwc/handbooks">https://ies.ed.gov/ncee/wwc/handbooks</ext-link>.</p></fn>
	Crucially, attenuation bias is not solved by IRT or FA scoring procedures, because the resulting scores still contain measurement error. The problem can be further compounded by expected a posteriori (EAP) scoring methods because shrinkage of the empirical Bayes estimation draws the distribution of estimated latent trait scores to the overall mean across treatment and control groups rather than the respective means of each group (<xref ref-type="bibr" rid="Xbib14">Briggs, 2008</xref>; <xref ref-type="bibr" rid="Xbib126">Soland, 2022</xref>). This problem is less severe but still present when using maximum likelihood (ML) scoring (<xref ref-type="bibr" rid="Xbib131">Soland, Kuhfeld, &amp; Edwards, 2024</xref>, p. 11), though ML scoring raises other issues such as undefined scores for respondents with “perfect” scores (i.e., all items answered correctly or incorrectly on an educational test). The solution is to apply an EIV correction by dividing the coeﬃcients by <inline-formula><mml:math id="math-20" display="inline"><mml:msqrt><mml:mrow><mml:mi>ρ</mml:mi></mml:mrow></mml:msqrt></mml:math></inline-formula>, where <inline-formula><mml:math id="math-21" display="inline"><mml:mi>ρ</mml:mi></mml:math></inline-formula> can be estimated as the internal consistency of the test (e.g., Cronbach’s <inline-formula><mml:math id="math-22" display="inline"><mml:mi>α</mml:mi></mml:math></inline-formula> or <inline-formula><mml:math id="math-23" display="inline"><mml:mi>ω</mml:mi></mml:math></inline-formula>) (<xref ref-type="bibr" rid="Xbib57">Hedges, 1981</xref>; <xref ref-type="bibr" rid="Xbib122">Shear &amp; Briggs, 2024</xref>), or to employ an LVM that directly adjusts for measurement error in the estimation procedure, as we will demonstrate.</p></sec>
<sec id="s1_2"><title>Methods for Estimating Causal Effects on Psychometric Outcome Data</title>
<sec id="s1_2_1"><title>Estimands and Estimators</title>
<p id="S1.PX1.P12">Consider outcome <inline-formula><mml:math id="math-24" display="inline"><mml:msub><mml:mrow><mml:mi>𝜃</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> for person <inline-formula><mml:math id="math-25" display="inline"><mml:mi>j</mml:mi></mml:math></inline-formula> (<inline-formula><mml:math id="math-26" display="inline"><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>.</mml:mo><mml:mo>.</mml:mo><mml:mo>.</mml:mo><mml:mo>,</mml:mo><mml:mi>J</mml:mi></mml:math></inline-formula>). Under the potential outcomes framework (<xref ref-type="bibr" rid="Xbib60">Imbens &amp; Rubin, 2015</xref>; <xref ref-type="bibr" rid="Xbib113">Rubin, 2005</xref>), the individual causal eﬀect of binary treatment <inline-formula><mml:math id="math-27" display="inline"><mml:msub><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> on person <inline-formula><mml:math id="math-28" display="inline"><mml:mi>j</mml:mi></mml:math></inline-formula> is <inline-formula><mml:math id="math-29" display="inline"><mml:msub><mml:mrow><mml:mi>τ</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>≡</mml:mo><mml:msub><mml:mrow><mml:mi>𝜃</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>−</mml:mo><mml:msub><mml:mrow><mml:mi>𝜃</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, where 1 indicates the treatment counterfactual and 0 indicates the control counterfactual.<xref ref-type="fn" rid="x1-5001f2"><sup>2</sup></xref><fn id="x1-5001f2"><label>2</label>
<p id="S1.PX1.P13">We follow the notation of <xref ref-type="bibr" rid="Xbib46">Gilbert, Himmelsbach, et al. (2025a)</xref>.</p></fn> Because only one counterfactual is observed, <inline-formula><mml:math id="math-30" display="inline"><mml:msub><mml:mrow><mml:mi>τ</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is unobservable. The target estimand of causal analyses is therefore typically the average treatment eﬀect (ATE), defined as <inline-formula><mml:math id="math-31" display="inline"><mml:mover accent="false"><mml:mrow><mml:mi>τ</mml:mi></mml:mrow><mml:mo>¯</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>J</mml:mi></mml:mrow></mml:mfrac><mml:msubsup><mml:mrow><mml:mi mathvariant="normal">Σ</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>J</mml:mi></mml:mrow></mml:msubsup><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mi>𝜃</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>−</mml:mo><mml:msub><mml:mrow><mml:mi>𝜃</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>.</p>
<p id="S1.PX1.P14">Random assignment of the treatment ensures that treatment status is independent of the potential outcomes. Therefore, we can estimate <inline-formula><mml:math id="math-32" display="inline"><mml:mover accent="false"><mml:mrow><mml:mi>τ</mml:mi></mml:mrow><mml:mo>¯</mml:mo></mml:mover></mml:math></inline-formula> as a diﬀerence in means between the treated and control groups. Practically, we can use a simple linear regression model as our estimator for <inline-formula><mml:math id="math-33" display="inline"><mml:mover accent="false"><mml:mrow><mml:mi>τ</mml:mi></mml:mrow><mml:mo>¯</mml:mo></mml:mover></mml:math></inline-formula>, in which <inline-formula><mml:math id="math-34" display="inline"><mml:msub><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is an indicator variable for the treatment status of person <inline-formula><mml:math id="math-35" display="inline"><mml:mi>j</mml:mi></mml:math></inline-formula>, <inline-formula><mml:math id="math-36" display="inline"><mml:msub><mml:mrow><mml:mi>β</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> is the mean of the control group, <inline-formula><mml:math id="math-37" display="inline"><mml:msub><mml:mrow><mml:mi>β</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> is the diﬀerence in means between the groups, and <inline-formula><mml:math id="math-38" display="inline"><mml:msub><mml:mrow><mml:mi mathvariant="normal">ε</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the error term (<xref ref-type="bibr" rid="Xbib2">Angrist &amp; Pischke, 2009</xref>; <xref ref-type="bibr" rid="Xbib60">Imbens &amp; Rubin, 2015</xref>; <xref ref-type="bibr" rid="Xbib95">Murnane &amp; Willett, 2010</xref>; <xref ref-type="bibr" rid="Xbib111">Rosenbaum, 2017</xref>):</p>
<p id="S1.PX1.P15"><disp-formula id="x1-5002r1"><label>1</label><mml:math id="dmath-1" display="block"><mml:mtable columnalign="left"><mml:mtr><mml:mtd columnalign="right"><mml:msub><mml:mrow><mml:mi>𝜃</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>β</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mrow><mml:mi>β</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="normal">ε</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p id="S1.PX1.P16">When <inline-formula><mml:math id="math-39" display="inline"><mml:msub><mml:mrow><mml:mi>𝜃</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is observed, the diﬀerence in means approach provided by <xref ref-type="disp-formula" rid="x1-5002r1">Equation 1</xref> is standard. However, when <inline-formula><mml:math id="math-40" display="inline"><mml:msub><mml:mrow><mml:mi>𝜃</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> represents an unobserved variable, such as mathematical ability, extroversion, or depression, <xref ref-type="disp-formula" rid="x1-5002r1">Equation 1</xref> is no longer estimable (<xref ref-type="bibr" rid="Xbib134">Stoetzer et al., 2024</xref>). The two primary approaches to estimating causal eﬀects on latent variables are two-step procedures and simultaneous estimation, to which we now turn.</p></sec>
<sec id="s1_2_2"><title>Two-Step Procedures</title>
<p id="S1.PX1.P17">In a two-step procedure, the latent trait of interest is first estimated for each person and then analyzed as the outcome variable using a standard statistical model such as OLS regression (<xref ref-type="bibr" rid="Xbib25">Christensen, 2006</xref>; <xref ref-type="bibr" rid="Xbib143">Ye, 2016</xref>). For example, consider the following regression model, in which <inline-formula><mml:math id="math-41" display="inline"><mml:msub><mml:mrow><mml:mtext>score</mml:mtext></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> represents an estimated latent trait score for person <inline-formula><mml:math id="math-42" display="inline"><mml:mi>j</mml:mi></mml:math></inline-formula> and <inline-formula><mml:math id="math-43" display="inline"><mml:msub><mml:mrow><mml:mi>β</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> represents the average treatment eﬀect (ATE):</p>
<p id="S1.PX1.P18"><disp-formula id="x1-6001r2"><label>2</label><mml:math id="dmath-2" display="block"><mml:mtable columnalign="left"><mml:mtr><mml:mtd columnalign="right"><mml:msub><mml:mrow><mml:mtext>score</mml:mtext></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mtd><mml:mtd columnalign="left"><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>β</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mrow><mml:mi>β</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mtext>treat</mml:mtext></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="normal">ε</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p id="S1.PX1.P19"><disp-formula id="x1-6002r3"><label>3</label><mml:math id="dmath-3" display="block"><mml:mtable columnalign="left"><mml:mtr><mml:mtd columnalign="right"><mml:msub><mml:mrow><mml:mi mathvariant="normal">ε</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mtd><mml:mtd columnalign="left"><mml:mo>∼</mml:mo><mml:mi>N</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>σ</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">ε</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p id="S1.PX1.P20"><inline-formula><mml:math id="math-44" display="inline"><mml:msub><mml:mrow><mml:mtext>score</mml:mtext></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> may be generated in a CTT or IRT/FA framework. In CTT, a sum or mean score is used, such that the observed score across all items for items <inline-formula><mml:math id="math-45" display="inline"><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>.</mml:mo><mml:mo>.</mml:mo><mml:mo>.</mml:mo><mml:mo>,</mml:mo><mml:mi>I</mml:mi></mml:math></inline-formula> equals the sum of the responses <inline-formula><mml:math id="math-46" display="inline"><mml:msubsup><mml:mrow><mml:mo>∑</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>I</mml:mi></mml:mrow></mml:msubsup><mml:msub><mml:mrow><mml:mtext>item</mml:mtext></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> or the mean of the responses <inline-formula><mml:math id="math-47" display="inline"><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>I</mml:mi></mml:mrow></mml:mfrac><mml:msubsup><mml:mrow><mml:mo>∑</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>I</mml:mi></mml:mrow></mml:msubsup><mml:msub><mml:mrow><mml:mtext>item</mml:mtext></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. In IRT or FA, the latent trait estimate, denoted <inline-formula><mml:math id="math-48" display="inline"><mml:mover accent="true"><mml:mrow><mml:mi>𝜃</mml:mi></mml:mrow><mml:mo>̂</mml:mo></mml:mover></mml:math></inline-formula>, is calculated by maximizing the likelihood of <inline-formula><mml:math id="math-49" display="inline"><mml:mover accent="true"><mml:mrow><mml:mi>𝜃</mml:mi></mml:mrow><mml:mo>̂</mml:mo></mml:mover></mml:math></inline-formula> given the estimated item parameters (<xref ref-type="bibr" rid="Xbib7">Bock et al., 1997</xref>). Generally, the IRT scoring approach has been argued to be superior to CTT approaches because IRT <inline-formula><mml:math id="math-50" display="inline"><mml:mover accent="true"><mml:mrow><mml:mi>𝜃</mml:mi></mml:mrow><mml:mo>̂</mml:mo></mml:mover></mml:math></inline-formula> estimates are on an interval rather than ordinal scale (<xref ref-type="bibr" rid="Xbib39">Ferrando &amp; Chico, 2007</xref>; <xref ref-type="bibr" rid="Xbib55">Harwell &amp; Gatti, 2001</xref>; <xref ref-type="bibr" rid="Xbib61">Jabrayilov et al., 2016</xref>; <xref ref-type="bibr" rid="Xbib89">McNeish &amp; Wolf, 2020</xref>).<xref ref-type="fn" rid="fn3"><sup>3</sup></xref><fn id="fn3"><label>3</label>
<p id="S1.PX1.P21">The ability of IRT to produce an interval scaling is a theoretical ideal that may or may not be met in any given empirical application, see e.g., <xref ref-type="bibr" rid="Xbib16">Briggs and Domingue (2013)</xref>; <xref ref-type="bibr" rid="Xbib17">Briggs and Weeks (2009)</xref>; <xref ref-type="bibr" rid="Xbib90">Michell (1994</xref>, <xref ref-type="bibr" rid="Xbib91">1997)</xref>; <xref ref-type="bibr" rid="Xbib106">Reckase (2004)</xref>; <xref ref-type="bibr" rid="Xbib116">Schafer (2006)</xref> for various perspectives on this issue.</p></fn> Furthermore, scores provided by IRT/FA models weight the contributions of item responses to <inline-formula><mml:math id="math-51" display="inline"><mml:mover accent="true"><mml:mrow><mml:mi>𝜃</mml:mi></mml:mrow><mml:mo>̂</mml:mo></mml:mover></mml:math></inline-formula> by their discrimination parameters or factor loadings, thus maximizing the information in and increasing the reliability of <inline-formula><mml:math id="math-52" display="inline"><mml:mover accent="true"><mml:mrow><mml:mi>𝜃</mml:mi></mml:mrow><mml:mo>̂</mml:mo></mml:mover></mml:math></inline-formula> (<xref ref-type="bibr" rid="Xbib19">Camilli, 2018</xref>; <xref ref-type="bibr" rid="Xbib63">Jessen et al., 2018</xref>; <xref ref-type="bibr" rid="Xbib87">McNeish, 2023</xref>; <xref ref-type="bibr" rid="Xbib89">McNeish &amp; Wolf, 2020</xref>; <xref ref-type="bibr" rid="Xbib108">Rhemtulla &amp; Savalei, 2025</xref>), and participants with identical sum scores can have diﬀerent <inline-formula><mml:math id="math-53" display="inline"><mml:mover accent="true"><mml:mrow><mml:mi>𝜃</mml:mi></mml:mrow><mml:mo>̂</mml:mo></mml:mover></mml:math></inline-formula> based on diﬀerent patterns of item responses, thus providing theoretically more fine-grained distinctions between the respondents. Empirically, however, diﬀerences between CTT and IRT/FA scoring are often found to be minor (<xref ref-type="bibr" rid="Xbib80">Lu et al., 2005</xref>; <xref ref-type="bibr" rid="Xbib118">Sébille et al., 2010</xref>; <xref ref-type="bibr" rid="Xbib142">Xu &amp; Stone, 2012</xref>). One limitation of the two-step approach is that, regardless of what type of scoring procedure is used to estimate the latent trait, the outcome variable is treated as known when it contains error and therefore resulting regression coeﬃcients will be biased when the outcome is standardized, unless the EIV correction is applied, as we will show.</p></sec>
<sec id="s1_2_3"><title>Simultaneous Estimation With Latent Variable Models (LVMs)</title>
<p id="S1.PX1.P22">As an alternative to two-step procedures, LVMs enable the analyst to estimate measurement (psychometric) and regression (structural) models simultaneously and incorporate the eﬀects of measurement error directly into the estimation procedure, for both predictors and outcomes (<xref ref-type="bibr" rid="Xbib8">Bollen, 1989</xref>; <xref ref-type="bibr" rid="Xbib69">Kline, 2023</xref>; <xref ref-type="bibr" rid="Xbib96">Muthén, 2002</xref>). For example, consider the following LVM for the analysis of a treatment eﬀect on test score data,</p>
<p id="S1.PX1.P23"><disp-formula id="x1-7001r4"><label>4</label><mml:math id="dmath-4" display="block"><mml:mtable columnalign="left"><mml:mtr><mml:mtd columnalign="right"><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mi>Y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mtd><mml:mtd columnalign="left"><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>λ</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mi>𝜃</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="normal">ε</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p id="S1.PX1.P24"><disp-formula id="x1-7002r5"><label>5</label><mml:math id="dmath-5" display="block"><mml:mtable columnalign="left"><mml:mtr><mml:mtd columnalign="right"><mml:msub><mml:mrow><mml:mi>𝜃</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mtd><mml:mtd columnalign="left"><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>β</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mrow><mml:mi>β</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mtext>treat</mml:mtext></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mrow><mml:mi>ζ</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p id="S1.PX1.P25"><disp-formula id="x1-7003r6"><label>6</label><mml:math id="dmath-6" display="block"><mml:mtable columnalign="left"><mml:mtr><mml:mtd columnalign="right"><mml:msub><mml:mrow><mml:mi mathvariant="normal">ε</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mtd><mml:mtd columnalign="left"><mml:mo>∼</mml:mo><mml:mi>N</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>σ</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="normal">ε</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p id="S1.PX1.P26"><disp-formula id="x1-7004r7"><label>7</label><mml:math id="dmath-7" display="block"><mml:mtable columnalign="left"><mml:mtr><mml:mtd columnalign="right"><mml:msub><mml:mrow><mml:mi>ζ</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mtd><mml:mtd columnalign="left"><mml:mo>∼</mml:mo><mml:mi>N</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>σ</mml:mi></mml:mrow><mml:mrow><mml:mi>ζ</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p id="S1.PX1.P27">in which the response <italic>Y</italic> to item <inline-formula><mml:math id="math-54" display="inline"><mml:mi>i</mml:mi></mml:math></inline-formula> for person <inline-formula><mml:math id="math-55" display="inline"><mml:mi>j</mml:mi></mml:math></inline-formula> is a function of latent person ability <inline-formula><mml:math id="math-56" display="inline"><mml:msub><mml:mrow><mml:mi>𝜃</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and item easiness parameters (item intercepts) <inline-formula><mml:math id="math-57" display="inline"><mml:msub><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, weighted by item discrimination parameters (factor loadings) <inline-formula><mml:math id="math-58" display="inline"><mml:msub><mml:mrow><mml:mi>λ</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and error term <inline-formula><mml:math id="math-59" display="inline"><mml:msub><mml:mrow><mml:mi mathvariant="normal">ε</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. <inline-formula><mml:math id="math-60" display="inline"><mml:msub><mml:mrow><mml:mi>𝜃</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is in turn a function of the control group mean <inline-formula><mml:math id="math-61" display="inline"><mml:msub><mml:mrow><mml:mi>β</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>, the ATE <inline-formula><mml:math id="math-62" display="inline"><mml:msub><mml:mrow><mml:mi>β</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>, and unexplained or residual variance in person ability <inline-formula><mml:math id="math-63" display="inline"><mml:msub><mml:mrow><mml:mi>ζ</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. Thus, the ATE <inline-formula><mml:math id="math-64" display="inline"><mml:msub><mml:mrow><mml:mi>β</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> is estimated directly on the latent trait without the need to compute an outcome score in a separate step.</p>
<p id="S1.PX1.P28">Because <inline-formula><mml:math id="math-65" display="inline"><mml:msub><mml:mrow><mml:mi>𝜃</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is unobserved, constraints are necessary to identify the model and provide a scale for <inline-formula><mml:math id="math-66" display="inline"><mml:msub><mml:mrow><mml:mi>𝜃</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. Two standard approaches to model identification are the unit-loading approach, where <inline-formula><mml:math id="math-67" display="inline"><mml:msub><mml:mrow><mml:mi>λ</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is fixed to 1 for a single item (or all items, as in a Rasch model), or the unit-variance approach, where the total variance (or residual variance) of <inline-formula><mml:math id="math-68" display="inline"><mml:msub><mml:mrow><mml:mi>𝜃</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is fixed to 1. The item easiness parameters <inline-formula><mml:math id="math-69" display="inline"><mml:msub><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> can be identified by either excluding one item in a fixed eﬀects approach (so that <inline-formula><mml:math id="math-70" display="inline"><mml:msub><mml:mrow><mml:mi>β</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> represents the performance of the average control respondent on this item), by fixing the mean of <inline-formula><mml:math id="math-71" display="inline"><mml:msub><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> to 0 in a random eﬀects approach (<xref ref-type="bibr" rid="Xbib31">De Boeck, 2008</xref>), or by fixing <inline-formula><mml:math id="math-72" display="inline"><mml:msub><mml:mrow><mml:mi>β</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula>.</p>
<p id="S1.PX1.P29">These constraints resolve the scale indeterminacy of <inline-formula><mml:math id="math-73" display="inline"><mml:msub><mml:mrow><mml:mi>𝜃</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> but are nonetheless arbitrary. For example, a single <inline-formula><mml:math id="math-74" display="inline"><mml:msub><mml:mrow><mml:mi>λ</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> could be fixed to 2 instead of 1, yielding diﬀerent point estimates and standard errors but providing identical fit to the observed data. Thus, when estimating causal eﬀects on latent outcomes, indexing <inline-formula><mml:math id="math-75" display="inline"><mml:msub><mml:mrow><mml:mi>β</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> to the pooled standard deviation of <inline-formula><mml:math id="math-76" display="inline"><mml:msub><mml:mrow><mml:mi>𝜃</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> (i.e., <inline-formula><mml:math id="math-77" display="inline"><mml:msub><mml:mrow><mml:mi>σ</mml:mi></mml:mrow><mml:mrow><mml:mi>ζ</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> in the case without additional covariates in the model) is an attractive strategy to ground the interpretation of the model results. Accordingly, for the purposes of the present study, we target the following latent ATE in our analyses, following the notation of <xref ref-type="bibr" rid="Xbib134">Stoetzer et al. (2024)</xref>:</p>
<p id="S1.PX1.P30"><disp-formula id="x1-7005r8"><label>8</label><mml:math id="dmath-8" display="block"><mml:mtable columnalign="left"><mml:mtr><mml:mtd columnalign="right"><mml:mfrac><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mi>𝜃</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">|</mml:mo><mml:msub><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>−</mml:mo><mml:mi mathvariant="double-struck">E</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mi>𝜃</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">|</mml:mo><mml:msub><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mtext>SD</mml:mtext><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mi>𝜃</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">|</mml:mo><mml:msub><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mfrac><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p id="S1.PX1.P31">This approach has the important benefit of providing the same point estimate regardless of the chosen identification constraints, and is analogous to estimating a standardized eﬀect size such as Cohen’s <italic>d</italic> when the outcome variable is observed (<xref ref-type="bibr" rid="Xbib134">Stoetzer et al., 2024</xref>).</p>
<p id="S1.PX1.P32"><inline-formula><mml:math id="math-78" display="inline"><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is a link function to allow for both linear and non-linear models. When all <inline-formula><mml:math id="math-79" display="inline"><mml:msub><mml:mrow><mml:mi>λ</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula>, and the link function is logistic, i.e., <inline-formula><mml:math id="math-80" display="inline"><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mtext>logit</mml:mtext><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, the LVM is equivalent to the One Parameter Logistic (1PL) Explanatory Item Response Model (EIRM; <xref ref-type="bibr" rid="Xbib30">De Boeck, 2004</xref>, <xref ref-type="bibr" rid="Xbib32">De Boeck &amp; Wilson, 2016</xref>, <xref ref-type="bibr" rid="Xbib141">Wilson et al., 2008</xref>). Note that the variance of the error term in the logistic model is fixed at <inline-formula><mml:math id="math-81" display="inline"><mml:mfrac><mml:mrow><mml:msup><mml:mrow><mml:mi>π</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:mfrac></mml:math></inline-formula> for model identification (<xref ref-type="bibr" rid="Xbib12">Breen et al., 2018</xref>; <xref ref-type="bibr" rid="Xbib93">Mood, 2010</xref>). When <inline-formula><mml:math id="math-82" display="inline"><mml:msub><mml:mrow><mml:mi>λ</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> are freely estimated and the link function is an identity, i.e., <inline-formula><mml:math id="math-83" display="inline"><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>x</mml:mi></mml:math></inline-formula>, the LVM is a linear Structural Equation Model (SEM). While LVMs such as the EIRM and SEM can be more complex to interpret than two-step approaches, LVMs estimate associations among latent variables theoretically stripped of measurement error. LMVs therefore deattenuate estimates of standardized regression coeﬃcients because, unlike <inline-formula><mml:math id="math-84" display="inline"><mml:msub><mml:mrow><mml:mi>σ</mml:mi></mml:mrow><mml:mrow><mml:mi>Y</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, <inline-formula><mml:math id="math-85" display="inline"><mml:msub><mml:mrow><mml:mi>σ</mml:mi></mml:mrow><mml:mrow><mml:mi>ζ</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is a consistent estimator of the residual <italic>SD</italic> of <inline-formula><mml:math id="math-86" display="inline"><mml:msub><mml:mrow><mml:mi>𝜃</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, thus counteracting the eﬀects of measurement error compared to regression on observed scores (<xref ref-type="bibr" rid="Xbib14">Briggs, 2008</xref>; <xref ref-type="bibr" rid="Xbib25">Christensen, 2006</xref>; <xref ref-type="bibr" rid="Xbib134">Stoetzer et al., 2024</xref>; <xref ref-type="bibr" rid="Xbib144">Zwinderman, 1991</xref>), suggesting that LVMs may provide more accurate tests of between-group diﬀerences such as causal treatment eﬀects.</p></sec>
<sec id="s1_2_4"><title>Model Assumptions</title>
	<p id="S1.PX1.P33">In addition to the standard causal inference assumptions of the stable unit treatment value assumption (SUTVA) and unconfoundedness of the treatment assignment, causal inference in latent variable contexts requires some additional assumptions. First, <xref ref-type="disp-formula" rid="x1-7001r4">Equation 4</xref> assumes full measurement invariance between the treatment and control groups. That is, other than the treatment eﬀect on the latent trait, the items function equivalently between the groups. An example violation of this assumption could include “response shift,” whereby treatment causes participants to interpret items diﬀerently such that diﬀerences in post-intervention scores reflect changes to item functioning rather than changes to the latent variable (<xref ref-type="bibr" rid="Xbib99">Olivera-Aguilar &amp; Rikoon, 2023</xref>). <xref ref-type="bibr" rid="Xbib134">Stoetzer et al. (2024</xref>, p. 5) describe this assumption as “unconfounded measurement”. More flexible models, such as multi-group estimation that allow for heteroskedasticity (<xref ref-type="bibr" rid="Xbib64">Kim &amp; Yoon, 2011</xref>) or multidimensional models that allow for response style eﬀects (<xref ref-type="bibr" rid="Xbib33">Deng et al., 2018</xref>) can relax these assumptions but are beyond the scope of the present study. Moreover, such approaches are relatively rare in applied causal inference (<xref ref-type="bibr" rid="Xbib128">Soland, Edwards, &amp; Talbert, 2024</xref>; <xref ref-type="bibr" rid="Xbib129">Soland &amp; Gilbert, 2025</xref>). We emphasize that while measurement invariance between treatment and control groups can and should be tested, such tests require item-level data, and therefore are diﬃcult or impossible to assess in secondary analyses of sum or factor scores.</p>
<p id="S1.PX1.P34"><xref ref-type="fig" rid="x1-80021">Figure 1</xref> provides Directed Acyclic Graphs (DAGs) for the two-step and simultaneous estimation approaches and highlights a closely related assumption that is necessary for causal inference with latent variables. Namely, we must assume that the treatment eﬀect on the individual item responses is fully mediated by <inline-formula><mml:math id="math-87" display="inline"><mml:msub><mml:mrow><mml:mi>𝜃</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, similar to the exclusion restriction assumption in instrumental variables estimation (<xref ref-type="bibr" rid="Xbib52">Halpin &amp; Gilbert, 2024</xref>; <xref ref-type="bibr" rid="Xbib134">Stoetzer et al., 2024</xref>; <xref ref-type="bibr" rid="Xbib138">VanderWeele &amp; Vansteelandt, 2022</xref>). In other words, a treatment that improves <inline-formula><mml:math id="math-88" display="inline"><mml:msub><mml:mrow><mml:mi>𝜃</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is statistically equivalent to one that makes the items easier (<xref ref-type="bibr" rid="Xbib49">Gilbert, Miratrix, et al., 2025</xref>; <xref ref-type="bibr" rid="Xbib115">San Martín, 2016</xref>).<xref ref-type="fn" rid="fn4"><sup>4</sup></xref><fn id="fn4"><label>4</label>
<p id="S1.PX1.P35">Under certain interpretations, the exclusion restriction assumption can be subsumed by the unconfounded measurement assumption because a direct eﬀect on item performance beyond the latent trait implies that the items are functioning diﬀerentially between the treatment and control groups. However, alternative conceptions allow for item-specific treatment eﬀects without invoking changes to the item parameters, but rather by introducing an additional item-specific treatment sensitivity term to the model (e.g., <xref ref-type="bibr" rid="Xbib46">Gilbert, Himmelsbach, et al., 2025a</xref>, Figure F.1.)</p></fn></p></sec>
<sec id="s1_2_5"><title>Summary</title>
	<p id="S1.PX1.P37">In sum, the applied researcher faces many choices in model selection when test score data are used as outcomes in a causal inference context: to use a one-step or two-step approach, to weight or not to weight the item responses in the construction of scores, to use CTT or IRT/FA, and so forth, as summarized in <xref ref-type="table" rid="x1-90011">Table 1</xref>. While exploratory data analysis can shed light on, for example, whether a 1PL or 2PL IRT model is a better fit to the data, to what extent does allowing for varying item discriminations/loadings in the estimation of the latent trait score aﬀect the bias, precision, and power of causal estimates? Are certain models consistently more robust than alternatives? This study seeks to shed light on these questions and leverage measurement principles for better application of causal inference in evaluation research by using Monte Carlo simulation and an empirical application to examine the performance of several models under varying conditions of test score construction and model estimation.</p><fig id="x1-80021" position="anchor" orientation="portrait"><label>Figure 1</label><caption><title>Directed Acyclic Graphs for Two-Step and Simultaneous Estimation</title>
<p id="S1.PX1.P36"><italic>Note</italic>. Squares indicate observed variables, hollow circles indicate latent variables. <inline-formula><mml:math id="math-89" display="inline"><mml:msub><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> are item responses, <inline-formula><mml:math id="math-90" display="inline"><mml:msub><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the treatment indicator, and <inline-formula><mml:math id="math-91" display="inline"><mml:msub><mml:mrow><mml:mi>β</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> represents the average treatment eﬀect. <inline-formula><mml:math id="math-92" display="inline"><mml:msub><mml:mrow><mml:mi>σ</mml:mi></mml:mrow><mml:mrow><mml:mi>ζ</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the residual standard deviation of the latent variable. See also Figure 1 in <xref ref-type="bibr" rid="Xbib134">Stoetzer et al. (2024)</xref>.</p></caption><graphic mimetype="image" mime-subtype="png" xlink:href="meth.15773-f1.png" position="anchor" orientation="portrait"/></fig>
	<table-wrap id="x1-90011" position="anchor" orientation="portrait">
<label>Table 1</label><caption><title>Summary of Analysis Options for Estimating Treatment Effects on Latent Outcomes</title></caption>
<table frame="hsides" rules="groups" style="compact-1; striped-#f3f3f3"><colgroup span="1">
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/></colgroup>
<thead>
<tr>
	<th valign="bottom">Measurement Framework</th>
<th valign="bottom">Item Types</th>
	<th valign="bottom">ATE Estimation Strategy</th>
	<th valign="bottom">Weighting</th>
	<th valign="bottom">Implementation</th>
</tr>
</thead>
<tbody>
<tr>
<td>CTT</td>
<td>dichotomous, polytomous, continuous</td>
<td>Two-step</td>
<td>Equal</td>
<td>Generate sum score, run regression</td>
</tr>
<tr>
<td>IRT</td>
<td>dichotomous, polytomous</td>
<td>Two-step</td>
<td>Equal or Variable</td>
<td>Generate IRT score, run regression</td>
</tr>
<tr>
<td>FA</td>
<td>continuous</td>
<td>Two-step</td>
<td>Equal or Variable</td>
<td>Generate FA score, run regression</td>
</tr>
<tr>
<td>CTT</td>
<td>dichotomous, polytomous, continuous</td>
<td>Simultaneous</td>
<td>Equal</td>
<td>LMM on item responses<sup>a</sup></td>
</tr>
<tr>
<td>IRT</td>
<td>dichotomous, polytomous</td>
<td>Simultaneous</td>
<td>Equal or Variable</td>
<td>EIRM on item responses</td>
</tr>
<tr>
<td>FA</td>
<td>continuous</td>
<td>Simultaneous</td>
<td>Equal or Variable</td>
<td>SEM on item responses</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
	<p id="S1.PX1.P38"><italic>Note</italic>. CTT = Classical Test Theory, IRT = Item Response Theory, FA = Factor Analysis, LMM = Linear Mixed Model, EIRM = Explanatory Item Response Model, SEM = Structural Equation Model.</p> <p><sup>a</sup>The linear mixed model applicable to simultaneous CTT estimation is equivalent to the FA model with equal loadings when the item parameters are fixed; see <xref ref-type="bibr" rid="Xbib10">Borsboom (2005)</xref> for a discussion of underlying equivalencies between models.</p>
</table-wrap-foot>
</table-wrap></sec></sec></sec>
<sec id="s2" sec-type="Methods"><title>Method</title>
<sec id="s2_1"><title>Data Generating Process</title>
<p id="S2.PX1.P1">The simulation and data analysis procedures are implemented in R. In total, we simulate 18,000 data sets (1,000 data sets per 18 data-generating conditions) and apply four analytic models—sum score, factor score, equal loading SEM, and variable loading SEM—to each, for a total of 72,000 results. We use a full factorial design to assess the performance of each model under a range of treatment eﬀect sizes and items of varying discriminating power. To maintain focus on the contrasts between the models and the eﬀects of item characteristics, we fix the number of subjects at 500 and the number of items at 10 to represent a moderate sample size and moderate test length. The latent trait scores <inline-formula><mml:math id="math-93" display="inline"><mml:msub><mml:mrow><mml:mi>𝜃</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> are drawn from <inline-formula><mml:math id="math-94" display="inline"><mml:mi>N</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo>+</mml:mo><mml:msub><mml:mrow><mml:mi>β</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mtext>treat</mml:mtext></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> and the item intercepts <inline-formula><mml:math id="math-95" display="inline"><mml:msub><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> are drawn from <inline-formula><mml:math id="math-96" display="inline"><mml:mi>N</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. The latent variables are converted to continuous observed scores for each item using <xref ref-type="disp-formula" rid="x1-7001r4">Equation 4</xref>. The residual <italic>SD</italic> for each item <inline-formula><mml:math id="math-97" display="inline"><mml:msub><mml:mrow><mml:mi>σ</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="normal">ε</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub></mml:math></inline-formula> is defined as <inline-formula><mml:math id="math-98" display="inline"><mml:msqrt><mml:mrow><mml:mn>1</mml:mn><mml:mo>−</mml:mo><mml:msubsup><mml:mrow><mml:mi>λ</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup></mml:mrow></mml:msqrt></mml:math></inline-formula> so that items with higher loadings have lower residual variances. The simulation factors include null, moderate, and large treatment eﬀect sizes (0, 0.2, or 0.4 SDs on the latent trait, based on empirical distributions of eﬀect sizes in education research, e.g., <xref ref-type="bibr" rid="Xbib70">Kraft, 2020)</xref>, moderate and high average factor loadings (<inline-formula><mml:math id="math-99" display="inline"><mml:msub><mml:mrow><mml:mi>μ</mml:mi></mml:mrow><mml:mrow><mml:mi>λ</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>0.4</mml:mn><mml:mo>,</mml:mo><mml:mn>0.6</mml:mn></mml:math></inline-formula>), and constant, moderately variable, or highly variable factor loadings (<inline-formula><mml:math id="math-100" display="inline"><mml:msub><mml:mrow><mml:mi>λ</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>∼</mml:mo><mml:mtext>Unif</mml:mtext><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mi>μ</mml:mi></mml:mrow><mml:mrow><mml:mi>λ</mml:mi></mml:mrow></mml:msub><mml:mo>−</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>μ</mml:mi></mml:mrow><mml:mrow><mml:mi>λ</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> where <inline-formula><mml:math id="math-101" display="inline"><mml:mi>x</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>0.15</mml:mn><mml:mo>,</mml:mo><mml:mn>0.3</mml:mn></mml:math></inline-formula>).</p>
<p id="S2.PX1.P2">For each simulated data set, we estimate the treatment eﬀect and associated <italic>z</italic>-statistic, <italic>p</italic>-value, and whether the null hypothesis was rejected under each model. The models for the sum score and factor scores are equivalent OLS regression models and the SEMs are estimated using maximum likelihood with fixed item intercepts. In all models, the parameter of interest is the ATE <inline-formula><mml:math id="math-102" display="inline"><mml:msub><mml:mrow><mml:mi>β</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>, and the errors are assumed to be normally distributed with mean 0 and constant variance and uncorrelated with the predictors. We also calculated <inline-formula><mml:math id="math-103" display="inline"><mml:mi>ω</mml:mi></mml:math></inline-formula> for each simulated test as an estimate of <inline-formula><mml:math id="math-104" display="inline"><mml:mi>ρ</mml:mi></mml:math></inline-formula> to assess the eﬀect of applying EIV corrections to the two-step models. To render each ATE comparable, we divide <inline-formula><mml:math id="math-105" display="inline"><mml:msub><mml:mrow><mml:mi>β</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> by the RMSE of the regression model to standardize the coeﬃcients, as the RMSE represents the estimated pooled (i.e., within-group) standard deviation of the latent trait <inline-formula><mml:math id="math-106" display="inline"><mml:msub><mml:mrow><mml:mi>𝜃</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. Thus, the standardized coeﬃcients are equivalent to Cohen’s <italic>d</italic> eﬀect size.</p>
<p id="S2.PX1.P3">We use <monospace>lm</monospace> to fit the OLS models, <monospace>lme4</monospace> to fit the FA models with equal loadings (<xref ref-type="bibr" rid="Xbib6">Bates et al., 2015</xref>), and <monospace>lavaan</monospace> (<xref ref-type="bibr" rid="Xbib112">Rosseel, 2012</xref>) to fit the FA models with variable loadings.<xref ref-type="fn" rid="fn5"><sup>5</sup></xref><fn id="fn5"><label>5</label>
	<p id="S2.PX1.P4">We use <monospace>lme4</monospace> to fit the equal-loading FA models because the syntax is more convenient than that of <monospace>lavaan</monospace>. We use MLE rather than the default REML in <monospace>lme4</monospace> so that the estimation procedures are identical across packages. Extensions to <monospace>lme4</monospace> such as <monospace>PLmixed</monospace> and <monospace>galamm</monospace> allow for more complex measurement models to be fit with <monospace>lme4</monospace> syntax (<xref ref-type="bibr" rid="Xbib109">Rockwood &amp; Jeon, 2019</xref>; <xref ref-type="bibr" rid="Xbib133">Sørensen, 2024</xref>).</p></fn> For the two-step approach using factor scores as outcomes, we use expected a posteriori (EAP) scores (<xref ref-type="bibr" rid="Xbib24">Chapman, 2022</xref>; <xref ref-type="bibr" rid="Xbib80">Lu et al., 2005</xref>; <xref ref-type="bibr" rid="Xbib94">Muraki &amp; Engelhard, 1985</xref>) as outcome variables. While there are many factor scoring options available in addition to EAP (<xref ref-type="bibr" rid="Xbib51">Grice, 2001</xref>), such as maximum a posteriori (MAP), maximum likelihood, test characteristic curve (TCC, common in IRT, <xref ref-type="bibr" rid="Xbib4">Baker et al., 2017</xref>), we use EAP scoring because it is among the most commonly used approaches and the Bayesian shrinkage of the EAP estimation reduces the <italic>SD</italic> of the observed scores, which is the primary cause of attenuation bias. For the EIV corrected two-step models, we calculate <inline-formula><mml:math id="math-107" display="inline"><mml:mi>ω</mml:mi></mml:math></inline-formula> using the R package <monospace>psych</monospace> (<xref ref-type="bibr" rid="Xbib107">Revelle &amp; Condon, 2019</xref>). We collect the default <italic>SE</italic> provided by each model and we can estimate the true <italic>SE</italic> by calculating the <italic>SD</italic> of the point estimates.</p></sec></sec>
<sec id="s3" sec-type="Results"><title>Results</title>
<p id="S3.PX1.P1"><xref ref-type="fig" rid="x1-120022">Figure 2</xref> shows the mean bias and Monte Carlo 95% confidence intervals for each method across all simulation conditions. We see that when the ATE is 0, bias is negligible across all conditions. However, when the ATE is positive, the two-step procedures are downwardly biased, the bias is proportional to the treatment eﬀect size (as expected given that the standardized eﬀect size is a ratio), and the bias is more severe when the average loadings are lower because lower average loadings translate to lower test reliability. In contrast, the LVMs do not show the same pattern of attenuation and are approximately unbiased across all conditions. Crucially, the performance of the SEM assuming equal factor loadings is indistinguishable from the SEM allowing for variable loadings, even when the range of loadings is high. The factor score allowing for variable weights only slightly outperforms the sum score when the range of loadings is highest, but its performance is nonetheless bested by the equal-weight SEM.</p>
<p id="S3.PX1.P2">These results clearly illustrate that attenuation bias due to measurement error with standardized outcome variables is a more serious concern than the decision of whether to weight or not to weight the item responses. When we correct the two-step procedures for measurement error by dividing the coeﬃcients by <inline-formula><mml:math id="math-108" display="inline"><mml:msqrt><mml:mrow><mml:mi>ω</mml:mi></mml:mrow></mml:msqrt></mml:math></inline-formula> as shown in <xref ref-type="fig" rid="x1-120033">Figure 3</xref>, we find that the performance of the sum score is indistinguishable from the LVMs.<xref ref-type="fn" rid="fn6"><sup>6</sup></xref><fn id="fn6"><label>6</label>
<p id="S3.PX1.P3">Note that an EIV correction based on <inline-formula><mml:math id="math-109" display="inline"><mml:mi>α</mml:mi></mml:math></inline-formula> will potentially overcorrect the factor score when the loadings are extremely variable. This occurs because the calculation of <inline-formula><mml:math id="math-110" display="inline"><mml:mi>α</mml:mi></mml:math></inline-formula> assumes equal loadings and provides a lower bound on test reliability (<xref ref-type="bibr" rid="Xbib89">McNeish &amp; Wolf, 2020</xref>, p. 2292). When this assumption is not met, <inline-formula><mml:math id="math-111" display="inline"><mml:mi>ρ</mml:mi><mml:mo>&gt;</mml:mo><mml:mi>α</mml:mi></mml:math></inline-formula> so dividing by <inline-formula><mml:math id="math-112" display="inline"><mml:msqrt><mml:mrow><mml:mi>α</mml:mi></mml:mrow></mml:msqrt></mml:math></inline-formula> provides an overcorrection for the factor score model. Thus, if the range of loadings is large, a measure of reliability appropriate for models with variable factor loadings such as <inline-formula><mml:math id="math-113" display="inline"><mml:mi>ω</mml:mi></mml:math></inline-formula> (<xref ref-type="bibr" rid="Xbib56">Hayes &amp; Coutts, 2020</xref>) should be applied, as we apply in our analyses. We also note that the concept of reliability is less straightforward in an IRT framework because the conditional standard error of measurement varies across the <inline-formula><mml:math id="math-114" display="inline"><mml:mi>𝜃</mml:mi></mml:math></inline-formula> scale, e.g., <xref ref-type="bibr" rid="Xbib67">Kim and Feldt (2010)</xref>; <xref ref-type="bibr" rid="Xbib77">Lockwood and McCaﬀrey (2014)</xref>. However, marginal reliability in IRT provides a conceptually similar quantity that may be used for EIV corrections if desired, and <inline-formula><mml:math id="math-115" display="inline"><mml:mi>ω</mml:mi></mml:math></inline-formula> functions well even with dichotomous items in moderate sample sizes (<xref ref-type="bibr" rid="Xbib102">Padilla &amp; Divers, 2016</xref>).</p></fn></p><fig id="x1-120022" position="float" orientation="portrait"><label>Figure 2</label><caption><title>Estimated Bias by Method, Standardized Scores</title>
<p id="S3.PX1.P4"><italic>Note</italic>. The points indicate the bias in standard deviation units, without the EIV correction applied to the two-step models. The bars indicate the Monte Carlo 95% CIs, calculated using the formula <inline-formula><mml:math id="math-116" display="inline"><mml:mfrac><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:msqrt><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msqrt></mml:mrow></mml:mfrac></mml:math></inline-formula> where <inline-formula><mml:math id="math-117" display="inline"><mml:mi>n</mml:mi></mml:math></inline-formula> is the number of simulation trials and <inline-formula><mml:math id="math-118" display="inline"><mml:mi>s</mml:mi></mml:math></inline-formula> is the standard deviation of the point estimates.</p></caption><graphic mimetype="image" mime-subtype="png" xlink:href="meth.15773-f2.png" position="float" orientation="portrait"/></fig><fig id="x1-120033" position="float" orientation="portrait"><label>Figure 3</label><caption><title>Estimated Bias by Method, EIV Correction Applied to Two-Step Scores</title>
<p id="S3.PX1.P5"><italic>Note</italic>. The points indicate the bias in standard deviation units, with the EIV correction applied to the two-step models. The bars indicate the Monte Carlo 95% CIs, calculated using the formula <inline-formula><mml:math id="math-119" display="inline"><mml:mfrac><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:msqrt><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msqrt></mml:mrow></mml:mfrac></mml:math></inline-formula> where <inline-formula><mml:math id="math-120" display="inline"><mml:mi>n</mml:mi></mml:math></inline-formula> is the number of simulation trials and <inline-formula><mml:math id="math-121" display="inline"><mml:mi>s</mml:mi></mml:math></inline-formula> is the standard deviation of the point estimates.</p></caption><graphic mimetype="image" mime-subtype="png" xlink:href="meth.15773-f3.png" position="float" orientation="portrait"/></fig>
	<p id="S3.PX1.P6">We include additional simulation results in <xref ref-type="app" rid="x1-25000A">Appendix A</xref>. In short, results show that diﬀerences between all models are minimal in terms of absolute precision (the <italic>SD</italic> of the point estimates), standard error calibration (the mean model-based <italic>SE</italic> as a proportion of the true <italic>SE</italic>), false positive rates, and statistical power, with the EIV correction applied to the two-step models. These results suggest that once the attenuation bias of the two-step models has been corrected, the choice of model does not appear to have strong impacts on the other statistical properties of the ATE. As a final sensitivity check, we rerun analogous simulations using dichotomous responses and IRT models rather than continuous responses and FA models. We find essentially identical results to those reported here, suggesting that our findings are consistent across multiple specifications and outcome item types. We include the full IRT simulation analysis results in our supplement.</p></sec>
<sec id="s4" sec-type="Empirical|Application"><title>Empirical Application</title>
<sec id="s4_1"><title>Data Source</title>
<p id="S4.PX1.P1">To illustrate how the issues of scoring and model selection can play out in practice, we use a selection of empirical datasets from RCTs containing item-level outcome data from <xref ref-type="bibr" rid="Xbib46">Gilbert, Himmelsbach, et al. (2025a)</xref>. The authors examine 75 datasets from 48 RCTs with item-level outcome data to examine models for item-level heterogeneous treatment eﬀects, in which treatments may uniquely impact each item of an outcome measure. Here, we take a random sample of 10 datasets to illustrate the results of diﬀerent analytic approaches to estimating average treatment eﬀects across a range of contexts. <xref ref-type="table" rid="x1-140012">Table 2</xref> provides a summary of each dataset. The datasets cover a range of geographic regions, outcome measures, and show a wide range of estimated reliability with <inline-formula><mml:math id="math-122" display="inline"><mml:mi>ω</mml:mi></mml:math></inline-formula> ranging from 0.43 to 0.95. In contrast to the simulation, we cannot know the true value of the treatment eﬀect on the latent trait in the empirical data. However, we can still examine the results of the diﬀerent analytic models explored in the simulation and see how sensitive the results are to the modeling choices.</p>
<table-wrap id="x1-140012" position="float" orientation="portrait">
<label>Table 2</label><caption><title>Studies Included in Our Empirical Analysis</title></caption>
<table frame="hsides" rules="groups">
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col/>
<thead>
<tr>
<th>ID</th>
<th>Citation</th>
<th>Location</th>
<th>Outcome</th>
<th><italic>N</italic></th>
<th><italic>I</italic></th>
<th>ω</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td><xref ref-type="bibr" rid="Xbib18">Bruhn et al. (2016)</xref></td>
<td>Brazil</td>
<td>Financial Literacy</td>
<td>16852</td>
<td>10</td>
<td align="char" char=".">0.57</td>
</tr>
<tr>
<td>2</td>
<td><xref ref-type="bibr" rid="Xbib66">Kim et al. (2024)</xref></td>
<td>USA</td>
<td>Reading Comprehension</td>
<td>1335</td>
<td>29</td>
<td align="char" char=".">0.86</td>
</tr>
<tr>
<td>3</td>
<td><xref ref-type="bibr" rid="Xbib65">Kim et al. (2021)</xref></td>
<td>USA</td>
<td>Vocabulary</td>
<td>2588</td>
<td>24</td>
<td align="char" char=".">0.79</td>
</tr>
<tr>
<td>4</td>
<td><xref ref-type="bibr" rid="Xbib110">Romero et al. (2020)</xref></td>
<td>Liberia</td>
<td>Raven’s Progressive Matrices</td>
<td>3510</td>
<td>10</td>
<td align="char" char=".">0.63</td>
</tr>
<tr>
<td>5</td>
<td><xref ref-type="bibr" rid="Xbib38">Duflo et al. (2024)</xref></td>
<td>Ghana</td>
<td>English</td>
<td>27201</td>
<td>21</td>
<td align="char" char=".">0.89</td>
</tr>
<tr>
<td>6</td>
<td><xref ref-type="bibr" rid="Xbib20">Carpena (2024)</xref></td>
<td>India</td>
<td>Health Knowledge</td>
<td>839</td>
<td>21</td>
<td align="char" char=".">0.75</td>
</tr>
<tr>
<td>7</td>
<td><xref ref-type="bibr" rid="Xbib81">Luo et al. (2019)</xref></td>
<td>China</td>
<td>Parenting Beliefs</td>
<td>449</td>
<td>11</td>
<td align="char" char=".">0.43</td>
</tr>
<tr>
<td>8</td>
<td><xref ref-type="bibr" rid="Xbib82">Lyall et al. (2020)</xref></td>
<td>Afghanistan</td>
<td>Pro-government attitudes</td>
<td>1853</td>
<td>9</td>
<td align="char" char=".">0.49</td>
</tr>
<tr>
<td>9</td>
<td><xref ref-type="bibr" rid="Xbib5">Banerjee et al. (2017)</xref></td>
<td>India</td>
<td>Math</td>
<td>9193</td>
<td>30</td>
<td align="char" char=".">0.93</td>
</tr>
<tr>
<td>10</td>
<td><xref ref-type="bibr" rid="Xbib5">Banerjee et al. (2017)</xref></td>
<td>India</td>
<td>Math</td>
<td>5356</td>
<td>30</td>
<td align="char" char=".">0.95</td>
</tr>
<tr>
<td/>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p id="S4.PX1.P2"><italic>Note. N</italic> = number of subjects, <italic>I</italic> = number of items, ω = estimated reliability of the outcome measure.</p>
</table-wrap-foot>
</table-wrap></sec>
<sec id="s4_2"><title>Analytic Models</title>
<p id="S4.PX1.P3">We apply eight estimators to produce treatment eﬀects from each dataset. Because all item responses are dichotomous, we use logistic IRT models instead of the linear SEM. For two-step approaches, we calculate the mean score, 1PL IRT score, and 2PL IRT score, and regress the resulting scores on the treatment indicator and standardize the results to calculate Cohen’s <inline-formula><mml:math id="math-123" display="inline"><mml:mi>d</mml:mi></mml:math></inline-formula>. We then apply the EIV correction to these three estimates, dividing the estimated ATE by <inline-formula><mml:math id="math-124" display="inline"><mml:msqrt><mml:mrow><mml:mi>ω</mml:mi></mml:mrow></mml:msqrt></mml:math></inline-formula>. (Our supplemental simulations of IRT models show that the classical EIV correction works well even though single-number estimates of reliability are less common in IRT frameworks where precision varies over the range of the latent trait, see <xref ref-type="bibr" rid="Xbib98">Nicewander, 2018</xref>; <xref ref-type="bibr" rid="Xbib105">Raju et al., 2007</xref>). We use mean scores instead of sum scores because there is some missing item response data. For one-step approaches, we estimate 1PL and 2PL EIRMs that allow for a treatment eﬀect directly on the latent trait.</p></sec>
<sec id="s4_3"><title>Results</title>
<p id="S4.PX1.P4"><xref ref-type="fig" rid="x1-160014">Figure 4</xref> shows the point estimates and 95% CIs for the standardized treatment eﬀect from the eight estimators applied to our 10 datasets. The datasets are ordered by <inline-formula><mml:math id="math-125" display="inline"><mml:mi>ω</mml:mi></mml:math></inline-formula> from lowest to highest. When <inline-formula><mml:math id="math-126" display="inline"><mml:mi>ω</mml:mi></mml:math></inline-formula> is high (datasets 2, 5, 9, 10), diﬀerences between the estimators are minimal, as expected. Datasets with moderate <inline-formula><mml:math id="math-127" display="inline"><mml:mi>ω</mml:mi></mml:math></inline-formula> (datasets 1, 4, 6, 3) most clearly mirror the simulation results, showing estimates from two-step models generally lower than alternative approaches, and minimal diﬀerences between EIV corrected estimates and estimates from the one-step models. In dataset 8, the ATE is near 0 and all estimators are very close to the null value. Only when <inline-formula><mml:math id="math-128" display="inline"><mml:mi>ω</mml:mi></mml:math></inline-formula> is very low, as in dataset 7 (<inline-formula><mml:math id="math-129" display="inline"><mml:mi>ω</mml:mi><mml:mo>=</mml:mo><mml:mo>.</mml:mo><mml:mn>43</mml:mn></mml:math></inline-formula>), do we see meaningful diﬀerences between weighted and unweighted estimators, with the 2PL approaches yielding negative point estimates and the 1PL approaches yielding positive point estimates. Taken as a whole, these results again suggest that once EIV corrections are applied, diﬀerences between estimators are likely to be minor in all but the most extreme cases.</p><fig id="x1-160014" position="anchor" orientation="portrait"><label>Figure 4</label><caption><title>Estimated Treatment Eﬀects for 10 RCT Datasets</title>
<p id="S4.PX1.P5"><italic>Note</italic>. The points indicate the estimated treatment eﬀect size in standard deviation units. The bars indicate the model-based 95% CIs. 1PL and 2PL indicate IRT-based scores are derived from one-parameter logistic and two-parameter logistic models, respectively. The EIRM is equivalent to an SEM with a logistic link function.</p></caption><graphic mimetype="image" mime-subtype="png" xlink:href="meth.15773-f4.png" position="anchor" orientation="portrait"/></fig></sec></sec>
<sec id="s5" sec-type="Discussion"><title>Discussion</title>
<p id="S5.PX1.P1">Because psychometric outcome measures are a noisy proxy of a latent trait of interest, they suﬀer from measurement error, which results in negatively biased treatment eﬀect estimates when outcome variables are standardized. Simulation results show that when applied to outcome data with diﬀerent properties, the bias is substantial when treatment eﬀect sizes are high, as predicted by Classical Test Theory. However, when the EIV correction is applied and the standardized coeﬃcients are divided by <inline-formula><mml:math id="math-130" display="inline"><mml:msqrt><mml:mrow><mml:mi>ω</mml:mi></mml:mrow></mml:msqrt></mml:math></inline-formula>, diﬀerences in model performance are negligible under most conditions. Thus, the very process that makes varying statistical models comparable to one another—standardization—biases two-step models, and the eﬀect of this bias dominates other features of the data generating process, including the variability of factor loadings used to create scoring weights. When left unaddressed, such bias could lead researchers and policymakers to erroneous conclusions about the eﬃcacy of interventions.</p>
<p id="S5.PX1.P2">As a concrete example of how attenuation bias could aﬀect substantive results, consider meta-analyses that pool the eﬀects of interventions on test score outcomes across studies. Even if the true underlying eﬀect is equal across all studies, when the outcome measures are of varying reliability, the estimated eﬀect sizes will diﬀer due to attenuation bias, even as the participant and study sample sizes grow to infinity (<xref ref-type="bibr" rid="Xbib9">Borenstein et al., 2009</xref>). Thus, failing to adjust standardized eﬀect sizes for measurement error may lead to spurious conclusions about treatment heterogeneity. The degree to which such issues may be related to the ongoing “replication crisis” in psychology and other fields is an open question (<xref ref-type="bibr" rid="Xbib100">Open Science Collaboration, 2015</xref>), but it seems plausible that variation in measurement practices may play a role in explaining variation in conclusions across studies (<xref ref-type="bibr" rid="Xbib41">Flake et al., 2017</xref>; <xref ref-type="bibr" rid="Xbib40">Flake &amp; Fried, 2020</xref>; <xref ref-type="bibr" rid="Xbib103">Pedersen et al., 2025</xref>).</p>
<p id="S5.PX1.P3">Our interpretation of these results is that researchers may be overly focused on second-order measurement issues, such as the use of variable factor loadings that function as optimal scoring weights (<xref ref-type="bibr" rid="Xbib86">McNeish, 2022</xref>; <xref ref-type="bibr" rid="Xbib89">McNeish &amp; Wolf, 2020</xref>), rather than the first-order issue of attenuation of standardized coeﬃcients for measurement error in the outcome variable (<xref ref-type="bibr" rid="Xbib122">Shear &amp; Briggs, 2024</xref>; <xref ref-type="bibr" rid="Xbib140">Widaman &amp; Revelle, 2023</xref>). That is, when the EIV correction is applied, diﬀerences between the simplest standardized sum score model and the more complex LVMs are negligible in terms of bias, precision, and statistical power in the estimation of treatment eﬀects, and this result holds even when the variability of factor loadings is high. Thus, when causal inference at a single time point is the primary goal, the use of sum scores with the EIV correction is likely to be suﬃcient for many applications in applied program evaluation.</p>
<p id="S5.PX1.P4">These results should not detract from other uses of LVMs. Clearly, IRT/FA methods are essential for piloting measures, identifying poorly functioning items (<xref ref-type="bibr" rid="Xbib63">Jessen et al., 2018</xref>), diﬀerential item functioning analysis (<xref ref-type="bibr" rid="Xbib101">Osterlind &amp; Everson, 2009</xref>), vertical scaling (<xref ref-type="bibr" rid="Xbib16">Briggs &amp; Domingue, 2013</xref>), linking (<xref ref-type="bibr" rid="Xbib73">Lee &amp; Lee, 2018</xref>), and addressing missing data (<xref ref-type="bibr" rid="Xbib44">Gilbert, 2024a</xref>), and LVMs can easily be expanded to incorporate complex relationships among many latent variables or multidimensional constructs at several time points (<xref ref-type="bibr" rid="Xbib69">Kline, 2023</xref>). A particularly valuable use case for LVMs in causal inference would be settings in which treatment may diﬀerentially impact individual items and the LVM can provide insights on treatment heterogeneity, such as “item-level heterogeneous treatment eﬀects” that would be masked in a two-step analysis (<xref ref-type="bibr" rid="Xbib1">Ahmed et al., 2024</xref>; <xref ref-type="bibr" rid="Xbib45">Gilbert, 2024b</xref>; <xref ref-type="bibr" rid="Xbib46">Gilbert, Himmelsbach, et al., 2025a</xref>; <xref ref-type="bibr" rid="Xbib47">Gilbert et al., 2023</xref>; <xref ref-type="bibr" rid="Xbib114">Sales et al., 2021</xref>), diﬀerential growth by item type (<xref ref-type="bibr" rid="Xbib15">Briggs, 2021</xref>; <xref ref-type="bibr" rid="Xbib48">Gilbert et al., 2024</xref>; <xref ref-type="bibr" rid="Xbib97">Naumann et al., 2014</xref>), or the appropriate interpretation of interaction eﬀects (<xref ref-type="bibr" rid="Xbib37">Domingue, Kanopka, Trejo, et al., 2024</xref>; <xref ref-type="bibr" rid="Xbib49">Gilbert, Miratrix, et al., 2025</xref>). However, when all respondents answer the same items at a single time point, and only average treatment eﬀects are of interest, the results appear relatively insensitive to the methods employed when the EIV correction is applied. Therefore, the benefits of interpretability and computational complexity may favor the EIV-corrected standardized sum score in many straightforward causal inference applications, despite arguments that the sum score can be a suboptimal choice (in general) because the constraint of equal factor loadings imposed by the sum score is rarely met in real data (<xref ref-type="bibr" rid="Xbib89">McNeish &amp; Wolf, 2020</xref>).</p>
<p id="S5.PX1.P5">While the results of this study provide evidence for the importance of EIV corrections in two-step analyses of standardized test score outcome variables, several limitations merit consideration. For example, the data generating process employed in this study examines the simple case of individual randomization with no covariates beyond the treatment indicator, and thus may be extended to explore how measurement model selection may impact the estimation of heterogeneous treatment eﬀects, the eﬀects of predictive covariates, multilevel structures such as multi-site or cluster-randomized trials, or alternative experimental and quasi-experimental contexts such as regression discontinuity, diﬀerence-in-diﬀerences, instrumental variables, and longitudinal analyses, though an emerging literature on the synthesis of latent variable and causal inference methods has begun to shed light on these areas (<xref ref-type="bibr" rid="Xbib46">Gilbert, Himmelsbach, et al., 2025a</xref>; <xref ref-type="bibr" rid="Xbib48">Gilbert et al., 2024</xref>; <xref ref-type="bibr" rid="Xbib49">Gilbert, Miratrix, et al., 2025</xref>; <xref ref-type="bibr" rid="Xbib71">Kuhfeld &amp; Soland, 2022</xref>, <xref ref-type="bibr" rid="Xbib72">2023</xref>; <xref ref-type="bibr" rid="Xbib83">Mayer, 2019</xref>; <xref ref-type="bibr" rid="Xbib92">Miratrix et al., 2021</xref>; <xref ref-type="bibr" rid="Xbib104">Rabbitt, 2018</xref>; <xref ref-type="bibr" rid="Xbib126">Soland, 2022</xref>, <xref ref-type="bibr" rid="Xbib127">2023</xref>; <xref ref-type="bibr" rid="Xbib130">Soland et al., 2023</xref>).</p>
	<p id="S5.PX1.P6">A related issue is measurement error in covariates, which we did not explore in this study. In theory, in an RCT, any bias induced by covariate measurement error will aﬀect treatment and control groups equally and thus should not aﬀect estimation of the ATE (<xref ref-type="bibr" rid="Xbib77">Lockwood &amp; McCaﬀrey, 2014</xref>). In observational studies, however, covariate measurement error can lead to biases when the covariates do not fully control for relevant diﬀerences between groups (<xref ref-type="bibr" rid="Xbib27">Cook et al., 2009</xref>; <xref ref-type="bibr" rid="Xbib121">Sengewald &amp; Pohl, 2019</xref>; <xref ref-type="bibr" rid="Xbib019a">Sengewald et al., 2019</xref>). Factor models with latent covariates and outcomes are easily estimable in <monospace>lavaan</monospace> when the indicators are continuous, however, software options for categorical responses common in the social sciences are currently less widely used in R, though recent developments such as <monospace>galamm</monospace> (<xref ref-type="bibr" rid="Xbib133">Sørensen, 2024</xref>) and <monospace>EﬀectLiteR</monospace> (<xref ref-type="bibr" rid="Xbib84">Mayer et al., 2016</xref>, <xref ref-type="bibr" rid="Xbib85">2020</xref>; <xref ref-type="bibr" rid="Xbib120">Sengewald &amp; Mayer, 2024</xref>) may provide attractive options. We view exploration of how measurement error in <italic>both</italic> covariates and outcomes influences results in experimental and observational contexts to be a promising avenue for future research.</p>
<p id="S5.PX1.P7">In conclusion, results of causal analyses of psychometric outcome data are sensitive to model selection, and the eﬀects of attenuation bias are much more consequential than the use of scoring weights. When researchers do not adjust for measurement error with EIV corrections or use LVMs, standardized treatment eﬀect estimates will be downwardly biased and thus understate estimates of treatment impact. When the EIV correction is applied, the impact of model selection will be reduced, demonstrating how the application of psychometric principles can improve causal inference in evaluation research.</p></sec>
</body>
<back>
<ref-list><title>References</title>
	<ref id="Xbib1"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Ahmed</surname>, <given-names>I.</given-names></string-name>, <string-name name-style="western"><surname>Bertling</surname>, <given-names>M.</given-names></string-name>, <string-name name-style="western"><surname>Zhang</surname>, <given-names>L.</given-names></string-name>, <string-name name-style="western"><surname>Ho</surname>, <given-names>A. D.</given-names></string-name>, <string-name name-style="western"><surname>Loyalka</surname>, <given-names>P.</given-names></string-name>, <string-name name-style="western"><surname>Xue</surname>, <given-names>H.</given-names></string-name>, <string-name name-style="western"><surname>Rozelle</surname>, <given-names>S.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Benjamin</surname>, <given-names>W.</given-names></string-name> (<year>2024</year>). <article-title>Heterogeneity of item-treatment interactions masks complexity and generalizability in randomized controlled trials</article-title>, 1–22. <source>Journal of Research on Educational Effectiveness</source>. <pub-id pub-id-type="doi">10.1080/19345747.2024.2361337</pub-id></mixed-citation></ref>
<ref id="Xbib2"><mixed-citation publication-type="book"><string-name name-style="western"><surname>Angrist</surname>, <given-names>J. D.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Pischke</surname>, <given-names>J.-S.</given-names></string-name> (<year>2009</year>). <source>Mostly harmless econometrics: An empiricist’s companion</source>. <publisher-name>Princeton University Press</publisher-name>.</mixed-citation></ref>
	<ref id="Xbib3"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Asher</surname>, <given-names>H. B.</given-names></string-name> (<year>1974</year>). <article-title>Some consequences of measurement error in survey data</article-title>. <source>American Journal of Political Science</source>, <volume>18</volume>(<issue>2</issue>), <fpage>469</fpage>–<lpage>485</lpage>.</mixed-citation></ref>
	<ref id="Xbib4"><mixed-citation publication-type="book"><string-name name-style="western"><surname>Baker</surname>, <given-names>F. B.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Kim</surname>, <given-names>S.-H.</given-names></string-name> (<year>2017</year>). <chapter-title>The test characteristic curve</chapter-title>. <source>The basics of Item Response Theory using R</source> (pp. 55–67). Springer.</mixed-citation></ref>
	<ref id="Xbib5"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Banerjee</surname>, <given-names>A.</given-names></string-name>, <string-name name-style="western"><surname>Banerji</surname>, <given-names>R.</given-names></string-name>, <string-name name-style="western"><surname>Berry</surname>, <given-names>J.</given-names></string-name>, <string-name name-style="western"><surname>Duflo</surname>, <given-names>E.</given-names></string-name>, <string-name name-style="western"><surname>Kannan</surname>, <given-names>H.</given-names></string-name>, <string-name name-style="western"><surname>Mukerji</surname>, <given-names>S.</given-names></string-name>, <string-name name-style="western"><surname>Shotland</surname>, <given-names>M.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Walton</surname>, <given-names>M.</given-names></string-name> (<year>2017</year>). <article-title>From proof of concept to scalable policies: Challenges and solutions, with an application</article-title>. <source>Journal of Economic Perspectives</source>, <volume>31</volume> (<issue>4</issue>), <fpage>73</fpage>–<lpage>102</lpage>. <pub-id pub-id-type="doi">10.1257/jep.31.4.73</pub-id></mixed-citation></ref>
	<ref id="Xbib6"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Bates</surname>, <given-names>D.</given-names></string-name>, <string-name name-style="western"><surname>Mächler</surname>, <given-names>M.</given-names></string-name>, <string-name name-style="western"><surname>Bolker</surname>, <given-names>B.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Walker</surname>, <given-names>S.</given-names></string-name> (<year>2015</year>). <article-title>Fitting linear mixed-effects models using lme4</article-title>. <source>Journal of Statistical Software</source>, <volume>67</volume> (<issue>1</issue>), <fpage>1</fpage>–<lpage>48</lpage>. <pub-id pub-id-type="doi">10.18637/jss.v067.i01</pub-id></mixed-citation></ref>
	<ref id="Xbib7"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Bock</surname>, <given-names>R. D.</given-names></string-name>, <string-name name-style="western"><surname>Thissen</surname>, <given-names>D.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Zimowski</surname>, <given-names>M. F.</given-names></string-name> (<year>1997</year>). <article-title>IRT estimation of domain scores</article-title>. <source>Journal of Educational Measurement</source>, <volume>34</volume> (<issue>3</issue>), <fpage>197</fpage>–<lpage>211</lpage>. <pub-id pub-id-type="doi">10.1111/J.1745-3984.1997.Tb00515.X</pub-id></mixed-citation></ref>
<ref id="Xbib8"><mixed-citation publication-type="book"><string-name name-style="western"><surname>Bollen</surname>, <given-names>K. A.</given-names></string-name> (<year>1989</year>). <source>Structural equations with latent variables</source>. <publisher-name>John Wiley &amp; Sons</publisher-name>.</mixed-citation></ref>
<ref id="Xbib9"><mixed-citation publication-type="book"><string-name name-style="western"><surname>Borenstein</surname>, <given-names>M.</given-names></string-name>, <string-name name-style="western"><surname>Hedges</surname>, <given-names>L. V.</given-names></string-name>, <string-name name-style="western"><surname>Higgins</surname>, <given-names>J. P.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Rothstein</surname>, <given-names>H. R.</given-names></string-name> (<year>2009</year>). <source>Introduction to meta-analysis</source>. <publisher-name>John Wiley &amp; Sons</publisher-name>.</mixed-citation></ref>
<ref id="Xbib10"><mixed-citation publication-type="book"><string-name name-style="western"><surname>Borsboom</surname>, <given-names>D.</given-names></string-name> (<year>2005</year>). <source>Measuring the mind: Conceptual issues in contemporary psychometrics</source>. <publisher-name>Cambridge University Press</publisher-name>.</mixed-citation></ref>
	<ref id="Xbib11"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Brakenhoff</surname>, <given-names>T. B.</given-names></string-name>, <string-name name-style="western"><surname>Mitroiu</surname>, <given-names>M.</given-names></string-name>, <string-name name-style="western"><surname>Keogh</surname>, <given-names>R. H.</given-names></string-name>, <string-name name-style="western"><surname>Moons</surname>, <given-names>K. G.</given-names></string-name>, <string-name name-style="western"><surname>Groenwold</surname>, <given-names>R. H.</given-names></string-name>, &amp; <string-name name-style="western"><surname>van Smeden</surname>, <given-names>M.</given-names></string-name> (<year>2018</year>). <article-title>Measurement error is often neglected in medical literature: A systematic review</article-title>. <source>Journal of Clinical Epidemiology</source>, <volume>98</volume>, <fpage>89</fpage>–<lpage>97</lpage>. <pub-id pub-id-type="doi">10.1016/j.jclinepi.2018.02.023</pub-id></mixed-citation></ref>
	<ref id="Xbib12"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Breen</surname>, <given-names>R.</given-names></string-name>, <string-name name-style="western"><surname>Karlson</surname>, <given-names>K. B.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Holm</surname>, <given-names>A.</given-names></string-name> (<year>2018</year>). <article-title>Interpreting and understanding logits, probits, and other nonlinear probability models</article-title>. <source>Annual Review of Sociology</source>, <volume>44</volume>, <fpage>39</fpage>–<lpage>54</lpage>. <pub-id pub-id-type="doi">10.1146/annurev-soc-073117-041429</pub-id></mixed-citation></ref>
	<ref id="Xbib13"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Brennan</surname>, <given-names>R. L.</given-names></string-name> (<year>2010</year>). <article-title>Generalizability theory and classical test theory</article-title>. <source>Applied Measurement in Education</source>, <volume>24</volume> (<issue>1</issue>), <fpage>1</fpage>–<lpage>21</lpage>. <pub-id pub-id-type="doi">10.1080/08957347.2011.532417</pub-id></mixed-citation></ref>
	<ref id="Xbib14"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Briggs</surname>, <given-names>D. C.</given-names></string-name> (<year>2008</year>). <article-title>Using explanatory item response models to analyze group differences in science achievement</article-title>. <source>Applied Measurement in Education</source>, <volume>21</volume> (<issue>2</issue>), <fpage>89</fpage>–<lpage>118</lpage>. <pub-id pub-id-type="doi">10.1080/08957340801926086</pub-id></mixed-citation></ref>
<ref id="Xbib15"><mixed-citation publication-type="book"><string-name name-style="western"><surname>Briggs</surname>, <given-names>D. C.</given-names></string-name> (<year>2021</year>). <source>Historical and conceptual foundations of measurement in the human sciences: Credos and controversies</source>. <publisher-name>Routledge</publisher-name>.</mixed-citation></ref>
	<ref id="Xbib16"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Briggs</surname>, <given-names>D. C.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Domingue</surname>, <given-names>B.</given-names></string-name> (<year>2013</year>). <article-title>The gains from vertical scaling</article-title>. <source>Journal of Educational and Behavioral Statistics</source>, <volume>38</volume> (<issue>6</issue>), <fpage>551</fpage>–<lpage>576</lpage>. <pub-id pub-id-type="doi">10.3102/1076998613508317</pub-id></mixed-citation></ref>
	<ref id="Xbib17"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Briggs</surname>, <given-names>D. C.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Weeks</surname>, <given-names>J. P.</given-names></string-name> (<year>2009</year>). <article-title>The sensitivity of value-added modeling to the creation of a vertical score scale</article-title>. <source>Education Finance and Policy</source>, <volume>4</volume> (<issue>4</issue>), <fpage>384</fpage>–<lpage>414</lpage>. <ext-link ext-link-type="uri" xlink:href="https://ideas.repec.org/a/tpr/edfpol/v4y2009i4p384-414.html">https://ideas.repec.org/a/tpr/edfpol/v4y2009i4p384-414.html</ext-link>  </mixed-citation></ref>
	<ref id="Xbib18"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Bruhn</surname>, <given-names>M.</given-names></string-name>, <string-name name-style="western"><surname>de Souza Leão</surname>, <given-names>L.</given-names></string-name>, <string-name name-style="western"><surname>Legovini</surname>, <given-names>A.</given-names></string-name>, <string-name name-style="western"><surname>Marchetti</surname>, <given-names>R.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Zia</surname>, <given-names>B.</given-names></string-name> (<year>2016</year>). <article-title>The impact of high school financial education: Evidence from a large-scale evaluation in Brazil</article-title>. <source>American Economic Journal: Applied Economics</source>, <volume>8</volume> (<issue>4</issue>), <fpage>256</fpage>–<lpage>295</lpage>. <pub-id pub-id-type="doi">10.1257/app.20150149</pub-id></mixed-citation></ref>
	<ref id="Xbib19"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Camilli</surname>, <given-names>G.</given-names></string-name> (<year>2018</year>). <article-title>IRT scoring and test blueprint fidelity</article-title>. <source>Applied Psychological Measurement</source>, <volume>42</volume> (<issue>5</issue>), <fpage>393</fpage>–<lpage>400</lpage>. <pub-id pub-id-type="doi">10.1177/0146621618754897</pub-id></mixed-citation></ref>
	<ref id="Xbib20"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Carpena</surname>, <given-names>F.</given-names></string-name> (<year>2024</year>). <article-title>Entertainment-education for better health: Insights from a field experiment in India</article-title>. <source>Journal of Development Studies</source>, <volume>60</volume> (<issue>5</issue>, <fpage>745</fpage>–<lpage>762</lpage>. <pub-id pub-id-type="doi">10.1080/00220388.2024.2312832</pub-id></mixed-citation></ref>
	<ref id="Xbib21"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Carroll</surname>, <given-names>R. J.</given-names></string-name>, <string-name name-style="western"><surname>Delaigle</surname>, <given-names>A.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Hall</surname>, <given-names>P.</given-names></string-name> (<year>2009</year>). <article-title>Nonparametric prediction in measurement error models</article-title>. <source>Journal of the American Statistical Association</source>, <volume>104</volume> (<issue>487</issue>), <fpage>993</fpage>–<lpage>1003</lpage>. <pub-id pub-id-type="doi">10.1198/jasa.2009.tm07543</pub-id></mixed-citation></ref>
	<ref id="Xbib22"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Castellano</surname>, <given-names>K. E.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Ho</surname>, <given-names>A. D.</given-names></string-name> (<year>2015</year>). <article-title>Practical differences among aggregate-level conditional status metrics: From median student growth percentiles to value-added models</article-title>. <source>Journal of Educational and Behavioral Statistics</source>, <volume>40</volume> (<issue>1</issue>), <fpage>35</fpage>–<lpage>68</lpage>. <pub-id pub-id-type="doi">10.3102/1076998614548485</pub-id></mixed-citation></ref>
	<ref id="Xbib23"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Chan</surname>, <given-names>W.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Hedges</surname>, <given-names>L. V.</given-names></string-name> (<year>2022</year>). <article-title>Pooling interactions into error terms in multisite experiments</article-title>. <source>Journal of Educational and Behavioral Statistics</source>, <volume>47</volume> (<issue>6</issue>), <fpage>639</fpage>–<lpage>665</lpage>. <pub-id pub-id-type="doi">10.3102/10769986221104800</pub-id></mixed-citation></ref>
	<ref id="Xbib24"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Chapman</surname>, <given-names>R.</given-names></string-name> (<year>2022</year>). <article-title>Expected a posteriori scoring in PROMIS®</article-title>. <source>Journal of Patient-Reported Outcomes</source>, <volume>6</volume>, <elocation-id>59</elocation-id>. <pub-id pub-id-type="doi">10.1186/s41687-022-00464-9</pub-id></mixed-citation></ref>
	<ref id="Xbib25"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Christensen</surname>, <given-names>K. B.</given-names></string-name> (<year>2006</year>). <article-title>From Rasch scores to regression</article-title>. <source>Journal of Applied Measurement</source>, <volume>7</volume> (<issue>2</issue>), <fpage>184</fpage>–<lpage>191</lpage>.</mixed-citation></ref>
	<ref id="Xbib26"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Cole</surname>, <given-names>D. A.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Preacher</surname>, <given-names>K. J.</given-names></string-name> (<year>2014</year>). <article-title>Manifest variable path analysis: potentially serious and misleading consequences due to uncorrected measurement error</article-title>. <source>Psychological Methods</source>, <volume>19</volume> (<issue>2</issue>), <fpage>300</fpage>–<lpage>315</lpage>. <pub-id pub-id-type="doi">10.1037/a0033805</pub-id></mixed-citation></ref>
	<ref id="Xbib27"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Cook</surname>, <given-names>T. D.</given-names></string-name>, <string-name name-style="western"><surname>Steiner</surname>, <given-names>P. M.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Pohl</surname>, <given-names>S.</given-names></string-name> (<year>2009</year>). <article-title>How bias reduction is affected by covariate choice, unreliability, and mode of data analysis: Results from two types of within-study comparisons</article-title>. <source>Multivariate Behavioral Research</source>, <volume>44</volume> (<issue>6</issue>), <fpage>828</fpage>–<lpage>847</lpage>. <pub-id pub-id-type="doi">10.1080/00273170903333673</pub-id></mixed-citation></ref>
	<ref id="Xbib28"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Cox</surname>, <given-names>K.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Kelcey</surname>, <given-names>B.</given-names></string-name> (<year>2019</year>). <article-title>Optimal design of cluster-and multisite-randomized studies using fallible outcome measures</article-title>. <source>Evaluation Review</source>, <volume>43</volume> (<issue>3–4</issue>), <fpage>189</fpage>–<lpage>225</lpage>. <pub-id pub-id-type="doi">10.1177/0193841X19870878</pub-id></mixed-citation></ref>
	<ref id="Xbib29"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Cronbach</surname>, <given-names>L. J.</given-names></string-name> (<year>1951</year>). <article-title>Coefficient alpha and the internal structure of tests</article-title>. <source>Psychometrika</source>, <volume>16</volume> (<issue>3</issue>), <fpage>297</fpage>–<lpage>334</lpage>. <pub-id pub-id-type="doi">10.1007/BF02310555</pub-id></mixed-citation></ref>
<ref id="Xbib30"><mixed-citation publication-type="book"><string-name name-style="western"><surname>De Boeck</surname>, <given-names>P.</given-names></string-name> (<year>2004</year>). <source>Explanatory item response models: A generalized linear and nonlinear approach</source>. <publisher-name>Springer Science &amp; Business Media</publisher-name>.</mixed-citation></ref>
	<ref id="Xbib31"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>De Boeck</surname>, <given-names>P.</given-names></string-name> (<year>2008</year>). <article-title>Random item IRT models</article-title>. <source>Psychometrika</source>, <volume>73</volume> (<issue>4</issue>), <fpage>533</fpage>–<lpage>559</lpage>. <pub-id pub-id-type="doi">10.1007/s11336-008-9092-x</pub-id></mixed-citation></ref>
<ref id="Xbib32"><mixed-citation publication-type="book"><string-name name-style="western"><surname>De Boeck</surname>, <given-names>P.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Wilson</surname>, <given-names>M. R.</given-names></string-name> (<year>2016</year>). <chapter-title>Explanatory item response models</chapter-title>. In W. J. van der Linden (Ed.), <source>Handbook of item response theory</source> (pp. 593–608). <publisher-name>Chapman; Hall/CRC</publisher-name>.</mixed-citation></ref>
<ref id="Xbib33"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Deng</surname>, <given-names>S.</given-names></string-name>, <string-name name-style="western"><surname>E. McCarthy</surname>, <given-names>D.</given-names></string-name>, <string-name name-style="western"><surname>E. Piper</surname>, <given-names>M.</given-names></string-name>, <string-name name-style="western"><surname>B. Baker</surname> <given-names>T.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Bolt</surname>, <given-names>D. M.</given-names></string-name> (<year>2018</year>). <article-title>Extreme response style and the measurement of intra-individual variability in affect</article-title>. <source>Multivariate Behavioral Research</source>, <volume>53</volume> (<issue>2</issue>), <fpage>199</fpage>–<lpage>218</lpage>. <pub-id pub-id-type="doi">10.1080/00273171.2017.1413636</pub-id></mixed-citation></ref>
	<ref id="Xbib34"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>DeVellis</surname>, <given-names>R. F.</given-names></string-name> (<year>2006</year>). <article-title>Classical test theory</article-title>. <source>Medical Care</source>, <volume>44</volume> (<issue>11 Suppl. 3</issue>), <fpage>S50</fpage>–<lpage>S59</lpage>. <pub-id pub-id-type="doi">10.1097/01.mlr.0000245426.10853.30</pub-id></mixed-citation></ref>
	<ref id="Xbib35"><mixed-citation publication-type="preprint"><string-name name-style="western"><surname>Domingue</surname>, <given-names>B.</given-names></string-name>, <string-name name-style="western"><surname>Braginsky</surname>, <given-names>M.</given-names></string-name>, <string-name name-style="western"><surname>Caffrey-Maffei</surname>, <given-names>L. A.</given-names></string-name>, <string-name name-style="western"><surname>Gilbert</surname>, <given-names>J.</given-names></string-name>, <string-name name-style="western"><surname>Kanopka</surname>, <given-names>K.</given-names></string-name>, <string-name name-style="western"><surname>Kapoor</surname>, <given-names>R.</given-names></string-name>, <string-name name-style="western"><surname>Liu</surname>, <given-names>Y.</given-names></string-name>, <string-name name-style="western"><surname>Nadela</surname>, <given-names>S.</given-names></string-name>, <string-name name-style="western"><surname>Pan</surname>, <given-names>G.</given-names></string-name>, <string-name name-style="western"><surname>Zhang</surname>, <given-names>L.</given-names></string-name>, <string-name name-style="western"><surname>Zhang</surname>, <given-names>S.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Frank</surname>, <given-names>M. C.</given-names></string-name> (<year>2024</year>). <italic>Solving the problem of data in psychometrics: An introduction to the Item Response Warehouse (IRW)</italic>. PsyArXiv. <pub-id pub-id-type="doi">10.31234/osf.io/7bd54</pub-id></mixed-citation></ref>
	<ref id="Xbib36"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Domingue</surname>, <given-names>B. W.</given-names></string-name>, <string-name name-style="western"><surname>Kanopka</surname>, <given-names>K.</given-names></string-name>, <string-name name-style="western"><surname>Kapoor</surname>, <given-names>R.</given-names></string-name>, <string-name name-style="western"><surname>Pohl</surname>, <given-names>S.</given-names></string-name>, <string-name name-style="western"><surname>Chalmers</surname>, <given-names>R. P.</given-names></string-name>, <string-name name-style="western"><surname>Rahal</surname>, <given-names>C.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Rhemtulla</surname>, <given-names>M.</given-names></string-name> (<year>2024</year>). <article-title>The InterModel Vigorish as a lens for understanding (and quantifying) the value of item response models for dichotomously coded items</article-title>. <source>Psychometrika</source>, <volume>89</volume> (<issue>3</issue>), <fpage>1034</fpage>–<lpage>1054</lpage>. <pub-id pub-id-type="doi">10.1007/s11336-024-09977-2</pub-id></mixed-citation></ref>
	<ref id="Xbib37"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Domingue</surname>, <given-names>B. W.</given-names></string-name>, <string-name name-style="western"><surname>Kanopka</surname>, <given-names>K.</given-names></string-name>, <string-name name-style="western"><surname>Trejo</surname>, <given-names>S.</given-names></string-name>, <string-name name-style="western"><surname>Rhemtulla</surname>, <given-names>M.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Tucker-Drob</surname>, <given-names>E. M.</given-names></string-name> (<year>2024</year>). <article-title>Ubiquitous bias and false discovery due to model misspecification in analysis of statistical interactions: The role of the outcome’s distribution and metric properties</article-title>. <source>Psychological Methods</source>, <volume>29</volume> (<issue>6</issue>), <fpage>1164</fpage>–<lpage>1179</lpage>. <pub-id pub-id-type="doi">10.1037/met0000532</pub-id> </mixed-citation></ref>
	<ref id="Xbib38"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Duflo</surname>, <given-names>A.</given-names></string-name>, <string-name name-style="western"><surname>Kiessel</surname>, <given-names>J.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Lucas</surname>, <given-names>A. M.</given-names></string-name> (<year>2024</year>). <article-title>Experimental evidence on four policies to increase learning at scale</article-title>. <source>Economic Journal</source>, <volume>134</volume> (<issue>661</issue>), <fpage>1985</fpage>–<lpage>2008</lpage>. <pub-id pub-id-type="doi">10.1093/ej/ueae003</pub-id></mixed-citation></ref>	
<ref id="Xbib39"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Ferrando</surname>, <given-names>P. J.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Chico</surname>, <given-names>E.</given-names></string-name> (<year>2007</year>). <article-title>The external validity of scores based on the two-parameter logistic model: Some comparisons between irt and ctt</article-title>. <source>Psicologica: International Journal of Methodology and Experimental Psychology</source>, <volume>28</volume> (<issue>2</issue>), <fpage>237</fpage>–<lpage>257</lpage>.</mixed-citation></ref>
	<ref id="Xbib41"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Flake</surname>, <given-names>J. K.</given-names></string-name>, <string-name name-style="western"><surname>Pek</surname>, <given-names>J.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Hehman</surname>, <given-names>E.</given-names></string-name> (<year>2017</year>). <article-title>Construct validation in social and personality research: Current practice and recommendations</article-title>. <source>Social Psychological and Personality Science</source>, <volume>8</volume> (<issue>4</issue>), <fpage>370</fpage>–<lpage>378</lpage>. <pub-id pub-id-type="doi">10.1177/1948550617693063</pub-id></mixed-citation></ref>
	<ref id="Xbib40"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Flake</surname>, <given-names>J. K.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Fried</surname>, <given-names>E. I.</given-names></string-name> (<year>2020</year>). <article-title>Measurement schmeasurement: Questionable measurement practices and how to avoid them</article-title>. <source>Advances in Methods and Practices in Psychological Science</source>, <volume>3</volume> (<issue>4</issue>), <fpage>456</fpage>–<lpage>465</lpage>. <pub-id pub-id-type="doi">10.1177/2515245920952393</pub-id></mixed-citation></ref>
	<ref id="Xbib42"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Fuller</surname>, <given-names>W. A.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Hidiroglou</surname>, <given-names>M. A.</given-names></string-name> (<year>1978</year>). <article-title>Regression estimation after correcting for attenuation</article-title>. <source>Journal of the American Statistical Association</source>, <volume>73</volume> (<issue>361</issue>), <fpage>99</fpage>–<lpage>104</lpage>. <pub-id pub-id-type="doi">10.1080/01621459.1978.10480011</pub-id></mixed-citation></ref>
	<ref id="Xbib43"><mixed-citation publication-type="book"><string-name name-style="western"><surname>Gelman</surname>, <given-names>A.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Loken</surname>, <given-names>E.</given-names></string-name> (<year>2013</year>). <source>The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time</source> (pp. 1–17). Department of Statistics, Columbia University.</mixed-citation></ref>
<ref id="Xbib44"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Gilbert</surname>, <given-names>J. B.</given-names></string-name> (<year>2024a</year>). <article-title>Estimating treatment effects with the explanatory item response model</article-title>. <source>Journal of Research on Educational Effectiveness</source>, <fpage>1</fpage>–<lpage>19</lpage>. <pub-id pub-id-type="doi">10.1080/19345747.2023.2287601</pub-id></mixed-citation></ref>
	<ref id="Xbib45"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Gilbert</surname>, <given-names>J. B.</given-names></string-name> (<year>2024b</year>). <article-title>Modeling item-level heterogeneous treatment effects: A tutorial with the glmer function from the lme4 package in R</article-title>. <source>Behavior Research Methods</source>, <volume>56</volume> (<issue>5</issue>), <fpage>5055</fpage>–<lpage>5067</lpage>. <pub-id pub-id-type="doi">10.3758/s13428-023-02245-8</pub-id></mixed-citation></ref>
	<ref id="Xbib45.5"><mixed-citation publication-type="data"><string-name name-style="western"><surname>Gilbert</surname>, <given-names>J. B.</given-names></string-name> (<year>2025</year>). <italic>ResearchBox #2289 – ‘How measurement affects causal inference’</italic> [Replication toolkit]. ResearchBox. <ext-link ext-link-type="uri" xlink:href="https://researchbox.org/2289">https://researchbox.org/2289</ext-link>.</mixed-citation></ref>
	<ref id="Xbib46"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Gilbert</surname>, <given-names>J. B.</given-names></string-name>, <string-name name-style="western"><surname>Himmelsbach</surname>, <given-names>Z.</given-names></string-name>, <string-name name-style="western"><surname>Soland</surname>, <given-names>J.</given-names></string-name>, <string-name name-style="western"><surname>Joshi</surname>, <given-names>M.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Domingue</surname>, <given-names>B. W.</given-names></string-name> (<year>2025a</year>). <article-title>Estimating heterogeneous treatment effects with item-level outcome data: Insights from Item Response Theory</article-title>. <source>Journal of Policy Analysis and Management</source>, <fpage>1</fpage>–<lpage>34</lpage>. <pub-id pub-id-type="doi">10.1002/pam.70025</pub-id></mixed-citation></ref>

	<ref id="Xbib46.5"><mixed-citation publication-type="data"><string-name name-style="western"><surname>Gilbert</surname>, <given-names>J. B.</given-names></string-name>, <string-name name-style="western"><surname>Himmelsbach</surname>, <given-names>Z.</given-names></string-name>, <string-name name-style="western"><surname>Soland</surname>, <given-names>J.</given-names></string-name>, <string-name name-style="western"><surname>Joshi</surname>, <given-names>M.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Domingue</surname>, <given-names>B. W.</given-names></string-name> (<year>2025b</year>). <italic>“Replication data for: Estimating heterogeneous treatment effects with item-level outcome data: Insights from Item Response Theory”</italic> [Data, Study Materials]. <source>Harvard Dataverse</source>. <pub-id pub-id-type="doi">10.7910/DVN/C4TJCA</pub-id>.</mixed-citation></ref>


	<ref id="Xbib47"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Gilbert</surname>, <given-names>J. B.</given-names></string-name>, <string-name name-style="western"><surname>Kim</surname>, <given-names>J. S.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Miratrix</surname>, <given-names>L. W.</given-names></string-name> (<year>2023</year>). <article-title>Modeling item-level heterogeneous treatment effects with the explanatory item response model: Leveraging large-scale online assessments to pinpoint the impact of educational interventions</article-title>. <source>Journal of Educational and Behavioral Statistics</source>, <volume>48</volume> (<issue>6</issue>), <fpage>889</fpage>–<lpage>913</lpage>. <pub-id pub-id-type="doi">10.3102/10769986231171710</pub-id></mixed-citation></ref>
<ref id="Xbib48"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Gilbert</surname>, <given-names>J. B.</given-names></string-name>, <string-name name-style="western"><surname>Kim</surname>, <given-names>J. S.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Miratrix</surname>, <given-names>L. W.</given-names></string-name> (<year>2024</year>). <article-title>Leveraging item parameter drift to assess transfer effects in vocabulary learning</article-title>. <source>Applied Measurement in Education</source>, <volume>37</volume> (<issue>3</issue>), <fpage>240</fpage>–<lpage>257</lpage>. <pub-id pub-id-type="doi">10.1080/08957347.2024.2386934</pub-id></mixed-citation></ref>
<ref id="Xbib49"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Gilbert</surname>, <given-names>J. B.</given-names></string-name>, <string-name name-style="western"><surname>Miratrix</surname>, <given-names>L. W.</given-names></string-name>, <string-name name-style="western"><surname>Joshi</surname>, <given-names>M.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Domingue</surname>, <given-names>B. W.</given-names></string-name> (<year>2025</year>). <article-title>Disentangling person-dependent and item-dependent causal effects: Applications of item response theory to the estimation of treatment effect heterogeneity</article-title>. <source>Journal of Educational and Behavioral Statistics</source>, <volume>50</volume> (<issue>1</issue>), <fpage>72</fpage>–<lpage>101</lpage>. <pub-id pub-id-type="doi">10.3102/10769986241240085</pub-id></mixed-citation></ref>
	<ref id="Xbib50"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Gillard</surname>, <given-names>J.</given-names></string-name> (<year>2010</year>). <article-title>An overview of linear structural models in errors in variables regression</article-title>. <source>REVSTAT-Statistical Journal</source>, <volume>8</volume> (<issue>1</issue>), <fpage>57</fpage>–<lpage>80</lpage>. <pub-id pub-id-type="doi">10.57805/revstat.v8i1.90</pub-id></mixed-citation></ref>
	<ref id="Xbib51"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Grice</surname>, <given-names>J. W.</given-names></string-name> (<year>2001</year>). <article-title>Computing and evaluating factor scores</article-title>. <source>Psychological Methods</source>, <volume>6</volume> (<issue>4</issue>), <fpage>430</fpage>–<lpage>450</lpage>. <pub-id pub-id-type="doi">10.1037/1082-989X.6.4.430</pub-id></mixed-citation></ref>
<ref id="Xbib52"><mixed-citation publication-type="web"><string-name name-style="western"><surname>Halpin</surname>, <given-names>P.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Gilbert</surname>, <given-names>J.</given-names></string-name> (<year>2024</year>). <italic>Testing whether reported treatment effects are unduly dependent on the specific outcome measure used</italic>. ArXiv. <pub-id pub-id-type="doi">10.48550/ARXIV.2409.03502</pub-id></mixed-citation></ref>
	<ref id="Xbib53"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Hambleton</surname>, <given-names>R. K.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Jones</surname>, <given-names>R. W.</given-names></string-name> (<year>1993</year>). <article-title>Comparison of classical test theory and item response theory and their applications to test development</article-title>. <source>Educational Measurement: Issues and Practice</source>, <volume>12</volume> (<issue>3</issue>), <fpage>38</fpage>–<lpage>47</lpage>. <pub-id pub-id-type="doi">10.1111/j.1745-3992.1993.tb00543.x</pub-id></mixed-citation></ref>
	<ref id="Xbib54"><mixed-citation publication-type="other"><string-name name-style="western"><surname>Hambleton</surname>, <given-names>R. K.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Van der Linden</surname>, <given-names>W. J.</given-names></string-name> (<year>1982</year>). <article-title>Advances in item response theory and applications: An introduction.</article-title> <source>Applied Psychological Measurement</source>, <volume>6</volume> (<issue>4</issue>), <fpage>373</fpage>–<lpage>378</lpage>. <pub-id pub-id-type="doi">10.1177/014662168200600401</pub-id>.</mixed-citation></ref>
	<ref id="Xbib55"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Harwell</surname>, <given-names>M. R.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Gatti</surname>, <given-names>G. G.</given-names></string-name> (<year>2001</year>). <article-title>Rescaling ordinal data to interval data in educational research</article-title>. <source>Review of Educational Research</source>, <volume>71</volume> (<issue>1</issue>), <fpage>105</fpage>–<lpage>131</lpage>. <pub-id pub-id-type="doi">10.3102/00346543071001105</pub-id></mixed-citation></ref>
	<ref id="Xbib56"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Hayes</surname>, <given-names>A. F.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Coutts</surname>, <given-names>J. J.</given-names></string-name> (<year>2020</year>). <article-title>Use Omega rather than Cronbach’s Alpha for estimating reliability. but…</article-title> <source>Communication Methods and Measures</source>, <volume>14</volume> (<issue>1</issue>), <fpage>1</fpage>–<lpage>24</lpage>. <pub-id pub-id-type="doi">10.1080/19312458.2020.1718629</pub-id> </mixed-citation></ref>
	<ref id="Xbib57"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Hedges</surname>, <given-names>L. V.</given-names></string-name> (<year>1981</year>). <article-title>Distribution theory for Glass’s estimator of effect size and related estimators</article-title>. <source>Journal of Educational Statistics</source>, <volume>6</volume> (<issue>2</issue>), <fpage>107</fpage>–<lpage>128</lpage>. <pub-id pub-id-type="doi">10.2307/1164588</pub-id></mixed-citation></ref>
	<ref id="Xbib58"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Hontangas</surname>, <given-names>P. M.</given-names></string-name>, <string-name name-style="western"><surname>De La Torre</surname>, <given-names>J.</given-names></string-name>, <string-name name-style="western"><surname>Ponsoda</surname>, <given-names>V.</given-names></string-name>, <string-name name-style="western"><surname>Leenen</surname>, <given-names>I.</given-names></string-name>, <string-name name-style="western"><surname>Morillo</surname>, <given-names>D.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Abad</surname>, <given-names>F. J.</given-names></string-name> (<year>2015</year>). <article-title>Comparing traditional and IRT scoring of forced-choice tests</article-title>. <source>Applied Psychological Measurement</source>, <volume>39</volume> (<issue>8</issue>), <fpage>598</fpage>–<lpage>612</lpage>. <pub-id pub-id-type="doi">10.1177/0146621615585851</pub-id></mixed-citation></ref>
	<ref id="Xbib59"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Hutcheon</surname>, <given-names>J. A.</given-names></string-name>, <string-name name-style="western"><surname>Chiolero</surname>, <given-names>A.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Hanley</surname>, <given-names>J. A.</given-names></string-name> (<year>2010</year>). <article-title>Random measurement error and regression dilution bias</article-title>. <source>BMJ</source>, <volume>340</volume> <elocation-id>c2289</elocation-id>. <pub-id pub-id-type="doi">10.1136/bmj.c2289</pub-id></mixed-citation></ref>
<ref id="Xbib60"><mixed-citation publication-type="other"><string-name name-style="western"><surname>Imbens</surname>, <given-names>G. W.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Rubin</surname>, <given-names>D. B.</given-names></string-name> (<year>2015</year>). <source>Causal inference in statistics, social, and biomedical sciences</source>. <comment>Cambridge University Press</comment>.</mixed-citation></ref>
	<ref id="Xbib61"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Jabrayilov</surname>, <given-names>R.</given-names></string-name>, <string-name name-style="western"><surname>Emons</surname>, <given-names>W. H.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Sijtsma</surname>, <given-names>K.</given-names></string-name> (<year>2016</year>). <article-title>Comparison of classical test theory and item response theory in individual change assessment</article-title>. <source>Applied Psychological Measurement</source>, <volume>40</volume> (<issue>8</issue>), <fpage>559</fpage>–<lpage>572</lpage>. <pub-id pub-id-type="doi">10.1177/0146621616664046</pub-id></mixed-citation></ref>
	<ref id="Xbib62"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Jackson</surname>, <given-names>P. H.</given-names></string-name> (<year>1973</year>). <article-title>The estimation of true score variance and error variance in the classical test theory model</article-title>. <source>Psychometrika</source>, <volume>38</volume> (<issue>2</issue>), <fpage>183</fpage>–<lpage>201</lpage>. <pub-id pub-id-type="doi">10.1007/BF02291113</pub-id></mixed-citation></ref>
	<ref id="Xbib63"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Jessen</surname>, <given-names>A.</given-names></string-name>, <string-name name-style="western"><surname>Ho</surname>, <given-names>A. D.</given-names></string-name>, <string-name name-style="western"><surname>Corrales</surname>, <given-names>C. E.</given-names></string-name>, <string-name name-style="western"><surname>Yueh</surname>, <given-names>B.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Shin</surname>, <given-names>J. J.</given-names></string-name> (<year>2018</year>). <article-title>Improving measurement efficiency of the inner ear scale with item response theory</article-title>. <source>Otolaryngology–Head and Neck Surgery</source>, <volume>158</volume> (<issue>6</issue>), <fpage>1093</fpage>–<lpage>1100</lpage>. <pub-id pub-id-type="doi">10.1177/0194599818760528</pub-id></mixed-citation></ref>
	<ref id="Xbib65"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Kim</surname>, <given-names>J. S.</given-names></string-name>, <string-name name-style="western"><surname>Burkhauser</surname>, <given-names>M. A.</given-names></string-name>, <string-name name-style="western"><surname>Mesite</surname>, <given-names>L. M.</given-names></string-name>, <string-name name-style="western"><surname>Asher</surname>, <given-names>C. A.</given-names></string-name>, <string-name name-style="western"><surname>Relyea</surname>, <given-names>J. E.</given-names></string-name>, <string-name name-style="western"><surname>Fitzgerald</surname>, <given-names>J.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Elmore</surname>, <given-names>J.</given-names></string-name> (<year>2021</year>). <article-title>Improving reading comprehension, science domain knowledge, and reading engagement through a first-grade content literacy intervention</article-title>. <source>Journal of Educational Psychology</source>, <volume>113</volume> (<issue>1</issue>), <fpage>3</fpage>–<lpage>26</lpage>. <pub-id pub-id-type="doi">10.1037/edu0000465</pub-id> </mixed-citation></ref>
	<ref id="Xbib67"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Kim</surname>, <given-names>S.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Feldt</surname>, <given-names>L. S.</given-names></string-name> (<year>2010</year>). <article-title>The estimation of the IRT reliability coefficient and its lower and upper bounds, with comparisons to CTT reliability statistics</article-title>. <source>Asia Pacific Education Review</source>, <volume>11</volume> (<issue>2</issue>), <fpage>179</fpage>–<lpage>188</lpage>. <pub-id pub-id-type="doi">10.1007/s12564-009-9062-8</pub-id></mixed-citation></ref>
	<ref id="Xbib66"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Kim</surname>, <given-names>J. S.</given-names></string-name>, <string-name name-style="western"><surname>Gilbert</surname>, <given-names>J. B.</given-names></string-name>, <string-name name-style="western"><surname>Relyea</surname>, <given-names>J. E.</given-names></string-name>, <string-name name-style="western"><surname>Rich</surname>, <given-names>P.</given-names></string-name>, <string-name name-style="western"><surname>Scherer</surname>, <given-names>E.</given-names></string-name>, <string-name name-style="western"><surname>Burkhauser</surname>, <given-names>M. A.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Tvedt</surname>, <given-names>J. N.</given-names></string-name> (<year>2024</year>). <article-title>Time to transfer: Long-term effects of a sustained and spiraled content literacy intervention in the elementary grades</article-title>. <source>Developmental Psychology</source>, <volume>60</volume> (<issue>7</issue>), <fpage>1279</fpage>–<lpage>1297</lpage>. <pub-id pub-id-type="doi">10.1037/dev0001710</pub-id></mixed-citation></ref>
<ref id="Xbib64"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Kim</surname>, <given-names>E. S.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Yoon</surname>, <given-names>M.</given-names></string-name> (<year>2011</year>). <article-title>Testing Measurement Invariance: A Comparison of Multiple-Group Categorical CFA and IRT</article-title>. <source>Structural Equation Modeling: A Multidisciplinary Journal</source>, <volume>18</volume> (<issue>2</issue>), <fpage>212</fpage>–<lpage>228</lpage>. <pub-id pub-id-type="doi">10.1080/10705511.2011.557337</pub-id></mixed-citation></ref>
	<ref id="Xbib68"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>King</surname>, <given-names>G.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Nielsen</surname>, <given-names>R.</given-names></string-name> (<year>2019</year>). <article-title>Why propensity scores should not be used for matching</article-title>. <source>Political Analysis</source>, <volume>27</volume> (<issue>4</issue>), <fpage>435</fpage>–<lpage>454</lpage>. <pub-id pub-id-type="doi">10.1017/pan.2019.11</pub-id></mixed-citation></ref>
<ref id="Xbib69"><mixed-citation publication-type="book"><string-name name-style="western"><surname>Kline</surname>, <given-names>R. B.</given-names></string-name> (<year>2023</year>). <source>Principles and practice of structural equation modeling</source>. <publisher-loc>Guilford Publications</publisher-loc>.</mixed-citation></ref>
	<ref id="Xbib70"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Kraft</surname>, <given-names>M. A.</given-names></string-name> (<year>2020</year>). <article-title>Interpreting effect sizes of education interventions</article-title>. <source>Educational Researcher</source>, <volume>49</volume> (<issue>4</issue>), <fpage>241</fpage>–<lpage>253</lpage>. <pub-id pub-id-type="doi">10.3102/0013189X20912798</pub-id></mixed-citation></ref>
	<ref id="Xbib71"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Kuhfeld</surname>, <given-names>M.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Soland</surname>, <given-names>J.</given-names></string-name> (<year>2022</year>). <article-title>Avoiding bias from sum scores in growth estimates: An examination of IRT-based approaches to scoring longitudinal survey responses</article-title>. <source>Psychological Methods</source>, <volume>27</volume> (<issue>2</issue>), <fpage>234</fpage>–<lpage>260</lpage>. <pub-id pub-id-type="doi">10.1037/met0000367</pub-id>  </mixed-citation></ref>
	<ref id="Xbib72"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Kuhfeld</surname>, <given-names>M.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Soland</surname>, <given-names>J.</given-names></string-name> (<year>2023</year>). <article-title>Scoring assessments in multisite randomized control trials: Examining the sensitivity of treatment effect estimates to measurement choices</article-title>. <source>Psychological Methods</source>. <pub-id pub-id-type="doi">10.1037/met0000633</pub-id></mixed-citation></ref>
	<ref id="Xbib73"><mixed-citation publication-type="other"><string-name name-style="western"><surname>Lee</surname>, <given-names>W.-C.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Lee</surname>, <given-names>G.</given-names></string-name> (<year>2018</year>). <article-title>IRT linking and equating</article-title>. In P. Irwing, T. Booth, &amp; D. J. Hughes (Eds.),<source>Wiley handbook of psychometric testing: A multidisciplinary reference on survey, scale and test development</source> (pp. 639–673). <publisher-name>Wiley Blackwell</publisher-name>. <pub-id pub-id-type="doi">10.1002/9781118489772.ch21</pub-id></mixed-citation></ref>
	<ref id="Xbib74"><mixed-citation publication-type="book"><string-name name-style="western"><surname>Lewis</surname>, <given-names>C.</given-names></string-name> (<year>2006</year>). <chapter-title>2 selected topics in classical test theory</chapter-title>. In  C. R. Rao &amp; S. Sinharay (Eds.), <source>Handbook of statistics</source> (Vol. 26, pp. 29–43). <pub-id pub-id-type="doi">10.1016/S0169-7161(06)26002-4</pub-id></mixed-citation></ref>
	<ref id="Xbib75"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Liu</surname>, <given-names>K.</given-names></string-name> (<year>1988</year>). <article-title>Measurement error and its impact on partial correlation and multiple linear regression analyses</article-title>. <source>American Journal of Epidemiology</source>, <volume>127</volume> (<issue>4</issue>), <fpage>864</fpage>–<lpage>874</lpage>. <pub-id pub-id-type="doi">10.1093/oxfordjournals.aje.a114870</pub-id></mixed-citation></ref>
	<ref id="Xbib76"><mixed-citation publication-type="book"><string-name name-style="western"><surname>Liu</surname>, <given-names>Y.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Pek</surname>, <given-names>J.</given-names></string-name> (<year>2024</year>). <chapter-title>Summed versus estimated factor scores: Considering uncertainties when using observed scores</chapter-title>. <source>Psychological Methods</source>. <pub-id pub-id-type="doi">10.1037/met0000644</pub-id></mixed-citation></ref>
	<ref id="Xbib77"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Lockwood</surname>, <given-names>J.</given-names></string-name>, &amp; <string-name name-style="western"><surname>McCaffrey</surname>, <given-names>D. F.</given-names></string-name> (<year>2014</year>). <article-title>Correcting for test score measurement error in ANCOVA models for estimating treatment effects</article-title>. <source>Journal of Educational and Behavioral Statistics</source>, <volume>39</volume> (<issue>1</issue>), <fpage>22</fpage>–<lpage>52</lpage>. <pub-id pub-id-type="doi">10.3102/107699861350940</pub-id></mixed-citation></ref>
<ref id="Xbib78"><mixed-citation publication-type="book"><string-name name-style="western"><surname>Lord</surname>, <given-names>F. M.</given-names></string-name> (<year>1980</year>). <source>Applications of item response theory to practical testing problems</source>. <publisher-name>Routledge</publisher-name>.</mixed-citation></ref>
<ref id="Xbib79"><mixed-citation publication-type="book"><string-name name-style="western"><surname>Lord</surname>, <given-names>F. M.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Novick</surname>, <given-names>M. R.</given-names></string-name> (<year>1968</year>). <source>Statistical theories of mental test scores</source>. <publisher-name>Addison-Wesley</publisher-name>.</mixed-citation></ref>
	<ref id="Xbib80"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Lu</surname>, <given-names>I. R.</given-names></string-name>, <string-name name-style="western"><surname>Thomas</surname>, <given-names>D. R.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Zumbo</surname>, <given-names>B. D.</given-names></string-name> (<year>2005</year>). <article-title>Embedding IRT in structural equation models: A comparison with regression based on irt scores</article-title>. <source>Structural Equation Modeling</source>, <volume>12</volume> (<issue>2</issue>), <fpage>263</fpage>–<lpage>277</lpage>. <pub-id pub-id-type="doi">10.1207/s15328007sem1202_5</pub-id></mixed-citation></ref>
	<ref id="Xbib81"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Luo</surname>, <given-names>R.</given-names></string-name>, <string-name name-style="western"><surname>Emmers</surname>, <given-names>D.</given-names></string-name>, <string-name name-style="western"><surname>Warrinnier</surname>, <given-names>N.</given-names></string-name>, <string-name name-style="western"><surname>Rozelle</surname>, <given-names>S.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Sylvia</surname>, <given-names>S.</given-names></string-name> (<year>2019</year>). <article-title>Using community health workers to deliver a scalable integrated parenting program in rural China: A cluster-randomized controlled trial</article-title>. <source>Social Science &amp; Medicine</source>, <volume>239</volume>, <elocation-id>112545</elocation-id>. <pub-id pub-id-type="doi">10.1016/j.socscimed.2019.112545</pub-id></mixed-citation></ref>
	<ref id="Xbib82"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Lyall</surname>, <given-names>J.</given-names></string-name>, <string-name name-style="western"><surname>Zhou</surname>, <given-names>Y.-Y.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Imai</surname>, <given-names>K.</given-names></string-name> (<year>2020</year>). <article-title>Can economic assistance shape combatant support in wartime? Experimental evidence from Afghanistan</article-title>. <source>American Political Science Review</source>, <volume>114</volume> (<issue>1</issue>), <fpage>126</fpage>–<lpage>143</lpage>. <pub-id pub-id-type="doi">10.2139/ssrn.3026531</pub-id></mixed-citation></ref>
<ref id="Xbib83"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Mayer</surname>, <given-names>A.</given-names></string-name> (<year>2019</year>). <article-title>Causal effects based on latent variable models</article-title>. <source>Methodology</source>, <volume>15</volume> (<issue>S1</issue>, <fpage>15</fpage>–<lpage>28</lpage>. <pub-id pub-id-type="doi">10.1027/1614-2241/a000174</pub-id></mixed-citation></ref>
	<ref id="Xbib84"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Mayer</surname>, <given-names>A.</given-names></string-name>, <string-name name-style="western"><surname>Dietzfelbinger</surname>, <given-names>L.</given-names></string-name>, <string-name name-style="western"><surname>Rosseel</surname>, <given-names>Y.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Steyer</surname>, <given-names>R.</given-names></string-name> (<year>2016</year>). <article-title>The EffectLiteR Approach for Analyzing Average and Conditional Effects</article-title>. <source>Multivariate Behavioral Research</source>, <volume>51</volume> (<issue>2–3</issue>), <fpage>374</fpage>–<lpage>391</lpage>. <pub-id pub-id-type="doi">10.1080/00273171.2016.1151334</pub-id></mixed-citation></ref>
<ref id="Xbib85"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Mayer</surname>, <given-names>A.</given-names></string-name>, <string-name name-style="western"><surname>Zimmermann</surname>, <given-names>J.</given-names></string-name>, <string-name name-style="western"><surname>Hoyer</surname>, <given-names>J.</given-names></string-name>, <string-name name-style="western"><surname>Salzer</surname>, <given-names>S.</given-names></string-name>, <string-name name-style="western"><surname>Wiltink</surname>, <given-names>J.</given-names></string-name>, <string-name name-style="western"><surname>Leibing</surname>, <given-names>E.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Leichsenring</surname>, <given-names>F.</given-names></string-name> (<year>2020</year>). <article-title>Interindividual differences in treatment effects based on structural equation models with latent variables: An EffectLiteR tutorial</article-title>. <source>Structural Equation Modeling: A Multidisciplinary Journal</source>, <volume>27</volume> (<issue>5</issue>), <fpage>798</fpage>–<lpage>816</lpage>. <pub-id pub-id-type="doi">10.1080/10705511.2019.1671196</pub-id></mixed-citation></ref>
	<ref id="Xbib86"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>McNeish</surname>, <given-names>D.</given-names></string-name> (<year>2022</year>). <article-title>Limitations of the sum-and-alpha approach to measurement in behavioral research</article-title>. <source>Policy Insights from the Behavioral and Brain Sciences</source>, <volume>9</volume> (<issue>2</issue>), <fpage>196</fpage>–<lpage>203</lpage>. <pub-id pub-id-type="doi">10.1177/23727322221117144</pub-id></mixed-citation></ref>
	<ref id="Xbib87"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>McNeish</surname>, <given-names>D.</given-names></string-name> (<year>2023</year>). <article-title>Psychometric properties of sum scores and factor scores differ even when their correlation is 0.98: A response to Widaman and Revelle</article-title>. <source>Behavior Research Methods</source>, <volume>55</volume> (<issue>8</issue>), <fpage>4269</fpage>–<lpage>4290</lpage>. <pub-id pub-id-type="doi">10.3758/s13428-022-02016-x</pub-id></mixed-citation></ref>
	<ref id="Xbib88"><mixed-citation publication-type="book"><string-name name-style="western"><surname>McNeish</surname>, <given-names>D.</given-names></string-name> (<year>2024</year>). <chapter-title>Practical implications of sum scores being psychometrics’ greatest accomplishment</chapter-title>. <source>Psychometrika</source>, <volume>89</volume> (<issue>4</issue>), <fpage>1148</fpage>-<lpage>1169</lpage>. <pub-id pub-id-type="doi">10.1007/s11336-024-09988-z</pub-id></mixed-citation></ref>
	<ref id="Xbib89"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>McNeish</surname>, <given-names>D.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Wolf</surname>, <given-names>M. G.</given-names></string-name> (<year>2020</year>). <article-title>Thinking twice about sum scores</article-title>. <source>Behavior Research Methods</source>, <volume>52</volume>, <fpage>2287</fpage>–<lpage>2305</lpage>. <pub-id pub-id-type="doi">10.3758/s13428-020-01398-0</pub-id></mixed-citation></ref>
	<ref id="Xbib90"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Michell</surname>, <given-names>J.</given-names></string-name> (<year>1994</year>). <article-title>Numbers as quantitative relations and the traditional theory of measurement</article-title>. <source>British Journal for the Philosophy of Science</source>, <volume>45</volume> (<issue>2</issue>), <fpage>389</fpage>–<lpage>406</lpage>. <pub-id>10.1093/bjps/45.2.389</pub-id></mixed-citation></ref>
	<ref id="Xbib91"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Michell</surname>, <given-names>J.</given-names></string-name> (<year>1997</year>). <article-title>Quantitative science and the definition of measurement in psychology</article-title>. <source>British Journal of Psychology</source>, <volume>88</volume> (<issue>3</issue>), <fpage>355</fpage>–<lpage>383</lpage>. <pub-id pub-id-type="doi">10.1111/j.2044-8295.1997.tb02641.x</pub-id></mixed-citation></ref>
	<ref id="Xbib92"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Miratrix</surname>, <given-names>L. W.</given-names></string-name>, <string-name name-style="western"><surname>Weiss</surname>, <given-names>M. J.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Henderson</surname>, <given-names>B.</given-names></string-name> (<year>2021</year>). <article-title>An applied researcher’s guide to estimating effects from multisite individually randomized trials: Estimands, estimators, and estimates</article-title>. <source>Journal of Research on Educational Effectiveness</source>, <volume>14</volume> (<issue>1</issue>), <fpage>270</fpage>–<lpage>308</lpage>. <pub-id pub-id-type="doi">10.1080/19345747.2020.1831115</pub-id></mixed-citation></ref>
	<ref id="Xbib93"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Mood</surname>, <given-names>C.</given-names></string-name> (<year>2010</year>). <article-title>Logistic regression: Why we cannot do what we think we can do, and what we can do about it</article-title>. <source>European Sociological Review</source>, <volume>26</volume> (<issue>1</issue>), <fpage>67</fpage>–<lpage>82</lpage>. <pub-id pub-id-type="doi">10.1093/esr/jcp006</pub-id></mixed-citation></ref>
	<ref id="Xbib94"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Muraki</surname>, <given-names>E.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Engelhard</surname>, <given-names>G., Jr.</given-names></string-name> (<year>1985</year>). <article-title>Full-information item factor analysis: Applications of eap scores</article-title>. <source>Applied Psychological Measurement</source>, <volume>9</volume> (<issue>4</issue>), <fpage>417</fpage>–<lpage>430</lpage>. <pub-id pub-id-type="doi">10.1177/01466216850090041</pub-id></mixed-citation></ref>
<ref id="Xbib95"><mixed-citation publication-type="book"><string-name name-style="western"><surname>Murnane</surname>, <given-names>R. J.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Willett</surname>, <given-names>J. B.</given-names></string-name> (<year>2010</year>). <source>Methods matter: Improving causal inference in educational and social science research</source>. <publisher-name>Oxford University Press</publisher-name>.</mixed-citation></ref>
	<ref id="Xbib96"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Muthén</surname>, <given-names>B. O.</given-names></string-name> (<year>2002</year>). <article-title>Beyond SEM: General latent variable modeling</article-title>. <source>Behaviormetrika</source>, <volume>29</volume> (<issue>1</issue>), <fpage>81</fpage>–<lpage>117</lpage>. <pub-id pub-id-type="doi">10.2333/bhmk.29.81</pub-id></mixed-citation></ref>
	<ref id="Xbib97"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Naumann</surname>, <given-names>A.</given-names></string-name>, <string-name name-style="western"><surname>Hochweber</surname>, <given-names>J.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Hartig</surname>, <given-names>J.</given-names></string-name> (<year>2014</year>). <article-title>Modeling instructional sensitivity using a longitudinal multilevel differential item functioning approach</article-title>. <source>Journal of Educational Measurement</source>, <volume>51</volume> (<issue>4</issue>), <fpage>381</fpage>–<lpage>399</lpage>. <pub-id pub-id-type="doi">10.1111/jedm.12051</pub-id></mixed-citation></ref>
<ref id="Xbib98"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Nicewander</surname>, <given-names>W. A.</given-names></string-name> (<year>2018</year>). <article-title>Conditional reliability coefficients for test scores</article-title>. <source>Psychological Methods</source>, <volume>23</volume> (<issue>2</issue>), <fpage>351</fpage>–<lpage>362</lpage>. <pub-id pub-id-type="doi">10.1037/met0000132</pub-id></mixed-citation></ref>
	<ref id="Xbib99"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Olivera-Aguilar</surname>, <given-names>M.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Rikoon</surname>, <given-names>S. H.</given-names></string-name> (<year>2023</year>). <article-title>Intervention effect or measurement artifact? Using invariance models to reveal response-shift bias in experimental studies</article-title>. <source>Journal of Research on Educational Effectiveness</source>, <fpage>1</fpage>–<lpage>29</lpage>. <pub-id pub-id-type="doi">10.1080/19345747.2023.2284768</pub-id></mixed-citation></ref>
<ref id="Xbib100"><mixed-citation publication-type="journal"><collab>Open Science Collaboration</collab> (<year>2015</year>). <article-title>Estimating the reproducibility of psychological science</article-title>. <source>Science</source>, <volume>349</volume> (<issue>6251</issue>), <elocation-id>aac4716</elocation-id>. <pub-id pub-id-type="doi">10.1126/science.aac4716</pub-id></mixed-citation></ref>
<ref id="Xbib101"><mixed-citation publication-type="book"><string-name name-style="western"><surname>Osterlind</surname>, <given-names>S. J.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Everson</surname>, <given-names>H. T.</given-names></string-name> (<year>2009</year>). <source>Differential item functioning</source>. <publisher-name>SAGE Publications</publisher-name>.</mixed-citation></ref>
<ref id="Xbib102"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Padilla</surname>, <given-names>M. A.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Divers</surname>, <given-names>J.</given-names></string-name> (<year>2016</year>). <article-title>A comparison of composite reliability estimators: Coefficient Omega confidence intervals in the current literature</article-title>. <source>Educational and Psychological Measurement</source>, <volume>76</volume> (<issue>3</issue>), <fpage>436</fpage>–<lpage>453</lpage>. <pub-id pub-id-type="doi">10.1177/0013164415593776</pub-id></mixed-citation></ref>
<ref id="Xbib103"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Pedersen</surname>, <given-names>A. P.</given-names></string-name>, <string-name name-style="western"><surname>Kellen</surname>, <given-names>D.</given-names></string-name>, <string-name name-style="western"><surname>Mayo-Wilson</surname>, <given-names>C.</given-names></string-name>, <string-name name-style="western"><surname>Davis-Stober</surname>, <given-names>C. P.</given-names></string-name>, <string-name name-style="western"><surname>Dunn</surname>, <given-names>J. C.</given-names></string-name>, <string-name name-style="western"><surname>Khan</surname>, <given-names>M. A.</given-names></string-name>, <string-name name-style="western"><surname>Stinchcombe</surname>, <given-names>M. B.</given-names></string-name>, <string-name name-style="western"><surname>Kalish</surname>, <given-names>M. L.</given-names></string-name>, <string-name name-style="western"><surname>Tentori</surname>, <given-names>K.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Haaf</surname>, <given-names>J.</given-names></string-name> (<year>2025</year>). <article-title>Discourse on measurement</article-title>. <source>Proceedings of the National Academy of Sciences of the United States of America</source>, <volume>122</volume> (<issue>5</issue>), <elocation-id>e2401229121</elocation-id>. <pub-id pub-id-type="doi">10.1073/pnas.2401229121</pub-id></mixed-citation></ref>
	<ref id="Xbib104"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Rabbitt</surname>, <given-names>M. P.</given-names></string-name> (<year>2018</year>). <article-title>Causal inference with latent variables from the Rasch model as outcomes</article-title>. <source>Measurement</source>, <volume>120</volume>, <fpage>193</fpage>–<lpage>205</lpage>. <pub-id pub-id-type="doi">10.1016/j.measurement.2018.01.044</pub-id></mixed-citation></ref>
<ref id="Xbib105"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Raju</surname>, <given-names>N. S.</given-names></string-name>, <string-name name-style="western"><surname>Price</surname>, <given-names>L. R.</given-names></string-name>, <string-name name-style="western"><surname>Oshima</surname>, <given-names>T.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Nering</surname>, <given-names>M. L.</given-names></string-name> (<year>2007</year>). <article-title>Standardized conditional SEM: A case for conditional reliability</article-title>. <source>Applied Psychological Measurement</source>, <volume>31</volume> (<issue>3</issue>), <fpage>169</fpage>–<lpage>180</lpage>. <pub-id pub-id-type="doi">10.1177/0146621606291569</pub-id></mixed-citation></ref>
<ref id="Xbib106"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Reckase</surname>, <given-names>M. D.</given-names></string-name> (<year>2004</year>). <article-title>The real world is more complicated than we would like</article-title>. <source>Journal of Educational and Behavioral Statistics</source>, <volume>29</volume> (<issue>1</issue>), <fpage>117</fpage>–<lpage>120</lpage>.</mixed-citation></ref>
	<ref id="Xbib107"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Revelle</surname>, <given-names>W.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Condon</surname>, <given-names>D. M.</given-names></string-name> (<year>2019</year>). <article-title>Reliability from α to ω: A tutorial</article-title>. <source>Psychological Assessment</source>, <volume>31</volume> (<issue>12</issue>), <fpage>1395</fpage>–<lpage>1411</lpage>. <pub-id pub-id-type="doi">10.1037/pas0000754</pub-id></mixed-citation></ref>
	<ref id="Xbib108"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Rhemtulla</surname>, <given-names>M.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Savalei</surname>, <given-names>V.</given-names></string-name> (<year>2025</year>). <article-title>Estimated factor scores are not true factor scores</article-title>. <source>Multivariate Behavioral Research</source>, <volume>60</volume> (<issue>3</issue>), <fpage>598</fpage>–<lpage>619</lpage>. <pub-id pub-id-type="doi">10.1080/00273171.2024.2444943</pub-id></mixed-citation></ref>
	<ref id="Xbib109"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Rockwood</surname>, <given-names>N. J.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Jeon</surname>, <given-names>M.</given-names></string-name> (<year>2019</year>). <article-title>Estimating complex measurement and growth models using the R package PLmixed</article-title>. <source>Multivariate Behavioral Research</source>, <volume>54</volume> (<issue>2</issue>), <fpage>288</fpage>–<lpage>306</lpage>. <pub-id pub-id-type="doi">10.1080/00273171.2018.1516541</pub-id></mixed-citation></ref>
	<ref id="Xbib110"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Romero</surname>, <given-names>M.</given-names></string-name>, <string-name name-style="western"><surname>Sandefur</surname>, <given-names>J.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Sandholtz</surname>, <given-names>W. A.</given-names></string-name> (<year>2020</year>). <article-title>Outsourcing education: Experimental evidence from Liberia</article-title>. <source>American Economic Review</source>, <volume>110</volume> (<issue>2</issue>), <fpage>364</fpage>–<lpage>400</lpage>. <pub-id pub-id-type="doi">10.1257/aer.20181478</pub-id></mixed-citation></ref>
<ref id="Xbib111"><mixed-citation publication-type="book"><string-name name-style="western"><surname>Rosenbaum</surname>, <given-names>P.</given-names></string-name> (<year>2017</year>). <source>Observation and experiment: An introduction to causal inference</source>. <publisher-name>Harvard University Press</publisher-name>.</mixed-citation></ref>
	<ref id="Xbib112"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Rosseel</surname>, <given-names>Y.</given-names></string-name> (<year>2012</year>). <article-title>lavaan: An R package for Structural Equation Modeling</article-title>. <source>Journal of Statistical Software</source>, <volume>48</volume> (<issue>2</issue>), <fpage>1</fpage>–<lpage>36</lpage>. <pub-id pub-id-type="doi">10.18637/jss.v048.i02</pub-id></mixed-citation></ref>
	<ref id="Xbib113"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Rubin</surname>, <given-names>D. B.</given-names></string-name> (<year>2005</year>). <article-title>Causal inference using potential outcomes: Design, modeling, decisions</article-title>. <source>Journal of the American Statistical Association</source>, <volume>100</volume> (<issue>469</issue>), <fpage>322</fpage>–<lpage>331</lpage>. <pub-id pub-id-type="doi">10.1198/016214504000001880</pub-id></mixed-citation></ref>
	<ref id="Xbib114"><mixed-citation publication-type="conference"><string-name name-style="western"><surname>Sales</surname>, <given-names>A.</given-names></string-name>, <string-name name-style="western"><surname>Prihar</surname>, <given-names>E.</given-names></string-name>, <string-name name-style="western"><surname>Heffernan</surname>, <given-names>N.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Pane</surname>, <given-names>J. F.</given-names></string-name> (<year>2021</year>). <italic>The effect of an intelligent tutor on performance on specific posttest problems</italic> (pp. 206–215). Proceedings of the 14th International Conference on Educational Data Mining (EDM21), Paris, France, June 29–July 2, 2021. International Educational Data Mining Society.</mixed-citation></ref>
<ref id="Xbib115"><mixed-citation publication-type="book"><string-name name-style="western"><surname>San Martín</surname>, <given-names>E.</given-names></string-name> (<year>2016</year>). <chapter-title>Identification of item response theory models</chapter-title>. In <string-name name-style="western"><given-names>W. J.</given-names> <surname>van der Linden</surname></string-name> (Ed.), <source>Handbook of item response theory</source> (Vol. 2, pp. 127–150). <publisher-name>Chapman; Hall/CRC</publisher-name>.</mixed-citation></ref>
	<ref id="Xbib116"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Schafer</surname>, <given-names>W. D.</given-names></string-name> (<year>2006</year>). <article-title>Growth scales as an alternative to vertical scales</article-title>. <source>Practical Assessment, Research, and Evaluation</source>, <volume>11</volume> (<issue>1</issue>), <elocation-id>4</elocation-id>. <pub-id pub-id-type="doi">10.7275/xjkz-7n67</pub-id></mixed-citation></ref>
	<ref id="Xbib117"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Schielzeth</surname>, <given-names>H.</given-names></string-name> (<year>2010</year>). <article-title>Simple means to improve the interpretability of regression coefficients</article-title>. <source>Methods in Ecology and Evolution</source>, <volume>1</volume> (<issue>2</issue>), <fpage>103</fpage>–<lpage>113</lpage>. <pub-id pub-id-type="doi">10.1111/j.2041-210X.2010.00012.x</pub-id></mixed-citation></ref>
	<ref id="Xbib118"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Sébille</surname>, <given-names>V.</given-names></string-name>, <string-name name-style="western"><surname>Hardouin</surname>, <given-names>J.-B.</given-names></string-name>, <string-name name-style="western"><surname>Le Néel</surname>, <given-names>T.</given-names></string-name>, <string-name name-style="western"><surname>Kubis</surname>, <given-names>G.</given-names></string-name>, <string-name name-style="western"><surname>Boyer</surname>, <given-names>F.</given-names></string-name>, <string-name name-style="western"><surname>Guillemin</surname>, <given-names>F.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Falissard</surname>, <given-names>B.</given-names></string-name> (<year>2010</year>). <article-title>Methodological issues regarding power of classical test theory (CTT) and item response theory (IRT)-based approaches for the comparison of patient-reported outcomes in two groups of patients: A simulation study</article-title>. <source>BMC Medical Research Methodology</source>, <volume>10</volume>, <elocation-id>24</elocation-id>. <pub-id pub-id-type="doi">10.1186/1471-2288-10-24</pub-id></mixed-citation></ref>
<ref id="Xbib120"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Sengewald</surname>, <given-names>M.-A.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Mayer</surname>, <given-names>A.</given-names></string-name> (<year>2024</year>). <article-title>Causal effect analysis in nonrandomized data with latent variables and categorical indicators: The implementation and benefits of EffectLiteR</article-title>. <source>Psychological Methods</source>, <volume>29</volume> (<issue>2</issue>), <fpage>287</fpage>–<lpage>307</lpage>. <pub-id pub-id-type="doi">10.1037/met0000489</pub-id></mixed-citation></ref>
<ref id="Xbib121"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Sengewald</surname>, <given-names>M.-A.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Pohl</surname>, <given-names>S.</given-names></string-name> (<year>2019</year>). <article-title>Compensation and amplification of attenuation bias in causal effect estimates</article-title>. <source>Psychometrika</source>, <volume>84</volume> (<issue>2</issue>), <fpage>589</fpage>–<lpage>610</lpage>. <pub-id pub-id-type="doi">10.1007/s11336-019-09665-6</pub-id></mixed-citation></ref>
<ref id="Xbib019a"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Sengewald</surname>, <given-names>M.-A.</given-names></string-name>, <string-name name-style="western"><surname>Steiner</surname>, <given-names>P. M.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Pohl</surname>, <given-names>S.</given-names></string-name> (<year>2019</year>). <article-title>When does measurement error in covariates impact causal effect estimates? Analytic derivations of different scenarios and an empirical illustration</article-title>. <source>British Journal of Mathematical and Statistical Psychology</source>, <volume>72</volume> (<issue>2</issue>), <fpage>244</fpage>–<lpage>270</lpage>. <pub-id pub-id-type="doi">10.1111/bmsp.12146</pub-id></mixed-citation></ref>
<ref id="Xbib122"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Shear</surname>, <given-names>B. R.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Briggs</surname>, <given-names>D. C.</given-names></string-name> (<year>2024</year>). <article-title>Measurement issues in causal inference</article-title>. <source>Asia Pacific Education Review</source>, <volume>25</volume> (<issue>3</issue>), <fpage>719</fpage>–<lpage>731</lpage>. <pub-id pub-id-type="doi">10.1007/s12564-024-09942-9</pub-id></mixed-citation></ref>	
<ref id="Xbib123"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Sijtsma</surname>, <given-names>K.</given-names></string-name>, <string-name name-style="western"><surname>Ellis</surname>, <given-names>J. L.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Borsboom</surname>, <given-names>D.</given-names></string-name> (<year>2024</year>). <article-title>Recognize the value of the sum score, psychometrics’ greatest accomplishment</article-title>. <source>Psychometrika</source>, <volume>89</volume> (<issue>1</issue>), <fpage>84</fpage>–<lpage>117</lpage>. <pub-id pub-id-type="doi">10.1007/s11336-024-09964-7</pub-id></mixed-citation></ref>
	<ref id="Xbib124"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Simmons</surname>, <given-names>J. P.</given-names></string-name>, <string-name name-style="western"><surname>Nelson</surname>, <given-names>L. D.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Simonsohn</surname>, <given-names>U.</given-names></string-name> (<year>2011</year>). <article-title>False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant</article-title>. <source>Psychological science</source>, <volume>22</volume> (<issue>11</issue>), <fpage>1359</fpage>–<lpage>1366</lpage>. <pub-id pub-id-type="doi">10.1177/0956797611417632</pub-id></mixed-citation></ref>
<ref id="Xbib125"><mixed-citation publication-type="book"><string-name name-style="western"><surname>Skrondal</surname>, <given-names>A.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Rabe-Hesketh</surname>, <given-names>S.</given-names></string-name> (<year>2004</year>). <source>Generalized latent variable modeling: Multilevel, longitudinal, and structural equation models</source>. <publisher-name>CRC Press</publisher-name>.</mixed-citation></ref>
	<ref id="Xbib126"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Soland</surname>, <given-names>J.</given-names></string-name> (<year>2022</year>). <article-title>Evidence that selecting an appropriate item response theory–based approach to scoring surveys can help avoid biased treatment effect estimates</article-title>. <source>Educational and Psychological Measurement</source>, <volume>82</volume> (<issue>2</issue>), <fpage>376</fpage>–<lpage>403</lpage>. <pub-id pub-id-type="doi">10.1177/00131644211007551</pub-id></mixed-citation></ref>
	<ref id="Xbib127"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Soland</surname>, <given-names>J.</given-names></string-name> (<year>2023</year>). <article-title>Item response theory models for difference-in-difference estimates (and whether they are worth the trouble)</article-title>. <source>Journal of Research on Educational Effectiveness</source>, <volume>17</volume> (<issue>2</issue>), <fpage>391</fpage>–<lpage>421</lpage>. <pub-id pub-id-type="doi">10.1080/19345747.2023.2195413</pub-id></mixed-citation></ref>
<ref id="Xbib128"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Soland</surname>, <given-names>J.</given-names></string-name>, <string-name name-style="western"><surname>Edwards</surname>, <given-names>K.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Talbert</surname>, <given-names>E.</given-names></string-name> (<year>2024</year>). <article-title>When should evaluators lose sleep over measurement? Toward establishing best practices</article-title>. <source>Journal of Research on Educational Effectiveness</source>, <fpage>1</fpage>–<lpage>33</lpage>. <pub-id pub-id-type="doi">10.1080/19345747.2024.2344011</pub-id></mixed-citation></ref>
	<ref id="Xbib129"><mixed-citation publication-type="author"><string-name name-style="western"><surname>Soland</surname>, <given-names>J.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Gilbert</surname>, <given-names>J. B.</given-names></string-name> (<year>2025</year>). <source>Does socially desirable responding increase after an intervention? Implications for estimating treatment effects</source>. PsyArXiv. <pub-id pub-id-type="doi">10.31234/osf.io/ujx4n_v1</pub-id></mixed-citation></ref>
	<ref id="Xbib130"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Soland</surname>, <given-names>J.</given-names></string-name>, <string-name name-style="western"><surname>Johnson</surname>, <given-names>A.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Talbert</surname>, <given-names>E.</given-names></string-name> (<year>2023</year>). <article-title>Regression discontinuity designs in a latent variable framework</article-title>. <source>Psychological Methods</source>, <volume>28</volume> (<issue>3</issue>), <fpage>691</fpage>–<lpage>704</lpage>. <pub-id pub-id-type="doi">10.1037/met0000453</pub-id></mixed-citation></ref>
	<ref id="Xbib131"><mixed-citation publication-type="book"><string-name name-style="western"><surname>Soland</surname>, <given-names>J.</given-names></string-name>, <string-name name-style="western"><surname>Kuhfeld</surname>, <given-names>M.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Edwards</surname>, <given-names>K.</given-names></string-name> (<year>2024</year>). <chapter-title>How survey scoring decisions can influence your study’s results: A trip through the IRT looking glass</chapter-title>. <source>Psychological Methods</source>, <volume>29</volume> (<issue>5</issue>), <fpage>1003</fpage>–<lpage>1024</lpage>. <pub-id pub-id-type="doi">10.1037/met0000506</pub-id></mixed-citation></ref>
	<ref id="Xbib133"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Sørensen</surname>, <given-names>Ø.</given-names></string-name> (<year>2024</year>). <article-title>Multilevel semiparametric latent variable modeling in R with “galamm”</article-title>. <source>Multivariate Behavioral Research</source>, <volume>59</volume> (<issue>5</issue>), <fpage>1098</fpage>–<lpage>1105</lpage>. <pub-id pub-id-type="doi">10.1080/00273171.2024.2385336</pub-id></mixed-citation></ref>
	<ref id="Xbib132"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Spybrook</surname>, <given-names>J.</given-names></string-name>, <string-name name-style="western"><surname>Kelcey</surname>, <given-names>B.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Dong</surname>, <given-names>N.</given-names></string-name> (<year>2016</year>). <article-title>Power for detecting treatment by moderator effects in two-and three-level cluster randomized trials</article-title>. <source>Journal of Educational and Behavioral Statistics</source>, <volume>41</volume> (<issue>6</issue>), <fpage>605</fpage>–<lpage>627</lpage>. <pub-id pub-id-type="doi">10.3102/1076998616655442</pub-id></mixed-citation></ref>
	<ref id="Xbib134"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Stoetzer</surname>, <given-names>L.</given-names></string-name>, <string-name name-style="western"><surname>Zhou</surname>, <given-names>X.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Steenbergen</surname>, <given-names>M.</given-names></string-name> (<year>2024</year>). <article-title>Causal inference with latent outcomes</article-title>. <source>American Journal of Political Science</source>, <volume>62</volume> (<issue>2</issue>), <fpage>624</fpage>–<lpage>640</lpage>. <pub-id pub-id-type="doi">10.1111/ajps.12871</pub-id></mixed-citation></ref>
	<ref id="Xbib135"><mixed-citation publication-type="book"><string-name name-style="western"><surname>Thissen</surname>, <given-names>D.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Wainer</surname>, <given-names>H.</given-names></string-name>. (Eds.). (<year>2001</year>). <chapter-title>An overview of test scoring</chapter-title>. <source>Test scoring</source> (pp. 13–32). Taylor &amp; Francis. <pub-id pub-id-type="doi">10.4324/9781410604729</pub-id></mixed-citation></ref>
	<ref id="Xbib136"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Timoneda</surname>, <given-names>J. C.</given-names></string-name> (<year>2021</year>). <article-title>Estimating group fixed effects in panel data with a binary dependent variable: How the LPM outperforms logistic regression in rare events data</article-title>. <source>Social Science Research</source>, <volume>93</volume>, <elocation-id>102486</elocation-id>. <pub-id pub-id-type="doi">10.1016/j.ssresearch.2020.102486</pub-id></mixed-citation></ref>
	<ref id="Xbib137"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Traub</surname>, <given-names>R. E.</given-names></string-name> (<year>1997</year>). <article-title>Classical test theory in historical perspective</article-title>. <source>Educational Measurement: Issues and Practice</source>, <volume>16</volume> (<issue>8</issue>), <fpage>8</fpage>–<lpage>14</lpage>.</mixed-citation></ref>
	<ref id="Xbib138"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>VanderWeele</surname>, <given-names>T. J.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Vansteelandt</surname>, <given-names>S.</given-names></string-name> (<year>2022</year>). <article-title>A statistical test to reject the structural interpretation of a latent factor model</article-title>. <source>Journal of the Royal Statistical Society Series B: Statistical Methodology</source>, <volume>84</volume> (<issue>5</issue>), <fpage>2032</fpage>–<lpage>2054</lpage>. <pub-id pub-id-type="doi">10.1111/rssb.12555</pub-id></mixed-citation></ref>
	<ref id="Xbib139"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Wicherts</surname>, <given-names>J. M.</given-names></string-name>, <string-name name-style="western"><surname>Veldkamp</surname>, <given-names>C. L.</given-names></string-name>, <string-name name-style="western"><surname>Augusteijn</surname>, <given-names>H. E.</given-names></string-name>, <string-name name-style="western"><surname>Bakker</surname>, <given-names>M.</given-names></string-name>, <string-name name-style="western"><surname>Van Aert</surname>, <given-names>R.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Van Assen</surname>, <given-names>M. A.</given-names></string-name> (<year>2016</year>). <article-title>Degrees of freedom in planning, running, analyzing, and reporting psychological studies: A checklist to avoid <italic>p</italic>-hacking</article-title>. <source>Frontiers in Psychology</source>, <elocation-id>1832</elocation-id>. <pub-id pub-id-type="doi">10.3389/fpsyg.2016.01832</pub-id></mixed-citation></ref>
	<ref id="Xbib140"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Widaman</surname>, <given-names>K. F.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Revelle</surname>, <given-names>W.</given-names></string-name> (<year>2023</year>). <article-title>Thinking thrice about sum scores, and then some more about measurement and analysis</article-title>. <source>Behavior Research Methods</source>, <volume>55</volume> (<issue>2</issue>), <fpage>788</fpage>–<lpage>806</lpage>. <pub-id pub-id-type="doi">10.3758/s13428-022-01849-w</pub-id></mixed-citation></ref>
	<ref id="Xbib141"><mixed-citation publication-type="book"><string-name name-style="western"><surname>Wilson</surname>, <given-names>M.</given-names></string-name>, <string-name name-style="western"><surname>De Boeck</surname>, <given-names>P.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Carstensen</surname>, <given-names>C. H.</given-names></string-name> (<year>2008</year>). <chapter-title>Explanatory item response models: A brief introduction</chapter-title>. 	In J. Hartig, E. Klieme, &amp; D. Leutner (Eds.), <source>Assessment of competencies in educational contexts</source> (pp. 91–120). Hogrefe &amp; Huber Publishers.</mixed-citation></ref>	
	<ref id="Xbib142"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Xu</surname>, <given-names>T.</given-names></string-name>, &amp; <string-name name-style="western"><surname>Stone</surname>, <given-names>C. A.</given-names></string-name> (<year>2012</year>). <article-title>Using IRT trait estimates versus summated scores in predicting outcomes</article-title>. <source>Educational and Psychological Measurement</source>, <volume>72</volume> (<issue>3</issue>), <fpage>453</fpage>–<lpage>468</lpage>. <pub-id pub-id-type="doi">10.1177/0013164411419846</pub-id></mixed-citation></ref>
	<ref id="Xbib143"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Ye</surname>, <given-names>F.</given-names></string-name> (<year>2016</year>). <article-title>Latent growth curve analysis with dichotomous items: Comparing four approaches</article-title>. <source>British Journal of Mathematical and Statistical Psychology</source>, <volume>69</volume> (<issue>1</issue>), <fpage>43</fpage>–<lpage>61</lpage>. <pub-id pub-id-type="doi">10.1111/bmsp.12058</pub-id></mixed-citation></ref>
	<ref id="Xbib144"><mixed-citation publication-type="journal"><string-name name-style="western"><surname>Zwinderman</surname>, <given-names>A. H.</given-names></string-name> (<year>1991</year>). <article-title>A generalized Rasch model for manifest predictors</article-title>. <source>Psychometrika</source>, <volume>56</volume> (<issue>4</issue>), <fpage>589</fpage>–<lpage>600</lpage>. <pub-id pub-id-type="doi">10.1007/BF02294492</pub-id></mixed-citation></ref>
</ref-list>
<app-group>
<app id="x1-240006.4" content-type="app"><title>Appendix</title>
<sec id="x1-25000A"><title>Additional Simulation Results</title>
<p>All figures include EIV corrections for the two-step scores. The figure for statistical power only includes true eﬀect sizes of 0.2 because near-ceiling levels of power are achieved at an eﬀect size of 0.4.</p><fig id="x1-80021a" position="anchor" orientation="portrait"><label>Figure A.1.</label><caption><title>Estimated Standard Errors by Method</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="meth.15773-a1.png" position="anchor" orientation="portrait"/></fig><fig id="x1-250022a" position="anchor" orientation="portrait"><label>Figure A.2.</label><caption><title>Estimated Standard Error Calibration by Method</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="meth.15773-a2.png" position="anchor" orientation="portrait"/></fig><fig id="x1-250033a" position="anchor" orientation="portrait"><label>Figure A.3.</label><caption><title>Estimated False Positive Rates by Method</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="meth.15773-a3.png" position="anchor" orientation="portrait"/></fig><fig id="x1-250044a" position="anchor" orientation="portrait"><label>Figure A.4.</label><caption><title>Estimated Power by Method</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="meth.15773-a4.png" position="anchor" orientation="portrait"/></fig></sec>
</app>
</app-group>
	<sec sec-type="data-availability" id="das"><title>Data Availability</title>
		<p id="S6.PX1.P1">A replication toolkit is available at <xref ref-type="bibr" rid="Xbib45.5">Gilbert, 2025</xref>. The empirical datasets and study materials from <xref ref-type="bibr" rid="Xbib46.5">Gilbert, Himmelsbach, et al. (2025b)</xref> are available at <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.7910/DVN/C4TJCA">https://doi.org/10.7910/DVN/C4TJCA</ext-link>, and also in the Item Response Warehouse (IRW, <xref ref-type="bibr" rid="Xbib35">Domingue et al., 2024)</xref>, with the prefix <monospace>gilbert_meta</monospace>: <ext-link ext-link-type="uri" xlink:href="https://redivis.com/datasets/as2e-cv7jb41fd">https://redivis.com/datasets/as2e-cv7jb41fd</ext-link>.</p>
	</sec>		
	<sec sec-type="supplementary-material" id="sp1"><title>Supplementary Materials</title>
		<p>For this article, the following Supplementary Materials are available:
			<list list-type="bullet">
				<list-item><p>Data. (<xref ref-type="bibr" rid="Xbib46.5">Gilbert, Himmelsbach, et al., 2025b</xref>)</p></list-item>
				<list-item><p>Study materials. (<xref ref-type="bibr" rid="Xbib46.5">Gilbert, Himmelsbach, et al., 2025b</xref>)</p></list-item>
				<list-item><p>Replication toolkit. (<xref ref-type="bibr" rid="Xbib45.5">Gilbert, 2025</xref>)</p></list-item>
			</list></p>
	</sec>		

<fn-group>
<fn fn-type="financial-disclosure"><p>The author has no funding to report.</p></fn>
</fn-group>
	
<fn-group>
<fn fn-type="conflict"><p>The author has declared that no competing interests exist.</p></fn>
</fn-group>
<ack>
	<p id="S6.PX1.P4">The author wishes to thank Andrew Ho, Ben Domingue, Derek Briggs, and two anonymous reviewers for their helpful comments on drafts of this paper.</p>
</ack>
	
	
	
</back>
</article>
