Original Article

Collapsing or Not? A Practical Guide to Handling Sparse Responses for Polytomous Items

Yale Quan*1 , Chun Wang1

Methodology, 2025, Vol. 21(1), 46–73, https://doi.org/10.5964/meth.14303

Received: 2024-04-02. Accepted: 2024-12-12. Published (VoR): 2025-03-31.

Handling Editor: Belén Fernández-Castilla, Universidad Nacional de Educación a Distancia, Madrid, Spain

*Corresponding author at: 312E Miller Hall, 2012 Skagit Ln., College of Education, University of Washington, Seattle, WA 98105, USA. E-mail: yalequan@uw.edu

This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

In ordinal data analysis, category collapse is the process of combining adjacent response options to create fewer response categories than were originally measured. When collapsing response categories, researchers need to be aware of inducing data-model misfit and of obtaining biased parameter estimates. Through mathematical derivation we show that category collapse induces data-model misfit when using Generalized Partial Credit IRT model (GPCM) generated data. This data-model misfit is not present when using Graded Response IRT model (GRM) generated data. Using simulation studies, we found that category collapse can indicate better data-model fit in GRM- and GPCM-generated data. In the case of GPCM data, this result is spurious and can lead practitioners to draw conclusions from models that do not fit the data well. Recovered GPCM IRT item parameters were also significantly biased. Recommendations for practitioners who wish to collapse categories are provided.

Keywords: category collapse, item response theory, generalized partial credit model, parameter recovery, data-model fit

Instruments that use ordinal response data (e.g., Likert scale data, or partial credit scoring) are widely used in educational research studies. When using ordinal response data, research practitioners often need to handle potential response sparseness. Response sparseness occurs when a response option (or score) is rarely, if ever, endorsed (or earned). The common practice for addressing response sparseness is to combine adjacent response categories to either reduce the number of response options or dichotomize a polytomous response item to obtain a significant number of responses per category (e.g., Outpatient and Ambulatory Surgery CAHPS, 2017), to improve data model fit (e.g., National Center for Education Statistics, 2008), or to reshape the data to meet specific modeling needs (e.g., Harpe, 2015). Collapsing response categories to create a dichotomous response item from what was originally a polytomous response item can introduce significant data-model misfit (Jansen & Roskam, 1986) along with a loss of power, reduced scale reliability, and spurious statistical significance when determining measurement invariance (Altman & Royston, 2006; MacCallum et al., 2002; Rutkowski et al., 2019). When collapsing categories to improve polytomous data-model fit, observed improved model fit may be spurious and only applicable to the collected sample (Rutkowski et al., 2019).

Utilizing a category collapse approach is also commonly observed in item response theory (IRT) analyses of ordinal response data to address disordered thresholds (e.g., Kim et al., 2010; Linden et al., 2020; Matovu, 2019; Smith et al., 2003). When modeling ordinal response data with an adjacent-categories IRT model, threshold order indicates how well an instrument functions. Disordered thresholds may indicate sparse response data in one (or more) of the response categories (Wind, 2023). Threshold order depends on the number of participants responding to an item, and collapsing the central response option to address threshold ordering (e.g., Rost & Von Davier, 1995) can complicate the understanding of the original scale by mixing respondents whose central response reflects their latent trait with other respondents (Wetzel & Carstensen, 2014). Recent research suggests that when using a Partial Credit IRT model (PCM), disordered thresholds should be addressed by collapsing response options closest to the disordered threshold category. When the central response option was affected, symmetrically collapsing response categories, collapsing from both ends of the response continuum simultaneously, best addressed the disordered thresholds. (Tsai et al., 2024).

The action of collapsing categories changes the observed item response matrix, and in-turn manipulates the conditional probabilities of a respondent selecting a response option or earning a particular score (Tsai et al., 2024). This action violates the joining assumption (Jansen & Roskam, 1986) needed to fit logistic-adjacent models, like the Partial Credit Model (PCM). Previously, Harel (2014) and subsequently Harel and Steele (2018) provided an in-depth discussion on this topic when fitting a PCM.

In their 2014 paper, Harel demonstrated through mathematical derivation, that fitting a PCM to data with collapsed categories would result in deliberate model misspecification. Harel proposed three category collapse rules based on: (1) item response function, (2) maximum information, and (3) integrated information. Simulation studies were performed to test the proposed rules and the effect of category collapse on person parameter recovery, and study how category collapse influenced threshold parameter recovery. However, none of the proposed rules could be utilized to make a consistent decision on when category collapse is appropriate. Harel and Steele (2018) extended this area of research by proposing an information matrix test (IMT) to assess the degree of data-model misspecification introduced by collapsing the central response category. They concluded that the direction the central category was collapsed influenced PCM data-model fit significantly, and this misspecification was detectable using the proposed Information Matrix test.

Although Harel and many other researchers have concluded that utilizing collapsed categories negatively impacts statistical analyses, there exists a significant amount of research utilizing data with collapsed categories. MacCallum et al. (2002) found that 11.50% of articles published between 1998 and 2000 in the Journal of Personality and Social Psychology, Journal of Consulting and Clinical Psychology, and Journal of Counseling Psychology contained at least one instance of dichotomization through category collapse. Additionally, only 20% of these studies provided justification for dichotomizing responses. This finding illustrates the need for a comprehensive guide on when category collapse is justified and proper guidelines for implementing category collapse.

There are two main open questions concerning category collapse that this paper aims to address. Firstly, despite the significant research performed on collapsing response categories to address threshold order and data-model fit, there is no clear conclusion on whether category collapse introduces bias into polytomous IRT model item parameter estimates. If biased item parameter estimates are obtained, it is also unclear to what extent the ability estimates may be impacted. Therefore, the question remains: by collapsing response categories to address threshold order and/or data model fit, are we inadvertently biasing item and person parameter estimates?

Secondly, the majority of previous research concerning category collapse focused on the Partial Credit IRT model. In practice, the primary two polytomous IRT models used to model education data are the Generalized Partial Credit Model (GPCM; Muraki, 1992) and the Graded Response Model (GRM; Samejima, 1968). However, to date, there has not been research focused on examining and comparing what effect category collapse has on these data generating models. Therefore, the question remains about if these results also hold for GRM and GPCM data.

Thus, the purpose of this study is to extend the literature on category collapse by investigating, through rigorous Monte Carlo simulations, if category collapse can be justified when utilizing the Graded Response and Generalized Partial Credit Item Response Theory models. We extend the literature by examining if sample size, the direction of collapse, and the number of responses within a collapsed category have an influence on Graded Response Model and Generalized Partial Credit model parameter estimation and data-model fit. By performing this in-depth investigation, we can develop recommendations for educational research practitioners who want to collapse adjacent response options while introducing the least amount of bias in their results.

We begin by presenting relevant IRT methodology along with discussing statistical methodology used for item parameter estimation. The second section presents our simulation study and results. Lastly, we present an empirical example of category collapse and discuss how our simulation study results are supported by this study. We conclude with recommendations for research practitioners based on our study.

Methodology

Item Response Theory Methodology

Two of the most popular unidimensional polytomous Item Response Theory models used to analyze ordinal response data are the Graded Response and Generalized Partial Credit models. While these two models can be used to analyze ordinal response data, education research practitioners should be aware of the fundamental differences between these two models.

The Graded Response Model

The Graded Response model is used to model the probability of a respondent endorsing a particular response. For example, consider an instrument composed of items that have Likert scale responses coded as k=0, , K where zero is a low endorsement of the latent trait and K is high endorsement. Then the probability of an examinee with ability level θ endorsing Category k on Item j is defined as:

1
Pjkx=k|θ=Pjk*θPjk+1*θ=PxkPxk+1 

where

Pjk*=11+edjk+ajθ k=1, , K

with item-specific discrimination Parameter aj, and K many threshold Parameters d1,d2,,dK. Furthermore, we define Pj0*1 and PjK*0.

The Generalized Partial Credit Model

In contrast, the Generalized Partial Credit model is used to model the probability of an examinee earning the kth score out of a possible score of K. Respondents who receive maximum credit on an item earn a score of K, and this score decreases to 0 as respondents make mistakes. A fundamental assumption of the Generalized Partial Credit model is that potential item scores are ordered such that a respondent who earns a score of k has correctly satisfied the requirements to earn a score of k1. But the level of difficulty moving from score of k1 to k may not necessarily be higher than the level of difficulty stepping up from k2 to k1, hence the threshold parameters need not to be strictly ordered. In contrast, in GRM, since it is a “difference model,” all threshold parameters need to be strictly ordered. Then, the probability of an examinee with ability level θ earning the kth score on Item j is:

2
Pjkθ=expv=0kdjv+ajθc=0Kexpv=0cdjv+ajθ

with item specific discrimination parameter aj, and Kj1 threshold parameters d1,d2,,dKj1. Note that the Index c is not associated with IRT parameters and is only used to index the summations. We also define dj00 and k=0KPjkθ1.

The Influence of Category Collapse on GRM and GPCM Parameter Estimation

Comparing Equations 1 and 2 we can see that, while both the GRM and GPCM are used to model polytomous item response data, they have significantly different functional forms. The GRM belongs to the “difference” model family while the GPCM belongs to the “divide-by-total” family of IRT models (García-Pérez, 2017; Thissen & Steinberg, 1986). This suggests that category collapse may influence each model differently. In this section, we highlight how threshold parameter estimation might be influenced by category collapse.

In Appendix A we derive the complete-data log-likelihood function and partial derivatives for GRM and GPCM respectively. From this derivation we can conclude that if a GRM is fit to the observed item response, the direction of collapse does not influence threshold parameter estimation. Collapsing the highest two parameters involves removing the threshold parameter corresponding to the highest response option. Similarly, when collapsing into a lower response category, the threshold parameter for the higher response option is removed. After removing the collapsed threshold, a GRM can still be used to estimate threshold parameters. The derivation is based on theory, and it alone will not provide a practical guide as to when collapse is needed. Instead, simulation studies must be used to determine thresholds for sparseness based on empirical evidence.

In contrast to GRM, using collapsed categories with GPCM changes the underlying threshold parameter estimation equation. Threshold parameters in GRM models the probability of responding to Category k or above. The link between non-adjacent categories allows for removal of a threshold parameter without changing the underlying IRT model. In contrast, threshold parameters in GPCM relate to two adjacent response categories with no connection to other response categories. Therefore, collapsing adjacent response categories by removing a threshold parameter influences the estimation of all other response categories. This collapse process changes the underlying IRT model. Prior research has found similar conclusions (e.g., Harel, 2014; Harel & Steele, 2018). However, these prior studies did not examine if the data-model misfit significantly biases parameter estimation. The degree to which parameter estimates are influenced, can only be studied through simulation.

Testing Data-Model Fit

Here, we are interested in using the Generalized SX2 test for polytomous item response data (Kang & Chen, 2008) to validate the conclusions presented in prior category collapse research. That: (1) analyzing collapsed GPCM data by fitting a GPCM results in significant data-model misfit percentage, and (2) analyzing collapsed GRM data by fitting a GRM results in a nominal data-model misfit percentage. If Conclusion (1) holds, we would observe a large percentage of GPCM simulations using collapsed data to be flagged for data-model misfit as compared to baseline data-model fit percentages. If Conclusion (2) holds, we would observe a similar percentage of data-model misfit for all GRM simulations. A brief outline of the SX2 test is provided below.

The generalized SX2 test is used to determine how well a proposed IRT model matches the observed item responses. The SX2 test statistic follows an asymptotic Chi-Square distribution, and its corresponding p-value can be interpreted similar to traditional Chi-Square goodness of fit tests. The SX2 test statistic is calculated as:

3
SX2= m=KjFKjNmk=0KjOikmEikm2Eikm

where k is the item response category/score, Kj is the highest possible score/response category of Item j, F is a perfect test score, and Nm is the number of examines in Group m. Note that the outer summation starts at m=Kj because groups with extreme test scores (e.g., all correct or all incorrect) will have an expected proportion of zero. The conditional expected and observed category proportions, Eikm and Oikm respectively, along with its degrees of freedom are calculated using methodology outlined in Kang and Chen (2008).

Simulation Study Methodology

For the simulation study we began by generating two datasets, one using a GRM data generating model and the other using a GPCM data generating model. Each having 12 items with 5 response Categories (0, 1, 2, 3, 4). We induced sparseness in response Category 3 by manipulating the data generating thresholds for the d3 parameter until we achieved an endorsement rate of approximately 5% for Items 7–9, and approximately 2.5% for Item 10–12. No other response categories, aside from Category 3, contained sparse endorsement rates. Data were generated using the MIRT package (Chalmers, 2012) within R Statistical Software v4.3.2 (R Core Team, 2023). Data generating parameters are provided in Appendix B.

Two endorsement thresholds were used to determine when category collapse is appropriate: (1) When a response category is endorsed by 5% or less of respondents, and (2) When a response category is endorsed by 2.5% or less of the respondents. Using the 2.5% endorsement threshold Items 10–12 would have Category 3 collapsed into an adjacent category, and using the 5% endorsement threshold would result in Items 7–12 having Category 3 collapsed into an adjacent category.

This collapse process resulted in 6 unique datasets for each data-generating process is outlined in Table 1 below. The Baseline dataset which is the original data that does not contain any collapsed categories, (2a & 2b) The Collapsed-Up dataset where responses for Category 3 were recoded into responses for Category 4, and (3a & 3b) The Collapsed-Down dataset where responses for Category 3 were recoded into responses for Category 2. Datasets 2a and 3a result from using a 5% endorsement threshold, and Datasets 2b and 3b are from a 2.5% endorsement threshold. Each of these five datasets were generated using one of six sample sizes: 150, 250, 500, 1000, 1500, and 2000.

Table 1

Data Generation Process

Sparseness Condition
Simulation Condition2.5%5%
Collapse DirectionBaselineCollapsed UpCollapsed DownBaselineCollapsed UpCollapsed Down
Sample Size150, 250, 500, 1000, 1500, 2000150, 250, 500, 1000, 1500, 2000150, 250, 500, 1000, 1500, 2000150, 250, 500, 1000, 1500, 2000150, 250, 500, 1000, 1500, 2000150, 250, 500, 1000, 1500, 2000
Dataset Labeling1a2a3a1b2b3b

After generating the analysis datasets, a Generalized Partial Credit or Graded Response IRT model was fitted to the data depending on which data generating model was used. Parameter estimation was obtained using standard EM algorithm with fixed quadrature within the MIRT package. To determine how closely recovered item parameters from collapsed datasets match parameters recovered from the baseline dataset we calculated the relative bias (Hoogland & Boomsma, 1998) of estimated item parameters. Using the discrimination parameter as an example we denote aj as the simulated true discrimination parameter and a^j¯ as the mean of the recovered discrimination parameters for the jth item over all simulations. Then the relative bias of the discrimination parameter for the jth item is calculated as:

4
Ba^j¯=a^j¯ajaj 

Relative bias values more extreme than 0.05 indicate that the item parameter of interest was not well recovered.

When collapsing categories, the number of threshold parameters changes. Figure 1 displays how threshold parameters change when collapsing categories. The parameters on the top of the number line denote the simulated true data generating parameters and the parameters on the bottom of the number line denote the recovered item parameters. When the data does not contain any collapsed categories (Figure 1a) we can directly compare recovered parameters which are denoted as a^, d^1, d^2, d^3 and d^4. When collapsing response Category 3 up into Category 4 (Figure 1b) we are essentially removing the d4 parameter and directly comparing the remaining item parameters. When collapsing response Category 3 down into Category 2 (Figure 1c) we are removing the d3 parameter.

Click to enlarge
meth.14303-f1
Figure 1

Parameter Comparisons

Note. Figure 1a (top) represents the baseline condition. Figure 1b (middle) represents collapsing Category 3 up. Figure 1c (bottom) represents collapsing Category 3 down.

To determine how closely the recovered person parameters matched the simulated true data generating person parameters, we calculated the simulation average bias and Root Mean Squared Error (RMSE) of recovered person parameters. RMSE is calculated using Equation 5 where θ^i is the recovered person parameter of Person i and θi is the simulated true data generating person parameter for Person i.

5
RMSEθ^=i=1Nθ^iθi2N

To understand if category collapse results in detectable model misfit we calculated the percentage of items that were flagged for model misspecification across all simulations. In summary, our simulation study is a 6×2×3×2 completely crossed design with (a) 6 sample size levels (150, 250, 500, 1000, 1500, 2000), (b) 2 levels of response sparseness for Category 3 (5%, 2.5%), (c) 3 collapse directions (baseline, collapsed-up, collapsed-down), and (d) 2 IRT models (GRM and GPCM).

Simulation Study Results

Category Collapse and Data-Model Fit

Figures 2 and 3 below display the percentage of simulations (out of 500) that an item containing collapsed categories was flagged for data-model misfit using an alpha level of α=0.05. When examining GRM data-model fit (Figure 2), we observed slightly inflated Type I errors for Items 7–12 in the GRM baseline dataset. This suggests that the SX2 test is sensitive to sparseness in GRM data. Kang and Chen (2008) also observed inflated Type 1 errors when using the SX2 with sparse response frequencies. Kang and Chen concluded that despite the inflated Type-I, the SX2 test can still be used for determining GRM data-model fit. When fitting a GRM to data containing collapsed categories, the data-model fit improves for items containing collapsed categories. In almost all cases, the percentage of flagged simulations decreased after category collapse.

Click to enlarge
meth.14303-f2
Figure 2

Percentage of Flagged Simulations: Items 7–12. GRM Data

Note. When using a 2.5% endorsement sparseness collapse rule Items 10–12 contained collapsed categories and using a 5% endorsement sparseness collapse rule Items 7–12 contained collapsed categories.

Click to enlarge
meth.14303-f3
Figure 3

Percentage of Flagged Simulations: Items 7–12. GPCM Data

Note. When using a 2.5% endorsement sparseness collapse rule Items 10–12 contained collapsed categories and using a 5% endorsement sparseness collapse rule Items 7–12 contained collapsed categories.

Conversely, when using GPCM with collapsed categories (Figure 3) the percentage of simulations flagged for data-model misfit increased. This result confirms research performed in Harel (2014) by demonstrating that collapsing categories with GPCM data introduces undesirable data-model misfit. Additionally, unique to our study, we can see from Figure 3 that the power to detect this misfit is only present at larger sample sizes. At smaller sample sizes, collapsing GPCM response categories does not seem to worsen the data-model fit.

Category Collapse and Person Parameter Recovery

The person parameter θ was well recovered when fitting a graded response model for all sample size, collapse direction, and sparseness conditions. The average bias of recovered person parameters over 500 simulations was very close to zero in all situations. Additionally, collapsing response Category 3 resulted in a decrease of the RMSE, suggesting that person parameters were better recovered under the collapsed conditions. These results are presented in Figures 4 and 5. This finding supports the research performed in Jiang (2018). When adjacent response categories are collapsed due to response sparseness person parameter recovery was not affected. These results support that as much as 5% of the total number of responses can be collapsed without influencing GRM person parameter recovery with a sample size as small as 150.

Click to enlarge
meth.14303-f4
Figure 4

Average Bias of Recovered GRM Person Parameters

Click to enlarge
meth.14303-f5
Figure 5

RMSE of Recovered GRM Person Parameters

Similar results were observed when examining person parameter recovery when using the GPCM with collapsed categories (Figures 6 and 7). The average bias was very close to zero for all simulation conditions. Standard errors were consistently around 0.30 for all simulation conditions. RMSE of recovered person parameters was smallest when no responses were collapsed. Both collapse conditions resulted in similar RMSE values. However, the increase in RMSE was marginal (> .10). These results suggest that despite the intentional model-misspecification induced by using GPCM with collapsed categories, person parameter recovery is not affected.

Click to enlarge
meth.14303-f6
Figure 6

Average Bias of Recovered GPCM Person Parameters

Click to enlarge
meth.14303-f7
Figure 7

RMSE of Recovered GPCM Person Parameters

Category Collapse and Item Parameter Recovery

When fitting a Graded Response model using a 5% collapse rule Items 7–12 contain collapsed categories. The slope parameter a1 was well recovered for all items containing collapsed categories. All the threshold parameters d1,d2,d3 were also well recovered from items containing collapsed categories. Using a 2.5% collapse rule Items 10–12 contained collapsed categories. Similar to the 5% category collapse results the slope and threshold parameters were well recovered from all items containing collapsed categories. In general, item parameters were best recovered when response Category 3 was collapsed up into response Category 4. These results are displayed in Figures 8 and 9 respectively.

Click to enlarge
meth.14303-f8
Figure 8

Relative Bias of Recovered GRM Item Parameters 5% Collapse Condition

Note. When using a 5% endorsement sparseness collapse rule Items 7–12 contained collapsed categories. The dashed horizontal lines indicate the |0.05| threshold for extreme relative bias.

Click to enlarge
meth.14303-f9
Figure 9

Relative Bias of Recovered GRM Item Parameters 2.5% Collapse Condition

Note. When using a 2.5% endorsement sparseness collapse rule Items 10–12 contained collapsed categories. The dashed horizontal lines indicate the |0.05| threshold for extreme relative bias.

Figures 10 and 11 display the relative bias of recovered item parameters when using a 5% and 2.5% collapse rule respectively for response Category 3 with GPCM. Using a 5% collapse rule, where Items 7–12 have collapsed categories, we can see from Figure 10 that when using data with collapsed categories item parameters are not well recovered. The d3 parameter displays significant bias regardless of collapse direction, however collapsing up seems to introduce the least amount of bias into the parameter estimate. In addition, the d2 and the a1 parameters display significant bias when using collapsed down data. This suggests that using GPCM with collapsed categories not only biases the parameters related to the target response category, but also influences other response categories.

Click to enlarge
meth.14303-f10
Figure 10

Relative Bias of Recovered GPCM Item Parameters 5% Collapse Condition

Note. When using a 5% endorsement sparseness collapse rule Items 7–12 contained collapsed categories. The dashed horizontal lines indicate the |0.05| threshold for extreme relative bias.

Click to enlarge
meth.14303-f11
Figure 11

Relative Bias of Recovered GPCM Item Parameters 2.5% Collapse Condition

Note. When using a 2.5% endorsement sparseness collapse rule Items 10–12 contained collapsed categories. The dashed horizontal lines indicate the |0.05| threshold for extreme relative bias.

Using a 2.5% collapse threshold (Figure 11) resulted in lower (but still significant) relative bias values. From Figure 11 we can also compare parameter estimates between items without collapsed categories (Items 7–9) and items with collapsed categories (Items 10–12). Using collapsed categories significantly increases the relative bias in item parameter estimates. All item parameters display significant relative bias values regardless of collapse direction. Similar to the 5% collapse condition, collapsing response Category 3 up induces lower (but still significant) relative bias values.

Empirical Example of Category Collapse

In this section we analyzed two-cohort repeated measure data from the Alzheimer’s Disease Neuroimage Initiative (ADNI). 1,656 ADNI participants were recruited from 57 sites in the United States and Canada. Participants were between the ages of 55 and 90. Participants responded to series of initial tests that were repeated at intervals over subsequent years. We used cognitive battery data from 2 ADNI cohorts: ADNI1 and ADNI2/ADNI GO. Specifically, we focus on the memory section (ADNI-MEM) of the cognitive battery.

To address MCMC convergence errors, cognitive battery polytomous item response categories were combined if less than 20 individuals endorsed a particular category. Dichotomous response items were dropped if less than 20 individuals endorsed an option. ADNI1 collapsed categories were used for any shared items with ADNI2/ADNI GO. Table 2 below shows how the items measuring memory were recoded (Gibbons et al., 2012; Wang et al., 2023).

Table 2

Recoded Response Categories ADNI-MEM in Gibbons et al.

Recoded Score
ADNI-MEM Measure0123456789
RAVLT Trial 10–2a34567891011–15a
RAVLT Trial 20–2a34567891011–15a
RAVLT Trial 30–2a345–6a7–8a910111213–14a
RAVLT Trial 40–3a45–6a7–8a91011121314–15a
RAVLT Trial 50–3a456–7a8–9a10–11a12131415
Interference0–1a23456878–15a
Immediate recall01–2a3–4a5–6a78910–11a12–13a14–15a
30-minute delay01–2a3–4a5–6a78910–11a12–13a14–15a
Recognition012–3a4–5a6–7a8–9a10–11a12–13a1415
ADAS Cog – Trial 10–1a2345678–10a
ADAS Cog – Trial 20–2a345678910
ADAS Cog – Trial 30–2a345678910
Recall0123456789–10a
Recognition present0–3a456789101112
Recognition absent0–4a5–6a789101112
Logical Memory – Immediate0–1a2–3a4–5a6–7a8–9a10–12a13–14a15–16a17–18a19–25a
Logical Memory – Delay11–2a3–4a5–8a9–11a121314–15a16–17a18–25a
MMSE Ball Recall21
MMSE Flag Recall21
MMSE Tree Recall21

Note. Cell values indicate recoded score values.

a indicates that adjacent response categories were combined.

A 2-Parameter Logistic (2PL) model was fit to items with binary responses and a Graded Response Model (GRM) was fit to items with polytomous responses. GRM was selected for polytomous responses due to the Likert-type data collected. This fitting was performed twice: once on the original dataset and once on the collapsed dataset. Person parameters were extracted from each model fitting. Bias, Root Mean Square Error (RMSE), and average Standard Error (SE) was calculated with the original dataset serving as the “true” person parameters.

From Table 3 we can see that collapsing response categories did not have a negative effect on person parameter recovery. The bias between person parameter estimates from the original dataset (without collapsed categories) to person parameter estimates using collapsed data was very close to zero for all four datasets. The RMSE was also very close to zero indicating that the person parameter was well recovered after collapsing response categories. The average standard error of person parameter estimates increased slightly after collapsing response categories, but the increase was not significant.

Table 3

ADNI-MEM Person Parameter Recovery Summary Statistics

DataBiasRMSEAverage SE Change
ADNI1 Baseline0.0010.0570.007
ADNI1 Follow up0.0010.0630.008
ADNI2 Baseline0.0040.0540.005
ADNI2 Follow up0.0040.0570.007

Note. Table values indicate the Bias, RMSE, and Average SE Change between the original dataset and the dataset containing collapsed categories.

Practical Implications for Researchers

This study examines the impact category collapse has on IRT parameter recovery and IRT data-model fit. Our study expands the literature by exploring parameter recovery and data-model fit for the Generalized Partial Credit IRT model along with the Graded Response IRT model.

In practice, since the true data generating model is unknown, the candidate IRT models (e.g., GRM or GPCM) are often selected based on item types. For instance, data collected from a Likert-type item would be appropriate for GRM, while scores from a constructed response item would be appropriate for GPCM. We provide the following example for researchers to consider when deciding which candidate IRT model to use.

Consider an assessment scored from 0 to 4 where students are asked to solve a constructed response math item. When scoring student work, students who earn a 3 have also successfully completed enough work to earn a score of 1 and 2. This assumption leads us to recommend the GPCM as a candidate IRT model. In contrast, consider a Likert-style survey item with responses “Disagree,” “Neutral,” and “Agree.” This would lead us to recommend the GRM as a candidate IRT model.

We caution researchers against applying both GRM and GPCM to their data and selecting the one with the best-fitting results because we need to interpret model parameters that align with the specific item types. Researchers can use a statistical test such as the SX2 test to confirm if their proposed IRT model fits their data prior to data collapse. We strongly recommend that researchers confirm that their observed item response data fit their proposed IRT model prior to category collapse.

In sum, we provide the following recommendations to practitioners: Firstly, if the observed item response data comes from Likert scale items, research practitioners can fit a GRM to a collapsed dataset. There was no significant data-model misfit introduced for any items containing collapsed categories. Secondly, if GRM is used, practitioners may use either the 2.5% or 5% endorsement threshold when deciding to collapse adjacent response categories. Practitioners should collapse the targeted response category down into the next lowest adjacent category. This combination resulted in the least bias in recovered IRT person and item parameters.

Thirdly, if the observed item response data best fits a GPCM we caution against fitting a GPCM to a collapsed dataset. This process introduced significant data-model misfit for larger sample sizes along with substantial relative bias in recovered person and item parameters. Lastly, if researchers wish to collapse categories when using GPCM we recommend that researchers collapse the target response category up into the next highest response category. This process still produces significant relative bias values in recovered item and person parameters, but the bias is lower when compared to collapsing down.

Conclusion

The purpose of this study is to extend the literature on category collapse by investigating, through rigorous Monte Carlo simulations, if category collapse can be justified when utilizing the Graded Response (GRM) and Generalized Partial Credit (GPCM) Item Response Theory (IRT) models. From our extensive simulation study, we concluded that when using the Graded Responses model adjacent response categories can be combined without biasing IRT parameter estimation. In contrast, when using a Generalized Partial Credit model with data containing collapsed categories IRT parameters were not well recovered and significant data-model misfit was introduced into the analysis.

Funding

The research reported here was supported by the Institute of Education Sciences, U.S. Department of Education, through Grant R305D200015 and R305D240021 to University of Washington.

Acknowledgments

The opinions expressed are those of the authors and do not represent views of the Institute or the U.S. Department of Education.

Competing Interests

The authors have declared that no competing interests exist.

References

Appendices

Appendix A: GRM and GPCM Parameter Estimation GRM Threshold Parameters

We begin by deriving the GRM complete-data log-likelihood function and respective partial derivatives. Let Xi be the observed item response vector for person i, with i=1N and j=1n denoting the sample size and test length respectively. The joint maximum likelihood is:

A1
L(x1,x2,,xN|a,d,θ)=j=1nk=0K1Pjkxij=k|aj,dk,θiIxij=k 

For notational simplicity let Pxi|θi,ϵ=k=0K1Pjkxij=k | aj,dk,θiIxij=k, with ϵ being the n×2 matrix of item parameters. The marginal likelihood of X is:

A2
L= i=1NPxi|θi,ϵgθ|τ dθ
Where g(θ|τ) is the normal density. We further simplify the notation by letting Pxi,ϵ=j=1nPxi|θ,ϵgθ|τ dθ. Then, to finally find the MMLE of dk we take the respective partial derivative of the log likelihood.
A3
dklogL=dklogPxi,ϵ
A4
= i=1N1Pxi,ϵdkPxi|θ,ϵgθ|τ dθ 
A5
=i=1N1Pxi,ϵdklogPxi|θ,ϵPxi|θ,ϵgθ|τ dθ 
A6
= i=1NdklogPxi|θ,ϵPxi|θ,ϵgθ| τPxi,ϵ dθ 
Note that using Bayes Theorem we have Pxi|θ,ϵgθ| τPxi,ϵ=Pθ|xi,τ, ϵ.

A7
dklogL= i=1NdklogPxi|θ,ϵPθ|xi,τ,ϵ dθ 

We now focus on solving  dklogPxi|θ,ϵ.

A8
 dklogPxi|θ,ϵ=  dklogk=0K1Pjkxij=k|aj,dk,θiIxij=k 
A9
=k=0K1Ixij=kdk[logPjkxij=k|aj,dk,θi] 
A10
=k=0K1Ixij=k1 Pjkxij=k|aj,dk,θidkPjkxij=k|aj,dk,θi
A11
= k=0K1Ixij=k1PK*θPK+1*θPk*θ1Pk*θ

After substituting Equation A11 into Equation A7 we obtain:

A12
dklogL=i=1Nk=0K1Ixij=k1PK*θPK+1*θPk*θQk*θPθ|xi,τ,ϵ dθ 

Where Qk*θ= 1Pk*θ.

From Equation A12 we can see that threshold estimation only involves the two adjacent response categories, k and k+1. Collapsing the highest two parameters involves removing the threshold parameter corresponding to the highest response option. Similarly, when collapsing into a lower response category, the threshold parameter for the higher response option is removed. After removing the collapsed threshold, a GRM can still be used to estimate threshold parameters because category collapse has not altered the order of the remaining threshold parameters.

GRM Discrimination Parameter

Continuing the derivation presented in the previous section, we now focus on finding the MMLE of the item specific discrimination parameter aj. To do so we take the respective partial derivative of the log likelihood.

A13
ajlogL=akjlogPxi,ϵ

Following a similar derivation as the threshold parameters we obtain

A14
ajlogL=i=1NajlogPxi|θ,ϵPθ|xi,τ,ϵ dθ 

We now focus on solving  ajlogPxi|θ,ϵ.

A15
 ajlogPxi|θ,ϵ= ajlogk=0K1Pjkxij=k|aj,dk,θiIxij=k 
A16
=k=0K1Ixij=kaj[logPjkxij=k|aj,dk,θi] 
A17
=k=0K1Ixij=k1 Pjkxij=k|aj,dk,θiaj Pjkxij=k|aj,dk,θi
A18
= k=0K1Ixij=k1PK*θPK+1*θθPk*θQk*θθPk+1*θQk+1*θ

Where Qk*θ= 1Pk*θ. Substituting equation A18 into A14 we obtain:

A19
dklogL=i=1Nk=0K1Ixij=kθPk*θQk*θθPk+1*θQk+1*θPK*θPK+1*θPθ|xi,τ,ϵdθ

As in the case of the threshold parameter estimation, the estimation of the GRM discrimination parameter also only involves the two adjacent response categories, k and k+1.Therefore, in theory, category collapse will not influence the estimation of item specific GRM discrimination parameters.

GPCM Threshold Parameters

Let Xi be the observed item response vector for person i, with N and n denoting the sample size and test length respectively. Let Pijk be the probability that student i is assigned score k on Item j, with all items having a maximum score of K. The joint maximum likelihood is:

A20
L(x1,x2,,xN|a,d,θ)=j=1nk=0KPijkxij=k|aj,dk,θiIxij=k 

With

A21
Pijk=expm=0kajθidjmk=0Kexpm=0kajθidjm 

Similar to the GRM derivation, we write the first derivative of the log-likelihood as:

A22
djmlogL=i=1NdmlogPxi|θ,ϵPθ|xi,τ,ϵ dθ 

Where xi|θi,ϵ=k=0K1Pjkxij=k | aj,dk,θjIxij=k. Focusing on  dklogPxi|θ,ϵ:

A23
 dklogPxi|θ,ϵ= dmlogk=0K1Pjkxij=k|aj,dk,θjIxij=k 
A24
=k=0K1Ixij=k dmlogexpm=0kajθidjmk=0Kexpm=0kajθidjm
A25
=k=0K1Ixij=k dmm=0kajθidmlogk=0Kexpm=0kajθidjm

When calculating the first derivative we consider 2 cases: when an examinee is assigned a score of 0 (i.e., k=0) and when the assigned score is above zero (i.e., k=1, ,2K).

When k=0 we have:

A26
 dklogPxi|θ,ϵ=k=0K1Ixij=k01k=0Kexpm=0kajθidjm× ajk=tKexpm=0kajθidjm
A27
=k=0K1Ixij=kajk=tKexpm=0kajθidmk=0Kexpm=0kajθidjm
A28
=k=0K1Ixij=kajk=0Kexpm=0kajθidmk=0Kexpm=0kajθidjmk=0t1expm=0kajθidm k=0Kexpm=0kajθidjm 
A29
=k=0K1Ixij=kajk=0t1pjkθ 

Similarly, when k=1,2,,K we have:

A30
 dtlogPxi|θ,ϵ= k=0K1Ixij=kaj1k=0Kexpm=0kajθidjm× ajk=tKexpm=0kajθidjm
A31
=k=0K1Ixij=kaj+ajk=tKexpm=0kajθidjm k=0Kexpm=0kajθidjm 
A32
=k=0K1Ixij=kaj+ajk=0Kexpm=0kajθidmk=0Kexpm=0kajθidjmk=0t1expm=0kajθidm k=0Kexpm=0kajθidjm 
A33
=k=0K1Ixij=kk=0t1pjkθ 

Combining the results, we obtain the first derivative of the log-likelihood with respect to the threshold parameter of interest as:

A34
 dklogPxi|θ,ϵ=k=0K1Ixij=kajk=0K1pjkθ, k=0k=0K1Ixij=kk=0K1pjkθ, k=1,2,,K

After substituting Equation A36 into Equation A7 we obtain two cases for the derivative:

A35
dklogL=i=1Nk=0K1Ixij=kajk=0K1pjkθ×Pθ|xi,τ,ϵ dθ, k=0i=1Nk=0K1Ixij=kk=0K1pjkθ×Pθ|xi,τ,ϵ dθ, k=1,2,,K 

In contrast to the GRM threshold parameter estimation, we can see that the estimation of GPCM threshold parameters relies on all response categories. Since category collapse involves the removal of a threshold parameter, estimated GPCM threshold parameters using collapsed category data will not be the same as the ones estimated from the original dataset. This introduces data-mode misfit since the resulting dataset after category collapse no longer follows a GPCM item response function.

Appendix B: Data Generating Parameters

Graded Response Model Data Generating Parameters

Itemad1d2d3d4
Item 11.96828673.2102080.6041567-1.7253799-2.78973
Item 21.29675382.3415731.261702-0.2455449-2.013333
Item 33.62401196.1636132.7843951-1.0279311-5.787697
Item 41.63830152.6692970.6504664-1.0579291-2.658706
Item 52.64127854.143420.4016618-0.9414887-2.823617
Item 61.83631913.0781951.4678899-1.7952482-2.747437
Item 72.75868273.9817431.7643366-3.0573067-3.746977
Item 81.67923752.1404850.9788973-2.3030821-2.773269
Item 91.24272762.3004610.3158956-1.1612637-1.459518
Item 102.5923694.8163141.7304576-3.0712945-3.356455
Item 110.86346931.0887870.6129386-0.9727626-1.102283
Item 121.87936783.6971371.7929021-2.6441636-2.907275

Generalized Partial Credit Model Data Generating Parameters

Itemad1d2d3d4
Item 11.96828673.2102080.6041567-1.7253799-2.78973
Item 21.29675382.3415731.261702-0.2455449-2.013333
Item 33.62401196.1636132.7843951-1.0279311-5.787697
Item 41.63830152.6692970.6504664-1.0579291-2.658706
Item 52.64127854.143420.4016618-0.9414887-2.823617
Item 61.83631913.0781951.4678899-1.7952482-2.747437
Item 72.75868273.9817431.7643366-1.2641629-3.746977
Item 81.67923752.1404850.9788973-1.5978024-2.773269
Item 91.24272762.3004610.3158956-1.0867-1.459518
Item 102.5923694.8163141.7304576-1.6714152-3.356455
Item 110.86346931.0887870.6129386-2.1384461-1.102283
Item 121.87936783.6971371.7929021-1.685686-2.907275